diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000000000000000000000000000000000000..10be5421fda98877b22feb1e7332979fb4057d9e
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,157 @@
+**简体中文**🀄 | [English🌎](.github/CONTRIBUTING_en.md)
+
+# Contributing to PaddleNLP
+
+我们非常欢迎并希望您对`PaddleNLP`做出开源贡献。在您开始提交您的贡献之前，请先行签署[PaddlePaddle 贡献者许可协议](https://cla-assistant.io/PaddlePaddle/PaddleNLP)。
+本文接下来将介绍我们的开发与贡献流程：
+
+## 贡献方式
+
+我们欢迎不同的向`PaddleNLP`做出贡献的方式，例如：
+
+- 修复已知的Issue
+- 提交新的Issue，例如提出功能需求或者bug报告
+- 实现新的模型结构
+
+如果您不知道从哪里开始，请查看Issues板块中的`Good First Issue`标签。它为您提供一个对初学者友好的已知Issue列表，可以降低贡献的门槛，帮助您开始为开源做出贡献。您只需在您想处理的Issue中告知我们您想负责此Issue即可。
+
+## 开发流程
+
+PaddleNLP 使用 [Git 分支模型](http://nvie.com/posts/a-successful-git-branching-model/)。对于常见的开源贡献，我们有以下的贡献流程：
+
+#### 1. Fork
+
+   因为PaddleNLP的开发社区一直在发展，如果每位贡献者都直接向官方Repo提交commit将会难以管理。因此，请从您的分支中提交 Pull Requests。建议您通过GitHub的[“Fork”按钮](https://help.github.com/articles/fork-a-repo/)来创建您的Fork分支。
+
+#### 2. Clone
+
+   请运行一下命令将您的分支clone到本地
+
+   ```bash
+   git clone https://github.com/<your-github-account>/PaddleNLP
+   cd PaddleNLP
+   ```
+
+#### 3. 创建本地开发分支
+
+   对于添加新功能或修复错误等日常工作，请在开发前创建您的本地开发分支：
+
+   ```bash
+   git checkout -b my-cool-feature
+   ```
+
+#### 4. 配置开发环境
+   在开始编码之前，您需要设置开发环境。我们强烈建议您在虚拟环境中进行所有开发，例如[venv](https://docs.python.org/3/library/venv.html)或[conda](https://docs.conda.io/en/latest/)。
+   请您设置并激活虚拟环境后，运行以下命令：
+
+   ```bash
+   make install
+   ```
+
+   这将设置 `PaddleNLP` 的所有依赖以及 [`pre-commit`](http://pre-commit.com/) 工具。
+
+   如果您需要开发 `examples` 或 `applications` 模块并加载 `PaddleNLP`，请确保以可编辑模式（`-e`）安装 `PaddleNLP`。
+   如果在虚拟环境中已经安装 `PaddleNLP` ，请使用 `pip uninstall paddlenlp` 将其删除，然后以可编辑模式重新安装它
+   `pip install -e .`
+
+
+#### 5. 开发
+
+   当您开发时，请确保您新增的代码会被单元测试所覆盖。我们所有的单元测试都可以在 `tests` 目录下找到。
+   您可以修改现有单元测试以覆盖新功能，也可以从头开始创建新测试。
+   当您完成代码时，您应该确保相关的单元测试可以通过。您可以像这样运行受更改影响的测试：
+
+   ```bash
+   pytest tests/<test_to_run>.py
+   ```
+
+#### 6. Commit
+
+   我们使用 [`pre-commit`](http://pre-commit.com/)工具（包括[black](https://black.readthedocs.io/en/stable/)、[isort](https:/ /pycqa.github.io/isort/) 和
+   [flake8](https://flake8.pycqa.org/en/latest/)）来检查每次提交中的代码和文档的风格。当你运行 `git commit` 时，你会看到
+   类似于以下内容：
+
+   ```
+    ➜  (my-virtual-env) git commit -m "commiting my cool feature"
+    black....................................................................Passed
+    isort....................................................................Passed
+    flake8...................................................................Passed
+    check for merge conflicts................................................Passed
+    check for broken symlinks............................(no files to check)Skipped
+    detect private key.......................................................Passed
+    fix end of files.....................................(no files to check)Skipped
+    trim trailing whitespace.............................(no files to check)Skipped
+    CRLF end-lines checker...............................(no files to check)Skipped
+    CRLF end-lines remover...............................(no files to check)Skipped
+    No-tabs checker......................................(no files to check)Skipped
+    Tabs remover.........................................(no files to check)Skipped
+    copyright_checker........................................................Passed
+   ```
+
+   但大多数时候事情并没有那么顺利。当您的代码或文档不符合标准时，`pre-commit` 检查将失败。
+   ```
+    ➜  (my-virtual-env) git commit -m "commiting my cool feature"
+    black....................................................................Passed
+    isort....................................................................Failed
+    - hook id: isort
+    - files were modified by this hook
+
+    Fixing examples/information_extraction/waybill_ie/run_ernie_crf.py
+
+    flake8...................................................................Passed
+    check for merge conflicts................................................Passed
+    check for broken symlinks............................(no files to check)Skipped
+    detect private key.......................................................Passed
+    fix end of files.....................................(no files to check)Skipped
+    trim trailing whitespace.............................(no files to check)Skipped
+    CRLF end-lines checker...............................(no files to check)Skipped
+    CRLF end-lines remover...............................(no files to check)Skipped
+    No-tabs checker......................................(no files to check)Skipped
+    Tabs remover.........................................(no files to check)Skipped
+    copyright_checker........................................................Passed
+   ```
+
+   我们的工具将自动修复大部分样式错误，但是有些错误需要手动解决。幸运的是，错误信息一般通俗易懂，很容易修复。
+   解决错误后，您可以再次运行 git add <files> 和 git commit ，这将再次触发 pre-commit 。
+   一旦 pre-commit 检查通过，您就可以推送代码了。
+
+   [Google][http://google.com/] 或 [StackOverflow](https://stackoverflow.com/) 是帮助您了解代码风格错误的好工具。
+   如果您仍然无法弄清楚，请不要担心。您可以使用 `git commit -m "style error" --no-verify` 提交，我们很乐意在您创建 Pull Request 后帮助您。
+
+#### 7. git pull与代码冲突
+
+   有经验的 Git 用户经常从官方Repo中git pull。因为这样子他们会及早注意到与其他人的代码冲突，并且让代码冲突更容易解决
+
+   ```bash
+   git remote add upstream https://github.com/PaddlePaddle/PaddleNLP
+   git pull upstream develop
+   ```
+
+#### 8. git push与提交Pull Request
+
+   您可以将您的本地开发分支中的工作 push 到您的fork的分支中：
+
+   ```bash
+   git push origin my-cool-stuff
+   ```
+
+   git push之后，您可以提交Pull Request，请求[官方repo](https://github.com/PaddlePaddle/PaddleNLP) 采纳您的开发工作。请您依照[这些步骤](https://help.github.com/articles/creating-a-pull-request/)创建Pull Request。
+
+#### 9. 删除已经合入的本地和远程分支
+
+   为了保持您本地的工作区和fork分支的干净整洁，建议您在Pull Request合入之后删除本地的残余分支：
+
+   ```bash
+   git push origin my-cool-stuff
+   git checkout develop
+   git pull upstream develop
+   git branch -d my-cool-stuff
+   ```
+
+## 代码Review
+
+- 在您的Pull Request能够顺利通过本地测试以及CI的情况下，您可以在Pull Request中 @ 相关的Reviewer，提醒他们尽快对您的Pull Request进行Review。
+
+- 请处理Reviewer的每一条评论。如果您已按照评论修改，请回复“完成”；否则，可以在评论下展开讨论。
+
+- 如果您不希望您的Reviewer被电子邮件通知淹没，您可以[批量回复](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/)。
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..744bf2ba7ca0e0fc6d2e30faa4e9bafd7b949e63
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,203 @@
+Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..5b58a1e38ec783775a113a1a5f2e837634c8da16
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,77 @@
+# Makefile for PaddleNLP
+#
+# 	GitHb: https://github.com/PaddlePaddle/PaddleNLP
+# 	Author: Paddle Team https://github.com/PaddlePaddle
+#
+
+.PHONY: all
+all : lint test
+check_dirs := applications examples model_zoo paddlenlp pipelines ppdiffusers scripts tests 
+# # # # # # # # # # # # # # # Format Block # # # # # # # # # # # # # # # 
+
+format:
+	pre-commit run isort
+	pre-commit run black
+
+# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
+
+# # # # # # # # # # # # # # # Lint Block # # # # # # # # # # # # # # # 
+
+.PHONY: lint
+lint:
+	$(eval modified_py_files := $(shell python scripts/get_modified_files.py $(check_dirs)))
+	@if test -n "$(modified_py_files)"; then \
+		echo ${modified_py_files}; \
+		pre-commit run --files ${modified_py_files}; \
+	else \
+		echo "No library .py files were modified"; \
+	fi	
+
+# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
+
+# # # # # # # # # # # # # # # Test Block # # # # # # # # # # # # # # # 
+
+.PHONY: test
+test: unit-test
+
+unit-test:
+	PYTHONPATH=$(shell pwd) pytest -v \
+		-n auto \
+		--durations 20 \
+		--cov paddlenlp \
+		--cov-report xml:coverage.xml
+
+# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
+
+.PHONY: install
+install:
+	pip install -r requirements-dev.txt
+	pip install -r requirements.txt
+	pip install -r paddlenlp/experimental/autonlp/requirements.txt
+	pre-commit install
+
+
+.PHONY: deploy-ppdiffusers
+deploy-ppdiffusers:
+	cd ppdiffusers && make install && make
+
+.PHONY: deploy-paddle-pipelines
+deploy-paddle-pipelines:
+	cd pipelines && make install && make
+
+.PHONY: deploy-paddlenlp
+deploy-paddlenlp:
+	# install related package
+	make install
+	# build
+	python3 setup.py sdist bdist_wheel
+	# upload
+	twine upload --skip-existing dist/*
+
+.PHONY: regression-all
+release: 
+	bash ./scripts/regression/run_release.sh 0 0,1 all
+
+.PHONY: regression-key
+key: 
+	bash ./scripts/regression/run_release.sh 0 0,1 p0
diff --git a/README.md b/README.md
index 84ab048bd88560a9cd9ff89dada6d1742654612a..d714a3de5daf34e45f27874af4ee326a2e366069 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,225 @@
-# LLAMA_paddle
+# LLAMA
 
-llama-13b pretrain example for paddle
\ No newline at end of file
+## 论文
+
+`LLaMA: Open and Efficient Foundation Language Models`
+
+- [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
+
+## 模型结构
+
+LLaMA，这是一个基础语言模型的集合，参数范围从7B到65B。在数万亿的tokens上训练出的模型，并表明可以专门使用公开可用的数据集来训练最先进的模型，而不依赖于专有的和不可访问的数据集。特别是，llama 13B在大多数基准测试中优于GPT-3 (175B)， LLaMA 65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。LLAMA网络基于 Transformer 架构。提出了各种改进，并用于不同的模型，例如 PaLM。
+
+<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E6%A8%A1%E5%9E%8B%E7%BB%93%E6%9E%84.png" alt="llama模型结构.png" style="zoom:50%;" />
+
+以下是llama-13B的主要网络参数配置：
+
+```
+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "hidden_size": 5120,
+  "initializer_range": 0.02,
+  "intermediate_size": 13824,
+  "max_position_embeddings": 2048,
+  "model_type": "llama",
+  "num_attention_heads": 40,
+  "num_hidden_layers": 40,
+  "pad_token_id": 0,
+  "paddlenlp_version": null,
+  "rms_norm_eps": 1e-06,
+  "use_recompute": false,
+  "vocab_size": 32000
+}
+```
+
+## 算法原理
+
+<img src="http://developer.hpccube.com/codes/modelzoo/llama_fastchat_pytorch/-/raw/main/llama%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86.png" alt="llama算法原理.png" style="zoom:50%;" />
+
+以下是与原始 Transformer 架构的主要区别：
+
+**预归一化**。为了提高训练稳定性，对每个transformer 子层的输入进行归一化，而不是对输出进行归一化。使用 RMSNorm 归一化函数。
+
+**SwiGLU 激活函数 [PaLM]**。使用 SwiGLU 激活函数替换 ReLU 非线性以提高性能。使用 2 /3 4d 的维度而不是 PaLM 中的 4d。
+
+**旋转嵌入**。移除了绝对位置嵌入，而是添加了旋转位置嵌入 (RoPE)，在网络的每一层。
+
+## 数据集
+
+数据详细制作流程可参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)，例：OpenWebText2预训练数据制作参考[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
+
+为了方便用户运行测试本模型，本项目提供了处理好的100k条doc的训练样本：
+
+    cd ./llm/llama/
+    mkkdir data && cd data
+    wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy
+    wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz
+    cd .. && tree data 
+    data
+    ├── llama_openwebtext_100k_ids.npy
+    └── llama_openwebtext_100k_idx.npz
+
+## 环境配置
+
+### Docker
+
+推荐使用docker方式运行，提供拉取的docker镜像，关于本项目所需新版本 DTK 等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装，docker中默认使用dtk-23.04.1：
+
+```
+docker pull registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73
+
+docker run -it --network=host --name=paddle_llama --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 -v `pwd`:/home registry.baidubce.com/device/paddle-dcu:dtk23.04.1-centos79-x86_64-gcc73 /bin/bash
+
+# 替换DTK-23.10
+
+pip install paddlenlp==2.6.1 -i http://mirrors.aliyun.com/pypi/simple/ 
+wget http://10.6.10.68:8000/customized/paddle/llama/paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
+pip3 install paddlepaddle_dtk2310-2.5.1-cp39-cp39-linux_x86_64.whl
+pip3 install tool_helpers visualdl==2.5.3 -i http://mirrors.aliyun.com/pypi/simple/ 
+```
+
+## 训练
+
+权重链接
+
+13B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-13b)
+
+7B:[https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b](https://bj.bcebos.com/paddlenlp/models/community/facebook/llama-7b)
+
+该训练脚本需要1节点，每节点8张DCU-Z100L-32G。
+
+并行配置采用TP 8，PP 1，使用fp16精度微调，配置如下：
+
+```
+--max_seq_length 2048 \
+--per_device_train_batch_size 1 \
+--gradient_accumulation_steps 2 \
+--per_device_eval_batch_size 2 \
+--use_flash_attention 0 \
+--use_fused_rms_norm 0 \
+--fp16  \
+--fp16_opt_level "O2"  \
+--scale_loss 512 \
+--tensor_parallel_degree 8 \
+--learning_rate 0.00001 \
+--min_learning_rate 0.000001 \
+--max_steps 10000 \
+--save_steps 5000 \
+--weight_decay 0.01 \
+--warmup_ratio 0.01 \
+--max_grad_norm 1.0 \
+--logging_steps 10 \
+--dataloader_num_workers 1 \
+--eval_steps 1000 \
+--report_to "visualdl" \
+--sharding "stage1" \
+--disable_tqdm true \
+--continue_training 1 \
+--recompute 1 \
+--recompute_granularity full \
+--do_train \
+--do_eval \
+--device "gpu" \
+--distributed_dataloader 1
+```
+
+微调命令：
+
+```
+cd ./llm/llama/
+bash run_trainer_tp8.sh
+```
+
+## result
+
+### 精度
+
+训练数据：[https://bj.bcebos.com/paddlenlp/models/transformers/llama/data](https://bj.bcebos.com/paddlenlp/models/transformers/llama/data)
+
+使用的GPGPU：8张DCU-Z100L-32G。
+
+模型精度（max_sequence_length: 2048）：
+| 卡数 | 分布式工具 | 收敛性 |
+| :------: | :------: |:------: |
+| 8 | Paddle |  |
+### input
+
+```plaintext
+>>>冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者
+```
+
+### output
+
+```plaintext
+>>>回答:避寒,当然是去海南呀!海南的冬天,阳光明媚,温度适宜,而且空气清新,没有雾霾,没有沙尘暴,没有雾霾,没有雾霾!
+```
+
+## benchmark
+
+### 训练benchmark
+
+数据集使用[tatsu-lab/alpaca · Datasets at Hugging Face](https://huggingface.co/datasets/tatsu-lab/alpaca)，将数据集放置在./examples/benchmark/peft/paddle下：
+
+```
+$tree tatsu-lab
+tatsu-lab/
+└── alpaca
+    └── data
+        └── train-00000-of-00001-a09b74b3ef9c3b56.parquet
+```
+
+训练benchmark测试命令：
+
+```
+cd ./examples/benchmark/peft/paddle
+
+RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" benchmark.py --model_name_or_path facebook/llama-13b --english --train_data_size 1000  --intokens --intokens_length 1024  --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 2 --evaluation_strategy no --save_strategy no  --fp16 --fp16_opt_level O2 --recompute --tensor_parallel_degree 8 --logging_steps 50 --output_dir outputs
+```
+
+### 推理benchmark
+
+```
+cd ./examples/benchmark/peft/paddle
+python3  inference_benchmark.py   --model_name_or_path facebook/llama-13b --dtype float16 --do_forward --do_generate
+```
+
+### LAMBADA推理评估
+
+```
+cd ./examples/benchmark/lambada
+wget https://paddlenlp.bj.bcebos.com/data/benchmark/lambada_test.jsonl
+```
+
+验证LAMBADA数据集，运行以下脚本：
+
+```
+python3 eval.py \
+--model_name_or_path facebook/llama-13b \
+--batch_size 4 \
+--eval_path lambada_test.jsonl \
+--tensor_parallel_degree 1 \
+--cloze_eval
+```
+
+## 应用场景
+
+### 算法类别
+
+`自然语言处理`
+
+### 热点应用行业
+
+`医疗,教育,科研,金融`
+
+## 源码仓库及问题反馈
+
+- [https://developer.hpccube.com/codes/modelzoo/llama_paddle](https://developer.hpccube.com/codes/modelzoo/llama_paddle)
+
+## 参考
+
+* https://huggingface.co/decapoda-research/llama-13b-hf
+* [https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
\ No newline at end of file
diff --git a/README_en.md b/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..35b5a2dfb94ce42ce31f2e05cdda25c7ec68c14a
--- /dev/null
+++ b/README_en.md
@@ -0,0 +1,356 @@
+
+[简体中文🀄](./README.md) | **English🌎**
+
+<p align="center"> <img src="https://user-images.githubusercontent.com/1371212/175816733-8ec25eb0-9af3-4380-9218-27c154518258.png" align="middle"  width="500" /> </p>
+
+------------------------------------------------------------------------------------------
+
+<p align="center">
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
+    <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
+    <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleNLP?color=9ea"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleNLP?color=3af"></a>
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
+</p>
+
+<h4 align="center"> <a href=#features> Features </a> | <a href=#installation> Installation </a> | <a href=#quick-start> Quick Start </a> | <a href=#api-reference> API Reference </a> | <a href=#community> Community </a>
+
+**PaddleNLP** is a NLP library that is both **easy to use** and **powerful**. It aggregates high-quality pretrained models in the industry and provides a **plug-and-play** development experience, covering a model library for various NLP scenarios. With practical examples from industry practices, PaddleNLP can meet the needs of developers who require **flexible customization**.
+
+## News 📢
+
+* **2023.6.12: [Release of PaddleNLP v2.6rc](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.6.0rc)**
+  * 🔨 LLM Tools：Introduces comprehensive examples of open-source LLM training and inference, including [Bloom](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bloom), [ChatGLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/chatglm), [GLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/glm), [Llama](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/llama) and [OPT](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/opt). Added Tensor Parallel capability to [Trainer API](./docs/trainer.md) for distributed LLM trainin. Also released [Parameter-Efficient Finetuning](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/peft),which enables training LLMs on consumer hardware.
+
+* **2023.1.12: [Release of PaddleNLP v2.5](<https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.5.0>)**
+
+    * 🔨 NLP Tools: [PPDiffusers](./ppdiffusers), our cross-modal diffusion model toolbox based on PaddlePaddle, has been released! It provides a complete training process for diffusion models, and supports FastDeploy inference acceleration and multi-hardware deployment (supports Ascend chips and Kunlun core deployment).
+    * 💎 Industrial Applications: Information extraction, text classification, sentiment analysis, and intelligent question answering have all been newly upgraded. New releases include document information extraction [UIE-X](./applications/information_extraction/document), unified text classification [UTC](./applications/zero_shot_text_classification), unified sentiment analysis [UIE-Senta](./applications/sentiment_analysis/unified_sentiment_extraction) , and [unsupervised QA application](./applications/question_answering/unsupervised_qa). At the same time, the [ERNIE 3.0 Tiny v2](./model_zoo/ernie-tiny) series of pretrained small models have been released, which are more effective with low-resource and foreign data. They provide open-source end-to-end deployment solutions such as model pruning, model quantization, FastDeploy inference acceleration, and edge-side deployment to reduce the difficulty of pretrained model deployment.
+    * 💪 Framework Upgrade: Pretrained model [parameter configuration unification](./paddlenlp/transformers/configuration_utils.py), saving and loading custom parameter configurations no longer requires additional development; [Trainer API](./docs/trainer.md) has added BF16 training, recompute recalculations, sharding, and other distributed capabilities. Large-scale pre-training model training can easily be accomplished through simple configuration. [Model Compression API](./docs/compression.md) supports quantization training, vocabulary compression, and other functions. The compressed model has smaller accuracy loss, and the memory consumption of model deployment is greatly reduced. [Data Augmentation API](./docs/dataaug.md) has been comprehensively upgraded to support three granularities of data augmentation strategy: character, word, and sentence, making it easy to customize data augmentation strategies.
+    * 🤝 Community: 🤗Huggingface hub officially supports PaddleNLP pretrained models, supporting PaddleNLP Model and Tokenizer downloads and uploads directly from the 🤗Huggingface hub. Everyone is welcome to try out PaddleNLP pretrained models on the 🤗Huggingface hub [here](https://huggingface.co/PaddlePaddle).
+
+* **September 6, 2022: [Release of PaddleNLP v2.4](<https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.4.0>)**
+
+    * 🔨 NLP Tools: [NLP Pipeline System Pipelines](./pipelines) has been released, supporting the rapid construction of search engines and question-answering systems, and can be extended to support various NLP systems, making it easy, flexible, and efficient to solve NLP tasks like building blocks!
+    * 💎 Industrial Applications: A new [text classification full-process application solution](./applications/text_classification) has been added, covering various scenarios such as multi-classification, multi-label, and hierarchical classification, supporting small-sample learning and TrustAI trustworthy computing model training and tuning.
+    * 🍭 AIGC: The SOTA model [CodeGen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/code_generation/codegen) for code generation in various programming languages has been added.
+    * 💪 Framework Upgrade: [Automatic Model Compression API](./docs/compression.md) has been released, which automatically cuts and quantizes models, greatly reducing the threshold for using model compression technology. [Few-shot Prompt](./applications/text_classification/multi_class/few-shot) capability has been released, integrating classic algorithms such as PET, P-Tuning, and RGL.
+
+
+
+
+
+
+## Features
+
+#### <a href=#out-of-box-nlp-toolset> 📦 Out-of-Box NLP Toolset </a>
+
+#### <a href=#awesome-chinese-model-zoo> 🤗 Awesome Chinese Model Zoo </a>
+
+#### <a href=#industrial-end-to-end-system> 🎛️ Industrial End-to-end System </a>
+
+#### <a href=#high-performance-distributed-training-and-inference> 🚀 High Performance Distributed Training and Inference </a>
+
+
+### Out-of-Box NLP Toolset
+
+Taskflow aims to provide off-the-shelf NLP pre-built task covering NLU and NLG technique, in the meanwhile with extremely fast inference satisfying industrial scenario.
+
+![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif)
+
+For more usage please refer to [Taskflow Docs](./docs/model_zoo/taskflow.md).
+
+### Awesome Chinese Model Zoo
+
+#### 🀄 Comprehensive Chinese Transformer Models
+
+We provide **45+** network architectures and over **500+** pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high-quality Chinese pretrained model developed by other organizations. Use `AutoModel` API to **⚡SUPER FAST⚡** download pretrained models of different architecture. We welcome all developers to contribute your Transformer models to PaddleNLP!
+
+```python
+from paddlenlp.transformers import *
+
+ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+bert = AutoModel.from_pretrained('bert-wwm-chinese')
+albert = AutoModel.from_pretrained('albert-chinese-tiny')
+roberta = AutoModel.from_pretrained('roberta-wwm-ext')
+electra = AutoModel.from_pretrained('chinese-electra-small')
+gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn')
+```
+
+Due to the computation limitation, you can use the ERNIE-Tiny light models to accelerate the deployment of pretrained models.
+```python
+# 6L768H
+ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+# 6L384H
+ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh')
+# 4L384H
+ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh')
+# 4L312H
+ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh')
+```
+Unified API experience for NLP task like semantic representation, text classification, sentence matching, sequence labeling, question answering, etc.
+
+```python
+import paddle
+from paddlenlp.transformers import *
+
+tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+text = tokenizer('natural language processing')
+
+# Semantic Representation
+model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
+# Text Classificaiton and Matching
+model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh')
+# Sequence Labeling
+model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh')
+# Question Answering
+model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh')
+```
+
+#### Wide-range NLP Task Support
+
+PaddleNLP provides rich examples covering mainstream NLP task to help developers accelerate problem solving. You can find our powerful transformer [Model Zoo](./model_zoo), and wide-range NLP application [examples](./examples) with detailed instructions.
+
+Also you can run our interactive [Notebook tutorial](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995) on AI Studio, a powerful platform with **FREE** computing resource.
+
+<details><summary> PaddleNLP Transformer model summary (<b>click to show details</b>) </summary><div>
+
+| Model              | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice |
+| :----------------- | ----------------------- | -------------------- | ------------------ | --------------- | --------------- |
+| ALBERT             | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BART               | ✅                       | ✅                    | ✅                  | ✅               | ❌               |
+| BERT               | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BigBird            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BlenderBot         | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| ChineseBERT        | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ConvBERT           | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| CTRL               | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| DistilBERT         | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ELECTRA            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| ERNIE              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| ERNIE-CTM          | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| ERNIE-Doc          | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ERNIE-GEN          | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| ERNIE-Gram         | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ERNIE-M            | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| FNet               | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| Funnel-Transformer | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| GPT                | ✅                       | ✅                    | ❌                  | ✅               | ❌               |
+| LayoutLM           | ✅                       | ✅                    | ❌                  | ❌               | ❌               |
+| LayoutLMv2         | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| LayoutXLM          | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| LUKE               | ❌                       | ✅                    | ✅                  | ❌               | ❌               |
+| mBART              | ✅                       | ❌                    | ✅                  | ❌               | ✅               |
+| MegatronBERT       | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| MobileBERT         | ✅                       | ❌                    | ✅                  | ❌               | ❌               |
+| MPNet              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| NEZHA              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| PP-MiniLM          | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| ProphetNet         | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| Reformer           | ✅                       | ❌                    | ✅                  | ❌               | ❌               |
+| RemBERT            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| RoBERTa            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| RoFormer           | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| SKEP               | ✅                       | ✅                    | ❌                  | ❌               | ❌               |
+| SqueezeBERT        | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| T5                 | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| TinyBERT           | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| UnifiedTransformer | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| XLNet              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+
+</div></details>
+
+For more pretrained model usage, please refer to [Transformer API Docs](./docs/model_zoo/index.rst).
+
+### Industrial End-to-end System
+
+We provide high value scenarios including information extraction, semantic retrieval, question answering high-value.
+
+For more details industrial cases please refer to [Applications](./applications).
+
+
+#### 🔍 Neural Search System
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168514909-8817d79a-72c4-4be1-8080-93d1f682bb46.gif" width="400">
+</div>
+
+
+For more details please refer to [Neural Search](./applications/neural_search).
+
+#### ❓ Question Answering System
+
+We provide question answering pipeline which can support FAQ system, Document-level Visual Question answering system based on [🚀RocketQA](https://github.com/PaddlePaddle/RocketQA).
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168514868-1babe981-c675-4f89-9168-dd0a3eede315.gif" width="400">
+</div>
+
+
+For more details please refer to [Question Answering](./applications/question_answering) and [Document VQA](./applications/document_intelligence/doc_vqa).
+
+
+#### 💌 Opinion Extraction and Sentiment Analysis
+
+We build an opinion extraction system for product review and fine-grained sentiment analysis based on [SKEP](https://arxiv.org/abs/2005.05635) Model.
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407260-b7f92800-861c-4207-98f3-2291e0102bbe.png" width="300">
+</div>
+
+
+For more details please refer to [Sentiment Analysis](./applications/sentiment_analysis).
+
+#### 🎙️ Speech Command Analysis
+
+Integrated ASR Model, Information Extraction, we provide a speech command analysis pipeline that show how to use PaddleNLP and [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) to solve Speech + NLP real scenarios.
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168412618-04897a47-79c9-4fe7-a054-5dc1f6a1f75c.png" width="500">
+</div>
+
+
+For more details please refer to [Speech Command Analysis](./applications/speech_cmd_analysis).
+
+### High Performance Distributed Training and Inference
+
+#### ⚡ FastTokenizer: High Performance Text Preprocessing Library
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
+</div>
+
+```python
+AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+```
+
+Set `use_fast=True` to use C++ Tokenizer kernel to achieve 100x faster on text pre-processing. For more usage please refer to [FastTokenizer](./fast_tokenizer).
+
+#### ⚡ FastGeneration: High Performance Generation Library
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407831-914dced0-3a5a-40b8-8a65-ec82bf13e53c.gif" width="400">
+</div>
+
+```python
+model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn')
+...
+outputs, _ = model.generate(
+    input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search',
+    use_fast=True)
+```
+
+Set `use_fast=True` to achieve 5x speedup for Transformer, GPT, BART, PLATO, UniLM text generation. For more usage please refer to [FastGeneration](./fast_generation).
+
+#### 🚀 Fleet: 4D Hybrid Distributed Training
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168515134-513f13e0-9902-40ef-98fa-528271dcccda.png" width="300">
+</div>
+
+
+For more super large-scale model pre-training details please refer to [GPT-3](./examples/language_model/gpt-3).
+
+
+## Installation
+
+### Prerequisites
+
+* python >= 3.7
+* paddlepaddle >= 2.3
+
+More information about PaddlePaddle installation please refer to [PaddlePaddle's Website](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html).
+
+### Python pip Installation
+
+```
+pip install --upgrade paddlenlp
+```
+
+or you can install the latest develop branch code with the following command:
+
+```shell
+pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+```
+
+## Quick Start
+
+**Taskflow** aims to provide off-the-shelf NLP pre-built task covering NLU and NLG scenario, in the meanwhile with extremely fast inference satisfying industrial applications.
+
+```python
+from paddlenlp import Taskflow
+
+# Chinese Word Segmentation
+seg = Taskflow("word_segmentation")
+seg("第十四届全运会在西安举办")
+>>> ['第十四届', '全运会', '在', '西安', '举办']
+
+# POS Tagging
+tag = Taskflow("pos_tagging")
+tag("第十四届全运会在西安举办")
+>>> [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]
+
+# Named Entity Recognition
+ner = Taskflow("ner")
+ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+>>> [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]
+
+# Dependency Parsing
+ddp = Taskflow("dependency_parsing")
+ddp("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫")
+>>> [{'word': ['9月9日', '上午', '纳达尔', '在', '亚瑟·阿什球场', '击败', '俄罗斯', '球员', '梅德韦杰夫'], 'head': [2, 6, 6, 5, 6, 0, 8, 9, 6], 'deprel': ['ATT', 'ADV', 'SBV', 'MT', 'ADV', 'HED', 'ATT', 'ATT', 'VOB']}]
+
+# Sentiment Analysis
+senta = Taskflow("sentiment_analysis")
+senta("这个产品用起来真的很流畅，我非常喜欢")
+>>> [{'text': '这个产品用起来真的很流畅，我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}]
+```
+
+## API Reference
+
+- Support [LUGE](https://www.luge.ai/) dataset loading and compatible with Hugging Face [Datasets](https://huggingface.co/datasets). For more details please refer to [Dataset API](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_list.html).
+- Using Hugging Face style API to load 500+ selected transformer models and download with fast speed. For more information please refer to [Transformers API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html).
+- One-line of code to load pre-trained word embedding. For more usage please refer to [Embedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html).
+
+Please find all PaddleNLP API Reference from our [readthedocs](https://paddlenlp.readthedocs.io/).
+
+## Community
+
+### Slack
+
+To connect with other users and contributors, welcome to join our [Slack channel](https://paddlenlp.slack.com/).
+
+### WeChat
+
+Scan the QR code below with your Wechat⬇️. You can access to official technical exchange group. Look forward to your participation.
+
+<div align="center">
+<img src="https://user-images.githubusercontent.com/11987277/245085922-0aa68d24-00ff-442e-9c53-2f1e898151ce.png" width="150" height="150" />
+</div>
+
+
+
+## Citation
+
+If you find PaddleNLP useful in your research, please consider cite
+```
+@misc{=paddlenlp,
+    title={PaddleNLP: An Easy-to-use and High Performance NLP Library},
+    author={PaddleNLP Contributors},
+    howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}},
+    year={2021}
+}
+```
+
+## Acknowledge
+
+We have borrowed from Hugging Face's [Transformers](https://github.com/huggingface/transformers)🤗 excellent design on pretrained models usage, and we would like to express our gratitude to the authors of Hugging Face and its open source community.
+
+## License
+
+PaddleNLP is provided under the [Apache-2.0 License](./LICENSE).
diff --git a/README_origin.md b/README_origin.md
new file mode 100644
index 0000000000000000000000000000000000000000..aefd03450188d7594e6443066015caa57977088d
--- /dev/null
+++ b/README_origin.md
@@ -0,0 +1,345 @@
+**简体中文**🀄 | [English🌎](./README_en.md)
+
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/1371212/175816733-8ec25eb0-9af3-4380-9218-27c154518258.png" align="middle"  width="500" />
+</p>
+
+------------------------------------------------------------------------------------------
+
+<p align="center">
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
+    <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
+    <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleNLP?color=9ea"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleNLP?color=3af"></a>
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
+</p>
+
+
+<h4 align="center">
+  <a href=#安装> 安装 </a> |
+  <a href=#快速开始> 快速开始 </a> |
+  <a href=#特性> 特性 </a> |
+  <a href=#社区交流> 社区交流 </a>
+</h4>
+
+**PaddleNLP**是一款**简单易用**且**功能强大**的自然语言处理和大语言模型(LLM)开发库。聚合业界**优质预训练模型**并提供**开箱即用**的开发体验，覆盖NLP多场景的模型库搭配**产业实践范例**可满足开发者**灵活定制**的需求。
+
+## News 📢
+
+* **2023.8.15 [PaddleNLP v2.6](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.6.0)**： 发布[全流程大模型工具链](./llm)，涵盖预训练，精调，压缩，推理以及部署等各个环节，为用户提供端到端的大模型方案和一站式的开发体验；内置[4D并行分布式Trainer](./docs/trainer.md)，[高效微调算法LoRA/Prefix Tuning](./llm#33-lora), [自研INT8/INT4量化算法](./llm#6-量化)等等；全面支持[LLaMA 1/2](./llm/llama), [BLOOM](.llm/bloom), [ChatGLM 1/2](./llm/chatglm), [GLM](./llm/glm), [OPT](./llm/opt)等主流大模型
+
+
+## 安装
+
+### 环境依赖
+
+- python >= 3.7
+- paddlepaddle >= 2.5.1
+- 如需大模型功能，请使用 paddlepaddle-gpu >= 2.5.1
+
+### pip安装
+
+```shell
+pip install --upgrade paddlenlp
+```
+
+或者可通过以下命令安装最新 develop 分支代码：
+
+```shell
+pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+```
+
+更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看[Installation](./docs/get_started/installation.rst)。
+
+## 快速开始
+
+
+### 大模型文本生成
+
+PaddleNLP提供了方便易用的Auto API，能够快速的加载模型和Tokenizer。这里以使用 `linly-ai/chinese-llama-2-7b` 大模型做文本生成为例：
+
+```python
+>>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
+>>> tokenizer = AutoTokenizer.from_pretrained("linly-ai/chinese-llama-2-7b")
+>>> model = AutoModelForCausalLM.from_pretrained("linly-ai/chinese-llama-2-7b", dtype="float16")
+>>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
+>>> outputs = model.generate(**input_features, max_length=128)
+>>> tokenizer.batch_decode(outputs[0])
+['\n你好！我是一个AI语言模型，可以回答你的问题和提供帮助。']
+```
+
+### 一键UIE预测
+
+PaddleNLP提供[一键预测功能](./docs/model_zoo/taskflow.md)，无需训练，直接输入数据即可开放域抽取结果。这里以信息抽取-命名实体识别任务，UIE模型为例：
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+>>> ie = Taskflow('information_extraction', schema=schema)
+>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！"))
+[{'时间': [{'end': 6,
+          'probability': 0.9857378532924486,
+          'start': 0,
+          'text': '2月8日上午'}],
+  '赛事名称': [{'end': 23,
+            'probability': 0.8503089953268272,
+            'start': 6,
+            'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+  '选手': [{'end': 31,
+          'probability': 0.8981548639781138,
+          'start': 28,
+          'text': '谷爱凌'}]}]
+```
+
+更多PaddleNLP内容可参考：
+- [大模型全流程工具链](./llm)，包含主流中文大模型的全流程方案。
+- [精选模型库](./model_zoo)，包含优质预训练模型的端到端全流程使用。
+- [多场景示例](./examples)，了解如何使用PaddleNLP解决NLP多种技术问题，包含基础技术、系统应用与拓展应用。
+- [交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995)，在🆓免费算力平台AI Studio上快速学习PaddleNLP。
+
+
+## 特性
+
+#### <a href=#开箱即用的nlp工具集> 📦 开箱即用的NLP工具集 </a>
+
+#### <a href=#丰富完备的中文模型库> 🤗 丰富完备的中文模型库 </a>
+
+#### <a href=#产业级端到端系统范例> 🎛️ 产业级端到端系统范例 </a>
+
+#### <a href=#高性能分布式训练与推理> 🚀 高性能分布式训练与推理 </a>
+
+
+### 开箱即用的NLP工具集
+
+Taskflow提供丰富的**📦开箱即用**的产业级NLP预置模型，覆盖自然语言理解与生成两大场景，提供**💪产业级的效果**与**⚡️极致的推理性能**。
+
+![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif)
+
+更多使用方法可参考[Taskflow文档](./docs/model_zoo/taskflow.md)。
+### 丰富完备的中文模型库
+
+#### 🀄 业界最全的中文预训练模型
+
+精选 45+ 个网络结构和 500+ 个预训练模型参数，涵盖业界最全的中文预训练模型：既包括文心NLP大模型的ERNIE、PLATO等，也覆盖BERT、GPT、RoBERTa、T5等主流结构。通过`AutoModel` API一键⚡**高速下载**⚡。
+
+```python
+from paddlenlp.transformers import *
+
+ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+bert = AutoModel.from_pretrained('bert-wwm-chinese')
+albert = AutoModel.from_pretrained('albert-chinese-tiny')
+roberta = AutoModel.from_pretrained('roberta-wwm-ext')
+electra = AutoModel.from_pretrained('chinese-electra-small')
+gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn')
+```
+
+针对预训练模型计算瓶颈，可以使用API一键使用文心ERNIE-Tiny全系列轻量化模型，降低预训练模型部署难度。
+
+```python
+# 6L768H
+ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+# 6L384H
+ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh')
+# 4L384H
+ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh')
+# 4L312H
+ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh')
+```
+
+对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等，提供统一的API体验。
+
+```python
+import paddle
+from paddlenlp.transformers import *
+
+tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+text = tokenizer('自然语言处理')
+
+# 语义表示
+model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
+# 文本分类 & 句对匹配
+model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh')
+# 序列标注
+model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh')
+# 问答
+model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh')
+```
+
+#### 💯 全场景覆盖的应用示例
+
+覆盖从学术到产业的NLP应用示例，涵盖NLP基础技术、NLP系统应用以及拓展应用。全面基于飞桨核心框架2.0全新API体系开发，为开发者提供飞桨文本领域的最佳实践。
+
+精选预训练模型示例可参考[Model Zoo](./model_zoo)，更多场景示例文档可参考[examples目录](./examples)。更有免费算力支持的[AI Studio](https://aistudio.baidu.com)平台的[Notbook交互式教程](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995)提供实践。
+
+<details><summary> PaddleNLP预训练模型适用任务汇总（<b>点击展开详情</b>）</summary><div>
+
+| Model              | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice |
+| :----------------- | ----------------------- | -------------------- | ------------------ | --------------- | --------------- |
+| ALBERT             | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BART               | ✅                       | ✅                    | ✅                  | ✅               | ❌               |
+| BERT               | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BigBird            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| BlenderBot         | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| ChineseBERT        | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ConvBERT           | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| CTRL               | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| DistilBERT         | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ELECTRA            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| ERNIE              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| ERNIE-CTM          | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| ERNIE-Doc          | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ERNIE-GEN          | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| ERNIE-Gram         | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| ERNIE-M            | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| FNet               | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| Funnel-Transformer | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| GPT                | ✅                       | ✅                    | ❌                  | ✅               | ❌               |
+| LayoutLM           | ✅                       | ✅                    | ❌                  | ❌               | ❌               |
+| LayoutLMv2         | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| LayoutXLM          | ❌                       | ✅                    | ❌                  | ❌               | ❌               |
+| LUKE               | ❌                       | ✅                    | ✅                  | ❌               | ❌               |
+| mBART              | ✅                       | ❌                    | ✅                  | ❌               | ✅               |
+| MegatronBERT       | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| MobileBERT         | ✅                       | ❌                    | ✅                  | ❌               | ❌               |
+| MPNet              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| NEZHA              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| PP-MiniLM          | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| ProphetNet         | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| Reformer           | ✅                       | ❌                    | ✅                  | ❌               | ❌               |
+| RemBERT            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| RoBERTa            | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+| RoFormer           | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| SKEP               | ✅                       | ✅                    | ❌                  | ❌               | ❌               |
+| SqueezeBERT        | ✅                       | ✅                    | ✅                  | ❌               | ❌               |
+| T5                 | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| TinyBERT           | ✅                       | ❌                    | ❌                  | ❌               | ❌               |
+| UnifiedTransformer | ❌                       | ❌                    | ❌                  | ✅               | ❌               |
+| XLNet              | ✅                       | ✅                    | ✅                  | ❌               | ✅               |
+
+</div></details>
+
+可参考[Transformer 文档](/docs/model_zoo/index.rst) 查看目前支持的预训练模型结构、参数和详细用法。
+
+### 产业级端到端系统范例
+
+PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高频NLP场景，提供了端到端系统范例，打通*数据标注*-*模型训练*-*模型调优*-*预测部署*全流程，持续降低NLP技术产业落地门槛。更多详细的系统级产业范例使用说明请参考[Applications](./applications)。
+
+#### 🔍 语义检索系统
+
+针对无监督数据、有监督数据等多种数据情况，结合SimCSE、In-batch Negatives、ERNIE-Gram单塔模型等，推出前沿的语义检索方案，包含召回、排序环节，打通训练、调优、高效向量检索引擎建库和查询全流程。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168514909-8817d79a-72c4-4be1-8080-93d1f682bb46.gif" width="400">
+</div>
+
+
+更多使用说明请参考[语义检索系统](./applications/neural_search)。
+
+#### ❓ 智能问答系统
+
+基于[🚀RocketQA](https://github.com/PaddlePaddle/RocketQA)技术的检索式问答系统，支持FAQ问答、说明书问答等多种业务场景。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168514868-1babe981-c675-4f89-9168-dd0a3eede315.gif" width="400">
+</div>
+
+
+更多使用说明请参考[智能问答系统](./applications/question_answering)与[文档智能问答](./applications/document_intelligence/doc_vqa)
+
+#### 💌 评论观点抽取与情感分析
+
+基于情感知识增强预训练模型SKEP，针对产品评论进行评价维度和观点抽取，以及细粒度的情感分析。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407260-b7f92800-861c-4207-98f3-2291e0102bbe.png" width="400">
+</div>
+
+更多使用说明请参考[情感分析](./applications/sentiment_analysis)。
+
+#### 🎙️ 智能语音指令解析
+
+集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)和[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术，打造智能一体化的语音指令解析系统范例，该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景，提高人机交互效率。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/16698950/168589100-a6c6f346-97bb-47b2-ac26-8d50e71fddc5.png" width="400">
+</div>
+
+更多使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)。
+
+### 高性能分布式训练与推理
+
+#### ⚡ FastTokenizer：高性能文本处理库
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407921-b4395b1d-44bd-41a0-8c58-923ba2b703ef.png" width="400">
+</div>
+
+```python
+AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+```
+
+为了实现更极致的模型部署性能，安装FastTokenizers后只需在`AutoTokenizer` API上打开 `use_fast=True`选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考[FastTokenizer文档](./fast_tokenizer)。
+
+#### ⚡️ FastGeneration：高性能生成加速库
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168407831-914dced0-3a5a-40b8-8a65-ec82bf13e53c.gif" width="400">
+</div>
+
+```python
+model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn')
+...
+outputs, _ = model.generate(
+    input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search',
+    use_fast=True)
+```
+
+简单地在`generate()`API上打开`use_fast=True`选项，轻松在Transformer、GPT、BART、PLATO、UniLM等生成式预训练模型上获得5倍以上GPU加速，更多使用说明可参考[FastGeneration文档](./fast_generation)。
+
+#### 🚀 Fleet：飞桨4D混合并行分布式训练技术
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168515134-513f13e0-9902-40ef-98fa-528271dcccda.png" width="300">
+</div>
+
+
+更多关于千亿级AI模型的分布式训练使用说明可参考[GPT-3](./examples/language_model/gpt-3)。
+
+## 社区交流
+
+- 微信扫描二维码并填写问卷，回复小助手关键词（NLP）之后，即可加入交流群领取福利
+
+  - 与众多社区开发者以及官方团队深度交流。
+  - 10G重磅NLP学习大礼包！
+
+  <div align="center">
+  <img src="https://user-images.githubusercontent.com/11987277/245085922-0aa68d24-00ff-442e-9c53-2f1e898151ce.png" width="150" height="150" />
+  </div>
+
+## Citation
+
+如果PaddleNLP对您的研究有帮助，欢迎引用
+
+```
+@misc{=paddlenlp,
+    title={PaddleNLP: An Easy-to-use and High Performance NLP Library},
+    author={PaddleNLP Contributors},
+    howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}},
+    year={2021}
+}
+```
+
+## Acknowledge
+
+我们借鉴了Hugging Face的[Transformers](https://github.com/huggingface/transformers)🤗关于预训练模型使用的优秀设计，在此对Hugging Face作者及其开源社区表示感谢。
+
+## License
+
+PaddleNLP遵循[Apache-2.0开源协议](./LICENSE)。
diff --git a/applications/README.md b/applications/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..650edeec997df86e868603283933fbdf5f50e24c
--- /dev/null
+++ b/applications/README.md
@@ -0,0 +1,130 @@
+# 产业级端到端系统范例
+
+## 1、简介
+
+PaddleNLP 从预训练模型库出发，提供了经典预训练模型在主流 NLP 任务上丰富的[应用示例](../examples)，满足了大量开发者的学习科研与基础应用需求。
+
+针对更广泛的产业落地需求、更复杂的 NLP 场景任务，PaddleNLP 推出**产业级端到端系统范例库**（下文简称产业范例），提供单个模型之上的产业解决方案。
+
+- 最强模型与实践———产业范例针对具体业务场景，提供最佳模型（组合），兼顾模型精度与性能，降低开发者模型选型成本；
+- 全流程———打通数据标注-模型训练-模型调优-模型压缩—预测部署全流程，帮助开发者更低成本得完成产业落地。
+
+## 2、基于 Pipelines 构建产业范例，加速落地
+
+在面向不同场景任务建设一系列产业方案的过程中，不难发现，从技术基础设施角度看：
+
+（1）NLP系统都可以抽象为由多个基础组件串接而成的流水线系统；
+（2）多个NLP流水线系统可共享使用相同的基础组件。
+
+因此，PaddleNLP 逐渐孵化出了一套 NLP 流水线系统 [Pipelines](../pipelines)，将各个 NLP 复杂系统的通用模块抽象封装为标准组件，支持开发者通过配置文件对标准组件进行组合，仅需几分钟即可定制化构建智能系统，让解决NLP任务像搭积木一样便捷、灵活、高效。同时，Pipelines 中预置了前沿的预训练模型和算法，在研发效率、模型效果和性能方面提供多重保障。因此，Pipelines 能够大幅加快开发者使用飞桨落地的效率。
+
+
+<div>
+    <img src="https://user-images.githubusercontent.com/11793384/212836991-d9132e46-b5bf-4389-80e1-4f9dee32f1fe.png" width="90%" length="90%">
+</div>
+
+<br>
+
+**PaddleNLP 提供了多个版本的产业范例:**
+
+- 如果你希望快速体验、直接应用、从零搭建一套完整系统，推荐使用 **Pipelines 版本**。这里集成了训练好的模型，无需关心模型训练细节；提供 Docker 环境，可快速一键部署端到端系统；打通前端 Demo 界面，便于直观展示、分析、调试效果。
+- 如果你希望使用自己的业务数据进行二次开发，推荐使用`./applications`目录下的**可定制版本**，训练好的模型可以直接集成进 Pipelines 中进行使用。
+- 也可以使用 [AI Studio](https://aistudio.baidu.com/aistudio/index) 在线 Jupyter Notebook 快速体验，有 GPU 算力哦。
+
+| 场景任务   | Pipelines版本地址 | 可定制版本地址 | Notebook |
+| :--------------- | ------- | ------- | ------- |
+| **检索**| [字面+语义检索](../pipelines/examples/semantic-search) | [语义检索](./neural_search) | [基于Pipelines搭建检索系统](https://aistudio.baidu.com/aistudio/projectdetail/4442670)<br>[二次开发语义检索](https://aistudio.baidu.com/aistudio/projectdetail/3351784) |
+| **问答** | [FAQ问答](../pipelines/examples/FAQ/)<br>[无监督检索式问答](../pipelines/examples/unsupervised-question-answering)<br>[有监督检索式问答](../pipelines/examples/question-answering) | [FAQ问答](./question_answering/supervised_qa)<br>[无监督检索式问答](./question_answering/unsupervised_qa) | [基于Pipelines搭建FAQ问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4465498)<br>[基于Pipelines搭建抽取式问答系统](https://aistudio.baidu.com/aistudio/projectdetail/4442857)<br>[FAQ政务问答](https://aistudio.baidu.com/aistudio/projectdetail/3678873)<br>[FAQ保险问答](https://aistudio.baidu.com/aistudio/projectdetail/3882519) |
+| **文本分类**| 暂无 | [文本分类](./text_classification)  | [对话意图识别](https://aistudio.baidu.com/aistudio/projectdetail/2017202)<br>[法律文本多标签分类](https://aistudio.baidu.com/aistudio/projectdetail/3996601)<br>[层次分类](https://aistudio.baidu.com/aistudio/projectdetail/4568985) |
+| **通用文本分类** | 暂无 | [通用文本分类](./zero_shot_text_classification) |  |
+| **通用信息抽取** | 暂无 | [通用信息抽取](./information_extraction) | [UIE快速体验](https://aistudio.baidu.com/aistudio/projectdetail/3914778)<br>[UIE微调实体抽取](https://aistudio.baidu.com/aistudio/projectdetail/4038499)<br>[UIE微调关系抽取](https://aistudio.baidu.com/aistudio/projectdetail/4371345)<br>[UIE-X快速体验](https://aistudio.baidu.com/aistudio/projectdetail/5017442)<br>[UIE-X微调](https://aistudio.baidu.com/aistudio/projectdetail/5261592) |
+| **情感分析**  | [情感分析](../pipelines/examples/sentiment_analysis)  | [情感分析](./sentiment_analysis) |  [情感分析](https://aistudio.baidu.com/aistudio/projectdetail/5318177)|
+| **文档智能**  | [文档抽取问答](../pipelines/examples/document-intelligence) |  [跨模态文档问答](./document_intelligence/doc_vqa)| [文档抽取问答](https://aistudio.baidu.com/aistudio/projectdetail/4881278)<br>[汽车说明书问答](https://aistudio.baidu.com/aistudio/projectdetail/4049663)  |
+| **文生图**  | [文生图系统](../pipelines/examples/text_to_image)  | 可参考[PPDiffusers](../ppdiffusers) |   |
+| **语音指令解析**  | 暂无 | [语音指令解析](./speech_cmd_analysis) | [语音指令解析](https://aistudio.baidu.com/aistudio/projectdetail/4399703) |
+| **文本摘要**  | 暂无 | [文本摘要](./text_summarization) | [文本摘要](https://aistudio.baidu.com/aistudio/projectdetail/4903667) |
+
+## 3、典型范例介绍
+
+#### 📄 通用信息抽取系统
+
+- 首个产业级通用信息抽取方案 UIE，面向纯文本，实现多任务统一建模，提供强大的零样本抽取和少样本快速迁移能力；
+- 首个兼具文本及文档抽取能力、多语言、开放域的信息抽取方案 UIE-X，基于 [ERNIE-Layout](../model_zoo/ernie-layout) 跨模态布局增强预训练模型，集成 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) 的 PP-OCR、PP-Structure 版面分析能力，小样本文档信息抽取效果领先。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/213365046-69967745-b4a8-4435-98fb-c34f68cd22e9.png" width="60%" length="60%">
+</div>
+
+
+详细使用说明请参考[通用信息抽取系统](./information_extraction)，更多：[UIE 解读](https://mp.weixin.qq.com/s/-hHz8knHIKKqKCBTke7i5A)、[UIE-X 解读](https://zhuanlan.zhihu.com/p/592422623)。
+
+#### 🔍 语义检索系统
+
+- 前沿算法———基于 SimCSE、In-batch Negatives、ERNIE Pairwise、RocketQA Pointwise 等提供针对无监督、有监督等多种数据情况的多样化方案；
+- 全流程———覆盖召回、排序环节，集成主流 ANN 引擎，同时兼容 ElasticSearch 字面检索模式，提供多路召回方案。打通训练、调优、高效向量检索引擎建库和查询全流程。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/213134465-30cae5fd-4cd1-4e5b-a1cb-fa55c72980a7.gif" width="60%" length="60%">
+</div>
+
+详细使用说明请参考[语义检索系统](./neural_search)。
+
+#### ❓ 智能问答系统
+
+- 端到端问答技术 [🚀RocketQA](https://github.com/PaddlePaddle/RocketQA)，首个中文端到端问答模型，基于知识增强的预训练模型ERNIE和百万量级的人工标注数据集DuReader训练得到，效果优异；
+- 覆盖有监督（如 FAQ 问答）、无监督（自动生成 QA 对，生成的问答对语料可以通过无监督的方式构建检索式问答系统）等多种情况，适用各类业务场景。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/168514868-1babe981-c675-4f89-9168-dd0a3eede315.gif" width="60%" length="60%">
+</div>
+
+
+详细使用说明请参考[智能问答系统](./question_answering)与[文档智能问答](./document_intelligence/doc_vqa)。
+
+#### 📚 通用文本分类
+
+- 基于“任务架构统一、通用能力共享”的通用文本分类技术 UTC，实了良好的零/少样本迁移能力，实现大一统诸多任务的开放域分类，可支持情感分析、意图识别、语义匹配、蕴含推理等各种可转换为分类问题的 NLU 任务。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/11793384/213347595-e9c08bd1-3d32-4519-9a52-31fb69b841e8.png" width="60%" length="60%">
+</div>
+
+<br>
+
+详细使用说明请参考[通用文本分类](./zero_shot_text_classification)，更多：[文章解读](https://mp.weixin.qq.com/s/VV-nYv4y1r7oipJnURRL5w)。
+
+
+#### 🗂 文本分类
+
+- 场景方案全覆盖––––开源预训练模型-微调、提示学习、基于语义索引等多种分类技术方案，满足不同场景需求，涵盖多分类（multi-class）、多标签（multi-label）、层次分类（hierarchical）三类任务；
+- 模型高效调优––––强强结合数据增强能力与可信增强技术，解决脏数据、标注数据欠缺、数据不平衡等问题，大幅提升模型效果。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/186378697-630d3590-4e67-49a0-8d5f-7cabd9daa894.png" width="60%" length="60%">
+</div>
+
+<br>
+
+详细使用说明请参考[文本分类](./text_classification)，更多：[文章解读](https://mp.weixin.qq.com/s/tas7yM8vapxwtlJt-MRZdg)。
+
+#### 💌 评论观点抽取与情感分析
+
+- 经典方案：基于情感知识增强预训练模型SKEP，两阶段式抽取和分类，首先通过序列标注的方式定位属性词和观点词，然后进行属性集情感分类；
+- 前沿方案：基于UIE的情感分析方案采用 Prompt Learning 的方式进行情感信息抽取，精度更高。支持语句级和属性级情感分析，解决同义属性聚合、隐性观点抽取难点，并提供可视化分析能力。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200259473-434888f7-c0ac-4253-ab23-ede1628e6ba2.png" width="60%" length="60%">
+</div>
+<br>
+
+详细使用说明请参考[情感分析](./sentiment_analysis)，更多：[文章解读](https://mp.weixin.qq.com/s/QAHjIRG9zxpYfM6YPRQ-9w)。
+
+#### 🎙️ 智能语音指令解析
+
+- 集成了[PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)和[百度开放平台](https://ai.baidu.com/)的语音识别和[UIE](./model_zoo/uie)通用信息抽取等技术，打造智能一体化的语音指令解析系统范例，该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景，提高人机交互效率。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/16698950/168589100-a6c6f346-97bb-47b2-ac26-8d50e71fddc5.png" width="400">
+</div>
+
+详细使用说明请参考[智能语音指令解析](./applications/speech_cmd_analysis)。
diff --git a/applications/document_intelligence/README.md b/applications/document_intelligence/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb6464aafd71bc0eef04f60a638846b6383dcdaf
--- /dev/null
+++ b/applications/document_intelligence/README.md
@@ -0,0 +1,188 @@
+# 文档智能应用
+
+**目录**
+- [1. 文档智能应用简介](#文档智能应用简介)
+- [2. 技术特色介绍](#技术特色介绍)
+  - [2.1 多语言跨模态训练基座](#多语言跨模态训练基座)
+  - [2.2 多场景覆盖](#多场景覆盖)
+- [3. 快速开始](#快速开始)
+  - [3.1 开箱即用](#开箱即用)
+  - [3.2 产业级流程方案](#产业级流程方案)
+
+## 1. 文档智能应用简介
+
+文档智能（DI, Document Intelligence）主要指**对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息，通过人工智能技术进行理解、分类、提取以及信息归纳**的过程。文档智能技术广泛应用于金融、保险、能源、物流、医疗等行业，常见的应用场景包括财务报销单、招聘简历、企业财报、合同文书、动产登记证、法律判决书、物流单据等多模态文档的关键信息抽取、文档解析、文档比对等。
+
+在实际应用中，需要解决文档格式繁杂、布局多样、信息模态多样、需求开放、业务数据少等多重难题。针对文档智能领域的痛点和难点，PaddleNLP将持续开源一系列产业实践范例，解决开发者们实际应用难题。
+
+<div align="center">
+    <img width="1000" height="270" alt="文档智能技术一般流程" src="https://user-images.githubusercontent.com/40840292/196361583-6b1c66d1-6a9b-4193-949a-71e2d420a82a.png">
+</div>
+
+<a name="技术特色介绍"></a>
+
+## 2. 技术特色介绍
+
+<a name="多语言跨模态训练基座"></a>
+
+### 2.1 多语言跨模态训练基座
+
+近期，百度文心文档智能，基于多语言跨模态布局增强的文档智能大模型[ERNIE-Layout](http://arxiv.org/abs/2210.06155)，刷新了五类11项文档智能任务效果。依托文心ERNIE大模型，基于布局知识增强技术，融合文本、图像、布局等信息进行联合建模，能够对多模态文档（如文档图片、PDF 文件、扫描件等）进行深度理解与分析，为各类上层应用提供SOTA模型底座。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196373896-597f6178-4c78-41a1-bb12-796546644b32.png width="600"/>
+</div>
+
+<a name="多场景覆盖"></a>
+
+### 2.2 多场景覆盖
+
+以下是文档智能技术的一些应用场景展示：
+
+- 发票抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png height=350 width=1000 hspace='10'/>
+</div>
+
+- 海报抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png height=600 width=1000 hspace='15'/>
+</div>
+
+- 网页抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 表格抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 试卷抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png height=700 width=1000 hspace='10'/>
+</div>
+
+
+- 英文票据多语种（中、英、日、泰、西班牙、俄语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png height=400 width=1000 hspace='15'/>
+</div>
+
+- 中文票据多语种（中简、中繁、英、日、法语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png height=350 width=1000 hspace='15'/>
+</div>
+
+- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip)
+
+<a name="快速开始"></a>
+
+## 3. 快速开始
+
+<a name="开箱即用"></a>
+
+### 3.1 开箱即用
+
+开源DocPrompt开放文档抽取问答模型，以ERNIE-Layout为底座，可精准理解图文信息，推理学习附加知识，准备捕捉图片、PDF等多模态文档中的每个细节。
+
+🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195749427-864d7744-1fd1-455e-99c6-53a260776483.jpg height=700 width=1100 hspace='10'/>
+</div>
+
+#### Taskflow
+
+通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能，具备多语言文档抽取问答能力，部分应用场景展示如下：
+
+- 输入格式
+
+```
+[
+  {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+默认使用PaddleOCR进行OCR识别，同时支持用户通过``word_boxes``传入自己的OCR结果，格式为``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+- 支持单条、批量预测
+
+  - 支持本地图片路径输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
+  [{'prompt': '五百丁本次想要担任的是什么职位?',
+    'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
+  {'prompt': '五百丁是在哪里上的大学?',
+    'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
+  {'prompt': '大学学的是什么专业?',
+    'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]
+  ```
+
+  - http图片链接输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg height=400 hspace='10'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}]))
+  [{'prompt': '发票号码是多少?',
+    'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},
+  {'prompt': '校验码是多少?',
+    'result': [{'end': 233,
+                'prob': 1.0,
+                'start': 231,
+                'value': '01107 555427109891646'}]}]
+  ```
+
+- 可配置参数说明
+  * `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+  * `lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+  * `topn`: 如果模型识别出多个结果，将返回前n个概率值最高的结果，默认为1。
+
+<a name="产业级流程方案"></a>
+
+### 3.2 产业级流程方案
+
+针对文档智能领域的痛点和难点，PaddleNLP将持续开源一系列文档智能产业实践范例，解决开发者们实际应用难题。
+
+- 👉 [汽车说明书跨模态智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/document_intelligence/doc_vqa#readme)
+
+更多：百度TextMind智能文档分析平台可提供包括文档信息抽取、文本内容审查、企业文档管理、文档格式解析、文档内容比对等全方位一站式的文档智能服务，已形成一套完整的企业文档场景化解决方案，满足银行、券商、法律、能源、传媒、通信、物流等不同行业和场景的文档处理需求，以AI助力企业的办公智能化升级和数字化转型。欢迎深度交流与商业合作，了解详情：https://ai.baidu.com/tech/nlp/Textanalysis
+
+## References
+
+- [文档智能：数据集、模型和应用](http://jcip.cipsc.org.cn/CN/abstract/abstract3331.shtml)
+
+- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)
diff --git a/applications/document_intelligence/doc_vqa/.gitignore b/applications/document_intelligence/doc_vqa/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..cc8c76c471fb1e2ed206c2bbaacebd04f4f29068
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/.gitignore
@@ -0,0 +1,17 @@
+OCR_process/*.json
+*.png
+*.json
+answers/*
+checkpoints/*
+__pycache__/*
+OCR_process/demo_pics/*
+Rerank/log/*
+Rerank/checkpoints/*
+Rerank/data/*
+Rerank/output/*
+Rerank/__pycache__/*
+Extraction/log/*
+Extraction/checkpoints/*
+Extraction/data/*
+Extraction/output/*
+Extraction/__pycache__/*
diff --git a/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py b/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb388b16605562c8adbca96292c008e3010342df
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/change_to_mrc.py
@@ -0,0 +1,51 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import json
+import numpy as np
+
+
+def get_top1_from_ranker(path):
+    with open(path, "r", encoding="utf-8") as f:
+        scores = [float(line.strip()) for line in f.readlines()]
+        top_id = np.argmax(scores)
+
+    return top_id
+
+
+def get_ocr_result_by_id(path, top_id):
+    with open(path, "r", encoding="utf-8") as f:
+        reses = f.readlines()
+        res = reses[top_id]
+    return json.loads(res)
+
+
+def write_to_file(doc, path):
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(doc, f, ensure_ascii=False)
+        f.write("\n")
+
+
+if __name__ == "__main__":
+    question = sys.argv[1]
+    ranker_result_path = "../Rerank/data/demo.score"
+    ocr_result_path = "../OCR_process/demo_ocr_res.json"
+    save_path = "data/demo_test.json"
+    top_id = get_top1_from_ranker(ranker_result_path)
+    doc = get_ocr_result_by_id(ocr_result_path, top_id)
+    doc["question"] = question
+    doc["img_id"] = str(top_id + 1)
+
+    write_to_file(doc, save_path)
diff --git a/applications/document_intelligence/doc_vqa/Extraction/docvqa.py b/applications/document_intelligence/doc_vqa/Extraction/docvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d3c576bc296790a570c9493e1571b1c825d26c4
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/docvqa.py
@@ -0,0 +1,359 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import sys
+
+import numpy as np
+import paddle
+from paddle.io import Dataset
+from tqdm import tqdm
+
+sys.path.insert(0, "../")
+
+
+class DocVQAExample(object):
+    def __init__(self, question, doc_tokens, doc_boxes=[], answer=None, labels=None, image=None):
+        self.question = question
+        self.doc_tokens = doc_tokens
+        self.doc_boxes = doc_boxes
+        self.image = image
+        self.answer = answer
+        self.labels = labels
+
+
+class DocVQAFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, example_index, input_ids, input_mask, segment_ids, boxes=None, label=None):
+        self.example_index = example_index
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.boxes = boxes
+        self.label = label
+
+
+class DocVQA(Dataset):
+    def __init__(
+        self, args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1
+    ):
+        super(DocVQA, self).__init__()
+        self.tokenizer = tokenizer
+        self.label2id_map = label2id_map
+        self.max_seq_len = max_seq_len
+        self.max_query_length = max_query_length
+        self.max_doc_length = max_doc_length
+        self.max_span_num = max_span_num
+        self.sample_list = None
+        self.args = args
+
+        self.docvqa_inputs = self.docvqa_input()
+
+    def check_is_max_context(self, doc_spans, cur_span_index, position):
+        """Check if this is the 'max context' doc span for the token."""
+
+        # Because of the sliding window approach taken to scoring documents, a single
+        # token can appear in multiple documents. E.g.
+        #  Doc: the man went to the store and bought a gallon of milk
+        #  Span A: the man went to the
+        #  Span B: to the store and bought
+        #  Span C: and bought a gallon of
+        #  ...
+        #
+        # Now the word 'bought' will have two scores from spans B and C. We only
+        # want to consider the score with "maximum context", which we define as
+        # the *minimum* of its left and right context (the *sum* of left and
+        # right context will always be the same, of course).
+        #
+        # In the example the maximum context for 'bought' would be span C since
+        # it has 1 left context and 3 right context, while span B has 4 left context
+        # and 0 right context.
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span.start + doc_span.length - 1
+            if position < doc_span.start:
+                continue
+            if position > end:
+                continue
+            num_left_context = position - doc_span.start
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+            if best_score is None or score > best_score:
+                best_score = score
+            best_span_index = span_index
+
+        return cur_span_index == best_span_index
+
+    def convert_examples_to_features(
+        self, examples, tokenizer, label_map, max_seq_length, max_span_num, max_doc_length, max_query_length
+    ):
+
+        if "[CLS]" in self.tokenizer.get_vocab():
+            start_token = "[CLS]"
+            end_token = "[SEP]"
+        else:
+            start_token = "<s>"
+            end_token = "</s>"
+
+        features = []
+        for (example_index, example) in enumerate(examples):
+            query_tokens = tokenizer.tokenize(example.question)
+            if len(query_tokens) > max_query_length:
+                query_tokens = query_tokens[0:max_query_length]
+
+            all_doc_tokens = example.doc_tokens
+            all_doc_boxes_tokens = example.doc_boxes
+
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [1000, 1000, 1000, 1000]
+            pad_token_box = [0, 0, 0, 0]
+            ques_token_box = [0, 0, 0, 0]
+
+            # The -3 accounts for [CLS], [SEP] and [SEP]
+            max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+
+            # We can have documents that are longer than the maximum sequence length.
+            # To deal with this we do a sliding window approach, where we take chunks
+            # of the up to our max length with a stride of `doc_stride`.
+            _DocSpan = collections.namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += length
+
+            spans_input_ids = []
+            spans_input_mask = []
+            spans_segment_ids = []
+            spans_boxes_tokens = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                if doc_span_index == max_span_num:
+                    break
+                tokens = []
+                boxes_tokens = []
+                token_is_max_context = {}
+                segment_ids = []
+                tokens.append(start_token)
+                boxes_tokens.append(cls_token_box)
+                segment_ids.append(0)
+                for token in query_tokens:
+                    tokens.append(token)
+                    boxes_tokens.append(ques_token_box)
+                    segment_ids.append(0)
+                tokens.append(end_token)
+                boxes_tokens.append(sep_token_box)
+                segment_ids.append(0)
+                for i in range(doc_span.length):
+                    split_token_index = doc_span.start + i
+                    is_max_context = self.check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[len(tokens)] = is_max_context
+                    tokens.append(all_doc_tokens[split_token_index])
+                    boxes_tokens.append(all_doc_boxes_tokens[split_token_index])
+                    segment_ids.append(0)
+
+                tokens.append(end_token)
+                boxes_tokens.append(sep_token_box)
+                segment_ids.append(0)
+                input_ids = tokenizer.convert_tokens_to_ids(tokens)
+                # The mask has 1 for real tokens and 0 for padding tokens. Only real
+                # tokens are attended to.
+                input_mask = [1] * len(input_ids)
+                # Zero-pad up to the sequence length.
+                while len(input_ids) < max_seq_length:
+                    input_ids.append(0)
+                    input_mask.append(0)
+                    segment_ids.append(0)
+                    boxes_tokens.append(pad_token_box)
+                assert len(input_ids) == max_seq_length
+                assert len(input_mask) == max_seq_length
+                assert len(segment_ids) == max_seq_length
+                assert len(boxes_tokens) == max_seq_length
+
+                spans_input_ids.append(input_ids)
+                spans_input_mask.append(input_mask)
+                spans_segment_ids.append(segment_ids)
+                spans_boxes_tokens.append(boxes_tokens)
+
+            # Padding
+            # padding spans
+            # max_span_num: max_seg_num
+            # spans_input_ids: the tokens in each segment
+            if len(spans_input_ids) > max_span_num:
+                spans_input_ids = spans_input_ids[0:max_span_num]
+                spans_input_mask = spans_input_mask[0:max_span_num]
+                spans_segment_ids = spans_segment_ids[0:max_span_num]
+                spans_boxes_tokens = spans_boxes_tokens[0:max_span_num]
+            while len(spans_input_ids) < max_span_num:
+                tokens = []
+                boxes_tokens = []
+                segment_ids = []
+                tokens.append(start_token)
+                boxes_tokens.append(cls_token_box)
+                segment_ids.append(0)
+                for token in query_tokens:
+                    tokens.append(token)
+                    boxes_tokens.append(ques_token_box)
+                    segment_ids.append(0)
+                tokens.append(end_token)
+                boxes_tokens.append(sep_token_box)
+                segment_ids.append(0)
+                tokens.append(end_token)
+                boxes_tokens.append(sep_token_box)
+                segment_ids.append(0)
+                input_ids = tokenizer.convert_tokens_to_ids(tokens)
+                input_mask = [1] * len(input_ids)
+                while len(input_ids) < max_seq_length:
+                    input_ids.append(0)
+                    input_mask.append(0)
+                    segment_ids.append(0)
+                    boxes_tokens.append(pad_token_box)
+                spans_input_ids.append(input_ids)
+                spans_input_mask.append(input_mask)
+                spans_segment_ids.append(segment_ids)
+                spans_boxes_tokens.append(boxes_tokens)
+
+            # padding labels
+            labels = example.labels
+            sep_id = tokenizer.convert_tokens_to_ids(end_token)
+            labels = ["O"] * (spans_input_ids[0].index(sep_id) + 1) + labels
+            if len(labels) > 512:
+                labels = labels[:512]
+
+            if len(labels) < 512:
+                labels += ["O"] * (512 - len(labels))
+            assert len(spans_input_ids[0]) == len(labels)
+
+            label_ids = []
+            for lid, l in enumerate(labels):
+                if l not in label_map:
+                    label_ids.append(0)
+                else:
+                    label_ids.append(label_map[l])
+
+            feature = DocVQAFeatures(
+                example_index=example_index,
+                input_ids=spans_input_ids,
+                input_mask=spans_input_mask,
+                segment_ids=spans_segment_ids,
+                boxes=spans_boxes_tokens,
+                label=label_ids,
+            )
+            features.append(feature)
+        return features
+
+    def create_examples(self, data, is_test=False):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for sample in tqdm(data, total=len(data)):
+            question = sample["question"]
+            doc_tokens = sample["document"]
+            doc_boxes = sample["document_bbox"]
+            labels = sample["labels"] if not is_test else []
+
+            x_min, y_min = min(doc_boxes, key=lambda x: x[0])[0], min(doc_boxes, key=lambda x: x[2])[2]
+            x_max, y_max = max(doc_boxes, key=lambda x: x[1])[1], max(doc_boxes, key=lambda x: x[3])[3]
+            width = x_max - x_min
+            height = y_max - y_min
+
+            if max(width, height) < 1000:
+                scale_x = 1
+                scale_y = 1
+            else:
+                scale_x = 1000 / max(width, height)
+                scale_y = 1000 / max(width, height)
+
+            scaled_doc_boxes = [
+                [
+                    round((b[0] - x_min) * scale_x),
+                    round((b[2] - y_min) * scale_y),
+                    round((b[1] - x_min) * scale_x),
+                    round((b[3] - y_min) * scale_y),
+                ]
+                for b in doc_boxes
+            ]
+
+            for box, oribox in zip(scaled_doc_boxes, doc_boxes):
+                if box[0] < 0:
+                    print(box, oribox)
+                if box[2] - box[0] < 0:
+                    print(box, oribox)
+                if box[3] - box[1] < 0:
+                    print(box, oribox)
+                for pos in box:
+                    if pos > 1000:
+                        print(width, height, box, oribox)
+
+            example = DocVQAExample(
+                question=question, doc_tokens=doc_tokens, doc_boxes=scaled_doc_boxes, labels=labels
+            )
+            examples.append(example)
+        return examples
+
+    def docvqa_input(self):
+        data = []
+        if self.args.do_train:
+            dataset = self.args.train_file
+        elif self.args.do_test:
+            dataset = self.args.test_file
+        with open(dataset, "r", encoding="utf8") as f:
+            for index, line in enumerate(f):
+                data.append(json.loads(line.strip()))
+
+            # read the examples from train/test xlm files
+            examples = self.create_examples(data, is_test=self.args.do_test)
+
+        features = self.convert_examples_to_features(
+            examples,
+            self.tokenizer,
+            self.label2id_map,
+            max_seq_length=self.max_seq_len,
+            max_doc_length=self.max_doc_length,
+            max_span_num=self.max_span_num,
+            max_query_length=self.max_query_length,
+        )
+
+        all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64")
+        all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64")
+        all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64")
+        all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64")
+        all_labels = paddle.to_tensor([f.label for f in features], dtype="int64")
+        self.sample_list = [
+            np.array(all_input_ids),
+            np.array(all_input_mask),
+            np.array(all_segment_ids),
+            np.array(all_bboxes),
+            np.array(all_labels),
+        ]
+
+    def __getitem__(self, idx):
+        return (
+            self.sample_list[0][idx],
+            self.sample_list[1][idx],
+            self.sample_list[2][idx],
+            self.sample_list[3][idx],
+            self.sample_list[4][idx],
+        )
+
+    def __len__(
+        self,
+    ):
+        return self.sample_list[0].shape[0]
diff --git a/applications/document_intelligence/doc_vqa/Extraction/model.py b/applications/document_intelligence/doc_vqa/Extraction/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..71647d96bf2424ddba1479e66266da3bfe3a806f
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/model.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.fluid as fluid
+import paddle.nn as nn
+
+from paddlenlp.transformers import LayoutXLMPretrainedModel
+
+
+class Crf_decoding(paddle.fluid.dygraph.Layer):
+    def __init__(self, param_attr, size=None, is_test=True, dtype="float32"):
+        super(Crf_decoding, self).__init__()
+
+        self._dtype = dtype
+        self._size = size
+        self._is_test = is_test
+        self._param_attr = param_attr
+        self._transition = self.create_parameter(
+            attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype
+        )
+
+    @property
+    def weight(self):
+        return self._transition
+
+    @weight.setter
+    def weight(self, value):
+        self._transition = value
+
+    def forward(self, input, label=None, length=None):
+
+        viterbi_path = self._helper.create_variable_for_type_inference(dtype=self._dtype)
+        this_inputs = {"Emission": [input], "Transition": self._transition, "Label": label}
+        if length is not None:
+            this_inputs["Length"] = [length]
+        self._helper.append_op(
+            type="crf_decoding",
+            inputs=this_inputs,
+            outputs={"ViterbiPath": [viterbi_path]},
+            attrs={
+                "is_test": self._is_test,
+            },
+        )
+        return viterbi_path
+
+
+class Chunk_eval(paddle.fluid.dygraph.Layer):
+    def __init__(self, num_chunk_types, chunk_scheme, excluded_chunk_types=None):
+        super(Chunk_eval, self).__init__()
+        self.num_chunk_types = num_chunk_types
+        self.chunk_scheme = chunk_scheme
+        self.excluded_chunk_types = excluded_chunk_types
+
+    def forward(self, input, label, seq_length=None):
+
+        precision = self._helper.create_variable_for_type_inference(dtype="float32")
+        recall = self._helper.create_variable_for_type_inference(dtype="float32")
+        f1_score = self._helper.create_variable_for_type_inference(dtype="float32")
+        num_infer_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+        num_label_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+        num_correct_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+
+        this_input = {"Inference": [input], "Label": [label]}
+        if seq_length is not None:
+            this_input["SeqLength"] = [seq_length]
+
+        self._helper.append_op(
+            type="chunk_eval",
+            inputs=this_input,
+            outputs={
+                "Precision": [precision],
+                "Recall": [recall],
+                "F1-Score": [f1_score],
+                "NumInferChunks": [num_infer_chunks],
+                "NumLabelChunks": [num_label_chunks],
+                "NumCorrectChunks": [num_correct_chunks],
+            },
+            attrs={
+                "num_chunk_types": self.num_chunk_types,
+                "chunk_scheme": self.chunk_scheme,
+                "excluded_chunk_types": self.excluded_chunk_types or [],
+            },
+        )
+        return (precision, recall, f1_score, num_infer_chunks, num_label_chunks, num_correct_chunks)
+
+
+class Linear_chain_crf(paddle.fluid.dygraph.Layer):
+    def __init__(self, param_attr, size=None, is_test=False, dtype="float32"):
+        super(Linear_chain_crf, self).__init__()
+
+        self._param_attr = param_attr
+        self._dtype = dtype
+        self._size = size
+        self._is_test = is_test
+        self._transition = self.create_parameter(
+            attr=self._param_attr, shape=[self._size + 2, self._size], dtype=self._dtype
+        )
+
+    @property
+    def weight(self):
+        return self._transition
+
+    @weight.setter
+    def weight(self, value):
+        self._transition = value
+
+    def forward(self, input, label, length=None):
+
+        alpha = self._helper.create_variable_for_type_inference(dtype=self._dtype)
+        emission_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype)
+        transition_exps = self._helper.create_variable_for_type_inference(dtype=self._dtype)
+        log_likelihood = self._helper.create_variable_for_type_inference(dtype=self._dtype)
+        this_inputs = {"Emission": [input], "Transition": self._transition, "Label": [label]}
+        if length is not None:
+            this_inputs["Length"] = [length]
+        self._helper.append_op(
+            type="linear_chain_crf",
+            inputs=this_inputs,
+            outputs={
+                "Alpha": [alpha],
+                "EmissionExps": [emission_exps],
+                "TransitionExps": transition_exps,
+                "LogLikelihood": log_likelihood,
+            },
+            attrs={
+                "is_test": self._is_test,
+            },
+        )
+        return log_likelihood
+
+
+class LayoutXLMForTokenClassification_with_CRF(LayoutXLMPretrainedModel):
+    def __init__(self, layoutxlm, num_classes, dropout=None):
+        super(LayoutXLMForTokenClassification_with_CRF, self).__init__()
+        self.num_classes = num_classes
+        self.layoutxlm = layoutxlm
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.layoutxlm.config["hidden_dropout_prob"])
+        self.emission_classifier = nn.Linear(self.layoutxlm.config["hidden_size"], self.num_classes)
+        self.emission_classifier.apply(self.init_weights)
+        self.linear_chain_crf = Linear_chain_crf(
+            size=self.num_classes, param_attr=paddle.fluid.ParamAttr(name="liner_chain_crfw")
+        )
+        self.crf_decoding = Crf_decoding(param_attr=paddle.fluid.ParamAttr(name="crfw_decode"), size=self.num_classes)
+        self.crf_decoding.weight = self.linear_chain_crf.weight
+        self.crfw = fluid.layers.create_parameter(
+            shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="crfw"
+        )
+        self.mask_crfw = fluid.layers.create_parameter(
+            shape=[self.num_classes + 2, self.num_classes], dtype="float32", name="mask_matrix"
+        )
+
+    def get_input_embeddings(self):
+        return self.layoutxlm.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        attention_mask=None,
+        token_type_ids=None,
+        labels=None,
+        image=None,
+        position_ids=None,
+        head_mask=None,
+        is_train=False,
+    ):
+
+        input_ids = input_ids.squeeze(axis=1)
+        bbox = bbox.squeeze(axis=1)
+        attention_mask = attention_mask.squeeze(axis=1)
+        token_type_ids = token_type_ids.squeeze(axis=1)
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_logits, _ = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
+        emission = self.emission_classifier(sequence_logits)
+        length = paddle.sum(attention_mask, axis=1)
+        labels = labels.reshape([-1, seq_length, 1])
+
+        # standard crf loss
+        crf_cost = self.linear_chain_crf(input=emission, label=labels, length=length)
+        crf_decode = self.crf_decoding(input=emission, length=length)
+        if is_train:
+            return [crf_cost]
+        else:
+            return [crf_cost, crf_decode]
diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py b/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0c1e5670fc4cbbf4b8270439d793f799c17fc07
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/run_docvqa.py
@@ -0,0 +1,457 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import logging
+import os
+import random
+import warnings
+from collections import Counter
+
+import numpy as np
+import paddle
+from docvqa import DocVQA
+from model import LayoutXLMForTokenClassification_with_CRF
+
+from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer
+
+warnings.filterwarnings("ignore")
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # yapf: disable
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True)
+    parser.add_argument("--do_train", default=False, type=bool, required=False)
+    parser.add_argument("--do_test", default=False, type=bool, required=False)
+    parser.add_argument("--test_file", default=None, type=str, required=False)
+    parser.add_argument("--train_file", default=None, type=str, required=False)
+    parser.add_argument("--output_dir", default=None, type=str, required=True)
+    parser.add_argument("--max_seq_len", default=512, type=int)
+    parser.add_argument("--max_query_length", default=20, type=int)
+    parser.add_argument("--max_doc_length", default=512, type=int)
+    parser.add_argument("--max_span_num", default=1, type=int)
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="the initialized checkpoint")
+    parser.add_argument("--save_path", type=str, default=None, help="the initialized checkpoint")
+    # yapf: enable
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def get_label_maps():
+    labels = ["O", "I-ans", "B-ans", "E-ans"]
+    label2id_map = {label: idx for idx, label in enumerate(labels)}
+    id2label_map = {idx: label for idx, label in enumerate(labels)}
+    return label2id_map, id2label_map
+
+
+def main(args):
+    os.makedirs(args.output_dir, exist_ok=True)
+    logging.basicConfig(
+        filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None,
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN,
+    )
+
+    ch = logging.StreamHandler()
+    ch.setLevel(logging.DEBUG)
+    logger.addHandler(ch)
+
+    label2id_map, id2label_map = get_label_maps()
+    pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path)
+
+    if args.do_test:
+        model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint)
+        evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, global_step=0)
+        exit(0)
+
+    if args.init_checkpoint:
+        logger.info("Init checkpoint from {}".format(args.init_checkpoint))
+        model = LayoutXLMForTokenClassification_with_CRF.from_pretrained(args.init_checkpoint)
+    else:
+        base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path)
+        model = LayoutXLMForTokenClassification_with_CRF(base_model, num_classes=len(label2id_map), dropout=None)
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_dataset = DocVQA(
+        args,
+        tokenizer,
+        label2id_map,
+        max_seq_len=args.max_seq_len,
+        max_query_length=args.max_query_length,
+        max_doc_length=args.max_doc_length,
+        max_span_num=args.max_span_num,
+    )
+
+    train_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=False
+    )
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size())
+
+    train_dataloader = paddle.io.DataLoader(
+        train_dataset, batch_sampler=train_sampler, num_workers=0, use_shared_memory=True, collate_fn=None
+    )
+
+    t_total = len(train_dataloader) * args.num_train_epochs
+    # build linear decay with warmup lr sch
+    lr_scheduler = paddle.optimizer.lr.PolynomialDecay(
+        learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0
+    )
+    if args.warmup_steps > 0:
+        lr_scheduler = paddle.optimizer.lr.LinearWarmup(
+            lr_scheduler, args.warmup_steps, start_lr=0, end_lr=args.learning_rate
+        )
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        epsilon=args.adam_epsilon,
+        weight_decay=args.weight_decay,
+    )
+
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info(
+        "  Total train batch size (w. parallel, distributed) = %d",
+        args.train_batch_size * paddle.distributed.get_world_size(),
+    )
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss = 0.0
+    set_seed(args)
+    for epoch_id in range(args.num_train_epochs):
+        print("epoch id:{}".format(epoch_id))
+        for step, batch in enumerate(train_dataloader):
+            model.train()
+            input_ids, input_mask, segment_ids, bboxes, labels = batch
+            if input_ids.shape[0] != args.per_gpu_train_batch_size:
+                continue
+            outputs = model(
+                input_ids=input_ids,
+                bbox=bboxes,
+                attention_mask=input_mask,
+                token_type_ids=segment_ids,
+                labels=labels,
+                is_train=True,
+            )
+            # model outputs are always tuple in paddlenlp (see doc)
+            loss = outputs[0]
+            loss = loss.mean()
+            if global_step % 50 == 0:
+                logger.info(
+                    "[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format(
+                        epoch_id,
+                        args.num_train_epochs,
+                        step,
+                        len(train_dataloader),
+                        lr_scheduler.get_lr(),
+                        float(loss),
+                    )
+                )
+
+            loss.backward()
+            tr_loss += loss.item()
+            optimizer.step()
+            lr_scheduler.step()  # Update learning rate schedule
+            optimizer.clear_grad()
+            global_step += 1
+
+            if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0:
+                # Save model checkpoint
+                output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
+                os.makedirs(output_dir, exist_ok=True)
+                if paddle.distributed.get_rank() == 0:
+                    model.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+
+def _tokenize_chinese_chars(text):
+    """
+    :param text: input text, unicode string
+    :return:
+        tokenized text, list
+    """
+
+    def _is_chinese_char(cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    output = []
+    buff = ""
+    for char in text:
+        cp = ord(char)
+        if _is_chinese_char(cp) or char == "=":
+            if buff != "":
+                output.append(buff)
+                buff = ""
+            output.append(char)
+        else:
+            buff += char
+
+    if buff != "":
+        output.append(buff)
+
+    return output
+
+
+def fast_f1(text1, text2):
+    common_char = Counter(text1) & Counter(text2)
+    len_seq1 = len(text1)
+    len_seq2 = len(text2)
+    len_common = sum(common_char.values())
+    if len_common == 0:
+        return 0.0
+    precision = 1.0 * len_common / len_seq2
+    recall = 1.0 * len_common / len_seq1
+    return (2.0 * precision * recall) / (precision + recall)
+
+
+def _normalize(in_str):
+    """
+    normalize the input unicode string
+    """
+    in_str = in_str.lower()
+    sp_char = [
+        ":",
+        "_",
+        "`",
+        "，",
+        "。",
+        "：",
+        "？",
+        "！",
+        "(",
+        ")",
+        "“",
+        "”",
+        "；",
+        "’",
+        "《",
+        "》",
+        "……",
+        "·",
+        "、",
+        ",",
+        "「",
+        "」",
+        "（",
+        "）",
+        "－",
+        "～",
+        "『",
+        "』",
+        "|",
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return "".join(out_segs)
+
+
+def calc_f1_score(answer, prediction):
+    ans_segs = _tokenize_chinese_chars(_normalize(answer))
+    prediction_segs = _tokenize_chinese_chars(_normalize(prediction))
+    f1 = fast_f1(prediction_segs, ans_segs)
+    return f1
+
+
+def decode(tokenizer, res):
+    sep_id = tokenizer._convert_token_to_id("</s>")
+    text_res = []
+    all_f1 = []
+    save_f1 = []
+    for i in range(len(res)):
+        input_ids, label_ids, predict_ids, bbox = res[i]
+        remove_pos = (
+            len(" ".join([str(x) for x in input_ids]).split("2 6 ")[0].strip(" ").split(" ")) + 2
+        )  # remove the question bbox and sep bbox
+        start_pos = input_ids.index(sep_id)
+        query_text = []
+        for idx in range(1, start_pos):
+            input_id = input_ids[idx]
+            query_text.append(tokenizer._convert_id_to_token(int(input_id)))
+
+        # label texts and predict texts
+        text_label, text_predict = [], []
+        label_bbox_index, predict_bbox_index = [], []
+        for idx in range(start_pos + 1, len(input_ids)):
+            input_id, label_id, predict_id = input_ids[idx], label_ids[idx], predict_ids[idx]
+
+            if label_id in [1, 2, 3]:
+                text_label.append(tokenizer._convert_id_to_token(int(input_id)))
+                label_bbox_index.append(idx - remove_pos + 1)
+            if predict_id in [1, 2, 3]:
+                text_predict.append(tokenizer._convert_id_to_token(int(input_id)))
+                predict_bbox_index.append(idx - remove_pos + 1)
+        text_res.append(
+            ["".join(query_text), "".join(text_label), "".join(text_predict), label_bbox_index, predict_bbox_index]
+        )
+
+        f1 = calc_f1_score("".join(text_label), "".join(text_predict))
+        save_f1.append(f1)
+
+        if len("".join(text_label)) > 10:
+            all_f1.append(f1)
+    if len(all_f1) > 0:
+        print("F1: ", sum(all_f1) / len(all_f1))
+
+    assert len(text_res) == len(save_f1)
+    return text_res
+
+
+def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix="", global_step=0):
+
+    eval_dataset = DocVQA(
+        args, tokenizer, label2id_map, max_seq_len=512, max_query_length=20, max_doc_length=512, max_span_num=1
+    )
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size())
+
+    eval_dataloader = paddle.io.DataLoader(
+        eval_dataset, batch_size=args.eval_batch_size, num_workers=0, use_shared_memory=True, collate_fn=None
+    )
+
+    # Eval!
+    logger.info("***** Running evaluation %s *****", prefix)
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    model.eval()
+    res = []
+    for idx, batch in enumerate(eval_dataloader):
+        with paddle.no_grad():
+            input_ids, input_mask, segment_ids, bboxes, labels = batch
+
+            if input_ids.shape[0] != args.eval_batch_size:
+                continue
+            outputs = model(
+                input_ids=input_ids,
+                bbox=bboxes,
+                attention_mask=input_mask,
+                token_type_ids=segment_ids,
+                labels=labels,
+                is_train=False,
+            )
+            labels = labels.numpy()
+            crf_decode = outputs[1].numpy()
+            bboxes = bboxes.squeeze().numpy()
+            input_ids = input_ids.squeeze(axis=1).numpy()
+
+            for index in range(input_ids.shape[0]):
+                res.append([list(input_ids[index]), list(labels[index]), list(crf_decode[index]), bboxes[index]])
+
+    origin_inputs = []
+    with open(args.test_file, "r", encoding="utf8") as f:
+        for line in f:
+            line = json.loads(line.strip())
+            origin_inputs.append(
+                {
+                    "img_name": line["img_name"],
+                    "question": line["question"],
+                    "bboxes": line["document_bbox"],
+                    "img_id": line["img_id"],
+                }
+            )
+
+    text_res = decode(tokenizer, res)
+
+    with open(args.save_path, "w", encoding="utf8") as f:
+        for line_res, line_text, line_label in zip(res, text_res, origin_inputs):
+            line_json = {}
+            line_json["img_name"] = line_label["img_name"]
+            line_json["img_id"] = line_label["img_id"]
+            line_json["question"] = line_label["question"]
+            line_json["label_answer"] = line_text[1]
+            line_json["predict_answer"] = line_text[2]
+            label_bbox_index, predict_bbox_index = line_text[3], line_text[4]
+            label_bboxes, predict_bboxes = [], []
+            for i in range(len(line_label["bboxes"])):
+                if i in label_bbox_index:
+                    label_bboxes.append(line_label["bboxes"][i])
+                if i in predict_bbox_index:
+                    predict_bboxes.append(line_label["bboxes"][i])
+            line_json["label_bboxes"] = label_bboxes
+            line_json["predict_bboxes"] = predict_bboxes
+            json.dump(line_json, f, ensure_ascii=False)
+            f.write("\n")
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    main(args)
diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_test.sh b/applications/document_intelligence/doc_vqa/Extraction/run_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f2e0df79e5f74638f721e07d6bdad540b1f2edd8
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/run_test.sh
@@ -0,0 +1,38 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+QUESTION=$1
+
+python3 change_to_mrc.py ${QUESTION}
+
+python3  ./run_docvqa.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_len 512 \
+    --do_test true \
+	--test_file "data/demo_test.json" \
+	--num_train_epochs 100 \
+    --eval_steps 6000 \
+    --save_steps 6000 \
+    --output_dir "output/" \
+    --save_path "data/decode_res.json" \
+	--init_checkpoint "./checkpoints/layoutxlm/" \
+    --learning_rate 3e-5 \
+    --warmup_steps 12000 \
+    --per_gpu_train_batch_size 4 \
+    --per_gpu_eval_batch_size 1 \
+    --seed 2048
+
+python3 view.py
diff --git a/applications/document_intelligence/doc_vqa/Extraction/run_train.sh b/applications/document_intelligence/doc_vqa/Extraction/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1a5370643aad91a357062f558fa1da1a3c48a1f4
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/run_train.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python3 ./run_docvqa.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_len 512 \
+    --train_file "data/train.json" \
+    --init_checkpoint "checkpoints/base_model" \
+	--do_train true \
+    --num_train_epochs 50 \
+    --eval_steps 24000 \
+    --save_steps 40 \
+    --output_dir "output" \
+    --save_path "data/decode_res.json" \
+    --learning_rate 3e-5 \
+    --warmup_steps 40 \
+    --per_gpu_train_batch_size 4 \
+    --per_gpu_eval_batch_size 4 \
+    --seed 2048
diff --git a/applications/document_intelligence/doc_vqa/Extraction/view.py b/applications/document_intelligence/doc_vqa/Extraction/view.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a3f35a068db2357a957990ac7e3ba0c903d62d8
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Extraction/view.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import cv2
+import json
+import numpy as np
+
+
+def view_ocr_result(img_path, bboxes, opath):
+    image = cv2.imread(img_path)
+    for char_bbox in bboxes:
+        x_min, x_max, y_min, y_max = char_bbox
+        cv2.rectangle(image, (x_min, y_min), (x_max, y_max), (0, 0, 255), 1)
+    cv2.imwrite(opath, image)
+
+
+def _highlight_bbox(img, bbox):
+    x = bbox[0]
+    w = bbox[1] - x
+    y = bbox[2]
+    h = bbox[3] - y
+    sub_img = img[y : y + h, x : x + w]
+    colored_rect = np.zeros(sub_img.shape, dtype=np.uint8)
+    colored_rect[:, :, 2] = 255
+    colored_rect[:, :, 1] = 255
+    res = cv2.addWeighted(sub_img, 0.5, colored_rect, 0.5, 1.0)
+    img[y : y + h, x : x + w] = res
+
+
+def highlight_ans(source_img_path, output_img_path, ans_bbox):
+    image = cv2.imread(source_img_path)
+    for bbox in ans_bbox:
+        _highlight_bbox(image, bbox)
+    cv2.imwrite(output_img_path, image)
+
+
+def highlight_img(source_img_path, output_img_path):
+    image = cv2.imread(source_img_path)
+    height = image.shape[0]
+    width = image.shape[1]
+    bbox = [0, width - 1, 0, height - 1]
+    _highlight_bbox(image, bbox)
+    cv2.imwrite(output_img_path, image)
+
+
+if __name__ == "__main__":
+    res_path = "./data/decode_res.json"
+    result = {}
+    with open(res_path, "r", encoding="utf-8") as f:
+        line = f.readline()
+        result = json.loads(line.strip())
+
+    img_path = "../OCR_process/demo_pics/demo_{}.png".format(result["img_id"])
+    img_save_path = "../answer.png"
+    highlight_ans(img_path, img_save_path, result["predict_bboxes"])
+    print("extraction result has been saved to answer.png")
diff --git a/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py b/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py
new file mode 100644
index 0000000000000000000000000000000000000000..0444e559d580c422effd408f0a1632146235c599
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py
@@ -0,0 +1,272 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import re
+
+from paddleocr import PaddleOCR
+
+from paddlenlp.transformers import LayoutXLMTokenizer
+
+tokenizer = LayoutXLMTokenizer.from_pretrained("layoutxlm-base-uncased")
+
+
+def get_all_chars(tokenizer):
+    all_chr = []
+    for i in range(30000):
+        tok_chr = tokenizer.tokenize(chr(i))
+        tok_chr = [tc.replace("▁", "") for tc in tok_chr]
+        while "" in tok_chr:
+            tok_chr.remove("")
+        tok_chr = "".join(tok_chr)
+        if len(tok_chr) != 1:
+            all_chr.append(i)
+    return all_chr
+
+
+def merge_bbox(tok_bboxes):
+    min_gx = min([box[0] for box in tok_bboxes])
+    max_gx = max([box[1] for box in tok_bboxes])
+    min_gy = min([box[2] for box in tok_bboxes])
+    max_gy = max([box[3] for box in tok_bboxes])
+    height_g = max_gy - min_gy
+    width_g = max_gx - min_gx
+    height_m = 0
+    width_m = 0
+    for box in tok_bboxes:
+        x_min, x_max, y_min, y_max = box
+        height_m += y_max - y_min
+        width_m += x_max - x_min
+    height_m = height_m / len(tok_bboxes)
+    if (height_g - height_m) < 0.5 * height_m and width_g - width_m < 0.1 * width_m:
+        return False, [min_gx, max_gx, min_gy, max_gy]
+    else:
+        return True, tok_bboxes[0]
+
+
+def xlm_parse(ocr_res, tokenizer):
+    doc_bboxes = []
+    all_chr = get_all_chars(tokenizer)
+
+    try:
+        new_tokens, new_token_boxes = [], []
+        for item in ocr_res:
+            new_tokens.extend(item["tokens"])
+            new_token_boxes.extend(item["token_box"])
+
+        # get layoutxlm tokenizer results and get the final results
+        temp_span_text = "".join(new_tokens)
+        temp_span_bbox = new_token_boxes
+        span_text = ""
+        span_bbox = []
+        # drop blank space
+        for text, bbox in zip(temp_span_text, temp_span_bbox):
+            if text == " ":
+                continue
+            else:
+                span_text += text
+                span_bbox += [bbox]
+
+        # span_tokens starts with "_"
+        span_tokens = tokenizer.tokenize(span_text)
+        span_tokens[0] = span_tokens[0].replace("▁", "")
+        while "" in span_tokens:
+            span_tokens.remove("")
+
+        doc_bboxes = []
+        i = 0
+        for tid, tok in enumerate(span_tokens):
+            tok = tok.replace("▁", "")
+            if tok == "":
+                doc_bboxes.append(span_bbox[i])
+                continue
+            if tok == "<unk>":
+                if tid + 1 == len(span_tokens):
+                    tok_len = 1
+                else:
+                    if span_tokens[tid + 1] == "<unk>":
+                        tok_len = 1
+                    else:
+                        for j in range(i, len(span_text)):
+                            if span_text[j].lower() == span_tokens[tid + 1][0]:
+                                break
+                        tok_len = j - i
+            elif ord(span_text[i]) in all_chr:
+                if tid + 1 == len(span_tokens):
+                    tok_len = 1
+                elif "°" in tok and "C" in span_tokens[tid + 1]:
+                    tok_len = len(tok) - 1
+                    if tok_len == 0:
+                        doc_bboxes.append(span_bbox[i])
+                        continue
+                elif span_text[i] == "ⅱ":
+                    if tok == "ii":
+                        if span_text[i + 1] != "i":
+                            tok_len = len(tok) - 1
+                        else:
+                            tok_len = len(tok)
+                    elif tok == "i":
+                        tok_len = len(tok) - 1
+                        if tok_len == 0:
+                            doc_bboxes.append(span_bbox[i])
+                            continue
+                elif "m" in tok and "2" == span_tokens[tid + 1][0]:
+                    tok_len = len(tok) - 1
+                    if tok_len == 0:
+                        doc_bboxes.append(span_bbox[i])
+                        continue
+                elif ord(span_text[i + 1]) in all_chr:
+                    tok_len = 1
+                else:
+                    for j in range(i, len(span_text)):
+                        if span_text[j].lower() == span_tokens[tid + 1][0]:
+                            break
+                        if span_text[j].lower() == "，" and span_tokens[tid + 1][0] == ",":
+                            break
+                        if span_text[j].lower() == "；" and span_tokens[tid + 1][0] == ";":
+                            break
+                        if span_text[j].lower() == "）" and span_tokens[tid + 1][0] == ")":
+                            break
+                        if span_text[j].lower() == "（" and span_tokens[tid + 1][0] == "(":
+                            break
+                        if span_text[j].lower() == "￥" and span_tokens[tid + 1][0] == "¥":
+                            break
+
+                    tok_len = j - i
+
+            else:
+                if "�" == span_text[i]:
+                    tok_len = len(tok) + 1
+                elif tok == "......" and "…" in span_text[i : i + 6]:
+                    tok_len = len(tok) - 2
+                elif "ⅱ" in span_text[i + len(tok) - 1]:
+                    if tok == "i":
+                        tok_len = 1
+                    else:
+                        tok_len = len(tok) - 1
+                elif "°" in tok and "C" in span_tokens[tid + 1]:
+                    tok_len = len(tok) - 1
+                else:
+                    tok_len = len(tok)
+
+            assert i + tok_len <= len(span_bbox)
+            tok_bboxes = span_bbox[i : i + tok_len]
+            _, merged_bbox = merge_bbox(tok_bboxes)
+
+            doc_bboxes.append(merged_bbox)
+            i = i + tok_len
+    except Exception:
+        print("Error")
+        span_tokens = ["▁"] * 512
+        doc_bboxes = [[0, 0, 0, 0]] * 512
+
+    return span_tokens, doc_bboxes
+
+
+def tokenize_ocr_res(ocr_reses):
+    """
+    input:
+        ocr_res: the ocr result of the image
+    return:
+        new_reses: {
+            pid: {
+                "text": all text in each ocr_res,
+                "bounding_box": the bounding box of the ocr_res,
+                "tokens": all chars in ocr_res,
+                "token_box: bounding box of each chars in ocr_res
+                }
+        }
+    """
+    new_reses = []
+    for img_name, ocr_res in ocr_reses:
+        new_res = []
+        for para in ocr_res:
+            text = para["text"]
+            text_box = para["bbox"]
+            x_min, y_min = [int(min(idx)) for idx in zip(*text_box)]
+            x_max, y_max = [int(max(idx)) for idx in zip(*text_box)]
+            text_chars = list(text.lower())
+            char_num = 0
+            for char in text_chars:
+                if re.match("[^\x00-\xff]", char):
+                    char_num += 2
+                else:
+                    char_num += 1
+            width = x_max - x_min
+            shift = x_min
+            new_token_boxes, new_tokens = [], []
+            for char in text_chars:
+                if re.match("[^\x00-\xff]", char):
+                    tok_x_max = shift + width / char_num * 2
+                else:
+                    tok_x_max = shift + width / char_num * 1
+                tok_x_min = shift
+                tok_y_min = y_min
+                tok_y_max = y_max
+
+                shift = tok_x_max
+                new_token_boxes.append([round(tok_x_min), round(tok_x_max), tok_y_min, tok_y_max])
+                new_tokens.append(char)
+            new_res.append(
+                {
+                    "text": para["text"],
+                    "bounding_box": para["bbox"],
+                    "tokens": new_tokens,
+                    "token_box": new_token_boxes,
+                }
+            )
+        new_reses.append((img_name, new_res))
+    return new_reses
+
+
+def process_input(ocr_reses, tokenizer, save_ocr_path):
+    ocr_reses = tokenize_ocr_res(ocr_reses)
+
+    examples = []
+    for img_name, ocr_res in ocr_reses:
+        doc_tokens, doc_bboxes = xlm_parse(ocr_res, tokenizer)
+        doc_tokens.insert(0, "▁")
+        doc_bboxes.insert(0, doc_bboxes[0])
+        example = {"img_name": img_name, "document": doc_tokens, "document_bbox": doc_bboxes}
+        examples.append(example)
+
+    with open(save_ocr_path, "w", encoding="utf8") as f:
+        for example in examples:
+            json.dump(example, f, ensure_ascii=False)
+            f.write("\n")
+
+    print(f"ocr parsing results has been save to: {save_ocr_path}")
+
+
+def ocr_preprocess(img_dir):
+    ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=True)
+    ocr_reses = []
+    img_names = sorted(os.listdir(img_dir), key=lambda x: int(x.split("_")[1].split(".")[0]))
+    for img_name in img_names:
+        img_path = os.path.join(img_dir, img_name)
+        parsing_res = ocr.ocr(img_path, cls=True)[0]
+        ocr_res = []
+        for para in parsing_res:
+            ocr_res.append({"text": para[1][0], "bbox": para[0]})
+        ocr_reses.append((img_name, ocr_res))
+
+    return ocr_reses
+
+
+if __name__ == "__main__":
+    img_dir = "./demo_pics"
+    save_path = "./demo_ocr_res.json"
+    ocr_results = ocr_preprocess(img_dir)
+    process_input(ocr_results, tokenizer, save_path)
diff --git a/applications/document_intelligence/doc_vqa/README.md b/applications/document_intelligence/doc_vqa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a2ebdfc42f3018b3a0aaf5da1e04537ab22662cb
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/README.md
@@ -0,0 +1,131 @@
+# 汽车说明书跨模态智能问答
+
+## 1. 项目说明
+
+**跨模态文档问答** 是跨模态的文档抽取任务，要求文档智能模型在文档中抽取能够回答文档相关问题的答案，需要模型在抽取和理解文档中文本信息的同时，还能充分利用文档的布局、字体、颜色等视觉信息，这比单一模态的信息抽取任务更具挑战性。
+
+这种基于跨模态文档阅读理解技术的智能问答能力，可以深度解析非结构化文档中排版复杂的图文/图表内容，直接定位问题答案。
+
+本项目将基于跨模态文档问答技术实现**汽车说明书问答系统**，该系统能够对用户提出的问题，自动从汽车说明书中寻找答案并进行回答。
+
+如下图所示， 用户提出问题："如何更换前风窗玻璃的刮水片"，跨模态文档问答引擎将从库中寻找相关的文档，然后通过跨模态阅读理解模型抽取出相应的答案，并进行了高亮展示。
+
+<center><img width="883" alt="image" src="https://user-images.githubusercontent.com/35913314/169781111-0734729d-3c7b-400d-8e92-e56548bb7dc5.png"></center>
+
+通过使用汽车说明书问答系统，能够极大地解决传统汽车售后的压力：
+- 用户：用户没有耐心查阅说明书，打客服电话需要等待
+- 售后客服：需要配置大量客服人员，且客服专业知识培训周期长
+- 构建问题库：需要投入大量人力整理常见问题库，并且固定的问题库难以覆盖灵活多变的提问
+
+对于用户来说，汽车说明书问答系统能够支持通过车机助手/APP/小程序为用户提供即问即答的功能。对于常见问题，用户不再需要查阅说明书，也无需打客服电话，从而缓解了人工客服的压力。
+
+对于客服来讲，汽车说明书问答系统帮助客服人员快速定位答案，高效查阅文档，提高客服的专业水平，同时也能够缩短客服的培训周期。
+
+## 2. 安装说明
+
+#### 环境要求
+
+- paddlepaddle == 2.3.2
+- paddlenlp == 2.5.2
+- paddleocr == 2.6.1.3
+
+安装相关问题可参考[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)和[PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/get_started/installation.html)文档。
+
+
+## 3. 整体流程
+
+汽车说明书问答系统针对用户提出的汽车使用相关问题，智能化地在汽车说明书中找出对应答案，并返回给用户。本项目提供的汽车说明书问答系统的使用流程如下图所示。本项目提供的汽车说明书问答系统主要包括 3 个模块：OCR处理模块、排序模块和跨模态阅读理解模块。
+
+在使用汽车说明书问答模型回答问题之前，需要先使用PaddleOCR对离线提供的汽车说明书文档进解析，并将解析结果保存下来，以备后续排序模块使用。
+
+对于用户提问的问题，首先会被传入排序模块，排序模块会针对该问题对解析的文档进行排序打分，其结果将会被传入跨模态阅读理解模块。阅读理解模块将从分数最高的说明书文档中，抽取用户问题的答案，并返回给用户。
+
+<center><img width="864" alt="image" src="https://user-images.githubusercontent.com/35913314/170222662-c438ff2a-a1df-44e5-8a83-f14dc0814b9d.png"></center>
+
+下面将具体介绍各个模块的功能。
+
+## 4. OCR处理模块
+
+本项目提供了包含10张图片的汽车说明书，为方便后续处理，首先需要通过 PaddleOCR 对汽车说明书进行识别，记录汽车说明书上的文字和文字布局信息， 以方便后续使用计算机视觉和自然语言处理方面的技术进行问答任务。
+
+本项目提供的汽车说明书图片可点击[这里](https://paddlenlp.bj.bcebos.com/images/applications/automobile.tar.gz)进行下载，下载后解压放至 `./OCR_process/demo_pics` 目录下，然后通过如下命令，使用 PaddleOCR 对图片进行解析。
+
+```shell
+cd OCR_process/
+python3 ocr_process.py
+cd ..
+```
+
+解析后的结果存放至 `./OCR_process/demo_ocr_res.json` 中。
+
+## 5. 排序模块
+对于用户提出的问题，如果从所有的汽车说明书图片中去寻找答案会比较耗时且耗费资源。因此这里使用了一个基于[RocketQA](https://arxiv.org/pdf/2010.08191.pdf)的排序模块，该模块将根据用户提出的问题对汽车说明书的不同图片进行打分排序，这样便可以获取和问题最相关的图片，并使用跨模态阅读理解模块在该问题上进行抽取答案。
+
+本项目提供了140条汽车说明书相关的训练样本，用于排序模型的训练， 同时也提供了一个基于RocketQA的预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调。
+
+其中，汽车说明书的训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_rerank_train.tsv) 进行下载，下载后将其重命名为 `train.tsv` ，存放至 `./Rerank/data/` 目录下。
+
+同时，base_model 是 [Dureader retrieval](https://arxiv.org/abs/2203.10232) 数据集训练的排序模型， 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_ranker.tar.gz) 进行下载，解压后可获得包含模型的目录 `base_model`，将其放至 `./Rerank/checkpoints` 目录下。
+
+可使用如下代码进行训练：
+
+```shell
+cd Rerank
+bash run_train.sh ./data/train.tsv ./checkpoints/base_model 50 1
+cd ..
+```
+其中，参数依次为训练数据地址，base_model 地址，训练轮次，节点数。
+
+在模型训练完成后，可将模型重命名为 `ranker` 存放至 `./checkpoints/` 目录下，接下来便可以使用如下命令，根据给定的汽车说明书相关问题，对汽车说明书的图片进行打分。代码如下：
+
+```shell
+cd Rerank
+bash run_test.sh 后备箱怎么开
+cd ..
+```
+
+其中，后一项为用户问题，命令执行完成后，分数文件将会保存至 `./Rerank/data/demo.score` 中。
+
+
+## 6. 跨模态阅读理解模块
+本项目首先获取排序模块输出的结果中评分最高的图片，然后将会使用跨模态的语言模型 LayoutXLM 从该图片中去抽取用户提问的答案。在获取答案后，将会对答案在该图片中进行高亮显示并返回用户。
+
+本项目提供了28条汽车说明书相关的训练样本，用于跨模态阅读理解模型的训练， 同时也提供了一个预先训练好的基线模型 base_model。 本模块可以使用 base_model 在汽车说明书训练样本上进一步微调，增强模型对汽车说明书领域的理解。
+
+其中，汽车说明书的阅读理解训练集可点击[这里](https://paddlenlp.bj.bcebos.com/data/automobile_mrc_train.json) 进行下载，下载后将其重命名为 `train.json`，存放至 `./Extraction/data/` 目录下。
+
+同时，base_model 是 [Dureader VIS](https://aclanthology.org/2022.findings-acl.105.pdf) 数据集训练的跨模态阅读理解模型， 可点击[这里](https://paddlenlp.bj.bcebos.com/models/base_mrc.tar.gz) 进行下载，解压后可获得包含模型的目录 `base_model`，将其放至 `./Extraction/checkpoints` 目录下。
+
+可使用如下代码进行训练：
+
+```shell
+cd Extraction
+bash run_train.sh
+cd ..
+```
+
+在模型训练完成后，可将模型重命名为 `layoutxlm` 存放至 `./checkpoints/` 目录下，接下来便可以使用如下命令，根据给定的汽车说明书相关问题，从得分最高的汽车说明书图片中抽取答案。代码如下：
+
+```shell
+cd Extraction
+bash run_test.sh 后备箱怎么开
+cd ..
+```
+
+其中，后一项为用户问题，命令执行完成后，最终结果将会保存至 `./answer.png` 中。
+
+## 7. 全流程预测
+本项目提供了全流程预测的功能，可通过如下命令进行一键式预测：
+
+```shell
+bash run_test.sh 后备箱怎么开
+```
+
+其中，后一项参数为用户问题，最终结果将会保存至 `./answer.png` 中。
+
+**备注**：在运行命令前，请确保已使用第4节介绍的命令对原始汽车说明书图片完成了文档解析。
+
+
+下图展示了用户提问的三个问题："后备箱怎么开"，"钥匙怎么充电" 和 "NFC解锁注意事项"， 可以看到，本项目的汽车说明书问答系统能够精准地找到答案并进行高亮显示。
+
+<center><img src="https://user-images.githubusercontent.com/35913314/169012902-1a42bd14-976f-4da8-b5b5-d8e7352b68df.png"/></center>
diff --git a/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py b/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py
new file mode 100644
index 0000000000000000000000000000000000000000..7136d15327244e4f7f1984d55c6f999e3a5373be
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/change_to_rerank.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import json
+
+question = sys.argv[1]
+
+with open("../OCR_process/demo_ocr_res.json", "r", encoding="utf8") as f:
+    paras = []
+    for line in f:
+        line = json.loads(line.strip())
+        document = line["document"]
+        para = []
+        for token in document:
+            token = token.replace("▁", "")
+            para.append(token)
+        paras.append("".join(para))
+
+with open("./data/demo.tsv", "w", encoding="utf8") as f:
+    for para in paras:
+        f.write("{}\t\t{}\t0\n".format(question, para))
diff --git a/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt b/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5db20b3b96fb86ef2aec3b783e12e17041a02d45
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/config/ernie_base_1.0_CN/vocab.txt
@@ -0,0 +1,17964 @@
+[PAD]
+[CLS]
+[SEP]
+[MASK]
+，
+的
+、
+一
+人
+有
+是
+在
+中
+为
+和
+了
+不
+年
+学
+大
+国
+生
+以
+“
+”
+作
+业
+个
+上
+用
+,
+地
+会
+成
+发
+工
+时
+于
+理
+出
+行
+要
+.
+等
+他
+到
+之
+这
+可
+后
+家
+对
+能
+公
+与
+》
+《
+主
+方
+分
+经
+来
+全
+其
+部
+多
+产
+自
+文
+高
+动
+进
+法
+化
+：
+我
+面
+）
+（
+实
+教
+建
+体
+而
+长
+子
+下
+现
+开
+本
+力
+定
+性
+过
+设
+合
+小
+同
+机
+市
+品
+水
+新
+内
+事
+也
+种
+及
+制
+入
+所
+心
+务
+就
+管
+们
+得
+展
+重
+民
+加
+区
+物
+者
+通
+天
+政
+三
+电
+关
+度
+第
+名
+术
+最
+系
+月
+外
+资
+日
+代
+员
+如
+间
+位
+并
+书
+科
+村
+应
+量
+道
+前
+当
+无
+里
+相
+平
+从
+计
+提
+保
+任
+程
+技
+都
+研
+十
+基
+特
+好
+被
+或
+目
+将
+使
+山
+二
+说
+数
+点
+明
+情
+元
+着
+收
+组
+然
+美
+各
+由
+场
+金
+形
+农
+期
+因
+表
+此
+色
+起
+还
+立
+世
+安
+活
+专
+质
+1
+规
+社
+万
+信
+西
+统
+结
+路
+利
+次
+南
+式
+意
+级
+常
+师
+校
+你
+育
+果
+究
+司
+服
+门
+海
+导
+流
+项
+她
+总
+处
+两
+传
+东
+正
+省
+院
+户
+手
+具
+2
+原
+强
+北
+向
+先
+但
+米
+城
+企
+件
+风
+军
+身
+更
+知
+已
+气
+战
+至
+单
+口
+集
+创
+解
+四
+标
+交
+比
+商
+论
+界
+题
+变
+花
+3
+改
+类
+运
+指
+型
+调
+女
+神
+接
+造
+受
+广
+只
+委
+去
+共
+治
+达
+持
+条
+网
+头
+构
+县
+些
+该
+又
+那
+想
+样
+办
+济
+5
+格
+责
+车
+很
+施
+求
+己
+光
+精
+林
+完
+爱
+线
+参
+少
+积
+清
+看
+优
+报
+王
+直
+没
+每
+据
+游
+效
+感
+五
+影
+别
+获
+领
+称
+选
+供
+乐
+老
+么
+台
+问
+划
+带
+器
+源
+织
+放
+深
+备
+视
+白
+功
+取
+装
+营
+见
+记
+环
+队
+节
+准
+石
+它
+回
+历
+负
+真
+增
+医
+联
+做
+职
+容
+士
+包
+义
+观
+团
+病
+4
+府
+息
+则
+考
+料
+华
+州
+语
+证
+整
+让
+江
+史
+空
+验
+需
+支
+命
+给
+离
+认
+艺
+较
+土
+古
+养
+才
+境
+推
+把
+均
+图
+际
+斯
+近
+片
+局
+修
+字
+德
+权
+步
+始
+复
+转
+协
+即
+打
+画
+投
+决
+何
+约
+反
+quot
+费
+议
+护
+极
+河
+房
+查
+布
+思
+干
+价
+儿
+非
+马
+党
+奖
+模
+故
+编
+音
+范
+识
+率
+存
+引
+客
+属
+评
+采
+尔
+配
+镇
+室
+再
+案
+监
+习
+注
+根
+克
+演
+食
+族
+示
+球
+状
+青
+号
+张
+百
+素
+首
+易
+热
+阳
+今
+园
+防
+版
+太
+乡
+英
+6
+材
+列
+便
+写
+住
+置
+层
+助
+确
+试
+难
+承
+象
+居
+10
+黄
+快
+断
+维
+却
+红
+速
+连
+众
+0
+细
+态
+话
+周
+言
+药
+培
+血
+亩
+龙
+越
+值
+几
+边
+读
+未
+曾
+测
+算
+京
+景
+余
+站
+低
+温
+消
+必
+切
+依
+随
+且
+志
+卫
+域
+照
+许
+限
+著
+销
+落
+足
+适
+争
+策
+8
+控
+武
+按
+7
+初
+角
+核
+死
+检
+富
+满
+显
+审
+除
+致
+亲
+占
+失
+星
+章
+善
+续
+千
+叶
+火
+副
+告
+段
+什
+声
+终
+况
+走
+木
+益
+戏
+独
+纪
+植
+财
+群
+六
+赛
+远
+拉
+亚
+密
+排
+超
+像
+课
+围
+往
+响
+击
+疗
+念
+八
+云
+险
+律
+请
+革
+诗
+批
+底
+压
+双
+男
+训
+例
+汉
+升
+拥
+势
+酒
+眼
+官
+牌
+油
+曲
+友
+望
+黑
+歌
+筑
+础
+香
+仅
+担
+括
+湖
+严
+秀
+剧
+九
+举
+执
+充
+兴
+督
+博
+草
+般
+李
+健
+喜
+授
+普
+预
+灵
+突
+良
+款
+罗
+9
+微
+七
+录
+朝
+飞
+宝
+令
+轻
+劳
+距
+异
+简
+兵
+树
+序
+候
+含
+福
+尽
+留
+20
+丰
+旅
+征
+临
+破
+移
+篇
+抗
+典
+端
+苏
+奇
+止
+康
+店
+毛
+觉
+春
+售
+络
+降
+板
+坚
+母
+讲
+早
+印
+略
+孩
+夫
+藏
+铁
+害
+互
+帝
+田
+融
+皮
+宗
+岁
+载
+析
+斗
+须
+伤
+12
+介
+另
+00
+半
+班
+馆
+味
+楼
+卡
+射
+述
+杀
+波
+绿
+免
+兰
+绝
+刻
+短
+察
+输
+择
+综
+杂
+份
+纳
+父
+词
+银
+送
+座
+左
+继
+固
+宣
+厂
+肉
+换
+补
+税
+派
+套
+欢
+播
+吸
+圆
+攻
+阿
+购
+听
+右
+减
+激
+巴
+背
+够
+遇
+智
+玉
+找
+宽
+陈
+练
+追
+毕
+彩
+软
+帮
+股
+荣
+托
+予
+佛
+堂
+障
+皇
+若
+守
+似
+届
+待
+货
+散
+额
+30
+尚
+穿
+丽
+骨
+享
+差
+针
+索
+稳
+宁
+贵
+酸
+液
+唐
+操
+探
+玩
+促
+笔
+库
+救
+虽
+久
+闻
+顶
+床
+港
+鱼
+亿
+登
+11
+永
+毒
+桥
+冷
+魔
+秘
+陆
+您
+童
+归
+侧
+沙
+染
+封
+紧
+松
+川
+刘
+15
+雄
+希
+毫
+卷
+某
+季
+菜
+庭
+附
+逐
+夜
+宫
+洲
+退
+顾
+尼
+胜
+剂
+纯
+舞
+遗
+苦
+梦
+挥
+航
+愿
+街
+招
+矿
+夏
+盖
+献
+怎
+茶
+申
+39
+吧
+脑
+亦
+吃
+频
+宋
+央
+威
+厚
+块
+冲
+叫
+熟
+礼
+厅
+否
+渐
+笑
+钱
+钟
+甚
+牛
+丝
+靠
+岛
+绍
+盘
+缘
+聚
+静
+雨
+氏
+圣
+顺
+唱
+刊
+阶
+困
+急
+饰
+弹
+庄
+既
+野
+阴
+混
+饮
+损
+齐
+末
+错
+轮
+宜
+鲜
+兼
+敌
+粉
+祖
+延
+100
+钢
+辑
+欧
+硬
+甲
+诉
+册
+痛
+订
+缺
+晚
+衣
+佳
+脉
+gt
+盛
+乎
+拟
+贸
+扩
+船
+仪
+谁
+警
+50
+停
+席
+竞
+释
+庆
+汽
+仍
+掌
+诸
+仙
+弟
+吉
+洋
+奥
+票
+危
+架
+买
+径
+塔
+休
+付
+恶
+雷
+怀
+秋
+借
+巨
+透
+誉
+厘
+句
+跟
+胞
+婚
+幼
+烈
+峰
+寻
+君
+汇
+趣
+纸
+假
+肥
+患
+杨
+雅
+罪
+谓
+亮
+脱
+寺
+烟
+判
+绩
+乱
+刚
+摄
+洞
+践
+码
+启
+励
+呈
+曰
+呢
+符
+哥
+媒
+疾
+坐
+雪
+孔
+倒
+旧
+菌
+岩
+鼓
+亡
+访
+症
+暗
+湾
+幸
+池
+讨
+努
+露
+吗
+繁
+途
+殖
+败
+蛋
+握
+刺
+耕
+洗
+沉
+概
+哈
+泛
+凡
+残
+隐
+虫
+朋
+虚
+餐
+殊
+慢
+询
+蒙
+孙
+谈
+鲁
+裂
+贴
+污
+漫
+谷
+违
+泉
+拿
+森
+横
+扬
+键
+膜
+迁
+尤
+涉
+净
+诚
+折
+冰
+械
+拍
+梁
+沿
+避
+吴
+惊
+犯
+灭
+湿
+迷
+姓
+阅
+灯
+妇
+触
+冠
+答
+俗
+档
+尊
+谢
+措
+筹
+竟
+韩
+签
+剑
+鉴
+灾
+贯
+迹
+洛
+沟
+束
+翻
+巧
+坏
+弱
+零
+壁
+枝
+映
+恩
+抓
+屋
+呼
+脚
+绘
+40
+淡
+辖
+2010
+伊
+粒
+欲
+震
+伯
+私
+蓝
+甘
+储
+胡
+卖
+梅
+16
+耳
+疑
+润
+伴
+泽
+牧
+烧
+尾
+累
+糖
+怪
+唯
+莫
+粮
+柱
+18
+竹
+灰
+岸
+缩
+井
+伦
+柔
+盟
+珠
+丹
+amp
+皆
+哪
+迎
+颜
+衡
+啊
+塑
+寒
+13
+紫
+镜
+25
+氧
+误
+伍
+彻
+刀
+览
+炎
+津
+耐
+秦
+尖
+潮
+描
+浓
+召
+禁
+阻
+胶
+译
+腹
+泰
+乃
+盐
+潜
+鸡
+诺
+遍
+2000
+纹
+冬
+牙
+麻
+辅
+猪
+弃
+楚
+羊
+晋
+14
+鸟
+赵
+洁
+谋
+隆
+滑
+60
+2008
+籍
+臣
+朱
+泥
+墨
+辆
+墙
+浪
+姐
+赏
+纵
+2006
+拔
+倍
+纷
+摩
+壮
+苗
+偏
+塞
+贡
+仁
+宇
+卵
+瓦
+枪
+覆
+殿
+刑
+贫
+妈
+幅
+幕
+忆
+丁
+估
+废
+萨
+舍
+详
+旗
+岗
+洪
+80
+贝
+2009
+迅
+凭
+勇
+雕
+奏
+旋
+杰
+煤
+阵
+乘
+溪
+奉
+畜
+挑
+昌
+硕
+庙
+惠
+薄
+逃
+爆
+哲
+浙
+珍
+炼
+栏
+暴
+币
+隔
+吨
+倾
+嘉
+址
+陶
+绕
+诊
+遭
+桃
+魂
+兽
+豆
+闲
+箱
+拓
+燃
+裁
+晶
+掉
+脂
+溶
+顿
+肤
+虑
+鬼
+2007
+灌
+徐
+龄
+陵
+恋
+侵
+坡
+寿
+勤
+磨
+妹
+瑞
+缓
+轴
+麦
+羽
+咨
+凝
+默
+驻
+敢
+债
+17
+浮
+幻
+株
+浅
+敬
+敏
+陷
+凤
+坛
+虎
+乌
+铜
+御
+乳
+讯
+循
+圈
+肌
+妙
+奋
+忘
+闭
+墓
+21
+汤
+忠
+2005
+跨
+怕
+振
+宾
+跑
+屏
+坦
+粗
+租
+悲
+伟
+拜
+24
+妻
+赞
+兄
+宿
+碑
+貌
+勒
+罚
+夺
+偶
+截
+纤
+2011
+齿
+郑
+聘
+偿
+扶
+豪
+慧
+跳
+the
+疏
+莱
+腐
+插
+恐
+郎
+辞
+挂
+娘
+肿
+徒
+伏
+磁
+杯
+丛
+旨
+琴
+19
+炮
+醒
+砖
+替
+辛
+暖
+锁
+杜
+肠
+孤
+饭
+脸
+邮
+贷
+lt
+俄
+毁
+荷
+谐
+荒
+肝
+链
+2004
+2012
+尺
+尘
+援
+a
+疫
+崇
+恢
+扎
+伸
+幽
+抵
+胸
+谱
+舒
+迫
+200
+畅
+泡
+岭
+喷
+70
+窗
+捷
+宏
+肯
+90
+狂
+铺
+骑
+抽
+券
+俱
+徽
+胆
+碎
+邀
+褐
+斤
+涂
+赋
+署
+颗
+2003
+渠
+仿
+迪
+炉
+辉
+涵
+耗
+22
+返
+邻
+斑
+董
+魏
+午
+娱
+浴
+尿
+曼
+锅
+柳
+舰
+搭
+旁
+宅
+趋
+of
+凉
+赢
+伙
+爷
+廷
+戴
+壤
+奶
+页
+玄
+驾
+阔
+轨
+朗
+捕
+肾
+稿
+惯
+侯
+乙
+渡
+稍
+恨
+脏
+2002
+姆
+腔
+抱
+杆
+垂
+赴
+赶
+莲
+辽
+荐
+旦
+妖
+2013
+稀
+驱
+沈
+役
+晓
+亭
+仲
+澳
+500
+炸
+绪
+28
+陕
+and
+23
+恒
+堡
+纠
+仇
+懂
+焦
+搜
+s
+忍
+贤
+添
+i
+艾
+赤
+犹
+尝
+锦
+稻
+撰
+填
+衰
+栽
+邪
+粘
+跃
+桌
+胃
+悬
+c
+翼
+彼
+睡
+曹
+刷
+摆
+悉
+锋
+26
+摇
+抢
+乏
+廉
+鼠
+盾
+瓷
+抑
+埃
+邦
+遂
+寸
+渔
+祥
+胎
+牵
+壳
+甜
+卓
+瓜
+袭
+遵
+巡
+逆
+玛
+韵
+2001
+桑
+酷
+赖
+桂
+郡
+肃
+仓
+寄
+塘
+瘤
+300
+碳
+搞
+燕
+蒸
+允
+忽
+斜
+穷
+郁
+囊
+奔
+昆
+盆
+愈
+递
+1000
+黎
+祭
+怒
+辈
+腺
+滚
+暂
+郭
+璃
+踪
+芳
+碍
+肺
+狱
+冒
+阁
+砂
+35
+苍
+揭
+踏
+颇
+柄
+闪
+孝
+葡
+腾
+茎
+鸣
+撤
+仰
+伐
+丘
+於
+泪
+荡
+扰
+纲
+拼
+欣
+纽
+癌
+堆
+27
+菲
+b
+披
+挖
+寓
+履
+捐
+悟
+乾
+嘴
+钻
+拳
+吹
+柏
+遥
+抚
+忧
+赠
+霸
+艰
+淋
+猫
+帅
+奈
+寨
+滴
+鼻
+掘
+狗
+驶
+朴
+拆
+惜
+玻
+扣
+萄
+蔬
+宠
+2014
+缴
+赫
+凯
+滨
+乔
+腰
+葬
+孟
+吾
+枚
+圳
+忙
+扫
+杭
+凌
+1998
+梯
+丈
+隶
+1999
+剪
+盗
+擅
+疆
+弯
+携
+拒
+秒
+颁
+醇
+割
+浆
+姑
+爸
+螺
+穗
+缝
+慈
+喝
+瓶
+漏
+悠
+猎
+番
+孕
+伪
+漂
+腿
+吐
+坝
+滤
+函
+匀
+偷
+浩
+矛
+僧
+辨
+俊
+棉
+铸
+29
+诞
+丧
+夹
+to
+姿
+睛
+淮
+阀
+姜
+45
+尸
+猛
+1997
+芽
+账
+旱
+醉
+弄
+坊
+烤
+萧
+矣
+雾
+倡
+榜
+弗
+氨
+朵
+锡
+袋
+拨
+湘
+岳
+烦
+肩
+熙
+炭
+婆
+棋
+禅
+穴
+宙
+汗
+艳
+儒
+叙
+晨
+颈
+峡
+拖
+烂
+茂
+戒
+飘
+氛
+蒂
+撞
+瓣
+箭
+叛
+1996
+31
+鞋
+劲
+祝
+娜
+饲
+侍
+诱
+叹
+卢
+弥
+32
+鼎
+厦
+屈
+慕
+魅
+m
+厨
+嫁
+绵
+逼
+扮
+叔
+酶
+燥
+狼
+滋
+汁
+辐
+怨
+翅
+佩
+坑
+旬
+沃
+剩
+蛇
+颖
+篮
+锐
+侠
+匹
+唤
+熊
+漠
+迟
+敦
+雌
+谨
+婴
+浸
+磷
+筒
+2015
+滩
+埋
+框
+弘
+吕
+碰
+纺
+硫
+堪
+契
+蜜
+蓄
+1995
+阐
+apos
+傲
+碱
+晰
+狭
+撑
+叉
+卧
+劫
+闹
+赐
+邓
+奴
+溉
+浦
+蹈
+辣
+遣
+耀
+耶
+翠
+t
+叠
+迈
+霍
+碧
+恰
+脊
+昭
+摸
+饱
+赔
+泄
+哭
+讼
+逝
+逻
+廊
+擦
+渗
+彰
+you
+卿
+旺
+宪
+36
+顷
+妆
+陪
+葛
+仔
+淀
+翰
+悦
+穆
+煮
+辩
+弦
+in
+串
+押
+蚀
+逢
+贺
+焊
+煌
+缔
+惑
+鹿
+袁
+糊
+逸
+舟
+勃
+侦
+涯
+蔡
+辟
+涌
+枯
+痕
+疼
+莉
+柴
+1993
+眉
+1992
+罢
+催
+衔
+秉
+妃
+鸿
+傅
+400
+辰
+聪
+咸
+1994
+扇
+盈
+勘
+佐
+泊
+抛
+搬
+牢
+宴
+牲
+贾
+摘
+姻
+慎
+帕
+忌
+卒
+夕
+卜
+惟
+挺
+崖
+炒
+爵
+冻
+椒
+鳞
+祸
+潭
+腊
+蒋
+缠
+寂
+眠
+冯
+芯
+槽
+吊
+33
+150
+聊
+梗
+嫩
+凶
+铭
+爽
+筋
+韦
+脾
+铝
+肢
+栋
+勾
+萌
+渊
+掩
+狮
+撒
+漆
+骗
+禽
+38
+蕴
+坪
+洒
+冶
+兹
+椭
+喻
+泵
+哀
+翔
+1990
+棒
+芝
+x
+扑
+3000
+毅
+衍
+惨
+疯
+欺
+贼
+肖
+轰
+巢
+臂
+轩
+扁
+淘
+犬
+宰
+祠
+挡
+厌
+帐
+蜂
+狐
+垃
+昂
+圾
+秩
+芬
+瞬
+枢
+舌
+唇
+棕
+1984
+霞
+霜
+艇
+侨
+鹤
+硅
+靖
+哦
+削
+泌
+奠
+d
+吏
+夷
+咖
+彭
+窑
+胁
+肪
+120
+贞
+劝
+钙
+柜
+鸭
+75
+庞
+兔
+荆
+丙
+纱
+34
+戈
+藤
+矩
+泳
+惧
+铃
+渴
+胀
+袖
+丸
+狠
+豫
+茫
+1985
+浇
+菩
+氯
+啡
+1988
+葱
+37
+梨
+霉
+脆
+氢
+巷
+丑
+娃
+锻
+愤
+贪
+蝶
+1991
+厉
+闽
+浑
+斩
+栖
+l
+茅
+昏
+龟
+碗
+棚
+滞
+慰
+600
+2016
+斋
+虹
+屯
+萝
+饼
+窄
+潘
+绣
+丢
+芦
+鳍
+42
+裕
+誓
+腻
+48
+95
+锈
+吞
+蜀
+啦
+扭
+5000
+巩
+髓
+1987
+劣
+拌
+谊
+涛
+勋
+郊
+莎
+痴
+窝
+驰
+1986
+跌
+笼
+挤
+溢
+1989
+隙
+55
+鹰
+诏
+帽
+65
+芒
+爬
+凸
+牺
+熔
+吻
+竭
+瘦
+冥
+800
+搏
+屡
+昔
+萼
+愁
+捉
+翁
+怖
+汪
+烯
+疲
+缸
+溃
+85
+泼
+剖
+涨
+橡
+谜
+悔
+嫌
+盒
+苯
+凹
+绳
+畏
+罐
+虾
+柯
+邑
+馨
+兆
+帖
+陌
+禄
+垫
+壶
+逊
+骤
+祀
+晴
+蓬
+e
+苞
+煎
+菊
+堤
+甫
+拱
+氮
+罕
+舶
+伞
+姚
+弓
+嵌
+1983
+1982
+馈
+琼
+噪
+雀
+呵
+汝
+焉
+陀
+胺
+惩
+沼
+枣
+桐
+酱
+遮
+孢
+钝
+呀
+锥
+妥
+酿
+巫
+闯
+沧
+崩
+蕊
+酬
+匠
+躲
+43
+喊
+98
+琳
+46
+绎
+喉
+凰
+抬
+93
+膨
+盲
+剥
+喂
+庸
+奸
+n
+钩
+冈
+募
+苑
+杏
+杉
+辱
+隋
+薪
+绒
+1980
+99
+欠
+尉
+r
+攀
+抹
+巾
+1958
+渣
+苹
+猴
+悄
+屠
+41
+颂
+湛
+魄
+颠
+1949
+呆
+粤
+岂
+娇
+暑
+44
+56
+52
+鹅
+筛
+膏
+樱
+p
+缆
+襄
+瑟
+恭
+泻
+匪
+兮
+恼
+吟
+仕
+蔽
+骄
+蚕
+斥
+椅
+姬
+谦
+for
+椎
+搅
+卸
+沫
+怜
+坎
+瑰
+1978
+钦
+h
+拾
+厕
+後
+逾
+薯
+衬
+钾
+崔
+稽
+蛮
+殷
+晒
+47
+菇
+臭
+弧
+擎
+粹
+纬
+1500
+焰
+玲
+竣
+咒
+歇
+糕
+诵
+茨
+妮
+酯
+麟
+卑
+浏
+咽
+罩
+舱
+酵
+晕
+顽
+赁
+咬
+枫
+冀
+贮
+艘
+亏
+薛
+瀑
+篆
+膀
+沸
+雍
+咳
+尹
+愉
+烹
+坠
+勿
+钠
+64
+坤
+甸
+墅
+闸
+藻
+韧
+鄂
+58
+51
+91
+j
+瑶
+舆
+夸
+54
+蕾
+栗
+咏
+丞
+抄
+鹏
+弊
+檐
+骂
+仆
+峻
+爪
+赚
+帆
+娶
+嘛
+钓
+澄
+猜
+1979
+裔
+抒
+铅
+卉
+彦
+f
+删
+衷
+禹
+寡
+蒲
+砌
+on
+棱
+72
+拘
+堵
+雁
+仄
+荫
+53
+k
+1981
+祈
+49
+奢
+赌
+寇
+3d
+隧
+摊
+雇
+卦
+婉
+敲
+挣
+皱
+虞
+亨
+懈
+挽
+珊
+饶
+滥
+锯
+闷
+it
+酮
+虐
+兑
+僵
+傻
+62
+沦
+巅
+鞭
+梳
+赣
+锌
+庐
+薇
+庵
+57
+96
+慨
+肚
+妄
+g
+仗
+绑
+2017
+枕
+牡
+000
+胖
+沪
+垒
+捞
+捧
+竖
+蜡
+桩
+厢
+孵
+黏
+拯
+63
+谭
+68
+诈
+灿
+釉
+1956
+裹
+钮
+俩
+o
+灶
+彝
+蟹
+涩
+醋
+110
+匙
+歧
+刹
+玫
+棘
+橙
+凑
+桶
+刃
+伽
+4000
+硝
+怡
+籽
+敞
+淳
+矮
+镶
+戚
+幢
+涡
+66
+尧
+膝
+is
+哉
+肆
+畔
+溯
+97
+媚
+烘
+01
+67
+窃
+焚
+澜
+愚
+棵
+乞
+86
+78
+佑
+76
+iphone
+暨
+敷
+饥
+俯
+蔓
+v
+05
+88
+暮
+砍
+邵
+仑
+毗
+剿
+馀
+180
+锤
+刮
+1950
+梭
+摧
+250
+掠
+躯
+诡
+匈
+侣
+胚
+疮
+59
+裙
+windows
+裸
+08
+塌
+吓
+俘
+糙
+藩
+楷
+羞
+with
+鲍
+帘
+裤
+宛
+憾
+桓
+痰
+寞
+骚
+惹
+笋
+萃
+92
+栓
+61
+挫
+矢
+垦
+09
+垄
+绸
+凄
+your
+镀
+熏
+钉
+1945
+led
+粪
+缅
+洽
+鞘
+蔗
+82
+迄
+沐
+凿
+勉
+昨
+喘
+700
+爹
+屑
+耻
+沥
+庶
+涅
+腕
+袍
+懒
+阜
+嗜
+朔
+1200
+蒜
+沛
+坟
+轿
+喀
+笛
+狄
+饿
+蓉
+泣
+窟
+130
+豹
+屿
+73
+崛
+迦
+诠
+贬
+腥
+83
+钥
+嗣
+瑜
+07
+倦
+萎
+拦
+冤
+讽
+潇
+谣
+趁
+1960
+妨
+84
+贩
+74
+萍
+窦
+纂
+缀
+矫
+淑
+墩
+梵
+沾
+淫
+乖
+汰
+莞
+81
+旷
+浊
+挚
+撼
+69
+87
+氟
+焕
+06
+庚
+掀
+诀
+kg
+盼
+71
+疹
+窖
+匆
+厥
+轧
+89
+淹
+94
+160
+亥
+鸦
+棍
+谅
+歼
+汕
+挪
+蚁
+敛
+魁
+畴
+炫
+丫
+奎
+菱
+沂
+撕
+阎
+詹
+03
+蛛
+77
+靡
+瞻
+咱
+愧
+烷
+畸
+灸
+眸
+that
+觅
+芜
+1955
+廓
+斌
+躁
+麓
+摔
+1970
+烛
+睹
+孜
+缚
+堕
+昼
+睿
+琪
+琉
+贱
+6000
+渝
+跋
+1959
+茄
+1957
+舜
+1976
+诛
+1952
+捣
+芙
+04
+1961
+倚
+1938
+酰
+澈
+慌
+帜
+颤
+陇
+1962
+02
+颌
+昧
+佣
+眷
+徙
+禾
+逮
+1948
+79
+莹
+碟
+梢
+朽
+粥
+喇
+1964
+榆
+驳
+楔
+1965
+啸
+肋
+dna
+踢
+1975
+1937
+u
+傍
+桔
+肴
+呕
+旭
+埠
+贿
+曝
+杖
+俭
+栩
+1953
+斧
+镁
+匾
+踩
+橘
+颅
+1963
+囚
+蛙
+1946
+膳
+坞
+琐
+荧
+瘟
+涤
+胰
+衫
+噬
+皖
+邱
+埔
+汀
+羡
+睐
+葵
+耿
+糟
+厄
+秧
+黔
+蹄
+140
+漳
+鞍
+谏
+腋
+簇
+梧
+戎
+1977
+榴
+诣
+宦
+苔
+揽
+簧
+狸
+阙
+扯
+耍
+棠
+脓
+烫
+翘
+芭
+躺
+羁
+藉
+拐
+1966
+陡
+1954
+漓
+棺
+钧
+琅
+扔
+寝
+绚
+熬
+驿
+邹
+杠
+1972
+w
+绥
+窥
+晃
+渭
+1947
+樊
+鑫
+祁
+陋
+哺
+堰
+祛
+y
+梓
+崎
+1968
+孽
+蝴
+蔚
+抖
+苟
+肇
+溜
+绅
+妾
+1940
+跪
+沁
+q
+1973
+莽
+虏
+be
+瞄
+砸
+稚
+僚
+崭
+迭
+皂
+彬
+雏
+ip
+羲
+缕
+绞
+俞
+簿
+耸
+廖
+嘲
+can
+1969
+翌
+榄
+裴
+槐
+1939
+洼
+睁
+1951
+灼
+啤
+臀
+啥
+濒
+醛
+峨
+葫
+悍
+笨
+嘱
+1935
+稠
+360
+韶
+1941
+陛
+峭
+1974
+酚
+翩
+舅
+8000
+寅
+1936
+蕉
+阮
+垣
+戮
+me
+趾
+犀
+巍
+re
+霄
+1942
+1930
+饪
+sci
+秆
+朕
+驼
+肛
+揉
+ipad
+楠
+岚
+疡
+帧
+柑
+iso9001
+赎
+逍
+滇
+璋
+礁
+黛
+钞
+邢
+涧
+劈
+瞳
+砚
+驴
+1944
+锣
+恳
+栅
+吵
+牟
+沌
+瞩
+咪
+毯
+炳
+淤
+盯
+芋
+粟
+350
+栈
+戊
+盏
+峪
+拂
+暇
+酥
+汛
+900
+pc
+嚣
+2500
+轼
+妒
+匿
+1934
+鸽
+蝉
+cd
+痒
+宵
+瘫
+1927
+1943
+璧
+汲
+1971
+冢
+碌
+琢
+磅
+卤
+105
+剔
+谎
+圩
+酌
+捏
+渺
+媳
+1933
+穹
+谥
+骏
+哨
+骆
+乒
+10000
+摹
+兜
+柿
+喧
+呜
+捡
+橄
+逗
+瑚
+呐
+檀
+辜
+妊
+祯
+1931
+苷
+don
+衙
+笃
+芸
+霖
+荔
+闺
+羌
+芹
+dvd
+哼
+糯
+吼
+蕃
+嵩
+矶
+绽
+坯
+娠
+1928
+祷
+锰
+qq
+by
+瘀
+108
+岐
+1932
+茵
+筝
+斐
+肽
+歉
+1929
+嗽
+恤
+汶
+聂
+樟
+擒
+鹃
+拙
+鲤
+絮
+鄙
+彪
+ipod
+z
+嗓
+墟
+骼
+渤
+僻
+豁
+谕
+荟
+姨
+婷
+挠
+哇
+炙
+220
+诅
+娥
+哑
+阱
+嫉
+圭
+乓
+橱
+歪
+禧
+甩
+坷
+晏
+驯
+讳
+泗
+煞
+my
+淄
+倪
+妓
+窍
+竿
+襟
+匡
+钛
+侈
+ll
+侄
+铲
+哮
+厩
+1967
+亢
+101
+辕
+瘾
+辊
+狩
+掷
+潍
+240
+伺
+嘿
+弈
+嘎
+陨
+娅
+1800
+昊
+犁
+屁
+蜘
+170
+寥
+滕
+毙
+as
+涝
+谛
+all
+郝
+痹
+溺
+汾
+脐
+馅
+蠢
+珀
+腌
+扼
+敕
+莓
+峦
+铬
+谍
+炬
+龚
+麒
+睦
+磺
+吁
+掺
+烁
+靶
+or
+圃
+饵
+褶
+娟
+滔
+挨
+android
+褒
+胱
+cpu
+晖
+脖
+垢
+抉
+冉
+茧
+from
+渲
+癫
+125
+de
+悼
+嫂
+瞒
+纶
+肘
+炖
+瀚
+皋
+姊
+颐
+1600
+俏
+颊
+gps
+讶
+札
+奕
+磊
+镖
+遐
+眺
+腑
+boss
+琦
+蚊
+窜
+渍
+嗯
+102
+1926
+touch
+夯
+1300
+笙
+蘑
+翡
+碘
+卯
+啼
+靓
+辍
+莺
+躬
+猿
+杞
+眩
+虔
+凋
+遁
+泾
+岔
+羟
+弛
+娄
+茸
+皓
+峙
+逅
+邂
+苇
+楹
+蹲
+拢
+甄
+鳃
+104
+邯
+捆
+勺
+450
+酉
+荚
+唑
+臻
+辗
+绰
+徊
+榨
+苛
+赦
+盔
+壬
+恍
+缉
+2020
+熨
+7000
+澡
+桨
+匣
+兢
+106
+驭
+x1
+镍
+孰
+绮
+馏
+蝇
+佼
+鲸
+128
+哎
+裳
+蜕
+嚼
+嘻
+web
+庇
+绢
+倩
+钵
+ii
+恪
+帷
+莆
+柠
+藕
+砾
+115
+绊
+喙
+坂
+徘
+荀
+瞧
+蛾
+1925
+晦
+ph
+mm
+铎
+107
+紊
+锚
+酪
+稷
+聋
+闵
+熹
+冕
+诫
+珑
+曦
+篷
+320
+迥
+蘖
+胤
+103
+檬
+瑾
+钳
+遏
+辄
+嬉
+隅
+ps
+秃
+112
+帛
+聆
+芥
+诬
+1100
+挟
+宕
+2018
+鹊
+琶
+膛
+mv
+兀
+gb
+懿
+碾
+叮
+863
+蠕
+譬
+缮
+烽
+妍
+榕
+260
+1920
+邃
+焙
+倘
+210
+戌
+茹
+豚
+晾
+浒
+玺
+醚
+祐
+炽
+this
+缪
+凛
+噩
+溅
+毋
+槛
+ei
+are
+嫡
+蝠
+娴
+稣
+禀
+壑
+殆
+敖
+cm
+ios
+倭
+挛
+侃
+蚌
+咀
+盎
+殉
+岑
+浚
+谬
+狡
+1924
+癸
+280
+逛
+耽
+俺
+璨
+巳
+茜
+郸
+蒴
+琵
+we
+230
+叩
+泸
+塾
+one
+稼
+reg
+侮
+锂
+曙
+3500
+up
+薰
+婿
+惶
+拭
+篱
+恬
+淌
+烙
+袜
+徵
+慷
+夭
+噶
+莘
+135
+鸳
+殡
+蚂
+1900
+憎
+喃
+佚
+龛
+潢
+烃
+at
+岱
+潺
+109
+衢
+璀
+5cm
+1400
+鹭
+揣
+痢
+know
+厮
+氓
+怠
+no
+nbsp
+痘
+硒
+镌
+乍
+咯
+惬
+not
+桦
+骇
+枉
+蜗
+睾
+淇
+耘
+娓
+弼
+鳌
+嗅
+gdp
+狙
+箫
+朦
+椰
+胥
+丐
+陂
+唾
+鳄
+柚
+谒
+journal
+戍
+1912
+刁
+鸾
+缭
+骸
+铣
+酋
+蝎
+掏
+耦
+怯
+娲
+拇
+汹
+胧
+疤
+118
+硼
+恕
+哗
+眶
+痫
+凳
+鲨
+擢
+歹
+樵
+瘠
+app
+茗
+翟
+黯
+蜒
+壹
+殇
+伶
+辙
+an
+瑕
+町
+孚
+痉
+铵
+搁
+漾
+戟
+镰
+鸯
+猩
+190
+蔷
+缤
+叭
+垩
+113
+曳
+usb
+奚
+毓
+ibm
+颓
+汐
+靴
+china
+傣
+尬
+濮
+赂
+媛
+懦
+扦
+111
+韬
+like
+戳
+java
+雯
+114
+蜿
+116
+1923
+笺
+裘
+尴
+侗
+mba
+3g
+钨
+1919
+苓
+1922
+寰
+蛊
+扳
+搓
+涟
+睫
+淬
+5mm
+123
+ve
+121
+赈
+恺
+瞎
+蝙
+1921
+枸
+萱
+颚
+憩
+秽
+秸
+拷
+阑
+貂
+粱
+煲
+隘
+暧
+惕
+沽
+time
+菠
+1911
+趟
+磋
+偕
+涕
+邸
+so
+踞
+惫
+122
+阪
+鞠
+饺
+汞
+颍
+氰
+屹
+蛟
+跻
+哟
+have
+126
+臼
+熄
+绛
+弩
+褪
+117
+渎
+亟
+匮
+撇
+internet
+霆
+攒
+舵
+扛
+彤
+nba
+蛤
+婢
+偃
+胫
+姥
+睑
+love
+iso
+pk
+诙
+what
+诲
+锭
+悚
+扒
+洱
+劾
+惰
+篡
+瓯
+徇
+铀
+骋
+flash
+1918
+out
+筷
+渚
+踵
+俨
+ceo
+榻
+糜
+捻
+釜
+哩
+萤
+270
+蛹
+隽
+垮
+鸠
+鸥
+漕
+瑙
+礴
+憧
+殴
+潼
+悯
+砺
+拽
+钗
+ct
+酣
+镂
+mp3
+膺
+楞
+竺
+迂
+嫣
+忱
+cad
+哄
+疣
+鹦
+1700
+枭
+憬
+疱
+will
+婪
+沮
+1914
+怅
+119
+筱
+扉
+瞰
+linux
+旌
+蔑
+铠
+瀛
+vip
+琥
+750
+127
+懵
+谴
+捍
+蟾
+漩
+1913
+拣
+汴
+university
+刨
+叱
+曜
+妞
+澎
+镑
+翎
+瞪
+sh
+倔
+芍
+璞
+瓮
+驹
+芷
+寐
+擂
+丕
+蟠
+诃
+悸
+亘
+溴
+宸
+廿
+恃
+棣
+1917
+荼
+筠
+羚
+慑
+唉
+纣
+麼
+蹦
+锄
+145
+international
+124
+淆
+甙
+132
+蚜
+椿
+禺
+绯
+冗
+168
+葩
+厝
+媲
+蒿
+痪
+650
+菁
+炊
+wifi
+俑
+new
+讥
+min
+桀
+祺
+129
+吡
+迩
+do
+john
+箔
+皿
+缎
+萦
+剃
+霓
+酝
+mg
+诰
+茉
+just
+get
+飙
+湍
+蜥
+箕
+蘸
+550
+4500
+柬
+韭
+溥
+but
+熠
+鹉
+咐
+剌
+138
+悖
+瞿
+槟
+娩
+闾
+pvc
+遴
+咫
+20000
+孺
+彷
+茬
+211
+蓟
+li
+if
+憨
+袅
+佬
+炯
+erp
+1910
+啶
+昙
+蚩
+136
+痔
+蕨
+瓢
+夔
+毡
+赃
+鳖
+沅
+wang
+go
+饷
+165
+臧
+掖
+褚
+羹
+ic
+勐
+tv
+谚
+畦
+眨
+贻
+攸
+涎
+弑
+咎
+铂
+瑛
+1905
+矗
+虱
+more
+133
+秤
+谟
+漱
+俸
+夙
+1915
+br
+game
+雉
+螨
+恣
+斛
+175
+谙
+隍
+131
+奄
+480
+yy
+1916
+壕
+髻
+155
+鄱
+嘶
+磕
+濡
+赘
+荞
+讹
+猕
+痞
+鬓
+铮
+腱
+幡
+榭
+爻
+5m
+涓
+晤
+咕
+惭
+钼
+匕
+ok
+撮
+庾
+笠
+窘
+癖
+365
+垛
+窒
+畲
+甬
+彗
+缨
+湮
+寮
+et
+衅
+谪
+156
+绫
+9000
+152
+兖
+疽
+磐
+380
+菏
+沱
+骁
+嫔
+盂
+娆
+钊
+蟒
+忏
+谤
+148
+137
+server
+2200
+晟
+ng
+15000
+google
+痈
+耆
+谧
+簪
+134
+ml
+疟
+扈
+脍
+琛
+咋
+胄
+142
+144
+葆
+轶
+桢
+973
+攘
+was
+邕
+拧
+茯
+205
+摒
+1908
+intel
+傀
+祚
+嘟
+帼
+1906
+wto
+筵
+when
+馒
+疚
+璇
+砧
+merge
+槃
+microsoft
+犷
+exe
+腓
+煜
+弋
+疸
+濑
+310
+201
+麝
+嗟
+忻
+愣
+facebook
+斓
+吝
+咧
+矾
+愫
+151
+158
+漪
+珂
+rna
+逞
+146
+206
+糠
+璐
+藓
+昕
+妩
+屌
+疵
+excel
+嘘
+he
+plc
+袂
+2400
+139
+稃
+剁
+侏
+掐
+猾
+匍
+2800
+坳
+黜
+邺
+闫
+猥
+湃
+斟
+癣
+1904
+185
+匐
+粳
+sql
+330
+141
+cp
+1909
+叟
+俾
+儡
+莒
+12000
+骥
+跤
+耙
+矜
+翱
+zhang
+ms
+赡
+1907
+浣
+栾
+拈
+science
+420
+螟
+aaa
+桧
+坍
+睢
+趴
+id
+伎
+2100
+婺
+霹
+痊
+膊
+眯
+豌
+202
+驮
+骈
+850
+iii
+嶂
+淞
+143
+腮
+髅
+炀
+啄
+亳
+麾
+147
+筐
+叨
+徨
+跷
+ac
+楂
+郴
+绶
+hp
+羔
+xp
+ieee
+咤
+now
+there
+靳
+they
+屎
+雳
+瘘
+蹬
+2300
+惮
+acid
+涪
+阖
+煽
+蹊
+225
+栉
+153
+俟
+涸
+辫
+锢
+佟
+176
+皎
+cctv
+啮
+钰
+螂
+dc
+啪
+绷
+204
+闰
+畿
+2d
+覃
+2600
+惘
+贰
+154
+碉
+卞
+酐
+枷
+葺
+芪
+207
+蕙
+192
+咚
+籁
+pro
+钴
+162
+冽
+玮
+骷
+啃
+焖
+猝
+榈
+滁
+拮
+跗
+讷
+蝗
+208
+蠡
+world
+烨
+been
+hd
+gmp
+256
+脯
+歙
+泠
+刍
+掳
+pe
+his
+僳
+340
+1902
+螯
+胳
+髦
+粽
+戾
+祜
+178
+186
+岷
+懋
+馥
+昵
+踊
+湄
+郢
+斡
+迢
+ce
+photoshop
+嗪
+about
+裨
+1903
+羧
+膈
+翊
+lcd
+鲫
+163
+螃
+沓
+疝
+笈
+ktv
+榔
+157
+诘
+autocad
+195
+颉
+蛀
+鸢
+焯
+囧
+make
+梆
+npc
+潞
+戛
+see
+system
+149
+佗
+艮
+chinese
+let
+霾
+鬟
+215
+net
+玖
+1898
+腭
+喔
+172
+罔
+佥
+粑
+visual
+舷
+泯
+m2
+198
+has
+203
+sd
+泓
+炜
+谗
+烬
+跆
+rpg
+傩
+飓
+浔
+钤
+惚
+胭
+踝
+镯
+ep
+221
+臆
+196
+蜚
+揪
+觞
+皈
+dj
+183
+api
+迸
+匝
+筏
+167
+醴
+黍
+洮
+滦
+侬
+甾
+290
+way
+3200
+188
+diy
+2cm
+com
+澧
+阈
+袱
+迤
+衮
+166
+濂
+娑
+砥
+砷
+铨
+缜
+箴
+30000
+逵
+猖
+159
+蛰
+箍
+侥
+2mm
+搂
+纨
+裱
+枋
+嫦
+敝
+挝
+贲
+潦
+235
+撩
+惺
+铰
+f1
+忒
+咆
+哆
+莅
+164
+炕
+抨
+涿
+龈
+猷
+got
+b1
+182
+2m
+212
+遒
+缥
+vs
+捂
+俐
+la
+瘙
+搐
+牍
+isbn
+馍
+our
+痿
+袤
+峥
+184
+栎
+罹
+燎
+喵
+209
+1901
+璜
+飒
+蔼
+珞
+澹
+奘
+岖
+芡
+簸
+杵
+甥
+骊
+216
+悴
+173
+惆
+5mg
+殃
+1895
+呃
+161
+5g
+祗
+3600
+髋
+169
+liu
+who
+幔
+down
+榛
+犊
+霁
+芮
+520
+牒
+佰
+her
+狈
+薨
+co
+吩
+鳝
+嵘
+濠
+呤
+纫
+3mm
+檄
+214
+浜
+370
+189
+缙
+缢
+煦
+蓦
+揖
+拴
+缈
+218
+褥
+铿
+312
+燮
+life
+锵
+174
+荥
+187
+忿
+4s
+僖
+婶
+171
+chen
+芾
+镐
+痣
+research
+眈
+460
+祇
+邈
+翳
+碣
+遨
+鳗
+诂
+never
+岫
+焘
+3cm
+co2
+茱
+tcp
+only
+255
+gsm
+say
+洵
+晁
+right
+噢
+she
+over
+偈
+旖
+david
+181
+232
+蚓
+柘
+珐
+遽
+岌
+桅
+213
+唔
+222
+鄞
+雹
+michael
+驸
+苻
+恻
+鬃
+玑
+磬
+崂
+304
+祉
+荤
+淼
+560
+264
+肱
+呗
+pp
+b2
+骡
+囱
+10cm
+佞
+back
+1890
+226
+耒
+伫
+嚷
+粼
+aa
+歆
+佃
+旎
+惋
+殁
+杳
+their
+阡
+red
+畈
+蔺
+os
+177
+map
+巽
+cbd
+昱
+啰
+吠
+179
+199
+嗔
+涮
+238
+奂
+1896
+撷
+301
+袒
+720
+爰
+捶
+赭
+蜓
+姗
+蔻
+垠
+193
+gis
+噻
+ab
+峒
+皙
+want
+245
+憔
+帚
+office
+xx
+杷
+蟆
+iso14001
+觐
+钒
+岙
+2700
+1899
+栀
+幄
+啧
+癜
+擀
+轲
+铆
+them
+讴
+樽
+霏
+mtv
+肮
+枳
+骞
+诧
+瘢
+虬
+拗
+play
+219
+蕲
+316
+茁
+唆
+technology
+word
+沭
+毂
+蛎
+芊
+銮
+瞥
+呱
+223
+羿
+吒
+傥
+髯
+濯
+蜻
+皴
+802
+430
+邳
+燧
+1860
+獭
+垭
+祟
+217
+虢
+how
+枇
+abs
+鹫
+194
+颞
+1894
+333
+皑
+脲
+197
+舔
+魇
+霭
+org
+坨
+郧
+baby
+椽
+舫
+228
+oh
+305
+荠
+琊
+溟
+1897
+煨
+265
+谯
+粲
+罂
+gonna
+屉
+佯
+郦
+亵
+诽
+芩
+嵇
+蚤
+哒
+315
+啬
+ain
+嚎
+玥
+twitter
+191
+隼
+唢
+铛
+cause
+壅
+藜
+won
+吱
+rom
+楣
+璟
+锆
+憋
+罡
+al
+咙
+1850
+腈
+oslash
+job
+233
+廪
+堑
+into
+诩
+b2c
+溧
+鹑
+讫
+哌
+铢
+蜴
+1ml
+稹
+噜
+镉
+224
+愕
+桁
+晔
+琰
+陲
+疙
+667
+崮
+need
+540
+8mm
+html
+颛
+through
+asp
+桡
+钜
+580
+take
+谑
+仞
+咦
+珪
+揍
+鱿
+阉
+3800
+瘩
+410
+槌
+滓
+茴
+tft
+泮
+涣
+atm
+pci
+柞
+渥
+飨
+孪
+沔
+谲
+桉
+vcd
+慵
+318
+oem
+other
+俚
+paul
+跖
+纭
+恙
+which
+fi
+佘
+236
+荃
+咄
+鞅
+叁
+james
+恽
+m3
+253
+炔
+萘
+钺
+6500
+1880
+ccd
+楫
+塬
+钡
+琮
+苄
+950
+325
+275
+1g
+day
+o2o
+960
+music
+骰
+偎
+粕
+amd
+咔
+鹄
+瓒
+阆
+捅
+嬴
+adobe
+箨
+name
+390
+680
+640
+氦
+倜
+b2b
+觊
+xml
+婕
+229
+jar
+锑
+撬
+chem
+掰
+嗷
+5500
+1cm
+饯
+蓓
+234
+good
+鼬
+spa
+佤
+5a
+ss
+蚯
+挞
+臾
+where
+atp
+227
+嶙
+幂
+饬
+闱
+live
+high
+煅
+嘧
+1mm
+蹭
+sun
+abc
+瞭
+顼
+箐
+here
+徉
+231
+骜
+302
+嗨
+邛
+庑
+柩
+饕
+俎
+4mm
+15g
+嘌
+50000
+颏
+cssci
+椁
+崧
+锉
+籼
+1870
+狞
+弁
+6mm
+羯
+踹
+糅
+248
+1840
+砼
+263
+嫖
+tmp
+252
+mac
+285
+豉
+啉
+榷
+嘈
+en
+俪
+痂
+308
+inf
+630
+儋
+4a
+芎
+ai
+man
+繇
+1889
+bt
+239
+meta
+蹇
+242
+530
+诋
+bbc
+煸
+峋
+淙
+324
+management
+1885
+泱
+徜
+crm
+4cm
+free
+汩
+纥
+246
+蝼
+囿
+uv
+暹
+谆
+蹂
+鞣
+3c
+mr
+螳
+cs
+馗
+幺
+鞑
+贽
+268
+istp
+243
+漯
+237
+牦
+淖
+engineering
+dr
+囤
+than
+gprs
+sp
+440
+晗
+1888
+258
+忡
+懊
+呋
+埂
+pcb
+307
+first
+321
+robert
+鲈
+sup2
+阕
+3m
+幌
+cg
+303
+鳅
+勰
+find
+8cm
+萸
+剽
+蚝
+wi
+绔
+pdf
+1250
+262
+php
+辇
+10mg
+use
+ie
+麋
+1884
+陟
+宥
+oracle
+锺
+喽
+620
+1892
+1893
+淅
+熵
+荨
+247
+忤
+american
+266
+seo
+轭
+嗦
+荪
+also
+骠
+鹘
+p2p
+4g
+聿
+绾
+诶
+985
+怆
+244
+喋
+恸
+湟
+睨
+翦
+fe
+蜈
+1875
+褂
+娼
+1886
+羸
+觎
+470
+瘁
+306
+蚣
+呻
+241
+1882
+昶
+谶
+猬
+荻
+school
+286
+酗
+unit
+肄
+躏
+膑
+288
+2g
+嗡
+273
+iv
+cam
+510
+庠
+崽
+254
+搪
+pcr
+胯
+309
+铉
+峤
+郯
+藐
+舂
+come
+蓼
+some
+薏
+窿
+羣
+氽
+徕
+冼
+rs
+阂
+欤
+殒
+窈
+脘
+780
+篝
+yang
+1861
+3300
+iso9000
+麸
+砭
+max
+砰
+骶
+豺
+lg
+窠
+獒
+think
+腴
+苕
+any
+its
+缇
+骅
+劭
+college
+卅
+ups
+揆
+垅
+na
+6cm
+琏
+镗
+苜
+胛
+1881
+black
+珏
+吮
+抠
+搔
+276
+rock
+251
+槎
+4200
+323
+掣
+pet
+1887
+ap
+琨
+餮
+375
+舛
+give
+si
+痤
+us
+311
+278
+埭
+english
+peter
+1891
+820
+胪
+喹
+妲
+婀
+帙
+10g
+oa
+7500
+箩
+灏
+霎
+logo
+袄
+dsp
+bl
+镭
+蓿
+power
+long
+墉
+too
+嵊
+1862
+girl
+堇
+king
+蟋
+610
+叽
+249
+钎
+30cm
+fm
+録
+group
+1883
+郓
+瘴
+vol
+丶
+呦
+邬
+頫
+272
+馁
+hiv
+鄢
+257
+1876
+ordm
+蛭
+322
+愍
+锲
+槿
+珈
+best
+4800
+mri
+1080
+fda
+10mm
+261
+nt
+660
+super
+1m
+center
+ui
+335
+蜃
+298
+拎
+鎏
+裟
+沏
+np
+螭
+7mm
+觑
+墒
+捺
+轸
+micro
+榫
+based
+319
+怔
+ram
+618
+昀
+even
+泷
+1864
+ca
+凫
+唠
+狰
+鲛
+氐
+呛
+绀
+碛
+茏
+盅
+蟀
+洙
+off
+訇
+蠹
+auml
+dos
+20cm
+267
+棂
+18000
+蚴
+篾
+two
+靛
+暄
+show
+1868
+泞
+cdma
+mark
+vc
+洄
+赓
+麽
+25000
+篓
+孑
+860
+烩
+980
+design
+颢
+钣
+var
+髂
+蹴
+wanna
+筮
+蝌
+醮
+home
+菖
+fun
+cmos
+獗
+friends
+business
+岘
+570
+鼐
+1865
+姣
+national
+1874
+蟑
+袈
+葶
+掬
+most
+vga
+emba
+躇
+30g
+鹌
+city
+踌
+282
+钹
+蚪
+颧
+001
+13000
+鹳
+274
+km
+345
+1050
+stop
+328
+then
+鲲
+驷
+潴
+295
+386
+焱
+稔
+悌
+mpeg
+st
+suv
+vista
+a1
+vi
+283
+help
+basic
+唏
+11000
+苒
+蹙
+house
+heart
+ouml
+281
+氩
+bug
+mobile
+宓
+service
+dll
+綦
+苎
+application
+疃
+methyl
+攫
+rfid
+100g
+287
+掾
+1871
+徭
+490
+舀
+逶
+嗤
+760
+0m
+ge
+1872
+people
+hr
+蜷
+茔
+512
+疳
+迳
+罄
+瓠
+100mg
+讪
+psp
+av
+傈
+ppp
+杲
+灞
+氲
+鬲
+獠
+柒
+骧
+1848
+away
+william
+326
+搀
+珩
+绦
+1879
+嚏
+710
+镛
+喱
+倏
+馋
+茭
+擘
+斫
+284
+1mg
+怂
+hdmi
+唧
+犍
+谩
+赊
+317
+271
+wu
+鬻
+禛
+15cm
+259
+840
+feel
+485
+圻
+10m
+蹶
+5kg
+1877
+1873
+缄
+瘿
+黠
+甑
+矸
+嘀
+il
+蹼
+jack
+lee
+269
+叼
+di
+313
+旻
+auc
+502
+1350
+鹜
+289
+fc
+稗
+336
+999
+association
+many
+293
+雒
+george
+td
+赉
+style
+馔
+颦
+ul
+ld50
+1867
+颔
+掇
+1863
+each
+赅
+桎
+inc
+痧
+dv
+谄
+孛
+笆
+鲶
+铳
+3100
+mc
+tell
+4m
+blue
+327
+299
+bios
+龋
+385
+盱
+笏
+2030
+窕
+苴
+314
+big
+1866
+296
+萋
+355
+辘
+琬
+cu
+梏
+much
+蚧
+3400
+1280
+镳
+24h
+own
+670
+studio
+瞅
+keep
+6g
+ppt
+conference
+around
+information
+睬
+1878
+class
+偌
+鲵
+惦
+1830
+蜍
+mp4
+why
+靼
+1851
+332
+阗
+菟
+黝
+1650
+control
+挈
+嵴
+剡
+358
+楸
+dha
+氤
+m1
+vr
+呎
+珲
+5ml
+馄
+滂
+338
+蹉
+蓑
+锷
+297
+279
+啜
+1644
+sm
+婵
+well
+鬣
+7cm
+钿
+bbs
+晌
+蛆
+隗
+酞
+枞
+352
+work
+always
+9g
+戬
+獾
+镕
+star
+easy
+饨
+娣
+缰
+邾
+334
+8m
+ni
+鹗
+277
+425
+end
+had
+嗒
+苋
+薮
+棹
+type
+richard
+880
+6m
+拄
+air
+埕
+勖
+鹞
+殚
+鲢
+pop
+a4
+1750
+ftp
+16000
+啖
+ad
+沣
+501
+靥
+葭
+诿
+htc
+鸪
+007
+饴
+t1
+疖
+抟
+睽
+770
+access
+tcl
+稞
+吋
+谀
+澍
+杈
+妤
+sata
+part
+峄
+systems
+漉
+40000
+ever
+気
+368
+咲
+qs
+ta
+璘
+ltd
+mol
+media
+萜
+僭
+朐
+742
+1855
+cc
+圜
+癞
+藿
+555
+珉
+isp
+set
+1450
+陉
+him
+僮
+292
+膻
+1853
+薹
+810
+汊
+still
+锗
+昉
+pvp
+猗
+http
+1859
+3700
+strong
+3a
+锶
+real
+跛
+art
+1869
+331
+1368
+嘹
+337
+瓤
+402
+衄
+1856
+1820
+1150
+matlab
+豕
+吆
+腆
+thomas
+a2
+294
+le
+366
+using
+356
+bb
+喆
+smith
+different
+莴
+401
+谌
+ci
+珙
+疥
+kw
+鲑
+405
+玷
+蛔
+砀
+361
+zh
+nasa
+materials
+329
+nature
+1h
+谔
+睥
+ch
+20mg
+2mg
+du
+mail
+data
+every
+蹑
+诒
+逋
+372
+while
+姝
+刈
+婧
+going
+喳
+镞
+铌
+291
+712
+辎
+鹧
+檩
+740
+扪
+10ml
+霰
+ar
+裆
+ol
+嬷
+0mm
+ufo
+charles
+20mm
+tvb
+apple
+刎
+iec
+project
+sbs
+嵋
+342
+690
+悱
+920
+嘤
+jean
+篁
+荸
+瞑
+殓
+搽
+50mg
+343
+橇
+include
+eva
+雎
+弭
+獐
+haccp
+恿
+video
+cf
+vpn
+society
+眦
+730
+铐
+song
+尕
+捎
+诟
+institute
+痨
+cn
+369
+笞
+756
+version
+des
+sns
+趺
+590
+award
+唬
+苣
+css
+lte
+xu
+fbi
+啾
+瘪
+垸
+357
+橹
+after
+濛
+曷
+level
+樾
+very
+汨
+仟
+姒
+1858
+again
+怦
+荏
+tom
+诤
+苡
+吭
+830
+dm
+before
+406
+崆
+氡
+young
+脩
+lan
+胝
+钏
+3ds
+cr
+arm
+pos
+night
+屐
+395
+忐
+彧
+拚
+鏖
+344
+100ml
+525
+孳
+1024
+yu
+忑
+384
+邝
+穰
+403
+摈
+庖
+351
+鸵
+398
+hello
+矽
+354
+鲟
+said
+381
+768
+発
+762
+sap
+1854
+msn
+菅
+book
+353
+true
+339
+javascript
+348
+2900
+圪
+蹋
+衾
+簋
+璎
+367
+噎
+911
+嬗
+346
+肼
+362
+359
+跎
+滟
+little
+4300
+701
+戦
+嵬
+look
+仝
+phys
+club
+惇
+纾
+times
+14000
+炁
+382
+xyz
+number
+ak
+mind
+huang
+闳
+骐
+秣
+眙
+谘
+碓
+iso9002
+疔
+412
+恂
+am
+top
+master
+鳕
+green
+鸱
+int
+爨
+镊
+404
+were
+4600
+em
+better
+钯
+圮
+楽
+堀
+1852
+408
+sat
+1857
+378
+422
+膘
+705
+噗
+347
+start
+486
+锹
+505
+杼
+酊
+same
+376
+white
+挎
+箸
+郗
+垌
+sa
+溏
+martin
+蔫
+偻
+364
+妫
+飚
+625
+601
+辔
+濬
+666
+ds
+瑄
+621
+觚
+5600
+nhk
+415
+express
+铍
+bit
+跚
+9mm
+翕
+煊
+these
+50mm
+gpu
+b6
+hip
+耄
+铋
+篦
+zhou
+阇
+骛
+nvidia
+莪
+吲
+youtube
+唁
+870
+箧
+503
+tm
+8500
+really
+珅
+潋
+迨
+哽
+without
+砦
+model
+缗
+hey
+謇
+呸
+mrna
+垓
+糍
+park
+wap
+璠
+妣
+狎
+攥
+396
+闇
+york
+蛉
+瑁
+joe
+腼
+蹒
+great
+review
+200mg
+chris
+www
+嶷
+online
+莠
+沤
+哚
+475
+遑
+v1
+such
+跺
+膦
+蹿
+unix
+hard
+40cm
+50cm
+nothing
+郫
+zhao
+玳
+ma
+boy
+埚
+url
+432
+network
+aaaa
+衿
+371
+try
+醪
+full
+挹
+raid
+bg
+绡
+汜
+digital
+mb
+c1
+坩
+ccc
+旃
+5200
+607
+itunes
+powerpoint
+鸨
+between
+407
+翈
+1842
+1844
+435
+838
+抡
+chemistry
+team
+party
+die
+晞
+place
+care
+盥
+藁
+蓖
+383
+cv
+臊
+made
+state
+465
+羰
+388
+1620
+sas
+楝
+噱
+ji
+饽
+苌
+soho
+褓
+佶
+mp
+581
+years
+1260
+1680
+hop
+稜
+瞠
+仡
+25mm
+605
+423
+341
+363
+374
+627
+text
+development
+518
+伉
+襁
+ug
+change
+713
+涞
+1849
+蜇
+抿
+瑗
+pda
+418
+un
+line
+958
+孱
+懑
+416
+von
+373
+淦
+赝
+core
+dns
+747
+427
+387
+would
+ipo
+醌
+551
+缫
+蠲
+alt
+嚓
+鲷
+湫
+捋
+1845
+咩
+裏
+avi
+犒
+2050
+墀
+yeah
+god
+445
+lesson
+硐
+蔸
+399
+758
+pu
+computer
+456
+钽
+1847
+麂
+brown
+store
+蒡
+鼹
+绻
+1821
+錾
+仃
+515
+篙
+蕤
+589
+applied
+737
+930
+c3
+1841
+铤
+billboard
+apec
+槁
+牖
+螈
+mary
+俦
+family
+笄
+color
+啻
+対
+jsp
+郤
+next
+iq
+645
+506
+hbv
+闼
+a3
+349
+value
+413
+igg
+411
+426
+醺
+赍
+檗
+usa
+裾
+head
+噫
+掸
+mike
+箓
+usb2
+things
+5800
+5v
+o2
+妪
+乂
+蝈
+砻
+胍
+220v
+392
+cba
+397
+535
+idc
+analysis
+25mg
+蜱
+ti
+2h
+聃
+雠
+碚
+椤
+缯
+昴
+890
+缱
+祎
+der
+缬
+ex
+508
+铙
+cnc
+pentium
+孀
+533
+advanced
+mpa
+yl
+笳
+蘇
+愆
+685
+榉
+old
+氙
+call
+alex
+燹
+撂
+菽
+583
+箬
+蛄
+瘸
+嬛
+495
+橐
+could
+60000
+something
+纡
+刽
+辂
+hong
+377
+law
+蒯
+邨
+1846
+1550
+r2
+1837
+赀
+player
+414
+跸
+phone
+邙
+hold
+rgb
+421
+henry
+2025
+黟
+409
+磴
+1815
+mode
+1843
+闿
+504
+letters
+1780
+428
+垟
+389
+t2
+london
+528
+jpeg
+嵯
+钚
+steve
+跄
+30min
+527
+潸
+h2
+35000
+崴
+eric
+379
+run
+three
+rf
+left
+455
+恁
+open
+楮
+556
+bc
+476
+腧
+458
+plus
+1812
+1839
+胨
+b12
+4d
+芫
+america
+est
+dream
+碴
+隰
+杓
+md
+ya
+global
+436
+15mm
+2ml
+貉
+欹
+sup3
+侑
+ea
+鳜
+910
+ben
+铄
+椴
+昇
+醍
+1020
+798
+midi
+肓
+features
+lc
+brian
+akb48
+缂
+1835
+test
+铡
+light
+978
+s1
+1799
+key
+sim
+1795
+simple
+energy
+蹠
+徂
+west
+725
+body
+豢
+424
+face
+蒽
+lin
+805
+1120
+479
+菡
+bill
+433
+衲
+阚
+believe
+brt
+pa
+last
+芗
+hu
+sam
+wei
+adsl
+602
+mk
+痍
+玠
+1832
+523
+晷
+604
+jj
+468
+淝
+1560
+鄯
+ck
+473
+糗
+耨
+榧
+394
+940
+eq
+498
+used
+sc
+胴
+c2
+蕈
+screen
+镬
+635
+鼾
+431
+education
+wwe
+摭
+鸮
+cl
+5400
+fpga
+恚
+419
+実
+asia
+534
+552
+砝
+100mm
+pid
+741
+珣
+under
+603
+寤
+埙
+mbc
+tc
+xxx
+didn
+478
+mn
+p1
+锏
+simon
+ansi
+438
+hi
+615
+喟
+蘅
+骺
+cell
+捭
+study
+586
+393
+莜
+should
+xi
+缶
+f2
+games
+0g
+1760
+mini
+johnson
+jones
+yes
+锟
+1825
+叵
+cm3
+炷
+1580
+stay
+675
+another
+6800
+鲧
+1736
+ps2
+胼
+517
+査
+岬
+2019
+1640
+rose
+鹂
+牯
+珥
+entertainment
+448
+und
+496
+莼
+software
+970
+邠
+5300
+h1n1
+488
+da
+眇
+卟
+変
+20m
+may
+417
+lady
+galaxy
+4100
+惴
+1789
+846
+801
+渑
+907
+put
+蚱
+gone
+606
+t3
+company
+632
+454
+516
+998
+548
+391
+4700
+瞌
+ide
+瘰
+7200
+佝
+together
+street
+旸
+626
+衽
+郅
+奁
+731
+30mg
+mvp
+1370
+60cm
+12cm
+魑
+1828
+628
+everything
+612
+san
+937
+缛
+2gb
+lu
+angel
+20ml
+576
+颙
+sony
+790
+press
+镫
+hall
+簌
+beautiful
+豇
+711
+453
+pm
+姹
+thing
+442
+邋
+alpha
+leave
+暝
+441
+30mm
+chapter
+507
+100000
+526
+directx
+511
+9cm
+words
+釐
+619
+洹
+444
+frank
+咿
+eyes
+483
+俳
+522
+蜊
+醐
+541
+water
+499
+聩
+non
+bob
+坻
+532
+757
+545
+毽
+oo
+喾
+alone
+scott
+744
+辋
+river
+zhu
+倌
+媪
+蛳
+滹
+哙
+nc
+20g
+阊
+gs
+queen
+趸
+1130
+1645
+祢
+4mg
+1814
+girls
+544
+e1
+籀
+1210
+1573
+徼
+ipv6
+訾
+髁
+1a
+jackson
+砜
+1836
+les
+4gb
+撸
+瓘
+1790
+缁
+镓
+sars
+eps
+519
+sod
+bp
+1810
+year
+縻
+sound
+617
+菀
+1125
+598
+酢
+桠
+466
+emc
+撵
+怏
+429
+1838
+ready
+渌
+546
+taylor
+452
+news
+1180
+568
+2a
+af
+538
+list
+hot
+1380
+etc
+1796
+摞
+mo
+槲
+levels
+ht
+浠
+诜
+魉
+韫
+daniel
+亓
+盤
+pv
+瑭
+魍
+1831
+emi
+襞
+social
+dreamweaver
+爿
+kbs
+565
+613
+990
+浃
+樯
+jb
+讵
+揩
+physics
+耋
+帏
+lng
+崃
+bs
+457
+enough
+shy
+521
+596
+ec
+451
+鸩
+遢
+turn
+臃
+available
+4400
+585
+粿
+1010
+禳
+hand
+439
+536
+桫
+link
+side
+earth
+mx
+髹
+7m
+482
+诳
+472
+1140
+707
+622
+wcdma
+513
+must
+492
+462
+踉
+40mg
+948
+cmax
+郃
+1320
+v2
+542
+email
+493
+嗖
+sup
+讧
+cnn
+446
+碁
+17000
+湎
+30m
+529
+653
+531
+575
+阏
+sr
+united
+pm2
+mt
+媾
+443
+様
+aac
+806
+哔
+舸
+vb
+611
+曩
+821
+gre
+gl
+cisco
+忝
+峁
+掂
+464
+葳
+487
+437
+including
+715
+鄄
+558
+both
+谵
+463
+jim
+608
+m4
+5100
+彊
+锴
+war
+郜
+money
+481
+葖
+1824
+tnt
+蓇
+瓴
+鳟
+橼
+5s
+louis
+434
+鲇
+邗
+el
+犄
+秭
+3900
+records
+view
+chemical
+1001
+1mol
+dance
+668
+dl
+槭
+缵
+que
+624
+rt
+1823
+1805
+005
+1826
+巯
+sgs
+user
+龊
+qc
+狍
+island
+language
+space
+擞
+saint
+2n
+pt
+share
+瞽
+hotel
+christian
+557
+栲
+撅
+2b
+1801
+447
+1822
+瑀
+smt
+hk
+1834
+戢
+825
+50ml
+朓
+逖
+general
+椹
+nm
+洺
+cae
+484
+艏
+wma
+zn
+苁
+single
+599
+c4
+滘
+777
+铧
+侪
+ocirc
+1kg
+684
+豳
+skf
+12mm
+489
+hla
+竦
+貔
+ld
+being
+562
+圄
+van
+gm
+688
+655
+special
+呷
+edition
+1s
+jiang
+131108
+514
+1792
+ncaa
+1833
+旄
+遛
+jr
+program
+656
+467
+ing
+901
+755
+509
+芈
+kong
+rp
+砣
+桷
+audio
+icp
+happy
+龌
+done
+疬
+japan
+ts
+mit
+p2
+524
+looking
+miss
+缟
+582
+洌
+35mm
+494
+grand
+跏
+those
+joseph
+ctrl
+547
+1040
+686
+蝮
+lp
+cod
+菰
+sio2
+txt
+1770
+1060
+帑
+767
+north
+fcc
+怙
+ester
+718
+story
+edi
+634
+1360
+豸
+1660
+lh
+雩
+1230
+magic
+誊
+549
+臬
+4k
+op
+1662
+651
+镣
+箇
+616
+title
+sciences
+25cm
+踱
+s2
+t4
+钍
+648
+100m
+543
+588
+苫
+554
+蝽
+r1
+3mg
+amino
+1776
+浯
+609
+772
+ca2
+vlan
+469
+500mg
+単
+road
+亶
+636
+metal
+device
+40mm
+囹
+穑
+1730
+佻
+1818
+绌
+12g
+537
+诔
+pve
+autodesk
+477
+v8
+ray
+gp
+span
+gc
+size
+716
+鹬
+ssl
+crt
+1670
+925
+髌
+pn
+1127
+702
+658
+services
+support
+1802
+蒌
+coming
+experience
+nbc
+鳏
+631
+638
+ace
+0cm
+ems
+9001
+殄
+yen
+soc
+ethyl
+怛
+tf
+筌
+刳
+studies
+theory
+1030
+578
+radio
+翮
+卍
+畹
+471
+704
+because
+1610
+箜
+save
+燔
+赳
+553
+1809
+篌
+窨
+翥
+785
+炅
+钕
+lett
+803
+1827
+academy
+ed
+629
+sf
+pr
+hill
+explorer
+future
+food
+莳
+662
+567
+dcs
+忖
+戡
+1086
+1190
+1829
+bad
+es
+15m
+order
+spring
+沢
+south
+497
+025
+move
+狒
+1630
+圉
+abb
+449
+learn
+l0
+d2
+5d
+wav
+琯
+邰
+cis
+quality
+odm
+926
+acta
+root
+smart
+1661
+苾
+cm2
+photos
+l2
+via
+sk
+犸
+623
+邡
+feeling
+572
+郏
+襦
+python
+bmw
+888
+guo
+epa
+williams
+沆
+813
+bot
+read
+function
+wilson
+1723
+enterprise
+玟
+50hz
+s26
+fire
+engineer
+tony
+1819
+濉
+rh
+洎
+莨
+氘
+pb
+咛
+1720
+佺
+1460
+815
+cbs
+腩
+beta
+鳔
+1735
+yan
+1gb
+x2
+剜
+秕
+牝
+芨
+din
+関
+del
+sms
+649
+pal
+1369
+far
+maya
+654
+拊
+812
+595
+竑
+50m
+圹
+close
+eos
+颡
+1420
+6300
+1816
+wrong
+break
+573
+765
+file
+friend
+002
+摺
+683
+nx
+沩
+蜉
+please
+1170
+ro
+6400
+筚
+nick
+acm
+愔
+ati
+point
+肟
+766
+俶
+fast
+ata
+d1
+678
+geforce
+1710
+yahoo
+堃
+绉
+mysql
+1793
+奭
+gap
+iso14000
+uk
+astm
+h2o
+n2
+film
+method
+1804
+罅
+so2
+嗳
+665
+adam
+uc
+蜢
+1806
+1775
+photo
+疠
+474
+image
+200mm
+sure
+561
+帔
+髡
+643
+黥
+1813
+proceedings
+褛
+柰
+beyond
+royal
+else
+eda
+808
+ddr
+gif
+鏊
+l1
+痼
+571
+waiting
+堞
+code
+652
+rss
+learning
+嗝
+461
+beijing
+娉
+566
+577
+708
+1520
+689
+kevin
+human
+661
+539
+875
+1811
+ssci
+6600
+戕
+587
+735
+3s
+铱
+耜
+觥
+867
+镒
+584
+呓
+1522
+904
+case
+1101
+491
+1080p
+history
+蒹
+栱
+im
+564
+f4
+卮
+琚
+salt
+jason
+rohs
+12v
+hydroxy
+逦
+modem
+font
+酩
+蓍
+cry
+65536
+health
+虺
+1798
+tonight
+small
+谠
+1570
+1220
+jane
+against
+597
+751
+459
+bd
+鼋
+焗
+udp
+process
+1070
+1807
+children
+8g
+eb
+62mm
+22000
+add
+1440
+褴
+rm
+25g
+ccedil
+706
+714
+5l
+砒
+赧
+蛏
+709
+蚬
+1530
+瘕
+5h
+559
+jay
+iga
+020
+fall
+scsi
+顗
+isdn
+death
+563
+today
+愠
+dvi
+勣
+wait
+1642
+飕
+徳
+滢
+琇
+鳙
+db
+瞟
+尻
+force
+400mg
+澶
+荽
+舐
+arts
+ha
+east
+lost
+effects
+1628
+album
+harry
+633
+dark
+public
+2250
+soul
+826
+659
+exo
+侂
+733
+se
+黼
+icu
+4h
+market
+潟
+7800
+绂
+瘗
+ngc
+1794
+crazy
+蓥
+竽
+濞
+igm
+scdma
+6200
+cb
+835
+699
+骖
+偁
+bmp
+809
+1270
+oled
+応
+1160
+1621
+锜
+g3
+ova
+cheng
+614
+匏
+thinkpad
+赑
+fps
+create
+kim
+讦
+1480
+诨
+1540
+rev
+1v1
+罘
+fans
+巖
+1740
+ag
+嫘
+1649
+ps3
+908
+颀
+g1
+703
+岿
+v3
+虻
+936
+fl
+c2c
+罴
+environmental
+paris
+594
+hear
+囗
+jump
+communications
+溆
+talk
+噤
+824
+骝
+003
+咂
+695
+728
+e2
+nec
+iptv
+1797
+kelly
+500ml
+锛
+721
+rc
+1808
+ldl
+1240
+槊
+radeon
+676
+啕
+tang
+plant
+50g
+驽
+professional
+凇
+698
+s36
+lord
+search
+alan
+籴
+pd
+1403
+硖
+1791
+816
+1636
+3h
+gsp
+811
+sky
+1632
+铯
+christmas
+怿
+笥
+matter
+574
+噙
+倨
+effect
+647
+779
+1803
+657
+sorry
+awards
+igbt
+pwm
+坭
+醅
+sos
+976
+592
+滏
+10min
+682
+cs3
+悻
+did
+mater
+579
+聒
+1724
+feng
+low
+mhz
+836
+722
+枥
+726
+昺
+bank
+memory
+rap
+975
+663
+ips
+酆
+2kg
+787
+簟
+睇
+轫
+溱
+骢
+榘
+642
+珺
+跹
+677
+series
+nlp
+raquo
+蚶
+stone
+1672
+1817
+1646
+827
+驺
+ko
+security
+perfect
+alexander
+746
+tt
+check
+804
+饧
+15mg
+sir
+moon
+doesn
+591
+inside
+tim
+672
+641
+噼
+儆
+1w
+氚
+646
+哧
+1783
+旒
+鸬
+1648
+夥
+ev
+1688
+score
+standard
+玦
+723
+貅
+揄
+戗
+fx
+938
+璩
+fu
+1654
+剐
+010
+cpi
+垴
+蘼
+hz
+1521
+1067
+727
+ah
+lv
+916
+裒
+639
+han
+躅
+1715
+唳
+form
+second
+嗑
+荦
+674
+霈
+jin
+缦
+啭
+pi
+1788
+rx
+隈
+gao
+sdk
+zheng
+悫
+745
+href
+593
+ngo
+multi
+d3
+彀
+637
+1276
+悭
+found
+jis
+5700
+焓
+1234
+80cm
+磔
+aim
+1778
+蓊
+act
+569
+xiao
+郾
+717
+786
+return
+5min
+1582
+etf
+1590
+action
+1625
+sarah
+yourself
+枧
+鹚
+10kg
+80000
+検
+775
+818
+stephen
+gui
+屃
+644
+9500
+v6
+馑
+wlan
+hs
+2048
+area
+1616
+andrew
+8226
+6mg
+1567
+1763
+1470
+嗲
+pps
+铟
+rca
+pierre
+687
+null
+manager
+738
+sdh
+828
+薤
+60g
+300mg
+jun
+1685
+favorite
+making
+playing
+summer
+754
+692
+涔
+樗
+664
+忾
+収
+绺
+945
+h2s
+bis
+self
+300mm
+烊
+opengl
+912
+acute
+螫
+黩
+996
+magazine
+edward
+su
+elisa
+hdl
+cyp3a4
+鞫
+foundation
+alice
+ddr3
+915
+923
+tbs
+andy
+field
+date
+transactions
+limited
+during
+1126
+鲠
+1057
+fan
+嘭
+缣
+845
+681
+rw
+mean
+1566
+become
+economic
+852
+johnny
+蒺
+unique
+黒
+tu
+boys
+1330
+885
+getting
+cj
+1072
+nh
+ne
+band
+cool
+724
+771
+骘
+氖
+content
+842
+镝
+俅
+谮
+te
+9600
+drive
+phenyl
+1275
+屦
+cao
+menu
+823
+摁
+氪
+蘧
+active
+sb
+appl
+988
+1622
+伝
+1725
+zero
+1008
+3kg
+腠
+叡
+hit
+鲂
+mi
+0kg
+748
+lite
+enjoy
+local
+789
+続
+1506
+seen
+s3
+1765
+european
+讣
+gold
+1279
+736
+965
+pl
+button
+耷
+1430
+986
+763
+toefl
+燊
+鸷
+jimmy
+dota
+955
+861
+猊
+732
+xbox
+days
+dan
+673
+833
+囡
+崤
+4c
+economics
+23000
+agent
+html5
+points
+ryan
+shi
+砬
+湜
+reading
+918
+mine
+adc
+917
+1592
+1781
+翚
+峯
+909
+once
+exchange
+choose
+current
+symbian
+ts16949
+dave
+machine
+鲎
+qos
+蕖
+1785
+9m
+cia
+until
+cs4
+759
+f3
+903
+24000
+968
+8mg
+lewis
+鹈
+凼
+snh48
+866
+泫
+荑
+黻
+牂
+1722
+鄣
+篑
+ho
+1110
+1784
+髭
+陬
+寔
+dt
+shanghai
+疴
+邽
+987
+45000
+1042
+喏
+彖
+sl
+saas
+814
+28000
+a5
+彘
+赟
+819
+foxpro
+shit
+822
+盹
+诮
+鸫
+per
+does
+150mm
+products
+camp
+select
+capital
+茕
+corporation
+26000
+铖
+954
+dd
+闩
+string
+page
+ba
+671
+読
+782
+鄜
+漈
+盍
+dlp
+729
+甭
+愎
+outlook
+wii
+ue
+1787
+festival
+communication
+channel
+gary
+1755
+1774
+8600
+copy
+150mg
+魃
+dragon
+1056
+c5
+炆
+track
+hdpe
+liang
+鍊
+1800mhz
+1619
+蛐
+995
+21000
+薜
+win
+1394
+1786
+rain
+楯
+table
+鲀
+逡
+itu
+applications
+mmorpg
+嘞
+s7
+696
+侔
+1069
+觇
+lbs
+0mg
+car
+wave
+糸
+踮
+狷
+1552
+1627
+latest
+step
+886
+761
+菘
+783
+寳
+esp
+扃
+865
+jazz
+k1
+fine
+child
+kind
+anna
+60mg
+997
+maria
+nk
+792
+raw
+late
+soa
+905
+cai
+ttl
+delphi
+prince
+1340
+禊
+synthesis
+喑
+rmb
+miller
+patrick
+933
+running
+50kg
+1398
+ast
+752
+location
+dead
+塍
+chateau
+allows
+forget
+tg
+921
+栝
+5w
+kiss
+1690
+691
+arthur
+瓿
+index
+csa
+rmvb
+msc
+廨
+cas
+known
+h1
+tj
+j2ee
+asian
+841
+1227
+g20
+cross
+cos
+ntilde
+719
+貘
+dnf
+california
+france
+modern
+pacific
+769
+1066
+turbo
+753
+795
+669
+1764
+868
+馕
+僰
+union
+1772
+2150
+1063
+哏
+double
+fight
+858
+math
+bo
+瑷
+men
+sea
+6700
+sem
+697
+疎
+882
+note
+qi
+uml
+902
+1637
+tp
+1290
+1085
+776
+蝣
+怵
+阃
+dps
+1687
+弢
+镲
+hcl
+al2o3
+js
+auto
+螅
+1683
+v5
+culture
+935
+吖
+edge
+碲
+voice
+1007
+bridge
+855
+008
+夼
+茌
+battle
+嗬
+靺
+dp
+ae
+1090
+895
+1012
+1162
+bi
+778
+髀
+1575
+pcm
+15min
+1598
+铊
+secret
+739
+200m
+6h
+matt
+谡
+card
+mic
+癔
+ecu
+16mm
+984
+镠
+5km
+dhcp
+1753
+巻
+秾
+living
+gn
+1643
+framework
+菪
+679
+赜
+1782
+four
+铈
+1777
+british
+shell
+santa
+yuan
+20ma
+fly
+927
+qu
+nds
+qaq
+bar
+髙
+arp
+1667
+1773
+693
+main
+鲳
+1510
+1002
+2022
+cdna
+box
+珰
+100km
+004
+畋
+bring
+泅
+959
+hpv
+makes
+cmv
+鲅
+tmd
+1762
+854
+泚
+ghost
+short
+mcu
+1768
+cat
+963
+1757
+1206
+1207
+puzzle
+793
+central
+859
+飏
+walter
+60hz
+anderson
+1727
+thought
+屍
+仨
+864
+molecular
+856
+dong
+financial
+1728
+surface
+g2
+mf
+葚
+叻
+solidworks
+res
+speed
+1195
+咻
+ascii
+1404
+784
+jeff
+衩
+1371
+land
+biology
+1655
+郄
+otc
+sio
+1310
+1605
+蹩
+mems
+1618
+m16
+complete
+industrial
+acs
+1603
+kids
+tour
+u2
+allen
+1756
+743
+嬖
+踽
+davis
+柽
+鞨
+65279
+7600
+30ml
+957
+0l
+734
+p450
+956
+ir
+麴
+500mm
+casio
+1038
+roger
+library
+015
+1652
+薙
+within
+hands
+874
+ntsc
+钇
+whole
+jq
+氵
+垆
+post
+sweet
+wall
+898
+cs5
+feo
+9800
+cms
+1390
+since
+medical
+犟
+1492
+罍
+stand
+justin
+lake
+i5
+1729
+bell
+ruby
+important
+bout
+images
+lab
+962
+1759
+rj
+cache
+nb
+production
+経
+807
+1771
+doing
+粜
+tnf
+ws
+guide
+bim
+events
+1626
+1016
+焜
+performance
+ra
+zl
+牀
+1568
+1647
+埝
+洧
+1615
+shift
+788
+shen
+1588
+60mm
+覧
+tuv
+1673
+electronic
+mos
+蓣
+8kg
+862
+echo
+1572
+section
+981
+甯
+sg
+1664
+understand
+hsk
+delta
+x86
+eap
+block
+1578
+er
+xl
+蒐
+馐
+nox
+畑
+ib
+trying
+ann
+1635
+apache
+naoh
+12345
+缑
+礽
+1624
+694
+瞋
+1601
+浍
+983
+773
+1000m
+someone
+15kg
+25m
+847
+袢
+桕
+1037
+jerry
+843
+picture
+919
+e3
+printf
+3gs
+marie
+853
+rj45
+侩
+913
+896
+lose
+unicode
+100cm
+1711
+charlie
+詈
+戸
+1689
+room
+烝
+beat
+堌
+伋
+hplc
+9300
+110kv
+nfc
+倬
+764
+iis
+圯
+solo
+碇
+ef
+round
+chang
+1366
+781
+1585
+982
+socket
+df
+892
+1536
+831
+ren
+6kg
+4900
+纰
+object
+forever
+832
+951
+qr
+1023
+8800
+4kg
+磾
+泔
+1131
+纮
+蓁
+971
+building
+1021
+铗
+939
+弇
+挲
+crystal
+艉
+smtp
+鱬
+cims
+fang
+1265
+trans
+pan
+1745
+1604
+泺
+橛
+817
+796
+袴
+cosplay
+1154
+1189
+749
+794
+1068
+881
+hc
+hope
+1410
+couldn
+1638
+992
+along
+age
+250mg
+clear
+aps
+1631
+1011
+provides
+1123
+1701
+36000
+csf
+韪
+n1
+works
+籓
+967
+ptc
+贶
+1111
+1651
+棰
+1726
+sar
+1666
+qvga
+hf
+coreldraw
+possible
+趵
+1629
+943
+marc
+luo
+樨
+848
+county
+944
+tb
+dts
+junior
+vba
+lot
+傕
+玕
+毎
+direct
+839
+繸
+2350
+774
+劵
+fsh
+wmv
+镧
+秫
+1094
+osi
+1602
+邶
+猞
+dior
+1766
+1623
+廛
+栌
+钲
+镦
+1607
+psa
+spss
+xy
+1769
+cells
+1465
+1577
+gon
+send
+vision
+thinking
+imf
+嘏
+carl
+蝰
+32000
+bay
+928
+is09001
+镏
+20kg
+淠
+imax
+novel
+qt
+1684
+荇
+逄
+au
+author
+mod
+80mm
+1748
+849
+1612
+yet
+嘅
+929
+6l
+karl
+6100
+students
+gmat
+myself
+kate
+jpg
+979
+1752
+829
+2450
+914
+876
+祕
+瑠
+48h
+mpv
+1734
+mis
+1565
+walk
+941
+1075
+1235
+natural
+k2
+977
+炝
+杪
+4050
+1669
+p3
+1004
+fn
+埴
+1555
+vmware
+chloride
+942
+steven
+1078
+獬
+966
+1135
+country
+947
+柢
+捱
+跣
+887
+涑
+75mm
+1278
+1583
+western
+watch
+撃
+伢
+堠
+1045
+12m
+museum
+1215
+document
+marketing
+952
+卽
+猁
+usb3
+906
+厣
+physical
+辏
+1668
+旆
+agp
+茆
+1488
+pg
+乜
+deep
+1082
+961
+踯
+1526
+#
+[
+yam
+lofter
+##s
+##0
+##a
+##2
+##1
+##3
+##e
+##8
+##5
+##6
+##4
+##9
+##7
+##t
+##o
+##d
+##i
+##n
+##m
+##c
+##l
+##y
+##r
+##g
+##p
+##f
+pixnet
+cookies
+tripadvisor
+##er
+##k
+##h
+##b
+##x
+##u
+##w
+##ing
+ctrip
+##on
+##v
+llc
+##an
+##z
+blogthis
+##le
+##in
+##mm
+##00
+ig
+##ng
+##us
+##te
+##ed
+ncc
+blog
+##10
+##al
+##ic
+##ia
+##q
+##ce
+##en
+##is
+##ra
+##es
+##j
+##cm
+tw
+##ne
+##re
+##tion
+pony
+##2017
+##ch
+##or
+##na
+cafe
+pinterest
+pixstyleme3c
+##ta
+##2016
+##ll
+##20
+##ie
+##ma
+##17
+##ion
+##th
+##st
+##se
+##et
+##ck
+##ly
+web885
+##ge
+xd
+##ry
+##11
+0fork
+##12
+##ter
+##ar
+##la
+##os
+##30
+##el
+##50
+##ml
+tue
+posted
+##at
+##man
+##15
+ago
+##it
+##me
+##de
+##nt
+##mb
+##16
+##ve
+##da
+##ps
+##to
+https
+momo
+##son
+##ke
+##80
+ebd
+apk
+##88
+##um
+wiki
+brake
+mon
+po
+june
+##ss
+fb
+##as
+leonardo
+safari
+##60
+wed
+win7
+kiehl
+##co
+##go
+vfm
+kanye
+##90
+##2015
+##id
+##ey
+##sa
+##ro
+##am
+##no
+thu
+fri
+##sh
+##ki
+comments
+##pe
+##ine
+uber
+##mi
+##ton
+wordpress
+##ment
+win10
+##ld
+##li
+gmail
+##rs
+##ri
+##rd
+##21
+##io
+##99
+paypal
+policy
+##40
+##ty
+##18
+##01
+##ba
+taiwan
+##ga
+privacy
+agoda
+##13
+##ny
+##24
+##22
+##by
+##ur
+##hz
+##ang
+cookie
+netscape
+##ka
+##ad
+nike
+survey
+##016
+wikia
+##32
+##017
+cbc
+##tor
+##kg
+##rt
+##14
+campaign
+##ct
+##ts
+##ns
+##ao
+##nd
+##70
+##ya
+##il
+##25
+0020
+897
+##23
+hotels
+##ian
+6606
+##ers
+##26
+##day
+##ay
+##line
+##be
+talk2yam
+yamservice
+coco
+##dy
+##ies
+##ha
+instagram
+##ot
+##va
+##mo
+##land
+ltxsw
+##ation
+##pa
+##ol
+tag
+##ue
+##31
+oppo
+##ca
+##om
+chrome
+##ure
+lol
+##19
+##bo
+##100
+##way
+##ko
+##do
+##un
+##ni
+herme
+##28
+##up
+##06
+##ds
+admin
+##48
+##015
+##35
+##ee
+tpp
+##ive
+##cc
+##ble
+##ity
+##ex
+##ler
+##ap
+##book
+##ice
+##km
+##mg
+##ms
+ebay
+##29
+ubuntu
+##cy
+##view
+##lo
+##oo
+##02
+step1
+july
+##net
+##ls
+##ii
+##05
+##33
+step2
+ios9
+##box
+##ley
+samsung
+pokemon
+##ent
+##les
+s8
+atom
+##said
+##55
+##2014
+##66
+adidas
+amazon
+##ber
+##ner
+visa
+##77
+##der
+connectivity
+##hi
+firefox
+skip
+##27
+##ir
+##61
+##ai
+##ver
+cafe2017
+##ron
+##ster
+##sk
+##ft
+longchamp
+ssd
+##ti
+reply
+##my
+apr
+##ker
+source
+##one
+##2013
+##ow
+goods
+##lin
+##ip
+##ics
+##45
+##03
+##ff
+##47
+ganji
+##nce
+##per
+faq
+comment
+##ock
+##bs
+##ah
+##lv
+##mp
+##000
+melody
+17life
+##au
+##71
+##04
+##95
+##age
+tips
+##68
+##ting
+##ung
+wonderland
+##ction
+mar
+article
+##db
+##07
+##ore
+##op
+##78
+##38
+##ong
+##73
+##08
+##ica
+##36
+##wa
+##64
+homemesh
+##85
+##tv
+##di
+macbook
+##ier
+##si
+##75
+##ok
+goris
+lock
+##ut
+carol
+##vi
+##ac
+anti
+jan
+tags
+##98
+##51
+august
+##86
+##fs
+##sion
+jordan
+##tt
+##lt
+##42
+##bc
+vivi
+##rry
+##ted
+##rn
+usd
+##t00
+##58
+##09
+##34
+goo
+##ui
+##ary
+item
+##pm
+##41
+##za
+##2012
+blogabstract
+##ger
+##62
+##44
+gr2
+asus
+cindy
+##hd
+esc
+##od
+booking
+##53
+fed
+##81
+##ina
+chan
+distribution
+steam
+pk10
+##ix
+##65
+##91
+dec
+##ana
+icecat
+00z
+##46
+##ji
+##ard
+oct
+##ain
+jp
+##ze
+##bi
+cio
+##56
+h5
+##39
+##port
+curve
+##nm
+##dia
+utc
+12345678910
+##52
+chanel
+##and
+##im
+##63
+vera
+vivo
+##ei
+2756
+##69
+msci
+##po
+##89
+##bit
+##out
+##zz
+##97
+##67
+opec
+##96
+##tes
+##ast
+##ling
+##ory
+##ical
+kitty
+##43
+step3
+##cn
+win8
+iphone7
+beauty
+##87
+dollars
+##ys
+##oc
+pay
+##2011
+##lly
+##ks
+download
+sep
+##board
+##37
+##lan
+winrar
+##que
+##ua
+##com
+ettoday
+##54
+##ren
+##via
+##72
+##79
+##tch
+##49
+##ial
+##nn
+step4
+2765
+gov
+##xx
+mandy
+##ser
+copyright
+fashion
+##ist
+##art
+##lm
+##ek
+##ning
+##if
+##ite
+iot
+##84
+##2010
+##ku
+october
+##ux
+trump
+##hs
+##ide
+##ins
+april
+##ight
+##83
+protected
+##fe
+##ho
+ofo
+gomaji
+march
+##lla
+##pp
+##ec
+6s
+720p
+##rm
+##ham
+##92
+fandom
+##ell
+info
+##82
+sina
+4066
+##able
+##ctor
+rights
+jul
+##76
+mall
+##59
+donald
+sodu
+##light
+reserved
+htm
+##han
+##57
+##ise
+##tions
+##shi
+doc
+055
+##ram
+shopping
+aug
+##pi
+##well
+wam
+##hu
+##gb
+##93
+mix
+##ef
+##uan
+bwl
+##plus
+##res
+##ess
+tea
+hktvmall
+##ate
+##ese
+feb
+inn
+nov
+##ci
+pass
+##bet
+##nk
+coffee
+airbnb
+##ute
+woshipm
+skype
+##fc
+##www
+##94
+##ght
+##gs
+##ile
+##wood
+##uo
+icon
+##em
+says
+##king
+##tive
+blogger
+##74
+##ox
+##zy
+##red
+##ium
+##lf
+nokia
+claire
+##ding
+november
+lohas
+##500
+##tic
+##cs
+##che
+##ire
+##gy
+##ult
+january
+ptt
+##fa
+##mer
+pchome
+udn
+##time
+##tte
+garden
+eleven
+309b
+bat
+##123
+##tra
+kindle
+##ern
+xperia
+ces
+travel
+##ous
+##int
+edu
+cho
+##car
+##our
+##ant
+rends
+##jo
+mastercard
+##2000
+kb
+##min
+##ino
+##ris
+##ud
+##set
+##her
+##ou
+taipei
+##fi
+##ill
+aphojoy
+december
+meiki
+##ick
+tweet
+##av
+iphone6
+##dd
+views
+##mark
+##ash
+##ome
+koreanmall
+##ak
+q2
+##200
+mlb
+##lle
+##watch
+##und
+##tal
+##less
+4399
+##rl
+update
+shop
+##mhz
+##house
+##key
+##001
+##hy
+##web
+##2009
+##gg
+##wan
+##val
+2021
+##ons
+doi
+trivago
+overdope
+##ance
+573032185
+wx17house
+##so
+audi
+##he
+##rp
+##ake
+beach
+cfa
+ps4
+##800
+##link
+##hp
+ferragamo
+##eng
+##style
+##gi
+i7
+##ray
+##max
+##pc
+september
+##ace
+vps
+february
+pantos
+wp
+lisa
+jquery
+offer
+##berg
+##news
+fks
+##all
+##rus
+##888
+##works
+blogtitle
+loftpermalink
+ling
+##ja
+outlet
+##ea
+##top
+##ness
+salvatore
+##lu
+swift
+##ul
+week
+##ean
+##300
+##gle
+##back
+powered
+##tan
+##nes
+canon
+##zi
+##las
+##oe
+##sd
+##bot
+##world
+##zo
+top100
+pmi
+##vr
+ball
+vogue
+ofweek
+##list
+##ort
+##lon
+##tc
+##of
+##bus
+##gen
+nas
+##lie
+##ria
+##coin
+##bt
+nata
+vive
+cup
+##ook
+##sy
+msg
+3ce
+##word
+ebooks
+r8
+nice
+months
+rewards
+##ther
+0800
+##xi
+##sc
+gg
+blogfp
+daily
+##bb
+##tar
+##ky
+anthony
+##yo
+##ara
+##aa
+##rc
+##tz
+##ston
+gear
+##eo
+##ade
+##win
+##ura
+##den
+##ita
+##sm
+png
+rakuten
+whatsapp
+##use
+pad
+gucci
+##ode
+##fo
+chicago
+##hone
+io
+sogo
+be2
+##ology
+cloud
+##con
+##ford
+##joy
+##kb
+##rade
+##ach
+docker
+##ful
+##ase
+ford
+##star
+edited
+##are
+##mc
+siri
+##ella
+bloomberg
+##read
+pizza
+##ison
+##vm
+node
+18k
+##play
+##cer
+##yu
+##ings
+asr
+##lia
+step5
+##cd
+pixstyleme
+##600
+##tus
+tokyo
+##rial
+##life
+##ae
+tcs
+##rk
+##wang
+##sp
+##ving
+premium
+netflix
+##lton
+##ple
+##cal
+021
+##sen
+##ville
+nexus
+##ius
+##mah
+tila
+##tin
+resort
+##ws
+p10
+report
+##360
+##ru
+bus
+vans
+##est
+links
+rebecca
+##dm
+azure
+##365
+##mon
+moto
+##eam
+blogspot
+##ments
+##ik
+##kw
+##bin
+##ata
+##vin
+##tu
+##ula
+station
+##ature
+files
+zara
+hdr
+top10
+s6
+marriott
+avira
+tab
+##ran
+##home
+oculus
+##ral
+rosie
+##force
+##ini
+ice
+##bert
+##nder
+##mber
+plurk
+##sis
+00kg
+##ence
+##nc
+##name
+log
+ikea
+malaysia
+##ncy
+##nie
+##ye
+##oid
+##chi
+xuehai
+##1000
+##orm
+##rf
+##ware
+##pro
+##era
+##ub
+##2008
+8891
+scp
+##zen
+qvod
+jcb
+##hr
+weibo
+##row
+##ish
+github
+mate
+##lot
+##ane
+##tina
+ed2k
+##vel
+##900
+final
+ns
+bytes
+##ene
+##cker
+##2007
+##px
+topapp
+helpapp
+14k
+g4g
+ldquo
+##fork
+##gan
+##zon
+##qq
+##google
+##ism
+##zer
+toyota
+category
+##labels
+restaurant
+##md
+posts
+##ico
+angelababy
+123456
+sports
+candy
+##new
+##here
+swissinfo
+dram
+##ual
+##vice
+##wer
+sport
+q1
+ios10
+##mll
+wan
+##uk
+x3
+0t
+##ming
+e5
+##3d
+h7n9
+worldcat
+##vo
+##led
+##580
+##ax
+##ert
+polo
+##lr
+##hing
+##chat
+##ule
+hotmail
+##pad
+bbq
+##ring
+wali
+2k
+costco
+switch
+##city
+philips
+##mann
+panasonic
+##cl
+##vd
+##ping
+##rge
+##lk
+css3
+##ney
+##ular
+##400
+##tter
+lz
+##tm
+##yan
+##let
+coach
+##pt
+a8
+follow
+##berry
+##ew
+##wn
+##og
+##code
+##rid
+villa
+git
+r11
+##cket
+error
+##anonymoussaid
+##ag
+##ame
+##gc
+qa
+##lis
+##gin
+vmalife
+##cher
+wedding
+##tis
+demo
+bye
+##rant
+orz
+acer
+##ats
+##ven
+macd
+yougou
+##dn
+##ano
+##urt
+##rent
+continue
+script
+##wen
+##ect
+paper
+##chel
+##cat
+x5
+fox
+##blog
+loading
+##yn
+##tp
+kuso
+799
+vdc
+forest
+prime
+ultra
+##rmb
+square
+##field
+##reen
+##ors
+##ju
+##air
+##map
+cdn
+##wo
+m8
+##get
+opera
+##base
+##ood
+vsa
+##aw
+##ail
+count
+##een
+##gp
+vsc
+tree
+##eg
+##ose
+##ories
+##shop
+alphago
+v4
+fluke62max
+zip
+##sta
+bas
+##yer
+hadoop
+##ube
+##wi
+0755
+hola
+##low
+centre
+##fer
+##750
+##media
+##san
+##bank
+q3
+##nge
+##mail
+##lp
+client
+event
+vincent
+##nse
+sui
+adchoice
+##stry
+##zone
+ga
+apps
+##ab
+##rner
+kymco
+##care
+##pu
+##yi
+minkoff
+annie
+collection
+kpi
+playstation
+bh
+##bar
+armani
+##xy
+iherb
+##ery
+##share
+##ob
+volvo
+##ball
+##hk
+##cp
+##rie
+##ona
+##sl
+gtx
+rdquo
+jayz
+##lex
+##rum
+namespace
+##ale
+##atic
+##erson
+##ql
+##ves
+##type
+enter
+##168
+##mix
+##bian
+a9
+ky
+##lc
+movie
+##hc
+tower
+##ration
+##mit
+##nch
+ua
+tel
+prefix
+##o2
+##point
+ott
+##http
+##ury
+baidu
+##ink
+member
+##logy
+bigbang
+nownews
+##js
+##shot
+##tb
+eba
+##tics
+##lus
+spark
+##ama
+##ions
+##lls
+##down
+##ress
+burberry
+day2
+##kv
+related
+edit
+##ark
+cx
+32gb
+g9
+##ans
+##tty
+s5
+##bee
+thread
+xr
+buy
+spotify
+##ari
+##verse
+7headlines
+nego
+sunny
+dom
+positioning
+fit
+##tton
+alexa
+##ties
+##llow
+amy
+##du
+##rth
+##lar
+2345
+##des
+sidebar
+site
+##cky
+##kit
+##ime
+##009
+season
+##fun
+gogoro
+a7
+lily
+twd600
+##vis
+##cture
+friday
+yi
+##tta
+##tel
+##lock
+economy
+tinker
+8gb
+##app
+oops
+##right
+edm
+##cent
+supreme
+##its
+##asia
+dropbox
+##tti
+books
+##tle
+##ller
+##ken
+##more
+##boy
+sex
+##dom
+##ider
+##unch
+##put
+##gh
+ka
+amoled
+div
+##tr
+##n1
+port
+howard
+##tags
+ken
+##nus
+adsense
+buff
+thunder
+##town
+##ique
+##body
+pin
+##erry
+tee
+##the
+##013
+udnbkk
+16gb
+##mic
+miui
+##tro
+##alk
+##nity
+s4
+##oa
+docomo
+##tf
+##ack
+fc2
+##ded
+##sco
+##014
+##rite
+linkedin
+##ada
+##now
+##ndy
+ucbug
+sputniknews
+legalminer
+##ika
+##xp
+##bu
+q10
+##rman
+cheese
+ming
+maker
+##gm
+nikon
+##fig
+ppi
+jchere
+ted
+fgo
+tech
+##tto
+##gl
+##len
+hair
+img
+##pper
+##a1
+acca
+##ition
+##ference
+suite
+##ig
+##mond
+##cation
+##pr
+101vip
+##999
+64gb
+airport
+##over
+##ith
+##su
+town
+piece
+##llo
+no1
+##qi
+focus
+reader
+##admin
+##ora
+false
+##log
+##ces
+##ume
+motel
+##oper
+flickr
+netcomponents
+##af
+pose
+##ound
+##cg
+##site
+##iko
+con
+##ath
+##hip
+##rey
+cream
+##cks
+012
+##dp
+facebooktwitterpinterestgoogle
+sso
+shtml
+swiss
+##mw
+lumia
+xdd
+tiffany
+insee
+russell
+dell
+##ations
+camera
+##vs
+##flow
+##late
+classic
+##nter
+##ever
+##lab
+##nger
+qe
+##cing
+editor
+##nap
+sunday
+##ens
+##700
+##bra
+acg
+sofascore
+mkv
+##ign
+jonathan
+build
+labels
+##oto
+tesla
+moba
+gohappy
+ajax
+##test
+##urs
+wps
+fedora
+##ich
+mozilla
+##480
+##dr
+urn
+##lina
+grace
+##die
+##try
+##ader
+elle
+##chen
+price
+##ten
+uhz
+##ough
+##hen
+states
+push
+session
+balance
+wow
+##cus
+##py
+##ward
+##ep
+34e
+wong
+prada
+##cle
+##ree
+q4
+##ctive
+##ool
+##ira
+##163
+rq
+buffet
+e6
+##ez
+##card
+##cha
+day3
+eye
+##end
+adi
+tvbs
+##ala
+nova
+##tail
+##ries
+##ved
+base
+##ways
+hero
+hgih
+profile
+fish
+mu
+ssh
+##wd
+click
+cake
+##ond
+pre
+##tom
+kic
+pixel
+##ov
+##fl
+product
+6a
+##pd
+dear
+##gate
+yumi
+##sky
+bin
+##ture
+##ape
+isis
+nand
+##101
+##load
+##ream
+a6
+##post
+##we
+zenfone
+##ike
+gd
+forum
+jessica
+##ould
+##ious
+lohasthree
+##gar
+##ggle
+##ric
+##own
+eclipse
+##side
+061
+##other
+##tech
+##ator
+engine
+##ged
+plaza
+##fit
+westbrook
+reuters
+##ily
+contextlink
+##hn
+##cil
+##cel
+cambridge
+##ize
+##aid
+##data
+frm
+##head
+butler
+##sun
+##mar
+puma
+pmid
+kitchen
+##lic
+day1
+##text
+##page
+##rris
+pm1
+##ket
+trackback
+##hai
+display
+##hl
+idea
+##sent
+airmail
+##ug
+##men
+028
+##lution
+schemas
+asics
+wikipedia
+##tional
+##vy
+##dget
+##ein
+contact
+pepper
+##uel
+##ument
+##hang
+q5
+##sue
+##ndi
+swatch
+##cept
+popular
+##ste
+##tag
+trc
+##west
+##live
+honda
+ping
+messenger
+##rap
+v9
+unity
+appqq
+leo
+##tone
+##ass
+uniqlo
+##010
+moneydj
+##tical
+12306
+##m2
+coc
+miacare
+##mn
+tmt
+##core
+vim
+kk
+##may
+target
+##2c
+##ope
+omega
+pinkoi
+##rain
+##ement
+p9
+rd
+##tier
+##vic
+zone
+isofix
+cpa
+kimi
+##lay
+lulu
+##uck
+050
+weeks
+##hop
+##ear
+eia
+##fly
+korea
+boost
+##ship
+eur
+valley
+##iel
+##ude
+rn
+##ena
+feed
+5757
+qqmei
+##thing
+aws
+pink
+##ters
+##kin
+board
+##vertisement
+wine
+##ien
+##dge
+##tant
+##twitter
+##3c
+cool1
+##012
+##150
+##fu
+##iner
+googlemsn
+pixnetfacebookyahoo
+x7
+##uce
+sao
+##ev
+##file
+9678
+xddd
+shirt
+##rio
+##hat
+givenchy
+bang
+##lio
+monday
+##abc
+ubuntuforumwikilinuxpastechat
+##vc
+##rity
+7866
+##ost
+imsean
+tiger
+##fet
+dji
+##come
+##beth
+##aft
+##don
+3p
+emma
+##khz
+x6
+##face
+pptv
+x4
+##mate
+sophie
+##jing
+fifa
+##mand
+sale
+inwedding
+##gn
+##mmy
+##pmlast
+nana
+##wu
+note7
+##340
+##bel
+window
+##dio
+##ht
+##ivity
+domain
+neo
+##isa
+##lter
+5k
+f5
+##cts
+ft
+zol
+##act
+mwc
+nbapop
+eds
+##room
+previous
+tomtom
+##ets
+5t
+chi
+##hg
+fairmont
+gay
+1b
+##raph
+##ils
+i3
+avenue
+##host
+##bon
+##tsu
+message
+navigation
+fintech
+h6
+##ject
+##vas
+##firm
+credit
+##wf
+xxxx
+##nor
+##space
+huawei
+plan
+json
+sbl
+##dc
+wish
+##120
+##sol
+windows7
+washington
+##nsis
+lo
+##sio
+##ym
+##bor
+planet
+##wt
+gpa
+##tw
+##oka
+connect
+##rss
+##work
+##atus
+chicken
+##times
+fa
+##ather
+##cord
+009
+##eep
+hitachi
+##pan
+disney
+##press
+wind
+frigidaire
+##tl
+hsu
+##ull
+expedia
+archives
+##wei
+cut
+ins
+6gb
+brand
+cf1
+##rip
+##nis
+128gb
+3t
+##oon
+quick
+15058
+wing
+##bug
+##cms
+##dar
+##oh
+zoom
+trip
+##nba
+rcep
+aspx
+080
+gnu
+##count
+##url
+##ging
+8591
+am09
+shadow
+##cia
+emily
+##tation
+host
+ff
+techorz
+##mini
+##mporary
+##ering
+##next
+cma
+##mbps
+##gas
+##ift
+##dot
+amana
+##ros
+##eet
+##ible
+##aka
+##lor
+maggie
+##011
+##iu
+##gt
+1tb
+articles
+##burg
+##iki
+database
+fantasy
+##rex
+##cam
+dlc
+dean
+##you
+path
+gaming
+victoria
+maps
+##lee
+##itor
+overchicstoretvhome
+##xt
+##nan
+x9
+install
+##ann
+##ph
+##rcle
+##nic
+##nar
+metro
+chocolate
+##rian
+##table
+skin
+##sn
+mountain
+##0mm
+inparadise
+7x24
+##jia
+eeworld
+creative
+g5
+parker
+ecfa
+village
+sylvia
+hbl
+##ques
+##onsored
+##x2
+##v4
+##tein
+ie6
+##stack
+ver
+##ads
+##baby
+bbe
+##110
+##lone
+##uid
+ads
+022
+gundam
+006
+scrum
+match
+##ave
+##470
+##oy
+##talk
+glass
+lamigo
+##eme
+##a5
+wade
+kde
+##lace
+ocean
+tvg
+##covery
+##r3
+##ners
+##rea
+##aine
+cover
+##ision
+##sia
+##bow
+msi
+##love
+soft
+z2
+##pl
+mobil
+##uy
+nginx
+##oi
+##rr
+6221
+##mple
+##sson
+##nts
+91tv
+comhd
+crv3000
+##uard
+gallery
+##bia
+rate
+spf
+redis
+traction
+icloud
+011
+jose
+##tory
+sohu
+899
+kicstart2
+##hia
+##sit
+##walk
+##xure
+500g
+##pact
+xa
+carlo
+##250
+##walker
+##can
+cto
+gigi
+pen
+##hoo
+ob
+##yy
+13913459
+##iti
+mango
+##bbs
+sense
+oxford
+walker
+jennifer
+##ola
+course
+##bre
+##pus
+##rder
+lucky
+075
+ivy
+##nia
+sotheby
+##ugh
+joy
+##orage
+##ush
+##bat
+##dt
+r9
+##2d
+##gio
+wear
+##lax
+##moon
+seven
+lonzo
+8k
+evolution
+##kk
+kd
+arduino
+##lux
+arpg
+##rdon
+cook
+##x5
+five
+##als
+##ida
+sign
+##nda
+##posted
+fresh
+##mine
+##skip
+##form
+##ssion
+##tee
+dyson
+stage
+##jie
+##night
+epson
+pack
+##ppy
+wd
+##eh
+##rence
+##lvin
+golden
+discovery
+##trix
+##n2
+loft
+##uch
+##dra
+##sse
+1mdb
+welcome
+##urn
+gaga
+##lmer
+teddy
+##160
+##f2016
+##sha
+rar
+holiday
+074
+##vg
+##nos
+##rail
+gartner
+gi
+6p
+##dium
+kit
+b3
+eco
+sean
+##stone
+nu
+##np
+f16
+write
+029
+m5
+##ias
+##dk
+fsm
+52kb
+##xxx
+##cake
+lim
+ru
+1v
+##ification
+published
+angela
+16g
+analytics
+##nel
+gmt
+##icon
+##bby
+ios11
+waze
+9985
+##ust
+##007
+delete
+52sykb
+wwdc
+027
+##fw
+1389
+##xon
+brandt
+##ses
+##dragon
+vetements
+anne
+monte
+official
+##ere
+##nne
+##oud
+etnews
+##a2
+##graphy
+##rtex
+##gma
+mount
+archive
+morning
+tan
+ddos
+e7
+day4
+factory
+bruce
+##ito
+guest
+##lling
+n3
+mega
+women
+dac
+church
+##jun
+singapore
+##facebook
+6991
+starbucks
+##tos
+##stin
+##shine
+zen
+##mu
+tina
+request
+##gence
+q7
+##zzi
+diary
+##tore
+##ead
+cst
+##osa
+canada
+va
+##jiang
+##lam
+##nix
+##sday
+g6
+##master
+bing
+##zl
+nb40
+thai
+ln284ct
+##itz
+##2f
+bonnie
+##food
+##lent
+originals
+##stro
+##lts
+##bscribe
+ntd
+yesstyle
+hmv
+##tment
+d5
+##pn
+topios9
+lifestyle
+virtual
+##ague
+xz
+##deo
+muji
+024
+unt
+##nnis
+faq1
+##ette
+curry
+##pop
+release
+##cast
+073
+##ews
+5c
+##stle
+ios7
+##ima
+dog
+lenovo
+##r4
+013
+vornado
+##desk
+##ald
+9595
+##van
+oil
+common
+##jy
+##lines
+g7
+twice
+ella
+nano
+belle
+##mes
+##self
+##note
+benz
+##ova
+##wing
+kai
+##hua
+##rect
+rainer
+##unge
+##0m
+guestname
+##uma
+##kins
+##zu
+tokichoi
+##price
+##med
+##mus
+rmk
+address
+vm
+openload
+##group
+##hin
+##iginal
+amg
+urban
+##oz
+jobs
+##public
+##sch
+##dden
+##bell
+hostel
+##drive
+##rmin
+boot
+##370
+##fx
+##nome
+##ctionary
+##oman
+##lish
+##cr
+##hm
+##how
+francis
+c919
+b5
+evernote
+##uc
+##3000
+coupe
+##urg
+##cca
+##uality
+019
+##ett
+##ani
+##tax
+##rma
+leonnhurt
+##jin
+ict
+bird
+notes
+##dical
+##lli
+result
+iu
+ee
+smap
+gopro
+##last
+yin
+pure
+32g
+##dan
+##rame
+mama
+##oot
+bean
+##hur
+2l
+bella
+sync
+xuite
+##ground
+discuz
+##getrelax
+##ince
+##bay
+##5s
+apt
+##pass
+jing
+##rix
+rich
+niusnews
+##ello
+bag
+##eting
+##mobile
+##ience
+details
+universal
+silver
+dit
+private
+ddd
+u11
+kanshu
+##ified
+fung
+##nny
+dx
+##520
+tai
+023
+##fr
+##lean
+##pin
+##rin
+ly
+rick
+##bility
+banner
+##baru
+##gion
+vdf
+qualcomm
+bear
+oldid
+ian
+jo
+##tors
+population
+##ernel
+##mv
+##bike
+ww
+##ager
+exhibition
+##del
+##pods
+fpx
+structure
+##free
+##tings
+kl
+##rley
+##copyright
+##mma
+orange
+yoga
+4l
+canmake
+honey
+##anda
+nikkie
+dhl
+publishing
+##mall
+##gnet
+e88
+##dog
+fishbase
+###
+##[
+。
+！
+？
+!
+?
+；
+:
+;
+##，
+##的
+##、
+##一
+##人
+##有
+##是
+##在
+##中
+##为
+##和
+##了
+##不
+##年
+##学
+##大
+##国
+##生
+##以
+##“
+##”
+##作
+##业
+##个
+##上
+##用
+##,
+##地
+##会
+##成
+##发
+##工
+##时
+##于
+##理
+##出
+##行
+##要
+##.
+##等
+##他
+##到
+##之
+##这
+##可
+##后
+##家
+##对
+##能
+##公
+##与
+##》
+##《
+##主
+##方
+##分
+##经
+##来
+##全
+##其
+##部
+##多
+##产
+##自
+##文
+##高
+##动
+##进
+##法
+##化
+##：
+##我
+##面
+##）
+##（
+##实
+##教
+##建
+##体
+##而
+##长
+##子
+##下
+##现
+##开
+##本
+##力
+##定
+##性
+##过
+##设
+##合
+##小
+##同
+##机
+##市
+##品
+##水
+##新
+##内
+##事
+##也
+##种
+##及
+##制
+##入
+##所
+##心
+##务
+##就
+##管
+##们
+##得
+##展
+##重
+##民
+##加
+##区
+##物
+##者
+##通
+##天
+##政
+##三
+##电
+##关
+##度
+##第
+##名
+##术
+##最
+##系
+##月
+##外
+##资
+##日
+##代
+##员
+##如
+##间
+##位
+##并
+##书
+##科
+##村
+##应
+##量
+##道
+##前
+##当
+##无
+##里
+##相
+##平
+##从
+##计
+##提
+##保
+##任
+##程
+##技
+##都
+##研
+##十
+##基
+##特
+##好
+##被
+##或
+##目
+##将
+##使
+##山
+##二
+##说
+##数
+##点
+##明
+##情
+##元
+##着
+##收
+##组
+##然
+##美
+##各
+##由
+##场
+##金
+##形
+##农
+##期
+##因
+##表
+##此
+##色
+##起
+##还
+##立
+##世
+##安
+##活
+##专
+##质
+##规
+##社
+##万
+##信
+##西
+##统
+##结
+##路
+##利
+##次
+##南
+##式
+##意
+##级
+##常
+##师
+##校
+##你
+##育
+##果
+##究
+##司
+##服
+##门
+##海
+##导
+##流
+##项
+##她
+##总
+##处
+##两
+##传
+##东
+##正
+##省
+##院
+##户
+##手
+##具
+##原
+##强
+##北
+##向
+##先
+##但
+##米
+##城
+##企
+##件
+##风
+##军
+##身
+##更
+##知
+##已
+##气
+##战
+##至
+##单
+##口
+##集
+##创
+##解
+##四
+##标
+##交
+##比
+##商
+##论
+##界
+##题
+##变
+##花
+##改
+##类
+##运
+##指
+##型
+##调
+##女
+##神
+##接
+##造
+##受
+##广
+##只
+##委
+##去
+##共
+##治
+##达
+##持
+##条
+##网
+##头
+##构
+##县
+##些
+##该
+##又
+##那
+##想
+##样
+##办
+##济
+##格
+##责
+##车
+##很
+##施
+##求
+##己
+##光
+##精
+##林
+##完
+##爱
+##线
+##参
+##少
+##积
+##清
+##看
+##优
+##报
+##王
+##直
+##没
+##每
+##据
+##游
+##效
+##感
+##五
+##影
+##别
+##获
+##领
+##称
+##选
+##供
+##乐
+##老
+##么
+##台
+##问
+##划
+##带
+##器
+##源
+##织
+##放
+##深
+##备
+##视
+##白
+##功
+##取
+##装
+##营
+##见
+##记
+##环
+##队
+##节
+##准
+##石
+##它
+##回
+##历
+##负
+##真
+##增
+##医
+##联
+##做
+##职
+##容
+##士
+##包
+##义
+##观
+##团
+##病
+##府
+##息
+##则
+##考
+##料
+##华
+##州
+##语
+##证
+##整
+##让
+##江
+##史
+##空
+##验
+##需
+##支
+##命
+##给
+##离
+##认
+##艺
+##较
+##土
+##古
+##养
+##才
+##境
+##推
+##把
+##均
+##图
+##际
+##斯
+##近
+##片
+##局
+##修
+##字
+##德
+##权
+##步
+##始
+##复
+##转
+##协
+##即
+##打
+##画
+##投
+##决
+##何
+##约
+##反
+##费
+##议
+##护
+##极
+##河
+##房
+##查
+##布
+##思
+##干
+##价
+##儿
+##非
+##马
+##党
+##奖
+##模
+##故
+##编
+##音
+##范
+##识
+##率
+##存
+##引
+##客
+##属
+##评
+##采
+##尔
+##配
+##镇
+##室
+##再
+##案
+##监
+##习
+##注
+##根
+##克
+##演
+##食
+##族
+##示
+##球
+##状
+##青
+##号
+##张
+##百
+##素
+##首
+##易
+##热
+##阳
+##今
+##园
+##防
+##版
+##太
+##乡
+##英
+##材
+##列
+##便
+##写
+##住
+##置
+##层
+##助
+##确
+##试
+##难
+##承
+##象
+##居
+##黄
+##快
+##断
+##维
+##却
+##红
+##速
+##连
+##众
+##细
+##态
+##话
+##周
+##言
+##药
+##培
+##血
+##亩
+##龙
+##越
+##值
+##几
+##边
+##读
+##未
+##曾
+##测
+##算
+##京
+##景
+##余
+##站
+##低
+##温
+##消
+##必
+##切
+##依
+##随
+##且
+##志
+##卫
+##域
+##照
+##许
+##限
+##著
+##销
+##落
+##足
+##适
+##争
+##策
+##控
+##武
+##按
+##初
+##角
+##核
+##死
+##检
+##富
+##满
+##显
+##审
+##除
+##致
+##亲
+##占
+##失
+##星
+##章
+##善
+##续
+##千
+##叶
+##火
+##副
+##告
+##段
+##什
+##声
+##终
+##况
+##走
+##木
+##益
+##戏
+##独
+##纪
+##植
+##财
+##群
+##六
+##赛
+##远
+##拉
+##亚
+##密
+##排
+##超
+##像
+##课
+##围
+##往
+##响
+##击
+##疗
+##念
+##八
+##云
+##险
+##律
+##请
+##革
+##诗
+##批
+##底
+##压
+##双
+##男
+##训
+##例
+##汉
+##升
+##拥
+##势
+##酒
+##眼
+##官
+##牌
+##油
+##曲
+##友
+##望
+##黑
+##歌
+##筑
+##础
+##香
+##仅
+##担
+##括
+##湖
+##严
+##秀
+##剧
+##九
+##举
+##执
+##充
+##兴
+##督
+##博
+##草
+##般
+##李
+##健
+##喜
+##授
+##普
+##预
+##灵
+##突
+##良
+##款
+##罗
+##微
+##七
+##录
+##朝
+##飞
+##宝
+##令
+##轻
+##劳
+##距
+##异
+##简
+##兵
+##树
+##序
+##候
+##含
+##福
+##尽
+##留
+##丰
+##旅
+##征
+##临
+##破
+##移
+##篇
+##抗
+##典
+##端
+##苏
+##奇
+##止
+##康
+##店
+##毛
+##觉
+##春
+##售
+##络
+##降
+##板
+##坚
+##母
+##讲
+##早
+##印
+##略
+##孩
+##夫
+##藏
+##铁
+##害
+##互
+##帝
+##田
+##融
+##皮
+##宗
+##岁
+##载
+##析
+##斗
+##须
+##伤
+##介
+##另
+##半
+##班
+##馆
+##味
+##楼
+##卡
+##射
+##述
+##杀
+##波
+##绿
+##免
+##兰
+##绝
+##刻
+##短
+##察
+##输
+##择
+##综
+##杂
+##份
+##纳
+##父
+##词
+##银
+##送
+##座
+##左
+##继
+##固
+##宣
+##厂
+##肉
+##换
+##补
+##税
+##派
+##套
+##欢
+##播
+##吸
+##圆
+##攻
+##阿
+##购
+##听
+##右
+##减
+##激
+##巴
+##背
+##够
+##遇
+##智
+##玉
+##找
+##宽
+##陈
+##练
+##追
+##毕
+##彩
+##软
+##帮
+##股
+##荣
+##托
+##予
+##佛
+##堂
+##障
+##皇
+##若
+##守
+##似
+##届
+##待
+##货
+##散
+##额
+##尚
+##穿
+##丽
+##骨
+##享
+##差
+##针
+##索
+##稳
+##宁
+##贵
+##酸
+##液
+##唐
+##操
+##探
+##玩
+##促
+##笔
+##库
+##救
+##虽
+##久
+##闻
+##顶
+##床
+##港
+##鱼
+##亿
+##登
+##永
+##毒
+##桥
+##冷
+##魔
+##秘
+##陆
+##您
+##童
+##归
+##侧
+##沙
+##染
+##封
+##紧
+##松
+##川
+##刘
+##雄
+##希
+##毫
+##卷
+##某
+##季
+##菜
+##庭
+##附
+##逐
+##夜
+##宫
+##洲
+##退
+##顾
+##尼
+##胜
+##剂
+##纯
+##舞
+##遗
+##苦
+##梦
+##挥
+##航
+##愿
+##街
+##招
+##矿
+##夏
+##盖
+##献
+##怎
+##茶
+##申
+##吧
+##脑
+##亦
+##吃
+##频
+##宋
+##央
+##威
+##厚
+##块
+##冲
+##叫
+##熟
+##礼
+##厅
+##否
+##渐
+##笑
+##钱
+##钟
+##甚
+##牛
+##丝
+##靠
+##岛
+##绍
+##盘
+##缘
+##聚
+##静
+##雨
+##氏
+##圣
+##顺
+##唱
+##刊
+##阶
+##困
+##急
+##饰
+##弹
+##庄
+##既
+##野
+##阴
+##混
+##饮
+##损
+##齐
+##末
+##错
+##轮
+##宜
+##鲜
+##兼
+##敌
+##粉
+##祖
+##延
+##钢
+##辑
+##欧
+##硬
+##甲
+##诉
+##册
+##痛
+##订
+##缺
+##晚
+##衣
+##佳
+##脉
+##盛
+##乎
+##拟
+##贸
+##扩
+##船
+##仪
+##谁
+##警
+##停
+##席
+##竞
+##释
+##庆
+##汽
+##仍
+##掌
+##诸
+##仙
+##弟
+##吉
+##洋
+##奥
+##票
+##危
+##架
+##买
+##径
+##塔
+##休
+##付
+##恶
+##雷
+##怀
+##秋
+##借
+##巨
+##透
+##誉
+##厘
+##句
+##跟
+##胞
+##婚
+##幼
+##烈
+##峰
+##寻
+##君
+##汇
+##趣
+##纸
+##假
+##肥
+##患
+##杨
+##雅
+##罪
+##谓
+##亮
+##脱
+##寺
+##烟
+##判
+##绩
+##乱
+##刚
+##摄
+##洞
+##践
+##码
+##启
+##励
+##呈
+##曰
+##呢
+##符
+##哥
+##媒
+##疾
+##坐
+##雪
+##孔
+##倒
+##旧
+##菌
+##岩
+##鼓
+##亡
+##访
+##症
+##暗
+##湾
+##幸
+##池
+##讨
+##努
+##露
+##吗
+##繁
+##途
+##殖
+##败
+##蛋
+##握
+##刺
+##耕
+##洗
+##沉
+##概
+##哈
+##泛
+##凡
+##残
+##隐
+##虫
+##朋
+##虚
+##餐
+##殊
+##慢
+##询
+##蒙
+##孙
+##谈
+##鲁
+##裂
+##贴
+##污
+##漫
+##谷
+##违
+##泉
+##拿
+##森
+##横
+##扬
+##键
+##膜
+##迁
+##尤
+##涉
+##净
+##诚
+##折
+##冰
+##械
+##拍
+##梁
+##沿
+##避
+##吴
+##惊
+##犯
+##灭
+##湿
+##迷
+##姓
+##阅
+##灯
+##妇
+##触
+##冠
+##答
+##俗
+##档
+##尊
+##谢
+##措
+##筹
+##竟
+##韩
+##签
+##剑
+##鉴
+##灾
+##贯
+##迹
+##洛
+##沟
+##束
+##翻
+##巧
+##坏
+##弱
+##零
+##壁
+##枝
+##映
+##恩
+##抓
+##屋
+##呼
+##脚
+##绘
+##淡
+##辖
+##伊
+##粒
+##欲
+##震
+##伯
+##私
+##蓝
+##甘
+##储
+##胡
+##卖
+##梅
+##耳
+##疑
+##润
+##伴
+##泽
+##牧
+##烧
+##尾
+##累
+##糖
+##怪
+##唯
+##莫
+##粮
+##柱
+##竹
+##灰
+##岸
+##缩
+##井
+##伦
+##柔
+##盟
+##珠
+##丹
+##皆
+##哪
+##迎
+##颜
+##衡
+##啊
+##塑
+##寒
+##紫
+##镜
+##氧
+##误
+##伍
+##彻
+##刀
+##览
+##炎
+##津
+##耐
+##秦
+##尖
+##潮
+##描
+##浓
+##召
+##禁
+##阻
+##胶
+##译
+##腹
+##泰
+##乃
+##盐
+##潜
+##鸡
+##诺
+##遍
+##纹
+##冬
+##牙
+##麻
+##辅
+##猪
+##弃
+##楚
+##羊
+##晋
+##鸟
+##赵
+##洁
+##谋
+##隆
+##滑
+##籍
+##臣
+##朱
+##泥
+##墨
+##辆
+##墙
+##浪
+##姐
+##赏
+##纵
+##拔
+##倍
+##纷
+##摩
+##壮
+##苗
+##偏
+##塞
+##贡
+##仁
+##宇
+##卵
+##瓦
+##枪
+##覆
+##殿
+##刑
+##贫
+##妈
+##幅
+##幕
+##忆
+##丁
+##估
+##废
+##萨
+##舍
+##详
+##旗
+##岗
+##洪
+##贝
+##迅
+##凭
+##勇
+##雕
+##奏
+##旋
+##杰
+##煤
+##阵
+##乘
+##溪
+##奉
+##畜
+##挑
+##昌
+##硕
+##庙
+##惠
+##薄
+##逃
+##爆
+##哲
+##浙
+##珍
+##炼
+##栏
+##暴
+##币
+##隔
+##吨
+##倾
+##嘉
+##址
+##陶
+##绕
+##诊
+##遭
+##桃
+##魂
+##兽
+##豆
+##闲
+##箱
+##拓
+##燃
+##裁
+##晶
+##掉
+##脂
+##溶
+##顿
+##肤
+##虑
+##鬼
+##灌
+##徐
+##龄
+##陵
+##恋
+##侵
+##坡
+##寿
+##勤
+##磨
+##妹
+##瑞
+##缓
+##轴
+##麦
+##羽
+##咨
+##凝
+##默
+##驻
+##敢
+##债
+##浮
+##幻
+##株
+##浅
+##敬
+##敏
+##陷
+##凤
+##坛
+##虎
+##乌
+##铜
+##御
+##乳
+##讯
+##循
+##圈
+##肌
+##妙
+##奋
+##忘
+##闭
+##墓
+##汤
+##忠
+##跨
+##怕
+##振
+##宾
+##跑
+##屏
+##坦
+##粗
+##租
+##悲
+##伟
+##拜
+##妻
+##赞
+##兄
+##宿
+##碑
+##貌
+##勒
+##罚
+##夺
+##偶
+##截
+##纤
+##齿
+##郑
+##聘
+##偿
+##扶
+##豪
+##慧
+##跳
+##疏
+##莱
+##腐
+##插
+##恐
+##郎
+##辞
+##挂
+##娘
+##肿
+##徒
+##伏
+##磁
+##杯
+##丛
+##旨
+##琴
+##炮
+##醒
+##砖
+##替
+##辛
+##暖
+##锁
+##杜
+##肠
+##孤
+##饭
+##脸
+##邮
+##贷
+##俄
+##毁
+##荷
+##谐
+##荒
+##肝
+##链
+##尺
+##尘
+##援
+##疫
+##崇
+##恢
+##扎
+##伸
+##幽
+##抵
+##胸
+##谱
+##舒
+##迫
+##畅
+##泡
+##岭
+##喷
+##窗
+##捷
+##宏
+##肯
+##狂
+##铺
+##骑
+##抽
+##券
+##俱
+##徽
+##胆
+##碎
+##邀
+##褐
+##斤
+##涂
+##赋
+##署
+##颗
+##渠
+##仿
+##迪
+##炉
+##辉
+##涵
+##耗
+##返
+##邻
+##斑
+##董
+##魏
+##午
+##娱
+##浴
+##尿
+##曼
+##锅
+##柳
+##舰
+##搭
+##旁
+##宅
+##趋
+##凉
+##赢
+##伙
+##爷
+##廷
+##戴
+##壤
+##奶
+##页
+##玄
+##驾
+##阔
+##轨
+##朗
+##捕
+##肾
+##稿
+##惯
+##侯
+##乙
+##渡
+##稍
+##恨
+##脏
+##姆
+##腔
+##抱
+##杆
+##垂
+##赴
+##赶
+##莲
+##辽
+##荐
+##旦
+##妖
+##稀
+##驱
+##沈
+##役
+##晓
+##亭
+##仲
+##澳
+##炸
+##绪
+##陕
+##恒
+##堡
+##纠
+##仇
+##懂
+##焦
+##搜
+##忍
+##贤
+##添
+##艾
+##赤
+##犹
+##尝
+##锦
+##稻
+##撰
+##填
+##衰
+##栽
+##邪
+##粘
+##跃
+##桌
+##胃
+##悬
+##翼
+##彼
+##睡
+##曹
+##刷
+##摆
+##悉
+##锋
+##摇
+##抢
+##乏
+##廉
+##鼠
+##盾
+##瓷
+##抑
+##埃
+##邦
+##遂
+##寸
+##渔
+##祥
+##胎
+##牵
+##壳
+##甜
+##卓
+##瓜
+##袭
+##遵
+##巡
+##逆
+##玛
+##韵
+##桑
+##酷
+##赖
+##桂
+##郡
+##肃
+##仓
+##寄
+##塘
+##瘤
+##碳
+##搞
+##燕
+##蒸
+##允
+##忽
+##斜
+##穷
+##郁
+##囊
+##奔
+##昆
+##盆
+##愈
+##递
+##黎
+##祭
+##怒
+##辈
+##腺
+##滚
+##暂
+##郭
+##璃
+##踪
+##芳
+##碍
+##肺
+##狱
+##冒
+##阁
+##砂
+##苍
+##揭
+##踏
+##颇
+##柄
+##闪
+##孝
+##葡
+##腾
+##茎
+##鸣
+##撤
+##仰
+##伐
+##丘
+##於
+##泪
+##荡
+##扰
+##纲
+##拼
+##欣
+##纽
+##癌
+##堆
+##菲
+##披
+##挖
+##寓
+##履
+##捐
+##悟
+##乾
+##嘴
+##钻
+##拳
+##吹
+##柏
+##遥
+##抚
+##忧
+##赠
+##霸
+##艰
+##淋
+##猫
+##帅
+##奈
+##寨
+##滴
+##鼻
+##掘
+##狗
+##驶
+##朴
+##拆
+##惜
+##玻
+##扣
+##萄
+##蔬
+##宠
+##缴
+##赫
+##凯
+##滨
+##乔
+##腰
+##葬
+##孟
+##吾
+##枚
+##圳
+##忙
+##扫
+##杭
+##凌
+##梯
+##丈
+##隶
+##剪
+##盗
+##擅
+##疆
+##弯
+##携
+##拒
+##秒
+##颁
+##醇
+##割
+##浆
+##姑
+##爸
+##螺
+##穗
+##缝
+##慈
+##喝
+##瓶
+##漏
+##悠
+##猎
+##番
+##孕
+##伪
+##漂
+##腿
+##吐
+##坝
+##滤
+##函
+##匀
+##偷
+##浩
+##矛
+##僧
+##辨
+##俊
+##棉
+##铸
+##诞
+##丧
+##夹
+##姿
+##睛
+##淮
+##阀
+##姜
+##尸
+##猛
+##芽
+##账
+##旱
+##醉
+##弄
+##坊
+##烤
+##萧
+##矣
+##雾
+##倡
+##榜
+##弗
+##氨
+##朵
+##锡
+##袋
+##拨
+##湘
+##岳
+##烦
+##肩
+##熙
+##炭
+##婆
+##棋
+##禅
+##穴
+##宙
+##汗
+##艳
+##儒
+##叙
+##晨
+##颈
+##峡
+##拖
+##烂
+##茂
+##戒
+##飘
+##氛
+##蒂
+##撞
+##瓣
+##箭
+##叛
+##鞋
+##劲
+##祝
+##娜
+##饲
+##侍
+##诱
+##叹
+##卢
+##弥
+##鼎
+##厦
+##屈
+##慕
+##魅
+##厨
+##嫁
+##绵
+##逼
+##扮
+##叔
+##酶
+##燥
+##狼
+##滋
+##汁
+##辐
+##怨
+##翅
+##佩
+##坑
+##旬
+##沃
+##剩
+##蛇
+##颖
+##篮
+##锐
+##侠
+##匹
+##唤
+##熊
+##漠
+##迟
+##敦
+##雌
+##谨
+##婴
+##浸
+##磷
+##筒
+##滩
+##埋
+##框
+##弘
+##吕
+##碰
+##纺
+##硫
+##堪
+##契
+##蜜
+##蓄
+##阐
+##傲
+##碱
+##晰
+##狭
+##撑
+##叉
+##卧
+##劫
+##闹
+##赐
+##邓
+##奴
+##溉
+##浦
+##蹈
+##辣
+##遣
+##耀
+##耶
+##翠
+##叠
+##迈
+##霍
+##碧
+##恰
+##脊
+##昭
+##摸
+##饱
+##赔
+##泄
+##哭
+##讼
+##逝
+##逻
+##廊
+##擦
+##渗
+##彰
+##卿
+##旺
+##宪
+##顷
+##妆
+##陪
+##葛
+##仔
+##淀
+##翰
+##悦
+##穆
+##煮
+##辩
+##弦
+##串
+##押
+##蚀
+##逢
+##贺
+##焊
+##煌
+##缔
+##惑
+##鹿
+##袁
+##糊
+##逸
+##舟
+##勃
+##侦
+##涯
+##蔡
+##辟
+##涌
+##枯
+##痕
+##疼
+##莉
+##柴
+##眉
+##罢
+##催
+##衔
+##秉
+##妃
+##鸿
+##傅
+##辰
+##聪
+##咸
+##扇
+##盈
+##勘
+##佐
+##泊
+##抛
+##搬
+##牢
+##宴
+##牲
+##贾
+##摘
+##姻
+##慎
+##帕
+##忌
+##卒
+##夕
+##卜
+##惟
+##挺
+##崖
+##炒
+##爵
+##冻
+##椒
+##鳞
+##祸
+##潭
+##腊
+##蒋
+##缠
+##寂
+##眠
+##冯
+##芯
+##槽
+##吊
+##聊
+##梗
+##嫩
+##凶
+##铭
+##爽
+##筋
+##韦
+##脾
+##铝
+##肢
+##栋
+##勾
+##萌
+##渊
+##掩
+##狮
+##撒
+##漆
+##骗
+##禽
+##蕴
+##坪
+##洒
+##冶
+##兹
+##椭
+##喻
+##泵
+##哀
+##翔
+##棒
+##芝
+##扑
+##毅
+##衍
+##惨
+##疯
+##欺
+##贼
+##肖
+##轰
+##巢
+##臂
+##轩
+##扁
+##淘
+##犬
+##宰
+##祠
+##挡
+##厌
+##帐
+##蜂
+##狐
+##垃
+##昂
+##圾
+##秩
+##芬
+##瞬
+##枢
+##舌
+##唇
+##棕
+##霞
+##霜
+##艇
+##侨
+##鹤
+##硅
+##靖
+##哦
+##削
+##泌
+##奠
+##吏
+##夷
+##咖
+##彭
+##窑
+##胁
+##肪
+##贞
+##劝
+##钙
+##柜
+##鸭
+##庞
+##兔
+##荆
+##丙
+##纱
+##戈
+##藤
+##矩
+##泳
+##惧
+##铃
+##渴
+##胀
+##袖
+##丸
+##狠
+##豫
+##茫
+##浇
+##菩
+##氯
+##啡
+##葱
+##梨
+##霉
+##脆
+##氢
+##巷
+##丑
+##娃
+##锻
+##愤
+##贪
+##蝶
+##厉
+##闽
+##浑
+##斩
+##栖
+##茅
+##昏
+##龟
+##碗
+##棚
+##滞
+##慰
+##斋
+##虹
+##屯
+##萝
+##饼
+##窄
+##潘
+##绣
+##丢
+##芦
+##鳍
+##裕
+##誓
+##腻
+##锈
+##吞
+##蜀
+##啦
+##扭
+##巩
+##髓
+##劣
+##拌
+##谊
+##涛
+##勋
+##郊
+##莎
+##痴
+##窝
+##驰
+##跌
+##笼
+##挤
+##溢
+##隙
+##鹰
+##诏
+##帽
+##芒
+##爬
+##凸
+##牺
+##熔
+##吻
+##竭
+##瘦
+##冥
+##搏
+##屡
+##昔
+##萼
+##愁
+##捉
+##翁
+##怖
+##汪
+##烯
+##疲
+##缸
+##溃
+##泼
+##剖
+##涨
+##橡
+##谜
+##悔
+##嫌
+##盒
+##苯
+##凹
+##绳
+##畏
+##罐
+##虾
+##柯
+##邑
+##馨
+##兆
+##帖
+##陌
+##禄
+##垫
+##壶
+##逊
+##骤
+##祀
+##晴
+##蓬
+##苞
+##煎
+##菊
+##堤
+##甫
+##拱
+##氮
+##罕
+##舶
+##伞
+##姚
+##弓
+##嵌
+##馈
+##琼
+##噪
+##雀
+##呵
+##汝
+##焉
+##陀
+##胺
+##惩
+##沼
+##枣
+##桐
+##酱
+##遮
+##孢
+##钝
+##呀
+##锥
+##妥
+##酿
+##巫
+##闯
+##沧
+##崩
+##蕊
+##酬
+##匠
+##躲
+##喊
+##琳
+##绎
+##喉
+##凰
+##抬
+##膨
+##盲
+##剥
+##喂
+##庸
+##奸
+##钩
+##冈
+##募
+##苑
+##杏
+##杉
+##辱
+##隋
+##薪
+##绒
+##欠
+##尉
+##攀
+##抹
+##巾
+##渣
+##苹
+##猴
+##悄
+##屠
+##颂
+##湛
+##魄
+##颠
+##呆
+##粤
+##岂
+##娇
+##暑
+##鹅
+##筛
+##膏
+##樱
+##缆
+##襄
+##瑟
+##恭
+##泻
+##匪
+##兮
+##恼
+##吟
+##仕
+##蔽
+##骄
+##蚕
+##斥
+##椅
+##姬
+##谦
+##椎
+##搅
+##卸
+##沫
+##怜
+##坎
+##瑰
+##钦
+##拾
+##厕
+##後
+##逾
+##薯
+##衬
+##钾
+##崔
+##稽
+##蛮
+##殷
+##晒
+##菇
+##臭
+##弧
+##擎
+##粹
+##纬
+##焰
+##玲
+##竣
+##咒
+##歇
+##糕
+##诵
+##茨
+##妮
+##酯
+##麟
+##卑
+##浏
+##咽
+##罩
+##舱
+##酵
+##晕
+##顽
+##赁
+##咬
+##枫
+##冀
+##贮
+##艘
+##亏
+##薛
+##瀑
+##篆
+##膀
+##沸
+##雍
+##咳
+##尹
+##愉
+##烹
+##坠
+##勿
+##钠
+##坤
+##甸
+##墅
+##闸
+##藻
+##韧
+##鄂
+##瑶
+##舆
+##夸
+##蕾
+##栗
+##咏
+##丞
+##抄
+##鹏
+##弊
+##檐
+##骂
+##仆
+##峻
+##爪
+##赚
+##帆
+##娶
+##嘛
+##钓
+##澄
+##猜
+##裔
+##抒
+##铅
+##卉
+##彦
+##删
+##衷
+##禹
+##寡
+##蒲
+##砌
+##棱
+##拘
+##堵
+##雁
+##仄
+##荫
+##祈
+##奢
+##赌
+##寇
+##隧
+##摊
+##雇
+##卦
+##婉
+##敲
+##挣
+##皱
+##虞
+##亨
+##懈
+##挽
+##珊
+##饶
+##滥
+##锯
+##闷
+##酮
+##虐
+##兑
+##僵
+##傻
+##沦
+##巅
+##鞭
+##梳
+##赣
+##锌
+##庐
+##薇
+##庵
+##慨
+##肚
+##妄
+##仗
+##绑
+##枕
+##牡
+##胖
+##沪
+##垒
+##捞
+##捧
+##竖
+##蜡
+##桩
+##厢
+##孵
+##黏
+##拯
+##谭
+##诈
+##灿
+##釉
+##裹
+##钮
+##俩
+##灶
+##彝
+##蟹
+##涩
+##醋
+##匙
+##歧
+##刹
+##玫
+##棘
+##橙
+##凑
+##桶
+##刃
+##伽
+##硝
+##怡
+##籽
+##敞
+##淳
+##矮
+##镶
+##戚
+##幢
+##涡
+##尧
+##膝
+##哉
+##肆
+##畔
+##溯
+##媚
+##烘
+##窃
+##焚
+##澜
+##愚
+##棵
+##乞
+##佑
+##暨
+##敷
+##饥
+##俯
+##蔓
+##暮
+##砍
+##邵
+##仑
+##毗
+##剿
+##馀
+##锤
+##刮
+##梭
+##摧
+##掠
+##躯
+##诡
+##匈
+##侣
+##胚
+##疮
+##裙
+##裸
+##塌
+##吓
+##俘
+##糙
+##藩
+##楷
+##羞
+##鲍
+##帘
+##裤
+##宛
+##憾
+##桓
+##痰
+##寞
+##骚
+##惹
+##笋
+##萃
+##栓
+##挫
+##矢
+##垦
+##垄
+##绸
+##凄
+##镀
+##熏
+##钉
+##粪
+##缅
+##洽
+##鞘
+##蔗
+##迄
+##沐
+##凿
+##勉
+##昨
+##喘
+##爹
+##屑
+##耻
+##沥
+##庶
+##涅
+##腕
+##袍
+##懒
+##阜
+##嗜
+##朔
+##蒜
+##沛
+##坟
+##轿
+##喀
+##笛
+##狄
+##饿
+##蓉
+##泣
+##窟
+##豹
+##屿
+##崛
+##迦
+##诠
+##贬
+##腥
+##钥
+##嗣
+##瑜
+##倦
+##萎
+##拦
+##冤
+##讽
+##潇
+##谣
+##趁
+##妨
+##贩
+##萍
+##窦
+##纂
+##缀
+##矫
+##淑
+##墩
+##梵
+##沾
+##淫
+##乖
+##汰
+##莞
+##旷
+##浊
+##挚
+##撼
+##氟
+##焕
+##庚
+##掀
+##诀
+##盼
+##疹
+##窖
+##匆
+##厥
+##轧
+##淹
+##亥
+##鸦
+##棍
+##谅
+##歼
+##汕
+##挪
+##蚁
+##敛
+##魁
+##畴
+##炫
+##丫
+##奎
+##菱
+##沂
+##撕
+##阎
+##詹
+##蛛
+##靡
+##瞻
+##咱
+##愧
+##烷
+##畸
+##灸
+##眸
+##觅
+##芜
+##廓
+##斌
+##躁
+##麓
+##摔
+##烛
+##睹
+##孜
+##缚
+##堕
+##昼
+##睿
+##琪
+##琉
+##贱
+##渝
+##跋
+##茄
+##舜
+##诛
+##捣
+##芙
+##倚
+##酰
+##澈
+##慌
+##帜
+##颤
+##陇
+##颌
+##昧
+##佣
+##眷
+##徙
+##禾
+##逮
+##莹
+##碟
+##梢
+##朽
+##粥
+##喇
+##榆
+##驳
+##楔
+##啸
+##肋
+##踢
+##傍
+##桔
+##肴
+##呕
+##旭
+##埠
+##贿
+##曝
+##杖
+##俭
+##栩
+##斧
+##镁
+##匾
+##踩
+##橘
+##颅
+##囚
+##蛙
+##膳
+##坞
+##琐
+##荧
+##瘟
+##涤
+##胰
+##衫
+##噬
+##皖
+##邱
+##埔
+##汀
+##羡
+##睐
+##葵
+##耿
+##糟
+##厄
+##秧
+##黔
+##蹄
+##漳
+##鞍
+##谏
+##腋
+##簇
+##梧
+##戎
+##榴
+##诣
+##宦
+##苔
+##揽
+##簧
+##狸
+##阙
+##扯
+##耍
+##棠
+##脓
+##烫
+##翘
+##芭
+##躺
+##羁
+##藉
+##拐
+##陡
+##漓
+##棺
+##钧
+##琅
+##扔
+##寝
+##绚
+##熬
+##驿
+##邹
+##杠
+##绥
+##窥
+##晃
+##渭
+##樊
+##鑫
+##祁
+##陋
+##哺
+##堰
+##祛
+##梓
+##崎
+##孽
+##蝴
+##蔚
+##抖
+##苟
+##肇
+##溜
+##绅
+##妾
+##跪
+##沁
+##莽
+##虏
+##瞄
+##砸
+##稚
+##僚
+##崭
+##迭
+##皂
+##彬
+##雏
+##羲
+##缕
+##绞
+##俞
+##簿
+##耸
+##廖
+##嘲
+##翌
+##榄
+##裴
+##槐
+##洼
+##睁
+##灼
+##啤
+##臀
+##啥
+##濒
+##醛
+##峨
+##葫
+##悍
+##笨
+##嘱
+##稠
+##韶
+##陛
+##峭
+##酚
+##翩
+##舅
+##寅
+##蕉
+##阮
+##垣
+##戮
+##趾
+##犀
+##巍
+##霄
+##饪
+##秆
+##朕
+##驼
+##肛
+##揉
+##楠
+##岚
+##疡
+##帧
+##柑
+##赎
+##逍
+##滇
+##璋
+##礁
+##黛
+##钞
+##邢
+##涧
+##劈
+##瞳
+##砚
+##驴
+##锣
+##恳
+##栅
+##吵
+##牟
+##沌
+##瞩
+##咪
+##毯
+##炳
+##淤
+##盯
+##芋
+##粟
+##栈
+##戊
+##盏
+##峪
+##拂
+##暇
+##酥
+##汛
+##嚣
+##轼
+##妒
+##匿
+##鸽
+##蝉
+##痒
+##宵
+##瘫
+##璧
+##汲
+##冢
+##碌
+##琢
+##磅
+##卤
+##剔
+##谎
+##圩
+##酌
+##捏
+##渺
+##媳
+##穹
+##谥
+##骏
+##哨
+##骆
+##乒
+##摹
+##兜
+##柿
+##喧
+##呜
+##捡
+##橄
+##逗
+##瑚
+##呐
+##檀
+##辜
+##妊
+##祯
+##苷
+##衙
+##笃
+##芸
+##霖
+##荔
+##闺
+##羌
+##芹
+##哼
+##糯
+##吼
+##蕃
+##嵩
+##矶
+##绽
+##坯
+##娠
+##祷
+##锰
+##瘀
+##岐
+##茵
+##筝
+##斐
+##肽
+##歉
+##嗽
+##恤
+##汶
+##聂
+##樟
+##擒
+##鹃
+##拙
+##鲤
+##絮
+##鄙
+##彪
+##嗓
+##墟
+##骼
+##渤
+##僻
+##豁
+##谕
+##荟
+##姨
+##婷
+##挠
+##哇
+##炙
+##诅
+##娥
+##哑
+##阱
+##嫉
+##圭
+##乓
+##橱
+##歪
+##禧
+##甩
+##坷
+##晏
+##驯
+##讳
+##泗
+##煞
+##淄
+##倪
+##妓
+##窍
+##竿
+##襟
+##匡
+##钛
+##侈
+##侄
+##铲
+##哮
+##厩
+##亢
+##辕
+##瘾
+##辊
+##狩
+##掷
+##潍
+##伺
+##嘿
+##弈
+##嘎
+##陨
+##娅
+##昊
+##犁
+##屁
+##蜘
+##寥
+##滕
+##毙
+##涝
+##谛
+##郝
+##痹
+##溺
+##汾
+##脐
+##馅
+##蠢
+##珀
+##腌
+##扼
+##敕
+##莓
+##峦
+##铬
+##谍
+##炬
+##龚
+##麒
+##睦
+##磺
+##吁
+##掺
+##烁
+##靶
+##圃
+##饵
+##褶
+##娟
+##滔
+##挨
+##褒
+##胱
+##晖
+##脖
+##垢
+##抉
+##冉
+##茧
+##渲
+##癫
+##悼
+##嫂
+##瞒
+##纶
+##肘
+##炖
+##瀚
+##皋
+##姊
+##颐
+##俏
+##颊
+##讶
+##札
+##奕
+##磊
+##镖
+##遐
+##眺
+##腑
+##琦
+##蚊
+##窜
+##渍
+##嗯
+##夯
+##笙
+##蘑
+##翡
+##碘
+##卯
+##啼
+##靓
+##辍
+##莺
+##躬
+##猿
+##杞
+##眩
+##虔
+##凋
+##遁
+##泾
+##岔
+##羟
+##弛
+##娄
+##茸
+##皓
+##峙
+##逅
+##邂
+##苇
+##楹
+##蹲
+##拢
+##甄
+##鳃
+##邯
+##捆
+##勺
+##酉
+##荚
+##唑
+##臻
+##辗
+##绰
+##徊
+##榨
+##苛
+##赦
+##盔
+##壬
+##恍
+##缉
+##熨
+##澡
+##桨
+##匣
+##兢
+##驭
+##镍
+##孰
+##绮
+##馏
+##蝇
+##佼
+##鲸
+##哎
+##裳
+##蜕
+##嚼
+##嘻
+##庇
+##绢
+##倩
+##钵
+##恪
+##帷
+##莆
+##柠
+##藕
+##砾
+##绊
+##喙
+##坂
+##徘
+##荀
+##瞧
+##蛾
+##晦
+##铎
+##紊
+##锚
+##酪
+##稷
+##聋
+##闵
+##熹
+##冕
+##诫
+##珑
+##曦
+##篷
+##迥
+##蘖
+##胤
+##檬
+##瑾
+##钳
+##遏
+##辄
+##嬉
+##隅
+##秃
+##帛
+##聆
+##芥
+##诬
+##挟
+##宕
+##鹊
+##琶
+##膛
+##兀
+##懿
+##碾
+##叮
+##蠕
+##譬
+##缮
+##烽
+##妍
+##榕
+##邃
+##焙
+##倘
+##戌
+##茹
+##豚
+##晾
+##浒
+##玺
+##醚
+##祐
+##炽
+##缪
+##凛
+##噩
+##溅
+##毋
+##槛
+##嫡
+##蝠
+##娴
+##稣
+##禀
+##壑
+##殆
+##敖
+##倭
+##挛
+##侃
+##蚌
+##咀
+##盎
+##殉
+##岑
+##浚
+##谬
+##狡
+##癸
+##逛
+##耽
+##俺
+##璨
+##巳
+##茜
+##郸
+##蒴
+##琵
+##叩
+##泸
+##塾
+##稼
+##侮
+##锂
+##曙
+##薰
+##婿
+##惶
+##拭
+##篱
+##恬
+##淌
+##烙
+##袜
+##徵
+##慷
+##夭
+##噶
+##莘
+##鸳
+##殡
+##蚂
+##憎
+##喃
+##佚
+##龛
+##潢
+##烃
+##岱
+##潺
+##衢
+##璀
+##鹭
+##揣
+##痢
+##厮
+##氓
+##怠
+##痘
+##硒
+##镌
+##乍
+##咯
+##惬
+##桦
+##骇
+##枉
+##蜗
+##睾
+##淇
+##耘
+##娓
+##弼
+##鳌
+##嗅
+##狙
+##箫
+##朦
+##椰
+##胥
+##丐
+##陂
+##唾
+##鳄
+##柚
+##谒
+##戍
+##刁
+##鸾
+##缭
+##骸
+##铣
+##酋
+##蝎
+##掏
+##耦
+##怯
+##娲
+##拇
+##汹
+##胧
+##疤
+##硼
+##恕
+##哗
+##眶
+##痫
+##凳
+##鲨
+##擢
+##歹
+##樵
+##瘠
+##茗
+##翟
+##黯
+##蜒
+##壹
+##殇
+##伶
+##辙
+##瑕
+##町
+##孚
+##痉
+##铵
+##搁
+##漾
+##戟
+##镰
+##鸯
+##猩
+##蔷
+##缤
+##叭
+##垩
+##曳
+##奚
+##毓
+##颓
+##汐
+##靴
+##傣
+##尬
+##濮
+##赂
+##媛
+##懦
+##扦
+##韬
+##戳
+##雯
+##蜿
+##笺
+##裘
+##尴
+##侗
+##钨
+##苓
+##寰
+##蛊
+##扳
+##搓
+##涟
+##睫
+##淬
+##赈
+##恺
+##瞎
+##蝙
+##枸
+##萱
+##颚
+##憩
+##秽
+##秸
+##拷
+##阑
+##貂
+##粱
+##煲
+##隘
+##暧
+##惕
+##沽
+##菠
+##趟
+##磋
+##偕
+##涕
+##邸
+##踞
+##惫
+##阪
+##鞠
+##饺
+##汞
+##颍
+##氰
+##屹
+##蛟
+##跻
+##哟
+##臼
+##熄
+##绛
+##弩
+##褪
+##渎
+##亟
+##匮
+##撇
+##霆
+##攒
+##舵
+##扛
+##彤
+##蛤
+##婢
+##偃
+##胫
+##姥
+##睑
+##诙
+##诲
+##锭
+##悚
+##扒
+##洱
+##劾
+##惰
+##篡
+##瓯
+##徇
+##铀
+##骋
+##筷
+##渚
+##踵
+##俨
+##榻
+##糜
+##捻
+##釜
+##哩
+##萤
+##蛹
+##隽
+##垮
+##鸠
+##鸥
+##漕
+##瑙
+##礴
+##憧
+##殴
+##潼
+##悯
+##砺
+##拽
+##钗
+##酣
+##镂
+##膺
+##楞
+##竺
+##迂
+##嫣
+##忱
+##哄
+##疣
+##鹦
+##枭
+##憬
+##疱
+##婪
+##沮
+##怅
+##筱
+##扉
+##瞰
+##旌
+##蔑
+##铠
+##瀛
+##琥
+##懵
+##谴
+##捍
+##蟾
+##漩
+##拣
+##汴
+##刨
+##叱
+##曜
+##妞
+##澎
+##镑
+##翎
+##瞪
+##倔
+##芍
+##璞
+##瓮
+##驹
+##芷
+##寐
+##擂
+##丕
+##蟠
+##诃
+##悸
+##亘
+##溴
+##宸
+##廿
+##恃
+##棣
+##荼
+##筠
+##羚
+##慑
+##唉
+##纣
+##麼
+##蹦
+##锄
+##淆
+##甙
+##蚜
+##椿
+##禺
+##绯
+##冗
+##葩
+##厝
+##媲
+##蒿
+##痪
+##菁
+##炊
+##俑
+##讥
+##桀
+##祺
+##吡
+##迩
+##箔
+##皿
+##缎
+##萦
+##剃
+##霓
+##酝
+##诰
+##茉
+##飙
+##湍
+##蜥
+##箕
+##蘸
+##柬
+##韭
+##溥
+##熠
+##鹉
+##咐
+##剌
+##悖
+##瞿
+##槟
+##娩
+##闾
+##遴
+##咫
+##孺
+##彷
+##茬
+##蓟
+##憨
+##袅
+##佬
+##炯
+##啶
+##昙
+##蚩
+##痔
+##蕨
+##瓢
+##夔
+##毡
+##赃
+##鳖
+##沅
+##饷
+##臧
+##掖
+##褚
+##羹
+##勐
+##谚
+##畦
+##眨
+##贻
+##攸
+##涎
+##弑
+##咎
+##铂
+##瑛
+##矗
+##虱
+##秤
+##谟
+##漱
+##俸
+##夙
+##雉
+##螨
+##恣
+##斛
+##谙
+##隍
+##奄
+##壕
+##髻
+##鄱
+##嘶
+##磕
+##濡
+##赘
+##荞
+##讹
+##猕
+##痞
+##鬓
+##铮
+##腱
+##幡
+##榭
+##爻
+##涓
+##晤
+##咕
+##惭
+##钼
+##匕
+##撮
+##庾
+##笠
+##窘
+##癖
+##垛
+##窒
+##畲
+##甬
+##彗
+##缨
+##湮
+##寮
+##衅
+##谪
+##绫
+##兖
+##疽
+##磐
+##菏
+##沱
+##骁
+##嫔
+##盂
+##娆
+##钊
+##蟒
+##忏
+##谤
+##晟
+##痈
+##耆
+##谧
+##簪
+##疟
+##扈
+##脍
+##琛
+##咋
+##胄
+##葆
+##轶
+##桢
+##攘
+##邕
+##拧
+##茯
+##摒
+##傀
+##祚
+##嘟
+##帼
+##筵
+##馒
+##疚
+##璇
+##砧
+##槃
+##犷
+##腓
+##煜
+##弋
+##疸
+##濑
+##麝
+##嗟
+##忻
+##愣
+##斓
+##吝
+##咧
+##矾
+##愫
+##漪
+##珂
+##逞
+##糠
+##璐
+##藓
+##昕
+##妩
+##屌
+##疵
+##嘘
+##袂
+##稃
+##剁
+##侏
+##掐
+##猾
+##匍
+##坳
+##黜
+##邺
+##闫
+##猥
+##湃
+##斟
+##癣
+##匐
+##粳
+##叟
+##俾
+##儡
+##莒
+##骥
+##跤
+##耙
+##矜
+##翱
+##赡
+##浣
+##栾
+##拈
+##螟
+##桧
+##坍
+##睢
+##趴
+##伎
+##婺
+##霹
+##痊
+##膊
+##眯
+##豌
+##驮
+##骈
+##嶂
+##淞
+##腮
+##髅
+##炀
+##啄
+##亳
+##麾
+##筐
+##叨
+##徨
+##跷
+##楂
+##郴
+##绶
+##羔
+##咤
+##靳
+##屎
+##雳
+##瘘
+##蹬
+##惮
+##涪
+##阖
+##煽
+##蹊
+##栉
+##俟
+##涸
+##辫
+##锢
+##佟
+##皎
+##啮
+##钰
+##螂
+##啪
+##绷
+##闰
+##畿
+##覃
+##惘
+##贰
+##碉
+##卞
+##酐
+##枷
+##葺
+##芪
+##蕙
+##咚
+##籁
+##钴
+##冽
+##玮
+##骷
+##啃
+##焖
+##猝
+##榈
+##滁
+##拮
+##跗
+##讷
+##蝗
+##蠡
+##烨
+##脯
+##歙
+##泠
+##刍
+##掳
+##僳
+##螯
+##胳
+##髦
+##粽
+##戾
+##祜
+##岷
+##懋
+##馥
+##昵
+##踊
+##湄
+##郢
+##斡
+##迢
+##嗪
+##裨
+##羧
+##膈
+##翊
+##鲫
+##螃
+##沓
+##疝
+##笈
+##榔
+##诘
+##颉
+##蛀
+##鸢
+##焯
+##囧
+##梆
+##潞
+##戛
+##佗
+##艮
+##霾
+##鬟
+##玖
+##腭
+##喔
+##罔
+##佥
+##粑
+##舷
+##泯
+##泓
+##炜
+##谗
+##烬
+##跆
+##傩
+##飓
+##浔
+##钤
+##惚
+##胭
+##踝
+##镯
+##臆
+##蜚
+##揪
+##觞
+##皈
+##迸
+##匝
+##筏
+##醴
+##黍
+##洮
+##滦
+##侬
+##甾
+##澧
+##阈
+##袱
+##迤
+##衮
+##濂
+##娑
+##砥
+##砷
+##铨
+##缜
+##箴
+##逵
+##猖
+##蛰
+##箍
+##侥
+##搂
+##纨
+##裱
+##枋
+##嫦
+##敝
+##挝
+##贲
+##潦
+##撩
+##惺
+##铰
+##忒
+##咆
+##哆
+##莅
+##炕
+##抨
+##涿
+##龈
+##猷
+##遒
+##缥
+##捂
+##俐
+##瘙
+##搐
+##牍
+##馍
+##痿
+##袤
+##峥
+##栎
+##罹
+##燎
+##喵
+##璜
+##飒
+##蔼
+##珞
+##澹
+##奘
+##岖
+##芡
+##簸
+##杵
+##甥
+##骊
+##悴
+##惆
+##殃
+##呃
+##祗
+##髋
+##幔
+##榛
+##犊
+##霁
+##芮
+##牒
+##佰
+##狈
+##薨
+##吩
+##鳝
+##嵘
+##濠
+##呤
+##纫
+##檄
+##浜
+##缙
+##缢
+##煦
+##蓦
+##揖
+##拴
+##缈
+##褥
+##铿
+##燮
+##锵
+##荥
+##忿
+##僖
+##婶
+##芾
+##镐
+##痣
+##眈
+##祇
+##邈
+##翳
+##碣
+##遨
+##鳗
+##诂
+##岫
+##焘
+##茱
+##洵
+##晁
+##噢
+##偈
+##旖
+##蚓
+##柘
+##珐
+##遽
+##岌
+##桅
+##唔
+##鄞
+##雹
+##驸
+##苻
+##恻
+##鬃
+##玑
+##磬
+##崂
+##祉
+##荤
+##淼
+##肱
+##呗
+##骡
+##囱
+##佞
+##耒
+##伫
+##嚷
+##粼
+##歆
+##佃
+##旎
+##惋
+##殁
+##杳
+##阡
+##畈
+##蔺
+##巽
+##昱
+##啰
+##吠
+##嗔
+##涮
+##奂
+##撷
+##袒
+##爰
+##捶
+##赭
+##蜓
+##姗
+##蔻
+##垠
+##噻
+##峒
+##皙
+##憔
+##帚
+##杷
+##蟆
+##觐
+##钒
+##岙
+##栀
+##幄
+##啧
+##癜
+##擀
+##轲
+##铆
+##讴
+##樽
+##霏
+##肮
+##枳
+##骞
+##诧
+##瘢
+##虬
+##拗
+##蕲
+##茁
+##唆
+##沭
+##毂
+##蛎
+##芊
+##銮
+##瞥
+##呱
+##羿
+##吒
+##傥
+##髯
+##濯
+##蜻
+##皴
+##邳
+##燧
+##獭
+##垭
+##祟
+##虢
+##枇
+##鹫
+##颞
+##皑
+##脲
+##舔
+##魇
+##霭
+##坨
+##郧
+##椽
+##舫
+##荠
+##琊
+##溟
+##煨
+##谯
+##粲
+##罂
+##屉
+##佯
+##郦
+##亵
+##诽
+##芩
+##嵇
+##蚤
+##哒
+##啬
+##嚎
+##玥
+##隼
+##唢
+##铛
+##壅
+##藜
+##吱
+##楣
+##璟
+##锆
+##憋
+##罡
+##咙
+##腈
+##廪
+##堑
+##诩
+##溧
+##鹑
+##讫
+##哌
+##铢
+##蜴
+##稹
+##噜
+##镉
+##愕
+##桁
+##晔
+##琰
+##陲
+##疙
+##崮
+##颛
+##桡
+##钜
+##谑
+##仞
+##咦
+##珪
+##揍
+##鱿
+##阉
+##瘩
+##槌
+##滓
+##茴
+##泮
+##涣
+##柞
+##渥
+##飨
+##孪
+##沔
+##谲
+##桉
+##慵
+##俚
+##跖
+##纭
+##恙
+##佘
+##荃
+##咄
+##鞅
+##叁
+##恽
+##炔
+##萘
+##钺
+##楫
+##塬
+##钡
+##琮
+##苄
+##骰
+##偎
+##粕
+##咔
+##鹄
+##瓒
+##阆
+##捅
+##嬴
+##箨
+##氦
+##倜
+##觊
+##婕
+##锑
+##撬
+##掰
+##嗷
+##饯
+##蓓
+##鼬
+##佤
+##蚯
+##挞
+##臾
+##嶙
+##幂
+##饬
+##闱
+##煅
+##嘧
+##蹭
+##瞭
+##顼
+##箐
+##徉
+##骜
+##嗨
+##邛
+##庑
+##柩
+##饕
+##俎
+##嘌
+##颏
+##椁
+##崧
+##锉
+##籼
+##狞
+##弁
+##羯
+##踹
+##糅
+##砼
+##嫖
+##豉
+##啉
+##榷
+##嘈
+##俪
+##痂
+##儋
+##芎
+##繇
+##蹇
+##诋
+##煸
+##峋
+##淙
+##泱
+##徜
+##汩
+##纥
+##蝼
+##囿
+##暹
+##谆
+##蹂
+##鞣
+##螳
+##馗
+##幺
+##鞑
+##贽
+##漯
+##牦
+##淖
+##囤
+##晗
+##忡
+##懊
+##呋
+##埂
+##鲈
+##阕
+##幌
+##鳅
+##勰
+##萸
+##剽
+##蚝
+##绔
+##辇
+##麋
+##陟
+##宥
+##锺
+##喽
+##淅
+##熵
+##荨
+##忤
+##轭
+##嗦
+##荪
+##骠
+##鹘
+##聿
+##绾
+##诶
+##怆
+##喋
+##恸
+##湟
+##睨
+##翦
+##蜈
+##褂
+##娼
+##羸
+##觎
+##瘁
+##蚣
+##呻
+##昶
+##谶
+##猬
+##荻
+##酗
+##肄
+##躏
+##膑
+##嗡
+##庠
+##崽
+##搪
+##胯
+##铉
+##峤
+##郯
+##藐
+##舂
+##蓼
+##薏
+##窿
+##羣
+##氽
+##徕
+##冼
+##阂
+##欤
+##殒
+##窈
+##脘
+##篝
+##麸
+##砭
+##砰
+##骶
+##豺
+##窠
+##獒
+##腴
+##苕
+##缇
+##骅
+##劭
+##卅
+##揆
+##垅
+##琏
+##镗
+##苜
+##胛
+##珏
+##吮
+##抠
+##搔
+##槎
+##掣
+##琨
+##餮
+##舛
+##痤
+##埭
+##胪
+##喹
+##妲
+##婀
+##帙
+##箩
+##灏
+##霎
+##袄
+##镭
+##蓿
+##墉
+##嵊
+##堇
+##蟋
+##叽
+##钎
+##録
+##郓
+##瘴
+##丶
+##呦
+##邬
+##頫
+##馁
+##鄢
+##蛭
+##愍
+##锲
+##槿
+##珈
+##蜃
+##拎
+##鎏
+##裟
+##沏
+##螭
+##觑
+##墒
+##捺
+##轸
+##榫
+##怔
+##昀
+##泷
+##凫
+##唠
+##狰
+##鲛
+##氐
+##呛
+##绀
+##碛
+##茏
+##盅
+##蟀
+##洙
+##訇
+##蠹
+##棂
+##蚴
+##篾
+##靛
+##暄
+##泞
+##洄
+##赓
+##麽
+##篓
+##孑
+##烩
+##颢
+##钣
+##髂
+##蹴
+##筮
+##蝌
+##醮
+##菖
+##獗
+##岘
+##鼐
+##姣
+##蟑
+##袈
+##葶
+##掬
+##躇
+##鹌
+##踌
+##钹
+##蚪
+##颧
+##鹳
+##鲲
+##驷
+##潴
+##焱
+##稔
+##悌
+##唏
+##苒
+##蹙
+##氩
+##宓
+##綦
+##苎
+##疃
+##攫
+##掾
+##徭
+##舀
+##逶
+##嗤
+##蜷
+##茔
+##疳
+##迳
+##罄
+##瓠
+##讪
+##傈
+##杲
+##灞
+##氲
+##鬲
+##獠
+##柒
+##骧
+##搀
+##珩
+##绦
+##嚏
+##镛
+##喱
+##倏
+##馋
+##茭
+##擘
+##斫
+##怂
+##唧
+##犍
+##谩
+##赊
+##鬻
+##禛
+##圻
+##蹶
+##缄
+##瘿
+##黠
+##甑
+##矸
+##嘀
+##蹼
+##叼
+##旻
+##鹜
+##稗
+##雒
+##赉
+##馔
+##颦
+##颔
+##掇
+##赅
+##桎
+##痧
+##谄
+##孛
+##笆
+##鲶
+##铳
+##龋
+##盱
+##笏
+##窕
+##苴
+##萋
+##辘
+##琬
+##梏
+##蚧
+##镳
+##瞅
+##睬
+##偌
+##鲵
+##惦
+##蜍
+##靼
+##阗
+##菟
+##黝
+##挈
+##嵴
+##剡
+##楸
+##氤
+##呎
+##珲
+##馄
+##滂
+##蹉
+##蓑
+##锷
+##啜
+##婵
+##鬣
+##钿
+##晌
+##蛆
+##隗
+##酞
+##枞
+##戬
+##獾
+##镕
+##饨
+##娣
+##缰
+##邾
+##鹗
+##嗒
+##苋
+##薮
+##棹
+##拄
+##埕
+##勖
+##鹞
+##殚
+##鲢
+##啖
+##沣
+##靥
+##葭
+##诿
+##鸪
+##饴
+##疖
+##抟
+##睽
+##稞
+##吋
+##谀
+##澍
+##杈
+##妤
+##峄
+##漉
+##気
+##咲
+##璘
+##萜
+##僭
+##朐
+##圜
+##癞
+##藿
+##珉
+##陉
+##僮
+##膻
+##薹
+##汊
+##锗
+##昉
+##猗
+##锶
+##跛
+##嘹
+##瓤
+##衄
+##豕
+##吆
+##腆
+##喆
+##莴
+##谌
+##珙
+##疥
+##鲑
+##玷
+##蛔
+##砀
+##谔
+##睥
+##蹑
+##诒
+##逋
+##姝
+##刈
+##婧
+##喳
+##镞
+##铌
+##辎
+##鹧
+##檩
+##扪
+##霰
+##裆
+##嬷
+##刎
+##嵋
+##悱
+##嘤
+##篁
+##荸
+##瞑
+##殓
+##搽
+##橇
+##雎
+##弭
+##獐
+##恿
+##眦
+##铐
+##尕
+##捎
+##诟
+##痨
+##笞
+##趺
+##唬
+##苣
+##啾
+##瘪
+##垸
+##橹
+##濛
+##曷
+##樾
+##汨
+##仟
+##姒
+##怦
+##荏
+##诤
+##苡
+##吭
+##崆
+##氡
+##脩
+##胝
+##钏
+##屐
+##忐
+##彧
+##拚
+##鏖
+##孳
+##忑
+##邝
+##穰
+##摈
+##庖
+##鸵
+##矽
+##鲟
+##発
+##菅
+##圪
+##蹋
+##衾
+##簋
+##璎
+##噎
+##嬗
+##肼
+##跎
+##滟
+##戦
+##嵬
+##仝
+##惇
+##纾
+##炁
+##闳
+##骐
+##秣
+##眙
+##谘
+##碓
+##疔
+##恂
+##鳕
+##鸱
+##爨
+##镊
+##钯
+##圮
+##楽
+##堀
+##膘
+##噗
+##锹
+##杼
+##酊
+##挎
+##箸
+##郗
+##垌
+##溏
+##蔫
+##偻
+##妫
+##飚
+##辔
+##濬
+##瑄
+##觚
+##铍
+##跚
+##翕
+##煊
+##耄
+##铋
+##篦
+##阇
+##骛
+##莪
+##吲
+##唁
+##箧
+##珅
+##潋
+##迨
+##哽
+##砦
+##缗
+##謇
+##呸
+##垓
+##糍
+##璠
+##妣
+##狎
+##攥
+##闇
+##蛉
+##瑁
+##腼
+##蹒
+##嶷
+##莠
+##沤
+##哚
+##遑
+##跺
+##膦
+##蹿
+##郫
+##玳
+##埚
+##衿
+##醪
+##挹
+##绡
+##汜
+##坩
+##旃
+##鸨
+##翈
+##抡
+##晞
+##盥
+##藁
+##蓖
+##臊
+##羰
+##楝
+##噱
+##饽
+##苌
+##褓
+##佶
+##稜
+##瞠
+##仡
+##伉
+##襁
+##涞
+##蜇
+##抿
+##瑗
+##孱
+##懑
+##淦
+##赝
+##醌
+##缫
+##蠲
+##嚓
+##鲷
+##湫
+##捋
+##咩
+##裏
+##犒
+##墀
+##硐
+##蔸
+##钽
+##麂
+##蒡
+##鼹
+##绻
+##錾
+##仃
+##篙
+##蕤
+##铤
+##槁
+##牖
+##螈
+##俦
+##笄
+##啻
+##対
+##郤
+##闼
+##醺
+##赍
+##檗
+##裾
+##噫
+##掸
+##箓
+##妪
+##乂
+##蝈
+##砻
+##胍
+##蜱
+##聃
+##雠
+##碚
+##椤
+##缯
+##昴
+##缱
+##祎
+##缬
+##铙
+##孀
+##笳
+##蘇
+##愆
+##榉
+##氙
+##燹
+##撂
+##菽
+##箬
+##蛄
+##瘸
+##嬛
+##橐
+##纡
+##刽
+##辂
+##蒯
+##邨
+##赀
+##跸
+##邙
+##黟
+##磴
+##闿
+##垟
+##嵯
+##钚
+##跄
+##潸
+##崴
+##恁
+##楮
+##腧
+##胨
+##芫
+##碴
+##隰
+##杓
+##貉
+##欹
+##侑
+##鳜
+##铄
+##椴
+##昇
+##醍
+##肓
+##缂
+##铡
+##蹠
+##徂
+##豢
+##蒽
+##菡
+##衲
+##阚
+##芗
+##痍
+##玠
+##晷
+##淝
+##鄯
+##糗
+##耨
+##榧
+##胴
+##蕈
+##镬
+##鼾
+##摭
+##鸮
+##恚
+##実
+##砝
+##珣
+##寤
+##埙
+##锏
+##喟
+##蘅
+##骺
+##捭
+##莜
+##缶
+##锟
+##叵
+##炷
+##鲧
+##胼
+##査
+##岬
+##鹂
+##牯
+##珥
+##莼
+##邠
+##眇
+##卟
+##変
+##惴
+##渑
+##蚱
+##瞌
+##瘰
+##佝
+##旸
+##衽
+##郅
+##奁
+##魑
+##缛
+##颙
+##镫
+##簌
+##豇
+##姹
+##邋
+##暝
+##釐
+##洹
+##咿
+##俳
+##蜊
+##醐
+##聩
+##坻
+##毽
+##喾
+##辋
+##倌
+##媪
+##蛳
+##滹
+##哙
+##阊
+##趸
+##祢
+##籀
+##徼
+##訾
+##髁
+##砜
+##撸
+##瓘
+##缁
+##镓
+##縻
+##菀
+##酢
+##桠
+##撵
+##怏
+##渌
+##摞
+##槲
+##浠
+##诜
+##魉
+##韫
+##亓
+##盤
+##瑭
+##魍
+##襞
+##爿
+##浃
+##樯
+##讵
+##揩
+##耋
+##帏
+##崃
+##鸩
+##遢
+##臃
+##粿
+##禳
+##桫
+##髹
+##诳
+##踉
+##郃
+##嗖
+##讧
+##碁
+##湎
+##阏
+##媾
+##様
+##哔
+##舸
+##曩
+##忝
+##峁
+##掂
+##葳
+##鄄
+##谵
+##彊
+##锴
+##郜
+##葖
+##蓇
+##瓴
+##鳟
+##橼
+##鲇
+##邗
+##犄
+##秭
+##槭
+##缵
+##巯
+##龊
+##狍
+##擞
+##瞽
+##栲
+##撅
+##瑀
+##戢
+##朓
+##逖
+##椹
+##洺
+##艏
+##苁
+##滘
+##铧
+##侪
+##豳
+##竦
+##貔
+##圄
+##呷
+##旄
+##遛
+##芈
+##砣
+##桷
+##龌
+##疬
+##缟
+##洌
+##跏
+##蝮
+##菰
+##帑
+##怙
+##豸
+##雩
+##誊
+##臬
+##镣
+##箇
+##踱
+##钍
+##苫
+##蝽
+##浯
+##単
+##亶
+##囹
+##穑
+##佻
+##绌
+##诔
+##鹬
+##髌
+##蒌
+##鳏
+##殄
+##怛
+##筌
+##刳
+##翮
+##卍
+##畹
+##箜
+##燔
+##赳
+##篌
+##窨
+##翥
+##炅
+##钕
+##莳
+##忖
+##戡
+##沢
+##狒
+##圉
+##琯
+##邰
+##苾
+##犸
+##邡
+##郏
+##襦
+##沆
+##玟
+##濉
+##洎
+##莨
+##氘
+##咛
+##佺
+##腩
+##鳔
+##剜
+##秕
+##牝
+##芨
+##関
+##拊
+##竑
+##圹
+##颡
+##摺
+##沩
+##蜉
+##筚
+##愔
+##肟
+##俶
+##堃
+##绉
+##奭
+##罅
+##嗳
+##蜢
+##疠
+##帔
+##髡
+##黥
+##褛
+##柰
+##鏊
+##痼
+##堞
+##嗝
+##娉
+##戕
+##铱
+##耜
+##觥
+##镒
+##呓
+##蒹
+##栱
+##卮
+##琚
+##逦
+##酩
+##蓍
+##虺
+##谠
+##鼋
+##焗
+##褴
+##砒
+##赧
+##蛏
+##蚬
+##瘕
+##顗
+##愠
+##勣
+##飕
+##徳
+##滢
+##琇
+##鳙
+##瞟
+##尻
+##澶
+##荽
+##舐
+##侂
+##黼
+##潟
+##绂
+##瘗
+##蓥
+##竽
+##濞
+##骖
+##偁
+##応
+##锜
+##匏
+##赑
+##讦
+##诨
+##罘
+##巖
+##嫘
+##颀
+##岿
+##虻
+##罴
+##囗
+##溆
+##噤
+##骝
+##咂
+##锛
+##槊
+##啕
+##驽
+##凇
+##籴
+##硖
+##铯
+##怿
+##笥
+##噙
+##倨
+##坭
+##醅
+##滏
+##悻
+##聒
+##枥
+##昺
+##酆
+##簟
+##睇
+##轫
+##溱
+##骢
+##榘
+##珺
+##跹
+##蚶
+##驺
+##饧
+##噼
+##儆
+##氚
+##哧
+##旒
+##鸬
+##夥
+##玦
+##貅
+##揄
+##戗
+##璩
+##剐
+##垴
+##蘼
+##裒
+##躅
+##唳
+##嗑
+##荦
+##霈
+##缦
+##啭
+##隈
+##悫
+##彀
+##悭
+##焓
+##磔
+##蓊
+##郾
+##枧
+##鹚
+##検
+##屃
+##馑
+##嗲
+##铟
+##薤
+##涔
+##樗
+##忾
+##収
+##绺
+##烊
+##螫
+##黩
+##鞫
+##鲠
+##嘭
+##缣
+##蒺
+##黒
+##骘
+##氖
+##镝
+##俅
+##谮
+##屦
+##摁
+##氪
+##蘧
+##伝
+##腠
+##叡
+##鲂
+##続
+##讣
+##耷
+##燊
+##鸷
+##猊
+##囡
+##崤
+##砬
+##湜
+##翚
+##峯
+##鲎
+##蕖
+##鹈
+##凼
+##泫
+##荑
+##黻
+##牂
+##鄣
+##篑
+##髭
+##陬
+##寔
+##疴
+##邽
+##喏
+##彖
+##彘
+##赟
+##盹
+##诮
+##鸫
+##茕
+##铖
+##闩
+##読
+##鄜
+##漈
+##盍
+##甭
+##愎
+##魃
+##炆
+##鍊
+##蛐
+##薜
+##楯
+##鲀
+##逡
+##嘞
+##侔
+##觇
+##糸
+##踮
+##狷
+##菘
+##寳
+##扃
+##禊
+##喑
+##塍
+##栝
+##瓿
+##廨
+##貘
+##馕
+##僰
+##哏
+##瑷
+##疎
+##蝣
+##怵
+##阃
+##弢
+##镲
+##螅
+##吖
+##碲
+##夼
+##茌
+##嗬
+##靺
+##髀
+##铊
+##谡
+##癔
+##镠
+##巻
+##秾
+##菪
+##赜
+##铈
+##髙
+##鲳
+##珰
+##畋
+##泅
+##鲅
+##泚
+##飏
+##屍
+##仨
+##葚
+##叻
+##咻
+##衩
+##郄
+##蹩
+##嬖
+##踽
+##柽
+##鞨
+##麴
+##薙
+##钇
+##氵
+##垆
+##犟
+##罍
+##経
+##粜
+##焜
+##牀
+##埝
+##洧
+##覧
+##蓣
+##甯
+##蒐
+##馐
+##畑
+##缑
+##礽
+##瞋
+##浍
+##袢
+##桕
+##侩
+##詈
+##戸
+##烝
+##堌
+##伋
+##倬
+##圯
+##碇
+##纰
+##磾
+##泔
+##纮
+##蓁
+##铗
+##弇
+##挲
+##艉
+##鱬
+##泺
+##橛
+##袴
+##韪
+##籓
+##贶
+##棰
+##趵
+##樨
+##傕
+##玕
+##毎
+##繸
+##劵
+##镧
+##秫
+##邶
+##猞
+##廛
+##栌
+##钲
+##镦
+##嘏
+##蝰
+##镏
+##淠
+##荇
+##逄
+##嘅
+##祕
+##瑠
+##炝
+##杪
+##埴
+##獬
+##柢
+##捱
+##跣
+##涑
+##撃
+##伢
+##堠
+##卽
+##猁
+##厣
+##辏
+##旆
+##茆
+##乜
+##踯
+##。
+##？
+##!
+##?
+##；
+[UNK]
diff --git a/applications/document_intelligence/doc_vqa/Rerank/run_test.sh b/applications/document_intelligence/doc_vqa/Rerank/run_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..48b03fe4dc5ad2358d97ffca5a02dca4f00cddca
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/run_test.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+QUESTION=$1
+
+if [ ! -d output ]; then
+    mkdir output
+fi
+if [ ! -d log ]; then
+    mkdir log
+fi
+
+python3 change_to_rerank.py ${QUESTION}
+
+python3 -u ./src/train_ce.py \
+                   --use_cuda true \
+                   --verbose true \
+                   --do_train false \
+                   --do_val false \
+                   --do_test true \
+                   --batch_size 128 \
+                   --init_checkpoint "./checkpoints/ranker" \
+                   --test_set "./data/demo.tsv" \
+                   --test_save "data/demo.score" \
+                   --max_seq_len 384 \
+                   --for_cn true \
+                   --vocab_path "config/ernie_base_1.0_CN/vocab.txt" \
+                   --ernie_config_path "config/ernie_base_1.0_CN/ernie_config.json"
+                   1>>log/train.log 2>&1
+
diff --git a/applications/document_intelligence/doc_vqa/Rerank/run_train.sh b/applications/document_intelligence/doc_vqa/Rerank/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..18cffa7761eeb335f9997df3903b703bca5c4803
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/run_train.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+if [ $# != 4 ];then
+    echo "USAGE: sh run_train.sh \$TRAIN_SET \$MODEL_PATH \$epoch \$nodes_count"
+    exit 1
+fi
+
+TRAIN_SET=$1
+MODEL_PATH=$2
+epoch=$3
+node=$4
+
+CHECKPOINT_PATH=output
+if [ ! -d output ]; then
+    mkdir output
+fi
+if [ ! -d log ]; then
+    mkdir log
+fi
+
+lr=1e-5
+batch_size=32
+train_exampls=`cat $TRAIN_SET | wc -l`
+save_steps=$[$train_exampls/$batch_size/$node]
+data_size=$[$save_steps*$batch_size*$node]
+new_save_steps=$[$save_steps*$epoch/2]
+
+python3 -m paddle.distributed.launch \
+    --log_dir log \
+    ./src/train_ce.py \
+                   --use_cuda true \
+                   --verbose true \
+                   --do_train true \
+                   --do_val false \
+                   --do_test false \
+                   --use_mix_precision false \
+                   --train_data_size ${data_size} \
+                   --batch_size ${batch_size} \
+                   --init_pretraining_params ${MODEL_PATH} \
+                   --train_set ${TRAIN_SET} \
+                   --save_steps ${new_save_steps} \
+                   --validation_steps ${new_save_steps} \
+                   --checkpoints ${CHECKPOINT_PATH} \
+                   --weight_decay  0.01 \
+                   --warmup_proportion 0.0 \
+                   --epoch $epoch \
+                   --max_seq_len 384 \
+                   --for_cn true \
+                   --vocab_path config/ernie_base_1.0_CN/vocab.txt \
+                   --ernie_config_path config/ernie_base_1.0_CN/ernie_config.json \
+                   --learning_rate ${lr} \
+                   --skip_steps 10 \
+                   --num_iteration_per_drop_scope 1 \
+                   --num_labels 2 \
+                   --random_seed 1 
+
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/batching.py b/applications/document_intelligence/doc_vqa/Rerank/src/batching.py
new file mode 100644
index 0000000000000000000000000000000000000000..10901b81f1c36b25b8640b0944e333647828c944
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/batching.py
@@ -0,0 +1,69 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+
+import numpy as np
+
+
+def pad_batch_data(
+    insts,
+    pad_idx=0,
+    return_pos=False,
+    return_input_mask=False,
+    return_max_len=False,
+    return_num_token=False,
+    return_seq_lens=False,
+):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+
+    # position data
+    if return_pos:
+        inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape([-1, 1])]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+if __name__ == "__main__":
+    pass
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py b/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..700b70407ce09832a4b7709d8e83a49cefb525a7
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/cross_encoder.py
@@ -0,0 +1,333 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Model for classifier."""
+
+import logging
+import time
+
+import numpy as np
+import paddle.fluid as fluid
+from model.ernie import ErnieModel
+from scipy.stats import pearsonr, spearmanr
+
+log = logging.getLogger(__name__)
+
+
+def create_model(args, pyreader_name, ernie_config, is_prediction=False, task_name=""):
+    pyreader = fluid.layers.py_reader(
+        capacity=50,
+        shapes=[
+            [-1, args.max_seq_len, 1],
+            [-1, args.max_seq_len, 1],
+            [-1, args.max_seq_len, 1],
+            [-1, args.max_seq_len, 1],
+            [-1, args.max_seq_len, 1],
+            [-1, 1],
+            [-1, 1],
+        ],
+        dtypes=["int64", "int64", "int64", "int64", "float32", "int64", "int64"],
+        lod_levels=[0, 0, 0, 0, 0, 0, 0],
+        name=task_name + "_" + pyreader_name,
+        use_double_buffer=True,
+    )
+
+    (src_ids, sent_ids, pos_ids, task_ids, input_mask, labels, qids) = fluid.layers.read_file(pyreader)
+
+    def _model(is_noise=False):
+        ernie = ErnieModel(
+            src_ids=src_ids,
+            position_ids=pos_ids,
+            sentence_ids=sent_ids,
+            task_ids=task_ids,
+            input_mask=input_mask,
+            config=ernie_config,
+            is_noise=is_noise,
+        )
+
+        cls_feats = ernie.get_pooled_output()
+        if not is_noise:
+            cls_feats = fluid.layers.dropout(x=cls_feats, dropout_prob=0.1, dropout_implementation="upscale_in_train")
+        logits = fluid.layers.fc(
+            input=cls_feats,
+            size=args.num_labels,
+            param_attr=fluid.ParamAttr(
+                name=task_name + "_cls_out_w", initializer=fluid.initializer.TruncatedNormal(scale=0.02)
+            ),
+            bias_attr=fluid.ParamAttr(name=task_name + "_cls_out_b", initializer=fluid.initializer.Constant(0.0)),
+        )
+        """
+        if is_prediction:
+            probs = fluid.layers.softmax(logits)
+            feed_targets_name = [
+                src_ids.name, sent_ids.name, pos_ids.name, input_mask.name
+            ]
+            if ernie_version == "2.0":
+                feed_targets_name += [task_ids.name]
+            return pyreader, probs, feed_targets_name
+        """
+
+        num_seqs = fluid.layers.create_tensor(dtype="int64")
+        # add focal loss
+        ce_loss, probs = fluid.layers.softmax_with_cross_entropy(logits=logits, label=labels, return_softmax=True)
+        loss = fluid.layers.mean(x=ce_loss)
+        accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
+        graph_vars = {
+            "loss": loss,
+            "probs": probs,
+            "accuracy": accuracy,
+            "labels": labels,
+            "num_seqs": num_seqs,
+            "qids": qids,
+        }
+        return graph_vars
+
+    if not is_prediction:
+        graph_vars = _model(is_noise=True)
+        old_loss = graph_vars["loss"]
+        token_emb = fluid.default_main_program().global_block().var("word_embedding")
+        token_emb.stop_gradient = False
+        token_gradient = fluid.gradients(old_loss, token_emb)[0]
+        token_gradient.stop_gradient = False
+        epsilon = 1e-8
+        norm = fluid.layers.sqrt(fluid.layers.reduce_sum(fluid.layers.square(token_gradient)) + epsilon)
+        gp = (0.01 * token_gradient) / norm
+        gp.stop_gradient = True
+        fluid.layers.assign(token_emb + gp, token_emb)
+        graph_vars = _model()
+        fluid.layers.assign(token_emb - gp, token_emb)
+    else:
+        graph_vars = _model()
+
+    return pyreader, graph_vars
+
+
+def evaluate_mrr(preds):
+    last_qid = None
+    total_mrr = 0.0
+    qnum = 0.0
+    rank = 0.0
+    correct = False
+    for qid, score, label in preds:
+        if qid != last_qid:
+            rank = 0.0
+            qnum += 1
+            correct = False
+            last_qid = qid
+
+        rank += 1
+        if not correct and label != 0:
+            total_mrr += 1.0 / rank
+            correct = True
+
+    return total_mrr / qnum
+
+
+def evaluate(
+    exe, test_program, test_pyreader, graph_vars, eval_phase, use_multi_gpu_test=False, metric="simple_accuracy"
+):
+    train_fetch_list = [graph_vars["loss"].name, graph_vars["accuracy"].name, graph_vars["num_seqs"].name]
+
+    if eval_phase == "train":
+        if "learning_rate" in graph_vars:
+            train_fetch_list.append(graph_vars["learning_rate"].name)
+        outputs = exe.run(fetch_list=train_fetch_list, program=test_program)
+        ret = {"loss": np.mean(outputs[0]), "accuracy": np.mean(outputs[1])}
+        if "learning_rate" in graph_vars:
+            ret["learning_rate"] = float(outputs[3][0])
+        return ret
+
+    test_pyreader.start()
+    total_cost = 0.0
+    total_acc = 0.0
+    total_num_seqs = 0.0
+    total_label_pos_num = 0.0
+    total_pred_pos_num = 0.0
+    total_correct_num = 0.0
+    qids, labels, scores, preds = [], [], [], []
+    time_begin = time.time()
+
+    fetch_list = [
+        graph_vars["loss"].name,
+        graph_vars["accuracy"].name,
+        graph_vars["probs"].name,
+        graph_vars["labels"].name,
+        graph_vars["num_seqs"].name,
+        graph_vars["qids"].name,
+    ]
+    while True:
+        try:
+            if use_multi_gpu_test:
+                np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(fetch_list=fetch_list)
+            else:
+                np_loss, np_acc, np_probs, np_labels, np_num_seqs, np_qids = exe.run(
+                    program=test_program, fetch_list=fetch_list
+                )
+            total_cost += np.sum(np_loss * np_num_seqs)
+            total_acc += np.sum(np_acc * np_num_seqs)
+            total_num_seqs += np.sum(np_num_seqs)
+            labels.extend(np_labels.reshape((-1)).tolist())
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            scores.extend(np_probs[:, 1].reshape(-1).tolist())
+            np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+            preds.extend(np_preds)
+            total_label_pos_num += np.sum(np_labels)
+            total_pred_pos_num += np.sum(np_preds)
+            total_correct_num += np.sum(np.dot(np_preds, np_labels))
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+    time_end = time.time()
+    cost = total_cost / total_num_seqs
+    elapsed_time = time_end - time_begin
+
+    evaluate_info = ""
+    if metric == "acc_and_f1":
+        ret = acc_and_f1(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, ave_acc: %f, f1: %f, data_num: %d, elapsed time: %f s" % (
+            eval_phase,
+            cost,
+            ret["acc"],
+            ret["f1"],
+            total_num_seqs,
+            elapsed_time,
+        )
+    elif metric == "matthews_corrcoef":
+        ret = matthews_corrcoef(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, matthews_corrcoef: %f, data_num: %d, elapsed time: %f s" % (
+            eval_phase,
+            cost,
+            ret,
+            total_num_seqs,
+            elapsed_time,
+        )
+    elif metric == "pearson_and_spearman":
+        ret = pearson_and_spearman(scores, labels)
+        evaluate_info = (
+            "[%s evaluation] ave loss: %f, pearson:%f, spearman:%f, corr:%f, data_num: %d, elapsed time: %f s"
+            % (eval_phase, cost, ret["pearson"], ret["spearman"], ret["corr"], total_num_seqs, elapsed_time)
+        )
+    elif metric == "simple_accuracy":
+        ret = simple_accuracy(preds, labels)
+        evaluate_info = "[%s evaluation] ave loss: %f, acc:%f, data_num: %d, elapsed time: %f s" % (
+            eval_phase,
+            cost,
+            ret,
+            total_num_seqs,
+            elapsed_time,
+        )
+    elif metric == "acc_and_f1_and_mrr":
+        ret_a = acc_and_f1(preds, labels)
+        preds = sorted(zip(qids, scores, labels), key=lambda elem: (elem[0], -elem[1]))
+        ret_b = evaluate_mrr(preds)
+        evaluate_info = "[%s evaluation] ave loss: %f, acc: %f, f1: %f, mrr: %f, data_num: %d, elapsed time: %f s" % (
+            eval_phase,
+            cost,
+            ret_a["acc"],
+            ret_a["f1"],
+            ret_b,
+            total_num_seqs,
+            elapsed_time,
+        )
+    else:
+        raise ValueError("unsupported metric {}".format(metric))
+    return evaluate_info
+
+
+def matthews_corrcoef(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+    tp = np.sum((labels == 1) & (preds == 1))
+    tn = np.sum((labels == 0) & (preds == 0))
+    fp = np.sum((labels == 0) & (preds == 1))
+    fn = np.sum((labels == 1) & (preds == 0))
+
+    mcc = ((tp * tn) - (fp * fn)) / np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
+    return mcc
+
+
+def f1_score(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    tp = np.sum((labels == 1) & (preds == 1))
+    fp = np.sum((labels == 0) & (preds == 1))
+    fn = np.sum((labels == 1) & (preds == 0))
+    p = tp / (tp + fp)
+    r = tp / (tp + fn)
+    f1 = (2 * p * r) / (p + r + 1e-8)
+    return f1
+
+
+def pearson_and_spearman(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    pearson_corr = pearsonr(preds, labels)[0]
+    spearman_corr = spearmanr(preds, labels)[0]
+    return {
+        "pearson": pearson_corr,
+        "spearmanr": spearman_corr,
+        "corr": (pearson_corr + spearman_corr) / 2,
+    }
+
+
+def acc_and_f1(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+
+    acc = simple_accuracy(preds, labels)
+    f1 = f1_score(preds, labels)
+    return {
+        "acc": acc,
+        "f1": f1,
+        "acc_and_f1": (acc + f1) / 2,
+    }
+
+
+def simple_accuracy(preds, labels):
+    preds = np.array(preds)
+    labels = np.array(labels)
+    return (preds == labels).mean()
+
+
+def predict(exe, test_program, test_pyreader, graph_vars, dev_count=1):
+    test_pyreader.start()
+    qids, probs = [], []
+    preds = []
+
+    fetch_list = [graph_vars["probs"].name, graph_vars["qids"].name]
+
+    while True:
+        try:
+            if dev_count == 1:
+                np_probs, np_qids = exe.run(program=test_program, fetch_list=fetch_list)
+            else:
+                np_probs, np_qids = exe.run(fetch_list=fetch_list)
+
+            if np_qids is None:
+                np_qids = np.array([])
+            qids.extend(np_qids.reshape(-1).tolist())
+            np_preds = np.argmax(np_probs, axis=1).astype(np.float32)
+            preds.extend(np_preds)
+            probs.append(np_probs)
+
+        except fluid.core.EOFException:
+            test_pyreader.reset()
+            break
+
+    probs = np.concatenate(probs, axis=0).reshape([len(preds), -1])
+
+    return qids, preds, probs
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py b/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..da3193f3167138ff50611d16115fa20d214b2dd3
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/finetune_args.py
@@ -0,0 +1,95 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from src.utils.args import ArgumentGroup
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("ernie_config_path", str, None, "Path to the json file for ernie model config.")
+model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
+model_g.add_arg("init_pretraining_params", str, None, "Init pre-training params which preforms fine-tuning from. If the arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.")
+
+model_g.add_arg("is_classify", bool, True, "is_classify")
+model_g.add_arg("is_regression", bool, False, "is_regression")
+model_g.add_arg("task_id", int, 0, "task id")
+
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
+train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.")
+train_g.add_arg("warmup_proportion", float, 0.1, "Proportion of training steps to perform linear learning rate warmup for.")
+train_g.add_arg("save_steps", int, 10000, "The steps interval to save checkpoints.")
+train_g.add_arg("validation_steps", int, 1000, "The steps interval to evaluate model performance.")
+train_g.add_arg("use_recompute", bool, False, "Whether to use recompute optimizer for training.")
+train_g.add_arg("use_mix_precision", bool, False, "Whether to use mix-precision optimizer for training.")
+train_g.add_arg("use_cross_batch", bool, False, "Whether to use cross-batch for training.")
+train_g.add_arg("use_lamb", bool, False, "Whether to use LambOptimizer for training.")
+train_g.add_arg("use_dynamic_loss_scaling", bool, True, "Whether to use dynamic loss scaling.")
+
+train_g.add_arg("test_save", str, "./checkpoints/test_result", "test_save")
+train_g.add_arg("metric", str, "simple_accuracy", "metric")
+train_g.add_arg("incr_every_n_steps", int, 100, "Increases loss scaling every n consecutive.")
+train_g.add_arg("decr_every_n_nan_or_inf", int, 2, "Decreases loss scaling every n accumulated steps with nan or inf gradients.")
+train_g.add_arg("incr_ratio", float, 2.0, "The multiplier to use when increasing the loss scaling.")
+train_g.add_arg("decr_ratio", float, 0.8, "The less-than-one-multiplier to use when decreasing.")
+
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
+log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("tokenizer", str, "FullTokenizer", "ATTENTION: the INPUT must be splited by Word with blank while using SentencepieceTokenizer or WordsegTokenizer")
+data_g.add_arg("train_set", str, None, "Path to training data.")
+data_g.add_arg("test_set", str, None, "Path to test data.")
+data_g.add_arg("dev_set", str, None, "Path to validation data.")
+data_g.add_arg("vocab_path", str, None, "Vocabulary path.")
+data_g.add_arg("max_seq_len", int, 512, "Number of words of the longest seqence.")
+data_g.add_arg("q_max_seq_len", int, 32, "Number of words of the longest seqence.")
+data_g.add_arg("p_max_seq_len", int, 256, "Number of words of the longest seqence.")
+data_g.add_arg("train_data_size", int, 0, "Number of training data's total examples. Set for distribute.")
+data_g.add_arg("batch_size", int, 32, "Total examples' number in batch for training. see also --in_tokens.")
+data_g.add_arg("predict_batch_size", int, None, "Total examples' number in batch for predict. see also --in_tokens.")
+data_g.add_arg("in_tokens", bool, False, "If set, the batch size will be the maximum number of tokens in one batch. Otherwise, it will be the maximum number of examples in one batch.")
+data_g.add_arg("do_lower_case", bool, True, "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+data_g.add_arg("random_seed", int, None, "Random seed.")
+data_g.add_arg("label_map_config", str, None, "label_map_path.")
+data_g.add_arg("num_labels", int, 2, "label number")
+data_g.add_arg("diagnostic", str, None, "GLUE Diagnostic Dataset")
+data_g.add_arg("diagnostic_save", str, None, "GLUE Diagnostic save f")
+data_g.add_arg("max_query_length", int, 64, "Max query length.")
+data_g.add_arg("max_answer_length", int, 100, "Max answer length.")
+data_g.add_arg("doc_stride", int, 128, "When splitting up a long document into chunks, how much stride to take between chunks.")
+data_g.add_arg("n_best_size", int, 20, "The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+data_g.add_arg("chunk_scheme", type=str, default="IOB", choices=["IO", "IOB", "IOE", "IOBES"], help="chunk scheme")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
+run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.")
+run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("num_iteration_per_drop_scope", int, 10, "Iteration intervals to drop scope.")
+run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
+run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation on dev data set.")
+run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("output_item", int, 3, "Test output format.")
+run_type_g.add_arg("output_file_name", str, None, "Test output file name")
+run_type_g.add_arg("test_data_cnt", int, 1110000 , "total cnt of testset")
+run_type_g.add_arg("use_multi_gpu_test", bool, False, "Whether to perform evaluation using multiple gpu cards")
+run_type_g.add_arg("metrics", bool, True, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("shuffle", bool, True, "")
+run_type_g.add_arg("for_cn", bool, False, "model train for cn or for other langs.")
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py b/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..d20cdaed497206e6ac3989db049426163aa79a3d
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/index_search.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import faiss
+import numpy as np
+
+
+def read_embed(file_name, dim=768, bs=3000):
+    if file_name.endswith("npy"):
+        i = 0
+        emb_np = np.load(file_name)
+        while i < len(emb_np):
+            vec_list = emb_np[i : i + bs]
+            i += bs
+            yield vec_list
+    else:
+        vec_list = []
+        with open(file_name) as inp:
+            for line in inp:
+                data = line.strip()
+                vector = [float(item) for item in data.split(" ")]
+                assert len(vector) == dim
+                vec_list.append(vector)
+                if len(vec_list) == bs:
+                    yield vec_list
+                    vec_list = []
+            if vec_list:
+                yield vec_list
+
+
+def load_qid(file_name):
+    qid_list = []
+    with open(file_name) as inp:
+        for line in inp:
+            line = line.strip()
+            qid = line.split("\t")[0]
+            qid_list.append(qid)
+    return qid_list
+
+
+def search(index, emb_file, qid_list, outfile, top_k):
+    q_idx = 0
+    with open(outfile, "w") as out:
+        for batch_vec in read_embed(emb_file):
+            q_emb_matrix = np.array(batch_vec)
+            res_dist, res_p_id = index.search(q_emb_matrix.astype("float32"), top_k)
+            for i in range(len(q_emb_matrix)):
+                qid = qid_list[q_idx]
+                for j in range(top_k):
+                    pid = res_p_id[i][j]
+                    score = res_dist[i][j]
+                    out.write("%s\t%s\t%s\t%s\n" % (qid, pid, j + 1, score))
+                q_idx += 1
+
+
+def main():
+    part = sys.argv[1]
+    topk = int(sys.argv[2])
+    q_text_file = sys.argv[3]
+    outfile = "output/res.top%s-part%s" % (topk, part)
+
+    qid_list = load_qid(q_text_file)
+
+    engine = faiss.read_index("output/para.index.part%s" % part)
+    emb_file = "output/query.emb.npy"
+    search(engine, emb_file, qid_list, outfile, topk)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/merge.py b/applications/document_intelligence/doc_vqa/Rerank/src/merge.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae060b4a01edbe944f8c76ce2067c5155a7a7867
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/merge.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+shift = int(sys.argv[1])
+top = int(sys.argv[2])
+total_part = int(sys.argv[3])
+
+f_list = []
+for part in range(total_part):
+    f0 = open("output/res.top%s-part%s" % (top, part))
+    f_list.append(f0)
+
+line_list = []
+for part in range(total_part):
+    line = f_list[part].readline()
+    line_list.append(line)
+
+out = open("output/dev.res.top%s" % top, "w")
+last_q = ""
+ans_list = {}
+while line_list[-1]:
+    cur_list = []
+    for line in line_list:
+        sub = line.strip().split("\t")
+        cur_list.append(sub)
+
+    if last_q == "":
+        last_q = cur_list[0][0]
+    if cur_list[0][0] != last_q:
+        rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True)
+        for i in range(top):
+            out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1]))
+        ans_list = {}
+    for i, sub in enumerate(cur_list):
+        ans_list[int(sub[1]) + shift * i] = float(sub[-1])
+    last_q = cur_list[0][0]
+
+    line_list = []
+    for f0 in f_list:
+        line = f0.readline()
+        line_list.append(line)
+
+rank = sorted(ans_list.items(), key=lambda a: a[1], reverse=True)
+for i in range(top):
+    out.write("%s\t%s\t%s\t%s\n" % (last_q, rank[i][0], i + 1, rank[i][1]))
+out.close()
+
+print("output/dev.res.top%s" % top)
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py b/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py
new file mode 100644
index 0000000000000000000000000000000000000000..2dafd972ca822ac8906e9f878559b9b16e3f9752
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/model/ernie.py
@@ -0,0 +1,259 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ernie model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+from io import open
+
+import paddle
+import paddle.fluid as fluid
+import six
+from model.transformer_encoder import encoder, pre_process_layer
+
+log = logging.getLogger(__name__)
+
+
+class ErnieConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path, "r", encoding="utf8") as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing Ernie model config file '%s'" % config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict.get(key, None)
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            log.info("%s: %s" % (arg, value))
+        log.info("------------------------------------------------")
+
+
+class ErnieModel(object):
+    def __init__(
+        self,
+        src_ids,
+        position_ids,
+        sentence_ids,
+        task_ids,
+        input_mask,
+        config,
+        weight_sharing=True,
+        model_name="",
+        is_noise=False,
+    ):
+
+        self._emb_size = config["hidden_size"]
+        self._n_layer = config["num_hidden_layers"]
+        self._n_head = config["num_attention_heads"]
+        self._voc_size = config["vocab_size"]
+        self._max_position_seq_len = config["max_position_embeddings"]
+        if config["sent_type_vocab_size"]:
+            self._sent_types = config["sent_type_vocab_size"]
+        else:
+            self._sent_types = config["type_vocab_size"]
+
+        self._use_task_id = config["use_task_id"]
+        if self._use_task_id:
+            self._task_types = config["task_type_vocab_size"]
+        self._hidden_act = config["hidden_act"]
+        self._prepostprocess_dropout = config["hidden_dropout_prob"]
+        self._attention_dropout = config["attention_probs_dropout_prob"]
+        if is_noise:
+            self._prepostprocess_dropout = 0
+            self._attention_dropout = 0
+        self._weight_sharing = weight_sharing
+        self.checkpoints = []
+
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._sent_emb_name = "sent_embedding"
+        self._task_emb_name = "task_embedding"
+        self._emb_dtype = "float32"
+
+        # Initialize all weights by truncated normal initializer, and all biases
+        # will be initialized by constant zero by default.
+        self._param_initializer = fluid.initializer.TruncatedNormal(scale=config["initializer_range"])
+
+        self._build_model(model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask)
+
+    def _build_model(self, model_name, src_ids, position_ids, sentence_ids, task_ids, input_mask):
+        # padding id in vocabulary must be set to 0
+        emb_out = fluid.layers.embedding(
+            input=src_ids,
+            size=[self._voc_size, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(name=model_name + self._word_emb_name, initializer=self._param_initializer),
+            is_sparse=False,
+        )
+
+        position_emb_out = fluid.layers.embedding(
+            input=position_ids,
+            size=[self._max_position_seq_len, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(name=model_name + self._pos_emb_name, initializer=self._param_initializer),
+        )
+
+        sent_emb_out = fluid.layers.embedding(
+            sentence_ids,
+            size=[self._sent_types, self._emb_size],
+            dtype=self._emb_dtype,
+            param_attr=fluid.ParamAttr(name=model_name + self._sent_emb_name, initializer=self._param_initializer),
+        )
+
+        emb_out = emb_out + position_emb_out
+        emb_out = emb_out + sent_emb_out
+
+        if self._use_task_id:
+            task_emb_out = fluid.layers.embedding(
+                task_ids,
+                size=[self._task_types, self._emb_size],
+                dtype=self._emb_dtype,
+                param_attr=fluid.ParamAttr(name=model_name + self._task_emb_name, initializer=self._param_initializer),
+            )
+
+            emb_out = emb_out + task_emb_out
+
+        emb_out = pre_process_layer(emb_out, "nd", self._prepostprocess_dropout, name=model_name + "pre_encoder")
+
+        self_attn_mask = paddle.matmul(x=input_mask, y=input_mask, transpose_y=True)
+
+        self_attn_mask = fluid.layers.scale(x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        self._enc_out, self.checkpoints = encoder(
+            enc_input=emb_out,
+            attn_bias=n_head_self_attn_mask,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            hidden_act=self._hidden_act,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer,
+            model_name=model_name,
+            name=model_name + "encoder",
+        )
+
+    def get_sequence_output(self):
+        return self._enc_out
+
+    def get_cls_output(self):
+        """Get the first feature of each sequence for classification"""
+        cls_output = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        cls_output = fluid.layers.squeeze(cls_output, axes=[1])
+        return cls_output
+
+    def get_pooled_output(self):
+        """Get the first feature of each sequence for classification"""
+        next_sent_feat = fluid.layers.slice(input=self._enc_out, axes=[1], starts=[0], ends=[1])
+        next_sent_feat = fluid.layers.fc(
+            input=next_sent_feat,
+            size=self._emb_size,
+            act="tanh",
+            param_attr=fluid.ParamAttr(name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0",
+        )
+        return next_sent_feat
+
+    def get_lm_output(self, mask_label, mask_pos):
+        """Get the loss & accuracy for pretraining"""
+
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype="int32")
+
+        # extract the first token feature in each sentence
+        self.next_sent_feat = self.get_pooled_output()
+        reshaped_emb_out = fluid.layers.reshape(x=self._enc_out, shape=[-1, self._emb_size])
+        # extract masked tokens' feature
+        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+
+        # transform: fc
+        mask_trans_feat = fluid.layers.fc(
+            input=mask_feat,
+            size=self._emb_size,
+            act=self._hidden_act,
+            param_attr=fluid.ParamAttr(name="mask_lm_trans_fc.w_0", initializer=self._param_initializer),
+            bias_attr=fluid.ParamAttr(name="mask_lm_trans_fc.b_0"),
+        )
+
+        # transform: layer norm
+        mask_trans_feat = fluid.layers.layer_norm(
+            mask_trans_feat,
+            begin_norm_axis=len(mask_trans_feat.shape) - 1,
+            param_attr=fluid.ParamAttr(
+                name="mask_lm_trans_layer_norm_scale", initializer=fluid.initializer.Constant(1.0)
+            ),
+            bias_attr=fluid.ParamAttr(
+                name="mask_lm_trans_layer_norm_bias", initializer=fluid.initializer.Constant(1.0)
+            ),
+        )
+        # transform: layer norm
+        # mask_trans_feat = pre_process_layer(
+        #    mask_trans_feat, 'n', name='mask_lm_trans')
+
+        mask_lm_out_bias_attr = fluid.ParamAttr(
+            name="mask_lm_out_fc.b_0", initializer=fluid.initializer.Constant(value=0.0)
+        )
+        if self._weight_sharing:
+            fc_out = paddle.matmul(
+                x=mask_trans_feat,
+                y=fluid.default_main_program().global_block().var(self._word_emb_name),
+                transpose_y=True,
+            )
+            fc_out += fluid.layers.create_parameter(
+                shape=[self._voc_size], dtype=self._emb_dtype, attr=mask_lm_out_bias_attr, is_bias=True
+            )
+
+        else:
+            fc_out = fluid.layers.fc(
+                input=mask_trans_feat,
+                size=self._voc_size,
+                param_attr=fluid.ParamAttr(name="mask_lm_out_fc.w_0", initializer=self._param_initializer),
+                bias_attr=mask_lm_out_bias_attr,
+            )
+
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(logits=fc_out, label=mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+
+        return mean_mask_lm_loss
+
+    def get_task_output(self, task, task_labels):
+        task_fc_out = fluid.layers.fc(
+            input=self.next_sent_feat,
+            size=task["num_labels"],
+            param_attr=fluid.ParamAttr(name=task["task_name"] + "_fc.w_0", initializer=self._param_initializer),
+            bias_attr=task["task_name"] + "_fc.b_0",
+        )
+        task_loss, task_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits=task_fc_out, label=task_labels, return_softmax=True
+        )
+        task_acc = fluid.layers.accuracy(input=task_softmax, label=task_labels)
+        mean_task_loss = fluid.layers.mean(task_loss)
+        return mean_task_loss, task_acc
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py b/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fcef783bcded4f861934f49b188700747367647
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/model/transformer_encoder.py
@@ -0,0 +1,318 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformer encoder."""
+
+from __future__ import absolute_import, division, print_function
+
+from functools import partial
+
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def multi_head_attention(
+    queries,
+    keys,
+    values,
+    attn_bias,
+    d_key,
+    d_value,
+    d_model,
+    n_head=1,
+    dropout_rate=0.0,
+    cache=None,
+    param_initializer=None,
+    name="multi_head_att",
+):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError("Inputs: queries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(
+            input=queries,
+            size=d_key * n_head,
+            num_flatten_dims=2,
+            param_attr=fluid.ParamAttr(name=name + "_query_fc.w_0", initializer=param_initializer),
+            bias_attr=name + "_query_fc.b_0",
+        )
+        k = layers.fc(
+            input=keys,
+            size=d_key * n_head,
+            num_flatten_dims=2,
+            param_attr=fluid.ParamAttr(name=name + "_key_fc.w_0", initializer=param_initializer),
+            bias_attr=name + "_key_fc.b_0",
+        )
+        v = layers.fc(
+            input=values,
+            size=d_value * n_head,
+            num_flatten_dims=2,
+            param_attr=fluid.ParamAttr(name=name + "_value_fc.w_0", initializer=param_initializer),
+            bias_attr=name + "_value_fc.b_0",
+        )
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of input tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of input tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3:
+            return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
+        product = paddle.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
+            )
+        out = paddle.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = cache["k"] = layers.concat([layers.reshape(cache["k"], shape=[0, 0, d_model]), k], axis=1)
+        v = cache["v"] = layers.concat([layers.reshape(cache["v"], shape=[0, 0, d_model]), v], axis=1)
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(
+        input=out,
+        size=d_model,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(name=name + "_output_fc.w_0", initializer=param_initializer),
+        bias_attr=name + "_output_fc.b_0",
+    )
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name="ffn"):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(
+        input=x,
+        size=d_inner_hid,
+        num_flatten_dims=2,
+        act=hidden_act,
+        param_attr=fluid.ParamAttr(name=name + "_fc_0.w_0", initializer=param_initializer),
+        bias_attr=name + "_fc_0.b_0",
+    )
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
+        )
+    out = layers.fc(
+        input=hidden,
+        size=d_hid,
+        num_flatten_dims=2,
+        param_attr=fluid.ParamAttr(name=name + "_fc_1.w_0", initializer=param_initializer),
+        bias_attr=name + "_fc_1.b_0",
+    )
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.0, name=""):
+    """
+    Add residual connection, layer normalization and dropout to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out_dtype = out.dtype
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float32")
+            out = layers.layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                param_attr=fluid.ParamAttr(
+                    name=name + "_layer_norm_scale", initializer=fluid.initializer.Constant(1.0)
+                ),
+                bias_attr=fluid.ParamAttr(name=name + "_layer_norm_bias", initializer=fluid.initializer.Constant(0.0)),
+            )
+            if out_dtype == fluid.core.VarDesc.VarType.FP16:
+                out = layers.cast(x=out, dtype="float16")
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False
+                )
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def encoder_layer(
+    enc_input,
+    attn_bias,
+    n_head,
+    d_key,
+    d_value,
+    d_model,
+    d_inner_hid,
+    prepostprocess_dropout,
+    attention_dropout,
+    relu_dropout,
+    hidden_act,
+    preprocess_cmd="n",
+    postprocess_cmd="da",
+    param_initializer=None,
+    name="",
+):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consists of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and dropout.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_att"),
+        None,
+        None,
+        attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        param_initializer=param_initializer,
+        name=name + "_multi_head_att",
+    )
+    attn_output = post_process_layer(
+        enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_att"
+    )
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + "_pre_ffn"),
+        d_inner_hid,
+        d_model,
+        relu_dropout,
+        hidden_act,
+        param_initializer=param_initializer,
+        name=name + "_ffn",
+    )
+    return (
+        post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + "_post_ffn"),
+        ffd_output,
+    )
+
+
+def encoder(
+    enc_input,
+    attn_bias,
+    n_layer,
+    n_head,
+    d_key,
+    d_value,
+    d_model,
+    d_inner_hid,
+    prepostprocess_dropout,
+    attention_dropout,
+    relu_dropout,
+    hidden_act,
+    preprocess_cmd="n",
+    postprocess_cmd="da",
+    param_initializer=None,
+    model_name="",
+    name="",
+):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    checkpoints = []
+    for i in range(n_layer):
+        enc_output, cp = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            hidden_act,
+            preprocess_cmd,
+            postprocess_cmd,
+            param_initializer=param_initializer,
+            name=name + "_layer_" + str(i),
+        )
+        checkpoints.append(cp)
+        enc_input = enc_output
+    enc_output = pre_process_layer(
+        enc_output, preprocess_cmd, prepostprocess_dropout, name=model_name + "post_encoder"
+    )
+
+    return enc_output, checkpoints
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py b/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py
new file mode 100644
index 0000000000000000000000000000000000000000..e724f1028f19e469e4759d399f637763c9007db0
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/optimization.py
@@ -0,0 +1,121 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+
+import paddle.fluid as fluid
+from paddle.fluid.incubate.fleet.collective import fleet
+
+
+def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps):
+    """Applies linear warmup of learning rate from 0 and decay to 0."""
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1], value=0.0, dtype="float32", persistable=True, name="scheduled_learning_rate"
+        )
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                warmup_lr = learning_rate * (global_step / warmup_steps)
+                fluid.layers.tensor.assign(warmup_lr, lr)
+            with switch.default():
+                decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay(
+                    learning_rate=learning_rate,
+                    decay_steps=num_train_steps,
+                    end_learning_rate=0.0,
+                    power=1.0,
+                    cycle=False,
+                )
+                fluid.layers.tensor.assign(decayed_lr, lr)
+
+        return lr
+
+
+def optimization(
+    loss,
+    warmup_steps,
+    num_train_steps,
+    learning_rate,
+    train_program,
+    startup_prog,
+    weight_decay,
+    scheduler="linear_warmup_decay",
+    use_dynamic_loss_scaling=False,
+    incr_every_n_steps=1000,
+    decr_every_n_nan_or_inf=2,
+    incr_ratio=2.0,
+    decr_ratio=0.8,
+    dist_strategy=None,
+    use_lamb=False,
+):
+    if warmup_steps > 0:
+        if scheduler == "noam_decay":
+            scheduled_lr = fluid.layers.learning_rate_scheduler.noam_decay(
+                1 / (warmup_steps * (learning_rate**2)), warmup_steps
+            )
+        elif scheduler == "linear_warmup_decay":
+            scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, num_train_steps)
+        else:
+            raise ValueError("Unknown learning rate scheduler, should be " "'noam_decay' or 'linear_warmup_decay'")
+        if use_lamb:
+            optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr)
+        else:
+            optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+    else:
+        scheduled_lr = fluid.layers.create_global_var(
+            name=fluid.unique_name.generate("learning_rate"),
+            shape=[1],
+            value=learning_rate,
+            dtype="float32",
+            persistable=True,
+        )
+        if use_lamb:
+            optimizer = fluid.optimizer.LambOptimizer(learning_rate=scheduled_lr)
+        else:
+            optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr)
+        optimizer._learning_rate_map[fluid.default_main_program()] = scheduled_lr
+
+    fluid.clip.set_gradient_clip(clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0))
+
+    def exclude_from_weight_decay(name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+
+    param_list = dict()
+
+    for param in train_program.global_block().all_parameters():
+        param_list[param.name] = param * 1.0
+        param_list[param.name].stop_gradient = True
+
+    if dist_strategy is not None:
+        # use fleet api
+        optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+
+    _, param_grads = optimizer.minimize(loss)
+
+    if weight_decay > 0:
+        for param, grad in param_grads:
+            if exclude_from_weight_decay(param.name):
+                continue
+            with param.block.program._optimized_guard([param, grad]), fluid.framework.name_scope("weight_decay"):
+                updated_param = param - param_list[param.name] * weight_decay * scheduled_lr
+                fluid.layers.assign(output=param, input=updated_param)
+
+    return scheduled_lr
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py b/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c08ade51b892044ed9a977f6ffdb2bb17fcc60d
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/reader_ce.py
@@ -0,0 +1,314 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import sys
+from collections import namedtuple
+from io import open
+
+import numpy as np
+import six
+import tokenization
+from batching import pad_batch_data
+
+log = logging.getLogger(__name__)
+
+if six.PY3:
+    import io
+
+    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8")
+    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8")
+
+
+def csv_reader(fd, delimiter="\t", trainer_id=0, trainer_num=1):
+    def gen():
+        for i, line in enumerate(fd):
+            if i % trainer_num == trainer_id:
+                slots = line.rstrip("\n").split(delimiter)
+                if len(slots) == 1:
+                    yield slots,
+                else:
+                    yield slots
+
+    return gen()
+
+
+class BaseReader(object):
+    def __init__(
+        self,
+        vocab_path,
+        label_map_config=None,
+        max_seq_len=512,
+        total_num=0,
+        do_lower_case=True,
+        in_tokens=False,
+        is_inference=False,
+        random_seed=None,
+        tokenizer="FullTokenizer",
+        for_cn=True,
+        task_id=0,
+    ):
+        self.max_seq_len = max_seq_len
+        self.tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self.vocab = self.tokenizer.vocab
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.in_tokens = in_tokens
+        self.is_inference = is_inference
+        self.for_cn = for_cn
+        self.task_id = task_id
+
+        np.random.seed(random_seed)
+
+        self.current_example = 0
+        self.current_epoch = 0
+        self.num_examples = 0
+        self.total_num = total_num
+
+        if label_map_config:
+            with open(label_map_config, encoding="utf8") as f:
+                self.label_map = json.load(f)
+        else:
+            self.label_map = None
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_example, self.current_epoch
+
+    def _read_tsv(self, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf8") as f:
+            reader = csv_reader(f)
+            headers = next(reader)
+            Example = namedtuple("Example", headers)
+
+            examples = []
+            for line in reader:
+                example = Example(*line)
+                examples.append(example)
+            return examples
+
+    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
+        """Truncates a sequence pair in place to the maximum length."""
+
+        # This is a simple heuristic which will always truncate the longer sequence
+        # one token at a time. This makes more sense than truncating an equal percent
+        # of tokens from each, since if one sequence is very short then each token
+        # that's truncated likely contains more information than a longer sequence.
+        while True:
+            total_length = len(tokens_a) + len(tokens_b)
+            if total_length <= max_length:
+                break
+            if len(tokens_a) > len(tokens_b):
+                tokens_a.pop()
+            else:
+                tokens_b.pop()
+
+    def _convert_example_to_record(self, example, max_seq_length, tokenizer):
+        """Converts a single `Example` into a single `Record`."""
+
+        query = tokenization.convert_to_unicode(example.query)
+        tokens_a = tokenizer.tokenize(query)
+        tokens_b = None
+
+        title = tokenization.convert_to_unicode(example.title)
+        tokens_b = tokenizer.tokenize(title)
+
+        para = tokenization.convert_to_unicode(example.para)
+        tokens_para = tokenizer.tokenize(para)
+
+        tokens_b.extend(tokens_para)
+
+        self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+
+        # The convention in BERT/ERNIE is:
+        # (a) For sequence pairs:
+        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+        #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
+        # (b) For single sequences:
+        #  tokens:   [CLS] the dog is hairy . [SEP]
+        #  type_ids: 0     0   0   0  0     0 0
+        #
+        # Where "type_ids" are used to indicate whether this is the first
+        # sequence or the second sequence. The embedding vectors for `type=0` and
+        # `type=1` were learned during pre-training and are added to the wordpiece
+        # embedding vector (and position vector). This is not *strictly* necessary
+        # since the [SEP] token unambiguously separates the sequences, but it makes
+        # it easier for the model to learn the concept of sequences.
+        #
+        # For classification tasks, the first vector (corresponding to [CLS]) is
+        # used as the "sentence vector". Note that this only makes sense because
+        # the entire model is fine-tuned.
+        tokens = []
+        text_type_ids = []
+        tokens.append("[CLS]")
+        text_type_ids.append(0)
+        for token in tokens_a:
+            tokens.append(token)
+            text_type_ids.append(0)
+        tokens.append("[SEP]")
+        text_type_ids.append(0)
+
+        if tokens_b:
+            for token in tokens_b:
+                tokens.append(token)
+                text_type_ids.append(1)
+            tokens.append("[SEP]")
+            text_type_ids.append(1)
+
+        token_ids = tokenizer.convert_tokens_to_ids(tokens)
+        position_ids = list(range(len(token_ids)))
+
+        if self.is_inference:
+            Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids"])
+            record = Record(token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids)
+        else:
+            if self.label_map:
+                label_id = self.label_map[example.label]
+            else:
+                label_id = example.label
+
+            Record = namedtuple("Record", ["token_ids", "text_type_ids", "position_ids", "label_id", "qid"])
+
+            qid = None
+            if "qid" in example._fields:
+                qid = example.qid
+
+            record = Record(
+                token_ids=token_ids, text_type_ids=text_type_ids, position_ids=position_ids, label_id=label_id, qid=qid
+            )
+        return record
+
+    def _prepare_batch_data(self, examples, batch_size, phase=None):
+        """generate batch records"""
+        batch_records, max_len = [], 0
+        for index, example in enumerate(examples):
+            if phase == "train":
+                self.current_example = index
+            record = self._convert_example_to_record(example, self.max_seq_len, self.tokenizer)
+            max_len = max(max_len, len(record.token_ids))
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * max_len <= batch_size
+            else:
+                to_append = len(batch_records) < batch_size
+            if to_append:
+                batch_records.append(record)
+            else:
+                yield self._pad_batch_records(batch_records)
+                batch_records, max_len = [record], len(record.token_ids)
+
+        if batch_records:
+            yield self._pad_batch_records(batch_records)
+
+    def get_num_examples(self, input_file):
+        #        examples = self._read_tsv(input_file)
+        #        return len(examples)
+        return self.num_examples
+
+    def data_generator(
+        self, input_file, batch_size, epoch, dev_count=1, trainer_id=0, trainer_num=1, shuffle=True, phase=None
+    ):
+
+        if phase == "train":
+            #            examples = examples[trainer_id: (len(examples) //trainer_num) * trainer_num : trainer_num]
+            self.num_examples_per_node = self.total_num // trainer_num
+            self.num_examples = self.num_examples_per_node * trainer_num
+            examples = self._read_tsv(
+                input_file, trainer_id=trainer_id, trainer_num=trainer_num, num_examples=self.num_examples_per_node
+            )
+            log.info("apply sharding %d/%d" % (trainer_id, trainer_num))
+        else:
+            examples = self._read_tsv(input_file)
+
+        def wrapper():
+            all_dev_batches = []
+            for epoch_index in range(epoch):
+                if phase == "train":
+                    self.current_example = 0
+                    self.current_epoch = epoch_index
+                if shuffle:
+                    np.random.shuffle(examples)
+
+                for batch_data in self._prepare_batch_data(examples, batch_size, phase=phase):
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []
+
+        def f():
+            try:
+                for i in wrapper():
+                    yield i
+            except Exception:
+                import traceback
+
+                traceback.print_exc()
+
+        return f
+
+
+class ClassifyReader(BaseReader):
+    def _read_tsv(self, input_file, quotechar=None, trainer_id=0, trainer_num=1, num_examples=0):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf8") as f:
+            reader = csv_reader(f, trainer_id=trainer_id, trainer_num=trainer_num)
+            #            headers = next(reader)
+            headers = "query\ttitle\tpara\tlabel".split("\t")
+            text_indices = [index for index, h in enumerate(headers) if h != "label"]
+            Example = namedtuple("Example", headers)
+
+            examples = []
+            for cnt, line in enumerate(reader):
+                if num_examples != 0 and cnt == num_examples:
+                    break
+                for index, text in enumerate(line):
+                    if index in text_indices:
+                        if self.for_cn:
+                            line[index] = text.replace(" ", "")
+                        else:
+                            line[index] = text
+                example = Example(*line)
+                examples.append(example)
+            return examples
+
+    def _pad_batch_records(self, batch_records):
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_text_type_ids = [record.text_type_ids for record in batch_records]
+        batch_position_ids = [record.position_ids for record in batch_records]
+
+        if not self.is_inference:
+            batch_labels = [record.label_id for record in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+
+            if batch_records[0].qid:
+                batch_qids = [record.qid for record in batch_records]
+                batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+            else:
+                batch_qids = np.array([]).astype("int64").reshape([-1, 1])
+
+        # padding
+        padded_token_ids, input_mask = pad_batch_data(batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)
+        padded_text_type_ids = pad_batch_data(batch_text_type_ids, pad_idx=self.pad_id)
+        padded_position_ids = pad_batch_data(batch_position_ids, pad_idx=self.pad_id)
+        padded_task_ids = np.ones_like(padded_token_ids, dtype="int64") * self.task_id
+
+        return_list = [padded_token_ids, padded_text_type_ids, padded_position_ids, padded_task_ids, input_mask]
+        if not self.is_inference:
+            return_list += [batch_labels, batch_qids]
+
+        return return_list
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py b/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..549b0126966b8a4992e7eb2a2ea319ad3ed946aa
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/tokenization.py
@@ -0,0 +1,400 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from __future__ import absolute_import
+
+import collections
+import unicodedata
+from io import open
+
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if isinstance(text, str):
+        return text
+    elif isinstance(text, bytes):
+        return text.decode("utf-8", "ignore")
+    else:
+        raise ValueError("Unsupported string type: %s" % (type(text)))
+
+
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+    if isinstance(text, str):
+        return text
+    elif isinstance(text, bytes):
+        return text.decode("utf-8", "ignore")
+    else:
+        raise ValueError("Unsupported string type: %s" % (type(text)))
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, encoding="utf8") as fin:
+        for num, line in enumerate(fin):
+            items = convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+    return vocab
+
+
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+
+
+def convert_tokens_to_ids(vocab, tokens):
+    return convert_by_vocab(vocab, tokens)
+
+
+def convert_ids_to_tokens(inv_vocab, ids):
+    return convert_by_vocab(inv_vocab, ids)
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class FullTokenizer(object):
+    """Runs end-to-end tokenization."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class CharTokenizer(object):
+    """Runs end-to-end tokenization."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        split_tokens = []
+        for token in text.lower().split(" "):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self, do_lower_case=True):
+        """Constructs a BasicTokenizer.
+
+        Args:
+            do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+            input = "unaffable"
+            output = ["un", "##aff", "##able"]
+
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer.
+
+        Returns:
+            A list of wordpiece tokens.
+        """
+
+        text = convert_to_unicode(text)
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically control characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
+
+
+def tokenize_chinese_chars(text):
+    """Adds whitespace around any CJK character."""
+
+    def _is_chinese_char(cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+
+    output = []
+    buff = ""
+    for char in text:
+        cp = ord(char)
+        if _is_chinese_char(cp) or _is_whitespace(char):
+            if buff != "":
+                output.append(buff)
+                buff = ""
+            output.append(char)
+        else:
+            buff += char
+
+    if buff != "":
+        output.append(buff)
+
+    return output
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py b/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..4620782083dd3b907fe8e1c84a52a0a03f29854e
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/train_ce.py
@@ -0,0 +1,354 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Finetuning on classification tasks."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+import multiprocessing
+import os
+import time
+import warnings
+
+# NOTE(paddle-dev): All of these flags should be
+# set before `import paddle`. Otherwise, it would
+# not take any effect.
+os.environ["FLAGS_eager_delete_tensor_gb"] = "0"  # enable gc
+
+import paddle  # noqa: E402
+import paddle.fluid as fluid  # noqa: E402
+
+if hasattr(paddle, "enable_static"):
+    paddle.enable_static()
+import paddle.fluid.incubate.fleet.base.role_maker as role_maker  # noqa: E402
+import reader_ce as reader_ce  # noqa: E402
+from cross_encoder import create_model, evaluate, predict  # noqa: E402
+from finetune_args import parser  # noqa: E402
+from model.ernie import ErnieConfig  # noqa: E402
+from optimization import optimization  # noqa: E402
+from paddle.fluid.incubate.fleet.collective import (  # noqa: E402
+    DistributedStrategy,
+    fleet,
+)
+from src.utils.args import check_cuda, prepare_logger, print_arguments  # noqa: E402
+from src.utils.init import init_checkpoint, init_pretraining_params  # noqa: E402
+
+warnings.filterwarnings("ignore")
+args = parser.parse_args()
+log = logging.getLogger()
+
+
+def main(args):
+    ernie_config = ErnieConfig(args.ernie_config_path)
+    ernie_config.print_config()
+
+    if args.use_cuda:
+        dev_list = fluid.cuda_places()
+        place = dev_list[0]
+        dev_count = len(dev_list)
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get("CPU_NUM", multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+
+    reader = reader_ce.ClassifyReader(
+        vocab_path=args.vocab_path,
+        label_map_config=args.label_map_config,
+        max_seq_len=args.max_seq_len,
+        total_num=args.train_data_size,
+        do_lower_case=args.do_lower_case,
+        in_tokens=args.in_tokens,
+        random_seed=args.random_seed,
+        tokenizer=args.tokenizer,
+        for_cn=args.for_cn,
+        task_id=args.task_id,
+    )
+
+    if not (args.do_train or args.do_val or args.do_test):
+        raise ValueError("For args `do_train`, `do_val` and `do_test`, at " "least one of them must be True.")
+
+    if args.do_test:
+        assert args.test_save is not None
+    startup_prog = fluid.Program()
+    if args.random_seed is not None:
+        startup_prog.random_seed = args.random_seed
+
+    if args.predict_batch_size is None:
+        args.predict_batch_size = args.batch_size
+
+    if args.do_train:
+        role = role_maker.PaddleCloudRoleMaker(is_collective=True)
+        fleet.init(role)
+        dev_count = fleet.worker_num()
+
+        train_data_generator = reader.data_generator(
+            input_file=args.train_set,
+            batch_size=args.batch_size,
+            epoch=args.epoch,
+            dev_count=1,
+            trainer_id=fleet.worker_index(),
+            trainer_num=fleet.worker_num(),
+            shuffle=True,
+            phase="train",
+        )
+
+        num_train_examples = reader.get_num_examples(args.train_set)
+
+        if args.in_tokens:
+            max_train_steps = args.epoch * num_train_examples // (args.batch_size // args.max_seq_len) // dev_count
+        else:
+            max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
+
+        warmup_steps = int(max_train_steps * args.warmup_proportion)
+        log.info("Device count: %d" % dev_count)
+        log.info("Num train examples: %d" % num_train_examples)
+        log.info("Max train steps: %d" % max_train_steps)
+        log.info("Num warmup steps: %d" % warmup_steps)
+
+        train_program = fluid.Program()
+
+        # use fleet api
+        exec_strategy = fluid.ExecutionStrategy()
+        if args.use_fast_executor:
+            exec_strategy.use_experimental_executor = True
+        exec_strategy.num_threads = dev_count
+        if args.is_distributed:
+            exec_strategy.num_threads = 3
+
+        exec_strategy.num_iteration_per_drop_scope = args.num_iteration_per_drop_scope
+
+        dist_strategy = DistributedStrategy()
+        dist_strategy.exec_strategy = exec_strategy
+        dist_strategy.nccl_comm_num = 1
+        if args.is_distributed:
+            dist_strategy.nccl_comm_num = 2
+        dist_strategy.use_hierarchical_allreduce = True
+
+        if args.use_mix_precision:
+            dist_strategy.use_amp = True
+
+        with fluid.program_guard(train_program, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, graph_vars = create_model(
+                    args, pyreader_name="train_reader", ernie_config=ernie_config
+                )
+                scheduled_lr = optimization(
+                    loss=graph_vars["loss"],
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    train_program=train_program,
+                    startup_prog=startup_prog,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    use_dynamic_loss_scaling=args.use_dynamic_loss_scaling,
+                    incr_every_n_steps=args.incr_every_n_steps,
+                    decr_every_n_nan_or_inf=args.decr_every_n_nan_or_inf,
+                    incr_ratio=args.incr_ratio,
+                    decr_ratio=args.decr_ratio,
+                    dist_strategy=dist_strategy,
+                )
+
+        if args.verbose:
+            if args.in_tokens:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size // args.max_seq_len
+                )
+            else:
+                lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
+                    program=train_program, batch_size=args.batch_size
+                )
+            log.info("Theoretical memory usage in training: %.3f - %.3f %s" % (lower_mem, upper_mem, unit))
+
+    if args.do_val or args.do_test:
+        test_prog = fluid.Program()
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, graph_vars = create_model(
+                    args, pyreader_name="test_reader", ernie_config=ernie_config, is_prediction=True
+                )
+
+        test_prog = test_prog.clone(for_test=True)
+
+    train_program = fleet.main_program
+
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    if args.do_train:
+        if args.init_checkpoint and args.init_pretraining_params:
+            log.warning(
+                "WARNING: args 'init_checkpoint' and 'init_pretraining_params' "
+                "both are set! Only arg 'init_checkpoint' is made valid."
+            )
+        if args.init_checkpoint:
+            init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
+        elif args.init_pretraining_params:
+            init_pretraining_params(exe, args.init_pretraining_params, main_program=startup_prog)
+    elif args.do_val or args.do_test:
+        if not args.init_checkpoint:
+            raise ValueError("args 'init_checkpoint' should be set if" "only doing validation or testing!")
+        init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
+
+    if args.do_train:
+        train_exe = exe
+        train_pyreader.decorate_tensor_provider(train_data_generator)
+    else:
+        train_exe = None
+
+    test_exe = exe
+
+    current_epoch = 0
+    steps = 0
+    if args.do_train:
+        train_pyreader.start()
+        if warmup_steps > 0:
+            graph_vars["learning_rate"] = scheduled_lr
+
+        ce_info = []
+        time_begin = time.time()
+        last_epoch = 0
+        while True:
+            try:
+                steps += 1
+
+                if fleet.worker_index() != 0:
+                    train_exe.run(fetch_list=[], program=train_program)
+                    continue
+
+                if steps % args.skip_steps != 0:
+                    train_exe.run(fetch_list=[], program=train_program)
+
+                else:
+                    outputs = evaluate(
+                        train_exe, train_program, train_pyreader, graph_vars, "train", metric=args.metric
+                    )
+
+                    if args.verbose:
+                        verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
+                        verbose += "learning rate: %f" % (
+                            outputs["learning_rate"] if warmup_steps > 0 else args.learning_rate
+                        )
+                        log.info(verbose)
+
+                    current_example, current_epoch = reader.get_train_progress()
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+
+                    log.info(
+                        "epoch: %d, progress: %d/%d, step: %d, ave loss: %f, "
+                        "ave acc: %f, speed: %f steps/s"
+                        % (
+                            current_epoch,
+                            current_example * dev_count,
+                            num_train_examples,
+                            steps,
+                            outputs["loss"],
+                            outputs["accuracy"],
+                            args.skip_steps / used_time,
+                        )
+                    )
+                    ce_info.append([outputs["loss"], outputs["accuracy"], used_time])
+
+                    time_begin = time.time()
+
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                    fluid.io.save_persistables(exe, save_path, fleet._origin_program)
+
+                if steps % args.validation_steps == 0:
+                    # evaluate dev set
+                    if args.do_val:
+                        evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
+
+                    if args.do_test:
+                        predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
+
+                if last_epoch != current_epoch:
+                    last_epoch = current_epoch
+
+            except fluid.core.EOFException:
+                save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+                fluid.io.save_persistables(exe, save_path, fleet._origin_program)
+                train_pyreader.reset()
+                break
+
+    # final eval on dev set
+    if args.do_val:
+        evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, current_epoch, steps)
+
+    # final eval on test set
+    if args.do_test:
+        predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars)
+
+    # final eval on diagnostic, hack for glue-ax
+    if args.diagnostic:
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(args.diagnostic, batch_size=args.batch_size, epoch=1, dev_count=1, shuffle=False)
+        )
+
+        log.info("Final diagnostic")
+        qids, preds, probs = predict(test_exe, test_prog, test_pyreader, graph_vars)
+        assert len(qids) == len(preds), "{} v.s. {}".format(len(qids), len(preds))
+        with open(args.diagnostic_save, "w") as f:
+            for id, s, p in zip(qids, preds, probs):
+                f.write("{}\t{}\t{}\n".format(id, s, p))
+
+        log.info("Done final diagnostic, saving to {}".format(args.diagnostic_save))
+
+
+def evaluate_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch, steps):
+    # evaluate dev set
+    for ds in args.dev_set.split(","):
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(ds, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False)
+        )
+        log.info("validation result of dataset {}:".format(ds))
+        evaluate_info = evaluate(exe, test_prog, test_pyreader, graph_vars, "dev", metric=args.metric)
+        log.info(evaluate_info + ", file: {}, epoch: {}, steps: {}".format(ds, epoch, steps))
+
+
+def predict_wrapper(args, reader, exe, test_prog, test_pyreader, graph_vars, epoch=None, steps=None):
+    test_sets = args.test_set.split(",")
+    save_dirs = args.test_save.split(",")
+    assert len(test_sets) == len(save_dirs)
+
+    for test_f, save_f in zip(test_sets, save_dirs):
+        test_pyreader.decorate_tensor_provider(
+            reader.data_generator(test_f, batch_size=args.predict_batch_size, epoch=1, dev_count=1, shuffle=False)
+        )
+
+        if epoch is not None or steps is not None:
+            save_path = save_f + "." + str(epoch) + "." + str(steps)
+        else:
+            save_path = save_f
+        log.info("testing {}, save to {}".format(test_f, save_path))
+        qids, preds, probs = predict(exe, test_prog, test_pyreader, graph_vars)
+
+        save_dir = os.path.dirname(save_path)
+        if not os.path.exists(save_dir):
+            os.makedirs(save_dir)
+        else:
+            log.warning("save dir exists: %s, will skip saving" % save_dir)
+
+        with open(save_path, "w") as f:
+            for p in probs:
+                f.write("{}\n".format(p[1]))
+
+
+if __name__ == "__main__":
+    prepare_logger(log)
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+    main(args)
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py b/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..a00d3d867a2549a4253af785ec14b7422894eaa4
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/utils/args.py
@@ -0,0 +1,71 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Arguments for configuration."""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+import os
+import sys
+
+import paddle.fluid as fluid
+import six
+
+from paddlenlp.trainer.argparser import strtobool
+
+log = logging.getLogger(__name__)
+
+
+def prepare_logger(logger, debug=False, save_to_file=None):
+    formatter = logging.Formatter(fmt="[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s")
+    console_hdl = logging.StreamHandler()
+    console_hdl.setFormatter(formatter)
+    logger.addHandler(console_hdl)
+    if save_to_file is not None and not os.path.exists(save_to_file):
+        file_hdl = logging.FileHandler(save_to_file)
+        file_hdl.setFormatter(formatter)
+        logger.addHandler(file_hdl)
+    logger.setLevel(logging.DEBUG)
+    logger.propagate = False
+
+
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
+        prefix = "" if positional_arg else "--"
+        type = strtobool if type == bool else type
+        self._group.add_argument(
+            prefix + name, default=default, type=type, help=help + " Default: %(default)s.", **kwargs
+        )
+
+
+def print_arguments(args):
+    log.info("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        log.info("%s: %s" % (arg, value))
+    log.info("------------------------------------------------")
+
+
+def check_cuda(
+    use_cuda,
+    err="\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n",
+):
+    try:
+        if use_cuda is True and fluid.is_compiled_with_cuda() is False:
+            log.error(err)
+            sys.exit(1)
+    except Exception:
+        pass
diff --git a/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py b/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ba377fd83597cdaadde05698000f60737b56cfe
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/Rerank/src/utils/init.py
@@ -0,0 +1,52 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+
+import paddle.fluid as fluid
+
+log = logging.getLogger(__name__)
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+    assert os.path.exists(init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        if not os.path.exists(os.path.join(init_checkpoint_path, var.name)):
+            print("Var not exists: [%s]\t%s" % (var.name, os.path.join(init_checkpoint_path, var.name)))
+        # else:
+        #    print ("Var exists: [%s]" % (var.name))
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(exe, init_checkpoint_path, main_program=main_program, predicate=existed_persitables)
+    log.info("Load model from {}".format(init_checkpoint_path))
+
+
+def init_pretraining_params(exe, pretraining_params_path, main_program):
+    assert os.path.exists(pretraining_params_path), "[%s] cann't be found." % pretraining_params_path
+
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        if not os.path.exists(os.path.join(pretraining_params_path, var.name)):
+            print("Var not exists: [%s]\t%s" % (var.name, os.path.join(pretraining_params_path, var.name)))
+        # else:
+        #    print ("Var exists: [%s]" % (var.name))
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(exe, pretraining_params_path, main_program=main_program, predicate=existed_params)
+    log.info("Load pretraining parameters from {}.".format(pretraining_params_path))
diff --git a/applications/document_intelligence/doc_vqa/run_test.sh b/applications/document_intelligence/doc_vqa/run_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2e742a0fa8cbb1a01f266c7bd522e3a67fe1660e
--- /dev/null
+++ b/applications/document_intelligence/doc_vqa/run_test.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+QUESTION=$1
+
+# Question: NFC咋开门
+
+if [ $# != 1 ];then
+    echo "USAGE: sh script/run_cross_encoder_test.sh \$QUESTION"
+    exit 1
+fi
+
+# compute scores for QUESTION and OCR parsing results  with Rerank module
+cd Rerank
+bash run_test.sh ${QUESTION}
+cd ..
+
+# extraction answer for QUESTION from the top1 of rank
+cd Extraction
+bash run_test.sh ${QUESTION}
+cd ..
diff --git a/applications/document_intelligence/docprompt/README.md b/applications/document_intelligence/docprompt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b323943aca9e7c08cf378027a105b6d2fbb81150
--- /dev/null
+++ b/applications/document_intelligence/docprompt/README.md
@@ -0,0 +1 @@
+[ERNIE-Layout](../../../model_zoo/ernie-layout)
diff --git a/applications/information_extraction/README.md b/applications/information_extraction/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c85a0912cae48cf688d248449ba1235ee01370fc
--- /dev/null
+++ b/applications/information_extraction/README.md
@@ -0,0 +1,171 @@
+简体中文 | [English](README_en.md)
+
+# 信息抽取应用
+
+**目录**
+- [1. 信息抽取应用简介](#1)
+- [2. 技术特色](#2)
+  - [2.1 信息抽取方案全覆盖](#21)
+  - [2.2 强大的训练基座](#22)
+  - [2.3 产业级全流程方案](#23)
+  - [2.4 效果展示](#24)
+- [3. 快速开始](#快速开始)
+  - [3.1 Taskflow开箱即用](#31)
+  - [3.2 文本信息抽取](#32)
+  - [3.3 文档信息抽取](#33)
+
+<a name="1"></a>
+
+## 1. 信息抽取应用简介
+
+信息抽取应用针对信息抽取一系列高频场景开源了产业级解决方案，**具备多领域、多任务、跨模态的能力**，打通**数据标注-模型训练-模型调优-预测部署全流程**，可快速实现信息抽取产品落地。
+
+信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点，PaddleNLP信息抽取应用**基于UIE统一建模的思想**，提供了信息抽取产业级应用方案，**除支持纯文本场景实体、关系、事件、观点等不同任务抽取外，还支持文档/图片/表格的端到端信息抽取**。该应用**不限定行业领域和抽取目标**，可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接，助力开发者实现特定领域抽取场景的快速适配与落地。
+
+**信息抽取应用亮点：**
+
+- **覆盖场景全面🎓：** 覆盖信息抽取各类主流任务，面向纯文本和文档场景，支持多语言，满足开发者多样信息抽取落地需求。
+- **效果领先🏃：** 以在纯文本、多模态上均有突出效果的UIE系列模型作为训练基座，提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **简单易用⚡：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启信息抽取训练，轻松完成部署上线，降低信息抽取技术落地门槛。
+- **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
+
+<a name="2"></a>
+
+## 2. 技术特色
+
+<a name="21"></a>
+
+### 2.1 信息抽取方案全覆盖
+
+多模型选择，满足精度、速度，适配不同信息抽取使用场景。
+
+|                           模型名称                           | 使用场景                                                   | 支持任务                                             |
+| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- |
+| `uie-base`<br />`uie-medium`<br />`uie-mini`<br />`uie-micro`<br />`uie-nano` | 面向**纯文本**场景的**抽取式**模型，支持**中文**         | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
+|                       `uie-base-en`                          | 面向**纯文本**场景的**抽取式**模型，支持**英文**         | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
+|                     `uie-m-base`<br />`uie-m-large`          | 面向**纯文本**场景的**抽取式**模型，支持**中英**         | 具备实体、关系、事件、评论观点等通用信息抽取能力 |
+|                      <b>`uie-x-base`</b>                     | 面向**纯文本**和**文档**场景的**抽取式**模型，支持**中英** | 支持纯文本场景的全部功能，还支持文档/图片/表格的端到端信息抽取 |
+
+<a name="22"></a>
+
+### 2.2 强大的训练基座
+
+信息抽取应用使用ERNIE 3.0轻量级模型作为预训练模型，同时在大量信息抽取数据上进行了二次预训练，从而让模型适配固定prompt。
+
+- 中文文本数据集实验效果
+
+我们在互联网、医疗、金融三大垂类文本自建测试集上进行了实验：
+
+<table>
+<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td>78.49<td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+<tr><td>🧾 🎓<b>uie-x-base (12L768H)</b><td>48.84<td>73.87<td>65.60<td>88.81<td><b>79.36</b><td>81.65
+</table>
+
+0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测，5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据（few-shot）进一步提升效果**。
+
+- 多模态数据集实验效果
+
+我们在通用、金融、医疗三大场景自建多模态测试集上对UIE-X的零样本效果进行了实验：
+
+<table>
+<tr><th ><th>通用<th>金融<th colspan='2'>医疗
+<tr><td>🧾 🎓<b>uie-x-base (12L768H)</b><td>65.03<td>73.51<td>84.24
+</table>
+
+通用测试集包含了不同领域的复杂样本，抽取难度最大。
+
+<a name="23"></a>
+
+### 2.3 产业级全流程方案
+
+**调研阶段**
+
+- 该阶段目标需求开放且缺少数据积累。我们提供Taskflow三行代码极简调用的方式，无需标注数据即可在业务场景上快速验证效果。
+  - [文本抽取 Taskflow使用指南](./taskflow_text.md)
+  - [文档抽取 Taskflow使用指南](./taskflow_doc.md)
+
+**数据准备阶段**
+
+- 我们推荐在实际的业务场景中定制自己的信息抽取模型。我们提供了不同抽取场景的Label Studio标注解决方案，可基于该方案实现从数据标注到训练数据构造的无缝衔接，大大降低了数据标注、模型定制的时间成本。
+  - [文本抽取标注指南](./label_studio_text.md)
+  - [文档抽取标注指南](./label_studio_doc.md)。
+
+**模型微调及封闭域蒸馏**
+
+- 基于UIE优秀的小样本微调能力，实现低成本模型定制适配。同时提供封闭域蒸馏的加速方案，解决抽取速度慢的问题。
+  - [文本信息抽取全流程示例](./text/README.md)
+  - [文档信息抽取全流程示例](./document/README.md)
+
+**模型部署**
+
+- 提供HTTP部署方案，快速实现定制模型的部署上线。
+  - [文本抽取HTTP部署指南](./text/deploy/simple_serving/README.md)
+  - [文档抽取HTTP部署指南](./document/deploy/simple_serving/README.md)
+
+<a name="24"></a>
+
+### 2.4 效果展示
+
+- 🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/UIE-X)体验UIE-X功能：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/207856955-a01cd5dd-fd5c-48ae-b8fd-c69512a88845.png height=500 width=900 hspace='10'/>
+</div>
+
+- UIE-X端到端文档抽取产业应用示例
+
+  - 报关单
+
+    <div align="center">
+        <img src=https://user-images.githubusercontent.com/40840292/205879840-239ada90-1692-40e4-a17f-c5e963fdd204.png height=800 width=500 />
+    </div>
+
+  - Delivery Note（需微调）
+
+    <div align="center">
+        <img src=https://user-images.githubusercontent.com/40840292/205922422-f2615050-83cb-4bf5-8887-461f5633e85c.png height=250 width=700 />
+    </div>
+
+  - 增值税发票（需微调）
+
+    <div align="center">
+        <img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
+    </div>
+
+  - 表单（需微调）
+
+    <div align="center">
+        <img src=https://user-images.githubusercontent.com/40840292/207856330-7aa0d158-47e0-477f-a88f-e23a040504a3.png height=400 width=700 />
+    </div>
+
+<a name="3"></a>
+
+## 3. 快速开始
+
+<a name="31"></a>
+
+### 3.1 Taskflow开箱即用
+
+- 通过Taskflow实现开箱即用
+  👉 [文本抽取 Taskflow使用指南](./taskflow_text.md)
+  👉 [文档抽取 Taskflow使用指南](./taskflow_doc.md)
+
+<a name="32"></a>
+
+### 3.2 文本信息抽取
+
+- 快速开启文本信息抽取 👉 [文本信息抽取指南](./text/README.md)
+
+<a name="33"></a>
+
+### 3.3 文档信息抽取
+
+- 快速开启文档信息抽取 👉 [文档信息抽取指南](./document/README.md)
diff --git a/applications/information_extraction/README_en.md b/applications/information_extraction/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..b26a57aa1dc04cc2a850c77d64f046ac277ebb80
--- /dev/null
+++ b/applications/information_extraction/README_en.md
@@ -0,0 +1,170 @@
+# Information Extraction Application
+
+**Table of contents**
+- [1. Introduction](#1)
+- [2. Features](#2)
+   - [2.1 Available Models](#21)
+   - [2.2 Performance](#22)
+   - [2.3 Full Development Lifecycle](#23)
+   - [2.4 Demo](#24)
+- [3. Quick Start](#3)
+   - [3.1 Taskflow](#31)
+   - [3.2 Text Information Extraction](#32)
+   - [3.3 Document Information Extraction](#33)
+
+<a name="1"></a>
+
+## 1. Introduction
+
+This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.
+
+Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction] (https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors.
+
+**Highlights:**
+
+- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
+- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
+- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
+- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
+
+<a name="2"></a>
+
+## 2. Features
+
+<a name="21"></a>
+
+### 2.1 Available Models
+
+Multiple model selection, satisfying accuracy and speed, and adapting to different information extraction scenarios.
+
+|                           Model Name                           | Usage Scenarios                                                 | Supporting Tasks                                            |
+| :----------------------------------------------------------: | :--------------------------------------------------------- | :--------------------------------------------------- |
+| `uie-base`<br />`uie-medium`<br />`uie-mini`<br />`uie-micro`<br />`uie-nano` | For **plain text** The **extractive** model of the scene supports **Chinese**         | Supports entity, relation, event, opinion extraction |
+|                       `uie-base-en`                          | An **extractive** model for **plain text** scenarios, supports **English**         | Supports entity, relation, event, opinion extraction |
+|                     `uie-m-base`<br />`uie-m-large`          | An **extractive** model for **plain text** scenarios, supporting **Chinese and English**     | Supports entity, relation, event, opinion extraction |
+|                      <b>`uie-x-base`</b>                     | An **extractive** model for **plain text** and **document** scenarios, supports **Chinese and English** | Supports entity, relation, event, opinion extraction on both plain text and documents/pictures/tables |
+
+
+<a name="22"></a>
+
+### 2.2 Performance
+
+The UIE model series uses the ERNIE 3.0 lightweight models as the pre-trained language models and was finetuned on a large amount of information extraction data so that the model can be adapted to a fixed prompt.
+
+- Experimental results on Chinese dataset
+
+We conducted experiments on the in-house test sets of the three different domains of Internet, medical care, and finance:
+
+<table>
+<tr><th row_span='2'><th colspan='2'>finance<th colspan='2'>healthcare<th colspan='2'>internet
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b ><td>78.49<td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+<tr><td>🧾🎓<b>uie-x-base (12L768H)</b><td>48.84<td>73.87<td>65.60<td>88.81<td><b>79.36</b> <td>81.65
+</table>
+
+0-shot means that no training data is directly used for prediction through ```paddlenlp.Taskflow```, and 5-shot means that each category contains 5 pieces of labeled data for model fine-tuning. **Experiments show that UIE can further improve the performance with a small amount of data (few-shot)**.
+
+- Experimental results on multimodal datasets
+
+We experimented on the zero-shot performance of UIE-X on the in-house multi-modal test sets in three different domains of general, financial, and medical:
+
+<table>
+<tr><th ><th>General <th>Financial<th colspan='2'>Medical
+<tr><td>🧾🎓<b>uie-x-base (12L768H)</b><td>65.03<td>73.51<td>84.24
+</table>
+
+The general test set contains complex samples from different fields and is the most difficult task.
+
+<a name="23"></a>
+
+### 2.3 Full Development Lifecycle
+
+**Research stage**
+
+- At this stage, the target requirements are open and there is no labeled data. We provide a simple way of using Taskflow out-of-the-box with three lines of code, which allows you to build POC without any labeled data.
+   - [Text Extraction Taskflow User Guide](./taskflow_text_en.md)
+   - [Document Extraction Taskflow User Guide](./taskflow_doc_en.md)
+
+**Data preparation stage**
+
+- We recommend finetuning your own information extraction model for your use case. We provide Label Studio labeling solutions for different extraction scenarios. Based on this solution, the seamless connection from data labeling to training data construction can be realized, which greatly reduces the time cost of data labeling and model customization.
+   - [Text Extraction Labeling Guide](./label_studio_text_en.md)
+   - [Document Extraction and Labeling Guide](./label_studio_doc_en.md).
+
+**Model fine-tuning and closed domain distillation**
+
+- Based on UIE's few-shot capabilities, it realizes low-cost model customization and adaptation. At the same time, it provides an acceleration solution for closed domain distillation to solve the problem of slow extraction speed.
+   - [Example of the whole process of text information extraction](./text/README_en.md)
+   - [Example of document information extraction process](./document/README_en.md)
+
+**Model Deployment**
+
+- Provide an HTTP deployment solution to quickly implement the deployment and launch of customized models.
+   - [Text Extract HTTP Deployment Guide](./text/deploy/simple_serving/README_en.md)
+   - [Document Extract HTTP Deployment Guide](./document/deploy/simple_serving/README_en.md)
+
+<a name="24"></a>
+
+### 2.4 Demo
+
+- 🧾Try our UIE-X demo on [🤗 HuggingFace Space](https://huggingface.co/spaces/PaddlePaddle/UIE-X):
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/207856955-a01cd5dd-fd5c-48ae-b8fd-c69512a88845.png height=500 width=900 hspace='10'/>
+</div>
+
+- UIE-X end-to-end document extraction industry application example
+
+   - Customs declaration
+
+     <div align="center">
+         <img src=https://user-images.githubusercontent.com/40840292/205879840-239ada90-1692-40e4-a17f-c5e963fdd204.png height=800 width=500 />
+     </div>
+
+   - Delivery Note (Need fine-tuning)
+
+     <div align="center">
+         <img src=https://user-images.githubusercontent.com/40840292/205922422-f2615050-83cb-4bf5-8887-461f5633e85c.png height=250 width=700 />
+     </div>
+
+   - VAT invoice (need fine-tuning)
+
+     <div align="center">
+         <img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
+     </div>
+
+  - Form (need fine-tuning)
+
+    <div align="center">
+        <img src=https://user-images.githubusercontent.com/40840292/207856330-7aa0d158-47e0-477f-a88f-e23a040504a3.png height=400 width=700 />
+    </div>
+
+<a name="3"></a>
+
+## 3. Quick Start
+
+<a name="31"></a>
+
+### 3.1 Taskflow
+
+- Out of the box with Taskflow
+   👉 [Text Extraction Taskflow User Guide](./taskflow_text_en.md)
+   👉 [Document Extraction Taskflow User Guide](./taskflow_doc_en.md)
+
+<a name="32"></a>
+
+### 3.2 Text Information Extraction
+
+- Quickly start text information extraction 👉 [Text Information Extraction Guide](./text/README_en.md)
+
+<a name="33"></a>
+
+### 3.3 Document Information Extraction
+
+- Quickly open document information extraction 👉 [Document Information Extraction Guide](./document/README_en.md)
diff --git a/applications/information_extraction/document/README.md b/applications/information_extraction/document/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..30f4670e954ff390b18898264663b646e9140762
--- /dev/null
+++ b/applications/information_extraction/document/README.md
@@ -0,0 +1,301 @@
+简体中文 | [English](README_en.md)
+
+# 文档信息抽取
+
+**目录**
+- [1. 文档信息抽取应用](#1)
+- [2. 快速开始](#2)
+  - [2.1 代码结构](#代码结构)
+  - [2.2 数据标注](#数据标注)
+  - [2.3 模型微调](#模型微调)
+  - [2.4 模型评估](#模型评估)
+  - [2.5 定制模型一键预测](#定制模型一键预测)
+  - [2.6 实验指标](#实验指标)
+
+<a name="1"></a>
+
+## 1. 文档信息抽取应用
+
+本项目提供基于UIE微调的文档抽取端到端应用方案，打通**数据标注-模型训练-模型调优-预测部署全流程**，可快速实现文档信息抽取产品落地。
+
+信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点，PaddleNLP信息抽取应用UIE统一建模的思想，提供了文档信息抽取产业级应用方案，支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**，可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接，助力开发者实现特定领域抽取场景的快速适配与落地。
+
+**文档信息抽取应用亮点：**
+
+- **覆盖场景全面🎓：** 覆盖文档信息抽取各类主流任务，支持多语言，满足开发者多样信息抽取落地需求。
+- **效果领先🏃：** 以在多模态信息抽取上有突出效果的模型UIE-X作为训练基座，具有广泛成熟的实践应用性。
+- **简单易用⚡：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启信息抽取训练，轻松完成部署上线，降低信息抽取技术落地门槛。
+- **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
+
+<a name="2"></a>
+
+## 2. 快速开始
+
+对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用定制功能（标注少量数据进行模型微调）以进一步提升效果。
+
+<a name="代码结构"></a>
+
+### 2.1 代码结构
+
+```shell
+.
+├── deploy            # 部署目录
+│ └── simple_serving  # 基于PaddleNLP SimpleServing 服务化部署
+├── utils.py          # 数据处理工具
+├── finetune.py       # 模型微调、压缩脚本
+├── evaluate.py       # 模型评估脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+### 2.2 数据标注
+我们推荐使用 [Label Studio](https://labelstud.io/) 进行文档信息抽取数据标注，本项目打通了从数据标注到训练的通道，也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_doc.md)。
+
+这里我们提供预先标注好的`增值税发票数据集`的文件，可以运行下面的命令行下载数据集，我们将展示如何使用数据转化脚本生成训练/验证/测试集文件，并使用UIE-X模型进行微调。
+
+下载增值税发票数据集：
+```shell
+wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz
+tar -zxvf tax.tar.gz
+mv tax data
+rm tax.tar.gz
+```
+
+生成训练/验证集文件：
+```shell
+python ../label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.2 0 \
+    --task_type ext
+```
+
+生成训练/验证集文件，可以使用PP-Structure的布局分析优化OCR结果的排序：
+```shell
+python ../label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.2 0\
+    --task_type ext \
+    --layout_analysis True
+```
+
+更多不同类型任务（含实体抽取、关系抽取、文档分类等）的标注规则及参数说明，请参考[Label Studio数据标注指南](../label_studio_doc.md)。
+
+<a name="模型微调"></a>
+
+### 2.3 模型微调
+
+推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `uie-x-base` 作为预训练模型进行模型微调，将微调后的模型保存至`./checkpoint/model_best`：
+
+单卡启动：
+
+```shell
+python finetune.py  \
+    --device gpu \
+    --logging_steps 5 \
+    --save_steps 25 \
+    --eval_steps 25 \
+    --seed 42 \
+    --model_name_or_path uie-x-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  8 \
+    --per_device_eval_batch_size 8 \
+    --num_train_epochs 10 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+如果在GPU环境中使用，可以指定gpus参数进行多卡训练：
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0" finetune.py \
+    --device gpu \
+    --logging_steps 5 \
+    --save_steps 25 \
+    --eval_steps 25 \
+    --seed 42 \
+    --model_name_or_path uie-x-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  8 \
+    --per_device_eval_batch_size 8 \
+    --num_train_epochs 10 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+该示例代码中由于设置了参数 `--do_eval`，因此在训练完会自动进行评估。
+
+可配置参数说明：
+* `device`: 训练设备，可选择 'cpu'、'gpu'、'npu' 其中的一种；默认为 GPU 训练。
+* `logging_steps`: 训练过程中日志打印的间隔 steps 数，默认10。
+* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `seed`：全局随机种子，默认为 42。
+* `model_name_or_path`：进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。
+* `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
+* `train_path`：训练集路径；默认为 `None` 。
+* `dev_path`：开发集路径；默认为 `None` 。
+* `max_seq_len`：文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+* `per_device_train_batch_size`:用于训练的每个 GPU 核心/NPU 核心/CPU 的batch大小，默认为8。
+* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小，默认为8。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择 100；默认为10。
+* `learning_rate`：训练最大学习率，UIE-X 推荐设置为 1e-5；默认值为3e-5。
+* `label_names`：训练数据标签label的名称，UIE-X 设置为'start_positions' 'end_positions'；默认值为None。
+* `do_train`:是否进行微调训练，设置该参数表示进行微调训练，默认不设置。
+* `do_eval`:是否进行评估，设置该参数表示进行评估，默认不设置。
+* `do_export`:是否进行导出，设置该参数表示进行静态图导出，默认不设置。
+* `export_model_dir`:静态图导出地址，默认为None。
+* `overwrite_output_dir`： 如果 `True`，覆盖输出目录的内容。如果 `output_dir` 指向检查点目录，则使用它继续训练。
+* `disable_tqdm`： 是否使用tqdm进度条。
+* `metric_for_best_model`：最优模型指标,UIE-X 推荐设置为 `eval_f1`，默认为None。
+* `load_best_model_at_end`：训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用，默认为False。
+* `save_total_limit`：如果设置次参数，将限制checkpoint的总数。删除旧的checkpoints `输出目录`，默认为None。
+
+<a name="模型评估"></a>
+
+### 2.4 模型评估
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --output_dir ./checkpoint/model_best \
+    --label_names 'start_positions' 'end_positions'\
+    --max_seq_len 512 \
+    --per_device_eval_batch_size 16
+```
+评估方式说明：采用单阶段评价的方式，即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。
+
+可开启`debug`模式对每个正例类别分别进行评估，该模式仅用于模型调试：
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --output_dir ./checkpoint/model_best \
+    --label_names 'start_positions' 'end_positions'\
+    --max_seq_len 512 \
+    --per_device_eval_batch_size 16 \
+    --debug True
+```
+
+输出结果：
+```text
+[2022-11-14 09:41:18,424] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:18,424] [    INFO] -   Num examples = 160
+[2022-11-14 09:41:18,424] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:18,424] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:18,424] [    INFO] -   Total prediction steps = 40
+[2022-11-14 09:41:26,451] [    INFO] - -----Evaluate model-------
+[2022-11-14 09:41:26,451] [    INFO] - Class Name: ALL CLASSES
+[2022-11-14 09:41:26,451] [    INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391
+[2022-11-14 09:41:26,451] [    INFO] - -----------------------------
+[2022-11-14 09:41:26,452] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:26,452] [    INFO] -   Num examples = 8
+[2022-11-14 09:41:26,452] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:26,452] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:26,452] [    INFO] -   Total prediction steps = 2
+[2022-11-14 09:41:26,692] [    INFO] - Class Name: 开票日期
+[2022-11-14 09:41:26,692] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
+[2022-11-14 09:41:26,692] [    INFO] - -----------------------------
+[2022-11-14 09:41:26,693] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:26,693] [    INFO] -   Num examples = 8
+[2022-11-14 09:41:26,693] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:26,693] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:26,693] [    INFO] -   Total prediction steps = 2
+[2022-11-14 09:41:26,952] [    INFO] - Class Name: 名称
+[2022-11-14 09:41:26,952] [    INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500
+[2022-11-14 09:41:26,952] [    INFO] - -----------------------------
+...
+```
+
+可配置参数：
+* `device`: 评估设备，可选择 'cpu'、'gpu'、'npu' 其中的一种；默认为 GPU 评估。
+* `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+* `test_path`: 进行评估的测试集文件。
+* `label_names`：训练数据标签label的名称，UIE-X 设置为'start_positions' 'end_positions'；默认值为None。
+* `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+* `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/NPU 核心/CPU 的batch大小，默认为8。
+* `debug`: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+* `schema_lang`: 选择schema的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+<a name="定制模型一键预测"></a>
+
+### 2.5 定制模型一键预测
+
+`paddlenlp.Taskflow`装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件`model_state.pdparams`。
+
+```python
+from pprint import pprint
+from paddlenlp import Taskflow
+from paddlenlp.utils.doc_parser import DocParser
+
+schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
+my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16')
+```
+
+我们可以根据设置的`schema`，对指定的`doc_path`文档进行信息抽取并进行可视化：
+
+```python
+doc_path = "./data/images/b199.jpg"
+results = my_ie({"doc": doc_path})
+pprint(results)
+
+# 结果可视化
+DocParser.write_image_with_results(
+    doc_path,
+    result=results[0],
+    save_path="./image_show.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
+</div>
+
+<a name="实验指标"></a>
+
+### 2.6 实验指标
+
+我们在自标注的增值税数据集上进行实验：
+
+
+  |  |  Precision  | Recall | F1 Score |
+  | :---: | :--------: | :--------: | :--------: |
+  | 0-shot| 0.44898 | 0.56410  | 0.50000 |
+  | 5-shot| 0.9000 | 0.9231 | 0.9114 |
+  | 10-shot| 0.9125 | 0.93590 |  0.9241 |
+  | 20-shot| 0.9737 | 0.9487 | 0.9610 |
+  | 30-shot|  0.9744  | 0.9744  | 0.9744 |
+  | 30-shot+PP-Structure| 1.0  | 0.9625 |  0.9809 |
+
+
+n-shot表示训练集包含n张标注图片数据进行模型微调，实验表明UIE-X可以通过少量数据（few-shot）和PP-Structure的布局分析进一步提升结果。
diff --git a/applications/information_extraction/document/README_en.md b/applications/information_extraction/document/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..acef107e5e07f7134adc4cf32a823a3688340891
--- /dev/null
+++ b/applications/information_extraction/document/README_en.md
@@ -0,0 +1,297 @@
+# document information extraction
+
+**Table of contents**
+- [1. Introduction](#1)
+- [2. Quick Start](#2)
+   - [2.1 Code Structure](#21)
+   - [2.2 Data Annotation](#22)
+   - [2.3 Finetuning](#23)
+   - [2.4 Evaluation](#24)
+   - [2.5 Inference](#25)
+   - [2.6 Experiments](#26)
+
+<a name="1"></a>
+
+## 1. Introduction
+
+This Information Extraction (IE) guide introduces our open-source industry-grade solution that covers the most widely-used application scenarios of Information Extraction. It features **multi-domain, multi-task, and cross-modal capabilities** and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.
+
+Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapted models specialized for different industry sectors.
+
+**Highlights:**
+
+- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
+- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
+- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
+- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
+
+<a name="2"></a>
+
+## 2. Quick Start
+
+For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot capability. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
+
+<a name="21"></a>
+
+### 2.1 Code Structure
+
+```shell
+.
+├── utils.py # data processing tools
+├── finetune.py # model fine-tuning, compression script
+├── evaluate.py # model evaluation script
+└── README.md
+```
+
+<a name="22"></a>
+
+### 2.2 Data Annotation
+
+We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
+
+Here we provide the pre-labeled example dataset `VAT invoice dataset`, which you can download by running the following command. We will demonstrate how to use the data conversion script to generate training/validation/test set files for finetuning.
+
+Download the VAT invoice dataset:
+```shell
+wget https://paddlenlp.bj.bcebos.com/datasets/tax.tar.gz
+tar -zxvf tax.tar.gz
+mv tax data
+rm tax.tar.gz
+```
+
+Generate training/validation data files:
+
+```shell
+python ../label_studio.py \
+     --label_studio_file ./data/label_studio.json \
+     --save_dir ./data \
+     --splits 0.8 0.2 0 \
+     --task_type ext
+```
+
+Generate training/validation set files, you can use PP-Structure's layout analysis to optimize the sorting of OCR results:
+
+```shell
+python ../label_studio.py \
+     --label_studio_file ./data/label_studio.json \
+     --save_dir ./data \
+     --splits 0.8 0.2 0\
+     --task_type ext\
+     --layout_analysis True
+```
+
+For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_doc_en.md).
+
+<a name="23"></a>
+
+### 2.3 Finetuning
+
+Use the following command to fine-tune the model using `uie-x-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best`:
+
+Single GPU:
+
+```shell
+python finetune.py\
+     --device gpu \
+     --logging_steps 5 \
+     --save_steps 25 \
+     --eval_steps 25 \
+     --seed 42 \
+     --model_name_or_path uie-x-base \
+     --output_dir ./checkpoint/model_best\
+     --train_path data/train.txt \
+     --dev_path data/dev.txt \
+     --max_seq_len 512 \
+     --per_device_train_batch_size 8 \
+     --per_device_eval_batch_size 8 \
+     --num_train_epochs 10 \
+     --learning_rate 1e-5 \
+     --do_train \
+     --do_eval \
+     --do_export \
+     --export_model_dir ./checkpoint/model_best\
+     --overwrite_output_dir \
+     --disable_tqdm True \
+     --metric_for_best_model eval_f1 \
+     --load_best_model_at_end True \
+     --save_total_limit 1
+```
+
+Multiple GPUs:
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0" finetune.py \
+     --device gpu \
+     --logging_steps 5 \
+     --save_steps 25 \
+     --eval_steps 25 \
+     --seed 42 \
+     --model_name_or_path uie-x-base \
+     --output_dir ./checkpoint/model_best\
+     --train_path data/train.txt \
+     --dev_path data/dev.txt \
+     --max_seq_len 512 \
+     --per_device_train_batch_size 8 \
+     --per_device_eval_batch_size 8 \
+     --num_train_epochs 10 \
+     --learning_rate 1e-5 \
+     --do_train \
+     --do_eval \
+     --do_export \
+     --export_model_dir ./checkpoint/model_best\
+     --overwrite_output_dir \
+     --disable_tqdm True \
+     --metric_for_best_model eval_f1 \
+     --load_best_model_at_end True \
+     --save_total_limit 1
+```
+
+Since the parameter `--do_eval` is set in the sample code, it will be automatically evaluated after training.
+
+Parameters:
+
+* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
+* `logging_steps`: The interval steps of log printing during training, the default is 10.
+* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `seed`: global random seed, default is 42.
+* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
+* `output_dir`: required, the model directory saved after model training or compression; the default is `None`.
+* `train_path`: training set path; defaults to `None`.
+* `dev_path`: Development set path; defaults to `None`.
+* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+* `per_device_train_batch_size`: The batch size of each GPU core/NPU core/CPU used for training, the default is 8.
+* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
+* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
+* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
+* `label_names`: the name of the training data label label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
+* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
+* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
+* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
+* `export_model_dir`: Static map export address, the default is None.
+* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
+* `disable_tqdm`: Whether to use tqdm progress bar.
+* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
+* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
+* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
+
+<a name="24"></a>
+
+### 2.4 Evaluation
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --output_dir ./checkpoint/model_best \
+    --label_names 'start_positions' 'end_positions'\
+    --max_seq_len 512 \
+    --per_device_eval_batch_size 16
+```
+We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
+The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --output_dir ./checkpoint/model_best \
+    --label_names 'start_positions' 'end_positions' \
+    --max_seq_len 512 \
+    --per_device_eval_batch_size 16 \
+    --debug True
+```
+
+Output result:
+
+```text
+[2022-11-14 09:41:18,424] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:18,424] [    INFO] -   Num examples = 160
+[2022-11-14 09:41:18,424] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:18,424] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:18,424] [    INFO] -   Total prediction steps = 40
+[2022-11-14 09:41:26,451] [    INFO] - -----Evaluate model-------
+[2022-11-14 09:41:26,451] [    INFO] - Class Name: ALL CLASSES
+[2022-11-14 09:41:26,451] [    INFO] - Evaluation Precision: 0.94521 | Recall: 0.88462 | F1: 0.91391
+[2022-11-14 09:41:26,451] [    INFO] - -----------------------------
+[2022-11-14 09:41:26,452] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:26,452] [    INFO] -   Num examples = 8
+[2022-11-14 09:41:26,452] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:26,452] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:26,452] [    INFO] -   Total prediction steps = 2
+[2022-11-14 09:41:26,692] [    INFO] - Class Name: 开票日期
+[2022-11-14 09:41:26,692] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
+[2022-11-14 09:41:26,692] [    INFO] - -----------------------------
+[2022-11-14 09:41:26,693] [    INFO] - ***** Running Evaluation *****
+[2022-11-14 09:41:26,693] [    INFO] -   Num examples = 8
+[2022-11-14 09:41:26,693] [    INFO] -   Pre device batch size = 4
+[2022-11-14 09:41:26,693] [    INFO] -   Total Batch size = 4
+[2022-11-14 09:41:26,693] [    INFO] -   Total prediction steps = 2
+[2022-11-14 09:41:26,952] [    INFO] - Class Name: 名称
+[2022-11-14 09:41:26,952] [    INFO] - Evaluation Precision: 0.87500 | Recall: 0.87500 | F1: 0.87500
+[2022-11-14 09:41:26,952] [    INFO] - -----------------------------
+...
+```
+
+Parameters:
+
+* `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
+* `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
+* `test_path`: The test set file for evaluation.
+* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
+* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
+* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
+* `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
+* `schema_lang`: Select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
+
+<a name="25"></a>
+
+### 2.5 Inference
+
+Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
+
+```python
+from pprint import pprint
+from paddlenlp import Taskflow
+from paddlenlp.utils.doc_parser import DocParser
+
+schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
+my_ie = Taskflow("information_extraction", model="uie-x-base", schema=schema, task_path='./checkpoint/model_best', precision='fp16')
+```
+
+We specify the extraction targets by setting `schema` and visualize the information of the specified `doc_path` document:
+
+```python
+doc_path = "./data/images/b199.jpg"
+results = my_ie({"doc": doc_path})
+pprint(results)
+
+# Result visualization
+DocParser.write_image_with_results(
+    doc_path,
+    result=results[0],
+    save_path="./image_show.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206084942-44ba477c-9244-4ce2-bbb5-ba430c9b926e.png height=550 width=700 />
+</div>
+
+<a name="26"></a>
+
+### 2.6 Experiments
+
+  |  |  Precision  | Recall | F1 Score |
+  | :---: | :--------: | :--------: | :--------: |
+  | 0-shot| 0.44898 | 0.56410  | 0.50000 |
+  | 5-shot| 0.9000 | 0.9231 | 0.9114 |
+  | 10-shot| 0.9125 | 0.93590 |  0.9241 |
+  | 20-shot| 0.9737 | 0.9487 | 0.9610 |
+  | 30-shot|  0.9744  | 0.9744  | 0.9744 |
+  | 30-shot+PP-Structure| 1.0  | 0.9625 |  0.9809 |
+
+
+n-shot means that the training set contains n labeled image data for model fine-tuning. Experiments show that UIE-X can further improve the results through a small amount of data (few-shot) and PP-Structure layout analysis.
diff --git a/applications/information_extraction/document/deploy/simple_serving/README.md b/applications/information_extraction/document/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e98fc150228a7f787c044b1f9acb8255aaccb844
--- /dev/null
+++ b/applications/information_extraction/document/deploy/simple_serving/README.md
@@ -0,0 +1,57 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+
+## Server服务启动
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+## Client请求启动
+
+```bash
+python client.py
+```
+
+## 服务化自定义参数
+
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = ['开票日期', '名称', '纳税人识别号', '开户行及账号', '金额', '价税合计', 'No', '税率', '地址、电话', '税额']
+```
+
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+
+### Client 自定义参数
+
+```python
+# Changed to image paths you wanted
+image_paths = ['../../data/images/b1.jpg']
+```
diff --git a/applications/information_extraction/document/deploy/simple_serving/README_en.md b/applications/information_extraction/document/deploy/simple_serving/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..71c54c406dde07d43801e5d96acff8d5a4264b7a
--- /dev/null
+++ b/applications/information_extraction/document/deploy/simple_serving/README_en.md
@@ -0,0 +1,65 @@
+# Service deployment based on PaddleNLP SimpleServing
+
+## Table of contents
+- [Environment Preparation](#1)
+- [Server](#2)
+- [Client](#3)
+- [Service Custom Parameters](#4)
+
+<a name="1"></a>
+
+## Environment Preparation
+Use the PaddleNLP version with SimpleServing function (or the latest develop version)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+<a name="2"></a>
+
+## Server
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+<a name="3"></a>
+
+## Client
+
+```bash
+python client.py
+```
+
+<a name="4"></a>
+
+## Service custom parameters
+
+### Server Custom Parameters
+
+#### schema replacement
+```python
+# Default schema
+schema = ['Billing Date', 'Name', 'Taxpayer Identification Number', 'Account Bank and Account Number', 'Amount', 'Total Price and Tax', 'No', 'Tax Rate', 'Address, Phone', 'tax']
+```
+
+#### Set model path
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### Doka Service Prediction
+PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service. register_taskflow('uie', [uie1, uie2])
+```
+
+### Client Custom Parameters
+
+```python
+# Changed to image paths you wanted
+image_paths = ['../../data/images/b1.jpg']
+```
diff --git a/applications/information_extraction/document/deploy/simple_serving/client.py b/applications/information_extraction/document/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcd2c67f94108b47fe0c870b7f7c71e29bf3544f
--- /dev/null
+++ b/applications/information_extraction/document/deploy/simple_serving/client.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+from paddlenlp.utils.doc_parser import DocParser
+
+# Define the document parser
+doc_parser = DocParser()
+
+image_paths = ["../../data/images/b1.jpg"]
+image_base64_docs = []
+
+# Get the image base64 to post
+for image_path in image_paths:
+    req_dict = {}
+    doc = doc_parser.parse({"doc": image_path}, do_ocr=False)
+    base64 = doc["image"]
+    req_dict["doc"] = base64
+    image_base64_docs.append(req_dict)
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+data = {"data": {"text": image_base64_docs}}
+
+# Post the requests
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/applications/information_extraction/document/deploy/simple_serving/server.py b/applications/information_extraction/document/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a6ab1fb7d2d1d2e4190665150e5c0ba07730c49
--- /dev/null
+++ b/applications/information_extraction/document/deploy/simple_serving/server.py
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = ["开票日期", "名称", "纳税人识别号", "开户行及账号", "金额", "价税合计", "No", "税率", "地址、电话", "税额"]
+# The task path changed to your best model path
+uie = Taskflow(
+    "information_extraction",
+    schema=schema,
+    task_path="../../checkpoint/model_best",
+)
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
diff --git a/applications/information_extraction/document/evaluate.py b/applications/information_extraction/document/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca3cdeda1da3192f0f5ab0e50936eab965d3f1f1
--- /dev/null
+++ b/applications/information_extraction/document/evaluate.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import paddle
+from utils import convert_example, reader
+
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
+from paddlenlp.transformers import UIEX, AutoTokenizer
+from paddlenlp.utils.ie_utils import (
+    compute_metrics,
+    get_relation_type_dict,
+    uie_loss_func,
+    unify_prompt_name,
+)
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for evaluation.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    test_path: str = field(default=None, metadata={"help": "The path of test set."})
+
+    schema_lang: str = field(
+        default="ch", metadata={"help": "Select the language type for schema, such as 'ch', 'en'"}
+    )
+
+    max_seq_len: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    debug: bool = field(
+        default=False,
+        metadata={"help": "Whether choose debug mode."},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_path: Optional[str] = field(
+        default=None, metadata={"help": "The path of saved model that you want to load."}
+    )
+
+
+def do_eval():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_path)
+    model = UIEX.from_pretrained(model_args.model_path)
+
+    test_ds = load_dataset(reader, data_path=data_args.test_path, max_seq_len=data_args.max_seq_len, lazy=False)
+    trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=data_args.max_seq_len)
+    if data_args.debug:
+        class_dict = {}
+        relation_data = []
+
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if len(data["result_list"]) != 0:
+                p = "的" if data_args.schema_lang == "ch" else " of "
+                if p not in data["prompt"]:
+                    class_dict.setdefault(class_name, []).append(data)
+                else:
+                    relation_data.append((data["prompt"], data))
+
+        relation_type_dict = get_relation_type_dict(relation_data, schema_lang=data_args.schema_lang)
+    test_ds = test_ds.map(trans_fn)
+
+    trainer = Trainer(
+        model=model,
+        criterion=uie_loss_func,
+        args=training_args,
+        eval_dataset=test_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+    eval_metrics = trainer.evaluate()
+    logger.info("-----Evaluate model-------")
+    logger.info("Class Name: ALL CLASSES")
+    logger.info(
+        "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
+        % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
+    )
+    logger.info("-----------------------------")
+    if data_args.debug:
+        for key in class_dict.keys():
+            test_ds = MapDataset(class_dict[key])
+            test_ds = test_ds.map(trans_fn)
+            eval_metrics = trainer.evaluate(eval_dataset=test_ds)
+
+            logger.info("Class Name: %s" % key)
+            logger.info(
+                "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
+                % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
+            )
+            logger.info("-----------------------------")
+        for key in relation_type_dict.keys():
+            test_ds = MapDataset(relation_type_dict[key])
+            test_ds = test_ds.map(trans_fn)
+            eval_metrics = trainer.evaluate(eval_dataset=test_ds)
+            logger.info("-----------------------------")
+            if data_args.schema_lang == "ch":
+                logger.info("Class Name: X的%s" % key)
+            else:
+                logger.info("Class Name: %s of X" % key)
+            logger.info(
+                "Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f"
+                % (eval_metrics["eval_precision"], eval_metrics["eval_recall"], eval_metrics["eval_f1"])
+            )
+
+
+if __name__ == "__main__":
+    do_eval()
diff --git a/applications/information_extraction/document/finetune.py b/applications/information_extraction/document/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..822fd4e2a788adafe38c08fef5ca0f90fc3b046a
--- /dev/null
+++ b/applications/information_extraction/document/finetune.py
@@ -0,0 +1,177 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import List, Optional
+
+import paddle
+from utils import convert_example, reader
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    CompressionArguments,
+    PdArgumentParser,
+    Trainer,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import UIEX, AutoTokenizer, export_model
+from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    train_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    dev_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    max_seq_len: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    dynamic_max_length: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: Optional[str] = field(default="uie-x-base", metadata={"help": "Path to pretrained model"})
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.label_names = ["start_positions", "end_positions"]
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Define model and tokenizer
+    model = UIEX.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # Load and preprocess dataset
+    train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_len, lazy=False)
+    dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_len, lazy=False)
+    trans_fn = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=data_args.max_seq_len,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+    train_ds = train_ds.map(trans_fn)
+    dev_ds = dev_ds.map(trans_fn)
+
+    trainer = Trainer(
+        model=model,
+        criterion=uie_loss_func,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    trainer.optimizer = paddle.optimizer.AdamW(
+        learning_rate=training_args.learning_rate, parameters=model.parameters()
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"),
+            paddle.static.InputSpec(shape=[None, None, 4], dtype="int64", name="bbox"),
+            paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="int64", name="image"),
+        ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/information_extraction/document/utils.py b/applications/information_extraction/document/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..69cfbeefdd69d08830a7a3312adebf75c8ae733f
--- /dev/null
+++ b/applications/information_extraction/document/utils.py
@@ -0,0 +1,374 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import json
+from typing import List, Optional
+
+import numpy as np
+
+from paddlenlp.utils.ie_utils import map_offset, pad_image_data
+from paddlenlp.utils.log import logger
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            boxes = json_line.get("bbox", None)
+            image = json_line.get("image", None)
+            # Model Input is aslike: [CLS] prompt [SEP] [SEP] text [SEP] for UIE-X
+            if boxes is not None and image is not None:
+                summary_token_num = 4
+            else:
+                summary_token_num = 3
+            if max_seq_len <= len(prompt) + summary_token_num:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - summary_token_num
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+                    for result in result_list:
+                        if result["end"] - result["start"] > max_content_len:
+                            logger.warning(
+                                "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
+                            )
+                        if (
+                            result["start"] + 1 <= max_content_len < result["end"]
+                            and result["end"] - result["start"] <= max_content_len
+                        ):
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+                    if boxes is not None and image is not None:
+                        cur_boxes = boxes[:max_content_len]
+                        res_boxes = boxes[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    if boxes is not None and image is not None:
+                        json_line = {
+                            "content": cur_content,
+                            "result_list": cur_result_list,
+                            "prompt": prompt,
+                            "bbox": cur_boxes,
+                            "image": image,
+                        }
+                    else:
+                        json_line = {
+                            "content": cur_content,
+                            "result_list": cur_result_list,
+                            "prompt": prompt,
+                        }
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - summary_token_num
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        if boxes is not None and image is not None:
+                            json_line = {
+                                "content": res_content,
+                                "result_list": result_list,
+                                "prompt": prompt,
+                                "bbox": res_boxes,
+                                "image": image,
+                            }
+                        else:
+                            json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+                        boxes = res_boxes
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def get_dynamic_max_len(examples, default_max_len: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    cur_length = len(examples[0]["input_ids"])
+    max_length = default_max_len
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+
+
+def convert_example(
+    example,
+    tokenizer,
+    max_seq_len,
+    pad_id=1,
+    c_sep_id=2,
+    summary_token_num=4,
+    dynamic_max_length: Optional[List[int]] = None,
+):
+
+    content = example["content"]
+    prompt = example["prompt"]
+    bbox_lines = example.get("bbox", None)
+    image_buff_string = example.get("image", None)
+    # Text
+    if bbox_lines is None or image_buff_string is None:
+        if dynamic_max_length is not None:
+            temp_encoded_inputs = tokenizer(
+                text=[example["prompt"]],
+                text_pair=[example["content"]],
+                truncation=True,
+                max_seq_len=max_seq_len,
+                return_attention_mask=True,
+                return_position_ids=True,
+                return_dict=False,
+                return_offsets_mapping=True,
+            )
+            max_length = get_dynamic_max_len(
+                examples=temp_encoded_inputs, default_max_len=max_seq_len, dynamic_max_length=dynamic_max_length
+            )
+            # always pad to max_length
+            encoded_inputs = tokenizer(
+                text=[example["prompt"]],
+                text_pair=[example["content"]],
+                truncation=True,
+                max_seq_len=max_length,
+                pad_to_max_seq_len=True,
+                return_attention_mask=True,
+                return_position_ids=True,
+                return_dict=False,
+                return_offsets_mapping=True,
+            )
+            max_seq_len = max_length
+        else:
+            encoded_inputs = tokenizer(
+                text=[example["prompt"]],
+                text_pair=[example["content"]],
+                truncation=True,
+                max_seq_len=max_seq_len,
+                pad_to_max_seq_len=True,
+                return_attention_mask=True,
+                return_position_ids=True,
+                return_offsets_mapping=True,
+                return_dict=False,
+            )
+
+        encoded_inputs = encoded_inputs[0]
+
+        inputs_ids = encoded_inputs["input_ids"]
+        position_ids = encoded_inputs["position_ids"]
+        attention_mask = encoded_inputs["attention_mask"]
+
+        q_sep_index = inputs_ids.index(2, 1)
+        c_sep_index = attention_mask.index(0)
+
+        offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+
+        bias = 0
+        for index in range(len(offset_mapping)):
+            if index == 0:
+                continue
+            mapping = offset_mapping[index]
+            if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+                # bias = index
+                bias = offset_mapping[index - 1][-1] + 1
+
+            if mapping[0] == 0 and mapping[1] == 0:
+                continue
+            offset_mapping[index][0] += bias
+            offset_mapping[index][1] += bias
+
+        offset_bias = bias
+
+        bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))]
+        token_type_ids = [
+            1 if token_index <= q_sep_index or token_index > c_sep_index else 0 for token_index in range(max_seq_len)
+        ]
+        padded_image = np.zeros([3, 224, 224])
+
+    # Doc
+    else:
+        inputs_ids = []
+        prev_bbox = [-1, -1, -1, -1]
+        this_text_line = ""
+        q_sep_index = -1
+        offset_mapping = []
+        last_offset = 0
+        for char_index, (char, bbox) in enumerate(zip(content, bbox_lines)):
+            if char_index == 0:
+                prev_bbox = bbox
+                this_text_line = char
+                continue
+
+            if all([bbox[x] == prev_bbox[x] for x in range(4)]):
+                this_text_line += char
+            else:
+                offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
+                    tokenizer,
+                    offset_mapping,
+                    last_offset,
+                    prompt,
+                    this_text_line,
+                    inputs_ids,
+                    q_sep_index,
+                    max_seq_len,
+                )
+                this_text_line = char
+            prev_bbox = bbox
+
+        if len(this_text_line) > 0:
+            offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
+                tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len
+            )
+
+        if len(inputs_ids) > max_seq_len:
+            inputs_ids = inputs_ids[: (max_seq_len - 1)] + [c_sep_id]
+            offset_mapping = offset_mapping[: (max_seq_len - 1)] + [[0, 0]]
+        else:
+            inputs_ids += [c_sep_id]
+            offset_mapping += [[0, 0]]
+
+        offset_bias = offset_mapping[q_sep_index - 1][-1] + 1
+
+        seq_len = len(inputs_ids)
+        inputs_ids += [pad_id] * (max_seq_len - seq_len)
+        token_type_ids = [1] * (q_sep_index + 1) + [0] * (seq_len - q_sep_index - 1)
+        token_type_ids += [pad_id] * (max_seq_len - seq_len)
+
+        bbox_list = _process_bbox(inputs_ids, bbox_lines, offset_mapping, offset_bias)
+
+        offset_mapping += [[0, 0]] * (max_seq_len - seq_len)
+
+        position_ids = list(range(seq_len))
+
+        position_ids = position_ids + [0] * (max_seq_len - seq_len)
+        attention_mask = [1] * seq_len + [0] * (max_seq_len - seq_len)
+
+        image_data = base64.b64decode(image_buff_string.encode("utf8"))
+        padded_image = pad_image_data(image_data)
+
+    start_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64")
+    end_ids = np.array([0.0 for x in range(max_seq_len)], dtype="int64")
+
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + offset_bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + offset_bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+
+    assert len(inputs_ids) == max_seq_len
+    assert len(token_type_ids) == max_seq_len
+    assert len(position_ids) == max_seq_len
+    assert len(attention_mask) == max_seq_len
+    assert len(bbox_list) == max_seq_len
+    tokenized_output = {
+        "input_ids": inputs_ids,
+        "token_type_ids": token_type_ids,
+        "position_ids": position_ids,
+        "attention_mask": attention_mask,
+        "bbox": bbox_list,
+        "image": padded_image,
+        "start_positions": start_ids,
+        "end_positions": end_ids,
+    }
+    return tokenized_output
+
+
+def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias):
+    bbox_list = [[0, 0, 0, 0] for x in range(len(tokens))]
+
+    for index, bbox in enumerate(bbox_lines):
+        index_token = map_offset(index + offset_bias, offset_mapping)
+        if 0 <= index_token < len(bbox_list):
+            bbox_list[index_token] = bbox
+    return bbox_list
+
+
+def _encode_doc(tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len):
+    if len(offset_mapping) == 0:
+        content_encoded_inputs = tokenizer(
+            text=[prompt],
+            text_pair=[this_text_line],
+            max_seq_len=max_seq_len,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        content_encoded_inputs = content_encoded_inputs[0]
+        inputs_ids = content_encoded_inputs["input_ids"][:-1]
+        sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
+        q_sep_index = content_encoded_inputs["input_ids"].index(2, 1)
+
+        bias = 0
+        for i in range(len(sub_offset_mapping)):
+            if i == 0:
+                continue
+            mapping = sub_offset_mapping[i]
+            if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+                bias = sub_offset_mapping[i - 1][-1] + 1
+            if mapping[0] == 0 and mapping[1] == 0:
+                continue
+            if mapping == sub_offset_mapping[i - 1]:
+                continue
+            sub_offset_mapping[i][0] += bias
+            sub_offset_mapping[i][1] += bias
+
+        offset_mapping = sub_offset_mapping[:-1]
+        last_offset = offset_mapping[-1][-1]
+    else:
+        content_encoded_inputs = tokenizer(
+            text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True
+        )
+        inputs_ids += content_encoded_inputs["input_ids"][1:-1]
+        sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
+
+        for i, sub_list in enumerate(sub_offset_mapping[1:-1]):
+            if i == 0:
+                org_offset = sub_list[1]
+            else:
+                if sub_list[0] != org_offset and sub_offset_mapping[1:-1][i - 1] != sub_list:
+                    last_offset += 1
+                org_offset = sub_list[1]
+            offset_mapping += [[last_offset, sub_list[1] - sub_list[0] + last_offset]]
+            last_offset = offset_mapping[-1][-1]
+    return offset_mapping, last_offset, q_sep_index, inputs_ids
diff --git a/applications/information_extraction/label_studio.py b/applications/information_extraction/label_studio.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f0d815f7774d388d6d700f183280f800a2d654b
--- /dev/null
+++ b/applications/information_extraction/label_studio.py
@@ -0,0 +1,139 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import random
+import time
+from decimal import Decimal
+
+import numpy as np
+import paddle
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.tools import DataConverter
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.label_studio_file):
+        raise ValueError("Please input the correct path of label studio file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.label_studio_file, "r", encoding="utf-8") as f:
+        raw_examples = json.loads(f.read())
+
+    if args.is_shuffle:
+        indexes = np.random.permutation(len(raw_examples))
+        index_list = indexes.tolist()
+        raw_examples = [raw_examples[i] for i in indexes]
+
+    i1, i2, _ = args.splits
+    p1 = int(len(raw_examples) * i1)
+    p2 = int(len(raw_examples) * (i1 + i2))
+
+    train_ids = index_list[:p1]
+    dev_ids = index_list[p1:p2]
+    test_ids = index_list[p2:]
+
+    with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
+        maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
+        fp.write(json.dumps(maps))
+
+    if raw_examples[0]["data"].get("image"):
+        anno_type = "image"
+    else:
+        anno_type = "text"
+
+    data_converter = DataConverter(
+        args.label_studio_file,
+        negative_ratio=args.negative_ratio,
+        prompt_prefix=args.prompt_prefix,
+        options=args.options,
+        separator=args.separator,
+        layout_analysis=args.layout_analysis,
+        schema_lang=args.schema_lang,
+        ocr_lang=args.ocr_lang,
+        anno_type=anno_type,
+    )
+
+    if args.task_type == "ext":
+        train_examples = data_converter.convert_ext_examples(raw_examples[:p1])
+        dev_examples = data_converter.convert_ext_examples(raw_examples[p1:p2], is_train=False)
+        test_examples = data_converter.convert_ext_examples(raw_examples[p2:], is_train=False)
+    else:
+        train_examples = data_converter.convert_cls_examples(raw_examples[:p1])
+        dev_examples = data_converter.convert_cls_examples(raw_examples[p1:p2])
+        test_examples = data_converter.convert_cls_examples(raw_examples[p2:])
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    _save_examples(args.save_dir, "train.txt", train_examples)
+    _save_examples(args.save_dir, "dev.txt", dev_examples)
+    _save_examples(args.save_dir, "test.txt", test_examples)
+
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.")
+    parser.add_argument("--options", default=["正向", "负向"], type=str, nargs="+", help="Used only for the classification task, the options for classification")
+    parser.add_argument("--prompt_prefix", default="情感倾向", type=str, help="Used only for the classification task, the prompt prefix for classification")
+    parser.add_argument("--is_shuffle", default="True", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--layout_analysis", default=False, type=bool, help="Enable layout analysis to optimize the order of OCR result.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--separator", type=str, default='##', help="Used only for entity/aspect-level classification task, separator for entity label and classification label")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+    parser.add_argument("--ocr_lang", choices=["ch", "en"], default="ch", help="Select the language type for OCR.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/applications/information_extraction/label_studio_doc.md b/applications/information_extraction/label_studio_doc.md
new file mode 100644
index 0000000000000000000000000000000000000000..57be7dc327b25abfe492a71c5723863bafbeb578
--- /dev/null
+++ b/applications/information_extraction/label_studio_doc.md
@@ -0,0 +1,272 @@
+# 文档抽取任务Label Studio使用指南
+
+ **目录**
+
+- [1. 安装](#1)
+- [2. 文档抽取任务标注](#2)
+    - [2.1 项目创建](#21)
+    - [2.2 数据上传](#22)
+    - [2.3 标签构建](#23)
+    - [2.4 任务标注](#24)
+    - [2.5 数据导出](#25)
+    - [2.6 数据转换](#26)
+    - [2.7 更多配置](#27)
+
+<a name="1"></a>
+
+## 1. 安装
+**以下标注示例用到的环境配置：**
+
+- Python 3.8+
+- label-studio == 1.6.0
+- paddleocr >= 2.6.0.1
+
+在终端(terminal)使用pip安装label-studio：
+
+```shell
+pip install label-studio==1.6.0
+```
+
+安装完成后，运行以下命令行：
+```shell
+label-studio start
+```
+
+在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/)，输入用户名和密码登录，开始使用label-studio进行标注。
+
+
+<a name="2"></a>
+
+## 2. 文档抽取任务标注
+
+<a name="21"></a>
+
+#### 2.1 项目创建
+
+点击创建（Create）开始创建一个新的项目，填写项目名称、描述，然后选择``Object Detection with Bounding Boxes``。
+
+- 填写项目名称、描述
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199445809-1206f887-2782-459e-9001-fbd790d59a5e.png height=300 width=1200 />
+</div>
+
+- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Object Detection with Bounding Boxes`
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199660090-d84901dd-001d-4620-bffa-0101a4ecd6e5.png height=400 width=1200 />
+</div>
+
+- **文档分类**任务选择``Image Classification`
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199729973-53a994d8-da71-4ab9-84f5-83297e19a7a1.png height=400 width=1200 />
+</div>
+
+- 添加标签(也可跳过后续在Setting/Labeling Interface中添加)
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199450930-4c0cd189-6085-465a-aca0-6ba6f52a0c0d.png height=600 width=1200 />
+</div>
+
+图中展示了Span实体类型标签的构建，其他类型标签的构建可参考[2.3标签构建](#23)
+
+<a name="22"></a>
+
+#### 2.2 数据上传
+
+先从本地或HTTP链接上传图片，然后选择导入本项目。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199452007-2d45f7ba-c631-46b4-b21f-729a2ed652e9.png height=270 width=1200 />
+</div>
+
+<a name="23"></a>
+
+#### 2.3 标签构建
+
+- Span实体类型标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199456432-ce601ab0-7d6c-458f-ac46-8839dbc4d013.png height=500 width=1200 />
+</div>
+
+
+- Relation关系类型标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199877621-f60e00c7-81ae-42e1-b498-8ebc5b5bd0fd.png height=650 width=1200 />
+</div>
+
+Relation XML模板：
+
+```xml
+  <Relations>
+    <Relation value="单位"/>
+    <Relation value="数量"/>
+    <Relation value="金额"/>
+  </Relations>
+```
+
+
+- 分类类别标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199891626-cc995783-18d2-41dc-88de-260b979edc56.png height=500 width=1200 />
+</div>
+
+<a name="24"></a>
+
+#### 2.4 任务标注
+
+- 实体抽取
+
+    - 标注示例：
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/199879427-82806ffc-dc60-4ec7-bda5-e16419ee9d15.png height=650 width=800 />
+        </div>
+
+    - 该标注示例对应的schema为：
+
+        ```text
+        schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率']
+        ```
+
+- 关系抽取
+
+    - Step 1. 标注主体（Subject）及客体（Object）
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218974459-4bf989fc-0e40-4dea-b309-346364cca1b5.png height=400 width=1000 />
+        </div>
+
+    - Step 2. 关系连线，箭头方向由主体（Subject）指向客体（Object）
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218975474-0cf933bc-7c1e-4e7d-ada5-685ee5265f61.png height=450 width=1000 />
+        </div>
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218975743-dc718068-6d58-4352-8eb2-8973549dd971.png height=400 width=1000 />
+        </div>
+
+    - Step 3. 添加对应关系类型标签
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976095-ff5a84e8-302c-4789-98df-139a8cef8d5a.png height=360 width=1000 />
+        </div>
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976368-a4556441-46ca-4372-b68b-e00b45f59260.png height=360 width=1000 />
+        </div>
+
+    - Step 4. 完成标注
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976853-4903f2ec-b669-4c63-8c21-5f7184fc03db.png height=450 width=1000 />
+        </div>
+
+
+    - 该标注示例对应的schema为：
+
+        ```text
+        schema = {
+            '名称及规格': [
+                '金额',
+                '单位',
+                '数量'
+            ]
+        }
+        ```
+
+- 文档分类
+
+    - 标注示例
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/199879238-b8b41d4a-7e77-47cd-8def-2fc8ba89442f.png height=650 width=800 />
+        </div>
+
+    - 该标注示例对应的schema为：
+
+        ```text
+        schema = '文档类别[发票，报关单]'
+        ```
+
+
+<a name="25"></a>
+
+#### 2.5 数据导出
+
+勾选已标注图片ID，选择导出的文件类型为``JSON``，导出数据：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199890897-b33ede99-97d8-4d44-877a-2518a87f8b67.png height=200 width=1200 />
+</div>
+
+<a name="26"></a>
+
+#### 2.6 数据转换
+
+将导出的文件重命名为``label_studio.json``后，放入``./document/data``目录下，并将对应的标注图片放入``./document/data/images``目录下（图片的文件名需与上传到label studio时的命名一致）。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。
+
+- 路径示例
+
+```shell
+./document/data/
+├── images # 图片目录
+│   ├── b0.jpg # 原始图片（文件名需与上传到label studio时的命名一致）
+│   └── b1.jpg
+└── label_studio.json # 从label studio导出的标注文件
+```
+
+- 抽取式任务
+
+```shell
+python label_studio.py \
+    --label_studio_file ./document/data/label_studio.json \
+    --save_dir ./document/data \
+    --splits 0.8 0.1 0.1\
+    --task_type ext
+```
+
+- 文档分类任务
+
+```shell
+python label_studio.py \
+    --label_studio_file ./document/data/label_studio.json \
+    --save_dir ./document/data \
+    --splits 0.8 0.1 0.1 \
+    --task_type cls \
+    --prompt_prefix "文档类别" \
+    --options "发票" "报关单"
+```
+
+<a name="27"></a>
+
+#### 2.7 更多配置
+
+- ``label_studio_file``: 从label studio导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全负例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务。
+- ``options``: 指定分类任务的类别标签，该参数只对分类类型任务有效。默认为["正向", "负向"]。
+- ``prompt_prefix``: 声明分类任务的prompt前缀信息，该参数只对分类类型任务有效。默认为"情感倾向"。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+- ``separator``: 实体类别/评价维度与分类标签的分隔符，该参数只对实体/评价维度分类任务有效。默认为"##"。
+- ``schema_lang``：选择schema的语言，将会应该训练数据prompt的构造方式，可选有`ch`和`en`。默认为`ch`。
+- ``ocr_lang``：选择OCR的语言，可选有`ch`和`en`。默认为`ch`。
+- ``layout_analysis``：是否使用PPStructure对文档进行布局分析，该参数只对文档类型标注任务有效。默认为False。
+
+备注：
+- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [label_studio.py](./label_studio.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从label_studio导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/information_extraction/label_studio_doc_en.md b/applications/information_extraction/label_studio_doc_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..e7925d54509c3a43b8ce4ae36abe8154d6fb739f
--- /dev/null
+++ b/applications/information_extraction/label_studio_doc_en.md
@@ -0,0 +1,273 @@
+# Label Studio User Guide - Document Information Extraction
+
+  **Table of contents**
+
+- [1. Installation](#1)
+- [2. Document Extraction Task Annotation](#2)
+     - [2.1 Project Creation](#21)
+     - [2.2 Data Upload](#22)
+     - [2.3 Label Construction](#23)
+     - [2.4 Task Annotation](#24)
+     - [2.5 Data Export](#25)
+     - [2.6 Data Conversion](#26)
+     - [2.7 More Configuration](#27)
+
+<a name="1"></a>
+
+## 1. Installation
+
+**Environmental configuration used in the following annotation examples:**
+
+- Python 3.8+
+- label-studio == 1.6.0
+- paddleocr >= 2.6.0.1
+
+Use pip to install label-studio in the terminal:
+
+```shell
+pip install label-studio==1.6.0
+```
+
+Once the installation is complete, run the following command line:
+
+```shell
+label-studio start
+```
+
+Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
+
+<a name="2"></a>
+
+## 2. Document Extraction Task Annotation
+
+<a name="21"></a>
+
+#### 2.1 Project Creation
+
+Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
+
+- Fill in the project name, description
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199445809-1206f887-2782-459e-9001-fbd790d59a5e.png height=300 width=1200 />
+</div>
+
+- For **Named Entity Recognition, Relation Extraction** tasks please select ``Object Detection with Bounding Boxes`
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199660090-d84901dd-001d-4620-bffa-0101a4ecd6e5.png height=400 width=1200 />
+</div>
+
+- For **Document Classification** task please select ``Image Classification`
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199729973-53a994d8-da71-4ab9-84f5-83297e19a7a1.png height=400 width=1200 />
+</div>
+
+- Define labels
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199450930-4c0cd189-6085-465a-aca0-6ba6f52a0c0d.png height=600 width=1200 />
+</div>
+
+The figure shows the construction of Span entity type tags. For the construction of other types of tags, please refer to [2.3 Label Construction](#23)
+
+<a name="22"></a>
+
+#### 2.2 Data upload
+
+First upload the picture from a local or HTTP link, and then choose to import this project.
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199452007-2d45f7ba-c631-46b4-b21f-729a2ed652e9.png height=270 width=1200 />
+</div>
+
+<a name="23"></a>
+
+#### 2.3 Label Construction
+
+- Entity Label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199456432-ce601ab0-7d6c-458f-ac46-8839dbc4d013.png height=500 width=1200 />
+</div>
+
+
+- Relation label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199877621-f60e00c7-81ae-42e1-b498-8ebc5b5bd0fd.png height=650 width=1200 />
+</div>
+
+Relation XML template:
+
+```xml
+   <Relations>
+     <Relation value="unit"/>
+     <Relation value="Quantity"/>
+     <Relation value="amount"/>
+   </Relations>
+```
+
+- Classification label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199891626-cc995783-18d2-41dc-88de-260b979edc56.png height=500 width=1200 />
+</div>
+
+<a name="24"></a>
+
+#### 2.4 Task Annotation
+
+- Entity extraction
+
+     - Callout example:
+
+         <div align="center">
+             <img src=https://user-images.githubusercontent.com/40840292/199879427-82806ffc-dc60-4ec7-bda5-e16419ee9d15.png height=650 width=800 />
+         </div>
+
+     - The schema corresponding to this annotation example is:
+
+         ```text
+         schema = ['开票日期', '名称', '纳税人识别号', '地址、电话', '开户行及账号', '金额', '税额', '价税合计', 'No', '税率']
+         ```
+
+- Relation extraction
+
+     - Step 1. Label the subject and object
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218974459-4bf989fc-0e40-4dea-b309-346364cca1b5.png height=400 width=1000 />
+        </div>
+
+     - Step 2. Relation line, the direction of the arrow is from the subject to the object
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218975474-0cf933bc-7c1e-4e7d-ada5-685ee5265f61.png height=450 width=1000 />
+        </div>
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218975743-dc718068-6d58-4352-8eb2-8973549dd971.png height=400 width=1000 />
+        </div>
+
+     - Step 3. Add corresponding relation label
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976095-ff5a84e8-302c-4789-98df-139a8cef8d5a.png height=360 width=1000 />
+        </div>
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976368-a4556441-46ca-4372-b68b-e00b45f59260.png height=360 width=1000 />
+        </div>
+
+     - Step 4. Finish labeling
+
+        <div align="center">
+            <img src=https://user-images.githubusercontent.com/40840292/218976853-4903f2ec-b669-4c63-8c21-5f7184fc03db.png height=450 width=1000 />
+        </div>
+
+
+     - The schema corresponding to this annotation example is:
+
+        ```text
+        schema = {
+            '名称及规格': [
+                '金额',
+                '单位',
+                '数量'
+            ]
+        }
+        ```
+
+- Document classification
+
+     - Callout example
+
+         <div align="center">
+             <img src=https://user-images.githubusercontent.com/40840292/199879238-b8b41d4a-7e77-47cd-8def-2fc8ba89442f.png height=650 width=800 />
+         </div>
+
+     - The schema corresponding to this annotation example is:
+
+        ```text
+        schema = '文档类别[发票，报关单]'
+        ```
+
+
+<a name="25"></a>
+
+#### 2.5 Data Export
+
+Check the marked image ID, select the exported file type as ``JSON``, and export the data:
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199890897-b33ede99-97d8-4d44-877a-2518a87f8b67.png height=200 width=1200 />
+</div>
+
+
+<a name="26"></a>
+
+#### 2.6 Data Conversion
+
+After renaming the exported file to ``label_studio.json``, put it into the ``./document/data`` directory, and put the corresponding label image into the ``./document/data/images`` directory (The file name of the picture must be the same as the one uploaded to label studio). Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
+
+- Path example
+
+```shell
+./document/data/
+├── images # image directory
+│ ├── b0.jpg # Original picture (the file name must be the same as the one uploaded to label studio)
+│ └── b1.jpg
+└── label_studio.json # Annotation file exported from label studio
+```
+
+- Extraction task
+
+```shell
+python label_studio.py \
+     --label_studio_file ./document/data/label_studio.json \
+     --save_dir ./document/data \
+     --splits 0.8 0.1 0.1 \
+     --task_type ext
+```
+
+- Document classification tasks
+
+```shell
+python label_studio.py \
+     --label_studio_file ./document/data/label_studio.json \
+     --save_dir ./document/data \
+     --splits 0.8 0.1 0.1 \
+     --task_type cls \
+     --prompt_prefix "document category" \
+     --options "invoice" "customs declaration"
+```
+
+<a name="27"></a>
+
+#### 2.7 More Configuration
+
+- ``label_studio_file``: Data labeling file exported from label studio.
+- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
+- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
+- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
+- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
+- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
+- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
+- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
+- ``seed``: random seed, default is 1000.
+- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
+- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
+- ``ocr_lang``: Select the language for OCR, optional `ch` and `en`. Defaults to `ch`.
+- ``layout_analysis``: Whether to use PPStructure to analyze the layout of the document. This parameter is only valid for document type labeling tasks. The default is False.
+
+Note:
+- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
+- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
+- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
+- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
+
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/information_extraction/label_studio_text.md b/applications/information_extraction/label_studio_text.md
new file mode 100644
index 0000000000000000000000000000000000000000..6596940a24e9fad3a33f4b699abfe36e87172d73
--- /dev/null
+++ b/applications/information_extraction/label_studio_text.md
@@ -0,0 +1,287 @@
+# 文本抽取任务Label Studio使用指南
+
+ **目录**
+
+- [1. 安装](#1)
+- [2. 文本抽取任务标注](#2)
+    - [2.1 项目创建](#21)
+    - [2.2 数据上传](#22)
+    - [2.3 标签构建](#23)
+    - [2.4 任务标注](#24)
+    - [2.5 数据导出](#25)
+    - [2.6 数据转换](#26)
+    - [2.7 更多配置](#27)
+
+<a name="1"></a>
+
+## 1. 安装
+**以下标注示例用到的环境配置：**
+
+- Python 3.8+
+- label-studio == 1.6.0
+- paddleocr >= 2.6.0.1
+
+在终端(terminal)使用pip安装label-studio：
+
+```shell
+pip install label-studio==1.6.0
+```
+
+安装完成后，运行以下命令行：
+```shell
+label-studio start
+```
+
+在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/)，输入用户名和密码登录，开始使用label-studio进行标注。
+
+<a name="2"></a>
+
+## 2. 文本抽取任务标注
+
+<a name="21"></a>
+
+#### 2.1 项目创建
+
+点击创建（Create）开始创建一个新的项目，填写项目名称、描述，然后选择``Object Detection with Bounding Boxes``。
+
+- 填写项目名称、描述
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199661377-d9664165-61aa-4462-927d-225118b8535b.png height=230 width=1200 />
+</div>
+
+- **命名实体识别、关系抽取、事件抽取、实体/评价维度分类**任务选择``Relation Extraction`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
+</div>
+
+- **文本分类、句子级情感倾向分类**任务选择``Text Classification``。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
+</div>
+
+- 添加标签(也可跳过后续在Setting/Labeling Interface中配置)
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199662737-ed996a2c-7a24-4077-8a36-239c4bfb0a16.png height=380 width=1200 />
+</div>
+
+图中展示了实体类型标签的构建，其他类型标签的构建可参考[2.3标签构建](#23)
+
+<a name="22"></a>
+
+#### 2.2 数据上传
+
+先从本地上传txt格式文件，选择``List of tasks``，然后选择导入本项目。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199667670-1b8f6755-b41f-41c4-8afc-06bb051690b6.png height=210 width=1200 />
+</div>
+
+<a name="23"></a>
+
+#### 2.3 标签构建
+
+- Span类型标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199667941-04e300c5-3cd7-4b8e-aaf5-561415414891.png height=480 width=1200 />
+</div>
+
+- Relation类型标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199725229-f5e998bf-367c-4449-b83a-c799f1e3de00.png height=620 width=1200 />
+</div>
+
+Relation XML模板：
+
+```xml
+  <Relations>
+    <Relation value="歌手"/>
+    <Relation value="发行时间"/>
+    <Relation value="所属专辑"/>
+  </Relations>
+```
+
+- 分类类别标签
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199724082-ee82dceb-dab0-496d-a930-a8ecb284d8b2.png height=370 width=1200 />
+</div>
+
+
+<a name="24"></a>
+
+#### 2.4 任务标注
+
+- 实体抽取
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199879957-aeec9d17-d342-4ea0-a840-457b49f6066e.png height=140 width=1000 />
+</div>
+
+该标注示例对应的schema为：
+
+```text
+schema = [
+    '时间',
+    '选手',
+    '赛事名称',
+    '得分'
+]
+```
+
+- 关系抽取
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199879866-03c1ecac-1828-4f35-af70-9ae61701c303.png height=230 width=1200 />
+</div>
+
+对于关系抽取，其P的类型设置十分重要，需要遵循以下原则
+
+“{S}的{P}为{O}”需要能够构成语义合理的短语。比如对于三元组(S, 父子, O)，关系类别为父子是没有问题的。但按照UIE当前关系类型prompt的构造方式，“S的父子为O”这个表达不是很通顺，因此P改成孩子更好，即“S的孩子为O”。**合理的P类型设置，将显著提升零样本效果**。
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '作品名': [
+        '歌手',
+        '发行时间',
+        '所属专辑'
+    ]
+}
+```
+
+- 事件抽取
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199879776-75abbade-9bea-44dc-ac36-322fecdc03e0.png height=220 width=1200 />
+</div>
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '地震触发词': [
+        '时间',
+        '震级'
+    ]
+}
+```
+
+- 句子级分类
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199879672-c3f286fe-a217-4888-950f-d4ee45b19f5a.png height=210 width=1000 />
+</div>
+
+
+该标注示例对应的schema为：
+
+```text
+schema = '情感倾向[正向，负向]'
+```
+
+- 实体/评价维度分类
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199879586-8c6e4826-a3b0-49e0-9920-98ca062dccff.png height=240 width=1200 />
+</div>
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+    ]
+}
+```
+
+<a name="25"></a>
+
+#### 2.5 数据导出
+
+勾选已标注文本ID，选择导出的文件类型为``JSON``，导出数据：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199891344-023736e2-6f9d-454b-b72a-dec6689f8436.png height=180 width=1200 />
+</div>
+
+<a name="26"></a>
+
+#### 2.6 数据转换
+
+将导出的文件重命名为``label_studio.json``后，放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UIE的数据格式。
+
+- 抽取式任务
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --task_type ext
+```
+
+- 句子级分类任务
+
+在数据转换阶段，我们会自动构造用于模型训练的prompt信息。例如句子级情感分类中，prompt为``情感倾向[正向,负向]``，可以通过`prompt_prefix`和`options`参数进行配置。
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --task_type cls \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --prompt_prefix "情感倾向" \
+    --options "正向" "负向"
+```
+
+- 实体/评价维度分类任务
+
+在数据转换阶段，我们会自动构造用于模型训练的prompt信息。例如评价维度情感分类中，prompt为``XXX的情感倾向[正向,负向]``，可以通过`prompt_prefix`和`options`参数进行声明。
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --task_type ext \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --prompt_prefix "情感倾向" \
+    --options "正向" "负向" \
+    --separator "##"
+```
+
+<a name="27"></a>
+
+#### 2.7 更多配置
+
+- ``label_studio_file``: 从label studio导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全负例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务。
+- ``options``: 指定分类任务的类别标签，该参数只对分类类型任务有效。默认为["正向", "负向"]。
+- ``prompt_prefix``: 声明分类任务的prompt前缀信息，该参数只对分类类型任务有效。默认为"情感倾向"。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+- ``schema_lang``：选择schema的语言，将会应该训练数据prompt的构造方式，可选有`ch`和`en`。默认为`ch`。
+- ``separator``: 实体类别/评价维度与分类标签的分隔符，该参数只对实体/评价维度分类任务有效。默认为"##"。
+
+备注：
+- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [label_studio.py](./label_studio.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从label_studio导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/information_extraction/label_studio_text_en.md b/applications/information_extraction/label_studio_text_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..8f13d48079c4e8f694f3827d2e5a6f2a59c2c331
--- /dev/null
+++ b/applications/information_extraction/label_studio_text_en.md
@@ -0,0 +1,288 @@
+# Label Studio User Guide - Text Information Extraction
+
+**Table of contents**
+
+- [1. Installation](#1)
+- [2. Text Extraction Task Annotation](#2)
+     - [2.1 Project Creation](#21)
+     - [2.2 Data Upload](#22)
+     - [2.3 Label Construction](#23)
+     - [2.4 Task Annotation](#24)
+     - [2.5 Data Export](#25)
+     - [2.6 Data Conversion](#26)
+     - [2.7 More Configuration](#27)
+
+<a name="1"></a>
+
+## 1. Installation
+
+**Environmental configuration used in the following annotation examples:**
+
+- Python 3.8+
+- label-studio == 1.6.0
+- paddleocr >= 2.6.0.1
+
+Use pip to install label-studio in the terminal:
+
+```shell
+pip install label-studio==1.6.0
+```
+
+Once the installation is complete, run the following command line:
+```shell
+label-studio start
+```
+
+Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
+
+<a name="2"></a>
+
+## 2. Text extraction task annotation
+
+<a name="21"></a>
+
+#### 2.1 Project Creation
+
+Click Create to start creating a new project, fill in the project name, description, and select ``Object Detection with Bounding Boxes``.
+
+- Fill in the project name, description
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199661377-d9664165-61aa-4462-927d-225118b8535b.png height=230 width=1200 />
+</div>
+
+- For **Named Entity Recognition, Relation Extraction, Event Extraction, Opinion Extraction** tasks please select ``Relation Extraction`.
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199661638-48a870eb-a1df-4db5-82b9-bc8e985f5190.png height=350 width=1200 />
+</div>
+
+- For **Text classification, Sentence-level sentiment classification** tasks please select ``Text Classification``.
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/212617773-34534e68-4544-4b24-8f39-ae7f9573d397.png height=420 width=1200 />
+</div>
+
+- Define labels
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199662737-ed996a2c-7a24-4077-8a36-239c4bfb0a16.png height=380 width=1200 />
+</div>
+
+The figure shows the construction of entity type tags, and the construction of other types of tags can refer to [2.3 Label Construction](#23)
+
+<a name="22"></a>
+
+#### 2.2 Data upload
+
+First upload the txt format file locally, select ``List of tasks``, and then choose to import this project.
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199667670-1b8f6755-b41f-41c4-8afc-06bb051690b6.png height=210 width=1200 />
+</div>
+
+<a name="23"></a>
+
+#### 2.3 Label construction
+
+- Entity label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199667941-04e300c5-3cd7-4b8e-aaf5-561415414891.png height=480 width=1200 />
+</div>
+
+- Relation label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199725229-f5e998bf-367c-4449-b83a-c799f1e3de00.png height=620 width=1200 />
+</div>
+
+Relation XML template:
+
+```xml
+   <Relations>
+     <Relation value="Singer"/>
+     <Relation value="Published"/>
+     <Relation value="Album"/>
+   </Relations>
+```
+
+- Classification label
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199724082-ee82dceb-dab0-496d-a930-a8ecb284d8b2.png height=370 width=1200 />
+</div>
+
+
+<a name="24"></a>
+
+#### 2.4 Task annotation
+
+- Entity extraction
+
+Callout example:
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199879957-aeec9d17-d342-4ea0-a840-457b49f6066e.png height=140 width=1000 />
+</div>
+
+The schema corresponding to this annotation example is:
+
+```text
+schema = [
+    '时间',
+    '选手',
+    '赛事名称',
+    '得分'
+]
+```
+
+- Relation extraction
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199879866-03c1ecac-1828-4f35-af70-9ae61701c303.png height=230 width=1200 />
+</div>
+
+For relation extraction, the type setting of P is very important, and the following principles need to be followed
+
+"{P} of {S} is {O}" needs to be able to form a semantically reasonable phrase. For example, for a triple (S, father and son, O), there is no problem with the relation category being father and son. However, according to the current structure of the UIE relation type prompt, the expression "the father and son of S is O" is not very smooth, so it is better to change P to child, that is, "child of S is O". **A reasonable P type setting will significantly improve the zero-shot performance**.
+
+The schema corresponding to this annotation example is:
+
+```text
+schema = {
+    '作品名': [
+        '歌手',
+        '发行时间',
+        '所属专辑'
+    ]
+}
+```
+
+- Event extraction
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199879776-75abbade-9bea-44dc-ac36-322fecdc03e0.png height=220 width=1200 />
+</div>
+
+The schema corresponding to this annotation example is:
+
+```text
+schema = {
+    '地震触发词': [
+        '时间',
+        '震级'
+    ]
+}
+```
+
+- Sentence level classification
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199879672-c3f286fe-a217-4888-950f-d4ee45b19f5a.png height=210 width=1000 />
+</div>
+
+
+The schema corresponding to this annotation example is:
+
+```text
+schema = '情感倾向[正向，负向]'
+```
+
+- Opinion Extraction
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199879586-8c6e4826-a3b0-49e0-9920-98ca062dccff.png height=240 width=1200 />
+</div>
+
+The schema corresponding to this annotation example is:
+
+```text
+schema = {
+    '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+    ]
+}
+```
+
+<a name="25"></a>
+
+#### 2.5 Data Export
+
+Check the marked text ID, select the exported file type as ``JSON``, and export the data:
+
+<div align="center">
+     <img src=https://user-images.githubusercontent.com/40840292/199891344-023736e2-6f9d-454b-b72a-dec6689f8436.png height=180 width=1200 />
+</div>
+
+<a name="26"></a>
+
+#### 2.6 Data conversion
+
+Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UIE.
+
+- Extraction task
+
+```shell
+python label_studio.py\
+     --label_studio_file ./data/label_studio.json \
+     --save_dir ./data \
+     --splits 0.8 0.1 0.1 \
+     --task_type ext
+```
+
+- Sentence-level classification tasks
+
+In the data conversion stage, we will automatically construct prompt information for model training. For example, in sentence-level sentiment classification, the prompt is ``Sentiment Classification [positive, negative]``, which can be configured through `prompt_prefix` and `options` parameters.
+
+```shell
+python label_studio.py\
+     --label_studio_file ./data/label_studio.json \
+     --task_type cls \
+     --save_dir ./data \
+     --splits 0.8 0.1 0.1 \
+     --prompt_prefix "Sentiment Classification" \
+     --options "positive" "negative"
+```
+
+- Opinion Extraction
+
+In the data conversion stage, we will automatically construct prompt information for model training. For example, in the emotional classification of the evaluation dimension, the prompt is ``Sentiment Classification of xxx [positive, negative]``, which can be declared through the `prompt_prefix` and `options` parameters.
+
+```shell
+python label_studio.py\
+     --label_studio_file ./data/label_studio.json \
+     --task_type ext \
+     --save_dir ./data \
+     --splits 0.8 0.1 0.1 \
+     --prompt_prefix "Sentiment Classification" \
+     --options "positive" "negative" \
+     --separator "##"
+```
+
+<a name="27"></a>
+
+#### 2.7 More Configuration
+
+- ``label_studio_file``: Data labeling file exported from label studio.
+- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
+- ``negative_ratio``: The maximum negative ratio. This parameter is only valid for extraction tasks. Properly constructing negative examples can improve the model effect. The number of negative examples is related to the actual number of labels, the maximum number of negative examples = negative_ratio * number of positive examples. This parameter is only valid for the training set, and the default is 5. In order to ensure the accuracy of the evaluation indicators, the verification set and test set are constructed with all negative examples by default.
+- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
+- ``task_type``: Select the task type, there are two types of tasks: extraction and classification.
+- ``options``: Specify the category label of the classification task, this parameter is only valid for the classification type task. Defaults to ["positive", "negative"].
+- ``prompt_prefix``: Declare the prompt prefix information of the classification task, this parameter is only valid for the classification type task. Defaults to "Sentimental Tendency".
+- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
+- ``seed``: random seed, default is 1000.
+- ``schema_lang``: Select the language of the schema, which will be the construction method of the training data prompt, optional `ch` and `en`. Defaults to `ch`.
+- ``separator``: The separator between entity category/evaluation dimension and classification label. This parameter is only valid for entity/evaluation dimension classification tasks. The default is"##".
+
+Note:
+- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
+- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
+- In the model training phase, we recommend constructing some negative examples to improve the model performance, and we have built-in this function in the data conversion phase. The proportion of automatically constructed negative samples can be controlled by `negative_ratio`; the number of negative samples = negative_ratio * the number of positive samples.
+- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
+
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/information_extraction/taskflow_doc.md b/applications/information_extraction/taskflow_doc.md
new file mode 100644
index 0000000000000000000000000000000000000000..538a86e12b21aab09fb3843c70a129ecfe9eaa33
--- /dev/null
+++ b/applications/information_extraction/taskflow_doc.md
@@ -0,0 +1,310 @@
+# UIE Taskflow使用指南
+
+**目录**
+- [1. 功能简介](#1)
+- [2. 文档信息抽取](#2)
+  - [2.1 实体抽取](#21)
+  - [2.2 关系抽取](#22)
+  - [2.3 跨任务使用](#23)
+  - [2.4 输入说明](#24)
+  - [2.5 使用技巧](#25)
+  - [2.6 结果可视化](#26)
+  - [2.7 更多配置](#27)
+
+<a name="1"></a>
+
+## 1. 功能简介
+
+```paddlenlp.Taskflow```提供文本及文档的通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本或文档中的对应信息。**实现开箱即用，并满足各类信息抽取需求**
+
+<a name="2"></a>
+
+## 2. 文档信息抽取
+
+本章节主要介绍Taskflow的文档抽取功能，以下示例图片[下载链接](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip)。
+
+<a name="21"></a>
+
+#### 2.1 实体抽取
+
+实体抽取，又称命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+- 报关单
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206112148-82e26dad-4a77-40e3-bc11-f877047aeb87.png height=700 width=450 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
+>>> pprint(ie({"doc": "./cases/custom.jpeg"}))
+[{'件数': [{'bbox': [[826, 1062, 926, 1121]],
+        'end': 312,
+        'probability': 0.9832498761402597,
+        'start': 308,
+        'text': '1142'}],
+'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]],
+            'end': 314,
+            'probability': 0.9995648138860567,
+            'start': 312,
+            'text': '纸箱'}],
+'合同协议号': [{'bbox': [[151, 1077, 258, 1117]],
+            'end': 319,
+            'probability': 0.9984179437542124,
+            'start': 314,
+            'text': '33035'}],
+'境内目的地': [{'bbox': [[1966, 872, 2095, 923]],
+            'end': 275,
+            'probability': 0.9975541483111243,
+            'start': 272,
+            'text': '上海市'}],
+'征免性质': [{'bbox': [[1583, 770, 1756, 821]],
+            'end': 242,
+            'probability': 0.9950633161231508,
+            'start': 238,
+            'text': '一般征税'}],
+'收发货人': [{'bbox': [[321, 533, 841, 580]],
+            'end': 95,
+            'probability': 0.4772132061042136,
+            'start': 82,
+            'text': '上海新尚实国际贸易有限公司'},
+        {'bbox': [[306, 584, 516, 624]],
+            'end': 150,
+            'probability': 0.33807074572195006,
+            'start': 140,
+            'text': '31222609K9'}],
+'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]],
+            'end': 190,
+            'probability': 0.6692050414718089,
+            'start': 174,
+            'text': 'E. R. TIANAN004E'}],
+'运输方式': [{'bbox': [[1070, 664, 1240, 715]],
+            'end': 174,
+            'probability': 0.9994416347044179,
+            'start': 170,
+            'text': '永路运输'}],
+'进口口岸': [{'bbox': [[1070, 566, 1346, 617]],
+            'end': 120,
+            'probability': 0.9945697196994345,
+            'start': 111,
+            'text': '洋山港区-2248'}],
+'进口日期': [{'bbox': [[1726, 569, 1933, 610]],
+            'end': 130,
+            'probability': 0.9804819494073627,
+            'start': 120,
+            'text': '2017-02-24'}]}]
+```
+
+- 证件
+
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206114081-8c82e2a2-0c88-4ca3-9651-b12c94266be9.png height=400 width=700 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["Name", "Date of birth", "Issue date"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
+>>> pprint(ie({"doc": "./cases/license.jpeg"}))
+```
+
+<a name="22"></a>
+
+#### 2.2 关系抽取
+
+关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+- 表格
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206115688-30de315a-8fd4-4125-a3c3-8cb05c6e39e5.png height=180 width=600 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = {"姓名": ["招聘单位", "报考岗位"]}
+>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
+>>> pprint(ie({"doc": "./cases/table.png"}))
+```
+
+<a name="23"></a>
+
+#### 2.3 跨任务使用
+
+- 实体、关系多任务抽取
+
+对文档进行实体+关系抽取，schema构造如下：
+
+```text
+schema = [
+    "Total GBP",
+    "No.",
+    "Date",
+    "Customer No.",
+    "Subtotal without VAT",
+    {
+        "Description": [
+            "Quantity",
+            "Amount"
+        ]
+    }
+]
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206120861-13b475dc-9a78-43bc-9dec-91f331db2ddf.png height=400 width=650 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
+>>> pprint(ie({"doc": "./cases/delivery_note.png"}))
+```
+
+<a name="24"></a>
+
+#### 2.4 输入说明
+
+- 输入格式
+
+文档抽取UIE-X支持图片路径、http图片链接、base64的输入形式，支持图片和PDF两种文档格式。文本抽取可以通过`text`指定输入文本。
+
+```python
+[
+    {'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！'},
+    {'doc': './cases/custom.jpg'},
+    {'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'}
+]
+```
+
+**NOTE**: 多页PDF输入目前只抽取第一页的结果，UIE-X比较适合单证文档（如票据、单据等）的信息提取，目前还不适合过长或多页的文档。
+
+- 使用自己的layout / OCR作为输入
+
+```python
+layout = [
+    ([68.0, 12.0, 167.0, 70.0], '名次'),
+    ([464.0, 13.0, 559.0, 67.0], '球员'),
+    ([833.0, 15.0, 1054.0, 64.0], '总出场时间'),
+    ......
+]
+ie({"doc": doc_path, 'layout': layout})
+```
+
+<a name="25"></a>
+
+#### 2.5 使用技巧
+
+- 使用PP-Structure版面分析功能
+
+OCR中识别出来的文字会按照左上到右下进行排序，对于分栏、表格内有多行文本等情况我们推荐使用版面分析功能``layout_analysis=True``以优化文字排序并增强抽取效果。以下例子仅举例版面分析功能的使用场景，实际场景一般需要标注微调。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206139057-aedec98f-683c-4648-999d-81ce5ea04a86.png height=250 width=500 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = "中标候选人名称"
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True)
+>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"}))
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206137978-3a69e7e2-dc2e-4d11-98b7-25911b0375a0.png height=350 width=600 hspace='10'/>
+</div>
+
+```python
+>>> schema = "抗血小板药物的用药指征"
+>>> ie.set_schema(schema)
+>>> pprint(ie({"doc": "./cases/drug.webp"}))
+```
+
+<a name="26"></a>
+
+#### 2.6 结果可视化
+
+- OCR识别结果可视化：
+
+```python
+>>> from paddlenlp.utils.doc_parser import DocParser
+
+>>> doc_parser = DocParser(ocr_lang="en")
+>>> doc_path = "./cases/business_card.png"
+>>> parsed_doc = doc_parser.parse({"doc": doc_path})
+>>> doc_parser.write_image_with_results(
+        doc_path,
+        layout=parsed_doc['layout'],
+        save_path="ocr_result.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206168103-0a37eab0-bb36-4eec-bd51-b3f85838b40c.png height=350 width=600 hspace='10'/>
+</div>
+
+- 抽取结果可视化：
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> from paddlenlp.utils.doc_parser import DocParser
+
+>>> doc_path = "./cases/business_card.png"
+>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en")
+
+>>> results = ie({"doc": doc_path})
+
+>>> DocParser.write_image_with_results(
+        doc_path,
+        result=results[0],
+        save_path="image_show.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206168852-c32c34c4-f245-4116-a244-390e55c13383.png height=350 width=600 hspace='10'/>
+</div>
+
+<a name="27"></a>
+
+#### 2.7 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema="",
+                  schema_lang="ch",
+                  ocr_lang="ch",
+                  batch_size=16,
+                  model='uie-x-base',
+                  layout_analysis=False,
+                  position_prob=0.5,
+                  precision='fp32',
+                  use_fast=False)
+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置schema的语言，默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同，因此需要指定schema的语言。
+* `ocr_lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为16。
+* `model`：选择任务使用的模型，默认为`uie-base`，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`，`uie-x-base`。
+* `layout_analysis`：是否使用PP-Structure对文档进行布局分析以优化布局信息的排序，默认为False。
+* `position_prob`：模型对于span的起始位置/终止位置的结果概率在0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快，支持GPU和NPU硬件环境。如果选择`fp16`，在GPU硬件环境下，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。
+
+## References
+- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**
+- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)**
diff --git a/applications/information_extraction/taskflow_doc_en.md b/applications/information_extraction/taskflow_doc_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..09fdc7193073b349b7cca5affb206a2911be6f76
--- /dev/null
+++ b/applications/information_extraction/taskflow_doc_en.md
@@ -0,0 +1,305 @@
+# UIE Taskflow User Guide
+
+**Table of contents**
+- [1. Introduction](#1)
+- [2. Document Information Extraction](#2)
+   - [2.1 Entity Extraction](#21)
+   - [2.2 Relation Extraction](#22)
+   - [2.3 Multi-Task Extraction](#23)
+   - [2.4 Input Format](#24)
+   - [2.5 Tips](#25)
+   - [2.6 Visualization](#26)
+   - [2.7 More Configuration](#27)
+
+<a name="1"></a>
+
+## 1. Introduction
+
+```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
+
+<a name="2"></a>
+
+## 2. Document Information Extraction
+
+This section introduces the document extraction capability of Taskflow with the following example picture [download link](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/cases.zip).
+
+<a name="21"></a>
+
+#### 2.1 Entity Extraction
+
+Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. UIE adopts the open-domain approach where the entity category is not fixed and the users can define them by through natural language.
+
+- Example: Customs Declaration Form
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206112148-82e26dad-4a77-40e3-bc11-f877047aeb87.png height=700 width=450 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["收发货人", "进口口岸", "进口日期", "运输方式", "征免性质", "境内目的地", "运输工具名称", "包装种类", "件数", "合同协议号"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
+>>> pprint(ie({"doc": "./cases/custom.jpeg"}))
+[{'件数': [{'bbox': [[826, 1062, 926, 1121]],
+        'end': 312,
+        'probability': 0.9832498761402597,
+        'start': 308,
+        'text': '1142'}],
+'包装种类': [{'bbox': [[1214, 1066, 1310, 1121]],
+            'end': 314,
+            'probability': 0.9995648138860567,
+            'start': 312,
+            'text': '纸箱'}],
+'合同协议号': [{'bbox': [[151, 1077, 258, 1117]],
+            'end': 319,
+            'probability': 0.9984179437542124,
+            'start': 314,
+            'text': '33035'}],
+'境内目的地': [{'bbox': [[1966, 872, 2095, 923]],
+            'end': 275,
+            'probability': 0.9975541483111243,
+            'start': 272,
+            'text': '上海市'}],
+'征免性质': [{'bbox': [[1583, 770, 1756, 821]],
+            'end': 242,
+            'probability': 0.9950633161231508,
+            'start': 238,
+            'text': '一般征税'}],
+'收发货人': [{'bbox': [[321, 533, 841, 580]],
+            'end': 95,
+            'probability': 0.4772132061042136,
+            'start': 82,
+            'text': '上海新尚实国际贸易有限公司'},
+        {'bbox': [[306, 584, 516, 624]],
+            'end': 150,
+            'probability': 0.33807074572195006,
+            'start': 140,
+            'text': '31222609K9'}],
+'运输工具名称': [{'bbox': [[1306, 672, 1516, 712], [1549, 668, 1645, 712]],
+            'end': 190,
+            'probability': 0.6692050414718089,
+            'start': 174,
+            'text': 'E. R. TIANAN004E'}],
+'运输方式': [{'bbox': [[1070, 664, 1240, 715]],
+            'end': 174,
+            'probability': 0.9994416347044179,
+            'start': 170,
+            'text': '永路运输'}],
+'进口口岸': [{'bbox': [[1070, 566, 1346, 617]],
+            'end': 120,
+            'probability': 0.9945697196994345,
+            'start': 111,
+            'text': '洋山港区-2248'}],
+'进口日期': [{'bbox': [[1726, 569, 1933, 610]],
+            'end': 130,
+            'probability': 0.9804819494073627,
+            'start': 120,
+            'text': '2017-02-24'}]}]
+```
+
+- Example: Driver's License
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206114081-8c82e2a2-0c88-4ca3-9651-b12c94266be9.png height=400 width=700 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["Name", "Date of birth", "Issue date"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
+>>> pprint(ie({"doc": "./cases/license.jpeg"}))
+```
+
+<a name="22"></a>
+
+#### 2.2 Relation Extraction
+
+Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
+
+- Example: Extracting relations from a table
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206115688-30de315a-8fd4-4125-a3c3-8cb05c6e39e5.png height=180 width=600 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = {"姓名": ["招聘单位", "报考岗位"]}
+>>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
+>>> pprint(ie({"doc": "./cases/table.png"}))
+```
+
+<a name="23"></a>
+
+#### 2.3 Multi-Task Extraction
+
+To extract entities and relation from documents simultaneously, you may set the schema structure as following:
+
+```text
+schema = [
+    "Total GBP",
+    "No.",
+    "Date",
+    "Customer No.",
+    "Subtotal without VAT",
+    {
+        "Description": [
+            "Quantity",
+            "Amount"
+        ]
+    }
+]
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206120861-13b475dc-9a78-43bc-9dec-91f331db2ddf.png height=400 width=650 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ["Total GBP", "No.", "Date", "Customer No.", "Subtotal without VAT", {"Description": ["Quantity", "Amount"]}]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
+>>> pprint(ie({"doc": "./cases/delivery_note.png"}))
+```
+
+<a name="24"></a>
+
+#### 2.4 Input Format
+
+For document information extraction, UIE-X supports image paths, http image links, base64 input form, and image and PDF document formats. In the input dict, `text` indicates text input and `doc` refer to the document input.
+
+```python
+[
+    {'text': '2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！'},
+    {'doc': './cases/custom.jpg'},
+    {'doc': 'https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg'}
+]
+```
+
+**NOTE**: Multi-page PDF input currently only extracts the results of the first page. UIE-X is more suitable for information extraction of document documents (such as bills, receipts, etc.), but it is not suitable for documents that are too long or multi-page.
+
+- Using custom OCR input
+
+```python
+layout = [
+    ([68.0, 12.0, 167.0, 70.0], '名次'),
+    ([464.0, 13.0, 559.0, 67.0], '球员'),
+    ([833.0, 15.0, 1054.0, 64.0], '总出场时间'),
+    ......
+]
+ie({"doc": doc_path, 'layout': layout})
+```
+
+<a name="25"></a>
+
+#### 2.5 Tips
+
+- Using PP-Structure layout analysis function
+
+The text recognized in OCR will be sorted from top left to bottom right. For cases such as column division and multiple lines of text in the table, we recommend using the layout analysis function ``layout_analysis=True`` to optimize text sorting and enhance the extraction effect. The following example is only an example of the usage scenario of the layout analysis function, and the actual scenario generally needs to be marked and fine-tuned.
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206139057-aedec98f-683c-4648-999d-81ce5ea04a86.png height=250 width=500 hspace='10'/>
+</div>
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = "中标候选人名称"
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", layout_analysis=True)
+>>> pprint(ie({"doc": "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fwww.xuyiwater.com%2Fwp-content%2Fuploads%2F2021%2F06%2F1-4.jpg&refer=http%3A%2F%2Fwww.xuyiwater.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1672994926&t=2a4a3fedf6999a34ccde190f97bcfa47"}))
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206137978-3a69e7e2-dc2e-4d11-98b7-25911b0375a0.png height=350 width=600 hspace='10'/>
+</div>
+
+```python
+>>> schema = "抗血小板药物的用药指征"
+>>> ie.set_schema(schema)
+>>> pprint(ie({"doc": "./cases/drug.webp"}))
+```
+
+<a name="26"></a>
+
+#### 2.6 Visualization
+
+- Visualization of OCR recognition results:
+
+```python
+>>> from paddlenlp.utils.doc_parser import DocParser
+
+>>> doc_parser = DocParser(ocr_lang="en")
+>>> doc_path = "./cases/business_card.png"
+>>> parsed_doc = doc_parser.parse({"doc": doc_path})
+>>> doc_parser.write_image_with_results(
+        doc_path,
+        layout=parsed_doc['layout'],
+        save_path="ocr_result.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206168103-0a37eab0-bb36-4eec-bd51-b3f85838b40c.png height=350 width=600 hspace='10'/>
+</div>
+
+- Visualization of extraction results:
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> from paddlenlp.utils.doc_parser import DocParser
+
+>>> doc_path = "./cases/business_card.png"
+>>> schema = ["人名", "职位", "号码", "邮箱地址", "网址", "地址", "邮编"]
+>>> ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en")
+
+>>> results = ie({"doc": doc_path})
+
+>>> DocParser.write_image_with_results(
+        doc_path,
+        result=results[0],
+        save_path="image_show.png")
+```
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/206168852-c32c34c4-f245-4116-a244-390e55c13383.png height=350 width=600 hspace='10'/>
+</div>
+
+<a name="27"></a>
+
+#### 2.7 More Configuration
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema="",
+                  schema_lang="ch",
+                  ocr_lang="ch",
+                  batch_size=16,
+                  model='uie-x-base',
+                  layout_analysis=False,
+                  position_prob=0.5,
+                  precision='fp32',
+                  use_fast=False)
+```
+
+* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
+* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified.
+* `ocr_lang`: Select the language of PaddleOCR, `ch` can be used in mixed Chinese and English images, `en` works better on English images, the default is `ch`.
+* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
+* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` ` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
+* `layout_analysis`: Whether to use PP-Structure to analyze the layout of the document to optimize the sorting of layout information, the default is False.
+* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
+* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
+* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
+
+## References
+- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)**
+- **[PP-Structure](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/ppstructure)**
diff --git a/applications/information_extraction/taskflow_text.md b/applications/information_extraction/taskflow_text.md
new file mode 100644
index 0000000000000000000000000000000000000000..a8fe491775beb1ab8edd50819eba488c253e4cc2
--- /dev/null
+++ b/applications/information_extraction/taskflow_text.md
@@ -0,0 +1,502 @@
+# 文本抽取任务UIE Taskflow使用指南
+
+**目录**
+- [1. 功能简介](#1)
+- [2. 应用示例](#2)
+- [3. 文本信息抽取](#3)
+  - [3.1 实体抽取](#31)
+  - [3.2 关系抽取](#32)
+  - [3.3 事件抽取](#33)
+  - [3.4 评论观点抽取](#34)
+  - [3.5 情感分类](#35)
+  - [3.6 跨任务抽取](#36)
+  - [3.7 模型选择](#37)
+  - [3.8 更多配置](#38)
+
+<a name="1"></a>
+
+## 1. 功能简介
+
+```paddlenlp.Taskflow```提供纯文本的通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**
+
+<a name="2"></a>
+
+## 2. 应用示例
+
+UIE不限定行业领域和抽取目标，以下是一些通过Taskflow实现开箱即用的行业示例：
+
+- 医疗场景-专病结构化
+
+![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png)
+
+- 法律场景-判决书抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png)
+
+- 金融场景-收入证明、招股书抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png)
+
+- 公安场景-事故报告抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png)
+
+- 旅游场景-宣传册、手册抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png)
+
+<a name="3"></a>
+
+## 3. 文本信息抽取
+
+<a name="31"></a>
+
+#### 3.1 实体抽取
+
+  实体抽取，又称命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+  - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下：
+
+    ```text
+    ['时间', '选手', '赛事名称']
+    ```
+
+    调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+
+    >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    >>> ie = Taskflow('information_extraction', schema=schema)
+    >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
+    [{'时间': [{'end': 6,
+              'probability': 0.9857378532924486,
+              'start': 0,
+              'text': '2月8日上午'}],
+      '赛事名称': [{'end': 23,
+                'probability': 0.8503089953268272,
+                'start': 6,
+                'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+      '选手': [{'end': 31,
+              'probability': 0.8981548639781138,
+              'start': 28,
+              'text': '谷爱凌'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下：
+
+    ```text
+    ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    ```
+
+    在上例中我们已经实例化了一个`Taskflow`对象，这里可以通过`set_schema`方法重置抽取目标。
+
+    调用示例：
+
+    ```python
+    >>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("（右肝肿瘤）肝细胞性肝癌（II-III级，梁索型和假腺管型），肿瘤包膜不完整，紧邻肝被膜，侵及周围肝组织，未见脉管内癌栓（MVI分级：M0级）及卫星子灶形成。（肿物1个，大小4.2×4.0×2.8cm）。"))
+    [{'肝癌级别': [{'end': 20,
+                'probability': 0.9243267447402701,
+                'start': 13,
+                'text': 'II-III级'}],
+      '肿瘤的个数': [{'end': 84,
+                'probability': 0.7538413804059623,
+                'start': 82,
+                'text': '1个'}],
+      '肿瘤的大小': [{'end': 100,
+                'probability': 0.8341128043459491,
+                'start': 87,
+                'text': '4.2×4.0×2.8cm'}],
+      '脉管内癌栓分级': [{'end': 70,
+                  'probability': 0.9083292325934664,
+                  'start': 67,
+                  'text': 'M0级'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"person"和"organization"，schema构造如下：
+
+    ```text
+    ['person', 'organization']
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+    >>> schema = ['Person', 'Organization']
+    >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Organization': [{'end': 53,
+                        'probability': 0.9985840259877357,
+                        'start': 48,
+                        'text': 'Apple'}],
+      'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+<a name="32"></a>
+
+#### 3.2 关系抽取
+
+  关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+  - 例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下：
+
+    ```text
+    {
+      '竞赛名称': [
+        '主办方',
+        '承办方',
+        '已举办次数'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
+    [{'竞赛名称': [{'end': 13,
+                'probability': 0.7825402622754041,
+                'relations': {'主办方': [{'end': 22,
+                                      'probability': 0.8421710521379353,
+                                      'start': 14,
+                                      'text': '中国中文信息学会'},
+                                      {'end': 30,
+                                      'probability': 0.7580801847701935,
+                                      'start': 23,
+                                      'text': '中国计算机学会'}],
+                              '已举办次数': [{'end': 82,
+                                        'probability': 0.4671295049136148,
+                                        'start': 80,
+                                        'text': '4届'}],
+                              '承办方': [{'end': 39,
+                                      'probability': 0.8292706618236352,
+                                      'start': 35,
+                                      'text': '百度公司'},
+                                      {'end': 72,
+                                      'probability': 0.6193477885474685,
+                                      'start': 56,
+                                      'text': '中国计算机学会自然语言处理专委会'},
+                                      {'end': 55,
+                                      'probability': 0.7000497331473241,
+                                      'start': 40,
+                                      'text': '中国中文信息学会评测工作委员会'}]},
+                'start': 0,
+                'text': '2022语言与智能技术竞赛'}]}]
+    ```
+
+  - 例如以"person"作为抽取主体，抽取关系类型为"Company"和"Position", schema构造如下：
+
+    ```text
+    {
+      'Person': [
+        'Company',
+        'Position'
+      ]
+    }
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = [{'Person': ['Company', 'Position']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'relations': {'Company': [{'end': 53,
+                                            'probability': 0.9960158209451642,
+                                            'start': 48,
+                                            'text': 'Apple'}],
+                                'Position': [{'end': 44,
+                                              'probability': 0.8871063806420736,
+                                              'start': 41,
+                                              'text': 'CEO'}]},
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+<a name="33"></a>
+
+#### 3.3 事件抽取
+
+  事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument)，组合为相应的事件结构化信息。
+
+  - 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息，schema构造如下：
+
+    ```text
+    {
+      '地震触发词': [
+        '地震强度',
+        '时间',
+        '震中位置',
+        '震源深度'
+      ]
+    }
+    ```
+
+    触发词的格式统一为`触发词`或``XX触发词`，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。
+
+    调用示例：
+
+    ```python
+    >>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('中国地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')
+    [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度，东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]
+    ```
+
+  - 英文模型**暂不支持事件抽取**，如有需要可使用英文事件数据集进行定制。
+
+<a name="34"></a>
+
+#### 3.4 评论观点抽取
+
+  评论观点抽取，是指抽取文本中包含的评价维度、观点词。
+
+  - 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向，schema构造如下：
+
+    ```text
+    {
+      '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'评价维度': ['观点词', '情感倾向[正向，负向]']} # Define the schema for opinion extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie("店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队")) # Better print results using pprint
+    [{'评价维度': [{'end': 20,
+                'probability': 0.9817040258681473,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9966142505350533,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 22,
+                                      'probability': 0.957396472711558,
+                                      'start': 21,
+                                      'text': '高'}]},
+                'start': 17,
+                'text': '性价比'},
+              {'end': 2,
+                'probability': 0.9696849569741168,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9982153274927796,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 4,
+                                      'probability': 0.9945318044652538,
+                                      'start': 2,
+                                      'text': '干净'}]},
+                'start': 0,
+                'text': '店面'}]}]
+    ```
+
+  - 英文模型schema构造如下：
+
+    ```text
+    {
+      'Aspect': [
+        'Opinion',
+        'Sentiment classification [negative, positive]'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en("The teacher is very nice."))
+    [{'Aspect': [{'end': 11,
+                  'probability': 0.4301476415932193,
+                  'relations': {'Opinion': [{'end': 24,
+                                            'probability': 0.9072940447883724,
+                                            'start': 15,
+                                            'text': 'very nice'}],
+                                'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
+                                                                                  'text': 'positive'}]},
+                  'start': 4,
+                  'text': 'teacher'}]}]
+    ```
+
+<a name="35"></a>
+
+#### 3.5 情感分类
+
+  - 句子级情感倾向分类，即判断句子的情感倾向是“正向”还是“负向”，schema构造如下：
+
+    ```text
+    '情感倾向[正向，负向]'
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = '情感倾向[正向，负向]' # Define the schema for sentence-level sentiment classification
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('这个产品用起来真的很流畅，我非常喜欢')
+    [{'情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}]
+    ```
+
+    英文模型schema构造如下：
+
+    ```text
+    'Sentiment classification [negative, positive]'
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = 'Sentiment classification [negative, positive]'
+    >>> ie_en.set_schema(schema)
+    >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
+    [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
+    ```
+
+<a name="36"></a>
+
+#### 3.6 跨任务抽取
+
+  - 例如在法律场景同时对文本进行实体抽取和关系抽取，schema可按照如下方式进行构造：
+
+    ```text
+    [
+      "法院",
+      {
+          "原告": "委托代理人"
+      },
+      {
+          "被告": "委托代理人"
+      }
+    ]
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。")) # Better print results using pprint
+    [{'原告': [{'end': 37,
+              'probability': 0.9949814024296764,
+              'relations': {'委托代理人': [{'end': 46,
+                                      'probability': 0.7956844697990384,
+                                      'start': 44,
+                                      'text': '李四'}]},
+              'start': 35,
+              'text': '张三'}],
+      '法院': [{'end': 10,
+              'probability': 0.9221074192336651,
+              'start': 0,
+              'text': '北京市海淀区人民法院'}],
+      '被告': [{'end': 67,
+              'probability': 0.8437349536631089,
+              'relations': {'委托代理人': [{'end': 92,
+                                      'probability': 0.7267121388225029,
+                                      'start': 90,
+                                      'text': '赵六'}]},
+              'start': 64,
+              'text': 'B公司'}]}]
+    ```
+
+<a name="37"></a>
+
+#### 3.7 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
+  | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
+  | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
+  | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
+
+
+- `uie-nano`调用示例：
+
+  ```python
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['时间', '选手', '赛事名称']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
+  >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+  [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
+  ```
+
+- `uie-m-base`和`uie-m-large`支持中英文混合抽取，调用示例：
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['Time', 'Player', 'Competition', 'Score']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
+  >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！", "Rafael Nadal wins French Open Final!"]))
+  [{'Competition': [{'end': 23,
+                    'probability': 0.9373889907291257,
+                    'start': 6,
+                    'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+    'Player': [{'end': 31,
+                'probability': 0.6981119555336441,
+                'start': 28,
+                'text': '谷爱凌'}],
+    'Score': [{'end': 39,
+              'probability': 0.9888507878270296,
+              'start': 32,
+              'text': '188.25分'}],
+    'Time': [{'end': 6,
+              'probability': 0.9784080036931151,
+              'start': 0,
+              'text': '2月8日上午'}]},
+  {'Competition': [{'end': 35,
+                    'probability': 0.9851549932171295,
+                    'start': 18,
+                    'text': 'French Open Final'}],
+    'Player': [{'end': 12,
+                'probability': 0.9379371275888104,
+                'start': 0,
+                'text': 'Rafael Nadal'}]}]
+  ```
+
+<a name="38"></a>
+
+#### 3.8 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema="",
+                  schema_lang="ch",
+                  batch_size=16,
+                  model='uie-base',
+                  position_prob=0.5,
+                  precision='fp32',
+                  use_fast=False)
+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置schema的语言，默认为`ch`, 可选有`ch`和`en`。因为中英schema的构造有所不同，因此需要指定schema的语言。该参数只对`uie-x-base`,`uie-m-base`和`uie-m-large`模型有效。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为16。
+* `model`：选择任务使用的模型，默认为`uie-base`，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`，`uie-x-base`。
+* `position_prob`：模型对于span的起始位置/终止位置的结果概率在0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快，支持GPU和NPU硬件环境。如果选择`fp16`，在GPU硬件环境下，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。
diff --git a/applications/information_extraction/taskflow_text_en.md b/applications/information_extraction/taskflow_text_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..d488799313f947bd9e2ae7807aaa283512b64bd4
--- /dev/null
+++ b/applications/information_extraction/taskflow_text_en.md
@@ -0,0 +1,312 @@
+# UIE Taskflow User Guide - Text Information Extraction
+
+**Table of contents**
+- [1. Introduction](#1)
+- [2. Examples](#2)
+- [3. Text Information Extraction](#3)
+   - [3.1 Entity Extraction](#31)
+   - [3.2 Relation Extraction](#32)
+   - [3.3 Event Extraction](#33)
+   - [3.4 Opinion Extraction](#34)
+   - [3.5 Sentiment Classification](#35)
+   - [3.6 Multi-task Extraction](#36)
+   - [3.7 Available Models](#37)
+   - [3.8 More Configuration](#38)
+
+<a name="1"></a>
+
+## 1. Introduction
+```paddlenlp.Taskflow``` provides general information extraction of text and documents, evaluation opinion extraction and other capabilities, and can extract various types of information, including but not limited to named entities (such as person name, place name, organization name, etc.), relations (such as the director of the movie, the release time of the song, etc.), events (such as a car accident at a certain intersection, an earthquake in a certain place, etc.), and information such as product reviews, opinions, and sentiments. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text or document without training.
+
+<a name="2"></a>
+
+## 2. Examples
+
+UIE does not limit industry fields and extraction targets. The following are some industry examples implemented out of the box by Taskflow:
+
+- Medical scenarios - specialized disease structure
+
+![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png)
+
+- Legal scene - Judgment extraction
+
+![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png)
+
+- Financial scenarios - proof of income, extraction of prospectus
+
+![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png)
+
+- Public security scene - accident report extraction
+
+![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png)
+
+- Tourism scene - brochure, manual extraction
+
+![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png)
+
+<a name="3"></a>
+
+## 3. Text information extraction
+
+<a name="31"></a>
+
+#### 3.1 Entity Extraction
+
+   Entity extraction, also known as Named Entity Recognition (NER for short), refers to identifying entities with specific meanings in text. In the open domain information extraction, the extracted categories are not limited, and users can define them by themselves.
+
+   - For example, the extracted target entity types are "person" and "organization", and the schema defined as follows:
+
+     ```text
+     ['person', 'organization']
+     ```
+
+     Example:
+
+     ```python
+     >>> from pprint import pprint
+     >>> from paddlenlp import Taskflow
+     >>> schema = ['Person', 'Organization']
+     >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
+     >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+     [{'Organization': [{'end': 53,
+                         'probability': 0.9985840259877357,
+                         'start': 48,
+                         'text': 'Apple'}],
+       'Person': [{'end': 14,
+                   'probability': 0.999631971804547,
+                   'start': 9,
+                   'text': 'Steve'}]}]
+     ```
+
+<a name="32"></a>
+
+#### 3.2 Relationship Extraction
+
+   Relation Extraction refers to identifying entities from text and extracting the semantic relationship between entities, and then obtaining triple information, namely <subject, predicate, object>.
+
+   - For example, if "person" is used as the extraction subject, and the extraction relationship types are "Company" and "Position", the schema structure is as follows:
+
+     ```text
+     {
+       'Person': [
+         'Company',
+         'Position'
+       ]
+     }
+     ```
+
+     Example:
+
+     ```python
+     >>> schema = [{'Person': ['Company', 'Position']}]
+     >>> ie_en.set_schema(schema)
+     >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+     [{'Person': [{'end': 14,
+                   'probability': 0.999631971804547,
+                   'relations': {'Company': [{'end': 53,
+                                             'probability': 0.9960158209451642,
+                                             'start': 48,
+                                             'text': 'Apple'}],
+                                 'Position': [{'end': 44,
+                                               'probability': 0.8871063806420736,
+                                               'start': 41,
+                                               'text': 'CEO'}]},
+                   'start': 9,
+                   'text': 'Steve'}]}]
+     ```
+
+<a name="33"></a>
+
+#### 3.3 Event extraction
+
+   Event Extraction refers to extracting predefined event trigger words (Trigger) and event arguments (Argument) from natural language texts, and combining them into corresponding event structured information.
+
+   - The English model** does not support event extraction**, if necessary, it can be customized using the English event dataset.
+
+<a name="34"></a>
+
+#### 3.4 Opinion Extraction
+
+   Opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text.
+
+   - For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and emotional tendencies. The schema structure is as follows:
+
+     ```text
+     {
+       'Aspect': [
+         'Opinion',
+         'Sentiment classification [negative, positive]'
+       ]
+     }
+     ```
+
+     Example:
+
+     ```python
+     >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
+     >>> ie_en.set_schema(schema)
+     >>> pprint(ie_en("The teacher is very nice."))
+     [{'Aspect': [{'end': 11,
+                   'probability': 0.4301476415932193,
+                   'relations': {'Opinion': [{'end': 24,
+                                             'probability': 0.9072940447883724,
+                                             'start': 15,
+                                             'text': 'very nice'}],
+                                 'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
+                                                                                   'text': 'positive'}]},
+                   'start': 4,
+                   'text': 'teacher'}]}]
+     ```
+
+<a name="35"></a>
+
+#### 3.5 Sentiment Classification
+
+   - Sentence-level sentiment classification, that is, to judge whether the emotional orientation of a sentence is "positive" or "negative". The schema structure is as follows:
+
+     ```text
+     'Sentiment classification [negative, positive]'
+     ```
+
+     Example:
+
+     ```python
+     >>> schema = 'Sentiment classification [negative, positive]'
+     >>> ie_en.set_schema(schema)
+     >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
+     [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
+     ```
+
+#### 3.6 Multi-Task Extraction
+
+   - For example, in the legal scene, entity extraction and relation extraction are performed on the text at the same time, and the schema can be constructed as follows:
+
+    ```text
+    [
+      "法院",
+      {
+          "原告": "委托代理人"
+      },
+      {
+          "被告": "委托代理人"
+      }
+    ]
+    ```
+
+    Example:
+
+    ```python
+    >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。")) # Better print results using pprint
+    [{'原告': [{'end': 37,
+              'probability': 0.9949814024296764,
+              'relations': {'委托代理人': [{'end': 46,
+                                      'probability': 0.7956844697990384,
+                                      'start': 44,
+                                      'text': '李四'}]},
+              'start': 35,
+              'text': '张三'}],
+      '法院': [{'end': 10,
+              'probability': 0.9221074192336651,
+              'start': 0,
+              'text': '北京市海淀区人民法院'}],
+      '被告': [{'end': 67,
+              'probability': 0.8437349536631089,
+              'relations': {'委托代理人': [{'end': 92,
+                                      'probability': 0.7267121388225029,
+                                      'start': 90,
+                                      'text': '赵六'}]},
+              'start': 64,
+              'text': 'B公司'}]}]
+    ```
+
+<a name="37"></a>
+
+#### 3.7 Available Model
+
+- A variety of models to different accuracy and speed requirements
+
+   | Model | Structure | Language |
+   | :---: | :--------: | :--------: |
+   | `uie-base` (default)| 12-layers, 768-hidden, 12-heads | Chinese |
+   | `uie-base-en` | 12-layers, 768-hidden, 12-heads | English |
+   | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | Chinese |
+   | `uie-medium`| 6-layers, 768-hidden, 12-heads | Chinese |
+   | `uie-mini`| 6-layers, 384-hidden, 12-heads | Chinese |
+   | `uie-micro`| 4-layers, 384-hidden, 12-heads | Chinese |
+   | `uie-nano`| 4-layers, 312-hidden, 12-heads | Chinese |
+   | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | Chinese and English |
+   | `uie-m-base`| 12-layers, 768-hidden, 12-heads | Chinese and English |
+
+
+- `uie-nano` call example:
+
+  ```python
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['时间', '选手', '赛事名称']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
+  >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+  [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
+  ```
+
+- `uie-m-base` and `uie-m-large` support extraction of both Chinese and English, call example:
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['Time', 'Player', 'Competition', 'Score']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
+  >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！", "Rafael Nadal wins French Open Final!"]))
+  [{'Competition': [{'end': 23,
+                    'probability': 0.9373889907291257,
+                    'start': 6,
+                    'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+    'Player': [{'end': 31,
+                'probability': 0.6981119555336441,
+                'start': 28,
+                'text': '谷爱凌'}],
+    'Score': [{'end': 39,
+              'probability': 0.9888507878270296,
+              'start': 32,
+              'text': '188.25分'}],
+    'Time': [{'end': 6,
+              'probability': 0.9784080036931151,
+              'start': 0,
+              'text': '2月8日上午'}]},
+  {'Competition': [{'end': 35,
+                    'probability': 0.9851549932171295,
+                    'start': 18,
+                    'text': 'French Open Final'}],
+    'Player': [{'end': 12,
+                'probability': 0.9379371275888104,
+                'start': 0,
+                'text': 'Rafael Nadal'}]}]
+  ```
+
+<a name="38"></a>
+
+#### 3.8 More Configuration
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema="",
+                  schema_lang="ch",
+                  batch_size=16,
+                  model='uie-base',
+                  position_prob=0.5,
+                  precision='fp32',
+                  use_fast=False)
+```
+
+* `schema`: Define the task extraction target, which can be configured by referring to the calling examples of different tasks in the out-of-the-box.
+* `schema_lang`: Set the language of the schema, the default is `ch`, optional `ch` and `en`. Because the structure of the Chinese and English schemas is different, the language of the schema needs to be specified. This parameter is only valid for `uie-x-base`, `uie-m-base` and `uie-m-large` models.
+* `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
+* `model`: select the model used by the task, the default is `uie-base`, optional `uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano` and `uie-medical-base`, `uie-base-en`, `uie-x-base`.
+* `position_prob`: The result probability of the model for the start position/end position of the span is between 0 and 1, and the returned result removes the results less than this threshold, the default is 0.5, and the final probability output of the span is the start position probability and end position The product of the position probabilities.
+* `precision`: select the model precision, the default is `fp32`, optional `fp16` and `fp32`. `fp16` inference is faster, support GPU and NPU hardware. If you choose `fp16` and GPU hardware, please ensure that the machine is correctly installed with NVIDIA-related drivers and basic software. **Ensure that CUDA>=11.2, cuDNN>=8.1.1**. For the first time use, you need to follow the prompts to install the relevant dependencies. Secondly, it is necessary to ensure that the CUDA Compute Capability of the GPU device is greater than 7.0. Typical devices include V100, T4, A10, A100, GTX 20 series and 30 series graphics cards, etc. For more information about CUDA Compute Capability and precision support, please refer to NVIDIA documentation: [GPU Hardware and Supported Precision Comparison Table](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix).
+* `use_fast`: Use the high-performance word segmentation operator FastTokenizer implemented in C++ to accelerate text preprocessing. The FastTokenizer library needs to be installed through `pip install fast-tokenizer-python` before it can be used. Defaults to `False`. For more usage instructions, please refer to [FastTokenizer Documentation](../../fast_tokenizer).
diff --git a/applications/information_extraction/text/README.md b/applications/information_extraction/text/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..84cd77fd288cdc5f965289d33db3f93f083db72a
--- /dev/null
+++ b/applications/information_extraction/text/README.md
@@ -0,0 +1,289 @@
+简体中文 | [English](README_en.md)
+
+# 文本信息抽取
+
+**目录**
+- [1. 文本信息抽取应用](#1)
+- [2. 快速开始](#2)
+  - [2.1 代码结构](#代码结构)
+  - [2.2 数据标注](#数据标注)
+  - [2.3 模型微调](#模型微调)
+  - [2.4 模型评估](#模型评估)
+  - [2.5 定制模型一键预测](#定制模型一键预测)
+  - [2.6 实验指标](#实验指标)
+  - [2.7 封闭域蒸馏](#封闭域蒸馏)
+
+<a name="1"></a>
+
+## 1. 文本信息抽取应用
+
+本项目提供基于UIE微调的纯文本抽取端到端应用方案，打通**数据标注-模型训练-模型调优-预测部署全流程**，可快速实现文档信息抽取产品落地。
+
+信息抽取通俗地说就是从给定的文本/图片等输入数据中抽取出结构化信息的过程。在信息抽取的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对信息抽取领域的难点和痛点，PaddleNLP信息抽取应用UIE统一建模的思想，提供了文档信息抽取产业级应用方案，支持**文档/图片/表格和纯文本场景下实体、关系、事件、观点等不同任务信息抽取**。该应用**不限定行业领域和抽取目标**，可实现从产品原型研发、业务POC阶段到业务落地、迭代阶段的无缝衔接，助力开发者实现特定领域抽取场景的快速适配与落地。
+
+**文本信息抽取应用亮点：**
+
+- **覆盖场景全面🎓：** 覆盖文本信息抽取各类主流任务，支持多语言，满足开发者多样信息抽取落地需求。
+- **效果领先🏃：** 以在纯文本具有突出效果的UIE系列模型作为训练基座，提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **简单易用⚡：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启信息抽取训练，轻松完成部署上线，降低信息抽取技术落地门槛。
+- **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
+
+<a name="2"></a>
+
+## 2. 快速开始
+
+对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用定制功能（标注少量数据进行模型微调）以进一步提升效果。
+
+<a name="代码结构"></a>
+
+### 2.1 代码结构
+
+```shell
+.
+├── utils.py          # 数据处理工具
+├── finetune.py       # 模型微调、压缩脚本
+├── evaluate.py       # 模型评估脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+### 2.2 数据标注
+
+我们推荐使用 [Label Studio](https://labelstud.io/) 进行文本信息抽取数据标注，本项目打通了从数据标注到训练的通道，也即Label Studio导出数据可以通过 [label_studio.py](../label_studio.py) 脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考 [Label Studio数据标注指南](../label_studio_text.md)。
+
+这里我们提供预先标注好的`军事关系抽取数据集`的文件，可以运行下面的命令行下载数据集，我们将展示如何使用数据转化脚本生成训练/验证/测试集文件，并使用UIE模型进行微调。
+
+下载军事关系抽取数据集：
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz
+tar -xvf military.tar.gz
+mv military data
+rm military.tar.gz
+```
+
+生成训练/验证集文件：
+```shell
+python ../label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.76 0.24 0 \
+    --negative_ratio 3 \
+    --task_type ext
+```
+
+更多不同类型任务（含实体抽取、关系抽取、文档分类等）的标注规则及参数说明，请参考[Label Studio数据标注指南](../label_studio_text.md)。
+
+
+<a name="模型微调"></a>
+
+### 2.3 模型微调
+
+推荐使用 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `uie-base` 作为预训练模型进行模型微调，将微调后的模型保存至`$finetuned_model`：
+
+单卡启动：
+
+```shell
+python finetune.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path uie-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  16 \
+    --per_device_eval_batch_size 16 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+如果在GPU环境中使用，可以指定gpus参数进行多卡训练：
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path uie-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  8 \
+    --per_device_eval_batch_size 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+该示例代码中由于设置了参数 `--do_eval`，因此在训练完会自动进行评估。
+
+可配置参数说明：
+* `device`: 训练设备，可选择 'cpu'、'gpu'、'npu' 其中的一种；默认为 GPU 训练。
+* `logging_steps`: 训练过程中日志打印的间隔 steps 数，默认10。
+* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `seed`：全局随机种子，默认为 42。
+* `model_name_or_path`：进行 few shot 训练使用的预训练模型。默认为 "uie-x-base"。
+* `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
+* `train_path`：训练集路径；默认为 `None` 。
+* `dev_path`：开发集路径；默认为 `None` 。
+* `max_seq_len`：文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小，默认为8。
+* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小，默认为8。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择 100；默认为10。
+* `learning_rate`：训练最大学习率，UIE-X 推荐设置为 1e-5；默认值为3e-5。
+* `label_names`：训练数据标签label的名称，UIE-X 设置为'start_positions' 'end_positions'；默认值为None。
+* `do_train`:是否进行微调训练，设置该参数表示进行微调训练，默认不设置。
+* `do_eval`:是否进行评估，设置该参数表示进行评估，默认不设置。
+* `do_export`:是否进行导出，设置该参数表示进行静态图导出，默认不设置。
+* `export_model_dir`:静态图导出地址，默认为None。
+* `overwrite_output_dir`： 如果 `True`，覆盖输出目录的内容。如果 `output_dir` 指向检查点目录，则使用它继续训练。
+* `disable_tqdm`： 是否使用tqdm进度条。
+* `metric_for_best_model`：最优模型指标,UIE-X 推荐设置为 `eval_f1`，默认为None。
+* `load_best_model_at_end`：训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用，默认为False。
+* `save_total_limit`：如果设置次参数，将限制checkpoint的总数。删除旧的checkpoints `输出目录`，默认为None。
+
+<a name="模型评估"></a>
+
+### 2.4 模型评估
+
+通过运行以下命令进行模型评估：
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --device gpu \
+    --batch_size 16 \
+    --max_seq_len 512
+```
+
+通过运行以下命令对 UIE-M 进行模型评估：
+
+```
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --batch_size 16 \
+    --device gpu \
+    --max_seq_len 512 \
+    --multilingual
+```
+
+评估方式说明：采用单阶段评价的方式，即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。
+
+可开启`debug`模式对每个正例类别分别进行评估，该模式仅用于模型调试：
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --debug
+```
+
+输出打印示例：
+
+```text
+[2022-11-21 12:48:41,794] [    INFO] - -----------------------------
+[2022-11-21 12:48:41,795] [    INFO] - Class Name: 武器名称
+[2022-11-21 12:48:41,795] [    INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667
+[2022-11-21 12:48:44,093] [    INFO] - -----------------------------
+[2022-11-21 12:48:44,094] [    INFO] - Class Name: X的产国
+[2022-11-21 12:48:44,094] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636
+[2022-11-21 12:48:46,474] [    INFO] - -----------------------------
+[2022-11-21 12:48:46,475] [    INFO] - Class Name: X的研发单位
+[2022-11-21 12:48:46,475] [    INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671
+[2022-11-21 12:48:48,800] [    INFO] - -----------------------------
+[2022-11-21 12:48:48,801] [    INFO] - Class Name: X的类型
+[2022-11-21 12:48:48,801] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
+```
+
+可配置参数说明：
+
+- `device`: 评估设备，可选择 'cpu'、'gpu'、'npu' 其中的一种；默认为 GPU 评估。
+- `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+- `test_path`: 进行评估的测试集文件。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `debug`: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+- `multilingual`: 是否是跨语言模型，默认关闭。
+- `schema_lang`: 选择schema的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+<a name="定制模型一键预测"></a>
+
+### 2.5 定制模型一键预测
+
+`paddlenlp.Taskflow`装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件`model_state.pdparams`。
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# 设定抽取目标和定制化模型权重路径
+>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
+>>> pprint(my_ie("威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
+[{'武器名称': [{'end': 14,
+            'probability': 0.9998632702221926,
+            'relations': {'产国': [{'end': 18,
+                                  'probability': 0.9998815094394331,
+                                  'start': 16,
+                                  'text': '瑞典'}],
+                          '研发单位': [{'end': 25,
+                                    'probability': 0.9995875123178521,
+                                    'start': 18,
+                                    'text': 'FFV军械公司'}],
+                          '类型': [{'end': 14,
+                                  'probability': 0.999877336059086,
+                                  'start': 12,
+                                  'text': '炸弹'}]},
+            'start': 0,
+            'text': '威尔哥（Virgo）减速炸弹'}]}]
+```
+
+<a name="实验指标"></a>
+
+### 2.6 实验指标
+
+军事关系抽取数据集实验指标：
+
+  |  |  Precision  | Recall | F1 Score |
+  | :---: | :--------: | :--------: | :--------: |
+  | 0-shot | 0.64634| 0.53535 | 0.58564 |
+  | 5-shot | 0.89474 | 0.85000 | 0.87179 |
+  | 10-shot | 0.92793 | 0.85833 | 0.89177 |
+  | full-set | 0.93103 | 0.90000 | 0.91525 |
+
+
+<a name="封闭域蒸馏"></a>
+
+### 2.7 封闭域蒸馏
+
+在一些工业应用场景中对性能的要求较高，模型若不能有效压缩则无法实际应用。因此，我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁，将UIE模型的知识迁移到封闭域信息抽取小模型，以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。详细介绍请参考[UIE Slim 数据蒸馏](./data_distill/README.md)
diff --git a/applications/information_extraction/text/README_en.md b/applications/information_extraction/text/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..4cc36bbcee15fa47b49ef511b97878474a16e41e
--- /dev/null
+++ b/applications/information_extraction/text/README_en.md
@@ -0,0 +1,278 @@
+# Text information extraction
+
+**Table of contents**
+- [1. Text Information Extraction Application](#1)
+- [2. Quick Start](#2)
+   - [2.1 Code Structure](#21)
+   - [2.2 Data Annotation](#22)
+   - [2.3 Finetuning](#23)
+   - [2.4 Evaluation](#24)
+   - [2.5 Inference](#25)
+   - [2.6 Experiments](#26)
+   - [2.7 Closed Domain Distillation](#27)
+
+<a name="1"></a>
+
+## 1. Text Information Extraction Application
+
+This project provides an end-to-end application solution for plain text extraction based on UIE fine-tuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Information Extraction techniques in your own products or models.a
+
+Information Extraction (IE) is the process of extracting structured information from given input data such as text, pictures or scanned document. While IE brings immense value, applying IE techniques is never easy with challenges such as domain adaptation, heterogeneous structures, lack of labeled data, etc. This PaddleNLP Information Extraction Guide builds on the foundation of our work in [Universal Information Extraction](https://arxiv.org/abs/2203.12277) and provides an industrial-level solution that not only supports **extracting entities, relations, events and opinions from plain text**, but also supports **cross-modal extraction out of documents, tables and pictures.** Our method features a flexible prompt, which allows you to specify extraction targets with simple natural language. We also provide a few different domain-adapated models specialized for different industry sectors.
+
+**Highlights:**
+- **Comprehensive Coverage🎓:** Covers various mainstream tasks of information extraction for plain text and document scenarios, supports multiple languages
+- **State-of-the-Art Performance🏃:** Strong performance from the UIE model series models in plain text and multimodal datasets. We also provide pretrained models of various sizes to meet different needs
+- **Easy to use⚡:** three lines of code to use our `Taskflow` for out-of-box Information Extraction capabilities. One line of command to model training and model deployment
+- **Efficient Tuning✊:** Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
+
+<a name="2"></a>
+
+## 2. Quick start
+
+For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
+
+<a name="21"></a>
+
+### 2.1 Code structure
+
+```shell
+.
+├── utils.py # data processing tools
+├── finetune.py # model fine-tuning, compression script
+├── evaluate.py # model evaluation script
+└── README.md
+```
+
+<a name="22"></a>
+
+### 2.2 Data labeling
+
+We recommend using [Label Studio](https://labelstud.io/) for data labeling. We provide an end-to-end pipeline for the labeling -> training process. You can export the labeled data in Label Studio through [label_studio.py](../label_studio.py) script to export and convert the data into the required input form for the model. For a detailed introduction to labeling methods, please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
+
+Here we provide a pre-labeled example dataset `Military Relationship Extraction Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning .
+
+Download the military relationship extraction dataset:
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/military.tar.gz
+tar -xvf military.tar.gz
+mv military data
+rm military.tar.gz
+```
+
+Generate training/validation set files:
+```shell
+python ../label_studio.py \
+     --label_studio_file ./data/label_studio.json \
+     --save_dir ./data \
+     --splits 0.76 0.24 0 \
+     --negative_ratio 3 \
+     --task_type ext
+```
+
+For more labeling rules and parameter descriptions for different types of tasks (including entity extraction, relationship extraction, document classification, etc.), please refer to [Label Studio Data Labeling Guide](../label_studio_text_en.md).
+
+
+<a name="23"></a>
+
+### 2.3 Finetuning
+
+Use the following command to fine-tune the model using `uie-base` as the pre-trained model, and save the fine-tuned model to `$finetuned_model`:
+
+Single GPU:
+
+```shell
+python finetune.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path uie-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  16 \
+    --per_device_eval_batch_size 16 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Multiple GPUs：
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path uie-base \
+    --output_dir ./checkpoint/model_best \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_len 512  \
+    --per_device_train_batch_size  8 \
+    --per_device_eval_batch_size 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Parameters:
+
+* `device`: Training device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU training.
+* `logging_steps`: The interval steps of log printing during training, the default is 10.
+* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `seed`: global random seed, default is 42.
+* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "uie-x-base".
+* `output_dir`: required, the model directory saved after model training or compression; the default is `None`.
+* `train_path`: training set path; defaults to `None`.
+* `dev_path`: Development set path; defaults to `None`.
+* `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+* `per_device_train_batch_size`: The batch size of each GPU core//NPU core/CPU used for training, the default is 8.
+* `per_device_eval_batch_size`: Batch size per GPU core/NPU core/CPU for evaluation, default is 8.
+* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
+* `learning_rate`: The maximum learning rate for training, UIE-X recommends setting it to 1e-5; the default value is 3e-5.
+* `label_names`: the name of the training data label, UIE-X is set to 'start_positions' 'end_positions'; the default value is None.
+* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
+* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
+* `do_export`: Whether to export, setting this parameter means to export static images, and it is not set by default.
+* `export_model_dir`: Static map export address, the default is None.
+* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
+* `disable_tqdm`: Whether to use tqdm progress bar.
+* `metric_for_best_model`: Optimal model metric, UIE-X recommends setting it to `eval_f1`, the default is None.
+* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
+* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
+
+<a name="24"></a>
+
+### 2.4 Evaluation
+
+Model evaluation:
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --batch_size 16 \
+    --max_seq_len 512
+```
+
+Model evaluation for UIE-M：
+
+```
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --batch_size 16 \
+    --max_seq_len 512 \
+    --multilingual
+```
+
+We adopt the single-stage method for evaluation, which means tasks that require multiple stages (e.g. relation extraction, event extraction) are evaluated separately for each stage. By default, the validation/test set uses all labels at the same level to construct the negative examples.
+
+The `debug` mode can be turned on to evaluate each positive category separately. This mode is only used for model debugging:
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --debug
+```
+
+Output print example:
+
+```text
+[2022-11-21 12:48:41,794] [    INFO] - -----------------------------
+[2022-11-21 12:48:41,795] [    INFO] - Class Name: 武器名称
+[2022-11-21 12:48:41,795] [    INFO] - Evaluation Precision: 0.96667 | Recall: 0.96667 | F1: 0.96667
+[2022-11-21 12:48:44,093] [    INFO] - -----------------------------
+[2022-11-21 12:48:44,094] [    INFO] - Class Name: X的产国
+[2022-11-21 12:48:44,094] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.99275 | F1: 0.99636
+[2022-11-21 12:48:46,474] [    INFO] - -----------------------------
+[2022-11-21 12:48:46,475] [    INFO] - Class Name: X的研发单位
+[2022-11-21 12:48:46,475] [    INFO] - Evaluation Precision: 0.77519 | Recall: 0.64935 | F1: 0.70671
+[2022-11-21 12:48:48,800] [    INFO] - -----------------------------
+[2022-11-21 12:48:48,801] [    INFO] - Class Name: X的类型
+[2022-11-21 12:48:48,801] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
+```
+
+Parameters:
+
+- `device`: Evaluation device, one of 'cpu', 'gpu' and 'npu' can be selected; the default is GPU evaluation.
+- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
+- `test_path`: The test set file for evaluation.
+- `batch_size`: batch size, please adjust according to the machine situation, the default is 16.
+- `max_seq_len`: The maximum segmentation length of the text. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+- `debug`: Whether to enable the debug mode to evaluate each positive category separately. This mode is only used for model debugging and is disabled by default.
+- `multilingual`: Whether it is a multilingual model, it is turned off by default.
+- `schema_lang`: select the language of the schema, optional `ch` and `en`. The default is `ch`, please select `en` for the English dataset.
+
+<a name="25"></a>
+
+### 2.5 Inference
+Same with the pretrained models, you can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# Set the extraction target and the fine-tuned model path
+>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
+>>> pprint(my_ie("威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
+[{'武器名称': [{'end': 14,
+            'probability': 0.9998632702221926,
+            'relations': {'产国': [{'end': 18,
+                                  'probability': 0.9998815094394331,
+                                  'start': 16,
+                                  'text': '瑞典'}],
+                          '研发单位': [{'end': 25,
+                                    'probability': 0.9995875123178521,
+                                    'start': 18,
+                                    'text': 'FFV军械公司'}],
+                          '类型': [{'end': 14,
+                                  'probability': 0.999877336059086,
+                                  'start': 12,
+                                  'text': '炸弹'}]},
+            'start': 0,
+            'text': '威尔哥（Virgo）减速炸弹'}]}]
+```
+
+<a name="26"></a>
+
+### 2.6 Experiments
+
+  |  |  Precision  | Recall | F1 Score |
+  | :---: | :--------: | :--------: | :--------: |
+  | 0-shot | 0.64634| 0.53535 | 0.58564 |
+  | 5-shot | 0.89474 | 0.85000 | 0.87179 |
+  | 10-shot | 0.92793 | 0.85833 | 0.89177 |
+  | full-set | 0.93103 | 0.90000 | 0.91525 |
+
+
+<a name="27"></a>
+
+### 2.7 Closed Domain Distillation
+
+Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the [UIE Slim Data Distillation](./data_distill/README_en.md) with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
diff --git a/applications/information_extraction/text/data_distill/README.md b/applications/information_extraction/text/data_distill/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6886f591e10d398f511f5acc321ed4d94929c109
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/README.md
@@ -0,0 +1,156 @@
+# UIE Slim 数据蒸馏
+
+在UIE强大的抽取能力背后，同样需要较大的算力支持计算。在一些工业应用场景中对性能的要求较高，若不能有效压缩则无法实际应用。因此，我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁，将UIE模型的知识迁移到封闭域信息抽取小模型，以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。
+
+#### UIE数据蒸馏三步
+
+- **Step 1**: 使用UIE模型对标注数据进行finetune，得到Teacher Model。
+
+- **Step 2**: 用户提供大规模无标注数据，需与标注数据同源。使用Taskflow UIE对无监督数据进行预测。
+
+- **Step 3**: 使用标注数据以及步骤2得到的合成数据训练出封闭域Student Model。
+
+## UIE Finetune
+
+参考[UIE关系抽取微调](../README.md)完成模型微调，得到``../checkpoint/model_best``。
+
+## 离线蒸馏
+
+#### 通过训练好的UIE定制模型预测无监督数据的标签
+
+```shell
+python data_distill.py \
+    --data_path ../data \
+    --save_dir student_data \
+    --task_type relation_extraction \
+    --synthetic_ratio 10 \
+    --model_path ../checkpoint/model_best
+```
+
+**NOTE**：schema需要根据标注数据在`data_distill.py`中进行配置，且schema需要包含标注数据中的所有标签类型。
+
+可配置参数说明：
+
+- `data_path`: 标注数据（`doccano_ext.json`）及无监督文本（`unlabeled_data.txt`）路径。
+- `model_path`: 训练好的UIE定制模型路径。
+- `save_dir`: 学生模型训练数据保存路径。
+- `synthetic_ratio`: 控制合成数据的比例。最大合成数据数量=synthetic_ratio*标注数据数量。
+- `platform`: 标注数据的所使用的标注平台，可选有`doccano`，`label_studio`，默认为`label_studio`。
+- `task_type`: 选择任务类型，可选有`entity_extraction`，`relation_extraction`，`event_extraction`和`opinion_extraction`。因为是封闭域抽取，不同任务的后处理逻辑不同，因此需指定任务类型。
+- `seed`: 随机种子，默认为1000。
+
+#### 老师模型评估
+
+UIE微调阶段针对UIE训练格式数据评估模型效果（该评估方式非端到端评估，非关系抽取或事件抽取的标准评估方式），可通过以下评估脚本进行端到端评估。
+
+```shell
+python evaluate_teacher.py \
+    --task_type relation_extraction \
+    --test_path ./student_data/dev_data.json \
+    --label_maps_path ./student_data/label_maps.json \
+    --model_path ../checkpoint/model_best
+```
+
+可配置参数说明：
+
+- `model_path`: 训练好的UIE定制模型路径。
+- `test_path`: 测试数据集路径。
+- `label_maps_path`: 学生模型标签字典。
+- `batch_size`: 批处理大小，默认为8。
+- `max_seq_len`: 最大文本长度，默认为256。
+- `task_type`: 选择任务类型，可选有`entity_extraction`，`relation_extraction`，`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取的评估，需指定任务类型。
+
+
+#### 学生模型训练
+
+```shell
+python train.py \
+    --task_type relation_extraction \
+    --train_path student_data/train_data.json \
+    --dev_path student_data/dev_data.json \
+    --label_maps_path student_data/label_maps.json \
+    --num_epochs 50 \
+    --encoder ernie-3.0-mini-zh
+```
+
+可配置参数说明：
+
+- `train_path`: 训练集文件路径。
+- `dev_path`: 验证集文件路径。
+- `batch_size`: 批处理大小，默认为16。
+- `learning_rate`: 学习率，默认为3e-5。
+- `save_dir`: 模型存储路径，默认为`./checkpoint`。
+- `max_seq_len`: 最大文本长度，默认为256。
+- `weight_decay`: 表示AdamW优化器中使用的 weight_decay 的系数。
+- `warmup_proportion`: 学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+- `num_epochs`: 训练轮数，默认为100。
+- `seed`: 随机种子，默认为1000。
+- `encoder`: 选择学生模型的模型底座，默认为`ernie-3.0-mini-zh`。
+- `task_type`: 选择任务类型，可选有`entity_extraction`，`relation_extraction`，`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取，需指定任务类型。
+- `logging_steps`: 日志打印的间隔steps数，默认10。
+- `eval_steps`: evaluate的间隔steps数，默认200。
+- `device`: 选用什么设备进行训练，可选cpu或gpu。
+- `init_from_ckpt`: 可选，模型参数路径，热启动模型训练；默认为None。
+
+#### 学生模型评估
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path student_data/dev_data.json \
+    --task_type relation_extraction \
+    --label_maps_path student_data/label_maps.json \
+    --encoder ernie-3.0-mini-zh
+```
+
+可配置参数说明：
+
+- `model_path`: 训练好的UIE定制模型路径。
+- `test_path`: 测试数据集路径。
+- `label_maps_path`: 学生模型标签字典。
+- `batch_size`: 批处理大小，默认为8。
+- `max_seq_len`: 最大文本长度，默认为256。
+- `encoder`: 选择学生模型的模型底座，默认为`ernie-3.0-mini-zh`。
+- `task_type`: 选择任务类型，可选有`entity_extraction`，`relation_extraction`，`event_extraction`和`opinion_extraction`。因为是封闭域信息抽取的评估，需指定任务类型。
+
+## Taskflow部署学生模型
+
+- 通过Taskflow一键部署封闭域信息抽取模型，`task_path`为学生模型路径。
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction
+>>> pprint(my_ie("威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"))
+[{'武器名称': [{'end': 14,
+            'probability': 0.9976037,
+            'relations': {'产国': [{'end': 18,
+                                  'probability': 0.9988706,
+                                  'relations': {},
+                                  'start': 16,
+                                  'text': '瑞典'}],
+                          '研发单位': [{'end': 25,
+                                    'probability': 0.9978277,
+                                    'relations': {},
+                                    'start': 18,
+                                    'text': 'FFV军械公司'}],
+                          '类型': [{'end': 14,
+                                  'probability': 0.99837446,
+                                  'relations': {},
+                                  'start': 12,
+                                  'text': '炸弹'}]},
+            'start': 0,
+            'text': '威尔哥（Virgo）减速炸弹'}]}]
+```
+
+
+# References
+
+- **[GlobalPointer](https://kexue.fm/search/globalpointer/)**
+
+- **[GPLinker](https://kexue.fm/archives/8888)**
+
+- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch)**
+
+- **[CBLUE](https://github.com/CBLUEbenchmark/CBLUE)**
diff --git a/applications/information_extraction/text/data_distill/README_en.md b/applications/information_extraction/text/data_distill/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..7b1ca61122a86ba7862c3a67471f4d4c95088234
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/README_en.md
@@ -0,0 +1,154 @@
+# UIE Slim data distillation
+
+While UIE has powerful zero-shot extraction capabilities, its prompting structure requires significant compute to serve in real time. Some industrial application scenarios have high inference performance requirements and the model cannot go into production without being effectively compressed. We built the UIE Slim Data Distillation with knowledge distillation techniques. The principle is to use the data as a bridge to transfer the knowledge of the UIE model to the smaller closed-domain information extraction model in order to achieve speedup inference significantly with minimal loss to accuracy.
+
+#### Three steps of UIE data distillation
+
+- **Step 1**: Finetune the UIE model on the labeled data to get the Teacher Model.
+
+- **Step 2**: Process the user-provided unlabeled data and run inference with Taskflow UIE.
+
+- **Step 3**: Use the labeled data and the inference results obtained in step 2 to train a closed-domain Student Model.
+
+## UIE Finetune
+
+Refer to [UIE relationship extraction fine-tuning](../README.md) to complete the model fine-tuning and get ``../checkpoint/model_best``.
+
+## Offline Distillation
+
+#### Predict the label of unsupervised data through the trained UIE custom model
+
+```shell
+python data_distill.py \
+     --data_path ../data \
+     --save_dir student_data \
+     --task_type relation_extraction \
+     --synthetic_ratio 10 \
+     --model_path ../checkpoint/model_best
+```
+
+**NOTE**: The schema needs to be configured in `data_distill.py` according to the label data, and the schema needs to contain all label types in the label data.
+
+Description of configurable parameters:
+
+- `data_path`: Path to labeled data (`doccano_ext.json`) and unsupervised text (`unlabeled_data.txt`).
+- `model_path`: The path of the trained UIE custom model.
+- `save_dir`: The path to save the training data of the student model.
+- `synthetic_ratio`: Controls the ratio of synthetic data. The maximum number of synthetic data=synthetic_ratio*number of labeled data.
+- `platform`: The labeling platform used to label data, optional are `doccano`, `label_studio`, the default is `label_studio`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is a closed-domain extraction, the post-processing logic of different tasks is different, so the task type needs to be specified.
+- `seed`: random seed, default is 1000.
+
+#### Teacher model evaluation
+
+In the UIE fine-tuning stage, the model performance is evaluated on UIE training format data, which is not a standard end-to-end evaluation method for relation extraction or event extraction. The end-to-end evaluation can be performed through the following evaluation script.
+
+```shell
+python evaluate_teacher.py \
+     --task_type relation_extraction \
+     --test_path ./student_data/dev_data.json\
+     --label_maps_path ./student_data/label_maps.json \
+     --model_path ../checkpoint/model_best
+```
+
+Description of configurable parameters:
+
+- `model_path`: The path of the trained UIE custom model.
+- `test_path`: test dataset path.
+- `label_maps_path`: dictionary of student model labels.
+- `batch_size`: batch size, default is 8.
+- `max_seq_len`: Maximum text length, default is 256.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
+
+
+#### Student model training
+
+```shell
+python train.py\
+     --task_type relation_extraction \
+     --train_path student_data/train_data.json \
+     --dev_path student_data/dev_data.json \
+     --label_maps_path student_data/label_maps.json \
+     --num_epochs 50 \
+     --encoder ernie-3.0-mini-zh
+```
+
+Description of configurable parameters:
+
+- `train_path`: training set file path.
+- `dev_path`: Validation set file path.
+- `batch_size`: batch size, default is 16.
+- `learning_rate`: Learning rate, default is 3e-5.
+- `save_dir`: model storage path, the default is `./checkpoint`.
+- `max_seq_len`: Maximum text length, default is 256.
+- `weight_decay`: Indicates the coefficient of weight_decay used in the AdamW optimizer.
+- `warmup_proportion`: The proportion of the learning rate warmup strategy. If it is 0.1, the learning rate will slowly increase from 0 to learning_rate during the first 10% training step, and then slowly decay. The default is 0.0.
+- `num_epochs`: The number of training epochs, the default is 100.
+- `seed`: random seed, default is 1000.
+- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is closed-domain information extraction, the task type needs to be specified.
+- `logging_steps`: The interval steps of log printing, the default is 10.
+- `eval_steps`: The interval steps of evaluate, the default is 200.
+- `device`: What device to choose for training, optional cpu or gpu.
+- `init_from_ckpt`: optional, model parameter path, hot start model training; default is None.
+
+#### Student model evaluation
+
+```shell
+python evaluate.py \
+     --model_path ./checkpoint/model_best \
+     --test_path student_data/dev_data.json \
+     --task_type relation_extraction \
+     --label_maps_path student_data/label_maps.json \
+     --encoder ernie-3.0-mini-zh
+```
+
+Description of configurable parameters:
+
+- `model_path`: The path of the trained UIE custom model.
+- `test_path`: test dataset path.
+- `label_maps_path`: dictionary of student model labels.
+- `batch_size`: batch size, default is 8.
+- `max_seq_len`: Maximum text length, default is 256.
+- `encoder`: select the model base of the student model, the default is `ernie-3.0-mini-zh`.
+- `task_type`: Select the task type, optional are `entity_extraction`, `relation_extraction`, `event_extraction` and `opinion_extraction`. Because it is an evaluation of closed-domain information extraction, the task type needs to be specified.
+
+## Student model deployment
+
+- Fast deployment of the closed-domain information extraction model through Taskflow, `task_path` is the path of the student model.
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> my_ie = Taskflow("information_extraction", model="uie-data-distill-gp", task_path="checkpoint/model_best/") # Schema is fixed in closed-domain information extraction
+>>> pprint(my_ie("Virgo deceleration bomb was developed by the Swedish FFV Ordnance Company specially for the attack aircraft of the Swedish Royal Air Force to carry out low-altitude and high-speed bombing. It was developed in 1956 and entered service in 1963. It is equipped on the A32 "Contradiction", A35 "Dragon", and AJ134 "Thunder" attack aircraft are mainly used to attack landing craft, parked aircraft, anti-aircraft artillery, field artillery, light armored vehicles and active forces."))
+[{'weapon name': [{'end': 14,
+             'probability': 0.9976037,
+             'relations': {'country of origin': [{'end': 18,
+                                   'probability': 0.9988706,
+                                   'relations': {},
+                                   'start': 16,
+                                   'text': 'Sweden'}],
+                           'R&D unit': [{'end': 25,
+                                     'probability': 0.9978277,
+                                     'relations': {},
+                                     'start': 18,
+                                     'text': 'FFV Ordnance Company'}],
+                           'type': [{'end': 14,
+                                   'probability': 0.99837446,
+                                   'relations': {},
+                                   'start': 12,
+                                   'text': 'bomb'}]},
+             'start': 0,
+             'text': 'Virgo slowing bomb'}]}]
+```
+
+
+# References
+
+- **[GlobalPointer](https://kexue.fm/search/globalpointer/)**
+
+- **[GPLinker](https://kexue.fm/archives/8888)**
+
+- **[JunnYu/GPLinker_pytorch](https://github.com/JunnYu/GPLinker_pytorch**
diff --git a/applications/information_extraction/text/data_distill/criterion.py b/applications/information_extraction/text/data_distill/criterion.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5e6c2f9c3c07a800605c57240098b5243ff8634
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/criterion.py
@@ -0,0 +1,56 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+class Criterion(nn.Layer):
+    """Criterion for GPNet"""
+
+    def __init__(self, mask_zero=True):
+        self.mask_zero = mask_zero
+
+    def _sparse_multilabel_categorical_crossentropy(self, y_true, y_pred, mask_zero=False):
+        """Sparse multi-label categorical cross entropy
+        reference to "https://kexue.fm/archives/7359".
+        """
+        zeros = paddle.zeros_like(y_pred[..., :1])
+        y_pred = paddle.concat([y_pred, zeros], axis=-1)
+        if mask_zero:
+            infs = zeros + 1e12
+            y_pred = paddle.concat([infs, y_pred[..., 1:]], axis=-1)
+        y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
+        y_pos_1 = paddle.concat([y_pos_2, zeros], axis=-1)
+        if mask_zero:
+            y_pred = paddle.concat([-infs, y_pred[..., 1:]], axis=-1)
+            y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
+
+        pos_loss = (-y_pos_1).exp().sum(axis=-1).log()
+        all_loss = y_pred.exp().sum(axis=-1).log()
+        aux_loss = y_pos_2.exp().sum(axis=-1).log() - all_loss
+        aux_loss = paddle.clip(1 - paddle.exp(aux_loss), min=0.1, max=1)
+        neg_loss = all_loss + paddle.log(aux_loss)
+        return pos_loss + neg_loss
+
+    def __call__(self, y_pred, y_true):
+        shape = y_pred.shape
+        y_true = y_true[..., 0] * shape[2] + y_true[..., 1]
+        # bs, nclass, seqlen * seqlen
+        y_pred = paddle.reshape(y_pred, shape=[shape[0], -1, np.prod(shape[2:])])
+
+        loss = self._sparse_multilabel_categorical_crossentropy(y_true, y_pred, self.mask_zero)
+        return loss.sum(axis=1).mean()
diff --git a/applications/information_extraction/text/data_distill/data_collator.py b/applications/information_extraction/text/data_distill/data_collator.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c12b98186ab1516c11af8b0a03576625958ddd9
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/data_collator.py
@@ -0,0 +1,86 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+
+import paddle
+
+from paddlenlp.transformers.tokenizer_utils_base import (
+    PaddingStrategy,
+    PretrainedTokenizerBase,
+)
+
+ignore_list = ["offset_mapping", "text"]
+
+
+@dataclass
+class DataCollator:
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    label_maps: Optional[dict] = None
+    task_type: Optional[str] = None
+
+    def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]:
+        labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None
+        new_features = [{k: v for k, v in f.items() if k not in ["labels"] + ignore_list} for f in features]
+
+        batch = self.tokenizer.pad(
+            new_features,
+            padding=self.padding,
+        )
+
+        batch = [paddle.to_tensor(batch[k]) for k in batch.keys()]
+
+        if labels is None:  # for test
+            if "offset_mapping" in features[0].keys():
+                batch.append([feature["offset_mapping"] for feature in features])
+            if "text" in features[0].keys():
+                batch.append([feature["text"] for feature in features])
+            return batch
+
+        bs = batch[0].shape[0]
+        if self.task_type == "entity_extraction":
+            # Ensure the dimension is greater or equal to 1
+            max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
+            num_ents = len(self.label_maps["entity2id"])
+            batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
+            for i, lb in enumerate(labels):
+                for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
+                    batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
+
+            batch.append([batch_entity_labels])
+        else:
+            # Ensure the dimension is greater or equal to 1
+            max_ent_num = max(max([len(lb["ent_labels"]) for lb in labels]), 1)
+            max_spo_num = max(max([len(lb["rel_labels"]) for lb in labels]), 1)
+            num_ents = len(self.label_maps["entity2id"])
+            if "relation2id" in self.label_maps.keys():
+                num_rels = len(self.label_maps["relation2id"])
+            else:
+                num_rels = len(self.label_maps["sentiment2id"])
+            batch_entity_labels = paddle.zeros(shape=[bs, num_ents, max_ent_num, 2], dtype="int64")
+            batch_head_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
+            batch_tail_labels = paddle.zeros(shape=[bs, num_rels, max_spo_num, 2], dtype="int64")
+
+            for i, lb in enumerate(labels):
+                for eidx, (l, eh, et) in enumerate(lb["ent_labels"]):
+                    batch_entity_labels[i, l, eidx, :] = paddle.to_tensor([eh, et])
+                for spidx, (sh, st, p, oh, ot) in enumerate(lb["rel_labels"]):
+                    batch_head_labels[i, p, spidx, :] = paddle.to_tensor([sh, oh])
+                    batch_tail_labels[i, p, spidx, :] = paddle.to_tensor([st, ot])
+            batch.append([batch_entity_labels, batch_head_labels, batch_tail_labels])
+        return batch
diff --git a/applications/information_extraction/text/data_distill/data_distill.py b/applications/information_extraction/text/data_distill/data_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3564311cd97d30712cbce2a3f416504cbfc352e
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/data_distill.py
@@ -0,0 +1,128 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+import random
+
+from tqdm import tqdm
+from utils import anno2distill, schema2label_maps, set_seed, synthetic2distill
+
+from paddlenlp import Taskflow
+from paddlenlp.utils.log import logger
+
+
+def do_data_distill():
+    set_seed(args.seed)
+
+    # Generate closed-domain label maps
+    if not os.path.exists(args.save_dir):
+        os.mkdir(args.save_dir)
+    label_maps = schema2label_maps(args.task_type, schema=args.schema)
+    label_maps_path = os.path.join(args.save_dir, "label_maps.json")
+
+    # Save closed-domain label maps file
+    with open(label_maps_path, "w", encoding="utf-8") as fp:
+        fp.write(json.dumps(label_maps, ensure_ascii=False))
+
+    # Load doccano file and convert to distill format
+    sample_index = json.loads(
+        open(os.path.join(args.data_path, "sample_index.json"), "r", encoding="utf-8").readline()
+    )
+
+    train_ids = sample_index["train_ids"]
+    dev_ids = sample_index["dev_ids"]
+    test_ids = sample_index["test_ids"]
+
+    if args.platform == "label_studio":
+        with open(os.path.join(args.data_path, "label_studio.json"), "r", encoding="utf-8") as fp:
+            json_lines = json.loads(fp.read())
+    elif args.platform == "doccano":
+        json_lines = []
+        with open(os.path.join(args.data_path, "doccano_ext.json"), "r", encoding="utf-8") as fp:
+            for line in fp:
+                json_lines.append(json.loads(line))
+    else:
+        raise ValueError("Unsupported annotation platform!")
+
+    train_lines = [json_lines[i] for i in train_ids]
+    train_lines = anno2distill(train_lines, args.task_type, label_maps, args.platform)
+
+    dev_lines = [json_lines[i] for i in dev_ids]
+    dev_lines = anno2distill(dev_lines, args.task_type, label_maps, args.platform)
+
+    test_lines = [json_lines[i] for i in test_ids]
+    test_lines = anno2distill(test_lines, args.task_type, label_maps, args.platform)
+
+    # Load trained UIE model
+    uie = Taskflow("information_extraction", schema=args.schema, task_path=args.model_path)
+
+    if args.synthetic_ratio > 0:
+        # Generate synthetic data
+        texts = open(os.path.join(args.data_path, "unlabeled_data.txt"), "r", encoding="utf-8").readlines()
+
+        actual_ratio = math.ceil(len(texts) / len(train_lines))
+        if actual_ratio <= args.synthetic_ratio or args.synthetic_ratio == -1:
+            infer_texts = texts
+        else:
+            idxs = random.sample(range(0, len(texts)), args.synthetic_ratio * len(train_lines))
+            infer_texts = [texts[i] for i in idxs]
+
+        infer_results = []
+        for text in tqdm(infer_texts, desc="Predicting: ", leave=False):
+            infer_results.extend(uie(text))
+
+        train_synthetic_lines = synthetic2distill(infer_texts, infer_results, args.task_type)
+
+        # Concat origin and synthetic data
+        train_lines.extend(train_synthetic_lines)
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    _save_examples(args.save_dir, "train_data.json", train_lines)
+    _save_examples(args.save_dir, "dev_data.json", dev_lines)
+    _save_examples(args.save_dir, "test_data.json", test_lines)
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--data_path", default="../data", type=str, help="The directory for labeled data with doccano format and the large scale unlabeled data.")
+    parser.add_argument("--model_path", type=str, default="../checkpoint/model_best", help="The path of saved model that you want to load.")
+    parser.add_argument("--save_dir", default="./distill_task", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--synthetic_ratio", default=10, type=int, help="The ratio of labeled and synthetic samples.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--platform", choices=['doccano', 'label_studio'], type=str, default="label_studio", help="Select the annotation platform.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    # Define your schema here
+    schema = {"武器名称": ["产国", "类型", "研发单位"]}
+
+    args.schema = schema
+
+    do_data_distill()
diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md b/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..22b4783242750951949a7abaf3072d42a20c74b9
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README.md
@@ -0,0 +1,58 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+
+## Server服务启动
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+## Client请求启动
+
+```bash
+python client.py
+```
+
+## 服务化自定义参数
+
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+```
+
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+
+### Client 自定义参数
+
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+
+```
diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md b/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..8337a2fbc18f8dfdcc10aa1be16f96950e162d7b
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/README_en.md
@@ -0,0 +1,64 @@
+# Service deployment based on PaddleNLP SimpleServing
+
+- [Environment Preparation](#1)
+- [Server](#2)
+- [Client](#3)
+- [Service Custom Parameters](#4)
+
+<a name="1"></a>
+
+## Environment Preparation
+Use the PaddleNLP version with SimpleServing function (or the latest develop version)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+<a name="2"></a>
+
+## Server
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+<a name="3"></a>
+
+## Client
+
+```bash
+python client.py
+```
+
+<a name="4"></a>
+
+## Service Custom Parameters
+
+### Server Custom Parameters
+
+#### schema replacement
+```python
+# Default schema
+schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
+```
+
+#### Set model path
+```
+# Default task_path
+uie = Taskflow('information_extraction', model='uie-data-distill-gp', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### Doka Service Prediction
+PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service. register_taskflow('uie', [uie1, uie2])
+```
+
+### Client Custom Parameters
+
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py b/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd2914e22b2b8b1c46db97facc09bdc6b5ac3957
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/client.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+
+headers = {"Content-Type": "application/json"}
+texts = [
+    "威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
+]
+
+data = {"data": {"text": texts}}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py b/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..dadb51a6dc04822869bd141a4a62c76c70012692
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/deploy/simple_serving/server.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# The task path changed to your best model path
+uie = Taskflow(
+    "information_extraction", model="uie-data-distill-gp", schema=schema, task_path="../../checkpoint/model_best/"
+)
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
diff --git a/applications/information_extraction/text/data_distill/evaluate.py b/applications/information_extraction/text/data_distill/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6cee13f165d2591967c8ee6b97fa7bc7152a153
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/evaluate.py
@@ -0,0 +1,96 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from metric import get_eval
+from tqdm import tqdm
+from utils import create_dataloader, get_label_maps, postprocess, reader
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.layers import (
+    GlobalPointerForEntityExtraction,
+    GPLinkerForRelationExtraction,
+)
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, dataloader, label_maps, task_type="relation_extraction"):
+    model.eval()
+    all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
+    for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
+        input_ids, attention_masks, offset_mappings, texts = batch
+        logits = model(input_ids, attention_masks)
+        batch_outputs = postprocess(logits, offset_mappings, texts, label_maps, task_type)
+        if isinstance(batch_outputs, tuple):
+            all_preds[0].extend(batch_outputs[0])  # Entity output
+            all_preds[1].extend(batch_outputs[1])  # Relation output
+        else:
+            all_preds.extend(batch_outputs)
+    eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
+    model.train()
+    return eval_results
+
+
+def do_eval():
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.encoder)
+    encoder = AutoModel.from_pretrained(args.encoder)
+    if args.task_type == "entity_extraction":
+        model = GlobalPointerForEntityExtraction(encoder, label_maps)
+    else:
+        model = GPLinkerForRelationExtraction(encoder, label_maps)
+
+    if args.model_path:
+        state_dict = paddle.load(os.path.join(args.model_path, "model_state.pdparams"))
+        model.set_dict(state_dict)
+
+    test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
+
+    test_dataloader = create_dataloader(
+        test_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="test",
+        task_type=args.task_type,
+    )
+
+    eval_result = evaluate(model, test_dataloader, label_maps, task_type=args.task_type)
+    logger.info("Evaluation precision: " + str(eval_result))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=128, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_eval()
diff --git a/applications/information_extraction/text/data_distill/evaluate_teacher.py b/applications/information_extraction/text/data_distill/evaluate_teacher.py
new file mode 100644
index 0000000000000000000000000000000000000000..318c1f9b0d8122f53a17881548fdae566663e651
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/evaluate_teacher.py
@@ -0,0 +1,95 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from metric import get_eval
+from tqdm import tqdm
+from utils import create_dataloader, get_label_maps, reader, synthetic2distill
+
+from paddlenlp import Taskflow
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(uie, dataloader, task_type="relation_extraction"):
+    all_preds = ([], []) if task_type in ["opinion_extraction", "relation_extraction", "event_extraction"] else []
+
+    infer_results = []
+    all_texts = []
+    for batch in tqdm(dataloader, desc="Evaluating: ", leave=False):
+        _, _, _, texts = batch
+        all_texts.extend(texts)
+        infer_results.extend(uie(texts))
+
+    infer_results = synthetic2distill(all_texts, infer_results, task_type)
+
+    for res in infer_results:
+        if task_type == "entity_extraction":
+            all_preds.append(res["entity_list"])
+        else:
+            all_preds[0].append(res["entity_list"])
+            all_preds[1].append(res["spo_list"])
+
+    eval_results = get_eval(all_preds, dataloader.dataset.raw_data, task_type)
+    return eval_results
+
+
+def do_eval():
+    # Load trained UIE model
+    uie = Taskflow("information_extraction", schema=args.schema, batch_size=args.batch_size, task_path=args.model_path)
+
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+
+    test_ds = load_dataset(reader, data_path=args.test_path, lazy=False)
+
+    test_dataloader = create_dataloader(
+        test_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="test",
+        task_type=args.task_type,
+    )
+
+    eval_result = evaluate(uie, test_dataloader, task_type=args.task_type)
+    logger.info("Evaluation precision: " + str(eval_result))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=256, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    schema = {"武器名称": ["产国", "类型", "研发单位"]}
+
+    args.schema = schema
+
+    do_eval()
diff --git a/applications/information_extraction/text/data_distill/metric.py b/applications/information_extraction/text/data_distill/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e140642de28350bb3f567159d485d2f0d49fb62
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/metric.py
@@ -0,0 +1,72 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def get_eval(all_preds, raw_data, task_type):
+    if task_type == "entity_extraction":
+        ex, ey, ez = 1e-10, 1e-10, 1e-10
+        for ent_preds, data in zip(all_preds, raw_data):
+            pred_ent_set = set([tuple(p.values()) for p in ent_preds])
+            gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
+            ex += len(pred_ent_set & gold_ent_set)
+            ey += len(pred_ent_set)
+            ez += len(gold_ent_set)
+        ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
+        ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
+        ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
+
+        return {
+            "entity_f1": ent_f1,
+            "entity_precision": ent_precision,
+            "entity_recall": ent_recall,
+        }
+    else:
+        all_ent_preds, all_rel_preds = all_preds
+
+        ex, ey, ez = 1e-10, 1e-10, 1e-10
+        for ent_preds, data in zip(all_ent_preds, raw_data):
+            pred_ent_set = set([tuple(p.values()) for p in ent_preds])
+            gold_ent_set = set([tuple(g.values()) for g in data["entity_list"]])
+            ex += len(pred_ent_set & gold_ent_set)
+            ey += len(pred_ent_set)
+            ez += len(gold_ent_set)
+        ent_f1 = round(2 * ex / (ey + ez), 5) if ex != 1e-10 else 0.0
+        ent_precision = round(ex / ey, 5) if ey != 1e-10 else 0.0
+        ent_recall = round(ex / ez, 5) if ez != 1e-10 else 0.0
+
+        rx, ry, rz = 1e-10, 1e-10, 1e-10
+
+        for rel_preds, raw_data in zip(all_rel_preds, raw_data):
+            pred_rel_set = set([tuple(p.values()) for p in rel_preds])
+            if task_type == "opinion_extraction":
+                gold_rel_set = set([tuple(g.values()) for g in raw_data["aso_list"]])
+            else:
+                gold_rel_set = set([tuple(g.values()) for g in raw_data["spo_list"]])
+            rx += len(pred_rel_set & gold_rel_set)
+            ry += len(pred_rel_set)
+            rz += len(gold_rel_set)
+
+        rel_f1 = round(2 * rx / (ry + rz), 5) if rx != 1e-10 else 0.0
+        rel_precision = round(rx / ry, 5) if ry != 1e-10 else 0.0
+        rel_recall = round(rx / rz, 5) if rz != 1e-10 else 0.0
+
+        return {
+            "entity_f1": ent_f1,
+            "entity_precision": ent_precision,
+            "entity_recall": ent_recall,
+            "relation_f1": rel_f1,
+            "relation_precision": rel_precision,
+            "relation_recall": rel_recall,
+        }
diff --git a/applications/information_extraction/text/data_distill/train.py b/applications/information_extraction/text/data_distill/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e52cb84602697b2abd43ba172e0ba240884c2224
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/train.py
@@ -0,0 +1,190 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import paddle
+from criterion import Criterion
+from evaluate import evaluate
+from utils import (
+    create_dataloader,
+    criteria_map,
+    get_label_maps,
+    reader,
+    save_model_config,
+    set_seed,
+)
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.layers import (
+    GlobalPointerForEntityExtraction,
+    GPLinkerForRelationExtraction,
+)
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+from paddlenlp.utils.log import logger
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+
+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+
+    train_ds = load_dataset(reader, data_path=args.train_path, lazy=False)
+    dev_ds = load_dataset(reader, data_path=args.dev_path, lazy=False)
+    tokenizer = AutoTokenizer.from_pretrained(args.encoder)
+
+    train_dataloader = create_dataloader(
+        train_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="train",
+        task_type=args.task_type,
+    )
+
+    dev_dataloader = create_dataloader(
+        dev_ds,
+        tokenizer,
+        max_seq_len=args.max_seq_len,
+        batch_size=args.batch_size,
+        label_maps=label_maps,
+        mode="dev",
+        task_type=args.task_type,
+    )
+
+    encoder = AutoModel.from_pretrained(args.encoder)
+    if args.task_type == "entity_extraction":
+        model = GlobalPointerForEntityExtraction(encoder, label_maps)
+    else:
+        model = GPLinkerForRelationExtraction(encoder, label_maps)
+
+    model_config = {"task_type": args.task_type, "label_maps": label_maps, "encoder": args.encoder}
+
+    num_training_steps = len(train_dataloader) * args.num_epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    criterion = Criterion()
+
+    global_step, best_f1 = 1, 0.0
+    tr_loss, logging_loss = 0.0, 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch in train_dataloader:
+            input_ids, attention_masks, labels = batch
+
+            logits = model(input_ids, attention_masks)
+
+            loss = sum([criterion(o, l) for o, l in zip(logits, labels)]) / 3
+
+            loss.backward()
+
+            tr_loss += loss.item()
+
+            lr_scheduler.step()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                loss_avg = (tr_loss - logging_loss) / args.logging_steps
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss_avg, args.logging_steps / time_diff)
+                )
+                logging_loss = tr_loss
+                tic_train = time.time()
+
+            if global_step % args.eval_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                save_model_config(save_dir, model_config)
+                logger.disable()
+                tokenizer.save_pretrained(save_dir)
+                logger.enable()
+
+                eval_result = evaluate(model, dev_dataloader, label_maps, task_type=args.task_type)
+                logger.info("Evaluation precision: " + str(eval_result))
+
+                f1 = eval_result[criteria_map[args.task_type]]
+                if f1 > best_f1:
+                    logger.info(f"best F1 performance has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    save_dir = os.path.join(args.save_dir, "model_best")
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    save_model_config(save_dir, model_config)
+                    logger.disable()
+                    tokenizer.save_pretrained(save_dir)
+                    logger.enable()
+                tic_train = time.time()
+
+            global_step += 1
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
+    parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+    parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum input sequence length.")
+    parser.add_argument("--label_maps_path", default="./ner_data/label_maps.json", type=str, help="The file path of the labels dictionary.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay rate for L2 regularizer.")
+    parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+    parser.add_argument("--num_epochs", default=100, type=int, help="Number of epoches for training.")
+    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
+    parser.add_argument("--encoder", default="ernie-3.0-mini-zh", type=str, help="Select the pretrained encoder model for GP.")
+    parser.add_argument("--task_type", choices=['relation_extraction', 'event_extraction', 'entity_extraction', 'opinion_extraction'], default="entity_extraction", type=str, help="Select the training task type.")
+    parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+    parser.add_argument("--eval_steps", default=200, type=int, help="The interval steps to evaluate model performance.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_train()
diff --git a/applications/information_extraction/text/data_distill/utils.py b/applications/information_extraction/text/data_distill/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..1532ee86185a6688c38d4186f3d89f6d788b1447
--- /dev/null
+++ b/applications/information_extraction/text/data_distill/utils.py
@@ -0,0 +1,554 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import random
+
+import numpy as np
+import paddle
+from data_collator import DataCollator
+
+from paddlenlp.taskflow.utils import SchemaTree
+from paddlenlp.utils.log import logger
+
+criteria_map = {
+    "entity_extraction": "entity_f1",
+    "opinion_extraction": "relation_f1",  # (Aspect, Sentiment, Opinion)
+    "relation_extraction": "relation_f1",  # (Subject, Predicate, Object)
+    "event_extraction": "relation_f1",  # (Trigger, Role, Argument)
+}
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def reader(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            yield json_line
+
+
+def save_model_config(save_dir, model_config):
+    model_config_file = os.path.join(save_dir, "model_config.json")
+    with open(model_config_file, "w", encoding="utf-8") as fp:
+        fp.write(json.dumps(model_config, ensure_ascii=False, indent=2))
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def get_label_maps(task_type="relation_extraction", label_maps_path=None):
+    with open(label_maps_path, "r", encoding="utf-8") as fp:
+        label_maps = json.load(fp)
+    if task_type == "entity_extraction":
+        entity2id = label_maps["entity2id"]
+        id2entity = {idx: t for t, idx in entity2id.items()}
+        label_maps["id2entity"] = id2entity
+    else:
+        entity2id = label_maps["entity2id"]
+        relation2id = (
+            label_maps["relation2id"]
+            if task_type in ["relation_extraction", "event_extraction"]
+            else label_maps["sentiment2id"]
+        )
+        id2entity = {idx: t for t, idx in entity2id.items()}
+        id2relation = {idx: t for t, idx in relation2id.items()}
+        label_maps["id2entity"] = id2entity
+        label_maps["id2relation"] = id2relation
+    return label_maps
+
+
+def create_dataloader(
+    dataset, tokenizer, max_seq_len=128, batch_size=1, label_maps=None, mode="train", task_type="relation_extraction"
+):
+    def tokenize_and_align_train_labels(example):
+        tokenized_inputs = tokenizer(
+            example["text"],
+            max_length=max_seq_len,
+            padding=False,
+            truncation=True,
+            return_attention_mask=True,
+            return_token_type_ids=False,
+            return_offsets_mapping=True,
+        )
+        offset_mapping = tokenized_inputs["offset_mapping"]
+
+        ent_labels = []
+        for e in example["entity_list"]:
+            _start, _end = e["start_index"], e["start_index"] + len(e["text"]) - 1
+            start = map_offset(_start, offset_mapping)
+            end = map_offset(_end, offset_mapping)
+            if start == -1 or end == -1:
+                continue
+            label = label_maps["entity2id"][e["type"]]
+            ent_labels.append([label, start, end])
+
+        outputs = {
+            "input_ids": tokenized_inputs["input_ids"],
+            "attention_mask": tokenized_inputs["attention_mask"],
+            "labels": {"ent_labels": ent_labels, "rel_labels": []},
+        }
+
+        if task_type in ["relation_extraction", "event_extraction"]:
+            rel_labels = []
+            for r in example["spo_list"]:
+                _sh, _oh = r["subject_start_index"], r["object_start_index"]
+                _st, _ot = _sh + len(r["subject"]) - 1, _oh + len(r["object"]) - 1
+                sh = map_offset(_sh, offset_mapping)
+                st = map_offset(_st, offset_mapping)
+                oh = map_offset(_oh, offset_mapping)
+                ot = map_offset(_ot, offset_mapping)
+                if sh == -1 or st == -1 or oh == -1 or ot == -1:
+                    continue
+                p = label_maps["relation2id"][r["predicate"]]
+                rel_labels.append([sh, st, p, oh, ot])
+            outputs["labels"]["rel_labels"] = rel_labels
+        elif task_type == "opinion_extraction":
+            rel_labels = []
+            for r in example["aso_list"]:
+                _ah, _oh = r["aspect_start_index"], r["opinion_start_index"]
+                _at, _ot = _ah + len(r["aspect"]) - 1, _oh + len(r["opinion"]) - 1
+                ah = map_offset(_ah, offset_mapping)
+                at = map_offset(_at, offset_mapping)
+                oh = map_offset(_oh, offset_mapping)
+                ot = map_offset(_ot, offset_mapping)
+                if ah == -1 or at == -1 or oh == -1 or ot == -1:
+                    continue
+
+                s = label_maps["sentiment2id"][r["sentiment"]]
+                rel_labels.append([ah, at, s, oh, ot])
+            outputs["labels"]["rel_labels"] = rel_labels
+        return outputs
+
+    def tokenize(example):
+        tokenized_inputs = tokenizer(
+            example["text"],
+            max_length=max_seq_len,
+            padding=False,
+            truncation=True,
+            return_attention_mask=True,
+            return_offsets_mapping=True,
+            return_token_type_ids=False,
+        )
+        tokenized_inputs["text"] = example["text"]
+        return tokenized_inputs
+
+    if mode == "train":
+        dataset = dataset.map(tokenize_and_align_train_labels)
+    else:
+        dataset_copy = copy.deepcopy(dataset)
+        dataset = dataset.map(tokenize)
+
+    data_collator = DataCollator(tokenizer, label_maps=label_maps, task_type=task_type)
+
+    shuffle = True if mode == "train" else False
+    batch_sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(
+        dataset=dataset, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True
+    )
+    if mode != "train":
+        dataloader.dataset.raw_data = dataset_copy
+    return dataloader
+
+
+def postprocess(batch_outputs, offset_mappings, texts, label_maps, task_type="relation_extraction"):
+    if task_type == "entity_extraction":
+        batch_ent_results = []
+        for entity_output, offset_mapping, text in zip(batch_outputs[0].numpy(), offset_mappings, texts):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+        return batch_ent_results
+    else:
+        batch_ent_results = []
+        batch_rel_results = []
+        for entity_output, head_output, tail_output, offset_mapping, text in zip(
+            batch_outputs[0].numpy(),
+            batch_outputs[1].numpy(),
+            batch_outputs[2].numpy(),
+            offset_mappings,
+            texts,
+        ):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            ents = set()
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                ents.add((start, end))
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {"text": text[start:end], "type": label_maps["id2entity"][l], "start_index": start}
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+
+            rel_list = []
+            for sh, st in ents:
+                for oh, ot in ents:
+                    p1s = np.where(head_output[:, sh, oh] > 0.0)[0]
+                    p2s = np.where(tail_output[:, st, ot] > 0.0)[0]
+                    ps = set(p1s) & set(p2s)
+                    for p in ps:
+                        if task_type in ["relation_extraction", "event_extraction"]:
+                            rel = {
+                                "subject": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "predicate": label_maps["id2relation"][p],
+                                "object": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "subject_start_index": offset_mapping[sh][0],
+                                "object_start_index": offset_mapping[oh][0],
+                            }
+                        else:
+                            rel = {
+                                "aspect": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "sentiment": label_maps["id2relation"][p],
+                                "opinion": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "aspect_start_index": offset_mapping[sh][0],
+                                "opinion_start_index": offset_mapping[oh][0],
+                            }
+                        rel_list.append(rel)
+            batch_rel_results.append(rel_list)
+        return (batch_ent_results, batch_rel_results)
+
+
+def build_tree(schema, name="root"):
+    """
+    Build the schema tree.
+    """
+    schema_tree = SchemaTree(name)
+    for s in schema:
+        if isinstance(s, str):
+            schema_tree.add_child(SchemaTree(s))
+        elif isinstance(s, dict):
+            for k, v in s.items():
+                if isinstance(v, str):
+                    child = [v]
+                elif isinstance(v, list):
+                    child = v
+                else:
+                    raise TypeError(
+                        "Invalid schema, value for each key:value pairs should be list or string"
+                        "but {} received".format(type(v))
+                    )
+                schema_tree.add_child(build_tree(child, name=k))
+        else:
+            raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+    return schema_tree
+
+
+def schema2label_maps(task_type, schema=None):
+    if schema and isinstance(schema, dict):
+        schema = [schema]
+
+    label_maps = {}
+    if task_type == "entity_extraction":
+        entity2id = {}
+        for s in schema:
+            entity2id[s] = len(entity2id)
+
+        label_maps["entity2id"] = entity2id
+    elif task_type == "opinion_extraction":
+        schema = ["观点词", {"评价维度": ["观点词", "情感倾向[正向,负向]"]}]
+        logger.info("Opinion extraction does not support custom schema, the schema is default to %s." % schema)
+        label_maps["entity2id"] = {"评价维度": 0, "观点词": 1}
+        label_maps["sentiment2id"] = {"正向": 0, "负向": 1}
+    else:
+        entity2id = {}
+        relation2id = {}
+        schema_tree = build_tree(schema)
+        schema_list = schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+
+            if node.name not in entity2id.keys() and len(node.children) != 0:
+                entity2id[node.name] = len(entity2id)
+
+            for child in node.children:
+                if child.name not in relation2id.keys():
+                    relation2id[child.name] = len(relation2id)
+                schema_list.append(child)
+
+        entity2id["object"] = len(entity2id)
+        label_maps["entity2id"] = entity2id
+        label_maps["relation2id"] = relation2id
+
+    label_maps["schema"] = schema
+    return label_maps
+
+
+def anno2distill(json_lines, task_type, label_maps=None, platform="label_studio"):
+    if platform == "label_studio":
+        return label_studio2distill(json_lines, task_type, label_maps)
+    else:
+        return doccano2distill(json_lines, task_type, label_maps)
+
+
+def label_studio2distill(json_lines, task_type, label_maps=None):
+    """Convert label-studio to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["data"]["text"]
+            output = {"text": text}
+            entity_list = []
+            aso_list = []
+            annos = json_line["annotations"][0]["result"]
+            for anno in annos:
+                if anno["type"] == "labels":
+                    ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
+                    ent_type_gather = anno["value"]["labels"][0].split("##")
+                    if len(ent_type_gather) == 2:
+                        ent_type, ent_senti = ent_type_gather
+                    else:
+                        ent_type = ent_type_gather[0]
+                        ent_senti = None
+                    ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
+                    id2ent[anno["id"]] = ent
+                    id2ent[anno["id"]]["sentiment"] = ent_senti
+                    entity_list.append(ent)
+                else:
+                    _aspect = id2ent[anno["from_id"]]
+                    if _aspect["sentiment"]:
+                        _opinion = id2ent[anno["to_id"]]
+                        rel = {
+                            "aspect": _aspect["text"],
+                            "sentiment": _aspect["sentiment"],
+                            "opinion": _opinion["text"],
+                            "aspect_start_index": _aspect["start_index"],
+                            "opinion_start_index": _opinion["start_index"],
+                        }
+                        aso_list.append(rel)
+                    output["aso_list"] = aso_list
+            output["entity_list"] = entity_list
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["data"]["text"]
+            output = {"text": text}
+            entity_list = []
+            spo_list = []
+            annos = json_line["annotations"][0]["result"]
+            for anno in annos:
+                if anno["type"] == "labels":
+                    ent_text = text[anno["value"]["start"] : anno["value"]["end"]]
+                    ent_label = anno["value"]["labels"][0]
+                    ent_type = "object" if ent_label not in label_maps["entity2id"].keys() else ent_label
+                    ent = {"text": ent_text, "type": ent_type, "start_index": anno["value"]["start"]}
+                    id2ent[anno["id"]] = ent
+                    entity_list.append(ent)
+                else:
+                    _subject = id2ent[anno["from_id"]]
+                    _object = id2ent[anno["to_id"]]
+                    rel = {
+                        "subject": _subject["text"],
+                        "predicate": anno["labels"][0],
+                        "object": _object["text"],
+                        "subject_start_index": _subject["start_index"],
+                        "object_start_index": _object["start_index"],
+                    }
+                    spo_list.append(rel)
+            output["entity_list"] = entity_list
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
+
+
+def doccano2distill(json_lines, task_type, label_maps=None):
+    """Convert doccano to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["text"]
+            output = {"text": text}
+            entity_list = []
+            entities = json_line["entities"]
+            for entity in entities:
+                ent_text = text[entity["start_offset"] : entity["end_offset"]]
+                ent_type_gather = entity["label"].split("##")
+                if len(ent_type_gather) == 2:
+                    ent_type, ent_senti = ent_type_gather
+                else:
+                    ent_type = ent_type_gather[0]
+                    ent_senti = None
+                ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
+                id2ent[entity["id"]] = ent
+                id2ent[entity["id"]]["sentiment"] = ent_senti
+                entity_list.append(ent)
+            output["entity_list"] = entity_list
+            aso_list = []
+            relations = json_line["relations"]
+            for relation in relations:
+                _aspect = id2ent[relation["from_id"]]
+                if _aspect["sentiment"]:
+                    _opinion = id2ent[relation["to_id"]]
+                    rel = {
+                        "aspect": _aspect["text"],
+                        "sentiment": _aspect["sentiment"],
+                        "opinion": _opinion["text"],
+                        "aspect_start_index": _aspect["start_index"],
+                        "opinion_start_index": _opinion["start_index"],
+                    }
+                    aso_list.append(rel)
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for json_line in json_lines:
+            id2ent = {}
+            text = json_line["text"]
+            output = {"text": text}
+            entity_list = []
+            entities = json_line["entities"]
+            for entity in entities:
+                ent_text = text[entity["start_offset"] : entity["end_offset"]]
+                if entity["label"] not in label_maps["entity2id"].keys():
+                    if task_type == "entity_extraction":
+                        logger.warning(
+                            "Found undefined label type. The setting of schema should contain all the label types in annotation file export from annotation platform."
+                        )
+                        continue
+                    else:
+                        ent_type = "object"
+                else:
+                    ent_type = entity["label"]
+                ent = {"text": ent_text, "type": ent_type, "start_index": entity["start_offset"]}
+                id2ent[entity["id"]] = ent
+                entity_list.append(ent)
+            output["entity_list"] = entity_list
+            spo_list = []
+            relations = json_line["relations"]
+            for relation in relations:
+                _subject = id2ent[relation["from_id"]]
+                _object = id2ent[relation["to_id"]]
+                rel = {
+                    "subject": _subject["text"],
+                    "predicate": relation["type"],
+                    "object": _object["text"],
+                    "subject_start_index": _subject["start_index"],
+                    "object_start_index": _object["start_index"],
+                }
+                spo_list.append(rel)
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
+
+
+def synthetic2distill(texts, infer_results, task_type, label_maps=None):
+    """Convert synthetic data to distill format"""
+    if task_type == "opinion_extraction":
+        outputs = []
+        for i, line in enumerate(infer_results):
+            pred = line
+            output = {"text": texts[i]}
+
+            entity_list = []
+            aso_list = []
+            for key1 in pred.keys():
+                for s in pred[key1]:
+                    ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
+                    entity_list.append(ent)
+
+                    if (
+                        "relations" in s.keys()
+                        and "观点词" in s["relations"].keys()
+                        and "情感倾向[正向,负向]" in s["relations"].keys()
+                    ):
+                        for o in s["relations"]["观点词"]:
+                            rel = {
+                                "aspect": s["text"],
+                                "sentiment": s["relations"]["情感倾向[正向,负向]"][0]["text"],
+                                "opinion": o["text"],
+                                "aspect_start_index": s["start"],
+                                "opinion_start_index": o["start"],
+                            }
+                            aso_list.append(rel)
+
+                            ent = {"text": o["text"], "type": "观点词", "start_index": o["start"]}
+                            entity_list.append(ent)
+            output["entity_list"] = entity_list
+            output["aso_list"] = aso_list
+            outputs.append(output)
+    else:
+        outputs = []
+        for i, line in enumerate(infer_results):
+            pred = line
+            output = {"text": texts[i]}
+
+            entity_list = []
+            spo_list = []
+            for key1 in pred.keys():
+                for s in pred[key1]:
+                    ent = {"text": s["text"], "type": key1, "start_index": s["start"]}
+                    entity_list.append(ent)
+                    if "relations" in s.keys():
+                        for key2 in s["relations"].keys():
+                            for o1 in s["relations"][key2]:
+                                if "start" in o1.keys():
+                                    rel = {
+                                        "subject": s["text"],
+                                        "predicate": key2,
+                                        "object": o1["text"],
+                                        "subject_start_index": s["start"],
+                                        "object_start_index": o1["start"],
+                                    }
+                                    spo_list.append(rel)
+
+                                    if "relations" not in o1.keys():
+                                        ent = {"text": o1["text"], "type": "object", "start_index": o1["start"]}
+                                        entity_list.append(ent)
+                                    else:
+                                        ent = {"text": o1["text"], "type": key2, "start_index": o1["start"]}
+                                        entity_list.append(ent)
+                                        for key3 in o1["relations"].keys():
+                                            for o2 in o1["relations"][key3]:
+                                                ent = {
+                                                    "text": o2["text"],
+                                                    "type": "object",
+                                                    "start_index": o2["start"],
+                                                }
+                                                entity_list.append(ent)
+
+                                                rel = {
+                                                    "subject": o1["text"],
+                                                    "predicate": key3,
+                                                    "object": o2["text"],
+                                                    "subject_start_index": o1["start"],
+                                                    "object_start_index": o2["start"],
+                                                }
+                                                spo_list.append(rel)
+            output["entity_list"] = entity_list
+            output["spo_list"] = spo_list
+            outputs.append(output)
+    return outputs
diff --git a/applications/information_extraction/text/deploy/simple_serving/README.md b/applications/information_extraction/text/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0624a674e9472083bf9180c488e3bfa127da2aef
--- /dev/null
+++ b/applications/information_extraction/text/deploy/simple_serving/README.md
@@ -0,0 +1,58 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+
+## Server服务启动
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+## Client请求启动
+
+```bash
+python client.py
+```
+
+## 服务化自定义参数
+
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+```
+
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+
+### Client 自定义参数
+
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+
+```
diff --git a/applications/information_extraction/text/deploy/simple_serving/README_en.md b/applications/information_extraction/text/deploy/simple_serving/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..4736bd34e5bf271454d0222f80ba3c51832b59c1
--- /dev/null
+++ b/applications/information_extraction/text/deploy/simple_serving/README_en.md
@@ -0,0 +1,64 @@
+# Service deployment based on PaddleNLP SimpleServing
+
+- [Environment Preparation](#1)
+- [Server](#2)
+- [Client](#3)
+- [Service Custom Parameters](#4)
+
+<a name="1"></a>
+
+## Environment Preparation
+Use the PaddleNLP version with SimpleServing function (or the latest develop version)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+<a name="2"></a>
+
+## Server
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+<a name="3"></a>
+
+## Client
+
+```bash
+python client.py
+```
+
+<a name="4"></a>
+
+## Service Custom Parameters
+
+### Server Custom Parameters
+
+#### schema replacement
+```python
+# Default schema
+schema = {"Weapon Name": ["Country of Production", "Type", "R&D Unit"]}
+```
+
+#### Set model path
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema)
+```
+
+#### Doka Service Prediction
+PaddleNLP SimpleServing supports multi-card load balancing prediction, mainly during service registration, just register two Taskflow tasks, the following is the sample code
+```
+uie1 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../checkpoint/model_best/', schema=schema, device_id=1)
+service. register_taskflow('uie', [uie1, uie2])
+```
+
+### Client Custom Parameters
+
+```python
+# Changed to input texts you wanted
+texts = ['威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。']
+```
diff --git a/applications/information_extraction/text/deploy/simple_serving/client.py b/applications/information_extraction/text/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd2914e22b2b8b1c46db97facc09bdc6b5ac3957
--- /dev/null
+++ b/applications/information_extraction/text/deploy/simple_serving/client.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+
+headers = {"Content-Type": "application/json"}
+texts = [
+    "威尔哥（Virgo）减速炸弹是由瑞典FFV军械公司专门为瑞典皇家空军的攻击机实施低空高速轰炸而研制，1956年开始研制，1963年进入服役，装备于A32“矛盾”、A35“龙”、和AJ134“雷”攻击机，主要用于攻击登陆>艇、停放的飞机、高炮、野战火炮、轻型防护装甲车辆以及有生力量。"
+]
+
+data = {"data": {"text": texts}}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/applications/information_extraction/text/deploy/simple_serving/server.py b/applications/information_extraction/text/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bb193e6311a9546cae019262ae5286bc7fc6a5a
--- /dev/null
+++ b/applications/information_extraction/text/deploy/simple_serving/server.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = {"武器名称": ["产国", "类型", "研发单位"]}
+# The task path changed to your best model path
+uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
diff --git a/applications/information_extraction/text/evaluate.py b/applications/information_extraction/text/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e9607583fc6d578f8bc0ce1ea409a579d56b576
--- /dev/null
+++ b/applications/information_extraction/text/evaluate.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from utils import convert_example, create_data_loader, reader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer
+from paddlenlp.utils.ie_utils import get_relation_type_dict, unify_prompt_name
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, multilingual=False):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        multilingual(bool): Whether is the multilingual model.
+    """
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        if multilingual:
+            start_prob, end_prob = model(batch["input_ids"], batch["position_ids"])
+        else:
+            start_prob, end_prob = model(
+                batch["input_ids"], batch["token_type_ids"], batch["position_ids"], batch["attention_mask"]
+            )
+
+        start_ids = paddle.cast(batch["start_positions"], "float32")
+        end_ids = paddle.cast(batch["end_positions"], "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+
+
+def do_eval():
+    paddle.set_device(args.device)
+
+    if args.model_path in ["uie-m-base", "uie-m-large"]:
+        args.multilingual = True
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    if args.multilingual:
+        model = UIEM.from_pretrained(args.model_path)
+    else:
+        model = UIE.from_pretrained(args.model_path)
+
+    test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
+    class_dict = {}
+    relation_data = []
+    if args.debug:
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if len(data["result_list"]) != 0:
+                p = "的" if args.schema_lang == "ch" else " of "
+                if p not in data["prompt"]:
+                    class_dict.setdefault(class_name, []).append(data)
+                else:
+                    relation_data.append((data["prompt"], data))
+
+        relation_type_dict = get_relation_type_dict(relation_data, schema_lang=args.schema_lang)
+    else:
+        class_dict["all_classes"] = test_ds
+
+    trans_fn = partial(
+        convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, multilingual=args.multilingual
+    )
+
+    for key in class_dict.keys():
+        if args.debug:
+            test_ds = MapDataset(class_dict[key])
+        else:
+            test_ds = class_dict[key]
+        test_ds = test_ds.map(trans_fn)
+
+        data_collator = DataCollatorWithPadding(tokenizer)
+
+        test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator)
+
+        metric = SpanEvaluator()
+        precision, recall, f1 = evaluate(model, metric, test_data_loader, args.multilingual)
+        logger.info("-----------------------------")
+        logger.info("Class Name: %s" % key)
+        logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+    if args.debug and len(relation_type_dict.keys()) != 0:
+        for key in relation_type_dict.keys():
+            test_ds = MapDataset(relation_type_dict[key])
+            test_ds = test_ds.map(trans_fn)
+            test_data_loader = create_data_loader(
+                test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator
+            )
+
+            metric = SpanEvaluator()
+            precision, recall, f1 = evaluate(model, metric, test_data_loader)
+            logger.info("-----------------------------")
+            if args.schema_lang == "ch":
+                logger.info("Class Name: X的%s" % key)
+            else:
+                logger.info("Class Name: %s of X" % key)
+            logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--device", type=str, default="gpu", choices=["gpu", "cpu", "npu"], help="Device selected for evaluate.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
+    parser.add_argument("--multilingual", action='store_true', help="Whether is the multilingual model.")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_eval()
diff --git a/applications/information_extraction/text/finetune.py b/applications/information_extraction/text/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..342ec6284574be7c949e2c74b5b0437fbd89b7e5
--- /dev/null
+++ b/applications/information_extraction/text/finetune.py
@@ -0,0 +1,243 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import List, Optional
+
+import paddle
+from utils import convert_example, reader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.trainer import (
+    CompressionArguments,
+    PdArgumentParser,
+    Trainer,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer, export_model
+from paddlenlp.utils.ie_utils import compute_metrics, uie_loss_func
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    train_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    dev_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    dynamic_max_length: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: Optional[str] = field(
+        default="uie-base",
+        metadata={
+            "help": "Path to pretrained model, such as 'uie-base', 'uie-tiny', "
+            "'uie-medium', 'uie-mini', 'uie-micro', 'uie-nano', 'uie-base-en', "
+            "'uie-m-base', 'uie-m-large', or finetuned model path."
+        },
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+    multilingual: bool = field(default=False, metadata={"help": "Whether the model is a multilingual model."})
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.label_names = ["start_positions", "end_positions"]
+
+    if model_args.model_name_or_path in ["uie-m-base", "uie-m-large"]:
+        model_args.multilingual = True
+    elif os.path.exists(os.path.join(model_args.model_name_or_path, "model_config.json")):
+        with open(os.path.join(model_args.model_name_or_path, "model_config.json")) as f:
+            init_class = json.load(f)["init_class"]
+        if init_class == "UIEM":
+            model_args.multilingual = True
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    if model_args.multilingual:
+        model = UIEM.from_pretrained(model_args.model_name_or_path)
+    else:
+        model = UIE.from_pretrained(model_args.model_name_or_path)
+
+    train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_length, lazy=False)
+    dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_length, lazy=False)
+
+    trans_fn = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=data_args.max_seq_length,
+        multilingual=model_args.multilingual,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+
+    train_ds = train_ds.map(trans_fn)
+    dev_ds = dev_ds.map(trans_fn)
+
+    if training_args.device == "npu":
+        data_collator = DataCollatorWithPadding(tokenizer, padding="longest")
+    else:
+        data_collator = DataCollatorWithPadding(tokenizer)
+
+    trainer = Trainer(
+        model=model,
+        criterion=uie_loss_func,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train or training_args.do_compress else None,
+        eval_dataset=dev_ds if training_args.do_eval or training_args.do_compress else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    trainer.optimizer = paddle.optimizer.AdamW(
+        learning_rate=training_args.learning_rate, parameters=model.parameters()
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        if training_args.device == "npu":
+            # npu will transform int64 to int32 for internal calculation.
+            # To reduce useless transformation, we feed int32 inputs.
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        if model_args.multilingual:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+            ]
+        else:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+            ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+    if training_args.do_compress:
+
+        @paddle.no_grad()
+        def custom_evaluate(self, model, data_loader):
+            metric = SpanEvaluator()
+            model.eval()
+            metric.reset()
+            for batch in data_loader:
+                if model_args.multilingual:
+                    logits = model(input_ids=batch["input_ids"], position_ids=batch["position_ids"])
+                else:
+                    logits = model(
+                        input_ids=batch["input_ids"],
+                        token_type_ids=batch["token_type_ids"],
+                        position_ids=batch["position_ids"],
+                        attention_mask=batch["attention_mask"],
+                    )
+                start_prob, end_prob = logits
+                start_ids, end_ids = batch["start_positions"], batch["end_positions"]
+                num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+                metric.update(num_correct, num_infer, num_label)
+            precision, recall, f1 = metric.accumulate()
+            logger.info("f1: %s, precision: %s, recall: %s" % (f1, precision, f1))
+            model.train()
+            return f1
+
+        trainer.compress(custom_evaluate=custom_evaluate)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/information_extraction/text/utils.py b/applications/information_extraction/text/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e1a86f79c520e3951ccb8abbb32097390bde83c
--- /dev/null
+++ b/applications/information_extraction/text/utils.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import random
+from typing import List, Optional
+
+import numpy as np
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+                    for result in result_list:
+                        if result["end"] - result["start"] > max_content_len:
+                            logger.warning(
+                                "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
+                            )
+                        if (
+                            result["start"] + 1 <= max_content_len < result["end"]
+                            and result["end"] - result["start"] <= max_content_len
+                        ):
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    cur_length = len(examples[0]["input_ids"])
+    max_length = default_max_length
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+
+
+def convert_example(
+    example, tokenizer, max_seq_len, multilingual=False, dynamic_max_length: Optional[List[int]] = None
+):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    if dynamic_max_length is not None:
+        temp_encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        max_length = get_dynamic_max_length(
+            examples=temp_encoded_inputs, default_max_length=max_seq_len, dynamic_max_length=dynamic_max_length
+        )
+        # always pad to max_length
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_length,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_length)]
+        end_ids = [0.0 for x in range(max_length)]
+    else:
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_seq_len)]
+        end_ids = [0.0 for x in range(max_seq_len)]
+
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(1, len(offset_mapping)):
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+    if multilingual:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    else:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "token_type_ids": encoded_inputs["token_type_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "attention_mask": encoded_inputs["attention_mask"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    return tokenized_output
diff --git a/applications/neural_search/README.md b/applications/neural_search/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8ef1ba05960912e59a44fd6db97b2553a379ef6d
--- /dev/null
+++ b/applications/neural_search/README.md
@@ -0,0 +1,501 @@
+# 手把手搭建一个语义检索系统
+
+## 1. 场景概述
+
+检索系统存在于我们日常使用的很多产品中，比如商品搜索系统、学术文献检索系等等，本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query，快速在海量数据中查找相似文档。
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/191490721-90a8f526-ad64-4f2b-b9b4-34ab06c6749b.png" width="500px">
+</div>
+
+所谓语义检索（也称基于向量的检索，如上图所示），是指检索系统不再拘泥于用户 Query 字面本身，而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索，从而更准确地向用户返回最符合的结果。通过使用最先进的语义索引模型找到文本的向量表示，在高维向量空间中对它们进行索引，并度量查询向量与索引文档的相似程度，从而解决了关键词索引带来的缺陷。
+
+例如下面两组文本 Pair，如果基于关键词去计算相似度，两组的相似度是相同的。而从实际语义上看，第一组相似度高于第二组。
+
+```
+车头如何放置车牌    前牌照怎么装
+车头如何放置车牌    后牌照怎么装
+```
+
+语义检索系统的关键就在于，采用语义而非关键词方式进行召回，达到更精准、更广泛得召回相似结果的目的。想快速体验搜索的效果，请参考[Pipelines的语义检索实现](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/semantic-search)
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190302765-663ba441-9dd3-470a-8fee-f7a6f81da615.gif" width="500px">
+</div>
+
+
+## 2. 产品功能介绍
+
+通常检索业务的数据都比较庞大，都会分为召回（索引）、排序两个环节。召回阶段主要是从至少千万级别的候选集合里面，筛选出相关的文档，这样候选集合的数目就会大大降低，在之后的排序阶段就可以使用一些复杂的模型做精细化或者个性化的排序。一般采用多路召回策略（例如关键词召回、热点召回、语义召回结合等），多路召回结果聚合后，经过统一的打分以后选出最优的 TopK 的结果。
+
+### 2.1 系统特色
+
++ 低门槛
+    + 手把手搭建起检索系统
+    + 无需标注数据也能构建检索系统
+    + 提供 训练、预测、ANN 引擎一站式能力
+    + Pipelines 快速实现语义检索系统
+
++ 效果好
+    + 针对多种数据场景的专业方案
+        + 仅有无监督数据: SimCSE
+        + 仅有有监督数据: InBatchNegative
+        + 兼具无监督数据 和 有监督数据：融合模型
+    + 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining
++ 性能快
+    + Paddle Inference 快速抽取向量
+    + Milvus 快速查询和高性能建库
+    + Paddle Serving服务化部署
+
+###  2.2 功能架构
+
+索引环节有两类方法：基于字面的关键词索引；语义索引。语义索引能够较好地表征语义信息，解决字面不相似但语义相似的情形。本系统给出的是语义索引方案，实际业务中可融合其他方案使用。下面就详细介绍整个方案的架构和功能。
+
+#### 2.2.1 整体介绍
+
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/191469309-42a54a67-a3a3-4e43-b6b1-b12be81ddf3d.png" width="800px">
+</div>
+
+以上是nerual_search的系统流程图，其中左侧为召回环节，核心是语义向量抽取模块；右侧是排序环节，核心是排序模型。召回环节需要用户通过自己的语料构建向量索引库，用户发起query了之后，就可以检索出相似度最高的向量，然后找出该向量对应的文本；排序环节主要是对召回的文本进行重新排序。下面我们分别介绍召回中的语义向量抽取模块，以及排序模型。
+
+
+#### 2.2.2 召回模块
+
+召回模块需要从千万量级数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding，然后借助向量搜索引擎实现高效 ANN，从而实现候选集召回。
+
+我们针对不同的数据情况推出三种语义索引方案，如下图所示，您可以参照此方案，快速建立语义索引：
+
+|  ⭐️ 无监督数据 |  ⭐️ 有监督数据 | **召回方案** |
+| ------------ | ------------ | ------------ |
+|  多 |  无 | SimCSE |
+|  无 |  多 | In-batch Negatives|
+|  有 | 有  | SimCSE+ In-batch Negatives |
+
+最基本的情况是只有无监督数据，我们推荐您使用 SimCSE 进行无监督训练；另一种方案是只有有监督数据，我们推荐您使用 In-batch Negatives 的方法进行有监督训练。
+
+如果想进一步提升模型效果：还可以使用大规模业务数据，对预训练模型进行 Domain-adaptive Pretraining，训练完以后得到预训练模型，再进行无监督的 SimCSE。
+
+此外，如果您同时拥有监督数据和无监督数据，我们推荐将两种方案结合使用，这样能训练出更加强大的语义索引模型。
+
+#### 2.2.3 排序模块
+
+召回模型负责从海量（千万级）候选文本中快速（毫秒级）筛选出与 Query 相关性较高的 TopK Doc，排序模型会在召回模型筛选出的 TopK Doc 结果基础之上针对每一个 (Query, Doc) Pair 对进行两两匹配计算相关性，排序效果更精准。
+
+排序模块有2种选择，第一种基于前沿的预训练模型 ERNIE，训练 Pair-wise 语义匹配模型；第二种是基于RocketQA模型训练的Cross Encoder模形。第一种是Pair-wise的排序算法，基本思路是对样本构建偏序文档对，两两比较，从比较中学习顺序，第二种是Poinet-Wise的算法，只考虑当前Query和每个文档的绝对相关度，并没有考虑其他文档与Query的相关度，但是建模方式比较简单。第一种Pair-wise模型可以说是第二种point-wise模型的改进版本，但对于噪声数据更为敏感，即一个错误的标注会导致多个pair对的错误，用户可以先使用基于Point-wise的Cross Encoder构建一个基础模型，需要进一步优化可以使用Pair-wise的方法优化。
+
+## 3. 文献检索实践
+
+### 3.1 技术方案和评估指标
+
+#### 3.1.1 技术方案
+
+**语义索引**：由于我们既有无监督数据，又有有监督数据，所以结合 SimCSE 和 In-batch Negatives 方案，并采取 Domain-adaptive Pretraining 优化模型效果。
+
+首先是利用 ERNIE模型进行 Domain-adaptive Pretraining，在得到的预训练模型基础上，进行无监督的 SimCSE 训练，最后利用 In-batch Negatives 方法进行微调，得到最终的语义索引模型，把建库的文本放入模型中抽取特征向量，然后把抽取后的向量放到语义索引引擎 milvus 中，利用 milvus 就可以很方便得实现召回了。
+
+**排序**：使用 ERNIE-Gram 的单塔结构/RocketQA的Cross Encoder对召回后的数据精排序。
+
+#### 3.1.2 评估指标
+
+**模型效果指标**
+* 在语义索引召回阶段使用的指标是 Recall@K，表示的是预测的前topK（从最后的按得分排序的召回列表中返回前K个结果）结果和语料库中真实的前 K 个相关结果的重叠率，衡量的是检索系统的查全率。
+
+* 在排序阶段使用的指标为AUC，AUC反映的是分类器对样本的排序能力，如果完全随机得对样本分类，那么AUC应该接近0.5。分类器越可能把真正的正样本排在前面，AUC越大，分类性能越好。
+
+**性能指标**
+* 基于 Paddle Inference 快速抽取向量
+
+* 建库性能和 ANN 查询性能快
+
+### 3.2 预置数据说明
+
+数据集来源于某文献检索系统，既有大量无监督数据，又有有监督数据。
+
+（1）采用文献的 query, title,keywords,abstract 四个字段内容，构建无标签数据集进行 Domain-adaptive Pretraining；
+
+（2）采用文献的 query,title,keywords 三个字段内容，构造无标签数据集，进行无监督召回训练SimCSE；
+
+（3）使用文献的query, title, keywords，构造带正标签的数据集，不包含负标签样本，基于 In-batch Negatives 策略进行训练；
+
+（4）在排序阶段，使用点击（作为正样本）和展现未点击（作为负样本）数据构造排序阶段的训练集，进行精排训练。
+
+|  阶段 |模型 |   训练集 | 评估集（用于评估模型效果） | 召回库 |测试集 |
+| ------------ | ------------ |------------ | ------------ | ------------ | ------------ |
+|  召回 |  Domain-adaptive Pretraining  |  2kw | - | - | - |
+|  召回 |  无监督预训练 - SimCSE |  798w  | 20000 |  300000| 1000 |
+|  召回 |  有监督训练 - In-batch Negatives | 3998  | 20000 |300000  | 1000 |
+|  排序 |  有监督训练 - ERNIE-Gram单塔 Pairwise/RocketQA Cross Encoder| 1973538   | 57811 | - | 1000 |
+
+我们将除 Domain-adaptive Pretraining 之外的其他数据集全部开源，下载地址：
+
+- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip)
+- [literature_search_rank](https://paddlenlp.bj.bcebos.com/applications/literature_search_rank.zip)
+
+```
+├── milvus # milvus建库数据集
+    ├── milvus_data.csv.  # 构建召回库的数据（模拟实际业务线上的语料库，实际语料库远大于这里的规模），用于直观演示相关文献召回效果
+├── recall  # 召回阶段数据集
+    ├── train_unsupervised.csv # 无监督训练集，用于训练 SimCSE
+    ├── train.csv  # 有监督训练集，用于训练 In-batch Negative
+    ├── dev.csv  # 召回阶段验证集，用于评估召回模型的效果，SimCSE 和 In-batch Negative 共用
+    ├── corpus.csv # 构建召回库的数据（模拟实际业务线上的语料库，实际语料库远大于这里的规模），用于评估召回阶段模型效果，SimCSE 和 In-batch Negative 共用
+    ├── test.csv # 召回阶段测试数据，预测文本之间的相似度，SimCSE 和 In-batch Negative 共用
+├── data # RocketQA排序数据集
+    ├── test.csv   # 测试集
+    ├── dev_pairwise.csv    # 验证集
+    └── train.csv  # 训练集
+├── sort # 排序阶段数据集
+    ├── train_pairwise.csv  # 排序训练集
+    ├── dev_pairwise.csv    # 排序验证集
+    └── test_pairwise.csv   # 排序测试集
+```
+
+
+### 3.3 数据格式
+
+1. 对于无监督SimCSE的训练方法，格式参考`train_unsupervised.csv`,即一行条文本即可，无需任何标注。对于召回模型训练需要规定格式的本地数据集，需要准备训练集文件`train.csv`，验证集`dev.csv`，召回集文件`corpus.csv`。
+
+
+训练数据集`train.csv`的格式如下：
+
+```
+query1 \t 用户点击的title1
+query2 \t 用户点击的title2
+```
+训练集合`train.csv`的文件样例：
+```
+从《唐律疏义》看唐代封爵贵族的法律特权  从《唐律疏义》看唐代封爵贵族的法律特权《唐律疏义》,封爵贵族,法律特权
+宁夏社区图书馆服务体系布局现状分析      宁夏社区图书馆服务体系布局现状分析社区图书馆,社区图书馆服务,社区图书馆服务体系
+人口老龄化对京津冀经济  京津冀人口老龄化对区域经济增长的影响京津冀,人口老龄化,区域经济增长,固定效应模型
+英语广告中的模糊语      模糊语在英语广告中的应用及其功能模糊语,英语广告,表现形式,语用功能
+甘氨酸二肽的合成        甘氨酸二肽合成中缩合剂的选择甘氨酸,缩合剂,二肽
+......
+```
+
+验证集`dev.csv`的格式如下：
+
+```
+query1 \t 用户点击的title1
+query2 \t 用户点击的title2
+```
+
+验证集合`train.csv`的文件样例：
+```
+试论我国海岸带经济开发的问题与前景      试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景
+外语阅读焦虑与英语成绩及性别的关系      外语阅读焦虑与英语成绩及性别的关系外语阅读焦虑,外语课堂焦虑,英语成绩,性别
+加油站风险分级管控      加油站工作危害风险分级研究加油站,工作危害分析(JHA),风险分级管控
+```
+召回集合`corpus.csv`主要作用是检验测试集合的句子对能否被正确召回，它的构造主要是提取验证集的第二列的句子，然后加入很多无关的句子，用来检验模型能够正确的从这些文本中找出测试集合对应的第二列的句子，格式如下：
+
+```
+2002-2017年我国法定传染病发病率和死亡率时间变化趋势传染病,发病率,死亡率,病死率
+陕西省贫困地区城乡青春期少女生长发育调查青春期,生长发育,贫困地区
+五丈岩水库溢洪道加固工程中的新材料应用碳纤维布,粘钢加固技术,超细水泥,灌浆技术
+......
+```
+
+2. 对于排序模型的训练，排序模型目前提供了2种，第一种是Pairwise训练的方式，第二种是RocketQA的排序模型，对于第一种排序模型，需要准备训练集`train_pairwise.csv`,验证集`dev_pairwise.csv`两个文件,除此之外还可以准备测试集文件`test.csv`或者`test_pairwise.csv`。
+
+训练数据集`train_pairwise.csv`的格式如下：
+
+```
+query1 \t 用户点击的title1 \t 用户未点击的title2
+query2 \t 用户点击的title3 \t 用户未点击的title4
+```
+
+训练数据集`train_pairwise.csv`的示例如下：
+
+```
+英语委婉语引起的跨文化交际障碍  英语委婉语引起的跨文化交际障碍及其翻译策略研究英语委婉语,跨文化交际障碍,翻译策略        委婉语在英语和汉语中的文化差异委婉语,文化,跨文化交际
+范迪慧 嘉兴市中医院     滋阴疏肝汤联合八穴隔姜灸治疗肾虚肝郁型卵巢功能低下的临床疗效滋阴疏肝汤,八穴隔姜灸,肾虚肝郁型卵巢功能低下,性脉甾类激素,妊娠      温针灸、中药薰蒸在半月板损伤术后康复中的疗效分析膝损伤,半月板,胫骨,中医康复,温针疗法,薰洗
+......
+```
+
+验证数据集`dev_pairwise.csv`的格式如下：
+
+```
+query1 \t title1 \t label
+query2 \t title2 \t label
+```
+验证数据集`dev_pairwise.csv`的示例如下：
+
+```
+作者单位:南州中学       浅谈初中教学管理如何体现人文关怀初中教育,教学管理,人文关怀      1
+作者单位:南州中学       高中美术课堂教学中藏区本土民间艺术的融入路径藏区,传统民间艺术,美术课堂  0
+作者单位:南州中学       列宁关于资产阶级民主革命向 社会主义革命过渡的理论列宁,直接过渡,间接过渡,资产阶级民主革命,社会主义革命   0
+DAA髋关节置换   DAA前侧入路和后外侧入路髋关节置换疗效对比髋关节置换术;直接前侧入路;后外侧入路;髋关节功能;疼痛;并发症    1
+DAA髋关节置换   DAA全髋关节置换术治疗髋关节病变对患者髋关节运动功能的影响直接前侧入路全髋关节置换术,髋关节病变,髋关节运动功能   0
+DAA髋关节置换   护患沟通技巧在急诊输液护理中的应用分析急诊科,输液护理,护理沟通技巧,应用 0
+.......
+```
+训练数据集`test_pairwise.csv`的格式如下，其中这个score得分是召回算出来的相似度或者距离，仅供参考，可以忽略：
+
+```
+query1 \t title1 \t score
+query2 \t title2 \t score
+```
+训练数据集`test_pairwise.csv`的示例如下：
+
+```
+中西方语言与文化的差异  中西方文化差异以及语言体现中西方文化,差异,语言体现      0.43203747272491455
+中西方语言与文化的差异  论中西方文化差异在非言语交际中的体现中西方文化,差异,非言语交际  0.4644506871700287
+中西方语言与文化的差异  中西方体态语文化差异跨文化,体态语,非语言交际,差异       0.4917311668395996
+中西方语言与文化的差异  由此便可以发现两种语言以及两种文化的差异。      0.5039259195327759
+.......
+```
+
+对于第二种基于RocketQA的排序模型。
+
+训练数据集`train.csv`,验证集`dev_pairwise.csv`的格式如下：
+
+```
+query1 \t title1 \t label
+query2 \t title2 \t label
+```
+训练数据集`train.csv`,验证集`dev_pairwise.csv`的示例如下：
+
+```
+(小学数学教材比较) 关键词:新加坡        新加坡与中国数学教材的特色比较数学教材,教材比较,问题解决        0
+徐慧新疆肿瘤医院        头颈部非霍奇金淋巴瘤扩散加权成像ADC值与Ki-67表达相关性分析淋巴瘤,非霍奇金,头颈部肿瘤,磁共振成像 1
+抗生素关性腹泻  鼠李糖乳杆菌GG防治消化系统疾病的研究进展鼠李糖乳杆菌,腹泻,功能性胃肠病,肝脏疾病,幽门螺杆菌      0
+德州市图书馆    图书馆智慧化建设与融合创新服务研究图书馆;智慧化;阅读服务;融合创新       1
+维生素c 综述    维生素C防治2型糖尿病研究进展维生素C;2型糖尿病;氧化应激;自由基;抗氧化剂  0
+.......
+```
+
+训练数据集`test.csv`的格式如下，其中这个score得分是召回算出来的相似度或者距离，仅供参考，可以忽略：
+
+```
+query1 \t title1 \t score
+query2 \t title2 \t score
+```
+训练数据集`test.csv`的示例如下：
+
+```
+加强科研项目管理有效促进医学科研工作    科研项目管理策略科研项目,项目管理,实施,必要性,策略      0.32163668
+加强科研项目管理有效促进医学科研工作    关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化 0.32922596
+加强科研项目管理有效促进医学科研工作    深圳科技计划对高校科研项目资助现状分析与思考基础研究,高校,科技计划,科技创新     0.36869502
+加强科研项目管理有效促进医学科研工作    普通高校科研管理模式的优化与创新普通高校,科研,科研管理  0.3688045
+.......
+```
+
+
+### 3.4 运行环境和安装说明
+
+
+（1）运行环境
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+
+
+- python >= 3.6
+- paddlenlp >= 2.2.1
+- paddlepaddle-gpu >=2.2
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+
+b. 硬件环境：
+
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+
+c. 依赖安装:
+
+```
+pip install -r requirements.txt
+```
+
+## 4. Neural Search 快速体验实践
+
+PaddleNLP已经基于ERNIE 1.0训练了一个基线模型，如果想快速搭建Neural Search的完整系统，有两种方法，第一种是请参考下面的实现，包含了服务化的完整流程，另一种是使用Pipelines加载，Pipelines已经支持Neural Search训练的模型的载入，可以使用Pipelines的快速的基于Neural Search模型实现检索系统，详情请参考文档[Pipelines-Neural-Search](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/pipelines/examples/semantic-search/Neural_Search.md)。
+
+### 4.1. 召回
+
+- 召回向量抽取服务的搭建请参考：[In-batch Negatives](./recall/in_batch_negative/)， 只需要下载基于ERNIE 1.0的预训练模型，导出成Paddle Serving的格式，然后启动Pipeline Server服务即可
+
+- 召回向量检索服务的搭建请参考：[Milvus](./recall/milvus/)， 需要搭建Milvus并且插入检索数据的向量
+
+【注意】如果使用Neural Search训练好的模型，由于该模型是基于ERNIE 1.0训练的，所以需要把 `model_name_or_path`指定为`ernie 1.0`，向量抽取结果才能正常。
+
+
+### 4.2. 排序
+
+排序服务的搭建请参考 [ernie_matching](./ranking/ernie_matching/)，只需要下载基于ERNIE Gram的预训练模型，导出成Paddle Serving的格式，最后需要启动 Pipeline Serving服务
+
+【注意】如果使用Neural Search训练好的模型，由于该模型是基于ERNIE Gram训练的，所以需要把 `model_name_or_path`指定为`ernie-gram-zh`，向量抽取结果才能正常。
+
+### 4.3. 系统运行
+
+以上召回和排序模型都经过Paddle Serving服务化以后，就可以直接使用下面的命令运行体验：
+
+```
+python3 run_system.py
+```
+输出的结果为：
+
+```
+PipelineClient::predict pack_data time:1656991375.5521955
+PipelineClient::predict before time:1656991375.5529568
+Extract feature time to cost :0.0161135196685791 seconds
+Search milvus time cost is 0.8139839172363281 seconds
+PipelineClient::predict pack_data time:1656991376.3981335
+PipelineClient::predict before time:1656991376.3983877
+time to cost :0.05616641044616699 seconds
+```
+会输出2个文件 `recall_result.csv` 是召回检索的结果，`rank_result.csv` 是排序的结果。csv的示例输出下。
+
+召回的结果：
+
+```
+中西方语言与文化的差异,港台文化对内地中小学生的负面影响,0.055068351328372955
+中西方语言与文化的差异,外来文化在越南的传播与融合,0.05621318891644478
+中西方语言与文化的差异,临终关怀中的“仪式”,0.05705389380455017
+中西方语言与文化的差异,历史的真实与艺术加工,0.05745899677276611
+......
+```
+
+排序的结果：
+
+```
+中西方语言与文化的差异,论中西方教育差异,0.870943009853363
+中西方语言与文化的差异,浅析中西方问候语的差异,0.8468159437179565
+中西方语言与文化的差异,文化认同及其根源,0.8288694620132446
+中西方语言与文化的差异,从历史文化角度分析中西方学校教育的差异,0.8209370970726013
+中西方语言与文化的差异,中西医思维方式的差异,0.8150948882102966
+中西方语言与文化的差异,浅析中韩餐桌文化差异,0.7751647233963013
+......
+```
+
+
+
+## 5. 从头开始搭建自己的检索系统
+
+这里展示了能够从头至尾跑通的完整代码，您使用自己的业务数据，照着跑，能搭建出一个给定 Query，返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果和性能数据来检查自己的运行过程是否正确。
+
+### 5.1 召回阶段
+
+**召回模型训练**
+
+我们进行了多组实践，用来对比说明召回阶段各方案的效果：
+
+|  模型 |  Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明|
+| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
+|  有监督训练 Baseline | 30.077| 43.513| 48.633 | 53.448 |59.632| 标准 pair-wise 训练范式，通过随机采样产生负样本|
+|  有监督训练 In-batch Negatives |  51.301 | 65.309| 69.878| 73.996|78.881| In-batch Negatives 有监督训练|
+|  无监督训练 SimCSE |  42.374 | 57.505| 62.641| 67.09|72.331| SimCSE 无监督训练|
+|  无监督 + 有监督训练 SimCSE + In-batch Negatives |  55.976 | 71.849| 76.363| 80.49|84.809| SimCSE无监督训练，In-batch Negatives 有监督训练|
+|  Domain-adaptive Pretraining + SimCSE |  51.031 | 66.648| 71.338 | 75.676 |80.144| ERNIE 预训练，SimCSE 无监督训练|
+|  Domain-adaptive Pretraining + SimCSE + In-batch Negatives|  **58.248** | **75.099**| **79.813**| **83.801**|**87.733**| ERNIE 预训练，SimCSE 无监督训训练，In-batch Negatives 有监督训练|
+
+从上述表格可以看出，首先利用 ERNIE 3.0 做 Domain-adaptive Pretraining ，然后把训练好的模型加载到 SimCSE 上进行无监督训练，最后利用 In-batch Negatives 在有监督数据上进行训练能够获得最佳的性能。[模型下载](https://paddlenlp.bj.bcebos.com/models/inbatch_model_best.zip)，模型的使用方式参考[In-batch Negatives](./recall/in_batch_negative/) 。
+
+
+这里采用 Domain-adaptive Pretraining + SimCSE + In-batch Negatives 方案：
+
+第一步：无监督训练 Domain-adaptive Pretraining
+
+训练用时 16hour55min，可参考：[ERNIE 1.0](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0)
+
+第二步：无监督训练 SimCSE
+
+训练用时 16hour53min，可参考：[SimCSE](./recall/simcse/)
+
+第三步：有监督训练
+
+几分钟内训练完成，可参考 [In-batch Negatives](./recall/in_batch_negative/)
+
+
+**召回系统搭建**
+
+召回系统使用索引引擎 Milvus，可参考 [milvus_system](./recall/milvus/)。
+我们展示一下系统的效果，输入的文本如下：
+
+```
+中西方语言与文化的差异
+
+```
+下面是召回的部分结果，第一个是召回的title，第二个数字是计算的相似度距离
+
+```
+跨文化中的文化习俗对翻译的影响翻译,跨文化,文化习俗	0.615584135055542
+试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比	0.6155391931533813
+中英文化差异及习语翻译习语,文化差异,翻译	0.6153547763824463
+英语中的中国文化元素英语,中国文化,语言交流	0.6151996850967407
+跨文化交际中的文化误读研究文化误读,影响,中华文化,西方文明	0.6137217283248901
+在语言学习中了解中法文化差异文化差异,对话交际,语言	0.6134252548217773
+从翻译视角看文化差异影响下的中式英语的应对策略文化差异;中式英语现;汉英翻译;动态对等理论	0.6127341389656067
+归化与异化在跨文化传播中的动态平衡归化,异化,翻译策略,跨文化传播,文化外译	0.6127211451530457
+浅谈中西言语交际行为中的文化差异交际用语,文化差异,中国,西方	0.6125463843345642
+翻译中的文化因素--异化与归化文化翻译,文化因素,异化与归化	0.6111845970153809
+历史与文化差异对翻译影响的分析研究历史与文化差异,法汉翻译,翻译方法	0.6107486486434937
+从中、韩、美看跨文化交际中的东西方文化差异跨文化交际,东西方,文化差异	0.6091923713684082
+试论文化差异对翻译工作的影响文化差异,翻译工作,影响	0.6084284782409668
+从归化与异化看翻译中的文化冲突现象翻译,文化冲突,归化与异化,跨文化交际	0.6063553690910339
+中西方问候语的文化差异问候语,文化差异,文化背景	0.6054259538650513
+中英思维方式的差异对翻译的影响中英文化的差异,中英思维方式的差异,翻译	0.6026732921600342
+略论中西方语言文字的特性与差异语言,会意,确意,特性,差异	0.6009351015090942
+......
+
+```
+
+
+### 5.2 排序阶段
+
+排序阶段有2种方案，第一种是[ernie_matching](./ranking/ernie_matching/)使用的模型是 ERNIE-3.0-Medium-zh，用时 20h；第二种是基于RocketQA的排序模型[cross_encoder](./ranking/cross_encoder/)，训练用时也是20h左右。
+
+
+排序阶段的效果评估：
+
+|  模型 |  AUC |
+| ------------ | ------------ |
+|  Baseline: In-batch Negatives |  0.582 |
+| pairwise ERNIE-Gram  |0.801 |
+|  CrossEncoder：rocketqa-base-cross-encoder  |**0.835** |
+
+
+同样输入文本：
+
+```
+中西方语言与文化的差异
+```
+排序阶段的结果展示如下，第一个是 Title ，第二个数字是计算的概率，显然经排序阶段筛选的文档与 Query 更相关：
+
+```
+中西方文化差异以及语言体现中西方文化,差异,语言体现	0.999848484992981
+论中西方语言与文化差异的历史渊源中西方语言,中西方文化,差异,历史渊源	0.9998375177383423
+从日常生活比较中西方语言与文化的差异中西方,语言,文化,比较	0.9985846281051636
+试论中西方语言文化教育的差异比较与融合中西方,语言文化教育,差异	0.9972485899925232
+中西方文化差异对英语学习的影响中西方文化,差异,英语,学习	0.9831035137176514
+跨文化视域下的中西文化差异研究跨文化,中西,文化差异	0.9781349897384644
+中西方文化差异对跨文化交际的影响分析文化差异,跨文化交际,影响	0.9735479354858398
+探析跨文化交际中的中西方语言差异跨文化交际,中西方,语言差异	0.9668175578117371
+中西方文化差异解读中英文差异表达中西文化,差异表达,跨文化交际	0.9629314541816711
+中西方文化差异对英语翻译的影响中西方文化差异,英语翻译,翻译策略,影响	0.9538986086845398
+论跨文化交际中的中西方文化冲突跨文化交际,冲突,文化差异,交际策略,全球化	0.9493677616119385
+中西方文化差异对英汉翻译的影响中西方文化,文化差异,英汉翻译,影响	0.9430705904960632
+中西方文化差异与翻译中西方,文化差异,翻译影响,策略方法,译者素质	0.9401137828826904
+外语教学中的中西文化差异外语教学,文化,差异	0.9397934675216675
+浅析西语国家和中国的文化差异-以西班牙为例跨文化交际,西语国家,文化差异	0.9373322129249573
+中英文化差异在语言应用中的体现中英文化,汉语言,语言应用,语言差异	0.9359155297279358
+....
+```
+
+
+## Reference
+
+[1] Tianyu Gao, Xingcheng Yao, Danqi Chen: [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). EMNLP (1) 2021: 6894-6910
+
+[2] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906). Preprint 2020.
+
+[3] Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang:
+[ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding](https://arxiv.org/abs/2010.12148). NAACL-HLT 2021: 1702-1715
+
+[4] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu:
+[ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223). CoRR abs/1904.09223 (2019)
diff --git a/applications/neural_search/img/attu.png b/applications/neural_search/img/attu.png
new file mode 100644
index 0000000000000000000000000000000000000000..9d152a6efba719f540d7a7ac343c43ca0b8eb4fc
Binary files /dev/null and b/applications/neural_search/img/attu.png differ
diff --git a/applications/neural_search/img/system_pipeline.png b/applications/neural_search/img/system_pipeline.png
new file mode 100644
index 0000000000000000000000000000000000000000..b7ef7972df94c4d6daa9b64e02c0a894b5efeb19
Binary files /dev/null and b/applications/neural_search/img/system_pipeline.png differ
diff --git a/applications/neural_search/ranking/cross_encoder/README.md b/applications/neural_search/ranking/cross_encoder/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b5dd873b4313c2f06b7bf5e6ee79f74f23bf5fc1
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/README.md
@@ -0,0 +1,399 @@
+
+ **目录**
+
+* [背景介绍](#背景介绍)
+* [CrossEncoder](#CrossEncoder)
+    * [1. 技术方案和评估指标](#技术方案)
+    * [2. 环境依赖](#环境依赖)
+    * [3. 代码结构](#代码结构)
+    * [4. 数据准备](#数据准备)
+    * [5. 模型训练](#模型训练)
+    * [6. 评估](#开始评估)
+    * [7. 预测](#预测)
+    * [8. 部署](#部署)
+
+<a name="背景介绍"></a>
+
+# 背景介绍
+
+基于RocketQA的CrossEncoder训练的单塔模型，该模型用于搜索的排序阶段，对召回的结果进行重新排序的作用。
+
+
+<a name="CrossEncoder"></a>
+
+# CrossEncoder
+
+<a name="技术方案"></a>
+
+## 1. 技术方案和评估指标
+
+### 技术方案
+
+加载基于ERNIE 3.0训练过的RocketQA单塔CrossEncoder模型。
+
+
+### 评估指标
+
+（1）采用 AUC 指标来评估排序模型的排序效果。
+
+**效果评估**
+
+|  训练方式 |  模型  | AUC |
+| ------------ | ------------ |------------ |
+| pairwise|  ERNIE-Gram  |0.801 |
+|  CrossEncoder |  rocketqa-base-cross-encoder  |**0.835** |
+
+<a name="环境依赖"></a>
+
+## 2. 环境依赖和安装说明
+
+**环境依赖**
+
+* python >= 3.7
+* paddlepaddle >= 2.3.7
+* paddlenlp >= 2.3
+* pandas >= 0.25.1
+* scipy >= 1.3.1
+
+<a name="代码结构"></a>
+
+## 3. 代码结构
+
+以下是本项目主要代码结构及说明：
+
+```
+ernie_matching/
+├── deply # 部署
+    ├── cpp
+        ├── rpc_client.py # RPC 客户端的bash脚本
+        ├── http_client.py # http 客户端的bash文件
+        └── start_server.sh # 启动C++服务的脚本
+    └── python
+        ├── deploy.sh # 预测部署bash脚本
+        ├── config_nlp.yml # Pipeline 的配置文件
+        ├── web_service.py # Pipeline 服务端的脚本
+        ├── rpc_client.py # Pipeline RPC客户端的脚本
+        └── predict.py # python 预测部署示例
+|—— scripts
+    ├── export_model.sh # 动态图参数导出静态图参数的bash文件
+    ├── export_to_serving.sh # 导出 Paddle Serving 模型格式的bash文件
+    ├── train_ce.sh # 匹配模型训练的bash文件
+    ├── evaluate_ce.sh # 评估验证文件bash脚本
+    ├── predict_ce.sh # 匹配模型预测脚本的bash文件
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── export_to_serving.py # 导出 Paddle Serving 模型格式的脚本
+├── data.py #  训练样本的转换逻辑
+├── train_ce.py # 模型训练脚本
+├── evaluate.py # 评估验证文件
+├── predict.py # Pair-wise 模型预测脚本，输出文本对是相似度
+
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+### 数据集说明
+
+样例数据如下:
+```
+(小学数学教材比较) 关键词:新加坡        新加坡与中国数学教材的特色比较数学教材,教材比较,问题解决        0
+徐慧新疆肿瘤医院        头颈部非霍奇金淋巴瘤扩散加权成像ADC值与Ki-67表达相关性分析淋巴瘤,非霍奇金,头颈部肿瘤,磁共振成像 1
+抗生素关性腹泻  鼠李糖乳杆菌GG防治消化系统疾病的研究进展鼠李糖乳杆菌,腹泻,功能性胃肠病,肝脏疾病,幽门螺杆菌      0
+德州市图书馆    图书馆智慧化建设与融合创新服务研究图书馆;智慧化;阅读服务;融合创新       1
+维生素c 综述    维生素C防治2型糖尿病研究进展维生素C;2型糖尿病;氧化应激;自由基;抗氧化剂  0
+(白藜芦醇) 关键词:2型糖尿病     2型糖尿病大鼠心肌缺血再灌注损伤转录因子E2相关因子2/血红素氧合酶1信号通路的表达及白藜芦醇的干预研究糖尿病,2型,心肌缺血,再灌注损伤,白藜芦醇       1
+融资偏好        创新型企业产业风险、融资偏好与融资选择融资偏好;产业风险;融资选择        1
+星载激光雷达    星载激光雷达望远镜主镜超轻量化结构设计超轻量化;拓扑优化;集成优化;RMS;有限元仿真 1
+```
+
+
+### 数据集下载
+
+
+- [literature_search_rank](https://paddlenlp.bj.bcebos.com/applications/literature_search_rank.zip)
+
+```
+├── data # 排序数据集
+    ├── test.csv   # 测试集
+    ├── dev_pairwise.csv    # 验证集
+    └── train.csv  # 训练集
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+**排序模型下载链接：**
+
+
+|Model|训练参数配置|硬件|MD5|
+| ------------ | ------------ | ------------ |-----------|
+|[ERNIE-Gram-Sort](https://bj.bcebos.com/v1/paddlenlp/models/ernie_gram_sort.zip)|<div style="width: 150pt">epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|d24ece68b7c3626ce6a24baa58dd297d|
+
+
+### 训练环境说明
+
+
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1,2,3 卡。如果采用单机单卡训练，只需要把`--gpu`参数设置成单卡的卡号即可
+
+训练的命令如下：
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir="logs" train_ce.py \
+        --device gpu \
+        --train_set data/train.csv \
+        --test_file data/dev_pairwise.csv \
+        --save_dir ./checkpoints \
+        --model_name_or_path rocketqa-base-cross-encoder \
+        --batch_size 32 \
+        --save_steps 10000 \
+        --max_seq_len 384 \
+        --learning_rate 1E-5 \
+        --weight_decay  0.01 \
+        --warmup_proportion 0.0 \
+        --logging_steps 10 \
+        --seed 1 \
+        --epochs 3 \
+        --eval_step 1000
+```
+也可以运行bash脚本：
+
+```
+sh scripts/train_ce.sh
+```
+
+<a name="评估"></a>
+
+## 6. 评估
+
+
+```
+python evaluate.py --model_name_or_path rocketqa-base-cross-encoder \
+                   --init_from_ckpt checkpoints/model_80000/model_state.pdparams \
+                   --test_file data/dev_pairwise.csv
+```
+也可以运行bash脚本：
+
+```
+sh scripts/evaluate_ce.sh
+```
+
+
+成功运行后会输出下面的指标：
+
+```
+eval_dev auc:0.829
+```
+
+<a name="预测"></a>
+
+## 7. 预测
+
+### 准备预测数据
+
+待预测数据为 tab 分隔的 tsv 文件，每一行为 1 个文本 Pair，和文本pair的语义索引相似度，(该相似度由召回模型算出，仅供参考)，部分示例如下:
+
+```
+中西方语言与文化的差异  第二语言习得的一大障碍就是文化差异。    0.5160342454910278
+中西方语言与文化的差异  跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译   0.5145505666732788
+中西方语言与文化的差异  从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译        0.5141439437866211
+中西方语言与文化的差异  中英文化差异对翻译的影响中英文化,差异,翻译的影响        0.5138794183731079
+中西方语言与文化的差异  浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际      0.5131710171699524
+```
+
+
+
+### 开始预测
+
+以上述 demo 数据为例，运行如下命令基于我们开源的rocketqa模型开始计算文本 Pair 的语义相似度:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python predict.py \
+                --device 'gpu' \
+                --params_path checkpoints/model_80000/model_state.pdparams \
+                --model_name_or_path rocketqa-base-cross-encoder \
+                --test_set data/test.csv \
+                --topk 10 \
+                --batch_size 128 \
+                --max_seq_length 384
+```
+也可以直接执行下面的命令：
+
+```
+sh scripts/predict_ce.sh
+```
+得到下面的输出，分别是query，title和对应的预测概率：
+
+```
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '高校\\十四五\\规划中学科建设要处理好五对关系\\十四五\\规划,学科建设,科技创新,人才培养', 'pred_prob': 0.7076062}
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '校企科研合作项目管理模式创新校企科研合作项目,管理模式,问题,创新', 'pred_prob': 0.64633846}
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '科研项目管理策略科研项目,项目管理,实施,必要性,策略', 'pred_prob': 0.63166416}
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '高校科研项目经费管理流程优化研究——以z大学为例高校,科研项目经费\\全流程\\管理,流程优化', 'pred_prob': 0.60351866}
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化', 'pred_prob': 0.5688347}
+{'text_a': '加强科研项目管理有效促进医学科研工作', 'text_b': '医学临床科研选题原则和方法医学临床,科学研究,选题', 'pred_prob': 0.55190295}
+```
+
+<a name="部署"></a>
+
+## 8. 部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py \
+                       --params_path checkpoints/model_80000/model_state.pdparams \
+                       --model_name_or_path rocketqa-base-cross-encoder \
+                       --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference
+
+使用PaddleInference
+
+```
+python deploy/python/predict.py --model_dir ./output \
+                                --input_file data/test.csv \
+                                --model_name_or_path rocketqa-base-cross-encoder
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy/python/deploy.sh
+```
+得到下面的输出，输出的是样本的query，title以及对应的概率：
+
+```
+Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '科研项目管理策略科研项目,项目管理,实施,必要性,策略'}   prob: 0.5479063987731934
+Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '关于推进我院科研发展进程的相关问题研究医院科研,主体,环境,信息化'}      prob: 0.5151925086975098
+Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '深圳科技计划对高校科研项目资助现状分析与思考基础研究,高校,科技计划,科技创新'}          prob: 0.42983829975128174
+Data: {'query': '加强科研项目管理有效促进医学科研工作', 'title': '普通高校科研管理模式的优化与创新普通高校,科研,科研管理'}       prob: 0.465454638004303
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.pdmodel" \
+    --params_filename "inference.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "predict"
+
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式
+
+修改对应预训练模型的`Tokenizer`：
+
+```
+self.tokenizer = AutoTokenizer.from_pretrained('rocketqa-base-cross-encoder')
+```
+
+启动 Pipeline Server:
+
+```
+python web_service.py
+```
+
+启动客户端调用 Server。
+
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [{"query":"加强科研项目管理有效促进医学科研工作","title":"科研项目管理策略科研项目,项目管理,实施,必要性,策略"}]`
+```
+然后运行：
+```
+python rpc_client.py
+```
+模型的输出为：
+
+```
+PipelineClient::predict pack_data time:1662354188.422532
+PipelineClient::predict before time:1662354188.423034
+time to cost :0.016808509826660156 seconds
+(1,)
+[0.5479064]
+```
+可以看到客户端发送了1条文本，这条文本的相似的概率值。
+
+#### C++的方式
+
+启动C++的Serving：
+
+```
+python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True
+```
+也可以使用脚本：
+
+```
+sh deploy/cpp/start_server.sh
+```
+Client 可以使用 http 或者 rpc 两种方式，rpc 的方式为：
+
+```
+python deploy/cpp/rpc_client.py
+```
+运行的输出为：
+
+```
+I0905 05:38:28.876770 28507 general_model.cpp:490] [client]logid=0,client_cost=158.124ms,server_cost=156.385ms.
+time to cost :0.15848731994628906 seconds
+[0.54790646]
+```
+可以看到服务端返回了相似度结果
+
+或者使用 http 的客户端访问模式：
+
+```
+python deploy/cpp/http_client.py
+```
+运行的输出为：
+```
+time to cost :0.13054680824279785 seconds
+0.5479064707850817
+```
+可以看到服务端返回了相似度结果
+
+
+## Reference
+
+[1] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs].
+
+[2] Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, Haifeng Wang:
+RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. NAACL-HLT 2021: 5835-5847
diff --git a/applications/neural_search/ranking/cross_encoder/data.py b/applications/neural_search/ranking/cross_encoder/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..c03776f4a41c34f8e95fa7525ea1edf9979bcb47
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/data.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False, is_pair=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    It returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+
+    if is_pair:
+        text = example["text_a"]
+        text_pair = example["text_b"]
+    else:
+        text = example["text"]
+        text_pair = None
+    encoded_inputs = tokenizer(text=text, text_pair=text_pair, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if is_test:
+        return input_ids, token_type_ids
+    label = np.array([example["label"]], dtype="int64")
+    return input_ids, token_type_ids, label
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"text_a": data[0], "text_b": data[1]}
+
+
+def read_data(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8", errors="ignore") as f:
+        for i, line in enumerate(f):
+            # Skip column name
+            if i == 0:
+                continue
+            data = line.rstrip("\n").split("\t")
+            if len(data) != 3:
+                print(data)
+                continue
+            query = data[0]
+            title = data[1]
+            label = data[-1]
+            # breakpoint()
+            yield {"text_a": query, "text_b": title, "label": int(label)}
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py b/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..40bc903f065bc949dc199bf0b5aadb1a7ea010d2
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/http_client.py
@@ -0,0 +1,68 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client.httpclient import HttpClient
+from scipy.special import expit
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:8600"]
+client = HttpClient()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+
+# 创建tokenizer
+tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder")
+max_seq_len = 64
+
+# 数据预处理
+list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}]
+
+input_ids, token_type_ids = [], []
+for example in list_data:
+    input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len)
+    input_ids.append(input_id)
+    token_type_ids.append(token_type_id)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(input_ids)
+feed_dict["token_type_ids"] = np.array(token_type_ids)
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+b_end = time.time()
+print(result)
+print("time to cost :{} seconds".format(b_end - b_start))
+score = result.outputs[0].tensor[0].float_data
+sim_score = expit(np.array(score))[1]
+print(sim_score)
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py b/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..13d988aa5d0a61b392848de19791ae68a86b2af8
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/rpc_client.py
@@ -0,0 +1,68 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client import Client
+from scipy.special import expit
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:8600"]
+client = Client()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+
+# 创建tokenizer
+tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder")
+max_seq_len = 64
+
+# 数据预处理
+list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}]
+
+input_ids, token_type_ids = [], []
+for example in list_data:
+    input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len)
+    input_ids.append(input_id)
+    token_type_ids.append(token_type_id)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(input_ids)
+feed_dict["token_type_ids"] = np.array(token_type_ids)
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+# breakpoint()
+sim_score = expit(result["predict"])[:, 1]
+b_end = time.time()
+print("time to cost :{} seconds".format(b_end - b_start))
+print(sim_score)
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh b/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0197b9a6c223204db1facbecd4b3384a079b95af
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/cpp/start_server.sh
@@ -0,0 +1 @@
+python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True
\ No newline at end of file
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml b/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..027ac44eafb7d3ebcd113a6783a3535d0e665b38
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8088
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8089
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      # ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: "0"
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['predict']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh b/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c20a4d057ab1078d08dfaf266c8a5487e149b81f
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/python/deploy.sh
@@ -0,0 +1,3 @@
+python deploy/python/predict.py --model_dir ./output \
+                                --input_file data/test.csv \
+                                --model_name_or_path rocketqa-base-cross-encoder
\ No newline at end of file
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py b/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..b092fc004ab3132ccdc4c159deddcbb1a994a758
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/python/predict.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import numpy as np
+import paddle
+from paddle import inference
+from scipy.special import softmax
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.append(".")
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--input_file", type=str, required=True, help="The test set file.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training")
+args = parser.parse_args()
+# yapf: enable
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.pdmodel"
+        params_file = model_dir + "/inference.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name=args.model_name_or_path,
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        sim_score = self.output_handle.copy_to_cpu()
+        if args.benchmark:
+            self.autolog.times.stamp()
+        sim_score = softmax(sim_score)[:, 1]
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return sim_score
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    test_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    data = [{"query": d["query"], "title": d["title"]} for d in test_ds]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer))
+    for idx, text in enumerate(data):
+        print("Data: {} \t prob: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py b/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..27c3d6e0697600055caed53f72cc933bb2d86d8c
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/python/rpc_client.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+import numpy as np
+
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8089"])
+
+list_data = [{"query": "加强科研项目管理有效促进医学科研工作", "title": "科研项目管理策略科研项目,项目管理,实施,必要性,策略"}]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = str(item)
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+result = np.array(eval(ret.value[0]))
+print(result.shape)
+print(result)
diff --git a/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py b/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfcbf7fc117084ba5ac6e7fdbed872fc8d16ea4f
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/deploy/python/web_service.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+from paddle_serving_server.web_service import Op, WebService
+from scipy.special import softmax
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-base-cross-encoder")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            example = json.loads(input_dict[str(i)].replace("'", '"'))
+            input_ids, segment_ids = convert_example(example, self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        sim_score = softmax(fetch_dict["predict"])[:, 1]
+        new_dict["predict"] = str(sim_score)
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/neural_search/ranking/cross_encoder/evaluate.py b/applications/neural_search/ranking/cross_encoder/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e7fbe1dd0b3090a82c83b22722fe454000fb113
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/evaluate.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_data
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--test_file", type=str, required=True, help="The full path of test file")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+
+    for idx, batch in enumerate(data_loader):
+        input_ids, token_type_ids, labels = batch
+
+        pos_probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        sim_score = F.softmax(pos_probs)
+        metric.update(preds=sim_score.numpy(), labels=labels)
+
+    print("eval_{} auc:{:.3}".format(phase, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+def main():
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    dev_ds = load_dataset(read_data, data_path=args.test_file, lazy=False)
+
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_pair=True)
+
+    batchify_fn_eval = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    metric = paddle.metric.Auc()
+    evaluate(model, metric, dev_data_loader, "dev")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/neural_search/ranking/cross_encoder/export_model.py b/applications/neural_search/ranking/cross_encoder/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..a580726b0c07df2094c637c899be79ffac2deca8
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/export_model.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/neural_search/ranking/cross_encoder/export_to_serving.py b/applications/neural_search/ranking/cross_encoder/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..e10cd5616a4e2b924fde1f7fa99a1722ca30a068
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/export_to_serving.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for fetch vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/neural_search/ranking/cross_encoder/predict.py b/applications/neural_search/ranking/cross_encoder/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..5272713a3c416c55f9a5a662c8faf22c56bfc67b
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/predict.py
@@ -0,0 +1,89 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default="checkpoints/model_900/model_state.pdparams", help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", type=int, default=128, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--test_set", type=str, required=True, help="The full path of test_set.")
+parser.add_argument("--topk", type=int, default=10, help="The Topk texts.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training")
+args = parser.parse_args()
+# yapf: enable
+
+
+@paddle.no_grad()
+def predict(model, data_loader):
+    results = []
+    model.eval()
+    with paddle.no_grad():
+        for batch in data_loader:
+            input_ids, token_type_ids = batch
+            logits = model(input_ids, token_type_ids)
+            probs = F.softmax(logits)
+            probs = probs.numpy()
+            results.extend(probs[:, 1])
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    test_ds = load_dataset(read_text_pair, data_path=args.test_set, lazy=False)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True, is_pair=True
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+    ): [data for data in fn(samples)]
+
+    test_data_loader = create_dataloader(
+        test_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    results = predict(model, test_data_loader)
+    test_ds = load_dataset(read_text_pair, data_path=args.test_set, lazy=False)
+    text_pairs = []
+    for idx, prob in enumerate(results):
+        text_pair = test_ds[idx]
+        text_pair["pred_prob"] = prob
+        text_pairs.append(text_pair)
+    text_pairs = sorted(text_pairs, key=lambda x: x["pred_prob"], reverse=True)[: args.topk]
+    for item in text_pairs:
+        print(item)
diff --git a/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f42491cda38d1fa20dccc8054a50d27195df1a95
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/scripts/evaluate_ce.sh
@@ -0,0 +1,3 @@
+python evaluate.py --model_name_or_path rocketqa-base-cross-encoder \
+                   --init_from_ckpt checkpoints/model_80000/model_state.pdparams \
+                   --test_file data/dev_pairwise.csv
\ No newline at end of file
diff --git a/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh b/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a6c54ae878d983b24cd5fa77f92840755e3873d3
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/scripts/export_model.sh
@@ -0,0 +1,4 @@
+python export_model.py \
+                       --params_path checkpoints/model_80000/model_state.pdparams \
+                       --model_name_or_path rocketqa-base-cross-encoder \
+                       --output_path=./output
\ No newline at end of file
diff --git a/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh b/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4a5fe6bfe576184201647b65e69f181a0a25a224
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/scripts/export_to_serving.sh
@@ -0,0 +1,7 @@
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.pdmodel" \
+    --params_filename "inference.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "predict"
diff --git a/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh
new file mode 100644
index 0000000000000000000000000000000000000000..46f6fc50d099fae712626591b4536eec3cad1f2f
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/scripts/predict_ce.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+unset CUDA_VISIBLE_DEVICES
+export CUDA_VISIBLE_DEVICES=0
+python predict.py \
+                --device 'gpu' \
+                --params_path checkpoints/model_80000/model_state.pdparams \
+                --model_name_or_path rocketqa-base-cross-encoder \
+                --test_set data/test.csv \
+                --topk 10 \
+                --batch_size 128 \
+                --max_seq_length 384
\ No newline at end of file
diff --git a/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh b/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh
new file mode 100644
index 0000000000000000000000000000000000000000..570f528902c6c396a40cfee3097c7bf142d4a5bc
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/scripts/train_ce.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir="logs" train_ce.py \
+        --device gpu \
+        --train_set data/train.csv \
+        --test_file data/dev_pairwise.csv \
+        --save_dir ./checkpoints \
+        --model_name_or_path rocketqa-base-cross-encoder \
+        --batch_size 32 \
+        --save_steps 10000 \
+        --max_seq_len 384 \
+        --learning_rate 1E-5 \
+        --weight_decay  0.01 \
+        --warmup_proportion 0.0 \
+        --logging_steps 10 \
+        --seed 1 \
+        --epochs 3 \
+        --eval_step 1000
diff --git a/applications/neural_search/ranking/cross_encoder/train_ce.py b/applications/neural_search/ranking/cross_encoder/train_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..bae21c4992e9eeb17b596f17a7c4a816bbac6a38
--- /dev/null
+++ b/applications/neural_search/ranking/cross_encoder/train_ce.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_data
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--train_set", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--test_file", type=str, required=True, help="The full path of test file")
+
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkppoints.")
+parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+parser.add_argument('--model_name_or_path', default="rocketqa-base-cross-encoder", help="The pretrained model used for training")
+parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.")
+args = parser.parse_args()
+# yapf: enable
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+
+    for idx, batch in enumerate(data_loader):
+        input_ids, token_type_ids, labels = batch
+
+        pos_probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        sim_score = F.softmax(pos_probs)
+
+        metric.update(preds=sim_score.numpy(), labels=labels)
+
+    print("eval_{} auc:{:.3}".format(phase, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    dev_count = paddle.distributed.get_world_size()
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read_data, data_path=args.train_set, lazy=False)
+    dev_ds = load_dataset(read_data, data_path=args.test_file, lazy=False)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=2)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_pair=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_examples = len(train_ds)
+    # 4卡 gpu
+    max_train_steps = args.epochs * num_training_examples // args.batch_size // dev_count
+
+    warmup_steps = int(max_train_steps * args.warmup_proportion)
+
+    print("Device count: %d" % dev_count)
+    print("Num train examples: %d" % num_training_examples)
+    print("Max train steps: %d" % max_train_steps)
+    print("Num warmup steps: %d" % warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=paddle.nn.ClipGradByGlobalNorm(1.0),
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Auc()
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits = model(input_ids, token_type_ids)
+            loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            acc = paddle.metric.accuracy(input=probs, label=labels)
+            loss.backward()
+
+            optimizer.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accuracy: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, args.logging_steps / time_diff)
+                )
+                tic_train = time.time()
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, metric, dev_data_loader, "dev")
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+                tic_train = time.time()
+
+    # save final checkpoint
+    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(save_dir)
+    tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/neural_search/ranking/ernie_matching/README.md b/applications/neural_search/ranking/ernie_matching/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6f912ad680a193c029d5293e8288155b1f6c634d
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/README.md
@@ -0,0 +1,409 @@
+
+ **目录**
+
+* [背景介绍](#背景介绍)
+* [ERNIE-Gram](#ERNIE-Gram)
+    * [1. 技术方案和评估指标](#技术方案)
+    * [2. 环境依赖](#环境依赖)
+    * [3. 代码结构](#代码结构)
+    * [4. 数据准备](#数据准备)
+    * [5. 模型训练](#模型训练)
+    * [6. 评估](#开始评估)
+    * [7. 预测](#预测)
+    * [8. 部署](#部署)
+
+<a name="背景介绍"></a>
+
+# 背景介绍
+
+基于ERNIE-Gram训练Pair-wise模型。Pair-wise 匹配模型适合将文本对相似度作为特征之一输入到上层排序模块进行排序的应用场景。
+
+
+<a name="ERNIE-Gram"></a>
+
+# ERNIE-Gram
+
+<a name="技术方案"></a>
+
+## 1. 技术方案和评估指标
+
+### 技术方案
+
+双塔模型，使用ERNIE-Gram预训练模型，使用margin_ranking_loss训练模型。
+
+
+### 评估指标
+
+（1）采用 AUC 指标来评估排序模型的排序效果。
+
+**效果评估**
+
+|  模型 |  AUC |
+| ------------ | ------------ |
+|  ERNIE-Gram |  0.801 |
+
+<a name="环境依赖"></a>
+
+## 2. 环境依赖和安装说明
+
+**环境依赖**
+
+* python >= 3.x
+* paddlepaddle >= 2.1.3
+* paddlenlp >= 2.2
+* pandas >= 0.25.1
+* scipy >= 1.3.1
+
+<a name="代码结构"></a>
+
+## 3. 代码结构
+
+以下是本项目主要代码结构及说明：
+
+```
+ernie_matching/
+├── deply # 部署
+    ├── cpp
+        ├── rpc_client.py # RPC 客户端的bash脚本
+        ├── http_client.py # http 客户端的bash文件
+        └── start_server.sh # 启动C++服务的脚本
+    └── python
+        ├── deploy.sh # 预测部署bash脚本
+        ├── config_nlp.yml # Pipeline 的配置文件
+        ├── web_service.py # Pipeline 服务端的脚本
+        ├── rpc_client.py # Pipeline RPC客户端的脚本
+        └── predict.py # python 预测部署示例
+|—— scripts
+    ├── export_model.sh # 动态图参数导出静态图参数的bash文件
+    ├── export_to_serving.sh # 导出 Paddle Serving 模型格式的bash文件
+    ├── train_pairwise.sh # Pair-wise 单塔匹配模型训练的bash文件
+    ├── evaluate.sh # 评估验证文件bash脚本
+    ├── predict_pairwise.sh # Pair-wise 单塔匹配模型预测脚本的bash文件
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── export_to_serving.py # 导出 Paddle Serving 模型格式的脚本
+├── model.py #  Pair-wise 匹配模型组网
+├── data.py #  Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑
+├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本
+├── evaluate.py # 评估验证文件
+├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本，输出文本对是相似度
+
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+### 数据集说明
+
+样例数据如下:
+```
+个人所得税税务筹划      基于新个税视角下的个人所得税纳税筹划分析新个税;个人所得税;纳税筹划      个人所得税工资薪金税务筹划研究个人所得税,工资薪金,税务筹划
+液压支架底座受力分析    ZY4000/09/19D型液压支架的有限元分析液压支架,有限元分析,两端加载,偏载,扭转       基于ANSYS的液压支架多工况受力分析液压支架,四种工况,仿真分析,ANSYS,应力集中,优化
+迟发性血管痉挛  西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析     西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析
+氧化亚硅        复合溶胶-凝胶一锅法制备锂离子电池氧化亚硅/碳复合负极材料氧化亚硅,溶胶-凝胶法,纳米颗粒,负极,锂离子电池   负载型聚酰亚胺-二氧化硅-银杂化膜的制备和表征聚酰亚胺,二氧化硅,银,杂化膜,促进传输
+```
+
+
+### 数据集下载
+
+
+- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip)
+
+```
+├── milvus # milvus建库数据集
+    ├── milvus_data.csv.  # 构建召回库的数据
+├── recall  # 召回（语义索引）数据集
+    ├── corpus.csv # 用于测试的召回库
+    ├── dev.csv  # 召回验证集
+    ├── test.csv # 召回测试集
+    ├── train.csv  # 召回训练集
+    ├── train_unsupervised.csv # 无监督训练集
+├── sort # 排序数据集
+    ├── test_pairwise.csv   # 排序测试集
+    ├── dev_pairwise.csv    # 排序验证集
+    └── train_pairwise.csv  # 排序训练集
+
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+**排序模型下载链接：**
+
+
+|Model|训练参数配置|硬件|MD5|
+| ------------ | ------------ | ------------ |-----------|
+|[ERNIE-Gram-Sort](https://bj.bcebos.com/v1/paddlenlp/models/ernie_gram_sort.zip)|<div style="width: 150pt">epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|d24ece68b7c3626ce6a24baa58dd297d|
+
+
+### 训练环境说明
+
+
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1,2,3 卡, 基于ERNIE-Gram训练模型，数据量比较大，需要20小时10分钟左右。如果采用单机单卡训练，只需要把`--gpu`参数设置成单卡的卡号即可
+
+训练的命令如下：
+
+```
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" train_pairwise.py \
+        --device gpu \
+        --save_dir ./checkpoints \
+        --batch_size 32 \
+        --learning_rate 2E-5 \
+        --margin 0.1 \
+        --eval_step 100 \
+        --train_file data/train_pairwise.csv \
+        --test_file data/dev_pairwise.csv
+```
+也可以运行bash脚本：
+
+```
+sh scripts/train_pairwise.sh
+```
+
+<a name="评估"></a>
+
+## 6. 评估
+
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0" evaluate.py \
+        --device gpu \
+        --batch_size 32 \
+        --learning_rate 2E-5 \
+        --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \
+        --test_file data/dev_pairwise.csv
+```
+也可以运行bash脚本：
+
+```
+sh scripts/evaluate.sh
+```
+
+
+成功运行后会输出下面的指标：
+
+```
+eval_dev auc:0.796
+```
+
+<a name="预测"></a>
+
+## 7. 预测
+
+### 准备预测数据
+
+待预测数据为 tab 分隔的 tsv 文件，每一行为 1 个文本 Pair，和文本pair的语义索引相似度，部分示例如下:
+
+```
+中西方语言与文化的差异  第二语言习得的一大障碍就是文化差异。    0.5160342454910278
+中西方语言与文化的差异  跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译   0.5145505666732788
+中西方语言与文化的差异  从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译        0.5141439437866211
+中西方语言与文化的差异  中英文化差异对翻译的影响中英文化,差异,翻译的影响        0.5138794183731079
+中西方语言与文化的差异  浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际      0.5131710171699524
+```
+
+
+
+### 开始预测
+
+以上述 demo 数据为例，运行如下命令基于我们开源的 ERNIE-Gram模型开始计算文本 Pair 的语义相似度:
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0" \
+        predict_pairwise.py \
+        --device gpu \
+        --params_path "./checkpoints/model_30000/model_state.pdparams"\
+        --batch_size 128 \
+        --max_seq_length 64 \
+        --input_file 'sort/test_pairwise.csv'
+```
+也可以直接执行下面的命令：
+
+```
+sh scripts/predict_pairwise.sh
+```
+得到下面的输出，分别是query，title和对应的预测概率：
+
+```
+{'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。', 'pred_prob': 0.85112214}
+{'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译', 'pred_prob': 0.78629625}
+{'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译', 'pred_prob': 0.91767526}
+{'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响', 'pred_prob': 0.8601749}
+{'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际', 'pred_prob': 0.8944413}
+```
+
+<a name="部署"></a>
+
+## 8. 部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/model_30000/model_state.pdparams \
+                       --output_path=./output \
+                       --model_name_or_path ernie-3.0-medium-zh
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference
+
+使用PaddleInference：
+
+```
+python deploy/python/predict.py --model_dir ./output \
+                                --input_file sort/test_pairwise.csv \
+                                --model_name_or_path ernie-3.0-medium-zh
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy/python/deploy.sh
+```
+得到下面的输出，输出的是样本的query，title以及对应的概率：
+
+```
+Data: {'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。'}       prob: [0.8511221]
+Data: {'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译'}      prob: [0.7862964]
+Data: {'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译'}   prob: [0.91767514]
+Data: {'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响'}   prob: [0.8601747]
+Data: {'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际'}     prob: [0.8944413]
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.predict.pdmodel" \
+    --params_filename "inference.predict.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "predict"
+
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式
+
+修改`Tokenizer`
+
+```
+self.tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+```
+
+启动 Pipeline Server:
+
+```
+python web_service.py
+```
+
+启动客户端调用 Server。
+
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [{"query":"中西方语言与文化的差异","title":"第二语言习得的一大障碍就是文化差异。"}]`
+```
+然后运行：
+```
+python rpc_client.py
+```
+模型的输出为：
+
+```
+PipelineClient::predict pack_data time:1656912047.5986433
+PipelineClient::predict before time:1656912047.599081
+time to cost :0.012039899826049805 seconds
+(1, 1)
+[[0.85112208]]
+```
+可以看到客户端发送了1条文本，这条文本的相似的概率值。
+
+#### C++的方式
+
+启动C++的Serving：
+
+```
+python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True
+```
+也可以使用脚本：
+
+```
+sh deploy/cpp/start_server.sh
+```
+Client 可以使用 http 或者 rpc 两种方式，rpc 的方式为：
+
+```
+python deploy/cpp/rpc_client.py
+```
+运行的输出为：
+
+```
+I0704 05:19:00.443437  1987 general_model.cpp:490] [client]logid=0,client_cost=8.477ms,server_cost=6.458ms.
+time to cost :0.008707761764526367 seconds
+{'predict': array([[0.8511221]], dtype=float32)}
+```
+可以看到服务端返回了相似度结果
+
+或者使用 http 的客户端访问模式：
+
+```
+python deploy/cpp/http_client.py
+```
+运行的输出为：
+```
+time to cost :0.006819009780883789 seconds
+[0.8511220812797546]
+```
+可以看到服务端返回了相似度结果
+
+也可以使用curl方式发送Http请求：
+
+```
+curl -XPOST http://0.0.0.0:8600/GeneralModelService/inference -d  ' {"tensor":[{"int64_data":[    1,    12,   213,    58,   405,   545,    54,    68,    73,
+            5,   859,   712,     2,   131,   177,   405,   545,   489,
+          116,     5,     7,    19,   843,  1767,   113,    10,    68,
+           73,   859,   712, 12043,     2],"elem_type":0,"name":"input_ids","alias_name":"input_ids","shape":[1,32]},
+    {"int64_data":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+        1, 1, 1, 1, 1, 1, 1, 1, 1, 1],"elem_type":0,"name":"token_type_ids","alias_name":"token_type_ids","shape":[1,32]}
+        ],
+"fetch_var_names":["sigmoid_2.tmp_0"],
+"log_id":0
+}'
+```
+
+
+## Reference
+
+[1] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs].
diff --git a/applications/neural_search/ranking/ernie_matching/data.py b/applications/neural_search/ranking/ernie_matching/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7fdd67cc36f5cc9691d62f29a5eec431f8f8415
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/data.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"):
+
+    if phase == "train":
+        query, pos_title, neg_title = example["query"], example["title"], example["neg_title"]
+
+        pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length)
+        neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length)
+
+        pos_input_ids = pos_inputs["input_ids"]
+        pos_token_type_ids = pos_inputs["token_type_ids"]
+        neg_input_ids = neg_inputs["input_ids"]
+        neg_token_type_ids = neg_inputs["token_type_ids"]
+
+        return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids)
+
+    else:
+        query, title = example["query"], example["title"]
+
+        inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+        input_ids = inputs["input_ids"]
+        token_type_ids = inputs["token_type_ids"]
+        if phase == "eval":
+            return input_ids, token_type_ids, example["label"]
+        elif phase == "predict":
+            return input_ids, token_type_ids
+        else:
+            raise ValueError("not supported phase:{}".format(phase))
+
+
+def gen_pair(dataset, pool_size=100):
+    """
+    Generate triplet randomly based on dataset
+
+    Args:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+            Each example is composed of 2 texts: example["query"], example["title"]
+        pool_size: the number of example to sample negative example randomly
+
+    Return:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+        Each example is composed of 3 texts: example["query"], example["pos_title"]、example["neg_title"]
+    """
+
+    if len(dataset) < pool_size:
+        pool_size = len(dataset)
+
+    new_examples = []
+    pool = []
+    tmp_examples = []
+
+    for example in dataset:
+        label = example["label"]
+
+        # Filter negative example
+        if label == 0:
+            continue
+
+        tmp_examples.append(example)
+        pool.append(example["title"])
+
+        if len(pool) >= pool_size:
+            np.random.shuffle(pool)
+            for idx, example in enumerate(tmp_examples):
+                example["neg_title"] = pool[idx]
+                new_examples.append(example)
+            tmp_examples = []
+            pool = []
+        else:
+            continue
+    return MapDataset(new_examples)
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py b/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..d649943dcf9f46d6ac46eae9bbdcddf2dfd0e09f
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/http_client.py
@@ -0,0 +1,65 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client.httpclient import HttpClient
+
+import paddlenlp as ppnlp
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:8600"]
+client = HttpClient()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+
+# 创建tokenizer
+tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-gram-zh")
+max_seq_len = 64
+
+# 数据预处理
+list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}]
+
+input_ids, token_type_ids = [], []
+for example in list_data:
+    input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len)
+    input_ids.append(input_id)
+    token_type_ids.append(token_type_id)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(input_ids)
+feed_dict["token_type_ids"] = np.array(token_type_ids)
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+b_end = time.time()
+print(result)
+print("time to cost :{} seconds".format(b_end - b_start))
+print(result.outputs[0].tensor[0].float_data)
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py b/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3474c2045a7ffbc21117ef9133db8dbc73449fd
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/rpc_client.py
@@ -0,0 +1,65 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client import Client
+
+import paddlenlp as ppnlp
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:8600"]
+client = Client()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+
+# 创建tokenizer
+tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-gram-zh")
+max_seq_len = 64
+
+# 数据预处理
+list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}]
+
+input_ids, token_type_ids = [], []
+for example in list_data:
+    input_id, token_type_id = convert_example(example, tokenizer, max_seq_length=max_seq_len)
+    input_ids.append(input_id)
+    token_type_ids.append(token_type_id)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(input_ids)
+feed_dict["token_type_ids"] = np.array(token_type_ids)
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+b_end = time.time()
+print("time to cost :{} seconds".format(b_end - b_start))
+print(result)
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh b/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0197b9a6c223204db1facbecd4b3384a079b95af
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/cpp/start_server.sh
@@ -0,0 +1 @@
+python -m paddle_serving_server.serve --model serving_server --port 8600 --gpu_id 0 --thread 5 --ir_optim True
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml b/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..027ac44eafb7d3ebcd113a6783a3535d0e665b38
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8088
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8089
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      # ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: "0"
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['predict']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh b/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2eeeb719b51438e9e811e63cb77644524481cb36
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/python/deploy.sh
@@ -0,0 +1,3 @@
+python deploy/python/predict.py --model_dir ./output \
+                                --input_file sort/test_pairwise.csv \
+                                --model_name_or_path ernie-3.0-medium-zh
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py b/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..a18e07f0d9db8aa92a1ed1cdadef0781e553fe7f
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/python/predict.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import numpy as np
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.append(".")
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--input_file", type=str, required=True, help="The test set file.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training")
+args = parser.parse_args()
+# yapf: enable
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.predict.pdmodel"
+        params_file = model_dir + "/inference.predict.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-tiny",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        sim_score = self.output_handle.copy_to_cpu()
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return sim_score
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    test_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    data = [{"query": d["query"], "title": d["title"]} for d in test_ds]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer))
+    for idx, text in enumerate(data):
+        print("Data: {} \t prob: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py b/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..613fe9b9aa3c42eb0210a5ec3e302767ae56c3ae
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/python/rpc_client.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+import numpy as np
+
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8089"])
+
+list_data = [{"query": "中西方语言与文化的差异", "title": "第二语言习得的一大障碍就是文化差异。"}]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = str(item)
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+result = np.array(eval(ret.value[0]))
+print(result.shape)
+print(result)
diff --git a/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py b/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..64bb3f1ef8970c7283077514453a5b22b8816ab0
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/deploy/python/web_service.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+from paddle_serving_server.web_service import Op, WebService
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+
+    query, title = example["query"], example["title"]
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return input_ids, token_type_ids
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            example = json.loads(input_dict[str(i)].replace("'", '"'))
+            input_ids, segment_ids = convert_example(example, self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["predict"] = str(fetch_dict["predict"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/neural_search/ranking/ernie_matching/evaluate.py b/applications/neural_search/ranking/ernie_matching/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..0aaf5caca1efd2617cfe531e9045379e9cff351e
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/evaluate.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import pandas as pd
+from data import convert_pairwise_example as convert_example
+from data import create_dataloader
+from model import PairwiseMatching
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--margin", default=0.1, type=float, help="Margin for pos_score and neg_score.")
+parser.add_argument("--test_file", type=str, required=True, help="The full path of test file")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+
+    for idx, batch in enumerate(data_loader):
+        input_ids, token_type_ids, labels = batch
+
+        pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        neg_probs = 1.0 - pos_probs
+
+        preds = np.concatenate((neg_probs, pos_probs), axis=1)
+        metric.update(preds=preds, labels=labels)
+
+    print("eval_{} auc:{:.3}".format(phase, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+# 构建读取函数，读取原始数据
+def read(src_path, is_predict=False):
+    data = pd.read_csv(src_path, sep="\t")
+    for index, row in tqdm(data.iterrows()):
+        query = row["query"]
+        title = row["title"]
+        neg_title = row["neg_title"]
+        yield {"query": query, "title": title, "neg_title": neg_title}
+
+
+def read_test(src_path, is_predict=False):
+    data = pd.read_csv(src_path, sep="\t")
+    for index, row in tqdm(data.iterrows()):
+        query = row["query"]
+        title = row["title"]
+        label = row["label"]
+        yield {"query": query, "title": title, "label": label}
+
+
+def main():
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    dev_ds = load_dataset(read_test, src_path=args.test_file, lazy=False)
+    print(dev_ds[0])
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval")
+
+    batchify_fn_eval = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval
+    )
+
+    model = PairwiseMatching(pretrained_model, margin=args.margin)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    metric = paddle.metric.Auc()
+    evaluate(model, metric, dev_data_loader, "dev")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/neural_search/ranking/ernie_matching/export_model.py b/applications/neural_search/ranking/ernie_matching/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d329bb326ffbc91b3b4d4403d6f2cb32a323950
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/export_model.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import PairwiseMatching
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = PairwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/neural_search/ranking/ernie_matching/export_to_serving.py b/applications/neural_search/ranking/ernie_matching/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/export_to_serving.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/neural_search/ranking/ernie_matching/model.py b/applications/neural_search/ranking/ernie_matching/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..205148f00524d1cdcb78adefc2f920a7ef8ffd59
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/model.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class PairwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.1):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        self.margin = margin
+
+        # hidden_size -> 1, calculate similarity
+        self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1)
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        sim_score = self.similarity(cls_embedding)
+        sim_score = F.sigmoid(sim_score)
+
+        return sim_score
+
+    def forward(
+        self,
+        pos_input_ids,
+        neg_input_ids,
+        pos_token_type_ids=None,
+        neg_token_type_ids=None,
+        pos_position_ids=None,
+        neg_position_ids=None,
+        pos_attention_mask=None,
+        neg_attention_mask=None,
+    ):
+
+        _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask)
+
+        _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask)
+
+        pos_embedding = self.dropout(pos_cls_embedding)
+        neg_embedding = self.dropout(neg_cls_embedding)
+
+        pos_sim = self.similarity(pos_embedding)
+        neg_sim = self.similarity(neg_embedding)
+
+        pos_sim = F.sigmoid(pos_sim)
+        neg_sim = F.sigmoid(neg_sim)
+
+        labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32")
+
+        loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin)
+
+        return loss
diff --git a/applications/neural_search/ranking/ernie_matching/predict_pairwise.py b/applications/neural_search/ranking/ernie_matching/predict_pairwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..3abbfeb35589c9ab09419612462f8ae1b3840bcb
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/predict_pairwise.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pairwise_example as convert_example
+from data import create_dataloader, read_text_pair
+from model import PairwiseMatching
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_probs = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            batch_prob = model.predict(input_ids=input_ids, token_type_ids=token_type_ids).numpy()
+
+            batch_probs.append(batch_prob)
+        if len(batch_prob) == 1:
+            batch_probs = np.array(batch_probs)
+        else:
+            batch_probs = np.concatenate(batch_probs, axis=0)
+
+        return batch_probs
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict")
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PairwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, valid_data_loader)
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    for idx, prob in enumerate(y_probs):
+        text_pair = valid_ds[idx]
+        text_pair["pred_prob"] = prob[0]
+        print(text_pair)
diff --git a/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh b/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bfb8c120a4cf852af840acb9a42d7594ac42977a
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/scripts/evaluate.sh
@@ -0,0 +1,16 @@
+unset CUDA_VISIBLE_DEVICES
+# gpu
+python -u -m paddle.distributed.launch --gpus "0" evaluate.py \
+        --device gpu \
+        --batch_size 32 \
+        --learning_rate 2E-5 \
+        --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \
+        --test_file sort/dev_pairwise.csv
+
+# cpu
+# python  evaluate.py \
+#         --device cpu \
+#         --batch_size 32 \
+#         --learning_rate 2E-5 \
+#         --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \
+#         --test_file sort/dev_pairwise.csv
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh b/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f6849a95eb80777aed443c4288828cb894ac57d8
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/scripts/export_model.sh
@@ -0,0 +1,3 @@
+python export_model.py --params_path checkpoints/model_30000/model_state.pdparams \
+                       --output_path=./output \
+                       --model_name_or_path ernie-3.0-medium-zh
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh b/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c252f811e29c1f741cbe9ba64f368ae5c914900d
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/scripts/export_to_serving.sh
@@ -0,0 +1,7 @@
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.predict.pdmodel" \
+    --params_filename "inference.predict.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "predict"
diff --git a/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh b/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fe0767e14bfabb6441aafb3829c650b0be42d42d
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh
@@ -0,0 +1,15 @@
+# gpu
+python -u -m paddle.distributed.launch --gpus "0" \
+        predict_pairwise.py \
+        --device gpu \
+        --params_path "./checkpoints/model_30000/model_state.pdparams"\
+        --batch_size 128 \
+        --max_seq_length 64 \
+        --input_file 'sort/test_pairwise.csv'
+# cpu
+# python predict_pairwise.py \
+#         --device gpu \
+#         --params_path "./checkpoints/model_30000/model_state.pdparams"\
+#         --batch_size 128 \
+#         --max_seq_length 64 \
+#         --input_file 'sort/test_pairwise.csv'
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh b/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a95169e5b155386bda9202285c7f5ee53fa3fddb
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh
@@ -0,0 +1,21 @@
+# gpu
+python -u -m paddle.distributed.launch --gpus="0,1,2,3" train_pairwise.py \
+        --device gpu \
+        --save_dir ./checkpoints \
+        --batch_size 32 \
+        --learning_rate 2E-5 \
+        --margin 0.1 \
+        --eval_step 100 \
+        --train_file sort/train_pairwise.csv \
+        --test_file sort/dev_pairwise.csv
+
+# cpu
+# python train_pairwise.py \
+#         --device cpu \
+#         --save_dir ./checkpoints \
+#         --batch_size 32 \
+#         --learning_rate 2E-5 \
+#         --margin 0.1 \
+#         --eval_step 100 \
+#         --train_file sort/train_pairwise.csv \
+#         --test_file sort/dev_pairwise.csv
\ No newline at end of file
diff --git a/applications/neural_search/ranking/ernie_matching/train_pairwise.py b/applications/neural_search/ranking/ernie_matching/train_pairwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d51cd55b5ec2d7ae984e1ce5b8911f270cd471b
--- /dev/null
+++ b/applications/neural_search/ranking/ernie_matching/train_pairwise.py
@@ -0,0 +1,209 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import pandas as pd
+from data import convert_pairwise_example as convert_example
+from data import create_dataloader
+from model import PairwiseMatching
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--margin", default=0.2, type=float, help="Margin for pos_score and neg_score.")
+parser.add_argument("--train_file", type=str, required=True, help="The full path of train file")
+parser.add_argument("--test_file", type=str, required=True, help="The full path of test file")
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.")
+parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="The pretrained model used for training")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+
+    for idx, batch in enumerate(data_loader):
+        input_ids, token_type_ids, labels = batch
+
+        pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        neg_probs = 1.0 - pos_probs
+
+        preds = np.concatenate((neg_probs, pos_probs), axis=1)
+        metric.update(preds=preds, labels=labels)
+
+    print("eval_{} auc:{:.3}".format(phase, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+# 构建读取函数，读取原始数据
+def read(src_path, is_predict=False):
+    data = pd.read_csv(src_path, sep="\t")
+    for index, row in tqdm(data.iterrows()):
+        query = row["query"]
+        title = row["title"]
+        neg_title = row["neg_title"]
+        yield {"query": query, "title": title, "neg_title": neg_title}
+
+
+def read_test(src_path, is_predict=False):
+    data = pd.read_csv(src_path, sep="\t")
+    for index, row in tqdm(data.iterrows()):
+        query = row["query"]
+        title = row["title"]
+        label = row["label"]
+        yield {"query": query, "title": title, "label": label}
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read, src_path=args.train_file, lazy=False)
+    dev_ds = load_dataset(read_test, src_path=args.test_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func_train = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval")
+
+    batchify_fn_train = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # pos_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # pos_pair_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # neg_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # neg_pair_segment
+    ): [data for data in fn(samples)]
+
+    batchify_fn_eval = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn_train, trans_fn=trans_func_train
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval
+    )
+
+    model = PairwiseMatching(pretrained_model, margin=args.margin)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    metric = paddle.metric.Auc()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids = batch
+
+            loss = model(
+                pos_input_ids=pos_input_ids,
+                neg_input_ids=neg_input_ids,
+                pos_token_type_ids=pos_token_type_ids,
+                neg_token_type_ids=neg_token_type_ids,
+            )
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, metric, dev_data_loader, "dev")
+
+            if global_step % args.save_step == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa07d6908480146ddc401a92e75b4386f864d89e
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/README.md
@@ -0,0 +1,635 @@
+# In-batch Negatives
+
+ **目录**
+
+* [背景介绍](#背景介绍)
+* [In-batch Negatives](#In-batchNegatives)
+    * [1. 技术方案和评估指标](#技术方案)
+    * [2. 环境依赖](#环境依赖)
+    * [3. 代码结构](#代码结构)
+    * [4. 数据准备](#数据准备)
+    * [5. 模型训练](#模型训练)
+    * [6. 评估](#开始评估)
+    * [7. 预测](#预测)
+    * [8. 部署](#部署)
+
+<a name="背景介绍"></a>
+
+# 背景介绍
+
+语义索引（可通俗理解为向量索引）技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是：给定输入文本，模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序，从基础层面影响整个系统的效果。
+
+在召回阶段，最常见的方式是通过双塔模型，学习Document(简写为Doc)的向量表示，对Doc端建立索引，用ANN召回。我们在这种方式的基础上，引入语义索引策略 [In-batch Negatives](https://arxiv.org/abs/2004.04906)，以如下Batch size=4的训练数据为例：
+
+
+```
+我手机丢了，我想换个手机     我想买个新手机，求推荐
+求秋色之空漫画全集          求秋色之空全集漫画
+学日语软件手机上的          手机学日语的软件
+侠盗飞车罪恶都市怎样改车     侠盗飞车罪恶都市怎么改车
+```
+
+In-batch Negatives 策略的训练数据为语义相似的 Pair 对，策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新，将Batch 内除自身之外其它所有 Source Text 的相似文本 Target Text 作为负例，例如: 上例中“我手机丢了，我想换个手机” 有 1 个正例(”我想买个新手机，求推荐“)，3 个负例(1.求秋色之空全集漫画，2.手机学日语的软件，3.侠盗飞车罪恶都市怎么改车)。
+
+
+<a name="In-batch Negatives"></a>
+
+# In-batch Negatives
+
+<a name="技术方案"></a>
+
+## 1. 技术方案和评估指标
+
+### 技术方案
+
+双塔模型，在召回训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，进行召回测试。
+
+
+### 评估指标
+
+采用 Recall@1，Recall@5 ，Recall@10 ，Recall@20  和 Recall@50 指标来评估语义索引模型的召回效果。
+
+Recall@K召回率是指预测的前topK（top-k是指从最后的按得分排序的召回列表中返回前k个结果）结果中检索出的相关结果数和库中所有的相关结果数的比率，衡量的是检索系统的查全率。
+
+**效果评估**
+
+|  策略 | 模型 |  Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |
+| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
+|  In-batch Negatives | ernie 1.0 | 51.301 | 65.309| 69.878| 73.996|78.881|
+|  In-batch Negatives | rocketqa-zh-base-query-encoder | **59.622** | **75.089**| **79.668**| **83.404**|**87.773**|
+
+
+<a name="环境依赖"></a>
+
+## 2. 环境依赖
+
+推荐使用GPU进行训练，在预测阶段使用CPU或者GPU均可。
+
+**环境依赖**
+* python >= 3.6.2
+* paddlepaddle >= 2.2.3
+* paddlenlp >= 2.2
+* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2
+* visualdl >= 2.2.2
+
+<a name="代码结构"></a>
+
+## 3. 代码结构
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— base_model.py # 语义索引模型基类
+|—— train_batch_neg.py # In-batch Negatives 策略的训练主脚本
+|—— batch_negative
+    |—— model.py # In-batch Negatives 策略核心网络结构
+|—— ann_util.py # Ann 建索引库相关函数
+
+
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— evaluate.py # 根据召回结果和评估集计算评估指标
+|—— predict.py # 给定输入文件，计算文本 pair 的相似度
+|—— export_model.py # 动态图转换成静态图
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— predict.sh  # 预测 bash 版本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train_batch_neg.sh  # 训练 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+|—— deploy
+    |—— python
+        |—— predict.py # PaddleInference
+        |—— deploy.sh # Paddle Inference 部署脚本
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+|—— inference.py # 动态图抽取向量
+|—— export_to_serving.py # 静态图转 Serving
+
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+### 数据集说明
+
+我们基于某文献检索平台数据，构造面向语义索引的训练集、测试集、召回库。
+
+**训练集** 和 **验证集** 格式一致，训练集4k条，测试集2w条，每行由一对语义相似的文本Pair构成，以tab符分割，第一列是检索query，第二列由相关文献标题（+关键词）构成。样例数据如下:
+
+```
+宁夏社区图书馆服务体系布局现状分析           宁夏社区图书馆服务体系布局现状分析社区图书馆,社区图书馆服务,社区图书馆服务体系
+人口老龄化对京津冀经济                     京津冀人口老龄化对区域经济增长的影响京津冀,人口老龄化,区域经济增长,固定效应模型
+英语广告中的模糊语                      模糊语在英语广告中的应用及其功能模糊语,英语广告,表现形式,语用功能
+甘氨酸二肽的合成                          甘氨酸二肽合成中缩合剂的选择甘氨酸,缩合剂,二肽
+```
+
+**召回库** 用于模拟业务线上的全量语料库，评估模型的召回效果，计算相应的Recall指标。召回库总共30万条样本，每行由一列构成，文献标题（+关键词），样例数据如下：
+```
+陕西省贫困地区城乡青春期少女生长发育调查青春期,生长发育,贫困地区
+五丈岩水库溢洪道加固工程中的新材料应用碳纤维布,粘钢加固技术,超细水泥,灌浆技术
+木塑复合材料在儿童卫浴家具中的应用探索木塑复合材料,儿童,卫浴家具
+泡沫铝准静态轴向压缩有限元仿真泡沫铝,准静态,轴向压缩,力学特性
+```
+
+
+### 数据集下载
+
+
+- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip)
+
+```
+├── milvus # milvus建库数据集
+    ├── milvus_data.csv.  # 构建召回库的数据
+├── recall  # 召回（语义索引）数据集
+    ├── corpus.csv # 用于测试的召回库
+    ├── dev.csv  # 召回验证集
+    ├── test.csv # 召回测试集
+    ├── train.csv  # 召回训练集
+    ├── train_unsupervised.csv # 无监督训练集
+├── sort # 排序数据集
+    ├── test_pairwise.csv   # 排序测试集
+    ├── dev_pairwise.csv    # 排序验证集
+    └── train_pairwise.csv  # 排序训练集
+
+```
+
+<a name="模型训练"></a>
+
+
+## 5. 模型训练
+
+**语义索引训练模型下载链接：**
+
+以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256`
+
+|Model|训练参数配置|硬件|MD5|
+| ------------ | ------------ | ------------ |-----------|
+|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">ernie 1.0 margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
+
+
+### 训练环境说明
+
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1,2,3 卡, 基于 In-batch Negatives 策略训练模型，数据量比较小，几分钟就可以完成。如果采用单机单卡训练，只需要把`--gpus`参数设置成单卡的卡号即可。
+
+如果使用CPU进行训练，则需要吧`--gpus`参数去除，然后吧`device`设置成cpu即可，详细请参考train_batch_neg.sh文件的训练设置
+
+然后运行下面的命令使用GPU训练，得到语义索引模型：
+
+```
+root_path=inbatch
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
+    train_batch_neg.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 3 \
+    --output_emb_size 256 \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --save_steps 10 \
+    --max_seq_length 64 \
+    --margin 0.2 \
+    --train_set_file recall/train.csv \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --hnsw_m 100 \
+    --hnsw_ef 100 \
+    --recall_num 50 \
+    --similar_text_pair_file "recall/dev.csv" \
+    --corpus_file "recall/corpus.csv"
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `margin`: 正样本相似度与负样本之间的目标 Gap
+* `train_set_file`: 训练集文件
+* `evaluate`: 是否开启边训练边评估模型训练效果，默认开启
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair_file`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+* `use_recompute`: 使用Recompute策略，用于节省显存，是一种以时间换空间的技术
+* `use_gradient_cache`: 使用Gradient Cache策略，用于节省显存，是一种以时间换空间的技术
+* `chunk_numbers`: 使用Gradient Cache策略的参数，表示的是同一个批次的样本分几次执行
+
+也可以使用bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+
+<a name="评估"></a>
+
+## 6. 评估
+
+效果评估分为 4 个步骤:
+
+a. 获取Doc端Embedding
+
+基于语义索引模型抽取出Doc样本库的文本向量。
+
+b. 采用hnswlib对Doc端Embedding建库
+
+使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引)
+
+c. 获取Query的Embedding并查询相似结果
+
+基于语义索引模型抽取出评估集 *Source Text* 的文本向量，在第 2 步中建立的索引库中进行 ANN 查询，召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。
+
+d. 评估
+
+基于评估集 `dev.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k，其中k取值1，5，10，20，50。
+
+运行如下命令进行 ANN 建库、召回，产出召回结果数据 `recall_result`
+
+```
+root_dir="checkpoints/inbatch"
+python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${root_dir}/model_40/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 60 \
+        --recall_num 50 \
+        --similar_text_pair "recall/dev.csv" \
+        --corpus_file "recall/corpus.csv"
+```
+参数含义说明
+* `device`： 使用 cpu/gpu 进行训练
+* `recall_result_dir`： 召回结果存储目录
+* `recall_result_file`： 召回结果的文件名
+* `params_path`： 待评估模型的参数文件名
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
+* `hnsw_m`： hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`： hnsw 算法相关参数，保持默认即可
+* `output_emb_size`： Transformer 顶层输出的文本向量维度
+* `recall_num`： 对 1 个文本召回的相似文本数量
+* `similar_text_pair`： 由相似文本对构成的评估集
+* `corpus_file`： 召回库数据 corpus_file
+
+也可以使用下面的bash脚本：
+
+```
+sh scripts/run_build_index.sh
+```
+
+run_build_index.sh还包含cpu和gpu运行的脚本，默认是gpu的脚本
+
+成功运行结束后，会在 `./recall_result_dir/` 目录下产出 `recall_result.txt` 文件
+
+```
+热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响        热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理      0.9831992387771606
+热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响        热处理方法对高强高模聚乙烯醇纤维性能的影响聚乙烯醇纤维,热处理,性能,热拉伸,热定型    0.8438636660575867
+热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响        制备工艺对PVC/ABS合金力学性能和维卡软化温度的影响PVC,ABS,正交试验,力学性能,维卡软化温度      0.8130228519439697
+.....
+```
+
+
+接下来，运行如下命令进行效果评估，产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标:
+```
+python -u evaluate.py \
+        --similar_text_pair "recall/dev.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
+```
+也可以使用下面的bash脚本：
+
+```
+sh scripts/evaluate.sh
+```
+
+参数含义说明
+* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
+* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果
+* `recall_num`: 对 1 个文本召回的相似文本数量
+
+成功运行结束后，会输出如下评估指标:
+
+```
+recall@1=51.261
+recall@5=65.279
+recall@10=69.848
+recall@20=73.971
+recall@50=78.84
+```
+
+<a name="预测"></a>
+
+## 7. 预测
+
+我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。
+
+### 7.1 功能一：抽取文本的语义向量
+
+修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path ：
+
+```
+params_path='checkpoints/inbatch/model_40/model_state.pdparams'
+id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+```
+然后运行：
+```
+python inference.py
+```
+预测结果为256维的向量：
+
+```
+[1, 256]
+[[ 0.07766181 -0.13780491  0.03388524 -0.14910668 -0.0334941   0.06780092
+   0.0104043   0.03168401  0.02605671  0.02088691  0.05520441 -0.0852212
+   .....
+```
+
+### 7.2 功能二：计算文本 Pair 的语义相似度
+
+
+### 准备预测数据
+
+待预测数据为 tab 分隔的 csv 文件，每一行为 1 个文本 Pair，部分示例如下:
+```
+试论我国海岸带经济开发的问题与前景    试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景
+外语阅读焦虑与英语成绩及性别的关系    外语阅读焦虑与英语成绩及性别的关系外语阅读焦虑,外语课堂焦虑,英语成绩,性别
+数字图书馆    智能化图书馆
+网络健康可信性研究    网络成瘾少年
+```
+
+### 开始预测
+
+以上述 demo 数据为例，运行如下命令基于我们开源的 [In-batch Negatives](https://arxiv.org/abs/2004.04906) 策略语义索引模型开始计算文本 Pair 的语义相似度:
+```
+root_dir="checkpoints/inbatch"
+
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_40/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --output_emb_size 256 \
+    --batch_size 128 \
+    --max_seq_length 64 \
+    --text_pair_file "recall/test.csv"
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/predict.sh
+```
+predict.sh文件包含了cpu和gpu运行的脚本，默认是gpu运行的脚本
+
+产出如下结果
+```
+0.9717282652854919
+0.9371012449264526
+0.7968897223472595
+0.30377304553985596
+```
+
+<a name="部署"></a>
+
+## 8. 部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \
+                       --model_name_or_path rocketqa-zh-base-query-encoder \
+                       --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference预测
+
+预测既可以抽取向量也可以计算两个文本的相似度。
+
+修改id2corpus的样本：
+
+```
+# 抽取向量
+id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+# 计算相似度
+corpus_list=[['中西方语言与文化的差异','中西方文化差异以及语言体现中西方文化,差异,语言体现'],
+                    ['中西方语言与文化的差异','飞桨致力于让深度学习技术的创新与应用更简单']]
+
+```
+
+然后使用PaddleInference
+
+```
+python deploy/python/predict.py \
+                             --model_dir=./output \
+                             --model_name_or_path rocketqa-zh-base-query-encoder
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy.sh
+```
+最终输出的是256维度的特征向量和句子对的预测概率：
+
+```
+(1, 256)
+[[-0.0394925  -0.04474756 -0.065534    0.00939134  0.04359895  0.14659195
+  -0.0091779  -0.07303623  0.09413272 -0.01255222 -0.08685658  0.02762237
+   0.10138468  0.00962821  0.10888419  0.04553023  0.05898942  0.00694253
+   ....
+
+[0.959269642829895, 0.04725276678800583]
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式
+
+修改模型需要用到的`Tokenizer`
+
+```
+self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
+```
+
+然后启动 Pipeline Server:
+
+```
+cd deploy/python
+python web_service.py
+```
+
+启动客户端调用 Server。
+
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [
+    "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据",
+    "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"
+]
+```
+然后运行：
+
+```
+python deploy/python/rpc_client.py
+```
+模型的输出为：
+
+```
+{'0': '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据', '1': '试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比'}
+PipelineClient::predict pack_data time:1641450851.3752182
+PipelineClient::predict before time:1641450851.375738
+['output_embedding']
+(2, 256)
+[[ 0.07830612 -0.14036864  0.03433796 -0.14967982 -0.03386067  0.06630666
+   0.01357943  0.03531194  0.02411093  0.02000859  0.05724002 -0.08119463
+   ......
+```
+
+可以看到客户端发送了2条文本，返回了2个 embedding 向量
+
+#### C++的方式
+
+启动C++的Serving：
+
+```
+python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 2 --thread 5 --ir_optim True --use_trt --precision FP16
+```
+也可以使用脚本：
+
+```
+sh deploy/cpp/start_server.sh
+```
+Client 可以使用 http 或者 rpc 两种方式，rpc 的方式为：
+
+```
+python deploy/cpp/rpc_client.py
+```
+运行的输出为：
+```
+I0209 20:40:07.978225 20896 general_model.cpp:490] [client]logid=0,client_cost=395.695ms,server_cost=392.559ms.
+time to cost :0.3960278034210205 seconds
+{'output_embedding': array([[ 9.01343748e-02, -1.21870913e-01,  1.32834800e-02,
+        -1.57673359e-01, -2.60387752e-02,  6.98455423e-02,
+         1.58108603e-02,  3.89952064e-02,  3.22783105e-02,
+         3.49135026e-02,  7.66086206e-02, -9.12970975e-02,
+         6.25643134e-02,  7.21886680e-02,  7.03565404e-02,
+         5.44054210e-02,  3.25332815e-03,  5.01751155e-02,
+......
+```
+可以看到服务端返回了向量
+
+或者使用 http 的客户端访问模式：
+
+```
+python deploy/cpp/http_client.py
+```
+运行的输出为：
+
+```
+(2, 64)
+(2, 64)
+outputs {
+  tensor {
+    float_data: 0.09013437479734421
+    float_data: -0.12187091261148453
+    float_data: 0.01328347995877266
+    float_data: -0.15767335891723633
+......
+```
+可以看到服务端返回了向量
+
+## FAQ
+
+#### 如何基于无监督SimCSE训练出的模型参数作为参数初始化继续做有监督 In-Batch Negative 训练？
+
++ 使用 `--init_from_ckpt` 参数加载即可，下面是使用示例：
+
+```
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
+    train_batch_neg.py \
+    --device gpu \
+    --save_dir ./checkpoints/simcse_inbatch_negative \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 3 \
+    --output_emb_size 256 \
+    --save_steps 10 \
+    --max_seq_length 64 \
+    --margin 0.2 \
+    --train_set_file recall/train.csv  \
+    --init_from_ckpt simcse/model_20000/model_state.pdparams
+```
+
+
+
+## Reference
+
+[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.
diff --git a/applications/neural_search/recall/in_batch_negative/ann_util.py b/applications/neural_search/recall/in_batch_negative/ann_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..a76b916a7e300355660aebb3580ae19ff442955a
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/ann_util.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import numpy as np
+import hnswlib
+from paddlenlp.utils.log import logger
+
+
+def build_index(args, data_loader, model):
+
+    index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(args.hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+
+    logger.info("start build index..........")
+
+    all_embeddings = []
+
+    for text_embeddings in model.get_semantic_embedding(data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+
+    logger.info("Total index number:{}".format(index.get_current_count()))
+
+    return index
diff --git a/applications/neural_search/recall/in_batch_negative/base_model.py b/applications/neural_search/recall/in_batch_negative/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..99466292bccb7cbc99d10547cb5b06eb18782b35
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/base_model.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr
+            )
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBaseStatic(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr
+            )
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
diff --git a/applications/neural_search/recall/in_batch_negative/batch_negative/model.py b/applications/neural_search/recall/in_batch_negative/batch_negative/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf9da27df57a79cdd98ca11fa483648aa31bf569
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/batch_negative/model.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # Scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
+
+
+class SemanticIndexCacheNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=cosine_sim.dtype)
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # Scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        return [cosine_sim, labels, query_cls_embedding, title_cls_embedding]
diff --git a/applications/neural_search/recall/in_batch_negative/data.py b/applications/neural_search/recall/in_batch_negative/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d35f0985be17ddc518e6473327438b69ff361c9f
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/data.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import paddle
+from paddlenlp.utils.log import logger
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"text_a": data[0], "text_b": data[1]}
+
+
+def read_text_triplet(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpoint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succeed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using latest ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py b/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..4115859f993814836d0fc7850e1936e4ba185f05
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/http_client.py
@@ -0,0 +1,73 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client import HttpClient
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=True):
+    list_input_ids = []
+    list_token_type_ids = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+
+        list_input_ids.append(input_ids)
+        list_token_type_ids.append(token_type_ids)
+    return list_input_ids, list_token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:9393"]
+client = HttpClient()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+print(feed_names)
+print(fetch_names)
+
+# 创建tokenizer
+tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
+max_seq_len = 64
+
+# 数据预处理
+
+list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据.", "面向生态系统服务的生态系统分类方案研发与应用"]
+# for i in range(5):
+#     list_data.extend(list_data)
+# print(len(list_data))
+examples = convert_example(list_data, tokenizer, max_seq_length=max_seq_len)
+print(examples)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(examples[0])
+feed_dict["token_type_ids"] = np.array(examples[1])
+
+print(feed_dict["input_ids"].shape)
+print(feed_dict["token_type_ids"].shape)
+
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+b_end = time.time()
+print(result)
+print("time to cost :{} seconds".format(b_end - b_start))
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py b/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..8938c8ce32c735e13f8790d48a81f21413a55ea1
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/rpc_client.py
@@ -0,0 +1,71 @@
+# coding:utf-8
+# pylint: disable=doc-string-missing
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+from paddle_serving_client import Client
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=True):
+    list_input_ids = []
+    list_token_type_ids = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        list_input_ids.append(input_ids)
+        list_token_type_ids.append(token_type_ids)
+    return list_input_ids, list_token_type_ids
+
+
+# 启动python客户端
+endpoint_list = ["127.0.0.1:9393"]
+client = Client()
+client.load_client_config("serving_client")
+client.connect(endpoint_list)
+feed_names = client.feed_names_
+fetch_names = client.fetch_names_
+print(feed_names)
+print(fetch_names)
+
+# 创建tokenizer
+tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
+max_seq_len = 64
+
+# 数据预处理
+
+list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据.", "面向生态系统服务的生态系统分类方案研发与应用"]
+# for i in range(5):
+#     list_data.extend(list_data)
+# print(len(list_data))
+examples = convert_example(list_data, tokenizer, max_seq_length=max_seq_len)
+print(examples)
+
+feed_dict = {}
+feed_dict["input_ids"] = np.array(examples[0])
+feed_dict["token_type_ids"] = np.array(examples[1])
+
+print(feed_dict["input_ids"].shape)
+print(feed_dict["token_type_ids"].shape)
+# batch设置为True表示的是批量预测
+b_start = time.time()
+result = client.predict(feed=feed_dict, fetch=fetch_names, batch=True)
+b_end = time.time()
+print("time to cost :{} seconds".format(b_end - b_start))
+print(result)
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh b/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..55d380d6f87396887675a008c54bb8544ce2a793
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/cpp/start_server.sh
@@ -0,0 +1 @@
+python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 2 --thread 5 --ir_optim True --use_trt --precision FP16
\ No newline at end of file
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml b/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1af6298427f4c90c02bfb9c3dc0142002fd58800
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 18082
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      #ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh b/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/python/deploy.sh
@@ -0,0 +1 @@
+python predict.py --model_dir=../../output
\ No newline at end of file
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py b/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d493b91a8c18ab64d9cd7c2b29edad688266866
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py
@@ -0,0 +1,265 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import sys
+
+import paddle
+from paddle import inference
+from scipy import spatial
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.append(".")
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.pdmodel"
+        params_file = model_dir + "/inference.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name=args.model_name_or_path,
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def extract_embedding(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the feature vectors.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return logits
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions probs.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for idx, text in enumerate(data):
+            input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer)
+            title_ids, title_segment_ids = convert_example({idx: text[1]}, tokenizer)
+            examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(query_ids)
+        self.input_handles[1].copy_from_cpu(query_segment_ids)
+        self.predictor.run()
+        query_logits = self.output_handle.copy_to_cpu()
+
+        self.input_handles[0].copy_from_cpu(title_ids)
+        self.input_handles[1].copy_from_cpu(title_segment_ids)
+        self.predictor.run()
+        title_logits = self.output_handle.copy_to_cpu()
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
+        return result
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    # ErnieTinyTokenizer is special for ernie-tiny pretained model.
+    output_emb_size = 256
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"}
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    res = predictor.extract_embedding(corpus_list, tokenizer)
+    print(res.shape)
+    print(res)
+    corpus_list = [["中西方语言与文化的差异", "中西方文化差异以及语言体现中西方文化,差异,语言体现"], ["中西方语言与文化的差异", "飞桨致力于让深度学习技术的创新与应用更简单"]]
+    res = predictor.predict(corpus_list, tokenizer)
+    print(res)
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py b/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..d46979c75e151522917d66b843f02a0f60a39c75
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/python/rpc_client.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+import numpy as np
+
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据", "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = item
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py b/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4730d721110cfc47531769fabc4d5733e83a131
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from paddle_serving_server.web_service import Op, WebService
+
+_LOGGER = logging.getLogger()
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/neural_search/recall/in_batch_negative/evaluate.py b/applications/neural_search/recall/in_batch_negative/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a4236220d3bc5cfff6f6dd478a5121d5788f980
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/evaluate.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+
+import time
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str,
+                    default='', help="The full path of similar pair file")
+parser.add_argument("--recall_result_file", type=str,
+                    default='', help="The full path of recall result file")
+parser.add_argument("--recall_num", type=int, default=10,
+                    help="Most similar number of doc recalled from corpus per query")
+
+
+args = parser.parse_args()
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+
+    rs = []
+
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20, 50]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    result = open("result.tsv", "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
diff --git a/applications/neural_search/recall/in_batch_negative/export_model.py b/applications/neural_search/recall/in_batch_negative/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..648ccbc9672b5a02d93e9667861645c0c43e34f1
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/export_model.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from base_model import SemanticIndexBaseStatic
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True,
+                    default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select model to train, defaults to rocketqa-zh-base-query-encoder.")
+parser.add_argument("--output_path", type=str, default='./output',
+                    help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    output_emb_size = 256
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size)
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/neural_search/recall/in_batch_negative/export_to_serving.py b/applications/neural_search/recall/in_batch_negative/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/export_to_serving.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/neural_search/recall/in_batch_negative/inference.py b/applications/neural_search/recall/in_batch_negative/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..a49d261a3be97650ecb9d34f949b3e7bb8ed4426
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/inference.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+import os
+
+import paddle
+from paddlenlp.data import Tuple, Pad
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+from base_model import SemanticIndexBaseStatic
+from data import convert_example, create_dataloader
+
+if __name__ == "__main__":
+    device = "gpu"
+    max_seq_length = 64
+    output_emb_size = 256
+    batch_size = 1
+    params_path = "checkpoints/inbatch/model_40/model_state.pdparams"
+    id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"}
+    model_name_or_path = "rocketqa-zh-base-query-encoder"
+    paddle.set_device(device)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_segment
+    ): [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path)
+
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size)
+
+    # Load pretrained semantic model
+    if params_path and os.path.isfile(params_path):
+        state_dict = paddle.load(params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    # convert_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    all_embeddings = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in corpus_data_loader:
+            input_ids, token_type_ids = batch_data
+
+            text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids)
+            all_embeddings.append(text_embeddings)
+
+    text_embedding = all_embeddings[0]
+    print(text_embedding.shape)
+    print(text_embedding.numpy())
diff --git a/applications/neural_search/recall/in_batch_negative/predict.py b/applications/neural_search/recall/in_batch_negative/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..7390c13851b17c5c47d1bcb5cad40d5258de9fd8
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/predict.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--text_pair_file", type=str,
+                    required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True,
+                    help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--batch_size", default=32, type=int,
+                    help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None,
+                    type=int, help="output_embedding_size")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu",
+                    help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--pad_to_max_seq_len", action="store_true",
+                    help="Whether to pad to max seq length.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    cosine_sims = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+
+            cosine_sims.append(batch_cosine_sim)
+
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    cosin_sim = predict(model, valid_data_loader)
+
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
diff --git a/applications/neural_search/recall/in_batch_negative/recall.py b/applications/neural_search/recall/in_batch_negative/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..d81c2a73540d78b5d97a20f6f60766e76202dc7e
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/recall.py
@@ -0,0 +1,134 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, gen_id2corpus, gen_text_file
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True,
+                    help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str,
+                    required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result',
+                    help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str,
+                    default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True,
+                    help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int,
+                    help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None,
+                    type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--hnsw_m", default=100, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000,
+                    type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu",
+                    help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+
+    final_index = build_index(args, corpus_data_loader, inner_model)
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh b/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..84d6f162b80ea2bf2d41b1947c61a77503c00264
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/scripts/evaluate.sh
@@ -0,0 +1,4 @@
+python -u evaluate.py \
+        --similar_text_pair "recall/dev.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
\ No newline at end of file
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh b/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..99d01c7b5aae4173fd1508f2d74e6f5f7696a7fa
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/scripts/export_model.sh
@@ -0,0 +1,3 @@
+python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \
+                       --model_name_or_path rocketqa-zh-base-query-encoder \
+                       --output_path=./output
\ No newline at end of file
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh b/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b0d7a422551fd09eb1a28cfacdf47237a8efc795
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/scripts/export_to_serving.sh
@@ -0,0 +1,7 @@
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/predict.sh b/applications/neural_search/recall/in_batch_negative/scripts/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3967bb2c9b5dce94568d1dca6436e87005ab9aef
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/scripts/predict.sh
@@ -0,0 +1,22 @@
+# gpu version
+root_dir="checkpoints/inbatch" 
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_40/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --output_emb_size 256 \
+    --batch_size 128 \
+    --max_seq_length 64 \
+    --text_pair_file "recall/test.csv"
+
+
+# cpu
+# root_dir="checkpoints/inbatch" 
+# python predict.py \
+#     --device cpu \
+#     --params_path "${root_dir}/model_40/model_state.pdparams" \
+#     --output_emb_size 256 \
+#     --batch_size 128 \
+#     --max_seq_length 64 \
+#     --text_pair_file "recall/test.csv"
diff --git a/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh b/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9920a045b9dcaeaeeb7a4d2d0155bb3cd607bb1f
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/scripts/run_build_index.sh
@@ -0,0 +1,46 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU version
+root_dir="checkpoints/inbatch" 
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${root_dir}/model_40/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 64 \
+        --recall_num 50 \
+        --similar_text_pair "recall/dev.csv" \
+        --corpus_file "recall/corpus.csv" 
+
+# CPU version
+# python  recall.py \
+#         --device cpu \
+#         --recall_result_dir "recall_result_dir" \
+#         --recall_result_file "recall_result.txt" \
+#         --params_path "${root_dir}/model_40/model_state.pdparams" \
+#         --hnsw_m 100 \
+#         --hnsw_ef 100 \
+#         --batch_size 64 \
+#         --output_emb_size 256\
+#         --max_seq_length 60 \
+#         --recall_num 50 \
+#         --similar_text_pair "recall/dev.csv" \
+#         --corpus_file "recall/corpus.csv" 
\ No newline at end of file
diff --git a/applications/neural_search/recall/in_batch_negative/train_batch_neg.py b/applications/neural_search/recall/in_batch_negative/train_batch_neg.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f156f5a39d3efd6abc134fb24c35fff572a0ffd
--- /dev/null
+++ b/applications/neural_search/recall/in_batch_negative/train_batch_neg.py
@@ -0,0 +1,434 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from ann_util import build_index
+from batch_negative.model import SemanticIndexBatchNeg, SemanticIndexCacheNeg
+from data import (
+    convert_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+    read_text_pair,
+)
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Interval steps to save checkpoint")
+parser.add_argument('--log_steps', type=int, default=10, help="Interval steps to print log")
+parser.add_argument("--train_set_file", type=str, default='./recall/train.csv', help="The full path of train_set_file.")
+parser.add_argument("--dev_set_file", type=str, default='./recall/dev.csv', help="The full path of dev_set_file.")
+parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples")
+parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--corpus_file", type=str, default='./recall/corpus.csv', help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, default='./recall/dev.csv', help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result")
+parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result")
+parser.add_argument('--evaluate', action='store_true', help='whether evaluate while training')
+parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip")
+parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.")
+parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.")
+parser.add_argument("--use_recompute", action='store_true', help="Using the recompute to scale up the batch size and save the memory.")
+parser.add_argument("--use_gradient_cache", action='store_true', help="Using the gradient cache to scale up the batch size and save the memory.")
+parser.add_argument("--chunk_numbers", type=int, default=50, help="The number of the chunks for model")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def recall(rs, N=10):
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+@paddle.no_grad()
+def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus):
+    # Load pretrained semantic model
+    inner_model = model._layers
+    final_index = build_index(args, corpus_data_loader, inner_model)
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
+    text2similar = {}
+    with open(args.similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+    rs = []
+    with open(recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text == recalled_text:
+                continue
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20, 50]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result)
+    result = open(evaluate_result_file, "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
+    return float(recall_N[1])
+
+
+def train(
+    train_data_loader,
+    model,
+    optimizer,
+    lr_scheduler,
+    rank,
+    corpus_data_loader,
+    query_data_loader,
+    recall_result_file,
+    text_list,
+    id2corpus,
+    tokenizer,
+):
+    global_step = 0
+    best_recall = 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+
+            global_step += 1
+            if global_step % args.log_steps == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, args.log_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if not args.evaluate:
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+        if args.evaluate and rank == 0:
+            print("evaluating")
+            recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus)
+            if recall_5 > best_recall:
+                best_recall = recall_5
+
+                save_dir = os.path.join(args.save_dir, "model_best")
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+                with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp:
+                    fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5))
+
+
+def gradient_cache_train(train_data_loader, model, optimizer, lr_scheduler, rank, tokenizer):
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale)
+
+    if args.batch_size % args.chunk_numbers == 0:
+        chunk_numbers = args.chunk_numbers
+    else:
+        raise Exception(
+            f" Batch_size {args.batch_size} must divides chunk_numbers {args.chunk_numbers} without producing a remainder "
+        )
+
+    def split(inputs, chunk_numbers, axis=0):
+        if inputs.shape[0] % chunk_numbers == 0:
+            return paddle.split(inputs, chunk_numbers, axis=0)
+        else:
+            return paddle.split(inputs, inputs.shape[0], axis=0)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            # Separate large batches into several sub batches
+            chunked_x = [split(t, chunk_numbers, axis=0) for t in batch]
+            sub_batchs = [list(s) for s in zip(*chunked_x)]
+
+            all_grads = []
+            all_CUDA_rnd_state = []
+            all_query = []
+            all_title = []
+
+            for sub_batch in sub_batchs:
+                all_reps = []
+                all_labels = []
+                (
+                    sub_query_input_ids,
+                    sub_query_token_type_ids,
+                    sub_title_input_ids,
+                    sub_title_token_type_ids,
+                ) = sub_batch
+                with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+
+                    with paddle.no_grad():
+                        sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state()
+                        all_CUDA_rnd_state.append(sub_CUDA_rnd_state)
+                        sub_cosine_sim, sub_label, query_embedding, title_embedding = model(
+                            query_input_ids=sub_query_input_ids,
+                            title_input_ids=sub_title_input_ids,
+                            query_token_type_ids=sub_query_token_type_ids,
+                            title_token_type_ids=sub_title_token_type_ids,
+                        )
+                        all_reps.append(sub_cosine_sim)
+                        all_labels.append(sub_label)
+                        all_title.append(title_embedding)
+                        all_query.append(query_embedding)
+
+                model_reps = paddle.concat(all_reps, axis=0)
+                model_title = paddle.concat(all_title)
+                model_query = paddle.concat(all_query)
+
+                model_title = model_title.detach()
+                model_query = model_query.detach()
+
+                model_query.stop_gradient = False
+                model_title.stop_gradient = False
+                model_reps.stop_gradient = False
+
+                model_label = paddle.concat(all_labels, axis=0)
+                loss = F.cross_entropy(input=model_reps, label=model_label)
+                loss.backward()
+                # Store gradients
+                all_grads.append(model_reps.grad)
+
+            for sub_batch, CUDA_state, grad in zip(sub_batchs, all_CUDA_rnd_state, all_grads):
+
+                (
+                    sub_query_input_ids,
+                    sub_query_token_type_ids,
+                    sub_title_input_ids,
+                    sub_title_token_type_ids,
+                ) = sub_batch
+                paddle.framework.random.set_cuda_rng_state(CUDA_state)
+                # Recompute the forward propagation
+                sub_cosine_sim, sub_label, query_embedding, title_embedding = model(
+                    query_input_ids=sub_query_input_ids,
+                    title_input_ids=sub_title_input_ids,
+                    query_token_type_ids=sub_query_token_type_ids,
+                    title_token_type_ids=sub_title_token_type_ids,
+                )
+                # Chain rule
+                surrogate = paddle.dot(sub_cosine_sim, grad)
+                # Backward propagation
+                if args.use_amp:
+                    scaled = scaler.scale(surrogate)
+                    scaled.backward()
+                else:
+                    surrogate.backward()
+            # Update model parameters
+            if args.use_amp:
+                scaler.minimize(optimizer, scaled)
+            else:
+                optimizer.step()
+
+            global_step += 1
+            if global_step % args.log_steps == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, args.log_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path, enable_recompute=args.use_recompute)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    if args.use_gradient_cache:
+        model = SemanticIndexCacheNeg(
+            pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+        )
+    else:
+        model = SemanticIndexBatchNeg(
+            pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+        )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    batchify_fn_dev = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # convert_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=trans_func
+    )
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=trans_func
+    )
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    if args.use_gradient_cache:
+        gradient_cache_train(train_data_loader, model, optimizer, lr_scheduler, rank, tokenizer)
+    else:
+        train(
+            train_data_loader,
+            model,
+            optimizer,
+            lr_scheduler,
+            rank,
+            corpus_data_loader,
+            query_data_loader,
+            recall_result_file,
+            text_list,
+            id2corpus,
+            tokenizer,
+        )
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/neural_search/recall/milvus/README.md b/applications/neural_search/recall/milvus/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..de3f1666b960c547c5a03697cb9f28cd52f9f87c
--- /dev/null
+++ b/applications/neural_search/recall/milvus/README.md
@@ -0,0 +1,220 @@
+ **目录**
+
+* [背景介绍](#背景介绍)
+* [Milvus召回](#Milvus召回)
+    * [1. 技术方案和评估指标](#技术方案)
+    * [2. 环境依赖](#环境依赖)
+    * [3. 代码结构](#代码结构)
+    * [4. 数据准备](#数据准备)
+    * [5. 向量检索](#向量检索)
+
+
+<a name="背景介绍"></a>
+
+# 背景介绍
+
+基于某检索平台开源的数据集构造生成了面向语义索引的召回库。
+
+<a name="Milvus召回"></a>
+
+# Milvus召回
+
+<a name="技术方案"></a>
+
+## 1. 技术方案和评估指标
+
+### 技术方案
+
+使用 Milvus 搭建召回系统，然后使用训练好的语义索引模型，抽取向量，插入到 Milvus 中，然后进行检索。
+
+<a name="环境依赖"></a>
+
+## 2. 环境依赖和安装说明
+
+**环境依赖**
+* python >= 3.6.2
+* paddlepaddle >= 2.2
+* paddlenlp >= 2.2
+* milvus >= 2.1.0
+* pymilvus >= 2.1.0
+
+<a name="代码结构"></a>
+
+## 3. 代码结构
+
+## 代码结构：
+
+```
+|—— scripts
+    |—— feature_extract.sh  提取特征向量的bash脚本
+    |—— search.sh  插入向量和向量检索bash脚本
+├── base_model.py # 语义索引模型基类
+├── config.py  # milvus配置文件
+├── data.py # 数据处理函数
+├── milvus_ann_search.py # 向量插入和检索的脚本
+├── inference.py # 动态图模型向量抽取脚本
+├── feature_extract.py # 批量抽取向量脚本
+├── milvus_util.py # milvus的工具类
+└── README.md
+```
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+数据集的样例如下，有两种，第一种是 title+keywords 进行拼接；第二种是一句话。
+
+```
+煤矸石-污泥基活性炭介导强化污水厌氧消化煤矸石,污泥,复合基活性炭,厌氧消化,直接种间电子传递
+睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病
+城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择
+....
+```
+
+### 数据集下载
+
+
+- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip)
+
+```
+├── milvus # milvus建库数据集
+    ├── milvus_data.csv.  # 构建召回库的数据
+├── recall  # 召回（语义索引）数据集
+    ├── corpus.csv # 用于测试的召回库
+    ├── dev.csv  # 召回验证集
+    ├── test.csv # 召回测试集
+    ├── train.csv  # 召回训练集
+    ├── train_unsupervised.csv # 无监督训练集
+├── sort # 排序数据集
+    ├── test_pairwise.csv   # 排序测试集
+    ├── dev_pairwise.csv    # 排序验证集
+    └── train_pairwise.csv  # 排序训练集
+
+```
+
+<a name="向量检索"></a>
+
+## 5. 向量检索
+
+### 5.1 基于Milvus的向量检索系统搭建
+
+数据准备结束以后，我们开始搭建 Milvus 的语义检索引擎，用于语义向量的快速检索，我们使用[Milvus](https://milvus.io/)开源工具进行召回，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1版本，建议使用官方的 Docker 安装方式，简单快捷。
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成256维度的向量，使用的是32GB的V100的卡进行的提取：
+
+```
+CUDA_VISIBLE_DEVICES=0 python feature_extract.py \
+        --model_dir=./output \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --corpus_file "data/milvus_data.csv"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+|  数据量 |  时间 |
+| ------------ | ------------ |
+|1000万条|3hour40min39s|
+
+运行结束后会生成 corpus_embedding.npy
+
+生成了向量后，需要把数据插入到 Milvus 库中，首先修改配置：
+
+修改 config.py 的配置 ip 和端口，本项目使用的是8530端口，而 Milvus 默认的是19530，需要根据情况进行修改：
+
+```
+MILVUS_HOST='your milvus ip'
+MILVUS_PORT = 8530
+```
+
+然后运行下面的命令把向量插入到Milvus库中：
+
+```
+python milvus_ann_search.py --data_path milvus/milvus_data.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --insert
+```
+参数含义说明
+
+* `data_path`: 数据的路径
+* `embedding_path`: 数据对应向量的路径
+* `index`: 选择检索向量的索引，用于向量检索
+* `insert`: 是否插入向量
+* `search`: 是否检索向量
+* `batch_size`: 表示的是一次性插入的向量的数量
+
+
+|  数据量 |  时间 |
+| ------------ | ------------ |
+|1000万条|21min12s|
+
+另外，Milvus提供了可视化的管理界面，可以很方便的查看数据，安装地址为[Attu](https://github.com/zilliztech/attu).
+
+![](../../img/attu.png)
+
+
+运行召回脚本：
+
+```
+python milvus_ann_search.py --data_path milvus/milvus_data.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --index 18 \
+                            --search
+```
+
+运行以后的结果的输出为：
+
+```
+hit: (distance: 0.0, id: 18), text field: 吉林铁合金集团资产管理现状分析及对策资产管理;资金控制;应收帐款风险;造价控制;集中化财务控制
+hit: (distance: 0.45325806736946106, id: 7611689), text field: 哈药集团应收账款分析应收账款,流动资产,财务报告
+hit: (distance: 0.5440893769264221, id: 4297885), text field: 宝钢集团负债经营风险控制策略研究钢铁行业;负债经营;风险控制
+hit: (distance: 0.5455711483955383, id: 5661135), text field: 浅谈电网企业固定资产风险管理大数据,固定资产,风险管理
+...
+```
+返回的是向量的距离，向量的id，以及对应的文本。
+
+也可以一键执行上述的过程：
+
+```
+sh scripts/search.sh
+```
+
+### 5.2 文本检索
+
+首先修改代码的模型路径和样本：
+
+```
+params_path='checkpoints/model_40/model_state.pdparams'
+id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+```
+
+运行命令
+
+```
+python3 inference.py
+
+```
+运行的输出为，分别是抽取的向量和召回的结果：
+
+```
+[1, 256]
+Tensor(shape=[1, 256], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.07830613, -0.14036864,  0.03433795, -0.14967985, -0.03386058,
+          0.06630671,  0.01357946,  0.03531205,  0.02411086,  0.02000865,
+          0.05724005, -0.08119474,  0.06286906,  0.06509133,  0.07193415,
+   ....
+hit: (distance: 0.40141725540161133, id: 2742485), text field: 完善国有企业技术创新投入机制的探讨--基于经济责任审计实践国有企业,技术创新,投
+入机制
+hit: (distance: 0.40258315205574036, id: 1472893), text field: 企业技术创新与组织冗余--基于国有企业与非国有企业的情境研究
+hit: (distance: 0.4121206998825073, id: 51831), text field: 企业创新影响对外直接投资决策—基于中国制造业上市公司的研究企业创新;对外直接投资;
+制造业;上市公司
+hit: (distance: 0.42234909534454346, id: 8682312), text field: 政治关联对企业创新绩效的影响——国有企业与民营企业的对比政治关联,创新绩效,国有
+企业,民营企业,双重差分
+hit: (distance: 0.46187296509742737, id: 9324797), text field: 财务杠杆、股权激励与企业创新——基于中国A股制造业经验数据制造业;上市公司;股权激
+励;财务杠杆;企业创新
+....
+```
+## FAQ
+
+#### 抽取文本语义向量后，利用 Milvus 进行 ANN 检索查询到了完全相同的文本，但是计算出的距离为什么不是 0？
+
+使用的是近似索引，详情请参考Milvus官方文档，[索引创建机制](https://milvus.io/cn/docs/v2.0.x/index.md)
diff --git a/applications/neural_search/recall/milvus/base_model.py b/applications/neural_search/recall/milvus/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9459d843b33f51776947ca83b36b47d7c327f7
--- /dev/null
+++ b/applications/neural_search/recall/milvus/base_model.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr
+            )
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                input_ids = paddle.to_tensor(input_ids)
+                token_type_ids = paddle.to_tensor(token_type_ids)
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBaseStatic(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr
+            )
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                input_ids = paddle.to_tensor(input_ids)
+                token_type_ids = paddle.to_tensor(token_type_ids)
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
diff --git a/applications/neural_search/recall/milvus/config.py b/applications/neural_search/recall/milvus/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..7eada53bf5a5625979d7ed851a3b05b8be08d473
--- /dev/null
+++ b/applications/neural_search/recall/milvus/config.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+MILVUS_HOST = "10.21.226.175"
+MILVUS_PORT = 8530
+data_dim = 256
+top_k = 100
+collection_name = "literature_search"
+partition_tag = "partition_2"
+embedding_name = "embeddings"
+
+index_config = {
+    "index_type": "IVF_FLAT",
+    "metric_type": "L2",
+    "params": {"nlist": 1000},
+}
+
+search_params = {
+    "metric_type": "L2",
+    "params": {"nprobe": top_k},
+}
diff --git a/applications/neural_search/recall/milvus/data.py b/applications/neural_search/recall/milvus/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..2acbc81b6af7261fe4dcf7bb7b001ba1f5404a60
--- /dev/null
+++ b/applications/neural_search/recall/milvus/data.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"text_a": data[0], "text_b": data[1]}
+
+
+def read_text_triplet(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpoint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succeed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using latest ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/applications/neural_search/recall/milvus/feature_extract.py b/applications/neural_search/recall/milvus/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2a850449ca2ceb955113de9d29dac740e8860df
--- /dev/null
+++ b/applications/neural_search/recall/milvus/feature_extract.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+sys.path.append(".")
+
+from data import convert_example  # noqa E402
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# yapf: enable
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in enumerate(tqdm(data)):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) > self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("corpus_embedding", all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, data in enumerate(file.readlines()):
+        id2corpus[idx] = data.strip()
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = read_text(args.corpus_file)
+
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    predictor.predict(corpus_list, tokenizer)
diff --git a/applications/neural_search/recall/milvus/inference.py b/applications/neural_search/recall/milvus/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b8c2e0c2743bd4ec3ae5c204c05a1f6393e7f20
--- /dev/null
+++ b/applications/neural_search/recall/milvus/inference.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+
+import paddle
+from base_model import SemanticIndexBaseStatic
+from config import collection_name, embedding_name, partition_tag
+from data import convert_example, create_dataloader
+from milvus_util import RecallByMilvus
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+
+def search_in_milvus(text_embedding):
+    recall_client = RecallByMilvus()
+    result = recall_client.search(
+        text_embedding.numpy(),
+        embedding_name,
+        collection_name,
+        partition_names=[partition_tag],
+        output_fields=["pk", "text"],
+    )
+    for hits in result:
+        for hit in hits:
+            print(f"hit: {hit}, text field: {hit.entity.get('text')}")
+
+
+if __name__ == "__main__":
+    device = "gpu"
+    max_seq_length = 64
+    output_emb_size = 256
+    batch_size = 1
+    params_path = "checkpoints/model_40/model_state.pdparams"
+    id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"}
+    model_name_or_path = "rocketqa-zh-base-query-encoder"
+    paddle.set_device(device)
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path)
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=output_emb_size)
+    # Load pretrained semantic model
+    if params_path and os.path.isfile(params_path):
+        state_dict = paddle.load(params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    # convert_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    # Need better way to get inner model of DataParallel
+    all_embeddings = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in corpus_data_loader:
+            input_ids, token_type_ids = batch_data
+            text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids)
+            all_embeddings.append(text_embeddings)
+    text_embedding = all_embeddings[0]
+    print(text_embedding.shape)
+    print(text_embedding)
+    search_in_milvus(text_embedding)
diff --git a/applications/neural_search/recall/milvus/milvus_ann_search.py b/applications/neural_search/recall/milvus/milvus_ann_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..23a5500350035a970d9d275f8e34182f1d73bfb8
--- /dev/null
+++ b/applications/neural_search/recall/milvus/milvus_ann_search.py
@@ -0,0 +1,91 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import numpy as np
+from config import collection_name, embedding_name, partition_tag
+from milvus_util import RecallByMilvus, VecToMilvus, text_max_len
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--data_path", default="milvus/milvus_data.csv", type=str, required=True, help="The data for vector extraction."
+)
+parser.add_argument(
+    "--embedding_path", default="corpus_embedding.npy", type=str, required=True, help="The vector path for data."
+)
+parser.add_argument("--index", default=0, type=int, help="index of the vector for search")
+parser.add_argument("--insert", action="store_true", help="whether to insert data")
+parser.add_argument("--search", action="store_true", help="whether to search data")
+parser.add_argument("--batch_size", default=100000, type=int, help="number of examples to insert each time")
+args = parser.parse_args()
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = []
+    for idx, data in enumerate(file.readlines()):
+        id2corpus.append(data.strip())
+    return id2corpus
+
+
+def milvus_data_insert(data_path, embedding_path, batch_size):
+    corpus_list = read_text(data_path)
+    embeddings = np.load(embedding_path)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    client = VecToMilvus()
+    client.drop_collection(collection_name)
+    data_size = len(embedding_ids)
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        entities = [
+            [j for j in range(i, cur_end, 1)],
+            [corpus_list[j][: text_max_len - 1] for j in range(i, cur_end, 1)],
+            batch_emb,  # field embeddings, supports numpy.ndarray and list
+        ]
+        client.insert(
+            collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag
+        )
+
+
+def milvus_data_recall(embedding_path, index):
+    embeddings = np.load(embedding_path)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    recall_client = RecallByMilvus()
+    if index > len(embedding_ids):
+        print("Index should not be larger than embedding size")
+        return
+    embeddings = embeddings[np.arange(index, index + 1)]
+    time_start = time.time()
+    result = recall_client.search(
+        embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"]
+    )
+    time_end = time.time()
+    sum_t = time_end - time_start
+    print("time cost", sum_t, "s")
+    for hits in result:
+        for hit in hits:
+            print(f"hit: {hit}, text field: {hit.entity.get('text')}")
+
+
+if __name__ == "__main__":
+    if args.insert:
+        milvus_data_insert(args.data_path, args.embedding_path, args.batch_size)
+    if args.search:
+        milvus_data_recall(args.embedding_path, args.index)
diff --git a/applications/neural_search/recall/milvus/milvus_util.py b/applications/neural_search/recall/milvus/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..d11bccaf2f2570a28b1dc994ed79ba92c3e384bd
--- /dev/null
+++ b/applications/neural_search/recall/milvus/milvus_util.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    data_dim,
+    index_config,
+    search_params,
+    top_k,
+)
+from pymilvus import (
+    Collection,
+    CollectionSchema,
+    DataType,
+    FieldSchema,
+    connections,
+    utility,
+)
+
+fmt = "\n=== {:30} ===\n"
+text_max_len = 1000
+fields = [
+    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100),
+    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=text_max_len),
+    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=data_dim),
+]
+schema = CollectionSchema(fields, "Neural Search Index")
+
+
+class VecToMilvus:
+    def __init__(self):
+        print(fmt.format("start connecting to Milvus"))
+        connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
+        self.collection = None
+
+    def has_collection(self, collection_name):
+        try:
+            has = utility.has_collection(collection_name)
+            print(f"Does collection {collection_name} exist in Milvus: {has}")
+            return has
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            print(fmt.format("Create collection {}".format(collection_name)))
+            self.collection = Collection(collection_name, schema, consistency_level="Strong")
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def drop_collection(self, collection_name):
+        try:
+            utility.drop_collection(collection_name)
+        except Exception as e:
+            print("Milvus delete collection error:", e)
+
+    def create_index(self, index_name):
+        try:
+            print(fmt.format("Start Creating index"))
+            self.collection.create_index(index_name, index_config)
+            print(fmt.format("Start loading"))
+            self.collection.load()
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, partition_tag):
+        try:
+            result = self.collection.has_partition(partition_tag)
+            return result
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, partition_tag):
+        try:
+            self.collection.create_partition(partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, entities, collection_name, index_name, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(index_name)
+            else:
+                self.collection = Collection(collection_name)
+            if (partition_tag is not None) and (not self.has_partition(partition_tag)):
+                self.create_partition(partition_tag)
+
+            self.collection.insert(entities, partition_name=partition_tag)
+            print(f"Number of entities in Milvus: {self.collection.num_entities}")  # check the num_entites
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        print(fmt.format("start connecting to Milvus"))
+        connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
+        self.collection = None
+
+    def get_collection(self, collection_name):
+        try:
+            print(fmt.format("Connect collection {}".format(collection_name)))
+            self.collection = Collection(collection_name)
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def search(self, vectors, embedding_name, collection_name, partition_names=[], output_fields=[]):
+        try:
+            self.get_collection(collection_name)
+            result = self.collection.search(
+                vectors,
+                embedding_name,
+                search_params,
+                limit=top_k,
+                partition_names=partition_names,
+                output_fields=output_fields,
+            )
+            return result
+        except Exception as e:
+            print("Milvus recall error: ", e)
+
+
+if __name__ == "__main__":
+    print(fmt.format("Start inserting entities"))
+    rng = np.random.default_rng(seed=19530)
+    num_entities = 3000
+    entities = [
+        # provide the pk field because `auto_id` is set to False
+        [i for i in range(num_entities)],
+        ["第{}个样本".format(i) for i in range(num_entities)],  # field text, only supports list
+        rng.random((num_entities, data_dim)),  # field embeddings, supports numpy.ndarray and list
+    ]
+    print(entities[-1].shape)
+    collection_name = "test1"
+    partition_tag = "partition_1"
+    embedding_name = "embeddings"
+    client = VecToMilvus()
+    client.insert(
+        collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag
+    )
+    print(fmt.format("Start searching entities"))
+    vectors_to_search = entities[-1][-2:]
+    recall_client = RecallByMilvus()
+    result = recall_client.search(
+        vectors_to_search,
+        embedding_name,
+        collection_name,
+        partition_names=[partition_tag],
+        output_fields=["pk", "text"],
+    )
+    for hits in result:
+        for hit in hits:
+            print(f"hit: {hit}, random field: {hit.entity.get('text')}")
diff --git a/applications/neural_search/recall/milvus/scripts/feature_extract.sh b/applications/neural_search/recall/milvus/scripts/feature_extract.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7f996ac0600a67cafaa09995d007381b7c519380
--- /dev/null
+++ b/applications/neural_search/recall/milvus/scripts/feature_extract.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=2 python feature_extract.py \
+        --model_dir ./output \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --batch_size 512 \
+        --corpus_file "milvus/milvus_data.csv" 
+
diff --git a/applications/neural_search/recall/milvus/scripts/search.sh b/applications/neural_search/recall/milvus/scripts/search.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c4cdeea3536dcf19328ca0c895112daa4670a1d
--- /dev/null
+++ b/applications/neural_search/recall/milvus/scripts/search.sh
@@ -0,0 +1,6 @@
+python milvus_ann_search.py --data_path milvus/milvus_data.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --index 18 \
+                            --insert \
+                            --search
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/README.md b/applications/neural_search/recall/simcse/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..033afd18008f28c0190ac577a5fe26b4c1dfa2a5
--- /dev/null
+++ b/applications/neural_search/recall/simcse/README.md
@@ -0,0 +1,448 @@
+
+ **目录**
+
+* [背景介绍](#背景介绍)
+* [SimCSE](#SimCSE)
+    * [1. 技术方案和评估指标](#技术方案)
+    * [2. 环境依赖](#环境依赖)
+    * [3. 代码结构](#代码结构)
+    * [4. 数据准备](#数据准备)
+    * [5. 模型训练](#模型训练)
+    * [6. 评估](#开始评估)
+    * [7. 预测](#预测)
+    * [8. 部署](#部署)
+
+<a name="背景介绍"></a>
+
+# 背景介绍
+
+语义索引（可通俗理解为向量索引）技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是：给定输入文本，模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序，从基础层面影响整个系统的效果。
+
+在召回阶段，最常见的方式是通过双塔模型，学习Document(简写为Doc)的向量表示，对Doc端建立索引，用ANN召回。我们在这种方式的基础上，引入无监督预训练策略，以如下训练数据为例：
+
+
+```
+我手机丢了，我想换个手机     我想买个新手机，求推荐
+求秋色之空漫画全集          求秋色之空全集漫画
+学日语软件手机上的          手机学日语的软件
+侠盗飞车罪恶都市怎样改车     侠盗飞车罪恶都市怎么改车
+```
+
+SimCSE 模型适合缺乏监督数据，但是又有大量无监督数据的匹配和检索场景。
+
+
+<a name="SimCSE"></a>
+
+# SimCSE
+
+<a name="技术方案"></a>
+
+## 1. 技术方案和评估指标
+
+### 技术方案
+
+双塔模型，采用ERNIE1.0热启，在召回阶段引入 SimCSE 策略。
+
+
+### 评估指标
+
+（1）采用 Recall@1，Recall@5 ，Recall@10 ，Recall@20  和 Recall@50 指标来评估语义索引模型的召回效果。
+
+**效果评估**
+
+|  策略 | 模型| Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |
+| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
+|  SimCSE | ernie 1.0 |42.374 | 57.505| 62.641| 67.09|72.331|
+|  SimCSE | rocketqa-zh-base-query-encoder |**50.108** | **64.005**| **68.288**| **72.306**|**77.306**|
+
+<a name="环境依赖"></a>
+
+## 2. 环境依赖和安装说明
+
+**环境依赖**
+* python >= 3.6
+* paddlepaddle >= 2.1.3
+* paddlenlp >= 2.2
+* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2
+* visualdl >= 2.2.2
+
+
+
+<a name="代码结构"></a>
+
+## 3. 代码结构
+
+以下是本项目主要代码结构及说明：
+
+```
+simcse/
+├── model.py # SimCSE 模型组网代码
+|—— deploy
+    |—— python
+        |—— predict.py # PaddleInference
+        ├── deploy.sh # Paddle Inference的bash脚本
+|—— scripts
+    ├── export_model.sh # 动态图转静态图bash脚本
+    ├── predict.sh # 预测的bash脚本
+    ├── evaluate.sh # 召回评估bash脚本
+    ├── run_build_index.sh  # 索引的构建脚本
+    ├── train.sh # 训练的bash脚本
+|—— ann_util.py # Ann 建索引库相关函数
+├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑
+├── export_model.py # 动态图转静态图
+├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度
+├── evaluate.py # 根据召回结果和评估集计算评估指标
+|—— inference.py # 动态图抽取向量
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+└── train.py # SimCSE 模型训练、评估逻辑
+
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+### 数据集说明
+
+我们基于开源的语义匹配数据集构造生成了面向语义索引的训练集、评估集、召回库。
+
+样例数据如下:
+```
+睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病
+城市道路交通流中观仿真研究
+城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择
+网络健康可信性研究
+网络健康可信性研究网络健康信息;可信性;评估模式
+脑瘫患儿家庭复原力的影响因素及干预模式雏形 研究
+脑瘫患儿家庭复原力的影响因素及干预模式雏形研究脑瘫患儿;家庭功能;干预模式
+地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较
+地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较
+个案工作 社会化
+个案社会工作介入社区矫正再社会化研究——以东莞市清溪镇为例社会工作者;社区矫正人员;再社会化;角色定位
+圆周运动加速度角速度
+圆周运动向心加速度物理意义的理论分析匀速圆周运动,向心加速度,物理意义,角速度,物理量,线速度,周期
+```
+
+召回集，验证集，测试集与inbatch-negative实验的数据保持一致
+
+
+### 数据集下载
+
+
+- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip)
+
+```
+├── milvus # milvus建库数据集
+    ├── milvus_data.csv.  # 构建召回库的数据
+├── recall  # 召回（语义索引）数据集
+    ├── corpus.csv # 用于测试的召回库
+    ├── dev.csv  # 召回验证集
+    ├── test.csv # 召回测试集
+    ├── train.csv  # 召回训练集
+    ├── train_unsupervised.csv # 无监督训练集
+├── sort # 排序数据集
+    ├── test_pairwise.csv   # 排序测试集
+    ├── dev_pairwise.csv   # 排序验证集
+    └── train_pairwise.csv  # 排序训练集
+
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+**语义索引预训练模型下载链接：**
+
+以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256`
+
+|Model|训练参数配置|硬件|MD5|
+| ------------ | ------------ | ------------ |-----------|
+|[SimCSE](https://bj.bcebos.com/v1/paddlenlp/models/simcse_model.zip)|<div style="width: 150pt">ernie 1.0 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|7c46d9b15a214292e3897c0eb70d0c9f|
+
+### 训练环境说明
+
++ NVIDIA Driver Version: 440.64.00
++ Ubuntu 16.04.6 LTS (Docker)
++ Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1,2,3 卡, 基于SimCSE训练模型，无监督的数据量比较大，4卡的训练的时长在16个小时左右。如果采用单机单卡训练，只需要把`--gpu`参数设置成单卡的卡号即可。
+
+训练的命令如下：
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus '0,1,2,3' \
+	train.py \
+	--device gpu \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 2000 \
+	--eval_steps 100 \
+	--max_seq_length 64 \
+	--infer_with_fc_pooler \
+	--dropout 0.2 \
+    --output_emb_size 256 \
+	--train_set_file "./recall/train_unsupervised.csv" \
+	--test_set_file "./recall/dev.csv" \
+    --model_name_or_path "rocketqa-zh-base-query-encoder"
+```
+也可以使用bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+
+
+可支持配置的参数：
+
+* `infer_with_fc_pooler`：可选，在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc;  建议打开模型效果最好。
+* `scale`：可选，在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子；默认为 20。
+* `dropout`：可选，SimCSE 网络前向使用的 dropout 取值；默认 0.1。
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ERNIE-Gram 模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为1。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── model_100
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+<a name="评估"></a>
+
+## 6. 评估
+
+效果评估分为 4 个步骤:
+
+a. 获取Doc端Embedding
+
+基于语义索引模型抽取出Doc样本库的文本向量，
+
+b. 采用hnswlib对Doc端Embedding建库
+
+使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引)
+
+c. 获取Query的Embedding并查询相似结果
+
+基于语义索引模型抽取出评估集 *Source Text* 的文本向量，在第 2 步中建立的索引库中进行 ANN 查询，召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件
+
+d. 评估
+
+基于评估集 `dev.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k，其中k取值1，5，10，20，50.
+
+运行如下命令进行 ANN 建库、召回，产出召回结果数据 `recall_result`
+
+```
+python -u -m paddle.distributed.launch --gpus "6" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "checkpoints/model_12000/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 60 \
+        --recall_num 50 \
+        --similar_text_pair "recall/dev.csv" \
+        --corpus_file "recall/corpus.csv"
+```
+也可以使用下面的bash脚本：
+
+```
+sh scripts/run_build_index.sh
+```
+
+run_build_index.sh还包含cpu和gpu运行的脚本，默认是gpu的脚本
+
+
+接下来，运行如下命令进行效果评估，产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标:
+```
+python -u evaluate.py \
+        --similar_text_pair "recall/dev.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
+```
+也可以使用下面的bash脚本：
+
+```
+bash scripts/evaluate.sh
+```
+
+参数含义说明
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果
+* `recall_num`: 对 1 个文本召回的相似文本数量
+
+成功运行结束后，会输出如下评估指标:
+
+```
+recall@1=45.183
+recall@5=60.444
+recall@10=65.224
+recall@20=69.562
+recall@50=74.848
+```
+
+
+<a name="预测"></a>
+
+## 7. 预测
+
+我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。
+
+### 7.1 功能一：抽取文本的语义向量
+
+修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path:
+
+```
+params_path='checkpoints/model_12000/model_state.pdparams'
+id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+```
+然后运行
+```
+python inference.py
+```
+预测结果位256维的向量：
+
+```
+[1, 256]
+[[-6.70653954e-02 -6.46878220e-03 -6.78317016e-03  1.66617986e-02
+   7.20006675e-02 -9.79134627e-03 -1.38441555e-03  4.37440760e-02
+   4.78116237e-02  1.33881181e-01  1.82927232e-02  3.23656350e-02
+   ...
+```
+
+### 7.2 功能二：计算文本 Pair 的语义相似度
+
+### 准备预测数据
+
+待预测数据为 tab 分隔的 tsv 文件，每一行为 1 个文本 Pair，部分示例如下:
+```
+热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响        热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理
+面向生态系统服务的生态系统分类方案研发与应用.   面向生态系统服务的生态系统分类方案研发与应用
+huntington舞蹈病的动物模型      Huntington舞蹈病的动物模型
+试论我国海岸带经济开发的问题与前景      试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景
+```
+
+### 开始预测
+
+以上述 demo 数据为例，运行如下命令基于我们开源的 SimCSE无监督语义索引模型开始计算文本 Pair 的语义相似度:
+```
+root_dir="checkpoints"
+
+python -u -m paddle.distributed.launch --gpus "3" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_12000/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --output_emb_size 256 \
+    --batch_size 128 \
+    --max_seq_length 64 \
+    --text_pair_file "recall/test.csv"
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化。
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/predict.sh
+```
+
+产出如下结果
+```
+0.6477588415145874
+0.9698382019996643
+1.0
+0.1787596344947815
+```
+
+<a name="部署"></a>
+
+## 8. 部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/model_12000/model_state.pdparams \
+                       --model_name_or_path rocketqa-zh-base-query-encoder \
+                       --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference预测
+
+预测既可以抽取向量也可以计算两个文本的相似度。
+
+修改id2corpus的样本：
+
+```
+# 抽取向量
+id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+# 计算相似度
+corpus_list=[['中西方语言与文化的差异','中西方文化差异以及语言体现中西方文化,差异,语言体现'],
+                    ['中西方语言与文化的差异','飞桨致力于让深度学习技术的创新与应用更简单']]
+
+```
+然后使用PaddleInference
+
+```
+python deploy/python/predict.py --model_dir=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy.sh
+```
+最终输出的是256维度的特征向量和句子对的预测概率
+
+```
+(1, 256)
+[[-6.70653731e-02 -6.46873191e-03 -6.78317575e-03  1.66618153e-02
+   7.20006898e-02 -9.79136024e-03 -1.38439541e-03  4.37440872e-02
+   4.78115827e-02  1.33881137e-01  1.82927139e-02  3.23656537e-02
+   .......
+
+[0.5649663209915161, 0.03284594044089317]
+```
+## FAQ
+
+#### SimCSE模型怎么部署？
+
++ SimCSE使用的模型跟 In-batch Negatives 训练出来的模型网络结构是一样的，使用 In-batch Negatives 的部署流程即可，参考[In-batch Negatives](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative/deploy/python)
+
+## Reference
+[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821.
diff --git a/applications/neural_search/recall/simcse/ann_util.py b/applications/neural_search/recall/simcse/ann_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..55c608d3e58c37c0d9baf884b270178d3ac5da7f
--- /dev/null
+++ b/applications/neural_search/recall/simcse/ann_util.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import numpy as np
+import hnswlib
+from paddlenlp.utils.log import logger
+
+
+def build_index(args, data_loader, model):
+
+    index = hnswlib.Index(space="ip", dim=args.output_emb_size)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(args.hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+
+    logger.info("start build index..........")
+
+    all_embeddings = []
+
+    for text_embeddings in model.get_semantic_embedding(data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+
+    logger.info("Total index number:{}".format(index.get_current_count()))
+
+    return index
diff --git a/applications/neural_search/recall/simcse/data.py b/applications/neural_search/recall/simcse/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e1fc4bf5bf1be42746648df499b3df5a1dfbfda
--- /dev/null
+++ b/applications/neural_search/recall/simcse/data.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        if "label" in key:
+            # do_evaluate
+            result += [example["label"]]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            result += [input_ids, token_type_ids]
+
+    return result
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
+
+
+def read_simcse_text(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip()
+            yield {"text_a": data, "text_b": data}
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test is False:
+                if len(data) != 3:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1]}
diff --git a/applications/neural_search/recall/simcse/deploy/python/deploy.sh b/applications/neural_search/recall/simcse/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06
--- /dev/null
+++ b/applications/neural_search/recall/simcse/deploy/python/deploy.sh
@@ -0,0 +1 @@
+python predict.py --model_dir=../../output
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/deploy/python/predict.py b/applications/neural_search/recall/simcse/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3a29726d1e64b51becb62f53eeaf9c276b4ba77
--- /dev/null
+++ b/applications/neural_search/recall/simcse/deploy/python/predict.py
@@ -0,0 +1,268 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+from paddle import inference
+from scipy import spatial
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.append(".")
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name=args.model_name_or_path,
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def extract_embedding(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the feature vectors.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return logits
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the prediction probs.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for idx, text in enumerate(data):
+            input_ids, segment_ids = convert_example({idx: text[0]}, tokenizer)
+            title_ids, title_segment_ids = convert_example({idx: text[1]}, tokenizer)
+            examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(query_ids)
+        self.input_handles[1].copy_from_cpu(query_segment_ids)
+        self.predictor.run()
+        query_logits = self.output_handle.copy_to_cpu()
+
+        self.input_handles[0].copy_from_cpu(title_ids)
+        self.input_handles[1].copy_from_cpu(title_segment_ids)
+        self.predictor.run()
+        title_logits = self.output_handle.copy_to_cpu()
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
+        return result
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    # ErnieTinyTokenizer is special for ernie-tiny pretained model.
+    output_emb_size = 256
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"}
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    res = predictor.extract_embedding(corpus_list, tokenizer)
+    print(res.shape)
+    print(res)
+    corpus_list = [["中西方语言与文化的差异", "中西方文化差异以及语言体现中西方文化,差异,语言体现"], ["中西方语言与文化的差异", "飞桨致力于让深度学习技术的创新与应用更简单"]]
+    res = predictor.predict(corpus_list, tokenizer)
+    print(res)
diff --git a/applications/neural_search/recall/simcse/evaluate.py b/applications/neural_search/recall/simcse/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..63d3b2fe16340e2818ec6bd0690387069931a82e
--- /dev/null
+++ b/applications/neural_search/recall/simcse/evaluate.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file")
+parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file")
+parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query")
+args = parser.parse_args()
+# yapf: enable
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+
+    rs = []
+
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20, 50]
+    result = open("result.tsv", "a")
+    res = []
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
diff --git a/applications/neural_search/recall/simcse/export_model.py b/applications/neural_search/recall/simcse/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9242c24dac0b0d5679a82a128527f42f074fe1b
--- /dev/null
+++ b/applications/neural_search/recall/simcse/export_model.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import SimCSE
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    # If you want to use ernie1.0 model, plesace uncomment the following code
+    output_emb_size = 256
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SimCSE(pretrained_model, output_emb_size=output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/neural_search/recall/simcse/inference.py b/applications/neural_search/recall/simcse/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb2345fa88c3a24a5317fc0f7a666dbcb13be8df
--- /dev/null
+++ b/applications/neural_search/recall/simcse/inference.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+
+import paddle
+from data import create_dataloader
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+
+    return result
+
+
+if __name__ == "__main__":
+    device = "gpu"
+    max_seq_length = 64
+    output_emb_size = 256
+    batch_size = 1
+    params_path = "checkpoints/model_20000/model_state.pdparams"
+    id2corpus = {0: "国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据"}
+    model_name_or_path = "rocketqa-zh-base-query-encoder"
+    paddle.set_device(device)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path)
+
+    model = SimCSE(pretrained_model, output_emb_size=output_emb_size)
+
+    # Load pretrained semantic model
+    if params_path and os.path.isfile(params_path):
+        state_dict = paddle.load(params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    all_embeddings = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in corpus_data_loader:
+            input_ids, token_type_ids = batch_data
+
+            text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids)
+            all_embeddings.append(text_embeddings)
+
+    text_embedding = all_embeddings[0]
+    print(text_embedding.shape)
+    print(text_embedding.numpy())
diff --git a/applications/neural_search/recall/simcse/model.py b/applications/neural_search/recall/simcse/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e3613c1e7c73bfd534fc2e7da82f40005ebf318
--- /dev/null
+++ b/applications/neural_search/recall/simcse/model.py
@@ -0,0 +1,140 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SimCSE(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None):
+
+        super().__init__()
+
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                self.ptm.config.hidden_size, output_emb_size, weight_attr=weight_attr
+            )
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True
+    ):
+
+        # Note: cls_embedding is poolerd embedding with act tanh
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if with_pooler is False:
+            cls_embedding = sequence_output[:, 0, :]
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                input_ids = paddle.to_tensor(input_ids)
+                token_type_ids = paddle.to_tensor(token_type_ids)
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+        with_pooler=True,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/applications/neural_search/recall/simcse/predict.py b/applications/neural_search/recall/simcse/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d3800ad3495d6565c37e6c82cbdc4dbaeee65a6
--- /dev/null
+++ b/applications/neural_search/recall/simcse/predict.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SimCSE`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+
+    cosine_sims = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+
+            cosine_sims.append(batch_cosine_sim)
+
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False, is_test=True)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    cosin_sim = predict(model, valid_data_loader)
+
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
diff --git a/applications/neural_search/recall/simcse/recall.py b/applications/neural_search/recall/simcse/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..604cc0a5988daa1475e7b73683910767bda41ee2
--- /dev/null
+++ b/applications/neural_search/recall/simcse/recall.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training')
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+
+    final_index = build_index(args, corpus_data_loader, inner_model)
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/neural_search/recall/simcse/scripts/evaluate.sh b/applications/neural_search/recall/simcse/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a95782c94d3e3bc593a8b41c655d28947bef78ba
--- /dev/null
+++ b/applications/neural_search/recall/simcse/scripts/evaluate.sh
@@ -0,0 +1,4 @@
+  python -u evaluate.py \
+        --similar_text_pair "recall/dev.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/scripts/export_model.sh b/applications/neural_search/recall/simcse/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..629440b9b079920e74e916f6b899b27d18aec559
--- /dev/null
+++ b/applications/neural_search/recall/simcse/scripts/export_model.sh
@@ -0,0 +1,3 @@
+python export_model.py --params_path checkpoints/model_12000/model_state.pdparams \
+                       --model_name_or_path rocketqa-zh-base-query-encoder \
+                       --output_path=./output
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/scripts/predict.sh b/applications/neural_search/recall/simcse/scripts/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..758e3ecf16967eae0d4daf6a950704cde60b4138
--- /dev/null
+++ b/applications/neural_search/recall/simcse/scripts/predict.sh
@@ -0,0 +1,21 @@
+# gpu
+root_dir="checkpoints" 
+python -u -m paddle.distributed.launch --gpus "3" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_12000/model_state.pdparams" \
+    --output_emb_size 256 \
+    --batch_size 128 \
+    --max_seq_length 64 \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
+    --text_pair_file "recall/test.csv"
+
+# cpu
+# root_dir="checkpoints" 
+# python  predict.py \
+#     --device cpu \
+#     --params_path "${root_dir}/model_20000/model_state.pdparams" \
+#     --output_emb_size 256 \
+#     --batch_size 128 \
+#     --max_seq_length 64 \
+#     --text_pair_file "recall/test.csv"
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/scripts/run_build_index.sh b/applications/neural_search/recall/simcse/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eee1ad3593598279f4ea6568e60dd269a6e12f3d
--- /dev/null
+++ b/applications/neural_search/recall/simcse/scripts/run_build_index.sh
@@ -0,0 +1,31 @@
+# gpu
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "checkpoints/model_12000/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 60 \
+        --recall_num 50 \
+        --similar_text_pair "recall/dev.csv" \
+        --corpus_file "recall/corpus.csv" 
+
+# cpu
+# python  recall.py \
+#         --device cpu \
+#         --recall_result_dir "recall_result_dir" \
+#         --recall_result_file "recall_result.txt" \
+#         --params_path "checkpoints/model_20000/model_state.pdparams" \
+#         --hnsw_m 100 \
+#         --hnsw_ef 100 \
+#         --batch_size 64 \
+#         --output_emb_size 256\
+#         --max_seq_length 60 \
+#         --recall_num 50 \
+#         --similar_text_pair "recall/dev.csv" \
+#         --corpus_file "recall/corpus.csv" 
\ No newline at end of file
diff --git a/applications/neural_search/recall/simcse/scripts/train.sh b/applications/neural_search/recall/simcse/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..60817e0ff7b50b705c7075f94f3b78386d4da708
--- /dev/null
+++ b/applications/neural_search/recall/simcse/scripts/train.sh
@@ -0,0 +1,55 @@
+# simcse gpu
+python -u -m paddle.distributed.launch --gpus '1,2,3,4' \
+	train.py \
+	--device gpu \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 2000 \
+	--eval_steps 100 \
+	--max_seq_length 64 \
+	--infer_with_fc_pooler \
+	--dropout 0.2 \
+	--output_emb_size 256 \
+	--train_set_file "./recall/train_unsupervised.csv" \
+	--test_set_file "./recall/dev.csv" \
+	--model_name_or_path "rocketqa-zh-base-query-encoder"
+
+# simcse cpu
+# python 	train.py \
+# 	--device cpu \
+# 	--save_dir ./checkpoints/ \
+# 	--batch_size 64 \
+# 	--learning_rate 5E-5 \
+# 	--epochs 3 \
+# 	--save_steps 2000 \
+# 	--eval_steps 100 \
+# 	--max_seq_length 64 \
+# 	--infer_with_fc_pooler \
+# 	--dropout 0.2 \
+#	--output_emb_size 256 \
+# 	--train_set_file "./recall/train_unsupervised.csv" \
+# 	--test_set_file "./recall/dev.csv" 
+# 	--model_name_or_path "ernie-3.0-medium-zh"
+
+# post training + simcse
+# python -u -m paddle.distributed.launch --gpus '0,1,2,3' \
+# 	train.py \
+# 	--device gpu \
+# 	--save_dir ./checkpoints/ \
+# 	--batch_size 64 \
+# 	--learning_rate 5E-5 \
+# 	--epochs 3 \
+# 	--save_steps 2000 \
+# 	--eval_steps 100 \
+# 	--max_seq_length 64 \
+# 	--infer_with_fc_pooler \
+# 	--dropout 0.2 \
+#	--output_emb_size 256 \
+# 	--train_set_file "./recall/train_unsupervised.csv" \
+# 	--test_set_file "./recall/dev.csv" 
+# 	--model_name_or_path "post_ernie"
+
+
+
diff --git a/applications/neural_search/recall/simcse/train.py b/applications/neural_search/recall/simcse/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..050e79bbcd937f74381156373b617398940e222f
--- /dev/null
+++ b/applications/neural_search/recall/simcse/train.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_simcse_text
+from model import SimCSE
+from visualdl import LogWriter
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.")
+parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--test_set_file", type=str, required=True, help="The full path of test_set_file.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.")
+parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-base-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    writer = LogWriter(logdir="./log/scalar_test/train")
+
+    train_ds = load_dataset(read_simcse_text, data_path=args.train_set_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained(
+        args.model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout
+    )
+    print("loading model from {}".format(args.model_name_or_path))
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    time_start = time.time()
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                writer.add_scalar(tag="loss", step=global_step, value=loss)
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % (global_step))
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+    time_end = time.time()
+    print("totally cost", time_end - time_start)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/neural_search/requirements.txt b/applications/neural_search/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..3a500dbca18df48b4e063fbd28972ff28360b350
--- /dev/null
+++ b/applications/neural_search/requirements.txt
@@ -0,0 +1,11 @@
+pymilvus>=2.1.0
+pandas
+paddlenlp>=2.1.1    
+paddlepaddle-gpu>=2.2.3
+hnswlib>=0.5.2
+numpy>=1.17.2
+visualdl>=2.2.2
+paddle-serving-app>=0.7.0        
+paddle-serving-client>=0.7.0        
+paddle-serving-server-gpu>=0.7.0.post102
+pybind11
\ No newline at end of file
diff --git a/applications/neural_search/run_system.py b/applications/neural_search/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c8af3f096c73392a51fc9dbbf2ae52d2adf2667
--- /dev/null
+++ b/applications/neural_search/run_system.py
@@ -0,0 +1,86 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import time
+
+import numpy as np
+import pandas as pd
+from paddle_serving_server.pipeline import PipelineClient
+
+sys.path.append("./recall/milvus")  # noqa: E402
+from config import collection_name, embedding_name, partition_tag  # noqa: E402
+from milvus_util import RecallByMilvus  # noqa: E402
+
+
+def recall_result(list_data):
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = item
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    return result
+
+
+def search_in_milvus(embeddings, query_text):
+    recall_client = RecallByMilvus()
+    start_time = time.time()
+    results = recall_client.search(
+        embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"]
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+    list_data = []
+    for line in results:
+        for item in line:
+            # idx = item.id
+            distance = item.distance
+            text = item.entity.get("text")
+            list_data.append([query_text, text, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "text", "distance"])
+    df.to_csv("recall_result.csv", index=False)
+    return df
+
+
+def rerank(df):
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8089"])
+    list_data = []
+    for index, row in df.iterrows():
+        example = {"query": row["query_text"], "title": row["text"]}
+        list_data.append(example)
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = str(item)
+
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    df["distance"] = result
+    df = df.sort_values(by=["distance"], ascending=False)
+    df.to_csv("rank_result.csv", index=False)
+
+
+if __name__ == "__main__":
+    list_data = ["中西方语言与文化的差异"]
+    result = recall_result(list_data)
+    df = search_in_milvus(result, list_data[0])
+    rerank(df)
diff --git a/applications/question_answering/README.md b/applications/question_answering/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..673a716d53de981c1cc01f8480c98921225861c6
--- /dev/null
+++ b/applications/question_answering/README.md
@@ -0,0 +1,21 @@
+# 问答系统
+
+问答系统(Question Answering System, QA)是信息检索系统的一种高级形式，它能用准确、简洁的自然语言回答用户用自然语言提出的问题。问答系统的应用空间十分包括，包括搜索引擎，小度音响等智能硬件，聊天机器人，以及政府、金融、银行、电信、电商领域的智能客服等。
+
+在问答系统中，检索式问答系统是最容易落地的一种，它具有速度快、可控性好、容易拓展等特点。
+检索式问答系统是一种基于问题答案对进行检索匹配的系统，根据是否需要FAQ（Frequently asked questions）可以进一步分为有监督检索式问答系统和无监督检索式问答系统，前者需要用户提供FAQ语料，后者不需要预备问答语料，可通过问题答案对生成的方式自动生成语料。
+
+PaddleNLP提供了[有监督检索式问答系统](./supervised_qa)和[无监督检索式问答系统](./unsupervised_qa)，开发者可根据实际情况进行选择。
+
+关于问答场景应用案例请查阅飞桨新产品[RocketQA](https://github.com/PaddlePaddle/RocketQA)。
+
+**有监督检索式问答系统效果展示**：
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190298926-a1fc92f3-5ec7-4265-8357-ab860cc1fed2.gif" width=800>
+</div>
+
+
+**无监督检索式问答系统效果展示**：
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/20476674/199488926-c64d3f4e-8117-475f-afe6-b02088105d09.gif">
+</div>
diff --git a/applications/question_answering/supervised_qa/faq_finance/README.md b/applications/question_answering/supervised_qa/faq_finance/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6f6714974de3784e22d913b1265b92be0754ec82
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/README.md
@@ -0,0 +1,555 @@
+# 保险智能问答
+
+ **目录**
+
+* [1. 项目介绍](#项目介绍)
+* [2. 系统特色](#系统特色)
+* [3. 保险智能问答系统方案](#保险问答系统方案)
+* [4. 动手实践——搭建自己的端到端检索式问答系统](#动手实践——搭建自己的端到端检索式问答系统)
+* [5. 模型优化](#模型优化)
+* [6. 参考文献](#参考文献)
+
+<a name="项目介绍"></a>
+
+## 1. 项目介绍
+
+智能问答是获取信息和知识的更直接、更高效的方式之一，传统的信息检索方法智能找到相关的文档，而智能问答能够直接找到精准的答案，极大的节省了人们查询信息的时间。问答按照技术分为基于阅读理解的问答和检索式的问答，阅读理解的问答是在正文中找到对应的答案片段，检索式问答则是匹配高频的问题，然后把答案返回给用户。本项目属于检索式的问答，问答的领域用途很广，比如搜索引擎，小度音响等智能硬件，政府，金融，银行，电信，电商领域的智能客服，聊天机器人等。
+
+- 本方案是场景的定制化的方案，用户可以使用自己的数据训练一个特定场景的方案。另外，想快速体验FAQ智能问答系统请参考Pipelines的实现[FAQ智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/FAQ)
+
+- 本项目的详细教程请参考（包括数据和代码实现）[aistudio教程](https://aistudio.baidu.com/aistudio/projectdetail/3882519)
+
+<a name="系统特色"></a>
+
+## 2. 系统特色
+
++ 低门槛
+    + 手把手搭建检索式保险智能问答
+    + 无需相似 Query-Query Pair 标注数据也能构建保险智能问答
++ 效果好
+    + 业界领先的检索预训练模型: RocketQA Dual Encoder
+    + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调
+
++ 性能快
+    + 基于 Paddle Inference 快速抽取向量
+    + 基于 Milvus 快速查询和高性能建库
+    + 基于 Paddle Serving 高性能部署
+
+<a name="保险问答系统方案"></a>
+
+## 3. 保险智能问答系统方案
+
+### 3.1 技术方案和评估指标
+
+#### 3.1.1 技术方案
+
+**语义索引**：针对保险等金融领域的问答只有问答对的场景，我们提供了一个在SimCSE的基础上融合WR (word reptition)策略，同义词策略，R-Drop策略的无监督的解决方案。
+
+#### 3.1.2 评估指标
+
+* 该保险智能问答系统使用的指标是 Recall@K，表示的是预测的前topK（从最后的按得分排序的召回列表中返回前K个结果）结果和语料库中真实的前 K 个相关结果的重叠率，衡量的是检索系统的查全率。
+
+### 3.2 数据说明
+
+#### 3.2.1 预置数据介绍
+
+数据集来源于Github开源的保险的问答数据，包括源用户的问题和相应的回复。
+
+|  阶段 |模型 |   训练集 | 评估集（用于评估模型效果） | 召回库 |
+| ------------ | ------------ |------------ | ------------ | ------------ |
+|  召回 |  SimCSE  |  3030 | 758 | 3788 |
+
+其中训练集的问题-问题对的构造使用了同义词替换的方法，详情请参考[nlpcda](https://github.com/425776024/nlpcda)
+
+评估集的问题对的构造使用了中英文回译的方法，数据使用的是百度翻译的API，详情请参考[百度翻译](https://fanyi-api.baidu.com/?fr=simultaneous)
+
+【注意】：数据集是基于Github开源数据进行了处理得到的，如果有任何侵权问题，请及时联系，我们会第一时间进行删除。
+
+```
+├── data  # 数据集
+    ├── train.csv  # 无监督训练集
+    ├── train_aug.csv # 同义词替换后构造的训练集
+    ├── test_pair.csv  # 测试集，用于评估模型的效果
+    ├── corpus.csv # 构建召回的数据，用于评估模型的召回效果
+    ├── qa_pair.csv # 问答对，问题对应的答案
+```
+数据集的下载链接为: [faq_finance](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb)
+
+#### 3.2.2 数据格式
+
+训练需要规定格式的本地数据集，需要准备训练集文件`train.csv`或者`train_aug.csv`，测试集`test_pair.csv`，召回集文件`corpus.csv`,问答对 `qa_pair.csv`。
+
+用于无监督训练的训练集的格式如下：
+
+```
+文本1
+文本2
+...
+```
+训练集合`train.csv`的文件样例：
+
+```
+家里有社保，还有必要买重疾险吗？
+工地买了建工险，出了事故多长时间上报保险公司有效
+请问下哆啦a保值不值得买呢？不晓得保障多不多
+自由职业办理养老保险是否划算
+工伤七级如果公司不干了,怎么赔我
+普通意外险的保障范围都有哪些？
+......
+```
+除此之外，也可以使用数据增强的格式，训练方式是类似有监督的构造句子对。数据增强的文件格式如下:
+
+```
+文本1 \t 增强文本1
+文本2 \t 增强文本2
+```
+增强数据集`train_aug.csv`的格式如下：
+
+```
+工伤七级如果公司不干了,怎么赔我	工伤七级如果企业不干了,怎生赔我
+普通意外险的保障范围都有哪些？	一般性意外险的保障范围都有哪些？
+重疾险赔付三次和赔付一次的区别	重疾险赔偿三次和赔偿一次的区别
+。。。。。
+```
+
+测试集合`test_pair.csv`是问句对，具体格式如下：
+
+```
+句子1 \t 句子2
+句子3 \t 句子4
+```
+其中句子1和句子2是相似的句子，只是表达方式不同，或者进行了一定程度的变形，但实际表达的语义是一样的。
+
+测试集的文件样例：
+
+```
+车险如何计算	如何计算汽车保险
+农民买养老保险怎么买	农民如何购买养老保险
+车险必买哪几项	你必须购买哪些汽车保险
+...
+```
+召回集合`corpus.csv`主要作用是检验测试集合的句子对能否被正确召回，它的构造主要是提取测试集的第二列的句子，然后加入很多无关的句子，用来检验模型能够正确的从这些文本中找出测试集合对应的第二列的句子，格式如下：
+
+```
+如何办理企业养老保险
+如何为西班牙购买签证保险？
+康慧宝需要买多少？
+如果另一方对车辆事故负有全部责任，并且拒绝提前支付维修费，该怎么办
+准备清明节去新兴坡旅游，什么样的旅游保险好？
+你能从国外账户购买互助基金吗？
+什么是海上保险？有哪些海上保险？
+....
+```
+
+问答对集合`qa_pair.csv`包含的是整个项目的问题和对应的答案,，具体格式如下：
+
+```
+问题1 \t 答案1
+问题2 \t 答案2
+......
+```
+问答对集合示例：
+
+```
+既然强制运输保险有浮动费率制度，有商业保险吗？	商业车险也有的。关于汽车商业险的费率在全国每个省都是不一样的，在同一地区，费率也会变化。一般1年、2-4年、4-6年、费率都不同。新车第一年的费率会比较高，2-4是相对比较优惠，4-6会再上涨，不同类型的汽车费率也不同。商业车险保费浮动比例与其他公司相比都是差不多的，一般销售保费浮动比例是这样的：上年赔款1次，保费打7折；上年赔款2次，保费打8折；上年赔款3次，保费上浮15%；上年赔款4次，保费上浮51%；上年赔款5次以上，保费上浮69%。该公司的有关人士表示，如果上年赔款次数超过了7次，续保时可能会遭拒。目前的研究意见规定中加大了车险保费与赔款记录相关系数的浮动区间，并与交通违章情况挂钩，若车主少违章少出险则保费最多可打5折，反之则保费最高可上浮至现行标准的4.5倍。
+汇鑫安儿童保险的保费是否也与性别有关	有关系，女宝宝会比男宝宝要多一点。如0岁男宝宝趸交是130.4元，3年期交是43.7元，5年期交是27元；而0岁女宝宝趸交是131.6元，3年期交是44.1元，5年期交是27.2元。
+在中国，哪个品牌的餐饮照明比较好？	一般来说美尔家比较可靠吧,有保障
+......
+```
+
+
+### 3.3 代码说明
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— model.py # SimCSE模型
+|—— train.py # SimCSE训练主脚本
+|—— ann_util.py # Ann 建索引库相关函数
+|—— config.py # Milvus 配置文件
+|—— evaluate.py # 召回评估文件
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— export_model.py # 动态图转换成静态图
+|—— export_to_serving.py # 静态图转 Serving
+|—— feature_extract.py # 批量提取文本的特征向量
+|—— milvus_util.py # Milvus的插入和召回类
+|—— milvus_ann_search.py # 向 Milvus 引擎插入向量的函数
+|—— run_system.py # Client Server 模式客户端，向 server 发送文本，得到向量后，利用milvus引擎进行检索
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train.sh  # 训练 bash 版本
+    |—— feature_extract.sh  # 向量抽取 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+|—— deploy
+    |—— python
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+```
+
+### 3.3 效果评估
+
+以下实验结果使用的是模型是`rocketqa-zh-dureader-query-encoder`：
+
+|  模型 |  Recall@1 |Recall@5 |Recall@10 |
+| ------------ | ------------ |--------- |--------- |
+|  RocketQA + SimCSE |  82.827 | 93.791| 96.169|
+|  RocketQA + SimCSE + WR |  82.695 | 93.791| 96.301|
+|  RocketQA + SimCSE + WR + 同义词 |  85.205 | 93.923| 95.509|
+|  RocketQA + SimCSE + 同义词 + RDrop |  **85.469** | **94.716**| **96.433**|
+
+<a name="动手实践——搭建自己的端到端检索式问答系统"></a>
+
+## 4. 动手实践——搭建自己的端到端检索式问答系统
+
+### 4.1 环境安装
+
+在运行下面的代码之前，安装相关的依赖，运行下面的命令：
+
+```
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+### 4.2 模型训练
+
+SimCSE可以使用2种方式进行训练，即有监督训练和无监督训练，区别在于无监督训练不需要标注数据集，有监督训练需要标注好问句对，下面是无监督的执行方式。
+
+#### 无监督训练
+
+无监督训练执行下面的方式，可以选择`train.csv`，纯无监督文本，或者数据增强的数据`train_aug.csv`，然后执行下面的命令：
+
+```
+python -u -m paddle.distributed.launch --gpus='0' \
+	train.py \
+	--device gpu \
+	--model_name_or_path rocketqa-zh-base-query-encoder \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 50 \
+	--eval_steps 50 \
+	--max_seq_length 64 \
+	--dropout 0.2 \
+	--output_emb_size 256 \
+	--dup_rate 0.1 \
+	--rdrop_coef 0.1 \
+	--train_set_file "./data/train_aug.csv"
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `model_name_or_path`: 预训练语言模型名，用于模型的初始化
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `is_unsupervised`:是否使用无监督的训练方式
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `dropout`: SimCSE的dropout参数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `dup_rate` : SimCSE的 Word reptition 策略的重复率
+* `train_set_file`: 训练集文件
+* `rdrop_coef`: R-Drop的系数
+
+也可以使用下面的bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+### 4.3  评估
+
+效果评估分为 4 个步骤:
+
+a. 获取Doc端Embedding
+
+基于语义索引模型抽取出Doc样本库的文本向量。
+
+b. 采用hnswlib对Doc端Embedding建库
+
+使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引)
+
+c. 获取question的Embedding并查询相似结果
+
+基于语义索引模型抽取出评估集 *Source Text* 的文本向量，在第 2 步中建立的索引库中进行 ANN 查询，召回 Top10 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。
+
+d. 评估
+
+基于评估集 `test.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k，其中k取值1，5，10。
+
+运行如下命令进行 ANN 建库、召回，产出召回结果数据 `recall_result`
+
+```
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "checkpoints/model_100/model_state.pdparams" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 64 \
+        --recall_num 10 \
+        --similar_text_pair "data/test_pair.csv" \
+        --corpus_file "data/corpus.csv"
+```
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `model_name_or_path`: 预训练语言模型名，用于模型的初始化
+* `params_path`： 待评估模型的参数文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+
+也可以使用下面的bash脚本：
+
+```
+sh scripts/run_build_index.sh
+```
+
+run_build_index.sh还包含cpu和gpu运行的脚本，默认是gpu的脚本
+
+接下来，运行如下命令进行效果评估，产出Recall@1, Recall@5, Recall@10 指标:
+```
+python -u evaluate.py \
+        --similar_text_pair "data/test_pair.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 10
+```
+也可以使用下面的bash脚本：
+
+```
+sh scripts/evaluate.sh
+```
+输出如下的结果：
+
+```
+recall@1=84.941
+recall@5=94.452
+recall@10=96.433
+```
+
+参数含义说明
+* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
+* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果
+* `recall_num`: 对 1 个文本召回的相似文本数量
+
+### 4.4 模型部署
+
+模型部署模块首先要把动态图转换成静态图，然后转换成serving的格式。
+
+#### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/model_100/model_state.pdparams \
+                       --output_path=./output \
+                       	--model_name_or_path rocketqa-zh-base-query-encoder
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+#### 问答检索引擎
+
+模型准备结束以后，开始搭建 Milvus 的语义检索引擎，用于语义向量的快速检索，本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1 版本，建议使用官方的 Docker-Compose 安装方式，简单快捷。
+
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成256维度的向量：
+
+```
+python feature_extract.py \
+        --model_dir=./output \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --corpus_file "data/corpus.csv"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/feature_extract.sh
+```
+
+然后向搭建好的 Milvus 系统插入向量：
+
+```
+python milvus_ann_search.py --data_path data/qa_pair.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --insert
+```
+
+另外，Milvus提供了可视化的管理界面，可以很方便的查看数据，安装地址为[Attu](https://github.com/zilliztech/attu).
+
+
+#### Paddle Serving 部署
+
+Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖，用pip安装Paddle Serving的依赖如下：
+
+```
+pip install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# 如果是CPU部署，只需要安装CPU Server
+pip install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# 如果是GPU Server，需要确认环境再选择执行哪一条，推荐使用CUDA 10.2的包
+# CUDA10.2 + Cudnn7 + TensorRT6（推荐）
+pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+更详细的安装信息请参考[链接](https://github.com/PaddlePaddle/Serving/blob/v0.9.0/doc/Install_Linux_Env_CN.md)，安装完依赖后就可以执行下面的步骤。首先把生成的静态图模型导出为 Paddle Serving的格式，命令如下：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+启动 Pipeline Server:
+
+```
+cd deploy/python/
+python web_service.py --model_name_or_path rocketqa-zh-base-query-encoder
+```
+
+启动客户端调用 Server, 使用 POST的方式：
+
+向服务端发送 POST 请求示例：
+
+```
+curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["买了社保，是不是就不用买商业保险了?"]}'
+```
+
+也可以使用 rpc的方式：
+
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [
+    "买了社保，是不是就不用买商业保险了？",
+]
+```
+然后运行：
+
+```
+python rpc_client.py
+```
+
+对于Windows用户，启动下面的Pipeline Server:
+
+```
+python web_service_windows.py --model_name_or_path rocketqa-zh-base-query-encoder
+```
+
+启动客户端调用 Server, 使用 POST的方式(Windows不支持RPC的调用方式)，首先修改http_client.py中需要预测的样本：
+
+```
+data = {"feed": ["买了社保，是不是就不用买商业保险了？"], "fetch": ["output_embedding"]}
+```
+然后运行：
+```
+python http_client.py
+```
+
+### 4.5 问答系统整个流程
+
+问答系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问。
+
+
+```
+python run_system.py
+```
+代码内置的测试用例为：
+
+```
+list_data = ["买了社保，是不是就不用买商业保险了？"]
+```
+
+会输出如下的结果：
+
+```
+......
+PipelineClient::predict pack_data time:1663127450.1656108
+PipelineClient::predict before time:1663127450.166227
+Extract feature time to cost :0.017495155334472656 seconds
+
+=== start connecting to Milvus     ===
+=== Connect collection faq_finance ===
+Search milvus time cost is 0.18691015243530273 seconds
+如果你买社会保险，你不需要买商业保险吗？ 社保是基础的，就是我们通常说的“五险”包括：基本养老保险、基本医疗保险、失业保险、工伤保险和生育保险。而商业保险则是保障。 0.32494643330574036
+已有社会保险还需要买商业保险吗 社保是社会保险的简称社会保险是指国家为了预防和分担年老失业疾病以及死亡等社会风险实现社会安全而强制社会多数成员参加的具有所得重分配功能的非营利性的社会安全制度主要包括基本医疗保险基本养老保险工伤保险失业保险生育保险五大类险种，商业保险是社保的一个补充，如果有足够的经济条件可以进行购买。1、社保覆盖面广，不存在拒保问题，但是保障较低，只能满足基本的保障需求。社保中的医疗保险，住院一般可报70%。而且这70%的医疗费，限于扣除起付线标准后。而且，在社保规定用药和规定项目内。许多检查费、专家诊疗、高新尖诊疗技术，社保都是不报的。这就需配合必要的商业保险了。2、另外，社保医疗是出院后报的，商业医保中的重疾险是确诊后就可以给钱，可以弥补很多家庭没钱治的困境；3、商业保险可以选择购买更高的保额，社保则很有限；社保医疗只是补偿医药费，而没有住院期间的收入损失补偿，商业医疗就有住院补贴。总之，建议在有了社保后，再购买适合自己的寿险，加上意外险、住院医疗、重疾医疗保险，就是非常的完善的保障了。 0.38041722774505615
+.....
+```
+输出的结果包括特征提取和检索的时间，还包含检索出来的问答对。
+
+
+<a name="模型优化"></a>
+
+## 5. 模型优化
+
+### 5.1 有监督训练[优化步骤，可选]
+
+无监督的方式对模型的提升有限，如果需要继续提升模型，则需要标注数据。构造类似`train_aug.csv`中的句子对，只需要构造相似句子对即可，不需要构造不相似的句子对。
+
+```
+python -u -m paddle.distributed.launch --gpus='0' \
+	train.py \
+	--device gpu \
+	--model_name_or_path rocketqa-zh-base-query-encoder \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 50 \
+	--eval_steps 50 \
+	--max_seq_length 64 \
+	--dropout 0.2 \
+	--output_emb_size 256 \
+	--dup_rate 0.1 \
+	--rdrop_coef 0.1 \
+	--train_set_file "./data/train_aug.csv"
+```
+
+其他步骤同上，只是使用的数据集是有监督数据。
+
+
+## 6.参考文献
+
+[1] Tianyu Gao, Xingcheng Yao, Danqi Chen: [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). EMNLP (1) 2021: 6894-6910
diff --git a/applications/question_answering/supervised_qa/faq_finance/ann_util.py b/applications/question_answering/supervised_qa/faq_finance/ann_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e4983bfc4d6f581e14180804591dda2d0897465
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/ann_util.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hnswlib
+import numpy as np
+
+from paddlenlp.utils.log import logger
+
+
+def build_index(args, data_loader, model):
+
+    index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(args.hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+
+    logger.info("start build index..........")
+
+    all_embeddings = []
+
+    for text_embeddings in model.get_semantic_embedding(data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+
+    logger.info("Total index number:{}".format(index.get_current_count()))
+
+    return index
diff --git a/applications/question_answering/supervised_qa/faq_finance/config.py b/applications/question_answering/supervised_qa/faq_finance/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..365da31198aa0da6c08265a094f51f2df61bb426
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/config.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+search_param = {"nprobe": 20}
+collection_name = "faq_finance"
+partition_tag = "partition_1"
+
+MILVUS_HOST = "10.21.226.175"
+MILVUS_PORT = 8530
+data_dim = 256
+top_k = 10
+embedding_name = "embeddings"
+
+index_config = {
+    "index_type": "IVF_FLAT",
+    "metric_type": "L2",
+    "params": {"nlist": 1000},
+}
+
+search_params = {
+    "metric_type": "L2",
+    "params": {"nprobe": top_k},
+}
diff --git a/applications/question_answering/supervised_qa/faq_finance/data.py b/applications/question_answering/supervised_qa/faq_finance/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0011cb431f8bc2e5df62b94f6b8c6c0981e6654
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/data.py
@@ -0,0 +1,196 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import random
+
+import numpy as np
+import paddle
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        if "label" in key:
+            # do_evaluate
+            result += [example["label"]]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            result += [input_ids, token_type_ids]
+
+    return result
+
+
+def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_simcse_text(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip()
+            yield {"text_a": data, "text_b": data}
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test is True:
+                if len(data) != 3:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+
+                yield {"text_a": data[0], "text_b": data[1]}
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
+
+
+def word_repetition(input_ids, token_type_ids, dup_rate=0.32):
+    """Word Repetition strategy."""
+    input_ids = input_ids.numpy().tolist()
+    token_type_ids = token_type_ids.numpy().tolist()
+
+    batch_size, seq_len = len(input_ids), len(input_ids[0])
+    repetitied_input_ids = []
+    repetitied_token_type_ids = []
+    rep_seq_len = seq_len
+    for batch_id in range(batch_size):
+        cur_input_id = input_ids[batch_id]
+        actual_len = np.count_nonzero(cur_input_id)
+        dup_word_index = []
+        # If sequence length is less than 5, skip it
+        if actual_len > 5:
+            dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len)))
+            # Skip cls and sep position
+            dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len)
+
+        r_input_id = []
+        r_token_type_id = []
+        for idx, word_id in enumerate(cur_input_id):
+            # Insert duplicate word
+            if idx in dup_word_index:
+                r_input_id.append(word_id)
+                r_token_type_id.append(token_type_ids[batch_id][idx])
+            r_input_id.append(word_id)
+            r_token_type_id.append(token_type_ids[batch_id][idx])
+        after_dup_len = len(r_input_id)
+        repetitied_input_ids.append(r_input_id)
+        repetitied_token_type_ids.append(r_token_type_id)
+
+        if after_dup_len > rep_seq_len:
+            rep_seq_len = after_dup_len
+    # Padding the data to the same length
+    for batch_id in range(batch_size):
+        after_dup_len = len(repetitied_input_ids[batch_id])
+        pad_len = rep_seq_len - after_dup_len
+        repetitied_input_ids[batch_id] += [0] * pad_len
+        repetitied_token_type_ids[batch_id] += [0] * pad_len
+
+    return paddle.to_tensor(repetitied_input_ids, dtype="int64"), paddle.to_tensor(
+        repetitied_token_type_ids, dtype="int64"
+    )
diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml b/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..229f66090ebf328b03aaa6b41120277176e97d97
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8090
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      # ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..65ecca248ab44e3de890b9543ca9e426b17af494
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/http_client.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+
+import numpy as np
+import requests
+
+headers = {"Content-type": "application/json"}
+url = "http://10.21.226.175:8080/ernie/prediction"  # XXX取决于服务端YourService的初始化name参数
+
+data = {"feed": ["买了社保，是不是就不用买商业保险了？"], "fetch": ["output_embedding"]}
+data = json.dumps(data)
+print(data)
+r = requests.post(url=url, headers=headers, data=data)
+print(r.json())
+json_data = r.json()
+data = np.array(json_data["result"]["output_embedding"])
+print(data.shape)
diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..877b6190408adaf0693d90620e21a1087b1bc959
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/rpc_client.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = ["买了社保，是不是就不用买商业保险了？"]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = item
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..fca4f023c7c9a01e1de3b15eccd75fcb97e220dd
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from paddle_serving_server.web_service import Op, WebService
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select tokenizer name to for model")
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        def batchify_fn(
+            samples,
+            fn=Tuple(
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            ),
+        ):
+            return fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+if __name__ == "__main__":
+    ernie_service = ErnieService(name="ernie")
+    ernie_service.prepare_pipeline_config("config_nlp.yml")
+    ernie_service.run_service()
diff --git a/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py
new file mode 100644
index 0000000000000000000000000000000000000000..538fe3f58f8dd11058c350f39b18bfbc22bbf41a
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/deploy/python/web_service_windows.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from paddle_serving_server.web_service import WebService
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select tokenizer name to for model")
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieService(WebService):
+    def init_service(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    def preprocess(self, feed=[], fetch=[]):
+        from paddlenlp.data import Pad, Tuple
+
+        print("input dict", feed)
+        batch_size = len(feed)
+        is_batch = True
+        examples = []
+        for i in range(batch_size):
+            input_ids, segment_ids = convert_example([feed[i]], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        def batchify_fn(
+            samples,
+            fn=Tuple(
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            ),
+        ):
+            return fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, fetch, is_batch
+
+    def postprocess(self, feed=[], fetch=[], fetch_map=None):
+        for key in fetch_map:
+            fetch_map[key] = fetch_map[key].tolist()
+        return fetch_map
+
+
+if __name__ == "__main__":
+    ernie_service = ErnieService(name="ernie")
+    ernie_service.load_model_config("../../serving_server")
+    ernie_service.prepare_server(workdir="workdir", port=8080)
+    ernie_service.init_service()
+    ernie_service.run_debugger_service()
+    ernie_service.run_web_service()
diff --git a/applications/question_answering/supervised_qa/faq_finance/evaluate.py b/applications/question_answering/supervised_qa/faq_finance/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..aabeadf5b197e24177e22a6331f8aa9fe4ef2c1b
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/evaluate.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file")
+parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file")
+parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query")
+
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+
+    rs = []
+
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+    recall_N = []
+    recall_num = [1, 5, 10]
+    result = open("result.tsv", "a")
+    res = []
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
+    # print("\t".join(recall_N))
diff --git a/applications/question_answering/supervised_qa/faq_finance/export_model.py b/applications/question_answering/supervised_qa/faq_finance/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4cf9e3faaea501629d42e632fc1826c7a00050e
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/export_model.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import SimCSE
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True,
+                    default='./checkpoint/model_50/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output',
+                    help="The path of model parameter in static graph to be saved.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--output_emb_size", default=256, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py b/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/export_to_serving.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/question_answering/supervised_qa/faq_finance/feature_extract.py b/applications/question_answering/supervised_qa/faq_finance/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e9a419405f5f79bb7664e85cfe18f93d5ff9d6b
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/feature_extract.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+
+        def batchify_fn(
+            samples,
+            fn=Tuple(
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+            ),
+        ):
+            return fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in enumerate(tqdm(data)):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) >= self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("corpus_embedding", all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, data in enumerate(file.readlines()):
+        id2corpus[idx] = data.strip()
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = read_text(args.corpus_file)
+
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    predictor.predict(corpus_list, tokenizer)
diff --git a/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py b/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..36192fc0bec7c7465ec6481665b376cc2867d2dc
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/milvus_ann_search.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import numpy as np
+from config import collection_name, embedding_name, partition_tag
+from milvus_util import RecallByMilvus, VecToMilvus, text_max_len
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--data_path", default="data/corpus.csv", type=str, required=True, help="The data for vector extraction."
+)
+parser.add_argument(
+    "--embedding_path", default="corpus_embedding.npy", type=str, required=True, help="The vector path for data."
+)
+parser.add_argument("--index", default=0, type=int, help="index of the vector for search")
+parser.add_argument("--insert", action="store_true", help="whether to insert data")
+parser.add_argument("--search", action="store_true", help="whether to search data")
+parser.add_argument("--batch_size", default=100000, type=int, help="number of examples to insert each time")
+args = parser.parse_args()
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = []
+    for idx, data in enumerate(file.readlines()):
+        question, answer = data.strip().split("\t")
+        id2corpus.append({"question": question, "answer": answer})
+    return id2corpus
+
+
+def milvus_data_insert(data_path, embedding_path, batch_size):
+    corpus_list = read_text(data_path)
+    embeddings = np.load(embedding_path)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    client = VecToMilvus()
+    client.drop_collection(collection_name)
+    data_size = len(embedding_ids)
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        entities = [
+            [j for j in range(i, cur_end, 1)],
+            [corpus_list[j]["question"][: text_max_len - 1] for j in range(i, cur_end, 1)],
+            [corpus_list[j]["answer"][: text_max_len - 1] for j in range(i, cur_end, 1)],
+            batch_emb,  # field embeddings, supports numpy.ndarray and list
+        ]
+        client.insert(
+            collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag
+        )
+
+
+def milvus_data_recall(embedding_path, index):
+    embeddings = np.load(embedding_path)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    recall_client = RecallByMilvus()
+    if index > len(embedding_ids):
+        print("Index should not be larger than embedding size")
+        return
+    embeddings = embeddings[np.arange(index, index + 1)]
+    time_start = time.time()
+    result = recall_client.search(
+        embeddings, embedding_name, collection_name, partition_names=[partition_tag], output_fields=["pk", "text"]
+    )
+    time_end = time.time()
+    sum_t = time_end - time_start
+    print("time cost", sum_t, "s")
+    for hits in result:
+        for hit in hits:
+            print(f"hit: {hit}, text field: {hit.entity.get('text')}")
+
+
+if __name__ == "__main__":
+    if args.insert:
+        milvus_data_insert(args.data_path, args.embedding_path, args.batch_size)
+    if args.search:
+        milvus_data_recall(args.embedding_path, args.index)
diff --git a/applications/question_answering/supervised_qa/faq_finance/milvus_util.py b/applications/question_answering/supervised_qa/faq_finance/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..92d55ccf132b1cc9cadf78b9401f020c5c198f59
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/milvus_util.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    data_dim,
+    index_config,
+    search_params,
+    top_k,
+)
+from pymilvus import (
+    Collection,
+    CollectionSchema,
+    DataType,
+    FieldSchema,
+    connections,
+    utility,
+)
+
+fmt = "\n=== {:30} ===\n"
+text_max_len = 1000
+fields = [
+    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100),
+    FieldSchema(name="question", dtype=DataType.VARCHAR, max_length=text_max_len),
+    FieldSchema(name="answer", dtype=DataType.VARCHAR, max_length=text_max_len),
+    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=data_dim),
+]
+schema = CollectionSchema(fields, "Neural Search Index")
+
+
+class VecToMilvus:
+    def __init__(self):
+        print(fmt.format("start connecting to Milvus"))
+        connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
+        self.collection = None
+
+    def has_collection(self, collection_name):
+        try:
+            has = utility.has_collection(collection_name)
+            print(f"Does collection {collection_name} exist in Milvus: {has}")
+            return has
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            print(fmt.format("Create collection {}".format(collection_name)))
+            self.collection = Collection(collection_name, schema, consistency_level="Strong")
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def drop_collection(self, collection_name):
+        try:
+            utility.drop_collection(collection_name)
+        except Exception as e:
+            print("Milvus delete collection error:", e)
+
+    def create_index(self, index_name):
+        try:
+            print(fmt.format("Start Creating index"))
+            self.collection.create_index(index_name, index_config)
+            print(fmt.format("Start loading"))
+            self.collection.load()
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, partition_tag):
+        try:
+            result = self.collection.has_partition(partition_tag)
+            return result
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, partition_tag):
+        try:
+            self.collection.create_partition(partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, entities, collection_name, index_name, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(index_name)
+            else:
+                self.collection = Collection(collection_name)
+            if (partition_tag is not None) and (not self.has_partition(partition_tag)):
+                self.create_partition(partition_tag)
+
+            self.collection.insert(entities, partition_name=partition_tag)
+            print(f"Number of entities in Milvus: {self.collection.num_entities}")  # check the num_entites
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        print(fmt.format("start connecting to Milvus"))
+        connections.connect("default", host=MILVUS_HOST, port=MILVUS_PORT)
+        self.collection = None
+
+    def get_collection(self, collection_name):
+        try:
+            print(fmt.format("Connect collection {}".format(collection_name)))
+            self.collection = Collection(collection_name)
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def search(self, vectors, embedding_name, collection_name, partition_names=[], output_fields=[]):
+        try:
+            self.get_collection(collection_name)
+            result = self.collection.search(
+                vectors,
+                embedding_name,
+                search_params,
+                limit=top_k,
+                partition_names=partition_names,
+                output_fields=output_fields,
+            )
+            return result
+        except Exception as e:
+            print("Milvus recall error: ", e)
+
+
+if __name__ == "__main__":
+    print(fmt.format("Start inserting entities"))
+    rng = np.random.default_rng(seed=19530)
+    num_entities = 3000
+    entities = [
+        # provide the pk field because `auto_id` is set to False
+        [i for i in range(num_entities)],
+        ["第{}个样本".format(i) for i in range(num_entities)],  # field text, only supports list
+        rng.random((num_entities, data_dim)),  # field embeddings, supports numpy.ndarray and list
+    ]
+    print(entities[-1].shape)
+    collection_name = "test1"
+    partition_tag = "partition_1"
+    embedding_name = "embeddings"
+    client = VecToMilvus()
+    client.insert(
+        collection_name=collection_name, entities=entities, index_name=embedding_name, partition_tag=partition_tag
+    )
+    print(fmt.format("Start searching entities"))
+    vectors_to_search = entities[-1][-2:]
+    recall_client = RecallByMilvus()
+    result = recall_client.search(
+        vectors_to_search,
+        embedding_name,
+        collection_name,
+        partition_names=[partition_tag],
+        output_fields=["pk", "text"],
+    )
+    for hits in result:
+        for hit in hits:
+            print(f"hit: {hit}, random field: {hit.entity.get('text')}")
diff --git a/applications/question_answering/supervised_qa/faq_finance/model.py b/applications/question_answering/supervised_qa/faq_finance/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..1850f185a2d1808a7d2ab48f55941ead82ac53a7
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/model.py
@@ -0,0 +1,143 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import paddlenlp
+
+
+class SimCSE(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None):
+
+        super().__init__()
+
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+        self.classifier = nn.Linear(output_emb_size, 2)
+        self.rdrop_loss = paddlenlp.losses.RDropLoss()
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True
+    ):
+
+        # Note: cls_embedding is poolerd embedding with act tanh
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if with_pooler is False:
+            cls_embedding = sequence_output[:, 0, :]
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+        with_pooler=True,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        logits1 = self.classifier(query_cls_embedding)
+        logits2 = self.classifier(title_cls_embedding)
+        kl_loss = self.rdrop_loss(logits1, logits2)
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss, kl_loss
diff --git a/applications/question_answering/supervised_qa/faq_finance/recall.py b/applications/question_answering/supervised_qa/faq_finance/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bfb71d054175a641e07b758ccb8455ca9fe96d9
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/recall.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+        ),
+    ):
+        return [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+
+    model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+
+    final_index = build_index(args, corpus_data_loader, inner_model)
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/question_answering/supervised_qa/faq_finance/requirements.txt b/applications/question_answering/supervised_qa/faq_finance/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2dfbec02607b44cb34cc42a8f5a0e3e0cab7c743
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/requirements.txt
@@ -0,0 +1,8 @@
+pymilvus>=2.1.0
+pandas==0.25.1 
+paddlenlp>=2.3.7    
+paddlepaddle-gpu>=2.2.3
+hnswlib>=0.5.2
+numpy>=1.17.2
+visualdl>=2.2.2
+pybind11
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/run_system.py b/applications/question_answering/supervised_qa/faq_finance/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..d095a4d59ce384f397cf441983560638a4db1def
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/run_system.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import time
+
+import numpy as np
+import pandas as pd
+from config import collection_name, embedding_name, partition_tag
+from milvus_util import RecallByMilvus
+from paddle_serving_server.pipeline import PipelineClient
+
+
+def recall_result(list_data):
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = item
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    return result
+
+
+def search_in_milvus(embeddings, query_text):
+    recall_client = RecallByMilvus()
+    start_time = time.time()
+    results = recall_client.search(
+        embeddings,
+        embedding_name,
+        collection_name,
+        partition_names=[partition_tag],
+        output_fields=["pk", "question", "answer"],
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+    list_data = []
+    for line in results:
+        for item in line:
+
+            distance = item.distance
+            question = item.entity.get("question")
+            answer = item.entity.get("answer")
+            print(question, answer, distance)
+            list_data.append([query_text, question, answer, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "question", "answer", "distance"])
+    df.to_csv("faq_result.csv", index=False)
+
+
+if __name__ == "__main__":
+    list_data = ["买了社保，是不是就不用买商业保险了？"]
+    result = recall_result(list_data)
+    df = search_in_milvus(result, list_data[0])
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7b0a901f9e7b6b77c2c832b849847395f675145f
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/evaluate.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+  python -u evaluate.py \
+        --similar_text_pair "data/test_pair.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 10
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7cd26597635a5c7006aa5d53041a0b58d8057346
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/export_model.sh
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py --params_path checkpoints/model_100/model_state.pdparams \
+                       --output_path=./output \
+                       	--model_name_or_path rocketqa-zh-base-query-encoder 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/export_to_serving.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh
new file mode 100644
index 0000000000000000000000000000000000000000..25862539311dfff26ccd1f7563743dedc3db86fc
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/feature_extract.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python feature_extract.py \
+        --model_dir=./output \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --corpus_file "data/corpus.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f235047e3ad3bde4ee580ec9c06a7ef61c9c1e5f
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/run_build_index.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# gpu
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
+        --params_path "checkpoints/model_100/model_state.pdparams" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 64 \
+        --recall_num 10 \
+        --similar_text_pair "data/test_pair.csv" \
+        --corpus_file "data/corpus.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh b/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f1da0dd71e827edd322d768e96a7fee599da013d
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/scripts/train.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -u -m paddle.distributed.launch --gpus='1' \
+	train.py \
+	--device gpu \
+	--model_name_or_path rocketqa-zh-base-query-encoder \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 50 \
+	--eval_steps 50 \
+	--max_seq_length 64 \
+	--dropout 0.2 \
+	--output_emb_size 256 \
+	--dup_rate 0.1 \
+	--rdrop_coef 0.1 \
+	--train_set_file "./data/train_aug.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_finance/train.py b/applications/question_answering/supervised_qa/faq_finance/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..52fbb51b566f970d2be8f467ee396420cb4b8a62
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_finance/train.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import (
+    convert_example,
+    create_dataloader,
+    read_simcse_text,
+    read_text_pair,
+    word_repetition,
+)
+from model import SimCSE
+from scipy import stats
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.")
+parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--is_unsupervised", action='store_true', help="Whether to use unsupervised training")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.")
+parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.")
+parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="The pretrained model used for training")
+parser.add_argument("--rdrop_coef", default=0.0, type=float, help="The coefficient of KL-Divergence loss in R-Drop paper, for more detail please refer to https://arxiv.org/abs/2106.14448), if rdrop_coef > 0 then R-Drop works")
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_evaluate(model, tokenizer, data_loader, with_pooler=False):
+    model.eval()
+
+    total_num = 0
+    spearman_corr = 0.0
+    sims = []
+    labels = []
+
+    for batch in data_loader:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch
+        total_num += len(label)
+
+        query_cls_embedding = model.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, with_pooler=with_pooler)
+
+        title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler)
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+
+        sims.append(cosine_sim.numpy())
+        labels.append(label.numpy())
+
+    sims = np.concatenate(sims, axis=0)
+    labels = np.concatenate(labels, axis=0)
+
+    spearman_corr = stats.spearmanr(labels, sims).correlation
+    model.train()
+    return spearman_corr, total_num
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    if args.is_unsupervised:
+        train_ds = load_dataset(read_simcse_text, data_path=args.train_set_file, is_test=False, lazy=False)
+    else:
+        train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, is_test=False, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+        ),
+    ):
+        return [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds,
+        mode='train',
+        batch_size=args.batch_size,
+        batchify_fn=batchify_fn,
+        trans_fn=trans_func)
+
+    model = SimCSE(
+        pretrained_model,
+        margin=args.margin,
+        scale=args.scale,
+        output_emb_size=args.output_emb_size)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(
+        train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps,
+                                         args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            if random.random() < 0.2:
+                title_input_ids, title_token_type_ids = query_input_ids, query_token_type_ids
+                query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate)
+                title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate)
+
+            loss, kl_loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids)
+
+            loss = loss + kl_loss * args.rdrop_coef
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss,
+                       10 / (time.time() - tic_train)))
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, 'model_state.pdparams')
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+            if args.max_steps > 0 and global_step >= args.max_steps:
+                return
+
+    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+        save_param_path = os.path.join(save_dir, 'model_state.pdparams')
+        paddle.save(model.state_dict(), save_param_path)
+        tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/question_answering/supervised_qa/faq_system/README.md b/applications/question_answering/supervised_qa/faq_system/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7992677f102e5a1e47139f77a484fc2ab473c561
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/README.md
@@ -0,0 +1,366 @@
+# 政务问答检索式 FAQ System
+
+ **目录**
+
+* [1. 场景概述](#场景概述)
+* [2. 系统特色](#系统特色)
+* [3. 政务问答系统方案](#政务问答系统方案)
+* [4. 动手实践——搭建自己的端到端检索式问答系统](#动手实践——搭建自己的端到端检索式问答系统)
+
+
+<a name="场景概述"></a>
+
+## 1. 场景概述
+
+政府工作人员往往要做很多政策解读等工作，费时费力还耗费大量的人力，在政府内部，工作人员往往积累了很多问答对，但是不知道怎么构建一个问答系统来辅助工作人员提升日常工作效率，简化工作流程。
+
+<a name="系统特色"></a>
+
+## 2. 系统特色
+
++ 低门槛
+    + 手把手搭建检索式 FAQ System
+    + 无需相似 Query-Query Pair 标注数据也能构建 FAQ System
++ 效果好
+    + 业界领先的检索预训练模型: RocketQA Dual Encoder
+    + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调
+
++ 性能快
+    + 基于 Paddle Inference 快速抽取向量
+    + 基于 Milvus 快速查询和高性能建库
+    + 基于 Paddle Serving 高性能部署
+
+<a name="政务问答系统方案"></a>
+
+## 3. 政务问答系统方案
+
+### 3.1 技术方案和评估指标
+
+#### 3.1.1 技术方案
+
+**语义索引**：针对政务问答只有问答对的场景，我们提供了一个 融合SimCSE 和 WR (word reptition)策略的无监督的解决方案。
+
+#### 3.1.2 评估指标
+
+* 该政务问答系统使用的指标是 Recall@K，表示的是预测的前topK（从最后的按得分排序的召回列表中返回前K个结果）结果和语料库中真实的前 K 个相关结果的重叠率，衡量的是检索系统的查全率。
+
+
+### 3.2 数据说明
+
+数据集来源于疫情政务问答比赛数据，包括源文本，问题和答案。
+
+|  阶段 |模型 |   训练集 | 评估集（用于评估模型效果） | 召回库 |
+| ------------ | ------------ |------------ | ------------ | ------------ |
+|  召回 |  SimCSE  |  4000 | 1000 | 5000 |
+
+其中评估集的问题对的构造使用了中英文回译的方法，总共有1000条评估集，其中500条数据使用的是百度翻译的API，详情请参考[百度翻译](https://fanyi-api.baidu.com/?fr=simultaneous)，另外500条数据使用了SimBERT模型生成的同义句。
+
+
+```
+├── data  # 数据集
+    ├── train.csv  # 无监督训练集
+    ├── test_pair.csv  # 测试集，用于评估模型的效果
+    ├── corpus.csv # 构建召回的数据，用于评估模型的召回效果
+    ├── qa_pair.csv # 问答对，问题对应的答案
+```
+数据集的下载链接为: [faq_data](https://paddlenlp.bj.bcebos.com/applications/faq_data.zip)
+
+### 3.3 代码说明
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— model.py # SimCSE模型
+|—— train.py # SimCSE训练主脚本
+|—— ann_util.py # Ann 建索引库相关函数
+|—— config.py # Milvus 配置文件
+|—— evaluate.py # 召回评估文件
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— export_model.py # 动态图转换成静态图
+|—— export_to_serving.py # 静态图转 Serving
+|—— feature_extract.py # 批量提取文本的特征向量
+|—— milvus_util.py # Milvus的插入和召回类
+|—— vector_insert.py # 向 Milvus 引擎插入向量的函数
+|—— run_system.py # Client Server 模式客户端，向 server 发送文本，得到向量后，利用milvus引擎进行检索
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train.sh  # 训练 bash 版本
+    |—— feature_extract.sh  # 向量抽取 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+|—— deploy
+    |—— python
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+```
+
+### 3.3 效果评估
+
+|  模型 |  Recall@1 |Recall@10 |
+| ------------ | ------------ |--------- |
+|  ERNIE1.0 + SimCSE |  68.068     | 85.686|
+|  RocketQA  |  81.381 | 96.997|
+|  RocketQA + SimCSE  |  83.283 | 97.297|
+|  RocketQA + SimCSE + WR |  **83.584** | **97.497**|
+
+<a name="动手实践——搭建自己的端到端检索式问答系统"></a>
+
+## 4. 动手实践——搭建自己的端到端检索式问答系统
+
+### 4.1 环境安装
+
+在运行下面的代码之前，安装相关的依赖，运行下面的命令：
+
+```
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+### 4.2 无监督训练
+
+```
+python -u -m paddle.distributed.launch --gpus '0' \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/ \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 3 \
+    --save_steps 50 \
+    --max_seq_length 64 \
+    --dropout 0.2 \
+    --output_emb_size 256 \
+    --dup_rate 0.3 \
+    --train_set_file "./data/train.csv"
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `dropout`: SimCSE的dropout参数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `dup_rate` : SimCSE的 Word reptition 策略的重复率
+* `train_set_file`: 训练集文件
+
+也可以使用下面的bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+### 4.3  评估
+
+效果评估分为 4 个步骤:
+
+a. 获取Doc端Embedding
+
+基于语义索引模型抽取出Doc样本库的文本向量。
+
+b. 采用hnswlib对Doc端Embedding建库
+
+使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引)
+
+c. 获取question的Embedding并查询相似结果
+
+基于语义索引模型抽取出评估集 *Source Text* 的文本向量，在第 2 步中建立的索引库中进行 ANN 查询，召回 Top10 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。
+
+d. 评估
+
+基于评估集 `test.csv` 和召回结果 `recall_result` 计算评估指标 Recall@k，其中k取值1，5，10。
+
+运行如下命令进行 ANN 建库、召回，产出召回结果数据 `recall_result`
+
+```
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "checkpoints/model_150/model_state.pdparams" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 64 \
+        --recall_num 10 \
+        --similar_text_pair "data/test_pair.csv" \
+        --corpus_file "data/corpus.csv"
+```
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `params_path`： 待评估模型的参数文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+
+也可以使用下面的bash脚本：
+
+```
+sh scripts/run_build_index.sh
+```
+
+run_build_index.sh还包含cpu和gpu运行的脚本，默认是gpu的脚本
+
+接下来，运行如下命令进行效果评估，产出Recall@1, Recall@5, Recall@10 指标:
+```
+python -u evaluate.py \
+        --similar_text_pair "data/test_pair.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 10
+```
+也可以使用下面的bash脚本：
+
+```
+sh scripts/evaluate.sh
+```
+输出如下的结果：
+
+```
+recall@1=83.784
+recall@5=94.995
+recall@10=96.997
+```
+
+参数含义说明
+* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
+* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果
+* `recall_num`: 对 1 个文本召回的相似文本数量
+
+## 4.4 模型部署
+
+模型部署模块首先要把动态图转换成静态图，然后转换成serving的格式。
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/model_150/model_state.pdparams --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### 问答检索引擎
+
+模型准备结束以后，开始搭建 Milvus 的语义检索引擎，用于语义向量的快速检索，本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本，建议使用官方的 Docker 安装方式，简单快捷。
+
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成256维度的向量：
+
+```
+python feature_extract.py \
+        --model_dir=./output \
+        --corpus_file "data/corpus.csv"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+然后向搭建好的 Milvus 系统插入向量：
+
+```
+python vector_insert.py
+```
+
+### Paddle Serving 部署
+
+Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖，安装完依赖后就可以执行下面的步骤。
+
+
+首先把生成的静态图模型导出为 Paddle Serving的格式，命令如下：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+启动 Pipeline Server:
+
+```
+cd deploy/python/
+python web_service.py
+```
+
+启动客户端调用 Server, 使用 POST的方式：
+
+向服务端发送 POST 请求示例：
+
+```
+curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["宁夏针对哪些人员开通工伤保障绿色通道?"]}'
+```
+
+也可以使用 rpc的方式：
+
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [
+    "湖北省为什么鼓励缴费人通过线上缴费渠道缴费？",
+    "佛山市救助站有多少个救助床位"
+]
+```
+然后运行：
+
+```
+python rpc_client.py
+```
+
+## 4.5 问答系统整个流程
+
+问答系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问。
+
+
+```
+python run_system.py
+```
+代码内置的测试用例为：
+
+```
+list_data = ["嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作？"]
+```
+
+会输出如下的结果：
+
+```
+......
+Extract feature time to cost :0.01161503791809082 seconds
+Search milvus time cost is 0.004535675048828125 seconds
+嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作？      拦、查、问、测、记 1.2107588152551751e-12
+上海市黄浦区老西门街道建立的党建责任区包干机制内容是什么？      街道工作人员担任楼宇联络员，分片区对接商务楼宇所属的物业公司，引导楼宇企业共同落实严防严控任务 0.4956303834915161
+上海市街道执行“四个统一”具体指什么？    统一由居委会干部在统一时间（每周三、五下午），递交至统一地点（社区事务受理服务中心专设窗口），街道统一收集至後台 0.6684658527374268
+怀柔区城管委在加强监督检查方面是如何落实的？    严格落实四方责任，保证每周2~3次深入环卫、电、气、热、公共自行车、垃圾处置等单位进行巡查，督促企业做好防疫工作，协调复工复产中存在的问题，确保安全复工复产有效落实。 0.7147952318191528
+华新镇“亮牌分批复工”工作方案具体内容是什么？    所有店铺一律先贴“红牌”禁止经营，经相关部门审批後，再换贴“蓝牌”准许复工。 0.7162970900535583
+.....
+```
+输出的结果包括特征提取和检索的时间，还包含检索出来的问答对。
diff --git a/applications/question_answering/supervised_qa/faq_system/ann_util.py b/applications/question_answering/supervised_qa/faq_system/ann_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e4983bfc4d6f581e14180804591dda2d0897465
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/ann_util.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hnswlib
+import numpy as np
+
+from paddlenlp.utils.log import logger
+
+
+def build_index(args, data_loader, model):
+
+    index = hnswlib.Index(space="ip", dim=args.output_emb_size if args.output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(args.hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+
+    logger.info("start build index..........")
+
+    all_embeddings = []
+
+    for text_embeddings in model.get_semantic_embedding(data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+
+    logger.info("Total index number:{}".format(index.get_current_count()))
+
+    return index
diff --git a/applications/question_answering/supervised_qa/faq_system/config.py b/applications/question_answering/supervised_qa/faq_system/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..44bf3a260a31d97107e22f8fec09b141c5b7fe79
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/config.py
@@ -0,0 +1,27 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from milvus import IndexType, MetricType
+
+MILVUS_HOST = "10.21.226.173"
+MILVUS_PORT = 8530
+
+collection_param = {"dimension": 256, "index_file_size": 256, "metric_type": MetricType.L2}
+
+index_type = IndexType.IVF_FLAT
+index_param = {"nlist": 1000}
+
+top_k = 100
+search_param = {"nprobe": 20}
diff --git a/applications/question_answering/supervised_qa/faq_system/data.py b/applications/question_answering/supervised_qa/faq_system/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..bda3a5de2e5dc3ca773aae96be874b5a9dc39abc
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/data.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+
+import numpy as np
+import paddle
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        if "label" in key:
+            # do_evaluate
+            result += [example["label"]]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            result += [input_ids, token_type_ids]
+
+    return result
+
+
+def convert_example_test(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_simcse_text(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip()
+            yield {"text_a": data, "text_b": data}
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test is False:
+                if len(data) != 3:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1]}
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
+
+
+def word_repetition(input_ids, token_type_ids, dup_rate=0.32):
+    """Word Repetition strategy."""
+    input_ids = input_ids.numpy().tolist()
+    token_type_ids = token_type_ids.numpy().tolist()
+
+    batch_size, seq_len = len(input_ids), len(input_ids[0])
+    repetitied_input_ids = []
+    repetitied_token_type_ids = []
+    rep_seq_len = seq_len
+    for batch_id in range(batch_size):
+        cur_input_id = input_ids[batch_id]
+        actual_len = np.count_nonzero(cur_input_id)
+        dup_word_index = []
+        # If sequence length is less than 5, skip it
+        if actual_len > 5:
+            dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len)))
+            # Skip cls and sep position
+            dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len)
+
+        r_input_id = []
+        r_token_type_id = []
+        for idx, word_id in enumerate(cur_input_id):
+            # Insert duplicate word
+            if idx in dup_word_index:
+                r_input_id.append(word_id)
+                r_token_type_id.append(token_type_ids[batch_id][idx])
+            r_input_id.append(word_id)
+            r_token_type_id.append(token_type_ids[batch_id][idx])
+        after_dup_len = len(r_input_id)
+        repetitied_input_ids.append(r_input_id)
+        repetitied_token_type_ids.append(r_token_type_id)
+
+        if after_dup_len > rep_seq_len:
+            rep_seq_len = after_dup_len
+    # Padding the data to the same length
+    for batch_id in range(batch_size):
+        after_dup_len = len(repetitied_input_ids[batch_id])
+        pad_len = rep_seq_len - after_dup_len
+        repetitied_input_ids[batch_id] += [0] * pad_len
+        repetitied_token_type_ids[batch_id] += [0] * pad_len
+
+    return paddle.to_tensor(repetitied_input_ids, dtype="int64"), paddle.to_tensor(
+        repetitied_token_type_ids, dtype="int64"
+    )
diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml b/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..229f66090ebf328b03aaa6b41120277176e97d97
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8090
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      # ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py b/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e8c3a0d744ee73130f2a728365f0a98a37df2d
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/rpc_client.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = ["湖北省为什么鼓励缴费人通过线上缴费渠道缴费？", "佛山市救助站有多少个救助床位"]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = item
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py b/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa2185782b17a68419d5d73b04e8241d35a90f29
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/deploy/python/web_service.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from paddle_serving_server.web_service import Op, WebService
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            input_ids, segment_ids = convert_example([input_dict[str(i)]], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        def batchify_fn(
+            samples,
+            fn=Tuple(
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+                Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # segment
+            ),
+        ):
+            return fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/question_answering/supervised_qa/faq_system/evaluate.py b/applications/question_answering/supervised_qa/faq_system/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..aabeadf5b197e24177e22a6331f8aa9fe4ef2c1b
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/evaluate.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file")
+parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file")
+parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query")
+
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+
+    rs = []
+
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+    recall_N = []
+    recall_num = [1, 5, 10]
+    result = open("result.tsv", "a")
+    res = []
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
+    # print("\t".join(recall_N))
diff --git a/applications/question_answering/supervised_qa/faq_system/export_model.py b/applications/question_answering/supervised_qa/faq_system/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fba6e24c2af66042ea29c6a92e04dfac7d524e14
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/export_model.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import SimCSE
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True,
+                    default='./checkpoint/model_50/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output',
+                    help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    # If you want to use ernie1.0 model, plesace uncomment the following code
+    output_emb_size = 256
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    model = SimCSE(pretrained_model, output_emb_size=output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/question_answering/supervised_qa/faq_system/export_to_serving.py b/applications/question_answering/supervised_qa/faq_system/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/export_to_serving.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/question_answering/supervised_qa/faq_system/feature_extract.py b/applications/question_answering/supervised_qa/faq_system/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d2a292c42de1f2202e14a331eea1e9ad790cbca
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/feature_extract.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+
+        def batchify_fn(
+            samples,
+            fn=Tuple(
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+            ),
+        ):
+            return fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in enumerate(tqdm(data)):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) >= self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("corpus_embedding", all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, data in enumerate(file.readlines()):
+        id2corpus[idx] = data.strip()
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    id2corpus = read_text(args.corpus_file)
+
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    predictor.predict(corpus_list, tokenizer)
diff --git a/applications/question_answering/supervised_qa/faq_system/milvus_util.py b/applications/question_answering/supervised_qa/faq_system/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..2eee1a88cf1ce7897981212d0710f3fad28e3351
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/milvus_util.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    collection_param,
+    index_param,
+    index_type,
+    search_param,
+    top_k,
+)
+
+# from milvus import *
+from milvus import Milvus
+
+
+class VecToMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def has_collection(self, collection_name):
+        try:
+            status, ok = self.client.has_collection(collection_name)
+            return ok
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            collection_param["collection_name"] = collection_name
+            status = self.client.create_collection(collection_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def create_index(self, collection_name):
+        try:
+            status = self.client.create_index(collection_name, index_type, index_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, collection_name, partition_tag):
+        try:
+            status, ok = self.client.has_partition(collection_name, partition_tag)
+            return ok
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.create_partition(collection_name, partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+            return status
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, vectors, collection_name, ids=None, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(collection_name)
+                print("collection info: {}".format(self.client.get_collection_info(collection_name)[1]))
+            if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)):
+                self.create_partition(collection_name, partition_tag)
+            status, ids = self.client.insert(
+                collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag
+            )
+            self.client.flush([collection_name])
+            print(
+                "Insert {} entities, there are {} entities after insert data.".format(
+                    len(ids), self.client.count_entities(collection_name)[1]
+                )
+            )
+            return status, ids
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def search(self, vectors, collection_name, partition_tag=None):
+        try:
+            status, results = self.client.search(
+                collection_name=collection_name,
+                query_records=vectors,
+                top_k=top_k,
+                params=search_param,
+                partition_tag=partition_tag,
+            )
+            return status, results
+        except Exception as e:
+            print("Milvus recall error: ", e)
diff --git a/applications/question_answering/supervised_qa/faq_system/model.py b/applications/question_answering/supervised_qa/faq_system/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d22cfe8c57979874c4839628577dfd1496e7d40e
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/model.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SimCSE(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None):
+
+        super().__init__()
+
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True
+    ):
+
+        # Note: cls_embedding is poolerd embedding with act tanh
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if with_pooler is False:
+            cls_embedding = sequence_output[:, 0, :]
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+        with_pooler=True,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/applications/question_answering/supervised_qa/faq_system/recall.py b/applications/question_answering/supervised_qa/faq_system/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..21b2db729ba2d2fd57a310ca8275745ea337a647
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/recall.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from data import convert_example_test, create_dataloader, gen_id2corpus, gen_text_file
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    model_name_or_path = "rocketqa-zh-dureader-query-encoder"
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    trans_func = partial(convert_example_test, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+        ),
+    ):
+        return [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path)
+
+    model = SimCSE(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+
+    final_index = build_index(args, corpus_data_loader, inner_model)
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+    # print(text_list[:5])
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/question_answering/supervised_qa/faq_system/requirements.txt b/applications/question_answering/supervised_qa/faq_system/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4a2271d321bf25ab1c5bc4e24af6c96a7a1f3a18
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/requirements.txt
@@ -0,0 +1,5 @@
+pymilvus==1.1.1
+pandas==0.25.1
+paddlenlp>=2.3.7
+hnswlib>=0.5.2
+pybind11
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/run_system.py b/applications/question_answering/supervised_qa/faq_system/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8390575b19ffcf5cabca0693075448af817762b
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/run_system.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import time
+
+import numpy as np
+import pandas as pd
+from data import gen_id2corpus
+from milvus_util import RecallByMilvus
+from paddle_serving_server.pipeline import PipelineClient
+
+
+def search_in_milvus(text_embedding, query_text):
+    collection_name = "faq_system"
+    partition_tag = "partition_1"
+    client = RecallByMilvus()
+    start_time = time.time()
+    status, results = client.search(
+        collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+
+    corpus_file = "data/qa_pair.csv"
+    id2corpus = gen_id2corpus(corpus_file)
+    list_data = []
+    for line in results:
+        for item in line:
+            idx = item.id
+            distance = item.distance
+            text = id2corpus[idx]
+            print(text, distance)
+            list_data.append([query_text, text, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "text", "distance"])
+    df = df.sort_values(by="distance", ascending=True)
+    df.to_csv("data/recall_predict.csv", columns=["text", "distance"], sep="\t", header=None, index=False)
+
+
+if __name__ == "__main__":
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    list_data = ["嘉定区南翔镇实行双门长制“门长”要求落实好哪些工作？"]
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = item
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    search_in_milvus(result, list_data[0])
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh b/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7b0a901f9e7b6b77c2c832b849847395f675145f
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/evaluate.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+  python -u evaluate.py \
+        --similar_text_pair "data/test_pair.csv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 10
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh b/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cdc97faa49473203422439958c11abc375ab599f
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/export_model.sh
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py --params_path checkpoints/model_150/model_state.pdparams --output_path=./output
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh b/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/export_to_serving.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh b/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f75e929756718f8adb8906e8bc3cdf4c9f66d343
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/feature_extract.sh
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python feature_extract.py \
+        --model_dir=./output \
+        --corpus_file "data/corpus.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh b/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ead0a04f57bc1a7fc2b32a067ae08af88d8b60a0
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/run_build_index.sh
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# gpu
+python -u -m paddle.distributed.launch --gpus "4" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "checkpoints/model_150/model_state.pdparams" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 64 \
+        --recall_num 10 \
+        --similar_text_pair "data/test_pair.csv" \
+        --corpus_file "data/corpus.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/scripts/train.sh b/applications/question_answering/supervised_qa/faq_system/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0de39022eb15b8af23292fe1817b5c64571385e4
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/scripts/train.sh
@@ -0,0 +1,28 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -u -m paddle.distributed.launch --gpus '4' \
+	train.py \
+	--device gpu \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 3 \
+	--save_steps 50 \
+	--eval_steps 50 \
+	--max_seq_length 64 \
+	--dropout 0.2 \
+    --output_emb_size 256 \
+	--dup_rate 0.3 \
+	--train_set_file "./data/train.csv" 
\ No newline at end of file
diff --git a/applications/question_answering/supervised_qa/faq_system/train.py b/applications/question_answering/supervised_qa/faq_system/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..efe0306e5578bb2f8efde9c9b2028a5c02aa7d00
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/train.py
@@ -0,0 +1,209 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_simcse_text, word_repetition
+from model import SimCSE
+from scipy import stats
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.")
+parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.")
+parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.")
+parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.")
+
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_evaluate(model, tokenizer, data_loader, with_pooler=False):
+    model.eval()
+
+    total_num = 0
+    spearman_corr = 0.0
+    sims = []
+    labels = []
+
+    for batch in data_loader:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch
+        total_num += len(label)
+
+        query_cls_embedding = model.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, with_pooler=with_pooler)
+
+        title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler)
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+
+        sims.append(cosine_sim.numpy())
+        labels.append(label.numpy())
+
+    sims = np.concatenate(sims, axis=0)
+    labels = np.concatenate(labels, axis=0)
+
+    spearman_corr = stats.spearmanr(labels, sims).correlation
+    model.train()
+    return spearman_corr, total_num
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    train_ds = load_dataset(
+        read_simcse_text, data_path=args.train_set_file, lazy=False)
+    model_name_or_path = 'rocketqa-zh-dureader-query-encoder'
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+        )
+    ):
+        return [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds,
+        mode='train',
+        batch_size=args.batch_size,
+        batchify_fn=batchify_fn,
+        trans_fn=trans_func)
+
+    model = SimCSE(
+        pretrained_model,
+        margin=args.margin,
+        scale=args.scale,
+        output_emb_size=args.output_emb_size)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(
+        train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps,
+                                         args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            if args.dup_rate > 0.0:
+                query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate)
+                title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate)
+
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids)
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss,
+                       10 / (time.time() - tic_train)))
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, 'model_state.pdparams')
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+            if args.max_steps > 0 and global_step >= args.max_steps:
+                return
+
+    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+        save_param_path = os.path.join(save_dir, 'model_state.pdparams')
+        paddle.save(model.state_dict(), save_param_path)
+        tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/question_answering/supervised_qa/faq_system/vector_insert.py b/applications/question_answering/supervised_qa/faq_system/vector_insert.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a32a083a3709e3da3f4eadfe970dd1638d3c610
--- /dev/null
+++ b/applications/question_answering/supervised_qa/faq_system/vector_insert.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+from milvus_util import VecToMilvus
+from tqdm import tqdm
+
+
+def vector_insert(file_path):
+    embeddings = np.load(file_path)
+    print(embeddings.shape)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    print(len(embedding_ids))
+    client = VecToMilvus()
+    collection_name = "faq_system"
+    partition_tag = "partition_1"
+    data_size = len(embedding_ids)
+    batch_size = 100000
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        status, ids = client.insert(
+            collection_name=collection_name,
+            vectors=batch_emb.tolist(),
+            ids=embedding_ids[i : i + batch_size],
+            partition_tag=partition_tag,
+        )
+
+
+if __name__ == "__main__":
+    file_path = "corpus_embedding.npy"
+    vector_insert(file_path)
diff --git a/applications/question_answering/unsupervised_qa/README.md b/applications/question_answering/unsupervised_qa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..719d8f640f56d40d490e40770d93760264c9d1b4
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/README.md
@@ -0,0 +1,480 @@
+# 无监督检索式问答系统
+
+**目录**
+- [无监督检索式问答系统](#无监督检索式问答系统)
+  - [简介](#简介)
+    - [项目优势](#项目优势)
+  - [方案介绍](#方案介绍)
+    - [流程图](#流程图)
+    - [技术方案](#技术方案)
+    - [代码结构说明](#代码结构说明)
+  - [快速体验](#快速体验)
+    - [运行环境和安装说明](#运行环境和安装说明)
+    - [数据说明](#数据说明)
+    - [快速体验无监督检索式问答系统](#快速体验无监督检索式问答系统)
+  - [可视化无监督检索式问答系统](#可视化无监督检索式问答系统)
+    - [离线问答对语料构建](#离线问答对语料构建)
+    - [基于Pipelines构建问答系统](#基于Pipelines构建问答系统)
+  - [自定义模型](#自定义模型)
+    - [数据准备](#数据准备)
+    - [模型微调](#模型微调)
+      - [答案抽取](#答案抽取)
+      - [问题生成](#问题生成)
+      - [过滤模型](#过滤模型)
+      - [语义索引和召回模型](#语义索引和召回模型)
+      - [排序模型](#排序模型)
+  - [References](#References)
+
+## 简介
+问答（QA）系统中最关键的挑战之一是标记数据的稀缺性，这是因为对目标领域获取问答对或常见问答对（FAQ）的成本很高，需要消耗大量的人力和时间。由于上述制约，这导致检索式问答系统落地困难，解决此问题的一种方法是依据问题上下文或大量非结构化文本自动生成的QA问答对。
+
+在此背景下，无监督检索式问答系统（即问答对自动生成智能检索式问答），基于PaddleNLP[问题生成](../../../examples/question_generation/README.md)、[UIE](../../../model_zoo/uie/README.md)、[检索式问答](../supervised_qa/faq_finance/README.md)，支持以非结构化文本形式为上下文自动生成QA问答对，生成的问答对语料可以通过无监督的方式构建检索式问答系统。
+
+若开发者已有FAQ语料，请参考[supervised_qa](../supervised_qa)。
+
+### 项目优势
+具体来说，本项目具有以下优势：
+
++ 低成本
+    + 可通过自动生成的方式快速大量合成QA语料，大大降低人力成本
+    + 可控性好，合成语料和语义检索问答解耦合，可以人工筛查和删除合成的问答对，也可以添加人工标注的问答对
+
++ 低门槛
+    + 手把手搭建无监督检索式问答系统
+    + 无需相似Query-Query Pair标注数据也能构建问答系统
+
++ 效果好
+    + 可通过自动问答对生成提升问答对语料覆盖度，缓解中长尾问题覆盖较少的问题
+    + 业界领先的检索预训练模型: RocketQA Dual Encoder
+    + 针对无标注数据场景的领先解决方案: 检索预训练模型 + 增强的无监督语义索引微调
+
++ 端到端
+    + 提供包括问答语料生成、索引库构建、模型服务部署、WebUI可视化一整套端到端智能问答系统能力
+    + 支持对Txt、Word、PDF、Image多源数据上传，同时支持离线、在线QA语料生成和ANN数据库更新
+
+## 方案介绍
+<!-- ### 评估指标
+**问答对生成**：问答对生成使用的指标是软召回率Recall@K，
+**语义索引**：语义索引使用的指标是Recall@K，表示的是预测的前topK（从最后的按得分排序的召回列表中返回前K个结果）结果和语料库中真实的前K个相关结果的重叠率，衡量的是检索系统的查全率。 -->
+### 流程图
+本项目的流程图如下，对于给定的非结构化文本，我们首先通过答案抽取、问题生成、以及往返过滤模块，得到大量语料相关的问答对。针对这些得到的问答对，用户可以通过可以人工筛查和删除的方式来调整生成的问答对，也可以进一步添加人工标注的问答对。随后开发者就可以通过语义索引模块，来构建向量索引库。在构造完索引库之后，我们就可以通过召回模块和排序模块对问答对进行查询，得到最终的查询结果。
+
+<div align="center">
+    <img width="700" alt="image" src="https://user-images.githubusercontent.com/20476674/211868709-2ac0932d-c48b-4f87-b1cf-1f2665e5a64e.png">
+</div>
+
+### 技术方案
+由于涉及较多的模块，本项目将基于PaddleNLP Pipelines进行模块的组合和项目的构建。PaddleNLP Pipelines是一个端到端NLP流水线系统框架，它可以通过插拔式组件产线化设计来构建一个完整的无监督问答系统。具体来说，我们的技术方案包含以下方面：
+
+**答案抽取**：我们基于UIE训练了一个答案抽取模型，该答案抽取模型接收“答案”作为提示词，该模型可以用来对潜在的答案信息进行挖掘抽取，我们同时提供了训练好的模型权重`uie-base-answer-extractor`。
+
+**问题生成**：我们基于中文预训练语言模型UNIMO-Text、模版策略和大规模多领域问题生成数据集训练了一个通用点问题生成预训练模型`unimo-text-1.0-question-generation`。
+
+**往返过滤**：我们采用过生成（overgenerate）的策略生成大量的潜在答案和问题，并通过往返过滤的方式针对生成的过量问答对进行过滤得到最终的问答对。我们的往返过滤模块需要训练一个有条件抽取式问答模型<sup>3</sup>。
+
+**语义索引**：针对给定问答对语料，我们基于RocketQA（即`rocketqa-zh-base-query-encoder`）对问答对进行语义向量化，并通过ElasticSearch的ANN服务构建索引库。
+
+**召回排序**：给定用户查询，我们基于RocketQA的query-encoder和cross-encoder分别进行召回和排序操作，得到目标的问答对，从而返回给用户查询结果。
+
+**Pipelines**：由于本项目设计的模块较多，我们使用PaddleNLP Pipelines进行模块的组合和项目的构建。大体来说，我们的Pipelines包含两个具体的pipeline和三个服务。两个pipeline分别是qa_generation_pipeline和dense_faq_pipeline；三个服务分别是基于ElasticSearch的ANN在线索引库服务，基于RestAPI的模型后端服务以及基于Streamlit的前端WebUI服务。
+
+
+## 快速体验
+### 运行环境和安装说明
+基于Pipelines构建问答系统需要安装paddle-pipelines依赖，使用pip安装命令如下：
+```bash
+# pip一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+或者进入pipelines目录下，针对源码进行安装：
+```bash
+# 源码进行安装
+cd PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+### 数据说明
+我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例，系统将每一条都视为一个上下文并基于此生成多个问答对，并基于此构建索引库，该文件可直接下载放入`data`，开发者也可以使用自己的文件。
+
+### 快速体验无监督检索式问答系统
+开发者可以通过如下命令快速体验无监督智能检索问答系统的效果，系统将自动根据提供的纯文本文件构建问答对语料库，并基于生成的问答对语料库构造索引库。
+我们建议在GPU环境下运行本示例，运行速度较快，运行命令如下：
+```bash
+# GPU环境下运行示例
+# 设置1个空闲的GPU卡，此处假设0卡为空闲GPU
+export CUDA_VISIBLE_DEVICES=0
+python run_pipelines_example.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16
+```
+关键参数释义如下：
+- `device`: 使用的设备，默认为'gpu'，可选择['cpu', 'gpu']。
+- `source_file`: 源文件路径，指定该路径将自动为其生成问答对至`doc_dir`。
+- `doc_dir`: 生成的问答对语料保存的位置，系统将根据该位置自动构建检索数据库，默认为'data/my_data'。
+- `index_name`: FAISS的ANN索引名称，默认为'faiss_index'。
+- `retriever_batch_size`: 构建ANN索引时的批量大小，默认为16。
+
+如果只有CPU机器，可以通过--device参数指定cpu即可, 运行耗时较长，运行命令如下：
+```bash
+# CPU环境下运行示例
+unset CUDA_VISIBLE_DEVICES
+python run_pipelines_example.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16
+```
+
+
+
+
+## 可视化无监督检索式问答系统
+开发者可以基于Pipelines进一步构建Web可视化的无监督检索式问答系统，其效果如下，
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/20476674/199488926-c64d3f4e-8117-475f-afe6-b02088105d09.gif" >
+</div>
+
+<!-- ## 基于Paddle-Serving构建问答系统
+### 环境依赖
+安装方式：`pip install -r requirements.txt` -->
+
+### 离线问答对语料构建
+这一部分介绍如何离线构建问答对语料，同时我们我们也在Pipeline中集成了在线问答对语料。
+#### 数据说明
+我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例，系统将每一条都视为一个上下文并基于此生成多个问答对，随后系统将根据这些问答对构建索引库，该文件可直接下载放入`data`，开发者也可以使用自己的文件。
+
+#### 问答对生成
+对于标准场景的问答对可以直接使用提供的预训练模型实现零样本（zero-shot）问答对生成。对于细分场景开发者可以根据个人需求训练[自定义模型](#自定义模型)，加载自定义模型进行问答对生成，以进一步提升效果。
+
+生成问答对语料的命令如下：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -u run_qa_pairs_generation.py \
+    --source_file_path=data/source_file.txt \
+    --target_file_path=data/target_file.json \
+    --answer_generation_model_path=uie-base-answer-extractor-v1 \
+    --question_generation_model_path=unimo-text-1.0-question-generation \
+    --filtration_model_path=uie-base-qa-filter-v1 \
+    --batch_size=8 \
+    --a_max_answer_candidates=10 \
+    --a_prompt='答案' \
+    --a_position_prob=0.01  \
+    --q_num_return_sequences=3 \
+    --q_max_question_length=50 \
+    --q_decode_strategy=sampling \
+    --q_top_k=5 \
+    --q_top_p=1 \
+    --do_filtration \
+    --f_filtration_position_prob=0.01 \
+    --do_debug
+```
+关键参数释义如下：
+- `source_file_path` 源文件路径，源文件中每一行代表一条待生成问答对的上下文文本。
+- `target_file_path` 目标文件路径，生成的目标文件为json格式。
+- `answer_generation_model_path` 要加载的答案抽取模型的路径，可以是PaddleNLP提供的预训练模型，或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+   | 可选预训练模型        |
+   |---------------------------------|
+   | uie-base-answer-extractor-v1    |
+
+- `question_generation_model_path` 要加载的问题生成模型的路径，可以是PaddleNLP提供的预训练模型，或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+   | 可选预训练模型        |
+   |---------------------------------|
+   | unimo-text-1.0-question-generation      |
+   | unimo-text-1.0-dureader_qg |
+   | unimo-text-1.0-question-generation-dureader_qg |
+
+- `filtration_model_path` 要加载的过滤模型的路径，可以是PaddleNLP提供的预训练模型，或者是本地模型checkpoint路径。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+   | 可选预训练模型        |
+   |---------------------------------|
+   | uie-base-qa-filter-v1     |
+
+- `batch_size` 使用taskflow时的批处理大小，请结合机器情况进行调整，默认为8。
+- `a_max_answer_candidates` 答案抽取阶段，每个输入的最大返回答案候选数，默认为5。
+- `a_prompt` 答案抽取阶段，使用的提示词，以","分隔，默认为"答案"。
+- `a_position_prob` 答案抽取阶段，置信度阈值，默认为0.01。
+- `q_num_return_sequences` 问题生成阶段，返回问题候选数，在使用"beam_search"解码策略时它应该小于`q_num_beams`，默认为3。
+- `q_max_question_length` 问题生成阶段，最大解码长度，默认为50。
+- `q_decode_strategy` 问题生成阶段，解码策略，默认为"sampling"。
+- `q_top_k` 问题生成阶段，使用"sampling"解码策略时的top k值，默认为5。
+- `q_top_p` 问题生成阶段，使用"sampling"解码策略时的top p值，默认为0。
+- `q_num_beams` 问题生成阶段，使用"beam_search"解码策略时的beam大小，默认为6。
+- `do_filtration` 是否进行过滤。
+- `f_filtration_position_prob` 过滤阶段，过滤置信度阈值，默认为0.1。
+- `do_debug` 是否进入调试状态，调试状态下将输出过滤掉的生成问答对。
+
+#### 语料转换
+执行以下脚本对生成的问答对进行转换，得到语义索引所需要的语料train.csv、dev.csv、q_corpus.csv、qa_pair.csv：
+```shell
+python -u run_corpus_preparation.py \
+    --source_file_path data/target_file.json \
+    --target_dir_path data/my_corpus
+```
+关键参数释义如下：
+- `source_file_path` 指示了要转换的训练数据集文件或测试数据集文件，文件格式要求见从本地文件创建数据集部分。指示了要转换的问答对json文件路径，生成的目标文件为json格式
+- `target_dir_path` 输出数据的目标文件夹，默认为"data/my_corpus"。
+- `test_sample_num` 构建检索系统时保留的测试样本数目，默认为0。
+- `train_sample_num` 构建检索系统时保留的有监督训练样本数目，默认为0。
+- `all_sample_num` 构建检索系统时保留的总样本数目，默认为None，表示保留除了前`test_sample_num`+`train_sample_num`个样本外的所有样本。
+
+
+
+<!-- ### 检索模型训练部署
+在已有问答语料库和语义检索模型前提下，模型部署首先要把语义检索模型由动态图转换成静态图，然后转换成serving的格式，此外还需要基于Milvus和问答语料库构建语义检索引擎。
+
+关于如何对语义检索模型进行无监督训练，以及针对给定问答语料库进行模型部署，请参考faq_system -->
+
+### 基于Pipelines构建问答系统
+本项目提供了基于Pipelines的低成本构建问答对自动生成智能检索问答系统的能力。开发者只需要提供非结构化的纯文本，就可以使用本项目预制的问答对生成模块生成大量的问答对，并基于此快速搭建一个针对自己业务的检索问答系统，并可以提供Web可视化产品服务。Web可视化产品服务支持问答检索、在线问答对生成，在线文件上传和解析，在线索引库更新等功能，用户也可根据需要自行调整。具体的构建流程请参考[Pipelines-无监督智能检索问答系统](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/unsupervised-question-answering)。
+
+
+
+## 自定义模型
+除了使用预置模型外，用户也可以训练并接入自己训练的模型，我们提供了从答案抽取、问题生成、往返过滤的过滤模型，到语义索引、召回、排序各个阶段的定制化训练方案。
+### 数据准备
+这一部分介绍如何准备和预处理答案抽取、问题生成、过滤模块微调所需的数据。关于如何准备通过无监督方式训练自定义语义索引模型所需的问答对数据，见[离线问答对语料构建](#离线问答对语料构建)。
+#### 自定义数据
+在许多情况下，我们需要使用本地数据集来微调模型从而得到定制化的能力，让生成的问答对更接近于理想分布，本项目支持使用固定格式本地数据集文件进行微调。
+
+这里我们提供预先标注好的文件样例[train.json](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/train.json)和[dev.json](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/dev.json)，开发者可直接下载放入`data`目录，此外也可自行构建本地数据集，具体来说，本地数据集主要包含以下文件：
+```text
+data
+├── train.json # 训练数据集文件
+├── dev.json # 开发数据集文件
+└── test.json # 可选，待预测数据文件
+```
+本地数据集文件格式如下：
+```text
+# train.json/dev.json/test.json文件格式：
+{
+  "context": <context_text>,
+  "answer": <answer_text>,
+  "question": <question_text>,
+}
+...
+```
+本地数据集文件具体样例如下：
+```text
+train.json/dev.json/test.json文件样例：
+{
+  "context": "欠条是永久有效的,未约定还款期限的借款合同纠纷,诉讼时效自债权人主张债权之日起计算,时效为2年。 根据《中华人民共和国民法通则》第一百三十五条:向人民法院请求保护民事权利的诉讼时效期间为二年,法律另有规定的除外。 第一百三十七条:诉讼时效期间从知道或者应当知道权利被侵害时起计算。但是,从权利被侵害之日起超过二十年的,人民法院不予保护。有特殊情况的,人民法院可以延长诉讼时效期间。 第六十二条第(四)项:履行期限不明确的,债务人可以随时履行,债权人也可以随时要求履行,但应当给对方必要的准备时间。",
+  "answer": "永久有效",
+  "question": "欠条的有效期是多久"
+}
+...
+```
+
+#### 数据预处理
+执行以下脚本对数据集进行数据预处理，得到接下来答案抽取、问题生成、过滤模块模型微调所需要的数据，注意这里答案抽取、问题生成、过滤模块的微调数据来源于相同的数据集。
+```shell
+python -u run_data_preprocess.py \
+    --source_file_path data/train.json \
+    --target_dir data/finetune \
+    --do_answer_prompt
+
+python -u run_data_preprocess.py \
+  --source_file_path data/dev.json \
+  --target_dir data/finetune \
+  --do_answer_prompt
+```
+关键参数释义如下：
+- `source_file_path` 指示了要转换的训练数据集文件或测试数据集文件，文件格式要求见[自定义数据](#自定义数据)部分。
+- `target_dir` 输出数据的目标文件夹，默认为"data/finetune"。
+- `do_answer_prompt` 表示在构造答案抽取数据时是否添加"答案"提示词。
+- `do_len_prompt` 表示在构造答案抽取数据时是否添加长度提示词。
+- `do_domain_prompt` 表示在构造答案抽取数据时是否添加领域提示词。
+- `domain` 表示添加的领域提示词，在`do_domain_prompt`时有效。
+
+**NOTE:** 预处理后的微调用数据将分别位于`target_dir`下的answer_extraction、question_generation、filtration三个子文件夹中。
+
+### 模型微调
+#### 答案抽取
+运行如下命令即可在样例训练集上微调答案抽取模型，用户可以选择基于`uie-base-answer-extractor`进行微调，或者基于`uie-base`等从头开始微调。
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/answer_extraction finetune/answer_extraction_and_roundtrip_filtration/finetune.py \
+    --train_path=data/finetune/answer_extraction/train.json \
+    --dev_path=data/finetune/answer_extraction/dev.json \
+    --save_dir=log/answer_extraction/checkpoints \
+    --learning_rate=1e-5 \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --num_epochs=30 \
+    --model=uie-base \
+    --seed=1000 \
+    --logging_steps=100 \
+    --valid_steps=100 \
+    --device=gpu
+```
+关键参数释义如下：
+- `train_path`: 训练集文件路径。
+- `dev_path`: 验证集文件路径。
+- `save_dir`: 模型存储路径，默认为`log/answer_extration/checkpoints`。
+- `learning_rate`: 学习率，默认为1e-5。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `num_epochs`: 训练轮数，默认为30。
+- `model`: 选择模型，程序会基于选择的模型进行模型微调，可选有`uie-base-answer-extractor`，`uie-base`,`uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`，默认为`uie-base`。
+- `init_from_ckpt`: 用于初始化的模型参数的路径。
+- `seed`: 随机种子，默认为1000.
+- `logging_steps`: 日志打印的间隔steps数，默认10。
+- `valid_steps`: evaluate的间隔steps数，默认100。
+- `device`: 选用什么设备进行训练，可选cpu或gpu。
+
+
+通过运行以下命令在样例验证集上进行模型评估：
+
+```shell
+python finetune/answer_extraction_and_roundtrip_filtration/evaluate.py \
+    --model_path=log/answer_extraction/checkpoints/model_best \
+    --test_path=data/finetune/answer_extraction/dev.json  \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --limit=0.01
+```
+
+关键参数释义如下：
+- `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+- `test_path`: 进行评估的测试集文件。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `model`: 选择所使用的模型，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`，默认为`uie-base`。
+- `debug`: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+- `limit`: SpanEvaluator测评指标的`limit`，当概率数组中的最后一个维度大于该值时将返回相应的文本片段；当limit设置为0.01时表示关注模型的召回率，也即答案的覆盖率。
+
+#### 问题生成
+运行如下命令即可在样例训练集上微调问题生成模型，并在样例验证集上进行验证。
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/question_generation finetune/question_generation/train.py \
+    --train_file=data/finetune/question_generation/train.json \
+    --predict_file=data/finetune/question_generation/dev.json \
+    --save_dir=log/question_generation/checkpoints \
+    --output_path=log/question_generation/predict.txt \
+    --dataset_name=dureader_qg \
+    --model_name_or_path="unimo-text-1.0" \
+    --logging_steps=100 \
+    --save_steps=500 \
+    --epochs=20 \
+    --batch_size=16 \
+    --learning_rate=1e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=512 \
+    --max_target_len=30 \
+    --do_train \
+    --do_predict \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --num_return_sequences=1 \
+    --template=1 \
+    --device=gpu
+```
+
+
+关键参数释义如下：
+- `gpus` 指示了训练所用的GPU，使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"。
+- `dataset_name` 数据集名称，用来指定数据集格式，默认为`dureader_qg`。
+- `train_file` 本地训练数据地址，数据格式必须与`dataset_name`所指数据集格式相同，默认为None。
+- `predict_file` 本地测试数据地址，数据格式必须与`dataset_name`所指数据集格式相同，默认为None。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一，默认为`unimo-text-1.0`。
+   | 可选预训练模型        |
+   |---------------------------------|
+   | unimo-text-1.0      |
+   | unimo-text-1.0-large |
+   | unimo-text-1.0-question-generation      |
+
+- `save_dir` 表示模型的保存路径。
+- `output_path` 表示预测结果的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_proportion` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例。
+- `max_seq_len` 模型输入序列的最大长度。
+- `max_target_len` 模型训练时标签的最大长度。
+- `min_dec_len` 模型生成序列的最小长度。
+- `max_dec_len` 模型生成序列的最大长度。
+- `do_train` 是否进行训练。
+- `do_predict` 是否进行预测，在验证集上会自动评估。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+- `template` 表示使用的模版，从[0, 1, 2, 3, 4]中选择，0表示不选择模版，1表示使用默认模版。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。
+
+**【注意】** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+
+#### 过滤模型
+运行如下命令即可在样例训练集上微调答案抽取模型，用户可以选择基于`uie-base-qa-filter`进行微调，或者基于`uie-base`等从头开始微调。
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "1,2" --log_dir log/filtration finetune/answer_extraction_and_roundtrip_filtration/finetune.py \
+    --train_path=data/finetune/filtration/train.json \
+    --dev_path=data/finetune/filtration/dev.json \
+    --save_dir=log/filtration/checkpoints \
+    --learning_rate=1e-5 \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --num_epochs=30 \
+    --model=uie-base \
+    --seed=1000 \
+    --logging_steps=100 \
+    --valid_steps=100 \
+    --device=gpu
+```
+关键参数释义如下：
+- `train_path`: 训练集文件路径。
+- `dev_path`: 验证集文件路径。
+- `save_dir`: 模型存储路径，默认为`log/filtration/checkpoints`。
+- `learning_rate`: 学习率，默认为1e-5。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `num_epochs`: 训练轮数，默认为30。
+- `model`: 选择模型，程序会基于选择的模型进行模型微调，可选有`uie-base-qa-filter`，`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`，默认为`uie-base`。
+- `init_from_ckpt`: 用于初始化的模型参数的路径。
+- `seed`: 随机种子，默认为1000.
+- `logging_steps`: 日志打印的间隔steps数，默认10。
+- `valid_steps`: evaluate的间隔steps数，默认100。
+- `device`: 选用什么设备进行训练，可选cpu或gpu。
+
+
+通过运行以下命令在样例验证集上进行模型评估：
+
+```shell
+python finetune/answer_extraction_and_roundtrip_filtration/evaluate.py \
+    --model_path=log/filtration/checkpoints/model_best \
+    --test_path=data/finetune/filtration/dev.json  \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --limit=0.5
+```
+
+关键参数释义如下：
+- `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+- `test_path`: 进行评估的测试集文件。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `model`: 选择所使用的模型，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`和`uie-nano`，默认为`uie-base`。
+- `debug`: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+- `limit`: SpanEvaluator测评指标的`limit`，当概率数组中的最后一个维度大于该值时将返回相应的文本片段。
+
+#### 语义索引和召回模型
+我们的语义索引和召回模型是基于RocketQA的QueryEncoder训练的双塔模型，该模型用于语义索引和召回阶段，分别进行语义向量抽取和相似度召回。除使用预置模型外，如果用户想训练并接入自己的模型，模型训练可以参考[FAQ Finance](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/supervised_qa/faq_finance)。
+
+#### 排序模型
+我们的排序模型是基于RocketQA的CrossEncoder训练的单塔模型，该模型用于搜索的排序阶段，对召回的结果进行重新排序的作用。关于排序的定制训练，可以参考[CrossEncoder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/ranking/cross_encoder)。
+
+## References
+[1] Zheng, Chujie, and Minlie Huang. "Exploring prompt-based few-shot learning for grounded dialog generation." arXiv preprint arXiv:2109.06513 (2021).
+
+[2] Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020).
+
+[3] Puri, Raul, et al. "Training question answering models from synthetic data." arXiv preprint arXiv:2002.09599 (2020).
+
+[4] Lewis, Patrick, et al. "Paq: 65 million probably-asked questions and what you can do with them." Transactions of the Association for Computational Linguistics 9 (2021): 1098-1115.
+
+[5] Alberti, Chris, et al. "Synthetic QA corpora generation with roundtrip consistency." arXiv preprint arXiv:1906.05416 (2019).
diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..4705384bf37cdeb3d478e51235836ef8674c914a
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/evaluate.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from utils import convert_example, reader, unify_prompt_name
+
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+        start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+        start_ids = paddle.cast(start_ids, "float32")
+        end_ids = paddle.cast(end_ids, "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+
+
+def do_eval():
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    model = UIE.from_pretrained(args.model_path)
+
+    test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
+    class_dict = {}
+    if args.debug:
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if len(data["result_list"]) != 0:
+                class_dict.setdefault(class_name, []).append(data)
+    else:
+        class_dict["all_classes"] = test_ds
+    for key in class_dict.keys():
+        if args.debug:
+            test_ds = MapDataset(class_dict[key])
+        else:
+            test_ds = class_dict[key]
+        test_ds = test_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len))
+        test_batch_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False)
+        test_data_loader = paddle.io.DataLoader(dataset=test_ds, batch_sampler=test_batch_sampler, return_list=True)
+
+        metric = SpanEvaluator(args.limit)
+        precision, recall, f1 = evaluate(model, metric, test_data_loader)
+        logger.info("-----------------------------")
+        logger.info("Class Name: %s" % key)
+        logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
+    parser.add_argument("--limit", type=float, default=0.5, help="The limit when using SpanEvaluator, when the last dimension in probability arrays is greater than the limit, the corresponding span will be returned.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_eval()
diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e6747075667f67b5a53233628dc0d86e6d0cc62
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/finetune.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import paddle
+from evaluate import evaluate
+from utils import convert_example, reader, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    model = UIE.from_pretrained(args.model)
+
+    train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False)
+    print("train data loaded successfully.")
+    dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False)
+    print("dev data loaded successfully.")
+
+    train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len))
+    dev_ds = dev_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len))
+
+    train_batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=args.batch_size, shuffle=True)
+    train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, return_list=True)
+
+    dev_batch_sampler = paddle.io.BatchSampler(dataset=dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = paddle.io.DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, return_list=True)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters())
+
+    criterion = paddle.nn.BCELoss()
+    metric = SpanEvaluator()
+
+    loss_list = []
+    global_step = 0
+
+    best_f1 = 0
+    tic_train = time.time()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch in train_data_loader:
+            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+            start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+            start_ids = paddle.cast(start_ids, "float32")
+            end_ids = paddle.cast(end_ids, "float32")
+            loss_start = criterion(start_prob, start_ids)
+            loss_end = criterion(end_prob, end_ids)
+            loss = (loss_start + loss_end) / 2.0
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_list.append(float(loss))
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                loss_avg = sum(loss_list) / len(loss_list)
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss_avg, args.logging_steps / time_diff)
+                )
+                tic_train = time.time()
+
+            if global_step % args.valid_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                logger.disable()
+                tokenizer.save_pretrained(save_dir)
+                logger.enable()
+
+                precision, recall, f1 = evaluate(model, metric, dev_data_loader)
+                logger.info("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1))
+                if f1 > best_f1:
+                    logger.info(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    save_dir = os.path.join(args.save_dir, "model_best")
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(save_dir)
+                    logger.disable()
+                    tokenizer.save_pretrained(save_dir)
+                    logger.enable()
+                tic_train = time.time()
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
+    parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
+    parser.add_argument("--save_dir", default='.log/filtration/checkpoints', type=str, help="The output directory where the model checkpoints will be written.")
+    parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.")
+    parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
+    parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+    parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--model", choices=["uie-base", "uie-tiny", "uie-medium", "uie-mini", "uie-micro", "uie-nano"], default="uie-base", type=str, help="Select the pretrained model for few-shot learning.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_train()
diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py
new file mode 100644
index 0000000000000000000000000000000000000000..a34c113ed4f0274b656d302ed8eb08a827084041
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/span.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.metric import Metric
+
+from paddlenlp.utils.tools import get_bool_ids_greater_than, get_span
+
+
+class SpanEvaluator(Metric):
+    """
+    SpanEvaluator computes the precision, recall and F1-score for span detection.
+    """
+
+    def __init__(self, limit=0.5):
+        super(SpanEvaluator, self).__init__()
+        self.num_infer_spans = 0
+        self.num_label_spans = 0
+        self.num_correct_spans = 0
+        self.limit = limit
+
+    def compute(self, start_probs, end_probs, gold_start_ids, gold_end_ids):
+        """
+        Computes the precision, recall and F1-score for span detection.
+        """
+        pred_start_ids = get_bool_ids_greater_than(start_probs, self.limit)
+        pred_end_ids = get_bool_ids_greater_than(end_probs, self.limit)
+        gold_start_ids = get_bool_ids_greater_than(gold_start_ids.tolist(), self.limit)
+        gold_end_ids = get_bool_ids_greater_than(gold_end_ids.tolist(), self.limit)
+        num_correct_spans = 0
+        num_infer_spans = 0
+        num_label_spans = 0
+        for predict_start_ids, predict_end_ids, label_start_ids, label_end_ids in zip(
+            pred_start_ids, pred_end_ids, gold_start_ids, gold_end_ids
+        ):
+            [_correct, _infer, _label] = self.eval_span(
+                predict_start_ids, predict_end_ids, label_start_ids, label_end_ids
+            )
+            num_correct_spans += _correct
+            num_infer_spans += _infer
+            num_label_spans += _label
+        return num_correct_spans, num_infer_spans, num_label_spans
+
+    def update(self, num_correct_spans, num_infer_spans, num_label_spans):
+        """
+        This function takes (num_infer_spans, num_label_spans, num_correct_spans) as input,
+        to accumulate and update the corresponding status of the SpanEvaluator object.
+        """
+        self.num_infer_spans += num_infer_spans
+        self.num_label_spans += num_label_spans
+        self.num_correct_spans += num_correct_spans
+
+    def eval_span(self, predict_start_ids, predict_end_ids, label_start_ids, label_end_ids):
+        """
+        evaluate position extraction (start, end)
+        return num_correct, num_infer, num_label
+        input: [1, 2, 10] [4, 12] [2, 10] [4, 11]
+        output: (1, 2, 2)
+        """
+        pred_set = get_span(predict_start_ids, predict_end_ids)
+        label_set = get_span(label_start_ids, label_end_ids)
+        num_correct = len(pred_set & label_set)
+        num_infer = len(pred_set)
+        # For the case of overlapping in the same category,
+        # length of label_start_ids and label_end_ids is not equal
+        num_label = max(len(label_start_ids), len(label_end_ids))
+        return (num_correct, num_infer, num_label)
+
+    def accumulate(self):
+        """
+        This function returns the mean precision, recall and f1 score for all accumulated minibatches.
+
+        Returns:
+            tuple: Returns tuple (`precision, recall, f1 score`).
+        """
+        precision = float(self.num_correct_spans / self.num_infer_spans) if self.num_infer_spans else 0.0
+        recall = float(self.num_correct_spans / self.num_label_spans) if self.num_label_spans else 0.0
+        f1_score = float(2 * precision * recall / (precision + recall)) if self.num_correct_spans else 0.0
+        return precision, recall, f1_score
+
+    def reset(self):
+        """
+        Reset function empties the evaluation memory for previous mini-batches.
+        """
+        self.num_infer_spans = 0
+        self.num_label_spans = 0
+        self.num_correct_spans = 0
+
+    def name(self):
+        """
+        Return name of metric instance.
+        """
+        return "precision", "recall", "f1"
diff --git a/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..95b0fb6d0b65bca840c26a643cd7267da2ce701f
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/answer_extraction_and_roundtrip_filtration/utils.py
@@ -0,0 +1,454 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import random
+import re
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def convert_example(example, tokenizer, max_seq_len):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    encoded_inputs = tokenizer(
+        text=[example["prompt"]],
+        text_pair=[example["content"]],
+        truncation=True,
+        max_seq_len=max_seq_len,
+        pad_to_max_seq_len=True,
+        return_attention_mask=True,
+        return_position_ids=True,
+        return_dict=False,
+        return_offsets_mapping=True,
+    )
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(1, len(offset_mapping)):
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    start_ids = [0 for x in range(max_seq_len)]
+    end_ids = [0 for x in range(max_seq_len)]
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+
+    tokenized_output = [
+        encoded_inputs["input_ids"],
+        encoded_inputs["token_type_ids"],
+        encoded_inputs["position_ids"],
+        encoded_inputs["attention_mask"],
+        start_ids,
+        end_ids,
+    ]
+    tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
+    return tuple(tokenized_output)
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        i = 0
+        j = 0
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                i += 1
+                yield json_line
+            else:
+                j += 1
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+
+                    for result in result_list:
+                        if result["start"] + 1 <= max_content_len < result["end"]:
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def unify_prompt_name(prompt):
+    # The classification labels are shuffled during finetuning, so they need
+    # to be unified during evaluation.
+    if re.search(r"\[.*?\]$", prompt):
+        prompt_prefix = prompt[: prompt.find("[", 1)]
+        cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",")
+        cls_options = sorted(list(set(cls_options)))
+        cls_options = ",".join(cls_options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+        return prompt
+    return prompt
+
+
+def add_negative_example(examples, texts, prompts, label_set, negative_ratio):
+    negative_examples = []
+    positive_examples = []
+    with tqdm(total=len(prompts)) as pbar:
+        for i, prompt in enumerate(prompts):
+
+            redundants_list = list(set(label_set) ^ set(prompt))
+            redundants_list.sort()
+
+            num_positive = len(examples[i])
+            if num_positive != 0:
+                actual_ratio = math.ceil(len(redundants_list) / num_positive)
+            else:
+                # Set num_positive to 1 for text without positive example
+                num_positive, actual_ratio = 1, 0
+
+            if actual_ratio <= negative_ratio or negative_ratio == -1:
+                idxs = [k for k in range(len(redundants_list))]
+            else:
+                idxs = random.sample(range(0, len(redundants_list)), negative_ratio * num_positive)
+
+            for idx in idxs:
+                negative_result = {"content": texts[i], "result_list": [], "prompt": redundants_list[idx]}
+                negative_examples.append(negative_result)
+            positive_examples.extend(examples[i])
+            pbar.update(1)
+    return positive_examples, negative_examples
+
+
+def add_full_negative_example(examples, texts, relation_prompts, predicate_set, subject_goldens):
+    with tqdm(total=len(relation_prompts)) as pbar:
+        for i, relation_prompt in enumerate(relation_prompts):
+            negative_sample = []
+            for subject in subject_goldens[i]:
+                for predicate in predicate_set:
+                    # The relation prompt is constructed as follows:
+                    # subject + "的" + predicate
+                    prompt = subject + "的" + predicate
+                    if prompt not in relation_prompt:
+                        negative_result = {"content": texts[i], "result_list": [], "prompt": prompt}
+                        negative_sample.append(negative_result)
+            examples[i].extend(negative_sample)
+            pbar.update(1)
+    return examples
+
+
+def construct_relation_prompt_set(entity_name_set, predicate_set):
+    relation_prompt_set = set()
+    for entity_name in entity_name_set:
+        for predicate in predicate_set:
+            # The relation prompt is constructed as follows:
+            # subject + "的" + predicate
+            relation_prompt = entity_name + "的" + predicate
+            relation_prompt_set.add(relation_prompt)
+    return sorted(list(relation_prompt_set))
+
+
+def generate_cls_example(text, labels, prompt_prefix, options):
+    random.shuffle(options)
+    cls_options = ",".join(options)
+    prompt = prompt_prefix + "[" + cls_options + "]"
+
+    result_list = []
+    example = {"content": text, "result_list": result_list, "prompt": prompt}
+    for label in labels:
+        start = prompt.rfind(label[0]) - len(prompt) - 1
+        end = start + len(label)
+        result = {"text": label, "start": start, "end": end}
+        example["result_list"].append(result)
+    return example
+
+
+def convert_cls_examples(raw_examples, prompt_prefix="情感倾向", options=["正向", "负向"]):
+    """
+    Convert labeled data export from doccano for classification task.
+    """
+    examples = []
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)):
+        for line in raw_examples:
+            items = json.loads(line)
+            # Compatible with doccano >= 1.6.2
+            if "data" in items.keys():
+                text, labels = items["data"], items["label"]
+            else:
+                text, labels = items["text"], items["label"]
+            example = generate_cls_example(text, labels, prompt_prefix, options)
+            examples.append(example)
+    return examples
+
+
+def convert_ext_examples(
+    raw_examples, negative_ratio, prompt_prefix="情感倾向", options=["正向", "负向"], separator="##", is_train=True
+):
+    """
+    Convert labeled data export from doccano for extraction and aspect-level classification task.
+    """
+
+    def _sep_cls_label(label, separator):
+        label_list = label.split(separator)
+        if len(label_list) == 1:
+            return label_list[0], None
+        return label_list[0], label_list[1:]
+
+    def _concat_examples(positive_examples, negative_examples, negative_ratio):
+        examples = []
+        if math.ceil(len(negative_examples) / len(positive_examples)) <= negative_ratio:
+            examples = positive_examples + negative_examples
+        else:
+            # Random sampling the negative examples to ensure overall negative ratio unchanged.
+            idxs = random.sample(range(0, len(negative_examples)), negative_ratio * len(positive_examples))
+            negative_examples_sampled = []
+            for idx in idxs:
+                negative_examples_sampled.append(negative_examples[idx])
+            examples = positive_examples + negative_examples_sampled
+        return examples
+
+    texts = []
+    entity_examples = []
+    relation_examples = []
+    entity_cls_examples = []
+    entity_prompts = []
+    relation_prompts = []
+    entity_label_set = []
+    entity_name_set = []
+    predicate_set = []
+    subject_goldens = []
+
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)) as pbar:
+        for line in raw_examples:
+            items = json.loads(line)
+            entity_id = 0
+            if "data" in items.keys():
+                relation_mode = False
+                if isinstance(items["label"], dict) and "entities" in items["label"].keys():
+                    relation_mode = True
+                text = items["data"]
+                entities = []
+                relations = []
+                if not relation_mode:
+                    # Export file in JSONL format which doccano < 1.7.0
+                    # e.g. {"data": "", "label": [ [0, 2, "ORG"], ... ]}
+                    for item in items["label"]:
+                        entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]}
+                        entities.append(entity)
+                        entity_id += 1
+                else:
+                    # Export file in JSONL format for relation labeling task which doccano < 1.7.0
+                    # e.g. {"data": "", "label": {"relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}}
+                    entities.extend([entity for entity in items["label"]["entities"]])
+                    if "relations" in items["label"].keys():
+                        relations.extend([relation for relation in items["label"]["relations"]])
+            else:
+                # Export file in JSONL format which doccano >= 1.7.0
+                # e.g. {"text": "", "label": [ [0, 2, "ORG"], ... ]}
+                if "label" in items.keys():
+                    text = items["text"]
+                    entities = []
+                    for item in items["label"]:
+                        entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]}
+                        entities.append(entity)
+                        entity_id += 1
+                    relations = []
+                else:
+                    # Export file in JSONL (relation) format
+                    # e.g. {"text": "", "relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}
+                    text, relations, entities = items["text"], items["relations"], items["entities"]
+            texts.append(text)
+
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            entity_map = {}  # id to entity name
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                entity_label, entity_cls_label = _sep_cls_label(entity["label"], separator)
+
+                # Define the prompt prefix for entity-level classification
+                entity_cls_prompt_prefix = entity_name + "的" + prompt_prefix
+                if entity_cls_label is not None:
+                    entity_cls_example = generate_cls_example(
+                        text, entity_cls_label, entity_cls_prompt_prefix, options
+                    )
+
+                    entity_cls_examples.append(entity_cls_example)
+
+                result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]}
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {
+                        "content": text,
+                        "result_list": [result],
+                        "prompt": entity_label,
+                    }
+                else:
+                    entity_example_map[entity_label]["result_list"].append(result)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                if entity_name not in entity_name_set:
+                    entity_name_set.append(entity_name)
+                entity_prompt.append(entity_label)
+
+            for v in entity_example_map.values():
+                entity_example.append(v)
+
+            entity_examples.append(entity_example)
+            entity_prompts.append(entity_prompt)
+
+            subject_golden = []  # Golden entity inputs
+            relation_example = []
+            relation_prompt = []
+            relation_example_map = {}
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+                # The relation prompt is constructed as follows:
+                # subject + "的" + predicate
+                prompt = entity_map[subject_id]["name"] + "的" + predicate
+                if entity_map[subject_id]["name"] not in subject_golden:
+                    subject_golden.append(entity_map[subject_id]["name"])
+                result = {
+                    "text": entity_map[object_id]["name"],
+                    "start": entity_map[object_id]["start"],
+                    "end": entity_map[object_id]["end"],
+                }
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt}
+                else:
+                    relation_example_map[prompt]["result_list"].append(result)
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+                relation_prompt.append(prompt)
+
+            for v in relation_example_map.values():
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            relation_prompts.append(relation_prompt)
+            subject_goldens.append(subject_golden)
+            pbar.update(1)
+
+    logger.info("Adding negative samples for first stage prompt...")
+    positive_examples, negative_examples = add_negative_example(
+        entity_examples, texts, entity_prompts, entity_label_set, negative_ratio
+    )
+    if len(positive_examples) == 0:
+        all_entity_examples = []
+    elif is_train:
+        all_entity_examples = _concat_examples(positive_examples, negative_examples, negative_ratio)
+    else:
+        all_entity_examples = positive_examples + negative_examples
+
+    all_relation_examples = []
+    if len(predicate_set) != 0:
+        if is_train:
+            logger.info("Adding negative samples for second stage prompt...")
+            relation_prompt_set = construct_relation_prompt_set(entity_name_set, predicate_set)
+            positive_examples, negative_examples = add_negative_example(
+                relation_examples, texts, relation_prompts, relation_prompt_set, negative_ratio
+            )
+            all_relation_examples = _concat_examples(positive_examples, negative_examples, negative_ratio)
+        else:
+            logger.info("Adding negative samples for second stage prompt...")
+            relation_examples = add_full_negative_example(
+                relation_examples, texts, relation_prompts, predicate_set, subject_goldens
+            )
+            all_relation_examples = [r for relation_example in relation_examples for r in relation_example]
+    return all_entity_examples, all_relation_examples, entity_cls_examples
diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..10591bf8058fcb2d186363b2db0f00a6324a2a5f
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/gen_utils.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+
+from paddlenlp.data import Pad
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def convert_example(
+    example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train", template=0
+):
+    """Convert all examples into necessary features."""
+    if mode == "pretrain" or mode == "pretrain_test":
+        context = example["context"]
+        answer = example["answer"]
+        target = example["target"]
+        source = "答案：" + answer + tokenizer.sep_token + "上下文：" + context
+        title = None
+
+    elif mode == "train" or mode == "test":
+        target = None
+        title = None
+        if "source" in example and "title" in example:
+            source = example["source"]
+            if "title" in example.keys():
+                title = example["title"]
+        elif "context" in example and "answer" in example:
+            source = example["context"]
+            if "answer" in example.keys():
+                title = example["answer"]
+        else:
+            assert False, "Source and title are not in the input dictionary, nor are context and answer."
+        if "target" in example.keys():
+            target = example["target"]
+        elif "question" in example.keys():
+            target = example["question"]
+
+        if template == 1:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 2:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "在已知答案的前提下，问题：" + target
+        elif template == 3:
+            source = "这是一个问题生成任务，根据提供的答案和上下文，来生成问题。" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 4:
+            prompt_common = example["prompt_common"]
+            prompt_domain = example["prompt_domain"]
+            source = (
+                prompt_common
+                + " "
+                + tokenizer.sep_token
+                + " "
+                + "".join(
+                    [" " + tokenizer.cls_token + " " + one + " " + tokenizer.sep_token + " " for one in prompt_domain]
+                )
+                + " "
+                + tokenizer.cls_token
+                + " "
+                + "答案："
+                + title
+                + " "
+                + tokenizer.sep_token
+                + " "
+                + tokenizer.cls_token
+                + "上下文："
+                + source
+            )
+
+            title = None
+            if target:
+                target = "问题：" + target
+
+    if mode == "train" or mode == "pretrain":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            target=target,
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            max_title_len=max_title_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"])
+        index_list = []
+        count = tokenized_example["input_ids"].count(tokenizer.cls_token_id)
+        # If template==4, count must be equal to 7, otherwise count must be equal to 2
+        assert count == 7 or count == 2, (
+            str(count) + " is not in [2, 7], temp_tokens: " + " ".join(temp_tokens) + "source: " + source
+        )
+        index = -1
+        for i in range(0, count):
+            index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1)
+            index_list.append(index)
+        if template == 4:
+            tokenized_example["token_type_ids"] = (
+                [2] * (index_list[1] - index_list[0])
+                + [3] * (index_list[4] - index_list[1])
+                + [0] * (index_list[6] - index_list[4])
+                + [1] * (len(tokenized_example["input_ids"]) - index_list[6])
+            )
+        target_start = index_list[-1]
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+        if template == 4:
+            tokenized_example["token_type_ids"]
+        return tokenized_example
+
+    elif mode == "test" or mode == "pretrain_test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=max_title_len,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+        )
+
+        if template == 4:
+            # temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example['input_ids'])
+            index_list = []
+            count = tokenized_example["input_ids"].count(tokenizer.cls_token_id)
+            assert count == 7, str(count) + " is not in [7]"
+            index = -1
+            for i in range(0, count):
+                index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1)
+                index_list.append(index)
+            tokenized_example["token_type_ids"] = (
+                [2] * (index_list[1] - index_list[0])
+                + [3] * (index_list[4] - index_list[1])
+                + [0] * (index_list[6] - index_list[4])
+                + [1] * (len(tokenized_example["input_ids"]) - index_list[6])
+            )
+        assert ("target" in example and example["target"]) or ("question" in example and example["question"]), example
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        elif "question" in example and example["question"]:
+            tokenized_example["target"] = example["question"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    if mode == "train" or mode == "pretrain":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    elif mode == "test" or mode == "pretrain_test":
+        return input_ids, token_type_ids, position_ids, attention_mask
+
+
+def create_data_loader(dataset, tokenizer, args, mode):
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        max_target_len=args.max_target_len,
+        max_title_len=args.max_title_len,
+        mode=mode,
+        template=args.template,
+    )
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "pretrain":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "test" or mode == "pretrain_test":
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def remove_template(instr):
+    """Remove template prefix of decoded sequence."""
+    outstr = instr.strip("问题：")
+    outstr = outstr.strip("在已知答案的前提下，问题：")
+    return outstr
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+            target = remove_template(target)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+            response = remove_template(response)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec590f45b08016a2642fbb37253aae48e33ee09
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/predict.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import time
+
+import paddle
+import paddle.distributed as dist
+from gen_utils import create_data_loader, print_args, select_sum, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.')
+    parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.')
+    parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.')
+    parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.')
+    parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.')
+    parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.")
+    parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def read_file(file):
+    with open(file, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip()
+            if not line:
+                continue
+            line = json.loads(line)
+            yield line
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    if args.predict_file:
+        dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False)
+    else:
+        dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+    if args.do_predict:
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        prediction(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def prediction(model, data_loader, args, tokenizer):
+    print("\nPred begin...")
+    model.eval()
+    pred_ref = []
+    time_begin = time.time()
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+    print("Generation cost time:", time.time() - time_begin)
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py b/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e2c1544328e42275af2d0886e44e66a61df7cf
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/finetune/question_generation/train.py
@@ -0,0 +1,281 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn.functional as F
+from gen_utils import create_data_loader, print_args, select_sum, set_seed
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import BLEU
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    LinearDecayWithWarmup,
+    UNIMOLMHeadModel,
+    UNIMOTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.')
+    parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.")
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.')
+    parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.')
+    parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.')
+    parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.')
+    parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.')
+    parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.')
+    parser.add_argument('--beta1', type=float, default=0.9, help='beta1')
+    parser.add_argument('--beta2', type=float, default=0.98, help='beta2')
+    parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.')
+    parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.')
+    parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.')
+    parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.')
+    parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--do_train", action='store_true', help="Whether to train the model.")
+    parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.")
+    parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def calc_bleu_n(preds, targets, n_size=4):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    bleu = BLEU(n_size=n_size)
+    tokenizer = BasicTokenizer()
+
+    for pred, target in zip(preds, targets):
+        pred_tokens = tokenizer.tokenize(pred)
+        target_token = tokenizer.tokenize(target)
+
+        bleu.add_inst(pred_tokens, [target_token])
+
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("BLEU-" + str(n_size) + ":", bleu.score())
+    return bleu.score()
+
+
+def calc_bleu(preds, targets):
+    calc_bleu_n(preds, targets, 1)
+    calc_bleu_n(preds, targets, 2)
+    calc_bleu_n(preds, targets, 3)
+    bleu4_score = calc_bleu_n(preds, targets, 4)
+    return bleu4_score
+
+
+def read_file(file):
+    with open(file, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip()
+            if not line:
+                continue
+            line = json.loads(line)
+            yield line
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    if args.train_file:
+        train_ds = load_dataset(read_file, file=args.train_file, lazy=False)
+    else:
+        train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file)
+    if args.predict_file:
+        dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False)
+    else:
+        dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+
+    train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train")
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+    if args.do_train:
+        num_training_steps = args.epochs * len(train_data_loader)
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+        optimizer = AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.epsilon,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+        )
+
+        step = 0
+        total_time = 0.0
+        best_bleu4 = 0
+        for epoch in range(args.epochs):
+            print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+            batch_start_time = time.time()
+            for inputs in train_data_loader:
+                step += 1
+                labels = inputs[-1]
+                logits = model(*inputs[:-1])
+                labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1])
+                labels = paddle.nn.functional.label_smooth(labels)
+                loss = F.cross_entropy(logits, labels, soft_label=True)
+                loss.backward()
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_time += time.time() - batch_start_time
+                if step % args.logging_steps == 0:
+                    ppl = paddle.exp(loss)
+                    print(
+                        "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                        % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                    )
+                    total_time = 0.0
+
+                if step % args.save_steps == 0 or step >= num_training_steps:
+                    if dist.get_rank() == 0:
+                        save_ckpt(model, tokenizer, args.save_dir, step)
+                        print("Saved step {} model.\n".format(step))
+                        if args.do_predict:
+                            model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+                            bleu4 = evaluation(model_eval, dev_data_loader, args, tokenizer)
+                            if bleu4 > best_bleu4:
+                                print("best BLEU-4 performence has been updated: %.5f  --> %.5f" % (best_bleu4, bleu4))
+                                best_bleu4 = bleu4
+                                save_ckpt(model, tokenizer, args.save_dir, "best")
+
+                batch_start_time = time.time()
+
+        print("\nTraining completed.")
+    elif args.do_predict:
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        evaluation(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader, args, tokenizer):
+    print("\nEval begin...")
+    model.eval()
+    pred_ref = []
+    time_begin = time.time()
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+    print("Generation cost time:", time.time() - time_begin)
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+    with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout:
+        targets = [example["target"] for example in data_loader.dataset]
+        for target in targets:
+            fout.write(target + "\n")
+
+    print("\nSave inference result into: %s" % args.output_path)
+
+    if "target" in data_loader.dataset[0].keys():
+        targets = [example["target"] for example in data_loader.dataset]
+        bleu4_score = calc_bleu(pred_ref, targets)
+
+    model.train()
+    return bleu4_score
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/applications/question_answering/unsupervised_qa/run_corpus_preparation.py b/applications/question_answering/unsupervised_qa/run_corpus_preparation.py
new file mode 100644
index 0000000000000000000000000000000000000000..56b1f912c020a14745af24376db0332ac50c4c66
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/run_corpus_preparation.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--source_file_path', type=str, default=None, help='the source json file path')
+    parser.add_argument('--target_dir_path', type=str, default=None, help='the target dir path')
+    parser.add_argument('--test_sample_num', type=int, default=0, help='the test sample number when preparing qa system data')
+    parser.add_argument('--train_sample_num', type=int, default=0, help='the test sample number when preparing qa system data')
+    parser.add_argument('--all_sample_num', type=int, default=None, help='the all sample number when preparing qa system data')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def convert_json_to_data(json_file, out_dir, test_sample_num, train_sample_num, all_sample_num=None):
+    with open(json_file, "r", encoding="utf-8") as rf, open(
+        os.path.join(out_dir, "qa_pair.csv"), "w", encoding="utf-8"
+    ) as qa_pair_wf, open(os.path.join(out_dir, "qac_triple.csv"), "w", encoding="utf-8") as qac_triple_wf, open(
+        os.path.join(out_dir, "train.csv"), "w", encoding="utf-8"
+    ) as train_wf, open(
+        os.path.join(out_dir, "q_corpus.csv"), "w", encoding="utf-8"
+    ) as q_corpus_wf, open(
+        os.path.join(out_dir, "dev.csv"), "w", encoding="utf-8"
+    ) as test_wf:
+        for i, json_line in enumerate(rf.readlines()):
+            line_dict = json.loads(json_line)
+            context = line_dict["context"]
+            if "answer" in line_dict and "question" in line_dict:
+                answer = line_dict["answer"]
+                question = line_dict["question"]
+            elif "synthetic_answer" in line_dict and "synthetic_question" in line_dict:
+                answer = line_dict["synthetic_answer"]
+                question = line_dict["synthetic_question"]
+
+            if isinstance(question, list):
+                question = question[0]
+            else:
+                question = question
+
+            if i < test_sample_num:
+                test_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n")
+            elif test_sample_num <= i < test_sample_num + train_sample_num:
+                train_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n")
+
+            if not all_sample_num or i < all_sample_num:
+                qa_pair_wf.write(
+                    question.replace("\n", " ").replace("\t", " ").strip()
+                    + "\t"
+                    + answer.replace("\n", " ").replace("\t", " ").strip()
+                    + "\n"
+                )
+                qac_triple_wf.write(
+                    question.replace("\n", " ").replace("\t", " ").strip()
+                    + "\t"
+                    + answer.replace("\n", " ").replace("\t", " ").strip()
+                    + "\t"
+                    + context
+                    + "\n"
+                )
+                q_corpus_wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    convert_json_to_data(
+        args.source_file_path, args.target_dir_path, args.test_sample_num, args.train_sample_num, args.all_sample_num
+    )
diff --git a/applications/question_answering/unsupervised_qa/run_data_preprocess.py b/applications/question_answering/unsupervised_qa/run_data_preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..8da9535318359593693a36807fa3899b7fa1fcc7
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/run_data_preprocess.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--source_file_path', type=str, default=None, help='the source json file path')
+    parser.add_argument('--target_dir', type=str, default='data', help='the target file path')
+    parser.add_argument('--do_answer_prompt', action="store_true", help="is use answer prompt")
+    parser.add_argument('--do_len_prompt', action="store_true", help="is use length prompt")
+    parser.add_argument('--do_domain_prompt', action="store_true", help="is use domain prompt")
+    parser.add_argument('--domain', type=str, default=None, help='the domain of the dataset when using domain prompt')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def convert_from_json_to_answer_extraction_format(
+    json_file, output_path, domain=None, do_answer_prompt=True, do_len_prompt=False, do_domain_prompt=False
+):
+    with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf:
+        for line in rf:
+            json_line = json.loads(line)
+            context = json_line["context"]
+
+            answer = json_line["answer"]
+            # Cut the abnormally long sample
+            if len(answer) > 300:
+                answer = answer[:300]
+
+            begin_id = context.find(answer)
+            assert begin_id != -1, "'" + answer + "' is not found in " + context
+            end_id = begin_id + len(answer)
+            result = {"text": answer, "start": begin_id, "end": end_id}
+            if do_answer_prompt:
+                outdict = {
+                    "content": context,
+                    "result_list": [result],
+                    "prompt": "答案",
+                }
+                wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+            if do_len_prompt:
+                if len(answer) < 10:
+                    len_prompat = "短答案"
+                elif len(answer) < 20:
+                    len_prompat = "中短答案"
+                elif len(answer) < 30:
+                    len_prompat = "中长答案"
+                else:
+                    len_prompat = "长答案"
+
+                len_outdict = {
+                    "content": context,
+                    "result_list": [result],
+                    "prompt": len_prompat,
+                }
+                wf.write(json.dumps(len_outdict, ensure_ascii=False) + "\n")
+            if do_domain_prompt and domain:
+                domain_outdict = {
+                    "content": context,
+                    "result_list": [result],
+                    "prompt": domain,
+                }
+                wf.write(json.dumps(domain_outdict, ensure_ascii=False) + "\n")
+
+
+def convert_from_json_to_question_generation_format(json_file, output_path, tokenizer=None):
+    with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf:
+        for line in rf:
+            json_line = json.loads(line)
+            context = json_line["context"]
+
+            answer = json_line["answer"]
+            # Cut the abnormally long sample
+            if len(answer) > 300:
+                answer = answer[:300]
+            question = json_line["question"]
+
+            outdict = {
+                "question": question,
+                "answer": answer,
+                "context": context,
+            }
+            wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+
+
+def convert_from_json_to_filtration_format(json_file, output_path, tokenizer=None):
+    with open(json_file, "r", encoding="utf-8") as rf, open(output_path, "w", encoding="utf-8") as wf:
+        for line in rf:
+            json_line = json.loads(line)
+            context = json_line["context"]
+
+            answer = json_line["answer"]
+            # Cut the abnormally long sample
+            if len(answer) > 300:
+                answer = answer[:300]
+            question = json_line["question"]
+
+            prefix = "问题：" + question + "上下文："
+            content = prefix + context
+
+            begin_id = context.find(answer)
+            assert begin_id != -1, "'" + answer + "' is not found in " + context
+            end_id = begin_id + len(answer)
+            begin_id += len(prefix)
+            end_id += len(prefix)
+
+            result = {"text": answer, "start": begin_id, "end": end_id}
+            outdict = {
+                "content": content,
+                "result_list": [result],
+                "prompt": "答案",
+            }
+            wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    answer_extraction_target_file_path = os.path.join(
+        args.target_dir, "answer_extraction", os.path.basename(args.source_file_path)
+    )
+    if not os.path.exists(os.path.dirname(answer_extraction_target_file_path)):
+        os.makedirs(os.path.dirname(answer_extraction_target_file_path))
+    convert_from_json_to_answer_extraction_format(
+        json_file=args.source_file_path,
+        output_path=answer_extraction_target_file_path,
+        domain=args.domain,
+        do_answer_prompt=args.do_answer_prompt,
+        do_len_prompt=args.do_len_prompt,
+        do_domain_prompt=args.do_domain_prompt,
+    )
+
+    question_generation_target_file_path = os.path.join(
+        args.target_dir, "question_generation", os.path.basename(args.source_file_path)
+    )
+    if not os.path.exists(os.path.dirname(question_generation_target_file_path)):
+        os.makedirs(os.path.dirname(question_generation_target_file_path))
+    convert_from_json_to_question_generation_format(
+        json_file=args.source_file_path, output_path=question_generation_target_file_path
+    )
+
+    filtration_target_file_path = os.path.join(args.target_dir, "filtration", os.path.basename(args.source_file_path))
+    if not os.path.exists(os.path.dirname(filtration_target_file_path)):
+        os.makedirs(os.path.dirname(filtration_target_file_path))
+    convert_from_json_to_filtration_format(json_file=args.source_file_path, output_path=filtration_target_file_path)
diff --git a/applications/question_answering/unsupervised_qa/run_pipelines_example.py b/applications/question_answering/unsupervised_qa/run_pipelines_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0b714f799af0e5d48c9a192b737a9ae86ed5ebf
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/run_pipelines_example.py
@@ -0,0 +1,155 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import (
+    AnswerExtractor,
+    DensePassageRetriever,
+    ErnieRanker,
+    QAFilter,
+    QuestionGenerator,
+)
+from pipelines.pipelines import QAGenerationPipeline, SemanticSearchPipeline
+from pipelines.utils import convert_files_to_dicts, print_documents
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer pairs file to be loaded when building ANN index.")
+parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def dense_faq_pipeline():
+    use_gpu = True if args.device == "gpu" else False
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+    else:
+        dicts = convert_files_to_dicts(
+            dir_path=args.doc_dir, split_paragraphs=True, split_answers=True, encoding="utf-8"
+        )
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+
+    pipe = SemanticSearchPipeline(retriever, ranker)
+
+    pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}}
+    prediction = pipe.run(query="世界上最早的地雷发明者是谁？", params=pipeline_params)
+
+    print_documents(prediction, print_name=False, print_meta=True)
+
+
+def qa_generation_pipeline():
+    answer_extractor = AnswerExtractor(
+        model="uie-base-answer-extractor",
+        device=args.device,
+        schema=["答案"],
+        max_answer_candidates=3,
+        position_prob=0.01,
+        batch_size=1,
+    )
+
+    question_generator = QuestionGenerator(
+        model="unimo-text-1.0-question-generation",
+        device=args.device,
+        num_return_sequences=2,
+    )
+
+    qa_filter = QAFilter(
+        model="uie-base-qa-filter",
+        device=args.device,
+        schema=["答案"],
+        position_prob=0.1,
+    )
+
+    pipe = QAGenerationPipeline(
+        answer_extractor=answer_extractor, question_generator=question_generator, qa_filter=qa_filter
+    )
+    pipeline_params = {"QAFilter": {"is_filter": True}}
+
+    # list example
+    meta = [
+        "世界上最早的电影院是美国洛杉矶的“电气剧场”，建于1902年。",
+        "以脸书为例，2020年时，54%的成年人表示，他们从该平台获取新闻。而现在，这个数字下降到了44%。与此同时，YouTube在过去几年里一直保持平稳，约有三分之一的用户在该平台上获取新闻。",
+    ]
+    prediction = pipe.run(meta=meta, params=pipeline_params)
+    prediction = prediction["filtered_cqa_triples"]
+    pprint(prediction)
+
+    # file example
+    if args.source_file:
+        meta = []
+        with open(args.source_file, "r", encoding="utf-8") as rf:
+            for line in rf:
+                meta.append(line.strip())
+        prediction = pipe.run(meta=meta, params=pipeline_params)
+        prediction = prediction["filtered_cqa_triples"]
+        if not os.path.exists(args.doc_dir):
+            os.makedirs(args.doc_dir)
+        with open(os.path.join(args.doc_dir, "generated_qa_pairs.txt"), "w", encoding="utf-8") as wf:
+            for pair in prediction:
+                wf.write(pair["synthetic_question"].strip() + "\t" + pair["synthetic_answer"].strip() + "\n")
+
+
+if __name__ == "__main__":
+    qa_generation_pipeline()
+    dense_faq_pipeline()
diff --git a/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py b/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a8e57b94c01878456f652ba0c4cdbec4019c60b
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/run_qa_pairs_generation.py
@@ -0,0 +1,334 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--answer_generation_model_path', type=str, default=None, help='the model path to be loaded for answer extraction')
+    parser.add_argument('--question_generation_model_path', type=str, default=None, help='the model path to be loaded for question generation')
+    parser.add_argument('--filtration_model_path', type=str, default=None, help='the model path to be loaded for filtration')
+    parser.add_argument('--source_file_path', type=str, default=None, help='the source file path')
+    parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path')
+    parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow')
+    parser.add_argument("--do_debug", action='store_true', help="Whether to do debug")
+    parser.add_argument('--a_prompt', type=str, default='答案', help='the prompt when using taskflow, separate by ,')
+    parser.add_argument('--a_position_prob', type=float, default=0.01, help='confidence threshold for answer extraction')
+    parser.add_argument('--a_max_answer_candidates', type=int, default=5, help='the max number of return answer candidate for each input')
+    parser.add_argument('--q_num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams')
+    parser.add_argument('--q_max_question_length', type=int, default=50, help='the max decoding length')
+    parser.add_argument('--q_decode_strategy', type=str, default='sampling', help='the decode strategy')
+    parser.add_argument('--q_num_beams', type=int, default=6, help='the number of beams when using beam search')
+    parser.add_argument('--q_num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search')
+    parser.add_argument('--q_diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search')
+    parser.add_argument('--q_top_k', type=float, default=5, help='the top_k when using sampling decoding strategy')
+    parser.add_argument('--q_top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy')
+    parser.add_argument('--q_temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy')
+    parser.add_argument("--do_filtration", action='store_true', help="Whether to do filtration")
+    parser.add_argument('--f_filtration_position_prob', type=float, default=0.1, help='confidence threshold for filtration')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def answer_generation_from_paragraphs(
+    paragraphs, batch_size=16, model=None, max_answer_candidates=5, schema=None, wf=None
+):
+    """Generate answer from given paragraphs."""
+    result = []
+    buffer = []
+    i = 0
+    len_paragraphs = len(paragraphs)
+    for paragraph_tobe in tqdm(paragraphs):
+        buffer.append(paragraph_tobe)
+        if len(buffer) == batch_size or (i + 1) == len_paragraphs:
+            predicts = model(buffer)
+            paragraph_list = buffer
+            buffer = []
+            for predict_dict, paragraph in zip(predicts, paragraph_list):
+                answers = []
+                probabilitys = []
+                for prompt in schema:
+                    if prompt in predict_dict:
+                        answer_dicts = predict_dict[prompt]
+                        answers += [answer_dict["text"] for answer_dict in answer_dicts]
+                        probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts]
+                    else:
+                        answers += []
+                        probabilitys += []
+                candidates = sorted(list(set([(a, p) for a, p in zip(answers, probabilitys)])), key=lambda x: -x[1])
+                if len(candidates) > max_answer_candidates:
+                    candidates = candidates[:max_answer_candidates]
+                outdict = {
+                    "context": paragraph,
+                    "answer_candidates": candidates,
+                }
+                if wf:
+                    wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+                result.append(outdict)
+        i += 1
+    return result
+
+
+def create_fake_question(
+    json_file_or_pair_list, out_json=None, num_return_sequences=1, all_sample_num=None, batch_size=8
+):
+    if out_json:
+        wf = open(out_json, "w", encoding="utf-8")
+    if isinstance(json_file_or_pair_list, list):
+        all_lines = json_file_or_pair_list
+    else:
+        rf = open(json_file_or_pair_list, "r", encoding="utf-8")
+        all_lines = []
+        for json_line in rf:
+            line_dict = json.loads(json_line)
+            all_lines.append(line_dict)
+        rf.close()
+    num_all_lines = len(all_lines)
+    output = []
+    context_buffer = []
+    answer_buffer = []
+    answer_probability_buffer = []
+    true_question_buffer = []
+    i = 0
+    for index, line_dict in enumerate(tqdm(all_lines)):
+        if "question" in line_dict:
+            q = line_dict["question"]
+        else:
+            q = ""
+        c = line_dict["context"]
+        assert "answer_candidates" in line_dict
+        answers = line_dict["answer_candidates"]
+        if not answers:
+            continue
+        for j, pair in enumerate(answers):
+            a, p = pair
+            context_buffer += [c]
+            answer_buffer += [a]
+            answer_probability_buffer += [p]
+            true_question_buffer += [q]
+            if (
+                (i + 1) % batch_size == 0
+                or (all_sample_num and (i + 1) == all_sample_num)
+                or ((index + 1) == num_all_lines and j == len(answers) - 1)
+            ):
+                result_buffer = question_generation(
+                    [{"context": context, "answer": answer} for context, answer in zip(context_buffer, answer_buffer)]
+                )
+                context_buffer_temp, answer_buffer_temp, answer_probability_buffer_temp, true_question_buffer_temp = (
+                    [],
+                    [],
+                    [],
+                    [],
+                )
+                for context, answer, answer_probability, true_question in zip(
+                    context_buffer, answer_buffer, answer_probability_buffer, true_question_buffer
+                ):
+                    context_buffer_temp += [context] * num_return_sequences
+                    answer_buffer_temp += [answer] * num_return_sequences
+                    answer_probability_buffer_temp += [answer_probability] * num_return_sequences
+                    true_question_buffer_temp += [true_question] * num_return_sequences
+                result_one_two_buffer = [(one, two) for one, two in zip(result_buffer[0], result_buffer[1])]
+                for context, answer, answer_probability, true_question, result in zip(
+                    context_buffer_temp,
+                    answer_buffer_temp,
+                    answer_probability_buffer_temp,
+                    true_question_buffer_temp,
+                    result_one_two_buffer,
+                ):
+                    fake_questions_tokens = [result[0]]
+                    fake_questions_scores = [result[1]]
+                    for fake_questions_token, fake_questions_score in zip(
+                        fake_questions_tokens, fake_questions_scores
+                    ):
+                        out_dict = {
+                            "context": context,
+                            "synthetic_answer": answer,
+                            "synthetic_answer_probability": answer_probability,
+                            "synthetic_question": fake_questions_token,
+                            "synthetic_question_probability": fake_questions_score,
+                            "true_question": true_question,
+                        }
+                        if out_json:
+                            wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                        output.append(out_dict)
+                context_buffer = []
+                answer_buffer = []
+                true_question_buffer = []
+            if all_sample_num and (i + 1) >= all_sample_num:
+                break
+            i += 1
+    if out_json:
+        wf.close()
+    return output
+
+
+def filtration(paragraphs, batch_size=16, model=None, schema=None, wf=None, wf_debug=None):
+    result = []
+    buffer = []
+    valid_num, invalid_num = 0, 0
+    i = 0
+    len_paragraphs = len(paragraphs)
+    for paragraph_tobe in tqdm(paragraphs):
+        buffer.append(paragraph_tobe)
+        if len(buffer) == batch_size or (i + 1) == len_paragraphs:
+            model_inputs = []
+            for d in buffer:
+                context = d["context"]
+                synthetic_question = d["synthetic_question"]
+                prefix = "问题：" + synthetic_question + "上下文："
+                content = prefix + context
+                model_inputs.append(content)
+            predicts = model(model_inputs)
+            paragraph_list = buffer
+            buffer = []
+            for predict_dict, paragraph in zip(predicts, paragraph_list):
+                context = paragraph["context"]
+                synthetic_question = paragraph["synthetic_question"]
+                synthetic_question_probability = paragraph["synthetic_question_probability"]
+                synthetic_answer = paragraph["synthetic_answer"]
+                synthetic_answer_probability = paragraph["synthetic_answer_probability"]
+
+                answers = []
+                probabilitys = []
+                for prompt in schema:
+                    if prompt in predict_dict:
+                        answer_dicts = predict_dict[prompt]
+                        answers += [answer_dict["text"] for answer_dict in answer_dicts]
+                        probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts]
+                    else:
+                        answers += []
+                        probabilitys += []
+                candidates = [
+                    an for an, pro in sorted([(a, p) for a, p in zip(answers, probabilitys)], key=lambda x: -x[1])
+                ]
+                out_dict = {
+                    "context": context,
+                    "synthetic_answer": synthetic_answer,
+                    "synthetic_answer_probability": synthetic_answer_probability,
+                    "synthetic_question": synthetic_question,
+                    "synthetic_question_probability": synthetic_question_probability,
+                }
+                if synthetic_answer in candidates:
+                    if wf:
+                        wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                    result.append(out_dict)
+                    valid_num += 1
+                else:
+                    if wf_debug:
+                        wf_debug.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                    invalid_num += 1
+        i += 1
+    print("valid synthetic question-answer pairs number:", valid_num)
+    print("invalid synthetic question-answer pairs number:", invalid_num)
+    return result
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    assert args.a_prompt
+    schema = args.a_prompt.strip().split(",")
+    answer_generator = Taskflow(
+        "information_extraction",
+        schema=schema,
+        task_path=args.answer_generation_model_path,
+        batch_size=args.batch_size,
+        position_prob=args.a_position_prob,
+    )
+    assert args.source_file_path
+    paragraphs = []
+    if args.source_file_path.endswith(".json"):
+        with open(args.source_file_path, "r", encoding="utf-8") as rf:
+            for json_line in rf:
+                line_dict = json.loads(json_line)
+                assert "context" in line_dict or "content" in line_dict
+                if "context" in line_dict:
+                    paragraphs.append(line_dict["context"].strip())
+                elif "content" in line_dict:
+                    paragraphs.append(line_dict["content"].strip())
+    else:
+        with open(args.source_file_path, "r", encoding="utf-8") as rf:
+            for line in rf:
+                paragraphs.append(line.strip())
+
+    synthetic_context_answer_pairs = answer_generation_from_paragraphs(
+        paragraphs,
+        batch_size=args.batch_size,
+        model=answer_generator,
+        max_answer_candidates=args.a_max_answer_candidates,
+        schema=schema,
+        wf=None,
+    )
+    print("create synthetic answers successfully!")
+
+    question_generation = Taskflow(
+        "question_generation",
+        task_path=args.question_generation_model_path,
+        output_scores=True,
+        max_length=args.q_max_question_length,
+        is_select_from_num_return_sequences=False,
+        num_return_sequences=args.q_num_return_sequences,
+        batch_size=args.batch_size,
+        decode_strategy=args.q_decode_strategy,
+        num_beams=args.q_num_beams,
+        num_beam_groups=args.q_num_beam_groups,
+        diversity_rate=args.q_diversity_rate,
+        top_k=args.q_top_k,
+        top_p=args.q_top_p,
+        temperature=args.q_temperature,
+    )
+    synthetic_answer_question_pairs = create_fake_question(
+        synthetic_context_answer_pairs,
+        None if args.do_filtration else args.target_file_path,
+        args.q_num_return_sequences,
+        None,
+        args.batch_size,
+    )
+    print("create synthetic question-answer pairs successfully!")
+
+    wf = None
+    wf_debug = None
+    if args.target_file_path:
+        if not os.path.exists(os.path.dirname(args.target_file_path)):
+            os.makedirs(os.path.dirname(args.target_file_path))
+        wf = open(args.target_file_path, "w", encoding="utf-8")
+        if args.do_debug:
+            wf_debug = open(args.target_file_path + ".debug.json", "w", encoding="utf-8")
+    if args.do_filtration:
+        filtration_model = Taskflow(
+            "information_extraction",
+            schema=["答案"],
+            task_path=args.filtration_model_path,
+            batch_size=args.batch_size,
+            position_prob=args.f_filtration_position_prob,
+        )
+        filtration(
+            synthetic_answer_question_pairs,
+            batch_size=16,
+            model=filtration_model,
+            schema=["答案"],
+            wf=wf,
+            wf_debug=wf_debug,
+        )
+        print("filter synthetic question-answer pairs successfully!")
+    rf.close()
+    wf.close()
diff --git a/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py b/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5408bb48d2b3035cbd741fb1c1b4b41b0863234
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/tools/create_synthetic_answer.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_path', type=str, default=None, help='the model path to be loaded for question_generation taskflow')
+    parser.add_argument('--source_file_path', type=str, default=None, help='the source file path')
+    parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path')
+    parser.add_argument('--all_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data')
+    parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams')
+    parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow')
+    parser.add_argument('--position_prob', type=float, default=0.01, help='the batch size when using taskflow')
+    parser.add_argument('--decode_strategy', type=str, default=None, help='the decode strategy')
+    parser.add_argument('--num_beams', type=int, default=6, help='the number of beams when using beam search')
+    parser.add_argument('--num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search')
+    parser.add_argument('--diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search')
+    parser.add_argument('--top_k', type=float, default=0, help='the top_k when using sampling decoding strategy')
+    parser.add_argument('--top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy')
+    parser.add_argument('--temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def answer_generation_from_paragraphs(paragraphs, batch_size=16, model=None, wf=None):
+    """Generate answer from given paragraphs."""
+    result = []
+    buffer = []
+    for paragraph_tobe in tqdm(paragraphs):
+        buffer.append(paragraph_tobe)
+        if len(buffer) == batch_size:
+            predicts = model(buffer)
+            paragraph_list = buffer
+            buffer = []
+            for predict_dict, paragraph in zip(predicts, paragraph_list):
+                if "答案" in predict_dict:
+                    answer_dicts = predict_dict["答案"]
+                    answers = [answer_dict["text"] for answer_dict in answer_dicts]
+                    probabilitys = [answer_dict["probability"] for answer_dict in answer_dicts]
+                else:
+                    answers = []
+                    probabilitys = []
+
+                outdict = {
+                    "context": paragraph,
+                    "answer_candidates": sorted([(a, p) for a, p in zip(answers, probabilitys)], key=lambda x: -x[1]),
+                }
+                if wf:
+                    wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+                result.append(outdict)
+    return result
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    schema = ["答案"]
+    answer_generator = Taskflow(
+        "information_extraction",
+        schema=schema,
+        task_path=args.model_path,
+        batch_size=args.batch_size,
+        position_prob=args.position_prob,
+    )
+    assert args.source_file_path
+    paragraphs = []
+    if args.source_file_path.endswith(".json"):
+        with open(args.source_file_path, "r", encoding="utf-8") as rf:
+            for json_line in rf:
+                line_dict = json.loads(json_line)
+                assert "context" in line_dict or "content" in line_dict
+                if "context" in line_dict:
+                    paragraphs.append(line_dict["context"].strip())
+                elif "content" in line_dict:
+                    paragraphs.append(line_dict["content"].strip())
+    else:
+        with open(args.source_file_path, "r", encoding="utf-8") as rf:
+            for line in rf:
+                paragraphs.append(line.strip())
+    wf = None
+    if args.target_file_path:
+        wf = open(args.target_file_path, "w", encoding="utf-8")
+
+    answer_generation_from_paragraphs(paragraphs, batch_size=args.batch_size, model=answer_generator, wf=wf)
+    rf.close()
+    wf.close()
diff --git a/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py b/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d8e1a1b7ed598d65eed21b5c2cf8bb7b358abfc
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/tools/create_synthetic_question.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_path', type=str, default=None, help='the model path to be loaded for question_generation taskflow')
+    parser.add_argument('--max_length', type=int, default=50, help='the max decoding length')
+    parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams')
+    parser.add_argument('--source_file_path', type=str, default=None, help='the souce json file path')
+    parser.add_argument('--target_file_path', type=str, default=None, help='the target json file path')
+    parser.add_argument('--all_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data')
+    parser.add_argument('--batch_size', type=int, default=1, help='the batch size when using taskflow')
+    parser.add_argument('--decode_strategy', type=str, default=None, help='the decode strategy')
+    parser.add_argument('--num_beams', type=int, default=6, help='the number of beams when using beam search')
+    parser.add_argument('--num_beam_groups', type=int, default=1, help='the number of beam groups when using diverse beam search')
+    parser.add_argument('--diversity_rate', type=float, default=0.0, help='the diversity_rate when using diverse beam search')
+    parser.add_argument('--top_k', type=float, default=0, help='the top_k when using sampling decoding strategy')
+    parser.add_argument('--top_p', type=float, default=1.0, help='the top_p when using sampling decoding strategy')
+    parser.add_argument('--temperature', type=float, default=1.0, help='the temperature when using sampling decoding strategy')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def create_fake_question(json_file, out_json, num_return_sequences, all_sample_num=None, batch_size=8):
+    with open(json_file, "r", encoding="utf-8") as rf, open(out_json, "w", encoding="utf-8") as wf:
+        all_lines = rf.readlines()
+        num_all_lines = len(all_lines)
+        context_buffer = []
+        answer_buffer = []
+        true_question_buffer = []
+        for i, json_line in enumerate(tqdm(all_lines)):
+            line_dict = json.loads(json_line)
+            q = line_dict["question"]
+            a = line_dict["answer"]
+            c = line_dict["context"]
+
+            context_buffer += [c]
+            answer_buffer += [a]
+            true_question_buffer += [q]
+            if (
+                (i + 1) % batch_size == 0
+                or (all_sample_num and (i + 1) == all_sample_num or (i + 1))
+                or (i + 1) == num_all_lines
+            ):
+                result_buffer = question_generation(
+                    [{"context": context, "answer": answer} for context, answer in zip(context_buffer, answer_buffer)]
+                )
+                context_buffer_temp, answer_buffer_temp, true_question_buffer_temp = [], [], []
+                for context, answer, true_question in zip(context_buffer, answer_buffer, true_question_buffer):
+                    context_buffer_temp += [context] * num_return_sequences
+                    answer_buffer_temp += [answer] * num_return_sequences
+                    true_question_buffer_temp += [true_question] * num_return_sequences
+                result_one_two_buffer = [(one, two) for one, two in zip(result_buffer[0], result_buffer[1])]
+                for context, answer, true_question, result in zip(
+                    context_buffer_temp, answer_buffer_temp, true_question_buffer_temp, result_one_two_buffer
+                ):
+                    fake_quesitons_tokens = [result[0]]
+                    fake_quesitons_scores = [result[1]]
+                    for fake_quesitons_token, fake_quesitons_score in zip(
+                        fake_quesitons_tokens, fake_quesitons_scores
+                    ):
+                        out_dict = {
+                            "context": context,
+                            "answer": answer,
+                            "question": fake_quesitons_token,
+                            "true_question": true_question,
+                            "score": fake_quesitons_score,
+                        }
+                        wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                context_buffer = []
+                answer_buffer = []
+                true_question_buffer = []
+
+            if all_sample_num and (i + 1) >= all_sample_num:
+                break
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    question_generation = Taskflow(
+        "question_generation",
+        task_path=args.model_path,
+        output_scores=True,
+        max_length=args.max_length,
+        is_select_from_num_return_sequences=False,
+        num_return_sequences=args.num_return_sequences,
+        batch_size=args.batch_size,
+        decode_strategy=args.decode_strategy,
+        num_beams=args.num_beams,
+        num_beam_groups=args.num_beam_groups,
+        diversity_rate=args.diversity_rate,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        temperature=args.temperature,
+    )
+    create_fake_question(
+        args.source_file_path, args.target_file_path, args.num_return_sequences, args.all_sample_num, args.batch_size
+    )
diff --git a/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py b/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py
new file mode 100644
index 0000000000000000000000000000000000000000..9de127a63f7b979f0438f15f984ec81948d018cc
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/tools/dev_qq_pair_creation.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--do_create_test_qq_pair", action='store_true', help="Whether to do create_test_qq_pair")
+    parser.add_argument('--qq_pair_source_ori_file_path', type=str, default=None, help='the original source file path for qq-pair creating')
+    parser.add_argument('--qq_pair_source_trans_file_path', type=str, default=None, help='the translated source file path for qq-pair creating')
+    parser.add_argument('--qq_pair_target_file_path', type=str, default=None, help='the target file path for qq-pair creating')
+    parser.add_argument('--trans_query_answer_path', type=str, default=None, help='the target query-answer file path for extract_trans_from_fake_question')
+    parser.add_argument('--dev_sample_num', type=int, default=None, help='the test sample number when convert_json_to_data, if None, treat all lines as dev samples')
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def extract_q_from_json_file(json_file, out_file=None, test_sample_num=None, query_answer_path=None):
+    with open(json_file, "r", encoding="utf-8") as rf:
+        if out_file:
+            wf = open(os.path.join(out_file), "w", encoding="utf-8")
+        if query_answer_path:
+            qeury_answer_wf = open(query_answer_path, "w", encoding="utf-8")
+        q_list = []
+        for i, json_line in enumerate(rf.readlines()):
+            line_dict = json.loads(json_line)
+            if isinstance(line_dict["question"], list):
+                question = line_dict["question"][0]
+            else:
+                question = line_dict["question"]
+            answer = line_dict["answer"]
+            if not test_sample_num or i < test_sample_num:
+                if query_answer_path:
+                    qeury_answer_wf.write(
+                        question.replace("\n", " ").replace("\t", " ").strip()
+                        + "\t"
+                        + answer.replace("\n", " ").replace("\t", " ").strip()
+                        + "\n"
+                    )
+                if out_file:
+                    wf.write(question.replace("\n", " ").replace("\t", " ").strip() + "\n")
+                q_list.append(question.strip())
+            else:
+                break
+        if query_answer_path:
+            qeury_answer_wf.close()
+        if out_file:
+            wf.colse()
+        return q_list
+
+
+def create_test_qq_pair(
+    ori_path=None, trans_path=None, write_path=None, trans_query_answer_path=None, test_sample_num=None
+):
+    assert trans_path
+    trans_rf = open(trans_path, "r", encoding="utf-8")
+    wf = open(write_path, "w", encoding="utf-8")
+    if trans_path.endswith(".json"):
+        trans_q_list = extract_q_from_json_file(trans_path, None, test_sample_num, trans_query_answer_path)
+    else:
+        trans_q_list = [
+            line.strip() for i, line in enumerate(trans_rf.readlines()) if not test_sample_num or i < test_sample_num
+        ]
+
+    if not ori_path or ori_path in ["NONE", "None", "none"]:
+        origin_q_list = ["-" for _ in range(len(trans_q_list))]
+    else:
+        origin_rf = open(ori_path, "r", encoding="utf-8")
+        if ori_path.endswith(".json"):
+            origin_q_list = extract_q_from_json_file(ori_path, None, test_sample_num)
+        else:
+            origin_q_list = [
+                line.strip()
+                for i, line in enumerate(origin_rf.readlines())
+                if not test_sample_num or i < test_sample_num
+            ]
+
+    for origin, trans in zip(origin_q_list, trans_q_list):
+        wf.write(
+            trans.replace("\n", " ").replace("\t", " ").strip()
+            + "\t"
+            + origin.replace("\n", " ").replace("\t", " ").strip()
+            + "\n"
+        )
+    if not ori_path or ori_path in ["NONE", "None", "none"]:
+        pass
+    else:
+        origin_rf.close()
+    trans_rf.close()
+    wf.close()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    if args.do_create_test_qq_pair:
+        create_test_qq_pair(
+            ori_path=args.qq_pair_source_ori_file_path,
+            trans_path=args.qq_pair_source_trans_file_path,
+            write_path=args.qq_pair_target_file_path,
+            trans_query_answer_path=args.trans_query_answer_path,
+            test_sample_num=args.dev_sample_num,
+        )
diff --git a/applications/question_answering/unsupervised_qa/tools/json_format_indent.py b/applications/question_answering/unsupervised_qa/tools/json_format_indent.py
new file mode 100644
index 0000000000000000000000000000000000000000..84731b2130130fd7f960a546e6d5b6888f711849
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/tools/json_format_indent.py
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+
+def json_format_indent(json_file, output_json):
+    with open(output_json, "w", encoding="utf-8") as wf:
+        with open(json_file, "r", encoding="utf-8") as rf:
+            all_lines = []
+            for json_line in rf:
+                line_dict = json.loads(json_line)
+                all_lines.append(line_dict)
+            output_dataset = {"data": all_lines}
+            json.dump(output_dataset, wf, ensure_ascii=False, indent="\t")
+
+
+if __name__ == "__main__":
+    json_format_indent("", "")
diff --git a/applications/question_answering/unsupervised_qa/tools/question_coverage.py b/applications/question_answering/unsupervised_qa/tools/question_coverage.py
new file mode 100644
index 0000000000000000000000000000000000000000..42c74d1eb759ea148ecf251c78830031cf6b84dc
--- /dev/null
+++ b/applications/question_answering/unsupervised_qa/tools/question_coverage.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import multiprocessing
+import os
+import time
+
+from tqdm import tqdm
+from tqdm.contrib import tzip
+
+from paddlenlp.metrics import BLEU
+from paddlenlp.transformers import BasicTokenizer
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--true_file_path', type=str, default=None, help='the source json file path')
+    parser.add_argument('--generate_file_path', type=str, default=None, help='the target json file path')
+    parser.add_argument('--num_return_sequences', type=int, default=3, help='the number of return sequences for each input sample, it should be less than num_beams')
+    parser.add_argument('--all_sample_num', type=int, default=None, help='the number of valid sample')
+    parser.add_argument('--bleu_n_size', type=int, default=4, help='the bleu n size')
+    parser.add_argument('--bleu_threshold', type=float, default=0.3, help='the bleu threshold')
+    parser.add_argument("--do_log_file", action="store_true", help="is log analysis file")
+    parser.add_argument('--log_dir', type=str, default=None, help='the log dir')
+    parser.add_argument("--do_multiprocessing", action="store_true", help="is do multiprocessing")
+    parser.add_argument("--do_map_async", action="store_true", help="is use map_async or apply_async when do multiprocessing")
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def calc_bleu_n(preds, targets, n_size=4):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    bleu = BLEU(n_size=n_size)
+    tokenizer = BasicTokenizer()
+
+    for pred, target in zip(preds, targets):
+        pred_tokens = tokenizer.tokenize(pred)
+        target_token = tokenizer.tokenize(target)
+
+        bleu.add_inst(pred_tokens, [target_token])
+    return bleu.score()
+
+
+def worker_apply_async(true_question, generate_question_group, bleu_n_size, bleu_threshold, i):
+    first_positive_pair = None
+    for generate_question in generate_question_group:
+        bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size)
+        if bleu_score > bleu_threshold:
+            first_positive_pair = (generate_question, true_question, i)
+    if first_positive_pair:
+        return (True, first_positive_pair)
+    else:
+        return (False, (generate_question_group[0], true_question))
+
+
+def worker_map_async(args):
+    true_question, generate_question_group, bleu_n_size, bleu_threshold, i = args
+    first_positive_pair = None
+    for generate_question in generate_question_group:
+        bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size)
+        if bleu_score > bleu_threshold:
+            first_positive_pair = (generate_question, true_question, i)
+    if first_positive_pair:
+        return (True, first_positive_pair)
+    else:
+        return (False, (generate_question_group[0], true_question))
+
+
+def coverage_rate(
+    true_file_path,
+    generate_file_path,
+    bleu_n_size,
+    bleu_threshold,
+    num_return_sequences,
+    all_sample_num=None,
+    is_log_file=False,
+    log_dir=None,
+    is_multiprocessing=True,
+    is_map_async=True,
+):
+    true_questions = []
+    with open(true_file_path, "r", encoding="utf-8") as rf:
+        for i, json_line in enumerate(tqdm(rf.readlines())):
+            if i >= all_sample_num:
+                break
+            line_dict = json.loads(json_line)
+            true_questions.append(
+                line_dict["question"][0] if isinstance(line_dict["question"], list) else line_dict["question"]
+            )
+
+    generate_question_groups = []
+    with open(generate_file_path, "r", encoding="utf-8") as rf:
+        group = []
+        for i, json_line in enumerate(tqdm(rf.readlines())):
+            if i >= all_sample_num * num_return_sequences:
+                break
+            line_dict = json.loads(json_line)
+            group.append(
+                line_dict["question"][0] if isinstance(line_dict["question"], list) else line_dict["question"]
+            )
+            if (i + 1) % num_return_sequences == 0:
+                generate_question_groups.append(group)
+                group = []
+    print("true_questions", len(true_questions))
+    print("generate_question_groups", len(generate_question_groups))
+    positive = []
+    negative = []
+    if is_multiprocessing:
+        pool = multiprocessing.Pool(processes=30)
+        pool_results = []
+        if is_map_async:
+            map_async_inputs = []
+    i = 0
+    bleu_cal_time_start = time.time()
+    generate_question_groups = [
+        [
+            generate_question if generate_question.strip() != "" else "none"
+            for generate_question in generate_question_group
+        ]
+        for generate_question_group in generate_question_groups
+    ]
+    for true_question, generate_question_group in tzip(true_questions, generate_question_groups):
+        if is_multiprocessing:
+            if is_map_async:
+                map_async_inputs.append((true_question, generate_question_group, bleu_n_size, bleu_threshold, i))
+            else:
+                pool_results.append(
+                    pool.apply_async(
+                        worker_apply_async,
+                        args=(true_question, generate_question_group, bleu_n_size, bleu_threshold, i),
+                    )
+                )
+
+        else:
+            first_positive_pair = None
+            best_pair, best_score = None, 0
+            for generate_question in generate_question_group:
+                try:
+                    bleu_score = calc_bleu_n([generate_question], [true_question], bleu_n_size)
+                except BaseException:
+                    print("generate_question", generate_question)
+                    print("true_question", true_question)
+                if bleu_score > best_score:
+                    best_pair = (generate_question, true_question)
+                if bleu_score > bleu_threshold:
+                    first_positive_pair = (generate_question, true_question)
+            if first_positive_pair:
+                positive.append((best_pair[0], best_pair[1], best_score))
+            else:
+                negative.append((best_pair[0], best_pair[1], best_score))
+        i += 1
+    if is_multiprocessing:
+        if is_map_async:
+            pool_results = pool.map_async(worker_map_async, map_async_inputs)
+            pool.close()
+            pool.join()
+            for result in pool_results.get():
+                is_positive, pair = result
+                if is_positive:
+                    positive.append(pair)
+                else:
+                    negative.append(pair)
+        else:
+            pool.close()
+            pool.join()
+            for result in pool_results:
+                is_positive, pair = result.get()
+                if is_positive:
+                    positive.append(pair)
+                else:
+                    negative.append(pair)
+
+    bleu_cal_time_end = time.time()
+    print("bleu_cal_time_spend:", bleu_cal_time_end - bleu_cal_time_start)
+    if is_log_file and log_dir:
+        with open(os.path.join(log_dir, "positive_pair.txt"), "w", encoding="utf-8") as wf:
+            for pair in positive:
+                wf.write(
+                    pair[0] + "\t" + pair[1] + "\n"
+                    if len(pair) == 2
+                    else pair[0] + "\t" + pair[1] + str(pair[2]) + "\n"
+                )
+        with open(os.path.join(log_dir, "negative_pair.txt"), "w", encoding="utf-8") as wf:
+            for pair in negative:
+                wf.write(
+                    pair[0] + "\t" + pair[1] + "\n"
+                    if len(pair) == 2
+                    else pair[0] + "\t" + pair[1] + str(pair[2]) + "\n"
+                )
+    assert len(positive) + len(negative) == all_sample_num, (
+        "the number of positive pairs "
+        + str(len(positive))
+        + " plus the number of negative pairs "
+        + str(len(negative))
+        + " should be equal to all_sample_num"
+        + str(all_sample_num)
+    )
+    return len(positive) / (len(positive) + len(negative))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    rate = coverage_rate(
+        true_file_path=args.true_file_path,
+        generate_file_path=args.generate_file_path,
+        bleu_n_size=args.bleu_n_size,
+        bleu_threshold=args.bleu_threshold,
+        num_return_sequences=args.num_return_sequences,
+        all_sample_num=args.all_sample_num,
+        is_log_file=args.do_log_file,
+        log_dir=args.log_dir,
+        is_multiprocessing=args.do_multiprocessing,
+        is_map_async=args.do_map_async,
+    )
+    print("coverage rate is", rate)
diff --git a/applications/sentiment_analysis/ASO_analysis/.gitignore b/applications/sentiment_analysis/ASO_analysis/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..7d7a840f05d7620097a1fc1633da9db221c58bf2
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/.gitignore
@@ -0,0 +1,2 @@
+checkpoints/*
+data/*
diff --git a/applications/sentiment_analysis/ASO_analysis/README.md b/applications/sentiment_analysis/ASO_analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1bd17691bcdd4a995e08d002f8a6cd4be9fc58e4
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/README.md
@@ -0,0 +1,200 @@
+# 评论观点抽取与情感倾向性分析
+
+## 1. 场景概述
+
+情感分析旨在对带有情感色彩的主观性文本进行分析、处理、归纳和推理，其广泛应用于消费决策、舆情分析、个性化推荐等领域，具有很高的商业价值。
+
+依托百度领先的情感分析技术，食行生鲜自动生成菜品评论标签辅助用户购买，并指导运营采购部门调整选品和促销策略；房天下向购房者和开发商直观展示楼盘的用户口碑情况，并对好评楼盘置顶推荐；国美搭建服务智能化评分系统，客服运营成本减少40%，负面反馈处理率100%。
+
+情感分析相关的任务有语句级情感分析、评论对象抽取、观点抽取等等。一般来讲，被人们所熟知的情感分析任务是语句级别的情感分析，该任务是在宏观上去分析整句话的感情色彩，其粒度可能相对比较粗。
+
+因为在人们进行评论的时候，往往针对某一产品或服务进行多个属性的评论，对每个属性的评论可能也会褒贬不一，因此针对属性级别的情感分析在真实的场景中会更加实用，同时更能给到企业用户或商家更加具体的建议。例如这句关于薯片的评论。
+
+> 这个薯片味道真的太好了，口感很脆，只是包装很一般。
+
+可以看到，顾客在口感、包装和味道 三个属性上对薯片进行了评价，顾客在味道和口感两个方面给出了好评，但是在包装上给出了负面的评价。只有通过这种比较细粒度的分析，商家才能更有针对性的发现问题，进而改进自己的产品或服务。
+
+基于这样的考虑，本项目提出了一种细粒度的情感分析能力，对于给定的文本，首先会抽取该文本中的评论观点，然后分析不同观点的情感极性。
+
+
+## 2. 产品功能介绍
+
+### 2.1 系统特色
+为了降低技术门槛，方便开发者共享效果领先的情感分析技术，PaddleNLP本次开源的情感分析系统，具备三大亮点：
+
+- 覆盖任务全
+    - 集成评论观点抽取、属性级情感分类等情感分析能力，并开源模型，且打通模型训练、评估、预测部署全流程。
+- 效果领先
+    - 集成百度研发的基于情感知识增强的预训练模型SKEP，为各类情感分析任务提供统一且强大的情感语义表示能力。
+- 预测性能强
+    - 针对预训练模型预测效率低的问题，开源小模型PP-MiniLM，量化优化策略，预测性能大幅提升。
+
+### 2.2 架构&功能
+
+本项目提出的情感分析解决方案如图1所示，整个情感分析的过程大致包含两个阶段，依次是评论观点抽取模型，属性级情感分类模型。对于给定的一段文本，首先基于前者抽取出文本语句中潜在的评论属性以及该属性相应的评论观点，然后将评论属性、观点以及原始文本进行拼接，传给属性级情感分类模型以识别出该评论属性的情感极性。
+
+这里需要提到的是，由于目前市面上的大多数模型是基于通用语料训练出来的，这些模型可能并不会对情感信息那么敏感。基于这样的考量，本项目使用了百度自研的 SKEP 预训练模型，其在预训练阶段便设计了多种情感信息相关的预训练目标进行训练。作为一种情感专属的模型，其更适合用来做上边提到的评论观点抽取任务，以及属性级情感分类任务。
+
+另外，本项目使用的是 Large 版的 SKEP 模型，考虑到企业用户在线上部署时会考虑到模型预测效率，所以本项目专门提供了一个通用版的小模型 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 以及一套量化策略，用户可以使用相应情感数据集对 PP-MiniLM 进行微调，然后进行量化，以达到更快的预测效率。
+
+<div align="center">
+    <img src="./imgs/sentiment_system.png" />
+    <p>图1 情感分析系统图<p/>
+</div>
+
+
+## 3. 情感分析效果展示
+在图1中可以看到，本项目的核心模块为评论观点抽取和属性级情感分类模块，本项目中基于情感专属模型 SKEP 实现了两个模块，并且提供了两者训练和测试的脚本，分别放在 `extraction` 和 `classification` 目录下。
+
+下表展示了我们训练的评论观点抽取模型在验证集 `dev` 和测试集 `test` 上的表现：
+|Model|数据集|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|dev|0.87095|0.90056|0.88551|
+|SKEP-Large|test|0.87125|0.89944|0.88512|
+
+下表展示了我们训练的属性级情感分类模型在验证集 `dev` 和测试集 `test` 上的表现：
+|Model|数据集|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|dev|0.98758|0.99251|0.99004|
+|SKEP-Large|test|0.98497|0.99139|0.98817|
+
+给定一段文本，使用我们提供的全流程预测脚本可以轻松获得情感分析结果，如下所示。
+
+- input_text: 蛋糕味道不错，很好吃，店家很耐心，服务也很好，很棒
+  - aspect: 蛋糕味道, opinions: ['不错', '好吃'], sentiment_polarity: 正向
+  - aspect: 店家, opinions: ['耐心'], sentiment_polarity: 正向
+  - aspect: 服务, opinions: ['好', '棒'], sentiment_polarity: 正向
+
+如果你想了解更多评论观点抽取模型和属性级情感分类模型的实现细节，请分别点击 [extraction](extraction/README.md) 和 [classification](classification/README.md)。
+
+
+## 4. 情感分析实践
+以下是本项目运行的完整目录结构以及说明：
+
+```
+.
+├── extraction                         # 评价观点抽取模型包
+├── classification                     # 细粒度情感分类模型包
+├── pp_minilm                          # PP-MiniLM特色小模型包
+├── deploy                             # 高性能预测部署包
+│   ├── predict.py                     # 高性能预测脚本
+│   ├── run_predict.py                 # 高性能预测命令
+├── imgs                               # 图片目录
+├── demo.py                            # demo脚本,方便体验预测效果
+├── predict.py                         # 全流程预测脚本
+├── export_model.py                    # 动转静模型导出脚本
+├── utils.py                           # 工具函数脚本
+├── run_demo.sh                        # 运行demo，快速体验情感分析效果
+├── run_predict.sh                     # 全流程预测命令
+├── run_export_model.sh                # 动转静模型导出命令
+└── README.md
+```
+
+### 4.1 运行环境和依赖安装
+(1) 环境依赖
+
+- python >= 3.6
+- paddlenlp >= 2.2.2
+- paddlepaddle-gpu >= 2.2.1
+
+(2) 运行环境准备
+在运行之前，请在本目录下新建目录 `data` 和 `checkpoints`，分别用于存放数据和保存模型。
+
+本项目需要训练两个阶段的模型：评论观点抽取模型，属性级情感分类模型。本次针对这抽取和分类模型，我们分别开源了 Demo 数据： [ext_data](https://bj.bcebos.com/v1/paddlenlp/data/ext_data.tar.gz)和[cls_data](https://bj.bcebos.com/v1/paddlenlp/data/cls_data.tar.gz)。
+
+用户可分别点击下载，解压后将相应的数据文件依次放入 `./data/ext_data` 和 `./data/cls_data` 目录下即可。
+
+### 4.2 使用说明
+本项目开源了训练后的评论观点模型 [ext_model](https://bj.bcebos.com/paddlenlp/models/best_ext.pdparams) 和 属性级情感分类模型 [cls_model](https://bj.bcebos.com/paddlenlp/models/best_cls.pdparams)。如有需要，可点击下载，下载后请将 `ext_model` 和 `cls_model` 重命名为 `best.pdparams`，分别放入 `./checkpoints/ext_checkpoints` 和 `./checkpoints/cls_checkpoints` 中。
+
+另外，考虑到不同用户可能有不同的需求，本项目提供了如下的方式学习或使用本项目。
+
+**(1）快速体验效果**
+如果你想快速体验本项目提供的情感分析能力，可使用本项目提供的 `demo.sh` 脚本以交互式的方式进行体验。
+```shell
+sh run_demo.sh
+```
+
+**备注**：体验之前，请确保下载以上提到的 `ext_model` 和 `cls_model`，重命名后放入相应的目录中。
+
+**(2) 文本批量预测**
+如果你有一批数据，不方便逐句输入，可使用本项目提供的正式预测脚本 `predict.py`， 以文件的形式进行输入，处理后该脚本会将结果文件保存到与输入文件相同的目录下，默认的结果文件名为 `sentiment_results.json`。
+
+本功能在预测时需要传入测试集文件路径，可将测试集文件命名为`test.txt`， 然后放入 `./data` 目录下。需要注意的是，测试集文件每行均为一个待预测的语句，如下所示。
+
+- 蛋糕味道不错，很好吃，店家很耐心，服务也很好，很棒
+- 酒店干净整洁，性价比很高
+- 酒店环境不错，非常安静，性价比还可以
+- 房间很大，环境不错
+
+通过运行如下命令，便可进行批量文本情感分析预测：
+```shell
+sh run_predict.sh
+```
+
+**备注**：体验之前，请确保下载以上提到的 `ext_model` 和 `cls_model`，重命名后放入相应的目录中。
+
+**（3）高性能预测**
+如果你想将本项目部署到线上环境去运行，那么建议你使用本项目基于 Paddle Inference 实现的高性能推理脚本 `deploy/predict.py`。
+
+在使用之前，首先需要将保存的动态图模型转为静态图，通过调用下面的命令，便可将评论观点抽取模型和属性级情感分类模型转为静态图模型：
+```shell
+sh run_export_model.sh extraction
+sh run_export_model.sh classification
+```
+
+这里需要注意的是，要确保相应的动态图已经下载或者训练生成到 `model_path` 指定的目录中，静态图模型会自动生成到`save_path`指定的地址。
+
+同上，高性能预测的默认输入和输出形式也为文件，可分别通过 `test_path` 和 `save_path` 进行指定，通过如下命令便可以基于Paddle Inference 进行高性能预测：
+```shell
+cd deploy
+sh run_predict.sh
+```
+
+**（4）自定义模型训练**
+如果你希望自己尝试进行评论观点抽取模型训练，可使用4.1节中提供的 `ext_data` Demo 数据，或自己业务的标注数据重新训练模型，本项目已将评论观点抽取模型的相关训练和测试代码放入 `extraction` 目录下， 请到该目录下执行模型训练即可，更多的实现细节和使用方式，请参考[这里](extraction/README.md)。
+
+如果你希望自己尝试进行属性级情感分类模型训练，可使用4.1节中提供的 `cls_data` Demo 数据，或自己业务的标注数据重新训练模型，本项目已将属性级情感分类模型的相关训练和测试代码放入 `classification` 目录下，请到该目录下执行模型训练即可，更多的实现细节和使用方式，请参考[这里](classification/README.md)。
+
+在训练后，如果需要进行高性能预测，可参考（3）进行动转静，然后基于Paddle Inference 进行高性能预测。
+
+### 4.3 数据标注说明
+如果你想标注自己的业务数据，并尝试利用标注的新数据重新训练本项目。本项目推荐使用 [doccano](https://github.com/doccano/doccano) 进行数据标注平台，同时本项目也打通了其从标注到训练的通道，即 doccano 导出的数据后可通过 [doccano.py](./doccano.py) 脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。 为达到这个目的，您需要按以下标注规则在 doccano 平台上标注数据：
+
+<div align="center">
+    <img src="./imgs/labeling_example.png" />
+    <p>图2 数据标注样例图<p/>
+</div>
+
+- 在doccano平台上，定义标签 Pos-Aspect、 Neg-Aspect 和 Opinion，其中 Pos-Aspect 表示 Aspect 的情感极性为正向；Neg-Aspect 表示 Aspect 的情感极性为负向；Opinion 表示相应的观点词。
+- 使用以上定义的标签开始标注数据，图2展示了一个标注样例。
+- 当标注完成后，在 doccano 平台上导出 `jsonl` 形式的文件，并将其重命名为 `doccano.json` 后，放入 `./data` 目录下。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano.json \
+    --save_ext_dir ./data/ext_data \
+    --save_cls_dir ./data/cls_data
+```
+
+**备注：**
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+
+## 5. 小模型优化策略
+以上实验中，无论是评论观点抽取模型，还是属性级情感分类模型，使用的均是 Large 版的 SKEP 模型，考虑到企业用户在线上部署时会考虑到模型预测效率，本项目提供了一套基于 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 中文特色小模型的解决方案。PP-MiniLM 提供了一套完整的小模型优化方案：首先使用 Task-agnostic 的方式进行模型蒸馏、然后依托于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行模型裁剪、模型量化等模型压缩技术，有效减小了模型的规模，加快了模型运行速度。
+
+本项目基于 PP-MiniLM 中文特色小模型进行 fine-tune 属性级情感分类模型，然后使用 PaddleSlim 对训练好的模型进行量化操作。
+
+在实验进行后，我们将 SKEP-Large、PP-MiniLM、量化PP-MiniLM 三个模型在性能和效果方面进行了对比，如下表所示。可以看到，三者在本任务数据集上的评估指标几乎相等，但是 PP-MiniLM 小模型运行速度较 SKEP-Large 提高了4倍，量化后的 PP-MiniLM 运行速度较 SKEP-Large 提高了近8倍。更多的详细信息请参考[这里](./pp_minilm/README.md)。
+
+|Model|运行时间(s)|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|1.00x|0.98497|0.99139|0.98817|
+|PP-MiniLM|4.95x|0.98379|0.98859|0.98618|
+|量化 PP-MiniLM|8.93x|0.98312|0.98953|0.98631|
+
+## 6. 引用
+
+[1] H. Tian et al., “SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis,” arXiv:2005.05635 [cs], May 2020, Accessed: Nov. 11, 2021.
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/README.md b/applications/sentiment_analysis/ASO_analysis/classification/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d7bbc5c719cb448c2a9c8c78021d7c8012582eb
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/README.md
@@ -0,0 +1,64 @@
+# 细粒度情感分类模型
+
+## 1. 方案设计
+
+本项目将进行属性级别的情感分类，对于给定的一段文本，我们在基于评论观点抽取模型抽取出不同属性对应的观点后，便可以有针对性地对各个属性判别情感极性。具体来讲，本项目将抽取出的评论属性和观点进行拼接，然后和原始语句进行拼接作为一条独立的训练语句。
+
+如图1所示，首先将评论属性和观点词进行拼接为"味道好"，然后将"味道好"和原文进行拼接，然后传入SKEP模型，并使用 "CLS" 位置的向量进行细粒度情感倾向。
+
+<div align="center">
+    <img src="../imgs/design_cls_model.png" />
+    <p>图1 细粒度情感分类模型<p/>
+</div>
+
+## 2. 项目结构说明
+
+以下是本项目运行的完整目录结构及说明：
+
+```shell
+.
+├── data.py          # 数据处理脚本
+├── model.py         # 模型组网脚本
+├── train.py         # 模型训练脚本
+├── evaluate.py      # 模型评估脚本
+├── run_train.sh     # 模型训练命令
+├── run_evaluate.sh  # 模型评估命令
+└── README.md
+```
+
+## 3. 数据说明
+
+本实验中，相应的数据集需要包含3列数据：标签、评论观点和原文，下面给出了一些样本示例。
+
+- 1       口味清淡        口味很清淡，价格也比较公道
+- 1       经济实惠        经济实惠，环境好，套餐划算
+- 0       设施一般        房间大，设施一般
+
+可点击 [cls_data](https://bj.bcebos.com/v1/paddlenlp/data/cls_data.tar.gz) 进行 Demo 数据下载，将数据解压之后放入父目录的 `data/cls_data/` 文件夹下。
+
+## 4. 模型效果展示
+
+在分类模型训练过程中，总共训练了10轮，并选择了评估 F1 得分最高的 best 模型，下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型，可点击下表的 `cls_model` 进行下载，下载后将模型重命名为 `best.pdparams`，然后放入父目录的 `checkpoints/cls_checkpoints` 中。
+|Model|训练参数配置|MD5|
+| ------------ | ------------ |-----------|
+|[cls_model](https://bj.bcebos.com/paddlenlp/models/best_cls.pdparams)|<div style="width: 150pt"> learning_rate: 3e-5, batch_size: 16, max_seq_len:256, epochs：10 </div>|3de6ddf581e665d9b1d035c29b49778a|
+
+我们基于训练过程中的 best 模型在验证集 `dev` 和测试集 `test` 上进行了评估测试，模型效果如下表所示:
+|Model|数据集|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|dev|0.98758|0.99251|0.99004|
+|SKEP-Large|test|0.98497|0.99139|0.98817|
+
+**备注**： 以上数据是基于全量数据训练和测试结果，并非 Demo 数据集。
+
+## 5. 模型训练
+通过运行以下命令进行分类模型训练：
+```shell
+sh run_train.sh
+```
+
+## 6. 模型测试
+通过运行以下命令进行分类模型测试：
+```shell
+sh run_evaluate.sh
+```
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/data.py b/applications/sentiment_analysis/ASO_analysis/classification/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcbaf5110cbda5aac5ab4d22f6d46028595cdf5b
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/data.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def load_dict(dict_path):
+    with open(dict_path, "r", encoding="utf-8") as f:
+        words = [word.strip() for word in f.readlines()]
+        word2id = dict(zip(words, range(len(words))))
+        id2word = dict((v, k) for k, v in word2id.items())
+
+        return word2id, id2word
+
+
+def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False):
+    example = example["text"].rstrip().split("\t")
+    if not is_test:
+        label = int(example[0])
+        aspect_text = example[1]
+        text = example[2]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+        encoded_inputs["label"] = label
+    else:
+        aspect_text = example[0]
+        text = example[1]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+
+    return encoded_inputs
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py b/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..e12bca18901f957718b0a8cabce0a395c7821036
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/evaluate.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from tqdm import tqdm
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.glue import AccuracyAndF1
+from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+
+def evaluate(model, data_loader, metric):
+
+    model.eval()
+    metric.reset()
+    for batch_data in tqdm(data_loader):
+        input_ids, token_type_ids, labels = batch_data["input_ids"], batch_data["token_type_ids"], batch_data["labels"]
+        logits = model(input_ids, token_type_ids=token_type_ids)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+
+    accuracy, precision, recall, f1, _ = metric.accumulate()
+
+    return accuracy, precision, recall, f1
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    args = parser.parse_args()
+    # yapf: enbale
+
+    # load dev data
+    model_name = "skep_ernie_1.0_large_ch"
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"test": args.test_path})
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len)
+    test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+
+    # load model
+    loaded_state_dict = paddle.load(args.model_path)
+    model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(label2id))
+    model.load_dict(loaded_state_dict)
+
+    metric = AccuracyAndF1()
+
+    # evaluate on dev data
+    accuracy, precision, recall, f1 = evaluate(model, test_loader, metric)
+    print(f'evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}')
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e9d6721c719d52c4fabffdd6bbb87673fbee2d7e
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/run_evaluate.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  evaluate.py \
+        --model_path "../checkpoints/cls_checkpoints/best.pdparams" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --batch_size 16 \
+        --max_seq_len 256
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh b/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..56a4a6755a429b1967f65bf3828aab008f855651
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/run_train.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  train.py \
+        --train_path "../data/cls_data/train.txt" \
+        --dev_path "../data/cls_data/dev.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --num_epochs 5 \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --learning_rate 3e-5 \
+        --weight_decay 0.01 \
+        --max_grad_norm 1.0 \
+        --warmup_proportion 0.1 \
+        --log_steps 50 \
+        --eval_steps 100 \
+        --seed 1000 \
+        --device "gpu" \
+        --checkpoints "../checkpoints/cls_checkpoints"
diff --git a/applications/sentiment_analysis/ASO_analysis/classification/train.py b/applications/sentiment_analysis/ASO_analysis/classification/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d87d93c9ce9bad2f9b271dbdabc372d6e81bfd1
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/classification/train.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import warnings
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from evaluate import evaluate
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.glue import AccuracyAndF1
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    SkepForSequenceClassification,
+    SkepTokenizer,
+)
+
+warnings.filterwarnings("ignore")
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def train():
+    # set running envir
+    model_name = "skep_ernie_1.0_large_ch"
+
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+
+    if not os.path.exists(args.checkpoints):
+        os.mkdir(args.checkpoints)
+
+    # load and process data5
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"train": args.train_path, "dev": args.dev_path})
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len
+    )
+
+    train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"])
+    dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator)
+    dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator)
+
+    # configure model training
+    model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(label2id))
+
+    num_training_steps = len(train_loader) * args.num_epochs
+    lr_scheduler = LinearDecayWithWarmup(
+        learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion
+    )
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=grad_clip,
+    )
+
+    metric = AccuracyAndF1()
+
+    # start to train model
+    global_step, best_f1 = 1, 0.0
+    model.train()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch_data in train_loader():
+            input_ids, token_type_ids, labels = (
+                batch_data["input_ids"],
+                batch_data["token_type_ids"],
+                batch_data["labels"],
+            )
+            loss, logits = model(input_ids, token_type_ids=token_type_ids, labels=labels)
+
+            loss.backward()
+            lr_scheduler.step()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step > 0 and global_step % args.log_steps == 0:
+                print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}")
+            if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps:
+                accuracy, precision, recall, f1 = evaluate(model, dev_loader, metric)
+                model.train()
+                if f1 > best_f1:
+                    print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams")
+                print(
+                    f"evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}"
+                )
+
+            global_step += 1
+
+    paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams")
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for training.")
+    parser.add_argument("--train_path", type=str, default=None, help="The path of train set.")
+    parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.")
+    parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
+    parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.")
+    parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Linear warmup proportion over the training process.")
+    parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.")
+    parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    train()
diff --git a/applications/sentiment_analysis/ASO_analysis/demo.py b/applications/sentiment_analysis/ASO_analysis/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b7be71295e941582ce534588d025c1b24ff0a40
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/demo.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import re
+
+import paddle
+from utils import decoding, load_dict
+
+from paddlenlp.transformers import (
+    SkepForSequenceClassification,
+    SkepForTokenClassification,
+    SkepTokenizer,
+)
+
+
+def is_aspect_first(text, aspect, opinion_word):
+    return text.find(aspect) <= text.find(opinion_word)
+
+
+def concate_aspect_and_opinion(text, aspect, opinion_words):
+    aspect_text = ""
+    for opinion_word in opinion_words:
+        if is_aspect_first(text, aspect, opinion_word):
+            aspect_text += aspect + opinion_word + "，"
+        else:
+            aspect_text += opinion_word + aspect + "，"
+    aspect_text = aspect_text[:-1]
+
+    return aspect_text
+
+
+def format_print(results):
+    for result in results:
+        aspect, opinions, sentiment = result["aspect"], result["opinions"], result["sentiment_polarity"]
+        print(f"aspect: {aspect}, opinions: {opinions}, sentiment_polarity: {sentiment}")
+    print()
+
+
+def predict(args, ext_model, cls_model, tokenizer, ext_id2label, cls_id2label):
+
+    ext_model.eval()
+    cls_model.eval()
+
+    while True:
+        input_text = input("input text: \n")
+        input_text = re.sub(" +", "", input_text.strip())
+        if not input_text:
+            continue
+        if input_text == "quit" or input_text == "exit":
+            break
+
+        input_text = input_text.strip().replace(" ", "")
+        # processing input text
+        encoded_inputs = tokenizer(list(input_text), is_split_into_words=True, max_seq_len=args.ext_max_seq_len)
+        input_ids = paddle.to_tensor([encoded_inputs["input_ids"]])
+        token_type_ids = paddle.to_tensor([encoded_inputs["token_type_ids"]])
+
+        # extract aspect and opinion words
+        logits = ext_model(input_ids, token_type_ids=token_type_ids)
+        predictions = logits.argmax(axis=2).numpy()[0]
+        tag_seq = [ext_id2label[idx] for idx in predictions][1:-1]
+
+        aps = decoding(input_text[: args.ext_max_seq_len - 2], tag_seq)
+
+        # predict sentiment for aspect with cls_model
+        results = []
+        for ap in aps:
+            aspect = ap[0]
+            opinion_words = list(set(ap[1:]))
+            aspect_text = concate_aspect_and_opinion(input_text, aspect, opinion_words)
+
+            encoded_inputs = tokenizer(
+                aspect_text, text_pair=input_text, max_seq_len=args.cls_max_seq_len, return_length=True
+            )
+            input_ids = paddle.to_tensor([encoded_inputs["input_ids"]])
+            token_type_ids = paddle.to_tensor([encoded_inputs["token_type_ids"]])
+
+            logits = cls_model(input_ids, token_type_ids=token_type_ids)
+            prediction = int(logits.argmax(axis=1))
+
+            result = {"aspect": aspect, "opinions": opinion_words, "sentiment_polarity": cls_id2label[prediction]}
+            results.append(result)
+
+        format_print(results)
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.")
+    parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.")
+    parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.")
+    parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.")
+    parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.")
+    parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.")
+    args = parser.parse_args()
+    # yapf: enbale
+
+    # load dict
+    model_name = "skep_ernie_1.0_large_ch"
+    ext_label2id, ext_id2label = load_dict(args.ext_label_path)
+    cls_label2id, cls_id2label = load_dict(args.cls_label_path)
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    print("label dict loaded.")
+
+    # load ext model
+    ext_state_dict = paddle.load(args.ext_model_path)
+    ext_model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(ext_label2id))
+    ext_model.load_dict(ext_state_dict)
+    print("extraction model loaded.")
+
+    # load cls model
+    cls_state_dict = paddle.load(args.cls_model_path)
+    cls_model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(cls_label2id))
+    cls_model.load_dict(cls_state_dict)
+    print("classification model loaded.")
+
+    # do predict
+    predict(args, ext_model, cls_model, tokenizer, ext_id2label, cls_id2label)
diff --git a/applications/sentiment_analysis/ASO_analysis/deploy/predict.py b/applications/sentiment_analysis/ASO_analysis/deploy/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..284092e71d998dd1be07fa26d0f3c77e78a48006
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/deploy/predict.py
@@ -0,0 +1,361 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import json
+import os
+import re
+from collections import defaultdict
+from functools import partial
+
+import paddle
+from datasets import Dataset, load_dataset
+from paddle import inference
+from seqeval.metrics.sequence_labeling import get_entities
+
+from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding
+from paddlenlp.transformers import SkepTokenizer
+
+
+def load_dict(dict_path):
+    with open(dict_path, "r", encoding="utf-8") as f:
+        words = [word.strip() for word in f.readlines()]
+        word2id = dict(zip(words, range(len(words))))
+        id2word = dict((v, k) for k, v in word2id.items())
+
+        return word2id, id2word
+
+
+def read_test_file(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip().replace(" ", "")
+            yield {"text": line}
+
+
+def decoding(text, tag_seq):
+    assert len(text) == len(tag_seq), f"text len: {len(text)}, tag_seq len: {len(tag_seq)}"
+
+    puncs = list(",.?;!，。？；！")
+    splits = [idx for idx in range(len(text)) if text[idx] in puncs]
+
+    prev = 0
+    sub_texts, sub_tag_seqs = [], []
+    for i, split in enumerate(splits):
+        sub_tag_seqs.append(tag_seq[prev:split])
+        sub_texts.append(text[prev:split])
+        prev = split
+    sub_tag_seqs.append(tag_seq[prev:])
+    sub_texts.append((text[prev:]))
+
+    ents_list = []
+    for sub_text, sub_tag_seq in zip(sub_texts, sub_tag_seqs):
+        ents = get_entities(sub_tag_seq, suffix=False)
+        ents_list.append((sub_text, ents))
+
+    aps = []
+    no_a_words = []
+    for sub_tag_seq, ent_list in ents_list:
+        sub_aps = []
+        sub_no_a_words = []
+        for ent in ent_list:
+            ent_name, start, end = ent
+            if ent_name == "Aspect":
+                aspect = sub_tag_seq[start : end + 1]
+                sub_aps.append([aspect])
+                if len(sub_no_a_words) > 0:
+                    sub_aps[-1].extend(sub_no_a_words)
+                    sub_no_a_words.clear()
+            else:
+                ent_name == "Opinion"
+                opinion = sub_tag_seq[start : end + 1]
+                if len(sub_aps) > 0:
+                    sub_aps[-1].append(opinion)
+                else:
+                    sub_no_a_words.append(opinion)
+
+        if sub_aps:
+            aps.extend(sub_aps)
+            if len(no_a_words) > 0:
+                aps[-1].extend(no_a_words)
+                no_a_words.clear()
+        elif sub_no_a_words:
+            if len(aps) > 0:
+                aps[-1].extend(sub_no_a_words)
+            else:
+                no_a_words.extend(sub_no_a_words)
+
+    if no_a_words:
+        no_a_words.insert(0, "None")
+        aps.append(no_a_words)
+
+    return aps
+
+
+def convert_example_to_feature_ext(example, tokenizer, label2id, max_seq_len=512, is_test=False):
+    example = example["text"].rstrip().split("\t")
+    text = list(example[0])
+    if not is_test:
+        label = example[1].split(" ")
+        assert len(text) == len(label)
+        new_text = []
+        new_label = []
+        for text_ch, label_ch in zip(text, label):
+            if text_ch.strip():
+                new_text.append(text_ch)
+                new_label.append(label_ch)
+        new_label = (
+            [label2id["O"]] + [label2id[label_term] for label_term in new_label][: (max_seq_len - 2)] + [label2id["O"]]
+        )
+        encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True)
+        encoded_inputs["labels"] = new_label
+        assert len(encoded_inputs["input_ids"]) == len(
+            new_label
+        ), f"input_ids: {len(encoded_inputs['input_ids'])}, label: {len(new_label)}"
+    else:
+        new_text = [text_ch for text_ch in text if text_ch.strip()]
+        encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True)
+
+    return encoded_inputs
+
+
+def convert_example_to_feature_cls(example, tokenizer, label2id, max_seq_len=512, is_test=False):
+    example = example["text"].rstrip().split("\t")
+    if not is_test:
+        label = int(example[0])
+        aspect_text = example[1]
+        text = example[2]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+        encoded_inputs["label"] = label
+    else:
+        aspect_text = example[0]
+        text = example[1]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+
+    return encoded_inputs
+
+
+def remove_blanks(example):
+    example["text"] = re.sub(" +", "", example["text"])
+    return example
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.args = args
+        self.ext_predictor, self.ext_input_handles, self.ext_output_hanle = self.create_predictor(args.ext_model_path)
+        print(f"ext_model_path: {args.ext_model_path}, {self.ext_predictor}")
+        self.cls_predictor, self.cls_input_handles, self.cls_output_hanle = self.create_predictor(args.cls_model_path)
+        self.ext_label2id, self.ext_id2label = load_dict(args.ext_label_path)
+        self.cls_label2id, self.cls_id2label = load_dict(args.cls_label_path)
+        self.tokenizer = SkepTokenizer.from_pretrained(args.base_model_name)
+
+    def create_predictor(self, model_path):
+        model_file = model_path + ".pdmodel"
+        params_file = model_path + ".pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if self.args.device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[args.precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=self.args.batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif self.args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif self.args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handle = predictor.get_output_handle(predictor.get_output_names()[0])
+
+        return predictor, input_handles, output_handle
+
+    def predict_ext(self, args):
+        datasets = load_dataset("text", data_files={"test": args.test_path})
+        datasets["test"] = datasets["test"].map(remove_blanks)
+        trans_func = partial(
+            convert_example_to_feature_ext,
+            tokenizer=self.tokenizer,
+            label2id=self.ext_label2id,
+            max_seq_len=args.ext_max_seq_len,
+            is_test=True,
+        )
+        test_ds = copy.copy(datasets["test"]).map(trans_func, batched=False, remove_columns=["text"])
+        data_collator = DataCollatorForTokenClassification(self.tokenizer, label_pad_token_id=self.ext_label2id["O"])
+        test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+        test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+
+        results = []
+        for bid, batch_data in enumerate(test_loader):
+            input_ids, token_type_ids, seq_lens = (
+                batch_data["input_ids"],
+                batch_data["token_type_ids"],
+                batch_data["seq_len"],
+            )
+            self.ext_input_handles[0].copy_from_cpu(input_ids.numpy())
+            self.ext_input_handles[1].copy_from_cpu(token_type_ids.numpy())
+            self.ext_predictor.run()
+            logits = self.ext_output_hanle.copy_to_cpu()
+
+            predictions = logits.argmax(axis=2)
+            for eid, (seq_len, prediction) in enumerate(zip(seq_lens, predictions)):
+                idx = bid * args.batch_size + eid
+                tag_seq = [self.ext_id2label[idx] for idx in prediction[:seq_len][1:-1]]
+                text = datasets["test"][idx]["text"]
+                aps = decoding(text[: args.ext_max_seq_len - 2], tag_seq)
+                for aid, ap in enumerate(aps):
+                    aspect, opinions = ap[0], list(set(ap[1:]))
+                    aspect_text = self._concate_aspect_and_opinion(text, aspect, opinions)
+                    results.append(
+                        {
+                            "id": str(idx) + "_" + str(aid),
+                            "aspect": aspect,
+                            "opinions": opinions,
+                            "text": text,
+                            "aspect_text": aspect_text,
+                        }
+                    )
+
+        return results
+
+    def predict_cls(self, args, ext_results):
+        text_list = []
+        for result in ext_results:
+            example = result["aspect_text"] + "\t" + result["text"]
+            text_list.append(example)
+        ext_results = {"text": text_list}
+
+        dataset = Dataset.from_dict(ext_results)
+        trans_func = partial(
+            convert_example_to_feature_cls,
+            tokenizer=self.tokenizer,
+            label2id=self.cls_label2id,
+            max_seq_len=args.cls_max_seq_len,
+            is_test=True,
+        )
+
+        test_ds = dataset.map(trans_func, batched=False, remove_columns=["text"])
+        data_collator = DataCollatorWithPadding(self.tokenizer, padding=True)
+        test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+        test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+
+        results = []
+        for batch_data in test_loader:
+            input_ids, token_type_ids = batch_data["input_ids"], batch_data["token_type_ids"]
+            self.cls_input_handles[0].copy_from_cpu(input_ids.numpy())
+            self.cls_input_handles[1].copy_from_cpu(token_type_ids.numpy())
+            self.cls_predictor.run()
+            logits = self.cls_output_hanle.copy_to_cpu()
+
+            predictions = logits.argmax(axis=1).tolist()
+            results.extend(predictions)
+
+        return results
+
+    def post_process(self, args, ext_results, cls_results):
+        assert len(ext_results) == len(cls_results)
+
+        collect_dict = defaultdict(list)
+        for ext_result, cls_result in zip(ext_results, cls_results):
+            ext_result["sentiment_polarity"] = self.cls_id2label[cls_result]
+            eid, _ = ext_result["id"].split("_")
+            collect_dict[eid].append(ext_result)
+
+        sentiment_results = []
+        for eid in collect_dict.keys():
+            sentiment_result = {}
+            ap_list = []
+            for idx, single_ap in enumerate(collect_dict[eid]):
+                if idx == 0:
+                    sentiment_result["text"] = single_ap["text"]
+                ap_list.append(
+                    {
+                        "aspect": single_ap["aspect"],
+                        "opinions": single_ap["opinions"],
+                        "sentiment_polarity": single_ap["sentiment_polarity"],
+                    }
+                )
+            sentiment_result["ap_list"] = ap_list
+            sentiment_results.append(sentiment_result)
+
+        with open(args.save_path, "w", encoding="utf-8") as f:
+            for sentiment_result in sentiment_results:
+                f.write(json.dumps(sentiment_result, ensure_ascii=False) + "\n")
+        print(f"sentiment analysis results has been saved to path: {args.save_path}")
+
+    def predict(self, args):
+        ext_results = self.predict_ext(args)
+        cls_results = self.predict_cls(args, ext_results)
+        self.post_process(args, ext_results, cls_results)
+
+    def _concate_aspect_and_opinion(self, text, aspect, opinion_words):
+        aspect_text = ""
+        for opinion_word in opinion_words:
+            if text.find(aspect) <= text.find(opinion_word):
+                aspect_text += aspect + opinion_word + "，"
+            else:
+                aspect_text += opinion_word + aspect + "，"
+        aspect_text = aspect_text[:-1]
+
+        return aspect_text
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model_name", default='skep_ernie_1.0_large_ch', type=str, help="Base model name, SKEP used by default", )
+    parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.")
+    parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.")
+    parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.")
+    parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set that you want to predict.")
+    parser.add_argument('--save_path', type=str, required=True, default=None, help="The saving path of predict results.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.")
+    parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.")
+    parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.")
+    parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+    parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.")
+    parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+    parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+    args = parser.parse_args()
+    # yapf: enbale
+
+    predictor = Predictor(args)
+    predictor.predict(args)
diff --git a/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh b/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..21feb414e4f008662e28fa5c6a811546813591d1
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/deploy/run_predict.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  predict.py \
+        --base_model_name "skep_ernie_1.0_large_ch" \
+        --ext_model_path "../checkpoints/ext_checkpoints/static/infer" \
+        --cls_model_path "../checkpoints/cls_checkpoints/static/infer" \
+        --ext_label_path "../data/ext_data/label.dict" \
+        --cls_label_path "../data/cls_data/label.dict" \
+        --test_path "../data/test.txt" \
+        --save_path "../data/sentiment_results.json" \
+        --batch_size 8 \
+        --ext_max_seq_len 512 \
+        --cls_max_seq_len 256
diff --git a/applications/sentiment_analysis/ASO_analysis/doccano.py b/applications/sentiment_analysis/ASO_analysis/doccano.py
new file mode 100644
index 0000000000000000000000000000000000000000..9880fe0a0e118170eeae6ed81fa62fca0e1317b1
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/doccano.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+from utils import concate_aspect_and_opinion, decoding, save_dict, save_examples
+
+
+def doccano2SA(doccano_file, save_ext_dir, save_cls_dir, splits=[0.8, 0.9], is_shuffle=True):
+    """
+    @Description: Consvert doccano file to data format which is suitable to input to this Application.
+    @Param doccano_file: The annotated file exported from doccano labeling platform.
+    @Param save_ext_dir: The directory of ext data that you wanna save.
+    @Param save_cls_dir: The directory of cls data that you wanna save.
+    @Param splits: Whether to split doccano file into train/dev/test, note: Only []/ len(splits)==2 accepted.
+    @Param is_shuffle: Whether to shuffle data.
+    """
+    if not os.path.exists(doccano_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(save_ext_dir):
+        os.makedirs(save_ext_dir)
+
+    if not os.path.exists(save_cls_dir):
+        os.makedirs(save_cls_dir)
+
+    if len(splits) != 0 and len(splits) != 2:
+        raise ValueError("Only []/ len(splits)==2 accepted for splits.")
+
+    if splits and (
+        splits[0] >= splits[1] or splits[0] >= 1.0 or splits[1] >= 1.0 or splits[0] <= 0.0 or splits[1] <= 0
+    ):
+        raise ValueError("Please set correct splits, the element in it should be in (0,1), and splits[1]>splits[0].")
+
+    def label_ext_with_label_term(ext_label, start, end, tag):
+
+        if tag == "Opinion":
+            b_tag = "B-Opinion"
+            i_tag = "I-Opinion"
+        else:
+            b_tag = "B-Aspect"
+            i_tag = "I-Aspect"
+
+        ext_label[start] = b_tag
+        for i in range(start + 1, end):
+            ext_label[i] = i_tag
+
+    ext_examples, cls_examples = [], []
+    with open(doccano_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+    # start to label for ext and cls data
+    for line in raw_examples:
+        items = json.loads(line)
+        text, label_terms = items["data"], items["label"]
+        # label ext data with label_terms
+        ext_label = ["O"] * len(text)
+        aspect_mapper = {}
+        for label_term in label_terms:
+            start, end, tag = label_term
+            label_ext_with_label_term(ext_label, start, end, tag)
+            if tag == "Pos-Aspect":
+                aspect_mapper[text[start:end]] = "1"
+            elif tag == "Neg-Aspect":
+                aspect_mapper[text[start:end]] = "0"
+        ext_examples.append((text, " ".join(ext_label)))
+        # label cls data
+        aps = decoding(text, ext_label)
+        for ap in aps:
+            aspect, opinions = ap[0], list(set(ap[1:]))
+            if aspect not in aspect_mapper:
+                continue
+            aspect_text = concate_aspect_and_opinion(text, aspect, opinions)
+            cls_examples.append((aspect_mapper[aspect], aspect_text, text))
+
+    # index for saving data
+    ext_idx = np.arange(len(ext_examples))
+    cls_idx = np.arange(len(cls_examples))
+
+    if is_shuffle:
+        ext_idx = np.random.permutation(ext_idx)
+        cls_idx = np.random.permutation(cls_idx)
+
+    if len(splits) == 0:
+        # save ext data
+        save_ext_path = os.path.join(save_ext_dir, "doccano.txt")
+        save_examples(ext_examples, save_ext_path, ext_idx)
+        print(f"\next: save data to {save_ext_path}.")
+        # save cls data
+        save_cls_path = os.path.join(save_cls_dir, "doccano.txt")
+        save_examples(cls_examples, save_cls_path, cls_idx)
+        print(f"\ncls: save data to {save_cls_path}.")
+
+    else:
+        # save ext data
+        eth1, eth2 = int(len(ext_examples) * splits[0]), int(len(ext_examples) * splits[1])
+        save_ext_train_path = os.path.join(save_ext_dir, "train.txt")
+        save_ext_dev_path = os.path.join(save_ext_dir, "dev.txt")
+        save_ext_test_path = os.path.join(save_ext_dir, "test.txt")
+        save_examples(ext_examples, save_ext_train_path, ext_idx[:eth1])
+        save_examples(ext_examples, save_ext_dev_path, ext_idx[eth1:eth2])
+        save_examples(ext_examples, save_ext_test_path, ext_idx[eth2:])
+        print(f"\next: save train data to {save_ext_train_path}.")
+        print(f"ext: save dev data to {save_ext_dev_path}.")
+        print(f"ext: save test data to {save_ext_test_path}.")
+
+        # save cls data
+        cth1, cth2 = int(len(cls_examples) * splits[0]), int(len(cls_examples) * splits[1])
+        save_cls_train_path = os.path.join(save_cls_dir, "train.txt")
+        save_cls_dev_path = os.path.join(save_cls_dir, "dev.txt")
+        save_cls_test_path = os.path.join(save_cls_dir, "test.txt")
+        save_examples(cls_examples, save_cls_train_path, cls_idx[:cth1])
+        save_examples(cls_examples, save_cls_dev_path, cls_idx[cth1:cth2])
+        save_examples(cls_examples, save_cls_test_path, cls_idx[cth2:])
+        print(f"\ncls: save train data to {save_cls_train_path}.")
+        print(f"cls: save dev data to {save_cls_dev_path}.")
+        print(f"cls: save test data to {save_cls_test_path}.")
+
+    # save ext dict
+    ext_dict_path = os.path.join(save_ext_dir, "label.dict")
+    cls_dict_path = os.path.join(save_cls_dir, "label.dict")
+    save_dict(ext_dict_path, "ext")
+    save_dict(cls_dict_path, "cls")
+    print(f"\next: save dict to {ext_dict_path}.")
+    print(f"cls: save dict to {cls_dict_path}.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--doccano_file",
+        type=str,
+        default="./data/doccano.json",
+        help="The doccano file exported from doccano platform.",
+    )
+    parser.add_argument(
+        "--save_ext_dir", type=str, default="./data/ext_data1", help="The path of ext data that you wanna save."
+    )
+    parser.add_argument(
+        "--save_cls_dir", type=str, default="./data/cls_data1", help="The path of cls data that you wanna save."
+    )
+    args = parser.parse_args()
+
+    doccano2SA(args.doccano_file, args.save_ext_dir, args.save_cls_dir, is_shuffle=True)
diff --git a/applications/sentiment_analysis/ASO_analysis/export_model.py b/applications/sentiment_analysis/ASO_analysis/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..532698d98c999fb8fcb5eeaa2dcc0219fc206ce7
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/export_model.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.transformers import (
+    PPMiniLMForSequenceClassification,
+    SkepForSequenceClassification,
+    SkepForTokenClassification,
+)
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_type", type=str, default="extraction", choices=["extraction", "classification", "pp_minilm"], help="The model type that you wanna export.")
+    parser.add_argument("--base_model_name", type=str, default="skep_ernie_1.0_large_ch", help="The base model of experiment, skep or ppminilm")
+    parser.add_argument("--model_path", type=str, default=None, help="The path of model that you want to load.")
+    parser.add_argument("--save_path", type=str, default=None, help="The path of the exported static model.")
+    args = parser.parse_args()
+    # yapf: enable
+
+    # load model with saved state_dict
+    if args.model_type == "extraction":
+        model = SkepForTokenClassification.from_pretrained(args.base_model_name, num_classes=5)
+    elif args.model_type == "classification":
+        model = SkepForSequenceClassification.from_pretrained(args.base_model_name, num_classes=2)
+    else:
+        model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=2)
+
+    loaded_state_dict = paddle.load(args.model_path)
+    model.load_dict(loaded_state_dict)
+    print(f"Loaded parameters from {args.model_path}")
+
+    model.eval()
+    # convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(
+                shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(
+                shape=[None, None], dtype="int64")  # token_type_ids
+        ])
+
+    # save to static model
+    paddle.jit.save(model, args.save_path)
+    print(f"static {args.model_type} model has been to {args.save_path}")
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/README.md b/applications/sentiment_analysis/ASO_analysis/extraction/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2a471a6bd62e18d31e4cbef337f5e374f112052a
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/README.md
@@ -0,0 +1,65 @@
+# 评论观点抽取模型
+
+## 1. 方案设计
+
+在本实验中，我们将采用序列标注的方式进行评论观点抽取，需要注意的是，这里会同时抽取评论的属性和观点，为此我们基于 BIO 的序列标注体系进行了标签的拓展：B-Aspect, I-Aspect, B-Opinion, I-Opinion, O，其中前两者用于标注评论属性，后两者用于标注评论观点。
+
+如图1所示，首先将文本串传入 SKEP 模型中，利用 SKEP 模型对该文本串进行语义编码后，然后基于每个位置的输出去预测相应的标签。
+
+<div align="center">
+    <img src="../imgs/design_ext_model.png" />
+    <p>图1 评论观点抽取模型<p/>
+</div>
+
+## 2. 项目结构说明
+
+以下是本项目运行的完整目录结构及说明：
+
+```shell
+.
+├── data.py          # 数据处理脚本
+├── model.py         # 模型组网脚本
+├── train.py         # 模型训练脚本
+├── evaluate.py      # 模型评估脚本
+├── run_train.sh     # 模型训练命令
+├── run_evaluate.sh  # 模型评估命令
+└── README.md
+```
+
+## 3. 数据说明
+
+如上所述，本项目将采用序列标注的方式进行抽取评论属性和观点，所以本项目训练集中需要包含两列数据：文本串和相应的序列标签数据，下面给出了一条样本。
+
+
+- 服务好，环境好，做出来效果也不错        B-Aspect I-Aspect B-Opinion O B-Aspect I-Aspect B-Opinion O O O O B-Aspect I-Aspect O B-Opinion I-Opinion
+- 环境很好，交通便利      B-Aspect I-Aspect O B-Opinion O B-Aspect I-Aspect B-Opinion I-Opinion
+- 空气清新，景色优美      B-Aspect I-Aspect B-Opinion I-Opinion O B-Aspect I-Aspect O B-Opinion
+
+
+可点击 [ext_data](https://bj.bcebos.com/v1/paddlenlp/data/ext_data.tar.gz) 进行 Demo 数据下载，将数据解压之后放入父目录的 `data/ext_data/` 文件夹下。
+
+## 4. 模型效果展示
+在抽取模型训练过程中，总共训练了10轮，并选择了评估F1得分最高的 best 模型，下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型，可点击下表的 `ext_model` 进行下载，下载后将模型重命名为 `best.pdparams`，然后放入父目录的 `checkpoints/ext_checkpoints` 中。
+|Model|训练参数配置|MD5|
+| ------------ | ------------ |-----------|
+|[ext_model](https://bj.bcebos.com/paddlenlp/models/best_ext.pdparams)|<div style="width: 150pt"> learning_rate: 5e-5, batch_size: 8, max_seq_len:512, epochs：10 </div> |e3358632165aa0338225e175b57cb304|
+
+我们基于训练过程中的 best 模型在验证集 `dev` 和测试集 `test` 上进行了评估测试，模型效果如下表所示:
+|Model|数据集|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|dev|0.87095|0.90056|0.88551|
+|SKEP-Large|test|0.87125|0.89944|0.88512|
+
+**备注**：以上数据是基于全量数据训练和测试结果，并非 Demo 数据集。
+
+## 5. 模型训练
+通过运行以下命令进行评论观点抽取模型训练：
+```shell
+sh run_train.sh
+```
+
+## 6. 模型测试
+通过运行以下命令进行评论观点抽取模型测试：
+```shell
+sh run_evaluate.sh
+```
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/data.py b/applications/sentiment_analysis/ASO_analysis/extraction/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0700bc72a108b843765ccc95a741773ce86e9e5
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/data.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def load_dict(dict_path):
+    with open(dict_path, "r", encoding="utf-8") as f:
+        words = [word.strip() for word in f.readlines()]
+        word2id = dict(zip(words, range(len(words))))
+        id2word = dict((v, k) for k, v in word2id.items())
+
+        return word2id, id2word
+
+
+def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False):
+    example = example["text"].rstrip().split("\t")
+    text = list(example[0])
+    if not is_test:
+        label = example[1].split(" ")
+        assert len(text) == len(label)
+        new_text = []
+        new_label = []
+        for text_ch, label_ch in zip(text, label):
+            if text_ch.strip():
+                new_text.append(text_ch)
+                new_label.append(label_ch)
+        new_label = (
+            [label2id["O"]] + [label2id[label_term] for label_term in new_label][: (max_seq_len - 2)] + [label2id["O"]]
+        )
+        encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True)
+        encoded_inputs["labels"] = new_label
+        assert len(encoded_inputs["input_ids"]) == len(
+            new_label
+        ), f"input_ids: {len(encoded_inputs['input_ids'])}, label: {len(new_label)}"
+    else:
+        new_text = [text_ch for text_ch in text if text_ch.strip()]
+        encoded_inputs = tokenizer(new_text, is_split_into_words="token", max_seq_len=max_seq_len, return_length=True)
+
+    return encoded_inputs
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py b/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a0e5d62e14469b6280fb7c3159ac55c75ac5922
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/evaluate.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from tqdm import tqdm
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import SkepForTokenClassification, SkepTokenizer
+
+
+def evaluate(model, data_loader, metric):
+
+    model.eval()
+    metric.reset()
+    for batch_data in tqdm(data_loader):
+        input_ids, token_type_ids, seq_lens, labels = (
+            batch_data["input_ids"],
+            batch_data["token_type_ids"],
+            batch_data["seq_len"],
+            batch_data["labels"],
+        )
+        logits = model(input_ids, token_type_ids=token_type_ids)
+
+        # count metric
+        predictions = logits.argmax(axis=2)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(seq_lens, predictions, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+
+    precision, recall, f1 = metric.accumulate()
+
+    return precision, recall, f1
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    args = parser.parse_args()
+    # yapf: enbale
+
+    # load dev data
+    model_name = "skep_ernie_1.0_large_ch"
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"test": args.test_path})
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len)
+    test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=label2id["O"])
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+
+    # load model
+    loaded_state_dict = paddle.load(args.model_path)
+    model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(label2id))
+    model.load_dict(loaded_state_dict)
+
+    metric = ChunkEvaluator(label2id.keys())
+
+    # evaluate on dev data
+    precision, recall, f1 = evaluate(model, test_loader, metric)
+    print(f'evaluation result: precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}')
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b57782b3b069a86ef96fcb41c3828cb81cd903f9
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/run_evaluate.sh
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  evaluate.py \
+        --model_path "../checkpoints/ext_checkpoints/best.pdparams" \
+        --test_path "../data/ext_data/test.txt" \
+        --label_path "../data/ext_data/label.dict" \
+        --batch_size 16 \
+        --max_seq_len 256
+
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh b/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0680f1df08afd5b01af4c4e527bb90966686d745
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/run_train.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  train.py \
+        --train_path "../data/ext_data/train.txt" \
+        --dev_path "../data/ext_data/dev.txt" \
+        --label_path "../data/ext_data/label.dict" \
+        --num_epochs 10 \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --learning_rate 5e-5 \
+        --weight_decay 0.01 \
+        --max_grad_norm 1.0 \
+        --warmup_proportion 0.1 \
+        --log_steps 50 \
+        --eval_steps 250 \
+        --seed 1000 \
+        --device "gpu" \
+        --checkpoints "../checkpoints/ext_checkpoints/"
diff --git a/applications/sentiment_analysis/ASO_analysis/extraction/train.py b/applications/sentiment_analysis/ASO_analysis/extraction/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e0a278237271068d1dd73f4631968f514a08950
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/extraction/train.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import os
+import random
+import warnings
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from evaluate import evaluate
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    SkepForTokenClassification,
+    SkepTokenizer,
+)
+
+warnings.filterwarnings("ignore")
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def train():
+    # set running envir
+    model_name = "skep_ernie_1.0_large_ch"
+
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+
+    if not os.path.exists(args.checkpoints):
+        os.mkdir(args.checkpoints)
+
+    # load and process data
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"dev": args.dev_path, "train": args.train_path})
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len
+    )
+    train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"])
+    dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=label2id["O"])
+
+    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator)
+    dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator)
+
+    # configure model training
+    model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(label2id))
+
+    num_training_steps = len(train_loader) * args.num_epochs
+    lr_scheduler = LinearDecayWithWarmup(
+        learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion
+    )
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=grad_clip,
+    )
+
+    metric = ChunkEvaluator(label2id.keys())
+
+    # start to train model
+    global_step, best_f1 = 1, 0.0
+    model.train()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch_data in train_loader():
+            input_ids, token_type_ids, labels = (
+                batch_data["input_ids"],
+                batch_data["token_type_ids"],
+                batch_data["labels"],
+            )
+            loss, logits = model(input_ids, token_type_ids=token_type_ids, labels=labels)
+
+            loss.backward()
+            lr_scheduler.step()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step > 0 and global_step % args.log_steps == 0:
+                print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}")
+            if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps:
+                precision, recall, f1 = evaluate(model, dev_loader, metric)
+                model.train()
+                if f1 > best_f1:
+                    print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams")
+                print(f"evaluation result: precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}")
+
+            global_step += 1
+
+    paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams")
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for training.")
+    parser.add_argument("--train_path", type=str, default=None, help="The path of train set.")
+    parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.")
+    parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
+    parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.")
+    parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Linear warmup proportion over the training process.")
+    parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.")
+    parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    train()
diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png b/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..2aac511d38bc4aa6fcf82e05d3aa4c0725859744
Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/design_cls_model.png differ
diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png b/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..14e4f72fb775658899b971724414c6f976731986
Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/design_ext_model.png differ
diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png b/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png
new file mode 100644
index 0000000000000000000000000000000000000000..21a0a2e9c1f36c695da75f576d8d680fa7fd0c88
Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/labeling_example.png differ
diff --git a/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png b/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png
new file mode 100644
index 0000000000000000000000000000000000000000..bff450c7c60f2e84d84db1ac2084be1b8f918ce3
Binary files /dev/null and b/applications/sentiment_analysis/ASO_analysis/imgs/sentiment_system.png differ
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md b/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f84515db981c8bbb40d6a74e0452d93e95fab9dc
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/README.md
@@ -0,0 +1,136 @@
+# 基于 PP-MiniLM 的小模型优化策略
+
+本项目中，无论是评论观点抽取模型，还是属性级情感分类模型，使用的均是 Large 版的 SKEP 模型，考虑到企业用户在线上部署时会考虑到模型预测效率，所以本项目提供了开源小模型 [PP-MiniLM](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm) 及量化加速方案，大幅提升预测性能。
+
+在本项目中，我们基于 PP-MiniLM 中文特色小模型进行 fine-tune 属性级情感分类模型，然后使用 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行模型量化，减小模型规模，加快模型预测性能。
+
+## 1. 基于 PP-MiniLM 训练属性级情感分类模型
+
+本实验的方案设计和基于 SKEP 的细粒度情感分类一样，有需要的同学请移步[这里](./../classification/README.md)，这里不再赘述。
+
+### 1.1 项目结构说明
+以下是本项目运行的完整目录结构及说明：
+
+```shell
+.
+├── data.py                   # 数据处理脚本
+├── model.py                  # 模型组网脚本
+├── train.py                  # 模型训练脚本
+├── evaluate.py               # 模型评估脚本
+├── quant_post.py             # 模型量化脚本
+├── performance_test.py       # 静态图预测脚本
+├── run_train.sh              # 模型训练命令
+├── run_evaluate.sh           # 模型评估命令
+├── run_quant.sh              # 模型量化命令
+├── run_performance_test.sh   # 静态图预测命令
+└── README.md
+```
+
+### 1.2 数据说明
+
+本实验数据和基于SKEP的细粒度情感分类实验所用数据是同一份，如果已将数据下载，并放入父目录的`data/cls_data/`目录下，则无需重复下载操作。更多信息请参考[这里](../classification/README.md)。
+
+### 1.3 模型效果展示
+
+在分类模型训练过程中，总共训练了10轮，并选择了评估 F1 得分最高的 best 模型， 下表展示了训练过程中使用的训练参数。我们同时开源了相应的模型，可点击下表的 `PP-MiniLM_cls` 进行下载，下载后将模型重命名为 `best.pdparams`，然后放入父目录的 `checkpoints/pp_checkpoints` 中。
+|Model|训练参数配置|MD5|
+| ------------ | ------------ |-----------|
+|[PP-MiniLM_cls](https://bj.bcebos.com/paddlenlp/models/best_mini.pdparams)|<div style="width: 150pt"> learning_rate: 3e-5, batch_size: 16, max_seq_len:256, epochs：10 </div>|643d358620e84879921b42d326f97aae|
+
+我们基于训练过程中的 best 模型在 `cls_data` 验证集 `dev` 和测试集 `test` 上进行了评估测试，模型效果如下表所示:
+|Model|数据集|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|PP-MiniLM|dev_set|0.98668|0.99115|0.98891|
+|PP-MiniLM|test_set|0.98263|0.98766|0.98514|
+
+**备注**：以上数据是基于全量数据训练和测试结果，并非 Demo 数据集。
+
+### 1.4 模型训练
+通过运行以下命令进行分类小模型训练，模型训练后会默认保存到父目录的`checkpoints/pp_checkpoints/`文件夹下：
+```shell
+sh run_train.sh
+```
+
+### 1.5 模型测试
+通过运行以下命令进行分类小模型测试：
+```shell
+sh run_evaluate.sh
+```
+
+## 2. 对 PP-MiniLM 小模型进行量化
+本节将基于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) ，对训练好的 PP-MiniLM 小模型进行量化。具体来讲，本节采用的是静态离线量化方法，即在训练好的模型基础上，使用少量校准数据计算量化因子，便可快速得到量化模型。量化过程中，默认使用 `avg` 的量化策略，对 `matmul/matmul_v2` 算子进行 `channel_wise_abs_max` 类型的量化。
+
+首先，需要先将训练好的动态图模型，转为静态图模型，注意这里需要跳到父目录进行操作：
+```shell
+cd ..
+sh run_export_model.sh pp_minilm
+```
+
+然后，使用如下命令进行量化生成的静态图模型：
+```shell
+sh run_quant.sh
+```
+执行以上命令时，需要使用 `static_model_dir` 指定待量化的模型目录，量化后，模型将会被保存在 `quant_model_dir` 指定的目录中。
+
+最后，对量化后的小模型可使用 `performance_test.py` 进行评估， 该脚本主要用于性能测试，如果需要做评估，需要设置 `--eval`，如下所示：
+```shell
+python  performance_test.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "../checkpoints/pp_checkpoints/quant/infer" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --eval
+```
+
+## 3. 对量化后的小模型进行性能测试
+
+### 3.1 环境要求
+本节需要使用安装有 Paddle Inference 预测库的 [PaddlePaddle 2.2.1](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html) 进行预测，请根据合适的机器环境进行下载安装。若想要得到明显的加速效果，推荐在 NVIDA Tensor Core GPU（如 T4、A10、A100) 上进行测试，若在 V 系列 GPU 卡上测试，由于其不支持 Int8 Tensor Core，将达不到预期的加速效果。
+
+**备注**：本项目基于T4进行性能测试。
+
+### 3.2 运行方式
+本项目使用了动态 shape 功能 (tuned_dynamic_shape)，因此需要设置获取 shape 的范围。Paddle Inference 提供了相应的接口，即首先通过离线输入数据来统计出所有临时 tensor 的 shape 范围，TensorRT 子图的 tensor 输入 shape 范围可直接根据上一步 tune 出来的结果来设置，即可完成自动 shape 范围设置。统计完成后，只需设置统计结果路径，即可启用 tuned_dynamic_shape 功能。
+
+在本案例中，进行性能测试的脚本为 `performance_test.py`，需要先设置 `--collect_shape` 参数，然后再取消传入这个参数，再次运行 `performance_test.py`。可通过设置 `--num_epochs` 计算多轮运行时间，然后取平均时间作为最终结果，具体使用方式如下：
+
+首先，设置 `--collect_shape` 参数，生成 shape range info 文件：
+```shell
+python  performance_test.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "../checkpoints/pp_checkpoints/quant/infer" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --num_epochs 1 \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --use_tensorrt \
+        --int8 \
+        --collect_shape
+```
+然后，开始进行性能测试：
+```shell
+python  performance_test.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "../checkpoints/pp_checkpoints/quant/infer" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --num_epochs 10 \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --use_tensorrt \
+        --int8 \
+```
+
+
+## 4. PP-MiniLM 模型效果展示
+
+关于 SKEP-Large、PP-MiniLM、量化PP-MiniLM 三个模型在性能和效果方面的对比如下表所示。可以看到，三者在本任务数据集上的评估指标几乎相等，但是 PP-MiniLM 小模型运行速度较 SKEP-Large 提高了4倍，量化后的 PP-MiniLM 运行速度较 SKEP-Large 提高了近8倍。
+
+|Model|运行时间(s)|precision|Recall|F1|
+| ------------ | ------------ | ------------ |-----------|------------ |
+|SKEP-Large|1.00x|0.98497|0.99139|0.98817|
+|PP-MiniLM|4.95x|0.98263|0.98766|0.98514|
+|量化 PP-MiniLM|8.93x|0.97696|0.98720|0.98205|
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcbaf5110cbda5aac5ab4d22f6d46028595cdf5b
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/data.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def load_dict(dict_path):
+    with open(dict_path, "r", encoding="utf-8") as f:
+        words = [word.strip() for word in f.readlines()]
+        word2id = dict(zip(words, range(len(words))))
+        id2word = dict((v, k) for k, v in word2id.items())
+
+        return word2id, id2word
+
+
+def convert_example_to_feature(example, tokenizer, label2id, max_seq_len=512, is_test=False):
+    example = example["text"].rstrip().split("\t")
+    if not is_test:
+        label = int(example[0])
+        aspect_text = example[1]
+        text = example[2]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+        encoded_inputs["label"] = label
+    else:
+        aspect_text = example[0]
+        text = example[1]
+        encoded_inputs = tokenizer(aspect_text, text_pair=text, max_seq_len=max_seq_len, return_length=True)
+
+    return encoded_inputs
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..cefe2bab0cd44ee745fe0959e3c8693765577845
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/evaluate.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.glue import AccuracyAndF1
+from paddlenlp.transformers import PPMiniLMForSequenceClassification, PPMiniLMTokenizer
+
+
+def evaluate(model, data_loader, metric):
+
+    model.eval()
+    metric.reset()
+    for batch_data in data_loader:
+        input_ids, token_type_ids, labels = (
+            batch_data["input_ids"],
+            batch_data["token_type_ids"],
+            batch_data["labels"],
+        )
+        logits = model(input_ids, token_type_ids=token_type_ids)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+
+    accuracy, precision, recall, f1, _ = metric.accumulate()
+
+    return accuracy, precision, recall, f1
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.")
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    args = parser.parse_args()
+    # yapf: enbale
+
+    # load dev data
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"test": args.test_path})
+
+    tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name)
+    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len)
+    test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+
+    # load model
+    loaded_state_dict = paddle.load(args.model_path)
+    model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=len(label2id))
+    model.load_dict(loaded_state_dict)
+
+    metric = AccuracyAndF1()
+
+    # evaluate on dev data
+    accuracy, precision, recall, f1 = evaluate(model, test_loader, metric)
+    print(f'evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}')
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..89e4e884aa81d13634783ae4eb61457c2ea09825
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/performance_test.py
@@ -0,0 +1,173 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from paddle import inference
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics import AccuracyAndF1
+from paddlenlp.transformers import PPMiniLMTokenizer
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.predictor, self.input_handles, self.output_handles = self.create_predictor(args)
+
+    def create_predictor(self, args):
+        config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+            paddle.set_device("gpu")
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            paddle.set_device("cpu")
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        if args.use_tensorrt:
+            if args.int8:
+                config.enable_tensorrt_engine(
+                    workspace_size=1 << 30,
+                    precision_mode=inference.PrecisionType.Int8,
+                    max_batch_size=args.batch_size,
+                    min_subgraph_size=5,
+                    use_static=False,
+                    use_calib_mode=False,
+                )
+            else:
+                config.enable_tensorrt_engine(
+                    workspace_size=1 << 30,
+                    precision_mode=inference.PrecisionType.Float32,
+                    max_batch_size=args.batch_size,
+                    min_subgraph_size=5,
+                    use_static=False,
+                    use_calib_mode=False,
+                )
+            print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))
+        if args.collect_shape:
+            config.collect_shape_range_info(
+                os.path.join(os.path.dirname(args.model_path), "collect_shape_range_info.pbtxt")
+            )
+        else:
+            config.enable_tuned_tensorrt_dynamic_shape(
+                os.path.join(os.path.dirname(args.model_path), "collect_shape_range_info.pbtxt"), True
+            )
+
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+        return predictor, input_handles, output_handles
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+
+        return output
+
+    def predict(self, data_loader, metric):
+
+        outputs = []
+        metric.reset()
+        for i, data in enumerate(data_loader):
+            output = self.predict_batch([data[0], data[1]])
+            logits = paddle.to_tensor(output).squeeze(0)
+            correct = metric.compute(logits, paddle.to_tensor(data[3]))
+            metric.update(correct)
+            outputs.append(output)
+
+        accuracy, precision, recall, F1, _ = metric.accumulate()
+        return outputs, accuracy, precision, recall, F1
+
+    def predict_perf(self, args, data_loader):
+        start_time = time.time()
+        for i, data in enumerate(data_loader):
+            if i < args.perf_warmup_steps:  # skip warmup steps.
+                continue
+            output = self.predict_batch([data["input_ids"], data["token_type_ids"]])
+            paddle.to_tensor(output)
+
+        used_time = time.time() - start_time
+        return used_time
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.")
+    parser.add_argument("--model_path", default='./checkpoints/quant/infer', type=str, required=True, help="The path prefix of inference model to be used.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--num_epochs", type=int, default=0, help="Number of epoches for training.")
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--perf_warmup_steps", default=1, type=int, help="Warmup steps for performance test.")
+    parser.add_argument("--use_tensorrt", action='store_true', help="Whether to use inference engin TensorRT.")
+    parser.add_argument("--eval", action='store_true', help="Whether to test performance.")
+    parser.add_argument("--collect_shape", action='store_true', help="Whether collect shape range info.")
+    parser.add_argument("--int8", action='store_true', help="Whether to use int8 inference.")
+    parser.add_argument("--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    # set running environment
+    paddle.seed(42)
+
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"test": args.test_path})
+
+    tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name)
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len, is_test=False
+    )
+    test_ds = datasets["test"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+    batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    data_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_sampler=batch_sampler, collate_fn=data_collator, num_workers=0, return_list=True
+    )
+
+    predictor = Predictor(args)
+
+    if args.num_epochs > 0:
+        print("start to do performance task.")
+        times = []
+        for epoch_id in range(1, args.num_epochs + 1):
+            used_time = predictor.predict_perf(args, data_loader)
+            times.append(used_time)
+            print(f"epoch {epoch_id}, used_time: {used_time}")
+        print(f"the avg time of {args.num_epochs} epochs is {np.mean(times)}")
+
+    if args.eval:
+        print("start to do evaluate task.")
+        metric = AccuracyAndF1()
+        outputs, accuracy, precision, recall, F1 = predictor.predict(data_loader, metric)
+        print(
+            f"evalute results - accuracy: {accuracy: .5f}, precision: {precision: .5f}, recall: {recall: .5f}, F1: {F1: .5f}"
+        )
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py
new file mode 100644
index 0000000000000000000000000000000000000000..57d7acea768074642cc47b8f42bbc9cf8538f5e5
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/quant_post.py
@@ -0,0 +1,92 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+
+from paddlenlp.data import Pad
+from paddlenlp.transformers import PPMiniLMTokenizer
+
+import paddleslim  # isort: skip  paddleslim needs to be imported last for some overrides to kick in
+
+
+def quant_post(args):
+    place = paddle.set_device("gpu")
+    exe = paddle.static.Executor(place)
+
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"dev": args.dev_path})
+
+    tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name)
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len
+    )
+    dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"])
+
+    def batch_generator_func():
+        batch_data = [[], []]
+        for data in dev_ds:
+            batch_data[0].append(data["input_ids"])
+            batch_data[1].append(data["token_type_ids"])
+            if len(batch_data[0]) == args.batch_size:
+                input_ids = Pad(axis=0, pad_val=0, dtype="int64")(batch_data[0])
+                segment_ids = Pad(axis=0, pad_val=0, dtype="int64")(batch_data[1])
+                yield [input_ids, segment_ids]
+                batch_data = [[], []]
+
+    paddleslim.quant.quant_post_static(
+        exe,
+        args.static_model_dir,
+        args.quant_model_dir,
+        save_model_filename=args.save_model_filename,
+        save_params_filename=args.save_params_filename,
+        algo=args.algorithm,
+        hist_percent=0.9999,
+        batch_generator=batch_generator_func,
+        model_filename=args.input_model_filename,
+        params_filename=args.input_param_filename,
+        quantizable_op_type=["matmul", "matmul_v2"],
+        weight_bits=8,
+        weight_quantize_type="channel_wise_abs_max",
+        batch_nums=1,
+    )
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_model_name", type=str, default="ppminilm-6l-768h", help="The path of ppminilm model.")
+    parser.add_argument("--static_model_dir", type=str, default="./checkpoints/static", help="Directory of static model that will be quantized.")
+    parser.add_argument("--quant_model_dir", type=str, default=None, help="Directory of the quantized model that will be written.")
+    parser.add_argument("--algorithm", type=str, default="avg", help="Quantize algorithm that you want to choice, such as abs_max, avg, mse, hist.")
+    parser.add_argument('--dev_path', type=str, default=None, help="The path of dev set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--save_model_filename", type=str, default="infer.pdmodel", required=False, help="File name of quantified model.")
+    parser.add_argument("--save_params_filename", type=str, default="infer.pdiparams", required=False, help="File name of quantified model's parameters.")
+    parser.add_argument("--input_model_filename", type=str, default="infer.pdmodel", required=False, help="File name of float model.")
+    parser.add_argument("--input_param_filename", type=str, default="infer.pdiparams", required=False, help="File name of float model's parameters.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    # start quantize model
+    paddle.enable_static()
+    quant_post(args)
+    print(f"quantize model done. the quantized model has been saved to {args.quant_model_dir}")
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5bf7ad5f777f171eeea392a1684e2cb524c7dc08
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_evaluate.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  evaluate.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "../checkpoints/pp_checkpoints/best.pdparams" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --batch_size 16 \
+        --max_seq_len 256
+
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4d720b86329283bedf18b659a64cb062c1459916
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_performance_test.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  performance_test.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "../checkpoints/pp_checkpoints/quant/infer" \
+        --test_path "../data/cls_data/test.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --num_epochs 10 \
+        --batch_size 16 \
+        --max_seq_len 256
\ No newline at end of file
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e11dd1b1d6d2ecd0a05552146c9fd866ec97dbb6
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_quant.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  quant_post.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --static_model_dir "../checkpoints/pp_checkpoints/static" \
+        --quant_model_dir "../checkpoints/pp_checkpoints/quant" \
+        --algorithm "avg" \
+        --dev_path "../data/cls_data/dev.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --batch_size 4 \
+        --max_seq_len 256 \
+        --save_model_filename "infer.pdmodel" \
+        --save_params_filename "infer.pdiparams" \
+        --input_model_filename "infer.pdmodel" \
+        --input_param_filename "infer.pdiparams"
+
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..44a1851a01e4803e01fa877bb4dbe2b6f20893a2
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/run_train.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  train.py \
+        --base_model_name "ppminilm-6l-768h" \
+        --train_path "../data/cls_data/train.txt" \
+        --dev_path "../data/cls_data/dev.txt" \
+        --label_path "../data/cls_data/label.dict" \
+        --num_epochs 5 \
+        --batch_size 16 \
+        --max_seq_len 256 \
+        --learning_rate 3e-5 \
+        --weight_decay 0.01 \
+        --max_grad_norm 1.0 \
+        --warmup_proportion 0.1 \
+        --log_steps 50 \
+        --eval_steps 100 \
+        --seed 1000 \
+        --device "gpu" \
+        --checkpoints "../checkpoints/pp_checkpoints/"
diff --git a/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py b/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fd724a5a2b2a0ff7bdef8f0232cce0d7cdcb362
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/pp_minilm/train.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import warnings
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data import convert_example_to_feature, load_dict
+from datasets import load_dataset
+from evaluate import evaluate
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.glue import AccuracyAndF1
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    PPMiniLMForSequenceClassification,
+    PPMiniLMTokenizer,
+)
+
+warnings.filterwarnings("ignore")
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def train():
+    # set running envir
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+
+    if not os.path.exists(args.checkpoints):
+        os.mkdir(args.checkpoints)
+
+    # load and process data
+    label2id, id2label = load_dict(args.label_path)
+    datasets = load_dataset("text", data_files={"train": args.train_path, "dev": args.dev_path})
+
+    tokenizer = PPMiniLMTokenizer.from_pretrained(args.base_model_name)
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, label2id=label2id, max_seq_len=args.max_seq_len
+    )
+    train_ds = datasets["train"].map(trans_func, batched=False, remove_columns=["text"])
+    dev_ds = datasets["dev"].map(trans_func, batched=False, remove_columns=["text"])
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_loader = paddle.io.DataLoader(train_ds, batch_sampler=train_batch_sampler, collate_fn=data_collator)
+    dev_loader = paddle.io.DataLoader(dev_ds, batch_sampler=dev_batch_sampler, collate_fn=data_collator)
+
+    # configure model training
+    model = PPMiniLMForSequenceClassification.from_pretrained(args.base_model_name, num_classes=len(label2id))
+
+    num_training_steps = len(train_loader) * args.num_epochs
+    lr_scheduler = LinearDecayWithWarmup(
+        learning_rate=args.learning_rate, total_steps=num_training_steps, warmup=args.warmup_proportion
+    )
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=grad_clip,
+    )
+
+    metric = AccuracyAndF1()
+
+    # start to train model
+    global_step, best_f1 = 1, 0.0
+    model.train()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch_data in train_loader():
+            input_ids, token_type_ids, labels = (
+                batch_data["input_ids"],
+                batch_data["token_type_ids"],
+                batch_data["labels"],
+            )
+            logits = model(input_ids, token_type_ids=token_type_ids)
+            loss = F.cross_entropy(logits, labels)
+
+            loss.backward()
+            lr_scheduler.step()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step > 0 and global_step % args.log_steps == 0:
+                print(f"epoch: {epoch} - global_step: {global_step}/{num_training_steps} - loss:{loss.item():.6f}")
+            if (global_step > 0 and global_step % args.eval_steps == 0) or global_step == num_training_steps:
+                accuracy, precision, recall, f1 = evaluate(model, dev_loader, metric)
+                model.train()
+                if f1 > best_f1:
+                    print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams")
+                print(
+                    f"evaluation result: accuracy:{accuracy:.5f} precision: {precision:.5f}, recall: {recall:.5f},  F1: {f1:.5f}"
+                )
+
+            global_step += 1
+
+    paddle.save(model.state_dict(), f"{args.checkpoints}/final.pdparams")
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--base_model_name", type=str, default=None, help="The name of base model.")
+    parser.add_argument("--train_path", type=str, default=None, help="The path of train set.")
+    parser.add_argument("--dev_path", type=str, default=None, help="The path of dev set.")
+    parser.add_argument("--label_path", type=str, default=None, help="The path of label dict.")
+    parser.add_argument("--num_epochs", type=int, default=3, help="Number of epoches for fine-tuning.")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate for optimizer.")
+    parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
+    parser.add_argument("--max_grad_norm", type=float, default=1.0, help="Max grad norm to clip gradient.")
+    parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy")
+    parser.add_argument("--log_steps", type=int, default=50, help="Frequency of printing log.")
+    parser.add_argument("--eval_steps", type=int, default=500, help="Frequency of performing evaluation.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--checkpoints", type=str, default=None, help="Directory to save checkpoint.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    train()
diff --git a/applications/sentiment_analysis/ASO_analysis/predict.py b/applications/sentiment_analysis/ASO_analysis/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..61c518ea49dd349c39dcb46fab72f7faaf994b9b
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/predict.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import json
+import re
+from collections import defaultdict
+from functools import partial
+
+import paddle
+from classification.data import (
+    convert_example_to_feature as convert_example_to_feature_cls,
+)
+from datasets import Dataset, load_dataset
+from extraction.data import convert_example_to_feature as convert_example_to_feature_ext
+from utils import decoding, load_dict
+
+from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding
+from paddlenlp.transformers import (
+    SkepForSequenceClassification,
+    SkepForTokenClassification,
+    SkepTokenizer,
+)
+
+
+def concate_aspect_and_opinion(text, aspect, opinions):
+    aspect_text = ""
+    for opinion in opinions:
+        if text.find(aspect) <= text.find(opinion):
+            aspect_text += aspect + opinion + "，"
+        else:
+            aspect_text += opinion + aspect + "，"
+    aspect_text = aspect_text[:-1]
+
+    return aspect_text
+
+
+def remove_blanks(example):
+    example["text"] = re.sub(" +", "", example["text"])
+    return example
+
+
+def predict_ext(args):
+    # load dict and dataset
+    model_name = "skep_ernie_1.0_large_ch"
+    ext_label2id, ext_id2label = load_dict(args.ext_label_path)
+    datasets = load_dataset("text", data_files={"test": args.test_path})
+    datasets["test"] = datasets["test"].map(remove_blanks)
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(
+        convert_example_to_feature_ext,
+        tokenizer=tokenizer,
+        label2id=ext_label2id,
+        max_seq_len=args.ext_max_seq_len,
+        is_test=True,
+    )
+    test_ds = copy.copy(datasets["test"]).map(trans_func, batched=False, remove_columns=["text"])
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=ext_label2id["O"])
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+    print("test data loaded.")
+
+    # load ext model
+    ext_state_dict = paddle.load(args.ext_model_path)
+    ext_model = SkepForTokenClassification.from_pretrained(model_name, num_classes=len(ext_label2id))
+    ext_model.load_dict(ext_state_dict)
+    print("extraction model loaded.")
+
+    ext_model.eval()
+    results = []
+    for bid, batch_data in enumerate(test_loader):
+        input_ids, token_type_ids, seq_lens = (
+            batch_data["input_ids"],
+            batch_data["token_type_ids"],
+            batch_data["seq_len"],
+        )
+        logits = ext_model(input_ids, token_type_ids=token_type_ids)
+
+        predictions = logits.argmax(axis=2).numpy()
+        for eid, (seq_len, prediction) in enumerate(zip(seq_lens, predictions)):
+            idx = bid * args.batch_size + eid
+            tag_seq = [ext_id2label[idx] for idx in prediction[:seq_len][1:-1]]
+            text = datasets["test"][idx]["text"]
+            aps = decoding(text[: args.ext_max_seq_len - 2], tag_seq)
+            for aid, ap in enumerate(aps):
+                aspect, opinions = ap[0], list(set(ap[1:]))
+                aspect_text = concate_aspect_and_opinion(text, aspect, opinions)
+                results.append(
+                    {
+                        "id": str(idx) + "_" + str(aid),
+                        "aspect": aspect,
+                        "opinions": opinions,
+                        "text": text,
+                        "aspect_text": aspect_text,
+                    }
+                )
+
+    return results
+
+
+def predict_cls(args, ext_results):
+    # load dict
+    model_name = "skep_ernie_1.0_large_ch"
+    cls_label2id, cls_id2label = load_dict(args.cls_label_path)
+    text_list = []
+    for result in ext_results:
+        example = result["aspect_text"] + "\t" + result["text"]
+        text_list.append(example)
+    ext_results = {"text": text_list}
+    dataset = Dataset.from_dict(ext_results)
+
+    tokenizer = SkepTokenizer.from_pretrained(model_name)
+    trans_func = partial(
+        convert_example_to_feature_cls,
+        tokenizer=tokenizer,
+        label2id=cls_label2id,
+        max_seq_len=args.cls_max_seq_len,
+        is_test=True,
+    )
+
+    test_ds = dataset.map(trans_func, batched=False, remove_columns=["text"])
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_loader = paddle.io.DataLoader(test_ds, batch_sampler=test_batch_sampler, collate_fn=data_collator)
+    print("test data loaded.")
+
+    # load cls model
+    cls_state_dict = paddle.load(args.cls_model_path)
+    cls_model = SkepForSequenceClassification.from_pretrained(model_name, num_classes=len(cls_label2id))
+    cls_model.load_dict(cls_state_dict)
+    print("classification model loaded.")
+
+    cls_model.eval()
+
+    results = []
+    for bid, batch_data in enumerate(test_loader):
+        input_ids, token_type_ids = batch_data["input_ids"], batch_data["token_type_ids"]
+        logits = cls_model(input_ids, token_type_ids=token_type_ids)
+
+        predictions = logits.argmax(axis=1).numpy().tolist()
+        results.extend(predictions)
+
+    results = [cls_id2label[pred_id] for pred_id in results]
+    return results
+
+
+def post_process(ext_results, cls_results):
+    assert len(ext_results) == len(cls_results)
+
+    collect_dict = defaultdict(list)
+    for ext_result, cls_result in zip(ext_results, cls_results):
+        ext_result["sentiment_polarity"] = cls_result
+        eid, _ = ext_result["id"].split("_")
+        collect_dict[eid].append(ext_result)
+
+    sentiment_results = []
+    for eid in collect_dict.keys():
+        sentiment_result = {}
+        ap_list = []
+        for idx, single_ap in enumerate(collect_dict[eid]):
+            if idx == 0:
+                sentiment_result["text"] = single_ap["text"]
+            ap_list.append(
+                {
+                    "aspect": single_ap["aspect"],
+                    "opinions": single_ap["opinions"],
+                    "sentiment_polarity": single_ap["sentiment_polarity"],
+                }
+            )
+        sentiment_result["ap_list"] = ap_list
+        sentiment_results.append(sentiment_result)
+
+    with open(args.save_path, "w", encoding="utf-8") as f:
+        for sentiment_result in sentiment_results:
+            f.write(json.dumps(sentiment_result, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ext_model_path", type=str, default=None, help="The path of extraction model path that you want to load.")
+    parser.add_argument("--cls_model_path", type=str, default=None, help="The path of classification model path that you want to load.")
+    parser.add_argument("--ext_label_path", type=str, default=None, help="The path of extraction label dict.")
+    parser.add_argument("--cls_label_path", type=str, default=None, help="The path of classification label dict.")
+    parser.add_argument('--test_path', type=str, default=None, help="The path of test set that you want to predict.")
+    parser.add_argument('--save_path', type=str, required=True, default=None, help="The saving path of predict results.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--ext_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for extraction model.")
+    parser.add_argument("--cls_max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization for classification model.")
+    args = parser.parse_args()
+    # yapf: enable
+
+    # predict with ext model
+    ext_results = predict_ext(args)
+    print("predicting with extraction model done!")
+
+    # predict with cls model
+    cls_results = predict_cls(args, ext_results)
+    print("predicting with classification model done!")
+
+    # post_process prediction results
+    post_process(ext_results, cls_results)
+    print(f"sentiment analysis results has been saved to path: {args.save_path}")
diff --git a/applications/sentiment_analysis/ASO_analysis/run_demo.sh b/applications/sentiment_analysis/ASO_analysis/run_demo.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ea87c8ffa49433c3a461ac31d1c57a121833e0d1
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/run_demo.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  demo.py \
+        --ext_model_path "./checkpoints/ext_checkpoints/best.pdparams" \
+        --cls_model_path "./checkpoints/cls_checkpoints/best.pdparams" \
+        --ext_label_path "./data/ext_data/label.dict" \
+        --cls_label_path "./data/cls_data/label.dict" \
+        --ext_max_seq_len 512 \
+        --cls_max_seq_len 256
+
diff --git a/applications/sentiment_analysis/ASO_analysis/run_export_model.sh b/applications/sentiment_analysis/ASO_analysis/run_export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..177ad4307ebdbe4cb9fd7ec9f9c1008158468751
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/run_export_model.sh
@@ -0,0 +1,41 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+model_type=$1
+
+if [ ! $model_type ]; then
+echo "Please enter the correct export model type, for example: sh run_export extraction"
+elif [ $model_type = extraction ]; then 
+python  export_model.py \
+        --model_type "extraction" \
+        --model_path "./checkpoints/ext_checkpoints/best.pdparams" \
+        --save_path "./checkpoints/ext_checkpoints/static/infer" 
+
+elif [ $model_type = classification ]; then
+python  export_model.py \
+        --model_type "classification" \
+        --model_path "./checkpoints/cls_checkpoints/best.pdparams" \
+        --save_path "./checkpoints/cls_checkpoints/static/infer" 
+        
+elif [ $model_type = pp_minilm ]; then
+python  export_model.py \
+        --model_type "pp_minilm" \
+        --base_model_name "ppminilm-6l-768h" \
+        --model_path "./checkpoints/pp_checkpoints/best.pdparams" \
+        --save_path "./checkpoints/pp_checkpoints/static/infer" 
+else
+echo "Three model_types are supported:  [extraction, classification, pp_minilm]"
+fi
diff --git a/applications/sentiment_analysis/ASO_analysis/run_predict.sh b/applications/sentiment_analysis/ASO_analysis/run_predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8fbb7a624edc3e708a69e89b42c284467cc78ce5
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/run_predict.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python  predict.py \
+        --ext_model_path "./checkpoints/ext_checkpoints/best.pdparams" \
+        --cls_model_path "./checkpoints/cls_checkpoints/best.pdparams" \
+        --test_path "./data/test.txt" \
+        --ext_label_path "./data/ext_data/label.dict" \
+        --cls_label_path "./data/cls_data/label.dict" \
+        --save_path "./data/sentiment_results.json" \
+        --batch_size 8 \
+        --ext_max_seq_len 512 \
+        --cls_max_seq_len 256
diff --git a/applications/sentiment_analysis/ASO_analysis/utils.py b/applications/sentiment_analysis/ASO_analysis/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..88264c549ab041fdd1f3937f6b3b039e25a7bfec
--- /dev/null
+++ b/applications/sentiment_analysis/ASO_analysis/utils.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+
+import numpy as np
+import paddle
+from seqeval.metrics.sequence_labeling import get_entities
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def load_dict(dict_path):
+    with open(dict_path, "r", encoding="utf-8") as f:
+        words = [word.strip() for word in f.readlines()]
+        word2id = dict(zip(words, range(len(words))))
+        id2word = dict((v, k) for k, v in word2id.items())
+
+        return word2id, id2word
+
+
+def read_test_file(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip().replace(" ", "")
+            yield {"text": line}
+
+
+def decoding(text, tag_seq):
+    assert len(text) == len(tag_seq), f"text len: {len(text)}, tag_seq len: {len(tag_seq)}"
+
+    puncs = list(",.?;!，。？；！")
+    splits = [idx for idx in range(len(text)) if text[idx] in puncs]
+
+    prev = 0
+    sub_texts, sub_tag_seqs = [], []
+    for i, split in enumerate(splits):
+        sub_tag_seqs.append(tag_seq[prev:split])
+        sub_texts.append(text[prev:split])
+        prev = split
+    sub_tag_seqs.append(tag_seq[prev:])
+    sub_texts.append((text[prev:]))
+
+    ents_list = []
+    for sub_text, sub_tag_seq in zip(sub_texts, sub_tag_seqs):
+        ents = get_entities(sub_tag_seq, suffix=False)
+        ents_list.append((sub_text, ents))
+
+    aps = []
+    no_a_words = []
+    for sub_tag_seq, ent_list in ents_list:
+        sub_aps = []
+        sub_no_a_words = []
+        for ent in ent_list:
+            ent_name, start, end = ent
+            if ent_name == "Aspect":
+                aspect = sub_tag_seq[start : end + 1]
+                sub_aps.append([aspect])
+                if len(sub_no_a_words) > 0:
+                    sub_aps[-1].extend(sub_no_a_words)
+                    sub_no_a_words.clear()
+            else:
+                ent_name == "Opinion"
+                opinion = sub_tag_seq[start : end + 1]
+                if len(sub_aps) > 0:
+                    sub_aps[-1].append(opinion)
+                else:
+                    sub_no_a_words.append(opinion)
+
+        if sub_aps:
+            aps.extend(sub_aps)
+            if len(no_a_words) > 0:
+                aps[-1].extend(no_a_words)
+                no_a_words.clear()
+        elif sub_no_a_words:
+            if len(aps) > 0:
+                aps[-1].extend(sub_no_a_words)
+            else:
+                no_a_words.extend(sub_no_a_words)
+
+    if no_a_words:
+        no_a_words.insert(0, "None")
+        aps.append(no_a_words)
+
+    return aps
+
+
+def concate_aspect_and_opinion(text, aspect, opinions):
+    aspect_text = ""
+    for opinion in opinions:
+        if text.find(aspect) <= text.find(opinion):
+            aspect_text += aspect + opinion + "，"
+        else:
+            aspect_text += opinion + aspect + "，"
+    aspect_text = aspect_text[:-1]
+
+    return aspect_text
+
+
+def save_examples(examples, save_path, idxs):
+    with open(save_path, "w", encoding="utf-8") as f:
+        for idx in idxs:
+            line = "\t".join(examples[idx]) + "\n"
+            f.write(line)
+
+
+def save_dict(dict_path, dict_type):
+    if dict_type not in ["ext", "cls"]:
+        raise ValueError("Only ext/cls should be accepted for dict_type.")
+
+    with open(dict_path, "w", encoding="utf-8") as f:
+        if dict_type == "ext":
+            label_list = ["O", "B-Aspect", "I-Aspect", "B-Opinion", "I-Opinion"]
+        else:
+            label_list = ["负向", "正向"]
+
+        for label in label_list:
+            f.write(label + "\n")
diff --git a/applications/sentiment_analysis/README.md b/applications/sentiment_analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3cb8501e0d39dea82e7c84179a1c12147306b85f
--- /dev/null
+++ b/applications/sentiment_analysis/README.md
@@ -0,0 +1,41 @@
+# 情感分析应用
+
+## **1. 情感分析简介**
+情感分析（sentiment analysis）是近年来国内外研究的热点，旨在对带有情感色彩的主观性文本进行分析、处理、归纳和推理。情感分析具有广泛的应用场景，可以被应用于消费决策、舆情分析、个性化推荐等领域。
+
+按照分析粒度可以大致分为三类：篇章级的情感分析（Document-Level Sentiment Classification）、语句级的情感分析（Sentence-Level Sentiment Classification）和属性级的情感分析（Aspect-Level Sentiment Classification）。其中属性级的情感分析又包含多项子任务，例如属性抽取（Aspect Term Extraction）、观点抽取（Opinion Term Extraction）、属性级情感分析（Aspect-Based Sentiment Classification）等。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/199965793-f0933baa-5b82-47da-9271-ba36642119f8.png" />
+</div>
+
+
+
+## **2. 情感分析项目介绍**
+
+PaddleNLP情感分析应用立足真实企业用户对情感分析方面的需求，同时针对情感分析领域的痛点和难点，基于前沿模型开源了细粒度的情感分析解决方案，助力开发者快速分析业务相关产品或服务的用户感受。针对情感分析应用，本项目不仅提供了基于Taskflow开箱即用的情感分析能力，还提供了从输入数据到情感分析结果可视化的能力，另外考虑到一些企业用户需要针对业务场景进行适配，本项目同时提供了完整的情感分析定制方案：数据标注 - 模型训练 - 模型测试 - 模型部署 - 情感分析可视化。
+
+当前PaddleNLP情感分析应用更多聚焦于属性级的情感分析，支持文本评论中关于属性、观点词和情感倾向方面的分析。当前提供了两种情感分析方案：基于通用信息抽取模型UIE的情感分析方案和基于情感知识增强模型SKEP的情感分析方案。
+
+基于UIE的情感分析方案采用 Prompt Learning 的方式进行情感信息抽取，该分析方式需要预先定义情感信息抽取的schema，然后通过该schema逐步分析和抽取情感信息。 相比基于SKEP的情感分析方案，UIE方案在测试中表现出了更好的效果。在测试中，通过精确匹配的方式对比抽取的 属性、情感倾向和观点词 三者信息，即当三者全部匹配才算抽取正确，下表展示了此次测试的评测指标：
+
+|  模型 | 权重 | Precision | Recall | F1 |
+|  :---: | :--------: | :--------: | :--------: | :--------: |
+| `SKEP` | `skep_ernie_1.0_large_ch` | 0.76368 | 0.74710 | 0.75530 |
+| `uie` | `uie-senta-base` | 0.89593 | 0.86125 | 0.87825 |
+
+
+基于SKEP的情感分析方案主要采用两阶段式的情感分析抽取，首先通过序列标注的方式定位属性词和观点词，然后通过结合属性词和观点词两者信息进行属性情感极性分类。相比基于UIE的情感分析方案，基于SKEP的情感分析方案具有更快的预测速度。下表展示了在测试集上平均每分钟预测的样本数，可以看到SKEP方案的预测速度显著快于UIE方案。
+
+|  模型 | 权重 | 预测样本数/m |
+|  :---: | :--------: | :--------: |
+| `SKEP` | `skep_ernie_1.0_large_ch` | 3428 |
+| `uie` | `uie-senta-base` | 1104 |
+
+备注： 当前只有基于UIE的方案支持情感分析结果可视化能力，基于SKEP的方案暂不支持。
+
+## **3. 快速开始**
+
+- 👉 [基于UIE的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction)
+
+- 👉 [基于SKEP的情感分析方案](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/ASO_analysis)
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore b/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..e21e51f32d9c494cadaf1884d0bed4c8bf95e09b
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/.gitignore
@@ -0,0 +1,10 @@
+checkpoint/*
+data/*
+export/*
+images/*
+outputs/*
+log/*
+uie-base/*
+SimHei.ttf
+*.sh
+myhttp.py
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/README.md b/applications/sentiment_analysis/unified_sentiment_extraction/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..37e52fc8957e2015f5b68ffc2b887543a107fc0a
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/README.md
@@ -0,0 +1,884 @@
+# 通用情感信息抽取
+
+## **目录**
+## **目录**
+- [1. 情感分析应用简介](#1)
+- [2. 特色介绍](#2)
+- [3. 运行环境](#3)
+- [4. 整体功能介绍与Taskflow快速体验](#4)
+  - [4.1 开箱即用的情感分析能力](#4.1)
+    - [4.1.1 语句级情感分析](#4.1.1)
+    - [4.1.2 属性级情感分析](#4.1.2)
+    - [4.1.3 多版本模型选择](#4.1.3)
+  - [4.2 批量处理：从数据到情感分析可视化](#4.2)
+    - [4.2.1 数据描述](#4.2.1)
+    - [4.2.2 批量情感分析](#4.2.2)
+    - [4.2.3 情感分析可视化](#4.2.3)
+      - [4.2.3.1 一键生成情感分析结果](#4.2.3.1)
+      - [4.2.3.2 情感分析详细展示](#4.2.3.2)
+- [5. 更进一步：结合业务分析经验，定制情感分析](#5)
+  - [5.1 打通数据标注到训练样本构建](#5.1)
+    - [5.1.1 样本构建：语句级情感分类任务](#5.1.1)
+    - [5.1.2 样本构建：属性抽取相关任务](#5.1.2)
+    - [5.1.3 样本构建升级1：加强属性聚合能力](#5.1.3)
+    - [5.1.4 样本构建升级2：加强隐性观点抽取能力](#5.1.4)
+  - [5.2 模型训练](#5.2)
+  - [5.3 模型测试](#5.3)
+  - [5.4 模型预测及效果展示](#5.4)
+    - [5.4.1 使用训练后的模型进行预测](#5.4.1)
+    - [5.4.2 属性聚合预测和分析](#5.4.2)
+    - [5.4.3 隐性观点词抽取预测和分析](#5.4.3)
+- [6. 模型部署](#6)
+  - [6.1 基于SimpleServer进行服务化部署](#6.1)
+  - [6.2 基于Pipeline进行部署](#6.2)
+
+<a name="1"></a>
+
+## **1. 情感分析应用简介**
+
+PaddleNLP情感分析应用立足真实企业用户对情感分析方面的需求，针对情感分析领域的痛点和难点，提供基于前沿模型的情感分析解决方案，助力开发者快速分析业务相关产品或服务的用户感受。
+
+本项目以通用信息抽取模型UIE为训练底座，提供了语句级情感分析和属性级情感分析能力、覆盖情感分类、属性抽取、观点抽取等常用情感分析能力，如下图所示。同时提供了可视化能力，支持从输入数据到情感分析结果可视化，帮助用户快速分析业务数据。更进一步地，本项目同时支持基于业务数据进行定制训练，同时支持引入业务侧积累的经验和知识，包括同义属性和隐性观点词表，加强模型进行属性聚合和隐性观点抽取的能力，进一步提高模型对于业务场景数据的分析能力。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/199965793-f0933baa-5b82-47da-9271-ba36642119f8.png" />
+</div>
+
+<a name="2"></a>
+
+## **2. 特色介绍**
+
+- **功能丰富🎓**：提供情感分析训练模型作为底座，支持语句级情感分析和属性级情感分析，覆盖情感分类、属性与观点抽取、同义属性聚合、隐性观点抽取、可视化分析等常见情感分析任务。
+- **效果领先✊**： 以通用信息抽取模型UIE为训练底座，具有较强的零样本预测和小样本微调能力。
+- **开箱即用👶**：打通Taskflow使用流程，3行代码获取分析结果，同时提供了情感分析结果可视化能力。
+- **定制模型🏃**：支持针对特定业务场景进行全流程定制，包括数据标注、样本构建、模型训练和模型测试，同时通过融合业务相关的同义属性词和隐性观点词，可进一步提高模型针对业务场景的情感分析能力。
+
+
+<a name="3"></a>
+
+## **3. 运行环境**
+
+**代码结构**
+```
+unified_sentiment_extraction/
+├── batch_predict.py # 以文件的形式输入，进行批量预测的脚本
+├── evaluate.py # 模型评估脚本
+├── finetune.py # 模型微调脚本
+├── label_studio.py # 将label-studio导出数据转换为模型输入数据的脚本
+├── label_studio.md # 将label-studio标注说明
+├── utils.py # 工具函数脚本
+├── visual_analysis.py # 情感分析结果可视化脚本
+└── README.md # 使用说明
+```
+
+**安装依赖**
+
+- python == 3.9.12
+- paddlepaddle == 2.3.2
+- paddlenlp == 2.4.5
+- wordcloud == 1.8.2.2
+
+**安装PaddlePaddle**：
+
+环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 具体可以参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。如下命令可以安装linux系统，CUDA版本为10.2环境下的paddlepaddle，具体版本号为支持GPU的2.3.2版本。
+
+```shell
+conda install paddlepaddle-gpu==2.3.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+```
+
+**安装PaddleNLP**：
+安装PaddleNLP可以开启百度镜像源来加速下载，更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+```shell
+python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+**安装wordcloud**：
+
+```shell
+python3 -m pip install wordcloud==1.8.2.2
+```
+
+<a name="4"></a>
+
+## **4. 整体功能介绍与Taskflow快速体验**
+
+本项目以通用信息抽取模型UIE为训练底座，基于大量情感分析数据进一步训练，增强了模型对于情感知识的处理能力，支持语句级情感分类、属性抽取、观点词抽取、属性级情感分类等基础情感分析能力。下表展示了通用UIE `uie-base` 和情感知识增强的UIE `uie-senta-base` 在测试集上的效果对比。
+
+|  模型 | Precision | Recall | F1 |
+|  :---: | :--------: | :--------: | :--------: |
+| `uie-base` | 0.86759 | 0.83696 | 0.85200 |
+| `uie-senta-base` | 0.93403 | 0.92795 | 0.93098 |
+
+另外，为方便用户体验和使用，本项目提供的情感分析能力已经集成到了 Taskflow，可以通过Taskflow开箱即用的能力快速体验情感分析的功能。
+
+<a name="4.1"></a>
+
+### **4.1 开箱即用的情感分析能力**
+
+<a name="4.1.1"></a>
+
+#### **4.1.1 语句级情感分析**
+整句情感分析功能当前支持二分类：正向和负向，调用示例如下：
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> schema = ['情感倾向[正向，负向]']
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
+>>> print(senta('蛋糕味道不错，店家服务也很好'))
+
+[
+    {
+        '情感倾向[正向,负向]': [
+            {
+                'text': '正向',
+                'probability': 0.996646058824652
+            }
+        ]
+    }
+]
+```
+
+<a name="4.1.2"></a>
+
+#### **4.1.2 属性级情感分析**
+
+除语句级情感分析之外，本项目同时支持属性级情感分析，包括属性抽取（Aspect Term Extraction）、观点抽取（Opinion Term Extraction）、属性级情感分析（Aspect Based Sentiment Classification）等等。可以通过设置相应的schema进行对应信息的抽取，其调用示例如下。
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> # Aspect Term Extraction
+>>> # schema =  ["评价维度"]
+>>> # Aspect - Opinion Extraction
+>>> # schema =  [{"评价维度":["观点词"]}]
+>>> # Aspect - Sentiment Extraction
+>>> # schema =  [{"评价维度":["情感倾向[正向,负向,未提及]"]}]
+>>> # Aspect - Sentiment - Opinion Extraction
+>>> schema =  [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}]
+
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
+>>> print(senta('蛋糕味道不错，店家服务也很热情'))
+
+[
+    {
+        '评价维度': [
+            {
+                'text': '服务',
+                'start': 9,
+                'end': 11,
+                'probability': 0.9709093024793489,
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '热情',
+                            'start': 13,
+                            'end': 15,
+                            'probability': 0.9897222206316556
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '正向',
+                            'probability': 0.9999327669598301
+                        }
+                    ]
+                }
+            },
+            {
+                'text': '味道',
+                'start': 2,
+                'end': 4,
+                'probability': 0.9105472387838915,
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '不错',
+                            'start': 4,
+                            'end': 6,
+                            'probability': 0.9946981266891619
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '正向',
+                            'probability': 0.9998829392709467
+                        }
+                    ]
+                }
+            }
+        ]
+    }
+]
+```
+
+更进一步地，在某些业务场景中，特别是一些垂域场景，用户可能比较关注固定的某些属性。在这种情况下，可以预先提供相应的属性集合，则本项目将只会在该属性集上进行情感分析，分析和抽取该集合中各个属性的信息。
+
+针对固定属性的情感分析示例如下，需要将属性集合传入参数 `aspects` 中，后续将只针对这些属性进行分析。可以看到在示例中，传入了属性 `房间`，`位置` 和 `价格`，针对 `房间` 和 `价格` 均分析到了观点词和情感倾向，但是`位置`由于在样本中并未提及，因此相应观点词为空，情感倾向为 `未提及`。
+
+```python
+>>> # define schema for pre-defined aspects, schema
+>>> schema = ["观点词", "情感倾向[正向,负向,未提及]"]
+>>> aspects = ["房间", "位置", "价格"]
+>>> # set aspects for Taskflow
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, aspects=aspects)
+>>> print(senta("这家店的房间很大，店家服务也很热情，就是价格有点贵"))
+
+[
+    {
+        '评价维度': [
+            {
+                'text': '房间',
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '大',
+                            'start': 7,
+                            'end': 8,
+                            'probability': 0.9998772175681552
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '正向',
+                            'probability': 0.9999312170965595
+                        }
+                    ]
+                }
+            },
+            {
+                'text': '位置',
+                'relations': {
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '未提及',
+                            'probability': 0.9999939203353847
+                        }
+                    ]
+                }
+            },
+            {
+                'text': '价格',
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '贵',
+                            'start': 24,
+                            'end': 25,
+                            'probability': 0.998841669863026
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '负向',
+                            'probability': 0.9997340617174757
+                        }
+                    ]
+                }
+            }
+        ]
+    }
+]
+```
+
+<a name="4.1.3"></a>
+
+#### **4.1.3 多版本模型选择**
+为方便用户实际业务应用情况，本项目多个版本的模型，可以根据业务对于精度和速度方面的要求进行选择，下表展示了不同版本模型的结构以及在测试集上的指标。
+
+|  模型 |  结构  | Precision | Recall | F1 |
+|  :---: | :--------: | :--------: | :--------: | :--------: |
+|  `uie-senta-base` (默认) | 12-layers, 768-hidden, 12-heads | 0.93403 | 0.92795 | 0.93098 |
+| `uie-senta-medium` | 6-layers, 768-hidden, 12-heads | 0.93146 | 0.92137 | 0.92639 |
+| `uie-senta-mini` | 6-layers, 384-hidden, 12-heads | 0.91799 | 0.92028 | 0.91913 |
+| `uie-senta-micro` | 4-layers, 384-hidden, 12-heads | 0.91542 | 0.90957 | 0.91248 |
+| `uie-senta-nano` | 4-layers, 312-hidden, 12-heads | 0.90817 | 0.90878 | 0.90847 |
+
+在Taskflow中，可以直接指定相应模型名称进行使用，使用`uie-senta-mini`版本的示例如下：
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> schema =  [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}]
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-mini", schema=schema)
+```
+
+<a name="4.2"></a>
+
+### **4.2 批量处理：从数据到情感分析可视化**
+
+为方便使用，本项目提供了批量处理的功能，支持以文件形式输入，批量进行情感分析。同时打通了从数据到情感分析结果可视化的流程，帮助用户可以更加快速获取情感分析结果，聚焦于业务分析方面。
+
+<a name="4.2.1"></a>
+
+#### **4.2.1 数据描述**
+输入数据如下方式进行组织，每行表示一个文本评论。可以点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/test_hotel.tar.gz)下载酒店场景的测试数据进行分析。
+
+```
+非常好的酒店 不枉我们爬了近一个小时的山，另外 大厨手艺非常棒 竹筒饭 竹筒鸡推荐入住的客人必须要点，
+房间隔音效果不好，楼下KTV好吵的
+酒店的房间很大，干净舒适，服务热情
+怎么说呢，早上办理入住的，一进房间闷热的一股怪味，很臭，不能开热风，好多了，虽然房间小，但是合理范围
+总台服务很差，房间一般
+```
+
+<a name="4.2.2"></a>
+
+#### **4.2.2 批量情感分析**
+
+通过脚本 `batch_predict.py` 批量进行情感分析，通过 `file_path` 指定要进行情感分析的文件路径，处理完后，结果将会保存在 `save_path` 指定的文件中，示例如下：
+
+```shell
+python batch_predict.py \
+    --file_path "./data/test_hotel.txt" \
+    --save_path "./data/sentiment_analysis.json" \
+    --model "uie-senta-base" \
+    --schema "[{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]" \
+    --batch_size 4 \
+    --max_seq_len 512
+```
+
+参数说明：
+- ``file_path``: 用于进行情感分析的文件路径。
+- ``save_path``: 情感分析结果的保存路径。
+- ``model``: 进行情感分析的模型名称，可以在这些模型中进行选择：['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano']。
+- ``load_from_dir``: 指定需要加载的离线模型目录，比如训练后保存的模型，如果不进行指定，则默认根据 `model` 指定的模型名称自动下载相应模型。
+- ``schema``: 基于UIE模型进行信息抽取的Schema描述。
+- ``batch_size``: 预测过程中的批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
+- ``max_seq_len``: 模型支持处理的最大序列长度，默认为512。
+- ``aspects``: 预先给定的属性，如果设置，模型将只针对这些属性进行情感分析，比如分析这些属性的观点词。
+
+<a name="4.2.3"></a>
+
+#### **4.2.3 情感分析可视化**
+
+在情感分析处理之后，可以根据情感分析的保存结果进行可视化展示，帮助用户更友好地分析业务特点。默认情况下，可视化功能支持围绕属性、观点、属性+观点、属性+情感、指定属性+观点分析功能。在各项分析中，均支持词云和直方图两类图像展示。
+
+下面将以酒店场景数据为例进行展示。
+
+<a name="4.2.3.1"></a>
+
+**4.2.3.1 一键生成情感分析结果**
+
+基于以上生成的情感分析结果，可以使用`visual_analysis.py`脚本对情感分析结果进行可视化，最终可视化结果将会被保存在 `save_dir` 指定的目录下。 使用时需要指定情感分析可视化的结果的任务类型，若是语句级的情感分类，则将task_type指定为``cls``，若是属性级的情感分析，则将task_type指定为``ext``，示例如下：
+
+```
+python visual_analysis.py \
+    --file_path "./outputs/test_hotel.json" \
+    --save_dir "./outputs/images" \
+    --task_type "ext"
+```
+
+可配置参数说明：
+- ``file_path``: 指定情感分析结果的保存路径。
+- ``save_dir``: 指定图片的保存目录。
+- ``task_type``: 指定任务类型，语句级情感分类请指定为``cls``，属性级情感分析请指定为``ext``，默认为``ext``。
+- ``font_path``: 指定字体文件的路径，用以在生成的wordcloud图片中辅助显示中文，如果为空，则会自动下载黑体字，用以展示中文字体。
+
+**备注**：在`visual_analysis.py`脚本启动时，默认会删除当前已经存在的`save_dir`目录以及其中文件，然后在该目录下重新生成相应的可视化图片。
+
+下图展示了对酒店场景数据分析后的部分图片：
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200259473-434888f7-c0ac-4253-ab23-ede1628e6ba2.png" />
+</div>
+<br>
+
+<a name="4.2.3.2"></a>
+
+**4.2.3.2 情感分析详细展示**
+
+**(1) 属性分析**
+通过属性信息，可以查看客户对于产品/服务的重点关注方面. 可以通过`plot_aspect_with_frequency`函数对属性进行可视化，当前可通过参数`image_type`分别指定`wordcloud`和'histogram'，通过词云和直方图的形式进行可视化。
+
+```python
+# define SentimentResult to process the result of sentiment result.
+sr = SentimentResult(args.file_path, sentiment_name=args.sentiment_name)
+# define VisualSentiment to help visualization
+vs = VisualSentiment(font_path=args.font_path)
+
+# visualization for aspect
+save_path = os.path.join(args.save_dir, "aspect_wc.png")
+vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="wordcloud")
+save_path = os.path.join(args.save_dir, "aspect_hist.png")
+vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="histogram")
+```
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200250669-7e06c742-ce62-4d2d-90f4-89efd7f6298c.png" />
+</div>
+<br>
+
+**(2) 观点分析**
+通过观点信息，可以查看客户对于产品/服务整体的直观印象。可以通过`plot_opinion_with_frequency`函数对观点进行可视化。
+
+```python
+# visualization for opinion
+save_path = os.path.join(args.save_dir, "opinion_wc.png")
+vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="wordcloud")
+```
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200251285-741881b5-8910-4152-a5c1-df34affaed42.png" />
+</div>
+<br>
+
+**(3) 属性+观点分析**
+结合属性和观点两者信息，可以更加具体的展现客户对于产品/服务的详细观点，分析某个属性的优劣，从而能够帮助商家更有针对性地改善或提高自己的产品/服务质量。可以通过`plot_aspect_with_opinion`函数对属性+观点进行可视化，同时可通过设置参数`sentiment`按照情感倾向展示不同分析结果，以更好进行情感分析，若设置为`all`，则会展示正向和负向所有的属性；若为`positive`，则会仅展示正向的属性；若为`negative`，则会仅展示负向的属性。如果在绘制直方图时，通过设置参数`top_n`，可以展示频率最高的top n个属性。
+
+```python
+# visualization for aspect + opinion
+save_path = os.path.join(args.save_dir, "aspect_opinion_wc.png")
+vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="wordcloud", sentiment="all")
+save_path = os.path.join(args.save_dir, "aspect_opinion_hist.png")
+vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="histogram", sentiment="all", top_n=8)
+```
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/199974942-8e55aabd-6c35-48ec-8f6d-3270b67b299c.png"/>
+</div>
+
+
+ **(4) 属性+情感分析**
+挖掘客户对于产品/服务针对属性的情感极性，帮助商家直观地查看客户对于产品/服务的某些属性的印象。可以通过`plot_aspect_with_sentiment`函数对属性+情感进行可视化。如果在绘制直方图时，通过设置参数`top_n`，可以展示频率最高的top n个属性。
+
+```python
+# visualization for aspect + sentiment
+save_path = os.path.join(args.save_dir, "aspect_sentiment_wc.png")
+vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="wordcloud")
+save_path = os.path.join(args.save_dir, "aspect_sentiment_hist.png")
+vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="histogram", top_n=15, descend_aspects=sr.descend_aspects)
+```
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200213177-0342bec4-5955-4ab9-9e98-5e4ef8e1a35e.png"/>
+</div>
+
+**(5) 对给定属性进行观点分析**
+通过指定属性，更加细致查看客户对于产品/服务某个属性的观点。可以帮助商家更加细粒度地分析客户对于产品/服务的某个属性的印象。下面图片示例中，展示了客户对于属性"房间"的观点。可以通过`plot_opinion_with_aspect`函数，对给定的属性进行观点分析。默认情况下，不会自动生成该类图像，需要开发者手动调用`plot_opinion_with_aspect`进行可视化分析。
+
+```python
+aspect = "房间"
+save_path = os.path.join(args.save_dir, "opinions_for_aspect_wc.png")
+vs.plot_opinion_with_aspect(aspect, sr.aspect_opinion, save_path, image_type="wordcloud")
+save_path = os.path.join(args.save_dir, "opinions_for_aspect_hist.png")
+vs.plot_opinion_with_aspect(aspect, sr.aspect_opinion, save_path, image_type="histogram")
+```
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/200213998-e646c422-7ab5-48ae-9e28-d6068cdf7b8f.png"/>
+</div>
+
+
+<a name="5"></a>
+
+## **5. 更进一步：结合业务分析经验，定制情感分析**
+
+考虑到用户在对业务数据进行情感分析时，往往聚焦于某个特定场景或领域，为满足用户更高的情感分析要求，本项目支持从以下方面协助用户，结合业务经验，进一步定制情感分析能力，提高模型对业务数据的理解和分析能力。
+
+- 数据层面：打通 label-studio 平台，定制了情感信息的标注规则，支持根据标注数据自动转换为模型输入样本。
+- 属性聚合：结合业务经验，支持传入同义的属性集合，可以增强模型对于数据聚合的能力。
+- 隐性观点抽取：结合业务经验，支持自定义隐性观点词表，可以增强模型对于隐性观点的抽取能力。
+
+下面以酒店场景为例，讲解定制酒店垂域的情感分析能力。具体地，将从数据标注及样本构建 - 模型训练 - 模型测试 - 模型预测及效果展示等全流程展开介绍。
+
+<a name="5.1"></a>
+
+### **5.1 打通数据标注到训练样本构建**
+
+本项目建议用户使用 label-studio 平台标注数据，同时提供了一套用于情感信息标注的规则，可以参考[情感分析任务Label Studio使用指南](./label_studio.md)获取更多信息，这里不再赘述。同时本项目打通了从 label-studio 标注平台到转换为模型输入形式数据的流程， 即支持用户在基于 label_studio 标注业务侧数据后，通过label-studio 导出标注好的json数据， 然后利用本项目提供的  `label_studio.py` 脚本，可以将导出数据一键转换为模型训练数据。
+
+在利用 `label_studio.py` 脚本进行数据转换时，需要考虑任务类型的不同，选择相应的样本构建方式，整体可以分为 `分类` 和 `抽取` 任务。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/203001847-8e41709b-0f5a-4673-8aca-5c4fb7705d4a.png  />
+</div>
+
+为方便用户使用，本项目提供了300+条酒店场景的标注数据，可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/sentiment_analysis/hotel/label_studio.tar.gz)进行下载，请注意该数据仅适合用于 `抽取` 类型的任务。
+
+
+<a name="5.1.1"></a>
+
+#### **5.1.1 样本构建：语句级情感分类任务**
+
+对于语句级情感分类任务，默认支持2分类：``正向`` 和 ``负向``，可以通过如下命令构造相关训练数据。
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --task_type cls \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options "正向" "负向" \
+    --is_shuffle True \
+    --seed 1000
+```
+
+参数介绍：
+- ``label_studio_file``: 从label studio导出的语句级情感分类的数据标注文件。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务，其中前者需要设置为``ext``，后者需要设置为``cls``。由于此处为语句级情感分类任务，因此需要设置为``cls``。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``options``: 情感极性分类任务的选项设置。对于语句级情感分类任务，默认支持2分类：``正向`` 和 ``负向``；对于属性级情感分析任务，默认支持3分类：``正向``, ``负向``和 ``未提及``，其中``未提及``表示要分析的属性在原文本评论中未提及，因此无法分析情感极性。如果业务需要其他情感极性选项，可以通过``options``字段进行设置，需要注意的是，如果定制了``options``，参数``label_studio_file``指定的文件需要包含针对新设置的选项的标注数据。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+
+**备注**：参数``options``可以不进行手动指定，如果这么做，则采用默认的设置。针对语句级情感分类任务，其默认将被设置为：``"正向" "负向"``；对于属性级情感分析任务，默认将被设置为：``"正向" "负向" "未提及"``。
+
+<a name="5.1.2"></a>
+
+#### **5.1.2 样本构建：属性抽取相关任务**
+
+针对抽取式的任务，比如属性-观点抽取、属性-情感极性-观点词抽取、属性分类任务等，可以使用如下命令将label-studio导出数据转换为模型训练数据。
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --task_type ext \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options "正向" "负向" "未提及" \
+    --negative_ratio 5 \
+    --is_shuffle True \
+    --seed 1000
+```
+
+重点参数介绍：
+- ``label_studio_file``: 从label studio导出的属性抽取相关的数据标注文件。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务，其中前者需要设置为``ext``，后者需要设置为``cls``。由于此处为属性抽取相关任务，因此需要设置为``ext``。
+- ``negative_ratio``表示对于一个样本，为每个子任务（属性级的观点抽取，属性级的情感分类）最多生成``negative_ratio``个负样本。如果额外提供了属性同义词标或隐性观点抽取词表，将结合两者信息生成更多的负样本，以增强属性聚合和隐性观点抽取能力。
+其他参数解释同上，这里不再赘述。
+
+<a name="5.1.3"></a>
+
+#### **5.1.3 样本构建升级1：加强属性聚合能力**
+
+在用户对产品或服务进行评论时，对某一些属性可能会有不同的说法，这会在后续对属性分析时可能会带来困扰。如以下示例中的"价格","价钱"和"费用"。
+
+```
+蛋糕味道不错，外观很漂亮，而且价格比较便宜
+蛋糕味道不错，外观很漂亮，而且价钱比较便宜
+蛋糕味道不错，外观很漂亮，而且费用比较便宜
+```
+
+针对这种情况，针对属性相关任务，本项目同时支持用户结合业务经验，通过设置同义的属性词表，加强模型的属性聚合能力。具体来讲，本项目期望通过以下两点，支持对属性聚合能力的建设。
+
+- 支持针对用户给定的属性进行情感分析
+- 支持用户提供同义的属性词表，用以加强模型对用户领域属性同义词的理解能力
+
+以下给出了酒店场景的示例，每行代表1类同义词，不同词之间以"空格"隔开。
+
+```
+房间 屋子 房子
+位置 地理位置
+隔音 隔声
+价格 价钱 费用
+```
+
+可以通过以下命令，使用synonym_file指定凝聚业务经验的同义属性集合，利用同义属性生成对应的数据样本：
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --synonym_file ./data/synonyms.txt \
+    --task_type ext \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options "正向" "负向" "未提及" \
+    --negative_ratio 5 \
+    --is_shuffle True \
+    --seed 1000
+```
+
+<a name="5.1.4"></a>
+
+#### **5.1.4 样本构建升级2：加强隐性观点抽取能力**
+
+另外，本项目同时支持加强对隐性观点功能抽取的能力，这里需要说明一点，本项目中定义隐性观点是指没有对应属性的纯观点词，如以下示例中的"比较便宜"便是隐性观点。
+
+```
+蛋糕味道不错，外观很漂亮，而且比较便宜
+```
+
+本项目支持用户提供一个隐性观点映射文件，用户可以根据自己的业务场景定义隐性观点词，以下给出了酒店场景的示例。其格式为，第1个单词为隐性观点对应的属性，后续按照情感情感倾向对隐性观点词进行了归类，同一类的以"[ ]"方式放到一块。
+
+```
+价格, 正向[实惠 便宜 超划算 划算 物超所值 物有所值 不贵], 负向[贵 不便宜 不划算]
+卫生, 正向[干净], 负向[很脏 很臭 不干净]
+隔音, 负向[好吵]
+位置, 负向[不太好找]
+```
+
+可以通过参数"implicit_file"指定凝聚业务经验的隐性观点词表，生成对应的隐性观点数据样本：
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --implicit_file ./data/implicit_opinions.txt \
+    --task_type ext \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options "正向" "负向" "未提及" \
+    --negative_ratio 5 \
+    --is_shuffle True \
+    --seed 1000
+```
+
+<a name="5.2"></a>
+
+### **5.2 模型训练**
+在生成酒店场景的训练数据后，可以通过以下命令启动模型训练：
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0" finetune.py \
+  --train_path ./data/train.json \
+  --dev_path ./data/dev.json \
+  --save_dir ./checkpoint \
+  --learning_rate 1e-5 \
+  --batch_size 16 \
+  --max_seq_len 512 \
+  --num_epochs 3 \
+  --model uie-senta-base \
+  --seed 1000 \
+  --logging_steps 10 \
+  --valid_steps 100 \
+  --device gpu
+```
+
+可配置参数说明：
+
+* ``train_path``：必须，训练集文件路径。
+* ``dev_path``：必须，验证集文件路径。
+* ``save_dir``：模型 checkpoints 的保存目录，默认为"./checkpoint"。
+* ``learning_rate``：训练最大学习率，UIE 推荐设置为 1e-5；默认值为1e-5。
+* ``batch_size``：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
+* ``max_seq_len``：模型支持处理的最大序列长度，默认为512。
+* ``num_epochs``：模型训练的轮次，可以视任务情况进行调整，默认为10。
+* ``model``：训练使用的预训练模型。可选择的有`uie-senta-base`, `uie-senta-medium`, `uie-senta-mini`, `uie-senta-micro`, `uie-senta-nano`，默认为`uie-senta-base`。
+* ``logging_steps``: 训练过程中日志打印的间隔 steps 数，默认10。
+* ``valid_steps``: 训练过程中模型评估的间隔 steps 数，默认100。
+* ``seed``：全局随机种子，默认为 42。
+* ``device``: 训练设备，可选择 'cpu'、'gpu' 其中的一种；默认为 GPU 训练。
+
+<a name="5.3"></a>
+
+### **5.3 模型测试**
+通过运行以下命令进行对酒店场景的测试集进行评估：
+
+```
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/test.json \
+    --batch_size 16 \
+    --max_seq_len 512
+```
+
+可配置参数说明：
+
+* ``model_path``：必须，进行评估的模型文件夹路径，路径下需包含模型权重文件model_state.pdparams及配置文件model_config.json。
+* ``test_path``：必须，进行评估的测试集文件。
+* ``batch_size``：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
+* ``max_seq_len``：文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+* ``debug``： 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+
+在构造样本过程中，如果设置了最大负例比例negative_ratio，会在样本中添加一定数量的负样本，模型测试默认会对正样本和负样本共同进行评估。特别地，当开启debug模式后，会对每个正例类别分别进行评估，该模式仅用于模型调试：
+```
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/test.json \
+    --batch_size 16 \
+    --max_seq_len 512 \
+    --debug
+```
+
+输出打印示例：
+```
+[2022-12-12 05:20:06,152] [    INFO] - -----------------------------
+[2022-12-12 05:20:06,152] [    INFO] - Class Name: 评价维度
+[2022-12-12 05:20:06,152] [    INFO] - Evaluation Precision: 0.89655 | Recall: 0.89655 | F1: 0.89655
+[2022-12-12 05:20:06,553] [    INFO] - -----------------------------
+[2022-12-12 05:20:06,553] [    INFO] - Class Name: 观点词
+[2022-12-12 05:20:06,553] [    INFO] - Evaluation Precision: 0.81159 | Recall: 0.86154 | F1: 0.83582
+[2022-12-12 05:20:07,610] [    INFO] - -----------------------------
+[2022-12-12 05:20:07,611] [    INFO] - Class Name: X的观点词
+[2022-12-12 05:20:07,611] [    INFO] - Evaluation Precision: 0.92222 | Recall: 0.90217 | F1: 0.91209
+[2022-12-12 05:20:08,331] [    INFO] - -----------------------------
+[2022-12-12 05:20:08,331] [    INFO] - Class Name: X的情感倾向[未提及,正向,负向]
+[2022-12-12 05:20:08,331] [    INFO] - Evaluation Precision: 0.81481 | Recall: 0.81481 | F1: 0.81481
+```
+
+<a name="5.4"></a>
+
+### **5.4 模型预测及效果展示**
+
+<a name="5.4.1"></a>
+
+#### **5.4.1 使用训练后的模型进行预测**
+paddlenlp.Taskflow装载定制模型，通过task_path指定模型权重文件的路径，路径下需要包含训练好的模型权重文件model_state.pdparams。
+
+```python
+>>> # define schema
+>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best")
+>>> senta("这家点的房间很大，店家服务也很热情，就是房间隔音不好")
+[
+    {
+        '评价维度': [
+            {
+                'text': '服务',
+                'start': 11,
+                'end': 13,
+                'probability': 0.9600759151746807,
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '热情',
+                            'start': 15,
+                            'end': 17,
+                            'probability': 0.9995151134519027
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '正向',
+                            'probability': 0.9998306104766073
+                        }
+                    ]
+                }
+            },
+            {
+                'text': '隔音',
+                'start': 22,
+                'end': 24,
+                'probability': 0.9993525950520166,
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '不好',
+                            'start': 24,
+                            'end': 26,
+                            'probability': 0.9992370362201655
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '负向',
+                            'probability': 0.9842680108546062
+                        }
+                    ]
+                }
+            },
+            {
+                'text': '房间',
+                'start': 4,
+                'end': 6,
+                'probability': 0.9991784415865368,
+                'relations': {
+                    '观点词': [
+                        {
+                            'text': '很大',
+                            'start': 6,
+                            'end': 8,
+                            'probability': 0.8359714693985723
+                        }
+                    ],
+                    '情感倾向[正向,负向,未提及]': [
+                        {
+                            'text': '正向',
+                            'probability': 0.997688853839179
+                        }
+                    ]
+                }
+            }
+        ]
+    }
+]
+```
+
+<a name="5.4.2"></a>
+
+#### **5.4.2 属性聚合预测和分析**
+
+下面就 `隔音` 与 `价格` 两个属性进行分析，抽取样本中与这两个属性相关的情感信息，代码如下：
+
+```python
+>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+>>> aspects = ["隔音", "价格"]
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best", aspects=aspects)
+>>> senta("这家点的房间很大，店家服务也很热情，就是房间隔音不好")
+```
+
+下图展示了关于模型对于属性聚合能力支持的样本，在分析之前设定固定的属性集合`["隔音", "价格"]`，可以看到尽管语料中同时出现了`隔音`、`隔声`、`价格`、`价钱`和`费用`，但是经过定制后的情感分析模型依然能够准确识别出对于属性 `隔音` 和 `价格`的情感信息，从而起到属性聚合的效果。
+
+| 样本 | 属性 | 观点词 | 情感倾向 |
+| :----: |:----: |:----: |:----: |
+|这家店的房间很大，隔音效果不错，而且价格很便宜|隔音|不错|正向|
+|房间比较小，隔声也不太好，设施还是挺齐全的|隔音|不太好|负向|
+|房间还不错，有免费的矿泉水，而且价格很实惠|价格|实惠|正向|
+|房间很大，店家也挺热情，很棒，就是价钱有点贵|价格|贵|负向|
+|酒店不错，房间面积大，住的也舒适，而且价格很划算|价格|划算|正向|
+|房间好大呀，而且这边还挺安静的，不过整体还是很好的，很宽敞，而且价格很便宜|价格|便宜|正向|
+
+<a name="5.4.3"></a>
+
+#### **5.4.3 隐性观点词抽取预测和分析**
+
+下面就`价格` 和 `卫生` 两个属性进行分析隐性观点，抽取样本中与这两个属性相关的情感信息，代码如下：
+
+对于"价格"的调用示例：
+```python
+>>> schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+>>> aspects = ["价格"]
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema, task_path="./checkpoint/model_best", aspects=aspects)
+>>> senta("这家店的房间很大，店家服务也很热情，而且很便宜")
+```
+
+下图展示了关于模型对于隐性观点抽取的样本，可以看到，虽然以下这些样本中，并未出现`价格` 和 `卫生`，但模型依然正确识别除了这两个属性的情感信息。
+
+| 样本 | 属性 | 观点词 | 情感倾向 |
+| :----: |:----: |:----: |:----: |
+|房间比较大，就是感觉贵了点，不太划算|价格|贵、不太划算|负向|
+|这家店的房间很大，店家服务也很热情，而且很便宜|价格|便宜|正向|
+|这次来荆州给我的房间小的无语了，所幸比较实惠|价格|实惠|正向|
+|酒店不大，有点不干净|卫生|不干净|负向|
+|老板人很好，房间虽然很大，但有点脏|卫生|脏|负向|
+|房间不大，很温暖，也很干净|卫生|干净|正向|
+
+
+<a name="6"></a>
+
+## **6. 模型部署**
+
+<a name="6.1"></a>
+
+### **6.1 基于SimpleServer进行服务化部署**
+本项目支持基于PaddleNLP SimpleServing进行服务化部署，可以在`deploy`目录下执行以下命令启动服务和请求。
+
+**启动服务**
+```
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+**Client发送请求**
+
+服务启动后， 通过 `client.py` 脚本发送请求：
+```
+python client.py
+```
+
+**多卡服务化预测**
+
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，代码示例如下：
+
+```python
+senta1 = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base", device_id=0)
+senta2 = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base", device_id=1)
+
+app.register_taskflow('senta', [senta1, senta2])
+```
+
+<a name="6.2"></a>
+
+### **6.2 基于Pipeline进行部署**
+
+本项目支持基于Pipeline的方式进行部署，用户只需要上传测试文件，即可获取对应的情感分析可视化结果，更多信息请参考[情感分析Pipeline](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/sentiment_analysis)。
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py b/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..87347668d58bad2ef71167a66e6ad83c18e145ea
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/batch_predict.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+from utils import load_txt, write_json_file
+
+from paddlenlp import Taskflow
+from paddlenlp.utils.log import logger
+
+
+def main(args):
+    """
+    Predict based on Taskflow.
+    """
+    start_time = time.time()
+    # read file
+    logger.info("Trying to load dataset: {}".format(args.file_path))
+    if not os.path.exists(args.file_path):
+        raise ValueError("something with wrong for your file_path, it may not exist.")
+    examples = load_txt(args.file_path)
+
+    # define Taskflow for sentiment analysis
+    schema = eval(args.schema)
+    if args.load_from_dir:
+        senta = Taskflow(
+            "sentiment_analysis",
+            model=args.model,
+            schema=schema,
+            aspects=args.aspects,
+            batch_size=args.batch_size,
+            max_seq_len=args.max_seq_len,
+            task_path=args.load_from_dir,
+        )
+    else:
+        senta = Taskflow(
+            "sentiment_analysis",
+            model=args.model,
+            schema=schema,
+            aspects=args.aspects,
+            batch_size=args.batch_size,
+            max_seq_len=args.max_seq_len,
+        )
+
+    # predict with Taskflow
+    logger.info("Start to perform sentiment analysis for your dataset, this may take some time.")
+    results = senta(examples)
+
+    # save results
+    save_path = args.save_path
+    if not save_path:
+        save_dir = os.path.dirname(args.file_path)
+        save_path = os.path.join(save_dir, "sentiment_results.json")
+    write_json_file(results, save_path)
+    logger.info("The results of sentiment analysis has been saved to: {}".format(save_path))
+    logger.info("This run take {} seconds.".format(time.time() - start_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--file_path", type=str, default="./data/test_hotel.txt", help="The file path that you want to perform sentiment analysis on.")
+    parser.add_argument("--save_path", type=str, default="./data/sentiment_analysis.json", help="The saving path for the results of sentiment analysis.")
+    parser.add_argument("--model", choices=['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano'], default="uie-senta-base", help="The model name that you wanna use for sentiment analysis.")
+    parser.add_argument("--load_from_dir", default=None, type=str, help="The directory path for the finetuned model to predict, if set None, it will download model according to model_name.")
+    parser.add_argument("--schema", default="[{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]", type=str, help="The schema for UIE to extract infomation.")
+    parser.add_argument("--aspects", default=None, type=str, nargs="+", help="A list of pre-given aspects, that is to say, Pipeline only perform sentiment analysis on these pre-given aspects if you input it.")
+    parser.add_argument("--batch_size", type=int, default=4, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    main(args)
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..1024ff2ce88e8089f49bd590bd01335aad3e0ee8
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/client.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/senta"
+headers = {"Content-Type": "application/json"}
+texts = ["蛋糕味道不错，店家的服务也很热情"]
+data = {
+    "data": {
+        "text": texts,
+    }
+}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bb78ea18f690e10da11efca1e91462527b7a0a5
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/deploy/server.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = [{"评价维度": ["观点词", "情感倾向[正向,负向,未提及]"]}]
+# define taskflow to perform sentiment analysis
+senta = Taskflow("sentiment_analysis", schema=schema, model="uie-senta-base")
+# define your server
+app = SimpleServer()
+app.register_taskflow("taskflow/senta", senta)
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py b/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..7382b7ddeea0fed863bfa93156414f97b95f10c8
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/evaluate.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import re
+from functools import partial
+
+import paddle
+from tqdm import tqdm
+from utils import (
+    convert_example,
+    create_data_loader,
+    get_relation_type_dict,
+    reader,
+    unify_prompt_name,
+)
+
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+    model.eval()
+    metric.reset()
+    for batch in tqdm(data_loader):
+        input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+        start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+        start_ids = paddle.cast(start_ids, "float32")
+        end_ids = paddle.cast(end_ids, "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+
+
+def do_eval():
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    model = UIE.from_pretrained(args.model_path)
+
+    test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
+    class_dict = {}
+    relation_data = []
+    if args.debug:
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if re.search(r"\[.*?\]$", data["prompt"]) and data["result_list"][0]["text"] == "未提及":
+                continue
+            if len(data["result_list"]) != 0:
+                if "的" not in data["prompt"]:
+                    class_dict.setdefault(class_name, []).append(data)
+                else:
+                    relation_data.append((data["prompt"], data))
+        relation_type_dict = get_relation_type_dict(relation_data)
+    else:
+        class_dict["all_classes"] = test_ds
+
+    trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)
+
+    for key in class_dict.keys():
+        if args.debug:
+            test_ds = MapDataset(class_dict[key])
+        else:
+            test_ds = class_dict[key]
+
+        test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=trans_fn)
+
+        metric = SpanEvaluator()
+        precision, recall, f1 = evaluate(model, metric, test_data_loader)
+        logger.info("-----------------------------")
+        logger.info("Class Name: %s" % key)
+        logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+    if args.debug and len(relation_type_dict.keys()) != 0:
+        for key in relation_type_dict.keys():
+            test_ds = MapDataset(relation_type_dict[key])
+
+            test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=trans_fn)
+
+            metric = SpanEvaluator()
+            precision, recall, f1 = evaluate(model, metric, test_data_loader)
+            logger.info("-----------------------------")
+            logger.info("Class Name: X的%s" % key)
+            logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_eval()
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py b/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..212a61de8a574f0889f455ac4e9fb52eb7805d5a
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/finetune.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import paddle
+from evaluate import evaluate
+from utils import convert_example, create_data_loader, reader, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    model = UIE.from_pretrained(args.model)
+
+    train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False)
+    dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False)
+
+    trans_fn = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)
+
+    train_data_loader = create_data_loader(train_ds, mode="train", batch_size=args.batch_size, trans_fn=trans_fn)
+    dev_data_loader = create_data_loader(dev_ds, mode="dev", batch_size=args.batch_size, trans_fn=trans_fn)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        logger.info("load model from path: {}".format(args.init_from_ckpt))
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters())
+
+    criterion = paddle.nn.BCELoss()
+    metric = SpanEvaluator()
+
+    loss_list = []
+    global_step = 0
+    best_f1 = 0
+    tic_train = time.time()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch in train_data_loader:
+            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+            start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+            start_ids = paddle.cast(start_ids, "float32")
+            end_ids = paddle.cast(end_ids, "float32")
+            loss_start = criterion(start_prob, start_ids)
+            loss_end = criterion(end_prob, end_ids)
+            loss = (loss_start + loss_end) / 2.0
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_list.append(float(loss))
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                loss_avg = sum(loss_list) / len(loss_list)
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss_avg, args.logging_steps / time_diff)
+                )
+                tic_train = time.time()
+
+            if global_step % args.valid_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                logger.disable()
+                tokenizer.save_pretrained(save_dir)
+                logger.enable()
+
+                precision, recall, f1 = evaluate(model, metric, dev_data_loader)
+                logger.info("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1))
+                if f1 > best_f1:
+                    logger.info(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    save_dir = os.path.join(args.save_dir, "model_best")
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(save_dir)
+                    logger.disable()
+                    tokenizer.save_pretrained(save_dir)
+                    logger.enable()
+                tic_train = time.time()
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
+    parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
+    parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+    parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.")
+    parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
+    parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+    parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--model", choices=["uie-senta-base", "uie-senta-medium", "uie-senta-mini", "uie-senta-micro", "uie-senta-nano"], default="uie-senta-base", type=str, help="Select the pretrained model for few-shot learning.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_train()
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md
new file mode 100644
index 0000000000000000000000000000000000000000..e3a52da0e1a72b057747f0802f9fa43f055d38b3
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.md
@@ -0,0 +1,171 @@
+# 情感分析任务Label Studio使用指南
+
+ **目录**
+
+- [1. label-studio 安装](#1)
+- [2. label-studio 项目创建](#2)
+- [3. 情感分析任务标注](#3)
+    - [3.1 语句级情感分类任务](#3.1)
+    - [3.2 属性级情感分析任务](#3.2)
+      - [3.2.1 属性-情感极性-观点词抽取](#3.2.1)
+      - [3.2.2 属性-情感极性抽取](#3.2.2)
+      - [3.2.3 属性-观点词抽取](#3.2.3)
+      - [3.2.4 属性抽取](#3.2.4)
+      - [3.2.5 观点词抽取](#3.2.5)
+- [4. 导出标注数据](#4)
+- [5. References](#5)
+
+<a name="1"></a>
+
+## **1. label-studio 安装**
+本内容在以下环境进行测试安装：
+- python == 3.9.12
+- label-studio == 1.6.0
+
+在终端(terminal)使用pip安装label-studio：
+
+```shell
+pip install label-studio==1.6.0
+```
+
+安装完成后，运行以下命令行：
+```shell
+label-studio start
+```
+
+在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/)，输入用户名和密码登录，开始使用label-studio进行标注。
+
+<a name="2"></a>
+
+## **2. label-studio 项目创建**
+
+创建项目之前，需要先确定标注的任务类型以及需要标注哪些内容，然后点击创建（Create）开始创建一个新的项目，填写项目名称、描述。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/202995157-9caa0b26-202d-46d2-832a-f1cdf3f9a9b6.png />
+</div>
+
+如果数据已经准备好，可以在此进行导入数据。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/202995686-954cc001-4478-46e1-8329-ab3ab02e8a35.png />
+</div>
+
+
+接下来，根据需要标注的任务类型，选择适合的任务。在本项目中，默认会包含两种类型的任务：语句级情感分类任务和属性级情感分析任务。由于这两者都属于自然语言处理（NLP）任务，因此可以点击 `Natural Language Processing` 选项，在该选项下面进行选择相应的子项任务。
+
+- 如果标注语句级情感分类任务，请选择`Text Classification`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/202996231-a4cf809d-000e-4693-b7c8-70ff2fae22ae.png />
+</div>
+
+- 如果标注属性级情感分析任务，比如属性-观点词-情感极性三元组的信息抽取，请选择`Relation Extraction`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/202997005-e8b0e865-584e-460e-8e68-a41532b6ef1b.png />
+</div>
+
+最后点击保存即可。
+
+<a name="3"></a>
+
+## **3. 情感分析任务标注**
+
+<a name="3.1"></a>
+
+### **3.1 语句级情感分类任务**
+这里对应的任务类型为`Text Classification`，在标注之前，需要设定`正向`和`负向`的标签，然后保存即可。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199724082-ee82dceb-dab0-496d-a930-a8ecb284d8b2.png />
+</div>
+
+设定好标签后，即可开始进行标注，选择正向或负向，最后点击提交，便标注好一条数据。
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/210329413-deff1eb7-3472-463e-aef8-bc9b0456504a.png />
+</div>
+
+<a name="3.2"></a>
+
+### **3.2 属性级情感分析任务**
+
+在本项目中，属性级的情感分析需要配置的标注任务类型为`Relation Extraction`，包括属性抽取、观点抽取、属性-观点抽取、属性-情感极性抽取、属性-情感极性-观点词三元组抽取等任务。其中属性-情感极-观点词(A-S-O)三元组抽取是最常见的任务之一，下面优先讲解该任务的标注规则。
+
+<a name="3.2.1"></a>
+
+#### **3.2.1 属性-情感极性-观点词抽取**
+属性-情感极性-观点词(A-S-O)三元组抽取标注内容涉及两类标签：Span 类型标签和 Relation 类型标签。其中Span标签用于定位文本批评中属性、观点词和情感极性三类信息，Relation类型标签用于设置评价维度和观点词、情感倾向之间的关系。
+
+
+#### **（1）Span类型标签**
+这里需要定位属性、情感极性、观点词三类信息，在标注时，需要将属性和情感极性进行组合，形成复合标签。具体来讲，设定`评价维度##正向`用于定位情感倾向为正向的属性，`评价维度##负向`用于定位情感倾向为负向的属性。另外，利用标注标签`观点词`定位语句中的观点词。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/202999690-c76948cf-45ba-42a2-85ed-ee55e6a0907f.png />
+</div>
+
+#### **（2）Relation类型标签**
+这里只涉及到1中Relation类型标签，即`评价维度`到`观点词`的映射关系。这里可以设置一下两者关系的名称，即点击Code，然后配置关系名称（这里将两者关系设置为`观点词`），最后点击保存即可。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/203000684-c7ce1483-6e1c-4399-9d43-369eae2f8684.png />
+</div>
+
+在设置好Span类型和Relation标签之后，便可以开始进行标注数据了。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/203001847-8e41709b-0f5a-4673-8aca-5c4fb7705d4a.png />
+</div>
+
+<a name="3.2.2"></a>
+
+#### **3.2.2 属性-情感极性抽取**
+如3.2.1所述，本项目中针对属性-情感极性(A-S)抽取任务，采用`Span`的形式进行标注。设定`评价维度##正向`用于定位情感倾向为正向的属性，`评价维度##负向`用于定位情感倾向为负向的属性。下图展示了关于属性-情感极性抽取任务的标注示例。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/210720997-950bae9c-407c-431c-bdec-d8289a831920.png />
+</div>
+
+<a name="3.2.3"></a>
+
+#### **3.2.3 属性-观点词抽取**
+针对属性-观点词(A-O)抽取任务，采用`Relation`的形式进行标注。这需要将属性对应标注标签设定为`评价维度`，观点词设定为`观点词`。下图展示了关于属性-观点词抽取任务的标注示例。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/210721569-9462145c-14ee-459d-a11b-72a8cd7b9a45.png />
+</div>
+
+<a name="3.2.4"></a>
+
+#### **3.2.4 属性抽取**
+针对属性(A)抽取任务，采用`Span`的形式进行标注。 这需要将属性对应的标注标签设定为`评价维度`。下图展示了关于属性抽取任务的标注示例。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/210722135-cc697497-cd46-444a-aad6-d31d9adfeaed.png />
+</div>
+
+<a name="3.2.5"></a>
+
+#### **3.2.4 观点词抽取**
+针对观点词(O)抽取任务，采用`Span`的形式进行标注。 这需要将观点词对应的标注标签设定为`观点词`。下图展示了关于观点词抽取任务的标注示例。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/35913314/210722331-64865a9a-522a-41aa-aaab-2bdf9de85263.png />
+</div>
+
+
+<a name="4"></a>
+
+## **4. 导出标注数据**
+
+勾选已标注文本ID，点击Export按钮，选择导出的文件类型为`JSON`，导出数据：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/199891344-023736e2-6f9d-454b-b72a-dec6689f8436.png />
+</div>
+
+<a name="5"></a>
+
+## **5. References**
+- **[Label Studio 官网](https://labelstud.io/)**
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py
new file mode 100644
index 0000000000000000000000000000000000000000..d03703ef88ddf093c513b11a17fe536195ac4460
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/label_studio.py
@@ -0,0 +1,738 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import json
+import os
+import random
+import time
+from decimal import Decimal
+
+import numpy as np
+import paddle
+from utils import load_txt
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+PROMPT_ITEMS = {
+    "aspect_prompt_prefix": "评价维度",
+    "opinion_prompt": "观点词",
+    "sentiment_prompt_prefix": "情感倾向",
+    "separator": "##",
+    "not_mentioned_option": "未提及",
+    "positive_option": "正向",
+    "negative_option": "负向",
+}
+
+
+class Convertor(object):
+    """Convertor to convert data export from annotation platform"""
+
+    def __init__(self, negative_ratio=5):
+        """Init Data Convertor"""
+        self.negative_ratio = negative_ratio
+        self.aspect_prompt_prefix = PROMPT_ITEMS["aspect_prompt_prefix"]
+        self.opinion_prompt = PROMPT_ITEMS["opinion_prompt"]
+        self.sentiment_prompt_prefix = PROMPT_ITEMS["sentiment_prompt_prefix"]
+        self.separator = PROMPT_ITEMS["separator"]
+        self.not_mentioned_option = PROMPT_ITEMS["not_mentioned_option"]
+        self.options = PROMPT_ITEMS["options"]
+
+    def process_text_tag(self, line, task_type="ext"):
+        items = {}
+        items["text"] = line["data"]["text"]
+        if task_type == "ext":
+            items["entities"] = []
+            items["relations"] = []
+            items["relation_ids"] = set()
+            result_list = line["annotations"][0]["result"]
+            for result in result_list:
+                if result["type"] == "labels":
+                    items["entities"].append(
+                        {
+                            "id": result["id"],
+                            "start_offset": result["value"]["start"],
+                            "end_offset": result["value"]["end"],
+                            "label": result["value"]["labels"][0],
+                        }
+                    )
+                else:
+                    items["relations"].append(
+                        {
+                            "id": result["from_id"] + "-" + result["to_id"],
+                            "from_id": result["from_id"],
+                            "to_id": result["to_id"],
+                            "type": result["labels"][0] if result["labels"] else self.opinion_prompt,
+                        }
+                    )
+                    items["relation_ids"].add(result["from_id"])
+                    items["relation_ids"].add(result["to_id"])
+
+        elif task_type == "cls":
+            items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
+        return items
+
+    def convert_cls_examples(self, raw_examples, data_flag="Data"):
+        """
+        Convert labeled data for classification task.
+        """
+        examples = []
+        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
+        for line in raw_examples:
+            items = self.process_text_tag(line, task_type="cls")
+            text, labels = items["text"], items["label"]
+            example = self.generate_cls_example(text, labels, self.sentiment_prompt_prefix, self.options)
+            examples.append(example)
+        logger.info("{0:7} End to convert annotation data.\n".format(""))
+        return examples
+
+    def convert_ext_examples(
+        self,
+        raw_examples,
+        synonyms=None,
+        implicit_opinion_map=None,
+        sentiment_map=None,
+        with_negatives=True,
+        task_type="ext_aso",
+        data_flag="Data",
+    ):
+        """
+        Convert labeled data for extraction task.
+        """
+
+        def _sep_cls_label(label, separator):
+            label_list = label.split(separator)
+            if len(label_list) == 1:
+                return label_list[0], None
+            return label_list[0], label_list[1:]
+
+        texts = []
+        # {"content": "", "result_list": [], "prompt": "X"}
+        entity_examples = []
+        # {"content": "", "result_list": [], "prompt": "X的Y"}
+        relation_examples = []
+        # {"content": "", "result_list": [], "prompt": "X的情感倾向[正向，负向]"}
+        entity_cls_examples = []
+
+        # entity label set: ["评价维度", "观点词", ... ]
+        entity_label_set = []
+        # predicate set: ["观点词", ... ]
+        predicate_set = []
+        # set of subject entity in relation: ["房间", "价格", ... ]
+        subject_name_set = []
+
+        # List[List[str]]
+        # List of entity prompt for each example
+        entity_prompt_list = []
+        # Golden subject label for each example
+        subject_golden_list = []
+        # List of inverse relation for each example
+        inverse_relation_list = []
+        # List of predicate for each example
+        predicate_list = []
+
+        logger.info("{0:7} Start to convert annotation data.".format("[" + data_flag + "]"))
+        logger.info("{0:7} Trying to generate positive examples...".format(""))
+        for line in raw_examples:
+            items = self.process_text_tag(line, task_type="ext")
+
+            text, relations, entities, relation_ids = (
+                items["text"],
+                items["relations"],
+                items["entities"],
+                items["relation_ids"],
+            )
+            texts.append(text)
+
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            implict_example_map = {}
+            entity_map = {}
+            subject_golden = []
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                entity_label, entity_cls_label = _sep_cls_label(entity["label"], self.separator)
+
+                # generate examples for entity-level sentiment classification
+                if entity_cls_label is not None:
+                    entity_cls_prompt_prefix = entity_name + "的" + self.sentiment_prompt_prefix
+                    entity_cls_example = self.generate_cls_example(
+                        text, entity_cls_label, entity_cls_prompt_prefix, self.options
+                    )
+
+                    entity_cls_examples.append(entity_cls_example)
+
+                # generate examples for entity extraction
+                result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]}
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {
+                        "content": text,
+                        "result_list": [result],
+                        "prompt": entity_label,
+                    }
+                else:
+                    entity_example_map[entity_label]["result_list"].append(result)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                entity_prompt.append(entity_label)
+
+                if implicit_opinion_map and entity["id"] not in relation_ids:
+                    maped_entity = entity_map[entity["id"]]
+                    if maped_entity["name"] not in implicit_opinion_map:
+                        continue
+
+                    result = {
+                        "text": maped_entity["name"],
+                        "start": maped_entity["start"],
+                        "end": maped_entity["end"],
+                    }
+                    aspect = implicit_opinion_map[maped_entity["name"]]
+                    if aspect not in implict_example_map:
+                        implict_example_map[aspect] = [result]
+                    else:
+                        implict_example_map[aspect].append(result)
+
+                if entity_label.startswith(self.aspect_prompt_prefix):
+                    if entity_name not in subject_golden:
+                        if synonyms and entity_name in synonyms:
+                            subject_synonyms = synonyms[entity_name]
+                            subject_golden.extend(subject_synonyms)
+                        else:
+                            subject_golden.append(entity_name)
+
+                    if entity_name not in subject_name_set:
+                        subject_name_set.append(entity_name)
+
+            for v in entity_example_map.values():
+                entity_example.append(v)
+            entity_examples.append(entity_example)
+            entity_prompt_list.append(entity_prompt)
+
+            # generate examples for classification of implicit opinion
+            if task_type == "ext_as" or task_type == "ext_aso":
+                for entity_name in implict_example_map.keys():
+                    prompt = entity_name + "的" + self.sentiment_prompt_prefix
+                    opinions = implict_example_map[entity_name]
+                    sentiment = None
+                    for opinion in opinions:
+                        if opinion["text"] in sentiment_map:
+                            sentiment = sentiment_map[opinion["text"]]
+                            break
+                    if sentiment is None:
+                        continue
+                    implicit_example = self.generate_cls_example(text, [sentiment], prompt, self.options)
+                    entity_cls_examples.append(implicit_example)
+
+            # generate examples for relation extraction
+            # Golden entity inputs, initializing with implicit subject and it's synonyms
+            for implicit_subject in implict_example_map.keys():
+                subject_golden.append(implicit_subject)
+                if synonyms and implicit_subject in synonyms:
+                    subject_golden.extend(synonyms[implicit_subject])
+            relation_example = []
+            relation_example_map = {}
+            inverse_relation = []
+            predicates = []
+
+            # generate examples for extraction of implicit opinion
+            for entity_name in implict_example_map.keys():
+                prompt = entity_name + "的" + self.opinion_prompt
+                implicit_example = {
+                    "content": text,
+                    "result_list": implict_example_map[entity_name],
+                    "prompt": prompt,
+                }
+                relation_example.append(implicit_example)
+
+            # generate examples for labeled relations
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+
+                prompt = entity_map[subject_id]["name"] + "的" + predicate
+                inverse_negative = entity_map[object_id]["name"] + "的" + predicate
+
+                result = {
+                    "text": entity_map[object_id]["name"],
+                    "start": entity_map[object_id]["start"],
+                    "end": entity_map[object_id]["end"],
+                }
+
+                inverse_relation.append(inverse_negative)
+                predicates.append(predicate)
+
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt}
+                else:
+                    relation_example_map[prompt]["result_list"].append(result)
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+
+            for v in relation_example_map.values():
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            subject_golden_list.append(subject_golden)
+            inverse_relation_list.append(inverse_relation)
+            predicate_list.append(predicates)
+
+        # start to generate negative examples
+        if with_negatives and task_type in ["ext_as", "ext_ao", "ext_aso"]:
+            logger.info("{0:7} Trying to generate negative examples...".format(""))
+
+        # generate negative examples according to entity
+        all_entity_examples = []
+        if with_negatives:
+            positive_examples, negative_examples = self.add_entity_negative_example(
+                entity_examples, texts, entity_prompt_list, entity_label_set
+            )
+            if len(positive_examples) != 0:
+                all_entity_examples = positive_examples + negative_examples
+        else:
+            for i in range(len(entity_examples)):
+                all_entity_examples.extend(entity_examples[i])
+
+        # generate negative examples according to relation
+        all_relation_examples = []
+        if with_negatives:
+            if len(predicate_set) != 0:
+                positive_examples = []
+                negative_examples = []
+                per_n_ratio = self.negative_ratio // 3
+
+                for i, text in enumerate(texts):
+                    negative_example = []
+                    collects = []
+
+                    # 1. inverse_relation_list
+                    redundants1 = inverse_relation_list[i]
+
+                    # 2. subject_name_set - subject_golden_list[i]
+                    redundants2 = []
+                    if len(predicate_list[i]) != 0:
+                        nonentity_list = list(set(subject_name_set) - set(subject_golden_list[i]))
+                        nonentity_list.sort()
+
+                        redundants2 = [
+                            nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))]
+                            for nonentity in nonentity_list
+                        ]
+
+                    # 3. entity_label_set - entity_prompt_list[i]
+                    redundants3 = []
+                    if len(subject_golden_list[i]) != 0:
+                        non_ent_label_list = list(set(entity_label_set) - set(entity_prompt_list[i]))
+                        non_ent_label_list.sort()
+
+                        redundants3 = [
+                            subject_golden_list[i][random.randrange(len(subject_golden_list[i]))] + "的" + non_ent_label
+                            for non_ent_label in non_ent_label_list
+                        ]
+
+                    redundants_list = [redundants1, redundants2, redundants3]
+
+                    for redundants in redundants_list:
+                        added, rest = self.add_relation_negative_example(redundants, texts[i], per_n_ratio)
+                        negative_example.extend(added)
+                        collects.extend(rest)
+                    num_sup = self.negative_ratio - len(negative_example)
+                    if num_sup > 0 and collects:
+                        if num_sup > len(collects):
+                            idxs = [k for k in range(len(collects))]
+                        else:
+                            idxs = random.sample(range(0, len(collects)), num_sup)
+                        for idx in idxs:
+                            negative_example.append(collects[idx])
+                    positive_examples.extend(relation_examples[i])
+                    negative_examples.extend(negative_example)
+
+                all_relation_examples = positive_examples + negative_examples
+        else:
+            for i in range(len(relation_examples)):
+                all_relation_examples.extend(relation_examples[i])
+
+        # generate negative examples according to sentiment polarity
+        all_cls_examples = entity_cls_examples
+        if with_negatives:
+            if task_type == "ext_aso" or task_type == "ext_as" and self.not_mentioned_option in self.options:
+                cls_negatives_examples = self.add_cls_negative_example(texts, subject_name_set, subject_golden_list)
+                all_cls_examples += cls_negatives_examples
+
+        # generate examples with synonyms to support aspect aggregation
+        if synonyms is not None:
+            synonym_map = {}
+            for k, vs in synonyms.items():
+                for v in vs:
+                    synonym_map[v] = k
+            relation_synonym_examples = self.change_aspect_with_synonyms(all_relation_examples, synonyms, synonym_map)
+            all_relation_examples += relation_synonym_examples
+            cls_synonym_examples = self.change_aspect_with_synonyms(all_cls_examples, synonyms, synonym_map)
+            all_cls_examples += cls_synonym_examples
+
+        logger.info("{0:7} End to convert annotation data.\n".format(""))
+        return all_entity_examples + all_relation_examples + all_cls_examples
+
+    def change_aspect_with_synonyms(self, examples, synonyms, synonym_map):
+        synonym_examples = []
+        for example in examples:
+            prompt = example["prompt"]
+            aspect, suffix = prompt.split("的", maxsplit=1)
+            if aspect not in synonym_map.keys():
+                continue
+            synonym_cluster = synonyms[synonym_map[aspect]]
+            for syn_aspect in synonym_cluster:
+                if syn_aspect == aspect:
+                    continue
+                syn_prompt = syn_aspect + "的" + suffix
+                syn_example = copy.deepcopy(example)
+                syn_example["prompt"] = syn_prompt
+                synonym_examples.append(syn_example)
+        return synonym_examples
+
+    def generate_cls_example(self, text, labels, prompt_prefix, options):
+        random.shuffle(self.options)
+        cls_options = ",".join(self.options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+
+        result_list = []
+        example = {"content": text, "result_list": result_list, "prompt": prompt}
+
+        for label in labels:
+            start = prompt.rfind(label) - len(prompt) - 1
+            end = start + len(label)
+            result = {"text": label, "start": start, "end": end}
+            example["result_list"].append(result)
+        return example
+
+    def add_entity_negative_example(self, examples, texts, prompts, label_set):
+        negative_examples = []
+        positive_examples = []
+        for i, prompt in enumerate(prompts):
+            redundants = list(set(label_set) - set(prompt))
+            redundants.sort()
+
+            ratio = self.negative_ratio
+            if ratio > len(redundants):
+                ratio = len(redundants)
+            idxs = random.sample(range(0, len(redundants)), ratio)
+
+            for idx in idxs:
+                negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]}
+                negative_examples.append(negative_result)
+            positive_examples.extend(examples[i])
+        return positive_examples, negative_examples
+
+    def add_relation_negative_example(self, redundants, text, ratio):
+        added_example = []
+        rest_example = []
+
+        if ratio > len(redundants):
+            ratio = len(redundants)
+
+        all_idxs = [k for k in range(len(redundants))]
+        idxs = random.sample(range(0, len(redundants)), ratio)
+        rest_idxs = list(set(all_idxs) - set(idxs))
+
+        for idx in idxs:
+            negative_result = {"content": text, "result_list": [], "prompt": redundants[idx]}
+            added_example.append(negative_result)
+
+        for rest_idx in rest_idxs:
+            negative_result = {"content": text, "result_list": [], "prompt": redundants[rest_idx]}
+            rest_example.append(negative_result)
+
+        return added_example, rest_example
+
+    def add_cls_negative_example(self, texts, subject_name_set, subject_golden_list):
+        negative_examples = []
+        for i, text in enumerate(texts):
+            redundants = list(set(subject_name_set) - set(subject_golden_list[i]))
+            redundants.sort()
+
+            ratio = self.negative_ratio
+            if ratio > len(redundants):
+                ratio = len(redundants)
+            idxs = random.sample(range(0, len(redundants)), ratio)
+
+            for idx in idxs:
+                subject_name = redundants[idx]
+                prompt_prefix = subject_name + "的" + self.sentiment_prompt_prefix
+                negative_example = self.generate_cls_example(text, ["未提及"], prompt_prefix, self.options)
+                negative_examples.append(negative_example)
+        return negative_examples
+
+
+def load_synonym(synonym_path):
+    synonyms = {}
+    lines = load_txt(synonym_path)
+    for line in lines:
+        items = line.split()
+        synonyms[items[0]] = items
+    return synonyms
+
+
+def load_implicit_opinion(implicit_opinion_path):
+    implicit_opinion_map = {}
+    sentiment_map = {}
+    lines = load_txt(implicit_opinion_path)
+    for line in lines:
+        items = line.split(",")
+        aspect = items[0].strip()
+        for item in items[1:]:
+            item = item.strip()
+            start = item.find("[")
+            end = item.find("]")
+            sentiment = item[0:start]
+            opinions = item[start + 1 : end].strip().split()
+            for opinion in opinions:
+                implicit_opinion_map[opinion] = aspect
+                sentiment_map[opinion] = sentiment
+    return implicit_opinion_map, sentiment_map
+
+
+def parse_ext_task_type(raw_examples):
+
+    task_type_dict = {"ext_a": False, "ext_o": False, "ext_ao": False, "ext_as": False, "ext_aso": False}
+
+    def _parse_raw_example(raw_example):
+        entity_map = {}
+        relations = []
+        result_list = raw_example["annotations"][0]["result"]
+        for result in result_list:
+            if result["type"] == "labels":
+                entity_id = result["id"]
+                entity_map[entity_id] = result["value"]["labels"][0]
+            elif result["type"] == "relation":
+                relation_pair = (result["from_id"], result["to_id"])
+                relations.append(relation_pair)
+            else:
+                raise ValueError(
+                    "Unknown entity type [{}], it indicates that your dataset maybe not a aspect-based extraction dataset, please check it.".format(
+                        result["type"]
+                    )
+                )
+
+        for entity_label in entity_map.values():
+            if (
+                entity_label.startswith(PROMPT_ITEMS["aspect_prompt_prefix"])
+                and PROMPT_ITEMS["separator"] in entity_label
+            ):
+                task_type_dict["ext_as"] = True
+            elif entity_label == PROMPT_ITEMS["aspect_prompt_prefix"]:
+                task_type_dict["ext_a"] = True
+            elif entity_label == PROMPT_ITEMS["opinion_prompt"]:
+                task_type_dict["ext_o"] = True
+            else:
+                raise ValueError("Unknown prompt: {}".format(entity_label))
+
+        # relations store the relation between aspect and opinion by default
+        if relations:
+            task_type_dict["ext_ao"] = True
+
+        if task_type_dict["ext_ao"] and task_type_dict["ext_as"]:
+            task_type_dict["ext_aso"] = True
+
+    for raw_example in raw_examples:
+        # analyze task type
+        _parse_raw_example(raw_example)
+        if task_type_dict["ext_aso"]:
+            return "ext_aso"
+        elif (not task_type_dict["ext_as"]) and task_type_dict["ext_ao"]:
+            return "ext_ao"
+
+    if task_type_dict["ext_as"]:
+        return "ext_as"
+    elif task_type_dict["ext_o"]:
+        return "ext_o"
+    else:
+        return "ext_a"
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.label_studio_file):
+        raise ValueError("Please input the correct path of label studio file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.label_studio_file, "r", encoding="utf-8") as f:
+        raw_examples = json.loads(f.read())
+
+    if args.is_shuffle:
+        indexes = np.random.permutation(len(raw_examples))
+        raw_examples = [raw_examples[i] for i in indexes]
+
+    # construct options according
+    if args.options:
+        PROMPT_ITEMS["options"] = args.options
+    else:
+        if args.task_type == "ext":
+            PROMPT_ITEMS["options"] = [
+                PROMPT_ITEMS["positive_option"],
+                PROMPT_ITEMS["negative_option"],
+                PROMPT_ITEMS["not_mentioned_option"],
+            ]
+        else:
+            PROMPT_ITEMS["options"] = [PROMPT_ITEMS["positive_option"], PROMPT_ITEMS["negative_option"]]
+
+    # analyze detailed ext task type: ext_a, ext_o, ext_as, ext_ao, ext_aso
+    if args.task_type == "ext":
+        args.task_type = parse_ext_task_type(raw_examples)
+
+    logger.info("You are trying perform dataset construction operation for task {}.\n".format(args.task_type))
+
+    # load synonyms
+    synonyms = None
+    if args.synonym_file:
+        if args.task_type in ["cls", "ext_a", "ext_o"]:
+            logger.warning(
+                "The param synonym_file will not work for task, because the task {} that you wanna try does not support synonym_function.".format(
+                    args.task_type
+                )
+            )
+        else:
+            if not os.path.isfile(args.synonym_file):
+                raise ValueError(
+                    "The path you input is not a file, please input the correct path of synonym file: {}".format(
+                        args.synonym_file
+                    )
+                )
+            synonyms = load_synonym(args.synonym_file)
+
+    # load implicit opinions
+    implicit_opinion_map = None
+    sentiment_map = None
+    if args.implicit_file:
+        if args.task_type in ["cls", "ext_a", "ext_o", "ext_as"]:
+            logger.warning(
+                "The param implicit_file will not work for task, because the task {} that you wanna try does not support implicit opinion function.".format(
+                    args.task_type
+                )
+            )
+        else:
+            if not os.path.isfile(args.implicit_file):
+                raise ValueError(
+                    "The path you input is not a file, please input the correct path of implicit opinion file: {}".format(
+                        args.implicit_file
+                    )
+                )
+            implicit_opinion_map, sentiment_map = load_implicit_opinion(args.implicit_file)
+
+    # split examples into train/dev/test examples
+    i1, i2, _ = args.splits
+    p1 = int(len(raw_examples) * i1)
+    p2 = int(len(raw_examples) * (i1 + i2))
+
+    # define Convertor and convert raw examples to model examples
+    convertor = Convertor(negative_ratio=args.negative_ratio)
+
+    if args.task_type.startswith("ext"):
+        train_examples = convertor.convert_ext_examples(
+            raw_examples[:p1],
+            synonyms=synonyms,
+            implicit_opinion_map=implicit_opinion_map,
+            sentiment_map=sentiment_map,
+            task_type=args.task_type,
+            data_flag="Train",
+        )
+        dev_examples = convertor.convert_ext_examples(
+            raw_examples[p1:p2],
+            synonyms=synonyms,
+            implicit_opinion_map=implicit_opinion_map,
+            sentiment_map=sentiment_map,
+            task_type=args.task_type,
+            data_flag="Dev",
+        )
+        test_examples = convertor.convert_ext_examples(
+            raw_examples[p2:],
+            synonyms=synonyms,
+            implicit_opinion_map=implicit_opinion_map,
+            sentiment_map=sentiment_map,
+            task_type=args.task_type,
+            data_flag="Test",
+        )
+    else:
+        train_examples = convertor.convert_cls_examples(raw_examples[:p1], data_flag="Train")
+        dev_examples = convertor.convert_cls_examples(raw_examples[p1:p2], data_flag="Dev")
+        test_examples = convertor.convert_cls_examples(raw_examples[p2:], data_flag="Test")
+
+    # save examples
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    _save_examples(args.save_dir, "train.json", train_examples)
+    _save_examples(args.save_dir, "dev.json", dev_examples)
+    _save_examples(args.save_dir, "test.json", test_examples)
+
+    logger.info("Finished! It takes {:.2f} seconds".format(time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.")
+    parser.add_argument("--synonym_file", type=str, help="The synonmy file of aspect to support aspect aggregation.")
+    parser.add_argument("--implicit_file", type=str, help="The implicit opinion file whose aspect not be mentioned in text, to support extraction of implicit opinion.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Worked only for the extraction task, it means that for each task (aspect-based opinion extraction, aspect-based sentiment classicition) of an example, at least negative_ratio negative examples will be generated without considering synonym_file and implicit_file.")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Two task types [ext, cls] are supported, ext represents the aspect-based extraction task and cls represents the sentence-level classification task, defaults to ext.")
+    parser.add_argument("--options", type=str, nargs="+", help="Used only for the classification task, the options for classification")
+    parser.add_argument("--is_shuffle", type=strtobool, default="True", help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+
+    args = parser.parse_args()
+    # yapf: enablecl
+    logger.info("Parameter Description:\n{}\n".format(args.__dict__))
+
+    do_convert()
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/utils.py b/applications/sentiment_analysis/unified_sentiment_extraction/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b52a525fb2a06a9d175232bd196559c0ca6f6ad
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/utils.py
@@ -0,0 +1,265 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import random
+import re
+
+import numpy as np
+import paddle
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def load_txt(file_path):
+    texts = []
+    with open(file_path, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            texts.append(line.strip())
+    return texts
+
+
+def load_json_file(path):
+    exmaples = []
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            example = json.loads(line)
+            exmaples.append(example)
+    return exmaples
+
+
+def write_json_file(examples, save_path):
+    with open(save_path, "w", encoding="utf-8") as f:
+        for example in examples:
+            line = json.dumps(example, ensure_ascii=False)
+            f.write(line + "\n")
+
+
+def str2bool(v):
+    """Support bool type for argparse."""
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Unsupported value encountered.")
+
+
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+
+
+def convert_example(example, tokenizer, max_seq_len):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    encoded_inputs = tokenizer(
+        text=[example["prompt"]],
+        text_pair=[example["content"]],
+        truncation=True,
+        max_seq_len=max_seq_len,
+        pad_to_max_seq_len=True,
+        return_attention_mask=True,
+        return_position_ids=True,
+        return_dict=False,
+        return_offsets_mapping=True,
+    )
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(1, len(offset_mapping)):
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    start_ids = [0 for x in range(max_seq_len)]
+    end_ids = [0 for x in range(max_seq_len)]
+    for item in example["result_list"]:
+        # Positioning at char granularity，offset_mapping indicates offset by char.
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+
+    tokenized_output = [
+        encoded_inputs["input_ids"],
+        encoded_inputs["token_type_ids"],
+        encoded_inputs["position_ids"],
+        encoded_inputs["attention_mask"],
+        start_ids,
+        end_ids,
+    ]
+    tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
+    return tuple(tokenized_output)
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+
+                    for result in result_list:
+                        if result["start"] + 1 <= max_content_len < result["end"]:
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def unify_prompt_name(prompt):
+    # The classification labels are shuffled during finetuning, so they need
+    # to be unified during evaluation.
+    if re.search(r"\[.*?\]$", prompt):
+        prompt_prefix = prompt[: prompt.find("[", 1)]
+        cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",")
+        cls_options = sorted(list(set(cls_options)))
+        cls_options = ",".join(cls_options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+        return prompt
+    return prompt
+
+
+def get_relation_type_dict(relation_data):
+    def compare(a, b):
+        a = a[::-1]
+        b = b[::-1]
+        res = ""
+        for i in range(min(len(a), len(b))):
+            if a[i] == b[i]:
+                res += a[i]
+            else:
+                break
+        if res == "":
+            return res
+        elif res[::-1][0] == "的":
+            return res[::-1][1:]
+        return ""
+
+    relation_type_dict = {}
+    added_list = []
+    for i in range(len(relation_data)):
+        added = False
+        if relation_data[i][0] not in added_list:
+            for j in range(i + 1, len(relation_data)):
+                match = compare(relation_data[i][0], relation_data[j][0])
+                if match != "":
+                    match = unify_prompt_name(match)
+                    if relation_data[i][0] not in added_list:
+                        added_list.append(relation_data[i][0])
+                        relation_type_dict.setdefault(match, []).append(relation_data[i][1])
+                    added_list.append(relation_data[j][0])
+                    relation_type_dict.setdefault(match, []).append(relation_data[j][1])
+                    added = True
+            if not added:
+                added_list.append(relation_data[i][0])
+                suffix = relation_data[i][0].rsplit("的", 1)[1]
+                suffix = unify_prompt_name(suffix)
+                relation_type_dict[suffix] = relation_data[i][1]
+    return relation_type_dict
diff --git a/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py b/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..bba221429f28381b80448ee73378c1ab1b90669a
--- /dev/null
+++ b/applications/sentiment_analysis/unified_sentiment_extraction/visual_analysis.py
@@ -0,0 +1,663 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import shutil
+from collections import defaultdict
+from operator import itemgetter
+
+import matplotlib.pyplot as plt
+import wordcloud
+from utils import load_json_file
+
+from paddlenlp.taskflow.utils import download_file
+from paddlenlp.utils.log import logger
+
+# make sure that the font 'SimHei' is installed in system
+plt.rcParams["font.sans-serif"] = ["SimHei"]
+plt.rcParams["axes.unicode_minus"] = False
+
+URLS = {
+    "SimHei": [
+        "https://paddlenlp.bj.bcebos.com/applications/sentiment_analysis/SimHei.ttf",
+        "c9c9de86d3fa7c4af0d3f1269bb2dff2",
+    ],
+}
+
+PROMPT_ITEMS = {
+    "aspect_prompt": "评价维度",
+    "opinion_prompt": "观点词",
+    "sentiment_prompt_prefix": "情感倾向",
+    "separator": "##",
+    "not_mentioned_option": "未提及",
+    "positive_option": "正向",
+    "negative_option": "负向",
+}
+
+
+class VisualSentiment(object):
+    """
+    A tool class for visualing sentiment analysis results.
+    """
+
+    def __init__(self, font_path=None):
+        if font_path is not None:
+            if not os.path.isfile(font_path):
+                raise ValueError("The param font_path passed in may not be a file: {}".format(font_path))
+            self.font_path = font_path
+        else:
+            default_name = "SimHei"
+            save_dir = os.path.dirname(__file__)
+            download_file(save_dir, default_name + ".ttf", URLS[default_name][0], URLS[default_name][1])
+            self.font_path = os.path.join(save_dir, default_name + ".ttf")
+
+        self.wc = wordcloud.WordCloud(font_path=self.font_path, background_color="white", width=800, height=400)
+        plt.figure(figsize=(8, 6))
+
+    def _plot_wordcloud(self, content_freq, save_path):
+        """
+        plot wordcloud image.
+
+        Args:
+            content_freq (dict): a content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+        """
+
+        text_list = []
+        for item in content_freq:
+            text_list.extend([item] * content_freq[item])
+        random.shuffle(text_list)
+        text = " ".join(text_list)
+
+        self.wc.generate(text)
+        self.wc.to_file(save_path)
+
+    def _plot_histogram(
+        self, content_freq, save_path, with_line_chart="true", top_n=15, plt_title="", plt_xlabel="", plt_ylabel=""
+    ):
+        """
+        generate histogram image. one aspect corresponds to one bar.
+
+        Args:
+            content_freq (dict): a content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+            with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram.
+            plt_title (str): the title of image, only work when image_type is set be histogram.
+            plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram.
+            plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram.
+        """
+
+        content_freq_items = content_freq.items()
+        content_freq_items = sorted(content_freq_items, key=lambda x: x[1], reverse=True)
+        content_freq_items = content_freq_items[:top_n]
+
+        x_data = [item[0] for item in content_freq_items]
+        y_data = [item[1] for item in content_freq_items]
+
+        for i in range(len(x_data)):
+            plt.bar(x_data[i], y_data[i])
+
+        if with_line_chart:
+            plt.plot(x_data, y_data, "-")
+        plt.title(plt_title)
+
+        plt.xlabel(plt_xlabel)
+        plt.ylabel(plt_ylabel)
+        plt.savefig(save_path)
+        plt.close()
+
+    def _plot_content_with_frequency(
+        self,
+        content_freq,
+        save_path,
+        image_type="wordcloud",
+        with_line_chart="true",
+        top_n=15,
+        plt_title="",
+        plt_xlabel="",
+        plt_ylabel="",
+    ):
+        """
+        generate image for specified content, such as aspect, opinion and so on.
+        two types of images are supported: wordcloud and histogram.
+
+        Args:
+            content_freq (dict): a content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram.
+            plt_title (str): the title of image, only work when image_type is set be histogram.
+            plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram.
+            plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram.
+        """
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        if image_type == "wordcloud":
+            self._plot_wordcloud(content_freq, save_path)
+        else:
+            self._plot_histogram(
+                content_freq,
+                save_path,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=plt_title,
+                plt_xlabel=plt_xlabel,
+                plt_ylabel=plt_ylabel,
+            )
+
+    def plot_aspect_with_frequency(
+        self, aspect_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image for aspect, two types of images are supported: wordcloud and histogram.
+        this method can help analyze which aspects of the product/service are more important to customers.
+
+        Args:
+            aspect_freq (dict): an aspect dict with frequency, the key is aspect and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of apsects, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_freq:
+            raise ValueError("aspect_freq is empty, please check it.")
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(aspect_freq, save_path, image_type=image_type)
+        else:
+            title = "The histogram of aspect/frequency"
+            xlabel = "aspect"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                aspect_freq,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_opinion_with_frequency(
+        self, opinion_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image for opinion, two types of images are supported: wordcloud and histogram.
+        this method can help analyze the whole impression of the product/service.
+
+        Args:
+            opinion_freq (dict): an opinion dict with frequency, the key is opinion and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not opinion_freq:
+            raise ValueError("opinion_freq is empty, please check it.")
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(opinion_freq, save_path, image_type=image_type)
+        else:
+            title = "The histogram of opinion/frequency"
+            xlabel = "opinion"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                opinion_freq,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_aspect_with_opinion(
+        self, aspect_opinion, save_path, sentiment="all", image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image with aspect and opinion, that is, combining apsect with opinion to display the more specifical opinions of aspect.
+        this method can help you at two aspects: 1. mining custom's overall impression of products/services; 2. analyzing the quality of some aspect and improve it further.
+
+        Args:
+            aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency.
+            aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency.
+            save_path (str): path that the image is saved to.
+            sentiment (str): analyzing aspect with sentiment, Only "all", "positive" and "negative" are received. "positive" only analyzes positive aspects with opinions, "negative" only analyzes negative aspects with opinions, and "all" analyzes all apsects.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_opinion:
+            raise ValueError("aspect_opinion is empty, please check it.")
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        if sentiment not in ["all", "positive", "negative"]:
+            raise ValueError(
+                "Only 'all', 'positive' and 'negative' are received for sentiment, that is, you should set be in [all, positive, negative]."
+            )
+
+        if sentiment == "all":
+            new_aspect_opinion = {}
+
+            for aspect in aspect_opinion:
+                for opinion in aspect_opinion[aspect]:
+                    key = aspect + opinion
+                    new_aspect_opinion[key] = aspect_opinion[aspect][opinion]
+            aspect_opinion = new_aspect_opinion
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(aspect_opinion, save_path, image_type=image_type)
+        else:
+            if sentiment == "all":
+                title = "The histogram of aspect with opinion/frequency"
+            else:
+                title = "The histogram of {} aspect with opinion/frequency".format(sentiment)
+            xlabel = "aspect with opinion"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                aspect_opinion,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_aspect_with_sentiment(
+        self, aspect_sentiment, save_path, image_type="wordcloud", top_n=0, descend_aspects=None
+    ):
+        """
+        generate image with aspect and sentiment, that is, combining apsect and sentiment to display the sentiment of aspect.
+        This method can help you more intuitively analyze customers' direct impressions of aspects of products/services.
+
+        Args:
+            aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency.
+            descend_aspects (dict): an aspect list, sorted by frequency in reverse order.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. if top_n set be 0, it will plot all aspects in histogram.
+        """
+
+        if not aspect_sentiment:
+            raise ValueError("aspect_sentiment is empty, please check it.")
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        if image_type == "wordcloud":
+            new_aspect_opinion = {}
+            for aspect in aspect_sentiment:
+                for sentiment in aspect_sentiment[aspect]:
+                    key = aspect + sentiment
+                    new_aspect_opinion[key] = aspect_sentiment[aspect][sentiment]
+            self._plot_wordcloud(new_aspect_opinion, save_path)
+        else:
+            if top_n != 0 and descend_aspects is None:
+                raise ValueError("You should input the param descend_aspects when top_n != 0.")
+
+            if top_n != 0:
+                keep_aspects = set(descend_aspects[:top_n])
+
+            aspects = []
+            positives = []
+            negatives = []
+            for aspect, sentiment in aspect_sentiment.items():
+                if top_n != 0 and aspect not in keep_aspects:
+                    continue
+                aspects.append(aspect)
+                if "正向" in sentiment:
+                    positives.append(sentiment["正向"])
+                else:
+                    positives.append(0)
+                if "负向" in sentiment:
+                    negatives.append(sentiment["负向"])
+                else:
+                    negatives.append(0)
+
+            total_width, n = 0.8, 2
+            width = total_width / n
+            x_pos = [item - (total_width - width) / 2 for item in range(len(aspects))]
+            x_neg = [item + width for item in x_pos]
+
+            plt.bar(x_pos, positives, width=width, label="positive")
+            plt.bar(x_neg, negatives, width=width, label="negative")
+            plt.title("The histogram of aspect/sentiment")
+            plt.xlabel("aspect")
+            plt.ylabel("sentiment frequency")
+            plt.xticks(x_pos, aspects)
+            plt.legend()
+            plt.savefig(save_path)
+            plt.close()
+
+    def plot_opinion_with_aspect(
+        self, aspect, aspect_opinion, save_path, image_type="wordcloud", with_line_chart=True, top_n=15
+    ):
+        """
+        generate opinion image for given aspect. This method can help you analyzing opinions for given aspects.
+
+        Args:
+            aspect (str): The set of aspect to analyze.
+            aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_opinion:
+            raise ValueError("aspect_opinion is empty, please check it.")
+
+        if aspect not in aspect:
+            raise ValueError("{} not in aspect_opinion, please check it.")
+
+        if image_type not in ["wordcloud", "histogram"]:
+            raise ValueError(
+                "Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram]."
+            )
+
+        opinions = aspect_opinion[aspect]
+        opinion_items = sorted(opinions.items(), key=lambda x: x[1], reverse=True)
+        if top_n is not None:
+            opinion_items = opinion_items[:top_n]
+
+        opinion_freq = {k: v for k, v in opinion_items}
+
+        if image_type == "wordcloud":
+            self._plot_wordcloud(opinion_freq, save_path)
+        else:
+            title = "The opinion analysis for aspect [{}] ".format(aspect)
+            xlabel = "opinion"
+            ylabel = "frequency"
+            self._plot_histogram(
+                opinion_freq,
+                save_path,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_sentence_sentiment(self, sentence_sentiment, save_path):
+        """
+        generate image for sentence sentiment, only histogram are supported.
+        this method can help analyze the customers' whole impression for product/service.
+
+        Args:
+            sentence_sentiment (dict): an sentiment dict with frequency, the key is sentiment polarity and its value is frequency.
+            save_path (str): path that the image is saved to.
+        """
+
+        if not sentence_sentiment:
+            raise ValueError("sentence_sentiment is empty, please check it.")
+
+        title = "The histogram of sentence sentiment"
+        xlabel = "sentiment polarity"
+        ylabel = "frequency"
+
+        self._plot_histogram(
+            sentence_sentiment, save_path, with_line_chart=False, plt_title=title, plt_xlabel=xlabel, plt_ylabel=ylabel
+        )
+
+
+class SentimentResult(object):
+    """
+    load and analyze result of sentiment analysis.
+    """
+
+    def __init__(self, file_path):
+        self.file_path = file_path
+        self.sentiment_prompt = PROMPT_ITEMS["sentiment_prompt"]
+        self.sentiment_prompt_prefix = PROMPT_ITEMS["sentiment_prompt_prefix"]
+        self.options = PROMPT_ITEMS["options"]
+        self.opinion_prompt = PROMPT_ITEMS["opinion_prompt"]
+        self.aspect_prompt = PROMPT_ITEMS["aspect_prompt"]
+        self.not_mentioned_option = PROMPT_ITEMS["not_mentioned_option"]
+        self.positive_option = PROMPT_ITEMS["positive_option"]
+        self.negative_option = PROMPT_ITEMS["negative_option"]
+        self.prompts = set()
+        # load the result of sentiment analysis
+        self.results = self._load_sentiment_result(file_path)
+        # define the parsing middle result for sentiment analysis
+        self.aspect_frequency = defaultdict(int)
+        self.opinion_frequency = defaultdict(int)
+        self.aspect_sentiment = defaultdict(dict)
+        self.aspect_opinion = defaultdict(dict)
+        self.aspect_opinion_positives = defaultdict(int)
+        self.aspect_opinion_negatives = defaultdict(int)
+        self.descend_aspects = []
+        self.sentence_sentiment = defaultdict(int)
+        # start to parse sentiment result
+        self.parse_sentiment_result(self.results)
+
+    def _load_sentiment_result(self, file_path):
+        return load_json_file(file_path)
+
+    def _parse_aspect(self, aspect):
+        aspect_name = aspect["text"]
+        self.aspect_frequency[aspect_name] += 1
+        if "relations" not in aspect:
+            return
+
+        sentiment_name = None
+        if self.sentiment_prompt in aspect["relations"].keys():
+            sentiment = aspect["relations"][self.sentiment_prompt][0]
+            sentiment_name = sentiment["text"]
+            if sentiment_name == self.not_mentioned_option:
+                sentiment_name = None
+                return
+            if sentiment_name not in self.aspect_sentiment[aspect_name]:
+                self.aspect_sentiment[aspect_name][sentiment_name] = 1
+            else:
+                self.aspect_sentiment[aspect_name][sentiment_name] += 1
+
+        if self.opinion_prompt in aspect["relations"].keys():
+            opinions = aspect["relations"][self.opinion_prompt]
+            for opinion in opinions:
+                opinion_name = opinion["text"]
+                self.opinion_frequency[opinion_name] += 1
+                if opinion_name not in self.aspect_opinion[aspect_name]:
+                    self.aspect_opinion[aspect_name][opinion_name] = 1
+                else:
+                    self.aspect_opinion[aspect_name][opinion_name] += 1
+
+                if sentiment_name is not None:
+                    aspect_opinion_name = aspect_name + opinion_name
+                    if sentiment_name == self.positive_option:
+                        self.aspect_opinion_positives[aspect_opinion_name] += 1
+                    else:
+                        self.aspect_opinion_negatives[aspect_opinion_name] += 1
+
+        self.prompts.update(aspect["relations"].keys())
+
+    def _parse_opinion(self, opinion):
+        opinion_name = opinion["text"]
+        self.opinion_frequency[opinion_name] += 1
+
+    def _parse_sentiment_polarity(self, sentiment):
+        sentiment_name = sentiment["text"]
+        self.sentence_sentiment[sentiment_name] += 1
+
+    def parse_one_result(self, result):
+        for key in result.keys():
+            if key == self.aspect_prompt:
+                for aspect in result[self.aspect_prompt]:
+                    self._parse_aspect(aspect)
+            elif key == self.opinion_prompt:
+                for opinion in result[self.opinion_prompt]:
+                    self._parse_opinion(opinion)
+            elif key == self.sentiment_prompt:
+                sentiment = result[self.sentiment_prompt][0]
+                self._parse_sentiment_polarity(sentiment)
+            else:
+                raise ValueError(
+                    "Unknown key {} for sentiment analysis, you can check it as follows: 1. whether the parameter task_type is right; 2. whether the sentiment prompt {} created by the parameter options matches with the prompt {} in the file of sentiment analysis results; 3. whether the aspect_prompt, opinion_prompt or sentiment prompt are right.".format(
+                        key, self.sentiment_prompt, key
+                    )
+                )
+            self.prompts.add(key)
+
+    def parse_sentiment_result(self, results):
+        for result in results:
+            if not result:
+                continue
+            self.parse_one_result(result)
+        # parse descend_aspects
+        descend_aspects_items = sorted(self.aspect_frequency.items(), key=itemgetter(1), reverse=True)
+        self.descend_aspects = [item[0] for item in descend_aspects_items]
+        # check whether sentiment prompt is parsed correctly
+        for prompt in self.prompts:
+            if prompt.startswith(self.sentiment_prompt_prefix) and prompt != self.sentiment_prompt:
+                logger.warning(
+                    "The visual images related to sentiment ploarity cannot be generated. Because the sentiment prompt {} created by the opinions you input cannot be match with the one {} in the file of sentiment analysis result.".format(
+                        self.sentiment_prompt, prompt
+                    )
+                )
+
+
+def default_visual_analysis(args):
+    # checking generating environment
+    if os.path.exists(args.save_dir):
+        shutil.rmtree(args.save_dir)
+    os.makedirs(args.save_dir)
+    # update sentiment prompt according to task type
+    if args.options:
+        PROMPT_ITEMS["options"] = args.options
+    else:
+        if args.task_type == "ext":
+            PROMPT_ITEMS["options"] = [
+                PROMPT_ITEMS["positive_option"],
+                PROMPT_ITEMS["negative_option"],
+                PROMPT_ITEMS["not_mentioned_option"],
+            ]
+        else:
+            PROMPT_ITEMS["options"] = [PROMPT_ITEMS["positive_option"], PROMPT_ITEMS["negative_option"]]
+    PROMPT_ITEMS["sentiment_prompt"] = PROMPT_ITEMS["sentiment_prompt_prefix"] + "[{}]".format(
+        ",".join(PROMPT_ITEMS["options"])
+    )
+
+    # define sr to process the result of sentiment analysis
+    logger.info("Trying to parse sentiment analysis result: {}".format(args.file_path))
+    sr = SentimentResult(args.file_path)
+    # define vs to visualize sentiment result
+    vs = VisualSentiment(font_path=args.font_path)
+    logger.info("Start to generate visual images of sentiment analysis for you.")
+    # visualize aspect with frequency
+    if args.task_type == "ext" and sr.aspect_frequency:
+        save_path = os.path.join(args.save_dir, "aspect_wc.png")
+        vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="wordcloud")
+        save_path = os.path.join(args.save_dir, "aspect_hist.png")
+        vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="histogram")
+    # visualize opinion with frequency
+    if args.task_type == "ext" and sr.opinion_frequency:
+        save_path = os.path.join(args.save_dir, "opinion_wc.png")
+        vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="wordcloud")
+        save_path = os.path.join(args.save_dir, "opinion_hist.png")
+        vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="histogram")
+    # visualize aspect and opinion
+    if args.task_type == "ext" and sr.aspect_opinion:
+        save_path = os.path.join(args.save_dir, "aspect_opinion_wc.png")
+        vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="wordcloud", sentiment="all")
+        save_path = os.path.join(args.save_dir, "aspect_opinion_hist.png")
+        vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="histogram", sentiment="all", top_n=8)
+    # visualize positive aspect and opinion
+    if args.task_type == "ext" and sr.aspect_opinion_positives:
+        save_path = os.path.join(args.save_dir, "aspect_opinion_wc_pos.png")
+        vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_positives, save_path, image_type="wordcloud", sentiment="positive"
+        )
+        save_path = os.path.join(args.save_dir, "aspect_opinion_hist_pos.png")
+        vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_positives, save_path, image_type="histogram", sentiment="positive", top_n=8
+        )
+    # visualize negative aspect and opinion
+    if args.task_type == "ext" and sr.aspect_opinion_negatives:
+        save_path = os.path.join(args.save_dir, "aspect_opinion_wc_neg.png")
+        vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_negatives, save_path, image_type="wordcloud", sentiment="negative"
+        )
+        save_path = os.path.join(args.save_dir, "aspect_opinion_hist_neg.png")
+        vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_negatives, save_path, image_type="histogram", sentiment="negative", top_n=8
+        )
+    # visualize aspect and sentiment
+    if args.task_type == "ext" and sr.aspect_sentiment:
+        save_path = os.path.join(args.save_dir, "aspect_sentiment_wc.png")
+        vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="wordcloud")
+        save_path = os.path.join(args.save_dir, "aspect_sentiment_hist.png")
+        vs.plot_aspect_with_sentiment(
+            sr.aspect_sentiment, save_path, image_type="histogram", top_n=15, descend_aspects=sr.descend_aspects
+        )
+    # visualize sentiment polarity for sentence
+    if args.task_type == "cls" and sr.sentence_sentiment:
+        save_path = os.path.join(args.save_dir, "sentence_sentiment.png")
+        vs.plot_sentence_sentiment(sr.sentence_sentiment, save_path)
+
+    if not os.listdir(args.save_dir):
+        logger.info(
+            "Nothing generated for task {}, please check that you input the correct parameter task_type or the result of sentiment analysis.".format(
+                args.task_type
+            )
+        )
+    else:
+        logger.info("Visual images for sentiment analysis has been saved to: {}".format(args.save_dir))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--file_path", required=True, type=str, help="The result path of sentiment analysis.")
+    parser.add_argument("--save_dir", default="./images", type=str, help="The saving path of images.")
+    parser.add_argument("--font_path", default=None, type=str, help="The font Path for showing Chinese in wordcloud.")
+    parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Two task types [ext, cls] are supported, ext represents the aspect-based extraction task and cls represents the sentence-level classification task, defaults to ext.")
+    parser.add_argument("--options", type=str, nargs="+", help="Used only for the classification task, the options for classification")
+
+    args = parser.parse_args()
+    # ypdf: enable
+
+    default_visual_analysis(args)
diff --git a/applications/speech_cmd_analysis/README.md b/applications/speech_cmd_analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e56c6a4b30505ec12edae7d4e4ca02cdc52a5e4
--- /dev/null
+++ b/applications/speech_cmd_analysis/README.md
@@ -0,0 +1,216 @@
+# 智能语音指令解析 (Speech Command Analysis)
+
+## 1. 项目说明
+
+**智能语音指令解析**集成了业界领先的语音识别（Automatic Speech Recognition, ASR）、信息抽取（Information Extraction, IE）等技术，打造智能一体化的语音指令系统，广泛应用于智能语音填单、智能语音交互、智能语音检索、手机APP语音唤醒等场景，提高人机交互效率。
+
+其中，**智能语音填单**允许用户通过**口述**的方式记录信息，利用**算法**解析口述内容中的关键信息，完成**自动信息录入**。
+
+#### 场景痛点
+
+- 电话分析：边询问边记录，关键信息遗漏。例如，社区疫情防控信息记录员需要边通电话边记录关键信息，重点信息不突出，人工二次审核成本高。
+- 工单生成：特定场景，无法完成文字录入。例如，电力路线巡检工作人员在高空巡检高压电线路，不便即时文字记录，滞后记录可能导致信息遗漏。
+- 信息登记：重复性的工作，效率低易出错。例如，某品牌汽车售后客服话务员每天接听约300通电话，重复性工作耗时长，易出错。
+
+针对以上场景，应用Baidu大脑AI开放平台[短语音识别标准版](https://ai.baidu.com/tech/speech/asr)和[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)的信息抽取技术，可以自动识别和抽取语音中的关键信息，帮助相关人员简化记录流程，提高工作效率和质量。
+另外，通过构造小样本优化信息抽取模型，能够获得更加准确的场景定制化效果。
+
+#### 方案选型
+
+- **语音识别模型**
+  Baidu大脑AI开放平台[短语音识别标准版](https://ai.baidu.com/tech/speech/asr)采用领先国际的流式端到端语音语言一体化建模方法，融合百度自然语言处理技术，近场中文普通话识别准确率达98%。根据语音内容理解可以将数字序列、小数、时间、分数、基础运算符正确转换为数字格式，使得识别的数字结果更符合使用习惯，直观自然。
+
+- **信息抽取模型**
+  [Universal Information Extraction, UIE](https://arxiv.org/pdf/2203.12277.pdf): Yaojie Lu等人在2022年提出了开放域信息抽取的统一框架，这一框架在实体抽取、关系抽取、事件抽取、情感分析等任务上都有着良好的泛化效果。本应用基于这篇工作的prompt设计思想，提供了以ERNIE为底座的阅读理解型信息抽取模型，用于关键信息抽取。同时，针对不同场景，支持通过构造小样本数据来优化模型效果，快速适配特定的关键信息配置。
+
+
+## 2. 安装说明
+
+#### 环境要求
+
+- paddlepaddle >= 2.2.0
+- paddlenlp >= 2.3.0
+
+安装相关问题可参考[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)和[PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/get_started/installation.html)文档。
+
+#### 可选依赖
+
+- 若要使用音频文件格式转换脚本，则需安装依赖``ffmpeg``和``pydub``。
+
+```
+git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg
+cd ffmpeg
+./configure
+make
+make install
+pip install pydub
+```
+
+## 3. 数据准备
+
+本应用来自于语音报销工单信息录入场景，即员工向公司报销部门提出交通费报销的口头申请，在传统场景下，报销审核人员需要人工将语音转换为文字信息，并从中抽取记录报销需要的``时间``、``出发地``、``目的地``和``费用``字段，而在本应用可以端到端的完成这一工作。相应的数据集为[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)，共50条标注数据，用于信息抽取模型在交通费报销场景下的优化，示例数据如下：
+
+```json
+{"id": 39, "text": "10月16日高铁从杭州到上海南站车次d5414共48元", "relations": [], "entities": [{"id": 90, "start_offset": 0, "end_offset": 6, "label": "时间"}, {"id": 77, "start_offset": 9, "end_offset": 11, "label": "出发地"}, {"id": 91, "start_offset": 12, "end_offset": 16, "label": "目的地"}, {"id": 92, "start_offset": 24, "end_offset": 26, "label": "费用"}]}
+```
+
+其中抽取的目标（schema）表示为：
+
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``json``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识ID。
+- ``text``: 语音报销工单的原始文本数据。
+- ``entities``: 数据中包含的实体标签，每个实体标签包含四个字段：
+    - ``id``: 实体在数据集中的唯一标识ID，不同样本中的相同实体对应同一个ID。
+    - ``start_offset``: 实体的起始token在文本中的下标。
+    - ``end_offset``: 实体的结束token在文本中下标的下一个位置。
+    - ``label``: 实体类型。
+- ``relations``: 数据中包含的关系标签（在语音报销工单应用中无关系标签），每个关系标签包含四个字段：
+    - ``id``: （关系主语，关系谓语，关系宾语）三元组在数据集中的唯一标识ID，不同样本中的相同三元组对应同一个ID。
+    - ``from_id``: 关系主语实体对应的标识ID。
+    - ``to_id``: 关系宾语实体对应的标识ID。
+    - ``type``: 关系类型。
+
+#### BaiduAI开放平台申请使用
+
+- 注册账号。在[百度智能云](https://console.bce.baidu.com)注册账号并登陆。
+- 资源申请。平台提供了免费资源用于功能测试，打开[语音识别控制台](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index)，点击``领取免费资源``，勾选短语音识别后点击下方``0元领取``。
+- 创建应用。打开语音识别控制台，点击[创建应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/create)，填写必选项后点击``立即创建``。
+- 获取API Key和Secret Key。打开语音识别控制台，点击[管理应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/list)即可查看应用对应的API Key和Secret Key。在运行本应用脚本时，设置这两个参数即可调用该平台的语音识别服务。
+
+#### 音频格式转换
+
+在语音报销工单信息录入的场景下，模型的输入为报销工单相关的音频文件。可以根据设备类型，选取合适的录音软件来录制音频文件，保存格式应为``.wav``数据格式。若音频文件格式不符，可以运行以下脚本进行转换：
+
+- 单个文件格式转换
+
+```
+python audio_to_wav.py --audio_file sample.m4a --audio_format m4a --save_dir ./audios_wav/
+```
+
+- 指定目录下所有文件格式转换
+
+```
+python audio_to_wav.py --audio_file ./audios_raw/ --save_dir ./audios_wav/
+```
+
+可配置参数包括
+
+- ``audio_file``: 原始音频文件或者所在目录。若设置为目录，则对该目录下所有音频文件进行格式转换。
+- ``audio_format``: 原始音频文件格式（可选），支持``mp3``, ``m4a``。若未设置，则根据文件扩展名对支持的两种音频文件进行格式转换。
+- ``save_dir``: 转换后``.wav``格式文件的存储目录，文件名称与原始音频保持一致。
+
+#### 自定义数据标注
+
+对于不同的应用场景，关键信息的配置多种多样，直接应用通用信息抽取模型的效果可能不够理想。这时可以标注少量场景相关的数据，利用few-shot learning技术来改进特定场景下的信息抽取效果。在本应用场景中，标注数据为[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)。针对其他场景，可使用[doccano](https://github.com/doccano/doccano)平台标注并导出自定义数据。
+
+
+## 4. 模型训练
+
+针对特定场景下的关键信息配置，需要使用标注数据对通用信息抽取模型进行训练以优化抽取效果。
+
+#### 代码结构
+
+```shell
+.
+├── audio_to_wav.py           # 音频文件格式转换脚本
+├── pipeline.py               # 语音指令解析脚本
+├── preprocess.py             # 数据预处理脚本
+├── finetune.py               # 信息抽取模型 fine-tune 脚本
+├── model.py                  # 信息抽取模型（UIE）组网脚本
+└── utils.py                  # 辅助函数
+```
+
+#### 数据预处理
+
+下载[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)，存储在``./data/``目录下。执行以下脚本，按设置的比例划分数据集，同时构造负样本用于提升模型的学习效果。
+
+```shell
+python preprocess.py \
+    --input_file ./data/audio-expense-account.jsonl \
+    --save_dir ./data/ \
+    --negative_ratio 5 \
+    --splits 0.2 0.8 0.0 \
+    --seed 1000
+```
+
+可配置参数包括
+
+- ``input_file``: 标注数据文件名。数据格式应与[语音报销工单数据](https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl)一致。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。若``splits``为空，则数据存储在``train.txt``文件，若``splits``为长度为3的列表，则数据存储在目录下的``train.txt``、``dev.txt``、``test.txt``文件。
+- ``negative_ratio``: 负样本与正样本的比例。使用负样本策略可提升模型效果，负样本数量 = negative_ratio * 正样本数量。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+
+
+#### 定制化模型训练
+
+运行以下命令，使用单卡训练自定义的UIE模型。
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python finetune.py \
+    --train_path ./data/train.txt \
+    --dev_path ./data/dev.txt \
+    --save_dir ./checkpoint \
+    --model uie-base \
+    --learning_rate 1e-5 \
+    --batch_size 16 \
+    --max_seq_len 512 \
+    --num_epochs 50 \
+    --seed 1000 \
+    --logging_steps 10 \
+    --valid_steps 10 \
+    --device gpu
+```
+
+可配置参数包括
+
+- `train_path`: 训练集文件路径。
+- `dev_path`: 验证集文件路径。
+- `save_dir`: 模型存储路径，默认为`./checkpoint`。
+- ``init_from_ckpt``: 可选，模型参数路径，热启动模型训练。默认为None。
+- `learning_rate`: 学习率，默认为1e-5。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `num_epochs`: 训练轮数，默认为100。
+- `model`: 选择模型，程序会基于选择的模型进行模型微调，可选有`uie-base`和`uie-tiny`。
+- `seed`: 随机种子，默认为1000.
+- `logging_steps`: 日志打印的间隔steps数，默认为10。
+- `valid_steps`: evaluate的间隔steps数，默认为100。
+- `device`: 模型训练使用的设备，可选cpu或gpu。
+
+
+## 5. 模型预测
+
+预测时使用的schema应与finetune阶段训练数据的schema保持一致以得到更好的效果。在语音报销工单信息录入场景下，
+- 首先准备好``.wav``格式的音频文件，例如下载[sample.wav](https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/sample.wav)放在``./audios_wav/``目录下。
+- 然后在BaiduAI开放平台创建语音识别应用以获取API Key和Secret Key。
+- 最后加载用场景数据finetune后的模型参数，执行语音指令解析脚本即可抽取报销需要的``时间``、``出发地``、``目的地``和``费用``字段。具体命令如下
+
+```shell
+python pipeline.py \
+    --api_key '4E1BG9lTnlSeIf1NQFlrxxxx' \
+    --secret_key '544ca4657ba8002e3dea3ac2f5fxxxxx' \
+    --audio_file ./audios_wav/sample.wav \
+    --uie_model ./checkpoint/model_best/ \
+    --schema '时间' '出发地' '目的地' '费用'
+```
+
+可配置参数包括
+
+- ``api_key``: BaiduAI开放平台上创建应用的API Key。
+- ``secret_key``: BaiduAI开放平台上创建应用的Secret Key。
+- ``audio_file``: ``.wav``格式音频文件路径。
+- ``uie_model``: 预测使用的模型参数文件所在路径。默认为None，即使用通用的预训练UIE模型。
+- ``schema``: 关键实体信息配置。默认为语音报销工单场景下的四个关键字段。
+
+
+## 6. 模型部署
+
+在应用中提供了基于Web的部署Demo方案，支持用户在网页录入语音进行预测。用户可根据实际情况参考实现。
+
+![demo](https://user-images.githubusercontent.com/25607475/165510522-a7f5f131-cd3f-4855-8932-6d8b6a7bb913.png)
diff --git a/applications/speech_cmd_analysis/audio_to_wav.py b/applications/speech_cmd_analysis/audio_to_wav.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0f4aa631976e78b9aefc8772897429602c79dc5
--- /dev/null
+++ b/applications/speech_cmd_analysis/audio_to_wav.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+
+from pydub import AudioSegment
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("Convert other audio formats to wav.")
+    parser.add_argument("--audio_file", type=str, required=True, help="Path to source audio.")
+    parser.add_argument("--wav_file", type=str, default=None, help="Path to save .wav file.")
+    parser.add_argument("--audio_format", type=str, default=None, help="The file extension.")
+    args = parser.parse_args()
+
+    supported = ["mp3", "m4a", "wav"]
+    if args.audio_format is not None:
+        if args.audio_format not in supported:
+            raise ValueError(".%s format file is not supported!" % args.audio_format)
+        supported = [args.audio_format]
+    print("All %s format files are converted to .wav format..." % ", ".join(supported))
+
+    if os.path.isfile(args.audio_file):
+        src_files = [args.audio_file]
+        if args.audio_format is not None:
+            if args.audio_format != args.audio_file.strip().split(".")[-1]:
+                raise ValueError(
+                    "Ignore audio_format %s! It is not consistent with the format of audio_file %s."
+                    % (args.audio_format, args.audio_file)
+                )
+    elif os.path.isdir(args.audio_file):
+        src_files = [x for x in os.listdir(args.audio_file) if x.strip().split(".")[-1] in supported]
+        src_files = [os.path.join(args.audio_file, x) for x in src_files]
+    else:
+        raise Exception("%s is neither valid path nor file!" % args.audio_file)
+
+    if args.wav_file is None:
+        wav_files = [os.path.basename(x)[:-3] + "wav" for x in src_files]
+    elif os.path.isfile(args.wav_file):
+        if len(src_files) == 1:
+            wav_files = [args.wav_file]
+        else:
+            raise Exception(
+                "All audios in %s will overwrite the same file %s! \
+                Please check it."
+                % (args.audio_file, args.wav_file)
+            )
+    else:
+        if not os.path.exists(args.wav_file):
+            os.makedirs(args.wav_file)
+        wav_files = [os.path.join(args.wav_file, os.path.basename(x)[:-3] + "wav") for x in src_files]
+
+    for src_file, wav_file in zip(src_files, wav_files):
+        audio = AudioSegment.from_file(src_file, src_file[-3:])
+        wav_audio = audio.export(wav_file, format="wav")
+
+    print("%d files converted!" % len(src_files))
diff --git a/applications/speech_cmd_analysis/finetune.py b/applications/speech_cmd_analysis/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..692c0648a2f289ae6174a369523e381b95fe5c05
--- /dev/null
+++ b/applications/speech_cmd_analysis/finetune.py
@@ -0,0 +1,127 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import paddle
+from utils import convert_example, create_dataloader, evaluate, reader, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, AutoTokenizer
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    model = UIE.from_pretrained(args.model)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds = load_dataset(reader, data_path=args.train_path, max_seq_len=args.max_seq_len, lazy=False)
+    dev_ds = load_dataset(reader, data_path=args.dev_path, max_seq_len=args.max_seq_len, lazy=False)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)
+
+    train_data_loader = create_dataloader(
+        dataset=train_ds, mode="train", batch_size=args.batch_size, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, trans_fn=trans_func)
+
+    optimizer = paddle.optimizer.AdamW(learning_rate=args.learning_rate, parameters=model.parameters())
+
+    criterion = paddle.nn.BCELoss()
+    metric = SpanEvaluator()
+
+    loss_list = []
+    global_step = 0
+    best_f1 = 0
+    tic_train = time.time()
+    for epoch in range(1, args.num_epochs + 1):
+        for batch in train_data_loader:
+            input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+            start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+            start_ids = paddle.cast(start_ids, "float32")
+            end_ids = paddle.cast(end_ids, "float32")
+            loss_start = criterion(start_prob, start_ids)
+            loss_end = criterion(end_prob, end_ids)
+            loss = (loss_start + loss_end) / 2.0
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_list.append(float(loss))
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                loss_avg = sum(loss_list) / len(loss_list)
+                print(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss_avg, args.logging_steps / time_diff)
+                )
+                tic_train = time.time()
+
+            if global_step % args.valid_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                model.save_pretrained(save_dir)
+
+                precision, recall, f1 = evaluate(model, metric, dev_data_loader)
+                print("Evaluation precision: %.5f, recall: %.5f, F1: %.5f" % (precision, recall, f1))
+                if f1 > best_f1:
+                    print(f"best F1 performence has been updated: {best_f1:.5f} --> {f1:.5f}")
+                    best_f1 = f1
+                    save_dir = os.path.join(args.save_dir, "model_best")
+                    model.save_pretrained(save_dir)
+                tic_train = time.time()
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--train_path", default=None, type=str, help="The path of train set.")
+    parser.add_argument("--dev_path", default=None, type=str, help="The path of dev set.")
+    parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+    parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum input sequence length. ")
+    parser.add_argument("--num_epochs", default=100, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization")
+    parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+    parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    parser.add_argument("--model", choices=["uie-base", "uie-tiny"], default="uie-base", type=str, help="Select the pretrained model for few-shot learning.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of model parameters for initialization.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_train()
diff --git a/applications/speech_cmd_analysis/pipeline.py b/applications/speech_cmd_analysis/pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f6f49780b722cc92e9bdb467181d28163dd299f
--- /dev/null
+++ b/applications/speech_cmd_analysis/pipeline.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ## Task: Speech Command Analysis for Audio Expense Claim
+#
+# Structured information entry is a common application scenario of speech
+# command analysis, where we can extract expected keywords from audios in
+# an end-to-end way. This technique can economize on manpower and reduce
+# error rates.
+
+import argparse
+import json
+import os
+import pprint
+
+from tqdm import tqdm
+from utils import mandarin_asr_api
+
+from paddlenlp import Taskflow
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--audio_file", type=str, required=True, help="The audio file name.")
+    parser.add_argument("--api_key", type=str, required=True, help="The app key applied on Baidu AI Platform.")
+    parser.add_argument(
+        "--secret_key", type=str, required=True, help="The app secret key generated on Baidu AI Platform."
+    )
+    parser.add_argument("--uie_model", type=str, default=None, help="The path to uie model.")
+    parser.add_argument(
+        "--schema",
+        type=str,
+        nargs="+",
+        default=["时间", "出发地", "目的地", "费用"],
+        help="The type of entities expected to extract.",
+    )
+    parser.add_argument(
+        "--save_file", type=str, default="./uie_results.txt", help="The path to save the recognised text and schemas."
+    )
+    args = parser.parse_args()
+
+    if os.path.isfile(args.audio_file):
+        audios = [args.audio_file]
+    elif os.path.isdir(args.audio_file):
+        audios = [x for x in os.listdir(args.audio_file)]
+        audios = [os.path.join(args.audio_file, x) for x in audios]
+    else:
+        raise Exception("%s is neither valid path nor file!" % args.audio_file)
+
+    audios = [x for x in audios if x.endswith(".wav")]
+    if len(audios) == 0:
+        raise Exception("No valid .wav file! Please check %s." % args.audio_file)
+
+    if args.uie_model is None:
+        parser = Taskflow("information_extraction", schema=args.schema)
+    else:
+        parser = Taskflow("information_extraction", schema=args.schema, task_path=args.uie_model)
+
+    with open(args.save_file, "w") as fp:
+        for audio_file in tqdm(audios):
+            # automatic speech recognition
+            text = mandarin_asr_api(args.api_key, args.secret_key, audio_file)
+            # extract entities according to schema
+            result = parser(text)
+            fp.write(text + "\n")
+            fp.write(json.dumps(result, ensure_ascii=False) + "\n\n")
+            print(text)
+            pprint.pprint(result)
diff --git a/applications/speech_cmd_analysis/preprocess.py b/applications/speech_cmd_analysis/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed46758acfc426aec3173652705c4a35655f0386
--- /dev/null
+++ b/applications/speech_cmd_analysis/preprocess.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+import argparse
+import json
+import numpy as np
+
+from utils import set_seed, convert_ext_examples
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.input_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    if args.splits and sum(args.splits) != 1:
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.input_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+
+    def _create_ext_examples(examples, negative_ratio=0, shuffle=False):
+        entities, relations = convert_ext_examples(examples, negative_ratio)
+        examples = [e + r for e, r in zip(entities, relations)]
+        if shuffle:
+            indexes = np.random.permutation(len(examples))
+            examples = [examples[i] for i in indexes]
+        return examples
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                for x in example:
+                    f.write(json.dumps(x, ensure_ascii=False) + "\n")
+                    count += 1
+        print("\nSave %d examples to %s." % (count, save_path))
+
+    if len(args.splits) == 0:
+        examples = _create_ext_examples(raw_examples, args.negative_ratio, args.is_shuffle)
+        _save_examples(args.save_dir, "train.txt", examples)
+    else:
+        if args.is_shuffle:
+            indexes = np.random.permutation(len(raw_examples))
+            raw_examples = [raw_examples[i] for i in indexes]
+
+        i1, i2, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        p2 = int(len(raw_examples) * (i1 + i2))
+
+        train_examples = _create_ext_examples(raw_examples[:p1], args.negative_ratio, args.is_shuffle)
+        dev_examples = _create_ext_examples(raw_examples[p1:p2])
+        test_examples = _create_ext_examples(raw_examples[p2:])
+
+        _save_examples(args.save_dir, "train.txt", train_examples)
+        _save_examples(args.save_dir, "dev.txt", dev_examples)
+        _save_examples(args.save_dir, "test.txt", test_examples)
+
+    print("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--input_file", default="./data/data.json", type=str, help="The data file exported from doccano platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path to save processed data.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the classification task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/applications/speech_cmd_analysis/utils.py b/applications/speech_cmd_analysis/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..97f02bc98e977fafef05a9f290ca63872afc1303
--- /dev/null
+++ b/applications/speech_cmd_analysis/utils.py
@@ -0,0 +1,409 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import random
+import time
+from urllib.error import URLError
+from urllib.parse import urlencode
+from urllib.request import Request, urlopen
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+class ASRError(Exception):
+    pass
+
+
+def mandarin_asr_api(api_key, secret_key, audio_file, audio_format="wav"):
+    """Mandarin ASR
+
+    Args:
+        audio_file (str):
+            Audio file of Mandarin with sampling rate 16000.
+        audio_format (str):
+            The file extension of audio_file, 'wav' by default.
+
+    Please refer to https://github.com/Baidu-AIP/speech-demo for more demos.
+    """
+    # Configurations.
+    TOKEN_URL = "http://aip.baidubce.com/oauth/2.0/token"
+    ASR_URL = "http://vop.baidu.com/server_api"
+    SCOPE = "audio_voice_assistant_get"
+    API_KEY = api_key
+    SECRET_KEY = secret_key
+
+    # Fetch tokens from TOKEN_URL.
+    post_data = urlencode(
+        {"grant_type": "client_credentials", "client_id": API_KEY, "client_secret": SECRET_KEY}
+    ).encode("utf-8")
+
+    request = Request(TOKEN_URL, post_data)
+    try:
+        result_str = urlopen(request).read()
+    except URLError as error:
+        print("token http response http code : " + str(error.code))
+        result_str = error.read()
+    result_str = result_str.decode()
+
+    result = json.loads(result_str)
+    if "access_token" in result.keys() and "scope" in result.keys():
+        if SCOPE and (SCOPE not in result["scope"].split(" ")):
+            raise ASRError("scope is not correct!")
+        token = result["access_token"]
+    else:
+        raise ASRError(
+            "MAYBE API_KEY or SECRET_KEY not correct: " + "access_token or scope not found in token response"
+        )
+
+    # Fetch results by ASR api.
+    with open(audio_file, "rb") as speech_file:
+        speech_data = speech_file.read()
+    length = len(speech_data)
+    if length == 0:
+        raise ASRError("file %s length read 0 bytes" % audio_file)
+    params_query = urlencode({"cuid": "ASR", "token": token, "dev_pid": 1537})
+    headers = {"Content-Type": "audio/%s; rate=16000" % audio_format, "Content-Length": length}
+
+    url = ASR_URL + "?" + params_query
+    request = Request(url, speech_data, headers)
+    try:
+        begin = time.time()
+        result_str = urlopen(request).read()
+        print("Request time cost %f" % (time.time() - begin))
+    except URLError as error:
+        print("asr http response http code : " + str(error.code))
+        result_str = error.read()
+    result_str = str(result_str, "utf-8")
+    result = json.loads(result_str)
+
+    return result["result"][0]
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, token_type_ids, att_mask, pos_ids, start_ids, end_ids = batch
+        start_prob, end_prob = model(input_ids, token_type_ids, att_mask, pos_ids)
+        start_ids = paddle.cast(start_ids, "float32")
+        end_ids = paddle.cast(end_ids, "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+
+
+def convert_example(example, tokenizer, max_seq_len):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    encoded_inputs = tokenizer(
+        text=[example["prompt"]],
+        text_pair=[example["content"]],
+        stride=len(example["prompt"]),
+        truncation=True,
+        max_seq_len=max_seq_len,
+        pad_to_max_seq_len=True,
+        return_attention_mask=True,
+        return_position_ids=True,
+        return_dict=False,
+    )
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(len(offset_mapping)):
+        if index == 0:
+            continue
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = index
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    start_ids = [0 for x in range(max_seq_len)]
+    end_ids = [0 for x in range(max_seq_len)]
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+
+    tokenized_output = [
+        encoded_inputs["input_ids"],
+        encoded_inputs["token_type_ids"],
+        encoded_inputs["position_ids"],
+        encoded_inputs["attention_mask"],
+        start_ids,
+        end_ids,
+    ]
+    tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
+    return tuple(tokenized_output)
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"]
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+
+                    for result in result_list:
+                        if result["start"] + 1 <= max_content_len < result["end"]:
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def add_negative_example(examples, texts, prompts, label_set, negative_ratio):
+    with tqdm(total=len(prompts)) as pbar:
+        for i, prompt in enumerate(prompts):
+            negtive_sample = []
+            redundants_list = list(set(label_set) ^ set(prompt))
+            redundants_list.sort()
+
+            if len(examples[i]) == 0:
+                continue
+            else:
+                actual_ratio = math.ceil(len(redundants_list) / len(examples[i]))
+
+            if actual_ratio <= negative_ratio:
+                idxs = [k for k in range(len(redundants_list))]
+            else:
+                idxs = random.sample(range(0, len(redundants_list)), negative_ratio * len(examples[i]))
+
+            for idx in idxs:
+                negtive_result = {"content": texts[i], "result_list": [], "prompt": redundants_list[idx]}
+                negtive_sample.append(negtive_result)
+            examples[i].extend(negtive_sample)
+            pbar.update(1)
+    return examples
+
+
+def construct_relation_prompt_set(entity_name_set, predicate_set):
+    relation_prompt_set = set()
+    for entity_name in entity_name_set:
+        for predicate in predicate_set:
+            # The relation prompt is constructed as follows:
+            # subject + "的" + predicate
+            relation_prompt = entity_name + "的" + predicate
+            relation_prompt_set.add(relation_prompt)
+    return sorted(list(relation_prompt_set))
+
+
+def convert_ext_examples(raw_examples, negative_ratio):
+    texts = []
+    entity_examples = []
+    relation_examples = []
+    entity_prompts = []
+    relation_prompts = []
+    entity_label_set = []
+    entity_name_set = []
+    predicate_set = []
+
+    print("Converting doccano data...")
+    with tqdm(total=len(raw_examples)) as pbar:
+        for line in raw_examples:
+            items = json.loads(line)
+            entity_id = 0
+            if "data" in items.keys():
+                text = items["data"]
+                entities = []
+                for item in items["label"]:
+                    entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]}
+                    entities.append(entity)
+                    entity_id += 1
+                relations = []
+            else:
+                text, relations, entities = items["text"], items["relations"], items["entities"]
+            texts.append(text)
+
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            entity_map = {}  # id to entity name
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                entity_label = entity["label"]
+                result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]}
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {
+                        "content": text,
+                        "result_list": [result],
+                        "prompt": entity_label,
+                    }
+                else:
+                    entity_example_map[entity_label]["result_list"].append(result)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                if entity_name not in entity_name_set:
+                    entity_name_set.append(entity_name)
+                entity_prompt.append(entity_label)
+
+            for v in entity_example_map.values():
+                entity_example.append(v)
+
+            entity_examples.append(entity_example)
+            entity_prompts.append(entity_prompt)
+
+            relation_example = []
+            relation_prompt = []
+            relation_example_map = {}
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+                # The relation prompt is constructed as follows:
+                # subject + "的" + predicate
+                prompt = entity_map[subject_id]["name"] + "的" + predicate
+                result = {
+                    "text": entity_map[object_id]["name"],
+                    "start": entity_map[object_id]["start"],
+                    "end": entity_map[object_id]["end"],
+                }
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt}
+                else:
+                    relation_example_map[prompt]["result_list"].append(result)
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+                relation_prompt.append(prompt)
+
+            for v in relation_example_map.values():
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            relation_prompts.append(relation_prompt)
+            pbar.update(1)
+
+    print("Adding negative samples for first stage prompt...")
+    entity_examples = add_negative_example(entity_examples, texts, entity_prompts, entity_label_set, negative_ratio)
+    if len(predicate_set) != 0:
+        print("Constructing relation prompts...")
+        relation_prompt_set = construct_relation_prompt_set(entity_name_set, predicate_set)
+
+        print("Adding negative samples for second stage prompt...")
+        relation_examples = add_negative_example(
+            relation_examples, texts, relation_prompts, relation_prompt_set, negative_ratio
+        )
+    return entity_examples, relation_examples
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
diff --git a/applications/text_classification/README.md b/applications/text_classification/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..361773805eddafb5c07288b67100801ebe2e3828
--- /dev/null
+++ b/applications/text_classification/README.md
@@ -0,0 +1,292 @@
+# 文本分类应用
+
+**目录**
+- [1. 文本分类应用简介](#文本分类应用简介)
+- [2. 技术特色介绍](#技术特色介绍)
+  - [2.1 文本分类方案全覆盖](#文本分类方案全覆盖)
+  - [2.2 更懂中文的训练基座](#更懂中文的训练基座)
+  - [2.3 高效模型调优方案](#高效模型调优方案)
+  - [2.4 产业级全流程方案](#产业级全流程方案)
+- [3. 快速开始](#快速开始)
+- [4. 常用中文分类数据集](#常用中文分类数据集)
+
+<a name="文本分类应用简介"></a>
+
+## 1. 文本分类应用简介
+文本分类应用针对**多分类、多标签、层次分类等高频场景开源了产业级分类应用方案**，打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程，旨在解决细分场景应用的痛点和难点，快速实现文本分类产品落地。
+
+文本分类简单来说就是对给定的一个句子或一段文本使用分类模型分类。虽然文本分类在金融、医疗、法律、工业等领域都有广泛的成功实践应用，但如何选择合适的方案和预训练模型、数据标注质量差、效果调优困难、AI入门成本高、如何高效训练部署等问题使部分开发者望而却步。针对文本分类领域的痛点和难点，PaddleNLP文本分类应用提出了多种前沿解决方案，助力开发者简单高效实现文本分类数据标注、训练、调优、上线，降低文本分类落地技术门槛。
+
+<div align="center">
+    <img width="700" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114119-4a1b0bd5-a604-4a34-a63b-7b27519eaf09.png">
+</div>
+
+**文本分类应用技术特色：**
+
+- **方案全面🎓：** 涵盖多分类、多标签、层次分类等高频分类场景，提供预训练模型微调、提示学习（小样本学习）、语义索引三种端到端全流程分类方案，满足开发者多样文本分类落地需求。
+- **效果领先🏃：** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座，ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **高效调优✊：** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)，提供模型分析模块助力开发者实现模型分析，并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。
+- **简单易用👶：** 开发者**无需机器学习背景知识**，仅需提供指定格式的标注分类数据，一行命令即可开启文本分类训练，轻松完成上线部署，不再让技术成为文本分类的门槛。
+
+<a name="技术特色介绍"></a>
+
+## 2. 技术特色介绍
+
+<a name="文本分类方案全覆盖"></a>
+
+### 2.1 文本分类方案全覆盖
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/189114232-bb706af4-45a9-4e63-8857-76945a63d081.png">
+</div>
+
+#### 2.1.1 分类场景齐全
+
+文本分类应用涵盖多分类（multi class）、多标签（multi label）、层次分类（hierarchical）三种场景，接下来我们将以下图的新闻文本分类为例介绍三种分类场景的区别。
+
+<div align="center">
+    <img width="900" alt="image" src=https://user-images.githubusercontent.com/63761690/186378697-630d3590-4e67-49a0-8d5f-7cabd9daa894.png />
+</div>
+
+- **多分类🚶：** 数据集的标签集含有两个或两个以上的类别，所有输入句子/文本有且只有一个标签。在文本多分类场景中，我们需要预测输入句子/文本最可能来自 `n` 个标签类别中的哪一个类别。以上图多分类中新闻文本为例，该新闻文本的标签为 `娱乐`。快速开启多分类任务参见  👉 [多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme)
+
+- **多标签👫 ：** 数据集的标签集含有两个或两个以上的类别，输入句子/文本具有一个或多个标签。在文本多标签任务中，我们需要预测输入句子/文本可能来自 `n` 个标签类别中的哪几个类别。以上图多标签中新闻文本为例，该新闻文本具有 `相机` 和 `芯片` 两个标签。快速开启多标签任务参见  👉 [多标签指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme) 。
+
+- **层次分类👪 ：** 数据集的标签集具有多级标签且标签之间具有层级结构关系，输入句子/文本具有一个或多个标签。在文本层次分类任务中，我们需要预测输入句子/文本可能来自于不同级标签类别中的某一个或几个类别。以上图层次分类中新闻文本为例（新闻为根节点），该新闻一级分类标签为 `体育`，二级分类标签为 `足球`。快速开启层次分类任务参见 👉 [层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme) 。
+
+
+#### 2.1.2 多方案满足定制需求
+
+#### 方案一：预训练模型微调
+
+【方案选择】对于大多数任务，我们推荐使用**预训练模型微调作为首选的文本分类方案**，预训练模型微调提供了数据标注-模型训练-模型分析-模型压缩-预测部署全流程，有效减少开发时间，低成本迁移至实际应用场景。
+
+【方案介绍】ERNIE 3.0 轻量级模型不能直接在文本分类任务上使用，预训练模型微调在预训练模型 `[CLS]` 输出向量后接入线性层作为文本分类器，用具体任务数据进行微调训练文本分类器，使预训练模型”更懂”这个任务。
+
+【方案效果】下表展示在多标签任务CAIL2019—婚姻家庭要素提取数据集中ERNIE 3.0 系列轻量级模型效果评测。
+
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/189115968-c1d14ed3-dbdd-4a84-ac11-e9eaa447d40f.png width=800 height=300 />
+</div>
+
+
+【快速开始】
+- 快速开启多分类任务参见 👉 [预训练模型微调-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme)
+- 快速开启多标签分类任务参见 👉 [预训练模型微调-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme)
+- 快速开启层次分类任务参见 👉 [预训练模型微调-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme)
+
+#### 方案二：提示学习
+
+【方案选择】提示学习（Prompt Learning）适用于**标注成本高、标注样本较少的文本分类场景**。在小样本场景中，相比于预训练模型微调学习，提示学习能取得更好的效果。对于标注样本充足、标注成本较低的场景，我们仍旧推荐使用充足的标注样本进行文本分类[预训练模型微调](#预训练模型微调)。
+
+【方案介绍】**提示学习的主要思想是将文本分类任务转换为构造提示中掩码 `[MASK]` 的分类预测任务**，也即在掩码 `[MASK]`向量后接入线性层分类器预测掩码位置可能的字或词。提示学习使用待预测字的预训练向量来初始化分类器参数（如果待预测的是词，则为词中所有字的预训练向量平均值），充分利用预训练语言模型学习到的特征和标签文本，从而降低样本需求。提示学习同时提供[ R-Drop](https://arxiv.org/abs/2106.14448) 和 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略，帮助提升模型效果。
+
+我们以下图情感二分类任务为例来具体介绍提示学习流程，分类任务标签分为 `0:负向` 和 `1:正向` 。在文本加入构造提示 `我[MASK]喜欢。` ，将情感分类任务转化为预测掩码 `[MASK]` 的待预测字是 `不` 还是 `很`。具体实现方法是在掩码`[MASK]`的输出向量后接入线性分类器（二分类），然后用`不`和`很`的预训练向量来初始化分类器进行训练，分类器预测分类为 `0：不` 或 `1：很` 对应原始标签 `0:负向` 或 `1:正向`。而预训练模型微调则是在预训练模型`[CLS]`向量接入随机初始化线性分类器进行训练，分类器直接预测分类为 `0:负向` 或 `1:正向`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/189114324-376025b6-8f4e-4d94-a135-953f53f20636.png width=800 height=300 />
+</div>
+
+【方案效果】我们比较预训练模型微调与提示学习在多分类、多标签、层次分类小样本场景的模型表现（多分类精度为准确率，多标签和层次分类精度为Macro F1值），可以看到在样本较少的情况下，提示学习比预训练模型微调有明显优势。
+
+<div align="center">
+    <img width="600" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114445-ee0dd6af-f102-4708-9f46-c630c572dfd3.png">
+</div>
+
+
+
+【快速开始】
+
+更多测评和使用细节详见各场景文档：
+- 快速开启多分类任务参见 👉 [提示学习(小样本)-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/few-shot#readme)
+- 快速开启多标签分类任务参见 👉 [提示学习(小样本)-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/few-shot#readme)
+- 快速开启层次分类任务参见 👉 [提示学习(小样本)-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/few-shot#readme)
+
+#### 方案三：语义索引
+
+【方案选择】基于语义索引的文本分类方案**适用于标签类别不固定的场景**，对于新增标签类别或新的相关分类任务无需重新训练，模型仍然能获得较好预测效果，方案具有良好的拓展性。
+
+【方案介绍】语义索引目标是从海量候选召回集中快速、准确地召回一批与输入文本语义相关的文本。基于语义索引的文本分类方法具体来说是将标签集作为召回目标集，召回与输入文本语义相似的标签作为文本的标签类别。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/189114541-2278b7f7-1af6-470d-a300-28e7e902b6a8.png width=800 height=300 />
+</div>
+
+【快速开始】
+- 快速开启多分类任务参见 👉 [语义索引-多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/retrieval_based#readme)
+- 快速开启多标签分类任务参见 👉 [语义索引-多标签分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/retrieval_based#readme)
+- 快速开启层次分类任务参见 👉 [语义索引-层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/retrieval_based#readme)
+
+
+
+<a name="更懂中文的训练基座"></a>
+
+### 2.2 更懂中文的训练基座
+
+近年来，大量的研究表明在超大规模的语料采用无监督或者弱监督的方式训练模型，模型能够获得语言相关的知识。预训练模型学习到的文本语义表示能够避免从零开始训练模型，同时有利于下游自然语言处理(NLP)任务。预训练模型与具体的文本分类任务的关系可以直观地理解为，**预训练模型已经懂得了相关句法、语义的语言知识，用具体任务数据训练使得预训练模型”更懂”这个任务**，在预训练过程中学到的知识基础使学习文本分类任务事半功倍。
+
+文本分类应用使用**ERNIE 3.0轻量级模型作为预训练模型**，ERNIE 3.0 轻量级模型是文心大模型ERNIE 3.0基础上通过在线蒸馏技术得到的轻量级模型。下面是ERNIE 3.0 效果-时延图，ERNIE 3.0 轻量级模型在精度和性能上的综合表现已全面领先于 UER-py、Huawei-Noah 以及 HFL 的中文模型，具体的测评细节可以见[ERNIE 3.0 效果和性能测评文档](../../model_zoo/ernie-3.0)。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/186376051-6c3ca239-1e31-4c0a-bdbe-547439234ddb.png width="600"/>
+</div>
+
+
+
+<a name="高效模型调优方案"></a>
+
+### 2.3 高效模型调优方案
+
+有这么一句话在业界广泛流传，"数据决定了机器学习的上限，而模型和算法只是逼近这个上限"，可见数据质量的重要性。文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)开源了模型分析模块，针对标注数据质量不高、训练数据覆盖不足、样本数量少等文本分类常见数据痛点，提供稀疏数据筛选、脏数据清洗、数据增强三种数据优化方案，解决训练数据缺陷问题，用低成本方式获得大幅度的效果提升。
+
+
+- **稀疏数据筛选**基于特征相似度的实例级证据分析方法挖掘待预测数据中缺乏证据支持的数据（也即稀疏数据），并进行有选择的训练集数据增强或针对性筛选未标注数据进行标注来解决稀疏数据问题，有效提升模型表现。
+<div align="center">
+    <img width="1000" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114644-c0d21801-dd6c-4530-b3a3-5f5a568a7a22.png">
+</div>
+
+我们采用在多分类、多标签、层次分类场景中评测稀疏数据-数据增强策略和稀疏数据-数据标注策略，下图表明稀疏数据筛选方案在各场景能够有效提高模型表现（多分类精度为准确率，多标签和层次分类精度为Macro F1值）。
+
+<div align="center">
+    <img width="600" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114660-820d1471-4907-4c73-a118-c494200afff0.png">
+</div>
+
+
+- **脏数据清洗**基于表示点方法的实例级证据分析方法，计算训练数据对模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为脏数据（标注错误样本）。脏数据清洗方案通过高效识别训练集中脏数据（也即标注质量差的数据），有效降低人力检查成本。
+
+<div align="center">
+    <img width="1000" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114677-9a2f5232-9551-4e10-a215-54ddb7ca1f33.png">
+</div>
+
+我们采用在多分类、多标签、层次分类场景中评测脏数据清洗方案，实验表明方案能够高效筛选出训练集中脏数据，提高模型表现（多分类精度为准确率，多标签和层次分类精度为Macro F1值）。
+
+<div align="center">
+    <img width="600" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189114695-90cef2fd-955d-4243-9fc4-78f2dcf7fb6c.png">
+</div>
+
+
+- **数据增强**在数据量较少的情况下能够通过增加数据集多样性，提升模型效果。PaddleNLP内置[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)，支持词替换、词删除、词插入、词置换、基于上下文生成词（MLM预测）、TF-IDF等多种数据增强策略。数据增强方案提供一行命令，快速完成数据集增强。以CAIL2019—婚姻家庭要素提取数据子集（500条）为例，我们在数据集应用多种数据增强策略，策略效果如下表。
+
+<div align="center">
+    <img width="600" alt="文本分类落地难点" src="https://user-images.githubusercontent.com/63761690/189115071-40152f7c-7c90-41b3-a70b-4d4aabc5b715.png">
+</div>
+
+
+【快速开始】
+
+更多使用方法和测评细节详见各场景模型分析模块：
+
+- 体验模型分析模块 👉 [多分类-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/analysis)
+- 体验模型分析模块 👉 [多标签-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/analysis)
+- 体验模型分析模块 👉 [层次分类-模型分析模块](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/analysis)
+
+<a name="产业级全流程方案"></a>
+
+### 2.4 产业级全流程方案
+
+文本分类应用提供了简单易用的数据标注-模型训练-模型调优-模型压缩-预测部署全流程方案，我们将以预训练模型微调方案为例介绍文本分类应用的全流程：
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/189115101-20cbaa00-e549-425b-b047-61bac2a5e39f.png">
+</div>
+<div align="center">
+    <font size ="2">
+    文本分类应用全流程示意图
+     </font>
+</div>
+
+
+**1.数据准备阶段**
+
+- 我们根据文本分类任务选择对应的场景目录: [多分类场景目录](./multi_class)、
+ [多标签场景目录](./multi_label)、[层次分类场景目录](./hierarchical)。
+
+- 如果没有已标注的数据集，我们推荐doccano数据标注工具进行标注，详见[文本分类标注指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)。如果已有标注好的本地数据集，我们需要根据不同任务要求将数据集整理为文档要求的格式，详见各分类场景文档。
+
+**2.模型训练**
+
+- 数据准备完成后，开始进行预训练模型微调训练。可以根据实际数据调整可配置参数，选择使用GPU或CPU进行模型训练，脚本默认保存在开发集最佳表现模型参数。
+
+- 训练结束后，使用模型分析(analysis)模块对分析模型表现，同时模型分析(analysis)模块依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供稀疏数据筛选、脏数据清洗、数据增强三种优化方案帮助提升模型效果。
+
+- 模型训练、调优完成后，可以通过预测脚本加载最佳模型参数，打印模型预测结果。
+
+**3.模型部署**
+
+- 现实部署场景需要同时考虑模型的精度和性能表现，文本分类应用接入PaddleNLP 模型压缩 API 。采用了DynaBERT 中宽度自适应裁剪策略，对预训练模型多头注意力机制中的头（Head ）进行重要性排序，保证更重要的头（Head ）不容易被裁掉，然后用原模型作为蒸馏过程中的教师模型，宽度更小的模型作为学生模型，蒸馏得到的学生模型就是我们裁剪得到的模型。实验表明模型裁剪能够有效缩小模型体积、减少内存占用、提升推理速度。模型裁剪去掉了部分冗余参数的扰动，增加了模型的泛化能力，在部分任务中预测精度得到提高。
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/189115124-2f429043-3145-4bf8-9969-47580a706037.png">
+</div>
+
+- 模型部署需要将保存的最佳模型参数（动态图参数）导出成静态图参数，用于后续的推理部署。p.s.模型裁剪之后会默认导出静态图模型
+
+- 文本分类应用提供了离线部署，并且支持在GPU设备使用FP16，在CPU设备使用动态量化的低精度加速推理；同时提供基于Paddle Serving的在线服务化部署，详见各分类场景文档中模型部署介绍。
+
+
+<a name="快速开始"></a>
+
+## 3. 快速开始
+
+- 快速开启多分类 👉 [多分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class#readme)
+
+- 快速开启多标签分类 👉 [多标签指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label#readme)
+
+- 快速开启层次分类 👉 [层次分类指南](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical#readme)
+
+<a name="常用中文分类数据集"></a>
+
+## 4. 常用中文分类数据集
+
+**多分类数据集：**
+
+- [THUCNews新闻分类数据集](http://thuctc.thunlp.org/)
+
+- [百科问答分类数据集](https://github.com/brightmart/nlp_chinese_corpus#3%E7%99%BE%E7%A7%91%E7%B1%BB%E9%97%AE%E7%AD%94json%E7%89%88baike2018qa)
+
+- [头条新闻标题数据集TNEWS](https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset)
+
+- [复旦新闻文本数据集](https://www.heywhale.com/mw/dataset/5d3a9c86cf76a600360edd04)
+
+- [IFLYTEK app应用描述分类数据集](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip)
+
+- [CAIL 2022事件检测](https://cloud.tsinghua.edu.cn/d/6e911ff1286d47db8016/)
+
+**情感分类数据集(多分类):**
+
+- [亚马逊商品评论情感数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_amazon/intro.ipynb)
+
+- [财经新闻情感分类数据集](https://github.com/wwwxmu/Dataset-of-financial-news-sentiment-classification)
+
+- [ChnSentiCorp 酒店评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/tree/master/datasets/ChnSentiCorp_htl_all)
+
+- [外卖评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb)
+
+- [weibo情感二分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/intro.ipynb)
+
+- [weibo情感四分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb)
+
+- [商品评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/online_shopping_10_cats/intro.ipynb)
+
+- [电影评论情感分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/dmsc_v2/intro.ipynb)
+
+- [大众点评分类数据集](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/yf_dianping/intro.ipynb)
+
+**多标签数据集:**
+
+- [学生评语分类数据集](https://github.com/FBI1314/textClassification/tree/master/multilabel_text_classfication/data)
+
+- [CAIL2019婚姻要素识别](https://aistudio.baidu.com/aistudio/projectdetail/3996601)
+
+- [CAIL2018 刑期预测、法条预测、罪名预测](https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip)
+
+**层次分类数据集:**
+
+- [头条新闻标题分类-TNEWS的升级版](https://github.com/aceimnorstuvwxz/toutiao-multilevel-text-classfication-dataset)
+
+- [网页层次分类数据集](https://csri.scu.edu.cn/info/1012/2827.htm)
+
+- [医学意图数据集(CMID)](https://github.com/liutongyang/CMID)
+
+- [2020语言与智能技术竞赛事件分类](https://github.com/percent4/keras_bert_multi_label_cls/tree/master/data)
diff --git a/applications/text_classification/doccano.md b/applications/text_classification/doccano.md
new file mode 100644
index 0000000000000000000000000000000000000000..16c79db92436d3a911b461cffa5b9f76052ecde2
--- /dev/null
+++ b/applications/text_classification/doccano.md
@@ -0,0 +1,240 @@
+# 文本分类任务doccano使用指南
+
+ **目录**
+
+* [1. 安装](#安装)
+* [2. 项目创建](#项目创建)
+* [3. 数据上传](#数据上传)
+* [4. 标签构建](#标签构建)
+* [5. 任务标注](#任务标注)
+* [6. 数据导出](#数据导出)
+* [7. 数据转换](#数据转换)
+
+<a name="安装"></a>
+
+## 1. 安装
+**以下标注示例用到的环境配置：**
+
+- Python 3.8+
+- doccano 1.6.2
+
+在终端(terminal)使用pip安装doccano：
+
+```shell
+pip install doccano==1.6.2
+```
+安装完成后，运行以下命令行：
+```shell
+# Initialize database.
+doccano init
+# Create a super user.
+doccano createuser --username admin --password pass
+# Start a web server.
+doccano webserver --port 8000
+```
+在新的终端(terminal)运行如下命令行：
+```shell
+# Start the task queue to handle file upload/download.
+doccano task
+```
+在浏览器打开[http://127.0.0.1:8000/](http://127.0.0.1:8000/)，输入用户名和密码登录，开始使用doccano进行标注。doccano支持中文版本，可以点击右上角选择ZH(中文)。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176856052-bde31dd7-6317-49d9-8ae6-572c821f4e3d.png height=230 hspace='15'/>
+</div>
+
+doccano还支持PostgreSQL、Docker、Docker Compose等安装方式，详情请参考[doccano官方文档](https://github.com/doccano/doccano) 完成doccano的安装与初始配置。
+
+
+<a name="项目创建"></a>
+
+## 2. 项目创建
+
+文本分类支持多分类、多标签、层次分类三种类型的文本分类任务。
+
+点击创建（Create）开始创建一个新的项目，选择文本分类，然后填写项目名称、描述、Tags等项目信息。如果是多分类任务或者是单路径层次分类任务，勾选 `Allow single label` ，勾选后标签标注只允许选择一个标签进行标注。点击创建成功创建一个doccano项目。
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176857646-b324d075-e281-4e9f-9f42-50fa2645b271.png height=400 hspace='15'/>
+</div>
+
+
+<a name="数据上传"></a>
+
+## 3. 数据上传
+
+点击数据集-操作-导入数据集，开始导入本地待标注数据集：
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176858888-79b781f9-2ce0-4348-ba14-faac19031867.png height=300 hspace='15'/>
+</div>
+
+doccano支持`TextFile`、`TextLine`、`JSONL`和`CoNLL`四种数据上传格式，文本分类本地数据集定制训练中**统一使用TextLine**这一文件格式，即上传的文件需要为txt等格式，且在数据标注时，该文件的每一行待标注文本显示为一页内容。
+上传的文件为txt等格式，每一行为一条待标注文本，示例:
+
+```text
+黑苦荞茶的功效与作用及食用方法
+交界痣会凸起吗
+检查是否能怀孕挂什么科
+鱼油怎么吃咬破吃还是直接咽下去
+幼儿挑食的生理原因是
+...
+```
+
+上传数据类型**选择TextLine**，选择待标注文本或拖拽文本导入doccano项目中，点击导入，导入待标注数据集。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176859861-b790288f-32d7-4ab0-8b5f-b30e97f8c306.png height=300 hspace='15'/>
+</div>
+
+
+<a name="标签构建"></a>
+
+## 4. 标签构建
+
+点击标签-操作-创建标签，开始添加分类类别标签：
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176860972-eb9cacf1-199a-4cec-9940-6858434cfb94.png height=300 hspace='15'/>
+</div>
+填入分类类别标签，选择标签颜色，建议不同标签选择不同颜色，最后点击保存或保存并添加下一个，保存标签：
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176860977-55292e2a-8bf8-4316-a0f8-b925872e5023.png height=300 hspace='15'/>
+</div>
+文本分类标签构建示例：
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176860996-542cd1f7-9770-4b22-9586-a5bf0e802970.png height=300 hspace='15'/>
+</div>
+
+**NOTE:**
+我们默认层次分类标签不同层的标签之间具有关联性，以下图为例一个样本具有标签美短虎斑，我们默认还包含美国短毛猫和猫两个标签。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/175248039-ce1673f1-9b03-4804-b1cb-29e4b4193f86.png height=300 hspace='15'/>
+</div>
+
+对于层次分类任务的分类标签我们建议使用标签层次结构中**叶结点标签路径作为标签**，以上图的标签结构为例，我们建议使用`##`作为分隔符，分隔不同层之间的标签：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/177095794-0acb9665-3862-4de9-8771-8f424fd4f7b0.png height=300 hspace='15'/>
+</div>
+<a name="任务标注"></a>
+
+## 5. 任务标注
+
+标注示例，选择对应的分类类别标签，输入回车（Enter）键确认：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176872684-4a19f592-be5c-4b86-8adf-eb0a7d7aa375.png height=200 hspace='10'/>
+</div>
+
+
+<a name="数据导出"></a>
+
+## 6. 数据导出
+
+选择数据集-操作-导出数据集，将标注好的数据导出，我们默认所有数据集已经标注完成且正确：
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176874195-d21615f4-8d53-4033-8f53-2106ebdf21f8.png height=250 hspace='20'/>
+</div>
+
+选择导出的文件类型为``JSONL``，导出数据：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/176873347-fd995e4e-5baf-4d13-92b9-800cabd1f0b1.png height=300 hspace='20'/>
+</div>
+
+导出数据示例：
+```text
+{"id": 23, "data": "黑苦荞茶的功效与作用及食用方法", "label": ["功效作用"]}
+{"id": 24, "data": "交界痣会凸起吗", "label": ["疾病表述"]}
+{"id": 25, "data": "检查是否能怀孕挂什么科", "label": ["就医建议"]}
+{"id": 26, "data": "鱼油怎么吃咬破吃还是直接咽下去", "label": ["其他"]}
+{"id": 27, "data": "幼儿挑食的生理原因是", "label": ["病因分析"]}
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``jsonl``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识ID。
+- ``data``: 原始文本数据。
+- ``label``: 文本对应类别标签。
+
+<a name="数据转换"></a>
+
+## 7.数据转换
+
+该章节详细说明如何通过`doccano.py`脚本对doccano平台导出的标注数据进行转换，一键生成训练/验证/测试集。当标注完成后，在 doccano 平台上导出 `JSON` 形式的文件，并将其重命名为 `doccano.jsonl`。
+
+
+### 7.1 多分类任务
+通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以按照[多分类文本任务指南](multi_class/README.md)中固定格式进行相应模型训练。
+
+数据标注转化运行：
+
+```shell
+python doccano.py \
+    --doccano_file doccano.jsonl \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --task_type "multi_class"
+```
+
+稀疏数据识别出的有效标注请增加配置参数`--valid`，脏数据清洗的标注数据（文本中有脏数据标签）请增加配置参数`--dirty`，更多稀疏数据识别和脏数据清洗详见[多分类训练评估与模型优化指南](multi_class/analysis/README.md)
+
+### 7.2 多标签任务
+通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以按照[多标签文本分类任务指南](multi_label/README.md)中固定格式进行相应模型训练。
+
+数据标注转化运行：
+
+```shell
+python doccano.py \
+    --doccano_file doccano.jsonl \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --task_type "multi_label"
+```
+
+稀疏数据识别出的有效标注请增加配置参数`--valid`，脏数据清洗的标注数据（文本中有脏数据标签）请增加配置参数`--dirty`，更多稀疏数据识别和脏数据清洗详见[多标签训练评估与模型优化指南](multi_label/analysis/README.md)
+
+### 7.3 层次分类任务
+
+通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以按照[层次文本分类任务指南](hierarchical/README.md)中固定格式进行相应模型训练。
+
+数据标注转化运行：
+
+```shell
+python doccano.py \
+    --doccano_file doccano.jsonl \
+    --save_dir ./data \
+    --splits 0.8 0.2 \
+    --task_type "hierarchical"
+```
+
+稀疏数据识别出的有效标注请增加配置参数`--valid`，脏数据清洗的标注数据（文本中有脏数据标签）请增加配置参数`--dirty`，更多稀疏数据识别和脏数据清洗详见[层次分类训练评估与模型优化指南](hierarchical/analysis/README.md)
+可配置参数说明：
+
+- ``doccano_file``: 从doccano导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.2]表示按照``8:2``的比例将数据划分为训练集、验证集。
+- ``task_type``: 可选，选择任务类型,有多分类，多标签，层次分类三种类型的任务。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+- ``separator``: 不同层标签之间的分隔符，该参数只对层次文本分类任务有效。默认为"##"。
+- ``valid``: 是否为稀疏数据筛选的有效标注数据，默认为False.
+- ``dirty``: 是否为脏数据清洗策略标注数据，默认为False.
+
+转化后的doccano标注数据目录结构如下：
+
+```text
+data/
+├── train.txt # 训练数据集文件
+├── dev.txt # 开发数据集文件
+├── test.txt # 测试训练集文件（可选，数据划分为 train/dev/test 数据集）
+├── label.txt # 分类标签文件
+└── data.txt # 待预测数据文件
+```
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分成train/dev 数据集，也可以划分为 train/dev/test 数据集。
+- 脚本会自动生成data.txt，如果数据划分为 train/dev/test 数据集，data.txt则为test数据集无标签数据；如果数据划分为 train/dev 数据集，data.txt为无标签数据。**如果有未标注数据，则用未标注数据文件替换data.txt**
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 对于从doccano导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+## References
+- **[doccano](https://github.com/doccano/doccano)**
diff --git a/applications/text_classification/doccano.py b/applications/text_classification/doccano.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c3d096556a084e4a9ebd955abb03a69ac99246f
--- /dev/null
+++ b/applications/text_classification/doccano.py
@@ -0,0 +1,171 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import random
+import time
+from decimal import Decimal
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--doccano_file", default="doccano.jsonl", type=str, help="The doccano file exported from doccano platform.")
+parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+parser.add_argument("--splits", default=[0.8, 0.2], type=float, nargs="*", help="The ratio of samples in datasets. [0.8, 0.2] means 80% samples used for training, 20% for evaluation.")
+parser.add_argument("--task_type", choices=['multi_class', 'multi_label', 'hierarchical'], default="multi_label", type=str, help="Select task type, multi_class for multi classification task, multi_label for multi label classification task and hierarchical for hierarchical classification, defaults to multi_label.")
+parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.")
+parser.add_argument("--seed", type=int, default=3, help="Random seed for initialization")
+parser.add_argument("--separator", type=str, default="##", help="Separator for hierarchical classification")
+parser.add_argument("--valid", action='store_true', help="Whether annotate valid data(extracted from sparse strategy)")
+parser.add_argument("--dirty", action='store_true', help="Whether annotate dirty data(extracted from dirty data cleaning strategy)")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def do_convert():
+    """
+    Convert doccano jsonl to fixed format
+    """
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.doccano_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 1 and len(args.splits) != 2 and len(args.splits) != 3:
+        raise ValueError("Only len(splits)==1 /len(splits)==2 / len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        if len(splits) == 2:
+            return Decimal(str(splits[0])) + Decimal(str(splits[1])) == Decimal("1")
+        if len(splits) == 3:
+            return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.doccano_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+    f.close()
+
+    examples = []
+    label_list = []
+    with tqdm(total=len(raw_examples)):
+        for line in raw_examples:
+            items = json.loads(line)
+            # Compatible with doccano >= 1.6.2
+            if "data" in items.keys():
+                text, labels = items["data"], items["label"]
+            else:
+                text, labels = items["text"], items["label"]
+            labels = list(set(labels))
+            for l in labels:
+                if "," in l:
+                    raise ValueError("There exists comma ',' in {}".format(l))
+
+            if args.task_type == "multi_label" or args.task_type == "multi_class":
+                if args.dirty:
+                    text = " ".join(text.strip().split("\t")[:-1])
+                else:
+                    text = " ".join(text.strip().split("\t"))
+                example = text + "\t" + ",".join(labels) + "\n"
+                for l in labels:
+                    if l not in label_list:
+                        label_list.append(l)
+            if args.task_type == "hierarchical":
+                label_dict = []
+                for label in labels:
+                    level_labels = label.split(args.separator)
+                    for i in range(len(level_labels)):
+                        l = args.separator.join(level_labels[: i + 1])
+                        if l not in label_dict:
+                            label_dict.append(l)
+                        if l not in label_list:
+                            label_list.append(l)
+                if args.dirty:
+                    text = " ".join(text.strip().split("\t")[:-1])
+                else:
+                    text = " ".join(text.strip().split("\t"))
+                example = text + "\t" + ",".join(label_dict) + "\n"
+            examples.append(example)
+
+    if not args.dirty and not args.valid:
+        save_path = os.path.join(args.save_dir, "label.txt")
+        with open(save_path, "w", encoding="utf-8") as f:
+            label_list = sorted(label_list)
+            for l in label_list:
+                f.write(l + "\n")
+
+    def _save_examples(save_dir, file_name, examples, is_data=False):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                if is_data:
+                    f.write(example.split("\t")[0] + "\n")
+                else:
+                    f.write(example)
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    if args.is_shuffle:
+        indexes = np.random.permutation(len(raw_examples))
+        raw_examples = [raw_examples[i] for i in indexes]
+
+    if len(args.splits) == 1:
+        if args.valid:
+            _save_examples(args.save_dir, "valid.txt", examples)
+        elif args.dirty:
+            _save_examples(args.save_dir, "train_dirty.txt", examples)
+        else:
+            _save_examples(args.save_dir, "train.txt", examples)
+            _save_examples(args.save_dir, "data.txt", examples, True)
+    elif len(args.splits) == 2:
+        i1, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        _save_examples(args.save_dir, "train.txt", examples[:p1])
+        _save_examples(args.save_dir, "dev.txt", examples[p1:])
+        _save_examples(args.save_dir, "data.txt", examples[p1:], True)
+    elif len(args.splits) == 3:
+        i1, i2, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        p2 = int(len(raw_examples) * (i1 + i2))
+        _save_examples(args.save_dir, "train.txt", examples[:p1])
+        _save_examples(args.save_dir, "dev.txt", examples[p1:p2])
+        _save_examples(args.save_dir, "test.txt", examples[p2:])
+        _save_examples(args.save_dir, "data.txt", examples[p2:], True)
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    do_convert()
diff --git a/applications/text_classification/hierarchical/README.md b/applications/text_classification/hierarchical/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..98ae356572acc9d4ebe39d72edf838eb1a24e51d
--- /dev/null
+++ b/applications/text_classification/hierarchical/README.md
@@ -0,0 +1,477 @@
+# 层次分类指南
+
+**目录**
+- [1. 层次分类简介](#层次分类简介)
+- [2. 快速开始](#快速开始)
+    - [2.1 运行环境](#运行环境)
+    - [2.2 代码结构](#代码结构)
+    - [2.3 数据准备](#数据准备)
+    - [2.4 模型训练](#模型训练)
+    - [2.5 模型部署](#模型部署)
+    - [2.6 模型效果](#模型效果)
+
+<a name="层次分类简介"></a>
+
+## 1. 层次分类简介
+
+本项目提供通用场景下**基于预训练模型微调的层次分类端到端应用方案**，打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程，有效缩短开发周期，降低AI开发落地门槛。
+
+层次文本分类任务的中数据样本具有多个标签且标签之间存在特定的层级结构，目标是**预测输入句子/文本可能来自于不同级标签类别中的某一个或几个类别**。以下图新闻文本分类为例，该新闻的一级标签为体育，二级标签为足球，体育与足球之间存在层级关系。在现实场景中，大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在层次化结构，需要利用算法为文本自动标注更细粒度和更准确的标签。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/186654723-6a287f18-56cc-4727-9347-09cfaf11b1dc.png width="550"/>
+</div>
+<br>
+
+**方案亮点：**
+
+- **效果领先🏃：** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座，ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **高效调优✊：** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)，提供模型分析模块助力开发者实现模型分析，并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。
+- **简单易用👶：** 开发者**无需机器学习背景知识**，仅需提供指定格式的标注分类数据，一行命令即可开启文本分类训练，轻松完成上线部署，不再让技术成为文本分类的门槛。
+
+**更多选择：**
+
+对于大多数层次分类任务，我们推荐使用预训练模型微调作为首选的文本分类方案，层次分类项目中还提供 提示学习(小样本)和语义索引的两种全流程文本分类方案满足不同开发者需求，更多技术细节请参见[文本分类技术特色介绍](../README.md)。
+
+- 【标注成本高、标注样本较少的小样本场景】 👉 [提示学习层次分类方案](./few-shot#readme)
+
+- 【标签类别不固定场景、标签数量众多】 👉 [语义索引层次分类方案](./retrieval_based#readme)
+
+<a name="快速开始"></a>
+
+## 2. 快速开始
+
+我们以[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取的多标签层次数据集为例，演示层次分类全流程方案使用。下载数据集：
+```shell
+wget https://paddlenlp.bj.bcebos.com/datasets/baidu_extract_2020.tar.gz
+tar -zxvf baidu_extract_2020.tar.gz
+mv baidu_extract_2020 data
+rm baidu_extract_2020.tar.gz
+```
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/187828356-e2f4f627-f5fe-4c83-8879-ed6951f7511e.png">
+</div>
+<div align="center">
+    <font size ="2">
+    层次分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图
+     </font>
+</div>
+
+<a name="运行环境"></a>
+
+### 2.1 运行环境
+
+- python >= 3.6
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.4.8
+- scikit-learn >= 1.0.2
+
+**安装PaddlePaddle：**
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+
+**安装PaddleNLP：**
+
+安装PaddleNLP默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple)，更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+```shell
+python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+
+**安装sklearn：**
+```shell
+python3 -m  pip install scikit-learn==1.0.2
+```
+
+<a name="代码结构"></a>
+
+### 2.2 代码结构
+
+```text
+hierarchical/
+├── few-shot # 小样本学习方案
+├── retrieval_based # 语义索引方案
+├── analysis # 分析模块
+├── deploy # 部署
+│   └── predictor # 离线部署
+│   ├── paddle_serving # PaddleServing在线服务化部署
+│   └── triton_serving # Triton在线服务化部署
+├── train.py # 训练评估脚本
+├── predict.py # 预测脚本
+├── export_model.py # 静态图模型导出脚本
+├── utils.py # 工具函数脚本
+├── metric.py # metric脚本
+├── prune.py # 裁剪脚本
+└── README.md # 使用说明
+```
+
+<a name="数据准备"></a>
+
+### 2.3 数据准备
+
+训练需要准备指定格式的标注数据集,如果没有已标注的数据集，可以参考 [数据标注指南](../doccano.md) 进行文本分类数据标注。指定格式本地数据集目录结构：
+
+```text
+data/
+├── train.txt # 训练数据集文件
+├── dev.txt # 开发数据集文件
+├── test.txt # 测试数据集文件（可选）
+├── label.txt # 分类标签文件
+└── data.txt # 待预测数据文件（可选）
+```
+
+**训练、开发、测试数据集文件：** 文本与标签类别名用tab符`'\t'`分隔开，标签中多个标签之间用英文逗号`','`分隔开，文本中避免出现tab符`'\t'`。
+
+- train.txt/dev.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签>','<标签>','<标签>
+<文本>'\t'<标签>','<标签>
+...
+```
+
+- train.txt/dev.txt/test.txt 文件样例：
+```text
+又要停产裁员6000！通用汽车罢工危机再升级股价大跌市值蒸发近300亿！    组织行为,组织行为##罢工,组织关系,组织关系##裁员
+上海一改建厂房坍塌已救出19人其中5人死亡    人生,人生##死亡,灾害/意外,灾害/意外##坍/垮塌
+车闻：广本召回9万余辆；领动上市，10.98万起；艾力绅混动    产品行为,产品行为##召回
+86岁老翁过马路遭挖掘机碾压身亡警方：正在侦办中    灾害/意外,灾害/意外##车祸,人生,人生##死亡
+...
+```
+
+**分类标签文件：** 包含数据集中所有标签，每个标签一行。
+
+- label.txt 文件格式：
+
+```text
+<一级标签>
+<一级标签>'##'<二级标签>
+<一级标签>'##'<二级标签>'##'<三级标签>
+...
+```
+- label.txt  文件样例：
+```text
+人生
+人生##死亡
+灾害/意外
+灾害/意外##坍/垮塌
+灾害/意外##车祸
+产品行为
+产品行为##召回
+...
+```
+**待预测数据文件：** 包含需要预测标签的文本数据，每条数据一行。
+- data.txt 文件格式：
+```text
+<文本>
+<文本>
+...
+```
+- data.txt 文件样例：
+```text
+金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件
+卡车超载致使跨桥侧翻，没那么简单
+消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了
+...
+```
+
+<a name="模型训练"></a>
+
+### 2.4 模型训练
+
+#### 2.4.1 预训练模型微调
+
+使用CPU/GPU训练，默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`，可以使用`--device gpu:0`指定GPU卡号：
+```shell
+python train.py \
+    --dataset_dir "data" \
+    --device "gpu" \
+    --max_seq_length 128 \
+    --model_name "ernie-3.0-medium-zh" \
+    --batch_size 32 \
+    --early_stop \
+    --epochs 100
+```
+
+如果在GPU环境中使用，可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0，可使用`nvidia-smi`命令查看GPU使用情况。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --dataset_dir "data" \
+    --device "gpu" \
+    --max_seq_length 128 \
+    --model_name "ernie-3.0-medium-zh" \
+    --batch_size 32 \
+    --early_stop \
+    --epochs 100
+```
+
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，选择cpu、gpu、xpu、npu。如使用gpu训练，可使用参数--gpus指定GPU卡号；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt，dev.txt和label.txt文件;默认为None。
+* `save_dir`：保存训练模型的目录；默认保存在当前目录checkpoint文件夹下。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `model_name`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh",根据任务复杂度和硬件条件进行选择。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：训练最大学习率；默认为3e-5。
+* `epochs`: 训练轮次，使用早停法时可以选择100；默认为10。
+* `early_stop`：选择是否使用早停法(EarlyStopping)，模型在开发集经过一定epoch后精度表现不再上升，训练终止；默认为False。
+* `early_stop_nums`：在设定的早停训练轮次内，模型在开发集上表现不再上升，训练终止；默认为10。
+* `logging_steps`: 训练过程中日志打印的间隔steps数，默认5。
+* `weight_decay`：控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `warmup`：是否使用学习率warmup策略，使用时应设置适当的训练轮次（epochs）；默认为False。
+* `warmup_steps`：学习率warmup策略的比例数，如果设为1000，则学习率会在1000steps数从0慢慢增长到learning_rate, 而后再缓慢衰减；默认为0。
+* `init_from_ckpt`: 模型初始checkpoint参数地址，默认None。
+* `seed`：随机种子，默认为3。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `save_dir` 中，保存模型文件结构如下所示：
+
+```text
+checkpoint/
+├── config.json # 模型配置文件，paddlenlp 2.4.5以前为model_config.json
+├── model_state.pdparams # 模型参数文件
+├── tokenizer_config.json # 分词器配置文件
+├── vocab.txt
+└── ...
+```
+**NOTE:**
+* 如需恢复模型训练，则可以设置 `--init_from_ckpt checkpoint/model_state.pdparams` 。
+* 如需训练英文文本分类任务，只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。
+* 英文和中文以外语言的文本分类任务，推荐使用基于96种语言（涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言）进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large"，详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。
+#### 2.4.2 训练评估与模型优化
+
+文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果"，"如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**模型评估、可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195241942-70068989-df17-4f53-9f71-c189d8c5c88d.png" width="600">
+</div>
+
+**模型评估：** 训练后的模型我们可以使用 [Analysis模块](./analysis) 对每个类别分别进行评估，并输出预测错误样本（bad case），默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`:
+
+```shell
+python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_file "bad_case.txt" --dataset_dir "data" --params_path "./checkpoint"
+```
+
+输出打印示例：
+
+```text
+[2022-08-11 03:10:14,058] [    INFO] - -----Evaluate model-------
+[2022-08-11 03:10:14,059] [    INFO] - Dev dataset size: 1498
+[2022-08-11 03:10:14,059] [    INFO] - Accuracy in dev dataset: 89.19%
+[2022-08-11 03:10:14,059] [    INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
+[2022-08-11 03:10:14,059] [    INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
+[2022-08-11 03:10:14,095] [    INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
+[2022-08-11 03:10:14,255] [    INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
+[2022-08-11 03:10:14,256] [    INFO] - Class name: 交往
+[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
+[2022-08-11 03:10:14,256] [    INFO] - ----------------------------
+[2022-08-11 03:10:14,256] [    INFO] - Class name: 交往##会见
+[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
+...
+```
+
+预测错误的样本保存在bad_case.txt文件中：
+
+```text
+Text	Label	Prediction
+据猛龙随队记者JoshLewenberg报道，消息人士透露，猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后，科纳特下赛季大概率将前往猛龙的发展联盟球队效力。	组织关系,组织关系##加盟,组织关系##裁员	组织关系,组织关系##解雇
+冠军射手被裁掉，欲加入湖人队，但湖人却无意，冠军射手何去何从	组织关系,组织关系##裁员	组织关系,组织关系##解雇
+6月7日报道，IBM将裁员超过1000人。IBM周四确认，将裁减一千多人。据知情人士称，此次裁员将影响到约1700名员工，约占IBM全球逾34万员工中的0.5%。IBM股价今年累计上涨16%，但该公司4月发布的财报显示，一季度营收下降5%，低于市场预期。	组织关系,组织关系##裁员	组织关系,组织关系##裁员,财经/交易
+有多名魅族员工表示，从6月份开始，魅族开始了新一轮裁员，重点裁员区域是营销和线下。裁员占比超过30%，剩余员工将不过千余人，魅族的知名工程师，爱讲真话的洪汉生已经从钉钉里退出了，外界传言说他去了OPPO。	组织关系,组织关系##退出,组织关系##裁员	组织关系,组织关系##裁员
+...
+```
+
+**可解释性分析：** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析，帮助理解模型预测结果，用于错误样本（bad case）分析，细节详见[训练评估与模型优化指南](analysis/README.md)。
+
+- 单词级别可解释性分析，也即分析待预测样本中哪一些单词对模型预测结果起重要作用。以下图为例，用颜色深浅表示单词对预测结果的重要性。
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195334753-78cc2dc8-a5ba-4460-9fde-3b1bb704c053.png" width="1000">
+</div>
+
+- 句子级别可解释性分析 ，也即分析对待预测样本的模型预测结果与训练集中中哪些样本有重要关系。下面的例子表明句子级别可解释性分析可以帮助理解待预测样本的预测结果与训练集中样本之间的关联。
+```text
+text: 据猛龙随队记者JoshLewenberg报道，消息人士透露，猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后，科纳特下赛季大概率将前往猛龙的发展联盟球队效力。
+predict label: 组织关系,组织关系##解雇
+label: 组织关系,组织关系##加盟,组织关系##裁员
+examples with positive influence
+support1 text: 尼克斯官方今日宣布，他们已经裁掉了前锋扎克-欧文，后者昨日才与尼克斯签约。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.99357
+support2 text: 活塞官方今日宣布，他们已经签下了克雷格-斯沃德，并且裁掉了托德-威瑟斯。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.98344
+support3 text: 孟菲斯灰熊今年宣布，球队已经签下后卫达斯蒂-汉纳斯（DustyHannahs，版头图）并裁掉马特-穆尼。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.98219
+...
+```
+
+**数据优化：** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略，从多角度优化训练数据提升模型效果，策略细节详见[训练评估与模型优化指南](analysis/README.md)。
+
+- 稀疏数据筛选主要是解决数据不均衡、训练数据覆盖不足的问题，通过数据增强和数据标注两种方式解决这一问题。
+- 脏数据清洗可以帮助开发者筛选训练集中错误标注的数据，对这些数据重新进行人工标注，得到标注正确的数据再重新进行训练。
+- 数据增强策略提供多种数据增强方案，可以快速扩充数据，提高模型泛化性和鲁棒性。
+
+#### 2.4.3 模型预测
+训练结束后，输入待预测数据(data.txt)和类别标签对照列表(label.txt)，使用训练好的模型进行，默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`：
+
+```shell
+python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_dir "data"
+```
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行预测，可选cpu、gpu、xpu、npu；默认为gpu。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含data.txt和label.txt文件;默认为None。
+* `params_path`：待预测模型的目录；默认为"./checkpoint/"。
+* `max_seq_length`：模型使用的最大序列长度,建议与训练时最大序列长度一致, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `data_file`：本地数据集中未标注待预测数据文件名；默认为"data.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+
+
+<a name="模型部署"></a>
+
+### 2.5 模型部署
+
+#### 2.5.1 静态图导出
+
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py)，静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python export_model.py --params_path ./checkpoint/ --output_path ./export
+```
+
+如果使用多语言模型 ERNIE M作为预训练模型，运行方式：
+```shell
+python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual
+```
+
+可支持配置的参数：
+* `multilingual`：是否为多语言任务（是否使用ERNIE M作为预训练模型）；默认为False。
+* `params_path`：动态图训练保存的参数路径；默认为"./checkpoint/"。
+* `output_path`：静态图图保存的参数路径；默认为"./export"。
+
+程序运行时将会自动导出模型到指定的 `output_path` 中，保存模型文件结构如下所示：
+
+```text
+export/
+├── float32.pdiparams
+├── float32.pdiparams.info
+└── float32.pdmodel
+```
+ 导出模型之后用于部署，项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。
+#### 2.5.2 模型裁剪
+
+如果有模型部署上线的需求，需要进一步压缩模型体积，可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。
+
+使用裁剪功能需要安装 paddleslim：
+
+```shell
+pip install paddleslim==2.4.1
+```
+
+开始模型裁剪训练，默认为GPU训练，使用CPU训练只需将设备参数配置改为`--device "cpu"`：
+```shell
+python prune.py \
+    --device "gpu" \
+    --dataset_dir "data" \
+    --output_dir "prune" \
+    --learning_rate 3e-5 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 10 \
+    --max_seq_length 128 \
+    --logging_steps 5 \
+    --save_steps 100 \
+    --width_mult_list '3/4' '2/3' '1/2'
+```
+
+
+可支持配置的参数：
+* `output_dir`：必须，保存模型输出和中间checkpoint的输出目录;默认为 `None` 。
+* `device`: 选用什么设备进行裁剪，选择cpu、gpu。如使用gpu训练，可使用参数--gpus指定GPU卡号。
+* `per_device_train_batch_size`：训练集裁剪训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `per_device_eval_batch_size`：开发集评测过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：训练最大学习率；默认为5e-5。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择100；默认为10。
+* `logging_steps`: 训练过程中日志打印的间隔steps数，默认100。
+* `save_steps`: 训练过程中保存模型checkpoint的间隔steps数，默认100。
+* `seed`：随机种子，默认为3。
+* `width_mult_list`：裁剪宽度（multi head）保留的比例列表，表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例，保留比例乘以宽度（multi haed数量）应为整数；默认是None。
+* `dataset_dir`：本地数据集路径，需包含train.txt,dev.txt,label.txt;默认为None。
+* `max_seq_length`：模型使用的最大序列长度，建议与训练过程保持一致, 若出现显存不足，请适当调低这一参数；默认为128。
+* `params_dir`：待预测模型参数文件；默认为"./checkpoint/"。
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中，保存模型文件结构如下所示：
+
+```text
+prune/
+├── width_mult_0.75
+│   ├── pruned_model.pdiparams
+│   ├── pruned_model.pdiparams.info
+│   ├── pruned_model.pdmodel
+│   ├── model_state.pdparams
+│   └── model_config.json
+└── ...
+```
+
+**NOTE:**
+
+1. 目前支持的裁剪策略需要训练，训练时间视下游任务数据量而定，且和微调的训练时间是一个量级。 裁剪类似蒸馏过程，方便起见，可以直接使用微调时的超参。为了进一步提升精度，可以对 `per_device_train_batch_size`、`learning_rate`、`num_train_epochs`、`max_seq_length` 等超参进行网格搜索（grid search）。
+
+2. 模型裁剪主要用于推理部署，因此裁剪后的模型都是静态图模型，只可用于推理部署，不能再通过 `from_pretrained` 导入继续训练。导出模型之后用于部署，项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。
+
+3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度（multi head数量）为12，ERNIE Xbase、Large 模型宽度（multi head数量）为16，保留比例`width_mult`乘以宽度（multi haed数量）应为整数。
+
+#### 2.5.3 部署方案
+
+- 离线部署搭建请参考[离线部署](deploy/predictor/README.md)。
+
+- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) 或 [Triton部署指南](deploy/triton_serving/README.md)。
+
+<a name="模型效果"></a>
+
+### 2.6 模型效果
+
+我们在[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)的多标签层次数据集评测模型表现，测试配置如下：
+
+1. 数据集：2020语言与智能技术竞赛抽取的多标签层次数据集
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.4
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 32，GPU部署运行时间 total_time，计算 latency = total_time / total_samples
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+|   | 模型结构  |Micro F1(%)   | Macro F1(%) | latency(ms) |
+| -------------------------- | ------------ | ------------ | ------------ |------------ |
+|ERNIE 1.0 Large Cw  |24-layer, 1024-hidden, 20-heads|96.24|94.24 |5.59 |
+|ERNIE 3.0 Xbase |20-layer, 1024-hidden, 16-heads|96.21|94.13| 5.51 |
+|ERNIE 3.0 Base |12-layer, 768-hidden, 12-heads|95.68|93.39| 2.01 |
+|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|95.26|93.22| 1.01|
+|ERNIE 3.0 Mini|6-layer, 384-hidden, 12-heads|94.72|93.03| 0.36|
+|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|94.24|93.08| 0.24|
+|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|93.98|91.25|0.19|
+| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 95.45|93.40| 0.81   |
+| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 95.23|93.27 | 0.74  |
+| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 94.92 | 92.70| 0.61 |
diff --git a/applications/text_classification/hierarchical/analysis/README.md b/applications/text_classification/hierarchical/analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0f79bc349a50e970042c619655a7931cdbece3b2
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/README.md
@@ -0,0 +1,427 @@
+# 训练评估与模型优化指南
+
+**目录**
+   * [Analysis模块介绍](#Analysis模块介绍)
+   * [环境准备](#环境准备)
+   * [模型评估](#模型评估)
+   * [可解释性分析](#可解释性分析)
+        * [单词级别可解释性分析](#单词级别可解释性分析)
+        * [句子级别可解释性分析](#句子级别可解释性分析)
+   * [数据优化](#数据优化)
+        * [稀疏数据筛选方案](#稀疏数据筛选方案)
+        * [脏数据清洗方案](#脏数据清洗方案)
+        * [数据增强策略方案](#数据增强策略方案)
+
+## Analysis模块介绍
+
+Analysis模块提供了**模型评估、可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+- **模型评估：** 对整体分类情况和每个类别分别进行评估，并打印预测错误样本，帮助开发者分析模型表现找到训练和预测数据中存在的问题。
+
+- **可解释性分析：** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析，帮助理解模型预测结果。
+
+- **数据优化：** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略，从多角度优化训练数据提升模型效果。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195241942-70068989-df17-4f53-9f71-c189d8c5c88d.png" width="600">
+</div>
+
+以下是本项目主要代码结构及说明：
+
+```text
+analysis/
+├── evaluate.py # 评估脚本
+├── sent_interpret.py # 句子级别可解释性分析脚本
+├── word_interpret.py # 单词级别可解释性分析notebook
+├── sparse.py # 稀疏数据筛选脚本
+├── dirty.py # 脏数据清洗脚本
+├── aug.py # 数据增强脚本
+└── README.md # 训练评估与模型优化指南
+```
+
+## 环境准备
+需要可解释性分析和数据优化需要安装相关环境。
+- trustai >= 0.1.7
+- interpretdl >= 0.7.0
+
+**安装TrustAI**（可选）如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。
+```shell
+pip install trustai==0.1.7
+```
+
+**安装InterpretDL**（可选）如果使用词级别可解释性分析GradShap方法，需要安装InterpretDL
+```shell
+pip install interpretdl==0.7.0
+```
+
+## 模型评估
+
+我们使用训练好的模型计算模型的在开发集的准确率，同时打印每个类别数据量及表现：
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --params_path "../checkpoint" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --bad_case_file "bad_case.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `bad_case_path`：开发集中预测错误样本保存路径；默认为"/bad_case.txt"。
+
+
+输出打印示例：
+
+```text
+[2022-08-11 03:10:14,058] [    INFO] - -----Evaluate model-------
+
+[2022-08-11 03:10:14,059] [    INFO] - Dev dataset size: 1498
+[2022-08-11 03:10:14,059] [    INFO] - Accuracy in dev dataset: 89.19%
+[2022-08-11 03:10:14,059] [    INFO] - Macro avg in dev dataset: precision: 93.48 | recall: 93.26 | F1 score 93.22
+[2022-08-11 03:10:14,059] [    INFO] - Micro avg in dev dataset: precision: 95.07 | recall: 95.46 | F1 score 95.26
+[2022-08-11 03:10:14,095] [    INFO] - Level 1 Label Performance: Macro F1 score: 96.39 | Micro F1 score: 96.81 | Accuracy: 94.93
+[2022-08-11 03:10:14,255] [    INFO] - Level 2 Label Performance: Macro F1 score: 92.79 | Micro F1 score: 93.90 | Accuracy: 89.72
+[2022-08-11 03:10:14,256] [    INFO] - Class name: 交往
+[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 60(4.0%) | precision: 91.94 | recall: 95.00 | F1 score 93.44
+[2022-08-11 03:10:14,256] [    INFO] - ----------------------------
+[2022-08-11 03:10:14,256] [    INFO] - Class name: 交往##会见
+[2022-08-11 03:10:14,256] [    INFO] - Evaluation examples in dev dataset: 12(0.8%) | precision: 92.31 | recall: 100.00 | F1 score 96.00
+...
+```
+
+预测错误的样本保存在bad_case.txt文件中：
+
+```text
+Text	Label	Prediction
+据猛龙随队记者JoshLewenberg报道，消息人士透露，猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后，科纳特下赛季大概率将前往猛龙的发展联盟球队效力。	组织关系,组织关系##加盟,组织关系##裁员	组织关系,组织关系##解雇
+冠军射手被裁掉，欲加入湖人队，但湖人却无意，冠军射手何去何从	组织关系,组织关系##裁员	组织关系,组织关系##解雇
+6月7日报道，IBM将裁员超过1000人。IBM周四确认，将裁减一千多人。据知情人士称，此次裁员将影响到约1700名员工，约占IBM全球逾34万员工中的0.5%。IBM股价今年累计上涨16%，但该公司4月发布的财报显示，一季度营收下降5%，低于市场预期。	组织关系,组织关系##裁员	组织关系,组织关系##裁员,财经/交易
+有多名魅族员工表示，从6月份开始，魅族开始了新一轮裁员，重点裁员区域是营销和线下。裁员占比超过30%，剩余员工将不过千余人，魅族的知名工程师，爱讲真话的洪汉生已经从钉钉里退出了，外界传言说他去了OPPO。	组织关系,组织关系##退出,组织关系##裁员	组织关系,组织关系##裁员
+...
+```
+
+## 可解释性分析
+"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题，如何分析错误样本(bad case)是文本分类任务落地中重要一环，本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法，帮助开发者更好地理解文本分类模型与数据，有助于后续的模型优化与数据清洗标注。
+
+### 单词级别可解释性分析
+本项目开源模型的词级别可解释性分析Notebook，提供LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。
+
+运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/hierarchical/analysis/README.md) 代码，即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况，颜色越深代表这个词对预测结果影响越大：
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195334753-78cc2dc8-a5ba-4460-9fde-3b1bb704c053.png" width="1000">
+</div>
+
+### 句子级别可解释性分析
+本项目基于特征相似度（[FeatureSimilarity](https://arxiv.org/abs/2104.04128)）算法，计算对样本预测结果正影响的训练数据，帮助理解模型的预测结果与训练集数据的关系。
+
+待分析数据文件`interpret_input_file`应为以下三种格式中的一种：
+**格式一：包括文本、标签、预测结果**
+```text
+<文本>'\t'<标签>'\t'<预测结果>
+...
+```
+
+**格式二：包括文本、标签**
+```text
+<文本>'\t'<标签>
+...
+```
+
+**格式三：只包括文本**
+```text
+<文本>
+准予原告胡某甲与被告韩某甲离婚。
+...
+```
+
+我们可以运行代码，得到支持样本模型预测结果的训练数据：
+```shell
+python sent_interpret.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --params_path "../checkpoint/" \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --top_k 3 \
+    --train_file "train.txt" \
+    --interpret_input_file "bad_case.txt" \
+    --interpret_result_file "sent_interpret.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `top_k`：筛选支持训练证据数量；默认为3。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `interpret_input_file`：本地数据集中待分析文件名；默认为"bad_case.txt"。
+* `interpret_result_file`：保存句子级别可解释性结果文件名；默认为"sent_interpret.txt"。
+
+可解释性结果保存在 `interpret_result_file` 文件中：
+```text
+text: 据猛龙随队记者JoshLewenberg报道，消息人士透露，猛龙已将前锋萨加巴-科纳特裁掉。此前他与猛龙签下了一份Exhibit10合同。在被裁掉后，科纳特下赛季大概率将前往猛龙的发展联盟球队效力。
+predict label: 组织关系,组织关系##解雇
+label: 组织关系,组织关系##加盟,组织关系##裁员
+examples with positive influence
+support1 text: 尼克斯官方今日宣布，他们已经裁掉了前锋扎克-欧文，后者昨日才与尼克斯签约。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.99357
+support2 text: 活塞官方今日宣布，他们已经签下了克雷格-斯沃德，并且裁掉了托德-威瑟斯。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.98344
+support3 text: 孟菲斯灰熊今年宣布，球队已经签下后卫达斯蒂-汉纳斯（DustyHannahs，版头图）并裁掉马特-穆尼。	label: 组织关系,组织关系##加盟,组织关系##解雇	score: 0.98219
+...
+```
+
+## 数据优化
+
+### 稀疏数据筛选方案
+
+稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景，简单来说，就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据，模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据，通常可以采用**数据增强**或**少量数据标注**的两种低成本方式，提升模型在开发集的预测效果。
+
+本项目中稀疏数据筛选基于TrustAI，利用基于特征相似度的实例级证据分析方法，抽取开发集中样本的支持训练证据，并计算支持证据平均分（通常为得分前三的支持训练证据均分）。分数较低的样本表明其训练证据不足，在训练集中较为稀疏，实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。
+
+
+#### 稀疏数据识别—数据增强
+
+这里我们将介绍稀疏数据识别—数据增强流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据增强**：将稀疏数据集在训练集中的支持证据应用数据增强策略，这些数据增强后的训练数据记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别-数据增强，得到支持数据集：
+
+```shell
+python sparse.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --aug_strategy "substitute" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `aug_strategy`：数据增强类型，可选"duplicate","substitute", "insert", "delete", "swap"；默认为"substitute"。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt
+```
+
+**方案效果**
+
+我们在[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据（训练集数据规模:700）进行实验,筛选稀疏数据数量和筛选支持数据数量均设为100条，使用不同的数据增强方法进行评测：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集|90.41|79.16|
+|训练集+支持增强集(duplicate) |**90.60**|80.55|
+|训练集+支持增强集(substitute) |90.21|80.11|
+|训练集+支持增强集(insert) |90.53|**80.61**|
+|训练集+支持增强集(delete) |90.56| 80.26|
+|训练集+支持增强集(swap) |90.18|80.05|
+
+#### 稀疏数据识别-数据标注
+
+本方案能够有针对性进行数据标注，相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据标注**：在未标注数据集中筛选稀疏数据集的支持证据，并进行数据标注，记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别--数据标注，得到待标注数据：
+
+```shell
+python sparse.py \
+    --annotate \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100 \
+    --unlabeled_file "data.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `annotate`：选择稀疏数据识别--数据标注模式；默认为False。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `unlabeled_file`：本地数据集中未标注数据文件名；默认为"data.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+我们将筛选出的支持数据`support.txt`进行标注，可以使用标注工具帮助更快标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt
+```
+
+**方案效果**
+
+我们在[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据（训练集数据规模:700）进行实验,筛选稀疏数据数量设为100条，筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果，下表结果表明使用稀疏数据方案的策略采样能够有效指导训练数据扩充，在标注更少的数据情况下获得更大提升的效果：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ | ------------ |
+|训练集|90.41|79.16|
+|训练集+策略采样集(50) |90.79|82.37|
+|训练集+随机采样集(50) |90.10|79.27|
+|训练集+策略采样集(100) |91.12|**84.13**|
+|训练集+随机采样集(100) |**91.24**|81.66|
+
+### 脏数据清洗方案
+
+脏数据清洗方案是基于已训练好的文本分类模型，筛选出训练数据集中标注错误的数据，再由人工检查重新标注，获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程：
+
+- **脏数据筛选：** 基于TrustAI中表示点方法，计算训练数据对文本分类模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为标注错误样本，记为脏数据集（Dirty Dataset）。
+
+- **数据清洗、训练：** 将筛选出的脏数据由人工重新检查，为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。
+
+现在我们进行脏数据识别，脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`：
+
+```shell
+python dirty.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --dirty_num 100 \
+    --dirty_file "train_dirty.txt" \
+    --rest_file "train_dirty_rest.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt和label.txt文件;默认为None。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `dirty_file`：保存脏数据文件名，默认为"train_dirty.txt"。
+* `rest_file`：保存剩余数据（非脏数据）文件名，默认为"train_dirty_rest.txt"。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dirty_threshold`：筛选脏数据用于重新标注的阈值，只选择影响分数大于阈值作为支持数据，默认为0。
+
+
+我们将筛选出脏数据进行人工检查重新标注，可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练：
+
+```shell
+cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt
+```
+
+**方案效果**
+
+我们在[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据（训练集数据规模:2000）进行实验，取200条数据进行脏数据处理，也即200条训练数据为标签错误数据，选择不同`dirty_num`应用脏数据清洗策略进行评测：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集(2000)|92.54|86.04|
+|训练集(2000，含200条脏数据) |89.11|73.33|
+|训练集(2000，含200条脏数据) + 脏数据清洗(50)|90.00|77.67|
+|训练集(2000，含200条脏数据) + 脏数据清洗(100)|92.48|**87.83**|
+|训练集(2000，含200条脏数据) + 脏数据清洗(150)|**92.55**|83.73|
+
+### 数据增强策略方案
+
+在数据量较少或某些类别样本量较少时，也可以通过数据增强策略的方式，生成更多的训练数据，提升模型效果。
+
+```shell
+python aug.py \
+    --create_n 2 \
+    --aug_percent 0.1 \
+    --train_path "../data/train.txt" \
+    --aug_path "../data/aug.txt"
+```
+
+可支持配置的参数：
+
+* `train_path`：待增强训练数据集文件路径；默认为"../data/train.txt"。
+* `aug_path`：增强生成的训练数据集文件路径；默认为"../data/train_aug.txt"。
+* `aug_strategy`：数据增强策略，可选"mix", "substitute", "insert", "delete", "swap","mix"为多种数据策略混合使用；默认为"substitute"。
+* `aug_type`：词替换/词插入增强类型，可选"synonym", "homonym", "mlm"，建议在GPU环境下使用mlm类型；默认为"synonym"。
+* `create_n`：生成的句子数量，默认为2。
+* `aug_percent`：生成词替换百分比，默认为0.1。
+* `device`: 选用什么设备进行增强，可选择cpu、gpu、xpu、npu，仅在使用mlm类型有影响；默认为"gpu"。
+
+生成的增强数据保存在`"aug.txt"`文件中，与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练：
+
+```shell
+cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt
+```
+
+PaddleNLP内置多种数据增强策略，更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。
+
+**方案效果**
+
+我们在[2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取部分训练数据（训练集数据规模:2000）进行实验，采用不同数据增强策略进行两倍数据增强（每条样本生成两条增强样本）：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集(2000)|92.54|86.04|
+|训练集(2000)+数据增强(×2, mix) |93.23|89.69|
+|训练集(2000)+支持增强集(×2, substitute) |93.07|89.49|
+|训练集(2000)+支持增强集(×2, insert) |**93.63**|**89.69**|
+|训练集(2000)+支持增强集(×2, delete) |91.53| 84.47|
+|训练集(2000)+支持增强集(×2, swap) |93.24|89.02|
diff --git a/applications/text_classification/hierarchical/analysis/aug.py b/applications/text_classification/hierarchical/analysis/aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..731454ba556065abce173e8c5394c861466f0dde
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/aug.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name")
+parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name")
+parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert")
+parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.")
+parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.")
+parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def aug():
+    """Do data augmentation"""
+    if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm":
+        paddle.set_device(args.device)
+
+    if args.aug_strategy in ["substitute", "insert", "delete", "swap"]:
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent)
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+
+                augs = aug.augment(s)
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+    elif args.aug_strategy in ["mix"]:
+        aug = [
+            WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordDelete(create_n=1, aug_percent=args.aug_percent),
+            WordSwap(create_n=1, aug_percent=args.aug_percent),
+        ]
+        count = 0
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+
+                for i in range(args.create_n):
+                    i = count % len(aug)
+                    augs = aug[i].augment(s)
+                    if not isinstance(augs[0], str):
+                        augs = augs[0]
+                    count += 1
+                    for a in augs:
+                        f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    aug()
diff --git a/applications/text_classification/hierarchical/analysis/dirty.py b/applications/text_classification/hierarchical/analysis/dirty.py
new file mode 100644
index 0000000000000000000000000000000000000000..394e597c7e280929f68ad38c6f188acdfa8ded50
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/dirty.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import RepresenterPointModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50")
+parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.")
+parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            sentence, label = line.strip().split("\t")
+            yield {"text": sentence, "label": label}
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+def get_dirty_data(weight_matrix, dirty_num, threshold=0):
+    """
+    Get index of dirty data from train data
+    """
+    scores = []
+    for idx in range(weight_matrix.shape[0]):
+        weight_sum = 0
+        count = 0
+        for weight in weight_matrix[idx].numpy():
+            if weight > threshold:
+                count += 1
+                weight_sum += weight
+        scores.append((count, weight_sum))
+    sorted_scores = sorted(scores)[::-1]
+    sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1]
+
+    ret_scores = sorted_scores[:dirty_num]
+    ret_idxs = sorted_idxs[:dirty_num]
+
+    return ret_idxs, ret_scores
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def run():
+    """
+    Get dirty data
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    train_ds = train_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier")
+    weight_matrix = rep_point.weight_matrix
+
+    # Save dirty data & rest data
+    dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold)
+
+    dirty_path = os.path.join(args.dataset_dir, args.dirty_file)
+    rest_path = os.path.join(args.dataset_dir, args.rest_file)
+
+    with open(dirty_path, "w") as f1, open(rest_path, "w") as f2:
+        for idx in range(len(train_ds)):
+            if idx in dirty_indexs:
+                f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+            else:
+                f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+
+    f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    run()
diff --git a/applications/text_classification/hierarchical/analysis/evaluate.py b/applications/text_classification/hierarchical/analysis/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef927e7df460b005c2d6aa92e9b34d66a4e68a48
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/evaluate.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.io import BatchSampler, DataLoader
+from sklearn.metrics import accuracy_score, classification_report, f1_score
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+parser.add_argument("--bad_case_file", type=str, default="./bad_case.txt", help="Bad case saving file path")
+args = parser.parse_args()
+# yapf: enable
+
+
+def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    if not is_test:
+        result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)]
+    return result
+
+
+def read_local_dataset(path, label_list):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 0:
+                continue
+            elif len(items) == 1:
+                sentence = items[0]
+                labels = []
+                label = ""
+            else:
+                sentence = "".join(items[:-1])
+                label = items[-1]
+                labels = [label_list[l] for l in label.split(",")]
+            yield {"text": sentence, "label": labels, "label_n": label}
+
+
+@paddle.no_grad()
+def evaluate():
+    """
+    Evaluate the model performance
+    """
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # load and preprocess dataset
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    dev_path = os.path.join(args.dataset_dir, args.dev_file)
+
+    label_list = {}
+    label_map = {}
+    label_map_dict = {}
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+            label_map[i] = l
+            for ii, ll in enumerate(l.split("##")):
+                if ii not in label_map_dict:
+                    label_map_dict[ii] = {}
+                if ll not in label_map_dict[ii]:
+                    iii = len(label_map_dict[ii])
+                    label_map_dict[ii][ll] = iii
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
+    )
+    dev_ds = dev_ds.map(trans_func)
+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    model.eval()
+    probs = []
+    labels = []
+    for batch in dev_data_loader:
+        label = batch.pop("labels")
+        logits = model(**batch)
+        labels.extend(label.numpy())
+        probs.extend(F.sigmoid(logits).numpy())
+    probs = np.array(probs)
+    labels = np.array(labels)
+    preds = probs > 0.5
+    report = classification_report(labels, preds, digits=4, output_dict=True)
+    accuracy = accuracy_score(labels, preds)
+
+    labels_dict = {ii: [] for ii in range(len(label_map_dict))}
+    preds_dict = {ii: [] for ii in range(len(label_map_dict))}
+    for i in range(len(preds)):
+        for ii in range(len(label_map_dict)):
+            labels_dict[ii].append([0] * len(label_map_dict[ii]))
+            preds_dict[ii].append([0] * len(label_map_dict[ii]))
+        for l in dev_ds.data[i]["label_n"].split(","):
+            for ii, sub_l in enumerate(l.split("##")):
+                labels_dict[ii][-1][label_map_dict[ii][sub_l]] = 1
+
+        pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp]
+
+        for l in pred_n:
+            for ii, sub_l in enumerate(l.split("##")):
+                preds_dict[ii][-1][label_map_dict[ii][sub_l]] = 1
+
+    logger.info("-----Evaluate model-------")
+    logger.info("Dev dataset size: {}".format(len(dev_ds)))
+    logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100))
+    logger.info(
+        "Micro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+            report["micro avg"]["precision"] * 100,
+            report["micro avg"]["recall"] * 100,
+            report["micro avg"]["f1-score"] * 100,
+        )
+    )
+    logger.info(
+        "Macro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+            report["macro avg"]["precision"] * 100,
+            report["macro avg"]["recall"] * 100,
+            report["macro avg"]["f1-score"] * 100,
+        )
+    )
+    for ii in range(len(label_map_dict)):
+        macro_f1_score = f1_score(labels_dict[ii], preds_dict[ii], average="macro")
+        micro_f1_score = f1_score(labels_dict[ii], preds_dict[ii], average="micro")
+        accuracy = accuracy_score(labels_dict[ii], preds_dict[ii])
+        logger.info(
+            "Level {} Label Performance: Macro F1 score: {:.2f} | Micro F1 score: {:.2f} | Accuracy: {:.2f}".format(
+                ii + 1, macro_f1_score * 100, micro_f1_score * 100, accuracy * 100
+            )
+        )
+
+    for i in label_map:
+        logger.info("Class name: {}".format(label_map[i]))
+        logger.info(
+            "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+                report[str(i)]["support"],
+                100 * report[str(i)]["support"] / len(dev_ds),
+                report[str(i)]["precision"] * 100,
+                report[str(i)]["recall"] * 100,
+                report[str(i)]["f1-score"] * 100,
+            )
+        )
+        logger.info("----------------------------")
+    bad_case_path = os.path.join(args.dataset_dir, args.bad_case_file)
+    with open(bad_case_path, "w", encoding="utf-8") as f:
+        f.write("Text\tLabel\tPrediction\n")
+        for i in range(len(preds)):
+            for p, l in zip(preds[i], labels[i]):
+                if (p and l == 0) or (not p and l == 1):
+                    pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp]
+                    f.write(dev_ds.data[i]["text"] + "\t" + dev_ds.data[i]["label_n"] + "\t" + ",".join(pred_n) + "\n")
+                    break
+
+    f.close()
+    logger.info("Bad case in dev dataset saved in {}".format(bad_case_path))
+
+    return
+
+
+if __name__ == "__main__":
+    evaluate()
diff --git a/applications/text_classification/hierarchical/analysis/sent_interpret.py b/applications/text_classification/hierarchical/analysis/sent_interpret.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/sent_interpret.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name")
+parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if items[0] == "Text":
+                continue
+            if len(items) == 3:
+                yield {"text": items[0], "label": items[1], "predict": items[2]}
+            elif len(items) == 2:
+                yield {"text": items[0], "label": items[1], "predict": ""}
+            elif len(items) == 1:
+                yield {"text": items[0], "label": "", "predict": ""}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def find_positive_influence_data():
+
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file)
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    interpret_ds = interpret_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    interpret_data_loader = DataLoader(
+        dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn
+    )
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in interpret_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.top_k)
+    with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f:
+        for i in range(len(analysis_result)):
+            f.write("text: " + interpret_ds.data[i]["text"] + "\n")
+            if "predict" in interpret_ds.data[i]:
+                f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n")
+            if "label" in interpret_ds.data[i]:
+                f.write("label: " + interpret_ds.data[i]["label"] + "\n")
+            f.write("examples with positive influence\n")
+            for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)):
+                f.write(
+                    "support{} text: ".format(i + 1)
+                    + train_ds.data[idx]["text"]
+                    + "\t"
+                    + "label: "
+                    + train_ds.data[idx]["label"]
+                    + "\t"
+                    + "score: "
+                    + "{:.5f}".format(score)
+                    + "\n"
+                )
+    f.close()
+
+
+if __name__ == "__main__":
+    find_positive_influence_data()
diff --git a/applications/text_classification/hierarchical/analysis/sparse.py b/applications/text_classification/hierarchical/analysis/sparse.py
new file mode 100644
index 0000000000000000000000000000000000000000..80feb4f661348014bd44e28ec69b4d6bcff9d189
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/sparse.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
+parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
+parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
+parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename")
+parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.")
+parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 2:
+                yield {"text": items[0], "label": items[1]}
+            elif len(items) == 1:
+                yield {"text": items[0]}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def get_sparse_data(analysis_result, sparse_num):
+    """
+    Get sparse data
+    """
+    idx_scores = {}
+    preds = []
+    for i in range(len(analysis_result)):
+        scores = analysis_result[i].pos_scores
+        idx_scores[i] = sum(scores) / len(scores)
+        preds.append(analysis_result[i].pred_label)
+
+    idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num]
+    ret_idxs, ret_scores = list(zip(*idx_socre_list))
+    return ret_idxs, ret_scores, preds
+
+
+def find_sparse_data():
+    """
+    Find sparse data (lack of supports in train dataset) in dev dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    dev_path = os.path.join(args.dataset_dir, args.dev_file)
+
+    label_list = {}
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in dev_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse)
+    sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num)
+
+    # Save the sparse data
+    with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f:
+        for idx in sparse_indexs:
+            data = dev_ds.data[idx]
+            f.write(data["text"] + "\t" + str(data["label"]) + "\n")
+    f.close()
+    logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file)))
+    logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores)))
+    return os.path.join(args.dataset_dir, args.sparse_file)
+
+
+def get_support_data(analysis_result, support_num, support_threshold=0.7):
+    """
+    get support data
+    """
+    ret_idxs = []
+    ret_scores = []
+    rationale_idx = 0
+    try:
+        while len(ret_idxs) < support_num:
+            for n in range(len(analysis_result)):
+                score = analysis_result[n].pos_scores[rationale_idx]
+                if score > support_threshold:
+                    idx = analysis_result[n].pos_indexes[rationale_idx]
+                    if idx not in ret_idxs:
+                        ret_idxs.append(idx)
+                        ret_scores.append(score)
+                    if len(ret_idxs) >= support_num:
+                        break
+
+            rationale_idx += 1
+    except IndexError:
+        logger.error(
+            f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now."
+        )
+
+    return ret_idxs, ret_scores
+
+
+def find_support_data():
+    """
+    Find support data (which supports sparse data) from candidate dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    if args.annotate:
+        candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file)
+    else:
+        candidate_path = os.path.join(args.dataset_dir, args.train_file)
+
+    sparse_path = os.path.join(args.dataset_dir, args.sparse_file)
+    support_path = os.path.join(args.dataset_dir, args.support_file)
+    candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False)
+    sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    candidate_ds = candidate_ds.map(trans_func)
+    sparse_ds = sparse_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False)
+    sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False)
+    candidate_data_loader = DataLoader(
+        dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn
+    )
+    sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis
+    analysis_result = []
+    for batch in sparse_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_support)
+
+    support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold)
+
+    # Save the support data
+    if args.annotate or args.aug_strategy == "duplicate":
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                if "label" in data:
+                    f.write(data["text"] + "\t" + data["label"] + "\n")
+                else:
+                    f.write(data["text"] + "\n")
+        f.close()
+    else:
+        create_n = 1
+        aug_percent = 0.1
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute("synonym", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert("synonym", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=create_n, aug_percent=aug_percent)
+
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                augs = aug.augment(data["text"])
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f.write(a + "\t" + data["label"] + "\n")
+        f.close()
+    logger.info("support data saved in {}".format(support_path))
+    logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores)))
+
+
+if __name__ == "__main__":
+    find_sparse_data()
+    find_support_data()
diff --git a/applications/text_classification/hierarchical/analysis/word_interpret.ipynb b/applications/text_classification/hierarchical/analysis/word_interpret.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..70a0da4081f6facf69b33c262895d60dbf549b2d
--- /dev/null
+++ b/applications/text_classification/hierarchical/analysis/word_interpret.ipynb
@@ -0,0 +1,362 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 词级别可解释性分析\n",
+    "本项目提供模型的词级别可解释性分析，包括LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n",
+    "\n",
+    "![image](https://user-images.githubusercontent.com/63761690/195334753-78cc2dc8-a5ba-4460-9fde-3b1bb704c053.png)\n",
+    " \n",
+    "\n",
+    "## 1.导入Python模块与参数配置\n",
+    "首先我们导入必要的导入必要python模块和设置配置参数，词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式：\n",
+    "\n",
+    "**格式一：包括文本、标签、预测结果**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>'\\t'<预测结果>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式二：包括文本、标签**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式三：只包括文本**\n",
+    "```text\n",
+    "<文本>\n",
+    "...\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import functools\n",
+    "import random\n",
+    "import os\n",
+    "import argparse\n",
+    "\n",
+    "import jieba\n",
+    "import numpy as np \n",
+    "from trustai.interpretation import VisualizationTextRecord\n",
+    "from trustai.interpretation import get_word_offset\n",
+    "import paddle\n",
+    "from paddle.io import DataLoader, BatchSampler\n",
+    "from paddlenlp.data import DataCollatorWithPadding\n",
+    "from paddlenlp.datasets import load_dataset\n",
+    "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trustai.interpretation import VisualizationTextRecord\n",
+    "from trustai.interpretation import get_word_offset\n",
+    "import paddle\n",
+    "from paddle.io import DataLoader, BatchSampler\n",
+    "from paddlenlp.data import DataCollatorWithPadding\n",
+    "from paddlenlp.datasets import load_dataset\n",
+    "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 预先定义配置参数\n",
+    "\n",
+    "# 运行环境，可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n",
+    "DEVICE = \"gpu\"\n",
+    "# 数据路径\n",
+    "DATASET_DIR = \"../data\" \n",
+    "# 训练模型保存路径\n",
+    "PARAM_PATH = \"../checkpoint/\" \n",
+    "# tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数\n",
+    "MAX_LENGTH = 128 \n",
+    "# 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数\n",
+    "BATCH_SIZE = 1 \n",
+    "# 待分析解释的数据\n",
+    "INTERPRETER_FILE = \"bad_case.txt\"\n",
+    "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n",
+    "# \"grad\":GradShap方法依赖interpretdl\n",
+    "# !pip install interpretdl\n",
+    "INTERPRETER = \"ig\"\n",
+    "# 分析句子中TOP K关键词，K值\n",
+    "KEY_WORDS_NUM = 5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def read_local_dataset(path):\n",
+    "    \"\"\"\n",
+    "    Read dataset file\n",
+    "    \"\"\"\n",
+    "    with open(path, 'r', encoding='utf-8') as f:\n",
+    "        for line in f:\n",
+    "            items = line.strip().split('\\t')\n",
+    "            if items[0] == 'Text':\n",
+    "                continue\n",
+    "            items[0] = items[0][:MAX_LENGTH-2]\n",
+    "            if len(items) == 3:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n",
+    "            elif len(items) == 2:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': ''}\n",
+    "            elif len(items) == 1:\n",
+    "                yield {'text': items[0], 'label': '', 'predict': ''}\n",
+    "            else:\n",
+    "                raise ValueError(\"{} should be in fixed format.\".format(path))\n",
+    "\n",
+    "def preprocess_function(examples, tokenizer, max_seq_length):\n",
+    "    \"\"\"\n",
+    "    Preprocess dataset\n",
+    "    \"\"\"\n",
+    "    result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n",
+    "    return result\n",
+    "\n",
+    "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n",
+    "    \"\"\"\n",
+    "    Convert the  result of DataCollatorWithPadding from dict dictionary to a list\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __call__(self, features):\n",
+    "        batch = super().__call__(features)\n",
+    "        batch = list(batch.values())\n",
+    "        return batch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m[2022-10-12 11:45:49,858] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load '/workspace/PaddleNLP/applications/text_classification/hierarchical/checkpoint/'.\u001b[0m\n",
+      "W1012 11:45:49.861358 26086 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n",
+      "W1012 11:45:49.865923 26086 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n",
+      "\u001b[32m[2022-10-12 11:45:52,912] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load '/workspace/PaddleNLP/applications/text_classification/hierarchical/checkpoint/'.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "paddle.set_device(DEVICE)\n",
+    "\n",
+    "# Define model & tokenizer\n",
+    "if os.path.exists(PARAM_PATH):\n",
+    "    model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n",
+    "    tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n",
+    "else:\n",
+    "    raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n",
+    "\n",
+    "\n",
+    "# Prepare & preprocess dataset\n",
+    "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n",
+    "\n",
+    "\n",
+    "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n",
+    "trans_func = functools.partial(preprocess_function,\n",
+    "                                tokenizer=tokenizer,\n",
+    "                                max_seq_length=MAX_LENGTH)\n",
+    "\n",
+    "interpret_ds = interpret_ds.map(trans_func)\n",
+    "\n",
+    "# Batchify dataset\n",
+    "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n",
+    "interpret_batch_sampler = BatchSampler(interpret_ds,\n",
+    "                                    batch_size=BATCH_SIZE,\n",
+    "                                    shuffle=False)\n",
+    "interpret_data_loader = DataLoader(dataset=interpret_ds,\n",
+    "                                batch_sampler=interpret_batch_sampler,\n",
+    "                                collate_fn=collate_fn)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Start token level interpretion, it will take some time...\n",
+      "Building prefix dict from the default dictionary ...\n",
+      "Loading model from cache /tmp/jieba.cache\n",
+      "Loading model cost 0.746 seconds.\n",
+      "Prefix dict has been built successfully.\n",
+      "Start word level alignment, it will take some time...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Init an interpreter\n",
+    "if INTERPRETER == 'ig':\n",
+    "    from trustai.interpretation.token_level import IntGradInterpreter\n",
+    "    interpreter = IntGradInterpreter(model)\n",
+    "elif INTERPRETER == 'lime':\n",
+    "    from trustai.interpretation.token_level import LIMEInterpreter\n",
+    "    interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n",
+    "else:\n",
+    "    from trustai.interpretation.token_level import GradShapInterpreter\n",
+    "    interpreter = GradShapInterpreter(model)\n",
+    "\n",
+    "# Use interpreter to get the importance scores for all data\n",
+    "print(\"Start token level interpretion, it will take some time...\")\n",
+    "analysis_result = []\n",
+    "for batch in interpret_data_loader:\n",
+    "    analysis_result += interpreter(tuple(batch))\n",
+    "\n",
+    "# Add CLS and SEP tags to both original text and standard splited tokens\n",
+    "contexts = []\n",
+    "words = []\n",
+    "for i in range(len(interpret_ds)):\n",
+    "    text = interpret_ds.data[i][\"text\"]\n",
+    "    contexts.append(\"[CLS]\" + text + \"[SEP]\")\n",
+    "    words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n",
+    "\n",
+    "# Get the offset map of tokenized tokens and standard splited tokens\n",
+    "print(\"Start word level alignment, it will take some time...\")\n",
+    "ori_offset_maps = []\n",
+    "word_offset_maps = []\n",
+    "for i in range(len(contexts)):\n",
+    "    ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n",
+    "    word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n",
+    "\n",
+    "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.core.display import display, HTML\n",
+    "class Visualization(VisualizationTextRecord):\n",
+    "\n",
+    "    def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n",
+    "        if words is not None:\n",
+    "            self.words = words\n",
+    "        else:\n",
+    "            self.words = interpret_res.words\n",
+    "        self.pred_label = pred_label if pred_label is not None else ''\n",
+    "        self.true_label = true_label if true_label is not None else ''\n",
+    "        self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n",
+    "        word_attributions = interpret_res.word_attributions\n",
+    "        _max = max(word_attributions)\n",
+    "        _min = min(word_attributions)\n",
+    "        self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n",
+    "\n",
+    "    def record_html(self):\n",
+    "        \"\"\"change all informations to html\"\"\"\n",
+    "        return \"\".join([\n",
+    "            \"<tr>\",\n",
+    "            self._format_class(self.true_label),\n",
+    "            self._format_class(self.pred_label),\n",
+    "            self._format_class(self.key_words),\n",
+    "            self._format_word_attributions(),\n",
+    "            \"<tr>\",\n",
+    "        ])\n",
+    "    def _format_class(self, label):\n",
+    "        return '<td align=\"center\"><text style=\"padding-right:2em\"><b>{label}</b></text></td>'.format(label=label)\n",
+    "\n",
+    "def visualize_text(text_records):\n",
+    "    \"\"\"visualize text\"\"\"\n",
+    "    html = [\"<table width: 100%, align : center>\"]\n",
+    "    rows = [\"<tr><th>Label</th>\"\n",
+    "            \"<th>Prediction</th>\"\n",
+    "            \"<th>Key words</th>\"\n",
+    "            \"<th>Important visualization</th>\"]\n",
+    "    for record in text_records:\n",
+    "        rows.append(record.record_html())\n",
+    "    html.append(\"\".join(rows))\n",
+    "    html.append(\"</table>\")\n",
+    "    html = HTML(\"\".join(html))\n",
+    "    display(html)\n",
+    "    return html.data\n",
+    "\n",
+    "\n",
+    "def visualize(interpret_res, ds):\n",
+    "    records = []\n",
+    "    for i in range(len(interpret_res)):\n",
+    "        records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n",
+    "    html = visualize_text(records)\n",
+    "    return html"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table width: 100%, align : center><tr><th>Label</th><th>Prediction</th><th>Key words</th><th>Important visualization</th><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##加盟,组织关系##裁员</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##解雇</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>。 特裁 签下 此前 掉</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 据                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 猛龙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 随队                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 记者                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> JoshLewenberg                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 报道                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 消息人士                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 透露                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 猛龙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 已                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 前锋                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 萨                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 加巴                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> -                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 科纳                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 特裁                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 掉                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 此前                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 他                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 与                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 猛龙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 签下                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 一份                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> Exhibit10                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 合同                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 在                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 裁掉                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 后                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 科纳                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 特下                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 赛季                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 大                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 概率                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 前往                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 猛龙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 发展                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 联盟                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 球队                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 效力                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##裁员</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##解雇</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>加入 湖人队 裁掉 被 何去何从</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 冠军                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 射手                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 裁掉                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 欲                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 加入                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 湖人队                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 但                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 湖人                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 却                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 无意                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 冠军                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 射手                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 何去何从                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##裁员</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>组织关系,组织关系##裁员,财经/交易</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>裁员 超过 1000 将 裁减</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 6                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 月                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 7                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 日                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 报道                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> IBM                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 裁员                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 超过                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 1000                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 人                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> IBM                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 周四                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 确认                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 裁减                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 一千多                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 人                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 据                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 知情                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 人士                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 称                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 此次                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 裁员                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 影响                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 到                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 约                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 1700                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 名                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 员工                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 约                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 占                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> IBM                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 全球                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 逾                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 34                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 万                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 员工                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 中                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 0.5%                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> IBM                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 股价                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 今年                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 累计                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 上涨                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 16%                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 但                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 该                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 公司                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 4                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 月                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 发布                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 财报                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 显示                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 一季度                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 营收                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 下降                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 5%                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 低于                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 市场                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 预期                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr></table>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# process for vbisualize\n",
+    "html = visualize(align_res, interpret_ds)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.7.13 64-bit",
+   "metadata": {
+    "interpreter": {
+     "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90"
+    }
+   },
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.13-final"
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/README.md b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b9caf349b1acf4ad0fb420152fd9d505533422c5
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/paddle_serving/README.md
@@ -0,0 +1,189 @@
+# 基于Paddle Serving的服务化部署
+
+本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建层次分类在线服务部署。
+
+## 目录
+- [环境准备](#环境准备)
+- [模型转换](#模型转换)
+- [部署模型](#部署模型)
+
+## 环境准备
+需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。
+
+- python >= 3.6
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.4
+
+### 安装PaddlePaddle
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+
+### 安装PaddleNLP
+
+安装PaddleNLP默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以删去` -i https://mirror.baidu.com/pypi/simple` ，更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+```shell
+python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+### 安装Paddle Serving
+
+安装client和serving app，用于向服务发送请求:
+```shell
+pip install paddle_serving_app paddle_serving_client
+```
+安装serving，用于启动服务，根据服务器设备选择安装CPU server或GPU server：
+
+- 安装CPU server
+```shell
+pip install paddle_serving_server
+```
+- 安装GPU server, 注意选择跟本地环境一致的命令
+```shell
+# CUDA10.2 + Cudnn7 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+**NOTE:**
+- 默认开启国内清华镜像源来加速下载，如果您使用 HTTP 代理可以关闭(-i https://pypi.tuna.tsinghua.edu.cn/simple)
+- 更多wheel包请参考[serving官网文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)
+
+### 安装FastTokenizer文本处理加速库（可选）
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+```shell
+pip install fast-tokenizer-python
+```
+
+
+## 模型转换
+
+使用Paddle Serving做服务化部署时，需要将保存的inference模型转换为serving易于部署的模型。
+
+用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md)，模型地址`dirname`，模型文件和参数名`model_filename`，`params_filename`根据实际填写即可。
+
+```shell
+python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams
+```
+
+可以通过命令查参数含义：
+```shell
+python -m paddle_serving_client.convert --help
+```
+
+转换成功后的目录如下:
+```
+paddle_serving/
+├──serving_server
+│  ├── float32.pdiparams
+│  ├── float32.pdmodel
+│  ├── serving_server_conf.prototxt
+│  └── serving_server_conf.stream.prototxt
+└──serving_client
+   ├── serving_client_conf.prototxt
+   └── serving_client_conf.stream.prototxt
+```
+
+## 部署模型
+
+serving目录包含启动pipeline服务和发送预测请求的代码和模型，包括：
+
+```
+serving/
+├──serving_server
+│  ├── float32.pdiparams
+│  ├── float32.pdmodel
+│  ├── serving_server_conf.prototxt
+│  └── serving_server_conf.stream.prototxt
+├──config.yml        # 层次分类任务启动服务端的配置文件
+├──rpc_client.py     # 层次分类任务发送pipeline预测请求的脚本
+└──service.py        # 层次分类任务启动服务端的脚本
+
+```
+
+### 修改配置文件
+目录中的`config.yml`文件解释了每一个参数的含义，可以根据实际需要修改其中的配置。比如：
+```
+# 修改模型目录为下载的模型目录或自己的模型目录:
+model_config: serving_server =>  model_config: erine-3.0-tiny/serving_server
+
+# 修改rpc端口号
+rpc_port: 10231   =>   rpc_port: 9998
+
+# 修改使用GPU推理为使用CPU推理:
+device_type: 1    =>   device_type: 0
+
+#开启MKLDNN加速
+#use_mkldnn: False    =>   use_mkldnn: True
+
+#Fetch结果列表，以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准
+fetch_list: ["linear_147.tmp_1"]    =>   fetch_list: ["linear_75.tmp_1"]
+```
+
+
+### 分类任务
+#### 启动服务
+修改好配置文件后，执行下面命令启动服务:
+```shell
+python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh"
+```
+
+可支持配置的参数：
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `model_name`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。
+
+输出打印如下:
+```
+[DAG] Succ init
+[PipelineServicer] succ init
+......
+--- Running analysis [ir_graph_to_program_pass]
+I0727 06:50:34.988327 43126 analysis_predictor.cc:1007] ======= optimize end =======
+I0727 06:50:34.992336 43126 naive_executor.cc:102] ---  skip [feed], feed -> token_type_ids
+I0727 06:50:34.992357 43126 naive_executor.cc:102] ---  skip [feed], feed -> input_ids
+I0727 06:50:34.993671 43126 naive_executor.cc:102] ---  skip [linear_75.tmp_1], fetch -> fetch
+[2022-07-27 06:50:35,954] [ WARNING] - Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. You can install fast_tokenizer by `pip install fast-tokenizer-python`.
+[2022-07-27 06:50:35,954] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'.
+[2022-07-27 06:50:35,954] [    INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt
+[OP Object] init success
+```
+
+#### 启动rpc client测试
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)
+```shell
+python rpc_client.py
+```
+输出打印如下:
+```
+text:  消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了
+label: 组织关系,组织关系##裁员
+--------------------
+text:  卡车超载致使跨桥侧翻，没那么简单
+label: 灾害/意外,灾害/意外##坍/垮塌
+--------------------
+text:  金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件
+label: 产品行为,产品行为##召回
+--------------------
+```
+#### 启动http client测试
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)
+```shell
+python http_client.py
+```
+输出打印如下:
+```
+text:  消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了
+label: 组织关系,组织关系##裁员
+--------------------
+text:  卡车超载致使跨桥侧翻，没那么简单
+label: 灾害/意外,灾害/意外##坍/垮塌
+--------------------
+text:  金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件
+label: 产品行为,产品行为##召回
+--------------------
+```
diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..62a1a3056b826619c7c640fcb9c426a2d96fc28f
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/paddle_serving/config.yml
@@ -0,0 +1,59 @@
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18090
+
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9878
+
+#worker_num, 最大并发数。
+#当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+#当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 1
+
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: False
+
+    #重试次数
+    retry: 1
+
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+
+op:
+    seq_cls:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 1
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #模型路径
+            model_config: serving_server
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["linear_75.tmp_1"]
+            
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 1
+
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: "0"
+
+            #开启MKLDNN加速
+            use_mkldnn: True
+
+            #thread_num
+            thread_num: 12
+
+            #ir_optim
+            ir_optim: True
+            
+            #开启tensorrt后，进行优化的子图包含的最少节点数
+            #min_subgraph_size: 10
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py b/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..083fe02600c7081b88441901cf6a32a31d549ea4
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/paddle_serving/http_client.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+
+import numpy as np
+import requests
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.server_url = server_url
+
+    def Run(self, text, label_list):
+        sentence = np.array([t.encode("utf-8") for t in text], dtype=np.object_)
+        sentence = sentence.__repr__()
+        data = {"key": ["sentence"], "value": [sentence]}
+        data = json.dumps(data)
+
+        ret = requests.post(url=self.server_url, data=data)
+        ret = ret.json()
+        for t, l in zip(text, eval(ret["value"][0])):
+            print("text: ", t)
+            label = ",".join([label_list[int(ll)] for ll in l.split(",")])
+            print("label: ", label)
+            print("--------------------")
+        return
+
+
+if __name__ == "__main__":
+    server_url = "http://127.0.0.1:9878/seq_cls/prediction"
+    runner = Runner(server_url)
+    text = ["消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了？", "卡车超载致使跨桥侧翻，没那么简单", "金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件"]
+    label_list = [
+        "交往",
+        "交往##会见",
+        "交往##感谢",
+        "交往##探班",
+        "交往##点赞",
+        "交往##道歉",
+        "产品行为",
+        "产品行为##上映",
+        "产品行为##下架",
+        "产品行为##发布",
+        "产品行为##召回",
+        "产品行为##获奖",
+        "人生",
+        "人生##产子/女",
+        "人生##出轨",
+        "人生##分手",
+        "人生##失联",
+        "人生##婚礼",
+        "人生##庆生",
+        "人生##怀孕",
+        "人生##死亡",
+        "人生##求婚",
+        "人生##离婚",
+        "人生##结婚",
+        "人生##订婚",
+        "司法行为",
+        "司法行为##举报",
+        "司法行为##入狱",
+        "司法行为##开庭",
+        "司法行为##拘捕",
+        "司法行为##立案",
+        "司法行为##约谈",
+        "司法行为##罚款",
+        "司法行为##起诉",
+        "灾害/意外",
+        "灾害/意外##地震",
+        "灾害/意外##坍/垮塌",
+        "灾害/意外##坠机",
+        "灾害/意外##洪灾",
+        "灾害/意外##爆炸",
+        "灾害/意外##袭击",
+        "灾害/意外##起火",
+        "灾害/意外##车祸",
+        "竞赛行为",
+        "竞赛行为##夺冠",
+        "竞赛行为##晋级",
+        "竞赛行为##禁赛",
+        "竞赛行为##胜负",
+        "竞赛行为##退役",
+        "竞赛行为##退赛",
+        "组织关系",
+        "组织关系##停职",
+        "组织关系##加盟",
+        "组织关系##裁员",
+        "组织关系##解散",
+        "组织关系##解约",
+        "组织关系##解雇",
+        "组织关系##辞/离职",
+        "组织关系##退出",
+        "组织行为",
+        "组织行为##开幕",
+        "组织行为##游行",
+        "组织行为##罢工",
+        "组织行为##闭幕",
+        "财经/交易",
+        "财经/交易##上市",
+        "财经/交易##出售/收购",
+        "财经/交易##加息",
+        "财经/交易##涨价",
+        "财经/交易##涨停",
+        "财经/交易##融资",
+        "财经/交易##跌停",
+        "财经/交易##降价",
+        "财经/交易##降息",
+    ]
+    runner.Run(text, label_list)
diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..f946f82078a0de296e8c7b5fb7d856bc5f343bb6
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/paddle_serving/rpc_client.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.client = PipelineClient()
+        self.client.connect([server_url])
+
+    def Run(self, data, label_list):
+        data = np.array([x.encode("utf-8") for x in data], dtype=np.object_)
+        ret = self.client.predict(feed_dict={"sentence": data})
+        for (
+            d,
+            l,
+        ) in zip(data, eval(ret.value[0])):
+            print("text: ", d)
+            label = ",".join([label_list[int(ll)] for ll in l.split(",")])
+            print("label: ", label)
+            print("--------------------")
+        return
+
+
+if __name__ == "__main__":
+    server_url = "127.0.0.1:18090"
+    runner = Runner(server_url)
+    text = ["消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了？", "卡车超载致使跨桥侧翻，没那么简单", "金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件"]
+    label_list = [
+        "交往",
+        "交往##会见",
+        "交往##感谢",
+        "交往##探班",
+        "交往##点赞",
+        "交往##道歉",
+        "产品行为",
+        "产品行为##上映",
+        "产品行为##下架",
+        "产品行为##发布",
+        "产品行为##召回",
+        "产品行为##获奖",
+        "人生",
+        "人生##产子/女",
+        "人生##出轨",
+        "人生##分手",
+        "人生##失联",
+        "人生##婚礼",
+        "人生##庆生",
+        "人生##怀孕",
+        "人生##死亡",
+        "人生##求婚",
+        "人生##离婚",
+        "人生##结婚",
+        "人生##订婚",
+        "司法行为",
+        "司法行为##举报",
+        "司法行为##入狱",
+        "司法行为##开庭",
+        "司法行为##拘捕",
+        "司法行为##立案",
+        "司法行为##约谈",
+        "司法行为##罚款",
+        "司法行为##起诉",
+        "灾害/意外",
+        "灾害/意外##地震",
+        "灾害/意外##坍/垮塌",
+        "灾害/意外##坠机",
+        "灾害/意外##洪灾",
+        "灾害/意外##爆炸",
+        "灾害/意外##袭击",
+        "灾害/意外##起火",
+        "灾害/意外##车祸",
+        "竞赛行为",
+        "竞赛行为##夺冠",
+        "竞赛行为##晋级",
+        "竞赛行为##禁赛",
+        "竞赛行为##胜负",
+        "竞赛行为##退役",
+        "竞赛行为##退赛",
+        "组织关系",
+        "组织关系##停职",
+        "组织关系##加盟",
+        "组织关系##裁员",
+        "组织关系##解散",
+        "组织关系##解约",
+        "组织关系##解雇",
+        "组织关系##辞/离职",
+        "组织关系##退出",
+        "组织行为",
+        "组织行为##开幕",
+        "组织行为##游行",
+        "组织行为##罢工",
+        "组织行为##闭幕",
+        "财经/交易",
+        "财经/交易##上市",
+        "财经/交易##出售/收购",
+        "财经/交易##加息",
+        "财经/交易##涨价",
+        "财经/交易##涨停",
+        "财经/交易##融资",
+        "财经/交易##跌停",
+        "财经/交易##降价",
+        "财经/交易##降息",
+    ]
+    runner.Run(text, label_list)
diff --git a/applications/text_classification/hierarchical/deploy/paddle_serving/service.py b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py
new file mode 100644
index 0000000000000000000000000000000000000000..608f0e1f7528942794300c30239caf04fac4b061
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/paddle_serving/service.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+
+import numpy as np
+from paddle_serving_server.web_service import Op, WebService
+
+from paddlenlp.transformers import AutoTokenizer
+
+_LOGGER = logging.getLogger()
+
+FETCH_NAME_MAP = {
+    "ernie-1.0-large-zh-cw": "linear_291.tmp_1",
+    "ernie-3.0-xbase-zh": "linear_243.tmp_1",
+    "ernie-3.0-base-zh": "linear_147.tmp_1",
+    "ernie-3.0-medium-zh": "linear_75.tmp_1",
+    "ernie-3.0-mini-zh": "linear_75.tmp_1",
+    "ernie-3.0-micro-zh": "linear_51.tmp_1",
+    "ernie-3.0-nano-zh": "linear_51.tmp_1",
+    "ernie-2.0-base-en": "linear_147.tmp_1",
+    "ernie-2.0-large-en": "linear_291.tmp_1",
+    "ernie-m-base": "linear_147.tmp_1",
+    "ernie-m-large": "linear_291.tmp_1",
+}
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
+                    choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+args = parser.parse_args()
+# fmt: on
+
+
+class Op(Op):
+    def init_op(self):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
+        # Output nodes may differ from model to model
+        # You can see the output node name in the conf.prototxt file of serving_server
+        self.fetch_names = [
+            FETCH_NAME_MAP[args.model_name],
+        ]
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        # Convert input format
+        ((_, input_dict),) = input_dicts.items()
+        data = input_dict["sentence"]
+        if isinstance(data, str) and "array(" in data:
+            data = eval(data)
+        else:
+            _LOGGER.error("input value  {}is not supported.".format(data))
+        data = [i.decode("utf-8") for i in data]
+
+        # tokenizer + pad
+        data = self.tokenizer(
+            data,
+            max_length=args.max_seq_length,
+            padding=True,
+            truncation=True,
+            return_position_ids=False,
+            return_attention_mask=False,
+        )
+        tokenized_data = {}
+        for tokenizer_key in data:
+            tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64")
+        return tokenized_data, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+
+        results = fetch_dict[self.fetch_names[0]]
+        results = np.array(results)
+        labels = []
+
+        for result in results:
+            label = []
+            result = 1 / (1 + (np.exp(-result)))
+            for i, p in enumerate(result):
+                if p > 0.5:
+                    label.append(str(i))
+            labels.append(",".join(label))
+        return {"label": labels}, None, ""
+
+
+class Service(WebService):
+    def get_pipeline_response(self, read_op):
+        return Op(name="seq_cls", input_ops=[read_op])
+
+
+if __name__ == "__main__":
+    service = Service(name="seq_cls")
+    service.prepare_pipeline_config("config.yml")
+    service.run_service()
diff --git a/applications/text_classification/hierarchical/deploy/predictor/README.md b/applications/text_classification/hierarchical/deploy/predictor/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..26f278d9a44c01257d4df521058bcebeff3d19b0
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/predictor/README.md
@@ -0,0 +1,175 @@
+# 基于ONNXRuntime推理部署指南
+
+**目录**
+   * [环境准备](#环境准备)
+   * [基于GPU部署推理样例](#基于GPU部署推理样例)
+   * [基于CPU部署推理样例](#基于CPU部署推理样例)
+   * [性能与精度测试](#性能与精度测试)
+## 环境准备
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[静态图导出](../../README.md)，模型使用裁剪API进行裁剪之后会自动生成静态图模型。
+
+如果基于GPU部署，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```shell
+python -m pip install onnxruntime-gpu onnx onnxconverter-common==1.9.0 psutil paddle2onnx==1.0.5
+```
+
+如果基于CPU部署，请使用如下命令安装所需依赖:
+```shell
+python -m pip install onnxruntime psutil
+```
+
+安装FastTokenizer文本处理加速库（可选）
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+```shell
+pip install fast-tokenizer-python
+```
+
+## 基于GPU部署推理样例
+请使用如下命令进行部署
+```
+python infer.py \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+多语言模型加上`--multilingual`,裁剪后的模型前缀为`--model_path_prefix ../../prune/width_mult_XXXX/pruned_model`。
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。
+* `max_seq_length`：ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `use_fp16`：选择是否开启FP16进行加速；默认为False。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `device`: 选用什么设备进行训练，可选cpu、gpu。
+* `device_id`: 选择GPU卡号；默认为0。
+* `perf`：选择进行模型性能和精度评估；默认为False。
+* `dataset_dir`：本地数据集地址，需包含data.txt, label.txt, test.txt/dev.txt(可选，如果启动模型性能和精度评估)；默认为None。
+* `perf_dataset`：评估数据集，可选'dev'、'test'，选择在开发集或测试集评估模型；默认为"dev"。
+型）；默认为False。
+
+在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速，在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。可以使用如下命令开启ONNXRuntime的FP16进行推理加速：
+
+```
+python infer.py \
+    --use_fp16 \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度：
+
+```
+python infer.py \
+    --perf \
+    --perf_dataset 'dev' \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+## 基于CPU部署推理样例
+
+请使用如下命令进行部署
+```
+python infer.py \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型；默认为"ernie-3.0-medium-zh"，中文数据集推荐使用"ernie-3.0-medium-zh"。
+* `max_seq_length`：ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `use_quantize`：选择是否开启INT8动态量化进行加速；默认为False。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
+* `num_threads`：cpu线程数；默认为cpu的物理核心数量。
+* `device`: 选用什么设备进行训练，可选cpu、gpu。
+* `perf`：选择进行模型性能和精度评估；默认为False。
+* `dataset_dir`：本地数据集地址，需包含data.txt, label.txt, dev.txt/test.txt(可选，如果启动模型性能和精度评估)；默认为None。
+* `perf_dataset`：评估数据集，选择在开发集或测试集评估模型；默认为"dev"。
+
+可以使用如下命令开启ONNXRuntime的INT8动态量化进行推理加速：
+
+```
+python infer.py \
+    --use_quantize \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+**Note**：INT8动态量化与FP16相比精度损失较大，GPU部署建议使用FP16加速。
+
+可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度：
+
+```
+python infer.py \
+    --perf \
+    --perf_dataset 'dev' \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+## 性能与精度测试
+
+
+测试配置如下：
+
+1. [2020语言与智能技术竞赛：事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)抽取的多标签数据集
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.3.1
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 32，GPU部署运行时间 total_time，计算 latency = total_time / total_samples
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+|                            | Micro F1(%)   | Macro F1(%) | latency(ms) |
+| -------------------------- | ------------ | ------------- |------------- |
+| ERNIE 3.0 Medium+FP32+GPU  | 95.26|93.22| 1.01|
+| ERNIE 3.0 Medium+FP16+GPU  | 95.26|93.22| 0.38|
+| ERNIE 3.0 Medium+FP32+CPU  | 95.26|93.22|  18.93 |
+| ERNIE 3.0 Medium+INT8+CPU  | 95.03 | 92.87| 12.14  |
+
+
+经过FP16转化加速比达到3~4倍左右，精度变化较小，与FP16相比,INT8在线量化精度下降较大，加速比在1.5~2倍左右。
diff --git a/applications/text_classification/hierarchical/deploy/predictor/infer.py b/applications/text_classification/hierarchical/deploy/predictor/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..34a58d91fc99e0fa7dca573499da3fb02d5fbca0
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/predictor/infer.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import psutil
+from predictor import Predictor
+
+from paddlenlp.datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.", choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.")
+parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu, only takes effect when deploying on cpu.")
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--device_id', default=0, help="Select which gpu device to train model.")
+parser.add_argument("--perf", action='store_true', help="Whether to compute the latency and f1 score of the test set.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="The dataset directory including data.txt, taxonomy.txt, test.txt(optional, if evaluate the performance).")
+parser.add_argument("--perf_dataset", choices=['dev', 'test'], default='dev', type=str, help="evaluate the performance on dev dataset or test dataset")
+parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task')
+args = parser.parse_args()
+# yapf: enable
+
+
+def read_local_dataset(path, label_list):
+    label_list_dict = {label_list[i]: i for i in range(len(label_list))}
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 0:
+                continue
+            elif len(items) == 1:
+                sentence = items[0]
+                labels = []
+            else:
+                sentence = "".join(items[:-1])
+                label = items[-1]
+                labels = [label_list_dict[l] for l in label.split(",")]
+            yield {"sentence": sentence, "label": labels}
+
+
+if __name__ == "__main__":
+
+    label_list = []
+    label_dir = os.path.join(args.dataset_dir, "label.txt")
+    with open(label_dir, "r", encoding="utf-8") as f:
+        lines = f.readlines()
+        for i, line in enumerate(lines):
+            label_list.append(line.strip())
+    f.close()
+
+    predictor = Predictor(args, label_list)
+
+    if args.perf:
+        eval_dir = os.path.join(args.dataset_dir, "{}.txt".format(args.perf_dataset))
+        eval_ds = load_dataset(read_local_dataset, path=eval_dir, label_list=label_list, lazy=False)
+        texts, labels = predictor.get_text_and_label(eval_ds)
+
+        # preprocess & evaluate & latency
+        preprocess_result = predictor.preprocess(texts)
+        predictor.evaluate(preprocess_result, labels)
+        predictor.performance(preprocess_result)
+    else:
+        data = []
+        data_dir = os.path.join(args.dataset_dir, "data.txt")
+        with open(data_dir, "r", encoding="utf-8") as f:
+            lines = f.readlines()
+            for i, line in enumerate(lines):
+                data.append(line.strip())
+        f.close()
+        predictor.predict(data)
diff --git a/applications/text_classification/hierarchical/deploy/predictor/predictor.py b/applications/text_classification/hierarchical/deploy/predictor/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..28b07ee7da2a7376f59417d6fa6ebc2152e482c6
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/predictor/predictor.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+from sklearn.metrics import f1_score
+
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+class InferBackend(object):
+    def __init__(
+        self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, use_quantize=False, num_threads=10
+    ):
+        logger.info(">>> [InferBackend] Creating Engine ...")
+        onnx_model = paddle2onnx.export(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+            if use_quantize:
+                logger.info(
+                    ">>> [InferBackend] use_quantize only takes effect when deploying on cpu, use_fp16 for acceleration when deploying on gpu ..."
+                )
+            sess_options = ort.SessionOptions()
+            self.predictor = ort.InferenceSession(
+                onnx_model,
+                sess_options=sess_options,
+                providers=["CUDAExecutionProvider"],
+                provider_options=[{"device_id": device_id}],
+            )
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    "The environment for GPU inference is not set properly. "
+                    "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+                    "Please run the following commands to reinstall: \n "
+                    "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+                )
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            if use_fp16:
+                logger.info(
+                    ">>> [InferBackend] use_fp16 only takes effect when deploying on gpu, use_quantize for acceleration when deploying on cpu ..."
+                )
+            if use_quantize:
+                dynamic_quantize_model = os.path.join(infer_model_dir, "int8_model.onnx")
+                self.dynamic_quantize(float_onnx_file, dynamic_quantize_model)
+                onnx_model = dynamic_quantize_model
+            sess_options = ort.SessionOptions()
+            sess_options.intra_op_num_threads = num_threads
+            self.predictor = ort.InferenceSession(
+                onnx_model, sess_options=sess_options, providers=["CPUExecutionProvider"]
+            )
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def dynamic_quantize(self, input_float_model, dynamic_quantized_model):
+        from onnxruntime.quantization import quantize_dynamic
+
+        quantize_dynamic(input_float_model, dynamic_quantized_model)
+
+    def infer(self, input_dict: dict):
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+def sigmoid_(x):
+    """
+    compute sigmoid
+    """
+    return 1 / (1 + np.exp(-x))
+
+
+class Predictor(object):
+    def __init__(self, args, label_list):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True)
+        self.label_list = label_list
+        self.batch_size = args.batch_size
+        self.max_seq_length = args.max_seq_length
+        self.multilingual = args.multilingual
+        self.inference_backend = InferBackend(
+            args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.use_quantize, args.num_threads
+        )
+
+    def preprocess(self, input_data: list):
+
+        # tokenizer + pad
+        data = self.tokenizer(
+            input_data,
+            max_length=self.max_seq_length,
+            padding=True,
+            truncation=True,
+            return_position_ids=False,
+            return_attention_mask=False,
+            return_token_type_ids=not self.multilingual,
+        )
+        tokenized_data = {}
+        for tokenizer_key in data:
+
+            tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64")
+        return tokenized_data
+
+    def postprocess(self, infer_data):
+        threshold = 0.5
+
+        sigmoid = np.vectorize(sigmoid_)
+        probs = sigmoid(infer_data)
+        labels = []
+
+        for prob in probs:
+            label = []
+
+            for i, p in enumerate(prob):
+                if p > threshold:
+                    label.append(self.label_list[i])
+
+            labels.append(label)
+
+        return labels
+
+    def infer(self, data):
+        infer_data = self.inference_backend.infer(data)
+        logits = np.array(infer_data[0])
+        return logits
+
+    def infer_batch(self, preprocess_result):
+        sample_num = len(preprocess_result["input_ids"])
+        infer_result = None
+        for i in range(0, sample_num, self.batch_size):
+            batch_size = min(self.batch_size, sample_num - i)
+            preprocess_result_batch = {}
+            for tokenizer_key in preprocess_result:
+                preprocess_result_batch[tokenizer_key] = [
+                    preprocess_result[tokenizer_key][i + j] for j in range(batch_size)
+                ]
+
+            result = self.infer(preprocess_result_batch)
+            if infer_result is None:
+                infer_result = result
+            else:
+                infer_result = np.append(infer_result, result, axis=0)
+        return infer_result
+
+    def printer(self, results, input_data):
+        for text, labels in zip(input_data, results):
+            hierarchical_labels = {}
+            logger.info("text: {}".format(text))
+            logger.info("prediction result: {}".format(",".join(labels)))
+            for label in labels:
+                for i, l in enumerate(label.split("##")):
+                    if i not in hierarchical_labels:
+                        hierarchical_labels[i] = []
+                    if l not in hierarchical_labels[i]:
+                        hierarchical_labels[i].append(l)
+            for d in range(len(hierarchical_labels)):
+                logger.info("level {} : {}".format(d + 1, ",".join(hierarchical_labels[d])))
+            logger.info("--------------------")
+
+    def predict(self, input_data: list):
+        preprocess_result = self.preprocess(input_data)
+        infer_result = self.infer_batch(preprocess_result)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return
+
+    def performance(self, preprocess_result):
+        nums = len(preprocess_result["input_ids"])
+
+        start = time.time()
+        self.infer_batch(preprocess_result)
+        total_time = time.time() - start
+        logger.info("sample nums: %s, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums))
+        return
+
+    def evaluate(self, preprocess_result, labels):
+
+        infer_result = self.infer_batch(preprocess_result)
+        sigmoid = np.vectorize(sigmoid_)
+        probs = sigmoid(infer_result)
+        preds = probs > 0.5
+        micro_f1_score = f1_score(y_pred=preds, y_true=labels, average="micro")
+        macro_f1_score = f1_score(y_pred=preds, y_true=labels, average="macro")
+        logger.info("micro f1: %.2f, macro f1: %.2f" % (micro_f1_score * 100, macro_f1_score * 100))
+        return
+
+    def get_text_and_label(self, ds):
+        """
+        Return text and label list
+        """
+        all_texts = []
+        all_labels = []
+        for ii in range(len(ds)):
+            all_texts.append(ds[ii]["sentence"])
+            labels = [float(1) if i in ds[ii]["label"] else float(0) for i in range(len(self.label_list))]
+            all_labels.append(labels)
+        return all_texts, all_labels
diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/README.md b/applications/text_classification/hierarchical/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..13ec53a993bab48a08a8e6b0f4558cca09151fd1
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/simple_serving/README.md
@@ -0,0 +1,42 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+```shell
+pip install paddlenlp --upgrade
+```
+## Server服务启动
+### 分类任务启动
+#### 启动 分类 Server 服务
+```bash
+paddlenlp server server:app --host 0.0.0.0 --port 8189
+```
+如果是ERNIE-M模型则启动
+```bash
+paddlenlp server ernie_m_server:app --host 0.0.0.0 --port 8189
+```
+#### 分类任务发送服务
+```bash
+python client.py
+```
+
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size,
+            'prob_limit': args.prob_limit
+        }
+    }
+```
diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/client.py b/applications/text_classification/hierarchical/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1eb7fc8357a5d9226c90776c6ceccce33b4492c
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/simple_serving/client.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import requests
+import json
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--prob_limit", default=0.5, type=float, help="The limitation of probability for the label.")
+args = parser.parse_args()
+# yapf: enable
+
+url = "http://0.0.0.0:8189/models/cls_hierarchical"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = [
+        "请问木竭胶囊能同高血压药、氨糖同时服吗？",
+        "低压100*高压140*头涨，想吃点降压药。谢谢！",
+        "脑穿通畸形易发人群有哪些",
+        "幼儿乱吃丙硫氧嘧啶片怎么办，我也不知道她吃了几片",
+        "如果是可以降血糖的话,血糖值7点多的大概需要吃几个疗程?",
+    ]
+    data = {
+        "data": {
+            "text": texts,
+        },
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py b/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py
new file mode 100644
index 0000000000000000000000000000000000000000..40a43f974d549ddcd581d4872c1f1580f8b3c499
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/simple_serving/ernie_m_server.py
@@ -0,0 +1,25 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from paddlenlp import SimpleServer
+from paddlenlp.server import ERNIEMHandler, MultiLabelClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/cls_hierarchical",
+    model_path="../../export",
+    tokenizer_name="ernie-m-base",
+    model_handler=ERNIEMHandler,
+    post_handler=MultiLabelClassificationPostHandler,
+)
diff --git a/applications/text_classification/hierarchical/deploy/simple_serving/server.py b/applications/text_classification/hierarchical/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..2965552088fce7654bf350c272c6a2391c948942
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/simple_serving/server.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiLabelClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/cls_hierarchical",
+    model_path="../../export",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=CustomModelHandler,
+    post_handler=MultiLabelClassificationPostHandler,
+)
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/README.md b/applications/text_classification/hierarchical/deploy/triton_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6e61ac5536c7195abf24c86192a26537bb35be90
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/README.md
@@ -0,0 +1,186 @@
+# 基于Triton Inference Server的服务化部署指南
+
+本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 2.0英文模型文本层次分类的pipeline在线服务。
+
+## 目录
+- [服务端环境准备](#服务端环境准备)
+- [模型获取和转换](#模型获取和转换)
+- [部署模型](#部署模型)
+- [客户端请求](#客户端请求)
+
+## 服务端环境准备
+
+### 安装Triton Server
+拉取Triton Server镜像：
+```shell
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3
+```
+启动容器：
+```shell
+docker run  -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash
+```
+
+**NOTE:**
+
+1. Triton版本号`21.10`可以根据自己的需求调整，各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行，如果NVIDIA Driver低于文档中要求，在启动运行时会报错。
+
+2. 可以使用`--gpus '"device=1"'`来指定GPU卡号，更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
+
+
+### 进入容器并准备PaddleNLP环境
+整个服务的前后处理依赖PaddleNLP，需要在容器内安装相关python包
+
+进入容器：
+```shell
+docker exec -it triton_server bash
+```
+安装PaddlePaddle、PaddleNLP
+```shell
+python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+**NOTE:**
+
+1. 默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple)
+
+2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+
+### 安装FastTokenizers文本处理加速库（可选）
+
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+
+在容器内安装 fast_tokenizer
+```shell
+python3 -m pip install fast-tokenizer-python
+```
+
+
+## 模型获取和转换
+
+使用Triton做服务化部署时，选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。
+
+
+首先将保存的动态图参数导出成静态图参数，具体代码见[静态图导出脚本](../../export_model.py)。静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python ../../export_model.py --params_path=../../checkpoint/model_state.pdparams --output_path=./wos_infer_model
+```
+
+使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下，以下命令成功运行后，将会在当前目录下生成model.onnx模型文件。
+
+```shell
+paddle2onnx --model_dir infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
+```
+
+创建空白目录/seqcls/1和seqcls_model/1，并将将转换好的ONNX模型移动到模型仓库目录
+
+```shell
+mkdir /models/seqcls/1
+mkdir /models/seqcls_model/1
+mv model.onnx /models/seqcls_model/1
+```
+
+Paddle2ONNX的命令行参数说明请查阅：[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
+
+模型下载转换好之后，models目录结构如下:
+```
+models
+├── seqcls
+│   ├── 1
+│   └── config.pbtxt
+├── seqcls_model
+│   ├── 1
+│   │   └── model.onnx
+│   └── config.pbtxt
+├── seqcls_postprocess
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── tokenizer
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)
+
+## 部署模型
+
+triton目录包含启动pipeline服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # Triton启动需要的模型仓库，包含模型和服务配置文件
+seqcls_grpc_client.py     # 层次分类任务发送pipeline预测请求的脚本
+```
+
+### 启动服务端
+
+在容器内执行下面命令启动服务，默认启动models下所有模型:
+```shell
+tritonserver --model-repository=/models
+```
+也可以通过设定参数只启动单一任务服务：
+```shell
+tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls
+```
+输出打印如下:
+
+```
+...
+I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
+I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
+I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
+I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
+I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456
+...
+I0619 13:43:33.360018 5127 server.cc:592]
++--------------------+---------+--------+
+| Model              | Version | Status |
++--------------------+---------+--------+
+| seqcls             | 1       | READY  |
+| seqcls_model       | 1       | READY  |
+| seqcls_postprocess | 1       | READY  |
+| tokenizer          | 1       | READY  |
++--------------------+---------+--------+
+...
+I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
+I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
+I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+**NOTE:**
+
+启动服务时，Triton Server的每个python后端进程默认申请`64M`内存，默认启动的docker无法启动多个python后端节点。两个解决方案：
+
+1. 启动容器时设置`shm-size`参数, 比如:`docker run  -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
+
+2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M： `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
+
+## 客户端请求
+
+### 客户端环境准备
+客户端请求有两种方式，可以选择在本地执行脚本请求，或下载官方客户端镜像在容器中执行。
+
+方式一：本地执行脚本，需要先安装依赖:
+```
+pip install grpcio
+pip install tritonclient==2.10.0
+```
+
+方式二：拉取官网镜像并启动容器:
+```
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk
+docker run  -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash
+```
+
+### 启动客户端测试
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+
+```
+python seqcls_grpc_client.py
+```
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls/config.pbtxt
@@ -0,0 +1,75 @@
+name: "seqcls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "INPUT"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "seqcls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        key: "linear_75.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "seqcls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_label"
+        value: "label"
+      }
+      output_map {
+        key: "POST_confidence"
+        value: "confidence"
+      }
+    }
+  ]
+}
+
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0fb1417cba37d4b4497fb1c27aff3ab6e039bd1f
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_model/config.pbtxt
@@ -0,0 +1,36 @@
+platform: "onnxruntime_onnx"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "linear_75.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ 74 ]
+    }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_GPU
+  }
+]
+
+optimization { 
+  graph: {level: -1}
+}
+
+parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
+parameters { key: "execution_mode" value: { string_value: "0" } }
+parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..5db7ef0c7746db295e9110817db6982704d6ac1b
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/1/model.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.txt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = 1 / (1 + (np.exp((-data[0]))))
+
+            probs = []
+            labels = []
+            for l, p in enumerate(data):
+                if p > 0.5:
+                    labels.append(l)
+                    probs.append(p)
+
+            labels = np.array(labels, dtype=self.output_dtype[0])
+            probs = np.array(probs, dtype=self.output_dtype[1])
+            # print(labels, probs)
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], labels)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fbeda7129f9247823de9d5918af5ee435613e967
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
@@ -0,0 +1,31 @@
+name: "seqcls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ 74 ]
+  }
+]
+
+output [
+  {
+    name: "POST_label"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "POST_confidence"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4b0b0547ee5a33278e53bacb5d5649c1b3c562b
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/1/model.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.pbtxt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = [i[0].decode("utf-8") for i in data]
+            data = self.tokenizer(data, max_length=128, padding=True, truncation=True)
+            input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0])
+            token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1])
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/models/tokenizer/config.pbtxt
@@ -0,0 +1,31 @@
+name: "tokenizer"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT_0"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT_0"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "OUTPUT_1"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..caedf81b752ab16a1844e4b3e371e34a1d84126e
--- /dev/null
+++ b/applications/text_classification/hierarchical/deploy/triton_serving/seqcls_grpc_client.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Optional
+
+import numpy as np
+from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class SyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} "
+            f"are up and ready!"
+        )
+
+        model_config = self._client.get_model_config(self._model_name, self._model_version)
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        self._inputs = {tm.name: tm for tm in model_metadata.inputs}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm.name: tm for tm in model_metadata.outputs}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run(self, inputs):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate(inputs):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+
+        results = self._client.infer(
+            model_name=self._model_name,
+            model_version=self._model_version,
+            inputs=infer_inputs,
+            outputs=self._outputs_req,
+            client_timeout=self._response_wait_t,
+        )
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+if __name__ == "__main__":
+    model_name = "seqcls"
+    model_version = "1"
+    url = "localhost:8001"
+    runner = SyncGRPCTritonRunner(url, model_name, model_version)
+
+    texts = [["消失的“外企光环”，5月份在华裁员900余人，香饽饽变“臭”了"], ["卡车超载致使跨桥侧翻，没那么简单"], ["金属卡扣安装不到位，上海乐扣乐扣贸易有限公司将召回捣碎器1162件"]]
+
+    for text in texts:
+        # input format:[input1, input2 ... inputn], n = len(self._input_names)
+        result = runner.Run([text])
+        print(result)
diff --git a/applications/text_classification/hierarchical/export_model.py b/applications/text_classification/hierarchical/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..c57dc23372f9b934fbee6686092309cd5ef5b22a
--- /dev/null
+++ b/applications/text_classification/hierarchical/export_model.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.transformers import AutoModelForSequenceClassification
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task')
+parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+    model.eval()
+    if args.multilingual:
+        input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
+    else:
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(model, input_spec=input_spec)
+
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "float32")
+    paddle.jit.save(model, save_path)
diff --git a/applications/text_classification/hierarchical/few-shot/README.md b/applications/text_classification/hierarchical/few-shot/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ff44c9d173a69b44457b7d0c07481fad01ab2788
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/README.md
@@ -0,0 +1,375 @@
+# 小样本场景下的多标签层次分类任务指南
+
+**零样本/小样本文本分类推荐使用 UTC 模型，详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification)，本项目将会在2.5.2版本下线。**
+
+## 目录
+
+- [1. 项目说明](#项目说明)
+- [2. 效果展示](#效果展示)
+- [3. 定制训练](#定制训练)
+  - [3.1 运行环境](#运行环境)
+  - [3.2 代码结构](#代码结构)
+  - [3.3 数据标注](#数据标注)
+  - [3.4 模型训练](#模型训练)
+  - [3.5 模型评估](#模型评估)
+  - [3.6 模型部署](#模型部署)
+- [4. References](#References)
+
+<a name="项目说明"></a>
+## 1. 项目说明
+
+本项目提供了小样本场景下文本多标签层次分类的解决方案，在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果，充分利用标注信息。
+
+**多标签层次分类任务** 指自然语言处理任务中，每个样本具有多个标签标记，并且标签集合中标签之间存在预定义的层次结构，多标签层次分类需要充分考虑标签集之间的层次结构关系来预测层次化预测结果。
+在现实场景中，大量的数据如新闻分类、专利分类、学术论文分类等标签集合存在层次化结构，需要利用算法为文本自动标注更细粒度和更准确的标签。
+现有的主流解决方案是在预训练语言模型上进行微调，因为多标签分类任务与预训练阶段的掩码预测任务有着天然的差异，想要取得较好的分类效果往往需要大量数据标注。
+
+**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务，充分利用预训练语言模型学习到的特征，从而降低样本需求。以情感分类任务为例，标签分为`1-正向`，`0-负向`两类，如下图所示，通过提示`我[MASK]喜欢。`，原有`1-正向`，`0-负向`的标签被转化为了预测空格是`很`还是`不`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/183909263-6ead8871-699c-4c2d-951f-e33eddcfdd9c.png width=800 height=300 />
+</div>
+
+微调方法和提示方法的区别如图所示：
+
+【微调学习】需要学习的参数是以 `[CLS]` 向量为输入，以负向/正向为输出的随机初始化的分类器。
+
+【提示学习】通过构造提示，将原有的分类任务转化为掩码预测，即掩盖原句中的某个字，用模型预测该字。此时的分类器不再是随机初始化，而是利用了待预测字的预训练向量来初始化，充分利用了预训练模型学习到的参数。
+
+【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类，对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习，以取得更好的效果。
+
+### 方案特点
+
+- **标注成本低**：以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖，在小样本（few-shot）的场景下取得比微调更好的分类效果。
+- **全流程打通**：提供了从训练到部署的完整解决方案，可以低成本迁移至实际应用场景。
+
+
+<a name="效果展示"></a>
+## 2.效果展示
+
+本项目中使用了 ERNIE3.0 模型，对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练，我们测评了 Base 模型在事件类型分类任务上的表现。测试配置如下：
+
+1. 数据集：2020语言与智能技术竞赛：[事件抽取任务](https://aistudio.baidu.com/aistudio/competition/detail/32/0/introduction)小样本数据集测试集。
+
+2. 物理机环境
+
+   系统: CentOS Linux release 7.7.1908 (Core)
+
+   GPU: Tesla V100-SXM2-32GB
+
+   CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+   CUDA: 11.2
+
+   cuDNN: 8.1.0
+
+   Driver Version: 460.27.04
+
+   内存: 630 GB
+
+3. PaddlePaddle 版本：2.4rc
+
+4. PaddleNLP 版本：2.4.3
+
+5. 评估设置
+
+- 每个 epoch 评估一次，按照验证集上的评价指标，取分数最高的模型参数用于测试集的评估。表格中的最终结果为重复 10 次的均值。
+- 为了避免过拟合，这里使用了早停机制 (Early-stopping)。因为微调方式收敛较慢，且波动较大，我们将微调方式的早停步数增加为 20 步，但仍有一半结果未收敛，表格中的微调结果为 5 次的均值。
+- 测试脚本如下
+  - 微调
+
+    ```
+    cd ../
+    python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop --early_stop_num 20
+    ```
+
+  - 提示学习
+
+    ```
+    python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这句话描述的事件是" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128  --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model macro_f1_score --load_best_model_at_end --evaluation_strategy epoch --save_strategy epoch
+    ```
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+   | model_name | 训练方式 | Micro F1分数 | Macro F1分数 |
+   | ---------- | ------- | ----------- | ----------- |
+   | ernie-3.0-base-zh | 微调学习 | 0.7172 | 0.3821 |
+   | ernie-3.0-base-zh | 提示学习 | 0.8945 | 0.8516 |
+
+
+<a name="定制训练"></a>
+## 3.定制训练
+
+下边通过事件抽取任务的例子展示如何使用小样本学习来进行文本分类。
+
+<a name="运行环境"></a>
+### 3.1 运行环境
+
+- python >= 3.7
+- paddlepaddle >= 2.4rc
+- paddlenlp >= 2.4.3
+- paddle2onnx >= 1.0.3
+
+<a name="代码结构"></a>
+### 3.2 代码结构
+
+```text
+.
+├── train.py    # 模型组网训练脚本
+├── utils.py    # 数据处理工具
+├── infer.py    # 模型部署脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+### 3.3 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注，本项目也打通了从标注到训练的通道，即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。
+
+**示例数据**
+
+这里我们使用2020语言与智能技术竞赛：事件抽取任务数据集的子集作为示例数据集。该数据集中原始训练集包括 11958 条标注样本，我们按每条标签随机采样 2 条样本，得到 148 条样本数据作为训练集，剩余训练集数据作为测试集。可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/events.tar.gz)下载解压并放入`./data/`文件夹，或者运行以下脚本
+
+```
+wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/events.tar.gz
+tar zxvf events.tar.gz
+mv events data
+```
+
+**数据格式**
+
+下边主要介绍多标签分类任务自定义数据集的格式要求，整体目录如下
+
+```text
+data/
+├── train.txt  # 训练数据集
+├── dev.txt    # 验证数据集
+├── test.txt   # 测试数据集（可选）
+├── data.txt   # 待预测数据（可选）
+└── label.txt  # 分类标签集
+```
+
+**训练/验证/测试数据**
+
+对于训练/验证/测试数据集文件，每行数据表示一条样本，包括文本和标签两部分，由tab符`\t`分隔，多个标签以英文逗号`,`分隔，同一标签内不同层级以`##`字符连接。格式如下
+```text
+<文本>'\t'<标签>','<标签>','<标签>
+<文本>'\t'<标签>','<标签>
+...
+```
+例如，
+```
+紫光圣果副总经理李明雷辞职  组织关系,组织关系##辞/离职
+无理取闹辱骂扶贫干部织金一居民被行拘    司法行为,司法行为##拘捕
+...
+```
+
+**预测数据**
+
+对于待预测数据文件，每行包含一条待预测样本，无标签。格式如下
+```text
+<文本>
+<文本>
+...
+```
+例如，
+```
+没白等！大众PoloPlus明日上市，配1.5L全铝发动机
+国家统计局17日发布消息称，国务院第四次全国经济普查领导小组办公室和国家统计局近期对四川省德阳市下辖广汉市第四次全国经济普查违法举报线索进行了立案调查。
+...
+```
+
+**标签数据**
+
+对于分类标签集文件，存储了数据集中所有的标签路径集合，每行是一个标签路径，高层的标签指向底层标签，不同层级的标签用'##'连接，本项目选择为标签层次结构中的每一个节点生成对应的标签路径，详见[层次分类任务介绍](../README.md#层次分类任务介绍)，标签路径格式如下
+
+```text
+<一级标签>
+<一级标签>'##'<二级标签>
+<一级标签>'##'<二级标签>'##'<三级标签>
+...
+```
+如果需要自定义标签映射用于分类器初始化，则每行需要包括标签名和相应的映射词，由`==`分隔。格式如下
+```text
+<一级标签>'=='<映射词>
+<一级标签>'##'<二级标签>'=='<映射词>
+<一级标签>'##'<二级标签>'##'<三级标签>'=='<映射词>
+...
+```
+例如，原标签路径`交往##会见`中包括特殊符号`##`，大概率不会在说话或者写作中使用，因此我们将其映射为`会见`或者`见面`。
+```
+交往==交往
+交往##会见==会见
+...
+```
+
+**Note**: 这里的标签映射词定义遵循的规则是，不同映射词尽可能长度一致，映射词和提示需要尽可能构成通顺的语句。越接近自然语句，小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句，也可以不构造映射词，每行一个标签即可。
+
+<a name="模型训练"></a>
+### 3.4 模型训练
+
+**单卡训练**
+
+```
+python train.py \
+--data_dir ./data/ \
+--output_dir ./checkpoints/ \
+--prompt "这句话描述的事件是" \
+--model_name_or_path ernie-3.0-base-zh \
+--max_seq_length 128  \
+--learning_rate 3e-5 \
+--ppt_learning_rate 3e-4 \
+--do_train \
+--do_eval \
+--do_predict \
+--do_export \
+--num_train_epochs 100 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--metric_for_best_model macro_f1_score \
+--load_best_model_at_end \
+--eval_steps 100
+```
+**多卡训练**
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \
+--data_dir ./data/ \
+--output_dir ./checkpoints/ \
+--prompt "这句话描述的事件是" \
+--model_name_or_path ernie-3.0-base-zh \
+--max_seq_length 128  \
+--learning_rate 3e-5 \
+--ppt_learning_rate 3e-4 \
+--do_train \
+--do_eval \
+--do_predict \
+--do_export \
+--num_train_epochs 100 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--metric_for_best_model macro_f1_score \
+--load_best_model_at_end \
+--eval_steps 100
+```
+
+可配置参数说明：
+- `model_name_or_path`: 内置模型名，或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 训练数据集路径，数据格式要求详见[数据标注](#数据标注)。
+- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。
+- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。
+- `soft_encoder`: 提示向量的编码器，`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。
+- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。
+- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。
+- `encoder_hidden_size`: 提示向量的维度。若为None，则使用预训练模型字向量维度。默认为200。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `learning_rate`: 预训练语言模型参数基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+- `ppt_learning_rate`: 提示相关参数的基础学习率大小，当预训练参数不固定时，与其共用learning rate scheduler。一般设为`learning_rate`的十倍。
+- `do_train`: 是否进行训练。
+- `do_eval`: 是否进行评估。
+- `do_predict`: 是否进行预测。
+- `do_export`: 是否在运行结束时将模型导出为静态图，保存路径为`output_dir/export`。
+- `num_train_epochs`: 训练的最大轮数。
+- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。
+- `save_total_limit`: 模型检查点保存数量。
+- `device`: 使用的设备，默认为`gpu`。
+- `eval_steps`: 评估模型的间隔步数。
+- `logging_steps`: 打印日志的间隔步数。
+- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `load_best_model_at_end`: 是否在模型训练结束后加载评估指标最优的模型参数。
+- `evaluation_strategy`: 模型评估的间隔策略。若为`epoch`，则每轮训练结束后评估模型。
+- `save_strategy`: 模型保存的间隔策略。若为`epoch`，则每轮训练结束后保存当前模型参数。
+
+更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。
+
+
+<a name="模型评估"></a>
+### 3.5 模型评估
+
+在模型训练时开启`--do_predict`，训练结束后直接在测试集上`test.txt`进行评估，也可以在训练结束后，通过运行以下命令加载模型参数进行评估：
+```
+python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128
+```
+
+可配置参数说明：
+
+- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中，每行一条待预测文本。
+- `output_dir`: 日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_predict`: 是否进行测试集评估。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+
+<a name="模型部署"></a>
+### 3.6 模型部署
+
+#### 模型导出
+
+在训练结束后，需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出，也可以运行以下命令加载并导出训练后的模型参数，默认导出到在`output_dir`指定的目录下。
+```
+python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/
+```
+
+可配置参数说明：
+
+- `data_dir`: 标签数据路径。
+- `output_dir`: 静态图模型参数和日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_export`: 是否将模型导出为静态图，保存路径为`output_dir/export`。
+- `export_type`: 模型导出的格式，默认为`paddle`，即导出静态图。
+
+#### ONNXRuntime部署
+
+**运行环境**
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。
+
+- 如果基于GPU部署，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime-gpu onnx onnxconverter-common
+```
+
+- 如果基于CPU部署，请使用如下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime
+```
+
+**CPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu
+```
+
+**GPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0
+```
+
+可配置参数说明：
+
+- `model_path_prefix`: 导出的静态图模型路径及文件前缀。
+- `model_name`: 内置预训练模型名，用于加载tokenizer。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 待推理数据所在路径，数据应存放在该目录下的`data.txt`文件。
+- `max_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `batch_size`: 每次预测的样本数量。
+- `device`: 选择推理设备，包括`cpu`和`gpu`。默认为`gpu`。
+- `device_id`: 指定GPU设备ID。
+- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。
+- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。
+
+**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速，在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。
+
+<a name="References"></a>
+## 4. References
+
+- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385)
+- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121)
+- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998)
diff --git a/applications/text_classification/hierarchical/few-shot/infer.py b/applications/text_classification/hierarchical/few-shot/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..eeb30fc27c2d9d61fd060febad20628b59f2f8fa
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/infer.py
@@ -0,0 +1,226 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+import psutil
+import six
+
+from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.")
+parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.")
+parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.")
+parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+args = parser.parse_args()
+# yapf: enable
+
+
+class InferBackend(object):
+    def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10):
+
+        if not isinstance(device, six.string_types):
+            logger.error(
+                ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device)
+            )
+            exit(0)
+        if device not in ["cpu", "gpu"]:
+            logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device))
+            exit(0)
+
+        logger.info(">>> [InferBackend] Creating Engine ...")
+
+        onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb", encoding="utf-8") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+            providers = ["CUDAExecutionProvider"]
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            providers = ["CPUExecutionProvider"]
+            if use_fp16:
+                logger.warning(
+                    ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..."
+                )
+
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = num_threads
+        self.predictor = ort.InferenceSession(
+            onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}]
+        )
+        if device == "gpu":
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    "The environment for GPU inference is not set properly. "
+                    "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+                    "Please run the following commands to reinstall: \n "
+                    "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+                )
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def infer(self, input_dict: dict):
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+class HierachicalPredictor(object):
+    def __init__(self, args):
+        self.args = args
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+        self.model = AutoModelForMaskedLM.from_pretrained(args.model_name)
+        self.template, self.labels, self.input_handles = self.post_init()
+        self.collate_fn = PromptDataCollatorWithPadding(
+            self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True
+        )
+
+        self.inference_backend = InferBackend(
+            self.args.model_path_prefix,
+            self.args.device,
+            self.args.device_id,
+            self.args.use_fp16,
+            self.args.num_threads,
+        )
+
+    def post_init(self):
+        export_path = os.path.dirname(self.args.model_path_prefix)
+        template_path = os.path.join(export_path, "template_config.json")
+        with open(template_path, "r", encoding="utf-8") as fp:
+            prompt = json.load(fp)
+            template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model)
+        keywords = template.extract_template_keywords(template.prompt)
+        inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask", "masked_positions"]
+        if "soft" in keywords:
+            inputs.append("soft_token_ids")
+        if "encoder" in keywords:
+            inputs.append("encoder_ids")
+        verbalizer_path = os.path.join(export_path, "verbalizer_config.json")
+        with open(verbalizer_path, "r", encoding="utf-8") as fp:
+            label_words = json.load(fp)
+            labels = sorted(list(label_words.keys()))
+
+        return template, labels, inputs
+
+    def predict(self, input_data: list):
+        encoded_inputs = self.preprocess(input_data)
+        infer_result = self.infer_batch(encoded_inputs)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return result
+
+    def _infer(self, input_dict):
+        infer_data = self.inference_backend.infer(input_dict)
+        return infer_data
+
+    def infer_batch(self, inputs):
+        num_sample = len(inputs)
+        infer_data = None
+        num_infer_data = None
+        for index in range(0, num_sample, self.args.batch_size):
+            left, right = index, index + self.args.batch_size
+            batch_dict = self.collate_fn(inputs[left:right])
+            input_dict = {}
+            for key in self.input_handles:
+                value = batch_dict[key]
+                if key == "attention_mask":
+                    if value.ndim == 2:
+                        value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4
+                    elif value.ndim != 4:
+                        raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim))
+                    value = value.astype("float32")
+                else:
+                    value = value.astype("int64")
+                input_dict[key] = value
+            results = self._infer(input_dict)
+            if infer_data is None:
+                infer_data = [[x] for x in results]
+                num_infer_data = len(results)
+            else:
+                for i in range(num_infer_data):
+                    infer_data[i].append(results[i])
+        for i in range(num_infer_data):
+            infer_data[i] = np.concatenate(infer_data[i], axis=0)
+        return infer_data
+
+    def preprocess(self, input_data: list):
+        text = [{"text_a": x} for x in input_data]
+        inputs = [self.template(x) for x in text]
+        return inputs
+
+    @staticmethod
+    def sigmoid(z):
+        return 1 / (1 + np.exp(-z))
+
+    def postprocess(self, infer_data):
+        threshold = 0.5
+        probs = self.sigmoid(infer_data[0])
+        label_ids = np.argwhere(probs > threshold)
+        labels = [[] for _ in range(probs.shape[0])]
+        for idx, label_id in label_ids:
+            labels[idx].append(self.labels[label_id])
+        return {"label": labels}
+
+    def printer(self, result, input_data):
+        label = result["label"]
+        for i in range(len(label)):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("labels: {}".format(", ".join(label[i])))
+            logger.info("-----------------------------")
+
+
+if __name__ == "__main__":
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    predictor = HierachicalPredictor(args)
+
+    text_dir = os.path.join(args.data_dir, "data.txt")
+    with open(text_dir, "r", encoding="utf-8") as f:
+        text_list = [x.strip() for x in f.readlines()]
+
+    predictor.predict(text_list)
diff --git a/applications/text_classification/hierarchical/few-shot/metric.py b/applications/text_classification/hierarchical/few-shot/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..e46d46ed093bce37012d524f3fc0483b7118119a
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/metric.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from paddle.metric import Metric
+from sklearn.metrics import classification_report, f1_score
+
+from paddlenlp.utils.log import logger
+
+
+class MetricReport(Metric):
+    """
+    F1 score for hierarchical text classification task.
+    """
+
+    def __init__(self, name="MetricReport", average="micro"):
+        super(MetricReport, self).__init__()
+        self.average = average
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.y_prob = None
+        self.y_true = None
+
+    def f1_score(self, y_prob):
+        """
+        Compute micro f1 score and macro f1 score
+        """
+        threshold = 0.5
+        self.y_pred = y_prob > threshold
+        micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro")
+        macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro")
+        return micro_f1_score, macro_f1_score
+
+    def update(self, probs, labels):
+        """
+        Update the probability and label
+        """
+        if self.y_prob is not None:
+            self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0)
+        else:
+            self.y_prob = probs.numpy()
+        if self.y_true is not None:
+            self.y_true = np.append(self.y_true, labels.numpy(), axis=0)
+        else:
+            self.y_true = labels.numpy()
+
+    def accumulate(self):
+        """
+        Returns micro f1 score and macro f1 score
+        """
+        micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob)
+        return micro_f1_score, macro_f1_score
+
+    def report(self):
+        """
+        Returns classification report
+        """
+        self.y_pred = self.y_prob > 0.5
+        logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4))
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt b/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/requirements_cpu.txt
@@ -0,0 +1,5 @@
+psutil
+paddlepaddle>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime
diff --git a/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt b/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/requirements_gpu.txt
@@ -0,0 +1,7 @@
+psutil
+paddlepaddle-gpu>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime-gpu
+onnx
+onnxconverter-common
diff --git a/applications/text_classification/hierarchical/few-shot/train.py b/applications/text_classification/hierarchical/few-shot/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce0d574f8dc85bcfa2909ca591224e1a7f29be55
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/train.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from collections import defaultdict
+from dataclasses import dataclass, field
+
+import paddle
+import paddle.nn.functional as F
+from metric import MetricReport
+from utils import load_local_dataset
+
+from paddlenlp.prompt import (
+    AutoTemplate,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    SoftVerbalizer,
+)
+from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    data_dir: str = field(default="./data", metadata={"help": "The dataset dictionary includes train.txt, dev.txt, test.txt, label.txt and data.txt (optional) files."})
+    prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "The build-in pretrained model or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # Define the template for preprocess and the verbalizer for postprocess.
+    template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model)
+    logger.info("Using template: {}".format(template.prompt))
+
+    label_file = os.path.join(data_args.data_dir, "label.txt")
+    with open(label_file, "r", encoding="utf-8") as fp:
+        label_words = defaultdict(list)
+        for line in fp:
+            data = line.strip().split("==")
+            word = data[1] if len(data) > 1 else data[0].split("##")[-1]
+            label_words[data[0]].append(word)
+    verbalizer = SoftVerbalizer(label_words, tokenizer, model)
+
+    # Load the few-shot datasets.
+    train_ds, dev_ds, test_ds = load_local_dataset(
+        data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids
+    )
+
+    # Define the criterion.
+    criterion = paddle.nn.BCEWithLogitsLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds):
+        metric = MetricReport()
+        preds = F.sigmoid(paddle.to_tensor(eval_preds.predictions))
+        metric.update(preds, paddle.to_tensor(eval_preds.label_ids))
+        micro_f1_score, macro_f1_score = metric.accumulate()
+        return {"micro_f1_score": micro_f1_score, "macro_f1_score": macro_f1_score}
+
+    # Deine the early-stopping callback.
+    callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)]
+
+    # Initialize the trainer.
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=callbacks,
+        compute_metrics=compute_metrics,
+    )
+
+    # Training.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=None)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Prediction.
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Export static model.
+    if training_args.do_export:
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/hierarchical/few-shot/utils.py b/applications/text_classification/hierarchical/few-shot/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e1dc6f44756ce215cb6b2b638a3d865ac436c91
--- /dev/null
+++ b/applications/text_classification/hierarchical/few-shot/utils.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.datasets import load_dataset
+
+
+def load_local_dataset(data_path, splits, label_list):
+    """
+    Load dataset for hierachical classification from files, where
+    there is one example per line. Text and label are separated
+    by '\t', and multiple labels are delimited by ','.
+
+    Args:
+        data_path (str):
+            Path to the dataset directory, including label.txt, train.txt,
+            dev.txt (and data.txt).
+        splits (list):
+            Which file(s) to load, such as ['train', 'dev', 'test'].
+        label_list (dict):
+            The dictionary that maps labels to indeces.
+    """
+
+    def _reader(data_file, label_list):
+        with open(data_file, "r", encoding="utf-8") as fp:
+            for idx, line in enumerate(fp):
+                data = line.strip().split("\t")
+                if len(data) == 1:
+                    yield {"text_a": data[0]}
+                else:
+                    text, label = data
+                    label = label.strip().split(",")
+                    label = [float(1) if x in label else float(0) for x in label_list]
+                    yield {"text_a": text, "labels": label}
+
+    split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"}
+    datasets = []
+    for split in splits:
+        data_file = os.path.join(data_path, split_map[split])
+        datasets.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False))
+    return datasets
diff --git a/applications/text_classification/hierarchical/metric.py b/applications/text_classification/hierarchical/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..608e23f2cf733fb813c309a5b5516fd0e58a7a0f
--- /dev/null
+++ b/applications/text_classification/hierarchical/metric.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from sklearn.metrics import f1_score, classification_report
+
+from paddle.metric import Metric
+from paddlenlp.utils.log import logger
+
+
+class MetricReport(Metric):
+    """
+    F1 score for hierarchical text classification task.
+    """
+
+    def __init__(self, name="MetricReport", average="micro"):
+        super(MetricReport, self).__init__()
+        self.average = average
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.y_prob = None
+        self.y_true = None
+
+    def f1_score(self, y_prob):
+        """
+        Compute micro f1 score and macro f1 score
+        """
+        threshold = 0.5
+        self.y_pred = y_prob > threshold
+        micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro")
+        macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro")
+        return micro_f1_score, macro_f1_score
+
+    def update(self, probs, labels):
+        """
+        Update the probability and label
+        """
+        if self.y_prob is not None:
+            self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0)
+        else:
+            self.y_prob = probs.numpy()
+        if self.y_true is not None:
+            self.y_true = np.append(self.y_true, labels.numpy(), axis=0)
+        else:
+            self.y_true = labels.numpy()
+
+    def accumulate(self):
+        """
+        Returns micro f1 score and macro f1 score
+        """
+        micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob)
+        return micro_f1_score, macro_f1_score
+
+    def report(self):
+        """
+        Returns classification report
+        """
+        self.y_pred = self.y_prob > 0.5
+        logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4))
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/applications/text_classification/hierarchical/predict.py b/applications/text_classification/hierarchical/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b0c4388dd219ba52c888078ee65459a25956569
--- /dev/null
+++ b/applications/text_classification/hierarchical/predict.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+
+import paddle
+import paddle.nn.functional as F
+from paddle.io import BatchSampler, DataLoader
+from utils import preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include data.txt and label.txt")
+parser.add_argument("--params_path", default="./checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--data_file", type=str, default="data.txt", help="Unlabeled data file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+@paddle.no_grad()
+def predict():
+    """
+    Predicts the data labels.
+    """
+    paddle.set_device(args.device)
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+
+    label_list = []
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            label_list.append(line.strip())
+
+    data_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.data_file), is_test=True, lazy=False
+    )
+
+    trans_func = functools.partial(
+        preprocess_function,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        label_nums=len(label_list),
+        is_test=True,
+    )
+
+    data_ds = data_ds.map(trans_func)
+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    data_batch_sampler = BatchSampler(data_ds, batch_size=args.batch_size, shuffle=False)
+
+    data_data_loader = DataLoader(dataset=data_ds, batch_sampler=data_batch_sampler, collate_fn=collate_fn)
+
+    results = []
+    model.eval()
+    for batch in data_data_loader:
+        logits = model(**batch)
+        probs = F.sigmoid(logits).numpy()
+        for prob in probs:
+            labels = []
+            for i, p in enumerate(prob):
+                if p > 0.5:
+                    labels.append(label_list[i])
+            results.append(labels)
+
+    for t, labels in zip(data_ds.data, results):
+        hierarchical_labels = {}
+        logger.info("text: {}".format(t["sentence"]))
+        logger.info("prediction result: {}".format(",".join(labels)))
+        for label in labels:
+            for i, l in enumerate(label.split("##")):
+                if i not in hierarchical_labels:
+                    hierarchical_labels[i] = []
+                if l not in hierarchical_labels[i]:
+                    hierarchical_labels[i].append(l)
+        for d in range(len(hierarchical_labels)):
+            logger.info("level {} : {}".format(d + 1, ",".join(hierarchical_labels[d])))
+        logger.info("--------------------")
+    return
+
+
+if __name__ == "__main__":
+
+    predict()
diff --git a/applications/text_classification/hierarchical/prune.py b/applications/text_classification/hierarchical/prune.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4b9dff397e3f12165af1ebece97d7429dd1df04
--- /dev/null
+++ b/applications/text_classification/hierarchical/prune.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import functools
+
+import paddle
+import paddle.nn.functional as F
+from paddleslim.nas.ofa import OFA
+from paddlenlp.utils.log import logger
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import PdArgumentParser, Trainer, CompressionArguments
+from paddlenlp.transformers import AutoTokenizer, AutoModelForSequenceClassification
+from dataclasses import dataclass, field
+
+from utils import preprocess_function, read_local_dataset
+from metric import MetricReport
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    dataset_dir: str = field(default=None, metadata={"help": "Local dataset directory should include train.txt, dev.txt and label.txt."})
+    max_seq_length: int = field(default=128, metadata={"help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+    params_dir: str = field(default='./checkpoint/', metadata={"help": "The output directory where the model checkpoints are written."})
+# yapf: enable
+
+
+@paddle.no_grad()
+def custom_evaluate(self, model, data_loader):
+    metric = MetricReport()
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        logits = model(batch["input_ids"], batch["token_type_ids"], attention_mask=[None, None])
+        # Supports paddleslim.nas.ofa.OFA model and nn.layer model.
+        if isinstance(model, OFA):
+            logits = logits[0]
+        probs = F.sigmoid(logits)
+        metric.update(probs, batch["labels"])
+
+    micro_f1_score, macro_f1_score = metric.accumulate()
+    logger.info("micro f1 score: %.5f, macro f1 score: %.5f" % (micro_f1_score, macro_f1_score))
+    model.train()
+    return macro_f1_score
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+    paddle.set_device(compression_args.device)
+    compression_args.strategy = "dynabert"
+    # Log model and data config
+    compression_args.print_config(model_args, "Model")
+    compression_args.print_config(data_args, "Data")
+
+    label_list = {}
+    label_path = os.path.join(data_args.dataset_dir, "label.txt")
+    train_path = os.path.join(data_args.dataset_dir, "train.txt")
+    dev_path = os.path.join(data_args.dataset_dir, "dev.txt")
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
+
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.params_dir)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.params_dir)
+
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=data_args.max_seq_length, label_nums=len(label_list)
+    )
+    train_dataset = train_ds.map(trans_func)
+    dev_dataset = dev_ds.map(trans_func)
+
+    # Define data collector， criterion
+    data_collator = DataCollatorWithPadding(tokenizer)
+    criterion = paddle.nn.BCEWithLogitsLoss()
+
+    trainer = Trainer(
+        model=model,
+        args=compression_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset,
+        eval_dataset=dev_dataset,
+        criterion=criterion,
+    )  # Strategy`dynabert` needs arguments `criterion`
+
+    compression_args.print_config()
+
+    trainer.compress(custom_evaluate=custom_evaluate)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/hierarchical/retrieval_based/README.md b/applications/text_classification/hierarchical/retrieval_based/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8dbbfc0a40f317db3e18645e6ad4124ebc384c09
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/README.md
@@ -0,0 +1,466 @@
+# 基于检索的文本分类方法
+
+ **目录**
+
+* [1. 基于语义索引的分类任务介绍](#基于语义索引的分类任务介绍)
+* [2. 代码结构说明](#代码结构说明)
+* [3. 环境准备](#环境准备)
+* [4. 数据准备](#数据准备)
+* [5. 模型训练](#模型训练)
+* [6. 模型预测](#模型预测)
+* [7. 模型部署](#模型部署)
+* [8. 分类流程](#分类流程)
+
+<a name="基于语义索引的分类任务介绍"></a>
+
+# 1.基于语义索引的分类任务介绍
+
+以前的分类任务中，标签信息作为无实际意义，独立存在的one-hot编码形式存在，这种做法会潜在的丢失标签的语义信息，本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量，将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景，或者需要经常有一些新增类别的需求的情况非常合适。另外，对于一些新的相关的分类任务，这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说，这种基于检索的文本分类方法能够有很好的拓展性，能够利用标签里面包含的语义信息，不需要重新进行学习。这种方法可以应用到相似标签推荐，文本标签标注，金融风险事件分类，政务信访分类等领域。
+
+本方案是基于语义索引模型的分类，语义索引模型的目标是：给定输入文本，模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的分类方法有两种，第一种方法是直接把标签变成召回库，即把输入文本和标签的文本进行匹配，第二种是利用召回的文本带有类别标签，把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型，训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，并把标签作为召回库，进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。
+
+
+**效果评估**
+
+|  模型 |  Accuracy | 策略简要说明|
+| ------------ | ------------ |--------- |
+|  ernie-3.0-medium-zh |  50.580 | ernie-3.0-medium-zh多分类，5个epoch，对于新增类别需要重新训练|
+|  In-batch Negatives + RocketQA |  49.755 | Inbatch-negative有监督训练，标签当作召回集，对新增类别不需要重新训练|
+|  In-batch Negatives + RocketQA + 投票|  **51.756** | Inbatch-negative有监督训练，训练集当作召回集，对新增类别，需要至少一条的数据放入召回库中|
+
+<a name="代码结构说明"></a>
+
+## 2. 代码结构说明
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— base_model.py # 语义索引模型基类
+|—— train.py # In-batch Negatives 策略的训练主脚本
+|—— model.py # In-batch Negatives 策略核心网络结构
+
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— evaluate.py # 根据召回结果和评估集计算评估指标
+|—— predict.py # 给定输入文件，计算文本 pair 的相似度
+|—— export_model.py # 动态图转换成静态图
+|—— export_to_serving.py # 静态图转 Serving
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— predict.sh  # 预测 bash 版本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train.sh  # 训练 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+    |—— run.sh # 构建Milvus向量的 bash 版本
+|—— utils
+    ├── config.py # Milvus 的配置文件
+    ├── feature_extract.py # 向量抽取文件
+    ├── milvus_util.py # Milvus 的配置文件
+|—— deploy
+    |—— python
+        |—— predict.py # PaddleInference
+        |—— deploy.sh # Paddle Inference 部署脚本
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+
+```
+
+<a name="环境准备"></a>
+
+## 3. 环境准备
+
+推荐使用GPU进行训练，在预测阶段使用CPU或者GPU均可。
+
+**环境依赖**
+* python >= 3.6.2
+* paddlepaddle >= 2.3.1
+* paddlenlp >= 2.3.4
+* hnswlib >= 0.5.2
+* visualdl >= 2.2.2
+
+```
+pip install -r requirements.txt
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+训练需要准备指定格式的本地数据集,如果没有已标注的数据集，可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。
+
+**指定格式本地数据集目录结构**
+
+```
+├── data # 数据集目录
+    ├── label.txt # 标签集
+    ├── dev.txt # 验证集
+    ├── train.txt  # 训练集
+```
+
+**训练、开发、测试数据集**
+
+train.txt(训练数据集文件)， dev.txt(开发数据集文件)，test.txt(可选，测试数据集文件)，文件中文本与标签类别名用tab符`'\t'`分隔开，层次标签之间用`'##'`号分隔开。训练集指用于训练模型的数据；开发集指用于评测模型表现的数据，可以根据模型在开发集上的精度调整训练参数和模型；测试集用于测试模型表现，没有测试集时可以使用开发集代替。
+
+**注意文本中不能包含tab符`'\t'`**。
+
+- train.txt/dev.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签>'##'<标签>'##'<标签>
+<文本>'\t'<标签>'##'<标签>
+...
+...
+```
+
+- train.txt/dev.txt/test.txt 文件样例：
+```text
+请问深入骨髓地喜欢一个人怎么办我不能确定对方是不是喜欢我，我却想我不能确定对方是不是喜欢我，我却想分分秒秒跟他在一起，有谁能告诉我如何能想他少一点	烦恼##恋爱
+我登陆诛仙2时总说我账号密码错误，但是我打的是正确的，就算不对我?	游戏##完美游戏##诛仙
+斩魔仙者称号怎么得来的斩魔仙者称号怎么得来的	游戏##网络游戏
+有哪位好心人上传一份女衬衫的加拿大海关发票给我看一下塞多谢了多谢了	商业/理财##贸易
+...
+```
+**分类标签**
+
+label.txt(层次分类标签文件)记录数据集中所有标签路径集合，层次标签之间用`'##'`连接即可，标签的行先后顺序对结果没有影响。
+
+- label.txt 文件格式：
+
+```text
+<一级标签1>
+<一级标签1>'##'<二级标签1>
+<一级标签1>'##'<二级标签1>'##'<三级标签1>
+<一级标签1>'##'<二级标签2>
+<一级标签2>
+<一级标签2>'##'<二级标签3>
+...
+```
+- label.txt  文件样例：
+```text
+教育/科学
+教育/科学##院校信息
+教育/科学##外语学习##英语考试
+教育/科学##理工学科##生物学
+教育/科学##职业教育##会计资格考试
+...
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+我们使用百科知识问答的数据来构建训练集，开发集。
+
+**训练集（train.txt）** 和 **开发集(dev.txt)** 格式一致，训练集30k条，开发集10k条，每行由文本的标题，内容和类别标签组成，以tab符分割，第一列是问题的标题和问题的描述拼接，剩下的列问题的类别。
+**召回库（label.txt）** 召回库的构建有2种方式，第一种是把所有的类别标签当成召回库，第二种是把训练集当成召回集合，我们以第一种为例。
+
+数据集选择的是百科问答数据集的一个子集，问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus)
+
+- [baike_qa_category](https://paddlenlp.bj.bcebos.com/applications/baike_qa_category.zip)
+
+```
+wget https://paddlenlp.bj.bcebos.com/applications/baike_qa_category.zip
+unzip  baike_qa_category.zip
+```
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1 卡;如果采用单机单卡训练，只需要把`--gpus`参数设置成单卡的卡号即可。
+
+如果使用CPU进行训练，则需要吧`--gpus`参数去除，然后吧`device`设置成cpu即可，详细请参考train.sh文件的训练设置
+
+然后运行下面的命令使用GPU训练，得到语义索引模型：
+
+```
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `margin`: 正样本相似度与负样本之间的目标 Gap
+* `train_set_file`: 训练集文件
+* `evaluate`: 是否开启边训练边评估模型训练效果，默认开启
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+
+也可以使用bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+<a name="模型预测"></a>
+
+## 6. 模型预测
+
+我们可以基于语义索引模型计算文本和标签的语义相似度。
+
+
+### 开始预测
+
+加载训练的语义索引模型，然后计算文本和标签的语义相似度:
+
+```
+root_dir="checkpoints/inbatch/model_best"
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/dev.txt"
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/predict.sh
+```
+predict.sh文件包含了cpu和gpu运行的脚本，默认是gpu运行的脚本
+
+产出如下结果
+```
+0.8841502070426941
+0.7834227681159973
+0.04591505229473114
+0.15116563439369202
+......
+```
+
+<a name="部署"></a>
+
+## 7. 模型部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py --params_path checkpoints/inbatch/model_best/model_state.pdparams --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference预测
+
+预测既可以抽取向量也可以计算两个文本的相似度。
+
+修改deploy/python/predict.py中的id2corpus和corpus_list的样本：
+
+```
+# 抽取向量
+id2corpus = {
+        0: {
+            "sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？"
+        }
+    }
+# 计算文本和类别的相似度
+corpus_list = [{
+        "sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？",
+        'label': '电脑/网络,硬件'
+    }, {
+        "sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？",
+        'label': '商业/理财,股票'
+    }]
+
+```
+
+然后使用PaddleInference
+
+```
+python deploy/python/predict.py --model_dir=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy.sh
+```
+最终输出的是256维度的特征向量和句子对的预测概率：
+
+```
+(1, 768)
+[[-0.06491912 -0.0133915   0.00937684  0.01285653 -0.02468005  0.03528611
+   0.0623698  -0.06062918  0.02238894 -0.05348937  0.02161925  0.04480227
+   ....
+
+[0.8100336194038391, -0.05148252472281456]
+```
+
+### 向量引擎
+
+模型准备结束以后，开始搭建 Milvus 的向量检索引擎，用于文本语义向量的快速检索，本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本，建议使用官方的 Docker 安装方式，简单快捷。
+
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成768维度的向量：
+
+```
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+然后向搭建好的 Milvus 系统插入向量：
+
+```
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
+```
+也可以直接运行：
+
+```bash
+sh scripts/run.sh
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式
+
+启动 Pipeline Server:
+
+```
+cd deploy/python/
+python web_service.py
+```
+
+启动客户端调用 Server, 使用 POST的方式：
+
+向服务端发送 POST 请求示例：
+
+```
+curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？\"}"]}'
+```
+
+也可以使用 rpc的方式：
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [{
+    "sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？"
+}]
+```
+然后运行：
+
+```
+python rpc_client.py
+```
+模型的输出为：
+
+```
+PipelineClient::predict pack_data time:1658988633.3673246
+PipelineClient::predict before time:1658988633.3678396
+time to cost :0.014188766479492188 seconds
+['output_embedding']
+(1, 768)
+[[-0.06491912 -0.0133915   0.00937684  0.01285653 -0.02468005  0.03528611
+   0.0623698  -0.06062918  0.02238894 -0.05348937  0.02161925  0.04480227
+   ......
+```
+
+可以看到客户端发送了1条文本，返回这个 embedding 向量
+
+<a name="分类流程"></a>
+
+## 8. 分类流程
+
+基于检索的分类系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问。
+
+```
+python run_system.py
+```
+代码内置的测试用例为：
+
+```
+list_data = [{"sentence": "我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！"}]
+```
+会输出如下的结果：
+
+```
+......
+PipelineClient::predict pack_data time:1658988661.507715
+PipelineClient::predict before time:1658988661.5081818
+Extract feature time to cost :0.02322244644165039 seconds
+Search milvus time cost is 0.06801486015319824 seconds
+{'sentence': '我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！'} 教育/科学,外语学习 0.17211778461933136
+{'sentence': '我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！'} 教育/科学,外语学习,英语翻译 0.5666656494140625
+{'sentence': '我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！'} 教育/科学,外语学习,法语 0.8530913591384888
+{'sentence': '我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！'} 教育/科学,出国/留学 1.1201119422912598
+{'sentence': '我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！'} 生活,购车养车,汽车养护 1.4068719148635864
+.....
+```
+输出的结果包括特征提取和检索的时间，还包含检索出来文本和对应的标签，通过设定阈值等方式可以得到最终的标签。
+
+## Reference
+
+[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.
diff --git a/applications/text_classification/hierarchical/retrieval_based/base_model.py b/applications/text_classification/hierarchical/retrieval_based/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/base_model.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBaseStatic(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
diff --git a/applications/text_classification/hierarchical/retrieval_based/data.py b/applications/text_classification/hierarchical/retrieval_based/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d440967a21822376a7c0a3c99e4b73933bf5cf8
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/data.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import hnswlib
+import numpy as np
+import paddle
+from paddlenlp.utils.log import logger
+
+
+def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m):
+
+    index = hnswlib.Index(space="ip", dim=output_emb_size if output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+    logger.info("start build index..........")
+    all_embeddings = []
+    for text_embeddings in model.get_semantic_embedding(corpus_data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+    logger.info("Total index number:{}".format(index.get_current_count()))
+    return index
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            yield {"sentence": data[0], "label": data[1].replace("##", ",")}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip().replace("##", ",")
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            splited_line = line.rstrip().split("\t")
+            text, similar_text = splited_line[0], ",".join(splited_line[1:])
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..236c3802002e80075ab55ed14461b0bde9fd545c
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8090
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      #ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/deploy.sh
@@ -0,0 +1 @@
+python predict.py --model_dir=../../output
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5f0c6203ec284b565eae187e4a8157404bc2ca9
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/predict.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+from paddle import inference
+from scipy import spatial
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+sys.path.append(".")
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    encoded_inputs = tokenizer(
+        text=example["sentence"],
+        max_seq_len=max_seq_length,
+        pad_to_max_seq_len=pad_to_max_seq_len,
+        truncation_strategy="longest_first",
+    )
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def extract_embedding(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the feature vectors.
+        """
+        examples = []
+        for idx, text in data.items():
+            print(text)
+            input_ids, segment_ids = convert_query_example(text, tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        return logits
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions probs.
+        """
+
+        examples = []
+        for idx, text in enumerate(data):
+            input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer)
+
+            examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(query_ids)
+        self.input_handles[1].copy_from_cpu(query_segment_ids)
+        self.predictor.run()
+        query_logits = self.output_handle.copy_to_cpu()
+
+        self.input_handles[0].copy_from_cpu(title_ids)
+        self.input_handles[1].copy_from_cpu(title_segment_ids)
+        self.predictor.run()
+        title_logits = self.output_handle.copy_to_cpu()
+
+        result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
+        return result
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    output_emb_size = 256
+    tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder")
+    id2corpus = {0: {"sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？"}}
+    res = predictor.extract_embedding(id2corpus, tokenizer)
+    print(res.shape)
+    print(res)
+    corpus_list = [
+        {"sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？", "label": "电脑/网络,硬件"},
+        {"sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？", "label": "商业/理财,股票"},
+    ]
+    res = predictor.predict(corpus_list, tokenizer)
+    print(res)
diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3d5e71e3d09c23a0e1d4f20c3f17c408ae9a4ba
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/rpc_client.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+import numpy as np
+
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = [{"sentence": "CPU每秒执行的指令数CPU频率3.0G，执行一条指令需要1.5,频率3.0G，执行一条指令需要1.5个周期，每秒执行的指令数为？是20亿吗？"}]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = str(item)
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py b/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/deploy/python/web_service.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle_serving_server.web_service import Op, WebService
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(
+            text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len
+        )
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        model_name_or_path = "rocketqa-zh-dureader-query-encoder"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            example = eval(input_dict[str(i)])
+            input_ids, segment_ids = convert_example([example], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/text_classification/hierarchical/retrieval_based/evaluate.py b/applications/text_classification/hierarchical/retrieval_based/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..c315b0d8b129b971f206c3404c79b712a2d57014
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/evaluate.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import numpy as np
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similar pair file")
+parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file")
+parser.add_argument(
+    "--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query"
+)
+args = parser.parse_args()
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().rsplit("\t", 1)
+            text2similar[text] = similar_text
+
+    rs = []
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+            text_arr = line.rstrip().split("\t")
+            text_title, text_para, recalled_title, recalled_para, label, cosine_sim = text_arr
+            if text2similar["\t".join([text_title, text_para])] == label:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20, 50]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    result = open("result.tsv", "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
diff --git a/applications/text_classification/hierarchical/retrieval_based/export_model.py b/applications/text_classification/hierarchical/retrieval_based/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d212a633e6006bda1f252687779a009528ddacf8
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/export_model.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from base_model import SemanticIndexBaseStatic
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    # If you want to use ernie1.0 model, plesace uncomment the following code
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py b/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/export_to_serving.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/text_classification/hierarchical/retrieval_based/model.py b/applications/text_classification/hierarchical/retrieval_based/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd87c6d8363efc4f54db6c6bd5d7b623ea68ab59
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/model.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/applications/text_classification/hierarchical/retrieval_based/predict.py b/applications/text_classification/hierarchical/retrieval_based/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/predict.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    cosine_sims = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+            cosine_sims.append(batch_cosine_sim)
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    cosin_sim = predict(model, valid_data_loader)
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
+        if idx > 5:
+            break
diff --git a/applications/text_classification/hierarchical/retrieval_based/recall.py b/applications/text_classification/hierarchical/retrieval_based/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c185bd8efaa0a20ea0289976e600a62e70f1dcc
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/recall.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from base_model import SemanticIndexBase
+from data import convert_corpus_example, create_dataloader, gen_id2corpus, gen_text_file
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    id2corpus = gen_id2corpus(args.corpus_file)
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+    query_ds = MapDataset(text_list)
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/text_classification/hierarchical/retrieval_based/requirements.txt b/applications/text_classification/hierarchical/retrieval_based/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6657b02e7b0c9a430659394b3398f575cae4ea91
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/requirements.txt
@@ -0,0 +1,11 @@
+pymilvus==1.1.2
+pandas==0.25.1 
+paddlenlp>=2.3.4    
+paddlepaddle-gpu>=2.3.0
+hnswlib>=0.5.2
+numpy>=1.17.2
+visualdl>=2.2.2
+paddle-serving-app>=0.7.0        
+paddle-serving-client>=0.7.0        
+paddle-serving-server-gpu>=0.7.0.post102
+pybind11
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/run_system.py b/applications/text_classification/hierarchical/retrieval_based/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..464024d976702c8e57bd87eba9c6888dd78403d5
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/run_system.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import time
+
+import numpy as np
+import pandas as pd
+from data import gen_id2corpus
+from paddle_serving_server.pipeline import PipelineClient
+
+sys.path.append("utils")
+from utils.milvus_util import RecallByMilvus  # noqa: E402
+
+
+def search_in_milvus(text_embedding, corpus_file, query_text):
+    collection_name = "text"
+    partition_tag = "partition_2"
+    client = RecallByMilvus()
+    start_time = time.time()
+    status, results = client.search(
+        collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+    id2corpus = gen_id2corpus(corpus_file)
+    list_data = []
+    for line in results:
+        for item in line:
+            idx = item.id
+            distance = item.distance
+            text = id2corpus[idx]
+            list_data.append([query_text, text, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "label", "distance"])
+    df = df.sort_values(by="distance", ascending=True)
+    for index, row in df.iterrows():
+        print(row["query_text"], row["label"], row["distance"])
+
+
+if __name__ == "__main__":
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    corpus_file = "data/label.txt"
+    list_data = [{"sentence": "我是一个多情善感的小男孩！我想翻译成英文，谢谢！我想成英文，谢谢！"}]
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = str(item)
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    search_in_milvus(result, corpus_file, list_data[0])
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..900d7283f76e1af5390be84a448266c1719004f8
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/evaluate.sh
@@ -0,0 +1,4 @@
+python -u evaluate.py \
+        --similar_text_pair "data/dev.txt" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bda31ba21ea81d3e431cf464314c89fdcfb04d09
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/export_model.sh
@@ -0,0 +1 @@
+python export_model.py --params_path checkpoints/inbatch/model_best/model_state.pdparams --output_path=./output
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b0d7a422551fd09eb1a28cfacdf47237a8efc795
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/export_to_serving.sh
@@ -0,0 +1,7 @@
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..67c9eee02da0d8a8beb58791c74a34bd4b68fefb
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/predict.sh
@@ -0,0 +1,22 @@
+# gpu version
+root_dir="checkpoints/inbatch/model_best" 
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/dev.txt"
+
+
+# cpu
+# root_dir="checkpoints/inbatch/model_best" 
+# python -u -m paddle.distributed.launch \
+#     predict.py \
+#     --device cpu \
+#     --params_path "${root_dir}/model_state.pdparams" \
+#     --output_emb_size 0 \
+#     --batch_size 128 \
+#     --max_seq_length 384 \
+#     --text_pair_file "data/train.txt"
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8bcd987f155ed2b0fd4c2a13faa7eb599d8076ff
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/run.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt" 
+
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3fa6030207137bb58e25b97244c038e886e875b2
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/run_build_index.sh
@@ -0,0 +1,16 @@
+# GPU version
+root_dir="checkpoints/inbatch" 
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${root_dir}/model_best/model_state.pdparams" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 0 \
+        --max_seq_length 384 \
+        --recall_num 50 \
+        --similar_text_pair "data/dev.txt" \
+        --corpus_file "data/train.txt" 
\ No newline at end of file
diff --git a/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh b/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ea88cfdd53a73078b5045c1c1c34f2dda1d321ea
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/scripts/train.sh
@@ -0,0 +1,35 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU training
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True
diff --git a/applications/text_classification/hierarchical/retrieval_based/train.py b/applications/text_classification/hierarchical/retrieval_based/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..a14114abe9c40d487dde2067c84b4e87af402f53
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/train.py
@@ -0,0 +1,281 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import (
+    build_index,
+    convert_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+    read_text_pair,
+)
+from model import SemanticIndexBatchNeg
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str,
+                    help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=512, type=int,
+                    help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int,
+                    help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=256,
+                    type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=5E-5, type=float,
+                    help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float,
+                    help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int,
+                    help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float,
+                    help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None,
+                    help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000,
+                    help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu",
+                    help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000,
+                    help="Inteval steps to save checkpoint")
+parser.add_argument('--log_steps', type=int, default=10,
+                    help="Inteval steps to print log")
+parser.add_argument("--train_set_file", type=str,
+                    default='./data/train.txt',
+                    help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.2, type=float,
+                    help="Margin between pos_sample and neg_samples")
+parser.add_argument("--scale", default=30, type=int,
+                    help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--corpus_file", type=str, default='./data/label.txt',
+                    help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str,
+                    default='./data/dev.txt',
+                    help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir',
+                    help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str,
+                    default='recall_result_init.txt', help="The file name of recall result")
+parser.add_argument("--recall_num", default=50, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int,
+                    help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000,
+                    type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt',
+                    help="evaluate_result")
+parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False],
+                    help='whether evaluate while training')
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def recall(rs, N=10):
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+@paddle.no_grad()
+def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus):
+    # Load pretrained semantic model
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
+    text2similar = {}
+    with open(args.similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            text_arr = line.rstrip().rsplit("\t")
+            text, similar_text = text_arr[0], text_arr[1].replace("##", ",")
+            text2similar[text] = similar_text
+    rs = []
+    with open(recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+            text_arr = line.rstrip().rsplit("\t")
+            text, similar_text, cosine_sim = text_arr
+            if text2similar[text] == similar_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result)
+    result = open(evaluate_result_file, "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
+    return float(recall_N[0])
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+    pretrained_model = AutoModel.from_pretrained("rocketqa-zh-dureader-query-encoder")
+    tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder")
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    model = SemanticIndexBatchNeg(
+        pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+    )
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+    model = paddle.DataParallel(model)
+    batchify_fn_dev = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    if args.evaluate:
+        eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        id2corpus = gen_id2corpus(args.corpus_file)
+        # conver_example function's input must be dict
+        corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+        corpus_ds = MapDataset(corpus_list)
+        corpus_data_loader = create_dataloader(
+            corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func
+        )
+        # convert_corpus_example
+        query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        text_list, _ = gen_text_file(args.similar_text_pair_file)
+        query_ds = MapDataset(text_list)
+        query_data_loader = create_dataloader(
+            query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func
+        )
+        if not os.path.exists(args.recall_result_dir):
+            os.mkdir(args.recall_result_dir)
+        recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByNorm(clip_norm=1.0),
+    )
+    global_step = 0
+    best_recall = 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+            global_step += 1
+            if global_step % args.log_steps == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if not args.evaluate and rank == 0:
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+        if args.evaluate and rank == 0:
+            print("evaluating")
+            recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus)
+            if recall_5 > best_recall:
+                best_recall = recall_5
+                save_dir = os.path.join(args.save_dir, "model_best")
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+                with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp:
+                    fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py b/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/config.py b/applications/text_classification/hierarchical/retrieval_based/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c95294c79ccec9c81348052a9f9b96f3465e8c6
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/utils/config.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from milvus import IndexType, MetricType
+
+MILVUS_HOST = "10.21.226.173"
+MILVUS_PORT = 8530
+
+output_emb_size = 0
+
+collection_param = {
+    "dimension": output_emb_size if output_emb_size > 0 else 768,
+    "index_file_size": 256,
+    "metric_type": MetricType.L2,
+}
+
+index_type = IndexType.FLAT
+index_param = {"nlist": 1000}
+
+top_k = 20
+search_param = {"nprobe": 20}
diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py b/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9f67b19e1385961ff5a60876bdcf97c2d70f6d4
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/utils/feature_extract.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+import paddlenlp as ppnlp
+from paddlenlp.data import Pad, Tuple
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument("--output_dir", type=str, required=True, help="The output path.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--data_name", type=str, required=True, help="The dataset name.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, data_name):
+        """
+        Predicts the data labels.
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in tqdm(data.items()):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) >= self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, line in enumerate(file.readlines()):
+        id2corpus[idx] = line.rstrip().replace("##", ",")
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+    data_name = args.data_name
+    model_name_or_path = "rocketqa-zh-dureader-query-encoder"
+    tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(model_name_or_path)
+    id2corpus = read_text(args.corpus_file)
+    predictor.predict(id2corpus, tokenizer, data_name)
diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py b/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/utils/milvus_util.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    collection_param,
+    index_param,
+    index_type,
+    search_param,
+    top_k,
+)
+from milvus import Milvus
+
+
+class VecToMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def has_collection(self, collection_name):
+        try:
+            status, ok = self.client.has_collection(collection_name)
+            return ok
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            collection_param["collection_name"] = collection_name
+            status = self.client.create_collection(collection_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def create_index(self, collection_name):
+        try:
+            status = self.client.create_index(collection_name, index_type, index_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, collection_name, partition_tag):
+        try:
+            status, ok = self.client.has_partition(collection_name, partition_tag)
+            return ok
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def delete_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.drop_collection(collection_name)
+            return status
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.create_partition(collection_name, partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+            return status
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, vectors, collection_name, ids=None, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(collection_name)
+                print("collection info: {}".format(self.client.get_collection_info(collection_name)[1]))
+            if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)):
+                self.create_partition(collection_name, partition_tag)
+            status, ids = self.client.insert(
+                collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag
+            )
+            self.client.flush([collection_name])
+            print(
+                "Insert {} entities, there are {} entities after insert data.".format(
+                    len(ids), self.client.count_entities(collection_name)[1]
+                )
+            )
+            return status, ids
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def search(self, vectors, collection_name, partition_tag=None):
+        try:
+            status, results = self.client.search(
+                collection_name=collection_name,
+                query_records=vectors,
+                top_k=top_k,
+                params=search_param,
+                partition_tag=partition_tag,
+            )
+            return status, results
+        except Exception as e:
+            print("Milvus recall error: ", e)
diff --git a/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py b/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py
new file mode 100644
index 0000000000000000000000000000000000000000..a58f330c3a0de4adbf496e9dfca7249f86233bcb
--- /dev/null
+++ b/applications/text_classification/hierarchical/retrieval_based/utils/vector_insert.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+from milvus_util import VecToMilvus
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--vector_path", type=str, required=True, help="feature file path.")
+args = parser.parse_args()
+
+
+def vector_insert(file_path):
+    embeddings = np.load(file_path)
+    print(embeddings.shape)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    print(len(embedding_ids))
+    client = VecToMilvus()
+
+    collection_name = "text"
+    partition_tag = "partition_2"
+    if client.has_partition(collection_name, partition_tag):
+        client.delete_partition(collection_name, partition_tag)
+    data_size = len(embedding_ids)
+    batch_size = 50000
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        status, ids = client.insert(
+            collection_name=collection_name,
+            vectors=batch_emb.tolist(),
+            ids=embedding_ids[i : i + batch_size],
+            partition_tag=partition_tag,
+        )
+
+
+if __name__ == "__main__":
+    vector_insert(args.vector_path)
diff --git a/applications/text_classification/hierarchical/train.py b/applications/text_classification/hierarchical/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c8ee8872c29cfa44555485fbaf5c75d318f6130
--- /dev/null
+++ b/applications/text_classification/hierarchical/train.py
@@ -0,0 +1,200 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from metric import MetricReport
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from utils import evaluate, preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt")
+parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
+                    choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.')
+parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.')
+parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument('--warmup', action='store_true', help="whether use warmup strategy")
+parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup steps over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """
+    Sets random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def args_saving():
+    argsDict = args.__dict__
+    with open(os.path.join(args.save_dir, "setting.txt"), "w") as f:
+        f.writelines("------------------ start ------------------" + "\n")
+        for eachArg, value in argsDict.items():
+            f.writelines(eachArg + " : " + str(value) + "\n")
+        f.writelines("------------------- end -------------------")
+
+
+def train():
+    """
+    Training a hierarchical classification model
+    """
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+    args_saving()
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # load and preprocess dataset
+    label_list = {}
+    with open(os.path.join(args.dataset_dir, args.label_file), "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+    train_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.train_file), label_list=label_list, lazy=False
+    )
+    dev_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.dev_file), label_list=label_list, lazy=False
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
+    )
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    if paddle.distributed.get_world_size() > 1:
+        train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    else:
+        train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    # define model
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_classes=len(label_list))
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.BCEWithLogitsLoss()
+    metric = MetricReport()
+
+    global_step = 0
+    best_f1_score = 0
+    early_stop_count = 0
+    tic_train = time.time()
+
+    for epoch in range(1, args.epochs + 1):
+
+        if args.early_stop and early_stop_count >= args.early_stop_nums:
+            logger.info("Early stop!")
+            break
+
+        for step, batch in enumerate(train_data_loader, start=1):
+
+            labels = batch.pop("labels")
+            logits = model(**batch)
+            loss = criterion(logits, labels)
+
+            loss.backward()
+            optimizer.step()
+            if args.warmup:
+                lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+        early_stop_count += 1
+        micro_f1_score, macro_f1_score = evaluate(model, criterion, metric, dev_data_loader)
+
+        save_best_path = args.save_dir
+        if not os.path.exists(save_best_path):
+            os.makedirs(save_best_path)
+
+        # save models
+        if macro_f1_score > best_f1_score:
+            early_stop_count = 0
+            best_f1_score = macro_f1_score
+            model._layers.save_pretrained(save_best_path)
+            tokenizer.save_pretrained(save_best_path)
+        logger.info("Current best macro f1 score: %.5f" % (best_f1_score))
+    logger.info("Final best macro f1 score: %.5f" % (best_f1_score))
+    logger.info("Save best macro f1 text classification model in %s" % (args.save_dir))
+
+
+if __name__ == "__main__":
+    train()
+    print(args.train_file)
diff --git a/applications/text_classification/hierarchical/utils.py b/applications/text_classification/hierarchical/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9915906c9a2ab9dd8680d95711a4336efd96ec6e
--- /dev/null
+++ b/applications/text_classification/hierarchical/utils.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+import paddle
+import paddle.nn.functional as F
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evaluates model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        labels = batch.pop("labels")
+        logits = model(**batch)
+        loss = criterion(logits, labels)
+        probs = F.sigmoid(logits)
+        losses.append(loss.numpy())
+        metric.update(probs, labels)
+
+    micro_f1_score, macro_f1_score = metric.accumulate()
+    logger.info(
+        "eval loss: %.5f, micro f1 score: %.5f, macro f1 score: %.5f"
+        % (np.mean(losses), micro_f1_score, macro_f1_score)
+    )
+    model.train()
+    metric.reset()
+
+    return micro_f1_score, macro_f1_score
+
+
+def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        examples(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_length(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        label_nums(obj:`int`): The number of the labels.
+    Returns:
+        result(obj:`dict`): The preprocessed data including input_ids, token_type_ids, labels.
+    """
+    result = tokenizer(text=examples["sentence"], max_seq_len=max_seq_length)
+    # One-Hot label
+    if not is_test:
+        result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)]
+    return result
+
+
+def read_local_dataset(path, label_list=None, is_test=False):
+    """
+    Read dataset
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            if is_test:
+                items = line.strip().split("\t")
+                sentence = "".join(items)
+                yield {"sentence": sentence}
+            else:
+                items = line.strip().split("\t")
+                if len(items) == 0:
+                    continue
+                elif len(items) == 1:
+                    sentence = items[0]
+                    labels = []
+                else:
+                    sentence = "".join(items[:-1])
+                    label = items[-1]
+                    labels = [label_list[l] for l in label.split(",")]
+                yield {"sentence": sentence, "label": labels}
diff --git a/applications/text_classification/multi_class/README.md b/applications/text_classification/multi_class/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e91ff948a9dfbe4fb75b306a2980f63bbf1b4d7d
--- /dev/null
+++ b/applications/text_classification/multi_class/README.md
@@ -0,0 +1,412 @@
+# 二分类/多分类任务指南
+
+**目录**
+
+- [1. 二分类/多分类简介](#二分类/多分类简介)
+- [2. 快速开始](#快速开始)
+    - [2.1 运行环境](#运行环境)
+    - [2.2 代码结构](#代码结构)
+    - [2.3 数据准备](#数据准备)
+    - [2.4 模型训练](#模型训练)
+    - [2.5 模型预测](#模型预测)
+    - [2.6 模型效果](#模型效果)
+
+
+<a name="二分类/多分类简介"></a>
+
+## 1. 二分类/多分类简介
+
+本项目提供通用场景下**基于预训练模型微调的二分类/多分类端到端应用方案**，打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程，有效缩短开发周期，降低AI开发落地门槛。
+
+二分类/多分类数据集的标签集含有两个或两个以上的类别，所有输入句子/文本有且只有一个标签。在文本多分类场景中，我们需要预测**输入句子/文本最可能来自 `n` 个标签类别中的哪一个类别**。在本项目中二分类任务被视为多分类任务中标签集包含两个类别的情况，以下统一称为多分类任务。以下图为例，该新闻文本的最可能的标签为 `娱乐`。多分类任务在商品分类、网页标签、新闻分类、医疗文本分类等各种现实场景中具有广泛的适用性。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/187588832-daa4294e-248e-4c69-9a4b-1cf8f25a2fcc.png width="550"/>
+</div>
+<br>
+
+**方案亮点：**
+
+- **效果领先🏃：** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座，ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **高效调优✊：** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)，提供模型分析模块助力开发者实现模型分析，并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。
+- **简单易用👶：** 开发者**无需机器学习背景知识**，仅需提供指定格式的标注分类数据，一行命令即可开启文本分类训练，轻松完成上线部署，不再让技术成为文本分类的门槛。
+
+**更多选择：**
+
+对于大多数多分类任务，我们推荐使用预训练模型微调作为首选的文本分类方案，多分类项目中还提供通用文本分类(UTC)和语义索引的两种方案满足不同开发者需求，更多技术细节请参见[文本分类技术特色介绍](../README.md)。
+
+- 【零样本、小样本场景】 👉 [通用文本分类(UTC)方案](../../zero_shot_text_classification)
+
+- 【标签类别不固定、标签类别众多】 👉 [语义索引分类方案](./retrieval_based)
+
+
+<a name="快速开始"></a>
+
+## 2. 快速开始
+
+接下来我们将以CBLUE公开数据集KUAKE-QIC任务为示例，演示多分类全流程方案使用。下载数据集：
+
+```shell
+wget https://paddlenlp.bj.bcebos.com/datasets/KUAKE_QIC.tar.gz
+tar -zxvf KUAKE_QIC.tar.gz
+mv KUAKE_QIC data
+rm -rf KUAKE_QIC.tar.gz
+```
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/187828356-e2f4f627-f5fe-4c83-8879-ed6951f7511e.png">
+</div>
+<div align="center">
+    <font size ="2">
+    多分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图
+     </font>
+</div>
+
+<a name="运行环境"></a>
+
+### 2.1 运行环境
+
+- python >= 3.6
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.5.1
+- scikit-learn >= 1.0.2
+
+**安装PaddlePaddle：**
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+
+**安装PaddleNLP：**
+
+```shell
+pip install --upgrade paddlenlp
+```
+
+
+**安装sklearn：**
+```shell
+pip install scikit-learn
+```
+
+<a name="代码结构"></a>
+
+### 2.2 代码结构
+
+```text
+multi_class/
+├── few-shot # 小样本学习方案
+├── retrieval_based # 语义索引方案
+├── analysis # 分析模块
+├── deploy # 部署
+│   ├── simple_serving # SimpleServing服务化部署
+│   └── triton_serving # Triton服务化部署
+├── train.py # 训练、评估、裁剪脚本
+├── utils.py # 工具函数脚本
+└── README.md # 多分类使用说明
+```
+
+<a name="数据准备"></a>
+
+### 2.3 数据准备
+
+训练需要准备指定格式的本地数据集,如果没有已标注的数据集，可以参考[文本分类任务doccano数据标注使用指南](../doccano.md)进行文本分类数据标注。指定格式本地数据集目录结构：
+
+```text
+data/
+├── train.txt # 训练数据集文件
+├── dev.txt # 开发数据集文件
+└── label.txt # 分类标签文件
+```
+
+**训练、开发、测试数据集** 文件中文本与标签类别名用tab符`'\t'`分隔开，文本中避免出现tab符`'\t'`。
+
+- train.txt/dev.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签>
+<文本>'\t'<标签>
+...
+```
+
+- train.txt/dev.txt/test.txt 文件样例：
+```text
+25岁已经感觉脸部松弛了怎么办	治疗方案
+小孩的眉毛剪了会长吗？	其他
+172的身高还能长高吗？	其他
+冻疮用三金冻疮酊有效果么？	功效作用
+...
+```
+
+**分类标签**
+
+label.txt(分类标签文件)记录数据集中所有标签集合，每一行为一个标签名。
+- label.txt 文件格式：
+```text
+<标签>
+<标签>
+...
+```
+
+- label.txt 文件样例：
+```text
+病情诊断
+治疗方案
+病因分析
+指标解读
+就医建议
+...
+```
+
+<a name="模型训练"></a>
+
+### 2.4 模型训练
+
+我们推荐使用 Trainer API 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+#### 2.4.1 预训练模型微调
+
+
+使用CPU/GPU训练，默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`，可以使用`--device gpu:0`指定GPU卡号：
+```shell
+python train.py \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --model_name_or_path ernie-3.0-tiny-medium-v2-zh \
+    --output_dir checkpoint \
+    --device gpu \
+    --num_train_epochs 100 \
+    --early_stopping True \
+    --early_stopping_patience 5 \
+    --learning_rate 3e-5 \
+    --max_length 128 \
+    --per_device_eval_batch_size 32 \
+    --per_device_train_batch_size 32 \
+    --metric_for_best_model accuracy \
+    --load_best_model_at_end \
+    --logging_steps 5 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --save_total_limit 1
+```
+
+如果在GPU环境中使用，可以指定`gpus`参数进行多卡分布式训练。使用多卡训练可以指定多个GPU卡号，例如 --gpus 0,1。如果设备只有一个GPU卡号默认为0，可使用`nvidia-smi`命令查看GPU使用情况:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus 0,1 train.py \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --model_name_or_path ernie-3.0-tiny-medium-v2-zh \
+    --output_dir checkpoint \
+    --device gpu \
+    --num_train_epochs 100 \
+    --early_stopping True \
+    --early_stopping_patience 5 \
+    --learning_rate 3e-5 \
+    --max_length 128 \
+    --per_device_eval_batch_size 32 \
+    --per_device_train_batch_size 32 \
+    --metric_for_best_model accuracy \
+    --load_best_model_at_end \
+    --logging_steps 5 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --save_total_limit 1
+```
+
+主要的配置的参数为：
+- `do_train`: 是否进行训练。
+- `do_eval`: 是否进行评估。
+- `debug`: 与`do_eval`配合使用，是否开启debug模型，对每一个类别进行评估。
+- `do_export`: 训练结束后是否导出静态图。
+- `do_compress`: 训练结束后是否进行模型裁剪。
+- `model_name_or_path`: 内置模型名，或者模型参数配置目录路径。默认为`ernie-3.0-tiny-medium-v2-zh`。
+- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。
+- `device`: 使用的设备，默认为`gpu`。
+- `num_train_epochs`: 训练轮次，使用早停法时可以选择100。
+- `early_stopping`: 是否使用早停法，也即一定轮次后评估指标不再增长则停止训练。
+- `early_stopping_patience`: 在设定的早停训练轮次内，模型在开发集上表现不再上升，训练终止；默认为4。
+- `learning_rate`: 预训练语言模型参数基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+- `max_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `max_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `train_path`: 训练集路径，默认为"./data/train.txt"。
+- `dev_path`: 开发集集路径，默认为"./data/dev.txt"。
+- `test_path`: 测试集路径，默认为"./data/dev.txt"。
+- `label_path`: 标签路径，默认为"./data/label.txt"。
+- `bad_case_path`: 错误样本保存路径，默认为"./data/bad_case.txt"。
+- `width_mult_list`：裁剪宽度（multi head）保留的比例列表，表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例，保留比例乘以宽度（multi haed数量）应为整数；默认是None。
+训练脚本支持所有`TrainingArguments`的参数，更多参数介绍可参考[TrainingArguments 参数介绍](https://paddlenlp.readthedocs.io/zh/latest/trainer.html#trainingarguments)。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中，保存模型文件结构如下所示：
+
+```text
+checkpoint/
+├── export # 静态图模型
+├── config.json # 模型配置文件
+├── model_state.pdparams # 模型参数文件
+├── tokenizer_config.json # 分词器配置文件
+├── vocab.txt
+└── special_tokens_map.json
+```
+
+**NOTE:**
+* 中文训练任务（文本支持含部分英文）推荐使用"ernie-1.0-large-zh-cw"、"ernie-3.0-tiny-base-v2-zh"、"ernie-3.0-tiny-medium-v2-zh"、"ernie-3.0-tiny-micro-v2-zh"、"ernie-3.0-tiny-mini-v2-zh"、"ernie-3.0-tiny-nano-v2-zh"、"ernie-3.0-tiny-pico-v2-zh"。
+* 英文训练任务推荐使用"ernie-3.0-tiny-mini-v2-en"、 "ernie-2.0-base-en"、"ernie-2.0-large-en"。
+* 英文和中文以外语言的文本分类任务，推荐使用基于96种语言（涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言）进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large"，详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。
+
+#### 2.4.2 训练评估
+训练后的模型我们可以开启`debug`模式，对每个类别分别进行评估，并打印错误预测样本保存在`bad_case.txt`。默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`:
+
+```shell
+python train.py \
+    --do_eval \
+    --debug True \
+    --device gpu \
+    --model_name_or_path checkpoint \
+    --output_dir checkpoint \
+    --per_device_eval_batch_size 32 \
+    --max_length 128 \
+    --test_path './data/dev.txt'
+```
+
+输出打印示例：
+
+```text
+[2023-02-14 12:35:03,470] [    INFO] - -----Evaluate model-------
+[2023-02-14 12:35:03,471] [    INFO] - Dev dataset size: 1955
+[2023-02-14 12:35:03,471] [    INFO] - Accuracy in dev dataset: 81.74%
+[2023-02-14 12:35:03,471] [    INFO] - Macro average | precision: 77.39 | recall: 79.89 | F1 score 78.32
+[2023-02-14 12:35:03,471] [    INFO] - Class name: 病情诊断
+[2023-02-14 12:35:03,471] [    INFO] - Evaluation examples in dev dataset: 288(14.7%) | precision: 85.22 | recall: 86.11 | F1 score 85.66
+[2023-02-14 12:35:03,471] [    INFO] - ----------------------------
+[2023-02-14 12:35:03,471] [    INFO] - Class name: 治疗方案
+[2023-02-14 12:35:03,471] [    INFO] - Evaluation examples in dev dataset: 676(34.6%) | precision: 91.72 | recall: 90.09 | F1 score 90.90
+...
+```
+
+文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果"，"如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+#### 2.4.3 模型裁剪(可选)
+
+如果有模型部署上线的需求，需要进一步压缩模型体积，可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。
+
+使用裁剪功能需要安装 paddleslim：
+
+```shell
+pip install paddleslim == 2.4.1
+```
+
+开始模型裁剪训练，默认为GPU训练，使用CPU训练只需将设备参数配置改为`--device "cpu"`：
+```shell
+python train.py \
+    --do_compress \
+    --device gpu \
+    --data_dir data \
+    --model_name_or_path checkpoint \
+    --output_dir checkpoint/prune \
+    --learning_rate 3e-5 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 1 \
+    --max_length 128 \
+    --logging_steps 5 \
+    --save_steps 100 \
+    --width_mult_list '3/4' '2/3' '1/2'
+```
+保存模型文件结构如下所示：
+
+```text
+checkpoint/prune/
+├── width_mult_0.75
+│   ├── pruned_model.pdiparams
+│   ├── pruned_model.pdiparams.info
+│   ├── pruned_model.pdmodel
+│   ├── model_state.pdparams
+│   └── config.json
+└── ...
+```
+**NOTE:**
+
+1. 目前支持的裁剪策略需要训练，训练时间视下游任务数据量而定，且和微调的训练时间是一个量级。 裁剪类似蒸馏过程，方便起见，可以直接使用微调时的超参。为了进一步提升精度，可以对 `per_device_train_batch_size`、`learning_rate`、`max_length` 等超参进行网格搜索（grid search）。
+
+2. 模型裁剪主要用于推理部署，因此裁剪后的模型都是静态图模型，只可用于推理部署。
+
+3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度（multi head数量）为12，ERNIE Xbase、Large 模型宽度（multi head数量）为16，保留比例`width_mult`乘以宽度（multi haed数量）应为整数。
+
+<a name="模型预测"></a>
+
+### 2.5 模型预测
+我们推荐使用taskflow进行模型预测，请保证paddlenlp版本大于2.5.1。
+```
+from paddlenlp import Taskflow
+
+# 模型预测
+cls = Taskflow("text_classification", task_path='checkpoint/export', is_static_model=True)
+cls(["黑苦荞茶的功效与作用及食用方法","幼儿挑食的生理原因是"])
+# [{'predictions': [{'label': '功效作用', 'score': 0.9683999621710758}], 'text': '黑苦荞茶的功效与作用及食用方法'}, {'predictions': [{'label': '病因分析', 'score': 0.5204789523701855}], 'text': '幼儿挑食的生理原因是'}]
+
+# 裁剪模型预测
+cls = Taskflow("text_classification", task_path='checkpoint/prune/width_mult_0.67', is_static_model=True)
+cls(["黑苦荞茶的功效与作用及食用方法","幼儿挑食的生理原因是"])
+# [{'predictions': [{'label': '功效作用', 'score': 0.964693000149321}], 'text': '黑苦荞茶的功效与作用及食用方法'}, {'predictions': [{'label': '病因分析', 'score': 0.4915921440237312}], 'text': '幼儿挑食的生理原因是'}]
+```
+
+#### 可配置参数说明
+* `task_path`：自定义任务路径，默认为None。
+* `is_static_model`：task_path中是否为静态图模型参数，默认为False。
+* `max_length`：最长输入长度，包括所有标签的长度，默认为512。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `id2label`：标签映射字典，如果`task_path`中包含id2label.json或加载动态图参数无需定义。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+
+在线服务化部署搭建请参考：
+- 【简单易用】 👉 [Simple Serving部署指南](deploy/simple_serving)
+- 【低时延】 👉 [Triton部署指南](deploy/triton_serving)。
+
+
+<a name="模型效果"></a>
+
+### 2.6 模型效果
+
+PaddleNLP提供ERNIE 3.0 全系列轻量化模型，对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练，我们评测了不同预训练模型在KUAKE-QIC任务的表现，测试配置如下：
+
+1. 数据集：CBLUE数据集中医疗搜索检索词意图分类(KUAKE-QIC)任务开发集
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.3.1
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 32，GPU部署运行时间 total_time，计算 latency = total_time / total_samples
+
+6. 精度评价指标：Accuracy
+
+|  model_name  | 模型结构  |Accuracy(%)   | latency(ms) |
+| -------------------------- | ------------ | ------------ | ------------ |
+|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|82.30| 5.62 |
+|ERNIE 3.0 Base  |12-layer, 768-hidden, 12-heads|82.25| 2.07 |
+|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|81.79| 1.07|
+|ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|79.80| 0.38|
+|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|79.80| 0.26|
+|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|78.57|0.22|
+| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 81.79| 0.83   |
+| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 81.07  | 0.79  |
+| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 81.07 | 0.64  |
diff --git a/applications/text_classification/multi_class/analysis/README.md b/applications/text_classification/multi_class/analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cf70e0dc42645923cb52c8f9cbce0f4e049b5e65
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/README.md
@@ -0,0 +1,321 @@
+# 训练评估与模型优化指南
+
+**目录**
+   * [Analysis模块介绍](#Analysis模块介绍)
+   * [环境准备](#环境准备)
+   * [可解释性分析](#可解释性分析)
+        * [单词级别可解释性分析](#单词级别可解释性分析)
+        * [句子级别可解释性分析](#句子级别可解释性分析)
+   * [数据优化](#数据优化)
+        * [稀疏数据筛选方案](#稀疏数据筛选方案)
+        * [脏数据清洗方案](#脏数据清洗方案)
+        * [数据增强策略方案](#数据增强策略方案)
+
+## Analysis模块介绍
+
+Analysis模块提供了**可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+
+- **可解释性分析：** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析，帮助理解模型预测结果。
+
+- **数据优化：** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略，从多角度优化训练数据提升模型效果。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195241942-70068989-df17-4f53-9f71-c189d8c5c88d.png" width="600">
+</div>
+
+以下是本项目主要代码结构及说明：
+
+```text
+analysis/
+├── sent_interpret.ipynb # 句子级别可解释性分析脚本
+├── word_interpret.py # 单词级别可解释性分析notebook
+├── sparse.py # 稀疏数据筛选脚本
+├── dirty.py # 脏数据清洗脚本
+├── aug.py # 数据增强脚本
+└── README.md # 训练评估与模型优化指南
+```
+
+## 环境准备
+需要可解释性分析和数据优化需要安装相关环境。
+- trustai >= 0.1.12
+- interpretdl >= 0.7.0
+
+**安装TrustAI**（可选）如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。
+```shell
+pip install trustai==0.1.12
+```
+
+**安装InterpretDL**（可选）如果使用词级别可解释性分析GradShap方法，需要安装InterpretDL
+```shell
+pip install interpretdl==0.7.0
+```
+
+## 可解释性分析
+"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题，如何分析错误样本(bad case)是文本分类任务落地中重要一环，本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法，帮助开发者更好地理解文本分类模型与数据，有助于后续的模型优化与数据清洗标注。
+
+### 单词级别可解释性分析
+本项目开源模型的词级别可解释性分析Notebook，提供LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。
+
+运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/multi_class/analysis/README.md) 代码，即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况，颜色越深代表这个词对预测结果影响越大：
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/192739675-63145d59-23c6-416f-bf71-998fd4995254.png" width="1000">
+</div>
+
+### 句子级别可解释性分析
+本项目基于特征相似度（[FeatureSimilarity](https://arxiv.org/abs/2104.04128)）算法，计算对样本预测结果正影响的训练数据，帮助理解模型的预测结果与训练集数据的关系。
+
+
+我们可以运行代码，得到支持样本模型预测结果的训练数据：
+```shell
+python sent_interpret.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --params_path "../checkpoint/" \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --top_k 3 \
+    --train_file "train.txt" \
+    --interpret_input_file "bad_case.txt" \
+    --interpret_result_file "sent_interpret.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `top_k`：筛选支持训练证据数量；默认为3。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `interpret_input_file`：本地数据集中待分析文件名；默认为"bad_case.txt"。
+* `interpret_result_file`：保存句子级别可解释性结果文件名；默认为"sent_interpret.txt"。
+
+## 数据优化
+
+### 稀疏数据筛选方案
+
+稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景，简单来说，就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据，模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据，通常可以采用**数据增强**或**少量数据标注**的两种低成本方式，提升模型在开发集的预测效果。
+
+本项目中稀疏数据筛选基于TrustAI，利用基于特征相似度的实例级证据分析方法，抽取开发集中样本的支持训练证据，并计算支持证据平均分（通常为得分前三的支持训练证据均分）。分数较低的样本表明其训练证据不足，在训练集中较为稀疏，实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。
+
+
+#### 稀疏数据识别—数据增强
+
+这里我们将介绍稀疏数据识别—数据增强流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据增强**：将稀疏数据集在训练集中的支持证据应用数据增强策略，这些数据增强后的训练数据记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别-数据增强，得到支持数据集：
+
+```shell
+python sparse.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --aug_strategy "substitute" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `aug_strategy`：数据增强类型，可选"duplicate","substitute", "insert", "delete", "swap"；默认为"substitute"。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt
+```
+
+**方案效果**
+
+我们在KUAKE-QIC数据集部分数据（训练集数据规模:500）进行实验,筛选稀疏数据数量和支持数据数量均设为100条，使用不同的数据增强方法进行评测：
+|  |Accuracy(%)   |
+| ---------| ------------ |
+|训练集|73.50|
+|训练集+支持增强集(duplicate) |73.61|
+|训练集+支持增强集(substitute) |**74.32**|
+|训练集+支持增强集(insert) |73.81|
+|训练集+支持增强集(delete) |74.27|
+|训练集+支持增强集(swap) |73.66|
+
+#### 稀疏数据识别-数据标注
+
+本方案能够有针对性进行数据标注，相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据标注**：在未标注数据集中筛选稀疏数据集的支持证据，并进行数据标注，记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别--数据标注，得到待标注数据：
+
+```shell
+python sparse.py \
+    --annotate \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100 \
+    --unlabeled_file "data.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `annotate`：选择稀疏数据识别--数据标注模式；默认为False。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `unlabeled_file`：本地数据集中未标注数据文件名；默认为"data.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+我们将筛选出的支持数据`support.txt`进行标注，可以使用标注工具帮助更快标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt
+```
+
+**方案效果**
+
+我们在KUAKE-QIC数据集部分数据（训练集数据规模:500）进行实验,筛选稀疏数据数量设为100条，筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果，下表结果表明使用稀疏数据方案的策略采样能够支持指导训练数据扩充，在标注更少的数据情况下获得更大提升的效果：
+
+|  |Accuracy(%)   |
+| ---------| ------------ |
+|训练集|73.50|
+|训练集+策略采样集(50) |76.88|
+|训练集+随机采样集(50) |74.32|
+|训练集+策略采样集(100) |**77.64**|
+|训练集+随机采样集(100) |76.37|
+
+### 脏数据清洗方案
+
+脏数据清洗方案是基于已训练好的文本分类模型，筛选出训练数据集中标注错误的数据，再由人工检查重新标注，获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程：
+
+- **脏数据筛选：** 基于TrustAI中表示点方法，计算训练数据对文本分类模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为标注错误样本，记为脏数据集（Dirty Dataset）。
+
+- **数据清洗、训练：** 将筛选出的脏数据由人工重新检查，为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。
+
+现在我们进行脏数据识别，脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`：
+
+```shell
+python dirty.py \
+    --device "gpu:3" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 8 \
+    --dirty_num 100 \
+    --dirty_file "train_dirty.txt" \
+    --rest_file "train_dirty_rest.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt和label.txt文件;默认为None。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `dirty_file`：保存脏数据文件名，默认为"train_dirty.txt"。
+* `rest_file`：保存剩余数据（非脏数据）文件名，默认为"train_dirty_rest.txt"。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dirty_threshold`：筛选脏数据用于重新标注的阈值，只选择影响分数大于阈值作为支持数据，默认为0。
+
+
+我们将筛选出脏数据进行人工检查重新标注，可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练：
+
+```shell
+cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt
+```
+
+**方案效果**
+
+我们在KUAKE-QIC数据集部分数据（训练集数据规模:500）进行实验，取100条数据进行脏数据处理，也即100条训练数据为标签错误数据，选择不同`dirty_num`应用脏数据清洗策略进行评测：
+
+|  |Accuracy(%)   |
+| ---------| ------------ |
+|训练集(500)|**73.50**|
+|训练集(500，含100条脏数据) |65.58|
+|训练集(500，含100条脏数据) + 脏数据清洗(50)|68.90|
+|训练集(500，含100条脏数据) + 脏数据清洗(100)|69.36|
+|训练集(500，含100条脏数据) + 脏数据清洗(150)|73.15|
+
+### 数据增强策略方案
+
+在数据量较少或某些类别样本量较少时，也可以通过数据增强策略的方式，生成更多的训练数据，提升模型效果。
+
+```shell
+python aug.py \
+    --create_n 2 \
+    --aug_percent 0.1 \
+    --train_path "../data/train.txt" \
+    --aug_path "../data/aug.txt"
+```
+
+可支持配置的参数：
+
+* `train_path`：待增强训练数据集文件路径；默认为"../data/train.txt"。
+* `aug_path`：增强生成的训练数据集文件路径；默认为"../data/train_aug.txt"。
+* `aug_strategy`：数据增强策略，可选"mix", "substitute", "insert", "delete", "swap","mix"为多种数据策略混合使用；默认为"substitute"。
+* `aug_type`：词替换/词插入增强类型，可选"synonym", "homonym", "mlm"，建议在GPU环境下使用mlm类型；默认为"synonym"。
+* `create_n`：生成的句子数量，默认为2。
+* `aug_percent`：生成词替换百分比，默认为0.1。
+* `device`: 选用什么设备进行增强，可选择cpu、gpu、xpu、npu，仅在使用mlm类型有影响；默认为"gpu"。
+
+生成的增强数据保存在`"aug.txt"`文件中，与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练：
+
+```shell
+cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt
+```
+
+PaddleNLP内置多种数据增强策略，更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。
diff --git a/applications/text_classification/multi_class/analysis/aug.py b/applications/text_classification/multi_class/analysis/aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc0f87f2a216e07771061ec225eb6c7271e404c2
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/aug.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name")
+parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name")
+parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert")
+parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.")
+parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.")
+parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def aug():
+    """Do data augmentation"""
+    if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm":
+        paddle.set_device(args.device)
+
+    if args.aug_strategy in ["substitute", "insert", "delete", "swap"]:
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent)
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+
+                augs = aug.augment(s)
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+    elif args.aug_strategy in ["mix"]:
+        aug = [
+            WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordDelete(create_n=1, aug_percent=args.aug_percent),
+            WordSwap(create_n=1, aug_percent=args.aug_percent),
+        ]
+        count = 0
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+
+                for i in range(args.create_n):
+                    i = count % len(aug)
+                    augs = aug[i].augment(s)
+                    count += 1
+                    if not isinstance(augs[0], str):
+                        augs = augs[0]
+                    for a in augs:
+                        f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    aug()
diff --git a/applications/text_classification/multi_class/analysis/dirty.py b/applications/text_classification/multi_class/analysis/dirty.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee0ae746821a230acd374a4815f89dbbc812349f
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/dirty.py
@@ -0,0 +1,152 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import RepresenterPointModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50")
+parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.")
+parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            sentence, label = line.strip().split("\t")
+            yield {"text": sentence, "label": label}
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+def get_dirty_data(weight_matrix, dirty_num, threshold=0):
+    """
+    Get index of dirty data from train data
+    """
+    scores = []
+    for idx in range(weight_matrix.shape[0]):
+        weight_sum = 0
+        count = 0
+        for weight in weight_matrix[idx].numpy():
+            if weight > threshold:
+                count += 1
+                weight_sum += weight
+        scores.append((count, weight_sum))
+    sorted_scores = sorted(scores)[::-1]
+    sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1]
+
+    ret_scores = sorted_scores[:dirty_num]
+    ret_idxs = sorted_idxs[:dirty_num]
+
+    return ret_idxs, ret_scores
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def run():
+    """
+    Get dirty data
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    train_ds = train_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier")
+    weight_matrix = rep_point.weight_matrix
+
+    # Save dirty data & rest data
+    dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold)
+
+    dirty_path = os.path.join(args.dataset_dir, args.dirty_file)
+    rest_path = os.path.join(args.dataset_dir, args.rest_file)
+
+    with open(dirty_path, "w") as f1, open(rest_path, "w") as f2:
+        for idx in range(len(train_ds)):
+            if idx in dirty_indexs:
+                f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+            else:
+                f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+
+    f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    run()
diff --git a/applications/text_classification/multi_class/analysis/sent_interpret.py b/applications/text_classification/multi_class/analysis/sent_interpret.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/sent_interpret.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name")
+parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if items[0] == "Text":
+                continue
+            if len(items) == 3:
+                yield {"text": items[0], "label": items[1], "predict": items[2]}
+            elif len(items) == 2:
+                yield {"text": items[0], "label": items[1], "predict": ""}
+            elif len(items) == 1:
+                yield {"text": items[0], "label": "", "predict": ""}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def find_positive_influence_data():
+
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file)
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    interpret_ds = interpret_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    interpret_data_loader = DataLoader(
+        dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn
+    )
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in interpret_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.top_k)
+    with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f:
+        for i in range(len(analysis_result)):
+            f.write("text: " + interpret_ds.data[i]["text"] + "\n")
+            if "predict" in interpret_ds.data[i]:
+                f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n")
+            if "label" in interpret_ds.data[i]:
+                f.write("label: " + interpret_ds.data[i]["label"] + "\n")
+            f.write("examples with positive influence\n")
+            for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)):
+                f.write(
+                    "support{} text: ".format(i + 1)
+                    + train_ds.data[idx]["text"]
+                    + "\t"
+                    + "label: "
+                    + train_ds.data[idx]["label"]
+                    + "\t"
+                    + "score: "
+                    + "{:.5f}".format(score)
+                    + "\n"
+                )
+    f.close()
+
+
+if __name__ == "__main__":
+    find_positive_influence_data()
diff --git a/applications/text_classification/multi_class/analysis/sparse.py b/applications/text_classification/multi_class/analysis/sparse.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3809b49a3463d75ecee54927a39f362b6c985d0
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/sparse.py
@@ -0,0 +1,291 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
+parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
+parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
+parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename")
+parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.")
+parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 2:
+                yield {"text": items[0], "label": items[1]}
+            elif len(items) == 1:
+                yield {"text": items[0]}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+    f.close()
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def get_sparse_data(analysis_result, sparse_num):
+    """
+    Get sparse data
+    """
+    idx_scores = {}
+    preds = []
+    for i in range(len(analysis_result)):
+        scores = analysis_result[i].pos_scores
+        idx_scores[i] = sum(scores) / len(scores)
+        preds.append(analysis_result[i].pred_label)
+
+    idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num]
+    ret_idxs, ret_scores = list(zip(*idx_socre_list))
+    return ret_idxs, ret_scores, preds
+
+
+def find_sparse_data():
+    """
+    Find sparse data (lack of supports in train dataset) in dev dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    dev_path = os.path.join(args.dataset_dir, args.dev_file)
+
+    label_list = {}
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+    f.close()
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in dev_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse)
+    sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num)
+
+    # Save the sparse data
+    is_true = []
+    with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f:
+        for idx in sparse_indexs:
+            data = dev_ds.data[idx]
+            f.write(data["text"] + "\t" + str(data["label"]) + "\n")
+            is_true.append(1 if str(preds[idx]) == str(label_list[data["label"]]) else 0)
+    f.close()
+    logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file)))
+    logger.info("Accuracy in sparse data: {:.2f}%".format(100 * sum(is_true) / len(is_true)))
+    logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores)))
+    return os.path.join(args.dataset_dir, args.sparse_file)
+
+
+def get_support_data(analysis_result, support_num, support_threshold=0.7):
+    """
+    get support data
+    """
+    ret_idxs = []
+    ret_scores = []
+    rationale_idx = 0
+    try:
+        while len(ret_idxs) < support_num:
+            for n in range(len(analysis_result)):
+                score = analysis_result[n].pos_scores[rationale_idx]
+                if score > support_threshold:
+                    idx = analysis_result[n].pos_indexes[rationale_idx]
+                    if idx not in ret_idxs:
+                        ret_idxs.append(idx)
+                        ret_scores.append(score)
+                    if len(ret_idxs) >= support_num:
+                        break
+
+            rationale_idx += 1
+    except IndexError:
+        logger.error(
+            f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now."
+        )
+
+    return ret_idxs, ret_scores
+
+
+def find_support_data():
+    """
+    Find support data (which supports sparse data) from candidate dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    if args.annotate:
+        candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file)
+    else:
+        candidate_path = os.path.join(args.dataset_dir, args.train_file)
+
+    sparse_path = os.path.join(args.dataset_dir, args.sparse_file)
+    support_path = os.path.join(args.dataset_dir, args.support_file)
+    candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False)
+    sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    candidate_ds = candidate_ds.map(trans_func)
+    sparse_ds = sparse_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False)
+    sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False)
+    candidate_data_loader = DataLoader(
+        dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn
+    )
+    sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis
+    analysis_result = []
+    for batch in sparse_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_support)
+
+    support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold)
+
+    # Save the support data
+    if args.annotate or args.aug_strategy == "duplicate":
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                if "label" in data:
+                    f.write(data["text"] + "\t" + data["label"] + "\n")
+                else:
+                    f.write(data["text"] + "\n")
+        f.close()
+    else:
+        create_n = 1
+        aug_percent = 0.1
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute("embedding", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert("embedding", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=create_n, aug_percent=aug_percent)
+
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                augs = aug.augment(data["text"])
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f.write(a + "\t" + data["label"] + "\n")
+        f.close()
+    logger.info("support data saved in {}".format(support_path))
+    logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores)))
+
+
+if __name__ == "__main__":
+    find_sparse_data()
+    find_support_data()
diff --git a/applications/text_classification/multi_class/analysis/word_interpret.ipynb b/applications/text_classification/multi_class/analysis/word_interpret.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..05ee41790fcc82bfae7e20330139a25cfe4fcbae
--- /dev/null
+++ b/applications/text_classification/multi_class/analysis/word_interpret.ipynb
@@ -0,0 +1,354 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 词级别可解释性分析\n",
+    "本项目提供模型的词级别可解释性分析，包括LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n",
+    "\n",
+    "![image](https://user-images.githubusercontent.com/63761690/195086276-6ee16e96-4ec3-4a0f-821f-37546d21746b.png)\n",
+    " \n",
+    "\n",
+    "## 1.导入Python模块与参数配置\n",
+    "首先我们导入必要的导入必要python模块和设置配置参数，词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式：\n",
+    "\n",
+    "**格式一：包括文本、标签、预测结果**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>'\\t'<预测结果>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式二：包括文本、标签**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式三：只包括文本**\n",
+    "```text\n",
+    "<文本>\n",
+    "...\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import functools\n",
+    "import random\n",
+    "import os\n",
+    "import argparse\n",
+    "\n",
+    "import jieba\n",
+    "import numpy as np \n",
+    "from trustai.interpretation import VisualizationTextRecord\n",
+    "from trustai.interpretation import get_word_offset\n",
+    "import paddle\n",
+    "from paddle.io import DataLoader, BatchSampler\n",
+    "from paddlenlp.data import DataCollatorWithPadding\n",
+    "from paddlenlp.datasets import load_dataset\n",
+    "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trustai.interpretation import VisualizationTextRecord\n",
+    "from trustai.interpretation import get_word_offset\n",
+    "import paddle\n",
+    "from paddle.io import DataLoader, BatchSampler\n",
+    "from paddlenlp.data import DataCollatorWithPadding\n",
+    "from paddlenlp.datasets import load_dataset\n",
+    "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 预先定义配置参数\n",
+    "\n",
+    "# 运行环境，可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n",
+    "DEVICE = \"gpu\"\n",
+    "# 数据路径\n",
+    "DATASET_DIR = \"../data\" \n",
+    "# 训练模型保存路径\n",
+    "PARAM_PATH = \"../checkpoint/\" \n",
+    "# tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数\n",
+    "MAX_LENGTH = 128 \n",
+    "# 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数\n",
+    "BATCH_SIZE = 1 \n",
+    "# 待分析解释的数据\n",
+    "INTERPRETER_FILE = \"bad_case.txt\"\n",
+    "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n",
+    "# \"grad\":GradShap方法依赖interpretdl\n",
+    "# !pip install interpretdl\n",
+    "INTERPRETER = \"ig\"\n",
+    "# 分析句子中TOP K关键词，K值\n",
+    "KEY_WORDS_NUM = 5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def read_local_dataset(path):\n",
+    "    \"\"\"\n",
+    "    Read dataset file\n",
+    "    \"\"\"\n",
+    "    with open(path, 'r', encoding='utf-8') as f:\n",
+    "        for line in f:\n",
+    "            items = line.strip().split('\\t')\n",
+    "            if items[0] == 'Text':\n",
+    "                continue\n",
+    "            items[0] = items[0][:MAX_LENGTH-2]\n",
+    "            if len(items) == 3:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n",
+    "            elif len(items) == 2:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': ''}\n",
+    "            elif len(items) == 1:\n",
+    "                yield {'text': items[0], 'label': '', 'predict': ''}\n",
+    "            else:\n",
+    "                raise ValueError(\"{} should be in fixed format.\".format(path))\n",
+    "\n",
+    "def preprocess_function(examples, tokenizer, max_seq_length):\n",
+    "    \"\"\"\n",
+    "    Preprocess dataset\n",
+    "    \"\"\"\n",
+    "    result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n",
+    "    return result\n",
+    "\n",
+    "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n",
+    "    \"\"\"\n",
+    "    Convert the  result of DataCollatorWithPadding from dict dictionary to a list\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __call__(self, features):\n",
+    "        batch = super().__call__(features)\n",
+    "        batch = list(batch.values())\n",
+    "        return batch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m[2022-10-11 12:17:29,041] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load '/workspace/PaddleNLP/applications/text_classification/multi_class/checkpoint/'.\u001b[0m\n",
+      "W1011 12:17:29.044690 79080 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n",
+      "W1011 12:17:29.051118 79080 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n",
+      "\u001b[32m[2022-10-11 12:17:32,517] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load '/workspace/PaddleNLP/applications/text_classification/multi_class/checkpoint/'.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "paddle.set_device(DEVICE)\n",
+    "\n",
+    "# Define model & tokenizer\n",
+    "if os.path.exists(PARAM_PATH):\n",
+    "    model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n",
+    "    tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n",
+    "else:\n",
+    "    raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n",
+    "\n",
+    "# Prepare & preprocess dataset\n",
+    "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n",
+    "\n",
+    "\n",
+    "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n",
+    "trans_func = functools.partial(preprocess_function,\n",
+    "                                tokenizer=tokenizer,\n",
+    "                                max_seq_length=MAX_LENGTH)\n",
+    "\n",
+    "interpret_ds = interpret_ds.map(trans_func)\n",
+    "\n",
+    "# Batchify dataset\n",
+    "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n",
+    "interpret_batch_sampler = BatchSampler(interpret_ds,\n",
+    "                                    batch_size=BATCH_SIZE,\n",
+    "                                    shuffle=False)\n",
+    "interpret_data_loader = DataLoader(dataset=interpret_ds,\n",
+    "                                batch_sampler=interpret_batch_sampler,\n",
+    "                                collate_fn=collate_fn)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Start token level interpretion, it will take some time...\n",
+      "Building prefix dict from the default dictionary ...\n",
+      "Loading model from cache /tmp/jieba.cache\n",
+      "Loading model cost 1.005 seconds.\n",
+      "Prefix dict has been built successfully.\n",
+      "Start word level alignment, it will take some time...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Init an interpreter\n",
+    "if INTERPRETER == 'ig':\n",
+    "    from trustai.interpretation.token_level import IntGradInterpreter\n",
+    "    interpreter = IntGradInterpreter(model)\n",
+    "elif INTERPRETER == 'lime':\n",
+    "    from trustai.interpretation.token_level import LIMEInterpreter\n",
+    "    interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n",
+    "else:\n",
+    "    from trustai.interpretation.token_level import GradShapInterpreter\n",
+    "    interpreter = GradShapInterpreter(model)\n",
+    "\n",
+    "# Use interpreter to get the importance scores for all data\n",
+    "print(\"Start token level interpretion, it will take some time...\")\n",
+    "analysis_result = []\n",
+    "for batch in interpret_data_loader:\n",
+    "    analysis_result += interpreter(tuple(batch))\n",
+    "\n",
+    "# Add CLS and SEP tags to both original text and standard splited tokens\n",
+    "contexts = []\n",
+    "words = []\n",
+    "for i in range(len(interpret_ds)):\n",
+    "    text = interpret_ds.data[i][\"text\"]\n",
+    "    contexts.append(\"[CLS]\" + text + \"[SEP]\")\n",
+    "    words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n",
+    "\n",
+    "# Get the offset map of tokenized tokens and standard splited tokens\n",
+    "print(\"Start word level alignment, it will take some time...\")\n",
+    "ori_offset_maps = []\n",
+    "word_offset_maps = []\n",
+    "for i in range(len(contexts)):\n",
+    "    ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n",
+    "    word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n",
+    "\n",
+    "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.core.display import display, HTML\n",
+    "class Visualization(VisualizationTextRecord):\n",
+    "\n",
+    "    def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n",
+    "        if words is not None:\n",
+    "            self.words = words\n",
+    "        else:\n",
+    "            self.words = interpret_res.words\n",
+    "        self.pred_label = pred_label if pred_label is not None else ''\n",
+    "        self.true_label = true_label if true_label is not None else ''\n",
+    "        self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n",
+    "        word_attributions = interpret_res.word_attributions\n",
+    "        _max = max(word_attributions)\n",
+    "        _min = min(word_attributions)\n",
+    "        self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n",
+    "\n",
+    "    def record_html(self):\n",
+    "        \"\"\"change all informations to html\"\"\"\n",
+    "        return \"\".join([\n",
+    "            \"<tr>\",\n",
+    "            self._format_class(self.true_label),\n",
+    "            self._format_class(self.pred_label),\n",
+    "            self._format_class(self.key_words),\n",
+    "            self._format_word_attributions(),\n",
+    "            \"<tr>\",\n",
+    "        ])\n",
+    "    def _format_class(self, label):\n",
+    "        return '<td align=\"center\"><text style=\"padding-right:2em\"><b>{label}</b></text></td>'.format(label=label)\n",
+    "\n",
+    "def visualize_text(text_records):\n",
+    "    \"\"\"visualize text\"\"\"\n",
+    "    html = [\"<table width: 100%, align : center>\"]\n",
+    "    rows = [\"<tr><th>Label</th>\"\n",
+    "            \"<th>Prediction</th>\"\n",
+    "            \"<th>Key words</th>\"\n",
+    "            \"<th>Important visualization</th>\"]\n",
+    "    for record in text_records:\n",
+    "        rows.append(record.record_html())\n",
+    "    html.append(\"\".join(rows))\n",
+    "    html.append(\"</table>\")\n",
+    "    html = HTML(\"\".join(html))\n",
+    "    display(html)\n",
+    "    return html.data\n",
+    "\n",
+    "\n",
+    "def visualize(interpret_res, ds):\n",
+    "    records = []\n",
+    "    for i in range(len(interpret_res)):\n",
+    "        records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n",
+    "    html = visualize_text(records)\n",
+    "    return html"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table width: 100%, align : center><tr><th>Label</th><th>Prediction</th><th>Key words</th><th>Important visualization</th><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>注意事项</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>月 服用 请问 的 可以</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 您好                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 请问                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 一岁                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 三个                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 月                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 孩子                        </font></mark><mark style=\"background-color: hsl(120, 75%, 76%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 可以                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 服用                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 复方                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 锌                        </font></mark><mark style=\"background-color: hsl(120, 75%, 90%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 布                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 颗粒                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吗                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ？                        </font></mark><mark style=\"background-color: hsl(120, 75%, 82%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>就医建议</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>输卵管 基本 检查 粘连 的</b></text></td><td><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 输卵管                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 粘连                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 基本                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 检查                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>病情诊断</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>胎动 么 ？ 是 会</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 会                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 是                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 胎动                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 79%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ？                        </font></mark><mark style=\"background-color: hsl(120, 75%, 84%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>病情诊断</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>这是 经常 干呕 了 生病</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 经常                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 干呕                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 恶心                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 这是                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 生病                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吗                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>就医建议</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>治疗方案</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>治 治疗 菏泽 怎么 白癜风</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 菏泽                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 哪个                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 医院                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 治疗                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 白癜风                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 比较                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 好                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ?                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 怎么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 治                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 好                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>后果表述</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>左旋 不良反应 吃 的 肉碱</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吃                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 左旋                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 肉碱                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 后                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 不良反应                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>注意事项</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>上 出血 吗 做爱 环后</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 上                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 环后                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 出血                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 可以                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 做爱                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吗                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>病情诊断</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>病因分析</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>感冒 了 呀 怎么 会</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 孩子                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 感冒                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 怎么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 84%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 会                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 喘息                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 呀                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ？                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>治疗方案</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>孕 周 21</b></text></td><td><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 孕                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 21                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 周                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>指标解读</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>谱 心肌 意义 酶 ？</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 83%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 心肌                        </font></mark><mark style=\"background-color: hsl(120, 75%, 83%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 酶                        </font></mark><mark style=\"background-color: hsl(120, 75%, 76%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 谱                        </font></mark><mark style=\"background-color: hsl(120, 75%, 85%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 五项                        </font></mark><mark style=\"background-color: hsl(120, 75%, 82%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 意义                        </font></mark><mark style=\"background-color: hsl(120, 75%, 79%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ？                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>病情诊断</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>其他</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>家长 判断 吃 吃饱 怎么</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 家长                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 怎么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 判断                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 孩子                        </font></mark><mark style=\"background-color: hsl(120, 75%, 80%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吃饱                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 呢                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ？                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 怎么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 都                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 不肯                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 吃                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 就是                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 饱                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 87%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr></table>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# process for vbisualize\n",
+    "html = visualize(align_res, interpret_ds)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.7.13 64-bit",
+   "metadata": {
+    "interpreter": {
+     "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90"
+    }
+   },
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.13-final"
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/applications/text_classification/multi_class/deploy/simple_serving/README.md b/applications/text_classification/multi_class/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6ec4f03ad2277907e1ff4ebef4c63d9a9a95db73
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/simple_serving/README.md
@@ -0,0 +1,23 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+```shell
+pip install paddlenlp >= 2.5.1
+```
+## Server服务启动
+### 分类任务启动
+#### 启动分类 Server 服务
+```bash
+paddlenlp server server:app --host 0.0.0.0 --port 8189
+```
+
+#### 启动分类 Client 服务
+```bash
+python client.py
+```
diff --git a/applications/text_classification/multi_class/deploy/simple_serving/client.py b/applications/text_classification/multi_class/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..57389859efd92107080303d54d2b6e311fba96aa
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/simple_serving/client.py
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/cls"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["黑苦荞茶的功效与作用及食用方法", "交界痣会凸起吗", "检查是否能怀孕挂什么科", "鱼油怎么吃咬破吃还是直接咽下去", "幼儿挑食的生理原因是"]
+    data = {"data": {"text": texts}}
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    result_json = json.loads(r.text)
+    print(result_json["result"])
diff --git a/applications/text_classification/multi_class/deploy/simple_serving/server.py b/applications/text_classification/multi_class/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..75701c1c797694bd5abded708a16a5961523c1c9
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/simple_serving/server.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+cls = Taskflow("text_classification", task_path="../../checkpoint/export", is_static_model=True)
+app = SimpleServer()
+app.register_taskflow("taskflow/cls", cls)
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/README.md b/applications/text_classification/multi_class/deploy/triton_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7fbcb5c92013441f489e54c75412a3a8c3adf679
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/README.md
@@ -0,0 +1,200 @@
+# 基于Triton Inference Server的服务化部署指南
+
+本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 3.0中文模型文本多分类的pipeline在线服务。
+
+## 目录
+- [服务端环境准备](#服务端环境准备)
+- [模型获取和转换](#模型获取和转换)
+- [部署模型](#部署模型)
+- [客户端请求](#客户端请求)
+
+## 服务端环境准备
+
+### 安装Triton Server
+拉取Triton Server镜像：
+```shell
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3
+```
+启动容器：
+```shell
+docker run  -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash
+```
+
+**NOTE:**
+
+1. Triton版本号`21.10`可以根据自己的需求调整，各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行，如果NVIDIA Driver低于文档中要求，在启动运行时会报错。
+
+2. 可以使用`--gpus '"device=1"'`来指定GPU卡号，更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
+
+
+### 进入容器并准备PaddleNLP环境
+整个服务的前后处理依赖PaddleNLP，需要在容器内安装相关python包
+
+进入容器：
+```shell
+docker exec -it triton_server bash
+```
+安装PaddlePaddle、PaddleNLP
+```shell
+python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+**NOTE:**
+
+1. 默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple)
+
+2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+
+### 安装FastTokenizer文本处理加速库（可选）
+
+部署环境是Linux，推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+
+在容器内安装 fast_tokenizer
+```shell
+python3 -m pip install fast-tokenizer-python
+```
+
+
+## 模型获取和转换
+
+使用Triton做服务化部署时，选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下，以下命令成功运行后，将会在当前目录下生成model.onnx模型文件。
+```shell
+paddle2onnx --model_dir ../../checkpoint/export --model_filename model.pdmodel --params_filename model.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
+```
+创建空白目录/seqcls/1和seqcls_model/1，并将将转换好的ONNX模型移动到模型仓库目录
+```shell
+mkdir /models/seqcls/1
+mkdir /models/seqcls_model/1
+mv model.onnx /models/seqcls_model/1
+```
+
+Paddle2ONNX的命令行参数说明请查阅：[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
+
+模型下载转换好之后，models目录结构如下:
+```
+models
+├── seqcls
+│   ├── 1
+│   └── config.pbtxt
+├── seqcls_model
+│   ├── 1
+│   │   └── model.onnx
+│   └── config.pbtxt
+├── seqcls_postprocess
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── tokenizer
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)
+
+## 部署模型
+
+triton目录包含启动pipeline服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # Triton启动需要的模型仓库，包含模型和服务配置文件
+seqcls_grpc_client.py     # 分类任务发送pipeline预测请求的脚本
+```
+
+### 启动服务端
+
+在容器内执行下面命令启动服务，默认启动models下所有模型:
+```shell
+tritonserver --model-repository=/models
+```
+也可以通过设定参数只启动单一任务服务：
+```shell
+tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls
+```
+
+**NOTE:**
+
+启动服务时，Triton Server的每个python后端进程默认申请`64M`内存，默认启动的docker无法启动多个python后端节点。两个解决方案：
+
+1. 启动容器时设置`shm-size`参数, 比如:`docker run  -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
+
+2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M： `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
+
+输出打印如下:
+
+```
+...
+I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
+I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
+I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
+I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
+I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456
+...
+I0619 13:43:33.360018 5127 server.cc:592]
++--------------------+---------+--------+
+| Model              | Version | Status |
++--------------------+---------+--------+
+| seqcls             | 1       | READY  |
+| seqcls_model       | 1       | READY  |
+| seqcls_postprocess | 1       | READY  |
+| tokenizer          | 1       | READY  |
++--------------------+---------+--------+
+...
+I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
+I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
+I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+## 客户端请求
+
+### 客户端环境准备
+客户端请求有两种方式，可以选择在本地执行脚本请求，或下载官方客户端镜像在容器中执行。
+
+方式一：本地执行脚本，需要先安装依赖:
+```shell
+pip install grpcio
+pip install tritonclient==2.10.0
+```
+
+方式二：拉取官网镜像并启动容器:
+```shell
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk
+docker run  -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash
+```
+
+### 启动客户端测试
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+
+```shell
+python seqcls_grpc_client.py
+```
+
+输出打印如下:
+
+```
+text:  黑苦荞茶的功效与作用及食用方法
+label:  功效作用
+confidence:  0.984
+--------------------
+text:  交界痣会凸起吗
+label:  疾病表述
+confidence:  0.904
+--------------------
+text:  检查是否能怀孕挂什么科
+label:  就医建议
+confidence:  0.969
+--------------------
+text:  幼儿挑食的生理原因是
+label:  病因分析
+confidence:  0.495
+--------------------
+text:  鱼油怎么吃咬破吃还是直接咽下去
+label:  其他
+confidence:  0.850
+--------------------
+```
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls/config.pbtxt
@@ -0,0 +1,75 @@
+name: "seqcls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "INPUT"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "seqcls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        key: "linear_75.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "seqcls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_label"
+        value: "label"
+      }
+      output_map {
+        key: "POST_confidence"
+        value: "confidence"
+      }
+    }
+  ]
+}
+
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82912ecbe05ee7b8688e5f5e68eafdc138999626
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_model/config.pbtxt
@@ -0,0 +1,36 @@
+platform: "onnxruntime_onnx"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "linear_75.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ 11 ]
+    }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_GPU
+  }
+]
+
+optimization { 
+  graph: {level: -1}
+}
+
+parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
+parameters { key: "execution_mode" value: { string_value: "0" } }
+parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad18d669698eb30f7121f0a89b3e32a8a3d44bfd
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/1/model.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.txt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            max_value = np.max(data, axis=1, keepdims=True)
+            exp_data = np.exp(data - max_value)
+            probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+            probs = probs.max(axis=-1)
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], data.argmax(axis=-1))
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a5d90d3bcbb02b50409e9053cb730f3ea193b9be
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
@@ -0,0 +1,31 @@
+name: "seqcls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ 11 ]
+  }
+]
+
+output [
+  {
+    name: "POST_label"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "POST_confidence"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..97a96222075370ebcca531250dc6cc7c0e8fead0
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/1/model.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.pbtxt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = [i[0].decode("utf-8") for i in data]
+            data = self.tokenizer(data, max_length=128, padding=True, truncation=True)
+            input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0])
+            token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1])
+
+            # print("input_ids:", input_ids)
+            # print("token_type_ids:", token_type_ids)
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/models/tokenizer/config.pbtxt
@@ -0,0 +1,31 @@
+name: "tokenizer"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT_0"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT_0"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "OUTPUT_1"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..4144f5484b214de0daf1ab468c88d5b078f9499d
--- /dev/null
+++ b/applications/text_classification/multi_class/deploy/triton_serving/seqcls_grpc_client.py
@@ -0,0 +1,112 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Optional
+
+import numpy as np
+from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class SyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} "
+            f"are up and ready!"
+        )
+
+        model_config = self._client.get_model_config(self._model_name, self._model_version)
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        self._inputs = {tm.name: tm for tm in model_metadata.inputs}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm.name: tm for tm in model_metadata.outputs}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run(self, inputs):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate(inputs):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+
+        results = self._client.infer(
+            model_name=self._model_name,
+            model_version=self._model_version,
+            inputs=infer_inputs,
+            outputs=self._outputs_req,
+            client_timeout=self._response_wait_t,
+        )
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+if __name__ == "__main__":
+    model_name = "seqcls"
+    model_version = "1"
+    url = "localhost:8001"
+    runner = SyncGRPCTritonRunner(url, model_name, model_version)
+
+    data = [["黑苦荞茶的功效与作用及食用方法", "交界痣会凸起吗", "检查是否能怀孕挂什么科"], ["幼儿挑食的生理原因是"], ["鱼油怎么吃咬破吃还是直接咽下去"]]
+    label_list = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+    for texts in data:
+        # input format:[input1, input2 ... inputn], n = len(self._input_names)
+        result = runner.Run([texts])
+        for i, text in enumerate(texts):
+            print("text: ", text)
+            print("label: ", label_list[result["label"][i]])
+            print("confidence: ", "{:.3f}".format(result["confidence"][i]))
+            print("--------------------")
diff --git a/applications/text_classification/multi_class/few-shot/README.md b/applications/text_classification/multi_class/few-shot/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b0ec99bdde54dbca9df475a1ca676c5e5ea8b2eb
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/README.md
@@ -0,0 +1,360 @@
+# 小样本场景下的二/多分类任务指南
+
+**零样本/小样本文本分类推荐使用 UTC 模型，详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification)，本项目将会在2.5.2版本下线。**
+
+## 目录
+
+- [1. 项目说明](#项目说明)
+- [2. 效果展示](#效果展示)
+- [3. 定制训练](#定制训练)
+  - [3.1 运行环境](#运行环境)
+  - [3.2 代码结构](#代码结构)
+  - [3.3 数据标注](#数据标注)
+  - [3.4 模型训练](#模型训练)
+  - [3.5 模型评估](#模型评估)
+  - [3.6 模型部署](#模型部署)
+- [4. References](#References)
+
+<a name="项目说明"></a>
+## 1. 项目说明
+
+本项目提供了小样本场景下文本二/多分类的解决方案，在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果，充分利用标注信息。
+
+### 模型介绍
+
+**文本二/多分类** 用于预测样本属于标签候选集中的哪个类别，在商品分类、网页分类、新闻分类、医疗文本分类等现实场景中有着广泛应用。
+现有的主流解决方案是在预训练语言模型上进行微调，因为二/多分类任务与预训练阶段的掩码预测任务有着天然的差异，想要取得较好的分类效果往往需要大量数据标注。
+
+**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务，充分利用预训练语言模型学习到的特征，从而降低样本需求。以情感分类任务为例，标签分为`1-正向`，`0-负向`两类，如下图所示，通过提示`我[MASK]喜欢。`，原有`1-正向`，`0-负向`的标签被转化为了预测空格是`很`还是`不`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/183909263-6ead8871-699c-4c2d-951f-e33eddcfdd9c.png width=800 height=300 />
+</div>
+
+微调方法和提示方法的区别如图所示：
+
+【微调学习】需要学习的参数是以 `[CLS]` 向量为输入，以负向/正向为输出的随机初始化的分类器。
+
+【提示学习】通过构造提示，将原有的分类任务转化为掩码预测，即掩盖原句中的某个字，用模型预测该字。此时的分类器不再是随机初始化，而是利用了待预测字的预训练向量来初始化，充分利用了预训练模型学习到的参数。
+
+【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类，对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习，以取得更好的效果。
+
+### 方案特点
+
+- **标注成本低**：以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖，在少样本（few-shot）的场景下取得比微调更好的分类效果。
+- **全流程打通**：提供了从训练到部署的完整解决方案，可以低成本迁移至实际应用场景。
+
+<a name="效果展示"></a>
+## 2.效果展示
+
+本项目中使用了 ERNIE3.0 模型，对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练，我们测评了 Base 模型在新闻分类任务上的表现。测试配置如下：
+
+1. 数据集：[FewCLUE](https://github.com/CLUEbenchmark/FewCLUE)中的新闻分类（tnews）任务测试集。
+
+2. 物理机环境
+
+   系统: CentOS Linux release 7.7.1908 (Core)
+
+   GPU: Tesla V100-SXM2-32GB
+
+   CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+   CUDA: 11.2
+
+   cuDNN: 8.1.0
+
+   Driver Version: 460.27.04
+
+   内存: 630 GB
+
+3. PaddlePaddle 版本：2.4rc
+
+4. PaddleNLP 版本：2.4.3
+
+5. 评估设置
+
+   每个 epoch 评估一次，按照验证集上的评价指标，取验证集上分数最高的模型参数用于测试集的测试。为了避免过拟合，这里使用了早停机制 (Early-stopping)。表格中的最终结果为重复 10 次的均值。
+
+- 微调
+
+```
+cd ../
+python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop
+```
+
+- 提示学习
+
+```
+python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这条新闻写的是" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128  --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model accuracy --load_best_model_at_end --evaluation_strategy epoch --save_strategy epoch --save_total_limit 1
+```
+
+6. 精度评价指标：Accuracy
+
+
+| model_name | 训练方式 | Accuracy |
+| ---------- | ------- | ---- |
+| ernie-3.0-base-zh | 微调学习 | 0.5046 |
+| ernie-3.0-base-zh | 提示学习 | 0.5521 |
+
+
+<a name="定制训练"></a>
+## 3.定制训练
+
+下边通过**新闻分类**的例子展示如何使用小样本学习来进行文本分类。
+
+<a name="运行环境"></a>
+### 3.1 运行环境
+
+- python >= 3.7
+- paddlepaddle >= 2.4rc
+- paddlenlp >= 2.4.3
+- paddle2onnx >= 1.0.3
+
+<a name="代码结构"></a>
+### 3.2 代码结构
+
+```text
+.
+├── train.py    # 模型组网训练脚本
+├── utils.py    # 数据处理工具
+├── infer.py    # 模型部署脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+### 3.3 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注，本项目也打通了从标注到训练的通道，即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。
+
+**示例数据**
+
+这里我们使用[FewCLUE](https://github.com/CLUEbenchmark/FewCLUE)中的新闻分类tnews数据集后缀为0的子集作为示例数据集，可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/tnews.tar.gz)下载解压并放入`./data/`文件夹，或者运行以下脚本
+
+```
+wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/tnews.tar.gz
+tar zxvf tnews.tar.gz
+mv tnews data
+```
+
+**数据格式**
+
+下边主要介绍二/多分类任务自定义数据集的格式要求，整体目录如下
+
+```text
+data/
+├── train.txt  # 训练数据集
+├── dev.txt    # 验证数据集
+├── test.txt   # 测试数据集（可选）
+├── data.txt   # 待预测数据（可选）
+└── label.txt  # 分类标签集
+```
+
+**训练/验证/测试数据**
+
+对于训练/验证/测试数据集文件，每行数据表示一条样本，包括文本和标签两部分，由tab符`\t`分隔。格式如下
+```text
+<文本>'\t'<标签>
+<文本>'\t'<标签>
+...
+```
+例如，在新闻分类数据集中
+```
+文登区这些公路及危桥将进入封闭施工，请注意绕行！ news_car
+普洱茶要如何醒茶？ news_culture
+...
+```
+
+**预测数据**
+
+对于待预测数据文件，每行包含一条待预测样本，无标签。格式如下
+```text
+<文本>
+<文本>
+...
+```
+例如，在新闻分类数据集中
+```
+互联网时代如何保护个人信息
+清秋暮雨读柳词：忍把浮名，换了浅斟低唱丨周末读诗
+...
+```
+
+**标签数据**
+
+对于分类标签集文件，存储了数据集中所有的标签集合，每行为一个标签名。如果需要自定义标签映射用于分类器初始化，则每行需要包括标签名和相应的映射词，由`==`分隔。格式如下
+```text
+<标签>'=='<映射词>
+<标签>'=='<映射词>
+...
+```
+例如，对于新闻分类数据集，原标签`news_car`可被映射为中文`汽车`等等。
+```
+news_car==汽车
+news_culture==文化
+...
+```
+**Note**: 这里的标签映射词定义遵循的规则是，不同映射词尽可能长度一致，映射词和提示需要尽可能构成通顺的语句。越接近自然语句，小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句，也可以不构造映射词，每行一个标签即可。
+
+<a name="模型训练"></a>
+### 3.4 模型训练
+
+**单卡训练**
+
+```
+python train.py \
+--device gpu \
+--data_dir ./data \
+--output_dir ./checkpoints/ \
+--prompt "这条新闻标题的主题是" \
+--max_seq_length 128  \
+--learning_rate 3e-6 \
+--ppt_learning_rate 3e-5 \
+--do_train \
+--do_eval \
+--use_rdrop \
+--max_steps 1000 \
+--eval_steps 10 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--load_best_model_at_end True \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--do_predict \
+--do_export
+```
+**多卡训练**
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \
+--data_dir ./data \
+--output_dir ./checkpoints/ \
+--prompt "这条新闻标题的主题是" \
+--max_seq_length 128  \
+--learning_rate 3e-6 \
+--ppt_learning_rate 3e-5 \
+--do_train \
+--do_eval \
+--use_rdrop \
+--do_eval \
+--max_steps 1000 \
+--eval_steps 10 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--load_best_model_at_end True \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--do_predict \
+--do_export
+```
+
+可配置参数说明：
+- `model_name_or_path`: 内置模型名，或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 训练数据集路径，数据格式要求详见[数据标注](#数据标注)。
+- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。
+- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。
+- `soft_encoder`: 提示向量的编码器，`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。
+- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。
+- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。
+- `encoder_hidden_size`: 提示向量的维度。若为None，则使用预训练模型字向量维度。默认为200。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `learning_rate`: 预训练语言模型参数基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+- `ppt_learning_rate`: 提示相关参数的基础学习率大小，当预训练参数不固定时，与其共用learning rate scheduler。一般设为`learning_rate`的十倍。
+- `do_train`: 是否进行训练。
+- `do_eval`: 是否进行评估。
+- `do_predict`: 是否进行预测。
+- `do_export`: 是否在运行结束时将模型导出为静态图，保存路径为`output_dir/export`。
+- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。
+- `save_total_limit`: 模型检查点保存数量。
+- `eval_steps`: 评估模型的间隔步数。
+- `device`: 使用的设备，默认为`gpu`。
+- `logging_steps`: 打印日志的间隔步数。
+- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+
+更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。
+
+<a name="模型评估"></a>
+### 3.5 模型评估
+
+在模型训练时开启`--do_predict`，训练结束后直接在测试集上`test.txt`进行评估，也可以在训练结束后，通过运行以下命令加载模型参数进行评估：
+```
+python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128
+```
+
+可配置参数说明：
+
+- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中，每行一条待预测文本。
+- `output_dir`: 日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_predict`: 是否进行预测。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+
+<a name="模型部署"></a>
+### 3.6 模型部署
+
+#### 模型导出
+
+在训练结束后，需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出，也可以运行以下命令加载并导出训练后的模型参数，默认导出到在`output_dir`指定的目录下。
+```
+python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/
+```
+
+可配置参数说明：
+
+- `data_dir`: 标签数据路径。
+- `output_dir`: 静态图模型参数和日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_export`: 是否将模型导出为静态图，保存路径为`output_dir/export`。
+- `export_type`: 模型导出的格式，默认为`paddle`，即导出静态图。
+
+#### ONNXRuntime部署
+
+**运行环境**
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。
+
+- 如果基于GPU部署，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime-gpu onnx onnxconverter-common
+```
+
+- 如果基于CPU部署，请使用如下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime
+```
+
+**CPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu
+```
+
+**GPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0
+```
+
+可配置参数说明：
+
+- `model_path_prefix`: 导出的静态图模型路径及文件前缀。
+- `model_name`: 内置预训练模型名，用于加载tokenizer。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 待推理数据所在路径，数据应存放在该目录下的`data.txt`文件。
+- `max_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `batch_size`: 每次预测的样本数量。
+- `device`: 选择推理设备，包括`cpu`和`gpu`。默认为`gpu`。
+- `device_id`: 指定GPU设备ID。
+- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。
+- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。
+
+**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速，在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。
+
+<a name="References"></a>
+## 4. References
+
+- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385)
+- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121)
+- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998)
diff --git a/applications/text_classification/multi_class/few-shot/infer.py b/applications/text_classification/multi_class/few-shot/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e5b351fdafb916dc1ebed70925158ff187fb106
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/infer.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+import psutil
+import six
+
+from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.")
+parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.")
+parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.")
+parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+args = parser.parse_args()
+# yapf: enable
+
+
+class InferBackend(object):
+    def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10):
+
+        if not isinstance(device, six.string_types):
+            logger.error(
+                ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device)
+            )
+            exit(0)
+        if device not in ["cpu", "gpu"]:
+            logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device))
+            exit(0)
+
+        logger.info(">>> [InferBackend] Creating Engine ...")
+
+        onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb", encoding="utf-8") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+            providers = ["CUDAExecutionProvider"]
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            providers = ["CPUExecutionProvider"]
+            if use_fp16:
+                logger.warning(
+                    ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..."
+                )
+
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = num_threads
+        self.predictor = ort.InferenceSession(
+            onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}]
+        )
+
+        if device == "gpu":
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    "The environment for GPU inference is not set properly. "
+                    "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+                    "Please run the following commands to reinstall: \n "
+                    "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+                )
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def infer(self, input_dict: dict):
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+class MultiClassPredictor(object):
+    def __init__(self, args):
+        self.args = args
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+        self.model = AutoModelForMaskedLM.from_pretrained(args.model_name)
+        self.template, self.labels, self.input_handles = self.post_init()
+        self.collate_fn = PromptDataCollatorWithPadding(
+            self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True
+        )
+
+        self.inference_backend = InferBackend(
+            self.args.model_path_prefix,
+            self.args.device,
+            self.args.device_id,
+            self.args.use_fp16,
+            self.args.num_threads,
+        )
+
+    def post_init(self):
+        export_path = os.path.dirname(self.args.model_path_prefix)
+        template_path = os.path.join(export_path, "template_config.json")
+        with open(template_path, "r", encoding="utf-8") as fp:
+            prompt = json.load(fp)
+            template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model)
+        keywords = template.extract_template_keywords(template.prompt)
+        inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask"]
+        if "mask" in keywords:
+            inputs.append("masked_positions")
+        if "soft" in keywords:
+            inputs.append("soft_token_ids")
+        if "encoder" in keywords:
+            inputs.append("encoder_ids")
+        verbalizer_path = os.path.join(export_path, "verbalizer_config.json")
+        with open(verbalizer_path, "r", encoding="utf-8") as fp:
+            label_words = json.load(fp)
+            labels = sorted(list(label_words.keys()))
+
+        return template, labels, inputs
+
+    def predict(self, input_data: list):
+        encoded_inputs = self.preprocess(input_data)
+        infer_result = self.infer_batch(encoded_inputs)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return result
+
+    def _infer(self, input_dict):
+        infer_data = self.inference_backend.infer(input_dict)
+        return infer_data
+
+    def infer_batch(self, inputs):
+        num_sample = len(inputs)
+        infer_data = None
+        num_infer_data = None
+        for index in range(0, num_sample, self.args.batch_size):
+            left, right = index, index + self.args.batch_size
+            batch_dict = self.collate_fn(inputs[left:right])
+            input_dict = {}
+            for key in self.input_handles:
+                value = batch_dict[key]
+                if key == "attention_mask":
+                    if value.ndim == 2:
+                        value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4
+                    elif value.ndim != 4:
+                        raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim))
+                    value = value.astype("float32")
+                else:
+                    value = value.astype("int64")
+                input_dict[key] = value
+            results = self._infer(input_dict)
+            if infer_data is None:
+                infer_data = [[x] for x in results]
+                num_infer_data = len(results)
+            else:
+                for i in range(num_infer_data):
+                    infer_data[i].append(results[i])
+        for i in range(num_infer_data):
+            infer_data[i] = np.concatenate(infer_data[i], axis=0)
+        return infer_data
+
+    def preprocess(self, input_data: list):
+        text = [{"text_a": x} for x in input_data]
+        inputs = [self.template(x) for x in text]
+        return inputs
+
+    def postprocess(self, infer_data):
+        preds = np.argmax(infer_data[0], axis=-1)
+        labels = [self.labels[x] for x in preds]
+        return {"label": labels}
+
+    def printer(self, result, input_data):
+        label = result["label"]
+        for i in range(len(label)):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("labels: {}".format(label[i]))
+            logger.info("-----------------------------")
+
+
+if __name__ == "__main__":
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    predictor = MultiClassPredictor(args)
+
+    text_dir = os.path.join(args.data_dir, "data.txt")
+    with open(text_dir, "r", encoding="utf-8") as f:
+        text_list = [x.strip() for x in f.readlines()]
+
+    predictor.predict(text_list)
diff --git a/applications/text_classification/multi_class/few-shot/requirements_cpu.txt b/applications/text_classification/multi_class/few-shot/requirements_cpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/requirements_cpu.txt
@@ -0,0 +1,5 @@
+psutil
+paddlepaddle>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime
diff --git a/applications/text_classification/multi_class/few-shot/requirements_gpu.txt b/applications/text_classification/multi_class/few-shot/requirements_gpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/requirements_gpu.txt
@@ -0,0 +1,7 @@
+psutil
+paddlepaddle-gpu>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime-gpu
+onnx
+onnxconverter-common
diff --git a/applications/text_classification/multi_class/few-shot/train.py b/applications/text_classification/multi_class/few-shot/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac06587de017158f9388fb637657c12285cf94ce
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/train.py
@@ -0,0 +1,132 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from collections import defaultdict
+from dataclasses import dataclass, field
+
+import paddle
+from paddle.metric import Accuracy
+from utils import load_local_dataset
+
+from paddlenlp.prompt import (
+    AutoTemplate,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    SoftVerbalizer,
+)
+from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    data_dir: str = field(default="./data/", metadata={"help": "Path to a dataset which includes train.txt, dev.txt, test.txt, label.txt and data.txt (optional)."})
+    prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "Build-in pretrained model name or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # Define the template for preprocess and the verbalizer for postprocess.
+    template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model)
+    logger.info("Using template: {}".format(template.prompt))
+
+    label_file = os.path.join(data_args.data_dir, "label.txt")
+    with open(label_file, "r", encoding="utf-8") as fp:
+        label_words = defaultdict(list)
+        for line in fp:
+            data = line.strip().split("==")
+            word = data[1] if len(data) > 1 else data[0].split("##")[-1]
+            label_words[data[0]].append(word)
+    verbalizer = SoftVerbalizer(label_words, tokenizer, model)
+
+    # Load the few-shot datasets.
+    train_ds, dev_ds, test_ds = load_local_dataset(
+        data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids
+    )
+
+    # Define the criterion.
+    criterion = paddle.nn.CrossEntropyLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds):
+        metric = Accuracy()
+        correct = metric.compute(paddle.to_tensor(eval_preds.predictions), paddle.to_tensor(eval_preds.label_ids))
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    # Deine the early-stopping callback.
+    callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)]
+
+    # Initialize the trainer.
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=callbacks,
+        compute_metrics=compute_metrics,
+    )
+
+    # Traininig.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=None)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Prediction.
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Export static model.
+    if training_args.do_export:
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/multi_class/few-shot/utils.py b/applications/text_classification/multi_class/few-shot/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a92a9a697eeeeac130b4d6354836471df7a76fd
--- /dev/null
+++ b/applications/text_classification/multi_class/few-shot/utils.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.datasets import load_dataset
+
+
+def load_local_dataset(data_path, splits, label_list):
+    """
+    Read datasets from files.
+
+    Args:
+        data_path (str):
+            Path to the dataset directory, including label.txt, train.txt,
+            dev.txt, test.txt (and data.txt).
+        splits (list):
+            Which file(s) to load, such as ['train', 'dev', 'test'].
+        label_list(dict):
+            A dictionary to encode labels as ids, which should be compatible
+            with that of verbalizer.
+    """
+
+    def _reader(data_file, label_list):
+        with open(data_file, "r", encoding="utf-8") as fp:
+            for idx, line in enumerate(fp):
+                data = line.strip().split("\t")
+                if len(data) == 1:
+                    yield {"text_a": data[0]}
+                else:
+                    text, label = data
+                    yield {"text_a": text, "labels": label_list[label]}
+
+    assert isinstance(splits, list) and len(splits) > 0
+
+    split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"}
+
+    dataset = []
+    for split in splits:
+        data_file = os.path.join(data_path, split_map[split])
+        dataset.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False))
+    return dataset
diff --git a/applications/text_classification/multi_class/retrieval_based/README.md b/applications/text_classification/multi_class/retrieval_based/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..10602a07b356e37c6c2d8e22929f06252b9ef2e6
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/README.md
@@ -0,0 +1,457 @@
+# 基于检索的文本分类方法
+
+ **目录**
+
+* [1. 基于语义索引的分类任务介绍](#基于语义索引的分类任务介绍)
+* [2. 代码结构说明](#代码结构说明)
+* [3. 环境准备](#环境准备)
+* [4. 数据准备](#数据准备)
+* [5. 模型训练](#模型训练)
+* [6. 模型预测](#模型预测)
+* [7. 模型部署](#模型部署)
+* [8. 分类流程](#分类流程)
+
+<a name="基于语义索引的分类任务介绍"></a>
+
+# 1.基于语义索引的分类任务介绍
+
+以前的分类任务中，标签信息作为无实际意义，独立存在的one-hot编码形式存在，这种做法会潜在的丢失标签的语义信息，本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量，将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景，或者需要经常有一些新增类别的需求的情况非常合适。另外，对于一些新的相关的分类任务，这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说，这种基于检索的文本分类方法能够有很好的拓展性，能够利用标签里面包含的语义信息，不需要重新进行学习。这种方法可以应用到相似标签推荐，文本标签标注，金融风险事件分类，政务信访分类等领域。
+
+本方案是基于语义索引模型的分类，语义索引模型的目标是：给定输入文本，模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的分类方法有两种，第一种方法是直接把标签变成召回库，即把输入文本和标签的文本进行匹配，第二种是利用召回的文本带有类别标签，把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型，训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，并把标签作为召回库，进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。
+
+
+<a name="代码结构说明"></a>
+
+## 2. 代码结构说明
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— base_model.py # 语义索引模型基类
+|—— train.py # In-batch Negatives 策略的训练主脚本
+|—— model.py # In-batch Negatives 策略核心网络结构
+
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— evaluate.py # 根据召回结果和评估集计算评估指标
+|—— predict.py # 给定输入文件，计算文本 pair 的相似度
+|—— export_model.py # 动态图转换成静态图
+|—— export_to_serving.py # 静态图转 Serving
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— predict.sh  # 预测 bash 版本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train.sh  # 训练 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+    |—— run.sh # 构建Milvus向量的 bash 版本
+|—— utils
+    ├── config.py # Milvus 的配置文件
+    ├── feature_extract.py # 向量抽取文件
+    ├── milvus_util.py # Milvus 的配置文件
+|—— deploy
+    |—— python
+        |—— predict.py # PaddleInference
+        |—— deploy.sh # Paddle Inference 部署脚本
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+
+```
+
+<a name="环境准备"></a>
+
+## 3. 环境准备
+
+推荐使用GPU进行训练，在预测阶段使用CPU或者GPU均可。
+
+**环境依赖**
+* python >= 3.6.2
+* paddlepaddle >= 2.3.1
+* paddlenlp >= 2.3.4
+* hnswlib >= 0.5.2
+* visualdl >= 2.2.2
+
+```
+pip install -r requirements.txt
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+训练需要准备指定格式的本地数据集,如果没有已标注的数据集，可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。
+
+**指定格式本地数据集目录结构**
+
+```
+├── data # 数据集目录
+    ├── label.txt # 标签集
+    ├── dev.txt # 验证集
+    ├── train.txt  # 训练集
+```
+
+**训练、开发、测试数据集**
+
+train.txt(训练数据集文件)， dev.txt(开发数据集文件)，test.txt(可选，测试数据集文件)，文件中文本与标签类别名用tab符`'\t'`分隔开。训练集指用于训练模型的数据；开发集指用于评测模型表现的数据，可以根据模型在开发集上的精度调整训练参数和模型；测试集用于测试模型表现，没有测试集时可以使用开发集代替。
+
+- train.txt/dev.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签>
+<文本>'\t'<标签>
+...
+```
+- train.txt/dev.txt/test.txt 文件样例：
+```text
+青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的	青岛
+谈谈去西安旅游，哪些地方让你觉得不虚此行？	旅行
+上古卷轴5有哪些奇葩玩法？	单机游戏
+...
+```
+**分类标签**
+
+label.txt(分类标签文件)记录数据集中所有标签集合，每一行为一个标签名。
+- label.txt 文件格式：
+```text
+<标签>
+<标签>
+...
+```
+- label.txt 文件样例：
+```text
+上海
+恋爱
+动画电影（Animated film）
+狮
+生活
+汽车
+...
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+我们使用百科知识问答的数据来构建训练集，开发集。
+
+**训练集（train.txt）** 和 **开发集(dev.txt)** 格式一致，训练集61510条，开发集6835条，每行由文本的标题，内容和类别标签组成，以tab符分割，第一列是问题的标题和描述拼接，剩下的列是问题的类别，另外，标签有5981个。
+**召回库（label.txt）** 召回库的构建有2种方式，第一种是把所有的类别标签当成召回库，第二种是把训练集当成召回集合，我们以第一种为例。
+
+数据集选择的是百科问答数据集的一个子集，问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus)
+
+- [webtext2019zh_topic](https://paddlenlp.bj.bcebos.com/applications/webtext2019zh_qa.zip)
+
+```
+wget https://paddlenlp.bj.bcebos.com/applications/webtext2019zh_qa.zip
+unzip  webtext2019zh_qa.zip
+```
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1 卡;如果采用单机单卡训练，只需要把`--gpus`参数设置成单卡的卡号即可。
+
+如果使用CPU进行训练，则需要吧`--gpus`参数去除，然后吧`device`设置成cpu即可，详细请参考train.sh文件的训练设置
+
+然后运行下面的命令使用GPU训练，得到语义索引模型：
+
+```
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `margin`: 正样本相似度与负样本之间的目标 Gap
+* `train_set_file`: 训练集文件
+* `evaluate`: 是否开启边训练边评估模型训练效果，默认开启
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+
+也可以使用bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+<a name="模型预测"></a>
+
+## 6. 模型预测
+
+我们可以基于语义索引模型计算文本和标签的语义相似度。
+
+
+### 开始预测
+
+加载训练的语义索引模型，然后计算文本和标签的语义相似度:
+
+```
+root_dir="checkpoints/inbatch/model_best"
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/dev.txt"
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化。
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/predict.sh
+```
+predict.sh文件包含了cpu和gpu运行的脚本，默认是gpu运行的脚本
+
+产出如下结果
+```
+0.8841502070426941
+0.7834227681159973
+0.04591505229473114
+0.15116563439369202
+......
+```
+
+<a name="部署"></a>
+
+## 7. 模型部署
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py \
+                 --params_path checkpoints/inbatch/model_best/model_state.pdparams \
+                 --model_name_or_path rocketqa-zh-dureader-query-encoder \
+                 --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+
+### Paddle Inference预测
+
+预测既可以抽取向量也可以计算两个文本的相似度。
+
+修改deploy/python/predict.py中的id2corpus和corpus_list的样本：
+
+```
+# 抽取向量
+id2corpus = {
+        0: {
+            "sentence":
+            "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的"
+        }
+    }
+# 计算文本和类别的相似度
+corpus_list = [{
+        "sentence":
+        "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的？",
+        'label': '青岛'
+    }, {
+        "sentence":
+        "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的",
+        'label': '单机游戏'
+    }]
+
+```
+
+然后使用PaddleInference
+
+```
+python deploy/python/predict.py --model_dir=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy.sh
+```
+最终输出的是256维度的特征向量和句子对的预测概率：
+
+```
+(1, 768)
+(1, 768)
+[[-0.02010613  0.01188739  0.02152571  0.055634   -0.05920463 -0.06134057
+   0.05532542 -0.01561351  0.02738068  0.05340409  0.06015684 -0.01133287
+   ....
+
+[0.6114088296890259, 0.08313259482383728]
+```
+
+### 向量引擎
+
+模型准备结束以后，开始搭建 Milvus 的向量检索引擎，用于文本语义向量的快速检索，本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本，建议使用官方的 Docker 安装方式，简单快捷。
+
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成768维度的向量：
+
+```
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+然后向搭建好的 Milvus 系统插入向量：
+
+```
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
+```
+也可以直接运行：
+
+```bash
+sh scripts/run.sh
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的详细文档请参考 [Pipeline_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Python_Pipeline/Pipeline_Design_CN.md)和[Serving_Design](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Serving_Design_CN.md),首先把静态图模型转换成Serving的格式：
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，第二种是C++的方式，下面分别介绍这两种方式的用法：
+
+#### Pipeline方式
+
+启动 Pipeline Server:
+
+```
+cd deploy/python/
+python web_service.py
+```
+
+启动客户端调用 Server, 使用 POST的方式：
+
+向服务端发送 POST 请求示例：
+
+```
+curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的\"}"]}'
+```
+
+也可以使用 rpc的方式：
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [{
+    "sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的"
+}]
+```
+然后运行：
+
+```
+python rpc_client.py
+```
+模型的输出为：
+
+```
+PipelineClient::predict pack_data time:1658988633.3673246
+PipelineClient::predict before time:1658988633.3678396
+time to cost :0.014188766479492188 seconds
+['output_embedding']
+(1, 768)
+[[-0.06491912 -0.0133915   0.00937684  0.01285653 -0.02468005  0.03528611
+   0.0623698  -0.06062918  0.02238894 -0.05348937  0.02161925  0.04480227
+   ......
+```
+
+可以看到客户端发送了1条文本，返回这个 embedding 向量
+
+<a name="分类流程"></a>
+
+## 8. 分类流程
+
+基于检索的分类系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问。
+
+```
+python run_system.py
+```
+代码内置的测试用例为：
+
+```
+list_data = [{"sentence": "谈谈去西安旅游，哪些地方让你觉得不虚此行？"}]
+```
+会输出如下的结果：
+
+```
+......
+PipelineClient::predict pack_data time:1658988661.507715
+PipelineClient::predict before time:1658988661.5081818
+Extract feature time to cost :0.02322244644165039 seconds
+Search milvus time cost is 0.06801486015319824 seconds
+{'sentence': '谈谈去西安旅游，哪些地方让你觉得不虚此行？'} 旅行 0.3969537019729614
+{'sentence': '谈谈去西安旅游，哪些地方让你觉得不虚此行？'} 西安 0.7750667333602905
+{'sentence': '谈谈去西安旅游，哪些地方让你觉得不虚此行？'} 陕西 0.8064634799957275
+{'sentence': '谈谈去西安旅游，哪些地方让你觉得不虚此行？'} 火车上 0.8384211659431458
+{'sentence': '谈谈去西安旅游，哪些地方让你觉得不虚此行？'} 山西 0.9251932501792908
+.....
+```
+输出的结果包括特征提取和检索的时间，还包含检索出来文本和对应的标签，通过设定阈值等方式可以得到最终的标签。
+
+## Reference
+
+[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.
diff --git a/applications/text_classification/multi_class/retrieval_based/base_model.py b/applications/text_classification/multi_class/retrieval_based/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/base_model.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBaseStatic(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
diff --git a/applications/text_classification/multi_class/retrieval_based/data.py b/applications/text_classification/multi_class/retrieval_based/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..17081543edaa967725453ecec99e760bb9f94f90
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/data.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import hnswlib
+import numpy as np
+import paddle
+from paddlenlp.utils.log import logger
+
+
+def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m):
+
+    index = hnswlib.Index(space="ip", dim=output_emb_size if output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+    logger.info("start build index..........")
+    all_embeddings = []
+    for text_embeddings in model.get_semantic_embedding(corpus_data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+    logger.info("Total index number:{}".format(index.get_current_count()))
+    return index
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            yield {"sentence": data[0], "label": data[1]}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succeed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            splited_line = line.rstrip().split("\t")
+            text, similar_text = splited_line[0], ",".join(splited_line[1:])
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..236c3802002e80075ab55ed14461b0bde9fd545c
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8090
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      #ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '2'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fe8f071e0a47a47f5dc24d84ea4eaaf8e7503c06
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/deploy.sh
@@ -0,0 +1 @@
+python predict.py --model_dir=../../output
\ No newline at end of file
diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..a94045fcb31ff04582406cb481389af358a06cb6
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/predict.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+from paddle import inference
+from scipy import spatial
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+sys.path.append(".")
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    encoded_inputs = tokenizer(
+        text=example["sentence"],
+        max_seq_len=max_seq_length,
+        pad_to_max_seq_len=pad_to_max_seq_len,
+        truncation_strategy="longest_first",
+    )
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def extract_embedding(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the feature vectors.
+        """
+        examples = []
+        for idx, text in data.items():
+            print(text)
+            input_ids, segment_ids = convert_query_example(text, tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        return logits
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions probs.
+        """
+
+        examples = []
+        for idx, text in enumerate(data):
+            input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer)
+
+            examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(query_ids)
+        self.input_handles[1].copy_from_cpu(query_segment_ids)
+        self.predictor.run()
+        query_logits = self.output_handle.copy_to_cpu()
+
+        self.input_handles[0].copy_from_cpu(title_ids)
+        self.input_handles[1].copy_from_cpu(title_segment_ids)
+        self.predictor.run()
+        title_logits = self.output_handle.copy_to_cpu()
+
+        result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
+        return result
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    output_emb_size = 256
+    tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder")
+    id2corpus = {0: {"sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的"}}
+    res = predictor.extract_embedding(id2corpus, tokenizer)
+    print(res.shape)
+    print(res)
+    corpus_list = [
+        {"sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的？", "label": "青岛"},
+        {"sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的", "label": "单机游戏"},
+    ]
+    res = predictor.predict(corpus_list, tokenizer)
+    print(res)
diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..afb13b803f65fb995b5c33a5ab5a069bee030717
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/rpc_client.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+import numpy as np
+
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = [{"sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的"}]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = str(item)
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py b/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/deploy/python/web_service.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle_serving_server.web_service import Op, WebService
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(
+            text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len
+        )
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        model_name_or_path = "rocketqa-zh-dureader-query-encoder"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            example = eval(input_dict[str(i)])
+            input_ids, segment_ids = convert_example([example], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/text_classification/multi_class/retrieval_based/evaluate.py b/applications/text_classification/multi_class/retrieval_based/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..bff2cfb814aa5c25442b2bda9f13561f72036cae
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/evaluate.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import numpy as np
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similar pair file")
+parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file")
+parser.add_argument(
+    "--recall_num", type=int, default=10, help="Most similar number of doc recalled from corpus per query"
+)
+args = parser.parse_args()
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().rsplit("\t", 1)
+            text2similar[text] = similar_text
+
+    rs = []
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+            text_arr = line.rstrip().split("\t")
+            text_title, text_para, recalled_title, recalled_para, label, cosine_sim = text_arr
+            if text2similar["\t".join([text_title, text_para])] == label:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20, 50]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    result = open("result.tsv", "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
diff --git a/applications/text_classification/multi_class/retrieval_based/export_model.py b/applications/text_classification/multi_class/retrieval_based/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac06c79a8f971e5cdbeede11c99c9f16d6e59520
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/export_model.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from base_model import SemanticIndexBaseStatic
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/text_classification/multi_class/retrieval_based/export_to_serving.py b/applications/text_classification/multi_class/retrieval_based/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ba681a4dfb14a43a5f91fa9c4cf632b4e6e827e
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/export_to_serving.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/text_classification/multi_class/retrieval_based/model.py b/applications/text_classification/multi_class/retrieval_based/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..25570c8dfa1725213a9ff0b463bde4fcabbd3ab9
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/model.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # subtract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/applications/text_classification/multi_class/retrieval_based/predict.py b/applications/text_classification/multi_class/retrieval_based/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/predict.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    cosine_sims = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+            cosine_sims.append(batch_cosine_sim)
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    cosin_sim = predict(model, valid_data_loader)
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
+        if idx > 5:
+            break
diff --git a/applications/text_classification/multi_class/retrieval_based/recall.py b/applications/text_classification/multi_class/retrieval_based/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d6f49ae9f7c92156fdc5d8d0c338e2339221072
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/recall.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from base_model import SemanticIndexBase
+from data import (
+    build_index,
+    convert_corpus_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+)
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    id2corpus = gen_id2corpus(args.corpus_file)
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+    query_ds = MapDataset(text_list)
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/text_classification/multi_class/retrieval_based/requirements.txt b/applications/text_classification/multi_class/retrieval_based/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6657b02e7b0c9a430659394b3398f575cae4ea91
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/requirements.txt
@@ -0,0 +1,11 @@
+pymilvus==1.1.2
+pandas==0.25.1 
+paddlenlp>=2.3.4    
+paddlepaddle-gpu>=2.3.0
+hnswlib>=0.5.2
+numpy>=1.17.2
+visualdl>=2.2.2
+paddle-serving-app>=0.7.0        
+paddle-serving-client>=0.7.0        
+paddle-serving-server-gpu>=0.7.0.post102
+pybind11
\ No newline at end of file
diff --git a/applications/text_classification/multi_class/retrieval_based/run_system.py b/applications/text_classification/multi_class/retrieval_based/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..27e71c6ecc9865e3665af7a065f8e55076119e0b
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/run_system.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import time
+
+import numpy as np
+import pandas as pd
+from data import gen_id2corpus
+from paddle_serving_server.pipeline import PipelineClient
+
+sys.path.append("utils")
+from utils.config import collection_name, partition_tag  # noqa: E402
+from utils.milvus_util import RecallByMilvus  # noqa: E402
+
+
+def search_in_milvus(text_embedding, corpus_file, query_text):
+    client = RecallByMilvus()
+    start_time = time.time()
+    status, results = client.search(
+        collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+    id2corpus = gen_id2corpus(corpus_file)
+    list_data = []
+    for line in results:
+        for item in line:
+            idx = item.id
+            distance = item.distance
+            text = id2corpus[idx]
+            list_data.append([query_text, text, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "label", "distance"])
+    df = df.sort_values(by="distance", ascending=True)
+    for index, row in df.iterrows():
+        print(row["query_text"], row["label"], row["distance"])
+
+
+if __name__ == "__main__":
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    corpus_file = "data/label.txt"
+    list_data = [{"sentence": "谈谈去西安旅游，哪些地方让你觉得不虚此行？"}]
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = str(item)
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    search_in_milvus(result, corpus_file, list_data[0])
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh b/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2da2c025cc7447965b676fb73ec2d548b8e342c2
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/evaluate.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -u evaluate.py \
+        --similar_text_pair "data/dev.txt" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
\ No newline at end of file
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh b/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..188e3a9bdf383e40f36ba3c7c5bb015ad6cdcddd
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/export_model.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py \
+                 --params_path checkpoints/inbatch/model_best/model_state.pdparams \
+                 --model_name_or_path rocketqa-zh-dureader-query-encoder \
+                 --output_path=./output 
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/export_to_serving.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh b/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b5a14d480ae64554125e86462f3632bdbf3a09bd
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/predict.sh
@@ -0,0 +1,38 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# gpu version
+root_dir="checkpoints/inbatch/model_best" 
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/dev.txt"
+
+
+# cpu
+# root_dir="checkpoints/inbatch/model_best" 
+# python -m paddle.distributed.launch --nproc_per_node 8 --backend "gloo" \
+#     predict.py \
+#     --device cpu \
+#     --params_path "${root_dir}/model_state.pdparams" \
+#     --output_emb_size 0 \
+#     --model_name_or_path rocketqa-zh-dureader-query-encoder \
+#     --batch_size 128 \
+#     --max_seq_length 384 \
+#     --text_pair_file "data/dev.txt"
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/run.sh b/applications/text_classification/multi_class/retrieval_based/scripts/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c4c990729c26e0c9fd00e4420ebe1810abd00984
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/run.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt" 
+
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
\ No newline at end of file
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7d75a8daad62a9f2ca482c354c7bad54788862c8
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/run_build_index.sh
@@ -0,0 +1,31 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU version
+root_dir="checkpoints/inbatch" 
+python -u -m paddle.distributed.launch --gpus "1" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${root_dir}/model_best/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-dureader-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 0 \
+        --max_seq_length 384 \
+        --recall_num 50 \
+        --similar_text_pair "data/dev.txt" \
+        --corpus_file "data/train.txt" 
\ No newline at end of file
diff --git a/applications/text_classification/multi_class/retrieval_based/scripts/train.sh b/applications/text_classification/multi_class/retrieval_based/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2cef4abcddac47f91a0d98ce701375d910879f27
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/scripts/train.sh
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU training
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder
diff --git a/applications/text_classification/multi_class/retrieval_based/train.py b/applications/text_classification/multi_class/retrieval_based/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0e1cded718c1df6ce738c785dfc86a3d5fde568
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/train.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import (
+    build_index,
+    convert_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+    read_text_pair,
+)
+from model import SemanticIndexBatchNeg
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Interval steps to save checkpoint")
+parser.add_argument('--log_steps', type=int, default=10, help="Interval steps to print log")
+parser.add_argument("--train_set_file", type=str, default='./data/train.txt', help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples")
+parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--corpus_file", type=str, default='./data/label.txt', help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, default='./data/dev.txt', help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result")
+parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result")
+parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False], help='whether evaluate while training')
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def recall(rs, N=10):
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+@paddle.no_grad()
+def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus):
+    # Load pretrained semantic model
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
+    text2similar = {}
+    with open(args.similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            text_arr = line.rstrip().rsplit("\t")
+            text, similar_text = text_arr[0], text_arr[1]
+            text2similar[text] = similar_text
+    rs = []
+    with open(recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+            text_arr = line.rstrip().rsplit("\t")
+            text, similar_text, cosine_sim = text_arr
+            if text2similar[text] == similar_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    recall_num = [1, 5, 10, 20]
+    for topN in recall_num:
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    evaluate_result_file = os.path.join(args.recall_result_dir, args.evaluate_result)
+    result = open(evaluate_result_file, "a")
+    res = []
+    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime())
+    res.append(timestamp)
+    for key, val in zip(recall_num, recall_N):
+        print("recall@{}={}".format(key, val))
+        res.append(str(val))
+    result.write("\t".join(res) + "\n")
+    return float(recall_N[0])
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    model = SemanticIndexBatchNeg(
+        pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+    )
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+    model = paddle.DataParallel(model)
+    batchify_fn_dev = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    if args.evaluate:
+        eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        id2corpus = gen_id2corpus(args.corpus_file)
+        # conver_example function's input must be dict
+        corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+        corpus_ds = MapDataset(corpus_list)
+        corpus_data_loader = create_dataloader(
+            corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func
+        )
+        # convert_corpus_example
+        query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        text_list, _ = gen_text_file(args.similar_text_pair_file)
+        query_ds = MapDataset(text_list)
+        query_data_loader = create_dataloader(
+            query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func
+        )
+        if not os.path.exists(args.recall_result_dir):
+            os.mkdir(args.recall_result_dir)
+        recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByNorm(clip_norm=1.0),
+    )
+    global_step = 0
+    best_recall = 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+            global_step += 1
+            if global_step % args.log_steps == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if not args.evaluate and rank == 0:
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+        if args.evaluate and rank == 0:
+            print("evaluating")
+            recall_5 = evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus)
+            if recall_5 > best_recall:
+                best_recall = recall_5
+                save_dir = os.path.join(args.save_dir, "model_best")
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+                with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp:
+                    fp.write("epoch=%d, global_step: %d, recall: %s\n" % (epoch, global_step, recall_5))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/text_classification/multi_class/retrieval_based/utils/__init__.py b/applications/text_classification/multi_class/retrieval_based/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/applications/text_classification/multi_class/retrieval_based/utils/config.py b/applications/text_classification/multi_class/retrieval_based/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..0784ba7410aa69f265e511893ce08f74b97088a1
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/utils/config.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from milvus import IndexType, MetricType
+
+MILVUS_HOST = "10.21.226.173"
+MILVUS_PORT = 8530
+
+output_emb_size = 0
+
+collection_param = {
+    "dimension": output_emb_size if output_emb_size > 0 else 768,
+    "index_file_size": 256,
+    "metric_type": MetricType.L2,
+}
+
+index_type = IndexType.FLAT
+index_param = {"nlist": 1000}
+
+top_k = 20
+search_param = {"nprobe": 20}
+
+collection_name = "text"
+partition_tag = "partition_2"
diff --git a/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py b/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..171253b0d1bc1882ec165e7af640e00c8ee1ff68
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/utils/feature_extract.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+import paddlenlp as ppnlp
+from paddlenlp.data import Pad, Tuple
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument("--output_dir", type=str, required=True, help="The output path.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--data_name", type=str, required=True, help="The dataset name.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, data_name):
+        """
+        Predicts the data labels.
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in tqdm(data.items()):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) >= self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, line in enumerate(file.readlines()):
+        id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+    data_name = args.data_name
+    tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = read_text(args.corpus_file)
+    predictor.predict(id2corpus, tokenizer, data_name)
diff --git a/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py b/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/utils/milvus_util.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    collection_param,
+    index_param,
+    index_type,
+    search_param,
+    top_k,
+)
+from milvus import Milvus
+
+
+class VecToMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def has_collection(self, collection_name):
+        try:
+            status, ok = self.client.has_collection(collection_name)
+            return ok
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            collection_param["collection_name"] = collection_name
+            status = self.client.create_collection(collection_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def create_index(self, collection_name):
+        try:
+            status = self.client.create_index(collection_name, index_type, index_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, collection_name, partition_tag):
+        try:
+            status, ok = self.client.has_partition(collection_name, partition_tag)
+            return ok
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def delete_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.drop_collection(collection_name)
+            return status
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.create_partition(collection_name, partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+            return status
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, vectors, collection_name, ids=None, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(collection_name)
+                print("collection info: {}".format(self.client.get_collection_info(collection_name)[1]))
+            if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)):
+                self.create_partition(collection_name, partition_tag)
+            status, ids = self.client.insert(
+                collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag
+            )
+            self.client.flush([collection_name])
+            print(
+                "Insert {} entities, there are {} entities after insert data.".format(
+                    len(ids), self.client.count_entities(collection_name)[1]
+                )
+            )
+            return status, ids
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def search(self, vectors, collection_name, partition_tag=None):
+        try:
+            status, results = self.client.search(
+                collection_name=collection_name,
+                query_records=vectors,
+                top_k=top_k,
+                params=search_param,
+                partition_tag=partition_tag,
+            )
+            return status, results
+        except Exception as e:
+            print("Milvus recall error: ", e)
diff --git a/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py b/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ad7628cfeeffd603e4792e5e789393665e8d5d
--- /dev/null
+++ b/applications/text_classification/multi_class/retrieval_based/utils/vector_insert.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+from config import collection_name, partition_tag
+from milvus_util import VecToMilvus
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--vector_path", type=str, required=True, help="feature file path.")
+args = parser.parse_args()
+
+
+def vector_insert(file_path):
+    embeddings = np.load(file_path)
+    print(embeddings.shape)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    print(len(embedding_ids))
+    client = VecToMilvus()
+
+    if client.has_partition(collection_name, partition_tag):
+        client.delete_partition(collection_name, partition_tag)
+    data_size = len(embedding_ids)
+    batch_size = 50000
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        status, ids = client.insert(
+            collection_name=collection_name,
+            vectors=batch_emb.tolist(),
+            ids=embedding_ids[i : i + batch_size],
+            partition_tag=partition_tag,
+        )
+
+
+if __name__ == "__main__":
+    vector_insert(args.vector_path)
diff --git a/applications/text_classification/multi_class/train.py b/applications/text_classification/multi_class/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca480b3a1e7b1932db63ae4b490c1351f2056b2d
--- /dev/null
+++ b/applications/text_classification/multi_class/train.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+import json
+import os
+import shutil
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+import numpy as np
+import paddle
+from sklearn.metrics import (
+    accuracy_score,
+    classification_report,
+    precision_recall_fscore_support,
+)
+from utils import log_metrics_debug, preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    CompressionArguments,
+    EarlyStoppingCallback,
+    PdArgumentParser,
+    Trainer,
+)
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    export_model,
+)
+from paddlenlp.utils.log import logger
+
+SUPPORTED_MODELS = [
+    "ernie-1.0-large-zh-cw",
+    "ernie-1.0-base-zh-cw",
+    "ernie-3.0-xbase-zh",
+    "ernie-3.0-base-zh",
+    "ernie-3.0-medium-zh",
+    "ernie-3.0-micro-zh",
+    "ernie-3.0-mini-zh",
+    "ernie-3.0-nano-zh",
+    "ernie-3.0-tiny-base-v2-zh",
+    "ernie-3.0-tiny-medium-v2-zh",
+    "ernie-3.0-tiny-micro-v2-zh",
+    "ernie-3.0-tiny-mini-v2-zh",
+    "ernie-3.0-tiny-nano-v2-zh ",
+    "ernie-3.0-tiny-pico-v2-zh",
+    "ernie-2.0-large-en",
+    "ernie-2.0-base-en",
+    "ernie-3.0-tiny-mini-v2-en",
+    "ernie-m-base",
+    "ernie-m-large",
+]
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    max_length: int = field(default=128, metadata={"help": "Maximum number of tokens for the model."})
+    early_stopping: bool = field(default=False, metadata={"help": "Whether apply early stopping strategy."})
+    early_stopping_patience: int = field(default=4, metadata={"help": "Stop training when the specified metric worsens for early_stopping_patience evaluation calls"})
+    debug: bool = field(default=False, metadata={"help": "Whether choose debug mode."})
+    train_path: str = field(default='./data/train.txt', metadata={"help": "Train dataset file path."})
+    dev_path: str = field(default='./data/dev.txt', metadata={"help": "Dev dataset file path."})
+    test_path: str = field(default='./data/dev.txt', metadata={"help": "Test dataset file path."})
+    label_path: str = field(default='./data/label.txt', metadata={"help": "Label file path."})
+    bad_case_path: str = field(default='./data/bad_case.txt', metadata={"help": "Bad case file path."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-3.0-tiny-medium-v2-zh", metadata={"help": "Build-in pretrained model name or the path to local model."})
+    export_model_dir: Optional[str] = field(default=None, metadata={"help": "Path to directory to store the exported inference model."})
+# yapf: enable
+
+
+def main():
+    """
+    Training a binary or multi classification model
+    """
+
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if training_args.do_compress:
+        training_args.strategy = "dynabert"
+    if training_args.do_train or training_args.do_compress:
+        training_args.print_config(model_args, "Model")
+        training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Define id2label
+    id2label = {}
+    label2id = {}
+    with open(data_args.label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            id2label[i] = l
+            label2id[l] = i
+
+    # Define model & tokenizer
+    if os.path.isdir(model_args.model_name_or_path):
+        model = AutoModelForSequenceClassification.from_pretrained(
+            model_args.model_name_or_path, label2id=label2id, id2label=id2label
+        )
+    elif model_args.model_name_or_path in SUPPORTED_MODELS:
+        model = AutoModelForSequenceClassification.from_pretrained(
+            model_args.model_name_or_path, num_classes=len(label2id), label2id=label2id, id2label=id2label
+        )
+    else:
+        raise ValueError(
+            f"{model_args.model_name_or_path} is not a supported model type. Either use a local model path or select a model from {SUPPORTED_MODELS}"
+        )
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # load and preprocess dataset
+    train_ds = load_dataset(read_local_dataset, path=data_args.train_path, label2id=label2id, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=data_args.dev_path, label2id=label2id, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_length=data_args.max_length)
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+
+    if data_args.debug:
+        test_ds = load_dataset(read_local_dataset, path=data_args.test_path, label2id=label2id, lazy=False)
+        test_ds = test_ds.map(trans_func)
+
+    # Define the metric function.
+    def compute_metrics(eval_preds):
+        pred_ids = np.argmax(eval_preds.predictions, axis=-1)
+        metrics = {}
+        metrics["accuracy"] = accuracy_score(y_true=eval_preds.label_ids, y_pred=pred_ids)
+        for average in ["micro", "macro"]:
+            precision, recall, f1, _ = precision_recall_fscore_support(
+                y_true=eval_preds.label_ids, y_pred=pred_ids, average=average
+            )
+            metrics[f"{average}_precision"] = precision
+            metrics[f"{average}_recall"] = recall
+            metrics[f"{average}_f1"] = f1
+        return metrics
+
+    def compute_metrics_debug(eval_preds):
+        pred_ids = np.argmax(eval_preds.predictions, axis=-1)
+        metrics = classification_report(eval_preds.label_ids, pred_ids, output_dict=True)
+        return metrics
+
+    # Define the early-stopping callback.
+    if data_args.early_stopping:
+        callbacks = [EarlyStoppingCallback(early_stopping_patience=data_args.early_stopping_patience)]
+    else:
+        callbacks = None
+
+    # Define Trainer
+    trainer = Trainer(
+        model=model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=paddle.nn.loss.CrossEntropyLoss(),
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=callbacks,
+        data_collator=DataCollatorWithPadding(tokenizer),
+        compute_metrics=compute_metrics_debug if data_args.debug else compute_metrics,
+    )
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train()
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        for checkpoint_path in Path(training_args.output_dir).glob("checkpoint-*"):
+            shutil.rmtree(checkpoint_path)
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        if data_args.debug:
+            output = trainer.predict(test_ds)
+            log_metrics_debug(output, id2label, test_ds, data_args.bad_case_path)
+        else:
+            eval_metrics = trainer.evaluate()
+            trainer.log_metrics("eval", eval_metrics)
+
+    # export inference model
+    if training_args.do_export:
+        if model.init_config["init_class"] in ["ErnieMForSequenceClassification"]:
+            input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
+        else:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+        tokenizer.save_pretrained(model_args.export_model_dir)
+        id2label_file = os.path.join(model_args.export_model_dir, "id2label.json")
+        with open(id2label_file, "w", encoding="utf-8") as f:
+            json.dump(id2label, f, ensure_ascii=False)
+            logger.info(f"id2label file saved in {id2label_file}")
+
+    # compress
+    if training_args.do_compress:
+        trainer.compress()
+        for width_mult in training_args.width_mult_list:
+            pruned_infer_model_dir = os.path.join(training_args.output_dir, "width_mult_" + str(round(width_mult, 2)))
+            tokenizer.save_pretrained(pruned_infer_model_dir)
+            id2label_file = os.path.join(pruned_infer_model_dir, "id2label.json")
+            with open(id2label_file, "w", encoding="utf-8") as f:
+                json.dump(id2label, f, ensure_ascii=False)
+                logger.info(f"id2label file saved in {id2label_file}")
+
+    for path in Path(training_args.output_dir).glob("runs"):
+        shutil.rmtree(path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/multi_class/utils.py b/applications/text_classification/multi_class/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e7a8889f6a32050ce1e48fbabd280a551abce1a
--- /dev/null
+++ b/applications/text_classification/multi_class/utils.py
@@ -0,0 +1,86 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+from paddlenlp.utils.log import logger
+
+
+def preprocess_function(examples, tokenizer, max_length, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+    """
+    result = tokenizer(examples["text"], max_length=max_length, truncation=True)
+    if not is_test:
+        result["labels"] = np.array([examples["label"]], dtype="int64")
+    return result
+
+
+def read_local_dataset(path, label2id=None, is_test=False):
+    """
+    Read dataset.
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            if is_test:
+                sentence = line.strip()
+                yield {"text": sentence}
+            else:
+                items = line.strip().split("\t")
+                yield {"text": items[0], "label": label2id[items[1]]}
+
+
+def log_metrics_debug(output, id2label, dev_ds, bad_case_path):
+    """
+    Log metrics in debug mode.
+    """
+    predictions, label_ids, metrics = output
+    pred_ids = np.argmax(predictions, axis=-1)
+    logger.info("-----Evaluate model-------")
+    logger.info("Dev dataset size: {}".format(len(dev_ds)))
+    logger.info("Accuracy in dev dataset: {:.2f}%".format(metrics["test_accuracy"] * 100))
+    logger.info(
+        "Macro average | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+            metrics["test_macro avg"]["precision"] * 100,
+            metrics["test_macro avg"]["recall"] * 100,
+            metrics["test_macro avg"]["f1-score"] * 100,
+        )
+    )
+    for i in id2label:
+        l = id2label[i]
+        logger.info("Class name: {}".format(l))
+        i = "test_" + str(i)
+        if i in metrics:
+            logger.info(
+                "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+                    metrics[i]["support"],
+                    100 * metrics[i]["support"] / len(dev_ds),
+                    metrics[i]["precision"] * 100,
+                    metrics[i]["recall"] * 100,
+                    metrics[i]["f1-score"] * 100,
+                )
+            )
+        else:
+            logger.info("Evaluation examples in dev dataset: 0 (0%)")
+        logger.info("----------------------------")
+
+    with open(bad_case_path, "w", encoding="utf-8") as f:
+        f.write("Text\tLabel\tPrediction\n")
+        for i, (p, l) in enumerate(zip(pred_ids, label_ids)):
+            p, l = int(p), int(l)
+            if p != l:
+                f.write(dev_ds.data[i]["text"] + "\t" + id2label[l] + "\t" + id2label[p] + "\n")
+
+    logger.info("Bad case in dev dataset saved in {}".format(bad_case_path))
diff --git a/applications/text_classification/multi_label/README.md b/applications/text_classification/multi_label/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..94896cc8cc14f9ffbd184fc3aee212ec783f8ab0
--- /dev/null
+++ b/applications/text_classification/multi_label/README.md
@@ -0,0 +1,474 @@
+# 多标签分类指南
+
+**目录**
+- [1. 多标签分类简介](#多标签分类简介)
+- [2. 快速开始](#快速开始)
+    - [2.1 运行环境](#运行环境)
+    - [2.2 代码结构](#代码结构)
+    - [2.3 数据准备](#数据准备)
+    - [2.4 模型训练](#模型训练)
+    - [2.5 模型部署](#模型部署)
+    - [2.6 模型效果](#模型效果)
+
+<a name="多标签分类简介"></a>
+
+## 1. 多标签分类简介
+
+本项目提供通用场景下**基于预训练模型微调的多标签分类端到端应用方案**，打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程，有效缩短开发周期，降低AI开发落地门槛。
+
+多标签数据集的标签集含有两个或两个以上的类别，输入句子/文本具有一个或多个标签，多标签任务的目标是预测**样本属于哪些标签类别，这些类别具有不相互排斥的属性**。文本多标签分类在各种现实场景中具有广泛的适用性，例如商品分类、网页标签、新闻标注、蛋白质功能分类、电影分类、语义场景分类等。以下图为例，该新闻文本具有 `相机` 和 `芯片` 两个标签。
+
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/63761690/187823132-3590bfff-8248-4e92-900d-bac350328743.png width="550"/>
+</div>
+<br>
+
+**方案亮点：**
+
+- **效果领先🏃：** 使用在中文领域内模型效果和模型计算效率有突出效果的ERNIE 3.0 轻量级系列模型作为训练基座，ERNIE 3.0 轻量级系列提供多种尺寸的预训练模型满足不同需求，具有广泛成熟的实践应用性。
+- **高效调优✊：** 文本分类应用依托[TrustAI](https://github.com/PaddlePaddle/TrustAI)可信增强能力和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)，提供模型分析模块助力开发者实现模型分析，并提供稀疏数据筛选、脏数据清洗、数据增强等多种解决方案。
+- **简单易用👶：** 开发者**无需机器学习背景知识**，仅需提供指定格式的标注分类数据，一行命令即可开启文本分类训练，轻松完成上线部署，不再让技术成为文本分类的门槛。
+
+**更多选择：**
+
+对于大多数多标签分类任务，我们推荐使用预训练模型微调作为首选的文本分类方案，多标签分类项目中还提供 提示学习(小样本)和语义索引的两种全流程文本分类方案满足不同开发者需求，更多技术细节请参见[文本分类技术特色介绍](../README.md)。
+
+- 【标注成本高、标注样本较少的小样本场景】 👉 [提示学习多标签分类方案](./few-shot#readme)
+- 【标签类别不固定场景、标签类别众多】 👉 [语义索引多分类方案](./retrieval_based#readme)
+<a name="快速开始"></a>
+
+## 2. 快速开始
+
+我们以公开数据集CAIL2019—婚姻家庭要素提取任务为示例，演示多标签分类全流程方案使用。下载数据集：
+```shell
+wget https://paddlenlp.bj.bcebos.com/datasets/divorce.tar.gz
+tar -zxvf divorce.tar.gz
+mv divorce data
+rm divorce.tar.gz
+```
+
+<div align="center">
+    <img width="900" alt="image" src="https://user-images.githubusercontent.com/63761690/187828356-e2f4f627-f5fe-4c83-8879-ed6951f7511e.png">
+</div>
+<div align="center">
+    <font size ="2">
+    多标签分类数据标注-模型训练-模型分析-模型压缩-预测部署流程图
+     </font>
+</div>
+
+<a name="运行环境"></a>
+
+### 2.1 运行环境
+
+- python >= 3.6
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.4.8
+- scikit-learn >= 1.0.2
+
+**安装PaddlePaddle：**
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+
+**安装PaddleNLP：**
+
+安装PaddleNLP默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以关闭(删去 -i https://mirror.baidu.com/pypi/simple)，更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+```shell
+python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+
+**安装sklearn：**
+```shell
+python3 -m  pip install scikit-learn==1.0.2
+```
+
+<a name="代码结构"></a>
+
+### 2.2 代码结构
+
+```text
+multi_label/
+├── few-shot # 小样本学习方案
+├── analysis # 分析模块
+├── deploy # 部署
+│   └── predictor # 离线部署
+│   ├── paddle_serving # PaddleServing在线服务化部署
+│   └── triton_serving # Triton在线服务化部署
+├── train.py # 训练评估脚本
+├── predict.py # 预测脚本
+├── export_model.py # 静态图模型导出脚本
+├── utils.py # 工具函数脚本
+├── metric.py # metric脚本
+├── prune.py # 裁剪脚本
+└── README.md # 使用说明
+```
+<a name="数据准备"></a>
+
+### 2.3 数据准备
+
+训练需要准备指定格式的标注数据集,如果没有已标注的数据集，可以参考 [数据标注指南](../doccano.md) 进行文本分类数据标注。指定格式本地数据集目录结构：
+
+```text
+data/
+├── train.txt # 训练数据集文件
+├── dev.txt # 开发数据集文件
+├── test.txt # 测试数据集文件（可选）
+├── label.txt # 分类标签文件
+└── data.txt # 待预测数据文件（可选）
+```
+**训练、开发、测试数据集**文件中文本与标签类别名用tab符`'\t'`分隔开，标签中多个标签之间用`','`逗号分隔开，文本中避免出现tab符`'\t'`。
+
+- train.txt/dev.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签>','<标签>','<标签>
+<文本>'\t'<标签>','<标签>
+...
+```
+
+- train.txt/dev.txt/test.txt 文件样例：
+
+```text
+现在原告已是第二次申请与被告离婚了。    二次起诉离婚
+双方均认可价值6万元。    不动产分割,有夫妻共同财产
+2004年4月，原、被告发生纠纷后，被告离家外出未归，直到现在，双方长期分居生活，十几年间互无联系，夫妻感情已经完全破裂。    婚后分居
+婚生子杨某甲由原告抚养，高中阶段之前的相关费用由原告承担，高中阶段之后的相关费用由双方协商，被告可以随时探望孩子；    婚后有子女,支付抚养费,限制行为能力子女抚养
+...
+```
+
+**分类标签**包含数据集中所有标签集合，每一行为一个标签名。
+
+- label.txt 文件格式：
+
+```text
+<标签>
+<标签>
+...
+```
+
+- label.txt 文件样例：
+```text
+婚后有子女
+限制行为能力子女抚养
+有夫妻共同财产
+支付抚养费
+...
+```
+
+**待预测数据文件：** 包含需要预测标签的文本数据，每条数据一行。
+
+- data.txt 文件格式：
+
+```text
+<文本>
+<文本>
+...
+```
+
+- data.txt 文件样例：
+```text
+原、被告另购置橱柜、碗架、电磁炉、电饭锅各一个归原告王某某所有。
+于是原告到儿子就读的幼儿园进行探望，被告碰见后对原告破口大骂，还不让儿子叫原告妈妈，而叫被告现在的妻子做妈妈。
+6、被告父亲给的房屋装修款2.3万元在原告处，要求依法分割；
+由我全额出资购买的联想台式电脑，我均依次放弃。
+...
+```
+<a name="模型训练"></a>
+
+### 2.4 模型训练
+
+#### 2.4.1 预训练模型微调
+
+使用CPU/GPU训练，默认为GPU训练。使用CPU训练只需将设备参数配置改为`--device cpu`，可以使用`--device gpu:0`指定GPU卡号：
+```shell
+python train.py \
+    --dataset_dir "data" \
+    --device "gpu" \
+    --max_seq_length 128 \
+    --model_name "ernie-3.0-medium-zh" \
+    --batch_size 32 \
+    --early_stop \
+    --epochs 100
+```
+
+如果在GPU环境中使用，可以指定`gpus`参数进行单卡/多卡训练。使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"。如果设备只有一个GPU卡号默认为0，可使用`nvidia-smi`命令查看GPU使用情况。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --dataset_dir "data" \
+    --device "gpu" \
+    --max_seq_length 128 \
+    --model_name "ernie-3.0-medium-zh" \
+    --batch_size 32 \
+    --early_stop \
+    --epochs 100
+```
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，选择cpu、gpu、xpu、npu。如使用gpu训练，可使用参数--gpus指定GPU卡号；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt，dev.txt和label.txt文件;默认为None。
+* `save_dir`：保存训练模型的目录；默认保存在当前目录checkpoint文件夹下。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `model_name`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh"。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：训练最大学习率；默认为3e-5。
+* `epochs`: 训练轮次，使用早停法时可以选择100；默认为10。
+* `early_stop`：选择是否使用早停法(EarlyStopping)，模型在开发集经过一定epoch后精度表现不再上升，训练终止；默认为False。
+* `early_stop_nums`：在设定的早停训练轮次内，模型在开发集上表现不再上升，训练终止；默认为4。
+* `logging_steps`: 训练过程中日志打印的间隔steps数，默认5。
+* `weight_decay`：控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `warmup`：是否使用学习率warmup策略，使用时应设置适当的训练轮次（epochs）；默认为False。
+* `warmup_steps`：学习率warmup策略的比例数，如果设为1000，则学习率会在1000steps数从0慢慢增长到learning_rate, 而后再缓慢衰减；默认为0。
+* `init_from_ckpt`: 模型初始checkpoint参数地址，默认None。
+* `seed`：随机种子，默认为3。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存开发集上最佳模型在指定的 `save_dir` 中，保存模型文件结构如下所示：
+
+```text
+checkpoint/
+├── config.json # 模型配置文件，paddlenlp 2.4.5以前为model_config.json
+├── model_state.pdparams # 模型参数文件
+├── tokenizer_config.json # 分词器配置文件
+├── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置 `init_from_ckpt` ， 如 `init_from_ckpt=checkpoint/model_state.pdparams` 。
+* 如需训练英文文本分类任务，只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en"、"ernie-2.0-large-en"。
+* 英文和中文以外语言的文本分类任务，推荐使用基于96种语言（涵盖法语、日语、韩语、德语、西班牙语等几乎所有常见语言）进行预训练的多语言预训练模型"ernie-m-base"、"ernie-m-large"，详情请参见[ERNIE-M论文](https://arxiv.org/pdf/2012.15674.pdf)。
+
+#### 2.4.2 训练评估与模型优化
+
+文本分类预测过程中常会遇到诸如"模型为什么会预测出错误的结果"，"如何提升模型的表现"等问题。[Analysis模块](./analysis) 提供了**模型评估、可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195241942-70068989-df17-4f53-9f71-c189d8c5c88d.png" width="600">
+</div>
+
+**模型评估：** 训练后的模型我们可以使用 [Analysis模块](./analysis) 对每个类别分别进行评估，并输出预测错误样本（bad case），默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`:
+
+```shell
+python analysis/evaluate.py --device "gpu" --max_seq_length 128 --batch_size 32 --bad_case_file "bad_case.txt" --dataset_dir "data" --params_path "./checkpoint"
+```
+
+输出打印示例：
+
+```text
+[2022-08-12 02:24:48,193] [    INFO] - -----Evaluate model-------
+[2022-08-12 02:24:48,194] [    INFO] - Dev dataset size: 1611
+[2022-08-12 02:24:48,194] [    INFO] - Accuracy in dev dataset: 74.24%
+[2022-08-12 02:24:48,194] [    INFO] - Macro avg in dev dataset: precision: 82.96 | recall: 77.59 | F1 score 79.36
+[2022-08-12 02:24:48,194] [    INFO] - Micro avg in dev dataset: precision: 91.50 | recall: 89.66 | F1 score 90.57
+[2022-08-12 02:24:48,195] [    INFO] - Class name: 婚后有子女
+[2022-08-12 02:24:48,195] [    INFO] - Evaluation examples in dev dataset: 784(48.7%) | precision: 97.07 | recall: 97.32 | F1 score 97.20
+[2022-08-12 02:24:48,195] [    INFO] - ----------------------------
+[2022-08-12 02:24:48,195] [    INFO] - Class name: 限制行为能力子女抚养
+[2022-08-12 02:24:48,195] [    INFO] - Evaluation examples in dev dataset: 492(30.5%) | precision: 88.57 | recall: 88.21 | F1 score 88.39
+...
+```
+
+预测错误的样本保存在bad_case.txt文件中：
+
+```text
+Text	Label	Prediction
+2014年，王X以其与肖X协议离婚时未分割该套楼房的首付款为由，起诉至法院，要求分得楼房的首付款15万元。    不动产分割,有夫妻共同财产    不动产分割
+但原、被告对已建立起的夫妻感情不够珍惜，因琐事即发生吵闹并最终分居，对夫妻感情造成了严重的影响，现原、被告已分居六年有余，且经人民法院判决不准离婚后仍未和好，夫妻感情确已破裂，依法应准予原、被告离婚。    二次起诉离婚,准予离婚,婚后分居,法定离婚    婚后分居,准予离婚
+婚后生有一女，取名彭某乙，已11岁，现已由被告从铁炉白族乡中心小学转入走马镇李桥小学读书。    婚后有子女    婚后有子女,限制行为能力子女抚养
+...
+```
+**可解释性分析：** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析，帮助理解模型预测结果，用于错误样本（bad case）分析，细节详见[训练评估与模型优化指南](analysis/README.md)。
+
+- 单词级别可解释性分析，也即分析待预测样本中哪一些单词对模型预测结果起重要作用。以下图为例，用颜色深浅表示单词对预测结果的重要性。
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/192739675-63145d59-23c6-416f-bf71-998fd4995254.png" width="1000">
+</div>
+
+- 句子级别可解释性分析 ，也即分析对待预测样本的模型预测结果与训练集中中哪些样本有重要关系。下面的例子表明句子级别可解释性分析可以帮助理解待预测样本的预测结果与训练集中样本之间的关联。
+```text
+text: 2015年2月23日，被告将原告赶出家门，原告居住于娘家待产，双方分居至今。
+predict label: 婚后分居
+label: 不履行家庭义务,婚后分居
+examples with positive influence
+support1 text: 2014年中秋节原告回了娘家，原、被告分居至今。	label: 婚后分居	score: 0.99942
+support2 text: 原告于2013年8月13日离开被告家，分居至今。	label: 婚后分居	score: 0.99916
+support3 text: 2014年4月，被告外出务工，双方分居至今。	label: 婚后分居	score: 0.99902
+...
+```
+
+**数据优化：** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略，从多角度优化训练数据提升模型效果，策略细节详见[训练评估与模型优化指南](analysis/README.md)。
+
+- 稀疏数据筛选主要是解决数据不均衡、训练数据覆盖不足的问题，通过数据增强和数据标注两种方式解决这一问题。
+- 脏数据清洗可以帮助开发者筛选训练集中错误标注的数据，对这些数据重新进行人工标注，得到标注正确的数据再重新进行训练。
+- 数据增强策略提供多种数据增强方案，可以快速扩充数据，提高模型泛化性和鲁棒性。
+
+
+#### 2.4.3 模型预测
+训练结束后，输入待预测数据(data.txt)和类别标签对照列表(label.txt)，使用训练好的模型进行，默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`：
+
+```shell
+python predict.py --device "gpu" --max_seq_length 128 --batch_size 32 --dataset_dir "data"
+```
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行预测，可选cpu、gpu、xpu、npu；默认为gpu。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含data.txt和label.txt文件;默认为None。
+* `params_path`：待预测模型的目录；默认为"./checkpoint/"。
+* `max_seq_length`：模型使用的最大序列长度,建议与训练时最大序列长度一致, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `data_file`：本地数据集中未标注待预测数据文件名；默认为"data.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+
+<a name="模型部署"></a>
+
+### 2.5 模型部署
+
+#### 2.5.1 静态图导出
+
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py)，静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python export_model.py --params_path ./checkpoint/ --output_path ./export
+```
+
+如果使用多语言模型 ERNIE M作为预训练模型，运行方式：
+
+```shell
+python export_model.py --params_path ./checkpoint/ --output_path ./export --multilingual
+```
+
+可支持配置的参数：
+
+* `multilingual`：是否为多语言任务（是否使用ERNIE M作为预训练模型）；默认为False。
+* `params_path`：动态图训练保存的参数路径；默认为"./checkpoint/"。
+* `output_path`：静态图图保存的参数路径；默认为"./export"。
+
+程序运行时将会自动导出模型到指定的 `output_path` 中，保存模型文件结构如下所示：
+
+```text
+export/
+├── float32.pdiparams
+├── float32.pdiparams.info
+└── float32.pdmodel
+```
+ 导出模型之后用于部署，项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。
+
+#### 2.5.2 模型裁剪
+
+如果有模型部署上线的需求，需要进一步压缩模型体积，可以使用 PaddleNLP 的 [压缩API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/compression.md), 一行命令即可启动模型裁剪。
+
+使用裁剪功能需要安装 paddleslim：
+
+```shell
+pip install paddleslim==2.4.1
+```
+
+开始模型裁剪训练，默认为GPU训练，使用CPU训练只需将设备参数配置改为`--device "cpu"`：
+```shell
+python prune.py \
+    --device "gpu" \
+    --dataset_dir "data" \
+    --output_dir "prune" \
+    --learning_rate 3e-5 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 10 \
+    --max_seq_length 128 \
+    --logging_steps 5 \
+    --save_steps 100 \
+    --width_mult_list '3/4' '2/3' '1/2'
+```
+
+
+可支持配置的参数：
+* `output_dir`：必须，保存模型输出和中间checkpoint的输出目录;默认为 `None` 。
+* `device`: 选用什么设备进行裁剪，选择cpu、gpu。如使用gpu训练，可使用参数--gpus指定GPU卡号。
+* `per_device_train_batch_size`：训练集裁剪训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `per_device_eval_batch_size`：开发集评测过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：训练最大学习率；默认为5e-5。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择100；默认为10。
+* `logging_steps`: 训练过程中日志打印的间隔steps数，默认100。
+* `save_steps`: 训练过程中保存模型checkpoint的间隔steps数，默认100。
+* `seed`：随机种子，默认为3。
+* `width_mult_list`：裁剪宽度（multi head）保留的比例列表，表示对self_attention中的 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例，保留比例乘以宽度（multi haed数量）应为整数；默认是None。
+* `dataset_dir`：本地数据集路径，需包含train.txt,dev.txt,label.txt;默认为None。
+* `max_seq_length`：模型使用的最大序列长度，建议与训练过程保持一致, 若出现显存不足，请适当调低这一参数；默认为128。
+* `params_dir`：待预测模型参数文件；默认为"./checkpoint/"。
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存开发集上最佳模型在指定的 `output_dir` 中，保存模型文件结构如下所示：
+
+```text
+prune/
+├── width_mult_0.75
+│   ├── pruned_model.pdiparams
+│   ├── pruned_model.pdiparams.info
+│   ├── pruned_model.pdmodel
+│   ├── model_state.pdparams
+│   └── model_config.json
+└── ...
+```
+
+**NOTE:**
+
+1. 目前支持的裁剪策略需要训练，训练时间视下游任务数据量而定，且和微调的训练时间是一个量级。 裁剪类似蒸馏过程，方便起见，可以直接使用微调时的超参。为了进一步提升精度，可以对 `per_device_train_batch_size`、`learning_rate`、`num_train_epochs`、`max_seq_length` 等超参进行网格搜索（grid search）。
+
+2. 模型裁剪主要用于推理部署，因此裁剪后的模型都是静态图模型，只可用于推理部署，不能再通过 `from_pretrained` 导入继续训练。导出模型之后用于部署，项目提供了基于ONNXRuntime的 [离线部署方案](./deploy/predictor/README.md) 和基于Paddle Serving的 [在线服务化部署方案](./deploy/predictor/README.md)。
+
+3. ERNIE Base、Medium、Mini、Micro、Nano的模型宽度（multi head数量）为12，ERNIE Xbase、Large 模型宽度（multi head数量）为16，保留比例`width_mult`乘以宽度（multi haed数量）应为整数。
+
+
+#### 2.5.3 部署方案
+
+- 离线部署搭建请参考[离线部署](deploy/predictor/README.md)。
+
+- 在线服务化部署搭建请参考 [PaddleNLP SimpleServing部署指南](deploy/simple_serving/README.md) 或 [Triton部署指南](deploy/triton_serving/README.md)。
+
+<a name="模型效果"></a>
+
+### 2.6 模型效果
+
+我们在CAIL2019—婚姻家庭要素提取任务数据集评测模型表现，测试配置如下：
+
+1. 数据集：CAIL2019—婚姻家庭要素提取任务数据集
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.3.1
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 32，GPU部署运行时间 total_time，计算 latency = total_time / total_samples
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+|  model_name  | 模型结构  |Micro F1(%)   | Macro F1(%) | latency(ms) |
+| -------------------------- | ------------ | ------------ | ------------ |------------ |
+|ERNIE 1.0 Large Cw |24-layer, 1024-hidden, 20-heads|91.14|81.68 |5.66 |
+|ERNIE 3.0 Base  |12-layer, 768-hidden, 12-heads|90.38|80.14| 2.70 |
+|ERNIE 3.0 Medium| 6-layer, 768-hidden, 12-heads|90.57|79.36| 1.46|
+|ERNIE 3.0 Mini |6-layer, 384-hidden, 12-heads|89.27|76.78| 0.56|
+|ERNIE 3.0 Micro | 4-layer, 384-hidden, 12-heads|89.43|77.20| 0.34|
+|ERNIE 3.0 Nano |4-layer, 312-hidden, 12-heads|85.39|75.07|0.32|
+| ERNIE 3.0 Medium + 裁剪(保留比例3/4)|6-layer, 768-hidden, 9-heads| 89.94|79.35| 0.81   |
+| ERNIE 3.0 Medium + 裁剪(保留比例2/3)|6-layer, 768-hidden, 8-heads| 89.99|79.37 | 0.75  |
+| ERNIE 3.0 Medium + 裁剪(保留比例1/2)|6-layer, 768-hidden, 6-heads| 89.19 | 76.35| 0.61 |
diff --git a/applications/text_classification/multi_label/analysis/README.md b/applications/text_classification/multi_label/analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9aeba1759089b078d6a9ed69e10864c3d1c14b88
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/README.md
@@ -0,0 +1,421 @@
+# 训练评估与模型优化指南
+
+**目录**
+   * [Analysis模块介绍](#Analysis模块介绍)
+   * [环境准备](#环境准备)
+   * [模型评估](#模型评估)
+   * [可解释性分析](#可解释性分析)
+        * [单词级别可解释性分析](#单词级别可解释性分析)
+        * [句子级别可解释性分析](#句子级别可解释性分析)
+   * [数据优化](#数据优化)
+        * [稀疏数据筛选方案](#稀疏数据筛选方案)
+        * [脏数据清洗方案](#脏数据清洗方案)
+        * [数据增强策略方案](#数据增强策略方案)
+
+## Analysis模块介绍
+
+Analysis模块提供了**模型评估、可解释性分析、数据优化**等功能，旨在帮助开发者更好地分析文本分类模型预测结果和对模型效果进行优化。
+
+- **模型评估：** 对整体分类情况和每个类别分别进行评估，并打印预测错误样本，帮助开发者分析模型表现找到训练和预测数据中存在的问题。
+
+- **可解释性分析：** 基于[TrustAI](https://github.com/PaddlePaddle/TrustAI)提供单词和句子级别的模型可解释性分析，帮助理解模型预测结果。
+
+- **数据优化：** 结合[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化策略，从多角度优化训练数据提升模型效果。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/195241942-70068989-df17-4f53-9f71-c189d8c5c88d.png" width="600">
+</div>
+
+以下是本项目主要代码结构及说明：
+
+```text
+analysis/
+├── evaluate.py # 评估脚本
+├── sent_interpret.py # 句子级别可解释性分析脚本
+├── word_interpret.py # 单词级别可解释性分析notebook
+├── sparse.py # 稀疏数据筛选脚本
+├── dirty.py # 脏数据清洗脚本
+├── aug.py # 数据增强脚本
+└── README.md # 训练评估与模型优化指南
+```
+
+## 环境准备
+需要可解释性分析和数据优化需要安装相关环境。
+- trustai >= 0.1.7
+- interpretdl >= 0.7.0
+
+**安装TrustAI**（可选）如果使用可解释性分析和数据优化中稀疏数据筛选和脏数据清洗需要安装TrustAI。
+```shell
+pip install trustai==0.1.7
+```
+
+**安装InterpretDL**（可选）如果使用词级别可解释性分析GradShap方法，需要安装InterpretDL
+```shell
+pip install interpretdl==0.7.0
+```
+
+## 模型评估
+
+我们使用训练好的模型计算模型的在开发集的准确率，同时打印每个类别数据量及表现：
+
+```shell
+python evaluate.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --params_path "../checkpoint" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --bad_case_file "bad_case.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt、dev.txt和label.txt文件;默认为None。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `train_file`：本地数据集中开发集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `bad_case_path`：开发集中预测错误样本保存路径；默认为"/bad_case.txt"。
+
+输出打印示例：
+
+```text
+[2022-08-12 02:24:48,193] [    INFO] - -----Evaluate model-------
+[2022-08-12 02:24:48,194] [    INFO] - Dev dataset size: 1611
+[2022-08-12 02:24:48,194] [    INFO] - Accuracy in dev dataset: 74.24%
+[2022-08-12 02:24:48,194] [    INFO] - Macro avg in dev dataset: precision: 82.96 | recall: 77.59 | F1 score 79.36
+[2022-08-12 02:24:48,194] [    INFO] - Micro avg in dev dataset: precision: 91.50 | recall: 89.66 | F1 score 90.57
+[2022-08-12 02:24:48,195] [    INFO] - Class name: 婚后有子女
+[2022-08-12 02:24:48,195] [    INFO] - Evaluation examples in dev dataset: 784(48.7%) | precision: 97.07 | recall: 97.32 | F1 score 97.20
+[2022-08-12 02:24:48,195] [    INFO] - ----------------------------
+[2022-08-12 02:24:48,195] [    INFO] - Class name: 限制行为能力子女抚养
+[2022-08-12 02:24:48,195] [    INFO] - Evaluation examples in dev dataset: 492(30.5%) | precision: 88.57 | recall: 88.21 | F1 score 88.39
+...
+```
+
+预测错误的样本保存在bad_case.txt文件中：
+
+```text
+Text	Label	Prediction
+2014年，王X以其与肖X协议离婚时未分割该套楼房的首付款为由，起诉至法院，要求分得楼房的首付款15万元。    不动产分割,有夫妻共同财产    不动产分割
+但原、被告对已建立起的夫妻感情不够珍惜，因琐事即发生吵闹并最终分居，对夫妻感情造成了严重的影响，现原、被告已分居六年有余，且经人民法院判决不准离婚后仍未和好，夫妻感情确已破裂，依法应准予原、被告离婚。    二次起诉离婚,准予离婚,婚后分居,法定离婚    婚后分居,准予离婚
+婚后生有一女，取名彭某乙，已11岁，现已由被告从铁炉白族乡中心小学转入走马镇李桥小学读书。    婚后有子女    婚后有子女,限制行为能力子女抚养
+...
+```
+## 可解释性分析
+"模型为什么会预测出这个结果?"是文本分类任务开发者时常遇到的问题，如何分析错误样本(bad case)是文本分类任务落地中重要一环，本项目基于TrustAI开源了基于词级别和句子级别的模型可解释性分析方法，帮助开发者更好地理解文本分类模型与数据，有助于后续的模型优化与数据清洗标注。
+
+### 单词级别可解释性分析
+本项目开源模型的词级别可解释性分析Notebook，提供LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用Jupyter Notebook进行数据分析。
+
+运行 [word_interpret.ipynb](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/multi_label/analysis/README.md) 代码，即可分析影响样本预测结果的关键词以及可视化所有词对预测结果的贡献情况，颜色越深代表这个词对预测结果影响越大：
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/63761690/192739675-63145d59-23c6-416f-bf71-998fd4995254.png" width="1000">
+</div>
+
+### 句子级别可解释性分析
+本项目基于特征相似度（[FeatureSimilarity](https://arxiv.org/abs/2104.04128)）算法，计算对样本预测结果正影响的训练数据，帮助理解模型的预测结果与训练集数据的关系。
+
+待分析数据文件`interpret_input_file`应为以下三种格式中的一种：
+**格式一：包括文本、标签、预测结果**
+```text
+<文本>'\t'<标签>'\t'<预测结果>
+...
+```
+
+**格式二：包括文本、标签**
+```text
+<文本>'\t'<标签>
+...
+```
+
+**格式三：只包括文本**
+```text
+<文本>
+准予原告胡某甲与被告韩某甲离婚。
+...
+```
+
+我们可以运行代码，得到支持样本模型预测结果的训练数据：
+```shell
+python sent_interpret.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --params_path "../checkpoint/" \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --top_k 3 \
+    --train_file "train.txt" \
+    --interpret_input_file "bad_case.txt" \
+    --interpret_result_file "sent_interpret.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `top_k`：筛选支持训练证据数量；默认为3。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `interpret_input_file`：本地数据集中待分析文件名；默认为"bad_case.txt"。
+* `interpret_result_file`：保存句子级别可解释性结果文件名；默认为"sent_interpret.txt"。
+
+可解释性结果保存在 `interpret_result_file` 文件中：
+```text
+text: 2015年2月23日，被告将原告赶出家门，原告居住于娘家待产，双方分居至今。
+predict label: 婚后分居
+label: 不履行家庭义务,婚后分居
+examples with positive influence
+support1 text: 2014年中秋节原告回了娘家，原、被告分居至今。	label: 婚后分居	score: 0.99942
+support2 text: 原告于2013年8月13日离开被告家，分居至今。	label: 婚后分居	score: 0.99916
+support3 text: 2014年4月，被告外出务工，双方分居至今。	label: 婚后分居	score: 0.99902
+...
+```
+
+
+## 数据优化
+
+### 稀疏数据筛选方案
+
+稀疏数据筛选适用于文本分类中**数据不平衡或训练数据覆盖不足**的场景，简单来说，就是由于模型在训练过程中没有学习到足够与待预测样本相似的数据，模型难以正确预测样本所属类别的情况。稀疏数据筛选旨在开发集中挖掘缺乏训练证据支持的数据，通常可以采用**数据增强**或**少量数据标注**的两种低成本方式，提升模型在开发集的预测效果。
+
+本项目中稀疏数据筛选基于TrustAI，利用基于特征相似度的实例级证据分析方法，抽取开发集中样本的支持训练证据，并计算支持证据平均分（通常为得分前三的支持训练证据均分）。分数较低的样本表明其训练证据不足，在训练集中较为稀疏，实验表明模型在这些样本上表现也相对较差。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。
+
+
+#### 稀疏数据识别—数据增强
+
+这里我们将介绍稀疏数据识别—数据增强流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据增强**：将稀疏数据集在训练集中的支持证据应用数据增强策略，这些数据增强后的训练数据记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别-数据增强，得到支持数据集：
+
+```shell
+python sparse.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --aug_strategy "substitute" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `aug_strategy`：数据增强类型，可选"duplicate","substitute", "insert", "delete", "swap"；默认为"substitute"。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+将得到增强支持数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_aug.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_aug.txt
+```
+
+**方案效果**
+
+我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据（训练集数据规模:500）进行实验,筛选稀疏数据数量和筛选支持数据数量均设为100条，使用不同的数据增强方法进行评测：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集|84.43|50.01|
+|训练集+支持增强集(duplicate) |84.80|**51.78**|
+|训练集+支持增强集(substitute) |84.66|50.61|
+|训练集+支持增强集(insert) |84.48|49.95|
+|训练集+支持增强集(delete) |84.83| 51.04|
+|训练集+支持增强集(swap) |**84.84**|51.06|
+
+#### 稀疏数据识别-数据标注
+
+本方案能够有针对性进行数据标注，相比于随机标注数据更好提高模型预测效果。这里我们将介绍稀疏数据识别-数据标注流程：
+
+- **稀疏数据识别：** 挖掘开发集中的缺乏训练证据支持数据，记为稀疏数据集（Sparse Dataset）；
+
+- **数据标注**：在未标注数据集中筛选稀疏数据集的支持证据，并进行数据标注，记为支持数据集（Support Dataset）；
+
+- **重新训练模型：** 将支持数据集加入到原有的训练集获得新的训练集，重新训练新的文本分类模型。
+
+现在我们进行稀疏数据识别--数据标注，得到待标注数据：
+
+```shell
+python sparse.py \
+    --annotate \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --support_num 100 \
+    --unlabeled_file "data.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含dev.txt和label.txt文件;默认为None。
+* `annotate`：选择稀疏数据识别--数据标注模式；默认为False。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `rationale_num_sparse`：筛选稀疏数据时计算样本置信度时支持训练证据数量；认为3。
+* `rationale_num_support`：筛选支持数据时计算样本置信度时支持训练证据数量，如果筛选的支持数据不够，可以适当增加；默认为6。
+* `sparse_num`：筛选稀疏数据数量，建议为开发集的10%~20%，默认为100。
+* `support_num`：用于数据增强的支持数据数量，建议为训练集的10%~20%，默认为100。
+* `support_threshold`：支持数据的阈值，只选择支持证据分数大于阈值作为支持数据，默认为0.7。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `unlabeled_file`：本地数据集中未标注数据文件名；默认为"data.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `support_file`：保存在本地数据集路径中支持训练数据文件名；默认为"support.txt"。
+
+我们将筛选出的支持数据`support.txt`进行标注，可以使用标注工具帮助更快标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已标注数据`support.txt`与训练集数据`train.txt`合并得到新的训练集`train_sparse_annotate.txt`重新进行训练：
+
+```shell
+cat ../data/train.txt ../data/support.txt > ../data/train_sparse_annotate.txt
+```
+
+**方案效果**
+
+我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据（训练集数据规模:500）进行实验,筛选稀疏数据数量设为100条，筛选待标注数据数量为50和100条。我们比较了使用稀疏数据方案的策略采样和随机采样的效果，下表结果表明使用稀疏数据方案的策略采样能够有效指导训练数据扩充，在标注更少的数据情况下获得更大提升的效果：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ | ------------ |
+|训练集|84.43|50.01|
+|训练集+策略采样集(50) |85.77|**57.13**|
+|训练集+随机采样集(50) |84.91|54.40|
+|训练集+策略采样集(100) |**86.14**|56.93|
+|训练集+随机采样集(100) |84.69|50.76|
+
+### 脏数据清洗方案
+
+脏数据清洗方案是基于已训练好的文本分类模型，筛选出训练数据集中标注错误的数据，再由人工检查重新标注，获得标注正确的数据集进行重新训练。我们将介绍脏数据清洗流程：
+
+- **脏数据筛选：** 基于TrustAI中表示点方法，计算训练数据对文本分类模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为标注错误样本，记为脏数据集（Dirty Dataset）。
+
+- **数据清洗、训练：** 将筛选出的脏数据由人工重新检查，为数据打上正确的标签。将清洗后的训练数据重新放入文本分类模型进行训练。
+
+现在我们进行脏数据识别，脏数据保存在`"train_dirty.txt"`,剩余训练数据保存在`"train_dirty_rest.txt"`：
+
+```shell
+python dirty.py \
+    --device "gpu" \
+    --dataset_dir "../data" \
+    --max_seq_length 128 \
+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --dirty_num 100 \
+    --dirty_file "train_dirty.txt" \
+    --rest_file "train_dirty_rest.txt"
+```
+
+默认在GPU环境下使用，在CPU环境下修改参数配置为`--device "cpu"`
+
+可支持配置的参数：
+
+* `dataset_dir`：必须，本地数据集路径，数据集路径中应包含train.txt和label.txt文件;默认为None。
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `params_path`：保存训练模型的目录；默认为"../checkpoint/"。
+* `device`: 选用什么设备进行训练，可选择cpu、gpu、xpu、npu；默认为"gpu"。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `seed`：随机种子，默认为3。
+* `dirty_file`：保存脏数据文件名，默认为"train_dirty.txt"。
+* `rest_file`：保存剩余数据（非脏数据）文件名，默认为"train_dirty_rest.txt"。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dirty_threshold`：筛选脏数据用于重新标注的阈值，只选择影响分数大于阈值作为支持数据，默认为0。
+
+
+我们将筛选出脏数据进行人工检查重新标注，可以将`train_dirty.txt`直接导入标注工具doccano帮助更快重新标注，详情请参考[文本分类任务doccano数据标注使用指南](../../doccano.md)进行文本分类数据标注。然后将已重新标注的脏数据`train_dirty.txt`与剩余训练集数据`train_dirty_rest.txt`合并得到新的训练集`train_clean.txt`重新进行训练：
+
+```shell
+cat ../data/train_dirty_rest.txt ../data/train_dirty.txt > ../data/train_clean.txt
+```
+
+**方案效果**
+
+我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据（训练集数据规模:500）进行实验,取50条数据进行脏数据处理，也即50条训练数据为标签错误数据。选择不同`dirty_num`应用脏数据清洗策略进行评测：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集(500，含50条脏数据) |82.89|47.83|
+|训练集(500，含50条脏数据) + 脏数据清洗(25)|82.42|49.57|
+|训练集(500，含50条脏数据) + 脏数据清洗(50)|83.38|50.40|
+|训练集(500，含50条脏数据) + 脏数据清洗(100)|84.50|51.28|
+
+
+## 数据增强策略方案
+
+在数据量较少或某些类别样本量较少时，也可以通过数据增强策略的方式，生成更多的训练数据，提升模型效果。
+
+```shell
+python aug.py \
+    --create_n 2 \
+    --aug_percent 0.1 \
+    --train_path "../data/train.txt" \
+    --aug_path "../data/aug.txt"
+```
+
+可支持配置的参数：
+
+* `train_path`：待增强训练数据集文件路径；默认为"../data/train.txt"。
+* `aug_path`：增强生成的训练数据集文件路径；默认为"../data/train_aug.txt"。
+* `aug_strategy`：数据增强策略，可选"mix", "substitute", "insert", "delete", "swap"为多种数据策略混合使用；默认为"substitute"。
+* `aug_type`：词替换/词插入增强类型，可选"synonym", "homonym", "mlm"，建议在GPU环境下使用mlm类型；默认为"synonym"。
+* `create_n`：生成的句子数量，默认为2。
+* `aug_percent`：生成词替换百分比，默认为0.1。
+* `device`: 选用什么设备进行增强，可选择cpu、gpu、xpu、npu，仅在使用mlm类型有影响；默认为"gpu"。
+
+生成的增强数据保存在`"aug.txt"`文件中，与训练集数据`train.txt`合并得到新的训练集`train_aug.txt`重新进行训练：
+
+```shell
+cat ../data/aug.txt ../data/train.txt > ../data/train_aug.txt
+```
+
+PaddleNLP内置多种数据增强策略，更多数据增强策略使用方法请参考[数据增强API](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/dataaug.md)。
+
+我们在CAIL2019—婚姻家庭要素提取数据集抽取部分训练数据（训练集数据规模:500）进行实验，采用不同数据增强策略进行两倍数据增强（每条样本生成两条增强样本）：
+
+|  |Micro F1(%)   | Macro F1(%) |
+| ---------| ------------ |------------ |
+|训练集(500)|84.43|50.01|
+|训练集(500)+数据增强(×2, mix) |84.72|51.86|
+|训练集(500)+支持增强集(×2, substitute) |84.50|53.23|
+|训练集(500)+支持增强集(×2, insert) |**85.03**|53.54|
+|训练集(500)+支持增强集(×2, delete) |84.74| **55.89**|
+|训练集(500)+支持增强集(×2, swap) |84.44|52.50|
diff --git a/applications/text_classification/multi_label/analysis/aug.py b/applications/text_classification/multi_label/analysis/aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..12f1e0fbcdbd33818b532b1443a985bbd9c2c15b
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/aug.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--train_path", type=str, default="../data/train.txt", help="Train dataset file name")
+parser.add_argument("--aug_path", type=str, default="../data/aug.txt", help="Aug dataset file name")
+parser.add_argument("--aug_strategy", choices=["mix", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--aug_type", choices=["synonym", "homonym", "mlm"], default='synonym', help="Select data augmentation type for substitute and insert")
+parser.add_argument("--create_n", type=int, default=2, help="Number of augmented sequences.")
+parser.add_argument("--aug_percent", type=float, default=0.1, help="Percentage of augmented words in sequences.")
+parser.add_argument('--device', default="gpu", help="Select which device to do data augmentation strategy, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def aug():
+    """Do data augmentation"""
+    if args.aug_strategy in ["mix", "substitute", "insert"] and args.aug_strategy == "mlm":
+        paddle.set_device(args.device)
+
+    if args.aug_strategy in ["substitute", "insert", "delete", "swap"]:
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert(args.aug_type, create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=args.create_n, aug_percent=args.aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=args.create_n, aug_percent=args.aug_percent)
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+                augs = aug.augment(s)
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+    elif args.aug_strategy in ["mix"]:
+        aug = [
+            WordSubstitute(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordInsert(args.aug_type, create_n=1, aug_percent=args.aug_percent),
+            WordDelete(create_n=1, aug_percent=args.aug_percent),
+            WordSwap(create_n=1, aug_percent=args.aug_percent),
+        ]
+        count = 0
+        with open(args.train_path, "r", encoding="utf-8") as f1, open(args.aug_path, "w", encoding="utf-8") as f2:
+            for line in f1:
+                s, l = line.strip().split("\t")
+                for i in range(args.create_n):
+                    i = count % len(aug)
+                    augs = aug[i].augment(s)
+                    if not isinstance(augs[0], str):
+                        augs = augs[0]
+                    count += 1
+                    for a in augs:
+                        f2.write(a + "\t" + l + "\n")
+        f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    aug()
diff --git a/applications/text_classification/multi_label/analysis/dirty.py b/applications/text_classification/multi_label/analysis/dirty.py
new file mode 100644
index 0000000000000000000000000000000000000000..394e597c7e280929f68ad38c6f188acdfa8ded50
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/dirty.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import RepresenterPointModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--dirty_num", type=int, default=100, help="Number of dirty data. default:50")
+parser.add_argument("--dirty_file", type=str, default="train_dirty.txt", help="Path to save dirty data.")
+parser.add_argument("--rest_file", type=str, default="train_dirty_rest.txt", help="The path of rest data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dirty_threshold", type=float, default="0", help="The threshold to select dirty data.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            sentence, label = line.strip().split("\t")
+            yield {"text": sentence, "label": label}
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+def get_dirty_data(weight_matrix, dirty_num, threshold=0):
+    """
+    Get index of dirty data from train data
+    """
+    scores = []
+    for idx in range(weight_matrix.shape[0]):
+        weight_sum = 0
+        count = 0
+        for weight in weight_matrix[idx].numpy():
+            if weight > threshold:
+                count += 1
+                weight_sum += weight
+        scores.append((count, weight_sum))
+    sorted_scores = sorted(scores)[::-1]
+    sorted_idxs = sorted(range(len(scores)), key=lambda idx: scores[idx])[::-1]
+
+    ret_scores = sorted_scores[:dirty_num]
+    ret_idxs = sorted_idxs[:dirty_num]
+
+    return ret_idxs, ret_scores
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def run():
+    """
+    Get dirty data
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    train_ds = train_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    rep_point = RepresenterPointModel(model, train_data_loader, classifier_layer_name="classifier")
+    weight_matrix = rep_point.weight_matrix
+
+    # Save dirty data & rest data
+    dirty_indexs, _ = get_dirty_data(weight_matrix, args.dirty_num, args.dirty_threshold)
+
+    dirty_path = os.path.join(args.dataset_dir, args.dirty_file)
+    rest_path = os.path.join(args.dataset_dir, args.rest_file)
+
+    with open(dirty_path, "w") as f1, open(rest_path, "w") as f2:
+        for idx in range(len(train_ds)):
+            if idx in dirty_indexs:
+                f1.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+            else:
+                f2.write(train_ds.data[idx]["text"] + "\t" + train_ds.data[idx]["label"] + "\n")
+
+    f1.close(), f2.close()
+
+
+if __name__ == "__main__":
+    run()
diff --git a/applications/text_classification/multi_label/analysis/evaluate.py b/applications/text_classification/multi_label/analysis/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd57a57cdf64891fa9eb136a42940dcb283dc104
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/evaluate.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.io import BatchSampler, DataLoader
+from sklearn.metrics import accuracy_score, classification_report
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to evaluate model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="Local dataset directory should include dev.txt and label.txt")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for evaluation.")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+parser.add_argument("--bad_case_file", type=str, default="bad_case.txt", help="Bad case saving file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    if not is_test:
+        result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)]
+    return result
+
+
+def read_local_dataset(path, label_list):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 0:
+                continue
+            elif len(items) == 1:
+                sentence = items[0]
+                labels = []
+                label = ""
+            else:
+                sentence = "".join(items[:-1])
+                label = items[-1]
+                labels = [label_list[l] for l in label.split(",")]
+            yield {"text": sentence, "label": labels, "label_n": label}
+
+
+@paddle.no_grad()
+def evaluate():
+    """
+    Evaluate the model performance
+    """
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # load and preprocess dataset
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    dev_path = os.path.join(args.dataset_dir, args.dev_file)
+
+    label_list = {}
+    label_map = {}
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+            label_map[i] = l
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
+    )
+    dev_ds = dev_ds.map(trans_func)
+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    model.eval()
+    probs = []
+    labels = []
+    for batch in dev_data_loader:
+        label = batch.pop("labels")
+        logits = model(**batch)
+        labels.extend(label.numpy())
+        probs.extend(F.sigmoid(logits).numpy())
+    probs = np.array(probs)
+    labels = np.array(labels)
+    preds = probs > 0.5
+    report = classification_report(labels, preds, digits=4, output_dict=True)
+    accuracy = accuracy_score(labels, preds)
+
+    logger.info("-----Evaluate model-------")
+    logger.info("Dev dataset size: {}".format(len(dev_ds)))
+    logger.info("Accuracy in dev dataset: {:.2f}%".format(accuracy * 100))
+    logger.info(
+        "Micro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+            report["micro avg"]["precision"] * 100,
+            report["micro avg"]["recall"] * 100,
+            report["micro avg"]["f1-score"] * 100,
+        )
+    )
+    logger.info(
+        "Macro avg in dev dataset: precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+            report["macro avg"]["precision"] * 100,
+            report["macro avg"]["recall"] * 100,
+            report["macro avg"]["f1-score"] * 100,
+        )
+    )
+
+    for i in label_map:
+        logger.info("Class name: {}".format(label_map[i]))
+        logger.info(
+            "Evaluation examples in dev dataset: {}({:.1f}%) | precision: {:.2f} | recall: {:.2f} | F1 score {:.2f}".format(
+                report[str(i)]["support"],
+                100 * report[str(i)]["support"] / len(dev_ds),
+                report[str(i)]["precision"] * 100,
+                report[str(i)]["recall"] * 100,
+                report[str(i)]["f1-score"] * 100,
+            )
+        )
+        logger.info("----------------------------")
+    bad_case_path = os.path.join(args.dataset_dir, args.bad_case_file)
+    with open(bad_case_path, "w", encoding="utf-8") as f:
+        f.write("Text\tLabel\tPrediction\n")
+        for i in range(len(preds)):
+            for p, l in zip(preds[i], labels[i]):
+                if (p and l == 0) or (not p and l == 1):
+                    pred_n = [label_map[i] for i, pp in enumerate(preds[i]) if pp]
+                    f.write(dev_ds.data[i]["text"] + "\t" + dev_ds.data[i]["label_n"] + "\t" + ",".join(pred_n) + "\n")
+                    break
+
+    f.close()
+    logger.info("Bad case in dev dataset saved in {}".format(bad_case_path))
+
+    return
+
+
+if __name__ == "__main__":
+    evaluate()
diff --git a/applications/text_classification/multi_label/analysis/sent_interpret.py b/applications/text_classification/multi_label/analysis/sent_interpret.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f0e4a88c190a158f948930a6e6c74c266f9f5f8
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/sent_interpret.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--top_k", type=int, default=3, help="Top K important training data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--interpret_input_file", type=str, default="bad_case.txt", help="interpretation file name")
+parser.add_argument("--interpret_result_file", type=str, default="sent_interpret.txt", help="interpreted file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if items[0] == "Text":
+                continue
+            if len(items) == 3:
+                yield {"text": items[0], "label": items[1], "predict": items[2]}
+            elif len(items) == 2:
+                yield {"text": items[0], "label": items[1], "predict": ""}
+            elif len(items) == 1:
+                yield {"text": items[0], "label": "", "predict": ""}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def find_positive_influence_data():
+
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    interpret_path = os.path.join(args.dataset_dir, args.interpret_input_file)
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    interpret_ds = interpret_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    interpret_batch_sampler = BatchSampler(interpret_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    interpret_data_loader = DataLoader(
+        dataset=interpret_ds, batch_sampler=interpret_batch_sampler, collate_fn=collate_fn
+    )
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in interpret_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.top_k)
+    with open(os.path.join(args.dataset_dir, args.interpret_result_file), "w") as f:
+        for i in range(len(analysis_result)):
+            f.write("text: " + interpret_ds.data[i]["text"] + "\n")
+            if "predict" in interpret_ds.data[i]:
+                f.write("predict label: " + interpret_ds.data[i]["predict"] + "\n")
+            if "label" in interpret_ds.data[i]:
+                f.write("label: " + interpret_ds.data[i]["label"] + "\n")
+            f.write("examples with positive influence\n")
+            for i, (idx, score) in enumerate(zip(analysis_result[i].pos_indexes, analysis_result[i].pos_scores)):
+                f.write(
+                    "support{} text: ".format(i + 1)
+                    + train_ds.data[idx]["text"]
+                    + "\t"
+                    + "label: "
+                    + train_ds.data[idx]["label"]
+                    + "\t"
+                    + "score: "
+                    + "{:.5f}".format(score)
+                    + "\n"
+                )
+    f.close()
+
+
+if __name__ == "__main__":
+    find_positive_influence_data()
diff --git a/applications/text_classification/multi_label/analysis/sparse.py b/applications/text_classification/multi_label/analysis/sparse.py
new file mode 100644
index 0000000000000000000000000000000000000000..80feb4f661348014bd44e28ec69b4d6bcff9d189
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/sparse.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from trustai.interpretation import FeatureSimilarityModel
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, type=str, help="The dataset directory should include train.txt,dev.txt and test.txt files.")
+parser.add_argument("--aug_strategy", choices=["duplicate", "substitute", "insert", "delete", "swap"], default='substitute', help="Select data augmentation strategy")
+parser.add_argument("--annotate", action='store_true', help="Select unlabeled data for annotation")
+parser.add_argument("--params_path", default="../checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--rationale_num_sparse", type=int, default=3, help="Number of rationales per example for sparse data.")
+parser.add_argument("--rationale_num_support", type=int, default=6, help="Number of rationales per example for support data.")
+parser.add_argument("--sparse_num", type=int, default=100, help="Number of sparse data.")
+parser.add_argument("--support_threshold", type=float, default="0.7", help="The threshold to select support data.")
+parser.add_argument("--support_num", type=int, default=100, help="Number of support data.")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+parser.add_argument("--unlabeled_file", type=str, default="data.txt", help="Unlabeled data filename")
+parser.add_argument("--sparse_file", type=str, default="sparse.txt", help="Sparse data file name.")
+parser.add_argument("--support_file", type=str, default="support.txt", help="support data file name.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """
+    Set random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def read_local_dataset(path):
+    """
+    Read dataset file
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 2:
+                yield {"text": items[0], "label": items[1]}
+            elif len(items) == 1:
+                yield {"text": items[0]}
+            else:
+                logger.info(line.strip())
+                raise ValueError("{} should be in fixed format.".format(path))
+
+
+def preprocess_function(examples, tokenizer, max_seq_length):
+    """
+    Preprocess dataset
+    """
+    result = tokenizer(text=examples["text"], max_seq_len=max_seq_length)
+    return result
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):
+    """
+    Convert the  result of DataCollatorWithPadding from dict dictionary to a list
+    """
+
+    def __call__(self, features):
+        batch = super().__call__(features)
+        batch = list(batch.values())
+        return batch
+
+
+def get_sparse_data(analysis_result, sparse_num):
+    """
+    Get sparse data
+    """
+    idx_scores = {}
+    preds = []
+    for i in range(len(analysis_result)):
+        scores = analysis_result[i].pos_scores
+        idx_scores[i] = sum(scores) / len(scores)
+        preds.append(analysis_result[i].pred_label)
+
+    idx_socre_list = list(sorted(idx_scores.items(), key=lambda x: x[1]))[:sparse_num]
+    ret_idxs, ret_scores = list(zip(*idx_socre_list))
+    return ret_idxs, ret_scores, preds
+
+
+def find_sparse_data():
+    """
+    Find sparse data (lack of supports in train dataset) in dev dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    train_path = os.path.join(args.dataset_dir, args.train_file)
+    dev_path = os.path.join(args.dataset_dir, args.dev_file)
+
+    label_list = {}
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, train_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []
+    for batch in dev_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_sparse)
+    sparse_indexs, sparse_scores, preds = get_sparse_data(analysis_result, args.sparse_num)
+
+    # Save the sparse data
+    with open(os.path.join(args.dataset_dir, args.sparse_file), "w") as f:
+        for idx in sparse_indexs:
+            data = dev_ds.data[idx]
+            f.write(data["text"] + "\t" + str(data["label"]) + "\n")
+    f.close()
+    logger.info("Sparse data saved in {}".format(os.path.join(args.dataset_dir, args.sparse_file)))
+    logger.info("Average score in sparse data: {:.4f}".format(sum(sparse_scores) / len(sparse_scores)))
+    return os.path.join(args.dataset_dir, args.sparse_file)
+
+
+def get_support_data(analysis_result, support_num, support_threshold=0.7):
+    """
+    get support data
+    """
+    ret_idxs = []
+    ret_scores = []
+    rationale_idx = 0
+    try:
+        while len(ret_idxs) < support_num:
+            for n in range(len(analysis_result)):
+                score = analysis_result[n].pos_scores[rationale_idx]
+                if score > support_threshold:
+                    idx = analysis_result[n].pos_indexes[rationale_idx]
+                    if idx not in ret_idxs:
+                        ret_idxs.append(idx)
+                        ret_scores.append(score)
+                    if len(ret_idxs) >= support_num:
+                        break
+
+            rationale_idx += 1
+    except IndexError:
+        logger.error(
+            f"The index is out of range, please reduce support_num or increase support_threshold. Got {len(ret_idxs)} now."
+        )
+
+    return ret_idxs, ret_scores
+
+
+def find_support_data():
+    """
+    Find support data (which supports sparse data) from candidate dataset
+    """
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):
+        model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+        tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    else:
+        raise ValueError("The {} should exist.".format(args.params_path))
+
+    # Prepare & preprocess dataset
+    if args.annotate:
+        candidate_path = os.path.join(args.dataset_dir, args.unlabeled_file)
+    else:
+        candidate_path = os.path.join(args.dataset_dir, args.train_file)
+
+    sparse_path = os.path.join(args.dataset_dir, args.sparse_file)
+    support_path = os.path.join(args.dataset_dir, args.support_file)
+    candidate_ds = load_dataset(read_local_dataset, path=candidate_path, lazy=False)
+    sparse_ds = load_dataset(read_local_dataset, path=sparse_path, lazy=False)
+    trans_func = functools.partial(preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    candidate_ds = candidate_ds.map(trans_func)
+    sparse_ds = sparse_ds.map(trans_func)
+
+    # Batchify dataset
+    collate_fn = LocalDataCollatorWithPadding(tokenizer)
+    candidate_batch_sampler = BatchSampler(candidate_ds, batch_size=args.batch_size, shuffle=False)
+    sparse_batch_sampler = BatchSampler(sparse_ds, batch_size=args.batch_size, shuffle=False)
+    candidate_data_loader = DataLoader(
+        dataset=candidate_ds, batch_sampler=candidate_batch_sampler, collate_fn=collate_fn
+    )
+    sparse_data_loader = DataLoader(dataset=sparse_ds, batch_sampler=sparse_batch_sampler, collate_fn=collate_fn)
+
+    # Classifier_layer_name is the layer name of the last output layer
+    feature_sim = FeatureSimilarityModel(model, candidate_data_loader, classifier_layer_name="classifier")
+    # Feature similarity analysis
+    analysis_result = []
+    for batch in sparse_data_loader:
+        analysis_result += feature_sim(batch, sample_num=args.rationale_num_support)
+
+    support_indexs, support_scores = get_support_data(analysis_result, args.support_num, args.support_threshold)
+
+    # Save the support data
+    if args.annotate or args.aug_strategy == "duplicate":
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                if "label" in data:
+                    f.write(data["text"] + "\t" + data["label"] + "\n")
+                else:
+                    f.write(data["text"] + "\n")
+        f.close()
+    else:
+        create_n = 1
+        aug_percent = 0.1
+        if args.aug_strategy == "substitute":
+            aug = WordSubstitute("synonym", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "insert":
+            aug = WordInsert("synonym", create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "delete":
+            aug = WordDelete(create_n=create_n, aug_percent=aug_percent)
+        elif args.aug_strategy == "swap":
+            aug = WordSwap(create_n=create_n, aug_percent=aug_percent)
+
+        with open(support_path, "w") as f:
+            for idx in list(support_indexs):
+                data = candidate_ds.data[idx]
+                augs = aug.augment(data["text"])
+                if not isinstance(augs[0], str):
+                    augs = augs[0]
+                for a in augs:
+                    f.write(a + "\t" + data["label"] + "\n")
+        f.close()
+    logger.info("support data saved in {}".format(support_path))
+    logger.info("support average scores: {:.4f}".format(float(sum(support_scores)) / len(support_scores)))
+
+
+if __name__ == "__main__":
+    find_sparse_data()
+    find_support_data()
diff --git a/applications/text_classification/multi_label/analysis/word_interpret.ipynb b/applications/text_classification/multi_label/analysis/word_interpret.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..936628db5fdcaad0808b26f41f713159b66b04b3
--- /dev/null
+++ b/applications/text_classification/multi_label/analysis/word_interpret.ipynb
@@ -0,0 +1,380 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 词级别可解释性分析\n",
+    "本项目提供模型的词级别可解释性分析，包括LIME、Integrated Gradient、GradShap 三种分析方法，支持分析微调后模型的预测结果，开发者可以通过更改**数据目录**和**模型目录**在自己的任务中使用此项目进行数据分析。\n",
+    "\n",
+    "![image](https://user-images.githubusercontent.com/63761690/192739675-63145d59-23c6-416f-bf71-998fd4995254.png)\n",
+    "\n",
+    "## 1.导入Python模块与参数配置\n",
+    "首先我们导入必要的导入必要python模块和设置配置参数，词级别可解释性分析算法支持三种待分析的文本 `INTERPRETER_FILE` 数据文件格式：\n",
+    "\n",
+    "**格式一：包括文本、标签、预测结果**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>'\\t'<预测结果>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式二：包括文本、标签**\n",
+    "```text\n",
+    "<文本>'\\t'<标签>\n",
+    "...\n",
+    "```\n",
+    "\n",
+    "**格式三：只包括文本**\n",
+    "```text\n",
+    "<文本>\n",
+    "准予原告胡某甲与被告韩某甲离婚。\n",
+    "...\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "grep: warning: GREP_OPTIONS is deprecated; please use an alias or script\n",
+      "/usr/local/lib/python3.7/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/image_utils.py:213: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.\n",
+      "  resample=Image.BILINEAR,\n",
+      "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/image_utils.py:379: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.\n",
+      "  resample=Image.NEAREST,\n",
+      "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/ernie_vil/feature_extraction.py:65: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.\n",
+      "  resample=Image.BICUBIC,\n",
+      "/usr/local/lib/python3.7/dist-packages/paddlenlp/transformers/clip/feature_extraction.py:64: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.\n",
+      "  resample=Image.BICUBIC,\n"
+     ]
+    }
+   ],
+   "source": [
+    "import functools\n",
+    "import random\n",
+    "import os\n",
+    "import argparse\n",
+    "\n",
+    "import jieba\n",
+    "import numpy as np\n",
+    "from trustai.interpretation import VisualizationTextRecord\n",
+    "from trustai.interpretation import get_word_offset\n",
+    "import paddle\n",
+    "from paddle.io import DataLoader, BatchSampler\n",
+    "from paddlenlp.data import DataCollatorWithPadding\n",
+    "from paddlenlp.datasets import load_dataset\n",
+    "from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 预先定义配置参数\n",
+    "\n",
+    "# 运行环境，可选\"cpu\",\"gpu\",\"gpu:x\"(x为gpu编号)\n",
+    "DEVICE = \"gpu\"\n",
+    "# 数据路径\n",
+    "DATASET_DIR = \"../data\" \n",
+    "# 训练模型保存路径\n",
+    "PARAM_PATH = \"../checkpoint/\" \n",
+    "# tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数\n",
+    "MAX_LENGTH = 128 \n",
+    "# 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数\n",
+    "BATCH_SIZE = 1 \n",
+    "# 待分析解释的数据\n",
+    "INTERPRETER_FILE = \"bad_case.txt\"\n",
+    "# 可选 \"ig\",\"lime\",\"grad\" ,可以根据实际任务效果选择解释器\n",
+    "# \"grad\":GradShap方法依赖interpretdl\n",
+    "# !pip install interpretdl\n",
+    "INTERPRETER = \"ig\"\n",
+    "# 分析句子中TOP K关键词，K值\n",
+    "KEY_WORDS_NUM = 5"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2.读取待分析数据"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def read_local_dataset(path):\n",
+    "    \"\"\"\n",
+    "    Read dataset file\n",
+    "    \"\"\"\n",
+    "    with open(path, 'r', encoding='utf-8') as f:\n",
+    "        for line in f:\n",
+    "            items = line.strip().split('\\t')\n",
+    "            if items[0] == 'Text':\n",
+    "                continue\n",
+    "            items[0] = items[0][:MAX_LENGTH-2]\n",
+    "            if len(items) == 3:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': items[2]}\n",
+    "            elif len(items) == 2:\n",
+    "                yield {'text': items[0], 'label': items[1], 'predict': ''}\n",
+    "            elif len(items) == 1:\n",
+    "                yield {'text': items[0], 'label': '', 'predict': ''}\n",
+    "            else:\n",
+    "                raise ValueError(\"{} should be in fixed format.\".format(path))\n",
+    "\n",
+    "def preprocess_function(examples, tokenizer, max_seq_length):\n",
+    "    \"\"\"\n",
+    "    Preprocess dataset\n",
+    "    \"\"\"\n",
+    "    result = tokenizer(text=examples[\"text\"], max_seq_len=max_seq_length)\n",
+    "    return result\n",
+    "\n",
+    "class LocalDataCollatorWithPadding(DataCollatorWithPadding):\n",
+    "    \"\"\"\n",
+    "    Convert the  result of DataCollatorWithPadding from dict dictionary to a list\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __call__(self, features):\n",
+    "        batch = super().__call__(features)\n",
+    "        batch = list(batch.values())\n",
+    "        return batch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[32m[2022-09-28 04:51:03,566] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load '../checkpoint/'.\u001b[0m\n",
+      "W0928 04:51:03.570216  4827 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2\n",
+      "W0928 04:51:03.575362  4827 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.\n",
+      "\u001b[32m[2022-09-28 04:51:06,542] [    INFO]\u001b[0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load '../checkpoint/'.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "paddle.set_device(DEVICE)\n",
+    "\n",
+    "# Define model & tokenizer\n",
+    "if os.path.exists(PARAM_PATH):\n",
+    "    model = AutoModelForSequenceClassification.from_pretrained(PARAM_PATH)\n",
+    "    tokenizer = AutoTokenizer.from_pretrained(PARAM_PATH)\n",
+    "else:\n",
+    "    raise ValueError(\"The {} should exist.\".format(PARAM_PATH))\n",
+    "\n",
+    "# Prepare & preprocess dataset\n",
+    "interpret_path = os.path.join(DATASET_DIR, INTERPRETER_FILE)\n",
+    "\n",
+    "\n",
+    "interpret_ds = load_dataset(read_local_dataset, path=interpret_path, lazy=False)\n",
+    "trans_func = functools.partial(preprocess_function,\n",
+    "                                tokenizer=tokenizer,\n",
+    "                                max_seq_length=MAX_LENGTH)\n",
+    "\n",
+    "interpret_ds = interpret_ds.map(trans_func)\n",
+    "\n",
+    "# Batchify dataset\n",
+    "collate_fn = LocalDataCollatorWithPadding(tokenizer)\n",
+    "interpret_batch_sampler = BatchSampler(interpret_ds,\n",
+    "                                    batch_size=BATCH_SIZE,\n",
+    "                                    shuffle=False)\n",
+    "interpret_data_loader = DataLoader(dataset=interpret_ds,\n",
+    "                                batch_sampler=interpret_batch_sampler,\n",
+    "                                collate_fn=collate_fn)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3.开始数据可解释性分析\n",
+    "数据量较大时，数据分析时间较长，请耐心等待"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Start token level interpretion, it will take some time...\n",
+      "Building prefix dict from the default dictionary ...\n",
+      "Loading model from cache /tmp/jieba.cache\n",
+      "Loading model cost 0.751 seconds.\n",
+      "Prefix dict has been built successfully.\n",
+      "Start word level alignment, it will take some time...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Init an interpreter\n",
+    "if INTERPRETER == 'ig':\n",
+    "    from trustai.interpretation.token_level import IntGradInterpreter\n",
+    "    interpreter = IntGradInterpreter(model)\n",
+    "elif INTERPRETER == 'lime':\n",
+    "    from trustai.interpretation.token_level import LIMEInterpreter\n",
+    "    interpreter = LIMEInterpreter(model, unk_id=tokenizer.convert_tokens_to_ids('[UNK]'), pad_id=tokenizer.convert_tokens_to_ids('[PAD]'))\n",
+    "else:\n",
+    "    from trustai.interpretation.token_level import GradShapInterpreter\n",
+    "    interpreter = GradShapInterpreter(model)\n",
+    "\n",
+    "# Use interpreter to get the importance scores for all data\n",
+    "print(\"Start token level interpretion, it will take some time...\")\n",
+    "analysis_result = []\n",
+    "for batch in interpret_data_loader:\n",
+    "    analysis_result += interpreter(tuple(batch))\n",
+    "\n",
+    "# Add CLS and SEP tags to both original text and standard splited tokens\n",
+    "contexts = []\n",
+    "words = []\n",
+    "for i in range(len(interpret_ds)):\n",
+    "    text = interpret_ds.data[i][\"text\"]\n",
+    "    contexts.append(\"[CLS]\" + text + \"[SEP]\")\n",
+    "    words.append([\"[CLS]\"] + list(jieba.cut(text)) + [\"[SEP]\"])\n",
+    "\n",
+    "# Get the offset map of tokenized tokens and standard splited tokens\n",
+    "print(\"Start word level alignment, it will take some time...\")\n",
+    "ori_offset_maps = []\n",
+    "word_offset_maps = []\n",
+    "for i in range(len(contexts)):\n",
+    "    ori_offset_maps.append(tokenizer.get_offset_mapping(contexts[i]))\n",
+    "    word_offset_maps.append(get_word_offset(contexts[i], words[i]))\n",
+    "\n",
+    "align_res = interpreter.alignment(analysis_result, contexts, words, word_offset_maps, ori_offset_maps, special_tokens=[\"[CLS]\", '[SEP]'],rationale_num=KEY_WORDS_NUM)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4.数据可解释性分析结果可视化\n",
+    "使用用颜色深浅可视化方式代表句子中词对预测结果的重要程度"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.core.display import display, HTML\n",
+    "class Visualization(VisualizationTextRecord):\n",
+    "\n",
+    "    def __init__(self, interpret_res, true_label=None, pred_label=None, words=None):\n",
+    "        if words is not None:\n",
+    "            self.words = words\n",
+    "        else:\n",
+    "            self.words = interpret_res.words\n",
+    "        self.pred_label = pred_label if pred_label is not None else ''\n",
+    "        self.true_label = true_label if true_label is not None else ''\n",
+    "        self.key_words = \" \".join(set(interpret_res.rationale_tokens))\n",
+    "        word_attributions = interpret_res.word_attributions\n",
+    "        _max = max(word_attributions)\n",
+    "        _min = min(word_attributions)\n",
+    "        self.word_attributions = [(word_imp - _min) / (_max - _min) for word_imp in word_attributions]\n",
+    "\n",
+    "    def record_html(self):\n",
+    "        \"\"\"change all informations to html\"\"\"\n",
+    "        return \"\".join([\n",
+    "            \"<tr>\",\n",
+    "            self._format_class(self.true_label),\n",
+    "            self._format_class(self.pred_label),\n",
+    "            self._format_class(self.key_words),\n",
+    "            self._format_word_attributions(),\n",
+    "            \"<tr>\",\n",
+    "        ])\n",
+    "    def _format_class(self, label):\n",
+    "        return '<td align=\"center\"><text style=\"padding-right:2em\"><b>{label}</b></text></td>'.format(label=label)\n",
+    "\n",
+    "def visualize_text(text_records):\n",
+    "    \"\"\"visualize text\"\"\"\n",
+    "    html = [\"<table width: 100%, align : center>\"]\n",
+    "    rows = [\"<tr><th>Label</th>\"\n",
+    "            \"<th>Prediction</th>\"\n",
+    "            \"<th>Key words</th>\"\n",
+    "            \"<th>Important visualization</th>\"]\n",
+    "    for record in text_records:\n",
+    "        rows.append(record.record_html())\n",
+    "    html.append(\"\".join(rows))\n",
+    "    html.append(\"</table>\")\n",
+    "    html = HTML(\"\".join(html))\n",
+    "    display(html)\n",
+    "    return html.data\n",
+    "\n",
+    "\n",
+    "def visualize(interpret_res, ds):\n",
+    "    records = []\n",
+    "    for i in range(len(interpret_res)):\n",
+    "        records.append(Visualization(interpret_res[i], true_label=ds.data[i][\"label\"], pred_label=ds.data[i][\"predict\"]))\n",
+    "    html = visualize_text(records)\n",
+    "    return html"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table width: 100%, align : center><tr><th>Label</th><th>Prediction</th><th>Key words</th><th>Important visualization</th><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>不履行家庭义务,婚后分居</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>婚后分居</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>至今 双方 出 分居 。</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 2015                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 年                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 2                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 月                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 23                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 日                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 将                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 赶                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 出                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 家门                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 居住                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 于                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 娘家                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 待产                        </font></mark><mark style=\"background-color: hsl(120, 75%, 99%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 双方                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 分居                        </font></mark><mark style=\"background-color: hsl(120, 75%, 77%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 至今                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>婚后有子女,限制行为能力子女抚养</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>婚后有子女,限制行为能力子女抚养,不履行离婚协议</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>财产 符合 付清 欠条 抚养</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 孙某                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 辩称                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ：                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 离婚                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 协议                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 中                        </font></mark><mark style=\"background-color: hsl(120, 75%, 90%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 关于                        </font></mark><mark style=\"background-color: hsl(120, 75%, 71%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 财产                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 分割                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 给付                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 资金                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 不                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 符合                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 法律                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 规定                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 只有                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 离婚                        </font></mark><mark style=\"background-color: hsl(120, 75%, 90%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 和                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 子女                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 抚养                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 符合                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 法律                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 规定                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 就                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 没有                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 么                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 协议                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 不                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 代表                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 的                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 真实                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 意思                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 表示                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 离婚                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 协议                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 中                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 没有                        </font></mark><mark style=\"background-color: hsl(120, 75%, 89%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 约定                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 付款                        </font></mark><mark style=\"background-color: hsl(120, 75%, 90%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 时间                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 而且                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 给                        </font></mark><mark style=\"background-color: hsl(120, 75%, 92%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 出具                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 了                        </font></mark><mark style=\"background-color: hsl(120, 75%, 83%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 欠条                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 是                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 5                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 年内                        </font></mark><mark style=\"background-color: hsl(120, 75%, 86%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 付清                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 未                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 在                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 期满                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 后                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 起诉                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 应                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 驳回                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>存在非婚生子,支付抚养费,限制行为能力子女抚养</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>限制行为能力子女抚养,存在非婚生子</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>赵某 并非 认可 之女 表示</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 董某                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 认可                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 赵某                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 乙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 并非                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 之女                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> ，                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 并                        </font></mark><mark style=\"background-color: hsl(120, 75%, 93%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 表示                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 愿意                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 自行                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 抚养                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 赵某                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 乙                        </font></mark><mark style=\"background-color: hsl(120, 75%, 94%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr><tr><td align=\"center\"><text style=\"padding-right:2em\"><b>准予离婚</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>准予离婚,法定离婚</b></text></td><td align=\"center\"><text style=\"padding-right:2em\"><b>原告 韩某 准予 离婚 。</b></text></td><td><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [CLS]                        </font></mark><mark style=\"background-color: hsl(120, 75%, 91%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 准予                        </font></mark><mark style=\"background-color: hsl(120, 75%, 96%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 原告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 胡某                        </font></mark><mark style=\"background-color: hsl(0, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 甲                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 与                        </font></mark><mark style=\"background-color: hsl(120, 75%, 98%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 被告                        </font></mark><mark style=\"background-color: hsl(120, 75%, 97%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 韩某                        </font></mark><mark style=\"background-color: hsl(120, 75%, 100%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 甲                        </font></mark><mark style=\"background-color: hsl(120, 75%, 70%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 离婚                        </font></mark><mark style=\"background-color: hsl(120, 75%, 95%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> 。                        </font></mark><mark style=\"background-color: hsl(120, 75%, 88%); opacity:1.0;                         line-height:1.75\"><font color=\"black\"> [SEP]                        </font></mark></td><tr></table>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# process for vbisualize\n",
+    "html = visualize(align_res, interpret_ds)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.7.13 64-bit",
+   "metadata": {
+    "interpreter": {
+     "hash": "767d51c1340bd893661ea55ea3124f6de3c7a262a8b4abca0554b478b1e2ff90"
+    }
+   },
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.13-final"
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/README.md b/applications/text_classification/multi_label/deploy/paddle_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b7ef12d709fc579fa8dfe5a9725789671243e81
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/paddle_serving/README.md
@@ -0,0 +1,190 @@
+# 基于Paddle Serving的服务化部署
+
+本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具搭建多标签在线服务部署。
+
+## 目录
+- [环境准备](#环境准备)
+- [模型转换](#模型转换)
+- [部署模型](#部署模型)
+
+## 环境准备
+需要准备PaddleNLP的运行环境和Paddle Serving的运行环境。
+
+- python >= 3.6
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.4
+
+### 安装PaddlePaddle
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+
+### 安装PaddleNLP
+
+安装PaddleNLP默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以删去` -i https://mirror.baidu.com/pypi/simple` ，更多关于PaddleNLP安装的详细教程请查见[PaddleNLP快速安装](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+```shell
+python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+### 安装Paddle Serving
+安装client和serving app，用于向服务发送请求:
+```
+pip install paddle_serving_app paddle_serving_client
+```
+安装serving，用于启动服务，根据服务器设备选择安装CPU server或GPU server：
+
+- 安装CPU server
+```shell
+pip install paddle_serving_server
+```
+- 安装GPU server, 注意选择跟本地环境一致的命令
+```shell
+# CUDA10.2 + Cudnn7 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+**NOTE:**
+- 默认开启国内清华镜像源来加速下载，如果您使用 HTTP 代理可以关闭(-i https://pypi.tuna.tsinghua.edu.cn/simple)
+- 更多wheel包请参考[serving官网文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)
+
+### 安装FastTokenizer文本处理加速库（可选）
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+```shell
+pip install fast-tokenizer-python
+```
+
+
+## 模型转换
+
+使用Paddle Serving做服务化部署时，需要将保存的inference模型转换为serving易于部署的模型。
+
+用已安装的paddle_serving_client将静态图参数模型转换成serving格式。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md)，模型地址`dirname`，模型文件和参数名`model_filename`，`params_filename`根据实际填写即可。
+
+```shell
+python -m paddle_serving_client.convert --dirname ../../export --model_filename float32.pdmodel --params_filename float32.pdiparams
+```
+
+可以通过命令查参数含义：
+```shell
+python -m paddle_serving_client.convert --help
+```
+
+转换成功后的目录如下:
+```
+paddle_serving/
+├──serving_server
+│  ├── float32.pdiparams
+│  ├── float32.pdmodel
+│  ├── serving_server_conf.prototxt
+│  └── serving_server_conf.stream.prototxt
+└──serving_client
+   ├── serving_client_conf.prototxt
+   └── serving_client_conf.stream.prototxt
+```
+
+## 部署模型
+
+serving目录包含启动pipeline服务和发送预测请求的代码和模型，包括：
+
+```
+serving/
+├──serving_server
+│  ├── float32.pdiparams
+│  ├── float32.pdmodel
+│  ├── serving_server_conf.prototxt
+│  └── serving_server_conf.stream.prototxt
+├──config.yml        # 分类任务启动服务端的配置文件
+├──rpc_client.py     # 分类任务发送pipeline预测请求的脚本
+└──service.py        # 分类任务启动服务端的脚本
+```
+
+### 修改配置文件
+目录中的`config.yml`文件解释了每一个参数的含义，可以根据实际需要修改其中的配置。比如：
+```
+# 修改模型目录为下载的模型目录或自己的模型目录:
+model_config: serving_server =>  model_config: erine-3.0-tiny/serving_server
+
+# 修改rpc端口号
+rpc_port: 10231   =>   rpc_port: 9998
+
+# 修改使用GPU推理为使用CPU推理:
+device_type: 1    =>   device_type: 0
+
+#开启MKLDNN加速
+#use_mkldnn: False    =>   use_mkldnn: True
+
+#Fetch结果列表，以serving_client/serving_client_conf.prototxt中fetch_var的alias_name为准
+fetch_list: ["linear_147.tmp_1"]    =>   fetch_list: ["linear_75.tmp_1"]
+```
+
+### 分类任务
+#### 启动服务
+修改好配置文件后，执行下面命令启动服务:
+```shell
+python service.py --max_seq_length 128 --model_name "ernie-3.0-medium-zh"
+```
+
+可支持配置的参数：
+* `max_seq_length`：分词器tokenizer使用的最大序列长度，ERNIE模型最大不能超过2048。请根据文本长度选择，通常推荐128、256或512，若出现显存不足，请适当调低这一参数；默认为128。
+* `model_name`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。
+
+输出打印如下:
+```
+[DAG] Succ init
+[PipelineServicer] succ init
+......
+--- Running analysis [ir_graph_to_program_pass]
+I0625 16:44:36.563802 40218 analysis_predictor.cc:1007] ======= optimize end =======
+I0625 16:44:36.571702 40218 naive_executor.cc:102] ---  skip [feed], feed -> token_type_ids
+I0625 16:44:36.571728 40218 naive_executor.cc:102] ---  skip [feed], feed -> input_ids
+I0625 16:44:36.574352 40218 naive_executor.cc:102] ---  skip [linear_147.tmp_1], fetch -> fetch
+[2022-06-25 16:44:37,545] [ WARNING] - Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. You can install fast_tokenizer by `pip install fast-tokenizer-python`.
+[2022-06-25 16:44:37,546] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'.
+[2022-06-25 16:44:37,546] [    INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_base_zh_vocab.txt
+[OP Object] init success
+W0625 16:45:40.312942 40218 gpu_context.cc:278] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.2
+W0625 16:45:40.316538 40218 gpu_context.cc:306] device: 3, cuDNN Version: 8.1.
+```
+
+#### 启动rpc client测试
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)
+```shell
+python rpc_client.py
+```
+输出打印如下:
+```
+data:  五松新村房屋是被告婚前购买的；
+label:  婚前个人财产
+--------------------
+data:  被告于2016年3月将车牌号为皖B×××××出售了2.7万元，被告通过原告偿还了齐荷花人民币2.6万元，原、被告尚欠齐荷花2万元。
+label:  有夫妻共同财产,有夫妻共同债务
+--------------------
+data:  2、判令被告返还借婚姻索取的现金33万元，婚前个人存款10万元；
+label:  婚前个人财产
+--------------------
+data:  一、判决原告于某某与被告杨某某离婚；
+label:  准予离婚,法定离婚
+```
+#### 启动http client测试
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)
+```shell
+python http_client.py
+```
+输出打印如下:
+```
+data:  五松新村房屋是被告婚前购买的；
+label:  婚前个人财产
+--------------------
+data:  被告于2016年3月将车牌号为皖B×××××出售了2.7万元，被告通过原告偿还了齐荷花人民币2.6万元，原、被告尚欠齐荷花2万元。
+label:  有夫妻共同财产,有夫妻共同债务
+--------------------
+data:  2、判令被告返还借婚姻索取的现金33万元，婚前个人存款10万元；
+label:  婚前个人财产
+--------------------
+data:  一、判决原告于某某与被告杨某某离婚；
+label:  准予离婚,法定离婚
+```
diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/config.yml b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..62a1a3056b826619c7c640fcb9c426a2d96fc28f
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/paddle_serving/config.yml
@@ -0,0 +1,59 @@
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18090
+
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9878
+
+#worker_num, 最大并发数。
+#当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+#当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 1
+
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: False
+
+    #重试次数
+    retry: 1
+
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+
+op:
+    seq_cls:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 1
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #模型路径
+            model_config: serving_server
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准
+            fetch_list: ["linear_75.tmp_1"]
+            
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 1
+
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: "0"
+
+            #开启MKLDNN加速
+            use_mkldnn: True
+
+            #thread_num
+            thread_num: 12
+
+            #ir_optim
+            ir_optim: True
+            
+            #开启tensorrt后，进行优化的子图包含的最少节点数
+            #min_subgraph_size: 10
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py b/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..df4d2d9313bcd1f73f50a62d6e564d5b7cfcb10f
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/paddle_serving/http_client.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+
+import numpy as np
+import requests
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.server_url = server_url
+
+    def Run(self, text, label_list):
+        sentence = np.array([t.encode("utf-8") for t in text], dtype=np.object_)
+        sentence = sentence.__repr__()
+        data = {"key": ["sentence"], "value": [sentence]}
+        data = json.dumps(data)
+
+        ret = requests.post(url=self.server_url, data=data)
+        ret = ret.json()
+        for t, l in zip(text, eval(ret["value"][0])):
+            print("text: ", t)
+            label = ",".join([label_list[int(ll)] for ll in l.split(",")])
+            print("label: ", label)
+            print("--------------------")
+        return
+
+
+if __name__ == "__main__":
+    server_url = "http://127.0.0.1:9878/seq_cls/prediction"
+    runner = Runner(server_url)
+    text = [
+        "五松新村房屋是被告婚前购买的；",
+        "被告于2016年3月将车牌号为皖B×××××出售了2.7万元，被告通过原告偿还了齐荷花人民币2.6万元，原、被告尚欠齐荷花2万元。",
+        "2、判令被告返还借婚姻索取的现金33万元，婚前个人存款10万元；",
+        "一、判决原告于某某与被告杨某某离婚；",
+    ]
+    label_list = [
+        "婚后有子女",
+        "限制行为能力子女抚养",
+        "有夫妻共同财产",
+        "支付抚养费",
+        "不动产分割",
+        "婚后分居",
+        "二次起诉离婚",
+        "按月给付抚养费",
+        "准予离婚",
+        "有夫妻共同债务",
+        "婚前个人财产",
+        "法定离婚",
+        "不履行家庭义务",
+        "存在非婚生子",
+        "适当帮助",
+        "不履行离婚协议",
+        "损害赔偿",
+        "感情不和分居满二年",
+        "子女随非抚养权人生活",
+        "婚后个人财产",
+    ]
+    runner.Run(text, label_list)
diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py b/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..4146d4b591470254b687c815a40989327a3d25cd
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/paddle_serving/rpc_client.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.client = PipelineClient()
+        self.client.connect([server_url])
+
+    def Run(self, data, label_list):
+        sentence = np.array([x.encode("utf-8") for x in data], dtype=np.object_)
+        ret = self.client.predict(feed_dict={"sentence": sentence})
+        for d, l in zip(data, eval(ret.value[0])):
+            print("data: ", d)
+            label = ",".join([label_list[int(ll)] for ll in l.split(",")])
+            print("label: ", label)
+            print("--------------------")
+        return
+
+
+if __name__ == "__main__":
+    server_url = "127.0.0.1:18090"
+    runner = Runner(server_url)
+    text = [
+        "五松新村房屋是被告婚前购买的；",
+        "被告于2016年3月将车牌号为皖B×××××出售了2.7万元，被告通过原告偿还了齐荷花人民币2.6万元，原、被告尚欠齐荷花2万元。",
+        "2、判令被告返还借婚姻索取的现金33万元，婚前个人存款10万元；",
+        "一、判决原告于某某与被告杨某某离婚；",
+    ]
+    label_list = [
+        "婚后有子女",
+        "限制行为能力子女抚养",
+        "有夫妻共同财产",
+        "支付抚养费",
+        "不动产分割",
+        "婚后分居",
+        "二次起诉离婚",
+        "按月给付抚养费",
+        "准予离婚",
+        "有夫妻共同债务",
+        "婚前个人财产",
+        "法定离婚",
+        "不履行家庭义务",
+        "存在非婚生子",
+        "适当帮助",
+        "不履行离婚协议",
+        "损害赔偿",
+        "感情不和分居满二年",
+        "子女随非抚养权人生活",
+        "婚后个人财产",
+    ]
+    runner.Run(text, label_list)
diff --git a/applications/text_classification/multi_label/deploy/paddle_serving/service.py b/applications/text_classification/multi_label/deploy/paddle_serving/service.py
new file mode 100644
index 0000000000000000000000000000000000000000..cac5558a62865d1eb3b9795a0e36cfe96079ad1a
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/paddle_serving/service.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+
+import numpy as np
+from paddle_serving_server.web_service import Op, WebService
+
+from paddlenlp.transformers import AutoTokenizer
+
+_LOGGER = logging.getLogger()
+
+FETCH_NAME_MAP = {
+    "ernie-1.0-large-zh-cw": "linear_291.tmp_1",
+    "ernie-3.0-xbase-zh": "linear_243.tmp_1",
+    "ernie-3.0-base-zh": "linear_147.tmp_1",
+    "ernie-3.0-medium-zh": "linear_75.tmp_1",
+    "ernie-3.0-mini-zh": "linear_75.tmp_1",
+    "ernie-3.0-micro-zh": "linear_51.tmp_1",
+    "ernie-3.0-nano-zh": "linear_51.tmp_1",
+    "ernie-2.0-base-en": "linear_147.tmp_1",
+    "ernie-2.0-large-en": "linear_291.tmp_1",
+    "ernie-m-base": "linear_147.tmp_1",
+    "ernie-m-large": "linear_291.tmp_1",
+}
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
+                    choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+args = parser.parse_args()
+# fmt: on
+
+
+class Op(Op):
+    def init_op(self):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
+        # Output nodes may differ from model to model
+        # You can see the output node name in the conf.prototxt file of serving_server
+        self.fetch_names = [
+            FETCH_NAME_MAP[args.model_name],
+        ]
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        # Convert input format
+        ((_, input_dict),) = input_dicts.items()
+        data = input_dict["sentence"]
+        if isinstance(data, str) and "array(" in data:
+            data = eval(data)
+        else:
+            _LOGGER.error("input value  {}is not supported.".format(data))
+        data = [i.decode("utf-8") for i in data]
+
+        # tokenizer + pad
+        data = self.tokenizer(
+            data,
+            max_length=args.max_seq_length,
+            padding=True,
+            truncation=True,
+            return_position_ids=False,
+            return_attention_mask=False,
+        )
+        tokenized_data = {}
+        for tokenizer_key in data:
+            tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64")
+
+        return tokenized_data, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+
+        results = fetch_dict[self.fetch_names[0]]
+        results = np.array(results)
+        labels = []
+
+        for result in results:
+            label = []
+            result = 1 / (1 + (np.exp(-result)))
+            for i, p in enumerate(result):
+                if p > 0.5:
+                    label.append(str(i))
+            labels.append(",".join(label))
+        return {"label": labels}, None, ""
+
+
+class Service(WebService):
+    def get_pipeline_response(self, read_op):
+        return Op(name="seq_cls", input_ops=[read_op])
+
+
+if __name__ == "__main__":
+    service = Service(name="seq_cls")
+    service.prepare_pipeline_config("config.yml")
+    service.run_service()
diff --git a/applications/text_classification/multi_label/deploy/predictor/README.md b/applications/text_classification/multi_label/deploy/predictor/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7bc4e828c0df3d518e2e671bf7b3ddeca924d054
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/predictor/README.md
@@ -0,0 +1,178 @@
+# 离线推理部署指南
+
+**目录**
+   * [环境准备](#环境准备)
+   * [基于GPU部署推理样例](#基于GPU部署推理样例)
+   * [基于CPU部署推理样例](#基于CPU部署推理样例)
+   * [性能与精度测试](#性能与精度测试)
+
+## 环境准备
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。如何使用[静态图导出脚本](../../export_model.py)将训练后的模型转为静态图模型详见[模型静态图导出](../../README.md)，模型使用裁剪API进行裁剪之后会自动生成静态图模型。
+
+如果基于GPU部署，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```shell
+python -m pip install onnxruntime-gpu onnx onnxconverter-common==1.9.0 paddle2onnx==1.0.5
+```
+
+如果基于CPU部署，请使用如下命令安装所需依赖:
+```shell
+python -m pip install onnxruntime
+```
+
+安装FastTokenizer文本处理加速库（可选）
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+```shell
+pip install fast-tokenizer-python
+```
+## 基于GPU部署推理样例
+
+请使用如下命令进行部署
+```
+python infer.py \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+多语言模型加上`--multilingual`,裁剪后的模型前缀为`--model_path_prefix ../../prune/width_mult_XXXX/pruned_model`。
+
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型,可选"ernie-1.0-large-zh-cw","ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en","ernie-m-base","ernie-m-large"；默认为"ernie-3.0-medium-zh",根据实际使用的预训练模型选择。
+* `max_seq_length`：ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `use_fp16`：选择是否开启FP16进行加速；默认为False。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `device`: 选用什么设备进行训练，可选cpu、gpu。
+* `device_id`: 选择GPU卡号；默认为0。
+* `perf`：选择进行模型性能和精度评估；默认为False。
+* `dataset_dir`：本地数据集地址，需包含data.txt, label.txt, test.txt/dev.txt(可选，如果启动模型性能和精度评估)；默认为None。
+* `perf_dataset`：评估数据集，可选'dev'、'test'，选择在开发集或测试集评估模型；默认为"dev"。
+* `multilingual`：是否为多语言任务（是否使用ERNIE M作为预训练模型）；默认为False。
+
+在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速，在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。可以使用如下命令开启ONNXRuntime的FP16进行推理加速：
+
+```
+python infer.py \
+    --use_fp16 \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度：
+
+```
+python infer.py \
+    --perf \
+    --perf_dataset 'dev' \
+    --device "gpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+## 基于CPU部署推理样例
+
+请使用如下命令进行部署
+```
+python infer.py \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型；默认为"ernie-3.0-medium-zh"，中文数据集推荐使用"ernie-3.0-medium-zh"。
+* `max_seq_length`：ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `use_quantize`：选择是否开启INT8动态量化进行加速；默认为False。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
+* `num_threads`：cpu线程数；默认为cpu的物理核心数量。
+* `device`: 选用什么设备进行训练，可选cpu、gpu。
+* `perf`：选择进行模型性能和精度评估；默认为False。
+* `dataset_dir`：本地数据集地址，需包含data.txt, label.txt, dev.txt/test.txt(可选，如果启动模型性能和精度评估)；默认为None。
+* `perf_dataset`：评估数据集，选择在开发集或测试集评估模型；默认为"dev"。
+
+可以使用如下命令开启ONNXRuntime的INT8动态量化进行推理加速：
+
+```
+python infer.py \
+    --use_quantize \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+**Note**：INT8动态量化与FP16相比精度损失较大，GPU部署建议使用FP16加速。
+
+可以使用如下命令开启ONNXRuntime推理评估模型的性能和精度：
+
+```
+python infer.py \
+    --perf \
+    --perf_dataset 'dev' \
+    --device "cpu" \
+    --model_path_prefix "../../export/float32" \
+    --model_name_or_path "ernie-3.0-medium-zh" \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --dataset_dir "../../data"
+```
+
+
+## 性能与精度测试
+
+
+测试配置如下：
+
+1. CAIL2019—婚姻家庭要素提取任务开发集
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.3.1
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 32，GPU部署运行时间 total_time，计算 latency = total_time / total_samples
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+|                            | Micro F1(%)   | Macro F1(%) | latency(ms) |
+| -------------------------- | ------------ | ------------- |------------- |
+| ERNIE 3.0 Medium+FP32+GPU  | 90.57|79.36| 1.46|
+| ERNIE 3.0 Medium+FP16+GPU  | 90.57| 79.36| 0.49|
+| ERNIE 3.0 Medium+FP32+CPU  | 90.57|79.36|  47.92 |
+| ERNIE 3.0 Medium+INT8+CPU  | 90.05 | 77.69| 34.24  |
+
+
+经过FP16转化加速比达到3~4倍左右，精度变化较小，与FP16相比,INT8在线量化精度下降较大，加速比在1.5倍左右
diff --git a/applications/text_classification/multi_label/deploy/predictor/infer.py b/applications/text_classification/multi_label/deploy/predictor/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcf562cb6fb4a4b6c89c2fce783e2ccc06591892
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/predictor/infer.py
@@ -0,0 +1,89 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import psutil
+from predictor import Predictor
+
+from paddlenlp.datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+parser.add_argument('--model_name_or_path', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
+                    choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.")
+parser.add_argument("--use_quantize", action='store_true', help="Whether to use quantization for acceleration, only takes effect when deploying on cpu.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu, only takes effect when deploying on cpu.")
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--device_id', default=0, help="Select which gpu device to train model.")
+parser.add_argument("--perf", action='store_true', help="Whether to compute the latency and f1 score of the test set.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="The dataset directory including data.txt, taxonomy.txt, test.txt(optional, if evaluate the performance).")
+parser.add_argument("--perf_dataset", choices=['dev', 'test'], default='dev', type=str, help="evaluate the performance on dev dataset or test dataset")
+parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task')
+args = parser.parse_args()
+# yapf: enable
+
+
+def read_local_dataset(path, label_list):
+    label_list_dict = {label_list[i]: i for i in range(len(label_list))}
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 0:
+                continue
+            elif len(items) == 1:
+                sentence = items[0]
+                labels = []
+            else:
+                sentence = "".join(items[:-1])
+                label = items[-1]
+                labels = [label_list_dict[l] for l in label.split(",")]
+            yield {"sentence": sentence, "label": labels}
+
+
+if __name__ == "__main__":
+
+    label_list = []
+    label_dir = os.path.join(args.dataset_dir, "label.txt")
+    with open(label_dir, "r", encoding="utf-8") as f:
+        lines = f.readlines()
+        for i, line in enumerate(lines):
+            label_list.append(line.strip())
+    f.close()
+
+    predictor = Predictor(args, label_list)
+
+    if args.perf:
+        eval_dir = os.path.join(args.dataset_dir, "{}.txt".format(args.perf_dataset))
+        eval_ds = load_dataset(read_local_dataset, path=eval_dir, label_list=label_list, lazy=False)
+        texts, labels = predictor.get_text_and_label(eval_ds)
+
+        # preprocess & evaluate & latency
+        preprocess_result = predictor.preprocess(texts)
+        predictor.evaluate(preprocess_result, labels)
+        predictor.performance(preprocess_result)
+    else:
+        data = []
+        data_dir = os.path.join(args.dataset_dir, "data.txt")
+        with open(data_dir, "r", encoding="utf-8") as f:
+            lines = f.readlines()
+            for i, line in enumerate(lines):
+                data.append(line.strip())
+        f.close()
+        predictor.predict(data)
diff --git a/applications/text_classification/multi_label/deploy/predictor/predictor.py b/applications/text_classification/multi_label/deploy/predictor/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1b6a8785d182a70caf0384ba5cd18c02e18a961
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/predictor/predictor.py
@@ -0,0 +1,229 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+from sklearn.metrics import f1_score
+
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+class InferBackend(object):
+    def __init__(
+        self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, use_quantize=False, num_threads=10
+    ):
+        logger.info(">>> [InferBackend] Creating Engine ...")
+        onnx_model = paddle2onnx.export(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+            if use_quantize:
+                logger.info(
+                    ">>> [InferBackend] use_quantize only takes effect when deploying on cpu, use_fp16 for acceleration when deploying on gpu ..."
+                )
+            sess_options = ort.SessionOptions()
+            self.predictor = ort.InferenceSession(
+                onnx_model,
+                sess_options=sess_options,
+                providers=["CUDAExecutionProvider"],
+                provider_options=[{"device_id": device_id}],
+            )
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    "The environment for GPU inference is not set properly. "
+                    "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+                    "Please run the following commands to reinstall: \n "
+                    "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+                )
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            if use_fp16:
+                logger.info(
+                    ">>> [InferBackend] use_fp16 only takes effect when deploying on gpu, use_quantize for acceleration when deploying on cpu ..."
+                )
+            if use_quantize:
+                dynamic_quantize_model = os.path.join(infer_model_dir, "int8_model.onnx")
+                self.dynamic_quantize(float_onnx_file, dynamic_quantize_model)
+                onnx_model = dynamic_quantize_model
+            sess_options = ort.SessionOptions()
+            sess_options.intra_op_num_threads = num_threads
+            self.predictor = ort.InferenceSession(
+                onnx_model, sess_options=sess_options, providers=["CPUExecutionProvider"]
+            )
+
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def dynamic_quantize(self, input_float_model, dynamic_quantized_model):
+        from onnxruntime.quantization import quantize_dynamic
+
+        quantize_dynamic(input_float_model, dynamic_quantized_model)
+
+    def infer(self, input_dict: dict):
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+def sigmoid_(x):
+    """
+    compute sigmoid
+    """
+    return 1 / (1 + np.exp(-x))
+
+
+class Predictor(object):
+    def __init__(self, args, label_list):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True)
+        self.label_list = label_list
+        self.batch_size = args.batch_size
+        self.max_seq_length = args.max_seq_length
+        self.multilingual = args.multilingual
+        self.inference_backend = InferBackend(
+            args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.use_quantize, args.num_threads
+        )
+
+    def preprocess(self, input_data: list):
+
+        # tokenizer + pad
+        data = self.tokenizer(
+            input_data,
+            max_length=self.max_seq_length,
+            padding=True,
+            truncation=True,
+            return_position_ids=False,
+            return_attention_mask=False,
+            return_token_type_ids=not self.multilingual,
+        )
+        tokenized_data = {}
+        for tokenizer_key in data:
+            tokenized_data[tokenizer_key] = np.array(data[tokenizer_key], dtype="int64")
+        return tokenized_data
+
+    def postprocess(self, infer_data):
+        threshold = 0.5
+
+        sigmoid = np.vectorize(sigmoid_)
+        probs = sigmoid(infer_data)
+        labels = []
+
+        for prob in probs:
+            label = []
+
+            for i, p in enumerate(prob):
+                if p > threshold:
+                    label.append(i)
+
+            labels.append(label)
+
+        return labels
+
+    def infer(self, data):
+        infer_data = self.inference_backend.infer(data)
+        logits = np.array(infer_data[0])
+        return logits
+
+    def infer_batch(self, preprocess_result):
+        sample_num = len(preprocess_result["input_ids"])
+        infer_result = None
+        for i in range(0, sample_num, self.batch_size):
+            batch_size = min(self.batch_size, sample_num - i)
+            preprocess_result_batch = {}
+            for tokenizer_key in preprocess_result:
+                preprocess_result_batch[tokenizer_key] = [
+                    preprocess_result[tokenizer_key][i + j] for j in range(batch_size)
+                ]
+
+            result = self.infer(preprocess_result_batch)
+            if infer_result is None:
+                infer_result = result
+            else:
+                infer_result = np.append(infer_result, result, axis=0)
+        return infer_result
+
+    def printer(self, result, input_data):
+
+        for idx, text in enumerate(input_data):
+            labels = []
+            logger.info("input data: {}".format(text))
+            for r in result[idx]:
+                labels.append(self.label_list[r])
+            logger.info("labels: {}".format(",".join(labels)))
+            logger.info("----------------------------")
+
+    def predict(self, input_data: list):
+        preprocess_result = self.preprocess(input_data)
+        infer_result = self.infer_batch(preprocess_result)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return
+
+    def performance(self, preprocess_result):
+        nums = len(preprocess_result["input_ids"])
+
+        start = time.time()
+        self.infer_batch(preprocess_result)
+        total_time = time.time() - start
+        logger.info("sample nums: %s, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums))
+        return
+
+    def evaluate(self, preprocess_result, labels):
+
+        infer_result = self.infer_batch(preprocess_result)
+        sigmoid = np.vectorize(sigmoid_)
+        probs = sigmoid(infer_result)
+        preds = probs > 0.5
+        micro_f1_score = f1_score(y_pred=preds, y_true=labels, average="micro")
+        macro_f1_score = f1_score(y_pred=preds, y_true=labels, average="macro")
+        logger.info("micro f1: %.2f, macro f1: %.2f" % (micro_f1_score * 100, macro_f1_score * 100))
+
+        return
+
+    def get_text_and_label(self, ds):
+        """
+        Return text and label list
+        """
+        all_texts = []
+        all_labels = []
+        for ii in range(len(ds)):
+            all_texts.append(ds[ii]["sentence"])
+            labels = [float(1) if i in ds[ii]["label"] else float(0) for i in range(len(self.label_list))]
+            all_labels.append(labels)
+        return all_texts, all_labels
diff --git a/applications/text_classification/multi_label/deploy/simple_serving/README.md b/applications/text_classification/multi_label/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..583887019358cc2cb36abc8488f0b9bdc76a8249
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/simple_serving/README.md
@@ -0,0 +1,41 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+```shell
+pip install paddlenlp --upgrade
+```
+## Server服务启动
+### 分类任务启动
+#### 启动 分类 Server 服务
+```bash
+paddlenlp server server:app --host 0.0.0.0 --port 8189
+```
+如果是ERNIE-M模型则启动
+```bash
+paddlenlp server ernie_m_server:app --host 0.0.0.0 --port 8189
+```
+#### 分类任务发送服务
+```bash
+python client.py
+```
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size,
+            'prob_limit': args.prob_limit
+        }
+    }
+```
diff --git a/applications/text_classification/multi_label/deploy/simple_serving/client.py b/applications/text_classification/multi_label/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..de347a19654a31e5666bc5e75d693c7e426a8e9e
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/simple_serving/client.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import requests
+import json
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--prob_limit", default=0.5, type=float, help="The limitation of probability for the label.")
+args = parser.parse_args()
+# yapf: enable
+
+url = "http://0.0.0.0:8189/models/cls_multi_label"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = [
+        "原、被告另购置橱柜、碗架、电磁炉、电饭锅各一个归原告王某某所有。",
+        "于是原告到儿子就读的幼儿园进行探望，被告碰见后对原告破口大骂，还不让儿子叫原告妈妈，而叫被告现在的妻子做妈妈。",
+        "由我全额出资购买的联想台式电脑，我均依次放弃。",
+    ]
+    data = {
+        "data": {
+            "text": texts,
+        },
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(json.loads(r.text))
diff --git a/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py b/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba7d0cbaf23ceaa74bb25521f1d6905776e9c523
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/simple_serving/ernie_m_server.py
@@ -0,0 +1,26 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import ERNIEMHandler, MultiLabelClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/cls_multi_label",
+    model_path="../../export",
+    tokenizer_name="ernie-m-base",
+    model_handler=ERNIEMHandler,
+    post_handler=MultiLabelClassificationPostHandler,
+)
diff --git a/applications/text_classification/multi_label/deploy/simple_serving/server.py b/applications/text_classification/multi_label/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..caec715de52b9f75a1edd13c8840920f671af3e1
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/simple_serving/server.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiLabelClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/cls_multi_label",
+    model_path="../../export",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=CustomModelHandler,
+    post_handler=MultiLabelClassificationPostHandler,
+)
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/README.md b/applications/text_classification/multi_label/deploy/triton_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c8541c22ccf5d85ac856cc4c5477d9a06efbe163
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/README.md
@@ -0,0 +1,188 @@
+# 基于Triton Inference Server的服务化部署指南
+
+本文档将介绍如何使用[Triton Inference Server](https://github.com/triton-inference-server/server)工具部署基于ERNIE 3.0中文模型文本多标签分类的pipeline在线服务。
+
+## 目录
+- [服务端环境准备](#服务端环境准备)
+- [模型获取和转换](#模型获取和转换)
+- [部署模型](#部署模型)
+- [客户端请求](#客户端请求)
+
+## 服务端环境准备
+
+### 安装Triton Server
+拉取Triton Server镜像：
+```shell
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3
+```
+启动容器：
+```shell
+docker run  -it --gpus all --net=host --name triton_server -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash
+```
+
+**NOTE:**
+
+1. Triton版本号`21.10`可以根据自己的需求调整，各个Triton版本对应的Driver、CUDA、TRT和ONNX Runtime等后端版本可以参考[官网文档](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)。注意其中的`NVIDIA Driver`行，如果NVIDIA Driver低于文档中要求，在启动运行时会报错。
+
+2. 可以使用`--gpus '"device=1"'`来指定GPU卡号，更多GPU指定方式请参见[Nvidia User Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#gpu-enumeration)
+
+
+### 进入容器并准备PaddleNLP环境
+整个服务的前后处理依赖PaddleNLP，需要在容器内安装相关python包
+
+进入容器：
+```shell
+docker exec -it triton_server bash
+```
+安装PaddlePaddle、PaddleNLP
+```shell
+python3 -m pip install paddlepaddle-gpu paddlenlp -i https://mirror.baidu.com/pypi/simple
+```
+
+**NOTE:**
+
+1. 默认开启百度镜像源来加速下载，如果您使用 HTTP 代理可以关闭(-i https://mirror.baidu.com/pypi/simple)
+
+2. 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.2, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+3. 更多关于PaddleNLP安装的详细教程请查看[Installation](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/get_started/installation.rst)。
+
+
+### 安装FastTokenizer文本处理加速库（可选）
+
+推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。
+
+在容器内安装 fast_tokenizer
+```shell
+python3 -m pip install fast-tokenizer-python
+```
+
+
+## 模型获取和转换
+
+使用Triton做服务化部署时，选择ONNX Runtime后端运行需要先将模型转换成ONNX格式。
+
+
+首先将保存的动态图参数导出成静态图参数，具体代码见[静态图导出脚本](../../export_model.py)，静态图参数保存在`output_path`指定路径中，裁剪API裁剪会自动保存静态图模型。运行方式：
+
+```shell
+python ../../export_model.py --params_path=../../checkpoint/model_state.pdparams --output_path=./infer_model
+```
+
+使用Paddle2ONNX将Paddle静态图模型转换为ONNX模型格式的命令如下，以下命令成功运行后，将会在当前目录下生成model.onnx模型文件。
+
+用Paddle2ONNX转换分类模型
+```shell
+paddle2onnx --model_dir infer_model/ --model_filename float32.pdmodel --params_filename float32.pdiparams --save_file model.onnx --opset_version 13 --enable_onnx_checker True --enable_dev_version True
+```
+创建空白目录/seqcls/1和seqcls_model/1，并将将转换好的ONNX模型移动到模型仓库目录
+```shell
+mkdir /models/seqcls/1
+mkdir /models/seqcls_model/1
+mv model.onnx /models/seqcls_model/1
+```
+
+Paddle2ONNX的命令行参数说明请查阅：[Paddle2ONNX命令行参数说明](https://github.com/PaddlePaddle/Paddle2ONNX#%E5%8F%82%E6%95%B0%E9%80%89%E9%A1%B9)
+
+模型下载转换好之后，models目录结构如下:
+```
+models
+├── seqcls
+│   ├── 1
+│   └── config.pbtxt
+├── seqcls_model
+│   ├── 1
+│   │   └── model.onnx
+│   └── config.pbtxt
+├── seqcls_postprocess
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── tokenizer
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+模型配置文件config.pbtxt配置细节请参见[Triton Server Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)
+
+## 部署模型
+
+triton目录包含启动pipeline服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # Triton启动需要的模型仓库，包含模型和服务配置文件
+seqcls_grpc_client.py     # 分类任务发送pipeline预测请求的脚本
+```
+
+### 启动服务端
+
+
+在容器内执行下面命令启动服务，默认启动models下所有模型:
+```shell
+tritonserver --model-repository=/models
+```
+也可以通过设定参数只启动单一任务服务：
+```shell
+tritonserver --model-repository=/models --model-control-mode=explicit --load-model=seqcls
+```
+
+**NOTE:**
+
+启动服务时，Triton Server的每个python后端进程默认申请`64M`内存，默认启动的docker无法启动多个python后端节点。两个解决方案：
+
+1. 启动容器时设置`shm-size`参数, 比如:`docker run  -it --net=host --name triton_server --shm-size="1g" -v /path/triton/models:/models nvcr.io/nvidia/tritonserver:21.10-py3 bash`
+
+2. 启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M： `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
+
+输出打印如下:
+
+```
+...
+I0619 13:40:51.590901 5127 onnxruntime.cc:1999] TRITONBACKEND_Initialize: onnxruntime
+I0619 13:40:51.590938 5127 onnxruntime.cc:2009] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.590947 5127 onnxruntime.cc:2015] 'onnxruntime' TRITONBACKEND API version: 1.6
+I0619 13:40:51.623808 5127 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
+I0619 13:40:51.623862 5127 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
+I0619 13:40:51.623868 5127 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
+I0619 13:40:52.980990 5127 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f14d8000000' with size 268435456
+...
+I0619 13:43:33.360018 5127 server.cc:592]
++--------------------+---------+--------+
+| Model              | Version | Status |
++--------------------+---------+--------+
+| seqcls             | 1       | READY  |
+| seqcls_model       | 1       | READY  |
+| seqcls_postprocess | 1       | READY  |
+| tokenizer          | 1       | READY  |
++--------------------+---------+--------+
+...
+I0619 13:43:33.365824 5127 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
+I0619 13:43:33.366221 5127 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
+I0619 13:43:33.409775 5127 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+## 客户端请求
+
+### 客户端环境准备
+客户端请求有两种方式，可以选择在本地执行脚本请求，或下载官方客户端镜像在容器中执行。
+
+方式一：本地执行脚本，需要先安装依赖:
+```
+pip install grpcio
+pip install tritonclient==2.10.0
+```
+
+方式二：拉取官网镜像并启动容器:
+
+```shell
+docker pull nvcr.io/nvidia/tritonserver:21.10-py3-sdk
+docker run  -it --net=host --name triton_client -v /path/to/triton:/triton_code nvcr.io/nvidia/tritonserver:21.10-py3-sdk bash
+```
+
+### 启动客户端测试
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+
+```
+python seqcls_grpc_client.py
+```
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82261157aefe68bac9a1865d888c0257d2e905e8
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls/config.pbtxt
@@ -0,0 +1,75 @@
+name: "seqcls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "INPUT"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "seqcls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        key: "linear_75.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "seqcls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_label"
+        value: "label"
+      }
+      output_map {
+        key: "POST_confidence"
+        value: "confidence"
+      }
+    }
+  ]
+}
+
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d61a7fec6d5a2d266c1d5d43ff11638d445e2036
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_model/config.pbtxt
@@ -0,0 +1,36 @@
+platform: "onnxruntime_onnx"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "linear_75.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ 20 ]
+    }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_GPU
+  }
+]
+
+optimization { 
+  graph: {level: -1}
+}
+
+parameters { key: "intra_op_thread_count" value: { string_value: "0" } }
+parameters { key: "execution_mode" value: { string_value: "0" } }
+parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..5db7ef0c7746db295e9110817db6982704d6ac1b
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/1/model.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.txt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = 1 / (1 + (np.exp((-data[0]))))
+
+            probs = []
+            labels = []
+            for l, p in enumerate(data):
+                if p > 0.5:
+                    labels.append(l)
+                    probs.append(p)
+
+            labels = np.array(labels, dtype=self.output_dtype[0])
+            probs = np.array(probs, dtype=self.output_dtype[1])
+            # print(labels, probs)
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], labels)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c625f28cc43537863fbeff341f94a7a3f9f60b31
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/seqcls_postprocess/config.pbtxt
@@ -0,0 +1,31 @@
+name: "seqcls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ 20 ]
+  }
+]
+
+output [
+  {
+    name: "POST_label"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "POST_confidence"
+    data_type: TYPE_FP32
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..97a96222075370ebcca531250dc6cc7c0e8fead0
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/1/model.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TritonPythonModel(object):
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration, config.pbtxt
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        # print("num:", len(requests), flush=True)
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = [i[0].decode("utf-8") for i in data]
+            data = self.tokenizer(data, max_length=128, padding=True, truncation=True)
+            input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0])
+            token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1])
+
+            # print("input_ids:", input_ids)
+            # print("token_type_ids:", token_type_ids)
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d35d1f44968ba205b1890899a82568d33e90a999
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/models/tokenizer/config.pbtxt
@@ -0,0 +1,31 @@
+name: "tokenizer"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT_0"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT_0"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "OUTPUT_1"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py b/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f10e3a90d9f628114a94a858e996d6f593955c8
--- /dev/null
+++ b/applications/text_classification/multi_label/deploy/triton_serving/seqcls_grpc_client.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Optional
+
+import numpy as np
+from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class SyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} "
+            f"are up and ready!"
+        )
+
+        model_config = self._client.get_model_config(self._model_name, self._model_version)
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        self._inputs = {tm.name: tm for tm in model_metadata.inputs}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm.name: tm for tm in model_metadata.outputs}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run(self, inputs):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate(inputs):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+
+        results = self._client.infer(
+            model_name=self._model_name,
+            model_version=self._model_version,
+            inputs=infer_inputs,
+            outputs=self._outputs_req,
+            client_timeout=self._response_wait_t,
+        )
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+if __name__ == "__main__":
+    model_name = "seqcls"
+    model_version = "1"
+    url = "localhost:8001"
+    runner = SyncGRPCTritonRunner(url, model_name, model_version)
+
+    texts = [
+        ["五松新村房屋是被告婚前购买的；"],
+        ["被告于2016年3月将车牌号为皖B×××××出售了2.7万元，被告通过原告偿还了齐荷花人民币2.6万元，原、被告尚欠齐荷花2万元。"],
+        ["一、判决原告于某某与被告杨某某离婚；"],
+    ]
+    for text in texts:
+        # input format:[input1, input2 ... inputn], n = len(self._input_names)
+        result = runner.Run([text])
+        print("text: ", text)
+        print("label: ", ",".join([str(r) for r in result["label"]]))
+        print("confidence: ", ",".join([str("%.3f" % c) for c in result["confidence"]]))
+        print("--------------------")
diff --git a/applications/text_classification/multi_label/export_model.py b/applications/text_classification/multi_label/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..c57dc23372f9b934fbee6686092309cd5ef5b22a
--- /dev/null
+++ b/applications/text_classification/multi_label/export_model.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.transformers import AutoModelForSequenceClassification
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--multilingual', action='store_true', help='Whether is multilingual task')
+parser.add_argument("--params_path", type=str, default='./checkpoint/', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+    model.eval()
+    if args.multilingual:
+        input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
+    else:
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(model, input_spec=input_spec)
+
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "float32")
+    paddle.jit.save(model, save_path)
diff --git a/applications/text_classification/multi_label/few-shot/README.md b/applications/text_classification/multi_label/few-shot/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d291ac73308a4a36e809db7ca7702aacdc343216
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/README.md
@@ -0,0 +1,377 @@
+# 小样本场景下的多标签分类任务指南
+
+**零样本/小样本文本分类推荐使用 UTC 模型，详情见[目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification)，本项目将会在2.5.2版本下线。**
+
+## 目录
+
+- [1. 项目说明](#项目说明)
+- [2. 效果展示](#效果展示)
+- [3. 定制训练](#定制训练)
+  - [3.1 运行环境](#运行环境)
+  - [3.2 代码结构](#代码结构)
+  - [3.3 数据标注](#数据标注)
+  - [3.4 模型训练](#模型训练)
+  - [3.5 模型评估](#模型评估)
+  - [3.6 模型部署](#模型部署)
+- [4. References](#References)
+
+<a name="项目说明"></a>
+## 1. 项目说明
+
+本项目提供了小样本场景下文本多标签分类的解决方案，在 ERNIE3.0 的基础上利用提示学习取得比微调更好的分类效果，充分利用标注信息。
+
+近年来，大量包含了案件事实及其适用法律条文信息的裁判文书逐渐在互联网上公开，海量的数据使自然语言处理技术的应用成为可能。现实中的案情错综复杂，案情描述通常涉及多个重要事实，以CAIL2019数据集中婚姻家庭领域的案情要素抽取为例：
+
+```text
+"2013年11月28日原、被告离婚时自愿达成协议，婚生子张某乙由被告李某某抚养，本院以（2013）宝渭法民初字第01848号民事调解书对该协议内容予以了确认，该协议具有法律效力，对原、被告双方均有约束力。"
+```
+该案件中涉及`婚后有子女`、`限制行为能力子女抚养`两项要素。接下来我们将讲解在小样本场景下如何利用多标签模型，对输入文本中进行案情重要要素抽取。
+
+**文本多标签分类** 用于预测样本属于哪些标签类别，这些类别具有不相互排斥的属性，在商品分类、网页标签、新闻标注、蛋白质功能分类、电影分类、语义场景分类等现实场景中有着广泛应用。
+现有的主流解决方案是在预训练语言模型上进行微调，因为多标签分类任务与预训练阶段的掩码预测任务有着天然的差异，想要取得较好的分类效果往往需要大量数据标注。
+
+**提示学习(Prompt Learning)** 的主要思想是将二/多分类任务转换为掩码预测任务，充分利用预训练语言模型学习到的特征，从而降低样本需求。以情感分类任务为例，标签分为`1-正向`，`0-负向`两类，如下图所示，通过提示`我[MASK]喜欢。`，原有`1-正向`，`0-负向`的标签被转化为了预测空格是`很`还是`不`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/183909263-6ead8871-699c-4c2d-951f-e33eddcfdd9c.png width=800 height=300 />
+</div>
+
+微调方法和提示方法的区别如图所示：
+
+【微调学习】需要学习的参数是以 `[CLS]` 向量为输入，以负向/正向为输出的随机初始化的分类器。
+
+【提示学习】通过构造提示，将原有的分类任务转化为掩码预测，即掩盖原句中的某个字，用模型预测该字。此时的分类器不再是随机初始化，而是利用了待预测字的预训练向量来初始化，充分利用了预训练模型学习到的参数。
+
+【方案选择】对于标注样本充足的场景可以直接使用[微调学习](../README.md)实现文本多分类，对于尚无标注或者标注样本较少的任务场景我们推荐使用提示学习，以取得更好的效果。
+
+### 方案特点
+
+- **标注成本低**：以往的微调方式需要大量的数据标注才能保证模型分类效果。提示学习可以降低数据标注依赖，在小样本（few-shot）的场景下取得比微调更好的分类效果。
+- **全流程打通**：提供了从训练到部署的完整解决方案，可以低成本迁移至实际应用场景。
+
+<a name="效果展示"></a>
+## 2.效果展示
+
+本项目中使用了 ERNIE3.0 模型，对于中文训练任务可以根据需求选择不同的预训练模型参数进行训练，我们测评了 Base 模型在婚姻家庭要素提取任务上的表现。测试配置如下：
+
+1. 数据集：CAIL2019—婚姻家庭要素提取任务小样本数据集测试集。
+
+2. 物理机环境
+
+   系统: CentOS Linux release 7.7.1908 (Core)
+
+   GPU: Tesla V100-SXM2-32GB
+
+   CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+   CUDA: 11.2
+
+   cuDNN: 8.1.0
+
+   Driver Version: 460.27.04
+
+   内存: 630 GB
+
+3. PaddlePaddle 版本：2.4rc
+
+4. PaddleNLP 版本：2.4.3
+
+5. 评估设置
+
+- 每个 epoch 评估一次，按照验证集上的评价指标，取分数最高的模型参数用于测试集的评估。表格中的最终结果为重复 10 次的均值。
+- 为了避免过拟合，这里使用了早停机制 (Early-stopping)。因为微调方式收敛较慢，且波动较大，我们将微调方式的早停步数增加为 10 步。
+- 测试脚本如下
+  - 微调
+
+    ```
+    cd ../
+    python train.py --dataset_dir "./data/" --save_dir "./checkpoints" --max_seq_length 128 --model_name "ernie-3.0-base-zh" --batch_size 8 --learning_rate 3e-5 --epochs 100 --logging_steps 5 --early_stop --early_stop_num 10
+    ```
+
+  - 提示学习
+
+    ```
+    python train.py --data_dir ./data/ --output_dir ./checkpoints/ --prompt "这句话包含的要素有" --model_name_or_path ernie-3.0-base-zh --max_seq_length 128  --learning_rate 3e-5 --ppt_learning_rate 3e-4 --do_train --do_eval --num_train_epochs 100 --logging_steps 5 --per_device_eval_batch_size 32 --per_device_train_batch_size 8 --do_predict --metric_for_best_model macro_f1_score --load_best_model_at_end --eval_steps 100 --save_total_limit 1
+    ```
+
+6. 精度评价指标：Micro F1分数、Macro F1分数
+
+   | model_name | 训练方式 | Micro F1分数 | Macro F1分数 |
+   | ---------- | ------- | ----------- | ----------- |
+   | ernie-3.0-base-zh | 微调学习 | 0.7419 | 0.5105 |
+   | ernie-3.0-base-zh | 提示学习 | 0.7839 | 0.6003 |
+
+<a name="定制训练"></a>
+## 3.定制训练
+
+下边通过婚姻家庭要素提取的例子展示如何使用小样本学习来进行文本分类。
+
+<a name="运行环境"></a>
+### 3.1 运行环境
+
+- python >= 3.7
+- paddlepaddle >= 2.4rc
+- paddlenlp >= 2.4.3
+- paddle2onnx >= 1.0.3
+
+<a name="代码结构"></a>
+### 3.2 代码结构
+
+```text
+.
+├── train.py    # 模型组网训练脚本
+├── utils.py    # 数据处理工具
+├── infer.py    # 模型部署脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+### 3.3 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano)进行自定义数据标注，本项目也打通了从标注到训练的通道，即doccano导出数据后可通过[doccano.py](../../doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](../../doccano.md)。
+
+**示例数据**
+
+这里我们使用CAIL2019—婚姻家庭要素提取任务数据集的子集作为示例数据集。该数据集中原始训练集包括 14377 条标注样本，我们按每条标签随机采样 4 条样本，得到 80 条样本数据作为训练集，剩余训练集数据作为测试集。可点击[这里](https://paddlenlp.bj.bcebos.com/datasets/few-shot/elements.tar.gz)下载解压并放入`./data/`文件夹，或者运行以下脚本
+
+```
+wget https://paddlenlp.bj.bcebos.com/datasets/few-shot/elements.tar.gz
+tar zxvf elements.tar.gz
+mv elements data
+```
+
+**数据格式**
+
+下边主要介绍多标签分类任务自定义数据集的格式要求，整体目录如下
+
+```text
+data/
+├── train.txt  # 训练数据集
+├── dev.txt    # 验证数据集
+├── test.txt   # 测试数据集（可选）
+├── data.txt   # 待预测数据（可选）
+└── label.txt  # 分类标签集
+```
+
+**训练/验证/测试数据**
+
+对于训练/验证/测试数据集文件，每行数据表示一条样本，包括文本和标签两部分，由tab符`\t`分隔，多个标签以英文逗号`,`分隔。格式如下
+```text
+<文本>'\t'<标签>','<标签>','<标签>
+<文本>'\t'<标签>','<标签>
+...
+```
+例如，在婚姻家庭要素提取数据集中
+```
+现在原告已是第二次申请与被告离婚了。    二次起诉离婚
+双方均认可价值6万元。    不动产分割,有夫妻共同财产
+2004年4月，原、被告发生纠纷后，被告离家外出未归，直到现在，双方长期分居生活，十几年间互无联系，夫妻感情已经完全破裂。    婚后分居
+婚生子杨某甲由原告抚养，高中阶段之前的相关费用由原告承担，高中阶段之后的相关费用由双方协商，被告可以随时探望孩子；    婚后有子女,支付抚养费,限制行为能力子女抚养
+...
+```
+
+**预测数据**
+
+对于待预测数据文件，每行包含一条待预测样本，无标签。格式如下
+```text
+<文本>
+<文本>
+...
+```
+例如，在婚姻家庭要素提取数据集中
+```
+五松新村房屋是被告婚前购买的；
+2、判令被告返还借婚姻索取的现金33万元，婚前个人存款10万元；
+...
+```
+
+**标签数据**
+
+对于分类标签集文件，存储了数据集中所有的标签集合，每行为一个标签名。如果需要自定义标签映射用于分类器初始化，则每行需要包括标签名和相应的映射词，由`==`分隔。格式如下
+```text
+<标签>'=='<映射词>
+<标签>'=='<映射词>
+...
+```
+例如，对于婚姻家庭要素提取数据集，原标签字数较多，因此同一个标签依赖的输出也多。为了降低训练难度，我们可以将其映射为较短的短语
+```
+有夫妻共同债务==共同债务
+存在非婚生子==非婚生子
+...
+```
+**Note**: 这里的标签映射词定义遵循的规则是，不同映射词尽可能长度一致，映射词和提示需要尽可能构成通顺的语句。越接近自然语句，小样本下模型训练效果越好。如果原标签名已经可以构成通顺语句，也可以不构造映射词，每行一个标签即可，即
+```
+有夫妻共同债务
+存在非婚生子
+...
+```
+
+<a name="模型训练"></a>
+### 3.4 模型训练
+
+**单卡训练**
+
+```
+python train.py \
+--data_dir ./data/ \
+--output_dir ./checkpoints/ \
+--prompt "这句话包含的要素有" \
+--model_name_or_path ernie-3.0-base-zh \
+--max_seq_length 128  \
+--learning_rate 3e-5 \
+--ppt_learning_rate 3e-4 \
+--do_train \
+--do_eval \
+--do_predict \
+--do_export \
+--num_train_epochs 100 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--metric_for_best_model macro_f1_score \
+--load_best_model_at_end \
+--evaluation_strategy epoch \
+--save_strategy epoch
+```
+**多卡训练**
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus 0,1,2,3 train.py \
+--data_dir ./data/ \
+--output_dir ./checkpoints/ \
+--prompt "这句话包含的要素有" \
+--model_name_or_path ernie-3.0-base-zh \
+--max_seq_length 128  \
+--learning_rate 3e-5 \
+--ppt_learning_rate 3e-4 \
+--do_train \
+--do_eval \
+--do_predict \
+--do_export \
+--num_train_epochs 100 \
+--logging_steps 5 \
+--save_total_limit 1 \
+--per_device_eval_batch_size 32 \
+--per_device_train_batch_size 8 \
+--metric_for_best_model macro_f1_score \
+--load_best_model_at_end \
+--evaluation_strategy epoch \
+--save_strategy epoch
+```
+
+可配置参数说明：
+- `model_name_or_path`: 内置模型名，或者模型参数配置目录路径。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 训练数据集路径，数据格式要求详见[数据标注](#数据标注)。
+- `output_dir`: 模型参数、训练日志和静态图导出的保存目录。
+- `prompt`: 提示模板。定义了如何将文本和提示拼接结合。
+- `soft_encoder`: 提示向量的编码器，`lstm`表示双向LSTM, `mlp`表示双层线性层, None表示直接使用提示向量。默认为`lstm`。
+- `use_rdrop`: 使用 [R-Drop](https://arxiv.org/abs/2106.14448) 策略。
+- `use_rgl`: 使用 [RGL](https://aclanthology.org/2022.findings-naacl.81/) 策略。
+- `encoder_hidden_size`: 提示向量的维度。若为None，则使用预训练模型字向量维度。默认为200。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `learning_rate`: 预训练语言模型参数基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+- `ppt_learning_rate`: 提示相关参数的基础学习率大小，当预训练参数不固定时，与其共用learning rate scheduler。一般设为`learning_rate`的十倍。
+- `do_train`: 是否进行训练。
+- `do_eval`: 是否进行评估。
+- `do_predict`: 是否进行预测。
+- `do_export`: 是否在运行结束时将模型导出为静态图，保存路径为`output_dir/export`。
+- `num_train_epochs`: 训练的最大轮数。
+- `max_steps`: 训练的最大步数。此设置将会覆盖`num_train_epochs`。
+- `save_total_limit`: 模型检查点保存数量。
+- `device`: 使用的设备，默认为`gpu`。
+- `eval_steps`: 评估模型的间隔步数。
+- `logging_steps`: 打印日志的间隔步数。
+- `per_device_train_batch_size`: 每次训练每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `per_device_eval_batch_size`: 每次评估每张卡上的样本数量。可根据实际GPU显存适当调小/调大此配置。
+- `load_best_model_at_end`: 是否在模型训练结束后加载评估指标最优的模型参数。
+- `evaluation_strategy`: 模型评估的间隔策略。若为`epoch`，则每轮训练结束后评估模型。
+- `save_strategy`: 模型保存的间隔策略。若为`epoch`，则每轮训练结束后保存当前模型参数。
+
+更多参数介绍可参考[配置文件](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。
+
+<a name="模型评估"></a>
+### 3.5 模型评估
+
+在模型训练时开启`--do_predict`，训练结束后直接在测试集上`test.txt`进行评估，也可以在训练结束后，通过运行以下命令加载模型参数进行评估：
+```
+python train.py --do_predict --data_dir ./data --output_dir ./predict_checkpoint --resume_from_checkpoint ./checkpoints/ --max_seq_length 128
+```
+
+可配置参数说明：
+
+- `data_dir`: 测试数据路径。测试数据应存放在该目录下`test.txt`文件中，每行一条待预测文本。
+- `output_dir`: 日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_predict`: 是否进行测试集评估。
+- `max_seq_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+
+<a name="模型部署"></a>
+### 3.6 模型部署
+
+#### 模型导出
+
+在训练结束后，需要将动态图模型导出为静态图参数用于部署推理。可以在模型训练时开启`--do_export`在训练结束后直接导出，也可以运行以下命令加载并导出训练后的模型参数，默认导出到在`output_dir`指定的目录下。
+```
+python train.py --do_export --data_dir ./data --output_dir ./export_checkpoint --resume_from_checkpoint ./checkpoints/
+```
+
+可配置参数说明：
+
+- `data_dir`: 标签数据路径。
+- `output_dir`: 静态图模型参数和日志的保存目录。
+- `resume_from_checkpoint`: 训练时模型参数的保存目录，用于加载模型参数。
+- `do_export`: 是否将模型导出为静态图，保存路径为`output_dir/export`。
+- `export_type`: 模型导出的格式，默认为`paddle`，即导出静态图。
+
+#### ONNXRuntime部署
+
+**运行环境**
+
+模型转换与ONNXRuntime预测部署依赖Paddle2ONNX和ONNXRuntime，Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。
+
+- 如果基于GPU部署，请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime-gpu onnx onnxconverter-common
+```
+
+- 如果基于CPU部署，请使用如下命令安装所需依赖:
+```shell
+pip install psutil
+python -m pip install onnxruntime
+```
+
+**CPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device cpu
+```
+
+**GPU端推理样例**
+
+```
+python infer.py --model_path_prefix checkpoints/export/model --data_dir ./data --batch_size 32 --device gpu --device_id 0
+```
+
+可配置参数说明：
+
+- `model_path_prefix`: 导出的静态图模型路径及文件前缀。
+- `model_name`: 内置预训练模型名，用于加载tokenizer。默认为`ernie-3.0-base-zh`。
+- `data_dir`: 待推理数据所在路径，数据应存放在该目录下的`data.txt`文件。
+- `max_length`: 最大句子长度，超过该长度的文本将被截断，不足的以Pad补全。提示文本不会被截断。
+- `batch_size`: 每次预测的样本数量。
+- `device`: 选择推理设备，包括`cpu`和`gpu`。默认为`gpu`。
+- `device_id`: 指定GPU设备ID。
+- `use_fp16`: 是否使用半精度加速推理。仅在GPU设备上有效。
+- `num_threads`: 设置CPU使用的线程数。默认为机器上的物理内核数。
+
+**Note**: 在GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，在包括V100、T4、A10、A100、GTX 20系列和30系列显卡等设备上可以开启FP16进行加速，在CPU或者CUDA计算能力 (CUDA Compute Capability) 小于7.0时开启不会带来加速效果。
+
+<a name="References"></a>
+## 4. References
+
+- Liu, Xiao, et al. "GPT understands, too." arXiv preprint arXiv:2103.10385 (2021). [[PDF]](https://arxiv.org/abs/2103.10385)
+- Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May. "Warp: Word-level adversarial reprogramming." arXiv preprint arXiv:2101.00121 (2021). [[PDF]](https://arxiv.org/abs/2101.00121)
+- Ding, Ning, et al. "Openprompt: An open-source framework for prompt-learning." arXiv preprint arXiv:2111.01998 (2021). [[PDF]](https://arxiv.org/abs/2111.01998)
diff --git a/applications/text_classification/multi_label/few-shot/infer.py b/applications/text_classification/multi_label/few-shot/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1de1407b9a2610ed51c00b40081f3bccc423fa63
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/infer.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+import psutil
+import six
+
+from paddlenlp.prompt import AutoTemplate, PromptDataCollatorWithPadding
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+parser.add_argument("--model_name", default="ernie-3.0-base-zh", type=str, help="The name of pretrained model.")
+parser.add_argument("--data_dir", default=None, type=str, help="The path to the prediction data, including label.txt and data.txt.")
+parser.add_argument("--max_length", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--use_fp16", action='store_true', help="Whether to use fp16 inference, only takes effect when deploying on gpu.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu.")
+parser.add_argument("--device", choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+args = parser.parse_args()
+# yapf: enable
+
+
+class InferBackend(object):
+    def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10):
+
+        if not isinstance(device, six.string_types):
+            logger.error(
+                ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device)
+            )
+            exit(0)
+        if device not in ["cpu", "gpu"]:
+            logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device))
+            exit(0)
+
+        logger.info(">>> [InferBackend] Creating Engine ...")
+
+        onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb", encoding="utf-8") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+            providers = ["CUDAExecutionProvider"]
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            providers = ["CPUExecutionProvider"]
+            if use_fp16:
+                logger.warning(
+                    ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..."
+                )
+
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = num_threads
+        self.predictor = ort.InferenceSession(
+            onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}]
+        )
+
+        if device == "gpu":
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    "The environment for GPU inference is not set properly. "
+                    "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+                    "Please run the following commands to reinstall: \n "
+                    "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+                )
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def infer(self, input_dict: dict):
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+class MultiLabelPredictor(object):
+    def __init__(self, args):
+        self.args = args
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+        self.model = AutoModelForMaskedLM.from_pretrained(args.model_name)
+        self.template, self.labels, self.input_handles = self.post_init()
+        self.collate_fn = PromptDataCollatorWithPadding(
+            self.tokenizer, padding=True, return_tensors="np", return_attention_mask=True
+        )
+
+        self.inference_backend = InferBackend(
+            self.args.model_path_prefix,
+            self.args.device,
+            self.args.device_id,
+            self.args.use_fp16,
+            self.args.num_threads,
+        )
+
+    def post_init(self):
+        export_path = os.path.dirname(self.args.model_path_prefix)
+        template_path = os.path.join(export_path, "template_config.json")
+        with open(template_path, "r", encoding="utf-8") as fp:
+            prompt = json.load(fp)
+            template = AutoTemplate.create_from(prompt, self.tokenizer, self.args.max_length, self.model)
+        keywords = template.extract_template_keywords(template.prompt)
+        inputs = ["input_ids", "token_type_ids", "position_ids", "attention_mask", "masked_positions"]
+        if "soft" in keywords:
+            inputs.append("soft_token_ids")
+        if "encoder" in keywords:
+            inputs.append("encoder_ids")
+        verbalizer_path = os.path.join(export_path, "verbalizer_config.json")
+        with open(verbalizer_path, "r", encoding="utf-8") as fp:
+            label_words = json.load(fp)
+            labels = sorted(list(label_words.keys()))
+
+        return template, labels, inputs
+
+    def predict(self, input_data: list):
+        encoded_inputs = self.preprocess(input_data)
+        infer_result = self.infer_batch(encoded_inputs)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return result
+
+    def _infer(self, input_dict):
+        infer_data = self.inference_backend.infer(input_dict)
+        return infer_data
+
+    def infer_batch(self, inputs):
+        num_sample = len(inputs)
+        infer_data = None
+        num_infer_data = None
+        for index in range(0, num_sample, self.args.batch_size):
+            left, right = index, index + self.args.batch_size
+            batch_dict = self.collate_fn(inputs[left:right])
+            input_dict = {}
+            for key in self.input_handles:
+                value = batch_dict[key]
+                if key == "attention_mask":
+                    if value.ndim == 2:
+                        value = (1 - value[:, np.newaxis, np.newaxis, :]) * -1e4
+                    elif value.ndim != 4:
+                        raise ValueError("Expect attention mask with ndim=2 or 4, but get ndim={}".format(value.ndim))
+                    value = value.astype("float32")
+                else:
+                    value = value.astype("int64")
+                input_dict[key] = value
+            results = self._infer(input_dict)
+            if infer_data is None:
+                infer_data = [[x] for x in results]
+                num_infer_data = len(results)
+            else:
+                for i in range(num_infer_data):
+                    infer_data[i].append(results[i])
+        for i in range(num_infer_data):
+            infer_data[i] = np.concatenate(infer_data[i], axis=0)
+        return infer_data
+
+    def preprocess(self, input_data: list):
+        text = [{"text_a": x} for x in input_data]
+        inputs = [self.template(x) for x in text]
+        return inputs
+
+    @staticmethod
+    def sigmoid(z):
+        return 1 / (1 + np.exp(-z))
+
+    def postprocess(self, infer_data):
+        threshold = 0.5
+        probs = self.sigmoid(infer_data[0])
+        label_ids = np.argwhere(probs > threshold)
+        labels = [[] for _ in range(probs.shape[0])]
+        for idx, label_id in label_ids:
+            labels[idx].append(self.labels[label_id])
+        return {"label": labels}
+
+    def printer(self, result, input_data):
+        label = result["label"]
+        for i in range(len(label)):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("labels: {}".format(", ".join(label[i])))
+            logger.info("-----------------------------")
+
+
+if __name__ == "__main__":
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    predictor = MultiLabelPredictor(args)
+
+    text_dir = os.path.join(args.data_dir, "data.txt")
+    with open(text_dir, "r", encoding="utf-8") as f:
+        text_list = [x.strip() for x in f.readlines()]
+
+    predictor.predict(text_list)
diff --git a/applications/text_classification/multi_label/few-shot/metric.py b/applications/text_classification/multi_label/few-shot/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..f41317ba76030105364c73843b6cf4fa9ad6d0bb
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/metric.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from paddle.metric import Metric
+from sklearn.metrics import classification_report, f1_score
+
+from paddlenlp.utils.log import logger
+
+
+class MetricReport(Metric):
+    """
+    F1 score for multi-label text classification task.
+    """
+
+    def __init__(self, name="MetricReport", average="micro"):
+        super(MetricReport, self).__init__()
+        self.average = average
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.y_prob = None
+        self.y_true = None
+
+    def f1_score(self, y_prob):
+        """
+        Compute micro f1 score and macro f1 score
+        """
+        threshold = 0.5
+        self.y_pred = y_prob > threshold
+        micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro")
+        macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro")
+        return micro_f1_score, macro_f1_score
+
+    def update(self, probs, labels):
+        """
+        Update the probability and label
+        """
+        if self.y_prob is not None:
+            self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0)
+        else:
+            self.y_prob = probs.numpy()
+        if self.y_true is not None:
+            self.y_true = np.append(self.y_true, labels.numpy(), axis=0)
+        else:
+            self.y_true = labels.numpy()
+
+    def accumulate(self):
+        """
+        Returns micro f1 score and macro f1 score
+        """
+        micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob)
+        return micro_f1_score, macro_f1_score
+
+    def report(self):
+        """
+        Returns classification report
+        """
+        self.y_pred = self.y_prob > 0.5
+        logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4))
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/applications/text_classification/multi_label/few-shot/requirements_cpu.txt b/applications/text_classification/multi_label/few-shot/requirements_cpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..bbe76e363f00631d66e0733833813cad5991f009
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/requirements_cpu.txt
@@ -0,0 +1,5 @@
+psutil
+paddlepaddle>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime
diff --git a/applications/text_classification/multi_label/few-shot/requirements_gpu.txt b/applications/text_classification/multi_label/few-shot/requirements_gpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66454bd8b6b5fe08521215d4a5c2e7242225d869
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/requirements_gpu.txt
@@ -0,0 +1,7 @@
+psutil
+paddlepaddle-gpu>=2.4rc
+paddlenlp>=2.4.3
+paddle2onnx>=1.0.3
+onnxruntime-gpu
+onnx
+onnxconverter-common
diff --git a/applications/text_classification/multi_label/few-shot/train.py b/applications/text_classification/multi_label/few-shot/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..345a5217158355bb95864f00227e114dc3f25fcf
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/train.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from collections import defaultdict
+from dataclasses import dataclass, field
+
+import paddle
+import paddle.nn.functional as F
+from metric import MetricReport
+from utils import load_local_dataset
+
+from paddlenlp.prompt import (
+    AutoTemplate,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    SoftVerbalizer,
+)
+from paddlenlp.trainer import EarlyStoppingCallback, PdArgumentParser
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    data_dir: str = field(default="./data", metadata={"help": "The dataset dictionary includes train.txt, dev.txt and label.txt files."})
+    prompt: str = field(default=None, metadata={"help": "The input prompt for tuning."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-3.0-base-zh", metadata={"help": "The build-in pretrained model or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    model = AutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # Define the template for preprocess and the verbalizer for postprocess.
+    template = AutoTemplate.create_from(data_args.prompt, tokenizer, training_args.max_seq_length, model=model)
+    logger.info("Using template: {}".format(template.prompt))
+
+    label_file = os.path.join(data_args.data_dir, "label.txt")
+    with open(label_file, "r", encoding="utf-8") as fp:
+        label_words = defaultdict(list)
+        for line in fp:
+            data = line.strip().split("==")
+            word = data[1] if len(data) > 1 else data[0].split("##")[-1]
+            label_words[data[0]].append(word)
+    verbalizer = SoftVerbalizer(label_words, tokenizer, model)
+
+    # Load the few-shot datasets.
+    train_ds, dev_ds, test_ds = load_local_dataset(
+        data_path=data_args.data_dir, splits=["train", "dev", "test"], label_list=verbalizer.labels_to_ids
+    )
+
+    # Define the criterion.
+    criterion = paddle.nn.BCEWithLogitsLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds):
+        metric = MetricReport()
+        preds = F.sigmoid(paddle.to_tensor(eval_preds.predictions))
+        metric.update(preds, paddle.to_tensor(eval_preds.label_ids))
+        micro_f1_score, macro_f1_score = metric.accumulate()
+        return {"micro_f1_score": micro_f1_score, "macro_f1_score": macro_f1_score}
+
+    # Deine the early-stopping callback.
+    callbacks = [EarlyStoppingCallback(early_stopping_patience=4, early_stopping_threshold=0.0)]
+
+    # Initialize the trainer.
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=callbacks,
+        compute_metrics=compute_metrics,
+    )
+
+    # Training.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=None)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Prediction.
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Export static model.
+    if training_args.do_export:
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/multi_label/few-shot/utils.py b/applications/text_classification/multi_label/few-shot/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a9c2a5b83faff2b8d512d43bc9471e603c7f6fd
--- /dev/null
+++ b/applications/text_classification/multi_label/few-shot/utils.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.datasets import load_dataset
+
+
+def load_local_dataset(data_path, splits, label_list):
+    """
+    Load dataset for multi-label classification from files, where
+    there is one example per line. Text and label are separated
+    by '\t', and multiple labels are delimited by ','.
+
+    Args:
+        data_path (str):
+            Path to the dataset directory, including label.txt, train.txt,
+            dev.txt (and data.txt).
+        splits (list):
+            Which file(s) to load, such as ['train', 'dev', 'test'].
+        label_list (dict):
+            The dictionary that maps labels to indeces.
+    """
+
+    def _reader(data_file, label_list):
+        with open(data_file, "r", encoding="utf-8") as fp:
+            for idx, line in enumerate(fp):
+                data = line.strip().split("\t")
+                if len(data) == 1:
+                    yield {"text_a": data[0]}
+                else:
+                    text, label = data
+                    label = label.strip().split(",")
+                    label = [float(1) if x in label else float(0) for x in label_list]
+                    yield {"text_a": text, "labels": label}
+
+    split_map = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"}
+    datasets = []
+    for split in splits:
+        data_file = os.path.join(data_path, split_map[split])
+        datasets.append(load_dataset(_reader, data_file=data_file, label_list=label_list, lazy=False))
+    return datasets
diff --git a/applications/text_classification/multi_label/metric.py b/applications/text_classification/multi_label/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..44ca1c37f12c80bcd2df0b8c8d0c6d50cde3a014
--- /dev/null
+++ b/applications/text_classification/multi_label/metric.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from sklearn.metrics import f1_score, classification_report
+
+from paddle.metric import Metric
+from paddlenlp.utils.log import logger
+
+
+class MetricReport(Metric):
+    """
+    F1 score for multi-label text classification task.
+    """
+
+    def __init__(self, name="MetricReport", average="micro"):
+        super(MetricReport, self).__init__()
+        self.average = average
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.y_prob = None
+        self.y_true = None
+
+    def f1_score(self, y_prob):
+        """
+        Compute micro f1 score and macro f1 score
+        """
+        threshold = 0.5
+        self.y_pred = y_prob > threshold
+        micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro")
+        macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro")
+        return micro_f1_score, macro_f1_score
+
+    def update(self, probs, labels):
+        """
+        Update the probability and label
+        """
+        if self.y_prob is not None:
+            self.y_prob = np.append(self.y_prob, probs.numpy(), axis=0)
+        else:
+            self.y_prob = probs.numpy()
+        if self.y_true is not None:
+            self.y_true = np.append(self.y_true, labels.numpy(), axis=0)
+        else:
+            self.y_true = labels.numpy()
+
+    def accumulate(self):
+        """
+        Returns micro f1 score and macro f1 score
+        """
+        micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob)
+        return micro_f1_score, macro_f1_score
+
+    def report(self):
+        """
+        Returns classification report
+        """
+        self.y_pred = self.y_prob > 0.5
+        logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4))
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/applications/text_classification/multi_label/predict.py b/applications/text_classification/multi_label/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0ad1f4a72bec354812872f81c116fa73c663656
--- /dev/null
+++ b/applications/text_classification/multi_label/predict.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+
+import paddle
+import paddle.nn.functional as F
+from paddle.io import BatchSampler, DataLoader
+from utils import preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include data.txt and label.txt")
+parser.add_argument("--output_file", default="output.txt", type=str, help="Save prediction result")
+parser.add_argument("--params_path", default="./checkpoint/", type=str, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--data_file", type=str, default="data.txt", help="Unlabeled data file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+args = parser.parse_args()
+# yapf: enable
+
+
+@paddle.no_grad()
+def predict():
+    """
+    Predicts the data labels.
+    """
+    paddle.set_device(args.device)
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+
+    label_list = []
+    label_path = os.path.join(args.dataset_dir, args.label_file)
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            label_list.append(line.strip())
+
+    data_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.data_file), is_test=True, lazy=False
+    )
+
+    trans_func = functools.partial(
+        preprocess_function,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        label_nums=len(label_list),
+        is_test=True,
+    )
+
+    data_ds = data_ds.map(trans_func)
+
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    data_batch_sampler = BatchSampler(data_ds, batch_size=args.batch_size, shuffle=False)
+
+    data_data_loader = DataLoader(dataset=data_ds, batch_sampler=data_batch_sampler, collate_fn=collate_fn)
+
+    results = []
+    model.eval()
+    for batch in data_data_loader:
+        logits = model(**batch)
+        probs = F.sigmoid(logits).numpy()
+        for prob in probs:
+            labels = []
+            for i, p in enumerate(prob):
+                if p > 0.5:
+                    labels.append(i)
+            results.append(labels)
+
+    with open(args.output_file, "w", encoding="utf-8") as f:
+        f.write("text" + "\t" + "label" + "\n")
+        for d, result in zip(data_ds.data, results):
+            label = [label_list[r] for r in result]
+            f.write(d["sentence"] + "\t" + ", ".join(label) + "\n")
+    logger.info("Prediction results save in {}.".format(args.output_file))
+
+    return
+
+
+if __name__ == "__main__":
+
+    predict()
diff --git a/applications/text_classification/multi_label/prune.py b/applications/text_classification/multi_label/prune.py
new file mode 100644
index 0000000000000000000000000000000000000000..16d8b1596f9bb31bcbbd901e0cb6df11dbd2e543
--- /dev/null
+++ b/applications/text_classification/multi_label/prune.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+import os
+from dataclasses import dataclass, field
+
+import paddle
+import paddle.nn.functional as F
+from metric import MetricReport
+from paddleslim.nas.ofa import OFA
+from utils import preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    dataset_dir: str = field(default=None, metadata={"help": "Local dataset directory should include train.txt, dev.txt and label.txt."})
+    max_seq_length: int = field(default=128, metadata={"help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+    params_dir: str = field(default='./checkpoint/', metadata={"help": "The output directory where the model checkpoints are written."})
+# yapf: enable
+
+
+@paddle.no_grad()
+def custom_evaluate(self, model, data_loader):
+    metric = MetricReport()
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        logits = model(batch["input_ids"], batch["token_type_ids"])
+        # Supports paddleslim.nas.ofa.OFA model and nn.layer model.
+        if isinstance(model, OFA):
+            logits = logits[0]
+        probs = F.sigmoid(logits)
+        metric.update(probs, batch["labels"])
+
+    micro_f1_score, macro_f1_score = metric.accumulate()
+    logger.info("micro f1 score: %.5f, macro f1 score: %.5f" % (micro_f1_score, macro_f1_score))
+    model.train()
+    return macro_f1_score
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+    paddle.set_device(compression_args.device)
+    compression_args.strategy = "dynabert"
+    # Log model and data config
+    compression_args.print_config(model_args, "Model")
+    compression_args.print_config(data_args, "Data")
+
+    label_list = {}
+    label_path = os.path.join(data_args.dataset_dir, "label.txt")
+    train_path = os.path.join(data_args.dataset_dir, "train.txt")
+    dev_path = os.path.join(data_args.dataset_dir, "dev.txt")
+    with open(label_path, "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+
+    train_ds = load_dataset(read_local_dataset, path=train_path, label_list=label_list, lazy=False)
+    dev_ds = load_dataset(read_local_dataset, path=dev_path, label_list=label_list, lazy=False)
+
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.params_dir)
+    tokenizer = AutoTokenizer.from_pretrained(model_args.params_dir)
+
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=data_args.max_seq_length, label_nums=len(label_list)
+    )
+    train_dataset = train_ds.map(trans_func)
+    dev_dataset = dev_ds.map(trans_func)
+
+    # Define data collector， criterion
+    data_collator = DataCollatorWithPadding(tokenizer)
+    criterion = paddle.nn.BCEWithLogitsLoss()
+
+    trainer = Trainer(
+        model=model,
+        args=compression_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset,
+        eval_dataset=dev_dataset,
+        criterion=criterion,
+    )  # Strategy`dynabert` needs arguments `criterion`
+
+    compression_args.print_config()
+
+    trainer.compress(custom_evaluate=custom_evaluate)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/text_classification/multi_label/retrieval_based/README.md b/applications/text_classification/multi_label/retrieval_based/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..835336e446ab01cf6c06123d8c73bd2addf43346
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/README.md
@@ -0,0 +1,512 @@
+# 基于检索的多标签文本分类方法
+
+ **目录**
+
+* [1. 基于语义索引的多标签分类任务介绍](#基于语义索引的多标签分类任务介绍)
+* [2. 代码结构说明](#代码结构说明)
+* [3. 环境准备](#环境准备)
+* [4. 数据准备](#数据准备)
+* [5. 模型训练](#模型训练)
+* [6. 模型评估](#模型训练)
+* [7. 模型预测](#模型预测)
+* [8. 模型部署](#模型部署)
+* [9. 分类流程](#分类流程)
+
+<a name="基于语义索引的多标签分类任务介绍"></a>
+
+# 1.基于语义索引的多标签分类任务介绍
+
+以前的分类任务中，标签信息作为无实际意义，独立存在的one-hot编码形式存在，这种做法会潜在的丢失标签的语义信息，本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量，将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景，或者需要经常有一些新增类别的需求的情况非常合适。另外，对于一些新的相关的分类任务，这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说，这种基于检索的文本分类方法能够有很好的拓展性，能够利用标签里面包含的语义信息，不需要重新进行学习。这种方法可以应用到相似标签推荐，文本标签标注，金融风险事件分类，政务信访分类等领域。
+
+本方案是基于语义索引模型的分类，语义索引模型的目标是：给定输入文本，模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的多标签分类方法有两种，第一种方法是直接把标签变成召回库，即把输入文本和标签的文本进行匹配，第二种是利用召回的文本带有类别标签，把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型，训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，并把标签作为召回库，进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。
+
+**注意** 基于语义索引的文本分类的标签在预测过程中会抽取成向量，所以标签需要文本的形式，不能是ID形式的标签。
+
+<a name="代码结构说明"></a>
+
+## 2. 代码结构说明
+
+```
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— base_model.py # 语义索引模型基类
+|—— train.py # In-batch Negatives 策略的训练主脚本
+|—— model.py # In-batch Negatives 策略核心网络结构
+
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+|—— evaluate.py # 根据召回结果和评估集计算评估指标
+|—— metric.py # Macro F1和Micro F1评估指标
+|—— predict.py # 给定输入文件，计算文本 pair 的相似度
+|—— export_model.py # 动态图转换成静态图
+|—— export_to_serving.py # 静态图转 Serving
+|—— scripts
+    |—— export_model.sh  # 动态图转换成静态图脚本
+    |—— predict.sh  # 预测 bash 版本
+    |—— evaluate.sh # 评估 bash 版本
+    |—— run_build_index.sh # 构建索引 bash 版本
+    |—— train.sh  # 训练 bash 版本
+    |—— export_to_serving.sh  # Paddle Inference 转 Serving 的 bash 脚本
+    |—— run.sh # 构建Milvus向量的 bash 版本
+|—— utils
+    ├── config.py # Milvus 的配置文件
+    ├── feature_extract.py # 向量抽取文件
+    ├── milvus_util.py # Milvus 的配置文件
+|—— deploy
+    |—— python
+        |—— predict.py # PaddleInference
+        |—— deploy.sh # Paddle Inference 部署脚本
+        |—— rpc_client.py # Paddle Serving 的 Client 端
+        |—— web_service.py # Paddle Serving 的 Serving 端
+        |—— config_nlp.yml # Paddle Serving 的配置文件
+
+```
+
+<a name="环境准备"></a>
+
+## 3. 环境准备
+
+推荐使用GPU进行训练，在预测阶段使用CPU或者GPU均可。
+
+**环境依赖**
+* python >= 3.7
+* paddlepaddle >= 2.3.1
+* paddlenlp >= 2.3.4
+* hnswlib >= 0.5.2
+* visualdl >= 2.2.2
+
+```
+pip install -r requirements.txt
+```
+
+<a name="数据准备"></a>
+
+## 4. 数据准备
+
+训练需要准备指定格式的本地数据集,如果没有已标注的数据集，可以参考[文本分类任务doccano数据标注使用指南](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/text_classification/doccano.md)进行文本分类数据标注。
+
+**指定格式本地数据集目录结构**
+
+```
+├── data # 数据集目录
+    ├── label.txt # 标签集
+    ├── dev.txt # 验证集
+    ├── test.txt # 测试集
+    ├── train.txt  # 训练集
+```
+
+**训练、开发、测试数据集**
+
+train.txt(训练数据集文件)， dev.txt(开发数据集文件)，test.txt(可选，测试数据集文件)，文件中文本与标签类别名用tab符`'\t'`分隔开，由于文本有多个标签，需要把文本拆分成多个单标签的形式，即多行文本标签对。训练集指用于训练模型的数据；开发集指用于评测模型表现的数据，可以根据模型在开发集上的精度调整训练参数和模型；测试集用于测试模型表现，没有测试集时可以使用开发集代替。
+
+- train.txt/test.txt 文件格式：
+```text
+<文本>'\t'<标签1>
+<文本>'\t'<标签2>
+...
+```
+- train.txt/test.txt 文件样例：
+```text
+茄子怎么做才好吃？怎么做才好吃？	生活
+茄子怎么做才好吃？怎么做才好吃？	美食/烹饪
+茄子怎么做才好吃？怎么做才好吃？	烹饪方法
+...
+```
+- dev.txt 文件格式：
+```
+<文本>'\t'<标签1>,<标签2>,<标签3>
+<文本>'\t'<标签2>,<标签5>
+```
+- dev.txt 文件样例：
+```text
+克隆人的思想和原体是一样的吗？	教育/科学,理工学科,生物学
+家用燃气灶哪个牌子好些?最好是高效节能的~最好是高效的~	生活,购物
+咨询下，黛莱美面膜怎么样，那里有咨询下，黛莱美怎么样，那里有	生活,美容/塑身,护肤
+...
+```
+
+**分类标签**
+
+label.txt(分类标签文件)记录数据集中所有标签集合，每一行为一个标签名。
+- label.txt 文件格式：
+```text
+<标签>
+<标签>
+...
+```
+- label.txt 文件样例：
+```text
+魔力宝贝
+完美游戏
+地球科学
+美食/烹饪
+肝胆外科
+动漫
+婚嫁
+历史话题
+新生儿
+...
+```
+
+<a name="模型训练"></a>
+
+## 5. 模型训练
+
+我们使用百科知识问答的数据来构建训练集，开发集。
+
+**训练集（train.txt）** 和 **开发集(dev.txt)** 格式一致，训练集30k条，开发集10k条，每行由文本的标题，内容和类别标签组成，以tab符分割，第一列是问题的标题和问题的描述拼接，剩下的列问题的类别。
+**召回库（label.txt）** 类别的数量是323类，召回标签库的构建有2种方式，第一种是把所有的类别标签当成召回库，第二种是把训练集当成召回集合，我们以第一种为例。
+
+数据集选择的是百科问答数据集的一个子集，问答数据集详情请参考[nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus)
+
+- [baike_qa_category](https://paddlenlp.bj.bcebos.com/applications/baike_qa_multilabel.zip)
+
+```
+wget https://paddlenlp.bj.bcebos.com/applications/baike_qa_multilabel.zip
+unzip  baike_qa_category.zip
+```
+
+### 单机单卡训练/单机多卡训练
+
+这里采用单机多卡方式进行训练，通过如下命令，指定 GPU 0,1 卡;如果采用单机单卡训练，只需要把`--gpus`参数设置成单卡的卡号即可。
+
+如果使用CPU进行训练，则需要吧`--gpus`参数去除，然后吧`device`设置成cpu即可，详细请参考train.sh文件的训练设置
+
+然后运行下面的命令使用GPU训练，得到语义索引模型：
+
+```
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True
+```
+
+参数含义说明
+
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `batch_size`: 训练的batch size的大小
+* `learning_rate`: 训练的学习率的大小
+* `epochs`: 训练的epoch数
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `max_seq_length`: 输入序列的最大长度
+* `margin`: 正样本相似度与负样本之间的目标 Gap
+* `train_set_file`: 训练集文件
+* `evaluate`: 是否开启边训练边评估模型训练效果，默认开启
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集
+* `corpus_file`: 召回库数据 corpus_file
+
+也可以使用bash脚本：
+
+```
+sh scripts/train.sh
+```
+
+<a name="模型评估"></a>
+
+## 6. 模型评估
+
+评估脚本命令如下：
+```
+python -u evaluate.py \
+        --similar_text_pair "data/dev.txt" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --label_path data/label.txt
+```
+也可以使用bash脚本：
+
+```
+sh scripts/evaluate.sh
+```
+会得到如下的输出结果：
+
+```
+Micro fl score: 99.30934877025769
+Macro f1 score: 78.20694877991563
+```
+
+<a name="模型预测"></a>
+
+## 7. 模型预测
+
+我们可以基于语义索引模型计算文本和标签的语义相似度。
+
+
+### 开始预测
+
+加载训练的语义索引模型，然后计算文本和标签的语义相似度:
+
+```
+root_dir="checkpoints/inbatch/model_best"
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/test.txt"
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化。
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+也可以运行下面的bash脚本：
+
+```
+sh scripts/predict.sh
+```
+predict.sh文件包含了cpu和gpu运行的脚本，默认是gpu运行的脚本
+
+产出如下结果
+```
+0.8841502070426941
+0.7834227681159973
+0.04591505229473114
+0.15116563439369202
+......
+```
+
+<a name="部署"></a>
+
+## 8. 模型部署
+
+模型部署分为：动转静导出，向量引擎，Paddle Inference推理， Paddle Serving服务化这几个部分。为了提升预测速度，通常需要把训练好的模型转换成静态图，然后就可以使用Paddle Inference静态图进行推理，向量引擎则是存放标签的向量的形式，方便快速检索，另外，Paddle Inference可以进一步对模型服务化，即使用Paddle Serving进行服务化，这样可以通过HTTP或者RPC的方式进行调用。Paddle Serving的服务化形式有Pipeline和C++两种形式，Pipeline灵活一点，方便进行修改，C++部署更麻烦一点，但C++的部署形式效率更高。
+
+### 动转静导出
+
+首先把动态图模型转换为静态图：
+
+```
+python export_model.py \
+                 --params_path checkpoints/inbatch/model_best/model_state.pdparams \
+                 --model_name_or_path rocketqa-zh-dureader-query-encoder \
+                 --output_path=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh scripts/export_model.sh
+```
+### Paddle Inference预测
+
+预测既可以抽取向量也可以计算两个文本的相似度。
+
+修改deploy/python/predict.py中的id2corpus和corpus_list的样本：
+
+```
+# 抽取向量
+id2corpus = {0: {"sentence": "快要到期的“我的资料”怎么续日期？"}}
+# 计算文本和类别的相似度
+corpus_list = [{
+        "sentence": "快要到期的“我的资料”怎么续日期？",
+        'label': '互联网'
+    }, {
+        "sentence": "快要到期的“我的资料”怎么续日期？",
+        'label': '游戏'
+    }]
+
+```
+
+然后使用PaddleInference
+
+```
+python deploy/python/predict.py --model_dir=./output
+```
+也可以运行下面的bash脚本：
+
+```
+sh deploy.sh
+```
+最终输出的是256维度的特征向量和句子对的预测概率：
+
+```
+(1, 768)
+[[ 0.01510728 -0.03822846 -0.0350773  -0.02304687  0.04219331 -0.04335611
+  -0.03983097  0.04164692  0.04074539 -0.02351343  0.04246496 -0.02563381
+   ....
+
+[0.8133385181427002, 0.1509452909231186]
+```
+
+### 向量引擎
+
+模型准备结束以后，开始搭建 Milvus 的向量检索引擎，用于文本语义向量的快速检索，本项目使用[Milvus](https://milvus.io/)开源工具进行向量检索，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本，建议使用官方的 Docker 安装方式，简单快捷。
+
+
+Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成768维度的向量：
+
+```
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt"
+```
+其中 output 目录下存放的是召回的 Paddle Inference 静态图模型。
+
+修改 utils/config.py 的配置 ip 和端口，本项目使用的是8530端口，而 Milvus 默认的是19530，需要根据情况进行修改：
+```
+MILVUS_HOST='your milvus ip'
+MILVUS_PORT = 8530
+```
+
+然后向搭建好的 Milvus 系统插入向量：
+
+```
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
+```
+也可以直接运行：
+
+```bash
+sh scripts/run.sh
+```
+
+### Paddle Serving部署
+
+Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖，用pip安装Paddle Serving的依赖如下：
+
+```
+pip install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# 如果是CPU部署，只需要安装CPU Server
+pip install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+
+# 如果是GPU Server，需要确认环境再选择执行哪一条，推荐使用CUDA 10.2的包
+# CUDA10.2 + Cudnn7 + TensorRT6（推荐）
+pip install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+更详细的安装信息请参考[链接](https://github.com/PaddlePaddle/Serving/blob/v0.9.0/doc/Install_Linux_Env_CN.md)，安装完依赖后就可以执行下面的步骤。首先把生成的静态图模型导出为 Paddle Serving的格式，命令如下:
+
+```
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "./serving_server" \
+    --client_path "./serving_client" \
+    --fetch_alias_names "output_embedding"
+```
+
+参数含义说明
+* `dirname`: 需要转换的模型文件存储路径，Program 结构文件和参数文件均保存在此目录。
+* `model_filename`： 存储需要转换的模型 Inference Program 结构的文件名称。如果设置为 None ，则使用 `__model__` 作为默认的文件名
+* `params_filename`: 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为 None
+* `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server
+* `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client
+* `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定
+* `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他模型，默认不指定
+
+也可以运行下面的 bash 脚本：
+```
+sh scripts/export_to_serving.sh
+```
+
+Paddle Serving的部署采用Pipeline的方式，如果用户有对性能有更高的要求，可以采用C++的部署形式，请参考[Neural Search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative#c%E7%9A%84%E6%96%B9%E5%BC%8F)：
+
+#### Pipeline方式
+
+启动 Pipeline Server:
+
+```
+cd deploy/python/
+python web_service.py
+```
+
+启动客户端调用 Server, 使用 POST的方式：
+
+向服务端发送 POST 请求示例：
+
+```
+curl -X POST -k http://localhost:8090/ernie/prediction -d '{"key": ["0"], "value": ["{\"sentence\": \"中国农业大学怎么样？可以吗？\"}"]}'
+```
+
+也可以使用 rpc的方式：
+首先修改rpc_client.py中需要预测的样本：
+
+```
+list_data = [{
+    "sentence": "中国农业大学怎么样？可以吗？"
+}]
+```
+然后运行：
+
+```
+python rpc_client.py
+```
+模型的输出为：
+
+```
+PipelineClient::predict pack_data time:1658988633.3673246
+PipelineClient::predict before time:1658988633.3678396
+time to cost :0.014188766479492188 seconds
+['output_embedding']
+(1, 768)
+[[-0.06491912 -0.0133915   0.00937684  0.01285653 -0.02468005  0.03528611
+   0.0623698  -0.06062918  0.02238894 -0.05348937  0.02161925  0.04480227
+   ......
+```
+
+可以看到客户端发送了1条文本，返回这个 embedding 向量
+
+<a name="分类流程"></a>
+
+## 9. 分类流程
+
+为了演示基于检索的文本分类流程，我们使用下面的python脚本来完成整个流程，该分类系统使用了Client Server的模式，即抽取向量的模型部署在服务端，然后启动客户端（Client）端去访问，得到分类的结果。
+
+```
+python run_system.py
+```
+代码内置的测试用例为：
+
+```
+list_data = [{"sentence": "中国农业大学怎么样？可以吗？"}]
+```
+会输出如下的结果：
+
+```
+......
+PipelineClient::predict pack_data time:1658988661.507715
+PipelineClient::predict before time:1658988661.5081818
+Extract feature time to cost :0.02322244644165039 seconds
+Search milvus time cost is 0.06801486015319824 seconds
+{'sentence': '中国农业大学怎么样？可以吗？'} 教育/科学 0.6138255596160889
+{'sentence': '中国农业大学怎么样？可以吗？'} 院校信息 0.9188833236694336
+```
+输出的结果包括特征提取和检索的时间，还包含检索出来文本和对应的标签，通过设定阈值等方式可以得到最终的标签。
+
+## Reference
+
+[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.
diff --git a/applications/text_classification/multi_label/retrieval_based/base_model.py b/applications/text_classification/multi_label/retrieval_based/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..56aa3ba50e189281c35d41e8819014f56d8e53f4
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/base_model.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBaseStatic(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    @paddle.jit.to_static(
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+    )
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
diff --git a/applications/text_classification/multi_label/retrieval_based/data.py b/applications/text_classification/multi_label/retrieval_based/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..61b6fc701a8d3489ca3d7aa57990970b52781a08
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/data.py
@@ -0,0 +1,236 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import hnswlib
+import numpy as np
+import paddle
+from paddlenlp.utils.log import logger
+
+
+def build_index(corpus_data_loader, model, output_emb_size, hnsw_max_elements, hnsw_ef, hnsw_m):
+
+    index = hnswlib.Index(space="cosine", dim=output_emb_size if output_emb_size > 0 else 768)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=hnsw_max_elements, ef_construction=hnsw_ef, M=hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+    logger.info("start build index..........")
+    all_embeddings = []
+    for text_embeddings in model.get_semantic_embedding(corpus_data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+    logger.info("Total index number:{}".format(index.get_current_count()))
+    return index
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_corpus_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_label_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    for k, v in example.items():
+        encoded_inputs = tokenizer(text=v, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                print(data)
+                continue
+            yield {"sentence": data[0], "label": data[1]}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def label2ids(label_path):
+    label2id = {}
+    with open(label_path) as f:
+        for idx, label in enumerate(f.readlines()):
+            label = label.strip()
+            label2id[label] = idx
+    return label2id
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            splited_line = line.rstrip().split("\t")
+            text, similar_text = splited_line[0], ",".join(splited_line[1:])
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml b/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..2429b5a6d01c837116f66dca345f90253146325e
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 8090
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8080
+op:
+  ernie:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      #ir_optim
+      ir_optim: True
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '0'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['output_embedding']
+      # 模型路径
+      model_config: ../../serving_server/
diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh b/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6351e89d8b7b80fd740746a8617ffa6f072b0e15
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/deploy.sh
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python predict.py --model_dir=../../output
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce442f3992e3e4860aca3ea22198ae3ea6fbded0
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/predict.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+from paddle import inference
+from scipy import spatial
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoTokenizer
+
+sys.path.append(".")
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=15, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def convert_query_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+    result = []
+    encoded_inputs = tokenizer(
+        text=example["sentence"],
+        max_seq_len=max_seq_length,
+        pad_to_max_seq_len=pad_to_max_seq_len,
+        truncation_strategy="longest_first",
+    )
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def extract_embedding(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the feature vectors.
+        """
+        examples = []
+        for idx, text in data.items():
+            print(text)
+            input_ids, segment_ids = convert_query_example(text, tokenizer)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        return logits
+
+    def predict(self, data, tokenizer):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+
+        Returns:
+            results(obj:`dict`): All the predictions probs.
+        """
+
+        examples = []
+        for idx, text in enumerate(data):
+            input_ids, segment_ids, title_ids, title_segment_ids = convert_example(text, tokenizer)
+
+            examples.append((input_ids, segment_ids, title_ids, title_segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        query_ids, query_segment_ids, title_ids, title_segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(query_ids)
+        self.input_handles[1].copy_from_cpu(query_segment_ids)
+        self.predictor.run()
+        query_logits = self.output_handle.copy_to_cpu()
+
+        self.input_handles[0].copy_from_cpu(title_ids)
+        self.input_handles[1].copy_from_cpu(title_segment_ids)
+        self.predictor.run()
+        title_logits = self.output_handle.copy_to_cpu()
+
+        result = [float(1 - spatial.distance.cosine(arr1, arr2)) for arr1, arr2 in zip(query_logits, title_logits)]
+        return result
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    output_emb_size = 256
+    tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder")
+    id2corpus = {0: {"sentence": "快要到期的“我的资料”怎么续日期？"}}
+    res = predictor.extract_embedding(id2corpus, tokenizer)
+    print(res.shape)
+    print(res)
+    corpus_list = [{"sentence": "快要到期的“我的资料”怎么续日期？", "label": "互联网"}, {"sentence": "快要到期的“我的资料”怎么续日期？", "label": "游戏"}]
+    res = predictor.predict(corpus_list, tokenizer)
+    print(res)
diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..b20265cfb4e3eeeddfb1a79500d5680e957d6f4a
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/rpc_client.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8080"])
+
+list_data = [{"sentence": "青岛有什么好一点的国际青旅推荐？离海近一点 外国人多一点 氛围好点的"}]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = str(item)
+
+print(feed)
+start_time = time.time()
+ret = client.predict(feed_dict=feed)
+end_time = time.time()
+print("time to cost :{} seconds".format(end_time - start_time))
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py b/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..df054797d51ec195c6f23ad1c144aa4f6aed43d1
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/deploy/python/web_service.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle_serving_server.web_service import Op, WebService
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    result = []
+    for text in example:
+        encoded_inputs = tokenizer(
+            text=text["sentence"], max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len
+        )
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+class ErnieOp(Op):
+    def init_op(self):
+        from paddlenlp.transformers import AutoTokenizer
+
+        model_name_or_path = "rocketqa-zh-dureader-query-encoder"
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        from paddlenlp.data import Pad, Tuple
+
+        ((_, input_dict),) = input_dicts.items()
+        print("input dict", input_dict)
+        batch_size = len(input_dict.keys())
+        examples = []
+        for i in range(batch_size):
+            example = eval(input_dict[str(i)])
+            input_ids, segment_ids = convert_example([example], self.tokenizer)
+            examples.append((input_ids, segment_ids))
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        input_ids, segment_ids = batchify_fn(examples)
+        feed_dict = {}
+        feed_dict["input_ids"] = input_ids
+        feed_dict["token_type_ids"] = segment_ids
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["output_embedding"] = str(fetch_dict["output_embedding"].tolist())
+        return new_dict, None, ""
+
+
+class ErnieService(WebService):
+    def get_pipeline_response(self, read_op):
+        ernie_op = ErnieOp(name="ernie", input_ops=[read_op])
+        return ernie_op
+
+
+ernie_service = ErnieService(name="ernie")
+ernie_service.prepare_pipeline_config("config_nlp.yml")
+ernie_service.run_service()
diff --git a/applications/text_classification/multi_label/retrieval_based/evaluate.py b/applications/text_classification/multi_label/retrieval_based/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3b5f937767bab04edb97a6864296646d0d364f9
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/evaluate.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+from data import label2ids
+from metric import MetricReport
+from tqdm import tqdm
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--label_path", type=str,
+                    default='data/label.txt', help="The full path of label file")
+parser.add_argument("--recall_result_file", type=str,
+                    default='./recall_result_dir/recall_result.txt', help="The full path of recall result file")
+parser.add_argument("--similar_text_pair", default='data/dev.txt',
+                    help="The full path of similar pair file")
+
+parser.add_argument("--threshold", default=0.5, type=float,
+                    help="The threshold for selection the labels")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def evaluate(label2id):
+    metric = MetricReport()
+    text2similar = {}
+    # Encoding labels as one hot
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().rsplit("\t", 1)
+            text2similar[text] = np.zeros(len(label2id))
+            # One hot Encoding
+            for label in similar_text.strip().split(","):
+                text2similar[text][label2id[label]] = 1
+    pred_labels = {}
+    # Convert predicted labels into one hot encoding
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        for index, line in enumerate(f):
+            text_arr = line.rstrip().split("\t")
+            text, labels, cosine_sim = text_arr
+            # One hot Encoding
+            if text not in pred_labels:
+                pred_labels[text] = np.zeros(len(label2id))
+            if float(cosine_sim) > args.threshold:
+                for label in labels.split(","):
+                    pred_labels[text][label2id[label]] = float(cosine_sim)
+
+        for text, probs in tqdm(pred_labels.items()):
+            metric.update(probs, text2similar[text])
+
+        micro_f1_score, macro_f1_score = metric.accumulate()
+        print("Micro fl score: {}".format(micro_f1_score * 100))
+        print("Macro f1 score: {}".format(macro_f1_score * 100))
+
+
+if __name__ == "__main__":
+    label2id = label2ids(args.label_path)
+    evaluate(label2id)
diff --git a/applications/text_classification/multi_label/retrieval_based/export_model.py b/applications/text_classification/multi_label/retrieval_based/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac06c79a8f971e5cdbeede11c99c9f16d6e59520
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/export_model.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from base_model import SemanticIndexBaseStatic
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="output_embedding_size")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBaseStatic(pretrained_model, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/applications/text_classification/multi_label/retrieval_based/export_to_serving.py b/applications/text_classification/multi_label/retrieval_based/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cc932da11173e54460642c16fd4226411ba3cfb
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/export_to_serving.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle_serving_client.io as serving_io
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dirname", type=str, required=True,
+                    default='./output', help="Path of saved model files. Program file and parameter files are saved in this directory.")
+parser.add_argument("--model_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdmodel', help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.")
+parser.add_argument("--params_filename", type=str, required=True,
+                    default='inference.get_pooled_embedding.pdiparams', help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.")
+parser.add_argument("--server_path", type=str, default='./serving_server',
+                    help="The path of server parameter in static graph to be saved.")
+parser.add_argument("--client_path", type=str, default='./serving_client',
+                    help="The path of client parameter in static graph to be saved.")
+parser.add_argument("--feed_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of feed vars')
+parser.add_argument("--fetch_alias_names", type=str, default=None,
+                    help='set alias names for feed vars, split by comma \',\', you should run --show_proto to check the number of fetch vars')
+parser.add_argument("--show_proto", type=bool, default=False,
+                    help='If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.')
+# yapf: enable
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/applications/text_classification/multi_label/retrieval_based/metric.py b/applications/text_classification/multi_label/retrieval_based/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..d68451a538bbda1354cd632488b3363a18dfea77
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/metric.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from sklearn.metrics import f1_score, classification_report
+from paddle.metric import Metric
+from paddlenlp.utils.log import logger
+
+
+class MetricReport(Metric):
+    """
+    F1 score for multi-label text classification task.
+    """
+
+    def __init__(self, name="MetricReport", average="micro"):
+        super(MetricReport, self).__init__()
+        self.average = average
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.y_prob = None
+        self.y_true = None
+
+    def f1_score(self, y_prob):
+        """
+        Compute micro f1 score and macro f1 score
+        """
+        threshold = 0.5
+        self.y_pred = y_prob > threshold
+        micro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="micro")
+        macro_f1_score = f1_score(y_pred=self.y_pred, y_true=self.y_true, average="macro")
+        return micro_f1_score, macro_f1_score
+
+    def update(self, probs, labels):
+        """
+        Update the probability and label
+        """
+        if self.y_prob is not None:
+            self.y_prob = np.append(self.y_prob, probs, axis=0)
+        else:
+            self.y_prob = probs
+        if self.y_true is not None:
+            self.y_true = np.append(self.y_true, labels, axis=0)
+        else:
+            self.y_true = labels
+
+    def accumulate(self):
+        """
+        Returns micro f1 score and macro f1 score
+        """
+        micro_f1_score, macro_f1_score = self.f1_score(y_prob=self.y_prob)
+        return micro_f1_score, macro_f1_score
+
+    def report(self):
+        """
+        Returns classification report
+        """
+        self.y_pred = self.y_prob > 0.5
+        logger.info("classification report:\n" + classification_report(self.y_true, self.y_pred, digits=4))
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/applications/text_classification/multi_label/retrieval_based/model.py b/applications/text_classification/multi_label/retrieval_based/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d21569ab78c7436c024ebfe5f70e420620c63526
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/model.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # Substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # Scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/applications/text_classification/multi_label/retrieval_based/predict.py b/applications/text_classification/multi_label/retrieval_based/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..906fdf3519c3aff341c81bcb0bb47f4245b91daf
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/predict.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    cosine_sims = []
+    model.eval()
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+            cosine_sims.append(batch_cosine_sim)
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    cosin_sim = predict(model, valid_data_loader)
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
+        if idx > 5:
+            break
diff --git a/applications/text_classification/multi_label/retrieval_based/recall.py b/applications/text_classification/multi_label/retrieval_based/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d6f49ae9f7c92156fdc5d8d0c338e2339221072
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/recall.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from base_model import SemanticIndexBase
+from data import (
+    build_index,
+    convert_corpus_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+)
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(convert_corpus_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+    id2corpus = gen_id2corpus(args.corpus_file)
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+    query_ds = MapDataset(text_list)
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/applications/text_classification/multi_label/retrieval_based/requirements.txt b/applications/text_classification/multi_label/retrieval_based/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7033416510087320bbd232fdb3eeacaed736dee5
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/requirements.txt
@@ -0,0 +1,5 @@
+pymilvus==1.1.2
+pandas==0.25.1 
+paddlenlp>=2.3.7    
+hnswlib>=0.5.2
+pybind11
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/retrieval_based/run_system.py b/applications/text_classification/multi_label/retrieval_based/run_system.py
new file mode 100644
index 0000000000000000000000000000000000000000..46e24fe94318a9cefb55de931fff06b856489aaf
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/run_system.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import time
+
+import numpy as np
+import pandas as pd
+from data import gen_id2corpus
+from paddle_serving_server.pipeline import PipelineClient
+
+sys.path.append("utils")
+from utils.config import collection_name, partition_tag  # noqa: E402
+from utils.milvus_util import RecallByMilvus  # noqa: E402
+
+
+def search_in_milvus(text_embedding, corpus_file, query_text):
+    client = RecallByMilvus()
+    start_time = time.time()
+    status, results = client.search(
+        collection_name=collection_name, vectors=text_embedding, partition_tag=partition_tag
+    )
+    end_time = time.time()
+    print("Search milvus time cost is {} seconds ".format(end_time - start_time))
+    id2corpus = gen_id2corpus(corpus_file)
+    list_data = []
+    for line in results:
+        for item in line:
+            idx = item.id
+            distance = item.distance
+            text = id2corpus[idx]
+            list_data.append([query_text, text, distance])
+    df = pd.DataFrame(list_data, columns=["query_text", "label", "innner_product"])
+    df = df.sort_values(by="innner_product")
+    for index, row in df.iterrows():
+        if row["innner_product"] > 0.5:
+            print(row["query_text"], row["label"], row["innner_product"])
+
+
+if __name__ == "__main__":
+    client = PipelineClient()
+    client.connect(["127.0.0.1:8080"])
+    corpus_file = "data/label.txt"
+    list_data = [{"sentence": "中国农业大学怎么样？可以吗？"}]
+    feed = {}
+    for i, item in enumerate(list_data):
+        feed[str(i)] = str(item)
+    start_time = time.time()
+    ret = client.predict(feed_dict=feed)
+    end_time = time.time()
+    print("Extract feature time to cost :{} seconds".format(end_time - start_time))
+    result = np.array(eval(ret.value[0]))
+    search_in_milvus(result, corpus_file, list_data[0])
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh b/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b7c908c13ca3e0e63d4ee7d28d79838f66fbada8
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/evaluate.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -u evaluate.py \
+        --similar_text_pair "data/dev.txt" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --label_path data/label.txt
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh b/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..188e3a9bdf383e40f36ba3c7c5bb015ad6cdcddd
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/export_model.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py \
+                 --params_path checkpoints/inbatch/model_best/model_state.pdparams \
+                 --model_name_or_path rocketqa-zh-dureader-query-encoder \
+                 --output_path=./output 
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh b/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a7337b40b7a7c2d652ce2a837562eaceeba0531
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/export_to_serving.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_to_serving.py \
+    --dirname "output" \
+    --model_filename "inference.get_pooled_embedding.pdmodel" \
+    --params_filename "inference.get_pooled_embedding.pdiparams" \
+    --server_path "serving_server" \
+    --client_path "serving_client" \
+    --fetch_alias_names "output_embedding"
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh b/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ff3fd8f76d1ce1442cdee0d53c5f2e73b2214fa7
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/predict.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# gpu version
+root_dir="checkpoints/inbatch/model_best" 
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "${root_dir}/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder \
+    --output_emb_size 0 \
+    --batch_size 128 \
+    --max_seq_length 384 \
+    --text_pair_file "data/test.txt"
+
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/run.sh b/applications/text_classification/multi_label/retrieval_based/scripts/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c4c990729c26e0c9fd00e4420ebe1810abd00984
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/run.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CUDA_VISIBLE_DEVICES=0 python utils/feature_extract.py \
+        --data_name label \
+        --model_dir ./output \
+        --output_dir data \
+        --corpus_file "./data/label.txt" 
+
+python utils/vector_insert.py \
+                    --vector_path ./data/label_embedding.npy
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh b/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2224ad585b1b70886ab6ee167f5ff2626fa37604
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/run_build_index.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+root_dir="checkpoints/inbatch" 
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${root_dir}/model_best/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-dureader-query-encoder \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 0 \
+        --max_seq_length 384 \
+        --recall_num 5 \
+        --similar_text_pair "data/dev.txt" \
+        --corpus_file "data/label.txt" 
\ No newline at end of file
diff --git a/applications/text_classification/multi_label/retrieval_based/scripts/train.sh b/applications/text_classification/multi_label/retrieval_based/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2cef4abcddac47f91a0d98ce701375d910879f27
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/scripts/train.sh
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU training
+root_path=inbatch
+data_path=data
+python -u -m paddle.distributed.launch --gpus "0,1" \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/${root_path} \
+    --batch_size 24 \
+    --learning_rate 5E-5 \
+    --epochs 100 \
+    --output_emb_size 0 \
+    --save_steps 50 \
+    --max_seq_length 384 \
+    --warmup_proportion 0.0 \
+    --margin 0.2 \
+    --recall_result_dir "recall_result_dir" \
+    --recall_result_file "recall_result.txt" \
+    --train_set_file ${data_path}/train.txt \
+    --corpus_file ${data_path}/label.txt   \
+    --similar_text_pair_file ${data_path}/dev.txt \
+    --evaluate True \
+    --model_name_or_path rocketqa-zh-dureader-query-encoder
diff --git a/applications/text_classification/multi_label/retrieval_based/train.py b/applications/text_classification/multi_label/retrieval_based/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6304d24b877e00bcb874ef4c5fdcd9fdb70eaf9
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/train.py
@@ -0,0 +1,245 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import (
+    build_index,
+    convert_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+    label2ids,
+    read_text_pair,
+)
+from metric import MetricReport
+from model import SemanticIndexBatchNeg
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=256, type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=5E-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="cpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint")
+parser.add_argument('--log_steps', type=int, default=10, help="Inteval steps to print log")
+parser.add_argument("--train_set_file", type=str, default='./data/train.txt', help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.2, type=float, help="Margin between pos_sample and neg_samples")
+parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--corpus_file", type=str, default='./data/label.txt', help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, default='./data/dev.txt', help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='./recall_result_dir', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_init.txt', help="The file name of recall result")
+parser.add_argument("--recall_num", default=50, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--evaluate_result", type=str, default='evaluate_result.txt', help="evaluate_result")
+parser.add_argument('--evaluate', default=True, type=eval, choices=[True, False], help='whether evaluate while training')
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+parser.add_argument("--threshold", default=0.5, type=float, help="The threshold for selection the labels")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus, label2id):
+    metric = MetricReport()
+    # Load pretrained semantic model
+    inner_model = model._layers
+    final_index = build_index(
+        corpus_data_loader,
+        inner_model,
+        output_emb_size=args.output_emb_size,
+        hnsw_max_elements=args.hnsw_max_elements,
+        hnsw_ef=args.hnsw_ef,
+        hnsw_m=args.hnsw_m,
+    )
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+            batch_size = len(cosine_sims)
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
+    text2similar = {}
+    with open(args.similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            text_arr = line.rstrip().rsplit("\t")
+            text, similar_text = text_arr[0], text_arr[1]
+            text2similar[text] = np.zeros(len(label2id))
+            # One hot Encoding
+            for label in similar_text.strip().split(","):
+                text2similar[text][label2id[label]] = 1
+    # Convert predicted labels into one hot encoding
+    pred_labels = {}
+    with open(recall_result_file, "r", encoding="utf-8") as f:
+        for index, line in enumerate(f):
+            text_arr = line.rstrip().split("\t")
+            text, labels, cosine_sim = text_arr
+            # One hot Encoding
+            if text not in pred_labels:
+                pred_labels[text] = np.zeros(len(label2id))
+            if float(cosine_sim) > args.threshold:
+                for label in labels.split(","):
+                    pred_labels[text][label2id[label]] = float(cosine_sim)
+
+        for text, probs in pred_labels.items():
+            metric.update(probs, text2similar[text])
+        micro_f1_score, macro_f1_score = metric.accumulate()
+    return macro_f1_score
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    model = SemanticIndexBatchNeg(
+        pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+    )
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+    model = paddle.DataParallel(model)
+    batchify_fn_dev = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # text_segment
+    ): [data for data in fn(samples)]
+    if args.evaluate:
+        eval_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        id2corpus = gen_id2corpus(args.corpus_file)
+        label2id = label2ids(args.corpus_file)
+        # conver_example function's input must be dict
+        corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+        corpus_ds = MapDataset(corpus_list)
+        corpus_data_loader = create_dataloader(
+            corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=eval_func
+        )
+        query_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        text_list, _ = gen_text_file(args.similar_text_pair_file)
+        query_ds = MapDataset(text_list)
+        query_data_loader = create_dataloader(
+            query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn_dev, trans_fn=query_func
+        )
+        if not os.path.exists(args.recall_result_dir):
+            os.mkdir(args.recall_result_dir)
+        recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByNorm(clip_norm=1.0),
+    )
+    global_step = 0
+    best_score = 0.0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+            global_step += 1
+            if global_step % args.log_steps == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if not args.evaluate and rank == 0:
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+        if args.evaluate and rank == 0:
+            print("evaluating")
+            macro_f1_score = evaluate(
+                model, corpus_data_loader, query_data_loader, recall_result_file, text_list, id2corpus, label2id
+            )
+            if macro_f1_score > best_score:
+                best_score = macro_f1_score
+                save_dir = os.path.join(args.save_dir, "model_best")
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+                with open(os.path.join(save_dir, "train_result.txt"), "a", encoding="utf-8") as fp:
+                    fp.write("epoch=%d, global_step: %d, Macro f1: %s\n" % (epoch, global_step, macro_f1_score))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/text_classification/multi_label/retrieval_based/utils/__init__.py b/applications/text_classification/multi_label/retrieval_based/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/applications/text_classification/multi_label/retrieval_based/utils/config.py b/applications/text_classification/multi_label/retrieval_based/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..2da411d0c5e84aa359357823e037df96de69619c
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/utils/config.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from milvus import MetricType, IndexType
+
+MILVUS_HOST = "localhost"
+MILVUS_PORT = 8530
+
+output_emb_size = 0
+
+collection_param = {
+    "dimension": output_emb_size if output_emb_size > 0 else 768,
+    "index_file_size": 256,
+    "metric_type": MetricType.IP,
+}
+
+index_type = IndexType.FLAT
+index_param = {"nlist": 1000}
+
+top_k = 20
+search_param = {"nprobe": 20}
+
+collection_name = "multi_label"
+partition_tag = "partition_2"
diff --git a/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py b/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py
new file mode 100644
index 0000000000000000000000000000000000000000..966a801776ba89bb104a2ddcc16a400df93ba268
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/utils/feature_extract.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from tqdm import tqdm
+
+import paddlenlp as ppnlp
+from paddlenlp.data import Pad, Tuple
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--corpus_file", type=str, required=True, help="The corpus_file path.")
+parser.add_argument("--output_dir", type=str, required=True, help="The output path.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--data_name", type=str, required=True, help="The dataset name.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument("--model_name_or_path", default='rocketqa-zh-dureader-query-encoder', type=str, help='The pretrained model used for training')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+    - single sequence: ``[CLS] X [SEP]``
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    result += [input_ids, token_type_ids]
+    return result
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.get_pooled_embedding.pdmodel"
+        params_file = model_dir + "/inference.get_pooled_embedding.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, data_name):
+        """
+        Predicts the data labels.
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        all_embeddings = []
+        examples = []
+        for idx, text in tqdm(data.items()):
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, max_seq_length=self.max_seq_length, pad_to_max_seq_len=True
+            )
+            examples.append((input_ids, segment_ids))
+            if len(examples) >= self.batch_size:
+                input_ids, segment_ids = batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(input_ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                logits = self.output_handle.copy_to_cpu()
+                all_embeddings.append(logits)
+                examples = []
+
+        if len(examples) > 0:
+            input_ids, segment_ids = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            all_embeddings.append(logits)
+
+        all_embeddings = np.concatenate(all_embeddings, axis=0)
+        np.save("./{}/{}_embedding".format(args.output_dir, data_name), all_embeddings)
+
+
+def read_text(file_path):
+    file = open(file_path)
+    id2corpus = {}
+    for idx, line in enumerate(file.readlines()):
+        id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+if __name__ == "__main__":
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+    data_name = args.data_name
+    tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    id2corpus = read_text(args.corpus_file)
+    predictor.predict(id2corpus, tokenizer, data_name)
diff --git a/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py b/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b186c4fa480ab20b888c0cd1376624083da9b9
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/utils/milvus_util.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from config import (
+    MILVUS_HOST,
+    MILVUS_PORT,
+    collection_param,
+    index_param,
+    index_type,
+    search_param,
+    top_k,
+)
+from milvus import Milvus
+
+
+class VecToMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def has_collection(self, collection_name):
+        try:
+            status, ok = self.client.has_collection(collection_name)
+            return ok
+        except Exception as e:
+            print("Milvus has_table error:", e)
+
+    def creat_collection(self, collection_name):
+        try:
+            collection_param["collection_name"] = collection_name
+            status = self.client.create_collection(collection_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create collection error:", e)
+
+    def create_index(self, collection_name):
+        try:
+            status = self.client.create_index(collection_name, index_type, index_param)
+            print(status)
+            return status
+        except Exception as e:
+            print("Milvus create index error:", e)
+
+    def has_partition(self, collection_name, partition_tag):
+        try:
+            status, ok = self.client.has_partition(collection_name, partition_tag)
+            return ok
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def delete_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.drop_collection(collection_name)
+            return status
+        except Exception as e:
+            print("Milvus has partition error: ", e)
+
+    def create_partition(self, collection_name, partition_tag):
+        try:
+            status = self.client.create_partition(collection_name, partition_tag)
+            print("create partition {} successfully".format(partition_tag))
+            return status
+        except Exception as e:
+            print("Milvus create partition error: ", e)
+
+    def insert(self, vectors, collection_name, ids=None, partition_tag=None):
+        try:
+            if not self.has_collection(collection_name):
+                self.creat_collection(collection_name)
+                self.create_index(collection_name)
+                print("collection info: {}".format(self.client.get_collection_info(collection_name)[1]))
+            if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)):
+                self.create_partition(collection_name, partition_tag)
+            status, ids = self.client.insert(
+                collection_name=collection_name, records=vectors, ids=ids, partition_tag=partition_tag
+            )
+            self.client.flush([collection_name])
+            print(
+                "Insert {} entities, there are {} entities after insert data.".format(
+                    len(ids), self.client.count_entities(collection_name)[1]
+                )
+            )
+            return status, ids
+        except Exception as e:
+            print("Milvus insert error:", e)
+
+
+class RecallByMilvus:
+    def __init__(self):
+        self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT)
+
+    def search(self, vectors, collection_name, partition_tag=None):
+        try:
+            status, results = self.client.search(
+                collection_name=collection_name,
+                query_records=vectors,
+                top_k=top_k,
+                params=search_param,
+                partition_tag=partition_tag,
+            )
+            return status, results
+        except Exception as e:
+            print("Milvus recall error: ", e)
diff --git a/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py b/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py
new file mode 100644
index 0000000000000000000000000000000000000000..986ba6f0918b01df8eb79887a2a07cd0a9dd52ac
--- /dev/null
+++ b/applications/text_classification/multi_label/retrieval_based/utils/vector_insert.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+from config import collection_name, partition_tag
+from milvus_util import VecToMilvus
+from tqdm import tqdm
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--vector_path", type=str, required=True, help="feature file path.")
+
+args = parser.parse_args()
+# fmt: on
+
+
+def vector_insert(file_path):
+    embeddings = np.load(file_path)
+    print(embeddings.shape)
+    embedding_ids = [i for i in range(embeddings.shape[0])]
+    print(len(embedding_ids))
+    client = VecToMilvus()
+
+    if client.has_partition(collection_name, partition_tag):
+        client.delete_partition(collection_name, partition_tag)
+    data_size = len(embedding_ids)
+    batch_size = 50000
+    for i in tqdm(range(0, data_size, batch_size)):
+        cur_end = i + batch_size
+        if cur_end > data_size:
+            cur_end = data_size
+        batch_emb = embeddings[np.arange(i, cur_end)]
+        status, ids = client.insert(
+            collection_name=collection_name,
+            vectors=batch_emb.tolist(),
+            ids=embedding_ids[i : i + batch_size],
+            partition_tag=partition_tag,
+        )
+
+
+if __name__ == "__main__":
+    vector_insert(args.vector_path)
diff --git a/applications/text_classification/multi_label/train.py b/applications/text_classification/multi_label/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1769d3019b3ac84a76f79dd2086c1a1ff87bf17
--- /dev/null
+++ b/applications/text_classification/multi_label/train.py
@@ -0,0 +1,198 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from metric import MetricReport
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from utils import evaluate, preprocess_function, read_local_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--dataset_dir", required=True, default=None, type=str, help="Local dataset directory should include train.txt, dev.txt and label.txt")
+parser.add_argument("--save_dir", default="./checkpoint", type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--model_name', default="ernie-3.0-medium-zh", help="Select model to train, defaults to ernie-3.0-medium-zh.",
+                    choices=["ernie-1.0-large-zh-cw", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-micro-zh", "ernie-3.0-mini-zh", "ernie-3.0-nano-zh", "ernie-2.0-base-en", "ernie-2.0-large-en", "ernie-m-base", "ernie-m-large"])
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=3e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument('--early_stop', action='store_true', help='Epoch before early stop.')
+parser.add_argument('--early_stop_nums', type=int, default=3, help='Number of epoch before early stop.')
+parser.add_argument("--logging_steps", default=5, type=int, help="The interval steps to logging.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument('--warmup', action='store_true', help="whether use warmup strategy")
+parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup steps over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=3, help="random seed for initialization")
+parser.add_argument("--train_file", type=str, default="train.txt", help="Train dataset file name")
+parser.add_argument("--dev_file", type=str, default="dev.txt", help="Dev dataset file name")
+parser.add_argument("--label_file", type=str, default="label.txt", help="Label file name")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """
+    Sets random seed
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+
+
+def args_saving():
+    argsDict = args.__dict__
+    with open(os.path.join(args.save_dir, "setting.txt"), "w") as f:
+        f.writelines("------------------ start ------------------" + "\n")
+        for eachArg, value in argsDict.items():
+            f.writelines(eachArg + " : " + str(value) + "\n")
+        f.writelines("------------------- end -------------------")
+
+
+def train():
+    """
+    Training a multi label classification model
+    """
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+    args_saving()
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # load and preprocess dataset
+    label_list = {}
+    with open(os.path.join(args.dataset_dir, args.label_file), "r", encoding="utf-8") as f:
+        for i, line in enumerate(f):
+            l = line.strip()
+            label_list[l] = i
+    train_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.train_file), label_list=label_list, lazy=False
+    )
+    dev_ds = load_dataset(
+        read_local_dataset, path=os.path.join(args.dataset_dir, args.dev_file), label_list=label_list, lazy=False
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    trans_func = functools.partial(
+        preprocess_function, tokenizer=tokenizer, max_seq_length=args.max_seq_length, label_nums=len(label_list)
+    )
+    train_ds = train_ds.map(trans_func)
+    dev_ds = dev_ds.map(trans_func)
+    # batchify dataset
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    if paddle.distributed.get_world_size() > 1:
+        train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    else:
+        train_batch_sampler = BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=collate_fn)
+    dev_data_loader = DataLoader(dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=collate_fn)
+
+    # define model
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name, num_classes=len(label_list))
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.BCEWithLogitsLoss()
+    metric = MetricReport()
+
+    global_step = 0
+    best_f1_score = 0
+    early_stop_count = 0
+    tic_train = time.time()
+
+    for epoch in range(1, args.epochs + 1):
+
+        if args.early_stop and early_stop_count >= args.early_stop_nums:
+            logger.info("Early stop!")
+            break
+
+        for step, batch in enumerate(train_data_loader, start=1):
+
+            labels = batch.pop("labels")
+            logits = model(**batch)
+            loss = criterion(logits, labels)
+
+            loss.backward()
+            optimizer.step()
+            if args.warmup:
+                lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+        early_stop_count += 1
+        micro_f1_score, macro_f1_score = evaluate(model, criterion, metric, dev_data_loader)
+
+        save_best_path = args.save_dir
+        if not os.path.exists(save_best_path):
+            os.makedirs(save_best_path)
+
+        # save models
+        if macro_f1_score > best_f1_score:
+            early_stop_count = 0
+            best_f1_score = macro_f1_score
+            model._layers.save_pretrained(save_best_path)
+            tokenizer.save_pretrained(save_best_path)
+        logger.info("Current best macro f1 score: %.5f" % (best_f1_score))
+    logger.info("Final best macro f1 score: %.5f" % (best_f1_score))
+    logger.info("Save best macro f1 text classification model in %s" % (args.save_dir))
+
+
+if __name__ == "__main__":
+    train()
diff --git a/applications/text_classification/multi_label/utils.py b/applications/text_classification/multi_label/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9915906c9a2ab9dd8680d95711a4336efd96ec6e
--- /dev/null
+++ b/applications/text_classification/multi_label/utils.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+import paddle
+import paddle.nn.functional as F
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evaluates model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        labels = batch.pop("labels")
+        logits = model(**batch)
+        loss = criterion(logits, labels)
+        probs = F.sigmoid(logits)
+        losses.append(loss.numpy())
+        metric.update(probs, labels)
+
+    micro_f1_score, macro_f1_score = metric.accumulate()
+    logger.info(
+        "eval loss: %.5f, micro f1 score: %.5f, macro f1 score: %.5f"
+        % (np.mean(losses), micro_f1_score, macro_f1_score)
+    )
+    model.train()
+    metric.reset()
+
+    return micro_f1_score, macro_f1_score
+
+
+def preprocess_function(examples, tokenizer, max_seq_length, label_nums, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        examples(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_length(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        label_nums(obj:`int`): The number of the labels.
+    Returns:
+        result(obj:`dict`): The preprocessed data including input_ids, token_type_ids, labels.
+    """
+    result = tokenizer(text=examples["sentence"], max_seq_len=max_seq_length)
+    # One-Hot label
+    if not is_test:
+        result["labels"] = [float(1) if i in examples["label"] else float(0) for i in range(label_nums)]
+    return result
+
+
+def read_local_dataset(path, label_list=None, is_test=False):
+    """
+    Read dataset
+    """
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            if is_test:
+                items = line.strip().split("\t")
+                sentence = "".join(items)
+                yield {"sentence": sentence}
+            else:
+                items = line.strip().split("\t")
+                if len(items) == 0:
+                    continue
+                elif len(items) == 1:
+                    sentence = items[0]
+                    labels = []
+                else:
+                    sentence = "".join(items[:-1])
+                    label = items[-1]
+                    labels = [label_list[l] for l in label.split(",")]
+                yield {"sentence": sentence, "label": labels}
diff --git a/applications/text_summarization/README.md b/applications/text_summarization/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3baa427f77f6a56ac4079640c88ca7bb6a3fea82
--- /dev/null
+++ b/applications/text_summarization/README.md
@@ -0,0 +1,30 @@
+# 文本摘要应用
+
+## **1. 文本摘要简介**
+文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述，是缓解文本信息过载的一个重要手段。
+文本摘要也是自然语言生成领域中的一个重要任务，有很多应用场景，如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。
+
+
+## **2. 文本摘要项目介绍**
+
+PaddleNLP文本摘要应用主要针对中文文本数据上的摘要需求，基于最前沿的文本摘要预训练模型，开源了文本摘要解决方案。针对文本摘要应用，本项目提供了基于Taskflow开箱即用的产业级文本摘要预置任务能力，无需训练，一键完成文本摘要预测。除此之外，本项目提供给用户定制化训练策略，可以结合用户自身的不同数据需求完成模型的训练、预测和推理部署工作。对于需要特殊能力的文本摘要预训练模型，本项目开源了摘要模型的预训练代码，用户可以使用大规模无标注数据定制在特定领域有摘要能力的预训练模型。
+
+本项目使用的基础模型为[PEGASUS（Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models)](https://arxiv.org/pdf/1912.08777.pdf)， 是由谷歌公司提出的文本摘要预训练模型。其预训练目标：Gap Sentences Generation (GSG)，是根据文本摘要任务形式特殊设计的自监督上游任务。PEGASUS有两个不同的版本（base和large），其模型参数分别为：
+
+
+|  参数 | base（238M） | large（523M） |
+|  :---: | :--------: | :--------: |
+| encoder layers    |    12 | 16|
+| encoder_attention_heads | 12 | 16|
+| encoder_ffn_dim | 3072 |4096 |
+| decoder layers | 12 | 16|
+| decoder_attention_heads | 12 | 16|
+| decoder_ffn_dim | 3072 |4096 |
+| max_encode_length | 512 | 1024|
+
+
+## **3. 快速开始**
+
+- [预训练PEGASUS模型](./pretrain/)
+
+- [微调PEGASUS模型](./finetune/)
diff --git a/applications/text_summarization/finetune/README.md b/applications/text_summarization/finetune/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f6a059944a39db39c6e336f792b3ac0489476796
--- /dev/null
+++ b/applications/text_summarization/finetune/README.md
@@ -0,0 +1,319 @@
+# 生成式文本摘要应用
+**目录**
+
+- [生成式文本摘要应用](#生成式文本摘要应用)
+  - [简介](#简介)
+  - [效果展示](#效果展示)
+  - [开箱即用](#开箱即用)
+    - [支持单条、批量预测](#支持单条批量预测)
+    - [可配置参数说明](#可配置参数说明)
+  - [训练定制](#训练定制)
+    - [文本摘要应用定制训练全流程介绍](#文本摘要应用定制训练全流程介绍)
+    - [环境依赖](#环境依赖)
+    - [代码结构说明](#代码结构说明)
+    - [数据准备](#数据准备)
+      - [数据加载](#数据加载)
+      - [从本地文件创建数据集](#从本地文件创建数据集)
+    - [模型训练](#模型训练)
+    - [模型预测](#模型预测)
+    - [模型推理部署](#模型推理部署)
+      - [FastGeneration加速及模型静态图导出](#fastgeneration加速及模型静态图导出)
+      - [模型部署](#模型部署)
+  - [References](#references)
+
+## 简介
+
+文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述，是缓解文本信息过载的一个重要手段。
+文本摘要也是自然语言生成领域中的一个重要任务，有很多应用场景，如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。
+
+本项目是基于预训练语言模型PEGASUS的中文文本摘要产业实践，具有以下优势：
+
+- 效果领先。在LCSTS上效果达到SOTA。
+- 开箱即用。本项目提供TaskFlow接口，无需训练，仅需几行代码便可预测。
+- 高性能推理。本项目基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation)
+  进行推理加速，能够提供更高性能的推理体验。
+- 训练推理全流程打通。本项目提供了全面的定制训练流程，从数据准备、模型训练预测，到模型推理部署，一应俱全。
+
+## 效果展示
+
+## 开箱即用
+
+PaddleNLP提供开箱即用的产业级NLP预置任务能力，无需训练，一键预测。
+
+### 支持单条、批量预测
+
+```python
+>> > from paddlenlp import Taskflow
+>> > summarizer = Taskflow("text_summarization")
+# 单条输入
+>> > summarizer(
+  '2022年，中国房地产进入转型阵痛期，传统“高杠杆、快周转”的模式难以为继，万科甚至直接喊话，中国房地产进入“黑铁时代”')
+# 输出：['万科喊话中国房地产进入“黑铁时代”']
+
+# 多条输入
+>> > summarizer([
+  '据悉，2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用，继续把落实“双减”作为学校工作的重中之重，重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。',
+  '党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用，因此平时口服党参能远离三高的危害。另外党参除了益气养血，降低中枢神经作用，调整消化系统功能，健脾补肺的功能。'
+])
+# 输出：['教育部：将从四个方面持续巩固提高学校“双减”工作水平', '党参能降低三高的危害']
+```
+
+### 可配置参数说明
+
+* `model`：可选模型，默认为`IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese`。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+
+## 训练定制
+
+### 文本摘要应用定制训练全流程介绍
+
+接下来，我们将按数据准备、训练、预测、推理部署对文本摘要应用的全流程进行介绍。
+
+1. **数据准备**
+
+- 如果没有已标注的数据集，我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。
+  如果已有标注好的本地数据集，我们需要根据将数据集整理为文档要求的格式，请参考[从本地文件创建数据集](#从本地文件创建数据集)
+  。
+
+2. **模型训练**
+
+- 数据准备完成后，可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求，调整可配置参数，选择使用GPU或CPU进行模型训练，脚本默认保存在开发集最佳表现模型。中文任务默认使用"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"模型，还支持large模型: "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese"。
+
+
+3. **模型预测**
+
+- 训练结束后，我们可以加载保存的最佳模型进行模型测试，打印模型预测结果。
+
+4. **模型推理部署**
+
+- 模型部署需要将保存的最佳模型参数（动态图）导出成静态图参数，用于后续的推理部署。
+
+- 文本摘要应用提供了基于Paddle Inference的本地部署predictor，并且支持在GPU设备使用FastGeneration进行加速。
+
+- 文本摘要应用提供了基于Simple Serving的服务端部署方案。
+
+### 环境依赖
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+finetune/
+├── data # 数据
+│   ├── train.json # 训练数据集文件
+│   └── test.json # 可选，待预测数据文件
+├── deploy # 部署
+│   ├── paddle_inference # PaddleInference高性能推理部署
+│   │   ├── inference_pegasus.py # 推理部署脚本
+│   │   └── README.md # 说明文档
+│   └── simple_serving
+│       ├── client.py # 客户端程序
+│       ├── server.py # 服务器程序
+│       └── README.md # 说明文档
+├── run_prepare.py # 小数据集获取脚本
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── export_model.sh # 动态图参数导出静态图参数shell脚本
+├── predict.py    # 预测脚本
+├── train.py # 训练评估脚本
+├── utils.py # 工具函数脚本
+├── requirements.txt # 依赖包
+└── README.md # 说明文档
+```
+
+### 数据准备
+
+#### 数据加载
+
+#### 从本地文件创建数据集
+
+在许多情况，我们需要使用本地数据集来训练我们的文本摘要模型，本项目支持使用固定格式本地数据集文件进行训练。
+
+本地数据集目录结构如下：
+
+```text
+data/
+├── train.json # 训练数据集文件
+└── test.json # 可选，待预测数据文件
+```
+
+本地数据集文件格式如下：
+
+- train.json/test.json 文件每行格式：
+
+```text
+{
+"title": "任志强抨击政府把土地作为投机品地产业被人为破坏",
+"content": "“北京的保障房市场就像一个巨大的赌场，每个人都在期待中奖。”面对中国目前现行的保障性住房政策，华远地产董事长任志强再次语出惊人。（分享自@第一财经-中国房地产金融）"
+}
+```
+
+这里提供小数据集供测试，运行下面命令即可下载:
+
+```bash
+python run_prepare.py
+```
+
+更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)
+和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+### 模型训练
+
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证。
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+python -m paddle.distributed.launch --gpus "2,3,4,5,6,7" train.py \
+    --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \
+    --train_file data/train.json \
+    --eval_file data/test.json \
+    --output_dir pegasus_out \
+    --max_source_length 128 \
+    --max_target_length 64 \
+    --num_train_epochs 20 \
+    --logging_steps 1 \
+    --save_steps 10000 \
+    --per_device_train_batch_size 128 \
+    --per_device_eval_batch_size 128 \
+    --learning_rate 5e-5 \
+    --warmup_ratio 0.02 \
+    --weight_decay=0.01 \
+    --do_train \
+    --do_eval \
+    --device=gpu
+```
+
+关键参数释义如下：
+
+- `gpus` 指示了训练所用的GPU卡号。
+- `train_file` 本地训练数据地址。
+- `eval_file` 本地测试数据地址。
+- `model_name_or_path`
+  指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如:
+  ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+  | PaddleNLP提供的预训练模型        |
+     |---------------------------------|
+  | IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese      |
+  | IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese      |
+
+- `output_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `num_train_epochs` 表示训练轮数。
+- `per_device_train_batch_size` 表示每次训练**每张卡**上的样本数目。
+- `per_device_eval_batch_size` 表示每次验证**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_ratio`
+  表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)
+  。
+- `max_source_length` 模型输入序列的最大长度。
+- `max_target_length` 模型训练时标签的最大长度。
+- `do_train` 是否进行训练。
+- `do_eval` 是否进行预测。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+
+  除此之外，我们提供了一种可选的解码端输入增强策略。该策略在解码过程中，基于标准摘要和模型输出构造了新的解码输入数据，以此实现解码端的数据增强。具体详情可以参考[SSTIA论文](https://openreview.net/pdf?id=pz1euXohm4H)。如果想使用该策略，可以设置参数：
+- `use_SSTIA` 为True表示使用该策略。以及，
+- `mix_ratio` 表示构造输入和原始输入的权重。
+
+  该策略在Pegasus-238M和Pegasus-523M模型上均有大幅度提升，具体效果见后文实验结果表格。
+
+  PaddleNLP提供了训练好的SSTIA模型，可以修改`model_name_or_path`直接使用：
+
+  | PaddleNLP提供的SSTIA模型        |
+  |---------------------------------|
+  | PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA      |
+  | PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA      |
+
+更多参数详情和参数的默认值请参考`train.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`output_dir`中。
+如：
+
+```text
+./pegasus_out/
+├── model_config.json
+├── model_state.pdparams
+├── special_tokens_map.json
+├── tokenizer_config.json
+└── vocab.txt
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+### 模型预测
+
+运行下方脚本可以使用训练好的模型进行预测。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+
+python predict.py \
+    --init_checkpoint_dir=pegasus_out \
+    --prefict_file data/valid.json \
+    --max_source_length 128 \
+    --max_target_length 64 \
+    --batch_size 128 \
+    --device=gpu \
+```
+
+程序运行结束后会将预测结果保存在`output_path`中。
+
+Finetuned baseline的模型在[LCSTS](https://aclanthology.org/D15-1229/)测试集上有如下结果：
+| model_name | Rouge-1 | Rouge-2 | Rouge-L | BLEU-4 |
+| :-----------------------------: | :---: | :-----------: | :-------------------: |:-------------------: |
+| finetuned IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese | 43.30 | 30.08 | 40.12 | 24.50 |
+| IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese + SSTIA   | 45.79 | 33.20 | 42.88 | 28.07 |
+| finetuned IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese | 48.13 | 36.41 | 45.39 | 31.99 |
+| IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese + SSTIA   | 53.23 | 42.79 | 50.84 | 39.05 |
+
+### 模型推理部署
+
+#### FastGeneration加速及模型静态图导出
+
+使用动态图训练结束之后，可以通过[静态图导出脚本](export_model.py)
+实现基于FastGeneration的高性能预测加速，并将动态图参数导出成静态图参数，静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python export_model.py \
+    --model_name_or_path IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \
+    --decoding_strategy beam_search \
+    --export_output_dir ./inference_model \
+    --max_out_len 30 \
+```
+
+关键参数释义如下：
+
+* `model_name_or_path`：动态图训练保存的参数路径；默认为"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"。
+* `export_output_dir`：静态图图保存的参数路径；默认为"./inference_model"。
+* `max_out_len`：最大输出长度。
+
+执行命令后将会自动导出模型到指定的 `inference_model` 中，保存模型文件结构如下所示：
+
+```text
+inference_model/
+├── pegasus.pdiparams
+├── pegasus.pdiparams.info
+└── pegasus.pdmodel
+```
+
+#### 模型部署
+
+文本摘要应用已打通多种场景部署方案，点击链接获取具体的使用教程。
+
+- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md)
+- [Simple Serving 服务化部署（Python）](./deploy/simple_serving/README.md)
+
+## References
+
+- Zhang J, Zhao Y, Saleh M, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization[C]
+  //International Conference on Machine Learning. PMLR, 2020: 11328-11339.
+- Wang J, Zhang Y, Zhang L, et al. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence[J]. arXiv
+  preprint arXiv:2209.02970, 2022.
+- Xie S, Lv A, Xia Y, et al. Target-side input augmentation for sequence to sequence generation[C]
+  //International Conference on Learning Representations. 2022.
diff --git a/applications/text_summarization/finetune/deploy/paddle_inference/README.md b/applications/text_summarization/finetune/deploy/paddle_inference/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4c3cf774dced1606cd6df4e00630954ab70babe7
--- /dev/null
+++ b/applications/text_summarization/finetune/deploy/paddle_inference/README.md
@@ -0,0 +1,31 @@
+# Paddle Inference部署
+本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行自动文本摘要应用高性能推理推理部署。
+
+**目录**
+   * [背景介绍](#背景介绍)
+   * [导出预测部署模型](#导出预测部署模型)
+   * [基于Python预测](#基于Python预测)
+
+
+## 背景介绍
+Paddle inference和主框架的Model.predict均可实现推理预测，Paddle Inference 是飞桨的原生推理库， 作用于服务器端和云端，提供高性能的推理能力，主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict，inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测，paddle inference适用于对推理性能、通用性有要求的用户，针对不同平台不同的应用场景进行了深度的适配优化，保证模型在服务器端即训即用，快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子，因此它支持飞桨训练出的所有模型的推理。
+
+
+Paddle Inference Python端预测部署主要包含两个步骤：
+- 导出预测部署模型
+- 基于Python预测
+
+
+## 导出预测部署模型
+部署时需要使用预测格式的模型（即动态图转静态图操作）。预测格式模型相对训练格式模型而言，在拓扑上裁剪掉了预测不需要的算子，并且会做特定部署优化。具体操作详见[FastGeneration加速及模型静态图导出](../../README.md)。
+
+## 基于Python预测
+<!-- 同上，高性能预测的默认输入和输出形式也为文件，可分别通过 test_path 和 save_path 进行指定，通过如下命令便可以基于Paddle Inference 进行高性能预测： -->
+
+在终端输入以下命令可在GPU上进行预测：
+```shell
+python inference_pegasus.py --inference_model_dir ../../inference_model
+```
+
+关键参数释义如下：
+* `inference_model_dir`：用于高性能推理的静态图模型参数路径；默认为"../../inference_model"。
diff --git a/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py b/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py
new file mode 100644
index 0000000000000000000000000000000000000000..b917302a9e0c63c2f3b5672190f5f6e0345577a2
--- /dev/null
+++ b/applications/text_summarization/finetune/deploy/paddle_inference/inference_pegasus.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import numpy as np
+from paddle import inference
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import PegasusChineseTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir",
+        default="../../inference_model/",
+        type=str,
+        help="Path to save inference model of Pegasus. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def setup_predictor(args):
+    """Setup inference predictor."""
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+    model_file = os.path.join(args.inference_model_dir, "pegasus.pdmodel")
+    params_file = os.path.join(args.inference_model_dir, "pegasus.pdiparams")
+    if not os.path.exists(model_file):
+        raise ValueError("not find model file path {}".format(model_file))
+    if not os.path.exists(params_file):
+        raise ValueError("not find params file path {}".format(params_file))
+    config = inference.Config(model_file, params_file)
+    config.enable_use_gpu(100, 0)
+    config.switch_ir_optim()
+    config.enable_memory_optim()
+    config.disable_glog_info()
+
+    predictor = inference.create_predictor(config)
+    return predictor
+
+
+def convert_example(example, tokenizer, max_seq_len=512):
+    """Convert all examples into necessary features."""
+    tokenized_example = tokenizer(
+        example, max_length=max_seq_len, padding=True, truncation=True, return_attention_mask=False
+    )
+    return tokenized_example
+
+
+def infer(args, predictor):
+    """Use predictor to inference."""
+    tokenizer = PegasusChineseTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese")
+
+    inputs = [
+        "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中，中国选手谷爱凌夺得银牌。祝贺谷爱凌！今天上午，自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行，取选手最佳成绩排名决出奖牌。第一跳，中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后，谷爱凌又扮了个鬼脸，甚是可爱。第二轮中，谷爱凌在道具区第三个障碍处失误，落地时摔倒。获得16.98分。网友：摔倒了也没关系，继续加油！在第二跳失误摔倒的情况下，谷爱凌顶住压力，第三跳稳稳发挥，流畅落地！获得86.23分！此轮比赛，共12位选手参赛，谷爱凌第10位出场。网友：看比赛时我比谷爱凌紧张，加油！",
+        "据微信公众号“界面”报道，4日上午10点左右，中国发改委反垄断调查小组突击查访奔驰上海办事处，调取数据材料，并对多名奔驰高管进行了约谈。截止昨日晚9点，包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内",
+    ]
+
+    data = convert_example(inputs, tokenizer, max_seq_len=128)
+    input_handles = {}
+    for name in predictor.get_input_names():
+        input_handles[name] = predictor.get_input_handle(name)
+        input_handles[name].copy_from_cpu(np.array(data[name], dtype="int32"))
+
+    output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+    predictor.run()
+
+    output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+    for idx, sample in enumerate(output[0]):
+        for beam_idx, beam in enumerate(sample):
+            if beam_idx >= len(sample) // 2:
+                break
+            print(
+                f"Example {idx} beam {beam_idx}: ",
+                "".join(tokenizer.decode(beam, skip_special_tokens=True, clean_up_tokenization_spaces=False)),
+            )
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+    predictor = setup_predictor(args)
+    infer(args, predictor)
diff --git a/applications/text_summarization/finetune/deploy/simple_serving/README.md b/applications/text_summarization/finetune/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d46d99f9317367cd9b8f9cddf8ac72266d7d3e92
--- /dev/null
+++ b/applications/text_summarization/finetune/deploy/simple_serving/README.md
@@ -0,0 +1,78 @@
+# SimpleServing服务化部署
+
+本文档将介绍如何使用[PaddleNLP SimpleServing](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/server.md)工具部署自动文本摘要在线服务。
+
+## 目录
+- [SimpleServing服务化部署](#SimpleServing服务化部署)
+  - [目录](#目录)
+  - [背景介绍](#背景介绍)
+  - [环境准备](#环境准备)
+  - [启动服务](#启动服务)
+  - [发送请求](#发送请求)
+  - [服务化自定义参数](#服务化自定义参数)
+    - [server参数](#server参数)
+      - [模型路径](#模型路径)
+      - [多卡服务化预测](#多卡服务化预测)
+      - [Taskflow加速](#Taskflow加速)
+    - [client参数](#client参数)
+
+
+
+## 背景介绍
+PaddleNLP SimpleServing 是基于 unicorn 封装的模型部署服务化工具，该服务化工具具备灵活、易用的特性，可以简易部署预训练模型和预训练模型工具Taskflow，PaddleNLP SimpleServing 具备以下两个特性：
+
+- 易用：一行代码即可部署预训练模型和预训练工具Taskflow
+- 灵活：Handler机制可以快速定制化服务化部署方式
+
+PaddleNLP SimpleServing Python端预测部署主要包含以下步骤：
+- 环境准备
+- 启动服务
+- 发送请求
+
+## 环境准备
+下载安装包含SimpleServing功能的PaddleNLP版本：
+```shell
+pip install paddlenlp
+```
+
+## 启动服务
+```shell
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+## 发送请求
+```shell
+python client.py
+```
+
+## 服务化自定义参数
+
+### server参数
+
+#### 模型路径
+
+默认使用的模型为 `IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese` , 用户也可以通过修改`task_path`参数使用其他模型或自己的模型：
+
+```shell
+ts = Taskflow("text_summarization", task_path='../../checkpoint/model_best/')
+```
+可选模型有 `PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA`， `PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA`， `unimo-text-1.0-summary`， `IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese`， `IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese`
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码：
+
+```shell
+ts1 = Taskflow('text_summarization', device_id=0)
+ts2 = Taskflow('text_summarization', device_id=1)
+service.register_taskflow("taskflow/text_summarization", [ts1, ts2])
+```
+
+#### Taskflow加速
+PaddleNLP SimpleServing 支持在线服务加速，需要在注册Taskflow时设置参数`use_faster`：
+
+```shell
+ts = Taskflow("text_summarization", use_faster=True)
+```
+
+### client参数
+用户修改`client.py`中的texts变量以对任意文本进行摘要。
diff --git a/applications/text_summarization/finetune/deploy/simple_serving/client.py b/applications/text_summarization/finetune/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a24deca524a63e90aba224bd67746f1be8d3e64
--- /dev/null
+++ b/applications/text_summarization/finetune/deploy/simple_serving/client.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/text_summarization"
+headers = {"Content-Type": "application/json"}
+texts = [
+    "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中，中国选手谷爱凌夺得银牌。祝贺谷爱凌！今天上午，自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行，取选手最佳成绩排名决出奖牌。第一跳，中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后，谷爱凌又扮了个鬼脸，甚是可爱。第二轮中，谷爱凌在道具区第三个障碍处失误，落地时摔倒。获得16.98分。网友：摔倒了也没关系，继续加油！在第二跳失误摔倒的情况下，谷爱凌顶住压力，第三跳稳稳发挥，流畅落地！获得86.23分！此轮比赛，共12位选手参赛，谷爱凌第10位出场。网友：看比赛时我比谷爱凌紧张，加油！",
+    "据微信公众号“界面”报道，4日上午10点左右，中国发改委反垄断调查小组突击查访奔驰上海办事处，调取数据材料，并对多名奔驰高管进行了约谈。截止昨日晚9点，包括北京梅赛德斯-奔驰销售服务有限公司东区总经理在内的多名管理人员仍留在上海办公室内",
+]
+data = {"data": {"text": texts}}
+
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/applications/text_summarization/finetune/deploy/simple_serving/server.py b/applications/text_summarization/finetune/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea35203924063cfbd78a6bcd469c84bb26cb2bdb
--- /dev/null
+++ b/applications/text_summarization/finetune/deploy/simple_serving/server.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+ts = Taskflow("text_summarization")
+app = SimpleServer()
+app.register_taskflow("taskflow/text_summarization", ts)
diff --git a/applications/text_summarization/finetune/export_model.py b/applications/text_summarization/finetune/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fad70e14ee88e5b81710f567301d820747b68711
--- /dev/null
+++ b/applications/text_summarization/finetune/export_model.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterPegasus
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusForConditionalGeneration,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+        type=str,
+        help="The model name to specify the Pegasus to use. ",
+    )
+    parser.add_argument(
+        "--export_output_dir", default="./inference_model", type=str, help="Path to save inference model of Pegasus. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ")
+    parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        choices=["beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+    parser.add_argument(
+        "--length_penalty",
+        default=0.0,
+        type=float,
+        help="The exponential penalty to the sequence length in the beam_search strategy. ",
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_name_or_path = args.model_name_or_path
+    model = PegasusForConditionalGeneration.from_pretrained(model_name_or_path)
+    tokenizer = PegasusChineseTokenizer.from_pretrained(model_name_or_path)
+
+    pegasus = FasterPegasus(model=model, use_fp16_decoding=args.use_fp16_decoding, trans_out=True)
+
+    # Set evaluate mode
+    pegasus.eval()
+
+    # Convert dygraph model to static graph model
+    pegasus = paddle.jit.to_static(
+        pegasus,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # encoder_output
+            None,
+            # seq_len
+            None,
+            # min_length
+            args.min_out_len,
+            # max_length
+            args.max_out_len,
+            # num_beams. Used for beam_search.
+            args.num_beams,
+            # decoding_strategy
+            args.decoding_strategy,
+            # decoder_start_token_id
+            model.decoder_start_token_id,
+            # bos_token_id
+            tokenizer.bos_token_id,
+            # eos_token_id
+            tokenizer.eos_token_id,
+            # pad_token_id
+            tokenizer.pad_token_id,
+            # diversity rate. Used for beam search.
+            args.diversity_rate,
+            # length_penalty
+            args.length_penalty,
+            # topk
+            args.topk,
+            # topp
+            args.topp,
+            # temperature
+            args.temperature,
+            # num_return_sequences
+            args.num_return_sequences,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(pegasus, os.path.join(args.export_output_dir, "pegasus"))
+    logger.info("PEGASUS has been saved to {}.".format(args.export_output_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/applications/text_summarization/finetune/export_model.sh b/applications/text_summarization/finetune/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..95bf1d5acb750b4860efee95a40d112c04c5dedd
--- /dev/null
+++ b/applications/text_summarization/finetune/export_model.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py \
+    --model_name_or_path IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \
+    --decoding_strategy beam_search \
+    --export_output_dir ./inference_model \
+    --max_out_len 30 \
\ No newline at end of file
diff --git a/applications/text_summarization/finetune/predict.py b/applications/text_summarization/finetune/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..b15c5008061e54479861070fc8de9c491d0ccdfb
--- /dev/null
+++ b/applications/text_summarization/finetune/predict.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import random
+import time
+from functools import partial
+from pprint import pprint
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle.io import BatchSampler, DataLoader
+from utils import compute_metrics, convert_example
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusForConditionalGeneration,
+)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--init_checkpoint_dir",
+        default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+        type=str,
+        required=True,
+        help="Path to pre-trained model. ",
+    )
+    parser.add_argument(
+        "--prefict_file", type=str, required=False, default="data/valid.json", help="Predict data path."
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved."
+    )
+    parser.add_argument(
+        "--max_source_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after "
+        "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--min_target_length",
+        default=0,
+        type=int,
+        help="The minimum total sequence length for target text when generating. ",
+    )
+    parser.add_argument(
+        "--max_target_length",
+        default=64,
+        type=int,
+        help="The maximum total sequence length for target text after "
+        "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+        "during ``evaluate`` and ``predict``.",
+    )
+    parser.add_argument(
+        "--decode_strategy", default="greedy_search", type=str, help="The decode strategy in generation."
+    )
+    parser.add_argument(
+        "--top_k",
+        default=2,
+        type=int,
+        help="The number of highest probability vocabulary tokens to keep for top-k sampling.",
+    )
+    parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.")
+    parser.add_argument("--num_beams", default=1, type=int, help="The number of beams for beam search.")
+    parser.add_argument(
+        "--length_penalty",
+        default=0.6,
+        type=float,
+        help="The exponential penalty to the sequence length for beam search.",
+    )
+    parser.add_argument(
+        "--early_stopping",
+        default=False,
+        type=eval,
+        help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.",
+    )
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument(
+        "--faster", action="store_true", help="Whether to process inference using faster transformer. "
+    )
+    parser.add_argument(
+        "--use_fp16_decoding",
+        action="store_true",
+        help="Whether to use fp16 when using faster transformer. Only works when using faster transformer. ",
+    )
+    parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for testing or evaluation.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def generate(args):
+    paddle.set_device(args.device)
+    set_seed(args)
+    tokenizer = PegasusChineseTokenizer.from_pretrained(args.init_checkpoint_dir)
+    model = PegasusForConditionalGeneration.from_pretrained(args.init_checkpoint_dir)
+    dataset = load_dataset("json", data_files=args.prefict_file, split="train")
+    remove_columns = ["content", "title"]
+    trans_func = partial(
+        convert_example,
+        text_column="content",
+        summary_column="title",
+        tokenizer=tokenizer,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+    )
+    dataset = dataset.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns)
+    batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
+    data_loader = DataLoader(
+        dataset=dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True
+    )
+    data_loader.pin_memory = False
+
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    all_preds = []
+    all_labels = []
+    for step, batch in enumerate(data_loader):
+        labels = batch.pop("labels").numpy()
+        preds, _ = model.generate(
+            input_ids=batch["input_ids"],
+            attention_mask=batch["attention_mask"],
+            max_length=args.max_target_length,
+            min_length=args.min_target_length,
+            decode_strategy=args.decode_strategy,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            early_stopping=args.early_stopping,
+            diversity_rate=args.diversity_rate,
+            use_fast=args.faster,
+        )
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+        all_preds.extend(
+            tokenizer.batch_decode(preds.numpy(), skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        )
+        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+        all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+        start_time = time.time()
+
+    compute_metrics(all_preds, all_labels)
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for decoded_pred in all_preds:
+            fout.write(decoded_pred + "\n")
+    print("Save generated result into: %s" % args.output_path)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    generate(args)
diff --git a/applications/text_summarization/finetune/requirements.txt b/applications/text_summarization/finetune/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7cfd8eb0cbe9979f0943ae351e09c4c60aa3f13d
--- /dev/null
+++ b/applications/text_summarization/finetune/requirements.txt
@@ -0,0 +1 @@
+rouge==1.0.1
\ No newline at end of file
diff --git a/applications/text_summarization/finetune/run_prepare.py b/applications/text_summarization/finetune/run_prepare.py
new file mode 100644
index 0000000000000000000000000000000000000000..373d0b08d7c65ab4431386fbc5e64255483e3aa8
--- /dev/null
+++ b/applications/text_summarization/finetune/run_prepare.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+
+def prepare():
+
+    bos_link_train = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/train.json"
+    bos_link_valid = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/valid.json"
+    bos_link_test = "https://paddlenlp.bj.bcebos.com/datasets/tiny_summary_dataset/test.json"
+    os.system("mkdir data")
+    os.system("cd data && wget %s " % (bos_link_train))
+    os.system("cd data && wget %s " % (bos_link_valid))
+    os.system("cd data && wget %s " % (bos_link_test))
+
+
+prepare()
diff --git a/applications/text_summarization/finetune/train.py b/applications/text_summarization/finetune/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..52e36eda84d978cb24637549c253f8da4046cc74
--- /dev/null
+++ b/applications/text_summarization/finetune/train.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from utils import PegasusTrainer, compute_metrics, convert_example, main_process_first
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusForConditionalGeneration,
+)
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(
+        default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+        metadata={"help": ("Path to pre-trained model.")},
+    )
+    max_source_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after "
+                "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    min_target_length: Optional[int] = field(
+        default=0,
+        metadata={"help": ("The minimum total sequence length for target text when generating. ")},
+    )
+    max_target_length: Optional[int] = field(
+        default=64,
+        metadata={
+            "help": (
+                "The maximum total sequence length for target text after "
+                "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+                "during ``evaluate`` and ``predict``."
+            )
+        },
+    )
+    use_SSTIA: Optional[bool] = field(
+        default=False,
+        metadata={"help": ("Whether to use SSTIA.")},
+    )
+    mix_ratio: Optional[float] = field(
+        default=0,
+        metadata={"help": ("Mixture ratio for TSDASG synthetic input.")},
+    )
+    num_beams: Optional[int] = field(
+        default=1,
+        metadata={"help": ("The number of beams to use in beam search.")},
+    )
+    predict_with_generate: Optional[bool] = field(
+        default=True,
+        metadata={"help": ("Whether to generate in predcit.")},
+    )
+
+
+@dataclass
+class DataArguments:
+    train_file: Optional[str] = field(
+        default="data/train.json",
+        metadata={"help": ("Train data path.")},
+    )
+    eval_file: Optional[str] = field(
+        default="data/test.json",
+        metadata={"help": ("Eval data path.")},
+    )
+
+
+def compute_metrics_trainer(eval_preds, tokenizer):
+    all_preds = []
+    all_labels = []
+    labels = eval_preds.label_ids
+    preds = eval_preds.predictions
+    all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+    rougel = compute_metrics(all_preds, all_labels)
+    return {"RougeL": rougel}
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    training_args.generation_max_length = model_args.max_target_length
+    training_args.generation_num_beams = model_args.num_beams
+    training_args.predict_with_generate = model_args.predict_with_generate
+
+    tokenizer = PegasusChineseTokenizer.from_pretrained(model_args.model_name_or_path)
+    train_set = load_dataset("json", data_files=data_args.train_file, split="train")
+    dev_set = load_dataset("json", data_files=data_args.eval_file, split="train")
+    remove_columns = ["content", "title"]
+    trans_func = partial(
+        convert_example,
+        text_column="content",
+        summary_column="title",
+        tokenizer=tokenizer,
+        max_source_length=model_args.max_source_length,
+        max_target_length=model_args.max_target_length,
+    )
+    with main_process_first(desc="train dataset map pre-processing"):
+        train_set = train_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns)
+    with main_process_first(desc="dev dataset map pre-processing"):
+        dev_set = dev_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns)
+
+    model = PegasusForConditionalGeneration.from_pretrained(model_args.model_name_or_path)
+    if model_args.use_SSTIA:
+        model.use_SSTIA = True
+        model.mix_ratio = model_args.mix_ratio
+
+    batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
+
+    compute_metrics_func = partial(
+        compute_metrics_trainer,
+        tokenizer=tokenizer,
+    )
+
+    trainer = PegasusTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_set if training_args.do_train else None,
+        eval_dataset=dev_set if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        compute_metrics=compute_metrics_func,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/text_summarization/finetune/utils.py b/applications/text_summarization/finetune/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..489bc4567f6690ba80587dd78c9588eeff749b36
--- /dev/null
+++ b/applications/text_summarization/finetune/utils.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+from paddle import nn
+from rouge import Rouge
+
+from paddlenlp.metrics import BLEU
+from paddlenlp.trainer import Seq2SeqTrainer
+from paddlenlp.utils.log import logger
+
+
+def convert_example(example, text_column, summary_column, tokenizer, max_source_length, max_target_length):
+    """
+    Convert a example into necessary features.
+    """
+    inputs = example[text_column]
+    targets = example[summary_column]
+    model_inputs = tokenizer(
+        inputs, max_length=max_source_length, padding=False, truncation=True, return_attention_mask=True
+    )
+    labels = tokenizer(targets, max_length=max_target_length, padding=False, truncation=True)
+    model_inputs["labels"] = labels["input_ids"]
+    return model_inputs
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("rouge-1:", round(rouge1, 4))
+    print("rouge-2:", round(rouge2, 4))
+    print("rouge-L:", round(rougel, 4))
+    print("BLEU-4:", round(bleu4.score(), 4))
+    return rougel
+
+
+@contextlib.contextmanager
+def main_process_first(desc="work"):
+    if paddle.distributed.get_world_size() > 1:
+        rank = paddle.distributed.get_rank()
+        is_main_process = rank == 0
+        main_process_desc = "main local process"
+
+        try:
+            if not is_main_process:
+                # tell all replicas to wait
+                logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}")
+                paddle.distributed.barrier()
+            yield
+        finally:
+            if is_main_process:
+                # the wait is over
+                logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas")
+                paddle.distributed.barrier()
+    else:
+        yield
+
+
+class PegasusTrainer(Seq2SeqTrainer):
+    def compute_loss(self, model, inputs, return_outputs=False):
+        """
+        How the loss is computed by Trainer. By default, all models return the loss in the first element.
+        Subclass and override for custom behavior.
+        """
+        if self.criterion is not None:
+            if "labels" in inputs:
+                labels = inputs.pop("labels")
+            elif "start_positions" in inputs and "end_positions" in inputs:
+                labels = (inputs.pop("start_positions"), inputs.pop("end_positions"))
+            elif self.args.label_names is not None:
+                labels = []
+                for label in self.label_names:
+                    labels.append(inputs.pop(label))
+                labels = tuple(labels)
+            elif "generator_labels" in inputs:
+                labels = inputs["generator_labels"]
+        else:
+            labels = None
+
+        outputs = model(**inputs)
+        if self.criterion is not None:
+            loss = self.criterion(outputs, labels)
+            outputs = (loss, outputs)
+
+        # Save past state if it exists
+        # TODO: this needs to be fixed and made cleaner later.
+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+
+        # We don't use .loss here since the model may return tuples instead of ModelOutput.
+        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[2]
+
+        return (loss, outputs) if return_outputs else loss
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        Perform an evaluation step on `model` using `inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to evaluate.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (`bool`):
+                Whether or not to return the loss only.
+
+        Return:
+            Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss, logits and
+            labels (each being optional).
+        """
+
+        if not self.args.predict_with_generate or prediction_loss_only:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        has_labels = "labels" in inputs
+        inputs = self._prepare_inputs(inputs)
+
+        gen_kwargs = self._gen_kwargs.copy()
+        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
+            gen_kwargs["max_length"] = self.model.config.max_length
+        gen_kwargs["num_beams"] = (
+            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
+        )
+
+        if "attention_mask" in inputs:
+            gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
+        if "global_attention_mask" in inputs:
+            gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)
+
+        # prepare generation inputs
+        # some encoder-decoder models can have varying encoder's and thus
+        # varying model input names
+        if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
+            generation_inputs = inputs[self.model.encoder.main_input_name]
+        else:
+            generation_inputs = inputs[self.model.main_input_name]
+
+        generated_tokens = self.model.generate(
+            generation_inputs,
+            **gen_kwargs,
+        )
+        # different from hf returns: tuple[Tensor]: It is a tuple contains two elements: ids and scores.
+        if isinstance(generated_tokens, tuple):
+            generated_tokens = generated_tokens[0]
+        # in case the batch is shorter than max length, the output should be padded
+        if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
+        elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
+            gen_kwargs["max_new_tokens"] + 1
+        ):
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    outputs = model(**inputs)
+                if self.label_smoother is not None:
+                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
+                else:
+                    # pegasus output is lm_logits, new_cache, masked_lm_loss
+                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[2]).mean().detach()
+            else:
+                loss = None
+
+        if self.args.prediction_loss_only:
+            return (loss, None, None)
+
+        if has_labels:
+            labels = inputs["labels"]
+            if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
+                labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
+            elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
+                gen_kwargs["max_new_tokens"] + 1
+            ):
+                labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
+        else:
+            labels = None
+
+        return (loss, generated_tokens, labels)
diff --git a/applications/text_summarization/pretrain/README.md b/applications/text_summarization/pretrain/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..76f86e23b57cad39f5639921dc0a66e310a586ca
--- /dev/null
+++ b/applications/text_summarization/pretrain/README.md
@@ -0,0 +1,191 @@
+# 生成式文本摘要预训练
+**目录**
+
+- [生成式文本摘要预训练](#生成式文本摘要预训练)
+  - [简介](#简介)
+  - [预训练任务介绍](#预训练任务)
+  - [预训练定制](#预训练定制)
+    - [文本摘要预训练全流程介绍](#文本摘要预训练全流程介绍)
+    - [环境依赖](#环境依赖)
+    - [代码结构说明](#代码结构说明)
+    - [数据准备](#数据准备)
+      - [数据加载](#数据加载)
+      - [从本地文件创建数据集](#从本地文件创建数据集)
+    - [模型预训练](#模型预训练)
+  - [模型微调](#模型微调)
+  - [References](#references)
+
+## 简介
+
+文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述，是缓解文本信息过载的一个重要手段。
+文本摘要也是自然语言生成领域中的一个重要任务，有很多应用场景，如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。
+
+本项目预训练了一个专门为中文文本摘要任务设计的语言模型：PEGASUS。其预训练目标为间隙句子生成（Gap Sentences Generation, GSG），是专门为文本摘要任务设计的上游任务。
+
+## 预训练任务
+Gap Sentences Generation（GSG）是一种专门为文本摘要提出的自监督预训练任务，其首先找出输入文本中较为核心的数个句子，然后将它们直接拼接到一起得到伪摘要输出，这些句子在输入中的位置则被替换成mask token，预训练的目标就是生成这些被mask掉的核心句子，即间隙句子。
+
+对于GSG任务如何选择核心句子以及超参数的设置，请参考[原论文](https://arxiv.org/pdf/1912.08777.pdf)
+
+另外，原论文中也用到了Masked Language Model (MLM) 作为预训练任务，但实际效果增幅不大，所以不做使用。
+
+
+## 预训练定制
+
+### 文本摘要预训练全流程介绍
+
+接下来，我们将按数据准备、预训练、预测的全流程进行介绍。
+
+1. **数据准备**
+
+- 如果没有已标注的数据集，我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。
+  如果已有标注好的本地数据集，我们需要根据将数据集整理为文档要求的格式，请参考[从本地文件创建数据集](#从本地文件创建数据集)
+  。
+- 此外，还需要准备中文停用词表，存放到stopwords.txt中，建议参考[哈工大停用词表](https://github.com/goto456/stopwords)
+
+2. **模型预训练**
+
+- 数据准备完成后，可以开始使用我们的数据集完成模型的预训练任务。我们可以根据任务需求，调整可配置参数，选择使用GPU或CPU进行模型训练，脚本默认保存在开发集最佳表现模型。预训练的Tokenizer默认使用base版本"IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese"的分词器，还支持large版本的分词器: "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese"
+
+
+3. **模型预测**
+
+- 预训练结束后，我们可以加载保存的最佳模型进行模型测试，打印模型在文本摘要任务上的预测结果。
+
+
+### 环境依赖
+
+rouge==1.0.1
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+pretrain/
+├── data # 数据
+│   ├── train.json # 预训练数据集文件
+│   └── test.json # 可选，待预测数据文件
+├── stopwords.txt # 停用词表
+├── train.py # 训练评估脚本
+├── utils.py # 工具函数脚本
+├── requirements.txt # 依赖包
+└── README.md # 说明文档
+```
+
+### 数据准备
+
+#### 数据加载
+
+#### 从本地文件创建数据集
+
+如果您想使用自己的数据来预训练PEGASUS模型，本项目支持使用固定格式本地数据集文件进行预训练。
+
+本地数据集目录结构如下：
+
+```text
+data/
+├── train.json # 训练数据集文件
+└── test.json # 可选，待预测数据文件
+```
+
+本地数据集文件格式如下：
+
+- train.json/test.json 文件每行格式：
+
+```text
+{
+"title": "任志强抨击政府把土地作为投机品地产业被人为破坏",
+"content": "“北京的保障房市场就像一个巨大的赌场，每个人都在期待中奖。”面对中国目前现行的保障性住房政策，华远地产董事长任志强再次语出惊人。（分享自@第一财经-中国房地产金融）"
+}
+```
+
+更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)
+和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+### 模型预训练
+
+运行如下命令即可在样例训练集上开始pretrain，并在样例验证集上进行验证。
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+python -m paddle.distributed.launch --gpus "2,3,4,5,6,7" train.py \
+    --model_name_or_path=IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese \
+    --train_file data/train.json \
+    --eval_file data/test.json \
+    --output_dir pegasus_out \
+    --max_source_length 128 \
+    --max_target_length 64 \
+    --num_train_epochs 20 \
+    --logging_steps 1 \
+    --save_steps 10000 \
+    --per_device_train_batch_size 128 \
+    --per_device_eval_batch_size 128 \
+    --learning_rate 1e-4 \
+    --warmup_ratio 0.02 \
+    --weight_decay=0.001 \
+    --do_train \
+    --do_eval \
+    --device=gpu
+```
+
+关键参数释义如下：
+
+- `gpus` 指示了训练所用的GPU卡号。
+- `train_file` 本地训练数据地址。
+- `eval_file` 本地测试数据地址。
+- `model_name_or_path`
+  指示了pretrain所使用的分词器，可以是PaddleNLP提供的分词器，或者是本地的分词器。如果使用本地的分词器，可以配置本地分词器的目录地址，例如:
+  ./checkpoints/model_xx/。如果使用PaddleNLP提供的分词器，可以选择下面其中之一。
+
+  | PaddleNLP提供的分词器        |
+     |---------------------------------|
+  | IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese      |
+  | IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese      |
+
+- `output_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `num_train_epochs` 表示训练轮数。
+- `per_device_train_batch_size` 表示每次训练**每张卡**上的样本数目。
+- `per_device_eval_batch_size` 表示每次验证**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_ratio`
+  表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `max_source_length` 模型输入序列的最大长度。
+- `max_target_length` 模型训练时标签的最大长度。
+- `do_train` 是否进行训练。
+- `do_eval` 是否进行预测。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+
+更多参数详情和参数的默认值请参考`train.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`output_dir`中。
+如：
+
+```text
+./pegasus_out/
+├── model_config.json
+├── model_state.pdparams
+├── special_tokens_map.json
+├── tokenizer_config.json
+└── vocab.txt
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+
+## 模型微调
+微调代码及效果请参考[PEGASUS微调](../finetune/)
+
+
+## References
+
+- Zhang J, Zhao Y, Saleh M, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization[C]
+  //International Conference on Machine Learning. PMLR, 2020: 11328-11339.
+- Wang J, Zhang Y, Zhang L, et al. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence[J]. arXiv
+  preprint arXiv:2209.02970, 2022.
diff --git a/applications/text_summarization/pretrain/requirements.txt b/applications/text_summarization/pretrain/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7cfd8eb0cbe9979f0943ae351e09c4c60aa3f13d
--- /dev/null
+++ b/applications/text_summarization/pretrain/requirements.txt
@@ -0,0 +1 @@
+rouge==1.0.1
\ No newline at end of file
diff --git a/applications/text_summarization/pretrain/train.py b/applications/text_summarization/pretrain/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..14487efc3c1aca4103153d7c22d4664c2da0d349
--- /dev/null
+++ b/applications/text_summarization/pretrain/train.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from utils import PegasusTrainer, compute_metrics, convert_example, main_process_first
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusConfig,
+    PegasusForConditionalGeneration,
+)
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(
+        default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+        metadata={"help": ("Path to pre-trained model.")},
+    )
+    max_source_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after "
+                "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    min_target_length: Optional[int] = field(
+        default=0,
+        metadata={"help": ("The minimum total sequence length for target text when generating. ")},
+    )
+    max_target_length: Optional[int] = field(
+        default=64,
+        metadata={
+            "help": (
+                "The maximum total sequence length for target text after "
+                "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+                "during ``evaluate`` and ``predict``."
+            )
+        },
+    )
+    num_beams: Optional[int] = field(
+        default=1,
+        metadata={"help": ("The number of beams to use in beam search.")},
+    )
+    predict_with_generate: Optional[bool] = field(
+        default=True,
+        metadata={"help": ("Whether to generate in predcit.")},
+    )
+
+
+@dataclass
+class DataArguments:
+    train_file: Optional[str] = field(
+        default="data/train.json",
+        metadata={"help": ("Train data path.")},
+    )
+    eval_file: Optional[str] = field(
+        default="data/test.json",
+        metadata={"help": ("Eval data path.")},
+    )
+    stop_words: Optional[str] = field(
+        default="stopwords.txt",
+        metadata={"help": ("The stop words vocab.")},
+    )
+
+
+def compute_metrics_trainer(eval_preds, tokenizer):
+    all_preds = []
+    all_labels = []
+    labels = eval_preds.label_ids
+    preds = eval_preds.predictions
+    all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+    rougel = compute_metrics(all_preds, all_labels)
+    return {"RougeL": rougel}
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    training_args.generation_max_length = model_args.max_target_length
+    training_args.max_source_length = model_args.max_source_length
+    training_args.generation_num_beams = model_args.num_beams
+    training_args.predict_with_generate = model_args.predict_with_generate
+    training_args.stop_words = data_args.stop_words
+
+    tokenizer = PegasusChineseTokenizer.from_pretrained(model_args.model_name_or_path)
+    train_set = load_dataset("json", data_files=data_args.train_file, split="train")
+    dev_set = load_dataset("json", data_files=data_args.eval_file, split="train")
+
+    # train_set needn't map
+    remove_columns = ["title", "content"]
+    trans_func = partial(
+        convert_example,
+        text_column="content",
+        summary_column="title",
+        tokenizer=tokenizer,
+        max_source_length=model_args.max_source_length,
+        max_target_length=model_args.max_target_length,
+    )
+    with main_process_first(desc="dev dataset map pre-processing"):
+        dev_set = dev_set.map(trans_func, batched=True, load_from_cache_file=True, remove_columns=remove_columns)
+
+    config = PegasusConfig()
+    model = PegasusForConditionalGeneration(config=config)
+
+    dev_batchify_fn = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
+
+    compute_metrics_func = partial(
+        compute_metrics_trainer,
+        tokenizer=tokenizer,
+    )
+
+    trainer = PegasusTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_set if training_args.do_train else None,
+        eval_dataset=dev_set if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=dev_batchify_fn,
+        compute_metrics=compute_metrics_func,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/applications/text_summarization/pretrain/utils.py b/applications/text_summarization/pretrain/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d15f540f233651e3c27121678b43eff7153e5a01
--- /dev/null
+++ b/applications/text_summarization/pretrain/utils.py
@@ -0,0 +1,486 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import contextlib
+import random
+import re
+import sys
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.io import DataLoader
+from rouge import Rouge
+
+from paddlenlp.metrics import BLEU
+from paddlenlp.trainer import Seq2SeqTrainer
+from paddlenlp.utils.log import logger
+
+rouge = Rouge()
+
+
+class FakeAbstractCollator:
+    def __init__(self, tokenizer, stopwords_dict, max_enc_length):
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_enc_length
+        self.stopwords_dict = stopwords_dict
+
+    def __call__(self, samples):
+        labels = []
+        attn_mask = []
+        decoder_attn_mask = []
+        source_inputs = []
+
+        for text in samples:
+            texts = text["content"]
+            text = text_segmentate(texts)
+
+            if len(text) < 2:
+                continue
+            sentence_id_vec, source, target, source_idxs, target_idxs = pseudo_summary_f1(
+                text, self.stopwords_dict, self.tokenizer, self.max_seq_length, "rouge-l"
+            )
+            source_idxs, target_idxs = get_input_mask(sentence_id_vec, target_idxs)
+            if len(source_idxs) > self.max_seq_length:
+                if 2 not in source_idxs[self.max_seq_length - 1 :]:
+                    source_idxs = source_idxs[: self.max_seq_length]
+                    source_idxs[-1] = self.tokenizer.eos_token_id
+                    sys.stderr.write("Warning split long line: " + source + "\n")
+                else:
+                    continue
+
+            source_idxs, attention_mask = padding_to_maxlength(
+                source_idxs, self.max_seq_length, self.tokenizer.pad_token_id
+            )
+            label, target_attention_mask = padding_to_maxlength(
+                target_idxs, self.max_seq_length, self.tokenizer.pad_token_id
+            )
+            source_inputs.append(source_idxs)
+            attn_mask.append(attention_mask)
+            decoder_attn_mask.append(target_attention_mask)
+            labels.append(label)
+        labels = paddle.to_tensor(labels)
+        decode_input_idxs = shift_tokens_right(labels, self.tokenizer.pad_token_id, self.tokenizer.pad_token_id)
+        end_token_index = paddle.where(labels == self.tokenizer.eos_token_id)[1]
+        for idx, end_idx in enumerate(end_token_index):
+            labels[idx, end_idx + 1 :] = -100
+
+        return {
+            "input_ids": paddle.to_tensor(source_inputs),
+            "attention_mask": paddle.to_tensor(attn_mask),
+            "labels": labels,
+            "decoder_input_ids": decode_input_idxs,
+            "decoder_attention_mask": paddle.to_tensor(decoder_attn_mask),
+        }
+
+
+def load_stopwords(stopwords_path):
+    stopwords_dict = {}
+    with open(stopwords_path, "r") as rf:
+        for line in rf:
+            line = line.strip()
+            if line not in stopwords_dict:
+                stopwords_dict[line] = 0
+            else:
+                pass
+    return stopwords_dict
+
+
+def text_segmentate(text):
+    en_seg_pattern = "((?:\\!|\\?|\\.|\\n)+(?:\\s)+)"
+    ch_seg_pattern = "((?:？|！|。|\\n)+)"
+    try:
+        text = re.sub(en_seg_pattern, r"\1[SEP]", text)
+    except Exception as e:
+        print("input: ", text)
+        raise e
+    text = re.sub(ch_seg_pattern, r"\1[SEP]", text)
+    text_list = text.split("[SEP]")
+    text_list = list(filter(lambda x: len(x) != 0, text_list))
+    return text_list
+
+
+def gather_join(texts, idxs):
+    return "".join([texts[i] for i in idxs])
+
+
+def gather_join_f1(texts_token, idsx):
+    join_texts = []
+    for id in idsx:
+        join_texts.extend(texts_token[id])
+    return join_texts
+
+
+def compute_rouge(source, target):
+    source, target = " ".join(source), " ".join(target)
+    try:
+        scores = rouge.get_scores(hyps=source, refs=target)
+        return {
+            "rouge-1": scores[0]["rouge-1"]["f"],
+            "rouge-2": scores[0]["rouge-2"]["f"],
+            "rouge-l": scores[0]["rouge-l"]["f"],
+        }
+    except ValueError:
+        return {
+            "rouge-1": 0.0,
+            "rouge-2": 0.0,
+            "rouge-l": 0.0,
+        }
+
+
+def remove_stopwords(texts, stopwords_dict):
+    for i, text in enumerate(texts):
+        texts[i] = list(filter(lambda x: x not in stopwords_dict, text))
+    return texts
+
+
+def pseudo_summary_f1(texts, stopwords, tokenizer, max_length, rouge_strategy="rouge-l"):
+    summary_rate = 0.25
+    max_length = max_length - 1
+    texts_tokens = []
+    sentece_idxs_vec = []
+    for text in texts:
+        if len(texts) == 0:
+            continue
+        try:
+            ids = tokenizer.encode(text.strip())["input_ids"][:-1]
+        except ValueError:
+            print("error, input : ", text)
+            raise ValueError
+        sentece_idxs_vec.append(ids)
+        tokens = [tokenizer._convert_id_to_token(token) for token in ids]
+        texts_tokens.append(tokens)
+
+    texts_tokens_rm = remove_stopwords(texts_tokens, stopwords)
+    source_idxs, target_idxs = list(range(len(texts))), []
+
+    assert len(texts_tokens) == len(texts)
+    while True:
+        sims = []
+        for i in source_idxs:
+            new_source_idxs = [j for j in source_idxs if j != i]
+            new_target_idxs = sorted(target_idxs + [i])
+            new_source = gather_join_f1(texts_tokens_rm, new_source_idxs)
+            new_target = gather_join_f1(texts_tokens_rm, new_target_idxs)
+            sim = compute_rouge(new_source, new_target)[rouge_strategy]
+            sims.append(sim)
+        new_idx = source_idxs[np.argmax(sims)]
+        del sims
+        source_idxs.remove(new_idx)
+        target_idxs = sorted(target_idxs + [new_idx])
+        source = gather_join(texts, source_idxs)
+        target = gather_join(texts, target_idxs)
+        try:
+            if len(source_idxs) == 1 or 1.0 * len(target) / len(source) > summary_rate:
+                break
+        except ZeroDivisionError:
+            print(texts)
+            print("source: ", source)
+            print("target: ", target)
+
+    if len(source) < len(target):
+        source, target = target, source
+        source_idxs, target_idxs = target_idxs, source_idxs
+
+    return sentece_idxs_vec, source, target, source_idxs, target_idxs
+
+
+def get_input_mask(sentence_id_vec, indexs):
+    target_idxs = []
+    input_idxs = []
+    kMaskSentenceTokenId = 2
+    kEosTokenId = 1
+    mask_sentence_options_cumulative_prob = [0.9, 0.9, 1, 1]
+    for index in indexs:
+        target_idxs.extend(sentence_id_vec[index])
+        choice = random.uniform(0, 1)
+        if choice < mask_sentence_options_cumulative_prob[0]:
+            sentence_id_vec[index] = [kMaskSentenceTokenId]
+        elif choice < mask_sentence_options_cumulative_prob[1]:
+            replace_id = random.randint(0, len(sentence_id_vec))
+            sentence_id_vec[index] = sentence_id_vec[replace_id]
+        elif choice < mask_sentence_options_cumulative_prob[2]:
+            pass
+        else:
+            sentence_id_vec[index] = []
+
+    target_idxs.append(kEosTokenId)
+    for index, sentence_id in enumerate(sentence_id_vec):
+        if len(sentence_id) == 0:
+            continue
+        input_idxs.extend(sentence_id_vec[index])
+
+    input_idxs.append(kEosTokenId)
+    return input_idxs, target_idxs
+
+
+def shift_tokens_right(input_ids, pad_token_id, decoder_start_token_id):
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = paddle.clone(input_ids[:, :-1])
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    if pad_token_id is None:
+        raise ValueError("self.model.config.pad_token_id has to be defined.")
+    shifted_input_ids = paddle.where(shifted_input_ids == -100, paddle.to_tensor(pad_token_id), shifted_input_ids)
+
+    return shifted_input_ids
+
+
+def padding_to_maxlength(ids, max_length, pad_id):
+    cur_len = len(ids)
+    len_diff = max_length - cur_len
+    return ids + [pad_id] * len_diff, [1] * cur_len + [0] * len_diff
+
+
+def convert_example(example, text_column, summary_column, tokenizer, max_source_length, max_target_length):
+    """
+    Convert a example into necessary features.
+    """
+    inputs = example[text_column]
+    targets = example[summary_column]
+    model_inputs = tokenizer(
+        inputs, max_length=max_source_length, padding=False, truncation=True, return_attention_mask=True
+    )
+    labels = tokenizer(targets, max_length=max_target_length, padding=False, truncation=True)
+    model_inputs["labels"] = labels["input_ids"]
+    return model_inputs
+
+
+def compute_correct(logits, labels):
+    y_pred = paddle.argmax(logits, axis=-1)
+    y_pred = y_pred.reshape(
+        [
+            -1,
+        ]
+    )
+    y_true = labels.reshape(
+        [
+            -1,
+        ]
+    )
+    correct = paddle.sum(paddle.equal(y_pred, y_true).astype("float32")).item()
+    return correct
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("rouge-1:", round(rouge1, 4))
+    print("rouge-2:", round(rouge2, 4))
+    print("rouge-L:", round(rougel, 4))
+    print("BLEU-4:", round(bleu4.score(), 4))
+    return rougel
+
+
+@contextlib.contextmanager
+def main_process_first(desc="work"):
+    if paddle.distributed.get_world_size() > 1:
+        rank = paddle.distributed.get_rank()
+        is_main_process = rank == 0
+        main_process_desc = "main local process"
+
+        try:
+            if not is_main_process:
+                # tell all replicas to wait
+                logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}")
+                paddle.distributed.barrier()
+            yield
+        finally:
+            if is_main_process:
+                # the wait is over
+                logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas")
+                paddle.distributed.barrier()
+    else:
+        yield
+
+
+class PegasusTrainer(Seq2SeqTrainer):
+    def get_train_dataloader(self):
+        """
+        Returns the training [`~paddle.io.DataLoader`].
+
+        Will use no sampler if `self.train_dataset` does not implement `__len__`, a random sampler (adapted to
+        distributed training if necessary) otherwise.
+
+        Subclass and override this method if you want to inject some custom behavior.
+        """
+        if self.train_dataset is None:
+            raise ValueError("Trainer: training requires a train_dataset.")
+
+        train_dataset = self.train_dataset
+        train_sampler = self._get_train_sampler()
+
+        stopwords_dict = load_stopwords(self.args.stop_words)
+        train_batchify_fn = FakeAbstractCollator(self.tokenizer, stopwords_dict, self.args.max_source_length)
+
+        return DataLoader(
+            train_dataset,
+            batch_sampler=train_sampler,
+            collate_fn=train_batchify_fn,
+            num_workers=self.args.dataloader_num_workers,
+        )
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        """
+        How the loss is computed by Trainer. By default, all models return the loss in the first element.
+        Subclass and override for custom behavior.
+        """
+        if self.criterion is not None:
+            if "labels" in inputs:
+                labels = inputs.pop("labels")
+            elif "start_positions" in inputs and "end_positions" in inputs:
+                labels = (inputs.pop("start_positions"), inputs.pop("end_positions"))
+            elif self.args.label_names is not None:
+                labels = []
+                for label in self.label_names:
+                    labels.append(inputs.pop(label))
+                labels = tuple(labels)
+            elif "generator_labels" in inputs:
+                labels = inputs["generator_labels"]
+        else:
+            labels = None
+
+        outputs = model(**inputs)
+        if self.criterion is not None:
+            loss = self.criterion(outputs, labels)
+            outputs = (loss, outputs)
+
+        # Save past state if it exists
+        # TODO: this needs to be fixed and made cleaner later.
+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+
+        # We don't use .loss here since the model may return tuples instead of ModelOutput.
+        # pegasus output is lm_logits, new_cache, masked_lm_loss
+        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[2]
+
+        return (loss, outputs) if return_outputs else loss
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        Perform an evaluation step on `model` using `inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to evaluate.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (`bool`):
+                Whether or not to return the loss only.
+
+        Return:
+            Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss, logits and
+            labels (each being optional).
+        """
+
+        if not self.args.predict_with_generate or prediction_loss_only:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        has_labels = "labels" in inputs
+        inputs = self._prepare_inputs(inputs)
+
+        gen_kwargs = self._gen_kwargs.copy()
+        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
+            gen_kwargs["max_length"] = self.model.config.max_length
+        gen_kwargs["num_beams"] = (
+            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
+        )
+
+        if "attention_mask" in inputs:
+            gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
+        if "global_attention_mask" in inputs:
+            gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)
+
+        # prepare generation inputs
+        # some encoder-decoder models can have varying encoder's and thus
+        # varying model input names
+        if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
+            generation_inputs = inputs[self.model.encoder.main_input_name]
+        else:
+            generation_inputs = inputs[self.model.main_input_name]
+
+        generated_tokens = self.model.generate(
+            generation_inputs,
+            **gen_kwargs,
+        )
+        # different from hf returns: tuple[Tensor]: It is a tuple contains two elements: ids and scores.
+        if isinstance(generated_tokens, tuple):
+            generated_tokens = generated_tokens[0]
+        # in case the batch is shorter than max length, the output should be padded
+        if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
+        elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
+            gen_kwargs["max_new_tokens"] + 1
+        ):
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    outputs = model(**inputs)
+                if self.label_smoother is not None:
+                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
+                else:
+                    # pegasus output is lm_logits, new_cache, masked_lm_loss
+                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[2]).mean().detach()
+            else:
+                loss = None
+
+        if self.args.prediction_loss_only:
+            return (loss, None, None)
+
+        if has_labels:
+            labels = inputs["labels"]
+            if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
+                labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
+            elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
+                gen_kwargs["max_new_tokens"] + 1
+            ):
+                labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
+        else:
+            labels = None
+
+        return (loss, generated_tokens, labels)
diff --git a/applications/zero_shot_text_classification/README.md b/applications/zero_shot_text_classification/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4f9ed711246ca402e8785c416e7d3d626d717a77
--- /dev/null
+++ b/applications/zero_shot_text_classification/README.md
@@ -0,0 +1,286 @@
+简体中文 | [English](README_en.md)
+
+# 零样本文本分类
+
+**目录**
+- [1. 零样本文本分类应用](#1)
+- [2. 快速开始](#2)
+    - [2.1 代码结构](#代码结构)
+    - [2.2 数据标注](#数据标注)
+    - [2.3 模型微调](#模型微调)
+    - [2.4 模型评估](#模型评估)
+    - [2.5 定制模型一键预测](#定制模型一键预测)
+    - [2.6 模型部署](#模型部署)
+    - [2.7 实验指标](#实验指标)
+
+<a name="1"></a>
+
+## 1. 零样本文本分类应用
+
+本项目提供基于通用文本分类 UTC（Universal Text Classification） 模型微调的文本分类端到端应用方案，打通**数据标注-模型训练-模型调优-预测部署全流程**，可快速实现文本分类产品落地。
+
+<div align="center">
+    <img width="700" alt="UTC模型结构图" src="https://user-images.githubusercontent.com/25607475/211755652-dac155ca-649e-470c-ac8b-06156b444b58.png">
+</div>
+
+文本分类简单来说就是对给定的句子或文本使用分类模型分类。在文本分类的落地过程中通常面临领域多变、任务多样、数据稀缺等许多挑战。针对文本分类领域的痛点和难点，PaddleNLP 零样本文本分类应用 UTC 通过统一语义匹配方式 USM（Unified Semantic Matching）统一建模标签与文本的语义匹配能力，具备低资源迁移能力，支持通用分类、评论情感分析、语义相似度计算、蕴含推理、多项式阅读理解等众多“泛分类”任务，助力开发者简单高效实现多任务文本分类数据标注、训练、调优、上线，降低文本分类落地技术门槛。
+
+
+**零样本文本分类应用亮点：**
+
+- **覆盖场景全面🎓：**  覆盖文本分类各类主流任务，支持多任务训练，满足开发者多样文本分类落地需求。
+- **效果领先🏃：**  具有突出分类效果的UTC模型作为训练基座，提供良好的零样本和小样本学习能力。该模型在[ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)和[FewCLUE](https://www.cluebenchmarks.com/fewclue.html)均取得榜首（截止2023年1月11日）。
+- **简单易用：** 通过Taskflow实现三行代码可实现无标注数据的情况下进行快速调用，一行命令即可开启文本分类，轻松完成部署上线，降低多任务文本分类落地门槛。
+- **高效调优✊：** 开发者无需机器学习背景知识，即可轻松上手数据标注及模型训练流程。
+
+<a name="2"></a>
+
+## 2. 快速开始
+
+对于简单的文本分类可以直接使用```paddlenlp.Taskflow```实现零样本（zero-shot）分类，对于细分场景我们推荐使用定制功能（标注少量数据进行模型微调）以进一步提升效果。
+
+<a name="代码结构"></a>
+
+### 2.1 代码结构
+
+```shell
+.
+├── deploy/simple_serving/ # 模型部署脚本
+├── utils.py               # 数据处理工具
+├── run_train.py           # 模型微调脚本
+├── run_eval.py            # 模型评估脚本
+├── label_studio.py        # 数据格式转换脚本
+├── label_studio_text.md   # 数据标注说明文档
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+### 2.2 数据标注
+
+我们推荐使用[Label Studio](https://labelstud.io/) 数据标注工具进行标注，如果已有标注好的本地数据集，我们需要将数据集整理为文档要求的格式，详见[Label Studio数据标注指南](./label_studio_text.md)。
+
+这里我们提供预先标注好的`医疗意图分类数据集`的文件，可以运行下面的命令行下载数据集，我们将展示如何使用数据转化脚本生成训练/验证/测试集文件，并使用UTC模型进行微调。
+
+下载医疗意图分类数据集：
+
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz
+tar -xvf utc-medical.tar.gz
+mv utc-medical data
+rm utc-medical.tar.gz
+```
+
+生成训练/验证集文件：
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options ./data/label.txt
+```
+多任务训练场景可分别进行数据转换再进行混合。
+
+<a name="模型微调"></a>
+
+### 2.3 模型微调
+
+推荐使用 PromptTrainer API 对模型进行微调，该 API 封装了提示定义功能，且继承自 [Trainer API ](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md) 。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `utc-base` 作为预训练模型进行模型微调，将微调后的模型保存至`$finetuned_model`：
+
+单卡启动：
+
+```shell
+python run_train.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-base \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1 \
+    --save_plm
+```
+
+如果在GPU环境中使用，可以指定gpus参数进行多卡训练：
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-base \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1 \
+    --save_plm
+```
+
+该示例代码中由于设置了参数 `--do_eval`，因此在训练完会自动进行评估。
+
+可配置参数说明：
+* `single_label`: 每条样本是否只预测一个标签。默认为`False`，表示多标签分类。
+* `device`: 训练设备，可选择 'cpu'、'gpu' 其中的一种；默认为 GPU 训练。
+* `logging_steps`: 训练过程中日志打印的间隔 steps 数，默认10。
+* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `eval_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `seed`：全局随机种子，默认为 42。
+* `model_name_or_path`：进行 few shot 训练使用的预训练模型。默认为 "utc-base", 可选"utc-xbase", "utc-base", "utc-medium", "utc-mini", "utc-micro", "utc-nano", "utc-pico"。
+* `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
+* `dataset_path`：数据集文件所在目录；默认为 `./data/` 。
+* `train_file`：训练集后缀；默认为 `train.txt` 。
+* `dev_file`：开发集后缀；默认为 `dev.txt` 。
+* `max_seq_len`：文本最大切分长度，包括标签的输入超过最大长度时会对输入文本进行自动切分，标签部分不可切分，默认为512。
+* `per_device_train_batch_size`:用于训练的每个 GPU 核心/CPU 的batch大小，默认为8。
+* `per_device_eval_batch_size`:用于评估的每个 GPU 核心/CPU 的batch大小，默认为8。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择 100；默认为10。
+* `learning_rate`：训练最大学习率，UTC 推荐设置为 1e-5；默认值为3e-5。
+* `do_train`:是否进行微调训练，设置该参数表示进行微调训练，默认不设置。
+* `do_eval`:是否进行评估，设置该参数表示进行评估，默认不设置。
+* `do_export`:是否进行导出，设置该参数表示进行静态图导出，默认不设置。
+* `export_model_dir`:静态图导出地址，默认为None。
+* `overwrite_output_dir`： 如果 `True`，覆盖输出目录的内容。如果 `output_dir` 指向检查点目录，则使用它继续训练。
+* `disable_tqdm`： 是否使用tqdm进度条。
+* `metric_for_best_model`：最优模型指标, UTC 推荐设置为 `macro_f1`，默认为None。
+* `load_best_model_at_end`：训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用，默认为False。
+* `save_total_limit`：如果设置次参数，将限制checkpoint的总数。删除旧的checkpoints `输出目录`，默认为None。
+
+<a name="模型评估"></a>
+
+### 2.4 模型评估
+
+通过运行以下命令进行模型评估预测：
+
+```shell
+python run_eval.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/test.txt \
+    --per_device_eval_batch_size 2 \
+    --max_seq_len 512 \
+    --output_dir ./checkpoint_test
+```
+
+可配置参数说明：
+
+- `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+- `test_path`: 进行评估的测试集文件。
+- `per_device_eval_batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `single_label`: 每条样本是否只预测一个标签。默认为`False`，表示多标签分类。
+
+<a name="定制模型一键预测"></a>
+
+### 2.5 定制模型一键预测
+
+`paddlenlp.Taskflow`装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件`model_state.pdparams`。
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+>>> my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path='./checkpoint/model_best/plm', precision="fp16")
+>>> pprint(my_cls("中性粒细胞比率偏低"))
+```
+
+<a name="模型部署"></a>
+
+### 2.6 模型部署
+
+目前 UTC 模型提供基于多种部署方式，包括基于 FastDeploy 的本地 Python 部署以及 PaddleNLP SimpleServing 的服务化部署。
+
+#### Python 部署
+
+以下示例展示如何基于 FastDeploy 库完成 UTC 模型完成通用文本分类任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型。模型目录为 `application/zero_shot_text_classification/checkpoint/model_best`（用户可按实际情况设置）。
+
+```bash
+# CPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device cpu
+# GPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device gpu
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-03-02 06:32:47,528] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+[2023-03-02 06:33:18,120] [    INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer
+[{'predictions': [{'label': '这是一条好评', 'score': 0.9073}], 'text_a': '房间干净明亮，非常不错'}]
+```
+
+更多细节请参考[UTC Python 部署方法](./deploy/python/README.md)
+
+#### 服务化部署
+
+在 UTC 的服务化能力中我们提供基于PaddleNLP SimpleServing 来搭建服务化能力，通过几行代码即可搭建服务化部署能力。
+
+```
+# Save at server.py
+from paddlenlp import SimpleServer, Taskflow
+
+schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
+utc = Taskflow("zero_shot_text_classification",
+               model="utc-base",
+               schema=schema,
+               task_path="../../checkpoint/model_best/plm",
+               precision="fp32")
+app = SimpleServer()
+app.register_taskflow("taskflow/utc", utc)
+```
+
+```
+# Start the server
+paddlenlp server server:app --host 0.0.0.0 --port 8990
+```
+
+支持FP16半精度推理加速，详见[UTC SimpleServing 使用方法](./deploy/simple_serving/README.md)
+
+<a name="实验指标"></a>
+
+### 2.7 实验指标
+
+医疗意图分类数据集 KUAKE-QIC 验证集 zero-shot 实验指标：
+
+  |            | Macro F1   | Micro F1   |
+  | :--------: | :--------: | :--------: |
+  | utc-xbase  | 66.30 | 89.67 |
+  | utc-base   | 64.13 | 89.06 |
+  | utc-medium | 69.62 | 89.15 |
+  | utc-micro  | 60.31 | 79.14 |
+  | utc-mini   | 65.82 | 89.82 |
+  | utc-nano   | 62.03 | 80.92 |
+  | utc-pico   | 53.63 | 83.57 |
diff --git a/applications/zero_shot_text_classification/README_en.md b/applications/zero_shot_text_classification/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..9c8a4df1f3df4933f5f03b9956b83d933aec452d
--- /dev/null
+++ b/applications/zero_shot_text_classification/README_en.md
@@ -0,0 +1,255 @@
+[简体中文](README.md) | English
+
+# Zero-shot Text Classification
+
+**Table of contents**
+- [1. Zero-shot Text Classification Application](#1)
+- [2. Quick Start](#2)
+   - [2.1 Code Structure](#21)
+   - [2.2 Data Annotation](#22)
+   - [2.3 Finetuning](#23)
+   - [2.4 Evaluation](#24)
+   - [2.5 Inference](#25)
+   - [2.6 Deployment](#26)
+   - [2.7 Experiments](#27)
+
+<a name="1"></a>
+
+## 1. Zero-shot Text Classification
+
+This project provides an end-to-end application solution for universal text classification based on Universal Task Classification (UTC) finetuning and goes through the full lifecycle of **data labeling, model training and model deployment**. We hope this guide can help you apply Text Classification techniques with zero-shot ability in your own products or models.
+
+<div align="center">
+    <img width="700" alt="UTC模型结构图" src="https://user-images.githubusercontent.com/25607475/212268807-66181bcb-d3f9-4086-9d4a-de4d1d0933c2.png">
+</div>
+
+Text Classification refers to assigning a set of categories to given input text. Despite the advantages of tuning, applying text classification techniques in practice remains a challenge due to domain adaption and lack of labeled data, etc. This PaddleNLP Zero-shot Text Classification Guide builds on our UTC from the Unified Semantic Matching (USM) model series and provides an industrial-level solution that supports universal text classification tasks, including but not limited to **semantic analysis, semantic matching, intention recognition and event detection**, allowing you accomplish multiple tasks with a single model. Besides, our method brings good generation performance through multi-task pretraining.
+
+**Highlights:**
+
+- **Comprehensive Coverage**🎓: Covers various mainstream tasks of text classification,  including but not limited to semantic analysis, semantic matching, intention recognition and event detection.
+
+- **State-of-the-Art Performance**🏃:  Strong performance from the UTC model, which ranks first on [ZeroCLUE](https://www.cluebenchmarks.com/zeroclue.html)/[FewCLUE](https://www.cluebenchmarks.com/fewclue.html) as of 01/11/2023.
+
+- **Easy to use**⚡: Three lines of code to use our Taskflow for out-of-box Zero-shot Text Classification capability. One line of command to model training and model deployment.
+
+- **Efficient Tuning**✊: Developers can easily get started with the data labeling and model training process without a background in Machine Learning.
+
+<a name="2"></a>
+
+## 2. Quick start
+
+For quick start, you can directly use ```paddlenlp.Taskflow``` out-of-the-box, leveraging the zero-shot performance. For production use cases, we recommend labeling a small amount of data for model fine-tuning to further improve the performance.
+
+<a name="21"></a>
+
+### 2.1 Code structure
+
+```shell
+.
+├── deploy/simple_serving/  # model deployment script
+├── utils.py                # data processing tools
+├── run_train.py            # model fine-tuning script
+├── run_eval.py             # model evaluation script
+├── label_studio.py         # data format conversion script
+├── label_studio_text.md    # data annotation instruction
+└── README.md
+```
+<a name="22"></a>
+
+### 2.2 Data labeling
+
+We recommend using [Label Studio](https://labelstud.io/) for data labeling. You can export labeled data in Label Studio and convert them into the required input format. Please refer to [Label Studio Data Labeling Guide](./label_studio_text_en.md) for more details.
+
+Here we provide a pre-labeled example dataset `Medical Question Intent Classification Dataset`, which you can download with the following command. We will show how to use the data conversion script to generate training/validation/test set files for fine-tuning.
+
+Download the medical question intent classification dataset:
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/utc-medical.tar.gz
+tar -xvf utc-medical.tar.gz
+mv utc-medical data
+rm utc-medical.tar.gz
+```
+
+Generate training/validation set files:
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options ./data/label.txt
+```
+
+For multi-task training, you can convert data with script separately and move them to the same directory.
+
+<a name="23"></a>
+
+### 2.3 Finetuning
+
+Use the following command to fine-tune the model using `utc-base` as the pre-trained model, and save the fine-tuned model to `./checkpoint/model_best/`:
+
+Single GPU:
+
+```shell
+python run_train.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-base \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Multiple GPUs:
+
+```shell
+python -u -m paddle.distributed.launch --gpus "0,1" run_train.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 1000 \
+    --model_name_or_path utc-base \
+    --output_dir ./checkpoint/model_best \
+    --dataset_path ./data/ \
+    --max_seq_length 512  \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir ./checkpoint/model_best \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model macro_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+```
+
+Parameters:
+
+* `device`: Training device, one of 'cpu' and 'gpu' can be selected; the default is GPU training.
+* `logging_steps`: The interval steps of log printing during training, the default is 10.
+* `save_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `eval_steps`: The number of interval steps to save the model checkpoint during training, the default is 100.
+* `seed`: global random seed, default is 42.
+* `model_name_or_path`: The pre-trained model used for few shot training. Defaults to "utc-base". Options: "utc-xbase", "utc-base", "utc-medium", "utc-mini", "utc-micro", "utc-nano", "utc-pico".
+* `output_dir`: Required, the model directory saved after model training or compression; the default is `None`.
+* `dataset_path`: The directory to dataset; defaults to `./data`.
+* `train_file`: Training file name; defaults to `train.txt`.
+* `dev_file`: Development file name; defaults to `dev.txt`.
+* `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+* `per_device_train_batch_size`: The batch size of each GPU core/CPU used for training, the default is 8.
+* `per_device_eval_batch_size`: Batch size per GPU core/CPU for evaluation, default is 8.
+* `num_train_epochs`: Training rounds, 100 can be selected when using early stopping method; the default is 10.
+* `learning_rate`: The maximum learning rate for training, UTC recommends setting it to 1e-5; the default value is 3e-5.
+* `do_train`: Whether to perform fine-tuning training, setting this parameter means to perform fine-tuning training, and it is not set by default.
+* `do_eval`: Whether to evaluate, setting this parameter means to evaluate, the default is not set.
+* `do_export`: Whether to export, setting this parameter means to export static graph, and it is not set by default.
+* `export_model_dir`: Static map export address, the default is `./checkpoint/model_best`.
+* `overwrite_output_dir`: If `True`, overwrite the contents of the output directory. If `output_dir` points to a checkpoint directory, use it to continue training.
+* `disable_tqdm`: Whether to use tqdm progress bar.
+* `metric_for_best_model`: Optimal model metric, UTC recommends setting it to `macro_f1`, the default is None.
+* `load_best_model_at_end`: Whether to load the best model after training, usually used in conjunction with `metric_for_best_model`, the default is False.
+* `save_total_limit`: If this parameter is set, the total number of checkpoints will be limited. Remove old checkpoints `output directory`, defaults to None.
+
+<a name="24"></a>
+
+### 2.4 Evaluation
+
+Model evaluation:
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/test.txt \
+    --per_device_eval_batch_size 2 \
+    --max_seq_len 512 \
+    --output_dir ./checkpoint_test
+```
+
+Parameters:
+
+- `model_path`: The path of the model folder for evaluation, which must contain the model weight file `model_state.pdparams` and the configuration file `model_config.json`.
+- `test_path`: The test set file for evaluation.
+- `per_device_eval_batch_size`: Batch size, please adjust it according to the machine situation, the default is 8.
+- `max_seq_len`: The maximum segmentation length of the text and label candidates. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
+
+<a name="25"></a>
+
+### 2.5 Inference
+
+You can use `paddlenlp.Taskflow` to load your custom model by specifying the path of the model weight file through `task_path`.
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+>>> my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path="./checkpoint/model_best", precision="fp16")
+>>> pprint(my_cls("中性粒细胞比率偏低"))
+```
+
+<a name="26"></a>
+
+### 2.6 Deployment
+
+We provide the deployment solution on the foundation of PaddleNLP SimpleServing, where you can easily build your own deployment service with three-line code.
+
+```
+# Save at server.py
+from paddlenlp import SimpleServer, Taskflow
+
+schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议"]
+utc = Taskflow("zero_shot_text_classification",
+               model="utc-base",
+               schema=schema,
+               task_path="../../checkpoint/model_best/",
+               precision="fp32")
+app = SimpleServer()
+app.register_taskflow("taskflow/utc", utc)
+```
+
+```
+# Start the server
+paddlenlp server server:app --host 0.0.0.0 --port 8990
+```
+
+It supports FP16 (half-precision) and multiple process for inference acceleration.
+
+<a name="27"></a>
+
+### 2.7 Experiments
+
+The zero-shot results reported here are based on the development set of KUAKE-QIC.
+
+  |            | Macro F1   | Micro F1   |
+  | :--------: | :--------: | :--------: |
+  | utc-xbase  | 66.30 | 89.67 |
+  | utc-base   | 64.13 | 89.06 |
+  | utc-medium | 69.62 | 89.15 |
+  | utc-micro  | 60.31 | 79.14 |
+  | utc-mini   | 65.82 | 89.82 |
+  | utc-nano   | 62.03 | 80.92 |
+  | utc-pico   | 53.63 | 83.57 |
diff --git a/applications/zero_shot_text_classification/deploy/python/README.md b/applications/zero_shot_text_classification/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b1da371d8a7c0dd026e97bc83e798cfe347dc9b6
--- /dev/null
+++ b/applications/zero_shot_text_classification/deploy/python/README.md
@@ -0,0 +1,130 @@
+# FastDeploy UTC 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下提供 `infer.py` 快速完成在 CPU/GPU 的通用文本分类任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+```
+
+## 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 UTC 模型进行文本分类任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [UTC 训练文档](../../README.md)导出得到的部署模型，其模型目录为 `application/zero_shot_text_classification/checkpoint/model_best`（用户可按实际情况设置）。
+
+
+```bash
+# CPU 推理
+python infer.py --model_dir ../../checkpoint/model_best --device cpu
+# GPU 推理
+python infer.py --model_dir ../../checkpoint/model_best --device gpu
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-03-02 06:32:47,528] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+[2023-03-02 06:33:18,120] [    INFO] - Assigning ['[O-MASK]'] to the additional_special_tokens key of the tokenizer
+[{'predictions': [{'label': '这是一条好评', 'score': 0.9073}], 'text_a': '房间干净明亮，非常不错'}]
+```
+
+## 参数说明
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--num_omask_tokens | 最大标签数量，默认为64|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--device_id | 运行设备的id。默认为0。 |
+|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 UTC 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 UTC 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center> CPU </td>
+        <td rowspan=2 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
diff --git a/applications/zero_shot_text_classification/deploy/python/infer.py b/applications/zero_shot_text_classification/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dceeef418ed3a0bd26cc39a80c8d6a4e054f5c4
--- /dev/null
+++ b/applications/zero_shot_text_classification/deploy/python/infer.py
@@ -0,0 +1,248 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import os
+from typing import Any, Dict, List, Union
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.prompt import PromptDataCollatorWithPadding, UTCTemplate
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument(
+        "--pred_threshold",
+        default=0.5,
+        type=float,
+        help="Probability threshold for start/end index probabiliry.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--num_omask_tokens", type=int, default=64, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
+    parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
+    return parser.parse_args()
+
+
+class Predictor(object):
+    def __init__(self, args, schema: list = None):
+        self.set_schema(schema)
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+        self.template = UTCTemplate(self.tokenizer, self.max_length)
+        self.collator = PromptDataCollatorWithPadding(self.tokenizer, return_tensors="np")
+        self.pred_threshold = args.pred_threshold
+
+    def set_schema(self, schema):
+        if schema is None:
+            self._question = None
+            self._choices = None
+        elif isinstance(schema, list):
+            self._question = ""
+            self._choices = schema
+        elif isinstance(schema, dict) and len(schema) == 1:
+            for key, value in schema.items():
+                self._question = key
+                self._choices = value
+        else:
+            raise ValueError(f"Invalid schema: {schema}.")
+
+    def _check_input_text(self, inputs):
+        if isinstance(inputs, str) or isinstance(inputs, dict):
+            inputs = [inputs]
+
+        if isinstance(inputs, list):
+            input_list = []
+            for example in inputs:
+                data = {"text_a": "", "text_b": "", "choices": self._choices, "question": self._question}
+                if isinstance(example, dict):
+                    for k, v in example.items():
+                        if k in data:
+                            data[k] = example[k]
+                elif isinstance(example, str):
+                    data["text_a"] = example
+                    data["text_b"] = ""
+                elif isinstance(example, list):
+                    for x in example:
+                        if not isinstance(x, str):
+                            raise ValueError("Invalid inputs, input text should be strings.")
+                    data["text_a"] = example[0]
+                    data["text_b"] = "".join(example[1:]) if len(example) > 1 else ""
+                else:
+                    raise ValueError(
+                        "Invalid inputs, the input should be {'text_a': a, 'text_b': b}, a text or a list of text."
+                    )
+
+                if len(data["text_a"]) < 1 and len(data["text_b"]) < 1:
+                    raise ValueError("Invalid inputs, input `text_a` and `text_b` are both missing or empty.")
+                if not isinstance(data["choices"], list) or len(data["choices"]) < 2:
+                    raise ValueError("Invalid inputs, label candidates should be a list with length >= 2.")
+                if not isinstance(data["question"], str):
+                    raise ValueError("Invalid inputs, prompt question should be a string.")
+                input_list.append(data)
+        else:
+            raise TypeError("Invalid input format!")
+        return input_list
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_threads)
+        else:
+            option.use_gpu(args.device_id)
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.use_paddle_infer_backend()
+                option.paddle_infer_option.collect_trt_shape = True
+                option.paddle_infer_option.enable_trt = True
+            trt_file = os.path.join(args.model_dir, "model.trt")
+            option.trt_option.set_shape(
+                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "position_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "attention_mask",
+                [1, 1, 1, 1],
+                [args.batch_size, 1, args.max_length, args.max_length],
+                [args.batch_size, 1, args.max_length, args.max_length],
+            )
+            option.trt_option.set_shape(
+                "omask_positions",
+                [1, 1],
+                [args.batch_size, args.num_omask_tokens],
+                [args.batch_size, args.num_omask_tokens],
+            )
+            option.trt_option.set_shape("cls_positions", [1], [args.batch_size], [args.batch_size])
+            if args.use_fp16:
+                option.trt_option.enable_fp16 = True
+                trt_file = trt_file + ".fp16"
+            option.trt_option.serialize_file = trt_file
+        return fd.Runtime(option)
+
+    @staticmethod
+    def sigmoid(z):
+        return 1 / (1 + np.exp(-z))
+
+    def preprocess(self, inputs: Union[str, List[str]]) -> Dict[str, Any]:
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        tokenized_inputs = [self.template(i) for i in inputs]
+        batches = [
+            tokenized_inputs[idx : idx + self.batch_size] for idx in range(0, len(tokenized_inputs), self.batch_size)
+        ]
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["batches"] = [self.collator(batch) for batch in batches]
+
+        return outputs
+
+    def infer(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        outputs = {}
+        outputs["text"] = inputs["text"]
+        outputs["batch_logits"] = []
+        dtype_list = ["int64", "int64", "int64", "float32", "int64", "int64"]
+        for batch in inputs["batches"]:
+            batch = dict(batch)
+            for i in range(self.runtime.num_inputs()):
+                input_name = self.runtime.get_input_info(i).name
+                batch[input_name] = batch[input_name].astype(dtype_list[i])
+            del batch["soft_token_ids"]
+            logits = self.runtime.infer(batch)[0]
+            outputs["batch_logits"].append(logits)
+        return outputs
+
+    def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        outputs = []
+        for logits in inputs["batch_logits"]:
+            scores = self.sigmoid(np.array(logits))
+            output = {}
+            output["predictions"] = []
+            for i, class_score in enumerate(scores[0]):
+                if class_score > self.pred_threshold:
+                    output["predictions"].append({"label": i, "score": class_score})
+            outputs.append(output)
+
+        for i, output in enumerate(outputs):
+            if len(inputs["text"][i]["text_a"]) > 0:
+                output["text_a"] = inputs["text"][i]["text_a"]
+            if len(inputs["text"][i]["text_b"]) > 0:
+                output["text_b"] = inputs["text"][i]["text_b"]
+            for j, pred in enumerate(output["predictions"]):
+                output["predictions"][j] = {
+                    "label": inputs["text"][i]["choices"][pred["label"]],
+                    "score": pred["score"],
+                }
+
+        return outputs
+
+    def predict(self, texts):
+        inputs = self.preprocess(texts)
+        outputs = self.infer(inputs)
+        results = self.postprocess(outputs)
+        return results
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args, schema=["这是一条差评", "这是一条好评"])
+    results = predictor.predict("房间干净明亮，非常不错")
+    print(results)
diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/README.md b/applications/zero_shot_text_classification/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..81179c100c293ffdd706b8fa3c15c49d47097207
--- /dev/null
+++ b/applications/zero_shot_text_classification/deploy/simple_serving/README.md
@@ -0,0 +1,82 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+
+## 环境准备
+
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+
+```shell
+pip install paddlenlp >= 2.5.0
+```
+
+## Server服务启动
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8190
+```
+
+## Client请求启动
+
+```bash
+python client.py
+```
+
+## 服务化自定义参数
+
+### Server 自定义参数
+
+#### schema替换
+
+```python
+# Default schema
+schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+```
+
+#### 设置模型路径
+
+```python
+# Default task_path
+utc = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema)
+```
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+
+```python
+utc1 = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema)
+utc2 = Taskflow("zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema)
+service.register_taskflow("taskflow/utc", [utc1, utc2])
+```
+
+#### 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+>>> utc = Taskflow("zero_shot_text_classification",
+                   schema=schema,
+                   model="utc-base",
+                   max_seq_len=512,
+                   batch_size=1,
+                   pred_threshold=0.5,
+                   precision="fp32")
+```
+
+* `schema`：定义任务标签候选集合。
+* `model`：选择任务使用的模型，默认为`utc-base`, 可选有`utc-xbase`, `utc-base`, `utc-medium`, `utc-micro`, `utc-mini`, `utc-nano`, `utc-pico`。
+* `max_seq_len`：最长输入长度，包括所有标签的长度，默认为512。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `pred_threshold`：模型对标签预测的概率在0～1之间，返回结果去掉小于这个阈值的结果，默认为0.5。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+
+### Client 自定义参数
+
+```python
+# Changed to input texts you wanted
+texts = ["中性粒细胞比率偏低"]
+```
diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/client.py b/applications/zero_shot_text_classification/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c50a8c1fc98db17fe553f37b04117dde7dca025
--- /dev/null
+++ b/applications/zero_shot_text_classification/deploy/simple_serving/client.py
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from pprint import pprint
+
+import requests
+
+if __name__ == "__main__":
+    url = "http://0.0.0.0:8190/taskflow/utc"
+
+    headers = {"Content-Type": "application/json"}
+
+    texts = ["中性粒细胞比率偏低", "男性小腹疼痛是什么原因？"]
+
+    data = {"data": {"text": texts}}
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    datas = json.loads(r.text)
+    pprint(datas)
diff --git a/applications/zero_shot_text_classification/deploy/simple_serving/server.py b/applications/zero_shot_text_classification/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1bb9ef1c278621bd070edc397b825a8381d21e3
--- /dev/null
+++ b/applications/zero_shot_text_classification/deploy/simple_serving/server.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]
+# The task path changed to your best model path
+utc = Taskflow(
+    "zero_shot_text_classification", model="utc-base", task_path="../../checkpoint/model_best/plm", schema=schema
+)
+# If you want to define the finetuned utc service
+app = SimpleServer()
+app.register_taskflow("taskflow/utc", utc)
diff --git a/applications/zero_shot_text_classification/label_studio.py b/applications/zero_shot_text_classification/label_studio.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b54112935be631b9cf45a5a8d6d6eaab431145b
--- /dev/null
+++ b/applications/zero_shot_text_classification/label_studio.py
@@ -0,0 +1,159 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import random
+import time
+from decimal import Decimal
+
+import numpy as np
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+class LabelStudioDataConverter(object):
+    """
+    DataConverter to convert data export from LabelStudio platform
+    """
+
+    def __init__(self, options, text_separator):
+        super().__init__()
+        if isinstance(options, list) and len(options) == 1 and os.path.isfile(options[0]):
+            with open(options[0], "r", encoding="utf-8") as fp:
+                self.options = [x.strip() for x in fp]
+        elif isinstance(options, list) and len(options) > 0:
+            self.options = options
+        else:
+            raise ValueError(
+                "Invalid options. Please use file with one label per line or set `options` with condidate labels."
+            )
+        self.text_separator = text_separator
+
+    def convert_utc_examples(self, raw_examples):
+        utc_examples = []
+        for example in raw_examples:
+            raw_text = example["data"]["text"].split(self.text_separator)
+            if len(raw_text) < 1:
+                continue
+            elif len(raw_text) == 1:
+                raw_text.append("")
+            elif len(raw_text) > 2:
+                raw_text = ["".join(raw_text[:-1]), raw_text[-1]]
+
+            label_list = []
+            for raw_label in example["annotations"][0]["result"][0]["value"]["choices"]:
+                if raw_label not in self.options:
+                    raise ValueError(
+                        f"Label `{raw_label}` not found in label candidates `options`. Please recheck the data."
+                    )
+                label_list.append(np.where(np.array(self.options) == raw_label)[0].tolist()[0])
+
+            utc_examples.append(
+                {
+                    "text_a": raw_text[0],
+                    "text_b": raw_text[1],
+                    "question": "",
+                    "choices": self.options,
+                    "labels": label_list,
+                }
+            )
+        return utc_examples
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.label_studio_file):
+        raise ValueError("Please input the correct path of label studio file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.label_studio_file, "r", encoding="utf-8") as f:
+        raw_examples = json.loads(f.read())
+
+    if args.is_shuffle:
+        indexes = np.random.permutation(len(raw_examples))
+        index_list = indexes.tolist()
+        raw_examples = [raw_examples[i] for i in indexes]
+
+    i1, i2, _ = args.splits
+    p1 = int(len(raw_examples) * i1)
+    p2 = int(len(raw_examples) * (i1 + i2))
+
+    train_ids = index_list[:p1]
+    dev_ids = index_list[p1:p2]
+    test_ids = index_list[p2:]
+
+    with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
+        maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
+        fp.write(json.dumps(maps))
+
+    data_converter = LabelStudioDataConverter(args.options, args.text_separator)
+
+    train_examples = data_converter.convert_utc_examples(raw_examples[:p1])
+    dev_examples = data_converter.convert_utc_examples(raw_examples[p1:p2])
+    test_examples = data_converter.convert_utc_examples(raw_examples[p2:])
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    _save_examples(args.save_dir, "train.txt", train_examples)
+    _save_examples(args.save_dir, "dev.txt", dev_examples)
+    _save_examples(args.save_dir, "test.txt", test_examples)
+
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--label_studio_file", default="./data/label_studio.json", type=str, help="The annotation file exported from label studio platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--text_separator", type=str, default='\t', help="Separator for classification with two input texts.")
+    parser.add_argument("--options", default=None, type=str, nargs="+", help="The options for classification.")
+    parser.add_argument("--is_shuffle", default=True, type=bool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/applications/zero_shot_text_classification/label_studio_text.md b/applications/zero_shot_text_classification/label_studio_text.md
new file mode 100644
index 0000000000000000000000000000000000000000..50a419046c2b85fc517dbccfda95f6676a37eb2c
--- /dev/null
+++ b/applications/zero_shot_text_classification/label_studio_text.md
@@ -0,0 +1,143 @@
+简体中文 | [English](label_studio_text_en.md)
+
+# 文本分类任务Label Studio使用指南
+
+ **目录**
+
+- [1. 安装](#1)
+- [2. 文本分类任务标注](#2)
+    - [2.1 项目创建](#21)
+    - [2.2 数据上传](#22)
+    - [2.3 标签构建](#23)
+    - [2.4 任务标注](#24)
+    - [2.5 数据导出](#25)
+    - [2.6 数据转换](#26)
+    - [2.7 更多配置](#27)
+
+<a name="1"></a>
+
+## 1. 安装
+**以下标注示例用到的环境配置：**
+
+- Python 3.8+
+- label-studio == 1.6.0
+
+在终端(terminal)使用pip安装label-studio：
+
+```shell
+pip install label-studio==1.6.0
+```
+
+安装完成后，运行以下命令行：
+```shell
+label-studio start
+```
+
+在浏览器打开[http://localhost:8080/](http://127.0.0.1:8080/)，输入用户名和密码登录，开始使用label-studio进行标注。
+
+<a name="2"></a>
+
+2. 文本分类任务标注
+
+<a name="21"></a>
+
+#### 2.1 项目创建
+
+点击创建（Create）开始创建一个新的项目，填写项目名称、描述，然后在``Labeling Setup``中选择``Text Classification``。
+
+- 填写项目名称、描述
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210772704-7d8ebe91-eeb7-4760-82ac-f3c6478b754b.png />
+</div>
+
+- 数据上传，从本地上传txt格式文件，选择``List of tasks``，然后选择导入本项目
+
+<a name="data"></a>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210775940-59809038-fa55-44cf-8c9d-1b19dcbdc8a6.png  />
+</div>
+
+- 设置任务，添加标签
+
+<a name="label"></a>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210775986-6402db99-4ab5-4ef7-af8d-9a8c91e12d3e.png />
+</div>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210776027-c4beb431-a450-43b9-ba06-1ee5455a95c5.png />
+</div>
+
+<a name="22"></a>
+
+#### 2.2 数据上传
+
+项目创建后，可在Project/文本分类任务中点击``Import``继续导入数据，同样从本地上传txt格式文件，选择``List of tasks``，详见[项目创建](#data) 。
+
+<a name="23"></a>
+
+#### 2.3 标签构建
+
+项目创建后，可在Setting/Labeling Interface中继续配置标签，详见[项目创建](#label)
+
+默认模式为单标签多分类数据标注。对于多标签多分类数据标注，需要将`choice`的值由`single`改为`multiple`。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/222630045-8d6eebf7-572f-43d2-b7a1-24bf21a47fad.png />
+</div>
+
+<a name="24"></a>
+
+#### 2.4 任务标注
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210778977-842785fc-8dff-4065-81af-8216d3646f01.png />
+</div>
+
+<a name="25"></a>
+
+#### 2.5 数据导出
+
+勾选已标注文本ID，选择导出的文件类型为``JSON``，导出数据：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210779879-7560116b-22ab-433c-8123-43402659bf1a.png />
+</div>
+
+<a name="26"></a>
+
+#### 2.6 数据转换
+
+将导出的文件重命名为``label_studio.json``后，放入``./data``目录下。通过[label_studio.py](./label_studio.py)脚本可转为UTC的数据格式。
+
+在数据转换阶段，还需要提供标签候选信息，放在`./data/label.txt`文件中，每个标签占一行。例如在医疗意图分类中，标签候选为``["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"]``，也可通过``options``参数直接进行配置。
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options ./data/label.txt
+```
+
+<a name="27"></a>
+
+#### 2.7 更多配置
+
+- ``label_studio_file``: 从label studio导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``options``: 指定分类任务的类别标签。若输入类型为文件，则文件中每行一个标签。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+
+备注：
+- 默认情况下 [label_studio.py](./label_studio.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [label_studio.py](./label_studio.py) 脚本，将会覆盖已有的同名数据文件
+- 对于从label_studio导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/zero_shot_text_classification/label_studio_text_en.md b/applications/zero_shot_text_classification/label_studio_text_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..e45c4146c116e74a11c153828f0b8f0d311a0283
--- /dev/null
+++ b/applications/zero_shot_text_classification/label_studio_text_en.md
@@ -0,0 +1,143 @@
+[简体中文](label_studio_text.md) | English
+
+# Label Studio User Guide - Text Classification
+
+**Table of contents**
+
+- [1. Installation](#1)
+- [2. Text Classification Task Annotation](#2)
+     - [2.1 Project Creation](#21)
+     - [2.2 Data Upload](#22)
+     - [2.3 Label Construction](#23)
+     - [2.4 Task Annotation](#24)
+     - [2.5 Data Export](#25)
+     - [2.6 Data Conversion](#26)
+     - [2.7 More Configuration](#27)
+
+<a name="1"></a>
+
+## 1. Installation
+
+** Dependencies used in the following annotation examples:**
+
+- Python 3.8+
+- label-studio == 1.6.0
+
+Use pip to install label-studio in the terminal:
+
+```shell
+pip install label-studio==1.6.0
+```
+
+Once the installation is complete, run the following command line:
+```shell
+label-studio start
+```
+
+Open [http://localhost:8080/](http://127.0.0.1:8080/) in the browser, enter the user name and password to log in, and start using label-studio for labeling.
+
+<a name="2"></a>
+
+## 2. Text Classification Task Annotation
+
+<a name="21"></a>
+
+#### 2.1 Project Creation
+
+Click Create to start creating a new project, fill in the project name, description, and select ``Text Classification`` in ``Labeling Setup``.
+
+- Fill in the project name, description
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210772704-7d8ebe91-eeb7-4760-82ac-f3c6478b754b.png />
+</div>
+
+- Upload the txt format file locally, select ``List of tasks``, and then choose to import this project.
+
+<a name="data"></a>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210775940-59809038-fa55-44cf-8c9d-1b19dcbdc8a6.png  />
+</div>
+
+- Define labels
+
+<a name="label"></a>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210775986-6402db99-4ab5-4ef7-af8d-9a8c91e12d3e.png />
+</div>
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210776027-c4beb431-a450-43b9-ba06-1ee5455a95c5.png />
+</div>
+
+<a name="22"></a>
+
+#### 2.2 Data Upload
+
+You can continue to import local txt format data after project creation. See more details in [Project Creation](#data).
+
+<a name="23"></a>
+
+#### 2.3 Label Construction
+
+After project creation, you can add/delete labels in Setting/Labeling Interface just as in [Project Creation](#label)
+
+LabelStudio supports single-label data annotation by default. Modify the value of `choice` as `multiple` in the `code` tab when multiple-label annotation is required.
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/222630045-8d6eebf7-572f-43d2-b7a1-24bf21a47fad.png />
+</div>
+
+<a name="24"></a>
+
+#### 2.4 Task annotation
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210778977-842785fc-8dff-4065-81af-8216d3646f01.png />
+</div>
+
+<a name="25"></a>
+
+#### 2.5 Data Export
+
+Check the marked text ID, select the exported file type as ``JSON``, and export the data:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/210779879-7560116b-22ab-433c-8123-43402659bf1a.png />
+</div>
+
+<a name="26"></a>
+
+#### 2.6 Data conversion
+
+First, create a label file in the `./data` directory, with one label candidate per line. You can also directly set label condidates list by `options`. Rename the exported file to ``label_studio.json`` and put it in the ``./data`` directory. Through the [label_studio.py](./label_studio.py) script, it can be converted to the data format of UTC.
+
+
+```shell
+python label_studio.py \
+    --label_studio_file ./data/label_studio.json \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --options ./data/label.txt
+```
+
+<a name="27"></a>
+
+#### 2.7 More Configuration
+
+- ``label_studio_file``: Data labeling file exported from label studio.
+- ``save_dir``: The storage directory of the training data, which is stored in the ``data`` directory by default.
+- ``splits``: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means that the data is divided into training set, verification set and test set according to the ratio of ``8:1:1``.
+- ``options``: Specify the label candidates set. For filename, there should be one label per line in the file. For list, the length should be longer than 1.
+- ``is_shuffle``: Whether to randomly shuffle the data set, the default is True.
+- ``seed``: random seed, default is 1000.
+
+Note:
+- By default the [label_studio.py](./label_studio.py) script will divide the data proportionally into train/dev/test datasets
+- Each time the [label_studio.py](./label_studio.py) script is executed, the existing data file with the same name will be overwritten
+- For files exported from label_studio, each piece of data in the default file is correctly labeled manually.
+
+## References
+- **[Label Studio](https://labelstud.io/)**
diff --git a/applications/zero_shot_text_classification/run_eval.py b/applications/zero_shot_text_classification/run_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..30686ff9fda33ca7075f274710daaf261e491348
--- /dev/null
+++ b/applications/zero_shot_text_classification/run_eval.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from dataclasses import dataclass, field
+
+import paddle
+from paddle.metric import Accuracy
+from sklearn.metrics import f1_score
+from utils import UTCLoss, read_local_dataset
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.prompt import (
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    UTCTemplate,
+)
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import UTC, AutoTokenizer
+
+
+@dataclass
+class DataArguments:
+    test_path: str = field(default="./data/test.txt", metadata={"help": "Test dataset file name."})
+    threshold: float = field(default=0.5, metadata={"help": "The threshold to produce predictions."})
+    single_label: str = field(default=False, metadata={"help": "Predict exactly one label per sample."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="utc-base", metadata={"help": "Build-in pretrained model."})
+    model_path: str = field(default=None, metadata={"help": "Build-in pretrained model."})
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = UTC.from_pretrained(model_args.model_name_or_path)
+
+    # Define template for preprocess and verbalizer for postprocess.
+    template = UTCTemplate(tokenizer, training_args.max_seq_length)
+
+    # Load and preprocess dataset.
+    if data_args.test_path is not None:
+        test_ds = load_dataset(read_local_dataset, data_path=data_args.test_path, lazy=False)
+
+    # Initialize the prompt model.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+    if model_args.model_path is not None:
+        model_state = paddle.load(os.path.join(model_args.model_path, "model_state.pdparams"))
+        prompt_model.set_state_dict(model_state)
+
+    # Define the metric function.
+    def compute_metrics_single_label(eval_preds):
+        labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+        preds = paddle.to_tensor(eval_preds.predictions)
+        preds = paddle.nn.functional.softmax(preds, axis=-1)
+        labels = paddle.argmax(labels, axis=-1)
+        print(preds, labels)
+        metric = Accuracy()
+        correct = metric.compute(preds, labels)
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    def compute_metrics(eval_preds):
+        labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+        preds = paddle.to_tensor(eval_preds.predictions)
+
+        preds = paddle.nn.functional.sigmoid(preds)
+        preds = preds[labels != -100].numpy()
+        labels = labels[labels != -100].numpy()
+        preds = preds > data_args.threshold
+        micro_f1 = f1_score(y_pred=preds, y_true=labels, average="micro")
+        macro_f1 = f1_score(y_pred=preds, y_true=labels, average="macro")
+
+        return {"micro_f1": micro_f1, "macro_f1": macro_f1}
+
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=UTCLoss(),
+        train_dataset=None,
+        eval_dataset=None,
+        callbacks=None,
+        compute_metrics=compute_metrics_single_label if data_args.single_label else compute_metrics,
+    )
+
+    if data_args.test_path is not None:
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+        with open(os.path.join(training_args.output_dir, "test_metric.json"), "w", encoding="utf-8") as fp:
+            json.dump(test_ret.metrics, fp)
+
+        with open(os.path.join(training_args.output_dir, "test_predictions.json"), "w", encoding="utf-8") as fp:
+            if data_args.single_label:
+                preds = paddle.nn.functional.softmax(paddle.to_tensor(test_ret.predictions), axis=-1)
+                for index, pred in enumerate(preds):
+                    result = {"id": index}
+                    result["labels"] = paddle.argmax(pred).item()
+                    result["probs"] = pred[result["labels"]].item()
+                    fp.write(json.dumps(result, ensure_ascii=False) + "\n")
+            else:
+                preds = paddle.nn.functional.sigmoid(paddle.to_tensor(test_ret.predictions))
+                for index, pred in enumerate(preds):
+                    result = {"id": index}
+                    result["labels"] = paddle.where(pred > data_args.threshold)[0].tolist()
+                    result["probs"] = pred[pred > data_args.threshold].tolist()
+                    fp.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/zero_shot_text_classification/run_train.py b/applications/zero_shot_text_classification/run_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e7956d19bdd0a84bb70a9da9eb9c53c5574fac4
--- /dev/null
+++ b/applications/zero_shot_text_classification/run_train.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+import paddle
+from paddle.metric import Accuracy
+from paddle.static import InputSpec
+from sklearn.metrics import f1_score
+from utils import UTCLoss, read_local_dataset
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.prompt import (
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    UTCTemplate,
+)
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import UTC, AutoTokenizer, export_model
+
+
+@dataclass
+class DataArguments:
+    dataset_path: str = field(
+        default="./data",
+        metadata={"help": "Local dataset directory including train.txt, dev.txt and label.txt (optional)."},
+    )
+    train_file: str = field(default="train.txt", metadata={"help": "Train dataset file name."})
+    dev_file: str = field(default="dev.txt", metadata={"help": "Dev dataset file name."})
+    threshold: float = field(default=0.5, metadata={"help": "The threshold to produce predictions."})
+    single_label: str = field(default=False, metadata={"help": "Predict exactly one label per sample."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(
+        default="utc-base",
+        metadata={
+            "help": "The build-in pretrained UTC model name or path to its checkpoints, such as "
+            "`utc-xbase`, `utc-base`, `utc-medium`, `utc-mini`, `utc-micro`, `utc-nano` and `utc-pico`."
+        },
+    )
+    export_type: str = field(default="paddle", metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+    export_model_dir: str = field(default="checkpoints/model_best", metadata={"help": "The export model path."})
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = UTC.from_pretrained(model_args.model_name_or_path)
+
+    # Define template for preprocess and verbalizer for postprocess.
+    template = UTCTemplate(tokenizer, training_args.max_seq_length)
+
+    # Load and preprocess dataset.
+    train_ds = load_dataset(
+        read_local_dataset,
+        data_path=data_args.dataset_path,
+        data_file=data_args.train_file,
+        lazy=False,
+    )
+    dev_ds = load_dataset(
+        read_local_dataset,
+        data_path=data_args.dataset_path,
+        data_file=data_args.dev_file,
+        lazy=False,
+    )
+
+    # Define the criterion.
+    criterion = UTCLoss()
+
+    # Initialize the prompt model.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics_single_label(eval_preds):
+        labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+        preds = paddle.to_tensor(eval_preds.predictions)
+        preds = paddle.nn.functional.softmax(preds, axis=-1)
+        labels = paddle.argmax(labels, axis=-1)
+        metric = Accuracy()
+        correct = metric.compute(preds, labels)
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    def compute_metrics(eval_preds):
+        labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+        preds = paddle.to_tensor(eval_preds.predictions)
+        preds = paddle.nn.functional.sigmoid(preds)
+        preds = preds[labels != -100].numpy()
+        labels = labels[labels != -100].numpy()
+        preds = preds > data_args.threshold
+        micro_f1 = f1_score(y_pred=preds, y_true=labels, average="micro")
+        macro_f1 = f1_score(y_pred=preds, y_true=labels, average="macro")
+
+        return {"micro_f1": micro_f1, "macro_f1": macro_f1}
+
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=None,
+        compute_metrics=compute_metrics_single_label if data_args.single_label else compute_metrics,
+    )
+
+    # Training.
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Export.
+    if training_args.do_export:
+        input_spec = [
+            InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+            InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+            InputSpec(shape=[None, None], dtype="int64", name="omask_positions"),
+            InputSpec(shape=[None], dtype="int64", name="cls_positions"),
+        ]
+        export_model(trainer.pretrained_model, input_spec, model_args.export_model_dir, model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/applications/zero_shot_text_classification/utils.py b/applications/zero_shot_text_classification/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c983b75dc9d1d26cf3d90fe718fa2dd83e9dde05
--- /dev/null
+++ b/applications/zero_shot_text_classification/utils.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import os
+
+import numpy as np
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def read_local_dataset(data_path, data_file=None, is_test=False):
+    """
+    Load datasets with one example per line, formated as:
+        {"text_a": X, "text_b": X, "question": X, "choices": [A, B], "labels": [0, 1]}
+    """
+    if data_file is not None:
+        file_paths = [os.path.join(data_path, fname) for fname in os.listdir(data_path) if fname.endswith(data_file)]
+    else:
+        file_paths = [data_path]
+    skip_count = 0
+    for file_path in file_paths:
+        with open(file_path, "r", encoding="utf-8") as fp:
+            for example in fp:
+                example = json.loads(example.strip())
+                if len(example["choices"]) < 2 or not isinstance(example["text_a"], str) or len(example["text_a"]) < 3:
+                    skip_count += 1
+                    continue
+                if "text_b" not in example:
+                    example["text_b"] = ""
+                if not is_test or "labels" in example:
+                    if not isinstance(example["labels"], list):
+                        example["labels"] = [example["labels"]]
+                    one_hots = np.zeros(len(example["choices"]), dtype="float32")
+                    for x in example["labels"]:
+                        one_hots[x] = 1
+                    example["labels"] = one_hots.tolist()
+
+                if is_test:
+                    yield example
+                    continue
+                std_keys = ["text_a", "text_b", "question", "choices", "labels"]
+                std_example = {k: example[k] for k in std_keys if k in example}
+                yield std_example
+    logger.warning(f"Skip {skip_count} examples.")
+
+
+class UTCLoss(object):
+    def __call__(self, logit, label):
+        return self.forward(logit, label)
+
+    def forward(self, logit, label):
+        logit = (1.0 - 2.0 * label) * logit
+        logit_neg = logit - label * 1e12
+        logit_pos = logit - (1.0 - label) * 1e12
+        zeros = paddle.zeros_like(logit[..., :1])
+        logit_neg = paddle.concat([logit_neg, zeros], axis=-1)
+        logit_pos = paddle.concat([logit_pos, zeros], axis=-1)
+        label = paddle.concat([label, zeros], axis=-1)
+        logit_neg[label == -100] = -1e12
+        logit_pos[label == -100] = -1e12
+        neg_loss = paddle.logsumexp(logit_neg, axis=-1)
+        pos_loss = paddle.logsumexp(logit_pos, axis=-1)
+        loss = (neg_loss + pos_loss).mean()
+        return loss
diff --git a/csrc/README.md b/csrc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..117d9c79c69cb0c9af39a0898bea157f21d89a96
--- /dev/null
+++ b/csrc/README.md
@@ -0,0 +1,15 @@
+# PaddleNLP 自定义 OP
+
+此文档介绍如何编译安装 PaddleNLP 自定义 OP。
+
+## 安装 C++ 依赖
+
+```shell
+pip install -r requirements.txt
+```
+
+## 编译 Cuda 算子
+
+```shell
+python setup_cuda.py install
+```
diff --git a/csrc/generation/encode_rotary_qk.cu b/csrc/generation/encode_rotary_qk.cu
new file mode 100644
index 0000000000000000000000000000000000000000..a920f81a2d1160a9a4e9e9eeb84bdb98c2bc5b51
--- /dev/null
+++ b/csrc/generation/encode_rotary_qk.cu
@@ -0,0 +1,225 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <typename T>
+__global__ void NeoXRotaryKernel(const T *input,
+                                 const float *cos_emb,
+                                 const float *sin_emb,
+                                 const int *sequence_lengths,
+                                 T *output,
+                                 const int rotary_emb_dims,
+                                 const int batch_size,
+                                 const int head_num,
+                                 const int seq_len,
+                                 const int last_dim) {
+  int bi = blockIdx.x;
+  int hi = blockIdx.y;
+  int si = blockIdx.z;
+  if (sequence_lengths && si >= sequence_lengths[bi] * rotary_emb_dims) return;
+  int half_lastdim = last_dim / 2;
+  for (int ti = threadIdx.x; ti < half_lastdim; ti += blockDim.x) {
+    int base_idx = bi * head_num * seq_len * last_dim +
+                   hi * seq_len * last_dim + si * last_dim;
+    int left_idx = base_idx + ti;
+    const int right_idx = base_idx + ti + half_lastdim;
+    int emb_idx_left = bi * seq_len * last_dim + si * last_dim + ti;
+    int emb_idx_right =
+        bi * seq_len * last_dim + si * last_dim + ti + half_lastdim;
+    float input_left = static_cast<float>(input[left_idx]);
+    float input_right = static_cast<float>(input[right_idx]);
+
+    float cos_tmp_left = cos_emb[emb_idx_left];
+    float sin_tmp_left = sin_emb[emb_idx_left];
+    float cos_tmp_right = cos_emb[emb_idx_right];
+    float sin_tmp_right = sin_emb[emb_idx_right];
+
+    T res1 =
+        static_cast<T>(input_left * cos_tmp_left - input_right * sin_tmp_left);
+    T res2 = static_cast<T>(input_right * cos_tmp_right +
+                            input_left * sin_tmp_right);
+    output[left_idx] = res1;
+    output[right_idx] = res2;
+  }
+}
+
+
+template <typename T>
+__global__ void RotaryKernel(const T *input,
+                             const float *cos_emb,
+                             const float *sin_emb,
+                             const int *sequence_lengths,
+                             T *output,
+                             const int rotary_emb_dims,
+                             const int batch_size,
+                             const int head_num,
+                             const int seq_len,
+                             const int last_dim) {
+  int bi = blockIdx.x;
+  int hi = blockIdx.y;
+  int si = blockIdx.z;
+  if (sequence_lengths && si >= sequence_lengths[bi] * rotary_emb_dims) return;
+  int half_lastdim = last_dim / 2;
+  // Note(ZhenyuLi): Calculate the relevant data at one time, so that no
+  // additional space is required.
+  for (int ti = threadIdx.x; ti < half_lastdim; ti += blockDim.x) {
+    int base_idx = bi * head_num * seq_len * last_dim +
+                   hi * seq_len * last_dim + si * last_dim;
+    int left_idx = base_idx + 2 * ti;
+    const int right_idx = base_idx + 2 * ti + 1;
+    int emb_idx = bi * seq_len * last_dim + si * last_dim + 2 * ti;
+    float input_left = static_cast<float>(input[left_idx]);
+    float input_right = static_cast<float>(input[right_idx]);
+    float cos_tmp = cos_emb[emb_idx];
+    float sin_tmp = sin_emb[emb_idx];
+    T res1 = static_cast<T>(input_left * cos_tmp - input_right * sin_tmp);
+    T res2 = static_cast<T>(input_right * cos_tmp + input_left * sin_tmp);
+    output[left_idx] = res1;
+    output[right_idx] = res2;
+  }
+}
+
+template <paddle::DataType D>
+void LaunchRotaryQK(const paddle::Tensor& q, 
+                    const paddle::Tensor& kv, 
+                    const paddle::Tensor& rotary_emb, 
+                    const paddle::Tensor& seq_lens, 
+                    const int32_t rotary_emb_dims, 
+                    bool use_neox) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+
+    const int32_t batch_size = q.shape()[0];
+    const int32_t head_num = q.shape()[1];
+    const int32_t seq_len = q.shape()[2];
+    const int32_t dim_head = q.shape()[3];
+
+    auto cu_stream = q.stream();
+    dim3 grid(batch_size, head_num, seq_len * rotary_emb_dims);
+    const int last_dim = dim_head / rotary_emb_dims;
+    auto getBlockSize = [](int dim) {
+        if (dim > 256) {
+        return 512;
+        } else if (dim > 128) {
+        return 256;
+        } else if (dim > 64) {
+        return 128;
+        } else if (dim > 32) {
+        return 64;
+        } else {
+        return 32;
+        }
+    };
+    int BlockSize = getBlockSize(last_dim / 2);
+    const float *cos_emb = rotary_emb.data<float>();
+    const float *sin_emb = rotary_emb.data<float>() + batch_size * seq_len * dim_head;
+    
+    const DataType_* q_data = reinterpret_cast<const DataType_*>(q.data<data_t>()); 
+    const DataType_* k_data = reinterpret_cast<const DataType_*>(kv.data<data_t>()); 
+
+    DataType_* q_out_data = reinterpret_cast<DataType_*>(const_cast<data_t*>(q.data<data_t>())); 
+    DataType_* k_out_data = reinterpret_cast<DataType_*>(const_cast<data_t*>(kv.data<data_t>())); 
+
+
+    if (!use_neox) {
+        RotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+            q_data,
+            cos_emb,
+            sin_emb,
+            seq_lens.data<int>()/*sequence_lengths*/,
+            q_out_data,
+            rotary_emb_dims,
+            batch_size,
+            head_num,
+            seq_len * rotary_emb_dims,
+            last_dim);
+        RotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+            k_data,
+            cos_emb,
+            sin_emb,
+            seq_lens.data<int>()/*sequence_lengths*/,
+            k_out_data,
+            rotary_emb_dims,
+            batch_size,
+            head_num,
+            seq_len * rotary_emb_dims,
+            last_dim);
+    } else {
+        NeoXRotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+            q_data,
+            cos_emb,
+            sin_emb,
+            seq_lens.data<int>()/*sequence_lengths*/,
+            q_out_data,
+            rotary_emb_dims,
+            batch_size,
+            head_num,
+            seq_len * rotary_emb_dims,
+            last_dim);
+        NeoXRotaryKernel<<<grid, BlockSize, 0, cu_stream>>>(
+            k_data,
+            cos_emb,
+            sin_emb,
+            seq_lens.data<int>()/*sequence_lengths*/,
+            k_out_data,
+            rotary_emb_dims,
+            batch_size,
+            head_num,
+            seq_len * rotary_emb_dims,
+            last_dim);
+    }
+}
+
+void RotaryQK(const paddle::Tensor& q, 
+              const paddle::Tensor& kv, 
+              const paddle::Tensor& rotary_emb, 
+              const paddle::Tensor& seq_lens,
+              const int32_t rotary_emb_dims, 
+              bool use_neox) {
+    switch (q.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return LaunchRotaryQK<paddle::DataType::BFLOAT16>(
+                q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return LaunchRotaryQK<paddle::DataType::FLOAT16>(
+                q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return LaunchRotaryQK<paddle::DataType::FLOAT32>(
+                q, kv, rotary_emb, seq_lens, rotary_emb_dims, use_neox
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only bfloat16, float16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+
+
+PD_BUILD_OP(encode_rotary_qk)
+    .Inputs({"q", "kv", "rotary_emb", "seq_lens"})
+    .Outputs({"rotary_q_out", "rotary_kv_out"})
+    .SetInplaceMap({{"q", "rotary_q_out"}, {"kv", "rotary_kv_out"}})
+    .Attrs({"rotary_emb_dims: int", "use_neox: bool"})
+    .SetKernelFn(PD_KERNEL(RotaryQK)); 
\ No newline at end of file
diff --git a/csrc/generation/fused_get_rope.cu b/csrc/generation/fused_get_rope.cu
new file mode 100644
index 0000000000000000000000000000000000000000..0115652626433a55fbdeeceee5a4b836348c94e5
--- /dev/null
+++ b/csrc/generation/fused_get_rope.cu
@@ -0,0 +1,215 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+/*
+Position_ids: bsz, max_seq_length 
+*/
+
+template<typename T, int N>
+struct GetPackType {
+  using type = typename std::aligned_storage<N * sizeof(T), N * sizeof(T)>::type;
+};
+
+template<typename T, int N>
+using PackType = typename GetPackType<T, N>::type;
+
+template<typename T, int N>
+union Pack {
+  static_assert(sizeof(PackType<T, N>) == sizeof(T) * N, "");
+  __device__ Pack() {
+    // do nothing
+  }
+  PackType<T, N> storage;
+  T elem[N];
+};
+
+__global__ __launch_bounds__(kBlockSize) void fused_get_rotary_embedding_neox(const int64_t* position_ids,
+                                                                              const int32_t bsz,
+                                                                              const int32_t max_seq_length,
+                                                                              const int32_t max_position_seq_length,
+                                                                              const int32_t head_dim,
+                                                                              const int32_t prompt_num,
+                                                                              const float inv_head_dim,
+                                                                              const int32_t elem_cnt,
+                                                                              float* rope_embedding) {
+    /*
+    In Naive implementation, it will stacks [freqs, freqs]
+    And actually, each threads can process 1 values, and store continuous 2 same values.
+    So here We construct a Pack to store 2 values.
+    */
+    constexpr int PackSize = 2;
+    // Pack<float, PackSize> SinStorePack{};
+    // Pack<float, PackSize> CosStorePack{};
+
+    const int half_head_dim = head_dim / PackSize;
+    const int32_t global_thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
+    for(int idx = global_thread_idx, step=blockDim.x * gridDim.x; idx < elem_cnt; idx += step){
+        const int32_t bsz_seq_idx = idx / half_head_dim;
+        const int32_t bsz_idx =  bsz_seq_idx / max_seq_length;
+        const int32_t seq_idx = bsz_seq_idx % max_seq_length;
+        const int64_t position_offset = bsz_idx * max_position_seq_length + seq_idx + prompt_num;
+        const int32_t half_head_idx = (idx % half_head_dim) * PackSize;
+        const float exponent_factor = -static_cast<float>(half_head_idx) * inv_head_dim; // * inv_head_dim equals to / head_dim.
+        const float inv_freq_val = powf(10000.0f, exponent_factor);
+        const float freqs_val = static_cast<float>(position_ids[position_offset]) * inv_freq_val;
+        const float cos_embedding_val = cos(freqs_val);
+        const float sin_embedding_val = sin(freqs_val);
+
+        const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx / PackSize;
+        rope_embedding[cos_offset] = cos_embedding_val;
+        rope_embedding[cos_offset + half_head_dim] = cos_embedding_val;
+        const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset;
+        rope_embedding[sin_offset] = sin_embedding_val;
+        rope_embedding[sin_offset + half_head_dim] = sin_embedding_val;
+
+        // /*
+        // Since After stack, the continuous 2 elements value is same.
+        // So here each threads store 2 computed embedding value.
+        // */
+        // #pragma unroll
+        // for(int unroll_idx = 0; unroll_idx < PackSize; unroll_idx++){
+        //     CosStorePack.elem[unroll_idx] = cos_embedding_val;
+        //     SinStorePack.elem[unroll_idx] = sin_embedding_val;
+        // }
+        //
+        // const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx;
+        // const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset;
+        // *(reinterpret_cast<PackType<float, PackSize>*>(rope_embedding + cos_offset)) = CosStorePack.storage;
+        // *(reinterpret_cast<PackType<float, PackSize>*>(rope_embedding + sin_offset)) = SinStorePack.storage;
+    }
+}
+
+__global__ __launch_bounds__(kBlockSize) void fused_get_rotary_embedding(const int64_t* position_ids, 
+                                                                         const int32_t bsz, 
+                                                                         const int32_t max_seq_length, 
+                                                                         const int32_t max_position_seq_length,
+                                                                         const int32_t head_dim, 
+                                                                         const int32_t prompt_num,
+                                                                         const float inv_head_dim, 
+                                                                         const int32_t elem_cnt, 
+                                                                         float* rope_embedding) {
+    /*
+    In Naive implementation, it will stacks [freqs, freqs]
+    And actually, each threads can process 1 values, and store continuous 2 same values. 
+    So here We construct a Pack to store 2 values. 
+    */
+    constexpr int PackSize = 2; 
+    Pack<float, PackSize> SinStorePack{}; 
+    Pack<float, PackSize> CosStorePack{}; 
+
+    const int half_head_dim = head_dim / PackSize; 
+    const int32_t global_thread_idx = blockIdx.x * blockDim.x + threadIdx.x; 
+    for(int idx = global_thread_idx, step=blockDim.x * gridDim.x; idx < elem_cnt; idx += step){
+        const int32_t bsz_seq_idx = idx / half_head_dim;
+        const int32_t bsz_idx =  bsz_seq_idx / max_seq_length;
+        const int32_t seq_idx = bsz_seq_idx % max_seq_length;
+        const int64_t position_offset = bsz_idx * max_position_seq_length + seq_idx + prompt_num;
+        const int32_t half_head_idx = (idx % half_head_dim) * PackSize; 
+        const float exponent_factor = -static_cast<float>(half_head_idx) * inv_head_dim; // * inv_head_dim equals to / head_dim. 
+        const float inv_freq_val = powf(10000.0f, exponent_factor); 
+        const float freqs_val = static_cast<float>(position_ids[position_offset]) * inv_freq_val; 
+        const float cos_embedding_val = cos(freqs_val); 
+        const float sin_embedding_val = sin(freqs_val); 
+
+        /*
+        Since After stack, the continuous 2 elements value is same. 
+        So here each threads store 2 computed embedding value. 
+        */
+        #pragma unroll 
+        for(int unroll_idx = 0; unroll_idx < PackSize; unroll_idx++){
+            CosStorePack.elem[unroll_idx] = cos_embedding_val; 
+            SinStorePack.elem[unroll_idx] = sin_embedding_val; 
+        }
+
+        const int32_t cos_offset = bsz_seq_idx * head_dim + half_head_idx; 
+        const int32_t sin_offset = bsz * max_seq_length * head_dim + cos_offset; 
+        *(reinterpret_cast<PackType<float, PackSize>*>(rope_embedding + cos_offset)) = CosStorePack.storage;
+        *(reinterpret_cast<PackType<float, PackSize>*>(rope_embedding + sin_offset)) = SinStorePack.storage;
+    }
+}
+
+std::vector<paddle::Tensor> GetRoPE(const paddle::Tensor& input_ids, 
+                                    const paddle::Tensor& position_ids, 
+                                    const paddle::Tensor& head_dim_shape_tensor,
+                                    int prompt_num,
+                                    bool use_neox) {
+    const int64_t batch_size = input_ids.shape()[0]; 
+    const int64_t max_seq_length = input_ids.shape()[1]; 
+    const int64_t max_position_seq_length = position_ids.shape()[1];
+    const int64_t head_dim = head_dim_shape_tensor.shape()[0]; 
+    const float inv_head_dim = 1.0f / static_cast<float>(head_dim); 
+
+    auto cu_stream = position_ids.stream();
+
+    auto rotary_embedding = paddle::full({2, batch_size, 1, max_seq_length, head_dim}, -1, paddle::DataType::FLOAT32, position_ids.place());
+
+    assert(head_dim % 2 == 0); 
+    const int32_t elem_cnt = batch_size * max_seq_length * head_dim / 2; 
+    int32_t grid_size = 1; 
+    GetNumBlocks(elem_cnt, &grid_size); 
+    if (use_neox) {
+      fused_get_rotary_embedding_neox<<<grid_size, kBlockSize, 0, cu_stream>>> (
+          position_ids.data<int64_t>(),
+          batch_size,
+          max_seq_length,
+          max_position_seq_length,
+          head_dim,
+          prompt_num,
+          inv_head_dim,
+          elem_cnt,
+          reinterpret_cast<float*>(rotary_embedding.data<float>()));
+    } else {
+      fused_get_rotary_embedding<<<grid_size, kBlockSize, 0, cu_stream>>> (
+          position_ids.data<int64_t>(),
+          batch_size, 
+          max_seq_length, 
+          max_position_seq_length,
+          head_dim, 
+          prompt_num,
+          inv_head_dim, 
+          elem_cnt, 
+          reinterpret_cast<float*>(rotary_embedding.data<float>()));
+    }
+    return {rotary_embedding};
+}
+
+
+
+std::vector<std::vector<int64_t>> GetRoPEInferShape(const std::vector<int64_t>& input_ids_shape, 
+                                                    const std::vector<int64_t>& position_ids_shape, 
+                                                    const std::vector<int64_t>& head_dim_shape_tensor_shape) {
+    const int64_t batch_size = position_ids_shape[0]; 
+    const int64_t max_seq_length = input_ids_shape[1]; 
+    const int64_t head_dim = head_dim_shape_tensor_shape[0]; 
+    std::vector<int64_t> out_shape = {2, batch_size, 1, max_seq_length, head_dim};                                                          
+    return {out_shape};
+}
+
+std::vector<paddle::DataType> GetRoPEInferDtype(const paddle::DataType& input_ids_dtype, 
+                                                const paddle::DataType& position_ids_dtype, 
+                                                const paddle::DataType& head_dim_shape_tensor_dtype) {
+    // RoPE output dtype is Float. 
+    return {paddle::DataType::FLOAT32};
+}
+
+PD_BUILD_OP(fused_get_rotary_embedding)
+    .Inputs({"input_ids", "position_ids", "head_dim_shape_tensor"})
+    .Outputs({"rotary_embedding"})
+    .Attrs({"prompt_num: int",
+            "use_neox: bool"})
+    .SetKernelFn(PD_KERNEL(GetRoPE))
+    .SetInferShapeFn(PD_INFER_SHAPE(GetRoPEInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(GetRoPEInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/get_padding_offset.cu b/csrc/generation/get_padding_offset.cu
new file mode 100644
index 0000000000000000000000000000000000000000..1d6327a7616a73cfce68e31b16b0a0c3d37a9e6f
--- /dev/null
+++ b/csrc/generation/get_padding_offset.cu
@@ -0,0 +1,133 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+__global__ void RemovePadding(int64_t *output_data,
+                              const int64_t *input_data,
+                              const int *seq_lens,
+                              const int *cum_offsets,
+                              const int sequence_length) {
+  const int bi = blockIdx.x;
+  const int tid = threadIdx.x;
+
+  for (int i = tid; i < seq_lens[bi]; i += blockDim.x) {
+    const int tgt_seq_id = bi * sequence_length - cum_offsets[bi] + i;
+    const int src_seq_id = bi * sequence_length + i;
+    output_data[tgt_seq_id] = input_data[src_seq_id];
+  }
+}
+
+__global__ void GetCumOffsetKernel(int *token_num,
+                                   int *enc_token_num,
+                                   int *dec_token_num,
+                                   int *cum_offsets,
+                                   const int *sequence_lengths,
+                                   const int *sequence_lengths_encoder,
+                                   const int *sequence_lengths_decoder,
+                                   const int batch_size,
+                                   const int max_seq_len) {
+  // get padding offset of each batch
+  int total_seq_len = 0;
+  int enc_total_seq_len = 0;
+  int dec_total_seq_len = 0;
+  int cum_offset = 0;
+  int index = 0;
+  
+  for (int i = 0; i < batch_size; i++) {
+    cum_offsets[i] = cum_offset;
+    int seq_len = sequence_lengths[i];
+    int seq_len_enc = sequence_lengths_encoder[i];
+    int seq_len_dec = sequence_lengths_decoder[i];
+
+    cum_offset += max_seq_len - seq_len;
+
+    total_seq_len += seq_len;
+    enc_total_seq_len += seq_len_enc;
+    dec_total_seq_len += seq_len_dec;
+  }
+  token_num[0] = total_seq_len;
+  enc_token_num[0] = enc_total_seq_len;
+  dec_token_num[0] = dec_total_seq_len;
+}
+
+__global__ void GetPaddingOffsetKernel(int *padding_offset,
+                                       int *cum_offsets_out,
+                                       const int *cum_offsets,
+                                       const int *seq_lens,
+                                       const int max_seq_len) {
+  // get padding offset of each batch
+  const int bi = blockIdx.x;
+  const int ti = threadIdx.x;
+  if (ti == 0) {
+    cum_offsets_out[bi] = bi == 0 ? 0 : cum_offsets[bi - 1];
+  }
+  int cum_offset = bi == 0 ? 0 : cum_offsets[bi - 1];
+  for (int i = ti; i < seq_lens[bi]; i += blockDim.x) {
+    padding_offset[bi * max_seq_len - cum_offset + i] = cum_offset;
+  }
+}
+
+
+std::vector<paddle::Tensor> GetPaddingOffset(const paddle::Tensor& input_ids,
+                                             const paddle::Tensor& cum_offsets,
+                                             const paddle::Tensor& token_num,
+                                             const paddle::Tensor& seq_len) {
+    auto cu_stream = input_ids.stream();
+    std::vector<int64_t> input_ids_shape = input_ids.shape();
+    const int bsz = input_ids_shape[0];
+    const int seq_length = input_ids_shape[1];
+    auto cum_offsets_out = cum_offsets.copy_to(cum_offsets.place(), false);
+    auto cpu_token_num = token_num.copy_to(paddle::CPUPlace(), false);
+    const int token_num_data = cpu_token_num.data<int64_t>()[0];
+    auto x_remove_padding = paddle::full({token_num_data}, 0, paddle::DataType::INT64, input_ids.place());
+    auto padding_offset = paddle::full({token_num_data}, 0, paddle::DataType::INT32, input_ids.place());
+    int blockSize = min((token_num_data + 32 - 1) / 32 * 32, 128);
+    GetPaddingOffsetKernel<<<bsz, 128, 0, cu_stream>>>(
+      padding_offset.data<int>(), 
+      cum_offsets_out.data<int>(),
+      cum_offsets.data<int>(),
+      seq_len.data<int>(),
+      seq_length);
+    RemovePadding<<<bsz, blockSize, 0, cu_stream>>>(
+      x_remove_padding.data<int64_t>(), 
+      input_ids.data<int64_t>(), 
+      seq_len.data<int>(),
+      cum_offsets_out.data<int>(), 
+      seq_length);
+    return {x_remove_padding, cum_offsets_out, padding_offset}; // , enc_token_num, dec_token_num};
+}
+
+std::vector<std::vector<int64_t>> GetPaddingOffsetInferShape(const std::vector<int64_t>& input_ids_shape,
+                                                             const std::vector<int64_t>& cum_offsets_shape,
+                                                             const std::vector<int64_t>& token_num_shape,
+                                                             const std::vector<int64_t>& seq_len_shape) {
+    int64_t bsz = input_ids_shape[0];
+    int64_t seq_len = input_ids_shape[1];
+    return {{-1}, {bsz}, {-1}};
+}
+
+std::vector<paddle::DataType> GetPaddingOffsetInferDtype(const paddle::DataType& input_ids_dtype,
+                                                         const paddle::DataType& cum_offsets_dtype,
+                                                         const paddle::DataType& token_num_dtype,
+                                                         const paddle::DataType& seq_len_dtype) {
+    return {input_ids_dtype, seq_len_dtype, seq_len_dtype};
+}
+
+PD_BUILD_OP(get_padding_offset)
+    .Inputs({"input_ids", "cum_offsets", "token_num", "seq_len"})
+    .Outputs({"x_remove_padding", "cum_offsets_out", "padding_offset"})
+    .SetKernelFn(PD_KERNEL(GetPaddingOffset))
+    .SetInferShapeFn(PD_INFER_SHAPE(GetPaddingOffsetInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(GetPaddingOffsetInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/helper.h b/csrc/generation/helper.h
new file mode 100644
index 0000000000000000000000000000000000000000..4a74709aecaea6c725a3c2f4e84a0eb6b31d974d
--- /dev/null
+++ b/csrc/generation/helper.h
@@ -0,0 +1,103 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/extension.h"
+#include <cub/cub.cuh>
+#include <curand_kernel.h>
+
+constexpr int kBlockSize = 256; 
+constexpr int kNumWaves = 16; 
+
+inline cudaError_t GetNumBlocks(int64_t n, int* num_blocks) {
+  int dev;
+  {
+    cudaError_t err = cudaGetDevice(&dev);
+    if (err != cudaSuccess) { return err; }
+  }
+  int sm_count;
+  {
+    cudaError_t err = cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, dev);
+    if (err != cudaSuccess) { return err; }
+  }
+  int tpm;
+  {
+    cudaError_t err = cudaDeviceGetAttribute(&tpm, cudaDevAttrMaxThreadsPerMultiProcessor, dev);
+    if (err != cudaSuccess) { return err; }
+  }
+  *num_blocks = std::max<int>(1, std::min<int64_t>((n + kBlockSize - 1) / kBlockSize,
+                                                    sm_count * tpm / kBlockSize * kNumWaves));
+  return cudaSuccess;
+}
+
+template<typename T>
+__device__ T max_func(const T a, const T b) {
+  return a > b ? a : b;
+}
+
+template<typename T>
+struct MaxOp {
+  __device__ __forceinline__ T operator()(const T& a, const T& b) const {
+    return max_func(a, b);
+  }
+};
+
+template <paddle::DataType D>
+class PDTraits;
+
+template <>
+class PDTraits<paddle::DataType::FLOAT32> {
+public:
+  typedef float DataType;
+  typedef float data_t;
+};
+
+template <>
+class PDTraits<paddle::DataType::FLOAT16> {
+public:
+  typedef half DataType;
+  typedef paddle::float16 data_t;
+};
+
+template <>
+class PDTraits<paddle::DataType::BFLOAT16> {
+public:
+  typedef __nv_bfloat16 DataType;
+  typedef paddle::bfloat16 data_t;
+};
+
+template <typename T, int Size>
+struct alignas(sizeof(T) * Size) AlignedVector {
+  T val[Size];
+
+  HOSTDEVICE inline const T& operator[](int i) const { return val[i]; }
+  HOSTDEVICE inline T& operator[](int i) { return val[i]; }
+};
+
+template <typename T, int Size>
+HOSTDEVICE inline void Load(const T* addr, AlignedVector<T, Size>* vec) {
+  const AlignedVector<T, Size>* addr_vec =
+      reinterpret_cast<const AlignedVector<T, Size>*>(addr);
+  *vec = *addr_vec;
+}
+
+template <typename T, int Size>
+HOSTDEVICE inline void Store(const AlignedVector<T, Size>& vec, T* addr) {
+  AlignedVector<T, Size>* addr_vec =
+      reinterpret_cast<AlignedVector<T, Size>*>(addr);
+  *addr_vec = vec;
+}
+
+constexpr int VEC_16B = 16;
\ No newline at end of file
diff --git a/csrc/generation/qkv_transpose_split.cu b/csrc/generation/qkv_transpose_split.cu
new file mode 100644
index 0000000000000000000000000000000000000000..ba9ee1f8ceb75170281afdfffd3ae0b271a0fe8f
--- /dev/null
+++ b/csrc/generation/qkv_transpose_split.cu
@@ -0,0 +1,193 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <typename T, int VecSize>
+__global__ void fusedQKV_transpose_split_kernel(
+    T *q_buf,
+    T *k_buf,
+    T *v_buf,
+    const T *qkv,
+    const int *padding_offset,
+    const int *seq_lens,
+    const int32_t elem_cnt,
+    const int batch_size,
+    const int max_len_this_time,
+    const int seq_len,
+    const int token_num,
+    const int head_num,
+    const int size_per_head) {
+  const int32_t offset = batch_size * max_len_this_time * head_num * size_per_head;
+  const int32_t hidden_size = head_num * size_per_head;
+  const int32_t fused_hidden_size = 3 * hidden_size;
+  int64_t global_thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  using LoadT = AlignedVector<T, VecSize>;
+  LoadT src_vec;
+  LoadT bias_vec;
+
+  for (int32_t linear_index = global_thread_idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    Load<T, VecSize>(&qkv[linear_index], &src_vec);
+    int32_t bias_idx = linear_index % fused_hidden_size;
+    const int32_t token_idx = linear_index / fused_hidden_size;
+    const int32_t ori_token_idx =
+        token_idx + (padding_offset == nullptr ? 0 : padding_offset[token_idx]);
+    const int32_t target_batch_id = ori_token_idx / seq_len;
+    if (seq_lens[target_batch_id] == 0) continue;
+    const int32_t seq_id = ori_token_idx % seq_len;
+
+    // equal to:
+    // const int qkv_id  = (linear_index % fused_hidden_size) / hidden_size;
+    const int32_t qkv_id = bias_idx / hidden_size;
+    const int32_t head_id = (linear_index % hidden_size) / size_per_head;
+    const int32_t size_id = linear_index % size_per_head;
+
+    if (qkv_id == 0) {
+      Store<T, VecSize>(
+          src_vec,
+          &q_buf[target_batch_id * head_num * max_len_this_time * size_per_head +
+                 head_id * max_len_this_time * size_per_head + seq_id * size_per_head +
+                 size_id]);
+    } else if (qkv_id == 1) {
+      Store<T, VecSize>(
+          src_vec,
+          &k_buf[target_batch_id * head_num * max_len_this_time * size_per_head +
+                 head_id * max_len_this_time * size_per_head + seq_id * size_per_head +
+                 size_id]);
+    } else {
+      Store<T, VecSize>(
+          src_vec,
+          &v_buf[target_batch_id * head_num * max_len_this_time * size_per_head +
+                 head_id * max_len_this_time * size_per_head + seq_id * size_per_head +
+                 size_id]);
+    }
+  }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> qkv_transpose_split(const paddle::Tensor& qkv, // [token_num, dim_embed]
+                                                const paddle::Tensor& padding_offset, // [bsz, 1]
+                                                const paddle::Tensor& seq_lens,
+                                                const paddle::Tensor& input_ids,
+                                                int num_head,
+                                                int head_size) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    auto cu_stream = qkv.stream();
+    std::vector<int64_t> qkv_shape = qkv.shape();
+    const int token_num = qkv_shape[0];
+    const int bsz = seq_lens.shape()[0];
+    const int max_seq_len = input_ids.shape()[1]; //max_seq_len_tensor.copy_to(paddle::CPUPlace(), false).data<int>()[0];
+    auto q_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place());
+    auto k_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place());
+    auto v_out = paddle::full({bsz, num_head, max_seq_len, head_size}, 0, qkv.dtype(), qkv.place());
+    constexpr int PackSize = VEC_16B / sizeof(DataType_);
+    const int elem_cnt = token_num * num_head * head_size * 3;
+    const int pack_num = elem_cnt / PackSize;
+    const int blocksize = 128;
+    const int grid_size = (pack_num + blocksize - 1) / blocksize;
+    fusedQKV_transpose_split_kernel<DataType_, PackSize>
+      <<<grid_size, blocksize, 0, qkv.stream()>>>(
+        reinterpret_cast<DataType_*>(q_out.data<data_t>()),
+        reinterpret_cast<DataType_*>(k_out.data<data_t>()),
+        reinterpret_cast<DataType_*>(v_out.data<data_t>()),
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(qkv.data<data_t>())),
+        padding_offset.data<int>(),
+        seq_lens.data<int>(),
+        elem_cnt,
+        bsz,
+        max_seq_len,
+        max_seq_len,
+        token_num,
+        num_head,
+        head_size);
+    return {q_out, k_out, v_out};
+}
+
+std::vector<paddle::Tensor> QKVTransposeSplit(const paddle::Tensor& qkv, 
+                                              const paddle::Tensor& padding_offset, 
+                                              const paddle::Tensor& seq_lens,
+                                              const paddle::Tensor& input_ids,
+                                              int num_head,
+                                              int head_size) {
+    switch (qkv.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return qkv_transpose_split<paddle::DataType::BFLOAT16>(
+                qkv,
+                padding_offset,
+                seq_lens,
+                input_ids,
+                num_head,
+                head_size
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return qkv_transpose_split<paddle::DataType::FLOAT16>(
+                qkv,
+                padding_offset,
+                seq_lens,
+                input_ids,
+                num_head,
+                head_size
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return qkv_transpose_split<paddle::DataType::FLOAT32>(
+                qkv,
+                padding_offset,
+                seq_lens,
+                input_ids,
+                num_head,
+                head_size
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16, bfloat16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> QKVTransposeSplitInferShape(const std::vector<int64_t>& qkv_shape,
+                                                              const std::vector<int64_t>& padding_offset_shape,
+                                                              const std::vector<int64_t>& seq_lens_shape,
+                                                              const std::vector<int64_t>& input_ids_shape,
+                                                              int num_head,
+                                                              int head_size) {
+    int64_t bsz = seq_lens_shape[0];
+    return {{bsz, num_head, -1, head_size}, {bsz, num_head, -1, head_size}, {bsz, num_head, -1, head_size}};
+}
+
+std::vector<paddle::DataType> QKVTransposeSplitInferDtype(const paddle::DataType& qkv_dtype,
+                                                        const paddle::DataType& padding_offset_dtype,
+                                                        const paddle::DataType& seq_lens_dtype,
+                                                        const paddle::DataType& input_ids_dtype) {
+    return {qkv_dtype, qkv_dtype, qkv_dtype};
+}
+
+PD_BUILD_OP(qkv_transpose_split)
+    .Inputs({"qkv", "padding_offset", "seq_lens", "input_ids"})
+    .Outputs({"q_out", "k_out", "v_out"})
+    .Attrs({"num_head: int",
+            "head_size: int"})
+    .SetKernelFn(PD_KERNEL(QKVTransposeSplit))
+    .SetInferShapeFn(PD_INFER_SHAPE(QKVTransposeSplitInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(QKVTransposeSplitInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/rebuild_padding.cu b/csrc/generation/rebuild_padding.cu
new file mode 100644
index 0000000000000000000000000000000000000000..3c8dcc9be47fcc208b12ec2e759e1eb50cac02ca
--- /dev/null
+++ b/csrc/generation/rebuild_padding.cu
@@ -0,0 +1,166 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <typename T, int VecSize>
+__global__ void RebuildPaddingKernel(T *output_data,
+                                     const T *input_data,
+                                     const int *cum_offsets,
+                                     const int *seq_lens,
+                                     const int max_seq_len,
+                                     const int dim_embed,
+                                     const int elem_nums) {
+  using LoadT = AlignedVector<T, VecSize>;
+  LoadT src_vec;
+  const int global_idx = blockDim.x * blockIdx.x + threadIdx.x;
+  for (int i = global_idx * VecSize; i < elem_nums; i += gridDim.x * blockDim.x * VecSize) {
+    const int bi = i / dim_embed;
+    const int bias_idx = i % dim_embed;
+    int seq_id = seq_lens[bi] - 1;
+    const int ori_token_idx = bi * max_seq_len - cum_offsets[bi] + seq_id;
+    const int src_offset = ori_token_idx * dim_embed + bias_idx;
+    Load<T, VecSize>(&input_data[src_offset], &src_vec);
+    Store<T, VecSize>(src_vec, &output_data[i]);
+  }
+}
+
+template <typename T>
+__global__ void RebuildPaddingKernel(T *output_data,
+                                    const T *input_data,
+                                    const int *padding_offset,
+                                    const int dim_embed) {
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+  const int dst_seq_id = bid + padding_offset[bid];
+  const int src_seq_id = bid;
+
+  for (int i = tid; i < dim_embed; i += blockDim.x) {
+    output_data[dst_seq_id * dim_embed + i] =
+        input_data[src_seq_id * dim_embed + i];
+  }
+}
+
+template <typename T>
+void InvokeRebuildPadding(T *output_data,
+                          const T *input_data,
+                          const int *padding_offset,
+                          const int token_num,
+                          const int dim_embed,
+                          cudaStream_t stream) {
+  // src: [token_num, dim_embed]
+  // dst: [batch_size * max_seq_len, dim_embed]
+  RebuildPaddingKernel<<<token_num, 256, 0, stream>>>(
+      output_data, input_data, padding_offset, dim_embed);
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> rebuild_padding(const paddle::Tensor& tmp_out, // [token_num, dim_embed]
+                                            const paddle::Tensor& padding_offset, // [bsz, 1]
+                                            const paddle::Tensor& seq_lens,
+                                            const paddle::Tensor& input_ids) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    auto cu_stream = tmp_out.stream();
+    std::vector<int64_t> tmp_out_shape = tmp_out.shape();
+    const int token_num = tmp_out_shape[0];
+    const int dim_embed = tmp_out_shape[1];
+    const int bsz = seq_lens.shape()[0];
+    auto out = paddle::full({bsz, dim_embed}, 0, tmp_out.dtype(), tmp_out.place());
+    constexpr int PackSize = VEC_16B / sizeof(DataType_);
+    int elem_nums = out.numel();
+    int pack_num = elem_nums / PackSize;
+    const int blocksize = 128;
+    const int grid_size = (pack_num + blocksize - 1) / blocksize;
+    RebuildPaddingKernel<DataType_, PackSize><<<grid_size, blocksize, 0, tmp_out.stream()>>>(
+        reinterpret_cast<DataType_*>(out.data<data_t>()), 
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(tmp_out.data<data_t>())), 
+        padding_offset.data<int>(), 
+        seq_lens.data<int>(), 
+        input_ids.shape()[1], 
+        dim_embed, 
+        elem_nums);
+    // InvokeRebuildPadding(
+    //     reinterpret_cast<DataType_*>(out.data<data_t>()), 
+    //     reinterpret_cast<DataType_*>(const_cast<data_t*>(tmp_out.data<data_t>())), 
+    //     padding_offset.data<int>(),
+    //     token_num,
+    //     dim_embed,
+    //     tmp_out.stream()
+    // );
+    return {out};
+}
+
+std::vector<paddle::Tensor> RebuildPadding(const paddle::Tensor& tmp_out, 
+                                           const paddle::Tensor& padding_offset, 
+                                           const paddle::Tensor& seq_lens,
+                                           const paddle::Tensor& input_ids) {
+    switch (tmp_out.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return rebuild_padding<paddle::DataType::BFLOAT16>(
+                tmp_out,
+                padding_offset,
+                seq_lens,
+                input_ids
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return rebuild_padding<paddle::DataType::FLOAT16>(
+                tmp_out,
+                padding_offset,
+                seq_lens,
+                input_ids
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return rebuild_padding<paddle::DataType::FLOAT32>(
+                tmp_out,
+                padding_offset,
+                seq_lens,
+                input_ids
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16, bfloat16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> RebuildPaddingInferShape(const std::vector<int64_t>& tmp_out_shape,
+                                                           const std::vector<int64_t>& padding_offset_shape,
+                                                           const std::vector<int64_t>& seq_lens_shape,
+                                                           const std::vector<int64_t>& input_ids_shape) {
+    int64_t bsz = seq_lens_shape[0];
+    int64_t dim_embed = tmp_out_shape[1];
+    return {{bsz, dim_embed}};
+}
+
+std::vector<paddle::DataType> RebuildPaddingInferDtype(const paddle::DataType& tmp_out_dtype,
+                                                       const paddle::DataType& padding_offset_dtype,
+                                                       const paddle::DataType& seq_lens_dtype,
+                                                       const paddle::DataType& input_ids_dtype) {
+    return {tmp_out_dtype};
+}
+
+PD_BUILD_OP(rebuild_padding)
+    .Inputs({"tmp_out", "padding_offset", "seq_lens", "input_ids"})
+    .Outputs({"out"})
+    .SetKernelFn(PD_KERNEL(RebuildPadding))
+    .SetInferShapeFn(PD_INFER_SHAPE(RebuildPaddingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(RebuildPaddingInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/save_with_output.cc b/csrc/generation/save_with_output.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8cf19d5c2d818921d3e7c6fc485878af26756ebe
--- /dev/null
+++ b/csrc/generation/save_with_output.cc
@@ -0,0 +1,166 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <iostream>
+#include <fstream>
+#include <memory>
+#include <stdexcept>
+#include <string>
+
+#include "stdlib.h"
+#include <stdio.h>
+#include <dlfcn.h>  // dladdr
+#include <sys/time.h>
+#include <sys/stat.h>
+#include "paddle/extension.h"
+
+constexpr char kSEP = '/';
+
+std::string DirName(const std::string &filepath) {
+  auto pos = filepath.rfind(kSEP);
+  if (pos == std::string::npos) {
+    return "";
+  }
+  return filepath.substr(0, pos);
+}
+
+bool FileExists(const std::string &filepath) {
+  struct stat buffer;
+  return (stat(filepath.c_str(), &buffer) == 0);
+}
+
+void MkDir(const char *path) {
+  std::string path_error(path);
+  path_error += " mkdir failed!";
+  if (mkdir(path, 0755)) {
+    if (errno != EEXIST) {
+      throw std::runtime_error(path_error);
+    }
+  }
+}
+
+void MkDirRecursively(const char *fullpath) {
+  if (*fullpath == '\0') return;  // empty string
+  if (FileExists(fullpath)) return;
+  MkDirRecursively(DirName(fullpath).c_str());
+  MkDir(fullpath);
+}
+
+
+template<typename data_t>
+void saveToFile(std::ostream & os, const void* x_data, std::vector<int64_t> shape, int64_t x_numel, const char type_id) {
+  // 1.type
+  os.write(reinterpret_cast<const char *>(&type_id),sizeof(type_id));
+  // 2.data
+  uint64_t size = x_numel * sizeof(data_t);
+  os.write(static_cast<const char*>(x_data),static_cast<std::streamsize>(size));
+
+}
+
+template<typename data_t>
+void save_with_output_kernel(const paddle::Tensor& x,
+                             const paddle::Tensor& batch_idx,
+                             const paddle::Tensor& step_idx,
+                             std::string file_path,
+                             int64_t rank_id,
+                             char type_id) {
+  std::vector<int64_t> x_shape = x.shape();
+
+  if(rank_id >= 0) {
+      file_path += "_rank_" + std::to_string(rank_id);
+  }
+
+  int batch_idx_data = -1, step_idx_data = -1;
+
+  if(batch_idx.is_gpu()) {
+    paddle::Tensor batch_idx_cpu = batch_idx.copy_to<int32_t>(paddle::CPUPlace());
+    batch_idx_data = batch_idx_cpu.data<int32_t>()[0];
+  } else {
+    batch_idx_data = batch_idx.data<int32_t>()[0];
+  }
+  if(step_idx.is_gpu()) {
+    paddle::Tensor step_idx_cpu = step_idx.copy_to<int64_t>(paddle::CPUPlace());
+    step_idx_data = step_idx_cpu.data<int64_t>()[0];
+  } else {
+    step_idx_data = step_idx.data<int64_t>()[0];
+  }
+  auto x_data = x.data<data_t>();
+
+  if(batch_idx_data >= 0) {
+    file_path += "_batch_" + std::to_string(batch_idx_data);
+  }
+  if(step_idx_data >= 0) {
+    file_path += "_step_" + std::to_string(step_idx_data);
+  }
+  MkDirRecursively(DirName(file_path).c_str());
+  std::ofstream fout(file_path, std::ios::binary);
+  fout.write("0",1);
+  saveToFile<data_t>(fout, x_data, x_shape, x.numel(),type_id);
+  fout.seekp(std::ios::beg);
+  fout.write("1",1);
+  fout.close();
+
+}
+
+void print_shape(const paddle::Tensor& tmp, char *tmp_str){
+    std::vector<int64_t> shape = tmp.shape();
+    printf("%s's shape: \n", tmp_str);
+    for(int i=0; i < shape.size(); i++) {
+        printf("%d ", (int)shape[i]);
+    }
+    printf("\n");
+}
+
+std::vector<paddle::Tensor> SaveWithOutputForward(const paddle::Tensor& x,
+                                                  const paddle::Tensor& batch_idx,
+                                                  const paddle::Tensor& step_idx,
+                                                  std::string file_path,
+                                                  int64_t rank_id) {
+    auto out = x.copy_to(paddle::CPUPlace(), false);
+    switch(x.type()) {
+      case paddle::DataType::FLOAT32:
+         save_with_output_kernel<float>(out, batch_idx, step_idx, file_path, rank_id, '0');
+         break;
+      case paddle::DataType::INT64:
+        save_with_output_kernel<int64_t>(out, batch_idx, step_idx, file_path, rank_id,'1');
+         break;
+      case paddle::DataType::INT32:
+        save_with_output_kernel<int32_t>(out, batch_idx, step_idx, file_path, rank_id, '2');
+         break;
+      default:
+        PD_THROW("function SaveWithOutputForward is not implemented for data type");
+    }
+   return {out};
+}
+
+std::vector<std::vector<int64_t>> SaveWithOutputInferShape(const std::vector<int64_t>& x_shape,
+                                                           const std::vector<int64_t>& batch_idx_shape,
+                                                           const std::vector<int64_t>& step_idx_shape) {
+    return {x_shape};
+}
+
+std::vector<paddle::DataType> SaveWithOutputInferDtype(const paddle::DataType& x_dtype,
+                                                       const paddle::DataType& batch_idx_dtype,
+                                                       const paddle::DataType& step_idx_dtype) {
+    return {x_dtype};
+}
+
+PD_BUILD_OP(save_with_output)
+    .Inputs({"x", "batch_idx", "step_idx"})
+    .Attrs({"file_path: std::string",
+            "rank_id: int64_t"})
+    .Outputs({"out"})
+    .SetKernelFn(PD_KERNEL(SaveWithOutputForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(SaveWithOutputInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(SaveWithOutputInferDtype));
diff --git a/csrc/generation/set_alibi_mask_value.cu b/csrc/generation/set_alibi_mask_value.cu
new file mode 100644
index 0000000000000000000000000000000000000000..8036f1096ebd536ac20596f20a7c4c3040d1c6b3
--- /dev/null
+++ b/csrc/generation/set_alibi_mask_value.cu
@@ -0,0 +1,136 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <typename T>
+__global__ void set_value_by_id(const int *seq_lens, 
+                               const bool *stop_flags, 
+                              const float *alibi_slopes, 
+                              const int64_t *tgt_pos, 
+                              T *output_data, 
+                              int *sequence_lengths, 
+                              int bs, 
+                              int length,
+                              int num_head) {
+    int bs_id = blockIdx.x;                          
+    int hid = threadIdx.x;
+    if (bs_id < bs) {
+        T *output_data_now = output_data + bs_id * num_head * length + hid * length;
+        float tgt_pos_now = static_cast<float>(tgt_pos[bs_id]);
+        output_data_now[seq_lens[bs_id]] = static_cast<T>(tgt_pos_now * alibi_slopes[hid]);
+        if (stop_flags[bs_id]) {
+            sequence_lengths[bs_id] = 0;
+        }
+    }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> set_mask_value(const paddle::Tensor& input_data, 
+                                           const paddle::Tensor& stop_flags, 
+                                          const paddle::Tensor& seq_lens,
+                                          const paddle::Tensor& alibi_slopes,
+                                          const paddle::Tensor& tgt_pos
+                                          ) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    PD_CHECK(seq_lens.dtype() == paddle::DataType::INT32);
+    PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
+    auto cu_stream = input_data.stream();
+    std::vector<int64_t> input_data_shape = input_data.shape();
+    std::vector<int64_t> seq_lens_shape = seq_lens.shape();
+    auto sequence_lengths = seq_lens.copy_to(seq_lens.place(), false);
+
+    int input_bs = input_data_shape[0];
+    int length = input_data_shape[3];
+    int seq_bs = seq_lens_shape[0];
+    int num_head = alibi_slopes.shape()[0];
+
+    int grid_size = input_bs;
+    int block_size = num_head;
+    set_value_by_id<<<grid_size, block_size, 0, cu_stream>>>(seq_lens.data<int>(), 
+                                                     stop_flags.data<bool>(), 
+                                                     alibi_slopes.data<float>(),
+                                                     tgt_pos.data<int64_t>(),
+                                                     reinterpret_cast<DataType_*>(const_cast<data_t*>(input_data.data<data_t>())), 
+                                                     sequence_lengths.data<int>(), seq_bs, length, num_head);
+    return {sequence_lengths};
+}
+
+std::vector<paddle::Tensor> SetMaskValue(const paddle::Tensor& input_data, 
+                                          const paddle::Tensor& stop_flags, 
+                                          const paddle::Tensor& seq_lens,
+                                          const paddle::Tensor& alibi_slopes,
+                                          const paddle::Tensor& tgt_pos) {
+    switch (input_data.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return set_mask_value<paddle::DataType::BFLOAT16>(
+                input_data,
+                stop_flags,
+                seq_lens,
+                alibi_slopes,
+                tgt_pos
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return set_mask_value<paddle::DataType::FLOAT16>(
+                input_data,
+                stop_flags,
+                seq_lens,
+                alibi_slopes,
+                tgt_pos
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return set_mask_value<paddle::DataType::FLOAT32>(
+                input_data,
+                stop_flags,
+                seq_lens,
+                alibi_slopes,
+                tgt_pos
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16, bfloat16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> SetMaskValueInferShape(const std::vector<int64_t>& input_data_shape, 
+                                                         const std::vector<int64_t>& stop_flags_shape, 
+                                                         const std::vector<int64_t>& seq_lens_shape,
+                                                         const std::vector<int64_t>& alibi_slopes_shape,
+                                                         const std::vector<int64_t>& tgt_pos) {
+    return {seq_lens_shape};
+}
+
+std::vector<paddle::DataType> SetMaskValueInferDtype(const paddle::DataType& input_data_dtype, 
+                                                      const paddle::DataType& stop_flags_dtype, 
+                                                      const paddle::DataType& seq_lens_dtype,
+                                                      const paddle::DataType& alibi_slopes_dtype,
+                                                      const paddle::DataType& tgt_pos_dtype) {
+    return {seq_lens_dtype};
+}
+
+PD_BUILD_OP(set_alibi_mask_value)
+    .Inputs({"input_data", "stop_flags", "seq_lens", "alibi_slopes", "tgt_pos"})
+    .Outputs({"sequence_lengths"})
+    .SetKernelFn(PD_KERNEL(SetMaskValue))
+    .SetInferShapeFn(PD_INFER_SHAPE(SetMaskValueInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(SetMaskValueInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/set_mask_value.cu b/csrc/generation/set_mask_value.cu
new file mode 100644
index 0000000000000000000000000000000000000000..bcd63a277de7b84463b105cf5ebae2bdb5128bd0
--- /dev/null
+++ b/csrc/generation/set_mask_value.cu
@@ -0,0 +1,123 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/extension.h"
+
+template <paddle::DataType D>
+class PDTraits;
+
+template <>
+class PDTraits<paddle::DataType::FLOAT32> {
+public:
+  typedef float DataType;
+  typedef float data_t;
+};
+
+template <>
+class PDTraits<paddle::DataType::FLOAT16> {
+public:
+  typedef half DataType;
+  typedef paddle::float16 data_t;
+};
+
+template <>
+class PDTraits<paddle::DataType::BFLOAT16> {
+public:
+  typedef __nv_bfloat16 DataType;
+  typedef paddle::bfloat16 data_t;
+};
+
+template <typename T>
+__global__ void set_value_by_id(const int *seq_lens, const bool *stop_flags, T *output_data, int *sequence_lengths, int bs, int length) {
+    int tid = threadIdx.x;
+    if (tid < bs) {
+        T *output_data_now = output_data + tid * length;
+        output_data_now[seq_lens[tid]] = 1.0;
+        if (stop_flags[tid]) {
+            sequence_lengths[tid] = 0;
+        }
+    }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> set_mask_value(const paddle::Tensor& input_data, const paddle::Tensor& stop_flags, const paddle::Tensor& seq_lens) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    PD_CHECK(seq_lens.dtype() == paddle::DataType::INT32);
+    PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
+    auto cu_stream = input_data.stream();
+    std::vector<int64_t> input_data_shape = input_data.shape();
+    std::vector<int64_t> seq_lens_shape = seq_lens.shape();
+    auto sequence_lengths = seq_lens.copy_to(seq_lens.place(), false);
+
+    int input_bs = input_data_shape[0];
+    int length = input_data_shape[3];
+    int seq_bs = seq_lens_shape[0];
+
+    int block_size = (input_bs + 32 - 1) / 32 * 32;
+    set_value_by_id<<<1, block_size, 0, cu_stream>>>(seq_lens.data<int>(),
+                                                     stop_flags.data<bool>(),
+                                                     reinterpret_cast<DataType_*>(const_cast<data_t*>(input_data.data<data_t>())),
+                                                     sequence_lengths.data<int>(), seq_bs, length);
+    return {sequence_lengths};
+}
+
+std::vector<paddle::Tensor> SetMaskValue(const paddle::Tensor& input_data, const paddle::Tensor& stop_flags, const paddle::Tensor& seq_lens) {
+    switch (input_data.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return set_mask_value<paddle::DataType::BFLOAT16>(
+                input_data,
+                stop_flags,
+                seq_lens
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return set_mask_value<paddle::DataType::FLOAT16>(
+                input_data,
+                stop_flags,
+                seq_lens
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return set_mask_value<paddle::DataType::FLOAT32>(
+                input_data,
+                stop_flags,
+                seq_lens
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16, bfloat16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> SetMaskValueInferShape(const std::vector<int64_t>& input_data_shape, const std::vector<int64_t>& stop_flags_shape, const std::vector<int64_t>& seq_lens_shape) {
+    return {seq_lens_shape};
+}
+
+std::vector<paddle::DataType> SetMaskValueInferDtype(const paddle::DataType& input_data_dtype, const paddle::DataType& stop_flags_dtype, const paddle::DataType& seq_lens_dtype) {
+    return {seq_lens_dtype};
+}
+
+PD_BUILD_OP(set_mask_value)
+    .Inputs({"input_data", "stop_flags", "seq_lens"})
+    .Outputs({"sequence_lengths"})
+    .SetKernelFn(PD_KERNEL(SetMaskValue))
+    .SetInferShapeFn(PD_INFER_SHAPE(SetMaskValueInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(SetMaskValueInferDtype));
diff --git a/csrc/generation/set_value_by_flags.cu b/csrc/generation/set_value_by_flags.cu
new file mode 100644
index 0000000000000000000000000000000000000000..03900783972c20e85f9f76b92b6141e9b2a79391
--- /dev/null
+++ b/csrc/generation/set_value_by_flags.cu
@@ -0,0 +1,56 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/extension.h"
+
+__global__ void set_value_by_flag_and_id(const bool *stop_flags, int64_t *pre_ids_all, const int64_t *pre_ids, const int64_t *step_idx, int bs, int length) {
+    int tid = threadIdx.x;
+    if (tid < bs && !stop_flags[tid]) {
+        int64_t *pre_ids_all_now = pre_ids_all + tid * length;
+        if (step_idx[tid] >= 0) {
+            pre_ids_all_now[step_idx[tid]] = pre_ids[tid];
+        }
+    }
+}
+
+std::vector<paddle::Tensor> SetValueByFlagsAndIdx(const paddle::Tensor& pre_ids_all, const paddle::Tensor& pre_ids_now, const paddle::Tensor& step_idx, const paddle::Tensor& stop_flags) {
+    auto cu_stream = stop_flags.stream();
+    std::vector<int64_t> pre_ids_all_shape = pre_ids_all.shape();
+    auto stop_flags_out = stop_flags.copy_to(stop_flags.place(), false); // gpu -> gpu
+
+    int bs = stop_flags.shape()[0];
+    int length = pre_ids_all_shape[1];
+    int block_size = (bs + 32 - 1) / 32 * 32;
+    set_value_by_flag_and_id<<<1, block_size, 0, cu_stream>>>(stop_flags.data<bool>(), const_cast<int64_t*>(pre_ids_all.data<int64_t>()), pre_ids_now.data<int64_t>(), step_idx.data<int64_t>(), bs, length);
+    return {stop_flags_out};
+}
+
+std::vector<std::vector<int64_t>> SetValueByFlagsAndIdxInferShape(const std::vector<int64_t>& pre_ids_all_shape, const std::vector<int64_t>& pre_ids_now_shape,
+                                                                  const std::vector<int64_t>& step_idx_shape, const std::vector<int64_t>& stop_flags_shape) {
+    return {stop_flags_shape};
+}
+
+std::vector<paddle::DataType> SetValueByFlagsAndIdxInferDtype(const paddle::DataType& pre_ids_all_dtype,
+                                                              const paddle::DataType& pre_ids_now_dtype,
+                                                              const paddle::DataType& step_idx_dtype,
+                                                              const paddle::DataType& stop_flags_dtype) {
+    return {stop_flags_dtype};
+}
+
+PD_BUILD_OP(set_value_by_flags_and_idx)
+    .Inputs({"pre_ids_all", "pre_ids_now", "step_idx", "stop_flags"})
+    .Outputs({"stop_flags_out"})
+    .SetKernelFn(PD_KERNEL(SetValueByFlagsAndIdx))
+    .SetInferShapeFn(PD_INFER_SHAPE(SetValueByFlagsAndIdxInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(SetValueByFlagsAndIdxInferDtype));
diff --git a/csrc/generation/stop_generation_multi_ends.cu b/csrc/generation/stop_generation_multi_ends.cu
new file mode 100644
index 0000000000000000000000000000000000000000..7be2c6cf3cd1835cb8293d5bba2be22ceac6475b
--- /dev/null
+++ b/csrc/generation/stop_generation_multi_ends.cu
@@ -0,0 +1,126 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/extension.h"
+#include<stdlib.h>
+#include<string.h>
+#include<sys/types.h>
+#include<sys/stat.h>
+#include<unistd.h>
+#include<fcntl.h>
+#include<sys/mman.h>
+#include<stdio.h>
+
+void set_flags_multi_ends(char *str_flags, bool *res, int length) {
+    for (int i = 0; i < length; i++) {
+        if (str_flags[i] == '0') {
+            res[i] = false;
+        } else {
+            res[i] = true;
+        }
+    }
+}
+
+__device__ bool is_in_end(const int64_t id, const int64_t *end_ids, int length) {
+    bool flag = false;
+    for (int i = 0; i < length; i++) {
+        if (id == end_ids[i]) {
+            return true;
+        }
+    }
+    return flag;
+}
+
+__global__ void set_value_by_flags(const bool *stop_flags, const int64_t *end_ids, int64_t *topk_ids, bool *stop_flags_out, const int bs, int end_length) {
+    int tid = threadIdx.x;
+    if (tid < bs) {
+        topk_ids[tid] = stop_flags[tid] ? end_ids[0] : topk_ids[tid];
+        __syncthreads();
+        if (is_in_end(topk_ids[tid], end_ids, end_length)) {
+            stop_flags_out[tid] = true;
+        }
+    }
+}
+
+__global__ void set_value_by_flags_both(const bool *flags, const bool *stop_flags, const int64_t *end_ids, int64_t *topk_ids, bool *stop_flags_out, const int bs, int end_length) {
+    int tid = threadIdx.x;
+    if (tid < bs) {
+        topk_ids[tid] = flags[tid] || stop_flags[tid] ? end_ids[0] : topk_ids[tid];
+        __syncthreads();
+        if (is_in_end(topk_ids[tid], end_ids, end_length)) {
+            stop_flags_out[tid] = true;
+        }
+    }
+}
+
+std::vector<paddle::Tensor> GetStopFlagsMulti(const paddle::Tensor& topk_ids, const paddle::Tensor& stop_flags, const paddle::Tensor& end_ids, int64_t mode) {
+    // mode = 0, stop_generation and stop_flags
+    // mode = 1, just stop_generation
+    // mode = 2, just stop_flags
+    PD_CHECK(mode <= 2);
+    PD_CHECK(topk_ids.dtype() == paddle::DataType::INT64);
+    PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
+
+    auto cu_stream = topk_ids.stream();
+    std::vector<int64_t> shape = topk_ids.shape();
+    int64_t bs_now = shape[0];
+    int64_t end_length = end_ids.shape()[0];
+    auto topk_ids_out = topk_ids.copy_to(topk_ids.place(), false); // gpu -> gpu
+    auto stop_flags_out = stop_flags.copy_to(stop_flags.place(), false); // gpu -> gpu
+    if (mode == 0 || mode == 1) {
+        constexpr char *path = "/root/paddlejob/workspace/env_run/lzy/ERNIE_ALL/early_stop/ERNIE3.0-fused-fp16/ops/test";
+        auto flags = paddle::full({bs_now, 1}, 1, paddle::DataType::BOOL, paddle::CPUPlace());
+        int fd = -1;
+        int ret = -1;
+        void *addr = nullptr;
+        fd = open(path, O_RDWR);
+        if(-1 == fd){
+            perror("open error");
+        }
+        addr = mmap(NULL, bs_now, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+        if(addr == MAP_FAILED){
+            perror("mmap error");
+        }
+        close(fd);
+        set_flags_multi_ends((char *)addr, flags.data<bool>(), bs_now);
+        munmap(addr, bs_now);
+        auto flags_gpu = flags.copy_to(topk_ids.place(), false); // cpu -> gpu
+        int block_size = (bs_now + 32 - 1) / 32 * 32;
+        if (mode == 0) {
+            set_value_by_flags_both<<<1, block_size, 0, cu_stream>>>(flags_gpu.data<bool>(), stop_flags.data<bool>(), end_ids.data<int64_t>(), topk_ids_out.data<int64_t>(), stop_flags_out.data<bool>(), bs_now, end_length);
+        } else {
+            set_value_by_flags<<<1, block_size, 0, cu_stream>>>(flags_gpu.data<bool>(), end_ids.data<int64_t>(), topk_ids_out.data<int64_t>(), stop_flags_out.data<bool>(), bs_now, end_length);
+        }
+    } else if (mode == 2) {
+        int block_size = (bs_now + 32 - 1) / 32 * 32;
+        set_value_by_flags<<<1, block_size, 0, cu_stream>>>(stop_flags.data<bool>(), end_ids.data<int64_t>(), topk_ids_out.data<int64_t>(), stop_flags_out.data<bool>(), bs_now, end_length);
+    }
+    return {topk_ids_out, stop_flags_out};
+}
+
+std::vector<std::vector<int64_t>> GetStopFlagsMultiInferShape(const std::vector<int64_t>& topk_ids_shape, const std::vector<int64_t>& stop_flags_shape, const std::vector<int64_t>& end_ids_shape) {
+    return {topk_ids_shape, stop_flags_shape};
+}
+
+std::vector<paddle::DataType> GetStopFlagsMultiInferDtype(const paddle::DataType& topk_ids_dtype, const paddle::DataType& stop_flags_dtype, const paddle::DataType& end_ids_dtype) {
+    return {topk_ids_dtype, stop_flags_dtype};
+}
+
+PD_BUILD_OP(set_stop_value_multi_ends)
+    .Inputs({"topk_ids", "stop_flags", "end_ids"})
+    .Outputs({"topk_ids_out", "stop_flags_out"})
+    .Attrs({"mode: int64_t"})
+    .SetKernelFn(PD_KERNEL(GetStopFlagsMulti))
+    .SetInferShapeFn(PD_INFER_SHAPE(GetStopFlagsMultiInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(GetStopFlagsMultiInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/token_penalty_multi_scores.cu b/csrc/generation/token_penalty_multi_scores.cu
new file mode 100644
index 0000000000000000000000000000000000000000..3ef010501921ca7a5191a5a7cd84a57ec6d5aa4b
--- /dev/null
+++ b/csrc/generation/token_penalty_multi_scores.cu
@@ -0,0 +1,231 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+
+template<typename T>
+__global__ inline void min_length_logits_process(T* logits,
+                                                 const int64_t *cur_len,
+                                                 const int64_t *min_len,
+                                                 const int64_t *eos_token_id,
+                                                 const int64_t bs,
+                                                 const int64_t length,
+                                                 const int64_t end_length) {
+    int bi = threadIdx.x;
+    if (bi >= bs) return;
+    if (cur_len[bi] < 0) {
+        return;
+    }
+    if (cur_len[bi] < min_len[bi]) {
+        for (int i=0; i < end_length; i++) {
+            logits[bi * length + eos_token_id[i]] = -1e10;
+        }
+    }
+}
+
+template<>
+__global__ inline void min_length_logits_process<half>(half* logits,
+                                                       const int64_t *cur_len,
+                                                       const int64_t *min_len,
+                                                       const int64_t *eos_token_id,
+                                                       const int64_t bs,
+                                                       const int64_t length,
+                                                       const int64_t end_length) {
+    int bi = threadIdx.x;
+    if (bi >= bs) return;
+    if (cur_len[bi] < 0) {
+        return;
+    }
+    if (cur_len[bi] < min_len[bi]) {
+        for (int i=0; i < end_length; i++) {
+            logits[bi * length + eos_token_id[i]] = -1e4;
+        }
+    }
+}
+
+
+__global__ void update_repeat_times(const int64_t *pre_ids,
+                                    const int64_t *cur_len,
+                                    int *repeat_times,
+                                    const int64_t bs,
+                                    const int64_t length,
+                                    const int64_t length_id) {
+    int bi = blockIdx.x;
+    if (cur_len[bi] < 0) {
+        return;
+    }
+    int tid = threadIdx.x;
+    const int64_t *pre_ids_now = pre_ids + bi * length_id;
+    int *repeat_times_now = repeat_times + bi * length;
+    for (int i = tid; i < length_id; i += blockDim.x) {
+        int64_t id = pre_ids_now[i];
+        if (id < 0) break;
+        atomicAdd(&repeat_times_now[id], 1);
+    }
+}
+
+template<typename T>
+__global__ void update_value_by_repeat_times(const int *repeat_times,
+                                             const T *penalty_scores,
+                                             const T *frequency_score,
+                                             const T *presence_score,
+                                             T *logits,
+                                             const int64_t bs,
+                                             const int64_t length) {
+    int bi = blockIdx.x;
+    int tid = threadIdx.x;
+    T *logits_now = logits + bi * length;
+    const int *repeat_times_now = repeat_times + bi * length;
+    float alpha = static_cast<float>(penalty_scores[bi]);
+    float beta = static_cast<float>(frequency_score[bi]);
+    float gamma = static_cast<float>(presence_score[bi]);
+    for (int i = tid; i < length; i += blockDim.x) {
+        int times = repeat_times_now[i];
+        if (times == 0) continue;
+        float logit_now = static_cast<float>(logits_now[i]);
+        logit_now = logit_now < 0 ? logit_now * alpha : logit_now / alpha;
+        logits_now[i] = static_cast<T>(logit_now - times * beta - gamma);
+    }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> token_penalty_multi_scores_kernel(const paddle::Tensor& pre_ids,
+                                                              const paddle::Tensor& logits,
+                                                              const paddle::Tensor& penalty_scores,
+                                                              const paddle::Tensor& frequency_score,
+                                                              const paddle::Tensor& presence_score,
+                                                              const paddle::Tensor& cur_len,
+                                                              const paddle::Tensor& min_len,
+                                                              const paddle::Tensor& eos_token_id) {
+
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+    auto cu_stream = logits.stream();
+    std::vector<int64_t> shape = logits.shape();
+    auto repeat_times = paddle::full(shape, 0, paddle::DataType::INT32, pre_ids.place());
+    int64_t bs = shape[0];
+    int64_t length = shape[1];
+    int64_t length_id = pre_ids.shape()[1];
+    auto logits_out = logits.copy_to(logits.place(), false); // gpu -> gpu
+
+    int64_t end_length = eos_token_id.shape()[0];
+
+    const int block_size = (bs + 32 - 1) / 32 * 32;
+    min_length_logits_process<<<1, block_size, 0, cu_stream>>>(
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(logits_out.data<data_t>())),
+        cur_len.data<int64_t>(),
+        min_len.data<int64_t>(),
+        eos_token_id.data<int64_t>(),
+        bs, length, end_length);
+
+    int block_size_1 = (length_id + 32 - 1) / 32 * 32;
+    block_size_1 = min(block_size_1, 512);
+    update_repeat_times<<<bs, block_size_1, 0, cu_stream>>>(pre_ids.data<int64_t>(), cur_len.data<int64_t>(), repeat_times.data<int>(), bs, length, length_id);
+    int block_size_2 = (length + 32 - 1) / 32 * 32;
+    block_size_2 = min(block_size_2, 512);
+    update_value_by_repeat_times<DataType_><<<bs, block_size_2, 0, cu_stream>>>(
+        repeat_times.data<int>(),
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(penalty_scores.data<data_t>())),
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(frequency_score.data<data_t>())),
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(presence_score.data<data_t>())),
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(logits_out.data<data_t>())),
+        bs, length);
+    return {logits_out};
+}
+
+std::vector<paddle::Tensor> TokenPenaltyMultiScores(const paddle::Tensor& pre_ids,
+                                                    const paddle::Tensor& logits,
+                                                    const paddle::Tensor& penalty_scores,
+                                                    const paddle::Tensor& frequency_scores,
+                                                    const paddle::Tensor& presence_scores,
+                                                    const paddle::Tensor& cur_len,
+                                                    const paddle::Tensor& min_len,
+                                                    const paddle::Tensor& eos_token_id) {
+
+    switch (logits.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return token_penalty_multi_scores_kernel<paddle::DataType::BFLOAT16>(
+                pre_ids,
+                logits,
+                penalty_scores,
+                frequency_scores,
+                presence_scores,
+                cur_len,
+                min_len,
+                eos_token_id
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return token_penalty_multi_scores_kernel<paddle::DataType::FLOAT16>(
+                pre_ids,
+                logits,
+                penalty_scores,
+                frequency_scores,
+                presence_scores,
+                cur_len,
+                min_len,
+                eos_token_id
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return token_penalty_multi_scores_kernel<paddle::DataType::FLOAT32>(
+                pre_ids,
+                logits,
+                penalty_scores,
+                frequency_scores,
+                presence_scores,
+                cur_len,
+                min_len,
+                eos_token_id
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> TokenPenaltyMultiScoresInferShape(const std::vector<int64_t>& pre_ids_shape,
+                                                                    const std::vector<int64_t>& logits_shape,
+                                                                    const std::vector<int64_t>& penalty_scores_shape,
+                                                                    const std::vector<int64_t>& frequency_scores_shape,
+                                                                    const std::vector<int64_t>& presence_scores_shape,
+                                                                    const std::vector<int64_t>& cur_len_shape,
+                                                                    const std::vector<int64_t>& min_len_shape,
+                                                                    const std::vector<int64_t>& eos_token_id_shape) {
+    return {logits_shape};
+}
+
+std::vector<paddle::DataType> TokenPenaltyMultiScoresInferDtype(const paddle::DataType& pre_ids_dtype,
+                                                                const paddle::DataType& logits_dtype,
+                                                                const paddle::DataType& penalty_scores_dtype,
+                                                                const paddle::DataType& frequency_scores_dtype,
+                                                                const paddle::DataType& presence_scores_dtype,
+                                                                const paddle::DataType& cur_len_dtype,
+                                                                const paddle::DataType& min_len_dtype,
+                                                                const paddle::DataType& eos_token_id_dtype) {
+    return {logits_dtype};
+}
+
+PD_BUILD_OP(get_token_penalty_multi_scores)
+    .Inputs({"pre_ids", "logits", "penalty_scores", "frequency_scores", "presence_scores", "cur_len", "min_len", "eos_token_id"})
+    .Outputs({"logits_out"})
+    .SetKernelFn(PD_KERNEL(TokenPenaltyMultiScores))
+    .SetInferShapeFn(PD_INFER_SHAPE(TokenPenaltyMultiScoresInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(TokenPenaltyMultiScoresInferDtype));
diff --git a/csrc/generation/top_p_sampling.cu b/csrc/generation/top_p_sampling.cu
new file mode 100644
index 0000000000000000000000000000000000000000..ae847d5febc31045a05f55361ed00f51dbd4737b
--- /dev/null
+++ b/csrc/generation/top_p_sampling.cu
@@ -0,0 +1,678 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+#define CHECK_INPUT(x) PD_CHECK(x.is_gpu(), #x " must be a GPU Tensor.")
+
+#define FINAL_MASK 0xFFFFFFFF
+
+#define FIXED_BLOCK_DIM_BASE(dim, ...) \
+  case (dim): {                        \
+    constexpr auto kBlockDim = (dim);  \
+    __VA_ARGS__;                       \
+  } break
+
+
+#define FIXED_BLOCK_DIM(...)                 \
+  FIXED_BLOCK_DIM_BASE(1024, ##__VA_ARGS__); \
+  FIXED_BLOCK_DIM_BASE(512, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(256, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(128, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(64, ##__VA_ARGS__);   \
+  FIXED_BLOCK_DIM_BASE(32, ##__VA_ARGS__)
+
+struct SegmentOffsetIter {
+    explicit SegmentOffsetIter(int num_cols) : num_cols_(num_cols) {}
+
+    __host__ __device__ __forceinline__ int operator()(int idx) const {
+        return idx * num_cols_;
+    }
+
+    int num_cols_;
+};
+
+template <typename T>
+struct Pair {
+  __device__ __forceinline__ Pair() {}
+  __device__ __forceinline__ Pair(T value, int id) : v(value), id(id) {}
+
+  __device__ __forceinline__ void set(T value, int id) {
+    v = value;
+    id = id;
+  }
+
+  __device__ __forceinline__ void operator=(const Pair<T>& in) {
+    v = in.v;
+    id = in.id;
+  }
+
+  __device__ __forceinline__ bool operator<(const T value) const {
+    return ((float)v < (float)value);
+  }
+
+  __device__ __forceinline__ bool operator>(const T value) const {
+    return ((float)v > (float)value);
+  }
+  __device__ __forceinline__ bool operator<(const Pair<T>& in) const {
+    return ((float)v < (float)in.v) || (((float)v == (float)in.v) && (id > in.id));
+  }
+
+  __device__ __forceinline__ bool operator>(const Pair<T>& in) const {
+    return ((float)v > (float)in.v) || (((float)v == (float)in.v) && (id < in.id));
+  }
+
+  T v;
+  int id;
+};
+
+inline int div_up(int a, int n)
+{
+    return (a + n - 1) / n;
+}
+
+__global__ void setup_kernel(curandState_t *state, const uint64_t seed, const int bs) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
+    curand_init(seed, 0, i, &state[i]);
+  }
+}
+
+template <typename T>
+__device__ __forceinline__ void AddTo(Pair<T> topk[],
+                                      const Pair<T>& p,
+                                      int beam_size) {
+  for (int k = beam_size - 2; k >= 0; k--) {
+    if (topk[k] < p) {
+    topk[k + 1] = topk[k];
+    } else {
+    topk[k + 1] = p;
+    return;
+    }
+  }
+  topk[0] = p;
+}
+
+template <typename T, int BlockSize>
+__device__ __forceinline__ void GetTopK(Pair<T> topk[],
+                                        const T* src,
+                                        int idx,
+                                        int dim,
+                                        int beam_size) {
+  while (idx < dim) {
+    if (topk[beam_size - 1] < src[idx]) {
+    Pair<T> tmp(src[idx], idx);
+    AddTo<T>(topk, tmp, beam_size);
+    }
+    idx += BlockSize;
+  }
+}
+
+template <typename T, int BlockSize>
+__device__ __forceinline__ void GetTopK(Pair<T> topk[],
+                                        const T* src,
+                                        int idx,
+                                        int dim,
+                                        const Pair<T>& max,
+                                        int beam_size) {
+  while (idx < dim) {
+    if (topk[beam_size - 1] < src[idx]) {
+        Pair<T> tmp(src[idx], idx);
+        if (tmp < max) {
+            AddTo<T>(topk, tmp, beam_size);
+        }
+    }
+    idx += BlockSize;
+  }
+}
+
+template <typename T, int MaxLength, int BlockSize>
+__device__ __forceinline__ void ThreadGetTopK(Pair<T> topk[],
+                                              int* beam,
+                                              int beam_size,
+                                              const T* src,
+                                              bool* firstStep,
+                                              bool* is_empty,
+                                              Pair<T>* max,
+                                              int dim,
+                                              const int tid) {
+  if (*beam > 0) {
+    int length = (*beam) < beam_size ? *beam : beam_size;
+    if (*firstStep) {
+      *firstStep = false;
+      GetTopK<T, BlockSize>(topk, src, tid, dim, length);
+    } else {
+      for (int k = 0; k < MaxLength; k++) {
+        if (k < MaxLength - (*beam)) {
+          topk[k] = topk[k + *beam];
+        } else {
+            topk[k].set(std::numeric_limits<T>::min(), -1);
+        }
+      }
+      if (!(*is_empty)) {
+        GetTopK<T, BlockSize>(
+            topk + MaxLength - *beam, src, tid, dim, *max, length);
+      }
+    }
+
+    *max = topk[MaxLength - 1];
+    if ((*max).id == -1) *is_empty = true;
+    *beam = 0;
+  }
+}
+
+template <typename T>
+__forceinline__ __device__ Pair<T> WarpReduce(Pair<T> input) {
+#pragma unroll
+    for (int offset = 16; offset > 0; offset >>= 1) {
+        T tmp_val = __shfl_down_sync(FINAL_MASK, input.v, static_cast<unsigned>(offset), 32);
+        int tmp_id = __shfl_down_sync(FINAL_MASK, input.id, static_cast<unsigned>(offset), 32);
+        if ((float)input.v < (float)tmp_val) {
+            input.v = tmp_val;
+            input.id = tmp_id;
+        }
+    }
+    return input;
+}
+
+template <typename T, int MaxLength, int BlockSize>
+__device__ __forceinline__ void BlockReduce(Pair<T> shared_max[],
+                                            Pair<T> topk[],
+                                            Pair<T> beam_max[],
+                                            int* beam,
+                                            int* k,
+                                            int *count,
+                                            const int tid,
+                                            const int wid,
+                                            const int lane) {
+  while (true) {
+    __syncthreads();
+    Pair<T> input_now = topk[0];
+    input_now = WarpReduce(input_now);
+
+    if (lane == 0) {
+      shared_max[wid] = input_now;
+    }
+    __syncthreads();
+    input_now = (tid < BlockSize / 32)
+                    ? shared_max[lane]
+                    : Pair<T>(std::numeric_limits<T>::min(), -1);
+    if (wid == 0) {
+      input_now = WarpReduce(input_now);
+      if (lane == 0) shared_max[0] = input_now;
+    }
+    __syncthreads();
+    if (tid == 0) {
+      beam_max[*count] = shared_max[0]; 
+      (*count)++;
+    }
+    int tid_max = shared_max[0].id % BlockSize;
+    if (tid == tid_max) {
+      (*beam)++;
+    }
+    if (--(*k) == 0) break;
+    __syncthreads();
+
+    if (tid == tid_max) {
+        if (*beam < MaxLength) {
+            topk[0] = topk[*beam];
+        }
+    }
+
+    if (MaxLength < 5) {
+      if (*beam >= MaxLength) break;
+    } else {
+      unsigned mask = 0u;
+      mask = __ballot_sync(FINAL_MASK, true);
+      if (tid_max / 32 == wid) {
+        if (__shfl_down_sync(FINAL_MASK, *beam, tid_max % 32, 32) ==
+            MaxLength)
+          break;
+      }
+    }
+  }
+}
+
+template <typename T, int MaxLength, int TopPBeamTopK, int BlockSize>
+__global__ void KeMatrixTopPBeamTopK(const T* src,
+                                     T* top_ps,
+                                     int64_t* out_id,  // topk id
+                                     T* out_val,       // topk val
+                                     int vocab_size,
+                                     curandState_t* state,
+                                     int* count_iter,
+                                     int* count_iter_begin) {
+  const int tid = threadIdx.x;
+  const int wid = tid / 32;
+  const int lane = tid % 32;
+  const int bid = blockIdx.x;
+
+  int top_num = TopPBeamTopK;
+  float top_p_num = static_cast<float>(top_ps[bid]);
+
+  __shared__ Pair<T> shared_max[BlockSize / 32];
+  __shared__ Pair<T> beam_max[TopPBeamTopK];
+
+  Pair<T> topk[MaxLength];
+  int beam = MaxLength;
+  Pair<T> max;
+  bool is_empty = false;
+  bool firststep = true;
+  __shared__ int count;
+
+  if (tid == 0) {
+    count = 0;
+  }
+
+  for (int j = 0; j < MaxLength; j++) {
+    topk[j].set(std::numeric_limits<T>::min(), -1);
+  }
+
+  while (top_num) {
+    ThreadGetTopK<T, MaxLength, BlockSize>(topk,
+                                           &beam,
+                                           TopPBeamTopK,
+                                           src + bid * vocab_size,
+                                           &firststep,
+                                           &is_empty,
+                                           &max,
+                                           vocab_size,
+                                           tid);
+    BlockReduce<T, MaxLength, BlockSize>(
+        shared_max, topk, beam_max, &beam, &top_num, &count, tid, wid, lane);
+  }
+  if (tid == 0) {
+    count_iter_begin[bid] = count_iter[bid];
+    float rand_top_p = curand_uniform(state + bid) * top_p_num;
+    top_ps[bid] = (T)rand_top_p;
+    float sum_prob = 0.0f;
+
+    for(int i = 0; i < TopPBeamTopK; i++) {
+      float val = static_cast<float>(beam_max[i].v);
+      sum_prob += val;
+#ifdef DEBUG_TOPP
+      printf("bi: %d, top_p: %f, rand_top_p: %f, sum_prob: %f\n", bid, top_p_num, rand_top_p, sum_prob);
+#endif
+      if(sum_prob >= rand_top_p) {
+        count_iter_begin[bid] += 1;
+        out_id[bid] = static_cast<int64_t>(beam_max[i].id);
+        out_val[bid] = beam_max[i].v;
+        break;
+      }
+    }
+  }
+}
+
+__global__ void SetCountIter(int *count_iter, int num) {
+    int tid = threadIdx.x;
+    int bid = blockIdx.x;
+    int idx = bid * blockDim.x + tid;
+    for (int i = idx; i < num; i += gridDim.x * blockDim.x) {
+        count_iter[i] = i;
+    }
+}
+
+template <typename T>
+__global__ void FillIndex(T* indices, T num_rows, T num_cols) {
+  int col_id = threadIdx.x;
+  int row_id = blockIdx.x;
+
+  for (T j = row_id; j < num_rows; j += gridDim.x) {
+    for (T i = col_id; i < num_cols; i += blockDim.x) {
+      indices[j * num_cols + i] = i;
+    }
+  }
+}
+
+struct BlockPrefixCallbackOp {
+    // Running prefix
+    float running_total;
+    // Constructor
+    __device__ BlockPrefixCallbackOp(float running_total): running_total(running_total) {}
+    // Callback operator to be entered by the first warp of threads in the block.
+    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
+    __device__ float operator()(float block_aggregate)
+    {
+        float old_prefix = running_total;
+        running_total += block_aggregate;
+        return old_prefix;
+    }
+};
+
+template <typename T, int BLOCK_SIZE>
+__global__ void topp_sampling(T* sorted_probs,
+                              int64_t* sorted_id,
+                              T* out_val,
+                              int64_t* out_id,
+                              const T* top_ps,
+                              const T* threshold,
+                              const uint64_t seed,
+                              const int p_num,
+                              const int vocab_size,
+                              int* count_iter,
+                              int* count_iter_begin) {
+  __shared__ int stop_shared;
+  __shared__ float rand_p;
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+  constexpr int NUM_WARPS = BLOCK_SIZE / 32;
+  constexpr int WARP_SIZE = 32;
+  const int lane_id = tid % 32;
+  const int warp_id = tid / 32;
+  const float p_t = static_cast<float>(top_ps[bid]);
+  const float threshold_now = threshold ? static_cast<float>(threshold[bid]) : 0.f;
+  if (tid == 0) {
+    stop_shared = 0;
+    rand_p = p_t;
+#ifdef DEBUG_TOPP
+    printf("bi: %d, p: %f\n", bid, rand_p);
+#endif
+  }
+  if (count_iter_begin[bid] == count_iter[bid + 1]) {
+    // topk
+    return;
+  }
+
+  typedef cub::BlockScan<float, BLOCK_SIZE> BlockScan;
+  typedef cub::BlockReduce<int, BLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockScan::TempStorage temp_storage;
+  __shared__ typename BlockReduce::TempStorage temp_storage_reduce;
+  __shared__ uint32_t selected_shared[NUM_WARPS];
+  int threshold_id = 0;
+
+  // Initialize running total
+  BlockPrefixCallbackOp prefix_op(0);
+
+  if (lane_id == 0) {
+    selected_shared[warp_id] = 0;
+  }
+  __syncthreads();
+
+  int offset = bid * vocab_size;
+#ifdef DEBUG_TOPP
+  if(tid == 0) {
+    printf("first_elem1_1: %f, first_elem1_2: %f, first_id1_1: %d, first_id1_2: %d\n", (float)sorted_probs[offset], (float)sorted_probs[offset+1], (int)sorted_id[offset], (int)sorted_id[offset+1]);
+  }
+#endif
+  int end = ((vocab_size + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE;
+  int i_activate = 0;
+  float thread_offset = 0;
+  for (int i = tid; i < end; i += BLOCK_SIZE) {
+    float thread_count =
+        (i < vocab_size) ? static_cast<float>(sorted_probs[offset + i]) : 0.f;
+    if (i < vocab_size && thread_count >= threshold_now) {
+      threshold_id = i;
+    }
+    BlockScan(temp_storage)
+        .InclusiveSum(thread_count, thread_offset, prefix_op);
+
+    uint32_t activate_mask = __ballot_sync(FINAL_MASK, rand_p <= thread_offset);
+
+    i_activate = i;
+    if (activate_mask != 0) {
+      if (lane_id == 0) {
+        atomicAdd(&stop_shared, 1);
+        selected_shared[warp_id] = activate_mask;
+      }
+    }
+    __syncthreads();
+    if (stop_shared > 0) {
+      break;
+    }
+  }
+  __syncthreads();
+  if (stop_shared == 0) {
+    if (tid == 0) {
+      out_id[bid] = sorted_id[offset];
+      out_val[bid] = sorted_probs[offset];
+#ifdef DEBUG_TOPP
+      printf("stop_shared: %d, out_id: %d, out_val: %f\n", (int)stop_shared, (int)out_id[bid], (float)out_val[bid]);
+#endif
+    }
+    return;
+  }
+#ifdef DEBUG_TOPP
+  if(tid == 0) {
+    printf("first_elem2_1: %f, first_elem2_2: %f, first_id2_1: %d, first_id2_2: %d\n", (float)sorted_probs[offset], (float)sorted_probs[offset+1], (int)sorted_id[offset], (int)sorted_id[offset+1]);
+  }
+#endif
+  bool skip = (selected_shared[warp_id] > 0) ? false : true;
+  for (int i = 0; i < warp_id; i++) {
+    if (selected_shared[i] != 0) {
+      // If the previous has stopped, skip the current warp
+      skip = true;
+    }
+  }
+  if (!skip) {
+    int active_lane_id =
+        WARP_SIZE - __popc(selected_shared[warp_id]);  // first not 0
+    if (lane_id == active_lane_id) {
+      float val = static_cast<float>(sorted_probs[offset + i_activate]);
+#ifdef DEBUG_TOPP
+      printf("active_lane_id: %d, i_activate: %d.\n", active_lane_id, i_activate); 
+      for (int i=0; i < active_lane_id; i++) {
+        printf("p %d, value: %f\n", i, (float)(sorted_probs[offset + i]));
+      }
+#endif
+      if (val < threshold_now) {
+        // don't sample low score token
+        int max_id = BlockReduce(temp_storage_reduce).Reduce(threshold_id, MaxOp<int>());
+        curandStatePhilox4_32_10_t rng;
+        curand_init(seed, tid, 0, &rng);
+        int random_id = curand(&rng) % (max_id + 1);
+        out_id[bid] = sorted_id[offset + random_id];
+        out_val[bid] = sorted_probs[offset + random_id];
+      } else {
+        out_id[bid] = sorted_id[offset + i_activate];
+        out_val[bid] = sorted_probs[offset + i_activate];
+      }
+    }
+  }
+}
+
+int GetBlockSize(int vocab_size) {
+    if (vocab_size > 512) {
+        return 1024;
+    } else if (vocab_size > 256) {
+        return 512;
+    } else if (vocab_size > 128) {
+        return 256;
+    } else if (vocab_size > 64) {
+        return 128;
+    } else {
+        return 64;
+    }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> top_p_sampling_kernel(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+    std::vector<int64_t> shape = x.shape();
+    auto cu_stream = x.stream();
+
+    int bs = shape[0];
+    int p_num = top_ps.numel();
+    PD_CHECK(bs == p_num, "PD_CHECK returns ", false, ", expected bs == p_num.");
+    int vocab_size = shape[1];
+    auto topp_ids = paddle::full({bs, 1}, 1, paddle::DataType::INT64, x.place());
+    auto topp_probs = paddle::full({bs, 1}, 1, x.dtype(), x.place());
+    auto top_ps_tmp = top_ps.copy_to(top_ps.place(), false);
+    auto inds_input = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place());
+    auto sorted_out = paddle::full({bs, vocab_size}, 1, x.dtype(), x.place());
+    auto sorted_id = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place());
+    
+
+    int BlockSize = GetBlockSize(vocab_size);
+    switch (BlockSize) {
+        FIXED_BLOCK_DIM(FillIndex<int64_t><<<bs, kBlockDim, 0, cu_stream>>>(inds_input.data<int64_t>(), bs, vocab_size));
+        default:
+            PD_THROW("the input data shape has error in the FillIndex kernel.");
+    }
+
+    
+    static int count = 0;
+    static curandState_t* dev_curand_states;
+    if (count == 0) {
+#if CUDA_VERSION >= 11020
+      cudaMallocAsync(&dev_curand_states, bs * sizeof(curandState_t), cu_stream);
+#else
+      cudaMalloc(&dev_curand_states, bs * sizeof(curandState_t));
+#endif
+    }
+    uint64_t seed = 0;
+    if (random_seed == -1) {
+      srand((unsigned int)(time(NULL)));
+      seed = rand();
+    } else {
+      seed = random_seed;
+    }
+    setup_kernel<<<1, 256, 0, cu_stream>>>(dev_curand_states, seed, bs);
+
+    auto count_iter = paddle::empty({bs + 1}, paddle::DataType::INT32, x.place());
+    auto count_iter_begin = paddle::empty({bs}, paddle::DataType::INT32, x.place());
+    SetCountIter<<<1, 256, 0, cu_stream>>>(count_iter.data<int>(), bs + 1);
+
+    constexpr int TopKMaxLength = 2;
+    constexpr int TopPBeamTopK = 5;
+    switch (BlockSize) {
+        FIXED_BLOCK_DIM(
+            KeMatrixTopPBeamTopK<DataType_, TopKMaxLength, TopPBeamTopK, kBlockDim><<<bs, kBlockDim, 0, cu_stream>>>(
+                reinterpret_cast<DataType_*>(const_cast<data_t*>(x.data<data_t>())),
+                reinterpret_cast<DataType_*>(const_cast<data_t*>(top_ps_tmp.data<data_t>())),
+                topp_ids.data<int64_t>(),
+                reinterpret_cast<DataType_*>(topp_probs.data<data_t>()),
+                vocab_size,
+                dev_curand_states,
+                count_iter.data<int>(),
+                count_iter_begin.data<int>()));
+        default:
+            PD_THROW("the input data shape has error in the topp_beam_topk kernel.");
+    }
+    count++;
+
+    size_t temp_storage_bytes = 0;
+
+    cub::TransformInputIterator<int, SegmentOffsetIter, int*>
+        segment_offsets_t_begin(count_iter_begin.data<int>(),
+                                SegmentOffsetIter(vocab_size));
+
+    cub::TransformInputIterator<int, SegmentOffsetIter, int*>
+        segment_offsets_t_end(count_iter.data<int>(),
+                              SegmentOffsetIter(vocab_size));
+    
+    DataType_ *x_ptr = reinterpret_cast<DataType_*>(const_cast<data_t*>(x.data<data_t>()));
+    DataType_ *sorted_out_ptr = reinterpret_cast<DataType_*>(const_cast<data_t*>(sorted_out.data<data_t>()));
+    int64_t *in_id_ptr = inds_input.data<int64_t>();
+    int64_t *out_id_ptr = sorted_id.data<int64_t>();
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
+                                                       temp_storage_bytes,
+                                                       x_ptr,
+                                                       sorted_out_ptr,
+                                                       in_id_ptr,
+                                                       out_id_ptr,
+                                                       vocab_size * bs,
+                                                       bs,
+                                                       segment_offsets_t_begin,
+                                                       segment_offsets_t_end + 1,
+                                                       0,
+                                                       sizeof(data_t) * 8,
+                                                       cu_stream);
+
+    temp_storage_bytes = div_up(temp_storage_bytes, 256) * 256;
+    int64_t temp_size = temp_storage_bytes;
+    auto temp_storage = paddle::empty({temp_size}, paddle::DataType::UINT8, x.place());
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        temp_storage.data<uint8_t>(),
+        temp_storage_bytes,
+        x_ptr,
+        sorted_out_ptr,
+        in_id_ptr,
+        out_id_ptr,
+        vocab_size * bs,
+        bs,
+        segment_offsets_t_begin,
+        segment_offsets_t_end + 1,
+        0,
+        sizeof(data_t) * 8,
+        cu_stream);
+
+    switch (BlockSize) {
+      FIXED_BLOCK_DIM(
+          topp_sampling<DataType_, kBlockDim><<<bs, kBlockDim, 0, cu_stream>>>(
+              sorted_out_ptr,
+              out_id_ptr,
+              reinterpret_cast<DataType_*>(topp_probs.data<data_t>()),
+              topp_ids.data<int64_t>(),
+              reinterpret_cast<DataType_*>(const_cast<data_t*>(top_ps_tmp.data<data_t>())),
+              nullptr,
+              seed,
+              p_num,
+              vocab_size,
+              count_iter.data<int>(),
+              count_iter_begin.data<int>()));
+      default:
+          PD_THROW("the input data shape has error in the topp_sampling kernel.");
+    }
+    return {topp_probs, topp_ids};
+}
+
+
+std::vector<paddle::Tensor> TopPSampling(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) {
+    switch (x.type()) {
+        case paddle::DataType::FLOAT16: {
+            return top_p_sampling_kernel<paddle::DataType::FLOAT16>(
+                x,
+                top_ps,
+                random_seed
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return top_p_sampling_kernel<paddle::DataType::FLOAT32>(
+                x,
+                top_ps,
+                random_seed
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> TopPSamplingInferShape(const std::vector<int64_t>& x_shape,
+                                                         const std::vector<int64_t>& top_ps_shape) {
+    std::vector<int64_t> out_probs_shape = {x_shape[0], 1};                                                          
+    std::vector<int64_t> out_ids_shape = {x_shape[0], 1};
+    return {out_probs_shape, out_ids_shape};
+}
+
+std::vector<paddle::DataType> TopPSamplingInferDtype(const paddle::DataType& x_dtype,
+                                                     const paddle::DataType& top_ps_dtype) {
+    return {x_dtype, paddle::DataType::INT64};
+}
+
+PD_BUILD_OP(top_p_sampling)
+    .Inputs({"x", "top_ps"})
+    .Outputs({"topp_probs", "topp_ids"})
+    .Attrs({"random_seed: int"})
+    .SetKernelFn(PD_KERNEL(TopPSampling))
+    .SetInferShapeFn(PD_INFER_SHAPE(TopPSamplingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(TopPSamplingInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/transpose_removing_padding.cu b/csrc/generation/transpose_removing_padding.cu
new file mode 100644
index 0000000000000000000000000000000000000000..5b6b16a7faa2dbe5885a2d0fc8d0514a17dfc80e
--- /dev/null
+++ b/csrc/generation/transpose_removing_padding.cu
@@ -0,0 +1,177 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+template <typename T, int VecSize>
+__global__ void TransposeRemovingPadding(const T* input_data,
+                                         const int* seq_lens,
+                                         T* output_data,
+                                         const int batch_size,
+                                         const int num_head,
+                                         const int max_len_this_time,
+                                         const int seq_len,
+                                         const int head_dim,
+                                         const int token_num,
+                                         const int elem_cnt,
+                                         const int* padding_offset) {
+  // transpose and remove padding
+  // [batch_size, num_head, max_len_this_time, head_dim] -> [token_num, num_head,
+  // head_dim]
+  int64_t idx = blockDim.x * blockIdx.x + threadIdx.x;
+  const int dim_embed = num_head * head_dim;
+  using LoadT = AlignedVector<T, VecSize>;
+  LoadT src_vec;
+
+  for (int32_t linear_index = idx * VecSize,
+               step = gridDim.x * blockDim.x * VecSize;
+       linear_index < elem_cnt;
+       linear_index += step) {
+    const int token_idx = linear_index / dim_embed;
+    const int ori_token_idx =
+        token_idx + (padding_offset == nullptr ? 0 : padding_offset[token_idx]);
+    const int ori_batch_id = ori_token_idx / seq_len;
+    if (seq_lens && seq_lens[ori_batch_id] == 0) continue;
+    const int ori_seq_id = ori_token_idx % seq_len;
+    const int ori_head_id = (linear_index % dim_embed) / head_dim;
+    const int ori_head_lane = (linear_index % dim_embed) % head_dim;
+    const int ori_idx = ori_batch_id * num_head * max_len_this_time * head_dim +
+                        ori_head_id * max_len_this_time * head_dim +
+                        ori_seq_id * head_dim + ori_head_lane;
+    Load<T, VecSize>(&input_data[ori_idx], &src_vec);
+    Store<T, VecSize>(src_vec, &output_data[linear_index]);
+  }
+}
+
+template <typename T>
+void InvokeTransposeRemovePadding(const T* input_data,
+                                  const int* seq_lens,
+                                  T* output_data,
+                                  const int batch_size,
+                                  const int num_head,
+                                  const int max_len_this_time,
+                                  const int seq_len,
+                                  const int head_dim,
+                                  const int token_num,
+                                  const int* padding_offset,
+                                  cudaStream_t cu_stream) {
+  // [batch_size, num_head, max_len_this_time, head_dim] -> [token_num, num_head,
+  // head_dim]
+  constexpr int VEC_16B = 16;
+  const int elem_cnt = token_num * num_head * head_dim;
+  constexpr int PackSize = VEC_16B / sizeof(T);
+  const int32_t pack_num = elem_cnt / PackSize;
+  const int32_t block_size = 128;
+  int32_t grid_size = (pack_num + block_size - 1) / block_size;
+  TransposeRemovingPadding<T, PackSize>
+      <<<grid_size, block_size, 0, cu_stream>>>(input_data,
+                                                seq_lens,
+                                                output_data,
+                                                batch_size,
+                                                num_head,
+                                                max_len_this_time,
+                                                seq_len,
+                                                head_dim,
+                                                token_num,
+                                                elem_cnt,
+                                                padding_offset);
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> apply_transpose_remove_padding(const paddle::Tensor& input, 
+                                                           const paddle::Tensor& seq_lens, 
+                                                           const paddle::Tensor& padding_offset) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    auto cu_stream = input.stream();
+    std::vector<int64_t> input_shape = input.shape();
+    const int bsz = input_shape[0];
+    const int num_head = input_shape[1];
+    const int seq_len = input_shape[2];
+    const int dim_head = input_shape[3];
+    const int token_num = padding_offset.shape()[0];
+
+    auto out = paddle::full({token_num, num_head * dim_head}, 0, input.dtype(), input.place());
+    InvokeTransposeRemovePadding(
+        reinterpret_cast<DataType_*>(const_cast<data_t*>(input.data<data_t>())),
+        seq_lens.data<int>(),
+        reinterpret_cast<DataType_*>(out.data<data_t>()),
+        bsz,
+        num_head,
+        seq_len,
+        seq_len,
+        dim_head,
+        token_num,
+        padding_offset.data<int>(),
+        cu_stream
+    );
+    return {out};
+}
+
+std::vector<paddle::Tensor> ApplyTransposeRemovingPadding(const paddle::Tensor& input, 
+                                                          const paddle::Tensor& seq_lens, 
+                                                          const paddle::Tensor& padding_offset) {
+    switch (input.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return apply_transpose_remove_padding<paddle::DataType::BFLOAT16>(
+                input,
+                seq_lens,
+                padding_offset
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return apply_transpose_remove_padding<paddle::DataType::FLOAT16>(
+                input,
+                seq_lens,
+                padding_offset
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return apply_transpose_remove_padding<paddle::DataType::FLOAT32>(
+                input,
+                seq_lens,
+                padding_offset
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16, bfloat16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> ApplyTransposeRemovingPaddingInferShape(
+        const std::vector<int64_t>& input_shape, 
+        const std::vector<int64_t>& seq_lens_shape,
+        const std::vector<int64_t>& padding_offset_shape) {
+    return {{padding_offset_shape[0], input_shape[1] * input_shape[3]}};
+}
+
+std::vector<paddle::DataType> ApplyTransposeRemovingPaddingInferDtype(
+        const paddle::DataType& input_dtype, 
+        const paddle::DataType& seq_lens_dtype,
+        const paddle::DataType& padding_offset_dtype) {
+    return {input_dtype};
+}
+
+PD_BUILD_OP(transpose_remove_padding)
+    .Inputs({"input", "seq_lens", "padding_offset"})
+    .Outputs({"fmha_out"})
+    .SetKernelFn(PD_KERNEL(ApplyTransposeRemovingPadding))
+    .SetInferShapeFn(PD_INFER_SHAPE(ApplyTransposeRemovingPaddingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(ApplyTransposeRemovingPaddingInferDtype));
\ No newline at end of file
diff --git a/csrc/generation/write_cache_kv.cu b/csrc/generation/write_cache_kv.cu
new file mode 100644
index 0000000000000000000000000000000000000000..62ebf854b0e058cc445775f2f98f7fcf776d2a17
--- /dev/null
+++ b/csrc/generation/write_cache_kv.cu
@@ -0,0 +1,185 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "helper.h"
+
+
+template <typename T>
+inline __device__ __host__ T div_up(T m, T n) {
+  return (m + n - 1) / n;
+}
+
+template <typename T>
+__global__ void write_cache_k_kernel(T *cache_k,
+                                     const T *k,
+                                     const int *seq_lens,
+                                     const int num_head,
+                                     const int dim_head,
+                                     const int seq_len,
+                                     const int max_seq_len) {
+  const int bi = blockIdx.y;
+  const int len = seq_lens ? seq_lens[bi] : seq_len;
+  if (len == 0) {
+    return;
+  }
+
+  const int hi = blockIdx.z;
+  constexpr int X_ELEMS = VEC_16B / sizeof(T);
+
+  // [bsz, num_head, seq_len, dim_head/x, x]
+  auto k_src = reinterpret_cast<const uint4 *>(
+      k + bi * num_head * seq_len * dim_head + hi * seq_len * dim_head);
+  // [bsz, num_head, dim_head/x, max_seq_len, x]
+  auto k_dst = reinterpret_cast<uint4 *>(
+      cache_k + bi * num_head * max_seq_len * dim_head +
+      hi * max_seq_len * dim_head);
+
+  const int out_idx = blockIdx.x * blockDim.x + threadIdx.x;
+  // vec size
+  int dim_head_div_x = dim_head / X_ELEMS;
+
+  // FIXME(wangxi): num_head is not need?
+  // if (out_idx >= num_head * dim_head_div_x * max_seq_len) return;
+  if (out_idx >= dim_head_div_x * max_seq_len) return;
+
+  int idx = out_idx;
+  const int k_seq_len_id = idx % max_seq_len;
+  // idx = (idx - k_seq_len_id) / max_seq_len;
+  idx = idx / max_seq_len;
+  const int k_vec_id = idx % dim_head_div_x;
+
+  if (k_seq_len_id < len) {
+    k_dst[out_idx] = k_src[k_seq_len_id * dim_head_div_x + k_vec_id];
+  }
+}
+
+template <typename T>
+__global__ void write_cache_v_kernel(T *cache_v,
+                                     const T *v,
+                                     const int *seq_lens,
+                                     const int num_head,
+                                     const int dim_head,
+                                     const int seq_len,
+                                     const int max_seq_len) {
+  const int bi = blockIdx.y;
+  const int len = seq_lens ? seq_lens[bi] : seq_len;
+  if (len == 0) {
+    return;
+  }
+
+  const int hi = blockIdx.z;
+
+  // [bsz, num_head, seq_len, dim_head/x, x]
+  auto v_src = reinterpret_cast<const uint4 *>(
+      v + bi * num_head * seq_len * dim_head + hi * seq_len * dim_head);
+  // [bsz, num_head, max_seq_len, dim_head/x, x]
+  auto v_dst = reinterpret_cast<uint4 *>(
+      cache_v + bi * num_head * max_seq_len * dim_head +
+      hi * max_seq_len * dim_head);
+
+  const int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  constexpr int X_ELEMS = VEC_16B / sizeof(T);
+  const int dim_head_div_x = dim_head / X_ELEMS;
+
+  if (idx >= dim_head_div_x * len) return;
+
+  v_dst[idx] = v_src[idx];
+}
+
+template <paddle::DataType D>
+void LaunchWriteCacheKV(const paddle::Tensor& input_k, 
+                        const paddle::Tensor& input_v, 
+                        const paddle::Tensor& cache_kv,
+                        const paddle::Tensor& sequence_lengths) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+
+    const int64_t bsz = input_k.shape()[0];
+    const int64_t seq_len = input_k.shape()[2]; 
+    const int64_t cache_bsz = cache_kv.shape()[1]; 
+    const int64_t num_head = cache_kv.shape()[2]; 
+    const int64_t dim_head = cache_kv.shape()[4]; 
+    // printf("bsz: %d, cache_bsz: %d, num_head: %d, seq_len: %d, dim_head: %d.\n", bsz, cache_bsz, num_head, seq_len, dim_head);
+
+    auto cache_kv_out = paddle::full({1}, -1, cache_kv.dtype(), cache_kv.place());
+
+    const DataType_ *k_ptr = reinterpret_cast<const DataType_*>(input_k.data<data_t>());
+    const DataType_ *v_ptr = reinterpret_cast<const DataType_*>(input_v.data<data_t>());
+
+    // [2, bsz, num_head, max_seq_len, head_dim]
+    int max_seq_len = cache_kv.shape()[3];
+    DataType_ *cache_kv_data = reinterpret_cast<DataType_*>(const_cast<data_t*>(cache_kv.data<data_t>()));
+
+    int64_t cache_k_size = cache_bsz * num_head * max_seq_len * dim_head;
+
+    DataType_ *cache_k_ptr = cache_kv_data;
+    DataType_ *cache_v_ptr = cache_kv_data + cache_k_size;
+
+    constexpr int block_sz = 128;
+    constexpr int x = VEC_16B / sizeof(DataType_);
+
+    assert(dim_head % x == 0);
+    // PD_CHECK((dim_head % x) == 0, "PD_CHECK returns ", false, ", dim_head must be divisible by vec_size.");
+
+    int max_size = max_seq_len * dim_head / x;
+    int size = seq_len * dim_head / x;
+    dim3 grid(div_up(max_size, block_sz), bsz, num_head);
+    dim3 grid_v(div_up(size, block_sz), bsz, num_head);
+
+    // transpose [bsz, num_head, seq_len, dim_head/x, x]->
+    // [bsz, num_head, dim_head/x, max_seq_len, x]
+    write_cache_k_kernel<<<grid, block_sz, 0, input_k.stream()>>>(
+        cache_k_ptr, k_ptr, sequence_lengths.data<int>(), num_head, dim_head, seq_len, max_seq_len);
+
+    // copy [bsz, num_head, seq_len, dim_head/x, x]->
+    // [bsz, num_head, max_seq_len, dim_head/x, x]
+    write_cache_v_kernel<<<grid_v, block_sz, 0, input_k.stream()>>>(
+        cache_v_ptr, v_ptr, sequence_lengths.data<int>(), num_head, dim_head, seq_len, max_seq_len);
+}
+
+void WriteCacheKV(const paddle::Tensor& input_k,
+                  const paddle::Tensor& input_v,
+                  const paddle::Tensor& cache_kv,
+                  const paddle::Tensor& sequence_lengths_shape) {
+    switch (cache_kv.type()) {
+        case paddle::DataType::BFLOAT16: {
+            return LaunchWriteCacheKV<paddle::DataType::BFLOAT16>(
+                input_k, input_v, cache_kv, sequence_lengths_shape
+            );
+        }
+        case paddle::DataType::FLOAT16: {
+            return LaunchWriteCacheKV<paddle::DataType::FLOAT16>(
+                input_k, input_v, cache_kv, sequence_lengths_shape
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return LaunchWriteCacheKV<paddle::DataType::FLOAT32>(
+                input_k, input_v, cache_kv, sequence_lengths_shape
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only bfloat16, float16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+PD_BUILD_OP(write_cache_kv)
+    .Inputs({"input_k", "input_v", "cache_kv", "sequence_lengths"})
+    .Outputs({"cache_kv_out"})
+    .SetInplaceMap({{"cache_kv", "cache_kv_out"}})
+    .SetKernelFn(PD_KERNEL(WriteCacheKV));
\ No newline at end of file
diff --git a/csrc/requirements.txt b/csrc/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0bf0625387b4da469c3e1866ea4849ca6d1183ec
--- /dev/null
+++ b/csrc/requirements.txt
@@ -0,0 +1,2 @@
+cupy-cuda116
+pybind11
\ No newline at end of file
diff --git a/csrc/setup_cuda.py b/csrc/setup_cuda.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf4ef65b45d5c5b417f3b3617ca9075e54a959d0
--- /dev/null
+++ b/csrc/setup_cuda.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.utils.cpp_extension import CUDAExtension, setup
+
+setup(
+    name="paddlenlp_ops",
+    ext_modules=CUDAExtension(
+        sources=[
+            "./generation/save_with_output.cc",
+            "./generation/set_mask_value.cu",
+            "./generation/set_value_by_flags.cu",
+            "./generation/token_penalty_multi_scores.cu",
+            "./generation/stop_generation_multi_ends.cu",
+            "./generation/fused_get_rope.cu",
+            "./generation/get_padding_offset.cu",
+            "./generation/qkv_transpose_split.cu",
+            "./generation/rebuild_padding.cu",
+            "./generation/transpose_removing_padding.cu",
+            "./generation/write_cache_kv.cu",
+            "./generation/encode_rotary_qk.cu",
+            "./generation/top_p_sampling.cu",
+            "./generation/set_alibi_mask_value.cu",
+        ]
+    ),
+)
diff --git a/docs/FAQ.md b/docs/FAQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..12853cf627fdd24ef0ae49cbab32ddaa8693c666
--- /dev/null
+++ b/docs/FAQ.md
@@ -0,0 +1,495 @@
+## PaddleNLP常见问题汇总（持续更新）
+
++ [【精选】NLP精选5问](#NLP精选)
+
+  + [Q1.1 如何加载自己的本地数据集，以便使用PaddleNLP的功能？](#1-1)
+  + [Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径，如何修改路径？](#1-2)
+  + [Q1.3 PaddleNLP中如何保存、加载训练好的模型？](#1-3)
+  + [Q1.4 当训练样本较少时，有什么推荐的方法能提升模型效果吗？](#1-4)
+  + [Q1.5 如何提升模型的性能，提升QPS？](#1-5)
+
++ [【理论篇】NLP通用问题](#NLP通用问题 )
+
+  + [Q2.1 数据类别分布不均衡， 有哪些应对方法？](#2-2)
+  + [Q2.2 如果使用预训练模型，一般需要多少条样本？](#2-3)
+
++ [【实战篇】PaddleNLP实战问题](#PaddleNLP实战问题)
+
+  [数据集和数据处理](#数据问题)
+
+  + [Q3.1 使用自己的数据集训练预训练模型时，如何引入额外的词表？](#3-1)
+
+  [模型训练调优](#训练调优问题)
+
+  + [Q3.2 如何加载自己的预训练模型，进而使用PaddleNLP的功能？](#4-1)
+  + [Q3.3 如果训练中断，需要继续热启动训练，如何保证学习率和优化器能从中断地方继续迭代？](#4-2)
+  + [Q3.4 如何冻结模型梯度？](#4-3)
+  + [Q3.5 如何在eval阶段打印评价指标，在各epoch保存模型参数？](#4-4)
+  + [Q3.6 训练过程中，训练程序意外退出或Hang住，应该如何排查？](#4-5)
+
+  + [Q3.7 在模型验证和测试过程中，如何保证每一次的结果是相同的？](#4-6)
+  + [Q3.8 ERNIE模型如何返回中间层的输出？](#4-7)
+
+  [预测部署](#部署问题)
+
+  + [Q3.9 PaddleNLP训练好的模型如何部署到服务器 ？](#5-1)
+  + [Q3.10 静态图模型如何转换成动态图模型？](#5-2)
+
++ [特定模型和应用场景咨询](#NLP应用场景)
+  + [Q4.1 【词法分析】LAC模型，如何自定义标签label，并继续训练？](#6-1)
+  + [Q4.2 信息抽取任务中，是否推荐使用预训练模型+CRF，怎么实现呢？](#6-2)
+  + [Q4.3 【阅读理解】`MapDatasets`的`map()`方法中对应的`batched=True`怎么理解，在阅读理解任务中为什么必须把参数`batched`设置为`True`？](#6-3)
+  + [Q4.4 【语义匹配】语义索引和语义匹配有什么区别？](#6-4)
+  + [Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类?](#6-5)
+
++ [其他使用咨询](#使用咨询问题)
+  + [Q5.1 在CUDA11使用PaddlNLP报错?](#7-1)
+  + [Q5.2 如何设置parameter？](#7-2)
+  + [Q5.3 GPU版的Paddle虽然能在CPU上运行，但是必须要有GPU设备吗？](#7-3)
+  + [Q5.4  如何指定用CPU还是GPU训练模型？](#7-4)
+  + [Q5.5 动态图模型和静态图模型的预测结果一致吗？](#7-5)
+  + [Q5.6 如何可视化acc、loss曲线图、模型网络结构图等？](#7-6)
+
+<a name="NLP精选"></a>
+
+## ⭐️【精选】NLP精选5问
+
+<a name="1-1"></a>
+
+##### Q1.1 如何加载自己的本地数据集，以便使用PaddleNLP的功能？
+
+**A:** 通过使用PaddleNLP提供的 `load_dataset`，  `MapDataset` 和 `IterDataset` ，可以方便的自定义属于自己的数据集哦，也欢迎您贡献数据集到PaddleNLP repo。
+
+从本地文件创建数据集时，我们 **推荐** 根据本地数据集的格式给出读取function并传入 `load_dataset()` 中创建数据集。
+以[waybill_ie](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie)快递单信息抽取任务中的数据为例：
+
+```python
+from paddlenlp.datasets import load_dataset
+
+def read(data_path):
+    with open(data_path, 'r', encoding='utf-8') as f:
+        # 跳过列名
+        next(f)
+        for line in f:
+            words, labels = line.strip('\n').split('\t')
+            words = words.split('\002')
+            labels = labels.split('\002')
+            yield {'tokens': words, 'labels': labels}
+
+# data_path为read()方法的参数
+map_ds = load_dataset(read, data_path='train.txt', lazy=False)
+iter_ds = load_dataset(read, data_path='train.txt', lazy=True)
+```
+
+如果您习惯使用`paddle.io.Dataset/IterableDataset`来创建数据集也是支持的，您也可以从其他python对象如`List`对象创建数据集，详细内容可参照[官方文档-自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+<a name="1-2"></a>
+
+##### Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径，如何修改路径？
+
+**A:** 内置的数据集、模型默认会下载到`$HOME/.paddlenlp/`下，通过配置环境变量可下载到指定路径：
+
+（1）Linux下，设置 `export PPNLP_HOME="xxxx"`，注意不要设置带有中文字符的路径。
+
+（2）Windows下，同样配置环境变量 PPNLP_HOME 到其他非中文字符路径，重启即可。
+
+<a name="1-3"></a>
+
+##### Q1.3 PaddleNLP中如何保存、加载训练好的模型？
+
+**A：**（1）PaddleNLP预训练模型
+
+​    保存：
+
+```python
+model.save_pretrained("./checkpoint')
+tokenizer.save_pretrained("./checkpoint')
+```
+
+​    加载：
+
+```python
+model.from_pretrained("./checkpoint')
+tokenizer.from_pretrained("./checkpoint')
+```
+
+（2）常规模型
+    保存：
+
+```python
+emb = paddle.nn.Embedding(10, 10)
+layer_state_dict = emb.state_dict()
+paddle.save(layer_state_dict, "emb.pdparams") #保存模型参数
+```
+
+​    加载：
+```python
+emb = paddle.nn.Embedding(10, 10)
+load_layer_state_dict = paddle.load("emb.pdparams") # 读取模型参数
+emb.set_state_dict(load_layer_state_dict) # 加载模型参数
+```
+
+<a name="1-4"></a>
+
+##### Q1.4 当训练样本较少时，有什么推荐的方法能提升模型效果吗？
+
+**A:** 增加训练样本带来的效果是最直接的。此外，可以基于我们开源的[预训练模型](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers)进行热启，再用少量数据集fine-tune模型。此外，针对分类、匹配等场景，[小样本学习](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/few_shot)也能够带来不错的效果。
+
+<a name="1-5"></a>
+
+##### Q1.5 如何提升模型的性能，提升QPS？
+
+**A:** 从工程角度，对于服务器端部署可以使用[Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/inference_cn.html)高性能预测引擎进行预测部署。对于Transformer类模型的GPU预测还可以使用PaddleNLP中提供的[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/ops)功能来进行快速预测，其集成了[NV FasterTransformer](https://github.com/NVIDIA/FasterTransformer)并进行了功能增强。
+
+从模型策略角度，可以使用一些模型小型化技术来进行模型压缩，如模型蒸馏和裁剪，通过小模型来实现加速。PaddleNLP中集成了ERNIE-Tiny这样一些通用小模型供下游任务微调使用。另外PaddleNLP提供了[模型压缩示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression)，实现了DynaBERT、TinyBERT、MiniLM等方法策略，可以参考对自己的模型进行蒸馏压缩。
+
+<a name="NLP通用问题"></a>
+
+## ⭐️【理论篇】NLP通用问题
+
+<a name="2-2"></a>
+
+##### Q2.1 数据类别分布不均衡， 有哪些应对方法？
+
+**A:** 可以采用以下几种方法优化类别分布不均衡问题：
+
+（1）欠采样：对样本量较多的类别进行欠采样，去除一些样本，使得各类别数目接近。
+
+（2）过采样：对样本量较少的类别进行过采样，选择样本进行复制，使得各类别数目接近。
+
+（3）修改分类阈值：直接使用类别分布不均衡的数据训练分类器，会使得模型在预测时更偏向于多数类，所以不再以0.5为分类阈值，而是针对少数类在模型仅有较小把握时就将样本归为少数类。
+
+（4）代价敏感学习：比如LR算法中设置class_weight参数。
+
+<a name="2-3"></a>
+
+##### Q2.2 如果使用预训练模型，一般需要多少条样本？
+
+**A:** 很难定义具体需要多少条样本，取决于具体的任务以及数据的质量。如果数据质量没问题的话，分类、文本匹配任务所需数据量级在百级别，翻译则需要百万级能够训练出一个比较鲁棒的模型。如果样本量较少，可以考虑数据增强，或小样本学习。
+
+
+<a name="PaddleNLP实战问题"></a>
+
+## ⭐️【实战篇】PaddleNLP实战问题
+
+<a name="数据问题"></a>
+
+### 数据集和数据处理
+
+<a name="3-1"></a>
+
+##### Q3.1 使用自己的数据集训练预训练模型时，如何引入额外的词表？
+
+**A:** 预训练模型通常会有配套的tokenzier和词典，对于大多数中文预训练模型，如ERNIE-3.0，使用的都是字粒度的输入，tokenzier会将句子转换为字粒度的形式，模型无法收到词粒度的输入。如果希望引入额外的词典，需要修改预训练模型的tokenizer和词典，可以参考这里[blog](https://kexue.fm/archives/7758/comment-page-1#Tokenizer )，另外注意embedding矩阵也要加上这些新增词的embedding表示。
+
+另外还有一种方式可以使用这些字典信息，可以将数据中在词典信息中的词进行整体mask进行一个mask language model的二次预训练，这样经过二次训练的模型就包含了对额外字典的表征。可参考 [PaddleNLP 预训练数据流程](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/)。
+
+
+此外还有些词粒度及字词混合粒度的预训练模型，在这些词粒度的模型下引入额外的词表也会容易些，我们也将持续丰富PaddleNLP中的预训练模型。
+
+<a name="训练调优问题"></a>
+
+### 模型训练调优
+
+<a name="4-1"></a>
+
+##### Q3.2 如何加载自己的预训练模型，进而使用PaddleNLP的功能？
+
+**A:** 以bert为例，如果是使用PaddleNLP训练，通过`save_pretrained()`接口保存的模型，可通过`from_pretrained()`来加载：
+
+```python
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+model = BertModel.from_pretrained("bert-base-uncased")
+```
+
+如果不是上述情况，可以使用如下方式加载模型，也欢迎您贡献模型到PaddleNLP repo中。
+
+（1）加载`BertTokenizer`和`BertModel`
+
+```python
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+model = BertModel.from_pretrained("bert-base-uncased")
+```
+
+（2）调用`save_pretrained()`生成 `model_config.json`、 ``tokenizer_config.json``、`model_state.pdparams`、  `vocab.txt `文件，保存到`./checkpoint`：
+
+```python
+tokenizer.save_pretrained("./checkpoint")
+model.save_pretrained("./checkpoint")
+```
+
+（3）修改`model_config.json`、 `tokenizer_config.json`这两个配置文件，指定为自己的模型，之后通过`from_pretrained()`加载模型。
+
+```python
+tokenizer = BertTokenizer.from_pretrained("./checkpoint")
+model = BertModel.from_pretrained("./checkpoint")
+```
+
+<a name="4-2"></a>
+
+##### Q3.3 如果训练中断，需要继续热启动训练，如何保证学习率和优化器能从中断地方继续迭代？
+
+**A:**
+
+ （1）完全恢复训练状态，可以先将`lr`、` optimizer`、`model`的参数保存下来：
+
+```python
+paddle.save(lr_scheduler.state_dict(), "xxx_lr")
+paddle.save(optimizer.state_dict(), "xxx_opt")
+paddle.save(model.state_dict(), "xxx_para")
+```
+
+（2）加载`lr`、` optimizer`、`model`参数即可恢复训练：
+
+```python
+lr_scheduler.set_state_dict(paddle.load("xxxx_lr"))
+optimizer.set_state_dict(paddle.load("xxx_opt"))
+model.set_state_dict(paddle.load("xxx_para"))
+```
+
+<a name="4-3"></a>
+
+##### Q3.4 如何冻结模型梯度？
+
+**A:**
+有多种方法可以尝试：
+
+（1）可以直接修改 PaddleNLP 内部代码实现，在需要冻结梯度的地方用 `paddle.no_grad()` 包裹一下
+
+   `paddle.no_grad()` 的使用方式，以对 `forward()` 进行冻结为例：
+
+``` python
+   # Method 1
+   class Model(nn.Layer):
+      def __init__(self, ...):
+         ...
+
+      def forward(self, ...):
+         with paddle.no_grad():
+            ...
+
+
+   # Method 2
+   class Model(nn.Layer):
+      def __init__(self, ...):
+         ...
+
+      @paddle.no_grad()
+      def forward(self, ...):
+         ...
+```
+
+   `paddle.no_grad()` 的使用也不局限于模型内部实现里面，也可以包裹外部的方法，比如：
+
+``` python
+   @paddle.no_grad()
+   def evaluation(...):
+      ...
+
+      model = Model(...)
+      model.eval()
+
+      ...
+
+```
+
+（2）第二种方法：以ERNIE为例，将模型输出的 tensor 设置 `stop_gradient` 为 True。可以使用 `register_forward_post_hook` 按照如下的方式尝试：
+
+``` python
+   def forward_post_hook(layer, input, output):
+      output.stop_gradient=True
+
+   self.ernie.register_forward_post_hook(forward_post_hook)
+```
+
+（3）第三种方法：在 `optimizer` 上进行处理，`model.parameters` 是一个 `List`，可以通过 `name` 进行相应的过滤，更新/不更新某些参数，这种方法需要对网络结构的名字有整体了解，因为网络结构的实体名字决定了参数的名字，这个使用方法有一定的门槛：
+
+```python
+   [ p for p in model.parameters() if 'linear' not in p.name]  # 这里就可以过滤一下linear层，具体过滤策略可以根据需要来设定
+```
+
+<a name="4-4"></a>
+
+##### Q3.5 如何在eval阶段打印评价指标，在各epoch保存模型参数？
+
+**A:** 飞桨主框架提供了两种训练与预测的方法，一种是用 [paddle.Model()](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/Model_cn.html)对模型进行封装，通过高层API如`Model.fit()`、`Model.evaluate()`、`Model.predict()`等完成模型的训练与预测；另一种就是基于基础API常规的训练方式。
+
+（1）对于第一种方法：
+
+- 我们可以设置 `paddle.Model.fit() ` API中的 *eval_data* 和 *eval_freq* 参数在训练过程中打印模型评价指标：*eval_data* 参数是一个可迭代的验证集数据源，*eval_freq* 参数是评估的频率；当*eval_data* 给定后，*eval_freq* 的默认值为1，即每一个epoch进行一次评估。注意：在训练前，我们需要在 `Model.prepare()` 接口传入metrics参数才能在eval时打印模型评价指标。
+
+- 关于模型保存，我们可以设置 `paddle.Model.fit()` 中的 *save_freq* 参数控制模型保存的频率：*save_freq* 的默认值为1，即每一个epoch保存一次模型。
+
+（2）对于第二种方法：
+
+- 我们在PaddleNLP的examples目录下提供了常见任务的训练与预测脚本：如[GLUE](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue) 和 [SQuAD](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_reading_comprehension/SQuAD)等
+
+- 开发者可以参考上述脚本进行自定义训练与预测脚本的开发。
+
+<a name="4-5"></a>
+
+##### Q3.6 训练过程中，训练程序意外退出或Hang住，应该如何排查？
+
+**A:**  一般先考虑内存、显存（使用GPU训练的话）是否不足，可将训练和评估的batch size调小一些。
+
+需要注意，batch size调小时，学习率learning rate也要调小，一般可按等比例调整。
+
+<a name="4-6"></a>
+
+##### Q3.7 在模型验证和测试过程中，如何保证每一次的结果是相同的？
+
+**A:** 在验证和测试过程中常常出现的结果不一致情况一般有以下几种解决方法：
+
+（1）确保设置了eval模式，并保证数据相关的seed设置保证数据一致性。
+
+（2）如果是下游任务模型，查看是否所有模型参数都被导入了，直接使用bert-base这种预训练模型是不包含任务相关参数的，要确认导入的是微调后的模型，否则任务相关参数会随机初始化导致出现随机性。
+
+（3）部分算子使用CUDNN后端产生的不一致性可以通过环境变量的设置来避免。如果模型中使用了CNN相关算子，可以设置`FLAGS_cudnn_deterministic=True`。如果模型中使用了RNN相关算子，可以设置`CUBLAS_WORKSPACE_CONFIG=:16:8`或`CUBLAS_WORKSPACE_CONFIG=:4096:2`（CUDNN 10.2以上版本可用，参考[CUDNN 8 release note](https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_8.html)）。
+
+<a name="4-7"></a>
+
+##### Q3.8 ERNIE模型如何返回中间层的输出？
+
+**A:** 目前的API设计不保留中间层输出，当然在PaddleNLP里可以很方便地修改源码。
+此外，还可以在`ErnieModel`的`__init__`函数中通过`register_forward_post_hook()`为想要保留输出的Layer注册一个`forward_post_hook`函数，在`forward_post_hook`函数中把Layer的输出保存到一个全局的`List`里面。`forward_post_hook`函数将会在`forward`函数调用之后被调用，并保存Layer输出到全局的`List`。详情参考[`register_forward_post_hook()`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Layer_cn.html#register_forward_post_hook)。
+
+<a name="部署问题"></a>
+
+### 预测部署
+
+<a name="5-1"></a>
+
+##### Q3.9 PaddleNLP训练好的模型如何部署到服务器 ？
+
+**A:** 我们推荐在动态图模式下开发，静态图模式部署。
+
+（1）动转静
+
+   动转静，即将动态图的模型转为可用于部署的静态图模型。
+   动态图接口更加易用，python 风格的交互式编程体验，对于模型开发更为友好，而静态图相比于动态图在性能方面有更绝对的优势。因此动转静提供了这样的桥梁，同时兼顾开发成本和性能。
+   可以参考官方文档 [动态图转静态图文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/04_dygraph_to_static/index_cn.html)，使用 `paddle.jit.to_static` 完成动转静。
+   另外，在 PaddleNLP 我们也提供了导出静态图模型的例子，可以参考 [waybill_ie 模型导出](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/#%E6%A8%A1%E5%9E%8B%E5%AF%BC%E5%87%BA)。
+
+（2）借助Paddle Inference部署
+
+   动转静之后保存下来的模型可以借助Paddle Inference完成高性能推理部署。Paddle Inference内置高性能的CPU/GPU Kernel，结合细粒度OP横向纵向融合等策略，并集成 TensorRT 实现模型推理的性能提升。具体可以参考文档 [Paddle Inference 简介](https://paddleinference.paddlepaddle.org.cn/master/product_introduction/inference_intro.html)。
+   为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference，PaddleNLP 也提供了对应的例子以供参考，可以参考 [/PaddleNLP/examples](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/) 下的deploy目录，如[基于ERNIE的命名实体识别模型部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/deploy/python)。
+
+<a name="5-2"></a>
+
+##### Q3.10 静态图模型如何转换成动态图模型？
+
+**A:** 首先，需要将静态图参数保存成`ndarray`数据，然后将静态图参数名和对应动态图参数名对应，最后保存成动态图参数即可。详情可参考[参数转换脚本](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie/static_to_dygraph_params)。
+
+<a name="NLP应用场景"></a>
+
+### ⭐️特定模型和应用场景咨询
+
+<a name="6-1"></a>
+
+##### Q4.1 【词法分析】LAC模型，如何自定义标签label，并继续训练？
+
+**A:** 更新label文件`tag.dict`，添加 修改下CRF的标签数即可。
+
+可参考[自定义标签示例](https://github.com/PaddlePaddle/PaddleNLP/issues/662)，[增量训练自定义LABLE示例](https://github.com/PaddlePaddle/PaddleNLP/issues/657)。
+
+<a name="6-2"></a>
+
+##### Q4.2 信息抽取任务中，是否推荐使用预训练模型+CRF，怎么实现呢？
+
+**A:** 预训练模型+CRF是一个通用的序列标注的方法，目前预训练模型对序列信息的表达也是非常强的，也可以尝试直接使用预训练模型对序列标注任务建模。
+
+<a name="6-3"></a>
+
+##### Q4.3.【阅读理解】`MapDatasets`的`map()`方法中对应的`batched=True`怎么理解，在阅读理解任务中为什么必须把参数`batched`设置为`True`？
+
+**A:** `batched=True`就是对整个batch（这里不一定是训练中的batch，理解为一组数据就可以）的数据进行map，即map中的trans_func接受一组数据为输入，而非逐条进行map。在阅读理解任务中，根据使用的doc_stride不同，一条样本可能被转换成多条feature，对数据逐条map是行不通的，所以需要设置`batched=True`。
+
+<a name="6-4"></a>
+
+##### Q4.4 【语义匹配】语义索引和语义匹配有什么区别？
+
+**A:** 语义索引要解决的核心问题是如何从海量 Doc 中通过 ANN 索引的方式快速、准确地找出与 query 相关的文档，语义匹配要解决的核心问题是对 query和文档更精细的语义匹配信息建模。换个角度理解， [语义索引](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/semantic_indexing)是要解决搜索、推荐场景下的召回问题，而[语义匹配](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_matching)是要解决排序问题，两者要解决的问题不同，所采用的方案也会有很大不同，但两者间存在一些共通的技术点，可以互相借鉴。
+
+<a name="6-5"></a>
+
+##### Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类?
+
+**A:** 其主要依赖于二次构造数据来进行finetune，同时要更新termtree信息。wordtag分为两个步骤：
+（1）通过BIOES体系进行分词；
+（2）将分词后的信息和TermTree进行匹配。
+    因此我们需要：
+（1）分词正确，这里可能依赖于wordtag的finetune数据，来让分词正确；
+（2）wordtag里面也需要把分词正确后term打上相应的知识信息。wordtag自定义TermTree的方式将在后续版本提供出来。
+
+可参考[issue](https://github.com/PaddlePaddle/PaddleNLP/issues/822)。
+
+<a name="使用咨询问题"></a>
+
+### ⭐️其他使用咨询
+
+<a name="7-1"></a>
+
+##### Q5.1 在CUDA11使用PaddlNLP报错?
+
+**A:** 在CUDA11安装，可参考[issue](https://github.com/PaddlePaddle/PaddleNLP/issues/348)，其他CUDA版本安装可参考 [官方文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html)
+
+<a name="7-2"></a>
+
+##### Q5.2 如何设置parameter？
+
+**A:** 有多种方法：
+（1）可以通过`set_value()`来设置parameter，`set_value()`的参数可以是`numpy`或者`tensor`。
+
+```python
+   layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range") else
+                        self.ernie.config["initializer_range"],
+                        shape=layer.weight.shape))
+```
+（2）通过`create_parameter()`设置参数。
+
+``` python
+    class MyLayer(paddle.nn.Layer):
+        def __init__(self):
+            super(MyLayer, self).__init__()
+            self._linear = paddle.nn.Linear(1, 1)
+            w_tmp = self.create_parameter([1,1])
+            self.add_parameter("w_tmp", w_tmp)
+
+        def forward(self, input):
+            return self._linear(input)
+
+    mylayer = MyLayer()
+    for name, param in mylayer.named_parameters():
+        print(name, param)
+```
+
+<a name="7-3"></a>
+
+##### Q5.3 GPU版的Paddle虽然能在CPU上运行，但是必须要有GPU设备吗？
+
+**A:** 不支持 GPU 的设备只能安装 CPU 版本的 PaddlePaddle。 GPU 版本的 PaddlePaddle 如果想只在 CPU 上运行，可以通过 `export CUDA_VISIBLE_DEVICES=-1` 来设置。
+
+<a name="7-4"></a>
+
+##### Q5.4  如何指定用CPU还是GPU训练模型？
+
+**A:** 一般我们的训练脚本提供了 `--device` 选项，用户可以通过 `--device` 选择需要使用的设备。
+
+具体而言，在Python文件中，我们可以通过·paddle.device.set_device()·，设置为gpu或者cpu，可参考 [set_device文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/device/set_device_cn.html#set-device)。
+
+<a name="7-5"></a>
+
+##### Q5.5 动态图模型和静态图模型的预测结果一致吗？
+
+**A:** 正常情况下，预测结果应当是一致的。如果遇到不一致的情况，可以及时反馈给 PaddleNLP 的开发人员，我们进行处理。
+
+<a name="7-6"></a>
+
+##### Q5.6 如何可视化acc、loss曲线图、模型网络结构图等？
+
+**A:** 可使用[VisualDL](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/index_cn.html)进行可视化。其中acc、loss曲线图的可视化可参考[Scalar——折线图组件](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/visualdl_usage_cn.html#scalar)使用指南，模型网络结构的可视化可参考[Graph——网络结构组件](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/03_VisualDL/visualdl_usage_cn.html#graph)使用指南。
diff --git a/docs/Makefile b/docs/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..ed88099027f775942fa65dce2314f1ae9675cb36
--- /dev/null
+++ b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/_static/custom.css b/docs/_static/custom.css
new file mode 100644
index 0000000000000000000000000000000000000000..bbd2345b4ee40ccf407c3b4f8afb657b9b7b803c
--- /dev/null
+++ b/docs/_static/custom.css
@@ -0,0 +1,3 @@
+.wy-nav-content {
+    max-width: 80%;
+}
diff --git a/docs/advanced_guide/distributed_training.rst b/docs/advanced_guide/distributed_training.rst
new file mode 100644
index 0000000000000000000000000000000000000000..16bee02a8eb86c4ab4f3a603c5c0cafdc013b981
--- /dev/null
+++ b/docs/advanced_guide/distributed_training.rst
@@ -0,0 +1,5 @@
+============
+大规模分布式训练
+============
+
+大规模分布式训练：
diff --git a/docs/advanced_guide/fastgeneration/fastgeneration.rst b/docs/advanced_guide/fastgeneration/fastgeneration.rst
new file mode 100644
index 0000000000000000000000000000000000000000..95fff8849aef78c8daa8ffdfd1225152c6dc6868
--- /dev/null
+++ b/docs/advanced_guide/fastgeneration/fastgeneration.rst
@@ -0,0 +1,189 @@
+========
+FastGeneration加速生成API
+========
+
+FastGeneration是PaddleNLP v2.2版本加入的一个高性能推理功能，可实现基于CUDA的序列解码。该功能可以用于多种生成类的预训练NLP模型，例如GPT、BART、UnifiedTransformer等，并且支持多种解码策略。因此该功能主要适用于机器翻译，文本续写，文本摘要，对话生成等任务。
+
+功能底层依托于 `FasterTransformer <https://github.com/NVIDIA/FasterTransformer>`_ ，该库专门针对Transformer系列模型及各种解码策略进行了优化。功能顶层封装于 `model.generate` 函数。功能的开启和关闭通过传入 `use_fast` 参数进行控制（默认为开启状态）。该功能具有如下特性：
+
+- 全面支持生成式预训练模型。包括GPT、BART、mBART、UnifiedTransformer和UNIMO-text。
+- 支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling Search、Length Penalty等子策略。
+- 解码速度快。最高可达非加速版generate函数的 **17倍**。HuggingFace generate函数的 **8倍**。**并支持FP16混合精度计算**。 详细性能试验数据请参见 `FastGeneration Performence <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation/perf>`_ 。
+- 易用性强。功能的入口为 `model.generate` ，与非加速版生成api的使用方法相同，当满足加速条件时使用jit即时编译高性能算子并用于生成，不满足则自动切换回非加速版生成api。下图展示了FastGeneration的启动流程：
+
+.. image:: ../../imgs/fast_generation.png
+
+快速开始
+-----------
+
+为体现FastGeneration的易用性，我们在 `samples <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/experimental/fast_generation/samples>`_ 文件夹中内置了几个典型任务示例，下面以基于GPT模型的中文文本续写任务为例：
+
+.. code-block::
+
+    python samples/gpt_sample.py
+
+
+如果是第一次执行，PaddleNLP会启动即时编译（ `JIT Compile <https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/new_op/new_custom_op_cn.html#jit-compile>`_ ）自动编译高性能解码算子。
+
+.. code-block::
+
+    ...
+    2021-11-17 13:42:56,771 - INFO - execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build
+    INFO:utils.cpp_extension:execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build
+    grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+    running build
+    running build_ext
+    -- The C compiler identification is GNU 8.2.0
+    -- The CXX compiler identification is GNU 8.2.0
+    -- The CUDA compiler identification is NVIDIA 10.2.89
+    -- Check for working C compiler: /usr/bin/cc
+    -- Check for working C compiler: /usr/bin/cc -- works
+    -- Detecting C compiler ABI info
+    -- Detecting C compiler ABI info - done
+    -- Detecting C compile features
+    -- Detecting C compile features - done
+    -- Check for working CXX compiler: /usr
+    ...
+
+
+编译过程通常会花费几分钟的时间但是只会进行一次，之后再次使用高性能解码不需要重新编译了。编译完成后会继续运行，可以看到生成的结果如下：
+
+.. code-block::
+
+    Model input: 花间一壶酒，独酌无相亲。举杯邀明月，
+    Result: 对影成三人。
+
+打开示例代码 `samples/gpt_sample.py` ，我们可以看到如下代码：
+
+.. code-block::
+
+    ...
+    model = GPTLMHeadModel.from_pretrained(model_name)
+    ...
+    outputs, _ = model.generate(
+        input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search')
+    ...
+
+可以看到，FastGeneration的使用方法与 `model.generate()` 相同，只需传入输入tensor和解码相关参数即可，使用非常简便。如果要使用非加速版的 `model.generate()` 方法，只需传入 `use_fast=False` 即可，示例如下：
+
+.. code-block::
+
+    ...
+    outputs, _ = model.generate(
+        input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', use_fast=False)
+    ...
+
+.. note::
+
+    需要注意的是，如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本，例如我们传入 `min_length=1` ，会得到如下提示：
+
+    .. code-block::
+
+        [2021-11-17 14:21:06,132] [ WARNING] - 'min_length != 0' is not supported yet in the fast version
+        [2021-11-17 14:21:06,132] [ WARNING] - FastGeneration is not available, and the original version would be used instead.
+
+
+关于该方法的更多参数可以参考API文档 `generate <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.generation_utils.html>`_ 。
+
+`samples <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/experimental/fast_generation/samples>`_ 文件夹中的其他示例的使用方法相同。
+
+其他示例
+-----------
+
+除了以上简单示例之外，PaddleNLP的examples中所有使用了 `model.generate()` 的示例都可以通过调整到合适的参数使用高性能推理。具体如下：
+
+- `examples/dialogue/unified_transformer <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dialogue/unified_transformer>`_
+- `model_zoo/gpt/fast_gpt <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt/fast_gpt>`_
+- `examples/text_generation/unimo-text <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_generation/unimo-text>`_
+- `examples/text_summarization/bart <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_summarization/bart>`_
+
+根据提示修改对应参数即可使用FastGeneration加速生成。下面我们以基于 `Unified Transformer` 的任务型对话为例展示一下FastGeneration的加速效果：
+
+打开以上链接中Unified Transformer对应的example，找到README中对应预测的脚本。稍作修改如下：
+
+.. code-block::
+
+    export CUDA_VISIBLE_DEVICES=0
+        python infer.py \
+        --model_name_or_path=unified_transformer-12L-cn-luge \
+        --output_path=./predict.txt \
+        --logging_steps=10 \
+        --seed=2021 \
+        --max_seq_len=512 \
+        --max_knowledge_len=256 \
+        --batch_size=4 \
+        --min_dec_len=1 \
+        --max_dec_len=64 \
+        --num_return_sequences=1 \
+        --decode_strategy=sampling \
+        --top_k=5 \
+        --device=gpu
+
+由于这里只是展示性能，我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 `unified_transformer-12L-cn-luge` 。
+
+可以看到，由于该任务为对话任务，我们为了防止模型生成过多安全回复（如：哈哈哈、不错等），保证生成结果具有更多的随机性，我们选择TopK-sampling作为解码策略，并让k=5。
+
+打开 `infer.py` ，可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法：
+
+.. code-block::
+
+    output = model.generate(
+        input_ids=input_ids,
+        token_type_ids=token_type_ids,
+        position_ids=position_ids,
+        attention_mask=attention_mask,
+        seq_len=seq_len,
+        max_length=args.max_dec_len,
+        min_length=args.min_dec_len,
+        decode_strategy=args.decode_strategy,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        num_beams=args.num_beams,
+        length_penalty=args.length_penalty,
+        early_stopping=args.early_stopping,
+        num_return_sequences=args.num_return_sequences,
+        use_fp16_decoding=args.use_fp16_decoding,
+        use_fast=args.faster)
+
+运行脚本，输出结果如下：
+
+.. code-block::
+
+    step 10 - 1.695s/step
+    step 20 - 1.432s/step
+    step 30 - 1.435s/step
+
+可以看到，非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。
+
+下面我们在启动脚本中传入 `--faster` 参数，这会让 `generate()` 方法传入 `use_fast=True` ，启动加速模式。同时我们需要设置 `--min_dec_len=0` ，因为FastGeneration当前还不支持该参数。新的脚本启动参数如下：
+
+.. code-block::
+
+    export CUDA_VISIBLE_DEVICES=0
+        python infer.py \
+        --model_name_or_path=unified_transformer-12L-cn-luge \
+        --output_path=./predict.txt \
+        --logging_steps=10 \
+        --seed=2021 \
+        --max_seq_len=512 \
+        --max_knowledge_len=256 \
+        --batch_size=4 \
+        --min_dec_len=0 \
+        --max_dec_len=64 \
+        --num_return_sequences=1 \
+        --decode_strategy=sampling \
+        --top_k=5 \
+        --device=gpu \
+        --faster
+
+再次运行脚本，输出结果如下（由于我们已经编译过高性能算子，所以这里不会重新编译）：
+
+.. code-block::
+
+    [2021-11-23 13:38:09,200] [   DEBUG] - skipping 'FastTransformer' extension (up-to-date) build
+    step 10 - 0.511s/step
+    step 20 - 0.343s/step
+    step 30 - 0.419s/step
+
+可以看到，FastGeneration的预测速度为每个step耗时0.4秒左右，提速超过三倍。如果减少 `num_return_sequences` ，可以得到更高的加速比。
diff --git a/docs/advanced_guide/fastgeneration/fasttransformer.rst b/docs/advanced_guide/fastgeneration/fasttransformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..28d767560676c634ed872de32aaced47ac823fdc
--- /dev/null
+++ b/docs/advanced_guide/fastgeneration/fasttransformer.rst
@@ -0,0 +1,241 @@
+============
+Transformer高性能加速
+============
+
+
+使用环境说明
+------------
+
+* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本
+* CMake >= 3.10
+* CUDA 10.1 或 10.2（需要 PaddlePaddle 框架一致）
+* gcc 版本需要与编译 PaddlePaddle 版本一致，比如使用 gcc8.2
+* 推荐使用 Python3
+* `FasterTransformer <https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup>`_ 使用必要的环境
+* 环境依赖
+
+  - attrdict
+  - pyyaml
+
+  .. code-block::
+
+      pip install attrdict pyyaml
+
+
+快速开始
+------------
+
+我们实现了基于 FasterTransformer 的自定义 op 的接入，打造了 FastGeneration 的能力用于加速文本生成模型在 GPU 上的预测性能。接下来，我们将分别介绍基于 Python 动态图和预测库使用 FastGeneration 自定义 op 的方式，包括 op 的编译与使用。
+
+Python 动态图使用自定义 op
+------------
+
+JIT 自动编译
+^^^^^^^^^^^^
+
+目前当基于动态图使用 FastGeneration 预测加速自定义 op 时，PaddleNLP 提供了 Just In Time 的自动编译，在一些 API 上，用户无需关注编译流程，可以直接执行对应的 API，程序会自动编译需要的第三方库。
+
+以 Transformer 为例，可以直接调用 `TransformerGenerator()` 这个 API，程序会自动编译。使用示例可以参考 `Transformer 预测加速使用示例-sample <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/fast_transformer/sample/decoding_sample.py>`_，`Transformer 预测加速使用示例-机器翻译 <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer/fast_transformer>`_。
+
+编译自定义OP
+^^^^^^^^^^^^
+
+除了自动编译外，如果需要自行编译，我们已经提供对应的 CMakeLists.txt，可以参考使用如下的方式完成编译。
+
+PaddleNLP 准备
+""""""""""""
+
+首先，如果需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 PaddleNLP，并重新编译:
+
+以下以从 github 上 clone 一个新版 PaddleNLP 为例:
+
+.. code-block::
+
+    git clone https://github.com/PaddlePaddle/PaddleNLP.git
+
+其次，配置环境变量，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作：
+
+.. code-block::
+
+    export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH
+    cd PaddleNLP/paddlenlp/ops/
+
+编译
+""""""""""""
+
+编译之前，请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译，并且正常可用。
+
+编译自定义 OP 可以参照一下步骤：
+
+.. code-block::
+
+    mkdir build
+    cd build/
+    cmake .. -DCMAKE_BUILD_TYPE=Release -DPY_CMD=python3.x
+    make -j
+    cd ../
+
+可以使用的编译选项包括：
+
+* `-DPY_CMD`: 指定当前装有 PaddlePaddle 版本的 python 环境，比如 `-DPY_CMD=python3.7`。若未指定 `-DPY_CMD` 将会默认使用系统命令 `python` 对应的 Python。
+* `-DSM`: 是指的所用 GPU 的 compute capability，建议不使用该选项设置，未设置时将自动检测。如要设置，需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置，如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。
+* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理，需要加上 `-DWITH_GPT=ON`。默认为 OFF。
+* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用，需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。
+* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用，需要加上 `-DWITH_BART=ON`。默认为 ON。
+* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。
+
+最终，编译会在 `./build/lib/` 路径下，产出 `libdecoding_op.so`，即需要的 FastGeneration decoding 执行的库。
+
+使用 Transformer decoding 高性能推理
+^^^^^^^^^^^^
+
+编写 python 脚本的时候，调用 `FasterTransformer API <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.html#paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer>`_ 即可实现 Transformer 模型的高性能预测。
+
+举例如下：
+
+.. code-block::
+
+    from paddlenlp.ops import FasterTransformer
+
+    transformer = FasterTransformer(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        n_layer=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        decoding_strategy=args.decoding_strategy,
+        beam_size=args.beam_size,
+        topk=args.topk,
+        topp=args.topp,
+        max_out_len=args.max_out_len,
+        decoding_lib=args.decoding_lib,
+        use_fp16_decoding=args.use_fp16_decoding)
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以如前文所述进行编译。编译好后，使用 `FasterTransformer(decoding_lib="/path/to/lib", ...)` 可以完成导入。
+
+更详细的例子可以参考 `Transformer 预测加速使用示例-sample <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/fast_transformer/sample/decoding_sample.py>`_，`Transformer 预测加速使用示例-机器翻译 <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer/fast_transformer>`_，我们提供了更详细用例。
+
+Transformer decoding 示例代码
+""""""""""""
+
+使用 PaddlePaddle 仅执行 decoding 测试（float32）：
+
+.. code-block::
+
+    export CUDA_VISIBLE_DEVICES=0
+    export FLAGS_fraction_of_gpu_memory_to_use=0.1
+    # 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+    ./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 0
+    python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so
+
+使用 PaddlePaddle 仅执行 decoding 测试（float16）：
+执行 float16 的 decoding，需要在执行的时候，加上 `--use_fp16_decoding` 选项。
+
+.. code-block::
+
+    export CUDA_VISIBLE_DEVICES=0
+    export FLAGS_fraction_of_gpu_memory_to_use=0.1
+    # 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+    ./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 1
+    python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so --use_fp16_decoding
+
+其中，`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 <https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-decoderdecoding-demos>`_。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config 文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。
+
+C++ 预测库使用自定义 op
+------------
+
+编译自定义OP
+^^^^^^^^^^^^
+
+在 C++ 预测库使用自定义 OP 需要将实现的 C++、CUDA 代码**以及 C++ 预测的 demo**编译成一个可执行文件。因预测库支持方式与 Python 不同，这个过程将不会产生自定义 op 的动态库，将直接得到可执行文件。我们已经提供对应的 CMakeLists.txt ，可以参考使用如下的方式完成编译。并获取执行 demo。
+
+PaddleNLP 准备
+""""""""""""
+
+首先，因为需要基于当前环境重新编译，当前的 paddlenlp 的 python 包里面并不包含 FastGeneration 相关 lib，需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 PaddleNLP，并重新编译:
+
+以下以从 github 上 clone 一个新版 PaddleNLP 为例:
+
+.. code-block::
+
+    git clone https://github.com/PaddlePaddle/PaddleNLP.git
+
+其次，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作：
+
+.. code-block::
+
+    cd PaddleNLP/paddlenlp/ops/
+
+编译
+""""""""""""
+
+编译之前，请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译，并且正常可用。
+
+编译自定义 OP 可以参照一下步骤：
+
+.. code-block::
+
+    mkdir build
+    cd build/
+    cmake .. -DCMAKE_BUILD_TYPE=Release -DPADDLE_LIB=/path/to/paddle_inference_lib/ -DDEMO=./demo/transformer_e2e.cc -DON_INFER=ON -DWITH_MKL=ON
+    make -j
+    cd ../
+
+可以使用的编译选项包括：
+
+* `-DPADDLE_LIB`: 需要指明使用的 PaddlePaddle 预测库的路径 `/path/to/paddle_inference_install_dir/`，需要使用的 PaddlePaddle 的 lib 可以选择自行编译或者直接从官网下载 `paddle_inference_linux_lib <https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#linux>`_。需要注意的是，在该路径下，预测库的组织结构满足：
+  .. code-block::
+
+      .
+      ├── CMakeCache.txt
+      ├── paddle/
+        ├── include/
+        └── lib/
+      ├── third_party/
+        ├── cudaerror/
+        ├── install/
+        └── threadpool/
+      └── version.txt
+
+* `-DDEMO`: 说明预测库使用 demo 的位置。比如指定 -DDEMO=./demo/transformer_e2e.cc 或是 -DDEMO=./demo/gpt.cc。最好使用绝对路径，若使用相对路径，需要是相对于 `PaddleNLP/paddlenlp/ops/fast_transformer/src/` 的相对路径。
+* `-DSM`: 是指的所用 GPU 的 compute capability，建议不使用该选项设置，未设置时将自动检测。如要设置，需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置，如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。
+* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理，需要加上 `-DWITH_GPT=ON`。默认为 OFF。
+* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用，需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。
+* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用，需要加上 `-DWITH_BART=ON`。默认为 ON。
+* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。
+* `-DWITH_MKL`: 若当前是使用的 mkl 的 Paddle lib，那么需要打开 MKL 以引入 MKL 相关的依赖。
+* `-DON_INFER`: 是否编译 paddle inference 预测库。
+* **当使用预测库的自定义 op 的时候，请务必开启 `-DON_INFER=ON` 选项，否则，不会得到预测库的可执行文件。**
+
+执行 Transformer decoding on PaddlePaddle
+""""""""""""
+
+编译完成后，在 `build/bin/` 路径下将会看到 `transformer_e2e` 的一个可执行文件。通过设置对应的设置参数完成执行的过程。
+
+.. code-block::
+
+    cd bin/
+    ./transformer_e2e -batch_size <batch_size> -gpu_id <gpu_id> -model_dir <model_directory> -vocab_file <dict_file> -data_file <input_data>
+
+举例说明：
+
+.. code-block::
+
+    cd bin/
+    # 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+    ../third-party/build/fastertransformer/bin/decoding_gemm 8 5 8 64 38512 256 512 0
+    ./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 -data_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en
+
+其中：
+
+* `decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 <https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-decoderdecoding-demos>`_。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config 文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。
+* `DATA_HOME` 则是 `paddlenlp.utils.env.DATA_HOME` 返回的路径。
+
+预测所需要的模型文件，可以通过 `fast_transformer/README.md <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/fast_transformer/README.md>`_ 文档中所记述的方式导出。
+
diff --git a/docs/advanced_guide/fastgeneration/index.rst b/docs/advanced_guide/fastgeneration/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f99b8666359d59d5fff6303748a343163f057ed7
--- /dev/null
+++ b/docs/advanced_guide/fastgeneration/index.rst
@@ -0,0 +1,8 @@
+============
+文本生成高性能加速
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   fasttransformer.rst
diff --git a/docs/advanced_guide/model_compression/distill_lstm.rst b/docs/advanced_guide/model_compression/distill_lstm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bb50237bdcc21b46cd24e0e380c24ec8bcfc9756
--- /dev/null
+++ b/docs/advanced_guide/model_compression/distill_lstm.rst
@@ -0,0 +1,134 @@
+由BERT到Bi-LSTM的知识蒸馏
+============
+
+
+整体原理介绍
+------------
+
+本例是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中，主要参考论文 `Distilling Task-Specific Knowledge from BERT into Simple Neural Networks <https://arxiv.org/abs/1903.12136>`_ \
+实现。整体原理如下：
+
+1. 在本例中，较大的模型是BERT被称为教师模型，Bi-LSTM被称为学生模型。
+
+2. 小模型学习大模型的知识，需要小模型学习蒸馏相关的损失函数。在本实验中，损失函数是均方误差损失函数，传入函数的两个参数分别是学生模型的输出和教师模型的输出。
+
+3. 在论文的模型蒸馏阶段，作者为了能让教师模型表达出更多的“暗知识”(dark knowledge，通常指分类任务中低概率类别与高概率类别的关系)供学生模型学习，对训练数据进行了数据增强。通过数据增强，可以产生更多无标签的训练数据，在训练过程中，学生模型可借助教师模型的“暗知识”，在更大的数据集上进行训练，产生更好的蒸馏效果。本文的作者使用了三种数据增强方式，分别是：
+
+ A. Masking，即以一定的概率将原数据中的word token替换成 ``[MASK]`` ；
+
+ B. POS—guided word replacement，即以一定的概率将原数据中的词用与其有相同POS tag的词替换；
+
+ C. n-gram sampling，即以一定的概率，从每条数据中采样n-gram，其中n的范围可通过人工设置。
+
+
+
+模型蒸馏步骤介绍
+------------
+
+本实验分为三个训练过程：在特定任务上对BERT进行微调、在特定任务上对基于Bi-LSTM的小模型进行训练（用于评价蒸馏效果）、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。
+
+1. 基于bert-base-uncased预训练模型在特定任务上进行微调
+^^^^^^^^^^^^
+
+训练BERT的fine-tuning模型，可以去 `PaddleNLP <https:github.com/PaddlePaddle/PaddleNLP>`_ 中\
+的 `glue <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue>`_ 目录下对bert-base-uncased做微调。
+
+以GLUE的SST-2任务为例，用bert-base-uncased做微调之后，可以得到一个在SST-2任务上的教师模型，可以把在dev上取得最好Accuracy的模型保存下来，用于第三步的蒸馏。
+
+
+2. 训练基于Bi-LSTM的小模型
+^^^^^^^^^^^^
+
+在本示例中，小模型采取的是基于双向LSTM的分类模型，网络层分别是 ``Embedding`` 、``LSTM`` 、 带有 ``tanh`` 激活函数的 ``Linear`` 层，最后经过\
+一个全连接的输出层得到logits。``LSTM`` 网络层定义如下：
+
+.. code-block::
+
+    self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, 
+        'bidirectional', dropout=dropout_prob)
+
+基于Bi-LSTM的小模型的 ``forward`` 函数定义如下：
+
+.. code-block::
+
+    def forward(self, x, seq_len):
+        x_embed = self.embedder(x)
+        lstm_out, (hidden, _) = self.lstm(
+            x_embed, sequence_length=seq_len) # 双向LSTM
+        out = paddle.concat((hidden[-2, :, :], hidden[-1, :, :]), axis=1)
+        out = paddle.tanh(self.fc(out))
+        logits = self.output_layer(out)
+        
+        return logits
+
+
+3.数据增强介绍
+^^^^^^^^^^^^
+
+接下来的蒸馏过程，蒸馏时使用的训练数据集并不只包含数据集中原有的数据，而是按照上文原理介绍中的A、C两种方法进行数据增强后的总数据。
+在多数情况下，``alpha`` 会被设置为0，表示无视硬标签，学生模型只利用数据增强后的无标签数据进行训练。根据教师模型提供的软标签 ``teacher_logits`` \
+，对比学生模型的 ``logits`` ，计算均方误差损失。由于数据增强过程产生了更多的数据，学生模型可以从教师模型中学到更多的暗知识。
+
+数据增强的核心代码如下：
+
+.. code-block::
+
+    def ngram_sampling(words, words_2=None, p_ng=0.25, ngram_range=(2, 6)):
+        if np.random.rand() < p_ng:
+            ngram_len = np.random.randint(ngram_range[0], ngram_range[1] + 1)
+            ngram_len = min(ngram_len, len(words))
+            start = np.random.randint(0, len(words) - ngram_len + 1)
+            words = words[start:start + ngram_len]
+            if words_2:
+                words_2 = words_2[start:start + ngram_len]
+        return words if not words_2 else (words, words_2)
+
+    def data_augmentation(data, whole_word_mask=whole_word_mask):
+        # 1. Masking
+        words = []
+        if not whole_word_mask:
+            tokenized_list = tokenizer.tokenize(data)
+            words = [
+                tokenizer.mask_token if np.random.rand() < p_mask else word
+                for word in tokenized_list
+            ]
+        else:
+            for word in data.split():
+                words += [[tokenizer.mask_token]] if np.random.rand(
+                ) < p_mask else [tokenizer.tokenize(word)]
+        # 2. N-gram sampling
+        words = ngram_sampling(words, p_ng=p_ng, ngram_range=ngram_range)
+        words = flatten(words) if isinstance(words[0], list) else words
+        new_text = " ".join(words)
+        return words, new_text
+
+
+4.蒸馏模型
+^^^^^^^^^^^^
+
+这一步是将教师模型BERT的知识蒸馏到基于Bi-LSTM的学生模型中，在本例中，主要是让学生模型（Bi-LSTM）去学习教师模型的输出logits。\
+蒸馏时使用的训练数据集是由上一步数据增强后的数据，核心代码如下：
+
+.. code-block::
+
+    ce_loss = nn.CrossEntropyLoss() # 交叉熵损失函数
+    mse_loss = nn.MSELoss() # 均方误差损失函数
+
+    for epoch in range(args.max_epoch):
+        for i, batch in enumerate(train_data_loader):
+            bert_input_ids, bert_segment_ids, student_input_ids, seq_len, labels = batch
+
+            # Calculate teacher model's forward.
+            with paddle.no_grad():
+                teacher_logits = teacher.model(bert_input_ids, bert_segment_ids)
+
+            # Calculate student model's forward.
+            logits = model(student_input_ids, seq_len)
+
+            # Calculate the loss, usually args.alpha equals to 0.
+            loss = args.alpha * ce_loss(logits, labels) + (
+                1 - args.alpha) * mse_loss(logits, teacher_logits)
+
+            loss.backward()
+            optimizer.step()
+
diff --git a/docs/advanced_guide/model_compression/index.rst b/docs/advanced_guide/model_compression/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..74f38cf8b5d2efff22b9cb7bd5fdeebf981d35f7
--- /dev/null
+++ b/docs/advanced_guide/model_compression/index.rst
@@ -0,0 +1,10 @@
+============
+模型压缩
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   introduction.rst
+   distill_lstm.rst
+   ofa_bert.rst
diff --git a/docs/advanced_guide/model_compression/introduction.rst b/docs/advanced_guide/model_compression/introduction.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0fb932abc1d9b4d279fc10f63292d6d3ecf5fbba
--- /dev/null
+++ b/docs/advanced_guide/model_compression/introduction.rst
@@ -0,0 +1,46 @@
+============
+模型压缩简介
+============
+
+
+近些年，基于Transformer的语言模型在机器翻译、阅读理解、文本匹配、自然语言推理等自然语言处理任务上取得了实质\
+进展。然而，海量的参数和计算资源的大量耗费，使BERT及其变体在部署中困难重重。模型压缩的发展，使得这些问题得到\
+了缓解。
+
+模型压缩简介
+------------
+
+模型压缩在保证一定精度的情况下，能够降低模型的存储，加速模型的推理时间。常见的模型压缩方法主要包括模型裁剪、量化和蒸馏。\
+下面分别对这几种方法进行简要的介绍。
+
+模型裁剪
+^^^^^^^^^^^^
+模型裁剪是通过对已经训练好的模型中不重要的网络连接进行裁剪，减少模型的冗余和计算量，从而减少网络存储、大幅度进行加速的模型压缩方法。
+
+量化
+^^^^^^^^^^^^
+一般而言，神经网络模型的参数都是用的32bit长度的浮点型数表示。实际上，有时不需要保留那么高的精度，可以通过量化的方法减少\
+模型的存储空间，通常用INT8代替Float32存储。比如，SGD（Stochastic Gradient Descent）所需要的精度仅为6~8bit，\
+因此合理的量化网络也可保证精度的情况下减小模型的存储体积，并且能够大幅度加速，使得神经网络在CPU上的运行成为可能。\
+通常，量化包含多种方法，例如：二值神经网络、三元权重网络以及XNOR网络。
+
+
+蒸馏
+^^^^^^^^^^^^
+蒸馏本质是student模型（参数量较少的模型）对teacher模型（参数量较多）的拟合，student模型从teacher中学到知识，比自己单独学习效果更好，。比较常见的方法通常是由Bert base蒸馏到\
+Bi-LSTM或者是Transformer层数更少的BERT小模型。例如DistilBERT，它保留了BERT-base 97%的精度，\
+减少了40%的参数，推理速度快了60%。
+
+
+模型压缩示例
+------------
+
+下面将会对基于飞桨实现的常见的模型压缩示例进行介绍，其中《由BERT到Bi-LSTM的知识蒸馏》可以作为蒸馏实验的"Hello World"示例。\
+而《使用DynaBERT中的策略对BERT进行压缩》中使用的DynaBERT则是同时对不同尺寸的子网络进行训练，通过该方法训练后可以在推理阶段直接对模型裁剪。
+
+
+.. toctree::
+   :maxdepth: 1
+
+   distill_lstm.rst
+   ofa_bert.rst
diff --git a/docs/advanced_guide/model_compression/ofa_bert.rst b/docs/advanced_guide/model_compression/ofa_bert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5b0db3b18db4c0e2cf6779dc809514e0eff9fb24
--- /dev/null
+++ b/docs/advanced_guide/model_compression/ofa_bert.rst
@@ -0,0 +1,177 @@
+使用DynaBERT中的策略对BERT进行压缩
+============
+
+本教程使用的是 `DynaBERT-Dynamic BERT with Adaptive Width and Depth <https://arxiv.org/abs/2004.04037>`_ 中的训练策略。\
+把原始模型作为超网络中最大的子模型，这里超网络指的是包含所有搜索空间在内的一个网络。\
+原始模型包括多个相同大小的Transformer Block。在每次训练前会选择当前轮次要训练的子模型，\
+每个子模型包含多个相同大小的Sub Transformer Block，每个Sub Transformer Block是选择不同宽度的Transformer Block得到的，\
+一个Transformer Block包含一个Multi-Head Attention和一个Feed-Forward Network，Sub Transformer Block获得方式为：
+
+1. 一个 ``Multi-Head Attention`` 层中有多个Head，每次选择不同宽度的子模型时，会同时对Head数量进行等比例减少，\
+例如：如果原始模型中有12个Head，本次训练选择的模型是宽度为原始宽度75%的子模型，则本次训练中所有Transformer Block的Head数量为9。
+
+2. ``Feed-Forward Network`` 层中 ``Linear`` 的参数大小进行等比例减少，例如：如果原始模型中 ``FFN`` 层的特征维度为3072，\
+本次训练选择的模型是宽度为原始宽度75%的子模型，则本次训练中所有Transformer Block中 ``FFN`` 层的特征维度为2304。
+
+
+整体原理介绍
+------------
+
+1. 首先对预训练模型的参数和head根据其重要性进行重排序，把重要的参数和head排在参数的前侧，保证训练过程中的参数裁剪不会裁剪掉这些重要的参数。\
+参数的重要性计算是先使用dev数据计算一遍每个参数的梯度，然后根据梯度和参数的整体大小来计算当前参数的重要性，head的重要性计算是通过传入一个\
+全1的对head的mask，并计算这个mask的梯度，根据mask的梯度来判断每个 ``Multi-Head Attention`` 层中每个Head的重要性。
+
+2. 使用原本的预训练模型作为蒸馏过程中的教师网络。同时定义一个超网络，这个超网络中最大的子网络的结构和教师网络的结构相同其他小的子网络是对最大网络\
+进行不同的宽度选择来得到的，宽度选择具体指对网络中的参数进行裁剪，所有子网络在整个训练过程中都是参数共享的。
+
+3. 使用重排序之后的预训练模型参数初始化超网络，并把这个超网络作为学生网络。分别为 ``Embedding`` 层，为每个transformer block层和最后的logits添加蒸馏损失。
+
+4. 每个batch数据在训练前首先会选择当前要训练的子网络配置（子网络配置目前仅包括对整个模型的宽度的选择），参数更新时仅会更新当前子网络计算中用到的那部分参数。
+
+5. 通过以上的方式来优化整个超网络参数，训练完成后选择满足加速要求和精度要求的子模型。
+
+.. image:: ../../../examples/model_compression/ofa/imgs/ofa_bert.jpg
+
+.. centered:: 整体流程
+
+
+基于PaddleSlim进行模型压缩
+------------
+
+在本例中，也需要训练基于特定任务的BERT模型，方法同上一篇教程《由BERT到Bi-LSTM的知识蒸馏》。下面重点介绍本例模型压缩的过程。
+
+1. 定义初始网络
+^^^^^^^^^^^^
+定义原始BERT-base模型并定义一个字典保存原始模型参数。普通模型转换为超网络之后，由于其组网OP的改变导致原始模型加载的参数失效，所以需要定义一个字典保存原始模型的参数并用来初始化超网络。
+
+.. code-block::
+
+    model = BertForSequenceClassification.from_pretrained('bert', num_classes=2)
+    origin_weights = {}
+    for name, param in model.named_parameters():
+        origin_weights[name] = param
+
+
+2. 构建超网络
+^^^^^^^^^^^^
+定义搜索空间，并根据搜索空间把普通网络转换为超网络。
+
+.. code-block::
+
+    # 定义搜索空间
+    sp_config = supernet(expand_ratio=[0.25, 0.5, 0.75, 1.0])
+    # 转换模型为超网络
+    model = Convert(sp_config).convert(model)
+    paddleslim.nas.ofa.utils.set_state_dict(model, origin_weights)
+
+
+3. 定义教师网络
+^^^^^^^^^^^^
+构造教师网络。
+
+.. code-block::
+
+    teacher_model = BertForSequenceClassification.from_pretrained('bert', num_classes=2)
+
+
+4. 配置蒸馏相关参数
+^^^^^^^^^^^^
+需要配置的参数包括教师模型实例；需要添加蒸馏的层，在教师网络和学生网络的 ``Embedding`` 层和每一个 ``Tranformer Block`` 层\
+之间添加蒸馏损失，中间层的蒸馏损失使用默认的MSE损失函数；配置 ``lambda_distill`` 参数表示整体蒸馏损失的缩放比例。
+
+.. code-block::
+
+    mapping_layers = ['bert.embeddings']
+    for idx in range(model.bert.config['num_hidden_layers']):
+        mapping_layers.append('bert.encoder.layers.{}'.format(idx))
+
+    default_distill_config = {
+        'lambda_distill': 0.1,
+        'teacher_model': teacher_model,
+        'mapping_layers': mapping_layers,
+    }
+    distill_config = DistillConfig(**default_distill_config)
+
+
+5. 定义Once-For-All模型
+^^^^^^^^^^^^
+普通模型和蒸馏相关配置传给 ``OFA`` 接口，自动添加蒸馏过程并把超网络训练方式转为 ``OFA`` 训练方式。
+
+.. code-block::
+
+    ofa_model = paddleslim.nas.ofa.OFA(model, distill_config=distill_config)
+
+
+6. 计算神经元和head的重要性并根据其重要性重排序参数
+^^^^^^^^^^^^
+
+.. code-block::
+
+    head_importance, neuron_importance = utils.compute_neuron_head_importance(
+        'sst-2',
+        ofa_model.model,
+        dev_data_loader,
+        num_layers=model.bert.config['num_hidden_layers'],
+        num_heads=model.bert.config['num_attention_heads'])
+    reorder_neuron_head(ofa_model.model, head_importance, neuron_importance)
+
+
+7. 传入当前OFA训练所处的阶段
+^^^^^^^^^^^^
+
+.. code-block::
+
+    ofa_model.set_epoch(epoch)
+    ofa_model.set_task('width')
+
+
+8. 传入网络相关配置，开始训练
+^^^^^^^^^^^^
+本示例使用DynaBERT的策略进行超网络训练。
+
+.. code-block::
+
+    width_mult_list = [1.0, 0.75, 0.5, 0.25]
+    lambda_logit = 0.1
+    for width_mult in width_mult_list:
+        net_config = paddleslim.nas.ofa.utils.dynabert_config(ofa_model, width_mult)
+        ofa_model.set_net_config(net_config)
+        logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None])
+        rep_loss = ofa_model.calc_distill_loss()
+        logit_loss = soft_cross_entropy(logits, teacher_logits.detach())
+        loss = rep_loss + lambda_logit * logit_loss
+        loss.backward()
+    optimizer.step()
+    lr_scheduler.step()
+    ofa_model.model.clear_gradients()
+
+
+
+**NOTE**
+
+由于在计算head的重要性时会利用一个mask来收集梯度，所以需要通过monkey patch的方式重新实现一下 ``BERTModel`` 类的 ``forward`` 函数。示例如下:
+
+.. code-block::
+
+    from paddlenlp.transformers import BertModel
+    def bert_forward(self,
+                    input_ids,
+                    token_type_ids=None,
+                    position_ids=None,
+                    attention_mask=[None, None]):
+        wtype = self.pooler.dense.fn.weight.dtype if hasattr(
+            self.pooler.dense, 'fn') else self.pooler.dense.weight.dtype
+        if attention_mask[0] is None:
+            attention_mask[0] = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+    BertModel.forward = bert_forward
diff --git a/docs/advanced_guide/prompt.md b/docs/advanced_guide/prompt.md
new file mode 100644
index 0000000000000000000000000000000000000000..c8e5d5a98a82109b51456a4b8271b06b9c991612
--- /dev/null
+++ b/docs/advanced_guide/prompt.md
@@ -0,0 +1,600 @@
+# 提示学习：Prompt API
+
+随着预训练语言模型规模的增长，“预训练-微调”范式在下游自然语言处理任务上的表现越来越好，但与之相应地对训练数据量和计算存储资源的要求也越来越高。为了充分利用预训练语言模型学习到的知识，同时降低对数据和资源的依赖，**提示学习**（Prompt Learning）作为一种可能的新范式受到了越来越多的关注，在 FewCLUE、SuperGLUE 等榜单的小样本任务上取得了远优于传统微调范式的结果。
+
+**提示学习**的核心思想是将下游任务转化为预训练阶段的掩码预测（MLM）任务。实现思路包括通过模板（Template）定义的提示语句，将原有任务转化为预测掩码位置的词，以及通过标签词（Verbalizer）的定义，建立预测词与真实标签之间的映射关系。
+
+以情感分类任务为例，“预训练-微调”范式和“预训练-提示”范式（以 [PET](https://arxiv.org/abs/2001.07676) 为例）之间的区别如下图所示
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/25607475/192727706-0a17b5ef-db6b-46be-894d-0ee315306776.png width=800 height=300 />
+</div>
+
+【微调学习】使用 `[CLS]` 来做分类，需要训练随机初始化的分类器，需要充分的训练数据来拟合。
+
+【提示学习】通过提示语句和标签词映射的定义，转化为 MLM 任务，无需训练新的参数，适用于小样本场景。
+
+
+Prompt API 提供了这类算法实现的基本模块，支持[PET](https://arxiv.org/abs/2001.07676)、[P-Tuning](https://arxiv.org/abs/2103.10385)、[WARP](https://aclanthology.org/2021.acl-long.381/)、[RGL](https://aclanthology.org/2022.findings-naacl.81/)等经典算法的快速实现。
+
+**目录**
+
+* [如何定义模板](#如何定义模板)
+    * [离散型模板](#离散型模板)
+    * [连续型模板](#连续型模板)
+    * [前缀连续型模板](#前缀连续型模板)
+    * [快速定义模板](#快速定义模板)
+* [如何定义标签词映射](#如何定义标签词映射)
+    * [离散型标签词映射](#离散型标签词映射)
+    * [连续型标签词映射](#连续型标签词映射)
+* [快速开始训练](#快速开始训练)
+    * [数据准备](#数据准备)
+    * [预训练参数准备](#预训练参数准备)
+    * [定义提示学习模型](#定义提示学习模型)
+    * [使用PromptTrainer训练](#使用PromptTrainer训练)
+* [实践教程](#实践教程)
+    * [文本分类示例](#文本分类示例)
+    * 其他任务示例（待更新）
+* [Reference](#Reference)
+
+## 如何定义模板
+
+**模板**（Template）的功能是在原有输入文本上增加提示语句，从而将原任务转化为 MLM 任务，可以分为离散型和连续型两种。Prompt API 中提供了统一的数据结构来构造不同类型的模板，输入相应格式的**字符串**，通过解析得到对应的输入模板。模板由不同字段构成，可任意组合。每个字段中的关键字定义了数据文本或者提示文本，即 `input_ids`，属性可定义该字段是否可截断，以及对应的 `position_ids`，`token_type_ids` 等。
+
+### 离散型模板
+
+离散型模板 `ManualTemplate` 是直接将提示语句与原始输入文本拼接起来，二者的词向量矩阵共享，均为预训练模型学到的词向量矩阵。可用于实现 PET、RGL 等算法。
+
+**模板关键字及属性**
+
+- ``text`` ：数据集中原始输入文本对应的关键字，例如，`text_a`、`text_b` 和 `content`。
+- ``hard`` ：自定义的提示语句文本。
+- ``mask`` ：待预测词的占位符。
+    - ``length`` ：定义 ``mask`` 的数量。
+- ``sep`` ：句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义，默认相同。
+- ``options`` ：数据集字典或者文件中的候选标签序列。
+    - ``add_omask`` ：在每个标签前新增 `[O-MASK]` 字符，用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。
+    - ``add_prompt`` ：给每个标签拼接固定的提示文本，标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。
+
+**模版通用属性**
+
+- `position`: 定义当前字段的起始 `position id`。
+- `token_type`: 定义当前字段及后续字段的 `token type id`。
+- `truncate`: 定义当提示和文本总长度超过最大长度时，当前字段是否可截断。可选 `True` 和 `False`。
+
+**模板定义**
+
+```
+{'hard': '“'}{'text': 'text_a'}{'hard': '”和“'}{'text': 'text_b'}{'hard': '”之间的逻辑关系是'}{'mask'}
+```
+
+或者使用简化方式定义，省略关键字 ``hard`` 后与上述模板等价。
+
+```
+“{'text': 'text_a'}”和“{'text': 'text_b'}”之间的逻辑关系是{'mask'}
+```
+
+```
+{'options': './data/label.txt'}{'sep'}下边两句话间的逻辑关系是什么？{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'}
+```
+其中 `label.txt` 为候选标签的本地文件路径，每行一个候选标签，例如
+
+```
+中立
+蕴含
+矛盾
+```
+
+**样本示例**
+
+例如，对于自然语言推理任务，给定样本
+
+```python
+sample = {
+    "text_a": "心里有些生畏,又不知畏惧什么", "text_b": "心里特别开心", "labels": "矛盾"
+}
+```
+
+按照模板修改拼接后，最终输入模型的文本数据为
+
+```
+“心里有些生畏,又不知畏惧什么”和“心里特别开心”之间的逻辑关系是[MASK]
+```
+
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import ManualTemplate
+from paddlenlp.transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+template = ManualTemplate(prompt="“{'text': 'text_a'}”和“{'text': 'text_b'}”之间的逻辑关系是{'mask'}",
+                          tokenizer=tokenizer,
+                          max_length=512)
+input_dict = template(sample)
+```
+
+其中初始化参数定义如下
+
+- ``prompt`` ：定义提示语句以及与输入文本组合方式的字符串。
+- ``tokenizer`` ：预训练模型的 tokenizer，用于文本编码。
+- ``max_length`` ：定义输入模型文本的最大长度，包括提示部分。
+
+**使用技巧**
+
+不同模板定义对结果的影响很明显。一般来说，提示语句与原始输入文本拼接后，语句越通顺自然，模型效果越好。在实践中，对于不同的任务需要分析文本特点，尝试不同的模板以取得好的效果。
+
+
+### 连续型模板
+
+离散型模板的使用难点在于设计一个好的提示语句需要很多经验和语言专业知识。为了解决这一问题，连续型模板 `SoftTemplate` 尝试使用一组连续性 prompt 向量作为模板，这样模型训练时就无需人工给定提示语句。当然，`SoftTemplate` 也支持用人工构造的提示来初始化 prompt 向量。与离散型模板的区别在于连续型提示向量与输入文本的词向量矩阵不共享，二者在训练过程中分别进行参数更新。可用于实现 P-Tuning 等算法。
+
+除此之外，连续型模板还支持混合模板定义，即在原始输入上同时拼接离散型提示和连续型提示向量。
+
+**模板关键字**
+
+- ``text`` ：数据集中原始输入文本对应的关键字，例如，`text_a`和`text_b`。
+- ``hard`` ：自定义的文本提示语句。
+- ``mask`` ：待预测词的占位符。
+- ``sep`` ：句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义，默认相同。
+- ``soft`` 表示连续型提示。若值为 ``None`` ，则随机初始化提示向量；若值为文本，则使用文本对应的预训练字向量初始化提示向量。
+    - ``length`` ：定义 ``soft token`` 的数量。若定义文本长度小于该值，超过部分随机初始化。
+    - ``encoder`` ：定义 `soft token` 的编码器类型，可选 `lstm`，`mlp`。默认为 `None`， 不使用编码器。
+    - ``hidden_size`` ：定义编码器的隐藏层维度。默认与预训练词向量维度相同。
+- ``options`` ：数据集字典或者文件中的候选标签序列。
+    - ``add_omask`` ：在每个标签前新增 `[O-MASK]` 字符，用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。
+    - ``add_prompt`` ：给每个标签拼接固定的提示文本，标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。
+
+**模版通用属性**
+
+- `position`: 定义当前字段的起始 `position id`。
+- `token_type`: 定义当前字段及后续字段的 `token type id`。
+- `truncate`: 定义当提示和文本总长度超过最大长度时，当前字段是否可截断。可选 `True` 和 `False`。
+
+**模板定义**
+
+- 定义长度为 1 的连续型提示，随机初始化：
+
+```python
+"{'soft'}{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'}"
+```
+
+- 定义长度为 10 的连续型提示，随机初始化，编码器为 `mlp`：
+
+```python
+"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': None, 'length':10, 'encoder': 'mlp'}{'mask'}"
+```
+
+- 定义长度为 15 的连续型提示，使用 `请判断` 初始化前三个 soft token，其余随机初始化，编码器为隐藏层维度为 100 的双层 LSTM：
+
+```python
+"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断：', 'length': 15, 'encoder': 'lstm', 'hidden_size': 100}{'mask'}"
+```
+
+- 定义长度为 15 的连续型提示，使用 `"请判断这两个句子间的逻辑关系："` 的预训练词向量逐一进行初始化：
+
+```python
+"{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系：'}{'mask'}"
+```
+
+- 定义混合模板，这里`soft`关键字对应的提示和`hard`对应的提示对应两套不同的向量：
+
+```python
+"{'soft': '自然语言推理任务：'}{'text': 'text_a'}{'sep'}{'text': 'text_b'}这两个句子间的逻辑关系是{'mask'}"
+```
+
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import SoftTemplate
+from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM
+
+model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh")
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+template = SoftTemplate(prompt="{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系：'}{'mask'}",
+                        tokenizer=tokenizer,
+                        max_length=512,
+                        word_embeddings=model.get_input_embeddings())
+```
+
+其中初始化参数定义如下
+
+- ``prompt`` ：定义连续型模板的提示语句、初始化以及与输入文本组合方式的字符串。
+- ``tokenizer`` ：预训练模型的 tokenizer，用于文本编码。
+- ``max_seq_length`` ：定义输入模型文本的最大长度，包括提示部分。
+- ``word_embeddings`` ：预训练语言模型的词向量，用于连续型提示向量初始化。
+- ``soft_embeddings`` ：连续型提示向量矩阵，可用于不同模板间的连续型参数共享。设置后将覆盖默认连续型向量矩阵。
+
+**使用技巧**
+
+- 对于分类任务，推荐的连续型提示长度一般为10-20。
+- 对于随机初始化的连续性 prompt 向量，通常用比预训练模型微调更大的学习率来更新参数。
+- 与离散型模板相似，连续型模板对初始化参数也比较敏感。自定义提示语句作为连续性 prompt 向量的初始化参数通常比随机初始化效果好。
+- prompt_encoder 为已有论文中的策略，用于建模不同连续型提示向量之间的序列关系。
+
+
+### 前缀连续型模板
+
+`PrefixTemplate` 同样使用了连续型向量作为提示，与 `SoftTemplate` 的不同，该模版的提示向量不仅仅作用于输入层，每层都会有相应的提示向量。可用于实现 P-Tuning 等算法。
+
+**模板关键字**
+
+- ``text`` ：数据集中原始输入文本对应的关键字，例如，`text_a`和`text_b`。
+- ``hard`` ：自定义的文本提示语句。
+- ``mask`` ：待预测词的占位符。
+- ``sep`` ：句间的标志符。不同句子的 `token_type_ids` 需使用 `token_type` 属性定义，默认相同。
+- ``prefix`` 表示连续型提示，该字段**必须**位于模板首位。若值为 ``None`` ，则随机初始化提示向量；若值为文本，则使用文本对应的预训练字向量初始化提示向量。
+    - ``length`` ：定义 ``soft token`` 的数量。若定义文本长度小于该值，超过部分随机初始化。
+    - ``encoder`` ：定义 `soft token` 的编码器类型，可选 `lstm`，`mlp`。默认为 `None`， 不使用编码器。
+    - ``hidden_size`` ：定义编码器的隐藏层维度。默认与预训练词向量维度相同。
+- ``options`` ：数据集字典或者文件中的候选标签序列。
+    - ``add_omask`` ：在每个标签前新增 `[O-MASK]` 字符，用于计算候选标签的预测值。支持实现 [UniMC](https://arxiv.org/pdf/2210.08590.pdf) 算法。
+    - ``add_prompt`` ：给每个标签拼接固定的提示文本，标签位置由 `[OPT]` 标记。支持实现 [EFL](https://arxiv.org/pdf/2104.14690.pdf) 算法。
+
+**模版通用属性**
+
+- `position`: 定义当前字段的起始 `position id`。
+- `token_type`: 定义当前字段及后续字段的 `token type id`。
+- `truncate`: 定义当提示和文本总长度超过最大长度时，当前字段是否可截断。可选 `True` 和 `False`。
+
+**模板定义**
+
+- 定义长度为 15 的连续型提示，随机初始化：
+
+```python
+"{'prefix': '新闻类别', 'length': 10, 'encoder': 'lstm'}{'text': 'text_a'}"
+```
+
+- 定义混合模板，这里`prefix`关键字对应的提示和`hard`对应的提示对应两套不同的向量：
+
+```python
+"{'prefix': '自然语言推理任务：', 'encoder': 'mlp'}{'text': 'text_a'}{'sep'}{'text': 'text_b'}这两个句子间的逻辑关系是{'mask'}"
+```
+
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import PrefixTemplate
+from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM
+
+model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh")
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+template = PrefixTemplate(prompt="{'prefix': '任务描述'}{'text': 'text_a'}{'mask'}",
+                          tokenizer=tokenizer,
+                          max_length=512,
+                          model=model,
+                          prefix_dropout=0.1)
+```
+
+其中初始化参数定义如下
+
+- ``prompt`` ：定义连续型模板的提示语句、初始化以及与输入文本组合方式的字符串。
+- ``tokenizer`` ：预训练模型的 tokenizer，用于文本编码。
+- ``max_length`` ：定义输入模型文本的最大长度，包括提示部分。
+- ``model`` ：预训练语言模型，用于连续型提示向量初始化，以及根据模型结构生成每层对应的提示向量。
+- ``prefix_dropout`` ：连续型提示向量的丢弃概率，用于正则化。
+
+
+### 快速定义模板
+
+PaddleNLP 提供了 ``AutoTemplate`` API 快速定义简化离散型模板，也可根据完整模板字符串自动切换 ManualTemplate、SoftTemplate 和 PrefixTemplate。
+
+**模板定义**
+
+- 快速定义离散型的文本提示。例如，
+
+```python
+"这篇文章表达了怎样的情感？"
+```
+
+等价于
+
+```python
+"{'text': 'text_a'}{'hard': '这篇文章表达了怎样的情感？'}{'mask'}"
+```
+
+- 当输入为完整模板字符串时，解析得到的模板与[离散型模板](#离散型模板)和[连续型模板](#连续型模板)中描述的一致。
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import AutoTemplate
+from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM
+
+model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh")
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+# 离散型模板，返回值为 ManualTemplate 实例
+template = AutoTemplate.create_from(prompt="这个句子表达了怎样的情感？",
+                                    tokenizer=tokenizer,
+                                    max_length=512)
+
+template = AutoTemplate.create_from(prompt="这个句子表达了怎样的情感？{'text': 'text_a'}{'mask'}",
+                                    tokenizer=tokenizer,
+                                    max_length=512)
+
+# 连续型模板，返回值为 SoftTemplate 实例
+template = AutoTemplate.create_from(prompt="{'text': 'text_a'}{'sep'}{'text': 'text_b'}{'soft': '请判断这两个句子间的逻辑关系：'}{'mask'}",
+                                    tokenizer=tokenizer,
+                                    max_length=512,
+                                    model=model)
+
+# 前缀连续型模板，返回值为 PrefixTemplate 实例
+template = AutoTemplate.create_from(prompt="{'prefix': None, 'encoder': 'mlp', 'hidden_size': 50}{'text': 'text_a'}",
+                                    tokenizer=tokenizer,
+                                    max_length=512,
+                                    model=model)
+```
+
+其中初始化参数定义如下
+
+- ``prompt`` ：定义离散型/连续型提示、初始化以及和输入文本的组合方式。
+- ``tokenizer`` ：预训练模型的 tokenizer，用于文本编码。
+- ``max_length`` ：定义输入模型文本的最大长度，包括提示部分。
+- ``model`` ：预训练语言模型，为了取预训练词向量用于连续型提示向量初始化。
+
+## 如何定义标签词映射
+
+**标签词映射**（Verbalizer）也是提示学习中可选的重要模块，用于建立预测词和标签之间的映射，将“预训练-微调”模式中预测标签的任务转换为预测模板中掩码位置的词语，从而将下游任务统一为预训练任务的形式。目前框架支持了离散型标签词映射和连续型标签词映射 [Word-level Adversarial ReProgramming (WARP)](https://aclanthology.org/2021.acl-long.381/) 方法。
+
+
+例如，在情感二分类任务中，微调方法和提示学习的标签体系如下
+
+- **微调方式** : 数据集的标签为 ``负向`` 和 ``正向``，分别映射为 ``0`` 和 ``1`` ；
+
+- **提示学习** : 通过下边的标签词映射建立原始标签与预测词之间的映射。
+
+``` python
+{'负向': '不', '正向': '很'}
+```
+
+具体来说，对于模板 ``{'text':'text_a'}这句话表示我{'mask'}满意。`` ，我们使用映射 ``{'负向': '不', '正向': '很'}`` 将标签 ``负向`` 映射为 ``不`` ，将标签 ``正向`` 映射为 ``很`` 。也就是说，我们期望对于正向情感的文本，预测结果为 ``...这句话表示我很满意。`` ，对于负向情感的文本，预测结果为 ``...这句话表示我不满意。``
+
+
+### 离散型标签词映射
+
+``ManualVerbalizer`` 支持构造 ``{'mask'}`` 对应的标签词映射，同一标签可对应多个不同长度的词，直接作用于 ``AutoMaskedLM`` 模型结构。当标签对应的预测词长度大于 ``1`` 时，默认取均值；当标签对应多个 `{'mask'}` 时，默认与单个 `{mask}` 效果等价。
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import ManualVerbalizer
+from paddlenlp.transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+verbalizer = ManualVerbalizer(tokenizer=tokenizer,
+                              label_words={'负向': '不', '正向': '很'})
+```
+
+其中初始化参数定义如下
+
+- ``label_words`` : 原标签到预测词之间的映射字典。
+- ``tokenizer`` : 预训练模型的 tokenizer，用于预测词的编码。
+
+``MaskedLMVerbalizer`` 同样支持构造 ``{'mask'}`` 对应的标签词映射，映射词与模板中的 `{'mask'}` 逐字对应，因此，映射词长度应与 `{'mask'}` 数量保持一致。当定义的标签词映射中同一标签对应多个词时，仅有第一个映射词生效。在自定义的 `compute_metric` 函数中需先调用 `verbalizer.aggregate_multiple_mask` 将多 `{'mask'}` 合并后再计算评估函数，默认使用乘积的方式。
+
+**调用 API**
+```python
+from paddlenlp.prompt import MaskedLMVerbalizer
+from paddlenlp.transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+verbalizer = MaskedLMVerbalizer(tokenizer=tokenizer,
+                                label_words={'负向': '不', '正向': '很'})
+```
+
+其中初始化参数定义如下
+
+- ``label_words`` : 原标签到预测词之间的映射字典。
+- ``tokenizer`` : 预训练模型的 tokenizer，用于预测词的编码。
+
+### 连续型标签词映射
+
+标签词映射分类器 ``SoftVerbalizer`` 修改了原 ``AutoMaskedLM`` 的模型结构，将预训练模型最后一层“隐藏层-词表”替换为“隐藏层-标签”的映射。该层网络的初始化参数由标签词映射中的预测词词向量来决定，如果预测词长度大于 ``1`` ，则使用词向量均值进行初始化。当前支持的预训练模型包括 ``ErnieForMaskedLM`` 、 ``BertForMaskedLM`` 、 ``AlbertForMaskedLM`` 和 ``RobertaForMaskedLM`` 。可用于实现 WARP 算法。
+
+
+**调用 API**
+
+```python
+from paddlenlp.prompt import SoftVerbalizer
+from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM
+
+model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh")
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+verbalizer = SoftVerbalizer(label_words={'负向': '生气', '正向': '高兴'},
+                            tokenizer=tokenizer,
+                            model=model)
+```
+
+- ``label_words`` : 原标签到预测词之间的映射字典。
+- ``tokenizer`` : 预训练模型的 tokenizer，用于预测词的编码。
+- ``model`` ：预训练语言模型，用于取预训练词向量进行“隐藏层-标签”网络的修改和初始化。
+
+## 快速开始训练
+
+本节介绍了如何使用 ``PromptTrainer`` 快速搭建提示训练流程。
+
+### 数据准备
+
+数据集封装为 ``MapDataset`` 类型。每条数据格式为字典结构，字典中关键字与模板中 `text` 定义的值相对应，统一使用 `labels` 关键字表示样本标签。
+
+例如，文本语义相似度 BUSTM 数据集中的数据样本
+
+```python
+from paddlenlp.datasets import MapDataset
+
+data_ds = MapDataset([
+    {'id': 3, 'sentence1': '你晚上吃了什么', 'sentence2': '你晚上吃啥了', 'label': 1},
+    {'id': 4, 'sentence1': '我想打开滴滴叫的士', 'sentence2': '你叫小欧吗', 'label': 0},
+    {'id': 5, 'sentence1': '女孩子到底是不是你', 'sentence2': '你不是女孩子吗', 'label': 1}
+])
+
+def convert_label_keyword(input_dict):
+    input_dict["labels"] = input_dict.pop("label")
+    return input_dict
+
+data_ds = data_ds.map(convert_label_keyword)
+```
+
+### 预训练参数准备
+
+如果使用标签词映射，用 ``AutoModelForMaskedLM`` 和 ``AutoTokenizer`` 加载预训练模型参数。如果不使用标签词映射，可将 ``AutoModelForMaskedLM`` 替换为任务对应的模型。
+
+```python
+from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM
+
+model = AutoModelForMaskedLM.from_pretrained("ernie-3.0-base-zh")
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+```
+
+
+### 定义提示学习模型
+
+对于文本分类任务，我们将模板预处理和标签词映射封装为提示学习模型 ``PromptModelForSequenceClassification`` 。
+
+
+```python
+from paddlenlp.prompt import AutoTemplate
+from paddlenlp.prompt import ManualVerbalizer
+from paddlenlp.prompt import PromptModelForSequenceClassification
+
+# 定义模板
+template = AutoTemplate.create_from(prompt="{'text': 'text_a'}和{'text': 'text_b'}说的是{'mask'}同的事情。",
+                                    tokenizer=tokenizer,
+                                    max_length=512)
+
+# 定义标签词映射
+verbalizer = ManualVerbalizer(label_words={0: '不', 1: '相'},
+                              tokenizer=tokenizer)
+
+# 定义文本分类提示模型
+prompt_model = PromptModelForSequenceClassification(model,
+                                                    template,
+                                                    verbalizer,
+                                                    freeze_plm=False,
+                                                    freeze_dropout=False)
+```
+
+其中提示模型初始化参数如下
+
+- ``model`` : 预训练模型实例，支持 ``AutoModelForMaskedLM`` 和 ``AutoModelForSequenceClassification`` 。
+- ``template`` : 模板实例。
+- ``verbalizer`` : 标签词映射实例。当设为 ``None`` 时，不使用标签词映射，模型输出及损失值计算由 ``model`` 类型定义。
+- ``freeze_plm`` : 在训练时固定预训练模型参数，默认为 `False`。对于轻量级预训练模型，推荐使用默认值。
+- ``freeze_dropout`` : 在训练时固定预训练模型参数并关闭 ``dropout`` 。 当 ``freeze_dropout=True`` ，``freeze_plm`` 也为 ``True`` 。
+
+
+### 使用PromptTrainer训练
+
+``PromptTrainer`` 继承自 ``Trainer`` ， 封装了数据处理，模型训练、测试，训练策略等，便于训练流程的快速搭建。
+
+**配置训练参数**
+
+``PromptTuningArguments`` 继承自 ``TrainingArguments`` ，包含了提示学习的主要训练参数。其中 ``TrainingArguments`` 参数见 `Trainer API 文档 <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md>`_ ，其余参数详见 [Prompt Trainer参数列表](#PromptTrainer参数列表) 。推荐使用 **命令行** 的形式进行参数配置，即
+
+```shell
+python xxx.py --output_dir xxx --learning_rate xxx
+```
+
+除了训练参数，还需要自定义数据和模型相关的参数。最后用 ``PdArgumentParser`` 输出参数。
+
+```python
+from dataclasses import dataclass, field
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.prompt import PromptTuningArguments
+
+@dataclass
+class DataArguments:
+    data_path : str = field(default="./data", metadata={"help": "The path to dataset."})
+
+parser = PdArgumentParser((DataArguments, PromptTuningArguments))
+data_args, training_args = parser.parse_args_into_dataclasses(
+    args=["--output_dir", "./", "--do_train", "True"], look_for_args_file=False)
+```
+
+**初始化和训练**
+
+除了上述准备，还需要定义损失函数和评估函数。
+
+```python
+
+import paddle
+from paddle.metric import Accuracy
+from paddlenlp.prompt import PromptTrainer
+
+# 损失函数
+criterion = paddle.nn.CrossEntropyLoss()
+
+# 评估函数
+def compute_metrics(eval_preds):
+    metric = Accuracy()
+    correct = metric.compute(paddle.to_tensor(eval_preds.predictions),
+                             paddle.to_tensor(eval_preds.label_ids))
+    metric.update(correct)
+    acc = metric.accumulate()
+    return {"accuracy": acc}
+
+# 初始化
+trainer = PromptTrainer(model=prompt_model,
+                        tokenizer=tokenizer,
+                        args=training_args,
+                        criterion=criterion,
+                        train_dataset=data_ds,
+                        eval_dataset=None,
+                        callbacks=None,
+                        compute_metrics=compute_metrics)
+
+# 训练模型
+if training_args.do_train:
+    train_result = trainer.train(resume_from_checkpoint=None)
+    metrics = train_result.metrics
+    trainer.save_model()
+    trainer.log_metrics("train", metrics)
+    trainer.save_metrics("train", metrics)
+    trainer.save_state()
+```
+
+## 实践教程
+
+### 文本分类示例
+
+
+- [多分类文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/few-shot)
+
+- [多标签文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_label/few-shot)
+
+- [多层次文本分类示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/hierarchical/few-shot)
+
+
+## Reference
+
+- Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. [[PDF]](https://arxiv.org/abs/2001.07676)
+- GPT Understands, Too. [[PDF]](https://arxiv.org/abs/2103.10385)
+- WARP: Word-level Adversarial ReProgramming. [[PDF]](https://aclanthology.org/2021.acl-long.381/)
+- RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning. [[PDF]](https://aclanthology.org/2022.findings-naacl.81/)
+- R-Drop: Regularized Dropout for Neural Networks. [[PDF]](https://arxiv.org/abs/2106.14448)
+- Openprompt: An open-source framework for prompt-learning. [[PDF]](https://arxiv.org/abs/2111.01998)
+
+
+### 附录
+
+
+#### PromptTrainer参数列表
+
+
+| 参数              |  类型  | 默认值   |   含义                                                   |
+| ---------------- | ------ | ------- | ------------------------------------------------------- |
+| max_seq_length   |  int   |  512    |   模型输入的最大长度，包括模板部分                          |
+| freeze_plm       |  bool  |  False  |   是否在训练时固定预训练模型的参数                          |
+| freeze_dropout   |  bool  |  False  |   是否在训练时固定预训练模型的参数，同时关闭 dropout         |
+| use_rdrop        |  bool  |  False  |   是否使用 RDrop 策略，详见 [RDrop 论文](https://arxiv.org/abs/2106.14448) |
+| alpha_rdrop      |  float |  5.0    |   RDrop Loss 的权重                                      |
+| use_rgl          |  bool  |  False  |   是否使用 RGL 策略，详见 [RGL 论文](https://aclanthology.org/2022.findings-naacl.81/) |
+| alpha_rgl        |  float |  0.5    |   RGL Loss 的权重                                        |
+| ppt_learning_rate|  float |  1e-4   |   连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的学习率   |
+| ppt_weight_decay |  float |  0.0    |   连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的衰减参数 |
+| ppt_adam_beta1   |  float |  0.9    |   连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 beta1  |
+| ppt_adam_beta2   |  float |  0.999  |   连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 beta2  |
+| ppt_adam_epsilon |  float |  1e-8   |   连续型提示以及 SoftVerbalizer “隐藏层-标签”层参数的 epsilon|
diff --git a/docs/clear_api.py b/docs/clear_api.py
new file mode 100644
index 0000000000000000000000000000000000000000..136beaae97d311ac58ce1a83301545c921077f72
--- /dev/null
+++ b/docs/clear_api.py
@@ -0,0 +1,138 @@
+import os
+import re
+
+
+def modify_doc_title_dir(abspath_rstfiles_dir):
+    """
+    rst文件中：有‘========’和‘----------’行的表示其行上一行的文字是标题，
+    ‘=’和‘-’要大于等于标题的长度。
+    使用sphinx-apidoc -o ./source/rst_files /home/myubuntu/pro/mypro命令将
+    生成rst文件放在./source/rst_files目录下， 执行sphinx-quickstart命令生成的
+    index.rst不用放到这个目录中。 或在source目录下新建
+    rst_files目录然后将rst文件剪切到这个目录下，修改后再剪切出来
+    生成rst文件后将rst_files/modules.rst文件中的标题去掉，并修改maxdepth字段。
+    删除和修改使用sphinx-apidoc -o 命令的生成的rst文件中的标题
+    :param abspath_rstfiles_dir: rst文件所在的文件夹的绝对路径
+    :return:
+    """
+    rst_files = os.listdir(abspath_rstfiles_dir)
+    # 要删除的节点(标题目录的节点)
+    del_nodes = ["Submodules", "Module contents", "Subpackages"]
+    # 要删除的标题中的字符串
+    del_str = [" module", " package"]
+    # datasets需要的部分
+    dataset_list = ["datasets", "dataset"]
+    # 需要call方法
+    add_call_files = [
+        "data.collate",
+        "data.iterator",
+        "data.sampler",
+        "data.tokenizer",
+        "data.vocab",
+        "tokenizer\_utils",
+    ]
+    # 删除inheritance
+    del_inheritance = [
+        "crf",
+        "tcn",
+        "distributed",
+        "dataset",
+        "paraller",
+        "decoder",
+        "rdrop",
+        "decoding",
+        "fast\_transformer",
+        "Adamoptimizer",
+        "attention\_utils",
+        "model\_utils",
+        "batch\_sampler",
+        "model",
+    ]
+    # 文档中空白的part，不显示
+    del_rst = ["iterator", "constant"]
+    for rst_file in rst_files:
+        f = open(os.path.join(abspath_rstfiles_dir, rst_file), "r")
+        file_lines = f.readlines()
+        f.close()
+        write_con = []
+        flag = 0
+        first_line = file_lines[0]
+        # 去除不需要的datasets
+        if "datasets" in first_line:
+            name = first_line.split()[0]
+            length = len(name.split("."))
+            # paddlenlp.datasets 需要留下
+            if length > 2:
+                if "datasets.dataset" not in first_line:
+                    path = os.path.join(abspath_rstfiles_dir, rst_file)
+                    print(path)
+                    os.remove(path)
+                    print(path)
+                    continue
+        # 去除文档中空白页面，目前是data.iterator, embeddings.constant部分
+        del_rst_flag = 0
+        for pattern in del_rst:
+            if pattern in first_line:
+                path = os.path.join(abspath_rstfiles_dir, rst_file)
+                os.remove(path)
+                del_rst_flag = 1
+                break
+        if del_rst_flag == 1:
+            continue
+        # 是否加入call
+        add_call_files_flag = 0
+        for i in add_call_files:
+            if i in first_line:
+                add_call_files_flag = 1
+        # 是否删除inheritance
+        del_inheritance_flag = 0
+        for j in del_inheritance:
+            if j in first_line:
+                del_inheritance_flag = 1
+        if "modeling" in first_line:
+            del_inheritance_flag = 0
+        for file_line in file_lines:
+            if file_line.strip() in del_nodes:
+                flag = 1
+                continue
+            if flag:
+                flag = 0
+                continue
+            if re.search(del_str[0], file_line):
+                length = len(file_line.split("."))
+                if length > 2:
+                    modify_line = file_line.split(".")[-1].replace(del_str[0], "")
+                else:
+                    modify_line = file_line.replace(del_str[0], "")
+                write_con.append(modify_line)
+                continue
+            if re.search(del_str[1], file_line):
+                length = len(file_line.split("."))
+                if length > 2:
+                    modify_line = file_line.split(".")[-1].replace(del_str[1], "")
+                else:
+                    modify_line = file_line.replace(del_str[1], "")
+                write_con.append(modify_line)
+                continue
+            if "undoc-members" in file_line:
+                if "no-undoc-members" not in file_line:
+                    file_line = file_line.replace("undoc-members", "no-undoc-members")
+            # 去除datasets中多余内容
+            if "paddlenlp.datasets" in file_line:
+                last_name = file_line.split(".")[-1]
+                if last_name.strip() not in dataset_list:
+                    continue
+            if "show-inheritance" in file_line:
+                if del_inheritance_flag == 0:
+                    write_con.append(file_line)
+            else:
+                write_con.append(file_line)
+        if add_call_files_flag == 1:
+            write_con.append("   :special-members: __call__\n")
+        f = open(os.path.join(abspath_rstfiles_dir, rst_file), "w")
+        f.writelines(write_con)
+        f.close()
+
+
+if __name__ == "__main__":
+    modify_doc_title_dir("./source")
diff --git a/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst b/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a7c20303259bc859282ec79d968d36975adc7b65
--- /dev/null
+++ b/docs/community/contribute_datasets/how_to_write_a_DatasetBuilder.rst
@@ -0,0 +1,123 @@
+==============
+创建 :class:`DatasetBuilder`
+==============
+
+数据集的贡献通过定义一个 :class:`DatasetBuilder` 的子类来实现。一个合格的 :class:`DatasetBuilder` 需要遵循一些协议和规范。
+
+下面我们以 :obj:`LCQMC` 为例了解一下 :class:`DatasetBuilder` 通常需要包含哪些方法和参数。
+
+成员变量
+---------------
+
+.. code-block::
+
+    from paddle.dataset.common import md5file
+    from paddle.utils.download import get_path_from_url
+    from paddlenlp.utils.env import DATA_HOME
+
+    class LCQMC(DatasetBuilder):
+        """
+        LCQMC:A Large-scale Chinese Question Matching Corpus
+        More information please refer to `https://www.aclweb.org/anthology/C18-1166/`
+
+        """
+        lazy = False
+        URL = "https://bj.bcebos.com/paddlehub-dataset/lcqmc.tar.gz"
+        MD5 = "62a7ba36f786a82ae59bbde0b0a9af0c"
+        META_INFO = collections.namedtuple('META_INFO', ('file', 'md5'))
+        SPLITS = {
+            'train': META_INFO(
+                os.path.join('lcqmc', 'train.tsv'),
+                '2193c022439b038ac12c0ae918b211a1'),
+            'dev': META_INFO(
+                os.path.join('lcqmc', 'dev.tsv'),
+                'c5dcba253cb4105d914964fd8b3c0e94'),
+            'test': META_INFO(
+                os.path.join('lcqmc', 'test.tsv'),
+                '8f4b71e15e67696cc9e112a459ec42bd'),
+        }
+    
+首先贡献的数据集需要继承 :class:`paddlenlp.datasets.DatasetBuilder` 类，类名格式为camel case。之后应该添加一段注释，简要说明数据集的来源等信息。之后需定义以下成员变量：
+
+- :attr:`lazy` ：数据集的默认类型。:obj:`False` 对应 :class:`MapDataset` ，:obj:`True` 对应 :class:`IterDataset` 。
+- :attr:`URL` ：数据集压缩包下载地址，需提供有效并稳定的下载链接。如果数据集不是压缩包，可以不再这里提供。
+- :attr:`MD5` ：数据集压缩包的md5值，用于文件校验，如果数据集文件不是压缩包，可以不再这里提供。
+- :attr:`META_INFO` ：数据集split信息格式。
+- :attr:`SPLITS` ：数据集的split信息，包含数据集解压后的不同文件的具体位置，文件名，md5值等，如果数据集不是压缩包则通常在这里提供下载地址，还可以包含诸如不同文件对应的文件读取参数等信息。
+
+除此之外，不同的数据集可能还需要诸如 :attr:`VOCAB_INFO` 等其他成员变量（参见 `iwslt15.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__ ）。或者成员变量会有其他格式。贡献者可以根据实际情况自行调整。
+
+.. note::
+
+    - 如果贡献的数据集没有子数据集，那么 :class:`DatasetBuilder` **必须包含** :attr:`SPLITS` 成员变量，且该变量必须是一个字典，字典的key是该数据集包含的splits。
+    - 如果贡献的数据集有子数据集，那么 :class:`DatasetBuilder` **必须包含** :attr:`BUILDER_CONFIGS` 成员变量，且该变量必须是一个字典，字典的key是该数据集包含的子数据集的 :attr:`name` 。字典的value是包含该数据集的子数据集split信息的字典，key值必须是 `splits` 。具体格式（参见 `glue.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/glue.py>`__ ）
+
+:func:`_get_data` 方法
+-----------------------
+
+.. code-block::
+
+    def _get_data(self, mode, **kwargs):
+        ''' Check and download Dataset '''
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and
+                                            not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+:func:`_get_data` 方法根据传入的 :attr:`mode` 和数据集的split信息定位到具体数据集文件。首先进行md5值校验本地文件，若校验失败则调用 :func:`paddle.utils.download.get_path_from_url` 方法下载并校验数据集文件，最后返回数据集文件的本地地址。
+
+:func:`_read` 方法
+-----------------------
+
+.. code-block::
+
+    def _read(self, filename):
+        """Reads data."""
+        with open(filename, 'r', encoding='utf-8') as f:
+            head = None
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    query, title, label = data
+                    yield {"query": query, "title": title, "label": label}
+
+:func:`_read` 方法根据传入的文件地址读取数据。该方法必须是一个生成器，以确保 :class:`DatasetBuilder` 可以构造 :class:`MapDataset` 和  :class:`IterDataset` 两种数据集。 
+当不同split对应的数据文件读取方式不同时，该方法还需要支持 :attr:`split` 参数，并支持不同split下的读取方式。
+
+.. note::
+
+    - 该方法提供的每条example都应是一个 :class:`Dictionary` 对象。
+    - :class:`DatasetBuilder` 在生成Dataset时提供了将class label转换为id的功能。如果用户需要此功能，需要将example中label对应的key设置为 **"label"** 或 **"labels"** ，并在类中正确添加 :func:`get_labels` 方法。
+
+:func:`get_labels` 方法
+-----------------------
+
+.. code-block::
+
+    def get_labels(self):
+        """
+        Return labels of the LCQMC object.
+        """
+        return ["0", "1"]
+
+:func:`get_labels` 方法返回一个由该数据集中所有label组成的list。用于将数据集中的class label转换为id，并且这个list之后会作为实例变量传给生成的数据集。
+
+:func:`get_vocab` 方法
+-----------------------
+
+如果数据集提供词典文件，则需要加入 :func:`get_vocab` 方法和 :attr:`VOCAB_INFO` 变量。
+
+该方法会根据 :attr:`VOCAB_INFO` 变量返回一个包含数据集词典信息的 :class:`Dictionary` 对象并作为实例变量传给生成的数据集。用于在训练过程中初始化 :class:`paddlenlp.data.Vocab` 对象。
+该方法的写法请参考 `iwslt15.py  <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__ 。
+
+.. note::
+
+    - 贡献数据集时 :func:`get_labels` 和 :func:`get_vocab` 方法是可选的，视具体数据集内容而定。 :func:`_read` 和 :func:`_get_data` 方法是 **必须包含** 的。
+    - 如果您不希望在数据获取过程中进行md5值校验，可以不用给出相关成员变量和校验代码。
+
diff --git a/docs/community/contribute_datasets/index.rst b/docs/community/contribute_datasets/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e18717438ba1414365b7450cf2df883bd0c00883
--- /dev/null
+++ b/docs/community/contribute_datasets/index.rst
@@ -0,0 +1,9 @@
+============
+如何贡献数据集
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   sharing_dataset.rst
+   how_to_write_a_DatasetBuilder.rst
\ No newline at end of file
diff --git a/docs/community/contribute_datasets/sharing_dataset.rst b/docs/community/contribute_datasets/sharing_dataset.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1f94a5c378d15958e37c41344ff6a0439a303b66
--- /dev/null
+++ b/docs/community/contribute_datasets/sharing_dataset.rst
@@ -0,0 +1,98 @@
+========================
+分享你的数据集
+========================
+
+除了使用PaddleNLP内置的数据集以外，我们也鼓励用户向PaddleNLP贡献自己的数据集。
+
+下面我们来介绍一下贡献数据集的详细流程：
+
+配置环境
+---------------
+
+#. 编写和测试PaddleNLP代码需要依赖python3.6以上版本以及最新版本的PaddlePaddle。请确保正确安装以上依赖。
+#. 在PaddleNLP的github页面上点击Fork按钮，在自己的github中创建一份PaddleNLP repo的副本。
+#. 将您frok的内容下载到本地，并将官方repo作为remote。
+
+   .. code-block::
+
+       git clone https://github.com/USERNAME/PaddleNLP
+       cd PaddleNLP
+       git remote add upstream https://github.com/PaddlePaddle/PaddleNLP.git
+
+#. 安装pre-commit钩子，它可以帮助我们格式化源代码，再提交前自动检查代码问题。不满足钩子的PR **不能** 被提交到PaddleNLP。
+
+   .. code-block::
+
+       pip install pre-commit
+       pre-commit install
+
+添加一个 :class:`DatasetBuilder` 
+----------------------------------
+
+#. 创建一个新的本地分支，一般从develop 分支上创建新分支。
+
+   .. code-block::
+
+       git checkout -b my-new-dataset
+
+#. 找到您本地repo下的 `PaddleNLP/paddlenlp/datasets/` 路径，PaddleNLP的所有数据集代码都储存在这个文件夹下。
+
+   .. code-block::
+
+       cd paddlenlp/datasets
+
+#. 为您的数据集确定一个 `name`，例如 `squad` , `chnsenticorp` 等，这个 `name` 就是您的数据集被读取时的名称。
+    
+   .. note::
+
+       - 为了方便别人使用您的数据集，确保这个 `name` **不会太长而且能够正确的表义**。
+       - 数据集的 `name` 格式应为snake case。
+
+#. 在该路径下创建python文件，文件名是数据集的 `name`，例如 `squad.py` 。并在这个文件中编写数据集的 :class:`DatasetBuilder` 代码。
+
+   :class:`DatasetBuilder` 的编写可以参考教程 :doc:`如何创建一个DatasetBuilder <./how_to_write_a_DatasetBuilder>` 。里面给出了详细的步骤和规范。
+
+   我们也推荐您参考已有数据集的 :class:`DatasetBuilder` 进行创建，从已有代码copy一些共用部分可能对您编写自己的数据集代码有所帮助，下面是一些已有数据集的示例：
+
+   -  `iwslt15.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__ 翻译数据集，包含词表文件。
+   -  `glue.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/glue.py>`__ glue数据集，包含多个子数据集，文件格式为tsv。
+   -  `squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/squad.py>`__ 阅读理解数据集，文件格式为json。
+   -  `imdb.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/imdb.py>`__ imdb数据集，每个split包含多个文件。
+   -  `ptb.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/ptb.py>`__ 语料库数据集。
+   -  `msra_ner.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/msra_ner.py>`__ 序列标注数据集。
+
+#. 开发完成后，可以使用 :attr:`load_dataset` 测试您创建的数据集中的split能否正确被识别。也可以使用 :attr:`print` 看看数据集读入的格式是否符合您的预期：
+
+   .. code-block::
+
+       from paddlenlp.datasets import load_dataset
+
+       ds = load_dataset('your_dataset_name', splits='your_split')
+       print(ds[0])
+
+提交您的成果
+---------------
+
+#. 当您认为数据集的代码已经ready后，就可以在本地commit您的修改了：
+   
+   .. code-block::
+       
+       git add PaddleNLP/paddlenlp/datasets/your_dataset_name.py
+       git commit
+
+#. 在提交修改之前，最好获取获取先upstream的最新代码并更新当前分支。
+
+   .. code-block::
+       
+       git fetch upstream
+       git pull upstream develop
+
+#. 将本地的修改推送到GitHub上，并在GitHub上向PaddleNLP提交Pull Request。
+
+   .. code-block::
+       
+       git push origin my-new-dataset
+
+以上就是像PaddleNLP贡献数据集的完整流程了。我们看到您的PR后会尽快review，如果有任何问题都会尽快反馈给您。如果没有问题的话我们就会合入到PaddleNLP repo，您贡献的数据集就可以供其他人使用啦。
+
+如果您对贡献数据集还有任何疑问，欢迎加入官方QQ技术交流群: 973379845向我们提出。我们会尽快为您解答。
\ No newline at end of file
diff --git a/docs/community/contribute_docs.rst b/docs/community/contribute_docs.rst
new file mode 100644
index 0000000000000000000000000000000000000000..19af97658064076f9b7c7c56246ad5dbe617ccdb
--- /dev/null
+++ b/docs/community/contribute_docs.rst
@@ -0,0 +1,3 @@
+==============
+如何贡献问答、案例
+==============
diff --git a/docs/community/contribute_models/contribute_awesome_pretrained_models.rst b/docs/community/contribute_models/contribute_awesome_pretrained_models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f2afc1d02c9f290578373cc896d2a0891f382cb8
--- /dev/null
+++ b/docs/community/contribute_models/contribute_awesome_pretrained_models.rst
@@ -0,0 +1,78 @@
+====================================================================================
+贡献预训练模型权重
+====================================================================================
+
+1. 模型网络结构类型
+------------------------------------------------------------------------------------
+PaddleNLP目前已支持绝大多数主流的预训练模型网络结构，既包括百度自研的预训练模型（如ERNIE系列），
+也涵盖业界主流的预训练模型（如BERT，ALBERT，GPT，RoBERTa，XLNet等）。
+
+PaddleNLP目前支持的预训练模型结构类型汇总可见
+`Transformer预训练模型汇总 <https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html>`_
+（持续增加中，也非常欢迎进行新模型贡献：`如何贡献新模型 <https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_new_models.html>`_ ）。
+
+2. 模型参数权重类型
+------------------------------------------------------------------------------------
+非常欢迎大家贡献优质模型参数权重。
+参数权重类型包括但不限于（以BERT模型网络为例）：
+
+- PaddleNLP还未收录的BERT预训练模型参数权重
+  （如 `bert-base-japanese-char <https://huggingface.co/cl-tohoku/bert-base-japanese-char>`_ ，`danish-bert-botxo <https://huggingface.co/Maltehb/danish-bert-botxo>`_ 等）；
+- BERT模型在其他垂类领域（如数学，金融，法律，医学等）的预训练模型参数权重
+  （如 `MathBERT <https://huggingface.co/tbs17/MathBERT>`_ ，`finbert <https://huggingface.co/ProsusAI/finbert>`_ 等）；
+- 基于BERT在下游具体任务进行fine-tuning后的模型参数权重
+  （如 `bert-base-multilingual-uncased-sentiment <https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment>`_ ，
+  `bert-base-NER <https://huggingface.co/dslim/bert-base-NER>`_ 等）；
+- 其他模型参数权重（任何你觉得有价值的模型参数权重）；
+
+3. 参数权重格式转换
+------------------------------------------------------------------------------------
+当我们想要贡献github上开源的某模型权重时，但是发现该权重保存为其他的深度学习框架（PyTorch，TensorFlow等）的格式，
+这就需要我们进行不同深度学习框架间的模型格式转换，下面的链接给出了一份详细的关于Pytorch到Paddle模型格式转换的教程：
+`Pytorch到Paddle模型格式转换文档 <./convert_pytorch_to_paddle.rst>`_ 。
+
+4. 进行贡献
+------------------------------------------------------------------------------------
+4.1 准备权重相关文件
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+一般来说，我们需要准备 **model_state.pdparams** ，**vocab.txt**，**tokenizer_config.json**
+以及 **model_config.json** 这四个文件进行参数权重贡献。
+
+- model_state.pdparams 文件可以通过上述的参数权重格式转换过程得到；
+- vocab.txt 文件可以直接使用原始模型对应的vocab文件（根据模型对应tokenizer类型的不同，该文件名可能为spiece.model等）；
+- model_config.json 文件可以参考对应 model.save_pretrained() 接口保存的model_config.json文件；
+- tokenizer_config.json 文件可以参考对应 tokenizer.save_pretrained() 接口保存的model_config.json文件；
+
+4.2 创建个人目录
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+如果你是首次进行权重贡献，那么你需要在 ``PaddleNLP/community/`` 下新建一个目录。
+目录名称使用你的github名称，比如新建目录 ``PaddleNLP/community/yingyibiao/`` 。
+如果已有个人目录，则可以跳过此步骤。
+
+4.3 创建权重目录
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+在步骤4.2的个人目录下新建一个权重目录，权重目录名为本次贡献的模型权重名称。
+比如我想贡献 ``bert-base-uncased-sst-2-finetuned`` 这个模型，
+则新建权重目录 ``PaddleNLP/community/yingyibiao/bert-base-uncased-sst-2-finetuned/`` 。
+
+4.4 在权重目录下添加PR（pull request）相关文件
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+在步骤4.3的目录下加入两个文件，分别为 ``README.md`` 和 ``files.json`` 。
+
+- ``README.md`` 是对你贡献的权重的详细介绍，使用示例，权重来源等。
+- ``files.json`` 为步骤4.1所得的权重相关文件以及对应地址。files.json文件内容示例如下，只需将地址中的 *yingyibiao* 和
+  *bert-base-uncased-sst-2-finetuned* 分别更改为你的github用户名和权重名称。
+
+.. code:: python
+
+  {
+    "model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/model_config.json",
+    "model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/model_state.pdparams",
+    "tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/tokenizer_config.json",
+    "vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/yingyibiao/bert-base-uncased-sst-2-finetuned/vocab.txt"
+  }
+
+4.5 在github上提PR进行贡献
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+- 第一次进行开源贡献的同学可以参考 `first-contributions <https://github.com/firstcontributions/first-contributions>`_ 。
+- 模型权重贡献PR示例请参考 `bert-base-uncased-sst-2-finetuned PR <.>`_ 。
\ No newline at end of file
diff --git a/docs/community/contribute_models/contribute_new_models.rst b/docs/community/contribute_models/contribute_new_models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c7de561eb55b5dd29a2276b9d1279df064f25051
--- /dev/null
+++ b/docs/community/contribute_models/contribute_new_models.rst
@@ -0,0 +1,3 @@
+==========================================
+贡献新模型
+==========================================
\ No newline at end of file
diff --git a/docs/community/contribute_models/convert_pytorch_to_paddle.rst b/docs/community/contribute_models/convert_pytorch_to_paddle.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7da5dd6d008b8979a7c825a31b818aeacc63ee15
--- /dev/null
+++ b/docs/community/contribute_models/convert_pytorch_to_paddle.rst
@@ -0,0 +1,463 @@
+==========================================
+模型格式转换
+==========================================
+
+0. 前言
+------------------------------------------
+本文将介绍如何进行不同框架下的模型权重转换（以模型权重从PyTorch框架到Paddle框架的格式转换为例）。
+
+模型格式转换的过程需要用户对模型结构有一个较详细的了解，成功完成模型格式转换也会有助于加深用户对该模型结构的理解。
+让我们开始这个有趣的过程吧！
+
+1. 模型权重文件概述
+------------------------------------------
+不管在什么框架下，当我们保存训练好的模型时，我们都需要将模型的参数权重持久化保存下来；
+当我们加载一个保存好的模型时，我们都需要将参数权重加载并重新赋值给相应的模型。
+
+PyTorch和Paddle都是通过序列化和反序列化模型的 ``state dict`` （状态字典）来进行参数权重的存储和加载的。
+``state dict`` 从数据结构上来看就是一个字典（比如Python中的dict），
+其中key是模型参数的名称（数据类型为string），而value则为key所对应的值（数据类型为Tensor）。
+参数存储时，先获取目标对象的 ``state dict`` ，然后将 ``state dict`` 存储至磁盘；
+参数载入时，先从磁盘载入保存的 ``state dict`` ，然后通过 ``set_state_dict()`` 方法配置到目标对象中。
+
+按照约定俗成的命名规则，Paddle框架保存的模型文件名一般后缀为 `'.pdparams'` ，
+PyTorch框架保存的模型文件名一般后缀为 `'.pt'` 、 `'.pth'` 或者 `'.bin'` 。
+虽然后缀并不影响模型的保存和加载，但我们一般都会遵循这个命名规范。
+
+2. 模型的 ``state dict`` 概述
+------------------------------------------
+刚刚我们简单介绍了一下模型文件和其中存储的 ``state dict`` ，
+下面让我们来看一个具体的例子来对 ``state dict`` 有更进一步的了解。
+
+``LeNet`` 是由Yann LeCun等人在1998年提出的一个CNN网络模型，并且成功应用于手写数字识别系统。
+Paddle集成了 ``LeNet`` 这个简单的模型，我们可以一键进行模型加载，
+下面的代码实现了该模型的加载和对应 ``state dict`` 的输出：
+
+.. code:: python
+
+    >>> import paddle
+    >>> from paddle.vision.models import LeNet
+    >>> model = LeNet()
+    >>> model.state_dict().keys()  # 输出state_dict的所有keys
+    odict_keys(['features.0.weight', 'features.0.bias', 'features.3.weight', 'features.3.bias',
+                'fc.0.weight', 'fc.0.bias', 'fc.1.weight', 'fc.1.bias', 'fc.2.weight', 'fc.2.bias'])
+
+    >>> model.state_dict()['features.0.weight']  # 输出 'features.0.weight' 对应的value
+    Parameter containing:
+    Tensor(shape=[6, 1, 3, 3], dtype=float32, place=CPUPlace, stop_gradient=False,
+           [[[[-0.31584871,  0.27280194, -0.43816274],
+              [ 0.06681869,  0.44526964,  0.80944657],
+              [ 0.05796078,  0.57411081,  0.15335406]]],
+            ...
+            ...
+            [[[-0.07211500, -0.14458601, -1.11733580],
+              [ 0.53036308, -0.19761689,  0.56962037],
+              [-0.09760553, -0.02011104, -0.50577533]]]])
+
+
+我们可以通过 ``model.state_dict().keys()`` 来获取模型的所有参数名称。
+可以看到 ``LeNet`` 一共有10组参数，分别为：*'features.0.weight'*、*'features.0.bias'*、*'features.3.weight'*
+、*'features.3.bias'*、*'fc.0.weight'*、*'fc.0.bias'*、*'fc.1.weight'*、*'fc.1.bias'*、*'fc.2.weight'* 和 *'fc.2.bias'*。
+
+通过查询 ``model.state_dict()['features.0.weight']`` 可以查看 **'features.0.weight'** 这个参数的具体权重数值。
+上述输出显示该权重是一个dtype=float32，shape=[6, 1, 3, 3]的Tensor。
+
+3. 利用 ``state dict`` 进行权重格式转换
+------------------------------------------
+了解了模型的存储和加载以及相关的 ``state dict`` 之后，我们来看一下模型格式的转换的具体步骤。
+一般来说，我们可以通过 ``state dict`` 的相互转换来帮助我们进行模型格式的转换。
+
+以从PyTorch框架到Paddle框架的模型权重转换为例，转换的具体流程为：
+
+1. 加载PyTorch模型得到 ``state dict``
+2. PyTorch下的 ``state dict`` 转换为Paddle下的 ``state dict``
+3. 保存Paddle下的 ``state dict`` 得到Paddle模型。
+
+下面我们来看一个具体的例子：``'bert-base-uncased'`` 是一个谷歌开源的12层的bert英文模型。
+PaddleNLP（Paddle框架）和HuggingFace的transformers（PyTorch框架）里都集成了这个模型，
+两者参数量和具体参数数值是完全一致的。我们可以来加载对比这两个模型的 ``state dict`` 来了解转换的细节。
+
+3.1 PyTorch框架下的 ``state dict``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+首先加载transformers下的 ``'bert-base-uncased'`` 模型，
+
+.. code:: python
+
+    >>> import torch
+    >>> model_name = "bert-base-uncased"
+    >>> # 模型下载地址： https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin
+    >>> model_file = "pytorch_model.bin"
+    >>> pytorch_state_dict = torch.load(model_file)
+    >>> pytorch_state_dict.keys()
+    odict_keys(['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight',
+                'bert.embeddings.LayerNorm.gamma', 'bert.embeddings.LayerNorm.beta',
+                'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.query.bias',
+                'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.key.bias',
+                'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.0.attention.self.value.bias',
+                'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.output.dense.bias',
+                'bert.encoder.layer.0.attention.output.LayerNorm.gamma', 'bert.encoder.layer.0.attention.output.LayerNorm.beta',
+                'bert.encoder.layer.0.intermediate.dense.weight', 'bert.encoder.layer.0.intermediate.dense.bias',
+                'bert.encoder.layer.0.output.dense.weight', 'bert.encoder.layer.0.output.dense.bias',
+                'bert.encoder.layer.0.output.LayerNorm.gamma', 'bert.encoder.layer.0.output.LayerNorm.beta',
+                'bert.encoder.layer.1'...
+                'bert.encoder.layer.2'...
+                .
+                .
+                .
+                'bert.encoder.layer.9'...
+                'bert.encoder.layer.10'...
+                'bert.encoder.layer.11.attention.self.query.weight', 'bert.encoder.layer.11.attention.self.query.bias',
+                'bert.encoder.layer.11.attention.self.key.weight', 'bert.encoder.layer.11.attention.self.key.bias',
+                'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.11.attention.self.value.bias',
+                'bert.encoder.layer.11.attention.output.dense.weight', 'bert.encoder.layer.11.attention.output.dense.bias',
+                'bert.encoder.layer.11.attention.output.LayerNorm.gamma', 'bert.encoder.layer.11.attention.output.LayerNorm.beta',
+                'bert.encoder.layer.11.intermediate.dense.weight', 'bert.encoder.layer.11.intermediate.dense.bias',
+                'bert.encoder.layer.11.output.dense.weight', 'bert.encoder.layer.11.output.dense.bias',
+                'bert.encoder.layer.11.output.LayerNorm.gamma', 'bert.encoder.layer.11.output.LayerNorm.beta',
+                'bert.pooler.dense.weight', 'bert.pooler.dense.bias',
+                'cls.predictions.bias', 'cls.predictions.transform.dense.weight',
+                'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.gamma',
+                'cls.predictions.transform.LayerNorm.beta', 'cls.predictions.decoder.weight',
+                'cls.seq_relationship.weight', 'cls.seq_relationship.bias'])
+
+\**odict_keys**（ordered_dict keys）所显示的是PyTorch模型文件所对应的 ``state dict`` 的keys:
+我们仔细观察一下可以发现参数可以分成几大模块：**embeddings** 模块，
+**encoder_layers** 模块, **pooler** 模块和 **cls** 模块。
+
+我们可以结合bert的具体结构来解读一下各个模块：
+
+- **embeddings** 模块
+
+  *'bert.embeddings'* 开头的各个参数是embeddings模块的参数，
+  包括word_embeddings矩阵，position_embeddings矩阵，token_type_embeddings矩阵以及embeddings模块的LayerNorm层参数等。
+- **encoder_layers** 模块
+
+  *'bert.encoder.layer'*开头的各个参数是各encoder层的参数，
+  可以看到 ``'bert-base-uncased'`` 模型一共有12层encoder（编号0-11），每一层encoder的结构都相同。
+  每一层encoder主要由一个*self-attention*模块和一个*feed-forward*模块构成。
+  我们具体来看一下第1层encoder的参数（编号为0，'bert.encoder.layer.0'开头的参数）：
+
+  首先是*self-attention*模块：
+
+  * *'attention.self.query'*，*'attention.self.key'* 和 *'attention.self.value'*
+    分别代表self-attention结构里面的query矩阵，key矩阵和value矩阵。
+  * *'attention.output.dense'* 是self-attention结构的线性层。
+  * *'attention.output.LayerNorm'* 则是self-attention结构后的LayerNorm层。
+
+  接下来是*feed-forward*模块，对应 'intermediate.dense' 和 'output.dense' 开头的参数
+  。*feed-forward*之后还有一个*LayerNorm*层，对应的是 'output.LayerNorm' 开头的参数。
+- **pooler** 模块
+
+  pooler模块在最后一层encoder之后，是我们对最后一层encoder输出的池化操作，
+- **cls** 模块
+
+  cls模块是我们计算mlm（masked language model）和next sentence prediction（nsp）任务的结构。
+  'cls.predictions'开头的参数是我们做mlm任务时的参数，'cls.seq_relationship'开头的参数是我们做nsp预测任务时的参数。
+
+3.2 Paddle框架下的 ``state dict``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+相信到现在，我们已经对bert这个模型的结构以及相应的具体参数有了更进一步的了解。
+接下来我们来加载PaddleNLP下的模型：
+
+.. code:: python
+
+    >>> import paddle
+    >>> model_name = "bert-base-uncased"
+    >>> # 模型下载地址： https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-uncased.pdparams
+    >>> model_file = "bert-base-uncased.pdparams"
+    >>> paddle_state_dict = paddle.load(model_file)
+    >>> paddle_state_dict.keys()
+    dict_keys(['bert.embeddings.word_embeddings.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight',
+                'bert.embeddings.layer_norm.weight', 'bert.embeddings.layer_norm.bias',
+                'bert.encoder.layers.0.self_attn.q_proj.weight', 'bert.encoder.layers.0.self_attn.q_proj.bias',
+                'bert.encoder.layers.0.self_attn.k_proj.weight', 'bert.encoder.layers.0.self_attn.k_proj.bias',
+                'bert.encoder.layers.0.self_attn.v_proj.weight', 'bert.encoder.layers.0.self_attn.v_proj.bias',
+                'bert.encoder.layers.0.self_attn.out_proj.weight', 'bert.encoder.layers.0.self_attn.out_proj.bias',
+                'bert.encoder.layers.0.linear1.weight', 'bert.encoder.layers.0.linear1.bias',
+                'bert.encoder.layers.0.linear2.weight', 'bert.encoder.layers.0.linear2.bias',
+                'bert.encoder.layers.0.norm1.weight', 'bert.encoder.layers.0.norm1.bias',
+                'bert.encoder.layers.0.norm2.weight', 'bert.encoder.layers.0.norm2.bias',
+                'bert.encoder.layers.1'...
+                ...
+                ...
+                ...
+                'bert.encoder.layers.10'...
+                'bert.encoder.layers.11.self_attn.q_proj.weight', 'bert.encoder.layers.11.self_attn.q_proj.bias',
+                'bert.encoder.layers.11.self_attn.k_proj.weight', 'bert.encoder.layers.11.self_attn.k_proj.bias',
+                'bert.encoder.layers.11.self_attn.v_proj.weight', 'bert.encoder.layers.11.self_attn.v_proj.bias',
+                'bert.encoder.layers.11.self_attn.out_proj.weight', 'bert.encoder.layers.11.self_attn.out_proj.bias',
+                'bert.encoder.layers.11.linear1.weight', 'bert.encoder.layers.11.linear1.bias',
+                'bert.encoder.layers.11.linear2.weight', 'bert.encoder.layers.11.linear2.bias',
+                'bert.encoder.layers.11.norm1.weight', 'bert.encoder.layers.11.norm1.bias',
+                'bert.encoder.layers.11.norm2.weight', 'bert.encoder.layers.11.norm2.bias',
+                'bert.pooler.dense.weight', 'bert.pooler.dense.bias',
+                'cls.predictions.decoder_weight', 'cls.predictions.decoder_bias',
+                'cls.predictions.transform.weight', 'cls.predictions.transform.bias',
+                'cls.predictions.layer_norm.weight', 'cls.predictions.layer_norm.bias',
+                'cls.seq_relationship.weight', 'cls.seq_relationship.bias'])
+
+Paddle模型的 ``state dict`` 是通过一个dict来进行存储，可以看到，两者的 ``state dict`` 是十分相似的。
+我们对比一下两者：
+
+- 两者的存储是相似的，PyTorch里使用的是python中的ordered_dict来存储模型的参数状态，
+  在Paddle中则使用的是python中的dict来来进行存储。
+- 两者的结构也是相似的，都可以分成embeddings，encoder_layer, pooler, cls等
+  模块（当然这也很直观，毕竟两者的模型结构和模型参数是完全一致的）。
+- 同时两者也存在一些区别，两者的 ``state dict`` 的keys有一些细微的差异，这是由于模型代码的具体实现的参数命名差异所造成的。
+
+3.3 PyTorch和Paddle的 ``state dict`` 对比
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+我们接下来对上述两个 ``state dict`` 的参数名称以及对应权重来做一一对应。
+下面的表格是整理好的 ``state_dict`` 对应关系表格（同一行代表着相对应的参数）：
+
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| Keys (PyTorch)                                         | Shape (PyTorch)            | Keys (Paddle)                                    | Shape (Paddle)            |
++========================================================+============================+==================================================+===========================+
+| bert.embeddings.word_embeddings.weight                 | [30522, 768]               | bert.embeddings.word_embeddings.weight           | [30522, 768]              |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.embeddings.position_embeddings.weight             | [512, 768]                 | bert.embeddings.position_embeddings.weight       | [512, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.embeddings.token_type_embeddings.weight           | [2, 768]                   | bert.embeddings.token_type_embeddings.weight     | [2, 768]                  |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.embeddings.LayerNorm.gamma                        | [768]                      | bert.embeddings.layer_norm.weight                | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.embeddings.LayerNorm.beta                         | [768]                      | bert.embeddings.layer_norm.bias                  | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.query.weight       | [768, 768]                 | bert.encoder.layers.0.self_attn.q_proj.weight    | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.query.bias         | [768]                      | bert.encoder.layers.0.self_attn.q_proj.bias      | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.key.weight         | [768, 768]                 | bert.encoder.layers.0.self_attn.k_proj.weight    | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.key.bias           | [768]                      | bert.encoder.layers.0.self_attn.k_proj.bias      | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.value.weight       | [768, 768]                 | bert.encoder.layers.0.self_attn.v_proj.weight    | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.self.value.bias         | [768]                      | bert.encoder.layers.0.self_attn.v_proj.bias      | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.output.dense.weight     | [768, 768]                 | bert.encoder.layers.0.self_attn.out_proj.weight  | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.output.dense.bias       | [768]                      | bert.encoder.layers.0.self_attn.out_proj.bias    | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.output.LayerNorm.gamma  | [768]                      | bert.encoder.layers.0.norm1.weight               | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.attention.output.LayerNorm.beta   | [768]                      | bert.encoder.layers.0.norm1.bias                 | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.intermediate.dense.weight         | [3072, 768]                | bert.encoder.layers.0.linear1.weight             | [768, 3072]               |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.intermediate.dense.bias           | [3072]                     | bert.encoder.layers.0.linear1.bias               | [3072]                    |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.output.dense.weight               | [768, 3072]                | bert.encoder.layers.0.linear2.weight             | [3072, 768]               |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.output.dense.bias                 | [768]                      | bert.encoder.layers.0.linear2.bias               | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.output.LayerNorm.gamma            | [768]                      | bert.encoder.layers.0.norm2.weight               | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.encoder.layer.0.output.LayerNorm.beta             | [768]                      | bert.encoder.layers.0.norm2.bias                 | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.pooler.dense.weight                               | [768, 768]                 | bert.pooler.dense.weight                         | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| bert.pooler.dense.bias                                 | [768]                      | bert.pooler.dense.bias                           | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.bias                                   | [30522]                    | cls.predictions.decoder_bias                     | [30522]                   |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.transform.dense.weight                 | [768, 768]                 | cls.predictions.transform.weight                 | [768, 768]                |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.transform.dense.bias                   | [768]                      | cls.predictions.transform.bias                   | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.transform.LayerNorm.gamma              | [768]                      | cls.predictions.layer_norm.weight                | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.transform.LayerNorm.beta               | [768]                      | cls.predictions.layer_norm.bias                  | [768]                     |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.predictions.decoder.weight                         | [30522, 768]               | cls.predictions.decoder_weight                   | [30522, 768]              |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.seq_relationship.weight                            | [2, 768]                   | cls.seq_relationship.weight                      | [768, 2]                  |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+| cls.seq_relationship.bias                              | [2]                        | cls.seq_relationship.bias                        | [2]                       |
++--------------------------------------------------------+----------------------------+--------------------------------------------------+---------------------------+
+
+正确地对应好 ``state dict`` 的参数以及权重有助于我们正确地进行 ``state dict`` 的转换。
+
+我们从参数名称上能看出基本的一个对应关系，例如：
+
+* `bert.embeddings.LayerNorm.gamma` 对应 `bert.embeddings.layer_norm.weight` ；
+* `bert.embeddings.LayerNorm.beta` 对应 `bert.embeddings.layer_norm.bias` ；
+* `bert.encoder.layer.0.attention.self.query.weight` 对应 `bert.encoder.layers.0.self_attn.q_proj.weight` ；
+* `bert.encoder.layer.0.attention.self.query.bias` 对应 `bert.encoder.layers.0.self_attn.q_proj.bias`。
+
+两者的顺序是基本一致的，但也有一些例外，例如：
+
+* `bert.encoder.layers.0.norm1.weight` 对应 `bert.encoder.layer.0.attention.output.LayerNorm.gamma` ；
+* `bert.encoder.layers.0.norm1.bias` 对应 `bert.encoder.layer.0.attention.output.LayerNorm.beta` ；
+* `bert.encoder.layer.0.intermediate.dense.weight` 对应 `bert.encoder.layers.0.linear1.weight` ；
+* `bert.encoder.layer.0.output.dense.weight` 对应 `bert.encoder.layers.0.linear2.weight` ；
+* `bert.encoder.layer.0.output.LayerNorm.gamma` 对应 `bert.encoder.layers.0.norm2.weight`。
+
+正确的参数对应关系可能需要我们阅读具体的代码进行判断。
+在上面的表格中我们已经将两者的keys准确地一一对应了。建立好了keys的对应关系之后，我们可以进行values的对应。
+
+如果你仔细观察表格，会发现有些参数对应的values形状存在差异。
+比如 ``bert.encoder.layer.0.intermediate.dense.weight`` 和 ``bert.encoder.layers.0.linear1.weight``
+这两个keys是相对应的一组参数名，但是他们的values形状却不相同；前者是 ``[3072, 768]`` ，
+后者是 ``[768, 3072]`` ，两者刚好是一个转置的关系。这是因为PyTorch对于 ``nn.Linear`` 模块的保存是将权重的shape进行转置后保存的。
+所以在我们进行 ``state dict`` 转换的时候，需要注意做好shape的转换（例如将PyTorch模型里
+nn.Linear层对应的参数权重转置处理后生成Paddle的参数权重）。
+
+另外还需要注意其他一些细节，这里列出来几个可能会遇到的情景以供参考：
+
+- 有些模型结构可能在实现时对参数的处理有差异导致存在参数的拆分或者合并等操作，
+  此时我们需要进行参数多对一或者一对多的映射，同时将对应的values拆分或者合并。
+- 还有存在batch norm层时，我们需要注意todo。
+
+3.4 bert模型转换代码
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+下一步就是进行最关键的模型转换环节。
+这一步十分关键，正确地进行 ``state dict`` 的转换才能确保我们通过精度验证。
+
+下面是进行模型转换的代码（PyTorch转换为Paddle）：
+
+.. code:: python
+
+    import paddle
+    import torch
+    import numpy as np
+
+    torch_model_path = "pytorch_model.bin"
+    torch_state_dict = torch.load(torch_model_path)
+
+    paddle_model_path = "bert_base_uncased.pdparams"
+    paddle_state_dict = {}
+
+    # State_dict's keys mapping: from torch to paddle
+    keys_dict = {
+        # about embeddings
+        "embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight",
+        "embeddings.LayerNorm.beta": "embeddings.layer_norm.bias",
+
+        # about encoder layer
+        'encoder.layer': 'encoder.layers',
+        'attention.self.query': 'self_attn.q_proj',
+        'attention.self.key': 'self_attn.k_proj',
+        'attention.self.value': 'self_attn.v_proj',
+        'attention.output.dense': 'self_attn.out_proj',
+        'attention.output.LayerNorm.gamma': 'norm1.weight',
+        'attention.output.LayerNorm.beta': 'norm1.bias',
+        'intermediate.dense': 'linear1',
+        'output.dense': 'linear2',
+        'output.LayerNorm.gamma': 'norm2.weight',
+        'output.LayerNorm.beta': 'norm2.bias',
+
+        # about cls predictions
+        'cls.predictions.transform.dense': 'cls.predictions.transform',
+        'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight',
+        'cls.predictions.transform.LayerNorm.gamma': 'cls.predictions.layer_norm.weight',
+        'cls.predictions.transform.LayerNorm.beta': 'cls.predictions.layer_norm.bias',
+        'cls.predictions.bias': 'cls.predictions.decoder_bias'
+    }
+
+
+    for torch_key in torch_state_dict:
+        paddle_key = torch_key
+        for k in keys_dict:
+            if k in paddle_key:
+                paddle_key = paddle_key.replace(k, keys_dict[k])
+
+        if ('linear' in paddle_key) or ('proj' in  paddle_key) or ('vocab' in  paddle_key and 'weight' in  paddle_key) or ("dense.weight" in paddle_key) or ('transform.weight' in paddle_key) or ('seq_relationship.weight' in paddle_key):
+            paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[torch_key].cpu().numpy().transpose())
+        else:
+            paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[torch_key].cpu().numpy())
+
+        print("torch: ", torch_key,"\t", torch_state_dict[torch_key].shape)
+        print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape, "\n")
+
+    paddle.save(paddle_state_dict, paddle_model_path)
+
+
+我们来看一下这份转换代码：
+我们需要下载好待转换的PyTorch模型，并加载模型得到**torch_state_dict**
+；**paddle_state_dict** 和 **paddle_model_path** 则定义了转换后的 ``state dict`` 和模型文件路径；
+代码中 **keys_dict** 定义了两者keys的映射关系（可以通过上面的表格对比得到）。
+
+下一步就是最关键的 *paddle_state_dict* 的构建，我们对 *torch_state_dict* 里的每一个key都进行映射，
+得到对应的 *paddle_state_dict* 的key。获取 *paddle_state_dict* 的key之后我们需要
+对 *torch_state_dict* 的value进行转换，如果key对应的结构是 ``nn.Linear`` 模块的话，
+我们还需要进行value的transpose操作。
+
+最后我们保存得到的 *paddle_state_dict* 就能得到对应的Paddle模型。
+至此我们已经完成了模型的转换工作，得到了Paddle框架下的模型 ``"model_state.pdparams"`` 。
+
+4. 模型权重验证
+------------------------------------------
+得到了模型权重后，我们还需要进行精度的对齐来验证我们上述转换的正确性。
+我们可以通过前向推理和下游任务fine-tuning这两个任务进行精度对齐验证。
+
+4.1 对齐前向精度
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+前向精度的对齐十分简单，我们只需要保证两者输入是一致的前提下，观察得到的输出是否一致。
+这里有几个注意事项，我们运行推理时需要打开eval模式，设置dropout为0等操作去除随机性造成的影响。
+
+除了得到的模型权重文件，我们还需要准备模型配置文件。将模型权重文件（model_state.pdparams）和模型配置文件（model_config.json）
+这两个文件放在同一个路径下，我们就可以进行模型前向精度的对齐验证，下面提供了bert模型对齐前向精度的代码示例：
+
+.. code:: python
+
+    text = "Welcome to use paddle paddle and paddlenlp!"
+    torch_model_name = "bert-base-uncased"
+    paddle_model_name = "bert-base-uncased"
+
+    # torch output
+    import torch
+    import transformers
+    from transformers.models.bert import *
+
+    # torch_model = BertForPreTraining.from_pretrained(torch_model_name)
+    torch_model = BertModel.from_pretrained(torch_model_name)
+    torch_tokenizer = BertTokenizer.from_pretrained(torch_model_name)
+    torch_model.eval()
+
+    torch_inputs = torch_tokenizer(text, return_tensors="pt")
+    torch_outputs = torch_model(**torch_inputs)
+
+    torch_logits = torch_outputs[0]
+    torch_array = torch_logits.cpu().detach().numpy()
+    print("torch_prediction_logits shape:{}".format(torch_array.shape))
+    print("torch_prediction_logits:{}".format(torch_array))
+
+
+    # paddle output
+    import paddle
+    import paddlenlp
+    from paddlenlp.transformers.bert.modeling import *
+    import numpy as np
+
+    # paddle_model = BertForPretraining.from_pretrained(paddle_model_name)
+    paddle_model = BertModel.from_pretrained(paddle_model_name)
+    paddle_tokenizer = BertTokenizer.from_pretrained(paddle_model_name)
+    paddle_model.eval()
+
+    paddle_inputs = paddle_tokenizer(text)
+    paddle_inputs = {k:paddle.to_tensor([v]) for (k, v) in paddle_inputs.items()}
+    paddle_outputs = paddle_model(**paddle_inputs)
+
+    paddle_logits = paddle_outputs[0]
+    paddle_array = paddle_logits.numpy()
+    print("paddle_prediction_logits shape:{}".format(paddle_array.shape))
+    print("paddle_prediction_logits:{}".format(paddle_array))
+
+
+    # the output logits should have the same shape
+    assert torch_array.shape == paddle_array.shape, "the output logits should have the same shape, but got : {} and {} instead".format(torch_array.shape, paddle_array.shape)
+    diff = torch_array - paddle_array
+    print(np.amax(abs(diff)))
+
+代码最后会打印模型输出矩阵的每个元素最大差值，根据这个差值可以判定我们是否对齐了前向精度。
+
+4.2 下游任务fine-tuning验证（可选）
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+当我们对齐前向精度时，一般来说我们的模型转换就已经成功了。我们还可以运行下游任务fine-tuning进行double check。
+同样的，我们需要设置相同的训练数据，相同的训练参数，相同的训练环境进行fine-tuning来对比两者的收敛性以及收敛指标。
+
+5. 写在最后
+------------------------------------------
+恭喜你成功完成了模型权重的格式转换工作！欢迎向PaddleNLP提PR共享你的模型，
+这样每一个使用PaddleNLP的用户都能使用你共享的模型哦～
diff --git a/docs/community/contribute_models/index.rst b/docs/community/contribute_models/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9e15bd78f87b733f03f87ba0facfdfac5045c1db
--- /dev/null
+++ b/docs/community/contribute_models/index.rst
@@ -0,0 +1,10 @@
+============
+如何贡献模型
+============
+
+.. toctree::
+   :maxdepth: 1
+
+   convert_pytorch_to_paddle.rst
+   contribute_awesome_pretrained_models.rst
+   contribute_new_models.rst
\ No newline at end of file
diff --git a/docs/community/join_in_PaddleNLP-SIG.rst b/docs/community/join_in_PaddleNLP-SIG.rst
new file mode 100644
index 0000000000000000000000000000000000000000..44d55add660875f8cb87210c6331f068b5dac6af
--- /dev/null
+++ b/docs/community/join_in_PaddleNLP-SIG.rst
@@ -0,0 +1,3 @@
+==============
+如何加入PaddleNLP-SIG
+==============
diff --git a/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md b/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md
new file mode 100644
index 0000000000000000000000000000000000000000..f023f95cde654d2af97a35378980a871a772b574
--- /dev/null
+++ b/docs/community/rfcs/20230304_api_design_for_tie_weight_task_103.md
@@ -0,0 +1,269 @@
+# 标题
+
+标题如：paddle.io.dataset 设计文档
+
+|API名称 | 新增API名称                                            |
+|---|----------------------------------------------------|
+|提交作者<input type="checkbox" class="rowselector hidden"> | 丘文波, 刘旺旺                                           |
+|提交时间<input type="checkbox" class="rowselector hidden"> | 2022-03-10                                         |
+|版本号 | V3                                                 |
+|依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | 如无特殊情况，都应基于develop版本开发                             |
+|文件名 | 20230304_api_design_for_tie_weight_task_103.md<br> |
+
+
+# 一、概述
+## 1、相关背景
+对应任务是 No.103：新增tie_weights能力 
+
+权重绑定, 一般是指将输入层embedding和 输出层embeding共享权重, 从而在减少网络的参数量, 使得embeding层参数训练更加充分.
+
+其中《attention is all you need》中的提到的transformer模型也使用到了tie weigh这个技巧, 论文3.4节提到将encoder输入embedding与decoder输入embedding以及输出线性层权重共享 这个技巧的有效性在论文《Using the output embedding to improve language models》进行了验证 .
+
+所以预训练语言模型需要实现一个输入层embedding和 输出层embeding共享权重共享功能,方便使用者进行调用.
+
+相关issue:
+* [https://github.com/PaddlePaddle/PaddleNLP/issues/4740](https://github.com/PaddlePaddle/PaddleNLP/issues/4740)
+
+
+## 2、功能目标
+给预训练语言模型增加一个基础函数, 实现输入层embeding和输出层embedding的权重共享绑定:
+
+- 为PaddleNLP新增tie_weights功能，能够对齐HuggingFace Transformers中的[tie_weights](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.tie_weights)功能
+- 参考: [https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172](https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172)
+
+
+## 3、意义
+实现权重绑定的函数, 作为一种模型技巧来提升训练效果.减少模型参数,
+
+权重绑定的函数作为模型的一个基本函数, 在基于预训练模型组网的时候 方便进行调用进行实验, 减少模型参数,提升模型效果.
+
+
+# 二、飞桨现状
+对飞桨框架目前支持此功能的现状调研，如果不支持此功能，如是否可以有替代实现的API，是否有其他可绕过的方式，或者用其他API组合实现的方式；
+
+paddle 中并没有对tie weight的统一实现,调用者需自己写代码实现这部分功能.
+
+paddleNLP中的一些示例代码中也找到了一个tie weight的实现.
+
+(1) [代码链接1](https://github.com/qiuwenbogdut/PaddleNLP/blob/develop/examples/language_model/transformer-xl/mem_transformer.py#L811)
+
+```python
+if tie_weight:
+        for i in range(len(self.crit.out_layers_weight)):
+            self.crit.out_layers_weight[i] = self.word_emb.emb_layers[i].weight
+
+if tie_projs:
+        for i, tie_proj in enumerate(tie_projs):
+            if tie_proj and div_val == 1 and d_model != d_embed:
+                self.crit.out_projs[i] = self.word_emb.emb_projs[0]
+            elif tie_proj and div_val != 1:
+                self.crit.out_projs[i] = self.word_emb.emb_projs[i]
+```
+
+(2) [代码链接2](https://github.com/PaddlePaddle/PaddleNLP/blob/4e5df921ff61ddae1d869c37aea621b9cac6bcd4/paddlenlp/transformers/reformer/modeling.py#L1977)
+
+```python
+def tie_weights(self):
+        """
+        Tie the weights between the input embeddings and the output embeddings.
+        """
+        tie_word_embeddings = (
+            self.tie_word_embeddings
+            if hasattr(self, "tie_word_embeddings")
+            else self.config.get("tie_word_embeddings", False)
+        )
+        if hasattr(self, "get_output_embeddings") and hasattr(self, "get_input_embeddings") and tie_word_embeddings:
+            output_embeddings = self.get_output_embeddings()
+            if output_embeddings is not None:
+                self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
+```
+
+(3) [代码链接3](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/ernie/modeling.py#L748)
+```python
+class ErnieLMPredictionHead(nn.Layer):
+    r"""
+    Ernie Model with a `language modeling` head on top.
+    """
+
+    def __init__(
+        self,
+        config: ErnieConfig,
+        embedding_weights=None,
+        weight_attr=None,
+    ):
+        super(ErnieLMPredictionHead, self).__init__()
+
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr)
+        self.activation = getattr(nn.functional, config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size],
+                dtype=self.transform.weight.dtype,
+                attr=weight_attr,
+                is_bias=False,
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+```
+
+
+其实paddlenlp内大部分的tie_weights实现是直接在模型layer定义层面实现的，见[代码](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/ernie/modeling.py#L748)
+，而不是类似transformers一样在模型以外统一实现的。这个项目的目标就是看一下能否在模型外统一实现，而不用每个模型都自己实现一次
+
+paddle里面tie_weghts实现主要有两种方式:
+* 一种在modeling.py中定义了tie_weghts函数，相应的模型也实现了get_input_embeding()和get_output_embeding()来获取输入和输出embeding层权重,然后通过赋值方式进行绑定。如上面的代码链接(1)(2) 
+* 另外一种是 在定义模型层的时候 直接将输入input_embeding的weight，赋值给输出层weight. 将embedding的weight直接传给head来构建linear输出层，期望是在get_input_embeding()拿到weight，然后传给head层，如上面代码链接(3) 
+
+
+
+最好是在模型[基类里面model_utils.py#L897](https://github.com/PaddlePaddle/PaddleNLP/blob/be80a3e30fb681e53773c265babe611d4df62ead/paddlenlp/transformers/model_utils.py#L897)
+去统一实现 tie_weights,减少调用者的开发.
+
+# 三、业内方案调研
+描述业内深度学习框架如何实现此功能，包括与此功能相关的现状、未来趋势；调研的范围包括不限于TensorFlow、PyTorch、NumPy等
+
+(1)目前huggingface的transformers库中实现了这个tieweight 这个基础函数. [代码链接](https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/modeling_utils.py#L1172)
+```python
+def tie_weights(self):
+        """
+        Tie the weights between the input embeddings and the output embeddings.
+        If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
+        weights instead.
+        """
+        if getattr(self.config, "tie_word_embeddings", True):
+            output_embeddings = self.get_output_embeddings()
+            if output_embeddings is not None:
+                self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
+
+        if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
+            if hasattr(self, self.base_model_prefix):
+                self = getattr(self, self.base_model_prefix)
+            self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
+
+        for module in self.modules():
+            if hasattr(module, "_tie_weights"):
+                module._tie_weights()
+```
+
+
+(2) tensor2tensor库 tieweight 实现代码 [代码链接](https://github.com/tensorflow/tensor2tensor/blob/316c9ce2f2b2373f44f5be0da712dda3e5861a75/tensor2tensor/layers/modalities.py#L1106)
+```python
+def symbol_top(body_output, targets, model_hparams, vocab_size):
+  del targets  # unused arg
+  if model_hparams.shared_embedding_and_softmax_weights:
+    scope_name = "shared"
+    reuse = tf.AUTO_REUSE
+  else:
+    scope_name = "softmax"
+    reuse = False
+  with tf.variable_scope(scope_name, reuse=reuse):
+    body_output_shape = common_layers.shape_list(body_output)
+    var = get_weights(model_hparams, vocab_size, body_output_shape[-1])
+    if (model_hparams.factored_logits and
+        model_hparams.mode == tf_estimator.ModeKeys.TRAIN):
+      # insert channels dimension
+      body_output = tf.expand_dims(body_output, 3)
+      return common_layers.FactoredTensor(body_output, var)
+    else:
+      body_output = tf.reshape(body_output, [-1, body_output_shape[-1]])
+      logits = tf.matmul(body_output, var, transpose_b=True)
+      return tf.reshape(logits,
+                        body_output_shape[:-1] + [1, vocab_size])
+```
+
+
+(3) fairseq库 中 tie weight实现函数 [代码链接](https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/fconv.py#L480)
+```python
+self.fc2 = Linear(in_channels, out_embed_dim)
+            if share_embed:
+                assert out_embed_dim == embed_dim, (
+                    "Shared embed weights implies same dimensions "
+                    " out_embed_dim={} vs embed_dim={}".format(out_embed_dim, embed_dim)
+                )
+                self.fc3 = nn.Linear(out_embed_dim, num_embeddings)
+                self.fc3.weight = self.embed_tokens.weight
+            else:
+                self.fc3 = Linear(out_embed_dim, num_embeddings, dropout=dropout)
+```
+
+# 四、对比分析
+paddle和 huggingface的transformers 都是基于动态图进行开发, 所以准备参照huggingface的transformers  的 tie weight 函数思路去实现功能.
+
+# 五、设计思路与实现方案
+参考huggingface的 transformers中的实现思路来基于paddle进行开发
+
+实现tie_weight函数步骤:
+1. 获取模型input embedding  权重对象 A
+2. 获取模型 output embedding 权重对象 B
+3. 让A和B 都指向同一个权重值
+
+
+
+
+## 命名与参数设计
+参考：[飞桨API 设计及命名规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.html)
+## 底层OP设计
+## API实现方案
+
+# 六、测试和验收的考量
+参考：[新增API 测试及验收规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_accpetance_criteria_cn.html)
+
+测试tie_weight有两个办法:
+* 直接判断输出层weight和输入层weight的id，如果一致即通过，否则Failed.
+* 训练几个step，经过几个反向后，看下输出层weight和输入层weight是否一致，如果一致即通过，否则Failed.
+
+用过id的一致性判断是否绑定成功, 简单高效,后面准备采用这种方式进行单侧:
+构建单元测试, 测试模型的get_input_embeding得到的权重的id 和get_output_embeding 得到的权重id 是都一致, 如果是一致就通过,都则不通过
+
+
+
+# 七、可行性分析和排期规划
+
+设计一个小脚本验证一下这种方式的有效性:
+```python
+import numpy as np
+from paddle.nn import Embedding
+
+"""step1 定义两个不同的embedding 对象 AA 和 BB"""
+print('------------step1')
+AA = Embedding(1,2)
+BB = Embedding(1,2)
+
+AA.weight = BB.weight # 进行权重的绑定
+
+""" step2 测试一下绑定结果"""
+print('------------step2')
+print('检测 AA 和 BB 的id是否一致:', AA is BB,id(AA), id(BB))                               # AA 和 BB 的id 不一致
+print('检测 AA.weight 和 BB.weight 的id是否一致:',AA.weight is BB.weight,id(AA.weight), id(BB.weight))   # 但是AA.weight 和 BB.weight 的id是一致的
+
+print("AA.weight: ",AA.weight)
+print("BB.weight: ",BB.weight)
+
+
+
+""" step3 尝试修改一下AA的weight的值 BB的weight的值是否也跟着会一起修改"""
+# 修改一下其中一个AA 的权重值, 看一下 BB的权重值会不会变化
+print('------------step3')
+AA.weight.set_value(np.array([[4.0,6.0]],dtype=np.float32))
+
+print('检测 修改后的 AA.weight 和 BB.weight 的id是否一致:',AA.weight is BB.weight,id(AA.weight), id(BB.weight)) # AA.weight 和 BB.weight 的id是一致的
+print("AA.weight 修改后的值: ",AA.weight)
+print("BB.weight:",BB.weight)
+
+```
+
+时间和开发排期规划，主要milestone
+- 3.10 跟官方确认好开发思路
+- 3.17 提交实现代码
+
+# 八、影响面
+需要进一步讨论的问题，开放性问题，有争议问题；对其他模块是否有影响
+
+# 名词解释
+
+# 附件及参考资料
diff --git a/docs/community/rfcs/api_design_template.md b/docs/community/rfcs/api_design_template.md
new file mode 100644
index 0000000000000000000000000000000000000000..fad836f99af4c7feeea401e5d8547f2861595852
--- /dev/null
+++ b/docs/community/rfcs/api_design_template.md
@@ -0,0 +1,50 @@
+# 标题
+
+标题如：paddle.io.dataset 设计文档
+|API名称 | 新增API名称 |
+|---|---|
+|提交作者<input type="checkbox" class="rowselector hidden"> | 李强、张明 |
+|提交时间<input type="checkbox" class="rowselector hidden"> | 2022-03-01 |
+|版本号 | 此设计文档的版本号，如V1.0 |
+|依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | 如无特殊情况，都应基于develop版本开发 |
+|文件名 | 提交的markdown设计文档文件名称，如：20200301_api_design_for_dataset.md<br> |
+
+
+# 一、概述
+## 1、相关背景
+填写此任务的开发背景，为什么想要开发这个API。如果有相关issue，请将issue链接填写至此。
+
+## 2、功能目标
+
+## 3、意义
+集中阐述本次升级的作用和意义。
+
+# 二、飞桨现状
+对飞桨框架目前支持此功能的现状调研，如果不支持此功能，如是否可以有替代实现的API，是否有其他可绕过的方式，或者用其他API组合实现的方式；
+
+
+# 三、业内方案调研
+描述业内深度学习框架如何实现此功能，包括与此功能相关的现状、未来趋势；调研的范围包括不限于TensorFlow、PyTorch、NumPy等
+
+# 四、对比分析
+对第三部分调研的方案进行对比**评价**和**对比分析**，论述各种方案的优劣势。
+
+# 五、设计思路与实现方案
+
+## 命名与参数设计
+参考：[飞桨API 设计及命名规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_design_guidelines_standard_cn.html)
+## 底层OP设计
+## API实现方案
+
+# 六、测试和验收的考量
+参考：[新增API 测试及验收规范](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/api_contributing_guides/api_accpetance_criteria_cn.html)
+
+# 七、可行性分析和排期规划
+时间和开发排期规划，主要milestone
+
+# 八、影响面
+需要进一步讨论的问题，开放性问题，有争议问题；对其他模块是否有影响
+
+# 名词解释
+
+# 附件及参考资料
diff --git a/docs/compression.md b/docs/compression.md
new file mode 100644
index 0000000000000000000000000000000000000000..35d8d6f8dc635510fecb37e714ce929fda5e5b3c
--- /dev/null
+++ b/docs/compression.md
@@ -0,0 +1,470 @@
+# PaddleNLP 模型压缩 API
+
+ **目录**
+   * [模型压缩 API 功能简介](#模型压缩API功简介)
+   * [三大场景快速启动模型压缩示例](#三大场景快速启动模型压缩示例)
+   * [四步启动模型压缩](#四步启动模型压缩)
+       * [Step1：获取模型压缩参数 compression_args](#获取模型压缩参数compression_args)
+       * [Step2：实例化 Trainer 并调用 compress()](#实例化Trainer并调用compress())
+           * [Trainer 实例化参数介绍](#Trainer实例化参数介绍)
+       * [Step3：实现自定义评估函数（按需可选）](#实现自定义评估函数（按需可选）)
+       * [Step4：传参并运行压缩脚本](#传参并运行压缩脚本)
+           * [CompressionArguments 参数介绍](#CompressionArguments参数介绍)
+   * [模型评估与部署](#模型评估与部署)
+   * [FAQ](#FAQ)
+   * [参考文献](#References)
+
+
+<a name="模型压缩API功能简介"></a>
+
+## 模型压缩 API 功能简介
+
+PaddleNLP 模型压缩 API 功能支持对 ERNIE 类下游任务上微调后的模型进行裁剪、量化，以缩小模型体积、减少内存占用、减少计算、提升推理速度从而减少部署难度。模型压缩 API 效果好，且简洁易用。目前裁剪功能现在支持 DynaBERT 中的宽度自适应裁剪策略；量化现在支持静态离线量化方法（PTQ）、量化训练（QAT）和 Embedding 量化。PTQ 无需训练，只需少量校准数据，即可导出量化模型，QAT 类似 FP32 模型的训练过程，也基本能够做到精度无损，Embedding 量化过程较为简单，不需要训练也不需要校准数据即可完成。
+
+- **效果好**：目前已经在分类（包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务）、序列标注、抽取式阅读理解任务上进行过验证，基本达到精度无损。例如，对于 12L768H 和 6L768H 结构的模型，进行宽度保留比例为 2/3 的裁剪基本可以达到精度无损，模型裁剪后推理速度能够达到原先的 1-2 倍；6L768H 结构的模型量化后推理速度能够达到量化前的 2-3 倍。
+
+- **简洁易用**：只需要简单几步即可开展模型压缩任务
+
+##### ERNIE 3.0 压缩效果
+如下表所示，ERNIE 3.0-Medium (6-layer, 384-hidden, 12-heads) 模型在三类任务（文本分类、序列标注、抽取式阅读理解）经过裁剪 + 量化后加速比均达到 3 倍左右，所有任务上平均精度损失可控制在 0.5 以内（0.46）。
+
+|                            | TNEWS 性能    | TNEWS 精度   | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 |
+| -------------------------- | ------------- | ------------ | ------------- | ------------- | ------------- | ------------- |
+| ERNIE 3.0-Medium+FP32      | 1123.85(1.0x) | 57.45        | 366.75(1.0x)  | 93.04         | 146.84(1.0x)  | 66.95         |
+| ERNIE 3.0-Medium+INT8      | 3226.26(2.9x) | 56.99(-0.46) | 889.33(2.4x)  | 92.70(-0.34)  | 348.84(2.4x)  | 66.32(-0.63   |
+| ERNIE 3.0-Medium+裁剪+FP32 | 1424.01(1.3x) | 57.31(-0.14) | 454.27(1.2x)  | 93.27(+0.23)  | 183.77(1.3x)  | 65.92(-1.03)  |
+| ERNIE 3.0-Medium+裁剪+INT8 | 3635.48(3.2x) | 57.26(-0.19) | 1105.26(3.0x) | 93.20(+0.16)  | 444.27(3.0x)  | 66.17(-0.78)  |
+
+(以上数据来自 [ERNIE 3.0 性能测试文档](../model_zoo/ernie-3.0/#性能测试)，文档包含测试环境介绍)
+
+##### UIE 压缩效果
+
+以报销工单信息抽取任务为例，使用 `uie-base` 进行微调，先得到原始 FP32 模型，然后使用 QAT 策略进一步量化。量化后的模型比原始 FP32 模型的 F1 值高 2.19。
+
+| Models         | F1           |
+| -------------  |:------------:|
+| uie-base+微调+FP32   | 91.93        |
+| uie-base+微调+量化+INT8 | 94.12        |
+
+
+<a name="三大场景快速启动模型压缩示例"></a>
+
+### 三大场景快速启动模型压缩示例
+
+本项目提供了压缩 API 在分类（包含文本分类、文本匹配、自然语言推理、代词消歧等任务）、序列标注、抽取式阅读理解三大场景下的使用样例，可以分别参考 [ERNIE 3.0](../model_zoo/ernie-3.0/) 目录下的 [compress_seq_cls.py](../model_zoo/ernie-3.0/compress_seq_cls.py) 、[compress_token_cls.py](../model_zoo/ernie-3.0/compress_token_cls.py)、[compress_qa.py](../model_zoo/ernie-3.0/compress_qa.py) 脚本，启动方式如下：
+
+```shell
+# 分类任务
+# 该脚本共支持 CLUE 中 7 个分类任务，超参不全相同，因此分类任务中的超参配置利用 config.yml 配置
+python compress_seq_cls.py \
+    --dataset "clue tnews"  \
+    --model_name_or_path best_models/TNEWS  \
+    --output_dir ./
+
+# 序列标注任务
+python compress_token_cls.py \
+    --dataset "msra_ner"  \
+    --model_name_or_path best_models/MSRA_NER \
+    --output_dir ./ \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --learning_rate 0.00005 \
+    --remove_unused_columns False \
+    --num_train_epochs 3
+
+# 阅读理解任务
+python compress_qa.py \
+    --dataset "clue cmrc2018" \
+    --model_name_or_path best_models/CMRC2018  \
+    --output_dir ./ \
+    --max_seq_length 512 \
+    --learning_rate 0.00003 \
+    --num_train_epochs 8 \
+    --per_device_train_batch_size 24 \
+    --per_device_eval_batch_size 24 \
+    --max_answer_length 50 \
+
+```
+
+示例代码中压缩使用的是 datasets 内置的数据集，若想要使用自定义数据集压缩，可参考 [datasets 加载自定义数据集文档](https://huggingface.co/docs/datasets/loading)。
+
+
+<a name="四步启动模型压缩"></a>
+
+## 四步启动模型压缩
+
+### 环境依赖
+
+- paddlepaddle-gpu >=2.4.1
+- paddlenlp >= 2.5
+- paddleslim >= 2.4.0
+
+模型压缩 API 中的压缩功能依赖最新的 `paddleslim` 包。可运行以下命令安装：
+
+```shell
+pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+模型压缩 API 的使用大致分为四步：
+
+- Step 1: 使用 `PdArgumentParser` 解析从命令行传入的超参数，以获取压缩参数 `compression_args`；
+- Step 2: 实例化 Trainer 并调用 `compress()` 压缩 API
+- Step 3: 实现自定义评估函数和 loss 计算函数（按需可选），以适配自定义压缩任务
+- Step 4：传参并运行压缩脚本
+
+**示例代码**
+
+```python
+from paddlenlp.trainer import PdArgumentParser, CompressionArguments
+
+# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数，以获取压缩参数 `compression_args`；
+parser = PdArgumentParser(CompressionArguments)
+compression_args = parser.parse_args_into_dataclasses()
+
+# Step2: 实例化 Trainer 并调用 compress()
+trainer = Trainer(
+    model=model,
+    args=compression_args,
+    data_collator=data_collator,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    criterion=criterion)
+
+# Step 3: 使用内置模型和评估方法，则不需要实现自定义评估函数和 loss 计算函数
+trainer.compress()
+```
+
+```shell
+# Step4: 传参并运行压缩脚本
+python compress.py \
+    --output_dir ./compress_models  \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 4 \
+    --width_mult_list 0.75 \
+    --batch_size_list 4 8 16 \
+    --batch_num_list 1 \
+
+```
+
+
+<a name="获取模型压缩参数compression_args"></a>
+
+### Step 1：获取模型压缩参数 compression_args
+
+使用 `PdArgumentParser` 对象解析从命令行得到的超参数，从而得到 `compression_args`，并将 `compression_args` 传给 `Trainer` 对象。获取 `compression_args` 的方法通常如下：
+
+```python
+from paddlenlp.trainer import PdArgumentParser, CompressionArguments
+
+# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数，以获取压缩参数 `compression_args`；
+parser = PdArgumentParser(CompressionArguments)
+compression_args = parser.parse_args_into_dataclasses()
+```
+
+<a name="实例化Trainer并调用compress()"></a>
+
+### Step 2：实例化 Trainer 并调用 compress
+
+<a name="Trainer实例化参数介绍"></a>
+
+#### Trainer 实例化参数介绍
+
+- **--model** 待压缩的模型，目前支持 ERNIE、BERT、RoBERTa、ERNIE-M、ELECTRA、ERNIE-Gram、PP-MiniLM、TinyBERT 等结构相似的模型，是在下游任务中微调后的模型，当预训练模型选择 ERNIE 时，需要继承 `ErniePretrainedModel`。以分类任务为例，可通过`AutoModelForSequenceClassification.from_pretrained(model_name_or_path)` 等方式来获取，这种情况下，`model_name_or_path`目录下需要有 model_config.json, model_state.pdparams 文件；
+- **--data_collator** 三类任务均可使用 PaddleNLP 预定义好的 [DataCollator 类](../paddlenlp/data/data_collator.py)，`data_collator` 可对数据进行 `Pad` 等操作。使用方法参考 [示例代码](../model_zoo/ernie-3.0/compress_seq_cls.py) 即可；
+- **--train_dataset** 裁剪训练需要使用的训练集，是任务相关的数据。自定义数据集的加载可参考 [文档](https://huggingface.co/docs/datasets/loading)。不启动裁剪时，可以为 None；
+- **--eval_dataset** 裁剪训练使用的评估集，也是量化使用的校准数据，是任务相关的数据。自定义数据集的加载可参考 [文档](https://huggingface.co/docs/datasets/loading)。是 Trainer 的必选参数；
+- **--tokenizer** 模型 `model` 对应的 `tokenizer`，可使用 `AutoTokenizer.from_pretrained(model_name_or_path)` 来获取。
+- **--criterion** 模型的 loss 计算方法，可以是一个 nn.Layer 对象，也可以是一个函数，用于在 ofa_utils.py 计算模型的 loss 用于计算梯度从而确定神经元重要程度。
+
+其中，`criterion` 函数定义示例：
+
+```python
+# 支持的形式一：
+def criterion(logits, labels):
+    loss_fct = paddle.nn.BCELoss()
+    start_ids, end_ids = labels
+    start_prob, end_prob = outputs
+    start_ids = paddle.cast(start_ids, 'float32')
+    end_ids = paddle.cast(end_ids, 'float32')
+    loss_start = loss_fct(start_prob, start_ids)
+    loss_end = loss_fct(end_prob, end_ids)
+    loss = (loss_start + loss_end) / 2.0
+    return loss
+
+# 支持的形式二：
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits,
+                                                        label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits,
+                                                      label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+```
+
+用以上参数实例化 Trainer 对象，之后直接调用 `compress()` 。`compress()` 会根据选择的策略进入不同的分支，以进行裁剪或者量化的过程。
+
+**示例代码**
+
+```python
+from paddlenlp.trainer import PdArgumentParser, CompressionArguments
+
+# Step1: 使用 `PdArgumentParser` 解析从命令行传入的超参数，以获取压缩参数 `compression_args`；
+parser = PdArgumentParser(CompressionArguments)
+compression_args = parser.parse_args_into_dataclasses()
+
+# Step2: 实例化 Trainer 并调用 compress()
+trainer = Trainer(
+    model=model,
+    args=compression_args,
+    data_collator=data_collator,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    criterion=criterion)
+
+trainer.compress()
+```
+
+<a name="实现自定义评估函数(按需可选）"></a>
+
+### Step3：实现自定义评估函数，以适配自定义压缩任务
+
+当使用 DynaBERT 裁剪功能时，如果模型、Metrics 不符合下表的情况，那么模型压缩 API 中评估函数需要自定义。
+
+目前 DynaBERT 裁剪功能只支持 SequenceClassification 等三类 PaddleNLP 内置 class，并且内置评估器对应为 Accuracy、F1、Squad。
+
+| Model class name |  SequenceClassification   | TokenClassification   | QuestionAnswering |
+| ---------------- | ------------------------- | --------------------- | ----------------- |
+|      Metrics     |          Accuracy         |           F1          |        Squad      |
+
+需要注意以下三个条件：
+
+- 如果模型是自定义模型，需要继承 `XXXPretrainedModel`，例如当预训练模型选择 ERNIE 时，继承 `ErniePretrainedModel`，模型需要支持调用 `from_pretrained()` 导入模型，且只含 `pretrained_model_name_or_path` 一个必选参数，`forward` 函数返回 `logits` 或者 `tuple of logits`；
+
+- 如果模型是自定义模型，或者数据集比较特殊，压缩 API 中 loss 的计算不符合使用要求，需要自定义 `custom_evaluate` 评估函数，需要同时支持 `paddleslim.nas.ofa.OFA` 模型和 `paddle.nn.layer` 模型。可参考下方示例代码。
+    - 输入`model` 和 `dataloader`，返回模型的评价指标（单个 float 值）。
+    - 将该函数传入 `compress()` 中的 `custom_evaluate` 参数；
+
+`custom_evaluate()` 函数定义示例：
+
+```python
+    import paddle
+    from paddle.metric import Accuracy
+
+    @paddle.no_grad()
+    def evaluate_seq_cls(self, model, data_loader):
+        metric = Accuracy()
+        model.eval()
+        metric.reset()
+        for batch in data_loader:
+            logits = model(input_ids=batch['input_ids'],
+                           token_type_ids=batch['token_type_ids'])
+            # Supports paddleslim.nas.ofa.OFA model and nn.layer model.
+            if isinstance(model, paddleslim.nas.ofa.OFA):
+                logits = logits[0]
+            correct = metric.compute(logits, batch['labels'])
+            metric.update(correct)
+        res = metric.accumulate()
+        logger.info("acc: %s, " % res)
+        model.train()
+        return res
+```
+
+
+在调用 `compress()` 时传入这个自定义函数：
+
+```python
+trainer.compress(custom_evaluate=evaluate_seq_cls)
+```
+
+
+<a name="传参并运行压缩脚本"></a>
+
+### Step 4：传参并运行压缩脚本
+
+这一步主要是将压缩需要用到的参数通过命令行传入，并启动压缩脚本。
+
+压缩启动命令：
+
+**示例代码**
+
+```shell
+# Step4: 运行压缩脚本
+python compress.py \
+    --output_dir ./compress_models  \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 4 \
+    --width_mult_list 0.75 \
+    --batch_size_list 4 8 16 \
+    --batch_num_list 1 \
+
+```
+
+下面会介绍模型压缩启动命令可以传递的超参数。
+
+<a name="CompressionArguments参数介绍"></a>
+
+#### CompressionArguments 参数介绍
+
+`CompressionArguments` 中的参数一部分是模型压缩功能特定参数，另一部分继承自 `TrainingArguments`，是压缩训练时需要设置的超参数。下面会进行具体介绍，
+
+**公共参数**
+
+公共参数中的参数和具体的压缩策略无关。
+
+- **--strategy** 模型压缩策略，目前支持 `'dynabert+qat+embeddings'`、`'dynabert+qat'`、`'dynabert+embeddings'`、`'dynabert+ptq'`、 `'dynabert'` 、 `'ptq'` 和 `'qat'`。
+其中 `'dynabert'` 代表基于 DynaBERT 的宽度裁剪策略，`'qat'` 表示量化训练，`'ptq'` 表示静态离线量化，`'embeddings'` 表示词表量化，并且 `--strategy` 支持选择它们之间所有合理的策略组合。默认是 `'dynabert+ptq'`；
+
+- **--output_dir** 模型压缩后模型保存目录；
+
+- **--input_infer_model_path** 待压缩的静态图模型，该参数是为了支持对静态图模型的压缩。不需使用时可忽略。默认为 `None`；
+
+- **--input_dtype** 导出模型的输入类型，一般是 `int64` 或者是 `int32`。默认为 `int64`；
+
+**DynaBERT 裁剪参数**
+
+当用户使用了 DynaBERT 裁剪、PTQ 量化策略（即策略中包含 'dynabert'、'qat' 时需要传入以下可选参数：
+
+- **--width_mult_list** 裁剪宽度保留的搜索列表，对 6 层模型推荐 `3/4` ，对 12 层模型推荐 `2/3`，表示对 `q`、`k`、`v` 以及 `ffn` 权重宽度的保留比例，假设 12 层模型原先有 12 个 attention heads，裁剪后只剩 9 个 attention heads。默认是 `[3/4]`；
+
+- **--per_device_train_batch_size**  用于裁剪训练的每个 GPU/CPU 核心 的 batch 大小。默认是 8；
+
+- **--per_device_eval_batch_size** 用于裁剪评估的每个 GPU/CPU 核心 的 batch 大小。默认是 8；
+
+- **--num_train_epochs** 裁剪训练所需要的 epochs 数。默认是 3.0；
+
+- **--max_steps** 如果设置为正数，则表示要执行的训练步骤总数。覆盖 `num_train_epochs`。默认为 -1；
+
+- **--logging_steps** 两个日志之间的更新步骤数。默认为 500；
+
+- **--save_steps** 评估模型的步数。默认为 100；
+
+- **--optim** 裁剪训练使用的优化器名称，默认为adamw，默认为 'adamw'；
+
+- **--learning_rate** 裁剪训练使用优化器的初始学习率，默认为 5e-05；
+
+- **--weight_decay** 除了所有 bias 和 LayerNorm 权重之外，应用于所有层裁剪训练时的权重衰减数值。 默认为 0.0；
+
+- **--adam_beta1** 裁剪训练使用 AdamW 的优化器时的 beta1 超参数。默认为 0.9；
+
+- **--adam_beta2** 裁剪训练使用 AdamW 优化器时的 beta2 超参数。默认为 0.999；
+
+- **--adam_epsilon** 裁剪训练使用 AdamW 优化器时的 epsilon 超参数。默认为 1e-8；
+
+- **--max_grad_norm** 最大梯度范数（用于梯度裁剪）。默认为 1.0；
+
+- **--lr_scheduler_type** 要使用的学习率调度策略。默认为 'linear'；
+
+- **--warmup_ratio** 用于从 0 到 `learning_rate` 的线性 warmup 的总训练步骤的比例。 默认为 0.0；
+
+- **--warmup_steps** 用于从 0 到 `learning_rate` 的线性 warmup 的步数。覆盖 warmup_ratio 参数。默认是 0；
+
+- **--seed** 设置的随机种子。为确保多次运行的可复现性。默认为 42；
+
+- **--device** 运行的设备名称。支持 cpu/gpu。默认为 'gpu'；
+
+- **--remove_unused_columns** 是否去除 Dataset 中不用的字段数据。默认是 True；
+
+**量化公共参数**
+
+
+**PTQ 量化参数**
+
+当用户使用了 PTQ 量化策略时需要传入以下可选参数：
+
+- **--algo_list** 量化策略搜索列表，目前支持 `'KL'`、`'abs_max'`、`'min_max'`、`'avg'`、`'hist'`、`'mse'` 和 `'emd'`，不同的策略计算量化比例因子的方法不同。建议传入多种策略，可批量得到由多种策略产出的多个量化模型，可从中选择效果最优模型。ERNIE 类模型较推荐 `'hist'`, `'mse'`, `'KL'`，`'emd'` 等策略。默认是 ['mse', 'KL']；
+
+- **--batch_num_list** batch_nums 的超参搜索列表，batch_nums 表示采样需要的 batch 数。校准数据的总量是 batch_size * batch_nums。如 batch_num 为 None，则 data loader 提供的所有数据均会被作为校准数据。默认是 [1]；
+
+- **--batch_size_list** 校准样本的 batch_size 搜索列表。并非越大越好，也是一个超参数，建议传入多种校准样本数，最后可从多个量化模型中选择最优模型。默认是 `[4]`；
+
+- **--weight_quantize_type** 权重的量化类型，支持 `'abs_max'` 和 `'channel_wise_abs_max'` 两种方式。通常使用 'channel_wise_abs_max'， 这种方法得到的模型通常精度更高；
+
+- **activation_quantize_type** 激活 tensor 的量化类型。支持 'abs_max', 'range_abs_max' 和 'moving_average_abs_max'。在 'ptq' 策略中，默认是 'range_abs_max'；
+
+- **--round_type** 权重值从 FP32 到 INT8 的转化方法，目前支持 `'round'` 和 '[adaround](https://arxiv.org/abs/2004.10568.)'，默认是 `'round'`；
+
+- **--bias_correction** 如果是 True，表示使用 [bias correction](https://arxiv.org/abs/1810.05723) 功能，默认为 False。
+
+**QAT 量化参数**
+
+当用户使用了 QAT 量化策略时，除了可以设置上面训练相关的参数，还可以传入以下可选参数：
+
+- **--weight_quantize_type** 权重的量化类型，支持 `'abs_max'` 和 `'channel_wise_abs_max'` 两种方式。通常使用 'channel_wise_abs_max'， 这种方法得到的模型通常精度更高；
+
+- **activation_quantize_type** 激活 tensor 的量化类型。支持 'abs_max', 'range_abs_max' 和 'moving_average_abs_max'。在'qat'策略中，它默认是 'moving_average_abs_max'；
+
+- **use_pact** 是否使用 PACT 量化策略，是对普通方法的改进，参考论文[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)，打开后精度更高，默认是 True。
+
+- **moving_rate** 'moving_average_abs_max' 量化方法中的衰减系数，默认为 0.9；
+
+<a name="模型评估与部署"></a>
+
+## 模型评估与部署
+
+裁剪、量化后的模型不能再通过 `from_pretrained` 导入进行预测，而是需要使用 Paddle 部署工具才能完成预测。
+
+压缩后的模型部署可以参考 [部署文档](../model_zoo/ernie-3.0/deploy) 完成。
+
+### Python 部署
+
+服务端部署可以从这里开始。可以参考 [seq_cls_infer.py](../model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py) 或者 [token_cls_infer.py](../model_zoo/ernie-3.0/deploy/python/token_cls_infer.py) 来编写自己的预测脚本。并根据 [Python 部署指南](../model_zoo/ernie-3.0/deploy/python/README.md) 的介绍安装预测环境，对压缩后的模型进行精度评估、性能测试以及部署。
+
+
+<a name="服务化部署"></a>
+
+### 服务化部署
+
+- [FastDeploy ERNIE 3.0 模型 Serving 部署示例](../model_zoo/ernie-3.0/deploy/serving/README.md)
+- [基于PaddleNLP SimpleServing 的服务化部署](../model_zoo/ernie-3.0/deploy/simple_serving/README.md)
+
+### 移动端部署
+
+
+<a name="FAQ"></a>
+
+## FAQ
+
+**Q：模型压缩需要数据吗？**
+
+A：DynaBERT 裁剪和量化训练 QAT 需要使用训练集进行训练，验证集进行评估，其过程类似微调；静态离线量化 PTQ 只需要验证集（对样本量要求较低，一般 4-16 个样本就可能可以满足要求）；
+
+**Q：示例代码里是内置的数据集，如何使用我自己的数据呢**
+
+A：可以参考 UIE 的例子，也可以参考 [datasets 加载自定义数据集文档](https://huggingface.co/docs/datasets/loading)；
+
+**Q：模型压缩后的模型还能继续训练吗？**
+
+A：模型压缩主要用于推理加速，因此压缩后的模型都是静态图（预测）模型，不能再通过 `from_pretrained()` API 导入继续训练；
+
+**Q：裁剪和量化怎么选？**
+
+A：可以设置参数 `--strategy` 来选择压缩的策略，默认是裁剪和量化同时选择，先裁剪后量化。目前裁剪策略有训练过程，需要下游任务的训练数据，其训练时间视下游任务数据量而定，且和微调的训练时间是一个量级。静态离线量化则不需要额外的训练，更快，通常来说量化的加速比比裁剪更明显。建议裁剪和量化同时选择，有些情况下可能比单独量化效果更好；
+
+**Q：裁剪中也有训练过程吗？**
+
+A：DynaBERT 裁剪类似蒸馏过程，也会有模型训练时用到的超参，方便起见，可以直接使用微调时所用的最佳的超参。如果想进一步提升精度，可以对 `batch_size`、`learning_rate`、`epoch` 等超参数进行 Grid Search；
+
+**Q：使用 `TensorDataset` 对象做量化报错了，为什么？**
+
+A：使用量化时，`eval_dataset` 不可以是 `TensorDataset` 对象，因为量化功能内部在静态图模式下执行，而 `TensorDataset` 只能在动态图下使用，两者同时使用会导致错误；
+
+<a name="References"></a>
+
+## 参考文献
+- Hou L, Huang Z, Shang L, Jiang X, Chen X and Liu Q. DynaBERT: Dynamic BERT with Adaptive Width and Depth[J]. arXiv preprint arXiv:2004.04037, 2020.
+
+- Cai H, Gan C, Wang T, Zhang Z, and Han S. Once for all: Train one network and specialize it for efficient deployment[J]. arXiv preprint arXiv:1908.09791, 2020.
+
+- Wu H, Judd P, Zhang X, Isaev M and Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation[J]. arXiv preprint arXiv:2004.09602v1, 2020.
diff --git a/docs/conf.py b/docs/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..7477c20c7cbe42a9d44925601260336b5c06214c
--- /dev/null
+++ b/docs/conf.py
@@ -0,0 +1,111 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath("../.."))
+sys.path.append(os.path.abspath(".."))
+
+# -- Project information -----------------------------------------------------
+
+project = "PaddleNLP"
+copyright = "2023, PaddleNLP"
+author = "PaddleNLP"
+default_role = "py:obj"
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx_rtd_theme",
+    "recommonmark",
+    "sphinx.ext.autodoc",
+    "sphinx.ext.autosummary",
+    "sphinx.ext.napoleon",
+    "sphinx_copybutton",
+    "sphinx_markdown_tables",
+    "sphinx.ext.viewcode",
+    "sphinx.ext.coverage",
+    "sphinx.ext.extlinks",
+]
+
+autodoc_default_options = {
+    "member-order": "bysource",
+    "undoc-members": False,
+}
+
+autosummary_generate = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+locale_dirs = ["locale/"]
+gettext_compact = False
+language = "zh_CN"
+add_module_names = False
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = []
+
+# source_parsers = {
+#     '.md': recommonmark.parser.CommonMarkParser,
+# }
+source_suffix = [".rst", ".md"]
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_book_theme"
+
+html_theme_options = {
+    "collapse_navigation": True,
+    "display_version": True,
+    "navigation_depth": 5,
+    "navigation_with_keys": True,
+    "body_max_width": "80%",
+}
+
+html_css_files = [
+    "custom.css",
+]
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+html_logo = "paddle.png"
diff --git a/docs/data.md b/docs/data.md
new file mode 100644
index 0000000000000000000000000000000000000000..32d0eef2296da1e4d4008f60ba04ad31f8cdbe4a
--- /dev/null
+++ b/docs/data.md
@@ -0,0 +1,237 @@
+# PaddleNLP Data API
+
+该模块提供了在NLP任务中构建有效的数据处理Pipeline的常用API。
+
+## APIl列表
+
+| API                             | 简介                                       |
+| ------------------------------- | :----------------------------------------- |
+| `paddlenlp.data.Stack`          | 堆叠N个具有相同shape的输入数据来构建一个batch |
+| `paddlenlp.data.Pad`            | 堆叠N个输入数据来构建一个batch，每个输入数据将会被padding到N个输入数据中最大的长度 |
+| `paddlenlp.data.Tuple`          | 将多个batchify函数包装在一起，组成tuple      |
+| `paddlenlp.data.Dict`           | 将多个batchify函数包装在一起，组成dict       |
+| `paddlenlp.data.SamplerHelper`  | 构建用于`Dataloader`的可迭代sampler         |
+| `paddlenlp.data.Vocab`          | 用于文本token和ID之间的映射                  |
+| `paddlenlp.data.JiebaTokenizer` | Jieba分词                                  |
+
+## API使用方法
+
+以上API都是用来辅助构建`DataLoader`，`DataLoader`比较重要的三个初始化参数是`dataset`、`batch_sampler`和`collate_fn`。
+
+`paddlenlp.data.Vocab`和`paddlenlp.data.JiebaTokenizer`用在构建`dataset`时处理文本token到ID的映射。
+
+`paddlenlp.data.SamplerHelper`用于构建可迭代的`batch_sampler`。
+
+`paddlenlp.data.Stack`、`paddlenlp.data.Pad`、`paddlenlp.data.Tuple`和`paddlenlp.data.Dict`用于构建生成mini-batch的`collate_fn`函数。
+
+### 数据预处理
+
+#### `paddlenlp.data.Vocab`
+
+`paddlenlp.data.Vocab`词表类，集合了一系列文本token与ids之间映射的一系列方法，支持从文件、字典、json等一系方式构建词表。
+
+```python
+from paddlenlp.data import Vocab
+# 从文件构建
+vocab1 = Vocab.load_vocabulary(vocab_file_path)
+# 从字典构建
+# dic = {'unk':0, 'pad':1, 'bos':2, 'eos':3, ...}
+vocab2 = Vocab.from_dict(dic)
+# 从json构建，一般是已构建好的Vocab对象先保存为json_str或json文件后再进行恢复
+# json_str方式
+json_str = vocab1.to_json()
+vocab3 = Vocab.from_json(json_str)
+# json文件方式
+vocab1.to_json(json_file_path)
+vocab4 = Vocab.from_json(json_file_path)
+```
+
+#### `paddlenlp.data.JiebaTokenizer`
+
+`paddlenlp.data.JiebaTokenizer`初始化需传入`paddlenlp.data.Vocab`类，包含`cut`分词方法和将句子明文转换为ids的`encode`方法。
+
+```python
+from paddlenlp.data import Vocab, JiebaTokenizer
+# 词表文件路径，运行示例程序可先下载词表文件
+# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+vocab_file_path = './senta_word_dict.txt'
+# 构建词表
+vocab = Vocab.load_vocabulary(
+    vocab_file_path,
+    unk_token='[UNK]',
+    pad_token='[PAD]')
+tokenizer = JiebaTokenizer(vocab)
+tokens = tokenizer.cut('我爱你中国') # ['我爱你', '中国']
+ids = tokenizer.encode('我爱你中国') # [1170578, 575565]
+```
+
+### 构建`Sampler`
+
+#### `paddlenlp.data.SamplerHelper`
+
+`paddlenlp.data.SamplerHelper`的作用是构建用于`DataLoader`的可迭代采样器，它包含`shuffle`、`sort`、`batch`、`shard`等一系列方法，方便用户灵活使用。
+
+```python
+from paddlenlp.data import SamplerHelper
+from paddle.io import Dataset
+
+class MyDataset(Dataset):
+    def __init__(self):
+        super(MyDataset, self).__init__()
+        self.data = [
+            [[1, 2, 3, 4], [1]],
+            [[5, 6, 7], [0]],
+            [[8, 9], [1]],
+        ]
+
+    def __getitem__(self, index):
+        data = self.data[index][0]
+        label = self.data[index][1]
+        return data, label
+
+    def __len__(self):
+        return len(self.data)
+
+dataset = MyDataset()
+# SamplerHelper返回的是数据索引的可迭代对象，产生的迭代的索引为：[0, 1, 2]
+sampler = SamplerHelper(dataset)
+# `shuffle()`的作用是随机打乱索引顺序，产生的迭代的索引为：[0, 2, 1]
+sampler = sampler.shuffle()
+# sort()的作用是按照指定key为排序方式并在buffer_size大小个样本中排序
+# 示例中以样本第一个字段的长度进行升序排序，产生的迭代的索引为：[2, 0, 1]
+key = (lambda x, data_source: len(data_source[x][0]))
+sampler = sampler.sort(key=key, buffer_size=2)
+# batch()的作用是按照batch_size组建mini-batch，产生的迭代的索引为：[[2, 0], [1]]
+sampler = sampler.batch(batch_size=2)
+# shard()的作用是为多卡训练切分数据集，当前卡产生的迭代的索引为：[[2, 0]]
+sampler = sampler.shard(num_replicas=2)
+```
+
+### 构建`collate_fn`
+
+#### `paddlenlp.data.Stack`
+
+`paddlenlp.data.Stack`用来组建batch，其输入必须具有相同的shape，输出便是这些输入的堆叠组成的batch数据。
+
+```python
+from paddlenlp.data import Stack
+a = [1, 2, 3, 4]
+b = [3, 4, 5, 6]
+c = [5, 6, 7, 8]
+result = Stack()([a, b, c])
+"""
+[[1, 2, 3, 4],
+ [3, 4, 5, 6],
+ [5, 6, 7, 8]]
+"""
+```
+
+#### `paddlenlp.data.Pad`
+
+`paddlenlp.data.Pad`用来组建batch，它的输入长度不同，它首先会将输入数据全部padding到最大长度，然后再堆叠组成batch数据输出。
+
+```python
+from paddlenlp.data import Pad
+a = [1, 2, 3, 4]
+b = [5, 6, 7]
+c = [8, 9]
+result = Pad(pad_val=0)([a, b, c])
+"""
+[[1, 2, 3, 4],
+ [5, 6, 7, 0],
+ [8, 9, 0, 0]]
+"""
+```
+
+#### `paddlenlp.data.Tuple`
+
+`paddlenlp.data.Tuple`会将多个组batch的函数包装在一起，组成tuple。
+
+```python
+from paddlenlp.data import Stack, Pad, Tuple
+data = [
+        [[1, 2, 3, 4], [1]],
+        [[5, 6, 7], [0]],
+        [[8, 9], [1]],
+       ]
+batchify_fn = Tuple(Pad(pad_val=0), Stack())
+ids, label = batchify_fn(data)
+"""
+ids:
+[[1, 2, 3, 4],
+ [5, 6, 7, 0],
+ [8, 9, 0, 0]]
+label: [[1], [0], [1]]
+"""
+```
+
+#### `paddlenlp.data.Dict`
+
+`paddlenlp.data.Dict`会将多个组batch的函数包装在一起，组成dict。
+
+```python
+from paddlenlp.data import Stack, Pad, Dict
+data = [
+        {'labels':[1], 'token_ids':[1, 2, 3, 4]},
+        {'labels':[0], 'token_ids':[5, 6, 7]},
+        {'labels':[1], 'token_ids':[8, 9]},
+       ]
+batchify_fn = Dict({'token_ids':Pad(pad_val=0), 'labels':Stack()})
+ids, label = batchify_fn(data)
+"""
+ids:
+[[1, 2, 3, 4],
+ [5, 6, 7, 0],
+ [8, 9, 0, 0]]
+label: [[1], [0], [1]]
+"""
+```
+
+### 综合示例
+
+```python
+from paddlenlp.data import Vocab, JiebaTokenizer, Stack, Pad, Tuple, SamplerHelper
+from paddlenlp.datasets import load_dataset
+from paddle.io import DataLoader
+
+# 词表文件路径，运行示例程序可先下载词表文件
+# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+vocab_file_path = './senta_word_dict.txt'
+# 构建词表
+vocab = Vocab.load_vocabulary(
+    vocab_file_path,
+    unk_token='[UNK]',
+    pad_token='[PAD]')
+# 初始化分词器
+tokenizer = JiebaTokenizer(vocab)
+
+def convert_example(example):
+    text, label = example['text'], example['label']
+    ids = tokenizer.encode(text)
+    label = [label]
+    return ids, label
+
+dataset = load_dataset('chnsenticorp', splits='train')
+dataset = dataset.map(convert_example, lazy=True)
+
+pad_id = vocab.token_to_idx[vocab.pad_token]
+batchify_fn = Tuple(
+    Pad(axis=0, pad_val=pad_id),  # ids
+    Stack(dtype='int64')  # label
+)
+
+batch_sampler = SamplerHelper(dataset).shuffle().batch(batch_size=16)
+data_loader = DataLoader(
+    dataset,
+    batch_sampler=batch_sampler,
+    collate_fn=batchify_fn,
+    return_list=True)
+
+# 测试数据集
+for batch in data_loader:
+    ids, label = batch
+    print(ids.shape, label.shape)
+    print(ids)
+    print(label)
+    break
+```
diff --git a/docs/data_prepare/data_preprocess.rst b/docs/data_prepare/data_preprocess.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cebf4102e2eca1b65ae4093f289c536ff7bd9a8d
--- /dev/null
+++ b/docs/data_prepare/data_preprocess.rst
@@ -0,0 +1,212 @@
+================
+数据处理
+================
+
+Dataset中通常为原始数据，需要经过一定的数据处理并进行采样组batch，而后通过 :class:`paddle.io.DataLoader` 为训练或预测使用，PaddleNLP中为其中各环节提供了相应的功能支持。
+
+基于预训练模型的数据处理
+------------------------
+
+在使用预训练模型做NLP任务时，需要加载对应的Tokenizer，PaddleNLP在 :class:`PreTrainedTokenizer` 中内置的 :func:`__call__` 方法可以实现基础的数据处理功能。PaddleNLP内置的所有预训练模型的Tokenizer都继承自 :class:`PreTrainedTokenizer` ，下面以BertTokenizer举例说明：
+
+.. code-block::
+
+    from paddlenlp.transformers import BertTokenizer
+
+    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
+
+    # 单句转换（单条数据）
+    print(tokenizer(text='天气不错')) # {'input_ids': [101, 1921, 3698, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0]}
+
+    # 句对转换（单条数据）
+    print(tokenizer(text='天气',text_pair='不错')) # {'input_ids': [101, 1921, 3698, 102, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1]}
+
+    # 单句转换（多条数据）
+    print(tokenizer(text=['天气','不错'])) # [{'input_ids': [101, 1921, 3698, 102], 'token_type_ids': [0, 0, 0, 0]}, 
+                                          #  {'input_ids': [101, 679, 7231, 102], 'token_type_ids': [0, 0, 0, 0]}]
+ 
+关于 :func:`__call__` 方法的其他参数和功能，请查阅PreTrainedTokenizer。
+
+paddlenlp内置的 :class:`paddlenlp.datasets.MapDataset` 的 :func:`map` 方法支持传入一个函数，对数据集内的数据进行统一转换。下面我们以 :obj:`LCQMC` 的数据处理流程为例：
+
+.. code-block::
+
+    from paddlenlp.transformers import BertTokenizer
+    from paddlenlp.datasets import load_dataset
+
+    tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
+    train_ds = load_dataset('lcqmc', splits='train')
+
+    print(train_ds[0]) # {'query': '喜欢打篮球的男生喜欢什么样的女生', 'title': '爱打篮球的男生喜欢什么样的女生', 'label': 1}
+
+可以看到， :obj:`LCQMC` 是一个句对匹配任务，即判断两个句子的意思是否相似的2分类任务。我们需要处理的是key为 **query** 和 **title** 的文本数据，我们编写基于 :class:`PreTrainedTokenizer` 的数据处理函数并传入数据集的 :func:`map` 方法。
+
+.. code-block::
+
+    def convert_example(example, tokenizer):
+        tokenized_example = tokenizer(
+                                text=example['query'], 
+                                text_pair=example['title'])
+        # 加上label用于训练
+        tokenized_example['label'] = [example['label']]
+        return tokenized_example
+    
+    from functools import partial
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer)
+    
+    train_ds.map(trans_func)
+    print(train_ds[0]) # {'input_ids': [101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, 
+                       #                1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, 
+                       #                4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 
+                       #                784, 720, 3416, 4638, 1957, 4495, 102], 
+                       #  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
+                       #                     0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                       #  'label': [1]}
+
+可以看到，数据集中的文本数据已经被处理成了模型可以接受的 *feature* 。
+
+:func:`map` 方法有一个重要的参数 :attr:`batched`，当设置为 :obj:`True` 时（默认为 :obj:`False` ），数据处理函数 :func:`trans_func` 的输入不再是单条数据，而是数据集的所有数据：
+
+.. code-block::
+
+    def convert_examples(examples, tokenizer):
+        querys = [example['query'] for example in examples]
+        titles = [example['title'] for example in examples]
+        tokenized_examples = tokenizer(text=querys, text_pair=titles,return_dict=False)
+
+        # 加上label用于训练
+        for idx in range(len(tokenized_examples)):
+            tokenized_examples[idx]['label'] = [examples[idx]['label']]
+        
+        return tokenized_examples
+    
+    from functools import partial
+
+    trans_func = partial(convert_examples, tokenizer=tokenizer)
+    
+    train_ds.map(trans_func, batched=True)
+    print(train_ds[0]) # {'input_ids': [101, 1599, 3614, 2802, 5074, 4413, 4638, 4511, 4495, 
+                       #                1599, 3614, 784, 720, 3416, 4638, 1957, 4495, 102, 
+                       #                4263, 2802, 5074, 4413, 4638, 4511, 4495, 1599, 3614, 
+                       #                784, 720, 3416, 4638, 1957, 4495, 102], 
+                       #  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
+                       #                     0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                       #  'label': [1]}
+
+可以看到，在本例中两种实现的结果是相同的。但是在诸如阅读理解，对话等任务中，一条原始数据可能会产生多个 *feature* 的情况（参见 `run_squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__ ）通常需要将 :attr:`batched` 参数设置为 :obj:`True` 。 
+
+:func:`map` 方法还有一个 :attr:`num_workers` 参数，当其大于0时进行多进程数据处理，可以提高处理速度。但是需要注意如果在数据处理的函数中用到了 **数据index** 的相关信息，多进程处理可能会导致错误的结果。
+
+关于 :func:`map` 方法的其他参数和 :class:`paddlenlp.datasets.MapDataset` 的其他数据处理方法，请查阅 :doc:`dataset <../source/paddlenlp.datasets.dataset>` 。
+
+Batchify
+-----------
+
+PaddleNLP内置了多种collate function，配合 :class:`paddle.io.BatchSampler` 可以协助用户简单的完成组batch的操作。
+
+我们继续以 :obj:`LCQMC` 的数据处理流程为例。从上一节最后可以看到，处理后的单条数据是一个 **字典** ，包含 `input_ids` ， `token_type_ids` 和 `label` 三个key。
+
+其中 `input_ids` 和 `token_type_ids` 是需要进行 **padding** 操作后输入模型的，而 `label` 是需要 **stack** 之后传入loss function的。
+
+因此，我们使用PaddleNLP内置的 :func:`Dict` ，:func:`Stack` 和 :func:`Pad` 函数整理batch中的数据。最终的 :func:`batchify_fn` 如下：
+
+.. code-block::
+
+    from paddlenlp.data import Dict, Stack, Pad 
+
+    # 使用Dict函数将Pad，Stack等函数与数据中的键值相匹配
+    train_batchify_fn = lambda samples, fn=Dict({
+        'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+        'label': Stack(dtype="int64")
+    }): fn(samples)
+
+之后使用 :class:`paddle.io.BatchSampler` 和 :func:`batchify_fn` 构建 :class:`paddle.io.DataLoader` ：
+
+.. code-block::
+
+    from paddle.io import DataLoader, BatchSampler
+
+    train_batch_sampler = BatchSampler(train_ds, batch_size=2, shuffle=True)
+
+    train_data_loader = DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn)
+
+到此，一个完整的数据准备流程就完成了。关于更多batchify方法，请查阅 :doc:`collate <../source/paddlenlp.data.collate>`。
+
+.. note::
+
+    - 当需要进行 **单机多卡** 训练时，需要将 :class:`BatchSampler` 更换为 :class:`DistributedBatchSampler` 。更多有关 :class:`paddle.io.BatchSampler` 的信息，请查阅 `BatchSampler <https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/dataloader/batch_sampler/BatchSampler_cn.html>`_。
+
+    - 当需要诸如batch内排序，按token组batch等更复杂的组batch功能时。可以使用PaddleNLP内置的 :class:`SamplerHelper` 。相关用例请参考 `reader.py <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/reader.py>`__。
+
+基于非预训练模型的数据处理
+-------------------------
+
+在使用非预训练模型做NLP任务时，我们可以借助PaddleNLP内置的 :class:`JiebaTokenizer` 和 :class:`Vocab` 完成数据处理的相关功能，整体流程与使用预训练模型基本相似。我们以中文情感分析 :obj:`ChnSentiCorp` 数据集为例：
+
+.. code-block::
+
+    from paddlenlp.data import JiebaTokenizer, Vocab
+    from paddlenlp.datasets import load_dataset
+
+    train_ds = load_dataset('chnsenticorp', splits='train')
+    
+    print(train_ds[0]) # {'text': '选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。
+                       #  酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 
+                       #  服务吗，一般', 'label': 1}
+
+    # 从本地词典文件构建Vocab
+    vocab = Vocab.load_vocabulary('./senta_word_dict.txt', unk_token='[UNK]', pad_token='[PAD]')
+
+    # 使用Vocab初始化JiebaTokenizer
+    tokenizer = JiebaTokenizer(vocab)
+
+.. note::
+
+    - :class:`Vocab` 除了可以从本地词典文件初始化之外，还提供多种初始化方法，包括从 :class:`dictionary` 创建、从数据集创建等。详情请查阅Vocab。
+    - 除了使用内置的 :class:`JiebaTokenizer` 外，用户还可以使用任何自定义的方式或第三方库进行分词，之后使用 :func:`Vocab.to_indices` 方法将token转为id。
+
+之后与基于预训练模型的数据处理流程相似，编写数据处理函数并传入 :func:`map` 方法：
+
+.. code-block::
+
+    def convert_example(example, tokenizer):
+        input_ids = tokenizer.encode(example["text"])
+        valid_length = [len(input_ids)]
+        label = [example["label"]]
+        return input_ids, valid_length, label
+
+    trans_fn = partial(convert_example, tokenizer=tokenizer)
+    train_ds.map(trans_fn)
+
+    print(train_ds[0]) # ([417329, 128448, 140437, 173188, 118001, 213058, 595790, 1106339, 940533, 947744, 169206,
+                       #   421258, 908089, 982848, 1106339, 35413, 1055821, 4782, 377145, 4782, 238721, 4782, 642263,
+                       #   4782, 891683, 767091, 4783, 672971, 774154, 1250380, 1106339, 340363, 146708, 1081122, 
+                       #   4783, 1, 943329, 1008467, 319839, 173188, 909097, 1106339, 1010656, 261577, 1110707, 
+                       #   1106339, 770761, 597037, 1068649, 850865, 4783, 1, 993848, 173188, 689611, 1057229, 1239193, 
+                       #   173188, 1106339, 146708, 427691, 4783, 1, 724601, 179582, 1106339, 1250380], 
+                       #  [67], 
+                       #  [1])
+
+
+可以看到，原始数据已经被处理成了 *feature* 。但是这里我们发现单条数据并不是一个 **字典** ，而是 **元组** 。所以我们的 :func:`batchify_fn` 也要相应的做一些调整：
+
+.. code-block::
+
+    from paddlenlp.data import Tuple, Stack, Pad 
+
+    # 使用Tuple函数将Pad，Stack等函数与数据中的键值相匹配
+    train_batchify_fn = lambda samples, fn=Tuple((
+        Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)),  # input_ids
+        Stack(dtype="int64"),  # seq len
+        Stack(dtype="int64")  # label
+    )): fn(samples)
+
+可以看到，:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应，适用于单条数据是字典的情况。而 :func:`Tuple` 是通过单条数据中不同部分的index进行对应的。
+
+所以需要 **注意** 的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。
+
+之后的流程与基于预训练模型的数据处理相同。
diff --git a/docs/data_prepare/dataset_list.md b/docs/data_prepare/dataset_list.md
new file mode 100644
index 0000000000000000000000000000000000000000..cd86cabae768af14acb1e7118c67d550dd28eb5e
--- /dev/null
+++ b/docs/data_prepare/dataset_list.md
@@ -0,0 +1,106 @@
+# PaddleNLP Datasets API
+
+PaddleNLP提供了以下数据集的快速读取API，实际使用时请根据需要**添加splits信息**：
+
+## 阅读理解
+
+|  数据集名称   | 简介 | 调用方法 |
+|  ----  | ----- | ------ |
+|  [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | 斯坦福问答数据集，包括SQuAD1.1和SQuAD2.0|`paddlenlp.datasets.load_dataset('squad')` |
+|  [DuReader-yesno](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集：阅读理解，判断答案极性|`paddlenlp.datasets.load_dataset('dureader_yesno')` |
+|  [DuReader-robust](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集：阅读理解，答案原文抽取|`paddlenlp.datasets.load_dataset('dureader_robust')` |
+|  [CMRC2018](http://hfl-rc.com/cmrc2018/) | 第二届“讯飞杯”中文机器阅读理解评测数据集|`paddlenlp.datasets.load_dataset('cmrc2018')` |
+|  [DRCD](https://github.com/DRCKnowledgeTeam/DRCD) | 台達閱讀理解資料集|`paddlenlp.datasets.load_dataset('drcd')` |
+|  [TriviaQA](http://nlp.cs.washington.edu/triviaqa/) | Washington大学问答数据集|`paddlenlp.datasets.load_dataset('triviaqa')` |
+|  [C3](https://dataset.org/c3/) | 阅读理解单选题 |`paddlenlp.datasets.load_dataset('c3')` |
+
+
+## 文本分类
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [CoLA](https://nyu-mll.github.io/CoLA/) | 单句分类任务，二分类，判断句子是否合法| `paddlenlp.datasets.load_dataset('glue','cola')`|
+|  [SST-2](https://nlp.stanford.edu/sentiment/index.html) | 单句分类任务，二分类，判断句子情感极性| `paddlenlp.datasets.load_dataset('glue','sst-2')`|
+|  [MRPC](https://microsoft.com/en-us/download/details.aspx?id=52398) | 句对匹配任务，二分类，判断句子对是否是相同意思| `paddlenlp.datasets.load_dataset('glue','mrpc')`|
+|  [STSB](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) | 计算句子对相似性，分数为1~5| `paddlenlp.datasets.load_dataset('glue','sts-b')`|
+|  [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 判定句子对是否等效，等效、不等效两种情况，二分类任务| `paddlenlp.datasets.load_dataset('glue','qqp')`|
+|  [MNLI](http://www.nyu.edu/projects/bowman/multinli/) | 句子对，一个前提，一个是假设。前提和假设的关系有三种情况：蕴含（entailment），矛盾（contradiction），中立（neutral）。句子对三分类问题| `paddlenlp.datasets.load_dataset('glue','mnli')`|
+|  [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) | 判断问题（question）和句子（sentence）是否蕴含，蕴含和不蕴含，二分类| `paddlenlp.datasets.load_dataset('glue','qnli')`|
+|  [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) | 判断句对是否蕴含，句子1和句子2是否互为蕴含，二分类任务| `paddlenlp.datasets.load_dataset('glue','rte')`|
+|  [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 判断句子对是否相关，相关或不相关，二分类任务| `paddlenlp.datasets.load_dataset('glue','wnli')`|
+|  [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html) | A Large-scale Chinese Question Matching Corpus 语义匹配数据集| `paddlenlp.datasets.load_dataset('lcqmc')`|
+|  [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb) | 中文评论情感分析语料| `paddlenlp.datasets.load_dataset('chnsenticorp')`|
+|  [COTE-DP](https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1) | 中文观点抽取语料  | `paddlenlp.datasets.load_dataset('cote', 'dp')`|
+|  [SE-ABSA16_PHNS](https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1) | 中文评价对象级情感分析语料| `paddlenlp.datasets.load_dataset('seabsa16', 'phns')`|
+|  [AFQMC](https://github.com/CLUEbenchmark/CLUE) | 蚂蚁金融语义相似度数据集，1表示句子1和句子2的含义类似，0表示含义不同| `paddlenlp.datasets.load_dataset('clue', 'afqmc')`|
+|  [TNEWS](https://github.com/CLUEbenchmark/CLUE) | 今日头条中文新闻（短文本）分类，共15类| `paddlenlp.datasets.load_dataset('clue', 'tnews')`|
+|  [IFLYTEK](https://github.com/CLUEbenchmark/CLUE) | 长文本分类，共119个类别| `paddlenlp.datasets.load_dataset('clue', 'iflytek')`|
+|  [OCNLI](https://github.com/cluebenchmark/OCNLI) | 原生中文自然语言推理数据集，句子对三分类问题| `paddlenlp.datasets.load_dataset('clue', 'ocnli')`|
+|  [CMNLI ](https://github.com/CLUEbenchmark/CLUE) | 中文语言推理任务，判断sentence1和sentence2的关系：蕴含（entailment），矛盾（contradiction），中立（neutral）。句子对三分类问题 | `paddlenlp.datasets.load_dataset('clue', 'cmnli')`|
+|  [CLUEWSC2020](https://github.com/CLUEbenchmark/CLUE) | WSC Winograd模式挑战中文版，代词消歧任务，二分类任务| `paddlenlp.datasets.load_dataset('clue', 'cluewsc2020')`|
+|  [CSL](https://github.com/P01son6415/CSL) | 论文关键词识别，判断关键词是否全部为真实关键词，二分类任务 | `paddlenlp.datasets.load_dataset('clue', 'csl')`|
+|  [EPRSTMT](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的电商产品评论情感分析数据集，Positive、Negative 情感 2 分类任务| `paddlenlp.datasets.load_dataset('fewclue', 'eprstmt')`|
+|  [CSLDCP](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的中文科学文献学科分类数据集，根据文献的中文摘要判断文献类别，共 67 类别。| `paddlenlp.datasets.load_dataset('fewclue', 'csldcp')`|
+|  [TNEWSF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的今日头条中文新闻（短文本）分类，共15类 | `paddlenlp.datasets.load_dataset('fewclue', 'tnews')`|
+|  [IFLYTEK](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的长文本分类任务，共 119 个类别 | `paddlenlp.datasets.load_dataset('fewclue', 'iflytek')`|
+|  [OCNLIF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的中文自然语言推理数据集，句子对三分类问题 | `paddlenlp.datasets.load_dataset('fewclue', 'ocnli')`|
+|  [BUSTM](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中对话短文本语义匹配数据集, 2 分类任务 | `paddlenlp.datasets.load_dataset('fewclue', ‘bustm')`|
+|  [CHIDF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的成语阅读理解填空, 根据文本内容从候选 7 个成语中预测正确的成语 | `paddlenlp.datasets.load_dataset('fewclue', 'chid')`|
+|  [CSLF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的论文关键词识别，判断关键词是否全部为真实关键词，二分类任务 | `paddlenlp.datasets.load_dataset('fewclue', 'csl')`|
+|  [CLUEWSCF](https://github.com/CLUEbenchmark/FewCLUE/tree/main/datasets)  | FewCLUE 评测中的 WSC Winograd 模式挑战中文版，代词消歧任务，二分类任务 | `paddlenlp.datasets.load_dataset('fewclue', 'cluewsc')`|
+| [THUCNews](https://github.com/gaussic/text-classification-cnn-rnn#%E6%95%B0%E6%8D%AE%E9%9B%86) |  THUCNews中文新闻类别分类 | `paddlenlp.datasets.load_dataset('thucnews')` |
+| [HYP](https://pan.webis.de/semeval19/semeval19-web/) | 英文政治新闻情感分类语料  | `paddlenlp.datasets.load_dataset('hyp')` |
+|  [XNLI](https://github.com/facebookresearch/XNLI) | 15种语言自然语言推理数据集，三分类任务. | `paddlenlp.datasets.load_dataset('xnli', 'ar')`|
+|  [XNLI_CN](https://github.com/facebookresearch/XNLI) | 中文自然语言推理数据集（XNLI的子集），三分类任务. | `paddlenlp.datasets.load_dataset('xnli_cn')`|
+
+## 文本匹配
+
+|  数据集名称   | 简介 | 调用方法 |
+|  ----  | --------- | ------ |
+| [CAIL2019-SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | 相似法律案例匹配  | `paddlenlp.datasets.load_dataset('cail2019_scm')` |
+
+## 序列标注
+
+|  数据集名称   | 简介 | 调用方法 |
+|  ----  | --------- | ------ |
+|  [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.load_dataset('msra_ner')`|
+|  [People's Daily](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/People's%20Daily) | 人民日报命名实体识别数据集| `paddlenlp.datasets.load_dataset('peoples_daily_ner')`|
+|  [CoNLL-2002](https://www.aclweb.org/anthology/W02-2024/) | 西班牙语和荷兰语实体识别数据集| `paddlenlp.datasets.load_dataset('conll2002', 'es')`|
+
+
+## 机器翻译
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [IWSLT15](https://workshop2015.iwslt.org/) | IWSLT'15 English-Vietnamese data 英语-越南语翻译数据集| `paddlenlp.datasets.load_dataset('iwslt15')`|
+|  [WMT14ENDE](http://www.statmt.org/wmt14/translation-task.html) | WMT14 EN-DE 经过BPE分词的英语-德语翻译数据集| `paddlenlp.datasets.load_dataset('wmt14ende')`|
+
+## 机器同传
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [BSTC](https://aistudio.baidu.com/aistudio/competition/detail/44/) | 千言数据集：机器同传，包括transcription_translation和asr | `paddlenlp.datasets.load_dataset('bstc', 'asr')`|
+
+## 对话系统
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [DuConv](https://aistudio.baidu.com/aistudio/competition/detail/48/) | 千言数据集：开放域对话，中文知识型对话数据集 | `paddlenlp.datasets.load_dataset('duconv')`|
+
+## 文本生成
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [Poetry](https://github.com/chinese-poetry/chinese-poetry) | 中文诗歌古典文集数据| `paddlenlp.datasets.load_dataset('poetry')`|
+|  [Couplet](https://github.com/v-zich/couplet-clean-dataset) | 中文对联数据集| `paddlenlp.datasets.load_dataset('couplet')`|
+|  [DuReaderQG](https://github.com/PaddlePaddle/Research/tree/master/NLP/DuReader-Robust-BASELINE) | 基于DuReader的问题生成数据集| `paddlenlp.datasets.load_dataset('dureader_qg')`|
+|  [AdvertiseGen](https://github.com/ZhihongShao/Planning-based-Hierarchical-Variational-Model) | 中文文案生成数据集| `paddlenlp.datasets.load_dataset('advertisegen')`|
+|  [LCSTS_new](https://aclanthology.org/D15-1229.pdf) | 中文摘要生成数据集| `paddlenlp.datasets.load_dataset('lcsts_new')`|
+|  [CNN/Dailymail](https://github.com/abisee/cnn-dailymail) | 英文摘要生成数据集| `paddlenlp.datasets.load_dataset('cnn_dailymail')`|
+
+## 语料库
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.load_dataset('ptb')`|
+|  [Yahoo Answer 100k](https://arxiv.org/pdf/1702.08139.pdf)  | 从Yahoo Answer采样100K| `paddlenlp.datasets.load_dataset('yahoo_answer_100k')`|
diff --git a/docs/data_prepare/dataset_load.rst b/docs/data_prepare/dataset_load.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e3d193da8ec29382c7637589ca536f078a703ec5
--- /dev/null
+++ b/docs/data_prepare/dataset_load.rst
@@ -0,0 +1,69 @@
+============
+加载数据集
+============
+
+快速加载内置数据集
+---------------------
+
+目前PaddleNLP内置20余个NLP数据集，涵盖阅读理解，文本分类，序列标注，机器翻译等多项任务。目前提供的数据集可以在 :doc:`数据集列表 <./dataset_list>` 中找到。
+
+以 **msra_ner** 数据集为例:
+
+.. code-block::
+
+    >>> from paddlenlp.datasets import load_dataset
+    >>> train_ds, test_ds = load_dataset("msra_ner", splits=("train", "test"))
+
+:func:`load_dataset` 方法会从 :obj:`paddlenlp.datasets` 下找到msra_ner数据集对应的数据读取脚本（默认路径：paddlenlp/datasets/msra_ner.py），并调用脚本中 :class:`DatasetBuilder` 类的相关方法生成数据集。
+
+生成数据集可以以 :class:`MapDataset` 和 :class:`IterDataset` 两种类型返回，分别是对 :class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展，只需在 :func:`load_dataset` 时设置 :attr:`lazy` 参数即可获取相应类型。:obj:`Flase` 对应返回 :class:`MapDataset` ，:obj:`True` 对应返回 :class:`IterDataset`，默认值为None，对应返回 :class:`DatasetBuilder` 默认的数据集类型，大多数为 :class:`MapDataset` 。
+
+.. code-block::
+
+    >>> from paddlenlp.datasets import load_dataset
+    >>> train_ds = load_dataset("msra_ner", splits="train")  
+    >>> print(type(train_ds))
+    <class 'paddlenlp.datasets.dataset.MapDataset'> # Default
+    >>> train_ds = load_dataset("msra_ner", splits="train", lazy=True) 
+    >>> print(type(train_ds))
+    <class 'paddlenlp.datasets.dataset.IterDataset'>
+
+关于 :class:`MapDataset` 和 :class:`IterDataset` 功能和异同可以参考API文档 :doc:`datasets <../source/paddlenlp.datasets.dataset>`。
+
+选择子数据集
+^^^^^^^^^^^^^^^^^^^^^^^
+
+有些数据集是很多子数据集的集合，每个子数据集都是一个独立的数据集。例如 **GLUE** 数据集就包含COLA, SST2, MRPC, QQP等10个子数据集。
+
+:func:`load_dataset` 方法提供了一个 :attr:`name` 参数用来指定想要获取的子数据集。使用方法如下：
+
+.. code-block::
+
+    >>> from paddlenlp.datasets import load_dataset
+    >>> train_ds, dev_ds = load_dataset("glue", name="cola", splits=("train", "dev"))  
+
+以内置数据集格式读取本地数据集
+-----------------------------
+
+有的时候，我们希望使用数据格式与内置数据集相同的本地数据替换某些内置数据集的数据（例如参加SQuAD竞赛，对训练数据进行了数据增强）。 :func:`load_dataset` 方法提供的 :attr:`data_files` 参数可以实现这个功能。以 **SQuAD** 为例。
+
+.. code-block::
+
+    >>> from paddlenlp.datasets import load_dataset
+    >>> train_ds, dev_ds = load_dataset("squad", data_files=("my_train_file.json", "my_dev_file.json"))
+    >>> test_ds = load_dataset("squad", data_files="my_test_file.json")
+
+.. note::
+
+    对于某些数据集，不同的split的读取方式不同。对于这种情况则需要在 :attr:`splits` 参数中以传入与 :attr:`data_files` **一一对应** 的split信息。
+    
+    此时 :attr:`splits` 不再代表选取的内置数据集，而代表以何种格式读取本地数据集。
+    
+    下面以 **COLA** 数据集为例：
+
+    .. code-block::
+
+        >>> from paddlenlp.datasets import load_dataset
+        >>> train_ds, test_ds = load_dataset("glue", "cola", splits=["train", "test"], data_files=["my_train_file.csv", "my_test_file.csv"])
+
+    **另外需要注意数据集的是没有默认加载选项的，**:attr:`splits` **和**:attr:`data_files` **必须至少指定一个。**
\ No newline at end of file
diff --git a/docs/data_prepare/dataset_self_defined.rst b/docs/data_prepare/dataset_self_defined.rst
new file mode 100644
index 0000000000000000000000000000000000000000..673d78a26acb7af4a48c3c35dfbbd277933b9b17
--- /dev/null
+++ b/docs/data_prepare/dataset_self_defined.rst
@@ -0,0 +1,155 @@
+============
+如何自定义数据集
+============
+
+通过使用PaddleNLP提供的 :func:`load_dataset` ， :class:`MapDataset` 和 :class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。
+
+从本地文件创建数据集
+-------------------
+
+从本地文件创建数据集时，我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset` 中创建数据集。
+
+以 `waybill_ie <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie>`__ 快递单信息抽取任务中的数据为例：
+
+.. code-block::
+
+    from paddlenlp.datasets import load_dataset
+
+    def read(data_path):
+        with open(data_path, 'r', encoding='utf-8') as f:
+            # 跳过列名
+            next(f)
+            for line in f:
+                words, labels = line.strip('\n').split('\t')
+                words = words.split('\002')
+                labels = labels.split('\002')
+                yield {'tokens': words, 'labels': labels}
+
+    # data_path为read()方法的参数
+    map_ds = load_dataset(read, data_path='train.txt',lazy=False) 
+    iter_ds = load_dataset(read, data_path='train.txt',lazy=True) 
+
+我们推荐将数据读取代码写成生成器(generator)的形式，这样可以更好的构建 :class:`MapDataset` 和 :class:`IterDataset` 两种数据集。同时我们也推荐将单条数据写成字典的格式，这样可以更方便的监测数据流向。
+
+事实上，:class:`MapDataset` 在绝大多数时候都可以满足要求。一般只有在数据集过于庞大无法一次性加载进内存的时候我们才考虑使用 :class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。
+
+.. note::
+
+    需要注意的是，只有PaddleNLP内置的数据集具有将数据中的label自动转为id的功能（详细条件参见 :doc:`创建DatasetBuilder <../community/contribute_datasets/how_to_write_a_DatasetBuilder>`）。
+    
+    像上例中的自定义数据集需要在自定义的convert to feature方法中添加label转id的功能。
+
+    自定义数据读取function中的参数可以直接以关键字参数的方式传入 :func:`load_dataset` 中。而且对于自定义数据集，:attr:`lazy` 参数是 **必须** 传入的。
+
+从 :class:`paddle.io.Dataset/IterableDataset` 创建数据集 
+-------------------
+
+虽然PaddlePddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 :class:`DataLoader` 用于模型训练的，但有时我们希望更方便的使用一些数据处理（例如convert to feature, 数据清洗，数据增强等）。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` 正好提供了能实现以上功能的API。
+
+所以如果您习惯使用 :class:`paddle.io.Dataset/IterableDataset` 创建数据集的话。只需要在原来的数据集上套上一层 :class:`MapDataset` 或 :class:`IterDataset` 就可以把原来的数据集对象转换成PaddleNLP的数据集。
+
+下面举一个简单的小例子。:class:`IterDataset` 的用法基本相同。
+
+.. code-block::
+
+    from paddle.io import Dataset
+    from paddlenlp.datasets import MapDataset
+
+    class MyDataset(Dataset):
+        def __init__(self, path):
+
+            def load_data_from_source(path):
+                ...
+                ...
+                return data
+
+            self.data = load_data_from_source(path)
+
+        def __getitem__(self, idx):
+            return self.data[idx]
+
+        def __len__(self):
+            return len(self.data)
+    
+    ds = MyDataset(data_path)  # paddle.io.Dataset
+    new_ds = MapDataset(ds)    # paddlenlp.datasets.MapDataset
+
+从其他python对象创建数据集
+-------------------
+
+理论上，我们可以使用任何包含 :func:`__getitem__` 方法和 :func:`__len__` 方法的python对象创建 :class:`MapDataset`。包括 :class:`List` ，:class:`Tuple` ，:class:`DataFrame` 等。只要将符合条件的python对象作为初始化参数传入 :class:`MapDataset` 即可完成创建。
+
+.. code-block::
+
+    from paddlenlp.datasets import MapDataset
+
+    data_source_1 = [1,2,3,4,5]
+    data_source_2 = ('a', 'b', 'c', 'd')
+
+    list_ds = MapDataset(data_source_1)
+    tuple_ds = MapDataset(data_source_2)
+
+    print(list_ds[0])  # 1
+    print(tuple_ds[0]) # a
+
+同样的，我们也可以使用包含 :func:`__iter__` 方法的python对象创建 :class:`IterDataset` 。例如 :class:`List`， :class:`Generator` 等。创建方法与 :class:`MapDataset` 相同。
+
+.. code-block::
+
+    from paddlenlp.datasets import IterDataset
+
+    data_source_1 = ['a', 'b', 'c', 'd']
+    data_source_2 = (i for i in range(5))
+
+    list_ds = IterDataset(data_source_1)
+    gen_ds = IterDataset(data_source_2)
+
+    print([data for data in list_ds]) # ['a', 'b', 'c', 'd']
+    print([data for data in gen_ds])  # [0, 1, 2, 3, 4]
+
+.. note::
+
+    需要注意，像上例中直接将 **生成器** 对象传入 :class:`IterDataset` 所生成的数据集。其数据只能迭代 **一次** 。
+
+与常规的python对象一样，只要满足以上的条件，我们也可以使用同样的方法从第三方数据集创建PaddleNLP数据集。
+
+例如HuggingFace Dataset：
+
+.. code-block::
+
+    from paddlenlp.datasets import MapDataset
+    from datasets import load_dataset
+    
+    hf_train_ds = load_dataset('msra_ner', split='train')
+    print(type(train_ds)) # <class 'datasets.arrow_dataset.Dataset'>
+
+    train_ds = MapDataset(train_ds)
+    print(type(train_ds)) # <class 'paddlenlp.datasets.dataset.MapDataset'>
+
+    print(train_ds[2]) # {'id': '2', 
+                       #  'ner_tags': [0, 0, 0, 5, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                       #               0, 0, 0, 0, 0, 0, 5, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
+                       #  'tokens': ['因', '有', '关', '日', '寇', '在', '京', '掠', '夺', '文', '物',
+                       #             '详', '情', '，', '藏', '界', '较', '为', '重', '视', '，', '也', 
+                       #             '是', '我', '们', '收', '藏', '北', '京', '史', '料', '中', '的',
+                       #             '要', '件', '之', '一', '。']}
+
+    hf_train_ds = load_dataset('cmrc2018', split='train')
+    train_ds = MapDataset(hf_train_ds)
+    print(train_ds[1818]) # {'answers': {'answer_start': [9], 'text': ['字仲可']}, 
+                          #  'context': '徐珂（），原名昌，字仲可，浙江杭县（今属杭州市）人。光绪举人。
+                          #              后任商务印书馆编辑。参加南社。1901年在上海担任了《外交报》、
+                          #              《东方杂志》的编辑，1911年，接管《东方杂志》的“杂纂部”。与潘仕成、
+                          #              王晋卿、王辑塘、冒鹤亭等友好。编有《清稗类钞》、《历代白话诗选》、
+                          #              《古今词选集评》等。光绪十五年（1889年）举人。后任商务印书馆编辑。
+                          #              参加南社。曾担任袁世凯在天津小站练兵时的幕僚，不久离去。', 
+                          #  'id': 'TRAIN_113_QUERY_0', 
+                          #  'question': '徐珂字什么？'}
+    
+    hf_train_ds = load_dataset('glue', 'sst2', split='train')
+    train_ds = MapDataset(hf_train_ds)
+    print(train_ds[0]) # {'idx': 0, 'label': 0, 'sentence': 'hide new secretions from the parental units '}
+
+    hf_train_ds = load_dataset('ptb_text_only', split='train')
+    train_ds = MapDataset(hf_train_ds)
+    print(train_ds[1]) # {'sentence': 'pierre <unk> N years old will join the board as a nonexecutive director nov. N'}
diff --git a/docs/data_prepare/overview.rst b/docs/data_prepare/overview.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2fadab309ffa415b8dcd2959e828f0f1ca3a7eac
--- /dev/null
+++ b/docs/data_prepare/overview.rst
@@ -0,0 +1,32 @@
+============
+整体介绍
+============
+
+数据集和数据处理部分一直是NLP任务中最重要的环节之一。为了方便用户以更低的学习成本完成这一环节，PaddleNLP提供了以下特性：
+
+- 功能强大的API。可以帮助用户完成大部分常见NLP任务的数据处理流程。
+- 更灵活的封装。各个模块保持低耦合，高内聚，保证用户可以通过继承和改写满足特定的数据处理需求。
+- 内置数据集涵盖大部分NLP任务，搭配简洁易用的数据集加载协议和贡献协议。对新手和社区贡献者更加友好。
+
+核心API
+----------
+
+- :func:`load_dataset` ：数据集快速加载接口，通过传入数据集读取脚本的名称和其他参数调用 :class:`DatasetBuilder` 子类的相关方法生成数据集。关于加载数据集的详细方法，请查阅 :doc:`加载数据集 <./dataset_load>` 。
+- :class:`DatasetBuilder` ： :class:`DatasetBuilder` 是一个基类，所有的内置数据集都继承自该类，该类的主要功能是下载和读取数据集文件并生成Dataset。其中大部分方法已经封装，不对贡献者暴露。贡献者通过重写 :func:`_get_data` 和 :func:`_read` 等方法像社区贡献数据集。详细信息请查阅 :doc:`如何贡献数据集 </community/contribute_dataset>` 。
+- :class:`MapDataset/IterDataset` ：PaddleNLP内置数据集类型，分别是对 :class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展。内置诸如 :func:`map` , :func:`filter` 等适用于NLP任务的数据处理功能。同时还能帮助用户简单创建自定义数据集。详细信息请查阅***和 :doc:`如何自定义数据集 <./dataset_self_defined>` 。
+
+数据处理流程设计
+-----------------
+
+目前PaddleNLP的通用数据处理流程如下：
+
+#. 加载数据集（内置数据集或者自定义数据集，数据集返回 **原始数据**）。
+#. 定义 :func:`trans_func` ，包括tokenize，token to id等操作，并传入数据集的 :func:`map` 方法，将原始数据转为 *feature* 。
+#. 根据上一步数据处理的结果定义 **batchify** 方法和 :class:`BatchSampler` 。
+#. 定义 :class:`DataLoader` ， 传入 :class:`BatchSampler` 和 :func:`batchify_fn` 。
+
+下面是基于Bert的文本分类任务的数据处理流程图：
+
+.. image:: ../imgs/data_preprocess_pipline.png
+
+关于数据处理的详细信息，请查阅 :doc:`./data_preprocess` 。
diff --git a/docs/dataaug.md b/docs/dataaug.md
new file mode 100644
index 0000000000000000000000000000000000000000..0ae035990f8dbd8c17ce7328e995bf8f47e6dae9
--- /dev/null
+++ b/docs/dataaug.md
@@ -0,0 +1,1167 @@
+# Data Augmentation API
+
+PaddleNLP提供了Data Augmentation数据增强API，可用于训练数据数据增强
+
+**目录**
+* [1. 词级别数据增强策略](#词级别数据增强策略)
+    * [1.1 词替换](#词替换)
+    * [1.2 词插入](#词插入)
+    * [1.3 词删除](#词删除)
+    * [1.4 词交换](#词交换)
+* [2. 句子级别数据增强策略](#句子级别数据增强策略)
+    * [2.1 同义句生成](#同义句生成)
+    * [2.2 句子回译](#句子回译)
+    * [2.3 句子摘要](#句子摘要)
+    * [2.4 句子续写](#句子续写)
+* [3. 字级别数据增强策略](#字级别数据增强策略)
+    * [3.1 字替换](#字替换)
+    * [3.2 字插入](#字插入)
+    * [3.3 字删除](#字删除)
+    * [3.4 字交换](#字交换)
+* [4. 文档一键增强](#文档一键增强)
+
+
+<a name="词级别数据增强策略"></a>
+
+## 1.词级别数据增强策略
+
+<a name="词替换"></a>
+
+### 1.1 词替换
+词替换数据增强策略也即将句子中的词随机替换为其他单词进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.WordSubstitute`进行词级别替换的数据增强。
+
+```text
+WordSubstitute 参数介绍：
+
+    aug_type(str or list(str))：
+        词替换增强策略类别。可以选择"antonym"、"embedding"、"synonym"、"homonym"、"custom"、"random"、"mlm"或者
+        前四种词替换增强策略组合。
+
+    custom_file_path (str，*可选*）：
+        本地数据增强词表路径。如果词替换增强策略选择"custom"，本地数据增强词表路径不能为None。默认为None。
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被替换词数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被替换词数量占全句词比例。如果aug_n不为None，则被替换词数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被替换词数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被替换词数量最大值。默认为10。
+
+    tf_idf (bool)：
+        使用TF-IDF分数确定哪些词进行增强。默认为False。
+
+    tf_idf_file (str，*可选*)：
+        用于计算TF-IDF分数的文件。如果tf_idf为True，本地数据增强词表路径不能为None。默认为None。
+```
+
+我们接下来将以下面的例子介绍词级别替换的使用：
+
+``` python
+from paddlenlp.dataaug import WordSubstitute
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+**同义词替换**
+
+根据同义词词表将句子中的词替换为同义词，可以根据实际需要，设置被替换词数量占全句词比例`aug_percent`和生成增强句子数量`create_n`。`synonym`基于[中文同义词词表](https://github.com/guotong1988/chinese_dictionary)实现，`embedding`则是基于词向量（word embedding）之间的词距离构建的同义词词表确定，可以根据实际效果选择合适的词表。
+
+``` python
+aug = WordSubstitute('synonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的音符号，其中蕴含着丰富的语义信，生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的意思。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['全人类言语是抽象的信符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记，其中蕴含着丰富的语义信息，人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息，无法直接理解人类言语，所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息，无法直接理解人类言语，所以需要将生人语言进行数值化变换。']]
+```
+
+可以根据的实际需求，直接设置句子中被替换的词数量 `aug_n`：
+``` python
+aug = WordSubstitute('synonym', create_n=1, aug_n=3)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['全人类语言是空泛的信息符号，其中蕴含着丰富的涵义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的消息符号，其中蕴含着丰富的疑义信息，人类可以很轻松地理解其中的意义。'], ['而计算机唯其如此处理实测值化的信息，无法直接理解人类语言，所以需要将人类语言进行实测值化转换。']]
+```
+
+``` python
+aug = WordSubstitute('embedding', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的音符号，其中蕴含着丰富的语义信，生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的意思。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['全人类言语是抽象的信符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记，其中蕴含着丰富的语义信息，人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息，无法直接理解人类言语，所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息，无法直接理解人类言语，所以需要将生人语言进行数值化变换。']]
+```
+
+**同音词替换**
+
+根据同音词词表将句子中的词替换为同音词：
+
+``` python
+aug = WordSubstitute('homonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是臭香的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松德力竭其中的含义。', '任雷语言是抽象的信息富豪，其中蕴含着丰富的语义信息，任蕾可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是臭香的新潟符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含一。', '任雷语言是抽象的新潟符号，其中蕴含着丰富的语义信息，人类可以很庆松地理解其中的含义。'], ['而计算机只能处理数值化的新戏，无法直接丽姐人类语言，所以需要将人类语言进行书之化转换。', '而计算机只能处理数值化的心系，无法直接李杰人类玉烟，所以需要将人类语言进行数值化转换。']]
+```
+
+**反义词替换**
+
+根据反义词词表将句子中的词替换为反义词：
+
+``` python
+aug = WordSubstitute('antonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地糊涂其中的含义。', '人类语言是具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地懵懂其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地糊涂其中的含义。', '人类语言是具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地困惑其中的含义。'], ['而计算机只能处理数值冻的信息，无法直接困惑人类语言，所以需要将人类语言进行数值冻转换。', '而计算机只能处理数值冻的信息，无法直接懵懂人类语言，所以需要将人类语言进行数值冻转换。']]
+```
+
+**本地词表替换**
+
+只需要传入本地词表文件路径`custom_file_path`，即可使用自定义的词表进行替换。本地词表文件为固定格式的`json`文件，字典关键字(key)为词，字典键值(item)为列表形式的替换词。例如自定义本地词表`custom.json`如下：
+```
+{"人类":["人", "人种","全人类"], "抽象":["abstract","具象"], "轻松":["简单","容易"]}
+```
+
+使用自定义的本地词表进行句子中词替换:
+``` python
+custom_file_path = "custom.json"
+aug = WordSubstitute('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人语言是abstract的信息符号，其中蕴含着丰富的语义信息，全人类可以很轻松地理解其中的含义。', '全人类语言是具象的信息符号，其中蕴含着丰富的语义信息，人可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人语言是abstract的信息符号，其中蕴含着丰富的语义信息，人种可以很轻松地理解其中的含义。', '人语言是具象的信息符号，其中蕴含着丰富的语义信息，人种可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人语言，所以需要将全人类语言进行数值化转换。', '而计算机只能处理数值化的信息，无法直接理解全人类语言，所以需要将人语言进行数值化转换。']]
+```
+
+**组合替换**
+
+还可以选择将同义词、同音词、本地词表进行随机组合,例如组合同义词词表核本地词表进行词替换：
+``` python
+custom_file_path = "custom.json"
+aug = WordSubstitute(['custom','synonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义音信，生人可以很轻松地领悟其中的含义。', '人种语言是抽象的信息符号，其中蕴含着丰富的贬义信息，人可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信符号，其中蕴含着丰富的语义消息，生人可以很轻松地理解其中的含义。', '人语言是抽象的信息符号，其中蕴含着丰富的语义消息，人类可以很轻松地亮堂其中的含义。'], ['而计算机只能处理数值变成的信息，无法直接理解人类语言，所以需要将生人语言进行数值变为转换。', '而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类言语进行标注值变为转换。']]
+```
+
+**随机词替换**
+
+使用随机词进行句子中词替换:
+``` python
+aug = WordSubstitute('random', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类塘屿是抽象的黄芪酒符号，其中蕴含着丰富的语义信息，人类可以很轻单官理解其中的含义。', '人类语言是抽象的亞符号，其中蕴含着丰富的语义镇咳药，人类可以いていた松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类共进退是抽象的信息符号，其中蕴含着丰富的梦界信息，人类可以很轻大凤理解其中的含义。', '人类语言是4490的信息符号，其中蕴含着丰富的语义信息，科摩可以很轻松地崔磊其中的含义。'], ['而库山乡只能处理数值化的信息，无法直接理解街亭失守MicrosoftWorks，所以需要将人类语言进行数值化转换。', '而0.57万只能处理数值化的信息，无法直接理解人类语言，所以需要将穆哥叶楚进行数值化转换。']]
+```
+
+**上下文替换**
+
+上下文替换是随机将句子中单词进行掩码，利用中文预训练模型ERNIE 1.0，根据句子中的上下文预测被掩码的单词。相比于根据词表进行词替换，上下文替换预测出的单词更匹配句子内容，数据增强所需的时间也更长。
+
+使用模型根据上下文预测单词进行句子中词替换:
+``` python
+import paddle
+# 在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = WordSubstitute('mlm', create_n=1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的信义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的语字符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信言，无法直接理解人类语言，所以需要将人类语言进行数值化转换。']]
+```
+句子中被替换的词数量目前只支持 `aug_n` 为1。
+
+**基于TF-IDF的词替换**
+
+TF-IDF算法认为如果一个词在同一个句子中出现的次数多，词对句子的重要性就会增加；如果它在语料库中出现频率越高，它的重要性将被降低。我们将计算每个词的TF-IDF分数，**低的TF-IDF得分将有很高的概率被替换**。
+
+我们可以在上面所有词替换策略中使用TF-IDF计算词被替换的概率，我们首先需要将`tf_idf`设为True，并传入语料库文件(包含所有训练的数据) `tf_idf_file` 用于计算单词的TF-IDF分数。语料库文件为固定 `txt` 格式，每一行为一条句子。以语料库文件`"data.txt"`做同义词替换为例，语料库文件格式如下：
+``` text
+人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。
+而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。
+...
+```
+
+``` python
+tf_idf_file = "data.txt"
+aug = WordSubstitute('synonym', tf_idf=True, tf_idf_file=tf_idf_file, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的消息符号，其中蕴含着丰富的语义音信，人类可以很轻松地敞亮其中的含义。', '生人语言是抽象的消息符号，其中蕴含着丰富的语义信息，全人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类言语是抽象的信息符号，其中蕴含着丰富的语义信息，生人可以很轻松地分晓其中的含义。', '人类言语是抽象的音问符号，其中蕴含着丰富的语义信息，全人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类言语，所以需要将全人类言语进行数值化转换。', '而计算机只能处理数值化的信息，无法直接理解生人语言，所以需要将全人类语言进行数值化变换。']]
+```
+
+
+### 词插入
+词插入数据增强策略也即将句子中的词随机插入其他单词进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.WordInsert`进行词级别插入的数据增强。
+
+```text
+WordInsert 参数介绍：
+
+    aug_type(str or list(str))：
+        词插入增强策略类别。可以选择"antonym"、"embedding"、"synonym"、"homonym"、"custom"、"random"、"mlm"或者
+        前三种词插入增强策略组合。
+
+    custom_file_path (str，*可选*）：
+        本地数据增强词表路径。如果词插入增强策略选择"custom"，本地数据增强词表路径不能为None。默认为None。
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被插入词数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被插入词数量占全句词比例。如果aug_n不为None，则被插入词数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被插入词数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被插入词数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍词级别插入的使用：
+
+``` python
+from paddlenlp.dataaug import WordInsert
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+**同义词插入**
+根据同义词词表将句子中的词前/后插入同义词，可以根据实际需要，设置插入词数量占全句词比例`aug_percent`和生成增强句子数量`create_n`。`synonym`基于[中文同义词词表](https://github.com/guotong1988/chinese_dictionary)实现，`embedding`则是基于词向量（word embedding）之间的词距离构建的同义词词表确定，可以根据实际效果选择合适的词表。
+
+``` python
+aug = WordInsert('synonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['全人类人类语言是华而不实抽象的信息符号，其中蕴含着丰富的语义消息信息，人类可以很轻松地理解其中的含义。', '人类语言是抽象的音信信息符号，其中蕴含着丰富的语义消息信息，生人人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言言语是抽象的信息符号，其中蕴含着丰富的语义褒义信息音问，人类可以很轻松地理解其中的含义。', '人类语言是抽象言之无物的信息符号记号，其中蕴含着丰富的语义信息，人类可以很轻松地理解清楚其中的含义。'], ['而计算机只能只得处理数值化变为的信息，无法直接理解人类生人语言，所以需要将人类语言进行数值化转换。', '而计算机只能处理数值分值化化为的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换变换。']]
+```
+
+可以根据的实际需求，直接设置句子中被替换的词数量 `aug_n`：
+``` python
+aug = WordInsert('synonym', create_n=1, aug_n=3)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言自然语言是抽象的信息符号，其中蕴含着蕴含丰富的语义信息数据，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象具象的信息符号，其中蕴含着丰富的语义演算信息，人类人类文明可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类全人类语言进行数值最大值化转换切换。']]
+```
+
+``` python
+aug = WordInsert('embedding', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的音符号，其中蕴含着丰富的语义信，生人可以很轻松地理解其中的含义。', '全人类语言是泛泛的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的意思。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['全人类言语是抽象的信符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '全人类语言是抽象的信息标记，其中蕴含着丰富的语义信息，人类可以很轻松地略知一二其中的含义。'], ['而计算机不得不处理数值化的信息，无法直接理解人类言语，所以需要将人类语言进行数值化更换。', '而计算机只能处理数值化的信息，无法直接理解人类言语，所以需要将生人语言进行数值化变换。']]
+```
+
+**同音词插入**
+
+根据同音词词表将句子中的词插入为同音词：
+
+``` python
+aug = WordInsert('homonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言雨燕是抽象的信息符号，其中蕴含着丰富的语义信息，人类任雷可以很轻松地理解其中的含义寒意。', '人泪人类语言是丑像抽象的心细信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象筹饷的信息符号，其中蕴含着丰富的语义信息，人类可以很轻恨情松地理解力竭其中的含义。', '人类语言是抽象臭香的信息新戏符号，其中蕴含着丰富的语义信息，人类可以很轻很庆松地理解其中的含义。'], ['而计算机只能纸能处理数值化的信息新西，无法直接理解李杰人类语言，所以需要将人类语言进行数值化转换。', '而计算机只能处理数值化的信息，无法直接理解人类语言语嫣，所以需要将人类语言语嫣进行数值书之化转换。']]
+```
+
+**反义词插入**
+
+根据反义词词表将句子中的词前/后插入反义词：
+
+``` python
+aug = WordInsert('antonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解懵懂其中的含义。', '人类语言是具体抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地懵懂理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象具体的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解懵懂其中的含义。', '人类语言是具体抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地困惑理解其中的含义。'], ['而计算机只能处理数值化凝的信息，无法直接理解困惑人类语言，所以需要将人类语言进行数值化冻转换。', '而计算机只能处理数值化凝的信息，无法直接理解懵懂人类语言，所以需要将人类语言进行数值化冻转换。']]
+```
+
+**本地词表插入**
+
+只需要传入本地词表文件路径`custom_file_path`，即可使用自定义的词表进行插入。本地词表文件为固定格式的`json`文件，字典关键字(key)为词，字典键值(item)为列表形式的插入词。例如自定义本地词表`custom.json`如下：
+```
+{"人类":["人累", "扔雷"], "抽象":["丑相"], "符号":["富豪","负号","付豪"]}
+```
+
+使用自定义的本地词表进行句子中词插入:
+``` python
+custom_file_path = "custom.json"
+aug = WordInsert('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类扔雷语言是抽象的信息符号富豪，其中蕴含着丰富的语义信息，人类扔雷可以很轻松地理解其中的含义。', '人类扔雷语言是抽象丑相的信息符号负号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['扔雷人类语言是丑相抽象的信息付豪符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '人类扔雷语言是抽象丑相的信息符号，其中蕴含着丰富的语义信息，人类人累可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类人累语言，所以需要将人类扔雷语言进行数值化转换。', '而计算机只能处理数值化的信息，无法直接理解人类扔雷语言，所以需要将人类人累语言进行数值化转换。']]
+```
+
+
+**组合插入**
+
+还可以选择将同义词、同音词、本地词表进行随机组合,例如组合同义词词表核本地词表进行词插入：
+``` python
+custom_file_path = "custom.json"
+aug = WordInsert(['custom','synonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言词汇是抽象的信息数据符号，其中蕴含着蕴含丰富的语义信息，人类可以很轻松地理解其中的含义。', '人类语言是丑相抽象的信息符号，其中蕴含蕴含着丰富的语义信息，人类可以很轻松地理解其中的含意含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含蕴含着丰富的语义数据信息，人类可以很轻松地理解其中的涵义含义。', '人类人累语言语法是抽象的信息符号，其中蕴含着丰富的语义信息数据，人类可以很轻松地理解其中的含义。'], ['而计算机计算机系统只能处理数值值化的信息，无法直接理解人类人累语言，所以需要将人类语言进行数值化转换。', '而计算机只能处理数值计算结果化的信息，无法直接理解人类语言，所以需要将人类人类文明语言进行数值化转换变换。']]
+```
+
+**随机词插入**
+
+使用随机词进行句子中词插入:
+``` python
+aug = WordInsert('random', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['郎氏人类语言是抽象的魏略信息符号，其中晓畅蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', 'seeddestiny人类语言是抽象的那一双信息符号，其中九王坟蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类文胸语言是抽象解放日报的信息符号鸭池，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '堤平人类语言是文学作家抽象的信息中越关系符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。'], ['而勤业计算机只能处理数值HIStory化的信息，无法直接理解人类语言，所以需要将唐本佑人类语言进行数值化转换。', '而计算机刀弓只能处理数值化苏雨琪的信息，无法直接理解人类语言，所以需要将人类平达语言进行数值化转换。']]
+```
+
+
+**上下文插入**
+
+上下文插入是随机将句子中单词进行掩码，利用中文预训练模型ERNIE 1.0，根据句子中的上下文预测被掩码的单词。相比于根据词表进行词插入，上下文插入预测出的单词更匹配句子内容，数据增强所需的时间也更长。
+
+使用模型根据上下文预测单词进行句子中词插入:
+``` python
+import paddle
+# 在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = WordInsert('mlm', create_n=1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义语化信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号系统，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。'], ['而计算机只能直接处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。']]
+```
+句子中插入的词数量目前只支持 `aug_n` 为1。
+
+### 词删除
+
+词删除数据增强策略也即将句子中的词随机删除进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.WordDelete`进行词级别删除的数据增强。
+
+```text
+WordDelete 参数介绍：
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被删除词数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被删除词数量占全句词比例。如果aug_n不为None，则被删除词数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被删除词数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被删除词数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍词级别删除的使用：
+
+``` python
+from paddlenlp.dataaug import WordDelete
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+将随机删除句子中的词：
+``` python
+aug = WordDelete(create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的。', '人类语言是抽象的信息符号，其中蕴含着丰富的语义，人类可以松地其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是的信息符号，其中丰富的语义，人类可以很轻松地理解其中的含义。', '人类语言是的信息，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的。'], ['而计算机只能处理数值化的信息，无法直接理解语言，所以需要将人类语言进行转换。', '而计算机处理数值化的信息，无法直接人类语言，所以需要将人类语言进行数值化。']]
+```
+
+### 词交换
+
+词交换数据增强策略也即将句子中的词的位置随机交换进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.WordSwap`进行词级别交换的数据增强。
+
+```text
+WordSwap 参数介绍：
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被交换词数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被交换词数量占全句词比例。如果aug_n不为None，则被交换词数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被交换词数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被交换词数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍词级别交换的使用：
+
+``` python
+from paddlenlp.dataaug import WordSwap
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+将随机交换句子中的词：
+``` python
+aug = WordSwap(create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的符号信息，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以松地很轻理解其中的含义。'], ['而计算机只能处理化数值的信息，无法直接理解人类语言，所以需要将人类语言进行数值转换化。']]
+```
+
+<a name="句子级别数据增强策略"></a>
+
+## 2. 句子级别数据增强策略
+
+<a name="同义句生成"></a>
+
+### 2.1 同义句生成
+
+同义句生成数据增强策略也即根据输入句子生成相似句，模型首先生成`generate_n`个句子，然后再利用模型筛选出最佳的`create_n`。这里我们将介绍如何使用`paddlenlp.dataaug.SentenceGenerate`进行同义句生成的数据增强。
+
+```text
+SentenceGenerate 参数介绍：
+
+    model_name (str)：
+        生成同义句模型名，可选"roformer-chinese-sim-char-ft-base"， "roformer-chinese-sim-char-base"，"roformer-chinese-sim-char-ft-small"，"roformer-chinese-sim-char-small"。默认为"roformer-chinese-sim-char-base"。
+
+    create_n（int）：
+        数据增强句子数量，从生成相似句中筛选最佳的句子数量。默认为1。
+
+    generate_n（int）：
+        模型生成相似句数量。默认为5。
+
+    max_length（int）：
+        模型生成相似句最长长度。默认为128。
+
+    top_p (float)：
+        “sampling”策略中top-p-filtering的累积概率。该值应满足：math:`0<=top_p<1`。默认为0.95
+```
+
+我们接下来将以下面的例子介绍同义句生成的使用：
+
+``` python
+from paddlenlp.dataaug import SentenceGenerate
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+``` python
+import paddle
+# 建议在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = SentenceGenerate(create_n=2, generate_n=5, max_length=128, top_p=0.95)
+augmented = aug.augment(s[0])
+print(augmented)
+# ['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义', '人类语言是一个抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。', '人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义答。'], ['而计算机只能处理数值化的信息，无法直接理解人类语言，故需要将人类语言进行数值化转换。', '2、计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。']]
+```
+
+<a name="句子回译"></a>
+
+### 2.2 句子回译
+
+句子回译数据增强策略也即将输入的句子翻译为另一种语言，然后再翻译回来，生成语义相同表达方式不同的句子，用于数据增强。这里我们将介绍如何使用基于百度翻译API`paddlenlp.dataaug.SentenceBackTranslateAPI`进行句子回译的数据增强和基于模型的`paddlenlp.dataaug.SentenceBackTranslate`。
+
+
+```text
+SentenceBackTranslateAPI 参数介绍：
+
+    src_lang (str)：
+        输入句子的语言。默认为"zh"。
+
+    tgt_lang（str）：
+        目标句子的语言，增强策略将会把句子翻译为目标句子语言，再翻译回输入句子语言。默认为"en"。
+
+    appid（str）：
+        百度通用翻译API的APPID（如果你使用自己的百度翻译API服务appid/secretKey）。默认为None。
+
+    secretKey (str)：
+        百度通用翻译API的密钥（如果你使用自己的百度翻译API服务appid/secretKey）。默认为1。
+
+    qps (int)：
+        百度通用翻译API的QPS（如果你使用自己的百度翻译API服务appid/secretKey）。 默认为1。
+```
+
+我们接下来将以下面的例子介绍基于百度翻译API的句子回译的使用：
+
+使用SentenceBackTranslateAPI需要安装PaddleHub
+```shell
+pip install paddlehub==2.3.1
+```
+
+``` python
+from paddlenlp.dataaug import SentenceBackTranslateAPI
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+``` python
+aug = SentenceBackTranslateAPI(src_lang='zh', tgt_lang='en')
+augmented = aug.augment(s[0])
+print(augmented)
+# ['人类语言是一种抽象的信息符号，蕴含着丰富的语义信息。人类很容易理解它的含义。']
+augmented = aug.augment(s)
+print(augmented)
+# ['人类语言是一种抽象的信息符号，蕴含着丰富的语义信息。人类很容易理解它的含义。', '然而，计算机只能处理数字信息，不能直接理解人类语言，因此有必要将人类语言转换为数字信息。']
+```
+**Note**
+1. 默认使用PaddleHub提供的百度翻译API服务，也可以选择注册自己的百度翻译API服务账号获取相应的AppID和密钥，账号注册流程请参见[百度翻译API文档](https://fanyi-api.baidu.com/doc/21)，使用自己AppID和密钥则无需安装PaddleHub。
+2. `src_lang`和`tgt_lang`支持的语言和服务异常报错详见[百度翻译API文档](https://fanyi-api.baidu.com/doc/21)中完整语种列表和错误码列表。
+
+```text
+SentenceBackTranslate 参数介绍：
+
+    src_lang (str)：
+        输入句子的语言。默认为"zh"。可选语言:'ar', 'cs', 'de', 'en', 'es', 'et', 'fi', 'fr', 'gu', 'hi', 'it', 'ja', 'kk', 'ko', 'lt', 'lv', 'my', 'ne', 'nl', 'ro', 'ru', 'si', 'tr', 'vi', 'zh', 'af', 'az', 'bn', 'fa', 'he', 'hr', 'id', 'ka', 'km', 'mk', 'ml', 'mn', 'mr', 'pl', 'ps', 'pt', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'uk', 'ur', 'xh', 'gl', 'sl'。
+
+    tgt_lang（str）：
+        目标句子的语言，增强策略将会把句子翻译为目标句子语言，再翻译回输入句子语言。默认为"en"。可选语言:'ar', 'cs', 'de', 'en', 'es', 'et', 'fi', 'fr', 'gu', 'hi', 'it', 'ja', 'kk', 'ko', 'lt', 'lv', 'my', 'ne', 'nl', 'ro', 'ru', 'si', 'tr', 'vi', 'zh', 'af', 'az', 'bn', 'fa', 'he', 'hr', 'id', 'ka', 'km', 'mk', 'ml', 'mn', 'mr', 'pl', 'ps', 'pt', 'sv', 'sw', 'ta', 'te', 'th', 'tl', 'uk', 'ur', 'xh', 'gl', 'sl'。
+
+    max_length（int）：
+        模型生成相似句最长长度。默认为128。
+
+    batch_size (int)：
+        批大小，如果显存不足，适当调小该值。默认为1。
+
+    num_beams (int)：
+        “beam_search”策略中的beam值。 默认为 4。
+
+    use_faster (bool)：
+        是否使用FasterGeneration进行加速。默认为False。
+
+    decode_strategy (str)：
+        生成中的解码策略。 目前支持三种解码策略：“greedy_search”、“sampling”和“beam_search”。 默认为“beam_search”。
+
+```
+
+我们接下来将以下面的例子介绍基于模型的句子回译的使用：
+
+``` python
+from paddlenlp.dataaug import SentenceBackTranslate
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+``` python
+import paddle
+# 建议在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = SentenceBackTranslate(src_lang='zh', tgt_lang='en', batch_size=1, max_length=128)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象信息符号, 它包含丰富的语义信息, 可以容易理解.']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象信息符号, 它包含丰富的语义信息, 可以容易理解.'], ['计算机只能处理数字化信息,不能直接理解人类语言,因此有必要进行数字化。']]
+```
+**Note**
+1. 如果`use_faster`设为True，第一次执行PaddleNLP会启动即时编译（JIT Compile）自动编译高性能解码算子。编译过程通常会花费几分钟的时间编译只会进行一次，之后再次使用高性能解码就不需要重新编译了，编译完成后会继续运行。
+
+<a name="句子摘要"></a>
+
+### 2.3 句子摘要
+
+句子摘要数据增强策略也即对输入句子生成摘要句子，这里我们将介绍如何使用`paddlenlp.dataaug.SentenceSummarize`进行句子摘要的数据增强。
+
+```text
+SentenceSummarize 参数介绍：
+
+    create_n（int）：
+        数据增强句子数量，从生成相似句中筛选最佳的句子数量。默认为1。
+
+    max_length（int）：
+        模型生成相似句最长长度。默认为128。
+
+    batch_size (int)：
+        批大小，如果显存不足，适当调小该值。默认为1。
+
+    top_k (int)：
+        “sampling”策略中top-k-filtering的最高概率token的数量, 0表示没有影响。默认为5。
+
+    top_p (float)：
+        “sampling”策略中top-p-filtering的累积概率。该值应满足：math:`0<=top_p<1`。默认为1.0，表示没有影响。
+
+    temperature (float)：
+        “sampling”策略中对下一个token概率进行建模的值。 默认为 1.0，表示没有影响。
+
+    use_fp16_decoding (bool)：
+        是否使用fp16进行加速。默认为False。
+```
+
+我们接下来将以下面的例子介绍句子摘要的使用：
+
+``` python
+from paddlenlp.dataaug import SentenceSummarize
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+``` python
+import paddle
+# 建议在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = SentenceSummarize(create_n=2, batch_size=1, max_length=128)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['什么是人类语言？', '为什么说人类语言是抽象的信息符号？']]
+augmented = aug.augment(s)
+print(augmented)
+# [['什么是人类语言？', '为什么说人类语言是抽象的信息符号？'], ['计算机只能处理数值化的信息(图)', '计算机只能处理数值化的信息']]
+```
+
+<a name="句子续写"></a>
+
+### 2.4 句子续写
+
+句子续写数据增强策略也即对输入句子进行句子续写，这里我们将介绍如何使用`paddlenlp.dataaug.SentenceContinue`进行句子续写的数据增强。
+
+```text
+SentenceContinue 参数介绍：
+
+    model_name (str)：
+        生成同义句模型名，可选"gpt-cpm-large-cn", "gpt-cpm-small-cn-distill"。默认为"gpt-cpm-small-cn-distill"。
+
+    max_length（int）：
+        模型生成相似句最长长度。默认为128。
+
+    decode_strategy (str)：
+        生成中的解码策略。 目前支持三种解码策略：“greedy_search”、“sampling”和“beam_search”。 默认为“beam_search”。
+
+    use_faster (bool)：
+        是否使用FasterGeneration进行加速。默认为False。
+
+    create_n（int）：
+        数据增强句子数量，从生成相似句中筛选最佳的句子数量。默认为1。
+
+    top_k (int)：
+        “sampling”策略中top-k-filtering的最高概率token的数量, 0表示没有影响。默认为5。
+
+    top_p (float)：
+        “sampling”策略中top-p-filtering的累积概率。该值应满足：math:`0<=top_p<1`。默认为1.0，表示没有影响。
+
+    temperature (float)：
+        “sampling”策略中对下一个token概率进行建模的值。 默认为 1.0，表示没有影响。
+
+    batch_size (int)：
+        批大小，如果显存不足，适当调小该值。默认为1。
+```
+
+我们接下来将以下面的例子介绍同义句生成的使用：
+
+``` python
+from paddlenlp.dataaug import SentenceContinue
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+``` python
+import paddle
+# 建议在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = SentenceContinue(create_n=2, batch_size=1, max_length=64)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。然而语言本身的抽象不是简单的,语言的复杂性以及语言的抽象化则是人类认识世界的另一个重要途径。信息本身和人类的理解能力无关,人类理解世界的过程就是信息过程的不断丰富与不断', '人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。不过,这也是很不容易的。有一些事情是不可能实现的,对于一些人来说,不可能实现的事情只是遥不可及的梦,这也就是为什么在他们的思想中经常会']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。那么,为什么会出现这种现象呢?首先,我们知道人类拥有最简单的语言,但是我们无法通过语言去直接理解它,这就使得我们需要建立数学模型,使得理解过程比语言模型复杂得多', '人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。如果人类可以用语言解决语言问题,那么这个问题是不能回避的。这就是为什么计算机是一个语言的存在,因为它能够处理语言的逻辑关系。这就要求我们对语言的基本事实和各种各样的信息进行细致'], ['而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。因此,计算机在编程方面的功能就是将程序的数据进行算法处理,以便在特定情况下做出特定的功能。在这里可以看到,计算机编程的主要功能是处理文字的信息,而与文字的信息无关的', '而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。因此,“语言”这个词的含义,实际上可以由下面这个公式来表示:=\\alpha\\left(\\alpha-(\\alpha']]
+```
+<a name="字级别数据增强策略"></a>
+
+## 3.字级别数据增强策略
+
+<a name="字替换"></a>
+
+### 3.1 字替换
+字替换数据增强策略也即将句子中的字随机替换为其他单字进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.CharSubstitute`进行字级别替换的数据增强。
+
+```text
+CharSubstitute 参数介绍：
+
+    aug_type(str or list(str))：
+        字替换增强策略类别。可以选择"antonym"、"homonym"、"custom"、"random"、"mlm"或者
+        前三种字替换增强策略组合。
+
+    custom_file_path (str，*可选*）：
+        本地数据增强字表路径。如果字替换增强策略选择"custom"，本地数据增强字表路径不能为None。默认为None。
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被替换字数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被替换字数量占全句字比例。如果aug_n不为None，则被替换字数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被替换字数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被替换字数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍字级别替换的使用：
+
+``` python
+from paddlenlp.dataaug import CharSubstitute
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。","而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+**同音字替换**
+
+根据同音字表将句子中的字替换为同音字，可以根据实际需要，设置被替换字数量占全句字比例`aug_percent`和生成增强句子数量`create_n`。
+
+``` python
+aug = CharSubstitute('homonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是筹象的信汐符号，其中蕴含着逢富的语义锌息，人类可以很轻诵地理解其中的含义。', '人类语嫣是抽象的信息符号，其中蕴含着丰富的语义信息，人垒可以很情松地理婕其种的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的辛息符豪，其中匀含着丰富的语义信息，人类可以很庆耸地理解其中的含义。', '人磊语晏是抽象的新息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理劫其种的含义。'], ['而叽算机只能处理数值化的信息，无法直接理解人蕾语堰，所以需要将人类语演进行数值化专换。', '而疾算机只能杵理数值华的信息，无法直接理捷人类语验，所以需要将人类语言进行数值化转换。']]
+```
+
+可以根据的实际需求，直接设置句子中被替换的字数量 `aug_n`：
+``` python
+aug = CharSubstitute('homonym', create_n=1, aug_n=3)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的裕义信息，人类可以很轻送地理解其中的含漪。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中陨含着丰富的语义信息，人蕾可以很轻松地理解其种的含义。'], ['而计算机只能处理数值化的心息，无罚直接理解人类语言，所以需要将人类煜言进行数值化转换。']]
+```
+
+**反义字替换**
+
+根据反义字字表将句子中的字替换为反义字：
+
+``` python
+aug = CharSubstitute('antonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰穷的语义信息，人类可以很轻紧地理结其西的露义。', '人类语言是抽象的疑息符号，其西蕴含着歉富的语义信息，人类可以很轻松地理结其中的露义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的疑作符号，其洋蕴含着丰贫的语义信息，人类可以很轻松地理解其中的露义。', '人类语言是抽象的信息符号，其洋蕴含着歉贫的语义信息，人类可以很轻紧地理系其中的含义。'], ['而计算机只能处理数值凝的疑作，无法曲接理扎人类语言，所以需要将人类语言进行数值化转换。', '而计算机只能处理数值化的信作，无法屈接理结人类语言，所以需要将人类语言退行数值凝转换。']]
+```
+
+**本地字表替换**
+
+只需要传入本地字表文件路径`custom_file_path`，即可使用自定义的字表进行替换。本地字表文件为固定格式的`json`文件，字典关键字(key)为字，字典键值(item)为列表形式的替换字。例如自定义本地字表`custom.json`如下：
+```
+{"人":["任", "认","忍"], "抽":["丑","臭"], "轻":["亲","秦"],"数":["书","树"],"转":["赚","专"],"理":["里","例"]}
+```
+
+使用自定义的本地字表进行句子中字替换:
+``` python
+custom_file_path = "custom.json"
+aug = CharSubstitute('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是丑象的信息符号，其中蕴含着丰富的语义信息，人类可以很秦松地理解其中的含义。', '人类语言是臭象的信息符号，其中蕴含着丰富的语义信息，人类可以很秦松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是丑象的信息符号，其中蕴含着丰富的语义信息，人类可以很秦松地例解其中的含义。', '人类语言是臭象的信息符号，其中蕴含着丰富的语义信息，人类可以很秦松地里解其中的含义。'], ['而计算机只能处例书值化的信息，无法直接里解人类语言，所以需要将人类语言进行书值化专换。', '而计算机只能处里书值化的信息，无法直接例解人类语言，所以需要将人类语言进行树值化赚换。']]
+```
+
+**组合替换**
+
+还可以选择将同音字、本地字表进行随机组合,例如组合同音字表和本地字表进行字替换：
+``` python
+custom_file_path = "custom.json"
+aug = CharSubstitute(['custom','homonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信囍符号，其中蕴含着丰斧的遇倚信息，人类可以很轻颂地理解其中的含义。', '人类语言是抽乡的信吸符好，其终蕴含着丰富的语义芯息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类於言是抽想的信息肤号，其中蕴含着丰腐的语义信息，人类可以很轻松地理解其中的含诣。', '人类语言是抽项的信息符号，其中蕴憨着丰富的娱义信息，人类可以很请怂地理解其中的含义。'], ['而计算机只能处理数值划的信羲，无法直接理解人类钰言，所以墟要将人类语闫进行数值化转换。', '而计算羁只能处理数值化的信熙，无法直介理解人类语岩，所以需要将人类语焰进行数值化转换。']]
+```
+
+**随机字替换**
+
+使用随机字进行句子中字替换:
+``` python
+aug = CharSubstitute('random', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人开自言是抽象的信息符号，其中蕴正着丰富的语义信息，人类可以很拜松地理解其中的含侯。', '人类语言是抽象的许息符号，其世蕴银着丰B的语义莘息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类吧言是抽象的信息符号，其中蕴含着丰富的萎义桅小，人类可以很轻松地理解其中的后义。', '人类语言是河象的信夹符号，其中蕴含着丰刘的语义信息，人类可以很轻李地理解其中的含阿。'], ['而庙算机只能处葛数弘化的信息，无法直接理解人类语拉，所以需要将人吴语言进行数值化转换。', '而ｎ算机只能处理数值化的信息，无法直接理解人红语言，所以需要将人类语言进行林值查转P。']]
+```
+
+**上下文替换**
+
+上下文替换是随机将句子中单字进行掩码，利用中文预训练模型ERNIE 3.0，根据句子中的上下文预测被掩码的单字。相比于根据字表进行字替换，上下文替换预测出的单字更匹配句子内容，数据增强所需的时间也更长。
+
+使用模型根据上下文预测单字进行句子中字替换:
+``` python
+import paddle
+# 在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = CharSubstitute('mlm', create_n=1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息符号，其中包含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+```
+句子中被替换的字数量目前只支持 `aug_n` 为1。
+
+
+
+### 字插入
+字插入数据增强策略也即将句子中的字随机插入其他单字进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.CharInsert`进行字级别插入的数据增强。
+
+```text
+CharInsert 参数介绍：
+
+    aug_type(str or list(str))：
+        字插入增强策略类别。可以选择"antonym"、"homonym"、"custom"、"random"、"mlm"或者
+        前三种字插入增强策略组合。
+
+    custom_file_path (str，*可选*）：
+        本地数据增强字表路径。如果字插入增强策略选择"custom"，本地数据增强字表路径不能为None。默认为None。
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被插入字数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被插入字数量占全句字比例。如果aug_n不为None，则被插入字数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被插入字数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被插入字数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍字级别插入的使用：
+
+``` python
+from paddlenlp.dataaug import CharInsert
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+**同音字插入**
+根据同音字表将句子中的字前/后插入同音字，可以根据实际需要，设置插入字数量占全句字比例`aug_percent`和生成增强句子数量`create_n`。
+
+``` python
+aug = CharInsert('homonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语寓言咽是抽象的信息符复号，其中蕴韵含着丰富夫的语义信息，人类可以很轻松地理解其中的含义。', '人镭类语岩言是抽想象的信息符号，其忠中蕴含着疯丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类勒语言是抽象想的信息符号，其中蕴含着丰富的语誉义以信息，人类可以很轻卿松地理解其中的含义。', '人泪类语言是抽象的芯信息符号，其中蕴含着枫丰富的语疑义锌信息，人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数植值化的新信息，无法直接狸理解人类语言，所以需要将人类峪语言进行书数值化转换。', '而计算机只能处理梳数值化的新信息，无法直接笠理解人类语言，所以需要将人类语衍言进行数值化赚转换。']]
+```
+
+可以根据的实际需求，直接设置句子中被替换的字数量 `aug_n`：
+``` python
+aug = CharInsert('homonym', create_n=1, aug_n=3)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类勒语言是抽象的信息符号，其中蕴含着丰缝富的语义信息，人类可以很轻松颂地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义新信息，人类可以很轻松地荔理解其终中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类语言，所以序需要将人類类语言进刑行数值化转换。']]
+```
+
+
+**本地字表插入**
+
+只需要传入本地字表文件路径`custom_file_path`，即可使用自定义的字表进行插入。本地字表文件为固定格式的`json`文件，字典关键字(key)为字，字典键值(item)为列表形式的插入字。例如自定义本地字表`custom.json`如下：
+```
+{"人":["任", "认","忍"], "抽":["丑","臭"], "轻":["亲","秦"],"数":["书","树"],"转":["赚","专"],"理":["里","例"]}
+```
+
+使用自定义的本地字表进行句子中字插入:
+``` python
+custom_file_path = "custom.json"
+aug = CharInsert('custom', custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是臭抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很亲轻松地里理解其中的含义。', '人类语言是抽臭象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻秦松地理里解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是丑抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很秦轻松地例理解其中的含义。', '人类语言是丑抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很亲轻松地例理解其中的含义。'], ['而计算机只能处理例数树值化的信息，无法直接理例解人类语言，所以需要将人类语言进行数树值化转专换。', '而计算机只能处里理树数值化的信息，无法直接例理解人类语言，所以需要将人类语言进行书数值化赚转换。']]
+```
+**反义字插入**
+
+根据反义字字表将句子中的字前/后插入反义字：
+
+``` python
+aug = CharInsert('antonym', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的疑信作息符号，其中蕴露含着丰富的语义信息，人类可以很轻紧松地理扎解其中的含义。', '人类语言是抽象的信疑息符号，其中洋蕴含着丰富穷的语义信息作，人类可以很轻松地理解其中的含露义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信作息符号，其中蕴含着丰富的语义信作息，人类可以很轻紧松地理系解其中的露含义。', '人类语言是抽象的信疑息符号，其中洋蕴含露着丰富的语义信息作，人类可以很轻松地理解扎其中的含义。'], ['而计算机只能处理数值凝化的信作息，无法屈直接理解人类语言，所以需要将人类语言进止行数值化停转换。', '而计算机只能处理数值化凝的信疑息，无法直接递理解系人类语言，所以需要将人类语言进行数值化凝转换。']]
+```
+
+**组合插入**
+
+还可以选择将同音字、同音字、本地字表进行随机组合,例如组合同音字表核本地字表进行字插入：
+``` python
+custom_file_path = "custom.json"
+aug = CharInsert(['custom','homonym'], custom_file_path=custom_file_path, create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类镭语言是抽象的信鑫息夕符号壕，其中蕴含着丰富的语义信息，人类可以很轻晴松地理解其中的含义。', '人类语咽言是抽翔象的信息覆符号，其中蕴含着丰腐富的语义信息，人类可以很轻松地離理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是稠抽象的芯信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理桀解其重中的含裔义。', '人类语言是抽象的信息囍符号壕，其中蕴含着丰富孵的语义信息奚，人类可以很轻卿松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类语言阎，所以需要将人类语言衍进金行数值化哗转专换。', '而计纪算机只能处岀理隶数值化的信息，无法直接理解人类语言，所以需要将人类雷语言进行数值芷化转换。']]
+```
+
+**随机字插入**
+
+使用随机字进行句子中字插入:
+``` python
+aug = CharInsert('random', create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类S语言是抽象的信息符号，其中蕴含着丰富的语义信息，人鞋类可以很轻J松地张理解其中的含陈义。', '人类谷语言是抽象的信息符号，其中蕴含着丰富的语义信息，人烘类可以很轻割松地灵理解其中的含异义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语创言是抽象的信好息符号，其中蕴王含着丰富的语义信如息，人类可以很轻松地理解其中的丹含义。', '人类语F言是抽象的信M息符号，其中蕴史含着丰富的语义信伊息，人类可以很轻松地理解其中的秀含义。'], ['而计算机只能处楚理数值化O的信息，无法直接理解人类语丁言，所以需P要将人类语言进行甲数值化转换。', '而计算机只能处漫理数值化翁的信息，无法直接理解人类语奚言，所以需中要将人类语言进行黄数值化转换。']]
+```
+
+
+**上下文插入**
+
+上下文插入是随机将句子中单字进行掩码，利用中文预训练模型ERNIE 3.0，根据句子中的上下文预测被掩码的单字。相比于根据字表进行字插入，上下文插入预测出的单字更匹配句子内容，数据增强所需的时间也更长。
+
+使用模型根据上下文预测单字进行句子中字插入:
+``` python
+import paddle
+# 在GPU环境下运行
+paddle.set_device("gpu")
+# 在CPU下环境运行
+# paddle.set_device("cpu")
+aug = CharInsert('mlm', create_n=1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的信息息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。'], ['而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值转化转换。']]
+```
+句子中插入的字数量目前只支持 `aug_n` 为1。
+
+### 字删除
+
+字删除数据增强策略也即将句子中的字随机删除进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.CharDelete`进行字级别删除的数据增强。
+
+```text
+CharDelete 参数介绍：
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被删除字数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被删除字数量占全句字比例。如果aug_n不为None，则被删除字数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被删除字数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被删除字数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍字级别删除的使用：
+
+``` python
+from paddlenlp.dataaug import CharDelete
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+将随机删除句子中的字：
+``` python
+aug = CharDelete(create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的。', '人类语言是抽象的信息符号，其中蕴含着丰富的语义，人类可以松地其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是的信息符号，其中丰富的语义，人类可以很轻松地理解其中的含义。', '人类语言是的信息，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的。'], ['而计算机只能处理数值化的信息，无法直接理解语言，所以需要将人类语言进行转换。', '而计算机处理数值化的信息，无法直接人类语言，所以需要将人类语言进行数值化。']]
+```
+
+### 字交换
+
+字交换数据增强策略也即将句子中的字的位置随机交换进行数据增强，这里我们将介绍如何使用`paddlenlp.dataaug.CharSwap`进行字级别交换的数据增强。
+
+```text
+CharSwap 参数介绍：
+
+    create_n（int）：
+        数据增强句子数量。默认为1。
+
+    aug_n（int）：
+        数据增强句子中被交换字数量。默认为None
+
+    aug_percent（int）：
+        数据增强句子中被交换字数量占全句字比例。如果aug_n不为None，则被交换字数量为aug_n。默认为0.1。
+
+    aug_min (int)：
+        数据增强句子中被交换字数量最小值。默认为1。
+
+    aug_max (int)：
+        数据增强句子中被交换字数量最大值。默认为10。
+```
+
+我们接下来将以下面的例子介绍字级别交换的使用：
+
+``` python
+from paddlenlp.dataaug import CharSwap
+s = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+```
+
+将随机交换句子中的字：
+``` python
+aug = CharSwap(create_n=2, aug_percent=0.1)
+augmented = aug.augment(s[0])
+print(augmented)
+# [['人类语言是抽象的符号信息，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。']]
+augmented = aug.augment(s)
+print(augmented)
+# [['人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以松地很轻理解其中的含义。'], ['而计算机只能处理化数值的信息，无法直接理解人类语言，所以需要将人类语言进行数值转换化。']]
+```
+
+
+<a name="文档一键增强"></a>
+
+## 4. 文档一键增强
+
+数据增强API也提供了文档一键增强功能，可以输入指定格式文件进行数据增强。
+```text
+FileAugment 初始化参数介绍：
+
+    strategies(list)：
+        输入应用的数据增强策略。
+```
+
+我们接下来将以下面的例子介绍文档一键增强的使用。
+
+只需要传入固定格式的`txt`文件，如下自定义输入文件`data.txt`：
+
+```text
+25岁已经感觉脸部松弛了怎么办
+小孩的眉毛剪了会长吗？
+...
+```
+
+我们对文件`data.txt`应用词替换和词插入数据增强策略。
+
+```python
+from paddlenlp.dataaug import WordSubstitute, WordInsert, FileAugment
+aug1 =  WordSubstitute('synonym', create_n=1, aug_percent=0.1)
+aug2 = WordInsert('synonym', create_n=1, aug_percent=0.1)
+aug = FileAugment([aug1,aug2])
+aug.augment(input_file='data.txt', output_file="aug.txt")
+```
+
+数据增强结果保存在`aug.txt`中，如下：
+```text
+25岁已经感觉面松弛了怎么办
+小朋友的眉毛剪了会长吗？
+25岁已经感觉脸部松驰松弛了怎么办
+幼儿小孩的眉毛剪了会长吗？
+```
+
+如果输入的文件中带有文本标签，如下自定义输入文件`data.txt`：
+
+```text
+25岁已经感觉脸部松弛了怎么办	治疗方案
+小孩的眉毛剪了会长吗？	其他
+```
+我们可以通过定义`separator`和`separator_id`选择只对其中部分文本进行数据增强策略。
+```python
+aug.augment(input_file='data.txt', output_file="aug.txt", separator='\t', separator_id=0)
+```
+
+数据增强结果保存在`aug.txt`中，如下：
+
+```text
+25阴历年已经感觉脸部松弛了怎么办	治疗方案
+小孩子的眉毛剪了会长吗？	其他
+25岁已经感觉面庞脸部松弛了怎么办	治疗方案
+小孩小朋友的眉毛剪了会长吗？	其他
+```
diff --git a/docs/datasets.md b/docs/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..0d5d598a7da97bdef5af69cb0e0a4ef960f8ee35
--- /dev/null
+++ b/docs/datasets.md
@@ -0,0 +1,67 @@
+# PaddleNLP Datasets API
+
+PaddleNLP提供了以下数据集的快速读取API，实际使用时请根据需要**添加splits信息**：
+
+## 阅读理解
+
+|  数据集名称   | 简介 | 调用方法 |
+|  ----  | ----- | ------ |
+|  [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) | 斯坦福问答数据集，包括SQuAD1.1和SQuAD2.0|`paddlenlp.datasets.load_dataset('squad')` |
+|  [DuReader-yesno](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集：阅读理解，判断答案极性|`paddlenlp.datasets.load_dataset('dureader_yesno')` |
+|  [DuReader-robust](https://aistudio.baidu.com/aistudio/competition/detail/49) | 千言数据集：阅读理解，答案原文抽取|`paddlenlp.datasets.load_dataset('dureader_robust')` |
+|  [CMRC2018](http://hfl-rc.com/cmrc2018/) | 第二届“讯飞杯”中文机器阅读理解评测数据集|`paddlenlp.datasets.load_dataset('cmrc2018')` |
+|  [DRCD](https://github.com/DRCKnowledgeTeam/DRCD) | 台達閱讀理解資料集|`paddlenlp.datasets.load_dataset('drcd')` |
+
+## 文本分类
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [CoLA](https://nyu-mll.github.io/CoLA/) | 单句分类任务，二分类，判断句子是否合法| `paddlenlp.datasets.load_dataset('glue','cola')`|
+|  [SST-2](https://nlp.stanford.edu/sentiment/index.html) | 单句分类任务，二分类，判断句子情感极性| `paddlenlp.datasets.load_dataset('glue','sst-2')`|
+|  [MRPC](https://microsoft.com/en-us/download/details.aspx?id=52398) | 句对匹配任务，二分类，判断句子对是否是相同意思| `paddlenlp.datasets.load_dataset('glue','mrpc')`|
+|  [STSB](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) | 计算句子对相似性，分数为1~5| `paddlenlp.datasets.load_dataset('glue','sts-b')`|
+|  [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) | 判定句子对是否等效，等效、不等效两种情况，二分类任务| `paddlenlp.datasets.load_dataset('glue','qqp')`|
+|  [MNLI](http://www.nyu.edu/projects/bowman/multinli/) | 句子对，一个前提，一个是假设。前提和假设的关系有三种情况：蕴含（entailment），矛盾（contradiction），中立（neutral）。句子对三分类问题| `paddlenlp.datasets.load_dataset('glue','mnli')`|
+|  [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) | 判断问题（question）和句子（sentence）是否蕴含，蕴含和不蕴含，二分类| `paddlenlp.datasets.load_dataset('glue','qnli')`|
+|  [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) | 判断句对是否蕴含，句子1和句子2是否互为蕴含，二分类任务| `paddlenlp.datasets.load_dataset('glue','rte')`|
+|  [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) | 判断句子对是否相关，相关或不相关，二分类任务| `paddlenlp.datasets.load_dataset('glue','wnli')`|
+|  [LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html) | A Large-scale Chinese Question Matching Corpus 语义匹配数据集| `paddlenlp.datasets.load_dataset('lcqmc')`|
+|  [ChnSentiCorp](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/ChnSentiCorp_htl_all/intro.ipynb) | 中文评论情感分析语料| `paddlenlp.datasets.load_dataset('chnsenticorp')`|
+
+
+## 序列标注
+
+|  数据集名称   | 简介 | 调用方法 |
+|  ----  | --------- | ------ |
+|  [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.load_dataset('msra_ner')`|
+|  [People's Daily](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/People's%20Daily) | 人民日报命名实体识别数据集| `paddlenlp.datasets.load_dataset('peoples_daily_ner')`|
+
+
+## 机器翻译
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [IWSLT15](https://workshop2015.iwslt.org/) | IWSLT'15 English-Vietnamese data 英语-越南语翻译数据集| `paddlenlp.datasets.load_dataset('iwslt15')`|
+|  [WMT14ENDE](http://www.statmt.org/wmt14/translation-task.html) | WMT14 EN-DE 经过BPE分词的英语-德语翻译数据集| `paddlenlp.datasets.load_dataset('wmt14ende')`|
+
+
+## 机器同传
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [BSTC](https://aistudio.baidu.com/aistudio/competition/detail/44/) | 千言数据集：机器同传，包括transcription_translation和asr | `paddlenlp.datasets.load_dataset('bstc', 'asr')`|
+
+
+## 文本生成
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [Poetry](https://github.com/chinese-poetry/chinese-poetry) | 中文诗歌古典文集数据| `paddlenlp.datasets.load_dataset('poetry')`|
+|  [Couplet](https://github.com/v-zich/couplet-clean-dataset) | 中文对联数据集| `paddlenlp.datasets.load_dataset('couplet')`|
+
+## 语料库
+
+| 数据集名称  | 简介 | 调用方法 |
+| ----  | --------- | ------ |
+|  [PTB](http://www.fit.vutbr.cz/~imikolov/rnnlm/) | Penn Treebank Dataset | `paddlenlp.datasets.load_dataset('ptb')`|
+|  [Yahoo Answer 100k](https://arxiv.org/pdf/1702.08139.pdf)  | 从Yahoo Answer采样100K| `paddlenlp.datasets.load_dataset('yahoo_answer_100k')`|
diff --git a/docs/get_started/installation.rst b/docs/get_started/installation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4b1362b411b5e76342a89ace9be037434ac8e7a6
--- /dev/null
+++ b/docs/get_started/installation.rst
@@ -0,0 +1,111 @@
+安装PaddleNLP
+^^^^^^^^
+以下安装过程默认用户已安装好paddlepaddle-gpu或paddlepaddle(版本大于或等于2.0)，paddlepaddle安装方式参照 飞桨官网_。
+
+.. _飞桨官网: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/2.0/install/pip/windows-pip.html
+
+pip安装
+--------
+.. code-block::
+
+  pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple
+
+Anaconda安装
+--------
+Anaconda是一个开源的Python发行版本，其包含了conda、Python等180多个科学包及其依赖项。使用Anaconda可以通过创建多个独立的Python环境，避免用户的Python环境安装太多不同版本依赖导致冲突。
+
+1、windows安装Anaconda
+>>>>>>>>>
+
+第一步 下载
+:::::::::
+* 在 Anaconda官网_ 选择下载Windows Python3.7 64-Bit版本。
+
+.. _Anaconda官网: https://www.anaconda.com/products/individual
+
+* 确保已经安装Visual C++ Build Tools(可以在开始菜单中找到)，如未安装，请点击 下载安装_。
+
+.. _下载安装: https://go.microsoft.com/fwlink/?Linkid=691126
+
+第二步 安装
+:::::::::
+运行下载的安装包(以.exe为后辍)，根据引导完成安装, 用户可自行修改安装目录（如下图）。
+
+.. image:: ../imgs/anaconda_windows.png
+
+第三步 使用
+:::::::::
+* 点击Windows系统左下角的Windows图标，打开：所有程序->Anaconda3/2（64-bit）->Anaconda Prompt
+* 在命令行中执行下述命令
+
+.. code-block::
+
+  # 创建名为my_paddlenlp的环境，指定Python版本为3.7
+  conda create -n my_paddlenlp python=3.7
+  # 进入my_paddlenlp环境
+  conda activate my_paddlenlp
+  # 安装PaddleNLP
+  pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple
+
+按如上方式配置后，即可在环境中使用PaddleNLP了，命令行输入python回车后，import paddlenlp试试吧，之后再次使用都可以通过打开'所有程序->Anaconda3/2（64-bit）->Anaconda Prompt'，再执行conda activate my_paddlenlp进入环境后，即可再次使用PaddleNLP。
+
+2、Linux/Mac安装Anaconda
+>>>>>>>>>
+
+第一步 下载
+:::::::::
+在 Anaconda官网_ 选择下载对应系统 Python3.7版本下载（Mac下载Command Line Installer版本即可)。
+
+.. _Anaconda官网: https://www.anaconda.com/products/individual
+
+第二步 安装
+:::::::::
+打开终端，在终端安装Anaconda
+
+.. code-block::
+
+  # ~/Downloads/Anaconda3-2019.07-Linux-x86_64.sh即下载的文件
+  bash ~/Downloads/Anaconda3-2019.07-Linux-x86_64.sh
+  
+安装过程中一直回车即可，如提示设置安装路径，可根据需求修改，一般默认即可。
+
+第三步 使用
+:::::::::
+
+.. code-block::
+
+  # 创建名为my_paddlenlp的环境，指定Python版本为3.7
+  conda create -n my_paddlenlp python=3.7
+  # 进入my_paddlenlp环境
+  conda activate my_paddlenlp
+  # 安装PaddleNLP
+  pip install --upgrade paddlenlp>=2.0.0rc -i https://pypi.org/simple
+
+按如上方式配置后，即可在环境中使用PaddleNLP了，命令行输入python回车后，import paddlenlp试试吧，之后再次使用都可以通过打开'所有程序->Anaconda3/2（64-bit）->Anaconda Prompt'，再执行conda activate my_paddlenlp进入环境后，即可再次使用PaddleNLP。
+
+代码安装
+---------
+github代码会跟随开发进度不断更新
+
+.. code-block::
+
+  git clone https://github.com/PaddlePaddle/PaddleNLP.git
+  cd PaddleNLP
+  git checkout develop
+
+使用Docker镜像体验PaddleNLP
+^^^^^^^^
+
+如果您没有Docker运行环境，请参考 `Docker官网`_ 进行安装
+
+.. _Docker官网: https://www.docker.com
+
+PaddleNLP提供了带有最新代码的docker镜像供您使用，您只需要*拉取docker镜像*，然后*运行docker镜像*，无需其他任何额外操作，即可开始使用PaddleNLP的所有功能。
+
+在 `Docker Hub`_ 中获取这些镜像及相应的使用指南，包括CPU、GPU、ROCm版本。
+
+.. _Docker Hub: https://hub.docker.com/repository/docker/paddlecloud/paddlenlp
+
+如果您对自动化制作docker镜像感兴趣，或有自定义需求，请访问 `PaddlePaddle/PaddleCloud`_ 做进一步了解。
+
+.. _PaddlePaddle/PaddleCloud: https://github.com/PaddlePaddle/PaddleCloud/tree/main/tekton
diff --git a/docs/get_started/quick_start.rst b/docs/get_started/quick_start.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1ece7a7aa95c1e5f85b8a1e283d83e2fa117694b
--- /dev/null
+++ b/docs/get_started/quick_start.rst
@@ -0,0 +1,164 @@
+========
+10分钟完成高精度中文情感分析
+========
+
+1. 安装PaddleNLP
+========
+
+安装相关过程和问题可以参考PaddleNLP的 安装文档_。
+
+.. _安装文档: https://paddlenlp.readthedocs.io/en/latest/gettingstarted/install.html
+
+
+.. code-block::
+
+    >>> pip install --upgrade paddlenlp -i https://pypi.org/simple
+
+2. 一键加载预训练模型
+========
+
+情感分析本质是一个文本分类任务。PaddleNLP内置了ERNIE、BERT、RoBERTa、Electra等丰富的预训练模型，并且内置了各种预训练模型对于不同下游任务的Fine-tune网络。用户可以使用PaddleNLP提供的模型，完成问答、序列分类、token分类等任务。查阅 预训练模型_ 了解更多。这里以ERNIE模型为例，介绍如何将预训练模型Fine-tune完成文本分类任务。
+
+.. _预训练模型: https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html
+
+加载预训练模型ERNIE
+
+.. code-block::
+
+    >>> MODEL_NAME = "ernie-3.0-medium-zh"
+    >>> ernie_model = paddlenlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)
+    
+加载预训练模型ERNIE用于文本分类任务的Fine-tune网络，只需指定想要使用的模型名称和文本分类的类别数即可完成网络定义。
+
+.. code-block::
+
+    >>> model = paddlenlp.transformers.ErnieForSequenceClassification.from_pretrained(
+    ...     MODEL_NAME, num_classes=len(label_list))
+    
+3. 调用Tokenizer进行数据处理
+========    
+
+Tokenizer用于将原始输入文本转化成模型可以接受的输入数据形式。PaddleNLP对于各种预训练模型已经内置了相应的Tokenizer，指定想要使用的模型名字即可加载。
+
+.. code-block::
+
+    >>> tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)
+
+Transformer类预训练模型所需的数据处理步骤通常包括将原始输入文本切分token；将token映射为对应的token id；拼接上预训练模型对应的特殊token ，如[CLS]、[SEP]；最后转化为框架所需的数据格式。为了方便使用，PaddleNLP提供了高阶API，一键即可返回模型所需数据格式。
+
+一行代码完成切分token，映射token ID以及拼接特殊token:
+
+.. code-block::
+
+    >>> encoded_text = tokenizer(text="请输入测试样例")
+    
+转化成paddle框架数据格式:
+
+.. code-block::
+
+    >>> input_ids = paddle.to_tensor([encoded_text['input_ids']])
+    >>> print("input_ids : {}".format(input_ids))
+    >>> token_type_ids = paddle.to_tensor([encoded_text['token_type_ids']])
+    >>> print("token_type_ids : {}".format(token_type_ids))
+    input_ids : Tensor(shape=[1, 9], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
+       [[1  , 647, 789, 109, 558, 525, 314, 656, 2  ]])
+    token_type_ids : Tensor(shape=[1, 9], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
+       [[0, 0, 0, 0, 0, 0, 0, 0, 0]])
+
+input_ids: 表示输入文本的token ID。
+
+token_type_ids: 表示对应的token属于输入的第一个句子还是第二个句子。（Transformer类预训练模型支持单句以及句对输入。）
+
+此时即可输入ERNIE模型中得到相应输出。
+
+.. code-block::
+
+    >>> sequence_output, pooled_output = ernie_model(input_ids, token_type_ids)
+    >>> print("Token wise output: {}, Pooled output: {}".format(
+    ...     sequence_output.shape, pooled_output.shape))
+    Token wise output: [1, 9, 768], Pooled output: [1, 768]
+
+可以看出，ERNIE模型输出有2个tensor。
+
+sequence_output是对应每个输入token的语义特征表示，shape为(1, num_tokens, hidden_size)。其一般用于序列标注、问答等任务。
+
+pooled_output是对应整个句子的语义特征表示，shape为(1, hidden_size)。其一般用于文本分类、信息检索等任务。
+
+4. 加载数据集
+========  
+PaddleNLP内置了适用于阅读理解、文本分类、序列标注、机器翻译等下游任务的多个数据集，这里我们使用公开中文情感分析数据集ChnSenticorp，包含7000多条正负向酒店评论数据。
+
+一键加载PaddleNLP内置数据集：
+
+.. code-block::
+
+    >>> train_ds, dev_ds, test_ds = paddlenlp.datasets.load_dataset(
+    ...     'chnsenticorp', splits=['train', 'dev', 'test'])
+
+获取分类数据标签：
+
+.. code-block::
+
+    >>> label_list = train_ds.label_list
+    >>> print(label_list)
+    ['0', '1']
+
+展示一些数据：
+
+.. code-block::
+
+    >>> for idx in range(5):
+    ...     print(train_ds[idx])
+
+    {'text': '选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。
+    酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般', 'label': 1}
+    {'text': '15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错', 'label': 1}
+    {'text': '房间太小。其他的都一般。。。。。。。。。', 'label': 0}
+    {'text': '1.接电源没有几分钟,电源适配器热的不行. 2.摄像头用不起来. 3.机盖的钢琴漆，手不能摸，一摸一个印. 4.硬盘分区不好办.', 'label': 0}
+    {'text': '今天才知道这书还有第6卷,真有点郁闷:为什么同一套书有两种版本呢?当当网是不是该跟出版社商量商量,
+    单独出个第6卷,让我们的孩子不会有所遗憾。', 'label': 1}
+
+5. 模型训练与评估
+========  
+数据读入时使用 :func:`paddle.io.DataLoader` 接口多线程异步加载数据，然后设置适用于ERNIE这类Transformer模型的动态学习率和损失函数、优化算法、评价指标等。
+
+模型训练的过程通常按照以下步骤：
+
+#. 从dataloader中取出一个batch data。
+#. 将batch data喂给model，做前向计算。
+#. 将前向计算结果传给损失函数，计算loss。将前向计算结果传给评价方法，计算评价指标。
+#. loss反向回传，更新梯度。重复以上步骤。
+#. 每训练一个epoch时，程序将会评估一次，评估当前模型训练的效果。
+
+本示例同步在AIStudio上，可直接 在线体验模型训练_。
+
+.. _在线体验模型训练: https://aistudio.baidu.com/aistudio/projectdetail/1294333
+
+最后，保存训练好的模型用于预测。
+
+6. 模型预测
+========  
+保存训练模型，定义预测函数 :func:`predict` ，即可开始预测文本情感倾向。
+
+以自定义预测数据和数据标签为示例：
+
+.. code-block::
+
+    >>> data = [
+    ...     '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般',
+    ...     '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片',
+    ...     '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。',
+    ... ]
+    >>> label_map = {0: 'negative', 1: 'positive'}
+
+得到预测结果：
+
+.. code-block::
+
+    >>> results = predict(
+    ...     model, data, tokenizer, label_map, batch_size=batch_size)
+    >>> for idx, text in enumerate(data):
+    ...     print('Data: {} \t Label: {}'.format(text, results[idx]))
+    Data: 这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般 	 Label: negative
+    Data: 怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片 	 Label: negative
+    Data: 作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。 	 Label: positive
diff --git a/docs/imgs/anaconda_windows.png b/docs/imgs/anaconda_windows.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe1c62ff6134f1d3cba928d91940f404ae9ac11d
Binary files /dev/null and b/docs/imgs/anaconda_windows.png differ
diff --git a/docs/imgs/data_preprocess_pipline.png b/docs/imgs/data_preprocess_pipline.png
new file mode 100644
index 0000000000000000000000000000000000000000..0bf1887ac272f1d961da4ca2c0ad41e538959b46
Binary files /dev/null and b/docs/imgs/data_preprocess_pipline.png differ
diff --git a/docs/imgs/fast_generation.png b/docs/imgs/fast_generation.png
new file mode 100644
index 0000000000000000000000000000000000000000..4e03be62ed9a3eceed467e083ef6bbdb8868d210
Binary files /dev/null and b/docs/imgs/fast_generation.png differ
diff --git a/docs/imgs/paddlenlp.png b/docs/imgs/paddlenlp.png
new file mode 100644
index 0000000000000000000000000000000000000000..5053c464119c15872c42acfee746881098e8e12b
Binary files /dev/null and b/docs/imgs/paddlenlp.png differ
diff --git a/docs/index.rst b/docs/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8780d0e9093fa0e916b71457eadcd3b364db1741
--- /dev/null
+++ b/docs/index.rst
@@ -0,0 +1,113 @@
+欢迎使用PaddleNLP
+==================
+
+`PaddleNLP <https://github.com/PaddlePaddle/PaddleNLP>`_ 是飞桨自然语言处理开发库，具备 **易用的文本领域API**，**多场景的应用示例**、和 **高性能分布式训练** 三大特点，旨在提升飞桨开发者文本领域建模效率，旨在提升开发者在文本领域的开发效率，并提供丰富的NLP应用示例。
+
+
+- **易用的文本领域API**
+
+  - 提供丰富的产业级预置任务能力 **Taskflow** 和全流程的文本领域API：支持丰富中文数据集加载的 **Dataset API**，可灵活高效地完成数据预处理的 **Data API** ，预置60+预训练词向量的 **Embedding API** ，提供100+预训练模型的 **Transformer API** 等，可大幅提升NLP任务建模的效率。
+
+- **多场景的应用示例**
+
+  - 覆盖从学术到产业级的NLP应用示例，涵盖NLP基础技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发，为开发者提供飞桨文本领域的最佳实践。
+
+- **高性能分布式训练**
+
+  - 基于飞桨核心框架领先的自动混合精度优化策略，结合分布式Fleet API，支持4D混合并行策略，可高效地完成大规模预训练模型训练。
+
+
+* 项目GitHub: https://github.com/PaddlePaddle/PaddleNLP
+* 项目Gitee: https://gitee.com/paddlepaddle/PaddleNLP
+* GitHub Issue反馈: https://github.com/PaddlePaddle/PaddleNLP/issues
+* 微信交流群: 微信扫描二维码并填写问卷之后，即可加入交流群，与众多社区开发者以及官方团队深度交流。
+
+.. image:: https://user-images.githubusercontent.com/11793384/184784832-bb97930f-a738-4480-99be-517aeb65afac.png
+   :align: center
+   :alt: paddlenlp微信交流群二维码
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 快速开始
+
+   安装 <get_started/installation>
+   10分钟完成高精度中文情感分析 <get_started/quick_start>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 数据准备
+
+   整体介绍 <data_prepare/overview>
+   数据集列表 <data_prepare/dataset_list>
+   加载数据集 <data_prepare/dataset_load>
+   自定义数据集 <data_prepare/dataset_self_defined>
+   数据处理 <data_prepare/data_preprocess>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 模型库
+
+   Transformer预训练模型 <model_zoo/index>
+   使用Trainer API训练 <trainer.md>
+   使用Trainer API进行模型压缩 <compression.md>
+   一键预测功能 <model_zoo/taskflow>
+   预训练词向量 <model_zoo/embeddings>
+   
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 评价指标
+
+   评价指标 <metrics/metrics.md>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 实践教程
+
+   AI Studio Notebook <tutorials/overview>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 进阶指南
+
+   模型压缩 <advanced_guide/model_compression/index>
+   文本生成高性能加速 <advanced_guide/fastgeneration/index>
+   大规模分布式训练 <advanced_guide/distributed_training>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 社区交流共建
+
+   如何贡献模型 <community/contribute_models/index>
+   如何贡献数据集 <community/contribute_datasets/index>
+   如何贡献文档案例 <community/contribute_docs>
+   如何加入兴趣小组 <community/join_in_PaddleNLP-SIG>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: FAQ
+
+   FAQ <FAQ.md>
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API Reference
+
+   paddlenlp.data <source/paddlenlp.data>
+   paddlenlp.datasets <source/paddlenlp.datasets>
+   paddlenlp.embeddings <source/paddlenlp.embeddings>
+   paddlenlp.layers <source/paddlenlp.layers>
+   paddlenlp.losses <source/paddlenlp.losses>
+   paddlenlp.metrics <source/paddlenlp.metrics>
+   paddlenlp.ops <source/paddlenlp.ops>
+   paddlenlp.seq2vec <source/paddlenlp.seq2vec>
+   paddlenlp.taskflow <source/paddlenlp.taskflow>
+   paddlenlp.trainer <source/paddlenlp.trainer>
+   paddlenlp.transformers <source/paddlenlp.transformers>
+   paddlenlp.utils <source/paddlenlp.utils>
+
+Indices and tables
+====================
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/docs/locale/en/LC_MESSAGES/FAQ.po b/docs/locale/en/LC_MESSAGES/FAQ.po
new file mode 100644
index 0000000000000000000000000000000000000000..2439942650b8553133c0aa8bac07f1b0efc79c29
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/FAQ.po
@@ -0,0 +1,686 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../FAQ.md:1
+msgid "PaddleNLP常见问题汇总（持续更新）"
+msgstr ""
+
+#: ../FAQ.md:3
+msgid "【精选】NLP精选5问"
+msgstr ""
+
+#: ../FAQ.md:5 ../FAQ.md:59
+msgid "Q1.1 如何加载自己的本地数据集，以便使用PaddleNLP的功能？"
+msgstr ""
+
+#: ../FAQ.md:6 ../FAQ.md:88
+msgid "Q1.2 PaddleNLP会将内置的数据集、模型下载到默认路径，如何修改路径？"
+msgstr ""
+
+#: ../FAQ.md:7 ../FAQ.md:98
+msgid "Q1.3 PaddleNLP中如何保存、加载训练好的模型？"
+msgstr ""
+
+#: ../FAQ.md:8 ../FAQ.md:134
+msgid "Q1.4 当训练样本较少时，有什么推荐的方法能提升模型效果吗？"
+msgstr ""
+
+#: ../FAQ.md:9 ../FAQ.md:140
+msgid "Q1.5 如何提升模型的性能，提升QPS？"
+msgstr ""
+
+#: ../FAQ.md:11
+msgid "【理论篇】NLP通用问题"
+msgstr ""
+
+#: ../FAQ.md:13 ../FAQ.md:152
+msgid "Q2.1 数据类别分布不均衡， 有哪些应对方法？"
+msgstr ""
+
+#: ../FAQ.md:14 ../FAQ.md:166
+msgid "Q2.2 如果使用预训练模型，一般需要多少条样本？"
+msgstr ""
+
+#: ../FAQ.md:16
+msgid "【实战篇】PaddleNLP实战问题"
+msgstr ""
+
+#: ../FAQ.md:18 ../FAQ.md:177
+msgid "数据集和数据处理"
+msgstr ""
+
+#: ../FAQ.md:20 ../FAQ.md:181
+msgid "Q3.1 使用自己的数据集训练预训练模型时，如何引入额外的词表？"
+msgstr ""
+
+#: ../FAQ.md:22 ../FAQ.md:192
+msgid "模型训练调优"
+msgstr ""
+
+#: ../FAQ.md:24 ../FAQ.md:196
+msgid "Q3.2 如何加载自己的预训练模型，进而使用PaddleNLP的功能？"
+msgstr ""
+
+#: ../FAQ.md:25 ../FAQ.md:230
+msgid "Q3.3 如果训练中断，需要继续热启动训练，如何保证学习率和优化器能从中断地方继续迭代？"
+msgstr ""
+
+#: ../FAQ.md:26 ../FAQ.md:252
+msgid "Q3.4 如何冻结模型梯度？"
+msgstr ""
+
+#: ../FAQ.md:27 ../FAQ.md:313
+msgid "Q3.5 如何在eval阶段打印评价指标，在各epoch保存模型参数？"
+msgstr ""
+
+#: ../FAQ.md:28 ../FAQ.md:331
+msgid "Q3.6 训练过程中，训练程序意外退出或Hang住，应该如何排查？"
+msgstr ""
+
+#: ../FAQ.md:30 ../FAQ.md:339
+msgid "Q3.7 在模型验证和测试过程中，如何保证每一次的结果是相同的？"
+msgstr ""
+
+#: ../FAQ.md:31 ../FAQ.md:351
+msgid "Q3.8 ERNIE模型如何返回中间层的输出？"
+msgstr ""
+
+#: ../FAQ.md:33 ../FAQ.md:358
+msgid "预测部署"
+msgstr ""
+
+#: ../FAQ.md:35 ../FAQ.md:362
+msgid "Q3.9 PaddleNLP训练好的模型如何部署到服务器 ？"
+msgstr ""
+
+#: ../FAQ.md:36 ../FAQ.md:380
+msgid "Q3.10 静态图模型如何转换成动态图模型？"
+msgstr ""
+
+#: ../FAQ.md:38
+msgid "特定模型和应用场景咨询"
+msgstr ""
+
+#: ../FAQ.md:39 ../FAQ.md:390
+msgid "Q4.1 【词法分析】LAC模型，如何自定义标签label，并继续训练？"
+msgstr ""
+
+#: ../FAQ.md:40 ../FAQ.md:398
+msgid "Q4.2 信息抽取任务中，是否推荐使用预训练模型+CRF，怎么实现呢？"
+msgstr ""
+
+#: ../FAQ.md:41
+msgid ""
+"Q4.3 "
+"【阅读理解】MapDatasets的map()方法中对应的batched=True怎么理解，在阅读理解任务中为什么必须把参数batched设置为True？"
+msgstr ""
+
+#: ../FAQ.md:42 ../FAQ.md:410
+msgid "Q4.4 【语义匹配】语义索引和语义匹配有什么区别？"
+msgstr ""
+
+#: ../FAQ.md:43 ../FAQ.md:416
+msgid "Q4.5 【解语】wordtag模型如何自定义添加命名实体及对应词类?"
+msgstr ""
+
+#: ../FAQ.md:45
+msgid "其他使用咨询"
+msgstr ""
+
+#: ../FAQ.md:46 ../FAQ.md:433
+msgid "Q5.1 在CUDA11使用PaddlNLP报错?"
+msgstr ""
+
+#: ../FAQ.md:47 ../FAQ.md:439
+msgid "Q5.2 如何设置parameter？"
+msgstr ""
+
+#: ../FAQ.md:48 ../FAQ.md:473
+msgid "Q5.3 GPU版的Paddle虽然能在CPU上运行，但是必须要有GPU设备吗？"
+msgstr ""
+
+#: ../FAQ.md:49 ../FAQ.md:479
+msgid "Q5.4  如何指定用CPU还是GPU训练模型？"
+msgstr ""
+
+#: ../FAQ.md:50 ../FAQ.md:487
+msgid "Q5.5 动态图模型和静态图模型的预测结果一致吗？"
+msgstr ""
+
+#: ../FAQ.md:51 ../FAQ.md:493
+msgid "Q5.6 如何可视化acc、loss曲线图、模型网络结构图等？"
+msgstr ""
+
+#: ../FAQ.md:53
+msgid "<a name=\"NLP精选\"></a>"
+msgstr ""
+
+#: ../FAQ.md:55
+msgid "⭐️【精选】NLP精选5问"
+msgstr ""
+
+#: ../FAQ.md:57
+msgid "<a name=\"1-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:61
+msgid ""
+"A: 通过使用PaddleNLP提供的 load_dataset，  MapDataset 和 IterDataset "
+"，可以方便的自定义属于自己的数据集哦，也欢迎您贡献数据集到PaddleNLP repo。"
+msgstr ""
+
+#: ../FAQ.md:63
+msgid ""
+"从本地文件创建数据集时，我们 推荐 根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集。 "
+"以waybill_ie快递单信息抽取任务中的数据为例："
+msgstr ""
+
+#: ../FAQ.md:84
+msgid "如果您习惯使用paddle.io.Dataset/IterableDataset来创建数据集也是支持的，您也可以从其他python对象如List对象创建数据集，详细内容可参照官方文档-自定义数据集。"
+msgstr ""
+
+#: ../FAQ.md:86
+msgid "<a name=\"1-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:90
+msgid "A: 内置的数据集、模型默认会下载到$HOME/.paddlenlp/下，通过配置环境变量可下载到指定路径："
+msgstr ""
+
+#: ../FAQ.md:92
+msgid "（1）Linux下，设置 export PPNLP_HOME=\"xxxx\"，注意不要设置带有中文字符的路径。"
+msgstr ""
+
+#: ../FAQ.md:94
+msgid "（2）Windows下，同样配置环境变量 PPNLP_HOME 到其他非中文字符路径，重启即可。"
+msgstr ""
+
+#: ../FAQ.md:96
+msgid "<a name=\"1-3\"></a>"
+msgstr ""
+
+#: ../FAQ.md:100
+msgid "A：（1）PaddleNLP预训练模型"
+msgstr ""
+
+#: ../FAQ.md:102
+msgid "​    保存："
+msgstr ""
+
+#: ../FAQ.md:109 ../FAQ.md:125
+msgid "​    加载："
+msgstr ""
+
+#: ../FAQ.md:116
+msgid "（2）常规模型 保存："
+msgstr ""
+
+#: ../FAQ.md:132
+msgid "<a name=\"1-4\"></a>"
+msgstr ""
+
+#: ../FAQ.md:136
+msgid ""
+"A: 增加训练样本带来的效果是最直接的。此外，可以基于我们开源的预训练模型进行热启，再用少量数据集fine-"
+"tune模型。此外，针对分类、匹配等场景，小样本学习也能够带来不错的效果。"
+msgstr ""
+
+#: ../FAQ.md:138
+msgid "<a name=\"1-5\"></a>"
+msgstr ""
+
+#: ../FAQ.md:142
+msgid ""
+"A: 从工程角度，对于服务器端部署可以使用Paddle "
+"Inference高性能预测引擎进行预测部署。对于Transformer类模型的GPU预测还可以使用PaddleNLP中提供的FasterTransformer功能来进行快速预测，其集成了NV"
+" FasterTransformer并进行了功能增强。"
+msgstr ""
+
+#: ../FAQ.md:144
+msgid ""
+"从模型策略角度，可以使用一些模型小型化技术来进行模型压缩，如模型蒸馏和裁剪，通过小模型来实现加速。PaddleNLP中集成了ERNIE-"
+"Tiny这样一些通用小模型供下游任务微调使用。另外PaddleNLP提供了模型压缩示例，实现了DynaBERT、TinyBERT、MiniLM等方法策略，可以参考对自己的模型进行蒸馏压缩。"
+msgstr ""
+
+#: ../FAQ.md:146
+msgid "<a name=\"NLP通用问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:148
+msgid "⭐️【理论篇】NLP通用问题"
+msgstr ""
+
+#: ../FAQ.md:150
+msgid "<a name=\"2-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:154
+msgid "A: 可以采用以下几种方法优化类别分布不均衡问题："
+msgstr ""
+
+#: ../FAQ.md:156
+msgid "（1）欠采样：对样本量较多的类别进行欠采样，去除一些样本，使得各类别数目接近。"
+msgstr ""
+
+#: ../FAQ.md:158
+msgid "（2）过采样：对样本量较少的类别进行过采样，选择样本进行复制，使得各类别数目接近。"
+msgstr ""
+
+#: ../FAQ.md:160
+msgid "（3）修改分类阈值：直接使用类别分布不均衡的数据训练分类器，会使得模型在预测时更偏向于多数类，所以不再以0.5为分类阈值，而是针对少数类在模型仅有较小把握时就将样本归为少数类。"
+msgstr ""
+
+#: ../FAQ.md:162
+msgid "（4）代价敏感学习：比如LR算法中设置class_weight参数。"
+msgstr ""
+
+#: ../FAQ.md:164
+msgid "<a name=\"2-3\"></a>"
+msgstr ""
+
+#: ../FAQ.md:168
+msgid ""
+"A: "
+"很难定义具体需要多少条样本，取决于具体的任务以及数据的质量。如果数据质量没问题的话，分类、文本匹配任务所需数据量级在百级别，翻译则需要百万级能够训练出一个比较鲁棒的模型。如果样本量较少，可以考虑数据增强，或小样本学习。"
+msgstr ""
+
+#: ../FAQ.md:171
+msgid "<a name=\"PaddleNLP实战问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:173
+msgid "⭐️【实战篇】PaddleNLP实战问题"
+msgstr ""
+
+#: ../FAQ.md:175
+msgid "<a name=\"数据问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:179
+msgid "<a name=\"3-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:183
+msgid ""
+"A: "
+"预训练模型通常会有配套的tokenzier和词典，对于大多数中文预训练模型，如ERNIE-3.0-Medium-zh，使用的都是字粒度的输入，tokenzier会将句子转换为字粒度的形式，模型无法收到词粒度的输入。如果希望引入额外的词典，需要修改预训练模型的tokenizer和词典，可以参考这里blog，另外注意embedding矩阵也要加上这些新增词的embedding表示。"
+msgstr ""
+
+#: ../FAQ.md:185
+msgid ""
+"另外还有一种方式可以使用这些字典信息，可以将数据中在词典信息中的词进行整体mask进行一个mask language "
+"model的二次预训练，这样经过二次训练的模型就包含了对额外字典的表征。可参考 Mask Language Model 数据构建。"
+msgstr ""
+
+#: ../FAQ.md:188
+msgid "此外还有些词粒度及字词混合粒度的预训练模型，在这些词粒度的模型下引入额外的词表也会容易些，我们也将持续丰富PaddleNLP中的预训练模型。"
+msgstr ""
+
+#: ../FAQ.md:190
+msgid "<a name=\"训练调优问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:194
+msgid "<a name=\"4-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:198
+msgid ""
+"A: "
+"以bert为例，如果是使用PaddleNLP训练，通过save_pretrained()接口保存的模型，可通过from_pretrained()来加载："
+msgstr ""
+
+#: ../FAQ.md:205
+msgid "如果不是上述情况，可以使用如下方式加载模型，也欢迎您贡献模型到PaddleNLP repo中。"
+msgstr ""
+
+#: ../FAQ.md:207
+msgid "（1）加载BertTokenizer和BertModel"
+msgstr ""
+
+#: ../FAQ.md:214
+msgid ""
+"（2）调用save_pretrained()生成 model_config.json、 "
+"tokenizer_config.json、model_state.pdparams、  vocab.txt "
+"文件，保存到./checkpoint："
+msgstr ""
+
+#: ../FAQ.md:221
+msgid ""
+"（3）修改model_config.json、 "
+"tokenizer_config.json这两个配置文件，指定为自己的模型，之后通过from_pretrained()加载模型。"
+msgstr ""
+
+#: ../FAQ.md:228
+msgid "<a name=\"4-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:232
+msgid "A:"
+msgstr ""
+
+#: ../FAQ.md:234
+msgid "（1）完全恢复训练状态，可以先将lr、 optimizer、model的参数保存下来："
+msgstr ""
+
+#: ../FAQ.md:242
+msgid "（2）加载lr、 optimizer、model参数即可恢复训练："
+msgstr ""
+
+#: ../FAQ.md:250
+msgid "<a name=\"4-3\"></a>"
+msgstr ""
+
+#: ../FAQ.md:254
+msgid "A: 有多种方法可以尝试："
+msgstr ""
+
+#: ../FAQ.md:257
+msgid "（1）可以直接修改 PaddleNLP 内部代码实现，在需要冻结梯度的地方用 paddle.no_grad() 包裹一下"
+msgstr ""
+
+#: ../FAQ.md:259
+msgid "paddle.no_grad() 的使用方式，以对 forward() 进行冻结为例："
+msgstr ""
+
+#: ../FAQ.md:282
+msgid "paddle.no_grad() 的使用也不局限于模型内部实现里面，也可以包裹外部的方法，比如："
+msgstr ""
+
+#: ../FAQ.md:296
+msgid ""
+"（2）第二种方法：以ERNIE为例，将模型输出的 tensor 设置 stop_gradient 为 True。可以使用 "
+"register_forward_post_hook 按照如下的方式尝试："
+msgstr ""
+
+#: ../FAQ.md:305
+msgid ""
+"（3）第三种方法：在 optimizer 上进行处理，model.parameters 是一个 List，可以通过 name "
+"进行相应的过滤，更新/不更新某些参数，这种方法需要对网络结构的名字有整体了解，因为网络结构的实体名字决定了参数的名字，这个使用方法有一定的门槛："
+msgstr ""
+
+#: ../FAQ.md:311
+msgid "<a name=\"4-4\"></a>"
+msgstr ""
+
+#: ../FAQ.md:315
+msgid ""
+"A: 飞桨主框架提供了两种训练与预测的方法，一种是用 "
+"paddle.Model()对模型进行封装，通过高层API如Model.fit()、Model.evaluate()、Model.predict()等完成模型的训练与预测；另一种就是基于基础API常规的训练方式。"
+msgstr ""
+
+#: ../FAQ.md:317
+msgid "（1）对于第一种方法："
+msgstr ""
+
+#: ../FAQ.md:319
+msgid ""
+"我们可以设置 paddle.Model.fit()  API中的 eval_data 和 eval_freq "
+"参数在训练过程中打印模型评价指标：eval_data 参数是一个可迭代的验证集数据源，eval_freq 参数是评估的频率；当eval_data "
+"给定后，eval_freq 的默认值为1，即每一个epoch进行一次评估。注意：在训练前，我们需要在 Model.prepare() "
+"接口传入metrics参数才能在eval时打印模型评价指标。"
+msgstr ""
+
+#: ../FAQ.md:321
+msgid ""
+"关于模型保存，我们可以设置 paddle.Model.fit() 中的 save_freq 参数控制模型保存的频率：save_freq "
+"的默认值为1，即每一个epoch保存一次模型。"
+msgstr ""
+
+#: ../FAQ.md:323
+msgid "（2）对于第二种方法："
+msgstr ""
+
+#: ../FAQ.md:325
+msgid "我们在PaddleNLP的examples目录下提供了常见任务的训练与预测脚本：如GLUE 和 SQuAD等"
+msgstr ""
+
+#: ../FAQ.md:327
+msgid "开发者可以参考上述脚本进行自定义训练与预测脚本的开发。"
+msgstr ""
+
+#: ../FAQ.md:329
+msgid "<a name=\"4-5\"></a>"
+msgstr ""
+
+#: ../FAQ.md:333
+msgid "A:  一般先考虑内存、显存（使用GPU训练的话）是否不足，可将训练和评估的batch size调小一些。"
+msgstr ""
+
+#: ../FAQ.md:335
+msgid "需要注意，batch size调小时，学习率learning rate也要调小，一般可按等比例调整。"
+msgstr ""
+
+#: ../FAQ.md:337
+msgid "<a name=\"4-6\"></a>"
+msgstr ""
+
+#: ../FAQ.md:341
+msgid "A: 在验证和测试过程中常常出现的结果不一致情况一般有以下几种解决方法："
+msgstr ""
+
+#: ../FAQ.md:343
+msgid "（1）确保设置了eval模式，并保证数据相关的seed设置保证数据一致性。"
+msgstr ""
+
+#: ../FAQ.md:345
+msgid ""
+"（2）如果是下游任务模型，查看是否所有模型参数都被导入了，直接使用bert-"
+"base这种预训练模型是不包含任务相关参数的，要确认导入的是微调后的模型，否则任务相关参数会随机初始化导致出现随机性。"
+msgstr ""
+
+#: ../FAQ.md:347
+msgid ""
+"（3）部分算子使用CUDNN后端产生的不一致性可以通过环境变量的设置来避免。如果模型中使用了CNN相关算子，可以设置FLAGS_cudnn_deterministic=True。如果模型中使用了RNN相关算子，可以设置CUBLAS_WORKSPACE_CONFIG=:16:8或CUBLAS_WORKSPACE_CONFIG=:4096:2（CUDNN"
+" 10.2以上版本可用，参考CUDNN 8 release note）。"
+msgstr ""
+
+#: ../FAQ.md:349
+msgid "<a name=\"4-7\"></a>"
+msgstr ""
+
+#: ../FAQ.md:353
+msgid ""
+"A: 目前的API设计不保留中间层输出，当然在PaddleNLP里可以很方便地修改源码。 "
+"此外，还可以在ErnieModel的__init__函数中通过register_forward_post_hook()为想要保留输出的Layer注册一个forward_post_hook函数，在forward_post_hook函数中把Layer的输出保存到一个全局的List里面。forward_post_hook函数将会在forward函数调用之后被调用，并保存Layer输出到全局的List。详情参考register_forward_post_hook()。"
+msgstr ""
+
+#: ../FAQ.md:356
+msgid "<a name=\"部署问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:360
+msgid "<a name=\"5-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:364
+msgid "A: 我们推荐在动态图模式下开发，静态图模式部署。"
+msgstr ""
+
+#: ../FAQ.md:366
+msgid "（1）动转静"
+msgstr ""
+
+#: ../FAQ.md:368
+msgid ""
+"动转静，即将动态图的模型转为可用于部署的静态图模型。 动态图接口更加易用，python "
+"风格的交互式编程体验，对于模型开发更为友好，而静态图相比于动态图在性能方面有更绝对的优势。因此动转静提供了这样的桥梁，同时兼顾开发成本和性能。 "
+"可以参考官方文档 动态图转静态图文档，使用 paddle.jit.to_static 完成动转静。 另外，在 PaddleNLP "
+"我们也提供了导出静态图模型的例子，可以参考 waybill_ie 模型导出。"
+msgstr ""
+
+#: ../FAQ.md:373
+msgid "（2）借助Paddle Inference部署"
+msgstr ""
+
+#: ../FAQ.md:375
+msgid ""
+"动转静之后保存下来的模型可以借助Paddle Inference完成高性能推理部署。Paddle Inference内置高性能的CPU/GPU "
+"Kernel，结合细粒度OP横向纵向融合等策略，并集成 TensorRT 实现模型推理的性能提升。具体可以参考文档 Paddle "
+"Inference 简介。 为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference，PaddleNLP "
+"也提供了对应的例子以供参考，可以参考 /PaddleNLP/examples 下的deploy目录，如基于ERNIE的命名实体识别模型部署。"
+msgstr ""
+
+#: ../FAQ.md:378
+msgid "<a name=\"5-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:382
+msgid "A: 首先，需要将静态图参数保存成ndarray数据，然后将静态图参数名和对应动态图参数名对应，最后保存成动态图参数即可。详情可参考参数转换脚本。"
+msgstr ""
+
+#: ../FAQ.md:384
+msgid "<a name=\"NLP应用场景\"></a>"
+msgstr ""
+
+#: ../FAQ.md:386
+msgid "⭐️特定模型和应用场景咨询"
+msgstr ""
+
+#: ../FAQ.md:388
+msgid "<a name=\"6-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:392
+msgid "A: 更新label文件tag.dict，添加 修改下CRF的标签数即可。"
+msgstr ""
+
+#: ../FAQ.md:394
+msgid "可参考自定义标签示例，增量训练自定义LABLE示例。"
+msgstr ""
+
+#: ../FAQ.md:396
+msgid "<a name=\"6-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:400
+msgid "A: 预训练模型+CRF是一个通用的序列标注的方法，目前预训练模型对序列信息的表达也是非常强的，也可以尝试直接使用预训练模型对序列标注任务建模。"
+msgstr ""
+
+#: ../FAQ.md:402
+msgid "<a name=\"6-3\"></a>"
+msgstr ""
+
+#: ../FAQ.md:404
+msgid "Q4.3.【阅读理解】MapDatasets的map()方法中对应的batched=True怎么理解，在阅读理解任务中为什么必须把参数batched设置为True？"
+msgstr ""
+
+#: ../FAQ.md:406
+msgid ""
+"A: "
+"batched=True就是对整个batch（这里不一定是训练中的batch，理解为一组数据就可以）的数据进行map，即map中的trans_func接受一组数据为输入，而非逐条进行map。在阅读理解任务中，根据使用的doc_stride不同，一条样本可能被转换成多条feature，对数据逐条map是行不通的，所以需要设置batched=True。"
+msgstr ""
+
+#: ../FAQ.md:408
+msgid "<a name=\"6-4\"></a>"
+msgstr ""
+
+#: ../FAQ.md:412
+msgid ""
+"A: 语义索引要解决的核心问题是如何从海量 Doc 中通过 ANN 索引的方式快速、准确地找出与 query "
+"相关的文档，语义匹配要解决的核心问题是对 query和文档更精细的语义匹配信息建模。换个角度理解， "
+"语义索引是要解决搜索、推荐场景下的召回问题，而语义匹配是要解决排序问题，两者要解决的问题不同，所采用的方案也会有很大不同，但两者间存在一些共通的技术点，可以互相借鉴。"
+msgstr ""
+
+#: ../FAQ.md:414
+msgid "<a name=\"6-5\"></a>"
+msgstr ""
+
+#: ../FAQ.md:418
+msgid ""
+"A: 其主要依赖于二次构造数据来进行finetune，同时要更新termtree信息。wordtag分为两个步骤： "
+"（1）通过BIOES体系进行分词； （2）将分词后的信息和TermTree进行匹配。 因此我们需要： "
+"（1）分词正确，这里可能依赖于wordtag的finetune数据，来让分词正确； "
+"（2）wordtag里面也需要把分词正确后term打上相应的知识信息。wordtag自定义TermTree的方式将在后续版本提供出来。"
+msgstr ""
+
+#: ../FAQ.md:425
+msgid "可参考issue。"
+msgstr ""
+
+#: ../FAQ.md:427
+msgid "<a name=\"使用咨询问题\"></a>"
+msgstr ""
+
+#: ../FAQ.md:429
+msgid "⭐️其他使用咨询"
+msgstr ""
+
+#: ../FAQ.md:431
+msgid "<a name=\"7-1\"></a>"
+msgstr ""
+
+#: ../FAQ.md:435
+msgid "A: 在CUDA11安装，可参考issue，其他CUDA版本安装可参考 官方文档"
+msgstr ""
+
+#: ../FAQ.md:437
+msgid "<a name=\"7-2\"></a>"
+msgstr ""
+
+#: ../FAQ.md:441
+msgid "A: 有多种方法： （1）可以通过set_value()来设置parameter，set_value()的参数可以是numpy或者tensor。"
+msgstr ""
+
+#: ../FAQ.md:453
+msgid "（2）通过create_parameter()设置参数。"
+msgstr ""
+
+#: ../FAQ.md:471
+msgid "<a name=\"7-3\"></a>"
+msgstr ""
+
+#: ../FAQ.md:475
+msgid ""
+"A: 不支持 GPU 的设备只能安装 CPU 版本的 PaddlePaddle。 GPU 版本的 PaddlePaddle 如果想只在 CPU "
+"上运行，可以通过 export CUDA_VISIBLE_DEVICES=-1 来设置。"
+msgstr ""
+
+#: ../FAQ.md:477
+msgid "<a name=\"7-4\"></a>"
+msgstr ""
+
+#: ../FAQ.md:481
+msgid "A: 一般我们的训练脚本提供了 --device 选项，用户可以通过 --device 选择需要使用的设备。"
+msgstr ""
+
+#: ../FAQ.md:483
+msgid ""
+"具体而言，在Python文件中，我们可以通过·paddle.device.set_device()·，设置为gpu或者cpu，可参考 "
+"set_device文档。"
+msgstr ""
+
+#: ../FAQ.md:485
+msgid "<a name=\"7-5\"></a>"
+msgstr ""
+
+#: ../FAQ.md:489
+msgid "A: 正常情况下，预测结果应当是一致的。如果遇到不一致的情况，可以及时反馈给 PaddleNLP 的开发人员，我们进行处理。"
+msgstr ""
+
+#: ../FAQ.md:491
+msgid "<a name=\"7-6\"></a>"
+msgstr ""
+
+#: ../FAQ.md:495
+msgid ""
+"A: "
+"可使用VisualDL进行可视化。其中acc、loss曲线图的可视化可参考Scalar——折线图组件使用指南，模型网络结构的可视化可参考Graph——网络结构组件使用指南。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po b/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po
new file mode 100644
index 0000000000000000000000000000000000000000..c2612396ac1f085ca14818e90ead14758b6467b9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/distributed_training.po
@@ -0,0 +1,27 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/distributed_training.rst:3
+msgid "大规模分布式训练"
+msgstr ""
+
+#: ../advanced_guide/distributed_training.rst:5
+msgid "大规模分布式训练："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po
new file mode 100644
index 0000000000000000000000000000000000000000..a927bd73636b72016446d75882fb7a505655f24c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fastgeneration.po
@@ -0,0 +1,211 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:3
+msgid "FastGeneration加速生成API"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:5
+msgid ""
+"FastGeneration是PaddleNLP "
+"v2.2版本加入的一个高性能推理功能，可实现基于CUDA的序列解码。该功能可以用于多种生成类的预训练NLP模型，例如GPT、BART、UnifiedTransformer等，并且支持多种解码策略。因此该功能主要适用于机器翻译，文本续写，文本摘要，对话生成等任务。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:7
+msgid ""
+"功能底层依托于 `FasterTransformer "
+"<https://github.com/NVIDIA/FasterTransformer>`_ "
+"，该库专门针对Transformer系列模型及各种解码策略进行了优化。功能顶层封装于 `model.generate` "
+"函数。功能的开启和关闭通过传入 `use_fast` 参数进行控制（默认为开启状态）。该功能具有如下特性："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:9
+msgid "全面支持生成式预训练模型。包括GPT、BART、mBART、UnifiedTransformer和UNIMO-text。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:10
+msgid ""
+"支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling "
+"Search、Length Penalty等子策略。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:11
+msgid ""
+"解码速度快。最高可达非加速版generate函数的 **17倍**。HuggingFace generate函数的 "
+"**8倍**。**并支持FP16混合精度计算**。 详细性能试验数据请参见 `FastGeneration Performence "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation/perf>`_"
+" 。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:12
+msgid ""
+"易用性强。功能的入口为 `model.generate` "
+"，与非加速版生成api的使用方法相同，当满足加速条件时使用jit即时编译高性能算子并用于生成，不满足则自动切换回非加速版生成api。下图展示了FastGeneration的启动流程："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:17
+msgid "快速开始"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:19
+msgid ""
+"为体现FastGeneration的易用性，我们在 `samples "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/experimental/fast_generation/samples>`_"
+" 文件夹中内置了几个典型任务示例，下面以基于GPT模型的中文文本续写任务为例："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:26
+msgid ""
+"如果是第一次执行，PaddleNLP会启动即时编译（ `JIT Compile "
+"<https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/new_op/new_custom_op_cn.html"
+"#jit-compile>`_ ）自动编译高性能解码算子。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:49
+msgid "编译过程通常会花费几分钟的时间但是只会进行一次，之后再次使用高性能解码不需要重新编译了。编译完成后会继续运行，可以看到生成的结果如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:56
+msgid "打开示例代码 `samples/gpt_sample.py` ，我们可以看到如下代码："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:67
+msgid ""
+"可以看到，FastGeneration的使用方法与 `model.generate()` "
+"相同，只需传入输入tensor和解码相关参数即可，使用非常简便。如果要使用非加速版的 `model.generate()` 方法，只需传入 "
+"`use_fast=False` 即可，示例如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:78
+msgid ""
+"需要注意的是，如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本，例如我们传入 "
+"`min_length=1` ，会得到如下提示："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:86
+msgid ""
+"关于该方法的更多参数可以参考API文档 `generate "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.generation_utils.html>`_"
+" 。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:88
+msgid ""
+"`samples "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/experimental/fast_generation/samples>`_"
+" 文件夹中的其他示例的使用方法相同。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:91
+msgid "其他示例"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:93
+msgid ""
+"除了以上简单示例之外，PaddleNLP的examples中所有使用了 `model.generate()` "
+"的示例都可以通过调整到合适的参数使用高性能推理。具体如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:95
+msgid ""
+"`examples/dialogue/unified_transformer "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dialogue/unified_transformer>`_"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:96
+msgid ""
+"`model_zoo/gpt/fast_gpt "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt/fast_gpt>`_"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:97
+msgid ""
+"`examples/text_generation/unimo-text "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_generation"
+"/unimo-text>`_"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:98
+msgid ""
+"`examples/text_summarization/bart "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_summarization/bart>`_"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:100
+msgid ""
+"根据提示修改对应参数即可使用FastGeneration加速生成。下面我们以基于 `Unified Transformer` "
+"的任务型对话为例展示一下FastGeneration的加速效果："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:102
+msgid "打开以上链接中Unified Transformer对应的example，找到README中对应预测的脚本。稍作修改如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:122
+msgid ""
+"由于这里只是展示性能，我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 "
+"`unified_transformer-12L-cn-luge` 。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:124
+msgid ""
+"可以看到，由于该任务为对话任务，我们为了防止模型生成过多安全回复（如：哈哈哈、不错等），保证生成结果具有更多的随机性，我们选择TopK-"
+"sampling作为解码策略，并让k=5。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:126
+msgid "打开 `infer.py` ，可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:149
+msgid "运行脚本，输出结果如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:157
+msgid "可以看到，非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:159
+msgid ""
+"下面我们在启动脚本中传入 `--faster` 参数，这会让 `generate()` 方法传入 `use_fast=True` "
+"，启动加速模式。同时我们需要设置 `--min_dec_len=0` "
+"，因为FastGeneration当前还不支持该参数。新的脚本启动参数如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:180
+msgid "再次运行脚本，输出结果如下（由于我们已经编译过高性能算子，所以这里不会重新编译）："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fastgeneration.rst:189
+msgid ""
+"可以看到，FastGeneration的预测速度为每个step耗时0.4秒左右，提速超过三倍。如果减少 "
+"`num_return_sequences` ，可以得到更高的加速比。"
+msgstr ""
+
+#~ msgid ""
+#~ "如果是第一次执行，PaddleNLP会启动即时编译（ `JIT Compile "
+#~ "<https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/new_op/new_custom_op_cn.html"
+#~ "#jit-compile#jit-compile>`_ ）自动编译高性能解码算子。"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "`examples/language_model/gpt/fast_gpt "
+#~ "<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/gpt/fast_gpt>`_"
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..4a7f27a8a45a692613e48e382172344191d04a7f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/fasttransformer.po
@@ -0,0 +1,331 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:3
+msgid "Transformer高性能加速"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:7
+msgid "使用环境说明"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:9
+msgid "本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:10
+msgid "CMake >= 3.10"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:11
+msgid "CUDA 10.1 或 10.2（需要 PaddlePaddle 框架一致）"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:12
+msgid "gcc 版本需要与编译 PaddlePaddle 版本一致，比如使用 gcc8.2"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:13
+msgid "推荐使用 Python3"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:14
+msgid ""
+"`FasterTransformer "
+"<https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup>`_ 使用必要的环境"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:15
+msgid "环境依赖"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:17
+msgid "attrdict"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:18
+msgid "pyyaml"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:26
+msgid "快速开始"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:28
+msgid ""
+"我们实现了基于 FasterTransformer 的自定义 op 的接入，打造了 FastGeneration 的能力，用于加速文本生成模型在 GPU "
+"上的预测性能。接下来，我们将分别介绍基于 Python 动态图和预测库使用 FastGeneration 自定义 op 的方式，包括 op "
+"的编译与使用。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:31
+msgid "Python 动态图使用自定义 op"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:34
+msgid "JIT 自动编译"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:36
+msgid ""
+"目前当基于动态图使用 FastGeneration 预测加速自定义 op 时，PaddleNLP 提供了 Just In Time "
+"的自动编译，在一些 API 上，用户无需关注编译流程，可以直接执行对应的 API，程序会自动编译需要的第三方库。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:38
+msgid ""
+"以 Transformer 为例，可以直接调用 `TransformerGenerator()` 这个 API，程序会自动编译。使用示例可以参考 "
+"`Transformer 预测加速使用示例-sample "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/fast_transformer/sample/decoding_sample.py>`_，`Transformer"
+" 预测加速使用示例-机器翻译 "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer/fast_transformer>`_。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:41
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:154
+msgid "编译自定义OP"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:43
+msgid "除了自动编译外，如果需要自行编译，我们已经提供对应的 CMakeLists.txt，可以参考使用如下的方式完成编译。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:46
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:159
+msgid "PaddleNLP 准备"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:48
+msgid ""
+"首先，如果需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 "
+"PaddleNLP，并重新编译:"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:50
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:163
+msgid "以下以从 github 上 clone 一个新版 PaddleNLP 为例:"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:56
+msgid "其次，配置环境变量，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:64
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:176
+msgid "编译"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:66
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:178
+msgid "编译之前，请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译，并且正常可用。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:68
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:180
+msgid "编译自定义 OP 可以参照一下步骤："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:78
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:190
+msgid "可以使用的编译选项包括："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:80
+msgid ""
+"`-DPY_CMD`: 指定当前装有 PaddlePaddle 版本的 python 环境，比如 "
+"`-DPY_CMD=python3.7`。若未指定 `-DPY_CMD` 将会默认使用系统命令 `python` 对应的 Python。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:81
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:207
+msgid ""
+"`-DSM`: 是指的所用 GPU 的 compute capability，建议不使用该选项设置，未设置时将自动检测。如要设置，需根据 "
+"[compute capability](https://developer.nvidia.com/zh-cn/cuda-"
+"gpus#compute) 进行设置，如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:82
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:208
+msgid ""
+"`-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理，需要加上 `-DWITH_GPT=ON`。默认为"
+" OFF。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:83
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:209
+msgid ""
+"`-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 "
+"lib。若使用，需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:84
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:210
+msgid "`-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用，需要加上 `-DWITH_BART=ON`。默认为 ON。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:85
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:211
+msgid "`-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:87
+msgid ""
+"最终，编译会在 `./build/lib/` 路径下，产出 `libdecoding_op.so`，即需要的 FastGeneration "
+"decoding 执行的库。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:90
+msgid "使用 Transformer decoding 高性能推理"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:92
+msgid ""
+"编写 python 脚本的时候，调用 `FasterTransformer API "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.html#paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer>`_"
+" 即可实现 Transformer 模型的高性能预测。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:94
+msgid "举例如下："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:120
+msgid ""
+"若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op "
+"所需的动态库，可以如前文所述进行编译。编译好后，使用 "
+"`FasterTransformer(decoding_lib=\"/path/to/lib\", ...)` 可以完成导入。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:122
+msgid ""
+"更详细的例子可以参考 `Transformer 预测加速使用示例-sample "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/fast_transformer/sample/decoding_sample.py>`_，`Transformer"
+" 预测加速使用示例-机器翻译 "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer/fast_transformer>`_，我们提供了更详细用例。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:125
+msgid "Transformer decoding 示例代码"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:127
+msgid "使用 PaddlePaddle 仅执行 decoding 测试（float32）："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:137
+msgid ""
+"使用 PaddlePaddle 仅执行 decoding 测试（float16）： 执行 float16 的 "
+"decoding，需要在执行的时候，加上 `--use_fp16_decoding` 选项。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:148
+msgid ""
+"其中，`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 "
+"<https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-"
+"decoderdecoding-demos>`_。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config "
+"文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:151
+msgid "C++ 预测库使用自定义 op"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:156
+msgid ""
+"在 C++ 预测库使用自定义 OP 需要将实现的 C++、CUDA 代码**以及 C++ 预测的 "
+"demo**编译成一个可执行文件。因预测库支持方式与 Python 不同，这个过程将不会产生自定义 op "
+"的动态库，将直接得到可执行文件。我们已经提供对应的 CMakeLists.txt ，可以参考使用如下的方式完成编译。并获取执行 demo。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:161
+msgid ""
+"首先，因为需要基于当前环境重新编译，当前的 paddlenlp 的 python 包里面并不包含 FastGeneration 相关 "
+"lib，需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 "
+"PaddleNLP，并重新编译:"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:169
+msgid "其次，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:192
+msgid ""
+"`-DPADDLE_LIB`: 需要指明使用的 PaddlePaddle 预测库的路径 "
+"`/path/to/paddle_inference_install_dir/`，需要使用的 PaddlePaddle 的 lib "
+"可以选择自行编译或者直接从官网下载 `paddle_inference_linux_lib "
+"<https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#linux>`_。需要注意的是，在该路径下，预测库的组织结构满足："
+" .. code-block::"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:206
+msgid ""
+"`-DDEMO`: 说明预测库使用 demo 的位置。比如指定 -DDEMO=./demo/transformer_e2e.cc 或是 "
+"-DDEMO=./demo/gpt.cc。最好使用绝对路径，若使用相对路径，需要是相对于 "
+"`PaddleNLP/paddlenlp/ops/fast_transformer/src/` 的相对路径。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:212
+msgid "`-DWITH_MKL`: 若当前是使用的 mkl 的 Paddle lib，那么需要打开 MKL 以引入 MKL 相关的依赖。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:213
+msgid "`-DON_INFER`: 是否编译 paddle inference 预测库。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:214
+msgid "**当使用预测库的自定义 op 的时候，请务必开启 `-DON_INFER=ON` 选项，否则，不会得到预测库的可执行文件。**"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:217
+msgid "执行 Transformer decoding on PaddlePaddle"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:219
+msgid ""
+"编译完成后，在 `build/bin/` 路径下将会看到 `transformer_e2e` "
+"的一个可执行文件。通过设置对应的设置参数完成执行的过程。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:226
+msgid "举例说明："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:235
+msgid "其中："
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:237
+msgid ""
+"`decoding_gemm` 不同参数的意义可以参考 `FasterTransformer 文档 "
+"<https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-"
+"decoderdecoding-demos>`_。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config "
+"文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:238
+msgid "`DATA_HOME` 则是 `paddlenlp.utils.env.DATA_HOME` 返回的路径。"
+msgstr ""
+
+#: ../advanced_guide/fastgeneration/fasttransformer.rst:240
+msgid ""
+"预测所需要的模型文件，可以通过 `fast_transformer/README.md "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/fast_transformer/README.md>`_"
+" 文档中所记述的方式导出。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..d0c0a3168a8c7b6341a3c1e736ae91a23c564bef
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/fastgeneration/index.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/fastgeneration/index.rst:3
+msgid "文本生成高性能加速"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po
new file mode 100644
index 0000000000000000000000000000000000000000..a04e1157a185756ee74fa67b11aaeb7663505775
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/distill_lstm.po
@@ -0,0 +1,128 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:2
+msgid "由BERT到Bi-LSTM的知识蒸馏"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:6
+msgid "整体原理介绍"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:8
+msgid ""
+"本例是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中，主要参考论文 `Distilling Task-Specific "
+"Knowledge from BERT into Simple Neural Networks "
+"<https://arxiv.org/abs/1903.12136>`_ \\ 实现。整体原理如下："
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:11
+msgid "在本例中，较大的模型是BERT被称为教师模型，Bi-LSTM被称为学生模型。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:13
+msgid "小模型学习大模型的知识，需要小模型学习蒸馏相关的损失函数。在本实验中，损失函数是均方误差损失函数，传入函数的两个参数分别是学生模型的输出和教师模型的输出。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:15
+msgid ""
+"在论文的模型蒸馏阶段，作者为了能让教师模型表达出更多的“暗知识”(dark "
+"knowledge，通常指分类任务中低概率类别与高概率类别的关系)供学生模型学习，对训练数据进行了数据增强。通过数据增强，可以产生更多无标签的训练数据，在训练过程中，学生模型可借助教师模型的“暗知识”，在更大的数据集上进行训练，产生更好的蒸馏效果。本文的作者使用了三种数据增强方式，分别是："
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:17
+msgid "Masking，即以一定的概率将原数据中的word token替换成 ``[MASK]`` ；"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:19
+msgid "POS—guided word replacement，即以一定的概率将原数据中的词用与其有相同POS tag的词替换；"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:21
+msgid "n-gram sampling，即以一定的概率，从每条数据中采样n-gram，其中n的范围可通过人工设置。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:26
+msgid "模型蒸馏步骤介绍"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:28
+msgid ""
+"本实验分为三个训练过程：在特定任务上对BERT进行微调、在特定任务上对基于Bi-LSTM的小模型进行训练（用于评价蒸馏效果"
+"）、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:31
+msgid "1. 基于bert-base-uncased预训练模型在特定任务上进行微调"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:33
+msgid ""
+"训练BERT的fine-tuning模型，可以去 `PaddleNLP "
+"<https:github.com/PaddlePaddle/PaddleNLP>`_ 中\\ 的 `glue "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue>`_"
+" 目录下对bert-base-uncased做微调。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:36
+msgid ""
+"以GLUE的SST-2任务为例，用bert-base-"
+"uncased做微调之后，可以得到一个在SST-2任务上的教师模型，可以把在dev上取得最好Accuracy的模型保存下来，用于第三步的蒸馏。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:40
+msgid "2. 训练基于Bi-LSTM的小模型"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:42
+msgid ""
+"在本示例中，小模型采取的是基于双向LSTM的分类模型，网络层分别是 ``Embedding`` 、``LSTM`` 、 带有 ``tanh`` "
+"激活函数的 ``Linear`` 层，最后经过\\ 一个全连接的输出层得到logits。``LSTM`` 网络层定义如下："
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:50
+msgid "基于Bi-LSTM的小模型的 ``forward`` 函数定义如下："
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:66
+msgid "3.数据增强介绍"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:68
+msgid ""
+"接下来的蒸馏过程，蒸馏时使用的训练数据集并不只包含数据集中原有的数据，而是按照上文原理介绍中的A、C两种方法进行数据增强后的总数据。 "
+"在多数情况下，``alpha`` 会被设置为0，表示无视硬标签，学生模型只利用数据增强后的无标签数据进行训练。根据教师模型提供的软标签 "
+"``teacher_logits`` \\ ，对比学生模型的 ``logits`` "
+"，计算均方误差损失。由于数据增强过程产生了更多的数据，学生模型可以从教师模型中学到更多的暗知识。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:72
+msgid "数据增强的核心代码如下："
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:107
+msgid "4.蒸馏模型"
+msgstr ""
+
+#: ../advanced_guide/model_compression/distill_lstm.rst:109
+msgid ""
+"这一步是将教师模型BERT的知识蒸馏到基于Bi-LSTM的学生模型中，在本例中，主要是让学生模型（Bi-"
+"LSTM）去学习教师模型的输出logits。\\ 蒸馏时使用的训练数据集是由上一步数据增强后的数据，核心代码如下："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..beff3fe46a772e5b129741b0eb162080c9f168bb
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/index.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/model_compression/index.rst:3
+msgid "模型压缩"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po
new file mode 100644
index 0000000000000000000000000000000000000000..0f2a0fb1c6053fce23acfe7854c5403a6dc6d1a0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/introduction.po
@@ -0,0 +1,79 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/model_compression/introduction.rst:3
+#: ../advanced_guide/model_compression/introduction.rst:11
+msgid "模型压缩简介"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:6
+msgid ""
+"近些年，基于Transformer的语言模型在机器翻译、阅读理解、文本匹配、自然语言推理等自然语言处理任务上取得了实质\\ "
+"进展。然而，海量的参数和计算资源的大量耗费，使BERT及其变体在部署中困难重重。模型压缩的发展，使得这些问题得到\\ 了缓解。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:13
+msgid ""
+"模型压缩在保证一定精度的情况下，能够降低模型的存储，加速模型的推理时间。常见的模型压缩方法主要包括模型裁剪、量化和蒸馏。\\ "
+"下面分别对这几种方法进行简要的介绍。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:17
+msgid "模型裁剪"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:18
+msgid "模型裁剪是通过对已经训练好的模型中不重要的网络连接进行裁剪，减少模型的冗余和计算量，从而减少网络存储、大幅度进行加速的模型压缩方法。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:21
+msgid "量化"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:22
+msgid ""
+"一般而言，神经网络模型的参数都是用的32bit长度的浮点型数表示。实际上，有时不需要保留那么高的精度，可以通过量化的方法减少\\ "
+"模型的存储空间，通常用INT8代替Float32存储。比如，SGD（Stochastic Gradient "
+"Descent）所需要的精度仅为6~8bit，\\ "
+"因此合理的量化网络也可保证精度的情况下减小模型的存储体积，并且能够大幅度加速，使得神经网络在CPU上的运行成为可能。\\ "
+"通常，量化包含多种方法，例如：二值神经网络、三元权重网络以及XNOR网络。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:29
+msgid "蒸馏"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:30
+msgid ""
+"蒸馏本质是student模型（参数量较少的模型）对teacher模型（参数量较多）的拟合，student模型从teacher中学到知识，比自己单独学习效果更好，。比较常见的方法通常是由Bert"
+" base蒸馏到\\ Bi-LSTM或者是Transformer层数更少的BERT小模型。例如DistilBERT，它保留了BERT-base "
+"97%的精度，\\ 减少了40%的参数，推理速度快了60%。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:36
+msgid "模型压缩示例"
+msgstr ""
+
+#: ../advanced_guide/model_compression/introduction.rst:38
+msgid ""
+"下面将会对基于飞桨实现的常见的模型压缩示例进行介绍，其中《由BERT到Bi-LSTM的知识蒸馏》可以作为蒸馏实验的\"Hello "
+"World\"示例。\\ "
+"而《使用DynaBERT中的策略对BERT进行压缩》中使用的DynaBERT则是同时对不同尺寸的子网络进行训练，通过该方法训练后可以在推理阶段直接对模型裁剪。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po
new file mode 100644
index 0000000000000000000000000000000000000000..3ee3881eaa9144bf8dab1ada4fb3820683f2249d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/advanced_guide/model_compression/ofa_bert.po
@@ -0,0 +1,167 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:2
+msgid "使用DynaBERT中的策略对BERT进行压缩"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:4
+msgid ""
+"本教程使用的是 `DynaBERT-Dynamic BERT with Adaptive Width and Depth "
+"<https://arxiv.org/abs/2004.04037>`_ 中的训练策略。\\ "
+"把原始模型作为超网络中最大的子模型，这里超网络指的是包含所有搜索空间在内的一个网络。\\ 原始模型包括多个相同大小的Transformer "
+"Block。在每次训练前会选择当前轮次要训练的子模型，\\ 每个子模型包含多个相同大小的Sub Transformer Block，每个Sub "
+"Transformer Block是选择不同宽度的Transformer Block得到的，\\ 一个Transformer "
+"Block包含一个Multi-Head Attention和一个Feed-Forward Network，Sub Transformer "
+"Block获得方式为："
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:10
+msgid ""
+"1. 一个 ``Multi-Head Attention`` "
+"层中有多个Head，每次选择不同宽度的子模型时，会同时对Head数量进行等比例减少，\\ "
+"例如：如果原始模型中有12个Head，本次训练选择的模型是宽度为原始宽度75%的子模型，则本次训练中所有Transformer "
+"Block的Head数量为9。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:13
+msgid ""
+"2. ``Feed-Forward Network`` 层中 ``Linear`` 的参数大小进行等比例减少，例如：如果原始模型中 ``FFN``"
+" 层的特征维度为3072，\\ 本次训练选择的模型是宽度为原始宽度75%的子模型，则本次训练中所有Transformer Block中 "
+"``FFN`` 层的特征维度为2304。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:18
+msgid "整体原理介绍"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:20
+msgid ""
+"1. "
+"首先对预训练模型的参数和head根据其重要性进行重排序，把重要的参数和head排在参数的前侧，保证训练过程中的参数裁剪不会裁剪掉这些重要的参数。\\"
+" "
+"参数的重要性计算是先使用dev数据计算一遍每个参数的梯度，然后根据梯度和参数的整体大小来计算当前参数的重要性，head的重要性计算是通过传入一个\\"
+" 全1的对head的mask，并计算这个mask的梯度，根据mask的梯度来判断每个 ``Multi-Head Attention`` "
+"层中每个Head的重要性。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:24
+msgid ""
+"2. "
+"使用原本的预训练模型作为蒸馏过程中的教师网络。同时定义一个超网络，这个超网络中最大的子网络的结构和教师网络的结构相同其他小的子网络是对最大网络\\"
+" 进行不同的宽度选择来得到的，宽度选择具体指对网络中的参数进行裁剪，所有子网络在整个训练过程中都是参数共享的。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:27
+msgid ""
+"使用重排序之后的预训练模型参数初始化超网络，并把这个超网络作为学生网络。分别为 ``Embedding`` 层，为每个transformer "
+"block层和最后的logits添加蒸馏损失。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:29
+msgid "每个batch数据在训练前首先会选择当前要训练的子网络配置（子网络配置目前仅包括对整个模型的宽度的选择），参数更新时仅会更新当前子网络计算中用到的那部分参数。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:31
+msgid "通过以上的方式来优化整个超网络参数，训练完成后选择满足加速要求和精度要求的子模型。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:37
+msgid "整体流程"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:39
+msgid "基于PaddleSlim进行模型压缩"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:41
+msgid "在本例中，也需要训练基于特定任务的BERT模型，方法同上一篇教程《由BERT到Bi-LSTM的知识蒸馏》。下面重点介绍本例模型压缩的过程。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:44
+msgid "1. 定义初始网络"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:45
+msgid ""
+"定义原始BERT-"
+"base模型并定义一个字典保存原始模型参数。普通模型转换为超网络之后，由于其组网OP的改变导致原始模型加载的参数失效，所以需要定义一个字典保存原始模型的参数并用来初始化超网络。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:56
+msgid "2. 构建超网络"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:57
+msgid "定义搜索空间，并根据搜索空间把普通网络转换为超网络。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:69
+msgid "3. 定义教师网络"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:70
+msgid "构造教师网络。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:78
+msgid "4. 配置蒸馏相关参数"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:79
+msgid ""
+"需要配置的参数包括教师模型实例；需要添加蒸馏的层，在教师网络和学生网络的 ``Embedding`` 层和每一个 ``Tranformer "
+"Block`` 层\\ 之间添加蒸馏损失，中间层的蒸馏损失使用默认的MSE损失函数；配置 ``lambda_distill`` "
+"参数表示整体蒸馏损失的缩放比例。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:97
+msgid "5. 定义Once-For-All模型"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:98
+msgid "普通模型和蒸馏相关配置传给 ``OFA`` 接口，自动添加蒸馏过程并把超网络训练方式转为 ``OFA`` 训练方式。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:106
+msgid "6. 计算神经元和head的重要性并根据其重要性重排序参数"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:120
+msgid "7. 传入当前OFA训练所处的阶段"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:129
+msgid "8. 传入网络相关配置，开始训练"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:130
+msgid "本示例使用DynaBERT的策略进行超网络训练。"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:150
+msgid "**NOTE**"
+msgstr ""
+
+#: ../advanced_guide/model_compression/ofa_bert.rst:152
+msgid ""
+"由于在计算head的重要性时会利用一个mask来收集梯度，所以需要通过monkey patch的方式重新实现一下 ``BERTModel`` 类的"
+" ``forward`` 函数。示例如下:"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/changelog.po b/docs/locale/en/LC_MESSAGES/changelog.po
new file mode 100644
index 0000000000000000000000000000000000000000..31069daadc58c0ddb51f8add953cf981acf11540
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/changelog.po
@@ -0,0 +1,105 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../changelog.md:1
+msgid "v2.0.2 (2021.06.07)"
+msgstr ""
+
+#: ../changelog.md:3
+msgid "丰富预训练模型"
+msgstr ""
+
+#: ../changelog.md:4
+msgid "新增多粒度语言知识预训练模型ERNIE-Gram，该模型在多项中文NLP任务取得SOTA成绩。"
+msgstr ""
+
+#: ../changelog.md:5
+msgid "新增NeZha中文预训练模型，感谢 @jm12138 的高质量贡献！ 🎉 🎉 🎉"
+msgstr ""
+
+#: ../changelog.md:6
+msgid "新增GPT CPM-Distill中文小型化模型，感谢 @jm12138 的高质量贡献！🎉 🎉 🎉"
+msgstr ""
+
+#: ../changelog.md:8 ../changelog.md:14
+msgid "Bug Fix"
+msgstr ""
+
+#: ../changelog.md:9
+msgid "修复了softmax_with_crossentropy API导致的deprecated warning"
+msgstr ""
+
+#: ../changelog.md:10
+msgid "更正了ChnSentiCorp, LCQMC等数据集的官方下载链接。"
+msgstr ""
+
+#: ../changelog.md:12
+msgid "v2.0.1 (2021.05.21)"
+msgstr ""
+
+#: ../changelog.md:15
+msgid "修复Windows CPU环境下的import产生的CUDA_HOME检测问题。"
+msgstr ""
+
+#: ../changelog.md:17
+msgid "v2.0.0 (2021.05.20)"
+msgstr ""
+
+#: ../changelog.md:19
+msgid ""
+"PaddleNLP "
+"2.0是飞桨生态的文本领域核心库，具备易用的文本领域API，多场景的应用示例、和高性能分布式训练三大特点，旨在提升飞桨开发者文本领域建模效率，并提供基于飞桨框架2.0的NLP领域最佳实践。"
+msgstr ""
+
+#: ../changelog.md:21
+msgid "特性"
+msgstr ""
+
+#: ../changelog.md:23
+msgid "易用的文本领域API"
+msgstr ""
+
+#: ../changelog.md:24
+msgid ""
+"提供从数据集加载、文本预处理、组网建模、评估、到推的领域API：如一键加载丰富中文数据集的Dataset API, "
+"可灵活高效的进行数据与处理的Data API，预置60+预训练词向量的Embedding API, "
+"内置50+预训练模型，提供预训练模型生态基础设施的Transformer "
+"API等，可大幅提升NLP任务建模和迭代的效率。更多API详细说明请查看PaddleNLP官方文档"
+msgstr ""
+
+#: ../changelog.md:27
+msgid "多场景的应用示例"
+msgstr ""
+
+#: ../changelog.md:28
+msgid ""
+"PaddleNLP "
+"2.0提供多粒度多场景的应用示例，涵盖从NLP基础技术、NLP核心技术、NLP系统应用以及文本相关的拓展应用等。全面基于飞桨2.0全新API体系开发，为开发提供飞桨2.0框架在文本领域的最佳实践。"
+msgstr ""
+
+#: ../changelog.md:31
+msgid "高性能分布式训练"
+msgstr ""
+
+#: ../changelog.md:32
+msgid ""
+"基于飞桨核心框架『动静统一』的特性与领先的自动混合精度优化策略，通过分布式Fleet "
+"API，支持超大规模参数的4D混合并行策略，并且可根据硬件情况灵活可配，高效地完成超大规模参数的模型训练。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po
new file mode 100644
index 0000000000000000000000000000000000000000..775c576d3241fbc9d8b1988df6e726dcbd24a71b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/how_to_write_a_DatasetBuilder.po
@@ -0,0 +1,160 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:3
+msgid "创建 :class:`DatasetBuilder`"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:5
+msgid ""
+"数据集的贡献通过定义一个 :class:`DatasetBuilder` 的子类来实现。一个合格的 :class:`DatasetBuilder`"
+" 需要遵循一些协议和规范。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:7
+msgid "下面我们以 :obj:`LCQMC` 为例了解一下 :class:`DatasetBuilder` 通常需要包含哪些方法和参数。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:10
+msgid "成员变量"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:40
+msgid ""
+"首先贡献的数据集需要继承 :class:`paddlenlp.datasets.DatasetBuilder` 类，类名格式为camel "
+"case。之后应该添加一段注释，简要说明数据集的来源等信息。之后需定义以下成员变量："
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:42
+msgid ""
+":attr:`lazy` ：数据集的默认类型。:obj:`False` 对应 :class:`MapDataset` ，:obj:`True` "
+"对应 :class:`IterDataset` 。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:43
+msgid ":attr:`URL` ：数据集压缩包下载地址，需提供有效并稳定的下载链接。如果数据集不是压缩包，可以不再这里提供。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:44
+msgid ":attr:`MD5` ：数据集压缩包的md5值，用于文件校验，如果数据集文件不是压缩包，可以不再这里提供。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:45
+msgid ":attr:`META_INFO` ：数据集split信息格式。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:46
+msgid ""
+":attr:`SPLITS` "
+"：数据集的split信息，包含数据集解压后的不同文件的具体位置，文件名，md5值等，如果数据集不是压缩包则通常在这里提供下载地址，还可以包含诸如不同文件对应的文件读取参数等信息。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:48
+msgid ""
+"除此之外，不同的数据集可能还需要诸如 :attr:`VOCAB_INFO` 等其他成员变量（参见 `iwslt15.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__"
+" ）。或者成员变量会有其他格式。贡献者可以根据实际情况自行调整。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:52
+msgid ""
+"如果贡献的数据集没有子数据集，那么 :class:`DatasetBuilder` **必须包含** :attr:`SPLITS` "
+"成员变量，且该变量必须是一个字典，字典的key是该数据集包含的splits。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:53
+msgid ""
+"如果贡献的数据集有子数据集，那么 :class:`DatasetBuilder` **必须包含** :attr:`BUILDER_CONFIGS`"
+" 成员变量，且该变量必须是一个字典，字典的key是该数据集包含的子数据集的 :attr:`name` "
+"。字典的value是包含该数据集的子数据集split信息的字典，key值必须是 `splits` 。具体格式（参见 `glue.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/glue.py>`__"
+" ）"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:56
+msgid ":func:`_get_data` 方法"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:71
+msgid ""
+":func:`_get_data` 方法根据传入的 :attr:`mode` "
+"和数据集的split信息定位到具体数据集文件。首先进行md5值校验本地文件，若校验失败则调用 "
+":func:`paddle.utils.download.get_path_from_url` "
+"方法下载并校验数据集文件，最后返回数据集文件的本地地址。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:74
+msgid ":func:`_read` 方法"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:90
+msgid ""
+":func:`_read` 方法根据传入的文件地址读取数据。该方法必须是一个生成器，以确保 :class:`DatasetBuilder` "
+"可以构造 :class:`MapDataset` 和  :class:`IterDataset` 两种数据集。 "
+"当不同split对应的数据文件读取方式不同时，该方法还需要支持 :attr:`split` 参数，并支持不同split下的读取方式。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:95
+msgid "该方法提供的每条example都应是一个 :class:`Dictionary` 对象。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:96
+msgid ""
+":class:`DatasetBuilder` 在生成Dataset时提供了将class "
+"label转换为id的功能。如果用户需要此功能，需要将example中label对应的key设置为 **\"label\"** 或 "
+"**\"labels\"** ，并在类中正确添加 :func:`get_labels` 方法。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:99
+msgid ":func:`get_labels` 方法"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:109
+msgid ""
+":func:`get_labels` 方法返回一个由该数据集中所有label组成的list。用于将数据集中的class "
+"label转换为id，并且这个list之后会作为实例变量传给生成的数据集。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:112
+msgid ":func:`get_vocab` 方法"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:114
+msgid "如果数据集提供词典文件，则需要加入 :func:`get_vocab` 方法和 :attr:`VOCAB_INFO` 变量。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:116
+msgid ""
+"该方法会根据 :attr:`VOCAB_INFO` 变量返回一个包含数据集词典信息的 :class:`Dictionary` "
+"对象并作为实例变量传给生成的数据集。用于在训练过程中初始化 :class:`paddlenlp.data.Vocab` 对象。 该方法的写法请参考"
+" `iwslt15.py  "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__"
+" 。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:121
+msgid ""
+"贡献数据集时 :func:`get_labels` 和 :func:`get_vocab` 方法是可选的，视具体数据集内容而定。 "
+":func:`_read` 和 :func:`_get_data` 方法是 **必须包含** 的。"
+msgstr ""
+
+#: ../community/contribute_datasets/how_to_write_a_DatasetBuilder.rst:122
+msgid "如果您不希望在数据获取过程中进行md5值校验，可以不用给出相关成员变量和校验代码。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..5124d3a4729456cf6af0f0d0b2c5d1f84f18eec0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/index.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_datasets/index.rst:3
+msgid "如何贡献数据集"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po
new file mode 100644
index 0000000000000000000000000000000000000000..67b63325e2d452aea6b0c70cd68fbe8aee6fd9ba
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_datasets/sharing_dataset.po
@@ -0,0 +1,169 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_datasets/sharing_dataset.rst:3
+msgid "分享你的数据集"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:5
+msgid "除了使用PaddleNLP内置的数据集以外，我们也鼓励用户向PaddleNLP贡献自己的数据集。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:7
+msgid "下面我们来介绍一下贡献数据集的详细流程："
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:10
+msgid "配置环境"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:12
+msgid "编写和测试PaddleNLP代码需要依赖python3.6以上版本以及最新版本的PaddlePaddle。请确保正确安装以上依赖。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:13
+msgid "在PaddleNLP的github页面上点击Fork按钮，在自己的github中创建一份PaddleNLP repo的副本。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:14
+msgid "将您frok的内容下载到本地，并将官方repo作为remote。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:22
+msgid "安装pre-commit钩子，它可以帮助我们格式化源代码，再提交前自动检查代码问题。不满足钩子的PR **不能** 被提交到PaddleNLP。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:30
+msgid "添加一个 :class:`DatasetBuilder`"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:32
+msgid "创建一个新的本地分支，一般从develop 分支上创建新分支。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:38
+msgid ""
+"找到您本地repo下的 `PaddleNLP/paddlenlp/datasets/` "
+"路径，PaddleNLP的所有数据集代码都储存在这个文件夹下。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:44
+msgid "为您的数据集确定一个 `name`，例如 `squad` , `chnsenticorp` 等，这个 `name` 就是您的数据集被读取时的名称。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:48
+msgid "为了方便别人使用您的数据集，确保这个 `name` **不会太长而且能够正确的表义**。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:49
+msgid "数据集的 `name` 格式应为snake case。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:51
+msgid ""
+"在该路径下创建python文件，文件名是数据集的 `name`，例如 `squad.py` 。并在这个文件中编写数据集的 "
+":class:`DatasetBuilder` 代码。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:53
+msgid ""
+":class:`DatasetBuilder` 的编写可以参考教程 :doc:`如何创建一个DatasetBuilder "
+"<./how_to_write_a_DatasetBuilder>` 。里面给出了详细的步骤和规范。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:55
+msgid ""
+"我们也推荐您参考已有数据集的 :class:`DatasetBuilder` "
+"进行创建，从已有代码copy一些共用部分可能对您编写自己的数据集代码有所帮助，下面是一些已有数据集的示例："
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:57
+msgid ""
+"`iwslt15.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/iwslt15.py>`__"
+" 翻译数据集，包含词表文件。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:58
+msgid ""
+"`glue.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/glue.py>`__"
+" glue数据集，包含多个子数据集，文件格式为tsv。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:59
+msgid ""
+"`squad.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/squad.py>`__"
+" 阅读理解数据集，文件格式为json。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:60
+msgid ""
+"`imdb.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/imdb.py>`__"
+" imdb数据集，每个split包含多个文件。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:61
+msgid ""
+"`ptb.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/ptb.py>`__"
+" 语料库数据集。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:62
+msgid ""
+"`msra_ner.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/datasets/msra_ner.py>`__"
+" 序列标注数据集。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:64
+msgid ""
+"开发完成后，可以使用 :attr:`load_dataset` 测试您创建的数据集中的split能否正确被识别。也可以使用 "
+":attr:`print` 看看数据集读入的格式是否符合您的预期："
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:74
+msgid "提交您的成果"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:76
+msgid "当您认为数据集的代码已经ready后，就可以在本地commit您的修改了："
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:83
+msgid "在提交修改之前，最好获取获取先upstream的最新代码并更新当前分支。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:90
+msgid "将本地的修改推送到GitHub上，并在GitHub上向PaddleNLP提交Pull Request。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:96
+msgid ""
+"以上就是像PaddleNLP贡献数据集的完整流程了。我们看到您的PR后会尽快review，如果有任何问题都会尽快反馈给您。如果没有问题的话我们就会合入到PaddleNLP"
+" repo，您贡献的数据集就可以供其他人使用啦。"
+msgstr ""
+
+#: ../community/contribute_datasets/sharing_dataset.rst:98
+msgid "如果您对贡献数据集还有任何疑问，欢迎加入官方QQ技术交流群: 973379845向我们提出。我们会尽快为您解答。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_docs.po b/docs/locale/en/LC_MESSAGES/community/contribute_docs.po
new file mode 100644
index 0000000000000000000000000000000000000000..1ed304c34f68876424ff35e625ac25fd4eb9cca1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_docs.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_docs.rst:3
+msgid "如何贡献问答、案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po
new file mode 100644
index 0000000000000000000000000000000000000000..1aedaf36c857e0c7df137d27b9cd9ed0956ed69f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_awesome_pretrained_models.po
@@ -0,0 +1,178 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:3
+msgid "贡献预训练模型权重"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:6
+msgid "1. 模型网络结构类型"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:7
+msgid ""
+"PaddleNLP目前已支持绝大多数主流的预训练模型网络结构，既包括百度自研的预训练模型（如ERNIE系列）， "
+"也涵盖业界主流的预训练模型（如BERT，ALBERT，GPT，RoBERTa，XLNet等）。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:10
+msgid ""
+"PaddleNLP目前支持的预训练模型结构类型汇总可见 `Transformer预训练模型汇总 "
+"<https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html>`_"
+" （持续增加中，也非常欢迎进行新模型贡献：`如何贡献新模型 "
+"<https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_new_models.html>`_"
+" ）。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:15
+msgid "2. 模型参数权重类型"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:16
+msgid "非常欢迎大家贡献优质模型参数权重。 参数权重类型包括但不限于（以BERT模型网络为例）："
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:19
+msgid ""
+"PaddleNLP还未收录的BERT预训练模型参数权重 （如 `bert-base-japanese-char "
+"<https://huggingface.co/cl-tohoku/bert-base-japanese-char>`_ ，`danish-"
+"bert-botxo <https://huggingface.co/Maltehb/danish-bert-botxo>`_ 等）；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:21
+msgid ""
+"BERT模型在其他垂类领域（如数学，金融，法律，医学等）的预训练模型参数权重 （如 `MathBERT "
+"<https://huggingface.co/tbs17/MathBERT>`_ ，`finbert "
+"<https://huggingface.co/ProsusAI/finbert>`_ 等）；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:23
+msgid ""
+"基于BERT在下游具体任务进行fine-tuning后的模型参数权重 （如 `bert-base-multilingual-uncased-"
+"sentiment <https://huggingface.co/nlptown/bert-base-multilingual-uncased-"
+"sentiment>`_ ， `bert-base-NER <https://huggingface.co/dslim/bert-base-"
+"NER>`_ 等）；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:26
+msgid "其他模型参数权重（任何你觉得有价值的模型参数权重）；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:29
+msgid "3. 参数权重格式转换"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:30
+msgid ""
+"当我们想要贡献github上开源的某模型权重时，但是发现该权重保存为其他的深度学习框架（PyTorch，TensorFlow等）的格式， "
+"这就需要我们进行不同深度学习框架间的模型格式转换，下面的链接给出了一份详细的关于Pytorch到Paddle模型格式转换的教程： "
+"`Pytorch到Paddle模型格式转换文档 <./convert_pytorch_to_paddle.rst>`_ 。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:35
+msgid "4. 进行贡献"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:37
+msgid "4.1 准备权重相关文件"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:38
+msgid ""
+"一般来说，我们需要准备 **model_state.pdparams** "
+"，**vocab.txt**，**tokenizer_config.json** 以及 **model_config.json** "
+"这四个文件进行参数权重贡献。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:41
+msgid "model_state.pdparams 文件可以通过上述的参数权重格式转换过程得到；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:42
+msgid ""
+"vocab.txt "
+"文件可以直接使用原始模型对应的vocab文件（根据模型对应tokenizer类型的不同，该文件名可能为spiece.model等）；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:43
+msgid ""
+"model_config.json 文件可以参考对应 model.save_pretrained() "
+"接口保存的model_config.json文件；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:44
+msgid ""
+"tokenizer_config.json 文件可以参考对应 tokenizer.save_pretrained() "
+"接口保存的model_config.json文件；"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:47
+msgid "4.2 创建个人目录"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:48
+msgid ""
+"如果你是首次进行权重贡献，那么你需要在 ``PaddleNLP/community/`` 下新建一个目录。 "
+"目录名称使用你的github名称，比如新建目录 ``PaddleNLP/community/yingyibiao/`` 。 "
+"如果已有个人目录，则可以跳过此步骤。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:53
+msgid "4.3 创建权重目录"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:54
+msgid ""
+"在步骤4.2的个人目录下新建一个权重目录，权重目录名为本次贡献的模型权重名称。 比如我想贡献 ``bert-base-uncased-"
+"sst-2-finetuned`` 这个模型， 则新建权重目录 ``PaddleNLP/community/yingyibiao/bert-"
+"base-uncased-sst-2-finetuned/`` 。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:59
+msgid "4.4 在权重目录下添加PR（pull request）相关文件"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:60
+msgid "在步骤4.3的目录下加入两个文件，分别为 ``README.md`` 和 ``files.json`` 。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:62
+msgid "``README.md`` 是对你贡献的权重的详细介绍，使用示例，权重来源等。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:63
+msgid ""
+"``files.json`` 为步骤4.1所得的权重相关文件以及对应地址。files.json文件内容示例如下，只需将地址中的 "
+"*yingyibiao* 和 *bert-base-uncased-sst-2-finetuned* 分别更改为你的github用户名和权重名称。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:76
+msgid "4.5 在github上提PR进行贡献"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:77
+msgid ""
+"第一次进行开源贡献的同学可以参考 `first-contributions "
+"<https://github.com/firstcontributions/first-contributions>`_ 。"
+msgstr ""
+
+#: ../community/contribute_models/contribute_awesome_pretrained_models.rst:78
+msgid "模型权重贡献PR示例请参考 `bert-base-uncased-sst-2-finetuned PR <.>`_ 。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po
new file mode 100644
index 0000000000000000000000000000000000000000..f33257f2724cc05864ebdc58693759113d514af4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/contribute_new_models.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_models/contribute_new_models.rst:3
+msgid "贡献新模型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po
new file mode 100644
index 0000000000000000000000000000000000000000..b3a66ac3773b97a737f7806dc1adf86bc634ea61
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/convert_pytorch_to_paddle.po
@@ -0,0 +1,725 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:3
+msgid "模型格式转换"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:6
+msgid "0. 前言"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:7
+msgid "本文将介绍如何进行不同框架下的模型权重转换（以模型权重从PyTorch框架到Paddle框架的格式转换为例）。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:9
+msgid "模型格式转换的过程需要用户对模型结构有一个较详细的了解，成功完成模型格式转换也会有助于加深用户对该模型结构的理解。 让我们开始这个有趣的过程吧！"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:13
+msgid "1. 模型权重文件概述"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:14
+msgid ""
+"不管在什么框架下，当我们保存训练好的模型时，我们都需要将模型的参数权重持久化保存下来； "
+"当我们加载一个保存好的模型时，我们都需要将参数权重加载并重新赋值给相应的模型。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:17
+msgid ""
+"PyTorch和Paddle都是通过序列化和反序列化模型的 ``state dict`` （状态字典）来进行参数权重的存储和加载的。 "
+"``state dict`` 从数据结构上来看就是一个字典（比如Python中的dict）， "
+"其中key是模型参数的名称（数据类型为string），而value则为key所对应的值（数据类型为Tensor）。 参数存储时，先获取目标对象的 "
+"``state dict`` ，然后将 ``state dict`` 存储至磁盘； 参数载入时，先从磁盘载入保存的 ``state dict`` "
+"，然后通过 ``set_state_dict()`` 方法配置到目标对象中。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:23
+msgid ""
+"按照约定俗成的命名规则，Paddle框架保存的模型文件名一般后缀为 `'.pdparams'` ， PyTorch框架保存的模型文件名一般后缀为 "
+"`'.pt'` 、 `'.pth'` 或者 `'.bin'` 。 虽然后缀并不影响模型的保存和加载，但我们一般都会遵循这个命名规范。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:28
+msgid "2. 模型的 ``state dict`` 概述"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:29
+msgid ""
+"刚刚我们简单介绍了一下模型文件和其中存储的 ``state dict`` ， 下面让我们来看一个具体的例子来对 ``state dict`` "
+"有更进一步的了解。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:32
+msgid ""
+"``LeNet`` 是由Yann LeCun等人在1998年提出的一个CNN网络模型，并且成功应用于手写数字识别系统。 Paddle集成了 "
+"``LeNet`` 这个简单的模型，我们可以一键进行模型加载， 下面的代码实现了该模型的加载和对应 ``state dict`` 的输出："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:58
+msgid ""
+"我们可以通过 ``model.state_dict().keys()`` 来获取模型的所有参数名称。 可以看到 ``LeNet`` "
+"一共有10组参数，分别为：*'features.0.weight'*、*'features.0.bias'*、*'features.3.weight'*"
+" "
+"、*'features.3.bias'*、*'fc.0.weight'*、*'fc.0.bias'*、*'fc.1.weight'*、*'fc.1.bias'*、*'fc.2.weight'*"
+" 和 *'fc.2.bias'*。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:62
+msgid ""
+"通过查询 ``model.state_dict()['features.0.weight']`` 可以查看 "
+"**'features.0.weight'** 这个参数的具体权重数值。 上述输出显示该权重是一个dtype=float32，shape=[6, "
+"1, 3, 3]的Tensor。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:66
+msgid "3. 利用 ``state dict`` 进行权重格式转换"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:67
+msgid ""
+"了解了模型的存储和加载以及相关的 ``state dict`` 之后，我们来看一下模型格式的转换的具体步骤。 一般来说，我们可以通过 "
+"``state dict`` 的相互转换来帮助我们进行模型格式的转换。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:70
+msgid "以从PyTorch框架到Paddle框架的模型权重转换为例，转换的具体流程为："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:72
+msgid "加载PyTorch模型得到 ``state dict``"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:73
+msgid "PyTorch下的 ``state dict`` 转换为Paddle下的 ``state dict``"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:74
+msgid "保存Paddle下的 ``state dict`` 得到Paddle模型。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:76
+msgid ""
+"下面我们来看一个具体的例子：``'bert-base-uncased'`` 是一个谷歌开源的12层的bert英文模型。 "
+"PaddleNLP（Paddle框架）和HuggingFace的transformers（PyTorch框架）里都集成了这个模型， "
+"两者参数量和具体参数数值是完全一致的。我们可以来加载对比这两个模型的 ``state dict`` 来了解转换的细节。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:81
+msgid "3.1 PyTorch框架下的 ``state dict``"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:82
+msgid "首先加载transformers下的 ``'bert-base-uncased'`` 模型，"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:123
+msgid ""
+"\\**odict_keys**（ordered_dict keys）所显示的是PyTorch模型文件所对应的 ``state dict`` "
+"的keys: 我们仔细观察一下可以发现参数可以分成几大模块：**embeddings** 模块， **encoder_layers** 模块, "
+"**pooler** 模块和 **cls** 模块。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:127
+msgid "我们可以结合bert的具体结构来解读一下各个模块："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:129
+msgid "**embeddings** 模块"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:131
+msgid ""
+"*'bert.embeddings'* 开头的各个参数是embeddings模块的参数， "
+"包括word_embeddings矩阵，position_embeddings矩阵，token_type_embeddings矩阵以及embeddings模块的LayerNorm层参数等。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:133
+msgid "**encoder_layers** 模块"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:135
+msgid ""
+"*'bert.encoder.layer'*开头的各个参数是各encoder层的参数， 可以看到 ``'bert-base-uncased'`` "
+"模型一共有12层encoder（编号0-11），每一层encoder的结构都相同。 每一层encoder主要由一个*self-"
+"attention*模块和一个*feed-forward*模块构成。 "
+"我们具体来看一下第1层encoder的参数（编号为0，'bert.encoder.layer.0'开头的参数）："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:140
+msgid "首先是*self-attention*模块："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:142
+msgid ""
+"*'attention.self.query'*，*'attention.self.key'* 和 "
+"*'attention.self.value'* 分别代表self-attention结构里面的query矩阵，key矩阵和value矩阵。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:144
+msgid "*'attention.output.dense'* 是self-attention结构的线性层。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:145
+msgid "*'attention.output.LayerNorm'* 则是self-attention结构后的LayerNorm层。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:147
+msgid ""
+"接下来是*feed-forward*模块，对应 'intermediate.dense' 和 'output.dense' 开头的参数 "
+"。*feed-forward*之后还有一个*LayerNorm*层，对应的是 'output.LayerNorm' 开头的参数。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:149
+msgid "**pooler** 模块"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:151
+msgid "pooler模块在最后一层encoder之后，是我们对最后一层encoder输出的池化操作，"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:152
+msgid "**cls** 模块"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:154
+msgid ""
+"cls模块是我们计算mlm（masked language model）和next sentence prediction（nsp）任务的结构。 "
+"'cls.predictions'开头的参数是我们做mlm任务时的参数，'cls.seq_relationship'开头的参数是我们做nsp预测任务时的参数。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:158
+msgid "3.2 Paddle框架下的 ``state dict``"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:159
+msgid "相信到现在，我们已经对bert这个模型的结构以及相应的具体参数有了更进一步的了解。 接下来我们来加载PaddleNLP下的模型："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:199
+msgid ""
+"Paddle模型的 ``state dict`` 是通过一个dict来进行存储，可以看到，两者的 ``state dict`` 是十分相似的。 "
+"我们对比一下两者："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:202
+msgid ""
+"两者的存储是相似的，PyTorch里使用的是python中的ordered_dict来存储模型的参数状态， "
+"在Paddle中则使用的是python中的dict来来进行存储。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:204
+msgid ""
+"两者的结构也是相似的，都可以分成embeddings，encoder_layer, pooler, cls等 "
+"模块（当然这也很直观，毕竟两者的模型结构和模型参数是完全一致的）。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:206
+msgid "同时两者也存在一些区别，两者的 ``state dict`` 的keys有一些细微的差异，这是由于模型代码的具体实现的参数命名差异所造成的。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:209
+msgid "3.3 PyTorch和Paddle的 ``state dict`` 对比"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:210
+msgid ""
+"我们接下来对上述两个 ``state dict`` 的参数名称以及对应权重来做一一对应。 下面的表格是整理好的 ``state_dict`` "
+"对应关系表格（同一行代表着相对应的参数）："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214
+msgid "Keys (PyTorch)"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214
+msgid "Shape (PyTorch)"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214
+msgid "Keys (Paddle)"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:214
+msgid "Shape (Paddle)"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:216
+msgid "bert.embeddings.word_embeddings.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:216
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272
+msgid "[30522, 768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:218
+msgid "bert.embeddings.position_embeddings.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:218
+msgid "[512, 768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:220
+msgid "bert.embeddings.token_type_embeddings.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:220
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274
+msgid "[2, 768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222
+msgid "bert.embeddings.LayerNorm.gamma"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:260
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270
+msgid "[768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:222
+msgid "bert.embeddings.layer_norm.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224
+msgid "bert.embeddings.LayerNorm.beta"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:224
+msgid "bert.embeddings.layer_norm.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226
+msgid "bert.encoder.layer.0.attention.self.query.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:258
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264
+msgid "[768, 768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:226
+msgid "bert.encoder.layers.0.self_attn.q_proj.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228
+msgid "bert.encoder.layer.0.attention.self.query.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:228
+msgid "bert.encoder.layers.0.self_attn.q_proj.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230
+msgid "bert.encoder.layer.0.attention.self.key.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:230
+msgid "bert.encoder.layers.0.self_attn.k_proj.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232
+msgid "bert.encoder.layer.0.attention.self.key.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:232
+msgid "bert.encoder.layers.0.self_attn.k_proj.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234
+msgid "bert.encoder.layer.0.attention.self.value.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:234
+msgid "bert.encoder.layers.0.self_attn.v_proj.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236
+msgid "bert.encoder.layer.0.attention.self.value.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:236
+msgid "bert.encoder.layers.0.self_attn.v_proj.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238
+msgid "bert.encoder.layer.0.attention.output.dense.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:238
+msgid "bert.encoder.layers.0.self_attn.out_proj.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240
+msgid "bert.encoder.layer.0.attention.output.dense.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:240
+msgid "bert.encoder.layers.0.self_attn.out_proj.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242
+msgid "bert.encoder.layer.0.attention.output.LayerNorm.gamma"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:242
+msgid "bert.encoder.layers.0.norm1.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244
+msgid "bert.encoder.layer.0.attention.output.LayerNorm.beta"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:244
+msgid "bert.encoder.layers.0.norm1.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246
+msgid "bert.encoder.layer.0.intermediate.dense.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250
+msgid "[3072, 768]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246
+msgid "bert.encoder.layers.0.linear1.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:246
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250
+msgid "[768, 3072]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248
+msgid "bert.encoder.layer.0.intermediate.dense.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248
+msgid "[3072]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:248
+msgid "bert.encoder.layers.0.linear1.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250
+msgid "bert.encoder.layer.0.output.dense.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:250
+msgid "bert.encoder.layers.0.linear2.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252
+msgid "bert.encoder.layer.0.output.dense.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:252
+msgid "bert.encoder.layers.0.linear2.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254
+msgid "bert.encoder.layer.0.output.LayerNorm.gamma"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:254
+msgid "bert.encoder.layers.0.norm2.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256
+msgid "bert.encoder.layer.0.output.LayerNorm.beta"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:256
+msgid "bert.encoder.layers.0.norm2.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:258
+msgid "bert.pooler.dense.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:260
+msgid "bert.pooler.dense.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262
+msgid "cls.predictions.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262
+msgid "[30522]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:262
+msgid "cls.predictions.decoder_bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264
+msgid "cls.predictions.transform.dense.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:264
+msgid "cls.predictions.transform.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266
+msgid "cls.predictions.transform.dense.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:266
+msgid "cls.predictions.transform.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268
+msgid "cls.predictions.transform.LayerNorm.gamma"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:268
+msgid "cls.predictions.layer_norm.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270
+msgid "cls.predictions.transform.LayerNorm.beta"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:270
+msgid "cls.predictions.layer_norm.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272
+msgid "cls.predictions.decoder.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:272
+msgid "cls.predictions.decoder_weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274
+msgid "cls.seq_relationship.weight"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:274
+msgid "[768, 2]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:276
+msgid "cls.seq_relationship.bias"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:276
+msgid "[2]"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:279
+msgid "正确地对应好 ``state dict`` 的参数以及权重有助于我们正确地进行 ``state dict`` 的转换。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:281
+msgid "我们从参数名称上能看出基本的一个对应关系，例如："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:283
+msgid "`bert.embeddings.LayerNorm.gamma` 对应 `bert.embeddings.layer_norm.weight` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:284
+msgid "`bert.embeddings.LayerNorm.beta` 对应 `bert.embeddings.layer_norm.bias` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:285
+msgid ""
+"`bert.encoder.layer.0.attention.self.query.weight` 对应 "
+"`bert.encoder.layers.0.self_attn.q_proj.weight` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:286
+msgid ""
+"`bert.encoder.layer.0.attention.self.query.bias` 对应 "
+"`bert.encoder.layers.0.self_attn.q_proj.bias`。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:288
+msgid "两者的顺序是基本一致的，但也有一些例外，例如："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:290
+msgid ""
+"`bert.encoder.layers.0.norm1.weight` 对应 "
+"`bert.encoder.layer.0.attention.output.LayerNorm.gamma` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:291
+msgid ""
+"`bert.encoder.layers.0.norm1.bias` 对应 "
+"`bert.encoder.layer.0.attention.output.LayerNorm.beta` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:292
+msgid ""
+"`bert.encoder.layer.0.intermediate.dense.weight` 对应 "
+"`bert.encoder.layers.0.linear1.weight` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:293
+msgid ""
+"`bert.encoder.layer.0.output.dense.weight` 对应 "
+"`bert.encoder.layers.0.linear2.weight` ；"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:294
+msgid ""
+"`bert.encoder.layer.0.output.LayerNorm.gamma` 对应 "
+"`bert.encoder.layers.0.norm2.weight`。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:296
+msgid ""
+"正确的参数对应关系可能需要我们阅读具体的代码进行判断。 "
+"在上面的表格中我们已经将两者的keys准确地一一对应了。建立好了keys的对应关系之后，我们可以进行values的对应。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:299
+msgid ""
+"如果你仔细观察表格，会发现有些参数对应的values形状存在差异。 比如 "
+"``bert.encoder.layer.0.intermediate.dense.weight`` 和 "
+"``bert.encoder.layers.0.linear1.weight`` "
+"这两个keys是相对应的一组参数名，但是他们的values形状却不相同；前者是 ``[3072, 768]`` ， 后者是 ``[768, "
+"3072]`` ，两者刚好是一个转置的关系。这是因为PyTorch对于 ``nn.Linear`` "
+"模块的保存是将权重的shape进行转置后保存的。 所以在我们进行 ``state dict`` "
+"转换的时候，需要注意做好shape的转换（例如将PyTorch模型里 nn.Linear层对应的参数权重转置处理后生成Paddle的参数权重）。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:306
+msgid "另外还需要注意其他一些细节，这里列出来几个可能会遇到的情景以供参考："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:308
+msgid ""
+"有些模型结构可能在实现时对参数的处理有差异导致存在参数的拆分或者合并等操作， "
+"此时我们需要进行参数多对一或者一对多的映射，同时将对应的values拆分或者合并。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:310
+msgid "还有存在batch norm层时，我们需要注意todo。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:313
+msgid "3.4 bert模型转换代码"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:314
+msgid "下一步就是进行最关键的模型转换环节。 这一步十分关键，正确地进行 ``state dict`` 的转换才能确保我们通过精度验证。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:317
+msgid "下面是进行模型转换的代码（PyTorch转换为Paddle）："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:376
+msgid ""
+"我们来看一下这份转换代码： 我们需要下载好待转换的PyTorch模型，并加载模型得到**torch_state_dict** "
+"；**paddle_state_dict** 和 **paddle_model_path** 则定义了转换后的 ``state dict`` "
+"和模型文件路径； 代码中 **keys_dict** 定义了两者keys的映射关系（可以通过上面的表格对比得到）。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:381
+msgid ""
+"下一步就是最关键的 *paddle_state_dict* 的构建，我们对 *torch_state_dict* 里的每一个key都进行映射， "
+"得到对应的 *paddle_state_dict* 的key。获取 *paddle_state_dict* 的key之后我们需要 对 "
+"*torch_state_dict* 的value进行转换，如果key对应的结构是 ``nn.Linear`` 模块的话， "
+"我们还需要进行value的transpose操作。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:386
+msgid ""
+"最后我们保存得到的 *paddle_state_dict* 就能得到对应的Paddle模型。 "
+"至此我们已经完成了模型的转换工作，得到了Paddle框架下的模型 ``\"model_state.pdparams\"`` 。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:390
+msgid "4. 模型权重验证"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:391
+msgid ""
+"得到了模型权重后，我们还需要进行精度的对齐来验证我们上述转换的正确性。 我们可以通过前向推理和下游任务fine-"
+"tuning这两个任务进行精度对齐验证。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:395
+msgid "4.1 对齐前向精度"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:396
+msgid ""
+"前向精度的对齐十分简单，我们只需要保证两者输入是一致的前提下，观察得到的输出是否一致。 "
+"这里有几个注意事项，我们运行推理时需要打开eval模式，设置dropout为0等操作去除随机性造成的影响。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:399
+msgid ""
+"除了得到的模型权重文件，我们还需要准备模型配置文件。将模型权重文件（model_state.pdparams）和模型配置文件（model_config.json）"
+" 这两个文件放在同一个路径下，我们就可以进行模型前向精度的对齐验证，下面提供了bert模型对齐前向精度的代码示例："
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:453
+msgid "代码最后会打印模型输出矩阵的每个元素最大差值，根据这个差值可以判定我们是否对齐了前向精度。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:456
+msgid "4.2 下游任务fine-tuning验证（可选）"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:457
+msgid ""
+"当我们对齐前向精度时，一般来说我们的模型转换就已经成功了。我们还可以运行下游任务fine-tuning进行double check。 "
+"同样的，我们需要设置相同的训练数据，相同的训练参数，相同的训练环境进行fine-tuning来对比两者的收敛性以及收敛指标。"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:461
+msgid "5. 写在最后"
+msgstr ""
+
+#: ../community/contribute_models/convert_pytorch_to_paddle.rst:462
+msgid "恭喜你成功完成了模型权重的格式转换工作！欢迎向PaddleNLP提PR共享你的模型， 这样每一个使用PaddleNLP的用户都能使用你共享的模型哦～"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po b/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..774255fc5db06b285bfa429f0fc3719754aa15d1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/contribute_models/index.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/contribute_models/index.rst:3
+msgid "如何贡献模型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po b/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po
new file mode 100644
index 0000000000000000000000000000000000000000..746ad42a6dcb8b40e4771b884c47c7dbfdf19534
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/community/join_in_PaddleNLP-SIG.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../community/join_in_PaddleNLP-SIG.rst:3
+msgid "如何加入PaddleNLP-SIG"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data.po b/docs/locale/en/LC_MESSAGES/data.po
new file mode 100644
index 0000000000000000000000000000000000000000..1aa73e16fba0acb2d85e25dc186a33a9fd635429
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data.po
@@ -0,0 +1,125 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data.md:1
+msgid "PaddleNLP Data API"
+msgstr ""
+
+#: ../data.md:3
+msgid "该模块提供了在NLP任务中构建有效的数据处理Pipeline的常用API。"
+msgstr ""
+
+#: ../data.md:5
+msgid "APIl列表"
+msgstr ""
+
+#: ../data.md:46
+msgid "API使用方法"
+msgstr ""
+
+#: ../data.md:48
+msgid "以上API都是用来辅助构建DataLoader，DataLoader比较重要的三个初始化参数是dataset、batch_sampler和collate_fn。"
+msgstr ""
+
+#: ../data.md:50
+msgid "paddlenlp.data.Vocab和paddlenlp.data.JiebaTokenizer用在构建dataset时处理文本token到ID的映射。"
+msgstr ""
+
+#: ../data.md:52
+msgid "paddlenlp.data.SamplerHelper用于构建可迭代的batch_sampler。"
+msgstr ""
+
+#: ../data.md:54
+msgid ""
+"paddlenlp.data.Stack、paddlenlp.data.Pad、paddlenlp.data.Tuple和paddlenlp.data"
+".Dict用于构建生成mini-batch的collate_fn函数。"
+msgstr ""
+
+#: ../data.md:56
+msgid "数据预处理"
+msgstr ""
+
+#: ../data.md:58
+msgid "paddlenlp.data.Vocab"
+msgstr ""
+
+#: ../data.md:60
+msgid "paddlenlp.data.Vocab词表类，集合了一系列文本token与ids之间映射的一系列方法，支持从文件、字典、json等一系方式构建词表。"
+msgstr ""
+
+#: ../data.md:78
+msgid "paddlenlp.data.JiebaTokenizer"
+msgstr ""
+
+#: ../data.md:80
+msgid "paddlenlp.data.JiebaTokenizer初始化需传入paddlenlp.data.Vocab类，包含cut分词方法和将句子明文转换为ids的encode方法。"
+msgstr ""
+
+#: ../data.md:97
+msgid "构建Sampler"
+msgstr ""
+
+#: ../data.md:99
+msgid "paddlenlp.data.SamplerHelper"
+msgstr ""
+
+#: ../data.md:101
+msgid "paddlenlp.data.SamplerHelper的作用是构建用于DataLoader的可迭代采样器，它包含shuffle、sort、batch、shard等一系列方法，方便用户灵活使用。"
+msgstr ""
+
+#: ../data.md:139
+msgid "构建collate_fn"
+msgstr ""
+
+#: ../data.md:141
+msgid "paddlenlp.data.Stack"
+msgstr ""
+
+#: ../data.md:143
+msgid "paddlenlp.data.Stack用来组建batch，其输入必须具有相同的shape，输出便是这些输入的堆叠组成的batch数据。"
+msgstr ""
+
+#: ../data.md:158
+msgid "paddlenlp.data.Pad"
+msgstr ""
+
+#: ../data.md:160
+msgid "paddlenlp.data.Pad用来组建batch，它的输入长度不同，它首先会将输入数据全部padding到最大长度，然后再堆叠组成batch数据输出。"
+msgstr ""
+
+#: ../data.md:175
+msgid "paddlenlp.data.Tuple"
+msgstr ""
+
+#: ../data.md:177
+msgid "paddlenlp.data.Tuple会将多个组batch的函数包装在一起，组成tuple。"
+msgstr ""
+
+#: ../data.md:197
+msgid "paddlenlp.data.Dict"
+msgstr ""
+
+#: ../data.md:199
+msgid "paddlenlp.data.Dict会将多个组batch的函数包装在一起，组成dict。"
+msgstr ""
+
+#: ../data.md:219
+msgid "综合示例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po b/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po
new file mode 100644
index 0000000000000000000000000000000000000000..58676470db2b14eac6688031b6943af082d058ee
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data_prepare/data_preprocess.po
@@ -0,0 +1,190 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data_prepare/data_preprocess.rst:3
+msgid "数据处理"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:5
+msgid ""
+"Dataset中通常为原始数据，需要经过一定的数据处理并进行采样组batch，而后通过 :class:`paddle.io.DataLoader`"
+" 为训练或预测使用，PaddleNLP中为其中各环节提供了相应的功能支持。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:8
+msgid "基于预训练模型的数据处理"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:10
+msgid ""
+"在使用预训练模型做NLP任务时，需要加载对应的Tokenizer，PaddleNLP在 :class:`PreTrainedTokenizer` "
+"中内置的 :func:`__call__` 方法可以实现基础的数据处理功能。PaddleNLP内置的所有预训练模型的Tokenizer都继承自 "
+":class:`PreTrainedTokenizer` ，下面以BertTokenizer举例说明："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:28
+msgid "关于 :func:`__call__` 方法的其他参数和功能，请查阅PreTrainedTokenizer。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:30
+msgid ""
+"paddlenlp内置的 :class:`paddlenlp.datasets.MapDataset` 的 :func:`map` "
+"方法支持传入一个函数，对数据集内的数据进行统一转换。下面我们以 :obj:`LCQMC` 的数据处理流程为例："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:42
+msgid ""
+"可以看到， :obj:`LCQMC` 是一个句对匹配任务，即判断两个句子的意思是否相似的2分类任务。我们需要处理的是key为 **query** "
+"和 **title** 的文本数据，我们编写基于 :class:`PreTrainedTokenizer` 的数据处理函数并传入数据集的 "
+":func:`map` 方法。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:69
+msgid "可以看到，数据集中的文本数据已经被处理成了模型可以接受的 *feature* 。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:71
+msgid ""
+":func:`map` 方法有一个重要的参数 :attr:`batched`，当设置为 :obj:`True` 时（默认为 "
+":obj:`False` ），数据处理函数 :func:`trans_func` 的输入不再是单条数据，而是数据集的所有数据："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:99
+msgid ""
+"可以看到，在本例中两种实现的结果是相同的。但是在诸如阅读理解，对话等任务中，一条原始数据可能会产生多个 *feature* 的情况（参见 "
+"`run_squad.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__"
+" ）通常需要将 :attr:`batched` 参数设置为 :obj:`True` 。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:101
+msgid ""
+":func:`map` 方法还有一个 :attr:`num_workers` "
+"参数，当其大于0时进行多进程数据处理，可以提高处理速度。但是需要注意如果在数据处理的函数中用到了 **数据index** "
+"的相关信息，多进程处理可能会导致错误的结果。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:103
+msgid ""
+"关于 :func:`map` 方法的其他参数和 :class:`paddlenlp.datasets.MapDataset` "
+"的其他数据处理方法，请查阅 :doc:`dataset <../source/paddlenlp.datasets.dataset>` 。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:106
+msgid "Batchify"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:108
+msgid ""
+"PaddleNLP内置了多种collate function，配合 :class:`paddle.io.BatchSampler` "
+"可以协助用户简单的完成组batch的操作。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:110
+msgid ""
+"我们继续以 :obj:`LCQMC` 的数据处理流程为例。从上一节最后可以看到，处理后的单条数据是一个 **字典** ，包含 "
+"`input_ids` ， `token_type_ids` 和 `label` 三个key。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:112
+msgid ""
+"其中 `input_ids` 和 `token_type_ids` 是需要进行 **padding** 操作后输入模型的，而 `label` "
+"是需要 **stack** 之后传入loss function的。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:114
+msgid ""
+"因此，我们使用PaddleNLP内置的 :func:`Dict` ，:func:`Stack` 和 :func:`Pad` "
+"函数整理batch中的数据。最终的 :func:`batchify_fn` 如下："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:127
+msgid ""
+"之后使用 :class:`paddle.io.BatchSampler` 和 :func:`batchify_fn` 构建 "
+":class:`paddle.io.DataLoader` ："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:137
+msgid ""
+"到此，一个完整的数据准备流程就完成了。关于更多batchify方法，请查阅 :doc:`collate "
+"<../source/paddlenlp.data.collate>`。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:141
+msgid ""
+"当需要进行 **单机多卡** 训练时，需要将 :class:`BatchSampler` 更换为 "
+":class:`DistributedBatchSampler` 。更多有关 :class:`paddle.io.BatchSampler` "
+"的信息，请查阅 `BatchSampler "
+"<https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/dataloader/batch_sampler/BatchSampler_cn.html>`_。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:143
+msgid ""
+"当需要诸如batch内排序，按token组batch等更复杂的组batch功能时。可以使用PaddleNLP内置的 "
+":class:`SamplerHelper` 。相关用例请参考 `reader.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/reader.py>`__。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:146
+msgid "基于非预训练模型的数据处理"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:148
+msgid ""
+"在使用非预训练模型做NLP任务时，我们可以借助PaddleNLP内置的 :class:`JiebaTokenizer` 和 "
+":class:`Vocab` 完成数据处理的相关功能，整体流程与使用预训练模型基本相似。我们以中文情感分析 :obj:`ChnSentiCorp`"
+" 数据集为例："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:169
+msgid ""
+":class:`Vocab` 除了可以从本地词典文件初始化之外，还提供多种初始化方法，包括从 :class:`dictionary` "
+"创建、从数据集创建等。详情请查阅Vocab。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:170
+msgid ""
+"除了使用内置的 :class:`JiebaTokenizer` 外，用户还可以使用任何自定义的方式或第三方库进行分词，之后使用 "
+":func:`Vocab.to_indices` 方法将token转为id。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:172
+msgid "之后与基于预训练模型的数据处理流程相似，编写数据处理函数并传入 :func:`map` 方法："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:195
+msgid ""
+"可以看到，原始数据已经被处理成了 *feature* 。但是这里我们发现单条数据并不是一个 **字典** ，而是 **元组** 。所以我们的 "
+":func:`batchify_fn` 也要相应的做一些调整："
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:208
+msgid ""
+"可以看到，:func:`Dict` 函数是将单条数据中的键值与 :func:`Pad` 等函数进行对应，适用于单条数据是字典的情况。而 "
+":func:`Tuple` 是通过单条数据中不同部分的index进行对应的。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:210
+msgid "所以需要 **注意** 的是 :func:`convert_example` 方法和 :func:`batchify_fn` 方法的匹配。"
+msgstr ""
+
+#: ../data_prepare/data_preprocess.rst:212
+msgid "之后的流程与基于预训练模型的数据处理相同。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po
new file mode 100644
index 0000000000000000000000000000000000000000..8018860488efb9f5ea1bfe6c1b627940781545d2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_list.po
@@ -0,0 +1,63 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data_prepare/dataset_list.md:1
+msgid "PaddleNLP Datasets API"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:3
+msgid "PaddleNLP提供了以下数据集的快速读取API，实际使用时请根据需要添加splits信息："
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:5
+msgid "阅读理解"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:55
+msgid "文本分类"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:234
+msgid "文本匹配"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:253
+msgid "序列标注"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:283
+msgid "机器翻译"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:307
+msgid "机器同传"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:326
+msgid "对话系统"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:345
+msgid "文本生成"
+msgstr ""
+
+#: ../data_prepare/dataset_list.md:389
+msgid "语料库"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po
new file mode 100644
index 0000000000000000000000000000000000000000..d4575f4162c2e388c06de27ca4fdf099950ccb26
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_load.po
@@ -0,0 +1,103 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data_prepare/dataset_load.rst:3
+msgid "加载数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:6
+msgid "快速加载内置数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:8
+msgid ""
+"目前PaddleNLP内置20余个NLP数据集，涵盖阅读理解，文本分类，序列标注，机器翻译等多项任务。目前提供的数据集可以在 "
+":doc:`数据集列表 <./dataset_list>` 中找到。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:10
+msgid "以 **msra_ner** 数据集为例:"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:17
+msgid ""
+":func:`load_dataset` 方法会从 :obj:`paddlenlp.datasets` "
+"下找到msra_ner数据集对应的数据读取脚本（默认路径：paddlenlp/datasets/msra_ner.py），并调用脚本中 "
+":class:`DatasetBuilder` 类的相关方法生成数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:19
+msgid ""
+"生成数据集可以以 :class:`MapDataset` 和 :class:`IterDataset` 两种类型返回，分别是对 "
+":class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展，只需在 "
+":func:`load_dataset` 时设置 :attr:`lazy` 参数即可获取相应类型。:obj:`Flase` 对应返回 "
+":class:`MapDataset` ，:obj:`True` 对应返回 :class:`IterDataset`，默认值为None，对应返回 "
+":class:`DatasetBuilder` 默认的数据集类型，大多数为 :class:`MapDataset` 。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:31
+msgid ""
+"关于 :class:`MapDataset` 和 :class:`IterDataset` 功能和异同可以参考API文档 "
+":doc:`datasets <../source/paddlenlp.datasets.dataset>`。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:34
+msgid "选择子数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:36
+msgid ""
+"有些数据集是很多子数据集的集合，每个子数据集都是一个独立的数据集。例如 **GLUE** 数据集就包含COLA, SST2, MRPC, "
+"QQP等10个子数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:38
+msgid ":func:`load_dataset` 方法提供了一个 :attr:`name` 参数用来指定想要获取的子数据集。使用方法如下："
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:46
+msgid "以内置数据集格式读取本地数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:48
+msgid ""
+"有的时候，我们希望使用数据格式与内置数据集相同的本地数据替换某些内置数据集的数据（例如参加SQuAD竞赛，对训练数据进行了数据增强）。 "
+":func:`load_dataset` 方法提供的 :attr:`data_files` 参数可以实现这个功能。以 **SQuAD** 为例。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:58
+msgid ""
+"对于某些数据集，不同的split的读取方式不同。对于这种情况则需要在 :attr:`splits` 参数中以传入与 "
+":attr:`data_files` **一一对应** 的split信息。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:60
+msgid "此时 :attr:`splits` 不再代表选取的内置数据集，而代表以何种格式读取本地数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:62
+msgid "下面以 **COLA** 数据集为例："
+msgstr ""
+
+#: ../data_prepare/dataset_load.rst:69
+msgid ""
+"**另外需要注意数据集的是没有默认加载选项的，**:attr:`splits` **和**:attr:`data_files` "
+"**必须至少指定一个。**"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po
new file mode 100644
index 0000000000000000000000000000000000000000..3d076811d2f3ac2880557884514818a06859b5a7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po
@@ -0,0 +1,127 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data_prepare/dataset_self_defined.rst:3
+msgid "如何自定义数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:5
+msgid ""
+"通过使用PaddleNLP提供的 :func:`load_dataset` ， :class:`MapDataset` 和 "
+":class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:8
+msgid "从本地文件创建数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:10
+msgid ""
+"从本地文件创建数据集时，我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset` "
+"中创建数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:12
+msgid ""
+"以 `waybill_ie "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie>`__"
+" 快递单信息抽取任务中的数据为例："
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:32
+msgid ""
+"我们推荐将数据读取代码写成生成器(generator)的形式，这样可以更好的构建 :class:`MapDataset` 和 "
+":class:`IterDataset` 两种数据集。同时我们也推荐将单条数据写成字典的格式，这样可以更方便的监测数据流向。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:34
+msgid ""
+"事实上，:class:`MapDataset` 在绝大多数时候都可以满足要求。一般只有在数据集过于庞大无法一次性加载进内存的时候我们才考虑使用 "
+":class:`IterDataset` 。任何人都可以方便的定义属于自己的数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:38
+msgid ""
+"需要注意的是，只有PaddleNLP内置的数据集具有将数据中的label自动转为id的功能（详细条件参见 "
+":doc:`创建DatasetBuilder "
+"<../community/contribute_datasets/how_to_write_a_DatasetBuilder>`）。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:40
+msgid "像上例中的自定义数据集需要在自定义的convert to feature方法中添加label转id的功能。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:42
+msgid ""
+"自定义数据读取function中的参数可以直接以关键字参数的方式传入 :func:`load_dataset` "
+"中。而且对于自定义数据集，:attr:`lazy` 参数是 **必须** 传入的。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:45
+msgid "从 :class:`paddle.io.Dataset/IterableDataset` 创建数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:47
+msgid ""
+"虽然PaddlePddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 "
+":class:`DataLoader` 用于模型训练的，但有时我们希望更方便的使用一些数据处理（例如convert to feature, "
+"数据清洗，数据增强等）。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` "
+"正好提供了能实现以上功能的API。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:49
+msgid ""
+"所以如果您习惯使用 :class:`paddle.io.Dataset/IterableDataset` "
+"创建数据集的话。只需要在原来的数据集上套上一层 :class:`MapDataset` 或 :class:`IterDataset` "
+"就可以把原来的数据集对象转换成PaddleNLP的数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:51
+msgid "下面举一个简单的小例子。:class:`IterDataset` 的用法基本相同。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:78
+msgid "从其他python对象创建数据集"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:80
+msgid ""
+"理论上，我们可以使用任何包含 :func:`__getitem__` 方法和 :func:`__len__` 方法的python对象创建 "
+":class:`MapDataset`。包括 :class:`List` ，:class:`Tuple` ，:class:`DataFrame` "
+"等。只要将符合条件的python对象作为初始化参数传入 :class:`MapDataset` 即可完成创建。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:95
+msgid ""
+"同样的，我们也可以使用包含 :func:`__iter__` 方法的python对象创建 :class:`IterDataset` 。例如 "
+":class:`List`， :class:`Generator` 等。创建方法与 :class:`MapDataset` 相同。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:112
+msgid "需要注意，像上例中直接将 **生成器** 对象传入 :class:`IterDataset` 所生成的数据集。其数据只能迭代 **一次** 。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:114
+msgid "与常规的python对象一样，只要满足以上的条件，我们也可以使用同样的方法从第三方数据集创建PaddleNLP数据集。"
+msgstr ""
+
+#: ../data_prepare/dataset_self_defined.rst:116
+msgid "例如HuggingFace Dataset："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/overview.po b/docs/locale/en/LC_MESSAGES/data_prepare/overview.po
new file mode 100644
index 0000000000000000000000000000000000000000..9da1d4654b3c84a5113c10da24e4879734d2a6dd
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/data_prepare/overview.po
@@ -0,0 +1,101 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../data_prepare/overview.rst:3
+msgid "整体介绍"
+msgstr ""
+
+#: ../data_prepare/overview.rst:5
+msgid "数据集和数据处理部分一直是NLP任务中最重要的环节之一。为了方便用户以更低的学习成本完成这一环节，PaddleNLP提供了以下特性："
+msgstr ""
+
+#: ../data_prepare/overview.rst:7
+msgid "功能强大的API。可以帮助用户完成大部分常见NLP任务的数据处理流程。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:8
+msgid "更灵活的封装。各个模块保持低耦合，高内聚，保证用户可以通过继承和改写满足特定的数据处理需求。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:9
+msgid "内置数据集涵盖大部分NLP任务，搭配简洁易用的数据集加载协议和贡献协议。对新手和社区贡献者更加友好。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:12
+msgid "核心API"
+msgstr ""
+
+#: ../data_prepare/overview.rst:14
+msgid ""
+":func:`load_dataset` ：数据集快速加载接口，通过传入数据集读取脚本的名称和其他参数调用 "
+":class:`DatasetBuilder` 子类的相关方法生成数据集。关于加载数据集的详细方法，请查阅 :doc:`加载数据集 "
+"<./dataset_load>` 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:15
+msgid ""
+":class:`DatasetBuilder` ： :class:`DatasetBuilder` "
+"是一个基类，所有的内置数据集都继承自该类，该类的主要功能是下载和读取数据集文件并生成Dataset。其中大部分方法已经封装，不对贡献者暴露。贡献者通过重写"
+" :func:`_get_data` 和 :func:`_read` 等方法像社区贡献数据集。详细信息请查阅 :doc:`如何贡献数据集 "
+"</community/contribute_dataset>` 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:16
+msgid ""
+":class:`MapDataset/IterDataset` ：PaddleNLP内置数据集类型，分别是对 "
+":class:`paddle.io.Dataset` 和 :class:`paddle.io.IterableDataset` 的扩展。内置诸如 "
+":func:`map` , :func:`filter` "
+"等适用于NLP任务的数据处理功能。同时还能帮助用户简单创建自定义数据集。详细信息请查阅***和 :doc:`如何自定义数据集 "
+"<./dataset_self_defined>` 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:19
+msgid "数据处理流程设计"
+msgstr ""
+
+#: ../data_prepare/overview.rst:21
+msgid "目前PaddleNLP的通用数据处理流程如下："
+msgstr ""
+
+#: ../data_prepare/overview.rst:23
+msgid "加载数据集（内置数据集或者自定义数据集，数据集返回 **原始数据**）。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:24
+msgid ""
+"定义 :func:`trans_func` ，包括tokenize，token to id等操作，并传入数据集的 :func:`map` "
+"方法，将原始数据转为 *feature* 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:25
+msgid "根据上一步数据处理的结果定义 **batchify** 方法和 :class:`BatchSampler` 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:26
+msgid "定义 :class:`DataLoader` ， 传入 :class:`BatchSampler` 和 :func:`batchify_fn` 。"
+msgstr ""
+
+#: ../data_prepare/overview.rst:28
+msgid "下面是基于Bert的文本分类任务的数据处理流程图："
+msgstr ""
+
+#: ../data_prepare/overview.rst:32
+msgid "关于数据处理的详细信息，请查阅 :doc:`./data_preprocess` 。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/datasets.po b/docs/locale/en/LC_MESSAGES/datasets.po
new file mode 100644
index 0000000000000000000000000000000000000000..346f663b86c1beb880fe0f5534b1f92c0458fb2f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/datasets.po
@@ -0,0 +1,55 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../datasets.md:1
+msgid "PaddleNLP Datasets API"
+msgstr ""
+
+#: ../datasets.md:3
+msgid "PaddleNLP提供了以下数据集的快速读取API，实际使用时请根据需要添加splits信息："
+msgstr ""
+
+#: ../datasets.md:5
+msgid "阅读理解"
+msgstr ""
+
+#: ../datasets.md:44
+msgid "文本分类"
+msgstr ""
+
+#: ../datasets.md:114
+msgid "序列标注"
+msgstr ""
+
+#: ../datasets.md:139
+msgid "机器翻译"
+msgstr ""
+
+#: ../datasets.md:164
+msgid "机器同传"
+msgstr ""
+
+#: ../datasets.md:184
+msgid "文本生成"
+msgstr ""
+
+#: ../datasets.md:208
+msgid "语料库"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/get_started/installation.po b/docs/locale/en/LC_MESSAGES/get_started/installation.po
new file mode 100644
index 0000000000000000000000000000000000000000..4a74fb50ddb0a501f038a1f7f7518736279dd9ea
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/get_started/installation.po
@@ -0,0 +1,108 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../get_started/installation.rst:2
+msgid "安装PaddleNLP"
+msgstr ""
+
+#: ../get_started/installation.rst:3
+msgid ""
+"以下安装过程默认用户已安装好paddlepaddle-"
+"gpu或paddlepaddle(版本大于或等于2.0)，paddlepaddle安装方式参照 飞桨官网_。"
+msgstr ""
+
+#: ../get_started/installation.rst:8
+msgid "pip安装"
+msgstr ""
+
+#: ../get_started/installation.rst:14
+msgid "Anaconda安装"
+msgstr ""
+
+#: ../get_started/installation.rst:15
+msgid "Anaconda是一个开源的Python发行版本，其包含了conda、Python等180多个科学包及其依赖项。使用Anaconda可以通过创建多个独立的Python环境，避免用户的Python环境安装太多不同版本依赖导致冲突。"
+msgstr ""
+
+#: ../get_started/installation.rst:18
+msgid "1、windows安装Anaconda"
+msgstr ""
+
+#: ../get_started/installation.rst:21 ../get_started/installation.rst:56
+msgid "第一步 下载"
+msgstr ""
+
+#: ../get_started/installation.rst:22
+msgid "在 Anaconda官网_ 选择下载Windows Python3.7 64-Bit版本。"
+msgstr ""
+
+#: ../get_started/installation.rst:26
+msgid "确保已经安装Visual C++ Build Tools(可以在开始菜单中找到)，如未安装，请点击 下载安装_。"
+msgstr ""
+
+#: ../get_started/installation.rst:31 ../get_started/installation.rst:62
+msgid "第二步 安装"
+msgstr ""
+
+#: ../get_started/installation.rst:32
+msgid "运行下载的安装包(以.exe为后辍)，根据引导完成安装, 用户可自行修改安装目录（如下图）。"
+msgstr ""
+
+#: ../get_started/installation.rst:37 ../get_started/installation.rst:73
+msgid "第三步 使用"
+msgstr ""
+
+#: ../get_started/installation.rst:38
+msgid "点击Windows系统左下角的Windows图标，打开：所有程序->Anaconda3/2（64-bit）->Anaconda Prompt"
+msgstr ""
+
+#: ../get_started/installation.rst:39
+msgid "在命令行中执行下述命令"
+msgstr ""
+
+#: ../get_started/installation.rst:50 ../get_started/installation.rst:84
+msgid ""
+"按如上方式配置后，即可在环境中使用PaddleNLP了，命令行输入python回车后，import "
+"paddlenlp试试吧，之后再次使用都可以通过打开'所有程序->Anaconda3/2（64-bit）->Anaconda "
+"Prompt'，再执行conda activate my_paddlenlp进入环境后，即可再次使用PaddleNLP。"
+msgstr ""
+
+#: ../get_started/installation.rst:53
+msgid "2、Linux/Mac安装Anaconda"
+msgstr ""
+
+#: ../get_started/installation.rst:57
+msgid "在 Anaconda官网_ 选择下载对应系统 Python3.7版本下载（Mac下载Command Line Installer版本即可)。"
+msgstr ""
+
+#: ../get_started/installation.rst:63
+msgid "打开终端，在终端安装Anaconda"
+msgstr ""
+
+#: ../get_started/installation.rst:70
+msgid "安装过程中一直回车即可，如提示设置安装路径，可根据需求修改，一般默认即可。"
+msgstr ""
+
+#: ../get_started/installation.rst:87
+msgid "代码安装"
+msgstr ""
+
+#: ../get_started/installation.rst:88
+msgid "github代码会跟随开发进度不断更新"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/get_started/quick_start.po b/docs/locale/en/LC_MESSAGES/get_started/quick_start.po
new file mode 100644
index 0000000000000000000000000000000000000000..0b9dc90f343e3b652ca3534a86a82620b2209aa6
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/get_started/quick_start.po
@@ -0,0 +1,178 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../get_started/quick_start.rst:3
+msgid "10分钟完成高精度中文情感分析"
+msgstr ""
+
+#: ../get_started/quick_start.rst:6
+msgid "1. 安装PaddleNLP"
+msgstr ""
+
+#: ../get_started/quick_start.rst:8
+msgid "安装相关过程和问题可以参考PaddleNLP的 安装文档_。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:18
+msgid "2. 一键加载预训练模型"
+msgstr ""
+
+#: ../get_started/quick_start.rst:20
+msgid ""
+"情感分析本质是一个文本分类任务。PaddleNLP内置了ERNIE、BERT、RoBERTa、Electra等丰富的预训练模型"
+"，并且内置了各种预训练模型对于不同下游任务的Fine-"
+"tune网络。用户可以使用PaddleNLP提供的模型，完成问答、序列分类、token分类等任务。查阅 预训练模型_ "
+"了解更多。这里以ERNIE模型为例，介绍如何将预训练模型Fine-tune完成文本分类任务。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:24
+msgid "加载预训练模型ERNIE"
+msgstr ""
+
+#: ../get_started/quick_start.rst:31
+msgid "加载预训练模型ERNIE用于文本分类任务的Fine-tune网络，只需指定想要使用的模型名称和文本分类的类别数即可完成网络定义。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:39
+msgid "3. 调用Tokenizer进行数据处理"
+msgstr ""
+
+#: ../get_started/quick_start.rst:41
+msgid "Tokenizer用于将原始输入文本转化成模型可以接受的输入数据形式。PaddleNLP对于各种预训练模型已经内置了相应的Tokenizer，指定想要使用的模型名字即可加载。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:47
+msgid ""
+"Transformer类预训练模型所需的数据处理步骤通常包括将原始输入文本切分token；将token映射为对应的token "
+"id；拼接上预训练模型对应的特殊token "
+"，如[CLS]、[SEP]；最后转化为框架所需的数据格式。为了方便使用，PaddleNLP提供了高阶API，一键即可返回模型所需数据格式。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:49
+msgid "一行代码完成切分token，映射token ID以及拼接特殊token:"
+msgstr ""
+
+#: ../get_started/quick_start.rst:55
+msgid "转化成paddle框架数据格式:"
+msgstr ""
+
+#: ../get_started/quick_start.rst:68
+msgid "input_ids: 表示输入文本的token ID。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:70
+msgid "token_type_ids: 表示对应的token属于输入的第一个句子还是第二个句子。（Transformer类预训练模型支持单句以及句对输入。）"
+msgstr ""
+
+#: ../get_started/quick_start.rst:72
+msgid "此时即可输入ERNIE模型中得到相应输出。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:81
+msgid "可以看出，ERNIE模型输出有2个tensor。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:83
+msgid ""
+"sequence_output是对应每个输入token的语义特征表示，shape为(1, num_tokens, "
+"hidden_size)。其一般用于序列标注、问答等任务。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:85
+msgid "pooled_output是对应整个句子的语义特征表示，shape为(1, hidden_size)。其一般用于文本分类、信息检索等任务。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:88
+msgid "4. 加载数据集"
+msgstr ""
+
+#: ../get_started/quick_start.rst:89
+msgid "PaddleNLP内置了适用于阅读理解、文本分类、序列标注、机器翻译等下游任务的多个数据集，这里我们使用公开中文情感分析数据集ChnSenticorp，包含7000多条正负向酒店评论数据。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:91
+msgid "一键加载PaddleNLP内置数据集："
+msgstr ""
+
+#: ../get_started/quick_start.rst:98
+msgid "获取分类数据标签："
+msgstr ""
+
+#: ../get_started/quick_start.rst:106
+msgid "展示一些数据："
+msgstr ""
+
+#: ../get_started/quick_start.rst:122
+msgid "5. 模型训练与评估"
+msgstr ""
+
+#: ../get_started/quick_start.rst:123
+msgid ""
+"数据读入时使用 :func:`paddle.io.DataLoader` "
+"接口多线程异步加载数据，然后设置适用于ERNIE这类Transformer模型的动态学习率和损失函数、优化算法、评价指标等。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:125
+msgid "模型训练的过程通常按照以下步骤："
+msgstr ""
+
+#: ../get_started/quick_start.rst:127
+msgid "从dataloader中取出一个batch data。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:128
+msgid "将batch data喂给model，做前向计算。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:129
+msgid "将前向计算结果传给损失函数，计算loss。将前向计算结果传给评价方法，计算评价指标。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:130
+msgid "loss反向回传，更新梯度。重复以上步骤。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:131
+msgid "每训练一个epoch时，程序将会评估一次，评估当前模型训练的效果。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:133
+msgid "本示例同步在AIStudio上，可直接 在线体验模型训练_。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:137
+msgid "最后，保存训练好的模型用于预测。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:140
+msgid "6. 模型预测"
+msgstr ""
+
+#: ../get_started/quick_start.rst:141
+msgid "保存训练模型，定义预测函数 :func:`predict` ，即可开始预测文本情感倾向。"
+msgstr ""
+
+#: ../get_started/quick_start.rst:143
+msgid "以自定义预测数据和数据标签为示例："
+msgstr ""
+
+#: ../get_started/quick_start.rst:154
+msgid "得到预测结果："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/index.po b/docs/locale/en/LC_MESSAGES/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..1cf3bcec97d47c85e6af8b37d096c05e72a34580
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/index.po
@@ -0,0 +1,252 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../index.rst:26
+msgid "安装"
+msgstr ""
+
+#: ../index.rst:26
+msgid "10分钟完成高精度中文情感分析"
+msgstr ""
+
+#: ../index.rst:26
+msgid "快速开始"
+msgstr ""
+
+#: ../index.rst:33
+msgid "整体介绍"
+msgstr ""
+
+#: ../index.rst:33
+msgid "数据集列表"
+msgstr ""
+
+#: ../index.rst:33
+msgid "加载数据集"
+msgstr ""
+
+#: ../index.rst:33
+msgid "自定义数据集"
+msgstr ""
+
+#: ../index.rst:33
+msgid "数据处理"
+msgstr ""
+
+#: ../index.rst:33
+msgid "数据准备"
+msgstr ""
+
+#: ../index.rst:43
+msgid "Transformer预训练模型"
+msgstr ""
+
+#: ../index.rst:43
+msgid "使用Trainer API训练"
+msgstr ""
+
+#: ../index.rst:43
+msgid "一键预测功能"
+msgstr ""
+
+#: ../index.rst:43
+msgid "预训练词向量"
+msgstr ""
+
+#: ../index.rst:43
+msgid "模型库"
+msgstr ""
+
+#: ../index.rst:53
+msgid "评价指标"
+msgstr ""
+
+#: ../index.rst:59
+msgid "AI Studio Notebook"
+msgstr ""
+
+#: ../index.rst:59
+msgid "实践教程"
+msgstr ""
+
+#: ../index.rst:65
+msgid "模型压缩"
+msgstr ""
+
+#: ../index.rst:65
+msgid "文本生成高性能加速"
+msgstr ""
+
+#: ../index.rst:65
+msgid "大规模分布式训练"
+msgstr ""
+
+#: ../index.rst:65
+msgid "进阶指南"
+msgstr ""
+
+#: ../index.rst:73
+msgid "如何贡献模型"
+msgstr ""
+
+#: ../index.rst:73
+msgid "如何贡献数据集"
+msgstr ""
+
+#: ../index.rst:73
+msgid "如何贡献文档案例"
+msgstr ""
+
+#: ../index.rst:73
+msgid "如何加入兴趣小组"
+msgstr ""
+
+#: ../index.rst:73
+msgid "社区交流共建"
+msgstr ""
+
+#: ../index.rst:82
+msgid "FAQ"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.data"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.datasets"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.embeddings"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.layers"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.losses"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.metrics"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.ops"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.seq2vec"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.taskflow"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.trainer"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.transformers"
+msgstr ""
+
+#: ../index.rst:88
+msgid "paddlenlp.utils"
+msgstr ""
+
+#: ../index.rst:88
+msgid "API Reference"
+msgstr ""
+
+#: ../index.rst:2
+msgid "欢迎使用PaddleNLP"
+msgstr ""
+
+#: ../index.rst:4
+msgid ""
+"`PaddleNLP <https://github.com/PaddlePaddle/PaddleNLP>`_ 是飞桨自然语言处理开发库，具备 "
+"**易用的文本领域API**，**多场景的应用示例**、和 **高性能分布式训练** "
+"三大特点，旨在提升飞桨开发者文本领域建模效率，旨在提升开发者在文本领域的开发效率，并提供丰富的NLP应用示例。"
+msgstr ""
+
+#: ../index.rst:7
+msgid "**易用的文本领域API**"
+msgstr ""
+
+#: ../index.rst:9
+msgid ""
+"提供丰富的产业级预置任务能力 **Taskflow** 和全流程的文本领域API：支持丰富中文数据集加载的 **Dataset "
+"API**，可灵活高效地完成数据预处理的 **Data API** ，预置60+预训练词向量的 **Embedding API** "
+"，提供100+预训练模型的 **Transformer API** 等，可大幅提升NLP任务建模的效率。"
+msgstr ""
+
+#: ../index.rst:11
+msgid "**多场景的应用示例**"
+msgstr ""
+
+#: ../index.rst:13
+msgid "覆盖从学术到产业级的NLP应用示例，涵盖NLP基础技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发，为开发者提供飞桨文本领域的最佳实践。"
+msgstr ""
+
+#: ../index.rst:15
+msgid "**高性能分布式训练**"
+msgstr ""
+
+#: ../index.rst:17
+msgid "基于飞桨核心框架领先的自动混合精度优化策略，结合分布式Fleet API，支持4D混合并行策略，可高效地完成大规模预训练模型训练。"
+msgstr ""
+
+#: ../index.rst:20
+msgid "项目GitHub: https://github.com/PaddlePaddle/PaddleNLP"
+msgstr ""
+
+#: ../index.rst:21
+msgid "项目Gitee: https://gitee.com/paddlepaddle/PaddleNLP"
+msgstr ""
+
+#: ../index.rst:22
+msgid "GitHub Issue反馈: https://github.com/PaddlePaddle/PaddleNLP/issues"
+msgstr ""
+
+#: ../index.rst:23
+msgid "官方QQ技术交流群: 973379845"
+msgstr ""
+
+#: ../index.rst:106
+msgid "Indices and tables"
+msgstr ""
+
+#: ../index.rst:107
+msgid ":ref:`genindex`"
+msgstr ""
+
+#: ../index.rst:108
+msgid ":ref:`modindex`"
+msgstr ""
+
+#: ../index.rst:109
+msgid ":ref:`search`"
+msgstr ""
+
+#~ msgid "TaskFlow"
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/metrics.po b/docs/locale/en/LC_MESSAGES/metrics.po
new file mode 100644
index 0000000000000000000000000000000000000000..18808669e1d277fff110a19a3198928a5620f3b3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/metrics.po
@@ -0,0 +1,27 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../metrics.md:1
+msgid "PaddleNLP Metrics API"
+msgstr ""
+
+#: ../metrics.md:3
+msgid "目前PaddleNLP提供以下模型评价指标："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/metrics/metrics.po b/docs/locale/en/LC_MESSAGES/metrics/metrics.po
new file mode 100644
index 0000000000000000000000000000000000000000..e2fb3304a303b6873b906a743966f9f27339ef68
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/metrics/metrics.po
@@ -0,0 +1,27 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../metrics/metrics.md:1
+msgid "PaddleNLP Metrics API"
+msgstr ""
+
+#: ../metrics/metrics.md:3
+msgid "目前PaddleNLP提供以下模型评价指标："
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po b/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po
new file mode 100644
index 0000000000000000000000000000000000000000..ad259196948d182dfda771a9a017e1a3b00e86d6
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/embeddings.po
@@ -0,0 +1,219 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../model_zoo/embeddings.md:1
+msgid "PaddleNLP Embedding API"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:3 ../model_zoo/embeddings.md:24
+msgid "介绍"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:4 ../model_zoo/embeddings.md:28
+msgid "用法"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:5 ../model_zoo/embeddings.md:30
+msgid "TokenEmbedding参数"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:6 ../model_zoo/embeddings.md:69
+msgid "初始化"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:7 ../model_zoo/embeddings.md:101
+msgid "查询embedding结果"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:8 ../model_zoo/embeddings.md:115
+msgid "可视化embedding结果"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:9 ../model_zoo/embeddings.md:140
+msgid "计算词向量cosine相似度"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:10 ../model_zoo/embeddings.md:147
+msgid "计算词向量内积"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:11 ../model_zoo/embeddings.md:155
+msgid "训练"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:12 ../model_zoo/embeddings.md:171
+msgid "切词"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:13 ../model_zoo/embeddings.md:183
+msgid "预训练模型"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:14 ../model_zoo/embeddings.md:189
+msgid "中文词向量"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:15 ../model_zoo/embeddings.md:343
+msgid "英文词向量"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:16 ../model_zoo/embeddings.md:345
+msgid "Word2Vec"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:17 ../model_zoo/embeddings.md:362
+msgid "GloVe"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:18 ../model_zoo/embeddings.md:395
+msgid "FastText"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:19 ../model_zoo/embeddings.md:416
+msgid "使用方式"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:20 ../model_zoo/embeddings.md:427
+msgid "模型信息"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:21 ../model_zoo/embeddings.md:751
+msgid "致谢"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:22 ../model_zoo/embeddings.md:756
+msgid "参考论文"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:26
+msgid "PaddleNLP提供多个开源的预训练词向量模型，用户仅需在使用paddlenlp.embeddings.TokenEmbedding时，指定预训练模型的名称，即可加载相对应的预训练模型。以下将介绍TokenEmbeddign详细用法，并列出PaddleNLP所支持的预训练Embedding模型。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:116
+msgid "使用深度学习可视化工具VisualDL的High Dimensional组件可以对embedding结果进行可视化展示，便于对其直观分析，步骤如下："
+msgstr ""
+
+#: ../model_zoo/embeddings.md:128
+msgid "执行完毕后会在当前路径下生成一个visualize目录，并将日志存放在其中，我们在命令行启动VisualDL即可进行查看，启动命令为："
+msgstr ""
+
+#: ../model_zoo/embeddings.md:132
+msgid "启动后打开浏览器即可看到可视化结果"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:138
+msgid "使用VisualDL除可视化embedding结果外，还可以对标量、图片、音频等进行可视化，有效提升训练调参效率。关于VisualDL更多功能和详细介绍，可参考VisualDL使用文档。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:157
+msgid ""
+"以下为TokenEmbedding简单的组网使用方法。有关更多TokenEmbedding训练流程相关的使用方法，请参考Word "
+"Embedding with PaddleNLP。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:185
+msgid "以下将列举PaddleNLP支持的Embedding预训练模型。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:186
+msgid "模型命名方式为：${训练模型}.${语料}.${词向量类型}.${co-occurrence type}.dim${维度}。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:187
+msgid "模型有三种，分别是Word2Vec(w2v, skip-gram), GloVe(glove)和FastText(fasttext)。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:191
+msgid "以下预训练词向量由Chinese-Word-Vectors提供。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:193
+msgid "根据不同类型的上下文为每个语料训练多个目标词向量，第二列开始表示不同类型的上下文。以下为上下文类别："
+msgstr ""
+
+#: ../model_zoo/embeddings.md:195
+msgid "Word表示训练时目标词预测的上下文是一个Word。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:196
+msgid ""
+"Word + "
+"N-gram表示训练时目标词预测的上下文是一个Word或者Ngram，其中bigram表示2-grams，ngram.1-2表示1-gram或者2-grams。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:197
+msgid ""
+"Word + Character表示训练时目标词预测的上下文是一个Word或者Character，其中word-"
+"character.char1-2表示上下文是1个或2个Character。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:198
+msgid ""
+"Word + Character + Ngram表示训练时目标词预测的上下文是一个Word、Character或者Ngram。bigram-"
+"char表示上下文是2-grams或者1个Character。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:284
+msgid "特别地，对于百度百科语料，在不同的 Co-occurrence类型下分别提供了目标词与上下文向量："
+msgstr ""
+
+#: ../model_zoo/embeddings.md:418
+msgid ""
+"以上所述的模型名称可直接以参数形式传入padddlenlp.embeddings.TokenEmbedding，加载相对应的模型。比如要加载语料为Wiki2017，通过FastText训练的预训练模型（fasttext"
+".wiki-news.target.word-word.dim300.en），只需执行以下代码："
+msgstr ""
+
+#: ../model_zoo/embeddings.md:752
+msgid "感谢 Chinese-Word-Vectors提供Word2Vec中文预训练词向量。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:753
+msgid "感谢 GloVe Project提供的GloVe英文预训练词向量。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:754
+msgid "感谢 FastText Project提供的英文预训练词向量。"
+msgstr ""
+
+#: ../model_zoo/embeddings.md:757
+msgid ""
+"Li, Shen, et al. \"Analogical reasoning on chinese morphological and "
+"semantic relations.\" arXiv preprint arXiv:1805.06504 (2018)."
+msgstr ""
+
+#: ../model_zoo/embeddings.md:758
+msgid ""
+"Qiu, Yuanyuan, et al. \"Revisiting correlations between intrinsic and "
+"extrinsic evaluations of word embeddings.\" Chinese Computational "
+"Linguistics and Natural Language Processing Based on Naturally Annotated "
+"Big Data. Springer, Cham, 2018. 209-221."
+msgstr ""
+
+#: ../model_zoo/embeddings.md:759
+msgid ""
+"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. "
+"GloVe: Global Vectors for Word Representation."
+msgstr ""
+
+#: ../model_zoo/embeddings.md:760
+msgid ""
+"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in "
+"Pre-Training Distributed Word Representations."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/index.po b/docs/locale/en/LC_MESSAGES/model_zoo/index.po
new file mode 100644
index 0000000000000000000000000000000000000000..f3ddee8affb5262babcaa5b8366c11ffb18f9550
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/index.po
@@ -0,0 +1,782 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/index.rst:75
+msgid "ALBERT"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "BART"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "BERT"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "BigBird"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "Blenderbot"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "Blenderbot-Small"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ChineseBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ConvBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "CTRL"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "DistilBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ELECTRA"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE-CTM"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE-DOC"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE-GEN"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE-GRAM"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ERNIE-M"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "FNet"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "Funnel"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "GPT"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "LayoutLM"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "LayoutLMV2"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "LayoutXLM"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "Luke"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "MBart"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "MegatronBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "MobileBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "MPNet"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "NeZha"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "PPMiniLM"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "ProphetNet"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "Reformer"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "RemBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "RoBERTa"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "RoFormer"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "SKEP"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "SqueezeBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "T5"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "TinyBert"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "UnifiedTransformer"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "UNIMO"
+msgstr ""
+
+#: ../model_zoo/index.rst:75
+msgid "XLNet"
+msgstr ""
+
+#: ../model_zoo/index.rst:4
+msgid "PaddleNLP Transformer预训练模型"
+msgstr ""
+
+#: ../model_zoo/index.rst:6
+msgid ""
+"随着深度学习的发展，NLP领域涌现了一大批高质量的Transformer类预训练模型，多次刷新了不同NLP任务的SOTA（State of the"
+" Art），极大地推动了自然语言处理的进展。 PaddleNLP为用户提供了常用的预训练模型及其相应权重，如 "
+"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等，采用统一的API进行加载、训练和调用，"
+" 让开发者能够方便快捷地应用各种Transformer类预训练模型及其下游任务，且相应预训练模型权重下载速度快、稳定。"
+msgstr ""
+
+#: ../model_zoo/index.rst:12
+msgid "预训练模型使用方法"
+msgstr ""
+
+#: ../model_zoo/index.rst:14
+msgid ""
+"PaddleNLP Transformer API在提供丰富预训练模型的同时，也降低了用户的使用门槛。 "
+"使用Auto模块，可以加载不同网络结构的预训练模型，无需查找模型对应的类别。只需十几行代码，用户即可完成模型加载和下游任务Fine-tuning。"
+msgstr ""
+
+#: ../model_zoo/index.rst:52
+msgid ""
+"上面的代码给出使用预训练模型的简要示例，更完整详细的示例代码， 可以参考：`使用预训练模型Fine-tune完成中文文本分类任务 "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_classification/pretrained_models/>`_"
+msgstr ""
+
+#: ../model_zoo/index.rst:55
+msgid "加载数据集：PaddleNLP内置了多种数据集，用户可以一键导入所需的数据集。"
+msgstr ""
+
+#: ../model_zoo/index.rst:56
+msgid ""
+"加载预训练模型：PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 "
+"Auto模块（包括AutoModel, AutoTokenizer, 及各种下游任务类）提供了方便易用的接口， "
+"无需指定类别，即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``，可加载对应的预训练权重。"
+" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数，如 "
+"``num_classes`` 等， 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 "
+"``from_pretrained`` 方法加载。"
+msgstr ""
+
+#: ../model_zoo/index.rst:62
+msgid "通过 ``Dataset`` 的 ``map`` 函数，使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。"
+msgstr ""
+
+#: ../model_zoo/index.rst:63
+msgid "定义 ``BatchSampler`` 和 ``DataLoader``，shuffle数据、组合Batch。"
+msgstr ""
+
+#: ../model_zoo/index.rst:64
+msgid "定义训练所需的优化器，loss函数等，就可以开始进行模型fine-tune任务。"
+msgstr ""
+
+#: ../model_zoo/index.rst:68
+msgid "Transformer预训练模型汇总"
+msgstr ""
+
+#: ../model_zoo/index.rst:70
+msgid ""
+"PaddleNLP的Transformer预训练模型包含从 `huggingface.co`_ "
+"直接转换的模型权重和百度自研模型权重，方便社区用户直接迁移使用。 目前共包含了40多个主流预训练模型，500多个模型权重。"
+msgstr ""
+
+#: ../model_zoo/index.rst:125
+msgid "Transformer预训练模型适用任务汇总"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Model"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Sequence Classification"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Token Classification"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Question Answering"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Text Generation"
+msgstr ""
+
+#: ../model_zoo/index.rst:128
+msgid "Multiple Choice"
+msgstr ""
+
+#: ../model_zoo/index.rst:130
+msgid "ALBERT_"
+msgstr ""
+
+#: ../model_zoo/index.rst:130 ../model_zoo/index.rst:132
+#: ../model_zoo/index.rst:134 ../model_zoo/index.rst:136
+#: ../model_zoo/index.rst:138 ../model_zoo/index.rst:140
+#: ../model_zoo/index.rst:142 ../model_zoo/index.rst:144
+#: ../model_zoo/index.rst:146 ../model_zoo/index.rst:148
+#: ../model_zoo/index.rst:150 ../model_zoo/index.rst:152
+#: ../model_zoo/index.rst:154 ../model_zoo/index.rst:156
+#: ../model_zoo/index.rst:158 ../model_zoo/index.rst:160
+#: ../model_zoo/index.rst:162 ../model_zoo/index.rst:164
+#: ../model_zoo/index.rst:166 ../model_zoo/index.rst:168
+#: ../model_zoo/index.rst:170 ../model_zoo/index.rst:172
+#: ../model_zoo/index.rst:174 ../model_zoo/index.rst:176
+#: ../model_zoo/index.rst:178 ../model_zoo/index.rst:180
+#: ../model_zoo/index.rst:182 ../model_zoo/index.rst:184
+#: ../model_zoo/index.rst:186 ../model_zoo/index.rst:188
+#: ../model_zoo/index.rst:190 ../model_zoo/index.rst:192
+#: ../model_zoo/index.rst:194 ../model_zoo/index.rst:196
+#: ../model_zoo/index.rst:198 ../model_zoo/index.rst:200
+#: ../model_zoo/index.rst:202 ../model_zoo/index.rst:204
+#: ../model_zoo/index.rst:206 ../model_zoo/index.rst:208
+#: ../model_zoo/index.rst:210
+msgid "✅"
+msgstr ""
+
+#: ../model_zoo/index.rst:130 ../model_zoo/index.rst:132
+#: ../model_zoo/index.rst:134 ../model_zoo/index.rst:136
+#: ../model_zoo/index.rst:138 ../model_zoo/index.rst:140
+#: ../model_zoo/index.rst:142 ../model_zoo/index.rst:144
+#: ../model_zoo/index.rst:146 ../model_zoo/index.rst:148
+#: ../model_zoo/index.rst:150 ../model_zoo/index.rst:152
+#: ../model_zoo/index.rst:154 ../model_zoo/index.rst:156
+#: ../model_zoo/index.rst:158 ../model_zoo/index.rst:160
+#: ../model_zoo/index.rst:162 ../model_zoo/index.rst:164
+#: ../model_zoo/index.rst:166 ../model_zoo/index.rst:168
+#: ../model_zoo/index.rst:170 ../model_zoo/index.rst:172
+#: ../model_zoo/index.rst:174 ../model_zoo/index.rst:176
+#: ../model_zoo/index.rst:178 ../model_zoo/index.rst:180
+#: ../model_zoo/index.rst:182 ../model_zoo/index.rst:184
+#: ../model_zoo/index.rst:186 ../model_zoo/index.rst:188
+#: ../model_zoo/index.rst:190 ../model_zoo/index.rst:192
+#: ../model_zoo/index.rst:194 ../model_zoo/index.rst:196
+#: ../model_zoo/index.rst:198 ../model_zoo/index.rst:200
+#: ../model_zoo/index.rst:202 ../model_zoo/index.rst:204
+#: ../model_zoo/index.rst:206 ../model_zoo/index.rst:208
+#: ../model_zoo/index.rst:210
+msgid "❌"
+msgstr ""
+
+#: ../model_zoo/index.rst:132
+msgid "BART_"
+msgstr ""
+
+#: ../model_zoo/index.rst:134
+msgid "BERT_"
+msgstr ""
+
+#: ../model_zoo/index.rst:136
+msgid "BigBird_"
+msgstr ""
+
+#: ../model_zoo/index.rst:138
+msgid "Blenderbot_"
+msgstr ""
+
+#: ../model_zoo/index.rst:140
+msgid "Blenderbot-Small_"
+msgstr ""
+
+#: ../model_zoo/index.rst:142
+msgid "ChineseBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:144
+msgid "ConvBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:146
+msgid "CTRL_"
+msgstr ""
+
+#: ../model_zoo/index.rst:148
+msgid "DistilBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:150
+msgid "ELECTRA_"
+msgstr ""
+
+#: ../model_zoo/index.rst:152
+msgid "ERNIE_"
+msgstr ""
+
+#: ../model_zoo/index.rst:154
+msgid "ERNIE-CTM_"
+msgstr ""
+
+#: ../model_zoo/index.rst:156
+msgid "ERNIE-DOC_"
+msgstr ""
+
+#: ../model_zoo/index.rst:158
+msgid "ERNIE-GEN_"
+msgstr ""
+
+#: ../model_zoo/index.rst:160
+msgid "ERNIE-GRAM_"
+msgstr ""
+
+#: ../model_zoo/index.rst:162
+msgid "ERNIE-M_"
+msgstr ""
+
+#: ../model_zoo/index.rst:164
+msgid "FNet_"
+msgstr ""
+
+#: ../model_zoo/index.rst:166
+msgid "Funnel_"
+msgstr ""
+
+#: ../model_zoo/index.rst:168
+msgid "GPT_"
+msgstr ""
+
+#: ../model_zoo/index.rst:170
+msgid "LayoutLM_"
+msgstr ""
+
+#: ../model_zoo/index.rst:172
+msgid "LayoutLMV2_"
+msgstr ""
+
+#: ../model_zoo/index.rst:174
+msgid "LayoutXLM_"
+msgstr ""
+
+#: ../model_zoo/index.rst:176
+msgid "Luke_"
+msgstr ""
+
+#: ../model_zoo/index.rst:178
+msgid "MBart_"
+msgstr ""
+
+#: ../model_zoo/index.rst:180
+msgid "MegatronBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:182
+msgid "MobileBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:184
+msgid "MPNet_"
+msgstr ""
+
+#: ../model_zoo/index.rst:186
+msgid "NeZha_"
+msgstr ""
+
+#: ../model_zoo/index.rst:188
+msgid "PPMiniLM_"
+msgstr ""
+
+#: ../model_zoo/index.rst:190
+msgid "ProphetNet_"
+msgstr ""
+
+#: ../model_zoo/index.rst:192
+msgid "Reformer_"
+msgstr ""
+
+#: ../model_zoo/index.rst:194
+msgid "RemBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:196
+msgid "RoBERTa_"
+msgstr ""
+
+#: ../model_zoo/index.rst:198
+msgid "RoFormer_"
+msgstr ""
+
+#: ../model_zoo/index.rst:200
+msgid "SKEP_"
+msgstr ""
+
+#: ../model_zoo/index.rst:202
+msgid "SqueezeBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:204
+msgid "T5_"
+msgstr ""
+
+#: ../model_zoo/index.rst:206
+msgid "TinyBert_"
+msgstr ""
+
+#: ../model_zoo/index.rst:208
+msgid "UnifiedTransformer_"
+msgstr ""
+
+#: ../model_zoo/index.rst:210
+msgid "XLNet_"
+msgstr ""
+
+#: ../model_zoo/index.rst:259
+msgid "Reference"
+msgstr ""
+
+#: ../model_zoo/index.rst:260
+msgid ""
+"部分中文预训练模型来自： `brightmart/albert_zh "
+"<https://github.com/brightmart/albert_zh>`_, `ymcui/Chinese-BERT-wwm "
+"<https://github.com/ymcui/Chinese-BERT-wwm>`_, `huawei-noah/Pretrained-"
+"Language-Model/TinyBERT <https://github.com/huawei-noah/Pretrained-"
+"Language-Model/tree/master/TinyBERT>`_, `ymcui/Chinese-XLNet "
+"<https://github.com/ymcui/Chinese-XLNet>`_, "
+"`huggingface/xlnet_chinese_large "
+"<https://huggingface.co/clue/xlnet_chinese_large>`_, `Knover/luge-"
+"dialogue <https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-"
+"dialogue>`_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ "
+"<https://github.com/huawei-noah/Pretrained-Language-Model/tree/master"
+"/NEZHA-PyTorch>`_, `ZhuiyiTechnology/simbert "
+"<https://github.com/ZhuiyiTechnology/simbert>`_"
+msgstr ""
+
+#: ../model_zoo/index.rst:269
+msgid ""
+"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning"
+" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:270
+msgid ""
+"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training "
+"for Natural Language Generation, Translation, and Comprehension.\" arXiv "
+"preprint arXiv:1910.13461 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:271
+msgid ""
+"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional "
+"transformers for language understanding.\" arXiv preprint "
+"arXiv:1810.04805 (2018)."
+msgstr ""
+
+#: ../model_zoo/index.rst:272
+msgid ""
+"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" "
+"arXiv preprint arXiv:2007.14062 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:273
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain "
+"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:274
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-"
+"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:275
+msgid ""
+"Sun, Zijun, et al. \"Chinesebert: Chinese pretraining enhanced by glyph "
+"and pinyin information.\" arXiv preprint arXiv:2106.16038 (2021)."
+msgstr ""
+
+#: ../model_zoo/index.rst:276
+msgid ""
+"Zhang, zhengyan, et al. \"CPM: A Large-scale Generative Chinese Pre-"
+"trained Language Model.\" arXiv preprint arXiv:2012.00413 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:277
+msgid ""
+"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic "
+"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:278
+msgid ""
+"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model "
+"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:279
+msgid ""
+"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, "
+"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:280
+msgid ""
+"Clark, Kevin, et al. \"Electra: Pre-training text encoders as "
+"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:281
+msgid ""
+"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge "
+"integration.\" arXiv preprint arXiv:1904.09223 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:282
+msgid ""
+"Ding, Siyu, et al. \"ERNIE-Doc: A retrospective long-document modeling "
+"transformer.\" arXiv preprint arXiv:2012.15688 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:283
+msgid ""
+"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training "
+"and fine-tuning framework for natural language generation.\" arXiv "
+"preprint arXiv:2001.11314 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:284
+msgid ""
+"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram "
+"Masked Language Modeling for Natural Language Understanding.\" arXiv "
+"preprint arXiv:2010.12148 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:285
+msgid ""
+"Ouyang, Xuan, et al. \"ERNIE-M: enhanced multilingual representation by "
+"aligning cross-lingual semantics with monolingual corpora.\" arXiv "
+"preprint arXiv:2012.15674 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:286
+msgid ""
+"Lee-Thorp, James, et al. \"Fnet: Mixing tokens with fourier transforms.\""
+" arXiv preprint arXiv:2105.03824 (2021)."
+msgstr ""
+
+#: ../model_zoo/index.rst:287
+msgid ""
+"Dai, Zihang, et al. \"Funnel-transformer: Filtering out sequential "
+"redundancy for efficient language processing.\" Advances in neural "
+"information processing systems 33 (2020): 4271-4282."
+msgstr ""
+
+#: ../model_zoo/index.rst:288
+msgid ""
+"Radford, Alec, et al. \"Language models are unsupervised multitask "
+"learners.\" OpenAI blog 1.8 (2019): 9."
+msgstr ""
+
+#: ../model_zoo/index.rst:289
+msgid ""
+"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for "
+"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:290
+msgid ""
+"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich"
+" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:291
+msgid ""
+"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual "
+"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 "
+"(2021)."
+msgstr ""
+
+#: ../model_zoo/index.rst:292
+msgid ""
+"Yamada, Ikuya, et al. \"Luke: deep contextualized entity representations "
+"with entity-aware self-attention.\" arXiv preprint arXiv:2010.01057 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:293
+msgid ""
+"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for "
+"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:294
+msgid ""
+"Shoeybi, Mohammad, et al. \"Megatron-lm: Training multi-billion parameter"
+" language models using model parallelism.\" arXiv preprint "
+"arXiv:1909.08053 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:295
+msgid ""
+"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for "
+"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:296
+msgid ""
+"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for "
+"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:297
+msgid ""
+"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for "
+"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:298
+msgid ""
+"Qi, Weizhen, et al. \"Prophetnet: Predicting future n-gram for sequence-"
+"to-sequence pre-training.\" arXiv preprint arXiv:2001.04063 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:299
+msgid ""
+"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv "
+"preprint arXiv:2001.04451 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:300
+msgid ""
+"Chung, Hyung Won, et al. \"Rethinking embedding coupling in pre-trained "
+"language models.\" arXiv preprint arXiv:2010.12821 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:301
+msgid ""
+"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining "
+"approach.\" arXiv preprint arXiv:1907.11692 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:302
+msgid ""
+"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position "
+"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)."
+msgstr ""
+
+#: ../model_zoo/index.rst:303
+msgid ""
+"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for "
+"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:304
+msgid ""
+"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP"
+" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:305
+msgid ""
+"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning "
+"with a Unified Text-to-Text Transformer.\" arXiv preprint "
+"arXiv:1910.10683 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:306
+msgid ""
+"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint "
+"arXiv:1706.03762 (2017)."
+msgstr ""
+
+#: ../model_zoo/index.rst:307
+msgid ""
+"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language "
+"understanding.\" arXiv preprint arXiv:1909.10351 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:308
+msgid ""
+"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via "
+"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)."
+msgstr ""
+
+#: ../model_zoo/index.rst:309
+msgid ""
+"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for "
+"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:310
+msgid ""
+"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese "
+"bert.\" arXiv preprint arXiv:1906.08101 (2019)."
+msgstr ""
+
+#: ../model_zoo/index.rst:311
+msgid ""
+"Wang, Quan, et al. “Building Chinese Biomedical Language Models via "
+"Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po b/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po
new file mode 100644
index 0000000000000000000000000000000000000000..a990df4e54a587f64a87e3ddb3e17467e07f5a57
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/taskflow.po
@@ -0,0 +1,796 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/taskflow.md:1
+msgid "PaddleNLP一键预测功能：Taskflow API"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:23
+msgid "特性"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:24
+msgid "PaddleNLP提供开箱即用的产业级NLP预置任务能力，无需训练，一键预测。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:25
+msgid "最全的中文任务：覆盖自然语言理解与自然语言生成两大核心应用；"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:26
+msgid "极致的产业级效果：在多个中文场景上提供产业级的精度与预测性能；"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:27
+msgid "统一的应用范式：通过paddlenlp.Taskflow调用，简捷易用。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:167
+msgid "QuickStart"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:169
+msgid "环境依赖"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:170
+msgid "python >= 3.6"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:171
+msgid "paddlepaddle >= 2.2.0"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:172
+msgid "paddlenlp >= 2.2.5"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:174
+msgid "taskflow1"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:176
+msgid "可进入 Jupyter Notebook 环境，在线体验 👉🏻  进入在线运行环境"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:178
+msgid "PaddleNLP Taskflow API 支持任务持续丰富中，我们将根据开发者反馈，灵活调整功能建设优先级，可通过Issue或问卷反馈给我们。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:180
+msgid "社区交流👬"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:182
+msgid "微信扫描二维码并填写问卷之后，加入交流群领取福利"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:183
+msgid "获取5月18-19日每晚20:30《产业级通用信息抽取技术UIE+ERNIE轻量级模型》直播课链接"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:184
+msgid "10G重磅NLP学习大礼包："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:190
+msgid "详细使用"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:192
+msgid "PART Ⅰ   一键预测"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:194
+msgid "中文分词"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:198
+msgid "三种分词模式，满足各类分词需求"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:220 ../model_zoo/taskflow.md:422
+#: ../model_zoo/taskflow.md:602 ../model_zoo/taskflow.md:1177
+#: ../model_zoo/taskflow.md:1209
+msgid "批量样本输入，平均速度更快"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:222
+msgid "输入为多个句子组成的list，平均速度会更快。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:231 ../model_zoo/taskflow.md:369
+#: ../model_zoo/taskflow.md:542
+msgid "自定义词典"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:233
+msgid ""
+"你可以通过传入user_dict参数，装载自定义词典来定制分词结果。 "
+"在默认模式和精确模式下，词典文件每一行由一个或多个自定义item组成。词典文件user_dict.txt示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:240
+msgid "在快速模式下，词典文件每一行为一个自定义item+\"\\t\"+词频（词频可省略，词频省略则自动计算能保证分出该词的词频），暂时不支持黑名单词典（即通过设置”年“、”末“，以达到切分”年末“的目的）。词典文件user_dict.txt示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:246
+msgid "加载自定义词典及输出结果示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:256
+msgid "参数说明"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:257
+msgid "mode：指定分词模式，默认为None。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:258 ../model_zoo/taskflow.md:393
+#: ../model_zoo/taskflow.md:572 ../model_zoo/taskflow.md:746
+#: ../model_zoo/taskflow.md:1073 ../model_zoo/taskflow.md:1092
+#: ../model_zoo/taskflow.md:1134 ../model_zoo/taskflow.md:1161
+#: ../model_zoo/taskflow.md:1186 ../model_zoo/taskflow.md:1217
+#: ../model_zoo/taskflow.md:1239 ../model_zoo/taskflow.md:1259
+#: ../model_zoo/taskflow.md:1278
+msgid "batch_size：批处理大小，请结合机器情况进行调整，默认为1。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:259
+msgid "user_dict：自定义词典文件路径，默认为None。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:260 ../model_zoo/taskflow.md:395
+#: ../model_zoo/taskflow.md:574 ../model_zoo/taskflow.md:753
+#: ../model_zoo/taskflow.md:1094 ../model_zoo/taskflow.md:1137
+#: ../model_zoo/taskflow.md:1162 ../model_zoo/taskflow.md:1188
+#: ../model_zoo/taskflow.md:1219
+msgid "task_path：自定义任务路径，默认为None。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:263
+msgid "词性标注"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:267
+msgid "支持单条和批量预测"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:280
+msgid "标签集合"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:371
+msgid "你可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item，可以由一个单词或者多个单词组成，单词后面可以添加自定义标签，格式为item/tag，如果不添加自定义标签，则使用模型默认标签n。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:373 ../model_zoo/taskflow.md:546
+msgid "词典文件user_dict.txt示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:381 ../model_zoo/taskflow.md:561
+msgid "装载自定义词典及输出结果示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:392 ../model_zoo/taskflow.md:571
+#: ../model_zoo/taskflow.md:745 ../model_zoo/taskflow.md:1072
+#: ../model_zoo/taskflow.md:1160 ../model_zoo/taskflow.md:1185
+#: ../model_zoo/taskflow.md:1216 ../model_zoo/taskflow.md:1238
+#: ../model_zoo/taskflow.md:1258
+msgid "可配置参数说明"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:394 ../model_zoo/taskflow.md:573
+#: ../model_zoo/taskflow.md:1095
+msgid "user_dict：用户自定义词典文件，默认为None。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:398 ../model_zoo/taskflow.md:763
+msgid "命名实体识别"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:402
+msgid "支持两种模式"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:430
+msgid "实体标签说明"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:432
+msgid "精确模式采用的标签集合"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:434
+msgid "包含66种词性及专名类别标签，标签集合如下表："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:453
+msgid "快速模式采用的标签集合"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:544
+msgid "你可以通过装载自定义词典来定制化命名实体识别结果。词典文件每一行表示一个自定义item，可以由一个term或者多个term组成，term后面可以添加自定义标签，格式为item/tag，如果不添加自定义标签，则使用模型默认标签。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:555
+msgid "以\"《长津湖》收尾，北美是最大海外票仓\"为例，原本的输出结果为："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:575
+msgid "entity_only：只返回实体/概念词及其对应标签。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:579
+msgid "依存句法分析"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:582
+msgid "支持多种形式输入"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:584
+msgid "未分词输入:"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:594
+msgid "使用分词结果来输入:"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:610
+msgid "多种模型选择，满足精度、速度需求"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:612
+msgid "使用ERNIE 1.0进行预测"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:620
+msgid ""
+"除ERNIE 1.0外，还可使用ERNIE-Gram预训练模型，其中model=ddparser（基于LSTM "
+"Encoder）速度最快，model=ddparser-ernie-gram-zh和model=ddparser-"
+"ernie-1.0效果更优（两者效果相当）。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:622
+msgid "输出方式"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:624
+msgid "输出概率值和词性标签:"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:632
+msgid "依存关系可视化"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:646
+msgid "依存句法分析标注关系集合"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:747
+msgid "model：选择任务使用的模型，可选有ddparser，ddparser-ernie-1.0和ddparser-ernie-gram-zh。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:748
+msgid "tree：确保输出结果是正确的依存句法树，默认为True。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:749
+msgid "prob：是否输出每个弧对应的概率值，默认为False。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:750
+msgid "use_pos：是否返回词性标签，默认为False。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:751
+msgid "use_cuda：是否使用GPU进行切词，默认为False。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:752
+msgid "return_visual：是否返回句法树的可视化结果，默认为False。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:756
+msgid "信息抽取"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:759
+msgid "开放域信息抽取是信息抽取的一种全新范式，主要思想是减少人工参与，利用单一模型支持多种类型的开放抽取任务，用户可以使用自然语言自定义抽取目标，在实体、关系类别等未定义的情况下抽取输入文本中的信息片段。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:761
+msgid "支持多场景信息抽取任务"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:765
+msgid ""
+"命名实体识别（Named Entity "
+"Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:767
+msgid "例如抽取的目标实体类型是\"时间\"、\"选手\"和\"赛事名称\", schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:773 ../model_zoo/taskflow.md:804
+#: ../model_zoo/taskflow.md:844 ../model_zoo/taskflow.md:899
+#: ../model_zoo/taskflow.md:923 ../model_zoo/taskflow.md:959
+#: ../model_zoo/taskflow.md:990
+msgid "预测："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:796
+msgid "例如抽取的目标实体类型是\"肿瘤的大小\"、\"肿瘤的个数\"、\"肝癌级别\"和\"脉管内癌栓分级\", schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:802
+msgid "在上例中我们已经实例化了一个Taskflow对象，这里可以通过set_schema方法重置抽取目标。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:828
+msgid "关系抽取"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:830
+msgid ""
+"关系抽取（Relation "
+"Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:832
+msgid "例如以\"竞赛名称\"作为抽取主体，抽取关系类型为\"主办方\"、\"承办方\"和\"已举办次数\", schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:880
+msgid "事件抽取"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:882
+msgid "事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取预定义的事件触发词和事件要素，组合为相应的结构化信息。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:884
+msgid "例如抽取的目标是\"地震\"事件的\"地震强度\"、\"时间\"、\"震中位置\"和\"震源深度\"这些信息，schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:897
+msgid "触发词的格式统一为XX触发词，XX表示具体事件类型，上例中的事件类型是地震，则对应触发词为地震触发词。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:908
+msgid "评论观点抽取"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:910
+msgid "评论观点抽取，是指抽取文本中包含的评价维度、观点词。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:912
+msgid "例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向，schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:951
+msgid "情感倾向分类"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:953
+msgid "句子级情感倾向分类，即判断句子的情感倾向是“正向”还是“负向”，schema构造如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:968
+msgid "跨任务抽取"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:970
+msgid "例如在法律场景同时对文本进行实体抽取和关系抽取，schema可按照如下方式进行构造："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1019
+msgid "多模型选择，满足精度、速度要求"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1021
+msgid "模型选择"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1046
+msgid "使用UIE-Tiny进行预测"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1057
+msgid "定制训练"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1059
+msgid ""
+"对于简单的抽取目标可以直接使用paddlenlp.Taskflow实现零样本（zero-"
+"shot）抽取，对于细分场景我们推荐使用定制训练（标注少量数据进行模型微调）以进一步提升效果。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1061
+msgid "我们在互联网、医疗、金融三大垂类自建测试集上进行了实验："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1070
+msgid "0-shot表示无训练数据直接通过paddlenlp.Taskflow进行预测，5-shot表示基于5条标注数据进行模型微调。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1074
+msgid "model：选择任务使用的模型，默认为uie-base，可选有uie-tiny，uie-base和uie-medical-base。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1075
+msgid "schema：定义任务抽取目标，可参考示例中对于不同信息抽取任务的schema配置自定义抽取目标。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1076
+msgid "position_prob：模型对于span的起始位置/终止位置的结果概率0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1079
+msgid "解语知识标注"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1082
+msgid "词类知识标注"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1091 ../model_zoo/taskflow.md:1133
+msgid "可配置参数说明："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1093
+msgid "linking：实现基于词类的linking，默认为True。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1098
+msgid "知识挖掘-词类知识标注任务共包含66种词性及专名类别标签，标签集合如下表："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1118
+msgid "名词短语标注"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1135
+msgid "max_seq_len：最大序列长度，默认为64。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1136
+msgid "linking：实现与WordTag类别标签的linking，默认为False。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1142
+msgid "文本纠错"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1146 ../model_zoo/taskflow.md:1225
+#: ../model_zoo/taskflow.md:1245
+msgid "支持单条、批量预测"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1165
+msgid "文本相似度"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1168
+msgid "单条输入"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1187
+msgid "max_seq_len：最大序列长度，默认为128。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1191
+msgid "情感倾向分析"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1194
+msgid "支持不同模型，速度快和精度高两种模式"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1218
+msgid "model：选择任务使用的模型，可选有bilstm和skep_ernie_1.0_large_ch。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1222
+msgid "生成式问答"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1242
+msgid "智能写诗"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1262
+msgid "开放域对话"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1265
+msgid "非交互模式"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1276
+msgid "可配置参数："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1279
+msgid "max_seq_len：最大序列长度，默认为512。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1281
+msgid "交互模式"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1299
+msgid "交互模式参数："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1300
+msgid "max_turn：任务能记忆的对话轮数，当max_turn为1时，模型只能记住当前对话，无法获知之前的对话内容。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1304
+msgid "PART Ⅱ   定制化训练"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1308
+msgid "如果你有自己的业务数据集，可以对模型效果进一步调优，支持定制化训练的任务如下："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1397
+msgid "这里我们以命名实体识别Taskflow(\"ner\", mode=\"accurate\")为例，展示如何定制自己的模型。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1399
+msgid "调用Taskflow接口后，程序自动将相关文件下载到$HOME/.paddlenlp/taskflow/wordtag/，该默认路径包含以下文件:"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1408
+msgid "参考上表中对应示例准备数据集和标签文件tags.txt，执行相应训练脚本得到自己的model_state.pdparams和model_config.json。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1410
+msgid "根据自己数据集情况，修改标签文件tags.txt。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1412
+msgid "将以上文件保存到任意路径中，自定义路径下的文件需要和默认路径的文件一致:"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1420
+msgid "通过task_path指定自定义路径，使用Taskflow加载自定义模型进行一键预测："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1428
+msgid "模型算法"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1456
+msgid "FAQ"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1460
+msgid ""
+"A: "
+"Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp下，可以在任务初始化的时候通过home_path自定义修改保存路径。示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1466
+msgid "通过以上方式即可将ner任务相关文件保存至/workspace路径下。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1472
+msgid ""
+"A: "
+"Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp/taskflow下，如果下载或调用失败，可删除相应路径下的文件，重新尝试即可"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1478
+msgid "A: 可以结合设备情况适当调整batch_size，采用批量输入的方式来提升平均速率。示例："
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1489
+msgid "通过上述方式进行分词可以大幅提升预测速度。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1495
+msgid "A: Taskflow支持任务持续丰富中，我们将根据开发者反馈，灵活调整功能建设优先级，可通过Issue或问卷反馈给我们。"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1500
+msgid "附录"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1504
+msgid "fxsjy/jieba"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1505
+msgid "ZhuiyiTechnology/simbert"
+msgstr ""
+
+#: ../model_zoo/taskflow.md:1506
+msgid "CPM: A Large-scale Generative Chinese Pre-trained Language Model"
+msgstr ""
+
+#~ msgid "PaddleNLP Taskflow"
+#~ msgstr ""
+
+#~ msgid "介绍"
+#~ msgstr ""
+
+#~ msgid "任务清单"
+#~ msgstr ""
+
+#~ msgid "用法"
+#~ msgstr ""
+
+#~ msgid "查看使用示例"
+#~ msgstr ""
+
+#~ msgid "句法分析"
+#~ msgstr ""
+
+#~ msgid "情感分析"
+#~ msgstr ""
+
+#~ msgid "『解语』-词类知识标注"
+#~ msgstr ""
+
+#~ msgid "『解语』-名词短语标注"
+#~ msgstr ""
+
+#~ msgid "自定义任务"
+#~ msgstr ""
+
+#~ msgid "paddlenlp.Taskflow提供开箱即用的NLP预置任务，覆盖自然语言理解与自然语言生成两大核心应用，在中文场景上提供产业级的效果与极致的预测性能。"
+#~ msgstr ""
+
+#~ msgid "随着版本迭代会持续开放更多的应用场景。"
+#~ msgstr ""
+
+#~ msgid "安装"
+#~ msgstr ""
+
+#~ msgid "paddlenlp >= 2.2.0"
+#~ msgstr ""
+
+#~ msgid "支持三种模式分词"
+#~ msgstr ""
+
+#~ msgid "Base模式（默认）"
+#~ msgstr ""
+
+#~ msgid "快速模式"
+#~ msgstr ""
+
+#~ msgid "利用『结巴』中文分词工具，实现文本快速切分。"
+#~ msgstr ""
+
+#~ msgid "精确模式"
+#~ msgstr ""
+
+#~ msgid "试图将句子中的实体词完整切分，分词精确度高。"
+#~ msgstr ""
+
+#~ msgid "快速模式词典载入方式："
+#~ msgstr ""
+
+#~ msgid "用户可以在词典文件每一行有两个部分：词语、词频（可省略），用空格隔开。词频省略则自动计算能保证分出该词的词频。"
+#~ msgstr ""
+
+#~ msgid "\"国家卫健委修订完成了新冠肺炎诊疗方案\"原本的输出结果为："
+#~ msgstr ""
+
+#~ msgid "Base、精确模式词典载入方式："
+#~ msgstr ""
+
+#~ msgid "词典文件每一行表示一个自定义item。"
+#~ msgstr ""
+
+#~ msgid "以默认模型为例，\"平原上的火焰计划于年末上映\"原本的输出结果为："
+#~ msgstr ""
+
+#~ msgid "标签集合："
+#~ msgstr ""
+
+#~ msgid "用户可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item，可以由一个单词或者多个单词组成，单词后面可以添加自定义标签，格式为item/tag，如果不添加自定义标签，则使用模型默认标签。"
+#~ msgstr ""
+
+#~ msgid "以\"赛里木湖是新疆海拔最高的高山湖泊\"为例，原本的输出结果为："
+#~ msgstr ""
+
+#~ msgid "精确模式（默认）"
+#~ msgstr ""
+
+#~ msgid "只返回实体/概念词："
+#~ msgstr ""
+
+#~ msgid "entity_only：是否返回所有词性标签；若设置为True，则只返回实体/概念词；默认为False。"
+#~ msgstr ""
+
+#~ msgid "使用ddparser-ernie-1.0进行预测:"
+#~ msgstr ""
+
+#~ msgid "依存关系可视化："
+#~ msgstr ""
+
+#~ msgid "标注关系说明："
+#~ msgstr ""
+
+#~ msgid "使用BiLSTM模型："
+#~ msgstr ""
+
+#~ msgid "使用SKEP情感分析预训练模型进行预测："
+#~ msgstr ""
+
+#~ msgid "知识挖掘-词类知识标注"
+#~ msgstr ""
+
+#~ msgid "知识挖掘-词类知识标注任务共包含66种词性及专名类别标签，标签集合如下表"
+#~ msgstr ""
+
+#~ msgid "知识挖掘-名词短语标注"
+#~ msgstr ""
+
+#~ msgid "非交互模式："
+#~ msgstr ""
+
+#~ msgid "交互模式："
+#~ msgstr ""
+
+#~ msgid "交互模式下，Taskflow具备多轮对话记忆功能。"
+#~ msgstr ""
+
+#~ msgid "max_turn：仅在交互模式有效，表示任务能记忆的对话轮数；当max_turn为1时，模型只能记住当前对话，无法获知之前的对话内容。"
+#~ msgstr ""
+
+#~ msgid "Taskflow提供了定制接口来使用自己的数据对模型进行微调/训练，适配任务如下："
+#~ msgstr ""
+
+#~ msgid "定制任务示例"
+#~ msgstr ""
+
+#~ msgid "任务的默认路径为$HOME/.paddlenlp/taskflow/ner/wordtag/，该默认路径包含以下文件:"
+#~ msgstr ""
+
+#~ msgid "参考表中对应示例准备数据集和标签文件tags.txt，执行相应训练脚本得到自己的model_state.pdparams和model_config.json。"
+#~ msgstr ""
+
+#~ msgid "通过task_path指定用户自定义路径，自定义路径下的文件需要和默认路径的文件一致:"
+#~ msgstr ""
+
+#~ msgid "使用Taskflow加载自定义模型进行一键预测："
+#~ msgstr ""
+
+#~ msgid "Q1 Taskflow如何修改任务保存路径？"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "A: "
+#~ "Taskflow默认会将任务相关模型等文件保存到$HOME/.paddlenlp下，可以在任务初始化的时候通过home_path自定义修改保存路径。"
+#~ msgstr ""
+
+#~ msgid "示例："
+#~ msgstr ""
+
+#~ msgid "参考资料"
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po
new file mode 100644
index 0000000000000000000000000000000000000000..4b7022a1515c94fe5f5d34cea98ee912e36dfcfa
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers.po
@@ -0,0 +1,1913 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../model_zoo/transformers.rst:2
+msgid "PaddleNLP Transformer API"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:4
+msgid ""
+"随着深度学习的发展，NLP领域涌现了一大批高质量的Transformer类预训练模型，多次刷新各种NLP任务SOTA（State of the "
+"Art）。 PaddleNLP为用户提供了常用的 "
+"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等经典结构预训练模型， "
+"让开发者能够方便快捷应用各类Transformer预训练模型及其下游任务。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:10
+msgid "Transformer预训练模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:14
+msgid ""
+"下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **32** 种网络结构， **136** "
+"种预训练的参数权重供用户使用， 其中包含了 **59** 种中文语言模型的预训练权重。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:18 ../model_zoo/transformers.rst:655
+msgid "Model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:18
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:18
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:18
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:20 ../model_zoo/transformers.rst:657
+msgid "ALBERT_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:20
+msgid "``albert-base-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:20 ../model_zoo/transformers.rst:24
+#: ../model_zoo/transformers.rst:28 ../model_zoo/transformers.rst:32
+#: ../model_zoo/transformers.rst:36 ../model_zoo/transformers.rst:40
+#: ../model_zoo/transformers.rst:44 ../model_zoo/transformers.rst:48
+#: ../model_zoo/transformers.rst:76 ../model_zoo/transformers.rst:80
+#: ../model_zoo/transformers.rst:84 ../model_zoo/transformers.rst:88
+#: ../model_zoo/transformers.rst:92 ../model_zoo/transformers.rst:96
+#: ../model_zoo/transformers.rst:148 ../model_zoo/transformers.rst:196
+#: ../model_zoo/transformers.rst:200 ../model_zoo/transformers.rst:204
+#: ../model_zoo/transformers.rst:208 ../model_zoo/transformers.rst:212
+#: ../model_zoo/transformers.rst:216 ../model_zoo/transformers.rst:220
+#: ../model_zoo/transformers.rst:224 ../model_zoo/transformers.rst:228
+#: ../model_zoo/transformers.rst:232 ../model_zoo/transformers.rst:236
+#: ../model_zoo/transformers.rst:241 ../model_zoo/transformers.rst:246
+#: ../model_zoo/transformers.rst:252 ../model_zoo/transformers.rst:256
+#: ../model_zoo/transformers.rst:260 ../model_zoo/transformers.rst:264
+#: ../model_zoo/transformers.rst:296 ../model_zoo/transformers.rst:300
+#: ../model_zoo/transformers.rst:304 ../model_zoo/transformers.rst:312
+#: ../model_zoo/transformers.rst:316 ../model_zoo/transformers.rst:320
+#: ../model_zoo/transformers.rst:324 ../model_zoo/transformers.rst:342
+#: ../model_zoo/transformers.rst:346 ../model_zoo/transformers.rst:350
+#: ../model_zoo/transformers.rst:354 ../model_zoo/transformers.rst:358
+#: ../model_zoo/transformers.rst:362 ../model_zoo/transformers.rst:366
+#: ../model_zoo/transformers.rst:370 ../model_zoo/transformers.rst:378
+#: ../model_zoo/transformers.rst:382 ../model_zoo/transformers.rst:386
+#: ../model_zoo/transformers.rst:390 ../model_zoo/transformers.rst:394
+#: ../model_zoo/transformers.rst:398 ../model_zoo/transformers.rst:402
+#: ../model_zoo/transformers.rst:406 ../model_zoo/transformers.rst:411
+#: ../model_zoo/transformers.rst:416 ../model_zoo/transformers.rst:421
+#: ../model_zoo/transformers.rst:425 ../model_zoo/transformers.rst:445
+#: ../model_zoo/transformers.rst:448 ../model_zoo/transformers.rst:467
+#: ../model_zoo/transformers.rst:471 ../model_zoo/transformers.rst:475
+#: ../model_zoo/transformers.rst:479 ../model_zoo/transformers.rst:527
+#: ../model_zoo/transformers.rst:531 ../model_zoo/transformers.rst:540
+#: ../model_zoo/transformers.rst:545 ../model_zoo/transformers.rst:550
+#: ../model_zoo/transformers.rst:554 ../model_zoo/transformers.rst:558
+#: ../model_zoo/transformers.rst:562 ../model_zoo/transformers.rst:566
+#: ../model_zoo/transformers.rst:570 ../model_zoo/transformers.rst:574
+#: ../model_zoo/transformers.rst:579 ../model_zoo/transformers.rst:584
+#: ../model_zoo/transformers.rst:589 ../model_zoo/transformers.rst:628
+#: ../model_zoo/transformers.rst:632
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:20
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:24
+msgid "``albert-large-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:24
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:28
+msgid "``albert-xlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:28
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:32
+msgid "``albert-xxlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:32
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:36
+msgid "``albert-base-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:36
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:40
+msgid "``albert-large-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:40
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:44
+msgid "``albert-xlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:44
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:48
+msgid "``albert-xxlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:48
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:52
+msgid "``albert-chinese-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:52 ../model_zoo/transformers.rst:56
+#: ../model_zoo/transformers.rst:60 ../model_zoo/transformers.rst:64
+#: ../model_zoo/transformers.rst:68 ../model_zoo/transformers.rst:72
+#: ../model_zoo/transformers.rst:112 ../model_zoo/transformers.rst:117
+#: ../model_zoo/transformers.rst:123 ../model_zoo/transformers.rst:129
+#: ../model_zoo/transformers.rst:133 ../model_zoo/transformers.rst:137
+#: ../model_zoo/transformers.rst:154 ../model_zoo/transformers.rst:159
+#: ../model_zoo/transformers.rst:164 ../model_zoo/transformers.rst:169
+#: ../model_zoo/transformers.rst:173 ../model_zoo/transformers.rst:268
+#: ../model_zoo/transformers.rst:272 ../model_zoo/transformers.rst:276
+#: ../model_zoo/transformers.rst:280 ../model_zoo/transformers.rst:284
+#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:292
+#: ../model_zoo/transformers.rst:308 ../model_zoo/transformers.rst:329
+#: ../model_zoo/transformers.rst:333 ../model_zoo/transformers.rst:337
+#: ../model_zoo/transformers.rst:374 ../model_zoo/transformers.rst:429
+#: ../model_zoo/transformers.rst:433 ../model_zoo/transformers.rst:437
+#: ../model_zoo/transformers.rst:441 ../model_zoo/transformers.rst:451
+#: ../model_zoo/transformers.rst:456 ../model_zoo/transformers.rst:461
+#: ../model_zoo/transformers.rst:464 ../model_zoo/transformers.rst:483
+#: ../model_zoo/transformers.rst:487 ../model_zoo/transformers.rst:491
+#: ../model_zoo/transformers.rst:495 ../model_zoo/transformers.rst:499
+#: ../model_zoo/transformers.rst:503 ../model_zoo/transformers.rst:507
+#: ../model_zoo/transformers.rst:511 ../model_zoo/transformers.rst:515
+#: ../model_zoo/transformers.rst:519 ../model_zoo/transformers.rst:523
+#: ../model_zoo/transformers.rst:535 ../model_zoo/transformers.rst:594
+#: ../model_zoo/transformers.rst:599 ../model_zoo/transformers.rst:604
+#: ../model_zoo/transformers.rst:608 ../model_zoo/transformers.rst:612
+#: ../model_zoo/transformers.rst:616 ../model_zoo/transformers.rst:620
+#: ../model_zoo/transformers.rst:624 ../model_zoo/transformers.rst:636
+#: ../model_zoo/transformers.rst:640 ../model_zoo/transformers.rst:644
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:52
+msgid ""
+"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. "
+"ALBERT tiny model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:56
+msgid "``albert-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:56
+msgid ""
+"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. "
+"ALBERT small model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:60
+msgid "``albert-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:60
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters."
+" ALBERT base model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:64
+msgid "``albert-chinese-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:64
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M "
+"parameters. ALBERT large model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:68
+msgid "``albert-chinese-xlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:68
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M "
+"parameters. ALBERT xlarge model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:72
+msgid "``albert-chinese-xxlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:72
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M "
+"parameters. ALBERT xxlarge model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:76 ../model_zoo/transformers.rst:659
+msgid "BART_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:76
+msgid "``bart-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:76
+msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:80
+msgid "``bart-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:80
+msgid ""
+"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model "
+"(English)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:84 ../model_zoo/transformers.rst:661
+msgid "BERT_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:84
+msgid "``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:84
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:88
+msgid "``bert-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:88 ../model_zoo/transformers.rst:304
+#: ../model_zoo/transformers.rst:320
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:92
+msgid "``bert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:92
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English"
+" text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:96
+msgid "``bert-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:96
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:100
+msgid "``bert-base-multilingual-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:100 ../model_zoo/transformers.rst:106
+#: ../model_zoo/transformers.rst:141
+msgid "Multilingual"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:100
+msgid ""
+"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased "
+"text in the top 102 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:106
+msgid "``bert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:106
+msgid ""
+"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in"
+" the top 104 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:112
+msgid "``bert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:112
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:117
+msgid "``bert-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:117
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:123
+msgid "``bert-wwm-ext-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:123
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking with extented "
+"data."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:129
+msgid "``junnyu/ckiplab-bert-base-chinese-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:129
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:133
+msgid "``junnyu/ckiplab-bert-base-chinese-pos``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:133
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:137
+msgid "``junnyu/ckiplab-bert-base-chinese-ws``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:137
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:141
+msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:141
+msgid ""
+"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment "
+"analysis on product reviews in six languages: English, Dutch, German, "
+"French, Spanish and Italian."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:148
+msgid "``junnyu/tbs17-MathBERT``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:148
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to "
+"graduate math language (English) using a masked language modeling (MLM) "
+"objective."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:154
+msgid "``macbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:154
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:159
+msgid "``macbert-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:159
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:164
+msgid "``simbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:164
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million "
+"pairs of similar sentences crawed from Baidu Know."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:169
+msgid "``Langboat/mengzi-bert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:169
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese "
+"Corpus Datasets."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:173
+msgid "``Langboat/mengzi-bert-base-fin``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:173
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial "
+"Corpus, based on ``Langboat/mengzi-bert-base``."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:178
+msgid "BERT-Japanese_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:178
+msgid "``iverxin/bert-base-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:178 ../model_zoo/transformers.rst:182
+#: ../model_zoo/transformers.rst:187 ../model_zoo/transformers.rst:191
+msgid "Japanese"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:178
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:182
+msgid "``iverxin/bert-base-japanese-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:182
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text"
+" using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:187
+msgid "``iverxin/bert-base-japanese-char``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:187
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:191
+msgid "``iverxin/bert-base-japanese-char-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:191
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:196 ../model_zoo/transformers.rst:663
+msgid "BigBird_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:196
+msgid "``bigbird-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:196
+msgid ""
+"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:200 ../model_zoo/transformers.rst:665
+msgid "Blenderbot_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:200
+msgid "``blenderbot-3B``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:200
+msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:204
+msgid "``blenderbot-400M-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:204
+msgid ""
+"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:208
+msgid "``blenderbot-1B-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:208
+msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot Distil 1B model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:212 ../model_zoo/transformers.rst:667
+msgid "Blenderbot-Small_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:212
+msgid "``blenderbot_small-90M``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:212
+msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:216 ../model_zoo/transformers.rst:669
+msgid "ConvBert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:216
+msgid "``convbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:216
+msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:220
+msgid "``convbert-medium-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:220
+msgid ""
+"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:224
+msgid "``convbert-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:224
+msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:228 ../model_zoo/transformers.rst:671
+msgid "CTRL_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:228
+msgid "``ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:228
+msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:232
+msgid "``sshleifer-tiny-ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:232
+msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:236 ../model_zoo/transformers.rst:673
+msgid "DistilBert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:236
+msgid "``distilbert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:236
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:241
+msgid "``distilbert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:241
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:246
+msgid "``distilbert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:246
+msgid ""
+"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:252
+msgid "``sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:252
+msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:256 ../model_zoo/transformers.rst:675
+msgid "ELECTRA_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:256
+msgid "``electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:256
+msgid ""
+"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:260
+msgid "``electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:260
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:264
+msgid "``electra-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:264
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:268
+msgid "``chinese-electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:268
+msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:272
+msgid "``chinese-electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:272 ../model_zoo/transformers.rst:487
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:276
+msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:276
+msgid ""
+"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained "
+"on 180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:280
+msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:280
+msgid ""
+"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on "
+"180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:284
+msgid "``junnyu/hfl-chinese-legal-electra-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:284
+msgid ""
+"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on "
+"Chinese legal corpus."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:677
+msgid "ERNIE_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:288
+msgid "``ernie-3.0-medium-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:288 ../model_zoo/transformers.rst:308
+#: ../model_zoo/transformers.rst:329 ../model_zoo/transformers.rst:429
+#: ../model_zoo/transformers.rst:604
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:292
+msgid "``ernie-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:292
+msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:296
+msgid "``ernie-2.0-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:296 ../model_zoo/transformers.rst:312
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:300
+msgid "``ernie-2.0-en-finetuned-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:300
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned "
+"squad text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:304
+msgid "``ernie-2.0-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:308 ../model_zoo/transformers.rst:679
+msgid "ERNIE-DOC_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:308
+msgid "``ernie-doc-base-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:312
+msgid "``ernie-doc-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:316 ../model_zoo/transformers.rst:681
+msgid "ERNIE-GEN_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:316
+msgid "``ernie-gen-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:316
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:320
+msgid "``ernie-gen-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:324
+msgid "``ernie-gen-large-en-430g``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:324
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text. with extended data (430 GB)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:329 ../model_zoo/transformers.rst:683
+msgid "ERNIE-GRAM_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:329
+msgid "``ernie-gram-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:333 ../model_zoo/transformers.rst:685
+msgid "GPT_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:333
+msgid "``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:333
+msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:337
+msgid "``gpt-cpm-small-cn-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:337
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from"
+" the GPT model ``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:342
+msgid "``gpt2-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:342
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:346
+msgid "``gpt2-medium-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:346
+msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:350
+msgid "``gpt2-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:350 ../model_zoo/transformers.rst:370
+msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:354
+msgid "``gpt2-xl-en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:354
+msgid ""
+"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:358
+msgid "``junnyu/distilgpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:358
+msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:362
+msgid "``junnyu/microsoft-DialoGPT-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:362 ../model_zoo/transformers.rst:467
+msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:366
+msgid "``junnyu/microsoft-DialoGPT-medium``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:366
+msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:370
+msgid "``junnyu/microsoft-DialoGPT-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:374
+msgid "``junnyu/uer-gpt2-chinese-poem``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:374
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese "
+"poetry corpus."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:378 ../model_zoo/transformers.rst:687
+msgid "LayoutLM_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:378
+msgid "``layoutlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:378
+msgid ""
+"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:382
+msgid "``layoutlm-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:382
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:386 ../model_zoo/transformers.rst:689
+msgid "LayoutLMV2_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:386
+msgid "``layoutlmv2-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:386
+msgid ""
+"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:390
+msgid "``layoutlmv2-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:390
+msgid ""
+"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:394 ../model_zoo/transformers.rst:691
+msgid "LayoutXLM_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:394
+msgid "``layoutxlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:394
+msgid ""
+"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:398
+msgid "MBart_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:398
+msgid "``mbart-large-cc25``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:398
+msgid ""
+"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-"
+"cc25`` model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:402
+msgid "``mbart-large-en-ro``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:402
+msgid ""
+"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-"
+"ro`` model ."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:406
+msgid "``mbart-large-50-one-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:406
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-"
+"to-many-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:411
+msgid "``mbart-large-50-many-to-one-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:411
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-one-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:416
+msgid "``mbart-large-50-many-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:416
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-many-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:421
+msgid "Mobilebert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:421
+msgid "``mobilebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:421
+msgid "24-layer, 512-hidden, 4-heads, 24M parameters. Mobilebert uncased Model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:425 ../model_zoo/transformers.rst:697
+msgid "MPNet_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:425
+msgid "``mpnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:425
+msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:429 ../model_zoo/transformers.rst:699
+msgid "NeZha_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:429
+msgid "``nezha-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:433
+msgid "``nezha-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:433 ../model_zoo/transformers.rst:441
+msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:437
+msgid "``nezha-base-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:437
+msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:441
+msgid "``nezha-large-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:445
+msgid "Reformer_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:445
+msgid "``reformer-enwik8``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:445
+msgid "12-layer, 1024-hidden, 8-heads, 148M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:448
+msgid "``reformer-crime-and-punishment``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:448
+msgid "6-layer, 256-hidden, 2-heads, 3M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:451 ../model_zoo/transformers.rst:703
+msgid "RoBERTa_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:451
+msgid "``roberta-wwm-ext``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:451
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text "
+"using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:456
+msgid "``roberta-wwm-ext-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:456
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text"
+" using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:461
+msgid "``rbt3``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:461
+msgid "3-layer, 768-hidden, 12-heads, 38M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:464
+msgid "``rbtl3``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:464
+msgid "3-layer, 1024-hidden, 16-heads, 61M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:467
+msgid "``nosaydomore/deepset-roberta-base-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:471
+msgid "``nosaydomore/roberta-en-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:471
+msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:475
+msgid "``nosaydomore/roberta-en-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:475
+msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:479
+msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:479
+msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:483
+msgid "``nosaydomore/uer-roberta-base-chn-extractive-qa``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:483 ../model_zoo/transformers.rst:491
+msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:487
+msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:491
+msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:495 ../model_zoo/transformers.rst:705
+msgid "RoFormer_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:495
+msgid "``roformer-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:495
+msgid ""
+"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:499
+msgid "``roformer-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:499
+msgid ""
+"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:503
+msgid "``roformer-chinese-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:503
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small"
+" model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:507
+msgid "``roformer-chinese-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:507
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:511
+msgid "``roformer-chinese-sim-char-ft-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:511
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:515
+msgid "``roformer-chinese-sim-char-ft-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:515
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:519
+msgid "``roformer-chinese-sim-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:519
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:523
+msgid "``roformer-chinese-sim-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:523
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char"
+" Base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:527
+msgid "``roformer-english-small-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:527
+msgid ""
+"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small "
+"Discriminator."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:531
+msgid "``roformer-english-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:531
+msgid ""
+"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small "
+"Generator."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:535 ../model_zoo/transformers.rst:707
+msgid "SKEP_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:535
+msgid "``skep_ernie_1.0_large_ch``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:535
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:540
+msgid "``skep_ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:540
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:545
+msgid "``skep_roberta_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:545
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the "
+"RoBERTa model ``roberta_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:550 ../model_zoo/transformers.rst:709
+msgid "SqueezeBert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:550
+msgid "``squeezebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:550
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:554
+msgid "``squeezebert-mnli``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:554
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:558
+msgid "``squeezebert-mnli-headless``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:558
+msgid ""
+"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless"
+" model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:562 ../model_zoo/transformers.rst:711
+msgid "T5_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:562
+msgid "``t5-small``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:562
+msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:566
+msgid "``t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:566
+msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:570
+msgid "``t5-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:570
+msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:574 ../model_zoo/transformers.rst:713
+msgid "TinyBert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:574
+msgid "``tinybert-4l-312d``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:574 ../model_zoo/transformers.rst:584
+#: ../model_zoo/transformers.rst:594
+msgid ""
+"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:579
+msgid "``tinybert-6l-768d``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:579 ../model_zoo/transformers.rst:589
+#: ../model_zoo/transformers.rst:599
+msgid ""
+"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:584
+msgid "``tinybert-4l-312d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:589
+msgid "``tinybert-6l-768d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:594
+msgid "``tinybert-4l-312d-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:599
+msgid "``tinybert-6l-768d-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:604 ../model_zoo/transformers.rst:715
+msgid "UnifiedTransformer_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:604
+msgid "``unified_transformer-12L-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:608
+msgid "``unified_transformer-12L-cn-luge``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:608
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text "
+"(LUGE.ai)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:612
+msgid "``plato-mini``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:612
+msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:616
+msgid "UNIMO_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:616
+msgid "``unimo-text-1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:616
+msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:620
+msgid "``unimo-text-1.0-lcsts-new``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:620
+msgid ""
+"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new "
+"dataset."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:624
+msgid "``unimo-text-1.0-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:624
+msgid ""
+"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:628 ../model_zoo/transformers.rst:717
+msgid "XLNet_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:628
+msgid "``xlnet-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:628
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:632
+msgid "``xlnet-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:632
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English "
+"model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:636
+msgid "``chinese-xlnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:636
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:640
+msgid "``chinese-xlnet-mid``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:640
+msgid ""
+"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese "
+"model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:644
+msgid "``chinese-xlnet-large``"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:644
+msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:652
+msgid "Transformer预训练模型适用任务汇总"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:655
+msgid "Sequence Classification"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:655
+msgid "Token Classification"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:655
+msgid "Question Answering"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:655
+msgid "Text Generation"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:655
+msgid "Multiple Choice"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:657 ../model_zoo/transformers.rst:659
+#: ../model_zoo/transformers.rst:661 ../model_zoo/transformers.rst:663
+#: ../model_zoo/transformers.rst:665 ../model_zoo/transformers.rst:667
+#: ../model_zoo/transformers.rst:669 ../model_zoo/transformers.rst:671
+#: ../model_zoo/transformers.rst:673 ../model_zoo/transformers.rst:675
+#: ../model_zoo/transformers.rst:677 ../model_zoo/transformers.rst:679
+#: ../model_zoo/transformers.rst:681 ../model_zoo/transformers.rst:683
+#: ../model_zoo/transformers.rst:685 ../model_zoo/transformers.rst:687
+#: ../model_zoo/transformers.rst:689 ../model_zoo/transformers.rst:691
+#: ../model_zoo/transformers.rst:693 ../model_zoo/transformers.rst:695
+#: ../model_zoo/transformers.rst:697 ../model_zoo/transformers.rst:699
+#: ../model_zoo/transformers.rst:701 ../model_zoo/transformers.rst:703
+#: ../model_zoo/transformers.rst:705 ../model_zoo/transformers.rst:707
+#: ../model_zoo/transformers.rst:709 ../model_zoo/transformers.rst:711
+#: ../model_zoo/transformers.rst:713 ../model_zoo/transformers.rst:715
+#: ../model_zoo/transformers.rst:717
+msgid "✅"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:657 ../model_zoo/transformers.rst:659
+#: ../model_zoo/transformers.rst:661 ../model_zoo/transformers.rst:663
+#: ../model_zoo/transformers.rst:665 ../model_zoo/transformers.rst:667
+#: ../model_zoo/transformers.rst:671 ../model_zoo/transformers.rst:673
+#: ../model_zoo/transformers.rst:675 ../model_zoo/transformers.rst:677
+#: ../model_zoo/transformers.rst:679 ../model_zoo/transformers.rst:681
+#: ../model_zoo/transformers.rst:683 ../model_zoo/transformers.rst:685
+#: ../model_zoo/transformers.rst:687 ../model_zoo/transformers.rst:689
+#: ../model_zoo/transformers.rst:691 ../model_zoo/transformers.rst:693
+#: ../model_zoo/transformers.rst:695 ../model_zoo/transformers.rst:697
+#: ../model_zoo/transformers.rst:699 ../model_zoo/transformers.rst:701
+#: ../model_zoo/transformers.rst:703 ../model_zoo/transformers.rst:705
+#: ../model_zoo/transformers.rst:707 ../model_zoo/transformers.rst:709
+#: ../model_zoo/transformers.rst:711 ../model_zoo/transformers.rst:713
+#: ../model_zoo/transformers.rst:715 ../model_zoo/transformers.rst:717
+msgid "❌"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:693
+msgid "Mbart_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:695
+msgid "MobileBert_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:701
+msgid "ReFormer_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:756
+msgid "预训练模型使用方法"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:758
+msgid ""
+"PaddleNLP Transformer API在提丰富预训练模型的同时，也降低了用户的使用门槛。 "
+"使用Auto模块，可以加载不同网络结构的预训练模型，无需查找 模型对应的类别。只需十几行代码，用户即可完成模型加载和下游任务Fine-"
+"tuning。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:797
+msgid ""
+"上面的代码给出使用预训练模型的简要示例，更完整详细的示例代码， 可以参考：`使用预训练模型Fine-tune完成中文文本分类任务 "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_classification/pretrained_models/>`_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:800
+msgid "加载数据集：PaddleNLP内置了多种数据集，用户可以一键导入所需的数据集。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:801
+msgid ""
+"加载预训练模型：PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 "
+"Auto模块（包括AutoModel, AutoTokenizer, 及各种下游任务类）提供了方便易用的接口， "
+"无需指定类别，即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``，可加载对应的预训练权重。"
+" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数，如 "
+"``num_classes`` 等， 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 "
+"``from_pretrained`` 方法加载。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:807
+msgid "通过 ``Dataset`` 的 ``map`` 函数，使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:808
+msgid "定义 ``BatchSampler`` 和 ``DataLoader``，shuffle数据、组合Batch。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:809
+msgid "定义训练所需的优化器，loss函数等，就可以开始进行模型fine-tune任务。"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:813
+msgid "Reference"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:814
+msgid ""
+"部分中文预训练模型来自： `brightmart/albert_zh "
+"<https://github.com/brightmart/albert_zh>`_, `ymcui/Chinese-BERT-wwm "
+"<https://github.com/ymcui/Chinese-BERT-wwm>`_, `huawei-noah/Pretrained-"
+"Language-Model/TinyBERT <https://github.com/huawei-noah/Pretrained-"
+"Language-Model/tree/master/TinyBERT>`_, `ymcui/Chinese-XLNet "
+"<https://github.com/ymcui/Chinese-XLNet>`_, "
+"`huggingface/xlnet_chinese_large "
+"<https://huggingface.co/clue/xlnet_chinese_large>`_, `Knover/luge-"
+"dialogue <https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-"
+"dialogue>`_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ "
+"<https://github.com/huawei-noah/Pretrained-Language-Model/tree/master"
+"/NEZHA-PyTorch>`_ `ZhuiyiTechnology/simbert "
+"<https://github.com/ZhuiyiTechnology/simbert>`_"
+msgstr ""
+
+#: ../model_zoo/transformers.rst:823
+msgid ""
+"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning"
+" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:824
+msgid ""
+"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training "
+"for Natural Language Generation, Translation, and Comprehension.\" arXiv "
+"preprint arXiv:1910.13461 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:825
+msgid ""
+"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional "
+"transformers for language understanding.\" arXiv preprint "
+"arXiv:1810.04805 (2018)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:826
+msgid ""
+"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" "
+"arXiv preprint arXiv:2007.14062 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:827
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain "
+"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:828
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-"
+"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:829
+msgid ""
+"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic "
+"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:830
+msgid ""
+"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model "
+"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:831
+msgid ""
+"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, "
+"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:832
+msgid ""
+"Clark, Kevin, et al. \"Electra: Pre-training text encoders as "
+"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:833
+msgid ""
+"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge "
+"integration.\" arXiv preprint arXiv:1904.09223 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:834
+msgid ""
+"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training "
+"and fine-tuning framework for natural language generation.\" arXiv "
+"preprint arXiv:2001.11314 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:835
+msgid ""
+"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram "
+"Masked Language Modeling for Natural Language Understanding.\" arXiv "
+"preprint arXiv:2010.12148 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:836
+msgid ""
+"Radford, Alec, et al. \"Language models are unsupervised multitask "
+"learners.\" OpenAI blog 1.8 (2019): 9."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:837
+msgid ""
+"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for "
+"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:838
+msgid ""
+"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich"
+" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:839
+msgid ""
+"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual "
+"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 "
+"(2021)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:840
+msgid ""
+"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for "
+"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:841
+msgid ""
+"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for "
+"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:842
+msgid ""
+"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for "
+"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:843
+msgid ""
+"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for "
+"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:844
+msgid ""
+"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv "
+"preprint arXiv:2001.04451 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:845
+msgid ""
+"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining "
+"approach.\" arXiv preprint arXiv:1907.11692 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:846
+msgid ""
+"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position "
+"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:847
+msgid ""
+"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for "
+"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:848
+msgid ""
+"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP"
+" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:849
+msgid ""
+"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning "
+"with a Unified Text-to-Text Transformer.\" arXiv preprint "
+"arXiv:1910.10683 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:850
+msgid ""
+"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint "
+"arXiv:1706.03762 (2017)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:851
+msgid ""
+"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language "
+"understanding.\" arXiv preprint arXiv:1909.10351 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:852
+msgid ""
+"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via "
+"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:853
+msgid ""
+"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for "
+"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers.rst:854
+msgid ""
+"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese "
+"bert.\" arXiv preprint arXiv:1906.08101 (2019)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..93a869bda14b22a7f42492b266f8977a330c396f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ALBERT/contents.po
@@ -0,0 +1,199 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:5
+msgid "ALBERT模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ALBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:15
+msgid "``albert-base-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:15
+#: ../model_zoo/transformers/ALBERT/contents.rst:19
+#: ../model_zoo/transformers/ALBERT/contents.rst:23
+#: ../model_zoo/transformers/ALBERT/contents.rst:27
+#: ../model_zoo/transformers/ALBERT/contents.rst:31
+#: ../model_zoo/transformers/ALBERT/contents.rst:35
+#: ../model_zoo/transformers/ALBERT/contents.rst:39
+#: ../model_zoo/transformers/ALBERT/contents.rst:43
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:15
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:19
+msgid "``albert-large-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:19
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:23
+msgid "``albert-xlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:23
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:27
+msgid "``albert-xxlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:27
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:31
+msgid "``albert-base-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:31
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:35
+msgid "``albert-large-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:35
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:39
+msgid "``albert-xlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:39
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:43
+msgid "``albert-xxlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:43
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:47
+msgid "``albert-chinese-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:47
+#: ../model_zoo/transformers/ALBERT/contents.rst:51
+#: ../model_zoo/transformers/ALBERT/contents.rst:55
+#: ../model_zoo/transformers/ALBERT/contents.rst:59
+#: ../model_zoo/transformers/ALBERT/contents.rst:63
+#: ../model_zoo/transformers/ALBERT/contents.rst:67
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:47
+msgid ""
+"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. "
+"ALBERT tiny model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:51
+msgid "``albert-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:51
+msgid ""
+"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. "
+"ALBERT small model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:55
+msgid "``albert-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:55
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters."
+" ALBERT base model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:59
+msgid "``albert-chinese-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:59
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M "
+"parameters. ALBERT large model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:63
+msgid "``albert-chinese-xlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:63
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M "
+"parameters. ALBERT xlarge model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:67
+msgid "``albert-chinese-xxlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers/ALBERT/contents.rst:67
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M "
+"parameters. ALBERT xxlarge model (Chinese)"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..2c780422a18973cbe7345853e5a4edd485cfab50
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BART/contents.po
@@ -0,0 +1,62 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/BART/contents.rst:5
+msgid "BART模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的BART模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:15
+msgid "``bart-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:15
+#: ../model_zoo/transformers/BART/contents.rst:19
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:19
+msgid "``bart-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/BART/contents.rst:19
+msgid ""
+"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model "
+"(English)"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..8aa42f1927335d6b74f0f5fd1920be423b33a138
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BERT/contents.po
@@ -0,0 +1,1670 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/BERT/contents.rst:5
+msgid "BERT模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的BERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:15
+msgid "``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:15
+#: ../model_zoo/transformers/BERT/contents.rst:19
+#: ../model_zoo/transformers/BERT/contents.rst:23
+#: ../model_zoo/transformers/BERT/contents.rst:27
+#: ../model_zoo/transformers/BERT/contents.rst:79
+#: ../model_zoo/transformers/BERT/contents.rst:109
+#: ../model_zoo/transformers/BERT/contents.rst:124
+#: ../model_zoo/transformers/BERT/contents.rst:133
+#: ../model_zoo/transformers/BERT/contents.rst:136
+#: ../model_zoo/transformers/BERT/contents.rst:139
+#: ../model_zoo/transformers/BERT/contents.rst:145
+#: ../model_zoo/transformers/BERT/contents.rst:148
+#: ../model_zoo/transformers/BERT/contents.rst:154
+#: ../model_zoo/transformers/BERT/contents.rst:157
+#: ../model_zoo/transformers/BERT/contents.rst:163
+#: ../model_zoo/transformers/BERT/contents.rst:166
+#: ../model_zoo/transformers/BERT/contents.rst:175
+#: ../model_zoo/transformers/BERT/contents.rst:178
+#: ../model_zoo/transformers/BERT/contents.rst:181
+#: ../model_zoo/transformers/BERT/contents.rst:184
+#: ../model_zoo/transformers/BERT/contents.rst:187
+#: ../model_zoo/transformers/BERT/contents.rst:193
+#: ../model_zoo/transformers/BERT/contents.rst:196
+#: ../model_zoo/transformers/BERT/contents.rst:199
+#: ../model_zoo/transformers/BERT/contents.rst:214
+#: ../model_zoo/transformers/BERT/contents.rst:220
+#: ../model_zoo/transformers/BERT/contents.rst:226
+#: ../model_zoo/transformers/BERT/contents.rst:235
+#: ../model_zoo/transformers/BERT/contents.rst:238
+#: ../model_zoo/transformers/BERT/contents.rst:241
+#: ../model_zoo/transformers/BERT/contents.rst:247
+#: ../model_zoo/transformers/BERT/contents.rst:262
+#: ../model_zoo/transformers/BERT/contents.rst:265
+#: ../model_zoo/transformers/BERT/contents.rst:268
+#: ../model_zoo/transformers/BERT/contents.rst:271
+#: ../model_zoo/transformers/BERT/contents.rst:277
+#: ../model_zoo/transformers/BERT/contents.rst:280
+#: ../model_zoo/transformers/BERT/contents.rst:283
+#: ../model_zoo/transformers/BERT/contents.rst:286
+#: ../model_zoo/transformers/BERT/contents.rst:289
+#: ../model_zoo/transformers/BERT/contents.rst:292
+#: ../model_zoo/transformers/BERT/contents.rst:295
+#: ../model_zoo/transformers/BERT/contents.rst:298
+#: ../model_zoo/transformers/BERT/contents.rst:301
+#: ../model_zoo/transformers/BERT/contents.rst:313
+#: ../model_zoo/transformers/BERT/contents.rst:316
+#: ../model_zoo/transformers/BERT/contents.rst:319
+#: ../model_zoo/transformers/BERT/contents.rst:322
+#: ../model_zoo/transformers/BERT/contents.rst:328
+#: ../model_zoo/transformers/BERT/contents.rst:331
+#: ../model_zoo/transformers/BERT/contents.rst:334
+#: ../model_zoo/transformers/BERT/contents.rst:337
+#: ../model_zoo/transformers/BERT/contents.rst:340
+#: ../model_zoo/transformers/BERT/contents.rst:346
+#: ../model_zoo/transformers/BERT/contents.rst:352
+#: ../model_zoo/transformers/BERT/contents.rst:355
+#: ../model_zoo/transformers/BERT/contents.rst:358
+#: ../model_zoo/transformers/BERT/contents.rst:376
+#: ../model_zoo/transformers/BERT/contents.rst:379
+#: ../model_zoo/transformers/BERT/contents.rst:382
+#: ../model_zoo/transformers/BERT/contents.rst:394
+#: ../model_zoo/transformers/BERT/contents.rst:400
+#: ../model_zoo/transformers/BERT/contents.rst:418
+#: ../model_zoo/transformers/BERT/contents.rst:427
+#: ../model_zoo/transformers/BERT/contents.rst:445
+#: ../model_zoo/transformers/BERT/contents.rst:472
+#: ../model_zoo/transformers/BERT/contents.rst:481
+#: ../model_zoo/transformers/BERT/contents.rst:490
+#: ../model_zoo/transformers/BERT/contents.rst:493
+#: ../model_zoo/transformers/BERT/contents.rst:496
+#: ../model_zoo/transformers/BERT/contents.rst:499
+#: ../model_zoo/transformers/BERT/contents.rst:505
+#: ../model_zoo/transformers/BERT/contents.rst:511
+#: ../model_zoo/transformers/BERT/contents.rst:514
+#: ../model_zoo/transformers/BERT/contents.rst:523
+#: ../model_zoo/transformers/BERT/contents.rst:526
+#: ../model_zoo/transformers/BERT/contents.rst:529
+#: ../model_zoo/transformers/BERT/contents.rst:532
+#: ../model_zoo/transformers/BERT/contents.rst:538
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:19
+msgid "``bert-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:23
+msgid "``bert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:23
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English"
+" text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:27
+msgid "``bert-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:27
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:31
+msgid "``bert-base-multilingual-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:31
+#: ../model_zoo/transformers/BERT/contents.rst:37
+#: ../model_zoo/transformers/BERT/contents.rst:72
+#: ../model_zoo/transformers/BERT/contents.rst:121
+#: ../model_zoo/transformers/BERT/contents.rst:421
+msgid "Multilingual"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:31
+msgid ""
+"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased "
+"text in the top 102 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:37
+msgid "``bert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:37
+msgid ""
+"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in"
+" the top 104 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:43
+msgid "``bert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:43
+#: ../model_zoo/transformers/BERT/contents.rst:48
+#: ../model_zoo/transformers/BERT/contents.rst:54
+#: ../model_zoo/transformers/BERT/contents.rst:60
+#: ../model_zoo/transformers/BERT/contents.rst:64
+#: ../model_zoo/transformers/BERT/contents.rst:68
+#: ../model_zoo/transformers/BERT/contents.rst:85
+#: ../model_zoo/transformers/BERT/contents.rst:90
+#: ../model_zoo/transformers/BERT/contents.rst:95
+#: ../model_zoo/transformers/BERT/contents.rst:100
+#: ../model_zoo/transformers/BERT/contents.rst:104
+#: ../model_zoo/transformers/BERT/contents.rst:130
+#: ../model_zoo/transformers/BERT/contents.rst:256
+#: ../model_zoo/transformers/BERT/contents.rst:349
+#: ../model_zoo/transformers/BERT/contents.rst:367
+#: ../model_zoo/transformers/BERT/contents.rst:412
+#: ../model_zoo/transformers/BERT/contents.rst:415
+#: ../model_zoo/transformers/BERT/contents.rst:436
+#: ../model_zoo/transformers/BERT/contents.rst:457
+#: ../model_zoo/transformers/BERT/contents.rst:469
+#: ../model_zoo/transformers/BERT/contents.rst:487
+#: ../model_zoo/transformers/BERT/contents.rst:535
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:43
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:48
+msgid "``bert-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:48
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:54
+msgid "``bert-wwm-ext-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:54
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking with extented "
+"data."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:60
+msgid "``junnyu/ckiplab-bert-base-chinese-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:60
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:64
+msgid "``junnyu/ckiplab-bert-base-chinese-pos``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:64
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:68
+msgid "``junnyu/ckiplab-bert-base-chinese-ws``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:68
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:72
+msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:72
+msgid ""
+"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment "
+"analysis in six languages: English, Dutch, German, French, Spanish and "
+"Italian."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:79
+msgid "``junnyu/tbs17-MathBERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:79
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to "
+"graduate math language (English) using a masked language modeling (MLM) "
+"objective."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:85
+msgid "``macbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:85
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:90
+msgid "``macbert-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:90
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:95
+msgid "``simbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:95
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million "
+"pairs of similar sentences crawed from Baidu Know."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:100
+msgid "``Langboat/mengzi-bert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:100
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese "
+"Corpus Datasets."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:104
+msgid "``Langboat/mengzi-bert-base-fin``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:104
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial "
+"Corpus, based on ``Langboat/mengzi-bert-base``."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:109
+msgid "``cross-encoder/ms-marco-MiniLM-L-12-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:109
+msgid "Please refer to: `cross-encoder/ms-marco-MiniLM-L-12-v2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:112
+msgid "``cl-tohoku/bert-base-japanese-char``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:112
+#: ../model_zoo/transformers/BERT/contents.rst:115
+#: ../model_zoo/transformers/BERT/contents.rst:118
+#: ../model_zoo/transformers/BERT/contents.rst:223
+#: ../model_zoo/transformers/BERT/contents.rst:229
+#: ../model_zoo/transformers/BERT/contents.rst:310
+#: ../model_zoo/transformers/BERT/contents.rst:409
+#: ../model_zoo/transformers/BERT/contents.rst:439
+#: ../model_zoo/transformers/BERT/contents.rst:541
+#: ../model_zoo/transformers/BERT/contents.rst:545
+#: ../model_zoo/transformers/BERT/contents.rst:550
+#: ../model_zoo/transformers/BERT/contents.rst:554
+msgid "Japanese"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:112
+msgid "Please refer to: `cl-tohoku/bert-base-japanese-char`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:115
+msgid "``cl-tohoku/bert-base-japanese-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:115
+msgid "Please refer to: `cl-tohoku/bert-base-japanese-whole-word-masking`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:118
+msgid "``cl-tohoku/bert-base-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:118
+msgid "Please refer to: `cl-tohoku/bert-base-japanese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:121
+msgid "``nlptown/bert-base-multilingual-uncased-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:121
+msgid "Please refer to: `nlptown/bert-base-multilingual-uncased-sentiment`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:124
+msgid "``bert-large-uncased-whole-word-masking-finetuned-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:124
+msgid "Please refer to: `bert-large-uncased-whole-word-masking-finetuned-squad`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:127
+msgid "``finiteautomata/beto-sentiment-analysis``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:127
+#: ../model_zoo/transformers/BERT/contents.rst:190
+#: ../model_zoo/transformers/BERT/contents.rst:373
+#: ../model_zoo/transformers/BERT/contents.rst:433
+#: ../model_zoo/transformers/BERT/contents.rst:463
+#: ../model_zoo/transformers/BERT/contents.rst:475
+msgid "Spanish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:127
+msgid "Please refer to: `finiteautomata/beto-sentiment-analysis`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:130
+msgid "``hfl/chinese-bert-wwm-ext``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:130
+msgid "Please refer to: `hfl/chinese-bert-wwm-ext`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:133
+msgid "``emilyalsentzer/Bio_ClinicalBERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:133
+msgid "Please refer to: `emilyalsentzer/Bio_ClinicalBERT`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:136
+msgid "``dslim/bert-base-NER``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:136
+msgid "Please refer to: `dslim/bert-base-NER`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:139
+msgid "``deepset/bert-large-uncased-whole-word-masking-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:139
+msgid "Please refer to: `deepset/bert-large-uncased-whole-word-masking-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:142
+msgid "``neuralmind/bert-base-portuguese-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:142
+#: ../model_zoo/transformers/BERT/contents.rst:244
+#: ../model_zoo/transformers/BERT/contents.rst:361
+#: ../model_zoo/transformers/BERT/contents.rst:520
+msgid "Portuguese"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:142
+msgid "Please refer to: `neuralmind/bert-base-portuguese-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:145
+msgid "``SpanBERT/spanbert-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:145
+msgid "Please refer to: `SpanBERT/spanbert-large-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:148
+msgid "``dslim/bert-large-NER``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:148
+msgid "Please refer to: `dslim/bert-large-NER`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:151
+msgid "``bert-base-german-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:151
+#: ../model_zoo/transformers/BERT/contents.rst:160
+#: ../model_zoo/transformers/BERT/contents.rst:205
+#: ../model_zoo/transformers/BERT/contents.rst:211
+#: ../model_zoo/transformers/BERT/contents.rst:250
+msgid "German"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:151
+msgid "Please refer to: `bert-base-german-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:154
+msgid "``deepset/sentence_bert``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:154
+msgid "Please refer to: `deepset/sentence_bert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:157
+msgid "``ProsusAI/finbert``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:157
+msgid "Please refer to: `ProsusAI/finbert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:160
+msgid "``oliverguhr/german-sentiment-bert``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:160
+msgid "Please refer to: `oliverguhr/german-sentiment-bert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:163
+msgid "``google/bert_uncased_L-2_H-128_A-2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:163
+msgid "Please refer to: `google/bert_uncased_L-2_H-128_A-2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:166
+msgid "``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:166
+msgid "Please refer to: `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:169
+msgid "``DeepPavlov/rubert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:169
+#: ../model_zoo/transformers/BERT/contents.rst:202
+#: ../model_zoo/transformers/BERT/contents.rst:304
+#: ../model_zoo/transformers/BERT/contents.rst:430
+#: ../model_zoo/transformers/BERT/contents.rst:484
+msgid "Russian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:169
+msgid "Please refer to: `DeepPavlov/rubert-base-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:172
+msgid "``wietsedv/bert-base-dutch-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:172
+#: ../model_zoo/transformers/BERT/contents.rst:397
+msgid "Dutch"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:172
+msgid "Please refer to: `wietsedv/bert-base-dutch-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:175
+msgid "``monologg/bert-base-cased-goemotions-original``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:175
+msgid "Please refer to: `monologg/bert-base-cased-goemotions-original`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:178
+msgid "``allenai/scibert_scivocab_uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:178
+msgid "Please refer to: `allenai/scibert_scivocab_uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:181
+msgid "``dbmdz/bert-large-cased-finetuned-conll03-english``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:181
+msgid "Please refer to: `dbmdz/bert-large-cased-finetuned-conll03-english`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:184
+msgid "``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:184
+msgid ""
+"Please refer to: `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-"
+"fulltext`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:187
+msgid "``bert-large-uncased-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:187
+msgid "Please refer to: `bert-large-uncased-whole-word-masking`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:190
+msgid "``dccuchile/bert-base-spanish-wwm-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:190
+msgid "Please refer to: `dccuchile/bert-base-spanish-wwm-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:193
+msgid "``google/bert_uncased_L-6_H-256_A-4``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:193
+msgid "Please refer to: `google/bert_uncased_L-6_H-256_A-4`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:196
+msgid "``google/bert_uncased_L-4_H-512_A-8``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:196
+msgid "Please refer to: `google/bert_uncased_L-4_H-512_A-8`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:199
+msgid "``FPTAI/vibert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:199
+msgid "Please refer to: `FPTAI/vibert-base-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:202
+msgid "``cointegrated/rubert-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:202
+msgid "Please refer to: `cointegrated/rubert-tiny`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:205
+msgid "``bert-base-german-dbmdz-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:205
+msgid "Please refer to: `bert-base-german-dbmdz-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:208
+msgid "``dbmdz/bert-base-turkish-128k-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:208
+#: ../model_zoo/transformers/BERT/contents.rst:325
+#: ../model_zoo/transformers/BERT/contents.rst:460
+#: ../model_zoo/transformers/BERT/contents.rst:502
+msgid "Turkish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:208
+msgid "Please refer to: `dbmdz/bert-base-turkish-128k-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:211
+msgid "``dbmdz/bert-base-german-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:211
+msgid "Please refer to: `dbmdz/bert-base-german-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:214
+msgid "``deepset/minilm-uncased-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:214
+msgid "Please refer to: `deepset/minilm-uncased-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:217
+msgid "``HooshvareLab/bert-base-parsbert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:217
+#: ../model_zoo/transformers/BERT/contents.rst:388
+#: ../model_zoo/transformers/BERT/contents.rst:478
+msgid "Persian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:217
+msgid "Please refer to: `HooshvareLab/bert-base-parsbert-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:220
+msgid "``textattack/bert-base-uncased-ag-news``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:220
+msgid "Please refer to: `textattack/bert-base-uncased-ag-news`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:223
+msgid "``cl-tohoku/bert-base-japanese-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:223
+msgid "Please refer to: `cl-tohoku/bert-base-japanese-v2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:226
+msgid "``emilyalsentzer/Bio_Discharge_Summary_BERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:226
+msgid "Please refer to: `emilyalsentzer/Bio_Discharge_Summary_BERT`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:229
+msgid "``KoichiYasuoka/bert-base-japanese-upos``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:229
+msgid "Please refer to: `KoichiYasuoka/bert-base-japanese-upos`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:232
+msgid "``dbmdz/bert-base-italian-xxl-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:232
+#: ../model_zoo/transformers/BERT/contents.rst:403
+msgid "Italian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:232
+msgid "Please refer to: `dbmdz/bert-base-italian-xxl-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:235
+msgid "``deepset/bert-base-cased-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:235
+msgid "Please refer to: `deepset/bert-base-cased-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:238
+msgid "``beomi/kcbert-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:238
+msgid "Please refer to: `beomi/kcbert-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:241
+msgid "``bert-large-cased-whole-word-masking-finetuned-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:241
+msgid "Please refer to: `bert-large-cased-whole-word-masking-finetuned-squad`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:244
+msgid "``neuralmind/bert-large-portuguese-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:244
+msgid "Please refer to: `neuralmind/bert-large-portuguese-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:247
+msgid "``Luyu/co-condenser-marco``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:247
+msgid "Please refer to: `Luyu/co-condenser-marco`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:250
+msgid "``Sahajtomar/German_Zeroshot``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:250
+msgid "Please refer to: `Sahajtomar/German_Zeroshot`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:253
+msgid "``indolem/indobert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:253
+msgid "Indonesian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:253
+msgid "Please refer to: `indolem/indobert-base-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:256
+msgid "``shibing624/text2vec-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:256
+msgid "Please refer to: `shibing624/text2vec-base-chinese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:259
+msgid "``cointegrated/LaBSE-en-ru``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:259
+msgid "English and Russian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:259
+msgid "Please refer to: `cointegrated/LaBSE-en-ru`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:262
+msgid "``prithivida/parrot_fluency_on_BERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:262
+msgid "Please refer to: `prithivida/parrot_fluency_on_BERT`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:265
+msgid "``textattack/bert-base-uncased-SST-2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:265
+msgid "Please refer to: `textattack/bert-base-uncased-SST-2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:268
+msgid "``textattack/bert-base-uncased-snli``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:268
+msgid "Please refer to: `textattack/bert-base-uncased-snli`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:271
+msgid "``klue/bert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:271
+msgid "Please refer to: `klue/bert-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:274
+msgid "``asafaya/bert-base-arabic``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:274
+#: ../model_zoo/transformers/BERT/contents.rst:424
+msgid "Arabic"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:274
+msgid "Please refer to: `asafaya/bert-base-arabic`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:277
+msgid "``textattack/bert-base-uncased-MRPC``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:277
+msgid "Please refer to: `textattack/bert-base-uncased-MRPC`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:280
+msgid "``textattack/bert-base-uncased-imdb``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:280
+msgid "Please refer to: `textattack/bert-base-uncased-imdb`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:283
+msgid "``cross-encoder/ms-marco-TinyBERT-L-2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:283
+msgid "Please refer to: `cross-encoder/ms-marco-TinyBERT-L-2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:286
+msgid "``mrm8488/bert-tiny-finetuned-sms-spam-detection``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:286
+msgid "Please refer to: `mrm8488/bert-tiny-finetuned-sms-spam-detection`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:289
+msgid "``felflare/bert-restore-punctuation``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:289
+msgid "Please refer to: `felflare/bert-restore-punctuation`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:292
+msgid "``sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:292
+msgid ""
+"Please refer to: `sshleifer/tiny-dbmdz-bert-large-cased-finetuned-"
+"conll03-english`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:295
+msgid "``textattack/bert-base-uncased-rotten-tomatoes``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:295
+msgid "Please refer to: `textattack/bert-base-uncased-rotten-tomatoes`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:298
+msgid "``nlpaueb/legal-bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:298
+msgid "Please refer to: `nlpaueb/legal-bert-base-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:301
+msgid "``hf-internal-testing/tiny-bert-for-token-classification``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:301
+msgid "Please refer to: `hf-internal-testing/tiny-bert-for-token-classification`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:304
+msgid "``cointegrated/rubert-tiny2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:304
+msgid "Please refer to: `cointegrated/rubert-tiny2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:307
+msgid "``kykim/bert-kor-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:307
+msgid "Korean"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:307
+msgid "Please refer to: `kykim/bert-kor-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:310
+msgid "``cl-tohoku/bert-base-japanese-char-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:310
+msgid "Please refer to: `cl-tohoku/bert-base-japanese-char-v2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:313
+msgid "``mrm8488/bert-small-finetuned-squadv2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:313
+msgid "Please refer to: `mrm8488/bert-small-finetuned-squadv2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:316
+msgid "``beomi/kcbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:316
+msgid "Please refer to: `beomi/kcbert-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:319
+msgid "``textattack/bert-base-uncased-MNLI``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:319
+msgid "Please refer to: `textattack/bert-base-uncased-MNLI`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:322
+msgid "``textattack/bert-base-uncased-WNLI``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:322
+msgid "Please refer to: `textattack/bert-base-uncased-WNLI`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:325
+msgid "``dbmdz/bert-base-turkish-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:325
+msgid "Please refer to: `dbmdz/bert-base-turkish-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:328
+msgid "``huawei-noah/TinyBERT_General_4L_312D``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:328
+msgid "Please refer to: `huawei-noah/TinyBERT_General_4L_312D`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:331
+msgid "``textattack/bert-base-uncased-QQP``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:331
+msgid "Please refer to: `textattack/bert-base-uncased-QQP`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:334
+msgid "``textattack/bert-base-uncased-STS-B``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:334
+msgid "Please refer to: `textattack/bert-base-uncased-STS-B`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:337
+msgid "``allenai/scibert_scivocab_cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:337
+msgid "Please refer to: `allenai/scibert_scivocab_cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:340
+msgid "``mrm8488/bert-medium-finetuned-squadv2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:340
+msgid "Please refer to: `mrm8488/bert-medium-finetuned-squadv2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:343
+msgid "``TurkuNLP/bert-base-finnish-cased-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:343
+msgid "Finnish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:343
+msgid "Please refer to: `TurkuNLP/bert-base-finnish-cased-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:346
+msgid "``textattack/bert-base-uncased-RTE``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:346
+msgid "Please refer to: `textattack/bert-base-uncased-RTE`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:349
+msgid "``uer/roberta-base-chinese-extractive-qa``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:349
+msgid "Please refer to: `uer/roberta-base-chinese-extractive-qa`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:352
+msgid "``textattack/bert-base-uncased-QNLI``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:352
+msgid "Please refer to: `textattack/bert-base-uncased-QNLI`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:355
+msgid "``textattack/bert-base-uncased-CoLA``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:355
+msgid "Please refer to: `textattack/bert-base-uncased-CoLA`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:358
+msgid "``dmis-lab/biobert-base-cased-v1.2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:358
+msgid "Please refer to: `dmis-lab/biobert-base-cased-v1.2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:361
+msgid "``pierreguillou/bert-base-cased-squad-v1.1-portuguese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:361
+msgid "Please refer to: `pierreguillou/bert-base-cased-squad-v1.1-portuguese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:364
+msgid "``KB/bert-base-swedish-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:364
+#: ../model_zoo/transformers/BERT/contents.rst:466
+msgid "Swedish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:364
+msgid "Please refer to: `KB/bert-base-swedish-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:367
+msgid "``uer/roberta-base-finetuned-cluener2020-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:367
+msgid "Please refer to: `uer/roberta-base-finetuned-cluener2020-chinese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:370
+msgid "``onlplab/alephbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:370
+msgid "Hebrew"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:370
+msgid "Please refer to: `onlplab/alephbert-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:373
+msgid "``mrm8488/bert-spanish-cased-finetuned-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:373
+msgid "Please refer to: `mrm8488/bert-spanish-cased-finetuned-ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:376
+msgid "``alvaroalon2/biobert_chemical_ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:376
+msgid "Please refer to: `alvaroalon2/biobert_chemical_ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:379
+msgid "``bert-base-cased-finetuned-mrpc``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:379
+msgid "Please refer to: `bert-base-cased-finetuned-mrpc`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:382
+msgid "``unitary/toxic-bert``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:382
+msgid "Please refer to: `unitary/toxic-bert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:385
+msgid "``nlpaueb/bert-base-greek-uncased-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:385
+msgid "Greek"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:385
+msgid "Please refer to: `nlpaueb/bert-base-greek-uncased-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:388
+msgid "``HooshvareLab/bert-fa-base-uncased-sentiment-snappfood``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:388
+msgid "Please refer to: `HooshvareLab/bert-fa-base-uncased-sentiment-snappfood`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:391
+msgid "``Maltehb/danish-bert-botxo``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:391
+msgid "Danish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:391
+msgid "Please refer to: `Maltehb/danish-bert-botxo`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:394
+msgid "``shahrukhx01/bert-mini-finetune-question-detection``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:394
+msgid "Please refer to: `shahrukhx01/bert-mini-finetune-question-detection`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:397
+msgid "``GroNLP/bert-base-dutch-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:397
+msgid "Please refer to: `GroNLP/bert-base-dutch-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:400
+msgid "``SpanBERT/spanbert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:400
+msgid "Please refer to: `SpanBERT/spanbert-base-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:403
+msgid "``dbmdz/bert-base-italian-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:403
+msgid "Please refer to: `dbmdz/bert-base-italian-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:406
+msgid "``dbmdz/bert-base-german-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:406
+msgid "Germanh"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:406
+msgid "Please refer to: `dbmdz/bert-base-german-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:409
+msgid "``cl-tohoku/bert-large-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:409
+msgid "Please refer to: `cl-tohoku/bert-large-japanese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:412
+msgid "``hfl/chinese-bert-wwm``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:412
+msgid "Please refer to: `hfl/chinese-bert-wwm`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:415
+msgid "``hfl/chinese-macbert-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:415
+msgid "Please refer to: `hfl/chinese-macbert-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:418
+msgid "``dslim/bert-base-NER-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:418
+msgid "Please refer to: `dslim/bert-base-NER-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:421
+msgid "``amberoad/bert-multilingual-passage-reranking-msmarco``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:421
+msgid "Please refer to: `amberoad/bert-multilingual-passage-reranking-msmarco`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:424
+msgid "``aubmindlab/bert-base-arabertv02``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:424
+msgid "Please refer to: `aubmindlab/bert-base-arabertv02`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:427
+msgid "``google/bert_uncased_L-4_H-256_A-4``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:427
+msgid "Please refer to: `google/bert_uncased_L-4_H-256_A-4`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:430
+msgid "``DeepPavlov/rubert-base-cased-conversational``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:430
+msgid "Please refer to: `DeepPavlov/rubert-base-cased-conversational`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:433
+msgid "``dccuchile/bert-base-spanish-wwm-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:433
+msgid "Please refer to: `dccuchile/bert-base-spanish-wwm-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:436
+msgid "``ckiplab/bert-base-chinese-ws``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:436
+msgid "Please refer to: `ckiplab/bert-base-chinese-ws`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:439
+msgid "``daigo/bert-base-japanese-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:439
+msgid "Please refer to: `daigo/bert-base-japanese-sentiment`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:442
+msgid "``SZTAKI-HLT/hubert-base-cc``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:442
+msgid "Hungarian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:442
+msgid "Please refer to: `SZTAKI-HLT/hubert-base-cc`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:445
+msgid "``nlpaueb/legal-bert-small-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:445
+msgid "Please refer to: `nlpaueb/legal-bert-small-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:448
+msgid "``dumitrescustefan/bert-base-romanian-uncased-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:448
+#: ../model_zoo/transformers/BERT/contents.rst:508
+msgid "Romanian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:448
+msgid "Please refer to: `dumitrescustefan/bert-base-romanian-uncased-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:451
+msgid "``google/muril-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:451
+msgid "Indian"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:451
+msgid "Please refer to: `google/muril-base-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:454
+msgid "``dkleczek/bert-base-polish-uncased-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:454
+msgid "Polish"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:454
+msgid "Please refer to: `dkleczek/bert-base-polish-uncased-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:457
+msgid "``ckiplab/bert-base-chinese-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:457
+msgid "Please refer to: `ckiplab/bert-base-chinese-ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:460
+msgid "``savasy/bert-base-turkish-sentiment-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:460
+msgid "Please refer to: `savasy/bert-base-turkish-sentiment-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:463
+msgid "``mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:463
+msgid ""
+"Please refer to: `mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-"
+"spa-squad2-es`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:466
+msgid "``KB/bert-base-swedish-cased-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:466
+msgid "Please refer to: `KB/bert-base-swedish-cased-ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:469
+msgid "``hfl/rbt3``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:469
+msgid "Please refer to: `hfl/rbt3`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:472
+msgid "``remotejob/gradientclassification_v0``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:472
+msgid "Please refer to: `remotejob/gradientclassification_v0`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:475
+msgid "``Recognai/bert-base-spanish-wwm-cased-xnli``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:475
+msgid "Please refer to: `Recognai/bert-base-spanish-wwm-cased-xnli`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:478
+msgid "``HooshvareLab/bert-fa-zwnj-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:478
+msgid "Please refer to: `HooshvareLab/bert-fa-zwnj-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:481
+msgid "``monologg/bert-base-cased-goemotions-group``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:481
+msgid "Please refer to: `monologg/bert-base-cased-goemotions-group`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:484
+msgid "``blanchefort/rubert-base-cased-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:484
+msgid "Please refer to: `blanchefort/rubert-base-cased-sentiment`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:487
+msgid "``shibing624/macbert4csc-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:487
+msgid "Please refer to: `shibing624/macbert4csc-base-chinese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:490
+msgid "``google/bert_uncased_L-8_H-512_A-8``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:490
+msgid "Please refer to: `google/bert_uncased_L-8_H-512_A-8`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:493
+msgid "``bert-large-cased-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:493
+msgid "Please refer to: `bert-large-cased-whole-word-masking`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:496
+msgid "``alvaroalon2/biobert_diseases_ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:496
+msgid "Please refer to: `alvaroalon2/biobert_diseases_ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:499
+msgid "``philschmid/BERT-Banking77``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:499
+msgid "Please refer to: `philschmid/BERT-Banking77`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:502
+msgid "``dbmdz/bert-base-turkish-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:502
+msgid "Please refer to: `dbmdz/bert-base-turkish-uncased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:505
+msgid "``vblagoje/bert-english-uncased-finetuned-pos``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:505
+msgid "Please refer to: `vblagoje/bert-english-uncased-finetuned-pos`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:508
+msgid "``dumitrescustefan/bert-base-romanian-cased-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:508
+msgid "Please refer to: `dumitrescustefan/bert-base-romanian-cased-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:511
+msgid "``nreimers/BERT-Tiny_L-2_H-128_A-2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:511
+msgid "Please refer to: `nreimers/BERT-Tiny_L-2_H-128_A-2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:514
+msgid "``digitalepidemiologylab/covid-twitter-bert-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:514
+msgid "Please refer to: `digitalepidemiologylab/covid-twitter-bert-v2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:517
+msgid "``UBC-NLP/MARBERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:517
+msgid "(DA) and MSA"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:517
+msgid "Please refer to: `UBC-NLP/MARBERT`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:520
+msgid "``pierreguillou/bert-large-cased-squad-v1.1-portuguese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:520
+msgid "Please refer to: `pierreguillou/bert-large-cased-squad-v1.1-portuguese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:523
+msgid "``alvaroalon2/biobert_genetic_ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:523
+msgid "Please refer to: `alvaroalon2/biobert_genetic_ner`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:526
+msgid "``bvanaken/clinical-assertion-negation-bert``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:526
+msgid "Please refer to: `bvanaken/clinical-assertion-negation-bert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:529
+msgid "``cross-encoder/stsb-TinyBERT-L-4``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:529
+msgid "Please refer to: `cross-encoder/stsb-TinyBERT-L-4`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:532
+msgid "``sshleifer/tiny-distilbert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:532
+msgid "Please refer to: `sshleifer/tiny-distilbert-base-cased`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:535
+msgid "``ckiplab/bert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:535
+msgid "Please refer to: `ckiplab/bert-base-chinese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:538
+msgid "``fabriceyhc/bert-base-uncased-amazon_polarity``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:538
+msgid "Please refer to: `fabriceyhc/bert-base-uncased-amazon_polarity`_"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:541
+msgid "``iverxin/bert-base-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:541
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:545
+msgid "``iverxin/bert-base-japanese-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:545
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text"
+" using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:550
+msgid "``iverxin/bert-base-japanese-char``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:550
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:554
+msgid "``iverxin/bert-base-japanese-char-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/BERT/contents.rst:554
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text using Whole-Word-Masking."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..d77ecc282de00215eea51c174fff926b3a4a59de
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/BigBird/contents.po
@@ -0,0 +1,53 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/BigBird/contents.rst:5
+msgid "BigBird模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的BigBird模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:15
+msgid "``bigbird-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/BigBird/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..469637a391c9021aeac8f73d1bfe2ff89d46d760
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot-Small/contents.po
@@ -0,0 +1,51 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:5
+msgid "Blenderbot-Small模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的Blenderbot-Small模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15
+msgid "``blenderbot_small-90M``"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot-Small/contents.rst:15
+msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..825a8687f226d4c3a49ff656d527d6c2129fa95d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Blenderbot/contents.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:5
+msgid "Blenderbot模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的Blenderbot模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:15
+msgid "``blenderbot-3B``"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:15
+#: ../model_zoo/transformers/Blenderbot/contents.rst:19
+#: ../model_zoo/transformers/Blenderbot/contents.rst:23
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:15
+msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model."
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:19
+msgid "``blenderbot-400M-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:19
+msgid ""
+"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:23
+msgid "``blenderbot-1B-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/Blenderbot/contents.rst:23
+msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot distil 1B model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..5359e6346dd16954ccef2fff81fa958b91f1c1ca
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/CTRL/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/CTRL/contents.rst:5
+msgid "CTRL模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的CTRL模型对应预训练权重。"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:12
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:12
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:12
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:14
+msgid "``ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:14
+#: ../model_zoo/transformers/CTRL/contents.rst:18
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:14
+msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model."
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:18
+msgid "``sshleifer-tiny-ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers/CTRL/contents.rst:18
+msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..704825363f2d83c30dd0e2341125ffa533116906
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ChineseBert/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:5
+msgid "ChineseBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ChineseBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:15
+msgid "``ChineseBERT-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:15
+#: ../model_zoo/transformers/ChineseBert/contents.rst:18
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:15
+msgid "For details, please refer to: ChineseBERT-base_"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:18
+msgid "``ChineseBERT-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/ChineseBert/contents.rst:18
+msgid "For details, please refer to: ChineseBERT-large_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..2c56bbc24c132881d22be32689bd415493ecbedc
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ConvBert/contents.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:5
+msgid "ConvBERT模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ConvBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:15
+msgid "``convbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:15
+#: ../model_zoo/transformers/ConvBert/contents.rst:19
+#: ../model_zoo/transformers/ConvBert/contents.rst:23
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model."
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:19
+msgid "``convbert-medium-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:19
+msgid ""
+"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:23
+msgid "``convbert-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/ConvBert/contents.rst:23
+msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..e184f5127268d6de7a59a2d7a0c82e2f9c78eb1c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/DistilBert/contents.po
@@ -0,0 +1,84 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:5
+msgid "DistilBERT模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的DistilBERT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:15
+msgid "``distilbert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:15
+#: ../model_zoo/transformers/DistilBert/contents.rst:20
+#: ../model_zoo/transformers/DistilBert/contents.rst:25
+#: ../model_zoo/transformers/DistilBert/contents.rst:30
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:15
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-uncased``."
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:20
+msgid "``distilbert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:20
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-cased``."
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:25
+msgid "``renmada/distilbert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:25
+msgid ""
+"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-multilingual-cased``."
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:30
+msgid "``renmada/sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``"
+msgstr ""
+
+#: ../model_zoo/transformers/DistilBert/contents.rst:30
+msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..0c3eefb80966f77f0e880b2c420ec66f4a131f27
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ELECTRA/contents.po
@@ -0,0 +1,140 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:5
+msgid "ELECTRA模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ELECTRA模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:15
+msgid "``electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:15
+#: ../model_zoo/transformers/ELECTRA/contents.rst:19
+#: ../model_zoo/transformers/ELECTRA/contents.rst:23
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:19
+msgid "``electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:19
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:23
+msgid "``electra-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:23
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:27
+msgid "``chinese-electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:27
+#: ../model_zoo/transformers/ELECTRA/contents.rst:31
+#: ../model_zoo/transformers/ELECTRA/contents.rst:35
+#: ../model_zoo/transformers/ELECTRA/contents.rst:39
+#: ../model_zoo/transformers/ELECTRA/contents.rst:43
+#: ../model_zoo/transformers/ELECTRA/contents.rst:47
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:27
+msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:31
+msgid "``chinese-electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:31
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:35
+msgid "``ernie-health-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:35
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese "
+"medical corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:39
+msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:39
+msgid ""
+"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained "
+"on 180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:43
+msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:43
+msgid ""
+"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on "
+"180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:47
+msgid "``junnyu/hfl-chinese-legal-electra-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/ELECTRA/contents.rst:47
+msgid ""
+"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on "
+"Chinese legal corpus."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..1bd874dc170fdb6c92d0ebdceff21e566ae98ccf
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-CTM/contents.po
@@ -0,0 +1,65 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:5
+msgid "ERNIE-CTM模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-CTM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15
+msgid "``ernie-ctm``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:15
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25
+msgid ""
+"12-layer, 768-hidden, 12-heads, _M parameters. For details, please refer "
+"to the ernie-ctm_"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:20
+msgid "``wordtag``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-CTM/contents.rst:25
+msgid "``nptag``"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..9ef7c49ba57f5826d30a63727691955136e1743b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-DOC/contents.po
@@ -0,0 +1,65 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:5
+msgid "ERNIE-DOC模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-DOC模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15
+msgid "``ernie-doc-base-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19
+msgid "``ernie-doc-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-DOC/contents.rst:19
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..c73b76a6826c5ac86732c239be226abd7977dc2d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GEN/contents.po
@@ -0,0 +1,75 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:5
+msgid "ERNIE-GEN模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-GEN模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15
+msgid "``ernie-gen-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20
+msgid "``ernie-gen-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:20
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25
+msgid "``ernie-gen-large-en-430g``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GEN/contents.rst:25
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text. with extended data (430 GB)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..088e37c7fda9205d6a8e9951a88c7c9984cced95
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-GRAM/contents.po
@@ -0,0 +1,62 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:5
+msgid "ERNIE-GRAM模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-GRAM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15
+msgid "``ernie-gram-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20
+msgid "``ernie-gram-zh-finetuned-dureader-robust``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-GRAM/contents.rst:20
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+" Then finetuned on dreader-robust."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..4de3b49f55b7cb283270543531a19026b27e6792
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE-M/contents.po
@@ -0,0 +1,64 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:5
+msgid "ERNIE-M模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE-M模型对应预训练权重。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:12
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:12
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:12
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:14
+msgid "``ernie-m-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:14
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:19
+msgid "Multilingual"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:14
+msgid ""
+"12-layer, 768-hidden, 12-heads, _M parameters. Trained on  pseudo-"
+"parallel sentence pairs on a monolingual corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:19
+msgid "``ernie-m-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE-M/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, _M parameters. Trained on  pseudo-"
+"parallel sentence pairs on a monolingual corpus."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..1584610c128f77d73cc401b10ef394a1e76a4811
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ERNIE/contents.po
@@ -0,0 +1,105 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:5
+msgid "ERNIE模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ERNIE模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:15
+msgid "``ernie-3.0-medium-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:15
+#: ../model_zoo/transformers/ERNIE/contents.rst:19
+#: ../model_zoo/transformers/ERNIE/contents.rst:35
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:19
+msgid "``ernie-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:19
+msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:23
+msgid "``ernie-2.0-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:23
+#: ../model_zoo/transformers/ERNIE/contents.rst:27
+#: ../model_zoo/transformers/ERNIE/contents.rst:31
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:23
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:27
+msgid "``ernie-2.0-en-finetuned-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:27
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned "
+"squad text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:31
+msgid "``ernie-2.0-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:31
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:35
+msgid "``zhui/ernie-1.0-cluecorpussmall``"
+msgstr ""
+
+#: ../model_zoo/transformers/ERNIE/contents.rst:35
+msgid "Please refer to: `zhui/ernie-1.0-cluecorpussmall`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..4a662a8bb31586fb8caac6724be909151b191311
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/FNet/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/FNet/contents.rst:5
+msgid "FNet模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的FNet模型对应预训练权重。"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:12
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:12
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:12
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:14
+msgid "``fnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:14
+#: ../model_zoo/transformers/FNet/contents.rst:17
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:14
+msgid "For details, please refer to: `google/fnet-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:17
+msgid "``fnet-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/FNet/contents.rst:17
+msgid "For details, please refer to: `google/fnet-large`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..68a37ba326ed029d9dc340e18d628bfdf4e888ba
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Funnel/contents.po
@@ -0,0 +1,132 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/Funnel/contents.rst:5
+msgid "Funnel模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的Funnel模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:15
+msgid "``funnel-transformer/small``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:15
+#: ../model_zoo/transformers/Funnel/contents.rst:18
+#: ../model_zoo/transformers/Funnel/contents.rst:21
+#: ../model_zoo/transformers/Funnel/contents.rst:24
+#: ../model_zoo/transformers/Funnel/contents.rst:27
+#: ../model_zoo/transformers/Funnel/contents.rst:30
+#: ../model_zoo/transformers/Funnel/contents.rst:33
+#: ../model_zoo/transformers/Funnel/contents.rst:36
+#: ../model_zoo/transformers/Funnel/contents.rst:39
+#: ../model_zoo/transformers/Funnel/contents.rst:42
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:15
+msgid "For details, please refer to: `funnel-transformer/small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:18
+msgid "``funnel-transformer/small-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:18
+msgid "For details, please refer to: `funnel-transformer/small-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:21
+msgid "``funnel-transformer/meduim``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:21
+msgid "For details, please refer to: `funnel-transformer/meduim`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:24
+msgid "``funnel-transformer/meduim-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:24
+msgid "For details, please refer to: `funnel-transformer/meduim-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:27
+msgid "``funnel-transformer/intermediate``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:27
+msgid "For details, please refer to: `funnel-transformer/intermediate`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:30
+msgid "``funnel-transformer/intermediate-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:30
+msgid "For details, please refer to: `funnel-transformer/intermediate-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:33
+msgid "``funnel-transformer/large``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:33
+msgid "For details, please refer to: `funnel-transformer/large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:36
+msgid "``funnel-transformer/large-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:36
+msgid "For details, please refer to: `funnel-transformer/large-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:39
+msgid "``funnel-transformer/xlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:39
+msgid "For details, please refer to: `funnel-transformer/xlarge`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:42
+msgid "``funnel-transformer/xlarge-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Funnel/contents.rst:42
+msgid "For details, please refer to: `funnel-transformer/xlarge-base`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..819c17c7821aef9fdb17a4c8f7224226597d2d03
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/GPT/contents.po
@@ -0,0 +1,779 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/GPT/contents.rst:5
+msgid "GPT模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:8
+msgid "下表汇总介绍了目前PaddleNLP支持的GPT模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:12
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:12
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:12
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:14
+msgid "``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:14
+#: ../model_zoo/transformers/GPT/contents.rst:18
+#: ../model_zoo/transformers/GPT/contents.rst:55
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:14
+msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:18
+msgid "``gpt-cpm-small-cn-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:18
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from"
+" the GPT model ``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:23
+msgid "``gpt2-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:23
+#: ../model_zoo/transformers/GPT/contents.rst:27
+#: ../model_zoo/transformers/GPT/contents.rst:31
+#: ../model_zoo/transformers/GPT/contents.rst:35
+#: ../model_zoo/transformers/GPT/contents.rst:39
+#: ../model_zoo/transformers/GPT/contents.rst:43
+#: ../model_zoo/transformers/GPT/contents.rst:47
+#: ../model_zoo/transformers/GPT/contents.rst:51
+#: ../model_zoo/transformers/GPT/contents.rst:59
+#: ../model_zoo/transformers/GPT/contents.rst:65
+#: ../model_zoo/transformers/GPT/contents.rst:68
+#: ../model_zoo/transformers/GPT/contents.rst:71
+#: ../model_zoo/transformers/GPT/contents.rst:74
+#: ../model_zoo/transformers/GPT/contents.rst:80
+#: ../model_zoo/transformers/GPT/contents.rst:83
+#: ../model_zoo/transformers/GPT/contents.rst:89
+#: ../model_zoo/transformers/GPT/contents.rst:95
+#: ../model_zoo/transformers/GPT/contents.rst:98
+#: ../model_zoo/transformers/GPT/contents.rst:104
+#: ../model_zoo/transformers/GPT/contents.rst:107
+#: ../model_zoo/transformers/GPT/contents.rst:110
+#: ../model_zoo/transformers/GPT/contents.rst:116
+#: ../model_zoo/transformers/GPT/contents.rst:122
+#: ../model_zoo/transformers/GPT/contents.rst:134
+#: ../model_zoo/transformers/GPT/contents.rst:140
+#: ../model_zoo/transformers/GPT/contents.rst:146
+#: ../model_zoo/transformers/GPT/contents.rst:149
+#: ../model_zoo/transformers/GPT/contents.rst:161
+#: ../model_zoo/transformers/GPT/contents.rst:164
+#: ../model_zoo/transformers/GPT/contents.rst:167
+#: ../model_zoo/transformers/GPT/contents.rst:170
+#: ../model_zoo/transformers/GPT/contents.rst:173
+#: ../model_zoo/transformers/GPT/contents.rst:176
+#: ../model_zoo/transformers/GPT/contents.rst:191
+#: ../model_zoo/transformers/GPT/contents.rst:197
+#: ../model_zoo/transformers/GPT/contents.rst:200
+#: ../model_zoo/transformers/GPT/contents.rst:203
+#: ../model_zoo/transformers/GPT/contents.rst:209
+#: ../model_zoo/transformers/GPT/contents.rst:212
+#: ../model_zoo/transformers/GPT/contents.rst:221
+#: ../model_zoo/transformers/GPT/contents.rst:227
+#: ../model_zoo/transformers/GPT/contents.rst:230
+#: ../model_zoo/transformers/GPT/contents.rst:233
+#: ../model_zoo/transformers/GPT/contents.rst:236
+#: ../model_zoo/transformers/GPT/contents.rst:239
+#: ../model_zoo/transformers/GPT/contents.rst:248
+#: ../model_zoo/transformers/GPT/contents.rst:251
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:23
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:27
+msgid "``gpt2-medium-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:27
+msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:31
+msgid "``gpt2-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:31
+#: ../model_zoo/transformers/GPT/contents.rst:51
+msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:35
+msgid "``gpt2-xl-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:35
+msgid ""
+"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:39
+msgid "``junnyu/distilgpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:39
+msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:43
+msgid "``junnyu/microsoft-DialoGPT-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:43
+msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:47
+msgid "``junnyu/microsoft-DialoGPT-medium``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:47
+msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:51
+msgid "``junnyu/microsoft-DialoGPT-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:55
+msgid "``junnyu/uer-gpt2-chinese-poem``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:55
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese "
+"poetry corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:59
+msgid "``distilgpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:59
+msgid "Please refer to: `distilgpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:62
+msgid "``w11wo/javanese-gpt2-small-imdb``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:62
+msgid "Javanese"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:62
+msgid "Please refer to: `w11wo/javanese-gpt2-small-imdb`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:65
+msgid "``remotejob/tweetsDISTILGPT2fi_v4``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:65
+msgid "Please refer to: `remotejob/tweetsDISTILGPT2fi_v4`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:68
+msgid "``TrLOX/gpt2-tdk``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:68
+msgid "Please refer to: `TrLOX/gpt2-tdk`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:71
+msgid "``huggingtweets/slime_machine``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:71
+msgid "Please refer to: `huggingtweets/slime_machine`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:74
+msgid "``microsoft/DialoGPT-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:74
+msgid "Please refer to: `microsoft/DialoGPT-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:77
+msgid "``sberbank-ai/rugpt3large_based_on_gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:77
+#: ../model_zoo/transformers/GPT/contents.rst:86
+#: ../model_zoo/transformers/GPT/contents.rst:101
+msgid "Russian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:77
+msgid "Please refer to: `sberbank-ai/rugpt3large_based_on_gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:80
+msgid "``sshleifer/tiny-gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:80
+msgid "Please refer to: `sshleifer/tiny-gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:83
+msgid "``microsoft/DialoGPT-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:83
+msgid "Please refer to: `microsoft/DialoGPT-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:86
+msgid "``sberbank-ai/rugpt3small_based_on_gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:86
+msgid "Please refer to: `sberbank-ai/rugpt3small_based_on_gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:89
+msgid "``uw-hai/polyjuice``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:89
+msgid "Please refer to: `uw-hai/polyjuice`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:92
+msgid "``NYTK/text-generation-poem-petofi-gpt2-small-hungarian``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:92
+msgid "Hungarian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:92
+msgid "Please refer to: `NYTK/text-generation-poem-petofi-gpt2-small-hungarian`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:95
+msgid "``microsoft/DialogRPT-human-vs-rand``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:95
+msgid "Please refer to: `microsoft/DialogRPT-human-vs-rand`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:98
+msgid "``hf-internal-testing/tiny-random-gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:98
+msgid "Please refer to: `hf-internal-testing/tiny-random-gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:101
+msgid "``Grossmend/rudialogpt3_medium_based_on_gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:101
+msgid "Please refer to: `Grossmend/rudialogpt3_medium_based_on_gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:104
+msgid "``pranavpsv/genre-story-generator-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:104
+msgid "Please refer to: `pranavpsv/genre-story-generator-v2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:107
+msgid "``microsoft/DialogRPT-updown``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:107
+msgid "Please refer to: `microsoft/DialogRPT-updown`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:110
+msgid "``microsoft/DialogRPT-human-vs-machine``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:110
+msgid "Please refer to: `microsoft/DialogRPT-human-vs-machine`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:113
+msgid "``pierreguillou/gpt2-small-portuguese``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:113
+msgid "Portuguese"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:113
+msgid "Please refer to: `pierreguillou/gpt2-small-portuguese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:116
+msgid "``mrm8488/GPT-2-finetuned-covid-bio-medrxiv``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:116
+msgid "Please refer to: `mrm8488/GPT-2-finetuned-covid-bio-medrxiv`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:119
+msgid "``anonymous-german-nlp/german-gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:119
+#: ../model_zoo/transformers/GPT/contents.rst:128
+msgid "German"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:119
+msgid "Please refer to: `anonymous-german-nlp/german-gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:122
+msgid "``microsoft/CodeGPT-small-py``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:122
+msgid "Please refer to: `microsoft/CodeGPT-small-py`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:125
+msgid "``antoiloui/belgpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:125
+#: ../model_zoo/transformers/GPT/contents.rst:131
+#: ../model_zoo/transformers/GPT/contents.rst:152
+#: ../model_zoo/transformers/GPT/contents.rst:245
+msgid "French"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:125
+msgid "Please refer to: `antoiloui/belgpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:128
+msgid "``benjamin/gerpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:128
+msgid "Please refer to: `benjamin/gerpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:131
+msgid "``asi/gpt-fr-cased-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:131
+msgid "Please refer to: `asi/gpt-fr-cased-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:134
+msgid "``microsoft/CodeGPT-small-java-adaptedGPT2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:134
+msgid "Please refer to: `microsoft/CodeGPT-small-java-adaptedGPT2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:137
+msgid "``GroNLP/gpt2-small-dutch``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:137
+msgid "Dutch"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:137
+msgid "Please refer to: `GroNLP/gpt2-small-dutch`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:140
+msgid "``lvwerra/gpt2-imdb``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:140
+msgid "Please refer to: `lvwerra/gpt2-imdb`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:143
+msgid "``DeepESP/gpt2-spanish``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:143
+#: ../model_zoo/transformers/GPT/contents.rst:194
+msgid "Spanish"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:143
+msgid "Please refer to: `DeepESP/gpt2-spanish`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:146
+msgid "``microsoft/CodeGPT-small-py-adaptedGPT2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:146
+msgid "Please refer to: `microsoft/CodeGPT-small-py-adaptedGPT2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:149
+msgid "``microsoft/DialogRPT-width``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:149
+msgid "Please refer to: `microsoft/DialogRPT-width`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:152
+msgid "``dbddv01/gpt2-french-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:152
+msgid "Please refer to: `dbddv01/gpt2-french-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:155
+msgid "``GroNLP/gpt2-small-italian``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:155
+#: ../model_zoo/transformers/GPT/contents.rst:206
+msgid "Italian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:155
+msgid "Please refer to: `GroNLP/gpt2-small-italian`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:158
+msgid "``flax-community/gpt2-medium-persian``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:158
+#: ../model_zoo/transformers/GPT/contents.rst:185
+msgid "Persian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:158
+msgid "Please refer to: `flax-community/gpt2-medium-persian`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:161
+msgid "``microsoft/DialogRPT-depth``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:161
+msgid "Please refer to: `microsoft/DialogRPT-depth`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:164
+msgid "``Nokia/nlgp-natural``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:164
+msgid "Please refer to: `Nokia/nlgp-natural`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:167
+msgid "``macedonizer/hr-gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:167
+msgid "Please refer to: `macedonizer/hr-gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:170
+msgid "``mrm8488/GPT-2-finetuned-common_gen``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:170
+msgid "Please refer to: `mrm8488/GPT-2-finetuned-common_gen`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:173
+msgid "``pranavpsv/gpt2-genre-story-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:173
+msgid "Please refer to: `pranavpsv/gpt2-genre-story-generator`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:176
+msgid "``rbhushan/distilgpt2-finetuned-wikitext2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:176
+msgid "Please refer to: `rbhushan/distilgpt2-finetuned-wikitext2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:179
+msgid "``readerbench/RoGPT2-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:179
+msgid "Romanian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:179
+msgid "Please refer to: `readerbench/RoGPT2-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:182
+msgid "``flax-community/gpt2-small-indonesian``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:182
+#: ../model_zoo/transformers/GPT/contents.rst:188
+#: ../model_zoo/transformers/GPT/contents.rst:215
+msgid "Indonesian"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:182
+msgid "Please refer to: `flax-community/gpt2-small-indonesian`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:185
+msgid "``HooshvareLab/gpt2-fa``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:185
+msgid "Please refer to: `HooshvareLab/gpt2-fa`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:188
+msgid "``cahya/gpt2-small-indonesian-522M``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:188
+msgid "Please refer to: `cahya/gpt2-small-indonesian-522M`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:191
+msgid "``DingleyMaillotUrgell/homer-bot``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:191
+msgid "Please refer to: `DingleyMaillotUrgell/homer-bot`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:194
+msgid "``datificate/gpt2-small-spanish``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:194
+msgid "Please refer to: `datificate/gpt2-small-spanish`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:197
+msgid "``ericzhou/tsundere_v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:197
+msgid "Please refer to: `ericzhou/tsundere_v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:200
+msgid "``huggingtweets/wwm_shakespeare``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:200
+msgid "Please refer to: `huggingtweets/wwm_shakespeare`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:203
+msgid "``SIC98/GPT2-python-code-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:203
+msgid "Please refer to: `SIC98/GPT2-python-code-generator`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:206
+msgid "``GroNLP/gpt2-small-italian-embeddings``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:206
+msgid "Please refer to: `GroNLP/gpt2-small-italian-embeddings`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:209
+msgid "``huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:209
+msgid ""
+"Please refer to: `huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-"
+"witheredstrings`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:212
+msgid "``salesken/grammar_correction``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:212
+msgid "Please refer to: `salesken/grammar_correction`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:215
+msgid "``flax-community/gpt2-medium-indonesian``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:215
+msgid "Please refer to: `flax-community/gpt2-medium-indonesian`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:218
+msgid "``gorkemgoknar/gpt2-small-turkish``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:218
+msgid "Turkish"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:218
+msgid "Please refer to: `gorkemgoknar/gpt2-small-turkish`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:221
+msgid "``deepparag/DumBot``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:221
+msgid "Please refer to: `deepparag/DumBot`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:224
+msgid "``jcblaise/gpt2-tagalog``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:224
+msgid "Tagalog"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:224
+msgid "Please refer to: `jcblaise/gpt2-tagalog`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:227
+msgid "``BigSalmon/InformalToFormalLincoln21``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:227
+msgid "Please refer to: `BigSalmon/InformalToFormalLincoln21`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:230
+msgid "``LorenzoDeMattei/GePpeTto``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:230
+msgid "Please refer to: `LorenzoDeMattei/GePpeTto`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:233
+msgid "``macedonizer/sr-gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:233
+msgid "Please refer to: `macedonizer/sr-gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:236
+msgid "``indonesian-nlp/gpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:236
+msgid "Please refer to: `indonesian-nlp/gpt2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:239
+msgid "``ceostroff/harry-potter-gpt2-fanfiction``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:239
+msgid "Please refer to: `ceostroff/harry-potter-gpt2-fanfiction`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:242
+msgid "``akhooli/gpt2-small-arabic-poetry``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:242
+msgid "Arabic"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:242
+msgid "Please refer to: `akhooli/gpt2-small-arabic-poetry`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:245
+msgid "``asi/gpt-fr-cased-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:245
+msgid "Please refer to: `asi/gpt-fr-cased-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:248
+msgid "``congcongwang/gpt2_medium_fine_tuned_coder``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:248
+msgid "Please refer to: `congcongwang/gpt2_medium_fine_tuned_coder`_"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:251
+msgid "``cambridgeltl/simctg_wikitext103``"
+msgstr ""
+
+#: ../model_zoo/transformers/GPT/contents.rst:251
+msgid "Please refer to: `cambridgeltl/simctg_wikitext103`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..20958ef2b77586917a438bf19aeb15200fc8547e
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLM/contents.po
@@ -0,0 +1,64 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:5
+msgid "LayoutLM模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的LayoutLM模型以及对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:15
+msgid "``layoutlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:15
+#: ../model_zoo/transformers/LayoutLM/contents.rst:19
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:19
+msgid "``layoutlm-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLM/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased "
+"model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..d8535b9511a622f4df80384a2bbfd685f293252d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutLMV2/contents.po
@@ -0,0 +1,64 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:5
+msgid "LayoutLMV2模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的LayoutLMV2模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15
+msgid "``layoutlmv2-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19
+msgid "``layoutlmv2-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutLMV2/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased "
+"model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..0fecb29cfba3cbc6dfe245cfdc4f7be839d562a8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/LayoutXLM/contents.po
@@ -0,0 +1,53 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:5
+msgid "LayoutXLM模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的LayoutXLM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:15
+msgid "``layoutxlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/LayoutXLM/contents.rst:15
+msgid ""
+"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased "
+"model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..c13f1c2fc2b1bc65d3ab0ecad382fbf31753cd94
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Luke/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/Luke/contents.rst:5
+msgid "Luke模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的Luke模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:15
+msgid "``luke-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:15
+#: ../model_zoo/transformers/Luke/contents.rst:18
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:15
+msgid "For details, please refer to: `studio-ousia/luke-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:18
+msgid "``luke-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/Luke/contents.rst:18
+msgid "For details, please refer to: `studio-ousia/luke-large`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..e18558c6b269722a4ef81e5973cd5553d99f237f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MBart/contents.po
@@ -0,0 +1,97 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/MBart/contents.rst:5
+msgid "MBart模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的MBart模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:15
+msgid "``mbart-large-cc25``"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:15
+#: ../model_zoo/transformers/MBart/contents.rst:20
+#: ../model_zoo/transformers/MBart/contents.rst:25
+#: ../model_zoo/transformers/MBart/contents.rst:30
+#: ../model_zoo/transformers/MBart/contents.rst:35
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:15
+msgid ""
+"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-"
+"cc25`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:20
+msgid "``mbart-large-en-ro``"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:20
+msgid ""
+"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-"
+"ro`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:25
+msgid "``mbart-large-50-one-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:25
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-"
+"to-many-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:30
+msgid "``mbart-large-50-many-to-one-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:30
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-one-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:35
+msgid "``mbart-large-50-many-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/MBart/contents.rst:35
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-many-mmt`` model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..9270a38b9c0eac1b7eda6018fcd88654bda903f2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MPNet/contents.po
@@ -0,0 +1,51 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/MPNet/contents.rst:5
+msgid "MPNet模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的MPNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:15
+msgid "``mpnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/MPNet/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..3fce0ad660ad375361b632b6cfc5dcdb07fb173d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MegatronBert/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:5
+msgid "MegatronBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的MegatronBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:15
+msgid "``megatronbert-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:15
+#: ../model_zoo/transformers/MegatronBert/contents.rst:18
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:15
+msgid "For details, please refer to: `nvidia/megatron-bert-cased-345m`_"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:18
+msgid "``megatronbert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/MegatronBert/contents.rst:18
+msgid "For details, please refer to: `nvidia/megatron-bert-uncased-345m`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..68a3649645aeb0c247ff8bb9e70d2476ec1829e8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/MobileBert/contents.po
@@ -0,0 +1,51 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:5
+msgid "MobileBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的MobileBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:15
+msgid "``mobilebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/MobileBert/contents.rst:15
+msgid "For details, please refer to: `google/mobilebert-uncased`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..e50f08e591ed72a6d3955be26c8ca9335d1888cf
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/NeZha/contents.po
@@ -0,0 +1,75 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/NeZha/contents.rst:5
+msgid "NeZha模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的NeZha模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:15
+msgid "``nezha-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:15
+#: ../model_zoo/transformers/NeZha/contents.rst:19
+#: ../model_zoo/transformers/NeZha/contents.rst:23
+#: ../model_zoo/transformers/NeZha/contents.rst:27
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:19
+msgid "``nezha-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:19
+#: ../model_zoo/transformers/NeZha/contents.rst:27
+msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:23
+msgid "``nezha-base-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:23
+msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/NeZha/contents.rst:27
+msgid "``nezha-large-wwm-chinese``"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..4e878768196544a74fa2bae6d536cdcbcbf984d5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/PPMiniLM/contents.po
@@ -0,0 +1,53 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:5
+msgid "PPMiniLM模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的PPMiniLM模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:15
+msgid "``ppminilm-6l-768h``"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:15
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/PPMiniLM/contents.rst:15
+msgid ""
+"A Chinese characteristic small model using multiple model compression. "
+"Please refer to: ppminilm-6l-768h_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..2f2ee06dcd52d772221b6b380c01e10efddd87ef
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/ProphetNet/contents.po
@@ -0,0 +1,51 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:5
+msgid "ProphetNet模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的ProphetNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:15
+msgid "``prophetnet-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/ProphetNet/contents.rst:15
+msgid "For details, please refer to: `microsoft/prophetnet-large-uncased`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..02f4319ef8f76a463786811afe0a37f66c98292d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/Reformer/contents.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/Reformer/contents.rst:5
+msgid "Reformer模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的Reformer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:15
+msgid "``reformer-enwik8``"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:15
+#: ../model_zoo/transformers/Reformer/contents.rst:18
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:15
+msgid "12-layer, 1024-hidden, 8-heads, 148M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:18
+msgid "``reformer-crime-and-punishment``"
+msgstr ""
+
+#: ../model_zoo/transformers/Reformer/contents.rst:18
+msgid "6-layer, 256-hidden, 2-heads, 3M parameters."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..cf1d91f9411ac9b0ab3d1aaeb9732f18f71c8900
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RemBert/contents.po
@@ -0,0 +1,53 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/RemBert/contents.rst:5
+msgid "RemBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的RemBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:15
+msgid "``rembert``"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:15
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/RemBert/contents.rst:15
+msgid ""
+"For details, please refer to the corresponding model card of huggingface:"
+" `google/rembert`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..afe2d722f2216309148c215678a5317a451d89b1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoBERTa/contents.po
@@ -0,0 +1,984 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:5
+msgid "RoBERTa模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:8
+msgid "下表汇总介绍了目前PaddleNLP支持的RoBERTa模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:12
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:12
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:12
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:14
+msgid "``roberta-wwm-ext``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:14
+#: ../model_zoo/transformers/RoBERTa/contents.rst:19
+#: ../model_zoo/transformers/RoBERTa/contents.rst:24
+#: ../model_zoo/transformers/RoBERTa/contents.rst:27
+#: ../model_zoo/transformers/RoBERTa/contents.rst:46
+#: ../model_zoo/transformers/RoBERTa/contents.rst:50
+#: ../model_zoo/transformers/RoBERTa/contents.rst:54
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:14
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text "
+"using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:19
+msgid "``roberta-wwm-ext-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text"
+" using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:24
+msgid "``rbt3``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:24
+msgid "3-layer, 768-hidden, 12-heads, 38M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:27
+msgid "``rbtl3``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:27
+msgid "3-layer, 1024-hidden, 16-heads, 61M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:30
+msgid "``nosaydomore/deepset-roberta-base-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:30
+#: ../model_zoo/transformers/RoBERTa/contents.rst:34
+#: ../model_zoo/transformers/RoBERTa/contents.rst:38
+#: ../model_zoo/transformers/RoBERTa/contents.rst:42
+#: ../model_zoo/transformers/RoBERTa/contents.rst:58
+#: ../model_zoo/transformers/RoBERTa/contents.rst:61
+#: ../model_zoo/transformers/RoBERTa/contents.rst:64
+#: ../model_zoo/transformers/RoBERTa/contents.rst:67
+#: ../model_zoo/transformers/RoBERTa/contents.rst:70
+#: ../model_zoo/transformers/RoBERTa/contents.rst:73
+#: ../model_zoo/transformers/RoBERTa/contents.rst:76
+#: ../model_zoo/transformers/RoBERTa/contents.rst:79
+#: ../model_zoo/transformers/RoBERTa/contents.rst:82
+#: ../model_zoo/transformers/RoBERTa/contents.rst:85
+#: ../model_zoo/transformers/RoBERTa/contents.rst:88
+#: ../model_zoo/transformers/RoBERTa/contents.rst:91
+#: ../model_zoo/transformers/RoBERTa/contents.rst:94
+#: ../model_zoo/transformers/RoBERTa/contents.rst:97
+#: ../model_zoo/transformers/RoBERTa/contents.rst:100
+#: ../model_zoo/transformers/RoBERTa/contents.rst:103
+#: ../model_zoo/transformers/RoBERTa/contents.rst:109
+#: ../model_zoo/transformers/RoBERTa/contents.rst:112
+#: ../model_zoo/transformers/RoBERTa/contents.rst:115
+#: ../model_zoo/transformers/RoBERTa/contents.rst:118
+#: ../model_zoo/transformers/RoBERTa/contents.rst:121
+#: ../model_zoo/transformers/RoBERTa/contents.rst:124
+#: ../model_zoo/transformers/RoBERTa/contents.rst:127
+#: ../model_zoo/transformers/RoBERTa/contents.rst:130
+#: ../model_zoo/transformers/RoBERTa/contents.rst:136
+#: ../model_zoo/transformers/RoBERTa/contents.rst:139
+#: ../model_zoo/transformers/RoBERTa/contents.rst:142
+#: ../model_zoo/transformers/RoBERTa/contents.rst:145
+#: ../model_zoo/transformers/RoBERTa/contents.rst:148
+#: ../model_zoo/transformers/RoBERTa/contents.rst:151
+#: ../model_zoo/transformers/RoBERTa/contents.rst:154
+#: ../model_zoo/transformers/RoBERTa/contents.rst:157
+#: ../model_zoo/transformers/RoBERTa/contents.rst:160
+#: ../model_zoo/transformers/RoBERTa/contents.rst:163
+#: ../model_zoo/transformers/RoBERTa/contents.rst:166
+#: ../model_zoo/transformers/RoBERTa/contents.rst:169
+#: ../model_zoo/transformers/RoBERTa/contents.rst:172
+#: ../model_zoo/transformers/RoBERTa/contents.rst:175
+#: ../model_zoo/transformers/RoBERTa/contents.rst:178
+#: ../model_zoo/transformers/RoBERTa/contents.rst:181
+#: ../model_zoo/transformers/RoBERTa/contents.rst:184
+#: ../model_zoo/transformers/RoBERTa/contents.rst:190
+#: ../model_zoo/transformers/RoBERTa/contents.rst:193
+#: ../model_zoo/transformers/RoBERTa/contents.rst:199
+#: ../model_zoo/transformers/RoBERTa/contents.rst:202
+#: ../model_zoo/transformers/RoBERTa/contents.rst:205
+#: ../model_zoo/transformers/RoBERTa/contents.rst:208
+#: ../model_zoo/transformers/RoBERTa/contents.rst:211
+#: ../model_zoo/transformers/RoBERTa/contents.rst:214
+#: ../model_zoo/transformers/RoBERTa/contents.rst:217
+#: ../model_zoo/transformers/RoBERTa/contents.rst:220
+#: ../model_zoo/transformers/RoBERTa/contents.rst:223
+#: ../model_zoo/transformers/RoBERTa/contents.rst:226
+#: ../model_zoo/transformers/RoBERTa/contents.rst:229
+#: ../model_zoo/transformers/RoBERTa/contents.rst:232
+#: ../model_zoo/transformers/RoBERTa/contents.rst:235
+#: ../model_zoo/transformers/RoBERTa/contents.rst:241
+#: ../model_zoo/transformers/RoBERTa/contents.rst:244
+#: ../model_zoo/transformers/RoBERTa/contents.rst:247
+#: ../model_zoo/transformers/RoBERTa/contents.rst:253
+#: ../model_zoo/transformers/RoBERTa/contents.rst:256
+#: ../model_zoo/transformers/RoBERTa/contents.rst:259
+#: ../model_zoo/transformers/RoBERTa/contents.rst:262
+#: ../model_zoo/transformers/RoBERTa/contents.rst:265
+#: ../model_zoo/transformers/RoBERTa/contents.rst:268
+#: ../model_zoo/transformers/RoBERTa/contents.rst:271
+#: ../model_zoo/transformers/RoBERTa/contents.rst:274
+#: ../model_zoo/transformers/RoBERTa/contents.rst:277
+#: ../model_zoo/transformers/RoBERTa/contents.rst:286
+#: ../model_zoo/transformers/RoBERTa/contents.rst:289
+#: ../model_zoo/transformers/RoBERTa/contents.rst:295
+#: ../model_zoo/transformers/RoBERTa/contents.rst:301
+#: ../model_zoo/transformers/RoBERTa/contents.rst:307
+#: ../model_zoo/transformers/RoBERTa/contents.rst:313
+#: ../model_zoo/transformers/RoBERTa/contents.rst:316
+#: ../model_zoo/transformers/RoBERTa/contents.rst:319
+#: ../model_zoo/transformers/RoBERTa/contents.rst:322
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:30
+msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:34
+msgid "``nosaydomore/roberta-en-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:34
+msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:38
+msgid "``nosaydomore/roberta-en-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:38
+msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:42
+msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:42
+msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:46
+msgid "``nosaydomore/uer-roberta-base-chinese-extractive-qa``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:46
+#: ../model_zoo/transformers/RoBERTa/contents.rst:54
+msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:50
+msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:50
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:54
+msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:58
+msgid "``roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:58
+msgid "Please refer to: roberta-base_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:61
+msgid "``cardiffnlp/twitter-roberta-base-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:61
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base-sentiment`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:64
+msgid "``roberta-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:64
+msgid "Please refer to: roberta-large_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:67
+msgid "``distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:67
+msgid "Please refer to: distilroberta-base_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:70
+msgid "``cross-encoder/nli-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:70
+msgid "Please refer to: `cross-encoder/nli-distilroberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:73
+msgid "``siebert/sentiment-roberta-large-english``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:73
+msgid "Please refer to: `siebert/sentiment-roberta-large-english`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:76
+msgid "``j-hartmann/emotion-english-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:76
+msgid "Please refer to: `j-hartmann/emotion-english-distilroberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:79
+msgid "``roberta-base-openai-detector``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:79
+msgid "Please refer to: `roberta-base-openai-detector`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:82
+msgid "``huggingface/CodeBERTa-small-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:82
+msgid "Please refer to: `huggingface/CodeBERTa-small-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:85
+msgid "``mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:85
+msgid ""
+"Please refer to: `mrm8488/distilroberta-finetuned-financial-news-"
+"sentiment-analysis`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:88
+msgid "``cardiffnlp/twitter-roberta-base-emotion``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:88
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base-emotion`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:91
+msgid "``seyonec/PubChem10M_SMILES_BPE_396_250``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:91
+msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_396_250`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:94
+msgid "``textattack/roberta-base-SST-2``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:94
+msgid "Please refer to: `textattack/roberta-base-SST-2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:97
+msgid "``sshleifer/tiny-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:97
+msgid "Please refer to: `sshleifer/tiny-distilroberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:100
+msgid "``thatdramebaazguy/roberta-base-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:100
+msgid "Please refer to: `thatdramebaazguy/roberta-base-squad`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:103
+msgid "``ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:103
+msgid "Please refer to: `ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:106
+msgid "``ufal/robeczech-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:106
+msgid "Czech"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:106
+msgid "Please refer to: `ufal/robeczech-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:109
+msgid "``seyonec/PubChem10M_SMILES_BPE_450k``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:109
+msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_450k`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:112
+msgid "``cardiffnlp/twitter-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:112
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:115
+msgid "``seyonec/PubChem10M_SMILES_BPE_50k``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:115
+msgid "Please refer to: `seyonec/PubChem10M_SMILES_BPE_50k`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:118
+msgid "``microsoft/codebert-base-mlm``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:118
+msgid "Please refer to: `microsoft/codebert-base-mlm`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:121
+msgid "``textattack/roberta-base-MNLI``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:121
+msgid "Please refer to: `textattack/roberta-base-MNLI`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:124
+msgid "``cardiffnlp/twitter-roberta-base-offensive``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:124
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base-offensive`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:127
+msgid "``cross-encoder/stsb-roberta-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:127
+msgid "Please refer to: `cross-encoder/stsb-roberta-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:130
+msgid "``seyonec/ChemBERTa_zinc250k_v2_40k``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:130
+msgid "Please refer to: `seyonec/ChemBERTa_zinc250k_v2_40k`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:133
+msgid "``uklfr/gottbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:133
+msgid "German"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:133
+msgid "Please refer to: `uklfr/gottbert-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:136
+msgid "``seyonec/ChemBERTa-zinc-base-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:136
+msgid "Please refer to: `seyonec/ChemBERTa-zinc-base-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:139
+msgid "``roberta-large-openai-detector``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:139
+msgid "Please refer to: `roberta-large-openai-detector`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:142
+msgid "``cross-encoder/quora-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:142
+msgid "Please refer to: `cross-encoder/quora-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:145
+msgid "``cross-encoder/stsb-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:145
+msgid "Please refer to: `cross-encoder/stsb-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:148
+msgid "``microsoft/graphcodebert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:148
+msgid "Please refer to: `microsoft/graphcodebert-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:151
+msgid "``cardiffnlp/twitter-roberta-base-hate``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:151
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base-hate`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:154
+msgid "``chkla/roberta-argument``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:154
+msgid "Please refer to: `chkla/roberta-argument`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:157
+msgid "``Salesforce/grappa_large_jnt``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:157
+msgid "Please refer to: `Salesforce/grappa_large_jnt`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:160
+msgid "``vinai/bertweet-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:160
+msgid "Please refer to: `vinai/bertweet-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:163
+msgid "``allenai/biomed_roberta_base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:163
+msgid "Please refer to: `allenai/biomed_roberta_base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:166
+msgid "``facebook/muppet-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:166
+msgid "Please refer to: `facebook/muppet-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:169
+msgid "``Rakib/roberta-base-on-cuad``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:169
+msgid "Please refer to: `Rakib/roberta-base-on-cuad`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:172
+msgid "``cross-encoder/stsb-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:172
+msgid "Please refer to: `cross-encoder/stsb-distilroberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:175
+msgid "``nyu-mll/roberta-base-1B-1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:175
+msgid "Please refer to: `nyu-mll/roberta-base-1B-1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:178
+msgid "``nyu-mll/roberta-med-small-1M-1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:178
+msgid "Please refer to: `nyu-mll/roberta-med-small-1M-1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:181
+msgid "``SkolkovoInstitute/roberta_toxicity_classifier``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:181
+msgid "Please refer to: `SkolkovoInstitute/roberta_toxicity_classifier`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:184
+msgid "``facebook/muppet-roberta-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:184
+msgid "Please refer to: `facebook/muppet-roberta-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:187
+msgid "``lassl/roberta-ko-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:187
+msgid "Korean"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:187
+msgid "Please refer to: `lassl/roberta-ko-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:190
+msgid "``huggingface/CodeBERTa-language-id``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:190
+msgid "Please refer to: `huggingface/CodeBERTa-language-id`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:193
+msgid "``textattack/roberta-base-imdb``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:193
+msgid "Please refer to: `textattack/roberta-base-imdb`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:196
+msgid "``macedonizer/mk-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:196
+msgid "Macedonian"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:196
+msgid "Please refer to: `macedonizer/mk-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:199
+msgid "``cross-encoder/nli-MiniLM2-L6-H768``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:199
+msgid "Please refer to: `cross-encoder/nli-MiniLM2-L6-H768`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:202
+msgid "``textattack/roberta-base-QNLI``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:202
+msgid "Please refer to: `textattack/roberta-base-QNLI`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:205
+msgid "``deepset/roberta-base-squad2-covid``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:205
+msgid "Please refer to: `deepset/roberta-base-squad2-covid`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:208
+msgid "``textattack/roberta-base-MRPC``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:208
+msgid "Please refer to: `textattack/roberta-base-MRPC`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:211
+msgid "``bhadresh-savani/roberta-base-emotion``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:211
+msgid "Please refer to: `bhadresh-savani/roberta-base-emotion`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:214
+msgid "``aychang/roberta-base-imdb``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:214
+msgid "Please refer to: `aychang/roberta-base-imdb`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:217
+msgid "``cross-encoder/quora-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:217
+msgid "Please refer to: `cross-encoder/quora-distilroberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:220
+msgid "``csarron/roberta-base-squad-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:220
+msgid "Please refer to: `csarron/roberta-base-squad-v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:223
+msgid "``seyonec/ChemBERTA_PubChem1M_shard00_155k``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:223
+msgid "Please refer to: `seyonec/ChemBERTA_PubChem1M_shard00_155k`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:226
+msgid "``mental/mental-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:226
+msgid "Please refer to: `mental/mental-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:229
+msgid "``textattack/roberta-base-CoLA``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:229
+msgid "Please refer to: `textattack/roberta-base-CoLA`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:232
+msgid "``navteca/quora-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:232
+msgid "Please refer to: `navteca/quora-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:235
+msgid "``cardiffnlp/twitter-roberta-base-emoji``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:235
+msgid "Please refer to: `cardiffnlp/twitter-roberta-base-emoji`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:238
+msgid "``benjamin/roberta-base-wechsel-german``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:238
+msgid "Multilingual"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:238
+msgid "Please refer to: `benjamin/roberta-base-wechsel-german`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:241
+msgid "``textattack/roberta-base-ag-news``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:241
+msgid "Please refer to: `textattack/roberta-base-ag-news`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:244
+msgid "``johngiorgi/declutr-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:244
+msgid "Please refer to: `johngiorgi/declutr-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:247
+msgid "``salesken/query_wellformedness_score``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:247
+msgid "Please refer to: `salesken/query_wellformedness_score`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:250
+msgid "``blinoff/roberta-base-russian-v0``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:250
+msgid "Russian"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:250
+msgid "Please refer to: `blinoff/roberta-base-russian-v0`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:253
+msgid "``allenai/reviews_roberta_base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:253
+msgid "Please refer to: `allenai/reviews_roberta_base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:256
+msgid "``ruiqi-zhong/roberta-base-meta-tuning-test``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:256
+msgid "Please refer to: `ruiqi-zhong/roberta-base-meta-tuning-test`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:259
+msgid "``mrm8488/distilroberta-finetuned-tweets-hate-speech``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:259
+msgid "Please refer to: `mrm8488/distilroberta-finetuned-tweets-hate-speech`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:262
+msgid "``cointegrated/roberta-large-cola-krishna2020``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:262
+msgid "Please refer to: `cointegrated/roberta-large-cola-krishna2020`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:265
+msgid "``deepset/roberta-base-squad2-distilled``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:265
+msgid "Please refer to: `deepset/roberta-base-squad2-distilled`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:268
+msgid "``tli8hf/unqover-roberta-base-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:268
+msgid "Please refer to: `tli8hf/unqover-roberta-base-squad`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:271
+msgid "``cross-encoder/nli-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:271
+msgid "Please refer to: `cross-encoder/nli-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:274
+msgid "``nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:274
+msgid "Please refer to: `nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:277
+msgid "``seyonec/BPE_SELFIES_PubChem_shard00_160k``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:277
+msgid "Please refer to: `seyonec/BPE_SELFIES_PubChem_shard00_160k`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:280
+msgid "``CLTL/MedRoBERTa.nl``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:280
+msgid "Dutch"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:280
+msgid "Please refer to: `CLTL/MedRoBERTa.nl`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:283
+msgid "``HooshvareLab/roberta-fa-zwnj-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:283
+msgid "Persian"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:283
+msgid "Please refer to: `HooshvareLab/roberta-fa-zwnj-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:286
+msgid "``nyu-mll/roberta-base-100M-1``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:286
+msgid "Please refer to: `nyu-mll/roberta-base-100M-1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:289
+msgid "``deepset/tinyroberta-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:289
+msgid "Please refer to: `deepset/tinyroberta-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:292
+msgid "``youscan/ukr-roberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:292
+msgid "Ukrainian"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:292
+msgid "Please refer to: `youscan/ukr-roberta-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:295
+msgid "``navteca/roberta-base-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:295
+msgid "Please refer to: `navteca/roberta-base-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:298
+msgid "``bertin-project/bertin-roberta-base-spanish``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:298
+msgid "Spanish"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:298
+msgid "Please refer to: `bertin-project/bertin-roberta-base-spanish`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:301
+msgid "``shiyue/roberta-large-tac08``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:301
+msgid "Please refer to: `shiyue/roberta-large-tac08`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:304
+msgid "``softcatala/julibert``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:304
+msgid "Catalan"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:304
+msgid "Please refer to: `softcatala/julibert`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:307
+msgid "``elozano/tweet_sentiment_eval``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:307
+msgid "Please refer to: `elozano/tweet_sentiment_eval`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:310
+msgid "``cahya/roberta-base-indonesian-1.5G``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:310
+msgid "Indonesian"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:310
+msgid "Please refer to: `cahya/roberta-base-indonesian-1.5G`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:313
+msgid "``elozano/tweet_emotion_eval``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:313
+msgid "Please refer to: `elozano/tweet_emotion_eval`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:316
+msgid "``navteca/roberta-large-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:316
+msgid "Please refer to: `navteca/roberta-large-squad2`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:319
+msgid "``elozano/tweet_offensive_eval``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:319
+msgid "Please refer to: `elozano/tweet_offensive_eval`_"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:322
+msgid "``ynie/roberta-large_conv_contradiction_detector_v0``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoBERTa/contents.rst:322
+msgid "Please refer to: `ynie/roberta-large_conv_contradiction_detector_v0`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..427d290450dcf86298b059c775e4e8cc1ad88314
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/RoFormer/contents.po
@@ -0,0 +1,155 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:5
+msgid "RoFormer模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:7
+msgid "下表汇总介绍了目前PaddleNLP支持的RoFormer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:11
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:11
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:11
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:13
+msgid "``roformer-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:13
+#: ../model_zoo/transformers/RoFormer/contents.rst:17
+#: ../model_zoo/transformers/RoFormer/contents.rst:21
+#: ../model_zoo/transformers/RoFormer/contents.rst:25
+#: ../model_zoo/transformers/RoFormer/contents.rst:29
+#: ../model_zoo/transformers/RoFormer/contents.rst:33
+#: ../model_zoo/transformers/RoFormer/contents.rst:37
+#: ../model_zoo/transformers/RoFormer/contents.rst:41
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:13
+msgid ""
+"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:17
+msgid "``roformer-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:17
+msgid ""
+"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:21
+msgid "``roformer-chinese-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:21
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small"
+" model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:25
+msgid "``roformer-chinese-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:25
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:29
+msgid "``roformer-chinese-sim-char-ft-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:29
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:33
+msgid "``roformer-chinese-sim-char-ft-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:33
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:37
+msgid "``roformer-chinese-sim-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:37
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:41
+msgid "``roformer-chinese-sim-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:41
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char"
+" Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:45
+msgid "``roformer-english-small-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:45
+#: ../model_zoo/transformers/RoFormer/contents.rst:49
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:45
+msgid ""
+"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small "
+"Discriminator."
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:49
+msgid "``roformer-english-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/RoFormer/contents.rst:49
+msgid ""
+"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small "
+"Generator."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..d8ec9283d84991ce7c69ef9ed548b784a7807b9d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SKEP/contents.po
@@ -0,0 +1,78 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/SKEP/contents.rst:5
+msgid "SKEP模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的SKEP模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:15
+msgid "``skep_ernie_1.0_large_ch``"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:15
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:15
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:20
+msgid "``skep_ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:20
+#: ../model_zoo/transformers/SKEP/contents.rst:25
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:20
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:25
+msgid "``skep_roberta_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/SKEP/contents.rst:25
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the "
+"RoBERTa model ``roberta_large_en``"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..614040be5af641feb7d9c449c6a9b63b05c198c9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/SqueezeBert/contents.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:5
+msgid "SqueezeBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的SqueezeBert模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:15
+msgid "``squeezebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:15
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:19
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:23
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model."
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:19
+msgid "``squeezebert-mnli``"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:19
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model."
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:23
+msgid "``squeezebert-mnli-headless``"
+msgstr ""
+
+#: ../model_zoo/transformers/SqueezeBert/contents.rst:23
+msgid ""
+"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless"
+" model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..7c4e944d04ebe0081650789c7563f766375c1db0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/T5/contents.po
@@ -0,0 +1,458 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/T5/contents.rst:5
+msgid "T5模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的T5模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:15
+msgid "``t5-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:15
+#: ../model_zoo/transformers/T5/contents.rst:19
+#: ../model_zoo/transformers/T5/contents.rst:23
+#: ../model_zoo/transformers/T5/contents.rst:27
+#: ../model_zoo/transformers/T5/contents.rst:30
+#: ../model_zoo/transformers/T5/contents.rst:36
+#: ../model_zoo/transformers/T5/contents.rst:39
+#: ../model_zoo/transformers/T5/contents.rst:42
+#: ../model_zoo/transformers/T5/contents.rst:45
+#: ../model_zoo/transformers/T5/contents.rst:48
+#: ../model_zoo/transformers/T5/contents.rst:51
+#: ../model_zoo/transformers/T5/contents.rst:54
+#: ../model_zoo/transformers/T5/contents.rst:57
+#: ../model_zoo/transformers/T5/contents.rst:60
+#: ../model_zoo/transformers/T5/contents.rst:63
+#: ../model_zoo/transformers/T5/contents.rst:66
+#: ../model_zoo/transformers/T5/contents.rst:72
+#: ../model_zoo/transformers/T5/contents.rst:75
+#: ../model_zoo/transformers/T5/contents.rst:78
+#: ../model_zoo/transformers/T5/contents.rst:81
+#: ../model_zoo/transformers/T5/contents.rst:84
+#: ../model_zoo/transformers/T5/contents.rst:87
+#: ../model_zoo/transformers/T5/contents.rst:90
+#: ../model_zoo/transformers/T5/contents.rst:93
+#: ../model_zoo/transformers/T5/contents.rst:96
+#: ../model_zoo/transformers/T5/contents.rst:99
+#: ../model_zoo/transformers/T5/contents.rst:102
+#: ../model_zoo/transformers/T5/contents.rst:105
+#: ../model_zoo/transformers/T5/contents.rst:111
+#: ../model_zoo/transformers/T5/contents.rst:114
+#: ../model_zoo/transformers/T5/contents.rst:117
+#: ../model_zoo/transformers/T5/contents.rst:120
+#: ../model_zoo/transformers/T5/contents.rst:123
+#: ../model_zoo/transformers/T5/contents.rst:126
+#: ../model_zoo/transformers/T5/contents.rst:129
+#: ../model_zoo/transformers/T5/contents.rst:132
+#: ../model_zoo/transformers/T5/contents.rst:135
+#: ../model_zoo/transformers/T5/contents.rst:138
+#: ../model_zoo/transformers/T5/contents.rst:144
+#: ../model_zoo/transformers/T5/contents.rst:147
+#: ../model_zoo/transformers/T5/contents.rst:150
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:15
+msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model."
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:19
+msgid "``t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:19
+msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model."
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:23
+msgid "``t5-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:23
+msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model."
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:27
+msgid "``t5-v1_1-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:27
+msgid "Please refer to: t5-v1_1-base_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:30
+msgid "``t5-v1_1-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:30
+msgid "Please refer to: t5-v1_1-large_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:33
+msgid "``Langboat/mengzi-t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:33
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:33
+msgid "Please refer to: `Langboat/mengzi-t5-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:36
+msgid "``deep-learning-analytics/wikihow-t5-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:36
+msgid "Please refer to: `deep-learning-analytics/wikihow-t5-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:39
+msgid "``sberbank-ai/ruT5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:39
+msgid "Please refer to: `sberbank-ai/ruT5-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:42
+msgid "``Michau/t5-base-en-generate-headline``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:42
+msgid "Please refer to: `Michau/t5-base-en-generate-headline`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:45
+msgid "``google/t5-v1_1-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:45
+msgid "Please refer to: `google/t5-v1_1-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:48
+msgid "``prithivida/parrot_paraphraser_on_T5``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:48
+msgid "Please refer to: `prithivida/parrot_paraphraser_on_T5`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:51
+msgid "``prithivida/grammar_error_correcter_v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:51
+msgid "Please refer to: `prithivida/grammar_error_correcter_v1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:54
+msgid "``valhalla/t5-small-qg-hl``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:54
+msgid "Please refer to: `valhalla/t5-small-qg-hl`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:57
+msgid "``valhalla/t5-small-qa-qg-hl``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:57
+msgid "Please refer to: `valhalla/t5-small-qa-qg-hl`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:60
+msgid "``ramsrigouthamg/t5-large-paraphraser-diverse-high-quality``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:60
+msgid ""
+"Please refer to: `ramsrigouthamg/t5-large-paraphraser-diverse-high-"
+"quality`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:63
+msgid "``mrm8488/t5-base-finetuned-common_gen``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:63
+msgid "Please refer to: `mrm8488/t5-base-finetuned-common_gen`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:66
+msgid "``valhalla/t5-small-e2e-qg``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:66
+msgid "Please refer to: `valhalla/t5-small-e2e-qg`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:69
+msgid "``sonoisa/t5-base-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:69
+msgid "japanese"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:69
+msgid "Please refer to: `sonoisa/t5-base-japanese`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:72
+msgid "``google/t5-base-lm-adapt``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:72
+msgid "Please refer to: `google/t5-base-lm-adapt`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:75
+msgid "``google/t5-small-lm-adapt``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:75
+msgid "Please refer to: `google/t5-small-lm-adapt`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:78
+msgid "``valhalla/t5-small-qg-prepend``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:78
+msgid "Please refer to: `valhalla/t5-small-qg-prepend`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:81
+msgid "``prithivida/informal_to_formal_styletransfer``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:81
+msgid "Please refer to: `prithivida/informal_to_formal_styletransfer`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:84
+msgid "``KETI-AIR/ke-t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:84
+msgid "Please refer to: `KETI-AIR/ke-t5-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:87
+msgid "``nielsr/nt5-small-rc1``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:87
+msgid "Please refer to: `nielsr/nt5-small-rc1`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:90
+msgid "``snrspeaks/t5-one-line-summary``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:90
+msgid "Please refer to: `snrspeaks/t5-one-line-summary`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:93
+msgid "``mrm8488/t5-small-finetuned-quora-for-paraphrasing``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:93
+msgid "Please refer to: `mrm8488/t5-small-finetuned-quora-for-paraphrasing`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:96
+msgid "``p-christ/12412fsasf``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:96
+msgid "Please refer to: `p-christ/12412fsasf`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:99
+msgid "``tscholak/3vnuv1vf``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:99
+msgid "Please refer to: `tscholak/3vnuv1vf`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:102
+msgid "``tennessejoyce/titlewave-t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:102
+msgid "Please refer to: `tennessejoyce/titlewave-t5-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:105
+msgid "``vennify/t5-base-grammar-correction``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:105
+msgid "Please refer to: `vennify/t5-base-grammar-correction`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:108
+msgid "``megagonlabs/t5-base-japanese-web``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:108
+#: ../model_zoo/transformers/T5/contents.rst:141
+msgid "Japanese"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:108
+msgid "Please refer to: `megagonlabs/t5-base-japanese-web`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:111
+msgid "``sberbank-ai/ruT5-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:111
+msgid "Please refer to: `sberbank-ai/ruT5-large`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:114
+msgid "``tscholak/t5.1.1.lm100k.base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:114
+msgid "Please refer to: `tscholak/t5.1.1.lm100k.base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:117
+msgid "``deep-learning-analytics/GrammarCorrector``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:117
+msgid "Please refer to: `deep-learning-analytics/GrammarCorrector`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:120
+msgid "``ThomasNLG/t5-qa_squad2neg-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:120
+msgid "Please refer to: `ThomasNLG/t5-qa_squad2neg-en`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:123
+msgid "``flexudy/t5-small-wav2vec2-grammar-fixer``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:123
+msgid "Please refer to: `flexudy/t5-small-wav2vec2-grammar-fixer`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:126
+msgid "``KETI-AIR/ke-t5-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:126
+msgid "Please refer to: `KETI-AIR/ke-t5-small`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:129
+msgid "``razent/SciFive-large-Pubmed_PMC``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:129
+msgid "Please refer to: `razent/SciFive-large-Pubmed_PMC`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:132
+msgid "``google/t5-large-ssm-nq``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:132
+msgid "Please refer to: `google/t5-large-ssm-nq`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:135
+msgid "``ozcangundes/T5-base-for-BioQA``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:135
+msgid "Please refer to: `ozcangundes/T5-base-for-BioQA`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:138
+msgid "``Rostlab/prot_t5_base_mt_uniref50``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:138
+msgid "Please refer to: `Rostlab/prot_t5_base_mt_uniref50`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:141
+msgid "``sonoisa/t5-base-japanese-question-generation``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:141
+msgid "Please refer to: `sonoisa/t5-base-japanese-question-generation`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:144
+msgid "``Wikidepia/IndoT5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:144
+msgid "Please refer to: `Wikidepia/IndoT5-base`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:147
+msgid "``razent/SciFive-base-Pubmed_PMC``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:147
+msgid "Please refer to: `razent/SciFive-base-Pubmed_PMC`_"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:150
+msgid "``google/t5-small-ssm-nq``"
+msgstr ""
+
+#: ../model_zoo/transformers/T5/contents.rst:150
+msgid "Please refer to: `google/t5-small-ssm-nq`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..c62e09a836b76bb956ffcf9d684d1a152f2e4bb7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/TinyBert/contents.po
@@ -0,0 +1,91 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:5
+msgid "TinyBert模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的TinyBert模型以及对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:15
+msgid "``tinybert-4l-312d``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:15
+#: ../model_zoo/transformers/TinyBert/contents.rst:20
+#: ../model_zoo/transformers/TinyBert/contents.rst:25
+#: ../model_zoo/transformers/TinyBert/contents.rst:30
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:15
+#: ../model_zoo/transformers/TinyBert/contents.rst:25
+#: ../model_zoo/transformers/TinyBert/contents.rst:35
+msgid ""
+"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:20
+msgid "``tinybert-6l-768d``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:20
+#: ../model_zoo/transformers/TinyBert/contents.rst:30
+#: ../model_zoo/transformers/TinyBert/contents.rst:40
+msgid ""
+"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:25
+msgid "``tinybert-4l-312d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:30
+msgid "``tinybert-6l-768d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:35
+msgid "``tinybert-4l-312d-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:35
+#: ../model_zoo/transformers/TinyBert/contents.rst:40
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/TinyBert/contents.rst:40
+msgid "``tinybert-6l-768d-zh``"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..356f04e285cb6b561cb84360d50d19e6920c6ff1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UNIMO/contents.po
@@ -0,0 +1,73 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:5
+msgid "UNIMO模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的UNIMO模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:15
+msgid "``unimo-text-1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:15
+#: ../model_zoo/transformers/UNIMO/contents.rst:19
+#: ../model_zoo/transformers/UNIMO/contents.rst:23
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model."
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:19
+msgid "``unimo-text-1.0-lcsts-new``"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:19
+msgid ""
+"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new "
+"dataset."
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:23
+msgid "``unimo-text-1.0-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/UNIMO/contents.rst:23
+msgid ""
+"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large "
+"model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..9cfd0c77fec1d3ff81d53668927941217d3624cc
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/UnifiedTransformer/contents.po
@@ -0,0 +1,80 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:5
+msgid "UnifiedTransformer模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的UnifiedTransformer模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15
+msgid "``unified_transformer-12L-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19
+msgid "``unified_transformer-12L-cn-luge``"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:19
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text "
+"(LUGE.ai)."
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23
+msgid "``plato-mini``"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:23
+msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27
+msgid "``plato-xl``"
+msgstr ""
+
+#: ../model_zoo/transformers/UnifiedTransformer/contents.rst:27
+msgid "72-layer, 3072-hidden, 32-heads, ?M parameters. Trained on Chinese text."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po
new file mode 100644
index 0000000000000000000000000000000000000000..8dee56236538b9ca29232e34af9ce1e3d0da5703
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/XLNet/contents.po
@@ -0,0 +1,94 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/XLNet/contents.rst:5
+msgid "XLNet模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:9
+msgid "下表汇总介绍了目前PaddleNLP支持的XLNet模型对应预训练权重。 关于模型的具体细节可以参考对应链接。"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:13
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:13
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:13
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:15
+msgid "``xlnet-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:15
+#: ../model_zoo/transformers/XLNet/contents.rst:19
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:15
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model."
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:19
+msgid "``xlnet-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:19
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:23
+msgid "``chinese-xlnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:23
+#: ../model_zoo/transformers/XLNet/contents.rst:27
+#: ../model_zoo/transformers/XLNet/contents.rst:31
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:23
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model."
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:27
+msgid "``chinese-xlnet-mid``"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:27
+msgid ""
+"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:31
+msgid "``chinese-xlnet-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/XLNet/contents.rst:31
+msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po
new file mode 100644
index 0000000000000000000000000000000000000000..a76d68bd4804a3e9e86bed8f438a2111bb0a1e73
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/model_zoo/transformers/all/transformers.po
@@ -0,0 +1,2090 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../model_zoo/transformers/all/transformers.rst:2
+msgid "PaddleNLP Transformer API"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:4
+msgid ""
+"随着深度学习的发展，NLP领域涌现了一大批高质量的Transformer类预训练模型，多次刷新各种NLP任务SOTA（State of the "
+"Art）。 PaddleNLP为用户提供了常用的 "
+"``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等经典结构预训练模型， "
+"让开发者能够方便快捷应用各类Transformer预训练模型及其下游任务。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:10
+msgid "Transformer预训练模型汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:14
+msgid ""
+"下表汇总了介绍了目前PaddleNLP支持的各类预训练模型以及对应预训练权重。我们目前提供了 **32** 种网络结构， **136** "
+"种预训练的参数权重供用户使用， 其中包含了 **59** 种中文语言模型的预训练权重。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:18
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:18
+msgid "Pretrained Weight"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:18
+msgid "Language"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:18
+msgid "Details of the model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:20
+#: ../model_zoo/transformers/all/transformers.rst:666
+msgid "ALBERT_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:20
+msgid "``albert-base-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:20
+#: ../model_zoo/transformers/all/transformers.rst:24
+#: ../model_zoo/transformers/all/transformers.rst:28
+#: ../model_zoo/transformers/all/transformers.rst:32
+#: ../model_zoo/transformers/all/transformers.rst:36
+#: ../model_zoo/transformers/all/transformers.rst:40
+#: ../model_zoo/transformers/all/transformers.rst:44
+#: ../model_zoo/transformers/all/transformers.rst:48
+#: ../model_zoo/transformers/all/transformers.rst:76
+#: ../model_zoo/transformers/all/transformers.rst:80
+#: ../model_zoo/transformers/all/transformers.rst:84
+#: ../model_zoo/transformers/all/transformers.rst:88
+#: ../model_zoo/transformers/all/transformers.rst:92
+#: ../model_zoo/transformers/all/transformers.rst:96
+#: ../model_zoo/transformers/all/transformers.rst:148
+#: ../model_zoo/transformers/all/transformers.rst:196
+#: ../model_zoo/transformers/all/transformers.rst:200
+#: ../model_zoo/transformers/all/transformers.rst:204
+#: ../model_zoo/transformers/all/transformers.rst:208
+#: ../model_zoo/transformers/all/transformers.rst:212
+#: ../model_zoo/transformers/all/transformers.rst:216
+#: ../model_zoo/transformers/all/transformers.rst:220
+#: ../model_zoo/transformers/all/transformers.rst:224
+#: ../model_zoo/transformers/all/transformers.rst:228
+#: ../model_zoo/transformers/all/transformers.rst:232
+#: ../model_zoo/transformers/all/transformers.rst:236
+#: ../model_zoo/transformers/all/transformers.rst:241
+#: ../model_zoo/transformers/all/transformers.rst:246
+#: ../model_zoo/transformers/all/transformers.rst:252
+#: ../model_zoo/transformers/all/transformers.rst:256
+#: ../model_zoo/transformers/all/transformers.rst:260
+#: ../model_zoo/transformers/all/transformers.rst:264
+#: ../model_zoo/transformers/all/transformers.rst:300
+#: ../model_zoo/transformers/all/transformers.rst:304
+#: ../model_zoo/transformers/all/transformers.rst:308
+#: ../model_zoo/transformers/all/transformers.rst:316
+#: ../model_zoo/transformers/all/transformers.rst:320
+#: ../model_zoo/transformers/all/transformers.rst:324
+#: ../model_zoo/transformers/all/transformers.rst:328
+#: ../model_zoo/transformers/all/transformers.rst:351
+#: ../model_zoo/transformers/all/transformers.rst:355
+#: ../model_zoo/transformers/all/transformers.rst:359
+#: ../model_zoo/transformers/all/transformers.rst:363
+#: ../model_zoo/transformers/all/transformers.rst:367
+#: ../model_zoo/transformers/all/transformers.rst:371
+#: ../model_zoo/transformers/all/transformers.rst:375
+#: ../model_zoo/transformers/all/transformers.rst:379
+#: ../model_zoo/transformers/all/transformers.rst:387
+#: ../model_zoo/transformers/all/transformers.rst:391
+#: ../model_zoo/transformers/all/transformers.rst:395
+#: ../model_zoo/transformers/all/transformers.rst:399
+#: ../model_zoo/transformers/all/transformers.rst:403
+#: ../model_zoo/transformers/all/transformers.rst:407
+#: ../model_zoo/transformers/all/transformers.rst:411
+#: ../model_zoo/transformers/all/transformers.rst:415
+#: ../model_zoo/transformers/all/transformers.rst:420
+#: ../model_zoo/transformers/all/transformers.rst:425
+#: ../model_zoo/transformers/all/transformers.rst:430
+#: ../model_zoo/transformers/all/transformers.rst:434
+#: ../model_zoo/transformers/all/transformers.rst:454
+#: ../model_zoo/transformers/all/transformers.rst:457
+#: ../model_zoo/transformers/all/transformers.rst:476
+#: ../model_zoo/transformers/all/transformers.rst:480
+#: ../model_zoo/transformers/all/transformers.rst:484
+#: ../model_zoo/transformers/all/transformers.rst:488
+#: ../model_zoo/transformers/all/transformers.rst:536
+#: ../model_zoo/transformers/all/transformers.rst:540
+#: ../model_zoo/transformers/all/transformers.rst:549
+#: ../model_zoo/transformers/all/transformers.rst:554
+#: ../model_zoo/transformers/all/transformers.rst:559
+#: ../model_zoo/transformers/all/transformers.rst:563
+#: ../model_zoo/transformers/all/transformers.rst:567
+#: ../model_zoo/transformers/all/transformers.rst:571
+#: ../model_zoo/transformers/all/transformers.rst:575
+#: ../model_zoo/transformers/all/transformers.rst:579
+#: ../model_zoo/transformers/all/transformers.rst:583
+#: ../model_zoo/transformers/all/transformers.rst:588
+#: ../model_zoo/transformers/all/transformers.rst:593
+#: ../model_zoo/transformers/all/transformers.rst:598
+#: ../model_zoo/transformers/all/transformers.rst:637
+#: ../model_zoo/transformers/all/transformers.rst:641
+msgid "English"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:20
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:24
+msgid "``albert-large-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:24
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:28
+msgid "``albert-xlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:28
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:32
+msgid "``albert-xxlarge-v1``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:32
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:36
+msgid "``albert-base-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:36
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters."
+" ALBERT base model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:40
+msgid "``albert-large-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:40
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M "
+"parameters. ALBERT large model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:44
+msgid "``albert-xlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:44
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M "
+"parameters. ALBERT xlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:48
+msgid "``albert-xxlarge-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:48
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 64-heads, 223M "
+"parameters. ALBERT xxlarge model (version2)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:52
+msgid "``albert-chinese-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:52
+#: ../model_zoo/transformers/all/transformers.rst:56
+#: ../model_zoo/transformers/all/transformers.rst:60
+#: ../model_zoo/transformers/all/transformers.rst:64
+#: ../model_zoo/transformers/all/transformers.rst:68
+#: ../model_zoo/transformers/all/transformers.rst:72
+#: ../model_zoo/transformers/all/transformers.rst:112
+#: ../model_zoo/transformers/all/transformers.rst:117
+#: ../model_zoo/transformers/all/transformers.rst:123
+#: ../model_zoo/transformers/all/transformers.rst:129
+#: ../model_zoo/transformers/all/transformers.rst:133
+#: ../model_zoo/transformers/all/transformers.rst:137
+#: ../model_zoo/transformers/all/transformers.rst:154
+#: ../model_zoo/transformers/all/transformers.rst:159
+#: ../model_zoo/transformers/all/transformers.rst:164
+#: ../model_zoo/transformers/all/transformers.rst:169
+#: ../model_zoo/transformers/all/transformers.rst:173
+#: ../model_zoo/transformers/all/transformers.rst:268
+#: ../model_zoo/transformers/all/transformers.rst:272
+#: ../model_zoo/transformers/all/transformers.rst:276
+#: ../model_zoo/transformers/all/transformers.rst:280
+#: ../model_zoo/transformers/all/transformers.rst:284
+#: ../model_zoo/transformers/all/transformers.rst:288
+#: ../model_zoo/transformers/all/transformers.rst:292
+#: ../model_zoo/transformers/all/transformers.rst:296
+#: ../model_zoo/transformers/all/transformers.rst:312
+#: ../model_zoo/transformers/all/transformers.rst:333
+#: ../model_zoo/transformers/all/transformers.rst:337
+#: ../model_zoo/transformers/all/transformers.rst:342
+#: ../model_zoo/transformers/all/transformers.rst:346
+#: ../model_zoo/transformers/all/transformers.rst:383
+#: ../model_zoo/transformers/all/transformers.rst:438
+#: ../model_zoo/transformers/all/transformers.rst:442
+#: ../model_zoo/transformers/all/transformers.rst:446
+#: ../model_zoo/transformers/all/transformers.rst:450
+#: ../model_zoo/transformers/all/transformers.rst:460
+#: ../model_zoo/transformers/all/transformers.rst:465
+#: ../model_zoo/transformers/all/transformers.rst:470
+#: ../model_zoo/transformers/all/transformers.rst:473
+#: ../model_zoo/transformers/all/transformers.rst:492
+#: ../model_zoo/transformers/all/transformers.rst:496
+#: ../model_zoo/transformers/all/transformers.rst:500
+#: ../model_zoo/transformers/all/transformers.rst:504
+#: ../model_zoo/transformers/all/transformers.rst:508
+#: ../model_zoo/transformers/all/transformers.rst:512
+#: ../model_zoo/transformers/all/transformers.rst:516
+#: ../model_zoo/transformers/all/transformers.rst:520
+#: ../model_zoo/transformers/all/transformers.rst:524
+#: ../model_zoo/transformers/all/transformers.rst:528
+#: ../model_zoo/transformers/all/transformers.rst:532
+#: ../model_zoo/transformers/all/transformers.rst:544
+#: ../model_zoo/transformers/all/transformers.rst:603
+#: ../model_zoo/transformers/all/transformers.rst:608
+#: ../model_zoo/transformers/all/transformers.rst:613
+#: ../model_zoo/transformers/all/transformers.rst:617
+#: ../model_zoo/transformers/all/transformers.rst:621
+#: ../model_zoo/transformers/all/transformers.rst:625
+#: ../model_zoo/transformers/all/transformers.rst:629
+#: ../model_zoo/transformers/all/transformers.rst:633
+#: ../model_zoo/transformers/all/transformers.rst:645
+#: ../model_zoo/transformers/all/transformers.rst:649
+#: ../model_zoo/transformers/all/transformers.rst:653
+msgid "Chinese"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:52
+msgid ""
+"4 repeating layers, 128 embedding, 312-hidden, 12-heads, 4M parameters. "
+"ALBERT tiny model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:56
+msgid "``albert-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:56
+msgid ""
+"6 repeating layers, 128 embedding, 384-hidden, 12-heads, _M parameters. "
+"ALBERT small model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:60
+msgid "``albert-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:60
+msgid ""
+"12 repeating layers, 128 embedding, 768-hidden, 12-heads, 12M parameters."
+" ALBERT base model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:64
+msgid "``albert-chinese-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:64
+msgid ""
+"24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 18M "
+"parameters. ALBERT large model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:68
+msgid "``albert-chinese-xlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:68
+msgid ""
+"24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 60M "
+"parameters. ALBERT xlarge model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:72
+msgid "``albert-chinese-xxlarge``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:72
+msgid ""
+"12 repeating layers, 128 embedding, 4096-hidden, 16-heads, 235M "
+"parameters. ALBERT xxlarge model (Chinese)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:76
+#: ../model_zoo/transformers/all/transformers.rst:668
+msgid "BART_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:76
+msgid "``bart-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:76
+msgid "12-layer, 768-hidden, 12-heads, 217M parameters. BART base model (English)"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:80
+msgid "``bart-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:80
+msgid ""
+"24-layer, 768-hidden, 16-heads, 509M parameters. BART large model "
+"(English)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:84
+#: ../model_zoo/transformers/all/transformers.rst:670
+msgid "BERT_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:84
+msgid "``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:84
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:88
+msgid "``bert-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:88
+#: ../model_zoo/transformers/all/transformers.rst:308
+#: ../model_zoo/transformers/all/transformers.rst:324
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:92
+msgid "``bert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:92
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on cased English"
+" text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:96
+msgid "``bert-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:96
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 335M parameters. Trained on cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:100
+msgid "``bert-base-multilingual-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:100
+#: ../model_zoo/transformers/all/transformers.rst:106
+#: ../model_zoo/transformers/all/transformers.rst:141
+msgid "Multilingual"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:100
+msgid ""
+"12-layer, 768-hidden, 12-heads, 168M parameters. Trained on lower-cased "
+"text in the top 102 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:106
+msgid "``bert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:106
+msgid ""
+"12-layer, 768-hidden, 12-heads, 179M parameters. Trained on cased text in"
+" the top 104 languages with the largest Wikipedias."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:112
+msgid "``bert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:112
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:117
+msgid "``bert-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:117
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:123
+msgid "``bert-wwm-ext-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:123
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on cased Chinese"
+" Simplified and Traditional text using Whole-Word-Masking with extented "
+"data."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:129
+msgid "``junnyu/ckiplab-bert-base-chinese-ner``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:129
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on NER task."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:133
+msgid "``junnyu/ckiplab-bert-base-chinese-pos``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:133
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on POS task."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:137
+msgid "``junnyu/ckiplab-bert-base-chinese-ws``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:137
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Finetuned on WS task."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:141
+msgid "``junnyu/nlptown-bert-base-multilingual-uncased-sentiment``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:141
+msgid ""
+"12-layer, 768-hidden, 12-heads, 167M parameters. Finetuned for sentiment "
+"analysis on product reviews in six languages: English, Dutch, German, "
+"French, Spanish and Italian."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:148
+msgid "``junnyu/tbs17-MathBERT``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:148
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on pre-k to "
+"graduate math language (English) using a masked language modeling (MLM) "
+"objective."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:154
+msgid "``macbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:154
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:159
+msgid "``macbert-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:159
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 326M parameters. Trained with novel MLM "
+"as correction pre-training task."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:164
+msgid "``simbert-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:164
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on 22 million "
+"pairs of similar sentences crawed from Baidu Know."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:169
+msgid "``Langboat/mengzi-bert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:169
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 300G Chinese "
+"Corpus Datasets."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:173
+msgid "``Langboat/mengzi-bert-base-fin``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:173
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on 20G Finacial "
+"Corpus, based on ``Langboat/mengzi-bert-base``."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:178
+msgid "BERT-Japanese_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:178
+msgid "``iverxin/bert-base-japanese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:178
+#: ../model_zoo/transformers/all/transformers.rst:182
+#: ../model_zoo/transformers/all/transformers.rst:187
+#: ../model_zoo/transformers/all/transformers.rst:191
+msgid "Japanese"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:178
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. Trained on Japanese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:182
+msgid "``iverxin/bert-base-japanese-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:182
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on Japanese text"
+" using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:187
+msgid "``iverxin/bert-base-japanese-char``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:187
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:191
+msgid "``iverxin/bert-base-japanese-char-whole-word-masking``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:191
+msgid ""
+"12-layer, 768-hidden, 12-heads, 89M parameters. Trained on Japanese char "
+"text using Whole-Word-Masking."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:196
+#: ../model_zoo/transformers/all/transformers.rst:672
+msgid "BigBird_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:196
+msgid "``bigbird-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:196
+msgid ""
+"12-layer, 768-hidden, 12-heads, 127M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:200
+#: ../model_zoo/transformers/all/transformers.rst:674
+msgid "Blenderbot_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:200
+msgid "``blenderbot-3B``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:200
+msgid "26-layer, 32-heads, 3B parameters. The Blenderbot base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:204
+msgid "``blenderbot-400M-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:204
+msgid ""
+"14-layer, 384-hidden, 32-heads, 400M parameters. The Blenderbot distil "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:208
+msgid "``blenderbot-1B-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:208
+msgid "14-layer, 32-heads, 1478M parameters. The Blenderbot Distil 1B model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:212
+#: ../model_zoo/transformers/all/transformers.rst:676
+msgid "Blenderbot-Small_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:212
+msgid "``blenderbot_small-90M``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:212
+msgid "16-layer, 16-heads, 90M parameters. The Blenderbot small model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:216
+#: ../model_zoo/transformers/all/transformers.rst:678
+msgid "ConvBert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:216
+msgid "``convbert-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:216
+msgid "12-layer, 768-hidden, 12-heads, 106M parameters. The ConvBERT base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:220
+msgid "``convbert-medium-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:220
+msgid ""
+"12-layer, 384-hidden, 8-heads, 17M parameters. The ConvBERT medium small "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:224
+msgid "``convbert-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:224
+msgid "12-layer, 128-hidden, 4-heads, 13M parameters. The ConvBERT small model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:228
+#: ../model_zoo/transformers/all/transformers.rst:680
+msgid "CTRL_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:228
+msgid "``ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:228
+msgid "48-layer, 1280-hidden, 16-heads, 1701M parameters. The CTRL base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:232
+msgid "``sshleifer-tiny-ctrl``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:232
+msgid "2-layer, 16-hidden, 2-heads, 5M parameters. The Tiny CTRL model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:236
+#: ../model_zoo/transformers/all/transformers.rst:682
+msgid "DistilBert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:236
+msgid "``distilbert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:236
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:241
+msgid "``distilbert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:241
+msgid ""
+"6-layer, 768-hidden, 12-heads, 66M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:246
+msgid "``distilbert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:246
+msgid ""
+"6-layer, 768-hidden, 12-heads, 200M parameters. The DistilBERT model "
+"distilled from the BERT model ``bert-base-multilingual-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:252
+msgid "``sshleifer-tiny-distilbert-base-uncase-finetuned-sst-2-english``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:252
+msgid "2-layer, 2-hidden, 2-heads, 50K parameters. The DistilBERT model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:256
+#: ../model_zoo/transformers/all/transformers.rst:684
+msgid "ELECTRA_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:256
+msgid "``electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:256
+msgid ""
+"12-layer, 768-hidden, 4-heads, 14M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:260
+msgid "``electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:260
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:264
+msgid "``electra-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:264
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 334M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:268
+msgid "``chinese-electra-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:268
+msgid "12-layer, 768-hidden, 4-heads, 12M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:272
+msgid "``chinese-electra-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:272
+#: ../model_zoo/transformers/all/transformers.rst:496
+msgid "12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:276
+msgid "``ernie-health-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:276
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on Chinese "
+"medical corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:280
+msgid "``junnyu/hfl-chinese-electra-180g-base-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:280
+msgid ""
+"Discriminator, 12-layer, 768-hidden, 12-heads, 102M parameters. Trained "
+"on 180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:284
+msgid "``junnyu/hfl-chinese-electra-180g-small-ex-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:284
+msgid ""
+"Discriminator, 24-layer, 256-hidden, 4-heads, 24M parameters. Trained on "
+"180g Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:288
+msgid "``junnyu/hfl-chinese-legal-electra-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:288
+msgid ""
+"Generator, 12-layer, 64-hidden, 1-heads, 3M parameters. Trained on "
+"Chinese legal corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:292
+#: ../model_zoo/transformers/all/transformers.rst:686
+msgid "ERNIE_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:292
+msgid "``ernie-3.0-medium-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:292
+#: ../model_zoo/transformers/all/transformers.rst:312
+#: ../model_zoo/transformers/all/transformers.rst:333
+#: ../model_zoo/transformers/all/transformers.rst:438
+#: ../model_zoo/transformers/all/transformers.rst:613
+msgid "12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:296
+msgid "``ernie-tiny``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:296
+msgid "3-layer, 1024-hidden, 16-heads, _M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:300
+msgid "``ernie-2.0-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:300
+#: ../model_zoo/transformers/all/transformers.rst:316
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:304
+msgid "``ernie-2.0-en-finetuned-squad``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:304
+msgid ""
+"12-layer, 768-hidden, 12-heads, 110M parameters. Trained on finetuned "
+"squad text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:308
+msgid "``ernie-2.0-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:312
+#: ../model_zoo/transformers/all/transformers.rst:688
+msgid "ERNIE-DOC_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:312
+msgid "``ernie-doc-base-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:316
+msgid "``ernie-doc-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:320
+#: ../model_zoo/transformers/all/transformers.rst:690
+msgid "ERNIE-GEN_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:320
+msgid "``ernie-gen-base-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:320
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on lower-cased "
+"English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:324
+msgid "``ernie-gen-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:328
+msgid "``ernie-gen-large-en-430g``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:328
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased "
+"English text. with extended data (430 GB)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:333
+#: ../model_zoo/transformers/all/transformers.rst:692
+msgid "ERNIE-GRAM_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:333
+msgid "``ernie-gram-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:337
+msgid "``ernie-gram-zh-finetuned-dureader-robust``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:337
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text."
+" Then finetuned on dreader-robust"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:342
+#: ../model_zoo/transformers/all/transformers.rst:694
+msgid "GPT_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:342
+msgid "``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:342
+msgid "32-layer, 2560-hidden, 32-heads, 2.6B parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:346
+msgid "``gpt-cpm-small-cn-distill``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:346
+msgid ""
+"12-layer, 768-hidden, 12-heads, 109M parameters. The model distilled from"
+" the GPT model ``gpt-cpm-large-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:351
+msgid "``gpt2-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:351
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:355
+msgid "``gpt2-medium-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:355
+msgid "24-layer, 1024-hidden, 16-heads, 345M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:359
+msgid "``gpt2-large-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:359
+#: ../model_zoo/transformers/all/transformers.rst:379
+msgid "36-layer, 1280-hidden, 20-heads, 774M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:363
+msgid "``gpt2-xl-en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:363
+msgid ""
+"48-layer, 1600-hidden, 25-heads, 1558M parameters. Trained on English "
+"text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:367
+msgid "``junnyu/distilgpt2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:367
+msgid "6-layer, 768-hidden, 12-heads, 81M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:371
+msgid "``junnyu/microsoft-DialoGPT-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:371
+#: ../model_zoo/transformers/all/transformers.rst:476
+msgid "12-layer, 768-hidden, 12-heads, 124M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:375
+msgid "``junnyu/microsoft-DialoGPT-medium``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:375
+msgid "24-layer, 1024-hidden, 16-heads, 354M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:379
+msgid "``junnyu/microsoft-DialoGPT-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:383
+msgid "``junnyu/uer-gpt2-chinese-poem``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:383
+msgid ""
+"12-layer, 768-hidden, 12-heads, 103M parameters. Trained on Chinese "
+"poetry corpus."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:387
+#: ../model_zoo/transformers/all/transformers.rst:696
+msgid "LayoutLM_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:387
+msgid "``layoutlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:387
+msgid ""
+"12-layer, 768-hidden, 12-heads, 339M parameters. LayoutLm base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:391
+msgid "``layoutlm-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:391
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 51M parameters. LayoutLm large Uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:395
+#: ../model_zoo/transformers/all/transformers.rst:698
+msgid "LayoutLMV2_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:395
+msgid "``layoutlmv2-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:395
+msgid ""
+"12-layer, 768-hidden, 12-heads, 200M parameters. LayoutLmv2 base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:399
+msgid "``layoutlmv2-large-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:399
+msgid ""
+"24-layer, 1024-hidden, 16-heads, _M parameters. LayoutLmv2 large uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:403
+#: ../model_zoo/transformers/all/transformers.rst:700
+msgid "LayoutXLM_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:403
+msgid "``layoutxlm-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:403
+msgid ""
+"12-layer, 768-hidden, 12-heads, 369M parameters. Layoutxlm base uncased "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:407
+msgid "MBart_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:407
+msgid "``mbart-large-cc25``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:407
+msgid ""
+"12-layer, 1024-hidden, 12-heads, 1123M parameters. The ``mbart-large-"
+"cc25`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:411
+msgid "``mbart-large-en-ro``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:411
+msgid ""
+"12-layer, 768-hidden, 16-heads, 1123M parameters. The ``mbart-large rn-"
+"ro`` model ."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:415
+msgid "``mbart-large-50-one-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:415
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-one-"
+"to-many-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:420
+msgid "``mbart-large-50-many-to-one-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:420
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-one-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:425
+msgid "``mbart-large-50-many-to-many-mmt``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:425
+msgid ""
+"12-layer, 1024-hidden, 16-heads, 1123M parameters. ``mbart-large-50-many-"
+"to-many-mmt`` model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:430
+msgid "Mobilebert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:430
+msgid "``mobilebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:430
+msgid "24-layer, 512-hidden, 4-heads, 24M parameters. Mobilebert uncased Model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:434
+#: ../model_zoo/transformers/all/transformers.rst:706
+msgid "MPNet_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:434
+msgid "``mpnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:434
+msgid "12-layer, 768-hidden, 12-heads, 109M parameters. MPNet Base Model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:438
+#: ../model_zoo/transformers/all/transformers.rst:708
+msgid "NeZha_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:438
+msgid "``nezha-base-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:442
+msgid "``nezha-large-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:442
+#: ../model_zoo/transformers/all/transformers.rst:450
+msgid "24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:446
+msgid "``nezha-base-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:446
+msgid "12-layer, 768-hidden, 16-heads, 108M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:450
+msgid "``nezha-large-wwm-chinese``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:454
+msgid "Reformer_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:454
+msgid "``reformer-enwik8``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:454
+msgid "12-layer, 1024-hidden, 8-heads, 148M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:457
+msgid "``reformer-crime-and-punishment``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:457
+msgid "6-layer, 256-hidden, 2-heads, 3M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:460
+#: ../model_zoo/transformers/all/transformers.rst:712
+msgid "RoBERTa_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:460
+msgid "``roberta-wwm-ext``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:460
+msgid ""
+"12-layer, 768-hidden, 12-heads, 102M parameters. Trained on English Text "
+"using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:465
+msgid "``roberta-wwm-ext-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:465
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 325M parameters. Trained on English Text"
+" using Whole-Word-Masking with extended data."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:470
+msgid "``rbt3``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:470
+msgid "3-layer, 768-hidden, 12-heads, 38M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:473
+msgid "``rbtl3``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:473
+msgid "3-layer, 1024-hidden, 16-heads, 61M parameters."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:476
+msgid "``nosaydomore/deepset-roberta-base-squad2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:480
+msgid "``nosaydomore/roberta-en-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:480
+msgid "12-layer, 768-hidden, 12-heads, 163M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:484
+msgid "``nosaydomore/roberta-en-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:484
+msgid "24-layer, 1024-hidden, 16-heads, 408M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:488
+msgid "``nosaydomore/sshleifei-tiny-distilroberta-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:488
+msgid "2-layer, 2-hidden, 2-heads, 0.25M parameters. Trained on English text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:492
+msgid "``nosaydomore/uer-roberta-base-chn-extractive-qa``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:492
+#: ../model_zoo/transformers/all/transformers.rst:500
+msgid "12-layer, 768-hidden, 12-heads, 101M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:496
+msgid "``nosaydomore/uer-roberta-base-ft-chinanews-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:500
+msgid "``nosaydomore/uer-roberta-base-ft-cluener2020-chn``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:504
+#: ../model_zoo/transformers/all/transformers.rst:714
+msgid "RoFormer_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:504
+msgid "``roformer-chinese-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:504
+msgid ""
+"6-layer, 384-hidden, 6-heads, 30M parameters. Roformer Small Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:508
+msgid "``roformer-chinese-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:508
+msgid ""
+"12-layer, 768-hidden, 12-heads, 124M parameters. Roformer Base Chinese "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:512
+msgid "``roformer-chinese-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:512
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Small"
+" model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:516
+msgid "``roformer-chinese-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:516
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:520
+msgid "``roformer-chinese-sim-char-ft-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:520
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Char Ft "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:524
+msgid "``roformer-chinese-sim-char-ft-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:524
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Char Ft "
+"Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:528
+msgid "``roformer-chinese-sim-char-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:528
+msgid ""
+"6-layer, 384-hidden, 6-heads, 15M parameters. Roformer Chinese Sim Char "
+"Small model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:532
+msgid "``roformer-chinese-sim-char-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:532
+msgid ""
+"12-layer, 768-hidden, 12-heads, 95M parameters. Roformer Chinese Sim Char"
+" Base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:536
+msgid "``roformer-english-small-discriminator``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:536
+msgid ""
+"12-layer, 256-hidden, 4-heads, 13M parameters. Roformer English Small "
+"Discriminator."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:540
+msgid "``roformer-english-small-generator``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:540
+msgid ""
+"12-layer, 64-hidden, 1-heads, 5M parameters. Roformer English Small "
+"Generator."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:544
+#: ../model_zoo/transformers/all/transformers.rst:716
+msgid "SKEP_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:544
+msgid "``skep_ernie_1.0_large_ch``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:544
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:549
+msgid "``skep_ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:549
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 336M parameters. Trained using the Erine"
+" model ``ernie_2.0_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:554
+msgid "``skep_roberta_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:554
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 355M parameters. Trained using the "
+"RoBERTa model ``roberta_large_en``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:559
+#: ../model_zoo/transformers/all/transformers.rst:718
+msgid "SqueezeBert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:559
+msgid "``squeezebert-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:559
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Uncased model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:563
+msgid "``squeezebert-mnli``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:563
+msgid "12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:567
+msgid "``squeezebert-mnli-headless``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:567
+msgid ""
+"12-layer, 768-hidden, 12-heads, 51M parameters. SqueezeBert Mnli Headless"
+" model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:571
+#: ../model_zoo/transformers/all/transformers.rst:720
+msgid "T5_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:571
+msgid "``t5-small``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:571
+msgid "6-layer, 512-hidden, 8-heads, 93M parameters. T5 small model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:575
+msgid "``t5-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:575
+msgid "12-layer, 768-hidden, 12-heads, 272M parameters. T5 base model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:579
+msgid "``t5-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:579
+msgid "24-layer, 1024-hidden, 16-heads, 803M parameters. T5 large model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:583
+#: ../model_zoo/transformers/all/transformers.rst:722
+msgid "TinyBert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:583
+msgid "``tinybert-4l-312d``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:583
+#: ../model_zoo/transformers/all/transformers.rst:593
+#: ../model_zoo/transformers/all/transformers.rst:603
+msgid ""
+"4-layer, 312-hidden, 12-heads, 14.5M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:588
+msgid "``tinybert-6l-768d``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:588
+#: ../model_zoo/transformers/all/transformers.rst:598
+#: ../model_zoo/transformers/all/transformers.rst:608
+msgid ""
+"6-layer, 768-hidden, 12-heads, 67M parameters. The TinyBert model "
+"distilled from the BERT model ``bert-base-uncased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:593
+msgid "``tinybert-4l-312d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:598
+msgid "``tinybert-6l-768d-v2``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:603
+msgid "``tinybert-4l-312d-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:608
+msgid "``tinybert-6l-768d-zh``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:613
+#: ../model_zoo/transformers/all/transformers.rst:724
+msgid "UnifiedTransformer_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:613
+msgid "``unified_transformer-12L-cn``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:617
+msgid "``unified_transformer-12L-cn-luge``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:617
+msgid ""
+"12-layer, 768-hidden, 12-heads, 108M parameters. Trained on Chinese text "
+"(LUGE.ai)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:621
+msgid "``plato-mini``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:621
+msgid "6-layer, 768-hidden, 12-heads, 66M parameters. Trained on Chinese text."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:625
+msgid "UNIMO_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:625
+msgid "``unimo-text-1.0``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:625
+msgid "12-layer, 768-hidden, 12-heads, 99M parameters. UNIMO-text-1.0 model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:629
+msgid "``unimo-text-1.0-lcsts-new``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:629
+msgid ""
+"12-layer, 768-hidden, 12-heads, 99M parameters. Finetuned on lcsts_new "
+"dataset."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:633
+msgid "``unimo-text-1.0-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:633
+msgid ""
+"24-layer, 768-hidden, 16-heads, 316M parameters. UNIMO-text-1.0 large "
+"model."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:637
+#: ../model_zoo/transformers/all/transformers.rst:726
+msgid "XLNet_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:637
+msgid "``xlnet-base-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:637
+msgid "12-layer, 768-hidden, 12-heads, 110M parameters. XLNet English model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:641
+msgid "``xlnet-large-cased``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:641
+msgid ""
+"24-layer, 1024-hidden, 16-heads, 340M parameters. XLNet Large English "
+"model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:645
+msgid "``chinese-xlnet-base``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:645
+msgid "12-layer, 768-hidden, 12-heads, 117M parameters. XLNet Chinese model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:649
+msgid "``chinese-xlnet-mid``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:649
+msgid ""
+"24-layer, 768-hidden, 12-heads, 209M parameters. XLNet Medium Chinese "
+"model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:653
+msgid "``chinese-xlnet-large``"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:653
+msgid "24-layer, 1024-hidden, 16-heads, _M parameters. XLNet Large Chinese model"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:661
+msgid "Transformer预训练模型适用任务汇总"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Sequence Classification"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Token Classification"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Question Answering"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Text Generation"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:664
+msgid "Multiple Choice"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:666
+#: ../model_zoo/transformers/all/transformers.rst:668
+#: ../model_zoo/transformers/all/transformers.rst:670
+#: ../model_zoo/transformers/all/transformers.rst:672
+#: ../model_zoo/transformers/all/transformers.rst:674
+#: ../model_zoo/transformers/all/transformers.rst:676
+#: ../model_zoo/transformers/all/transformers.rst:678
+#: ../model_zoo/transformers/all/transformers.rst:680
+#: ../model_zoo/transformers/all/transformers.rst:682
+#: ../model_zoo/transformers/all/transformers.rst:684
+#: ../model_zoo/transformers/all/transformers.rst:686
+#: ../model_zoo/transformers/all/transformers.rst:688
+#: ../model_zoo/transformers/all/transformers.rst:690
+#: ../model_zoo/transformers/all/transformers.rst:692
+#: ../model_zoo/transformers/all/transformers.rst:694
+#: ../model_zoo/transformers/all/transformers.rst:696
+#: ../model_zoo/transformers/all/transformers.rst:698
+#: ../model_zoo/transformers/all/transformers.rst:700
+#: ../model_zoo/transformers/all/transformers.rst:702
+#: ../model_zoo/transformers/all/transformers.rst:704
+#: ../model_zoo/transformers/all/transformers.rst:706
+#: ../model_zoo/transformers/all/transformers.rst:708
+#: ../model_zoo/transformers/all/transformers.rst:710
+#: ../model_zoo/transformers/all/transformers.rst:712
+#: ../model_zoo/transformers/all/transformers.rst:714
+#: ../model_zoo/transformers/all/transformers.rst:716
+#: ../model_zoo/transformers/all/transformers.rst:718
+#: ../model_zoo/transformers/all/transformers.rst:720
+#: ../model_zoo/transformers/all/transformers.rst:722
+#: ../model_zoo/transformers/all/transformers.rst:724
+#: ../model_zoo/transformers/all/transformers.rst:726
+msgid "✅"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:666
+#: ../model_zoo/transformers/all/transformers.rst:668
+#: ../model_zoo/transformers/all/transformers.rst:670
+#: ../model_zoo/transformers/all/transformers.rst:672
+#: ../model_zoo/transformers/all/transformers.rst:674
+#: ../model_zoo/transformers/all/transformers.rst:676
+#: ../model_zoo/transformers/all/transformers.rst:680
+#: ../model_zoo/transformers/all/transformers.rst:682
+#: ../model_zoo/transformers/all/transformers.rst:684
+#: ../model_zoo/transformers/all/transformers.rst:686
+#: ../model_zoo/transformers/all/transformers.rst:688
+#: ../model_zoo/transformers/all/transformers.rst:690
+#: ../model_zoo/transformers/all/transformers.rst:692
+#: ../model_zoo/transformers/all/transformers.rst:694
+#: ../model_zoo/transformers/all/transformers.rst:696
+#: ../model_zoo/transformers/all/transformers.rst:698
+#: ../model_zoo/transformers/all/transformers.rst:700
+#: ../model_zoo/transformers/all/transformers.rst:702
+#: ../model_zoo/transformers/all/transformers.rst:704
+#: ../model_zoo/transformers/all/transformers.rst:706
+#: ../model_zoo/transformers/all/transformers.rst:708
+#: ../model_zoo/transformers/all/transformers.rst:710
+#: ../model_zoo/transformers/all/transformers.rst:712
+#: ../model_zoo/transformers/all/transformers.rst:714
+#: ../model_zoo/transformers/all/transformers.rst:716
+#: ../model_zoo/transformers/all/transformers.rst:718
+#: ../model_zoo/transformers/all/transformers.rst:720
+#: ../model_zoo/transformers/all/transformers.rst:722
+#: ../model_zoo/transformers/all/transformers.rst:724
+#: ../model_zoo/transformers/all/transformers.rst:726
+msgid "❌"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:702
+msgid "Mbart_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:704
+msgid "MobileBert_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:710
+msgid "ReFormer_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:765
+msgid "预训练模型使用方法"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:767
+msgid ""
+"PaddleNLP Transformer API在提丰富预训练模型的同时，也降低了用户的使用门槛。 "
+"使用Auto模块，可以加载不同网络结构的预训练模型，无需查找 模型对应的类别。只需十几行代码，用户即可完成模型加载和下游任务Fine-"
+"tuning。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:806
+msgid ""
+"上面的代码给出使用预训练模型的简要示例，更完整详细的示例代码， 可以参考：`使用预训练模型Fine-tune完成中文文本分类任务 "
+"<https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_classification/pretrained_models/>`_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:809
+msgid "加载数据集：PaddleNLP内置了多种数据集，用户可以一键导入所需的数据集。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:810
+msgid ""
+"加载预训练模型：PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。 "
+"Auto模块（包括AutoModel, AutoTokenizer, 及各种下游任务类）提供了方便易用的接口， "
+"无需指定类别，即可调用不同网络结构的预训练模型。 第一个参数是汇总表中对应的 ``Pretrained Weight``，可加载对应的预训练权重。"
+" ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数，如 "
+"``num_classes`` 等， 也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 "
+"``from_pretrained`` 方法加载。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:816
+msgid "通过 ``Dataset`` 的 ``map`` 函数，使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:817
+msgid "定义 ``BatchSampler`` 和 ``DataLoader``，shuffle数据、组合Batch。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:818
+msgid "定义训练所需的优化器，loss函数等，就可以开始进行模型fine-tune任务。"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:822
+msgid "Reference"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:823
+msgid ""
+"部分中文预训练模型来自： `brightmart/albert_zh "
+"<https://github.com/brightmart/albert_zh>`_, `ymcui/Chinese-BERT-wwm "
+"<https://github.com/ymcui/Chinese-BERT-wwm>`_, `huawei-noah/Pretrained-"
+"Language-Model/TinyBERT <https://github.com/huawei-noah/Pretrained-"
+"Language-Model/tree/master/TinyBERT>`_, `ymcui/Chinese-XLNet "
+"<https://github.com/ymcui/Chinese-XLNet>`_, "
+"`huggingface/xlnet_chinese_large "
+"<https://huggingface.co/clue/xlnet_chinese_large>`_, `Knover/luge-"
+"dialogue <https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-"
+"dialogue>`_, `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ "
+"<https://github.com/huawei-noah/Pretrained-Language-Model/tree/master"
+"/NEZHA-PyTorch>`_, `ZhuiyiTechnology/simbert "
+"<https://github.com/ZhuiyiTechnology/simbert>`_"
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:832
+msgid ""
+"Lan, Zhenzhong, et al. \"Albert: A lite bert for self-supervised learning"
+" of language representations.\" arXiv preprint arXiv:1909.11942 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:833
+msgid ""
+"Lewis, Mike, et al. \"BART: Denoising Sequence-to-Sequence Pre-training "
+"for Natural Language Generation, Translation, and Comprehension.\" arXiv "
+"preprint arXiv:1910.13461 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:834
+msgid ""
+"Devlin, Jacob, et al. \"Bert: Pre-training of deep bidirectional "
+"transformers for language understanding.\" arXiv preprint "
+"arXiv:1810.04805 (2018)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:835
+msgid ""
+"Zaheer, Manzil, et al. \"Big bird: Transformers for longer sequences.\" "
+"arXiv preprint arXiv:2007.14062 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:836
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot: Recipes for building an open-domain "
+"chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:837
+msgid ""
+"Stephon, Emily, et al. \"Blenderbot-Small: Recipes for building an open-"
+"domain chatbot.\" arXiv preprint arXiv:2004.13637 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:838
+msgid ""
+"Zhang, zhengyan, et al. \"CPM: A Large-scale Generative Chinese Pre-"
+"trained Language Model.\" arXiv preprint arXiv:2012.00413 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:839
+msgid ""
+"Jiang, Zihang, et al. \"ConvBERT: Improving BERT with Span-based Dynamic "
+"Convolution.\" arXiv preprint arXiv:2008.02496 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:840
+msgid ""
+"Nitish, Bryan, et al. \"CTRL: A Conditional Transformer Language Model "
+"for Controllable Generation.\" arXiv preprint arXiv:1909.05858 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:841
+msgid ""
+"Sanh, Victor, et al. \"DistilBERT, a distilled version of BERT: smaller, "
+"faster, cheaper and lighter.\" arXiv preprint arXiv:1910.01108 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:842
+msgid ""
+"Clark, Kevin, et al. \"Electra: Pre-training text encoders as "
+"discriminators rather than generators.\" arXiv preprint arXiv:2003.10555 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:843
+msgid ""
+"Sun, Yu, et al. \"Ernie: Enhanced representation through knowledge "
+"integration.\" arXiv preprint arXiv:1904.09223 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:844
+msgid ""
+"Xiao, Dongling, et al. \"Ernie-gen: An enhanced multi-flow pre-training "
+"and fine-tuning framework for natural language generation.\" arXiv "
+"preprint arXiv:2001.11314 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:845
+msgid ""
+"Xiao, Dongling, et al. \"ERNIE-Gram: Pre-Training with Explicitly N-Gram "
+"Masked Language Modeling for Natural Language Understanding.\" arXiv "
+"preprint arXiv:2010.12148 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:846
+msgid ""
+"Radford, Alec, et al. \"Language models are unsupervised multitask "
+"learners.\" OpenAI blog 1.8 (2019): 9."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:847
+msgid ""
+"Xu, Yiheng, et al. \"LayoutLM: Pre-training of Text and Layout for "
+"Document Image Understanding.\" arXiv preprint arXiv:1912.13318 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:848
+msgid ""
+"Xu, Yang, et al. \"LayoutLMv2: Multi-modal Pre-training for Visually-Rich"
+" Document Understanding\" arXiv preprint arXiv:2012.14740 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:849
+msgid ""
+"Xu, Yiheng, et al. \"LayoutXLM: Multimodal Pre-training for Multilingual "
+"Visually-rich Document Understanding\" arXiv preprint arXiv:2104.08836 "
+"(2021)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:850
+msgid ""
+"Liu, Yinhan, et al. \"MBart: Multilingual Denoising Pre-training for "
+"Neural Machine Translation\" arXiv preprint arXiv:2001.08210 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:851
+msgid ""
+"Sun, Zhiqing, et al. \"MobileBERT: a Compact Task-Agnostic BERT for "
+"Resource-Limited Devices\" arXiv preprint arXiv:2004.02984 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:852
+msgid ""
+"Song, Kaitao, et al. \"MPNet: Masked and Permuted Pre-training for "
+"Language Understanding.\" arXiv preprint arXiv:2004.09297 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:853
+msgid ""
+"Wei, Junqiu, et al. \"NEZHA: Neural contextualized representation for "
+"chinese language understanding.\" arXiv preprint arXiv:1909.00204 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:854
+msgid ""
+"Kitaev, Nikita, et al. \"Reformer: The efficient Transformer.\" arXiv "
+"preprint arXiv:2001.04451 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:855
+msgid ""
+"Liu, Yinhan, et al. \"Roberta: A robustly optimized bert pretraining "
+"approach.\" arXiv preprint arXiv:1907.11692 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:856
+msgid ""
+"Su Jianlin, et al. \"RoFormer: Enhanced Transformer with Rotary Position "
+"Embedding.\" arXiv preprint arXiv:2104.09864 (2021)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:857
+msgid ""
+"Tian, Hao, et al. \"SKEP: Sentiment knowledge enhanced pre-training for "
+"sentiment analysis.\" arXiv preprint arXiv:2005.05635 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:858
+msgid ""
+"Forrest, ALbert, et al. \"SqueezeBERT: What can computer vision teach NLP"
+" about efficient neural networks?\" arXiv preprint arXiv:2006.11316 "
+"(2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:859
+msgid ""
+"Raffel, Colin, et al. \"T5: Exploring the Limits of Transfer Learning "
+"with a Unified Text-to-Text Transformer.\" arXiv preprint "
+"arXiv:1910.10683 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:860
+msgid ""
+"Vaswani, Ashish, et al. \"Attention is all you need.\" arXiv preprint "
+"arXiv:1706.03762 (2017)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:861
+msgid ""
+"Jiao, Xiaoqi, et al. \"Tinybert: Distilling bert for natural language "
+"understanding.\" arXiv preprint arXiv:1909.10351 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:862
+msgid ""
+"Bao, Siqi, et al. \"Plato-2: Towards building an open-domain chatbot via "
+"curriculum learning.\" arXiv preprint arXiv:2006.16779 (2020)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:863
+msgid ""
+"Yang, Zhilin, et al. \"Xlnet: Generalized autoregressive pretraining for "
+"language understanding.\" arXiv preprint arXiv:1906.08237 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:864
+msgid ""
+"Cui, Yiming, et al. \"Pre-training with whole word masking for chinese "
+"bert.\" arXiv preprint arXiv:1906.08101 (2019)."
+msgstr ""
+
+#: ../model_zoo/transformers/all/transformers.rst:865
+msgid ""
+"Wang, Quan, et al. “Building Chinese Biomedical Language Models via "
+"Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/modules.po b/docs/locale/en/LC_MESSAGES/source/modules.po
new file mode 100644
index 0000000000000000000000000000000000000000..caee4de4a69d33c64c63c428c6e876b509ba314d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/modules.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/modules.rst:2
+msgid "paddlenlp"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po
new file mode 100644
index 0000000000000000000000000000000000000000..0833b056f484eb23d23394a026955ad177efb0e0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.collate.po
@@ -0,0 +1,219 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.data.collate.rst:2
+msgid "collate"
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict:1 paddlenlp.data.collate.Pad:1
+#: paddlenlp.data.collate.Stack:1 paddlenlp.data.collate.Tuple:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.data.collate.Stack:1
+msgid ""
+"Stacks the input data samples to construct the batch. The N input samples"
+" must have the same shape/length and will be stacked to construct a "
+"batch."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict paddlenlp.data.collate.Dict.__call__
+#: paddlenlp.data.collate.Pad paddlenlp.data.collate.Pad.__call__
+#: paddlenlp.data.collate.Stack paddlenlp.data.collate.Stack.__call__
+#: paddlenlp.data.collate.Tuple paddlenlp.data.collate.Tuple.__call__
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.data.collate.Stack:4
+msgid ""
+"The axis in the result data along which the input data are stacked. "
+"Default: 0."
+msgstr ""
+
+#: of paddlenlp.data.collate.Stack:7
+msgid ""
+"The value type of the output. If it is set to None, the type of input "
+"data is used. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.collate.Stack.__call__:1
+msgid "Batchifies the input data by stacking."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad.__call__:6
+#: paddlenlp.data.collate.Stack.__call__:3
+msgid ""
+"The input data samples. It is a list. Each element is a numpy.ndarray or "
+"list."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__ paddlenlp.data.collate.Pad.__call__
+#: paddlenlp.data.collate.Stack.__call__ paddlenlp.data.collate.Tuple.__call__
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.data.collate.Stack.__call__:7
+msgid "Stacked batch data."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__ paddlenlp.data.collate.Pad.__call__
+#: paddlenlp.data.collate.Stack.__call__ paddlenlp.data.collate.Tuple.__call__
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__:14
+#: paddlenlp.data.collate.Pad.__call__:18
+#: paddlenlp.data.collate.Stack.__call__:11
+#: paddlenlp.data.collate.Tuple.__call__:14
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:1
+msgid "Pads the input data samples to the largest length at `axis`."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:3
+msgid "The padding value. Default: 0."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:5
+msgid ""
+"The axis to pad the arrays. The arrays will be padded to the largest "
+"length at `axis`. For example, assume the input arrays have shape (10, 8,"
+" 5), (6, 8, 5), (3, 8, 5) and the axis is 0. Each input will be padded "
+"into (10, 8, 5) and then stacked to form the final output, which has "
+"shape (3, 10, 8, 5). Default: 0."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:12
+msgid ""
+"If it is bool, indicate whether to return the valid length in the output,"
+" and the data type of returned length is int32 if True. If it is "
+"numpy.dtype, indicate the data type of returned length. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:17
+msgid ""
+"The value type of the output. If it is set to None, the input data type "
+"is used. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad:20
+msgid ""
+"Whether the padding direction is right-side. If True, it indicates we pad"
+" to the right side, while False indicates we pad to the left side. "
+"Default: True."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad.__call__:1
+msgid ""
+"Batchifies the input data by padding. The input will be padded to the "
+"largest dimension at `axis` and then stacked to form the final output. In"
+" addition, the function will output the original dimensions at the `axis`"
+" if `ret_length` is not None or False."
+msgstr ""
+
+#: of paddlenlp.data.collate.Pad.__call__:10
+msgid ""
+"If `ret_length` is False, it is a numpy.ndarray representing the padded "
+"batch data and the shape is (N, …). Otherwise, it is a tuple, besides the"
+" padded batch data, the tuple also includes a numpy.ndarray representing "
+"original length at `axis` of all input samples, which shaped `(N,)`."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict:1 paddlenlp.data.collate.Tuple:1
+msgid ""
+"Wraps multiple batchify functions together. The input functions will be "
+"applied to the corresponding input fields."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple:4
+msgid ""
+"Each sample should be a list or tuple containing multiple fields. The "
+"i'th batchify function stored in Tuple will be applied on the i'th field."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple:7
+msgid ""
+"For example, when data sample is (nd_data, label), you can wrap two "
+"batchify functions using `Tuple(DataBatchify, LabelBatchify)` to batchify"
+" nd_data and label correspondingly."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple:11
+msgid ""
+"The batchify functions to wrap. It is a callable function or a list/tuple"
+" of callable functions."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple:14
+msgid "The additional batchify functions to wrap."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple.__call__:1
+msgid ""
+"Batchifies data samples by applying each function on the corresponding "
+"data field, and each data field is produced by stacking the field data of"
+" samples."
+msgstr ""
+
+#: of paddlenlp.data.collate.Tuple.__call__:5
+msgid ""
+"The samples to batchfy. Each sample in list/tuple should contain `N` "
+"fields."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__:9
+#: paddlenlp.data.collate.Tuple.__call__:9
+msgid "A tuple composed of results from all including batchifying functions."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict:4
+msgid ""
+"Each sample should be a dict containing multiple fields. Each batchify "
+"function with key stored in `Dict` will be applied on the field which has"
+" the same key."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict:8
+msgid ""
+"For example, when data sample is {'tokens': tokens, 'labels': labels}, "
+"you can wrap two batchify functions using `Dict({'tokens': DataBatchify, "
+"'labels': LabelBatchify})` to batchify tokens and labels correspondingly."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict:13
+msgid ""
+"The batchify functions to wrap. It is a dict, which values is callable "
+"functions."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__:1
+msgid ""
+"Batchifies data samples by applying each function on the corresponding "
+"data field, and each data field is produced by stacking the field data "
+"with the same key as batchify functions of all samples."
+msgstr ""
+
+#: of paddlenlp.data.collate.Dict.__call__:5
+msgid ""
+"The samples to batchfy. Each sample in list/tuple is a dict with `N` key-"
+"values."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po
new file mode 100644
index 0000000000000000000000000000000000000000..f9ae0c435657599fd55104c9c935bb6a4d8bbec9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.data_collator.po
@@ -0,0 +1,182 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.data.data_collator.rst:2
+msgid "data\\_collator"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:1
+#: paddlenlp.data.data_collator.DataCollatorForTokenClassification:1
+#: paddlenlp.data.data_collator.DataCollatorWithPadding:1
+#: paddlenlp.data.data_collator.DefaultDataCollator:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorWithPadding:1
+msgid ""
+"Data collator that will dynamically pad the inputs to the longest "
+"sequence in the batch."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq
+#: paddlenlp.data.data_collator.DataCollatorForTokenClassification
+#: paddlenlp.data.data_collator.DataCollatorWithPadding
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:3
+#: paddlenlp.data.data_collator.DataCollatorForTokenClassification:3
+#: paddlenlp.data.data_collator.DataCollatorWithPadding:3
+msgid "The tokenizer used for encoding the data."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DefaultDataCollator:1
+msgid ""
+"Very simple data collator that simply collates batches of dict-like "
+"objects and performs special handling for potential keys named:"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DefaultDataCollator:3
+msgid "`label`: handles a single value (int or float) per object"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DefaultDataCollator:4
+msgid "`label_ids`: handles a list of values per object"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DefaultDataCollator:5
+msgid ""
+"Does not do any additional preprocessing: property names of the input "
+"object will be used as corresponding inputs to the model. See glue and "
+"ner for example of how it's useful. This is an object (like other data "
+"collators) rather than a pure function like default_data_collator. This "
+"can be helpful if you need to set a return_tensors value at "
+"initialization. :param return_tensors: Return Tensor or numpy array. "
+":type return_tensors: `bool`"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForTokenClassification:1
+msgid ""
+"Data collator that will dynamically pad the inputs to longest sequence in"
+" the batch, as well as the labels."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForTokenClassification:5
+msgid "The id to use when padding the labels. Defaults to -100."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:1
+msgid ""
+"Data collator that will dynamically pad the inputs received, as well as "
+"the labels."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:5
+msgid ""
+"The model that is being trained. If set and has the "
+"*prepare_decoder_input_ids_from_labels*, use it to prepare the "
+"*decoder_input_ids*  This is useful when using *label_smoothing* to avoid"
+" calculating loss twice."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:5
+msgid ""
+"The model that is being trained. If set and has the "
+"*prepare_decoder_input_ids_from_labels*, use it to prepare the "
+"*decoder_input_ids*"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:8
+msgid ""
+"This is useful when using *label_smoothing* to avoid calculating loss "
+"twice."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:10
+msgid ""
+"Select a strategy to pad the returned sequences (according to the model's"
+" padding side and padding index) among:  - `True` or `'longest'`: Pad to "
+"the longest sequence in the batch (or no padding if only a single "
+"sequence   is provided). - `'max_length'`: Pad to a maximum length "
+"specified with the argument `max_length` or to the maximum   acceptable "
+"input length for the model if that argument is not provided. - `False` or"
+" `'do_not_pad'` (default): No padding (i.e., can output a batch with "
+"sequences of different   lengths)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:10
+msgid ""
+"Select a strategy to pad the returned sequences (according to the model's"
+" padding side and padding index) among:"
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:13
+msgid ""
+"`True` or `'longest'`: Pad to the longest sequence in the batch (or no "
+"padding if only a single sequence is provided)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:15
+msgid ""
+"`'max_length'`: Pad to a maximum length specified with the argument "
+"`max_length` or to the maximum acceptable input length for the model if "
+"that argument is not provided."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:17
+msgid ""
+"`False` or `'do_not_pad'` (default): No padding (i.e., can output a batch"
+" with sequences of different lengths)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:20
+msgid ""
+"Maximum length of the returned list and optionally padding length (see "
+"above)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:22
+msgid ""
+"If set will pad the sequence to a multiple of the provided value.  This "
+"is especially useful to enable the use of Tensor Cores on NVIDIA hardware"
+" with compute capability >= 7.5 (Volta)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:22
+msgid "If set will pad the sequence to a multiple of the provided value."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:24
+msgid ""
+"This is especially useful to enable the use of Tensor Cores on NVIDIA "
+"hardware with compute capability >= 7.5 (Volta)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:27
+msgid ""
+"The id to use when padding the labels (-100 will be automatically ignored"
+" by PyTorch loss functions)."
+msgstr ""
+
+#: of paddlenlp.data.data_collator.DataCollatorForSeq2Seq:29
+msgid ""
+"The type of Tensor to return. Allowable values are \"np\", \"pt\" and "
+"\"tf\"."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po
new file mode 100644
index 0000000000000000000000000000000000000000..8f85809137c1ee602c3f2058128bcdd16ab5afb1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.data.rst:2
+msgid "paddlenlp.data"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po
new file mode 100644
index 0000000000000000000000000000000000000000..741070ddb7b5bd3874c7800b64272eebf2abdbe1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.sampler.po
@@ -0,0 +1,194 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.data.sampler.rst:2
+msgid "sampler"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:1
+msgid ""
+"The class is to help construct iterable sampler used for "
+":class:`paddle.io.DataLoader`. It wraps a dataset and uses its "
+":meth:`__getitem__` method. Every subclass of :class:`SamplerHelper` has "
+"to provide an :meth:`__iter__` method, providing a way to iterate over "
+"indices of dataset elements, and a :meth:`__len__` method that returns "
+"the length of the returned iterators."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:8
+msgid ""
+"The class also can be used as batch iterator instead of indices iterator "
+"when `iterator` yield samples rather than indices by initializing "
+"`iterator` with a iterable dataset."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:13
+msgid ""
+"The :meth:`__len__` method isn't strictly required by "
+":class:`paddle.io.DataLoader`, but is expected in any calculation "
+"involving the length of a :class:`paddle.io.DataLoader`."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper
+#: paddlenlp.data.sampler.SamplerHelper.batch
+#: paddlenlp.data.sampler.SamplerHelper.shard
+#: paddlenlp.data.sampler.SamplerHelper.shuffle
+#: paddlenlp.data.sampler.SamplerHelper.sort
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:17
+msgid "Input dataset for :class:`SamplerHelper`."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper:19
+msgid "Iterator of dataset. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.length:1
+msgid "Returns the length."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shuffle:1
+msgid "Shuffles the dataset according to the given buffer size and random seed."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shuffle:3
+msgid ""
+"Buffer size for shuffle. If `buffer_size < 0` or more than the length of "
+"the dataset, `buffer_size` is the length of the dataset. Default: -1."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shuffle:7
+msgid "Seed for the random. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch
+#: paddlenlp.data.sampler.SamplerHelper.shard
+#: paddlenlp.data.sampler.SamplerHelper.shuffle
+#: paddlenlp.data.sampler.SamplerHelper.sort
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shuffle:10
+msgid "A new shuffled :class:`SamplerHelper` object."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch
+#: paddlenlp.data.sampler.SamplerHelper.shard
+#: paddlenlp.data.sampler.SamplerHelper.shuffle
+#: paddlenlp.data.sampler.SamplerHelper.sort
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:25
+#: paddlenlp.data.sampler.SamplerHelper.shard:18
+#: paddlenlp.data.sampler.SamplerHelper.shuffle:14
+#: paddlenlp.data.sampler.SamplerHelper.sort:21
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:1
+msgid "Sorts the dataset according to given callable :meth:`cmp` or :meth:`key`."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:3
+msgid "The function of comparison. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:5
+msgid "The function of key. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:7
+msgid ""
+"Whether to reverse when sorting the data samples. If True, it means in "
+"descending order, and False means in ascending order. Default: False."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:11
+msgid ""
+"Buffer size for sort. If `buffer_size < 0` or `buffer_size` is more than "
+"the length of the data, `buffer_size` will be set to the length of the "
+"data. Default: -1."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.sort:17
+msgid "A new sorted :class:`SamplerHelper` object."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:1
+msgid "Batches the dataset according to given `batch_size`."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:3
+msgid "The batch size."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:5
+msgid "Whether to drop the last mini batch. Default: False."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:8
+msgid ""
+"It accepts four arguments: index of data source, the length of minibatch,"
+" the size of minibatch so far and data source, and it returns the size of"
+" mini batch so far. Actually, the returned value can be anything and "
+"would used as argument `size_so_far` in `key`. If None, it would return "
+"the length of mini match. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:15
+msgid ""
+"The function of key. It accepts the size of minibatch so far and the "
+"length of minibatch, and returns what to be compared with `batch_size`. "
+"If None, only the size of mini batch so far would be compared with "
+"`batch_size`. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.batch:21
+msgid "A new batched :class:`SamplerHelper` object."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shard:1
+msgid "Slices the dataset for multi GPU training."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shard:3
+msgid ""
+"The number of training process, and is also the number of GPU cards used "
+"in training. If None, it will be set by "
+":meth:`paddle.distributed.get_world_size` method. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shard:8
+msgid ""
+"The id of current training process. Equal to the value of the environment"
+" variable PADDLE_TRAINER_ID. If None, it will be intialized by "
+":meth:`paddle.distributed.get_rank` method. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.sampler.SamplerHelper.shard:14
+msgid "A new sliced :class:`SamplerHelper` object."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..ec5b593ae6a7e3a89ef9c52c5290227412c288f1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.tokenizer.po
@@ -0,0 +1,99 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.data.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer:1
+msgid "基类：:class:`paddlenlp.data.tokenizer.BaseTokenizer`"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer:1
+msgid ""
+"Constructs a tokenizer based on `jieba "
+"<https://github.com/fxsjy/jieba>`__. It supports :meth:`cut` method to "
+"split the text to tokens, and :meth:`encode` method to covert text to "
+"token ids."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer
+#: paddlenlp.data.tokenizer.JiebaTokenizer.cut
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer:5
+msgid "An instance of :class:`paddlenlp.data.Vocab`."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:1
+msgid "The method used to cut the text to tokens."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:3
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:5
+msgid "The text that needs to be cuted."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:5
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:7
+msgid ""
+"Whether to use the full mode. If True, using full mode that gets all the "
+"possible words from the sentence, which is fast but not accurate. If "
+"False, using accurate mode that attempts to cut the sentence into the "
+"most accurate segmentations, which is suitable for text analysis. "
+"Default: False."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:12
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:14
+msgid "Whether to use the HMM model. Default: True."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:15
+msgid "A list of tokens."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.cut:19
+#: paddlenlp.data.tokenizer.JiebaTokenizer.encode:21
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.encode:1
+msgid ""
+"The method used to convert the text to ids. It will firstly call "
+":meth:`cut` method to cut the text to tokens. Then, convert tokens to ids"
+" using `vocab`."
+msgstr ""
+
+#: of paddlenlp.data.tokenizer.JiebaTokenizer.encode:17
+msgid "A list of ids."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po
new file mode 100644
index 0000000000000000000000000000000000000000..34418ef11149794efa1b2f23f734be6b2f1d4332
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.data.vocab.po
@@ -0,0 +1,321 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.data.vocab.rst:2
+msgid "vocab"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:1
+msgid ""
+"The class used to convert between tokens and ids. It also includes some "
+"store/load functions."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab paddlenlp.data.vocab.Vocab.build_vocab
+#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json
+#: paddlenlp.data.vocab.Vocab.load_vocabulary
+#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json
+#: paddlenlp.data.vocab.Vocab.to_tokens
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:4
+msgid ""
+"A Counter intance describes the tokens and their frequencies. Its keys "
+"will be indexed accroding to the order of frequency sorting to construct "
+"mapping relationship. If None, `token_to_idx` must be provided as the "
+"mapping relationship. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:10
+msgid "Max size of vocab, not including special tokens. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:13 paddlenlp.data.vocab.Vocab.build_vocab:11
+msgid "Ignore tokens whose frequencies are less than `min_freq`. Default: 1."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:16 paddlenlp.data.vocab.Vocab.build_vocab:14
+msgid ""
+"A dict specifies the mapping relationship between tokens and indices to "
+"be used. If provided, adjust the tokens and indices mapping according to "
+"it. If None, counter must be provided. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:21
+msgid ""
+"Special token for unknow token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:24
+msgid ""
+"Special token for padding token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:27
+msgid ""
+"Special token for bos token. If no need, it also could be None. Default: "
+"None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:30
+msgid ""
+"Special token for eos token. If no need, it lso could be None. Default: "
+"None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab:33 paddlenlp.data.vocab.Vocab.build_vocab:31
+#: paddlenlp.data.vocab.Vocab.from_dict:18
+#: paddlenlp.data.vocab.Vocab.load_vocabulary:19
+msgid ""
+"Keyword arguments ending with `_token`. It can be used to specify further"
+" special tokens that will be exposed as attribute of the vocabulary and "
+"associated with an index."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_tokens:1
+msgid "Maps the input indices to token list."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_tokens:3
+msgid ""
+"The input indice(s) for mapping. Must be an `int` or 1D "
+"`list[int]`|`tuple[int]`|`numpy.ndarray`."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab
+#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json
+#: paddlenlp.data.vocab.Vocab.load_vocabulary
+#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json
+#: paddlenlp.data.vocab.Vocab.to_tokens
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_tokens:7
+msgid ""
+"Obtained token(s). If `indices` is an integer, it will return a str. If "
+"`indices` is a list/tuple of integers, it will return a list of str."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab
+#: paddlenlp.data.vocab.Vocab.from_dict paddlenlp.data.vocab.Vocab.from_json
+#: paddlenlp.data.vocab.Vocab.load_vocabulary
+#: paddlenlp.data.vocab.Vocab.to_indices paddlenlp.data.vocab.Vocab.to_json
+#: paddlenlp.data.vocab.Vocab.to_tokens
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:41
+#: paddlenlp.data.vocab.Vocab.from_dict:28
+#: paddlenlp.data.vocab.Vocab.from_json:12
+#: paddlenlp.data.vocab.Vocab.load_vocabulary:28
+#: paddlenlp.data.vocab.Vocab.to_indices:13
+#: paddlenlp.data.vocab.Vocab.to_json:14
+#: paddlenlp.data.vocab.Vocab.to_tokens:13
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_indices:1
+msgid "Maps the input tokens into indices."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_indices:3
+msgid "The input token(s) for mapping."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_indices:7
+msgid ""
+"Obationed indice(s). If `tokens` is a str, it will return an integer. If "
+"`tokens` is a list/tuple of str, it will return a list of integers."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.__call__:1
+msgid ""
+"Maps the input tokens into indices. Its function is the same as the "
+":meth:`to_indices` method."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.__call__:4
+msgid "See detail at `to_indices`."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_json:1
+msgid ""
+"Summarizes some information of vocab as JSON string. If path is gaven, "
+"the JSON string will be saved into files. The JSON string and the saved "
+"file all can be used to reconstruct the :class:`Vocab` by calling "
+":meth:`from_json` method."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_json:6
+msgid ""
+"The path to save JSON string. If None, the JSON will not be saved. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.to_json:10
+msgid "The JSON string including information of vocab."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_json:1
+msgid ""
+"Loads :class:`Vocab` from JSON string or JSON file, which is gotten by "
+"calling :meth:`to_json` method."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_json:4
+msgid "JSON string or file path of JSON string."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_json:7
+msgid ""
+"An instance of :class:`Vocab` generated from information contained in "
+"JSON string."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:1
+msgid "Builds the :class:`Vocab` from a dict."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:3
+msgid "A dict describes the mapping relationship between tokens and indices."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:6
+msgid ""
+"The special token for unknow token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:9
+msgid ""
+"The special token for padding token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:12
+msgid ""
+"The special token for bos token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:15
+msgid ""
+"The special token for eos token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.from_dict:23
+msgid ""
+"An instance of :class:`Vocab` generated from the given dict and special "
+"tokens."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:1
+msgid ""
+"Builds the :class:`Vocab` accoring to given iterator and other "
+"information. Firstly, iterate over the `iterator` to construct a "
+":class:`collections.Counter` and used to init the as  :class:`Vocab`."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:5
+msgid ""
+"Iterator of tokens. Each element should be a list of tokens if wordlevel "
+"vocab is needed."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:8
+msgid "The max size of vocab, not including special tokens. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:19
+msgid ""
+"The special token for unknow token '<unk>'. If no need, it also could be "
+"None. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:22
+msgid ""
+"The special token for padding token '<pad>'. If no need, it also could be"
+" None. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:25
+msgid ""
+"The special token for bos token '<bos>'. If no need, it also could be "
+"None. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:28
+msgid ""
+"The special token for eos token '<eos>'. If no need, it also could be "
+"None. Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.build_vocab:36
+msgid ""
+"An instance of :class:`Vocab` generated from given iterator and other "
+"informations."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:1
+msgid ""
+"Builds the :class:`Vocab` from a file reserving all tokens by calling "
+":meth:`Vocab.from_dict` method. The file contains a token per line, and "
+"the line index would be the index of corresponding token."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:5
+msgid "the path of file to construct vocabulary."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:7
+msgid ""
+"special token for unknown token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:10
+msgid ""
+"special token for padding token. If no need, it also could be None. "
+"Default: None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:13
+msgid ""
+"special token for bos token. If no need, it also could be None. Default: "
+"None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:16
+msgid ""
+"special token for eos token. If no need, it also could be None. Default: "
+"None."
+msgstr ""
+
+#: of paddlenlp.data.vocab.Vocab.load_vocabulary:24
+msgid "An instance of :class:`Vocab` generated from the given file."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po
new file mode 100644
index 0000000000000000000000000000000000000000..daafb8801ffe434f84878d9e7c1dc13e6f909fb8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.dataset.po
@@ -0,0 +1,282 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.datasets.dataset.rst:2
+msgid "dataset"
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset:1
+msgid ""
+"Wraps a map-style dataset-like object as an instance of `MapDataset`, and"
+" equips it with `map` and other utility methods. All non-magic methods of"
+" the raw object are also accessible."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read
+#: paddlenlp.datasets.dataset.IterDataset
+#: paddlenlp.datasets.dataset.IterDataset.filter
+#: paddlenlp.datasets.dataset.IterDataset.map
+#: paddlenlp.datasets.dataset.IterDataset.shard
+#: paddlenlp.datasets.dataset.MapDataset
+#: paddlenlp.datasets.dataset.MapDataset.filter
+#: paddlenlp.datasets.dataset.MapDataset.map
+#: paddlenlp.datasets.dataset.load_dataset
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset:5
+msgid ""
+"An object with `__getitem__` and `__len__` methods. It could be a list or"
+" a subclass of `paddle.io.Dataset`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset:8
+#: paddlenlp.datasets.dataset.MapDataset:8
+msgid "Other information to be passed to the dataset."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset:11
+#: paddlenlp.datasets.dataset.MapDataset:11
+msgid ""
+"For examples of this class, please see `dataset_self_defined "
+"<https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html>`__."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.filter:1
+#: paddlenlp.datasets.dataset.MapDataset.filter:1
+msgid ""
+"Filters samples by the filter function and uses the filtered data to "
+"update this dataset."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.filter:4
+msgid ""
+"A filter function that takes a sample as input and returns a boolean. "
+"Samples that return False would be discarded."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.filter:7
+msgid ""
+"Number of processes for multiprocessing. If set to 0, it doesn't use "
+"multiprocessing. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.map:1
+#: paddlenlp.datasets.dataset.MapDataset.map:1
+msgid ""
+"Performs specific function on the dataset to transform and update every "
+"sample."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.map:3
+msgid ""
+"Transformations to be performed. It receives single sample as argument if"
+" batched is False. Else it receives all examples."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.map:6
+msgid ""
+"If True, transformations would be delayed and performed on demand. "
+"Otherwise, transforms all samples at once. Note that if `fn` is "
+"stochastic, `lazy` should be True or you will get the same result on all "
+"epochs. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.map:11
+msgid ""
+"If True, transformations would take all examples as input and return a "
+"collection of transformed examples. Note that if set True, `lazy` option "
+"would be ignored. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.MapDataset.map:15
+msgid ""
+"Number of processes for multiprocessing. If set to 0, it doesn't use "
+"multiprocessing. Note that if set to positive value, `lazy` option would "
+"be ignored. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder:1
+msgid ""
+"A base class for all DatasetBuilder. It provides a `read()` function to "
+"turn a data file into a MapDataset or IterDataset."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder:4
+msgid ""
+"`_get_data()` function and `_read()` function should be implemented to "
+"download data file and read data file into a `Iterable` of the examples."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder:7
+msgid ""
+"For how to define a custom `DatasetBuilder`, please see "
+"`contribute_dataset "
+"<https://paddlenlp.readthedocs.io/zh/latest/community/contribute_dataset.html>`__."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:1
+msgid ""
+"Returns a dataset containing all the examples that can be read from the "
+"file path."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:3
+msgid ""
+"If `self.lazy` is False, this eagerly reads all instances from "
+"`self._read()` and returns a `MapDataset`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:6
+msgid ""
+"If `self.lazy` is True, this returns an `IterDataset`, which internally "
+"relies on the generator created from `self._read()` to lazily produce "
+"examples. In this case your implementation of `_read()` must also be lazy"
+" (that is, not load all examples into memory at once)."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:11
+msgid "Path of data file to read, usually provided by `_get_data` function."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:14
+msgid ""
+"The split name of selected dataset. This only makes a different when data"
+" files of different splits have different structures."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read
+#: paddlenlp.datasets.dataset.load_dataset
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.read:18
+msgid "A `MapDataset|IterDataset`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.get_labels:1
+msgid "Returns list of class labels of the dataset if specified."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.DatasetBuilder.get_vocab:1
+msgid "Returns vocab file path of the dataset if specified."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset:1
+msgid ""
+"Wraps a dataset-like object as an instance of `IterDataset`, and equips "
+"it with `map` and other utility methods. All non-magic methods of the raw"
+" object also accessible."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset:5
+msgid ""
+"An object with `__iter__` function. It can be a Iterable or a subclass of"
+" `paddle.io.IterableDataset`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.filter:4
+msgid ""
+"A filter function that takes a sample as input and returns a boolean. "
+"Samples that return False are discarded."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.shard:1
+msgid "Split the dataset into `num_shards` pieces."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.shard:3
+msgid ""
+"An integer representing the number of data shards. If None, `num_shards` "
+"would be number of trainers. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.shard:7
+msgid ""
+"An integer representing the index of the current shard. If None, `index` "
+"would be the current trainer rank id. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.IterDataset.map:3
+msgid "Transformations to be performed. It receives single sample as argument."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:1
+msgid ""
+"This method will load a dataset, either form PaddleNLP library or from a "
+"self-defined data loading script, by calling functions in "
+"`DatasetBuilder`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:4
+msgid ""
+"For all the names of datasets in PaddleNLP library, see here:  "
+"`dataset_list "
+"<https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_list.html>`__."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:7
+msgid "Either `splits` or `data_files` must be specified."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:9
+msgid ""
+"Name of the dataset processing script in PaddleNLP library or a custom "
+"data reading function."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:12
+msgid "Additional name to select a more specific dataset. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:15
+msgid ""
+"Defining the path of dataset files. If None. `splits` must be specified. "
+"Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:18
+msgid ""
+"Which split of the data to load. If None. `data_files` must be specified."
+" Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:21
+msgid ""
+"Weather to return `MapDataset` or an `IterDataset`. True for "
+"`IterDataset`. False for `MapDataset`. If None, return the default type "
+"of this dataset. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:25
+msgid "Other keyword arguments to be passed to the `DatasetBuilder`."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:28
+msgid "A `MapDataset` or `IterDataset` or a tuple of those."
+msgstr ""
+
+#: of paddlenlp.datasets.dataset.load_dataset:30
+msgid ""
+"For how to use this function, please see `dataset_load "
+"<https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html>`__"
+" and `dataset_self_defined "
+"<https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html>`__"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po
new file mode 100644
index 0000000000000000000000000000000000000000..97715579e95c3d470e7223458043af91216c949d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.datasets.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.datasets.rst:2
+msgid "paddlenlp.datasets"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po
new file mode 100644
index 0000000000000000000000000000000000000000..d3c2572e9434b0278c479eaf599caa80a0380651
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.embeddings.rst:2
+msgid "paddlenlp.embeddings"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po
new file mode 100644
index 0000000000000000000000000000000000000000..638fc8a0a6664df00cc71c927054215f02b96156
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.embeddings.token_embedding.po
@@ -0,0 +1,200 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.embeddings.token_embedding.rst:2
+msgid "token\\_embedding"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.list_embedding_name:1
+msgid "Lists all names of pretrained embedding models paddlenlp provides."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:1
+msgid "基类：:class:`paddle.nn.layer.common.Embedding`"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:1
+msgid ""
+"A `TokenEmbedding` can load pre-trained embedding model which paddlenlp "
+"provides by specifying embedding name. Furthermore, a `TokenEmbedding` "
+"can load extended vocabulary by specifying extended_vocab_path."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:5
+msgid ""
+"The pre-trained embedding model name. Use "
+"`paddlenlp.embeddings.list_embedding_name()` to list the names of all "
+"embedding models that we provide. Defaults to "
+"`w2v.baidu_encyclopedia.target.word-word.dim300`."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:9
+msgid "Specifies unknown token. Defaults to `[UNK]`."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:12
+msgid ""
+"To initialize the vector of unknown token. If it's none, use normal "
+"distribution to initialize the vector of unknown token. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:16
+msgid "The file path of extended vocabulary. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:19
+msgid "Whether the weight of embedding can be trained. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding:22
+msgid ""
+"Whether to keep the extended vocabulary only, will be effective only if "
+"provides extended_vocab_path. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable:1
+msgid "Whether or not to set the weights of token embedding to be trainable."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.set_trainable:3
+msgid ""
+"The weights can be trained if trainable is set to True, or the weights "
+"are fixed if trainable is False."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:1
+msgid "Gets the vectors of specifying words."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:3
+msgid "The words which need to be searched."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:6
+msgid "The vectors of specifying words."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.search:7
+msgid "`numpy.array`"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:13
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:14
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:10
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.search:10
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:1
+msgid "Gets the index of specifying word by searching word_to_idx dict."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:3
+msgid "The input token word which we want to get the token index converted from."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:6
+msgid "The index of specifying word."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_from_word:7
+msgid "`int`"
+msgstr ""
+
+#: of
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:1
+msgid "Gets the index list of specifying words by searching word_to_idx dict."
+msgstr ""
+
+#: of
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:3
+msgid ""
+"The input token words which we want to get the token indices converted "
+"from."
+msgstr ""
+
+#: of
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:6
+msgid "The indexes list of specifying words."
+msgstr ""
+
+#: of
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.get_idx_list_from_words:7
+msgid "`list`"
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:1
+msgid ""
+"Calculates the dot product of 2 words. Dot product or scalar product is "
+"an algebraic operation that takes two equal-length sequences of numbers "
+"(usually coordinate vectors), and returns a single number."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:4
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:5
+msgid "The first word string."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:6
+#: paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:7
+msgid "The second word string."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.dot:10
+msgid "The dot product of 2 words."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:1
+msgid ""
+"Calculates the cosine similarity of 2 word vectors. Cosine similarity is "
+"the cosine of the angle between two n-dimensional vectors in an "
+"n-dimensional space."
+msgstr ""
+
+#: of paddlenlp.embeddings.token_embedding.TokenEmbedding.cosine_sim:9
+msgid "The cosine similarity of 2 words."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..31045f6c624d38299a58db84eb7742f33aa80578
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.ernie_model.po
@@ -0,0 +1,161 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.experimental.ernie_model.rst:2
+msgid "ernie\\_model"
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:1
+msgid "The bare ERNIE Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward
+#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward
+#: paddlenlp.experimental.ernie_model.FasterErnieModel
+#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`.  .. note::     A normal_initializer "
+"initializes weight matrices as normal distributions.     See "
+":meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.experimental.ernie_model.FasterErnieModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:1
+#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:1
+#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:4
+#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:4
+#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.ernie_model.FasterErnieForSequenceClassification.forward:6
+#: paddlenlp.experimental.ernie_model.FasterErnieForTokenClassification.forward:6
+#: paddlenlp.experimental.ernie_model.FasterErnieModel.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..5e8ab0f96c3e7d53844c30129e5527dde4d95e90
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.faster_tokenizer.po
@@ -0,0 +1,74 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.experimental.faster_tokenizer.rst:2
+msgid "faster\\_tokenizer"
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.to_tensor:1
+msgid ""
+"Create the tensor that the value holds the list of string. NOTICE: The "
+"value will be holded in the cpu place."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward
+#: paddlenlp.experimental.faster_tokenizer.to_tensor
+#: paddlenlp.experimental.faster_tokenizer.to_vocab_buffer
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.to_tensor:4
+msgid "The value will be setted to the tensor."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.to_tensor:6
+#: paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:7
+msgid "The name of the tensor."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:1
+msgid ""
+"Create the tensor that the value holds the map, the type of key is the "
+"string. NOTICE: The value will be holded in the cpu place."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.to_vocab_buffer:4
+msgid ""
+"The value will be setted to the tensor. The key is token and the value is"
+" the token index."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.experimental.faster_tokenizer.FasterTokenizer.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..6eb76061d50380deba0f0ec4879007455882caf7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.model_utils.po
@@ -0,0 +1,134 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.experimental.model_utils.rst:2
+msgid "model\\_utils"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:1
+msgid ""
+"Creates an instance of `PretrainedModel`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:5
+msgid ""
+"Name of pretrained model or dir path to load from. The string can be:  - "
+"Name of a built-in pretrained model - Name of a community-contributed "
+"pretrained model. - Local directory path which contains model weights "
+"file(\"model_state.pdparams\")   and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:5
+msgid "Name of pretrained model or dir path to load from. The string can be:"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:8
+msgid "Name of a built-in pretrained model"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:9
+msgid "Name of a community-contributed pretrained model."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:10
+msgid ""
+"Local directory path which contains model weights "
+"file(\"model_state.pdparams\") and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:13
+msgid ""
+"Position arguments for model `__init__`. If provided, use these as "
+"position argument values for model initialization."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:16
+msgid ""
+"Keyword arguments for model `__init__`. If provided, use these to update "
+"pre-defined keyword argument values for model initialization. If the "
+"keyword is in `__init__` argument names of base model, update argument "
+"values of the base model; else update argument values of derived model."
+msgstr ""
+
+#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:23
+msgid "An instance of `PretrainedModel`."
+msgstr ""
+
+#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.from_pretrained:27
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:13
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:1
+msgid ""
+"Saves model configuration and related resources (model state) as files "
+"under `save_dir`. The model configuration would be saved into a file "
+"named \"model_config.json\", and model state would be saved into a file "
+"named \"model_state.pdparams\"."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:6
+msgid ""
+"The `save_dir` can be used in `from_pretrained` as argument value of "
+"`pretrained_model_name_or_path` to re-load the trained model."
+msgstr ""
+
+#: of
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_pretrained:9
+#: paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
+#: of paddlenlp.experimental.model_utils.FasterPretrainedModel.save_resources:1
+msgid ""
+"Save tokenizer related resources to `resource_files_names` indicating "
+"files under `save_directory` by copying directly. Override it if "
+"necessary."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po
new file mode 100644
index 0000000000000000000000000000000000000000..c380d05ce24fb38ddf9023e630d8a62b865ead63
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.experimental.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.experimental.rst:2
+msgid "paddlenlp.experimental"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po
new file mode 100644
index 0000000000000000000000000000000000000000..3bea53b2814a1959bb95ab2ea6bc2dd26d082483
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.crf.po
@@ -0,0 +1,223 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.layers.crf.rst:2
+msgid "crf"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf:1
+msgid ""
+"LinearChainCrf is a linear chain Conditional Random Field layer, it can "
+"implement sequential dependencies in the predictions. Therefore, it can "
+"take context into account whereas a classifier predicts a label for a "
+"single sample without considering \"neighboring\" samples. See "
+"https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers"
+" for reference."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf
+#: paddlenlp.layers.crf.LinearChainCrf.forward
+#: paddlenlp.layers.crf.LinearChainCrf.gold_score
+#: paddlenlp.layers.crf.LinearChainCrfLoss
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward
+#: paddlenlp.layers.crf.ViterbiDecoder
+#: paddlenlp.layers.crf.ViterbiDecoder.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf:5
+msgid "The label number."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf:7
+msgid "The crf layer learning rate. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf:9
+msgid ""
+"If set to True, the start tag and stop tag will be considered, the "
+"transitions params will be a tensor with a shape of `[num_labels+2, "
+"num_labels+2]`. Else, the transitions params will be a tensor with a "
+"shape of `[num_labels, num_labels]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:1
+msgid ""
+"Computes the normalization in a linear-chain CRF. See "
+"http://www.cs.columbia.edu/~mcollins/fb.pdf for reference."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:3
+msgid ""
+"F & = logZ(x) = log\\sum_y exp(score(x,y))\n"
+"\n"
+"score(x,y) & = \\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i)\n"
+"\n"
+"p(y_i) & = Emit(x_i,y_i), T(y_{i-1}, y_i) = Trans(y_{i-1}, y_i)"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:10
+msgid "then we can get:"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:12
+msgid ""
+"F(1) = log\\sum_{y1} exp(p(y_1) + T([START], y1))\n"
+"\n"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:15
+msgid ""
+"F(2) & = log\\sum_{y1}\\sum_{y2} exp(p(y_1) + T([START], y1) + p(y_2) + "
+"T(y_1,y_2)) \\\\\n"
+"& = log\\sum_{y2} exp(F(1) + p(y_2) + T(y_1,y_2))\n"
+"\n"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:19
+msgid "Further, We can get F(n) is a recursive formula with F(n-1)."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:21
+#: paddlenlp.layers.crf.LinearChainCrf.gold_score:4
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:4
+msgid ""
+"The input predicted tensor. Its dtype is float32 and has a shape of "
+"`[batch_size, sequence_length, num_tags]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:23
+#: paddlenlp.layers.crf.LinearChainCrf.gold_score:8
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:6
+msgid "The input length. Its dtype is int64 and has a shape of `[batch_size]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward
+#: paddlenlp.layers.crf.LinearChainCrf.gold_score
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward
+#: paddlenlp.layers.crf.ViterbiDecoder.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward:26
+msgid ""
+"Returns the normalizers tensor `norm_score`. Its dtype is float32 and has"
+" a shape of `[batch_size]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.forward
+#: paddlenlp.layers.crf.LinearChainCrf.gold_score
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward
+#: paddlenlp.layers.crf.ViterbiDecoder.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:1
+msgid ""
+"Computes the unnormalized score for a tag sequence. $$ score(x,y) = "
+"\\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i) $$"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:6
+#: paddlenlp.layers.crf.LinearChainCrfLoss.forward:8
+msgid ""
+"The input label tensor. Its dtype is int64 and has a shape of "
+"`[batch_size, sequence_length]`"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrf.gold_score:11
+msgid ""
+"Returns the unnormalized sequence scores tensor `unnorm_score`. Its dtype"
+" is float32 and has a shape of `[batch_size]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrfLoss:1
+msgid ""
+"The negative log-likelihood for linear chain Conditional Random Field "
+"(CRF)."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrfLoss:3
+msgid ""
+"The `LinearChainCrf` network object. Its parameter will be used to "
+"calculate the loss."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:1
+msgid ""
+"Calculate the crf loss. Let $$ Z(x) = \\sum_{y'}exp(score(x,y')) $$, "
+"means the sum of all path scores, then we have $$ loss = -logp(y|x) = "
+"-log(exp(score(x,y))/Z(x)) = -score(x,y) + logZ(x) $$"
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:10
+msgid ""
+"Unnecessary parameter for compatibility with older versions. Defaults to "
+"``None``."
+msgstr ""
+
+#: of paddlenlp.layers.crf.LinearChainCrfLoss.forward:13
+msgid "The crf loss. Its dtype is float32 and has a shape of `[batch_size]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder:1
+msgid ""
+"ViterbiDecoder can decode the highest scoring sequence of tags, it should"
+" only be used at test time."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder:3
+msgid ""
+"The transition matrix.  Its dtype is float32 and has a shape of "
+"`[num_tags, num_tags]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder:5
+msgid ""
+"If set to True, the last row and the last column of transitions will be "
+"considered as start tag, the penultimate row and the penultimate "
+"column of transitions will be considered as stop tag. Else, all the rows "
+"and columns will be considered as the real tag. Defaults to ``None``."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder.forward:1
+msgid "Decode the highest scoring sequence of tags."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder.forward:3
+msgid ""
+"The unary emission tensor. Its dtype is float32 and has a shape of "
+"`[batch_size, sequence_length, num_tags]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder.forward:5
+msgid ""
+"The input length tensor storing real length of each sequence for "
+"correctness. Its dtype is int64 and has a shape of `[batch_size]`."
+msgstr ""
+
+#: of paddlenlp.layers.crf.ViterbiDecoder.forward:8
+msgid ""
+"Returns tuple (scores, paths). The `scores` tensor containing the score "
+"for the Viterbi sequence. Its dtype is float32 and has a shape of "
+"`[batch_size]`. The `paths` tensor containing the highest scoring tag "
+"indices. Its dtype is int64 and has a shape of `[batch_size, "
+"sequence_length]`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po
new file mode 100644
index 0000000000000000000000000000000000000000..15d7ab232f01d0cd875ce935b2d9074245569f46
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.layers.rst:2
+msgid "paddlenlp.layers"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po
new file mode 100644
index 0000000000000000000000000000000000000000..c381dfaa47418a2ff26e568d7a77d796f43ac15d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.sequence.po
@@ -0,0 +1,57 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.layers.sequence.rst:2
+msgid "sequence"
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask:1
+msgid ""
+"To boost the performance, this sequence_mask is different with "
+"paddle.fluid.layers.sequence_mask"
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask:3
+msgid ""
+"The whole sequence index, a tensor with a shape of [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask:5
+msgid "The valid length of every sequence, a tensor with a shape of [batch_size]."
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask:8
+msgid ""
+"Returns the output sequence mask `mask`. Its dtype is `bool` and has a "
+"shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.layers.sequence.sequence_mask
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po
new file mode 100644
index 0000000000000000000000000000000000000000..e4259a8fb04a1c28749dfb73f8577f7ede28ff85
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.layers.tcn.po
@@ -0,0 +1,93 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.layers.tcn.rst:2
+msgid "tcn"
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:1
+msgid ""
+"The TCN block, consists of dilated causal conv, relu and residual block. "
+"See the Figure 1(b) in https://arxiv.org/pdf/1803.01271.pdf for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward paddlenlp.layers.tcn.TemporalBlock
+#: paddlenlp.layers.tcn.TemporalBlock.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:4
+msgid "The number of channels in the input tensor."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:6
+msgid "The number of filters."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:8
+msgid "The filter size."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:10
+msgid "The stride size."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:12
+msgid "The dilation size."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:14
+msgid "The size of zeros to be padded."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock:16
+msgid "Probability of dropout the units. Defaults to 0.2."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TemporalBlock.forward:1
+msgid ""
+"The input tensor with a shape  of [batch_size, input_channel, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward:1
+msgid "Apply temporal convolutional networks to the input tensor."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward:3
+msgid ""
+"The input tensor with a shape of [batch_size, input_channel, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward:6
+msgid ""
+"The `output` tensor with a shape of [batch_size, num_channels[-1], "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.layers.tcn.TCN.forward
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po
new file mode 100644
index 0000000000000000000000000000000000000000..0356af101fad1bc3d61232f48a9b90615781df4f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.losses.rst:2
+msgid "paddlenlp.losses"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po
new file mode 100644
index 0000000000000000000000000000000000000000..b3873b947f7482888cfa93468d5f4feda6b312c7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.losses.rdrop.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.losses.rdrop.rst:2
+msgid "rdrop"
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss:1
+msgid ""
+"R-Drop Loss implementation For more information about R-drop please refer"
+" to this paper: https://arxiv.org/abs/2106.14448 Original implementation "
+"please refer to this code: https://github.com/dropreg/R-Drop"
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss paddlenlp.losses.rdrop.RDropLoss.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss:5
+msgid ""
+"Indicate how to average the loss, the candicates are "
+"``'none'``,``'batchmean'``,``'mean'``,``'sum'``. If `reduction` is "
+"``'mean'``, the reduced mean loss is returned; If `reduction` is "
+"``'batchmean'``, the sum loss divided by batch size is returned; If "
+"`reduction` is ``'sum'``, the reduced sum loss is returned; If "
+"`reduction` is ``'none'``, no reduction will be applied. Defaults to "
+"``'none'``."
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward:1
+msgid "the first forward logits of training examples."
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward:3
+msgid "the second forward logits of training examples."
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward:5
+msgid ""
+"The Tensor containing the binary mask to index with, it's data type is "
+"bool."
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward:8
+msgid "Returns tensor `loss`, the rdrop loss of p and q."
+msgstr ""
+
+#: of paddlenlp.losses.rdrop.RDropLoss.forward
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po
new file mode 100644
index 0000000000000000000000000000000000000000..4d7afbb8914db5653edf7d744bbd6ade6211513a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.bleu.po
@@ -0,0 +1,188 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.bleu.rst:2
+msgid "bleu"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:1
+msgid ""
+"BLEU (bilingual evaluation understudy) is an algorithm for evaluating the"
+" quality of text which has been machine-translated from one natural "
+"language to another. This metric uses a modified form of precision to "
+"compare a candidate translation against multiple reference translations."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:6
+msgid ""
+"BLEU could be used as `paddle.metric.Metric` class, or an ordinary class."
+" When BLEU is used as `paddle.metric.Metric` class. A function is needed "
+"that transforms the network output to reference string list, and "
+"transforms the label to candidate string. By default, a default function "
+"`default_trans_func` is provided, which gets target sequence id by "
+"calculating the maximum probability of each step. In this case, user must"
+" provide `vocab`. It should be noted that the BLEU here is different from"
+" the BLEU calculated in prediction, and it is only for observation during"
+" training and evaluation."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:16
+msgid ""
+"BP & =\n"
+"\\begin{cases}\n"
+"1,  & \\text{if }c>r \\\\\n"
+"e_{1-r/c}, & \\text{if }c\\leq r\n"
+"\\end{cases}\n"
+"\n"
+"BLEU & = BP\\exp(\\sum_{n=1}^N w_{n} \\log{p_{n}})"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:26
+msgid ""
+"where `c` is the length of candidate sentence, and `r` is the length of "
+"reference sentence."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU paddlenlp.metrics.bleu.BLEU.add_inst
+#: paddlenlp.metrics.bleu.BLEUForDuReader
+#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:28
+msgid "`trans_func` transforms the network output to string to calculate."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:31
+msgid ""
+"Vocab for target language. If `trans_func` is None and BLEU is used as "
+"`paddle.metric.Metric` instance, `default_trans_func` will be performed "
+"and `vocab` must be provided."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:36 paddlenlp.metrics.bleu.BLEUForDuReader:5
+msgid "Number of gram for BLEU metric. Defaults to 4."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:38
+msgid "The weights of precision of each gram. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:41
+msgid "Name of `paddle.metric.Metric` instance. Defaults to \"bleu\"."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:46
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:47
+msgid "Using as a general evaluation object."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU:58
+msgid "Using as an instance of `paddle.metric.Metric`."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.update:1
+msgid "Update states for metric"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.update:3
+msgid ""
+"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if "
+":code:`compute` is not defined, the inputs of :code:`update` will be "
+"flatten arguments of **output** of mode and **label** from data: "
+":code:`update(output1, output2, ..., label1, label2,...)`"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.update:8
+msgid "see :code:`Metric.compute`"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.add_inst:1
+#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:1
+msgid "Update the states based on a pair of candidate and references."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.add_inst:3
+#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:3
+msgid "Tokenized candidate sentence."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.add_inst:5
+#: paddlenlp.metrics.bleu.BLEUForDuReader.add_inst:5
+msgid "List of tokenized ground truth sentences."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.reset:1
+msgid "Reset states and result"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.accumulate:1
+msgid "Calculates and returns the final bleu metric."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.accumulate
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.accumulate:3
+msgid "Returns the accumulated metric `bleu` and its data type is float64."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.accumulate
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEU.name:1
+msgid "Returns metric name"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEUForDuReader:1
+msgid "基类：:class:`paddlenlp.metrics.bleu.BLEU`"
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEUForDuReader:1
+msgid "BLEU metric with bonus for DuReader contest."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEUForDuReader:3
+msgid ""
+"Please refer to `DuReader "
+"Homepage<https://ai.baidu.com//broad/subordinate?dataset=dureader>`_ for "
+"more details."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEUForDuReader:7
+msgid ""
+"Weight of YesNo dataset when adding bonus for DuReader contest. Defaults "
+"to 1.0."
+msgstr ""
+
+#: of paddlenlp.metrics.bleu.BLEUForDuReader:9
+msgid ""
+"Weight of Entity dataset when adding bonus for DuReader contest. Defaults"
+" to 1.0."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po
new file mode 100644
index 0000000000000000000000000000000000000000..31be210073d18c2f5468ee30a5c310fe0b1253da
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.chunk.po
@@ -0,0 +1,176 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.chunk.rst:2
+msgid "chunk"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator:1
+msgid ""
+"ChunkEvaluator computes the precision, recall and F1-score for chunk "
+"detection. It is often used in sequence tagging tasks, such as Named "
+"Entity Recognition(NER)."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator
+#: paddlenlp.metrics.chunk.ChunkEvaluator.compute
+#: paddlenlp.metrics.chunk.ChunkEvaluator.update
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator:4
+msgid "The label list."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator:6
+msgid ""
+"If set True, the label ends with '-B', '-I', '-E' or '-S', else the label"
+" starts with them. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator:11
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:1
+msgid "Computes the precision, recall and F1-score for chunk detection."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:3
+msgid "The valid length of every sequence, a tensor with shape `[batch_size]`"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:5
+msgid ""
+"The predictions index, a tensor with shape `[batch_size, "
+"sequence_length]`."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:7
+msgid "The labels index, a tensor with shape `[batch_size, sequence_length]`."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:9
+msgid ""
+"Unnecessary parameter for compatibility with older versions with "
+"parameters list `inputs`, `lengths`, `predictions`, `labels`. Defaults to"
+" None."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate
+#: paddlenlp.metrics.chunk.ChunkEvaluator.compute
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:12
+msgid ""
+"Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`)."
+"  With the fields:  - `num_infer_chunks` (Tensor):     The number of the "
+"inference chunks.  - `num_label_chunks` (Tensor):     The number of the "
+"label chunks.  - `num_correct_chunks` (Tensor):     The number of the "
+"correct chunks."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:12
+msgid "Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`)."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:14
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:17
+msgid "`num_infer_chunks` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:17
+msgid "The number of the inference chunks."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:20
+msgid "`num_label_chunks` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:20
+msgid "The number of the label chunks."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:22
+msgid "`num_correct_chunks` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.compute:23
+msgid "The number of the correct chunks."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate
+#: paddlenlp.metrics.chunk.ChunkEvaluator.compute
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:1
+msgid ""
+"This function takes (num_infer_chunks, num_label_chunks, "
+"num_correct_chunks) as input, to accumulate and update the corresponding "
+"status of the ChunkEvaluator object. The update method is as follows:"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:4
+msgid ""
+"\\\\ \\begin{array}{l}{\\text { self. num_infer_chunks }+=\\text { "
+"num_infer_chunks }} \\\\ {\\text { self. num_Label_chunks }+=\\text { "
+"num_label_chunks }} \\\\ {\\text { self. num_correct_chunks }+=\\text { "
+"num_correct_chunks }}\\end{array} \\\\\n"
+"\n"
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:7
+msgid "The number of chunks in Inference on the given minibatch."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:9
+msgid "The number of chunks in Label on the given mini-batch."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.update:11
+msgid "The number of chunks both in Inference and Label on the given mini-batch."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate:1
+msgid ""
+"This function returns the mean precision, recall and f1 score for all "
+"accumulated minibatches."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.accumulate:3
+msgid "Returns tuple (`precision, recall, f1 score`)."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.reset:1
+msgid "Reset function empties the evaluation memory for previous mini-batches."
+msgstr ""
+
+#: of paddlenlp.metrics.chunk.ChunkEvaluator.name:1
+msgid "Return name of metric instance."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po
new file mode 100644
index 0000000000000000000000000000000000000000..b1d2bb23e8fd2e73790dc1467e94f9f142cfe3b7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.distinct.po
@@ -0,0 +1,151 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.distinct.rst:2
+msgid "distinct"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:1
+msgid ""
+"`Distinct` is an algorithm for evaluating the textual diversity of the "
+"generated text by calculating the number of distinct n-grams. The larger "
+"the number of distinct n-grams, the higher the diversity of the text. See"
+" details at https://arxiv.org/abs/1510.03055."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:6
+msgid ""
+":class:`Distinct` could be used as a :class:`paddle.metric.Metric` class,"
+" or an ordinary class. When :class:`Distinct` is used as a "
+":class:`paddle.metric.Metric` class, a function is needed to transform "
+"the network output to a string list."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct
+#: paddlenlp.metrics.distinct.Distinct.add_inst
+#: paddlenlp.metrics.distinct.Distinct.update
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:11
+msgid "Number of gram for :class:`Distinct` metric. Defaults to 2."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:13
+msgid ""
+"`trans_func` transforms the network output to a string list. Defaults to "
+"None.  .. note::     When :class:`Distinct` is used as a "
+":class:`paddle.metric.Metric`     class, `trans_func` must be provided. "
+"Please note that the     input of `trans_func` is numpy array."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:13
+msgid ""
+"`trans_func` transforms the network output to a string list. Defaults to "
+"None."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:16
+msgid ""
+"When :class:`Distinct` is used as a :class:`paddle.metric.Metric` class, "
+"`trans_func` must be provided. Please note that the input of `trans_func`"
+" is numpy array."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:20
+msgid "Name of :class:`paddle.metric.Metric` instance. Defaults to \"distinct\"."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:25
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:26
+msgid "Using as a general evaluation object."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct:38
+msgid "Using as an instance of `paddle.metric.Metric`."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.update:1
+msgid ""
+"Updates the metrics states. This method firstly will use "
+":meth:`trans_func` method to process the `output` to get the tokenized "
+"candidate sentence list. Then call :meth:`add_inst` method to process the"
+" candidate list one by one."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.update:6
+msgid "The outputs of model."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.update:8
+msgid "The additional inputs."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.add_inst:1
+msgid "Updates the states based on the candidate."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.add_inst:3
+msgid "Tokenized candidate sentence generated by model."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.reset:1
+msgid "Resets states and result."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.accumulate:1
+msgid "Calculates the final distinct score."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.accumulate
+#: paddlenlp.metrics.distinct.Distinct.name
+#: paddlenlp.metrics.distinct.Distinct.score
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.accumulate:3
+#: paddlenlp.metrics.distinct.Distinct.score:3
+msgid "The final distinct score."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.accumulate
+#: paddlenlp.metrics.distinct.Distinct.name
+#: paddlenlp.metrics.distinct.Distinct.score
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.score:1
+msgid "The function is the same as :meth:`accumulate` method."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.name:1
+msgid "Returns the metric name."
+msgstr ""
+
+#: of paddlenlp.metrics.distinct.Distinct.name:3
+msgid "The metric name."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po
new file mode 100644
index 0000000000000000000000000000000000000000..5731e3f52b95ef2f990f96b2c951a74a8b1f91c3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.dureader.po
@@ -0,0 +1,55 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.dureader.rst:2
+msgid "dureader"
+msgstr ""
+
+#: of paddlenlp.metrics.dureader:1
+msgid "Official evaluation script for SQuAD version 2.0."
+msgstr ""
+
+#: of paddlenlp.metrics.dureader:3
+msgid ""
+"In addition to basic functionality, we also compute additional statistics"
+" and plot precision-recall curves if an additional na_prob.json file is "
+"provided. This file is expected to map question ID's to the model's "
+"predicted probability that a question is unanswerable."
+msgstr ""
+
+#: of paddlenlp.metrics.dureader.compute_predictions:1
+msgid "Write final predictions to the json file and log-odds of null if needed."
+msgstr ""
+
+#: of paddlenlp.metrics.dureader.get_final_text:1
+msgid "Project the tokenized prediction back to the original text."
+msgstr ""
+
+#: of paddlenlp.metrics.dureader.normalize:1
+msgid "Normalize strings to space joined chars. :param s: a list of strings."
+msgstr ""
+
+#: of paddlenlp.metrics.dureader.normalize
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.dureader.normalize:4
+msgid "A list of normalized strings."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po
new file mode 100644
index 0000000000000000000000000000000000000000..5c9b7b61a769b285ab404f4bfccc670e3d1df8bf
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.glue.po
@@ -0,0 +1,459 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.glue.rst:2
+msgid "glue"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:1 paddlenlp.metrics.glue.Mcc:1
+#: paddlenlp.metrics.glue.MultiLabelsMetric:1
+#: paddlenlp.metrics.glue.PearsonAndSpearman:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:1
+msgid ""
+"This class encapsulates Accuracy, Precision, Recall and F1 metric logic, "
+"and `accumulate` function returns accuracy, precision, recall and f1. The"
+" overview of all metrics could be seen at the document of `paddle.metric "
+"<https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/metric/Overview_cn.html>`_"
+" for details."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1
+#: paddlenlp.metrics.glue.AccuracyAndF1.compute
+#: paddlenlp.metrics.glue.AccuracyAndF1.update paddlenlp.metrics.glue.Mcc
+#: paddlenlp.metrics.glue.Mcc.compute paddlenlp.metrics.glue.Mcc.update
+#: paddlenlp.metrics.glue.MultiLabelsMetric
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate
+#: paddlenlp.metrics.glue.MultiLabelsMetric.compute
+#: paddlenlp.metrics.glue.MultiLabelsMetric.update
+#: paddlenlp.metrics.glue.PearsonAndSpearman
+#: paddlenlp.metrics.glue.PearsonAndSpearman.update
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:7
+msgid ""
+"Number of top elements to look at for computing accuracy. Defaults to "
+"(1,)."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:10
+msgid "The positive label for calculating precision and recall. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:14
+msgid "String name of the metric instance. Defaults to 'acc_and_f1'."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1:18 paddlenlp.metrics.glue.Mcc:7
+#: paddlenlp.metrics.glue.MultiLabelsMetric:11
+#: paddlenlp.metrics.glue.PearsonAndSpearman:9
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:1
+#: paddlenlp.metrics.glue.MultiLabelsMetric.compute:1
+msgid ""
+"Accepts network's output and the labels, and calculates the top-k "
+"(maximum value in topk) indices for accuracy."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:4
+msgid ""
+"Predicted tensor, and its dtype is float32 or float64, and has a shape of"
+" [batch_size, num_classes]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:7
+msgid ""
+"The ground truth tensor, and its dtype is int64, and has a shape of "
+"[batch_size, 1] or [batch_size, num_classes] in one hot representation."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate
+#: paddlenlp.metrics.glue.AccuracyAndF1.compute
+#: paddlenlp.metrics.glue.AccuracyAndF1.name
+#: paddlenlp.metrics.glue.Mcc.accumulate paddlenlp.metrics.glue.Mcc.compute
+#: paddlenlp.metrics.glue.Mcc.name
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate
+#: paddlenlp.metrics.glue.MultiLabelsMetric.compute
+#: paddlenlp.metrics.glue.MultiLabelsMetric.name
+#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate
+#: paddlenlp.metrics.glue.PearsonAndSpearman.name
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.compute:12
+msgid ""
+"Correct mask, each element indicates whether the prediction equals to the"
+" label. Its' a tensor with a data type of float32 and has a shape of "
+"[batch_size, topk]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate
+#: paddlenlp.metrics.glue.AccuracyAndF1.compute
+#: paddlenlp.metrics.glue.AccuracyAndF1.name
+#: paddlenlp.metrics.glue.Mcc.accumulate paddlenlp.metrics.glue.Mcc.compute
+#: paddlenlp.metrics.glue.Mcc.name
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate
+#: paddlenlp.metrics.glue.MultiLabelsMetric.compute
+#: paddlenlp.metrics.glue.MultiLabelsMetric.name
+#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate
+#: paddlenlp.metrics.glue.PearsonAndSpearman.name
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.update:1
+#: paddlenlp.metrics.glue.MultiLabelsMetric.update:1
+msgid ""
+"Updates the metrics states (accuracy, precision and recall), in order to "
+"calculate accumulated accuracy, precision and recall of all instances."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.update:4
+msgid ""
+"Correct mask for calculating accuracy, and it's a tensor with shape "
+"[batch_size, topk] and has a dtype of float32."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:1
+#: paddlenlp.metrics.glue.Mcc.accumulate:1
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:1
+#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:1
+msgid "Calculates and returns the accumulated metric."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:3
+msgid ""
+"The accumulated metric. A tuple of shape (acc, precision, recall, f1, "
+"average_of_acc_and_f1)  With the fields:  - `acc` (numpy.float64):     "
+"The accumulated accuracy. - `precision` (numpy.float64):     The "
+"accumulated precision. - `recall` (numpy.float64):     The accumulated "
+"recall. - `f1` (numpy.float64):     The accumulated f1. - "
+"`average_of_acc_and_f1` (numpy.float64):     The average of accumulated "
+"accuracy and f1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:3
+msgid ""
+"The accumulated metric. A tuple of shape (acc, precision, recall, f1, "
+"average_of_acc_and_f1)"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:6
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:27
+#: paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:6
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:8
+msgid "`acc` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:9
+msgid "The accumulated accuracy."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:10
+msgid "`precision` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:11
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:30
+msgid "The accumulated precision."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:12
+msgid "`recall` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:13
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:32
+msgid "The accumulated recall."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:14
+msgid "`f1` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:15
+#: paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:34
+msgid "The accumulated f1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:16
+msgid "`average_of_acc_and_f1` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.accumulate:17
+msgid "The average of accumulated accuracy and f1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.reset:1
+#: paddlenlp.metrics.glue.Mcc.reset:1
+#: paddlenlp.metrics.glue.PearsonAndSpearman.reset:1
+msgid "Resets all metric states."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.name:1
+#: paddlenlp.metrics.glue.Mcc.name:1
+#: paddlenlp.metrics.glue.MultiLabelsMetric.name:1
+#: paddlenlp.metrics.glue.PearsonAndSpearman.name:1
+msgid "Returns name of the metric instance."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.AccuracyAndF1.name:3
+#: paddlenlp.metrics.glue.Mcc.name:3
+#: paddlenlp.metrics.glue.MultiLabelsMetric.name:3
+#: paddlenlp.metrics.glue.PearsonAndSpearman.name:3
+msgid "The name of the metric instance."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc:1
+msgid ""
+"This class calculates `Matthews correlation coefficient "
+"<https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ ."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc:3
+msgid "String name of the metric instance. Defaults to 'mcc'."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.compute:1
+msgid ""
+"Processes the pred tensor, and returns the indices of the maximum of each"
+" sample."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.compute:4
+msgid ""
+"The predicted value is a Tensor with dtype float32 or float64. Shape is "
+"[batch_size, 1]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.compute:7
+msgid ""
+"The ground truth value is Tensor with dtype int64, and its shape is "
+"[batch_size, 1]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.compute:11
+msgid ""
+"A tuple of preds and label. Each shape is [batch_size, 1], with dtype "
+"float32 or float64."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.update:1
+msgid ""
+"Calculates states, i.e. the number of true positive, false positive, true"
+" negative and false negative samples."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.update:4
+msgid ""
+"Tuple of predicted value and the ground truth label, with dtype float32 "
+"or float64. Each shape is [batch_size, 1]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.Mcc.accumulate:3
+msgid ""
+"Returns the accumulated metric, a tuple of shape (mcc,), `mcc` is the "
+"accumulated mcc and its data type is float64."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman:1
+#, python-format
+msgid ""
+"The class calculates `Pearson correlation coefficient "
+"<https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_ and "
+"`Spearman's rank correlation coefficient "
+"<https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_"
+" ."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman:5
+msgid "String name of the metric instance. Defaults to 'pearson_and_spearman'."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.update:1
+msgid ""
+"Ensures the type of preds and labels is numpy.ndarray and reshapes them "
+"into [-1, 1]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.update:4
+msgid ""
+"Tuple or list of predicted value and the ground truth label. Its data "
+"type should be float32 or float64 and its shape is [batch_size, d0, ..., "
+"dN]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:3
+msgid ""
+"Returns the accumulated metric, a tuple of (pearson, spearman, "
+"the_average_of_pearson_and_spearman).  With the fields:  - `pearson` "
+"(numpy.float64):     The accumulated pearson.  - `spearman` "
+"(numpy.float64):     The accumulated spearman.  - "
+"`the_average_of_pearson_and_spearman` (numpy.float64):     The average of"
+" accumulated pearson and spearman correlation     coefficient."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:3
+msgid ""
+"Returns the accumulated metric, a tuple of (pearson, spearman, "
+"the_average_of_pearson_and_spearman)."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:9
+msgid "`pearson` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:9
+msgid "The accumulated pearson."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:12
+msgid "`spearman` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:12
+msgid "The accumulated spearman."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:15
+msgid "`the_average_of_pearson_and_spearman` (numpy.float64):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.PearsonAndSpearman.accumulate:15
+msgid "The average of accumulated pearson and spearman correlation coefficient."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric:1
+msgid ""
+"This class encapsulates Accuracy, Precision, Recall and F1 metric logic "
+"in multi-labels setting (also the binary setting). Some codes are taken "
+"and modified from sklearn.metrics ."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric:5
+msgid "The total number of labels which is usually the number of classes"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric:7
+msgid "String name of the metric instance. Defaults to 'multi_labels_metric'."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric:43
+msgid ""
+"Note: When zero_division is encountered (details as followed), the "
+"corresponding metrics will be set to 0.0"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric:40
+msgid ""
+"precision is zero_division if there are no positive predictions recall is"
+" zero_division if there are no positive labels fscore is zero_division if"
+" all labels AND predictions are negative"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.update:4
+msgid "the tuple returned from `compute` function"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:9
+msgid "Only report results for the class specified by pos_label."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:10
+msgid ""
+"Calculate metrics globally by counting the total true positives, false "
+"negatives and false positives."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:12
+msgid ""
+"Calculate metrics for each label, and find their unweighted mean. This "
+"does not take label imbalance into account."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:14
+msgid ""
+"Calculate metrics for each label, and find their average weighted by "
+"support (the number of true instances for each label). This alters "
+"`macro` to account for label imbalance; it can result in an F-score that "
+"is not between precision and recall."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:18
+msgid ""
+"The positive label for calculating precision and recall in binary "
+"settings. Noted: Only when `average='binary'`, this arguments will be "
+"used. Otherwise, it will be ignored. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:24
+msgid ""
+"The accumulated metric. A tuple of shape (precision, recall, f1)     With"
+" the fields:      - `precision` (numpy.float64 or numpy.ndarray if "
+"average=None):         The accumulated precision.     - `recall` "
+"(numpy.float64 or numpy.ndarray if average=None):         The accumulated"
+" recall.     - `f1` (numpy.float64 or numpy.ndarray if average=None):"
+"         The accumulated f1."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:33
+msgid "The accumulated metric. A tuple of shape (precision, recall, f1)"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:29
+msgid "`precision` (numpy.float64 or numpy.ndarray if average=None):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:31
+msgid "`recall` (numpy.float64 or numpy.ndarray if average=None):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.accumulate:33
+msgid "`f1` (numpy.float64 or numpy.ndarray if average=None):"
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:4
+msgid ""
+"Predicted tensor, and its dtype is float32 or float64, and has a shape of"
+" [batch_size, *, num_labels]."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:7
+msgid ""
+"The ground truth tensor, and its dtype is int64, and has a shape of "
+"[batch_size, *] or [batch_size, *, num_labels] in one hot representation."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.compute:12
+msgid ""
+"it contains two Tensor of shape [*, 1]. The tuple should be passed to "
+"`update` function."
+msgstr ""
+
+#: of paddlenlp.metrics.glue.MultiLabelsMetric.reset:1
+msgid "Reset states and result"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po
new file mode 100644
index 0000000000000000000000000000000000000000..227fa0e7a906c963765fc3dd6cdf7abc58ffe792
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.perplexity.po
@@ -0,0 +1,148 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.perplexity.rst:2
+msgid "perplexity"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:1
+msgid ""
+"Perplexity is a metric used to judge how good a language model is. We can"
+" define perplexity as the inverse probability of the test set, normalised"
+" by the number of the words in the test set. Perplexity is calculated "
+"using cross entropy. It supports both padding data and no padding data."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:7
+msgid ""
+"If data is not padded, users should provide `seq_len` for `Metric` "
+"initialization. If data is padded, your label should contain `seq_mask`, "
+"which indicates the actual length of samples."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:11
+msgid ""
+"This Perplexity requires that the output of your network is prediction, "
+"label and sequence length (optional). If the Perplexity here doesn't meet"
+" your needs, you could override the `compute` or `update` method for "
+"calculating Perplexity."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity
+#: paddlenlp.metrics.perplexity.Perplexity.compute
+#: paddlenlp.metrics.perplexity.Perplexity.update
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:16
+msgid ""
+"Sequence length of each sample, it must be provided while data is not "
+"padded. Defaults to 20."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:19
+msgid "Name of `Metric` instance. Defaults to 'Perplexity'."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity:23
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.compute:1
+msgid "Computes cross entropy loss."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.compute:3
+msgid ""
+"Predictor tensor, and its dtype is float32 or float64, and has a shape of"
+" [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.compute:6
+msgid ""
+"Label tensor, and its dtype is int64, and has a shape of [batch_size, "
+"sequence_length, 1] or [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.compute:9
+msgid ""
+"Sequence mask tensor, and its type could be float32, float64, int32 or "
+"int64, and has a shape of [batch_size, sequence_length]. It's used to "
+"calculate loss. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.accumulate
+#: paddlenlp.metrics.perplexity.Perplexity.compute
+#: paddlenlp.metrics.perplexity.Perplexity.name
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.compute:14
+msgid ""
+"Returns tuple (`ce, word_num`) if `seq_mask` is not None. Otherwise, "
+"returns tensor `ce`. `ce` it the cross entropy loss, its shape is "
+"[batch_size, sequence_length] and its data type should be float32."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.accumulate
+#: paddlenlp.metrics.perplexity.Perplexity.compute
+#: paddlenlp.metrics.perplexity.Perplexity.name
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.update:1
+msgid "Updates metric states."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.update:3
+msgid ""
+"Cross entropy loss, it's calculated by `compute` and converted to "
+"`numpy.ndarray`."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.update:6
+msgid ""
+"The number of words of sequence, it's calculated by `compute` and "
+"converted to `numpy.ndarray`. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.reset:1
+msgid "Resets all metric states."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.accumulate:1
+msgid "Calculates and returns the value of perplexity."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.accumulate:3
+msgid "Returns `perplexity`, the calculation results."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.name:1
+msgid "Returns name of the metric instance."
+msgstr ""
+
+#: of paddlenlp.metrics.perplexity.Perplexity.name:3
+msgid "The name of the metric instance."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po
new file mode 100644
index 0000000000000000000000000000000000000000..c9a3d201205f711bbf2a642a641ed454f261b426
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.rst:2
+msgid "paddlenlp.metrics"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po
new file mode 100644
index 0000000000000000000000000000000000000000..bb5f73841b34a8022249fea883ce13366baea858
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.rouge.po
@@ -0,0 +1,172 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.rouge.rst:2
+msgid "rouge"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:1
+msgid ""
+"Rouge-L is Recall-Oriented Understudy for Gisting Evaluation based on "
+"Longest Common Subsequence (LCS). Longest common subsequence problem "
+"takes into account sentence level structure similarity naturally and "
+"identifies longest co-occurring in sequence n-grams automatically."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:6
+msgid ""
+"R_{LCS} & = \\frac{LCS(C,S)}{len(S)}\n"
+"\n"
+"P_{LCS} & = \\frac{LCS(C,S)}{len(C)}\n"
+"\n"
+"F_{LCS} & = \\frac{(1 + \\gamma^2)R_{LCS}P_{LCS}}{R_{LCS}} + "
+"\\gamma^2{R_{LCS}}"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:14
+msgid "where `C` is the candidate sentence, and `S` is the reference sentence."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL paddlenlp.metrics.rouge.RougeL.add_inst
+#: paddlenlp.metrics.rouge.RougeL.lcs paddlenlp.metrics.rouge.RougeLForDuReader
+#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:16
+msgid "`trans_func` transforms the network output to string to calculate."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:19
+msgid ""
+"Vocab for target language. If `trans_func` is None and RougeL is used as "
+"`paddle.metric.Metric` instance, `default_trans_func` will be performed "
+"and `vocab` must be provided."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:24
+msgid "A hyperparameter to decide the weight of recall. Defaults to 1.2."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:26
+msgid "Name of `paddle.metric.Metric` instance. Defaults to \"rouge-l\"."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL:30
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs:1
+msgid "Calculate the length of longest common subsequence of string and sub."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs:3
+msgid "The string to be calculated, usually longer the sub string."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs:5
+msgid "The sub string to be calculated."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs:8
+msgid "Returns the length of the longest common subsequence of string and sub."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.lcs
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.add_inst:1
+#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:1
+msgid "Update the states based on the a pair of candidate and references."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.add_inst:3
+#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:3
+msgid "The candidate sentence generated by model."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.add_inst:5
+#: paddlenlp.metrics.rouge.RougeLForDuReader.add_inst:5
+msgid "List of ground truth sentences."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.update:1
+msgid "Update states for metric"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.update:3
+msgid ""
+"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if "
+":code:`compute` is not defined, the inputs of :code:`update` will be "
+"flatten arguments of **output** of mode and **label** from data: "
+":code:`update(output1, output2, ..., label1, label2,...)`"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.update:8
+msgid "see :code:`Metric.compute`"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.accumulate:1
+msgid "Calculate the final rouge-l metric."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.reset:1
+msgid "Reset states and result"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeL.name:1
+msgid "Returns metric name"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeLForDuReader:1
+msgid "基类：:class:`paddlenlp.metrics.rouge.RougeL`"
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeLForDuReader:1
+msgid "Rouge-L metric with bonus for DuReader contest."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeLForDuReader:3
+msgid ""
+"Please refer to `DuReader "
+"Homepage<https://ai.baidu.com//broad/subordinate?dataset=dureader>`_ for "
+"more details."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeLForDuReader:5
+msgid ""
+"Weight of YesNo dataset when adding bonus for DuReader contest. Defaults "
+"to 1.0."
+msgstr ""
+
+#: of paddlenlp.metrics.rouge.RougeLForDuReader:7
+msgid ""
+"Weight of Entity dataset when adding bonus for DuReader contest. Defaults"
+" to 1.0."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po
new file mode 100644
index 0000000000000000000000000000000000000000..df04947c3ccd72c3a83f2e1ab4a4093ff9b49a21
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.sighan.po
@@ -0,0 +1,74 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.sighan.rst:2
+msgid "sighan"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1:1
+msgid "基类：:class:`paddle.metric.metrics.Metric`"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.CorrectionF1.update:1
+#: paddlenlp.metrics.sighan.DetectionF1.update:1
+msgid "Update states for metric"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.CorrectionF1.update:3
+#: paddlenlp.metrics.sighan.DetectionF1.update:3
+msgid ""
+"Inputs of :code:`update` is the outputs of :code:`Metric.compute`, if "
+":code:`compute` is not defined, the inputs of :code:`update` will be "
+"flatten arguments of **output** of mode and **label** from data: "
+":code:`update(output1, output2, ..., label1, label2,...)`"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.CorrectionF1.update:8
+#: paddlenlp.metrics.sighan.DetectionF1.update:8
+msgid "see :code:`Metric.compute`"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.reset:1
+msgid "Resets all of the metric state."
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.accumulate:1
+msgid "Accumulates statistics, computes and returns the metric value"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.name:1
+msgid "Returns name of the metric instance."
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.name
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.name:3
+msgid "The name of the metric instance."
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.DetectionF1.name
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.metrics.sighan.CorrectionF1:1
+msgid "基类：:class:`paddlenlp.metrics.sighan.DetectionF1`"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po
new file mode 100644
index 0000000000000000000000000000000000000000..d6dc2614f71d075c2bfda5e06da57a1541639654
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.span.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.metrics.span.rst:2
+msgid "span"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po
new file mode 100644
index 0000000000000000000000000000000000000000..b1644f89ab8f94e8a586fa96b46abb4d6d667536
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.squad.po
@@ -0,0 +1,125 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.metrics.squad.rst:2
+msgid "squad"
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:1
+msgid ""
+"Post-processes the predictions of a question-answering model to convert "
+"them to answers that are substrings of the original contexts. This is the"
+" base postprocessing functions for models that only return start and end "
+"logits."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction
+#: paddlenlp.metrics.squad.squad_evaluate
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:6
+msgid ""
+"List of raw squad-style data (see `run_squad.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/ "
+"machine_reading_comprehension/SQuAD/run_squad.py>`__ for more "
+"information)."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:11
+msgid ""
+"List of processed squad-style features (see `run_squad.py "
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/ "
+"develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__ for"
+" more information)."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:16
+msgid ""
+"The predictions of the model. Should be a tuple of two list containing "
+"the start logits and the end logits."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:19
+msgid "Whether the dataset contains examples with no answers. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:22
+msgid "The total number of candidate predictions to generate. Defaults to 20."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:25
+msgid "The maximum length of predicted answer. Defaults to 20."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:28
+msgid ""
+"The threshold used to select the null answer. Only useful when "
+"`version_2_with_negative` is True. Defaults to 0.0."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.metrics.squad.compute_prediction:33
+msgid ""
+"A tuple of three dictionaries containing final selected answer, all "
+"n_best answers along with their probability and scores, and the "
+"score_diff of each example."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:1
+msgid ""
+"Computes and prints the f1 score and em score of input prediction. :param"
+" examples: List of raw squad-style data (see `run_squad.py"
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:3
+msgid ""
+"<https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/ "
+"machine_reading_comprehension/SQuAD/run_squad.py>`__ for more "
+"information)."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:7
+msgid ""
+"Dictionary of final predictions. Usually generated by "
+"`compute_prediction`."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:10
+msgid ""
+"Dictionary of score_diffs of each example. Used to decide if answer exits"
+" and compute best score_diff threshold of null. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:14
+msgid "The threshold used to select the null answer. Defaults to 1.0."
+msgstr ""
+
+#: of paddlenlp.metrics.squad.squad_evaluate:17
+msgid ""
+"Whether the predictions and references can be tokenized by whitespace. "
+"Usually set True for English and False for Chinese. Defaults to True."
+msgstr ""
+
+#~ msgid "Computes and prints the f1 score and em score of input prediction."
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..00d80ab061109471738d718f6f33dfe0d18ad807
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.metrics.utils.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.metrics.utils.rst:2
+msgid "utils"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po
new file mode 100644
index 0000000000000000000000000000000000000000..ecc268adbae37e43a2ff56a20b0fa4469323d82b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.parallel.po
@@ -0,0 +1,138 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.distributed.parallel.rst:2
+msgid "parallel"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:1
+msgid "Parallel Embedding."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner
+#: paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:3
+msgid ""
+"The size of embedding dictionary which dictates the maximum value of the "
+"input id."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:5
+msgid "The dimensions of each embedding vector."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:7
+msgid ""
+"The rank of the current part, which determines the start index of the "
+"vocab."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:9
+msgid "The number of trainers."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:11
+msgid ""
+"Specify the weight parameter property, including the initialization "
+"method. Defaults to None which means the default weight parameter "
+"property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:15
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding:14
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:15
+msgid "Normally there is no need for user to set this property. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:1
+msgid ""
+"A Tensor contains the id information. Its data type should be int32 or "
+"int64, and the value of the input id should be in [0, weight.shape[0]] ."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:4
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:5
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:4
+msgid "Returns the embedding Tensor mapped by x."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:1
+msgid "Parallel Linear, axis=1."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:3
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:3
+msgid "The size of embedding vector."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:5
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:5
+msgid "The number of parts within a model parallel group. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:7
+msgid "Whether to gather the output tensor. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:9
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:9
+msgid ""
+"Specify the parameter property, including the initialization method. "
+"Defaults to None which means the default parameter property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:12
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:12
+msgid ""
+"Specify the bias property. Defaults to None which means the default "
+"parameter property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:1
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:1
+msgid "The input tensor. Its data type can be int or float."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:1
+msgid "Parallel Linear, axis=0."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:7
+msgid "Whether the input is parallel. Defaults to `False`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po
new file mode 100644
index 0000000000000000000000000000000000000000..319dd56ee7c865b53a6848f001ea465fd00d3f0b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.po
@@ -0,0 +1,138 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.distributed.rst:2
+msgid "distributed"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:1
+msgid "Parallel Embedding."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner
+#: paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:3
+msgid ""
+"The size of embedding dictionary which dictates the maximum value of the "
+"input id."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:5
+msgid "The dimensions of each embedding vector."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:7
+msgid ""
+"The rank of the current part, which determines the start index of the "
+"vocab."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:9
+msgid "The number of trainers."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding:11
+msgid ""
+"Specify the weight parameter property, including the initialization "
+"method. Defaults to None which means the default weight parameter "
+"property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:15
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding:14
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:15
+msgid "Normally there is no need for user to set this property. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:1
+msgid ""
+"A Tensor contains the id information. Its data type should be int32 or "
+"int64, and the value of the input id should be in [0, weight.shape[0]] ."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:4
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward:5
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:4
+msgid "Returns the embedding Tensor mapped by x."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward
+#: paddlenlp.ops.distributed.parallel.ParallelEmbedding.forward
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:1
+msgid "Parallel Linear, axis=1."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:3
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:3
+msgid "The size of embedding vector."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:5
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:5
+msgid "The number of parts within a model parallel group. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:7
+msgid "Whether to gather the output tensor. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:9
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:9
+msgid ""
+"Specify the parameter property, including the initialization method. "
+"Defaults to None which means the default parameter property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner:12
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner:12
+msgid ""
+"Specify the bias property. Defaults to None which means the default "
+"parameter property will be used."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.ColumnParallelLiner.forward:1
+#: paddlenlp.ops.distributed.parallel.RowParallelLiner.forward:1
+msgid "The input tensor. Its data type can be int or float."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:1
+msgid "Parallel Linear, axis=0."
+msgstr ""
+
+#: of paddlenlp.ops.distributed.parallel.RowParallelLiner:7
+msgid "Whether the input is parallel. Defaults to `False`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..a4a06c183ccc07dcbc52c752b2df64c8550dab3b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.distributed.utils.rst:2
+msgid "utils"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po
new file mode 100644
index 0000000000000000000000000000000000000000..532e33cfade4b61e76cf7f72bf125ea690c36079
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.random.po
@@ -0,0 +1,27 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.distributed.utils.random.rst:2
+msgid "random"
+msgstr ""
+
+#: of paddlenlp.ops.distributed.utils.random.RNGStatesTracker:1
+msgid "Tracker the RNG states."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po
new file mode 100644
index 0000000000000000000000000000000000000000..a8764d182b1752bfdf3268381a000ccfd97a807b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.distributed.utils.topo.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.distributed.utils.topo.rst:2
+msgid "topo"
+msgstr ""
+
+#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.rank:1
+msgid "Alias for field number 1"
+msgstr ""
+
+#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.size:1
+msgid "Alias for field number 0"
+msgstr ""
+
+#: ../docstring of paddlenlp.ops.distributed.utils.topo.GroupInfo.world:1
+msgid "Alias for field number 2"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po
new file mode 100644
index 0000000000000000000000000000000000000000..d1747337f5eeef0b0a359b599fa83c2662e9d5a3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.einsum.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.einsum.rst:2
+msgid "einsum"
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:1
+msgid ""
+"Executes the sum of product of provided operands based on the Einstein "
+"summation convention. Einsum can be used to complete a variety of "
+"operations, such as sum, transpose, batch matrix multiplication."
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:5
+msgid ""
+"Uses uncased letters to specify the dimension of the operands and result."
+" The input equation is on the left hand before `->` while the output "
+"equation is on the right side. Einsum can infer the result shape so that "
+"the `->` and the result label letters can be omitted. Operands in the "
+"input equation are splited by commas (','), e.g. 'abc,cde' describes two "
+"3D operands. The dimensions labeled with same letter should be same or be"
+" 1. Ellipsis ('...') can be used to specify the broadcast dimensions."
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:12
+msgid ""
+"The operands to compute the Einstein sum of. The number of operands "
+"should be the same as the operands described in input equation."
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:16
+msgid "The result of Einstein sum product."
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:17
+msgid "`Tensor`"
+msgstr ""
+
+#: of paddlenlp.ops.einsum.einsum:20
+msgid "示例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..5a3ca41484a44b9c309bdaa423ecef62e89dbec9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.ext_utils.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.ops.ext_utils.rst:2
+msgid "ext\\_utils"
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.CMakeExtension:1
+msgid "基类：:class:`setuptools.extension.Extension`"
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.CMakeExtension.build_with_command:1
+#: paddlenlp.ops.ext_utils.FasterTransformerExtension.build_with_command:1
+msgid ""
+"Custom `build_ext.build_extension` in `Extension` instead of `Command`. "
+"`ext_builder` is the instance of `build_ext` command."
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.CMakeExtension.get_target_filename:1
+#: paddlenlp.ops.ext_utils.FasterTransformerExtension.get_target_filename:1
+msgid ""
+"The file names of libraries. Currently only support one library for one "
+"extension."
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.CMakeExtension.get_output_filename:1
+#: paddlenlp.ops.ext_utils.FasterTransformerExtension.get_output_filename:1
+msgid ""
+"The file names of outputs, which mostly is the same with "
+"`get_target_filename`."
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.FasterTransformerExtension:1
+msgid "基类：:class:`paddlenlp.ops.ext_utils.CMakeExtension`"
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.BuildExtension:1
+msgid "基类：:class:`paddle.utils.cpp_extension.cpp_extension.BuildExtension`"
+msgstr ""
+
+#: of paddlenlp.ops.ext_utils.BuildExtension:1
+msgid "Support both `CppExtention` of Paddle and custom extensions of PaddleNLP."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..f9aff194c4ab68184fa8b0037d57730f1dca145a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.rst:2
+msgid "fast\\_transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po
new file mode 100644
index 0000000000000000000000000000000000000000..bec6fac5c86d068581ba88d6d4ecb531d705e2b0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoder.po
@@ -0,0 +1,135 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.transformer.decoder.rst:2
+msgid "decoder"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:1
+msgid "FasterTransformer decoder block."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder
+#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:3
+msgid "Transformer decoder block."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:13
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:5
+msgid "The number of head used in multi-head attention."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:7
+msgid "The size of per head used in multi-head attention."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:31
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:9
+msgid "The path to decoder_lib. Default to None."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:33
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder:11
+msgid "Whether to use fp16 for decoder. Default to False."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.decoder.InferTransformerDecoder.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:1
+msgid "FasterTransformer decoder for auto-regressive generation."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:3
+msgid "The size of source vocabulary."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:5
+msgid "The size of target vocabulary."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:7
+msgid "The maximum length of input sequences."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:9
+msgid "The number of sub-layers to be stacked in the encoder."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:11
+msgid "The number of sub-layers to be stacked in the decoder."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:15
+msgid ""
+"The dimension for word embeddings, which is also the last dimension of "
+"the input and output of multi-head attention, position-wise feed-forward "
+"networks, encoder and decoder."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:19
+msgid "Size of the hidden layer in position-wise feed-forward networks."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:21
+msgid "Dropout rates. Used for pre-process, activation and inside attention."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:23
+msgid "Whether to use weight sharing."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:25
+msgid "The start token id and also is used as padding id. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:27
+msgid "The end token id. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoder.FasterDecoder:29
+msgid "The maximum output length. Defaults to 256."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po
new file mode 100644
index 0000000000000000000000000000000000000000..c1001f5c6fb59d222f7ccd1e1ce97fb515318771
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.decoding.po
@@ -0,0 +1,304 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.transformer.decoding.rst:2
+msgid "decoding"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:1
+msgid ""
+"Convert parameters included in Transformer layer (`nn.TransformerEncoder`"
+" and `gpt.modeling.TransformerDecoder`) from original models to the "
+"format of faster models."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward
+#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params
+#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:5
+msgid "The faster model object."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:7
+msgid ""
+"The Transformer layer. It can be an instance of `nn.TransformerEncoder` "
+"or `gpt.modeling.TransformerDecoder` currently, and "
+"`nn.TransformerDecoder` would be supported soon."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:11
+msgid ""
+"0 for nofuse, 1 for fuse, 2 for fuse and delete the unfused parameters. "
+"If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of "
+"q/k/v is fused, it will try to delete the original unfused weights. Note "
+"the rollback to original model would not be guarantee anymore when the "
+"faster model failed if the original weights are deleted. Default to 1."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:18
+msgid ""
+"Whether to use float16. Maybe we should use the default dtype as the "
+"highest priority later. Default to `False`."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:21
+msgid ""
+"If `False`, need to reload the weight values. It should be `True` for "
+"weight loaded models. Default to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight
+#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params
+#: paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:25
+msgid ""
+"Each value is a list including converted parameters in all     layers. "
+"For other parameters not included in Transformer module to     be "
+"converted, such as embeddings, you can achieve it by using the     "
+"returned dict `params` though `params['word_emb'].append()` directly     "
+"which would do CPU/GPU and fp32/fp16 transfer automatically."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:30
+msgid "Each value is a list including converted parameters in all"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.convert_params:28
+msgid ""
+"layers. For other parameters not included in Transformer module to be "
+"converted, such as embeddings, you can achieve it by using the returned "
+"dict `params` though `params['word_emb'].append()` directly which would "
+"do CPU/GPU and fp32/fp16 transfer automatically."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight
+#: paddlenlp.ops.fast_transformer.transformer.decoding.convert_params
+#: paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferBartDecoding.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferGptDecoding.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferMBartDecoding.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferTransformerDecoding.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.decoding.InferUnifiedDecoding.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:1
+msgid ""
+"Configurations for model parallel in FasterTransformer. Currently only "
+"support GPT. Please refer to  `Megatron "
+"<https://arxiv.org/pdf/2104.04473.pdf>`__ for details."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:5
+#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:5
+msgid ""
+"The size for tensor parallel. If it is 1, tensor parallel would not be "
+"used. Default to 1."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:8
+#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:8
+msgid ""
+"The size for layer parallel. If it is 1, layer parallel would not be "
+"used. Default to 1."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf:11
+#: paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:11
+msgid ""
+"The local batch size for pipeline parallel. It is suggested to use "
+"`batch_size // layer_para_size`. Default to 1."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_last_group:1
+msgid ""
+"For layer parallel, only the process corresponding to the last layer "
+"group can get the predict results. It is used to check whether this is "
+"the process corresponding to the last layer group."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:1
+msgid ""
+"Whether or not the given transformer layer of should be loaded to the "
+"current parallel model. For layer parallel, there is no need not to load "
+"other layer groups."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:5
+msgid "The index of Transformer layer."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:7
+msgid "The number of Transformer layers."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:10
+msgid ""
+"Indicate whether or not the given transformer layer of should     be "
+"loaded to the current parallel model."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:12
+msgid "Indicate whether or not the given transformer layer of should"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.is_load:13
+msgid "be loaded to the current parallel model."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:1
+msgid "Get the weight slice for tensor parallel."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:3
+msgid "The weight or bias to be sliced."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:5
+msgid "The axis to perform slice."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:7
+msgid ""
+"0 is used for creating partial model when initializing and "
+"`from_pretrained`. While 1 is used in converting parameters to "
+"FasterTransformer. No slice would be performed if it is 1, since "
+"parameters have been sliced in `phase=0`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:12
+msgid ""
+"If true, `weight` should be a Parameter and force the output to be a "
+"Parameter."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.slice_weight:16
+msgid "The sliced weight."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model:1
+msgid ""
+"This is used to set whether or not the current model has complete "
+"parameters."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.set_partial_model:4
+msgid ""
+"It is used to set whether or not the current model has complete "
+"parameters."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:1
+msgid ""
+"Slice every values included in `state_to_load` according to the shape of "
+"corresponding parameters in `model`. This is used in `from_pratrained` to"
+" get sliced parameter values."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:5
+msgid "The model to use."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:7
+msgid "The state dict including complete parameter values of model."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.decoding.FTParaConf.fit_partial_model:11
+msgid "The state dict contains adjusted values."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf:1
+msgid "Get settings for model parallel."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.get_ft_para_conf:3
+msgid "The settings for model parallel."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.decoding.enable_ft_para:1
+msgid ""
+"Enable model parallel with the given settings in FasterTransformer. "
+"Currently only support GPT. Please refer to `Megatron "
+"<https://arxiv.org/pdf/2104.04473.pdf>`__ for details."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po
new file mode 100644
index 0000000000000000000000000000000000000000..6f0bc28d10fb015a24e26cf3c30be9bbb38a2245
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.encoder.po
@@ -0,0 +1,178 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.transformer.encoder.rst:2
+msgid "encoder"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.infer_transformer_encoder:1
+msgid ""
+"Fusion Encoder API intergrating Encoder inference in FasterTransformer. "
+"It accepts the weight and bias of TransformerEncoder and some other "
+"parameters for inference."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:1
+msgid ""
+"Redefines `forward` function of `paddle.nn.TransformerEncoderLayer` for "
+"integrating FasterTransformer for inference."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:4
+msgid ""
+"The original `forward` function would not be replaced unless "
+"`enable_fast_encoder` is called by objects of its base class. After "
+"replacing, objects of `paddle.nn.TransformerEncoderLayer` also have the "
+"same member variables as before."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:9
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:9
+msgid ""
+"After inference, `disable_fast_encoder` could be called to restore the "
+"`forward` function of `paddle.nn.TransformerEncoder` and "
+"`paddle.nn.TransformerEncoderLayer`."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:13
+msgid ""
+"The input of Transformer encoder layer. It is a tensor with shape "
+"`[batch_size, sequence_length, d_model]`. The data type should be float32"
+" or float64."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:17
+msgid ""
+"A tensor used in multi-head attention to prevents attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. It "
+"is a tensor with shape `[batch_size, 1, 1, sequence_length]`. When the "
+"data type is bool, the unwanted positions have `False` values and the "
+"others have `True` values. When the data type is int, the unwanted "
+"positions have 0 values and the others have 1 values. When the data type "
+"is float, the unwanted positions have `-INF` values and the others have 0"
+" values. It can be None when nothing wanted or needed to be prevented "
+"attention to. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward:28
+msgid ""
+"It is a tensor that has the same shape and data type as `enc_input`, "
+"representing the output of Transformer encoder layer. Or a tuple if "
+"`cache` is not None, except for encoder layer output, the tuple includes "
+"the new cache which is same as input `cache` argument but "
+"`incremental_cache` has an incremental length. See "
+"`paddle.nn.MultiHeadAttention.gen_cache` and "
+"`paddle.nn.MultiHeadAttention.forward` for more details."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward
+#: paddlenlp.ops.fast_transformer.transformer.encoder.encoder_layer_forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:1
+msgid ""
+"Redefines `forward` function of `paddle.nn.TransformerEncoder` for "
+"integrating FasterTransformer for inference."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:4
+msgid ""
+"The original `forward` function would not be replaced unless "
+"`enable_fast_encoder` is called by objects of its base class. After "
+"replacing, objects of `paddle.nn.TransformerEncoder` also have the same "
+"member variables as before."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:13
+msgid ""
+"The input of Transformer encoder. It is a tensor with shape `[batch_size,"
+" sequence_length, d_model]`. The data type should be float32 or float16."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:17
+msgid ""
+"A tensor used in multi-head attention to prevents attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. It "
+"is a tensor with shape `[batch_size, 1, 1, sequence_length]`. The data "
+"type must be float, the unwanted positions have `-INF` values or other "
+"non-zeros and the wanted positions must be 0.0."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.encoder_forward:24
+msgid ""
+"It is a tensor that has the same shape and data type as `src`, "
+"representing the output of Transformer encoder. Or a tuple if `cache` is "
+"not None, except for encoder output, the tuple includes the new cache "
+"which is same as input `cache` argument but `incremental_cache` in it has"
+" an incremental length. See `paddle.nn.MultiHeadAttention.gen_cache` and "
+"`paddle.nn.MultiHeadAttention.forward` for more details."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.enable_fast_encoder:1
+msgid ""
+"Compiles fusion encoder operator intergrated FasterTransformer using the "
+"method of JIT(Just-In-Time) and replaces the `forward` function of "
+"`paddle.nn.TransformerEncoder` and `paddle.nn.TransformerEncoderLayer` "
+"objects inherited from `self` to support inference using "
+"FasterTransformer."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.disable_fast_encoder:5
+#: paddlenlp.ops.fast_transformer.transformer.encoder.enable_fast_encoder:7
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.encoder.disable_fast_encoder:1
+msgid ""
+"Restores the original `forward` function of "
+"`paddle.nn.TransformerEncoder` and `paddle.nn.TransformerEncoderLayer` "
+"objects inherited from `self`."
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16:1
+msgid "Convert paddle.nn.TransformerEncoder's parameter from float32 to float16"
+msgstr ""
+
+#: of paddlenlp.ops.fast_transformer.transformer.encoder.convert_to_fp16:3
+msgid ""
+"The object to be converted to float16 inplaced, it must be an isinstance "
+"of paddle.nn.TransformerEncoder."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..9955784239be694fea07c91b9d60ccc0739262ba
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.po
@@ -0,0 +1,523 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst:2
+msgid "fast\\_transformer"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:1
+msgid "基类：:class:`paddlenlp.transformers.transformer.modeling.TransformerModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:1
+msgid ""
+"FastGeneration is a fast version for generation with the Transformer"
+" model. It uses a custom op based on and enhancing NV FasterTransformer "
+"to do fast generation."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:5
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:5
+msgid "The size of source vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:7
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:7
+msgid "The size of target vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:9
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:9
+msgid "The maximum length of input sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:11
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:11
+msgid "The number of sub-layers to be stacked in the encoder."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:13
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:13
+msgid "The number of sub-layers to be stacked in the decoder."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:15
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:15
+msgid "The number of head used in multi-head attention."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:17
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:17
+msgid ""
+"The dimension for word embeddings, which is also the last dimension of "
+"the input and output of multi-head attention, position-wise feed-forward "
+"networks, encoder and decoder."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:21
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:21
+msgid "Size of the hidden layer in position-wise feed-forward networks."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:23
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:23
+msgid "Dropout rates. Used for pre-process, activation and inside attention."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:25
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:25
+msgid "Whether to use weight sharing."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:27
+msgid ""
+"The dropout probability used in MHA to drop some attention target. If "
+"None, use the value of dropout. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:30
+msgid ""
+"The dropout probability used after FFN activition. If None, use the value"
+" of dropout. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:33
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:27
+msgid "The start token id and also is used as padding id. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:35
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:29
+msgid "The end token id. Defaults to 1."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:37
+msgid ""
+"Indicating the strategy of decoding. It can be 'beam_search', "
+"'beam_search_v2', 'topk_sampling' and 'topp_sampling'. For beam search "
+"strategies, 'v2' would select the top `beam_size * 2` beams and process "
+"the top `beam_size` alive and finish beams in them separately, while 'v1'"
+" would only select the top `beam_size` beams and mix up the alive and "
+"finish beams. 'v2' always searchs more and get better results, since the "
+"alive beams would always be `beam_size` while the number of alive beams "
+"in `v1` might decrease when meeting the end token. However, 'v2' always "
+"generates longer results thus might do more calculation and be slower."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:48
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:31
+msgid "The beam width for beam search. Defaults to 4."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:50
+msgid ""
+"The number of highest probability tokens to keep for top-k sampling. "
+"Defaults to 4."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:53
+msgid ""
+"The most probable tokens whose cumulative probability is not less than "
+"`topp` are kept for top-p sampling. Defaults to 4."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:56
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:33
+msgid "The maximum output length. Defaults to 256."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:58
+msgid ""
+"Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation"
+" <https://arxiv.org/abs/1611.08562>`_ for details. Bigger "
+"`diversity_rate` would lead to more diversity. if `diversity_rate == 0` "
+"is equivalent to naive BeamSearch. Default to 0 if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:63
+msgid "Whether to use fp16 for decoding."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:65
+msgid ""
+"Whether to use the fast version of encoder. This is experimental option"
+" for now. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:68
+msgid ""
+"Whether to use fp16 for encoder. Only works when enable_fast_encoder is"
+" True. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:71
+msgid ""
+"Indicating whether `max_out_len` in is the length relative to that of "
+"source text. Only works in `v2` temporarily. It is suggest to set a small"
+" `max_out_len` and use `rel_len=True`. Default to False if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer:76
+msgid ""
+"The power number in length penalty calculation. Only works in `v2` "
+"temporarily. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_. "
+"Default to 0.6 if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:1
+msgid ""
+"The Transformer forward methods. The input are source/target sequences, "
+"and returns logits."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:4
+msgid ""
+"The ids of source sequences words. It is a tensor with shape "
+"`[batch_size, source_sequence_length]` and its data type can be int or "
+"int64."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:8
+msgid ""
+"The ids of target sequences words. It is a tensor with shape "
+"`[batch_size, target_sequence_length]` and its data type can be int or "
+"int64."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:13
+msgid ""
+"Output tensor of the final layer of the model whose data type can be "
+"float32 or float64 with shape `[batch_size, sequence_length, "
+"vocab_size]`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:10
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.forward:19
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:21
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:1
+msgid ""
+"This method is used for load static graph from dygraph checkpoint or "
+"export inference model using static graph."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:4
+msgid "The path to dygraph checkpoint."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer.export_params:6
+msgid "The place to execute static graph."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:1
+msgid ""
+"The Transformer model for auto-regressive generation with beam search. It"
+" wraps `FasterTransformer` and `InferTransformerModel`, and automatically"
+" chioces using `FasterTransformer` (with jit building) or the slower "
+"verison `InferTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:35
+msgid ""
+"The key word arguments can be `output_time_major`, `use_ft`, "
+"`use_fp16_decoding`, `rel_len`, `alpha`:  - `output_time_major(bool, "
+"optional)`: Indicate the data layout of predicted Tensor. If `False`, the"
+" data layout would be batch major with shape `[batch_size, seq_len, "
+"beam_size]`. If  `True`, the data layout would be time major with shape "
+"`[seq_len, batch_size, beam_size]`. Default to `False`.  - `use_ft(bool, "
+"optional)`: Whether to use FastGeneration for decoding. Default to "
+"True if not set.  - `use_fp16_decoding(bool, optional)`: Whether to use "
+"fp16 for decoding.  Only works when using FastGeneration.  - "
+"`beam_search_version(str, optional)`: Indicating the strategy of beam "
+"search. It can be 'v1' or 'v2'. 'v2' would select the top `beam_size * 2`"
+" beams and process the top `beam_size` alive and finish beams in them "
+"separately, while 'v1' would only select the top `beam_size` beams and "
+"mix up the alive and finish beams. 'v2' always searchs more and get "
+"better results, since the alive beams would always be `beam_size` while "
+"the number of alive beams in `v1` might decrease when meeting the end "
+"token. However, 'v2' always generates longer results thus might do more "
+"calculation and be slower.  - `rel_len(bool, optional)`: Indicating "
+"whether `max_out_len` in is the length relative to that of source text. "
+"Only works in `v2` temporarily. It is suggest to set a small "
+"`max_out_len` and use `rel_len=True`. Default to False if not set.  - "
+"`alpha(float, optional)`: The power number in length penalty calculation."
+" Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_. Only works in "
+"`v2` temporarily. Default to 0.6 if not set.  - diversity_rate(float, "
+"optional): Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural"
+" Generation <https://arxiv.org/abs/1611.08562>`_ for details. Bigger "
+"`diversity_rate` would lead to more diversity. if `diversity_rate == 0` "
+"is equivalent to naive BeamSearch. Default to 0 if not set. **NOTE**: "
+"Only works when using FastGeneration temporarily."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:35
+msgid ""
+"The key word arguments can be `output_time_major`, `use_ft`, "
+"`use_fp16_decoding`, `rel_len`, `alpha`:"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:38
+msgid "`output_time_major(bool, optional)`: Indicate the data layout of predicted"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:39
+msgid ""
+"Tensor. If `False`, the data layout would be batch major with shape "
+"`[batch_size, seq_len, beam_size]`. If  `True`, the data layout would be "
+"time major with shape `[seq_len, batch_size, beam_size]`. Default to "
+"`False`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:44
+msgid "`use_ft(bool, optional)`: Whether to use FastGeneration"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:45
+msgid "for decoding. Default to True if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:47
+msgid "`use_fp16_decoding(bool, optional)`: Whether to use fp16"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:48
+msgid "for decoding.  Only works when using FastGeneration."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:50
+msgid "`beam_search_version(str, optional)`: Indicating the strategy of"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:51
+msgid ""
+"beam search. It can be 'v1' or 'v2'. 'v2' would select the top `beam_size"
+" * 2` beams and process the top `beam_size` alive and finish beams in "
+"them separately, while 'v1' would only select the top `beam_size` beams "
+"and mix up the alive and finish beams. 'v2' always searchs more and get "
+"better results, since the alive beams would always be `beam_size` while "
+"the number of alive beams in `v1` might decrease when meeting the end "
+"token. However, 'v2' always generates longer results thus might do more "
+"calculation and be slower."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:60
+msgid "`rel_len(bool, optional)`: Indicating whether `max_out_len` in is"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:61
+msgid ""
+"the length relative to that of source text. Only works in `v2` "
+"temporarily. It is suggest to set a small `max_out_len` and use "
+"`rel_len=True`. Default to False if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:65
+msgid "`alpha(float, optional)`: The power number in length penalty"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:66
+msgid ""
+"calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_. "
+"Only works in `v2` temporarily. Default to 0.6 if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:69
+msgid "diversity_rate(float, optional): Refer to `A Simple, Fast Diverse"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator:70
+msgid ""
+"Decoding Algorithm for Neural Generation "
+"<https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate`"
+" would lead to more diversity. if `diversity_rate == 0` is equivalent to "
+"naive BeamSearch. Default to 0 if not set. **NOTE**: Only works when "
+"using FastGeneration temporarily."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:1
+msgid "Performs decoding for transformer model."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:3
+msgid ""
+"The ids of source sequence words. It is a tensor with shape `[batch_size,"
+" source_sequence_length]` and its data type can be int or int64."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:7
+msgid ""
+"The ids of target sequence words. Normally, it should NOT be given. If "
+"it's given, force decoding with previous output token will be trigger. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.TransformerGenerator.forward:12
+msgid ""
+"An int64 tensor shaped indicating the predicted ids. Its shape is "
+"`[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]` "
+"according to `output_time_major`. While, when using FastGeneration and"
+" beam search v2, the beam dimension would be doubled to include both the "
+"top `beam_size` alive and finish beams, thus the tensor shape is "
+"`[batch_size, seq_len, beam_size * 2]` or `[seq_len, batch_size, "
+"beam_size * 2]`."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.modeling.GPTPretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:1
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:4
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterGPT.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText.forward:6
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUnifiedTransformer:1
+msgid "基类：:class:`paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterUNIMOText:1
+msgid "基类：:class:`paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterBART:1
+msgid "基类：:class:`paddlenlp.transformers.bart.modeling.BartPretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterMBART:1
+msgid "基类：:class:`paddlenlp.transformers.mbart.modeling.MBartPretrainedModel`"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..b86c4d9321a5dc4038698c895039b8e3dca3e07a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.fast_transformer.transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.fast_transformer.transformer.rst:2
+msgid "transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..f52a3d7f4044c127d38ec9b8437f9772038543dc
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.AdamwOptimizer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.optimizer.AdamwOptimizer.rst:2
+msgid "AdamwOptimizer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po
new file mode 100644
index 0000000000000000000000000000000000000000..16b4edae7c465a9543a0ba98c702573119192c34
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamw.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.optimizer.adamw.rst:2
+msgid "adamw"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po
new file mode 100644
index 0000000000000000000000000000000000000000..c2a5f6042d0508ad1e0e69bf2387f36c288c1fe0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.adamwdl.po
@@ -0,0 +1,168 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.optimizer.adamwdl.rst:2
+msgid "adamwdl"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1
+msgid "基类：:class:`paddle.optimizer.adamw.AdamW`"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1
+msgid ""
+"The AdamWDL optimizer is implemented based on the AdamW Optimization with"
+" dynamic lr setting. Generally it's used for transformer model."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:4
+msgid ""
+"We use \"layerwise_lr_decay\" as default dynamic lr setting method of "
+"AdamWDL. “Layer-wise decay” means exponentially decaying the learning "
+"rates of individual layers in a top-down manner. For example, suppose the"
+" 24-th layer uses a learning rate l, and the Layer-wise decay rate is α, "
+"then the learning rate of layer m is lα^(24-m). See more details on: "
+"https://arxiv.org/abs/1906.08237."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:10
+msgid ""
+"& t = t + 1\n"
+"\n"
+"& moment\\_1\\_out = {\\beta}_1 * moment\\_1 + (1 - {\\beta}_1) * grad\n"
+"\n"
+"& moment\\_2\\_out = {\\beta}_2 * moment\\_2 + (1 - {\\beta}_2) * grad * "
+"grad\n"
+"\n"
+"& learning\\_rate = learning\\_rate * \\frac{\\sqrt{1 - {\\beta}_2^t}}{1 "
+"- {\\beta}_1^t}\n"
+"\n"
+"& param\\_out = param - learning\\_rate * "
+"(\\frac{moment\\_1}{\\sqrt{moment\\_2} + \\epsilon} + \\lambda * param)"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:21
+msgid ""
+"The learning rate used to update ``Parameter``. It can be a float value "
+"or a LRScheduler. The default value is 0.001."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:24
+msgid ""
+"The exponential decay rate for the 1st moment estimates. It should be a "
+"float number or a Tensor with shape [1] and data type as float32. The "
+"default value is 0.9."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:28
+msgid ""
+"The exponential decay rate for the 2nd moment estimates. It should be a "
+"float number or a Tensor with shape [1] and data type as float32. The "
+"default value is 0.999."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:32
+msgid ""
+"A small float value for numerical stability. It should be a float number "
+"or a Tensor with shape [1] and data type as float32. The default value is"
+" 1e-08."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:36
+msgid ""
+"List/Tuple of ``Tensor`` to update to minimize ``loss``. \\ This "
+"parameter is required in dygraph mode. \\ The default value is None in "
+"static mode, at this time all parameters will be updated."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:40
+msgid ""
+"The weight decay coefficient, it can be float or Tensor. The default "
+"value is 0.01."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:42
+msgid ""
+"If it is not None, only tensors that makes "
+"apply_decay_param_fun(Tensor.name)==True will be updated. It only works "
+"when we want to specify tensors. Default: None."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:47
+msgid ""
+"Gradient cliping strategy, it's an instance of some derived class of "
+"``GradientClipBase`` . There are three cliping strategies ( "
+":ref:`api_fluid_clip_GradientClipByGlobalNorm` , "
+":ref:`api_fluid_clip_GradientClipByNorm` , "
+":ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there "
+"is no gradient clipping."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:52
+msgid ""
+"The official Adam algorithm has two moving-average accumulators. The "
+"accumulators are updated at every step. Every element of the two moving-"
+"average is updated in both dense mode and sparse mode. If the size of "
+"parameter is very large, then the update may be very slow. The lazy mode "
+"only update the element that has gradient in current mini-batch, so it "
+"will be much more faster. But this mode has different semantics with the "
+"original Adam algorithm and may lead to different result. The default "
+"value is False."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:60
+msgid "Whether to use multi-precision during weight updating. Default is false."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:62
+msgid "The layer-wise decay ratio. Defaults to 1.0."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:64
+msgid "The total number of encoder layers. Defaults to 12."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:66
+msgid ""
+"If it's not None, set_param_lr_fun() will set the parameter learning "
+"rate before it executes Adam Operator. Defaults to "
+":ref:`layerwise_lr_decay`."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:69
+msgid ""
+"The keys of name_dict is dynamic name of model while the value of "
+"name_dict is static name. Use model.named_parameters() to get name_dict."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:72
+msgid ""
+"Normally there is no need for user to set this property. For more "
+"information, please refer to :ref:`api_guide_Name`. The default value is "
+"None."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:78
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po
new file mode 100644
index 0000000000000000000000000000000000000000..174756237d0497df58ef8b2406c3716e27e5b636
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.ema.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.ops.optimizer.ema.rst:2
+msgid "ema"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..2ff6d35cf192d6d7737432bbb342a5b9a6612080
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.optimizer.po
@@ -0,0 +1,168 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.optimizer.rst:2
+msgid "optimizer"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1
+msgid "基类：:class:`paddle.optimizer.adamw.AdamW`"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:1
+msgid ""
+"The AdamWDL optimizer is implemented based on the AdamW Optimization with"
+" dynamic lr setting. Generally it's used for transformer model."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:4
+msgid ""
+"We use \"layerwise_lr_decay\" as default dynamic lr setting method of "
+"AdamWDL. “Layer-wise decay” means exponentially decaying the learning "
+"rates of individual layers in a top-down manner. For example, suppose the"
+" 24-th layer uses a learning rate l, and the Layer-wise decay rate is α, "
+"then the learning rate of layer m is lα^(24-m). See more details on: "
+"https://arxiv.org/abs/1906.08237."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:10
+msgid ""
+"& t = t + 1\n"
+"\n"
+"& moment\\_1\\_out = {\\beta}_1 * moment\\_1 + (1 - {\\beta}_1) * grad\n"
+"\n"
+"& moment\\_2\\_out = {\\beta}_2 * moment\\_2 + (1 - {\\beta}_2) * grad * "
+"grad\n"
+"\n"
+"& learning\\_rate = learning\\_rate * \\frac{\\sqrt{1 - {\\beta}_2^t}}{1 "
+"- {\\beta}_1^t}\n"
+"\n"
+"& param\\_out = param - learning\\_rate * "
+"(\\frac{moment\\_1}{\\sqrt{moment\\_2} + \\epsilon} + \\lambda * param)"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:21
+msgid ""
+"The learning rate used to update ``Parameter``. It can be a float value "
+"or a LRScheduler. The default value is 0.001."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:24
+msgid ""
+"The exponential decay rate for the 1st moment estimates. It should be a "
+"float number or a Tensor with shape [1] and data type as float32. The "
+"default value is 0.9."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:28
+msgid ""
+"The exponential decay rate for the 2nd moment estimates. It should be a "
+"float number or a Tensor with shape [1] and data type as float32. The "
+"default value is 0.999."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:32
+msgid ""
+"A small float value for numerical stability. It should be a float number "
+"or a Tensor with shape [1] and data type as float32. The default value is"
+" 1e-08."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:36
+msgid ""
+"List/Tuple of ``Tensor`` to update to minimize ``loss``. \\ This "
+"parameter is required in dygraph mode. \\ The default value is None in "
+"static mode, at this time all parameters will be updated."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:40
+msgid ""
+"The weight decay coefficient, it can be float or Tensor. The default "
+"value is 0.01."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:42
+msgid ""
+"If it is not None, only tensors that makes "
+"apply_decay_param_fun(Tensor.name)==True will be updated. It only works "
+"when we want to specify tensors. Default: None."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:47
+msgid ""
+"Gradient cliping strategy, it's an instance of some derived class of "
+"``GradientClipBase`` . There are three cliping strategies ( "
+":ref:`api_fluid_clip_GradientClipByGlobalNorm` , "
+":ref:`api_fluid_clip_GradientClipByNorm` , "
+":ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there "
+"is no gradient clipping."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:52
+msgid ""
+"The official Adam algorithm has two moving-average accumulators. The "
+"accumulators are updated at every step. Every element of the two moving-"
+"average is updated in both dense mode and sparse mode. If the size of "
+"parameter is very large, then the update may be very slow. The lazy mode "
+"only update the element that has gradient in current mini-batch, so it "
+"will be much more faster. But this mode has different semantics with the "
+"original Adam algorithm and may lead to different result. The default "
+"value is False."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:60
+msgid "Whether to use multi-precision during weight updating. Default is false."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:62
+msgid "The layer-wise decay ratio. Defaults to 1.0."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:64
+msgid "The total number of encoder layers. Defaults to 12."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:66
+msgid ""
+"If it's not None, set_param_lr_fun() will set the parameter learning "
+"rate before it executes Adam Operator. Defaults to "
+":ref:`layerwise_lr_decay`."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:69
+msgid ""
+"The keys of name_dict is dynamic name of model while the value of "
+"name_dict is static name. Use model.named_parameters() to get name_dict."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:72
+msgid ""
+"Normally there is no need for user to set this property. For more "
+"information, please refer to :ref:`api_guide_Name`. The default value is "
+"None."
+msgstr ""
+
+#: of paddlenlp.ops.optimizer.adamwdl.AdamWDL:78
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po
new file mode 100644
index 0000000000000000000000000000000000000000..c6c62ce5e251eb44bf5c0b1fe503b9328a7d7a5b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.rst:2
+msgid "paddlenlp.ops"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po
new file mode 100644
index 0000000000000000000000000000000000000000..3136f0d90b94f699a1f5722a77a6388091713d1f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.strings.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.strings.rst:2
+msgid "strings"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po
new file mode 100644
index 0000000000000000000000000000000000000000..9c31bf833d87fc12c06a90283a7ea06255c2a02b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.decoding.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.transformer.decoding.rst:2
+msgid "decoding"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..49a6647b9d42808ae2f390bef6ac17c2a2198c51
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.fast_transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.transformer.fast_transformer.rst:2
+msgid "fast\\_transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..92f9893b41b0c119147fb6508340cb7785cfacf5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.ops.transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.ops.transformer.rst:2
+msgid "transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.po
new file mode 100644
index 0000000000000000000000000000000000000000..a1c26dcfa3be07be61669e1df9d052865a79901c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.rst:2
+msgid "paddlenlp"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po
new file mode 100644
index 0000000000000000000000000000000000000000..35d01ba951acb02ec29879d41ca057ad270aad74
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.encoder.po
@@ -0,0 +1,566 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.seq2vec.encoder.rst:2
+msgid "encoder"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder:1
+#: paddlenlp.seq2vec.encoder.CNNEncoder:1
+#: paddlenlp.seq2vec.encoder.GRUEncoder:1
+#: paddlenlp.seq2vec.encoder.LSTMEncoder:1
+#: paddlenlp.seq2vec.encoder.RNNEncoder:1
+#: paddlenlp.seq2vec.encoder.TCNEncoder:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder:1
+msgid ""
+"A `BoWEncoder` takes as input a sequence of vectors and returns a single "
+"vector, which simply sums the embeddings of a sequence across the time "
+"dimension. The input to this encoder is of shape `(batch_size, "
+"num_tokens, emb_dim)`, and the output is of shape `(batch_size, "
+"emb_dim)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder
+#: paddlenlp.seq2vec.encoder.BoWEncoder.forward
+#: paddlenlp.seq2vec.encoder.CNNEncoder
+#: paddlenlp.seq2vec.encoder.CNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.GRUEncoder
+#: paddlenlp.seq2vec.encoder.GRUEncoder.forward
+#: paddlenlp.seq2vec.encoder.LSTMEncoder
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward
+#: paddlenlp.seq2vec.encoder.RNNEncoder
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.TCNEncoder
+#: paddlenlp.seq2vec.encoder.TCNEncoder.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder:6
+#: paddlenlp.seq2vec.encoder.CNNEncoder:20
+msgid "The dimension of each vector in the input sequence."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder:10
+#: paddlenlp.seq2vec.encoder.CNNEncoder:39
+#: paddlenlp.seq2vec.encoder.GRUEncoder:44
+#: paddlenlp.seq2vec.encoder.LSTMEncoder:43
+#: paddlenlp.seq2vec.encoder.RNNEncoder:43
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `BoWEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `BoWEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:1
+msgid "It simply sums the embeddings of a sequence across the time dimension."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:3
+msgid ""
+"Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or "
+"`float64`. The sequence length of the input sequence."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:6
+msgid ""
+"Shape same as `inputs`. Its each elements identify whether the "
+"corresponding input token is padding or not. If True, not padding token. "
+"If False, padding token. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward
+#: paddlenlp.seq2vec.encoder.CNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.GRUEncoder.forward
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.TCNEncoder.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward:12
+msgid ""
+"Returns tensor `summed`, the result vector of BagOfEmbedding. Its data "
+"type is same as `inputs` and its shape is `[batch_size, emb_dim]`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.BoWEncoder.forward
+#: paddlenlp.seq2vec.encoder.CNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.GRUEncoder.forward
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward
+#: paddlenlp.seq2vec.encoder.TCNEncoder.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:1
+msgid ""
+"A `CNNEncoder` takes as input a sequence of vectors and returns a single "
+"vector, a combination of multiple convolution layers and max pooling "
+"layers. The input to this encoder is of shape `(batch_size, num_tokens, "
+"emb_dim)`, and the output is of shape `(batch_size, output_dim)` or "
+"`(batch_size, len(ngram_filter_sizes) * num_filter)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:6
+msgid ""
+"The CNN has one convolution layer for each ngram filter size. Each "
+"convolution operation gives out a vector of size num_filter. The number "
+"of times a convolution layer will be used is `num_tokens - ngram_size + "
+"1`. The corresponding maxpooling layer aggregates all these outputs from "
+"the convolution layer and outputs the max."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:11
+msgid ""
+"This operation is repeated for every ngram size passed, and consequently "
+"the dimensionality of the output after maxpooling is "
+"`len(ngram_filter_sizes) * num_filter`.  This then gets (optionally) "
+"projected down to a lower dimensional output, specified by `output_dim`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:15
+msgid ""
+"We then use a fully connected layer to project in back to the desired "
+"output_dim.  For more details, refer to `A Sensitivity Analysis of (and "
+"Practitioners’ Guide to) Convolutional Neural Networks for Sentence "
+"Classification <https://arxiv.org/abs/1510.03820>`__ , Zhang and Wallace "
+"2016, particularly Figure 1."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:22
+msgid ""
+"This is the output dim for each convolutional layer, which is the number "
+"of \"filters\" learned by that layer."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:25
+msgid ""
+"This specifies both the number of convolutional layers we will create and"
+" their sizes.  The default of `(2, 3, 4, 5)` will have four convolutional"
+" layers, corresponding to encoding ngrams of size 2 to 5 with some number"
+" of filters."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:29
+msgid ""
+"Activation to use after the convolution layers. Defaults to "
+"`paddle.nn.Tanh()`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder:32
+msgid ""
+"After doing convolutions and pooling, we'll project the collected "
+"features into a vector of this size.  If this value is `None`, we will "
+"just return the result of the max pooling, giving an output of shape "
+"`len(ngram_filter_sizes) * num_filter`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `CNNEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `CNNEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:1
+msgid "The combination of multiple convolution layers and max pooling layers."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:3
+msgid ""
+"Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or "
+"`float64`. Tensor containing the features of the input sequence."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:6
+msgid ""
+"Shape shoule be same as `inputs` and dtype as `int32`, `int64`, `float32`"
+" or `float64`. Its each elements identify whether the corresponding input"
+" token is padding or not. If True, not padding token. If False, padding "
+"token. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.CNNEncoder.forward:12
+msgid ""
+"Returns tensor `result`. If output_dim is None, the result shape is of "
+"`(batch_size, output_dim)` and dtype is `float`; If not, the result shape"
+" is of `(batch_size, len(ngram_filter_sizes) * num_filter)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:1
+msgid ""
+"A GRUEncoder takes as input a sequence of vectors and returns a single "
+"vector, which is a combination of multiple `paddle.nn.GRU "
+"<https://www.paddlepaddle.org.cn/documentation/docs/en/api "
+"/paddle/nn/layer/rnn/GRU_en.html>`__ subclass. The input to this encoder "
+"is of shape `(batch_size, num_tokens, input_size)`, The output is of "
+"shape `(batch_size, hidden_size * 2)` if GRU is bidirection; If not, "
+"output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:9
+msgid ""
+"Paddle's GRU have two outputs: the hidden state for every time step at "
+"last layer, and the hidden state at the last time step for every layer. "
+"If `pooling_type` is not None, we perform the pooling on the hidden state"
+" of every time step at last layer to create a single vector. If None, we "
+"use the hidden state of the last time step at last layer as a single "
+"output (shape of `(batch_size, hidden_size)`); And if direction is "
+"bidirection, the we concat the hidden state of the last forward gru and "
+"backward gru layer to create a single vector (shape of `(batch_size, "
+"hidden_size * 2)`)."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:17
+#: paddlenlp.seq2vec.encoder.LSTMEncoder:17
+#: paddlenlp.seq2vec.encoder.RNNEncoder:17
+#: paddlenlp.seq2vec.encoder.TCNEncoder:14
+msgid "The number of expected features in the input (the last dimension)."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:19
+#: paddlenlp.seq2vec.encoder.LSTMEncoder:19
+#: paddlenlp.seq2vec.encoder.RNNEncoder:19
+msgid "The number of features in the hidden state."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:21
+msgid ""
+"Number of recurrent layers. E.g., setting num_layers=2 would mean "
+"stacking two GRUs together to form a stacked GRU, with the second GRU "
+"taking in outputs of the first GRU and computing the final results. "
+"Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:26
+msgid ""
+"The direction of the network. It can be \"forward\" and \"bidirect\" (it "
+"means bidirection network). If \"bidirect\", it is a birectional GRU, and"
+" returns the concat output from both directions. Defaults to \"forward\"."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:31
+msgid ""
+"If non-zero, introduces a Dropout layer on the outputs of each GRU layer "
+"except the last layer, with dropout probability equal to dropout. "
+"Defaults to 0.0."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder:35
+msgid ""
+"If `pooling_type` is None, then the GRUEncoder will return the hidden "
+"state of the last time step at last layer as a single vector. If "
+"pooling_type is not None, it must be one of \"sum\", \"max\" and "
+"\"mean\". Then it will be pooled on the GRU output (the hidden state of "
+"every time step at last layer) to create a single vector. Defaults to "
+"`None`"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `GRUEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `GRUEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:1
+msgid ""
+"GRUEncoder takes the a sequence of vectors and returns a single "
+"vector, which is a combination of multiple GRU layers. The input to this "
+"encoder is of shape `(batch_size, num_tokens, input_size)`, The output is"
+" of shape `(batch_size, hidden_size * 2)` if GRU is bidirection; If not, "
+"output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:7
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:7
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:7
+msgid ""
+"Shape as `(batch_size, num_tokens, input_size)`. Tensor containing the "
+"features of the input sequence."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:10
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:10
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:10
+msgid "Shape as `(batch_size)`. The sequence length of the input sequence."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.GRUEncoder.forward:14
+#: paddlenlp.seq2vec.encoder.LSTMEncoder.forward:14
+#: paddlenlp.seq2vec.encoder.RNNEncoder.forward:14
+msgid ""
+"Returns tensor `output`, the hidden state at the last time step for every"
+" layer. Its data type is `float` and its shape is `[batch_size, "
+"hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:1
+msgid ""
+"An LSTMEncoder takes as input a sequence of vectors and returns a single "
+"vector, which is a combination of multiple `paddle.nn.LSTM "
+"<https://www.paddlepaddle.org.cn/documentation/docs/en/api "
+"/paddle/nn/layer/rnn/LSTM_en.html>`__ subclass. The input to this encoder"
+" is of shape `(batch_size, num_tokens, input_size)`. The output is of "
+"shape `(batch_size, hidden_size * 2)` if LSTM is bidirection; If not, "
+"output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:9
+msgid ""
+"Paddle's LSTM have two outputs: the hidden state for every time step at "
+"last layer, and the hidden state and cell at the last time step for every"
+" layer. If `pooling_type` is not None, we perform the pooling on the "
+"hidden state of every time step at last layer to create a single vector. "
+"If None, we use the hidden state of the last time step at last layer as a"
+" single output (shape of `(batch_size, hidden_size)`); And if direction "
+"is bidirection, the we concat the hidden state of the last forward lstm "
+"and backward lstm layer to create a single vector (shape of `(batch_size,"
+" hidden_size * 2)`)."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:21
+msgid ""
+"Number of recurrent layers. E.g., setting num_layers=2 would mean "
+"stacking two LSTMs together to form a stacked LSTM, with the second LSTM "
+"taking in outputs of the first LSTM and computing the final results. "
+"Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:26
+msgid ""
+"The direction of the network. It can be \"forward\" or \"bidirect\" (it "
+"means bidirection network). If \"bidirect\", it is a birectional LSTM, "
+"and returns the concat output from both directions. Defaults to "
+"\"forward\"."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:30
+msgid ""
+"If non-zero, introduces a Dropout layer on the outputs of each LSTM layer"
+" except the last layer, with dropout probability equal to dropout. "
+"Defaults to 0.0 ."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder:34
+msgid ""
+"If `pooling_type` is None, then the LSTMEncoder will return the hidden "
+"state of the last time step at last layer as a single vector. If "
+"pooling_type is not None, it must be one of \"sum\", \"max\" and "
+"\"mean\". Then it will be pooled on the LSTM output (the hidden state of "
+"every time step at last layer) to create a single vector. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `LSTMEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `LSTMEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.LSTMEncoder.forward:1
+msgid ""
+"LSTMEncoder takes the a sequence of vectors and returns a single "
+"vector, which is a combination of multiple LSTM layers. The input to this"
+" encoder is of shape `(batch_size, num_tokens, input_size)`, The output "
+"is of shape `(batch_size, hidden_size * 2)` if LSTM is bidirection; If "
+"not, output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:1
+msgid ""
+"A RNNEncoder takes as input a sequence of vectors and returns a single "
+"vector, which is a combination of multiple `paddle.nn.RNN "
+"<https://www.paddlepaddle.org.cn/documentation/docs/en/api "
+"/paddle/nn/layer/rnn/RNN_en.html>`__ subclass. The input to this encoder "
+"is of shape `(batch_size, num_tokens, input_size)`, The output is of "
+"shape `(batch_size, hidden_size * 2)` if RNN is bidirection; If not, "
+"output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:9
+msgid ""
+"Paddle's RNN have two outputs: the hidden state for every time step at "
+"last layer, and the hidden state at the last time step for every layer. "
+"If `pooling_type` is not None, we perform the pooling on the hidden state"
+" of every time step at last layer to create a single vector. If None, we "
+"use the hidden state of the last time step at last layer as a single "
+"output (shape of `(batch_size, hidden_size)`); And if direction is "
+"bidirection, the we concat the hidden state of the last forward rnn and "
+"backward rnn layer to create a single vector (shape of `(batch_size, "
+"hidden_size * 2)`)."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:21
+msgid ""
+"Number of recurrent layers. E.g., setting num_layers=2 would mean "
+"stacking two RNNs together to form a stacked RNN, with the second RNN "
+"taking in outputs of the first RNN and computing the final results. "
+"Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:26
+msgid ""
+"The direction of the network. It can be \"forward\" and \"bidirect\" (it "
+"means bidirection network). If \"biderect\", it is a birectional RNN, and"
+" returns the concat output from both directions. Defaults to \"forward\""
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:30
+msgid ""
+"If non-zero, introduces a Dropout layer on the outputs of each RNN layer "
+"except the last layer, with dropout probability equal to dropout. "
+"Defaults to 0.0."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder:34
+msgid ""
+"If `pooling_type` is None, then the RNNEncoder will return the hidden "
+"state of the last time step at last layer as a single vector. If "
+"pooling_type is not None, it must be one of \"sum\", \"max\" and "
+"\"mean\". Then it will be pooled on the RNN output (the hidden state of "
+"every time step at last layer) to create a single vector. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `RNNEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `RNNEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.RNNEncoder.forward:1
+msgid ""
+"RNNEncoder takes the a sequence of vectors and returns a single "
+"vector, which is a combination of multiple RNN layers. The input to this "
+"encoder is of shape `(batch_size, num_tokens, input_size)`. The output is"
+" of shape `(batch_size, hidden_size * 2)` if RNN is bidirection; If not, "
+"output is of shape `(batch_size, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:1
+msgid ""
+"A `TCNEncoder` takes as input a sequence of vectors and returns a single "
+"vector, which is the last one time step in the feature map. The input to "
+"this encoder is of shape `(batch_size, num_tokens, input_size)`, and the "
+"output is of shape `(batch_size, num_channels[-1])` with a receptive "
+"filed:"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:7
+#: paddlenlp.seq2vec.encoder.TCNEncoder.forward:7
+msgid ""
+"receptive filed = 2 * "
+"\\sum_{i=0}^{len(num\\_channels)-1}2^i(kernel\\_size-1)."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:11
+msgid ""
+"Temporal Convolutional Networks is a simple convolutional architecture. "
+"It outperforms canonical recurrent networks such as LSTMs in many tasks. "
+"See https://arxiv.org/pdf/1803.01271.pdf for more details."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:16
+msgid "The number of channels in different layer."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:18
+msgid "The kernel size. Defaults to 2."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder:20
+msgid "The dropout probability. Defaults to 0.2."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder.get_input_dim:1
+msgid ""
+"Returns the dimension of the vector input for each element in the "
+"sequence input to a `TCNEncoder`. This is not the shape of the input "
+"tensor, but the last element of that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder.get_output_dim:1
+msgid ""
+"Returns the dimension of the final vector output by this `TCNEncoder`.  "
+"This is not the shape of the returned tensor, but the last element of "
+"that shape."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:1
+msgid ""
+"TCNEncoder takes as input a sequence of vectors and returns a single "
+"vector, which is the last one time step in the feature map. The input to "
+"this encoder is of shape `(batch_size, num_tokens, input_size)`, and the "
+"output is of shape `(batch_size, num_channels[-1])` with a receptive "
+"filed:"
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:11
+msgid "The input tensor with shape `[batch_size, num_tokens, input_size]`."
+msgstr ""
+
+#: of paddlenlp.seq2vec.encoder.TCNEncoder.forward:14
+msgid "Returns tensor `output` with shape `[batch_size, num_channels[-1]]`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po
new file mode 100644
index 0000000000000000000000000000000000000000..b834eb1c2519b88a2d72c395bf5bf3b4113d0b36
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.seq2vec.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.seq2vec.rst:2
+msgid "paddlenlp.seq2vec"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po
new file mode 100644
index 0000000000000000000000000000000000000000..28f4a1fc13f643978c8098a0d5c69cbc15a72317
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dependency_parsing.po
@@ -0,0 +1,158 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.dependency_parsing.rst:2
+msgid "dependency\\_parsing"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DDParserTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DDParserTask:1
+msgid ""
+"DDParser task to analyze the dependency relationship between words in a "
+"sentence :param task: The name of task. :type task: string :param model: "
+"The model name in the task. :type model: string :param tree: Ensure the "
+"output conforms to the tree structure. :type tree: bool :param prob: "
+"Whether to return the probability of predicted heads. :type prob: bool "
+":param use_pos: Whether to return the postag. :type use_pos: bool :param "
+"batch_size: Numbers of examples a batch. :type batch_size: int :param "
+"return_visual: If True, the result will contain the dependency "
+"visualization. :type return_visual: bool :param kwargs: Additional "
+"keyword arguments passed along to the specific task. :type kwargs: dict, "
+"optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.pad_sequence:1
+msgid "Fill sequences(np.ndarray) into a fixed-length matrix."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.eisner:1
+msgid ""
+"Eisner algorithm is a general dynamic programming decoding algorithm for "
+"bilexical grammar."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.eisner:5
+msgid "Args："
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.eisner:4
+msgid ""
+"scores: Adjacency matrix，shape=(batch, seq_len, seq_len) mask: mask "
+"matrix，shape=(batch, sql_len)"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.eisner
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.eisner:7
+msgid ""
+"output，shape=(batch, seq_len)，the index of the parent node corresponding "
+"to the token in the query"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.fill_diagonal:1
+msgid ""
+"Fill value into the diagoanl of x that offset is ${offset} and the "
+"coordinate system is (dim1, dim2)."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.backtrack:1
+msgid "Backtrack the position matrix of eisner to generate the tree"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:1
+msgid "Returns a diagonal stripe of the tensor."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:3
+msgid "the input tensor with 2 or more dims."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:5
+msgid "the length of the stripe."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:7
+msgid "the width of the stripe."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:9
+msgid "the offset of the first two dims."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:11
+msgid "0 if returns a horizontal stripe; 1 else."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:14
+msgid ""
+"Example: >>> x = np.arange(25).reshape(5, 5) >>> x tensor([[ 0,  1,  2,  "
+"3,  4],"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.stripe:18
+msgid ""
+"[ 5,  6,  7,  8,  9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, "
+"21, 22, 23, 24]])"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree:1
+#: paddlenlp.taskflow.dependency_parsing.Node:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.Node:1
+msgid "Node class"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree:1
+msgid ""
+"DepTree class, used to check whether the prediction result is a project "
+"Tree. A projective tree means that you can project the tree without "
+"crossing arcs."
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree.build_tree:1
+msgid "Build the tree"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree.add:1
+msgid "Add a child node"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree.judge_legal:1
+msgid "Determine whether it is a project tree"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.DepTree.inorder_traversal:1
+msgid "Inorder traversal"
+msgstr ""
+
+#: of paddlenlp.taskflow.dependency_parsing.istree:1
+msgid "Is the sequence a project tree"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po
new file mode 100644
index 0000000000000000000000000000000000000000..03c5d6d4de824833c56cabf07cb57d9defac60f8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.dialogue.po
@@ -0,0 +1,39 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.dialogue.rst:2
+msgid "dialogue"
+msgstr ""
+
+#: of paddlenlp.taskflow.dialogue.DialogueTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.dialogue.DialogueTask:1
+msgid ""
+"Task of Chinese open domain dialogue. :param task: The name of task. "
+":type task: string :param model: The model name in the task. :type model:"
+" string :param kwargs: Additional keyword arguments passed along to the "
+"specific task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.dialogue.DialogueTask.interactive_mode:1
+msgid "Enter the interactive mode."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po
new file mode 100644
index 0000000000000000000000000000000000000000..e5eb592a77eea9a4d361f755872b01d725146536
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.information_extraction.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.taskflow.information_extraction.rst:2
+msgid "information\\_extraction"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po
new file mode 100644
index 0000000000000000000000000000000000000000..7e684c5191a0e3a795490540e947c477bf75c16e
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.knowledge_mining.po
@@ -0,0 +1,62 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.knowledge_mining.rst:2
+msgid "knowledge\\_mining"
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:1
+#: paddlenlp.taskflow.knowledge_mining.WordTagTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.WordTagTask:1
+msgid ""
+"This the NER(Named Entity Recognition) task that convert the raw text to "
+"entities. And the task with the `wordtag` model will link the more "
+"meesage with the entity. :param task: The name of task. :type task: "
+"string :param model: The model name in the task. :type model: string "
+":param kwargs: Additional keyword arguments passed along to the specific "
+"task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:12
+#: paddlenlp.taskflow.knowledge_mining.WordTagTask:11
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.NPTagTask.summary_num:1
+#: paddlenlp.taskflow.knowledge_mining.WordTagTask.summary_num:1
+msgid "Number of model summary token"
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.WordTagTask.linking:1
+msgid "Whether to do term linking."
+msgstr ""
+
+#: of paddlenlp.taskflow.knowledge_mining.NPTagTask:1
+msgid ""
+"Noun phrase tagging task that convert the noun phrase to POS tag. :param "
+"task: The name of task. :type task: string :param model: The model name "
+"in the task. :type model: string :param batch_size: Numbers of examples a"
+" batch. :type batch_size: int :param linking: Returns the categories. If "
+"`linking` is True, the fine-grained label (label) will link with the "
+"coarse-grained label (category). :type linking: bool"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po
new file mode 100644
index 0000000000000000000000000000000000000000..680107ee5916083e520c96c199482e99cbdd408c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.lexical_analysis.po
@@ -0,0 +1,41 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.lexical_analysis.rst:2
+msgid "lexical\\_analysis"
+msgstr ""
+
+#: of paddlenlp.taskflow.lexical_analysis.load_vocab:1
+msgid "Load vocab from file"
+msgstr ""
+
+#: of paddlenlp.taskflow.lexical_analysis.LacTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.lexical_analysis.LacTask:1
+msgid ""
+"Lexical analysis of Chinese task to segement the chinese sentence. :param"
+" task: The name of task. :type task: string :param model: The model name "
+"in the task. :type model: string :param user_dict: The user-defined "
+"dictionary, default to None. :type user_dict: string :param kwargs: "
+"Additional keyword arguments passed along to the specific task. :type "
+"kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po
new file mode 100644
index 0000000000000000000000000000000000000000..680bb94c7a78e711e703de04c2f337106daf14ae
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.model.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.model.rst:2
+msgid "model"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..b422296e89e3040fabea34345387449df89efde0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.dependency_parsing_model.po
@@ -0,0 +1,81 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.models.dependency_parsing_model.rst:2
+msgid "dependency\\_parsing\\_model"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser:1
+msgid "DDParser"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:1
+#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:1
+#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:1
+#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:1
+#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward
+#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward
+#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward
+#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward
+#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:4
+#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:4
+#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:4
+#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:4
+#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.BiAffine.forward:6
+#: paddlenlp.taskflow.models.dependency_parsing_model.BiAffineParser.forward:6
+#: paddlenlp.taskflow.models.dependency_parsing_model.ErnieEncoder.forward:6
+#: paddlenlp.taskflow.models.dependency_parsing_model.LSTMByWPEncoder.forward:6
+#: paddlenlp.taskflow.models.dependency_parsing_model.MLP.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:1
+msgid "Select input value according to index"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:4
+msgid "Arags："
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:4
+msgid "input: input matrix index: index matrix"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.dependency_parsing_model.index_sample:6
+msgid "output"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..22676e1db76165cc30c69079e58ecb957d9c0366
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.information_extraction_model.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.taskflow.models.information_extraction_model.rst:2
+msgid "information\\_extraction\\_model"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..efaac39394ad6d82b3b48e657ad93a7ad7a07252
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.lexical_analysis_model.po
@@ -0,0 +1,55 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.models.lexical_analysis_model.rst:2
+msgid "lexical\\_analysis\\_model"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf:1
+msgid ""
+"The network for lexical analysis, based on two layers of BiGRU and one "
+"layer of CRF. More details see https://arxiv.org/abs/1807.01882 :param "
+"word_emb_dim: The dimension in which a word is embedded. :type "
+"word_emb_dim: int :param hidden_size: The number of hidden nodes in the "
+"GRU layer. :type hidden_size: int :param vocab_size: the word vocab size."
+" :type vocab_size: int :param num_labels: the labels amount. :type "
+"num_labels: int :param emb_lr: The scaling of the learning rate of the "
+"embedding layer. Defaults to 2.0. :type emb_lr: float, optional :param "
+"crf_lr: The scaling of the learning rate of the crf layer. Defaults to "
+"0.2. :type crf_lr: float, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.lexical_analysis_model.BiGruCrf.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po
new file mode 100644
index 0000000000000000000000000000000000000000..ca52fe50a94ab58e16bf4f48f91c9080c3f6dffd
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.models.rst:2
+msgid "models"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..3b23dedcef965b6e29c24e750a0b1f83a54e5e63
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.sentiment_analysis_model.po
@@ -0,0 +1,119 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.models.sentiment_analysis_model.rst:2
+msgid "sentiment\\_analysis\\_model"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:1
+msgid ""
+"This class implements the Bag of Words Classification Network model to "
+"classify texts. At a high level, the model starts by embedding the tokens"
+" and running them through a word embedding. Then, we encode these "
+"epresentations with a `BoWEncoder`. Lastly, we take the output of the "
+"encoder to create a final representation, which is passed through some "
+"feed-forward layers to output a logits (`output_layer`). :param "
+"vocab_size: The vocab size that used to create the embedding. :type "
+"vocab_size: int :param num_class: The num class of the classifier. :type "
+"num_class: int :param emb_dim: The size of the embedding, default value "
+"is 128. :type emb_dim: int. optinal :param padding_idx: The padding value"
+" in the embedding, the padding_idx of embedding value will"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:13
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:13
+msgid "not be updated, the default value is 0."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel
+#: paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward
+#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:15
+msgid "The output size of linear that after the bow, default value is 128."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel:17
+msgid ""
+"The output size of linear that after the fisrt linear, default value is "
+"96."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:1
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:1
+#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:4
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:4
+#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.BoWModel.forward:6
+#: paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel.forward:6
+#: paddlenlp.taskflow.models.sentiment_analysis_model.SkepSequenceModel.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:1
+msgid ""
+"This class implements the Bag of Words Classification Network model to "
+"classify texts. At a high level, the model starts by embedding the tokens"
+" and running them through a word embedding. Then, we encode these "
+"epresentations with a `BoWEncoder`. Lastly, we take the output of the "
+"encoder to create a final representation, which is passed through some "
+"feed-forward layers to output a logits (`output_layer`). :param "
+"vocab_size: The vocab size that used to create the embedding. :type "
+"vocab_size: int :param num_class: The num clas of the classifier. :type "
+"num_class: int :param emb_dim: The size of the embedding, default value "
+"is 128. :type emb_dim: int. optinal :param padding_idx: The padding value"
+" in the embedding, the padding_idx of embedding value will"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:15
+msgid "The output size of the lstm, defalut value 198."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:17
+msgid "The direction of lstm, default value is `forward`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:19
+msgid "The num of lstm layer."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:21
+msgid "The dropout rate of lstm."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.sentiment_analysis_model.LSTMModel:23
+msgid ""
+"The pooling type of lstm. Defalut value is None, if `pooling_type` is "
+"None, then the LSTMEncoder will return the hidden state of the last time "
+"step at last layer as a single vector."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po
new file mode 100644
index 0000000000000000000000000000000000000000..e989ef85091b23abf2eaf6f772b3edaf011f2685
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.models.text_correction_model.po
@@ -0,0 +1,167 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.models.text_correction_model.rst:2
+msgid "text\\_correction\\_model"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:1
+msgid "ErnieForCSC is a model specified for Chinese Spelling Correction task."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:3
+msgid ""
+"It integrates phonetic features into language model by leveraging the "
+"powerful pre-training and fine-tuning method."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC:6
+msgid ""
+"See more details on https://aclanthology.org/2021.findings-acl.198.pdf. "
+":param ernie: An instance of `paddlenlp.transformers.ErnieModel`. :type "
+"ernie: ErnieModel :param pinyin_vocab_size: The vocab size of pinyin "
+"vocab. :type pinyin_vocab_size: int :param pad_pinyin_id: The pad token "
+"id of pinyin vocab. Defaults to 0. :type pad_pinyin_id: int, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:5
+msgid ""
+"Indices of pinyin tokens of input sequence in the pinyin vocabulary. They"
+" are numerical representations of tokens that build the pinyin input "
+"sequence. It's data type should be `int64` and has a shape of "
+"[batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:9
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:9
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:12
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:13
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:15
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to None, which means no segment embeddings is "
+"added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, config.max_position_embeddings - "
+"1]``. Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as"
+" `int32` or `int64`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:22
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**.  - **1** for tokens that are "
+"**not masked**, - **0** for tokens that are **masked**.  It's data type "
+"should be `float32` and has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:22
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:27
+msgid "**1** for tokens that are **not masked**,"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:28
+msgid "**0** for tokens that are **masked**."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:30
+msgid ""
+"It's data type should be `float32` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:34
+msgid ""
+"A Tensor of the detection prediction of each tokens.     Shape as "
+"`(batch_size, sequence_length)` and dtype as `int`.  char_preds (Tensor):"
+"     A Tensor of the correction prediction of each tokens.     Shape as "
+"`(batch_size, sequence_length)` and dtype as `int`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:35
+msgid "A Tensor of the detection prediction of each tokens."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:35
+msgid "Shape as `(batch_size, sequence_length)` and dtype as `int`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:38
+msgid "char_preds (Tensor):"
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward:38
+msgid ""
+"A Tensor of the correction prediction of each tokens. Shape as "
+"`(batch_size, sequence_length)` and dtype as `int`."
+msgstr ""
+
+#: of paddlenlp.taskflow.models.text_correction_model.ErnieForCSC.forward
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po
new file mode 100644
index 0000000000000000000000000000000000000000..1e374e1453239eb95bc5e909d7f1e48a88324cc4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.named_entity_recognition.po
@@ -0,0 +1,49 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.named_entity_recognition.rst:2
+msgid "named\\_entity\\_recognition"
+msgstr ""
+
+#: of paddlenlp.taskflow.named_entity_recognition.NERWordTagTask:1
+msgid "基类：:class:`paddlenlp.taskflow.knowledge_mining.WordTagTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.named_entity_recognition.NERWordTagTask:1
+msgid ""
+"This the NER(Named Entity Recognition) task that convert the raw text to "
+"entities. And the task with the `wordtag` model will link the more "
+"meesage with the entity. :param task: The name of task. :type task: "
+"string :param model: The model name in the task. :type model: string "
+":param kwargs: Additional keyword arguments passed along to the specific "
+"task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.named_entity_recognition.NERLACTask:1
+msgid "基类：:class:`paddlenlp.taskflow.lexical_analysis.LacTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.named_entity_recognition.NERLACTask:1
+msgid ""
+"Part-of-speech tagging task for the raw text. :param task: The name of "
+"task. :type task: string :param model: The model name in the task. :type "
+"model: string :param kwargs: Additional keyword arguments passed along to"
+" the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po
new file mode 100644
index 0000000000000000000000000000000000000000..88c4509069ea68ff7ccc36cde2f9225f36848fcf
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.rst:2
+msgid "paddlenlp.taskflow"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po
new file mode 100644
index 0000000000000000000000000000000000000000..f9d4a4d5bd3567da1c797d06a71b590479db12e4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.poetry_generation.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.poetry_generation.rst:2
+msgid "poetry\\_generation"
+msgstr ""
+
+#: of paddlenlp.taskflow.poetry_generation.PoetryGenerationTask:1
+msgid "基类：:class:`paddlenlp.taskflow.text_generation.TextGenerationTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.poetry_generation.PoetryGenerationTask:1
+msgid ""
+"The text generation model to predict the question or chinese  poetry. "
+":param task: The name of task. :type task: string :param model: The model"
+" name in the task. :type model: string :param kwargs: Additional keyword "
+"arguments passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po
new file mode 100644
index 0000000000000000000000000000000000000000..125024b9e302679d3f712fbe835b00fbd5cb939a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.pos_tagging.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.pos_tagging.rst:2
+msgid "pos\\_tagging"
+msgstr ""
+
+#: of paddlenlp.taskflow.pos_tagging.POSTaggingTask:1
+msgid "基类：:class:`paddlenlp.taskflow.lexical_analysis.LacTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.pos_tagging.POSTaggingTask:1
+msgid ""
+"Part-of-speech tagging task for the raw text. :param task: The name of "
+"task. :type task: string :param model: The model name in the task. :type "
+"model: string :param kwargs: Additional keyword arguments passed along to"
+" the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po
new file mode 100644
index 0000000000000000000000000000000000000000..b5c34939926a4f6db33e67b502137b870aedb98b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.question_answering.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.question_answering.rst:2
+msgid "question\\_answering"
+msgstr ""
+
+#: of paddlenlp.taskflow.question_answering.QuestionAnsweringTask:1
+msgid "基类：:class:`paddlenlp.taskflow.text_generation.TextGenerationTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.question_answering.QuestionAnsweringTask:1
+msgid ""
+"The text generation model to predict the question or chinese  poetry. "
+":param task: The name of task. :type task: string :param model: The model"
+" name in the task. :type model: string :param kwargs: Additional keyword "
+"arguments passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po
new file mode 100644
index 0000000000000000000000000000000000000000..8626d0f8ba41911710cfdd4d8d8d2b6ff7f76da5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.sentiment_analysis.po
@@ -0,0 +1,46 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.sentiment_analysis.rst:2
+msgid "sentiment\\_analysis"
+msgstr ""
+
+#: of paddlenlp.taskflow.sentiment_analysis.SentaTask:1
+#: paddlenlp.taskflow.sentiment_analysis.SkepTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.sentiment_analysis.SentaTask:1
+msgid ""
+"Sentiment analysis task using RNN or BOW model to predict sentiment "
+"opinion on Chinese text. :param task: The name of task. :type task: "
+"string :param model: The model name in the task. :type model: string "
+":param kwargs: Additional keyword arguments passed along to the specific "
+"task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.sentiment_analysis.SkepTask:1
+msgid ""
+"Sentiment analysis task using ERNIE-Gram model to predict sentiment "
+"opinion on Chinese text. :param task: The name of task. :type task: "
+"string :param model: The model name in the task. :type model: string "
+":param kwargs: Additional keyword arguments passed along to the specific "
+"task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po
new file mode 100644
index 0000000000000000000000000000000000000000..22119bc44d00e2ea6730252aba2362969d94923c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.task.po
@@ -0,0 +1,57 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.task.rst:2
+msgid "task"
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:1
+msgid ""
+"The meta classs of task in Taskflow. The meta class has the five abstract"
+" function,"
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:2
+msgid "the subclass need to inherit from the meta class."
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:3
+msgid "The name of task."
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:5
+msgid "The model name in the task."
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task:7
+msgid "Additional keyword arguments passed along to the specific task."
+msgstr ""
+
+#: of paddlenlp.taskflow.task.Task.help:1
+msgid "Return the usage message of the current task."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po
new file mode 100644
index 0000000000000000000000000000000000000000..c10c2d08a26a213c9d318ae5d413c726c8d2e758
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.taskflow.po
@@ -0,0 +1,84 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.taskflow.rst:2
+msgid "taskflow"
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:3
+msgid ""
+"The Taskflow is the end2end inferface that could convert the raw text to "
+"model result, and decode the model result to task result. The main "
+"functions as follows:"
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:2
+msgid "Convert the raw text to task result."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:3
+msgid "Convert the model to the inference model."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:4
+msgid "Offer the usage and help message."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:5
+msgid "The task name for the Taskflow, and get the task class from the name."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:7
+msgid "The model name in the task, if set None, will use the default model."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:9
+msgid ""
+"Select the mode of the task, only used in the tasks of word_segmentation "
+"and ner. If set None, will use the default mode."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:12
+msgid "The device id for the gpu, xpu and other devices, the defalut value is 0."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow:14
+msgid "Additional keyword arguments passed along to the specific task."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow.help:1
+msgid "Return the task usage message."
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow.task_path:1
+msgid "Return the path of current task"
+msgstr ""
+
+#: of paddlenlp.taskflow.taskflow.Taskflow.tasks:1
+msgid "Return the available task list."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po
new file mode 100644
index 0000000000000000000000000000000000000000..0e8fa7b1ec681792528087267399daf741f9168c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text2knowledge.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.text2knowledge.rst:2
+msgid "text2knowledge"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po
new file mode 100644
index 0000000000000000000000000000000000000000..7c8db153a50a895995fd35213759cf99ab16f9df
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_correction.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.text_correction.rst:2
+msgid "text\\_correction"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_correction.CSCTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_correction.CSCTask:1
+msgid ""
+"The text generation model to predict the question or chinese  poetry. "
+":param task: The name of task. :type task: string :param model: The model"
+" name in the task. :type model: string :param kwargs: Additional keyword "
+"arguments passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po
new file mode 100644
index 0000000000000000000000000000000000000000..2c2131c87cd543879dc03c8624276cfb76c2d133
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_generation.po
@@ -0,0 +1,35 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.text_generation.rst:2
+msgid "text\\_generation"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_generation.TextGenerationTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_generation.TextGenerationTask:1
+msgid ""
+"The text generation model to predict the question or chinese  poetry. "
+":param task: The name of task. :type task: string :param model: The model"
+" name in the task. :type model: string :param kwargs: Additional keyword "
+"arguments passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po
new file mode 100644
index 0000000000000000000000000000000000000000..df0db4d96d4c82a9c730eb4298dd25cc0a6ffb88
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.text_similarity.po
@@ -0,0 +1,36 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.text_similarity.rst:2
+msgid "text\\_similarity"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_similarity.TextSimilarityTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.text_similarity.TextSimilarityTask:1
+msgid ""
+"Text similarity task using SimBERT to predict the similarity of sentence "
+"pair. :param task: The name of task. :type task: string :param model: The"
+" model name in the task. :type model: string :param kwargs: Additional "
+"keyword arguments passed along to the specific task. :type kwargs: dict, "
+"optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..710f4c8c0c086d0a312d37e8c206c5863034d97f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.utils.po
@@ -0,0 +1,343 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.utils.rst:2
+msgid "utils"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_file:1
+msgid ""
+"Download the file from the url to specified directory. Check md5 value "
+"when the file is exists, if the md5 value is the same as the existed "
+"file, just use the older file, if not, will download the file from the "
+"url."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerNode
+#: paddlenlp.taskflow.utils.BurkhardKellerTree.add
+#: paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word
+#: paddlenlp.taskflow.utils.TermTree.add_term
+#: paddlenlp.taskflow.utils.TermTree.build_from_dir
+#: paddlenlp.taskflow.utils.TermTree.find_term
+#: paddlenlp.taskflow.utils.TermTree.from_dir
+#: paddlenlp.taskflow.utils.TermTree.save paddlenlp.taskflow.utils.TermTreeNode
+#: paddlenlp.taskflow.utils.TermTreeNode.from_dict
+#: paddlenlp.taskflow.utils.TermTreeNode.from_json
+#: paddlenlp.taskflow.utils.TriedTree.search
+#: paddlenlp.taskflow.utils.download_check
+#: paddlenlp.taskflow.utils.download_file
+#: paddlenlp.taskflow.utils.levenstein_distance
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_file:5
+msgid "The specified directory saving the file."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_file:7
+msgid "The specified filename saving the file."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_file:9
+msgid "The url downling the file."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_file:11
+msgid "The md5 value that checking the version downloaded."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_check:1
+msgid "Check the resource statuc in the specified task."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.download_check:3
+msgid "The name of specified task."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.add_docstrings:1
+msgid "The function that add the doc string to doc of class."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.cut_chinese_sent:1
+msgid ""
+"Cut the Chinese sentences more precisely, reference to "
+"\"https://blog.csdn.net/blmoistawinde/article/details/82379256\"."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerNode:1
+#: paddlenlp.taskflow.utils.BurkhardKellerTree:1
+#: paddlenlp.taskflow.utils.Customization:1 paddlenlp.taskflow.utils.TermTree:1
+#: paddlenlp.taskflow.utils.TermTreeNode:1 paddlenlp.taskflow.utils.TriedTree:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:1
+msgid ""
+"Defination of term node. All members are protected, to keep rigorism of "
+"data struct."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:3
+msgid "term id of node."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:5
+msgid "term, common name of this term."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:7
+msgid "`cb` indicates concept base, `eb` indicates entity base."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:9
+msgid "type of this term, constructs hirechical of `term` node. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:11
+msgid "parent type of a `type` node. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:13
+msgid "type statement of node, `type` or `term`. Defaults to \"term\"."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:13
+#: paddlenlp.taskflow.utils.TermTreeNode:15
+msgid "alias of this term. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:17
+msgid "extended alias of this term, CANNOT be used in matching. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:20
+msgid "grouped by some term. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:22
+msgid "some lower term. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode:24
+msgid "to sore full imformation of a term. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:1
+msgid "Build a node from dictionary data."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:3
+msgid "Dictionary data contain all k-v data."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word
+#: paddlenlp.taskflow.utils.TermTree.find_term
+#: paddlenlp.taskflow.utils.TermTree.from_dir
+#: paddlenlp.taskflow.utils.TermTreeNode.from_dict
+#: paddlenlp.taskflow.utils.TermTreeNode.from_json
+#: paddlenlp.taskflow.utils.TriedTree.search
+#: paddlenlp.taskflow.utils.levenstein_distance
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode.from_dict:6
+#: paddlenlp.taskflow.utils.TermTreeNode.from_json:6
+msgid "TermTree node object."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word
+#: paddlenlp.taskflow.utils.TermTree.find_term
+#: paddlenlp.taskflow.utils.TermTree.from_dir
+#: paddlenlp.taskflow.utils.TermTreeNode.from_dict
+#: paddlenlp.taskflow.utils.TermTreeNode.from_json
+#: paddlenlp.taskflow.utils.TriedTree.search
+#: paddlenlp.taskflow.utils.levenstein_distance
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode.from_json:1
+msgid "Build a node from JSON string."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTreeNode.from_json:3
+msgid "JSON string formatted by TermTree data."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree:1
+msgid "TermTree class."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:1
+msgid "Add a term into TermTree."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:3
+msgid "common name of name."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:5
+msgid "term is concept or entity."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:7
+msgid "term type of this term"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:9
+msgid "sub type of this term, must exists in TermTree. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:11
+msgid "sub terms of this term. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:15
+msgid ". Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.add_term:17
+msgid "[description]. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.find_term:1
+msgid ""
+"Find a term in Term Tree. If term not exists, return None. If `term_type`"
+" is not None, will find term with this type."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.find_term:4
+msgid "term to look up."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.find_term:6
+msgid "find term in this term_type. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.build_from_dir:3
+#: paddlenlp.taskflow.utils.TermTree.find_term:9
+#: paddlenlp.taskflow.utils.TermTree.from_dir:3
+#: paddlenlp.taskflow.utils.TermTree.from_dir:6
+msgid "[description]"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.build_from_dir:1
+#: paddlenlp.taskflow.utils.TermTree.from_dir:1
+msgid ""
+"Build TermTree from a directory which should contain type schema and term"
+" data."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.save:1
+msgid "Save term tree to directory `save_dir`"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TermTree.save:3
+msgid "Directory."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.levenstein_distance:1
+msgid "Calculate minimal Levenstein distance between s1 and s2."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.levenstein_distance:3
+#: paddlenlp.taskflow.utils.levenstein_distance:5
+msgid "string"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.levenstein_distance:8
+msgid "the minimal distance."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerNode:1
+msgid ""
+"Node implementatation for BK-Tree. A BK-Tree node stores the information "
+"of current word, and its approximate words calculated by levenstein "
+"distance."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerNode:3
+msgid "word of current node."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree:1
+msgid "Implementataion of BK-Tree"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.add:1
+msgid "Insert a word into current tree. If tree is empty, set this word to root."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.add:3
+msgid "word to be inserted."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:1
+msgid "Search the most similar (minimal levenstain distance) word between `s`."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:3
+msgid "target word"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.BurkhardKellerTree.search_similar_word:6
+msgid "similar words."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree:1
+msgid "Implementataion of TriedTree"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.add_word:1
+msgid "add single word into TriedTree"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.search:1
+msgid "Backward maximum matching"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.search:3
+msgid "string to be searched"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.search:6
+msgid ""
+"list of maximum matching words, each element represents     the starting "
+"and ending position of the matching string."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.search:8
+msgid "list of maximum matching words, each element represents"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.TriedTree.search:9
+msgid "the starting and ending position of the matching string."
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.Customization:1
+msgid "User intervention based on Aho-Corasick automaton"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.Customization.load_customization:1
+msgid "Load the custom vocab"
+msgstr ""
+
+#: of paddlenlp.taskflow.utils.Customization.parse_customization:1
+msgid "Use custom vocab to modify the lac results"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po
new file mode 100644
index 0000000000000000000000000000000000000000..8dcf598ef031685b7e05c2e5ff7bccaca8b6a08a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.taskflow.word_segmentation.po
@@ -0,0 +1,60 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.taskflow.word_segmentation.rst:2
+msgid "word\\_segmentation"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegJiebaTask:1
+msgid "基类：:class:`paddlenlp.taskflow.task.Task`"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegJiebaTask:1
+msgid ""
+"Word Segmentation task for the raw text. :param task: The name of task. "
+":type task: string :param model: The model name in the task. :type model:"
+" string :param user_dict: The user-defined dictionary, default to None. "
+":type user_dict: string :param kwargs: Additional keyword arguments "
+"passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegLACTask:1
+msgid "基类：:class:`paddlenlp.taskflow.lexical_analysis.LacTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegLACTask:1
+msgid ""
+"Segement the sentences to the words using LAC mode. :param task: The name"
+" of task. :type task: string :param model: The model name in the task. "
+":type model: string :param kwargs: Additional keyword arguments passed "
+"along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegWordTagTask:1
+msgid "基类：:class:`paddlenlp.taskflow.named_entity_recognition.NERWordTagTask`"
+msgstr ""
+
+#: of paddlenlp.taskflow.word_segmentation.SegWordTagTask:1
+msgid ""
+"Segement the sentences to the words using WordTag model. :param task: The"
+" name of task. :type task: string :param model: The model name in the "
+"task. :type model: string :param kwargs: Additional keyword arguments "
+"passed along to the specific task. :type kwargs: dict, optional"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po
new file mode 100644
index 0000000000000000000000000000000000000000..24c32480ffedb0f60dad1ceaf345fcf234d87b00
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.argparser.po
@@ -0,0 +1,137 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.argparser.rst:2
+msgid "argparser"
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser:1
+msgid "基类：:class:`argparse.ArgumentParser`"
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser:1
+msgid ""
+"This subclass of `argparse.ArgumentParser` uses type hints on dataclasses"
+" to generate arguments."
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser:3
+msgid ""
+"The class is designed to play well with the native argparse. In "
+"particular, you can add more (non-dataclass backed) arguments to the "
+"parser after initialization and you'll get the output back after parsing "
+"as an additional namespace. Optional: To create sub argument groups use "
+"the `_argument_group_name` attribute in the dataclass."
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:1
+msgid "Parse command-line args into instances of the specified dataclass types."
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:3
+msgid ""
+"This relies on argparse's `ArgumentParser.parse_known_args`. See the doc "
+"at: "
+"docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args"
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:6
+msgid ""
+"List of strings to parse. The default is taken from sys.argv. (same as "
+"argparse.ArgumentParser)"
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:7
+msgid "If true, also return a list of remaining argument strings."
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:8
+msgid ""
+"If true, will look for a \".args\" file with the same base name as the "
+"entry point script for this process, and will append its potential "
+"content to the command line args."
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:10
+msgid ""
+"If not None, will uses this file instead of the \".args\" file specified "
+"in the previous argument."
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:12
+msgid ""
+"- the dataclass instances in the same order as they were passed to the "
+"initializer.abspath - if applicable, an additional namespace for more "
+"(non-dataclass backed) arguments added to the parser   after "
+"initialization. - The potential list of remaining argument strings. (same"
+" as argparse.ArgumentParser.parse_known_args)"
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:12
+msgid ""
+"the dataclass instances in the same order as they were passed to the "
+"initializer.abspath"
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:13
+msgid ""
+"if applicable, an additional namespace for more (non-dataclass backed) "
+"arguments added to the parser after initialization."
+msgstr ""
+
+#: of
+#: paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses:15
+msgid ""
+"The potential list of remaining argument strings. (same as "
+"argparse.ArgumentParser.parse_known_args)"
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_args_into_dataclasses
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_json_file:1
+msgid ""
+"Alternative helper method that does not use `argparse` at all, instead "
+"loading a json file and populating the dataclass types."
+msgstr ""
+
+#: of paddlenlp.trainer.argparser.PdArgumentParser.parse_dict:1
+msgid ""
+"Alternative helper method that does not use `argparse` at all, instead "
+"uses a dict and populating the dataclass types."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po
new file mode 100644
index 0000000000000000000000000000000000000000..40af1d077b2643c74ffa6a0424c87bf63ae42289
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.integrations.po
@@ -0,0 +1,47 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.integrations.rst:2
+msgid "integrations"
+msgstr ""
+
+#: of paddlenlp.trainer.integrations.VisualDLCallback:1
+msgid "基类：:class:`paddlenlp.trainer.trainer_callback.TrainerCallback`"
+msgstr ""
+
+#: of paddlenlp.trainer.integrations.VisualDLCallback:1
+msgid ""
+"A [`TrainerCallback`] that sends the logs to "
+"[VisualDL](https://www.paddlepaddle.org.cn/paddle/visualdl). :param "
+"vdl_writer: The writer to use. Will instantiate one if not set. :type "
+"vdl_writer: `LogWriter`, *optional*"
+msgstr ""
+
+#: of paddlenlp.trainer.integrations.VisualDLCallback.on_train_begin:1
+msgid "Event called at the beginning of training."
+msgstr ""
+
+#: of paddlenlp.trainer.integrations.VisualDLCallback.on_log:1
+msgid "Event called after logging the last logs."
+msgstr ""
+
+#: of paddlenlp.trainer.integrations.VisualDLCallback.on_train_end:1
+msgid "Event called at the end of training."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po
new file mode 100644
index 0000000000000000000000000000000000000000..9791ba7b7a659245e71780983ce427eaede2202c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.rst:2
+msgid "paddlenlp.trainer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po
new file mode 100644
index 0000000000000000000000000000000000000000..12cd6dcf1b03eaf68e930034d73e34f54b277098
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_base.po
@@ -0,0 +1,600 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.trainer_base.rst:2
+msgid "trainer\\_base"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:1
+msgid ""
+"Trainer is a simple but feature-complete training and eval loop for "
+"PaddlePaddle, optimized for PaddleNLP."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer
+#: paddlenlp.trainer.trainer_base.Trainer.add_callback
+#: paddlenlp.trainer.trainer_base.Trainer.create_scheduler
+#: paddlenlp.trainer.trainer_base.Trainer.evaluate
+#: paddlenlp.trainer.trainer_base.Trainer.export_model
+#: paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader
+#: paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs
+#: paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader
+#: paddlenlp.trainer.trainer_base.Trainer.log
+#: paddlenlp.trainer.trainer_base.Trainer.predict
+#: paddlenlp.trainer.trainer_base.Trainer.prediction_step
+#: paddlenlp.trainer.trainer_base.Trainer.training_step
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:3
+msgid ""
+"The model to train, evaluate or use for predictions.  <Tip>  [`Trainer`] "
+"is optimized to work with the [`PretrainedModel`] provided by the "
+"library. You can still use your own models defined as `paddle.nn.Layer` "
+"as long as they work the same way as the PaddleNLP models.  </Tip>"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:3
+msgid "The model to train, evaluate or use for predictions."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:5
+msgid "<Tip>"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:7
+msgid ""
+"[`Trainer`] is optimized to work with the [`PretrainedModel`] provided by"
+" the library. You can still use your own models defined as "
+"`paddle.nn.Layer` as long as they work the same way as the PaddleNLP "
+"models."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:11
+msgid "</Tip>"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:13
+msgid ""
+"The arguments to tweak for training. Will default to a basic instance of "
+"[`TrainingArguments`] with the `output_dir` set to a directory named "
+"*tmp_trainer* in the current directory if not provided."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:16
+msgid ""
+"The function to use to form a batch from a list of elements of "
+"`train_dataset` or `eval_dataset`. Will default to "
+"[`default_data_collator`] if no `tokenizer` is provided, an instance of "
+"[`DataCollatorWithPadding`] otherwise."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:20
+msgid ""
+"The dataset to use for training. If it is an `datasets.Dataset`, columns "
+"not accepted by the `model.forward()` method are automatically removed."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:23
+msgid ""
+"The dataset to use for evaluation. If it is an `datasets.Dataset`, "
+"columns not accepted by the `model.forward()` method are automatically "
+"removed."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:26
+msgid ""
+"The tokenizer used to preprocess the data. If provided, will be used to "
+"automatically pad the inputs the maximum length when batching inputs, and"
+" it will be saved along the model to make it easier to rerun an "
+"interrupted training or reuse the fine-tuned model."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:30
+msgid ""
+"The function that will be used to compute metrics at evaluation. Must "
+"take a [`EvalPrediction`] and return a dictionary string to metric "
+"values."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:33
+msgid ""
+"A tuple containing the optimizer and the scheduler to use. Will default "
+"to an instance of [`AdamW`] on your model and a scheduler given by "
+"[`get_linear_schedule_with_warmup`] controlled by `args`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:38
+msgid "Important attributes:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:40
+msgid ""
+"**model** -- Always points to the core model. If using a transformers "
+"model, it will be a [`PretrainedModel`] subclass."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:42
+msgid ""
+"**model_wrapped** -- Always points to the most external model in case one"
+" or more other modules wrap the original model. This is the model that "
+"should be used for the forward pass. For example, the inner model is "
+"wrapped in `paddle.DataParallel`. If model hasn't been wrapped, then "
+"`self.model_wrapped` is the same as `self.model`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:46
+msgid ""
+"**is_model_parallel** -- Whether or not a model has been switched to a "
+"model parallel mode (different from data parallelism, this means some of "
+"the model layers are split on different GPUs)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:48
+msgid ""
+"**place_model_on_device** -- Whether or not to automatically place the "
+"model on the device - it will be set to `False` if model parallel or "
+"deepspeed is used, or if the default "
+"`TrainingArguments.place_model_on_device` is overridden to return `False`"
+" ."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer:51
+msgid ""
+"**is_in_train** -- Whether or not a model is currently running `train` "
+"(e.g. when `evaluate` is called while in `train`)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.log_metrics:1
+msgid ""
+"Log metrics in a specially formatted way Under distributed environment "
+"this is done only for a process with rank 0. :param split: Mode/split "
+"name: one of `train`, `eval`, `test` :type split: `str` :param metrics: "
+"The metrics returned from train/evaluate/predictmetrics: metrics dict "
+":type metrics: `Dict[str, float]`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:1
+msgid ""
+"Reformat Trainer metrics values to a human-readable format :param "
+"metrics: The metrics returned from train/evaluate/predict :type metrics: "
+"`Dict[str, float]`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate
+#: paddlenlp.trainer.trainer_base.Trainer.pop_callback
+#: paddlenlp.trainer.trainer_base.Trainer.prediction_step
+#: paddlenlp.trainer.trainer_base.Trainer.training_step
+#: paddlenlp.trainer.trainer_utils.metrics_format
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:5
+msgid "The reformatted metrics"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback
+#: paddlenlp.trainer.trainer_base.Trainer.prediction_step
+#: paddlenlp.trainer.trainer_base.Trainer.training_step
+#: paddlenlp.trainer.trainer_utils.metrics_format
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:6
+msgid "metrics (`Dict[str, float]`)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_metrics:1
+msgid ""
+"Save metrics into a json file for that split, e.g. `train_results.json`. "
+"Under distributed environment this is done only for a process with rank "
+"0. :param split: Mode/split name: one of `train`, `eval`, `test`, `all` "
+":type split: `str` :param metrics: The metrics returned from "
+"train/evaluate/predict :type metrics: `Dict[str, float]` :param combined:"
+" Creates combined metrics by updating `all_results.json` with metrics of "
+"this call :type combined: `bool`, *optional*, defaults to `True`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_metrics:10
+msgid ""
+"To understand the metrics please read the docstring of "
+"[`~Trainer.log_metrics`]. The only difference is that raw unformatted "
+"numbers are saved in the current method."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_state:1
+msgid ""
+"Saves the Trainer state, since Trainer.save_model saves only the "
+"tokenizer with the model Under distributed environment this is done only "
+"for a process with rank 0."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.add_callback:1
+msgid "Add a callback to the current list of [`~TrainerCallback`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.add_callback:3
+msgid ""
+"A [`~TrainerCallback`] class or an instance of a [`~TrainerCallback`]. In"
+" the first case, will instantiate a member of that class."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:1
+msgid ""
+"Remove a callback from the current list of [`~TrainerCallback`] and "
+"returns it. If the callback is not found, returns `None` (and no error is"
+" raised). :param callback: A [`~TrainerCallback`] class or an instance of"
+" a [`~TrainerCallback`]. In the"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:4
+msgid ""
+"first case, will pop the first member of that class found in the list of "
+"callbacks."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:7
+msgid "The callback removed, if found."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.pop_callback:8
+msgid "[`~TrainerCallback`]"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.remove_callback:1
+msgid ""
+"Remove a callback from the current list of [`~TrainerCallback`]. :param "
+"callback: A [`~TrainerCallback`] class or an instance of a "
+"[`~TrainerCallback`]. In the"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.remove_callback:3
+msgid ""
+"first case, will remove the first member of that class found in the list "
+"of callbacks."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:1
+msgid "Returns the training [`~paddle.io.DataLoader`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:3
+msgid ""
+"Will use no sampler if `self.train_dataset` does not implement `__len__`,"
+" a random sampler (adapted to distributed training if necessary) "
+"otherwise."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:3
+#: paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:3
+#: paddlenlp.trainer.trainer_base.Trainer.get_train_dataloader:6
+msgid ""
+"Subclass and override this method if you want to inject some custom "
+"behavior."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:1
+msgid "Returns the evaluation [`~paddle.io.DataLoader`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_eval_dataloader:5
+msgid ""
+"If provided, will override `self.eval_dataset`. If it is an "
+"`datasets.Dataset`, columns not accepted by the `model.forward()` method "
+"are automatically removed. It must implement `__len__`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:1
+msgid "Returns the test [`~paddle.io.DataLoader`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_test_dataloader:5
+msgid ""
+"The test dataset to use. If it is an `datasets.Dataset`, columns not "
+"accepted by the `model.forward()` method are automatically removed. It "
+"must implement `__len__`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer_and_scheduler:1
+msgid "Setup the optimizer and the learning rate scheduler."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer_and_scheduler:3
+msgid ""
+"We provide a reasonable default that works well. If you want to use "
+"something else, you can pass a tuple in the Trainer's init through "
+"`optimizers`, or subclass and override this method (or `create_optimizer`"
+" and/or `create_scheduler`) in a subclass."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer:1
+msgid "Setup the optimizer."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_optimizer:3
+msgid ""
+"We provide a reasonable default that works well. If you want to use "
+"something else, you can pass a tuple in the Trainer's init through "
+"`optimizers`, or subclass and override this method in a subclass."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs:1
+msgid ""
+"Returns the optimizer class and optimizer parameters based on the "
+"training arguments."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.get_optimizer_cls_and_kwargs:3
+msgid "The training arguments for the training session."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_scheduler:1
+msgid ""
+"Setup the scheduler. The optimizer of the trainer must have been set up "
+"either before this method is called or passed as an argument."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.create_scheduler:4
+msgid "The number of training steps to do."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.autocast_smart_context_manager:1
+msgid ""
+"A helper wrapper that creates an appropriate context manager for "
+"`autocast` while feeding it the desired arguments, depending on the "
+"situation."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.compute_loss:1
+msgid ""
+"How the loss is computed by Trainer. By default, all models return the "
+"loss in the first element. Subclass and override for custom behavior."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.training_step:1
+msgid "Perform a training step on a batch of inputs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:3
+#: paddlenlp.trainer.trainer_base.Trainer.training_step:3
+msgid "Subclass and override to inject custom behavior."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.training_step:5
+msgid "The model to train."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:7
+#: paddlenlp.trainer.trainer_base.Trainer.training_step:7
+msgid ""
+"The inputs and targets of the model.  The dictionary will be unpacked "
+"before being fed to the model. Most models expect the targets under the "
+"argument `labels`. Check your model's documentation for all accepted "
+"arguments."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:7
+#: paddlenlp.trainer.trainer_base.Trainer.training_step:7
+msgid "The inputs and targets of the model."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:9
+#: paddlenlp.trainer.trainer_base.Trainer.training_step:9
+msgid ""
+"The dictionary will be unpacked before being fed to the model. Most "
+"models expect the targets under the argument `labels`. Check your model's"
+" documentation for all accepted arguments."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.training_step:13
+msgid "The tensor with training loss on this batch."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.training_step:14
+msgid "`paddle.Tensor`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.save_model:1
+msgid "Will save the model, so you can reload it using `from_pretrained()`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.save_model:3
+msgid "Will only save from the main process."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.export_model:1
+msgid "Export paddle inference model."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.export_model:3
+msgid ""
+"InputSpec describes the signature information of the model input, such as"
+" shape , dtype , name. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.export_model:6
+msgid "Load best model. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.export_model:8
+msgid "Output dir to save the exported model. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.log:1
+msgid "Log `logs` on the various objects watching training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.log:3
+msgid "Subclass and override this method to inject custom behavior."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.log:5
+msgid "The values to log."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:1
+msgid "Run evaluation and returns metrics."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:3
+msgid ""
+"The calling script will be responsible for providing a method to compute "
+"metrics, as they are task-dependent (pass it to the init "
+"`compute_metrics` argument)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:6
+msgid "You can also subclass and override this method to inject custom behavior."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:8
+msgid ""
+"Pass a dataset if you wish to override `self.eval_dataset`. If it is an "
+"`datasets.Dataset`, columns not accepted by the `model.forward()` method "
+"are automatically removed. It must implement the `__len__` method."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:12
+#: paddlenlp.trainer.trainer_base.Trainer.predict:7
+#: paddlenlp.trainer.trainer_base.Trainer.prediction_step:14
+msgid ""
+"A list of keys in the output of your model (if it is a dictionary) that "
+"should be ignored when gathering predictions."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:15
+msgid ""
+"An optional prefix to be used as the metrics key prefix. For example the "
+"metrics \"bleu\" will be named \"eval_bleu\" if the prefix is \"eval\" "
+"(default)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluate:19
+msgid ""
+"A dictionary containing the evaluation loss and the potential metrics "
+"computed from the predictions. The dictionary also contains the epoch "
+"number which comes from the training state."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluation_loop:1
+msgid ""
+"Prediction/evaluation loop, shared by `Trainer.evaluate()` and "
+"`Trainer.predict()`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.evaluation_loop:3
+msgid "Works both with or without labels."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:1
+msgid ""
+"Run prediction and returns predictions and potential metrics. Depending "
+"on the dataset and your use case, your test dataset may contain labels. "
+"In that case, this method will also return metrics, like in `evaluate()`."
+" :param test_dataset: Dataset to run the predictions on. If it is an "
+"`datasets.Dataset`, columns not accepted by the"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:5
+msgid ""
+"`model.forward()` method are automatically removed. Has to implement the "
+"method `__len__`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:10
+msgid ""
+"An optional prefix to be used as the metrics key prefix. For example the "
+"metrics \"bleu\" will be named \"test_bleu\" if the prefix is \"test\" "
+"(default)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:14
+msgid ""
+"<Tip> If your predictions or labels have different sequence length (for "
+"instance because you're doing dynamic padding in a token classification "
+"task) the predictions will be padded (on the right) to allow for "
+"concatenation into one array. The padding index is -100. </Tip> Returns: "
+"*NamedTuple* A namedtuple with the following keys:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:20
+msgid "predictions (`np.ndarray`): The predictions on `test_dataset`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:21
+msgid ""
+"label_ids (`np.ndarray`, *optional*): The labels (if the dataset "
+"contained some)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.predict:22
+msgid ""
+"metrics (`Dict[str, float]`, *optional*): The potential dictionary of "
+"metrics (if the dataset contained labels)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:1
+msgid "Perform an evaluation step on `model` using `inputs`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:5
+msgid "The model to evaluate."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:12
+msgid "Whether or not to return the loss only."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.prediction_step:18
+msgid "A tuple with the loss, logits and labels (each being optional)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.num_examples:1
+msgid ""
+"Helper to get number of samples in a [`~paddle.io.DataLoader`] by "
+"accessing its dataset."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.num_examples:3
+msgid ""
+"Will raise an exception if the underlying dataset does not implement "
+"method `__len__`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.is_local_process_zero:1
+msgid ""
+"Whether or not this process is the local (e.g., on one machine if "
+"training in a distributed fashion on several machines) main process."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.is_world_process_zero:1
+msgid ""
+"Whether or not this process is the global main process (when training in "
+"a distributed fashion on several machines, this is only going to be "
+"`True` for one process)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_base.Trainer.print_config:1
+msgid "print config values"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po
new file mode 100644
index 0000000000000000000000000000000000000000..93c3875a65cf32579498593eb6974e598f24d90b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_callback.po
@@ -0,0 +1,431 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.trainer_callback.rst:2
+msgid "trainer\\_callback"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback:1
+msgid "Callbacks to use with the Trainer class and customize the training loop."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:1
+#: paddlenlp.trainer.trainer_callback.TrainerControl:1
+#: paddlenlp.trainer.trainer_callback.TrainerState:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:1
+msgid ""
+"A class containing the [`Trainer`] inner state that will be saved along "
+"the model and optimizer when checkpointing and passed to the "
+"[`TrainerCallback`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:4
+msgid "<Tip>"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:6
+msgid ""
+"In all this class, one step is to be understood as one update step. When "
+"using gradient accumulation, one update step may require several forward "
+"and backward passes: if you use `gradient_accumulation_steps=n`, then one"
+" update step requires going through *n* batches."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:10
+msgid "</Tip>"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback
+#: paddlenlp.trainer.trainer_callback.TrainerCallback
+#: paddlenlp.trainer.trainer_callback.TrainerControl
+#: paddlenlp.trainer.trainer_callback.TrainerState
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:12
+msgid ""
+"Only set during training, will represent the epoch the training is at "
+"(the decimal part being the percentage of the current epoch completed)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:15
+msgid "During training, represents the number of update steps completed."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:17
+msgid "The number of update steps to do during the current training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:19
+msgid ""
+"The total number of floating operations done by the model since the "
+"beginning of training (stored as floats to avoid overflow)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:22
+msgid "The list of logs done since the beginning of training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:24
+msgid ""
+"When tracking the best model, the value of the best metric encountered so"
+" far."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:26
+msgid ""
+"When tracking the best model, the value of the name of the checkpoint for"
+" the best model encountered so far."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:29
+msgid ""
+"Whether or not this process is the local (e.g., on one machine if "
+"training in a distributed fashion on several machines) main process."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState:32
+msgid ""
+"Whether or not this process is the global main process (when training in "
+"a distributed fashion on several machines, this is only going to be "
+"`True` for one process)."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState.save_to_json:1
+msgid "Save the content of this instance in JSON format inside `json_path`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerState.load_from_json:1
+msgid "Create an instance from the content of `json_path`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:1
+msgid ""
+"A class that handles the [`Trainer`] control flow. This class is used by "
+"the [`TrainerCallback`] to activate some switches in the training loop."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:4
+msgid ""
+"Whether or not the training should be interrupted.  If `True`, this "
+"variable will not be set back to `False`. The training will just stop."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:4
+msgid "Whether or not the training should be interrupted."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:6
+msgid ""
+"If `True`, this variable will not be set back to `False`. The training "
+"will just stop."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:8
+msgid ""
+"Whether or not the current epoch should be interrupted.  If `True`, this "
+"variable will be set back to `False` at the beginning of the next epoch."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:8
+msgid "Whether or not the current epoch should be interrupted."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:10
+msgid ""
+"If `True`, this variable will be set back to `False` at the beginning of "
+"the next epoch."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:12
+msgid ""
+"Whether or not the model should be saved at this step.  If `True`, this "
+"variable will be set back to `False` at the beginning of the next step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:12
+msgid "Whether or not the model should be saved at this step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:14
+#: paddlenlp.trainer.trainer_callback.TrainerControl:18
+#: paddlenlp.trainer.trainer_callback.TrainerControl:22
+msgid ""
+"If `True`, this variable will be set back to `False` at the beginning of "
+"the next step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:16
+msgid ""
+"Whether or not the model should be evaluated at this step.  If `True`, "
+"this variable will be set back to `False` at the beginning of the next "
+"step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:16
+msgid "Whether or not the model should be evaluated at this step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:20
+msgid ""
+"Whether or not the logs should be reported at this step.  If `True`, this"
+" variable will be set back to `False` at the beginning of the next step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerControl:20
+msgid "Whether or not the logs should be reported at this step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:1
+msgid ""
+"A class for objects that will inspect the state of the training loop at "
+"some events and take some decisions. At each of those events the "
+"following arguments are available:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:4
+msgid "The training arguments used to instantiate the [`Trainer`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:6
+msgid "The current state of the [`Trainer`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:8
+msgid ""
+"The object that is returned to the [`Trainer`] and can be used to make "
+"some decisions."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:10
+msgid "The model being trained."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:12
+msgid "The tokenizer used for encoding the data."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:14
+msgid "The optimizer used for the training steps."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:16
+msgid "The scheduler used for setting the learning rate."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:18
+#: paddlenlp.trainer.trainer_callback.TrainerCallback:20
+msgid "The current dataloader used for training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:22
+msgid ""
+"The metrics computed by the last evaluation phase.  Those are only "
+"accessible in the event `on_evaluate`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:22
+msgid "The metrics computed by the last evaluation phase."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:24
+msgid "Those are only accessible in the event `on_evaluate`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:26
+msgid "The values to log.  Those are only accessible in the event `on_log`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:26
+msgid "The values to log."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:28
+msgid "Those are only accessible in the event `on_log`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:31
+msgid ""
+"The `control` object is the only one that can be changed by the callback,"
+" in which case the event that changes it should return the modified "
+"version."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:34
+msgid ""
+"The argument `args`, `state` and `control` are positionals for all "
+"events, all the others are grouped in `kwargs`. You can unpack the ones "
+"you need in the signature of the event using them. As an example, see the"
+" code of the simple [`~transformer.PrinterCallback`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:38
+msgid "Example:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:40
+msgid "```python class PrinterCallback(TrainerCallback):"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:44
+msgid "def on_log(self, args, state, control, logs=None, **kwargs):"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:43
+msgid "_ = logs.pop(\"total_flos\", None) if state.is_local_process_zero:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:45
+msgid "print(logs)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.TrainerCallback:46
+msgid "```"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_init_end:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_init_end:1
+msgid "Event called at the end of the initialization of the [`Trainer`]."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_train_begin:1
+#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback.on_train_begin:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_train_begin:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_train_begin:1
+msgid "Event called at the beginning of training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_train_end:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_train_end:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_train_end:1
+msgid "Event called at the end of training."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_epoch_begin:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_epoch_begin:1
+msgid "Event called at the beginning of an epoch."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_epoch_end:1
+#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback.on_epoch_end:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_epoch_end:1
+msgid "Event called at the end of an epoch."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_step_begin:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_step_begin:1
+msgid ""
+"Event called at the beginning of a training step. If using gradient "
+"accumulation, one training step might take several inputs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_substep_end:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_substep_end:1
+msgid "Event called at the end of an substep during gradient accumulation."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_step_end:1
+#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback.on_step_end:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_step_end:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_step_end:1
+msgid ""
+"Event called at the end of a training step. If using gradient "
+"accumulation, one training step might take several inputs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_evaluate:1
+#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback.on_evaluate:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_evaluate:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_evaluate:1
+msgid "Event called after an evaluation phase."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_save:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_save:1
+msgid "Event called after a checkpoint save."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_log:1
+#: paddlenlp.trainer.trainer_callback.PrinterCallback.on_log:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_log:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_log:1
+msgid "Event called after logging the last logs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler.on_prediction_step:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback.on_prediction_step:1
+#: paddlenlp.trainer.trainer_callback.TrainerCallback.on_prediction_step:1
+msgid "Event called after a prediction step."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler:1
+#: paddlenlp.trainer.trainer_callback.DefaultFlowCallback:1
+#: paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:1
+#: paddlenlp.trainer.trainer_callback.PrinterCallback:1
+#: paddlenlp.trainer.trainer_callback.ProgressCallback:1
+msgid "基类：:class:`paddlenlp.trainer.trainer_callback.TrainerCallback`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.CallbackHandler:1
+msgid "Internal class that just calls the list of callbacks in order."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.DefaultFlowCallback:1
+msgid ""
+"A [`TrainerCallback`] that handles the default flow of the training loop "
+"for logs, evaluation and checkpoints."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.ProgressCallback:1
+msgid ""
+"A [`TrainerCallback`] that displays the progress of training or "
+"evaluation."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.PrinterCallback:1
+msgid "A bare [`TrainerCallback`] that just prints the logs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:1
+msgid "A [`TrainerCallback`] that handles early stopping."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:3
+msgid ""
+"Use with `metric_for_best_model` to stop training when the specified "
+"metric worsens for `early_stopping_patience` evaluation calls."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:6
+msgid ""
+"Use with TrainingArguments `metric_for_best_model` and "
+"`early_stopping_patience` to denote how much the specified metric must "
+"improve to satisfy early stopping conditions. `"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_callback.EarlyStoppingCallback:10
+msgid ""
+"This callback depends on [`TrainingArguments`] argument "
+"*load_best_model_at_end* functionality to set best_metric in "
+"[`TrainerState`]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..f4c282e4865cab4d856419967bbacf392c4dfb33
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.trainer_utils.po
@@ -0,0 +1,246 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.trainer_utils.rst:2
+msgid "trainer\\_utils"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils:1
+msgid "Utilities for the Trainer class."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.ExplicitEnum:1
+msgid "基类：:class:`enum.Enum`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.ExplicitEnum:1
+msgid "Enum with more explicit error message for missing values."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun:1
+#: paddlenlp.trainer.trainer_utils.EvalLoopOutput:1
+#: paddlenlp.trainer.trainer_utils.EvalPrediction:1
+#: paddlenlp.trainer.trainer_utils.PredictionOutput:1
+#: paddlenlp.trainer.trainer_utils.TrainOutput:1
+msgid "基类：:class:`tuple`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.EvalPrediction:1
+msgid "Evaluation output (always contains labels), to be used to compute metrics."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun
+#: paddlenlp.trainer.trainer_utils.EvalPrediction
+#: paddlenlp.trainer.trainer_utils.default_compute_objective
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.EvalPrediction:3
+msgid "Predictions of the model."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.EvalPrediction:5
+msgid "Targets to be matched."
+msgstr ""
+
+#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.run_id:1
+#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.predictions:1
+#: paddlenlp.trainer.trainer_utils.EvalPrediction.predictions:1
+#: paddlenlp.trainer.trainer_utils.PredictionOutput.predictions:1
+#: paddlenlp.trainer.trainer_utils.TrainOutput.global_step:1
+msgid "Alias for field number 0"
+msgstr ""
+
+#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.objective:1
+#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.label_ids:1
+#: paddlenlp.trainer.trainer_utils.EvalPrediction.label_ids:1
+#: paddlenlp.trainer.trainer_utils.PredictionOutput.label_ids:1
+#: paddlenlp.trainer.trainer_utils.TrainOutput.training_loss:1
+msgid "Alias for field number 1"
+msgstr ""
+
+#: ../docstring of paddlenlp.trainer.trainer_utils.BestRun.hyperparameters:1
+#: paddlenlp.trainer.trainer_utils.EvalLoopOutput.metrics:1
+#: paddlenlp.trainer.trainer_utils.PredictionOutput.metrics:1
+#: paddlenlp.trainer.trainer_utils.TrainOutput.metrics:1
+msgid "Alias for field number 2"
+msgstr ""
+
+#: ../docstring of paddlenlp.trainer.trainer_utils.EvalLoopOutput.num_samples:1
+msgid "Alias for field number 3"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.EvaluationStrategy:1
+#: paddlenlp.trainer.trainer_utils.IntervalStrategy:1
+#: paddlenlp.trainer.trainer_utils.OptimizerNames:1
+#: paddlenlp.trainer.trainer_utils.SchedulerType:1
+msgid "基类：:class:`paddlenlp.trainer.trainer_utils.ExplicitEnum`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.EvaluationStrategy:1
+#: paddlenlp.trainer.trainer_utils.IntervalStrategy:1
+#: paddlenlp.trainer.trainer_utils.SchedulerType:1
+msgid "An enumeration."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.OptimizerNames:1
+msgid "Stores the acceptable string identifiers for optimizers."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun:1
+msgid ""
+"The best run found by an hyperparameter search (see "
+"[`~Trainer.hyperparameter_search`])."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun:3
+msgid ""
+"The id of the best run (if models were saved, the corresponding "
+"checkpoint will be in the folder ending with run-{run_id})."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun:6
+msgid "The objective that was obtained for this run."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.BestRun:8
+msgid "The hyperparameters picked to get this run."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective:1
+msgid ""
+"The default objective to maximize/minimize when doing an hyperparameter "
+"search. It is the evaluation loss if no metrics are provided to the "
+"[`Trainer`], the sum of all metrics otherwise."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective:4
+msgid "The metrics returned by the evaluate method."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective
+#: paddlenlp.trainer.trainer_utils.metrics_format
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective:7
+msgid "The objective to minimize or maximize"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective
+#: paddlenlp.trainer.trainer_utils.metrics_format
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.default_compute_objective:8
+msgid "`float`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.is_main_process:1
+msgid ""
+"Whether or not the current process is the local process, based on "
+"`xm.get_ordinal()` (for TPUs) first, then on `local_rank`."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.total_processes_number:1
+msgid ""
+"Return the number of processes launched in parallel. Works with "
+"`paddle.distributed` and TPUs."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:1
+msgid "Measure and return speed performance metrics."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:3
+msgid ""
+"This function requires a time snapshot `start_time` before the operation "
+"to be measured starts and this function should be run immediately after "
+"the operation to be measured has completed."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:6
+msgid "Args:"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:8
+msgid "split: name to prefix metric (like train, eval, test...)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:9
+msgid "start_time: operation start time"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.speed_metrics:10
+msgid "num_samples: number of samples processed"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:1
+msgid ""
+"Reformat Trainer metrics values to a human-readable format :param "
+"metrics: The metrics returned from train/evaluate/predict :type metrics: "
+"`Dict[str, float]`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:5
+msgid "The reformatted metrics"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.metrics_format:6
+msgid "metrics (`Dict[str, float]`)"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.log_metrics:1
+msgid ""
+"Log metrics in a specially formatted way Under distributed environment "
+"this is done only for a process with rank 0. :param split: Mode/split "
+"name: one of `train`, `eval`, `test` :type split: `str` :param metrics: "
+"The metrics returned from train/evaluate/predictmetrics: metrics dict "
+":type metrics: `Dict[str, float]`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_metrics:1
+msgid ""
+"Save metrics into a json file for that split, e.g. `train_results.json`. "
+"Under distributed environment this is done only for a process with rank "
+"0. :param split: Mode/split name: one of `train`, `eval`, `test`, `all` "
+":type split: `str` :param metrics: The metrics returned from "
+"train/evaluate/predict :type metrics: `Dict[str, float]` :param combined:"
+" Creates combined metrics by updating `all_results.json` with metrics of "
+"this call :type combined: `bool`, *optional*, defaults to `True`"
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_metrics:10
+msgid ""
+"To understand the metrics please read the docstring of "
+"[`~Trainer.log_metrics`]. The only difference is that raw unformatted "
+"numbers are saved in the current method."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.save_state:1
+msgid ""
+"Saves the Trainer state, since Trainer.save_model saves only the "
+"tokenizer with the model Under distributed environment this is done only "
+"for a process with rank 0."
+msgstr ""
+
+#: of paddlenlp.trainer.trainer_utils.has_length:1
+msgid "Checks if the dataset implements __len__() and it doesn't raise an error"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po
new file mode 100644
index 0000000000000000000000000000000000000000..66c243a762ec95dd7f65d87989d1dcacf2279880
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.training_args.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.training_args.rst:2
+msgid "training\\_args"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po
new file mode 100644
index 0000000000000000000000000000000000000000..b1f0d1ac761cbebd8862c96037a02be0ad79be8c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.helper.po
@@ -0,0 +1,49 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.utils.helper.rst:2
+msgid "helper"
+msgstr ""
+
+#: of paddlenlp.trainer.utils.helper.paddle_pad_and_concatenate:1
+msgid ""
+"Concatenates `tensor1` and `tensor2` on first axis, applying padding on "
+"the second if necessary."
+msgstr ""
+
+#: of paddlenlp.trainer.utils.helper.nested_concat:1
+msgid ""
+"Concat the `new_tensors` to `tensors` on the first dim and pad them on "
+"the second if needed. Works for tensors or nested list/tuples of tensors."
+msgstr ""
+
+#: of paddlenlp.trainer.utils.helper.nested_detach:1
+msgid "Detach `tensors` (even if it's a nested list/tuple of tensors)."
+msgstr ""
+
+#: of paddlenlp.trainer.utils.helper.nested_numpify:1
+msgid "Numpify `tensors` (even if it's a nested list/tuple of tensors)."
+msgstr ""
+
+#: of paddlenlp.trainer.utils.helper.nested_truncate:1
+msgid ""
+"Truncate `tensors` at `limit` (even if it's a nested list/tuple of "
+"tensors)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..80cf9f323f57b8fc7d8e34db07c04e2803a3d014
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.trainer.utils.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.trainer.utils.rst:2
+msgid "utils"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..a9813e3bbb7f70eee4b39e969f97ede513a0fc2d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.modeling.po
@@ -0,0 +1,774 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.albert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling:1
+msgid "Modeling classes for ALBERT model."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ALBERT models. It provides ALBERT "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+"`PretrainedModel` for more details."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:1
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:1
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:1
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:1
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:1
+#: paddlenlp.transformers.albert.modeling.AlbertModel:1
+msgid "基类：:class:`paddlenlp.transformers.albert.modeling.AlbertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:1
+msgid "The bare Albert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertModel
+#: paddlenlp.transformers.albert.modeling.AlbertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `AlbertModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `AlbertModel`."
+" Defaults to `30000`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:14
+msgid "Dimensionality of the embedding layer. Defaults to `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:16
+msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:18
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:20
+msgid "Number of hidden groups in the Transformer encoder. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:22
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:25
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:29
+msgid "Number of inner groups in a hidden group. Default to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:31
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:35
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:38
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:41
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:44
+msgid "The vocabulary size of `token_type_ids`. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:46
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:46
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:49
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:52
+msgid ""
+"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for "
+"initializing layer normalization layers. A small value to the variance "
+"added to the normalization layer to prevent division by zero. Default to "
+"`1e-12`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:56
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel:58
+msgid "Whether or not to add the pooling layer. Default to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:1
+msgid "The AlbertModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:16
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:16
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:21
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:22
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:24
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:27
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:31
+msgid ""
+"Mask to nullify selected heads of the self-attention modules. Masks "
+"values can either be 0 or 1:  - 1 indicates the head is **not masked**, -"
+" 0 indicated the head is **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:31
+msgid ""
+"Mask to nullify selected heads of the self-attention modules. Masks "
+"values can either be 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:33
+msgid "1 indicates the head is **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:34
+msgid "0 indicated the head is **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:36
+msgid ""
+"If you want to control how to convert `inputs_ids` indices into "
+"associated vectors, you can pass an embedded representation directly "
+"instead of passing `inputs_ids`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:39
+msgid ""
+"Whether or not to return a dict instead of a plain tuple. Default to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:42
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or a dict with "
+"`last_hidden_state`, `pooled_output`, `all_hidden_states`, "
+"`all_attentions` fields.  With the fields:  - `sequence_output` (Tensor):"
+"    Sequence of hidden-states at the last layer of the model.    It's "
+"data type should be float32 and has a shape of [`batch_size, "
+"sequence_length, hidden_size`].  - `pooled_output` (Tensor):    The "
+"output of first token (`[CLS]`) in sequence.    We \"pool\" the model by "
+"simply taking the hidden state corresponding to the first token.    Its "
+"data type should be float32 and    has a shape of [batch_size, "
+"hidden_size].  - `last_hidden_state` (Tensor):    The output of the last "
+"encoder layer, it is also the `sequence_output`.    It's data type should"
+" be float32 and has a shape of [batch_size, sequence_length, "
+"hidden_size].  - `all_hidden_states` (Tensor):    Hidden_states of all "
+"layers in the Transformer encoder. The length of `all_hidden_states` is "
+"`num_hidden_layers + 1`.    For all element in the tuple, its data type "
+"should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`].  - `all_attentions` (Tensor):    Attentions of all layers "
+"of in the Transformer encoder. The length of `all_attentions` is "
+"`num_hidden_layers`.    For all element in the tuple, its data type "
+"should be float32 and its shape is    [`batch_size, num_attention_heads, "
+"sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:42
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or a dict with "
+"`last_hidden_state`, `pooled_output`, `all_hidden_states`, "
+"`all_attentions` fields."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:21
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:25
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:25
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:20
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:20
+#: paddlenlp.transformers.albert.modeling.AlbertModel.forward:45
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:49
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:48
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and has a shape of [`batch_size, sequence_length, "
+"hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:55
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:52
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and has a shape of [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:59
+msgid "`last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:58
+msgid ""
+"The output of the last encoder layer, it is also the `sequence_output`. "
+"It's data type should be float32 and has a shape of [batch_size, "
+"sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:63
+msgid "`all_hidden_states` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:62
+msgid ""
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`all_hidden_states` is `num_hidden_layers + 1`. For all element in the "
+"tuple, its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:67
+msgid "`all_attentions` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertModel.forward:66
+msgid ""
+"Attentions of all layers of in the Transformer encoder. The length of "
+"`all_attentions` is `num_hidden_layers`. For all element in the tuple, "
+"its data type should be float32 and its shape is [`batch_size, "
+"num_attention_heads, sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward
+#: paddlenlp.transformers.albert.modeling.AlbertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:37
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:37
+#: paddlenlp.transformers.albert.modeling.AlbertModel.forward:72
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining:1
+msgid ""
+"Albert Model with a `masked language modeling` head and a `sentence order"
+" prediction` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:3
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:4
+msgid "An instance of :class:`AlbertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining:6
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:8
+msgid "An instance of :class:`AlbertSOPHead`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:3
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:5
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:7
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:9
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:11
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:13
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:15
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:3
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:5
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:7
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:9
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:11
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:13
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:19
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining:10
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:3
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:5
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:7
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:9
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:11
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:13
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:19
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:3
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:5
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:7
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:9
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:11
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:13
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:15
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:3
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:5
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:7
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:9
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:11
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:13
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:15
+msgid "See :class:`AlbertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:1
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:1
+msgid ""
+"The AlbertForPretraining forward method, overrides the __call__() special"
+" method."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:15
+msgid ""
+"Labels of the next sequence prediction. Input should be a sequence pair "
+"Indices should be 0 or 1. ``0`` indicates original order (sequence A, "
+"then sequence B), and ``1`` indicates switched order (sequence B, then "
+"sequence A). Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:22
+msgid ""
+"Returns tuple (`prediction_scores`, `sop_scores`) or a dict with "
+"`prediction_logits`, `sop_logits`, `pooled_output`, `hidden_states`, "
+"`attentions` fields.  With the fields:  - `prediction_scores` (Tensor):"
+"     The scores of masked token prediction. Its data type should be "
+"float32.     and its shape is [batch_size, sequence_length, vocab_size]."
+"  - `sop_scores` (Tensor):     The scores of sentence order prediction."
+"     Its data type should be float32 and its shape is [batch_size, 2].  -"
+" `prediction_logits` (Tensor):     The scores of masked token prediction."
+" Its data type should be float32.     and its shape is [batch_size, "
+"sequence_length, vocab_size].  - `sop_logits` (Tensor):     The scores of"
+" sentence order prediction.     Its data type should be float32 and its "
+"shape is [batch_size, 2].  - `hidden_states` (Tensor):     Hidden_states "
+"of all layers in the Transformer encoder. The length of `hidden_states` "
+"is `num_hidden_layers + 1`.     For all element in the tuple, its data "
+"type should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`].  - `attentions` (Tensor):     Attentions of all layers of "
+"in the Transformer encoder. The length of `attentions` is "
+"`num_hidden_layers`.     For all element in the tuple, its data type "
+"should be float32 and its shape is     [`batch_size, num_attention_heads,"
+" sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:22
+msgid ""
+"Returns tuple (`prediction_scores`, `sop_scores`) or a dict with "
+"`prediction_logits`, `sop_logits`, `pooled_output`, `hidden_states`, "
+"`attentions` fields."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:25
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:29
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:24
+#: paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:28
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:28
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:36
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"and its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:33
+msgid "`sop_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:32
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:40
+msgid ""
+"The scores of sentence order prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:37
+msgid "`prediction_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:41
+msgid "`sop_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:33
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:33
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:45
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:28
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:28
+msgid "`hidden_states` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:32
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:32
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:44
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:27
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:27
+msgid ""
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`hidden_states` is `num_hidden_layers + 1`. For all element in the tuple,"
+" its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:37
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:37
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:49
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:32
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:32
+msgid "`attentions` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:36
+#: paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:36
+#: paddlenlp.transformers.albert.modeling.AlbertForPretraining.forward:48
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:31
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:31
+msgid ""
+"Attentions of all layers of in the Transformer encoder. The length of "
+"`attentions` is `num_hidden_layers`. For all element in the tuple, its "
+"data type should be float32 and its shape is [`batch_size, "
+"num_attention_heads, sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM:1
+msgid "Albert Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:18
+msgid ""
+"Returns tensor `prediction_scores` or a dict with `logits`, "
+"`hidden_states`, `attentions` fields.  With the fields:  - "
+"`prediction_scores` (Tensor):     The scores of masked token prediction. "
+"Its data type should be float32.     and its shape is [batch_size, "
+"sequence_length, vocab_size].  - `logits` (Tensor):     The scores of "
+"masked token prediction. Its data type should be float32.     and its "
+"shape is [batch_size, sequence_length, vocab_size].  - `hidden_states` "
+"(Tensor):     Hidden_states of all layers in the Transformer encoder. The"
+" length of `hidden_states` is `num_hidden_layers + 1`.     For all "
+"element in the tuple, its data type should be float32 and its shape is "
+"[`batch_size, sequence_length, hidden_size`].  - `attentions` (Tensor):"
+"     Attentions of all layers of in the Transformer encoder. The length "
+"of `attentions` is `num_hidden_layers`.     For all element in the tuple,"
+" its data type should be float32 and its shape is     [`batch_size, "
+"num_attention_heads, sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:18
+msgid ""
+"Returns tensor `prediction_scores` or a dict with `logits`, "
+"`hidden_states`, `attentions` fields."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMaskedLM.forward:29
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:24
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:24
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:1
+msgid ""
+"Albert Model with a linear layer on top of the output layer, designed for"
+" sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:4
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:4
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:4
+msgid "An instance of AlbertModel."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:6
+msgid "The dropout probability for the classifier. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification:9
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:1
+msgid ""
+"The AlbertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:18
+msgid ""
+"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, "
+"`attentions` fields.  With the fields:  - `logits` (Tensor):     A tensor"
+" of the input text classification logits.     Shape as `[batch_size, "
+"num_classes]` and dtype as float32.  - `hidden_states` (Tensor):     "
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`hidden_states` is `num_hidden_layers + 1`.     For all element in the "
+"tuple, its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`].  - `attentions` (Tensor):     Attentions "
+"of all layers of in the Transformer encoder. The length of `attentions` "
+"is `num_hidden_layers`.     For all element in the tuple, its data type "
+"should be float32 and its shape is     [`batch_size, num_attention_heads,"
+" sequence_length, sequence_length`]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:18
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:18
+msgid ""
+"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, "
+"`attentions` fields."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForSequenceClassification.forward:23
+msgid ""
+"A tensor of the input text classification logits. Shape as `[batch_size, "
+"num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForTokenClassification:1
+msgid ""
+"Albert Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:1
+msgid ""
+"The AlbertForTokenClassification forward method, overrides the __call__()"
+" special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:18
+msgid ""
+"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, "
+"`attentions` fields.  With the fields:  - `logits` (Tensor):     A tensor"
+" of the input token classification logits.     Shape as `[batch_size, "
+"sequence_length, num_classes]` and dtype as `float32`.  - `hidden_states`"
+" (Tensor):     Hidden_states of all layers in the Transformer encoder. "
+"The length of `hidden_states` is `num_hidden_layers + 1`.     For all "
+"element in the tuple, its data type should be float32 and its shape is "
+"[`batch_size, sequence_length, hidden_size`].  - `attentions` (Tensor):"
+"     Attentions of all layers of in the Transformer encoder. The length "
+"of `attentions` is `num_hidden_layers`.     For all element in the tuple,"
+" its data type should be float32 and its shape is     [`batch_size, "
+"num_attention_heads, sequence_length, sequence_length`]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.modeling.AlbertForTokenClassification.forward:23
+msgid ""
+"A tensor of the input token classification logits. Shape as `[batch_size,"
+" sequence_length, num_classes]` and dtype as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice:1
+msgid ""
+"Albert Model with a linear layer on top of the hidden-states output "
+"layer, designed for multiple choice tasks like SWAG tasks ."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:1
+msgid ""
+"The AlbertForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:15
+msgid "Start positions of the text. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:17
+msgid "End positions of the text. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:22
+msgid ""
+"Returns tensor `reshaped_logits` or a dict with `reshaped_logits`, "
+"`hidden_states`, `attentions` fields.  With the fields:  - "
+"`reshaped_logits` (Tensor):     A tensor of the input multiple choice "
+"classification logits.     Shape as `[batch_size, num_classes]` and dtype"
+" as `float32`.  - `hidden_states` (Tensor):     Hidden_states of all "
+"layers in the Transformer encoder. The length of `hidden_states` is "
+"`num_hidden_layers + 1`.     For all element in the tuple, its data type "
+"should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`].  - `attentions` (Tensor):     Attentions of all layers of "
+"in the Transformer encoder. The length of `attentions` is "
+"`num_hidden_layers`.     For all element in the tuple, its data type "
+"should be float32 and its shape is     [`batch_size, num_attention_heads,"
+" sequence_length, sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:22
+msgid ""
+"Returns tensor `reshaped_logits` or a dict with `reshaped_logits`, "
+"`hidden_states`, `attentions` fields."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:29
+msgid "`reshaped_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.modeling.AlbertForMultipleChoice.forward:28
+msgid ""
+"A tensor of the input multiple choice classification logits. Shape as "
+"`[batch_size, num_classes]` and dtype as `float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po
new file mode 100644
index 0000000000000000000000000000000000000000..cae0d3112373564530ab46219a5a4e609b285287
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.albert.rst:2
+msgid "albert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..7bfcd12e49d88de3ec2593263e16747bb77c1b66
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.albert.tokenizer.po
@@ -0,0 +1,348 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.albert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer:1
+msgid "Tokenization class for ALBERT model."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:1
+msgid "Constructs an Albert tokenizer based on SentencePiece or `BertTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:7
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:10
+msgid ""
+"The vocabulary file (ends with '.spm') required to instantiate a "
+"`SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:13
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:15
+msgid "Whether or note to remove space when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:17
+msgid "Whether or note to keep accents when tokenizing. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:19
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:23
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:26
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:29
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:32
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer:38
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:10
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:10
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:1
+msgid "Converts a string to a list of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:3
+msgid "The text to be tokenized."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.tokenize:6
+msgid "A list of string representing converted tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (list of string) to a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:3
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.convert_tokens_to_string:6
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:4
+msgid "An Albert sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A Albert offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of wordpiece offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.build_offset_mapping_with_special_tokens:13
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"Should be overridden in a subclass if the model has a special way of "
+"building those."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources:1
+msgid ""
+"Save tokenizer related resources to `resource_files_names` indicating "
+"files under `save_directory` by copying directly. Override it if "
+"necessary."
+msgstr ""
+
+#: of paddlenlp.transformers.albert.tokenizer.AlbertTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..9c623fbf987f8e82eb42b883afabcd0b2f245da3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.attention_utils.po
@@ -0,0 +1,73 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.attention_utils.rst:2
+msgid "attention\\_utils"
+msgstr ""
+
+#: of paddlenlp.transformers.attention_utils.Attention.forward:1
+#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:1
+#: paddlenlp.transformers.attention_utils.Linear3D.forward:1
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.attention_utils.Attention.forward
+#: paddlenlp.transformers.attention_utils.DefaultAttention.forward
+#: paddlenlp.transformers.attention_utils.Linear3D.forward
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.attention_utils.Attention.forward:4
+#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:4
+#: paddlenlp.transformers.attention_utils.Linear3D.forward:4
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.attention_utils.Attention.forward:6
+#: paddlenlp.transformers.attention_utils.DefaultAttention.forward:6
+#: paddlenlp.transformers.attention_utils.Linear3D.forward:6
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.attention_utils.BigBirdSparseAttention.forward:1
+msgid ""
+"query_matrix: [B, H, T, D] key_matrix: [B, H, T, D] value_matrix: [B, H, "
+"T, D] query_mask: [B, 1, T, 1]  bool mask key_mask: [B, 1, 1, T]    bool "
+"mask rand_mask_idx: [H, T//bs, bs] Global Attention Random Attention "
+"Window Attention"
+msgstr ""
+
+#: ../docstring of
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.Cache.k:1
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.StaticCache.k:1
+msgid "Alias for field number 0"
+msgstr ""
+
+#: ../docstring of
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.Cache.v:1
+#: paddlenlp.transformers.attention_utils.MultiHeadAttention.StaticCache.v:1
+msgid "Alias for field number 1"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..c9de1097ab21438a2f20d4000dd46c19954917df
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.modeling.po
@@ -0,0 +1,409 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.auto.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder:1
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator:1
+#: paddlenlp.transformers.auto.modeling.AutoEncoder:1
+#: paddlenlp.transformers.auto.modeling.AutoGenerator:1
+#: paddlenlp.transformers.auto.modeling.AutoModel:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification:1
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification:1
+msgid "基类：:class:`paddlenlp.transformers.auto.modeling._BaseAutoModelClass`"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel:1
+msgid ""
+"AutoClass can help you automatically retrieve the relevant model given "
+"the provided pretrained weights/vocabulary. AutoModel is a generic model "
+"class that will be instantiated as one of the base model classes when "
+"created with the from_pretrained() classmethod."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModel`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:5
+msgid ""
+"Name of pretrained model or dir path to load from. The string can be:  - "
+"Name of a built-in pretrained model - Name of a community-contributed "
+"pretrained model. - Local directory path which contains model weights "
+"file(\"model_state.pdparams\")   and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:5
+msgid "Name of pretrained model or dir path to load from. The string can be:"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:8
+msgid "Name of a built-in pretrained model"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:9
+msgid "Name of a community-contributed pretrained model."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:10
+msgid ""
+"Local directory path which contains model weights "
+"file(\"model_state.pdparams\") and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:13
+msgid ""
+"Specify a downstream task. Task can be 'Model', 'ForPretraining', "
+"'ForSequenceClassification', 'ForTokenClassification', "
+"'ForQuestionAnswering', 'ForMultipleChoice', 'ForMaskedLM', "
+"'ForCausalLM', 'Encoder', 'Decoder', 'Generator', 'Discriminator', "
+"'ForConditionalGeneration'. We only support specify downstream tasks in "
+"AutoModel. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:19
+msgid ""
+"Position arguments for model `__init__`. If provided, use these as "
+"position argument values for model initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:22
+msgid ""
+"Keyword arguments for model `__init__`. If provided, use these to update "
+"pre-defined keyword argument values for model initialization. If the "
+"keyword is in `__init__` argument names of base model, update argument "
+"values of the base model; else update argument values of derived model."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:29
+msgid "An instance of `AutoModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModel.from_pretrained:33
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:16
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:16
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForPretraining:1
+msgid "AutoModelForPretraining."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForPretraining`. Model weights are "
+"loaded by specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:9
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:5
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:7
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:9
+msgid "See :class:`AutoModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForPretraining.from_pretrained:12
+msgid "An instance of `AutoModelForPretraining`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification:1
+msgid "AutoModelForSequenceClassification."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForSequenceClassification`. Model "
+"weights are loaded by specifying name of a built-in pretrained model, or "
+"a community contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForSequenceClassification.from_pretrained:12
+msgid "An instance of `AutoModelForSequenceClassification`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification:1
+msgid "AutoModelForTokenClassification."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForTokenClassification`. Model weights "
+"are loaded by specifying name of a built-in pretrained model, or a "
+"community contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForTokenClassification.from_pretrained:12
+msgid "An instance of `AutoModelForTokenClassification`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering:1
+msgid "AutoModelForQuestionAnswering."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForQuestionAnswering`. Model weights are"
+" loaded by specifying name of a built-in pretrained model, or a community"
+" contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForQuestionAnswering.from_pretrained:12
+msgid "An instance of `AutoModelForQuestionAnswering`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice:1
+msgid "AutoModelForMultipleChoice."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForMultipleChoice`. Model weights are "
+"loaded by specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForMultipleChoice.from_pretrained:12
+msgid "An instance of `AutoModelForMultipleChoice`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM:1
+msgid "AutoModelForMaskedLM."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForMaskedLM`. Model weights are loaded "
+"by specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForMaskedLM.from_pretrained:12
+msgid "An instance of `AutoModelForMaskedLM`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForCausalLM:1
+msgid "AutoModelForCausalLM."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForCausalLM`. Model weights are loaded "
+"by specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForCausalLM.from_pretrained:12
+msgid "An instance of `AutoModelForCausalLM`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoEncoder:1
+msgid "AutoEncoder."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoEncoder`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoEncoder.from_pretrained:12
+msgid "An instance of `AutoEncoder`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder:1
+msgid "AutoDecoder."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoDecoder`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDecoder.from_pretrained:12
+msgid "An instance of `AutoDecoder`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoGenerator:1
+msgid "AutoGenerator."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoGenerator`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoGenerator.from_pretrained:12
+msgid "An instance of `AutoGenerator`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator:1
+msgid "AutoDiscriminator."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoDiscriminator`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoDiscriminator.from_pretrained:12
+msgid "An instance of `AutoDiscriminator`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration:1
+msgid "AutoModelForConditionalGeneration."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoModelForConditionalGeneration`. Model weights"
+" are loaded by specifying name of a built-in pretrained model, or a "
+"community contributed model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.auto.modeling.AutoModelForConditionalGeneration.from_pretrained:12
+msgid "An instance of `AutoModelForConditionalGeneration`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po
new file mode 100644
index 0000000000000000000000000000000000000000..fe7904e1e221f52a3329cd5c1afb345b8f51d758
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.auto.rst:2
+msgid "auto"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..fe94e163b5fc6b803171e20c77f5df398cb4ae9a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.auto.tokenizer.po
@@ -0,0 +1,101 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.auto.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer:1
+msgid ""
+"AutoClass can help you automatically retrieve the relevant model given "
+"the provided pretrained weights/vocabulary. AutoTokenizer is a generic "
+"tokenizer class that will be instantiated as one of the base tokenizer "
+"classes when created with the AutoTokenizer.from_pretrained() "
+"classmethod."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:1
+msgid ""
+"Creates an instance of `AutoTokenizer`. Related resources are loaded by "
+"specifying name of a built-in pretrained model, or a community-"
+"contributed pretrained model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:5
+msgid ""
+"Name of pretrained model or dir path to load from. The string can be:  - "
+"Name of built-in pretrained model - Name of a community-contributed "
+"pretrained model. - Local directory path which contains tokenizer related"
+" resources   and tokenizer config file (\"tokenizer_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:5
+msgid "Name of pretrained model or dir path to load from. The string can be:"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:8
+msgid "Name of built-in pretrained model"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:9
+msgid "Name of a community-contributed pretrained model."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:10
+msgid ""
+"Local directory path which contains tokenizer related resources and "
+"tokenizer config file (\"tokenizer_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:13
+msgid ""
+"position arguments for model `__init__`. If provided, use these as "
+"position argument values for tokenizer initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:16
+msgid ""
+"keyword arguments for model `__init__`. If provided, use these to update "
+"pre-defined keyword argument values for tokenizer initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:21
+msgid "An instance of `PretrainedTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.auto.tokenizer.AutoTokenizer.from_pretrained:25
+msgid "示例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..27542df560b8f46d3cb84530968371892b44ba30
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.modeling.po
@@ -0,0 +1,505 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bart.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder:1
+#: paddlenlp.transformers.bart.modeling.BartEncoder:1
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:1
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:1
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification:1
+#: paddlenlp.transformers.bart.modeling.BartModel:1
+msgid "基类：:class:`paddlenlp.transformers.bart.modeling.BartPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:1
+msgid "The bare Bart Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartClassificationHead.forward
+#: paddlenlp.transformers.bart.modeling.BartDecoder.forward
+#: paddlenlp.transformers.bart.modeling.BartEncoder.forward
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward
+#: paddlenlp.transformers.bart.modeling.BartModel
+#: paddlenlp.transformers.bart.modeling.BartModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `BartModel`. Also is the vocab size of"
+" token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `BartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:13
+msgid ""
+"The beginning of sequence token that was used during pretraining. Can be "
+"used a sequence classifier token. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:17
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:20
+msgid ""
+"A special token representing the end of a sequence that was used during "
+"pretraining. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:23
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and decoder layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:25
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `6`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:27
+msgid "Number of hidden layers in the Transformer decoder. Defaults to `6`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:29
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:32
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:35
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `d_model` to "
+"`encoder_ffn_dim`, and then projected back to `d_model`. Typically "
+"`encoder_ffn_dim` is larger than `d_model`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:40
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `d_model` to "
+"`decoder_ffn_dim`, and then projected back to `d_model`. Typically "
+"`decoder_ffn_dim` is larger than `d_model`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:45
+msgid ""
+"The dropout probability used in all fully connected layers (pre-process "
+"and post-process of MHA and FFN sub-layer) in the encoders and decoders. "
+"Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:48
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:52
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"and decoder layers to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:55
+msgid ""
+"The dropout probability used after FFN activation in all encoder layers "
+"and decoder layers. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:58
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`1024`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel:61
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Default to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:1
+msgid "The BartModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:18
+msgid ""
+"Indices of decoder input sequence tokens in the vocabulary. Its data type"
+" should be `int64` and it has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`, which means no `decoder_input_ids` is provided, the "
+"model will create the tensor by shifting the `input_ids` to the right."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:23
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions in `decoder_input_ids`. Its data type and shape is the"
+" same as `attention_mask`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:26
+msgid ""
+"The output of the encoder, a tuple consists `last_hidden_state`, "
+"`hidden_states`(optional), `attentions`(optional). The data type of "
+"`last_hidden_state` is float32 and its shape is `[batch_size, "
+"sequence_length, hidden_size]`. `hidden_states` is hidden_states of all "
+"layers in the Transformer encoder. The length of `hidden_states` is "
+"`num_hidden_layers + 1`. For all element in the tuple, its data type "
+"should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`]. `attentions` is attentions of all layers of in the "
+"Transformer encoder. The length of `attentions` is `num_hidden_layers`. "
+"For all element in the tuple, its data type should be float32 and its "
+"shape is [`batch_size, num_attention_heads, sequence_length, "
+"sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:33
+msgid ""
+"Whether or not to use cache. Defaults to `False`. If set to `True`, key "
+"value states will be returned and can be used to speed up decoding."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartModel.forward:36
+msgid ""
+"It is a list, and each element in the list is a tuple "
+"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache "
+"<https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__"
+" for more details. It is only used for inference and should be None for "
+"training. Default to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward
+#: paddlenlp.transformers.bart.modeling.BartEncoder.forward
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward
+#: paddlenlp.transformers.bart.modeling.BartModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:14
+#: paddlenlp.transformers.bart.modeling.BartModel.forward:42
+msgid ""
+"Returns tensor `decoder_output`, which is the output at the last layer of"
+" the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward
+#: paddlenlp.transformers.bart.modeling.BartEncoder.forward
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward
+#: paddlenlp.transformers.bart.modeling.BartModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:31
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:32
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:23
+#: paddlenlp.transformers.bart.modeling.BartModel.forward:47
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Bart models. It provides Bart related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartEncoder:1
+msgid ""
+"The Transformer Encoder of BartModel. The arguments of BartEncoder can "
+"see :class:`BartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartEncoder.forward:1
+msgid "The BartEncoder forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:3
+#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:5
+#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:7
+#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:9
+#: paddlenlp.transformers.bart.modeling.BartDecoder.forward:11
+#: paddlenlp.transformers.bart.modeling.BartEncoder.forward:3
+#: paddlenlp.transformers.bart.modeling.BartEncoder.forward:5
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:3
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:5
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:7
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:9
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:11
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:13
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:15
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:27
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:3
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:5
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:7
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:9
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:11
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:13
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:15
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:3
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:5
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:7
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:9
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:11
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:13
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:15
+msgid "See :class:`BartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartEncoder.forward:8
+msgid ""
+"Returns tensor `encoder_output`, which is the output at the last layer of"
+" the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder:1
+msgid ""
+"The Transformer Decoder of BartModel. The arguments of BartDecoder can "
+"see :class:`BartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartDecoder.forward:1
+msgid "The BartDecoder forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartClassificationHead:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartClassificationHead:1
+msgid "Perform sentence-level classification tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartClassificationHead.forward:1
+msgid "Hidden states of the classification model."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:1
+msgid ""
+"Bart Model with a linear layer on top of the pooled output, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:3
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:4
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification:4
+msgid "An instance of BartModel."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:6
+msgid "The number of different labels. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForSequenceClassification:8
+msgid ""
+"The dropout probability for output of Bart. If None, use the same value "
+"as `hidden_dropout_prob` of `BartModel` instance `bart`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:1
+msgid ""
+"The BartForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForSequenceClassification.forward:18
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_labels]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering:1
+msgid ""
+"Bart Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:1
+msgid ""
+"The BartForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:18
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:18
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:20
+#: paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:20
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:24
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:23
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:27
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForQuestionAnswering.forward:27
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.modeling.BartForConditionalGeneration:1
+msgid "Bart Model with a `language modeling` head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:1
+msgid ""
+"The BartForConditionalGeneration forward method, overrides the __call__()"
+" special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:18
+msgid ""
+"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns "
+"tuple (`lm_logits`, `cache`).  With the fields:  - `lm_logits` (Tensor):"
+"     The generated sentence of the model.     Its data type should be "
+"float32 and has a shape of [batch_size, sequence_length, vocab_size].  - "
+"`cache` (Tensor):     See :class:`BartModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:18
+msgid ""
+"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns "
+"tuple (`lm_logits`, `cache`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:24
+msgid "`lm_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:23
+msgid ""
+"The generated sentence of the model. Its data type should be float32 and "
+"has a shape of [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.modeling.BartForConditionalGeneration.forward:26
+msgid "`cache` (Tensor):"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po
new file mode 100644
index 0000000000000000000000000000000000000000..919ed199b2784097af46434dab10db7746ab9113
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bart.rst:2
+msgid "bart"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..6deb0f8e7fd2f45fea08d0db4332e8830fc4c425
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bart.tokenizer.po
@@ -0,0 +1,133 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bart.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:1
+msgid "Construct a BART tokenizer based on byte-level Byte-Pair-Encoding."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`. For more "
+"information regarding those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:6
+msgid ""
+"Path to the vocabulary file. The vocab file contains a mapping from "
+"vocabulary strings to indices."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:9
+msgid ""
+"Path to the merge file. The merge file is used to split the input "
+"sentence into \"subword\" units. The vocab file is then used to encode "
+"those units as intices."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:13
+msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:16
+msgid "The maximum value of the input sequence length. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:19
+msgid ""
+"The beginning of sequence token that was used during pretraining. Can be "
+"used a sequence classifier token. Defaults to `\"<s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:23
+msgid ""
+"A special token representing the end of a sequence that was used during "
+"pretraining. Defaults to `\"</s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:26
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to `\"<s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:30
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to `\"</s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:33
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to `\"<unk>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:37
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to `\"<pad>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:40
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to `\"<mask>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bart.tokenizer.BartTokenizer:46
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bart.tokenizer.BartTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..c13b624495d40542f6bac89800a19d932e6f273c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.faster_tokenizer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.bert.fast_tokenizer.rst:2
+msgid "fast\\_tokenizer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..d401e85f0e2a4c953c1f4c969a7951399dbe8df2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.modeling.po
@@ -0,0 +1,679 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:1
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice:1
+#: paddlenlp.transformers.bert.modeling.BertForPretraining:1
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:1
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification:1
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:1
+#: paddlenlp.transformers.bert.modeling.BertModel:1
+msgid "基类：:class:`paddlenlp.transformers.bert.modeling.BertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:1
+msgid "The bare BERT Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM
+#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward
+#: paddlenlp.transformers.bert.modeling.BertForPretraining
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertModel
+#: paddlenlp.transformers.bert.modeling.BertModel.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion
+#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `BertModel`. Also is the vocab size of"
+" token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:38
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:41
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:41
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel:51
+msgid ""
+"The non-linear activation function in the pooling layer. Defaults to "
+"`\"tanh\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:1
+msgid "The BertModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:31
+msgid "Whether to return the output of each hidden layers. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertModel.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`).  With the fields:  - `sequence_output` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size].  - `pooled_output` (Tensor):     The output of first token "
+"(`[CLS]`) in sequence.     We \"pool\" the model by simply taking the "
+"hidden state corresponding to the first token.     Its data type should "
+"be float32 and its shape is [batch_size, hidden_size].  - "
+"`encoder_outputs` (List(Tensor)):     A list of Tensor containing hidden-"
+"states of the model at each hidden layer in the Transformer encoder.     "
+"The length of the list is `num_hidden_layers`.     Each Tensor has a data"
+" type of float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:14
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:10
+#: paddlenlp.transformers.bert.modeling.BertModel.forward:37
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:16
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:41
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:40
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:1
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:46
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:44
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:4
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:50
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertModel.forward:49
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape"
+" is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward
+#: paddlenlp.transformers.bert.modeling.BertModel.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:15
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:17
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:22
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:17
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:17
+#: paddlenlp.transformers.bert.modeling.BertModel.forward:55
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained BERT models. It provides BERT related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining:1
+msgid "Bert Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:3
+#: paddlenlp.transformers.bert.modeling.BertForPretraining:3
+msgid "An instance of :class:`BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:1
+#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:7
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:1
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForPretraining.forward:7
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:7
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:9
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:7
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:9
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:3
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:5
+msgid "See :class:`BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:9
+msgid "See :class:`BertPretrainingHeads`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:12
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:14
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``).  With "
+"the fields:  - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size].  - `seq_relationship_score` (Tensor):     The scores of next"
+" sentence prediction.     Its data type should be float32 and its shape "
+"is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:12
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:14
+msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:19
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:21
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:17
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:19
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:22
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:24
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForPretraining.forward:22
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:24
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion:1
+#: paddlenlp.transformers.bert.modeling.BertPretrainingHeads:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `BertModel`. Defines the number of "
+"different tokens that can be represented by the `inputs_ids` passed when "
+"calling `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:1
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:5
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:8
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64. If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:12
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:16
+msgid ""
+"The scale of masked tokens. Used for the normalization of masked language"
+" modeling loss. If it is a `Tensor`, its data type should be int64 and "
+"its shape is equal to `prediction_scores`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingCriterion.forward:20
+msgid ""
+"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean"
+" of `next_sentence_loss`. Its data type should be float32 and its shape "
+"is [1]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:1
+msgid "Perform language modeling task and next sentence classification task."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:7
+msgid "Activation function used in the language modeling task."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads:9
+msgid ""
+"Decoding weights used to map hidden_states to logits of the masked token "
+"prediction. Its data type should be float32 and its shape is [vocab_size,"
+" hidden_size]. Defaults to `None`, which means use the same weights of "
+"the embedding layer."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertPretrainingHeads.forward:8
+msgid ""
+"A tensor indicates positions to be masked in the position embedding. Its "
+"data type should be int64 and its shape is [batch_size, mask_token_num]. "
+"`mask_token_num` is the number of masked tokens. It should be no bigger "
+"than `sequence_length`. Defaults to `None`, which means we output hidden-"
+"states of all tokens in masked token prediction."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:1
+msgid ""
+"Bert Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:4
+#: paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:4
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification:4
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:4
+msgid "An instance of BertModel."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:6
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForSequenceClassification:8
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification:8
+msgid ""
+"The dropout probability for output of BERT. If None, use the same value "
+"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:1
+msgid ""
+"The BertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.modeling.BertForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForTokenClassification:1
+msgid ""
+"Bert Model with a linear layer on top of the hidden-states output layer, "
+"designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:1
+msgid ""
+"The BertForTokenClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.modeling.BertForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:1
+msgid ""
+"Bert Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering:6
+msgid ""
+"The dropout probability for output of BERT. If None, use the same value "
+"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:1
+msgid ""
+"The BertForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:8
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:8
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:14
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:13
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:17
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:1
+msgid ""
+"Bert Model with a linear layer on top of the hidden-states output layer, "
+"designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice:8
+msgid ""
+"The dropout probability for output of Bert. If None, use the same value "
+"as `hidden_dropout_prob` of `BertModel` instance `bert`. Defaults to "
+"None."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:1
+msgid ""
+"The BertForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:3
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:5
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:7
+#: paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:9
+msgid ""
+"See :class:`BertModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMultipleChoice.forward:12
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM:1
+msgid "Bert Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.modeling.BertForMaskedLM.forward:10
+msgid ""
+"Returns tensor `prediction_scores`, The scores of masked token "
+"prediction. Its data type should be float32 and shape is [batch_size, "
+"sequence_length, vocab_size]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po
new file mode 100644
index 0000000000000000000000000000000000000000..aba47451e1de48437f8d453adecb47974f673851
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert.rst:2
+msgid "bert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..00bf336ab2b5f380d378e30faff64451262ed0f9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert.tokenizer.po
@@ -0,0 +1,358 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:1
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:1
+msgid "Runs basic tokenization (punctuation splitting, lower casing, etc.)."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer
+#: paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer:3
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:1
+msgid "Tokenizes a piece of text using basic tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:3
+msgid "A piece of text."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:6
+msgid "A list of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BasicTokenizer.tokenize:10
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer:30
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:12
+#: paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:1
+msgid ""
+"Constructs a BERT tokenizer. It uses a basic tokenizer to do punctuation "
+"splitting, lower casing and so on, and follows a WordPiece tokenizer to "
+"tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:5
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:8
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:11
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:15
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:18
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:21
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer:24
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.BertTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also removes "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:4
+msgid "A BERT sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A BERT offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of wordpiece offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.build_offset_mapping_with_special_tokens:13
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:3
+msgid "A BERT sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert.tokenizer.BertTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:1
+msgid "Runs WordPiece tokenization."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:3
+msgid "Vocab of the word piece tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:5
+msgid "A specific token to replace all unknown tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer:7
+msgid ""
+"If a word's length is more than max_input_chars_per_word, it will be "
+"dealt as unknown word. Defaults to 100."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:1
+msgid ""
+"Tokenizes a piece of text into its word pieces. This uses a greedy "
+"longest-match-first algorithm to perform tokenization using the given "
+"vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:5
+msgid ""
+"A single token or whitespace separated tokens. This should have already "
+"been passed through `BasicTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert.tokenizer.WordpieceTokenizer.tokenize:8
+msgid "A list of wordpiece tokens."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po
new file mode 100644
index 0000000000000000000000000000000000000000..b016b4e1a04987291399e46e6c470de98c734560
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst:2
+msgid "convert\\_bert\\_japanese\\_params"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po
new file mode 100644
index 0000000000000000000000000000000000000000..6241941cf9728f6cb3ad6db6e5eac7b0fac5f568
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert_japanese.rst:2
+msgid "bert\\_japanese"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..1224431a96b89265c119e67ac812d918b49513f9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bert_japanese.tokenizer.po
@@ -0,0 +1,152 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bert_japanese.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:1
+msgid "Construct a BERT tokenizer for Japanese text, based on a MecabTokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:3
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:6
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:9
+msgid "Whether to do word tokenization. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:11
+msgid "Whether to do subword tokenization. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:13
+msgid "Type of word tokenizer. Defaults to`basic`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:15
+msgid "Type of subword tokenizer. Defaults to`wordpiece`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:17
+msgid "Kept for backward compatibility purposes. Defaults to`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:19
+msgid "Dictionary passed to the `MecabTokenizer` constructor."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:21
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:25
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:28
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:31
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:34
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.BertJapaneseTokenizer:40
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer:1
+#: paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer:1
+msgid "Runs basic tokenization with MeCab morphological parser."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.MecabTokenizer.tokenize:1
+msgid "Tokenizes a piece of text."
+msgstr ""
+
+#: of paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer:1
+msgid "Runs Character tokenization."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:1
+msgid "Tokenizes a piece of text into characters."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:3
+msgid ""
+"For example, `input = \"apple\"\"` wil return as output `[\"a\", \"p\", "
+"\"p\", \"l\", \"e\"]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:5
+msgid ""
+"A single token or whitespace separated tokens. This should have already "
+"been passed through `BasicTokenizer`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bert_japanese.tokenizer.CharacterTokenizer.tokenize:8
+msgid "A list of characters."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..b074b8478b22ef35a31292718d056542217d2f4b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.modeling.po
@@ -0,0 +1,793 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bigbird.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel:1
+msgid "基类：:class:`paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:1
+msgid "The bare BigBird Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:10
+msgid "Number of hidden layers in the Transformer encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:12
+msgid ""
+"Vocabulary size of `inputs_ids` in `BigBirdModel`. Also is the vocab size"
+" of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling "
+"`BigBirdModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:15
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:17
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the Transformer encoder."
+" Input tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"``, ``\"silu\"`` and ``\"gelu_new\"`` are "
+"supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:29
+msgid ""
+"Indicates whether to put layer normalization into preprocessing of MHA "
+"and FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:35
+msgid "The block size for the attention mask. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:38
+msgid "The number of block in a window. Defaults to `3`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:41
+msgid "Number of global blocks per sequence. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:44
+msgid "Number of random blocks per row. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:47
+msgid "The random seed for generating random block id. Defaults to ``None``."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:50
+msgid "The index of padding token for BigBird embedding. Defaults to ``0``."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:53
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:56
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:59
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel:62
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:1
+msgid "The BigBirdModel forward method, overrides the __call__() special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. Its data type should "
+"be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:6
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can either be 0 or 1:  - 0 corresponds to a *sentence A* "
+"token, - 1 corresponds to a *sentence B* token.  Its data type should be "
+"`int64` and it has a shape of [batch_size, sequence_length]. Defaults to "
+"``None``, which means we don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:6
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can either be 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:9
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:10
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:12
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to ``None``, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:15
+msgid ""
+"A list which contains some tensors used in multi-head attention to "
+"prevents attention to some unwanted positions, usually the paddings or "
+"the subsequent positions. Its data type can be int, float and bool. When "
+"the data type is bool, the `masked` tokens have `False` values and the "
+"others have `True` values. When the data type is int, the `masked` tokens"
+" have `0` values and the others have `1` values. When the data type is "
+"float, the `masked` tokens have `-INF` values and the others have `0` "
+"values. It is a tensor with shape broadcasted to `[batch_size, n_head, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:27
+msgid "A list which contains some tensors used in bigbird random block."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:30
+msgid ""
+"Returns tuple (`encoder_output`, `pooled_output`).  With the fields:  - "
+"encoder_output (Tensor):     Sequence of output at the last layer of the "
+"model.     Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size].  - pooled_output (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:30
+msgid "Returns tuple (`encoder_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:12
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:12
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:19
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:14
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:32
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:17
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:36
+msgid "encoder_output (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:35
+msgid ""
+"Sequence of output at the last layer of the model. Its data type should "
+"be float32 and has a shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:40
+msgid "pooled_output (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:39
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:31
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:17
+#: paddlenlp.transformers.bigbird.modeling.BigBirdModel.forward:45
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainedModel:1
+msgid ""
+"An abstract class for pretrained BigBird models. It provides BigBird "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:1
+msgid "BigBird Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:4
+msgid "An instance of :class:`BigBirdModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:1
+msgid ""
+"The BigBirdForPretraining forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:9
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:9
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:9
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:9
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:5
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:7
+msgid "See :class:`BigBirdModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:11
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:9
+msgid ""
+"A tensor indicates positions to be masked in the position embedding. Its "
+"data type should be int64 and its shape is [batch_size, mask_token_num]. "
+"`mask_token_num` is the number of masked tokens. It should be no bigger "
+"than `sequence_length`. Defaults to `None`, which means we output hidden-"
+"states of all tokens in masked token prediction."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:17
+msgid ""
+"Returns tuple (`prediction_scores`, `seq_relationship_score`).  With the "
+"fields:  - prediction_scores (Tensor):     The scores of masked token "
+"prediction.     Its data type should be float32 and its shape is "
+"[batch_size, sequence_length, vocab_size].  - seq_relationship_score "
+"(Tensor):     The scores of next sentence prediction.     Its data type "
+"should be float32 and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:17
+msgid "Returns tuple (`prediction_scores`, `seq_relationship_score`)."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:23
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:22
+msgid "prediction_scores (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:22
+msgid ""
+"The scores of masked token prediction. Its data type should be float32 "
+"and its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:26
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:25
+msgid "seq_relationship_score (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForPretraining.forward:26
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:7
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:1
+msgid "BigBird Criterion for a pretraining task on top."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:5
+msgid ""
+"It decides whether it considers Next Sentence Prediction loss. Defaults "
+"to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion:8
+msgid ""
+"Specifies a target value that is ignored and does not contribute to the "
+"input gradient. Only valid if :attr:`soft_label` is set to :attr:`False`."
+" Defaults to `0`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:1
+msgid ""
+"The BigBirdPretrainingCriterion forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:3
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:20
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:10
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64. If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:14
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:17
+msgid ""
+"The scale of masked tokens. Used for the normalization of masked language"
+" modeling loss. If it is a `Tensor`, its data type should be int64 and "
+"its shape is equal to `prediction_scores`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:20
+msgid ""
+"The weight of masked tokens. Its data type should be float32 and its "
+"shape is [mask_token_num, 1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:24
+msgid ""
+"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean"
+" of `next_sentence_loss`. Its data type should be float32 and its shape "
+"is [1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:15
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:26
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:17
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingCriterion.forward:29
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:1
+msgid ""
+"BigBird Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification:6
+msgid "The number of classes. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:1
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:1
+msgid ""
+"The BigBirdForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `output`, a tensor of the input text classification "
+"logits. Its data type should be float32 and it has a shape of "
+"[batch_size, num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:1
+msgid "The BigBird pretraining heads for a pretraining task."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads:9
+msgid ""
+"The weight of pretraining embedding layer. Its data type should be "
+"float32 and its shape is [hidden_size, vocab_size]. If set to `None`, use"
+" normal distribution to initialize weight. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:1
+msgid ""
+"The BigBirdPretrainingHeads forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:3
+msgid ""
+"The sequence output of BigBirdModel. Its data type should be float32 and "
+"has a shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:6
+msgid ""
+"The pooled output of BigBirdModel. Its data type should be float32 and "
+"has a shape of [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:15
+msgid ""
+"(``prediction_scores``, ``seq_relationship_score``).  With the fields:  -"
+" prediction_scores (Tensor):     The scores of masked token prediction. "
+"Its data type should be float32.     If `masked_positions` is None, its "
+"shape is [batch_size, sequence_length, vocab_size].     Otherwise, its "
+"shape is [batch_size, mask_token_num, vocab_size].  - "
+"seq_relationship_score (Tensor):     The logits whether 2 sequences are "
+"NSP relationship. Its data type should be float32 and     has a shape of "
+"[batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:15
+msgid "(``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdPretrainingHeads.forward:25
+msgid ""
+"The logits whether 2 sequences are NSP relationship. Its data type should"
+" be float32 and has a shape of [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:1
+msgid ""
+"BigBird Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:4
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:4
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:4
+msgid "An instance of BigBirdModel."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering:6
+msgid ""
+"The dropout probability for output of BigBirdModel. If None, use the same"
+" value as `hidden_dropout_prob` of `BigBirdModel` instance `bigbird`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:1
+msgid ""
+"The BigBirdForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:12
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:18
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:21
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForQuestionAnswering.forward:21
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:1
+msgid ""
+"BigBird Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:8
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification:8
+msgid ""
+"The dropout probability for output of BIGBIRD. If None, use the same "
+"value as `hidden_dropout_prob` of `BigBirdModel` instance `bigbird`. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:1
+msgid ""
+"BigBird Model with a linear layer on top of the hidden-states output "
+"layer, designed for multiple choice tasks like RocStories/SWAG tasks ."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:1
+msgid ""
+"The BigBirdForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:3
+msgid ""
+"See :class:`BigBirdModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:5
+msgid ""
+"See :class:`BigBirdModel`  and shape as [batch_size, num_choice, n_head, "
+"sequence_length, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMultipleChoice.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, 1]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM:1
+msgid "BigBird Model with a `language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:7
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:7
+msgid ""
+"The Labels for computing the masked language modeling loss. Indices "
+"should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to "
+"``-100`` are ignored (masked), the loss is only computed for the tokens "
+"with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:10
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:10
+msgid ""
+"Returns tuple (`masked_lm_loss`, `prediction_scores`, "
+"``sequence_output`).  With the fields:  - `masked_lm_loss` (Tensor):     "
+"The masked lm loss. Its data type should be float32 and its shape is [1]."
+"  - `prediction_scores` (Tensor):     The scores of masked token "
+"prediction. Its data type should be float32. Its shape is [batch_size, "
+"sequence_length, vocab_size].  - `sequence_output` (Tensor):     Sequence"
+" of hidden-states at the last layer of the model. Its data type should be"
+" float32. Its shape is `[batch_size, sequence_length, hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:10
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:10
+msgid "Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:15
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:15
+msgid "`masked_lm_loss` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:15
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:15
+msgid "The masked lm loss. Its data type should be float32 and its shape is [1]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:18
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:18
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:18
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:18
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"Its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:20
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:20
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM.forward:21
+#: paddlenlp.transformers.bigbird.modeling.BigBirdForMaskedLM.forward:21
+msgid ""
+"Sequence of hidden-states at the last layer of the model. Its data type "
+"should be float32. Its shape is `[batch_size, sequence_length, "
+"hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.modeling.BigBirdForCausalLM:1
+msgid "BigBird Model for casual language model tasks."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po
new file mode 100644
index 0000000000000000000000000000000000000000..bd93b32f357387bef3e839976a6bbe9288b501a0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bigbird.rst:2
+msgid "bigbird"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..cb7fec73508f6640d43a81a1510b954161fb665d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.bigbird.tokenizer.po
@@ -0,0 +1,241 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.bigbird.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:1
+msgid ""
+"Constructs an BigBird tokenizer based on `SentencePiece "
+"<https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:7
+msgid ""
+"The vocabulary file (ends with '.spm') required to instantiate a "
+"`SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:10
+msgid ""
+"Whether the text strips accents and convert to Whether or not to "
+"lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:14
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:18
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:21
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:24
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:27
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer
+msgid "引发"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer:32
+msgid "If file sentencepiece_model_file doesn't exist."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also removes "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:1
+msgid "Returns a tuple containing the encoded sequence and mask information."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:3
+msgid ""
+"The first sequence to be encoded. This can be a string, a list of strings"
+" (tokenized string using the `tokenize` method) or a list of integers "
+"(tokenized string ids using the `convert_tokens_to_ids` method)"
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:7
+msgid ""
+"If set to a number, will limit the total sequence returned so that it has"
+" a maximum length. If set to None, will not limit the total sequence. "
+"Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:11
+msgid ""
+"If set to a number, will limit the mask sequence returned so that it has "
+"a maximum prediction length. If set to None, will not limit the mask "
+"sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:14
+msgid "The probability of the token to be masked. Defaults to `0.15`."
+msgstr ""
+
+#: of paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.encode:17
+msgid ""
+"Returns tuple (span_ids, masked_lm_positions, masked_lm_ids, "
+"masked_lm_weights)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:4
+msgid "A BigBird sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:11
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.bigbird.tokenizer.BigBirdTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..72618fc633594ed8de483a3d11151bf96a577284
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.modeling.po
@@ -0,0 +1,461 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:1
+msgid "基类：:class:`paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:1
+msgid "Construct a bare Blenderbot Model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Check the "
+"superclass documentation for the generic methods and the library "
+"implements for all its model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:10
+msgid "Vocabulary size of the Blenderbot model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:12
+msgid "The id for begging of sentences token. Defaults to ``1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:14
+msgid "The id for padding token. Defaults to ``0``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:16
+msgid "The id for end of sentence token. Defaults to ``2``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:18
+msgid "The id indicating the start of decoding sentence. Defaults to ``1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:20
+msgid "Dimensionality of the layers and the pooler layer. Defaults to ``1280``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:22
+msgid ""
+"Number of Transformer encoder layers for BlenderbotEncoder. Defaults to "
+"``2``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:24
+msgid ""
+"Number of Transformer decoder layers for BlenderbotDecoder. Defaults to "
+"``12``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:26
+msgid ""
+"Number of attention heads for each Transformer encoder layer in "
+"BlenderbotEncoder. Defaults to ``32``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:29
+msgid ""
+"Number of attention heads for each Transformer decoder layer in "
+"BlenderbotDecoder. Defaults to ``32``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:32
+msgid ""
+"Dimensionality of the feed-forward layer for each Transformer encoder "
+"layer in BlenderbotEncoder. Defaults to ``5120``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:35
+msgid ""
+"Dimensionality of the feed-forward layer for each Transformer dncoder "
+"layer in BlenderbotDncoder. Defaults to ``5120``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:38
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings,"
+" encoder, and pooler. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:41
+msgid ""
+"The non-linear activation function (function or string) in the encoder "
+"and pooler. ``\"gelu\"``, ``\"relu\"`` and any other paddle supported "
+"activation functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:45
+msgid "The dropout ratio for the attention probabilities. Defaults to ``0.0``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:48
+msgid "The dropout ratio for activations inside the fully connected layer."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:50
+msgid ", The max position index of an input sequence. Defaults to ``128``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:53
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Defaults to ``0.02``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:56
+msgid ""
+"Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults"
+" to ``True``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel:58
+msgid ""
+"Indicate whether to put layer normalization into preprocessing of MHA and"
+" FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to ``True``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:5
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:5
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**.  - **1** for tokens that are "
+"**not masked**, - **0** for tokens that are **masked**.  It's data type "
+"should be `float32` and has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:5
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:5
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:10
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:10
+msgid "**1** for tokens that are **not masked**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:11
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:11
+msgid "**0** for tokens that are **masked**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:13
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:13
+msgid ""
+"It's data type should be `float32` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:16
+msgid ""
+"If not provided, ``decoder_input_ids`` will be automatically generated "
+"based on ``decoder_start_token_id`` and ``input_ids``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:19
+msgid ""
+"If not provided, the default ``decoder_attention_mask`` will be a tensor "
+"with upper triangular part being ``-np.inf``. the shape will be "
+"``(decoder_length, decoder_length)``"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:22
+msgid ""
+"The output of encoder. If not provided, a ``encoder_output`` will be "
+"generated from BlenderbotEncoder. Defaults to ``None``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:16
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:25
+msgid "Indicates whether to use cache to speed up decoding. Defaults to ``False``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:18
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:27
+msgid ""
+"It is a list, and each element in the list is a tuple( "
+":code:`(incremental_cache, static_cache)` ). See "
+"`paddle.nn.TransformerDecoder.gen_cache` for more details. It is only "
+"used for inference and should be None for training. Default None."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:33
+msgid ""
+"If ``use_cache=False``, the return will be the last hidden state of "
+"decoder with shape of [batch_size, seq_lens, hidden_size]. ``seq_lens`` "
+"corresponds to the length of input sequence. Otherwise, the return will "
+"be a tuple of ``(decoder_output, cache)``. Please refer to class "
+":class:`paddle.nn.TransformerDecoder` for more information regarding "
+"``cache``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:32
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:11
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.forward:40
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.get_encoder:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotModel.get_encoder:1
+msgid "This method is required for model with encoder-decoder architecture."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Blenderbot models. It provides "
+"Blenderbot related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. Refer "
+"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder:1
+msgid ""
+"The encoder of Blenderbot Model. Please refer to "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` or "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding methods and arguments."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotEncoder.forward:1
+msgid ""
+"The last hidden states at the last layer of the encoder. It's data type "
+"should be `float` and has a shape of `(batch_size, seq_lens, "
+"hidden_size)`. ``seq_lens`` corresponds to the length of input sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder:1
+msgid ""
+"The decoder of Blenderbot Model. Please refer to "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` and "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding methods and arguments."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotDecoder.forward:1
+msgid ""
+"Please refer to "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding the arguments."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:1
+msgid ""
+"Please refer to "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding arguments. :returns:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:27
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:6
+msgid "If ``use_cache=False``, the return will be a tensor with shape of"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:6
+msgid ""
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple of ``(decoder_output, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:14
+msgid ""
+"import paddle from paddlenlp.transformers import BlenderbotTokenizer, "
+"BlenderbotForConditionalGeneration"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:17
+msgid ""
+"pretrained_model_name = \"blenderbot-400M-distill\" tokenizer = "
+"BlenderbotTokenizer.from_pretrained(pretrained_model_name) model = "
+"BlenderbotForConditionalGeneration.from_pretrained(pretrained_model_name)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:21
+msgid ""
+"sample_text = \"My friends are cool but they eat too many carbs.\" inputs"
+" = tokenizer(sample_text, return_attention_mask=True, "
+"return_token_type_ids=False) inputs = {k: paddle.to_tensor([v]) for (k, "
+"v) in inputs.items()}"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:25
+msgid ""
+"# Generate response using beam search result_ids, scores = "
+"model.generate(input_ids=inputs['input_ids'],"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:27
+msgid ""
+"max_length=60, min_length=20, decode_strategy='beam_search', "
+"num_beams=10, length_penalty=0.65)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:34
+msgid "for sequence_ids in result_ids.numpy().tolist():"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.forward:33
+msgid ""
+"print(\"User:    \", sample_text) print(\"bot:     \", "
+"tokenizer.convert_ids_to_string(sequence_ids)) # \"bot:   That's "
+"unfortunate. Are they trying to lose weight?\""
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.prepare_inputs_for_generation:1
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForConditionalGeneration.prepare_inputs_for_generation:1
+msgid ""
+"Prepare inputs for decoder to generate sentences. :returns: A dictionary "
+"containing necessary inputs for generating next token. :rtype: dict"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM:1
+msgid ""
+"Constructs BLenderbot For Causal Language Model. This model is equivalent"
+" to the blenderbot decoder without cross-attention."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:24
+msgid ""
+"If ``use_cache=False``, the return will be a tensor with shape of     "
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple     of ``(lm_logits, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:27
+msgid ""
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple of ``(lm_logits, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:35
+msgid ""
+"import paddle from paddlenlp.transformers import BlenderbotTokenizer, "
+"BlenderbotForCausalLM use_cache = False text = \"My friends are cool but "
+"they eat too many carbs.\" model_name = \"blenderbot-400M-distill\" "
+"tokenizer = BlenderbotTokenizer.from_pretrained(model_name) model = "
+"BlenderbotForCausalLM.from_pretrained(model_name) model.eval() inputs = "
+"tokenizer(text) inputs = {k: paddle.to_tensor([v]) for (k, v) in "
+"inputs.items()}"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:47
+msgid "with paddle.no_grad():"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.modeling.BlenderbotForCausalLM.forward:47
+msgid ""
+"outputs = model(**inputs, use_cache=use_cache) # outputs is a tuple of "
+"(lm_logits, cache) if ``use_cache=True``."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po
new file mode 100644
index 0000000000000000000000000000000000000000..e0443109a91e8ba8f83c747c4543faef8eea9dd5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot.rst:2
+msgid "blenderbot"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..904d63a5aa318252cf69e691b3dba83085005ad1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot.tokenizer.po
@@ -0,0 +1,110 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:1
+msgid ""
+"Construct a Blenderbot tokenizer, derived from the GPT tokenizer, using "
+"byte-level Byte-Pair-Encoding."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:4
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.GPTTokenizer`, which contains most of the"
+" main methods. Please should refer to the superclass for more information"
+" regarding methods. :param vocab_file: file path of the vocabulary :type "
+"vocab_file: str :param merges_file: file path of the merges file. :type "
+"merges_file: str :param errors: The method to handle errors in decoding "
+":type errors: str :param max_len: The specified maximum sequence length. "
+"Default: \"None\". :type max_len: int :param special_tokens: The "
+"additional special tokens. Default: \"None\". :type special_tokens: dict "
+":param bos_token: The special token for beginning of sequence token. "
+"Default: \"<s>\". :type bos_token: str :param eos_token: The special "
+"token for end of sequence token. Default: \"</s>\". :type eos_token: str "
+":param cls_token: The special token for cls. Default: \"<s>\". :type "
+"cls_token: str :param sep_token: The special token for separator token . "
+"Default: \"</s>\". :type sep_token: str :param pad_token: The special "
+"token for padding. Default: \"<pad>\". :type pad_token: str :param "
+"eol_token: The special token for newline. Default: \"\\u010a\". :type "
+"eol_token: str :param add_prefix: Whether or not to add an initial space "
+"to the input."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:30
+msgid ""
+"This allows to treat the leading word just as any other word. (Blenderbot"
+" adds an initial space when tokenizes input text, which"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:32
+msgid "is differnt from BlenderbotSmall)"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer:36
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:1
+msgid "A Blenderbot sequence has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:5
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:7
+msgid "token_ids_1 Will be ignored"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:10
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot.tokenizer.BlenderbotTokenizer.build_inputs_with_special_tokens:11
+msgid ":obj:`List[int]`"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..0145bef9318f8615ee0f3f5a42067a62e8cecb98
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.modeling.po
@@ -0,0 +1,477 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot_small.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:1
+msgid "基类：:class:`paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:1
+msgid "Construct a bare BlenderbotSmall Model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Check the "
+"superclass documentation for the generic methods and the library "
+"implements for all its model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:10
+msgid "Vocabulary size of the BlenderbotSmall model."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:12
+msgid "The id for begging of sentences token. Defaults to ``1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:14
+msgid "The id for padding token. Defaults to ``0``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:16
+msgid "The id for end of sentence token. Defaults to ``2``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:18
+msgid "The id indicating the start of decoding sentence. Defaults to ``1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:20
+msgid "Dimensionality of the layers and the pooler layer. Defaults to ``512``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:22
+msgid ""
+"Number of Transformer encoder layers for BlenderbotSmallEncoder. Defaults"
+" to ``8``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:24
+msgid ""
+"Number of Transformer decoder layers for BlenderbotSmallDecoder. Defaults"
+" to ``8``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:26
+msgid ""
+"Number of attention heads for each Transformer encoder layer in "
+"BlenderbotSmallEncoder. Defaults to ``16``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:29
+msgid ""
+"Number of attention heads for each Transformer decoder layer in "
+"BlenderbotSmallDecoder. Defaults to ``16``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:32
+msgid ""
+"Dimensionality of the feed-forward layer for each Transformer encoder "
+"layer in BlenderbotSmallEncoder. Defaults to ``2048``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:35
+msgid ""
+"Dimensionality of the feed-forward layer for each Transformer dncoder "
+"layer in BlenderbotSmallDncoder. Defaults to ``2048``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:38
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings,"
+" encoder, and pooler. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:41
+msgid ""
+"The non-linear activation function (function or string) in the encoder "
+"and pooler. ``\"gelu\"``, ``\"relu\"`` and any other paddle supported "
+"activation functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:45
+msgid "The dropout ratio for the attention probabilities. Defaults to ``0.0``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:48
+msgid "The dropout ratio for activations inside the fully connected layer."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:50
+msgid ", The max position index of an input sequence. Defaults to ``512``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:53
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Defaults to ``0.02``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:56
+msgid ""
+"Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults"
+" to ``True``."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel:58
+msgid ""
+"Indicate whether to put layer normalization into preprocessing of MHA and"
+" FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to ``False``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:5
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:5
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**.  - **1** for tokens that are "
+"**not masked**, - **0** for tokens that are **masked**.  It's data type "
+"should be `float32` and has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:5
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:5
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:10
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:10
+msgid "**1** for tokens that are **not masked**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:11
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:11
+msgid "**0** for tokens that are **masked**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:13
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:13
+msgid ""
+"It's data type should be `float32` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:16
+msgid ""
+"If not provided, ``decoder_input_ids`` will be automatically generated "
+"based on ``decoder_start_token_id`` and ``input_ids``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:19
+msgid ""
+"If not provided, the default ``decoder_attention_mask`` will be a tensor "
+"with upper triangular part being ``-np.inf``. the shape will be "
+"``(decoder_length, decoder_length)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:22
+msgid ""
+"The output of encoder. If not provided, a new ``encoder_output`` will be "
+"generated from BlenderbotEncoder. Defaults to ``None``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:16
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:25
+msgid "Indicates whether to use cache to speed up decoding. Defaults to ``False``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:27
+msgid ""
+"It is a list, and each element in the list is a tuple( "
+":code:`(incremental_cache, static_cache)` ). See "
+"`TransformerDecoder.gen_cache` for more details. It is only used for "
+"inference and should be None for training. Default None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:33
+msgid ""
+"If ``use_cache=False``, the return will be the last hidden state of "
+"decoder with shape of [batch_size, seq_lens, hidden_size]. ``seq_lens`` "
+"corresponds to the length of input sequence. Otherwise, the return will "
+"be a tuple of ``(decoder_output, cache)``. Please refer to class "
+":class:`paddle.nn.TransformerDecoder` for more information regarding "
+"``cache``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:32
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:11
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:40
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:43
+msgid ""
+"import paddle from paddlenlp.transformers import "
+"BlenderbotSmallTokenizer, BlenderbotSmallModel"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:46
+msgid ""
+"# \"blenderbot_small-90M\" is pretrained weight of "
+"BlenderbotSmallForConditionalGeneration, # Therefore some weight of "
+"additional layers in BlenderbotSmallForConditionalGeneration # might not "
+"be loaded and used. pretrained_model_name = \"blenderbot_small-90M\" "
+"tokenizer = "
+"BlenderbotSmallTokenizer.from_pretrained(pretrained_model_name) model = "
+"BlenderbotSmallModel.from_pretrained(pretrained_model_name)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.forward:53
+msgid ""
+"sample_text = \"My friends are cool but they eat too many carbs.\" inputs"
+" = tokenizer(sample_text, return_attention_mask=True, "
+"return_token_type_ids=False) inputs = {k:paddle.to_tensor([v]) for (k, v)"
+" in inputs.items()} decoder_output = model(**inputs)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.get_encoder:1
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallModel.get_encoder:1
+msgid "This method is required for model with encoder-decoder architecture."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel:1
+msgid ""
+"An abstract class for pretrained BlenderbotSmall models. It provides "
+"BlenderbotSmall related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. Refer "
+"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder:1
+msgid ""
+"The encoder of BlenderbotSmall Model. Please refer to "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` or "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotSmallModel` for more"
+" details regarding methods and arguments."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallEncoder.forward:1
+msgid ""
+"The last hidden-states at the last layer of the encoder. It's data type "
+"should be `float` and has a shape of `(batch_size, seq_lens, "
+"hidden_size)`. ``seq_lens`` corresponds to the length of input sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder:1
+msgid ""
+"The decoder of BlenderbotSmall Model. Please refer to "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` and "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding methods and arguments."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallDecoder.forward:1
+msgid ""
+"Please refer to "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding the arguments."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:1
+msgid ""
+"Please refer to "
+":class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more "
+"information regarding arguments. :returns:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:27
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:6
+msgid "If ``use_cache=False``, the return will be a tensor with shape of"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration:6
+msgid ""
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple of ``(decoder_output, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForConditionalGeneration.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM:1
+msgid ""
+"Constructs BLenderbotSmall For Causal Language Model. This model is "
+"equivalent to the blenderbotSmall decoder without cross-attention."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:18
+msgid ""
+"It is a list, and each element in the list is a tuple( "
+":code:`(incremental_cache, static_cache)` ). See "
+"`paddle.nn.TransformerDecoder.gen_cache` for more details. It is only "
+"used for inference and should be None for training. Default None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:24
+msgid ""
+"If ``use_cache=False``, the return will be a tensor with shape of     "
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple     of ``(lm_logits, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:27
+msgid ""
+"[batch_size, seq_lens, hidden_size]. Otherwise, the return will be a "
+"tuple of ``(lm_logits, cache)``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:35
+msgid ""
+"import paddle from paddlenlp.transformers import "
+"BlenderbotSmallTokenizer, BlenderbotSmallForCausalLM use_cache = False "
+"text = \"My friends are cool but they eat too many carbs.\" model_name = "
+"\"blenderbot_small-90M\" tokenizer = "
+"BlenderbotSmallTokenizer.from_pretrained(model_name) model = "
+"BlenderbotSmallForCausalLM.from_pretrained(model_name) model.eval() "
+"inputs = tokenizer(text, return_attention_mask=True, "
+"return_token_type_ids=False) inputs = {k: paddle.to_tensor([v]) for (k, "
+"v) in inputs.items()}"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:47
+msgid "with paddle.no_grad():"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.forward:47
+msgid ""
+"outputs = model(**inputs, use_cache=use_cache) # outputs is a tuple of "
+"(lm_logits, cache) if ``use_cache=True``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.modeling.BlenderbotSmallForCausalLM.prepare_inputs_for_generation:1
+msgid ""
+"Prepare inputs for decoder to generate sentences. :returns: A dictionary "
+"containing necessary inputs for generating next token. :rtype: dict"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po
new file mode 100644
index 0000000000000000000000000000000000000000..8ff26039ee79f9cbd33457d187133f662ca54165
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot_small.rst:2
+msgid "blenderbot\\_small"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..28b6a258b076c918880a2be844ae7d4029172789
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.blenderbot_small.tokenizer.po
@@ -0,0 +1,116 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.blenderbot_small.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:1
+msgid "Constructs a BlenderbotSmall tokenizer based on Byte-Pair-Encoding."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.GPTTokenizer`, which contains most of the"
+" main methods. Please should refer to the superclass for more information"
+" regarding methods. :param vocab_file: file path of the vocabulary :type "
+"vocab_file: str :param merges_file: file path of the merges file. :type "
+"merges_file: str :param errors: The method to handle errors in decoding "
+":type errors: str :param max_len: The specified maximum sequence length. "
+"Default: \"None\". :type max_len: int :param special_tokens: The "
+"additional special tokens. Default: \"None\". :type special_tokens: dict "
+":param bos_token: The special token for beginning of sequence token. "
+"Default: \"__start__\". :type bos_token: str :param eos_token: The "
+"special token for end of sequence token. Default: \"__end__\". :type "
+"eos_token: str :param unk_token: The special token for unknown tokens. "
+"Default: \"__unk__\" :type unk_token: str :param pad_token: The special "
+"token for padding. Default: \"__null__\". :type pad_token: str :param "
+"eol_token: The special token for newline. Default: \"__newln__\". :type "
+"eol_token: str"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer:28
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe:1
+msgid ""
+"Apply Byte-Pair-Encoding on token. The process of bpe in BlenderbotSmall "
+"is different from Blenderbot. :param token: The token to be converted. "
+":type token: str"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe:6
+msgid "Converted token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.bpe
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. :param"
+" tokens: A sequence of tokens. :type tokens: list[str]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:10
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_tokens_to_string:5
+msgid "Converted string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:1
+msgid ""
+"Converts a sequence of ids (list of integers) to a single string. :param "
+"ids: A sequence of ids corresponding to tokens. :type ids: list[int] "
+":param skip_special_tokens: Whether to skip and not decode special tokens"
+" when converting. Defaults to `False`. :type skip_special_tokens: bool, "
+"optional :param clean_up_tokenization_spaces: Whether to Clean up a list "
+"of simple English tokenization artifacts"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.blenderbot_small.tokenizer.BlenderbotSmallTokenizer.convert_ids_to_string:7
+msgid "like spaces before punctuations and abbreviated forms."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..29f90d47cd9c5e9fb337c9a1414daab6fa6bd301
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.modeling.po
@@ -0,0 +1,634 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.chinesebert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:1
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:1
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:1
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:1
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:1
+msgid "The bare ChineseBert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `BChineseBertModel`. Also is the vocab"
+" size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`ChineseBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:38
+msgid "The vocabulary size of `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:41
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:41
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:51
+msgid ""
+"The non-linear activation function in the pooling layer. Defaults to "
+"`\"tanh\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:54
+msgid "The epsilon of layernorm. Defaults to `1e-12`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:56
+msgid "The dim of glyph_embedding. Defaults to `1728`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel:59
+msgid "The length of pinyin map. Defaults to `32`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:1
+msgid "The ChineseBert forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:7
+msgid ""
+"Indices of input sequence tokens pinyin. We apply a CNN model with width "
+"2 on the pinyin sequence, followed by max-pooling to derive the resulting"
+" pinyin embedding. This makes output dimensionality immune to the length "
+"of the input pinyin sequence. The length of the input pinyin sequence is "
+"fixed at 8. Its data type should be `int64` and it has a shape of "
+"[batch_size, sequence_length, 8]. Defaults to `None`, which means we "
+"don't add pinyin embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:14
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:14
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:19
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:20
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:22
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:25
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:29
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:38
+msgid "Whether to return the output of each hidden layers. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:42
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`).  With the fields:  - `sequence_output` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size].  - `pooled_output` (Tensor):     The output of first token "
+"(`[CLS]`) in sequence.     We \"pool\" the model by simply taking the "
+"hidden state corresponding to the first token.     Its data type should "
+"be float32 and its shape is [batch_size, hidden_size].  - "
+"`encoder_outputs` (List(Tensor)):     A list of Tensor containing hidden-"
+"states of the model at each hidden layer in the Transformer encoder.     "
+"The length of the list is `num_hidden_layers`.     Each Tensor has a data"
+" type of float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:42
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:16
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:12
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:44
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:48
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:47
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:53
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:51
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:57
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:56
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape"
+" is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:24
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:19
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:19
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertModel.forward:62
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ChineseBert models. It provides "
+"ChineseBert related `model_config_file`, `pretrained_init_configuration`,"
+" `resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainedModel.init_weights:1
+msgid "Initialize the weights."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:1
+msgid "ChineseBert Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining:3
+msgid "An instance of :class:`ChineseBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:1
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:3
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:5
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:7
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:9
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:7
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:7
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:9
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:11
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:3
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:5
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:7
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:9
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:11
+msgid "See :class:`ChineseBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:11
+msgid "See :class:`ChineseBertPretrainingHeads`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:14
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``).  With "
+"the fields:  - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size].  - `seq_relationship_score` (Tensor):     The scores of next"
+" sentence prediction.     Its data type should be float32 and its shape "
+"is [batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:14
+msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:21
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:19
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:24
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForPretraining.forward:24
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `ChineseBertModel`. Defines the number"
+" of different tokens that can be represented by the `inputs_ids` passed "
+"when calling `ChineseBertBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:1
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:5
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:8
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64. If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:12
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:16
+msgid ""
+"The scale of masked tokens. Used for the normalization of masked language"
+" modeling loss. If it is a `Tensor`, its data type should be int64 and "
+"its shape is equal to `prediction_scores`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertPretrainingCriterion.forward:20
+msgid ""
+"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean"
+" of `next_sentence_loss`. Its data type should be float32 and its shape "
+"is [1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:1
+msgid ""
+"ChineseBert Model with a linear layer on top of the output layer, "
+"designed for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:4
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:4
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:4
+msgid "An instance of ChineseBertModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:6
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification:8
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:8
+msgid ""
+"The dropout probability for output of ChineseBert. If None, use the same "
+"value as `hidden_dropout_prob` of `ChineseBertModel` instance "
+"`chinesebert`. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:1
+msgid ""
+"The ChineseBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForSequenceClassification.forward:14
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification:1
+msgid ""
+"ChineseBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:1
+msgid ""
+"The ChineseBertForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForTokenClassification.forward:14
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:1
+msgid ""
+"ChineseBert Model with a linear layer on top of the hidden-states output "
+"to compute `span_start_logits` and `span_end_logits`, designed for "
+"question-answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering:6
+msgid ""
+"The dropout probability for output of ChineseBert. If None, use the same "
+"value as `hidden_dropout_prob` of `ChineseBertModel` instance "
+"`chinesebert`. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:1
+msgid ""
+"The ChineseBertForQuestionAnswering forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.modeling.ChineseBertForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po
new file mode 100644
index 0000000000000000000000000000000000000000..0df25a22b97674b2a3993b96bbbe1f3f3fcc2652
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.chinesebert.rst:2
+msgid "chinesebert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..665864174807e676101c9f85f19ca71c5927e68c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.chinesebert.tokenizer.po
@@ -0,0 +1,565 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.chinesebert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:1
+msgid ""
+"Construct a ChineseBert tokenizer. `ChineseBertTokenizer` is similar to "
+"`BertTokenizerr`. The difference between them is that ChineseBert has the"
+" extra process about pinyin id. For more information regarding those "
+"methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:5
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:8
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:11
+msgid ""
+"A dict of pinyin map, the map between pinyin char and id. pinyin char is "
+"26 Romanian characters and 0-5 numbers. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:14
+msgid "A dict of char id map tensor. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:17
+msgid "A dict of pinyin map tensor. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:20
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:24
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:27
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:30
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:33
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer:39
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports sequence or sequence pair as input, and batch input "
+"is not allowed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:5
+msgid ""
+"The sequence to be processed. One sequence is a string, a list of "
+"strings, or a list of integers depending on whether it has been "
+"pretokenized and converted to ids."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:9
+msgid ""
+"Same as `text` argument, while it represents for the latter sequence of "
+"the sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:10
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:12
+msgid ""
+"If set to a number, will limit the total sequence returned so that it has"
+" a maximum length. If there are overflowing tokens, those overflowing "
+"tokens will be added to the returned dictionary when "
+"`return_overflowing_tokens` is `True`. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:15
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:17
+msgid ""
+"Only available for batch input of sequence pair and mainly for question "
+"answering usage. When for QA, `text` represents questions and `text_pair`"
+" represents contexts. If `stride` is set to a positive number, the "
+"context will be split into multiple spans where `stride` defines the "
+"number of (tokenized) tokens to skip from the start of one span to get "
+"the next span, thus will produce a bigger batch than inputs to include "
+"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving"
+" the original example and position information will be added to the "
+"returned dictionary. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:25
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:27
+msgid ""
+"If set to `True`, the returned sequences would be padded up to "
+"`max_seq_len` specified length according to padding side "
+"(`self.padding_side`) and padding token id. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:29
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:31
+msgid ""
+"String selected in the following options:  - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"`max_seq_len` starting from the longest one at each token (when there is "
+"a pair of input sequences). - 'only_first': Only truncate the first "
+"sequence. - 'only_second': Only truncate the second sequence. - "
+"'do_not_truncate': Do not truncate (raise an error if the input sequence "
+"is longer than `max_seq_len`).  Defaults to 'longest_first'."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:29
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:31
+msgid "String selected in the following options:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:31
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:33
+msgid "'longest_first' (default) Iteratively reduce the inputs sequence"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:32
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:34
+msgid ""
+"until the input is under `max_seq_len` starting from the longest one at "
+"each token (when there is a pair of input sequences). - 'only_first': "
+"Only truncate the first sequence. - 'only_second': Only truncate the "
+"second sequence. - 'do_not_truncate': Do not truncate (raise an error if "
+"the input sequence is longer than `max_seq_len`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:39
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:41
+msgid "Defaults to 'longest_first'."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:41
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:43
+msgid ""
+"Whether to include tokens position ids in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:44
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:46
+msgid ""
+"Whether to include token type ids in the returned dictionary. Defaults to"
+" `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:47
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:49
+msgid ""
+"Whether to include the attention mask in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:50
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:52
+msgid ""
+"Whether to include the length of each encoded inputs in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:53
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:55
+msgid ""
+"Whether to include overflowing token information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:56
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:58
+msgid ""
+"Whether to include special tokens mask information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:62
+msgid ""
+"The dict has the following optional items:  - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **pinyin_ids** (list[int]): "
+"List of pinyin ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:60
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:62
+msgid "The dict has the following optional items:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:62
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:64
+msgid "**input_ids** (list[int]): List of token ids to be fed to a model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:63
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:65
+msgid "**pinyin_ids** (list[int]): List of pinyin ids to be fed to a model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:64
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:66
+msgid ""
+"**position_ids** (list[int], optional): List of token position ids to be "
+"fed to a model. Included when `return_position_ids` is `True`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:66
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:68
+msgid ""
+"**token_type_ids** (list[int], optional): List of token type ids to be "
+"fed to a model. Included when `return_token_type_ids` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:68
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:70
+msgid ""
+"**attention_mask** (list[int], optional): List of integers valued 0 or 1,"
+" where 0 specifies paddings and should not be attended to by the model. "
+"Included when `return_attention_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:71
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:73
+msgid ""
+"**seq_len** (int, optional): The input_ids length. Included when "
+"`return_length` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:73
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:75
+msgid ""
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+" Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:76
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:78
+msgid ""
+"**num_truncated_tokens** (int, optional): The number of overflowing "
+"tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:79
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode:81
+msgid ""
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1, with 0 specifying special added tokens and 1 specifying sequence "
+"tokens. Included when `return_special_tokens_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.encode
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports batch inputs of sequence or sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:4
+msgid ""
+"The element of list can be sequence or sequence pair, and the sequence is"
+" a string or a list of strings depending on whether it has been "
+"pretokenized. If each sequence is provided as a list of strings "
+"(pretokenized), you must set `is_split_into_words` as `True` to "
+"disambiguate with a sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:60
+msgid ""
+"The dict has the following optional items:  - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **pinyin_ids** (list[int]): "
+"List of pinyin ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   `stride` "
+"works. - **overflow_to_sample** (int, optional): Index of example from "
+"which this   feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:82
+msgid ""
+"**offset_mapping** (list[int], optional): list of pair preserving the "
+"index of start and end char in original input for each token. For a "
+"sqecial token, the index pair is `(0, 0)`. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.batch_encode:86
+msgid ""
+"**overflow_to_sample** (int, optional): Index of example from which this "
+"feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:1
+msgid "Truncates a sequence pair in place to the maximum length."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:3
+msgid ""
+"list of tokenized input ids. Can be obtained from a string by chaining "
+"the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:5
+msgid ""
+"Optional second list of input ids. Can be obtained from a string by "
+"chaining the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:7
+msgid ""
+"The map of tokens and the start and end index of their start and end "
+"character"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:9
+msgid ""
+"The map of token pairs and the start and end index of their start and end"
+" character"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:11
+msgid "number of tokens to remove using the truncation strategy"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:13
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len     starting from the longest one at each token (when there "
+"is a pair of input sequences).     Overflowing tokens only contains "
+"overflow from the first sequence. - 'only_first': Only truncate the first"
+" sequence. raise an error if the first sequence is shorter or equal to "
+"than num_tokens_to_remove. - 'only_second': Only truncate the second "
+"sequence - 'do_not_truncate': Does not truncate (raise an error if the "
+"input sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:13
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:15
+msgid ""
+"starting from the longest one at each token (when there is a pair of "
+"input sequences). Overflowing tokens only contains overflow from the "
+"first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:17
+msgid ""
+"'only_first': Only truncate the first sequence. raise an error if the "
+"first sequence is shorter or equal to than num_tokens_to_remove."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:18
+msgid "'only_second': Only truncate the second sequence"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:19
+msgid ""
+"'do_not_truncate': Does not truncate (raise an error if the input "
+"sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.truncate_sequences:20
+msgid ""
+"If set to a number along with max_seq_len, the overflowing tokens "
+"returned will contain some tokens from the main sequence returned. The "
+"value of this argument defines the number of additional tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:1
+msgid "Get the map of pinyin locations and pinyin tensor."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:3
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:3
+msgid "The sequence to be processed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.pinyin_locs_map:6
+msgid "the map of pinyin locations and pinyin tensor."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:1
+msgid "Find chinese character location, and generate pinyin ids."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:5
+msgid ""
+"Same as `text` argument, while it represents for the latter sequence of "
+"the sequence pair. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:8
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.chinesebert.tokenizer.ChineseBertTokenizer.get_pinyin_ids:12
+msgid "The list of pinyin id tensor."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..eba96010f926fbbbe6b6a9b31567bfdc12f56752
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.modeling.po
@@ -0,0 +1,703 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.convbert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:1
+msgid "The bare ConvBert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator
+#: paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ConvBertModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`ConvBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:13
+msgid "Dimensionality of the embedding layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:15
+msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:17
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:19
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:22
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:27
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:31
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:34
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:37
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:40
+msgid "The vocabulary size of `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:43
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`ConvBertPretrainedModel.init_weights()` for"
+" how weights are initialized in `ConvBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:43
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:47
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ConvBertPretrainedModel.init_weights()` for how weights are "
+"initialized in `ConvBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:50
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:53
+msgid "The size of the convolutional kernel. Defaults to `9`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:56
+msgid "Ratio gamma to reduce the number of attention heads. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel:59
+msgid ""
+"The number of groups for grouped linear layers for ConvBert model. "
+"Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:1
+msgid ""
+"The ConvBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:12
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:12
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:13
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:13
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:15
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:15
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:18
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:18
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:22
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:22
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1.  - **1** for tokens that **not "
+"masked**, - **0** for tokens that **masked**.  It is a tensor with shape "
+"broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:22
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:22
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:27
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:27
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:27
+msgid "**1** for tokens that **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:28
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:28
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:28
+msgid "**0** for tokens that **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:30
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:30
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:30
+msgid ""
+"It is a tensor with shape broadcasted to `[batch_size, "
+"num_attention_heads, sequence_length, sequence_length]`. Defaults to "
+"`None`, which means nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:34
+msgid ""
+"Returns Tensor `sequence_output`, sequence of hidden-states at the last "
+"layer of the model. Shape as `[batch_size, sequence_length, hidden_size]`"
+" and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:39
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:17
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:26
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:17
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:17
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:39
+#: paddlenlp.transformers.convbert.modeling.ConvBertModel.forward:39
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ConvBert models. It provides ConvBert "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel.init_weights:1
+msgid "Initializes and tie weights if needed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainedModel.tie_weights:1
+msgid "Tie the weights between the input embeddings and the output embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining:1
+msgid ""
+"Combine generator with discriminator for Replaced Token Detection (RTD) "
+"pretraining."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.get_discriminator_inputs:1
+msgid "Sample from the generator to create discriminator input."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:9
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:9
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:5
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:9
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:5
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:28
+msgid "See :class:`ConvBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:9
+msgid ""
+"The raw input_ids. Its data type should be `int64` and it has a shape of "
+"[batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:11
+msgid ""
+"The generator labels. Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:14
+msgid ""
+"Returns tuple (``gen_logits``, ``disc_logits``, ``disc_labels``, "
+"``attention_mask``).  With the fields:  - `gen_logits` (Tensor):     a "
+"tensor of the generator prediction logits. Shape as `[batch_size, "
+"sequence_length, vocab_size]` and dtype as float32.  - `disc_logits` "
+"(Tensor):     a tensor of the discriminator prediction logits. Shape as "
+"`[batch_size, sequence_length]` and dtype as float32.  - `disc_labels` "
+"(Tensor):     a tensor of the discriminator prediction labels. Shape as "
+"`[batch_size, sequence_length]` and dtype as int64.  - `attention_mask` "
+"(Tensor):     See :class:`ConvBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:14
+msgid ""
+"Returns tuple (``gen_logits``, ``disc_logits``, ``disc_labels``, "
+"``attention_mask``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:14
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:16
+msgid "With the fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:19
+msgid "`gen_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:19
+msgid ""
+"a tensor of the generator prediction logits. Shape as `[batch_size, "
+"sequence_length, vocab_size]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:22
+msgid "`disc_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:22
+msgid ""
+"a tensor of the discriminator prediction logits. Shape as `[batch_size, "
+"sequence_length]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:25
+msgid "`disc_labels` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:25
+msgid ""
+"a tensor of the discriminator prediction labels. Shape as `[batch_size, "
+"sequence_length]` and dtype as int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTotalPretraining.forward:27
+msgid "`attention_mask` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:1
+msgid "ConvBert Model with a discriminator prediction head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:4
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:4
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:4
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:4
+#: paddlenlp.transformers.convbert.modeling.ConvBertGenerator:3
+msgid "An instance of ConvBertModel."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:1
+msgid ""
+"The ConvBertDiscriminator forward method, overrides the `__call__()` "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertDiscriminator.forward:34
+msgid ""
+"Returns tensor `logits`, a tensor of the discriminator prediction logits."
+" Shape as `[batch_size, sequence_length]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator:1
+msgid "ConvBert Model with a generator prediction head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:1
+msgid ""
+"The ConvBertGenerator forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertGenerator.forward:34
+msgid ""
+"Returns tensor `prediction_scores`, a tensor of the generator prediction "
+"scores. Shape as `[batch_size, sequence_length, vocab_size]` and dtype as"
+" float32."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead:1
+msgid "ConvBert head for sentence-level classification tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:1
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:4
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertClassificationHead.forward:6
+#: paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:1
+msgid ""
+"ConvBert Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:6
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:8
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification:8
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:8
+msgid ""
+"The dropout probability for output of ConvBert. If None, use the same "
+"value as `hidden_dropout_prob` of `ConvBertModel` instance `convbert`. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:1
+msgid ""
+"The ConvBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification:1
+msgid ""
+"ConvBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:1
+msgid ""
+"The ConvBertForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `ConvBertModel`. Defines the number of"
+" different tokens that can be represented by the `inputs_ids` passed when"
+" calling `ConvBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:4
+msgid "This is the generator weight."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertPretrainingCriterion:6
+msgid "This is the discriminator weight."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering:1
+msgid ""
+"ConvBert Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:1
+msgid ""
+"The ConvBertForQuestionAnswering forward method, overrides the __call__()"
+" special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:12
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:18
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:21
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForQuestionAnswering.forward:21
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:1
+msgid ""
+"ConvBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for multiple choice tasks like RocStories/SWAG tasks ."
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:1
+msgid ""
+"The ConvBertForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:3
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:5
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:7
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:9
+msgid ""
+"See :class:`ConvBertModel` and shape as [batch_size,num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.convbert.modeling.ConvBertForMultipleChoice.forward:12
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po
new file mode 100644
index 0000000000000000000000000000000000000000..2e9252d1894316c8a413cf66b9ca6debdd33c4ad
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.convbert.rst:2
+msgid "convbert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..7bef9ac5e1c3a473e4ba03d0af0e4cb1eac81d4b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convbert.tokenizer.po
@@ -0,0 +1,34 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.convbert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.tokenizer.ConvBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.electra.tokenizer.ElectraTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.convbert.tokenizer.ConvBertTokenizer:1
+msgid ""
+"Construct a ConvBERT tokenizer. `ConvBertTokenizer` is identical to "
+"`ElectraTokenizer`. For more information regarding those methods, please "
+"refer to this superclass."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..56a66eef978591fe5eea4e5cf52ee5901091dd00
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.convert_slow_tokenizer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.convert_slow_tokenizer.rst:2
+msgid "convert\\_slow\\_tokenizer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..85b35cd74795ca63738910a79ff8d0447b3dd04c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.modeling.po
@@ -0,0 +1,517 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ctrl.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:1
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:1
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel:1
+msgid "基类：:class:`paddlenlp.transformers.ctrl.modeling.CTRLPreTrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:1
+msgid ""
+"The bare CTRL Model transformer outputting raw hidden-states without any "
+"specific head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward
+#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `CTRLModel`. Also is the vocab size of"
+" token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `CTRLModel`. "
+"Defaults to `246534`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:14
+msgid ""
+"The maximum sequence length that this model might ever be used with. "
+"Typically set this to something large just in case (e.g., 512 or 1024 or "
+"2048 or 50000). Defaults to `50000`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:17
+msgid "Dimensionality of the embeddings and hidden states. Defaults to `1280`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:20
+msgid ""
+"Dimensionality of the inner dimension of the feed forward networks (FFN)."
+" Defaults to `8192`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:23
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `48`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:26
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:29
+msgid ""
+"The dropout ratio for all fully connected layers in the encoder. Defaults"
+" to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:32
+msgid "The dropout ratio for the embeddings. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:35
+msgid "The epsilon to use in the layer normalization layers. Defaults to `1e-6`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:38
+msgid ""
+"Whether the model's input and output word embeddings should be tied. Note"
+" that this is only relevant if the model has a output word embedding "
+"layer. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:41
+msgid "The id of the `padding` token. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:44
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`CTRLPreTrainedModel._init_weights()` for "
+"how weights are initialized in `CTRLModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:44
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel:48
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`CTRLPreTrainedModel._init_weights()` for how weights are "
+"initialized in `CTRLModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:1
+msgid "The CTRLModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:7
+msgid ""
+"Contains pre-computed hidden-states (key and values in the attention "
+"blocks) as computed by the model. Can be used to speed up sequential "
+"decoding. The `input_ids` which have their past given to this model "
+"should not be passed as input ids as they have already been computed. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:13
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `0.0` values and the others have `1.0` values. It is"
+" a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:23
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range `[0, type_vocab_size - 1]`. If `type_vocab_size` is"
+" 2, which means the inputs have two portions. Indices can either be 0 or "
+"1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:23
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range `[0, type_vocab_size - 1]`. If `type_vocab_size` is"
+" 2, which means the inputs have two portions. Indices can either be 0 or "
+"1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:28
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:29
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:31
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:34
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range `[0, max_position_embeddings - 1]`. "
+"Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:38
+msgid ""
+"Whether or not to use cache. Defaults to `False`. If set to `True`, key "
+"value states will be returned and can be used to speed up decoding."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:41
+msgid ""
+"Whether or not to return the attentions tensors of all attention layers. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:44
+msgid ""
+"Whether or not to return the output of all hidden layers. Defaults to "
+"`False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:48
+msgid ""
+"Returns tuple (`last_hidden_state`, `caches`, `hidden_states`, "
+"`attentions`)  With the fields:  - `last_hidden_state` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size].  - `caches` (tuple(tuple(Tensor), optional):     returned "
+"when `use_cache=True` is passed.     Tuple of `tuple(Tensor)` of length "
+"`num_hidden_layers`, with each tuple having 2     tensors of shape "
+"[batch_size, num_heads, sequence_length, embed_size_per_head] and float32"
+" dtype.  - `hidden_states` (tuple(Tensor), optional):     returned when "
+"`output_hidden_states=True` is passed.     Tuple of `Tensor` (one for the"
+" output of the embeddings + one for the output of     each layer). Each "
+"Tensor has a data type of float32 and its shape is     [batch_size, "
+"sequence_length, hidden_size].  - `attentions` (tuple(Tensor), optional):"
+"     returned when `output_attentions=True` is passed.     Tuple of "
+"`Tensor` (one for each layer) of shape. Each Tensor has a data type of"
+"     float32 and its shape is [batch_size, num_heads, sequence_length, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:48
+msgid ""
+"Returns tuple (`last_hidden_state`, `caches`, `hidden_states`, "
+"`attentions`)"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:50
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:54
+msgid "`last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:53
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:38
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:39
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:59
+msgid "`caches` (tuple(tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:57
+msgid ""
+"returned when `use_cache=True` is passed. Tuple of `tuple(Tensor)` of "
+"length `num_hidden_layers`, with each tuple having 2 tensors of shape "
+"[batch_size, num_heads, sequence_length, embed_size_per_head] and float32"
+" dtype."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:41
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:42
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:65
+msgid "`hidden_states` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:62
+msgid ""
+"returned when `output_hidden_states=True` is passed. Tuple of `Tensor` "
+"(one for the output of the embeddings + one for the output of each "
+"layer). Each Tensor has a data type of float32 and its shape is "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:43
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:44
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:69
+msgid "`attentions` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:68
+msgid ""
+"returned when `output_attentions=True` is passed. Tuple of `Tensor` (one "
+"for each layer) of shape. Each Tensor has a data type of float32 and its "
+"shape is [batch_size, num_heads, sequence_length, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:48
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:49
+#: paddlenlp.transformers.ctrl.modeling.CTRLModel.forward:74
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:1
+msgid ""
+"The CTRL Model transformer with a language modeling head on top (linear "
+"layer with weights tied to the input embeddings)."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:8
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel:4
+msgid "An instance of :class:`CTRLModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:1
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:3
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:5
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:7
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:9
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:17
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:19
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:21
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:38
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:41
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:44
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:1
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:3
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:5
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:7
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:9
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:17
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:19
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:21
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:39
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:42
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:45
+msgid "See :class:`CTRLModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:11
+msgid ""
+"Labels for language modeling. Note that the labels **are shifted** inside"
+" the model, i.e. you can set `labels = input_ids` Indices are selected in"
+" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored "
+"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`."
+" Shape is [batch_size, sequence_length] and dtype is int64."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:24
+msgid ""
+"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With "
+"the fields:  - `loss` (Tensor):     returned when `labels` is provided."
+"     Language modeling loss (for next-token prediction).     It's data "
+"type should be float32 and its shape is [1,].  - `logits` (Tensor):     "
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary     token before SoftMax).     It's data type should be "
+"float32 and     its shape is [batch_size, sequence_length, vocab_size].  "
+"- `caches` (tuple(tuple(Tensor), optional):     See :class:`CTRLModel`.  "
+"- `hidden_states` (tuple(Tensor), optional):     See :class:`CTRLModel`."
+"  - `attentions` (tuple(Tensor), optional):     See :class:`CTRLModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:24
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:24
+msgid ""
+"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With "
+"the fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:30
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:30
+msgid "`loss` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:28
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:28
+msgid ""
+"returned when `labels` is provided. Language modeling loss (for next-"
+"token prediction). It's data type should be float32 and its shape is "
+"[1,]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:35
+#: paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:36
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLLMHeadModel.forward:33
+msgid ""
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary token before SoftMax). It's data type should be float32 and "
+"its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:1
+msgid ""
+"The CTRL Model transformer with a sequence classification head on top "
+"(linear layer). `CTRLForSequenceClassification` uses the last token in "
+"order to do the classification, as other causal models (e.g. GPT-2) do. "
+"Since it does classification on the last token, it requires to know the "
+"position of the last token. If a `pad_token_id` is defined in the "
+"configuration, it finds the last token that is not a padding token in "
+"each row. If no `pad_token_id` is defined, it simply takes the last value"
+" in each row of the batch."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:10
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification:12
+msgid ""
+"The dropout probability for output of CTRL. If None, use the same value "
+"as `hidden_dropout_prob` of `CTRLModel` instance `ctrl`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:11
+msgid ""
+"Labels for computing the sequence classification/regression loss. Indices"
+" should be in `[0, ...,num_classes - 1]`. If `num_classes == 1` a "
+"regression loss is computed (Mean-Square loss), If `num_classes > 1` a "
+"classification loss is computed (Cross-Entropy). Shape is [batch_size,] "
+"and dtype is int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:24
+msgid ""
+"Returns tuple `(loss, logits, caches, hidden_states, attentions)`. With "
+"the fields:  - `loss` (Tensor):     returned when `labels` is provided."
+"     Language modeling loss (for next-token prediction).     It's data "
+"type should be float32 and its shape is [1,].  - `logits` (Tensor):     "
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary     token before SoftMax).     It's data type should be "
+"float32 and its shape is [batch_size, num_classes].  - `caches` "
+"(tuple(tuple(Tensor), optional):     See :class:`CTRLModel`.  - "
+"`hidden_states` (tuple(Tensor), optional):     See :class:`CTRLModel`.  -"
+" `attentions` (tuple(Tensor), optional):     See :class:`CTRLModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.CTRLForSequenceClassification.forward:33
+msgid ""
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary token before SoftMax). It's data type should be float32 and "
+"its shape is [batch_size, num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding:1
+msgid "基类：:class:`paddle.nn.layer.common.Embedding`"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding:1
+msgid "This module produces sinusoidal positional embeddings of any length."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.modeling.SinusoidalPositionalEmbedding.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po
new file mode 100644
index 0000000000000000000000000000000000000000..a2dc827a0ccb9b40788424d57f6c2f53b8096750
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ctrl.rst:2
+msgid "ctrl"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..416dc598a80b33852c8d648af68c07b274234da0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ctrl.tokenizer.po
@@ -0,0 +1,170 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ctrl.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:1
+msgid "Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:7
+msgid ""
+"Path to the vocab file. The vocab file contains a mapping from vocabulary"
+" strings to indices."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:10
+msgid ""
+"Path to the merge file. The merge file is used to split the input "
+"sentence into \"subword\" units. The vocab file is then used to encode "
+"those units as intices."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:14
+msgid "The maximum value of the input sequence length. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer:17
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"<unk>\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:1
+msgid "Converts a string to a list of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:3
+msgid "The text to be tokenized."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:6
+msgid "A list of string representing converted tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:14
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:11
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:10
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.tokenize:10
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (list of string) to a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:3
+msgid "A sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_string:6
+msgid "Converted string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:1
+msgid ""
+"Converts a single token or a sequence of tokens to an index or a sequence"
+" of indices using the vocab."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:4
+msgid "A single token or a sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_tokens_to_ids:7
+msgid "The converted token id or token ids."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:1
+msgid ""
+"Converts an index or a sequence indices to a single token or a sequence "
+"of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:4
+msgid "The token id (or token ids) to be converted to text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:6
+msgid ""
+"Whether or not to skip the special tokens. Defaults to `False`, which "
+"means we don't skip the special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.convert_ids_to_tokens:10
+msgid "The converted token or the sequence of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources:1
+msgid "Save tokenizer related resources to files under `save_directory`."
+msgstr ""
+
+#: of paddlenlp.transformers.ctrl.tokenizer.CTRLTokenizer.save_resources:3
+msgid "Directory to save files into."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..2a483df03b586de5f8759c52a1215e71470585e2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.modeling.po
@@ -0,0 +1,389 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.distilbert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:1
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:1
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:1
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:1
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:1
+msgid "The bare DistilBert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number "
+"of different tokens that can be represented by the `inputs_ids` passed "
+"when calling `DistilBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and the pooler "
+"layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:38
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:41
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`DistilBertPretrainedModel.init_weights()` "
+"for how weights are initialized in `DistilBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:41
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`DistilBertPretrainedModel.init_weights()` for how weights are"
+" initialized in `DistilBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:1
+msgid ""
+"The DistilBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:19
+msgid ""
+"Returns tensor `encoder_output`, which means the sequence of hidden-"
+"states at the last layer of the model. Its data type should be float32 "
+"and its shape is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:13
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:22
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:13
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:13
+#: paddlenlp.transformers.distilbert.modeling.DistilBertModel.forward:24
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained DistilBert models. It provides "
+"DistilBert related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:1
+msgid ""
+"DistilBert Model with a linear layer on top of the output layer, designed"
+" for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:3
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:4
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:4
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:4
+msgid "An instance of DistilBertModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:6
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:6
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification:8
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:8
+msgid ""
+"The dropout probability for output of DistilBert. If None, use the same "
+"value as `hidden_dropout_prob` of `DistilBertModel` instance "
+"`distilbert`. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:1
+msgid ""
+"The DistilBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:3
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:5
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:3
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:5
+msgid "See :class:`DistilBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForSequenceClassification.forward:8
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification:1
+msgid ""
+"DistilBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:1
+msgid ""
+"The DistilBertForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForTokenClassification.forward:8
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering:1
+msgid ""
+"DistilBert Model with a linear layer on top of the hidden-states output "
+"to compute `span_start_logits` and `span_end_logits`, designed for "
+"question-answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:1
+msgid ""
+"The DistilBertForQuestionAnswering forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:8
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"start_logits(Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" end_logits(Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:8
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:10
+msgid "With the fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:14
+msgid "start_logits(Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:13
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:17
+msgid "end_logits(Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM:1
+msgid "DistilBert Model with a `language modeling` head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:1
+msgid ""
+"The DistilBertForMaskedLM forward method, overrides the `__call__()` "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.distilbert.modeling.DistilBertForMaskedLM.forward:8
+msgid ""
+"Returns tensor `prediction_logits`, the scores of masked token "
+"prediction. Its data type should be float32 and its shape is [batch_size,"
+" sequence_length, vocab_size]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po
new file mode 100644
index 0000000000000000000000000000000000000000..6036cb24084d21512874e2ce357c9836d1e226a1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.distilbert.rst:2
+msgid "distilbert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..f3e52ce990a7457f464bcb8b08d83099dc28a8b9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distilbert.tokenizer.po
@@ -0,0 +1,36 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.distilbert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.tokenizer.DistilBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.distilbert.tokenizer.DistilBertTokenizer:1
+msgid ""
+"Constructs a DistilBert tokenizer. The usage of DistilBertTokenizer is "
+"the same as `BertTokenizer "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__."
+" For more information regarding those methods, please refer to this "
+"superclass."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..4d59ea3942031813e6cb2dcf56deec2a18fee9c3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.distill_utils.po
@@ -0,0 +1,116 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.distill_utils.rst:2
+msgid "distill\\_utils"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.to_distill:1
+msgid ""
+"Can be bound to object with transformer encoder layers, and make model "
+"expose attributes `outputs.q`, `outputs.k`, `outputs.v`, "
+"`outputs.scaled_qks`, `outputs.hidden_states`and `outputs.attentions` of "
+"the object for distillation. It could be returned intermediate tensor "
+"using in MiniLM and TinyBERT strategy."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:1
+msgid ""
+"Calculates loss for Q-Q, K-K, V-V relation from MiniLMv2. :param "
+"loss_fct: Loss function for distillation. It only supports kl_div loss "
+"now. :type loss_fct: callable :param s: Q, K, V of Student. :type s: "
+"Tensor :param t: Q, K, V of teacher. :type t: Tensor :param attn_mask: "
+"Attention mask for relation. :type attn_mask: Tensor :param "
+"num_relation_heads: The number of relation heads. 0 means "
+"`num_relation_heads` equals"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:11
+msgid "to origin head num. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_minilm_loss
+#: paddlenlp.transformers.distill_utils.calc_multi_relation_loss
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_minilm_loss:15
+msgid "MiniLM loss value."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_minilm_loss
+#: paddlenlp.transformers.distill_utils.calc_multi_relation_loss
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:1
+msgid ""
+"Calculates loss for multiple Q-Q, K-K and V-V relation. It supports head-"
+"head relation, sample-sample relation and origin token-token relation. "
+"The final loss value could be balanced by weight `alpha` and `beta`."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:5
+msgid "Loss function for distillation. It only supports kl_div loss now."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:7
+msgid "Q, K, V of Student."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:9
+msgid "Q, K, V of teacher."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:11
+msgid "Attention mask for relation."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:13
+msgid ""
+"The number of relation heads. 0 means `num_relation_heads` equals to "
+"origin head num. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:17
+msgid "The weight for head-head relation. Defaults to 0.0."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:20
+msgid "The weight for sample-sample relation. Defaults to 0.0."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:24
+msgid ""
+"Weighted loss of token-token loss, head-head loss and     sample-sample "
+"loss."
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:26
+msgid "Weighted loss of token-token loss, head-head loss and"
+msgstr ""
+
+#: of paddlenlp.transformers.distill_utils.calc_multi_relation_loss:27
+msgid "sample-sample loss."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..821939bccd2221d46d3ef288d4bfde24a03dbf3d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.modeling.po
@@ -0,0 +1,905 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.electra.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:1
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:1
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:1
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:1
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:1
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:1
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator:1
+#: paddlenlp.transformers.electra.modeling.ElectraModel:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:1
+msgid "基类：:class:`paddlenlp.transformers.electra.modeling.ElectraPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:1
+msgid "The bare Electra Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead
+#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraModel
+#: paddlenlp.transformers.electra.modeling.ElectraModel.forward
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ElectraModel`. Also is the vocab size"
+" of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling "
+"`ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:3
+#: paddlenlp.transformers.electra.modeling.ElectraModel:13
+msgid "Dimensionality of the embedding layer."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:15
+msgid "Dimensionality of the encoder layer and pooler layer."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:17
+msgid "Number of hidden layers in the Transformer encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:19
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:21
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:31
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:33
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:36
+msgid "The vocabulary size of `token_type_ids`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:38
+msgid ""
+"The standard deviation of the normal initializer.  .. note::     A "
+"normal_initializer initializes weight matrices as normal distributions."
+"     See :meth:`ElectraPretrainedModel.init_weights()` for how weights "
+"are initialized in `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:38
+msgid "The standard deviation of the normal initializer."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:41
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ElectraPretrainedModel.init_weights()` for how weights are "
+"initialized in `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel:44
+msgid "The index of padding token in the token vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:1
+msgid ""
+"The ElectraModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraModel.forward
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraModel.forward:32
+msgid ""
+"Returns tensor `encoder_outputs`, which is the output at the last layer "
+"of the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward
+#: paddlenlp.transformers.electra.modeling.ElectraModel.forward
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:16
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:17
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:26
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:17
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:17
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:15
+#: paddlenlp.transformers.electra.modeling.ElectraModel.forward:37
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Electra models. It provides Electra "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainedModel.init_weights:1
+msgid "Initializes and tie weights if needed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainedModel.tie_weights:1
+msgid "Tie the weights between the input embeddings and the output embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:1
+msgid "Electra Model for pretraining tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:3
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:3
+msgid "An instance of :class:`ElectraGenerator`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining:5
+msgid "An instance of :class:`ElectraDiscriminator`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:1
+msgid ""
+"The ElectraForPretraining forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:1
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:9
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:9
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:9
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:9
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:1
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:14
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:6
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:8
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:10
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:3
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:5
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:7
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:9
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:14
+msgid "See :class:`ElectraModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:11
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:11
+msgid ""
+"Raw inputs used to get discriminator labels. Its data type should be "
+"`int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:14
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:14
+msgid ""
+"Labels to compute the discriminator inputs. Its data type should be int64"
+" and its shape is [batch_size, sequence_length]. The value for unmasked "
+"tokens should be -100 and value for masked tokens should be 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:19
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:19
+msgid ""
+"Returns tuple (generator_logits, disc_logits, disc_labels, "
+"attention_mask).  With the fields:  - `generator_logits` (Tensor):     "
+"The scores of Electra Generator.     Its data type should be int64 and "
+"its shape is [batch_size, sequence_length, vocab_size].  - `disc_logits` "
+"(Tensor):     The prediction result of replaced tokens.     Its data "
+"type should be float32 and if batch_size>1, its shape is [batch_size, "
+"sequence_length],     if batch_size=1, its shape is [sequence_length].  -"
+" `disc_labels` (Tensor):     The labels of electra discriminator. Its "
+"data type should be int32,     and its shape is [batch_size, "
+"sequence_length].  - `attention_mask` (Tensor):     See "
+":class:`ElectraModel`. Its data type should be bool."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:19
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:19
+msgid ""
+"Returns tuple (generator_logits, disc_logits, disc_labels, "
+"attention_mask)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:14
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:21
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:21
+msgid "With the fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:25
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:25
+msgid "`generator_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:24
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:24
+msgid ""
+"The scores of Electra Generator. Its data type should be int64 and its "
+"shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:30
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:30
+msgid "`disc_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:28
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:28
+msgid ""
+"The prediction result of replaced tokens. Its data type should be "
+"float32 and if batch_size>1, its shape is [batch_size, sequence_length], "
+"if batch_size=1, its shape is [sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:34
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:34
+msgid "`disc_labels` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:33
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:33
+msgid ""
+"The labels of electra discriminator. Its data type should be int32, and "
+"its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:36
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:36
+msgid "`attention_mask` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining.forward:37
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.forward:37
+msgid "See :class:`ElectraModel`. Its data type should be bool."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:1
+msgid ""
+"The Electra Discriminator can detect the tokens that are replaced by the "
+"Electra Generator."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator:3
+#: paddlenlp.transformers.electra.modeling.ElectraGenerator:4
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:6
+msgid "An instance of :class:`ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraDiscriminator.forward:10
+msgid ""
+"Returns tensor `logits`, the prediction result of replaced tokens. Its "
+"data type should be float32 and if batch_size>1, its shape is "
+"[batch_size, sequence_length], if batch_size=1, its shape is "
+"[sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraGenerator:1
+msgid ""
+"The Electra Generator will replace some tokens of the given sequence, it "
+"is trained as a masked language model."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraGenerator.forward:10
+msgid ""
+"Returns tensor `prediction_scores`, the scores of Electra Generator. Its "
+"data type should be int64 and its shape is [batch_size, sequence_length, "
+"vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:1
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:1
+msgid "Perform sentence-level classification tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:5
+msgid "The dropout probability for all fully connected layers."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:7
+msgid "The number of classes."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraClassificationHead:9
+msgid "The activation function name between layers."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:1
+msgid ""
+"The ElectraClassificationHead forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:3
+msgid ""
+"Input sequence, usually the `sequence_output` of electra model. Its data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraClassificationHead.forward:7
+msgid ""
+"Returns a tensor of the input text classification logits. Shape as "
+"`[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:1
+msgid ""
+"Electra Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:4
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:4
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:4
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:4
+msgid "An instance of ElectraModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:6
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:8
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:8
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:8
+msgid ""
+"The dropout probability for output of Electra. If None, use the same "
+"value as `hidden_dropout_prob` of `ElectraModel` instance `electra`. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:12
+msgid "The activation function name for classifier. Defaults to \"gelu\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification:15
+msgid "The epsilon to initialize nn.LayerNorm layers. Defaults to 1e-12."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:1
+msgid ""
+"The ElectraForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForTokenClassification:1
+msgid ""
+"Electra  Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:1
+msgid ""
+"The ElectraForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `ElectraModel`. Defines the number of "
+"different tokens that can be represented by the `inputs_ids` passed when "
+"calling `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:4
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:4
+msgid "The weight of the Electra Generator."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion:6
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion:6
+msgid "The weight of the Electra Discriminator."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:1
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:1
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"and its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:4
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:7
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"and its shape is [batch_size, sequence_length] or [sequence length] if "
+"batch_size=1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:7
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:4
+msgid ""
+"The labels of the generator, its dimensionality is equal to "
+"`generator_prediction_scores`. Its data type should be int64 and its "
+"shape is [batch_size, sequence_size, 1]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:10
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:10
+msgid ""
+"The labels of the discriminator, its dimensionality is equal to "
+"`discriminator_prediction_scores`. The labels should be numbers between 0"
+" and 1. Its data type should be float32 and its shape is [batch_size, "
+"sequence_size] or [sequence length] if batch_size=1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraPretrainingCriterion.forward:17
+#: paddlenlp.transformers.electra.modeling.ErnieHealthPretrainingCriterion.forward:17
+msgid ""
+"The pretraining loss, equals to weighted generator loss plus the weighted"
+" discriminator loss. Its data type should be float32 and its shape is "
+"[1]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:1
+msgid ""
+"Electra Model with a linear layer on top of the hidden-states output "
+"layer, designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:1
+msgid ""
+"The ElectraForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:3
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:5
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:7
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:9
+msgid ""
+"See :class:`ElectraModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForMultipleChoice.forward:12
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering:1
+msgid ""
+"Electra Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:1
+msgid ""
+"The ElectraForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:12
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:18
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:21
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ElectraForQuestionAnswering.forward:21
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:1
+msgid "基类：:class:`paddlenlp.transformers.electra.modeling.ElectraForTotalPretraining`"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:1
+msgid "ERNIE-Health Model for pretraining task."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining:5
+msgid ""
+"class:`ErnieHealthDiscriminator): An instance of "
+":class:`ErnieHealthDiscriminator`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax:1
+msgid "Sample K=5 non-original negative samples for candidate set."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthForTotalPretraining.sample_negatives_from_softmax:3
+msgid ""
+"Returns tensor `neg_samples_ids`, a tensor of the negative samples of "
+"original inputs. Shape as ` [batch_size, sequence_length, K, vocab_size]`"
+" and dtype as `int64`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:4
+msgid ""
+"The Discriminators in ERNIE-Health (https://arxiv.org/abs/2110.07244), "
+"including"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:2
+msgid "token-level Replaced Token Detection (RTD) task"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:3
+msgid "token-level Multi-Token Selection (MTS) task"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator:4
+msgid "sequence-level Contrastive Sequence Prediction (CSP) task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:3
+msgid ""
+"The candidate indices of input sequence tokens in the vocabulary for MTS "
+"task. Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:13
+msgid ""
+"Returns list of tensors, the prediction results of RTD, MTS and CSP. The "
+"logits' data type should be float32 and if batch_size > 1,     - the "
+"shape of `logits_rtd` is [batch_size, sequence_length],     - the shape "
+"of `logits_mts` is [batch_size, sequence_length, num_candidate],     - "
+"the shape of `logits_csp` is [batch_size, 128]. If batch_size=1, the "
+"shapes are [sequence_length], [sequence_length, num_cadidate], [128], "
+"separately."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:13
+msgid ""
+"Returns list of tensors, the prediction results of RTD, MTS and CSP. The "
+"logits' data type should be float32 and if batch_size > 1,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:15
+msgid "the shape of `logits_rtd` is [batch_size, sequence_length],"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:16
+msgid "the shape of `logits_mts` is [batch_size, sequence_length, num_candidate],"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:17
+msgid "the shape of `logits_csp` is [batch_size, 128]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.modeling.ErnieHealthDiscriminator.forward:18
+msgid ""
+"If batch_size=1, the shapes are [sequence_length], [sequence_length, "
+"num_cadidate], [128], separately."
+msgstr ""
+
+#~ msgid ""
+#~ "Returns tuple (gen_logits, disc_logits, "
+#~ "disc_labels, attention_mask).  With the "
+#~ "fields:  - `gen_logits` (Tensor):     The "
+#~ "scores of Electra Generator.     Its "
+#~ "data type should be int64 and its"
+#~ " shape is [batch_size, sequence_length, "
+#~ "vocab_size].  - `disc_logits` (Tensor):     "
+#~ "The prediction result of replaced"
+#~ " tokens.     Its data type should be"
+#~ " float32 and if batch_size>1, its "
+#~ "shape is [batch_size, sequence_length],     if"
+#~ " batch_size=1, its shape is "
+#~ "[sequence_length].  - `disc_labels` (Tensor):"
+#~ "     The labels of electra discriminator."
+#~ " Its data type should be int32,"
+#~ "     and its shape is [batch_size, "
+#~ "sequence_length].  - `attention_mask` (Tensor):"
+#~ "     See :class:`ElectraModel`. Its data "
+#~ "type should be bool."
+#~ msgstr ""
+
+#~ msgid "Returns tuple (gen_logits, disc_logits, disc_labels, attention_mask)."
+#~ msgstr ""
+
+#~ msgid "`gen_logits` (Tensor):"
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po
new file mode 100644
index 0000000000000000000000000000000000000000..24f321c4c9b2f582a1705248a388c0b5134668d1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.electra.rst:2
+msgid "electra"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..c79fb28ec8eb1907ac512701ab906355b2440e0e
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.electra.tokenizer.po
@@ -0,0 +1,303 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.electra.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:1
+msgid ""
+"Constructs an Electra tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer:34
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Returns the number of added tokens in the case of a sequence pair if set "
+"to True, returns the number of added tokens in the case of a single "
+"sequence if set to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.num_special_tokens_to_add:6
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:4
+msgid "A ELECTRA sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:12
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A ELECTRA offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:10
+msgid "Optional second list of char offsets for offset mapping pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.build_offset_mapping_with_special_tokens:13
+msgid "List of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:3
+msgid "A ELECTRA sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:8
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:10
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.create_token_type_ids_from_sequences:15
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.electra.tokenizer.ElectraTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..92d346421ca16a999d4394aad38d012a0ffa2edc
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.faster_tokenizer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.ernie.fast_tokenizer.rst:2
+msgid "fast\\_tokenizer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..7461cbeed8cf5e78d2ca085e929d1c682f1ba064
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.modeling.po
@@ -0,0 +1,598 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:1
+#: paddlenlp.transformers.ernie.modeling.ErnieModel:1
+msgid "基类：:class:`paddlenlp.transformers.ernie.modeling.ErniePretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:1
+msgid "The bare ERNIE Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM
+#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieModel
+#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward
+#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`.  .. note::     A normal_initializer "
+"initializes weight matrices as normal distributions.     See "
+":meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:5
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:5
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:10
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:11
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:13
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:16
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:20
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will"
+" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will"
+" have the same value. Defaults to `None`, which means nothing needed to "
+"be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward
+#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:34
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``).  With the fields:"
+"  - `sequence_output` (Tensor):     Sequence of hidden-states at the last"
+" layer of the model.     It's data type should be float32 and its shape "
+"is [batch_size, sequence_length, hidden_size].  - `pooled_output` "
+"(Tensor):     The output of first token (`[CLS]`) in sequence.     We "
+"\"pool\" the model by simply taking the hidden state corresponding to the"
+" first token.     Its data type should be float32 and its shape is "
+"[batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:34
+msgid "Returns tuple (``sequence_output``, ``pooled_output``)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:12
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:12
+#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward:36
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:40
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:39
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:44
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieModel.forward:43
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward
+#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward
+#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:15
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:24
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:15
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:15
+#: paddlenlp.transformers.ernie.modeling.ErnieModel.forward:49
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel:1
+msgid ""
+"An abstract class for pretrained ERNIE models. It provides ERNIE related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. Refer "
+"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:1
+msgid ""
+"Ernie Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:4
+msgid "An instance of `paddlenlp.transformers.ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:6
+msgid "The number of classes. Default to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification:8
+msgid ""
+"The dropout probability for output of ERNIE. If None, use the same value "
+"as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieModel` instance."
+" Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:7
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:7
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:7
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:7
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:1
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:7
+msgid "See :class:`ErnieModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:1
+msgid ""
+"ERNIE Model with a linear layer on top of the hidden-states output layer,"
+" designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:4
+msgid "An instance of `ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification:8
+msgid ""
+"The dropout probability for output of ERNIE. If None, use the same value "
+"as `hidden_dropout_prob` of `ErnieModel` instance `ernie`. Defaults to "
+"`None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering:1
+msgid ""
+"Ernie Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErnieForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining:1
+msgid ""
+"Ernie Model with a `masked language modeling` head and a `sentence order "
+"prediction` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:10
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``).  With "
+"the fields:  - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size].  - `seq_relationship_score` (Tensor):     The scores of next"
+" sentence prediction.     Its data type should be float32 and its shape "
+"is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:10
+msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:17
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:15
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:20
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForPretraining.forward:20
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion:1
+msgid ""
+"The loss output of Ernie Model during the pretraining: a `masked language"
+" modeling` head and a `next sentence prediction (classification)` head."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:1
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:5
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:8
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64. If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:12
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.modeling.ErniePretrainingCriterion.forward:17
+msgid ""
+"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean"
+" of `next_sentence_loss`. Its data type should be float32 and its shape "
+"is [1]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:1
+msgid "Ernie Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM:3
+msgid "An instance of :class:`ErnieModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMaskedLM.forward:10
+msgid ""
+"Returns tensor `prediction_scores`, The scores of masked token "
+"prediction. Its data type should be float32 and shape is [batch_size, "
+"sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:1
+msgid ""
+"Ernie Model with a linear layer on top of the hidden-states output layer,"
+" designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:4
+msgid "An instance of ErnieModel."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice:8
+msgid ""
+"The dropout probability for output of Ernie. If None, use the same value "
+"as `hidden_dropout_prob` of `ErnieModel` instance `ernie`. Defaults to "
+"None."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:1
+msgid ""
+"The ErnieForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:3
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:5
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:7
+#: paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:9
+msgid ""
+"See :class:`ErnieModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.modeling.ErnieForMultipleChoice.forward:12
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po
new file mode 100644
index 0000000000000000000000000000000000000000..d63a014b0a77f342f5b403c19af8e967132316b0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie.rst:2
+msgid "ernie"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..1f3a1cfcfe77c7b4e7e45067789c1860dc918e99
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie.tokenizer.po
@@ -0,0 +1,421 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:1
+msgid ""
+"Constructs an ERNIE tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:23
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:26
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:30
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:33
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:36
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:39
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:5
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:45
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer:34
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size:3
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.vocab_size
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:5
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:8
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:5
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:5
+msgid ""
+"This encodes inputs and checks the number of added tokens, and is "
+"therefore not efficient. Do not put this inside your training loop."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:8
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:8
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.num_special_tokens_to_add:12
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.num_special_tokens_to_add:12
+msgid "Number of tokens added to sequences"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:4
+msgid "An Ernie sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:7
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:9
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:6
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:13
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:15
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_inputs_with_special_tokens:15
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:3
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "An ERNIE offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:5
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:6
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:8
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:10
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.build_offset_mapping_with_special_tokens:14
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:1
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:3
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:3
+msgid "A ERNIE sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:9
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:11
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.create_token_type_ids_from_sequences:17
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer.create_token_type_ids_from_sequences:17
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:1
+msgid ""
+"Constructs a ErnieTiny tokenizer. It uses the `dict.wordseg.pickle` cut "
+"the text to words, and use the `sentencepiece` tools to cut the words to "
+"sub-words."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:17
+msgid "The file path of the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:19
+msgid "The file path of sentencepiece model."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer:21
+msgid ""
+"The file path of word vocabulary, which is used to do chinese word "
+"segmentation."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also removes "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.convert_tokens_to_string:11
+msgid "Examples: .. code-block::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources:1
+msgid "Save tokenizer related resources to files under `save_directory`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.save_resources:3
+msgid "Directory to save files into."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:4
+msgid "An ERNIE sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:       ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.build_offset_mapping_with_special_tokens:14
+msgid "List of wordpiece offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:9
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie.tokenizer.ErnieTinyTokenizer.get_special_tokens_mask:13
+msgid ""
+"The list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..ef101c8733d39a41cfe1e3ddb8e5d70967c53e27
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.modeling.po
@@ -0,0 +1,443 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_ctm.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ErnieCtm models. It provides ErnieCtm "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:4
+msgid "and loading pretrained models."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel:5
+msgid ""
+"See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more"
+" details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:1
+msgid "基类：:class:`paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:1
+msgid "The bare ErnieCtm Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ErnieCtmModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`ErnieCtmModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:13
+msgid "Dimensionality of the embedding layer. Defaults to `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:16
+msgid ""
+"Dimensionality of the encoder layers and the pooler layer. Defaults to "
+"`768`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:19
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:21
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:24
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:44
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:47
+msgid "Whether or not to add content summary tokens. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:50
+msgid ""
+"The number of the content summary tokens. Only valid when "
+"use_content_summary is True. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel:53
+msgid ""
+"The number of the CLS tokens. Only valid when use_content_summary is "
+"True. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:1
+msgid "The ErnieCtmModel forward method, overrides the __call__() special method."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will"
+" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will"
+" have the same value. Defaults to `None`, which means nothing needed to "
+"be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:35
+msgid ""
+"Whether the `content_output` is clone from `sequence_output`. If set to "
+"`True`, the content_output is clone from sequence_output, which may cause"
+" the classification task impact on the sequence labeling task. Defaults "
+"to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:40
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``, "
+"``content_output``).  With the fields:  - `sequence_output` (Tensor):"
+"     Sequence of output at the last layer of the model. Its data type "
+"should be float32 and     has a shape of [batch_size, sequence_length, "
+"hidden_size].  - `pooled_output` (Tensor):     The output of first token "
+"(`[CLS]`) in sequence.     We \"pool\" the model by simply taking the "
+"hidden state corresponding to the first token.     Its data type should "
+"be float32 and its shape is [batch_size, hidden_size].  - "
+"`content_output` (Tensor):     The output of content summary token "
+"(`[CLS1]` in sequence). Its data type should be float32 and     has a "
+"shape of [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:40
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``, "
+"``content_output``)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:42
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:19
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:46
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:45
+msgid ""
+"Sequence of output at the last layer of the model. Its data type should "
+"be float32 and has a shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:51
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:49
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:54
+msgid "`content_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:54
+msgid ""
+"The output of content summary token (`[CLS1]` in sequence). Its data type"
+" should be float32 and has a shape of [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:15
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmModel.forward:59
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:15
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:27
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:1
+msgid ""
+"ErnieCtmWordtag Model with a token classification head on top (a crf "
+"layer on top of the hidden-states output) . e.g. for Named-Entity-"
+"Recognition (NER) tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:3
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:4
+msgid "An instance of :class:`ErnieCtmModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:6
+msgid "The number of different tags."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel:8
+msgid "The learning rate of the crf. Defaults to `100`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:3
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:5
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:7
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:3
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:5
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:7
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:1
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:3
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:5
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:7
+msgid "See :class:`ErnieCtmModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:9
+msgid ""
+"The input length. Its dtype is int64 and has a shape of `[batch_size]`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:12
+msgid ""
+"The input predicted tensor. Its dtype is float32 and has a shape of "
+"`[batch_size, sequence_length, num_tags]`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:17
+msgid ""
+"Returns tuple (`seq_logits`, `cls_logits`).  With the fields:  - "
+"`seq_logits` (Tensor):     A tensor of next sentence prediction logits."
+"     Its data type should be float32 and its shape is [batch_size, "
+"sequence_length, num_tag]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:17
+msgid "Returns tuple (`seq_logits`, `cls_logits`)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:22
+msgid "`seq_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmWordtagModel.forward:22
+msgid ""
+"A tensor of next sentence prediction logits. Its data type should be "
+"float32 and its shape is [batch_size, sequence_length, num_tag]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel:1
+msgid "ErnieCtmNptag Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmNptagModel.forward:10
+msgid ""
+"Returns tensor `logits`, the scores of masked token prediction. Its data "
+"type should be float32 and shape is [batch_size, sequence_length, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:1
+msgid ""
+"ERNIECtm Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:4
+msgid "An instance of `ErnieModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification:8
+msgid ""
+"The dropout probability for output of ERNIE. If None, use the same value "
+"as `hidden_dropout_prob` of `ErnieCtmModel` instance `ernie`. Defaults to"
+" `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.modeling.ErnieCtmForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[sequence_length, num_classes]` and dtype as `float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po
new file mode 100644
index 0000000000000000000000000000000000000000..d28a42a646a75e1bc35e5b6b6708785e756898de
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_ctm.rst:2
+msgid "ernie\\_ctm"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..fc4e22a00b052d3190870c102b316d600c3c3aa6
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_ctm.tokenizer.po
@@ -0,0 +1,295 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_ctm.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:1
+msgid "Construct an ERNIE-CTM tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:7
+msgid "File path of the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:9
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:11
+msgid ""
+"Whether or not to do basic tokenization before WordPiece. Defaults to "
+"`True`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:13
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:17
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:20
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:23
+msgid ""
+"The template of summary token for multiple summary placeholders. Defaults"
+" to `\"[CLS{}]\"`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:25
+msgid ""
+"Summary placeholder used in ernie-ctm model. For catching a sentence "
+"global feature from multiple aware. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task. This is the token which the model will"
+" try to predict the original unmasked ones. Defaults to `\"[MASK]\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:32
+msgid ""
+"(bool, optional): Whether or not to strip all accents. If this option is "
+"not specified, then it will be determined by the value for `lowercase` "
+"(as in the original BERT)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer:37
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequences for sequence "
+"classification tasks by concatenating and add special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:4
+msgid "A ERNIE-CTM sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      [CLS0][CLS1]... X [SEP]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        [CLS0][CLS1]... X [SEP] X [SEP]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:11
+msgid "Optional second list of IDs for sequence pairs. Defaults to ``None``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.build_inputs_with_special_tokens:14
+msgid "The input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:1
+msgid ""
+"Creates a special tokens mask from the input sequences. This method is "
+"called when adding special tokens using the tokenizer `encode` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:23
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:25
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:6
+msgid ""
+"Optional second list of `inputs_ids` for the second sequence. Defaults to"
+" `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:9
+msgid ""
+"Whether or not the token list already contains special tokens for the "
+"model. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.get_special_tokens_mask:13
+msgid ""
+"A list of integers which is either 0 or 1: 1 for a special token, 0 for a"
+" sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:1
+msgid "Creates a token_type mask from the input sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"If `token_ids_1` is not `None`, then a sequence pair token_type mask has "
+"the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:11
+msgid ""
+"Else if `token_ids_1` is `None`, then a single sequence token_type mask "
+"has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:19
+msgid "0 stands for the segment id of **first segment tokens**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:20
+msgid "1 stands for the segment id of **second segment tokens**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:21
+msgid "2 stands for the segment id of **cls_token**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.create_token_type_ids_from_sequences:29
+msgid "List of token type IDs according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:5
+msgid ""
+"This encodes inputs and checks the number of added tokens, and is "
+"therefore not efficient. Do not put this inside your training loop."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:8
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_ctm.tokenizer.ErnieCtmTokenizer.num_special_tokens_to_add:12
+msgid "Number of tokens added to sequences."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..ad156d1cebf8cc0ef7aaffea898e7ea34c0c02b9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.modeling.po
@@ -0,0 +1,535 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_doc.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:1
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:1
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:1
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:1
+msgid "基类：:class:`paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:1
+msgid "The bare ERNIE-Doc Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:6
+msgid ""
+"This model is also a `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:10
+msgid "The number of hidden layers in the Transformer encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:12
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:14
+msgid "Dimensionality of the embedding layers, encoder layers and pooler layer."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:16
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:18
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:20
+msgid "The dropout probability of FFN."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:22
+msgid "The non-linear activation function of FFN."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:24
+msgid ""
+"The number of tokens to cache. If not 0, the last `memory_len` hidden "
+"states in each layer will be cached into memory."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:27
+msgid ""
+"Vocabulary size of `inputs_ids` in `ErnieDocModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`ErnieDocModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:30
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:33
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `3`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:35
+msgid ""
+"Indicate whether to put layer normalization into preprocessing of MHA and"
+" FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:40
+msgid ""
+"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for "
+"initializing layer normalization layers. Defaults to `1e-5`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:43
+msgid "Whether to share the relative position parameters. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:46
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:49
+msgid ""
+"The token id of [PAD] token whose parameters won't be updated when "
+"training. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel:52
+msgid "The token id of [CLS] token. Defaults to `-1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:1
+msgid ""
+"The ErnieDocModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length, 1]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:7
+msgid ""
+"A list of length `n_layers` with each Tensor being a pre-computed hidden-"
+"state for each layer. Each Tensor has a dtype `float32` and a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:10
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length, 1]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:10
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:13
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:14
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:16
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length, 1]. Defaults to None, which means no segment embeddings "
+"is added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:19
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, config.max_position_embeddings - "
+"1]``. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will"
+" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will"
+" have the same value. Defaults to `None`, which means nothing needed to "
+"be prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:36
+msgid ""
+"Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``).  With"
+" the fields:  - `encoder_output` (Tensor):     Sequence of hidden-states "
+"at the last layer of the model.     It's data type should be float32 and "
+"its shape is [batch_size, sequence_length, hidden_size].  - "
+"`pooled_output` (Tensor):     The output of first token (`[CLS]`) in "
+"sequence.     We \"pool\" the model by simply taking the hidden state "
+"corresponding to the first token.     Its data type should be float32 and"
+" its shape is [batch_size, hidden_size].  - `new_mem` (List[Tensor]):"
+"     A list of pre-computed hidden-states. The length of the list is "
+"`n_layers`.     Each element in the list is a Tensor with dtype `float32`"
+" and shape as [batch_size, memory_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:36
+msgid "Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:16
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:16
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:17
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:38
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:42
+msgid "`encoder_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:41
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:47
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:45
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:50
+msgid "`new_mem` (List[Tensor]):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:50
+msgid ""
+"A list of pre-computed hidden-states. The length of the list is "
+"`n_layers`. Each element in the list is a Tensor with dtype `float32` and"
+" shape as [batch_size, memory_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:33
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:29
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:30
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocModel.forward:55
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ErnieDoc models. It provides ErnieDoc "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:1
+msgid ""
+"ErnieDoc Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:5
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:4
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:4
+msgid "An instance of :class:`ErnieDocModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:6
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:6
+msgid "The number of classes."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification:8
+msgid "The dropout ratio of last output. Default to `0.1`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:1
+msgid ""
+"The ErnieDocForSequenceClassification forward method, overrides the "
+"`__call__()` special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:3
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:5
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:7
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:9
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:11
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:3
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:5
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:7
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:9
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:11
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:3
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:5
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:10
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:12
+msgid "See :class:`ErnieDocModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:14
+msgid ""
+"Returns tuple (`logits`, `mem`).  With the fields:  - `logits` (Tensor):"
+"     A tensor containing the [CLS] of hidden-states of the model at the "
+"output of last layer.     Each Tensor has a data type of `float32` and "
+"has a shape of [batch_size, num_classes].  - `mem` (List[Tensor]):     A "
+"list of pre-computed hidden-states. The length of the list is `n_layers`."
+"     Each element in the list is a Tensor with dtype `float32` and has a "
+"shape of     [batch_size, memory_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:14
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:15
+msgid "Returns tuple (`logits`, `mem`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:20
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:21
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:19
+msgid ""
+"A tensor containing the [CLS] of hidden-states of the model at the output"
+" of last layer. Each Tensor has a data type of `float32` and has a shape "
+"of [batch_size, num_classes]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:28
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:24
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:25
+msgid "`mem` (List[Tensor]):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:27
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForSequenceClassification.forward:23
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:24
+msgid ""
+"A list of pre-computed hidden-states. The length of the list is "
+"`n_layers`. Each element in the list is a Tensor with dtype `float32` and"
+" has a shape of [batch_size, memory_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:1
+msgid ""
+"ErnieDoc Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:7
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification:8
+msgid "The dropout ratio of last output. Default to 0.1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:1
+msgid ""
+"The ErnieDocForTokenClassification forward method, overrides the "
+"`__call__()` special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:7
+msgid ""
+"See :class:`ErnieDocModel`. Defaults to None, which means no segment "
+"embeddings is added to token embeddings."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:15
+msgid ""
+"Returns tuple (`logits`, `mem`).  With the fields:  - `logits` (Tensor):"
+"     A tensor containing the hidden-states of the model at the output of "
+"last layer.     Each Tensor has a data type of `float32` and has a shape "
+"of [batch_size, sequence_length, num_classes].  - `mem` (List[Tensor]):"
+"     A list of pre-computed hidden-states. The length of the list is "
+"`n_layers`.     Each element in the list is a Tensor with dtype `float32`"
+" and has a shape of     [batch_size, memory_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForTokenClassification.forward:20
+msgid ""
+"A tensor containing the hidden-states of the model at the output of last "
+"layer. Each Tensor has a data type of `float32` and has a shape of "
+"[batch_size, sequence_length, num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering:1
+msgid ""
+"ErnieDoc Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:1
+msgid ""
+"The ErnieDocForQuestionAnswering forward method, overrides the "
+"`__call__()` special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:14
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`, `mem`).  With the fields:  -"
+" `start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `mem` (List[Tensor]):     A list of pre-computed hidden-states. The "
+"length of the list is `n_layers`.     Each element in the list is a "
+"Tensor with dtype `float32` and has a shape of     [batch_size, "
+"memory_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:14
+msgid "Returns tuple (`start_logits`, `end_logits`, `mem`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:20
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:24
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.modeling.ErnieDocForQuestionAnswering.forward:23
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po
new file mode 100644
index 0000000000000000000000000000000000000000..09c4c0c23d421ec1da371bae3bddccc1ddf28448
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_doc.rst:2
+msgid "ernie\\_doc"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..bddacc7b7ee00985d1ff75722a50b9b3b2b392a1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_doc.tokenizer.po
@@ -0,0 +1,148 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_doc.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:1
+msgid ""
+"Constructs an ERNIE-Doc tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`. For more"
+" information regarding those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:8
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:11
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:13
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:14
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:17
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:18
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:20
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:21
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:23
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:24
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:26
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:27
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:32
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocTokenizer:33
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.BPETokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:1
+msgid ""
+"Constructs an ERNIE-Doc BPE tokenizer. It uses a bpe tokenizer to do "
+"punctuation splitting, lower casing and so on, then tokenize words as "
+"subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:4
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.BPETokenizer`. For more "
+"information regarding those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:7
+msgid "File path of the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:9
+msgid "File path of the id to vocab."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer:11
+msgid "File path of word merge text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_doc.tokenizer.ErnieDocBPETokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..585be964010e36c5591945c8577c18801531c8a1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.modeling.po
@@ -0,0 +1,168 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gen.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ErnieGen models. It provides ErnieGen "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gen.modeling.ErnieGenPretrainedModel.save_pretrained:1
+msgid ""
+"Save model configuration and related resources (model state) to files "
+"under `save_directory`. :param save_directory: Directory to save files "
+"into. :type save_directory: str"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:1
+msgid "基类：:class:`paddlenlp.transformers.ernie_gen.modeling.ErnieModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:1
+msgid "Ernie Model for sequence to sequence generation."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.ernie.modeling.ErnieModel`. Refer to the "
+"superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:1
+msgid ""
+"The ground truth target sequence id (hard label) or distribution (soft "
+"label). It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length] or [batch_size, sequence_length, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:5
+msgid ""
+"Index of tgt_labels in `src_ids`. It's data type should be `int64` and "
+"has a shape of [n_targets, 2])."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:8
+msgid ""
+"Whether the model will output the logits or only encode the inputs. If "
+"`encode_only` is `True`, `loss` and `logits_2d` will not be returned."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:12
+msgid ""
+"Returns tuple (`None`, `None`, `info`) if `encode_only` is `True`, "
+"returns (`output_ids`, `logits`, `info`) if `tgt_labels` or `tgt_pos` is "
+"`None`, else, returns (`loss`, `logits_2d`, `info`).  With the fields:  -"
+" `info`(dict):      Middle level info, includes all hidden stats and k/v "
+"caches.  - `output_ids`(Tensor):     The output index. Its data type "
+"should be float32 and its shape is [batch_size].     If `encode_only`, "
+"returns None.  - `logits`(Tensor):     Logits for every targets.     Its "
+"data type should be float32 and its shape is [batch_size, "
+"sequence_length].     If `encode_only`, returns None.  - `loss`(Tensor):"
+"     Cross entropy loss mean over every target label.     If "
+"`encode_only`, returns None.  - `logits_2d`(Tensor):     Logits for every"
+" targets if `tgt_labels` or `tgt_pos` is not `None` .     Its data type "
+"should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:12
+msgid ""
+"Returns tuple (`None`, `None`, `info`) if `encode_only` is `True`, "
+"returns (`output_ids`, `logits`, `info`) if `tgt_labels` or `tgt_pos` is "
+"`None`, else, returns (`loss`, `logits_2d`, `info`)."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:16
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:19
+msgid "`info`(dict):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:19
+msgid "Middle level info, includes all hidden stats and k/v caches."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:23
+msgid "`output_ids`(Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:22
+msgid ""
+"The output index. Its data type should be float32 and its shape is "
+"[batch_size]. If `encode_only`, returns None."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:28
+msgid "`logits`(Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:26
+msgid ""
+"Logits for every targets. Its data type should be float32 and its shape "
+"is [batch_size, sequence_length]. If `encode_only`, returns None."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:32
+msgid "`loss`(Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:31
+msgid ""
+"Cross entropy loss mean over every target label. If `encode_only`, "
+"returns None."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:35
+msgid "`logits_2d`(Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward:35
+msgid ""
+"Logits for every targets if `tgt_labels` or `tgt_pos` is not `None` . Its"
+" data type should be float32 and its shape is [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gen.modeling.ErnieForGeneration.forward
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po
new file mode 100644
index 0000000000000000000000000000000000000000..c4aa4f4ff1729736b3401de0642f6c1adc50b2a5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gen.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gen.rst:2
+msgid "ernie\\_gen"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po
new file mode 100644
index 0000000000000000000000000000000000000000..b017641491e348c56a3000dcc39e88a0eeb546fe
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.matching_param_name.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gram.matching_param_name.rst:2
+msgid "matching\\_param\\_name"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..34a2cd874dfa50ca3aed9d472bf2faf08749ee80
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.modeling.po
@@ -0,0 +1,435 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gram.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:1
+msgid "基类：:class:`paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:1
+msgid "The bare ERNIE-Gram Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:10
+msgid ""
+"Vocabulary size of the ERNIE-Gram model. Also is the vocab size of token "
+"embedding matrix."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:12
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:14
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:16
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:19
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:24
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:28
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoders. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:31
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:34
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:37
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:40
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`.  .. note::     A normal_initializer "
+"initializes weight matrices as normal distributions.     See "
+":meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieGramModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:40
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:44
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ErniePretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieGramModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:47
+msgid ""
+"The relative position size just for ERNIE-Gram English model. Defaults to"
+" None."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel:49
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:5
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:5
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:8
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:9
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:11
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to None, which means no segment embeddings is "
+"added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:14
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, config.max_position_embeddings - "
+"1]``. Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as"
+" `int32` or `int64`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:18
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. We use whole-word-mask in ERNIE, so the whole word will"
+" have the same value. For example, \"使用\" as a word, \"使\" and \"用\" will"
+" have the same value. Defaults to `None`, which means nothing needed to "
+"be prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:32
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``).  With the fields:"
+"  - `sequence_output` (Tensor):     Sequence of hidden-states at the last"
+" layer of the model.     It's data type should be float32 and its shape "
+"is [batch_size, sequence_length, hidden_size].  - `pooled_output` "
+"(Tensor):     The output of first token (`[CLS]`) in sequence.     We "
+"\"pool\" the model by simply taking the hidden state corresponding to the"
+" first token.     Its data type should be float32 and its shape is "
+"[batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:32
+msgid "Returns tuple (``sequence_output``, ``pooled_output``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:12
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:34
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:38
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:37
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:42
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:41
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:24
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:15
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:15
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramModel.forward:47
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ERNIE-Gram models. It provides ERNIE-"
+"Gram related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:1
+msgid ""
+"ERNIE-Gram Model with a linear layer on top of the output layer, designed"
+" for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:4
+msgid "An instance of `paddlenlp.transformers.ErnieGramModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:6
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:6
+msgid "The number of classes. Default to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification:8
+msgid ""
+"The dropout probability for output of ERNIE-Gram. If None, use the same "
+"value as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieGramModel`"
+" instance. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:3
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:5
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:7
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:3
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:5
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:7
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:1
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:3
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:5
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:7
+msgid "See :class:`ErnieGramModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:1
+msgid ""
+"ERNIE-Gram Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:5
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:4
+msgid "An instance of `ErnieGramModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification:8
+msgid ""
+"The dropout probability for output of ERNIE-Gram. If None, use the same "
+"value as `hidden_dropout_prob` of `ErnieGramModel` instance `ernie_gram`."
+" Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering:1
+msgid ""
+"ERNIE-Gram Model with a linear layer on top of the hidden-states output "
+"to compute `span_start_logits` and `span_end_logits`, designed for "
+"question-answering tasks like SQuAD.."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_gram.modeling.ErnieGramForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po
new file mode 100644
index 0000000000000000000000000000000000000000..2397fb7858eb3b5259d4f0eafccbf7d3d9b0496a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gram.rst:2
+msgid "ernie\\_gram"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..a99977fa731d901cd5deca74d8c97673d4ea00c8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_gram.tokenizer.po
@@ -0,0 +1,91 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_gram.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:1
+msgid ""
+"Constructs an ERNIE-Gram tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:4
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`. For more"
+" information regarding those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:7
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:10
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:13
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:17
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:20
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:23
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:26
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_gram.tokenizer.ErnieGramTokenizer:32
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..563c4e457b395571e84b66ab14dd49bb2b83cd03
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.modeling.po
@@ -0,0 +1,391 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_m.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel:1
+msgid "基类：:class:`paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:1
+msgid "The bare ERNIE-M Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `ErnieMModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `ErnieMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`.  .. note::     A normal_initializer "
+"initializes weight matrices as normal distributions.     See "
+":meth:`ErnieMPretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ErnieMPretrainedModel._init_weights()` for how weights are "
+"initialized in `ErnieMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:5
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:9
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:21
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``).  With the fields:"
+"  - `sequence_output` (Tensor):     Sequence of hidden-states at the last"
+" layer of the model.     It's data type should be float32 and its shape "
+"is [batch_size, sequence_length, hidden_size].  - `pooled_output` "
+"(Tensor):     The output of first token (`[CLS]`) in sequence.     We "
+"\"pool\" the model by simply taking the hidden state corresponding to the"
+" first token.     Its data type should be float32 and its shape is "
+"[batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:21
+msgid "Returns tuple (``sequence_output``, ``pooled_output``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:12
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:23
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:27
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:26
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:31
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:30
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:24
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:15
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:15
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMModel.forward:36
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel:1
+msgid ""
+"An abstract class for pretrained ERNIE-M models. It provides ERNIE-M "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. Refer "
+"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:1
+msgid ""
+"Ernie-M Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:4
+msgid "An instance of `paddlenlp.transformers.ErnieMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:6
+msgid "The number of classes. Default to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification:8
+msgid ""
+"The dropout probability for output of ERNIE-M. If None, use the same "
+"value as `hidden_dropout_prob` of `paddlenlp.transformers.ErnieMModel` "
+"instance. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:3
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:5
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:7
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:3
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:5
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:7
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:1
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:3
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:5
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:7
+msgid "See :class:`ErnieMModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:1
+msgid ""
+"ERNIE-M Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:5
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:4
+msgid "An instance of `ErnieMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification:8
+msgid ""
+"The dropout probability for output of ERNIE-M. If None, use the same "
+"value as `hidden_dropout_prob` of `ErnieMModel` instance `ernie_m`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering:1
+msgid ""
+"Ernie-M Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.modeling.ErnieMForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po
new file mode 100644
index 0000000000000000000000000000000000000000..0bcbb7a03cbd2bed69bd57759f252ceb760cd950
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_m.rst:2
+msgid "ernie\\_m"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..51866a4482e1cd3426ecd0edbb3de6b363ac2a1d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ernie_m.tokenizer.po
@@ -0,0 +1,216 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ernie_m.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:1
+msgid ""
+"Constructs a ErnieM tokenizer. It uses the `sentencepiece` tools to cut "
+"the words to sub-words."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:3
+msgid "The file path of the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:5
+msgid "The file path of sentencepiece model."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:7
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:10
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:14
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:17
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:20
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer:23
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.clean_text:1
+msgid "Performs invalid character removal and whitespace cleanup on text."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:1
+msgid "Converts a string to a list of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:3
+msgid "The text to be tokenized."
+msgstr ""
+
+#: of paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.tokenize:6
+msgid "A list of string representing converted tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.convert_ids_to_string:1
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (strings for sub-words) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:4
+msgid ""
+"An ERNIE-M sequence has the following format: - single sequence:       "
+"``[CLS] X [SEP]`` - pair of sequences:        ``[CLS] A [SEP] [SEP] B "
+"[SEP]`` :param token_ids_0: List of IDs to which the special tokens will "
+"be added. :type token_ids_0: List[int] :param token_ids_1: Optional "
+"second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:10
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:6
+msgid "Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_inputs_with_special_tokens:13
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:3
+msgid ""
+"An ERNIE-M offset_mapping has the following format: - single sequence:"
+"      ``(0,0) X (0,0)`` - pair of sequences:        ``(0,0) A (0,0) (0,0)"
+" B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:7
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:9
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.build_offset_mapping_with_special_tokens:13
+msgid "List of wordpiece offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods. :param token_ids_0: List of ids of the "
+"first sequence. :type token_ids_0: List[int] :param token_ids_1: Optional"
+" second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po
new file mode 100644
index 0000000000000000000000000000000000000000..4baa1d450bff7493a7ba88da07aa41fedf4768c4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.export.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.export.rst:2
+msgid "export"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..7bd4c32fa12003ec986dc377cea94658f5aed5b9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.modeling.po
@@ -0,0 +1,539 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.fnet.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling:1
+msgid "Modeling classes for FNet model."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetPretrainedModel:1
+msgid ""
+"An abstract class for pretrained FNet models. It provides FNet related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+"`PretrainedModel` for more details."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:1
+#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:1
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:1
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining:1
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:1
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:1
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:1
+#: paddlenlp.transformers.fnet.modeling.FNetModel:1
+msgid "基类：:class:`paddlenlp.transformers.fnet.modeling.FNetPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:1
+msgid ""
+"The model can behave as an encoder, following the architecture described "
+"in `FNet: Mixing Tokens with Fourier Transforms "
+"<https://arxiv.org/abs/2105.03824>`__ by James Lee-Thorp, Joshua Ainslie,"
+" Ilya Eckstein, Santiago Ontanon."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice
+#: paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward
+#: paddlenlp.transformers.fnet.modeling.FNetModel
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:5
+msgid ""
+"Vocabulary size of `inputs_ids` in `FNetModel`. Also is the vocab size of"
+" token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `FNetModel`. "
+"Defaults to `32000`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:9
+msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:11
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:13
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:18
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `glue_new`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:22
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:25
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:28
+msgid "The vocabulary size of `token_type_ids`. Defaults to `4`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:30
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`. .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `ElectraModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:30
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`. .. "
+"note::"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:35
+msgid ""
+"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for "
+"initializing layer normalization layers. A small value to the variance "
+"added to the normalization layer to prevent division by zero. Defaults to"
+" `1e-12`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:39
+msgid "The index of padding token in the token vocabulary. Defaults to `3`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel:41
+msgid "Whether or not to add the pooling layer. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:1
+msgid "The FNetModel forward method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:3
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:7
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:7
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:12
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:13
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:15
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:18
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:22
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:22
+msgid ""
+"If you want to control how to convert `inputs_ids` indices into "
+"associated vectors, you can pass an embedded representation directly "
+"instead of passing `inputs_ids`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:32
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`, `encoder_outputs[1:]`)"
+" or a dict with last_hidden_state`, `pooled_output`, `all_hidden_states`,"
+" fields.  With the fields:  - `sequence_output` (Tensor):    Sequence of "
+"hidden-states at the last layer of the model.    It's data type should be"
+" float32 and has a shape of [`batch_size, sequence_length, hidden_size`]."
+"  - `pooled_output` (Tensor):    The output of first token (`[CLS]`) in "
+"sequence.    We \"pool\" the model by simply taking the hidden state "
+"corresponding to the first token.    Its data type should be float32 and"
+"    has a shape of [batch_size, hidden_size].  - `last_hidden_state` "
+"(Tensor):    The output of the last encoder layer, it is also the "
+"`sequence_output`.    It's data type should be float32 and has a shape of"
+" [batch_size, sequence_length, hidden_size].  - `all_hidden_states` "
+"(Tensor):    Hidden_states of all layers in the Transformer encoder. The "
+"length of `all_hidden_states` is `num_hidden_layers + 1`.    For all "
+"element in the tuple, its data type should be float32 and its shape is "
+"[`batch_size, sequence_length, hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:32
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`, `encoder_outputs[1:]`)"
+" or a dict with last_hidden_state`, `pooled_output`, `all_hidden_states`,"
+" fields."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:22
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:34
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:35
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:39
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:38
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and has a shape of [`batch_size, sequence_length, "
+"hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:45
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:42
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and has a shape of [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:49
+msgid "`last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:48
+msgid ""
+"The output of the last encoder layer, it is also the `sequence_output`. "
+"It's data type should be float32 and has a shape of [batch_size, "
+"sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:52
+msgid "`all_hidden_states` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetModel.forward:52
+msgid ""
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`all_hidden_states` is `num_hidden_layers + 1`. For all element in the "
+"tuple, its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:46
+#: paddlenlp.transformers.fnet.modeling.FNetModel.forward:57
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:1
+msgid ""
+"FNet Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:4
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:4
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:4
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:4
+msgid "An instance of FNetModel."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification:6
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:1
+msgid "The FNetForSequenceClassification forward method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:32
+msgid ""
+"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, "
+"`attentions` fields.  With the fields:  - `logits` (Tensor):     A tensor"
+" of the input text classification logits.     Shape as `[batch_size, "
+"num_classes]` and dtype as float32.  - `hidden_states` (Tensor):     "
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`hidden_states` is `num_hidden_layers + 1`.     For all element in the "
+"tuple, its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:32
+msgid ""
+"Returns tensor `logits`, or a dict with `logits`, `hidden_states`, "
+"`attentions` fields."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:38
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:37
+msgid ""
+"A tensor of the input text classification logits. Shape as `[batch_size, "
+"num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:29
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:41
+msgid "`hidden_states` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:29
+#: paddlenlp.transformers.fnet.modeling.FNetForSequenceClassification.forward:41
+msgid ""
+"Hidden_states of all layers in the Transformer encoder. The length of "
+"`hidden_states` is `num_hidden_layers + 1`. For all element in the tuple,"
+" its data type should be float32 and its shape is [`batch_size, "
+"sequence_length, hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining:1
+msgid ""
+"FNet Model with two heads on top as done during the pretraining: a "
+"`masked language modeling` head and a `next sentence prediction "
+"(classification)` head."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:1
+msgid "The FNetForPretraining forward method."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:3
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:5
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:7
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:9
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:15
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:17
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:3
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:5
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:7
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:11
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:17
+#: paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:19
+msgid "See :class:`FNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:9
+msgid "Labels for computing the masked language modeling loss."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:13
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForPreTraining.forward:22
+msgid ""
+"Returns tuple (`prediction_scores`, `seq_relationship_score`) or a dict "
+"with `prediction_logits`, `seq_relationship_logits`,  `hidden_states` "
+"fields."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:1
+msgid "FNet Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM:3
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:3
+msgid "An instance of :class:`FNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:1
+msgid "The FNetForMaskedLM forward method."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:11
+#: paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:13
+msgid "See :class:`FNetForPreTraining`."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:20
+msgid ""
+"Returns tensor `prediction_scores` or a dict with `prediction_logits`, "
+"`hidden_states` fields.  With the fields:  - `prediction_scores` "
+"(Tensor):     The scores of masked token prediction. Its data type should"
+" be float32.     and its shape is [batch_size, sequence_length, "
+"vocab_size].  - `hidden_states` (Tensor):     Hidden_states of all layers"
+" in the Transformer encoder. The length of `hidden_states` is "
+"`num_hidden_layers + 1`.     For all element in the tuple, its data type "
+"should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:20
+msgid ""
+"Returns tensor `prediction_scores` or a dict with `prediction_logits`, "
+"`hidden_states` fields."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:26
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMaskedLM.forward:25
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"and its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction:1
+msgid "FNet Model with a `next sentence prediction` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:1
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:1
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:1
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:4
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:4
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:4
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice.forward:6
+#: paddlenlp.transformers.fnet.modeling.FNetForNextSentencePrediction.forward:6
+#: paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering.forward:6
+#: paddlenlp.transformers.fnet.modeling.FNetForTokenClassification.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForMultipleChoice:1
+msgid ""
+"FNet Model with a linear layer on top of the hidden-states output layer, "
+"designed for multiple choice tasks like SWAG tasks ."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForTokenClassification:1
+msgid ""
+"FNet Model with a linear layer on top of the hidden-states output layer, "
+"designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:1
+msgid ""
+"FNet Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.modeling.FNetForQuestionAnswering:6
+msgid "The number of labels."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po
new file mode 100644
index 0000000000000000000000000000000000000000..d1ca3ef4986a7eebcb571f9ea6ab1fa2467925e4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.fnet.rst:2
+msgid "fnet"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..79004b0383ca1f76ff5c64931f67f19e87e61d11
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.fnet.tokenizer.po
@@ -0,0 +1,308 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.fnet.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer:1
+msgid "Tokenization class for FNet model."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:1
+msgid ""
+"Construct a FNet tokenizer. Inherit from :class:`AlbertEnglishTokenizer`."
+" Based on `SentencePiece <https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:4
+msgid ""
+"`SentencePiece <https://github.com/google/sentencepiece>`__ file "
+"(generally has a `.spm` extension) that contains the vocabulary necessary"
+" to instantiate a tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:7
+msgid "Whether or not to lowercase the input when tokenizing."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:9
+msgid ""
+"Whether or not to strip the text when tokenizing (removing excess spaces "
+"before and after the string)."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:11
+msgid "Whether or not to keep accents when tokenizing."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:13
+msgid ""
+"The unknown token. A token that is not in the vocabulary cannot be "
+"converted to an ID and is set to be this token instead."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:16
+msgid ""
+"The separator token, which is used when building a sequence from multiple"
+" sequences, e.g. two sequences for sequence classification or for a text "
+"and a question for question answering. It is also used as the last token "
+"of a sequence built with special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:20
+msgid ""
+"The token used for padding, for example when batching sequences of "
+"different lengths."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:22
+msgid ""
+"The classifier token which is used when doing sequence classification "
+"(classification of the whole sequence instead of per-token "
+"classification). It is the first token of the sequence when built with "
+"special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:25
+msgid ""
+"The token used for masking values. This is the token used when training "
+"this model with masked language modeling. This is the token which the "
+"model will try to predict."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:28
+msgid ""
+"Will be passed to the ``SentencePieceProcessor.__init__()`` method. The "
+"`Python wrapper for SentencePiece "
+"<https://github.com/google/sentencepiece/tree/master/python>`__ can be "
+"used, among other things, to set:  - ``enable_sampling``: Enable subword "
+"regularization. - ``nbest_size``: Sampling parameters for unigram. "
+"Invalid for BPE-Dropout.    - ``nbest_size = {0,1}``: No sampling is "
+"performed.   - ``nbest_size > 1``: samples from the nbest_size results."
+"   - ``nbest_size < 0``: assuming that nbest_size is infinite and samples"
+" from the all hypothesis (lattice)     using forward-filtering-and-"
+"backward-sampling algorithm. - ``alpha``: Smoothing parameter for unigram"
+" sampling, and dropout probability of merge operations for   BPE-dropout."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:28
+msgid ""
+"Will be passed to the ``SentencePieceProcessor.__init__()`` method. The "
+"`Python wrapper for SentencePiece "
+"<https://github.com/google/sentencepiece/tree/master/python>`__ can be "
+"used, among other things, to set:"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:31
+msgid "``enable_sampling``: Enable subword regularization."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:32
+msgid "``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:34
+msgid "``nbest_size = {0,1}``: No sampling is performed."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:35
+msgid "``nbest_size > 1``: samples from the nbest_size results."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:36
+msgid ""
+"``nbest_size < 0``: assuming that nbest_size is infinite and samples from"
+" the all hypothesis (lattice) using forward-filtering-and-backward-"
+"sampling algorithm."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:38
+msgid ""
+"``alpha``: Smoothing parameter for unigram sampling, and dropout "
+"probability of merge operations for BPE-dropout."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:44
+msgid ""
+"The `SentencePiece` processor that is used for every conversion (string, "
+"tokens and IDs)."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer
+msgid "type"
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer:46
+msgid ":obj:`SentencePieceProcessor`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (strings for sub-words) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. An FNet "
+"sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:4
+msgid "single sequence: ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:5
+msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:7
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:9
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:12
+msgid ""
+"List of `input IDs <../glossary.html#input-ids>`__ with the appropriate "
+"special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.build_inputs_with_special_tokens:13
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:15
+msgid ":obj:`List[int]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]:     1 for a special token, 0 "
+"for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:14
+msgid "The list of integers in the range [0, 1]:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.get_special_tokens_mask:15
+msgid "1 for a special token, 0 for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task. An FNet sequence pair mask has the following "
+"format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.create_token_type_ids_from_sequences:13
+msgid ""
+"List of `token type IDs <../glossary.html#token-type-ids>`_ according to "
+"the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources:1
+msgid ""
+"Save tokenizer related resources to `resource_files_names` indicating "
+"files under `save_directory` by copying directly. Override it if "
+"necessary."
+msgstr ""
+
+#: of paddlenlp.transformers.fnet.tokenizer.FNetTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..a5a0d757377a5688887059dde06c222cfa4eadda
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.modeling.po
@@ -0,0 +1,109 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.funnel.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering:1
+#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification:1
+#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification:1
+#: paddlenlp.transformers.funnel.modeling.FunnelModel:1
+msgid "基类：:class:`paddlenlp.transformers.funnel.modeling.FunnelPreTrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.modeling.FunnelModel.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification.forward:3
+msgid "labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForSequenceClassification.forward:2
+msgid ""
+"Labels for computing the sequence classification/regression loss. Indices"
+" should be in :obj:`[0, ..., config.num_labels - 1]`. If "
+":obj:`config.num_labels == 1` a regression loss is computed (Mean-Square "
+"loss), If :obj:`config.num_labels > 1` a classification loss is computed "
+"(Cross-Entropy)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification.forward:2
+msgid ""
+"labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size, "
+"sequence_length)`, `optional`):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForTokenClassification.forward:2
+msgid ""
+"Labels for computing the token classification loss. Indices should be in "
+"``[0, ..., config.num_labels - 1]``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:3
+msgid ""
+"start_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, "
+"`optional`):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:2
+msgid ""
+"Labels for position (index) of the start of the labelled span for "
+"computing the token classification loss. Positions are clamped to the "
+"length of the sequence (:obj:`sequence_length`). Position outside of the "
+"sequence are not taken into account for computing the loss."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:7
+msgid ""
+"end_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, "
+"`optional`):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.modeling.FunnelForQuestionAnswering.forward:6
+msgid ""
+"Labels for position (index) of the end of the labelled span for computing"
+" the token classification loss. Positions are clamped to the length of "
+"the sequence (:obj:`sequence_length`). Position outside of the sequence "
+"are not taken into account for computing the loss."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po
new file mode 100644
index 0000000000000000000000000000000000000000..50cc071d876c721c467240e39ac2f08228cfbcc5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.funnel.rst:2
+msgid "funnel"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..d8f3f6ca3911cb73ca200c21498832d05e6ca0c7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.funnel.tokenizer.po
@@ -0,0 +1,479 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.funnel.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.vocab_size:1
+msgid ""
+"return the size of vocabulary. :returns: the size of vocabulary. :rtype: "
+"int"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize:1
+msgid ""
+"End-to-end tokenization for BERT models. :param text: The text to be "
+"tokenized. :type text: str"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize:5
+msgid "A list of string representing converted tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.tokenize
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting. :param tokens: A list of string representing tokens"
+" to be converted. :type tokens: list"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.convert_tokens_to_string:7
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:5
+msgid ""
+"This encodes inputs and checks the number of added tokens, and is "
+"therefore not efficient. Do not put this inside your training loop."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:8
+msgid ""
+"Returns the number of added tokens in the case of a sequence pair if set "
+"to True, returns the number of added tokens in the case of a single "
+"sequence if set to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.num_special_tokens_to_add:11
+msgid "Number of tokens added to sequences"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A BERT offset_mapping has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:10
+msgid "Optional second list of char offsets for offset mapping pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:13
+msgid "List of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.build_offset_mapping_with_special_tokens:14
+msgid ":obj:`List[tuple]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:3
+msgid "A BERT sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:11
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:13
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:1
+msgid "Truncates a sequence pair in place to the maximum length."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:3
+msgid ""
+"list of tokenized input ids. Can be obtained from a string by chaining "
+"the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:5
+msgid ""
+"Optional second list of input ids. Can be obtained from a string by "
+"chaining the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:7
+msgid "number of tokens to remove using the truncation strategy"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:9
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len     starting from the longest one at each token (when there "
+"is a pair of input sequences).     Overflowing tokens only contains "
+"overflow from the first sequence. - 'only_first': Only truncate the first"
+" sequence. raise an error if the first sequence is shorter or equal to "
+"than num_tokens_to_remove. - 'only_second': Only truncate the second "
+"sequence - 'do_not_truncate': Does not truncate (raise an error if the "
+"input sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:9
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:11
+msgid ""
+"starting from the longest one at each token (when there is a pair of "
+"input sequences). Overflowing tokens only contains overflow from the "
+"first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:13
+msgid ""
+"'only_first': Only truncate the first sequence. raise an error if the "
+"first sequence is shorter or equal to than num_tokens_to_remove."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:14
+msgid "'only_second': Only truncate the second sequence"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:15
+msgid ""
+"'do_not_truncate': Does not truncate (raise an error if the input "
+"sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.truncate_sequences:16
+msgid ""
+"If set to a number along with max_seq_len, the overflowing tokens "
+"returned will contain some tokens from the main sequence returned. The "
+"value of this argument defines the number of additional tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports batch inputs of sequence or sequence pair. :param "
+"batch_text_or_text_pairs: The element of list can be sequence or sequence"
+" pair, and the"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:4
+msgid ""
+"sequence is a string or a list of strings depending on whether it has "
+"been pretokenized. If each sequence is provided as a list of strings "
+"(pretokenized), you must set `is_split_into_words` as `True` to "
+"disambiguate with a sequence pair."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:9
+msgid ""
+"If set to a number, will limit the total sequence returned so that it has"
+" a maximum length. If there are overflowing tokens, those overflowing "
+"tokens will be added to the returned dictionary when "
+"`return_overflowing_tokens` is `True`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:14
+msgid ""
+"Only available for batch input of sequence pair and mainly for question "
+"answering usage. When for QA, `text` represents questions and `text_pair`"
+" represents contexts. If `stride` is set to a positive number, the "
+"context will be split into multiple spans where `stride` defines the "
+"number of (tokenized) tokens to skip from the start of one span to get "
+"the next span, thus will produce a bigger batch than inputs to include "
+"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving"
+" the original example and position information will be added to the "
+"returned dictionary. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:24
+msgid ""
+"If set to `True`, the returned sequences would be padded up to "
+"`max_seq_len` specified length according to padding side "
+"(`self.padding_side`) and padding token id. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:28
+msgid ""
+"String selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"`max_seq_len` starting from the longest one at each token (when there is "
+"a pair of input sequences). - 'only_first': Only truncate the first "
+"sequence. - 'only_second': Only truncate the second sequence. - "
+"'do_not_truncate': Do not truncate (raise an error if the input sequence "
+"is longer than `max_seq_len`). Defaults to 'longest_first'."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:38
+msgid ""
+"Whether to include tokens position ids in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:41
+msgid ""
+"Whether to include token type ids in the returned dictionary. Defaults to"
+" `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:44
+msgid ""
+"Whether to include the attention mask in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:47
+msgid ""
+"Whether to include the length of each encoded inputs in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:50
+msgid ""
+"Whether to include overflowing token information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:53
+msgid ""
+"Whether to include special tokens mask information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:57
+msgid ""
+"The dict has the following optional items: - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   `stride` "
+"works. - **overflow_to_sample** (int, optional): Index of example from "
+"which this   feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:57
+msgid ""
+"The dict has the following optional items: - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:60
+msgid "fed to a model. Included when `return_position_ids` is `True`"
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:61
+msgid ""
+"**token_type_ids** (list[int], optional): List of token type ids to be "
+"fed to a model. Included when `return_token_type_ids` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:63
+msgid ""
+"**attention_mask** (list[int], optional): List of integers valued 0 or 1,"
+" where 0 specifies paddings and should not be attended to by the model. "
+"Included when `return_attention_mask` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:66
+msgid ""
+"**seq_len** (int, optional): The input_ids length. Included when "
+"`return_length` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:68
+msgid ""
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+" Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:71
+msgid ""
+"**num_truncated_tokens** (int, optional): The number of overflowing "
+"tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:74
+msgid ""
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1, with 0 specifying special added tokens and 1 specifying sequence "
+"tokens. Included when `return_special_tokens_mask` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:77
+msgid ""
+"**offset_mapping** (list[int], optional): list of pair preserving the "
+"index of start and end char in original input for each token. For a "
+"sqecial token, the index pair is `(0, 0)`. Included when `stride` works."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.batch_encode:81
+msgid ""
+"**overflow_to_sample** (int, optional): Index of example from which this "
+"feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of paddlenlp.transformers.funnel.tokenizer.FunnelTokenizer.rematch:1
+msgid ""
+"changed from "
+"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..e4609c0b17e0ab7ae0535230a0ba667bac29db00
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.generation_utils.po
@@ -0,0 +1,238 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.generation_utils.rst:2
+msgid "generation\\_utils"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin:1
+msgid "This class implements the interface for generation task."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin:3
+msgid ""
+"It's used as the base class of `paddlenlp.transformers.PretrainedModel "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.model_utils.html>`__."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:1
+msgid ""
+"The interface for generation task. This method can generate sequences by "
+"using decoding strategy. Currently, there are three decoding strategies "
+"supported: \"greedy_search\", \"sampling\" and \"beam_search\"."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:5
+msgid ""
+"The input sequence ids for the generation. It is a Tensor with shape "
+"[batch_size, sequence_length]. The data type should be int32 or int64. "
+"Default to None, which we will initialize it as a Tensor with shape [1, "
+"1], filled with the value `bos_token_id`."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:11
+msgid "The maximum length of the sequence to be generated. Default to 20."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:14
+msgid "The minimum length of the sequence to be generated. Default to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:17
+msgid ""
+"The decoding strategy in generation. Currently, there are three decoding "
+"strategies supported: \"greedy_search\", \"sampling\" and "
+"\"beam_search\". Default to \"greedy_search\"."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:22
+msgid ""
+"The value used to module the next token probabilities in the \"sampling\""
+" strategy. Default to 1.0, which means no effect."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:26
+msgid ""
+"The number of highest probability tokens to keep for top-k-filtering in "
+"the \"sampling\" strategy. Default to 0, which means no effect."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:30
+msgid ""
+"The cumulative probability for top-p-filtering in the \"sampling\" "
+"strategy. The value should satisfy :math:`0 <= top\\_p < 1`. Default to "
+"1.0, which means no effect."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:35
+msgid ""
+"The parameter for repetition penalty. 1.0 means no penalty. See `this "
+"paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details. "
+"Defaults to 1.0."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:38
+msgid "The number of beams in the \"beam_search\" strategy. Default to 1."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:41
+msgid ""
+"Number of groups to divide `num_beams` into in order to use DIVERSE BEAM "
+"SEARCH. See `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__ for "
+"more details. Default to 1."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:45
+msgid ""
+"The exponential penalty to the sequence length in the \"beam_search\" "
+"strategy. The larger this param is, the more that the model would "
+"generate shorter sequences. Default to 0.0, which means no penalty."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:50
+msgid ""
+"Whether to stop searching in the \"beam_search\" strategy when at least "
+"`num_beams` sentences are finished per batch or not. Default to False."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:54
+msgid "The id of the `bos_token`. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:57
+msgid "The id of the `eos_token`. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:60
+msgid "The id of the `pad_token`. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:63
+msgid "The start token id for encoder-decoder models. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:66
+msgid ""
+"The id of the token to force as the first generated token. Usually use "
+"for multilingual models. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:70
+msgid "The id of the token to force as the last generated token. Default to None."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:73
+msgid ""
+"The number of returned sequences for each sequence in the batch. Default "
+"to 1."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:76
+msgid ""
+"If num_beam_groups is 1, this is the diversity_rate for Diverse Siblings "
+"Search. See `this paper https://arxiv.org/abs/1611.08562`__ for more "
+"details. If not, this is the diversity_rate for DIVERSE BEAM SEARCH."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:81
+msgid ""
+"(bool, optional): Whether to use the model cache to speed up decoding. "
+"Default to True."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:83
+msgid ""
+"(bool, optional): Whether to use fast entry of model for "
+"FastGeneration. Default to False."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:85
+msgid ""
+"(bool, optional): Whether to use fp16 for decoding. Only works when "
+"fast entry is avalible. Default to False."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:87
+msgid "It can be used to specify additional kwargs passed to the model."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:91
+msgid ""
+"It is a tuple contains two elements: ids and scores. Each element is a "
+"Tensor.  With the fields:  - ids (Tensor):     The ids of the generated "
+"sequences. It is a Tensor with shape     [batch_size * "
+"num_return_sequences, sequence_length]. The data     type is same as the "
+"input `input_ids`. - scores (Tensor):     The scores of the generated "
+"sequences. It is a Tensor with shape     [batch_size * "
+"num_return_sequences, 1]. The data type is float32     or float64, which "
+"is the same as the parameters in the model."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:91
+msgid ""
+"It is a tuple contains two elements: ids and scores. Each element is a "
+"Tensor."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:94
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:98
+msgid "ids (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:97
+msgid ""
+"The ids of the generated sequences. It is a Tensor with shape [batch_size"
+" * num_return_sequences, sequence_length]. The data type is same as the "
+"input `input_ids`."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:102
+msgid "scores (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:101
+msgid ""
+"The scores of the generated sequences. It is a Tensor with shape "
+"[batch_size * num_return_sequences, 1]. The data type is float32 or "
+"float64, which is the same as the parameters in the model."
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.generation_utils.GenerationMixin.generate:107
+msgid "示例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..915270047f8c402de8b61b36a3b12e950a1b9cc2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.modeling.po
@@ -0,0 +1,428 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.gpt.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:1
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining:1
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:1
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:1
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:1
+#: paddlenlp.transformers.gpt.modeling.GPTModel:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.modeling.GPTPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:1
+msgid "The bare GPT Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTModel
+#: paddlenlp.transformers.gpt.modeling.GPTModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `GPTModel`. Also is the vocab size of "
+"token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:13
+msgid ""
+"Dimensionality of the embedding layer and decoder layer. Defaults to "
+"`768`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:15
+msgid "Number of hidden layers in the Transformer decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the decoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and decoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all decoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:38
+msgid ""
+"The vocabulary size of the `token_type_ids`. Defaults to `16`.  .. note::"
+"     Please NOT using `type_vocab_size`, for it will be obsolete in the "
+"future.."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:41
+msgid ""
+"Please NOT using `type_vocab_size`, for it will be obsolete in the "
+"future.."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:43
+msgid ""
+"The standard deviation of the normal initializer. Default to `0.02`.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`GPTPretrainedModel._init_weights()` for how"
+" weights are initialized in `GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:43
+msgid "The standard deviation of the normal initializer. Default to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:46
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`GPTPretrainedModel._init_weights()` for how weights are "
+"initialized in `GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel:49
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:1
+msgid "The GPTModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:7
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:11
+msgid ""
+"Mask used in self attention to avoid performing attention to some "
+"unwanted positions, usually the subsequent positions. It is a tensor with"
+" shape broadcasted to `[batch_size, num_attention_heads, sequence_length,"
+" sequence_length]`. It is a tensor with shape broadcasted to "
+"`[batch_size, num_attention_heads, sequence_length, sequence_length]`. "
+"For example, its shape can be  [batch_size, sequence_length], "
+"[batch_size, sequence_length, sequence_length], [batch_size, "
+"num_attention_heads, sequence_length, sequence_length]. Its data type "
+"should be float32. The `masked` tokens have `-1e-9` values, and the "
+"`unmasked` tokens have `0` values. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:21
+msgid ""
+"Whether or not to use cache. Defaults to `False`. If set to `True`, key "
+"value states will be returned and can be used to speed up decoding."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:24
+msgid ""
+"It is a list, and each element in the list is a tuple "
+"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache "
+"<https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__"
+" for more details. It is only used for inference and should be None for "
+"training. Default to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTModel.forward:30
+msgid ""
+"Returns tensor `encoder_output`, which is the output at the last layer of"
+" the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTModel.forward
+#: paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:19
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:15
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:15
+#: paddlenlp.transformers.gpt.modeling.GPTModel.forward:35
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel:1
+msgid ""
+"An abstract class for pretrained GPT models. It provides GPT related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining:1
+msgid "GPT Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForPretraining:3
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:3
+msgid "An instance of :class:`GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward:1
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:1
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:3
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:5
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:7
+#: paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:9
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:1
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:3
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:5
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:7
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:9
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:3
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:5
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:7
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:3
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:5
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:7
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:1
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:3
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:5
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:7
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:9
+msgid "See :class:`GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.model:12
+#: paddlenlp.transformers.gpt.modeling.GPTForPretraining.forward:12
+#: paddlenlp.transformers.gpt.modeling.GPTLMHeadModel.forward:12
+msgid ""
+"Returns tensor `logits` or tuple `(logits, cached_kvs)`. If `use_cache` "
+"is True, tuple (`logits, cached_kvs`) will be returned. Otherwise, tensor"
+" `logits` will be returned. `logits` is the output of the gpt model. "
+"`cache_kvs` is the cache output of gpt model if `use_cache` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion:1
+msgid "Criterion for GPT. It calculates the final loss."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:1
+msgid ""
+"The logits of masked token prediction. Its data type should be float32 "
+"and its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:4
+msgid ""
+"The labels of the masked language modeling, the dimensionality of "
+"`masked_lm_labels` is equal to `prediction_scores`. Its data type should "
+"be int64 and its shape is [batch_size, sequence_length, 1]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:8
+msgid ""
+"Mask used for calculating the loss of the masked language modeling to "
+"avoid calculating some unwanted tokens. Its data type should be float32 "
+"and its shape is [batch_size, sequence_length, 1]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTPretrainingCriterion.forward:13
+msgid ""
+"The pretraining loss. Its data type should be float32 and its shape is "
+"[1]."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:1
+msgid ""
+"The generate model for GPT-2. It use the greedy strategy and generate the"
+" output sequence with highest probability."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:4
+msgid "An instance of `paddlenlp.transformers.GPTModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration:6
+msgid "The max length of the prediction."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForGreedyGeneration.forward:4
+msgid ""
+"Returns tensor `src_ids`, which means the indices of output sequence "
+"tokens in the vocabulary. They are numerical representations of tokens "
+"that build the output sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTLMHeadModel:1
+msgid "The GPT Model with a `language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:1
+msgid ""
+"GPT Model with a token classification head on top (a linear layer on top "
+"of the hidden-states output) e.g. for Named-Entity-Recognition (NER) "
+"tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:4
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:4
+msgid "An instance of GPTModel."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:6
+#: paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification:8
+msgid ""
+"The dropout probability for output of GPT. If None, use the same value as"
+" `hidden_dropout_prob` of `GPTModel` instance `gpt`. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:1
+msgid ""
+"The GPTForTokenClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification:1
+msgid ""
+"GPT Model with a sequence classification/regression head on top (a linear"
+" layer on top of the pooled output) e.g. for GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:1
+msgid ""
+"The GPTForSequenceClassification forward method, overrides the __call__()"
+" special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.modeling.GPTForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po
new file mode 100644
index 0000000000000000000000000000000000000000..73f3ea9694b457fa2bf4b224f035fd335921c138
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.gpt.rst:2
+msgid "gpt"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..b35fc7efb1d262feaae43bdc059141c2fdf31d17
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.gpt.tokenizer.po
@@ -0,0 +1,189 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.gpt.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:1
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:1
+msgid "Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:3
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:7
+msgid ""
+"Path to the vocab file. The vocab file contains a mapping from vocabulary"
+" strings to indices."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:10
+msgid ""
+"Path to the merge file. The merge file is used to split the input "
+"sentence into \"subword\" units. The vocab file is then used to encode "
+"those units as intices."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:14
+msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:17
+msgid "The maximum value of the input sequence length. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:19
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer:22
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:1
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size:1
+msgid "Returns the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size:3
+msgid "The sum of size of vocabulary and the size of speical tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:1
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:1
+msgid "Converts a single index or a sequence of indices to texts."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:3
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:3
+msgid "The token id (or token ids) to be converted to text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:6
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:6
+msgid "The decoded text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_string:10
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:11
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:7
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.convert_ids_to_string:10
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources:1
+msgid ""
+"Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file "
+"(ends with '.spm') under `save_directory`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources:3
+#: paddlenlp.transformers.gpt.tokenizer.GPTTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:1
+msgid ""
+"Constructs a GPT Chinese tokenizer based on `SentencePiece "
+"<https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:7
+msgid ""
+"The vocabulary file required to instantiate a `SentencePiece "
+"<https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:10
+msgid "The maximum value of the input sequence length. Defaults to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer:13
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:1
+msgid ""
+"Converts a single index or a sequence of indices to a token or a sequence"
+" of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:4
+msgid "The token id (or token ids) to be converted to token(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.convert_ids_to_tokens:7
+msgid "The converted token or sequence of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.gpt.tokenizer.GPTChineseTokenizer.save_resources:1
+msgid "Save tokenizer related resources to files under `save_directory`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..8446e7b167ea52591c49eb96607f005d4f443607
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.modeling.po
@@ -0,0 +1,382 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlm.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling:1
+msgid "Modeling classes for LayoutLM model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:1
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:1
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:1
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:1
+msgid "基类：:class:`paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:1
+msgid "The bare LayoutLM Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:10
+msgid ""
+"Vocabulary size of the LayoutLM model. Defines the number of different "
+"tokens that can be represented by the `inputs_ids` passed when calling "
+"LayoutLMModel."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:13
+msgid "Dimensionality of the encoder layers and the pooler layer."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:15
+msgid "Number of hidden layers in the Transformer encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:19
+msgid ""
+"Dimensionality of the \"intermediate\" (often named feed-forward) layer "
+"in the Transformer encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:21
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:25
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:27
+msgid "The dropout probability for all fully connected layers in the pooler."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:29
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:32
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`LayoutLMPretrainedModel.init_weights()` for"
+" how weights are initialized in `LayoutLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:32
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:36
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`LayoutLMPretrainedModel.init_weights()` for how weights are "
+"initialized in `LayoutLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:39
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel:42
+msgid ""
+"The non-linear activation function in the pooling layer. Defaults to "
+"`\"tanh\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:1
+msgid ""
+"The LayoutLMModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:31
+msgid "Whether to return the output of each hidden layers. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`).  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:35
+msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:37
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:41
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:40
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:45
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward:44
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:1
+msgid "LayoutLM Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM:3
+msgid "An instance of :class:`LayoutLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:1
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:3
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:5
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:7
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:9
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:3
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:5
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:7
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:9
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:11
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:13
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:3
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:5
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:7
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:9
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:11
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:13
+msgid "See :class:`LayoutLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:12
+msgid ""
+"Returns tensor `prediction_scores`, The scores of masked token "
+"prediction. Its data type should be float32 and shape is [batch_size, "
+"sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForMaskedLM.forward:17
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:21
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:21
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:1
+msgid ""
+"LayoutLM Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:4
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:4
+msgid "An instance of LayoutLMModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:6
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification:8
+msgid ""
+"The dropout probability for output of LayoutLM. If None, use the same "
+"value as `hidden_dropout_prob` of `LayoutLMModel` instance `layoutlm`. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:1
+msgid ""
+"The LayoutLMForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForTokenClassification.forward:16
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification:1
+msgid ""
+"LayoutLM Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:1
+msgid ""
+"The LayoutLMForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlm.modeling.LayoutLMForSequenceClassification.forward:16
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po
new file mode 100644
index 0000000000000000000000000000000000000000..9ad11da107a28272077cf9d55b404b09f4320b04
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlm.rst:2
+msgid "layoutlm"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..9ecd8d8859bef56735cdb856421c4ffe44d6f546
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlm.tokenizer.po
@@ -0,0 +1,39 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlm.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.tokenizer:1
+msgid "Tokenization classes for LayoutLM model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.tokenizer.LayoutLMTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlm.tokenizer.LayoutLMTokenizer:1
+msgid ""
+"The usage of LayoutLMTokenizer is the same as `BertTokenizer "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__."
+" For more information regarding those methods, please refer to this "
+"superclass."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..1169d16673304a192292e249a51dfe476890e096
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.modeling.po
@@ -0,0 +1,157 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlmv2.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling:1
+msgid "Modeling classes for LayoutLMv2 model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:1
+msgid "基类：:class:`paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:1
+msgid "The bare LayoutLMv2 Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:10
+msgid ""
+"Vocabulary size of the XLNet model. Defines the number of different "
+"tokens that can be represented by the `inputs_ids` passed when calling "
+"XLNetModel."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:13
+msgid ""
+"Dimensionality of the encoder layers and the pooler layer. Defaults to "
+"``768``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to ``12``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to ``12``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:20
+msgid ""
+"Dimensionality of the \"intermediate\" (often named feed-forward) layer "
+"in the Transformer encoder. Defaults to ``3072``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:23
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:27
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:30
+msgid ""
+"The dropout probability for all fully connected layers in the pooler. "
+"Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model:33
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Defaults to ``0.02``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:1
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:4
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:4
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:4
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForPretraining.forward:6
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.forward:6
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForTokenClassification.forward:6
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2Model.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2PretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutlmv2.modeling.LayoutLMv2ForRelationExtraction.init_weights:1
+msgid "Initialize the weights"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po
new file mode 100644
index 0000000000000000000000000000000000000000..e228ffd620941c77e98fbe2549ba65e77a7e8f1c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlmv2.rst:2
+msgid "layoutlmv2"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..36187c4795111a788b10699a494769807d28f745
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutlmv2.tokenizer.po
@@ -0,0 +1,39 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutlmv2.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.tokenizer:1
+msgid "Tokenization classes for LayoutLMv2 model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.tokenizer.LayoutLMv2Tokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutlmv2.tokenizer.LayoutLMv2Tokenizer:1
+msgid ""
+"The usage of LayoutLMv2Tokenizer is the same as `BertTokenizer "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__."
+" For more information regarding those methods, please refer to this "
+"superclass."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..ea5ad2bac1471c74cb3a4095d831bbb1cb2f45f9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.modeling.po
@@ -0,0 +1,156 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutxlm.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling:1
+msgid "Modeling classes for LayoutXLM model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:1
+msgid "基类：:class:`paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:1
+msgid "The bare LayoutXLM Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:10
+msgid ""
+"Vocabulary size of the XLNet model. Defines the number of different "
+"tokens that can be represented by the `inputs_ids` passed when calling "
+"XLNetModel."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:13
+msgid ""
+"Dimensionality of the encoder layers and the pooler layer. Defaults to "
+"``768``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to ``12``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to ``12``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:20
+msgid ""
+"Dimensionality of the \"intermediate\" (often named feed-forward) layer "
+"in the Transformer encoder. Defaults to ``3072``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:23
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:27
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:30
+msgid ""
+"The dropout probability for all fully connected layers in the pooler. "
+"Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel:33
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Defaults to ``0.02``."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:1
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:4
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:4
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:4
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining.forward:6
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.forward:6
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification.forward:6
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction.init_weights:1
+msgid "Initialize the weights"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po
new file mode 100644
index 0000000000000000000000000000000000000000..5088b7bb51a55d7b99c9aecef777aa74cecd3c88
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutxlm.rst:2
+msgid "layoutxlm"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..b493f08595fc77f1ece491a5b2628170cf3231db
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.tokenizer.po
@@ -0,0 +1,176 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutxlm.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.tokenizer:1
+msgid "Tokenization classes for LayoutXLM model."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:4
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"Should be overridden in a subclass if the model has a special way of "
+"building those."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:6
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:8
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens:11
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]:     1 for a special token, 0 "
+"for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:14
+msgid "The list of integers in the range [0, 1]:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.get_special_tokens_mask:15
+msgid "1 for a special token, 0 for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (strings for sub-words) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the number of added tokens should be computed in the case of a "
+"sequence pair or a single sequence. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.tokenizer.LayoutXLMTokenizer.num_special_tokens_to_add:7
+msgid "Number of special tokens added to sequences."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po
new file mode 100644
index 0000000000000000000000000000000000000000..62dc467e525171f1223428d14e170849d8a23a54
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.layoutxlm.visual_backbone.po
@@ -0,0 +1,286 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.layoutxlm.visual_backbone.rst:2
+msgid "visual\\_backbone"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d:1
+msgid "基类：:class:`paddle.nn.layer.conv.Conv2D`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage
+#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:4
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:4
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:4
+#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:4
+#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:4
+#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone.forward:6
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem.forward:6
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock.forward:6
+#: paddlenlp.transformers.layoutxlm.visual_backbone.Conv2d.forward:6
+#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool.forward:6
+#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.Backbone:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.CNNBlockBase:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.VisualBackbone:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ShapeSpec:1
+msgid "基类：:class:`paddlenlp.transformers.layoutxlm.visual_backbone._ShapeSpec`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:1
+msgid ""
+"either one of BN, SyncBN, FrozenBN, GN; or a callable that takes a "
+"channel number and returns the normalization layer as a nn.Layer."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:5
+msgid "out_channels"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage
+#: paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone
+#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.get_norm:8
+msgid "the normalization layer"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage
+#: paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone
+#: paddlenlp.transformers.layoutxlm.visual_backbone.get_norm
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FrozenBatchNorm:1
+msgid "基类：:class:`paddle.fluid.dygraph.nn.BatchNorm`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicBlock:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.DeformBottleneckBlock:1
+msgid "基类：:class:`paddlenlp.transformers.layoutxlm.visual_backbone.CNNBlockBase`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicBlock:1
+msgid ""
+"The basic residual block for ResNet-18 and ResNet-34 defined in "
+":paper:`ResNet`, with two 3x3 conv layers and a projection shortcut if "
+"needed."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.BottleneckBlock:1
+msgid ""
+"The standard bottleneck residual block used by ResNet-50, 101 and 152 "
+"defined in :paper:`ResNet`.  It contains 3 conv layers with kernels 1x1, "
+"3x3, 1x1, and a projection shortcut if needed."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.DeformBottleneckBlock:1
+msgid ""
+"Similar to :class:`BottleneckBlock`, but with :paper:`deformable conv "
+"<deformconv>` in the 3x3 convolution."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.BasicStem:1
+msgid ""
+"The standard ResNet stem (layers before the first residual block), with a"
+" conv, relu and max_pool."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN:1
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet:1
+msgid "基类：:class:`paddlenlp.transformers.layoutxlm.visual_backbone.Backbone`"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward:1
+msgid ""
+"Tensor of shape (N,C,H,W). H, W must be a multiple of "
+"``self.size_divisibility``."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.forward:3
+msgid "names and the corresponding features"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:1
+msgid "Create a list of blocks of the same type that forms one ResNet stage."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:3
+msgid ""
+"a subclass of CNNBlockBase that's used to create all blocks in this "
+"stage. A module of this type must not change spatial resolution of inputs"
+" unless its stride != 1."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:7
+msgid "number of blocks in this stage"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:9
+msgid "input channels of the entire stage."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:11
+msgid "output channels of **every block** in the stage."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:13
+msgid ""
+"other arguments passed to the constructor of `block_class`. If the "
+"argument name is \"xx_per_block\", the argument is a list of values to be"
+" passed to each block in the stage. Otherwise, the same argument is "
+"passed to every block in the stage."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:19
+msgid "a list of block module."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:22
+msgid "Examples: ::"
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_stage:31
+msgid ""
+"Usually, layers that produce the same feature map spatial size are "
+"defined as one \"stage\" (in :paper:`FPN`). Under such definition, "
+"``stride_per_block[1:]`` should all be 1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:1
+msgid ""
+"Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, "
+"101, 152). If it doesn't create the ResNet variant you need, please use "
+":meth:`make_stage` instead for fine-grained customization."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:5
+msgid "depth of ResNet"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:7
+msgid ""
+"the CNN block class. Has to accept `bottleneck_channels` argument for "
+"depth > 50. By default it is BasicBlock or BottleneckBlock, based on the "
+"depth."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:12
+msgid ""
+"other arguments to pass to `make_stage`. Should not contain stride and "
+"channels, as they are predefined for each depth."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:15
+msgid "modules in all stages; see arguments of     :class:`ResNet.__init__`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:17
+msgid "modules in all stages; see arguments of"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.layoutxlm.visual_backbone.ResNet.make_default_stages:18
+msgid ":class:`ResNet.__init__`."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.LastLevelMaxPool:1
+msgid ""
+"This module is used in the original FPN to generate a downsampled P6 "
+"feature from P5."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward:1
+msgid ""
+"mapping feature map name (e.g., \"res5\") to feature map tensor for each "
+"feature level in high to low resolution order."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.FPN.forward:5
+msgid ""
+"mapping from feature map name to FPN feature map tensor in high to low "
+"resolution order. Returned feature names follow the FPN paper convention:"
+" \"p<stage>\", where stage has stride = 2 ** stage e.g., [\"p2\", \"p3\","
+" ..., \"p6\"]."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.make_stage:1
+msgid "Deprecated alias for backward compatibiltiy."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone:1
+msgid "Create a ResNet instance from config."
+msgstr ""
+
+#: of paddlenlp.transformers.layoutxlm.visual_backbone.build_resnet_backbone:3
+msgid "a :class:`ResNet` instance."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..783249b186716de930651c6f75e1243434fb6773
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.modeling.po
@@ -0,0 +1,598 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.luke.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:1
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:1
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:1
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM:1
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering:1
+#: paddlenlp.transformers.luke.modeling.LukeModel:1
+msgid "基类：:class:`paddlenlp.transformers.luke.modeling.LukePretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:1
+msgid "The bare Luke Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward
+#: paddlenlp.transformers.luke.modeling.LukeModel
+#: paddlenlp.transformers.luke.modeling.LukeModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `LukeModel`. Also is the vocab size of"
+" token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `LukeModel`. "
+"Defaults to 50267."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:14
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:16
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:18
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:21
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:26
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:30
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:33
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:36
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`514`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:39
+msgid "The vocabulary size of `token_type_ids`. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:42
+msgid ""
+"Vocabulary size of `entity_ids` in `LukeModel`. Also is the vocab size of"
+" token entity embedding matrix. Defines the number of different entity "
+"that can be represented by the `entity_ids` passed when calling "
+"`LukeModel`. Defaults to 500000."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:46
+msgid "Dimensionality of the entity embedding layer Defaults to `256`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:48
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:48
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:52
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:55
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel:58
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:1
+msgid "The LukeModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:31
+msgid ""
+"Indices of entity sequence tokens in the entity vocabulary. They are "
+"numerical representations of entities that build the entity input "
+"sequence. Its data type should be `int64` and it has a shape of "
+"[batch_size, entity_sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:35
+msgid ""
+"Indices of positions of each entity sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_entity_tokens)` and dtype as int64. Defaults "
+"to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:39
+msgid ""
+"Segment entity token indices to indicate different portions of the entity"
+" inputs. Selected in the range ``[0, type_vocab_size - 1]``. If "
+"`type_vocab_size` is 2, which means the inputs have two portions. Indices"
+" can either be 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:44
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor will be concat with `attention_mask`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward
+#: paddlenlp.transformers.luke.modeling.LukeModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:53
+msgid ""
+"Returns tuple (`word_hidden_state, entity_hidden_state, pool_output`).  "
+"With the fields:  - `word_hidden_state` (Tensor):     Sequence of hidden-"
+"states at the last layer of the model.     It's data type should be "
+"float32 and its shape is [batch_size, sequence_length, hidden_size].  - "
+"`entity_hidden_state` (Tensor):     Sequence of entity hidden-states at "
+"the last layer of the model.     It's data type should be float32 and its"
+" shape is [batch_size, sequence_length, hidden_size].  - `pooled_output` "
+"(Tensor):     The output of first token (`<s>`) in sequence.     We "
+"\"pool\" the model by simply taking the hidden state corresponding to the"
+" first token.     Its data type should be float32 and its shape is "
+"[batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:53
+msgid "Returns tuple (`word_hidden_state, entity_hidden_state, pool_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:22
+#: paddlenlp.transformers.luke.modeling.LukeModel.forward:55
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:59
+msgid "`word_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:58
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:63
+msgid "`entity_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:62
+msgid ""
+"Sequence of entity hidden-states at the last layer of the model. It's "
+"data type should be float32 and its shape is [batch_size, "
+"sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:67
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeModel.forward:66
+msgid ""
+"The output of first token (`<s>`) in sequence. We \"pool\" the model by "
+"simply taking the hidden state corresponding to the first token. Its data"
+" type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward
+#: paddlenlp.transformers.luke.modeling.LukeModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:25
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:25
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:27
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:34
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:31
+#: paddlenlp.transformers.luke.modeling.LukeModel.forward:72
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel:1
+msgid ""
+"An abstract class for pretrained Luke models. It provides Luke related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukePretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:1
+msgid ""
+"The LUKE model with a span classification head on top (a linear layer on "
+"top of the hidden states output) for tasks such as named entity "
+"recognition."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:4
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:4
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:4
+msgid "An instance of LukeModel."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:6
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:6
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification:6
+msgid "The number of classes."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:1
+msgid ""
+"The LukeForEntitySpanClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:3
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:4
+msgid "The start position of entities in sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:3
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:5
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:9
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:11
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:13
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:15
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:17
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:3
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:5
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:9
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:11
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:13
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:15
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:17
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:5
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:7
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:11
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:13
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:15
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:17
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:19
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:3
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:5
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:9
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:11
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:13
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:15
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:17
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:3
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:5
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:9
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:11
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:13
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:15
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:17
+msgid "See :class:`LukeModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:7
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:7
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:9
+#: paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:7
+#: paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:7
+msgid "See :class: `LukeModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntitySpanClassification.forward:22
+msgid ""
+"Returns tensor `logits`, a tensor of the entity span classification "
+"logits. Shape as `[batch_size, num_entities, num_classes]` and dtype as "
+"float32."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification:1
+msgid ""
+"The LUKE model with a classification head on top (a linear layer on top "
+"of the hidden states of the two entity tokens) for entity pair "
+"classification tasks, such as TACRED."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:1
+msgid ""
+"The LukeForEntityPairClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityPairClassification.forward:20
+msgid ""
+"Returns tensor `logits`, a tensor of the entity pair classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForEntityClassification:1
+msgid ""
+"The LUKE model with a classification head on top (a linear layer on top "
+"of the hidden state of the first entity token) for entity classification "
+"tasks, such as Open Entity."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:1
+msgid ""
+"The LukeForEntityClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.modeling.LukeForEntityClassification.forward:20
+msgid ""
+"Returns tensor `logits`, a tensor of the entity classification logits. "
+"Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM:1
+msgid "Luke Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM:3
+msgid "An instance of :class:`LukeModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:1
+msgid ""
+"The LukeForMaskedLM forward method, overrides the __call__() special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:20
+msgid ""
+"Returns tuple (``logits``, ``entity_logits``).  With the fields:  - "
+"`logits` (Tensor):     The scores of masked token prediction.     Its "
+"data type should be float32 and shape is [batch_size, sequence_length, "
+"vocab_size].  - `entity_logits` (Tensor):     The scores of masked entity"
+" prediction.     Its data type should be float32 and its shape is "
+"[batch_size, entity_length, entity_vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:20
+msgid "Returns tuple (``logits``, ``entity_logits``)."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:26
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:25
+msgid ""
+"The scores of masked token prediction. Its data type should be float32 "
+"and shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:29
+msgid "`entity_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForMaskedLM.forward:29
+msgid ""
+"The scores of masked entity prediction. Its data type should be float32 "
+"and its shape is [batch_size, entity_length, entity_vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering:1
+msgid ""
+"LukeBert Model with question answering tasks. :param luke: An instance of"
+" :class:`LukeModel`. :type luke: :class:`LukeModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:1
+msgid ""
+"The LukeForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:20
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]. - "
+"`end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:20
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:23
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:26
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.modeling.LukeForQuestionAnswering.forward:26
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po
new file mode 100644
index 0000000000000000000000000000000000000000..d934a3c3cf16a1b79e0e9ffa4a79977a2d62846d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.luke.rst:2
+msgid "luke"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..3361b2a80d0a56855e10d40d46d6859dc93b04a5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.luke.tokenizer.po
@@ -0,0 +1,262 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.luke.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer:1
+msgid "Tokenization classes for LUKE."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:1
+msgid ""
+"Constructs a Luke tokenizer. It uses a basic tokenizer to do punctuation "
+"splitting, lower casing and so on, and follows a WordPiece tokenizer to "
+"tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.json') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:12
+msgid ""
+"The entity vocabulary file path (ends with '.tsv') required to "
+"instantiate a `EntityTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:15
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:18
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:22
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:25
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:28
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:31
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer:37
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_entity_vocab:1
+msgid "Get the entity vocab"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6
+msgid "Tokenize a string."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6
+msgid "Args:"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:3
+msgid "text (str):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:4
+msgid "The sentence to be tokenized."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6
+msgid "add_prefix_space (boolean, default False):"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.tokenize:6
+msgid ""
+"Begin the sentence with at least one space to get invariance to word "
+"order in GPT-2 (and Luke) tokenizers."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (string) in a single string."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens:1
+msgid "Adding special tokens if you need."
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.add_special_tokens:3
+msgid ""
+"The special token list you provided. If you provide a Dict, the key of "
+"the Dict must be \"additional_special_tokens\" and the value must be "
+"token list."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.convert_entity_to_id:1
+msgid "Convert the entity to id"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.entity_encode:1
+msgid "Convert the string entity to digital entity"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:1
+msgid ""
+"Returns the map of tokens and the start and end index of their start and "
+"end character. Modified from "
+"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:4
+msgid "Input text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping:7
+msgid "The offset map of input text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.get_offset_mapping
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:3
+msgid "A Luke sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:11
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:13
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.luke.tokenizer.LukeTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#~ msgid "基类：:class:`paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer`"
+#~ msgstr ""
+
+#~ msgid "Converts a sequence of tokens (list of string) to a list of ids."
+#~ msgstr ""
+
+#~ msgid "A list of string representing tokens to be converted."
+#~ msgstr ""
+
+#~ msgid "Converted ids from tokens."
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..b86ffd95ee657dc00d2fec4852c1e31433ccdfb6
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.modeling.po
@@ -0,0 +1,518 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mbart.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder:1
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder:1
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:1
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:1
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:1
+#: paddlenlp.transformers.mbart.modeling.MBartModel:1
+msgid "基类：:class:`paddlenlp.transformers.mbart.modeling.MBartPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:1
+msgid "The bare MBart Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead.forward
+#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward
+#: paddlenlp.transformers.mbart.modeling.MBartModel
+#: paddlenlp.transformers.mbart.modeling.MBartModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `MBartModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `MBartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:13
+msgid ""
+"The beginning of sequence token that was used during pretraining. Can be "
+"used a sequence classifier token. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:17
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:20
+msgid ""
+"A special token representing the end of a sequence that was used during "
+"pretraining. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:23
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and decoder layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:25
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `6`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:27
+msgid "Number of hidden layers in the Transformer decoder. Defaults to `6`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:29
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:32
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:35
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `d_model` to "
+"`encoder_ffn_dim`, and then projected back to `d_model`. Typically "
+"`encoder_ffn_dim` is larger than `d_model`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:40
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `d_model` to "
+"`decoder_ffn_dim`, and then projected back to `d_model`. Typically "
+"`decoder_ffn_dim` is larger than `d_model`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:45
+msgid ""
+"The dropout probability used in all fully connected layers (pre-process "
+"and post-process of MHA and FFN sub-layer) in the encoders and decoders. "
+"Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:48
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:52
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"and decoder layers to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:55
+msgid ""
+"The dropout probability used after FFN activation in all encoder layers "
+"and decoder layers. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:58
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`1024`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel:61
+msgid ""
+"The standard deviation of the truncated_normal_initializer for "
+"initializing all weight matrices. Default to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:1
+msgid "The MBartModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:18
+msgid ""
+"Indices of decoder input sequence tokens in the vocabulary. Its data type"
+" should be `int64` and it has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`, which means no `decoder_input_ids` is provided, the "
+"model will create the tensor by shifting the `input_ids` to the right."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:23
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions in `decoder_input_ids`. Its data type and shape is the"
+" same as `attention_mask`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:26
+msgid ""
+"The output of the encoder, a tuple consists `last_hidden_state`, "
+"`hidden_states`(optional), `attentions`(optional). The data type of "
+"`last_hidden_state` is float32 and its shape is `[batch_size, "
+"sequence_length, hidden_size]`. `hidden_states` is hidden_states of all "
+"layers in the Transformer encoder. The length of `hidden_states` is "
+"`num_hidden_layers + 1`. For all element in the tuple, its data type "
+"should be float32 and its shape is [`batch_size, sequence_length, "
+"hidden_size`]. `attentions` is attentions of all layers of in the "
+"Transformer encoder. The length of `attentions` is `num_hidden_layers`. "
+"For all element in the tuple, its data type should be float32 and its "
+"shape is [`batch_size, num_attention_heads, sequence_length, "
+"sequence_length`]."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:33
+msgid ""
+"Whether or not to use cache. Defaults to `False`. If set to `True`, key "
+"value states will be returned and can be used to speed up decoding."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartModel.forward:36
+msgid ""
+"It is a list, and each element in the list is a tuple "
+"`(incremental_cache, static_cache)`. See `TransformerDecoder.gen_cache "
+"<https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__"
+" for more details. It is only used for inference and should be None for "
+"training. Default to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward
+#: paddlenlp.transformers.mbart.modeling.MBartModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:14
+#: paddlenlp.transformers.mbart.modeling.MBartModel.forward:42
+msgid ""
+"Returns tensor `decoder_output`, which is the output at the last layer of"
+" the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward
+#: paddlenlp.transformers.mbart.modeling.MBartModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:31
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:32
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:23
+#: paddlenlp.transformers.mbart.modeling.MBartModel.forward:47
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel:1
+msgid ""
+"An abstract class for pretrained MBart models. It provides MBart related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartEncoder:1
+msgid ""
+"The Transformer Encoder of MBartModel. The arguments of MBartEncoder can "
+"see :class:`MBartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:1
+msgid ""
+"The MBartEncoder forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:3
+#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:5
+#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:7
+#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:9
+#: paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:11
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:3
+#: paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:5
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:3
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:5
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:7
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:9
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:11
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:13
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:15
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:27
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:3
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:5
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:7
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:9
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:11
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:13
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:15
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:3
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:5
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:7
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:9
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:11
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:13
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:15
+msgid "See :class:`MBartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartEncoder.forward:8
+msgid ""
+"Returns tensor `encoder_output`, which is the output at the last layer of"
+" the model. Its data type should be float32 and has a shape of "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder:1
+msgid ""
+"The Transformer Decoder of MBartModel. The arguments of MBartDecoder can "
+"see :class:`MBartModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartDecoder.forward:1
+msgid ""
+"The MBartDecoder forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead:1
+msgid "Head for sentence-level classification tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartClassificationHead.forward:1
+msgid "Hidden states of the classification model."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:1
+msgid ""
+"MBart Model with a linear layer on top of the pooled output, designed for"
+" sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:4
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:4
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:4
+msgid "An instance of MBartModel."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:6
+msgid "The number of different labels. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification:8
+msgid ""
+"The dropout probability for output of MBart. If None, use the same value "
+"as `hidden_dropout_prob` of `MBartModel` instance `mbart`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:1
+msgid ""
+"The MBartForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForSequenceClassification.forward:18
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_labels]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering:1
+msgid ""
+"MBart Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:1
+msgid ""
+"The MBartForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:18
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:18
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:20
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:20
+msgid "With the fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:24
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:23
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:27
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForQuestionAnswering.forward:27
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration:1
+msgid ""
+"MBart Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:1
+msgid ""
+"The MBartForConditionalGeneration forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:18
+msgid ""
+"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns "
+"tuple (`lm_logits`, `cache`).  With the fields:  - `lm_logits` (Tensor):"
+"     The generated sentence of the model.     Its data type should be "
+"float32 and has a shape of [batch_size, sequence_length, vocab_size].  - "
+"`cache` (Tensor):     See :class:`MBartModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:18
+msgid ""
+"Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns "
+"tuple (`lm_logits`, `cache`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:24
+msgid "`lm_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:23
+msgid ""
+"The generated sentence of the model. Its data type should be float32 and "
+"has a shape of [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.modeling.MBartForConditionalGeneration.forward:26
+msgid "`cache` (Tensor):"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po
new file mode 100644
index 0000000000000000000000000000000000000000..59ecf1f1db7ff6ee1d9393c6c87b5408a062c2a6
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mbart.rst:2
+msgid "mbart"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..9d2b9fd6c7faf7f0f1855c0d5dee9d928880658e
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mbart.tokenizer.po
@@ -0,0 +1,91 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mbart.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.tokenize:1
+msgid "Tokenize a string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.convert_ids_to_string:1
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (strings for sub-words) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.get_special_tokens_mask:1
+msgid "Retrieve sequence ids from a token list that has no special tokens added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. An MBART"
+" sequence has the following format, where ``X`` represents the sequence:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:4
+msgid "``input_ids`` (for encoder) ``X [eos, src_lang_code]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:5
+msgid "``decoder_input_ids``: (for decoder) ``X [eos, tgt_lang_code]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.build_inputs_with_special_tokens:7
+msgid ""
+"BOS is never used. Pairs of sequences are not the expected use case, but "
+"they will be handled without a separator."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mbart.tokenizer.MBartTokenizer.as_target_tokenizer:1
+msgid ""
+"Temporarily sets the tokenizer for encoding the targets. Useful for "
+"tokenizer associated to sequence-to-sequence models that need a slightly "
+"different processing for the labels."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..9a2181b19282af23ac0466fef840f6c45317cd74
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.modeling.po
@@ -0,0 +1,686 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.megatronbert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:1
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:1
+msgid "The bare MegatronBert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `MegatronBertModel`. Also is the vocab"
+" size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`MegatronBert`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:13
+msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `1024`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:15
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:18
+msgid "The vocabulary size of `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:21
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:25
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:28
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:31
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `24`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:33
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:36
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:39
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `4096`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:44
+msgid "Type of position embedding. Defaults to \"absolute\""
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:46
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`MegatronBertPretrainedModel.init_weights()`"
+" for how weights are initialized in `MegatronBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:46
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel:50
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`MegatronBertPretrainedModel.init_weights()` for how weights "
+"are initialized in `MegatronBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:1
+msgid ""
+"The MegatronBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1.  - **1** for tokens that **not "
+"masked**, - **0** for tokens that **masked**.  It is a tensor with shape "
+"broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:27
+msgid "**1** for tokens that **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:28
+msgid "**0** for tokens that **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:30
+msgid ""
+"It is a tensor with shape broadcasted to `[batch_size, "
+"num_attention_heads, sequence_length, sequence_length]`. Defaults to "
+"`None`, which means nothing needed to be prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:34
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`).  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:34
+msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:14
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:14
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:36
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:40
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:39
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:44
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:43
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:21
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:21
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:19
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:19
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:27
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:26
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:16
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:19
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertModel.forward:49
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained MegatronBert models. It provides RoBerta"
+" related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:1
+msgid "MegatronBert Model with question answering tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:3
+msgid "An instance of :class:`MegatronBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:1
+msgid ""
+"The MegatronBertForQuestionAnswering forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:9
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:3
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:7
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:9
+msgid "See :class:`MegatronBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:12
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:18
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:21
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForQuestionAnswering.forward:21
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:1
+msgid "MegatronBert Model with sequence classification tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification:5
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:5
+msgid "The number of labels."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:1
+msgid ""
+"The MegatronBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForSequenceClassification.forward:12
+msgid "Returns tensor `logits`, a tensor of the sequence classification logits."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction:1
+msgid ""
+"MegatronBert Model with a `next sentence prediction (classification)` "
+"head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:1
+msgid ""
+"The MegatronBertForNextSentencePrediction forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:12
+msgid ""
+"Returns Tensor `seq_relationship_scores`. The scores of next sentence "
+"prediction.         Its data type should be float32 and its shape is "
+"[batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:14
+msgid ""
+"Returns Tensor `seq_relationship_scores`. The scores of next sentence "
+"prediction."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForNextSentencePrediction.forward:15
+msgid "Its data type should be float32 and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM:1
+msgid "MegatronBert Model with a `causal masked language modeling` head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:1
+msgid ""
+"The MegatronBertForCausalLM forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:12
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:12
+msgid ""
+"Returns Tensor `prediction_scores`. The scores of masked token "
+"prediction.         Its data type should be float32. If "
+"`masked_positions` is None, its shape is         [batch_size, "
+"sequence_length, vocab_size]. Otherwise, its shape is         "
+"[batch_size, mask_token_num, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:16
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:16
+msgid "Returns Tensor `prediction_scores`. The scores of masked token prediction."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForCausalLM.forward:15
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:15
+msgid ""
+"Its data type should be float32. If `masked_positions` is None, its shape"
+" is [batch_size, sequence_length, vocab_size]. Otherwise, its shape is "
+"[batch_size, mask_token_num, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining:1
+msgid "Megatronbert Model with pretraining tasks on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:1
+msgid ""
+"The MegatronBertForPreTraining forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:12
+msgid ""
+"Returns tuple (`prediction_scores`, `seq_relationship_score`).  With the "
+"fields:  - `prediction_scores` (Tensor):     The scores of masked token "
+"prediction. Its data type should be float32.     If `masked_positions` is"
+" None, its shape is [batch_size, sequence_length, vocab_size].     "
+"Otherwise, its shape is [batch_size, mask_token_num, vocab_size].  - "
+"`seq_relationship_score` (Tensor):     The scores of next sentence "
+"prediction.     Its data type should be float32 and its shape is "
+"[batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:12
+msgid "Returns tuple (`prediction_scores`, `seq_relationship_score`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:19
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:17
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:22
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForPreTraining.forward:22
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM:1
+msgid "MegatronBert Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMaskedLM.forward:1
+msgid ""
+"The MegatronBertForMaskedLM forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice:1
+msgid "MegatronBert Model with a multiple choice classification head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:1
+msgid ""
+"The MegatronBertForMultipleChoice forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:12
+msgid ""
+"Returns Tensor `reshaped_logits`. A tensor of the multiple choice "
+"classification logits.         Shape as `[batch_size, num_choice]` and "
+"dtype as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:14
+msgid ""
+"Returns Tensor `reshaped_logits`. A tensor of the multiple choice "
+"classification logits."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForMultipleChoice.forward:15
+msgid "Shape as `[batch_size, num_choice]` and dtype as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification:1
+msgid "MegatronBert Model with a token classification head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:1
+msgid ""
+"The MegatronBertForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits.         Shape as `[batch_size, sequence_length, num_classes]` and"
+" dtype as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:14
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.megatronbert.modeling.MegatronBertForTokenClassification.forward:15
+msgid ""
+"Shape as `[batch_size, sequence_length, num_classes]` and dtype as "
+"`float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po
new file mode 100644
index 0000000000000000000000000000000000000000..74024acea98d21c0b82eba8ed42a76810cf152ec
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.megatronbert.rst:2
+msgid "megatronbert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..39359ac4a5110a184c21bbe5bb4cc6864d64f90d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.megatronbert.tokenizer.po
@@ -0,0 +1,88 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.megatronbert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer:1
+msgid "Tokenization classes for MegatronBert."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:1
+msgid ""
+"Constructs a MegatronBert tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:5
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:8
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:11
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:15
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:18
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:21
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:24
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.megatronbert.tokenizer.MegatronBertTokenizer:30
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..02165c7089d29a8c9a216d33a184ffebf6435444
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.modeling.po
@@ -0,0 +1,538 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mobilebert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:1
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:1
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:1
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:1
+msgid ""
+"The bare MobileBert Model transformer outputting raw hidden-states. This "
+"model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods. This model is also "
+"a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:8
+msgid ""
+"Vocabulary size of `inputs_ids` in `MobileBertModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`MobileBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:11
+msgid ""
+"Embedding dimensionality of lookup_table in the embedding layer. Defaults"
+" to `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:15
+msgid "Dimensionality of input_tensor in self attention layer. Defaults to `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:17
+msgid ""
+"Using bottleneck to value tensor in self attention layer. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:19
+msgid "Key and query shared bottleneck layer. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:21
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `24`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:23
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `4`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:26
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:31
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"relu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:35
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:38
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:41
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:44
+msgid "The vocabulary size of `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:47
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02. .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`MobileBertPretrainedModel.init_weights()` "
+"for how weights are initialized in `MobileBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:47
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02. .. "
+"note::"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:53
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:56
+msgid "Adding the pooling Layer after the encoder layer. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel:58
+msgid ""
+"Using the non-linear activation function in the pooling layer. Defaults "
+"to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:1
+msgid "Prepare the head mask if needed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:3
+msgid ""
+"The mask indicating if we should keep the heads or not (1.0 for keep, 0.0"
+" for discard)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:5
+msgid "The number of hidden layers in the model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:7
+msgid ""
+"(:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not the "
+"attentions scores are computed by chunks or not."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.get_head_mask:10
+msgid ""
+":obj:`paddle.Tensor` with shape :obj:`[num_hidden_layers x batch x "
+"num_heads x seq_length x seq_length]` or list with :obj:`[None]` for each"
+" layer."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:1
+msgid ""
+"The MobileBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token. Its data type should be `int64` and it has a shape of"
+" [batch_size, sequence_length]. Defaults to `None`, which means we don't "
+"add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:16
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:20
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:29
+msgid ""
+"The mask indicating if we should keep the heads or not (1.0 for keep, 0.0"
+" for discard). Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:31
+msgid "Whether to return the output of each hidden layers. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:34
+msgid ""
+"Whether to return the output of each self attention layers. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:38
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`). With the fields: - `sequence_output` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]. - `pooled_output` (Tensor):     The output of first token "
+"(`[CLS]`) in sequence.     We \"pool\" the model by simply taking the "
+"hidden state corresponding to the first token.     Its data type should "
+"be float32 and its shape is [batch_size, hidden_size]. - "
+"`encoder_outputs` (List(Tensor)):     A list of Tensor containing hidden-"
+"states of the model at each hidden layer in the Transformer encoder.     "
+"The length of the list is `num_hidden_layers`.     Each Tensor has a data"
+" type of float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:38
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`). With the fields: - `sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:41
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:45
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:44
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:49
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:48
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape"
+" is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:39
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:25
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertModel.forward:54
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained MobileBert models. It provides "
+"MobileBert related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:1
+msgid "MobileBert Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining:3
+msgid "An instance of :class:`MobileBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:1
+msgid ""
+"The MobileBertForPreTraining forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:3
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:5
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:7
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:9
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:11
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:13
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:15
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:17
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:7
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:9
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:11
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:13
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:15
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:17
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:7
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:9
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:11
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:13
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:15
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:17
+msgid "See :class:`MobileBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:20
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With "
+"the fields: - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]. - `seq_relationship_score` (Tensor):     The scores of next "
+"sentence prediction.     Its data type should be float32 and its shape is"
+" [batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:20
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``). With "
+"the fields: - `prediction_scores` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:23
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:27
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForPreTraining.forward:27
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:1
+msgid ""
+"MobileBert Model with a linear layer on top of the output layer, designed"
+" for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:4
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:4
+msgid "An instance of MobileBert."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:1
+msgid ""
+"The MobileBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForSequenceClassification.forward:20
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering:1
+msgid ""
+"MobileBert Model with a linear layer on top of the hidden-states output "
+"to compute `span_start_logits` and `span_end_logits`, designed for "
+"question-answering tasks like SQuAD."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:1
+msgid ""
+"The MobileBertForQuestionAnswering forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:19
+msgid ""
+"Labels for position (index) of the start of the labelled span for "
+"computing the token classification loss. Positions are clamped to the "
+"length of the sequence (:obj:`sequence_length`). Position outside of the "
+"sequence are not taken into account for computing the loss."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:23
+msgid ""
+"Labels for position (index) of the end of the labelled span for computing"
+" the token classification loss. Positions are clamped to the length of "
+"the sequence (:obj:`sequence_length`). Position outside of the sequence "
+"are not taken into account for computing the loss."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:28
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]. - "
+"`end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:28
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:31
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:34
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.modeling.MobileBertForQuestionAnswering.forward:34
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po
new file mode 100644
index 0000000000000000000000000000000000000000..37a54f0d9800da5fb15a216d7034265f5cd2117b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mobilebert.rst:2
+msgid "mobilebert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..204f416d4d503a192fecafc98f2652421b8f2767
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mobilebert.tokenizer.po
@@ -0,0 +1,256 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mobilebert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer:1
+msgid ""
+"Construct a MobileBERT tokenizer. "
+":class:`~paddlenlp.transformers.MobileBertTokenizer is identical to "
+":class:`~paddlenlp.transformers.BertTokenizer` and runs end-to-end "
+"tokenization: punctuation splitting and wordpiece. Refer to superclass "
+":class:`~~paddlenlp.transformers.BertTokenizer` for usage examples and "
+"documentation concerning parameters."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports batch inputs of sequence or sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:4
+msgid ""
+"The element of list can be sequence or sequence pair, and the sequence is"
+" a string or a list of strings depending on whether it has been "
+"pretokenized. If each sequence is provided as a list of strings "
+"(pretokenized), you must set `is_split_into_words` as `True` to "
+"disambiguate with a sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:10
+msgid ""
+"If set to a number, will limit the total sequence returned so that it has"
+" a maximum length. If there are overflowing tokens, those overflowing "
+"tokens will be added to the returned dictionary when "
+"`return_overflowing_tokens` is `True`. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:15
+msgid ""
+"Only available for batch input of sequence pair and mainly for question "
+"answering usage. When for QA, `text` represents questions and `text_pair`"
+" represents contexts. If `stride` is set to a positive number, the "
+"context will be split into multiple spans where `stride` defines the "
+"number of (tokenized) tokens to skip from the start of one span to get "
+"the next span, thus will produce a bigger batch than inputs to include "
+"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving"
+" the original example and position information will be added to the "
+"returned dictionary. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:25
+msgid ""
+"If set to `True`, the returned sequences would be padded up to "
+"`max_seq_len` specified length according to padding side "
+"(`self.padding_side`) and padding token id. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:29
+msgid ""
+"String selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"`max_seq_len` starting from the longest one at each token (when there is "
+"a pair of input sequences). - 'only_first': Only truncate the first "
+"sequence. - 'only_second': Only truncate the second sequence. - "
+"'do_not_truncate': Do not truncate (raise an error if the input sequence "
+"is longer than `max_seq_len`). Defaults to 'longest_first'."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:39
+msgid ""
+"Whether to include tokens position ids in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:42
+msgid ""
+"Whether to include token type ids in the returned dictionary. Defaults to"
+" `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:45
+msgid ""
+"Whether to include the attention mask in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:48
+msgid ""
+"Whether to include the length of each encoded inputs in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:51
+msgid ""
+"Whether to include overflowing token information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:54
+msgid ""
+"Whether to include special tokens mask information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:58
+msgid ""
+"The dict has the following optional items: - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   `stride` "
+"works. - **overflow_to_sample** (int, optional): Index of example from "
+"which this   feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:58
+msgid ""
+"The dict has the following optional items: - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:61
+msgid "fed to a model. Included when `return_position_ids` is `True`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:62
+msgid ""
+"**token_type_ids** (list[int], optional): List of token type ids to be "
+"fed to a model. Included when `return_token_type_ids` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:64
+msgid ""
+"**attention_mask** (list[int], optional): List of integers valued 0 or 1,"
+" where 0 specifies paddings and should not be attended to by the model. "
+"Included when `return_attention_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:67
+msgid ""
+"**seq_len** (int, optional): The input_ids length. Included when "
+"`return_length` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:69
+msgid ""
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+" Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:72
+msgid ""
+"**num_truncated_tokens** (int, optional): The number of overflowing "
+"tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:75
+msgid ""
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1, with 0 specifying special added tokens and 1 specifying sequence "
+"tokens. Included when `return_special_tokens_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:78
+msgid ""
+"**offset_mapping** (list[int], optional): list of pair preserving the "
+"index of start and end char in original input for each token. For a "
+"sqecial token, the index pair is `(0, 0)`. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode:82
+msgid ""
+"**overflow_to_sample** (int, optional): Index of example from which this "
+"feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mobilebert.tokenizer.MobileBertTokenizer.batch_encode
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..4c0a2579baf4e2f0b6fe49275038777732f78f81
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.model_utils.po
@@ -0,0 +1,258 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.model_utils.rst:2
+msgid "model\\_utils"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:1
+msgid ""
+"The base class for all pretrained models. It mainly provides common "
+"methods for loading (construction and loading) and saving pretrained "
+"models. Loading and saving also rely on the following class attributes "
+"which should be overridden by derived classes accordingly:"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:6
+msgid ""
+"**model_config_file** (str): Represents the file name of model "
+"configuration for configuration saving and loading in local file system. "
+"The value is `model_config.json`."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:9
+msgid ""
+"**resource_files_names** (dict): Name of local file where the model "
+"configuration can be saved and loaded locally. Currently, resources only "
+"include the model state, thus the dict only includes `'model_state'` as "
+"key with corresponding value `'model_state.pdparams'` for model weights "
+"saving and loading."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:13
+msgid ""
+"**pretrained_init_configuration** (dict): Provides the model "
+"configurations of built-in pretrained models (contrasts to models in "
+"local file system). It has pretrained model names as keys (such as `bert-"
+"base-uncased`), and the values are dict preserving corresponding "
+"configuration for model initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:17
+msgid ""
+"**pretrained_resource_files_map** (dict): Provides resource URLs of "
+"built-in pretrained models (contrasts to models in local file system). It"
+" has the same key as resource_files_names (that is \"model_state\"), and "
+"the corresponding value is a dict with specific model name to model "
+"weights URL mapping (such as \"bert-base-uncased\" -> "
+"\"https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-"
+"uncased.pdparams\")."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:23
+msgid ""
+"**base_model_prefix** (str): Represents the attribute associated to the "
+"base model in derived classes of the same architecture adding layers on "
+"top of the base model. Note: A base model class is pretrained model class"
+" decorated by `register_base_model`, such as `BertModel`; A derived model"
+" class is a pretrained model class adding layers on top of the base "
+"model, and it has a base model as attribute, such as "
+"`BertForSequenceClassification`."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:30
+msgid ""
+"Methods common to models for text generation are defined in "
+"`GenerationMixin` and also inherited here."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel:33
+msgid ""
+"Besides, metaclass `InitTrackerMeta` is used to create `PretrainedModel`,"
+" by which subclasses can track arguments for initialization "
+"automatically."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model:1
+msgid ""
+"The body of the same model architecture. It is the base model itself for "
+"base model or the base model attribute for derived model."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model
+#: paddlenlp.transformers.model_utils.PretrainedModel.model_name_list
+msgid "type"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.base_model:5
+msgid "PretrainedModel"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.model_name_list:1
+msgid ""
+"Contains all supported built-in pretrained model names of the current "
+"PretrainedModel class."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.model_name_list:4
+msgid "list"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:1
+msgid ""
+"Creates an instance of `PretrainedModel`. Model weights are loaded by "
+"specifying name of a built-in pretrained model, or a community "
+"contributed model, or a local file directory path."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained
+#: paddlenlp.transformers.model_utils.PretrainedModel.save_model_config
+#: paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained
+#: paddlenlp.transformers.model_utils.register_base_model
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:5
+msgid ""
+"Name of pretrained model or dir path to load from. The string can be:  - "
+"Name of a built-in pretrained model - Name of a community-contributed "
+"pretrained model. - Local directory path which contains model weights "
+"file(\"model_state.pdparams\")   and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:5
+msgid "Name of pretrained model or dir path to load from. The string can be:"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:8
+msgid "Name of a built-in pretrained model"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:9
+msgid "Name of a community-contributed pretrained model."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:10
+msgid ""
+"Local directory path which contains model weights "
+"file(\"model_state.pdparams\") and model config file "
+"(\"model_config.json\")."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:13
+msgid ""
+"Position arguments for model `__init__`. If provided, use these as "
+"position argument values for model initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:16
+msgid ""
+"Keyword arguments for model `__init__`. If provided, use these to update "
+"pre-defined keyword argument values for model initialization. If the "
+"keyword is in `__init__` argument names of base model, update argument "
+"values of the base model; else update argument values of derived model."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:22
+msgid ""
+"The weights read in can be choosed to place on CPU or GPU though the "
+"model is on the default device. If `True`, load the model weights as "
+"`numpy.ndarray` on CPU. Otherwise, weights would be loaded as tensors on "
+"the default device. Note that if on GPU, the latter would creates extra "
+"temporary tensors in addition to the model weights, which doubles the "
+"memory usage . Thus it is suggested to use `True` for big models on GPU. "
+"Default to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained
+#: paddlenlp.transformers.model_utils.PretrainedModel.get_model_config
+#: paddlenlp.transformers.model_utils.register_base_model
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:32
+msgid "An instance of `PretrainedModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained
+#: paddlenlp.transformers.model_utils.PretrainedModel.get_model_config
+#: paddlenlp.transformers.model_utils.register_base_model
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.from_pretrained:36
+#: paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:13
+#: paddlenlp.transformers.model_utils.register_base_model:13
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.get_model_config:1
+msgid "Get model configuration."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.get_model_config:3
+msgid "The config of the model."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.save_model_config:1
+msgid ""
+"Saves model configuration to a file named \"model_config.json\" under "
+"`save_dir`."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.save_model_config:3
+msgid "Directory to save model_config file into."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:1
+msgid ""
+"Saves model configuration and related resources (model state) as files "
+"under `save_dir`. The model configuration would be saved into a file "
+"named \"model_config.json\", and model state would be saved into a file "
+"named \"model_state.pdparams\"."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:6
+msgid ""
+"The `save_dir` can be used in `from_pretrained` as argument value of "
+"`pretrained_model_name_or_path` to re-load the trained model."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained:9
+msgid "Directory to save files into."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.register_base_model:1
+msgid ""
+"A decorator for `PretrainedModel` class. It first retrieves the parent "
+"class of the class being decorated, then sets the `base_model_class` "
+"attribute of that parent class to be the class being decorated. In "
+"summary, the decorator registers the decorated class as the base model "
+"class in all derived classes under the same architecture."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.register_base_model:6
+msgid "The class (inherited from PretrainedModel) to be decorated ."
+msgstr ""
+
+#: of paddlenlp.transformers.model_utils.register_base_model:9
+msgid "The input class `cls` after decorating."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..ea77223e7f09950a56ca1f0b872c34a8a65bbd5b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.modeling.po
@@ -0,0 +1,521 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mpnet.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel:1
+msgid "基类：:class:`paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:1
+msgid "The bare MPNet Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `MPNetModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `MPNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`514`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:38
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`MPNetPretrainedModel.init_weights()` for "
+"how weights are initialized in `MPNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:38
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:42
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`MPNetPretrainedModel.init_weights()` for how weights are "
+"initialized in `MPNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:45
+msgid "The number of buckets to use for each attention layer. Defaults to `32`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:48
+msgid "The epsilon used by the layer normalization layers. Defaults to `1e-5`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel:51
+msgid "The index of padding token in the token vocabulary. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:1
+msgid "The MPNetModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:7
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:11
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1.  - **1** for tokens that **not "
+"masked**, - **0** for tokens that **masked**.  It is a tensor with shape "
+"broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:11
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:16
+msgid "**1** for tokens that **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:17
+msgid "**0** for tokens that **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:19
+msgid ""
+"It is a tensor with shape broadcasted to `[batch_size, "
+"num_attention_heads, sequence_length, sequence_length]`. Defaults to "
+"`None`, which means nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:23
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`).  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`<s>`) in sequence.     We \"pool\" the "
+"model by simply taking the hidden state corresponding to the first token."
+"     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:23
+msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:12
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:12
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:25
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:20
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:29
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:28
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:33
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:32
+msgid ""
+"The output of first token (`<s>`) in sequence. We \"pool\" the model by "
+"simply taking the hidden state corresponding to the first token. Its data"
+" type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:15
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:24
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:15
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:15
+#: paddlenlp.transformers.mpnet.modeling.MPNetModel.forward:38
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel:1
+msgid ""
+"An abstract class for pretrained MPNet models. It provides MPNet related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:1
+msgid "MPNet Model with a `language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM:3
+msgid "An instance of :class:`MPNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:1
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:3
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:5
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:3
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:5
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:7
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:3
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:5
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:7
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:3
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:5
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:7
+msgid "See :class:`MPNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:7
+msgid ""
+"The Labels for computing the masked language modeling loss. Indices "
+"should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to "
+"``-100`` are ignored (masked), the loss is only computed for the tokens "
+"with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:10
+msgid ""
+"Returns tuple (`masked_lm_loss`, `prediction_scores`, "
+"``sequence_output`).  With the fields:  - `masked_lm_loss` (Tensor):     "
+"The masked lm loss. Its data type should be float32 and its shape is [1]."
+"  - `prediction_scores` (Tensor):     The scores of masked token "
+"prediction. Its data type should be float32. Its shape is [batch_size, "
+"sequence_length, vocab_size].  - `sequence_output` (Tensor):     Sequence"
+" of hidden-states at the last layer of the model. Its data type should be"
+" float32. Its shape is `[batch_size, sequence_length, hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:10
+msgid "Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:15
+msgid "`masked_lm_loss` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:15
+msgid "The masked lm loss. Its data type should be float32 and its shape is [1]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:18
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:18
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"Its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMaskedLM.forward:21
+msgid ""
+"Sequence of hidden-states at the last layer of the model. Its data type "
+"should be float32. Its shape is `[batch_size, sequence_length, "
+"hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:1
+msgid ""
+"MPNet Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:4
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:4
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:4
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:4
+msgid "An instance of MPNetModel."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:6
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:6
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:8
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification:8
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:8
+msgid ""
+"The dropout probability for output of MPNet. If None, use the same value "
+"as `hidden_dropout_prob` of `MPNetModel` instance `mpnet`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:1
+msgid ""
+"The MPNetForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:1
+msgid ""
+"MPNet Model with a linear layer on top of the hidden-states output layer,"
+" designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:1
+msgid ""
+"The MPNetForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:3
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:5
+#: paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:7
+msgid ""
+"See :class:`MPNetModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForMultipleChoice.forward:10
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification:1
+msgid ""
+"MPNet Model with a linear layer on top of the hidden-states output layer,"
+" designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:1
+msgid ""
+"The MPNetForTokenClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering:1
+msgid ""
+"MPNet Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:1
+msgid ""
+"The MPNetForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.modeling.MPNetForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po
new file mode 100644
index 0000000000000000000000000000000000000000..433d59d822d107262bce9b1d1298852c0e7bbee3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mpnet.rst:2
+msgid "mpnet"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..011b07f96ccf938585fe935d60048bb6739c8c2d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.mpnet.tokenizer.po
@@ -0,0 +1,135 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.mpnet.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer:1
+msgid ""
+"Construct a MPNet tokenizer which is almost identical to `BertTokenizer`."
+" For more information regarding those methods, please refer to this "
+"superclass."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:4
+msgid "A MPNet sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``<s> X </s>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``<s> A </s></s> B </s>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:6
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:4
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Creates a mask from the two sequences passed to be used in a sequence-"
+"pair classification task. MPNet does not make use of token type ids, "
+"therefore a list of zeros is returned."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.mpnet.tokenizer.MPNetTokenizer.create_token_type_ids_from_sequences:9
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..faae4fc0fa755ca69dbb96005d6020639f278c45
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.modeling.po
@@ -0,0 +1,600 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.nezha.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel:1
+msgid "基类：:class:`paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:1
+msgid "The bare NeZha Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice
+#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number "
+"of different tokens that can be represented by the `inputs_ids` passed "
+"when calling `DistilBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and the pooler "
+"layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:38
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:41
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`NeZhaPretrainedModel.init_weights()` for "
+"how weights are initialized in `NeZhaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:41
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`NeZhaPretrainedModel.init_weights()` for how weights are "
+"initialized in `NeZhaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:48
+msgid ""
+"The maximum value of the dimensionality of relative encoding, which "
+"dictates the maximum supported relative distance of two sentences. "
+"Defaults to `64`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:52
+msgid ""
+"The small value added to the variance in `LayerNorm` to prevent division "
+"by zero. Defaults to `1e-12`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel:55
+msgid "Whether or not to use relative position embedding. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:1
+msgid "The NeZhaModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:18
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. We "
+"use whole-word-mask in NeZha, so the whole word will have the same value."
+" For example, \"使用\" as a word, \"使\" and \"用\" will have the same value."
+" Defaults to `None`, which means nothing needed to be prevented attention"
+" to."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:32
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`).  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:32
+msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:17
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:12
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:34
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:11
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:38
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:37
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:1
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:42
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:41
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:4
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:24
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:15
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:15
+#: paddlenlp.transformers.nezha.modeling.NeZhaModel.forward:47
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel:1
+msgid ""
+"An abstract class for pretrained NeZha models. It provides NeZha related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:1
+msgid "NeZha Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining:3
+msgid "An instance of :class:`NeZhaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:5
+#: paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:7
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:1
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:5
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:5
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:7
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:5
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:7
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:5
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:7
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:3
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:5
+msgid "See :class:`NeZhaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:7
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64 and its shape is "
+"[batch_size, sequence_length, 1]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:10
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:14
+msgid ""
+"Returns Tensor ``total_loss`` if `masked_lm_labels` is not None. Returns "
+"tuple (``prediction_scores``, ``seq_relationship_score``) if "
+"`masked_lm_labels` is None.  With the fields:  - `total_loss` (Tensor):"
+"   - `prediction_scores` (Tensor):     The scores of masked token "
+"prediction. Its data type should be float32.     If `masked_positions` is"
+" None, its shape is [batch_size, sequence_length, vocab_size].     "
+"Otherwise, its shape is [batch_size, mask_token_num, vocab_size].  - "
+"`seq_relationship_score` (Tensor):     The scores of next sentence "
+"prediction.     Its data type should be float32 and its shape is "
+"[batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:14
+msgid ""
+"Returns Tensor ``total_loss`` if `masked_lm_labels` is not None. Returns "
+"tuple (``prediction_scores``, ``seq_relationship_score``) if "
+"`masked_lm_labels` is None."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:19
+msgid "`total_loss` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:25
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:16
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:23
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:14
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:28
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:19
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForPretraining.forward:28
+#: paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:19
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:1
+msgid ""
+"NeZha Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:4
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:4
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:4
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:4
+msgid "An instance of NeZhaModel."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:6
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification:8
+msgid ""
+"The dropout probability for output of NeZha. If None, use the same value "
+"as `hidden_dropout_prob` of `NeZhaModel` instance `nezha`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:1
+msgid ""
+"The NeZhaForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:1
+msgid "Perform language modeling task and next sentence classification task."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:7
+msgid "Activation function used in the language modeling task."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads:9
+msgid ""
+"Decoding weights used to map hidden_states to logits of the masked token "
+"prediction. Its data type should be float32 and its shape is [vocab_size,"
+" hidden_size]. Defaults to `None`, which means use the same weights of "
+"the embedding layer."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:9
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``).  With "
+"the fields:  - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size].  - `seq_relationship_score` (Tensor):     The scores of next"
+" sentence prediction.     Its data type should be float32 and its shape "
+"is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaPretrainingHeads.forward:9
+msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:1
+msgid ""
+"NeZha Model with a linear layer on top of the hidden-states output layer,"
+" designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:8
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:6
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification:8
+msgid ""
+"The dropout probability for output of NeZha. If None, use the same value "
+"as `hidden_dropout_prob` of `NeZhaModel` instance `nezha`. Defaults to "
+"`None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:1
+msgid ""
+"The NeZhaForTokenClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForTokenClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering:1
+msgid ""
+"NeZha with a linear layer on top of the hidden-states output to compute "
+"`span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:1
+msgid ""
+"The NeZhaForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:10
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:10
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:16
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:15
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:19
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.modeling.NeZhaForQuestionAnswering.forward:19
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:1
+msgid ""
+"NeZha Model with a linear layer on top of the hidden-states output layer,"
+" designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice:6
+msgid "The number of choices. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:1
+msgid ""
+"The NeZhaForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.modeling.NeZhaForMultipleChoice.forward:10
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the input multiple choice "
+"classification logits. Shape as `[batch_size, num_classes]` and dtype as "
+"`float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po
new file mode 100644
index 0000000000000000000000000000000000000000..8428ce2ba07414c2f71ca74f05ed13e7c53aed87
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.nezha.rst:2
+msgid "nezha"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..d1b64ed92e9b23927afdbd373dfd2bcc812f790c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.nezha.tokenizer.po
@@ -0,0 +1,302 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.nezha.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:1
+msgid ""
+"Constructs a NeZha tokenizer. It uses a basic tokenizer to do punctuation"
+" splitting, lower casing and so on, and follows a WordPiece tokenizer to "
+"tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer:34
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also removes "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:4
+msgid "A NeZha sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A NeZha offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of wordpiece offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.build_offset_mapping_with_special_tokens:13
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:3
+msgid "A NeZha sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:13
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.nezha.tokenizer.NeZhaTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po
new file mode 100644
index 0000000000000000000000000000000000000000..7ba6a123422c649f776b0bd6f7b04e7e4333f01f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.optimization.po
@@ -0,0 +1,152 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.optimization.rst:2
+msgid "optimization"
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:1
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:1
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:1
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:1
+msgid "基类：:class:`paddle.optimizer.lr.LambdaDecay`"
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.LinearDecayWithWarmup:1
+msgid ""
+"Creates a learning rate scheduler, which increases learning rate linearly"
+" from 0 to given `learning_rate`, after this warmup period learning rate "
+"would be decreased linearly from the base learning rate to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:5
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:7
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:5
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:6
+msgid "The base learning rate. It is a python float number."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:9
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:7
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:8
+msgid "The number of training steps."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:7
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:11
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:9
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:10
+msgid ""
+"If int, it means the number of steps for warmup. If float, it means the "
+"proportion of warmup in total training steps."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:14
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:23
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:12
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:19
+msgid ""
+"The index of last epoch. It can be set to restart training. If None, it "
+"means initial learning rate. Defaults to -1."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.LinearDecayWithWarmup:16
+msgid "If True, prints a message to stdout for each update. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:20
+#: paddlenlp.transformers.optimization.CosineDecayWithWarmup:29
+#: paddlenlp.transformers.optimization.LinearDecayWithWarmup:21
+#: paddlenlp.transformers.optimization.PolyDecayWithWarmup:25
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:1
+msgid ""
+"Creates a learning rate scheduler, which increases learning rate linearly"
+" from 0 to given `learning_rate` during warmup periods and keeps learning"
+" rate a constant after that."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.ConstScheduleWithWarmup:10
+msgid ""
+"The number of training steps. If `warmup` is a float number, "
+"`total_steps` must be provided. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:1
+msgid ""
+"Creates a learning rate scheduler, which increases learning rate linearly"
+" from 0 to given `learning_rate`, after this warmup period learning rate "
+"would be decreased following the values of the cosine function. If "
+"`with_hard_restarts` is True, the cosine function could have serveral "
+"hard restarts."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:14
+msgid "Whether cosine function has several hard restarts. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.CosineDecayWithWarmup:17
+msgid ""
+"If `with_hard_restarts` is False, it means the number of waves in cosine "
+"scheduler and should be an integer number and defaults to 1. If "
+"`with_hard_restarts` is True, it means the number of hard restarts to use"
+" and should be a float number and defaults to be 0.5. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:1
+msgid ""
+"Creates a learning rate scheduler, which increases learning rate linearly"
+" from 0 to given `lr_init`, after this warmup period learning rate would "
+"be decreased as a polynomial decay from the base learning rate to the end"
+" learning rate `lr_end`."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:13
+msgid "The end learning rate. Defaults to 1e-7."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.PolyDecayWithWarmup:16
+msgid "Power factor. Defaults to 1.0."
+msgstr ""
+
+#: of paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay:1
+msgid "基类：:class:`paddle.optimizer.lr.LRScheduler`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay.get_lr:1
+msgid ""
+"For those subclass who overload ``LRScheduler`` (Base Class), User should"
+" have a custom implementation of ``get_lr()`` ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.optimization.CosineAnnealingWithWarmupDecay.get_lr:3
+msgid "Otherwise, an ``NotImplementedError`` exception will be thrown."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po
new file mode 100644
index 0000000000000000000000000000000000000000..f1ac70debea6f48d2be291d37592ee93fc11a432
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.rst:2
+msgid "paddlenlp.transformers"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..855d3381963115d311803c162cc924d6e3a51fad
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.modeling.po
@@ -0,0 +1,348 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ppminilm.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:1
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:1
+msgid "基类：:class:`paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:1
+msgid "The bare PPMiniLM Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `PPMiniLMModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`PPMiniLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:38
+msgid "The vocabulary size of the `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`.  .. note::     A normal_initializer "
+"initializes weight matrices as normal distributions.     See "
+":meth:`PPMiniLMPretrainedModel._init_weights()` for how weights are "
+"initialized in `PPMiniLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:41
+msgid ""
+"The standard deviation of the normal initializer for initializing all "
+"weight matrices. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:45
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`PPMiniLMPretrainedModel._init_weights()` for how weights are "
+"initialized in `PPMiniLMModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel:48
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:1
+msgid ""
+"If `input_ids` is a Tensor object, it is an indices of input sequence "
+"tokens in the vocabulary. They are numerical representations of tokens "
+"that build the input sequence. It's data type should be `int64` and has a"
+" shape of [batch_size, sequence_length]. If `input_ids` is a list of "
+"string, `self.use_faster_tokenizer` should be True, and the network "
+"contains `faster_tokenizer` operator."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:9
+msgid ""
+"If `token_type_ids` is a Tensor object: Segment token indices to indicate"
+" different portions of the inputs. Selected in the range ``[0, "
+"type_vocab_size - 1]``. If `type_vocab_size` is 2, which means the inputs"
+" have two portions. Indices can either be 0 or 1:  - 0 corresponds to a "
+"*sentence A* token, - 1 corresponds to a *sentence B* token.  Its data "
+"type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings.  If `token_type_ids` is a list of string: "
+"`self.use_faster_tokenizer` should be True, and the network contains "
+"`faster_tokenizer` operator."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:9
+msgid ""
+"If `token_type_ids` is a Tensor object: Segment token indices to indicate"
+" different portions of the inputs. Selected in the range ``[0, "
+"type_vocab_size - 1]``. If `type_vocab_size` is 2, which means the inputs"
+" have two portions. Indices can either be 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:15
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:16
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:18
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:21
+msgid ""
+"If `token_type_ids` is a list of string: `self.use_faster_tokenizer` "
+"should be True, and the network contains `faster_tokenizer` operator."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:24
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:28
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. For example, its shape can be  "
+"[batch_size, sequence_length], [batch_size, sequence_length, "
+"sequence_length], [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]. We use whole-word-mask in PPMiniLM, so the whole word "
+"will have the same value. For example, \"使用\" as a word, \"使\" and \"用\" "
+"will have the same value. Defaults to `None`, which means nothing needed "
+"to be prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:42
+msgid ""
+"Returns tuple (``sequence_output``, ``pooled_output``).  With the fields:"
+"  - `sequence_output` (Tensor):     Sequence of hidden-states at the last"
+" layer of the model.     It's data type should be float32 and its shape "
+"is [batch_size, sequence_length, hidden_size].  - `pooled_output` "
+"(Tensor):     The output of first token (`[CLS]`) in sequence.     We "
+"\"pool\" the model by simply taking the hidden state corresponding to the"
+" first token.     Its data type should be float32 and its shape is "
+"[batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:42
+msgid "Returns tuple (``sequence_output``, ``pooled_output``)."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:44
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:48
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:47
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:52
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:51
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:15
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMModel.forward:57
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel:1
+msgid "基类：:class:`paddlenlp.experimental.model_utils.FasterPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel:1
+msgid ""
+"An abstract class for pretrained PPMiniLM models. It provides PPMiniLM "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. Refer "
+"to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:1
+msgid ""
+"PPMiniLM Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:4
+msgid "An instance of `paddlenlp.transformers.PPMiniLMModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:6
+msgid "The number of classes. Default to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification:8
+msgid ""
+"The dropout probability for output of PPMiniLM. If None, use the same "
+"value as `hidden_dropout_prob` of `paddlenlp.transformers.PPMiniLMModel` "
+"instance. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:1
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:3
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:5
+msgid "See :class:`PPMiniLMModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:7
+msgid "See :class:`MiniLMModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.modeling.PPMiniLMForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po
new file mode 100644
index 0000000000000000000000000000000000000000..2904e71719a238f4f380f604a396361be1f14127
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ppminilm.rst:2
+msgid "ppminilm"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..fdeb47bbfe519ecc9a3dc1622d7ebba8521a62b2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.ppminilm.tokenizer.po
@@ -0,0 +1,294 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.ppminilm.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:1
+msgid ""
+"Constructs an PPMiniLM tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer:34
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:12
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:10
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:1
+msgid "Converts a string to a list of tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:3
+msgid "The text to be tokenized."
+msgstr ""
+
+#: of paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.tokenize:6
+msgid "A list of string representing converted tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:5
+msgid ""
+"This encodes inputs and checks the number of added tokens, and is "
+"therefore not efficient. Do not put this inside your training loop."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:8
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.num_special_tokens_to_add:12
+msgid "Number of tokens added to sequences"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:4
+msgid "A sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:13
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_inputs_with_special_tokens:15
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "An offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.build_offset_mapping_with_special_tokens:14
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:3
+msgid "A sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:11
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.ppminilm.tokenizer.PPMiniLMTokenizer.create_token_type_ids_from_sequences:17
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..77077e04da76b1a1b06a6501183a3b2685e652ec
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.modeling.po
@@ -0,0 +1,85 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.prophetnet.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel:1
+msgid "基类：:class:`paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:1
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:4
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:4
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:4
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetDecoder.forward:6
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder.forward:6
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetForConditionalGeneration.forward:6
+#: paddlenlp.transformers.prophetnet.modeling.ProphetNetModel.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Prophetnet models. It provides "
+"Prophetnet related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:4
+msgid ""
+"word_embeddings  (:obj:`paddle.nn.Embeddings` of shape "
+":obj:`(config.vocab_size, config.hidden_size)`, `optional`):"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.modeling.ProphetNetEncoder:2
+msgid ""
+"The word embedding parameters. This can be used to initialize "
+":class:`~transformers.ProphetNetEncoder` with pre-defined word embeddings"
+" instead of randomly initialized word embeddings."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po
new file mode 100644
index 0000000000000000000000000000000000000000..b6ee17f11e6a006ccf2c541cb85bdeb261fab6a4
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.prophetnet.rst:2
+msgid "prophetnet"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..87f73f783bf49e0ce9850d842e063fb244ce6878
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.prophetnet.tokenizer.po
@@ -0,0 +1,292 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.prophetnet.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.load_vocab:1
+msgid "Loads a vocabulary file into a dictionary."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:1
+msgid "Construct a ProphetNetTokenizer. Based on WordPiece."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:3
+msgid ""
+"This tokenizer inherits from [`PreTrainedTokenizer`] which contains most "
+"of the main methods. Users should refer to this superclass for more "
+"information regarding those methods."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:6
+msgid "File containing the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:8
+msgid "Whether or not to lowercase the input when tokenizing."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:10
+msgid "Whether or not to do basic tokenization before WordPiece."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:12
+msgid ""
+"The unknown token. A token that is not in the vocabulary cannot be "
+"converted to an ID and is set to be this token instead."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:15
+msgid ""
+"The separator token, which is used when building a sequence from multiple"
+" sequences, e.g. two sequences for sequence classification or for a text "
+"and a question for question answering. It is also used as the last token "
+"of a sequence built with special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:19
+msgid ""
+"Special second separator token, which can be generated by "
+"[`ProphetNetForConditionalGeneration`]. It is used to separate bullet-"
+"point like sentences in summarization, *e.g.*."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:23
+msgid ""
+"The token used for padding, for example when batching sequences of "
+"different lengths."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:25
+msgid ""
+"The classifier token which is used when doing sequence classification "
+"(classification of the whole sequence instead of per-token "
+"classification). It is the first token of the sequence when built with "
+"special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer:28
+msgid ""
+"The token used for masking values. This is the token used when training "
+"this model with masked language modeling. This is the token which the "
+"model will try to predict."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:1
+msgid ""
+"Converts a sequence of tokens into ids using the `vocab` attribute (an "
+"instance of `Vocab`). Override it if needed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:5
+msgid "Args："
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:5
+msgid "tokens (list[int]): List of token ids."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids:7
+msgid "Converted id list."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_ids
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:1
+msgid ""
+"Converts a single index or a sequence of indices to a token or a sequence"
+" of tokens, using the vocabulary and added tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:4
+msgid "The token id (or token ids) to be converted to token(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:6
+msgid ""
+"Whether or not to remove special tokens in the decoding. Defaults to "
+"`False` and we do not remove special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_ids_to_tokens:10
+msgid "The decoded token(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (string) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieve sequence ids from a token list that has no special tokens added."
+" This method is called when adding special tokens using the tokenizer "
+"`prepare_for_model` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:4
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:9
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:11
+msgid ""
+"A list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:13
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:18
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.get_special_tokens_mask:12
+msgid "`List[int]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task. A ProphetNet sequence pair mask has the following "
+"format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:4
+msgid ""
+"``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence    | second "
+"sequence | ```"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.create_token_type_ids_from_sequences:16
+msgid ""
+"List of [token type IDs](../glossary#token-type-ids) according to the "
+"given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. A BERT "
+"sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:4
+msgid "single sequence: `[CLS] X [SEP]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:5
+msgid "pair of sequences: `[CLS] A [SEP] B [SEP]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:7
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.build_inputs_with_special_tokens:12
+msgid ""
+"List of [input IDs](../glossary#input-ids) with the appropriate special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:1
+msgid ""
+"Save all tokens to a vocabulary file. The file contains a token per line,"
+" and the line number would be the index of corresponding token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:4
+msgid "File path to be saved to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.prophetnet.tokenizer.ProphetNetTokenizer.save_vocabulary:6
+msgid "The `Vocab` or `dict` instance to be saved."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..a486f3f2e4d0afd8d854bf0dade44cc5732ef6ff
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.modeling.po
@@ -0,0 +1,807 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.reformer.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:1
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:1
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:1
+#: paddlenlp.transformers.reformer.modeling.ReformerModel:1
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:1
+msgid "基类：:class:`paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:1
+msgid ""
+"The bare Reformer Model transformer outputting raw hidden-states without "
+"any specific head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModel
+#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:10
+msgid "Whether to tie input and output embeddings. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:12
+msgid ""
+"Whether or not to use a causal mask in addition to the `attention_mask` "
+"passed to `ReformerModel`. When using the Reformer for causal language "
+"modeling, this argument should be set to `True`. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:14
+msgid ""
+"The chunk size of all feed forward layers in the residual attention "
+"blocks. A chunk size of `0` means that the feed forward layer is not "
+"chunked. A chunk size of n means that the feed forward layer processes "
+"`n` < sequence_length embeddings at a time. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:18
+msgid "The id of the `padding` token. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:20
+msgid ""
+"Seed that can be used to make local sensitive hashing in "
+"`LSHSelfAttention` deterministic. This should only be set for testing "
+"purposed. For evaluation and training purposes `hash_seed` should be left"
+" as `None` to ensure fully random rotations in local sensitive hashing "
+"scheme. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:24
+msgid ""
+"Vocabulary size of `inputs_ids` in `ReformerModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`ReformerModel`. Defaults to `258`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:27
+msgid ""
+"Dimensionality of the projected key, query and value vectors. Defaults to"
+" `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:29
+msgid "Dimensionality of the embedding layer, encoder layer.Defaults to `1024`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:31
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `8`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:34
+msgid ""
+"Number of hashing rounds (e.g., number of random rotations) in Local "
+"Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the "
+"`LSHSelfAttention` becomes, but also the more memory and time intensive "
+"the hashing becomes. Defaults to `4`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:36
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:38
+msgid ""
+"Number of buckets, the key query vectors can be \"hashed into\" using the"
+" locality sensitive hashing scheme. Each query key vector is hashed into "
+"a hash in `1, ..., num_buckets`. The number of buckets can also be "
+"factorized into a list for improved memory complexity. In this case, each"
+" query key vector is hashed into a hash in `1-1, 1-2, ..., "
+"num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is"
+" factorized into two factors. The number of buckets (or the product the "
+"factors) should approximately equal sequence length / lsh_chunk_length. "
+"If `num_buckets` not set, a good value is calculated on the fly. Defaults"
+" to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:41
+msgid ""
+"Length of chunk which attends to itself in `LSHSelfAttention`. Chunking "
+"reduces memory complexity from sequence length x sequence length (self "
+"attention) to chunk length x chunk length x sequence length / chunk "
+"length (chunked self attention).Defaults to `256`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:43
+msgid ""
+"Length of chunk which attends to itself in `LocalSelfAttention`. Chunking"
+" reduces memory complexity from sequence length x sequence length (self "
+"attention) to chunk length x chunk length x sequence length / chunk "
+"length (chunked self attention).Defaults to `128`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:45
+msgid ""
+"Number of following neighbouring chunks to attend to in "
+"`LSHSelfAttention` layer to itself. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:47
+msgid ""
+"Number of previous neighbouring chunks to attend to in `LSHSelfAttention`"
+" layer to itself. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:49
+msgid ""
+"Number of following neighbouring chunks to attend to in "
+"`LocalSelfAttention` layer to itself. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:51
+msgid ""
+"Number of previous neighbouring chunks to attend to in "
+"`LocalSelfAttention` layer to itself. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:53
+msgid ""
+"The non-linear activation function (function or string) in the feed "
+"forward layer in the residual attention block. If string, `\"gelu\"`, "
+"`\"relu\"`, `\"tanh\"`, `\"mish\"` and `\"gelu_new\"` are supported. "
+"Defaults to `\"relu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:55
+msgid ""
+"Dimensionality of the feed_forward layer in the residual attention block."
+" Defaults to `4096`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:57
+msgid ""
+"The dropout ratio for all fully connected layers in the embeddings and "
+"encoder. Defaults to `0.2`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:59
+msgid ""
+"The dropout ratio for the attention probabilities in `LSHSelfAttention`. "
+"Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:61
+msgid ""
+"The dropout ratio for the attention probabilities in "
+"`LocalSelfAttention`. Defaults to `0.2`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:63
+msgid ""
+"The maximum sequence length that this model might ever be used with. "
+"Typically set this to something large just in case (e.g., 512 or 1024 or "
+"2048). Defaults to `65536`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:65
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`ReformerPretrainedModel._init_weights()` "
+"for how weights are initialized in `ReformerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:65
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:68
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`ReformerPretrainedModel._init_weights()` for how weights are "
+"initialized in `ReformerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:71
+msgid "The epsilon used by the layer normalization layers. Defaults to `1e-12`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:73
+msgid "Whether or not to use axial position embeddings. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:75
+msgid ""
+"The position dims of the axial position encodings. During training, the "
+"product of the position dims has to be equal to the sequence length. "
+"Defaults to `[128, 512]`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:77
+msgid ""
+"The embedding dims of the axial position encodings. The sum of the "
+"embedding dims has to be equal to the hidden size. Defaults to `[256, "
+"768]`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:80
+msgid ""
+"The standard deviation of the normal_initializer for initializing the "
+"weight matrices of the axial positional encodings. Defaults to `1.0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:83
+msgid ""
+"The chunk size of the final language model feed forward head layer. A "
+"chunk size of 0 means that the feed forward layer is not chunked. A chunk"
+" size of n means that the feed forward layer processes n < "
+"sequence_length embeddings at a time. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel:86
+msgid ""
+"List of attention layer types in ascending order. It can be chosen "
+"between a LSHSelfAttention layer (`\"lsh\"`) and a LocalSelfAttention "
+"layer (`\"local\"`). Defaults to `[\"local\", \"local\", \"lsh\", "
+"\"local\", \"local\", \"local\", \"lsh\", \"local\", \"local\", "
+"\"local\", \"lsh\", \"local\"]`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:1
+msgid ""
+"The ReformerModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float. When the data type is int, "
+"the `masked` tokens have `0` values and the others have `1` values. When "
+"the data type is float, the `masked` tokens have `0.0` values and the "
+"others have `1.0` values. It is a tensor with shape broadcasted to "
+"`[batch_size, num_attention_heads, sequence_length, sequence_length]`. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:17
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range `[0, max_position_embeddings - 1]`. "
+"Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:21
+msgid ""
+"The number of hashing rounds that should be performed during bucketing. "
+"Setting this argument overwrites the default defined in "
+"`config[\"num_hashes\"]`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:25
+msgid ""
+"List of `tuple(Tensor, Tensor)` of length "
+"`config[\"num_hidden_layers\"]`, with the first element being the "
+"previous `buckets` of shape `[batch_size, num_heads, num_hashes, "
+"sequence_length]` and the second being the previous `hidden_states` of "
+"shape `[batch_size, sequence_length, hidden_size]`. Contains precomputed "
+"hidden-states and buckets (only relevant for LSH Self-Attention). Can be "
+"used to speed up sequential decoding. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:32
+msgid ""
+"Whether or not to use cache. If set to `True`, `cache` states are "
+"returned and can be used to speed up decoding. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:36
+msgid ""
+"Whether or not to return the attentions tensors of all attention layers. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:39
+msgid ""
+"Whether or not to return the output of all hidden layers. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:43
+msgid ""
+"Returns tuple (`last_hidden_state`, `cache`, `hidden_states`, "
+"`attentions`)  With the fields:  - `last_hidden_state` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and     its shape is [batch_size, sequence_length,"
+" hidden_size].  - `cache` (List[tuple(Tensor, Tensor)], optional):     "
+"returned when `use_cache=True` is passed.     List of `tuple(Tensor, "
+"Tensor)` of length `config[\"num_hidden_layers\"]`,     with the first "
+"element being the previous `buckets` of shape     `[batch_size, "
+"num_heads, num_hashes, sequence_length]` and the second     being the "
+"previous `hidden_states` of shape `[batch_size, sequence_length, "
+"hidden_size]`.  - `hidden_states` (tuple(Tensor), optional):     returned"
+" when `output_hidden_states=True` is passed.     tuple of `Tensor` (one "
+"for the output of the embeddings + one for the     output of each layer)."
+" Each Tensor has a data type of float32     and its shape is [batch_size,"
+" sequence_length, hidden_size].  - `attentions` (tuple(Tensor), "
+"optional):     returned when `output_attentions=True` is passed.     "
+"tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data"
+"     type of float32 and its shape is [batch_size, num_heads, "
+"sequence_length, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:43
+msgid ""
+"Returns tuple (`last_hidden_state`, `cache`, `hidden_states`, "
+"`attentions`)"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:23
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:30
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:22
+#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward:45
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:26
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:50
+msgid "`last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:48
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:57
+msgid "`cache` (List[tuple(Tensor, Tensor)], optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:53
+msgid ""
+"returned when `use_cache=True` is passed. List of `tuple(Tensor, Tensor)`"
+" of length `config[\"num_hidden_layers\"]`, with the first element being "
+"the previous `buckets` of shape `[batch_size, num_heads, num_hashes, "
+"sequence_length]` and the second being the previous `hidden_states` of "
+"shape `[batch_size, sequence_length, hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:63
+msgid "`hidden_states` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:60
+msgid ""
+"returned when `output_hidden_states=True` is passed. tuple of `Tensor` "
+"(one for the output of the embeddings + one for the output of each "
+"layer). Each Tensor has a data type of float32 and its shape is "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:67
+msgid "`attentions` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModel.forward:66
+msgid ""
+"returned when `output_attentions=True` is passed. tuple of `Tensor` (one "
+"for each layer) of shape. Each Tensor has a data type of float32 and its "
+"shape is [batch_size, num_heads, sequence_length, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:44
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:55
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:41
+#: paddlenlp.transformers.reformer.modeling.ReformerModel.forward:72
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:50
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Reformer models. It provides Reformer "
+"related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+"`PretrainedModel` for more details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel.init_weights:1
+msgid "Initializes and tie weights if needed."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerPretrainedModel.tie_weights:1
+msgid "Tie the weights between the input embeddings and the output embeddings."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:1
+msgid ""
+"The Reformer Model transformer with a sequence classification head on top"
+" (linear layer)."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:3
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:3
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:3
+msgid "An instance of :class:`ReformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:5
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification:7
+msgid ""
+"The dropout probability for output of Reformer. If None, use the same "
+"value as `hidden_dropout_prob` of `ReformerModel` instance `reformer`. "
+"Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:1
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:3
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:5
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:7
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:16
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:18
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:37
+#: paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:40
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:1
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:3
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:5
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:7
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:23
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:25
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:48
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:51
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:1
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:3
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:5
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:7
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:15
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:17
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:34
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:37
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:1
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:3
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:5
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:7
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:9
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:11
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:19
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:21
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:40
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:43
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:46
+msgid "See :class:`ReformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:9
+msgid ""
+"Labels for computing the sequence classification/regression loss. Indices"
+" should be in `[0, ...,num_classes - 1]`. If `num_classes == 1` a "
+"regression loss is computed (Mean-Square loss), If `num_classes > 1` a "
+"classification loss is computed (Cross-Entropy). Shape is [batch_size,] "
+"and dtype is int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:20
+msgid ""
+"Returns tuple `(loss, logits, hidden_states, attentions)`.  With the "
+"fields:  - `loss` (Tensor):     returned when `labels` is provided.     "
+"Classification (or regression if num_classes==1) loss.     It's data type"
+" should be float32 and its shape is [1,].  - `logits` (Tensor):     "
+"Classification (or regression if num_classes==1) scores (before SoftMax)."
+"     It's data type should be float32 and its shape is [batch_size, "
+"num_classes].  - `hidden_states` (tuple(Tensor)):     See "
+":class:`ReformerModel`.  - `attentions` (tuple(Tensor)):     See "
+":class:`ReformerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:21
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:28
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:20
+msgid "Returns tuple `(loss, logits, hidden_states, attentions)`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:28
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:35
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:27
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:31
+msgid "`loss` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:33
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:25
+msgid ""
+"returned when `labels` is provided. Classification (or regression if "
+"num_classes==1) loss. It's data type should be float32 and its shape is "
+"[1,]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:34
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:31
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:37
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:30
+msgid ""
+"Classification (or regression if num_classes==1) scores (before SoftMax)."
+" It's data type should be float32 and its shape is [batch_size, "
+"num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:37
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:48
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:34
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:43
+msgid "`hidden_states` (tuple(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:39
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:50
+#: paddlenlp.transformers.reformer.modeling.ReformerForSequenceClassification.forward:36
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:45
+msgid "`attentions` (tuple(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:1
+msgid ""
+"Reformer Model with a span classification head on top for extractive "
+"question-answering tasks like SQuAD (a linear layers on top of the "
+"hidden-states output to compute `span start logits` and `span end "
+"logits`)."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:5
+msgid "An instance of ReformerModel."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering:7
+msgid ""
+"The dropout probability for output of Reformer. If None, use the same "
+"value as `hidden_dropout_prob` of `ReformerModel` instance `reformer`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:9
+msgid ""
+"Labels for position (index) of the start of the labelled span for "
+"computing the token classification loss. Positions are clamped to the "
+"length of the sequence (`sequence_length`). Position outside of the "
+"sequence are not taken into account for computing the loss. Shape is "
+"[batch_size,] and dtype is int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:16
+msgid ""
+"Labels for position (index) of the end of the labelled span for computing"
+" the token classification loss. Positions are clamped to the length of "
+"the sequence (`sequence_length`). Position outside of the sequence are "
+"not taken into account for computing the loss. Shape is [batch_size,] and"
+" dtype is int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:28
+msgid ""
+"Returns tuple `(loss, logits, hidden_states, attentions)`.  With the "
+"fields:  - `loss` (Tensor):     returned when `labels` is provided.     "
+"Classification (or regression if num_classes==1) loss.     It's data type"
+" should be float32 and its shape is [1,].  - `start_logits` (Tensor):"
+"     A tensor of the input token classification logits, indicates     the"
+" start position of the labelled span.     Its data type should be float32"
+" and its shape is [batch_size, sequence_length].  - `end_logits` "
+"(Tensor):     A tensor of the input token classification logits, "
+"indicates     the end position of the labelled span.     Its data type "
+"should be float32 and its shape is [batch_size, sequence_length].  - "
+"`hidden_states` (tuple(Tensor)):     See :class:`ReformerModel`.  - "
+"`attentions` (tuple(Tensor)):     See :class:`ReformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:40
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:38
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:45
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerForQuestionAnswering.forward:43
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead:1
+msgid "The Reformer Model transformer with a language modeling head on top."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:13
+msgid ""
+"Labels for language modeling. Note that the labels **are shifted** inside"
+" the model, i.e. you can set `labels = input_ids` Indices are selected in"
+" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored "
+"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`."
+" Shape is [batch_size, sequence_length] and dtype is int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:24
+msgid ""
+"Returns tuple `(loss, logits, cache, hidden_states, attentions)`.  With "
+"the fields:  - `loss` (Tensor):     returned when `labels` is provided."
+"     Language modeling loss (for next-token prediction).     It's data "
+"type should be float32 and its shape is [1,].  - `logits` (Tensor):     "
+"Prediction scores of the language modeling head     (scores for each "
+"vocabulary token before SoftMax).     It's data type should be float32 "
+"and its shape is     [batch_size, sequence_length, vocab_size].  - "
+"`cache` (List[tuple(Tensor, Tensor)]):     See :class:`ReformerModel`.  -"
+" `hidden_states` (tuple(Tensor)):     See :class:`ReformerModel`.  - "
+"`attentions` (tuple(Tensor)):     See :class:`ReformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:24
+msgid "Returns tuple `(loss, logits, cache, hidden_states, attentions)`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:29
+msgid ""
+"returned when `labels` is provided. Language modeling loss (for next-"
+"token prediction). It's data type should be float32 and its shape is "
+"[1,]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:34
+msgid ""
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary token before SoftMax). It's data type should be float32 and "
+"its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.modeling.ReformerModelWithLMHead.forward:40
+msgid "`cache` (List[tuple(Tensor, Tensor)]):"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM:1
+msgid ""
+"The Reformer Model transformer with a masked language modeling head on "
+"top."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:9
+msgid ""
+"Labels for computing the masked language modeling loss. Indices should be"
+" in ``[-100, 0, ..., vocab_size]`` (see ``input_ids`` docstring) Tokens "
+"with indices set to ``-100`` are ignored(masked), the loss is only "
+"computed for the tokens with labels in ``[0, ..., vocab_size]``. Shape is"
+" [batch_size, sequence_length] and dtype is int64."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:21
+msgid ""
+"Returns tuple `(loss, logits, hidden_states, attentions)`.  With the "
+"fields:  - `loss` (Tensor):     returned when `labels` is provided.     "
+"Masked Language modeling loss.     It's data type should be float32 and "
+"its shape is [1,].  - `logits` (Tensor):     Prediction scores of the "
+"masked language modeling head     (scores for each vocabulary token "
+"before SoftMax).     It's data type should be float32 and its shape is"
+"     [batch_size, sequence_length, vocab_size].  - `hidden_states` "
+"(tuple(Tensor)):     See :class:`ReformerModel`.  - `attentions` "
+"(tuple(Tensor)):     See :class:`ReformerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:26
+msgid ""
+"returned when `labels` is provided. Masked Language modeling loss. It's "
+"data type should be float32 and its shape is [1,]."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.modeling.ReformerForMaskedLM.forward:31
+msgid ""
+"Prediction scores of the masked language modeling head (scores for each "
+"vocabulary token before SoftMax). It's data type should be float32 and "
+"its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..d656b21a7e577f4ec956d04ed963b61c99315666
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.reformer.rst:2
+msgid "reformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..a1dadaf14a90d32338bcccdd9329627d77b01cc1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.reformer.tokenizer.po
@@ -0,0 +1,155 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.reformer.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:1
+msgid ""
+"Constructs a Reformer tokenizer based on SentencePiece . This tokenizer "
+"inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:6
+msgid ""
+"The vocabulary file (ends with '.spm') required to instantiate a "
+"`SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:9
+msgid ""
+"Whether or not to lowercase the input when tokenizing. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:11
+msgid "Whether or note to remove space when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:13
+msgid "Whether or note to keep accents when tokenizing. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:15
+msgid ""
+"A special token representing the *eos (end-of-sentence)* token. Defaults "
+"to \"</s>\"."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:18
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"<unk>\"."
+msgstr ""
+
+#: of paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"<unk>\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:1
+msgid "Build model inputs from a sequence or a pair of sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:3
+msgid "An Reformer sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:5
+msgid "single sequence:      ``X``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:6
+msgid "pair of sequences:        ``A B ``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:8
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:10
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens:13
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:1
+msgid "Create a mask from the two sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:5
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:7
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.reformer.tokenizer.ReformerTokenizer.create_token_type_ids_from_sequences:10
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..e291691d880a1716dee8cb0c75249bf3c843e573
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.modeling.po
@@ -0,0 +1,484 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.rembert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:1
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:1
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:1
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:1
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:1
+#: paddlenlp.transformers.rembert.modeling.RemBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.rembert.modeling.RembertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:1
+msgid "The bare RemBERT Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM
+#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertModel
+#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `RemBertModel`. Also is the vocab size"
+" of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling "
+"`RemBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:13
+msgid "Dimensionality of the embedding layer. Defaults to `256`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:15
+msgid "Dimensionality of the encoder layer and pooler layer. Defaults to `1152`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:17
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `32`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:19
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `18`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:22
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:27
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:31
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:34
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:37
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:40
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:43
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:43
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:47
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel:50
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:1
+msgid ""
+"The RemBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:32
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`)  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:32
+msgid "Returns tuple (`sequence_output`, `pooled_output`)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:14
+#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward:34
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:38
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:37
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:42
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertModel.forward:41
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward
+#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:15
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:17
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:26
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:17
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:17
+#: paddlenlp.transformers.rembert.modeling.RemBertModel.forward:47
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:1
+msgid "RemBert Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM:3
+msgid "An instance of :class:`RemBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:1
+#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:3
+#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:5
+#: paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:7
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:3
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:5
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:7
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:9
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:7
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:9
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:3
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:5
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:7
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:9
+msgid "See :class:`RemBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMaskedLM.forward:10
+msgid ""
+"Returns tensor `prediction_scores`, The scores of masked token "
+"prediction. Its data type should be float32 and shape is [batch_size, "
+"sequence_length, vocab_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:1
+msgid ""
+"RemBert Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:4
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering:4
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:4
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:4
+msgid "An instance of RemBertModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:1
+msgid ""
+"The RemBertForQuestionAnswering forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:12
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:18
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:21
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForQuestionAnswering.forward:21
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:1
+msgid ""
+"RemBert Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification:6
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:6
+msgid "The number of classes."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:1
+msgid ""
+"The RemBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:1
+msgid ""
+"RemBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for multiple choice tasks like RocStories/SWAG tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice:6
+msgid "The number of choices."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:1
+msgid ""
+"The BertForMultipleChoice forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:3
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:5
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:7
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:9
+msgid ""
+"See :class:`RemBertModel` and shape as [batch_size, num_choice, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForMultipleChoice.forward:12
+msgid ""
+"Returns tensor `reshaped_logits`, a tensor of the multiple choice "
+"classification logits. Shape as `[batch_size, num_choice]` and dtype as "
+"`float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RembertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RembertPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification:1
+msgid ""
+"RemBert Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:1
+msgid ""
+"The RemBertForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.modeling.RemBertForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po
new file mode 100644
index 0000000000000000000000000000000000000000..f6e5af4d2b038e7dc695895292d8883fe0bb2ad0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.rembert.rst:2
+msgid "rembert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..722f3493018ac4ff8b2a8e0898313b4987ed7345
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.rembert.tokenizer.po
@@ -0,0 +1,234 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.rembert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:1
+msgid ""
+"Construct a RemBertTokenizer. For more information regarding those "
+"methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:4
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:7
+msgid ""
+"Whether or not to lowercase the input when tokenizing. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:10
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:14
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:17
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:20
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:23
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer:29
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string by "
+"using ``' '.join(tokens)`` ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:4
+msgid "A sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string:7
+msgid "Converted string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. A "
+"REMBERT sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:4
+msgid "single sequence: ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:5
+msgid "pair of sequences: ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:7
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:9
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:12
+msgid ""
+"List of `input IDs <../glossary.html#input-ids>`__ with the appropriate "
+"special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.build_inputs_with_special_tokens:13
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:18
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:12
+msgid ":obj:`List[int]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieve sequence ids from a token list that has no special tokens added."
+" This method is called when adding special tokens using the tokenizer "
+"``prepare_for_model`` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:4
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.get_special_tokens_mask:11
+msgid ""
+"A list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task. A RemBERT sequence pair mask has the following "
+"format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.create_token_type_ids_from_sequences:16
+msgid ""
+"List of `token type IDs <../glossary.html#token-type-ids>`_ according to "
+"the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:1
+msgid ""
+"Save all tokens to a vocabulary file. The file contains a token per line,"
+" and the line number would be the index of corresponding token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:4
+msgid "File path to be saved to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.rembert.tokenizer.RemBertTokenizer.save_vocabulary:6
+msgid "The `Vocab` or `dict` instance to be saved."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..7bd5d069b183abd1e96b2f354fa052bb575b279a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.modeling.po
@@ -0,0 +1,676 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.roberta.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:1
+#: paddlenlp.transformers.roberta.modeling.RobertaModel:1
+msgid "基类：:class:`paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:1
+msgid "The bare Roberta Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM
+#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaModel
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size"
+" of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling "
+"`RobertaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:38
+msgid ""
+"The vocabulary size of the `token_type_ids` passed when calling "
+"`~transformers.RobertaModel`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:41
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`RobertaPretrainedModel._init_weights()` for"
+" how weights are initialized in `RobertaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:41
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:44
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`RobertaPretrainedModel._init_weights()` for how weights are "
+"initialized in `RobertaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:47
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel:50
+msgid "The index of cls token in the token vocabulary. Defaults to `101`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:1
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:5
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:5
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:8
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:9
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:11
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to None, which means no segment embeddings is "
+"added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:14
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:19
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:30
+msgid ""
+"Whether or not to output hidden states for all hidden layers. Defaults to"
+" `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:34
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) by default. Returns "
+"(`encoder_outputs`, `pooled_output`) if output_hidden_states is `True`.  "
+"With the fields:  - `sequence_output` (Tensor):     Sequence of hidden-"
+"states at the last layer of the model.     It's data type should be "
+"float32 and its shape is [batch_size, sequence_length, hidden_size].  - "
+"`pooled_output` (Tensor):     The output of first token (`[CLS]`) in "
+"sequence.     We \"pool\" the model by simply taking the hidden state "
+"corresponding to the first token.     Its data type should be float32 and"
+" its shape is [batch_size, hidden_size].  - `encoder_outputs` "
+"(List(Tensor)):     A list of Tensor containing hidden-states of the "
+"model at each hidden layer in the Transformer encoder.     The length of "
+"the list is `num_hidden_layers`.     Each Tensor has a data type of "
+"float32 and its shape is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:34
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) by default. Returns "
+"(`encoder_outputs`, `pooled_output`) if output_hidden_states is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:15
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:15
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:15
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:15
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:15
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:37
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:41
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:40
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:46
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:44
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:23
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:23
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:27
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:23
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:23
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:50
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaModel.forward:49
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape"
+" is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:28
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:28
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:32
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:28
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:28
+#: paddlenlp.transformers.roberta.modeling.RobertaModel.forward:55
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel:1
+msgid ""
+"An abstract class for pretrained RoBerta models. It provides RoBerta "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:1
+msgid ""
+"Roberta Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:4
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:4
+msgid "An instance of `RobertaModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:6
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification:8
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:8
+msgid ""
+"The dropout probability for output of Roberta. If None, use the same "
+"value as `hidden_dropout_prob` of `RobertaModel` instance `roberta`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:5
+#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:7
+#: paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:9
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:5
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:7
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:9
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:5
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:7
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:9
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:5
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:7
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:9
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:1
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:5
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:7
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:9
+msgid "See :class:`RobertaModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits` by default. Returns tuple (`logits`, "
+"`encoder_outputs`) if output_hidden_states is set to `True`.  With the "
+"fields:  - `logits` (Tensor):     a tensor of the input text "
+"classification logits.     Its data type should be float32 and it has a "
+"shape of [batch_size, num_classes].  - `encoder_outputs` (List(Tensor)):"
+"     A list of Tensor containing hidden-states of the model at each "
+"hidden layer in the Transformer encoder.     The length of the list is "
+"`num_hidden_layers`.     Each Tensor has a data type of float32 and a "
+"shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:12
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits` by default. Returns tuple (`logits`, "
+"`encoder_outputs`) if output_hidden_states is set to `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:19
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:19
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:18
+msgid ""
+"a tensor of the input text classification logits. Its data type should be"
+" float32 and it has a shape of [batch_size, num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:22
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:22
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:26
+#: paddlenlp.transformers.roberta.modeling.RobertaForSequenceClassification.forward:22
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:22
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and a shape "
+"of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification:1
+msgid ""
+"Roberta Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits` by default. Returns tuple (`logits`, "
+"`encoder_outputs`) if output_hidden_states is set to `True`.  With the "
+"fields:  - `logits` (Tensor):     a tensor of the input token "
+"classification logits.     Shape as `[batch_size, sequence_length, "
+"num_classes]` and dtype as `float32`.  - `encoder_outputs` "
+"(List(Tensor)):     A list of Tensor containing hidden-states of the "
+"model at each hidden layer in the Transformer encoder.     The length of "
+"the list is `num_hidden_layers`.     Each Tensor has a data type of "
+"float32 and a shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForTokenClassification.forward:18
+msgid ""
+"a tensor of the input token classification logits. Shape as `[batch_size,"
+" sequence_length, num_classes]` and dtype as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:2
+msgid ""
+"Roberta Model with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:2
+msgid "and `span_end_logits`, designed for question-answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering:4
+msgid "An instance of RobertaModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`) by default if "
+"output_hidden_states is `False`. Returns tuple (`start_logits`, "
+"`end_logits`, `encoder_outputs`) if output_hidden_states is set to "
+"`True`.  With the fields:  - `start_logits` (Tensor):     A tensor of the"
+" input token classification logits, indicates the start position of the "
+"labelled span.     Its data type should be float32 and its shape is "
+"[batch_size, sequence_length].  - `end_logits` (Tensor):     A tensor of "
+"the input token classification logits, indicates the end position of the "
+"labelled span.     Its data type should be float32 and its shape is "
+"[batch_size, sequence_length].  - `encoder_outputs` (List(Tensor)):     A"
+" list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder.     The length of the list is "
+"`num_hidden_layers`.     Each Tensor has a data type of float32 and a "
+"shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:12
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`) by default if "
+"output_hidden_states is `False`. Returns tuple (`start_logits`, "
+"`end_logits`, `encoder_outputs`) if output_hidden_states is set to "
+"`True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:19
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:18
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:23
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForQuestionAnswering.forward:22
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:1
+msgid "Roberta Model with a `masked language modeling` head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:3
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM:3
+msgid "class:RobertaModel`): An instance of :class:`RobertaModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:12
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:12
+msgid ""
+"Returns tensor `prediction_scores` by default. Returns tuple "
+"(`prediction_scores`, `encoder_outputs`) if output_hidden_states is set "
+"to `True`.  With the fields:  - `prediction_scores` (Tensor):     The "
+"scores of masked token prediction.     Its data type should be float32 "
+"and shape is [batch_size, sequence_length, vocab_size].  - "
+"`encoder_outputs` (List(Tensor)):     A list of Tensor containing hidden-"
+"states of the model at each hidden layer in the Transformer encoder.     "
+"The length of the list is `num_hidden_layers`.     Each Tensor has a data"
+" type of float32 and a shape of [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:12
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:12
+msgid ""
+"Returns tensor `prediction_scores` by default. Returns tuple "
+"(`prediction_scores`, `encoder_outputs`) if output_hidden_states is set "
+"to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:19
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:19
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM.forward:18
+#: paddlenlp.transformers.roberta.modeling.RobertaForMaskedLM.forward:18
+msgid ""
+"The scores of masked token prediction. Its data type should be float32 "
+"and shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.modeling.RobertaForMultipleChoice.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.modeling.RobertaForCausalLM:1
+msgid "Roberta Model with a `Causal language modeling` head on top."
+msgstr ""
+
+#~ msgid ""
+#~ "Returns tuple (`sequence_output`, `pooled_output`)."
+#~ "  With the fields:  - sequence_output "
+#~ "(Tensor):     Sequence of hidden-states "
+#~ "at the last layer of the model."
+#~ "     It's data type should be float32"
+#~ " and its shape is [batch_size, "
+#~ "sequence_length, hidden_size].  - pooled_output "
+#~ "(Tensor):     The output of first token"
+#~ " (`[CLS]`) in sequence.     We \"pool\" "
+#~ "the model by simply taking the "
+#~ "hidden state corresponding to the first"
+#~ " token.     Its data type should be"
+#~ " float32 and its shape is "
+#~ "[batch_size, hidden_size]."
+#~ msgstr ""
+
+#~ msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+#~ msgstr ""
+
+#~ msgid "sequence_output (Tensor):"
+#~ msgstr ""
+
+#~ msgid "pooled_output (Tensor):"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Returns tensor `logits`, a tensor of "
+#~ "the input text classification logits. "
+#~ "Its data type should be float32 "
+#~ "and it has a shape of [batch_size,"
+#~ " num_classes]."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Returns tensor `logits`, a tensor of "
+#~ "the input token classification logits. "
+#~ "Shape as `[batch_size, sequence_length, "
+#~ "num_classes]` and dtype as `float32`."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Returns tuple (`start_logits`, `end_logits`).  "
+#~ "With the fields:  - `start_logits` "
+#~ "(Tensor):     A tensor of the input "
+#~ "token classification logits, indicates the "
+#~ "start position of the labelled span."
+#~ "     Its data type should be float32"
+#~ " and its shape is [batch_size, "
+#~ "sequence_length].  - `end_logits` (Tensor):     "
+#~ "A tensor of the input token "
+#~ "classification logits, indicates the end "
+#~ "position of the labelled span.     Its"
+#~ " data type should be float32 and "
+#~ "its shape is [batch_size, sequence_length]."
+#~ msgstr ""
+
+#~ msgid "Returns tuple (`start_logits`, `end_logits`)."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Returns tensor `prediction_scores`, The scores"
+#~ " of masked token prediction. Its data"
+#~ " type should be float32 and shape "
+#~ "is [batch_size, sequence_length, vocab_size]."
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po
new file mode 100644
index 0000000000000000000000000000000000000000..ac6284724b338cc2c48158f64b054c2d082c964f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.roberta.rst:2
+msgid "roberta"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..d8dbdfc30131a8e56dc38fb6a79b2407af88bea2
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roberta.tokenizer.po
@@ -0,0 +1,495 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.roberta.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaTokenizer:1
+msgid ""
+"RobertaTokenizer is a generic tokenizer class that will be instantiated "
+"as either RobertaChineseTokenizer or RobertaBPETokenizer when created "
+"with the RobertaTokenizer.from_pretrained() class method."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:1
+msgid ""
+"Constructs a RoBerta tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:25
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer:34
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also removes "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:1
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:3
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.num_special_tokens_to_add:7
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_inputs_with_special_tokens:1
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:4
+msgid "A RoBERTa sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:       ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:11
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_inputs_with_special_tokens:15
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:1
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A RoBERTa offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:5
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence:      ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:8
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of wordpiece offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:10
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:13
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.build_offset_mapping_with_special_tokens:13
+msgid ""
+"A list of wordpiece offsets with the appropriate offsets of special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:1
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:3
+msgid "A RoBERTa sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:4
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:6
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:1
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:8
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_special_tokens_mask:12
+#: paddlenlp.transformers.roberta.tokenizer.RobertaChineseTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:1
+msgid "Constructs a Roberta tokenizer based on byte-level Byte-Pair-Encoding."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.GPTTokenizer` which contains most of the "
+"main methods. For more information regarding those methods, please refer "
+"to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:7
+msgid ""
+"Path to the vocab file. The vocab file contains a mapping from vocabulary"
+" strings to indices."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:10
+msgid ""
+"Path to the merge file. The merge file is used to split the input "
+"sentence into \"subword\" units. The vocab file is then used to encode "
+"those units as intices."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:14
+msgid "Paradigm to follow when decoding bytes to UTF-8. Defaults to `'replace'`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:17
+msgid "The maximum value of the input sequence length. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer:20
+msgid "A list of special tokens not in the vocabulary. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:1
+msgid ""
+"Returns the map of tokens and the start and end index of their start and "
+"end character. Modified from "
+"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:4
+msgid "Input text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.get_offset_mapping:7
+msgid "The offset map of input text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A Roberta offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences:        ``(0,0) A (0,0) (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"Should be overridden in a subclass if the model has a special way of "
+"building those."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string by "
+"using ``' '.join(tokens)`` ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:4
+msgid "A sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roberta.tokenizer.RobertaBPETokenizer.convert_tokens_to_string:7
+msgid "Converted string."
+msgstr ""
+
+#~ msgid "Converts a string to a list of tokens."
+#~ msgstr ""
+
+#~ msgid "The text to be tokenized."
+#~ msgstr ""
+
+#~ msgid "A list of string representing converted tokens."
+#~ msgstr ""
+
+#~ msgid "Converts a sequence of tokens (list of string) to a list of ids."
+#~ msgstr ""
+
+#~ msgid "Converted ids from tokens."
+#~ msgstr ""
+
+#~ msgid "A list of ids to be converted."
+#~ msgstr ""
+
+#~ msgid "Whether or not to skip specical tokens. Defaults to `False`."
+#~ msgstr ""
+
+#~ msgid "A list of converted tokens."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Save tokenizer related resources to "
+#~ "`resource_files_names` indicating files under "
+#~ "`save_directory` by copying directly. Override"
+#~ " it if necessary."
+#~ msgstr ""
+
+#~ msgid "Directory to save files into."
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..8a649e338ca4bdcac093cae87382212e4ed4a18a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.modeling.po
@@ -0,0 +1,635 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.roformer.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel:1
+msgid "基类：:class:`paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:1
+msgid "The bare RoFormer Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `RoFormerModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`RoFormerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:13
+msgid "Dimensionality of the embedding layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:15
+msgid "Dimensionality of the, encoder layers and pooler layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:17
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:19
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:22
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:27
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:31
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:34
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:37
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:40
+msgid "The vocabulary size of `token_type_ids`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:43
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:43
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:47
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`BertPretrainedModel.init_weights()` for how weights are "
+"initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:50
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:53
+msgid "The non-linear activation function in the pooler. Defaults to `\"tanh\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel:56
+msgid ""
+"Whether or not apply rotay position embeddings to value. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:1
+msgid ""
+"The RoFormerModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. When the data type "
+"is bool, the `masked` tokens have `False` values and the others have "
+"`True` values. When the data type is int, the `masked` tokens have `0` "
+"values and the others have `1` values. When the data type is float, the "
+"`masked` tokens have `-INF` values and the others have `0` values. It is "
+"a tensor with shape broadcasted to `[batch_size, num_attention_heads, "
+"sequence_length, sequence_length]`. Defaults to `None`, which means "
+"nothing needed to be prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:31
+msgid "Whether to return the output of each hidden layers. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`).  With the fields:  - `sequence_output` (Tensor):     "
+"Sequence of hidden-states at the last layer of the model.     It's data "
+"type should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size].  - `pooled_output` (Tensor):     The output of first token "
+"(`[CLS]`) in sequence.     We \"pool\" the model by simply taking the "
+"hidden state corresponding to the first token.     Its data type should "
+"be float32 and its shape is [batch_size, hidden_size].  - "
+"`encoder_outputs` (List(Tensor)):     A list of Tensor containing hidden-"
+"states of the model at each hidden layer in the Transformer encoder.     "
+"The length of the list is `num_hidden_layers`.     Each Tensor has a data"
+" type of float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`,"
+" `pooled_output`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:12
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:10
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:37
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:16
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:41
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:40
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:1
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:46
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:44
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:4
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:50
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:49
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers`. Each Tensor has a data type of float32 and its shape"
+" is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:22
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:13
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:13
+#: paddlenlp.transformers.roformer.modeling.RoFormerModel.forward:55
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel:1
+msgid ""
+"An abstract class for pretrained RoFormer models. It provides RoFormer "
+"related `model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:1
+msgid "RoFormer Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining:3
+msgid "An instance of :class:`RoFormerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:3
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:5
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:3
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:5
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:3
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:5
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:3
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:5
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:3
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:5
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:7
+msgid "See :class:`RoFormerModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:7
+msgid "See :class:`RoFormerPretrainingHeads`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:10
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:14
+msgid ""
+"Returns tuple (``prediction_scores``, ``seq_relationship_score``).  With "
+"the fields:  - `prediction_scores` (Tensor):     The scores of masked "
+"token prediction. Its data type should be float32.     If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"vocab_size].     Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size].  - `seq_relationship_score` (Tensor):     The scores of next"
+" sentence prediction.     Its data type should be float32 and its shape "
+"is [batch_size, 2]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:10
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:14
+msgid "Returns tuple (``prediction_scores``, ``seq_relationship_score``)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:17
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:21
+msgid "`prediction_scores` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:15
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:19
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:20
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:24
+msgid "`seq_relationship_score` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForPretraining.forward:20
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:24
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion:1
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `RoFormerModel`. Defines the number of"
+" different tokens that can be represented by the `inputs_ids` passed when"
+" calling `RoFormerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:1
+msgid ""
+"The scores of masked token prediction. Its data type should be float32. "
+"If `masked_positions` is None, its shape is [batch_size, sequence_length,"
+" vocab_size]. Otherwise, its shape is [batch_size, mask_token_num, "
+"vocab_size]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:5
+msgid ""
+"The scores of next sentence prediction. Its data type should be float32 "
+"and its shape is [batch_size, 2]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:8
+msgid ""
+"The labels of the masked language modeling, its dimensionality is equal "
+"to `prediction_scores`. Its data type should be int64. If "
+"`masked_positions` is None, its shape is [batch_size, sequence_length, "
+"1]. Otherwise, its shape is [batch_size, mask_token_num, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:12
+msgid ""
+"The labels of the next sentence prediction task, the dimensionality of "
+"`next_sentence_labels` is equal to `seq_relation_labels`. Its data type "
+"should be int64 and its shape is [batch_size, 1]"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:16
+msgid ""
+"The scale of masked tokens. Used for the normalization of masked language"
+" modeling loss. If it is a `Tensor`, its data type should be int64 and "
+"its shape is equal to `prediction_scores`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingCriterion.forward:20
+msgid ""
+"The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean"
+" of `next_sentence_loss`. Its data type should be float32 and its shape "
+"is [1]."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:1
+msgid "Perform language modeling task and next sentence classification task."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:9
+msgid "Activation function used in the language modeling task."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads:11
+msgid ""
+"Decoding weights used to map hidden_states to logits of the masked token "
+"prediction. Its data type should be float32 and its shape is [vocab_size,"
+" hidden_size]. Defaults to `None`, which means use the same weights of "
+"the embedding layer."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerPretrainingHeads.forward:8
+msgid ""
+"A tensor indicates positions to be masked in the position embedding. Its "
+"data type should be int64 and its shape is [batch_size, mask_token_num]. "
+"`mask_token_num` is the number of masked tokens. It should be no bigger "
+"than `sequence_length`. Defaults to `None`, which means we output hidden-"
+"states of all tokens in masked token prediction."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:1
+msgid ""
+"RoFormer Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:4
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:4
+msgid "An instance of `paddlenlp.transformers.RoFormerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:6
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:6
+msgid "The number of classes. Default to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification:8
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:8
+msgid ""
+"The dropout probability for output of RoFormer. If None, use the same "
+"value as `hidden_dropout_prob` of `paddlenlp.transformers.RoFormerModel` "
+"instance. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForSequenceClassification.forward:8
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification:1
+msgid ""
+"RoFormer Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForTokenClassification.forward:8
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:1
+msgid ""
+"RoFormer with a linear layer on top of the hidden-states output to "
+"compute `span_start_logits` and `span_end_logits`, designed for question-"
+"answering tasks like SQuAD."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:4
+msgid "An instance of RoFormerModel."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering:6
+msgid ""
+"The dropout probability for output of RoFormer. If None, use the same "
+"value as `hidden_dropout_prob` of `RoFormerModel` instance `roformer`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:1
+msgid ""
+"The RoFormerForQuestionAnswering forward method, overrides the __call__()"
+" special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:8
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`).  With the fields:  - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length].  -"
+" `end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:8
+msgid "Returns tuple (`start_logits`, `end_logits`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:14
+msgid "`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:13
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:17
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.modeling.RoFormerForQuestionAnswering.forward:17
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..6c168e065af2e55f04a9010ca0ce08f4e453d00d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.roformer.rst:2
+msgid "roformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..59189b79266395381f6f0bd55803b98d104c4ac5
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformer.tokenizer.po
@@ -0,0 +1,322 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.roformer.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:1
+msgid ""
+"Constructs a RoFormer tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing, jieba pretokenizer and so on, and "
+"follows a WordPiece tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:12
+msgid ""
+"Whether or not to lowercase the input when tokenizing. If you use the "
+"RoFormer pretrained model, lower is set to False when using the cased "
+"model, otherwise it is set to True. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:17
+msgid "Whether or not to tokenize the text with jieba. Default: False."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:19
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:23
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:26
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:29
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:32
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer:38
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:10
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (list of string) in a single string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:3
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.convert_tokens_to_string:6
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:4
+msgid "A Roformer sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:13
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:6
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_inputs_with_special_tokens:14
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:3
+msgid "A RoFormer offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "single sequence: ``(0,0) X (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "pair of sequences: `(0,0) A (0,0) B (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "List of wordpiece offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:10
+msgid ""
+"Optional second list of wordpiece offsets for offset mapping pairs. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.build_offset_mapping_with_special_tokens:13
+msgid "List of wordpiece offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:3
+msgid "A RoFormer sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:11
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.create_token_type_ids_from_sequences:16
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.roformer.tokenizer.RoFormerTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers either be 0 or 1: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BasicTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:1
+msgid ""
+"Runs basic tokenization with jieba (punctuation splitting, lower casing, "
+"jieba pretokenizer etc)."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:3
+msgid "An instance of paddlenlp.data.Vocab."
+msgstr ""
+
+#: of paddlenlp.transformers.roformer.tokenizer.JiebaBasicTokenizer:5
+msgid ""
+"Whether the text strips accents and converts to lower case. If you use "
+"the RoFormer Pretrained model, lower is set to False when using the cased"
+" model, otherwise it is set to True. Defaults to `True`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..049ad921adaef995df1b92e95bf5fe3bd47e6c20
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.modeling.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.roformerv2.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po
new file mode 100644
index 0000000000000000000000000000000000000000..ba4bfda1b4382f0638427757329a0575a786d7d1
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.roformerv2.rst:2
+msgid "roformerv2"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..8c9eb6b2179a64b7a5d212e538b51b43a58890cd
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.roformerv2.tokenizer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.roformerv2.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..f9542ce184ba5ec4a3a7b8b4d40f16e9ed71d425
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.modeling.po
@@ -0,0 +1,53 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.semantic_indexing.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#~ msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "This class encapsulates two ErnieEncoder "
+#~ "models into one model, so query "
+#~ "embedding and title embedding could be"
+#~ " obtained using one model. And this"
+#~ " class allows two ErnieEncoder models "
+#~ "to be trained at the same time."
+#~ msgstr ""
+
+#~ msgid "示例"
+#~ msgstr ""
+
+#~ msgid ""
+#~ "Defines the computation performed at "
+#~ "every call. Should be overridden by "
+#~ "all subclasses."
+#~ msgstr ""
+
+#~ msgid "参数"
+#~ msgstr ""
+
+#~ msgid "unpacked tuple arguments"
+#~ msgstr ""
+
+#~ msgid "unpacked dict arguments"
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po
new file mode 100644
index 0000000000000000000000000000000000000000..fd0bc26a1616d9e6b738b6feb914c28f9a195118
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_indexing.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.semantic_indexing.rst:2
+msgid "semantic\\_indexing"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..10c87b9a27751e36a76af2776cd41f1efdee4755
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.modeling.po
@@ -0,0 +1,65 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.semantic_search.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder:1
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:1
+msgid ""
+"This class encapsulates two ErnieEncoder models into one model, so query "
+"embedding and title embedding could be obtained using one model. And this"
+" class allows two ErnieEncoder models to be trained at the same time."
+msgstr ""
+
+#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder:2
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder:6
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:1
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:1
+msgid ""
+"Defines the computation performed at every call. Should be overridden by "
+"all subclasses."
+msgstr ""
+
+#: of paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:4
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:4
+msgid "unpacked tuple arguments"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.semantic_search.modeling.ErnieCrossEncoder.forward:6
+#: paddlenlp.transformers.semantic_search.modeling.ErnieDualEncoder.forward:6
+msgid "unpacked dict arguments"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po
new file mode 100644
index 0000000000000000000000000000000000000000..5ec6b92a52a7beb070883369ec95f0a38030cef9
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.semantic_search.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.semantic_search.rst:2
+msgid "semantic\\_search"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..3f1162fb0f64665eb78447f5a2ffda03fd13d549
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.modeling.po
@@ -0,0 +1,455 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.skep.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:1
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:1
+#: paddlenlp.transformers.skep.modeling.SkepModel:1
+msgid "基类：:class:`paddlenlp.transformers.skep.modeling.SkepPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:1
+msgid "The bare SKEP Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:10
+msgid ""
+"More details refer to `SKEP <https://www.aclweb.org/anthology/2020.acl-"
+"main.374>`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepModel
+#: paddlenlp.transformers.skep.modeling.SkepModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:12
+msgid ""
+"Vocabulary size of `inputs_ids` in `SKEPModel`. Defines the number of "
+"different tokens that can be represented by the `inputs_ids` passed when "
+"calling `SKEPModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:15
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and the pooler "
+"layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:17
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:19
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:22
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:27
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:31
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to ``0.1``."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:34
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:37
+msgid ""
+"The maximum value of the dimensionality of position encoding. The "
+"dimensionality of position encoding is the dimensionality of the sequence"
+" in `TinyBertModel`. Defaults to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:41
+msgid ""
+"The vocabulary size of the `token_type_ids` passed when calling "
+"`~transformers.SkepModel`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:44
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`SkepPretrainedModel.init_weights()` for how"
+" weights are initialized in `SkepModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:44
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:48
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`SkepPretrainedModel.init_weights()` for how weights are "
+"initialized in `SkepModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel:51
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:1
+msgid "The SkepModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:18
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:34
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`).  With the fields:  - "
+"`sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:34
+msgid "Returns tuple (`sequence_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:36
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:40
+msgid "`sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:39
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:44
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepModel.forward:43
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward
+#: paddlenlp.transformers.skep.modeling.SkepModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:17
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:17
+#: paddlenlp.transformers.skep.modeling.SkepModel.forward:49
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel:1
+msgid ""
+"An abstract class for pretrained Skep models. It provides Skep related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:1
+msgid ""
+"SKEP Model with a linear layer on top of the pooled output, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:4
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:4
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:4
+msgid "An instance of SkepModel."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:6
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForSequenceClassification:8
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification:8
+msgid ""
+"The dropout probability for output of SKEP. If None, use the same value "
+"as `hidden_dropout_prob` of `SkepModel` instance `skep`. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:1
+msgid ""
+"The SkepForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:3
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:5
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:7
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:9
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:3
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:5
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:7
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:9
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:3
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:5
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:7
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:9
+msgid "See :class:`SkepModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepForSequenceClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForTokenClassification:1
+msgid ""
+"SKEP Model with a linear layer on top of the hidden-states output layer, "
+"designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:1
+msgid ""
+"The SkepForTokenClassification forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepForTokenClassification.forward:12
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:1
+msgid ""
+"SKEPCRF Model with a linear layer on top of the hidden-states output "
+"layer, designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification:6
+msgid "The number of classes."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:1
+msgid ""
+"The SkepCrfForTokenClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:11
+msgid ""
+"The input length tensor storing real length of each sequence for "
+"correctness. Its data type should be int64 and its shape is "
+"`[batch_size]`. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:15
+msgid ""
+"The input label tensor. Its data type should be int64 and its shape is "
+"`[batch_size, sequence_length]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:19
+msgid ""
+"Returns tensor `loss` if `labels` is not None. Otherwise, returns tensor "
+"`prediction`.  - `loss` (Tensor):     The crf loss. Its data type is "
+"float32 and its shape is `[batch_size]`.  - `prediction` (Tensor):     "
+"The prediction tensor containing the highest scoring tag indices.     Its"
+" data type is int64 and its shape is `[batch_size, sequence_length]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:19
+msgid ""
+"Returns tensor `loss` if `labels` is not None. Otherwise, returns tensor "
+"`prediction`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:22
+msgid "`loss` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:22
+msgid "The crf loss. Its data type is float32 and its shape is `[batch_size]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:25
+msgid "`prediction` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.modeling.SkepCrfForTokenClassification.forward:25
+msgid ""
+"The prediction tensor containing the highest scoring tag indices. Its "
+"data type is int64 and its shape is `[batch_size, sequence_length]`."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po
new file mode 100644
index 0000000000000000000000000000000000000000..d7925d9d9d603cba2f3dbe65136dc2cf3e78a383
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.skep.rst:2
+msgid "skep"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..e415d3306b2b925dfb14927ce1a01546e0e4c1f8
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.skep.tokenizer.po
@@ -0,0 +1,239 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.skep.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:1
+msgid ""
+"Constructs a Skep tokenizer. It uses a basic tokenizer to do punctuation "
+"splitting, lower casing and so on, and follows a WordPiece tokenizer to "
+"tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:12
+msgid "The vocabulary file path of a `BpeTokenizer`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:14
+msgid "The json file path of a `BpeTokenizer`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:16
+msgid "Whether or not to use BPE Encoder. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:18
+msgid "Whether or not to use token type id. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:20
+msgid "Whether or not to add two different `sep_token`. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:22
+msgid "The special token for unknown words. Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:25
+msgid "The special token for separator token. Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:28
+msgid "The special token for padding. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:31
+msgid "The special token for cls. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:34
+msgid "The special token for mask. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer:39
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size:3
+msgid "the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Returns the number of added tokens in the case of a sequence pair if set "
+"to True, returns the number of added tokens in the case of a single "
+"sequence if set to False. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.num_special_tokens_to_add:8
+msgid "Number of tokens added to sequences"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:4
+msgid ""
+"A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the "
+"following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:6
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:11
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:9
+msgid "A skep_roberta_large_en sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:12
+msgid "pair of sequences:        ``[CLS] A [SEP] [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:14
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:16
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:15
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.build_inputs_with_special_tokens:20
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has "
+"the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:9
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:11
+msgid "note: There is no need token type ids for skep_roberta_large_ch model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.skep.tokenizer.SkepTokenizer.create_token_type_ids_from_sequences:19
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources:1
+msgid "Save tokenizer related resources to files under `save_directory`."
+msgstr ""
+
+#: of paddlenlp.transformers.skep.tokenizer.SkepTokenizer.save_resources:3
+msgid "Directory to save files into."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..f002c3f3141d814d653d9c3b782f777018f7932b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.modeling.po
@@ -0,0 +1,417 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.squeezebert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:1
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:1
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:1
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.squeezebert.modeling.SqueezeBertPreTrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:1
+msgid ""
+"Vocabulary size of `inputs_ids` in `SqueezeBertModel`. Also is the vocab "
+"size of token embedding matrix. Defines the number of different tokens "
+"that can be represented by the `inputs_ids` passed when calling "
+"`BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:4
+msgid ""
+"Dimensionality of the embedding layer, encoder layer and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:6
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:8
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:11
+msgid "Output chans for intermediate layer."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:13
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:17
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:20
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:23
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:26
+msgid "The vocabulary size of `token_type_ids`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:29
+msgid ""
+"number of query groups for all layers in the BertModule. (eventually we "
+"could change the interface to allow different groups for different "
+"layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:32
+msgid ""
+"number of key groups for all layers in the BertModule. (eventually we "
+"could change the interface to allow different groups for different "
+"layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:35
+msgid ""
+"number of value groups for all layers in the BertModule. (eventually we "
+"could change the interface to allow different groups for different "
+"layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:38
+msgid ""
+"number of output groups for all layers in the BertModule. (eventually we "
+"could change the interface to allow different groups for different "
+"layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:41
+msgid ""
+"number of intermediate groups for all layers in the BertModule. "
+"(eventually we could change the interface to allow different groups for "
+"different layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:44
+msgid ""
+"number of post groups for all layers in the BertModule. (eventually we "
+"could change the interface to allow different groups for different "
+"layers)"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:47
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02. .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`BertPretrainedModel.init_weights()` for how"
+" weights are initialized in `BertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel:47
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02. .. "
+"note::"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:1
+msgid ""
+"The  forward method, overrides the `__call__()` special method. :param "
+"input_ids: Indices of input sequence tokens in the vocabulary. They are"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:3
+msgid ""
+"numerical representations of tokens that build the input sequence. Its "
+"data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:6
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float and bool. If its data type is "
+"int, the values should be either 0 or 1. - **1** for tokens that **not "
+"masked**, - **0** for tokens that **masked**. It is a tensor with shape "
+"broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. Defaults to `None`, which means nothing needed to be "
+"prevented attention to."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:15
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1: - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token. Its data type should be `int64` and it has a shape of"
+" [batch_size, sequence_length]. Defaults to `None`, which means we don't "
+"add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:24
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:28
+msgid ""
+"Whether to return the attention_weight of each hidden layers. Defaults to"
+" `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:31
+msgid "Whether to return the output of each hidden layers. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) with "
+"(`encoder_outputs`, `encoder_attentions`) by optional. With the fields: -"
+" `sequence_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size]. - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]. - `encoder_outputs` (List(Tensor)):     A list of Tensor "
+"containing hidden-states of the model at each hidden layer in the "
+"Transformer encoder.     The length of the list is `num_hidden_layers` + "
+"1 (Embedding Layer output).     Each Tensor has a data type of float32 "
+"and its shape is [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:35
+msgid ""
+"Returns tuple (`sequence_output`, `pooled_output`) with "
+"(`encoder_outputs`, `encoder_attentions`) by optional. With the fields: -"
+" `sequence_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:39
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:43
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:42
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:47
+msgid "`encoder_outputs` (List(Tensor)):"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward:46
+msgid ""
+"A list of Tensor containing hidden-states of the model at each hidden "
+"layer in the Transformer encoder. The length of the list is "
+"`num_hidden_layers` + 1 (Embedding Layer output). Each Tensor has a data "
+"type of float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:1
+msgid ""
+"SqueezeBert Model with a sequence classification/regression head on top "
+"(a linear layer on top of the pooled output) e.g. for GLUE tasks. :param "
+"squeezebert: An instance of SqueezeBert. :type squeezebert: "
+":class:`SqueezeBertModel` :param num_classes: The number of classes. "
+"Defaults to `2`. :type num_classes: int, optional :param dropout: The "
+"dropout probability for output of SqueezeBertModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification:8
+msgid ""
+"If None, use the same value as `hidden_dropout_prob` of "
+"`SqueezeBertModel` instance `squeezebert`. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward:1
+msgid ""
+"The SqueezeBertForSequenceClassification forward method, overrides the "
+"__call__() special method. :param input_ids: See "
+":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:"
+" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional "
+":param position_ids: See :class:`SqueezeBertModel`. :type position_ids: "
+"Tensor, optional :param attention_mask: See :class:`SqueezeBertModel`. "
+":type attention_mask: list, optional"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForSequenceClassification.forward:11
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:1
+msgid ""
+"SqueezeBert Model with a token classification head on top (a linear layer"
+" on top of the hidden-states output) e.g. for Named-Entity-Recognition "
+"(NER) tasks. :param squeezebert: An instance of SqueezeBertModel. :type "
+"squeezebert: :class:`SqueezeBertModel` :param num_classes: The number of "
+"classes. Defaults to `2`. :type num_classes: int, optional :param "
+"dropout: The dropout probability for output of squeezebert."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification:8
+msgid ""
+"If None, use the same value as `hidden_dropout_prob` of `SqueezeBert` "
+"instance `squeezebert`. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward:1
+msgid ""
+"The SqueezeBertForTokenClassification forward method, overrides the "
+"__call__() special method. :param input_ids: See "
+":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:"
+" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional "
+":param position_ids: See :class:`SqueezeBertModel`. :type position_ids: "
+"Tensor, optional :param attention_mask: See :class:`SqueezeBertModel`. "
+":type attention_mask: list, optional"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForTokenClassification.forward:11
+msgid ""
+"Returns tensor `logits`, a tensor of the input token classification "
+"logits. Shape as `[batch_size, sequence_length, num_classes]` and dtype "
+"as `float32`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:1
+msgid ""
+"SqueezeBert Model with a span classification head on top for extractive "
+"question-answering tasks like SQuAD (a linear layers on top of the "
+"hidden-states output to compute `span start logits` and `span end "
+"logits`). :param squeezebert: An instance of SqueezeBertModel. :type "
+"squeezebert: :class:`SqueezeBertModel` :param dropout: The dropout "
+"probability for output of SqueezeBert."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering:7
+msgid ""
+"If None, use the same value as `hidden_dropout_prob` of "
+"`SqueezeBertModel` instance `squeezebert`. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:1
+msgid ""
+"The SqueezeBertForQuestionAnswering forward method, overrides the "
+"__call__() special method. :param input_ids: See "
+":class:`SqueezeBertModel`. :type input_ids: Tensor :param token_type_ids:"
+" See :class:`SqueezeBertModel`. :type token_type_ids: Tensor, optional"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:7
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the start position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]. - "
+"`end_logits` (Tensor):     A tensor of the input token classification "
+"logits, indicates the end position of the labelled span.     Its data "
+"type should be float32 and its shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:7
+msgid ""
+"Returns tuple (`start_logits`, `end_logits`). With the fields: - "
+"`start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:10
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:13
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.modeling.SqueezeBertForQuestionAnswering.forward:13
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po
new file mode 100644
index 0000000000000000000000000000000000000000..73515a3111c35060a34ac2c5fde3a175c4e467de
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.squeezebert.rst:2
+msgid "squeezebert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..42eb8d320e1a4200d5f60d840d2901c09ae77232
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.squeezebert.tokenizer.po
@@ -0,0 +1,237 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.squeezebert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:1
+msgid ""
+"Constructs a SqueezeBert tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:5
+msgid "file path of the vocabulary"
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:7
+msgid ""
+"Whether the text strips accents and convert to lower case. Default: "
+"`True`. Default: True."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:11
+msgid "The special token for unkown words. Default: \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:13
+msgid "The special token for separator token . Default: \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:15
+msgid "The special token for padding. Default: \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:17
+msgid "The special token for cls. Default: \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:19
+msgid "The special token for mask. Default: \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer:23
+msgid "实际案例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.vocab_size:1
+msgid ""
+"return the size of vocabulary. :returns: the size of vocabulary. :rtype: "
+"int"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting. :param tokens: A list of string representing tokens"
+" to be converted. :type tokens: list"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string:7
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens. .. note::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:7
+msgid ""
+"Returns the number of added tokens in the case of a sequence pair if set "
+"to True, returns the number of added tokens in the case of a single "
+"sequence if set to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.num_special_tokens_to_add:10
+msgid "Number of tokens added to sequences"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. A "
+"SqueezeBert sequence has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:7
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:9
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:12
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_inputs_with_special_tokens:13
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:13
+msgid ":obj:`List[int]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens. A SqueezeBert offset_mapping has the following"
+" format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:8
+msgid "Optional second list of char offsets for offset mapping pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:11
+msgid "List of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.build_offset_mapping_with_special_tokens:12
+msgid ":obj:`List[tuple]`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task. A SqueezeBert sequence pair mask has the following "
+"format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If :obj:`token_ids_1` is :obj:`None`, this method only returns the first "
+"portion of the mask (0s). :param token_ids_0: List of IDs. :type "
+"token_ids_0: :obj:`List[int]` :param token_ids_1: Optional second list of"
+" IDs for sequence pairs. :type token_ids_1: :obj:`List[int]`, `optional`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.create_token_type_ids_from_sequences:12
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods. :param token_ids_0: List of ids of the "
+"first sequence. :type token_ids_0: List[int] :param token_ids_1: List of "
+"ids of the second sequence. :type token_ids_1: List[int], optinal :param "
+"already_has_special_tokens: Whether or not the token list is already"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:8
+msgid "formatted with special tokens for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.squeezebert.tokenizer.SqueezeBertTokenizer.get_special_tokens_mask:11
+msgid ""
+"The list of integers in the range [0, 1]: 1 for a special token, 0 for a "
+"sequence token."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..3cc4a09950ed14f98625bbb5eac8700bfb432666
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.modeling.po
@@ -0,0 +1,475 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.t5.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:1
+#: paddlenlp.transformers.t5.modeling.T5Model:1
+msgid "基类：:class:`paddlenlp.transformers.t5.modeling.T5PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:1
+msgid ""
+"The bare T5 Model transformer outputting raw hidden-states without any "
+"specific head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward
+#: paddlenlp.transformers.t5.modeling.T5Model
+#: paddlenlp.transformers.t5.modeling.T5Model.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:10
+msgid "Whether to tie input and output embeddings. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:12
+msgid "The id of the `padding` token. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:14
+msgid "The id of the `bos` token. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:16
+msgid "The id of the `eos` token. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:18
+msgid ""
+"A factor for initializing all weight matrices (should be kept to 1, used "
+"internally for initialization testing). Defaults to `1.0`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:21
+msgid ""
+"Vocabulary size of `inputs_ids` in `T5Model`. Also is the vocab size of "
+"token embedding matrix. Defines the number of different tokens that can "
+"be represented by the `inputs_ids` passed when calling `T5Model`. "
+"Defaults to `32128`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:24
+msgid "Dimensionality of the embedding layer, encoder layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:26
+msgid ""
+"Size of the key, query, value projections per attention head. Defaults to"
+" `64`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:28
+msgid ""
+"Dimensionality of the feed_forward layer in the residual attention block."
+" Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:30
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:32
+msgid "Number of hidden layers in the Transformer decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:34
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder and decoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:37
+msgid "The number of buckets to use for each attention layer. Defaults to `32`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:39
+msgid "The dropout ratio for all layers. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:41
+msgid "The epsilon used by the layer normalization layers. Defaults to `1e-6`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model:43
+#: paddlenlp.transformers.t5.modeling.T5Model:45
+msgid ""
+"The non-linear activation function (function or string) in the feed "
+"forward layer in the residual attention block. If string, `\"relu\"`, "
+"`\"gated-gelu\"` are supported. Defaults to `\"relu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:1
+msgid "The T5Model forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:7
+msgid ""
+"Mask used in multi-head attention to avoid performing attention on to "
+"some unwanted positions, usually the paddings or the subsequent "
+"positions. Its data type can be int, float. When the data type is int, "
+"the `masked` tokens have `0` values and the others have `1` values. When "
+"the data type is float, the `masked` tokens have `0.0` values and the "
+"others have `1.0` values. It is a tensor with shape broadcasted to "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:17
+msgid ""
+"Indices of decoder input sequence tokens in the vocabulary. Its data type"
+" should be `int64` and it has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`, which means no `decoder_input_ids` is provided, the "
+"model will create the tensor by shifting the `input_ids` to the right."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:22
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions in `decoder_input_ids`. Its data type and shape is the"
+" same as `attention_mask`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:25
+msgid ""
+"The output of the encoder, a tuple consists `last_hidden_state`, "
+"`hidden_states`(optional), `attentions`(optional). The data type of "
+"`last_hidden_state` is float32 and its shape is [batch_size, "
+"sequence_length, hidden_size]. `hidden_states` is hidden_states of all "
+"layers in the Transformer encoder. The length of `hidden_states` is "
+"`num_hidden_layers + 1`. For all element in the tuple, its data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]. `attentions` is attentions of all layers of in the "
+"Transformer encoder. The length of `attentions` is `num_hidden_layers`. "
+"For all element in the tuple, its data type should be float32 and its "
+"shape is [batch_size, num_attention_heads, sequence_length, "
+"sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:32
+msgid ""
+"Contains pre-computed hidden-states (key and values in the attention "
+"blocks) as computed by the model. Can be used to speed up sequential "
+"decoding. The `input_ids` which have their past given to this model "
+"should not be passed as input ids as they have already been computed. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:38
+msgid ""
+"Whether or not to use cache. If set to `True`, `past_buckets_states` "
+"states are returned and can be used to speed up decoding. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:42
+msgid ""
+"Whether or not to return the attentions tensors of all attention layers. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:45
+msgid ""
+"Whether or not to return the output of all hidden layers. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward
+#: paddlenlp.transformers.t5.modeling.T5Model.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:49
+msgid ""
+"Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, "
+"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, "
+"`encoder_hidden_states`, `encoder_attentions`)  With the fields:  - "
+"`last_hidden_state` (Tensor):     Sequence of hidden-states at the last "
+"layer of the decoder of the model.     It's data type should be float32 "
+"and     its shape is [batch_size, sequence_length, hidden_size].  - "
+"`cache` (List[tuple(Tensor, Tensor)], optional):     returned when "
+"`use_cache=True` is passed.     List of `tuple(Tensor, Tensor)` of length"
+" `config[\"num_layers\"]`,     with the first element being the previous "
+"`buckets` of shape     `[batch_size, num_heads, num_hashes, "
+"sequence_length]` and the second     being the previous `hidden_states` "
+"of shape `[batch_size, sequence_length, hidden_size]`.  - "
+"`decoder_hidden_states` (tuple(Tensor), optional)     returned when "
+"``output_hidden_states=True`` is passed.     Tuple of `Tensor` (one for "
+"the output of the embeddings + one for the output of decoder each layer) "
+"of shape `(batch_size, sequence_length, hidden_size)`.  - "
+"`decoder_attentions` (tuple(Tensor), optional):     returned when "
+"`output_attentions=True` is passed.     tuple of `Tensor` (one for each "
+"layer) of shape. Each Tensor has a data     type of float32 and its shape"
+" is [batch_size, num_heads, sequence_length, sequence_length].  - "
+"`cross_attentions` (tuple(Tensor), optional):     returned when "
+"`output_attentions=True` is passed.     tuple of `Tensor` (one for each "
+"layer) of shape. Each Tensor has a data     type of float32 and its shape"
+" is [batch_size, num_heads, sequence_length, sequence_length].  - "
+"`encoder_last_hidden_state` (Tensor):     Sequence of hidden-states at "
+"the last layer of the encoder of the model.     It's data type should be "
+"float32 and     its shape is [batch_size, sequence_length, hidden_size]."
+"  - `encoder_hidden_states` (tuple(Tensor), optional):     returned when "
+"`output_hidden_states=True` is passed.     tuple of `Tensor` (one for the"
+" output of the embeddings + one for the     output of encoder each "
+"layer). Each Tensor has a data type of float32     and its shape is "
+"[batch_size, sequence_length, hidden_size].  - `encoder_attentions` "
+"(tuple(Tensor), optional):     returned when `output_attentions=True` is "
+"passed.     tuple of `Tensor` (one for each layer) of shape. Each Tensor "
+"has a data     type of float32 and its shape is [batch_size, num_heads, "
+"sequence_length, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:49
+msgid ""
+"Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, "
+"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, "
+"`encoder_hidden_states`, `encoder_attentions`)"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:29
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:52
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:57
+msgid "`last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:55
+msgid ""
+"Sequence of hidden-states at the last layer of the decoder of the model. "
+"It's data type should be float32 and its shape is [batch_size, "
+"sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:42
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:64
+msgid "`cache` (List[tuple(Tensor, Tensor)], optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:60
+msgid ""
+"returned when `use_cache=True` is passed. List of `tuple(Tensor, Tensor)`"
+" of length `config[\"num_layers\"]`, with the first element being the "
+"previous `buckets` of shape `[batch_size, num_heads, num_hashes, "
+"sequence_length]` and the second being the previous `hidden_states` of "
+"shape `[batch_size, sequence_length, hidden_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:45
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:68
+msgid "`decoder_hidden_states` (tuple(Tensor), optional)"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:67
+msgid ""
+"returned when ``output_hidden_states=True`` is passed. Tuple of `Tensor` "
+"(one for the output of the embeddings + one for the output of decoder "
+"each layer) of shape `(batch_size, sequence_length, hidden_size)`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:48
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:73
+msgid "`decoder_attentions` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:71
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:76
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:92
+msgid ""
+"returned when `output_attentions=True` is passed. tuple of `Tensor` (one "
+"for each layer) of shape. Each Tensor has a data type of float32 and its "
+"shape is [batch_size, num_heads, sequence_length, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:51
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:78
+msgid "`cross_attentions` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:54
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:83
+msgid "`encoder_last_hidden_state` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:81
+msgid ""
+"Sequence of hidden-states at the last layer of the encoder of the model. "
+"It's data type should be float32 and its shape is [batch_size, "
+"sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:57
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:89
+msgid "`encoder_hidden_states` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5Model.forward:86
+msgid ""
+"returned when `output_hidden_states=True` is passed. tuple of `Tensor` "
+"(one for the output of the embeddings + one for the output of encoder "
+"each layer). Each Tensor has a data type of float32 and its shape is "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:59
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:93
+msgid "`encoder_attentions` (tuple(Tensor), optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward
+#: paddlenlp.transformers.t5.modeling.T5Model.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:64
+#: paddlenlp.transformers.t5.modeling.T5Model.forward:98
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel:1
+msgid ""
+"An abstract class for pretrained T5 models. It provides T5 related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+"`PretrainedModel` for more details."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5PretrainedModel.init_weights:1
+msgid "Initializes and tie weights if needed."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:1
+msgid "The T5 Model transformer with a language modeling head on top."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration:3
+msgid "An instance of :class:`T5Model`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:1
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:3
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:5
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:7
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:9
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:11
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:19
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:21
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:23
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:42
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:45
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:48
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:51
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:54
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:57
+#: paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:60
+msgid "See :class:`T5Model`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:13
+msgid ""
+"Labels for language modeling. Note that the labels **are shifted** inside"
+" the model, i.e. you can set `labels = input_ids` Indices are selected in"
+" `[-100, 0, ..., vocab_size]` All labels set to `-100` are ignored "
+"(masked), the loss is only computed for labels in `[0, ..., vocab_size]`."
+" Shape is [batch_size, sequence_length] and dtype is int64."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:26
+msgid ""
+"Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, "
+"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, "
+"`encoder_hidden_states`, `encoder_attentions`)  With the fields:  - "
+"`loss` (Tensor):     returned when `labels` is provided.     Language "
+"modeling loss. It's data type should be float32 and its shape is [1,].  -"
+" `logits` (Tensor):     Prediction scores of the language modeling head"
+"     (scores for each vocabulary token before SoftMax).     It's data "
+"type should be float32 and its shape is     [batch_size, sequence_length,"
+" vocab_size].  - `cache` (List[tuple(Tensor, Tensor)], optional):     See"
+" :class:`T5Model`.  - `decoder_hidden_states` (tuple(Tensor), optional)"
+"     See :class:`T5Model`.  - `decoder_attentions` (tuple(Tensor), "
+"optional):     See :class:`T5Model`.  - `cross_attentions` "
+"(tuple(Tensor), optional):     See :class:`T5Model`.  - "
+"`encoder_last_hidden_state` (Tensor):     See :class:`T5Model`.  - "
+"`encoder_hidden_states` (tuple(Tensor), optional):     See "
+":class:`T5Model`.  - `encoder_attentions` (tuple(Tensor), optional):     "
+"See :class:`T5Model`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:26
+msgid ""
+"Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, "
+"`decoder_attentions`, `cross_attentions`, `encoder_last_hidden_state`, "
+"`encoder_hidden_states`, `encoder_attentions`)"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:33
+msgid "`loss` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:32
+msgid ""
+"returned when `labels` is provided. Language modeling loss. It's data "
+"type should be float32 and its shape is [1,]."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:39
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.modeling.T5ForConditionalGeneration.forward:36
+msgid ""
+"Prediction scores of the language modeling head (scores for each "
+"vocabulary token before SoftMax). It's data type should be float32 and "
+"its shape is [batch_size, sequence_length, vocab_size]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po
new file mode 100644
index 0000000000000000000000000000000000000000..30b62d3a0f55690d31dfab9a00f67b890b39cbde
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.t5.rst:2
+msgid "t5"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..2023df5590557450059c3921a868d6a7659e0229
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.t5.tokenizer.po
@@ -0,0 +1,266 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.t5.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.albert.tokenizer.AlbertEnglishTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:1
+msgid ""
+"Constructs a T5 tokenizer based on SentencePiece . This tokenizer "
+"inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:6
+msgid ""
+"The vocabulary file (ends with '.spm') required to instantiate a "
+"`SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:9
+msgid ""
+"Whether or not to lowercase the input when tokenizing. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:11
+msgid "Whether or note to remove space when tokenizing. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:13
+msgid "Whether or note to keep accents when tokenizing. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:15
+msgid ""
+"A special token representing the *eos (end-of-sentence)* token. Defaults "
+"to \"</s>\"."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:18
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"<unk>\"."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"<pad>\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:1
+msgid "Build model inputs from a sequence or a pair of sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:3
+msgid "An Reformer sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:5
+msgid "single sequence:      ``X </s>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:6
+msgid "pair of sequences:        ``A </s> B </s>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:8
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:10
+msgid "Optional second list of IDs for sequence pairs. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens:13
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:1
+msgid "Create a mask from the two sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:5
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:7
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.create_token_type_ids_from_sequences:10
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]:     1 for a special token, 0 "
+"for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:14
+msgid "The list of integers in the range [0, 1]:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.get_special_tokens_mask:15
+msgid "1 for a special token, 0 for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.convert_tokens_to_string:1
+msgid "Converts a sequence of tokens (string) in a single string."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:1
+msgid ""
+"Converts a sequence of ids in a string, using the tokenizer and "
+"vocabulary with options to remove special tokens and clean up "
+"tokenization spaces."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:4
+msgid ""
+"Similar to doing "
+"``self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))``."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:3
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:6
+msgid "List of tokenized input ids."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:5
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:8
+msgid ""
+"Whether or not to remove special tokens in the decoding. Defaults to "
+"`False`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:7
+#: paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:10
+msgid "Whether or not to clean up the tokenization spaces. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.decode:13
+msgid "The decoded sentence."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:1
+msgid ""
+"Convert a list of lists of token ids into a list of strings by calling "
+"decode."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.batch_decode:10
+msgid "The list of decoded sentences."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:1
+msgid ""
+"Clean up a list of simple English tokenization artifacts like spaces "
+"before punctuations and abbreviated forms."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:3
+msgid "The text to clean up."
+msgstr ""
+
+#: of paddlenlp.transformers.t5.tokenizer.T5Tokenizer.clean_up_tokenization:6
+msgid "The cleaned-up string."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..efaa3ae560edb0998aa18a9ab40db77f4c996562
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.modeling.po
@@ -0,0 +1,369 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.tinybert.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:1
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:1
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel:1
+msgid "基类：:class:`paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:1
+msgid "The bare TinyBert Model transformer outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:6
+msgid ""
+"This model is also a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `TinyBertModel`. Defines the number of"
+" different tokens that can be represented by the `inputs_ids` passed when"
+" calling `TinyBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:13
+msgid ""
+"Dimensionality of the embedding layer, encoder layers and pooler layer. "
+"Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:15
+msgid "Number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:29
+msgid ""
+"The dropout probability for all fully connected layers in the embeddings "
+"and encoder. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:35
+msgid ""
+"The maximum value of the dimensionality of position encoding. The "
+"dimensionality of position encoding is the dimensionality of the sequence"
+" in `TinyBertModel`. Defaults to `512`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:39
+msgid ""
+"The vocabulary size of `token_type_ids` passed when calling `~ "
+"transformers.TinyBertModel`. Defaults to `16`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:42
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`TinyBertPretrainedModel.init_weights()` for"
+" how weights are initialized in `TinyBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:42
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:46
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`TinyBertPretrainedModel.init_weights()` for how weights are "
+"initialized in `TinyBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:49
+msgid "The index of padding token in the token vocabulary. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel:52
+msgid ""
+"Dimensionality of the output layer of `fit_dense(s)`, which is the hidden"
+" size of the teacher model. `fit_dense(s)` means a hidden states' "
+"transformation from student to teacher. `fit_dense(s)` will be generated "
+"when bert model is distilled during the training, and will not be "
+"generated during the prediction process. `fit_denses` is used in v2 "
+"models and it has `num_hidden_layers+1` layers. `fit_dense` is used in "
+"other pretraining models and it has one linear layer. Defaults to `768`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:1
+msgid ""
+"The TinyBertModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. Its data type "
+"should be `int64` and it has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:  - 0 corresponds to a *sentence A* token, - 1 corresponds to a "
+"*sentence B* token.  Its data type should be `int64` and it has a shape "
+"of [batch_size, sequence_length]. Defaults to `None`, which means we "
+"don't add segment embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:7
+msgid ""
+"Segment token indices to indicate different portions of the inputs. "
+"Selected in the range ``[0, type_vocab_size - 1]``. If `type_vocab_size` "
+"is 2, which means the inputs have two portions. Indices can either be 0 "
+"or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:12
+msgid "0 corresponds to a *sentence A* token,"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:13
+msgid "1 corresponds to a *sentence B* token."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:15
+msgid ""
+"Its data type should be `int64` and it has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`, which means we don't add segment "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:18
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:30
+msgid ""
+"Returns tuple (`encoder_output`, `pooled_output`).  With the fields:  - "
+"`encoder_output` (Tensor):     Sequence of hidden-states at the last "
+"layer of the model.     It's data type should be float32 and its shape is"
+" [batch_size, sequence_length, hidden_size].  - `pooled_output` (Tensor):"
+"     The output of first token (`[CLS]`) in sequence.     We \"pool\" the"
+" model by simply taking the hidden state corresponding to the first "
+"token.     Its data type should be float32 and its shape is [batch_size, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:30
+msgid "Returns tuple (`encoder_output`, `pooled_output`)."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:32
+msgid "With the fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:36
+msgid "`encoder_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:35
+msgid ""
+"Sequence of hidden-states at the last layer of the model. It's data type "
+"should be float32 and its shape is [batch_size, sequence_length, "
+"hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:40
+msgid "`pooled_output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:39
+msgid ""
+"The output of first token (`[CLS]`) in sequence. We \"pool\" the model by"
+" simply taking the hidden state corresponding to the first token. Its "
+"data type should be float32 and its shape is [batch_size, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:15
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:15
+#: paddlenlp.transformers.tinybert.modeling.TinyBertModel.forward:45
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel:1
+msgid ""
+"An abstract class for pretrained TinyBERT models. It provides TinyBERT "
+"related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertPretrainedModel.init_weights:1
+msgid "Initialization hook"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:1
+msgid "TinyBert Model with pretraining tasks on top."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining:3
+msgid "An instance of :class:`TinyBertModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:1
+msgid ""
+"The TinyBertForPretraining forward method, overrides the __call__() "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:3
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:5
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:7
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:3
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:5
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:7
+msgid "See :class:`TinyBertModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForPretraining.forward:10
+msgid ""
+"Returns tensor `sequence_output`, sequence of hidden-states at the last "
+"layer of the model. It's data type should be float32 and its shape is "
+"[batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:1
+msgid ""
+"TinyBert Model with a linear layer on top of the output layer, designed "
+"for sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:4
+msgid "An instance of TinyBertModel."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:6
+msgid "The number of classes. Defaults to `2`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification:8
+msgid ""
+"The dropout probability for output of TinyBert. If None, use the same "
+"value as `hidden_dropout_prob` of `TinyBertModel` instance `tinybert`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:1
+msgid ""
+"The TinyBertForSequenceClassification forward method, overrides the "
+"__call__() special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tinybert.modeling.TinyBertForSequenceClassification.forward:10
+msgid ""
+"Returns tensor `logits`, a tensor of the input text classification "
+"logits. Shape as `[batch_size, num_classes]` and dtype as float32."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po
new file mode 100644
index 0000000000000000000000000000000000000000..480d2d850b17be12ec92549619f254aabd88c88b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.tinybert.rst:2
+msgid "tinybert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..4de3dbadfb45193b9efa9089ebd2cc3842b4bb53
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tinybert.tokenizer.po
@@ -0,0 +1,36 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.tinybert.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.tokenizer.TinyBertTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.bert.tokenizer.BertTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.tinybert.tokenizer.TinyBertTokenizer:1
+msgid ""
+"Constructs a TinyBert tokenizer. The usage of TinyBertTokenizer is the "
+"same as `BertTokenizer "
+"<https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__."
+" For more information regarding those methods, please refer to this "
+"superclass."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..ab2e9240532e3fd1468830e9f940455ecf028a51
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils.po
@@ -0,0 +1,1264 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.tokenizer_utils.rst:2
+msgid "tokenizer\\_utils"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:1
+msgid ""
+"The base class for all pretrained tokenizers. It mainly provides common "
+"methods for loading (construction and loading) and saving pretrained "
+"tokenizers. Loading and saving also rely on the following class "
+"attributes which should be overridden by derived classes accordingly:"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:6
+msgid ""
+"**tokenizer_config_file** (str): Represents the file name of tokenizer "
+"configuration for configuration saving and loading in local file system. "
+"The value is `tokenizer_config.json`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:9
+msgid ""
+"**resource_files_names** (dict): Represents resources to specific file "
+"names mapping for resource saving and loading in local file system. The "
+"keys of dict representing resource items should be argument names in "
+"tokenizer's `__init__` method, and the values are file names for saving "
+"and loading corresponding resources. The mostly used resources here are "
+"vocabulary file and sentence-piece model file."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:15
+msgid ""
+"**pretrained_init_configuration** (dict): Provides the tokenizer "
+"configurations of built-in pretrained tokenizers (contrasts to tokenizers"
+" in local file system). It has pretrained tokenizer names as keys (the "
+"same as pretrained model names, such as `bert-base-uncased`), and the "
+"values are dict preserving corresponding configuration for tokenizer "
+"initialization."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:20
+msgid ""
+"**pretrained_resource_files_map** (dict): Provides resource URLs of "
+"built-in pretrained tokenizers (contrasts to tokenizers in local file "
+"system). It has the same keys as `resource_files_names`, and the values "
+"are also `dict` mapping specific pretrained tokenizer names (such as "
+"`bert-base-uncased`) to corresponding resource URLs."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:26
+msgid ""
+"Moreover, methods common to tokenizers for tokenization, token/id "
+"conversion and encoding as model inputs are also provided here."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer:29
+msgid ""
+"Besides, metaclass `InitTrackerMeta` is used to create "
+"`PretrainedTokenizer`, by which subclasses can track arguments for "
+"initialization automatically and expose special tokens initialization "
+"used as attributes."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports sequence or sequence pair as input, and batch input "
+"is allowed. `self.encode()` or `self.batch_encode()` would be called "
+"separately for single or batch input depending on input format and "
+"`is_split_into_words` argument."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:7
+msgid ""
+"The sequence or batch of sequences to be processed. One sequence is a "
+"string or a list of strings depending on whether it has been "
+"pretokenized. If each sequence is provided as a list of strings "
+"(pretokenized), you must set `is_split_into_words` as `True` to "
+"disambiguate with a batch of sequences."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:13
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:9
+msgid ""
+"Same as `text` argument, while it represents for the latter sequence of "
+"the sequence pair."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:16
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:10
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:12
+msgid ""
+"If set to a number, will limit the total sequence returned so that it has"
+" a maximum length. If there are overflowing tokens, those overflowing "
+"tokens will be added to the returned dictionary when "
+"`return_overflowing_tokens` is `True`. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:21
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:15
+msgid ""
+"Only available for batch input of sequence pair and mainly for question "
+"answering usage. When for QA, `text` represents questions and `text_pair`"
+" represents contexts. If `stride` is set to a positive number, the "
+"context will be split into multiple spans where `stride` defines the "
+"number of (tokenized) tokens to skip from the start of one span to get "
+"the next span, thus will produce a bigger batch than inputs to include "
+"all spans. Moreover, 'overflow_to_sample' and 'offset_mapping' preserving"
+" the original example and position information will be added to the "
+"returned dictionary. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:31
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:25
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:17
+msgid ""
+"If set to `True`, the returned sequences would be padded up to "
+"`max_seq_len` specified length according to padding side "
+"(`self.padding_side`) and padding token id. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:35
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:29
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:21
+msgid ""
+"String selected in the following options:  - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"`max_seq_len` starting from the longest one at each token (when there is "
+"a pair of input sequences). - 'only_first': Only truncate the first "
+"sequence. - 'only_second': Only truncate the second sequence. - "
+"'do_not_truncate': Do not truncate (raise an error if the input sequence "
+"is longer than `max_seq_len`).  Defaults to 'longest_first'."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:35
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:29
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:21
+msgid "String selected in the following options:"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:37
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:31
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:23
+msgid "'longest_first' (default) Iteratively reduce the inputs sequence"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:38
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:32
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:24
+msgid ""
+"until the input is under `max_seq_len` starting from the longest one at "
+"each token (when there is a pair of input sequences). - 'only_first': "
+"Only truncate the first sequence. - 'only_second': Only truncate the "
+"second sequence. - 'do_not_truncate': Do not truncate (raise an error if "
+"the input sequence is longer than `max_seq_len`)."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:45
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:39
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:31
+msgid "Defaults to 'longest_first'."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:47
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:41
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:33
+msgid ""
+"Whether to include tokens position ids in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:50
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:44
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:36
+msgid ""
+"Whether to include token type ids in the returned dictionary. Defaults to"
+" `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:53
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:47
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:39
+msgid ""
+"Whether to include the attention mask in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:56
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:50
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:42
+msgid ""
+"Whether to include the length of each encoded inputs in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:59
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:53
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:45
+msgid ""
+"Whether to include overflowing token information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:62
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:56
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:48
+msgid ""
+"Whether to include special tokens mask information in the returned "
+"dictionary. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:65
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:59
+msgid ""
+"Decide the format for returned encoded batch inputs. Only works when "
+"input is a batch of data. ::     - If True, encoded inputs would be a "
+"dictionary like:         {'input_ids': [[1, 4444, 4385, 1545, 6712],[1, "
+"4444, 4385]],         'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0]]}"
+"     - If False, encoded inputs would be a list like:         "
+"[{'input_ids': [1, 4444, 4385, 1545, 6712],           'token_type_ids': "
+"[0, 0, 0, 0, 0]},          {'input_ids': [1, 4444, 4385], "
+"'token_type_ids': [0, 0, 0]}]  Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:65
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:59
+msgid ""
+"Decide the format for returned encoded batch inputs. Only works when "
+"input is a batch of data. ::"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:76
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:70
+msgid "Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:78
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:72
+msgid ""
+"Whether to include the list of pair preserving the index of start and end"
+" char in original input for each token in the returned dictionary. Would "
+"be automatically set to `True` when `stride` > 0. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:83
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:77
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:55
+msgid ""
+"Whether to add the special tokens associated with the corresponding model"
+" to the encoded inputs. Defaults to `True`"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.tokenizer_utils.convert_to_unicode
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:87
+msgid ""
+"The dict has the following optional items:  - **input_ids** (list[int] or"
+" list[list[int]]): List of token ids to be fed to a model. - "
+"**position_ids** (list[int] or list[list[int]], optional): List of token "
+"position ids to be   fed to a model. Included when `return_position_ids` "
+"is `True` - **token_type_ids** (list[int] or list[list[int]], optional): "
+"List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int] or "
+"list[list[int]], optional): List of integers valued 0 or 1,   where 0 "
+"specifies paddings and should not be attended to by the   model. Included"
+" when `return_attention_mask` is `True`. - **seq_len** (int or list[int],"
+" optional): The input_ids length. Included when `return_length`   is "
+"`True`. - **overflowing_tokens** (list[int] or list[list[int]], "
+"optional): List of overflowing tokens.   Included when if `max_seq_len` "
+"is specified and `return_overflowing_tokens`   is True. - "
+"**num_truncated_tokens** (int or list[int], optional): The number of "
+"overflowing tokens.   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **special_tokens_mask** "
+"(list[int] or list[list[int]], optional): List of integers valued 0 or 1,"
+"   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   "
+"`return_overflowing_tokens` is True or `stride` > 0. - "
+"**overflow_to_sample** (int or list[int], optional): Index of example "
+"from which this   feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:87
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:81
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:59
+msgid "The dict has the following optional items:"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:89
+msgid ""
+"**input_ids** (list[int] or list[list[int]]): List of token ids to be fed"
+" to a model."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:90
+msgid ""
+"**position_ids** (list[int] or list[list[int]], optional): List of token "
+"position ids to be fed to a model. Included when `return_position_ids` is"
+" `True`"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:92
+msgid ""
+"**token_type_ids** (list[int] or list[list[int]], optional): List of "
+"token type ids to be fed to a model. Included when "
+"`return_token_type_ids` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:94
+msgid ""
+"**attention_mask** (list[int] or list[list[int]], optional): List of "
+"integers valued 0 or 1, where 0 specifies paddings and should not be "
+"attended to by the model. Included when `return_attention_mask` is "
+"`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:97
+msgid ""
+"**seq_len** (int or list[int], optional): The input_ids length. Included "
+"when `return_length` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:99
+msgid ""
+"**overflowing_tokens** (list[int] or list[list[int]], optional): List of "
+"overflowing tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:102
+msgid ""
+"**num_truncated_tokens** (int or list[int], optional): The number of "
+"overflowing tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:105
+msgid ""
+"**special_tokens_mask** (list[int] or list[list[int]], optional): List of"
+" integers valued 0 or 1, with 0 specifying special added tokens and 1 "
+"specifying sequence tokens. Included when `return_special_tokens_mask` is"
+" `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:108
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:102
+msgid ""
+"**offset_mapping** (list[int], optional): list of pair preserving the "
+"index of start and end char in original input for each token. For a "
+"sqecial token, the index pair is `(0, 0)`. Included when "
+"`return_overflowing_tokens` is True or `stride` > 0."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__:112
+msgid ""
+"**overflow_to_sample** (int or list[int], optional): Index of example "
+"from which this feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.__call__
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.tokenizer_utils.convert_to_unicode
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens:1
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended:1
+msgid ""
+"All the special tokens ('<unk>', '<cls>'...) corresponding to special "
+"token arguments in `__init__` (arguments end with '_end')."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended
+msgid "type"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids:3
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens:4
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_tokens_extended:4
+msgid "list"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.all_special_ids:1
+msgid "All the token ids corresponding to all the special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string by "
+"using ``' '.join(tokens)`` ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:4
+msgid "A sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.convert_tokens_to_string:7
+msgid "Converted string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:1
+msgid ""
+"Creates an instance of `PretrainedTokenizer`. Related resources are "
+"loaded by specifying name of a built-in pretrained model, or a community-"
+"contributed pretrained model, or a local file directory path."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:5
+msgid ""
+"Name of pretrained model or dir path to load from. The string can be:  - "
+"Name of built-in pretrained model - Name of a community-contributed "
+"pretrained model. - Local directory path which contains tokenizer related"
+" resources   and tokenizer config file (\"tokenizer_config.json\")."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:5
+msgid "Name of pretrained model or dir path to load from. The string can be:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:8
+msgid "Name of built-in pretrained model"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:9
+msgid "Name of a community-contributed pretrained model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:10
+msgid ""
+"Local directory path which contains tokenizer related resources and "
+"tokenizer config file (\"tokenizer_config.json\")."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:13
+msgid ""
+"position arguments for model `__init__`. If provided, use these as "
+"position argument values for tokenizer initialization."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:16
+msgid ""
+"keyword arguments for model `__init__`. If provided, use these to update "
+"pre-defined keyword argument values for tokenizer initialization."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:21
+msgid "An instance of `PretrainedTokenizer`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.from_pretrained:25
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:14
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:1
+msgid ""
+"Save tokenizer configuration and related resources to files under "
+"`save_directory`. The tokenizer configuration would be saved into "
+"`tokenizer_config_file` indicating file (thus `tokenizer_config.json`), "
+"and resources would be saved into `resource_files_names` indicating files"
+" by using `self.save_resources(save_directory)`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:7
+msgid ""
+"The `save_directory` can be used in `from_pretrained` as argument value "
+"of `pretrained_model_name_or_path` to re-load the tokenizer."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_pretrained:10
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_resources:1
+msgid ""
+"Save tokenizer related resources to `resource_files_names` indicating "
+"files under `save_directory` by copying directly. Override it if "
+"necessary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:1
+msgid ""
+"Instantiate an instance of `Vocab` from a file reserving all tokens by "
+"using `Vocab.from_dict`. The file contains a token per line, and the line"
+" number would be the index of corresponding token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:5
+msgid "path of file to construct vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:7
+msgid ""
+"special token for unknown token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:10
+msgid ""
+"special token for padding token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:13
+msgid ""
+"special token for bos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:16
+msgid ""
+"special token for eos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:19
+msgid "keyword arguments for `Vocab.from_dict`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.load_vocabulary:22
+msgid "An instance of `Vocab`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:1
+msgid ""
+"Save all tokens to a vocabulary file. The file contains a token per line,"
+" and the line number would be the index of corresponding token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:4
+msgid "File path to be saved to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.save_vocabulary:6
+msgid "The `Vocab` or `dict` instance to be saved."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:1
+msgid "Truncates a sequence pair in place to the maximum length."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:3
+msgid ""
+"list of tokenized input ids. Can be obtained from a string by chaining "
+"the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:5
+msgid ""
+"Optional second list of input ids. Can be obtained from a string by "
+"chaining the `tokenize` and `convert_tokens_to_ids` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:7
+msgid "number of tokens to remove using the truncation strategy"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:9
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len     starting from the longest one at each token (when there "
+"is a pair of input sequences).     Overflowing tokens only contains "
+"overflow from the first sequence. - 'only_first': Only truncate the first"
+" sequence. raise an error if the first sequence is shorter or equal to "
+"than num_tokens_to_remove. - 'only_second': Only truncate the second "
+"sequence - 'do_not_truncate': Does not truncate (raise an error if the "
+"input sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:9
+msgid ""
+"string selected in the following options: - 'longest_first' (default) "
+"Iteratively reduce the inputs sequence until the input is under "
+"max_seq_len"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:11
+msgid ""
+"starting from the longest one at each token (when there is a pair of "
+"input sequences). Overflowing tokens only contains overflow from the "
+"first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:13
+msgid ""
+"'only_first': Only truncate the first sequence. raise an error if the "
+"first sequence is shorter or equal to than num_tokens_to_remove."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:14
+msgid "'only_second': Only truncate the second sequence"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:15
+msgid ""
+"'do_not_truncate': Does not truncate (raise an error if the input "
+"sequence is longer than max_seq_len)"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.truncate_sequences:16
+msgid ""
+"If set to a number along with max_seq_len, the overflowing tokens "
+"returned will contain some tokens from the main sequence returned. The "
+"value of this argument defines the number of additional tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:4
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:3
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"Should be overridden in a subclass if the model has a special way of "
+"building those."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:6
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:8
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_inputs_with_special_tokens:11
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:7
+msgid "Optional second list of char offsets for offset mapping pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.build_offset_mapping_with_special_tokens:10
+msgid "List of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]:     1 for a special token, 0 "
+"for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:14
+msgid "The list of integers in the range [0, 1]:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_special_tokens_mask:15
+msgid "1 for a special token, 0 for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the number of added tokens should be computed in the case of a "
+"sequence pair or a single sequence. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.num_special_tokens_to_add:7
+msgid "Number of special tokens added to sequences."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports sequence or sequence pair as input, and batch input "
+"is not allowed."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:5
+msgid ""
+"The sequence to be processed. One sequence is a string, a list of "
+"strings, or a list of integers depending on whether it has been "
+"pretokenized and converted to ids."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:51
+msgid ""
+"Whether to include the list of pair preserving the index of start and end"
+" char in original input for each token in the returned dictionary. "
+"Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:59
+msgid ""
+"The dict has the following optional items:  - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:83
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:61
+msgid "**input_ids** (list[int]): List of token ids to be fed to a model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:84
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:62
+msgid ""
+"**position_ids** (list[int], optional): List of token position ids to be "
+"fed to a model. Included when `return_position_ids` is `True`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:86
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:64
+msgid ""
+"**token_type_ids** (list[int], optional): List of token type ids to be "
+"fed to a model. Included when `return_token_type_ids` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:88
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:66
+msgid ""
+"**attention_mask** (list[int], optional): List of integers valued 0 or 1,"
+" where 0 specifies paddings and should not be attended to by the model. "
+"Included when `return_attention_mask` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:91
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:69
+msgid ""
+"**seq_len** (int, optional): The input_ids length. Included when "
+"`return_length` is `True`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:93
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:71
+msgid ""
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+" Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:96
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:74
+msgid ""
+"**num_truncated_tokens** (int, optional): The number of overflowing "
+"tokens. Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:99
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:77
+msgid ""
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1, with 0 specifying special added tokens and 1 specifying sequence "
+"tokens. Included when `return_special_tokens_mask` is `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.encode:80
+msgid ""
+"**offset_mapping** (list[int], optional): list of pair preserving the "
+"index of start and end char in original input for each token. For a "
+"sqecial token, the index pair is `(0, 0)`. Included when "
+"`return_overflowing_tokens` is True."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:1
+msgid ""
+"Performs tokenization and uses the tokenized tokens to prepare model "
+"inputs. It supports batch inputs of sequence or sequence pair."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:4
+msgid ""
+"The element of list can be sequence or sequence pair, and the sequence is"
+" a string or a list of strings depending on whether it has been "
+"pretokenized. If each sequence is provided as a list of strings "
+"(pretokenized), you must set `is_split_into_words` as `True` to "
+"disambiguate with a sequence pair."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:81
+msgid ""
+"The dict has the following optional items:  - **input_ids** (list[int]): "
+"List of token ids to be fed to a model. - **position_ids** (list[int], "
+"optional): List of token position ids to be   fed to a model. Included "
+"when `return_position_ids` is `True` - **token_type_ids** (list[int], "
+"optional): List of token type ids to be   fed to a model. Included when "
+"`return_token_type_ids` is `True`. - **attention_mask** (list[int], "
+"optional): List of integers valued 0 or 1,   where 0 specifies paddings "
+"and should not be attended to by the   model. Included when "
+"`return_attention_mask` is `True`. - **seq_len** (int, optional): The "
+"input_ids length. Included when `return_length`   is `True`. - "
+"**overflowing_tokens** (list[int], optional): List of overflowing tokens."
+"   Included when if `max_seq_len` is specified and "
+"`return_overflowing_tokens`   is True. - **num_truncated_tokens** (int, "
+"optional): The number of overflowing tokens.   Included when if "
+"`max_seq_len` is specified and `return_overflowing_tokens`   is True. - "
+"**special_tokens_mask** (list[int], optional): List of integers valued 0 "
+"or 1,   with 0 specifying special added tokens and 1 specifying sequence "
+"tokens.   Included when `return_special_tokens_mask` is `True`. - "
+"**offset_mapping** (list[int], optional): list of pair preserving the   "
+"index of start and end char in original input for each token.   For a "
+"sqecial token, the index pair is `(0, 0)`. Included when   "
+"`return_overflowing_tokens` is True or `stride` > 0. - "
+"**overflow_to_sample** (int, optional): Index of example from which this"
+"   feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.batch_encode:106
+msgid ""
+"**overflow_to_sample** (int, optional): Index of example from which this "
+"feature is generated. Included when `stride` works."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:1
+msgid ""
+"Returns the map of tokens and the start and end index of their start and "
+"end character. Modified from "
+"https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:4
+msgid "Input text."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer.get_offset_mapping:7
+msgid "The offset map of input text."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:1
+msgid ""
+"The base class for all bpe tokenizers. It mainly provides common tokenize"
+" methods for bpe type tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:4
+msgid "file path of the vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:6
+msgid "file path of the id to vocab."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:8
+msgid "file path of word merge text."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:10
+msgid "The special token for unknown words. Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:13
+msgid "The special token for separator token. Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:16
+msgid "The special token for padding. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:19
+msgid "The special token for cls. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.BPETokenizer:22
+msgid "The special token for mask. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.tokenize_chinese_chars:1
+msgid "Adds whitespace around any CJK character."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.is_chinese_char:1
+msgid "Checks whether CP is the codepoint of a CJK character."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.normalize_chars:1
+msgid ""
+"Normalize the text for multiligual and chinese models. Unicode range: "
+"https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.tokenize_special_chars:1
+msgid "Adds whitespace around any special character."
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.convert_to_unicode:1
+msgid ""
+"Converts `text` to Unicode (if it's not already), assuming utf-8 input. "
+":param text: Text to be converted to unicode. :type text: str|bytes"
+msgstr ""
+
+#: of paddlenlp.transformers.tokenizer_utils.convert_to_unicode:5
+msgid "converted text."
+msgstr ""
+
+#~ msgid ""
+#~ "The dict has the following optional "
+#~ "items:  - **input_ids** (list[int]): List "
+#~ "of token ids to be fed to a"
+#~ " model. - **position_ids** (list[int], "
+#~ "optional): List of token position ids"
+#~ " to be   fed to a model. "
+#~ "Included when `return_position_ids` is `True`"
+#~ " - **token_type_ids** (list[int], optional): "
+#~ "List of token type ids to be   "
+#~ "fed to a model. Included when "
+#~ "`return_token_type_ids` is `True`. - "
+#~ "**attention_mask** (list[int], optional): List "
+#~ "of integers valued 0 or 1,   where"
+#~ " 0 specifies paddings and should not"
+#~ " be attended to by the   model. "
+#~ "Included when `return_attention_mask` is "
+#~ "`True`. - **seq_len** (int, optional): "
+#~ "The input_ids length. Included when "
+#~ "`return_length`   is `True`. - "
+#~ "**overflowing_tokens** (list[int], optional): List"
+#~ " of overflowing tokens.   Included when "
+#~ "if `max_seq_len` is specified and "
+#~ "`return_overflowing_tokens`   is True. - "
+#~ "**num_truncated_tokens** (int, optional): The "
+#~ "number of overflowing tokens.   Included "
+#~ "when if `max_seq_len` is specified and"
+#~ " `return_overflowing_tokens`   is True. - "
+#~ "**special_tokens_mask** (list[int], optional): List"
+#~ " of integers valued 0 or 1,   "
+#~ "with 0 specifying special added tokens"
+#~ " and 1 specifying sequence tokens.   "
+#~ "Included when `return_special_tokens_mask` is "
+#~ "`True`. - **offset_mapping** (list[int], "
+#~ "optional): list of pair preserving the"
+#~ "   index of start and end char "
+#~ "in original input for each token.   "
+#~ "For a special token, the index "
+#~ "pair is `(0, 0)`. Included when   "
+#~ "`stride` works. - **overflow_to_sample** (int,"
+#~ " optional): Index of example from "
+#~ "which this   feature is generated. "
+#~ "Included when `stride` works."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "**offset_mapping** (list[int], optional): list "
+#~ "of pair preserving the index of "
+#~ "start and end char in original "
+#~ "input for each token. For a "
+#~ "special token, the index pair is "
+#~ "`(0, 0)`. Included when `stride` works."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "The dict has the following optional "
+#~ "items:  - **input_ids** (list[int]): List "
+#~ "of token ids to be fed to a"
+#~ " model. - **position_ids** (list[int], "
+#~ "optional): List of token position ids"
+#~ " to be   fed to a model. "
+#~ "Included when `return_position_ids` is `True`"
+#~ " - **token_type_ids** (list[int], optional): "
+#~ "List of token type ids to be   "
+#~ "fed to a model. Included when "
+#~ "`return_token_type_ids` is `True`. - "
+#~ "**attention_mask** (list[int], optional): List "
+#~ "of integers valued 0 or 1,   where"
+#~ " 0 specifies paddings and should not"
+#~ " be attended to by the   model. "
+#~ "Included when `return_attention_mask` is "
+#~ "`True`. - **seq_len** (int, optional): "
+#~ "The input_ids length. Included when "
+#~ "`return_length`   is `True`. - "
+#~ "**overflowing_tokens** (list[int], optional): List"
+#~ " of overflowing tokens.   Included when "
+#~ "if `max_seq_len` is specified and "
+#~ "`return_overflowing_tokens`   is True. - "
+#~ "**num_truncated_tokens** (int, optional): The "
+#~ "number of overflowing tokens.   Included "
+#~ "when if `max_seq_len` is specified and"
+#~ " `return_overflowing_tokens`   is True. - "
+#~ "**special_tokens_mask** (list[int], optional): List"
+#~ " of integers valued 0 or 1,   "
+#~ "with 0 specifying special added tokens"
+#~ " and 1 specifying sequence tokens.   "
+#~ "Included when `return_special_tokens_mask` is "
+#~ "`True`."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "The dict has the following optional "
+#~ "items:  - **input_ids** (list[int]): List "
+#~ "of token ids to be fed to a"
+#~ " model. - **position_ids** (list[int], "
+#~ "optional): List of token position ids"
+#~ " to be   fed to a model. "
+#~ "Included when `return_position_ids` is `True`"
+#~ " - **token_type_ids** (list[int], optional): "
+#~ "List of token type ids to be   "
+#~ "fed to a model. Included when "
+#~ "`return_token_type_ids` is `True`. - "
+#~ "**attention_mask** (list[int], optional): List "
+#~ "of integers valued 0 or 1,   where"
+#~ " 0 specifies paddings and should not"
+#~ " be attended to by the   model. "
+#~ "Included when `return_attention_mask` is "
+#~ "`True`. - **seq_len** (int, optional): "
+#~ "The input_ids length. Included when "
+#~ "`return_length`   is `True`. - "
+#~ "**overflowing_tokens** (list[int], optional): List"
+#~ " of overflowing tokens.   Included when "
+#~ "if `max_seq_len` is specified and "
+#~ "`return_overflowing_tokens`   is True. - "
+#~ "**num_truncated_tokens** (int, optional): The "
+#~ "number of overflowing tokens.   Included "
+#~ "when if `max_seq_len` is specified and"
+#~ " `return_overflowing_tokens`   is True. - "
+#~ "**special_tokens_mask** (list[int], optional): List"
+#~ " of integers valued 0 or 1,   "
+#~ "with 0 specifying special added tokens"
+#~ " and 1 specifying sequence tokens.   "
+#~ "Included when `return_special_tokens_mask` is "
+#~ "`True`. - **offset_mapping** (list[int], "
+#~ "optional): list of pair preserving the"
+#~ "   index of start and end char "
+#~ "in original input for each token.   "
+#~ "For a sqecial token, the index "
+#~ "pair is `(0, 0)`. Included when   "
+#~ "`stride` works. - **overflow_to_sample** (int,"
+#~ " optional): Index of example from "
+#~ "which this   feature is generated. "
+#~ "Included when `stride` works."
+#~ msgstr ""
+
+#~ msgid ""
+#~ "**offset_mapping** (list[int], optional): list "
+#~ "of pair preserving the index of "
+#~ "start and end char in original "
+#~ "input for each token. For a "
+#~ "sqecial token, the index pair is "
+#~ "`(0, 0)`. Included when `stride` works."
+#~ msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po
new file mode 100644
index 0000000000000000000000000000000000000000..b135d3b7f82db0cd1127817bd9629fe024652199
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_base.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.tokenizer_utils_base.rst:2
+msgid "tokenizer\\_utils\\_base"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po
new file mode 100644
index 0000000000000000000000000000000000000000..5c99e1c224cdef1a0dbce4d338c0b12cc4fb7edd
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.tokenizer_utils_fast.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.transformers.tokenizer_utils_fast.rst:2
+msgid "tokenizer\\_utils\\_fast"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..f188f8fcffba594e92e5f1244126b91a6c4e7c07
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.modeling.po
@@ -0,0 +1,761 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.transformer.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:1
+msgid ""
+"Generates the initial values for the sinusoidal position encoding table. "
+"This method follows the implementation in tensor2tensor, but is slightly "
+"different from the description in \"Attention Is All You Need\"."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerModel
+#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.position_encoding_init
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:5
+msgid ""
+"The largest position for sequences, that is, the maximum length of source"
+" or target sequences."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:8
+msgid "The size of positional embedding vector."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:10
+msgid "The output `numpy.array`'s data type. Defaults to \"float32\"."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.position_encoding_init
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.position_encoding_init:13
+msgid ""
+"The embedding table of sinusoidal position encoding with shape "
+"`[n_position, d_pos_vec]`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward
+#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward
+#: paddlenlp.transformers.transformer.modeling.position_encoding_init
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:30
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:18
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:13
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:18
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:40
+#: paddlenlp.transformers.transformer.modeling.TransformerModel.forward:19
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:14
+#: paddlenlp.transformers.transformer.modeling.position_encoding_init:18
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:1
+#: paddlenlp.transformers.transformer.modeling.PositionalEmbedding:1
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:1
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:1
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding:1
+msgid "基类：:class:`paddle.fluid.dygraph.layers.Layer`"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:1
+msgid "Word Embedding layer of Transformer."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:3
+msgid ""
+"This layer automatically constructs a 2D embedding matrix based on the "
+"input the size of vocabulary (`vocab_size`) and the size of each "
+"embedding vector (`emb_dim`). This layer lookups embeddings vector of ids"
+" provided by input `word`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:8
+msgid ""
+"After the embedding, those weights are multiplied by `sqrt(d_model)` "
+"which is `sqrt(emb_dim)` in the interface."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:11
+msgid "Out = embedding(word) * sqrt(emb\\_dim)"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:15
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding:17
+msgid "Dimensionality of each embedding vector."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:31
+#: paddlenlp.transformers.transformer.modeling.WordEmbedding:19
+msgid "The start token id and also is used as padding id. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:1
+msgid "Computes word embedding."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:3
+msgid ""
+"The input ids which indicates the sequences' words with shape "
+"`[batch_size, sequence_length]` whose data type can be int or int64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.WordEmbedding.forward:8
+msgid ""
+"The (scaled) embedding tensor of shape `(batch_size, sequence_length, "
+"emb_dim)` whose data type can be float32 or float64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:1
+msgid ""
+"This layer produces sinusoidal positional embeddings of any length. While"
+" in `forward()` method, this layer lookups embeddings vector of ids "
+"provided by input `pos`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:5
+msgid "The size of each embedding vector."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding:7
+msgid "The maximum length of sequences."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:1
+msgid "Computes positional embedding."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:3
+msgid ""
+"The input position ids with shape `[batch_size, sequence_length]` whose "
+"data type can be int or int64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.PositionalEmbedding.forward:7
+msgid ""
+"The positional embedding tensor of shape `(batch_size, sequence_length, "
+"emb_dim)` whose data type can be float32 or float64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:1
+msgid ""
+"Computes the cross entropy loss for given input with or without label "
+"smoothing."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:3
+msgid ""
+"The weight used to mix up the original ground-truth distribution and the "
+"fixed distribution. Defaults to None. If given, label smoothing will be "
+"applied on `label`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion:7
+msgid "The token id used to pad variant sequence. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:1
+msgid "Computes cross entropy loss with or without label smoothing."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:3
+msgid ""
+"The predict results of `TransformerModel` with shape `[batch_size, "
+"sequence_length, vocab_size]` whose data type can be float32 or float64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:7
+msgid ""
+"The label for correspoding results with shape `[batch_size, "
+"sequence_length, 1]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:11
+msgid ""
+"A tuple with items: (`sum_cost`, `avg_cost`, `token_num`).  With the "
+"corresponding fields:  - `sum_cost` (Tensor):     The sum of loss of "
+"current batch whose data type can be float32, float64. - `avg_cost` "
+"(Tensor):     The average loss of current batch whose data type can be "
+"float32, float64.     The relation between `sum_cost` and `avg_cost` can "
+"be described as:      .. math::          avg\\_cost = sum\\_cost / "
+"token\\_num  - `token_num` (Tensor):     The number of tokens of current "
+"batch. Its data type can be float32, float64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:11
+msgid "A tuple with items: (`sum_cost`, `avg_cost`, `token_num`)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:13
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:28
+msgid "With the corresponding fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:15
+msgid "`sum_cost` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:16
+msgid "The sum of loss of current batch whose data type can be float32, float64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:23
+msgid "`avg_cost` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:18
+msgid ""
+"The average loss of current batch whose data type can be float32, "
+"float64. The relation between `sum_cost` and `avg_cost` can be described "
+"as:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:21
+msgid "avg\\_cost = sum\\_cost / token\\_num"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:25
+msgid "`token_num` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.CrossEntropyCriterion.forward:26
+msgid ""
+"The number of tokens of current batch. Its data type can be float32, "
+"float64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:1
+msgid ""
+"This layer wraps a Transformer decoder combined with embedding layer and "
+"output layer to produce logits from ids and position."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:4
+msgid ""
+"Can be a `paddle.nn.TransformerDecoder` instance. Or a wrapper that "
+"includes an embedding layer accepting ids and positions and includes an "
+"output layer transforming decoder output to logits."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:8
+msgid ""
+"Can be a `WordEmbedding` instance or a callable that accepts ids as "
+"arguments and return embeddings. It can be None if `decoder` includes a "
+"embedding layer. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:12
+msgid ""
+"Can be a `PositionalEmbedding` instance or a callable that accepts "
+"position as arguments and return embeddings. It can be None if `decoder` "
+"includes a positional embedding layer. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:16
+msgid ""
+"Can be a `paddle.nn.Linear` instance or a callable to transform decoder "
+"output to logits."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerDecodeCell:19
+msgid ""
+"The dropout rate for the results of `word_embedding` and `pos_embedding`."
+" Defaults to 0.1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:1
+msgid "Produces logits."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:3
+msgid ""
+"A tuple/list includes target ids and positions. If `word_embedding` is "
+"None, then it should be a Tensor which means the input for decoder."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:6
+msgid ""
+"It is a list and each element of the list is an instance of "
+"`paddle.nn.MultiheadAttention.Cache` for corresponding decoder layer. It "
+"can be produced by `paddle.nn.TransformerDecoder.gen_cache`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:10
+msgid ""
+"It is a list and each element of the list is an instance of "
+"`paddle.nn.MultiheadAttention.StaticCache` for corresponding decoder "
+"layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:14
+msgid ""
+"A tensor used in self attention to prevents attention to some unwanted "
+"positions, usually the subsequent positions. It is a tensor with shape "
+"broadcasted to `[batch_size, n_head, target_length, target_length]`, "
+"where the unwanted positions have `-INF` values and the others have 0 "
+"values. The data type should be float32 or float64. It can be None when "
+"nothing wanted or needed to be prevented attention to."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:21
+msgid ""
+"The output of Transformer encoder. It is a tensor with shape "
+"`[batch_size, source_length, d_model]` and its data type can be float32 "
+"or float64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:26
+msgid ""
+"A tuple with items: `(outputs, new_states)`  With the corresponding "
+"fields:  - `outputs` (Tensor):     A float32 or float64 3D tensor "
+"representing logits shaped     `[batch_size, sequence_length, "
+"vocab_size]`. - `new_states` (Tensor):     This output has the same "
+"structure and data type with `states`     while the length is one larger "
+"since concatanating the     intermediate results of current step."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:26
+msgid "A tuple with items: `(outputs, new_states)`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:31
+msgid "`outputs` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:31
+msgid ""
+"A float32 or float64 3D tensor representing logits shaped `[batch_size, "
+"sequence_length, vocab_size]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:35
+msgid "`new_states` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerDecodeCell.forward:34
+msgid ""
+"This output has the same structure and data type with `states` while the "
+"length is one larger since concatanating the intermediate results of "
+"current step."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:1
+msgid "基类：:class:`paddle.fluid.layers.rnn.BeamSearchDecoder`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:1
+msgid ""
+"This layer is a subclass of `BeamSearchDecoder` to make beam search adapt"
+" to Transformer decoder."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:4
+msgid "An instance of `TransformerDecoderCell`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:6
+msgid "The start token id."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:8
+msgid "The end token id."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:10
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:10
+msgid "The beam width used in beam search."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder:12
+msgid "Indicate which dimension of states is variant."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:1
+msgid ""
+"Tiles the batch dimension of a tensor. Specifically, this function takes "
+"a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch "
+"entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape "
+"`[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries "
+"`t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated"
+" `beam_size` times."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:8
+msgid "A list of tensor with shape `[batch_size, ...]`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.tile_beam_merge_with_batch:13
+msgid ""
+"A tensor with shape `[batch_size * beam_size, ...]`, whose data type is "
+"same as `t`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:1
+msgid ""
+"Perform a beam search decoding step, which uses cell to get "
+"probabilities, and follows a beam search step to calculate scores and "
+"select candidate token ids."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:4
+msgid ""
+"An `int64` tensor with shape `[1]` provided by the caller, representing "
+"the current time step number of decoding."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:7
+msgid ""
+"A tensor variable. It is same as `initial_inputs` returned by "
+"`initialize()` for the first decoding step and `next_inputs` returned by "
+"`step()` for the others."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:11
+msgid ""
+"A structure of tensor variables. It is same as the `initial_cell_states` "
+"returned by `initialize()` for the first decoding step and `next_states` "
+"returned by `step()` for the others."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:16
+msgid "Additional keyword arguments, provided by the caller `dynamic_decode`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.TransformerBeamSearchDecoder.step:19
+msgid ""
+"Returns tuple (``beam_search_output, beam_search_state, next_inputs, "
+"finished``). `beam_search_state` and `next_inputs` have the same "
+"structure, shape and data type as the input arguments states and inputs "
+"separately. `beam_search_output` is a namedtuple(including scores, "
+"predicted_ids, parent_ids as fields) of tensor variables, where `scores, "
+"predicted_ids, parent_ids` all has a tensor value shaped [batch_size, "
+"beam_size] with data type float32, int64, int64. `finished` is a bool "
+"tensor with shape [batch_size, beam_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel:1
+msgid "The Transformer model."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel:3
+msgid ""
+"This model is a Paddle `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:3
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:7
+msgid "The size of source vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:5
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:9
+msgid "The size of target vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:7
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:11
+msgid "The maximum length of input sequences."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:9
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:13
+msgid "The number of sub-layers to be stacked in the encoder."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:11
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:15
+msgid "The number of sub-layers to be stacked in the decoder."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:13
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:17
+msgid "The number of head used in multi-head attention."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:15
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:19
+msgid ""
+"The dimension for word embeddings, which is also the last dimension of "
+"the input and output of multi-head attention, position-wise feed-forward "
+"networks, encoder and decoder."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:19
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:23
+msgid "Size of the hidden layer in position-wise feed-forward networks."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:21
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:25
+msgid "Dropout rates. Used for pre-process, activation and inside attention."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:23
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:27
+msgid "Whether to use weight sharing."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:25
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:29
+msgid ""
+"The dropout probability used in MHA to drop some attention target. If "
+"None, use the value of dropout. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel:32
+msgid ""
+"The dropout probability used after FFN activation. If None, use the value"
+" of dropout. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel:35
+msgid "The start token id and also be used as padding id. Defaults to 0."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:33
+#: paddlenlp.transformers.transformer.modeling.TransformerModel:37
+msgid "The end token id. Defaults to 1."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:1
+msgid ""
+"The Transformer forward methods. The input are source/target sequences, "
+"and returns logits."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:4
+msgid ""
+"The ids of source sequences words. It is a tensor with shape "
+"`[batch_size, source_sequence_length]` and its data type can be int or "
+"int64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:8
+msgid ""
+"The ids of target sequences words. It is a tensor with shape "
+"`[batch_size, target_sequence_length]` and its data type can be int or "
+"int64."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.TransformerModel.forward:13
+msgid ""
+"Output tensor of the final layer of the model whose data type can be "
+"float32 or float64 with shape `[batch_size, sequence_length, "
+"vocab_size]`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:1
+msgid "基类：:class:`paddlenlp.transformers.transformer.modeling.TransformerModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:1
+msgid "The Transformer model for auto-regressive generation."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:28
+msgid ""
+"The dropout probability used after FFN activition. If None, use the value"
+" of dropout. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:35
+msgid "The beam width for beam search. Defaults to 4."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:37
+msgid "The maximum output length. Defaults to 256."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:39
+msgid ""
+"Indicate the data layout of predicted Tensor. If `False`, the data layout"
+" would be batch major with shape `[batch_size, seq_len, beam_size]`. If  "
+"`True`, the data layout would be time major with shape `[seq_len, "
+"batch_size, beam_size]`. Default to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:45
+msgid ""
+"Specify beam search version. It should be in one of [`v1`, `v2`]. If "
+"`v2`, need to set `alpha`(default to 0.6) for length penalty. Default to "
+"`v1`."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:49
+msgid ""
+"The key word arguments can be `rel_len` and `alpha`:  - `rel_len(bool, "
+"optional)`: Indicating whether `max_out_len` in is the length relative to"
+" that of source text. Only works in `v2` temporarily. It is suggest to "
+"set a small `max_out_len` and use `rel_len=True`. Default to False if not"
+" set.  - `alpha(float, optional)`: The power number in length penalty "
+"calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_. "
+"Only works in `v2` temporarily. Default to 0.6 if not set."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:49
+msgid "The key word arguments can be `rel_len` and `alpha`:"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:51
+msgid "`rel_len(bool, optional)`: Indicating whether `max_out_len` in"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:52
+msgid ""
+"is the length relative to that of source text. Only works in `v2` "
+"temporarily. It is suggest to set a small `max_out_len` and use "
+"`rel_len=True`. Default to False if not set."
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:56
+msgid "`alpha(float, optional)`: The power number in length penalty"
+msgstr ""
+
+#: of paddlenlp.transformers.transformer.modeling.InferTransformerModel:57
+msgid ""
+"calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_. "
+"Only works in `v2` temporarily. Default to 0.6 if not set."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:1
+msgid "The Transformer forward method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:3
+msgid ""
+"The ids of source sequence words. It is a tensor with shape `[batch_size,"
+" source_sequence_length]` and its data type can be int or int64."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:7
+msgid ""
+"The ids of target sequence words. Normally, it should NOT be given. If "
+"it's given, force decoding with previous output token will be trigger. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.forward:12
+msgid ""
+"An int64 tensor shaped indicating the predicted ids. Its shape is "
+"`[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]` "
+"according to `output_time_major`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.transformer.modeling.InferTransformerModel.beam_search_v2:1
+msgid ""
+"Beam search with the alive and finished two queues, both have a beam size"
+" capicity separately. It includes `grow_topk` `grow_alive` `grow_finish` "
+"as steps. 1. `grow_topk` selects the top `2*beam_size` candidates to "
+"avoid all getting EOS. 2. `grow_alive` selects the top `beam_size` non-"
+"EOS candidates as the inputs of next decoding step. 3. `grow_finish` "
+"compares the already finished candidates in the finished queue and newly "
+"added finished candidates from `grow_topk`, and selects the top "
+"`beam_size` finished candidates."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..8fc2bc3acd8b78ab2bacac3a99c4638814703909
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.transformer.rst:2
+msgid "transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po
new file mode 100644
index 0000000000000000000000000000000000000000..4a3a54f3f3da0d9f00a13dd0c5cd8a26573703e0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.convert.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unified_transformer.convert.rst:2
+msgid "convert"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..02178cd8c75bf2001afef298df7bd1aa498fc5f7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.modeling.po
@@ -0,0 +1,421 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unified_transformer.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.unified_transformer.modeling:1
+msgid "Modeling classes for UnifiedTransformer model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel:1
+msgid ""
+"An abstract class for pretrained UnifiedTransformer models. It provides  "
+"UnifiedTransformer related `model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:1
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:1
+msgid "基类：:class:`paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerPretrainedModel`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:1
+msgid "The bare UnifiedTransformer Model outputting raw hidden-states."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:6
+msgid ""
+"This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn "
+"/documentation/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ "
+"subclass. Use it as a regular Paddle Layer and refer to the Paddle "
+"documentation for all matter related to general usage and behavior."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:11
+msgid ""
+"Vocabulary size of `inputs_ids` in :class:`UnifiedTransformerModel`. Also"
+" is the vocab size of token embedding matrix."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:14
+msgid ""
+"Dimensionality of the embedding layers, encoder layers and pooler layer. "
+"Defaults to 768."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:17
+msgid "The number of hidden layers in the encoder. Defaults to 12."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:19
+msgid "The number of heads in multi-head attention(MHA). Defaults to 12."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:21
+msgid ""
+"Dimensionality of the feed-forward layer in the encoder. Input tensors to"
+" feed-forward layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to 3072."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:27
+msgid "The activation function in the feedforward network. Defaults to \"gelu\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:30
+msgid ""
+"The dropout probability used in pre-process and post-precess of MHA and "
+"FFN sub-layer. Defaults to 0.1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:33
+msgid ""
+"The dropout probability used in MHA to drop some attention target. "
+"Defaults to 0.1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:36
+msgid ""
+"Indicate whether to put layer normalization into preprocessing of MHA and"
+" FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:42
+msgid "The maximum length of input `position_ids`. Defaults to 512."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:44
+msgid "The size of the input `token_type_ids`. Defaults to 2."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:46
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal"
+"     distributions. See     "
+":meth:`UnifiedTransformerPretrainedModel.init_weights` method     for how"
+" weights are initialized in     :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:46
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:49
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`UnifiedTransformerPretrainedModel.init_weights` method for "
+"how weights are initialized in :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:55
+msgid "The id of special token `unk_token`. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:57
+msgid "The id of special token `pad_token`. Defaults to 0."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:59
+msgid "The id of special token `bos_token`. Defaults to 1."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:61
+msgid "The id of special token `eos_token`. Defaults to 2."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel:63
+msgid "The id of special token `mask_token`. Defaults to 30000."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:1
+msgid ""
+"The UnifiedTransformerModel forward method, overrides the special "
+":meth:`__call__` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:4
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:9
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:9
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:12
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:13
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:15
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:18
+msgid ""
+"The position indices of input sequence tokens. It's data type should be "
+"`int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:21
+msgid ""
+"A tensor used in multi-head attention to prevents attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. It "
+"is a tensor with shape broadcasted to [batch_size, n_head, "
+"sequence_length, sequence_length].  - When the data type is bool, the "
+"unwanted positions have   `False` values and the others have `True` "
+"values. - When the data type is int, the unwanted positions have 0   "
+"values and the others have 1 values. - When the data type is float, the "
+"unwanted positions have   `-INF` values and the others have 0 values."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:21
+msgid ""
+"A tensor used in multi-head attention to prevents attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. It "
+"is a tensor with shape broadcasted to [batch_size, n_head, "
+"sequence_length, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:26
+msgid ""
+"When the data type is bool, the unwanted positions have `False` values "
+"and the others have `True` values."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:28
+msgid ""
+"When the data type is int, the unwanted positions have 0 values and the "
+"others have 1 values."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:30
+msgid ""
+"When the data type is float, the unwanted positions have `-INF` values "
+"and the others have 0 values."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:33
+msgid ""
+"(bool, optional): Whether or not use the model cache to speed up "
+"decoding. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:36
+msgid ""
+"It is a list, and each element in the list is `incremental_cache` "
+"produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache` method. "
+"See :meth:`paddle.nn.TransformerEncoder.gen_cache` method for more "
+"details. It is only used for inference and should be None for training. "
+"Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:42
+msgid ""
+"Indices of role ids indicated different roles.  It's data type should be "
+"`int64` and has a shape of [batch_size, sequence_length]. Defaults to "
+"None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:43
+msgid "Indices of role ids indicated different roles."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:44
+msgid "It's data type should be `int64` and has a shape of"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:45
+msgid "[batch_size, sequence_length]. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:48
+msgid ""
+"If `use_cache` is False, it is a tensor representing the output of "
+":class:`UnifiedTransformerModel`, with shape [batch_size, "
+"sequence_length, hidden_size]. The data type is float32 or float64. "
+"Otherwise, it is a tuple, besides the output of "
+":class:`UnifiedTransformerModel`, the tuple also includes the new cache "
+"which is same as input `cache` but `incremental_cache` in it has an "
+"incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` "
+"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more "
+"details."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:31
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerModel.forward:60
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:1
+msgid ""
+"The UnifiedTransformer Model with a language modeling head on top for "
+"generation tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel:4
+msgid "An instance of :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:1
+msgid ""
+"The UnifiedTransformerLMHeadModel forward method, overrides the special "
+":meth:`__call__` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:4
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:6
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:8
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:10
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:14
+msgid "See :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:12
+msgid "(bool, optional): See :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:16
+msgid "(Tensor, optional): See :class:`UnifiedTransformerModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.modeling.UnifiedTransformerLMHeadModel.forward:19
+msgid ""
+"If `use_cache` is False, it is a tensor representing the output of "
+":class:`UnifiedTransformerLMHeadModel`, with shape [batch_size, "
+"sequence_length, vocab_size]. The data type is float32 or float64. "
+"Otherwise, it is a tuple, besides the output of "
+":class:`UnifiedTransformerLMHeadModel`, the tuple also includes the new "
+"cache which is same as input `cache` but `incremental_cache` in it has an"
+" incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` "
+"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more "
+"details."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po
new file mode 100644
index 0000000000000000000000000000000000000000..7303924289a46bf2630261e610ede3f3b2ca3a8f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unified_transformer.rst:2
+msgid "unified\\_transformer"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..15517e061e49822217e82db00c2e6c20991eac3d
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unified_transformer.tokenizer.po
@@ -0,0 +1,647 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unified_transformer.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.unified_transformer.tokenizer:1
+msgid "Tokenization class for UnifiedTransformer model."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:1
+msgid ""
+"Constructs an UnifiedTransformer tokenizer based on `SentencePiece "
+"<https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources
+msgid "参数"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:7
+msgid "The path of file to construct vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:9
+msgid ""
+"The sentencepiece model file (ends with '.spm') required to instantiate a"
+" `SentencePiece <https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:12
+msgid ""
+"Whether or not to lowercase the input when tokenizing. Defaults to False "
+"and **does not** lowercase the input."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:19
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:22
+msgid ""
+"A special token representing the beginning of a sequence. Defaults to "
+"\"[CLS]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:25
+msgid ""
+"A special token representing the end of a sequence or separating two "
+"different sentences in the same input. Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:28
+msgid "A special token representing a masked token. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer:30
+msgid ""
+"The path of file that contains additional special tokens to be used by "
+"the tokenizer. Defaults to \"\"."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.vocab_size:1
+msgid "Returns the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:14
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:15
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:107
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.vocab_size:4
+msgid "示例"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `__` to concat subwords, also remove "
+"`__` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:5
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:6
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:7
+msgid "Whether or not to keep the segmentation with space. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string:11
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:1
+msgid ""
+"Converts a single index or a sequence of indices to a token or a sequence"
+" of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:4
+msgid "The token id (or token ids) to be converted to token(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.convert_ids_to_string:10
+msgid "The decoded token(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the number of added tokens should be computed in the case of a "
+"sequence pair or a single sequence. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.num_special_tokens_to_add:7
+msgid "Number of special tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:4
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:3
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:3
+msgid ""
+"Should be overridden in a subclass if the model has a special way of "
+"building those."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:6
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:8
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:10
+msgid "Optional second list of IDs for sequence pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_inputs_with_special_tokens:11
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:5
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:7
+msgid "Optional second list of char offsets for offset mapping pairs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.build_offset_mapping_with_special_tokens:10
+msgid "List of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:6
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:8
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:1
+msgid ""
+"Retrieves sequence ids from a token list that has no special tokens "
+"added. This method is called when adding special tokens using the "
+"tokenizer ``encode`` methods."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:4
+msgid "List of ids of the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:6
+msgid "List of ids of the second sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:8
+msgid ""
+"Whether or not the token list is already formatted with special tokens "
+"for the model. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:12
+msgid ""
+"The list of integers in the range [0, 1]:     1 for a special token, 0 "
+"for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:14
+msgid "The list of integers in the range [0, 1]:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.get_special_tokens_mask:15
+msgid "1 for a special token, 0 for a sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources:1
+msgid ""
+"Save tokenizer related resources to `resource_files_names` indicating "
+"files under `save_directory` by copying directly. Override it if "
+"necessary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:1
+msgid ""
+"Instantiate an instance of `Vocab` from a file reserving all tokens by "
+"using `Vocab.from_dict`. The file contains a token per line, and the line"
+" number would be the index of corresponding token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:5
+msgid "path of file to construct vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:7
+msgid ""
+"special token for unknown token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:10
+msgid ""
+"special token for padding token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:13
+msgid ""
+"special token for bos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:16
+msgid ""
+"special token for eos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:19
+msgid "keyword arguments for `Vocab.from_dict`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.load_vocabulary:22
+msgid "An instance of `Vocab`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:1
+msgid ""
+"Main method to encode the single-turn or multi-turn dialogue "
+"conversation. It will return a dictionary containing the encoded sequence"
+" and other relative informations which meets the input format "
+"requirements of the UnifiedTransformer model. See detail at "
+"https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:8
+msgid ""
+"The history of dialogue conversation. It is an utterance or list of "
+"utterances to be encoded. Each utterance is a string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:12
+msgid ""
+"The response of dialogue conversation. It should be set when training the"
+" model. It should not be set when running inference. Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:16
+msgid ""
+"The knowledge information of dialogue conversation. It should be set if "
+"the `task_type` is \"knowledge\" or \"recommend\". Defaults to None."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:20
+msgid ""
+"The type of dialogue conversation. It is one of \"chitchat\", "
+"\"knowledge\" and \"recommend\". They represent the chitchat dialogue, "
+"knowledge grounded dialogue and conversational recommendation "
+"respectively. Defaults to None, which means there is no `special_token` "
+"added in output sequence for identifying different conversation types."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:27
+msgid "The maximum encoded sequence length. Defaults to 512."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:30
+msgid ""
+"The maximum encoded sequence length of the input `response`. Defaults to "
+"128."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:33
+msgid ""
+"The maximum encoded sequence length of the input `knowledge`. Defaults to"
+" 128."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:36
+msgid "Whether to return the position_ids. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:39
+msgid "Whether to return the token_type_ids. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:42
+msgid "Whether to return the role_ids. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:45
+msgid "Whether to return the attention_mask. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:48
+msgid "Whether to return the length of the encoded sequence. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:51
+msgid ""
+"Whether to add the special token \"[CLS]\" at the end of sequence as the "
+"beginning of the response when running inference to force the model to "
+"start generating response sequence. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:56
+msgid ""
+"Whether to pad the returned sequences to the `max_seq_len`. Note that, in"
+" this method, returned sequences will be padded on the left. Defaults to "
+"False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:60
+msgid "Whether to convert the returned sequences to Tensor. Defaults to False."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:63
+msgid ""
+"Whether or not the input text (`history`, `response` and `knowledge`) has"
+" been pretokenized. Defaults to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:67
+msgid ""
+"Specify the involved positional style which must be one of [relative, "
+"continuous]. Defaults to continuous which means start from 0 to maximum "
+"length of history."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:72
+msgid ""
+"A dictionary containing the encoded sequence and other relative "
+"informations.  With the corresponding fields:  - input_ids "
+"(list[int]|Tensor):     A list of indices of input tokens to be feed to "
+"UnifiedTransformer     model. If `return_tensors` is True, it is a Tensor"
+" with shape     [1, sequence_length] and data type 'int64'. - role_ids "
+"(list[int]|Tensor, optional):     A list of indices of role indices. If "
+"`return_role_ids` is True,     it is a Tensor with shape [1, "
+"sequence_length] and data type 'int64'. - token_type_ids "
+"(list[int]|Tensor, optional):     A list of segment token indices to "
+"indicate whether the token     belongs to the dialogue response. If "
+"`return_tensors` is True,     it is a Tensor with shape [1, "
+"sequence_length] and data type     'int64'.     Being returned when "
+"`return_token_type_ids` is set to True. - position_ids (list[int]|Tensor,"
+" optional):     A list of The position indices. If `return_tensors` is "
+"True,     it is a Tensor with shape [1, sequence_length] and data type"
+"     'int64'.     Being returned when `return_position_ids` is set to "
+"True. - attention_mask (numpy.ndarray|Tensor, optional):     A "
+"numpy.ndarray to prevents attention to some unwanted positions,     with "
+"shape [sequence_length, sequence_length] and data type     'float32'. If "
+"`return_tensors` is True, it is a Tensor with shape     [1, 1, "
+"sequence_length, sequence_length] and data type 'float32'.     Being "
+"returned when `return_attention_mask` is set to True. - seq_len (int, "
+"optional):     The actual length of the `input_ids`, excluding the pad "
+"token.     Being returned when `return_length` is set to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:72
+msgid ""
+"A dictionary containing the encoded sequence and other relative "
+"informations."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:75
+msgid "With the corresponding fields:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:79
+msgid "input_ids (list[int]|Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:78
+msgid ""
+"A list of indices of input tokens to be feed to UnifiedTransformer model."
+" If `return_tensors` is True, it is a Tensor with shape [1, "
+"sequence_length] and data type 'int64'."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:82
+msgid "role_ids (list[int]|Tensor, optional):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:82
+msgid ""
+"A list of indices of role indices. If `return_role_ids` is True, it is a "
+"Tensor with shape [1, sequence_length] and data type 'int64'."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:88
+msgid "token_type_ids (list[int]|Tensor, optional):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:85
+msgid ""
+"A list of segment token indices to indicate whether the token belongs to "
+"the dialogue response. If `return_tensors` is True, it is a Tensor with "
+"shape [1, sequence_length] and data type 'int64'. Being returned when "
+"`return_token_type_ids` is set to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:93
+msgid "position_ids (list[int]|Tensor, optional):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:91
+msgid ""
+"A list of The position indices. If `return_tensors` is True, it is a "
+"Tensor with shape [1, sequence_length] and data type 'int64'. Being "
+"returned when `return_position_ids` is set to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:99
+msgid "attention_mask (numpy.ndarray|Tensor, optional):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:96
+msgid ""
+"A numpy.ndarray to prevents attention to some unwanted positions, with "
+"shape [sequence_length, sequence_length] and data type 'float32'. If "
+"`return_tensors` is True, it is a Tensor with shape [1, 1, "
+"sequence_length, sequence_length] and data type 'float32'. Being returned"
+" when `return_attention_mask` is set to True."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:102
+msgid "seq_len (int, optional):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unified_transformer.tokenizer.UnifiedTransformerTokenizer.dialogue_encode:102
+msgid ""
+"The actual length of the `input_ids`, excluding the pad token. Being "
+"returned when `return_length` is set to True."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..2a1873a8eda037e4363d1fb5539487bc3f0deea7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.modeling.po
@@ -0,0 +1,346 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unimo.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling:1
+msgid "Modeling classes for UNIMO model."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel:1
+msgid ""
+"An abstract class for pretrained UNIMO models. It provides UNIMO related "
+"`model_config_file`, `pretrained_init_configuration`, "
+"`resource_files_names`, `pretrained_resource_files_map`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:1
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel:1
+msgid "基类：:class:`paddlenlp.transformers.unimo.modeling.UNIMOPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:1
+msgid "The bare UNIMO Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the  superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:6
+msgid ""
+"This model is also a `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__"
+" subclass. Use it as a regular Paddle Layer and refer to the Paddle "
+"documentation for all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel
+#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `UNIMOModel`. Also is the vocab size "
+"of token embedding matrix. Defines the number of different tokens that "
+"can be represented by the `inputs_ids` passed when calling `UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:13
+msgid ""
+"Dimensionality of the embedding layers and encoder layers. Defaults to "
+"`768`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:15
+msgid "The number of hidden layers in the Transformer encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:17
+msgid ""
+"Number of attention heads for each attention layer in the Transformer "
+"encoder. Defaults to `12`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:20
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `hidden_size` to "
+"`intermediate_size`, and then projected back to `hidden_size`. Typically "
+"`intermediate_size` is larger than `hidden_size`. Defaults to `3072`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:25
+msgid ""
+"The non-linear activation function in the feed-forward layer. "
+"``\"gelu\"``, ``\"relu\"`` and any other paddle supported activation "
+"functions are supported. Defaults to ``\"gelu\"``."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:29
+msgid ""
+"The dropout probability used in pre-process and post-precess of MHA and "
+"FFN sub-layer. Defaults to 0.1."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:32
+msgid ""
+"The dropout probability used in MultiHeadAttention in all encoder layers "
+"to drop some attention target. Defaults to `0.1`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:35
+msgid ""
+"Indicate whether to put layer normalization into preprocessing of MHA and"
+" FFN sub-layers. If True, pre-process is layer normalization and post-"
+"precess includes dropout, residual connection. Otherwise, no pre-process "
+"and post-precess includes dropout, residual connection, layer "
+"normalization. Defaults to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:41
+msgid ""
+"The maximum value of the dimensionality of position encoding, which "
+"dictates the maximum supported length of an input sequence. Defaults to "
+"`512`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:44
+msgid ""
+"The vocabulary size of the `token_type_ids` passed when calling "
+"`~transformers.UNIMOModel`. Defaults to `2`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:47
+msgid ""
+"The standard deviation of the normal initializer. Defaults to `0.02`.  .."
+" note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`UNIMOPretrainedModel._init_weights()` for "
+"how weights are initialized in `UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:47
+msgid "The standard deviation of the normal initializer. Defaults to `0.02`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:50
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`UNIMOPretrainedModel._init_weights()` for how weights are "
+"initialized in `UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:53
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` in order to be converted to an ID."
+" Defaults to `17963`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:57
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to `0`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:60
+msgid ""
+"A special token representing the beginning of a sequence that was used "
+"during pretraining. Defaults to `1`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:63
+msgid ""
+"A special token representing the end of a sequence that was used during "
+"pretraining. Defaults to `3`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel:66
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to `3`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:1
+msgid ""
+"The UNIMOModel forward method, overrides the special :meth:`__call__` "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of  [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:7
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:7
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:10
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:11
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:13
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to None, which means no segment embeddings is "
+"added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:16
+msgid ""
+"Indices of positions of each input sequence tokens in the position "
+"embeddings. Selected in the range ``[0, max_position_embeddings - 1]``. "
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:21
+msgid ""
+"Mask used in multi-head attention to avoid performing attention to some "
+"unwanted positions, usually the paddings or the subsequent positions. Its"
+" data type can be int, float and bool. When the data type is bool, the "
+"`masked` tokens have `False` values and the others have `True` values. "
+"When the data type is int, the `masked` tokens have `0` values and the "
+"others have `1` values. When the data type is float, the `masked` tokens "
+"have `-INF` values and the others have `0` values. It is a tensor with "
+"shape broadcasted to `[batch_size, num_attention_heads, sequence_length, "
+"sequence_length]`. For example, its shape can be  [batch_size, "
+"sequence_length], [batch_size, sequence_length, sequence_length], "
+"[batch_size, num_attention_heads, sequence_length, sequence_length]. "
+"Defaults to `None`, which means nothing needed to be prevented attention "
+"to."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:32
+msgid ""
+"(bool, optional): Whether or not use the model cache to speed up "
+"decoding. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:35
+msgid ""
+"It is a list, and each element in the list is `incremental_cache` "
+"produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache` method. "
+"See :meth:`paddle.nn.TransformerEncoder.gen_cache` method for more "
+"details. It is only used for inference and should be None for training. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:42
+msgid ""
+"If `use_cache` is False, it is a tensor representing the output of "
+":class:`UNIMOModel`, with shape [batch_size, sequence_length, "
+"hidden_size]. The data type is float64. Otherwise, it is a tuple, besides"
+" the output of :class:`UNIMOModel`, the tuple also includes the new cache"
+" which is same as input `cache` but `incremental_cache` in it has an "
+"incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache` "
+"method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:26
+#: paddlenlp.transformers.unimo.modeling.UNIMOModel.forward:51
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:1
+msgid ""
+"The UNIMO Model with a `language modeling` head on top designed for "
+"generation tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel:3
+msgid "An instance of :class:`UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:1
+msgid ""
+"The UNIMOLMHeadModel forward method, overrides the special "
+":meth:`__call__` method."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:4
+#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:6
+#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:8
+#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:10
+#: paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:14
+msgid "See :class:`UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:12
+msgid "(bool, optional): See :class:`UNIMOModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.modeling.UNIMOLMHeadModel.forward:17
+msgid ""
+"If `use_cache` is False, it is a tensor representing the output of "
+":class:`UNIMOModel`, with shape [batch_size, sequence_length, "
+"hidden_size]. The data type is float64. Otherwise, it is a tuple, besides"
+" the output of :class:`UNIMOLMHeadModel`, the tuple also includes the new"
+" cache which is same as input `cache` but `incremental_cache` in it has "
+"an incremental length. See :meth:`paddle.nn.MultiHeadAttention.gen_cache`"
+" method and :meth:`paddle.nn.MultiHeadAttention.forward` method for more "
+"details."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po
new file mode 100644
index 0000000000000000000000000000000000000000..54632057c48c9ae07ab0e50ab8a9aaf3f8d0962a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unimo.rst:2
+msgid "unimo"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..50c8a3dc632a6966bdf17dc3e776ce8b7b4ca484
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.unimo.tokenizer.po
@@ -0,0 +1,512 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.unimo.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:1
+msgid ""
+"Constructs an UNIMO tokenizer. It uses a basic tokenizer to do "
+"punctuation splitting, lower casing and so on, and follows a WordPiece "
+"tokenizer to tokenize as subwords."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:5
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:9
+msgid ""
+"The vocabulary file path (ends with '.txt') required to instantiate a "
+"`WordpieceTokenizer`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:12
+msgid "Whether or not to lowercase the input when tokenizing. Defaults to`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:15
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to \"[UNK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:19
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to \"[SEP]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:22
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to \"[PAD]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:25
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to \"[CLS]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:28
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to \"[MASK]\"."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer:34
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:12
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size:1
+msgid "Return the size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size:3
+msgid "The size of vocabulary."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.vocab_size
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:1
+msgid ""
+"Instantiate an instance of `Vocab` from a file reserving all tokens by "
+"using `Vocab.from_dict`. The file contains a token per line, and the line"
+" number would be the index of corresponding token."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:5
+msgid "path of file to construct vocabulary."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:7
+msgid ""
+"special token for unknown token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:10
+msgid ""
+"special token for padding token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:13
+msgid ""
+"special token for bos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:16
+msgid ""
+"special token for eos token. If no need, it also could be `None`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:19
+msgid "keyword arguments for `Vocab.from_dict`."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.load_vocabulary:22
+msgid "An instance of `Vocab`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) in a single string. Since "
+"the usage of WordPiece introducing `##` to concat subwords, also remove "
+"`##` when converting."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:5
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:4
+msgid "A list of string representing tokens to be converted."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.convert_tokens_to_string:8
+msgid "Converted string from tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Build model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:4
+msgid "A UNIMO sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:6
+msgid "single sequence:      ``[CLS] X [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:7
+msgid "pair of sequences:        ``[CLS] A [SEP] B [SEP]``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:9
+msgid "List of IDs to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:11
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:15
+msgid "Optional second list of IDs for sequence pairs. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_inputs_with_special_tokens:15
+msgid "List of input_id with the appropriate special tokens."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:1
+msgid ""
+"Converts the subwords in a sequence of tokens (list of string) to whole "
+"words, also remove `##` when converting."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.merge_subword:7
+msgid "Converted sequence of whole words."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Build offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:4
+msgid "A UNIMO offset_mapping has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:9
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:11
+msgid ""
+"Optional second list of char offsets for offset mapping pairs. Defaults "
+"to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:15
+msgid "List of char offsets with the appropriate offsets     of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:17
+msgid "List of char offsets with the appropriate offsets"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.build_offset_mapping_with_special_tokens:18
+msgid "of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Create a mask from the two sequences passed to be used in a sequence-pair"
+" classification task."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:4
+msgid "A UNIMO sequence pair mask has the following format: ::"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:10
+msgid ""
+"If `token_ids_1` is `None`, this method only returns the first portion of"
+" the mask (0s)."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:13
+msgid "List of IDs."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.create_token_type_ids_from_sequences:19
+msgid "List of token_type_id according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:1
+msgid ""
+"Main method for encoding the source for generation. It will return a "
+"dictionary containing the encoded sequence and other relative "
+"informations which meets the input format requirements of the UNIMO-text "
+"model."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:5
+msgid "The source text of generation. It should be a string."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:7
+msgid ""
+"The target text of generation. It should be set when training the model "
+"and should be None when running inference. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:11
+msgid ""
+"The additional information of some of the generation tasks such as "
+"summary. Defaults to None."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:14
+msgid "The maximum encoded sequence length. Defaults to 512."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:17
+msgid ""
+"The maximum encoded sequence length of the input `target`. Defaults to "
+"128."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:20
+msgid "The maximum encoded sequence length of the input `title`. Defaults to 128."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:23
+msgid "Whether to return the position_ids. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:26
+msgid "Whether to return the token_type_ids. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:29
+msgid "Whether to return the attention_mask. Defaults to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:32
+msgid "Whether to return the length of the encoded sequence. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:35
+msgid ""
+"Whether to add the special token \"[CLS]\" at the end of sequence as the "
+"beginning of the target when running inference to force the model to start"
+" generating target sequence. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:40
+msgid ""
+"Whether to pad the returned sequences to the `max_seq_len`. Note that, in"
+" this method, returned sequences will be padded on the left. Defaults to "
+"False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:44
+msgid "Whether to convert the returned sequences to Tensor. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:47
+msgid ""
+"Whether or not the input text (`source`, `target` and `title`) has been "
+"pretokenized. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:51
+msgid ""
+"Whether the position ids is continuous between source ids and target ids."
+" Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:55
+msgid ""
+"A dictionary containing the encoded sequence and other relative "
+"informations.  With the corresponding fields:  - input_ids "
+"(list[int]|Tensor):     A list of indices of input tokens to be feed to "
+"UNIMO-text     model. If `return_tensors` is True, it is a Tensor with "
+"shape     [1, sequence_length] and data type 'int64'. - token_type_ids "
+"(list[int]|Tensor, optional):     A list of segment token indices to "
+"indicate whether the token     belongs to the dialogue target. If "
+"`return_tensors` is True,     it is a Tensor with shape [1, "
+"sequence_length] and data type     'int64'.     Being returned when "
+"`return_token_type_ids` is set to True. - position_ids (list[int]|Tensor,"
+" optional):     A list of The position indices. If `return_tensors` is "
+"True,     it is a Tensor with shape [1, sequence_length] and data type"
+"     'int64'.     Being returned when `return_position_ids` is set to "
+"True. - attention_mask (numpy.ndarray|Tensor, optional):     A "
+"numpy.ndarray to prevents attention to some unwanted positions,     with "
+"shape [sequence_length, sequence_length] and data type     'float32'. If "
+"`return_tensors` is True, it is a Tensor with shape     [1, 1, "
+"sequence_length, sequence_length] and data type 'float32'.     Being "
+"returned when `return_attention_mask` is set to True. - seq_len (int, "
+"optional):     The actual length of the `input_ids`, excluding the pad "
+"token.     Being returned when `return_length` is set to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:55
+msgid ""
+"A dictionary containing the encoded sequence and other relative "
+"informations."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:58
+msgid "With the corresponding fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:62
+msgid "input_ids (list[int]|Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:61
+msgid ""
+"A list of indices of input tokens to be feed to UNIMO-text model. If "
+"`return_tensors` is True, it is a Tensor with shape [1, sequence_length] "
+"and data type 'int64'."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:68
+msgid "token_type_ids (list[int]|Tensor, optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:65
+msgid ""
+"A list of segment token indices to indicate whether the token belongs to "
+"the dialogue target. If `return_tensors` is True, it is a Tensor with "
+"shape [1, sequence_length] and data type 'int64'. Being returned when "
+"`return_token_type_ids` is set to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:73
+msgid "position_ids (list[int]|Tensor, optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:71
+msgid ""
+"A list of The position indices. If `return_tensors` is True, it is a "
+"Tensor with shape [1, sequence_length] and data type 'int64'. Being "
+"returned when `return_position_ids` is set to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:79
+msgid "attention_mask (numpy.ndarray|Tensor, optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:76
+msgid ""
+"A numpy.ndarray to prevents attention to some unwanted positions, with "
+"shape [sequence_length, sequence_length] and data type 'float32'. If "
+"`return_tensors` is True, it is a Tensor with shape [1, 1, "
+"sequence_length, sequence_length] and data type 'float32'. Being returned"
+" when `return_attention_mask` is set to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:82
+msgid "seq_len (int, optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:82
+msgid ""
+"The actual length of the `input_ids`, excluding the pad token. Being "
+"returned when `return_length` is set to True."
+msgstr ""
+
+#: of paddlenlp.transformers.unimo.tokenizer.UNIMOTokenizer.gen_encode:87
+msgid "示例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..a792eb80f03fd4fffde5bd47bb099b304cdec395
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.utils.po
@@ -0,0 +1,71 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.utils.rst:2
+msgid "utils"
+msgstr ""
+
+#: of paddlenlp.transformers.utils.fn_args_to_dict:1
+msgid ""
+"Inspect function `func` and its arguments for running, and extract a dict"
+" mapping between argument names and keys."
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta:1
+msgid "基类：:class:`type`"
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta:1
+msgid ""
+"This metaclass wraps the `__init__` method of a class to add "
+"`init_config` attribute for instances of that class, and `init_config` "
+"use a dict to track the initial configuration. If the class has "
+"`_wrap_init` method, it would be hooked after `__init__` and called as "
+"`_wrap_init(self, init_fn, init_args)`. Since InitTrackerMeta would be "
+"used as metaclass for pretrained model classes, which always are Layer "
+"and `type(Layer)` is not `type`, thus use `type(Layer)` rather than "
+"`type` as base class for it to avoid inheritance metaclass conflicts."
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:1
+msgid ""
+"wraps `init_func` which is `__init__` method of a class to add "
+"`init_config` attribute for instances of that class. :param init_func: It"
+" should be the `__init__` method of a class. :type init_func: callable "
+":param help_func: If provided, it would be hooked after"
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:6
+msgid ""
+"`init_func` and called as `_wrap_init(self, init_func, *init_args, "
+"**init_args)`. Default None."
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf:10
+msgid "the wrapped function"
+msgstr ""
+
+#: of paddlenlp.transformers.utils.InitTrackerMeta.init_and_track_conf
+msgid "返回类型"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po
new file mode 100644
index 0000000000000000000000000000000000000000..6862a9012543b7c6efc6af564eef1c62294000f0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.modeling.po
@@ -0,0 +1,915 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.xlnet.modeling.rst:2
+msgid "modeling"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling:1
+msgid "Modeling classes for XLNet model."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel:1
+msgid "基类：:class:`paddlenlp.transformers.model_utils.PretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel:1
+msgid ""
+"An abstract class for pretrained XLNet models. It provides XLNet related "
+"`model_config_file`, `resource_files_names`, "
+"`pretrained_resource_files_map`, `pretrained_init_configuration`, "
+"`base_model_prefix` for downloading and loading pretrained models. See "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more "
+"details."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:1
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:1
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:1
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:1
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:1
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel:1
+msgid "基类：:class:`paddlenlp.transformers.xlnet.modeling.XLNetPretrainedModel`"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:1
+msgid "The bare XLNet Model outputting raw hidden-states."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:3
+msgid ""
+"This model inherits from "
+":class:`~paddlenlp.transformers.model_utils.PretrainedModel`. Refer to "
+"the superclass documentation for the generic methods."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:6
+msgid ""
+"This model is also a `paddle.nn.Layer "
+"<https://www.paddlepaddle.org.cn/documentation "
+"/docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use "
+"it as a regular Paddle Layer and refer to the Paddle documentation for "
+"all matter related to general usage and behavior."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:10
+msgid ""
+"Vocabulary size of `inputs_ids` in `XLNetModel`. Also is the vocab size "
+"of token embedding matrix."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:13
+msgid ""
+"The number of tokens to cache. If not 0 or None, the last `mem_len` "
+"hidden states in each layer will be cached into memory. Defaults to "
+"`None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:16
+msgid ""
+"The number of tokens in the current batch to be cached. If positive, then"
+" at most `reuse_len` tokens can be cached in the current batch. "
+"Otherwise, there is no limit to the number of tokens. Defaults to `None`."
+"  .. note::     The difference between `mem_len` and `reuse_len` is that "
+"`mem_len` defines     **the total number** of tokens to cache while "
+"`reuse_len` defines the number of tokens     in **the current batch** to "
+"be cached."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:16
+msgid ""
+"The number of tokens in the current batch to be cached. If positive, then"
+" at most `reuse_len` tokens can be cached in the current batch. "
+"Otherwise, there is no limit to the number of tokens. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:21
+msgid ""
+"The difference between `mem_len` and `reuse_len` is that `mem_len` "
+"defines **the total number** of tokens to cache while `reuse_len` defines"
+" the number of tokens in **the current batch** to be cached."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:25
+msgid ""
+"Dimensionality of the embedding layers, encoder layers and pooler layer. "
+"Defaults to 768."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:28
+msgid ""
+"Whether or not to use the same attention length for each token. Defaults "
+"to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:31
+msgid ""
+"The attention type used in the attention layer. Set **\"bi\"** for "
+"``XLNet``, **\"uni\"** for ``Transformer-XL``. Defaults to **\"bi\"**."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:34
+msgid ""
+"Whether or not to use bidirectional input pipeline. Set to `True` during "
+"pretraining and `False` during fine-tuning. Defaults to `False`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:37
+msgid ""
+"Maximum relative distance supported. All relative distances larger than "
+"`clamp_len` will be clamped. Setting this attribute to -1 means no "
+"clamping. Defaults to -1."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:40
+msgid "The number of hidden layers in the encoder. Defaults to 12."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:42
+msgid ""
+"The dropout ratio for all fully connected layers in the embeddings and "
+"encoder. Defaults to 0.1."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:45
+msgid ""
+"The dropout ratio for all fully connected layers in the pooler "
+"(classification head). Defaults to 0.1."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:48
+msgid "Number of attention heads in each attention layer. Defaults to 12."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:51
+msgid ""
+"Dimensionality of each attention head. Defaults to 64.  .. note::     "
+"`d_head` should be equal to `d_model` divided by `n_head`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:51
+msgid "Dimensionality of each attention head. Defaults to 64."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:54
+msgid "`d_head` should be equal to `d_model` divided by `n_head`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:56
+msgid ""
+"The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for "
+"initializing layer normalization layers. Defaults to 1e-12."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:59
+msgid ""
+"Dimensionality of the feed-forward (ff) layer in the encoder. Input "
+"tensors to ff layers are firstly projected from `d_model` to `d_inner`, "
+"and then projected back to `d_model`. Typically `d_inner` is larger than "
+"`d_model`. Defaults to 3072."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:64
+msgid ""
+"The non-linear activation function in the feed-forward layers in the "
+"encoder. Choose from the following supported activation functions: "
+"`[\"relu\", \"gelu\", \"tanh\", \"sigmoid\", \"mish\", \"swish\"]`. "
+"Defaults to `\"gelu\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:68
+msgid ""
+"The standard deviation of the normal initializer. Defaults to 0.02.  .. "
+"note::     A normal_initializer initializes weight matrices as normal "
+"distributions.     See :meth:`XLNetPretrainedModel._init_weights()` for "
+"how weights are initialized in `XLNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:68
+msgid "The standard deviation of the normal initializer. Defaults to 0.02."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel:71
+msgid ""
+"A normal_initializer initializes weight matrices as normal distributions."
+" See :meth:`XLNetPretrainedModel._init_weights()` for how weights are "
+"initialized in `XLNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:1
+msgid "The XLNetModel forward method, overrides the `__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:3
+msgid ""
+"Indices of input sequence tokens in the vocabulary. They are numerical "
+"representations of tokens that build the input sequence. It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:7
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:  - 0 corresponds to a **sentence "
+"A** token, - 1 corresponds to a **sentence B** token.  It's data type "
+"should be `int64` and has a shape of [batch_size, sequence_length]. "
+"Defaults to None, which means no segment embeddings is added to token "
+"embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:7
+msgid ""
+"Segment token indices to indicate first and second portions of the "
+"inputs. Indices can be either 0 or 1:"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:10
+msgid "0 corresponds to a **sentence A** token,"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:11
+msgid "1 corresponds to a **sentence B** token."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:13
+msgid ""
+"It's data type should be `int64` and has a shape of [batch_size, "
+"sequence_length]. Defaults to None, which means no segment embeddings is "
+"added to token embeddings."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:16
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**.  - **1** for tokens that are "
+"**not masked**, - **0** for tokens that are **masked**.  It's data type "
+"should be `float32` and has a shape of [batch_size, sequence_length]. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:16
+msgid ""
+"Mask to indicate whether to perform attention on each input token or not."
+" The values should be either 0 or 1. The attention scores will be set to "
+"**-infinity** for any positions in the mask that are **0**, and will be "
+"**unchanged** for positions that are **1**."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:21
+msgid "**1** for tokens that are **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:22
+msgid "**0** for tokens that are **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:24
+msgid ""
+"It's data type should be `float32` and has a shape of [batch_size, "
+"sequence_length]. Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:27
+msgid ""
+"A list of length `n_layers` with each Tensor being a pre-computed hidden-"
+"state for each layer. Each Tensor has a dtype `float32` and a shape of "
+"[batch_size, sequence_length, hidden_size]. Defaults to None, and we "
+"don't use mems.  .. note::     `use_mems` has to be set to `True` in "
+"order to make use of `mems`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:27
+msgid ""
+"A list of length `n_layers` with each Tensor being a pre-computed hidden-"
+"state for each layer. Each Tensor has a dtype `float32` and a shape of "
+"[batch_size, sequence_length, hidden_size]. Defaults to None, and we "
+"don't use mems."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:32
+msgid "`use_mems` has to be set to `True` in order to make use of `mems`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:34
+msgid ""
+"Mask to indicate the permutation pattern of the input sequence with "
+"values being either 0 or 1.  - if ``perm_mask[k, i, j] = 0``, i "
+"**attend** to j in batch k; - if ``perm_mask[k, i, j] = 1``, i **does not"
+" attend** to j in batch k.  Only used during pretraining (to define "
+"factorization order) or for sequential decoding (generation). It's data "
+"type should be `float32` and has a shape of [batch_size, sequence_length,"
+" sequence_length]. Defaults to `None`, then each token attends to all the"
+" other tokens (full bidirectional attention)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:34
+msgid ""
+"Mask to indicate the permutation pattern of the input sequence with "
+"values being either 0 or 1."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:36
+msgid "if ``perm_mask[k, i, j] = 0``, i **attend** to j in batch k;"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:37
+msgid "if ``perm_mask[k, i, j] = 1``, i **does not attend** to j in batch k."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:39
+msgid ""
+"Only used during pretraining (to define factorization order) or for "
+"sequential decoding (generation). It's data type should be `float32` and "
+"has a shape of [batch_size, sequence_length, sequence_length]. Defaults "
+"to `None`, then each token attends to all the other tokens (full "
+"bidirectional attention)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:44
+msgid ""
+"Mask to indicate the output tokens to use with values being either 0 or "
+"1. If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on "
+"the j-th token. It's data type should be `float32` and has a shape of "
+"[batch_size, num_predict, sequence_length]. Only used during pretraining "
+"for partial prediction or for sequential decoding (generation). Defaults "
+"to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:50
+msgid ""
+"Mask to avoid performing attention on padding token with values being "
+"either 0 or 1. It's data type should be `float32` and it has a shape of "
+"[batch_size, sequence_length]. This mask is negative of `attention_mask`:"
+"  - 1 for tokens that are **masked**, - 0 for tokens that are **not "
+"masked**.  You should use only one of `input_mask` and `attention_mask`. "
+"Defaults to `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:50
+msgid ""
+"Mask to avoid performing attention on padding token with values being "
+"either 0 or 1. It's data type should be `float32` and it has a shape of "
+"[batch_size, sequence_length]. This mask is negative of `attention_mask`:"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:54
+msgid "1 for tokens that are **masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:55
+msgid "0 for tokens that are **not masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:57
+msgid ""
+"You should use only one of `input_mask` and `attention_mask`. Defaults to"
+" `None`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:59
+msgid ""
+"Mask to nullify selected heads of the self-attention layers with values "
+"being either 0 or 1.  - 1 indicates the head is **not masked**, - 0 "
+"indicates the head is **masked**.  It's data type should be `float32` and"
+" has a shape of [num_heads] or [num_layers, num_heads]. Defaults to "
+"`None`, which means we keep all heads."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:59
+msgid ""
+"Mask to nullify selected heads of the self-attention layers with values "
+"being either 0 or 1."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:61
+msgid "1 indicates the head is **not masked**,"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:62
+msgid "0 indicates the head is **masked**."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:64
+msgid ""
+"It's data type should be `float32` and has a shape of [num_heads] or "
+"[num_layers, num_heads]. Defaults to `None`, which means we keep all "
+"heads."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:67
+msgid ""
+"An embedded representation tensor which is an alternative of `input_ids`."
+" You should specify only either one of them to avoid contradiction. It's "
+"data type should be `float32` and has a shape of [batch_size, "
+"sequence_length, hidden_size]. Defaults to `None`, which means we only "
+"specify `input_ids`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:72
+msgid ""
+"Whether or not to use recurrent memory mechanism during training. "
+"Defaults to `False` and we don't use recurrent memory mechanism in "
+"training mode."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:75
+msgid ""
+"Whether or not to use recurrent memory mechanism during evaluation. "
+"Defaults to `False` and we don't use recurrent memory mechanism in "
+"evaluation mode."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:78
+msgid ""
+"Whether or not to return additional information other than the output "
+"tensor. If True, then returns information about `output`, `new_mems`, "
+"`hidden_states` and `attentions` which will also be formatted as a dict. "
+"Else only returns the output tensor. Defaults to False."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:84
+msgid ""
+"Returns tensor `output` or a dict with key-value pairs: "
+"{\"last_hidden_state\": `output`, \"mems\": `mems`, \"hidden_states\": "
+"`hidden_states`, \"attentions\": `attentions`}.  With the corresponding "
+"fields:  - `output` (Tensor):     Output of the final layer of the model."
+"     It's a Tensor of dtype `float32` and has a shape of [batch_size, "
+"num_predict, hidden_size].      .. note::         `num_predict` "
+"corresponds to `target_mapping.shape[1]`.         If `target_mapping` is "
+"`None`, then `num_predict` equals to `sequence_length`. - `mems` "
+"(List[Tensor]):     A list of pre-computed hidden-states. The length of "
+"the list is `n_layers`.     Each element in the list is a Tensor with "
+"dtype `float32` and has a shape of     [batch_size, sequence_length, "
+"hidden_size]. - `hidden_states` (List[Tensor], optional):     A list of "
+"Tensor containing hidden-states of the model at the output of each layer"
+"     plus the initial embedding outputs. Each Tensor has a data type of "
+"`float32` and     has a shape of [batch_size, sequence_length, "
+"hidden_size].     Being returned when `output_hidden_states` is set to "
+"`True`. - `attentions` (List[Tensor], optional):     A list of Tensor "
+"containing attentions weights of each hidden layer.     Each Tensor (one "
+"for each layer) has a data type of `float32` and     has a shape of "
+"[batch_size, num_heads, sequence_length, sequence_length].     Being "
+"returned when `output_attentions` is set to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:84
+msgid ""
+"Returns tensor `output` or a dict with key-value pairs: "
+"{\"last_hidden_state\": `output`, \"mems\": `mems`, \"hidden_states\": "
+"`hidden_states`, \"attentions\": `attentions`}."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:32
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:34
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:34
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:88
+msgid "With the corresponding fields:"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:95
+msgid "`output` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:91
+msgid ""
+"Output of the final layer of the model. It's a Tensor of dtype `float32` "
+"and has a shape of [batch_size, num_predict, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:95
+msgid ""
+"`num_predict` corresponds to `target_mapping.shape[1]`. If "
+"`target_mapping` is `None`, then `num_predict` equals to "
+"`sequence_length`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:38
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:41
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:37
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:39
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:39
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:99
+msgid "`mems` (List[Tensor]):"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:98
+msgid ""
+"A list of pre-computed hidden-states. The length of the list is "
+"`n_layers`. Each element in the list is a Tensor with dtype `float32` and"
+" has a shape of [batch_size, sequence_length, hidden_size]."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:40
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:43
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:39
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:41
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:41
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:104
+msgid "`hidden_states` (List[Tensor], optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:102
+msgid ""
+"A list of Tensor containing hidden-states of the model at the output of "
+"each layer plus the initial embedding outputs. Each Tensor has a data "
+"type of `float32` and has a shape of [batch_size, sequence_length, "
+"hidden_size]. Being returned when `output_hidden_states` is set to "
+"`True`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:42
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:45
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:41
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:43
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:43
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:109
+msgid "`attentions` (List[Tensor], optional):"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:107
+msgid ""
+"A list of Tensor containing attentions weights of each hidden layer. Each"
+" Tensor (one for each layer) has a data type of `float32` and has a shape"
+" of [batch_size, num_heads, sequence_length, sequence_length]. Being "
+"returned when `output_attentions` is set to `True`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:47
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:50
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:46
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:48
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:48
+#: paddlenlp.transformers.xlnet.modeling.XLNetModel.forward:114
+msgid "示例"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:1
+msgid ""
+"XLNet Model with a linear layer on top of the output layer, designed for "
+"sequence classification/regression tasks like GLUE tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:4
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:4
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:4
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:4
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:3
+msgid "An instance of :class:`XLNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification:6
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:6
+msgid "The number of classes. Defaults to 2."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:1
+msgid ""
+"The XLNetForSequenceClassification forward method, overrides the "
+"`__call__()` special method."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:3
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:5
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:7
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:9
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:11
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:13
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:15
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:17
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:19
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:21
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:23
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:25
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:39
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:41
+#: paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:43
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:3
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:5
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:7
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:9
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:11
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:13
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:15
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:17
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:19
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:21
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:23
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:25
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:42
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:44
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:46
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:3
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:5
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:7
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:9
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:11
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:13
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:15
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:17
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:19
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:21
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:23
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:25
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:38
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:40
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:42
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:3
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:5
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:7
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:9
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:11
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:13
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:15
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:17
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:19
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:21
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:23
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:25
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:40
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:42
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:44
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:3
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:5
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:7
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:9
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:11
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:13
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:15
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:17
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:19
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:21
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:23
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:25
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:40
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:42
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:44
+msgid "See :class:`XLNetModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:28
+msgid ""
+"Returns tensor `logits` or a dict with key-value pairs: {\"logits\": "
+"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, "
+"\"attentions\": `attentions`}.  With the corresponding fields:  - "
+"`logits` (Tensor):     Classification scores before SoftMax (also called "
+"logits). It's data type should be `float32`     and has a shape of "
+"[batch_size, num_classes]. - `mems` (List[Tensor]):     See "
+":class:`XLNetModel`. - `hidden_states` (List[Tensor], optional):     See "
+":class:`XLNetModel`. - `attentions` (List[Tensor], optional):     See "
+":class:`XLNetModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:28
+msgid ""
+"Returns tensor `logits` or a dict with key-value pairs: {\"logits\": "
+"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, "
+"\"attentions\": `attentions`}."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:35
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:37
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:37
+msgid "`logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForSequenceClassification.forward:35
+msgid ""
+"Classification scores before SoftMax (also called logits). It's data type"
+" should be `float32` and has a shape of [batch_size, num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification:1
+msgid ""
+"XLNet Model with a linear layer on top of the hidden-states output layer,"
+" designed for token classification tasks like NER tasks."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:1
+msgid ""
+"The XLNetForTokenClassification forward method, overrides the "
+"`__call__()` special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:28
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:28
+msgid ""
+"Returns tensor `logits` or a dict with key-value pairs:  {\"logits\": "
+"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, "
+"\"attentions\": `attentions`}.  With the corresponding fields:  - "
+"`logits` (Tensor):     Classification scores before SoftMax (also called "
+"logits). It's data type should be `float32`     and has a shape of "
+"[batch_size, sequence_length, num_classes]. - `mems` (List[Tensor]):     "
+"See :class:`XLNetModel`. - `hidden_states` (List[Tensor], optional):     "
+"See :class:`XLNetModel`. - `attentions` (List[Tensor], optional):     See"
+" :class:`XLNetModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:30
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:30
+msgid "Returns tensor `logits` or a dict with key-value pairs:"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:31
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:31
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:31
+msgid "{\"logits\": `logits`, \"mems\": `mems`,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:32
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:32
+msgid "\"hidden_states\": `hidden_states`, \"attentions\": `attentions`}."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:36
+#: paddlenlp.transformers.xlnet.modeling.XLNetForTokenClassification.forward:37
+#: paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:37
+msgid ""
+"Classification scores before SoftMax (also called logits). It's data type"
+" should be `float32` and has a shape of [batch_size, sequence_length, "
+"num_classes]."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel:1
+msgid ""
+"XLNet Model with a language modeling head on top (linear layer with "
+"weights tied to the input embeddings)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetLMHeadModel.forward:1
+msgid ""
+"The XLNetLMHeadModel forward method, overrides the `__call__()` special "
+"method."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice:1
+msgid ""
+"XLNet Model with a multiple choice classification head on top (a linear "
+"layer on top of the pooled output and a softmax) e.g. for RACE/SWAG "
+"tasks."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:1
+msgid ""
+"The XLNetForMultipleChoice forward method, overrides the `__call__()` "
+"special method."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:28
+msgid ""
+"Returns tensor `logtis` or a dict with key-value pairs:  {\"logits\": "
+"`logits`, \"mems\": `mems`, \"hidden_states\": `hidden_states`, "
+"\"attentions\": `attentions`}  With the corresponding fields: - `logits` "
+"(Tensor):     Classification scores before SoftMax (also called logits). "
+"It's data type should be `float32`     and has a shape of [batch_size, "
+"sequence_length, num_classes]. - `mems` (List[Tensor]):     See "
+":class:`XLNetModel`. - `hidden_states` (List[Tensor], optional):     See "
+":class:`XLNetModel`. - `attentions` (List[Tensor], optional):     See "
+":class:`XLNetModel`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:30
+msgid "Returns tensor `logtis` or a dict with key-value pairs:"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:32
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:32
+msgid "\"hidden_states\": `hidden_states`, \"attentions\": `attentions`}"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForMultipleChoice.forward:34
+msgid "With the corresponding fields: - `logits` (Tensor):"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering:1
+msgid ""
+"XLNet Model with a span classification head on top for extractive "
+"question-answering tasks like SQuAD (a linear layers on top of the "
+"hidden-states output to compute `span start logits` and `span end "
+"logits`)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:1
+msgid ""
+"The XLNetForQuestionAnswering forward method, overrides the `__call__()` "
+"special method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:28
+msgid ""
+"Returns tensor (`start_logits`, `end_logits`) or a dict with key-value "
+"pairs:  {\"start_logits\": `start_logits`, \"end_logits\": `end_logits`, "
+"\"mems\": `mems`, \"hidden_states\": `hidden_states`, \"attentions\": "
+"`attentions`}  With the corresponding fields: - `start_logits` (Tensor):"
+"     A tensor of the input token classification logits, indicates the "
+"start position of the labelled span.     Its data type should be float32 "
+"and its shape is [batch_size, sequence_length]. - `end_logits` (Tensor):"
+"     A tensor of the input token classification logits, indicates the end"
+" position of the labelled span.     Its data type should be float32 and "
+"its shape is [batch_size, sequence_length]. - `mems` (List[Tensor]):     "
+"See :class:`XLNetModel`. - `hidden_states` (List[Tensor], optional):     "
+"See :class:`XLNetModel`. - `attentions` (List[Tensor], optional):     See"
+" :class:`XLNetModel`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:30
+msgid ""
+"Returns tensor (`start_logits`, `end_logits`) or a dict with key-value "
+"pairs:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:31
+msgid ""
+"{\"start_logits\": `start_logits`, \"end_logits\": `end_logits`, "
+"\"mems\": `mems`,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:34
+msgid "With the corresponding fields: - `start_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:36
+msgid ""
+"A tensor of the input token classification logits, indicates the start "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:39
+msgid "`end_logits` (Tensor):"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.modeling.XLNetForQuestionAnswering.forward:39
+msgid ""
+"A tensor of the input token classification logits, indicates the end "
+"position of the labelled span. Its data type should be float32 and its "
+"shape is [batch_size, sequence_length]."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po
new file mode 100644
index 0000000000000000000000000000000000000000..95bbdac3087e9c2265785bc5a2677c1255958900
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.po
@@ -0,0 +1,41 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.xlnet.rst:2
+msgid "xlnet"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet:4
+msgid ""
+"`XLNet: Generalized Autoregressive Pretraining for Language Understanding"
+" <https://arxiv.org/abs/1906.08237>`__ 是一款无监督的自回归预训练语言模型。"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet:7
+msgid ""
+"有别于传统的单向自回归模型，XLNet通过最大化输入序列所有排列的期望来进行语言建模，这使得它可以同时关注到上下文的信息。 "
+"另外，XLNet在预训练阶段集成了 `Transformer-XL <https://arxiv.org/abs/1901.02860>`__ "
+"模型， Transformer-XL中的片段循环机制（Segment Recurrent Mechanism）和相对位置编码（Relative "
+"Positional Encoding）机制 能够支持XLNet接受更长的输入序列，这使得XLNet在长文本序列的语言任务上有着优秀的表现。"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet:12
+msgid "本项目是XLNet在 Paddle 2.0上的开源实现，由modeling和tokenizer两部分组成。"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po
new file mode 100644
index 0000000000000000000000000000000000000000..3272bbcb31be4f7ca0a2952ba18d611dd33df3c3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.transformers.xlnet.tokenizer.po
@@ -0,0 +1,352 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.transformers.xlnet.tokenizer.rst:2
+msgid "tokenizer"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer:1
+msgid "Tokenization class for XLNet model."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:1
+msgid "基类：:class:`paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:1
+msgid ""
+"Constructs an XLNet tokenizer based on `SentencePiece "
+"<https://github.com/google/sentencepiece>`__."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:3
+msgid ""
+"This tokenizer inherits from "
+":class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer` "
+"which contains most of the main methods. For more information regarding "
+"those methods, please refer to this superclass."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:7
+msgid ""
+"The vocabulary file (ends with '.spm') required to instantiate a "
+"`SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:10
+msgid ""
+"Whether or not to lowercase the input when tokenizing. Defaults to "
+"`False` and **does not** lowercase the input."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:13
+msgid ""
+"Whether or not to strip the text when tokenizing. Defaults to `True` and "
+"removes excess spaces before and after the string."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:16
+msgid ""
+"Whether or not to keep accents when tokenizing. Defaults to `False` and "
+"**does not** keep accents."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:18
+msgid ""
+"A special token representing the beginning of a sequence that was used "
+"during pretraining. Defaults to `\"<s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:21
+msgid ""
+"A special token representing the end of a sequence that was used during "
+"pretraining. Defaults to `\"</s>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:24
+msgid ""
+"A special token representing the *unknown (out-of-vocabulary)* token. An "
+"unknown token is set to be `unk_token` inorder to be converted to an ID. "
+"Defaults to `\"<unk>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:28
+msgid ""
+"A special token separating two different sentences in the same input. "
+"Defaults to `\"<sep>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:31
+msgid ""
+"A special token used to make arrays of tokens the same size for batching "
+"purposes. Defaults to `\"<pad>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:34
+msgid ""
+"A special token used for sequence classification. It is the last token of"
+" the sequence when built with special tokens. Defaults to `\"<cls>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:37
+msgid ""
+"A special token representing a masked token. This is the token used in "
+"the masked language modeling task which the model tries to predict the "
+"original unmasked ones. Defaults to `\"<mask>\"`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:41
+msgid ""
+"A list of additional special tokens to be used by the tokenizer. Defaults"
+" to `[\"<eop>\", \"<eod>\"]`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:47
+msgid ""
+"The *SentencePiece* processor that is used for every conversion (string, "
+"tokens and IDs)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer
+msgid "type"
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer:49
+msgid "SentencePieceProcessor"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:1
+msgid ""
+"Converts a sequence of tokens (list of string) to a single string by "
+"using ``' '.join(tokens)`` ."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:4
+msgid "A sequence of tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add
+msgid "返回"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string:7
+msgid "Converted string."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.convert_tokens_to_string
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add
+msgid "返回类型"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:1
+msgid ""
+"Returns the number of added tokens when encoding a sequence with special "
+"tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:3
+msgid ""
+"Whether the input is a sequence pair or a single sequence. Defaults to "
+"`False` and the input is a single sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.num_special_tokens_to_add:7
+msgid "Number of tokens added to sequences."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:1
+msgid ""
+"Builds model inputs from a sequence or a pair of sequence for sequence "
+"classification tasks by concatenating and adding special tokens. An XLNet"
+" sequence has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:5
+msgid "single sequence:      ``X <sep> <cls>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:6
+msgid "pair of sequences:    ``A <sep> B <sep> <cls>``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:8
+msgid "List of IDs for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:10
+msgid "Optional second list of IDs for the second sequenze. Defaults to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_inputs_with_special_tokens:13
+msgid "List of input IDs with the appropriate special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:1
+msgid ""
+"Builds offset map from a pair of offset map by concatenating and adding "
+"offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:4
+msgid "An XLNet offset_mapping has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:6
+msgid "single sequence:      ``X (0,0) (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:7
+msgid "pair of sequences:    ``A (0,0) B (0,0) (0,0)``"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:9
+msgid "List of char offsets to which the special tokens will be added."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:11
+msgid ""
+"Optional second list of char offsets for offset mapping pairs. Defaults "
+"to `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.build_offset_mapping_with_special_tokens:15
+msgid "A list of char offsets with the appropriate offsets of special tokens."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:1
+msgid ""
+"Creates a special tokens mask from the input sequences. This method is "
+"called when adding special tokens using the tokenizer `encode` method."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:22
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:4
+msgid "A list of `inputs_ids` for the first sequence."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:24
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:6
+msgid ""
+"Optional second list of `inputs_ids` for the second sequence. Defaults to"
+" `None`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:9
+msgid ""
+"Whether or not the token list already contains special tokens for the "
+"model. Defaults to `False`."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.get_special_tokens_mask:13
+msgid ""
+"A list of integers which is either 0 or 1: 1 for a special token, 0 for a"
+" sequence token."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:1
+msgid ""
+"Creates a token_type mask from the input sequences. If `token_ids_1` is "
+"not `None`, then a sequence pair token_type mask has the following "
+"format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:10
+msgid ""
+"Else if `token_ids_1` is `None`, then a single sequence token_type mask "
+"has the following format:"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:18
+msgid "0 stands for the segment id of **first segment tokens**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:19
+msgid "1 stands for the segment id of **second segment tokens**,"
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:20
+msgid "2 stands for the segment id of **cls_token**."
+msgstr ""
+
+#: of
+#: paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.create_token_type_ids_from_sequences:28
+msgid "List of token type IDs according to the given sequence(s)."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources:1
+msgid ""
+"Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file "
+"(ends with '.spm') under `save_directory`."
+msgstr ""
+
+#: of paddlenlp.transformers.xlnet.tokenizer.XLNetTokenizer.save_resources:4
+msgid "Directory to save files into."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po
new file mode 100644
index 0000000000000000000000000000000000000000..d09a8f042cf3c9d9f1daa3e4bd83b6a1efdc3d99
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.batch_sampler.po
@@ -0,0 +1,98 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.batch_sampler.rst:2
+msgid "batch\\_sampler"
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:1
+msgid "Sampler that restricts data loading to a subset of the dataset."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:3
+msgid ""
+"In such case, each process can pass a DistributedBatchSampler instance as"
+" a DataLoader sampler, and load a subset of the original dataset that is "
+"exclusive to it."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:8
+msgid "Dataset is assumed to be of constant size."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler
+#: paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:10
+msgid ""
+"this could be a `paddle.io.Dataset` implement or other python object "
+"which implemented `__len__` for BatchSampler to get sample number of data"
+" source."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:15
+msgid "sample indice number in a mini-batch indices."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:17
+msgid ""
+"porcess number in distributed training. If :attr:`num_replicas` is None, "
+":attr:`num_replicas` will be retrieved from "
+":code:`paddle.distributed.ParallenEnv`. Default None."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:22
+msgid ""
+"the rank of the current process among :attr:`num_replicas` processes. If "
+":attr:`rank` is None, :attr:`rank` is retrieved from "
+":code:`paddle.distributed.ParallenEnv`. Default None."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:26
+msgid ""
+"whther to shuffle indices order before genrating batch indices. Default "
+"False."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:29
+msgid ""
+"whether drop the last incomplete batch dataset size is not divisible by "
+"the batch size. Default False"
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler:34
+#: paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:11
+msgid "实际案例"
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:1
+msgid ""
+"Sets the epoch number. When :attr:`shuffle=True`, this number is used as "
+"seeds of random numbers. By default, users may not set this, all replicas"
+" (workers) use a different random ordering for each epoch. If set same "
+"number at each epoch, this sampler will yield the same ordering at all "
+"epoches."
+msgstr ""
+
+#: of paddlenlp.utils.batch_sampler.DistributedBatchSampler.set_epoch:7
+msgid "Epoch number."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po
new file mode 100644
index 0000000000000000000000000000000000000000..996dc51a87061c05139973ef61f9355c337834e0
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.downloader.po
@@ -0,0 +1,46 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.downloader.rst:2
+msgid "downloader"
+msgstr ""
+
+#: of paddlenlp.utils.downloader.get_weights_path_from_url:1
+msgid ""
+"Get weights path from WEIGHT_HOME, if not exists, download it from url. "
+":param url: download url :type url: str :param md5sum: md5 sum of "
+"download package :type md5sum: str"
+msgstr ""
+
+#: of paddlenlp.utils.downloader.get_weights_path_from_url
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.utils.downloader.get_weights_path_from_url:8
+msgid "a local path to save downloaded weights."
+msgstr ""
+
+#: of paddlenlp.utils.downloader.get_weights_path_from_url
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.utils.downloader.get_weights_path_from_url:12
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po
new file mode 100644
index 0000000000000000000000000000000000000000..63c65bb62a50bc9b3387e7f6ea7b588c1254ce3e
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.env.po
@@ -0,0 +1,33 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.env.rst:2
+msgid "env"
+msgstr ""
+
+#: of paddlenlp.utils.env:1
+msgid ""
+"This module is used to store environmental variables in PaddleNLP. "
+"PPNLP_HOME              -->  the root directory for storing PaddleNLP "
+"related data. Default to ~/.paddlenlp. Users can change the ├"
+"                            default value through the PPNLP_HOME "
+"environment variable. ├─ MODEL_HOME              -->  Store model files. "
+"└─ DATA_HOME         -->  Store automatically downloaded datasets."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po
new file mode 100644
index 0000000000000000000000000000000000000000..b6a1ff8f3d66051a171fd1c1d4d91af72bb6ac6c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.file_lock.po
@@ -0,0 +1,52 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.utils.file_lock.rst:2
+msgid "file\\_lock"
+msgstr ""
+
+#: of paddlenlp.utils.file_lock.FileLockException:1
+msgid "基类：:class:`Exception`"
+msgstr ""
+
+#: of paddlenlp.utils.file_lock.FileLock:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.utils.file_lock.FileLock:1
+msgid ""
+"A file locking mechanism that has context-manager support so you can use "
+"it in a with statement. This should be relatively cross compatible as it "
+"doesn't rely on msvcrt or fcntl for the locking."
+msgstr ""
+
+#: of paddlenlp.utils.file_lock.FileLock.acquire:1
+msgid ""
+"Acquire the lock, if possible. If the lock is in use, it check again "
+"every `wait` seconds. It does this until it either gets the lock or "
+"exceeds `timeout` number of seconds, in which case it throws an "
+"exception."
+msgstr ""
+
+#: of paddlenlp.utils.file_lock.FileLock.release:1
+msgid ""
+"Get rid of the lock by deleting the lockfile. When working in a `with` "
+"statement, this gets automatically called at the end."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..c63490425c91f68d32219608c323d28c24bf942b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.import_utils.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../source/paddlenlp.utils.import_utils.rst:2
+msgid "import\\_utils"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po
new file mode 100644
index 0000000000000000000000000000000000000000..4b4330933c4467aa22056ab07c1e0f09036b1db7
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.log.po
@@ -0,0 +1,51 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.log.rst:2
+msgid "log"
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger:1
+msgid "Deafult logger in PaddleNLP"
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger paddlenlp.utils.log.Logger.processing
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger:3
+msgid "Logger name, default is 'PaddleNLP'"
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger.processing:1
+msgid "Continuously print a progress bar with rotating special effects."
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger.processing:3
+msgid "Message to be printed."
+msgstr ""
+
+#: of paddlenlp.utils.log.Logger.processing:5
+msgid "Rotation interval. Default to 0.1."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po
new file mode 100644
index 0000000000000000000000000000000000000000..0ecaeaa053ebf5b3651185a2151ce4d6cfbd4c34
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.rst:2
+msgid "paddlenlp.utils"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po
new file mode 100644
index 0000000000000000000000000000000000000000..2b6ec35b4f6d6ef5f1b40da6c3610a77cd669c2c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.profiler.po
@@ -0,0 +1,92 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.profiler.rst:2
+msgid "profiler"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:1
+msgid ""
+"Use a string to initialize a ProfilerOptions. The string should be in the"
+" format: \"key1=value1;key2=value;key3=value3\". For example:"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:4
+msgid ""
+"\"profile_path=model.profile\" \"batch_range=[50, 60]; "
+"profile_path=model.profile\" \"batch_range=[50, 60]; "
+"tracer_option=OpDetail; profile_path=model.profile\""
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:16
+msgid "ProfilerOptions supports following key-value pair:"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:9
+msgid ""
+"batch_range      - a integer list, e.g. [100, 110]. state            - a "
+"string, the optional values are 'CPU', 'GPU' or 'All'. sorted_key       -"
+" a string, the optional values are 'calls', 'total',"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:12
+msgid "'max', 'min' or 'ave."
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:13
+msgid ""
+"tracer_option    - a string, the optional values are 'Default', "
+"'OpDetail',"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:14
+msgid "'AllOpDetail'."
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:15
+msgid "profile_path     - a string, the path to save the serialized profile data,"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:16
+msgid "which can be used to generate a timeline."
+msgstr ""
+
+#: of paddlenlp.utils.profiler.ProfilerOptions:17
+msgid "exit_on_finished - a boolean."
+msgstr ""
+
+#: of paddlenlp.utils.profiler.add_profiler_step:1
+msgid ""
+"Enable the operator-level timing using PaddlePaddle's profiler. The "
+"profiler uses a independent variable to count the profiler steps. One "
+"call of this function is treated as a profiler step."
+msgstr ""
+
+#: of paddlenlp.utils.profiler.add_profiler_step
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.utils.profiler.add_profiler_step:5
+msgid "Default is None, and the profiler is disabled."
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po
new file mode 100644
index 0000000000000000000000000000000000000000..be7be41c852f415ea6e5cd1dbeea816de4d66d09
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/source/paddlenlp.utils.tools.po
@@ -0,0 +1,130 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../source/paddlenlp.utils.tools.rst:2
+msgid "tools"
+msgstr ""
+
+#: of paddlenlp.utils.tools.static_params_to_dygraph:1
+msgid "Simple tool for convert static paramters to dygraph paramters dict."
+msgstr ""
+
+#: of paddlenlp.utils.tools.dygraph_params_to_static:3
+#: paddlenlp.utils.tools.static_params_to_dygraph:3
+msgid "**NOTE** The model must both support static graph and dygraph mode."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version
+#: paddlenlp.utils.tools.dygraph_params_to_static
+#: paddlenlp.utils.tools.static_params_to_dygraph
+msgid "参数"
+msgstr ""
+
+#: of paddlenlp.utils.tools.dygraph_params_to_static:5
+#: paddlenlp.utils.tools.static_params_to_dygraph:5
+msgid "the model of a neural network."
+msgstr ""
+
+#: of paddlenlp.utils.tools.static_params_to_dygraph:7
+msgid ""
+"path of which locate the saved paramters in static mode. Usualy load by "
+"`paddle.static.load_program_state`."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version
+#: paddlenlp.utils.tools.dygraph_params_to_static
+#: paddlenlp.utils.tools.static_params_to_dygraph
+msgid "返回"
+msgstr ""
+
+#: of paddlenlp.utils.tools.dygraph_params_to_static:10
+#: paddlenlp.utils.tools.static_params_to_dygraph:11
+msgid "a state dict the same as the dygraph mode."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version
+#: paddlenlp.utils.tools.dygraph_params_to_static
+#: paddlenlp.utils.tools.static_params_to_dygraph
+msgid "返回类型"
+msgstr ""
+
+#: of paddlenlp.utils.tools.dygraph_params_to_static:1
+msgid "Simple tool for convert dygraph paramters to static paramters dict."
+msgstr ""
+
+#: of paddlenlp.utils.tools.dygraph_params_to_static:7
+msgid "path of which locate the saved paramters in static mode."
+msgstr ""
+
+#: of paddlenlp.utils.tools.TimeCostAverage:1
+msgid "基类：:class:`object`"
+msgstr ""
+
+#: of paddlenlp.utils.tools.TimeCostAverage:1
+msgid ""
+"Simple tool for calcluating time average cost in the process of training "
+"and inferencing."
+msgstr ""
+
+#: of paddlenlp.utils.tools.TimeCostAverage.reset:1
+msgid "Reset the recoder state, and reset the `cnt` to zero."
+msgstr ""
+
+#: of paddlenlp.utils.tools.TimeCostAverage.record:1
+msgid "Recoding the time cost in current step and accumulating the `cnt`."
+msgstr ""
+
+#: of paddlenlp.utils.tools.TimeCostAverage.get_average:1
+msgid "Returning the average time cost after the start of training."
+msgstr ""
+
+#: of paddlenlp.utils.tools.get_env_device:1
+msgid "Return the device name of running environment."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:1
+msgid ""
+"The first version string needed to be compared. The format of version "
+"string should be as follow : \"xxx.yyy.zzz\"."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:4
+msgid ""
+"The second version string needed to be compared. The format of version "
+"string should be as follow : \"xxx.yyy.zzz\"."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:8
+msgid ""
+"The result of comparasion. 1 means version > pair_version; 0 means     "
+"version = pair_version; -1 means version < pair_version."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:10
+msgid "The result of comparasion. 1 means version > pair_version; 0 means"
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:11
+msgid "version = pair_version; -1 means version < pair_version."
+msgstr ""
+
+#: of paddlenlp.utils.tools.compare_version:15
+msgid "实际案例"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/trainer.po b/docs/locale/en/LC_MESSAGES/trainer.po
new file mode 100644
index 0000000000000000000000000000000000000000..1fe4d12bd7443783f87a8e7c606151752de0f6eb
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/trainer.po
@@ -0,0 +1,137 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-05-19 14:17+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.10.1\n"
+
+#: ../trainer.md:1
+msgid "PaddleNLP Trainer API"
+msgstr ""
+
+#: ../trainer.md:3
+msgid "PaddleNLP提供了Trainer训练API，针对训练过程的通用训练配置做了封装，比如："
+msgstr ""
+
+#: ../trainer.md:5
+msgid "优化器、学习率调度等训练配置"
+msgstr ""
+
+#: ../trainer.md:6
+msgid "多卡，混合精度，梯度累积等功能"
+msgstr ""
+
+#: ../trainer.md:7
+msgid "checkpoint断点，断点重启（数据集，随机数恢复）"
+msgstr ""
+
+#: ../trainer.md:8
+msgid "日志显示，loss可视化展示等"
+msgstr ""
+
+#: ../trainer.md:10
+msgid "用户输入模型，数据集，就可以使用Trainer API高效快速的实现预训练、微调等任务。"
+msgstr ""
+
+#: ../trainer.md:13
+msgid "Trainer基本使用方法介绍"
+msgstr ""
+
+#: ../trainer.md:15
+msgid ""
+"下面是用户使用 Trainer API进行finetune任务的简单示例，这里以中文情感分类数据集chnsenticorp为例。 "
+"更详细的使用可以参考CLUE Trainer版本。"
+msgstr ""
+
+#: ../trainer.md:18
+msgid "导入需要用到的头文件。"
+msgstr ""
+
+#: ../trainer.md:19
+msgid "主要是模型、Tokenizer"
+msgstr ""
+
+#: ../trainer.md:20
+msgid "还有Trainer组件"
+msgstr ""
+
+#: ../trainer.md:21
+msgid "其中Trainer是训练主要入口，用户传入模型，数据集，即可进行训练"
+msgstr ""
+
+#: ../trainer.md:22
+msgid "TrainingArguments 包含了用户需要的大部分训练参数。"
+msgstr ""
+
+#: ../trainer.md:23
+msgid "PdArgumentParser 是用户输出参数的工具"
+msgstr ""
+
+#: ../trainer.md:31
+msgid "设置好用户参数"
+msgstr ""
+
+#: ../trainer.md:32
+msgid ""
+"PdArgumentParser 可以接受多个类似TrainingArguments的参数。用户可以自定义所需要的ModelArguments, "
+"DataArguments为 tuple 传入 PdArgumentParser即可。"
+msgstr ""
+
+#: ../trainer.md:33
+msgid ""
+"这些参数都是通过python xxx.py --dataset xx --max_seq_length "
+"xx的方式传入。TrainingArguments的所有可配置参数见后文。"
+msgstr ""
+
+#: ../trainer.md:50
+msgid "加载模型，tokenizer, 数据集"
+msgstr ""
+
+#: ../trainer.md:51
+msgid "注意，这里的数据集，需要输出的是一个dict。dict中的key，需要和模型的输入名称对应。"
+msgstr ""
+
+#: ../trainer.md:52
+msgid "这里的，labels如果模型没有使用到，我们还需要额外定义criterion，计算最后的loss损失。"
+msgstr ""
+
+#: ../trainer.md:66
+msgid "构造Trainer实例，进行模型训练。"
+msgstr ""
+
+#: ../trainer.md:67
+msgid "这里传入model,criterion,args,train_dataset,tokenizer这些训练需要的组件，构建了实例化的trainer"
+msgstr ""
+
+#: ../trainer.md:68
+msgid "使用trainer.train()接口开始训练过程。训练完成后，可以保存模型，保存一些日志。"
+msgstr ""
+
+#: ../trainer.md:84
+msgid "预训练的使用方式可以参考ERNIE-1.0 Trainer版本。"
+msgstr ""
+
+#: ../trainer.md:87
+msgid "Trainer 实例化参数介绍"
+msgstr ""
+
+#: ../trainer.md:88
+msgid "Trainer 是一个简单，但功能完整的 Paddle训练和评估模块，并针对 PaddleNLP 模型进行了优化。"
+msgstr ""
+
+#: ../trainer.md:172
+msgid "TrainingArguments 参数介绍"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/classify.po b/docs/locale/en/LC_MESSAGES/tutorials/classify.po
new file mode 100644
index 0000000000000000000000000000000000000000..1690b3b2028482e49d4e038e1306f7e534b06c9b
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/classify.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/classify.rst:3
+msgid "文本分类"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/embedding.po b/docs/locale/en/LC_MESSAGES/tutorials/embedding.po
new file mode 100644
index 0000000000000000000000000000000000000000..ea594eb16d9d7c0cd6126e4946963e77d445bbe3
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/embedding.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/embedding.rst:3
+msgid "词向量"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po b/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po
new file mode 100644
index 0000000000000000000000000000000000000000..bed957915ad3a139c08a70daa5e9a4cd99fa48af
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/general_dialogue.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/general_dialogue.rst:3
+msgid "通用对话"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po b/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po
new file mode 100644
index 0000000000000000000000000000000000000000..41ad758c3fb1f4bd94480525b8ed41b97371928c
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/lexical_analysis.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/lexical_analysis.rst:3
+msgid "词法分析"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po b/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po
new file mode 100644
index 0000000000000000000000000000000000000000..8759d5ffc4d65358c5593ab8030cfe7b95e6049f
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/machine_translation.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/machine_translation.rst:3
+msgid "机器翻译"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/ner.po b/docs/locale/en/LC_MESSAGES/tutorials/ner.po
new file mode 100644
index 0000000000000000000000000000000000000000..3bca60c8ae4bd3e104c59902fdc6c53e24dc1942
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/ner.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/ner.rst:3
+msgid "序列标注"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/overview.po b/docs/locale/en/LC_MESSAGES/tutorials/overview.po
new file mode 100644
index 0000000000000000000000000000000000000000..54ca186a71b386c1b623deaef209e2638d10b348
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/overview.po
@@ -0,0 +1,133 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/overview.rst:3
+msgid "整体介绍"
+msgstr ""
+
+#: ../tutorials/overview.rst:7
+msgid "案例集"
+msgstr ""
+
+#: ../tutorials/overview.rst:9
+msgid "词向量"
+msgstr ""
+
+#: ../tutorials/overview.rst:11
+msgid ""
+"`使用预训练词向量改善模型效果 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1535355>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:13
+msgid "文本分类"
+msgstr ""
+
+#: ../tutorials/overview.rst:15
+msgid ""
+"`基于LSTM等RNN网络的文本分类 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1283423>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:16
+msgid ""
+"`基于预训练模型的文本分类 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1294333>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:17
+msgid ""
+"`自定义数据集实现文本多分类任务 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1468469>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:19
+msgid "信息抽取"
+msgstr ""
+
+#: ../tutorials/overview.rst:21
+msgid ""
+"`使用BiGRU-CRF模型完成快递单信息抽取 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1317771>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:22
+msgid ""
+"`使用预训练模型ERNIE优化快递单信息抽取 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1329361>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:23
+msgid "`关系抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639963>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:24
+msgid "`事件抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639964>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:26
+msgid "阅读理解式问答"
+msgstr ""
+
+#: ../tutorials/overview.rst:28
+msgid ""
+"`使用预训练模型完成阅读理解 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1339612>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:30
+msgid "对话"
+msgstr ""
+
+#: ../tutorials/overview.rst:32
+msgid "`多技能对话 <https://aistudio.baidu.com/aistudio/projectdetail/1640180>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:34
+msgid "文本生成"
+msgstr ""
+
+#: ../tutorials/overview.rst:36
+msgid ""
+"`使用Seq2Seq模型完成自动对联 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1321118>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:37
+msgid ""
+"`使用预训练模型ERNIE-GEN实现智能写诗 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1339888>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:39
+msgid "时序预测"
+msgstr ""
+
+#: ../tutorials/overview.rst:41
+msgid ""
+"`使用TCN网络完成新冠疫情病例数预测 "
+"<https://aistudio.baidu.com/aistudio/projectdetail/1290873>`_"
+msgstr ""
+
+#: ../tutorials/overview.rst:43
+msgid ""
+"更多教程参见 `PaddleNLP on AI Studio "
+"<https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995>`_"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po b/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po
new file mode 100644
index 0000000000000000000000000000000000000000..dd64cc674261d5f00a25ac145e952a4e914f8b08
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/reading_comprehension.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/reading_comprehension.rst:3
+msgid "阅读理解"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po b/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po
new file mode 100644
index 0000000000000000000000000000000000000000..16629a2f69eef79ed57bab50297391190ebabb7a
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/semantic_matching.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/semantic_matching.rst:3
+msgid "语义匹配"
+msgstr ""
+
diff --git a/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po b/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po
new file mode 100644
index 0000000000000000000000000000000000000000..22f5a95187db8208fae666677d9df84f4b1f2b42
--- /dev/null
+++ b/docs/locale/en/LC_MESSAGES/tutorials/text_generation.po
@@ -0,0 +1,23 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2021, PaddleNLP
+# This file is distributed under the same license as the PaddleNLP package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2022.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: PaddleNLP \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-03-18 21:31+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language-Team: LANGUAGE <LL@li.org>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.0\n"
+
+#: ../tutorials/text_generation.rst:3
+msgid "文本生成"
+msgstr ""
+
diff --git a/docs/metrics.md b/docs/metrics.md
new file mode 100644
index 0000000000000000000000000000000000000000..581287460d49eb3bead3920791d562dc9bbd8c69
--- /dev/null
+++ b/docs/metrics.md
@@ -0,0 +1,15 @@
+# PaddleNLP Metrics API
+
+目前PaddleNLP提供以下模型评价指标：
+
+| Metric | 简介 | API |
+| ------ | --- | --- |
+| [Perplexity](https://en.wikipedia.org/wiki/Perplexity)  | 困惑度，常用来衡量语言模型优劣，也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity`                               |
+| [BLEU(BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU)           | 机器翻译常用评价指标          | `paddlenlp.metrics.BLEU`                                     |
+| [Rouge(Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) | 评估自动文摘以及机器翻译的指标   | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN`       |
+| AccuracyAndF1                                            | 准确率及F1-score，可用于GLUE中的MRPC 和QQP任务               | `paddlenlp.metrics.AccuracyAndF1`                            |
+| PearsonAndSpearman                                       | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务  | `paddlenlp.metrics.PearsonAndSpearman`                       |
+| Mcc(Matthews correlation coefficient)                    | 马修斯相关系数，用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc`                                      |
+| ChunkEvaluator                                           | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务，如命名实体识别（NER） | `paddlenlp.metrics.ChunkEvaluator`                           |
+| Squad Evalutaion                        | 用于SQuAD和DuReader-robust的评价指标                         | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` |
+| [Distinct](https://arxiv.org/abs/1510.03055) | 多样性指标，常用来衡量文本生成模型生成的句子形式上的多样性。 | `paddlenlp.metrics.Distinct` |
diff --git a/docs/metrics/metrics.md b/docs/metrics/metrics.md
new file mode 100644
index 0000000000000000000000000000000000000000..581287460d49eb3bead3920791d562dc9bbd8c69
--- /dev/null
+++ b/docs/metrics/metrics.md
@@ -0,0 +1,15 @@
+# PaddleNLP Metrics API
+
+目前PaddleNLP提供以下模型评价指标：
+
+| Metric | 简介 | API |
+| ------ | --- | --- |
+| [Perplexity](https://en.wikipedia.org/wiki/Perplexity)  | 困惑度，常用来衡量语言模型优劣，也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity`                               |
+| [BLEU(BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU)           | 机器翻译常用评价指标          | `paddlenlp.metrics.BLEU`                                     |
+| [Rouge(Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) | 评估自动文摘以及机器翻译的指标   | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN`       |
+| AccuracyAndF1                                            | 准确率及F1-score，可用于GLUE中的MRPC 和QQP任务               | `paddlenlp.metrics.AccuracyAndF1`                            |
+| PearsonAndSpearman                                       | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务  | `paddlenlp.metrics.PearsonAndSpearman`                       |
+| Mcc(Matthews correlation coefficient)                    | 马修斯相关系数，用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc`                                      |
+| ChunkEvaluator                                           | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务，如命名实体识别（NER） | `paddlenlp.metrics.ChunkEvaluator`                           |
+| Squad Evalutaion                        | 用于SQuAD和DuReader-robust的评价指标                         | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` |
+| [Distinct](https://arxiv.org/abs/1510.03055) | 多样性指标，常用来衡量文本生成模型生成的句子形式上的多样性。 | `paddlenlp.metrics.Distinct` |
diff --git a/docs/model_zoo/embeddings.md b/docs/model_zoo/embeddings.md
new file mode 100644
index 0000000000000000000000000000000000000000..b24bb5ba18f56a9f9181dade82a694416a9caf6e
--- /dev/null
+++ b/docs/model_zoo/embeddings.md
@@ -0,0 +1,307 @@
+# PaddleNLP Embedding API
+
+- [介绍](#介绍)
+- [用法](#用法)
+  * [TokenEmbedding参数](#TokenEmbedding参数)
+  * [初始化](#初始化)
+  * [查询embedding结果](#查询embedding结果)
+  * [可视化embedding结果](#可视化embedding结果)
+  * [计算词向量cosine相似度](#计算词向量cosine相似度)
+  * [计算词向量内积](#计算词向量内积)
+  * [训练](#训练)
+  * [切词](#切词)
+- [预训练模型](#预训练模型)
+  * [中文词向量](#中文词向量)
+  * [英文词向量](#英文词向量)
+  * [Word2Vec](#word2vec)
+  * [GloVe](#glove)
+  * [FastText](#fasttext)
+  * [使用方式](#使用方式)
+  * [模型信息](#模型信息)
+- [致谢](#致谢)
+- [参考论文](#参考论文)
+
+## 介绍
+
+PaddleNLP提供多个开源的预训练词向量模型，用户仅需在使用`paddlenlp.embeddings.TokenEmbedding`时，指定预训练模型的名称，即可加载相对应的预训练模型。以下将介绍`TokenEmbedding`详细用法，并列出PaddleNLP所支持的预训练Embedding模型。
+
+## 用法
+
+### TokenEmbedding参数
+
+|  参数 | 类型  | 属性  |
+| ------------ | ------------ | ------------ |
+| embedding_name | **string**  | 预训练embedding名称，可通过paddlenlp.embeddings.list_embedding_name()或[Embedding 模型汇总](#中文词向量)查询。 |
+| unknown_token | **string**  | unknown token。 |
+| unknown_token_vector | **list** 或者 **np.array** | 用来初始化unknown token对应的vector。默认为None（以正态分布方式初始化vector）|
+| extended_vocab_path | **string**  | 扩展词表的文件名路径。词表格式为一行一个词。 |
+| trainable | **bool**  | 是否可训练。True表示Embedding可以更新参数，False为不可更新。 |
+
+### 初始化
+```python
+import paddle
+from paddlenlp.embeddings import TokenEmbedding, list_embedding_name
+paddle.set_device("cpu")
+
+# 查看预训练embedding名称：
+print(list_embedding_name()) # ['w2v.baidu_encyclopedia.target.word-word.dim300']
+
+# 初始化TokenEmbedding， 预训练embedding没下载时会自动下载并加载数据
+token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300")
+
+# 查看token_embedding详情
+print(token_embedding)
+
+Object   type: <paddlenlp.embeddings.token_embedding.TokenEmbedding object at 0x7fda7eb5f290>
+Unknown index: 635963
+Unknown token: [UNK]
+Padding index: 635964
+Padding token: [PAD]
+Parameter containing:
+Tensor(shape=[635965, 300], dtype=float32, place=CPUPlace, stop_gradient=False,
+       [[-0.24200200,  0.13931701,  0.07378800, ...,  0.14103900,  0.05592300, -0.08004800],
+        [-0.08671700,  0.07770800,  0.09515300, ...,  0.11196400,  0.03082200, -0.12893000],
+        [-0.11436500,  0.12201900,  0.02833000, ...,  0.11068700,  0.03607300, -0.13763499],
+        ...,
+        [ 0.02628800, -0.00008300, -0.00393500, ...,  0.00654000,  0.00024600, -0.00662600],
+        [-0.00924490,  0.00652097,  0.01049327, ..., -0.01796000,  0.03498908, -0.02209341],
+        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ]])
+
+```
+
+### 查询embedding结果
+
+```python
+test_token_embedding = token_embedding.search("中国")
+print(test_token_embedding)
+[[ 0.260801  0.1047    0.129453 -0.257317 -0.16152   0.19567  -0.074868
+   0.361168  0.245882 -0.219141 -0.388083  0.235189  0.029316  0.154215
+  -0.354343  0.017746  0.009028  0.01197  -0.121429  0.096542  0.009255
+   ...,
+  -0.260592 -0.019668 -0.063312 -0.094939  0.657352  0.247547 -0.161621
+   0.289043 -0.284084  0.205076  0.059885  0.055871  0.159309  0.062181
+   0.123634  0.282932  0.140399 -0.076253 -0.087103  0.07262 ]]
+```
+
+### 可视化embedding结果
+使用深度学习可视化工具[VisualDL](https://github.com/PaddlePaddle/VisualDL)的High Dimensional组件可以对embedding结果进行可视化展示，便于对其直观分析，步骤如下：
+```python
+# 获取词表中前1000个单词
+labels = token_embedding.vocab.to_tokens(list(range(0,1000)))
+test_token_embedding = token_embedding.search(labels)
+
+# 引入VisualDL的LogWriter记录日志
+from visualdl import LogWriter
+
+with LogWriter(logdir='./visualize') as writer:
+    writer.add_embeddings(tag='test', mat=test_token_embedding, metadata=labels)
+```
+执行完毕后会在当前路径下生成一个visualize目录，并将日志存放在其中，我们在命令行启动VisualDL即可进行查看，启动命令为：
+```shell
+visualdl --logdir ./visualize
+```
+启动后打开浏览器即可看到可视化结果
+
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/48054808/103188111-1b32ac00-4902-11eb-914e-c2368bdb8373.gif" width="80%"/>
+</p>
+
+使用VisualDL除可视化embedding结果外，还可以对标量、图片、音频等进行可视化，有效提升训练调参效率。关于VisualDL更多功能和详细介绍，可参考[VisualDL使用文档](https://github.com/PaddlePaddle/VisualDL/tree/develop/docs)。
+
+### 计算词向量cosine相似度
+
+```python
+score = token_embedding.cosine_sim("中国", "美国")
+print(score) # 0.49586025
+```
+
+### 计算词向量内积
+
+```python
+score = token_embedding.dot("中国", "美国")
+print(score) # 8.611071
+```
+
+
+### 训练
+
+以下为`TokenEmbedding`简单的组网使用方法。有关更多`TokenEmbedding`训练流程相关的使用方法，请参考[Word Embedding with PaddleNLP](../../examples/word_embedding/README.md)。
+
+```python
+in_words = paddle.to_tensor([0, 2, 3])
+input_embeddings = token_embedding(in_words)
+linear = paddle.nn.Linear(token_embedding.embedding_dim, 20)
+input_fc = linear(input_embeddings)
+print(input_fc)
+Tensor(shape=[3, 20], dtype=float32, place=CPUPlace, stop_gradient=False,
+       [[ 0.        ,  0.        ,  0.        ,  ...,  0.        ,  0.        ,  0.        ],
+        [-0.23473957,  0.17878169,  0.07215232,  ...,  0.03698236,  0.14291850,  0.05136518],
+        [-0.42466098,  0.15017235, -0.04780108,  ..., -0.04995505,  0.15847842,  0.00025209]])
+```
+
+### 切词
+
+```python
+from paddlenlp.data import JiebaTokenizer
+tokenizer = JiebaTokenizer(vocab=token_embedding.vocab)
+words = tokenizer.cut("中国人民")
+print(words) # ['中国人', '民']
+
+tokens = tokenizer.encode("中国人民")
+print(tokens) # [12530, 1334]
+```
+
+## 预训练模型
+
+以下将列举PaddleNLP支持的Embedding预训练模型。
+- 模型命名方式为：\${训练模型}.\${语料}.\${词向量类型}.\${co-occurrence type}.dim\${维度}。
+- 模型有三种，分别是Word2Vec(w2v, skip-gram), GloVe(glove)和FastText(fasttext)。
+
+### 中文词向量
+
+以下预训练词向量由[Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供。
+
+根据不同类型的上下文为每个语料训练多个目标词向量，第二列开始表示不同类型的上下文。以下为上下文类别：
+
+* Word表示训练时目标词预测的上下文是一个Word。
+* Word + N-gram表示训练时目标词预测的上下文是一个Word或者Ngram，其中bigram表示2-grams，ngram.1-2表示1-gram或者2-grams。
+* Word + Character表示训练时目标词预测的上下文是一个Word或者Character，其中word-character.char1-2表示上下文是1个或2个Character。
+* Word + Character + Ngram表示训练时目标词预测的上下文是一个Word、Character或者Ngram。bigram-char表示上下文是2-grams或者1个Character。
+
+| 语料 | Word | Word + N-gram | Word + Character | Word + Character + N-gram |
+| ------------------------------------------- | ----   | ---- | ----   | ---- |
+| Baidu Encyclopedia 百度百科                 | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.target.bigram-char.dim300 |
+| Wikipedia_zh 中文维基百科                   | w2v.wiki.target.word-word.dim300 | w2v.wiki.target.word-bigram.dim300 | w2v.wiki.target.word-char.dim300 | w2v.wiki.target.bigram-char.dim300 |
+| People's Daily News 人民日报                | w2v.people_daily.target.word-word.dim300 | w2v.people_daily.target.word-bigram.dim300 | w2v.people_daily.target.word-char.dim300 | w2v.people_daily.target.bigram-char.dim300 |
+| Sogou News 搜狗新闻                         | w2v.sogou.target.word-word.dim300 | w2v.sogou.target.word-bigram.dim300 | w2v.sogou.target.word-char.dim300 | w2v.sogou.target.bigram-char.dim300 |
+| Financial News 金融新闻                     | w2v.financial.target.word-word.dim300 | w2v.financial.target.word-bigram.dim300 | w2v.financial.target.word-char.dim300 | w2v.financial.target.bigram-char.dim300 |
+| Zhihu_QA 知乎问答                           | w2v.zhihu.target.word-word.dim300 | w2v.zhihu.target.word-bigram.dim300 | w2v.zhihu.target.word-char.dim300 | w2v.zhihu.target.bigram-char.dim300 |
+| Weibo 微博                                  | w2v.weibo.target.word-word.dim300 | w2v.weibo.target.word-bigram.dim300 | w2v.weibo.target.word-char.dim300 | w2v.weibo.target.bigram-char.dim300 |
+| Literature 文学作品                         | w2v.literature.target.word-word.dim300 | w2v.literature.target.word-bigram.dim300 | w2v.literature.target.word-char.dim300 | w2v.literature.target.bigram-char.dim300 |
+| Complete Library in Four Sections 四库全书  | w2v.sikuquanshu.target.word-word.dim300 | w2v.sikuquanshu.target.word-bigram.dim300 | 无 | 无 |
+| Mixed-large 综合                            | w2v.mixed-large.target.word-word.dim300 | 暂无 | w2v.mixed-large.target.word-word.dim300 | 暂无 |
+
+特别地，对于百度百科语料，在不同的 Co-occurrence类型下分别提供了目标词与上下文向量：
+
+| Co-occurrence 类型          | 目标词向量 | 上下文词向量  |
+| --------------------------- | ------   | ---- |
+|    Word → Word              | w2v.baidu_encyclopedia.target.word-word.dim300     |   w2v.baidu_encyclopedia.context.word-word.dim300    |
+|    Word → Ngram (1-2)       |  w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300    |   w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300    |
+|    Word → Ngram (1-3)       |  w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300    |   w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300    |
+|    Ngram (1-2) → Ngram (1-2)|  w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300   |   w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300    |
+|    Word → Character (1)     |  w2v.baidu_encyclopedia.target.word-character.char1-1.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-1.dim300     |
+|    Word → Character (1-2)   |  w2v.baidu_encyclopedia.target.word-character.char1-2.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-2.dim300     |
+|    Word → Character (1-4)   |  w2v.baidu_encyclopedia.target.word-character.char1-4.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-4.dim300     |
+|    Word → Word (left/right) |   w2v.baidu_encyclopedia.target.word-wordLR.dim300   |   w2v.baidu_encyclopedia.context.word-wordLR.dim300    |
+|    Word → Word (distance)   |   w2v.baidu_encyclopedia.target.word-wordPosition.dim300   |   w2v.baidu_encyclopedia.context.word-wordPosition.dim300    |
+
+### 英文词向量
+
+### Word2Vec
+
+| 语料 | 名称 |
+|------|------|
+| Google News | w2v.google_news.target.word-word.dim300.en |
+
+### GloVe
+
+| 语料                | 25维     | 50维      | 100维    | 200维    | 300 维   |
+| -----------------   | ------   |  ------   | ------   | ------   | ------   |
+| Wiki2014 + GigaWord | 无 | glove.wiki2014-gigaword.target.word-word.dim50.en | glove.wiki2014-gigaword.target.word-word.dim100.en | glove.wiki2014-gigaword.target.word-word.dim200.en | glove.wiki2014-gigaword.target.word-word.dim300.en |
+| Twitter             | glove.twitter.target.word-word.dim25.en | glove.twitter.target.word-word.dim50.en | glove.twitter.target.word-word.dim100.en | glove.twitter.target.word-word.dim200.en | 无 |
+
+### FastText
+
+| 语料 | 名称 |
+|------|------|
+| Wiki2017 | fasttext.wiki-news.target.word-word.dim300.en |
+| Crawl    | fasttext.crawl.target.word-word.dim300.en |
+
+### 使用方式
+
+以上所述的模型名称可直接以参数形式传入`padddlenlp.embeddings.TokenEmbedding`，加载相对应的模型。比如要加载语料为Wiki2017，通过FastText训练的预训练模型（`fasttext.wiki-news.target.word-word.dim300.en`），只需执行以下代码：
+
+```python
+import paddle
+from paddlenlp.embeddings import TokenEmbedding
+
+token_embedding = TokenEmbedding(embedding_name="fasttext.wiki-news.target.word-word.dim300.en")
+```
+
+### 模型信息
+
+| 模型 | 文件大小 | 词表大小 |
+|-----|---------|---------|
+| w2v.baidu_encyclopedia.target.word-word.dim300                         | 678.21 MB  | 635965 |
+| w2v.baidu_encyclopedia.target.word-character.char1-1.dim300            | 679.15 MB  | 636038 |
+| w2v.baidu_encyclopedia.target.word-character.char1-2.dim300            | 679.30 MB  | 636038 |
+| w2v.baidu_encyclopedia.target.word-character.char1-4.dim300            | 679.51 MB  | 636038 |
+| w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300                    | 679.48 MB  | 635977 |
+| w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300                    | 671.27 MB  | 628669 |
+| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300                    | 7.28 GB    | 6969069 |
+| w2v.baidu_encyclopedia.target.word-wordLR.dim300                       | 678.22 MB  | 635958 |
+| w2v.baidu_encyclopedia.target.word-wordPosition.dim300                 | 679.32 MB  | 636038 |
+| w2v.baidu_encyclopedia.target.bigram-char.dim300                       | 679.29 MB  | 635976 |
+| w2v.baidu_encyclopedia.context.word-word.dim300                        | 677.74 MB  | 635952 |
+| w2v.baidu_encyclopedia.context.word-character.char1-1.dim300           | 678.65 MB  | 636200 |
+| w2v.baidu_encyclopedia.context.word-character.char1-2.dim300           | 844.23 MB  | 792631 |
+| w2v.baidu_encyclopedia.context.word-character.char1-4.dim300           | 1.16 GB    | 1117461 |
+| w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300                   | 7.25 GB    | 6967598 |
+| w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300                   | 5.21 GB    | 5000001 |
+| w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300                   | 7.26 GB    | 6968998 |
+| w2v.baidu_encyclopedia.context.word-wordLR.dim300                      | 1.32 GB    | 1271031 |
+| w2v.baidu_encyclopedia.context.word-wordPosition.dim300                | 6.47 GB    | 6293920 |
+| w2v.wiki.target.bigram-char.dim300                                     | 375.98 MB  | 352274 |
+| w2v.wiki.target.word-char.dim300                                       | 375.52 MB  | 352223 |
+| w2v.wiki.target.word-word.dim300                                       | 374.95 MB  | 352219 |
+| w2v.wiki.target.word-bigram.dim300                                     | 375.72 MB  | 352219 |
+| w2v.people_daily.target.bigram-char.dim300                             | 379.96 MB  | 356055 |
+| w2v.people_daily.target.word-char.dim300                               | 379.45 MB  | 355998 |
+| w2v.people_daily.target.word-word.dim300                               | 378.93 MB  | 355989 |
+| w2v.people_daily.target.word-bigram.dim300                             | 379.68 MB  | 355991 |
+| w2v.weibo.target.bigram-char.dim300                                    | 208.24 MB  | 195199 |
+| w2v.weibo.target.word-char.dim300                                      | 208.03 MB  | 195204 |
+| w2v.weibo.target.word-word.dim300                                      | 207.94 MB  | 195204 |
+| w2v.weibo.target.word-bigram.dim300                                    | 208.19 MB  | 195204 |
+| w2v.sogou.target.bigram-char.dim300                                    | 389.81 MB  | 365112 |
+| w2v.sogou.target.word-char.dim300                                      | 389.89 MB  | 365078 |
+| w2v.sogou.target.word-word.dim300                                      | 388.66 MB  | 364992 |
+| w2v.sogou.target.word-bigram.dim300                                    | 388.66 MB  | 364994 |
+| w2v.zhihu.target.bigram-char.dim300                                    | 277.35 MB  | 259755 |
+| w2v.zhihu.target.word-char.dim300                                      | 277.40 MB  | 259940 |
+| w2v.zhihu.target.word-word.dim300                                      | 276.98 MB  | 259871 |
+| w2v.zhihu.target.word-bigram.dim300                                    | 277.53 MB  | 259885 |
+| w2v.financial.target.bigram-char.dim300                                | 499.52 MB  | 467163 |
+| w2v.financial.target.word-char.dim300                                  | 499.17 MB  | 467343 |
+| w2v.financial.target.word-word.dim300                                  | 498.94 MB  | 467324 |
+| w2v.financial.target.word-bigram.dim300                                | 499.54 MB  | 467331 |
+| w2v.literature.target.bigram-char.dim300                               | 200.69 MB  | 187975 |
+| w2v.literature.target.word-char.dim300                                 | 200.44 MB  | 187980 |
+| w2v.literature.target.word-word.dim300                                 | 200.28 MB  | 187961 |
+| w2v.literature.target.word-bigram.dim300                               | 200.59 MB  | 187962 |
+| w2v.sikuquanshu.target.word-word.dim300                                | 20.70 MB   | 19529 |
+| w2v.sikuquanshu.target.word-bigram.dim300                              | 20.77 MB   | 19529 |
+| w2v.mixed-large.target.word-char.dim300                                | 1.35 GB    | 1292552 |
+| w2v.mixed-large.target.word-word.dim300                                | 1.35 GB    | 1292483 |
+| w2v.google_news.target.word-word.dim300.en                             | 1.61 GB    | 3000000 |
+| glove.wiki2014-gigaword.target.word-word.dim50.en                      | 73.45 MB   | 400002 |
+| glove.wiki2014-gigaword.target.word-word.dim100.en                     | 143.30 MB  | 400002 |
+| glove.wiki2014-gigaword.target.word-word.dim200.en                     | 282.97 MB  | 400002 |
+| glove.wiki2014-gigaword.target.word-word.dim300.en                     | 422.83 MB  | 400002 |
+| glove.twitter.target.word-word.dim25.en                                | 116.92 MB  | 1193516 |
+| glove.twitter.target.word-word.dim50.en                                | 221.64 MB  | 1193516 |
+| glove.twitter.target.word-word.dim100.en                               | 431.08 MB  | 1193516 |
+| glove.twitter.target.word-word.dim200.en                               | 848.56 MB  | 1193516 |
+| fasttext.wiki-news.target.word-word.dim300.en                          | 541.63 MB  | 999996 |
+| fasttext.crawl.target.word-word.dim300.en                              | 1.19 GB    | 2000002 |
+
+## 致谢
+- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文预训练词向量。
+- 感谢 [GloVe Project](https://nlp.stanford.edu/projects/glove)提供的GloVe英文预训练词向量。
+- 感谢 [FastText Project](https://fasttext.cc/docs/en/english-vectors.html)提供的英文预训练词向量。
+
+## 参考论文
+- Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018).
+- Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221.
+- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
+- T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations.
diff --git a/docs/model_zoo/index.rst b/docs/model_zoo/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d15e4b6ab022179267ae8afa63cd4705eaa5184c
--- /dev/null
+++ b/docs/model_zoo/index.rst
@@ -0,0 +1,311 @@
+
+
+PaddleNLP Transformer预训练模型
+====================================
+
+随着深度学习的发展，NLP领域涌现了一大批高质量的Transformer类预训练模型，多次刷新了不同NLP任务的SOTA（State of the Art），极大地推动了自然语言处理的进展。
+PaddleNLP为用户提供了常用的预训练模型及其相应权重，如 ``BERT``、``ERNIE``、``ALBERT``、``RoBERTa``、``XLNet`` 等，采用统一的API进行加载、训练和调用，
+让开发者能够方便快捷地应用各种Transformer类预训练模型及其下游任务，且相应预训练模型权重下载速度快、稳定。
+
+------------------------------------
+预训练模型使用方法
+------------------------------------
+
+PaddleNLP Transformer API在提供丰富预训练模型的同时，也降低了用户的使用门槛。
+使用Auto模块，可以加载不同网络结构的预训练模型，无需查找模型对应的类别。只需十几行代码，用户即可完成模型加载和下游任务Fine-tuning。
+
+.. code:: python
+
+    from functools import partial
+    import numpy as np
+
+    import paddle
+    from paddlenlp.datasets import load_dataset
+    from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+    train_ds = load_dataset("chnsenticorp", splits=["train"])
+
+    model = AutoModelForSequenceClassification.from_pretrained("bert-wwm-chinese", num_classes=len(train_ds.label_list))
+
+    tokenizer = AutoTokenizer.from_pretrained("bert-wwm-chinese")
+
+    def convert_example(example, tokenizer):
+        encoded_inputs = tokenizer(text=example["text"], max_seq_len=512, pad_to_max_seq_len=True)
+        return tuple([np.array(x, dtype="int64") for x in [
+                encoded_inputs["input_ids"], encoded_inputs["token_type_ids"], [example["label"]]]])
+    train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer))
+
+    batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=8, shuffle=True)
+    train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, return_list=True)
+
+    optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+
+    for input_ids, token_type_ids, labels in train_data_loader():
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+
+上面的代码给出使用预训练模型的简要示例，更完整详细的示例代码，
+可以参考：`使用预训练模型Fine-tune完成中文文本分类任务 <https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_classification/pretrained_models/>`_
+
+1. 加载数据集：PaddleNLP内置了多种数据集，用户可以一键导入所需的数据集。
+2. 加载预训练模型：PaddleNLP的预训练模型可以很容易地通过 ``from_pretrained()`` 方法加载。
+   Auto模块（包括AutoModel, AutoTokenizer, 及各种下游任务类）提供了方便易用的接口，
+   无需指定类别，即可调用不同网络结构的预训练模型。
+   第一个参数是汇总表中对应的 ``Pretrained Weight``，可加载对应的预训练权重。
+   ``AutoModelForSequenceClassification`` 初始化 ``__init__`` 所需的其他参数，如 ``num_classes`` 等，
+   也是通过 ``from_pretrained()`` 传入。``Tokenizer`` 使用同样的 ``from_pretrained`` 方法加载。
+3. 通过 ``Dataset`` 的 ``map`` 函数，使用 ``tokenizer`` 将 ``dataset`` 从原始文本处理成模型的输入。
+4. 定义 ``BatchSampler`` 和 ``DataLoader``，shuffle数据、组合Batch。
+5. 定义训练所需的优化器，loss函数等，就可以开始进行模型fine-tune任务。
+
+------------------------------------
+Transformer预训练模型汇总
+------------------------------------
+
+PaddleNLP的Transformer预训练模型包含从 `huggingface.co`_ 直接转换的模型权重和百度自研模型权重，方便社区用户直接迁移使用。
+目前共包含了40多个主流预训练模型，500多个模型权重。
+
+.. _huggingface.co: https://huggingface.co/models
+
+.. toctree::
+   :maxdepth: 3
+
+   ALBERT <transformers/ALBERT/contents>
+   BART <transformers/BART/contents>
+   BERT <transformers/BERT/contents>
+   BigBird <transformers/BigBird/contents>
+   Blenderbot <transformers/Blenderbot/contents>
+   Blenderbot-Small <transformers/Blenderbot-Small/contents>
+   ChineseBert <transformers/ChineseBert/contents>
+   ConvBert <transformers/ConvBert/contents>
+   CTRL <transformers/CTRL/contents>
+   DistilBert <transformers/DistilBert/contents>
+   ELECTRA <transformers/ELECTRA/contents>
+   ERNIE <transformers/ERNIE/contents>
+   ERNIE-CTM <transformers/ERNIE-CTM/contents>
+   ERNIE-DOC <transformers/ERNIE-DOC/contents>
+   ERNIE-GEN <transformers/ERNIE-GEN/contents>
+   ERNIE-GRAM <transformers/ERNIE-GRAM/contents>
+   ERNIE-M <transformers/ERNIE-M/contents>
+   FNet <transformers/FNet/contents>
+   Funnel <transformers/Funnel/contents>
+   GPT <transformers/GPT/contents>
+   LayoutLM <transformers/LayoutLM/contents>
+   LayoutLMV2 <transformers/LayoutLMV2/contents>
+   LayoutXLM <transformers/LayoutXLM/contents>
+   Luke <transformers/Luke/contents>
+   MBart <transformers/MBart/contents>
+   MegatronBert <transformers/MegatronBert/contents>
+   MobileBert <transformers/MobileBert/contents>
+   MPNet <transformers/MPNet/contents>
+   NeZha <transformers/NeZha/contents>
+   PPMiniLM <transformers/PPMiniLM/contents>
+   ProphetNet <transformers/ProphetNet/contents>
+   Reformer <transformers/Reformer/contents>
+   RemBert <transformers/RemBert/contents>
+   RoBERTa <transformers/RoBERTa/contents>
+   RoFormer <transformers/RoFormer/contents>
+   SKEP <transformers/SKEP/contents>
+   SqueezeBert <transformers/SqueezeBert/contents>
+   T5 <transformers/T5/contents>
+   TinyBert <transformers/TinyBert/contents>
+   UnifiedTransformer <transformers/UnifiedTransformer/contents>
+   UNIMO <transformers/UNIMO/contents>
+   XLNet <transformers/XLNet/contents>
+
+
+
+------------------------------------
+Transformer预训练模型适用任务汇总
+------------------------------------
+
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+| Model              | Sequence Classification | Token Classification | Question Answering | Text Generation | Multiple Choice |
++====================+=========================+======================+====================+=================+=================+
+|ALBERT_             | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|BART_               | ✅                      | ✅                   | ✅                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|BERT_               | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|BigBird_            | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|Blenderbot_         | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|Blenderbot-Small_   | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ChineseBert_        | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ConvBert_           | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|CTRL_               | ✅                      | ❌                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|DistilBert_         | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ELECTRA_            | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE_              | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE-CTM_          | ❌                      | ✅                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE-DOC_          | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE-GEN_          | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE-GRAM_         | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ERNIE-M_            | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|FNet_               | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|Funnel_             | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|GPT_                | ✅                      | ✅                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|LayoutLM_           | ✅                      | ✅                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|LayoutLMV2_         | ❌                      | ✅                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|LayoutXLM_          | ❌                      | ✅                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|Luke_               | ❌                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|MBart_              | ✅                      | ❌                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|MegatronBert_       | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|MobileBert_         | ✅                      | ❌                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|MPNet_              | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|NeZha_              | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|PPMiniLM_           | ✅                      | ❌                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|ProphetNet_         | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|Reformer_           | ✅                      | ❌                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|RemBert_            | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|RoBERTa_            | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|RoFormer_           | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|SKEP_               | ✅                      | ✅                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|SqueezeBert_        | ✅                      | ✅                   | ✅                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|T5_                 | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|TinyBert_           | ✅                      | ❌                   | ❌                 | ❌              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|UnifiedTransformer_ | ❌                      | ❌                   | ❌                 | ✅              | ❌              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+|XLNet_              | ✅                      | ✅                   | ✅                 | ❌              | ✅              |
++--------------------+-------------------------+----------------------+--------------------+-----------------+-----------------+
+
+.. _ALBERT: https://arxiv.org/abs/1909.11942
+.. _BART: https://arxiv.org/abs/1910.13461
+.. _BERT: https://arxiv.org/abs/1810.04805
+.. _BERT-Japanese: https://arxiv.org/abs/1810.04805
+.. _BigBird: https://arxiv.org/abs/2007.14062
+.. _Blenderbot: https://arxiv.org/pdf/2004.13637.pdf
+.. _Blenderbot-Small: https://arxiv.org/pdf/2004.13637.pdf
+.. _ChineseBert: https://arxiv.org/abs/2106.16038
+.. _ConvBert: https://arxiv.org/abs/2008.02496
+.. _CTRL: https://arxiv.org/abs/1909.05858
+.. _DistilBert: https://arxiv.org/abs/1910.01108
+.. _ELECTRA: https://arxiv.org/abs/2003.10555
+.. _ERNIE: https://arxiv.org/abs/1904.09223
+.. _ERNIE-CTM: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm
+.. _ERNIE-DOC: https://arxiv.org/abs/2012.15688
+.. _ERNIE-GEN: https://arxiv.org/abs/2001.11314
+.. _ERNIE-GRAM: https://arxiv.org/abs/2010.12148
+.. _ERNIE-M: https://arxiv.org/abs/2012.15674
+.. _FNet: https://arxiv.org/abs/2105.03824
+.. _Funnel: https://arxiv.org/abs/2006.03236
+.. _GPT: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
+.. _LayoutLM: https://arxiv.org/abs/1912.13318
+.. _LayoutLMV2: https://arxiv.org/abs/2012.14740
+.. _LayoutXLM: https://arxiv.org/abs/2104.08836
+.. _Luke: https://arxiv.org/abs/2010.01057
+.. _MBart: https://arxiv.org/abs/2001.08210
+.. _MegatronBert: https://arxiv.org/abs/1909.08053
+.. _MobileBert: https://arxiv.org/abs/2004.02984
+.. _MPNet: https://arxiv.org/abs/2004.09297
+.. _NeZha: https://arxiv.org/abs/1909.00204
+.. _PPMiniLM: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm
+.. _ProphetNet: https://arxiv.org/abs/2001.04063
+.. _Reformer: https://arxiv.org/abs/2001.04451
+.. _RemBert: https://arxiv.org/abs/2010.12821
+.. _RoBERTa: https://arxiv.org/abs/1907.11692
+.. _RoFormer: https://arxiv.org/abs/2104.09864
+.. _SKEP: https://arxiv.org/abs/2005.05635
+.. _SqueezeBert: https://arxiv.org/abs/2006.11316
+.. _T5: https://arxiv.org/abs/1910.10683
+.. _TinyBert: https://arxiv.org/abs/1909.10351
+.. _UnifiedTransformer: https://arxiv.org/abs/2006.16779
+.. _UNIMO: https://arxiv.org/abs/2012.15409
+.. _XLNet: https://arxiv.org/abs/1906.08237
+
+------------------------------------
+Reference
+------------------------------------
+- 部分中文预训练模型来自：
+  `brightmart/albert_zh <https://github.com/brightmart/albert_zh>`_,
+  `ymcui/Chinese-BERT-wwm <https://github.com/ymcui/Chinese-BERT-wwm>`_,
+  `huawei-noah/Pretrained-Language-Model/TinyBERT <https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT>`_,
+  `ymcui/Chinese-XLNet <https://github.com/ymcui/Chinese-XLNet>`_,
+  `huggingface/xlnet_chinese_large <https://huggingface.co/clue/xlnet_chinese_large>`_,
+  `Knover/luge-dialogue <https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue>`_,
+  `huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/ <https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-PyTorch>`_,
+  `ZhuiyiTechnology/simbert <https://github.com/ZhuiyiTechnology/simbert>`_
+- Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).
+- Lewis, Mike, et al. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." arXiv preprint arXiv:1910.13461 (2019).
+- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
+- Zaheer, Manzil, et al. "Big bird: Transformers for longer sequences." arXiv preprint arXiv:2007.14062 (2020).
+- Stephon, Emily, et al. "Blenderbot: Recipes for building an open-domain chatbot." arXiv preprint arXiv:2004.13637 (2020).
+- Stephon, Emily, et al. "Blenderbot-Small: Recipes for building an open-domain chatbot." arXiv preprint arXiv:2004.13637 (2020).
+- Sun, Zijun, et al. "Chinesebert: Chinese pretraining enhanced by glyph and pinyin information." arXiv preprint arXiv:2106.16038 (2021).
+- Zhang, zhengyan, et al. "CPM: A Large-scale Generative Chinese Pre-trained Language Model." arXiv preprint arXiv:2012.00413 (2020).
+- Jiang, Zihang, et al. "ConvBERT: Improving BERT with Span-based Dynamic Convolution." arXiv preprint arXiv:2008.02496 (2020).
+- Nitish, Bryan, et al. "CTRL: A Conditional Transformer Language Model for Controllable Generation." arXiv preprint arXiv:1909.05858 (2019).
+- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
+- Clark, Kevin, et al. "Electra: Pre-training text encoders as discriminators rather than generators." arXiv preprint arXiv:2003.10555 (2020).
+- Sun, Yu, et al. "Ernie: Enhanced representation through knowledge integration." arXiv preprint arXiv:1904.09223 (2019).
+- Ding, Siyu, et al. "ERNIE-Doc: A retrospective long-document modeling transformer." arXiv preprint arXiv:2012.15688 (2020).
+- Xiao, Dongling, et al. "Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation." arXiv preprint arXiv:2001.11314 (2020).
+- Xiao, Dongling, et al. "ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding." arXiv preprint arXiv:2010.12148 (2020).
+- Ouyang, Xuan, et al. "ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora." arXiv preprint arXiv:2012.15674 (2020).
+- Lee-Thorp, James, et al. "Fnet: Mixing tokens with fourier transforms." arXiv preprint arXiv:2105.03824 (2021).
+- Dai, Zihang, et al. "Funnel-transformer: Filtering out sequential redundancy for efficient language processing." Advances in neural information processing systems 33 (2020): 4271-4282.
+- Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
+- Xu, Yiheng, et al. "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." arXiv preprint arXiv:1912.13318 (2019).
+- Xu, Yang, et al. "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding" arXiv preprint arXiv:2012.14740 (2020).
+- Xu, Yiheng, et al. "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding" arXiv preprint arXiv:2104.08836 (2021).
+- Yamada, Ikuya, et al. "Luke: deep contextualized entity representations with entity-aware self-attention." arXiv preprint arXiv:2010.01057 (2020).
+- Liu, Yinhan, et al. "MBart: Multilingual Denoising Pre-training for Neural Machine Translation" arXiv preprint arXiv:2001.08210 (2020).
+- Shoeybi, Mohammad, et al. "Megatron-lm: Training multi-billion parameter language models using model parallelism." arXiv preprint arXiv:1909.08053 (2019).
+- Sun, Zhiqing, et al. "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices" arXiv preprint arXiv:2004.02984 (2020).
+- Song, Kaitao, et al. "MPNet: Masked and Permuted Pre-training for Language Understanding." arXiv preprint arXiv:2004.09297 (2020).
+- Wei, Junqiu, et al. "NEZHA: Neural contextualized representation for chinese language understanding." arXiv preprint arXiv:1909.00204 (2019).
+- Qi, Weizhen, et al. "Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training." arXiv preprint arXiv:2001.04063 (2020).
+- Kitaev, Nikita, et al. "Reformer: The efficient Transformer." arXiv preprint arXiv:2001.04451 (2020).
+- Chung, Hyung Won, et al. "Rethinking embedding coupling in pre-trained language models." arXiv preprint arXiv:2010.12821 (2020).
+- Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
+- Su Jianlin, et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint arXiv:2104.09864 (2021).
+- Tian, Hao, et al. "SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis." arXiv preprint arXiv:2005.05635 (2020).
+- Forrest, ALbert, et al. "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?" arXiv preprint arXiv:2006.11316 (2020).
+- Raffel, Colin, et al. "T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv preprint arXiv:1910.10683 (2019).
+- Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
+- Jiao, Xiaoqi, et al. "Tinybert: Distilling bert for natural language understanding." arXiv preprint arXiv:1909.10351 (2019).
+- Bao, Siqi, et al. "Plato-2: Towards building an open-domain chatbot via curriculum learning." arXiv preprint arXiv:2006.16779 (2020).
+- Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." arXiv preprint arXiv:1906.08237 (2019).
+- Cui, Yiming, et al. "Pre-training with whole word masking for chinese bert." arXiv preprint arXiv:1906.08101 (2019).
+- Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021).
diff --git a/docs/model_zoo/taskflow.md b/docs/model_zoo/taskflow.md
new file mode 100644
index 0000000000000000000000000000000000000000..fde8f31b6bcc069a09f9129226495e0592c24b4e
--- /dev/null
+++ b/docs/model_zoo/taskflow.md
@@ -0,0 +1,1912 @@
+# PaddleNLP一键预测功能：Taskflow API
+
+
+
+<p align="left">
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/v/paddlenlp.svg?label=pip&logo=PyPI&logoColor=white"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/pyversions/paddlenlp"></a>
+    <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg"></a>
+    <a href="../../LICENSE"><img src="https://img.shields.io/github/license/paddlepaddle/paddlenlp"></a>
+</p>
+
+
+<h4 align="left">
+  <a href=#QuickStart> QuickStart </a> |
+  <a href=#社区交流> 社区交流 </a> |
+  <a href=#详细使用> 一键预测&定制训练 </a> |
+  <a href=#FAQ> FAQ </a>
+</h4>
+
+
+------------------------------------------------------------------------------------------
+
+## 特性
+PaddleNLP提供**开箱即用**的产业级NLP预置任务能力，无需训练，一键预测。
+- 最全的中文任务：覆盖自然语言理解与自然语言生成两大核心应用；
+- 极致的产业级效果：在多个中文场景上提供产业级的精度与预测性能；
+- 统一的应用范式：通过`paddlenlp.Taskflow`调用，简捷易用。
+
+| 任务名称                           | 调用方式                         | 一键预测 | 单条输入 | 多条输入 | 文档级输入 | 定制化训练 | 其它特性                                               |
+| :--------------------------------- | -------------------------------- | -------- | -------- | -------- | ---------- | ---------- | ------------------------------------------------------ |
+| [中文分词](#中文分词)              | `Taskflow("word_segmentation")`  | ✅        | ✅        | ✅        | ✅          | ✅          | 多种分词模式，满足快速切分和实体粒度精准切分           |
+| [词性标注](#词性标注)              | `Taskflow("pos_tagging")`        | ✅        | ✅        | ✅        | ✅          | ✅          | 基于百度前沿词法分析工具LAC                            |
+| [命名实体识别](#命名实体识别)      | `Taskflow("ner")`                 | ✅        | ✅        | ✅        | ✅          | ✅          | 覆盖最全中文实体标签                                   |
+| [依存句法分析](#依存句法分析)      | `Taskflow("dependency_parsing")`  | ✅        | ✅        | ✅        |            | ✅          | 基于最大规模中文依存句法树库研发的DDParser             |
+| [信息抽取](#信息抽取)           | `Taskflow("information_extraction")`| ✅        | ✅        | ✅        | ✅         | ✅          | 适配多场景的开放域通用信息抽取工具                     |
+| [『解语』-知识标注](#解语知识标注) | `Taskflow("knowledge_mining")`     | ✅        | ✅        | ✅        | ✅          | ✅          | 覆盖所有中文词汇的知识标注工具                         |
+| [文本纠错](#文本纠错)              | `Taskflow("text_correction")`    | ✅        | ✅        | ✅        | ✅          | ✅          | 融合拼音特征的端到端文本纠错模型ERNIE-CSC              |
+| [文本相似度](#文本相似度)          | `Taskflow("text_similarity")`    | ✅        | ✅        | ✅        |            |            | 基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果|
+| [情感分析](#情感分析)      | `Taskflow("sentiment_analysis")`  | ✅        | ✅        | ✅        |            | ✅          | 集成BiLSTM、SKEP、UIE等模型，支持评论维度、观点抽取、情感极性分类等情感分析任务             |
+| [生成式问答](#生成式问答)          | `Taskflow("question_answering")` | ✅        | ✅        | ✅        |            |            | 使用最大中文开源CPM模型完成问答                        |
+| [智能写诗](#智能写诗)              | `Taskflow("poetry_generation")`  | ✅        | ✅        | ✅        |            |            | 使用最大中文开源CPM模型完成写诗                        |
+| [开放域对话](#开放域对话)          | `Taskflow("dialogue")`           | ✅        | ✅        | ✅        |            |            | 十亿级语料训练最强中文闲聊模型PLATO-Mini，支持多轮对话 |
+| [代码生成](#代码生成)          | `Taskflow("code_generation")`        | ✅        | ✅        | ✅        |      ✅        |            | 代码生成大模型 |
+| [文本摘要](#文本摘要)          | `Taskflow("text_summarization")`        | ✅        | ✅        | ✅        | ✅          |            | 文本摘要大模型 |
+| [文档智能](#文档智能)          | `Taskflow("document_intelligence")`        | ✅        | ✅        | ✅        | ✅          |            | 以多语言跨模态布局增强文档预训练模型ERNIE-Layout为核心底座 |
+| [问题生成](#问题生成)          | `Taskflow("question_generation")`        | ✅        | ✅        | ✅        | ✅          |            | 问题生成大模型 |
+| [零样本文本分类](#零样本文本分类)      | `Taskflow("zero_shot_text_classification")`  | ✅        | ✅        | ✅        |            | ✅          | 集成多场景的通用文本分类工具       |
+| [模型特征提取](#模型特征提取)      | `Taskflow("feature_extraction")`  | ✅        | ✅        | ✅        |     ✅       |          | 集成文本，图片的特征抽取工具       |
+
+## QuickStart
+
+**环境依赖**
+  - python >= 3.6
+  - paddlepaddle >= 2.3.0
+  - paddlenlp >= 2.3.4
+
+![taskflow1](https://user-images.githubusercontent.com/11793384/159693816-fda35221-9751-43bb-b05c-7fc77571dd76.gif)
+
+可进入 Jupyter Notebook 环境，在线体验 👉🏻  [进入在线运行环境](https://aistudio.baidu.com/aistudio/projectdetail/3696243)
+
+PaddleNLP Taskflow API 支持任务持续丰富中，我们将根据开发者反馈，灵活调整功能建设优先级，可通过Issue或[问卷](https://iwenjuan.baidu.com/?code=44amg8)反馈给我们。
+
+## 社区交流👬
+
+- 微信扫描二维码并填写问卷之后，加入交流群领取福利
+  - 获取5月18-19日每晚20:30《产业级通用信息抽取技术UIE+ERNIE轻量级模型》直播课链接
+  - 10G重磅NLP学习大礼包：
+
+  <div align="center">
+  <img src="https://user-images.githubusercontent.com/11793384/168411900-d9f3d777-99ab-4b5c-8cdc-ef747a48b864.jpg" width="188" height="188" />
+  </div>
+
+## 详细使用
+
+## PART Ⅰ &emsp; 一键预测
+
+### 中文分词
+
+<details><summary>&emsp;（可展开详情）多种分词模式，满足快速切分和实体粒度精准切分 </summary><div>
+
+#### 三种分词模式，满足各类分词需求
+
+```python
+from paddlenlp import Taskflow
+
+# 默认模式————实体粒度分词，在精度和速度上的权衡，基于百度LAC
+>>> seg = Taskflow("word_segmentation")
+>>> seg("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案")
+['近日', '国家卫健委', '发布', '第九版', '新型', '冠状病毒肺炎', '诊疗', '方案']
+
+# 快速模式————最快：实现文本快速切分，基于jieba中文分词工具
+>>> seg_fast = Taskflow("word_segmentation", mode="fast")
+>>> seg_fast("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案")
+['近日', '国家', '卫健委', '发布', '第九版', '新型', '冠状病毒', '肺炎', '诊疗', '方案']
+
+# 精确模式————最准：实体粒度切分准确度最高，基于百度解语
+# 精确模式基于预训练模型，更适合实体粒度分词需求，适用于知识图谱构建、企业搜索Query分析等场景中
+>>> seg_accurate = Taskflow("word_segmentation", mode="accurate")
+>>> seg_accurate("近日国家卫健委发布第九版新型冠状病毒肺炎诊疗方案")
+['近日', '国家卫健委', '发布', '第九版', '新型冠状病毒肺炎', '诊疗', '方案']
+```
+
+#### 批量样本输入，平均速度更快
+
+输入为多个句子组成的list，平均速度会更快。
+
+```python
+>>> from paddlenlp import Taskflow
+>>> seg = Taskflow("word_segmentation")
+>>> seg(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+[['第十四届', '全运会', '在', '西安', '举办'], ['三亚', '是', '一个', '美丽', '的', '城市']]
+```
+
+#### 自定义词典
+
+你可以通过传入`user_dict`参数，装载自定义词典来定制分词结果。
+在默认模式和精确模式下，词典文件每一行由一个或多个自定义item组成。词典文件`user_dict.txt`示例：
+```text
+平原上的火焰
+上 映
+```
+
+在快速模式下，词典文件每一行为一个自定义item+"\t"+词频（词频可省略，词频省略则自动计算能保证分出该词的词频），暂时不支持黑名单词典（即通过设置”年“、”末“，以达到切分”年末“的目的）。词典文件`user_dict.txt`示例：
+
+```text
+平原上的火焰  10
+```
+
+加载自定义词典及输出结果示例：
+```python
+>>> from paddlenlp import Taskflow
+>>> seg = Taskflow("word_segmentation")
+>>> seg("平原上的火焰宣布延期上映")
+['平原', '上', '的', '火焰', '宣布', '延期', '上映']
+>>> seg = Taskflow("word_segmentation", user_dict="user_dict.txt")
+>>> seg("平原上的火焰宣布延期上映")
+['平原上的火焰', '宣布', '延期', '上', '映']
+```
+#### 参数说明
+* `mode`：指定分词模式，默认为None。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `user_dict`：自定义词典文件路径，默认为None。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 词性标注
+
+<details><summary>&emsp;基于百度词法分析工具LAC</summary><div>
+
+#### 支持单条和批量预测
+```python
+>>> from paddlenlp import Taskflow
+# 单条预测
+>>> tag = Taskflow("pos_tagging")
+>>> tag("第十四届全运会在西安举办")
+[('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]
+
+# 批量样本输入，平均速度更快
+>>> tag(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+[[('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')], [('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')]]
+```
+
+#### 标签集合
+
+| 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
+| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
+| n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
+| nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
+| nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
+| a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
+| m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
+| c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
+| PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
+
+#### 自定义词典
+
+你可以通过装载自定义词典来定制化分词和词性标注结果。词典文件每一行表示一个自定义item，可以由一个单词或者多个单词组成，单词后面可以添加自定义标签，格式为`item/tag`，如果不添加自定义标签，则使用模型默认标签`n`。
+
+词典文件`user_dict.txt`示例：
+
+```text
+赛里木湖/LAKE
+高/a 山/n
+海拔最高
+```
+
+装载自定义词典及输出结果示例：
+
+```python
+>>> from paddlenlp import Taskflow
+>>> tag = Taskflow("pos_tagging")
+>>> tag("赛里木湖是新疆海拔最高的高山湖泊")
+[('赛里木湖', 'LOC'), ('是', 'v'), ('新疆', 'LOC'), ('海拔', 'n'), ('最高', 'a'), ('的', 'u'), ('高山', 'n'), ('湖泊', 'n')]
+>>> my_tag = Taskflow("pos_tagging", user_dict="user_dict.txt")
+>>> my_tag("赛里木湖是新疆海拔最高的高山湖泊")
+[('赛里木湖', 'LAKE'), ('是', 'v'), ('新疆', 'LOC'), ('海拔最高', 'n'), ('的', 'u'), ('高', 'a'), ('山', 'n'), ('湖泊', 'n')]
+```
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `user_dict`：用户自定义词典文件，默认为None。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 命名实体识别
+
+<details><summary>&emsp;最全中文实体标签</summary><div>
+
+#### 支持两种模式
+
+```python
+# 精确模式（默认），基于百度解语，内置91种词性及专名类别标签
+>>> from paddlenlp import Taskflow
+>>> ner = Taskflow("ner")
+>>> ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+[('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]
+
+>>> ner = Taskflow("ner", entity_only=True)  # 只返回实体/概念词
+>>> ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+[('孤女', '作品类_实体'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('小说', '作品类_概念'), ('作者', '人物类_概念'), ('余兼羽', '人物类_实体')]
+
+# 快速模式，基于百度LAC，内置24种词性和专名类别标签
+>>> from paddlenlp import Taskflow
+>>> ner = Taskflow("ner", mode="fast")
+>>> ner("三亚是一个美丽的城市")
+[('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')]
+```
+
+#### 批量样本输入，平均速度更快
+```python
+>>> from paddlenlp import Taskflow
+>>> ner = Taskflow("ner")
+>>> ner(["热梅茶是一道以梅子为主要原料制作的茶饮", "《孤女》是2010年九州出版社出版的小说，作者是余兼羽"])
+[[('热梅茶', '饮食类_饮品'), ('是', '肯定词'), ('一道', '数量词'), ('以', '介词'), ('梅子', '饮食类'), ('为', '肯定词'), ('主要原料', '物体类'), ('制作', '场景事件'), ('的', '助词'), ('茶饮', '饮食类_饮品')], [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]]
+```
+
+#### 实体标签说明
+
+- 精确模式采用的标签集合
+
+包含91种词性及专名类别标签，标签集合如下表：
+
+<table>
+    <thead>
+        <th colspan='7'>WordTag标签集合</th>
+    </thead>
+    <tbody>
+        <tr>
+            <td>人物类_实体</td>
+            <td>组织机构类_军事组织机构_概念</td>
+            <td>文化类_制度政策协议</td>
+            <td>位置方位</td>
+            <td>术语类_医药学术语</td>
+            <td>信息资料_性别</td>
+            <td>否定词</td>
+        </tr>
+        <tr>
+            <td>人物类_概念</td>
+            <td>组织机构类_医疗卫生机构</td>
+            <td>文化类_姓氏与人名</td>
+            <td>世界地区类</td>
+            <td>术语类_生物体</td>
+            <td>链接地址</td>
+            <td>数量词</td>
+        </tr>
+        <tr>
+            <td>作品类_实体</td>
+            <td>组织机构类_医疗卫生机构_概念</td>
+            <td>生物类</td>
+            <td>世界地区类_国家</td>
+            <td>疾病损伤类</td>
+            <td>个性特征</td>
+            <td>数量词_序数词</td>
+        </tr>
+        <tr>
+            <td>作品类_概念</td>
+            <td>组织机构类_教育组织机构</td>
+            <td>生物类_植物</td>
+            <td>世界地区类_区划概念</td>
+            <td>疾病损伤类_植物病虫害</td>
+            <td>感官特征</td>
+            <td>数量词_单位数量词</td>
+        </tr>
+        <tr>
+            <td>组织机构类</td>
+            <td>组织机构类_教育组织机构_概念</td>
+            <td>生物类_动物</td>
+            <td>世界地区类_地理概念</td>
+            <td>宇宙类</td>
+            <td>场景事件</td>
+            <td>叹词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_概念</td>
+            <td>物体类</td>
+            <td>品牌名</td>
+            <td>饮食类</td>
+            <td>事件类</td>
+            <td>介词</td>
+            <td>拟声词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位</td>
+            <td>物体类_概念</td>
+            <td>品牌名_品牌类型</td>
+            <td>饮食类_菜品</td>
+            <td>时间类</td>
+            <td>介词_方位介词</td>
+            <td>修饰词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位_概念</td>
+            <td>物体类_兵器</td>
+            <td>场所类</td>
+            <td>饮食类_饮品</td>
+            <td>时间类_特殊日</td>
+            <td>助词</td>
+            <td>修饰词_性质</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关</td>
+            <td>物体类_化学物质</td>
+            <td>场所类_概念</td>
+            <td>药物类</td>
+            <td>时间类_朝代</td>
+            <td>代词</td>
+            <td>修饰词_类型</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关_概念</td>
+            <td>其他角色类</td>
+            <td>场所类_交通场所</td>
+            <td>药物类_中药</td>
+            <td>时间类_具体时间</td>
+            <td>连词</td>
+            <td>修饰词_化</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构</td>
+            <td>文化类</td>
+            <td>场所类_交通场所_概念</td>
+            <td>术语类</td>
+            <td>时间类_时长</td>
+            <td>副词</td>
+            <td>外语单词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构_概念</td>
+            <td>文化类_语言文字</td>
+            <td>场所类_网上场所</td>
+            <td>术语类_术语类型</td>
+            <td>词汇用语</td>
+            <td>疑问词</td>
+            <td>汉语拼音</td>
+        </tr>
+        <tr>
+            <td>组织机构类_军事组织机构</td>
+            <td>文化类_奖项赛事活动</td>
+            <td>场所类_网上场所_概念</td>
+            <td>术语类_符号指标类</td>
+            <td>信息资料</td>
+            <td>肯定词</td>
+            <td>w（标点）</td>
+        </tr>
+    </tbody>
+</table>
+
+- 快速模式采用的标签集合
+
+| 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
+| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
+| n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
+| nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
+| nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
+| a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
+| m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
+| c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
+| PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
+
+#### 自定义词典
+
+你可以通过装载自定义词典来定制化命名实体识别结果。词典文件每一行表示一个自定义item，可以由一个term或者多个term组成，term后面可以添加自定义标签，格式为`item/tag`，如果不添加自定义标签，则使用模型默认标签。
+
+词典文件`user_dict.txt`示例：
+
+```text
+长津湖/电影类_实体
+收/词汇用语 尾/术语类
+最 大
+海外票仓
+```
+
+以"《长津湖》收尾，北美是最大海外票仓"为例，原本的输出结果为：
+
+```text
+[('《', 'w'), ('长津湖', '作品类_实体'), ('》', 'w'), ('收尾', '场景事件'), ('，', 'w'), ('北美', '世界地区类'), ('是', '肯定词'), ('最大', '修饰词'), ('海外', '场所类'), ('票仓', '词汇用语')]
+```
+
+装载自定义词典及输出结果示例：
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> my_ner = Taskflow("ner", user_dict="user_dict.txt")
+>>> my_ner("《长津湖》收尾，北美是最大海外票仓")
+[('《', 'w'), ('长津湖', '电影类_实体'), ('》', 'w'), ('收', '词汇用语'), ('尾', '术语类'), ('，', 'w'), ('北美', '世界地区类'), ('是', '肯定词'), ('最', '修饰词'), ('大', '修饰词'), ('海外票仓', '场所类')]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `user_dict`：用户自定义词典文件，默认为None。
+* `task_path`：自定义任务路径，默认为None。
+* `entity_only`：只返回实体/概念词及其对应标签。
+</div></details>
+
+
+### 依存句法分析
+<details><summary>&emsp;基于最大规模中文依存句法树库研发的DDParser </summary><div>
+
+#### 支持多种形式输入
+
+未分词输入:
+
+```python
+>>> from paddlenlp import Taskflow
+>>> ddp = Taskflow("dependency_parsing")
+>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金")
+[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}]
+
+```
+
+使用分词结果来输入:
+
+```python
+>>> ddp = Taskflow("dependency_parsing")
+>>> ddp.from_segments([['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金']])
+[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}]
+```
+
+#### 批量样本输入，平均速度更快
+
+```python
+>>> from paddlenlp import Taskflow
+>>> ddp(["2月8日谷爱凌夺得北京冬奥会第三金", "他送了一本书"])
+[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}, {'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
+```
+
+#### 多种模型选择，满足精度、速度需求
+
+使用ERNIE 1.0进行预测
+
+```python
+>>> ddp = Taskflow("dependency_parsing", model="ddparser-ernie-1.0")
+>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金")
+[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB']}]
+```
+
+除ERNIE 1.0外，还可使用ERNIE-Gram预训练模型，其中`model=ddparser`（基于LSTM Encoder）速度最快，`model=ddparser-ernie-gram-zh`和`model=ddparser-ernie-1.0`效果更优（两者效果相当）。
+
+#### 输出方式
+
+输出概率值和词性标签:
+
+```python
+>>> ddp = Taskflow("dependency_parsing", prob=True, use_pos=True)
+>>> ddp("2月8日谷爱凌夺得北京冬奥会第三金")
+[{'word': ['2月8日', '谷爱凌', '夺得', '北京冬奥会', '第三金'], 'head': [3, 3, 0, 5, 3], 'deprel': ['ADV', 'SBV', 'HED', 'ATT', 'VOB'], 'postag': ['TIME', 'PER', 'v', 'ORG', 'n'], 'prob': [0.97, 1.0, 1.0, 0.99, 0.99]}]
+```
+
+依存关系可视化
+
+```python
+>>> from paddlenlp import Taskflow
+>>> ddp = Taskflow("dependency_parsing", return_visual=True)
+>>> result = ddp("2月8日谷爱凌夺得北京冬奥会第三金")[0]['visual']
+>>> import cv2
+>>> cv2.imwrite('test.png', result)
+```
+
+<p align="center">
+ <img src="https://user-images.githubusercontent.com/11793384/159904566-40f42e19-d3ef-45e7-b798-ae7ad954fca5.png" align="middle">
+<p align="center">
+
+#### 依存句法分析标注关系集合
+
+| Label |  关系类型  | 说明                     | 示例                           |
+| :---: | :--------: | :----------------------- | :----------------------------- |
+|  SBV  |  主谓关系  | 主语与谓词间的关系       | 他送了一本书(他<--送)          |
+|  VOB  |  动宾关系  | 宾语与谓词间的关系       | 他送了一本书(送-->书)          |
+|  POB  |  介宾关系  | 介词与宾语间的关系       | 我把书卖了（把-->书）          |
+|  ADV  |  状中关系  | 状语与中心词间的关系     | 我昨天买书了（昨天<--买）      |
+|  CMP  |  动补关系  | 补语与中心词间的关系     | 我都吃完了（吃-->完）          |
+|  ATT  |  定中关系  | 定语与中心词间的关系     | 他送了一本书(一本<--书)        |
+|   F   |  方位关系  | 方位词与中心词的关系     | 在公园里玩耍(公园-->里)        |
+|  COO  |  并列关系  | 同类型词语间关系         | 叔叔阿姨(叔叔-->阿姨)          |
+|  DBL  |  兼语结构  | 主谓短语做宾语的结构     | 他请我吃饭(请-->我，请-->吃饭) |
+|  DOB  | 双宾语结构 | 谓语后出现两个宾语       | 他送我一本书(送-->我，送-->书) |
+|  VV   |  连谓结构  | 同主语的多个谓词间关系   | 他外出吃饭(外出-->吃饭)        |
+|  IC   |  子句结构  | 两个结构独立或关联的单句 | 你好，书店怎么走？(你好<--走)  |
+|  MT   |  虚词成分  | 虚词与中心词间的关系     | 他送了一本书(送-->了)          |
+|  HED  |  核心关系  | 指整个句子的核心         |                                |
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，可选有`ddparser`，`ddparser-ernie-1.0`和`ddparser-ernie-gram-zh`。
+* `tree`：确保输出结果是正确的依存句法树，默认为True。
+* `prob`：是否输出每个弧对应的概率值，默认为False。
+* `use_pos`：是否返回词性标签，默认为False。
+* `use_cuda`：是否使用GPU进行切词，默认为False。
+* `return_visual`：是否返回句法树的可视化结果，默认为False。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 信息抽取
+<details><summary>&emsp; 适配多场景的开放域通用信息抽取工具 </summary><div>
+
+开放域信息抽取是信息抽取的一种全新范式，主要思想是减少人工参与，利用单一模型支持多种类型的开放抽取任务，用户可以使用自然语言自定义抽取目标，在实体、关系类别等未定义的情况下抽取输入文本中的信息片段。
+
+#### 实体抽取
+
+  命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+  - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下：
+
+    ```text
+    ['时间', '选手', '赛事名称']
+    ```
+
+    调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+
+    >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    >>> ie = Taskflow('information_extraction', schema=schema)
+    >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
+    [{'时间': [{'end': 6,
+              'probability': 0.9857378532924486,
+              'start': 0,
+              'text': '2月8日上午'}],
+      '赛事名称': [{'end': 23,
+                'probability': 0.8503089953268272,
+                'start': 6,
+                'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+      '选手': [{'end': 31,
+              'probability': 0.8981548639781138,
+              'start': 28,
+              'text': '谷爱凌'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下：
+
+    ```text
+    ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    ```
+
+    在上例中我们已经实例化了一个`Taskflow`对象，这里可以通过`set_schema`方法重置抽取目标。
+
+    调用示例：
+
+    ```python
+    >>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("（右肝肿瘤）肝细胞性肝癌（II-III级，梁索型和假腺管型），肿瘤包膜不完整，紧邻肝被膜，侵及周围肝组织，未见脉管内癌栓（MVI分级：M0级）及卫星子灶形成。（肿物1个，大小4.2×4.0×2.8cm）。"))
+    [{'肝癌级别': [{'end': 20,
+                'probability': 0.9243267447402701,
+                'start': 13,
+                'text': 'II-III级'}],
+      '肿瘤的个数': [{'end': 84,
+                'probability': 0.7538413804059623,
+                'start': 82,
+                'text': '1个'}],
+      '肿瘤的大小': [{'end': 100,
+                'probability': 0.8341128043459491,
+                'start': 87,
+                'text': '4.2×4.0×2.8cm'}],
+      '脉管内癌栓分级': [{'end': 70,
+                  'probability': 0.9083292325934664,
+                  'start': 67,
+                  'text': 'M0级'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"person"和"organization"，schema构造如下：
+
+    ```text
+    ['person', 'organization']
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+    >>> schema = ['Person', 'Organization']
+    >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Organization': [{'end': 53,
+                        'probability': 0.9985840259877357,
+                        'start': 48,
+                        'text': 'Apple'}],
+      'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+#### 关系抽取
+
+  关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+  - 例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下：
+
+    ```text
+    {
+      '竞赛名称': [
+        '主办方',
+        '承办方',
+        '已举办次数'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
+    [{'竞赛名称': [{'end': 13,
+                'probability': 0.7825402622754041,
+                'relations': {'主办方': [{'end': 22,
+                                      'probability': 0.8421710521379353,
+                                      'start': 14,
+                                      'text': '中国中文信息学会'},
+                                      {'end': 30,
+                                      'probability': 0.7580801847701935,
+                                      'start': 23,
+                                      'text': '中国计算机学会'}],
+                              '已举办次数': [{'end': 82,
+                                        'probability': 0.4671295049136148,
+                                        'start': 80,
+                                        'text': '4届'}],
+                              '承办方': [{'end': 39,
+                                      'probability': 0.8292706618236352,
+                                      'start': 35,
+                                      'text': '百度公司'},
+                                      {'end': 72,
+                                      'probability': 0.6193477885474685,
+                                      'start': 56,
+                                      'text': '中国计算机学会自然语言处理专委会'},
+                                      {'end': 55,
+                                      'probability': 0.7000497331473241,
+                                      'start': 40,
+                                      'text': '中国中文信息学会评测工作委员会'}]},
+                'start': 0,
+                'text': '2022语言与智能技术竞赛'}]}]
+    ```
+
+  - 例如以"person"作为抽取主体，抽取关系类型为"Company"和"Position", schema构造如下：
+
+    ```text
+    {
+      'Person': [
+        'Company',
+        'Position'
+      ]
+    }
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = [{'Person': ['Company', 'Position']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'relations': {'Company': [{'end': 53,
+                                            'probability': 0.9960158209451642,
+                                            'start': 48,
+                                            'text': 'Apple'}],
+                                'Position': [{'end': 44,
+                                              'probability': 0.8871063806420736,
+                                              'start': 41,
+                                              'text': 'CEO'}]},
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+#### 事件抽取
+
+  事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument)，组合为相应的事件结构化信息。
+
+  - 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息，schema构造如下：
+
+    ```text
+    {
+      '地震触发词': [
+        '地震强度',
+        '时间',
+        '震中位置',
+        '震源深度'
+      ]
+    }
+    ```
+
+    触发词的格式统一为`触发词`或``XX触发词`，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。
+
+    调用示例：
+
+    ```python
+    >>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('中国地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')
+    [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度，东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]
+    ```
+
+  - 英文模型zero-shot方式**暂不支持事件抽取**，如有英文事件抽取相关语料请进行训练定制。
+
+#### 评论观点抽取
+
+  评论观点抽取，是指抽取文本中包含的评价维度、观点词。
+
+  - 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向，schema构造如下：
+
+    ```text
+    {
+      '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'评价维度': ['观点词', '情感倾向[正向，负向]']} # Define the schema for opinion extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie("店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队")) # Better print results using pprint
+    [{'评价维度': [{'end': 20,
+                'probability': 0.9817040258681473,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9966142505350533,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 22,
+                                      'probability': 0.957396472711558,
+                                      'start': 21,
+                                      'text': '高'}]},
+                'start': 17,
+                'text': '性价比'},
+              {'end': 2,
+                'probability': 0.9696849569741168,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9982153274927796,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 4,
+                                      'probability': 0.9945318044652538,
+                                      'start': 2,
+                                      'text': '干净'}]},
+                'start': 0,
+                'text': '店面'}]}]
+    ```
+
+  - 英文模型schema构造如下：
+
+    ```text
+    {
+      'Aspect': [
+        'Opinion',
+        'Sentiment classification [negative, positive]'
+      ]
+    }
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en("The teacher is very nice."))
+    [{'Aspect': [{'end': 11,
+                  'probability': 0.4301476415932193,
+                  'relations': {'Opinion': [{'end': 24,
+                                            'probability': 0.9072940447883724,
+                                            'start': 15,
+                                            'text': 'very nice'}],
+                                'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
+                                                                                  'text': 'positive'}]},
+                  'start': 4,
+                  'text': 'teacher'}]}]
+    ```
+
+#### 情感分类
+
+  - 句子级情感倾向分类，即判断句子的情感倾向是“正向”还是“负向”，schema构造如下：
+
+    ```text
+    '情感倾向[正向，负向]'
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = '情感倾向[正向，负向]' # Define the schema for sentence-level sentiment classification
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('这个产品用起来真的很流畅，我非常喜欢')
+    [{'情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}]
+    ```
+
+    英文模型schema构造如下：
+
+    ```text
+    '情感倾向[正向，负向]'
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = 'Sentiment classification [negative, positive]'
+    >>> ie_en.set_schema(schema)
+    >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
+    [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
+    ```
+
+#### 跨任务抽取
+
+  - 例如在法律场景同时对文本进行实体抽取和关系抽取，schema可按照如下方式进行构造：
+
+    ```text
+    [
+      "法院",
+      {
+          "原告": "委托代理人"
+      },
+      {
+          "被告": "委托代理人"
+      }
+    ]
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。")) # Better print results using pprint
+    [{'原告': [{'end': 37,
+              'probability': 0.9949814024296764,
+              'relations': {'委托代理人': [{'end': 46,
+                                      'probability': 0.7956844697990384,
+                                      'start': 44,
+                                      'text': '李四'}]},
+              'start': 35,
+              'text': '张三'}],
+      '法院': [{'end': 10,
+              'probability': 0.9221074192336651,
+              'start': 0,
+              'text': '北京市海淀区人民法院'}],
+      '被告': [{'end': 67,
+              'probability': 0.8437349536631089,
+              'relations': {'委托代理人': [{'end': 92,
+                                      'probability': 0.7267121388225029,
+                                      'start': 90,
+                                      'text': '赵六'}]},
+              'start': 64,
+              'text': 'B公司'}]}]
+    ```
+
+#### 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
+  | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
+  | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
+  | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
+
+- `uie-nano`调用示例：
+
+  ```python
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['时间', '选手', '赛事名称']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
+  >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+  [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
+  ```
+
+- `uie-m-base`和`uie-m-large`支持中英文混合抽取，调用示例：
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['Time', 'Player', 'Competition', 'Score']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
+  >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！", "Rafael Nadal wins French Open Final!"]))
+  [{'Competition': [{'end': 23,
+                    'probability': 0.9373889907291257,
+                    'start': 6,
+                    'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+    'Player': [{'end': 31,
+                'probability': 0.6981119555336441,
+                'start': 28,
+                'text': '谷爱凌'}],
+    'Score': [{'end': 39,
+              'probability': 0.9888507878270296,
+              'start': 32,
+              'text': '188.25分'}],
+    'Time': [{'end': 6,
+              'probability': 0.9784080036931151,
+              'start': 0,
+              'text': '2月8日上午'}]},
+  {'Competition': [{'end': 35,
+                    'probability': 0.9851549932171295,
+                    'start': 18,
+                    'text': 'French Open Final'}],
+    'Player': [{'end': 12,
+                'probability': 0.9379371275888104,
+                'start': 0,
+                'text': 'Rafael Nadal'}]}]
+  ```
+
+#### 定制训练
+
+对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用[定制训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)（标注少量数据进行模型微调）以进一步提升效果。
+
+我们在互联网、医疗、金融三大垂类自建测试集上进行了实验：
+
+<table>
+<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+</table>
+
+0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测，5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据（few-shot）进一步提升效果**。
+
+#### 可配置参数说明
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置schema的语言，默认为`zh`, 可选有`zh`和`en`。因为中英schema的构造有所不同，因此需要指定schema的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，默认为`uie-base`，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`, `uie-medical-base`, `uie-base-en`。
+* `position_prob`：模型对于span的起始位置/终止位置的结果概率0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖(主要是**确保安装onnxruntime-gpu**)。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+</div></details>
+
+### 解语知识标注
+<details><summary>&emsp;覆盖所有中文词汇的知识标注工具</summary><div>
+
+#### 词类知识标注
+
+```python
+>>> from paddlenlp import Taskflow
+>>> wordtag = Taskflow("knowledge_mining")
+>>> wordtag("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+[{'text': '《孤女》是2010年九州出版社出版的小说，作者是余兼羽', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': '，', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}]}]
+```
+
+**可配置参数说明：**
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `linking`：实现基于词类的linking，默认为True。
+* `task_path`：自定义任务路径，默认为None。
+* `user_dict`：用户自定义词典文件，默认为None。
+
+
+知识挖掘-词类知识标注任务共包含91种词性及专名类别标签，标签集合如下表：
+
+<table>
+    <thead>
+        <th colspan='7'>WordTag标签集合</th>
+    </thead>
+    <tbody>
+        <tr>
+            <td>人物类_实体</td>
+            <td>组织机构类_军事组织机构_概念</td>
+            <td>文化类_制度政策协议</td>
+            <td>位置方位</td>
+            <td>术语类_医药学术语</td>
+            <td>信息资料_性别</td>
+            <td>否定词</td>
+        </tr>
+        <tr>
+            <td>人物类_概念</td>
+            <td>组织机构类_医疗卫生机构</td>
+            <td>文化类_姓氏与人名</td>
+            <td>世界地区类</td>
+            <td>术语类_生物体</td>
+            <td>链接地址</td>
+            <td>数量词</td>
+        </tr>
+        <tr>
+            <td>作品类_实体</td>
+            <td>组织机构类_医疗卫生机构_概念</td>
+            <td>生物类</td>
+            <td>世界地区类_国家</td>
+            <td>疾病损伤类</td>
+            <td>个性特征</td>
+            <td>数量词_序数词</td>
+        </tr>
+        <tr>
+            <td>作品类_概念</td>
+            <td>组织机构类_教育组织机构</td>
+            <td>生物类_植物</td>
+            <td>世界地区类_区划概念</td>
+            <td>疾病损伤类_植物病虫害</td>
+            <td>感官特征</td>
+            <td>数量词_单位数量词</td>
+        </tr>
+        <tr>
+            <td>组织机构类</td>
+            <td>组织机构类_教育组织机构_概念</td>
+            <td>生物类_动物</td>
+            <td>世界地区类_地理概念</td>
+            <td>宇宙类</td>
+            <td>场景事件</td>
+            <td>叹词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_概念</td>
+            <td>物体类</td>
+            <td>品牌名</td>
+            <td>饮食类</td>
+            <td>事件类</td>
+            <td>介词</td>
+            <td>拟声词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位</td>
+            <td>物体类_概念</td>
+            <td>品牌名_品牌类型</td>
+            <td>饮食类_菜品</td>
+            <td>时间类</td>
+            <td>介词_方位介词</td>
+            <td>修饰词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位_概念</td>
+            <td>物体类_兵器</td>
+            <td>场所类</td>
+            <td>饮食类_饮品</td>
+            <td>时间类_特殊日</td>
+            <td>助词</td>
+            <td>修饰词_性质</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关</td>
+            <td>物体类_化学物质</td>
+            <td>场所类_概念</td>
+            <td>药物类</td>
+            <td>时间类_朝代</td>
+            <td>代词</td>
+            <td>修饰词_类型</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关_概念</td>
+            <td>其他角色类</td>
+            <td>场所类_交通场所</td>
+            <td>药物类_中药</td>
+            <td>时间类_具体时间</td>
+            <td>连词</td>
+            <td>修饰词_化</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构</td>
+            <td>文化类</td>
+            <td>场所类_交通场所_概念</td>
+            <td>术语类</td>
+            <td>时间类_时长</td>
+            <td>副词</td>
+            <td>外语单词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构_概念</td>
+            <td>文化类_语言文字</td>
+            <td>场所类_网上场所</td>
+            <td>术语类_术语类型</td>
+            <td>词汇用语</td>
+            <td>疑问词</td>
+            <td>汉语拼音</td>
+        </tr>
+        <tr>
+            <td>组织机构类_军事组织机构</td>
+            <td>文化类_奖项赛事活动</td>
+            <td>场所类_网上场所_概念</td>
+            <td>术语类_符号指标类</td>
+            <td>信息资料</td>
+            <td>肯定词</td>
+            <td>w（标点）</td>
+        </tr>
+    </tbody>
+</table>
+
+#### 知识模板信息抽取
+```python
+>>> from paddlenlp import Taskflow
+>>> wordtag_ie = Taskflow("knowledge_mining", with_ie=True)
+>>> wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。')
+[[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '，', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': '，', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]]
+
+```
+
+**自定义抽取的schema**
+
+``` python
+>>> from pprint import pprint
+>>> schema = [
+     {
+        "head_role": "作品类_实体", #头实体词类
+        "group": "创作者", #关系名
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体" #尾实体词类
+                ],
+                "support": [] #相关词类，可作为该关系的补充，不可作为尾实体独立存在
+            }
+        ],
+        "trig_word": [
+            "作词", #触发词，对于没有触发词，而是由头尾实体直接触发的关系，可为null
+        ],
+        "trig_type": "trigger", #trigger表明由触发词触发，tail表明为尾实体触发
+        "reverse": False, #是否为反向配置，即尾实体实际是头，头实体实际是尾
+        "trig_direction": "B", #触发P的方向，表示在自然表达中，尾实体在触发词的哪一边，L为左，R为右，B为双向都有可能，默认为B
+        "rel_group": "创作" #对应的反关系，即头尾实体对调后，对应的关系，用于逻辑推断
+    }]
+>>> wordtag_ie.set_schema(schema)
+>>> pprint(wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。')[1])
+[[{'GROUP': '创作',
+   'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'},
+   'SRC': 'REVERSE',
+   'TAIL_ROLE': [{'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}],
+   'TRIG': [{'item': '作词', 'offset': 12}]},
+  {'GROUP': '创作者',
+   'HEAD_ROLE': {'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'},
+   'SRC': 'HTG',
+   'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}],
+   'TRIG': [{'item': '作词', 'offset': 12}]}]]
+```
+具体的WordTag-IE信息抽取的功能可以见[WordTag-IE具体介绍](../../examples/text_to_knowledge/wordtag-ie/README.md) .
+
+
+#### 名词短语标注
+```python
+>>> from paddlenlp import Taskflow
+>>> nptag = Taskflow("knowledge_mining", model="nptag")
+>>> nptag("糖醋排骨")
+[{'text': '糖醋排骨', 'label': '菜品'}]
+
+>>> nptag(["糖醋排骨", "红曲霉菌"])
+[{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}]
+
+# 使用`linking`输出粗粒度类别标签`category`，即WordTag的词汇标签。
+>>> nptag = Taskflow("knowledge_mining", model="nptag", linking=True)
+>>> nptag(["糖醋排骨", "红曲霉菌"])
+[{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}]
+```
+**可配置参数说明：**
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `max_seq_len`：最大序列长度，默认为64。
+* `linking`：实现与WordTag类别标签的linking，默认为False。
+* `task_path`：自定义任务路径，默认为None。
+
+
+</div></details>
+
+### 文本纠错
+<details><summary>&emsp;融合拼音特征的端到端文本纠错模型ERNIE-CSC</summary><div>
+
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+>>> corrector = Taskflow("text_correction")
+# 单条输入
+>>> corrector('遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇。')
+[{'source': '遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇。', 'target': '遇到逆境时，我们必须勇于面对，而且要愈挫愈勇。', 'errors': [{'position': 3, 'correction': {'竟': '境'}}]}]
+
+# 批量预测
+>>> corrector(['遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇。', '人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。'])
+[{'source': '遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇。', 'target': '遇到逆境时，我们必须勇于面对，而且要愈挫愈勇。', 'errors': [{'position': 3, 'correction': {'竟': '境'}}]}, {'source': '人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。', 'target': '人生就是如此，经过磨练才能让自己更加茁壮，才能使自己更加乐观。', 'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 文本相似度
+<details><summary>&emsp;基于百万量级Dureader Retrieval数据集训练RocketQA并达到前沿文本相似效果</summary><div>
+
+#### 单条输入
+
++ Query-Query的相似度匹配
+
+```python
+>>> from paddlenlp import Taskflow
+>>> similarity = Taskflow("text_similarity")
+>>> similarity([["春天适合种什么花？", "春天适合种什么菜？"]])
+[{'text1': '春天适合种什么花？', 'text2': '春天适合种什么菜？', 'similarity': 0.83402544}]
+```
+
++ Query-Passage的相似度匹配
+
+```python
+>>> similarity = Taskflow("text_similarity", model='rocketqa-base-cross-encoder')
+>>> similarity([["国家法定节假日共多少天?", "现在法定假日是元旦1天，春节3天，清明节1天，五一劳动节1天，端午节1天，国庆节3天，中秋节1天，共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。"]])
+[{'text1': '国家法定节假日共多少天?', 'text2': '现在法定假日是元旦1天，春节3天，清明节1天，五一劳动节1天，端午节1天，国庆节3天，中秋节1天，共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。', 'similarity': 0.7174624800682068}]
+```
+
+#### 批量样本输入，平均速度更快
+
++ Query-Query的相似度匹配
+
+```python
+>>> from paddlenlp import Taskflow
+>>> similarity = Taskflow("text_similarity")
+>>> similarity([['春天适合种什么花？','春天适合种什么菜？'],['谁有狂三这张高清的','这张高清图，谁有']])
+[{'text1': '春天适合种什么花？', 'text2': '春天适合种什么菜？', 'similarity': 0.83402544}, {'text1': '谁有狂三这张高清的', 'text2': '这张高清图，谁有', 'similarity': 0.6540646}]
+```
+
++ Query-Passage的相似度匹配
+
+```python
+>>> similarity = Taskflow("text_similarity", model='rocketqa-base-cross-encoder')
+>>> similarity([["国家法定节假日共多少天?", "现在法定假日是元旦1天，春节3天，清明节1天，五一劳动节1天，端午节1天，国庆节3天，中秋节1天，共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。"],["衡量酒水的价格的因素有哪些?", "衡量酒水的价格的因素很多的，酒水的血统(也就是那里产的，采用什么工艺等）；存储的时间等等，酒水是一件很难标准化得商品，只要你敢要价，有买的那就值那个钱。"]])
+[{'text1': '国家法定节假日共多少天?', 'text2': '现在法定假日是元旦1天，春节3天，清明节1天，五一劳动节1天，端午节1天，国庆节3天，中秋节1天，共计11天。法定休息日每年52个周末总共104天。合到一起总计115天。', 'similarity': 0.7174624800682068}, {'text1': '衡量酒水的价格的因素有哪些?', 'text2': '衡量酒水的价格的因素很多的，酒水的血统(也就是那里产的，采用什么工艺等）；存储的时间等等，酒水是一件很难标准化得商品，只要你敢要价，有买的那就值那个钱。', 'similarity': 0.9069755673408508}]
+
+```
+
+#### 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `rocketqa-zh-dureader-cross-encoder`       | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `simbert-base-chinese` (默认)               | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `rocketqa-base-cross-encoder`              | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `rocketqa-medium-cross-encoder`            | 6-layers, 768-hidden, 12-heads | 中文 |
+  | `rocketqa-mini-cross-encoder`              | 6-layers, 384-hidden, 12-heads | 中文 |
+  | `rocketqa-micro-cross-encoder`             | 4-layers, 384-hidden, 12-heads | 中文 |
+  | `rocketqa-nano-cross-encoder`              | 4-layers, 312-hidden, 12-heads | 中文 |
+  | `rocketqav2-en-marco-cross-encoder`        | 12-layers, 768-hidden, 12-heads | 英文 |
+
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `max_seq_len`：最大序列长度，默认为384。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 情感分析
+<details><summary>&emsp;集成BiLSTM、SKEP、UIE等模型，支持评论维度、观点抽取、情感极性分类等情感分析任务 </summary><div>
+
+#### 支持不同模型，速度快和精度高两种模式
+
+```python
+>>> from paddlenlp import Taskflow
+# 默认使用bilstm模型进行预测，速度快
+>>> senta = Taskflow("sentiment_analysis")
+>>> senta("这个产品用起来真的很流畅，我非常喜欢")
+[{'text': '这个产品用起来真的很流畅，我非常喜欢', 'label': 'positive', 'score': 0.9938690066337585}]
+
+# 使用SKEP情感分析预训练模型进行预测，精度高
+>>> senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")
+>>> senta("作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。")
+[{'text': '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。', 'label': 'positive', 'score': 0.984320878982544}]
+
+# 使用UIE模型进行情感分析，具有较强的样本迁移能力
+# 1. 语句级情感分析
+>>> schema = ['情感倾向[正向，负向]']
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
+>>> senta('蛋糕味道不错，店家服务也很好')
+[{'情感倾向[正向,负向]': [{'text': '正向', 'probability': 0.996646058824652}]}]
+
+# 2. 评价维度级情感分析
+>>> # Aspect Term Extraction
+>>> # schema =  ["评价维度"]
+>>> # Aspect - Opinion Extraction
+>>> # schema =  [{"评价维度":["观点词"]}]
+>>> # Aspect - Sentiment Extraction
+>>> # schema =  [{"评价维度":["情感倾向[正向,负向,未提及]"]}]
+>>> # Aspect - Sentiment - Opinion Extraction
+>>> schema =  [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}]
+
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
+>>> senta('蛋糕味道不错，店家服务也很热情')
+[{'评价维度': [{'text': '服务', 'start': 9, 'end': 11, 'probability': 0.9709093024793489, 'relations': { '观点词': [{'text': '热情', 'start': 13, 'end': 15, 'probability': 0.9897222206316556}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9999327669598301}]}}, {'text': '味道', 'start': 2, 'end': 4, 'probability': 0.9105472387838915, 'relations': {'观点词': [{'text': '不错', 'start': 4, 'end': 6, 'probability': 0.9946981266891619}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9998829392709467}]}}]}]
+```
+
+#### 批量样本输入，平均速度更快
+```python
+>>> from paddlenlp import Taskflow
+>>> schema =  [{"评价维度":["观点词", "情感倾向[正向,负向,未提及]"]}]
+>>> senta = Taskflow("sentiment_analysis", model="uie-senta-base", schema=schema)
+>>> senta(["房间不大，很干净", "老板服务热情，价格也便宜"])
+[{'评价维度': [{'text': '房间', 'start': 0, 'end': 2, 'probability': 0.998526653966298, 'relations': {'观点词': [{'text': '干净', 'start': 6, 'end': 8, 'probability': 0.9899580841973474}, {'text': '不大', 'start': 2, 'end': 4, 'probability': 0.9945525066163512}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.6077412795680956}]}}]}, {'评价维度': [{'text': '服务', 'start': 2, 'end': 4, 'probability': 0.9913965811617516, 'relations': {'观点词': [{'text': '热情', 'start': 4, 'end': 6, 'probability': 0.9995530034336753}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9956709542206106}]}}, {'text': '价格', 'start': 7, 'end': 9, 'probability': 0.9970075537913772, 'relations': {'观点词': [{'text': '便宜', 'start': 10, 'end': 12, 'probability': 0.9991568497876635}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9943191048602245}]}}]}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，可选有`bilstm`,`skep_ernie_1.0_large_ch`,`uie-senta-base`,`uie-senta-medium`,`uie-senta-mini`,`uie-senta-micro`,`uie-senta-nano`。
+* `task_path`：自定义任务路径，默认为None。
+</div></details>
+
+### 生成式问答
+<details><summary>&emsp; 使用最大中文开源CPM模型完成问答</summary><div>
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+>>> qa = Taskflow("question_answering")
+# 单条输入
+>>> qa("中国的国土面积有多大？")
+[{'text': '中国的国土面积有多大？', 'answer': '960万平方公里。'}]
+# 多条输入
+>>> qa(["中国国土面积有多大？", "中国的首都在哪里？"])
+[{'text': '中国国土面积有多大？', 'answer': '960万平方公里。'}, {'text': '中国的首都在哪里？', 'answer': '北京。'}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+</div></details>
+
+### 智能写诗
+<details><summary>&emsp; 使用最大中文开源CPM模型完成写诗 </summary><div>
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+>>> poetry = Taskflow("poetry_generation")
+# 单条输入
+>>> poetry("林密不见人")
+[{'text': '林密不见人', 'answer': ',但闻人语响。'}]
+# 多条输入
+>>> poetry(["林密不见人", "举头邀明月"])
+[{'text': '林密不见人', 'answer': ',但闻人语响。'}, {'text': '举头邀明月', 'answer': ',低头思故乡。'}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+</div></details>
+
+### 开放域对话
+<details><summary>&emsp;十亿级语料训练最强中文闲聊模型PLATO-Mini，支持多轮对话</summary><div>
+
+#### 非交互模式
+```python
+>>> from paddlenlp import Taskflow
+>>> dialogue = Taskflow("dialogue")
+>>> dialogue(["吃饭了吗"])
+['刚吃完饭,你在干什么呢?']
+
+>>> dialogue(["你好", "吃饭了吗"], ["你是谁？"])
+['吃过了,你呢', '我是李明啊']
+```
+
+可配置参数：
+
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `max_seq_len`：最大序列长度，默认为512。
+
+#### 交互模式
+```python
+>>> from paddlenlp import Taskflow
+
+>>> dialogue = Taskflow("dialogue")
+# 输入`exit`可退出交互模式
+>>> dialogue.interactive_mode(max_turn=3)
+
+'''
+[Human]:你好
+[Bot]:你好,很高兴认识你,我想问你一下,你喜欢运动吗?
+[Human]:喜欢
+[Bot]:那你喜欢什么运动啊?
+[Human]:篮球,你喜欢篮球吗
+[Bot]:当然了,我很喜欢打篮球的
+'''
+```
+
+交互模式参数：
+* `max_turn`：任务能记忆的对话轮数，当max_turn为1时，模型只能记住当前对话，无法获知之前的对话内容。
+  </div></details>
+
+### 代码生成
+<details><summary>&emsp; 通过CodeGen模型来生成代码 </summary><div>
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+# 默认模型为 Salesforce/codegen-350M-mono
+>>> codegen = Taskflow("code_generation", model="Salesforce/codegen-2B-mono")
+# 单条输入
+>>> codegen("def hello_world():")
+['\n    print("Hello World")']
+# 多条输入
+>>> codegen(["Get the length of array", "def hello_world():"])
+['\n    n = len(a)\n\n    #', '\n    print("Hello World!")']
+```
+
+#### 可配置参数说明
+* `model`：可选模型，默认为Salesforce/codegen-350M-mono，支持的模型参考[CodeGen文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/code_generation/codegen/README.md)。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `max_length`：生成代码的最大长度，默认为128。
+* `min_length`：生成代码的最小长度，默认为0。
+* `decode_strategy`：解码策略，支持greedy_search，beam_search和sampling，默认为sampling。
+* `temperature`：解码参数temperature，默认为0.6。
+* `top_k`：解码参数top_k，默认为5。
+* `top_p`：解码参数top_p，默认为1.0。
+* `num_beams`：beam_search解码的beam size，默认为4。
+* `length_penalty`：解码长度控制值，默认为1.0。
+* `repetition_penalty`：解码重复惩罚值，默认为1.1。
+* `output_scores`：是否要输出解码得分，请默认为False。
+</div></details>
+
+
+
+### 文本摘要
+<details><summary>&emsp; 通过Pegasus模型来生成摘要 </summary><div>
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+>>> summarizer = Taskflow("text_summarization")
+# 单条输入
+>>> summarizer('2022年，中国房地产进入转型阵痛期，传统“高杠杆、快周转”的模式难以为继，万科甚至直接喊话，中国房地产进入“黑铁时代”')
+# 输出：['万科喊话中国房地产进入“黑铁时代”']
+
+# 多条输入
+>>> summarizer([
+  '据悉，2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用，继续把落实“双减”作为学校工作的重中之重，重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。',
+  '党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用，因此平时口服党参能远离三高的危害。另外党参除了益气养血，降低中枢神经作用，调整消化系统功能，健脾补肺的功能。'
+  ])
+#输出：['教育部：将从四个方面持续巩固提高学校“双减”工作水平', '党参能降低三高的危害']
+```
+
+#### 可配置参数说明
+* `model`：可选模型，默认为`IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese`。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+
+</div></details>
+
+### 文档智能
+<details><summary>&emsp; 以多语言跨模态布局增强文档预训练模型ERNIE-Layout为核心底座 </summary><div>
+
+#### 输入格式
+
+```
+[
+  {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+默认使用PaddleOCR进行OCR识别，同时支持用户通过``word_boxes``传入自己的OCR结果，格式为``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+#### 支持单条、批量预测
+
+- 支持本地图片路径输入
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
+</div>
+
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> docprompt = Taskflow("document_intelligence")
+>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
+[{'prompt': '五百丁本次想要担任的是什么职位?',
+  'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
+{'prompt': '五百丁是在哪里上的大学?',
+  'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
+{'prompt': '大学学的是什么专业?',
+  'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]
+```
+
+- http图片链接输入
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg height=400 hspace='10'/>
+</div>
+
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> docprompt = Taskflow("document_intelligence")
+>>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}]))
+[{'prompt': '发票号码是多少?',
+  'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},
+{'prompt': '校验码是多少?',
+  'result': [{'end': 233,
+              'prob': 1.0,
+              'start': 231,
+              'value': '01107 555427109891646'}]}]
+```
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+* `topn`: 如果模型识别出多个结果，将返回前n个概率值最高的结果，默认为1。
+
+
+</div></details>
+
+### 问题生成
+<details><summary>&emsp; 基于百度自研中文预训练模型UNIMO-Text和大规模多领域问题生成数据集</summary><div>
+
+#### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+# 默认模型为 unimo-text-1.0-dureader_qg
+>>> question_generator = Taskflow("question_generation")
+# 单条输入
+>>> question_generator([
+  {"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}
+  ])
+'''
+  ['黄山最高峰是什么']
+'''
+# 多条输入
+>>> question_generator([
+  {"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"},
+  {"context": "弗朗索瓦·韦达外文名：franciscusvieta国籍：法国出生地：普瓦图出生日期：1540年逝世日期：1603年12月13日职业：数学家主要成就：为近代数学的发展奠定了基础。", "answer": "法国"}
+  ])
+'''
+  ['黄山最高峰是什么',  '弗朗索瓦是哪里人']
+'''
+```
+
+#### 可配置参数说明
+* `model`：可选模型，默认为unimo-text-1.0-dureader_qg，支持的模型有["unimo-text-1.0", "unimo-text-1.0-dureader_qg", "unimo-text-1.0-question-generation", "unimo-text-1.0-question-generation-dureader_qg"]。
+* `device`：运行设备，默认为"gpu"。
+* `template`：模版，可选项有[0, 1, 2, 3]，1表示使用默认模版，0表示不使用模版。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `output_scores`：是否要输出解码得分，默认为False。
+* `is_select_from_num_return_sequences`：是否对多个返回序列挑选最优项输出，当为True时，若num_return_sequences不为1则自动根据解码得分选择得分最高的序列最为最终结果，否则返回num_return_sequences个序列，默认为True。
+* `max_length`：生成代码的最大长度，默认为50。
+* `min_length`：生成代码的最小长度，默认为3。
+* `decode_strategy`：解码策略，支持beam_search和sampling，默认为beam_search。
+* `temperature`：解码参数temperature，默认为1.0。
+* `top_k`：解码参数top_k，默认为0。
+* `top_p`：解码参数top_p，默认为1.0。
+* `num_beams`：解码参数num_beams，表示beam_search解码的beam size，默认为6。
+* `num_beam_groups`：解码参数num_beam_groups，默认为1。
+* `diversity_rate`：解码参数diversity_rate，默认为0.0。
+* `length_penalty`：解码长度控制值，默认为1.2。
+* `num_return_sequences`：解码返回序列数，默认为1。
+* `repetition_penalty`：解码重复惩罚值，默认为1。
+* `use_fast`：表示是否开启基于FastGeneration的高性能预测，注意FastGeneration的高性能预测仅支持gpu，默认为False。
+* `use_fp16_decoding`: 表示在开启高性能预测的时候是否使用fp16来完成预测过程，若不使用则使用fp32，默认为False。
+
+</div></details>
+
+### 零样本文本分类
+<details><summary>&emsp; 适配多场景的零样本通用文本分类工具 </summary><div>
+
+通用文本分类主要思想是利用单一模型支持通用分类、评论情感分析、语义相似度计算、蕴含推理、多项式阅读理解等众多“泛分类”任务。用户可以自定义任意标签组合，在不限定领域、不设定 prompt 的情况下进行文本分类。
+
+
+#### 情感分析
+
+```
+>>> cls = Taskflow("zero_shot_text_classification", schema=["这是一条好评", "这是一条差评"])
+>>> cls("房间干净明亮，非常不错")
+[{'predictions': [{'label': '这是一条好评', 'score': 0.9072999699439914}], 'text_a': '房间干净明亮，非常不错'}]
+>>> cls("东西还可以，但是快递非常慢，下次不会再买这家了。")
+[{'predictions': [{'label': '这是一条差评', 'score': 0.9282672873429476}], 'text_a': '东西还可以，但是快递非常慢，下次不会再买这家了。'}]
+```
+
+#### 意图识别
+
+```
+>>> from paddlenlp import Taskflow
+>>> schema = ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用"]
+>>> cls("先天性厚甲症去哪里治")
+[{'predictions': [{'label': '就医建议', 'score': 0.5494891306403806}], 'text_a': '先天性厚甲症去哪里治'}]
+>>> cls("男性小腹疼痛是什么原因？")
+[{'predictions': [{'label': '病因分析', 'score': 0.5763229815300723}], 'text_a': '男性小腹疼痛是什么原因？'}]
+```
+
+#### 语义相似度计算
+
+```
+>>> from paddlenlp import Taskflow
+>>> cls = Taskflow("zero_shot_text_classification", schema=["不同", "相同"])
+>>> cls([["怎么查看合同", "从哪里可以看到合同"]])
+[{'predictions': [{'label': '相同', 'score': 0.9951385264364382}], 'text_a': '怎么查看合同', 'text_b': '从哪里可以看到合同'}]
+>>> cls([["为什么一直没有电话来确认借款信息", "为何我还款了，今天却接到客服电话通知"]])
+[{'predictions': [{'label': '不同', 'score': 0.9991497973466908}], 'text_a': '为什么一直没有电话来确认借款信息', 'text_b': '为何我还款了，今天却接到客服电话通知'}]
+```
+
+#### 蕴含推理
+
+```
+>>> from paddlenlp import Taskflow
+>>> cls = Taskflow("zero_shot_text_classification", schema=["无关", "蕴含", "矛盾"])
+>>> cls([["一个骑自行车的人正沿着一条城市街道朝一座有时钟的塔走去。", "骑自行车的人正朝钟楼走去。"]])
+[{'predictions': [{'label': '蕴含', 'score': 0.9931122738524856}], 'text_a': '一个骑自行车的人正沿着一条城市街道朝一座有时钟的塔走去。', 'text_b': '骑自行车的人正朝钟楼走去。'}]
+>>> cls([["一个留着长发和胡须的怪人，在地铁里穿着一件颜色鲜艳的衬衫。", "这件衬衫是新的。"]])
+[{'predictions': [{'label': '无关', 'score': 0.997680189334587}], 'text_a': '一个留着长发和胡须的怪人，在地铁里穿着一件颜色鲜艳的衬衫。', 'text_b': '这件衬衫是新的。'}]
+>>> cls([["一个穿着绿色衬衫的妈妈和一个穿全黑衣服的男人在跳舞。", "两人都穿着白色裤子。"]])
+[{'predictions': [{'label': '矛盾', 'score': 0.9666946163628479}], 'text_a': '一个穿着绿色衬衫的妈妈和一个穿全黑衣服的男人在跳舞。', 'text_b': '两人都穿着白色裤子。'}]
+```
+
+#### 可配置参数说明
+
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `task_path`：自定义任务路径，默认为None。
+* `schema`：定义任务标签候选集合。
+* `model`：选择任务使用的模型，默认为`utc-base`, 支持`utc-xbase`, `utc-base`, `utc-medium`, `utc-micro`, `utc-mini`, `utc-nano`, `utc-pico`。
+* `max_seq_len`：最长输入长度，包括所有标签的长度，默认为512。
+* `pred_threshold`：模型对标签预测的概率在0～1之间，返回结果去掉小于这个阈值的结果，默认为0.5。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快。如果选择`fp16`，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+
+</div></details>
+
+### 模型特征提取
+
+<details><summary>&emsp; 基于百度自研中文图文跨模态预训练模型ERNIE-ViL 2.0</summary><div>
+
+#### 多模态特征提取
+
+```python
+>>> from paddlenlp import Taskflow
+>>> from PIL import Image
+>>> import paddle.nn.functional as F
+>>> vision_language= Taskflow("feature_extraction")
+# 单条输入
+>>> image_embeds = vision_language(Image.open("demo/000000039769.jpg"))
+>>> image_embeds["features"]
+Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[-0.59475428, -0.69795364,  0.22144008,  0.88066685, -0.58184201,
+# 单条输入
+>>> text_embeds = vision_language("猫的照片")
+>>> text_embeds['features']
+Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.04250504, -0.41429776,  0.26163983,  0.29910022,  0.39019185,
+         -0.41884750, -0.19893740,  0.44328332,  0.08186490,  0.10953025,
+         ......
+
+# 多条输入
+>>> image_embeds = vision_language([Image.open("demo/000000039769.jpg")])
+>>> image_embeds["features"]
+Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[-0.59475428, -0.69795364,  0.22144008,  0.88066685, -0.58184201,
+       ......
+# 多条输入
+>>> text_embeds = vision_language(["猫的照片","狗的照片"])
+>>> text_embeds["features"]
+Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.04250504, -0.41429776,  0.26163983, ...,  0.26221892,
+          0.34387422,  0.18779707],
+        [ 0.06672225, -0.41456309,  0.13787819, ...,  0.21791610,
+          0.36693242,  0.34208685]])
+>>> image_features = image_embeds["features"]
+>>> text_features = text_embeds["features"]
+>>> image_features /= image_features.norm(axis=-1, keepdim=True)
+>>> text_features /= text_features.norm(axis=-1, keepdim=True)
+>>> logits_per_image = 100 * image_features @ text_features.t()
+>>> probs = F.softmax(logits_per_image, axis=-1)
+>>> probs
+Tensor(shape=[1, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[0.99833173, 0.00166824]])
+```
+#### 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  视觉| 文本  | 语言 |
+  | :---: | :--------: | :--------: | :--------: |
+  | `PaddlePaddle/ernie_vil-2.0-base-zh` (默认) | ViT | ERNIE | 中文 |
+  | `OFA-Sys/chinese-clip-vit-base-patch16`                     | ViT-B/16 |RoBERTa-wwm-Base| 中文 |
+  | `OFA-Sys/chinese-clip-vit-large-patch14`            | ViT-L/14 | RoBERTa-wwm-Base | 中文 |
+  | `OFA-Sys/chinese-clip-vit-large-patch14-336px`              | ViT-L/14 | RoBERTa-wwm-Base | 中文 |
+
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `_static_mode`：静态图模式，默认开启。
+* `model`：选择任务使用的模型，默认为`PaddlePaddle/ernie_vil-2.0-base-zh`。
+
+#### 文本特征提取
+
+```python
+>>> from paddlenlp import Taskflow
+>>> import paddle.nn.functional as F
+>>> text_encoder = Taskflow("feature_extraction", model='rocketqa-zh-base-query-encoder')
+>>> text_embeds = text_encoder(['春天适合种什么花？','谁有狂三这张高清的?'])
+>>> text_features1 = text_embeds["features"]
+>>> text_features1
+Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.27640465, -0.13405125,  0.00612330, ..., -0.15600294,
+         -0.18932408, -0.03029604],
+        [-0.12041329, -0.07424965,  0.07895312, ..., -0.17068857,
+          0.04485796, -0.18887770]])
+>>> text_embeds = text_encoder('春天适合种什么菜？')
+>>> text_features2 = text_embeds["features"]
+>>> text_features2
+Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.32578075, -0.02398480, -0.18929179, -0.18639392, -0.04062131,
+       ......
+>>> probs = F.cosine_similarity(text_features1, text_features2)
+>>> probs
+Tensor(shape=[2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [0.86455142, 0.41222256])
+```
+
+#### 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  层数| 维度  | 语言|
+  | :---: | :--------: | :--------: | :--------: |
+  | `rocketqa-zh-dureader-query-encoder`  | 12 | 768 | 中文|
+  | `rocketqa-zh-dureader-para-encoder`  | 12 | 768 | 中文|
+  | `rocketqa-zh-base-query-encoder`  | 12 | 768 | 中文|
+  | `rocketqa-zh-base-para-encoder`  | 12 | 768 | 中文|
+  | `moka-ai/m3e-base`  | 12 | 768 | 中文|
+  | `rocketqa-zh-medium-query-encoder`  | 6 | 768 | 中文|
+  | `rocketqa-zh-medium-para-encoder`  | 6 | 768 | 中文|
+  | `rocketqa-zh-mini-query-encoder`  | 6 | 384 | 中文|
+  | `rocketqa-zh-mini-para-encoder`  | 6 | 384 | 中文|
+  | `rocketqa-zh-micro-query-encoder`  | 4 | 384 | 中文|
+  | `rocketqa-zh-micro-para-encoder`  | 4 | 384 | 中文|
+  | `rocketqa-zh-nano-query-encoder`  | 4 | 312 | 中文|
+  | `rocketqa-zh-nano-para-encoder`  | 4 | 312 | 中文|
+  | `rocketqav2-en-marco-query-encoder`  | 12 | 768 | 英文|
+  | `rocketqav2-en-marco-para-encoder`  | 12 | 768 | 英文|
+  | `ernie-search-base-dual-encoder-marco-en"`  | 12 | 768 | 英文|
+
+#### 可配置参数说明
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `max_seq_len`：文本序列的最大长度，默认为128。
+* `return_tensors`: 返回的类型，有pd和np，默认为pd。
+* `model`：选择任务使用的模型，默认为`PaddlePaddle/ernie_vil-2.0-base-zh`。
+* `pooling_mode`：选择句向量获取方式，有'max_tokens','mean_tokens','mean_sqrt_len_tokens','cls_token'，默认为'cls_token'（`moka-ai/m3e-base`）。
+
+</div></details>
+
+## PART Ⅱ &emsp; 定制化训练
+
+<details><summary>适配任务列表</summary><div>
+
+如果你有自己的业务数据集，可以对模型效果进一步调优，支持定制化训练的任务如下：
+
+|                           任务名称                           |                           默认路径                           |                                                              |
+| :----------------------------------------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
+|         `Taskflow("word_segmentation", mode="base")`         |             `$HOME/.paddlenlp/taskflow/lac`                  | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) |
+|       `Taskflow("word_segmentation", mode="accurate")`       |             `$HOME/.paddlenlp/taskflow/wordtag`              | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) |
+|       `Taskflow("pos_tagging")`                              |             `$HOME/.paddlenlp/taskflow/lac`                  | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) |
+|                `Taskflow("ner", mode="fast")`                |             `$HOME/.paddlenlp/taskflow/lac`                  | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis) |
+|              `Taskflow("ner", mode="accurate")`              |             `$HOME/.paddlenlp/taskflow/wordtag`              | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) |
+|              `Taskflow("information_extraction", model="uie-base")`              |             `$HOME/.paddlenlp/taskflow/information_extraction/uie-base`              | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) |
+|              `Taskflow("information_extraction", model="uie-tiny")`              |             `$HOME/.paddlenlp/taskflow/information_extraction/uie-tiny`              | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie) |
+|     `Taskflow("text_correction", model="ernie-csc")`     |  `$HOME/.paddlenlp/taskflow/text_correction/ernie-csc`   | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_correction/ernie-csc) |
+|      `Taskflow("dependency_parsing", model="ddparser")`      |   `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser`    | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) |
+| `Taskflow("dependency_parsing", model="ddparser-ernie-1.0")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) |
+| `Taskflow("dependency_parsing", model="ddparser-ernie-gram-zh")` | `$HOME/.paddlenlp/taskflow/dependency_parsing/ddparser-ernie-gram-zh` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser) |
+| `Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")` | `$HOME/.paddlenlp/taskflow/sentiment_analysis/skep_ernie_1.0_large_ch` | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/sentiment_analysis/skep) |
+|       `Taskflow("knowledge_mining", model="wordtag")`        |             `$HOME/.paddlenlp/taskflow/wordtag`              | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm) |
+|        `Taskflow("knowledge_mining", model="nptag")`         |      `$HOME/.paddlenlp/taskflow/knowledge_mining/nptag`      | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/nptag) |
+|        `Taskflow("zero_shot_text_classification", model="utc-base")`         |      `$HOME/.paddlenlp/taskflow/zero_shot_text_classification/utc-base`      | [示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification) |
+
+</div></details>
+
+
+<details><summary>定制化训练示例</summary><div>
+
+这里我们以命名实体识别`Taskflow("ner", mode="accurate")`为例，展示如何定制自己的模型。
+
+调用`Taskflow`接口后，程序自动将相关文件下载到`$HOME/.paddlenlp/taskflow/wordtag/`，该默认路径包含以下文件:
+
+```text
+$HOME/.paddlenlp/taskflow/wordtag/
+├── model_state.pdparams # 默认模型参数文件
+├── model_config.json # 默认模型配置文件
+└── tags.txt # 默认标签文件
+```
+
+* 参考上表中对应[示例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm)准备数据集和标签文件`tags.txt`，执行相应训练脚本得到自己的`model_state.pdparams`和`model_config.json`。
+
+* 根据自己数据集情况，修改标签文件`tags.txt`。
+
+* 将以上文件保存到任意路径中，自定义路径下的文件需要和默认路径的文件一致:
+
+```text
+custom_task_path/
+├── model_state.pdparams # 定制模型参数文件
+├── model_config.json # 定制模型配置文件
+└── tags.txt # 定制标签文件
+```
+* 通过`task_path`指定自定义路径，使用Taskflow加载自定义模型进行一键预测：
+
+```python
+from paddlenlp import Taskflow
+my_ner = Taskflow("ner", mode="accurate", task_path="./custom_task_path/")
+```
+</div></details>
+
+## 模型算法
+
+<details><summary>模型算法说明</summary><div>
+
+<table>
+  <tr><td>任务名称<td>模型<td>模型详情<td>训练集
+  <tr><td rowspan="3">中文分词<td>默认模式: BiGRU+CRF<td>  <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis"> 训练详情 <td> 百度自建数据集，包含近2200万句子，覆盖多种场景
+  <tr><td>快速模式：Jieba<td> - <td> -
+  <tr><td>精确模式：WordTag<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm"> 训练详情 <td> 百度自建数据集，词类体系基于TermTree构建
+  <tr><td>词性标注<td>BiGRU+CRF<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis"> 训练详情 <td> 百度自建数据集，包含2200万句子，覆盖多种场景
+  <tr><td rowspan="2">命名实体识别<td>精确模式：WordTag<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm"> 训练详情 <td> 百度自建数据集，词类体系基于TermTree构建
+  <tr><td>快速模式：BiGRU+CRF <td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/lexical_analysis"> 训练详情 <td> 百度自建数据集，包含2200万句子，覆盖多种场景
+  <tr><td>依存句法分析<td>DDParser<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dependency_parsing/ddparser"> 训练详情 <td> 百度自建数据集，DuCTB 1.0中文依存句法树库
+  <tr><td>信息抽取<td> UIE <td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie"> 训练详情 <td> 百度自建数据集
+  <tr><td rowspan="2">解语知识标注<td>词类知识标注：WordTag<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm"> 训练详情 <td> 百度自建数据集，词类体系基于TermTree构建
+  <tr><td>名词短语标注：NPTag <td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/nptag"> 训练详情 <td> 百度自建数据集
+  <tr><td>文本纠错<td>ERNIE-CSC<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_correction/ernie-csc"> 训练详情 <td> SIGHAN简体版数据集及 <a href="https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml"> Automatic Corpus Generation生成的中文纠错数据集
+  <tr><td>文本相似度<td>SimBERT<td> - <td> 收集百度知道2200万对相似句组
+  <tr><td rowspan="3">情感分析<td> BiLSTM <td> - <td> 百度自建数据集
+  <tr><td> SKEP <td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/sentiment_analysis/skep"> 训练详情 <td> 百度自建数据集
+  <tr><td> UIE <td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction"> 训练详情 <td> 百度自建数据集
+  <tr><td>生成式问答<td>CPM<td> - <td> 100GB级别中文数据
+  <tr><td>智能写诗<td>CPM<td> - <td> 100GB级别中文数据
+  <tr><td>开放域对话<td>PLATO-Mini<td> - <td> 十亿级别中文对话数据
+  <tr><td>零样本文本分类<td>UTC<td> <a href="https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/zero_shot_text_classification"> 训练详情  <td> 百度自建数据集
+</table>
+
+</div></details>
+
+## FAQ
+
+<details><summary><b>Q：</b>Taskflow如何修改任务保存路径？</summary><div>
+
+**A:** Taskflow默认会将任务相关模型等文件保存到`$HOME/.paddlenlp`下，可以在任务初始化的时候通过`home_path`自定义修改保存路径。示例：
+```python
+from paddlenlp import Taskflow
+
+ner = Taskflow("ner", home_path="/workspace")
+```
+通过以上方式即可将ner任务相关文件保存至`/workspace`路径下。
+</div></details>
+
+
+<details><summary><b>Q：</b>下载或调用模型失败，多次下载均失败怎么办？</summary><div>
+
+**A:** Taskflow默认会将任务相关模型等文件保存到`$HOME/.paddlenlp/taskflow`下，如果下载或调用失败，可删除相应路径下的文件，重新尝试即可
+
+</div></details>
+
+<details><summary><b>Q：</b>Taskflow如何提升预测速度？</summary><div>
+
+**A:** 可以结合设备情况适当调整batch_size，采用批量输入的方式来提升平均速率。示例：
+```python
+from paddlenlp import Taskflow
+
+# 精确模式模型体积较大，可结合机器情况适当调整batch_size，采用批量样本输入的方式。
+seg_accurate = Taskflow("word_segmentation", mode="accurate", batch_size=32)
+
+# 批量样本输入，输入为多个句子组成的list，预测速度更快
+texts = ["热梅茶是一道以梅子为主要原料制作的茶饮", "《孤女》是2010年九州出版社出版的小说，作者是余兼羽"]
+seg_accurate(texts)
+```
+通过上述方式进行分词可以大幅提升预测速度。
+
+</div></details>
+
+<details><summary><b>Q：</b>后续会增加更多任务支持吗？</summary><div>
+
+**A:** Taskflow支持任务持续丰富中，我们将根据开发者反馈，灵活调整功能建设优先级，可通过Issue或[问卷](https://wenjuan.baidu-int.com/manage/?r=survey/pageEdit&sid=85827)反馈给我们。
+
+</div></details>
+
+
+## 附录
+
+<details><summary><b>参考资料</b> </summary><div>
+
+1. [fxsjy/jieba](https://github.com/fxsjy/jieba)
+2. [ZhuiyiTechnology/simbert](https://github.com/ZhuiyiTechnology/simbert)
+3. [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413)
+
+</div></details>
diff --git a/docs/model_zoo/transformers/ALBERT/contents.rst b/docs/model_zoo/transformers/ALBERT/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..febf37681a07d7d47b9dfaec6acabd8a7f5683b2
--- /dev/null
+++ b/docs/model_zoo/transformers/ALBERT/contents.rst
@@ -0,0 +1,70 @@
+
+
+------------------------------------
+ALBERT模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ALBERT模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``albert-base-v1``                                                                | English      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 768-hidden, 12-heads, 11M parameters.                                            |
+|                                                                                  |              | ALBERT base model                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-large-v1``                                                               | English      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 1024-hidden, 16-heads, 17M parameters.                                           |
+|                                                                                  |              | ALBERT large model                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-xlarge-v1``                                                              | English      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 2048-hidden, 16-heads, 58M parameters.                                           |
+|                                                                                  |              | ALBERT xlarge model                                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-xxlarge-v1``                                                             | English      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 4096-hidden, 64-heads, 223M parameters.                                          |
+|                                                                                  |              | ALBERT xxlarge model                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-base-v2``                                                                | English      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 768-hidden, 12-heads, 11M parameters.                                            |
+|                                                                                  |              | ALBERT base model (version2)                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-large-v2``                                                               | English      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 1024-hidden, 16-heads, 17M parameters.                                           |
+|                                                                                  |              | ALBERT large model (version2)                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-xlarge-v2``                                                              | English      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 2048-hidden, 16-heads, 58M parameters.                                           |
+|                                                                                  |              | ALBERT xlarge model (version2)                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-xxlarge-v2``                                                             | English      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 4096-hidden, 64-heads, 223M parameters.                                          |
+|                                                                                  |              | ALBERT xxlarge model (version2)                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-tiny``                                                           | Chinese      | 4 repeating layers, 128 embedding,                                               |
+|                                                                                  |              | 312-hidden, 12-heads, 4M parameters.                                             |
+|                                                                                  |              | ALBERT tiny model (Chinese)                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-small``                                                          | Chinese      | 6 repeating layers, 128 embedding,                                               |
+|                                                                                  |              | 384-hidden, 12-heads, _M parameters.                                             |
+|                                                                                  |              | ALBERT small model (Chinese)                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-base``                                                           | Chinese      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 768-hidden, 12-heads, 12M parameters.                                            |
+|                                                                                  |              | ALBERT base model (Chinese)                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-large``                                                          | Chinese      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 1024-hidden, 16-heads, 18M parameters.                                           |
+|                                                                                  |              | ALBERT large model (Chinese)                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-xlarge``                                                         | Chinese      | 24 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 2048-hidden, 16-heads, 60M parameters.                                           |
+|                                                                                  |              | ALBERT xlarge model (Chinese)                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``albert-chinese-xxlarge``                                                        | Chinese      | 12 repeating layers, 128 embedding,                                              |
+|                                                                                  |              | 4096-hidden, 16-heads, 235M parameters.                                          |
+|                                                                                  |              | ALBERT xxlarge model (Chinese)                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/BART/contents.rst b/docs/model_zoo/transformers/BART/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d9205a31fc48bf36cc9155a46c20f7433c581af3
--- /dev/null
+++ b/docs/model_zoo/transformers/BART/contents.rst
@@ -0,0 +1,22 @@
+
+
+------------------------------------
+BART模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的BART模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``bart-base``                                                                     | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 217M parameters.                                                       |
+|                                                                                  |              | BART base model (English)                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``bart-large``                                                                    | English      | 24-layer, 768-hidden,                                                            |
+|                                                                                  |              | 16-heads, 509M parameters.                                                       |
+|                                                                                  |              | BART large model (English)                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/BERT/contents.rst b/docs/model_zoo/transformers/BERT/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5cc533a10007f0f21b2c469387430c347b2c1180
--- /dev/null
+++ b/docs/model_zoo/transformers/BERT/contents.rst
@@ -0,0 +1,692 @@
+
+
+------------------------------------
+BERT模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的BERT模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+| ``bert-base-uncased``                                                            | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 110M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-uncased``                                                           | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-cased``                                                              | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 109M parameters.                                                       |
+|                                                                                  |              | Trained on cased English text.                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-cased``                                                             | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 335M parameters.                                                       |
+|                                                                                  |              | Trained on cased English text.                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-multilingual-uncased``                                               | Multilingual | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 168M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased text                                                      |
+|                                                                                  |              | in the top 102 languages                                                         |
+|                                                                                  |              | with the largest Wikipedias.                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-multilingual-cased``                                                 | Multilingual | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 179M parameters.                                                       |
+|                                                                                  |              | Trained on cased text                                                            |
+|                                                                                  |              | in the top 104 languages                                                         |
+|                                                                                  |              | with the largest Wikipedias.                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-chinese``                                                            | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on cased Chinese Simplified                                              |
+|                                                                                  |              | and Traditional text.                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-wwm-chinese``                                                             | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on cased Chinese Simplified                                              |
+|                                                                                  |              | and Traditional text using                                                       |
+|                                                                                  |              | Whole-Word-Masking.                                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-wwm-ext-chinese``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on cased Chinese Simplified                                              |
+|                                                                                  |              | and Traditional text using                                                       |
+|                                                                                  |              | Whole-Word-Masking with extented data.                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-base``                                                     | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-12_H-768`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-medium``                                                   | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-8_H-512`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-small``                                                    | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-4_H-512`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-mini``                                                     | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-4_H-256`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-tiny``                                                     | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-2_H-128`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/chinese-roberta-6l-768h``                                                  | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `uer/chinese_roberta_L-6_H-768`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ckiplab/bert-base-chinese-pos``                                                | Chinese      | Please refer to:                                                                 |
+|                                                                                  |              | `ckiplab/bert-base-chinese-pos`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``tbs17/MathBERT``                                                               | English      | Please refer to:                                                                 |
+|                                                                                  |              | `tbs17/MathBERT`_                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``macbert-base-chinese``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained with novel MLM as correction                                             |
+|                                                                                  |              | pre-training task.                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``macbert-large-chinese``                                                        | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 326M parameters.                                                       |
+|                                                                                  |              | Trained with novel MLM as correction                                             |
+|                                                                                  |              | pre-training task.                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``simbert-base-chinese``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on 22 million pairs of similar                                           |
+|                                                                                  |              | sentences crawed from Baidu Know.                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Langboat/mengzi-bert-base``                                                    | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on 300G Chinese Corpus Datasets.                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Langboat/mengzi-bert-base-fin``                                                | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on 20G Finacial Corpus,                                                  |
+|                                                                                  |              | based on ``Langboat/mengzi-bert-base``.                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cross-encoder/ms-marco-MiniLM-L-12-v2``                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/ms-marco-MiniLM-L-12-v2`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-base-japanese-char``                                            | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-base-japanese-char`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-base-japanese-whole-word-masking``                              | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-base-japanese-whole-word-masking`_                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-base-japanese``                                                 | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-base-japanese`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``nlptown/bert-base-multilingual-uncased-sentiment``                             | Multilingual | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nlptown/bert-base-multilingual-uncased-sentiment`_                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-uncased-whole-word-masking-finetuned-squad``                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-large-uncased-whole-word-masking-finetuned-squad`_                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``finiteautomata/beto-sentiment-analysis``                                       | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `finiteautomata/beto-sentiment-analysis`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hfl/chinese-bert-wwm-ext``                                                     | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `hfl/chinese-bert-wwm-ext`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``emilyalsentzer/Bio_ClinicalBERT``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `emilyalsentzer/Bio_ClinicalBERT`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dslim/bert-base-NER``                                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dslim/bert-base-NER`_                                                           |    
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``deepset/bert-large-uncased-whole-word-masking-squad2``                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/bert-large-uncased-whole-word-masking-squad2`_                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``neuralmind/bert-base-portuguese-cased``                                        | Portuguese   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `neuralmind/bert-base-portuguese-cased`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``SpanBERT/spanbert-large-cased``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `SpanBERT/spanbert-large-cased`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dslim/bert-large-NER``                                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dslim/bert-large-NER`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-german-cased``                                                       | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-base-german-cased`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``deepset/sentence_bert``                                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/sentence_bert`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ProsusAI/finbert``                                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ProsusAI/finbert`_                                                              |  
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``oliverguhr/german-sentiment-bert``                                             | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `oliverguhr/german-sentiment-bert`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/bert_uncased_L-2_H-128_A-2``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/bert_uncased_L-2_H-128_A-2`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract``                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`_                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``DeepPavlov/rubert-base-cased``                                                 | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `DeepPavlov/rubert-base-cased`_                                                  |   
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``wietsedv/bert-base-dutch-cased``                                               | Dutch        | Please refer to:                                                                 |                                   
+|                                                                                  |              | `wietsedv/bert-base-dutch-cased`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``monologg/bert-base-cased-goemotions-original``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `monologg/bert-base-cased-goemotions-original`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``allenai/scibert_scivocab_uncased``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `allenai/scibert_scivocab_uncased`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-large-cased-finetuned-conll03-english``                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-large-cased-finetuned-conll03-english`_                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext``                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`_                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-uncased-whole-word-masking``                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-large-uncased-whole-word-masking`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dccuchile/bert-base-spanish-wwm-uncased``                                      | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dccuchile/bert-base-spanish-wwm-uncased`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/bert_uncased_L-6_H-256_A-4``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/bert_uncased_L-6_H-256_A-4`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/bert_uncased_L-4_H-512_A-8``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/bert_uncased_L-4_H-512_A-8`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``FPTAI/vibert-base-cased``                                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `FPTAI/vibert-base-cased`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cointegrated/rubert-tiny``                                                     | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cointegrated/rubert-tiny`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-german-dbmdz-uncased``                                               | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-base-german-dbmdz-uncased`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-turkish-128k-cased``                                           | Turkish      | Please refer to:                                                                 |                                             
+|                                                                                  |              | `dbmdz/bert-base-turkish-128k-cased`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-german-uncased``                                               | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-german-uncased`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``deepset/minilm-uncased-squad2``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/minilm-uncased-squad2`_                                                 |    
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``HooshvareLab/bert-base-parsbert-uncased``                                      | Persian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `HooshvareLab/bert-base-parsbert-uncased`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-ag-news``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-ag-news`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-base-japanese-v2``                                              | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-base-japanese-v2`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``emilyalsentzer/Bio_Discharge_Summary_BERT``                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `emilyalsentzer/Bio_Discharge_Summary_BERT`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``KoichiYasuoka/bert-base-japanese-upos``                                        | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `KoichiYasuoka/bert-base-japanese-upos`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-italian-xxl-cased``                                            | Italian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-italian-xxl-cased`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``deepset/bert-base-cased-squad2``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/bert-base-cased-squad2`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``beomi/kcbert-large``                                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `beomi/kcbert-large`_                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-cased-whole-word-masking-finetuned-squad``                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-large-cased-whole-word-masking-finetuned-squad`_                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``neuralmind/bert-large-portuguese-cased``                                       |Portuguese    | Please refer to:                                                                 |                                   
+|                                                                                  |              | `neuralmind/bert-large-portuguese-cased`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Luyu/co-condenser-marco``                                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Luyu/co-condenser-marco`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Sahajtomar/German_Zeroshot``                                                   | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Sahajtomar/German_Zeroshot`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``indolem/indobert-base-uncased``                                                | Indonesian   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `indolem/indobert-base-uncased`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``shibing624/text2vec-base-chinese``                                             | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `shibing624/text2vec-base-chinese`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cointegrated/LaBSE-en-ru``                                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  | and Russian  | `cointegrated/LaBSE-en-ru`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``prithivida/parrot_fluency_on_BERT``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `prithivida/parrot_fluency_on_BERT`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-SST-2``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-SST-2`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-snli``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-snli`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``klue/bert-base``                                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `klue/bert-base`_                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``asafaya/bert-base-arabic``                                                     | Arabic       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `asafaya/bert-base-arabic`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-MRPC``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-MRPC`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-imdb``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-imdb`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cross-encoder/ms-marco-TinyBERT-L-2``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/ms-marco-TinyBERT-L-2`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/bert-tiny-finetuned-sms-spam-detection``                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/bert-tiny-finetuned-sms-spam-detection`_                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``felflare/bert-restore-punctuation``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `felflare/bert-restore-punctuation`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english``              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english`_               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-rotten-tomatoes``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-rotten-tomatoes`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``nlpaueb/legal-bert-base-uncased``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nlpaueb/legal-bert-base-uncased`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hf-internal-testing/tiny-bert-for-token-classification``                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `hf-internal-testing/tiny-bert-for-token-classification`_                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cointegrated/rubert-tiny2``                                                    | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cointegrated/rubert-tiny2`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``kykim/bert-kor-base``                                                          | Korean       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `kykim/bert-kor-base`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-base-japanese-char-v2``                                         | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-base-japanese-char-v2`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/bert-small-finetuned-squadv2``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/bert-small-finetuned-squadv2`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``beomi/kcbert-base``                                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `beomi/kcbert-base`_                                                             | 
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-MNLI``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-MNLI`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-WNLI``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-WNLI`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-turkish-cased``                                                | Turkish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-turkish-cased`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``huawei-noah/TinyBERT_General_4L_312D``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `huawei-noah/TinyBERT_General_4L_312D`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-QQP``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-QQP`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-STS-B``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-STS-B`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``allenai/scibert_scivocab_cased``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `allenai/scibert_scivocab_cased`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/bert-medium-finetuned-squadv2``                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/bert-medium-finetuned-squadv2`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``TurkuNLP/bert-base-finnish-cased-v1``                                          | Finnish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `TurkuNLP/bert-base-finnish-cased-v1`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-RTE``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-RTE`_                                              |  
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/roberta-base-chinese-extractive-qa``                                       | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `uer/roberta-base-chinese-extractive-qa`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-QNLI``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-QNLI`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``textattack/bert-base-uncased-CoLA``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/bert-base-uncased-CoLA`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dmis-lab/biobert-base-cased-v1.2``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dmis-lab/biobert-base-cased-v1.2`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``pierreguillou/bert-base-cased-squad-v1.1-portuguese``                          | Portuguese   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `pierreguillou/bert-base-cased-squad-v1.1-portuguese`_                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``KB/bert-base-swedish-cased``                                                   | Swedish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `KB/bert-base-swedish-cased`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uer/roberta-base-finetuned-cluener2020-chinese``                               | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `uer/roberta-base-finetuned-cluener2020-chinese`_                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``onlplab/alephbert-base``                                                       | Hebrew       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `onlplab/alephbert-base`_                                                        |   
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/bert-spanish-cased-finetuned-ner``                                     | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/bert-spanish-cased-finetuned-ner`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``alvaroalon2/biobert_chemical_ner``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `alvaroalon2/biobert_chemical_ner`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-base-cased-finetuned-mrpc``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-base-cased-finetuned-mrpc`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``unitary/toxic-bert``                                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `unitary/toxic-bert`_                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``nlpaueb/bert-base-greek-uncased-v1``                                           | Greek        | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nlpaueb/bert-base-greek-uncased-v1`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``HooshvareLab/bert-fa-base-uncased-sentiment-snappfood``                        | Persian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `HooshvareLab/bert-fa-base-uncased-sentiment-snappfood`_                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Maltehb/danish-bert-botxo``                                                    | Danish       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Maltehb/danish-bert-botxo`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``shahrukhx01/bert-mini-finetune-question-detection``                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `shahrukhx01/bert-mini-finetune-question-detection`_                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``GroNLP/bert-base-dutch-cased``                                                 | Dutch        | Please refer to:                                                                 |                                   
+|                                                                                  |              | `GroNLP/bert-base-dutch-cased`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``SpanBERT/spanbert-base-cased``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `SpanBERT/spanbert-base-cased`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-italian-uncased``                                              | Italian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-italian-uncased`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-german-cased``                                                 | Germanh      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-german-cased`_                                                  |                     
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cl-tohoku/bert-large-japanese``                                                | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cl-tohoku/bert-large-japanese`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hfl/chinese-bert-wwm``                                                         | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `hfl/chinese-bert-wwm`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hfl/chinese-macbert-large``                                                    | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `hfl/chinese-macbert-large`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dslim/bert-base-NER-uncased``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dslim/bert-base-NER-uncased`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``amberoad/bert-multilingual-passage-reranking-msmarco``                         | Multilingual | Please refer to:                                                                 |                                   
+|                                                                                  |              | `amberoad/bert-multilingual-passage-reranking-msmarco`_                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``aubmindlab/bert-base-arabertv02``                                              | Arabic       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `aubmindlab/bert-base-arabertv02`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/bert_uncased_L-4_H-256_A-4``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/bert_uncased_L-4_H-256_A-4`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``DeepPavlov/rubert-base-cased-conversational``                                  | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `DeepPavlov/rubert-base-cased-conversational`_                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dccuchile/bert-base-spanish-wwm-cased``                                        | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dccuchile/bert-base-spanish-wwm-cased`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ckiplab/bert-base-chinese-ws``                                                 | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ckiplab/bert-base-chinese-ws`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``daigo/bert-base-japanese-sentiment``                                           | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `daigo/bert-base-japanese-sentiment`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``SZTAKI-HLT/hubert-base-cc``                                                    | Hungarian    | Please refer to:                                                                 |                                   
+|                                                                                  |              | `SZTAKI-HLT/hubert-base-cc`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``nlpaueb/legal-bert-small-uncased``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nlpaueb/legal-bert-small-uncased`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dumitrescustefan/bert-base-romanian-uncased-v1``                               | Romanian     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dumitrescustefan/bert-base-romanian-uncased-v1`_                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/muril-base-cased``                                                      | Indian       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/muril-base-cased`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dkleczek/bert-base-polish-uncased-v1``                                         | Polish       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dkleczek/bert-base-polish-uncased-v1`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ckiplab/bert-base-chinese-ner``                                                | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ckiplab/bert-base-chinese-ner`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``savasy/bert-base-turkish-sentiment-cased``                                     | Turkish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `savasy/bert-base-turkish-sentiment-cased`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es``          | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es`_           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``KB/bert-base-swedish-cased-ner``                                               | Swedish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `KB/bert-base-swedish-cased-ner`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hfl/rbt3``                                                                     | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `hfl/rbt3`_                                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``remotejob/gradientclassification_v0``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `remotejob/gradientclassification_v0`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Recognai/bert-base-spanish-wwm-cased-xnli``                                    | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Recognai/bert-base-spanish-wwm-cased-xnli`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``HooshvareLab/bert-fa-zwnj-base``                                               | Persian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `HooshvareLab/bert-fa-zwnj-base`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``monologg/bert-base-cased-goemotions-group``                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `monologg/bert-base-cased-goemotions-group`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``blanchefort/rubert-base-cased-sentiment``                                      | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `blanchefort/rubert-base-cased-sentiment`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``shibing624/macbert4csc-base-chinese``                                          | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `shibing624/macbert4csc-base-chinese`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``google/bert_uncased_L-8_H-512_A-8``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/bert_uncased_L-8_H-512_A-8`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bert-large-cased-whole-word-masking``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bert-large-cased-whole-word-masking`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``alvaroalon2/biobert_diseases_ner``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `alvaroalon2/biobert_diseases_ner`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``philschmid/BERT-Banking77``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `philschmid/BERT-Banking77`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbmdz/bert-base-turkish-uncased``                                              | Turkish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dbmdz/bert-base-turkish-uncased`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``vblagoje/bert-english-uncased-finetuned-pos``                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `vblagoje/bert-english-uncased-finetuned-pos`_                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dumitrescustefan/bert-base-romanian-cased-v1``                                 | Romanian     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `dumitrescustefan/bert-base-romanian-cased-v1`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``nreimers/BERT-Tiny_L-2_H-128_A-2``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nreimers/BERT-Tiny_L-2_H-128_A-2`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``digitalepidemiologylab/covid-twitter-bert-v2``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `digitalepidemiologylab/covid-twitter-bert-v2`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``UBC-NLP/MARBERT``                                                              | (DA) and MSA | Please refer to:                                                                 |                                   
+|                                                                                  |              | `UBC-NLP/MARBERT`_                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``pierreguillou/bert-large-cased-squad-v1.1-portuguese``                         | Portuguese   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `pierreguillou/bert-large-cased-squad-v1.1-portuguese`_                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``alvaroalon2/biobert_genetic_ner``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `alvaroalon2/biobert_genetic_ner`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``bvanaken/clinical-assertion-negation-bert``                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bvanaken/clinical-assertion-negation-bert`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cross-encoder/stsb-TinyBERT-L-4``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/stsb-TinyBERT-L-4`_                                               |  
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``sshleifer/tiny-distilbert-base-cased``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sshleifer/tiny-distilbert-base-cased`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ckiplab/bert-base-chinese``                                                    | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ckiplab/bert-base-chinese`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``fabriceyhc/bert-base-uncased-amazon_polarity``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `fabriceyhc/bert-base-uncased-amazon_polarity`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _ckiplab/bert-base-chinese-pos: https://huggingface.co/ckiplab/bert-base-chinese-pos
+.. _uer/chinese_roberta_L-12_H-768: https://huggingface.co/uer/chinese_roberta_L-12_H-768
+.. _uer/chinese_roberta_L-6_H-768: https://huggingface.co/uer/chinese_roberta_L-6_H-768
+.. _uer/chinese_roberta_L-8_H-512: https://huggingface.co/uer/chinese_roberta_L-8_H-512
+.. _uer/chinese_roberta_L-4_H-512: https://huggingface.co/uer/chinese_roberta_L-4_H-512
+.. _uer/chinese_roberta_L-4_H-256: https://huggingface.co/uer/chinese_roberta_L-4_H-256
+.. _uer/chinese_roberta_L-2_H-128: https://huggingface.co/uer/chinese_roberta_L-2_H-128
+.. _tbs17/MathBERT: https://huggingface.co/tbs17/MathBERT
+.. _cross-encoder/ms-marco-MiniLM-L-12-v2: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
+.. _cl-tohoku/bert-base-japanese-char: https://huggingface.co/cl-tohoku/bert-base-japanese-char
+.. _cl-tohoku/bert-base-japanese-whole-word-masking: https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
+.. _cl-tohoku/bert-base-japanese: https://huggingface.co/cl-tohoku/bert-base-japanese
+.. _nlptown/bert-base-multilingual-uncased-sentiment: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment
+.. _bert-large-uncased-whole-word-masking-finetuned-squad: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad
+.. _finiteautomata/beto-sentiment-analysis: https://huggingface.co/finiteautomata/beto-sentiment-analysis
+.. _hfl/chinese-bert-wwm-ext: https://huggingface.co/hfl/chinese-bert-wwm-ext
+.. _emilyalsentzer/Bio_ClinicalBERT: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
+.. _dslim/bert-base-NER: https://huggingface.co/dslim/bert-base-NER
+.. _deepset/bert-large-uncased-whole-word-masking-squad2: https://huggingface.co/deepset/bert-large-uncased-whole-word-masking-squad2
+.. _neuralmind/bert-base-portuguese-cased: https://huggingface.co/neuralmind/bert-base-portuguese-cased
+.. _SpanBERT/spanbert-large-cased: https://huggingface.co/SpanBERT/spanbert-large-cased
+.. _dslim/bert-large-NER: https://huggingface.co/dslim/bert-large-NER
+.. _bert-base-german-cased: https://huggingface.co/bert-base-german-cased
+.. _deepset/sentence_bert: https://huggingface.co/deepset/sentence_bert
+.. _ProsusAI/finbert: https://huggingface.co/ProsusAI/finbert
+.. _oliverguhr/german-sentiment-bert: https://huggingface.co/oliverguhr/german-sentiment-bert
+.. _google/bert_uncased_L-2_H-128_A-2: https://huggingface.co/google/bert_uncased_L-2_H-128_A-2
+.. _DeepPavlov/rubert-base-cased: https://huggingface.co/DeepPavlov/rubert-base-cased
+.. _wietsedv/bert-base-dutch-cased: https://huggingface.co/wietsedv/bert-base-dutch-cased
+.. _monologg/bert-base-cased-goemotions-original: https://huggingface.co/monologg/bert-base-cased-goemotions-original
+.. _allenai/scibert_scivocab_uncased: https://huggingface.co/allenai/scibert_scivocab_uncased
+.. _microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
+.. _dbmdz/bert-large-cased-finetuned-conll03-english: https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english
+.. _microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
+.. _bert-large-uncased-whole-word-masking: https://huggingface.co/bert-large-uncased-whole-word-masking
+.. _dccuchile/bert-base-spanish-wwm-uncased: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased
+.. _google/bert_uncased_L-6_H-256_A-4: https://huggingface.co/google/bert_uncased_L-6_H-256_A-4
+.. _google/bert_uncased_L-4_H-512_A-8: https://huggingface.co/google/bert_uncased_L-4_H-512_A-8
+.. _FPTAI/vibert-base-cased: https://huggingface.co/FPTAI/vibert-base-cased
+.. _cointegrated/rubert-tiny: https://huggingface.co/cointegrated/rubert-tiny
+.. _bert-base-german-dbmdz-uncased: https://huggingface.co/bert-base-german-dbmdz-uncased
+.. _dbmdz/bert-base-turkish-128k-cased: https://huggingface.co/dbmdz/bert-base-turkish-128k-cased
+.. _dbmdz/bert-base-german-uncased: https://huggingface.co/dbmdz/bert-base-german-uncased
+.. _deepset/minilm-uncased-squad2: https://huggingface.co/deepset/minilm-uncased-squad2
+.. _HooshvareLab/bert-base-parsbert-uncased: https://huggingface.co/HooshvareLab/bert-base-parsbert-uncased
+.. _textattack/bert-base-uncased-ag-news: https://huggingface.co/textattack/bert-base-uncased-ag-news
+.. _cl-tohoku/bert-base-japanese-v2: https://huggingface.co/cl-tohoku/bert-base-japanese-v2
+.. _emilyalsentzer/Bio_Discharge_Summary_BERT: https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT
+.. _KoichiYasuoka/bert-base-japanese-upos: https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos
+.. _dbmdz/bert-base-italian-xxl-cased: https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
+.. _deepset/bert-base-cased-squad2: https://huggingface.co/deepset/bert-base-cased-squad2
+.. _beomi/kcbert-large: https://huggingface.co/beomi/kcbert-large
+.. _bert-large-cased-whole-word-masking-finetuned-squad: https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad
+.. _neuralmind/bert-large-portuguese-cased: https://huggingface.co/neuralmind/bert-large-portuguese-cased
+.. _Luyu/co-condenser-marco: https://huggingface.co/Luyu/co-condenser-marco
+.. _Sahajtomar/German_Zeroshot: https://huggingface.co/Sahajtomar/German_Zeroshot
+.. _indolem/indobert-base-uncased: https://huggingface.co/indolem/indobert-base-uncased
+.. _shibing624/text2vec-base-chinese: https://huggingface.co/shibing624/text2vec-base-chinese
+.. _cointegrated/LaBSE-en-ru: https://huggingface.co/cointegrated/LaBSE-en-ru
+.. _prithivida/parrot_fluency_on_BERT: https://huggingface.co/prithivida/parrot_fluency_on_BERT
+.. _textattack/bert-base-uncased-SST-2: https://huggingface.co/textattack/bert-base-uncased-SST-2
+.. _textattack/bert-base-uncased-snli: https://huggingface.co/textattack/bert-base-uncased-snli
+.. _klue/bert-base: https://huggingface.co/klue/bert-base
+.. _asafaya/bert-base-arabic: https://huggingface.co/asafaya/bert-base-arabic
+.. _textattack/bert-base-uncased-MRPC: https://huggingface.co/textattack/bert-base-uncased-MRPC
+.. _textattack/bert-base-uncased-imdb: https://huggingface.co/textattack/bert-base-uncased-imdb
+.. _cross-encoder/ms-marco-TinyBERT-L-2: https://huggingface.co/cross-encoder/ms-marco-TinyBERT-L-2
+.. _mrm8488/bert-tiny-finetuned-sms-spam-detection: https://huggingface.co/mrm8488/bert-tiny-finetuned-sms-spam-detection
+.. _felflare/bert-restore-punctuation: https://huggingface.co/felflare/bert-restore-punctuation
+.. _sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english: https://huggingface.co/sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english
+.. _textattack/bert-base-uncased-rotten-tomatoes: https://huggingface.co/textattack/bert-base-uncased-rotten-tomatoes
+.. _nlpaueb/legal-bert-base-uncased: https://huggingface.co/nlpaueb/legal-bert-base-uncased
+.. _hf-internal-testing/tiny-bert-for-token-classification: https://huggingface.co/hf-internal-testing/tiny-bert-for-token-classification
+.. _cointegrated/rubert-tiny2: https://huggingface.co/cointegrated/rubert-tiny2
+.. _kykim/bert-kor-base: https://huggingface.co/kykim/bert-kor-base
+.. _cl-tohoku/bert-base-japanese-char-v2: https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2
+.. _mrm8488/bert-small-finetuned-squadv2: https://huggingface.co/mrm8488/bert-small-finetuned-squadv2
+.. _beomi/kcbert-base: https://huggingface.co/beomi/kcbert-base
+.. _textattack/bert-base-uncased-MNLI: https://huggingface.co/textattack/bert-base-uncased-MNLI
+.. _textattack/bert-base-uncased-WNLI: https://huggingface.co/textattack/bert-base-uncased-WNLI
+.. _dbmdz/bert-base-turkish-cased: https://huggingface.co/dbmdz/bert-base-turkish-cased
+.. _huawei-noah/TinyBERT_General_4L_312D: https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D
+.. _textattack/bert-base-uncased-QQP: https://huggingface.co/textattack/bert-base-uncased-QQP
+.. _textattack/bert-base-uncased-STS-B: https://huggingface.co/textattack/bert-base-uncased-STS-B
+.. _allenai/scibert_scivocab_cased: https://huggingface.co/allenai/scibert_scivocab_cased
+.. _mrm8488/bert-medium-finetuned-squadv2: https://huggingface.co/mrm8488/bert-medium-finetuned-squadv2
+.. _TurkuNLP/bert-base-finnish-cased-v1: https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1
+.. _textattack/bert-base-uncased-RTE: https://huggingface.co/textattack/bert-base-uncased-RTE
+.. _uer/roberta-base-chinese-extractive-qa: https://huggingface.co/uer/roberta-base-chinese-extractive-qa
+.. _textattack/bert-base-uncased-QNLI: https://huggingface.co/textattack/bert-base-uncased-QNLI
+.. _textattack/bert-base-uncased-CoLA: https://huggingface.co/textattack/bert-base-uncased-CoLA
+.. _dmis-lab/biobert-base-cased-v1.2: https://huggingface.co/dmis-lab/biobert-base-cased-v1.2
+.. _pierreguillou/bert-base-cased-squad-v1.1-portuguese: https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese
+.. _KB/bert-base-swedish-cased: https://huggingface.co/KB/bert-base-swedish-cased
+.. _uer/roberta-base-finetuned-cluener2020-chinese: https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
+.. _onlplab/alephbert-base: https://huggingface.co/onlplab/alephbert-base
+.. _mrm8488/bert-spanish-cased-finetuned-ner: https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner
+.. _alvaroalon2/biobert_chemical_ner: https://huggingface.co/alvaroalon2/biobert_chemical_ner
+.. _bert-base-cased-finetuned-mrpc: https://huggingface.co/bert-base-cased-finetuned-mrpc
+.. _unitary/toxic-bert: https://huggingface.co/unitary/toxic-bert
+.. _nlpaueb/bert-base-greek-uncased-v1: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1
+.. _HooshvareLab/bert-fa-base-uncased-sentiment-snappfood: https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood
+.. _Maltehb/danish-bert-botxo: https://huggingface.co/Maltehb/danish-bert-botxo
+.. _shahrukhx01/bert-mini-finetune-question-detection: https://huggingface.co/shahrukhx01/bert-mini-finetune-question-detection
+.. _GroNLP/bert-base-dutch-cased: https://huggingface.co/GroNLP/bert-base-dutch-cased
+.. _SpanBERT/spanbert-base-cased: https://huggingface.co/SpanBERT/spanbert-base-cased
+.. _dbmdz/bert-base-italian-uncased: https://huggingface.co/dbmdz/bert-base-italian-uncased
+.. _dbmdz/bert-base-german-cased: https://huggingface.co/dbmdz/bert-base-german-cased
+.. _cl-tohoku/bert-large-japanese: https://huggingface.co/cl-tohoku/bert-large-japanese
+.. _hfl/chinese-bert-wwm: https://huggingface.co/hfl/chinese-bert-wwm
+.. _hfl/chinese-macbert-large: https://huggingface.co/hfl/chinese-macbert-large
+.. _dslim/bert-base-NER-uncased: https://huggingface.co/dslim/bert-base-NER-uncased
+.. _amberoad/bert-multilingual-passage-reranking-msmarco: https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco
+.. _aubmindlab/bert-base-arabertv02: https://huggingface.co/aubmindlab/bert-base-arabertv02
+.. _google/bert_uncased_L-4_H-256_A-4: https://huggingface.co/google/bert_uncased_L-4_H-256_A-4
+.. _DeepPavlov/rubert-base-cased-conversational: https://huggingface.co/DeepPavlov/rubert-base-cased-conversational
+.. _dccuchile/bert-base-spanish-wwm-cased: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased
+.. _ckiplab/bert-base-chinese-ws: https://huggingface.co/ckiplab/bert-base-chinese-ws
+.. _daigo/bert-base-japanese-sentiment: https://huggingface.co/daigo/bert-base-japanese-sentiment
+.. _SZTAKI-HLT/hubert-base-cc: https://huggingface.co/SZTAKI-HLT/hubert-base-cc
+.. _nlpaueb/legal-bert-small-uncased: https://huggingface.co/nlpaueb/legal-bert-small-uncased
+.. _dumitrescustefan/bert-base-romanian-uncased-v1: https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1
+.. _google/muril-base-cased: https://huggingface.co/google/muril-base-cased
+.. _dkleczek/bert-base-polish-uncased-v1: https://huggingface.co/dkleczek/bert-base-polish-uncased-v1
+.. _ckiplab/bert-base-chinese-ner: https://huggingface.co/ckiplab/bert-base-chinese-ner
+.. _savasy/bert-base-turkish-sentiment-cased: https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
+.. _mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es: https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
+.. _KB/bert-base-swedish-cased-ner: https://huggingface.co/KB/bert-base-swedish-cased-ner
+.. _hfl/rbt3: https://huggingface.co/hfl/rbt3
+.. _remotejob/gradientclassification_v0: https://huggingface.co/remotejob/gradientclassification_v0
+.. _Recognai/bert-base-spanish-wwm-cased-xnli: https://huggingface.co/Recognai/bert-base-spanish-wwm-cased-xnli
+.. _HooshvareLab/bert-fa-zwnj-base: https://huggingface.co/HooshvareLab/bert-fa-zwnj-base
+.. _monologg/bert-base-cased-goemotions-group: https://huggingface.co/monologg/bert-base-cased-goemotions-group
+.. _blanchefort/rubert-base-cased-sentiment: https://huggingface.co/blanchefort/rubert-base-cased-sentiment
+.. _shibing624/macbert4csc-base-chinese: https://huggingface.co/shibing624/macbert4csc-base-chinese
+.. _google/bert_uncased_L-8_H-512_A-8: https://huggingface.co/google/bert_uncased_L-8_H-512_A-8
+.. _bert-large-cased-whole-word-masking: https://huggingface.co/bert-large-cased-whole-word-masking
+.. _alvaroalon2/biobert_diseases_ner: https://huggingface.co/alvaroalon2/biobert_diseases_ner
+.. _philschmid/BERT-Banking77: https://huggingface.co/philschmid/BERT-Banking77
+.. _dbmdz/bert-base-turkish-uncased: https://huggingface.co/dbmdz/bert-base-turkish-uncased
+.. _vblagoje/bert-english-uncased-finetuned-pos: https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos
+.. _dumitrescustefan/bert-base-romanian-cased-v1: https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1
+.. _nreimers/BERT-Tiny_L-2_H-128_A-2: https://huggingface.co/nreimers/BERT-Tiny_L-2_H-128_A-2
+.. _digitalepidemiologylab/covid-twitter-bert-v2: https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2
+.. _UBC-NLP/MARBERT: https://huggingface.co/UBC-NLP/MARBERT
+.. _pierreguillou/bert-large-cased-squad-v1.1-portuguese: https://huggingface.co/pierreguillou/bert-large-cased-squad-v1.1-portuguese
+.. _alvaroalon2/biobert_genetic_ner: https://huggingface.co/alvaroalon2/biobert_genetic_ner
+.. _bvanaken/clinical-assertion-negation-bert: https://huggingface.co/bvanaken/clinical-assertion-negation-bert
+.. _cross-encoder/stsb-TinyBERT-L-4: https://huggingface.co/cross-encoder/stsb-TinyBERT-L-4
+.. _sshleifer/tiny-distilbert-base-cased: https://huggingface.co/sshleifer/tiny-distilbert-base-cased
+.. _ckiplab/bert-base-chinese: https://huggingface.co/ckiplab/bert-base-chinese
+.. _fabriceyhc/bert-base-uncased-amazon_polarity: https://huggingface.co/fabriceyhc/bert-base-uncased-amazon_polarity
diff --git a/docs/model_zoo/transformers/BigBird/contents.rst b/docs/model_zoo/transformers/BigBird/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ba2c57911fb39a6e2cf65196375b5a829620ea53
--- /dev/null
+++ b/docs/model_zoo/transformers/BigBird/contents.rst
@@ -0,0 +1,18 @@
+
+
+------------------------------------
+BigBird模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的BigBird模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``bigbird-base-uncased``                                                          | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 127M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/Blenderbot-Small/contents.rst b/docs/model_zoo/transformers/Blenderbot-Small/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d348b9059da88ca2ebe354b0a2e0b7fe90470f1e
--- /dev/null
+++ b/docs/model_zoo/transformers/Blenderbot-Small/contents.rst
@@ -0,0 +1,18 @@
+
+
+------------------------------------
+Blenderbot-Small模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的Blenderbot-Small模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``blenderbot_small-90M``                                                          | English      | 16-layer,                                                                        |
+|                                                                                  |              | 16-heads, 90M parameters.                                                        |
+|                                                                                  |              | The Blenderbot small model.                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/Blenderbot/contents.rst b/docs/model_zoo/transformers/Blenderbot/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e4948268f5c901431d0dc76d38ef27e31ebcd5a2
--- /dev/null
+++ b/docs/model_zoo/transformers/Blenderbot/contents.rst
@@ -0,0 +1,26 @@
+
+
+------------------------------------
+Blenderbot模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的Blenderbot模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``blenderbot-3B``                                                                 | English      | 26-layer,                                                                        |
+|                                                                                  |              | 32-heads, 3B parameters.                                                         |
+|                                                                                  |              | The Blenderbot base model.                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``blenderbot-400M-distill``                                                       | English      | 14-layer, 384-hidden,                                                            |
+|                                                                                  |              | 32-heads, 400M parameters.                                                       |
+|                                                                                  |              | The Blenderbot distil model.                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``blenderbot-1B-distill``                                                         | English      | 14-layer,                                                                        |
+|                                                                                  |              | 32-heads, 1478M parameters.                                                      |
+|                                                                                  |              | The Blenderbot distil 1B model.                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/CTRL/contents.rst b/docs/model_zoo/transformers/CTRL/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6f57970b2119d58dc9d33fa72a8b494b153ff1bd
--- /dev/null
+++ b/docs/model_zoo/transformers/CTRL/contents.rst
@@ -0,0 +1,21 @@
+
+
+------------------------------------
+CTRL模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的CTRL模型对应预训练权重。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ctrl``                                                                          | English      | 48-layer, 1280-hidden,                                                           |
+|                                                                                  |              | 16-heads, 1701M parameters.                                                      |
+|                                                                                  |              | The CTRL base model.                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sshleifer-tiny-ctrl``                                                           | English      | 2-layer, 16-hidden,                                                              |
+|                                                                                  |              | 2-heads, 5M parameters.                                                          |
+|                                                                                  |              | The Tiny CTRL model.                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/ChineseBert/contents.rst b/docs/model_zoo/transformers/ChineseBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cdacb2ec1e0722fb2f7c3d560a79e141cbaaa96e
--- /dev/null
+++ b/docs/model_zoo/transformers/ChineseBert/contents.rst
@@ -0,0 +1,23 @@
+
+
+------------------------------------
+ChineseBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ChineseBert模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ChineseBERT-base``                                                              | Chinese      | For details, please refer to:                                                    |
+|                                                                                  |              | ChineseBERT-base_                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ChineseBERT-large``                                                             | Chinese      | For details, please refer to:                                                    |
+|                                                                                  |              | ChineseBERT-large_                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _ChineseBERT-base: https://huggingface.co/ShannonAI/ChineseBERT-base
+.. _ChineseBERT-large: https://huggingface.co/ShannonAI/ChineseBERT-large
diff --git a/docs/model_zoo/transformers/ConvBert/contents.rst b/docs/model_zoo/transformers/ConvBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..889f2e5da446b9221514e11a123644d41ef25b6d
--- /dev/null
+++ b/docs/model_zoo/transformers/ConvBert/contents.rst
@@ -0,0 +1,26 @@
+
+
+------------------------------------
+ConvBERT模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ConvBERT模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``convbert-base``                                                                 | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 106M parameters.                                                       |
+|                                                                                  |              | The ConvBERT base model.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``convbert-medium-small``                                                         | English      | 12-layer, 384-hidden,                                                            |
+|                                                                                  |              | 8-heads, 17M parameters.                                                         |
+|                                                                                  |              | The ConvBERT medium small model.                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``convbert-small``                                                                | English      | 12-layer, 128-hidden,                                                            |
+|                                                                                  |              | 4-heads, 13M parameters.                                                         |
+|                                                                                  |              | The ConvBERT small model.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/DistilBert/contents.rst b/docs/model_zoo/transformers/DistilBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..06e2a8bdc77a3776be71dc2958d852fa7c85d2e5
--- /dev/null
+++ b/docs/model_zoo/transformers/DistilBert/contents.rst
@@ -0,0 +1,42 @@
+
+
+------------------------------------
+DistilBERT模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的DistilBERT模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``distilbert-base-uncased``                                                       | English      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 66M parameters.                                                        |
+|                                                                                  |              | The DistilBERT model distilled from                                              |
+|                                                                                  |              | the BERT model ``bert-base-uncased``.                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``distilbert-base-cased``                                                         | English      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 66M parameters.                                                        |
+|                                                                                  |              | The DistilBERT model distilled from                                              |
+|                                                                                  |              | the BERT model ``bert-base-cased``.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``distilbert-base-multilingual-cased``                                            | English      | 6-layer, 768-hidden, 12-heads,                                                   |
+|                                                                                  |              | 200M parameters. The DistilBERT model                                            |
+|                                                                                  |              | distilled from the BERT model                                                    |
+|                                                                                  |              | ``bert-base-multilingual-cased``.                                                |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `distilbert-base-multilingual-cased`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english``                | English      | 2-layer, 2-hidden,                                                               |
+|                                                                                  |              | 2-heads, 50K parameters.                                                         |
+|                                                                                  |              | The DistilBERT model.                                                            |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english`_                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _distilbert-base-multilingual-cased: https://huggingface.co/distilbert-base-multilingual-cased
+.. _sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english: https://huggingface.co/sshleifer/tiny-distilbert-base-uncased-finetuned-sst-2-english
diff --git a/docs/model_zoo/transformers/ELECTRA/contents.rst b/docs/model_zoo/transformers/ELECTRA/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0176aac948fd5738ad221a7a5e47af22c8883061
--- /dev/null
+++ b/docs/model_zoo/transformers/ELECTRA/contents.rst
@@ -0,0 +1,63 @@
+
+
+------------------------------------
+ELECTRA模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ELECTRA模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``electra-small``                                                                 | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 4-heads, 14M parameters.                                                         |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``electra-base``                                                                  | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 109M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``electra-large``                                                                 | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 334M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chinese-electra-small``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 4-heads, 12M parameters.                                                         |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chinese-electra-base``                                                          | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-health-chinese``                                                          | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese medical corpus.                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/chinese-electra-180g-base-discriminator``                                   | Chinese      | Discriminator, 12-layer, 768-hidden,                                             |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on 180g Chinese text.                                                    |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `hfl/chinese-electra-180g-base-discriminator`_                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/chinese-electra-180g-small-ex-discriminator``                               | Chinese      | Discriminator, 24-layer, 256-hidden,                                             |
+|                                                                                  |              | 4-heads, 24M parameters.                                                         |
+|                                                                                  |              | Trained on 180g Chinese text.                                                    |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `hfl/chinese-electra-180g-small-ex-discriminator`_                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/chinese-legal-electra-small-generator``                                     | Chinese      | Generator, 12-layer, 64-hidden,                                                  |
+|                                                                                  |              | 1-heads, 3M parameters.                                                          |
+|                                                                                  |              | Trained on Chinese legal corpus.                                                 |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `hfl/chinese-legal-electra-small-generator`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _hfl/chinese-electra-180g-base-discriminator: https://huggingface.co/hfl/chinese-electra-180g-base-discriminator
+.. _hfl/chinese-electra-180g-small-ex-discriminator: https://huggingface.co/hfl/chinese-electra-180g-small-ex-discriminator
+.. _hfl/chinese-legal-electra-small-generator: https://huggingface.co/hfl/chinese-legal-electra-small-generator
diff --git a/docs/model_zoo/transformers/ERNIE-CTM/contents.rst b/docs/model_zoo/transformers/ERNIE-CTM/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c0e5719585ac966d2c8a3f3edacbd667df267c37
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE-CTM/contents.rst
@@ -0,0 +1,31 @@
+
+
+------------------------------------
+ERNIE-CTM模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE-CTM模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-ctm``                                                                     | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, _M parameters.                                                         |
+|                                                                                  |              | For details, please refer to the                                                 |
+|                                                                                  |              | ernie-ctm_                                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``wordtag``                                                                       | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, _M parameters.                                                         |
+|                                                                                  |              | For details, please refer to the                                                 |
+|                                                                                  |              | ernie-ctm_                                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nptag``                                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, _M parameters.                                                         |
+|                                                                                  |              | For details, please refer to the                                                 |
+|                                                                                  |              | ernie-ctm_                                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _ernie-ctm: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_to_knowledge/ernie-ctm
\ No newline at end of file
diff --git a/docs/model_zoo/transformers/ERNIE-DOC/contents.rst b/docs/model_zoo/transformers/ERNIE-DOC/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..905708628d8c83c2bb9695a00cc62a92ab930744
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE-DOC/contents.rst
@@ -0,0 +1,22 @@
+
+
+------------------------------------
+ERNIE-DOC模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE-DOC模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-doc-base-zh``                                                             | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-doc-base-en``                                                             | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 103M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/ERNIE-GEN/contents.rst b/docs/model_zoo/transformers/ERNIE-GEN/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a0822f91379e022cf1581c9f6fdebe19aa68f7b0
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE-GEN/contents.rst
@@ -0,0 +1,29 @@
+
+
+------------------------------------
+ERNIE-GEN模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE-GEN模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-gen-base-en``                                                             | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
+|                                                                                  |              |                                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-gen-large-en``                                                            | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
+|                                                                                  |              |                                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-gen-large-en-430g``                                                       | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
+|                                                                                  |              | with extended data (430 GB).                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst b/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2ed80d0e35e66d3b87f352f373ae8687aee66e9b
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE-GRAM/contents.rst
@@ -0,0 +1,24 @@
+
+
+------------------------------------
+ERNIE-GRAM模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE-GRAM模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-gram-zh``                                                                 | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
+|                                                                                  |              |                                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-gram-zh-finetuned-dureader-robust``                                       | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
+|                                                                                  |              | Then finetuned on dreader-robust.                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/ERNIE-M/contents.rst b/docs/model_zoo/transformers/ERNIE-M/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..62493fed917ed883d4fd67e6fe80945e7a925eff
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE-M/contents.rst
@@ -0,0 +1,23 @@
+
+
+------------------------------------
+ERNIE-M模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE-M模型对应预训练权重。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-m-base``                                                                  | Multilingual | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, _M parameters.                                                         |
+|                                                                                  |              | Trained on  pseudo-parallel sentence                                             |
+|                                                                                  |              | pairs on a monolingual corpus.                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-m-large``                                                                 | Multilingual | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, _M parameters.                                                         |
+|                                                                                  |              | Trained on  pseudo-parallel sentence                                             |
+|                                                                                  |              | pairs on a monolingual corpus.                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/ERNIE/contents.rst b/docs/model_zoo/transformers/ERNIE/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5bab4d1dc2eedba51c3903c52b0baaf9cdcf8649
--- /dev/null
+++ b/docs/model_zoo/transformers/ERNIE/contents.rst
@@ -0,0 +1,127 @@
+
+
+------------------------------------
+ERNIE模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ERNIE模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``ernie-1.0-base-zh``                                                             | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-1.0-base-zh-cw``                                                          | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 118M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-1.0-large-zh-cw``                                                         | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 272M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-tiny``                                                                    | Chinese      | 3-layer, 1024-hidden,                                                            |
+|                                                                                  |              | 16-heads, _M parameters.                                                         |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-2.0-base-en``                                                             | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 103M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-2.0-base-en-finetuned-squad``                                             | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 110M parameters.                                                       |
+|                                                                                  |              | Trained on finetuned squad text.                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-2.0-large-en``                                                            | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on lower-cased English text.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-xbase-zh``                                                            | Chinese      | 20-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 296M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-base-zh``                                                             | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 118M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-medium-zh``                                                           | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 75M parameters.                                                        |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-mini-zh``                                                             | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 27M parameters.                                                        |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-micro-zh``                                                            | Chinese      | 4-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 23M parameters.                                                        |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ernie-3.0-nano-zh``                                                             | Chinese      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 18M parameters.                                                        |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-base-cross-encoder``                                                   | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 118M parameters.                                                       |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-medium-cross-encoder``                                                 | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 75M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-mini-cross-encoder``                                                   | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 27M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-micro-cross-encoder``                                                  | Chinese      | 4-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 23M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-nano-cross-encoder``                                                   | Chinese      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 18M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-base-query-encoder``                                                | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 118M parameters.                                                       |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-base-para-encoder``                                                 | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 118M parameters.                                                       |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-medium-query-encoder``                                              | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 75M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-medium-para-encoder``                                               | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 75M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-mini-query-encoder``                                                | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 27M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-mini-para-encoder``                                                 | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 27M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-micro-query-encoder``                                               | Chinese      | 4-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 23M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-micro-para-encoder``                                                | Chinese      | 4-layer, 384-hidden,                                                             |
+|                                                                                  |              | 12-heads, 23M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-nano-query-encoder``                                                | Chinese      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 18M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``rocketqa-zh-nano-para-encoder``                                                 | Chinese      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 18M parameters.                                                        |
+|                                                                                  |              | Trained on DuReader retrieval text.                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+.. _zhui/ernie-1.0-cluecorpussmall: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community/zhui/ernie-1.0-cluecorpussmall
diff --git a/docs/model_zoo/transformers/FNet/contents.rst b/docs/model_zoo/transformers/FNet/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2fa95029adc7e97a39658d1b43918f5dafa38f56
--- /dev/null
+++ b/docs/model_zoo/transformers/FNet/contents.rst
@@ -0,0 +1,22 @@
+
+
+------------------------------------
+FNet模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的FNet模型对应预训练权重。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``fnet-base``                                                                     | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `google/fnet-base`_                                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``fnet-large``                                                                    | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `google/fnet-large`_                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _google/fnet-base: https://huggingface.co/google/fnet-base
+.. _google/fnet-large: https://huggingface.co/google/fnet-large
\ No newline at end of file
diff --git a/docs/model_zoo/transformers/Funnel/contents.rst b/docs/model_zoo/transformers/Funnel/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..89b192f17152741e79dfc2b9137b9659b32aef3d
--- /dev/null
+++ b/docs/model_zoo/transformers/Funnel/contents.rst
@@ -0,0 +1,55 @@
+
+
+------------------------------------
+Funnel模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的Funnel模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``funnel-transformer/small``                                                      | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/small`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/small-base``                                                 | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/small-base`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/meduim``                                                     | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/meduim`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/meduim-base``                                                | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/meduim-base`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/intermediate``                                               | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/intermediate`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/intermediate-base``                                          | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/intermediate-base`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/large``                                                      | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/large`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/large-base``                                                 | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/large-base`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/xlarge``                                                     | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/xlarge`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``funnel-transformer/xlarge-base``                                                | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `funnel-transformer/xlarge-base`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _funnel-transformer/small: https://huggingface.co/funnel-transformer/small
+.. _funnel-transformer/small-base: https://huggingface.co/funnel-transformer/small-base
+.. _funnel-transformer/meduim: https://huggingface.co/funnel-transformer/medium
+.. _funnel-transformer/meduim-base: https://huggingface.co/funnel-transformer/medium-base
+.. _funnel-transformer/intermediate: https://huggingface.co/funnel-transformer/intermediate
+.. _funnel-transformer/intermediate-base: https://huggingface.co/funnel-transformer/intermediate-base
+.. _funnel-transformer/large: https://huggingface.co/funnel-transformer/large
+.. _funnel-transformer/large-base: https://huggingface.co/funnel-transformer/large-base
+.. _funnel-transformer/xlarge: https://huggingface.co/funnel-transformer/xlarge
+.. _funnel-transformer/xlarge-base: https://huggingface.co/funnel-transformer/xlarge-base
\ No newline at end of file
diff --git a/docs/model_zoo/transformers/GPT/contents.rst b/docs/model_zoo/transformers/GPT/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a5b802387d9f9564133d2289c4e91783a0279528
--- /dev/null
+++ b/docs/model_zoo/transformers/GPT/contents.rst
@@ -0,0 +1,315 @@
+
+
+------------------------------------
+GPT模型汇总
+------------------------------------
+
+
+下表汇总介绍了目前PaddleNLP支持的GPT模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``gpt-cpm-large-cn``                                                              | Chinese      | 32-layer, 2560-hidden,                                                           |
+|                                                                                  |              | 32-heads, 2.6B parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``gpt-cpm-small-cn-distill``                                                      | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 109M parameters.                                                       |
+|                                                                                  |              | The model distilled from                                                         |
+|                                                                                  |              | the GPT model ``gpt-cpm-large-cn``                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``gpt2-en``                                                                       | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 117M parameters.                                                       |
+|                                                                                  |              | Trained on English text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``gpt2-medium-en``                                                                | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 345M parameters.                                                       |
+|                                                                                  |              | Trained on English text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``gpt2-large-en``                                                                 | English      | 36-layer, 1280-hidden,                                                           |
+|                                                                                  |              | 20-heads, 774M parameters.                                                       |
+|                                                                                  |              | Trained on English text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``gpt2-xl-en``                                                                    | English      | 48-layer, 1600-hidden,                                                           |
+|                                                                                  |              | 25-heads, 1558M parameters.                                                      |
+|                                                                                  |              | Trained on English text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``microsoft/DialoGPT-medium``                                                     | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 354M parameters.                                                       |
+|                                                                                  |              | Trained on English text.                                                         |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialoGPT-medium`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``uer/gpt2-chinese-poem``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 103M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese poetry corpus.                                                |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `uer/gpt2-chinese-poem`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``distilgpt2``                                                                   | English      | Please refer to:                                                                 |
+|                                                                                  |              | `distilgpt2`_                                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``w11wo/javanese-gpt2-small-imdb``                                               | Javanese     | Please refer to:                                                                 |
+|                                                                                  |              | `w11wo/javanese-gpt2-small-imdb`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``remotejob/tweetsDISTILGPT2fi_v4``                                              | English      | Please refer to:                                                                 |
+|                                                                                  |              | `remotejob/tweetsDISTILGPT2fi_v4`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``TrLOX/gpt2-tdk``                                                               | English      | Please refer to:                                                                 |
+|                                                                                  |              | `TrLOX/gpt2-tdk`_                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``huggingtweets/slime_machine``                                                  | English      | Please refer to:                                                                 |
+|                                                                                  |              | `huggingtweets/slime_machine`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialoGPT-small``                                                     | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialoGPT-small`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``sberbank-ai/rugpt3large_based_on_gpt2``                                        | Russian      | Please refer to:                                                                 |
+|                                                                                  |              | `sberbank-ai/rugpt3large_based_on_gpt2`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``sshleifer/tiny-gpt2``                                                          | English      | Please refer to:                                                                 |
+|                                                                                  |              | `sshleifer/tiny-gpt2`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialoGPT-large``                                                     | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialoGPT-large`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``sberbank-ai/rugpt3small_based_on_gpt2``                                        | Russian      | Please refer to:                                                                 |
+|                                                                                  |              | `sberbank-ai/rugpt3small_based_on_gpt2`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``uw-hai/polyjuice``                                                             | English      | Please refer to:                                                                 |
+|                                                                                  |              | `uw-hai/polyjuice`_                                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``NYTK/text-generation-poem-petofi-gpt2-small-hungarian``                        | Hungarian    | Please refer to:                                                                 |
+|                                                                                  |              | `NYTK/text-generation-poem-petofi-gpt2-small-hungarian`_                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialogRPT-human-vs-rand``                                            | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialogRPT-human-vs-rand`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``hf-internal-testing/tiny-random-gpt2``                                         | English      | Please refer to:                                                                 |
+|                                                                                  |              | `hf-internal-testing/tiny-random-gpt2`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Grossmend/rudialogpt3_medium_based_on_gpt2``                                   | Russian      | Please refer to:                                                                 |
+|                                                                                  |              | `Grossmend/rudialogpt3_medium_based_on_gpt2`_                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``pranavpsv/genre-story-generator-v2``                                           | English      | Please refer to:                                                                 |
+|                                                                                  |              | `pranavpsv/genre-story-generator-v2`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialogRPT-updown``                                                   | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialogRPT-updown`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialogRPT-human-vs-machine``                                         | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialogRPT-human-vs-machine`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``pierreguillou/gpt2-small-portuguese``                                          | Portuguese   | Please refer to:                                                                 |
+|                                                                                  |              | `pierreguillou/gpt2-small-portuguese`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/GPT-2-finetuned-covid-bio-medrxiv``                                    | English      | Please refer to:                                                                 |
+|                                                                                  |              | `mrm8488/GPT-2-finetuned-covid-bio-medrxiv`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``anonymous-german-nlp/german-gpt2``                                             | German       | Please refer to:                                                                 |
+|                                                                                  |              | `anonymous-german-nlp/german-gpt2`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/CodeGPT-small-py``                                                   | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/CodeGPT-small-py`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``antoiloui/belgpt2``                                                            | French       | Please refer to:                                                                 |
+|                                                                                  |              | `antoiloui/belgpt2`_                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``benjamin/gerpt2``                                                              | German       | Please refer to:                                                                 |
+|                                                                                  |              | `benjamin/gerpt2`_                                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``asi/gpt-fr-cased-small``                                                       | French       | Please refer to:                                                                 |
+|                                                                                  |              | `asi/gpt-fr-cased-small`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/CodeGPT-small-java-adaptedGPT2``                                     | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/CodeGPT-small-java-adaptedGPT2`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``GroNLP/gpt2-small-dutch``                                                      | Dutch        | Please refer to:                                                                 |
+|                                                                                  |              | `GroNLP/gpt2-small-dutch`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``lvwerra/gpt2-imdb``                                                            | English      | Please refer to:                                                                 |
+|                                                                                  |              | `lvwerra/gpt2-imdb`_                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``DeepESP/gpt2-spanish``                                                         | Spanish      | Please refer to:                                                                 |
+|                                                                                  |              | `DeepESP/gpt2-spanish`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/CodeGPT-small-py-adaptedGPT2``                                       | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/CodeGPT-small-py-adaptedGPT2`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialogRPT-width``                                                    | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialogRPT-width`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``dbddv01/gpt2-french-small``                                                    | French       | Please refer to:                                                                 |
+|                                                                                  |              | `dbddv01/gpt2-french-small`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``GroNLP/gpt2-small-italian``                                                    | Italian      | Please refer to:                                                                 |
+|                                                                                  |              | `GroNLP/gpt2-small-italian`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``flax-community/gpt2-medium-persian``                                           | Persian      | Please refer to:                                                                 |
+|                                                                                  |              | `flax-community/gpt2-medium-persian`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``microsoft/DialogRPT-depth``                                                    | English      | Please refer to:                                                                 |
+|                                                                                  |              | `microsoft/DialogRPT-depth`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``Nokia/nlgp-natural``                                                           | English      | Please refer to:                                                                 |
+|                                                                                  |              | `Nokia/nlgp-natural`_                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``macedonizer/hr-gpt2``                                                          | English      | Please refer to:                                                                 |
+|                                                                                  |              | `macedonizer/hr-gpt2`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``mrm8488/GPT-2-finetuned-common_gen``                                           | English      | Please refer to:                                                                 |
+|                                                                                  |              | `mrm8488/GPT-2-finetuned-common_gen`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``pranavpsv/gpt2-genre-story-generator``                                         | English      | Please refer to:                                                                 |
+|                                                                                  |              | `pranavpsv/gpt2-genre-story-generator`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``rbhushan/distilgpt2-finetuned-wikitext2``                                      | English      | Please refer to:                                                                 |
+|                                                                                  |              | `rbhushan/distilgpt2-finetuned-wikitext2`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``readerbench/RoGPT2-large``                                                     | Romanian     | Please refer to:                                                                 |
+|                                                                                  |              | `readerbench/RoGPT2-large`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``flax-community/gpt2-small-indonesian``                                         | Indonesian   | Please refer to:                                                                 |
+|                                                                                  |              | `flax-community/gpt2-small-indonesian`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``HooshvareLab/gpt2-fa``                                                         | Persian      | Please refer to:                                                                 |
+|                                                                                  |              | `HooshvareLab/gpt2-fa`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cahya/gpt2-small-indonesian-522M``                                             | Indonesian   | Please refer to:                                                                 |
+|                                                                                  |              | `cahya/gpt2-small-indonesian-522M`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``DingleyMaillotUrgell/homer-bot``                                               | English      | Please refer to:                                                                 |
+|                                                                                  |              | `DingleyMaillotUrgell/homer-bot`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``datificate/gpt2-small-spanish``                                                | Spanish      | Please refer to:                                                                 |
+|                                                                                  |              | `datificate/gpt2-small-spanish`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ericzhou/tsundere_v1``                                                         | English      | Please refer to:                                                                 |
+|                                                                                  |              | `ericzhou/tsundere_v1`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``huggingtweets/wwm_shakespeare``                                                | English      | Please refer to:                                                                 |
+|                                                                                  |              | `huggingtweets/wwm_shakespeare`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``SIC98/GPT2-python-code-generator``                                             | English      | Please refer to:                                                                 |
+|                                                                                  |              | `SIC98/GPT2-python-code-generator`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``GroNLP/gpt2-small-italian-embeddings``                                         | Italian      | Please refer to:                                                                 |
+|                                                                                  |              | `GroNLP/gpt2-small-italian-embeddings`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings``               | English      | Please refer to:                                                                 |
+|                                                                                  |              | `huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings`_                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``salesken/grammar_correction``                                                  | English      | Please refer to:                                                                 |
+|                                                                                  |              | `salesken/grammar_correction`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``flax-community/gpt2-medium-indonesian``                                        | Indonesian   | Please refer to:                                                                 |
+|                                                                                  |              | `flax-community/gpt2-medium-indonesian`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``gorkemgoknar/gpt2-small-turkish``                                              | Turkish      | Please refer to:                                                                 |
+|                                                                                  |              | `gorkemgoknar/gpt2-small-turkish`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``deepparag/DumBot``                                                             | English      | Please refer to:                                                                 |
+|                                                                                  |              | `deepparag/DumBot`_                                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``jcblaise/gpt2-tagalog``                                                        | Tagalog      | Please refer to:                                                                 |
+|                                                                                  |              | `jcblaise/gpt2-tagalog`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``BigSalmon/InformalToFormalLincoln21``                                          | English      | Please refer to:                                                                 |
+|                                                                                  |              | `BigSalmon/InformalToFormalLincoln21`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``LorenzoDeMattei/GePpeTto``                                                     | English      | Please refer to:                                                                 |
+|                                                                                  |              | `LorenzoDeMattei/GePpeTto`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``macedonizer/sr-gpt2``                                                          | English      | Please refer to:                                                                 |
+|                                                                                  |              | `macedonizer/sr-gpt2`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``indonesian-nlp/gpt2``                                                          | English      | Please refer to:                                                                 |
+|                                                                                  |              | `indonesian-nlp/gpt2`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``ceostroff/harry-potter-gpt2-fanfiction``                                       | English      | Please refer to:                                                                 |
+|                                                                                  |              | `ceostroff/harry-potter-gpt2-fanfiction`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``akhooli/gpt2-small-arabic-poetry``                                             | Arabic       | Please refer to:                                                                 |
+|                                                                                  |              | `akhooli/gpt2-small-arabic-poetry`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``asi/gpt-fr-cased-base``                                                        | French       | Please refer to:                                                                 |
+|                                                                                  |              | `asi/gpt-fr-cased-base`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``congcongwang/gpt2_medium_fine_tuned_coder``                                    | English      | Please refer to:                                                                 |
+|                                                                                  |              | `congcongwang/gpt2_medium_fine_tuned_coder`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| ``cambridgeltl/simctg_wikitext103``                                              | English      | Please refer to:                                                                 |
+|                                                                                  |              | `cambridgeltl/simctg_wikitext103`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _microsoft/DialoGPT-medium: https://huggingface.co/microsoft/DialoGPT-medium
+.. _uer/gpt2-chinese-poem: https://huggingface.co/uer/gpt2-chinese-poem
+.. _distilgpt2: https://huggingface.co/distilgpt2
+.. _w11wo/javanese-gpt2-small-imdb: https://huggingface.co/w11wo/javanese-gpt2-small-imdb
+.. _remotejob/tweetsDISTILGPT2fi_v4: https://huggingface.co/remotejob/tweetsDISTILGPT2fi_v4
+.. _TrLOX/gpt2-tdk: https://huggingface.co/TrLOX/gpt2-tdk
+.. _huggingtweets/slime_machine: https://huggingface.co/huggingtweets/slime_machine
+.. _microsoft/DialoGPT-small: https://huggingface.co/microsoft/DialoGPT-small
+.. _sberbank-ai/rugpt3large_based_on_gpt2: https://huggingface.co/sberbank-ai/rugpt3large_based_on_gpt2
+.. _sshleifer/tiny-gpt2: https://huggingface.co/sshleifer/tiny-gpt2
+.. _microsoft/DialoGPT-large: https://huggingface.co/microsoft/DialoGPT-large
+.. _sberbank-ai/rugpt3small_based_on_gpt2: https://huggingface.co/sberbank-ai/rugpt3small_based_on_gpt2
+.. _uw-hai/polyjuice: https://huggingface.co/uw-hai/polyjuice
+.. _NYTK/text-generation-poem-petofi-gpt2-small-hungarian: https://huggingface.co/NYTK/text-generation-poem-petofi-gpt2-small-hungarian
+.. _microsoft/DialogRPT-human-vs-rand: https://huggingface.co/microsoft/DialogRPT-human-vs-rand
+.. _hf-internal-testing/tiny-random-gpt2: https://huggingface.co/hf-internal-testing/tiny-random-gpt2
+.. _Grossmend/rudialogpt3_medium_based_on_gpt2: https://huggingface.co/Grossmend/rudialogpt3_medium_based_on_gpt2
+.. _pranavpsv/genre-story-generator-v2: https://huggingface.co/pranavpsv/genre-story-generator-v2
+.. _microsoft/DialogRPT-updown: https://huggingface.co/microsoft/DialogRPT-updown
+.. _microsoft/DialogRPT-human-vs-machine: https://huggingface.co/microsoft/DialogRPT-human-vs-machine
+.. _pierreguillou/gpt2-small-portuguese: https://huggingface.co/pierreguillou/gpt2-small-portuguese
+.. _mrm8488/GPT-2-finetuned-covid-bio-medrxiv: https://huggingface.co/mrm8488/GPT-2-finetuned-covid-bio-medrxiv
+.. _anonymous-german-nlp/german-gpt2: https://huggingface.co/anonymous-german-nlp/german-gpt2
+.. _microsoft/CodeGPT-small-py: https://huggingface.co/microsoft/CodeGPT-small-py
+.. _antoiloui/belgpt2: https://huggingface.co/antoiloui/belgpt2
+.. _benjamin/gerpt2: https://huggingface.co/benjamin/gerpt2
+.. _asi/gpt-fr-cased-small: https://huggingface.co/asi/gpt-fr-cased-small
+.. _microsoft/CodeGPT-small-java-adaptedGPT2: https://huggingface.co/microsoft/CodeGPT-small-java-adaptedGPT2
+.. _GroNLP/gpt2-small-dutch: https://huggingface.co/GroNLP/gpt2-small-dutch
+.. _lvwerra/gpt2-imdb: https://huggingface.co/lvwerra/gpt2-imdb
+.. _DeepESP/gpt2-spanish: https://huggingface.co/DeepESP/gpt2-spanish
+.. _microsoft/CodeGPT-small-py-adaptedGPT2: https://huggingface.co/microsoft/CodeGPT-small-py-adaptedGPT2
+.. _microsoft/DialogRPT-width: https://huggingface.co/microsoft/DialogRPT-width
+.. _dbddv01/gpt2-french-small: https://huggingface.co/dbddv01/gpt2-french-small
+.. _GroNLP/gpt2-small-italian: https://huggingface.co/GroNLP/gpt2-small-italian
+.. _flax-community/gpt2-medium-persian: https://huggingface.co/flax-community/gpt2-medium-persian
+.. _microsoft/DialogRPT-depth: https://huggingface.co/microsoft/DialogRPT-depth
+.. _Nokia/nlgp-natural: https://huggingface.co/Nokia/nlgp-natural
+.. _macedonizer/hr-gpt2: https://huggingface.co/macedonizer/hr-gpt2
+.. _mrm8488/GPT-2-finetuned-common_gen: https://huggingface.co/mrm8488/GPT-2-finetuned-common_gen
+.. _pranavpsv/gpt2-genre-story-generator: https://huggingface.co/pranavpsv/gpt2-genre-story-generator
+.. _rbhushan/distilgpt2-finetuned-wikitext2: https://huggingface.co/rbhushan/distilgpt2-finetuned-wikitext2
+.. _readerbench/RoGPT2-large: https://huggingface.co/readerbench/RoGPT2-large
+.. _flax-community/gpt2-small-indonesian: https://huggingface.co/flax-community/gpt2-small-indonesian
+.. _HooshvareLab/gpt2-fa: https://huggingface.co/HooshvareLab/gpt2-fa
+.. _cahya/gpt2-small-indonesian-522M: https://huggingface.co/cahya/gpt2-small-indonesian-522M
+.. _DingleyMaillotUrgell/homer-bot: https://huggingface.co/DingleyMaillotUrgell/homer-bot
+.. _datificate/gpt2-small-spanish: https://huggingface.co/datificate/gpt2-small-spanish
+.. _ericzhou/tsundere_v1: https://huggingface.co/ericzhou/tsundere_v1
+.. _huggingtweets/wwm_shakespeare: https://huggingface.co/huggingtweets/wwm_shakespeare
+.. _SIC98/GPT2-python-code-generator: https://huggingface.co/SIC98/GPT2-python-code-generator
+.. _GroNLP/gpt2-small-italian-embeddings: https://huggingface.co/GroNLP/gpt2-small-italian-embeddings
+.. _huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings: https://huggingface.co/huggingtweets/hel_ql-shahdashrf_-sinnerslayerr-witheredstrings
+.. _salesken/grammar_correction: https://huggingface.co/salesken/grammar_correction
+.. _flax-community/gpt2-medium-indonesian: https://huggingface.co/flax-community/gpt2-medium-indonesian
+.. _gorkemgoknar/gpt2-small-turkish: https://huggingface.co/gorkemgoknar/gpt2-small-turkish
+.. _deepparag/DumBot: https://huggingface.co/deepparag/DumBot
+.. _jcblaise/gpt2-tagalog: https://huggingface.co/jcblaise/gpt2-tagalog
+.. _BigSalmon/InformalToFormalLincoln21: https://huggingface.co/BigSalmon/InformalToFormalLincoln21
+.. _LorenzoDeMattei/GePpeTto: https://huggingface.co/LorenzoDeMattei/GePpeTto
+.. _macedonizer/sr-gpt2: https://huggingface.co/macedonizer/sr-gpt2
+.. _indonesian-nlp/gpt2: https://huggingface.co/indonesian-nlp/gpt2
+.. _ceostroff/harry-potter-gpt2-fanfiction: https://huggingface.co/ceostroff/harry-potter-gpt2-fanfiction
+.. _akhooli/gpt2-small-arabic-poetry: https://huggingface.co/akhooli/gpt2-small-arabic-poetry
+.. _asi/gpt-fr-cased-base: https://huggingface.co/asi/gpt-fr-cased-base
+.. _congcongwang/gpt2_medium_fine_tuned_coder: https://huggingface.co/congcongwang/gpt2_medium_fine_tuned_coder
+.. _cambridgeltl/simctg_wikitext103: https://huggingface.co/cambridgeltl/simctg_wikitext103
\ No newline at end of file
diff --git a/docs/model_zoo/transformers/LayoutLM/contents.rst b/docs/model_zoo/transformers/LayoutLM/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..47662bcb3df4d7d9979b07dd69b12dcdd92a8673
--- /dev/null
+++ b/docs/model_zoo/transformers/LayoutLM/contents.rst
@@ -0,0 +1,22 @@
+
+
+------------------------------------
+LayoutLM模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的LayoutLM模型以及对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``layoutlm-base-uncased``                                                         | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 339M parameters.                                                       |
+|                                                                                  |              | LayoutLm base uncased model.                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``layoutlm-large-uncased``                                                        | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 51M parameters.                                                        |
+|                                                                                  |              | LayoutLm large Uncased model.                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/LayoutLMV2/contents.rst b/docs/model_zoo/transformers/LayoutLMV2/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..efecc7fbbc69c67fb1c84d8f37f30759e2696dd6
--- /dev/null
+++ b/docs/model_zoo/transformers/LayoutLMV2/contents.rst
@@ -0,0 +1,22 @@
+
+
+------------------------------------
+LayoutLMV2模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的LayoutLMV2模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``layoutlmv2-base-uncased``                                                       | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 200M parameters.                                                       |
+|                                                                                  |              | LayoutLmv2 base uncased model.                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``layoutlmv2-large-uncased``                                                      | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, _M parameters.                                                         |
+|                                                                                  |              | LayoutLmv2 large uncased model.                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/LayoutXLM/contents.rst b/docs/model_zoo/transformers/LayoutXLM/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..212bb27e272f1484ea0465da2b58150984a4d551
--- /dev/null
+++ b/docs/model_zoo/transformers/LayoutXLM/contents.rst
@@ -0,0 +1,18 @@
+
+
+------------------------------------
+LayoutXLM模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的LayoutXLM模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``layoutxlm-base-uncased``                                                        | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 369M parameters.                                                       |
+|                                                                                  |              | Layoutxlm base uncased model.                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/Luke/contents.rst b/docs/model_zoo/transformers/Luke/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..396600f55c4298f9b4efd15c2dfb6bff926d2118
--- /dev/null
+++ b/docs/model_zoo/transformers/Luke/contents.rst
@@ -0,0 +1,23 @@
+
+
+------------------------------------
+Luke模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的Luke模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``luke-base``                                                                     | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `studio-ousia/luke-base`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``luke-large``                                                                    | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `studio-ousia/luke-large`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _studio-ousia/luke-base: https://huggingface.co/studio-ousia/luke-base
+.. _studio-ousia/luke-large: https://huggingface.co/studio-ousia/luke-large
diff --git a/docs/model_zoo/transformers/MBart/contents.rst b/docs/model_zoo/transformers/MBart/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a1a432e171c7d3a50074b5f19ee36847e7815852
--- /dev/null
+++ b/docs/model_zoo/transformers/MBart/contents.rst
@@ -0,0 +1,39 @@
+
+
+------------------------------------
+MBart模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的MBart模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``mbart-large-cc25``                                                              | English      | 12-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 12-heads, 1123M parameters.                                                      |
+|                                                                                  |              | The ``mbart-large-cc25`` model.                                                  |
+|                                                                                  |              |                                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mbart-large-en-ro``                                                             | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 16-heads, 1123M parameters.                                                      |
+|                                                                                  |              | The ``mbart-large rn-ro`` model.                                                 |
+|                                                                                  |              |                                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mbart-large-50-one-to-many-mmt``                                                | English      | 12-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 1123M parameters.                                                      |
+|                                                                                  |              | ``mbart-large-50-one-to-many-mmt``                                               |
+|                                                                                  |              | model.                                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mbart-large-50-many-to-one-mmt``                                                | English      | 12-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 1123M parameters.                                                      |
+|                                                                                  |              | ``mbart-large-50-many-to-one-mmt``                                               |
+|                                                                                  |              | model.                                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mbart-large-50-many-to-many-mmt``                                               | English      | 12-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 1123M parameters.                                                      |
+|                                                                                  |              | ``mbart-large-50-many-to-many-mmt``                                              |
+|                                                                                  |              | model.                                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/MPNet/contents.rst b/docs/model_zoo/transformers/MPNet/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2f3dd1c078dbb33f67a57e5ba9dcac9f3ea3a0a7
--- /dev/null
+++ b/docs/model_zoo/transformers/MPNet/contents.rst
@@ -0,0 +1,18 @@
+
+
+------------------------------------
+MPNet模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的MPNet模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``mpnet-base``                                                                    | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 109M parameters.                                                       |
+|                                                                                  |              | MPNet Base Model.                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/MegatronBert/contents.rst b/docs/model_zoo/transformers/MegatronBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0d9118115858931813048406ee29d18b96314a93
--- /dev/null
+++ b/docs/model_zoo/transformers/MegatronBert/contents.rst
@@ -0,0 +1,23 @@
+
+
+------------------------------------
+MegatronBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的MegatronBert模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``megatronbert-cased``                                                            | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `nvidia/megatron-bert-cased-345m`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``megatronbert-uncased``                                                          | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `nvidia/megatron-bert-uncased-345m`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _nvidia/megatron-bert-cased-345m: https://huggingface.co/nvidia/megatron-bert-cased-345m
+.. _nvidia/megatron-bert-uncased-345m: https://huggingface.co/nvidia/megatron-bert-uncased-345m
diff --git a/docs/model_zoo/transformers/MobileBert/contents.rst b/docs/model_zoo/transformers/MobileBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ad948ff8411899e7e912da6f33c90426accf3125
--- /dev/null
+++ b/docs/model_zoo/transformers/MobileBert/contents.rst
@@ -0,0 +1,19 @@
+
+
+------------------------------------
+MobileBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的MobileBert模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``mobilebert-uncased``                                                            | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `google/mobilebert-uncased`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _google/mobilebert-uncased: https://huggingface.co/google/mobilebert-uncased
diff --git a/docs/model_zoo/transformers/NeZha/contents.rst b/docs/model_zoo/transformers/NeZha/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..23a54aa8eb62a935a4530e864ed2050c8b5b1cf1
--- /dev/null
+++ b/docs/model_zoo/transformers/NeZha/contents.rst
@@ -0,0 +1,30 @@
+
+
+------------------------------------
+NeZha模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的NeZha模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``nezha-base-chinese``                                                            | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nezha-large-chinese``                                                           | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nezha-base-wwm-chinese``                                                        | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 16-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nezha-large-wwm-chinese``                                                       | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/PPMiniLM/contents.rst b/docs/model_zoo/transformers/PPMiniLM/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..90ec6236f2c78b0ffcabd497ac05b91f290106ca
--- /dev/null
+++ b/docs/model_zoo/transformers/PPMiniLM/contents.rst
@@ -0,0 +1,20 @@
+
+
+------------------------------------
+PPMiniLM模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的PPMiniLM模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+| ``ppminilm-6l-768h``                                                             | Chinese      | A Chinese characteristic small model                                             |
+|                                                                                  |              | using multiple model compression.                                                |
+|                                                                                  |              | Please refer to: ppminilm-6l-768h_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _ppminilm-6l-768h: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/model_compression/pp-minilm
diff --git a/docs/model_zoo/transformers/ProphetNet/contents.rst b/docs/model_zoo/transformers/ProphetNet/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0ae7d3e510dcb7e65ef8b41355a7b80847872f49
--- /dev/null
+++ b/docs/model_zoo/transformers/ProphetNet/contents.rst
@@ -0,0 +1,19 @@
+
+
+------------------------------------
+ProphetNet模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的ProphetNet模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``prophetnet-large-uncased``                                                      | English      | For details, please refer to:                                                    |
+|                                                                                  |              | `microsoft/prophetnet-large-uncased`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _microsoft/prophetnet-large-uncased: https://huggingface.co/microsoft/prophetnet-large-uncased
diff --git a/docs/model_zoo/transformers/Reformer/contents.rst b/docs/model_zoo/transformers/Reformer/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c7f54848a55c3e4c5621b39d26567f1233ffc3db
--- /dev/null
+++ b/docs/model_zoo/transformers/Reformer/contents.rst
@@ -0,0 +1,20 @@
+
+
+------------------------------------
+Reformer模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的Reformer模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``reformer-enwik8``                                                               | English      | 12-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 8-heads, 148M parameters.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``reformer-crime-and-punishment``                                                 | English      | 6-layer, 256-hidden,                                                             |
+|                                                                                  |              | 2-heads, 3M parameters.                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/RemBert/contents.rst b/docs/model_zoo/transformers/RemBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0cd7914c1ef15415f86b9437dd332b7ccca618bc
--- /dev/null
+++ b/docs/model_zoo/transformers/RemBert/contents.rst
@@ -0,0 +1,20 @@
+
+
+------------------------------------
+RemBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的RemBert模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``rembert``                                                                       | English      | For details, please refer to the                                                 |
+|                                                                                  |              | corresponding model card of huggingface:                                         |
+|                                                                                  |              | `google/rembert`_                                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _google/rembert: https://huggingface.co/google/rembert
diff --git a/docs/model_zoo/transformers/RoBERTa/contents.rst b/docs/model_zoo/transformers/RoBERTa/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..859bfbcde559a3e9e9a047e897246c9dd3dd5063
--- /dev/null
+++ b/docs/model_zoo/transformers/RoBERTa/contents.rst
@@ -0,0 +1,424 @@
+
+
+------------------------------------
+RoBERTa模型汇总
+------------------------------------
+
+
+下表汇总介绍了目前PaddleNLP支持的RoBERTa模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``hfl/roberta-wwm-ext``                                                           | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on English Text using                                                    |
+|                                                                                  |              | Whole-Word-Masking with extended data.                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/roberta-wwm-ext-large``                                                     | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 325M parameters.                                                       |
+|                                                                                  |              | Trained on English Text using                                                    |
+|                                                                                  |              | Whole-Word-Masking with extended data.                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/rbt3``                                                                      | Chinese      | 3-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 38M parameters.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/rbtl3``                                                                     | Chinese      | 3-layer, 1024-hidden,                                                            |
+|                                                                                  |              | 16-heads, 61M parameters.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/rbt4``                                                                      | Chinese      | 4-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 47M parameters.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``hfl/rbt6``                                                                      | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 60M parameters.                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deepset/roberta-base-squad2``                                                   | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 124M parameters.                                                       |
+|                                                                                  |              | Trained on English text.                                                         |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `deepset/roberta-base-squad2`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``uer/roberta-base-chinese-extractive-qa``                                        | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 101M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `uer/roberta-base-chinese-extractive-qa`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``uer/roberta-base-finetuned-chinanews-chinese``                                  | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 102M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `uer/roberta-base-finetuned-chinanews-chinese`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``uer/roberta-base-finetuned-cluener2020-chinese``                                | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 101M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
+|                                                                                  |              |                                                                                  |
+|                                                                                  |              | Please refer to:                                                                 |
+|                                                                                  |              | `uer/roberta-base-finetuned-cluener2020-chinese`_                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roberta-base``                                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `roberta-base`_                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base-sentiment``                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base-sentiment`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roberta-large``                                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `roberta-large`_                                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``distilroberta-base``                                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `distilroberta-base`_                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/nli-distilroberta-base``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/nli-distilroberta-base`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``siebert/sentiment-roberta-large-english``                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `siebert/sentiment-roberta-large-english`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``j-hartmann/emotion-english-distilroberta-base``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `j-hartmann/emotion-english-distilroberta-base`_                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roberta-base-openai-detector``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `roberta-base-openai-detector`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``huggingface/CodeBERTa-small-v1``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `huggingface/CodeBERTa-small-v1`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis``             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis`_             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base-emotion``                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base-emotion`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/PubChem10M_SMILES_BPE_396_250``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/PubChem10M_SMILES_BPE_396_250`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-SST-2``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-SST-2`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sshleifer/tiny-distilroberta-base``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sshleifer/tiny-distilroberta-base`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``thatdramebaazguy/roberta-base-squad``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `thatdramebaazguy/roberta-base-squad`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli``                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli`_                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ufal/robeczech-base``                                                           | Czech        | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ufal/robeczech-base`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/PubChem10M_SMILES_BPE_450k``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/PubChem10M_SMILES_BPE_450k`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/PubChem10M_SMILES_BPE_50k``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/PubChem10M_SMILES_BPE_50k`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``microsoft/codebert-base-mlm``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `microsoft/codebert-base-mlm`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-MNLI``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-MNLI`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base-offensive``                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base-offensive`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/stsb-roberta-large``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/stsb-roberta-large`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/ChemBERTa_zinc250k_v2_40k``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/ChemBERTa_zinc250k_v2_40k`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``uklfr/gottbert-base``                                                           | German       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `uklfr/gottbert-base`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/ChemBERTa-zinc-base-v1``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/ChemBERTa-zinc-base-v1`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roberta-large-openai-detector``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `roberta-large-openai-detector`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/quora-roberta-base``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/quora-roberta-base`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/stsb-roberta-base``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/stsb-roberta-base`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``microsoft/graphcodebert-base``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `microsoft/graphcodebert-base`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base-hate``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base-hate`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chkla/roberta-argument``                                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `chkla/roberta-argument`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Salesforce/grappa_large_jnt``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Salesforce/grappa_large_jnt`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``vinai/bertweet-large``                                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `vinai/bertweet-large`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``allenai/biomed_roberta_base``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `allenai/biomed_roberta_base`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``facebook/muppet-roberta-base``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `facebook/muppet-roberta-base`_                                                  |                              
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Rakib/roberta-base-on-cuad``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Rakib/roberta-base-on-cuad`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/stsb-distilroberta-base``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/stsb-distilroberta-base`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nyu-mll/roberta-base-1B-1``                                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nyu-mll/roberta-base-1B-1`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nyu-mll/roberta-med-small-1M-1``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nyu-mll/roberta-med-small-1M-1`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``SkolkovoInstitute/roberta_toxicity_classifier``                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `SkolkovoInstitute/roberta_toxicity_classifier`_                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``facebook/muppet-roberta-large``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `facebook/muppet-roberta-large`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``lassl/roberta-ko-small``                                                        | Korean       | Please refer to:                                                                 |                                   
+|                                                                                  |              | `lassl/roberta-ko-small`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``huggingface/CodeBERTa-language-id``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `huggingface/CodeBERTa-language-id`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-imdb``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-imdb`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``macedonizer/mk-roberta-base``                                                   | Macedonian   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `macedonizer/mk-roberta-base`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/nli-MiniLM2-L6-H768``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/nli-MiniLM2-L6-H768`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-QNLI``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-QNLI`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deepset/roberta-base-squad2-covid``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/roberta-base-squad2-covid`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-MRPC``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-MRPC`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``bhadresh-savani/roberta-base-emotion``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bhadresh-savani/roberta-base-emotion`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``aychang/roberta-base-imdb``                                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `aychang/roberta-base-imdb`_                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/quora-distilroberta-base``                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/quora-distilroberta-base`_                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``csarron/roberta-base-squad-v1``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `csarron/roberta-base-squad-v1`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/ChemBERTA_PubChem1M_shard00_155k``                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/ChemBERTA_PubChem1M_shard00_155k`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mental/mental-roberta-base``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mental/mental-roberta-base`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-CoLA``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-CoLA`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``navteca/quora-roberta-base``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `navteca/quora-roberta-base`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cardiffnlp/twitter-roberta-base-emoji``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cardiffnlp/twitter-roberta-base-emoji`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``benjamin/roberta-base-wechsel-german``                                          | Multilingual | Please refer to:                                                                 |                                   
+|                                                                                  |              | `benjamin/roberta-base-wechsel-german`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``textattack/roberta-base-ag-news``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `textattack/roberta-base-ag-news`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``johngiorgi/declutr-base``                                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `johngiorgi/declutr-base`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``salesken/query_wellformedness_score``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `salesken/query_wellformedness_score`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``blinoff/roberta-base-russian-v0``                                               | Russian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `blinoff/roberta-base-russian-v0`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``allenai/reviews_roberta_base``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `allenai/reviews_roberta_base`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ruiqi-zhong/roberta-base-meta-tuning-test``                                     | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ruiqi-zhong/roberta-base-meta-tuning-test`_                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mrm8488/distilroberta-finetuned-tweets-hate-speech``                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/distilroberta-finetuned-tweets-hate-speech`_                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cointegrated/roberta-large-cola-krishna2020``                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cointegrated/roberta-large-cola-krishna2020`_                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deepset/roberta-base-squad2-distilled``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/roberta-base-squad2-distilled`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tli8hf/unqover-roberta-base-squad``                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `tli8hf/unqover-roberta-base-squad`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cross-encoder/nli-roberta-base``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cross-encoder/nli-roberta-base`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large``                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large`_                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``seyonec/BPE_SELFIES_PubChem_shard00_160k``                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `seyonec/BPE_SELFIES_PubChem_shard00_160k`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``CLTL/MedRoBERTa.nl``                                                            | Dutch        | Please refer to:                                                                 |                                   
+|                                                                                  |              | `CLTL/MedRoBERTa.nl`_                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``HooshvareLab/roberta-fa-zwnj-base``                                             | Persian      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `HooshvareLab/roberta-fa-zwnj-base`_                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nyu-mll/roberta-base-100M-1``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nyu-mll/roberta-base-100M-1`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deepset/tinyroberta-squad2``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deepset/tinyroberta-squad2`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``youscan/ukr-roberta-base``                                                      | Ukrainian    | Please refer to:                                                                 |                                   
+|                                                                                  |              | `youscan/ukr-roberta-base`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``navteca/roberta-base-squad2``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `navteca/roberta-base-squad2`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``bertin-project/bertin-roberta-base-spanish``                                    | Spanish      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `bertin-project/bertin-roberta-base-spanish`_                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``shiyue/roberta-large-tac08``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `shiyue/roberta-large-tac08`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``softcatala/julibert``                                                           | Catalan      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `softcatala/julibert`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``elozano/tweet_sentiment_eval``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `elozano/tweet_sentiment_eval`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``cahya/roberta-base-indonesian-1.5G``                                            | Indonesian   | Please refer to:                                                                 |                                   
+|                                                                                  |              | `cahya/roberta-base-indonesian-1.5G`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``elozano/tweet_emotion_eval``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `elozano/tweet_emotion_eval`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``navteca/roberta-large-squad2``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `navteca/roberta-large-squad2`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``elozano/tweet_offensive_eval``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `elozano/tweet_offensive_eval`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ynie/roberta-large_conv_contradiction_detector_v0``                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ynie/roberta-large_conv_contradiction_detector_v0`_                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+.. _deepset/roberta-base-squad2: https://huggingface.co/deepset/roberta-base-squad2
+.. _uer/roberta-base-chinese-extractive-qa: https://huggingface.co/uer/roberta-base-chinese-extractive-qa
+.. _uer/roberta-base-finetuned-chinanews-chinese: https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese
+.. _uer/roberta-base-finetuned-cluener2020-chinese: https://huggingface.co/uer/uer/roberta-base-finetuned-cluener2020-chinese
+.. _roberta-base: https://huggingface.co/roberta-base
+.. _cardiffnlp/twitter-roberta-base-sentiment: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment
+.. _roberta-large: https://huggingface.co/roberta-large
+.. _distilroberta-base: https://huggingface.co/distilroberta-base
+.. _cross-encoder/nli-distilroberta-base: https://huggingface.co/cross-encoder/nli-distilroberta-base
+.. _roberta-base-openai-detector: https://huggingface.co/roberta-base-openai-detector
+.. _huggingface/CodeBERTa-small-v1: https://huggingface.co/huggingface/CodeBERTa-small-v1
+.. _mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis: https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis
+.. _siebert/sentiment-roberta-large-english: https://huggingface.co/siebert/sentiment-roberta-large-english
+.. _j-hartmann/emotion-english-distilroberta-base: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
+.. _cardiffnlp/twitter-roberta-base-emotion: https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion
+.. _seyonec/PubChem10M_SMILES_BPE_396_250: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_396_250
+.. _textattack/roberta-base-SST-2: https://huggingface.co/textattack/roberta-base-SST-2
+.. _sshleifer/tiny-distilroberta-base: https://huggingface.co/sshleifer/tiny-distilroberta-base
+.. _thatdramebaazguy/roberta-base-squad: https://huggingface.co/thatdramebaazguy/roberta-base-squad
+.. _ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli: https://huggingface.co/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli
+.. _ufal/robeczech-base: https://huggingface.co/ufal/robeczech-base
+.. _seyonec/PubChem10M_SMILES_BPE_450k: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_450k
+.. _cardiffnlp/twitter-roberta-base: https://huggingface.co/cardiffnlp/twitter-roberta-base
+.. _seyonec/PubChem10M_SMILES_BPE_50k: https://huggingface.co/seyonec/PubChem10M_SMILES_BPE_50k
+.. _microsoft/codebert-base-mlm: https://huggingface.co/microsoft/codebert-base-mlm
+.. _textattack/roberta-base-MNLI: https://huggingface.co/textattack/roberta-base-MNLI
+.. _cardiffnlp/twitter-roberta-base-offensive: https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive
+.. _cross-encoder/stsb-roberta-large: https://huggingface.co/cross-encoder/stsb-roberta-large
+.. _seyonec/ChemBERTa_zinc250k_v2_40k: https://huggingface.co/seyonec/ChemBERTa_zinc250k_v2_40k
+.. _uklfr/gottbert-base: https://huggingface.co/uklfr/gottbert-base
+.. _seyonec/ChemBERTa-zinc-base-v1: https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1
+.. _roberta-large-openai-detector: https://huggingface.co/roberta-large-openai-detector
+.. _cross-encoder/quora-roberta-base: https://huggingface.co/cross-encoder/quora-roberta-base
+.. _cross-encoder/stsb-roberta-base: https://huggingface.co/cross-encoder/stsb-roberta-base
+.. _microsoft/graphcodebert-base: https://huggingface.co/microsoft/graphcodebert-base
+.. _cardiffnlp/twitter-roberta-base-hate: https://huggingface.co/cardiffnlp/twitter-roberta-base-hate
+.. _chkla/roberta-argument: https://huggingface.co/chkla/roberta-argument
+.. _Salesforce/grappa_large_jnt: https://huggingface.co/Salesforce/grappa_large_jnt
+.. _vinai/bertweet-large: https://huggingface.co/vinai/bertweet-large
+.. _allenai/biomed_roberta_base: https://huggingface.co/allenai/biomed_roberta_base
+.. _facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
+.. _Rakib/roberta-base-on-cuad: https://huggingface.co/Rakib/roberta-base-on-cuad
+.. _cross-encoder/stsb-distilroberta-base: https://huggingface.co/cross-encoder/stsb-distilroberta-base
+.. _nyu-mll/roberta-base-1B-1: https://huggingface.co/nyu-mll/roberta-base-1B-1
+.. _nyu-mll/roberta-med-small-1M-1: https://huggingface.co/nyu-mll/roberta-med-small-1M-1
+.. _SkolkovoInstitute/roberta_toxicity_classifier: https://huggingface.co/SkolkovoInstitute/roberta_toxicity_classifier
+.. _facebook/muppet-roberta-large: https://huggingface.co/facebook/muppet-roberta-large
+.. _lassl/roberta-ko-small: https://huggingface.co/lassl/roberta-ko-small
+.. _huggingface/CodeBERTa-language-id: https://huggingface.co/huggingface/CodeBERTa-language-id
+.. _textattack/roberta-base-imdb: https://huggingface.co/textattack/roberta-base-imdb
+.. _macedonizer/mk-roberta-base: https://huggingface.co/macedonizer/mk-roberta-base
+.. _cross-encoder/nli-MiniLM2-L6-H768: https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768
+.. _textattack/roberta-base-QNLI: https://huggingface.co/textattack/roberta-base-QNLI
+.. _deepset/roberta-base-squad2-covid: https://huggingface.co/deepset/roberta-base-squad2-covid
+.. _textattack/roberta-base-MRPC: https://huggingface.co/textattack/roberta-base-MRPC
+.. _bhadresh-savani/roberta-base-emotion: https://huggingface.co/bhadresh-savani/roberta-base-emotion
+.. _aychang/roberta-base-imdb: https://huggingface.co/aychang/roberta-base-imdb
+.. _cross-encoder/quora-distilroberta-base: https://huggingface.co/cross-encoder/quora-distilroberta-base
+.. _csarron/roberta-base-squad-v1: https://huggingface.co/csarron/roberta-base-squad-v1
+.. _seyonec/ChemBERTA_PubChem1M_shard00_155k: https://huggingface.co/seyonec/ChemBERTA_PubChem1M_shard00_155k
+.. _mental/mental-roberta-base: https://huggingface.co/mental/mental-roberta-base
+.. _textattack/roberta-base-CoLA: https://huggingface.co/textattack/roberta-base-CoLA
+.. _navteca/quora-roberta-base: https://huggingface.co/navteca/quora-roberta-base
+.. _cardiffnlp/twitter-roberta-base-emoji: https://huggingface.co/cardiffnlp/twitter-roberta-base-emoji
+.. _benjamin/roberta-base-wechsel-german: https://huggingface.co/benjamin/roberta-base-wechsel-german
+.. _textattack/roberta-base-ag-news: https://huggingface.co/textattack/roberta-base-ag-news
+.. _johngiorgi/declutr-base: https://huggingface.co/johngiorgi/declutr-base
+.. _salesken/query_wellformedness_score: https://huggingface.co/salesken/query_wellformedness_score
+.. _blinoff/roberta-base-russian-v0: https://huggingface.co/blinoff/roberta-base-russian-v0
+.. _allenai/reviews_roberta_base: https://huggingface.co/allenai/reviews_roberta_base
+.. _ruiqi-zhong/roberta-base-meta-tuning-test: https://huggingface.co/ruiqi-zhong/roberta-base-meta-tuning-test
+.. _mrm8488/distilroberta-finetuned-tweets-hate-speech: https://huggingface.co/mrm8488/distilroberta-finetuned-tweets-hate-speech
+.. _cointegrated/roberta-large-cola-krishna2020: https://huggingface.co/cointegrated/roberta-large-cola-krishna2020
+.. _deepset/roberta-base-squad2-distilled: https://huggingface.co/deepset/roberta-base-squad2-distilled
+.. _tli8hf/unqover-roberta-base-squad: https://huggingface.co/tli8hf/unqover-roberta-base-squad
+.. _cross-encoder/nli-roberta-base: https://huggingface.co/cross-encoder/nli-roberta-base
+.. _nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large: https://huggingface.co/nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large
+.. _seyonec/BPE_SELFIES_PubChem_shard00_160k: https://huggingface.co/seyonec/BPE_SELFIES_PubChem_shard00_160k
+.. _CLTL/MedRoBERTa.nl: https://huggingface.co/CLTL/MedRoBERTa.nl
+.. _HooshvareLab/roberta-fa-zwnj-base: https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base
+.. _nyu-mll/roberta-base-100M-1: https://huggingface.co/nyu-mll/roberta-base-100M-1
+.. _deepset/tinyroberta-squad2: https://huggingface.co/deepset/tinyroberta-squad2
+.. _youscan/ukr-roberta-base: https://huggingface.co/youscan/ukr-roberta-base
+.. _navteca/roberta-base-squad2: https://huggingface.co/navteca/roberta-base-squad2
+.. _bertin-project/bertin-roberta-base-spanish: https://huggingface.co/bertin-project/bertin-roberta-base-spanish
+.. _shiyue/roberta-large-tac08: https://huggingface.co/shiyue/roberta-large-tac08
+.. _softcatala/julibert: https://huggingface.co/softcatala/julibert
+.. _elozano/tweet_sentiment_eval: https://huggingface.co/elozano/tweet_sentiment_eval
+.. _cahya/roberta-base-indonesian-1.5G: https://huggingface.co/cahya/roberta-base-indonesian-1.5G
+.. _elozano/tweet_emotion_eval: https://huggingface.co/elozano/tweet_emotion_eval
+.. _navteca/roberta-large-squad2: https://huggingface.co/navteca/roberta-large-squad2
+.. _elozano/tweet_offensive_eval: https://huggingface.co/elozano/tweet_offensive_eval
+.. _ynie/roberta-large_conv_contradiction_detector_v0: https://huggingface.co/ynie/roberta-large_conv_contradiction_detector_v0
diff --git a/docs/model_zoo/transformers/RoFormer/contents.rst b/docs/model_zoo/transformers/RoFormer/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3b4db00dc70f9c93fc45a92ad3aee21cba3cfb6f
--- /dev/null
+++ b/docs/model_zoo/transformers/RoFormer/contents.rst
@@ -0,0 +1,53 @@
+
+
+------------------------------------
+RoFormer模型汇总
+------------------------------------
+
+下表汇总介绍了目前PaddleNLP支持的RoFormer模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``roformer-chinese-small``                                                        | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 6-heads, 30M parameters.                                                         |
+|                                                                                  |              | Roformer Small Chinese model.                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-base``                                                         | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 124M parameters.                                                       |
+|                                                                                  |              | Roformer Base Chinese model.                                                     |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-char-small``                                                   | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 6-heads, 15M parameters.                                                         |
+|                                                                                  |              | Roformer Chinese Char Small model.                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-char-base``                                                    | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 95M parameters.                                                        |
+|                                                                                  |              | Roformer Chinese Char Base model.                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-sim-char-ft-small``                                            | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 6-heads, 15M parameters.                                                         |
+|                                                                                  |              | Roformer Chinese Char Ft Small model.                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-sim-char-ft-base``                                             | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 95M parameters.                                                        |
+|                                                                                  |              | Roformer Chinese Char Ft Base model.                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-sim-char-small``                                               | Chinese      | 6-layer, 384-hidden,                                                             |
+|                                                                                  |              | 6-heads, 15M parameters.                                                         |
+|                                                                                  |              | Roformer Chinese Sim Char Small model.                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-chinese-sim-char-base``                                                | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 95M parameters.                                                        |
+|                                                                                  |              | Roformer Chinese Sim Char Base model.                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-english-small-discriminator``                                          | English      | 12-layer, 256-hidden,                                                            |
+|                                                                                  |              | 4-heads, 13M parameters.                                                         |
+|                                                                                  |              | Roformer English Small Discriminator.                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``roformer-english-small-generator``                                              | English      | 12-layer, 64-hidden,                                                             |
+|                                                                                  |              | 1-heads, 5M parameters.                                                          |
+|                                                                                  |              | Roformer English Small Generator.                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
diff --git a/docs/model_zoo/transformers/SKEP/contents.rst b/docs/model_zoo/transformers/SKEP/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f250842787d6ee5c235c750941837018c6306a9c
--- /dev/null
+++ b/docs/model_zoo/transformers/SKEP/contents.rst
@@ -0,0 +1,29 @@
+
+
+------------------------------------
+SKEP模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的SKEP模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``skep_ernie_1.0_large_ch``                                                       | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained using the Erine model                                                    |
+|                                                                                  |              | ``ernie_1.0``                                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``skep_ernie_2.0_large_en``                                                       | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 336M parameters.                                                       |
+|                                                                                  |              | Trained using the Erine model                                                    |
+|                                                                                  |              | ``ernie_2.0_large_en``                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``skep_roberta_large_en``                                                         | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 355M parameters.                                                       |
+|                                                                                  |              | Trained using the RoBERTa model                                                  |
+|                                                                                  |              | ``roberta_large_en``                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/SqueezeBert/contents.rst b/docs/model_zoo/transformers/SqueezeBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6f1b2afcc1efc4fb054353cb8b5b147fa9060a25
--- /dev/null
+++ b/docs/model_zoo/transformers/SqueezeBert/contents.rst
@@ -0,0 +1,26 @@
+
+
+------------------------------------
+SqueezeBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的SqueezeBert模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``squeezebert-uncased``                                                           | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 51M parameters.                                                        |
+|                                                                                  |              | SqueezeBert Uncased model.                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``squeezebert-mnli``                                                              | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 51M parameters.                                                        |
+|                                                                                  |              | SqueezeBert Mnli model.                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``squeezebert-mnli-headless``                                                     | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 51M parameters.                                                        |
+|                                                                                  |              | SqueezeBert Mnli Headless model.                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/T5/contents.rst b/docs/model_zoo/transformers/T5/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b5b0ea1f9ead660dd40206cc19af150407551831
--- /dev/null
+++ b/docs/model_zoo/transformers/T5/contents.rst
@@ -0,0 +1,203 @@
+
+
+------------------------------------
+T5模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的T5模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``t5-small``                                                                      | English      | 6-layer, 512-hidden,                                                             |
+|                                                                                  |              | 8-heads, 93M parameters.                                                         |
+|                                                                                  |              | T5 small model.                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``t5-base``                                                                       | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 272M parameters.                                                       |
+|                                                                                  |              | T5 base model.                                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``t5-large``                                                                      | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 803M parameters.                                                       |
+|                                                                                  |              | T5 large model.                                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``t5-v1_1-base``                                                                  | English      | Please refer to:                                                                 |                                    
+|                                                                                  |              | t5-v1_1-base_                                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``t5-v1_1-large``                                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | t5-v1_1-large_                                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Langboat/mengzi-t5-base``                                                       | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Langboat/mengzi-t5-base`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Langboat/mengzi-t5-base-mt``                                                    | Chinese      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Langboat/mengzi-t5-base-mt`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deep-learning-analytics/wikihow-t5-small``                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deep-learning-analytics/wikihow-t5-small`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sberbank-ai/ruT5-base``                                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sberbank-ai/ruT5-base`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Michau/t5-base-en-generate-headline``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Michau/t5-base-en-generate-headline`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``google/t5-v1_1-small``                                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/t5-v1_1-small`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``prithivida/parrot_paraphraser_on_T5``                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `prithivida/parrot_paraphraser_on_T5`_                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``prithivida/grammar_error_correcter_v1``                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `prithivida/grammar_error_correcter_v1`_                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``valhalla/t5-small-qg-hl``                                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `valhalla/t5-small-qg-hl`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``valhalla/t5-small-qa-qg-hl``                                                    | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `valhalla/t5-small-qa-qg-hl`_                                                    |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ramsrigouthamg/t5-large-paraphraser-diverse-high-quality``                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ramsrigouthamg/t5-large-paraphraser-diverse-high-quality`_                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mrm8488/t5-base-finetuned-common_gen``                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/t5-base-finetuned-common_gen`_                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``valhalla/t5-small-e2e-qg``                                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `valhalla/t5-small-e2e-qg`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sonoisa/t5-base-japanese``                                                      | japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sonoisa/t5-base-japanese`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``google/t5-base-lm-adapt``                                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/t5-base-lm-adapt`_                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``google/t5-small-lm-adapt``                                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/t5-small-lm-adapt`_                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``valhalla/t5-small-qg-prepend``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `valhalla/t5-small-qg-prepend`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``prithivida/informal_to_formal_styletransfer``                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `prithivida/informal_to_formal_styletransfer`_                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``KETI-AIR/ke-t5-base``                                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `KETI-AIR/ke-t5-base`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``nielsr/nt5-small-rc1``                                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `nielsr/nt5-small-rc1`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``snrspeaks/t5-one-line-summary``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `snrspeaks/t5-one-line-summary`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``mrm8488/t5-small-finetuned-quora-for-paraphrasing``                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `mrm8488/t5-small-finetuned-quora-for-paraphrasing`_                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``p-christ/12412fsasf``                                                           | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `p-christ/12412fsasf`_                                                           |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tscholak/3vnuv1vf``                                                             | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `tscholak/3vnuv1vf`_                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tennessejoyce/titlewave-t5-base``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `tennessejoyce/titlewave-t5-base`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``vennify/t5-base-grammar-correction``                                            | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `vennify/t5-base-grammar-correction`_                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``megagonlabs/t5-base-japanese-web``                                              | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `megagonlabs/t5-base-japanese-web`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sberbank-ai/ruT5-large``                                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sberbank-ai/ruT5-large`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tscholak/t5.1.1.lm100k.base``                                                   | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `tscholak/t5.1.1.lm100k.base`_                                                   |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``deep-learning-analytics/GrammarCorrector``                                      | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `deep-learning-analytics/GrammarCorrector`_                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ThomasNLG/t5-qa_squad2neg-en``                                                  | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ThomasNLG/t5-qa_squad2neg-en`_                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``flexudy/t5-small-wav2vec2-grammar-fixer``                                       | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `flexudy/t5-small-wav2vec2-grammar-fixer`_                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``KETI-AIR/ke-t5-small``                                                          | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `KETI-AIR/ke-t5-small`_                                                          |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``razent/SciFive-large-Pubmed_PMC``                                               | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `razent/SciFive-large-Pubmed_PMC`_                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``google/t5-large-ssm-nq``                                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/t5-large-ssm-nq`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``ozcangundes/T5-base-for-BioQA``                                                 | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `ozcangundes/T5-base-for-BioQA`_                                                 |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Rostlab/prot_t5_base_mt_uniref50``                                              | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Rostlab/prot_t5_base_mt_uniref50`_                                              |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``sonoisa/t5-base-japanese-question-generation``                                  | Japanese     | Please refer to:                                                                 |                                   
+|                                                                                  |              | `sonoisa/t5-base-japanese-question-generation`_                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``Wikidepia/IndoT5-base``                                                         | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `Wikidepia/IndoT5-base`_                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``razent/SciFive-base-Pubmed_PMC``                                                | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `razent/SciFive-base-Pubmed_PMC`_                                                |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``google/t5-small-ssm-nq``                                                        | English      | Please refer to:                                                                 |                                   
+|                                                                                  |              | `google/t5-small-ssm-nq`_                                                        |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+
+
+.. _t5-v1_1-base: https://huggingface.co/google/t5-v1_1-base
+.. _t5-v1_1-large: https://huggingface.co/google/t5-v1_1-large
+.. _Langboat/mengzi-t5-base: https://huggingface.co/Langboat/mengzi-t5-base
+.. _Langboat/mengzi-t5-base-mt: https://huggingface.co/Langboat/mengzi-t5-base-mt
+.. _deep-learning-analytics/wikihow-t5-small: https://huggingface.co/deep-learning-analytics/wikihow-t5-small
+.. _sberbank-ai/ruT5-base: https://huggingface.co/sberbank-ai/ruT5-base
+.. _Michau/t5-base-en-generate-headline: https://huggingface.co/Michau/t5-base-en-generate-headline
+.. _google/t5-v1_1-small: https://huggingface.co/google/t5-v1_1-small
+.. _prithivida/parrot_paraphraser_on_T5: https://huggingface.co/prithivida/parrot_paraphraser_on_T5
+.. _prithivida/grammar_error_correcter_v1: https://huggingface.co/prithivida/grammar_error_correcter_v1
+.. _valhalla/t5-small-qg-hl: https://huggingface.co/valhalla/t5-small-qg-hl
+.. _valhalla/t5-small-qa-qg-hl: https://huggingface.co/valhalla/t5-small-qa-qg-hl
+.. _ramsrigouthamg/t5-large-paraphraser-diverse-high-quality: https://huggingface.co/ramsrigouthamg/t5-large-paraphraser-diverse-high-quality
+.. _mrm8488/t5-base-finetuned-common_gen: https://huggingface.co/mrm8488/t5-base-finetuned-common_gen
+.. _valhalla/t5-small-e2e-qg: https://huggingface.co/valhalla/t5-small-e2e-qg
+.. _sonoisa/t5-base-japanese: https://huggingface.co/sonoisa/t5-base-japanese
+.. _google/t5-base-lm-adapt: https://huggingface.co/google/t5-base-lm-adapt
+.. _google/t5-small-lm-adapt: https://huggingface.co/google/t5-small-lm-adapt
+.. _valhalla/t5-small-qg-prepend: https://huggingface.co/valhalla/t5-small-qg-prepend
+.. _prithivida/informal_to_formal_styletransfer: https://huggingface.co/prithivida/informal_to_formal_styletransfer
+.. _KETI-AIR/ke-t5-base: https://huggingface.co/KETI-AIR/ke-t5-base
+.. _nielsr/nt5-small-rc1: https://huggingface.co/nielsr/nt5-small-rc1
+.. _snrspeaks/t5-one-line-summary: https://huggingface.co/snrspeaks/t5-one-line-summary
+.. _mrm8488/t5-small-finetuned-quora-for-paraphrasing: https://huggingface.co/mrm8488/t5-small-finetuned-quora-for-paraphrasing
+.. _p-christ/12412fsasf: https://huggingface.co/p-christ/12412fsasf
+.. _tscholak/3vnuv1vf: https://huggingface.co/tscholak/3vnuv1vf
+.. _tennessejoyce/titlewave-t5-base: https://huggingface.co/tennessejoyce/titlewave-t5-base
+.. _vennify/t5-base-grammar-correction: https://huggingface.co/vennify/t5-base-grammar-correction
+.. _megagonlabs/t5-base-japanese-web: https://huggingface.co/megagonlabs/t5-base-japanese-web
+.. _sberbank-ai/ruT5-large: https://huggingface.co/sberbank-ai/ruT5-large
+.. _tscholak/t5.1.1.lm100k.base: https://huggingface.co/tscholak/t5.1.1.lm100k.base
+.. _deep-learning-analytics/GrammarCorrector: https://huggingface.co/deep-learning-analytics/GrammarCorrector
+.. _ThomasNLG/t5-qa_squad2neg-en: https://huggingface.co/ThomasNLG/t5-qa_squad2neg-en
+.. _t5-small-wav2vec2-grammar-fixer: https://huggingface.co/t5-small-wav2vec2-grammar-fixer
+.. _KETI-AIR/ke-t5-small: https://huggingface.co/KETI-AIR/ke-t5-small
+.. _razent/SciFive-large-Pubmed_PMC: https://huggingface.co/razent/SciFive-large-Pubmed_PMC
+.. _google/t5-large-ssm-nq: https://huggingface.co/google/t5-large-ssm-nq
+.. _ozcangundes/T5-base-for-BioQA: https://huggingface.co/ozcangundes/T5-base-for-BioQA
+.. _Rostlab/prot_t5_base_mt_uniref50: https://huggingface.co/Rostlab/prot_t5_base_mt_uniref50
+.. _sonoisa/t5-base-japanese-question-generation: https://huggingface.co/sonoisa/t5-base-japanese-question-generation
+.. _Wikidepia/IndoT5-base: https://huggingface.co/Wikidepia/IndoT5-base
+.. _razent/SciFive-base-Pubmed_PMC: https://huggingface.co/razent/SciFive-base-Pubmed_PMC
+.. _google/t5-small-ssm-nq: https://huggingface.co/google/t5-small-ssm-nq
+.. _flexudy/t5-small-wav2vec2-grammar-fixer: https://huggingface.co/flexudy/t5-small-wav2vec2-grammar-fixer
+
diff --git a/docs/model_zoo/transformers/TinyBert/contents.rst b/docs/model_zoo/transformers/TinyBert/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..92bf2879c39d9ad2f546c67007737ccd080d8f1a
--- /dev/null
+++ b/docs/model_zoo/transformers/TinyBert/contents.rst
@@ -0,0 +1,44 @@
+
+
+------------------------------------
+TinyBert模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的TinyBert模型以及对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``tinybert-4l-312d``                                                              | English      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 14.5M parameters.                                                      |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tinybert-6l-768d``                                                              | English      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 67M parameters.                                                        |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tinybert-4l-312d-v2``                                                           | English      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 14.5M parameters.                                                      |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tinybert-6l-768d-v2``                                                           | English      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 67M parameters.                                                        |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tinybert-4l-312d-zh``                                                           | Chinese      | 4-layer, 312-hidden,                                                             |
+|                                                                                  |              | 12-heads, 14.5M parameters.                                                      |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``tinybert-6l-768d-zh``                                                           | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 67M parameters.                                                        |
+|                                                                                  |              | The TinyBert model distilled from                                                |
+|                                                                                  |              | the BERT model ``bert-base-uncased``                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/UNIMO/contents.rst b/docs/model_zoo/transformers/UNIMO/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..85e3bbbdde9c16080e87e5dd46e01ed6a99be274
--- /dev/null
+++ b/docs/model_zoo/transformers/UNIMO/contents.rst
@@ -0,0 +1,26 @@
+
+
+------------------------------------
+UNIMO模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的UNIMO模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``unimo-text-1.0``                                                                | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 99M parameters.                                                        |
+|                                                                                  |              | UNIMO-text-1.0 model.                                                            |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``unimo-text-1.0-lcsts-new``                                                      | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 99M parameters.                                                        |
+|                                                                                  |              | Finetuned on lcsts_new dataset.                                                  |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``unimo-text-1.0-large``                                                          | Chinese      | 24-layer, 768-hidden,                                                            |
+|                                                                                  |              | 16-heads, 316M parameters.                                                       |
+|                                                                                  |              | UNIMO-text-1.0 large model.                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/model_zoo/transformers/UnifiedTransformer/contents.rst b/docs/model_zoo/transformers/UnifiedTransformer/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..16c62d9c3e7fad6e808276f09ef7ba4983e51939
--- /dev/null
+++ b/docs/model_zoo/transformers/UnifiedTransformer/contents.rst
@@ -0,0 +1,32 @@
+
+
+------------------------------------
+UnifiedTransformer模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的UnifiedTransformer模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``unified_transformer-12L-cn``                                                    | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``unified_transformer-12L-cn-luge``                                               | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 108M parameters.                                                       |
+|                                                                                  |              | Trained on Chinese text (LUGE.ai).                                               |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``plato-mini``                                                                    | Chinese      | 6-layer, 768-hidden,                                                             |
+|                                                                                  |              | 12-heads, 66M parameters.                                                        |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``plato-xl``                                                                      | Chinese      | 72-layer, 3072-hidden,                                                           |
+|                                                                                  |              | 32-heads, ?M parameters.                                                         |
+|                                                                                  |              | Trained on Chinese text.                                                         |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+
+
diff --git a/docs/model_zoo/transformers/XLNet/contents.rst b/docs/model_zoo/transformers/XLNet/contents.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5856dd99bb3c7b058d05725de000f48c653175b9
--- /dev/null
+++ b/docs/model_zoo/transformers/XLNet/contents.rst
@@ -0,0 +1,34 @@
+
+
+------------------------------------
+XLNet模型汇总
+------------------------------------
+
+
+
+下表汇总介绍了目前PaddleNLP支持的XLNet模型对应预训练权重。
+关于模型的具体细节可以参考对应链接。
+
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+| Pretrained Weight                                                                | Language     | Details of the model                                                             |
++==================================================================================+==============+==================================================================================+
+|``xlnet-base-cased``                                                              | English      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 110M parameters.                                                       |
+|                                                                                  |              | XLNet English model.                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``xlnet-large-cased``                                                             | English      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, 340M parameters.                                                       |
+|                                                                                  |              | XLNet Large English model.                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chinese-xlnet-base``                                                            | Chinese      | 12-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 117M parameters.                                                       |
+|                                                                                  |              | XLNet Chinese model.                                                             |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chinese-xlnet-mid``                                                             | Chinese      | 24-layer, 768-hidden,                                                            |
+|                                                                                  |              | 12-heads, 209M parameters.                                                       |
+|                                                                                  |              | XLNet Medium Chinese model.                                                      |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
+|``chinese-xlnet-large``                                                           | Chinese      | 24-layer, 1024-hidden,                                                           |
+|                                                                                  |              | 16-heads, _M parameters.                                                         |
+|                                                                                  |              | XLNet Large Chinese model.                                                       |
++----------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------+
diff --git a/docs/paddle.png b/docs/paddle.png
new file mode 100644
index 0000000000000000000000000000000000000000..bc1135abfab7aa48f29392da4bca614f688314af
Binary files /dev/null and b/docs/paddle.png differ
diff --git a/docs/peft.md b/docs/peft.md
new file mode 100644
index 0000000000000000000000000000000000000000..64e0a87038146e8ad43196dde69eb604e324a027
--- /dev/null
+++ b/docs/peft.md
@@ -0,0 +1,277 @@
+# PaddleNLP PEFT API
+
+PaddleNLP PEFT API提供单卡/分布式LoRA和Prefix-Tuning，用户定义好模型，数据集, 以及相应的配置，就可以快速使用PEFT适配模型进行低参数模型微调。
+
+# 预备知识
+## LoRA
+<div align="center">
+<img src=https://github.com/PaddlePaddle/PaddleNLP/assets/37530985/63d56558-247a-4a8d-a6ca-121c820f7534 width=30% height=30% />
+</div>
+
+大模型网络中有很多的线性层，里面需要进行密集的矩阵乘法计算，而这些通常具有全秩(full rank)，较难优化计算。LoRA论文的研究中表明, 将输入表达随机投影到较小的子空间不仅任然可以有效地学习还可以节约大量的计算显存需求。具体做法：对于预训练的权重矩阵, 通过引入两个低 rank 矩阵 $AB$(图中橙色的两个矩阵) 来近似权重的更新过程 $W_0+\Delta W=W_0+B A$ , 其中 $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$, $r$ 远小于原权重矩阵的 rank 。训练期间, $W_0$ 参数冻结, 只对 $\mathrm{A}$ 和 $\mathrm{B}$ 两个矩阵进行梯度更新，前向传播公式如下:
+
+$$
+h=W_{0}x+BAx
+$$
+
+由于训练参数的减少，训练过程会减少很多中间变量的存储，由此节约大量的训练显存消耗。
+更多算法细节参考LoRA[论文](https://arxiv.org/abs/2106.09685)
+
+## Prefix-tuning
+
+<div align="center">
+<img src=https://github.com/PaddlePaddle/PaddleNLP/assets/37530985/8baf6943-4540-4c02-8540-35f977acc077 width=40% height=40% />
+</div>
+
+
+Prefix-tuning是一个针对NLG类型下游任务的轻量级微调方案，受提示学习（Prompt learning）的影响，加入的一部分 prefix embedding 作为连续型提示进行训练。prefix embedding是由专门的 prefix encoder 网络生成的数个张量，会以 past_key_value的方式被插入到语言模型每一层的 hidden_state之前。和 LoRA 类似，它也会冻结整个预训练模型的所有参数权重，只对prefix embedding进行梯度更新，因此训练参数量只有常规 SFT 的 0.1%。Prefix-tuning可以在全样本下获得与 SFT 比肩的训练效果，在小样本环境下甚至可以超越 SFT。更多算法细节参考
+Prefix-tuning[论文](https://arxiv.org/abs/2101.00190)
+
+# 快速开始
+## LoRA
+
+1. 要对 model 进行 LoRA 微调，首先需要定义LoRAConfig， 再通过 LoRAConfig 对 LoRAModel 进行构建，再通过 mark_only_lora_as_trainable函数冻结主干参数：
+```python
+    from paddlenlp.peft import LoRAConfig, LoRAModel
+    from paddlenlp.transformers import AutoModelForCausalLM
+
+    model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b')
+    target_modules = [".*q_proj.*", ".*v_proj.*", ".*k_proj.*"]
+    lora_rank = 8
+    lora_config = LoRAConfig(
+        target_modules=target_modules,
+        r=lora_rank,
+        lora_alpha=2 * lora_rank,
+        merge_weights=True
+    )
+    model = LoRAModel(model, lora_config)
+    model.mark_only_lora_as_trainable()
+    model.print_trainable_parameters()
+```
+
+2. 模型的保存和载入
+
+LoRAModel的保存和载入和普通的 model 没有太大区别，都是通过 save_pretrained/from_pretrained调用
+```python
+    # 保存
+    model.save_pretrained('lora_path')
+```
+Paddle会将 LoRAModel 的矩阵 AB 权重保存为lora_mode_state.pdparams文件，LoRAConfig 配置保存为 lora_config.json 文件在 lora_path 目录下。
+之后当需要载入模型权重进行推理时，则直接进行 from_pretrained即可。
+```python
+      from paddlenlp.transformers import AutoModelForCausalLM
+    + from paddlenlp.peft import LoRAModel, LoRAConfig
+
+    # 载入
+    + config = LoRAConfig.from_pretrained('lora_path')
+      model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b')
+    + model = LoRAModel.from_pretrained(model, 'lora_path')
+      model.eval()
+```
+## class LoRAConfig
+```python
+Parameters:
+
+    --r
+                        默认为 8，LoRA A/B 矩阵秩。
+
+    --target_modules
+                        指定哪些 module 需要适配 LoRA 算法，格式为module 的名字
+                        或正则表达式的 List，比如, ['q', 'v'] 或者 '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$'
+
+    --trainable_modules
+                        指定除LoRA参数外的需要进行梯度更新参数的 modules，格式为module 的名字
+                        或正则表达式的 List，比如, ['q', 'v'] 或者 '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$'
+
+    --lora_alpha
+                        默认为 8，LoRA算法的 alpha 值，int 类型
+
+    --lora_dropout
+                        默认为 0.0，dropout的比例设置，float 类型
+
+    --merge_weights
+                        默认为 False，模型推理时，是否进行base model 权重和 LoRA 权重的合参操作，bool 类型
+
+    --trainable_bias
+                        指定可训练的 bias, 可选项 ['lora', 'all']
+
+    --enable_lora_list
+                        指定是否需要使用`MergedLoRALinear`，如果不指定则默认使用
+                        `LoRALinear`
+
+    --tensor_parallel_degree
+                        默认为-1，多 GPU 并行的控制参数，传入tensor_parallel_degree 必须与 base model保持一致
+
+    --dtype
+                        LoRA矩阵参数类型设置
+
+    --head_dim
+                        多头注意力的头数，只有`LoRAMergedLinear`和
+                        `ColumnParallelLoRAMergedLinear`使用
+```
+## class LoRAModel
+```python
+Parameters:
+
+    --model
+                        指定 base model，必须是 nn.Layer 类型的对象
+
+    --lora_config
+                        指定 LoRAConfig 用于配置 LoRAModel
+
+key function:
+
+    -mark_only_lora_as_trainable()
+
+        其作用是将模型中与LoRA相关的的一些层标记为可训练，而其他层则标记为不可训练。
+
+
+    -save_pretrained(save_directory, merge_tensor_parallel)
+        --save_directory
+                        保存目录的路径
+        --merge_tensor_parallel
+                        是否合并多卡参数,默认为True
+
+        如果merge_tensor_parallel为真且模型的配置中的张量并行度大于1，则获取可训练的state_dict，并使用_merge_trainable_tensor_parallel方法合并张量并行训练的state_dict。如果merge_tensor_parallel为真且模型的张量并行度大于1，只有主进程会进行保存操作。
+
+
+    -from_pretrained(model, lora_path)
+        --model
+                        要加载LORA权重参数的model对象
+        --lora_path
+                        保存LORA权重参数和 config 的路径
+
+        该函数用于从预先训练的模型中加载LORA权重参数，并将其设置到给定的模型中，以便在后续的任务中使用该模型进行预测或训练。
+
+
+    -print_trainable_parameters()
+
+        该函数会遍历整个权重参数列表，对于每个权重参数weight，统计所有进行梯度更新的参数，最后将信息打印出来。
+```
+
+
+## Prefix-tuning
+1. 设置Prefix-tuning参数
+```python
+    from paddlenlp.transformers import AutoModelForCausalLM
+
+    model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b')
+
+    prefix_config = PrefixConfig(
+        num_prefix_tokens=64,
+        num_attention_heads=model.config.n_head,
+        num_hidden_layers=model.config.n_layer,
+        hidden_size=model.config.hidden_size,
+        prefix_projection=False,
+        prefix_projection_hidden_size=model.config.hidden_size
+    )
+    model = PrefixModelForCausalLM(model=model, prefix_config=prefix_config)
+    model.mark_only_prefix_as_trainable()
+    model.print_trainable_parameters()
+```
+
+2. 模型的保存和载入
+
+和 LoRAModel 一致，通过 save_pretrained/from_pretrained调用
+```python
+    # 保存
+    model.save_pretrained('prefix_path')
+```
+Paddle会将 PrefixModel 中用到的 prefix_encoder(里面包含 Embedding layer 和 Linear layers)网络模型权重，PrefixConfig 配置保存为 prefix_config.json 文件在 prefix_path 路径下。
+之后当需要载入模型权重进行推理时，则直接进行 from_pretrained即可。
+```python
+      from paddlenlp.transformers import AutoModelForCausalLM
+    + from paddlenlp.peft import PrefixModel, PrefixConfig
+
+    # 载入
+    + config = PrefixConfig.from_pretrained('prefix_path')
+      model = AutoModelForCausalLM.from_pretrained('facebook/llama-7b')
+    + model = PrefixModelForCausalLM.from_pretrained(model, 'prefix_path')
+      model.eval()
+```
+
+## class PrefixConfig
+```python
+Parameters:
+
+    --prefix_dropout
+                        默认为 0.0，prefix projection dropout比例设置，float 类型
+
+    --num_prefix_tokens
+                        prefix tokens个数设定，int 类型
+
+    --num_attention_heads
+                        注意力头数设置，int 类型
+
+    --multi_query_group_num
+                        multi query group 个数设置，int 类型
+
+    --num_hidden_layers
+                        base model 的 layer层数设置，int 类型
+
+    --hidden_size
+                        base model 的 hidden size 设置，int 类型
+
+    --prefix_projection
+                        默认为 False，是否对 prefix tokens 进行 projection 操作，bool 类型
+
+    --prefix_projection_hidden_size
+                        如果 prefix_projection 设置为 True，则在这里设置
+                        projection 操作的 hidden size，int 类型
+
+    --tensor_parallel_degree
+                        默认为-1，多 GPU 并行的控制参数
+
+    --dtype
+                        prefix embeddings 参数类型设置
+
+```
+
+## class PrefixModelForCausalLM
+```python
+Parameters:
+
+    --model
+                        指定 base model，必须是 nn.Layer 类型的对象
+
+    --prefix_config
+                        指定 PrefixConfig 用于配置 PrefixModelForCausalLM
+
+    --postprocess_past_key_value
+                        指定对 past_key_value 进行后处理的函数
+
+    --pad_attention_mask
+                        指定处理新增的 prefix embedding 的 pad_attention_mask函数
+
+key function
+
+    -mark_only_prefix_as_trainable()
+
+        其作用是只把模型中的 Prefix embedding 和 Prefix projection 层标记为可训练，而其他层参数冻结。
+
+    -save_pretrained(save_directory, merge_tensor_parallel)
+        --save_directory
+                        保存目录的路径
+        --merge_tensor_parallel
+                        是否合并多卡参数，默认为True
+
+        如果merge_tensor_parallel为真且模型的配置中的张量并行度大于1，则获取可训练的state_dict，并使用_merge_trainable_tensor_parallel方法合并张量并行训练的state_dict。如果merge_tensor_parallel为真且模型的张量并行度大于1，只有主进程会进行保存操作。
+
+    -from_pretrained(model, prefix_path, postprocess_past_key_value, pad_attention_mask)
+        --model
+                        要加载Prefix权重参数的model对象
+        --prefix_path
+                        保存Prefix权重参数和 config 文件的路径
+        --postprocess_past_key_value
+                        功能同 PrefixModelForCausalLM 构造参数
+        --pad_attention_mask
+                        功能同 PrefixModelForCausalLM 构造参数
+
+        该函数用于从预先训练的模型中加载Prefix权重参数，并将其设置到给定的模型中，以便在后续的任务中使用该模型进行预测或训练。
+
+    -print_trainable_parameters()
+
+        该函数会遍历整个权重参数列表，对于每个权重参数weight，统计所有进行梯度更新的参数，最后将信息打印出来。
+```
+
+更详细的使用可以参考[finetuning 脚本](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/finetune_generation.py)版本, 以及对应的启动脚本编写方式（写在 [README.md](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/causallm/README.md)文件中)。
diff --git a/docs/requirements.txt b/docs/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..129805924a933b43302ead432016fcb1167776fb
--- /dev/null
+++ b/docs/requirements.txt
@@ -0,0 +1,18 @@
+# Defining the exact version will make sure things don't break
+
+urllib3==1.26.2    # fix urllib3 version dependency: https://github.com/psf/requests/issues/6432#issuecomment-1537221990
+scipy==1.9.1
+aiohttp==3.8.4
+numpy<1.27.0,>=1.19.5
+h11<0.13,>=0.11
+jinja2
+sphinx
+sphinx_book_theme
+readthedocs-sphinx-search
+
+Markdown
+sphinx-copybutton
+sphinx-markdown-tables
+
+# use paddlepaddle == 2.3.* according to: https://github.com/PaddlePaddle/Paddle/issues/48243
+paddlepaddle>=2.4.2
diff --git a/docs/server.md b/docs/server.md
new file mode 100644
index 0000000000000000000000000000000000000000..1505790b65c18df8f5785b2df83aa32d728ceeac
--- /dev/null
+++ b/docs/server.md
@@ -0,0 +1,207 @@
+# PaddleNLP SimpleSevring
+
+PaddleNLP SimpleServing 是基于 unicorn 封装的模型部署服务化工具，该服务化工具具备灵活、易用的特性，可以简易部署预训练模型和预训练模型工具Taskflow，PaddleNLP SimpleServing 具备以下两个特性：
+  - 易用：一行代码即可部署预训练模型和预训练工具Taskflow
+  - 灵活：Handler机制可以快速定制化服务化部署方式
+
+
+## Tasflow部署
+
+Taskflow 是 PaddleNLP 预训练模型工具，具备开箱即用的特性，同时 Taskflow 可以支持加载微调后的模型，基于 Taskflow 的服务化方式可以进一步降低使用者的部署难度。PaddleNLP SimpleServing 基于这样的设计需求，设计了一套基于Taskflow的快速部署方式。下面从 server 搭建，client 发送请求来详细介绍使用方式。
+
+### server 搭建
+
+下面是 Taskflow 搭建服务的简易代码
+
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+uie = Taskflow("information_extraction", schema=schema)
+app = SimpleServer()
+app.register_taskflow('taskflow/uie', uie)
+```
+这里主要是使用 `SimpleServer` 服务类来注册 Taskflow Server，下面我们具体介绍一下 `register_taskflow` 相关参数
+
+```text
+def register_taskflow(
+    task_name,
+    task,
+    taskflow_handler=None)
+
+task_name(str)：
+      服务化的名称，最终的服务化的URL: https://host:port/{task_name}
+task(paddlenlp.Taskflow or list(paddlenlp.Taskflow)):
+      Taskflow的实例对象，将想要注册的Taskflow任务注册进去，可以是多个Taskflow实例来支持多卡服务化
+taskflow_handler(paddlenlp.server.BaseTaskflowHandler, 可选):
+      Taskflow句柄处理类，可以自定义处理类来定制化Taskflow服务，默认为None，是默认的TaskflowHandler
+```
+### 多卡服务化(可选)
+在机器环境里面如果有多卡，那就可以register taskflow服务化时，可以注册多个Taskflow实例，在服务化处理请求的过程中做了负载均衡，保证机器设备利用率充分利用，下面是具体的使用例子
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+uie1 = Taskflow("information_extraction", schema=schema, device_id=0)
+uie2 = Taskflow("information_extraction", schema=schema, device_id=1)
+app = SimpleServer()
+app.register_taskflow('taskflow/uie', [uie1, uie2])
+```
+### 启动服务化
+执行代码的即可启动服务
+```
+paddlenlp server server:app --host 0.0.0.0 --port 8189 --workers 1
+```
+服务化整体参数配置如下：
+```text
+--host: 启动服务化的IP地址，通常可以设置成 0.0.0.0
+--port：启动服务化的网络端口
+--workers: 接收服务化的进程数，默认为1
+--log_level：服务化输出日志的级别，默认为 info 级别
+--limit_concurrency：服务化能接受的并发数目，默认为None, 没有限制
+--timeout_keep_alive：保持服务化连接的时间，默认为15s
+--app_dir：服务化本地的路径，默认为服务化启动的位置
+--reload: 当 app_dir的服务化相关配置和代码发生变化时，是否重启server，默认为False
+```
+
+### client 发送
+```python
+import requests
+import json
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+texts = ["城市内交通费7月5日金额114广州至佛山", "5月9日交通费29元从北苑到望京搜后"]
+data = {
+    "data": {
+        "text": texts,
+    }
+}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
+```
+通过上述代码配置即可发送POST请求，同时注意在`data`这个key填入相关请求
+
+同时可以支持定义 `schema` 传入到client请求中，可以快速切换 `schema`
+
+```python
+import requests
+import json
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+texts = ["城市内交通费7月5日金额114广州至佛山", "5月9日交通费29元从北苑到望京搜后"]
+data = {
+    "data": {
+        "text": texts,
+    },
+    "parameters": {
+        "schema": [] # 自定义schema
+    }
+}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
+```
+
+## 预训练模型部署
+PaddleNLP SimpleServing 除了能支持Taskflow的服务化部署，也能支持预训练模型的部署，通过简单的配置即可加载预训练模型来进行服务化，同时在接口层面也能支持服务化的扩展，支持模型前后处理的定制化需求。
+
+## server 搭建
+
+下面是预训练模型的搭建的简易代码
+```python
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler
+
+app = SimpleServer()
+app.register('cls_multi_class',
+             model_path="./export",
+             tokenizer_name='ernie-3.0-medium-zh',
+             model_handler=CustomModelHandler,
+             post_handler=MultiClassificationPostHandler)
+```
+
+这里主要是使用 `SimpleServer` 服务类来注册 Transformers Server，下面我们具体介绍一下 `register` 相关参数
+
+```text
+def register(task_name,
+             model_path,
+             tokenizer_name,
+             model_handler,
+             post_handler,
+             precision='fp32',
+             device_id=0)
+task_name(str)：
+      服务化的名称，最终的服务化的URL: https://host:port/{task_name}
+model_path(str):
+      需要部署的模型路径，这里的路径必须是动转静后的模型路径
+model_handler(paddlenlp.server.BaseModelHandler):
+      模型前置处理以及模型预测的Handler类别名字，这里可以继承 BaseModelHandler 自定义处理逻辑
+ post_handler(paddlenlp.server.BasePostHandler):
+      模型后置处理的Handler类别名字，这里可以继承 BasePostHandler 自定义处理逻辑
+precision(str):
+      模型的预测精度，默认为fp32；可选fp16，fp16的支持需要以下条件 1) **硬件**： V100、T4、A10、A100/GA100、Jetson AGX Xavier 、3080、3080、2080、2090 等显卡 2）**CUDA环境**：确保 CUDA >= 11.2，cuDNN >= 8.1.1 3) **安装依赖**：安装 onnx、 onnxruntime-gpu
+device_id(int, list(int)):
+       GPU设备，device_id默认为0，同时如果有多张显卡，可以设置成list,例如[0, 1]就可以支持多卡服务化；CPU设备，不用设置。
+```
+- BaseModelHandler继承类：主要是 `CustomModelHandler`，该类的实现可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/custom_model_handler.py), 绝大多数语义理解模型均可使用该继承类
+- BasePostHandler继承类：主要是文本分类 `MultiClassificationPostHandler`、`MultiLabelClassificationPostHandler` 来支持多分类、多标签分类，实现代码部分可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/cls_post_handler.py)；`TokenClsModelHandler` 支持 序列标注任务，实现代码部分可以参考[链接](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/server/handlers/token_model_handler.py)
+
+### 启动服务化
+执行代码的即可启动服务
+```
+paddlenlp server server:app --host 0.0.0.0 --port 8189 --workers 1
+```
+服务化整体参数配置如下：
+```text
+--host: 启动服务化的IP地址，通常可以设置成 0.0.0.0
+--port：启动服务化的网络端口，注意和已有网络端口冲突
+--workers: 接收服务化的进程数，默认为1
+--log_level：服务化输出日志的级别，默认为 info 级别
+--limit_concurrency：服务化能接受的并发数目，默认为None, 没有限制
+--timeout_keep_alive：保持服务化连接的时间，默认为15s
+--app_dir：服务化本地的路径，默认为服务化启动的位置
+--reload: 当 app_dir的服务化相关配置和代码发生变化时，是否重启server，默认为False
+```
+
+### 多卡服务化(可选)
+在机器环境里面如果有多卡，通过简单设置 device_id 即可实现多卡服务化，保证机器设备利用率充分利用，下面是具体的使用例子
+```python
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler
+
+app = SimpleServer()
+app.register('models/cls_multi_class',
+             model_path="../../export",
+             tokenizer_name='ernie-3.0-medium-zh',
+             model_handler=CustomModelHandler,
+             post_handler=MultiClassificationPostHandler，
+             device_id=[0,1]) # device_id是0,1 两张卡
+```
+### client 发送
+```python
+
+import requests
+import json
+
+texts = [
+        '黑苦荞茶的功效与作用及食用方法', '交界痣会凸起吗', '检查是否能怀孕挂什么科', '鱼油怎么吃咬破吃还是直接咽下去',
+        '幼儿挑食的生理原因是'
+    ]
+    data = {
+        'data': {
+            'text': texts,
+        },
+        'parameters': {
+            'max_seq_len': 128,
+            'batch_size': 2
+        }
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    result_json = json.loads(r.text)
+    print(result_json)
+```
+在Client发送请求的过程中可以一些参数来控制服务化处理逻辑，例如上面的 `max_seq_len`和 `batch_size` 均可以控制服务化处理时的序列长度和处理batch_size 。
+
+## 参考示例
+- [UIE 服务化部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie/deploy/serving/simple_serving)
+- [文本分类服务化部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/text_classification/multi_class/deploy/simple_serving)
+- [预训练模型定制化post_handler](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-health/cblue/deploy/serving/simple_serving)
diff --git a/docs/source/modules.rst b/docs/source/modules.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e470b1c22bcc177a73949cdf6fe379992aea679b
--- /dev/null
+++ b/docs/source/modules.rst
@@ -0,0 +1,7 @@
+paddlenlp
+=========
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp
diff --git a/docs/source/paddlenlp.data.collate.rst b/docs/source/paddlenlp.data.collate.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a0b8e9058a98198c0d069fbae7990a09294f201c
--- /dev/null
+++ b/docs/source/paddlenlp.data.collate.rst
@@ -0,0 +1,8 @@
+collate
+=============================
+
+.. automodule:: paddlenlp.data.collate
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.data.data_collator.rst b/docs/source/paddlenlp.data.data_collator.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1543cdd6eb86c88e170ff3644a1359719bb4468a
--- /dev/null
+++ b/docs/source/paddlenlp.data.data_collator.rst
@@ -0,0 +1,7 @@
+data\_collator
+====================================
+
+.. automodule:: paddlenlp.data.data_collator
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.data.rst b/docs/source/paddlenlp.data.rst
new file mode 100644
index 0000000000000000000000000000000000000000..608f8eff45258c49fa16fba168f401278bc97190
--- /dev/null
+++ b/docs/source/paddlenlp.data.rst
@@ -0,0 +1,18 @@
+paddlenlp.data
+======================
+
+.. automodule:: paddlenlp.data
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.data.collate
+   paddlenlp.data.data_collator
+   paddlenlp.data.iterator
+   paddlenlp.data.sampler
+   paddlenlp.data.tokenizer
+   paddlenlp.data.vocab
diff --git a/docs/source/paddlenlp.data.sampler.rst b/docs/source/paddlenlp.data.sampler.rst
new file mode 100644
index 0000000000000000000000000000000000000000..97ef992ab92ec5d705a1c2681e6f2115328fe645
--- /dev/null
+++ b/docs/source/paddlenlp.data.sampler.rst
@@ -0,0 +1,8 @@
+sampler
+=============================
+
+.. automodule:: paddlenlp.data.sampler
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.data.tokenizer.rst b/docs/source/paddlenlp.data.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..80848dcd8372590944401b1909fc648237209bae
--- /dev/null
+++ b/docs/source/paddlenlp.data.tokenizer.rst
@@ -0,0 +1,8 @@
+tokenizer
+===============================
+
+.. automodule:: paddlenlp.data.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.data.vocab.rst b/docs/source/paddlenlp.data.vocab.rst
new file mode 100644
index 0000000000000000000000000000000000000000..15c436f6f914246270fc63f38c69ed91a71e5832
--- /dev/null
+++ b/docs/source/paddlenlp.data.vocab.rst
@@ -0,0 +1,8 @@
+vocab
+===========================
+
+.. automodule:: paddlenlp.data.vocab
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.dataaug.base_augment.rst b/docs/source/paddlenlp.dataaug.base_augment.rst
new file mode 100644
index 0000000000000000000000000000000000000000..be43595b87cbd4114f147788dd13fd5b919fed93
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.base_augment.rst
@@ -0,0 +1,7 @@
+base\_augment
+======================================
+
+.. automodule:: paddlenlp.dataaug.base_augment
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.dataaug.rst b/docs/source/paddlenlp.dataaug.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3df52b5ab89c7f328cddcc605a62e4ed02e9024b
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.rst
@@ -0,0 +1,17 @@
+paddlenlp.dataaug
+=========================
+
+.. automodule:: paddlenlp.dataaug
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.dataaug.base_augment
+   paddlenlp.dataaug.word_delete
+   paddlenlp.dataaug.word_insert
+   paddlenlp.dataaug.word_substitute
+   paddlenlp.dataaug.word_swap
diff --git a/docs/source/paddlenlp.dataaug.word_delete.rst b/docs/source/paddlenlp.dataaug.word_delete.rst
new file mode 100644
index 0000000000000000000000000000000000000000..36f4fd46bb830426437a397242a3297fba04e50c
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.word_delete.rst
@@ -0,0 +1,7 @@
+word\_delete
+=====================================
+
+.. automodule:: paddlenlp.dataaug.word_delete
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.dataaug.word_insert.rst b/docs/source/paddlenlp.dataaug.word_insert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..90aea8210552ca06b00b720e2fd8386019e2a5bf
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.word_insert.rst
@@ -0,0 +1,7 @@
+word\_insert
+=====================================
+
+.. automodule:: paddlenlp.dataaug.word_insert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.dataaug.word_substitute.rst b/docs/source/paddlenlp.dataaug.word_substitute.rst
new file mode 100644
index 0000000000000000000000000000000000000000..dd343af4cc477a2bfe9c2f77862dbc02f479b49e
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.word_substitute.rst
@@ -0,0 +1,7 @@
+word\_substitute
+=========================================
+
+.. automodule:: paddlenlp.dataaug.word_substitute
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.dataaug.word_swap.rst b/docs/source/paddlenlp.dataaug.word_swap.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bedc8d99621e19e5d10cb4d266773d2948b5eed0
--- /dev/null
+++ b/docs/source/paddlenlp.dataaug.word_swap.rst
@@ -0,0 +1,7 @@
+word\_swap
+===================================
+
+.. automodule:: paddlenlp.dataaug.word_swap
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.datasets.dataset.rst b/docs/source/paddlenlp.datasets.dataset.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5faa5b46234761bcca1f2a350dbf1764c39836d5
--- /dev/null
+++ b/docs/source/paddlenlp.datasets.dataset.rst
@@ -0,0 +1,6 @@
+dataset
+=================================
+
+.. automodule:: paddlenlp.datasets.dataset
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.datasets.rst b/docs/source/paddlenlp.datasets.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6681f0fb142ff7a3579f66e1e5793318fada5851
--- /dev/null
+++ b/docs/source/paddlenlp.datasets.rst
@@ -0,0 +1,17 @@
+paddlenlp.datasets
+==========================
+
+.. automodule:: paddlenlp.datasets
+   :members:
+   :no-undoc-members:
+
+
+.. toctree::
+   :maxdepth: 4
+
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.datasets.dataset
diff --git a/docs/source/paddlenlp.embeddings.rst b/docs/source/paddlenlp.embeddings.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1c85548c46d4429e514154983334fb11683dd41e
--- /dev/null
+++ b/docs/source/paddlenlp.embeddings.rst
@@ -0,0 +1,14 @@
+paddlenlp.embeddings
+============================
+
+.. automodule:: paddlenlp.embeddings
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.embeddings.constant
+   paddlenlp.embeddings.token_embedding
diff --git a/docs/source/paddlenlp.embeddings.token_embedding.rst b/docs/source/paddlenlp.embeddings.token_embedding.rst
new file mode 100644
index 0000000000000000000000000000000000000000..eabb899600e01d9014b3b69dc65d18d95081c83a
--- /dev/null
+++ b/docs/source/paddlenlp.embeddings.token_embedding.rst
@@ -0,0 +1,7 @@
+token\_embedding
+============================================
+
+.. automodule:: paddlenlp.embeddings.token_embedding
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.experimental.ernie_model.rst b/docs/source/paddlenlp.experimental.ernie_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..607fd5740a1854cc768e748aff6fe52f32d96de3
--- /dev/null
+++ b/docs/source/paddlenlp.experimental.ernie_model.rst
@@ -0,0 +1,6 @@
+ernie\_model
+==========================================
+
+.. automodule:: paddlenlp.experimental.ernie_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.experimental.faster_tokenizer.rst b/docs/source/paddlenlp.experimental.faster_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a53da32782808505dea088fa738acebce85550fb
--- /dev/null
+++ b/docs/source/paddlenlp.experimental.faster_tokenizer.rst
@@ -0,0 +1,7 @@
+faster\_tokenizer
+===============================================
+
+.. automodule:: paddlenlp.experimental.faster_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.experimental.model_utils.rst b/docs/source/paddlenlp.experimental.model_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..07715f9552b4a627978089f61e0311e08d3747b7
--- /dev/null
+++ b/docs/source/paddlenlp.experimental.model_utils.rst
@@ -0,0 +1,6 @@
+model\_utils
+==========================================
+
+.. automodule:: paddlenlp.experimental.model_utils
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.experimental.rst b/docs/source/paddlenlp.experimental.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7018ced0af7bae9ef9ed22fa8fd769ad5c9e5895
--- /dev/null
+++ b/docs/source/paddlenlp.experimental.rst
@@ -0,0 +1,15 @@
+paddlenlp.experimental
+==============================
+
+.. automodule:: paddlenlp.experimental
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.experimental.ernie_model
+   paddlenlp.experimental.faster_tokenizer
+   paddlenlp.experimental.model_utils
diff --git a/docs/source/paddlenlp.layers.crf.rst b/docs/source/paddlenlp.layers.crf.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ed340a8fea0c51a0b389038d5bd7da882b5ba727
--- /dev/null
+++ b/docs/source/paddlenlp.layers.crf.rst
@@ -0,0 +1,6 @@
+crf
+===========================
+
+.. automodule:: paddlenlp.layers.crf
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.layers.rst b/docs/source/paddlenlp.layers.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c41c7e41962eb8b5eef8fe6b15028b052d08c74e
--- /dev/null
+++ b/docs/source/paddlenlp.layers.rst
@@ -0,0 +1,15 @@
+paddlenlp.layers
+========================
+
+.. automodule:: paddlenlp.layers
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.layers.crf
+   paddlenlp.layers.sequence
+   paddlenlp.layers.tcn
diff --git a/docs/source/paddlenlp.layers.sequence.rst b/docs/source/paddlenlp.layers.sequence.rst
new file mode 100644
index 0000000000000000000000000000000000000000..945540340c9cce8c94ea04891c3a2091511011f0
--- /dev/null
+++ b/docs/source/paddlenlp.layers.sequence.rst
@@ -0,0 +1,7 @@
+sequence
+================================
+
+.. automodule:: paddlenlp.layers.sequence
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.layers.tcn.rst b/docs/source/paddlenlp.layers.tcn.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c7a6784837d1eb7511ddec480aab38b04f8a11be
--- /dev/null
+++ b/docs/source/paddlenlp.layers.tcn.rst
@@ -0,0 +1,6 @@
+tcn
+===========================
+
+.. automodule:: paddlenlp.layers.tcn
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.losses.rdrop.rst b/docs/source/paddlenlp.losses.rdrop.rst
new file mode 100644
index 0000000000000000000000000000000000000000..39ccfd28fb728facb93249991451817134b5c192
--- /dev/null
+++ b/docs/source/paddlenlp.losses.rdrop.rst
@@ -0,0 +1,6 @@
+rdrop
+=============================
+
+.. automodule:: paddlenlp.losses.rdrop
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.losses.rst b/docs/source/paddlenlp.losses.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9c7196336aee4f317dfa931f793b9ceb21fe675a
--- /dev/null
+++ b/docs/source/paddlenlp.losses.rst
@@ -0,0 +1,13 @@
+paddlenlp.losses
+========================
+
+.. automodule:: paddlenlp.losses
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.losses.rdrop
diff --git a/docs/source/paddlenlp.metrics.bleu.rst b/docs/source/paddlenlp.metrics.bleu.rst
new file mode 100644
index 0000000000000000000000000000000000000000..610ddf2ccd792425116e4d0db4104e991fee46b3
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.bleu.rst
@@ -0,0 +1,7 @@
+bleu
+=============================
+
+.. automodule:: paddlenlp.metrics.bleu
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.chunk.rst b/docs/source/paddlenlp.metrics.chunk.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b496e40bfced04a440c0fe6c5c91a1f850953380
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.chunk.rst
@@ -0,0 +1,7 @@
+chunk
+==============================
+
+.. automodule:: paddlenlp.metrics.chunk
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.distinct.rst b/docs/source/paddlenlp.metrics.distinct.rst
new file mode 100644
index 0000000000000000000000000000000000000000..135b5b0cbdf5db495dd7475e35d9d136fd4fe852
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.distinct.rst
@@ -0,0 +1,7 @@
+distinct
+=================================
+
+.. automodule:: paddlenlp.metrics.distinct
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.dureader.rst b/docs/source/paddlenlp.metrics.dureader.rst
new file mode 100644
index 0000000000000000000000000000000000000000..deeb9372db3f434f4d6a71e86b08c8b6980e9993
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.dureader.rst
@@ -0,0 +1,7 @@
+dureader
+=================================
+
+.. automodule:: paddlenlp.metrics.dureader
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.glue.rst b/docs/source/paddlenlp.metrics.glue.rst
new file mode 100644
index 0000000000000000000000000000000000000000..55c3a972f7a85de74d9981aa031b615af0938954
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.glue.rst
@@ -0,0 +1,7 @@
+glue
+=============================
+
+.. automodule:: paddlenlp.metrics.glue
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.perplexity.rst b/docs/source/paddlenlp.metrics.perplexity.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5a0ebed6b3348fe1db7fedfc2d8c356f8e722514
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.perplexity.rst
@@ -0,0 +1,7 @@
+perplexity
+===================================
+
+.. automodule:: paddlenlp.metrics.perplexity
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.rouge.rst b/docs/source/paddlenlp.metrics.rouge.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5844354ca18f1b3d63ade339835d40be473e3627
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.rouge.rst
@@ -0,0 +1,7 @@
+rouge
+==============================
+
+.. automodule:: paddlenlp.metrics.rouge
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.rst b/docs/source/paddlenlp.metrics.rst
new file mode 100644
index 0000000000000000000000000000000000000000..995355b941a021163c8e70f7c27fba90a32257d3
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.rst
@@ -0,0 +1,23 @@
+paddlenlp.metrics
+=========================
+
+.. automodule:: paddlenlp.metrics
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.metrics.bleu
+   paddlenlp.metrics.chunk
+   paddlenlp.metrics.distinct
+   paddlenlp.metrics.dureader
+   paddlenlp.metrics.glue
+   paddlenlp.metrics.perplexity
+   paddlenlp.metrics.rouge
+   paddlenlp.metrics.sighan
+   paddlenlp.metrics.span
+   paddlenlp.metrics.squad
+   paddlenlp.metrics.utils
diff --git a/docs/source/paddlenlp.metrics.sighan.rst b/docs/source/paddlenlp.metrics.sighan.rst
new file mode 100644
index 0000000000000000000000000000000000000000..058c86387d138d9499b4861d90f4e276fa2599f0
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.sighan.rst
@@ -0,0 +1,7 @@
+sighan
+===============================
+
+.. automodule:: paddlenlp.metrics.sighan
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.span.rst b/docs/source/paddlenlp.metrics.span.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5476262a60fd98ed8cd558c07d15b99ad7a87713
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.span.rst
@@ -0,0 +1,7 @@
+span
+=============================
+
+.. automodule:: paddlenlp.metrics.span
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.squad.rst b/docs/source/paddlenlp.metrics.squad.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ef2ffabf35c46339537793fd3e6861b4578e51bc
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.squad.rst
@@ -0,0 +1,7 @@
+squad
+==============================
+
+.. automodule:: paddlenlp.metrics.squad
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.metrics.utils.rst b/docs/source/paddlenlp.metrics.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ff1640ba062c358067661302ac5154b7fe190534
--- /dev/null
+++ b/docs/source/paddlenlp.metrics.utils.rst
@@ -0,0 +1,7 @@
+utils
+==============================
+
+.. automodule:: paddlenlp.metrics.utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.distributed.parallel.rst b/docs/source/paddlenlp.ops.distributed.parallel.rst
new file mode 100644
index 0000000000000000000000000000000000000000..83c36944bdefc2347e751cc748299e3ebffbf33b
--- /dev/null
+++ b/docs/source/paddlenlp.ops.distributed.parallel.rst
@@ -0,0 +1,6 @@
+parallel
+=========================================
+
+.. automodule:: paddlenlp.ops.distributed.parallel
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.distributed.rst b/docs/source/paddlenlp.ops.distributed.rst
new file mode 100644
index 0000000000000000000000000000000000000000..733f114bf4ea0350efad768102bec87071dad985
--- /dev/null
+++ b/docs/source/paddlenlp.ops.distributed.rst
@@ -0,0 +1,18 @@
+distributed
+=================================
+
+.. automodule:: paddlenlp.ops.distributed
+   :members:
+   :no-undoc-members:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.distributed.utils
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.distributed.parallel
diff --git a/docs/source/paddlenlp.ops.distributed.utils.random.rst b/docs/source/paddlenlp.ops.distributed.utils.random.rst
new file mode 100644
index 0000000000000000000000000000000000000000..27370ba9f718241cbadc822ea854253c90d2b38f
--- /dev/null
+++ b/docs/source/paddlenlp.ops.distributed.utils.random.rst
@@ -0,0 +1,6 @@
+random
+=============================================
+
+.. automodule:: paddlenlp.ops.distributed.utils.random
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.distributed.utils.rst b/docs/source/paddlenlp.ops.distributed.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c7e1c223995d18b4374b7276644f2d6a1b1fa629
--- /dev/null
+++ b/docs/source/paddlenlp.ops.distributed.utils.rst
@@ -0,0 +1,13 @@
+utils
+=======================================
+
+.. automodule:: paddlenlp.ops.distributed.utils
+   :members:
+   :no-undoc-members:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.distributed.utils.random
+   paddlenlp.ops.distributed.utils.topo
diff --git a/docs/source/paddlenlp.ops.distributed.utils.topo.rst b/docs/source/paddlenlp.ops.distributed.utils.topo.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6cd8a7aa2ae43ce0f0a2a6ec948980e6b49825cb
--- /dev/null
+++ b/docs/source/paddlenlp.ops.distributed.utils.topo.rst
@@ -0,0 +1,6 @@
+topo
+===========================================
+
+.. automodule:: paddlenlp.ops.distributed.utils.topo
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.einsum.rst b/docs/source/paddlenlp.ops.einsum.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c5907b8e135f1fe3baa68a442f41cc6288a0177a
--- /dev/null
+++ b/docs/source/paddlenlp.ops.einsum.rst
@@ -0,0 +1,7 @@
+einsum
+===========================
+
+.. automodule:: paddlenlp.ops.einsum
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.ext_utils.rst b/docs/source/paddlenlp.ops.ext_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a3750293b1bb6463fa38a8226dbe7dd9628942ac
--- /dev/null
+++ b/docs/source/paddlenlp.ops.ext_utils.rst
@@ -0,0 +1,7 @@
+ext\_utils
+===============================
+
+.. automodule:: paddlenlp.ops.ext_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.fast_transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..02a51ecdcb01a776d80a2345cabaa1ee5f451e8d
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.rst
@@ -0,0 +1,13 @@
+fast\_transformer
+=========================================
+
+.. automodule:: paddlenlp.ops.fast_transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.fast_transformer.transformer
diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst
new file mode 100644
index 0000000000000000000000000000000000000000..50a3a23fa2386aa90c74b34bf84cc39951310cfc
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoder.rst
@@ -0,0 +1,6 @@
+decoder
+============================================================
+
+.. automodule:: paddlenlp.ops.fast_transformer.transformer.decoder
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e8ed3550c06f9062025b39f0d4f469c0c77b423f
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.decoding.rst
@@ -0,0 +1,6 @@
+decoding
+=============================================================
+
+.. automodule:: paddlenlp.ops.fast_transformer.transformer.decoding
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5ca4d481f64b083f3797048beee72e3003d46435
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.encoder.rst
@@ -0,0 +1,7 @@
+encoder
+============================================================
+
+.. automodule:: paddlenlp.ops.fast_transformer.transformer.encoder
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d57dd44eb85b96479d03a02153b6823227e21bf6
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.rst
@@ -0,0 +1,7 @@
+fast\_transformer
+========================================================================
+
+.. automodule:: paddlenlp.ops.fast_transformer.transformer.fast_transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.fast_transformer.transformer.rst b/docs/source/paddlenlp.ops.fast_transformer.transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4fade55a679ee2c72a71430e377f4e91102a6a00
--- /dev/null
+++ b/docs/source/paddlenlp.ops.fast_transformer.transformer.rst
@@ -0,0 +1,16 @@
+transformer
+=====================================================
+
+.. automodule:: paddlenlp.ops.fast_transformer.transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.fast_transformer.transformer.decoder
+   paddlenlp.ops.fast_transformer.transformer.decoding
+   paddlenlp.ops.fast_transformer.transformer.encoder
+   paddlenlp.ops.fast_transformer.transformer.fast_transformer
diff --git a/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst b/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..88be3e6ddc10b42c96f532bc2e8c553670f33a1c
--- /dev/null
+++ b/docs/source/paddlenlp.ops.optimizer.AdamwOptimizer.rst
@@ -0,0 +1,6 @@
+AdamwOptimizer
+=============================================
+
+.. automodule:: paddlenlp.ops.optimizer.AdamwOptimizer
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.optimizer.adamw.rst b/docs/source/paddlenlp.ops.optimizer.adamw.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5aab38563dcd740b4d4540694b35083dfc94b8fa
--- /dev/null
+++ b/docs/source/paddlenlp.ops.optimizer.adamw.rst
@@ -0,0 +1,7 @@
+adamw
+====================================
+
+.. automodule:: paddlenlp.ops.optimizer.adamw
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.optimizer.adamwdl.rst b/docs/source/paddlenlp.ops.optimizer.adamwdl.rst
new file mode 100644
index 0000000000000000000000000000000000000000..78be7f88818fbfccb0db82d413b219b774a26df5
--- /dev/null
+++ b/docs/source/paddlenlp.ops.optimizer.adamwdl.rst
@@ -0,0 +1,7 @@
+adamwdl
+======================================
+
+.. automodule:: paddlenlp.ops.optimizer.adamwdl
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.optimizer.ema.rst b/docs/source/paddlenlp.ops.optimizer.ema.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b02eb7eb6b073133a6550282406e98709e7098d2
--- /dev/null
+++ b/docs/source/paddlenlp.ops.optimizer.ema.rst
@@ -0,0 +1,7 @@
+ema
+==================================
+
+.. automodule:: paddlenlp.ops.optimizer.ema
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.optimizer.rst b/docs/source/paddlenlp.ops.optimizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..df220f9fd7f89552e8d6f277a01a60cb1ffc9544
--- /dev/null
+++ b/docs/source/paddlenlp.ops.optimizer.rst
@@ -0,0 +1,14 @@
+optimizer
+===============================
+
+.. automodule:: paddlenlp.ops.optimizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.optimizer.adamwdl
+   paddlenlp.ops.optimizer.ema
diff --git a/docs/source/paddlenlp.ops.rst b/docs/source/paddlenlp.ops.rst
new file mode 100644
index 0000000000000000000000000000000000000000..831aeb95411e6cd689611cea4b190f233613f335
--- /dev/null
+++ b/docs/source/paddlenlp.ops.rst
@@ -0,0 +1,22 @@
+paddlenlp.ops
+=====================
+
+.. automodule:: paddlenlp.ops
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.distributed
+   paddlenlp.ops.fast_transformer
+   paddlenlp.ops.optimizer
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.einsum
+   paddlenlp.ops.ext_utils
diff --git a/docs/source/paddlenlp.ops.strings.rst b/docs/source/paddlenlp.ops.strings.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8ccb7d3a66b3f9e5a1dec024391618ae1c8a4b0c
--- /dev/null
+++ b/docs/source/paddlenlp.ops.strings.rst
@@ -0,0 +1,7 @@
+strings
+============================
+
+.. automodule:: paddlenlp.ops.strings
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.transformer.decoding.rst b/docs/source/paddlenlp.ops.transformer.decoding.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e3966647cb3400017ef55c0863d5878db6eeae08
--- /dev/null
+++ b/docs/source/paddlenlp.ops.transformer.decoding.rst
@@ -0,0 +1,6 @@
+decoding
+=========================================
+
+.. automodule:: paddlenlp.ops.transformer.decoding
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.ops.transformer.fast_transformer.rst b/docs/source/paddlenlp.ops.transformer.fast_transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2718eb8f6fc9f4bd1d4af9b182aac42c565d8cb0
--- /dev/null
+++ b/docs/source/paddlenlp.ops.transformer.fast_transformer.rst
@@ -0,0 +1,7 @@
+fast\_transformer
+====================================================
+
+.. automodule:: paddlenlp.ops.transformer.fast_transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.ops.transformer.rst b/docs/source/paddlenlp.ops.transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ecab713ecfa8b9622c07eb9b4bd88125e7c8b1ee
--- /dev/null
+++ b/docs/source/paddlenlp.ops.transformer.rst
@@ -0,0 +1,14 @@
+transformer
+=================================
+
+.. automodule:: paddlenlp.ops.transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.ops.transformer.decoding
+   paddlenlp.ops.transformer.fast_transformer
diff --git a/docs/source/paddlenlp.rst b/docs/source/paddlenlp.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fbc7672055fad96c4a03f9b736938eb0a7f0d8f9
--- /dev/null
+++ b/docs/source/paddlenlp.rst
@@ -0,0 +1,26 @@
+paddlenlp
+=================
+
+.. automodule:: paddlenlp
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.data
+   paddlenlp.dataaug
+   paddlenlp.datasets
+   paddlenlp.embeddings
+   paddlenlp.experimental
+   paddlenlp.layers
+   paddlenlp.losses
+   paddlenlp.metrics
+   paddlenlp.ops
+   paddlenlp.seq2vec
+   paddlenlp.taskflow
+   paddlenlp.trainer
+   paddlenlp.transformers
+   paddlenlp.utils
diff --git a/docs/source/paddlenlp.seq2vec.encoder.rst b/docs/source/paddlenlp.seq2vec.encoder.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e2a42957efa67d603f7451766bfe9feb4ce23b25
--- /dev/null
+++ b/docs/source/paddlenlp.seq2vec.encoder.rst
@@ -0,0 +1,7 @@
+encoder
+================================
+
+.. automodule:: paddlenlp.seq2vec.encoder
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.seq2vec.rst b/docs/source/paddlenlp.seq2vec.rst
new file mode 100644
index 0000000000000000000000000000000000000000..48497e5874f9b7417d9f0d97906d4a0bdd51bf1f
--- /dev/null
+++ b/docs/source/paddlenlp.seq2vec.rst
@@ -0,0 +1,13 @@
+paddlenlp.seq2vec
+=========================
+
+.. automodule:: paddlenlp.seq2vec
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.seq2vec.encoder
diff --git a/docs/source/paddlenlp.taskflow.code_generation.rst b/docs/source/paddlenlp.taskflow.code_generation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7fbd7ab20bb097e8270aeda1e309c4c8389c0354
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.code_generation.rst
@@ -0,0 +1,7 @@
+code\_generation
+==========================================
+
+.. automodule:: paddlenlp.taskflow.code_generation
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.dependency_parsing.rst b/docs/source/paddlenlp.taskflow.dependency_parsing.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a2c7ab93f4dace23c64f997bdad5c90a77cba4e2
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.dependency_parsing.rst
@@ -0,0 +1,7 @@
+dependency\_parsing
+=============================================
+
+.. automodule:: paddlenlp.taskflow.dependency_parsing
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.dialogue.rst b/docs/source/paddlenlp.taskflow.dialogue.rst
new file mode 100644
index 0000000000000000000000000000000000000000..dbba5586323d9b3d7381ccfa69ab4d1d4468f907
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.dialogue.rst
@@ -0,0 +1,7 @@
+dialogue
+==================================
+
+.. automodule:: paddlenlp.taskflow.dialogue
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.information_extraction.rst b/docs/source/paddlenlp.taskflow.information_extraction.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bc5cf227ea099b8119a46c35942e7a32220a7afc
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.information_extraction.rst
@@ -0,0 +1,7 @@
+information\_extraction
+=================================================
+
+.. automodule:: paddlenlp.taskflow.information_extraction
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.knowledge_mining.rst b/docs/source/paddlenlp.taskflow.knowledge_mining.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2bc62bf7b6ac1ed60293844f594641010d3da3bb
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.knowledge_mining.rst
@@ -0,0 +1,7 @@
+knowledge\_mining
+===========================================
+
+.. automodule:: paddlenlp.taskflow.knowledge_mining
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.lexical_analysis.rst b/docs/source/paddlenlp.taskflow.lexical_analysis.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0916c249d95ab0a0dd1cdac9acbc5a2dc00df078
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.lexical_analysis.rst
@@ -0,0 +1,7 @@
+lexical\_analysis
+===========================================
+
+.. automodule:: paddlenlp.taskflow.lexical_analysis
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.model.rst b/docs/source/paddlenlp.taskflow.model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a09842dd516df4068ee47b715ac1ff3e1c4496cf
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.model.rst
@@ -0,0 +1,6 @@
+model
+===============================
+
+.. automodule:: paddlenlp.taskflow.model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst b/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ef0bc9010245e07957c0b9fadf2680fa1f28d987
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.dependency_parsing_model.rst
@@ -0,0 +1,6 @@
+dependency\_parsing\_model
+===========================================================
+
+.. automodule:: paddlenlp.taskflow.models.dependency_parsing_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst b/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..78407871c9d6ca762619e997d10c3f4db573176e
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.information_extraction_model.rst
@@ -0,0 +1,6 @@
+information\_extraction\_model
+===============================================================
+
+.. automodule:: paddlenlp.taskflow.models.information_extraction_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst b/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..db0eadd0273f68427b8aa3481ab1655cba74822f
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.lexical_analysis_model.rst
@@ -0,0 +1,6 @@
+lexical\_analysis\_model
+=========================================================
+
+.. automodule:: paddlenlp.taskflow.models.lexical_analysis_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.models.rst b/docs/source/paddlenlp.taskflow.models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f5966db3a44c0b68affaca7040714bdd1e138efa
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.rst
@@ -0,0 +1,16 @@
+models
+=================================
+
+.. automodule:: paddlenlp.taskflow.models
+   :members:
+   :no-undoc-members:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.taskflow.models.dependency_parsing_model
+   paddlenlp.taskflow.models.information_extraction_model
+   paddlenlp.taskflow.models.lexical_analysis_model
+   paddlenlp.taskflow.models.sentiment_analysis_model
+   paddlenlp.taskflow.models.text_correction_model
diff --git a/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst b/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..df299b6af547c3601c06b5ea84cc358929321137
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.sentiment_analysis_model.rst
@@ -0,0 +1,6 @@
+sentiment\_analysis\_model
+===========================================================
+
+.. automodule:: paddlenlp.taskflow.models.sentiment_analysis_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.models.text_correction_model.rst b/docs/source/paddlenlp.taskflow.models.text_correction_model.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4231184a34671107931328cae3d25cdeaa2cbb99
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.models.text_correction_model.rst
@@ -0,0 +1,6 @@
+text\_correction\_model
+========================================================
+
+.. automodule:: paddlenlp.taskflow.models.text_correction_model
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.taskflow.named_entity_recognition.rst b/docs/source/paddlenlp.taskflow.named_entity_recognition.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3183ae7930c7efd1d548afbf0cf9f4a9c3a1e50b
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.named_entity_recognition.rst
@@ -0,0 +1,7 @@
+named\_entity\_recognition
+====================================================
+
+.. automodule:: paddlenlp.taskflow.named_entity_recognition
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.poetry_generation.rst b/docs/source/paddlenlp.taskflow.poetry_generation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..092e1838dae2273ce291b7c29ce4130f2fcb7b2e
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.poetry_generation.rst
@@ -0,0 +1,7 @@
+poetry\_generation
+============================================
+
+.. automodule:: paddlenlp.taskflow.poetry_generation
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.pos_tagging.rst b/docs/source/paddlenlp.taskflow.pos_tagging.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e8251cae44c35aca02ed79d9d4e1a276086c5747
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.pos_tagging.rst
@@ -0,0 +1,7 @@
+pos\_tagging
+======================================
+
+.. automodule:: paddlenlp.taskflow.pos_tagging
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.question_answering.rst b/docs/source/paddlenlp.taskflow.question_answering.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3d9a34d72e0ec70590c9cd3222388ad09bb345fa
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.question_answering.rst
@@ -0,0 +1,7 @@
+question\_answering
+=============================================
+
+.. automodule:: paddlenlp.taskflow.question_answering
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.rst b/docs/source/paddlenlp.taskflow.rst
new file mode 100644
index 0000000000000000000000000000000000000000..791033a8812b5fd2ee246a5cbba727a15c948c92
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.rst
@@ -0,0 +1,37 @@
+paddlenlp.taskflow
+==========================
+
+.. automodule:: paddlenlp.taskflow
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.taskflow.models
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.taskflow.code_generation
+   paddlenlp.taskflow.dependency_parsing
+   paddlenlp.taskflow.dialogue
+   paddlenlp.taskflow.information_extraction
+   paddlenlp.taskflow.knowledge_mining
+   paddlenlp.taskflow.lexical_analysis
+   paddlenlp.taskflow.named_entity_recognition
+   paddlenlp.taskflow.poetry_generation
+   paddlenlp.taskflow.pos_tagging
+   paddlenlp.taskflow.question_answering
+   paddlenlp.taskflow.sentiment_analysis
+   paddlenlp.taskflow.task
+   paddlenlp.taskflow.taskflow
+   paddlenlp.taskflow.text_to_image
+   paddlenlp.taskflow.text_correction
+   paddlenlp.taskflow.text_generation
+   paddlenlp.taskflow.text_similarity
+   paddlenlp.taskflow.utils
+   paddlenlp.taskflow.word_segmentation
diff --git a/docs/source/paddlenlp.taskflow.sentiment_analysis.rst b/docs/source/paddlenlp.taskflow.sentiment_analysis.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4d539b970840759024e2a495be2fda2c119f07d6
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.sentiment_analysis.rst
@@ -0,0 +1,7 @@
+sentiment\_analysis
+=============================================
+
+.. automodule:: paddlenlp.taskflow.sentiment_analysis
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.task.rst b/docs/source/paddlenlp.taskflow.task.rst
new file mode 100644
index 0000000000000000000000000000000000000000..64804791adf32ea03be498df693670479ca0b16f
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.task.rst
@@ -0,0 +1,7 @@
+task
+==============================
+
+.. automodule:: paddlenlp.taskflow.task
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.taskflow.rst b/docs/source/paddlenlp.taskflow.taskflow.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6244b2c54055fd666f13b140dfe89805d0829f81
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.taskflow.rst
@@ -0,0 +1,7 @@
+taskflow
+==================================
+
+.. automodule:: paddlenlp.taskflow.taskflow
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.text2knowledge.rst b/docs/source/paddlenlp.taskflow.text2knowledge.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e855e1520dd9f530baa295068eb9f3d9679fbf1e
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.text2knowledge.rst
@@ -0,0 +1,7 @@
+text2knowledge
+========================================
+
+.. automodule:: paddlenlp.taskflow.text2knowledge
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.text_correction.rst b/docs/source/paddlenlp.taskflow.text_correction.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d3737ea366562065774da14617f7fabcd42edca1
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.text_correction.rst
@@ -0,0 +1,7 @@
+text\_correction
+==========================================
+
+.. automodule:: paddlenlp.taskflow.text_correction
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.text_generation.rst b/docs/source/paddlenlp.taskflow.text_generation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ff69b87fbea13400ab2af3e7b236f871dadc8f9d
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.text_generation.rst
@@ -0,0 +1,7 @@
+text\_generation
+==========================================
+
+.. automodule:: paddlenlp.taskflow.text_generation
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.text_similarity.rst b/docs/source/paddlenlp.taskflow.text_similarity.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0b3dde8dad74a5bed0682919e760382e13f98b90
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.text_similarity.rst
@@ -0,0 +1,7 @@
+text\_similarity
+==========================================
+
+.. automodule:: paddlenlp.taskflow.text_similarity
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.text_to_image.rst b/docs/source/paddlenlp.taskflow.text_to_image.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fa834e27471dc3bb17118bad26393df3f197aa9d
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.text_to_image.rst
@@ -0,0 +1,7 @@
+text\_to\_image
+================================================
+
+.. automodule:: paddlenlp.taskflow.text_to_image
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.utils.rst b/docs/source/paddlenlp.taskflow.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..097a299a56d020d47a64d54565db59c04efee976
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.utils.rst
@@ -0,0 +1,7 @@
+utils
+===============================
+
+.. automodule:: paddlenlp.taskflow.utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.taskflow.word_segmentation.rst b/docs/source/paddlenlp.taskflow.word_segmentation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c7c6090f145e177cf1ee55a55573bf6b2a1da302
--- /dev/null
+++ b/docs/source/paddlenlp.taskflow.word_segmentation.rst
@@ -0,0 +1,7 @@
+word\_segmentation
+============================================
+
+.. automodule:: paddlenlp.taskflow.word_segmentation
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.argparser.rst b/docs/source/paddlenlp.trainer.argparser.rst
new file mode 100644
index 0000000000000000000000000000000000000000..80c48c001e69ca104655324ec5fd52eb3c972cd5
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.argparser.rst
@@ -0,0 +1,7 @@
+argparser
+==================================
+
+.. automodule:: paddlenlp.trainer.argparser
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.integrations.rst b/docs/source/paddlenlp.trainer.integrations.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e16235dedb7b8602bfb8c1f249576ff201a290c7
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.integrations.rst
@@ -0,0 +1,7 @@
+integrations
+=====================================
+
+.. automodule:: paddlenlp.trainer.integrations
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.rst b/docs/source/paddlenlp.trainer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..af33d53c245bd0226880f31b1f5a212a8b485b17
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.rst
@@ -0,0 +1,24 @@
+paddlenlp.trainer
+=========================
+
+.. automodule:: paddlenlp.trainer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.trainer.utils
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.trainer.argparser
+   paddlenlp.trainer.integrations
+   paddlenlp.trainer.trainer_base
+   paddlenlp.trainer.trainer_callback
+   paddlenlp.trainer.trainer_utils
+   paddlenlp.trainer.training_args
diff --git a/docs/source/paddlenlp.trainer.trainer_base.rst b/docs/source/paddlenlp.trainer.trainer_base.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4378dd75d758ce0c587b9f42fd27d5806228f8d3
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.trainer_base.rst
@@ -0,0 +1,7 @@
+trainer\_base
+======================================
+
+.. automodule:: paddlenlp.trainer.trainer_base
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.trainer_callback.rst b/docs/source/paddlenlp.trainer.trainer_callback.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f4de42a59b965b2c54581d89972b426bcbf46eff
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.trainer_callback.rst
@@ -0,0 +1,7 @@
+trainer\_callback
+==========================================
+
+.. automodule:: paddlenlp.trainer.trainer_callback
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.trainer_utils.rst b/docs/source/paddlenlp.trainer.trainer_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c58e4c52d91a850b2874793f886b48bc4cc40d95
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.trainer_utils.rst
@@ -0,0 +1,7 @@
+trainer\_utils
+=======================================
+
+.. automodule:: paddlenlp.trainer.trainer_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.training_args.rst b/docs/source/paddlenlp.trainer.training_args.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e4909f9aef910e818057d6807a0a906991d04975
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.training_args.rst
@@ -0,0 +1,7 @@
+training\_args
+=======================================
+
+.. automodule:: paddlenlp.trainer.training_args
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.utils.helper.rst b/docs/source/paddlenlp.trainer.utils.helper.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2d8b6f6e1e2580da788a02b54fcf96ed891ebf1f
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.utils.helper.rst
@@ -0,0 +1,7 @@
+helper
+=====================================
+
+.. automodule:: paddlenlp.trainer.utils.helper
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.trainer.utils.rst b/docs/source/paddlenlp.trainer.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..de6a9daff5160eeabd5f253aa11d4a3bd71326fc
--- /dev/null
+++ b/docs/source/paddlenlp.trainer.utils.rst
@@ -0,0 +1,13 @@
+utils
+===============================
+
+.. automodule:: paddlenlp.trainer.utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.trainer.utils.helper
diff --git a/docs/source/paddlenlp.transformers.albert.modeling.rst b/docs/source/paddlenlp.transformers.albert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0a7948f40a04f4ba99fd3bb287b695727071fe9f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.albert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=============================================
+
+.. automodule:: paddlenlp.transformers.albert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.albert.rst b/docs/source/paddlenlp.transformers.albert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0c0da687851368c5a9e9ffefa195929f1fd4cd51
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.albert.rst
@@ -0,0 +1,14 @@
+albert
+=====================================
+
+.. automodule:: paddlenlp.transformers.albert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.albert.modeling
+   paddlenlp.transformers.albert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.albert.tokenizer.rst b/docs/source/paddlenlp.transformers.albert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3feef349131446cd5d7ab1b6b3d20e4c8c29d0e9
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.albert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==============================================
+
+.. automodule:: paddlenlp.transformers.albert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.artist.modeling.rst b/docs/source/paddlenlp.transformers.artist.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..751786f8b99fd5e2c5f3a126caef2d990769fe90
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.artist.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=============================================
+
+.. automodule:: paddlenlp.transformers.artist.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.artist.rst b/docs/source/paddlenlp.transformers.artist.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cacfe699e3e58ecabefb1b1ba2bb2236c1dcc8d7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.artist.rst
@@ -0,0 +1,14 @@
+artist
+=====================================
+
+.. automodule:: paddlenlp.transformers.artist
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.artist.modeling
+   paddlenlp.transformers.artist.tokenizer
diff --git a/docs/source/paddlenlp.transformers.artist.tokenizer.rst b/docs/source/paddlenlp.transformers.artist.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0e1a05f005ae64a3b115f2182ee1e53def1ab393
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.artist.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==============================================
+
+.. automodule:: paddlenlp.transformers.artist.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.attention_utils.rst b/docs/source/paddlenlp.transformers.attention_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9cbf2e8705edbd8aadc957ffca09f8e766465275
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.attention_utils.rst
@@ -0,0 +1,6 @@
+attention\_utils
+==============================================
+
+.. automodule:: paddlenlp.transformers.attention_utils
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.transformers.auto.modeling.rst b/docs/source/paddlenlp.transformers.auto.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b8778dd70d0d32a963281b9093a2903b05d1f091
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.auto.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.auto.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.auto.rst b/docs/source/paddlenlp.transformers.auto.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7ff49719749f11438d9839c65abffe4b40c9fc3f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.auto.rst
@@ -0,0 +1,14 @@
+auto
+===================================
+
+.. automodule:: paddlenlp.transformers.auto
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.auto.modeling
+   paddlenlp.transformers.auto.tokenizer
diff --git a/docs/source/paddlenlp.transformers.auto.tokenizer.rst b/docs/source/paddlenlp.transformers.auto.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3cf2c590985a57d7ab91d33eb8d300c5f5401e11
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.auto.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.auto.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bart.modeling.rst b/docs/source/paddlenlp.transformers.bart.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4ed15ed2a64d4bfead9f58371f49f656d6c7be32
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bart.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.bart.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bart.rst b/docs/source/paddlenlp.transformers.bart.rst
new file mode 100644
index 0000000000000000000000000000000000000000..de955a2d3af8ff62a405a98b77420b6956cf9fbc
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bart.rst
@@ -0,0 +1,14 @@
+bart
+===================================
+
+.. automodule:: paddlenlp.transformers.bart
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.bart.modeling
+   paddlenlp.transformers.bart.tokenizer
diff --git a/docs/source/paddlenlp.transformers.bart.tokenizer.rst b/docs/source/paddlenlp.transformers.bart.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6fd4c7a597f5f666ae1ea1d67c051c48d0149a34
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bart.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.bart.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d102c88993748c7335bbe0f915e9b0081ff79d24
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert.fast_tokenizer.rst
@@ -0,0 +1,7 @@
+fast\_tokenizer
+====================================================
+
+.. automodule:: paddlenlp.transformers.bert.fast_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bert.modeling.rst b/docs/source/paddlenlp.transformers.bert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..01b8b37ded3989b378cfed8926b5aaede3db8538
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.bert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bert.rst b/docs/source/paddlenlp.transformers.bert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bca1ffa00ecada671c30c74719ec6a9c6e0724d9
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert.rst
@@ -0,0 +1,15 @@
+bert
+===================================
+
+.. automodule:: paddlenlp.transformers.bert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.bert.fast_tokenizer
+   paddlenlp.transformers.bert.modeling
+   paddlenlp.transformers.bert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.bert.tokenizer.rst b/docs/source/paddlenlp.transformers.bert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f451e3d3e510408c53f2384d3461b90b6645ea03
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.bert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst b/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c8608818dceb162037fb8fd4bfcd095bc2be3e89
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert_japanese.convert_bert_japanese_params.rst
@@ -0,0 +1,7 @@
+convert\_bert\_japanese\_params
+============================================================================
+
+.. automodule:: paddlenlp.transformers.bert_japanese.convert_bert_japanese_params
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bert_japanese.rst b/docs/source/paddlenlp.transformers.bert_japanese.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4e0a617de60829ea12233d06faab6688b5260ea5
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert_japanese.rst
@@ -0,0 +1,13 @@
+bert\_japanese
+=============================================
+
+.. automodule:: paddlenlp.transformers.bert_japanese
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.bert_japanese.tokenizer
diff --git a/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst b/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6f78bbf99884782c655522c30510ca3931954841
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bert_japanese.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+======================================================
+
+.. automodule:: paddlenlp.transformers.bert_japanese.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bigbird.modeling.rst b/docs/source/paddlenlp.transformers.bigbird.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..44f82bbe22905d1fbcd038f6b8d4813b45f9a9da
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bigbird.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==============================================
+
+.. automodule:: paddlenlp.transformers.bigbird.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.bigbird.rst b/docs/source/paddlenlp.transformers.bigbird.rst
new file mode 100644
index 0000000000000000000000000000000000000000..21fb8e6e8e0e2c09520913d48102557a915de448
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bigbird.rst
@@ -0,0 +1,14 @@
+bigbird
+======================================
+
+.. automodule:: paddlenlp.transformers.bigbird
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.bigbird.modeling
+   paddlenlp.transformers.bigbird.tokenizer
diff --git a/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst b/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..34e240e414f5744894579830c58a85044745ab7f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.bigbird.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===============================================
+
+.. automodule:: paddlenlp.transformers.bigbird.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.blenderbot.modeling.rst b/docs/source/paddlenlp.transformers.blenderbot.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a52aa9d2e4347024a37b3e8aee0381e612d35eba
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.blenderbot.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.blenderbot.rst b/docs/source/paddlenlp.transformers.blenderbot.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b7f3e335a0ddeaf3e67c93804e62b2ac0d15f5cb
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot.rst
@@ -0,0 +1,14 @@
+blenderbot
+=========================================
+
+.. automodule:: paddlenlp.transformers.blenderbot
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.blenderbot.modeling
+   paddlenlp.transformers.blenderbot.tokenizer
diff --git a/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst b/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3f74eeaad3fe1c6693c6ea79c4b6b3fd8b2da6b9
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.blenderbot.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst b/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e6b486f4b2930483b37806e6b37ce98728bcb5c5
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot_small.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+========================================================
+
+.. automodule:: paddlenlp.transformers.blenderbot_small.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.rst b/docs/source/paddlenlp.transformers.blenderbot_small.rst
new file mode 100644
index 0000000000000000000000000000000000000000..13736397b4542c20cd8d0826c43ae2d6378a0bc8
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot_small.rst
@@ -0,0 +1,14 @@
+blenderbot\_small
+================================================
+
+.. automodule:: paddlenlp.transformers.blenderbot_small
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.blenderbot_small.modeling
+   paddlenlp.transformers.blenderbot_small.tokenizer
diff --git a/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst b/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..496c3e03f03b90b22bda57e3ddd61b8040aaaefe
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.blenderbot_small.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=========================================================
+
+.. automodule:: paddlenlp.transformers.blenderbot_small.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.chinesebert.modeling.rst b/docs/source/paddlenlp.transformers.chinesebert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a93209e0655e689391fcc0fb22d9335833de3024
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.chinesebert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==================================================
+
+.. automodule:: paddlenlp.transformers.chinesebert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.chinesebert.rst b/docs/source/paddlenlp.transformers.chinesebert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fa8cad8e88a91f6a7922049ed34d069a1cd3150e
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.chinesebert.rst
@@ -0,0 +1,14 @@
+chinesebert
+==========================================
+
+.. automodule:: paddlenlp.transformers.chinesebert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.chinesebert.modeling
+   paddlenlp.transformers.chinesebert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst b/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9521fc2adb2dfd3b3f5b555f38de0b865aed5bcb
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.chinesebert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===================================================
+
+.. automodule:: paddlenlp.transformers.chinesebert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.codegen.modeling.rst b/docs/source/paddlenlp.transformers.codegen.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..94bf3c40080b8b7e90098a28d926108a615775fc
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.codegen.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==============================================
+
+.. automodule:: paddlenlp.transformers.codegen.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.codegen.rst b/docs/source/paddlenlp.transformers.codegen.rst
new file mode 100644
index 0000000000000000000000000000000000000000..06b98a2db02c13665327b7877a68258fadf9d847
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.codegen.rst
@@ -0,0 +1,14 @@
+codegen
+======================================
+
+.. automodule:: paddlenlp.transformers.codegen
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.codegen.modeling
+   paddlenlp.transformers.codegen.tokenizer
diff --git a/docs/source/paddlenlp.transformers.codegen.tokenizer.rst b/docs/source/paddlenlp.transformers.codegen.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..49ebd1510971e953085113f70aa3b03bafa14cd8
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.codegen.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===============================================
+
+.. automodule:: paddlenlp.transformers.codegen.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.convbert.modeling.rst b/docs/source/paddlenlp.transformers.convbert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f24b7e95f7520a5acc731b7bf370fc4de7e8ac5b
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.convbert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.convbert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.convbert.rst b/docs/source/paddlenlp.transformers.convbert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..93b396b679e9d7015064ba6d157656c48a725b6a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.convbert.rst
@@ -0,0 +1,14 @@
+convbert
+=======================================
+
+.. automodule:: paddlenlp.transformers.convbert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.convbert.modeling
+   paddlenlp.transformers.convbert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.convbert.tokenizer.rst b/docs/source/paddlenlp.transformers.convbert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..715b3fad7b7d0ea9f53b9c225c5e63d066501f10
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.convbert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.convbert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst b/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c8e4e4b576fb6dbbf8bf90c46b18717afb382166
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.convert_slow_tokenizer.rst
@@ -0,0 +1,7 @@
+convert\_slow\_tokenizer
+======================================================
+
+.. automodule:: paddlenlp.transformers.convert_slow_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ctrl.modeling.rst b/docs/source/paddlenlp.transformers.ctrl.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2a00824efcc49fca708324f7143dca58fc04199a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ctrl.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.ctrl.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ctrl.rst b/docs/source/paddlenlp.transformers.ctrl.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ac550371b5a781faeeb6b8b74171fb965ff927e9
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ctrl.rst
@@ -0,0 +1,14 @@
+ctrl
+===================================
+
+.. automodule:: paddlenlp.transformers.ctrl
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ctrl.modeling
+   paddlenlp.transformers.ctrl.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst b/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ae6909f8367624a1ef42a6e093dae955161b6850
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ctrl.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.ctrl.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.dallebart.modeling.rst b/docs/source/paddlenlp.transformers.dallebart.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..43b9008645f6bdb0055d2cf800be51f30a70073a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.dallebart.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+================================================
+
+.. automodule:: paddlenlp.transformers.dallebart.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.dallebart.rst b/docs/source/paddlenlp.transformers.dallebart.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b4c076fe1095d2ebf470f5d44d14efff6cbab5a8
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.dallebart.rst
@@ -0,0 +1,14 @@
+dallebart
+========================================
+
+.. automodule:: paddlenlp.transformers.dallebart
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.dallebart.modeling
+   paddlenlp.transformers.dallebart.tokenizer
diff --git a/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst b/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..dc6ce082462a55b60f0ceb98800e4acf138d8728
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.dallebart.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=================================================
+
+.. automodule:: paddlenlp.transformers.dallebart.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.distilbert.modeling.rst b/docs/source/paddlenlp.transformers.distilbert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d269bf4a4ccdc7eceafcfd231054e945587347d2
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.distilbert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.distilbert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.distilbert.rst b/docs/source/paddlenlp.transformers.distilbert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..09ed79d561cbe7c2816e10ff50dd809dc4e11397
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.distilbert.rst
@@ -0,0 +1,14 @@
+distilbert
+=========================================
+
+.. automodule:: paddlenlp.transformers.distilbert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.distilbert.modeling
+   paddlenlp.transformers.distilbert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst b/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..490daa508c15737a3b363c2e032bbbc16b1545a7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.distilbert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.distilbert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.distill_utils.rst b/docs/source/paddlenlp.transformers.distill_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..776b2c9fb4d6f3e456a6c920ea9c6ce3b83508f1
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.distill_utils.rst
@@ -0,0 +1,7 @@
+distill\_utils
+============================================
+
+.. automodule:: paddlenlp.transformers.distill_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.electra.modeling.rst b/docs/source/paddlenlp.transformers.electra.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c31be0555a89424aca414f20f9c2663cc5eb00a6
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.electra.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==============================================
+
+.. automodule:: paddlenlp.transformers.electra.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.electra.rst b/docs/source/paddlenlp.transformers.electra.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9c9c44eaa6fddd9ea94bbec6352978750daa20b0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.electra.rst
@@ -0,0 +1,14 @@
+electra
+======================================
+
+.. automodule:: paddlenlp.transformers.electra
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.electra.modeling
+   paddlenlp.transformers.electra.tokenizer
diff --git a/docs/source/paddlenlp.transformers.electra.tokenizer.rst b/docs/source/paddlenlp.transformers.electra.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..870b37825fb1727eda7871195a1fb2163b78b61d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.electra.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===============================================
+
+.. automodule:: paddlenlp.transformers.electra.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d076125bafe460bf74cde7a5eb802c5612608c05
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie.fast_tokenizer.rst
@@ -0,0 +1,7 @@
+fast\_tokenizer
+=====================================================
+
+.. automodule:: paddlenlp.transformers.ernie.fast_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie.modeling.rst b/docs/source/paddlenlp.transformers.ernie.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..380a8d64cfcb5f983f34aad1e2a01802cf070658
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.ernie.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie.rst b/docs/source/paddlenlp.transformers.ernie.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7444d131dbb07cb2e941c1fc8bb1ff8496f7a77f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie.rst
@@ -0,0 +1,15 @@
+ernie
+====================================
+
+.. automodule:: paddlenlp.transformers.ernie
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie.fast_tokenizer
+   paddlenlp.transformers.ernie.modeling
+   paddlenlp.transformers.ernie.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ernie.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fb6c1fa1c503f7e94e518897a8db7ef26df7e956
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.ernie.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst b/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3b8770cddca47fa9800ee1b7fc894b3009d3fab2
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_ctm.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.ernie_ctm.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.rst b/docs/source/paddlenlp.transformers.ernie_ctm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f1494fa75c0eb3b32dc383835a1ad7a9aaa46c4e
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_ctm.rst
@@ -0,0 +1,14 @@
+ernie\_ctm
+=========================================
+
+.. automodule:: paddlenlp.transformers.ernie_ctm
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie_ctm.modeling
+   paddlenlp.transformers.ernie_ctm.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..15a8daed187666dcca9bc90a146f3e2343fe590a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_ctm.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.ernie_ctm.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst b/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e8fd915642d1fed055f8a9afda23a59ebc203f3f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_doc.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.ernie_doc.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_doc.rst b/docs/source/paddlenlp.transformers.ernie_doc.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8cf67597ebca15a64c06e76d1eb97302ad8acb60
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_doc.rst
@@ -0,0 +1,14 @@
+ernie\_doc
+=========================================
+
+.. automodule:: paddlenlp.transformers.ernie_doc
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie_doc.modeling
+   paddlenlp.transformers.ernie_doc.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cdf63976128cd831c044185574c3b6a00e9a75b0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_doc.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.ernie_doc.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst b/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b33a1d0abc2d9f34a48d5fc29f4a838257a1dc34
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gen.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.ernie_gen.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_gen.rst b/docs/source/paddlenlp.transformers.ernie_gen.rst
new file mode 100644
index 0000000000000000000000000000000000000000..be4f6f3b853f728a0d0e0adb129f18807471a8fe
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gen.rst
@@ -0,0 +1,13 @@
+ernie\_gen
+=========================================
+
+.. automodule:: paddlenlp.transformers.ernie_gen
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie_gen.modeling
diff --git a/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst b/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7968e19f2db34028fd6c63f50d257fd55c03055c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gram.matching_param_name.rst
@@ -0,0 +1,7 @@
+matching\_param\_name
+===============================================================
+
+.. automodule:: paddlenlp.transformers.ernie_gram.matching_param_name
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst b/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3053c06384527dbc5ce95d58563aab49655586b8
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gram.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==================================================
+
+.. automodule:: paddlenlp.transformers.ernie_gram.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_gram.rst b/docs/source/paddlenlp.transformers.ernie_gram.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2e615be1548dedb02e3ac20f24e7b103704f8e6c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gram.rst
@@ -0,0 +1,15 @@
+ernie\_gram
+==========================================
+
+.. automodule:: paddlenlp.transformers.ernie_gram
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie_gram.matching_param_name
+   paddlenlp.transformers.ernie_gram.modeling
+   paddlenlp.transformers.ernie_gram.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..459dba4fc07658cd499b1f4ab0afc78b95388be8
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_gram.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===================================================
+
+.. automodule:: paddlenlp.transformers.ernie_gram.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a0c7dcb07bababd9af9006af9e2f620906e34fe7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_m.faster_tokenizer.rst
@@ -0,0 +1,7 @@
+fast\_tokenizer
+========================================================
+
+.. automodule:: paddlenlp.transformers.ernie_m.fast_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_m.modeling.rst b/docs/source/paddlenlp.transformers.ernie_m.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4a836830edd1ca0b59fa27f7d85d70ba4f6ad775
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_m.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.ernie_m.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ernie_m.rst b/docs/source/paddlenlp.transformers.ernie_m.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ab97f8f9553a4f367ea9d2b00fb6f2709e8652bd
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_m.rst
@@ -0,0 +1,15 @@
+ernie\_m
+=======================================
+
+.. automodule:: paddlenlp.transformers.ernie_m
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ernie_m.fast_tokenizer
+   paddlenlp.transformers.ernie_m.modeling
+   paddlenlp.transformers.ernie_m.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst b/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fb9eaec8cce3e9be1b84e9c19a67354716ba8f8d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ernie_m.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.ernie_m.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.export.rst b/docs/source/paddlenlp.transformers.export.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7f49d2faafde10ddfcda6d1c62f38c4bd9660cff
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.export.rst
@@ -0,0 +1,7 @@
+export
+====================================
+
+.. automodule:: paddlenlp.transformers.export
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.fnet.modeling.rst b/docs/source/paddlenlp.transformers.fnet.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..141baad80d80ea8009f29a8cc9ff405ea0067253
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.fnet.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.fnet.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.fnet.rst b/docs/source/paddlenlp.transformers.fnet.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7e0440c2704d608c5ef7862342ed29397157d4ad
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.fnet.rst
@@ -0,0 +1,14 @@
+fnet
+===================================
+
+.. automodule:: paddlenlp.transformers.fnet
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.fnet.modeling
+   paddlenlp.transformers.fnet.tokenizer
diff --git a/docs/source/paddlenlp.transformers.fnet.tokenizer.rst b/docs/source/paddlenlp.transformers.fnet.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..22bf444a04bf5be97da770c35466e69d6a500a7d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.fnet.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.fnet.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.funnel.modeling.rst b/docs/source/paddlenlp.transformers.funnel.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..329f6bec4b098f17a5c35654244402a6428c0ed5
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.funnel.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=============================================
+
+.. automodule:: paddlenlp.transformers.funnel.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.funnel.rst b/docs/source/paddlenlp.transformers.funnel.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a69c3398ff81e27004026f1f76a22bb99a0a09b
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.funnel.rst
@@ -0,0 +1,14 @@
+funnel
+=====================================
+
+.. automodule:: paddlenlp.transformers.funnel
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.funnel.modeling
+   paddlenlp.transformers.funnel.tokenizer
diff --git a/docs/source/paddlenlp.transformers.funnel.tokenizer.rst b/docs/source/paddlenlp.transformers.funnel.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..49b413a063184ce5e25be48b49b3dfcfcb207004
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.funnel.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==============================================
+
+.. automodule:: paddlenlp.transformers.funnel.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst b/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9bc4a2e4e01c9eed0cee941415ba6ca331bd04c0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gau_alpha.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.gau_alpha.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.gau_alpha.rst b/docs/source/paddlenlp.transformers.gau_alpha.rst
new file mode 100644
index 0000000000000000000000000000000000000000..91684515e801b83f53c0370d6edff3f0a96e52fa
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gau_alpha.rst
@@ -0,0 +1,14 @@
+gau\_alpha
+=========================================
+
+.. automodule:: paddlenlp.transformers.gau_alpha
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.gau_alpha.modeling
+   paddlenlp.transformers.gau_alpha.tokenizer
diff --git a/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst b/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..40d966a2e0efc3e2b77ca063b1228221ab436db4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gau_alpha.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.gau_alpha.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.generation_utils.rst b/docs/source/paddlenlp.transformers.generation_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a36d1f0b26fc9344b78c1ed691a2054a9828dc51
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.generation_utils.rst
@@ -0,0 +1,7 @@
+generation\_utils
+===============================================
+
+.. automodule:: paddlenlp.transformers.generation_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.gpt.modeling.rst b/docs/source/paddlenlp.transformers.gpt.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e50a4d51a9ebf71444de1075a9e2aff47a5f6363
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gpt.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==========================================
+
+.. automodule:: paddlenlp.transformers.gpt.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.gpt.rst b/docs/source/paddlenlp.transformers.gpt.rst
new file mode 100644
index 0000000000000000000000000000000000000000..08123c1ee019ff6d74e693178307d14bec3970c1
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gpt.rst
@@ -0,0 +1,14 @@
+gpt
+==================================
+
+.. automodule:: paddlenlp.transformers.gpt
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.gpt.modeling
+   paddlenlp.transformers.gpt.tokenizer
diff --git a/docs/source/paddlenlp.transformers.gpt.tokenizer.rst b/docs/source/paddlenlp.transformers.gpt.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..58dd364ba1b535aad91d3402ad6c79fceabf62ab
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.gpt.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===========================================
+
+.. automodule:: paddlenlp.transformers.gpt.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutlm.modeling.rst b/docs/source/paddlenlp.transformers.layoutlm.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..389635c84f7edb9b9b72e7a147e774205474d77c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlm.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.layoutlm.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutlm.rst b/docs/source/paddlenlp.transformers.layoutlm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5777c44f88a228c18243c1abcbefa312ab6be572
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlm.rst
@@ -0,0 +1,14 @@
+layoutlm
+=======================================
+
+.. automodule:: paddlenlp.transformers.layoutlm
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.layoutlm.modeling
+   paddlenlp.transformers.layoutlm.tokenizer
diff --git a/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4328daa8d95140f346c567c6f139e13229ce33be
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlm.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.layoutlm.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst b/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bbce41e59313253c2cfd34432cdbb28ff883b0ab
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlmv2.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.layoutlmv2.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.rst b/docs/source/paddlenlp.transformers.layoutlmv2.rst
new file mode 100644
index 0000000000000000000000000000000000000000..208cec4cf7568a518b095cfaba9c57cf22de05da
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlmv2.rst
@@ -0,0 +1,14 @@
+layoutlmv2
+=========================================
+
+.. automodule:: paddlenlp.transformers.layoutlmv2
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.layoutlmv2.modeling
+   paddlenlp.transformers.layoutlmv2.tokenizer
diff --git a/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..dfb3da6c03c9a0bc160c82c740f0a333348f72a2
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutlmv2.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.layoutlmv2.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst b/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..696a754912797ad52171036e11209123fcf9988f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutxlm.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+================================================
+
+.. automodule:: paddlenlp.transformers.layoutxlm.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutxlm.rst b/docs/source/paddlenlp.transformers.layoutxlm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5c5f45f562ea2c18052cdaaaee5d867f145635ef
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutxlm.rst
@@ -0,0 +1,15 @@
+layoutxlm
+========================================
+
+.. automodule:: paddlenlp.transformers.layoutxlm
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.layoutxlm.modeling
+   paddlenlp.transformers.layoutxlm.tokenizer
+   paddlenlp.transformers.layoutxlm.visual_backbone
diff --git a/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst b/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8d59aa88ead34650cd343728b7f304ccb9c9ac64
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutxlm.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=================================================
+
+.. automodule:: paddlenlp.transformers.layoutxlm.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst b/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7f513c7592828569be937a8f7fcab03db0b6da68
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.layoutxlm.visual_backbone.rst
@@ -0,0 +1,7 @@
+visual\_backbone
+========================================================
+
+.. automodule:: paddlenlp.transformers.layoutxlm.visual_backbone
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.luke.modeling.rst b/docs/source/paddlenlp.transformers.luke.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..68b42e54af00222a03a3407715f78abd661af836
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.luke.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.luke.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.luke.rst b/docs/source/paddlenlp.transformers.luke.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ddc1975aed911dce5518c4dc2b1997df9fec9ad4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.luke.rst
@@ -0,0 +1,14 @@
+luke
+===================================
+
+.. automodule:: paddlenlp.transformers.luke
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.luke.modeling
+   paddlenlp.transformers.luke.tokenizer
diff --git a/docs/source/paddlenlp.transformers.luke.tokenizer.rst b/docs/source/paddlenlp.transformers.luke.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5ffdc300ad83101442b101c60c29831b5df36c51
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.luke.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.luke.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.mbart.modeling.rst b/docs/source/paddlenlp.transformers.mbart.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..57877eb210ec547f77e5a0c5775868286769f7ce
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mbart.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.mbart.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.mbart.rst b/docs/source/paddlenlp.transformers.mbart.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ee9e94b7a194016efdedded4355a42cec4407187
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mbart.rst
@@ -0,0 +1,14 @@
+mbart
+====================================
+
+.. automodule:: paddlenlp.transformers.mbart
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.mbart.modeling
+   paddlenlp.transformers.mbart.tokenizer
diff --git a/docs/source/paddlenlp.transformers.mbart.tokenizer.rst b/docs/source/paddlenlp.transformers.mbart.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..baf9e956d91259a1c9a8dfe87ffb5a86c452b788
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mbart.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.mbart.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.megatronbert.modeling.rst b/docs/source/paddlenlp.transformers.megatronbert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..fe34bcc5f8be0b71ed536dc96c5270010021395b
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.megatronbert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===================================================
+
+.. automodule:: paddlenlp.transformers.megatronbert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.megatronbert.rst b/docs/source/paddlenlp.transformers.megatronbert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ed428d1afad26a3e36e667cb259091236ffebd3d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.megatronbert.rst
@@ -0,0 +1,14 @@
+megatronbert
+===========================================
+
+.. automodule:: paddlenlp.transformers.megatronbert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.megatronbert.modeling
+   paddlenlp.transformers.megatronbert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst b/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..61243ad237eff9d4bd6594a65ab3556417a0d0a0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.megatronbert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+====================================================
+
+.. automodule:: paddlenlp.transformers.megatronbert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.mobilebert.modeling.rst b/docs/source/paddlenlp.transformers.mobilebert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d912aeea993d468de2eddf51956f2f9471b76ce2
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mobilebert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.mobilebert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.mobilebert.rst b/docs/source/paddlenlp.transformers.mobilebert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..da320896cb8671db2279245684955eee0179ee81
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mobilebert.rst
@@ -0,0 +1,14 @@
+mobilebert
+=========================================
+
+.. automodule:: paddlenlp.transformers.mobilebert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.mobilebert.modeling
+   paddlenlp.transformers.mobilebert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst b/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..20576d7b3d0e432b8274c9b72a2fbabc1a6b3a9c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mobilebert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.mobilebert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.model_outputs.rst b/docs/source/paddlenlp.transformers.model_outputs.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7c3220fc0fd0a54b7235e2aa202381ba77b1ade5
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.model_outputs.rst
@@ -0,0 +1,6 @@
+model\_outputs
+============================================
+
+.. automodule:: paddlenlp.transformers.model_outputs
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.transformers.model_utils.rst b/docs/source/paddlenlp.transformers.model_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f4e465e7fcfc594de2d8088cefeded21332ac111
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.model_utils.rst
@@ -0,0 +1,6 @@
+model\_utils
+==========================================
+
+.. automodule:: paddlenlp.transformers.model_utils
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.transformers.mpnet.modeling.rst b/docs/source/paddlenlp.transformers.mpnet.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3af94ea79612c9a315c31e86611dc54e20005e46
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mpnet.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.mpnet.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.mpnet.rst b/docs/source/paddlenlp.transformers.mpnet.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ecf0185c05494cc104ecc60d5670b6ae624c148f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mpnet.rst
@@ -0,0 +1,14 @@
+mpnet
+====================================
+
+.. automodule:: paddlenlp.transformers.mpnet
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.mpnet.modeling
+   paddlenlp.transformers.mpnet.tokenizer
diff --git a/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst b/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4d46073b97fdccc582d719dbc5c3943ff1bfc7cd
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.mpnet.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.mpnet.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.nezha.modeling.rst b/docs/source/paddlenlp.transformers.nezha.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4246cb72fea66aab2571191d116aa21ca4c4ce6d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.nezha.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.nezha.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.nezha.rst b/docs/source/paddlenlp.transformers.nezha.rst
new file mode 100644
index 0000000000000000000000000000000000000000..02a28088eb3261c3dff14d9acb6bec6a26541ff4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.nezha.rst
@@ -0,0 +1,14 @@
+nezha
+====================================
+
+.. automodule:: paddlenlp.transformers.nezha
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.nezha.modeling
+   paddlenlp.transformers.nezha.tokenizer
diff --git a/docs/source/paddlenlp.transformers.nezha.tokenizer.rst b/docs/source/paddlenlp.transformers.nezha.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..435140acb8a9031cbb83e05ab9a793c110193571
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.nezha.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.nezha.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.opt.modeling.rst b/docs/source/paddlenlp.transformers.opt.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..71e2af7e61ddb973ca034b540b830a25c2bccf50
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.opt.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==========================================
+
+.. automodule:: paddlenlp.transformers.opt.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.opt.rst b/docs/source/paddlenlp.transformers.opt.rst
new file mode 100644
index 0000000000000000000000000000000000000000..445e3c2639b9871d85c016a319827cd1aba3a8ed
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.opt.rst
@@ -0,0 +1,13 @@
+opt
+==================================
+
+.. automodule:: paddlenlp.transformers.opt
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.opt.modeling
diff --git a/docs/source/paddlenlp.transformers.optimization.rst b/docs/source/paddlenlp.transformers.optimization.rst
new file mode 100644
index 0000000000000000000000000000000000000000..eb8c4eb182e07e17c302f3bf61e53a8255f0a45f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.optimization.rst
@@ -0,0 +1,7 @@
+optimization
+==========================================
+
+.. automodule:: paddlenlp.transformers.optimization
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ppminilm.modeling.rst b/docs/source/paddlenlp.transformers.ppminilm.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a786fb47e2ae6f20f17c3bd5c93fc136d239219f
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ppminilm.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.ppminilm.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.ppminilm.rst b/docs/source/paddlenlp.transformers.ppminilm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8676e01e332b54d76c225a85010c636cbb4b9336
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ppminilm.rst
@@ -0,0 +1,14 @@
+ppminilm
+=======================================
+
+.. automodule:: paddlenlp.transformers.ppminilm
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.ppminilm.modeling
+   paddlenlp.transformers.ppminilm.tokenizer
diff --git a/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst b/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..974c8579e852fd9a1219b0ea27becca227dc9b31
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.ppminilm.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.ppminilm.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.prophetnet.modeling.rst b/docs/source/paddlenlp.transformers.prophetnet.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9c6c587fcdefc0b60f6ace7041ad57141b5fbd75
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.prophetnet.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.prophetnet.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.prophetnet.rst b/docs/source/paddlenlp.transformers.prophetnet.rst
new file mode 100644
index 0000000000000000000000000000000000000000..91a594abc38d7ee87d29cc6e73bc166b3eaef513
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.prophetnet.rst
@@ -0,0 +1,14 @@
+prophetnet
+=========================================
+
+.. automodule:: paddlenlp.transformers.prophetnet
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.prophetnet.modeling
+   paddlenlp.transformers.prophetnet.tokenizer
diff --git a/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst b/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..38fba82a15e741eabff4e94d7cde58cdb0608339
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.prophetnet.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.prophetnet.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.reformer.modeling.rst b/docs/source/paddlenlp.transformers.reformer.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1b1d8d26f71b33f4c19a85d69d8bf18b9d74de49
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.reformer.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.reformer.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.reformer.rst b/docs/source/paddlenlp.transformers.reformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e65273244234347b42eba22560613290daa4cfa7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.reformer.rst
@@ -0,0 +1,14 @@
+reformer
+=======================================
+
+.. automodule:: paddlenlp.transformers.reformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.reformer.modeling
+   paddlenlp.transformers.reformer.tokenizer
diff --git a/docs/source/paddlenlp.transformers.reformer.tokenizer.rst b/docs/source/paddlenlp.transformers.reformer.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ffb7a11cc76c92a436e81b2702b9e30408e412b7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.reformer.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.reformer.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.rembert.modeling.rst b/docs/source/paddlenlp.transformers.rembert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f1b5aa838b65ba48f538dd777031d2b113377484
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.rembert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==============================================
+
+.. automodule:: paddlenlp.transformers.rembert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.rembert.rst b/docs/source/paddlenlp.transformers.rembert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a559fa9c5378f49db4ff251fa451b50031f42a02
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.rembert.rst
@@ -0,0 +1,14 @@
+rembert
+======================================
+
+.. automodule:: paddlenlp.transformers.rembert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.rembert.modeling
+   paddlenlp.transformers.rembert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.rembert.tokenizer.rst b/docs/source/paddlenlp.transformers.rembert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0ee175516cc5cde31677c949c115dab19f925077
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.rembert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===============================================
+
+.. automodule:: paddlenlp.transformers.rembert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roberta.modeling.rst b/docs/source/paddlenlp.transformers.roberta.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2e46e4358b4bb2894fdd2155c95f2600d5eb0b47
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roberta.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==============================================
+
+.. automodule:: paddlenlp.transformers.roberta.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roberta.rst b/docs/source/paddlenlp.transformers.roberta.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3ff4cb5fcdf5d699d803647a5c1cff4972d57e0a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roberta.rst
@@ -0,0 +1,14 @@
+roberta
+======================================
+
+.. automodule:: paddlenlp.transformers.roberta
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.roberta.modeling
+   paddlenlp.transformers.roberta.tokenizer
diff --git a/docs/source/paddlenlp.transformers.roberta.tokenizer.rst b/docs/source/paddlenlp.transformers.roberta.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c9cbc66227e169784af566480815caa21be35c1a
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roberta.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===============================================
+
+.. automodule:: paddlenlp.transformers.roberta.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roformer.modeling.rst b/docs/source/paddlenlp.transformers.roformer.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..6d1d02576738956df899d9887ba6d338984e77d4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformer.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.roformer.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roformer.rst b/docs/source/paddlenlp.transformers.roformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5a67d90fe63dcb2845bc3b05a25f50f0b66e02af
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformer.rst
@@ -0,0 +1,14 @@
+roformer
+=======================================
+
+.. automodule:: paddlenlp.transformers.roformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.roformer.modeling
+   paddlenlp.transformers.roformer.tokenizer
diff --git a/docs/source/paddlenlp.transformers.roformer.tokenizer.rst b/docs/source/paddlenlp.transformers.roformer.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ac6c2332f819b4b6f9c41c00aff6a396635ba775
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformer.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.roformer.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roformerv2.modeling.rst b/docs/source/paddlenlp.transformers.roformerv2.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e145ea6b352e143c1895de235718c0d90a88474c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformerv2.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=================================================
+
+.. automodule:: paddlenlp.transformers.roformerv2.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.roformerv2.rst b/docs/source/paddlenlp.transformers.roformerv2.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a19ac028c7d2aebb7e6bc5afc0ad8a2d99ce7191
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformerv2.rst
@@ -0,0 +1,14 @@
+roformerv2
+=========================================
+
+.. automodule:: paddlenlp.transformers.roformerv2
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.roformerv2.modeling
+   paddlenlp.transformers.roformerv2.tokenizer
diff --git a/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst b/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..25f32bd4bfbab8d3497d8e17721db559a800e2ee
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.roformerv2.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==================================================
+
+.. automodule:: paddlenlp.transformers.roformerv2.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.rst b/docs/source/paddlenlp.transformers.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1ec7ff404395051fad08d3a0e2d9d722b0a156eb
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.rst
@@ -0,0 +1,83 @@
+paddlenlp.transformers
+==============================
+
+.. automodule:: paddlenlp.transformers
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.albert
+   paddlenlp.transformers.artist
+   paddlenlp.transformers.auto
+   paddlenlp.transformers.bart
+   paddlenlp.transformers.bert
+   paddlenlp.transformers.bert_japanese
+   paddlenlp.transformers.bigbird
+   paddlenlp.transformers.blenderbot
+   paddlenlp.transformers.blenderbot_small
+   paddlenlp.transformers.chinesebert
+   paddlenlp.transformers.codegen
+   paddlenlp.transformers.convbert
+   paddlenlp.transformers.ctrl
+   paddlenlp.transformers.dallebart
+   paddlenlp.transformers.distilbert
+   paddlenlp.transformers.electra
+   paddlenlp.transformers.ernie
+   paddlenlp.transformers.ernie_ctm
+   paddlenlp.transformers.ernie_doc
+   paddlenlp.transformers.ernie_gen
+   paddlenlp.transformers.ernie_gram
+   paddlenlp.transformers.ernie_m
+   paddlenlp.transformers.fnet
+   paddlenlp.transformers.funnel
+   paddlenlp.transformers.gau_alpha
+   paddlenlp.transformers.gpt
+   paddlenlp.transformers.layoutlm
+   paddlenlp.transformers.layoutlmv2
+   paddlenlp.transformers.layoutxlm
+   paddlenlp.transformers.luke
+   paddlenlp.transformers.mbart
+   paddlenlp.transformers.megatronbert
+   paddlenlp.transformers.mobilebert
+   paddlenlp.transformers.mpnet
+   paddlenlp.transformers.nezha
+   paddlenlp.transformers.opt
+   paddlenlp.transformers.ppminilm
+   paddlenlp.transformers.prophetnet
+   paddlenlp.transformers.reformer
+   paddlenlp.transformers.rembert
+   paddlenlp.transformers.roberta
+   paddlenlp.transformers.roformer
+   paddlenlp.transformers.roformerv2
+   paddlenlp.transformers.semantic_search
+   paddlenlp.transformers.skep
+   paddlenlp.transformers.squeezebert
+   paddlenlp.transformers.t5
+   paddlenlp.transformers.tinybert
+   paddlenlp.transformers.transformer
+   paddlenlp.transformers.unified_transformer
+   paddlenlp.transformers.unimo
+   paddlenlp.transformers.xlm
+   paddlenlp.transformers.xlnet
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.attention_utils
+   paddlenlp.transformers.convert_slow_tokenizer
+   paddlenlp.transformers.distill_utils
+   paddlenlp.transformers.export
+   paddlenlp.transformers.generation_utils
+   paddlenlp.transformers.model_outputs
+   paddlenlp.transformers.model_utils
+   paddlenlp.transformers.optimization
+   paddlenlp.transformers.sentencepiece_model_pb2
+   paddlenlp.transformers.tokenizer_utils
+   paddlenlp.transformers.tokenizer_utils_base
+   paddlenlp.transformers.tokenizer_utils_fast
+   paddlenlp.transformers.utils
diff --git a/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst b/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e9280aa743c86c592b86f10770b3f919cb0d28f9
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.semantic_indexing.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=========================================================
+
+.. automodule:: paddlenlp.transformers.semantic_indexing.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.semantic_indexing.rst b/docs/source/paddlenlp.transformers.semantic_indexing.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ae92d819be16fb6af3d025a8ea4624b1966cb1fa
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.semantic_indexing.rst
@@ -0,0 +1,13 @@
+semantic\_indexing
+=================================================
+
+.. automodule:: paddlenlp.transformers.semantic_indexing
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.semantic_indexing.modeling
diff --git a/docs/source/paddlenlp.transformers.semantic_search.modeling.rst b/docs/source/paddlenlp.transformers.semantic_search.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8d64e83628531cda8244ec0dc384cbbe3cc37bbd
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.semantic_search.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=======================================================
+
+.. automodule:: paddlenlp.transformers.semantic_search.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.semantic_search.rst b/docs/source/paddlenlp.transformers.semantic_search.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cb255233b7f3520e94557431b1fc8e13f5018bd5
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.semantic_search.rst
@@ -0,0 +1,13 @@
+semantic\_search
+===============================================
+
+.. automodule:: paddlenlp.transformers.semantic_search
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.semantic_search.modeling
diff --git a/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst b/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst
new file mode 100644
index 0000000000000000000000000000000000000000..27793b44085afc768616e4143751da07c86d1c54
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.sentencepiece_model_pb2.rst
@@ -0,0 +1,6 @@
+sentencepiece\_model\_pb2
+=======================================================
+
+.. automodule:: paddlenlp.transformers.sentencepiece_model_pb2
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.transformers.skep.modeling.rst b/docs/source/paddlenlp.transformers.skep.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..78f4c10a35c339259694912b880a3174e5503195
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.skep.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================
+
+.. automodule:: paddlenlp.transformers.skep.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.skep.rst b/docs/source/paddlenlp.transformers.skep.rst
new file mode 100644
index 0000000000000000000000000000000000000000..116bfd5bfef3f3bd40d34da55f2b845cef2fe353
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.skep.rst
@@ -0,0 +1,14 @@
+skep
+===================================
+
+.. automodule:: paddlenlp.transformers.skep
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.skep.modeling
+   paddlenlp.transformers.skep.tokenizer
diff --git a/docs/source/paddlenlp.transformers.skep.tokenizer.rst b/docs/source/paddlenlp.transformers.skep.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a33afb457018ed0d935945a5f7e2aac615d1b753
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.skep.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================
+
+.. automodule:: paddlenlp.transformers.skep.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.squeezebert.modeling.rst b/docs/source/paddlenlp.transformers.squeezebert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1992788a2091892931269fb70cf5ed8c1ee74254
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.squeezebert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==================================================
+
+.. automodule:: paddlenlp.transformers.squeezebert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.squeezebert.rst b/docs/source/paddlenlp.transformers.squeezebert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7498b55a0b271318368d2e18d9fc1502a21e3c75
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.squeezebert.rst
@@ -0,0 +1,14 @@
+squeezebert
+==========================================
+
+.. automodule:: paddlenlp.transformers.squeezebert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.squeezebert.modeling
+   paddlenlp.transformers.squeezebert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst b/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..f6a680c40f5184878a8bf6117dbd767453de2120
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.squeezebert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===================================================
+
+.. automodule:: paddlenlp.transformers.squeezebert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.t5.modeling.rst b/docs/source/paddlenlp.transformers.t5.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ee60d516ef5f078bc503698dbe9309c09137827b
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.t5.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+=========================================
+
+.. automodule:: paddlenlp.transformers.t5.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.t5.rst b/docs/source/paddlenlp.transformers.t5.rst
new file mode 100644
index 0000000000000000000000000000000000000000..35c1aec1639b2ae0ae18b4eac2c43471f30d7ec2
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.t5.rst
@@ -0,0 +1,14 @@
+t5
+=================================
+
+.. automodule:: paddlenlp.transformers.t5
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.t5.modeling
+   paddlenlp.transformers.t5.tokenizer
diff --git a/docs/source/paddlenlp.transformers.t5.tokenizer.rst b/docs/source/paddlenlp.transformers.t5.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..a7b912bc055d226a230a7cb98786d2fddbd31f41
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.t5.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+==========================================
+
+.. automodule:: paddlenlp.transformers.t5.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst b/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..0f37e8ad03e7f8c92cc4106a8e92395a831964ef
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tinybert.fast_tokenizer.rst
@@ -0,0 +1,7 @@
+fast\_tokenizer
+========================================================
+
+.. automodule:: paddlenlp.transformers.tinybert.fast_tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.tinybert.modeling.rst b/docs/source/paddlenlp.transformers.tinybert.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..344656a9970690a5572f9fdd7aa5b153980f75d7
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tinybert.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===============================================
+
+.. automodule:: paddlenlp.transformers.tinybert.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.tinybert.rst b/docs/source/paddlenlp.transformers.tinybert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..43b80a253545b730424cc1d9c1158dab7cfa88d0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tinybert.rst
@@ -0,0 +1,15 @@
+tinybert
+=======================================
+
+.. automodule:: paddlenlp.transformers.tinybert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.tinybert.fast_tokenizer
+   paddlenlp.transformers.tinybert.modeling
+   paddlenlp.transformers.tinybert.tokenizer
diff --git a/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst b/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..eaa651268369a0512e797c3959f55caddde60e00
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tinybert.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+================================================
+
+.. automodule:: paddlenlp.transformers.tinybert.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils.rst b/docs/source/paddlenlp.transformers.tokenizer_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..52693b51a499a90b1fdb11e91980dca5ccd100f3
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tokenizer_utils.rst
@@ -0,0 +1,8 @@
+tokenizer\_utils
+==============================================
+
+.. automodule:: paddlenlp.transformers.tokenizer_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst b/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst
new file mode 100644
index 0000000000000000000000000000000000000000..91de4951b9060912fe03b7c457e9138237fc939c
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tokenizer_utils_base.rst
@@ -0,0 +1,8 @@
+tokenizer\_utils\_base
+====================================================
+
+.. automodule:: paddlenlp.transformers.tokenizer_utils_base
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst b/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3c24a50a8d8cdefa327bed2896865d52db8d00d3
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.tokenizer_utils_fast.rst
@@ -0,0 +1,8 @@
+tokenizer\_utils\_fast
+======================================================
+
+.. automodule:: paddlenlp.transformers.tokenizer_utils_fast
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+   :special-members: __call__
diff --git a/docs/source/paddlenlp.transformers.transformer.modeling.rst b/docs/source/paddlenlp.transformers.transformer.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..53361828f7865dccd5920dbd138ed5898af953b4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.transformer.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==================================================
+
+.. automodule:: paddlenlp.transformers.transformer.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.transformer.rst b/docs/source/paddlenlp.transformers.transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7d8435bff395c4b9ddce9896f9ab31393e011de4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.transformer.rst
@@ -0,0 +1,13 @@
+transformer
+==========================================
+
+.. automodule:: paddlenlp.transformers.transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.transformer.modeling
diff --git a/docs/source/paddlenlp.transformers.unified_transformer.convert.rst b/docs/source/paddlenlp.transformers.unified_transformer.convert.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c0722b806a7c4d9732895a789c6929da7ab49672
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unified_transformer.convert.rst
@@ -0,0 +1,7 @@
+convert
+==========================================================
+
+.. automodule:: paddlenlp.transformers.unified_transformer.convert
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst b/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ff123566d6a32200691bbc02c39387111718bff4
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unified_transformer.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+===========================================================
+
+.. automodule:: paddlenlp.transformers.unified_transformer.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.unified_transformer.rst b/docs/source/paddlenlp.transformers.unified_transformer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..1c6a1b68df30f8e6e1a77890f3c3f0b8461304f0
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unified_transformer.rst
@@ -0,0 +1,15 @@
+unified\_transformer
+===================================================
+
+.. automodule:: paddlenlp.transformers.unified_transformer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.unified_transformer.convert
+   paddlenlp.transformers.unified_transformer.modeling
+   paddlenlp.transformers.unified_transformer.tokenizer
diff --git a/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst b/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..96a227173b17b2ed81e52a9f7503938222bb521d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unified_transformer.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+============================================================
+
+.. automodule:: paddlenlp.transformers.unified_transformer.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.unimo.modeling.rst b/docs/source/paddlenlp.transformers.unimo.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..026cf3230aa139652306d111e10c8d3aaac5bd28
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unimo.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.unimo.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.unimo.rst b/docs/source/paddlenlp.transformers.unimo.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8a4d4afe8d5d3aeb5c7748ab3ebbce31f5a71c0d
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unimo.rst
@@ -0,0 +1,14 @@
+unimo
+====================================
+
+.. automodule:: paddlenlp.transformers.unimo
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.unimo.modeling
+   paddlenlp.transformers.unimo.tokenizer
diff --git a/docs/source/paddlenlp.transformers.unimo.tokenizer.rst b/docs/source/paddlenlp.transformers.unimo.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d7e711fa5b01ae351f5d524480fd96fcabcafa15
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.unimo.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.unimo.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.utils.rst b/docs/source/paddlenlp.transformers.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bd1f403f5cb64d78aa71c6663b4f0ee892d427e6
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.utils.rst
@@ -0,0 +1,7 @@
+utils
+===================================
+
+.. automodule:: paddlenlp.transformers.utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.xlm.modeling.rst b/docs/source/paddlenlp.transformers.xlm.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..24df639bdc43f4d9b3245bcfae462047b3dc16d3
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlm.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+==========================================
+
+.. automodule:: paddlenlp.transformers.xlm.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.xlm.rst b/docs/source/paddlenlp.transformers.xlm.rst
new file mode 100644
index 0000000000000000000000000000000000000000..54ec485dd13b5c7f5f78bb868fa570099b1f3b69
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlm.rst
@@ -0,0 +1,14 @@
+xlm
+==================================
+
+.. automodule:: paddlenlp.transformers.xlm
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.xlm.modeling
+   paddlenlp.transformers.xlm.tokenizer
diff --git a/docs/source/paddlenlp.transformers.xlm.tokenizer.rst b/docs/source/paddlenlp.transformers.xlm.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9ba4a1cddbe0401189a2a9aa461d2834f452e827
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlm.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+===========================================
+
+.. automodule:: paddlenlp.transformers.xlm.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.xlnet.modeling.rst b/docs/source/paddlenlp.transformers.xlnet.modeling.rst
new file mode 100644
index 0000000000000000000000000000000000000000..edc53492499e2798fb7874ae5e0fa24e0d85fe35
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlnet.modeling.rst
@@ -0,0 +1,7 @@
+modeling
+============================================
+
+.. automodule:: paddlenlp.transformers.xlnet.modeling
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.transformers.xlnet.rst b/docs/source/paddlenlp.transformers.xlnet.rst
new file mode 100644
index 0000000000000000000000000000000000000000..eb6b5c4911cb5edfe1b65500f7aa9e25b34e2106
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlnet.rst
@@ -0,0 +1,14 @@
+xlnet
+====================================
+
+.. automodule:: paddlenlp.transformers.xlnet
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.transformers.xlnet.modeling
+   paddlenlp.transformers.xlnet.tokenizer
diff --git a/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst b/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b55fa0a21d452861a1e0f65b98adcaa9559849bf
--- /dev/null
+++ b/docs/source/paddlenlp.transformers.xlnet.tokenizer.rst
@@ -0,0 +1,7 @@
+tokenizer
+=============================================
+
+.. automodule:: paddlenlp.transformers.xlnet.tokenizer
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.batch_sampler.rst b/docs/source/paddlenlp.utils.batch_sampler.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5c9ef2fb6068aa940cc0033136871abbae4af96e
--- /dev/null
+++ b/docs/source/paddlenlp.utils.batch_sampler.rst
@@ -0,0 +1,6 @@
+batch\_sampler
+=====================================
+
+.. automodule:: paddlenlp.utils.batch_sampler
+   :members:
+   :no-undoc-members:
diff --git a/docs/source/paddlenlp.utils.downloader.rst b/docs/source/paddlenlp.utils.downloader.rst
new file mode 100644
index 0000000000000000000000000000000000000000..946af01660db9a371a9ee89a6ed402589da2b33f
--- /dev/null
+++ b/docs/source/paddlenlp.utils.downloader.rst
@@ -0,0 +1,7 @@
+downloader
+=================================
+
+.. automodule:: paddlenlp.utils.downloader
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.env.rst b/docs/source/paddlenlp.utils.env.rst
new file mode 100644
index 0000000000000000000000000000000000000000..c650dfead4b252c28afcde5f07870363e2eaa814
--- /dev/null
+++ b/docs/source/paddlenlp.utils.env.rst
@@ -0,0 +1,7 @@
+env
+==========================
+
+.. automodule:: paddlenlp.utils.env
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.file_lock.rst b/docs/source/paddlenlp.utils.file_lock.rst
new file mode 100644
index 0000000000000000000000000000000000000000..dbf1a60ebc4a0a865be67c909f81dfbec70a56c4
--- /dev/null
+++ b/docs/source/paddlenlp.utils.file_lock.rst
@@ -0,0 +1,7 @@
+file\_lock
+=================================
+
+.. automodule:: paddlenlp.utils.file_lock
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.import_utils.rst b/docs/source/paddlenlp.utils.import_utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..49a928b734552d878904eb0b47efc75460df3c1d
--- /dev/null
+++ b/docs/source/paddlenlp.utils.import_utils.rst
@@ -0,0 +1,7 @@
+import\_utils
+====================================
+
+.. automodule:: paddlenlp.utils.import_utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.log.rst b/docs/source/paddlenlp.utils.log.rst
new file mode 100644
index 0000000000000000000000000000000000000000..343d32eb7ceee056a64fe8fddf79053b9cac371d
--- /dev/null
+++ b/docs/source/paddlenlp.utils.log.rst
@@ -0,0 +1,7 @@
+log
+==========================
+
+.. automodule:: paddlenlp.utils.log
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.profiler.rst b/docs/source/paddlenlp.utils.profiler.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a8d3644bf82f5c0c0c04bbf619788c1e69df0bb
--- /dev/null
+++ b/docs/source/paddlenlp.utils.profiler.rst
@@ -0,0 +1,7 @@
+profiler
+===============================
+
+.. automodule:: paddlenlp.utils.profiler
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/source/paddlenlp.utils.rst b/docs/source/paddlenlp.utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..55f2d79907d5b0abf9670af8725cb69d37c43285
--- /dev/null
+++ b/docs/source/paddlenlp.utils.rst
@@ -0,0 +1,20 @@
+paddlenlp.utils
+=======================
+
+.. automodule:: paddlenlp.utils
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
+
+
+.. toctree::
+   :maxdepth: 4
+
+   paddlenlp.utils.batch_sampler
+   paddlenlp.utils.downloader
+   paddlenlp.utils.env
+   paddlenlp.utils.file_lock
+   paddlenlp.utils.import_utils
+   paddlenlp.utils.log
+   paddlenlp.utils.profiler
+   paddlenlp.utils.tools
diff --git a/docs/source/paddlenlp.utils.tools.rst b/docs/source/paddlenlp.utils.tools.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d3451531212cb573d60c6d08f79881009e2b8693
--- /dev/null
+++ b/docs/source/paddlenlp.utils.tools.rst
@@ -0,0 +1,7 @@
+tools
+============================
+
+.. automodule:: paddlenlp.utils.tools
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/docs/trainer.md b/docs/trainer.md
new file mode 100644
index 0000000000000000000000000000000000000000..31728bbaa627b0ed523ee8f4d33aca01580ba66b
--- /dev/null
+++ b/docs/trainer.md
@@ -0,0 +1,674 @@
+# PaddleNLP Trainer API
+
+PaddleNLP提供了Trainer训练API，针对训练过程的通用训练配置做了封装，比如：
+
+- 优化器、学习率调度等训练配置
+- 多卡，混合精度，梯度累积等功能
+- checkpoint断点，断点重启（数据集，随机数恢复）
+- 日志显示，loss可视化展示等
+
+用户输入模型，数据集，就可以使用Trainer API高效快速的实现预训练、微调等任务。
+
+
+## Trainer基本使用方法介绍
+
+下面是用户使用 Trainer API进行finetune任务的简单示例，这里以中文情感分类数据集`chnsenticorp`为例。
+更详细的使用可以参考[CLUE Trainer](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/benchmark/clue/classification/run_clue_classifier_trainer.py)版本。
+
+1. 导入需要用到的头文件。
+    - 主要是模型、Tokenizer
+    - 还有Trainer组件
+        - 其中`Trainer`是训练主要入口，用户传入模型，数据集，即可进行训练
+        - `TrainingArguments` 包含了用户需要的大部分训练参数。
+        - `PdArgumentParser` 是用户输出参数的工具
+```python
+from functools import partial
+import paddle
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.trainer import Trainer, TrainingArguments, PdArgumentParser
+```
+2. 设置好用户参数
+    - PdArgumentParser 可以接受多个类似`TrainingArguments`的参数。用户可以自定义所需要的`ModelArguments`, `DataArguments`为 tuple 传入 PdArgumentParser即可。
+    - 这些参数都是通过`python xxx.py --dataset xx --max_seq_length xx`的方式传入。`TrainingArguments`的所有可配置参数见后文。
+```python
+from dataclasses import dataclass
+@dataclass
+class DataArguments:
+    dataset: str = field(
+        default=None,
+        metadata={"help": "The name of the dataset to use."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={"help": "The maximum total input sequence length after tokenization."})
+
+parser = PdArgumentParser(TrainingArguments, DataArguments)
+(training_args, data_args) = parser.parse_args_into_dataclasses()
+```
+
+3. 加载模型，tokenizer, 数据集
+    - 注意，这里的数据集，需要输出的是一个dict。dict中的key，需要和模型的输入名称对应。
+    - 这里的，`labels`如果模型没有使用到，我们还需要额外定义`criterion`，计算最后的loss损失。
+```python
+train_dataset = load_dataset("chnsenticorp", splits=["train"])
+model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(train_dataset.label_list))
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+def convert_example(example, tokenizer):
+    encoded_inputs = tokenizer(text=example["text"], max_seq_len=128, pad_to_max_seq_len=True)
+    encoded_inputs["labels"] = int(example["label"])
+    return encoded_inputs
+
+train_dataset = train_dataset.map(partial(convert_example, tokenizer=tokenizer))
+```
+
+4. 构造Trainer实例，进行模型训练。
+    - 这里传入`model,criterion,args,train_dataset,tokenizer`这些训练需要的组件，构建了实例化的trainer
+    - 使用trainer.train()接口开始训练过程。训练完成后，可以保存模型，保存一些日志。
+```python
+trainer = Trainer(
+    model=model,
+    criterion=paddle.nn.loss.CrossEntropyLoss(),
+    args=training_args,
+    train_dataset=train_dataset if training_args.do_train else None,
+    tokenizer=tokenizer)
+
+if training_args.do_train:
+    train_result = trainer.train()
+    metrics = train_result.metrics
+    trainer.save_model()
+    trainer.log_metrics("train", metrics)
+    trainer.save_state()
+```
+预训练的使用方式可以参考[ERNIE-1.0 Trainer](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/run_pretrain_trainer.py)版本。
+
+
+## Trainer进阶分布式能力使用介绍
+
+**通用分布式能力**
+对于通用的分布式能力, PaddleNLP主要做了数据并行data_parallel, 分布式参数sharding功能的支持.
+这类功能无需用户修改组网, 直接多卡即可运行.
+
+用户使用 `paddle.distruted.launch --devices "0,1,2,3" train.py`即可将运行的程序切换为多卡数据并行.
+如果想要使用sharding功能, 减少模型显存占用, 指定参数`--sharding "stage2"`即可. 更多sharding功能配置见参数介绍部分.
+
+
+**混合并行分布式能力**
+
+飞桨4D并行, 即: data parallel +  sharding parallel + tensor parallel + pipeline parallel.
+
+混合并行这里, 主要添加了 tensor parallel (TP) 和 pipeline parallel(PP)支持.
+目前, PaddleNLP主要对一些大模型, 如 GPT, Llama等做了 TP PP支持, 用户可以使用这些策略.
+
+相关代码实现可以参考llama训练的[例子](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/llama/run_trainer_tp4pp2.sh)
+
+流水线并行的组网改造可以参见[modeling_pp.py](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/llama/modeling_pp.py)
+
+
+当组网适配好 张量并行(TP), 流水线并行(PP)之后, 用户使用 `--tensor_parallel_degree` `--pipeline_parallel_degree` 即可启用混合并行训练.
+
+
+
+
+## Trainer 实例化参数介绍
+Trainer 是一个简单，但功能完整的 Paddle训练和评估模块，并针对 PaddleNLP 模型进行了优化。
+
+```python
+参数：
+    model（[`PretrainedModel`] 或 `paddle.nn.Layer`，可选）：
+        用于训练、评估或预测的模型。
+        [`Trainer`] 对PaddleNLP的 [`PretrainedModel`] 一起使用进行了优化。你仍然可以使用
+        您自己的模型定义为`paddle.nn.Layer`，只要它们的工作方式与 PaddleNLP 模型相同。
+
+        ([`PretrainedModel`] or `paddle.nn.Layer`, *optional*):
+        The model to train, evaluate or use for predictions.
+    criterion (`paddle.nn.Layer`，*可选*）：
+        model可能只输出中间结果loggit，如果想对模型的输出做更多的计算，可以添加criterion层。
+
+        The model may only output the loggit, if you want do more computation for the output of model,
+        you can add the criterion Layer.
+
+    args（[`TrainingArguments`]，可选）：
+        训练时需要用到的参数。将默认使用 [`TrainingArguments`] 初始化。
+        `output_dir` 设置为当前目录中名为 *tmp_trainer* 的目录（如果未提供）。
+
+        ([`TrainingArguments`], *optional*):
+        The arguments to tweak for training. Will default to a basic instance of [`TrainingArguments`] with the
+        `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided.
+
+    data_collator（`DataCollator`，可选）：
+        用于将 `train_dataset` 或 `eval_dataset` 的数据，组合为batch的函数。
+        如果没有提供 `tokenizer`，则默认为 [`default_data_collator`], 否则为
+        [`DataCollatorWithPadding`]。
+
+         (`DataCollator`, *optional*):
+        The function to use to form a batch from a list of elements of `train_dataset` or `eval_dataset`. Will
+        default to [`default_data_collator`] if no `tokenizer` is provided, an instance of
+        [`DataCollatorWithPadding`] otherwise.
+
+
+    train_dataset（`paddle.io.Dataset` 或 `paddle.io.IterableDataset`，可选）：
+        用于训练的数据集。如果是 `datasets.Dataset`，那么
+        `model.forward()` 不需要的输入字段会被自动删除。
+
+        (`paddle.io.Dataset` or `paddle.io.IterableDataset`, *optional*):
+        The dataset to use for training. If it is an `datasets.Dataset`, columns not accepted by the
+        `model.forward()` method are automatically removed.
+
+    eval_dataset（`paddle.io.Dataset` 或 `Dict[str, paddle.io.Dataset]`，可选）：
+        用于评估的数据集。如果是 `datasets.Dataset`，那么
+        `model.forward()` 不需要的输入字段会被自动删除。
+        如果它是一个字典，则将对字典中每个数据集进行评估，
+        并将字典中的键添加到评估指标名称前。
+
+        The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the
+        `model.forward()` method are automatically removed. If it is a dictionary, it will evaluate on each
+        dataset prepending the dictionary key to the metric name.
+
+    tokenizer（[`PretrainedTokenizer`]，可选）：
+        用于数据预处理的tokenizer。如果传入，将用于自动Pad输入
+        batch输入的最大长度，它随模型保存，可以重新运行中断的训练过程。
+
+         ([`PretrainedTokenizer`], *optional*):
+        The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
+        maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an
+        interrupted training or reuse the fine-tuned model.
+
+    compute_metrics (`Callable[[EvalPrediction], Dict]`, 可选):
+        用于评估的计算指标的函数。必须采用 [`EvalPrediction`] 并返回
+        dict形式的metrics结果。
+
+        (`Callable[[EvalPrediction], Dict]`, *optional*):
+        The function that will be used to compute metrics at evaluation. Must take a [`EvalPrediction`] and return
+        a dictionary string to metric values.
+
+    callbacks (List of [`TrainerCallback`]，*可选*）：
+        用于自定义训练call列表函数。将这些函数会被添加到默认回调函数列表。
+        如果要删除使用的回调函数，请使用 [`Trainer.remove_callback`] 方法。
+
+        A list of callbacks to customize the training loop. Will add those to the list of default callbacks.
+        If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method.
+
+    optimizers (`Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler]`, 可选）：
+        一个tuple, 包含要使用Optimizer和LRScheduler。将默认为模型上的 [`AdamW`] 实例
+        和LinearDecayWithWarmup。
+
+        (`Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler]`, *optional*)
+        A tuple containing the optimizer and the scheduler to use. Will default to an instance of [`AdamW`] on your model
+        and a scheduler  [`LinearDecayWithWarmup`].
+
+    preprocess_logits_for_metrics (`Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor]`, 可选）)：
+        一个函数, 在每次评估之前对logits进行预处理。
+
+        (`Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor]`, *optional*)
+        A function that preprocess the logits right before caching them at each evaluation step. Must take two
+        tensors, the logits and the labels, and return the logits once processed as desired. The modifications made
+        by this function will be reflected in the predictions received by `compute_metrics`.
+```
+
+
+## TrainingArguments 参数介绍
+```python
+  --output_dir
+                        保存模型输出和中间checkpoints的输出目录。(`str`, 必须, 默认为 `None`)
+
+                        The output directory where the model predictions and
+                        checkpoints will be written. (default: None)
+
+  --overwrite_output_dir
+                        如果 `True`，覆盖输出目录的内容。如果 `output_dir` 指向检查点
+                        目录，则使用它继续训练。(`bool`, 可选, 默认为 `False`)
+
+                        Overwrite the content of the output directory. Use
+                        this to continue training if output_dir points to a
+                        checkpoint directory. (default: False)
+
+  --do_train
+                        是否进行训练任务。 注：`Trainer`不直接使用此参数，而是提供给用户
+                        的训练/评估脚本使用。(`bool`, 可选, 默认为 `False`)
+
+                        Whether to run training. (default: False)
+
+  --do_eval
+                        是否进行评估任务。同上。(`bool`, 可选, 默认为 `False`)
+
+                        Whether to run eval on the dev set. (default: False)
+
+  --do_predict
+                        是否进行预测任务。同上。(`bool`, 可选, 默认为 `False`)
+
+                        Whether to run predictions on the test set. (default:False)
+
+  --do_export
+                        是否进行模型导出任务。同上。(`bool`, 可选, 默认为 `False`)
+
+                        Whether to export infernece model. (default: False)
+
+  --evaluation_strategy {no,steps,epoch}
+                        评估策略，（`str`, 可选，默认为 `"no"`）：
+                        训练期间采用的评估策略。可能的值为：
+                            - `"no"`：训练期间不进行评估。
+                            - `"steps"`：评估在每个`eval_steps`完成（并记录）。
+                            - `"epoch"`：在每个 epoch 结束时进行评估。
+
+                        The evaluation strategy to use. (default: no)
+
+  --prediction_loss_only
+                        在执行评估和预测任务时，只返回loss的值。(`bool`, 可选, 默认为 `False`)
+
+                        When performing evaluation and predictions, only
+                        returns the loss. (default: False)
+
+  --per_device_train_batch_size
+                        用于训练的每个 GPU 核心/CPU 的batch大小.（`int`，可选，默认为 8）
+
+                        Batch size per GPU core/CPU for training. (default: 8)
+
+  --per_device_eval_batch_size
+                        用于评估的每个 GPU 核心/CPU 的batch大小.（`int`，可选，默认为 8）
+
+                        Batch size per GPU core/CPU for evaluation. (default:8)
+
+  --gradient_accumulation_steps
+                        在执行反向，更新回传梯度之前，累积梯度的更新步骤数（`int`，可选，默认为 1）
+
+                        Number of updates steps to accumulate before
+                        performing a backward/update pass. (default: 1)
+
+  --eval_accumulation_steps
+                        在将结果移动到CPU之前，累积输出张量的预测步骤数。如果如果未设置，
+                        则在移动到CPU之前，整个预测都会在GPU上累积（速度更快需要更多的显存）。
+                        （`int`，可选，默认为 None 不设置）
+
+                        Number of predictions steps to accumulate the output tensors for,
+                        before moving the results to the CPU. If left unset, the whole predictions are
+                        accumulated on GPU before being moved to the CPU (faster butrequires more memory)
+                        (default: None)
+
+  --learning_rate
+                        优化器的初始学习率, （`float`，可选，默认为 5e-05）
+
+                        The initial learning rate for optimizer. (default: 5e-05)
+
+  --weight_decay
+                        除了所有bias和 LayerNorm 权重之外，应用于所有层的权重衰减数值。（`float`，可选，默认为 0.0）
+
+                        Weight decay for AdamW if we apply some. (default:
+                        0.0)
+
+  --adam_beta1
+                        AdamW的优化器的 beta1 超参数。（`float`，可选，默认为 0.9）
+
+                        Beta1 for AdamW optimizer (default: 0.9)
+
+  --adam_beta2
+                        AdamW的优化器的 beta2 超参数。（`float`，可选，默认为 0.999）
+
+                        Beta2 for AdamW optimizer (default: 0.999)
+
+  --adam_epsilon
+                        AdamW的优化器的 epsilon 超参数。（`float`，可选，默认为 1e-8)
+
+                        Epsilon for AdamW optimizer. (default: 1e-08)
+
+  --max_grad_norm
+                        最大梯度范数（用于梯度裁剪）。（`float`，可选，默认为 1.0）
+
+                        Max gradient norm. (default: 1.0)
+
+  --num_train_epochs
+                        要执行的训练 epoch 总数（如果不是整数，将在停止训练
+                        之前执行最后一个 epoch 的小数部分百分比）。
+                        (`float`, 可选, 默认为 3.0):
+
+                        Total number of training epochs to perform. (default:3.0)
+
+  --max_steps
+                        如果设置为正数，则表示要执行的训练步骤总数。
+                        覆盖`num_train_epochs`。（`int`，可选，默认为 -1）
+
+                        If > 0: set total number of training steps to
+                        perform.Override num_train_epochs. (default: -1
+
+  --lr_scheduler_type
+                        要使用的学习率调度策略。 (`str`, 可选, 默认为 `"linear"`)
+
+                        The scheduler type to use. (default: linear) 支持，linear, cosine, constant, constant_with_warmup.
+
+  --warmup_ratio
+                        用于从 0 到 `learning_rate` 的线性warmup的总训练步骤的比例。（`float`，可选，默认为 0.0）
+
+                        Linear warmup over warmup_ratio fraction of total
+                        steps. (default: 0.0)
+
+  --warmup_steps
+                        用于从 0 到 `learning_rate` 的线性warmup的步数。覆盖warmup_ratio参数。
+                        （`int`，可选，默认为 0）
+
+                        Linear warmup over warmup_steps. (default: 0)
+
+  --log_on_each_node
+                        在多节点分布式训练中，是在每个节点上记录一次，还是仅在主节点上记录节点。（`bool`，可选，默认为`True`）
+
+                        When doing a multinode distributed training, whether
+                        to log once per node or just once on the main node.
+                        (default: True)
+
+  --logging_dir
+                        VisualDL日志目录。（`str`，可选，默认为None）
+                        None情况下会修改为 *output_dir/runs/**CURRENT_DATETIME_HOSTNAME**
+
+                        VisualDL log dir. (default: None)
+
+  --logging_strategy {no,steps,epoch}
+                        (`str`, 可选，默认为 `"steps"`)
+                        训练期间采用的日志记录策略。可能的值为：
+                            - `"no"`：训练期间不进行记录。
+                            - `"epoch"`：记录在每个 epoch 结束时完成。
+                            - `"steps"`：记录是每 `logging_steps` 完成的。
+
+                        The logging strategy to use. (default: steps)
+
+  --logging_first_step
+                        是否记录和评估第一个 `global_step`。（`bool`，可选，默认为`False`）
+
+                        Log the first global_step (default: False)
+
+  --logging_steps
+                        如果 `logging_strategy="steps"`，则两个日志之间的更新步骤数。
+                        （`int`，可选，默认为 500）
+
+                        Log every X updates steps. (default: 500)
+
+  --save_strategy {no,steps,epoch}
+                        (`str`, 可选，默认为 `"steps"`)
+                        训练期间采用的checkpoint保存策略。可能的值为：
+                            - `"no"`：训练期间不保存。
+                            - `"epoch"`：保存在每个 epoch 结束时完成。
+                            - `"steps"`：保存是每`save_steps`完成。
+                        The checkpoint save strategy to use. (default: steps)
+
+  --save_steps
+                        如果 `save_strategy="steps"`，则在两个checkpoint保存之间的更新步骤数。
+                        （`int`，可选，默认为 500）
+
+                        Save checkpoint every X updates steps. (default: 500)
+
+  --save_total_limit
+                        如果设置次参数，将限制checkpoint的总数。删除旧的checkpoints
+                        `输出目录`。(`int`，可选）
+
+                        Limit the total amount of checkpoints. Deletes the
+                        older checkpoints in the output_dir. Default is
+                        unlimited checkpoints (default: None)
+
+  --save_on_each_node
+                        在做多节点分布式训练时，是在每个节点上保存模型和checkpoints，
+                        还是只在主节点上。当不同的节点使用相同的存储时，不应激活此功能，
+                        因为每个节点的文件将以相同的名称保存。(`bool`, 可选, 默认为 `False`)
+
+                        When doing multi-node distributed training, whether to
+                        save models and checkpoints on each node, or only on
+                        the main one (default: False)
+
+  --no_cuda
+                        是否不使用 CUDA，即使CUDA环境可用。(`bool`, 可选, 默认为 `False`)
+                        Do not use CUDA even when it is available (default:
+                        False)
+  --seed
+                        设置的随机种子。为确保多次运行的可复现性。（`int`，可选，默认为 42）
+
+                        Random seed that will be set at the beginning of
+                        training. (default: 42)
+
+  --bf16
+                        是否使用 bf16 混合精度训练而不是 fp32 训练。需要 Ampere 或更高的 NVIDIA
+                        显卡架构支持。这是实验性质的API，以后可能会修改。
+                        (`bool`, 可选, 默认为 `False`)
+
+                        Whether to use bf16 (mixed) precision instead of
+                        32-bit. Requires Ampere or higher NVIDIA architecture.
+                        This is an experimental API and it may change.
+                        (default: False)
+
+  --fp16
+                        是否使用 fp16 混合精度训练而不是 fp32 训练。
+                        (`bool`, 可选, 默认为 `False`)
+
+                        Whether to use fp16 (mixed) precision instead of
+                        32-bit (default: False)
+
+  --fp16_opt_level
+                        混合精度训练模式，可为``O1``或``O2``模式，默认``O1``模式，默认O1.
+                        O1表示混合精度训练，O2表示纯fp16/bf16训练。
+                        只在fp16或bf16选项开启时候生效.
+                        (`str`, 可选, 默认为 `O1`)
+
+                        For fp16: AMP optimization level selected in
+                        ['O0', 'O1', and 'O2']. See details at https://www.pad
+                        dlepaddle.org.cn/documentation/docs/zh/develop/api/pad
+                        dle/amp/auto_cast_cn.html (default: O1)
+  --amp_custom_black_list
+                       飞桨有默认的黑名单，可以根据模型特点设置自定义黑名单。自定义黑名单中的算子在计算时会被认为是数值危险的，它们的影响也可能会在下游算子中观察到。该名单中的算子不会转为 float16/bfloat16 计算。(可选，默认为None)
+
+                       The custom black_list. The set of ops that support fp16/bf16 calculation and are considered numerically-dangerous and whose effects may also be observed in downstream ops. These ops will not be converted to fp16/bf16. (default:None)
+
+  --amp_custom_white_list
+                       飞桨有默认的白名单，通常不需要设置自定义白名单。自定义白名单中的算子在计算时会被认为是数值安全的，并且对性能至关重要。如果设置了该名单，其中的算子会使用 float16/bfloat16 计算。(可选，默认为None)
+
+                       The custom white_list. It’s the set of ops that support fp16/bf16 calculation and are considered numerically-safe and performance-critical. These ops will be converted to fp16/bf16. (default:None)
+
+  --amp_master_grad
+                        当使用pure fp16/bf16的时候, 可能对梯度的数值精度有更高要求,
+                        例如梯度裁剪, weight decay, 权重更新的时候.
+                        打开此选项, 梯度的数值精度会变成float32类型.
+                        只在 --fp16_opt_level O2 生效, 默认为 False
+
+                        For amp opt level=’O2’, whether to use float32 weight gradients
+                        for calculations such as gradient clipping, weight decay, and weight updates.
+                        If master_grad is enabled, the weight gradients will be float32 dtype after the backpropagation.
+                        Note: only support model parallel and pipeline parallel for now !!! (default: False)
+
+  --scale_loss
+                        fp16/bf16训练时，scale_loss的初始值。
+                        （`float`，可选，默认为 32768）
+
+                        The value of initial scale_loss for fp16. (default: 32768)
+
+  --sharding
+                        是否使用Paddle的Sharding数据并行功能，用户的参数。支持sharding `stage1`, `stage2` or `stage3`。
+                        其中`stage2``stage3`可以和`offload`组合使用。
+                        每个种策略分别为：
+                            stage1 : optimizer 中的参数切分到不同卡
+                            stage2 : optimizer  + gradient 中的参数切分到不同卡
+                            stage3 : parameter + gradient + optimizer  中的参数都切分到不同卡
+                            offload ： offload parameters to cpu 部分参数存放到cpu中
+                         (`str`,  可选, 默认为 `` 不使用sharding)
+
+                        Whether or not to use Paddle Sharding Data Parallel training (in distributed training
+                        only). The base option should be `stage1`, `stage2` or `stage3` and you can add
+                        CPU-offload to `stage2` or `stage3` like this: `stage2 offload` or `stage3 offload`.
+                        Each stage means:
+                            stage1 : optimizer state segmentation
+                            stage2 : optimizer state + gradient segmentation
+                            stage3 : parameter + gradient + optimizer state segmentation
+                            offload ： offload parameters to cpu
+
+  --sharding_parallel_degree
+                        设置sharding的通信组参数，表示通信组的大小。同一个sharding通信组内的参数，进行sharding，分布到不同卡上。
+                        不同sharding通信组之间，相当于单纯的数据并行。此选项只在sharding选项开启时候生效。
+                        默认值为-1，表示所有训练的卡在同一个通信组内。
+                        (`int`, 可选, 默认为 `-1`)
+
+                        Sharding parameter in certain cards group. For example, aussume we use 2 machines each
+                        with 8 cards, then set sharding_degree=8, sharding will only communication inside machine.
+                        default -1 means sharding parameters between all workers. (`int`, *optional*, defaults to `-1`)
+
+  --tensor_parallel_degree
+                        张量并行是Megatron论文针对Transformer结构的张量切分方法.
+                        此方法将一层transformer的计算划分到了不同卡上.
+                        此参数tensor_parallel_degree表示将一层transformer结构的份数.
+                        默认值-1, 表示不启用张量并行,
+                        (`int`, 可选, 默认为 `-1`)
+                        (注: 该方法需要修改模型结构, 目前支持GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM)
+                        (注: 该方法对通信开销较大, 建议 tensor_parallel_degree<=8, 尽量使用机器内部通信)
+
+                        Tensor parallelism is a parallel technique which proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.3 Tensor Model Parallelism).
+                        This techique splits one transformer layer into multi-cards (For examples, tensor_parallel_degree=4, will split a layer to 4-parts)
+                        tensor_parallel_degree means split the transformer layer to how many parts.
+                        default -1 for not use tensor parallel,  Suggest tensor_parallel_degree<=8 for better proformance.
+                        Note, this need model support in source code, currently GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM is supported.
+
+
+  --pipeline_parallel_degree
+                        流水线并行是Megatron论文针对多层Transformer结构提出的按层划分方法.
+                        该方法将多层的transformer结构,按照不同层,均匀划分到不同的卡上.
+                        然后数据流先后在不同的卡上传递, 形成流水线.
+                        参数pipeline_parallel_degree表示划分流水线的大小.(假设该参数为4, 模型12层, 则每一个pp stage 包含3层模型)
+                        默认值-1, 表示不启用流水线并行,
+                        (`int`, 可选, 默认为 `-1`)
+                        (注, 使用此功能需要修改源码,请参见language_model/llama/modeling_pp.py文件)
+
+                        Pipeline parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.2 Pipeline Model Parallelism).
+                        Pipeline parallelism assigns multi-transformer layers to different cards, the micro batch data stream passed between cards like pipelines.
+                        pipeline_parallel_degree means split all transformer layers to how many stages.
+                        default -1 for not use pipeline parallel.
+                        Note. this need model support in source code, see llama modeling_pp.py file
+
+  --pipeline_parallel_config
+                        对于流水线并行,一些选项会影响训练性能,这里将一些选项配置集中管理,以str形式传入配置.
+                        支持如下选项:
+                            disable_p2p_cache_shape : 关闭通信时候的tensor shape cache, 如果你的模型输入的tensor, shape 是不断变化的(如sequence length) 必须配置此选项
+                            disable_partial_send_recv : 关闭与张量并行合用时候的通信优化.
+                            enable_dp_comm_overlap : 开启PP+DP使用时候的通信优化.
+                            enable_delay_scale_loss : 开启, 使得梯度累积, 先累积最后除以累积次数. 而不是每次除以累积次数.
+
+                        Some additional config it highly affect the useage of pipeline parallel, we provide some option to config it.
+                        following config is support:
+                          disable_p2p_cache_shape, if you max sequence length is varying, please set disable_p2p_cache_shape.
+                          disable_partial_send_recv, optmize send speed for tensor parallel.
+                          enable_delay_scale_loss, accumulate gradients util optimizer step, all gradients div by inner pipeline accumute step. instead of div accumute step on loss directly.
+                          enable_dp_comm_overlap, fuse data parallel gradient communication.
+
+
+  --recompute
+                        是否使用重计算训练。可以节省显存。
+                        重新计算前向过程以获取梯度，减少中间变量显存.
+                        注：需要组网支持 recompute，默认使用 enable_recompute 关键字作为recompute功能开关。
+                        (`bool`, 可选, 默认为 `False`)
+
+                        Recompute the forward pass to calculate gradients. Used for saving memory (default: False)
+
+  --minimum_eval_times
+                        最少评估次数，如果当前设置的eval_steps，评估次数少于minimum_eval_times，
+                        此选项会覆盖eval_steps参数。
+                        （`int`，可选，默认为 None）
+
+                        If under eval_steps, the valid time is less then
+                        minimum_eval_times, the config of override eval_steps.
+                        (default: None)
+
+  --local_rank
+                        分布式训练时，设备的本地rank值。
+                        For distributed training: local_rank (default: -1)
+
+  --dataloader_drop_last
+                        是否丢弃最后一个不完整的批次（如果数据集的长度不能被批次大小整除）
+                        （`bool`，可选，默认为 False）
+
+                        Drop the last incomplete batch if it is not divisible
+                        by the batch size. (default: False)
+
+  --eval_steps
+                        如果 `evaluation_strategy="steps"`，则两次评估之间的更新步骤数。将默认为相同如果未设置，则值为 `logging_steps`。
+                        （`int`，可选，默认为 None）
+
+                        Run an evaluation every X steps. (default: None)
+
+  --max_evaluate_steps
+                        如果设置为正数，则表示要执行的评估步骤的总数。
+                        （`int`，可选，默认为 -1)
+
+                        If set to a positive number, the total number of evaluation steps to perform. (default: -1)
+
+  --dataloader_num_workers
+                        用于数据加载的子进程数。 0 表示数据将在主进程制造。
+                        （`int`，可选，默认为 0）
+
+                        Number of subprocesses to use for data loading. 0 means
+                        that the data will be loaded in the main process. (default: 0)
+
+  --past_index
+                        If >=0, uses the corresponding part of the output as
+                        the past state for next step. (default: -1)
+
+  --run_name
+                        An optional descriptor for the run. (default: None)
+  --device
+                        运行的设备名称。支持cpu/gpu, 默认gpu
+                        （`str`，可选，默认为 'gpu'）
+
+                        select cpu, gpu, xpu devices. (default: gpu)
+
+  --disable_tqdm
+                        是否使用tqdm进度条
+                        Whether or not to disable the tqdm progress bars.
+                        (default: None)
+
+  --remove_unused_columns
+                        去除Dataset中不用的字段数据
+                        Remove columns not required by the model when using an
+                        nlp.Dataset. (default: True)
+
+  --label_names
+                        训练数据标签label的名称
+                        The list of keys in your dictionary of inputs that
+                        correspond to the labels. (default: None)
+
+  --load_best_model_at_end
+                        训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用
+                        Whether or not to load the best model found during
+                        training at the end of training. (default: False)
+
+  --metric_for_best_model
+                        最优模型指标，如`eval_accuarcy`等，用于比较模型好坏。
+                        The metric to use to compare two different models.
+                        (default: None)
+
+  --greater_is_better
+                        与`metric_for_best_model`配合使用。
+                        Whether the `metric_for_best_model` should be
+                        maximized or not. (default: None)
+
+  --ignore_data_skip
+                        重启训练时候，不略过已经训练的数据。
+                        When resuming training, whether or not to skip the
+                        first epochs and batches to get to the same training
+                        data. (default: False)
+
+  --optim
+                        优化器名称，默认为adamw，(`str`, 可选，默认为 `adamw`)
+                        The optimizer to use. (default: adamw)
+
+  --report_to
+                        日志可视化显示，默认使用visualdl可视化展示。(可选，默认为 None，展示所有)
+                        The list of integrations to report the results and
+                        logs to. (default: None)
+
+  --resume_from_checkpoint
+                        是否从断点重启恢复训练，(可选，默认为 None)
+                        The path to a folder with a valid checkpoint for your
+                        model. (default: None)
+
+  --skip_memory_metrics
+                       是否跳过内存profiler检测。（可选，默认为True，跳过）
+                       Whether or not to skip adding of memory profiler reports
+                       to metrics.(default:True)
+
+  --flatten_param_grads
+                       是否在优化器中使用flatten_param_grads策略，该策略将素有参数摊平后输入Optimizer更新。目前该策略仅在NPU设备上生效。（可选，默认为False）
+                       Whether use flatten_param_grads method in optimizer,
+                       only used on NPU devices.(default:False)
+
+```
diff --git a/docs/tutorials/classify.rst b/docs/tutorials/classify.rst
new file mode 100644
index 0000000000000000000000000000000000000000..75cc774f6f26f8d9d63c9dbd5cfaaa744e3cb220
--- /dev/null
+++ b/docs/tutorials/classify.rst
@@ -0,0 +1,3 @@
+========================
+文本分类
+========================
diff --git a/docs/tutorials/embedding.rst b/docs/tutorials/embedding.rst
new file mode 100644
index 0000000000000000000000000000000000000000..8c618c6fde192ad03a645d002913cffb94d183bf
--- /dev/null
+++ b/docs/tutorials/embedding.rst
@@ -0,0 +1,3 @@
+========================
+词向量
+========================
diff --git a/docs/tutorials/general_dialogue.rst b/docs/tutorials/general_dialogue.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2b9e0f28110e14ad2d52f240b5ce7b6c1438c9bb
--- /dev/null
+++ b/docs/tutorials/general_dialogue.rst
@@ -0,0 +1,3 @@
+========================
+通用对话
+========================
diff --git a/docs/tutorials/lexical_analysis.rst b/docs/tutorials/lexical_analysis.rst
new file mode 100644
index 0000000000000000000000000000000000000000..872bcb710582c676dbfee23775fce0523eea17ac
--- /dev/null
+++ b/docs/tutorials/lexical_analysis.rst
@@ -0,0 +1,3 @@
+========================
+词法分析
+========================
diff --git a/docs/tutorials/machine_translation.rst b/docs/tutorials/machine_translation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..e153b7a33b51db0266adec18567b346824e22fb3
--- /dev/null
+++ b/docs/tutorials/machine_translation.rst
@@ -0,0 +1,3 @@
+========================
+机器翻译
+========================
diff --git a/docs/tutorials/ner.rst b/docs/tutorials/ner.rst
new file mode 100644
index 0000000000000000000000000000000000000000..cf2a8b257aba879d7fdab441201f628d526a7b24
--- /dev/null
+++ b/docs/tutorials/ner.rst
@@ -0,0 +1,3 @@
+========================
+序列标注
+========================
diff --git a/docs/tutorials/overview.rst b/docs/tutorials/overview.rst
new file mode 100644
index 0000000000000000000000000000000000000000..040de50bc802f667a57839161db333b8163e3c6a
--- /dev/null
+++ b/docs/tutorials/overview.rst
@@ -0,0 +1,43 @@
+============
+整体介绍
+============
+
+
+案例集
+----------
+
+  - 词向量
+
+    - `使用预训练词向量改善模型效果 <https://aistudio.baidu.com/aistudio/projectdetail/1535355>`_
+
+  - 文本分类
+
+    - `基于LSTM等RNN网络的文本分类 <https://aistudio.baidu.com/aistudio/projectdetail/1283423>`_
+    - `基于预训练模型的文本分类 <https://aistudio.baidu.com/aistudio/projectdetail/1294333>`_
+    - `自定义数据集实现文本多分类任务 <https://aistudio.baidu.com/aistudio/projectdetail/1468469>`_
+
+  - 信息抽取
+
+    - `使用BiGRU-CRF模型完成快递单信息抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1317771>`_
+    - `使用预训练模型ERNIE优化快递单信息抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1329361>`_
+    - `关系抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639963>`_
+    - `事件抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639964>`_
+
+  - 阅读理解式问答
+
+    - `使用预训练模型完成阅读理解 <https://aistudio.baidu.com/aistudio/projectdetail/1339612>`_
+
+  - 对话
+
+    - `多技能对话 <https://aistudio.baidu.com/aistudio/projectdetail/1640180>`_
+
+  - 文本生成
+
+    - `使用Seq2Seq模型完成自动对联 <https://aistudio.baidu.com/aistudio/projectdetail/1321118>`_
+    - `使用预训练模型ERNIE-GEN实现智能写诗 <https://aistudio.baidu.com/aistudio/projectdetail/1339888>`_
+
+  - 时序预测
+
+    - `使用TCN网络完成新冠疫情病例数预测 <https://aistudio.baidu.com/aistudio/projectdetail/1290873>`_
+
+    更多教程参见 `PaddleNLP on AI Studio <https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995>`_
\ No newline at end of file
diff --git a/docs/tutorials/reading_comprehension.rst b/docs/tutorials/reading_comprehension.rst
new file mode 100644
index 0000000000000000000000000000000000000000..7ff09f6388c6484b1f1598578da13e3fb6215d3d
--- /dev/null
+++ b/docs/tutorials/reading_comprehension.rst
@@ -0,0 +1,3 @@
+========================
+阅读理解
+========================
diff --git a/docs/tutorials/semantic_matching.rst b/docs/tutorials/semantic_matching.rst
new file mode 100644
index 0000000000000000000000000000000000000000..40bd7cfd9c82c1e611529749535ee23e6824c578
--- /dev/null
+++ b/docs/tutorials/semantic_matching.rst
@@ -0,0 +1,3 @@
+========================
+语义匹配
+========================
diff --git a/docs/tutorials/text_generation.rst b/docs/tutorials/text_generation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3fbb1f3d1be82138fc5c8e2a02df5fc8e44056a5
--- /dev/null
+++ b/docs/tutorials/text_generation.rst
@@ -0,0 +1,3 @@
+========================
+文本生成
+========================
diff --git a/examples/README.md b/examples/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..817c9400b2ddf39de38ce0143c64cda8c92b3fcc
--- /dev/null
+++ b/examples/README.md
@@ -0,0 +1,43 @@
+# PaddleNLP Examples
+
+PaddleNLP旨在提供覆盖从研究到产业应用的丰富示例，助力开发者加速文本任务开发效率。
+
+PaddleNLP provides rich application examples covering mainstream NLP task to help developers accelerate problem solving.
+
+## NLP 基础技术 (NLP Basic Technique)
+
+| 目录 Folder  | 任务 Task    |
+| :--------------- | ------- |
+| word_embedding | [词向量 (Word Embedding)](./word_embedding/) |
+| lexical_analysis | [词法分析 (Lexical Analysis)](./lexical_analysis/) |
+| dependency_parsing | [句法依存分析 (Dependency Parsing)](./dependency_parsing/) |
+| language_model | [预训练语言模型 (Pretrained Language Model)](./language_model/) |
+| text_to_sql | [语义解析 (Semantic Parsing/Text to SQL)](./text_to_sql):star: |
+| text_classification | [文本分类 (Text Classification)](./text_classification/) |
+| text_matching | [文本匹配 (Text Matching)](./text_matching/) |
+| text_generation | [文本生成 (Text Generation)](./text_generation/) |
+| text_summarization | [文本摘要 (Text Summarization)](./text_summarization/) |
+| text_correction  |[文本纠错 (Text Correction)](./text_correction/):star: |
+| semantic_indexing | [语义索引 (Semantic Indexing)](./semantic_indexing/)|
+| information_extraction | [信息抽取 (Information Extraction)](./information_extraction/) |
+| question_generation | [问题生成 (Question Generation)](./question_generation/) |
+
+## NLP 系统应用 (NLP System Applications)
+
+| 目录 Folder  | 任务 Task    |
+| :--------------- | ------- |
+| sentiment_analysis|[情感分析 (Sentiment Analysis)](./sentiment_analysis/):star2: |
+| dialogue |[通用对话 (General Dialogue System)](./dialogue/) |
+| machine_translation |[文本翻译 (Machine Translation)](./machine_translation/) |
+| simultaneous_translation|[同声翻译 (Simultaneous Translation)](./simultaneous_translation/) |
+| machine_reading_comprehension | [阅读理解 (Machine Reading Comprehension)](./machine_reading_comprehension/) |
+
+## NLP 拓展应用 (NLP Extented Applications)
+
+| 目录 Folder  | 任务 Task    |
+| :--------------- | ------- |
+| few_shot |[小样本学习 (Few-shot Learning)](./few_shot/):star2: |
+| text_to_knowledge |[解语知识关联框架 (Text Knowledge Mining)](./text_to_knowledge/):star2: |
+| model_compression |[模型压缩 (Model Compression）](./model_compression/) |
+| text_graph |[文本图学习 (Text Graph Learning)](./text_graph/erniesage/) |
+| time_series |[时间序列预测 (Time Series Prediction)](./time_series/) |
diff --git a/examples/benchmark/ceval/README.md b/examples/benchmark/ceval/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..67b803ed81dcc2e4152dbd1c6d564e1fd196a50f
--- /dev/null
+++ b/examples/benchmark/ceval/README.md
@@ -0,0 +1,84 @@
+# C-Eval评测脚本
+
+此C-Eval评测脚本修改自[ymcui/Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)项目。
+
+## 数据准备
+
+从C-Eval官方指定路径下载评测数据集，并解压至data文件夹：
+
+```
+wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
+unzip ceval-exam.zip -d data
+```
+将data文件夹放置于本项目的scripts/ceval目录下。
+
+## 运行预测脚本
+
+运行以下脚本：
+
+```
+cd scripts/ceval
+python eval.py \
+    --model_name_or_path /path/to/your/model \
+    --cot False \
+    --few_shot False \
+    --with_prompt True \
+    --constrained_decoding True \
+    --temperature 0.2 \
+    --n_times 1 \
+    --ntrain 5 \
+    --do_save_csv False \
+    --do_test False \
+    --output_dir ${output_path} \
+```
+
+参数说明
+
+- model_path：待评测模型所在目录（合并LoRA后的HF格式模型）
+- cot：是否使用chain-of-thought
+- few_shot：是否使用few-shot
+- ntrain：few_shot=True时，指定few-shot实例的数量（5-shot：ntrain=5）；few_shot=False时该项不起作用
+- with_prompt：模型输入是否包含针对Alpaca模型的指令模板
+- constrained_decoding：由于C-Eval评测的标准答案格式为选项'A'/'B'/'C'/'D'，所以我们提供了两种从模型生成内容中抽取答案的方案：
+    - 当constrained_decoding=True，计算模型生成的第一个token分别为'A', 'B', 'C', 'D'的概率，选择其中概率最大的一个作为答案
+    - 当constrained_decoding=False，用正则表达式从模型生成内容中提取答案
+- temperature：模型解码时的温度
+- n_times：指定评测的重复次数，将在output_dir下生成指定次数的文件夹
+- do_save_csv：是否将模型生成结果、提取的答案等内容保存在csv文件中
+- output_dir：指定评测结果的输出路径
+- do_test：在valid或test集上测试：当do_test=False，在valid集上测试；当do_test=True，在test集上测试
+
+## 评测输出
+模型预测完成后，生成目录`outputs/take*`，其中*代表数字，范围为0至`n_times-1`，分别储存了`n_times`次解码的结果。
+
+`outputs/take*`下包含`submission.json`和`summary.json`两个json文件。若`do_save_csv=True`，还将包含52个保存的模型生成结果、提取的答案等内容的csv文件。
+
+`submission.json`为依据官方提交规范生成的存储模型评测答案的文件，形式如：
+
+```
+{
+    "computer_network": {
+        "0": "A",
+        "1": "B",
+        ...
+    },
+      "marxism": {
+        "0": "B",
+        "1": "A",
+        ...
+      },
+      ...
+}
+```
+
+summary.json包含模型在52个主题下、4个大类下和总体平均的评测结果。例如，json文件最后的All字段中会显示总体平均效果：
+
+```
+  "All": {
+    "score": 0.36701337295690933,
+    "num": 1346,
+  "correct": 494.0
+}
+```
+
+其中score为准确率，num为测试的总样本条数，correct为正确的数量。
diff --git a/examples/benchmark/ceval/eval.py b/examples/benchmark/ceval/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..47b2227bbf3c9155a687b1923d5bd6c5af6efeac
--- /dev/null
+++ b/examples/benchmark/ceval/eval.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval
+import argparse
+import json
+import os
+import time
+
+import pandas as pd
+from model_evaluator import ModelEvaluator
+
+choices = ["A", "B", "C", "D"]
+
+
+def main(args, evaluator, take):
+    assert os.path.exists("subject_mapping.json"), "subject_mapping.json not found!"
+    with open("subject_mapping.json") as f:
+        subject_mapping = json.load(f)
+    filenames = os.listdir("data/val")
+    subject_list = [val_file.replace("_val.csv", "") for val_file in filenames]
+    accuracy, summary = {}, {}
+
+    run_date = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime(time.time()))
+    output_dir = args.output_dir
+    save_result_dir = os.path.join(output_dir, f"take{take}")
+    if not os.path.exists(save_result_dir):
+        os.makedirs(save_result_dir, exist_ok=True)
+
+    all_answers = {}
+    for index, subject_name in enumerate(subject_list):
+        print(
+            f"{index/len(subject_list)} Inference starts at {run_date} on {args.model_name_or_path} with subject of {subject_name}!"
+        )
+        val_file_path = os.path.join("data/val", f"{subject_name}_val.csv")
+        dev_file_path = os.path.join("data/dev", f"{subject_name}_dev.csv")
+        test_file_path = os.path.join("data/test", f"{subject_name}_test.csv")
+
+        val_df = pd.read_csv(val_file_path) if args.do_test is False else pd.read_csv(test_file_path)
+        dev_df = pd.read_csv(dev_file_path) if args.few_shot else None
+
+        correct_ratio, answers = evaluator.eval_subject(
+            subject_name,
+            val_df,
+            dev_df,
+            save_result_dir=save_result_dir if args.do_save_csv else None,
+            few_shot=args.few_shot,
+            cot=args.cot,
+            with_prompt=args.with_prompt,
+            constrained_decoding=args.constrained_decoding,
+            do_test=args.do_test,
+        )
+        print(f"Subject: {subject_name}")
+        print(f"Acc: {correct_ratio}")
+        accuracy[subject_name] = correct_ratio
+        summary[subject_name] = {
+            "score": correct_ratio,
+            "num": len(val_df),
+            "correct": correct_ratio * len(val_df) / 100,
+        }
+        all_answers[subject_name] = answers
+
+    json.dump(all_answers, open(save_result_dir + "/submission.json", "w"), ensure_ascii=False, indent=4)
+    print("Accuracy:")
+    for k, v in accuracy.items():
+        print(k, ": ", v)
+
+    total_num = 0
+    total_correct = 0
+    summary["grouped"] = {
+        "STEM": {"correct": 0.0, "num": 0},
+        "Social Science": {"correct": 0.0, "num": 0},
+        "Humanities": {"correct": 0.0, "num": 0},
+        "Other": {"correct": 0.0, "num": 0},
+    }
+    for subj, info in subject_mapping.items():
+        group = info[2]
+        summary["grouped"][group]["num"] += summary[subj]["num"]
+        summary["grouped"][group]["correct"] += summary[subj]["correct"]
+    for group, info in summary["grouped"].items():
+        info["score"] = info["correct"] / info["num"]
+        total_num += info["num"]
+        total_correct += info["correct"]
+    summary["All"] = {"score": total_correct / total_num, "num": total_num, "correct": total_correct}
+
+    json.dump(summary, open(save_result_dir + "/summary.json", "w"), ensure_ascii=False, indent=2)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", type=str)
+    parser.add_argument("--cot", choices=["False", "True"], default="False")
+    parser.add_argument("--few_shot", choices=["False", "True"], default="True")
+    parser.add_argument("--ntrain", "-k", type=int, default=5)
+    parser.add_argument("--with_prompt", choices=["False", "True"], default="False")
+    parser.add_argument("--constrained_decoding", choices=["False", "True"], default="True")
+    parser.add_argument("--temperature", type=float, default=0.2)
+    parser.add_argument("--n_times", default=1, type=int)
+    parser.add_argument("--do_save_csv", choices=["False", "True"], default="False")
+    parser.add_argument("--output_dir", type=str)
+    parser.add_argument("--do_test", choices=["False", "True"], default="False")
+
+    args = parser.parse_args()
+
+    args.cot = args.cot == "True"
+    args.few_shot = args.few_shot == "True"
+    args.with_prompt = args.with_prompt == "True"
+    args.constrained_decoding = args.constrained_decoding == "True"
+    args.do_test = args.do_test == "True"
+    args.do_save_csv = args.do_save_csv == "True"
+    if args.constrained_decoding is True:
+        args.n_times = max(args.n_times, 1)
+    print(args)
+
+    evaluator = ModelEvaluator(
+        choices=choices, k=args.ntrain, model_name_or_path=args.model_name_or_path, temperature=args.temperature
+    )
+    for i in range(args.n_times):
+        main(args, evaluator=evaluator, take=i)
diff --git a/examples/benchmark/ceval/evaluator.py b/examples/benchmark/ceval/evaluator.py
new file mode 100644
index 0000000000000000000000000000000000000000..47eff428b9baf1d63ea5c3fe4139485544d84dba
--- /dev/null
+++ b/examples/benchmark/ceval/evaluator.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval
+import string
+
+
+class Evaluator:
+    def __init__(self, choices, model_name, k=-1):
+        self.choices = choices
+        self.model_name = model_name
+        self.k = k
+        self.puncs = list(string.punctuation)
+
+    def format_example(self, line, include_answer=True):
+        example = line["question"]
+        for choice in self.choices:
+            example += f'\n{choice}. {line[f"{choice}"]}'
+        example += "\n答案："
+        if include_answer:
+            example += f'{line["answer"]}\n\n'
+        return example
+
+    def generate_few_shot_prompt(self, subject, dev_df):
+        prompt = f"以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。\n\n"
+        k = self.k
+        if self.k == -1:
+            k = dev_df.shape[0]
+        for i in range(k):
+            prompt += self.format_example(dev_df.iloc[i, :])
+        return prompt
+
+    def eval_subject(self, subject_name, test_df, dev_df=None, few_shot=False, save_result_dir=None):
+        pass
+
+    def normalize_answer(self, s):
+        def white_space_fix(text):
+            return " ".join(text.split())
+
+        def remove_punc(text):
+            exclude = set(self.puncs)
+            return "".join(ch for ch in text if ch not in exclude)
+
+        def lower(text):
+            return text.lower()
+
+        return white_space_fix(remove_punc(lower(s)))
+
+    def exact_match(self, pred, target):
+        return self.normalize_answer(pred) == self.normalize_answer(target)
diff --git a/examples/benchmark/ceval/model_evaluator.py b/examples/benchmark/ceval/model_evaluator.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fbef4fe26c93a2f6f5c583d49a4e11b73774637
--- /dev/null
+++ b/examples/benchmark/ceval/model_evaluator.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Adapted from https://github.com/ymcui/Chinese-LLaMA-Alpaca and https://github.com/SJTU-LIT/ceval
+import os
+import random
+import re
+
+import numpy as np
+import paddle
+from evaluator import Evaluator
+from tqdm import tqdm
+
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+class ModelEvaluator(Evaluator):
+    def __init__(self, choices, k, model_name_or_path, temperature=0.2):
+        super().__init__(choices, model_name_or_path, k)
+        self.model_name_or_path = model_name_or_path
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+        self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path, dtype="float16", low_cpu_mem_usage=True)
+        self.model.eval()
+        self.generation_config = dict(
+            temperature=temperature,
+            top_k=40,
+            top_p=0.9,
+            do_sample=True,
+            num_beams=1,
+            repetition_penalty=1.1,
+            max_new_tokens=20,
+        )
+
+        self.A_id = self.tokenizer.encode("A", add_special_tokens=False)["input_ids"][0]
+        self.B_id = self.tokenizer.encode("B", add_special_tokens=False)["input_ids"][0]
+        self.C_id = self.tokenizer.encode("C", add_special_tokens=False)["input_ids"][0]
+        self.D_id = self.tokenizer.encode("D", add_special_tokens=False)["input_ids"][0]
+
+    def eval_subject(
+        self,
+        subject_name,
+        test_df,
+        dev_df=None,
+        few_shot=False,
+        cot=False,
+        save_result_dir=None,
+        with_prompt=False,
+        constrained_decoding=False,
+        do_test=False,
+    ):
+        all_answers = {}
+
+        correct_num = 0
+        if save_result_dir:
+            result = []
+            score = []
+        if few_shot:
+            history = self.generate_few_shot_prompt(subject_name, dev_df, cot=cot)
+        else:
+            history = ""
+        answers = ["NA"] * len(test_df) if do_test is True else list(test_df["answer"])
+        for row_index, row in tqdm(test_df.iterrows(), total=len(test_df)):
+            question = self.format_example(row, include_answer=False, cot=cot, with_prompt=with_prompt)
+            instruction = history + question
+            inputs = self.tokenizer(instruction, return_tensors="pd")
+            batch_size, length = inputs.input_ids.shape
+            if constrained_decoding is True:
+                # batch_size is 1, take the last logits as the logits for next token prediction
+                with paddle.no_grad():
+                    logits = self.model(**inputs)[0][0, -1, :]
+                choices_logits = logits[[self.A_id, self.B_id, self.C_id, self.D_id]].numpy()
+                assert not (np.any(np.isinf(choices_logits)) or np.any(np.isnan(choices_logits)))
+                ans = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(choices_logits)]
+                response = self.tokenizer.decode([logits.argmax(-1).item()])
+            else:
+                generation_output = self.model.generate(
+                    **inputs,
+                    eos_token_id=self.tokenizer.eos_token_id,
+                    pad_token_id=self.tokenizer.pad_token_id,
+                    **self.generation_config,
+                )
+                response = self.tokenizer.decode(generation_output[0][0, length:], skip_special_tokens=True)
+                ans, direct_extract = self.extract_answer(row, response)
+            if ans == answers[row_index]:
+                correct_num += 1
+                correct = 1
+            else:
+                correct = 0
+            print(f"\n=======begin {str(row_index)}=======")
+            print("question: ", question)
+            print("response: ", response)
+            print("ans: ", ans)
+            print("ground truth: ", answers[row_index], "\n")
+            if save_result_dir:
+                result.append(response)
+                score.append(correct)
+            print(f"=======end {str(row_index)}=======")
+
+            all_answers[str(row_index)] = ans
+
+        correct_ratio = 100 * correct_num / len(answers)
+
+        if save_result_dir:
+            test_df["model_output"] = result
+            test_df["correctness"] = score
+            test_df.to_csv(os.path.join(save_result_dir, f"{subject_name}_test.csv"))
+
+        return correct_ratio, all_answers
+
+    def format_example(self, line, include_answer=True, cot=False, with_prompt=False):
+        example = line["question"]
+        for choice in self.choices:
+            example += f'\n{choice}. {line[f"{choice}"]}'
+        if include_answer:
+            if cot:
+                example += "\n答案：让我们一步一步思考，\n" + line["explanation"] + f"\n所以答案是{line['answer']}。\n\n"
+            else:
+                example += "\n答案：" + line["answer"] + "\n\n"
+        else:
+            if with_prompt is False:
+                if cot:
+                    example += "\n答案：让我们一步一步思考，\n1."
+                else:
+                    example += "\n答案："
+            else:
+                if cot:
+                    example += "\n答案是什么？让我们一步一步思考，\n1."
+                else:
+                    example += "\n答案是什么？ "
+        return example
+
+    def generate_few_shot_prompt(self, subject, dev_df, cot=False):
+        prompt = f"以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。\n\n"
+        k = self.k
+        if self.k == -1:
+            k = dev_df.shape[0]
+        for i in range(k):
+            prompt += self.format_example(dev_df.iloc[i, :], include_answer=True, cot=cot)
+        return prompt
+
+    def extract_answer(self, line, gen_ans):
+        m = re.findall(r"所以答案是(.+?)。", gen_ans, re.M)
+        if len(m) > 0 and m[-1] in self.choices:
+            return m[-1], True
+        answer_patterns = [
+            r"([ABCD])是正确的",
+            r"选项([ABCD])正确",
+            r"答案为([ABCD])",
+            r"答案是([ABCD])",
+            r"答案([ABCD])",
+            r"选择([ABCD])",
+            r"答案：([ABCD])",
+            r"选择答案([ABCD])",
+        ]
+        # RE extraction
+        for answer_pattern in answer_patterns:
+            m = re.search(answer_pattern, gen_ans, re.M)
+            if m:
+                answer = m.group(1)
+                return answer, False
+        # only containing one choice-character
+        m = re.findall(r"[ABCD]", gen_ans, re.M)
+        if len(m) >= 1:
+            answer = m[0]
+            return answer, False
+        # only containing one choice-context
+        choices_dict = {}
+        pattern = ""
+        for c in self.choices:
+            choices_dict[str(line[f"{c}"])] = c
+            pattern += re.escape(str(line[f"{c}"])) + "|"
+        pattern = pattern[:-1]
+        m = re.findall(pattern, gen_ans, re.M)
+        print("w/ escape:", repr(pattern), gen_ans, (len(m) >= 1))
+        if len(m) >= 1:
+            answer = choices_dict[m[0]]
+            return answer, False
+        return random.choice("ABCD"), False
diff --git a/examples/benchmark/ceval/subject_mapping.json b/examples/benchmark/ceval/subject_mapping.json
new file mode 100644
index 0000000000000000000000000000000000000000..493c0f38e42213604cd9a46550d577ee9a51e0db
--- /dev/null
+++ b/examples/benchmark/ceval/subject_mapping.json
@@ -0,0 +1,262 @@
+{
+	"computer_network": [
+		"Computer Network",
+		"\u8ba1\u7b97\u673a\u7f51\u7edc",
+		"STEM"
+	],
+	"operating_system": [
+		"Operating System",
+		"\u64cd\u4f5c\u7cfb\u7edf",
+		"STEM"
+	],
+	"computer_architecture": [
+		"Computer Architecture",
+		"\u8ba1\u7b97\u673a\u7ec4\u6210",
+		"STEM"
+	],
+	"college_programming": [
+		"College Programming",
+		"\u5927\u5b66\u7f16\u7a0b",
+		"STEM"
+	],
+	"college_physics": [
+		"College Physics",
+		"\u5927\u5b66\u7269\u7406",
+		"STEM"
+	],
+	"college_chemistry": [
+		"College Chemistry",
+		"\u5927\u5b66\u5316\u5b66",
+		"STEM"
+	],
+	"advanced_mathematics": [
+		"Advanced Mathematics",
+		"\u9ad8\u7b49\u6570\u5b66",
+		"STEM"
+	],
+	"probability_and_statistics": [
+		"Probability and Statistics",
+		"\u6982\u7387\u7edf\u8ba1",
+		"STEM"
+	],
+	"discrete_mathematics": [
+		"Discrete Mathematics",
+		"\u79bb\u6563\u6570\u5b66",
+		"STEM"
+	],
+	"electrical_engineer": [
+		"Electrical Engineer",
+		"\u6ce8\u518c\u7535\u6c14\u5de5\u7a0b\u5e08",
+		"STEM"
+	],
+	"metrology_engineer": [
+		"Metrology Engineer",
+		"\u6ce8\u518c\u8ba1\u91cf\u5e08",
+		"STEM"
+	],
+	"high_school_mathematics": [
+		"High School Mathematics",
+		"\u9ad8\u4e2d\u6570\u5b66",
+		"STEM"
+	],
+	"high_school_physics": [
+		"High School Physics",
+		"\u9ad8\u4e2d\u7269\u7406",
+		"STEM"
+	],
+	"high_school_chemistry": [
+		"High School Chemistry",
+		"\u9ad8\u4e2d\u5316\u5b66",
+		"STEM"
+	],
+	"high_school_biology": [
+		"High School Biology",
+		"\u9ad8\u4e2d\u751f\u7269",
+		"STEM"
+	],
+	"middle_school_mathematics": [
+		"Middle School Mathematics",
+		"\u521d\u4e2d\u6570\u5b66",
+		"STEM"
+	],
+	"middle_school_biology": [
+		"Middle School Biology",
+		"\u521d\u4e2d\u751f\u7269",
+		"STEM"
+	],
+	"middle_school_physics": [
+		"Middle School Physics",
+		"\u521d\u4e2d\u7269\u7406",
+		"STEM"
+	],
+	"middle_school_chemistry": [
+		"Middle School Chemistry",
+		"\u521d\u4e2d\u5316\u5b66",
+		"STEM"
+	],
+	"veterinary_medicine": [
+		"Veterinary Medicine",
+		"\u517d\u533b\u5b66",
+		"STEM"
+	],
+	"college_economics": [
+		"College Economics",
+		"\u5927\u5b66\u7ecf\u6d4e\u5b66",
+		"Social Science"
+	],
+	"business_administration": [
+		"Business Administration",
+		"\u5de5\u5546\u7ba1\u7406",
+		"Social Science"
+	],
+	"marxism": [
+		"Marxism",
+		"\u9a6c\u514b\u601d\u4e3b\u4e49\u57fa\u672c\u539f\u7406",
+		"Social Science"
+	],
+	"mao_zedong_thought": [
+		"Mao Zedong Thought",
+		"\u6bdb\u6cfd\u4e1c\u601d\u60f3\u548c\u4e2d\u56fd\u7279\u8272\u793e\u4f1a\u4e3b\u4e49\u7406\u8bba\u4f53\u7cfb\u6982\u8bba",
+		"Social Science"
+	],
+	"education_science": [
+		"Education Science",
+		"\u6559\u80b2\u5b66",
+		"Social Science"
+	],
+	"teacher_qualification": [
+		"Teacher Qualification",
+		"\u6559\u5e08\u8d44\u683c",
+		"Social Science"
+	],
+	"high_school_politics": [
+		"High School Politics",
+		"\u9ad8\u4e2d\u653f\u6cbb",
+		"Social Science"
+	],
+	"high_school_geography": [
+		"High School Geography",
+		"\u9ad8\u4e2d\u5730\u7406",
+		"Social Science"
+	],
+	"middle_school_politics": [
+		"Middle School Politics",
+		"\u521d\u4e2d\u653f\u6cbb",
+		"Social Science"
+	],
+	"middle_school_geography": [
+		"Middle School Geography",
+		"\u521d\u4e2d\u5730\u7406",
+		"Social Science"
+	],
+	"modern_chinese_history": [
+		"Modern Chinese History",
+		"\u8fd1\u4ee3\u53f2\u7eb2\u8981",
+		"Humanities"
+	],
+	"ideological_and_moral_cultivation": [
+		"Ideological and Moral Cultivation",
+		"\u601d\u60f3\u9053\u5fb7\u4fee\u517b\u4e0e\u6cd5\u5f8b\u57fa\u7840",
+		"Humanities"
+	],
+	"logic": [
+		"Logic",
+		"\u903b\u8f91\u5b66",
+		"Humanities"
+	],
+	"law": [
+		"Law",
+		"\u6cd5\u5b66",
+		"Humanities"
+	],
+	"chinese_language_and_literature": [
+		"Chinese Language and Literature",
+		"\u4e2d\u56fd\u8bed\u8a00\u6587\u5b66",
+		"Humanities"
+	],
+	"art_studies": [
+		"Art Studies",
+		"\u827a\u672f\u5b66",
+		"Humanities"
+	],
+	"professional_tour_guide": [
+		"Professional Tour Guide",
+		"\u5bfc\u6e38\u8d44\u683c",
+		"Humanities"
+	],
+	"legal_professional": [
+		"Legal Professional",
+		"\u6cd5\u5f8b\u804c\u4e1a\u8d44\u683c",
+		"Humanities"
+	],
+	"high_school_chinese": [
+		"High School Chinese",
+		"\u9ad8\u4e2d\u8bed\u6587",
+		"Humanities"
+	],
+	"high_school_history": [
+		"High School History",
+		"\u9ad8\u4e2d\u5386\u53f2",
+		"Humanities"
+	],
+	"middle_school_history": [
+		"Middle School History",
+		"\u521d\u4e2d\u5386\u53f2",
+		"Humanities"
+	],
+	"civil_servant": [
+		"Civil Servant",
+		"\u516c\u52a1\u5458",
+		"Other"
+	],
+	"sports_science": [
+		"Sports Science",
+		"\u4f53\u80b2\u5b66",
+		"Other"
+	],
+	"plant_protection": [
+		"Plant Protection",
+		"\u690d\u7269\u4fdd\u62a4",
+		"Other"
+	],
+	"basic_medicine": [
+		"Basic Medicine",
+		"\u57fa\u7840\u533b\u5b66",
+		"Other"
+	],
+	"clinical_medicine": [
+		"Clinical Medicine",
+		"\u4e34\u5e8a\u533b\u5b66",
+		"Other"
+	],
+	"urban_and_rural_planner": [
+		"Urban and Rural Planner",
+		"\u6ce8\u518c\u57ce\u4e61\u89c4\u5212\u5e08",
+		"Other"
+	],
+	"accountant": [
+		"Accountant",
+		"\u6ce8\u518c\u4f1a\u8ba1\u5e08",
+		"Other"
+	],
+	"fire_engineer": [
+		"Fire Engineer",
+		"\u6ce8\u518c\u6d88\u9632\u5de5\u7a0b\u5e08",
+		"Other"
+	],
+	"environmental_impact_assessment_engineer": [
+		"Environmental Impact Assessment Engineer",
+		"\u73af\u5883\u5f71\u54cd\u8bc4\u4ef7\u5de5\u7a0b\u5e08",
+		"Other"
+	],
+	"tax_accountant": [
+		"Tax Accountant",
+		"\u7a0e\u52a1\u5e08",
+		"Other"
+	],
+	"physician": [
+		"Physician",
+		"\u533b\u5e08\u8d44\u683c",
+		"Other"
+	]
+}
\ No newline at end of file
diff --git a/examples/benchmark/clue/README.md b/examples/benchmark/clue/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cf6662d543a9162392b21d8976a5c2e53beeacba
--- /dev/null
+++ b/examples/benchmark/clue/README.md
@@ -0,0 +1,1522 @@
+# CLUE Benchmark
+
+**目录**
+   * [CLUE 评测结果](#CLUE评测结果)
+   * [一键复现模型效果](#一键复现模型效果)
+       * [启动 CLUE 分类任务](#启动CLUE分类任务)
+           * [使用 Trainer 启动 CLUE 分类任务](#使用Trainer启动CLUE分类任务)
+       * [启动 CLUE 阅读理解任务](#启动CLUE阅读理解任务)
+       * [批量启动 Grid Search](#批量启动GridSearch)
+           * [环境依赖](#环境依赖)
+           * [一键启动方法](#一键启动方法)
+           * [Grid Search 脚本说明](#GridSearch脚本说明)
+   * [参加 CLUE 竞赛](#参加CLUE竞赛)
+       * [分类任务](#分类任务)
+       * [阅读理解任务](#阅读理解任务)
+
+[CLUE](https://www.cluebenchmarks.com/) 自成立以来发布了多项 NLP 评测基准，包括分类榜单，阅读理解榜单和自然语言推断榜单等，在学术界、工业界产生了深远影响。是目前应用最广泛的中文语言测评指标之一。详细可参考 [CLUE论文](https://arxiv.org/abs/2004.05986)。
+
+本项目基于 PaddlePaddle 在 CLUE 数据集上对领先的开源预训练模型模型进行了充分评测，为开发者在预训练模型选择上提供参考，同时开发者基于本项目可以轻松一键复现模型效果，也可以参加 CLUE 竞赛取得好成绩。
+
+
+<a name="CLUE评测结果"></a>
+
+## CLUE 评测结果
+
+使用多种**中文**预训练模型微调在 CLUE 的各验证集上有如下结果：
+
+<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000">
+    <tbody>
+        <tr>
+            <td style="text-align:center;vertical-align:middle">
+                <span style="font-size:18px;">Arch</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">Model</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">AVG</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">AFQMC</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">TNEWS</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">IFLYTEK</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CMNLI</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">OCNLI</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CLUEWSC2020</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CSL</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CMRC2018</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CHID</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">C<sup>3</sup></span>
+            </td>
+        </tr>        <tr>
+            <td rowspan=3 align=center> 24L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Large-zh-cw</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.03</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.65</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>85.09</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.73</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>93.09</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.53</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>74.22/91.88</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>88.57</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.54</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 2.0-Large-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.03</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.41</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>59.67</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.69</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">89.14</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">84.10</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.48/90.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">85.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.12</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RoBERTa-wwm-ext-large</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.61</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.00</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.88</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.81</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">90.79</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.58/89.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">85.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.26</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 20L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 3.0-Xbase-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>78.39</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.16</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>59.55</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>61.87</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.40</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.73</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>88.82</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.60</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>75.99/93.00</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>86.78</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.98</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=9 align=center> 12L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams">
+                        ERNIE 3.0-Base-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>80.10</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">86.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.71/90.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">84.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>77.88</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Base-zh-cw</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.47</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.07</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.86</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.41</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>89.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.42</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>72.88/90.78</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.68</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.98</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE-Gram-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.28</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.88</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.87</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.82/90.38</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">84.04</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.69</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 2.0-Base-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.65</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.25</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.08/87.46</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.19</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">Langboat/Mengzi-BERT-Base</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.69</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.76</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.16</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.04/88.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.74</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.70</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Base-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.84</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>58.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.25</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.68</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">85.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.32/87.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.47</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.68</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RoBERTa-wwm-ext</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.60</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.23</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.92</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.49</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.39/88.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.43</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.03</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">BERT-Base-Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.57</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.30/86.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.01</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.38</span>
+            </td>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Base</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.89</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.14</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.01</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.80</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.87/84.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.76</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 8L512H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Medium</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.06</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.10</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.09</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.63/78.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.84</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=5 align=center> 6L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams">
+                        ERNIE 3.0-Medium-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>72.49</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>73.37</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>57.00</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>80.64</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.88</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.28</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.60</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>65.83/87.30</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>69.73</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">HLF/RBT6, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.06</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.45</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.36</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.72/84.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.85</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">TinyBERT<sub>6</sub>, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.70</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.12</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.03/83.75</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.11</span>
+            </td>
+        </tr>
+        <td style="text-align:center">
+            <span style="font-size:18px">RoFormerV2 Small</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">68.52</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">72.47</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">56.53</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px"><b>60.72</b></span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">76.37</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">72.95</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">75.00</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">81.07</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">62.97/83.64</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">67.66</span>
+        </td>
+        <td style="text-align:center">
+            <span style="font-size:18px">59.41</span>
+        </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-L6-H768</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.09</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.54</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.49</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.00</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.04</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.74/75.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.73</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.40</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 6L384H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams">
+                        ERNIE 3.0-Mini-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.85</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.24</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.19</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.30</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.53/81.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.60</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBT4, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.42</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.34</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.23</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.30/81.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.45</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L512H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Small</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.25</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.21</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.552</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.80</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">46.75/69.69</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.59</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">50.92</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L384H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams">
+                    ERNIE 3.0-Micro-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">64.21</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.15</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.81</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.77/77.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.53</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=2 align=center> 4L312H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams">
+                        ERNIE 3.0-Nano-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.97</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.51</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>54.57</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>48.36</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>74.97</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.61</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.75</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>75.93</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>52.00/76.35</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>58.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>55.11</b></span>
+            </td>
+        </tr>
+         <tr>
+            <td style="text-align:center">
+            <span style="font-size:18px">TinyBERT<sub>4</sub>, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">39.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.94</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.59</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.07</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">46.04/69.34</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">52.18</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L256H </td>
+            <td style="text-align:center">
+            <span style="font-size:18px">UER/Chinese-RoBERTa-Mini</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.40</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">41.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.40</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.36</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">5.96/17.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">51.19</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">39.68</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 3L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBTL3, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.14</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.74</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.50/80.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.03</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.56</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 3L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBT3, Chinese</span>
+            </td>
+                <td style="text-align:center">
+                <span style="font-size:18px">65.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.73/78.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.93</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 2L128H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Tiny</span>
+            </td>
+                <td style="text-align:center">
+                <span style="font-size:18px">44.45</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">51.47</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">20.28</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.73</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.43</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">3.08/14.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">23.57</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">28.12</span>
+            </td>
+        </tr>
+    <tbody>
+</table>
+<br />
+
+AFQMC（语义相似度）、TNEWS（文本分类）、IFLYTEK（长文本分类）、CMNLI（自然语言推理）、OCNLI（自然语言推理）、CLUEWSC2020（代词消歧）、CSL（论文关键词识别）、CHID（成语阅读理解填空） 和 C<sup>3</sup>（中文多选阅读理解） 任务使用的评估指标均是 Accuracy。CMRC2018（阅读理解） 的评估指标是 EM (Exact Match)/F1，计算每个模型效果的平均值时，取 EM 为最终指标。
+
+其中前 7 项属于分类任务，后面 3 项属于阅读理解任务，这两种任务的训练过程在下面将会分开介绍。
+
+**NOTE：具体评测方式如下**
+1. 以上所有任务均基于 Grid Search 方式进行超参寻优。分类任务训练每间隔 100 steps 评估验证集效果，阅读理解任务每隔一个 epoch 评估验证集效果，取验证集最优效果作为表格中的汇报指标。
+
+2. 分类任务 Grid Search 超参范围: batch_size: 16, 32, 64; learning rates: 1e-5, 2e-5, 3e-5, 5e-5；因为 CLUEWSC2020 数据集较小，所以模型在该数据集上的效果对 batch_size 较敏感，所以对 CLUEWSC2020 评测时额外增加了 batch_size = 8 的超参搜索； 因为CLUEWSC2020 和 IFLYTEK 数据集对 dropout 概率值较为敏感，所以对 CLUEWSC2020 和 IFLYTEK 数据集评测时额外增加了 dropout_prob = 0.0 的超参搜索。
+
+3. 阅读理解任务 Grid Search 超参范围：batch_size: 24, 32; learning rates: 1e-5, 2e-5, 3e-5。阅读理解任务均使用多卡训练，其中 Grid Search 中的 batch_size 是指多张卡上的 batch_size 总和。
+
+4. 以上每个下游任务的固定超参配置如下表所示：
+
+| TASK              | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL  | CMRC2018 | CHID | C<sup>3</sup> |
+| ----------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ---- | -------- | ---- | ------------- |
+| epoch             | 3     | 3     | 3       | 2     | 5     | 50          | 5    | 2        | 3    | 8             |
+| max_seq_length    | 128   | 128   | 128     | 128   | 128   | 128         | 256  | 512      | 64   | 512           |
+| warmup_proportion | 0.1   | 0.1   | 0.1     | 0.1   | 0.1   | 0.1         | 0.1  | 0.1      | 0.06 | 0.1           |
+| num_cards         | 1     | 1     | 1       | 1     | 1     | 1           | 1    | 2        | 4    | 4             |
+
+不同预训练模型在下游任务上做 Grid Search 之后的最优超参（learning_rate、batch_size）如下：
+
+| Model                            | AFQMC   | TNEWS   | IFLYTEK | CMNLI    | OCNLI    | CLUEWSC2020 | CSL     | CMRC2018 | CHID    | C<sup>3</sup> |
+| -------------------------------- | ------- | ------- | ------- | -------- | -------- | ----------- | ------- | -------- | ------- | ------------- |
+| ERNIE 1.0-Large-zh-cw            | 2e-5,64 | 3e-5,32 | 5e-5,16 | 2e-5,16  | 2e-5,32  | 1e-5,32     | 1e-5,16 | 2e-5,24  | 1e-5,24 | 2e-5,32       |
+| ERNIE 3.0-Xbase-zh               | 2e-5,16 | 3e-5,32 | 3e-5,32 | 3e-5,64  | 3e-5,64  | 2e-5,32     | 1e-5,16 | 3e-5,24  | 2e-5,24 | 3e-5,24       |
+| ERNIE 2.0-Large-zh               | 1e-5,32 | 3e-5,64 | 3e-5,32 | 2e-5,32  | 1e-5,16  | 3e-5,32     | 1e-5,64 | 2e-5,24  | 2e-5,24 | 3e-5,32       |
+| HFL/RoBERTa-wwm-ext-large        | 1e-5,32 | 3e-5,32 | 2e-5,32 | 1e-5,16  | 1e-5,16  | 2e-5,16     | 2e-5,16 | 3e-5,32  | 1e-5,24 | 2e-5,24       |
+| ERNIE 3.0-Base-zh                | 3e-5,16 | 3e-5,32 | 5e-5,32 | 3e-5,32  | 2e-5,64  | 2e-5,16     | 2e-5,32 | 2e-5,24  | 3e-5,24 | 3e-5,32       |
+| ERNIE 1.0-Base-zh-cw             | 2e-5,16 | 3e-5,32 | 5e-5,16 | 2e-5,16  | 3e-5,32  | 2e-5,16     | 2e-5,32 | 3e-5,24  | 2e-5,32 | 3e-5,24     |
+| ERNIE-Gram-zh                    | 1e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,32  | 2e-5,64  | 3e-5,16     | 3e-5,64 | 3e-5,32  | 2e-5,24 | 2e-5,24       |
+| ERNIE 2.0-Base-zh                | 3e-5,64 | 3e-5,64 | 5e-5,16 | 5e-5,64  | 5e-5,32  | 5e-5,16     | 2e-5,16 | 2e-5,32  | 3e-5,24 | 3e-5,32       |
+| Langboat/Mengzi-Bert-Base        | 3e-5,32 | 5e-5,32 | 5e-5,16 | 2e-5,16  | 2e-5,16  | 3e-5,8      | 1e-5,16 | 3e-5,24  | 3e-5,24 | 2e-5,32       |
+| ERNIE 1.0-Base-zh                | 3e-5,16 | 3e-5,32 | 5e-5,16 | 5e-5,32  | 3e-5,16  | 2e-5,8      | 2e-5,16 | 3e-5,32  | 3e-5,24 | 3e-5,24       |
+| HFL/RoBERTa-wwm-ext              | 3e-5,32 | 3e-5,64 | 5e-5,16 | 3e-5,32  | 2e-5,32  | 3e-5,32     | 2e-5,32 | 3e-5,32  | 2e-5,32 | 3e-5,24       |
+| BERT-Base-Chinese                | 2e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,64  | 3e-5,16  | 3e-5,16     | 1e-5,16 | 3e-5,24  | 2e-5,32 | 3e-5,24       |
+| UER/Chinese-RoBERTa-Base         | 2e-5,16 | 5e-5,16 | 5e-5,16 | 2e-5,16  | 3e-5,16  | 3e-5,8      | 2e-5,16 | 3e-5,24  | 3e-5,32 | 3e-5,32       |
+| UER/Chinese-RoBERTa-Medium       | 3e-5,32 | 5e-5,64 | 5e-5,16 | 5e-5,32  | 3e-5,32  | 3e-5,16     | 5e-5,32 | 3e-5,24  | 3e-5,24 | 3e-5,32       |
+| ERNIE 3.0-Medium-zh              | 3e-5,32 | 3e-5,64 | 5e-5,32 | 2e-5,32  | 1e-5,64  | 3e-5,16     | 2e-5,32 | 3e-5,24  | 2e-5,24 | 1e-5,24       |
+| TinyBERT<sub>6</sub>, Chinese    | 1e-5,16 | 3e-5,32 | 5e-5,16 | 5e-5,32  | 3e-5,64  | 3e-5,16     | 3e-5,16 | 3e-5,32  | 3e-5,24 | 2e-5,24       |
+| RoFormerV2 Small                 | 5e-5,16 | 2e-5,16 | 5e-5,16 | 5e-5,32  | 2e-5,16  | 3e-5,8      | 3e-5,16 | 3e-5,24  | 3e-5,24 | 3e-5,24       |
+| HLF/RBT6, Chinese                | 3e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,64  | 3e-5,16  | 3e-5,8      | 5e-5,64 | 2e-5,24  | 3e-5,32 | 2e-5,32       |
+| UER/Chinese-RoBERTa-L6-H768      | 2e-5,16 | 3e-5,16 | 5e-5,16 | 5e-5,16  | 5e-5,32  | 2e-5,32     | 3e-5,16 | 3e-5,32  | 3e-5,24 | 3e-5,24       |
+| ERNIE 3.0-Mini-zh                | 5e-5,64 | 5e-5,64 | 5e-5,16 | 5e-5,32  | 2e-5,16  | 2e-5,8      | 2e-5,16 | 3e-5,24  | 3e-5,24 | 3e-5,24       |
+| HFL/RBT4, Chinese                | 5e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,16  | 2e-5,16  | 2e-5,8      | 2e-5,16 | 3e-5,32  | 3e-5,24 | 3e-5,32       |
+| UER/Chinese-RoBERTa-Small        | 2e-5,32 | 5e-5,32 | 5e-5,16 | 5e-5,16  | 5e-5,16  | 2e-5,64     | 5e-5,32 | 3e-5,24  | 3e-5,24 | 3e-5,24       |
+| ERNIE 3.0-Micro-zh               | 3e-5,16 | 5e-5,32 | 5e-5,16 | 5e-5,16  | 2e-5,32  | 5e-5,16     | 3e-5,64 | 3e-5,24  | 3e-5,32 | 3e-5,24       |
+| ERNIE 3.0-Nano-zh                | 2e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,16  | 3e-5,16  | 1e-5,8      | 3e-5,32 | 3e-5,24  | 3e-5,24 | 2e-5,24       |
+| TinyBERT<sub>4</sub>, Chinese    | 3e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,16  | 3e-5,16  | 1e-5,16     | 5e-5,16 | 3e-5,24  | 3e-5,24 | 2e-5,24       |
+| UER/Chinese-RoBERTa-Mini         | 3e-5,16 | 5e-5,16 | 5e-5,16 | 5e-5,16  | 5e-5,32  | 3e-5,8      | 5e-5,32 | 3e-5,24  | 3e-5,32 | 3e-5,32       |
+| HFL/RBTL3, Chinese               | 5e-5,32 | 5e-5,16 | 5e-5,16 | 5e-5,32  | 2e-5,16  | 5e-5,8      | 2e-5,16 | 3e-5,24  | 2e-5,24 | 3e-5,24       |
+| HFL/RBT3, Chinese                | 5e-5,64 | 5e-5,32 | 5e-5,16 | 5e-5,16  | 2e-5,16  | 3e-5,16     | 5e-5,16 | 3e-5,32  | 3e-5,24 | 3e-5,32       |
+| UER/Chinese-RoBERTa-Tiny         | 5e-5,64 | 5e-5,16 | 5e-5,16 | 5e-5,16  | 5e-5,16  | 5e-5,8      | 5e-5,16 | 3e-5,24  | 3e-5,24 | 3e-5,24       |
+
+其中，`ERNIE 3.0-Base-zh`、`ERNIE 3.0-Medium-zh`、`ERNIE-Gram-zh`、`ERNIE 1.0-Base-zh`、`ERNIE 3.0-Mini-zh`、`ERNIE 3.0-Micro-zh`、`ERNIE 3.0-Nano-zh` 、`HFL/RBT3, Chinese`、`HFL/RBTL3, Chinese`、`HFL/RBT6, Chinese`、`TinyBERT<sub>4</sub>, Chinese`、`UER/Chinese-RoBERTa-Base`、`UER/Chinese-RoBERTa-Mini`、`UER/Chinese-RoBERTa-Small` 在 CLUEWSC2020 处的 dropout_prob 为 0.0，`ERNIE 3.0-Base-zh`、`HLF/RBT6, Chinese`、`Langboat/Mengzi-BERT-Base`、`ERNIE-Gram-zh`、`ERNIE 1.0-Base-zh` 、`TinyBERT6, Chinese`、`UER/Chinese-RoBERTa-L6-H768`、`ERNIE 3.0-Mini-zh`、`ERNIE 3.0-Micro-zh`、`ERNIE 3.0-Nano-zh`、`HFL/RBT3, Chinese`、`HFL/RBT4, Chinese`、`HFL/RBT6, Chinese`、`TinyBERT<sub>4</sub>, Chinese`、`UER/Chinese-RoBERTa-Medium`、`UER/Chinese-RoBERTa-Base`、`UER/Chinese-RoBERTa-Mini`、`UER/Chinese-RoBERTa-Tiny`、`UER/Chinese-RoBERTa-Small` 在 IFLYTEK 处的 dropout_prob 为 0.0。
+
+<a name="一键复现模型效果"></a>
+
+## 一键复现模型效果
+
+这一节将会对分类、阅读理解任务分别展示如何一键复现本文的评测结果。
+
+<a name="启动CLUE分类任务"></a>
+
+### 启动 CLUE 分类任务
+以 CLUE 的 TNEWS 任务为例，启动 CLUE 任务进行 Fine-tuning 的方式如下：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=TNEWS
+export LR=3e-5
+export BS=32
+export EPOCH=6
+export MAX_SEQ_LEN=128
+export MODEL_PATH=ernie-3.0-medium-zh
+
+cd classification
+mkdir ernie-3.0-medium-zh
+python -u ./run_clue_classifier.py \
+    --model_name_or_path ${MODEL_PATH} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
+    --dropout 0.1 \
+    --gradient_accumulation_steps 1 \
+    --save_best_model True \
+    --do_train \
+
+```
+
+另外，如需评估，传入参数 `--do_eval` 即可，如果只对读入的 checkpoint 进行评估不训练，则不需传入 `--do_train`。
+
+其中参数释义如下：
+
+- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型，可以是 PaddleNLP 提供的预训练模型，可以选择[Transformer预训练模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中相对应的中文预训练权重。注意 CLUE 任务应选择中文预训练权重。
+- `task_name` 表示 Fine-tuning 的分类任务，当前支持 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于 learning rate scheduler 产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `save_best_model` 是否保存在评估集上效果最好的模型，默认为 True
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu' 表示使用GPU, 'xpu' 表示使用百度昆仑卡, 'cpu' 表示使用 CPU。
+
+Fine-tuning 过程将按照 `logging_steps` 和 `save_steps` 的设置打印出如下日志：
+
+```
+global step 100/20010, epoch: 0, batch: 99, rank_id: 0, loss: 2.734340, lr: 0.0000014993, speed: 8.7969 step/s
+eval loss: 2.720359, acc: 0.0827, eval done total : 25.712125062942505 s
+global step 200/20010, epoch: 0, batch: 199, rank_id: 0, loss: 2.608563, lr: 0.0000029985, speed: 2.5921 step/s
+eval loss: 2.652753, acc: 0.0945, eval done total : 25.64827537536621 s
+global step 300/20010, epoch: 0, batch: 299, rank_id: 0, loss: 2.555283, lr: 0.0000044978, speed: 2.6032 step/s
+eval loss: 2.572999, acc: 0.112, eval done total : 25.67190170288086 s
+global step 400/20010, epoch: 0, batch: 399, rank_id: 0, loss: 2.631579, lr: 0.0000059970, speed: 2.6238 step/s
+eval loss: 2.476962, acc: 0.1697, eval done total : 25.794789791107178 s
+```
+
+<a name="使用Trainer启动CLUE分类任务"></a>
+
+#### 使用 Trainer 启动 CLUE 分类任务
+PaddleNLP 提供了 Trainer API，本示例新增了`run_clue_classifier_trainer.py`脚本供用户使用。
+
+```
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=TNEWS
+export LR=3e-5
+export BS=32
+export EPOCH=6
+export MAX_SEQ_LEN=128
+export MODEL_PATH=ernie-3.0-medium-zh
+
+cd classification
+mkdir ernie-3.0-medium-zh
+
+python -u ./run_clue_classifier_trainer.py \
+    --model_name_or_path ${MODEL_PATH} \
+    --dataset "clue ${TASK_NAME}" \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --per_device_train_batch_size ${BS}   \
+    --per_device_eval_batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps 100 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
+    --do_train \
+    --do_eval \
+    --metric_for_best_model "eval_accuracy" \
+    --load_best_model_at_end \
+    --save_total_limit 3 \
+```
+大部分参数含义如上文所述，这里简要介绍一些新参数:
+- `dataset`, 同上文`task_name`，此处为小写字母。表示 Fine-tuning 的分类任务，当前支持 afamc、tnews、iflytek、ocnli、cmnli、csl、cluewsc2020。
+- `per_device_train_batch_size` 同上文`batch_size`。训练时，每次迭代**每张卡**上的样本数目。
+- `per_device_eval_batch_size` 同上文`batch_size`。评估时，每次迭代**每张卡**上的样本数目。
+- `warmup_ratio` 同上文`warmup_proportion`，warmup步数占总步数的比例。
+- `metric_for_best_model` 评估时，最优评估指标。
+- `load_best_model_at_end` 训练结束时，时候加载评估结果最好的 ckpt。
+- `save_total_limit` 保存的ckpt数量的最大限制
+
+<a name="启动CLUE阅读理解任务"></a>
+
+### 启动 CLUE 阅读理解任务
+以 CLUE 的 C<sup>3</sup> 任务为例，多卡启动 CLUE 任务进行 Fine-tuning 的方式如下：
+
+```shell
+
+cd mrc
+
+MODEL_PATH=ernie-3.0-medium-zh
+BATCH_SIZE=6
+LR=2e-5
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" run_c3.py \
+    --model_name_or_path ${MODEL_PATH} \
+    --batch_size ${BATCH_SIZE} \
+    --learning_rate ${LR} \
+    --max_seq_length 512 \
+    --num_train_epochs 8 \
+    --do_train \
+    --warmup_proportion 0.1 \
+    --gradient_accumulation_steps 3 \
+
+```
+需要注意的是，如果显存无法容纳所传入的 `batch_size`，可以通过传入 `gradient_accumulation_steps` 参数来模拟该 `batch_size`。
+
+<a name="批量启动GridSearch"></a>
+
+### 批量启动 Grid Search
+
+<a name="环境依赖"></a>
+
+#### 环境依赖
+
+Grid Search 需要在 GPU 环境下进行，需要注意的是 C<sup>3</sup> 任务需要显存大于 16 GB，最好是在显存 32 GB的环境下启动。
+
+Grid Search 中的 GPU 调度需要依赖 pynvml 库，pynvml 库提供了 GPU 管理的 Python 接口。可启动以下命令进行安装 pynvml：
+
+```shell
+pip install pynvml
+```
+
+<a name="一键启动方法"></a>
+
+#### 一键启动方法
+
+运行下面一句命令即可启动 Grid Search 任务。前期需要注意数据集是否正常下载，否则训练任务不会正式启动。
+脚本默认不保存模型，如需保存每个超参数下最好的模型，需要修改 Python 脚本中的 `--save_best_models` 参数为 True。
+
+```shell
+cd grid_search_tools
+
+# 这里 ernie-3.0-base-zh 是模型名，也可以传用户自定义的模型目录
+# 自定义的模型目录需要有 model_config.json, model_state.pdparams, tokenizer_config.json 和 vocab.txt 四个文件
+python grid_seach.py ernie-3.0-base-zh
+
+```
+
+确认模型所有任务训练完成后，可以调用脚本 `extract_result.sh` 一键抽取 Grid Search 结果，打印出每个任务的最佳结果和对应的超参数，例如：
+
+```shell
+bash extract_result.sh ernie-3.0-base-zh
+```
+```text
+AFQMC	TNEWS	IFLYTEK	CMNLI	OCNLI	CLUEWSC2020	CSL	CMRC2018	CHID	C3
+75.93	58.26	61.56	83.02	80.10	86.18	82.63	70.71/90.41	84.26	77.88
+====================================================================
+Best hyper-parameters list:
+====================================================================
+TASK	result	(lr, batch_size, dropout_p)
+AFQMC	75.93	(3e-05,16,0.1)
+TNEWS	58.26	(3e-05,32,0.1)
+IFLYTEK	61.56	(5e-05,32,0.0)
+CMNLI	83.02	(3e-05,32,0.1)
+OCNLI	80.10	(2e-05,64,0.1)
+CLUEWSC2020	86.18	(2e-05,16,0.0)
+CSL	82.63	(2e-05,32,0.1)
+CMRC2018	70.71/90.41	(2e-05,24,0.1)
+CHID	84.26	(3e-05,24,0.1)
+C3	77.88	(3e-05,32,0.1)
+```
+
+另外，如遇意外情况（如机器重启）导致训练中断，可以直接再次启动 `grid_search.py` 脚本，之前已完成（输出完整日志）的任务则会直接跳过。
+
+<a name="GridSearch脚本说明"></a>
+
+#### Grid Search 脚本说明
+
+本节介绍 grid_search_tools 目录下各个脚本的功能：
+
+- `grid_search.py` Grid Search 任务入口脚本，该脚本负责调度 GPU 资源，可自动将 7 个分类任务、3 个阅读理解下所有超参数对应的任务完成，训练完成后会自动调用抽取结果的脚本 `extract_result.sh` 打印出所有任务的最佳结果和对应的超参。
+- `warmup_dataset_and_model.py` 首次运行时，该脚本完成模型下载（如需）、数据集下载，阅读理解任务数据预处理、预处理文件缓存等工作，再次运行则会检查这些文件是否存在，存在则跳过。该脚本由 `grid_search.py` 在 Grid Search 训练前自动调用，预处理 cache 文件生成后，后面所有训练任务即可加载缓存文件避免重复进行数据预处理。如果该阶段任务失败，大多需要检查网络，解决之后需重启 `grid_search.py`，直到训练正常开始。该脚本也可手动调用，需要 1 个参数，模型名称或目录。该脚本在使用 Intel(R) Xeon(R) Gold 6271C CPU 且 `--num_proc`默认为 4 的情况下需约 30 分钟左右完成，可以更改 `run_mrc.sh` 中的 `--num_proc` 参数以改变生成 cache 的进程数。需要注意的是，若改变 num_proc，之前的缓存则不能再使用，该脚本会重新处理数据并生成新的 cache，cache 相关内容可查看[datasets.Dataset.map文档](https://huggingface.co/docs/datasets/v2.0.0/package_reference/main_classes?highlight=map#datasets.Dataset.map)。
+- `extract_result.sh` 从日志抽取每个任务的最佳结果和对应的最佳超参并打印，`grid_search.py` 在完成训练任务后会自动调用，也可手动调用，需要 1 个参数：模型名称或目录。手动调用前需要确认训练均全部完成，并且保证该目录下有分类和阅读理解所有任务的日志。
+- `run_mrc.sh` 阅读理解任务的启动脚本。
+- `run_cls.sh` 分类任务的启动脚本。
+
+
+<a name="参加CLUE竞赛"></a>
+
+## 参加 CLUE 竞赛
+
+对各个任务运行预测脚本，汇总多个结果文件压缩之后，即可提交至 CLUE 官网进行评测。
+
+下面 2 小节会分别介绍分类、阅读理解任务产生预测结果的方法。
+
+<a name="分类任务"></a>
+
+### 分类任务
+
+以 TNEWS 为例，可以直接使用脚本 `classification/run_clue_classifier.py` 对单个任务进行预测，注意脚本启动时需要传入参数 `--do_predict`。假设 TNEWS 模型所在路径为 `${TNEWS_MODEL}`，运行如下脚本可得到模型在测试集上的预测结果，预测结果会写入地址 `${OUTPUT_DIR}/tnews_predict.json`。
+
+```
+cd classification
+OUTPUT_DIR=results
+mkdir ${OUTPUT_DIR}
+
+python run_clue_classifier.py \
+    --task_name TNEWS \
+    --model_name_or_path ${TNEWS_MODEL}  \
+    --output_dir ${OUTPUT_DIR} \
+    --do_predict \
+```
+<a name="阅读理解任务"></a>
+
+### 阅读理解任务
+
+以 C<sup>3</sup> 为例，直接使用 `mrc/run_c3.py`对该任务进行预测，注意脚本启动时需要传入参数 `--do_predict`。假设 C<sup>3</sup> 模型所在路径为 `${C3_MODEL}`，运行如下脚本可得到模型在测试集上的预测结果，预测结果会写入地址 `${OUTPUT_DIR}/c311_predict.json`。
+
+```shell
+cd mrc
+OUTPUT_DIR=results
+mkdir ${OUTPUT_DIR}
+
+python run_c3.py \
+    --model_name_or_path ${C3_MODEL} \
+    --output_dir ${OUTPUT_DIR} \
+    --do_predict \
+```
diff --git a/examples/benchmark/clue/classification/run_clue_classifier.py b/examples/benchmark/clue/classification/run_clue_classifier.py
new file mode 100644
index 0000000000000000000000000000000000000000..81e1ee690fd4281823e0698fe1a202770537c36f
--- /dev/null
+++ b/examples/benchmark/clue/classification/run_clue_classifier.py
@@ -0,0 +1,481 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "afqmc": Accuracy,
+    "tnews": Accuracy,
+    "iflytek": Accuracy,
+    "ocnli": Accuracy,
+    "cmnli": Accuracy,
+    "cluewsc2020": Accuracy,
+    "csl": Accuracy,
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="best_clue_model",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumualte before performing a backward/update pass.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether do train.")
+    parser.add_argument("--do_eval", action="store_true", help="Whether do train.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether do predict.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--save_best_model",
+        default=True,
+        type=strtobool,
+        help="Whether to save best model.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument("--dropout", default=0.1, type=float, help="dropout.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        labels = batch.pop("labels")
+        logits = model(**batch)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    logger.info("eval loss: %f, acc: %s, " % (loss.numpy(), res))
+    model.train()
+    return res
+
+
+def convert_example(example, tokenizer, label_list, is_test=False, max_seq_length=512):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        # Get the label
+        label = np.array(example["label"], dtype="int64")
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+    if not is_test:
+        example["labels"] = label
+    return example
+
+
+def do_eval(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length
+    )
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+    batchify_fn = DataCollatorWithPadding(tokenizer)
+
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if dev_ds.label_list is None else len(dev_ds.label_list)
+
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    metric = metric_class()
+    model.eval()
+    metric.reset()
+    for batch in dev_data_loader:
+        labels = batch.pop("labels")
+        logits = model(**batch)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    logger.info("acc: %s\n, " % (res))
+
+
+def do_train(args):
+    assert (
+        args.batch_size % args.gradient_accumulation_steps == 0
+    ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`."
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+
+    args.batch_size = int(args.batch_size / args.gradient_accumulation_steps)
+    train_ds, dev_ds = load_dataset("clue", args.task_name, splits=("train", "dev"))
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length
+    )
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+    batchify_fn = DataCollatorWithPadding(tokenizer)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+
+    if args.dropout != 0.1:
+        update_model_dropout(model, args.dropout)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps / args.gradient_accumulation_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+    best_acc = 0.0
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            labels = batch.pop("labels")
+            logits = model(**batch)
+            loss = loss_fct(logits, labels)
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+            loss.backward()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                global_step += 1
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                if global_step % args.logging_steps == 0:
+                    logger.info(
+                        "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                        % (
+                            global_step,
+                            num_training_steps,
+                            epoch,
+                            step,
+                            paddle.distributed.get_rank(),
+                            loss,
+                            optimizer.get_lr(),
+                            args.logging_steps / (time.time() - tic_train),
+                        )
+                    )
+                    tic_train = time.time()
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    tic_eval = time.time()
+                    acc = evaluate(model, loss_fct, metric, dev_data_loader)
+                    logger.info("eval done total : %s s" % (time.time() - tic_eval))
+                    if acc > best_acc:
+                        best_acc = acc
+                        if args.save_best_model:
+                            output_dir = args.output_dir
+                            if not os.path.exists(output_dir):
+                                os.makedirs(output_dir)
+                            # Need better way to get inner model of DataParallel
+                            model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                            model_to_save.save_pretrained(output_dir)
+                            tokenizer.save_pretrained(output_dir)
+                if global_step >= num_training_steps:
+                    logger.info("best_result: %.2f" % (best_acc * 100))
+                    return
+    logger.info("best_result: %.2f" % (best_acc * 100))
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+    args.task_name = args.task_name.lower()
+
+    train_ds, test_ds = load_dataset("clue", args.task_name, splits=("train", "test"))
+    if args.task_name == "cluewsc2020" or args.task_name == "tnews":
+        test_ds_10 = load_dataset("clue", args.task_name, splits="test1.0")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=train_ds.label_list,
+        max_seq_length=args.max_seq_length,
+        is_test=True,
+    )
+
+    batchify_fn = DataCollatorWithPadding(tokenizer)
+
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "cluewsc2020" or args.task_name == "tnews":
+        test_ds_10 = test_ds_10.map(trans_func, lazy=True)
+        test_batch_sampler_10 = paddle.io.BatchSampler(test_ds_10, batch_size=args.batch_size, shuffle=False)
+        test_data_loader_10 = DataLoader(
+            dataset=test_ds_10,
+            batch_sampler=test_batch_sampler_10,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+
+    prediction_filename = args.task_name
+
+    if args.task_name == "ocnli":
+        prediction_filename = "ocnli_50k"
+    elif args.task_name == "cluewsc2020":
+        prediction_filename = "cluewsc" + "11"
+    elif args.task_name == "tnews":
+        prediction_filename = args.task_name + "11"
+
+    # For version 1.1
+    f = open(os.path.join(args.output_dir, prediction_filename + "_predict.json"), "w")
+    preds = []
+    for step, batch in enumerate(test_data_loader):
+        with paddle.no_grad():
+            logits = model(**batch)
+        pred = paddle.argmax(logits, axis=1).numpy().tolist()
+        preds += pred
+    for idx, pred in enumerate(preds):
+        j = json.dumps({"id": idx, "label": train_ds.label_list[pred]})
+        f.write(j + "\n")
+
+    # For version 1.0
+    if args.task_name == "cluewsc2020" or args.task_name == "tnews":
+        prediction_filename = args.task_name + "10"
+        if args.task_name == "cluewsc2020":
+            prediction_filename = "cluewsc10"
+        f = open(os.path.join(args.output_dir, prediction_filename + "_predict.json"), "w")
+
+        preds = []
+        for step, batch in enumerate(test_data_loader_10):
+            with paddle.no_grad():
+                logits = model(**batch)
+            pred = paddle.argmax(logits, axis=1).numpy().tolist()
+            preds += pred
+        for idx, pred in enumerate(preds):
+            j = json.dumps({"id": idx, "label": train_ds.label_list[pred]})
+            f.write(j + "\n")
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def update_model_dropout(model, p=0.0):
+    model.base_model.embeddings.dropout.p = p
+    for i in range(len(model.base_model.encoder.layers)):
+        model.base_model.encoder.layers[i].dropout.p = p
+        model.base_model.encoder.layers[i].dropout1.p = p
+        model.base_model.encoder.layers[i].dropout2.p = p
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    if args.do_train:
+        do_train(args)
+    if args.do_eval:
+        do_eval(args)
+    if args.do_predict:
+        do_predict(args)
diff --git a/examples/benchmark/clue/classification/run_clue_classifier_trainer.py b/examples/benchmark/clue/classification/run_clue_classifier_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bfabedc44e9478968557a2fbe4b8678a443fcd6
--- /dev/null
+++ b/examples/benchmark/clue/classification/run_clue_classifier_trainer.py
@@ -0,0 +1,282 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import paddle
+import paddle.nn as nn
+from paddle.metric import Accuracy
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    export_model,
+)
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    do_lower_case: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models."
+        },
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        }
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the dataset cache."},
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+
+
+# Data pre-process function for clue benchmark datatset
+def convert_clue(example, label_list, tokenizer=None, max_seq_length=512, **kwargs):
+    """convert a glue example into necessary features"""
+    is_test = False
+    if "label" not in example.keys():
+        is_test = True
+
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # print("label_list", label_list)
+        # Get the label
+        # example['label'] = np.array(example["label"], dtype="int64")
+        example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"])
+        label = example["label"]
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+
+    if tokenizer is None:
+        return example
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label}
+    else:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]}
+
+
+def clue_trans_fn(example, tokenizer, args):
+    return convert_clue(example, tokenizer=tokenizer, label_list=args.label_list, max_seq_length=args.max_seq_length)
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    data_args.dataset = data_args.dataset.strip()
+
+    dataset_config = data_args.dataset.split(" ")
+    print(dataset_config)
+    raw_datasets = load_dataset(
+        dataset_config[0], name=None if len(dataset_config) <= 1 else dataset_config[1], splits=("train", "dev")
+    )
+
+    data_args.label_list = getattr(raw_datasets["train"], "label_list", None)
+    num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list)
+
+    # Define tokenizer, model, loss function.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+    criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss()
+
+    # Define dataset pre-process function
+    trans_fn = partial(clue_trans_fn, tokenizer=tokenizer, args=data_args)
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Dataset pre-process
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"].map(trans_fn)
+    if training_args.do_eval:
+        eval_dataset = raw_datasets["dev"].map(trans_fn)
+    if training_args.do_predict:
+        test_dataset = raw_datasets["test"].map(trans_fn)
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = Accuracy()
+        metric.reset()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        accu = metric.accumulate()
+        metric.reset()
+        return {"accuracy": accu}
+
+    trainer = Trainer(
+        model=model,
+        criterion=criterion,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+        if test_ret.label_ids is None:
+            paddle.save(
+                test_ret.predictions,
+                os.path.join(training_args.output_dir, "test_results.pdtensor"),
+            )
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/benchmark/clue/grid_search_tools/draw_pic.py b/examples/benchmark/clue/grid_search_tools/draw_pic.py
new file mode 100644
index 0000000000000000000000000000000000000000..31710f26992db0525fd37e6bcc2aceb4304c8d39
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/draw_pic.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import matplotlib.pyplot as plt
+
+mode = sys.argv[1]
+batch_size = sys.argv[2]
+
+ylabel_name = "CLUE Avg Score"
+title_name = "PaddleNLP Chinese Models"
+
+if mode == "gpu":
+    picture_name = "./gpu_bs" + batch_size + ".png"
+    xlabel_name = "Latency (ms) under FP16 on Tesla T4"
+elif mode == "cpu1":
+    picture_name = "./cpu_thread1_bs" + batch_size + ".png"
+    xlabel_name = "Latency (ms) under FP32 on Intel(R) Xeon(R) Gold 6271C, num_threads=1"
+elif mode == "cpu8":
+    picture_name = "./cpu_thread8_bs" + batch_size + ".png"
+    xlabel_name = "Latency (ms) under FP32 on Intel(R) Xeon(R) Gold 6271C, num_threads=8"
+else:
+    raise ValueError("Only supports gpu, cpu1, cpu8.")
+xlabel_name += ", batch_size=" + batch_size
+
+# Each element has model_name, model_param_num, latency(ms), clue avg score,
+# color, the size of circle.
+# Models of the same series are best represented by colors of the same color
+# system. https://zhuanlan.zhihu.com/p/65220518 is for reference.
+data = [
+    [
+        ["ERNIE 3.0-Base", "117.95M", 2.69, 226.43, 33.08, 3.43, 205.57, 34.10, 76.05, "#F08080", 11.8],  # #F08080
+        ["ERNIE 3.0-Medium", "75.43M", 1.42, 113.35, 17.32, 2.11, 104.06, 17.50, 72.49, "#A52A2A", 7.5],
+        ["ERNIE 3.0-Mini", "26.95M", 0.75, 38.24, 5.54, 1.59, 30.28, 8.18, 66.90, "#CD5C5C", 2.7],
+        ["ERNIE 3.0-Micro", "23.40M", 0.62, 26.44, 3.76, 1.33, 20.06, 5.46, 64.21, "#FF6347", 2.3],
+        ["ERNIE 3.0-Nano", "17.91M", 0.57, 20.93, 3.22, 1.25, 15.24, 4.89, 62.97, "#FF0000", 1.8],
+    ],
+    [
+        ["RoBERTa-Base", "102.27M", 2.69, 226.16, 32.18, 3.44, 204.27, 34.10, 71.78, "royalblue", 10.2],  # #4169E1
+        ["RoBERTa-6L768H", "59.74M", 1.43, 112.55, 16.21, 2.14, 102.95, 18.55, 67.09, "#6495ED", 6.0],
+        ["RoBERTa-Medium", "36.56M", 1.02, 71.23, 10.84, 1.91, 65.74, 13.26, 67.06, "#87CEFA", 3.7],
+        ["RoBERTa-Small", "23.95M", 0.63, 36.33, 5.61, 1.41, 33.26, 7.01, 63.25, "#B0E0E6", 2.4],
+        # ['RoBERTa-Mini','8.77M', 0.59, 10.61, 2.02, 1.41, 10.03, 3.60, 53.40, '#40E0D0', 0.9],
+        # ['RoBERTa-Tiny','3.18M', 0.37, 2.08, 0.72, 1.03, 2.25, 1.30, 44.45, '#4682B4', 0.3],
+    ],
+    [
+        ["TinyBERT6", "59.74M", 1.44, 113.90, 16.37, 2.14, 104.06, 17.44, 69.62, "gold", 6.5],  # '#008000'
+        ["TinyBERT4", "11.46M", 0.54, 16.53, 2.93, 1.22, 14.02, 4.64, 60.82, "#8FBC8F", 1.3],
+    ],
+    [
+        # [
+        #     'RBTL3', '61.00M', 1.34, 113.27, 16.02, 1.69, 101.59,
+        #     15.47, 66.79, '#FFA500', 6.1
+        # ],
+        ["RBT6", "59.74M", 1.43, 114.24, 16.35, 2.14, 103.53, 17.27, 70.06, "mediumseagreen", 6.0],  # #FFDEAD
+        ["RBT4", "46.56M", 1.03, 76.19, 11.08, 1.60, 69.90, 12.60, 67.42, "#FFD700", 4.7],
+        ["RBT3", "38.43M", 0.86, 58.65, 8.28, 1.40, 52.12, 10.63, 65.72, "#FFE4B5", 3.8],
+    ],
+]
+
+fig, ax = plt.subplots()
+ax.spines["right"].set_visible(False)
+ax.spines["top"].set_visible(False)
+
+ln_list = []
+size = 7
+for i in range(len(data)):
+    model_name_list = [model_info[0] for model_info in data[i]]
+    clue_res_list = [model_info[-3] for model_info in data[i]]
+    color = data[i][0][-2]
+    num_param_list = [model_info[1] for model_info in data[i]]
+
+    if mode == "gpu":
+        if batch_size == "32":
+            latency_list = [model_info[2] for model_info in data[i]]
+        else:
+            latency_list = [model_info[5] for model_info in data[i]]
+
+        (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5)
+        ln_list.append(ln)
+        for j, model in enumerate(data[i]):
+            xytext = (latency_list[j] + 0.05, clue_res_list[j] - 0.1)
+            model_name = model_name_list[j]
+            clue_res = clue_res_list[j]
+            num_param = num_param_list[j]
+            latency = latency_list[j]
+            if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"):
+                xytext = (latency + 0.05, clue_res - 0.6)
+            if model_name in ("RBT4"):
+                xytext = (latency + 0.05, clue_res + 0.1)
+            plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0)
+            plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0)
+
+    elif mode == "cpu1":
+        if batch_size == "32":
+            latency_list = [model_info[3] for model_info in data[i]]
+        else:
+            latency_list = [model_info[6] for model_info in data[i]]
+        (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5)
+        ln_list.append(ln)
+        for j, model in enumerate(data[i]):
+            xytext = (latency_list[j] + 5.0, clue_res_list[j] - 0.1)
+            model_name = model_name_list[j]
+            clue_res = clue_res_list[j]
+            num_param = num_param_list[j]
+            latency = latency_list[j]
+            if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"):
+                xytext = (latency + 5.0, clue_res - 0.6)
+            plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0)
+            plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0)
+    else:
+        if batch_size == "32":
+            latency_list = [model_info[4] for model_info in data[i]]
+        else:
+            latency_list = [model_info[7] for model_info in data[i]]
+        (ln,) = plt.plot(latency_list, clue_res_list, color=color, linewidth=2.0, linestyle="-", marker="o", ms=5)
+        ln_list.append(ln)
+        for j, model in enumerate(data[i]):
+            xytext = (latency_list[j] + 0.8, clue_res_list[j] - 0.1)
+            model_name = model_name_list[j]
+            clue_res = clue_res_list[j]
+            num_param = num_param_list[j]
+            latency = latency_list[j]
+            if model_name in ("RoBERTa-Medium", "TinyBERT6", "ERNIE 3.0-Nano"):
+                xytext = (latency + 0.8, clue_res - 0.6)
+            plt.annotate(model_name, xy=(latency, clue_res), xytext=xytext, size=size, alpha=1.0)
+            plt.annotate(num_param, xy=(latency, clue_res), xytext=(xytext[0], xytext[1] - 0.3), size=5, alpha=1.0)
+    plt.legend(handles=ln_list, labels=["Baidu/ERNIE 3.0", "UER/RoBERTa", "Huawei/TinyBERT", "HFL/RBT"], loc="best")
+
+plt.title(title_name)
+plt.xlabel(xlabel_name)
+plt.ylabel(ylabel_name)
+
+plt.savefig(picture_name, dpi=500)
diff --git a/examples/benchmark/clue/grid_search_tools/extract_result.sh b/examples/benchmark/clue/grid_search_tools/extract_result.sh
new file mode 100644
index 0000000000000000000000000000000000000000..138c704a06e69e3a28d7a764700a3617b260ebfc
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/extract_result.sh
@@ -0,0 +1,55 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset GREP_OPTIONS
+
+MODEL_PATH=$1
+declare -A dict
+
+for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3
+do
+    # `awk '{print substr($0,1,11)}'` is to prevent '[' brought by logger.
+    if [ $task == 'cmrc2018' ]; then
+        dict[${task}]=`cat ${MODEL_PATH}/${task}/*|grep best_result|awk '{print $7}' |awk '{print substr($0,1,11)}'|awk 'BEGIN {max = 0} {if ($1+0 > max+0) max=$1} END {print  max}'`
+    else
+    dict[${task}]=`tail -n 1  ${MODEL_PATH}/${task}/*|grep best_result|awk '{print $7}'|awk '{print substr($0,1,5)}'|awk 'BEGIN {max = 0} {if ($1+0 > max+0) max=$1} END {print  max}'`
+    fi
+done
+
+echo -e AFQMC"\t"TNEWS"\t"IFLYTEK"\t"CMNLI"\t"OCNLI"\t"CLUEWSC2020"\t"CSL"\t"CMRC2018"\t"CHID"\t"C3
+
+for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3
+do
+    echo -e -n "${dict[$task]}\t"
+done
+
+echo -e "\n====================================================================\nBest hyper-parameters list: \n===================================================================="
+echo -e TASK"\t"result"\t(lr, batch_size, dropout_p)"
+
+for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl cmrc2018 chid c3
+do
+    if [ -z ${dict[$task]} ]
+    then
+    continue
+    fi
+    s=`find  ${MODEL_PATH}/${task}/* | xargs grep -rin "best_result: ${dict[$task]}"`
+    if [ $task == 'cmrc2018' ]; then
+    s=${s%/*}
+    fi
+    s=${s##*/}
+    s=`echo $s|awk '{split($1, hy, "."); print hy[1]"."hy[2]}'`
+    s=`echo $s|awk '{split($1, hy, "_"); print hy[1]"," hy[2]"," hy[3]}'`
+    echo -n ${task}| tr 'a-z' 'A-Z'
+    echo -e "\t"${dict[$task]}"\t("$s")"
+done
diff --git a/examples/benchmark/clue/grid_search_tools/grid_search.py b/examples/benchmark/clue/grid_search_tools/grid_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..6887f09bc10bd095a64f7db4e04cd69d911ea0c7
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/grid_search.py
@@ -0,0 +1,187 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import subprocess
+import sys
+import time
+from collections import defaultdict
+
+from pynvml import (
+    nvmlDeviceGetCount,
+    nvmlDeviceGetHandleByIndex,
+    nvmlDeviceGetMemoryInfo,
+    nvmlInit,
+)
+
+nvmlInit()
+
+world_size = nvmlDeviceGetCount()
+handles = []
+mrc_device = {}
+handle_mapping = {}
+for i in range(world_size):
+    h = nvmlDeviceGetHandleByIndex(i)
+    handles.append(h)
+    handle_mapping[str(h)] = i
+
+
+def get_availble(est=15, is_mrc=False):
+    # Sort handles according to info.free
+    handles.sort(key=lambda x: nvmlDeviceGetMemoryInfo(x).free, reverse=True)
+    for i, h in enumerate(handles):
+        device_id = handle_mapping[str(h)]
+        if device_id in mrc_device.values():
+            continue
+        info = nvmlDeviceGetMemoryInfo(h)
+        gb = 1024 * 1024 * 1024
+        print(f"- device_id: {device_id}")
+        print(f"- free     : {info.free/gb}")
+        if info.free / gb >= est:
+            return device_id
+    return None
+
+
+# TODO Support multi-machine
+
+
+def get_mrc_tasks(model_name_or_path):
+    learning_rate_list = [1e-5, 2e-5, 3e-5]
+    batch_size_list = [32, 24]
+    cls_base_grd_acc = 4
+    tasks = []
+    for lr in learning_rate_list:
+        for bs in batch_size_list:
+            tasks.append(f"bash run_mrc.sh {model_name_or_path} chid {bs} {lr} {cls_base_grd_acc*2}")
+            tasks.append(f"bash run_mrc.sh {model_name_or_path} cmrc2018 {bs} {lr} {cls_base_grd_acc}")
+            tasks.append(f"bash run_mrc.sh {model_name_or_path} c3 {bs} {lr} {bs//2}")
+    return tasks
+
+
+def get_cls_tasks(model_name_or_path):
+    learning_rate_list = [1e-5, 2e-5, 3e-5, 5e-5]
+    batch_size_list = [16, 32, 64]
+    datasets = ["afqmc", "tnews", "iflytek", "ocnli", "cmnli", "cluewsc2020", "csl"]
+    cls_base_grd_acc = 1
+    hyper_params = {
+        "afqmc": [[3, 128, cls_base_grd_acc, 0.1]],
+        "tnews": [[3, 128, cls_base_grd_acc, 0.1]],
+        "iflytek": [[3, 128, cls_base_grd_acc, 0.1], [3, 128, cls_base_grd_acc, 0.0]],
+        "ocnli": [[5, 128, cls_base_grd_acc, 0.1]],
+        "cluewsc2020": [[50, 128, cls_base_grd_acc, 0.1], [50, 128, cls_base_grd_acc, 0.0]],
+        "csl": [[5, 256, cls_base_grd_acc * 2, 0.1]],
+        "cmnli": [[2, 128, cls_base_grd_acc, 0.1]],
+    }
+    tasks = []
+    for dataset in datasets:
+        for lr in learning_rate_list:
+            for bs in batch_size_list:
+                for hyper_param in hyper_params[dataset]:
+                    epoch, max_seq_len, grd_acc, dropout = hyper_param
+                    tasks.append(
+                        f"bash run_cls.sh {dataset} {lr} {bs} {epoch} {max_seq_len} {model_name_or_path} {grd_acc} {dropout}"
+                    )
+    for lr in learning_rate_list:
+        for hyper_param in hyper_params["cluewsc2020"]:
+            bs = 8
+            epoch, max_seq_len, grd_acc, dropout = hyper_param
+            tasks.append(
+                f"bash run_cls.sh cluewsc2020 {lr} {bs} {epoch} {max_seq_len} {model_name_or_path} {grd_acc} {dropout}"
+            )
+    return tasks
+
+
+def do_task(task):
+    # tmp = task.split(" ")
+    est = 15
+    # if int(tmp[4]) * int(tmp[6]) > 32 * 128:
+    #     est = 30
+    print(est)
+    is_mrc = False
+    if "cmrc" in task or "chid" in task or "c3" in task:
+        is_mrc = True
+    device_id = get_availble(est, is_mrc)
+    retry = 5
+    while device_id is None and retry > 0:
+        print("> No device avaliable, wait 120 seconds.")
+        time.sleep(120)
+        device_id = get_availble(est, is_mrc)
+        retry -= 1
+    if retry == 0:
+        return None
+    task_ps = f"set -x \nexport CUDA_VISIBLE_DEVICES={device_id}\n" + task
+    print(f"> Send task \n{task_ps}\n")
+    ps = subprocess.Popen(task_ps, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    if is_mrc and device_id is not None:
+        mrc_device[task] = device_id
+        print("mrc_device", mrc_device)
+    return ps
+
+
+def main():
+    model_name_or_path = sys.argv[1]
+    # Make sure that dataset has been downloaded first
+    status = os.system(f"python warmup_dataset_and_model.py {model_name_or_path}")
+    assert status == 0, "Please make sure clue dataset has been downloaded successfully."
+    tasks = []
+    tasks = get_cls_tasks(model_name_or_path)
+    tasks += get_mrc_tasks(model_name_or_path)
+
+    for x in tasks:
+        print(x)
+
+    runs = []
+    retry = defaultdict(int)
+    while len(tasks) > 0 or len(runs) > 0:
+        i = 0
+        print("\n\n\n>> Round start")
+        while i < len(runs):
+            returncode = runs[i]["ps"].poll()
+            if returncode is not None:
+                if returncode != 0:
+                    retry[runs[i]["ts"]] += 1
+                    print(f"> {runs[i]['ts']} task failed, will retried, tryed {retry[runs[i]['ts']]} times.")
+                    output = runs[i]["ps"].communicate()[0]
+                    for line in output.decode("utf-8").split("\n"):
+                        print(line)
+                    if retry[runs[i]["ts"]] <= 5:
+                        tasks.append(runs[i]["ts"])
+                else:
+                    if "cmrc" in runs[i]["ts"] or "chid" in runs[i]["ts"] or "c3" in runs[i]["ts"]:
+                        mrc_device.pop(runs[i]["ts"])
+                        print("mrc_device", mrc_device)
+                    print(f"> Done! {runs[i]['ts']}")
+                runs.pop(i)
+                i = i - 1
+            else:
+                print(">> DOING", runs[i]["ts"])
+            i += 1
+
+        if len(tasks) > 0:
+            task = tasks.pop(0)
+            print(f"> Try to append {task}")
+            ps = do_task(task)
+            if ps is None:
+                tasks.append(task)
+            else:
+                runs.append({"ps": ps, "ts": task})
+
+        print("> Wait for 15 seconds to start!")
+        time.sleep(15)
+    print("All done!")
+    status = os.system(f"bash extract_result.sh {model_name_or_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/benchmark/clue/grid_search_tools/run_cls.sh b/examples/benchmark/clue/grid_search_tools/run_cls.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dd0c474c002566f1ce766d5e4e4a0e38ed821618
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/run_cls.sh
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export TASK_NAME=$1
+export LR=$2
+export BS=$3
+export EPOCH=$4
+export MAX_SEQ_LEN=$5
+export MODEL_PATH=$6
+export grad_acc=$7
+export dropout_p=$8
+
+export FLAGS_cudnn_deterministic=True
+
+if [ -f "${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log" ]
+then
+    # Exits if log exits and best_result is computed.
+    best_acc=`cat ${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log |grep "best_result"`
+    if [ "${best_acc}" != "" ]
+    then
+        exit 0
+    fi
+fi
+
+mkdir -p ${MODEL_PATH}/${TASK_NAME}
+
+python -u ../classification/run_clue_classifier.py \
+    --model_name_or_path ${MODEL_PATH} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
+    --gradient_accumulation_steps ${grad_acc} \
+    --do_train \
+    --dropout ${dropout_p} \
+    --save_best_model False > ${MODEL_PATH}/${TASK_NAME}/${LR}_${BS}_${dropout_p}.log 2>&1
+
diff --git a/examples/benchmark/clue/grid_search_tools/run_mrc.sh b/examples/benchmark/clue/grid_search_tools/run_mrc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1ddb5cdfc507fe3c8f613ccab576c346d9805cd0
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/run_mrc.sh
@@ -0,0 +1,69 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+MODEL_PATH=$1
+TASK_NAME=$2
+BATCH_SIZE=$3
+LR=$4
+GRAD_ACCU_STEPS=$5
+
+export FLAGS_cudnn_deterministic=True
+
+if [ -f "${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log" ]
+then
+    # Exits if log exits and best_result is computed.
+    best_res=`cat ${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log |grep "best_result"`
+    if [ "${best_res}" != "" ]
+    then
+        exit 0
+    fi
+fi
+ 
+if [ $TASK_NAME == 'cmrc2018' ]; then
+MAX_SEQ_LEN=512
+EPOCHS=2
+WARMUP_PROP=0.1
+fi
+
+if [ $TASK_NAME == 'c3' ]; then
+MAX_SEQ_LEN=512
+EPOCHS=8
+WARMUP_PROP=0.1
+fi
+
+if [ $TASK_NAME == 'chid' ]; then
+MAX_SEQ_LEN=64
+EPOCHS=3
+WARMUP_PROP=0.06
+fi
+
+mkdir -p ${MODEL_PATH}/${TASK_NAME}
+
+python ../mrc/run_${TASK_NAME}.py \
+    --model_name_or_path ${MODEL_PATH} \
+    --batch_size ${BATCH_SIZE} \
+    --learning_rate ${LR} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --num_train_epochs ${EPOCHS} \
+    --output_dir ${MODEL_PATH}/${TASK_NAME}_model/${LR}_${BATCH_SIZE}/ \
+    --do_train \
+    --seed 42 \
+    --weight_decay 0.01 \
+    --device gpu \
+    --num_proc 4 \
+    --logging_steps 100 \
+    --warmup_proportion ${WARMUP_PROP} \
+    --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
+    --save_best_model False  > ${MODEL_PATH}/${TASK_NAME}/${LR}_${BATCH_SIZE}_0.1.log 2>&1
diff --git a/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py b/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5775a7b2f4c1c4f0d7c11a2cf09194af20c1ac5
--- /dev/null
+++ b/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.utils.log import logger
+
+model_name_or_path = sys.argv[1]
+
+# CLUE classification dataset warmup
+logger.info("Download model and data for CLUE classification tasks.")
+for task in ["afqmc", "tnews", "iflytek", "ocnli", "cmnli", "cluewsc2020", "csl"]:
+    load_dataset("clue", task, splits=("train", "dev", "test"))
+
+# Downloads HF dataset
+from datasets import load_dataset  # noqa: E402
+
+load_dataset("clue", "chid")
+load_dataset("clue", "cmrc2018")
+load_dataset("clue", "c3")
+
+# HF dataset process and cache
+logger.info("Data process for CHID tasks, and this will take some time. If cache exists, this will skip.")
+status = os.system(
+    f"python ../mrc/run_chid.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1"
+)
+assert status == 0, "Please make sure clue dataset CHID has been preprocessed successfully."
+logger.info("Data process for CMRC2018 tasks. If cache exists, this will skip.")
+status = os.system(
+    f"python ../mrc/run_cmrc2018.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1"
+)
+assert status == 0, "Please make sure clue dataset CMRC2018 has been preprocessed successfully."
+logger.info("Data process for C3 tasks. If cache exists, this will skip.")
+status = os.system(
+    f"python ../mrc/run_c3.py --do_train --max_steps 0 --model_name_or_path {model_name_or_path} --batch_size 1 --gradient_accumulation_steps 1"
+)
+assert status == 0, "Please make sure clue dataset C3 has been preprocessed successfully."
diff --git a/examples/benchmark/clue/mrc/run_c3.py b/examples/benchmark/clue/mrc/run_c3.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7bc139b4ca7571da235ae0420038696f6d24620
--- /dev/null
+++ b/examples/benchmark/clue/mrc/run_c3.py
@@ -0,0 +1,413 @@
+# coding: utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import contextlib
+import json
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    AutoModelForMultipleChoice,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name.",
+    )
+    parser.add_argument(
+        "--num_proc",
+        default=None,
+        type=int,
+        help="Max number of processes when generating cache. Already cached shards are loaded sequentially.",
+    )
+    parser.add_argument("--output_dir", default="best_c3_model", type=str, help="The path of the checkpoints .")
+    parser.add_argument("--save_best_model", default=True, type=strtobool, help="Whether to save best model.")
+    parser.add_argument(
+        "--overwrite_cache",
+        default=False,
+        type=strtobool,
+        help="Whether to overwrite cache for dataset.",
+    )
+    parser.add_argument("--num_train_epochs", default=8, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    parser.add_argument("--batch_size", default=24, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--eval_batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=4,
+        help="Number of updates steps to accumualte before performing a backward/update pass.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=512,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, dev_data_loader, metric):
+    metric.reset()
+    model.eval()
+    for _, batch in enumerate(dev_data_loader):
+        input_ids, segment_ids, label_id = batch
+        logits = model(input_ids=input_ids, token_type_ids=segment_ids)
+        correct = metric.compute(logits, label_id)
+        metric.update(correct)
+    acc = metric.accumulate()
+    model.train()
+    return acc
+
+
+@contextlib.contextmanager
+def main_process_first(desc="work"):
+    if paddle.distributed.get_world_size() > 1:
+        rank = paddle.distributed.get_rank()
+        is_main_process = rank == 0
+        main_process_desc = "main local process"
+
+        try:
+            if not is_main_process:
+                # tell all replicas to wait
+                logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}")
+                paddle.distributed.barrier()
+            yield
+        finally:
+            if is_main_process:
+                # the wait is over
+                logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas")
+                paddle.distributed.barrier()
+    else:
+        yield
+
+
+def run(args):
+    if args.do_train:
+        assert (
+            args.batch_size % args.gradient_accumulation_steps == 0
+        ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`."
+    max_seq_length = args.max_seq_length
+    max_num_choices = 4
+
+    def preprocess_function(examples, do_predict=False):
+        def _truncate_seq_tuple(tokens_a, tokens_b, tokens_c, max_length):
+            """Truncates a sequence tuple in place to the maximum length."""
+            # This is a simple heuristic which will always truncate the longer
+            # sequence one token at a time. This makes more sense than
+            # truncating an equal percent of tokens from each, since if one
+            # sequence is very short then each token that's truncated likely
+            # contains more information than a longer sequence.
+            while True:
+                total_length = len(tokens_a) + len(tokens_b) + len(tokens_c)
+                if total_length <= max_length:
+                    break
+                if len(tokens_a) >= len(tokens_b) and len(tokens_a) >= len(tokens_c):
+                    tokens_a.pop()
+                elif len(tokens_b) >= len(tokens_a) and len(tokens_b) >= len(tokens_c):
+                    tokens_b.pop()
+                else:
+                    tokens_c.pop()
+
+        num_examples = len(examples.data["question"])
+        if do_predict:
+            result = {"input_ids": [], "token_type_ids": []}
+        else:
+            result = {"input_ids": [], "token_type_ids": [], "labels": []}
+        for idx in range(num_examples):
+            text = "\n".join(examples.data["context"][idx]).lower()
+            question = examples.data["question"][idx].lower()
+            choice_list = examples.data["choice"][idx]
+            choice_list = [choice.lower() for choice in choice_list][:max_num_choices]
+            if not do_predict:
+                answer = examples.data["answer"][idx].lower()
+                label = choice_list.index(answer)
+
+            tokens_t = tokenizer.tokenize(text)
+            tokens_q = tokenizer.tokenize(question)
+
+            tokens_t_list = []
+            tokens_c_list = []
+
+            # Pad each new example for axis=1, [batch_size, num_choices, seq_len]
+            while len(choice_list) < max_num_choices:
+                choice_list.append("无效答案")
+
+            for choice in choice_list:
+                tokens_c = tokenizer.tokenize(choice.lower())
+                _truncate_seq_tuple(tokens_t, tokens_q, tokens_c, max_seq_length - 4)
+
+                tokens_c = tokens_q + ["[SEP]"] + tokens_c
+                tokens_t_list.append(tokens_t)
+                tokens_c_list.append(tokens_c)
+
+            new_data = tokenizer(tokens_t_list, text_pair=tokens_c_list, is_split_into_words="token")
+
+            # Pad each new example for axis=2 of [batch_size, num_choices, seq_len],
+            # because length of each choice could be different.
+            input_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(new_data["input_ids"])
+            token_type_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(new_data["token_type_ids"])
+
+            # Final shape of input_ids: [batch_size, num_choices, seq_len]
+            result["input_ids"].append(input_ids)
+            result["token_type_ids"].append(token_type_ids)
+            if not do_predict:
+                result["labels"].append([label])
+            if (idx + 1) % 1000 == 0:
+                logger.info("%d samples have been processed." % (idx + 1))
+        return result
+
+    paddle.set_device(args.device)
+    set_seed(args)
+
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = AutoModelForMultipleChoice.from_pretrained(args.model_name_or_path, num_choices=max_num_choices)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds, dev_ds, test_ds = load_dataset("clue", "c3", split=["train", "validation", "test"])
+
+    if args.do_train:
+        args.batch_size = int(args.batch_size / args.gradient_accumulation_steps)
+        column_names = train_ds.column_names
+        with main_process_first(desc="train dataset map pre-processing"):
+            train_ds = train_ds.map(
+                preprocess_function,
+                batched=True,
+                batch_size=len(train_ds),
+                num_proc=args.num_proc,
+                remove_columns=column_names,
+                load_from_cache_file=not args.overwrite_cache,
+                desc="Running tokenizer on train dataset",
+            )
+
+        batchify_fn = lambda samples, fn=Dict(  # noqa: E731
+            {
+                "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id),  # input
+                "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id),  # segment
+                "labels": Stack(dtype="int64"),  # label
+            }
+        ): fn(samples)
+
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_data_loader = paddle.io.DataLoader(
+            dataset=train_ds,
+            batch_sampler=train_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        with main_process_first(desc="evaluate dataset map pre-processing"):
+            dev_ds = dev_ds.map(
+                preprocess_function,
+                batched=True,
+                batch_size=len(dev_ds),
+                remove_columns=column_names,
+                num_proc=args.num_proc,
+                load_from_cache_file=args.overwrite_cache,
+                desc="Running tokenizer on validation dataset",
+            )
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False)
+        dev_data_loader = paddle.io.DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        num_training_steps = (
+            int(args.max_steps / args.gradient_accumulation_steps)
+            if args.max_steps >= 0
+            else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps)
+        )
+
+        warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=grad_clip,
+        )
+        loss_fct = paddle.nn.loss.CrossEntropyLoss()
+        metric = paddle.metric.Accuracy()
+        model.train()
+        global_step = 0
+        best_acc = 0.0
+        tic_train = time.time()
+        for epoch in range(args.num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                input_ids, segment_ids, label = batch
+                logits = model(input_ids=input_ids, token_type_ids=segment_ids)
+                loss = loss_fct(logits, label)
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+                loss.backward()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    global_step += 1
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.clear_grad()
+                    if global_step % args.logging_steps == 0:
+                        logger.info(
+                            "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                            % (
+                                global_step,
+                                num_training_steps,
+                                epoch,
+                                step + 1,
+                                paddle.distributed.get_rank(),
+                                loss,
+                                optimizer.get_lr(),
+                                args.logging_steps / (time.time() - tic_train),
+                            )
+                        )
+                        tic_train = time.time()
+                if global_step >= num_training_steps:
+                    break
+            if global_step > num_training_steps:
+                break
+            tic_eval = time.time()
+            acc = evaluate(model, loss_fct, dev_data_loader, metric)
+            logger.info("eval acc: %.5f, eval done total : %s s" % (acc, time.time() - tic_eval))
+            if paddle.distributed.get_rank() == 0 and acc > best_acc:
+                best_acc = acc
+                if args.save_best_model:
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    if not os.path.exists(args.output_dir):
+                        os.makedirs(args.output_dir)
+                    model_to_save.save_pretrained(args.output_dir)
+                    tokenizer.save_pretrained(args.output_dir)
+            if global_step >= num_training_steps:
+                break
+        logger.info("best_result: %.2f" % (best_acc * 100))
+
+    if args.do_predict:
+        column_names = test_ds.column_names
+        test_ds = test_ds.map(
+            partial(preprocess_function, do_predict=True),
+            batched=True,
+            batch_size=len(test_ds),
+            remove_columns=column_names,
+            num_proc=args.num_proc,
+        )
+        # Serveral samples have more than four choices.
+        test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=1, shuffle=False)
+
+        batchify_fn = lambda samples, fn=Dict(  # noqa: E731
+            {
+                "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id),  # input
+                "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id),  # segment
+            }
+        ): fn(samples)
+
+        test_data_loader = paddle.io.DataLoader(
+            dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        if not os.path.exists(args.output_dir):
+            os.makedirs(args.output_dir)
+
+        f = open(os.path.join(args.output_dir, "c311_predict.json"), "w")
+        result = {}
+        idx = 0
+        for step, batch in enumerate(test_data_loader):
+            input_ids, segment_ids = batch
+            with paddle.no_grad():
+                logits = model(input_ids, segment_ids)
+            preds = paddle.argmax(logits, axis=1).numpy().tolist()
+            for pred in preds:
+                result[str(idx)] = pred
+                j = json.dumps({"id": idx, "label": pred})
+                f.write(j + "\n")
+                idx += 1
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    run(args)
diff --git a/examples/benchmark/clue/mrc/run_chid.py b/examples/benchmark/clue/mrc/run_chid.py
new file mode 100644
index 0000000000000000000000000000000000000000..952e78b3b1bb2068adb64189f6e167702b18966f
--- /dev/null
+++ b/examples/benchmark/clue/mrc/run_chid.py
@@ -0,0 +1,583 @@
+# coding: utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import contextlib
+import json
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from datasets import load_dataset
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    AutoModelForMultipleChoice,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name.",
+    )
+    parser.add_argument("--output_dir", default="best_chid_model", type=str, help="The path of the checkpoints .")
+    parser.add_argument("--save_best_model", default=True, type=strtobool, help="Whether to save best model.")
+    parser.add_argument(
+        "--overwrite_cache",
+        default=False,
+        type=strtobool,
+        help="Whether to overwrite cache for dataset.",
+    )
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--num_proc",
+        default=None,
+        type=int,
+        help="Max number of processes when generating cache. Already cached shards are loaded sequentially.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--eval_batch_size", default=24, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumualte before performing a backward/update pass.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=64,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def calc_global_pred_results(logits):
+    logits = np.array(logits)
+    # [num_choices, tag_size]
+    logits = np.transpose(logits)
+    tmp = []
+    for i, row in enumerate(logits):
+        for j, col in enumerate(row):
+            tmp.append((i, j, col))
+    else:
+        choice = set(range(i + 1))
+        blanks = set(range(j + 1))
+    tmp = sorted(tmp, key=lambda x: x[2], reverse=True)
+    results = []
+    for i, j, v in tmp:
+        if (j in blanks) and (i in choice):
+            results.append((i, j))
+            blanks.remove(j)
+            choice.remove(i)
+    results = sorted(results, key=lambda x: x[1], reverse=False)
+    results = [i for i, j in results]
+    return results
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, do_predict=False):
+    model.eval()
+    right_num, total_num = 0, 0
+    all_results = []
+    for step, batch in enumerate(data_loader):
+        if do_predict:
+            input_ids, segment_ids, example_ids = batch
+        else:
+            input_ids, segment_ids, labels, example_ids = batch
+        logits = model(input_ids=input_ids, token_type_ids=segment_ids)
+        batch_num = example_ids.shape[0]
+        l = 0
+        r = batch_num - 1
+        batch_results = []
+        for i in range(batch_num - 1):
+            if example_ids[i] != example_ids[i + 1]:
+                r = i
+                batch_results.extend(calc_global_pred_results(logits[l : r + 1, :]))
+                l = i + 1
+        if l <= batch_num - 1:
+            batch_results.extend(calc_global_pred_results(logits[l:batch_num, :]))
+        if do_predict:
+            all_results.extend(batch_results)
+        else:
+            right_num += np.sum(np.array(batch_results) == labels.numpy())
+            total_num += labels.shape[0]
+    model.train()
+    if not do_predict:
+        acc = right_num / total_num
+        return acc
+    return all_results
+
+
+@contextlib.contextmanager
+def main_process_first(desc="work"):
+    if paddle.distributed.get_world_size() > 1:
+        rank = paddle.distributed.get_rank()
+        is_main_process = rank == 0
+        main_process_desc = "main local process"
+        try:
+            if not is_main_process:
+                # tell all replicas to wait
+                logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}")
+                paddle.distributed.barrier()
+            yield
+        finally:
+            if is_main_process:
+                # the wait is over
+                logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas")
+                paddle.distributed.barrier()
+    else:
+        yield
+
+
+def run(args):
+    if args.do_train:
+        assert (
+            args.batch_size % args.gradient_accumulation_steps == 0
+        ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`."
+    paddle.set_device(args.device)
+    set_seed(args)
+
+    max_seq_length = args.max_seq_length
+    max_num_choices = 10
+
+    def preprocess_function(examples, do_predict=False):
+        SPIECE_UNDERLINE = "▁"
+
+        def _is_chinese_char(cp):
+            if (
+                (cp >= 0x4E00 and cp <= 0x9FFF)
+                or (cp >= 0x3400 and cp <= 0x4DBF)  #
+                or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+                or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+                or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+                or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+                or (cp >= 0xF900 and cp <= 0xFAFF)
+                or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+            ):  #
+                return True
+
+            return False
+
+        def is_fuhao(c):
+            if (
+                c == "。"
+                or c == "，"
+                or c == "！"
+                or c == "？"
+                or c == "；"
+                or c == "、"
+                or c == "："
+                or c == "（"
+                or c == "）"
+                or c == "－"
+                or c == "~"
+                or c == "「"
+                or c == "《"
+                or c == "》"
+                or c == ","
+                or c == "」"
+                or c == '"'
+                or c == "“"
+                or c == "”"
+                or c == "$"
+                or c == "『"
+                or c == "』"
+                or c == "—"
+                or c == ";"
+                or c == "。"
+                or c == "("
+                or c == ")"
+                or c == "-"
+                or c == "～"
+                or c == "。"
+                or c == "‘"
+                or c == "’"
+            ):
+                return True
+            return False
+
+        def _tokenize_chinese_chars(text):
+            """Adds whitespace around any CJK character."""
+            output = []
+            is_blank = False
+            for index, char in enumerate(text):
+                cp = ord(char)
+                if is_blank:
+                    output.append(char)
+                    if context[index - 12 : index + 1].startswith("#idiom"):
+                        is_blank = False
+                        output.append(SPIECE_UNDERLINE)
+                else:
+                    if text[index : index + 6] == "#idiom":
+                        is_blank = True
+                        if len(output) > 0 and output[-1] != SPIECE_UNDERLINE:
+                            output.append(SPIECE_UNDERLINE)
+                        output.append(char)
+                    elif _is_chinese_char(cp) or is_fuhao(char):
+                        if len(output) > 0 and output[-1] != SPIECE_UNDERLINE:
+                            output.append(SPIECE_UNDERLINE)
+                        output.append(char)
+                        output.append(SPIECE_UNDERLINE)
+                    else:
+                        output.append(char)
+            return "".join(output)
+
+        def is_whitespace(c):
+            if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F or c == SPIECE_UNDERLINE:
+                return True
+            return False
+
+        def add_tokens_for_around(tokens, pos, num_tokens):
+            num_l = num_tokens // 2
+            num_r = num_tokens - num_l
+
+            if pos >= num_l and (len(tokens) - 1 - pos) >= num_r:
+                tokens_l = tokens[pos - num_l : pos]
+                tokens_r = tokens[pos + 1 : pos + 1 + num_r]
+            elif pos <= num_l:
+                tokens_l = tokens[:pos]
+                right_len = num_tokens - len(tokens_l)
+                tokens_r = tokens[pos + 1 : pos + 1 + right_len]
+            elif (len(tokens) - 1 - pos) <= num_r:
+                tokens_r = tokens[pos + 1 :]
+                left_len = num_tokens - len(tokens_r)
+                tokens_l = tokens[pos - left_len : pos]
+            else:
+                raise ValueError("impossible")
+
+            return tokens_l, tokens_r
+
+        max_tokens_for_doc = max_seq_length - 3
+        num_tokens = max_tokens_for_doc - 5
+        num_examples = len(examples.data["candidates"])
+        if do_predict:
+            result = {"input_ids": [], "token_type_ids": [], "example_ids": []}
+        else:
+            result = {"input_ids": [], "token_type_ids": [], "labels": [], "example_ids": []}
+        for idx in range(num_examples):
+            candidate = 0
+            options = examples.data["candidates"][idx]
+
+            # Each content may have several sentences.
+            for context in examples.data["content"][idx]:
+                context = (
+                    context.replace("“", '"')
+                    .replace("”", '"')
+                    .replace("——", "--")
+                    .replace("—", "-")
+                    .replace("―", "-")
+                    .replace("…", "...")
+                    .replace("‘", "'")
+                    .replace("’", "'")
+                )
+                context = _tokenize_chinese_chars(context)
+                paragraph_text = context.strip()
+                doc_tokens = []
+                prev_is_whitespace = True
+                for c in paragraph_text:
+                    if is_whitespace(c):
+                        prev_is_whitespace = True
+                    else:
+                        if prev_is_whitespace:
+                            doc_tokens.append(c)
+                        else:
+                            doc_tokens[-1] += c
+                        prev_is_whitespace = False
+                all_doc_tokens = []
+                for (i, token) in enumerate(doc_tokens):
+                    if "#idiom" in token:
+                        sub_tokens = [str(token)]
+                    else:
+                        sub_tokens = tokenizer.tokenize(token)
+                    for sub_token in sub_tokens:
+                        all_doc_tokens.append(sub_token)
+                tags = [blank for blank in doc_tokens if "#idiom" in blank]
+
+                # Each sentence may have several tags
+                for tag_index, tag in enumerate(tags):
+                    pos = all_doc_tokens.index(tag)
+
+                    tmp_l, tmp_r = add_tokens_for_around(all_doc_tokens, pos, num_tokens)
+                    num_l = len(tmp_l)
+                    num_r = len(tmp_r)
+                    tokens_l = []
+                    for token in tmp_l:
+                        if "#idiom" in token and token != tag:
+                            # Mask tag which is not considered in this new sample.
+                            # Each idiom has four words, so 4 mask tokens are used.
+                            tokens_l.extend(["[MASK]"] * 4)
+                        else:
+                            tokens_l.append(token)
+                    tokens_l = tokens_l[-num_l:]
+                    del tmp_l
+
+                    tokens_r = []
+                    for token in tmp_r:
+                        if "#idiom" in token and token != tag:
+                            tokens_r.extend(["[MASK]"] * 4)
+                        else:
+                            tokens_r.append(token)
+                    tokens_r = tokens_r[:num_r]
+                    del tmp_r
+
+                    tokens_list = []
+                    # Each tag has ten choices, and the shape of each new
+                    # example is [num_choices, seq_len]
+                    for i, elem in enumerate(options):
+                        option = tokenizer.tokenize(elem)
+                        tokens = option + ["[SEP]"] + tokens_l + ["[unused1]"] + tokens_r
+                        tokens_list.append(tokens)
+                    new_data = tokenizer(tokens_list, is_split_into_words=True)
+                    # Final shape of input_ids: [batch_size, num_choices, seq_len]
+                    result["input_ids"].append(new_data["input_ids"])
+                    result["token_type_ids"].append(new_data["token_type_ids"])
+                    result["example_ids"].append(idx)
+                    if not do_predict:
+                        label = examples.data["answers"][idx]["candidate_id"][candidate]
+                        result["labels"].append(label)
+                    candidate += 1
+            if (idx + 1) % 10000 == 0:
+                logger.info("%d samples have been processed." % (idx + 1))
+        return result
+
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    model = AutoModelForMultipleChoice.from_pretrained(args.model_name_or_path, num_choices=max_num_choices)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds, dev_ds, test_ds = load_dataset("clue", "chid", split=["train", "validation", "test"])
+
+    if args.do_train:
+        args.batch_size = int(args.batch_size / args.gradient_accumulation_steps)
+        column_names = train_ds.column_names
+        with main_process_first(desc="train dataset map pre-processing"):
+            train_ds = train_ds.map(
+                partial(preprocess_function),
+                batched=True,
+                batch_size=len(train_ds),
+                num_proc=args.num_proc,
+                remove_columns=column_names,
+                load_from_cache_file=not args.overwrite_cache,
+                desc="Running tokenizer on train dataset",
+            )
+        batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id),  # input
+                "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id),  # segment
+                "labels": Stack(dtype="int64"),  # label
+                "example_ids": Stack(dtype="int64"),  # example id
+            }
+        ): fn(samples)
+
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_data_loader = paddle.io.DataLoader(
+            dataset=train_ds,
+            batch_sampler=train_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        with main_process_first(desc="evaluate dataset map pre-processing"):
+            dev_ds = dev_ds.map(
+                partial(preprocess_function),
+                batched=True,
+                batch_size=len(dev_ds),
+                remove_columns=column_names,
+                num_proc=args.num_proc,
+                load_from_cache_file=args.overwrite_cache,
+                desc="Running tokenizer on validation dataset",
+            )
+
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False)
+
+        dev_data_loader = paddle.io.DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        num_training_steps = (
+            int(args.max_steps / args.gradient_accumulation_steps)
+            if args.max_steps >= 0
+            else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps)
+        )
+
+        warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=grad_clip,
+        )
+
+        loss_fct = nn.CrossEntropyLoss()
+
+        model.train()
+        global_step = 0
+        best_acc = 0.0
+        tic_train = time.time()
+        for epoch in range(args.num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                input_ids, segment_ids, labels, example_ids = batch
+                logits = model(input_ids=input_ids, token_type_ids=segment_ids)
+                loss = loss_fct(logits, labels)
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+                loss.backward()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    global_step += 1
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.clear_grad()
+                    if global_step % args.logging_steps == 0:
+                        logger.info(
+                            "global step %d/%d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                            % (
+                                global_step,
+                                num_training_steps,
+                                epoch,
+                                step + 1,
+                                loss,
+                                args.logging_steps / (time.time() - tic_train),
+                            )
+                        )
+                        tic_train = time.time()
+                if global_step >= num_training_steps:
+                    break
+            if global_step > num_training_steps:
+                break
+            tic_eval = time.time()
+            acc = evaluate(model, dev_data_loader)
+            logger.info("eval acc: %.5f, eval done total : %s s" % (acc, time.time() - tic_eval))
+            if paddle.distributed.get_rank() == 0 and acc > best_acc:
+                best_acc = acc
+                if args.save_best_model:
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    if not os.path.exists(args.output_dir):
+                        os.makedirs(args.output_dir)
+                    model_to_save.save_pretrained(args.output_dir)
+                    tokenizer.save_pretrained(args.output_dir)
+            if global_step >= num_training_steps:
+                break
+        logger.info("best_result: %.2f" % (best_acc * 100))
+
+    if args.do_predict:
+        column_names = test_ds.column_names
+        test_ds = test_ds.map(
+            partial(preprocess_function, do_predict=True),
+            batched=True,
+            batch_size=len(test_ds),
+            remove_columns=column_names,
+            num_proc=args.num_proc,
+        )
+        test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.eval_batch_size, shuffle=False)
+
+        batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=1, pad_val=tokenizer.pad_token_id),  # input
+                "token_type_ids": Pad(axis=1, pad_val=tokenizer.pad_token_type_id),  # segment
+                "example_ids": Stack(dtype="int64"),  # example id
+            }
+        ): fn(samples)
+
+        test_data_loader = paddle.io.DataLoader(
+            dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        result = {}
+        idx = 623377
+        preds = evaluate(model, test_data_loader, do_predict=True)
+        for pred in preds:
+            result["#idiom" + str(idx) + "#"] = pred
+            idx += 1
+        if not os.path.exists(args.output_dir):
+            os.makedirs(args.output_dir)
+        with open(os.path.join(args.output_dir, "chid11_predict.json"), "w") as writer:
+            json.dump(result, writer, indent=2)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    run(args)
diff --git a/examples/benchmark/clue/mrc/run_cmrc2018.py b/examples/benchmark/clue/mrc/run_cmrc2018.py
new file mode 100644
index 0000000000000000000000000000000000000000..be12a216edd45a8ea07924d2d3a76afe85c5c988
--- /dev/null
+++ b/examples/benchmark/clue/mrc/run_cmrc2018.py
@@ -0,0 +1,488 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import contextlib
+import distutils.util
+import json
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import (
+    AutoModelForQuestionAnswering,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="best_cmrc_model",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--save_best_model", default=True, type=distutils.util.strtobool, help="Whether to save best model."
+    )
+    parser.add_argument(
+        "--overwrite_cache",
+        default=False,
+        type=distutils.util.strtobool,
+        help="Whether to overwrite cache for dataset.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--num_proc",
+        default=None,
+        type=int,
+        help="Max number of processes when generating cache. Already cached shards are loaded sequentially.",
+    )
+    parser.add_argument("--eval_batch_size", default=12, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.1,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=20,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=50, help="Max answer length.")
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+    parser.add_argument("--do_train", action="store_true", help="Whether to train.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=2,
+        help="Number of updates steps to accumualte before performing a backward/update pass.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, raw_dataset, dataset, data_loader, args, do_eval=True):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+    for batch in data_loader:
+        start_logits, end_logits = model(**batch)
+        for idx in range(start_logits.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                logger.info("Processing example: %d" % len(all_start_logits))
+                logger.info("time per 1000: %s" % (time.time() - tic_eval))
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits.numpy()[idx])
+            all_end_logits.append(end_logits.numpy()[idx])
+
+    all_predictions, _, _ = compute_prediction(
+        raw_dataset, dataset, (all_start_logits, all_end_logits), False, args.n_best_size, args.max_answer_length
+    )
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    if do_eval:
+        filename = os.path.join(args.output_dir, "prediction_validation.json")
+    else:
+        filename = os.path.join(args.output_dir, "cmrc2018_predict.json")
+    with open(filename, "w", encoding="utf-8") as writer:
+        writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+    if do_eval:
+        res = squad_evaluate(
+            examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, is_whitespace_splited=False
+        )
+        model.train()
+        return res["exact"], res["f1"]
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+@contextlib.contextmanager
+def main_process_first(desc="work"):
+    if paddle.distributed.get_world_size() > 1:
+        rank = paddle.distributed.get_rank()
+        is_main_process = rank == 0
+        main_process_desc = "main local process"
+
+        try:
+            if not is_main_process:
+                # tell all replicas to wait
+                logger.debug(f"{rank}: waiting for the {main_process_desc} to perform {desc}")
+                paddle.distributed.barrier()
+            yield
+        finally:
+            if is_main_process:
+                # the wait is over
+                logger.debug(f"{rank}: {main_process_desc} completed {desc}, releasing all replicas")
+                paddle.distributed.barrier()
+    else:
+        yield
+
+
+def run(args):
+    if args.do_train:
+        assert (
+            args.batch_size % args.gradient_accumulation_steps == 0
+        ), "Please make sure argmument `batch_size` must be divisible by `gradient_accumulation_steps`."
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    set_seed(args)
+
+    train_examples, dev_examples, test_examples = load_dataset(
+        "clue", "cmrc2018", split=["train", "validation", "test"]
+    )
+
+    column_names = train_examples.column_names
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+
+    model = AutoModelForQuestionAnswering.from_pretrained(args.model_name_or_path)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    def prepare_train_features(examples):
+        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+        # in one example possible giving several features when a context is long, each of those features having a
+        # context that overlaps a bit the context of the previous feature.
+        # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+        # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+        contexts = examples["context"]
+        questions = examples["question"]
+
+        tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length)
+
+        # Since one example might give us several features if it has a long context, we need a map from a feature to
+        # its corresponding example. This key gives us just that.
+        sample_mapping = tokenized_examples.pop("overflow_to_sample")
+        # The offset mappings will give us a map from token to character position in the original context. This will
+        # help us compute the start_positions and end_positions.
+        offset_mapping = tokenized_examples.pop("offset_mapping")
+
+        # Let's label those examples!
+        tokenized_examples["start_positions"] = []
+        tokenized_examples["end_positions"] = []
+
+        for i, offsets in enumerate(offset_mapping):
+            # We will label impossible answers with the index of the CLS token.
+            input_ids = tokenized_examples["input_ids"][i]
+            cls_index = input_ids.index(tokenizer.cls_token_id)
+
+            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+            sequence_ids = tokenized_examples["token_type_ids"][i]
+
+            # One example can give several spans, this is the index of the example containing this span of text.
+            sample_index = sample_mapping[i]
+            answers = examples["answers"][sample_index]
+            # If no answers are given, set the cls_index as answer.
+            if len(answers["answer_start"]) == 0:
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Start/end character index of the answer in the text.
+                start_char = answers["answer_start"][0]
+                end_char = start_char + len(answers["text"][0])
+
+                # Start token index of the current span in the text.
+                token_start_index = 0
+                while sequence_ids[token_start_index] != 1:
+                    token_start_index += 1
+
+                # End token index of the current span in the text.
+                token_end_index = len(input_ids) - 1
+                while sequence_ids[token_end_index] != 1:
+                    token_end_index -= 1
+                token_end_index -= 1
+
+                # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+                if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                    tokenized_examples["start_positions"].append(cls_index)
+                    tokenized_examples["end_positions"].append(cls_index)
+                else:
+                    # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                    # Note: we could go after the last offset if the answer is the last word (edge case).
+                    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                        token_start_index += 1
+                    tokenized_examples["start_positions"].append(token_start_index - 1)
+                    while offsets[token_end_index][1] >= end_char:
+                        token_end_index -= 1
+                    tokenized_examples["end_positions"].append(token_end_index + 1)
+
+        return tokenized_examples
+
+    def prepare_validation_features(examples):
+        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+        # in one example possible giving several features when a context is long, each of those features having a
+        # context that overlaps a bit the context of the previous feature.
+        # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+        # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+        contexts = examples["context"]
+        questions = examples["question"]
+
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+        )
+
+        # Since one example might give us several features if it has a long context, we need a map from a feature to
+        # its corresponding example. This key gives us just that.
+        sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+        # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+        # corresponding example_id and we will store the offset mappings.
+        tokenized_examples["example_id"] = []
+
+        for i in range(len(tokenized_examples["input_ids"])):
+            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+            sequence_ids = tokenized_examples["token_type_ids"][i]
+            context_index = 1
+
+            # One example can give several spans, this is the index of the example containing this span of text.
+            sample_index = sample_mapping[i]
+            tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+            # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+            # position is part of the context or not.
+            tokenized_examples["offset_mapping"][i] = [
+                (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None)
+                for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+            ]
+
+        return tokenized_examples
+
+    if args.do_train:
+        args.batch_size = int(args.batch_size / args.gradient_accumulation_steps)
+
+        with main_process_first(desc="train dataset map pre-processing"):
+            train_ds = train_examples.map(
+                prepare_train_features,
+                batched=True,
+                remove_columns=column_names,
+                load_from_cache_file=not args.overwrite_cache,
+                num_proc=args.num_proc,
+                desc="Running tokenizer on train dataset",
+            )
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+        batchify_fn = DataCollatorWithPadding(tokenizer)
+        train_data_loader = DataLoader(
+            dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        with main_process_first(desc="evaluate dataset map pre-processing"):
+            dev_ds = dev_examples.map(
+                prepare_validation_features,
+                batched=True,
+                remove_columns=column_names,
+                num_proc=args.num_proc,
+                load_from_cache_file=args.overwrite_cache,
+                desc="Running tokenizer on validation dataset",
+            )
+        dev_ds_for_model = dev_ds.remove_columns(["example_id", "offset_mapping", "attention_mask"])
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.eval_batch_size, shuffle=False)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds_for_model, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        num_training_steps = (
+            int(args.max_steps / args.gradient_accumulation_steps)
+            if args.max_steps >= 0
+            else int(len(train_data_loader) * args.num_train_epochs / args.gradient_accumulation_steps)
+        )
+
+        warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        criterion = CrossEntropyLossForSQuAD()
+        best_res = (0.0, 0.0)
+        global_step = 0
+        tic_train = time.time()
+        for epoch in range(args.num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                start_positions = batch.pop("start_positions")
+                end_positions = batch.pop("end_positions")
+                logits = model(**batch)
+                loss = criterion(logits, (start_positions, end_positions))
+                if args.gradient_accumulation_steps > 1:
+                    loss = loss / args.gradient_accumulation_steps
+                loss.backward()
+                if (step + 1) % args.gradient_accumulation_steps == 0:
+                    global_step += 1
+                    optimizer.step()
+                    lr_scheduler.step()
+                    optimizer.clear_grad()
+
+                    if global_step % args.logging_steps == 0:
+                        logger.info(
+                            "global step %d/%d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                            % (
+                                global_step,
+                                num_training_steps,
+                                epoch,
+                                step + 1,
+                                loss,
+                                args.logging_steps / (time.time() - tic_train),
+                            )
+                        )
+                        tic_train = time.time()
+                    if global_step >= num_training_steps:
+                        break
+            if global_step > num_training_steps:
+                break
+            em, f1 = evaluate(model, dev_examples, dev_ds, dev_data_loader, args)
+            if paddle.distributed.get_rank() == 0 and em > best_res[0]:
+                best_res = (em, f1)
+                if args.save_best_model:
+                    output_dir = args.output_dir
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                break
+        logger.info("best_result: %.2f/%.2f" % (best_res[0], best_res[1]))
+
+    if args.do_predict and rank == 0:
+        test_ds = test_examples.map(
+            prepare_validation_features, batched=True, remove_columns=column_names, num_proc=args.num_proc
+        )
+        test_ds_for_model = test_ds.remove_columns(["example_id", "offset_mapping", "attention_mask"])
+        test_batch_sampler = paddle.io.BatchSampler(test_ds_for_model, batch_size=args.eval_batch_size, shuffle=False)
+
+        batchify_fn = DataCollatorWithPadding(tokenizer)
+        test_data_loader = DataLoader(
+            dataset=test_ds_for_model, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        evaluate(model, test_examples, test_ds, test_data_loader, args, do_eval=False)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    run(args)
diff --git a/examples/benchmark/glue/README.md b/examples/benchmark/glue/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..14f9abc434606081297b69087372e7c1987de8b0
--- /dev/null
+++ b/examples/benchmark/glue/README.md
@@ -0,0 +1,126 @@
+# GLUE Benchmark
+
+[GLUE](https://gluebenchmark.com/)是当今使用最为普遍的自然语言理解评测基准数据集，评测数据涵盖新闻、电影、百科等许多领域，其中有简单的句子，也有困难的句子。其目的是通过公开的得分榜，促进自然语言理解系统的发展。详细可参考 [GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)
+
+本项目是 GLUE评测任务 在 Paddle 2.0上的开源实现。
+
+本项目支持BERT, ELECTRA,ERNIE,ALBERT,RoBERTa模型，可在model_type中进行指定。
+
+## 快速开始
+
+### 启动GLUE任务
+以 GLUE/SST-2 任务为例，启动GLUE任务进行Fine-tuning的方式如下：
+
+#### 单卡训练
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_name_or_path bert-base-uncased \
+    --tokenizer_name_or_path bert-base-uncased \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 100 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu
+
+```
+
+#### 多卡训练
+```shell
+unset CUDA_VISIBLE_DEVICES
+export TASK_NAME=SST-2
+
+python -m paddle.distributed.launch --gpus "0,1" run_glue.py \
+    --model_name_or_path bert-base-uncased \
+    --tokenizer_name_or_path bert-base-uncased \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 100 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu
+
+```
+其中参数释义如下：
+- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型，可以是PaddleNLP提供的预训练模型 或者 本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: /home/xx_model/，目录中需包含paddle预训练模型model_state.pdparams。
+如果使用PaddleNLP提供的预训练模型，可以选择`model_type`在[Transformer预训练模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中相对应的英文预训练权重。注意这里选择的模型权重要和上面配置的模型类型匹配，例如model_type 配置的是bert，则model_name_or_path只能选择bert相关的模型。另，glue任务应选择英文预训练权重。
+- `tokenizer_name_or_path` 指示了Fine-tuning使用的具体tokenizer，一般保持和model_name_or_path一致，也可以单独指定
+- `task_name` 表示 Fine-tuning 的任务，当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+
+Fine-tuning过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下日志：
+
+```
+global step 6310/6315, epoch: 2, batch: 2099, rank_id: 0, loss: 0.035772, lr: 0.0000000880, speed: 3.1527 step/s
+global step 6311/6315, epoch: 2, batch: 2100, rank_id: 0, loss: 0.056789, lr: 0.0000000704, speed: 3.4201 step/s
+global step 6312/6315, epoch: 2, batch: 2101, rank_id: 0, loss: 0.096717, lr: 0.0000000528, speed: 3.4694 step/s
+global step 6313/6315, epoch: 2, batch: 2102, rank_id: 0, loss: 0.044982, lr: 0.0000000352, speed: 3.4513 step/s
+global step 6314/6315, epoch: 2, batch: 2103, rank_id: 0, loss: 0.139579, lr: 0.0000000176, speed: 3.4566 step/s
+global step 6315/6315, epoch: 2, batch: 2104, rank_id: 0, loss: 0.046043, lr: 0.0000000000, speed: 3.4590 step/s
+eval loss: 0.549763, acc: 0.9151376146788991, eval done total : 1.8206987380981445 s
+```
+
+使用各种预训练模型进行 Fine-tuning ，在GLUE验证集上有如下结果：
+
+| Model GLUE Score   | CoLA  | SST-2  | MRPC   | STS-B  | QQP    | MNLI   | QNLI   | RTE    |
+|--------------------|-------|--------|--------|--------|--------|--------|--------|--------|
+| electra-small      | 58.22 | 91.85  | 88.24  | 87.24  | 88.83  | 82.45  | 88.61  | 66.78  |
+| ernie-2.0-large-en | 65.4  | 96.0   | 88.7   | 92.3   | 92.5   | 89.1   | 94.3   | 85.2   |
+
+关于GLUE Score的说明：
+1. 因Fine-tuning过程中有dropout等随机因素影响，同样预训练模型每次运行的GLUE Score会有较小差异，上表中的GLUE Score是运行多次取eval最好值的得分。
+2. 不同GLUE任务判定得分所使用的评价指标有些差异，简单如下表，详细说明可参考[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)。
+
+| GLUE Task  | Metric                       |
+|------------|------------------------------|
+| CoLA       | Matthews corr                |
+| SST-2      | acc.                         |
+| MRPC       | acc./F1                      |
+| STS-B      | Pearson/Spearman corr        |
+| QQP        | acc./F1                      |
+| MNLI       | matched acc./mismatched acc. |
+| QNLI       | acc.                         |
+| RTE        | acc.                         |
+
+#### trainer 版本
+
+```shell
+export task_name=mnli
+export learning_rate=5e-5
+
+python run_glue_trainer.py \
+--model_name_or_path roberta-large \
+--task_name $task_name \
+--do_train \
+--do_eval \
+--max_seq_length 512 \
+--per_device_train_batch_size 16 \
+--per_device_eval_batch_size 64 \
+--learning_rate $learning_rate \
+--num_train_epochs 10 \
+--output_dir ./checkpoints/$task_name/ft \
+--overwrite_output_dir \
+--logging_steps 10 \
+--evaluation_strategy epoch \
+--save_strategy epoch \
+--warmup_ratio 0.06 \
+--seed 0 \
+--weight_decay 0.1 \
+--disable_tqdm True
+```
diff --git a/examples/benchmark/glue/run_glue.py b/examples/benchmark/glue/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..29ad6c42236a7efe3eb594d9eefad27da08dc167
--- /dev/null
+++ b/examples/benchmark/glue/run_glue.py
@@ -0,0 +1,346 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument("--model_type", default=None, type=str, required=False, help="should be remove later")
+    parser.add_argument(
+        "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model"
+    )
+    parser.add_argument("--tokenizer_name_or_path", default=None, type=str, required=False, help="Path to tokenizer")
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    elif isinstance(metric, Mcc):
+        print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (loss.numpy(), res[0], res[1], res[2]),
+            end="",
+        )
+    else:
+        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    if args.tokenizer_name_or_path:
+        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = AutoModelForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
+                    )
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/benchmark/glue/run_glue_trainer.py b/examples/benchmark/glue/run_glue_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..03c4e2f347bf911bb64df16ed381181231f1eba4
--- /dev/null
+++ b/examples/benchmark/glue/run_glue_trainer.py
@@ -0,0 +1,333 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from paddle.metric import Accuracy
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    export_model,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    task_name: str = field(
+        default=None,
+        metadata={"help": "The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys())},
+    )
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        }
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+    lora_rank: int = field(default=8, metadata={"help": "Lora rank"})
+    lora_alpha: int = field(default=16, metadata={"help": "Lora alpha"})
+    qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"})
+    qat_type: str = field(default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, W4,A8W4"})
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label}
+    else:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]}
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    data_args.task_name = data_args.task_name.strip().lower()
+    metric = METRIC_CLASSES[data_args.task_name]()
+
+    train_ds = load_dataset("glue", data_args.task_name, splits="train")
+    if model_args.tokenizer_name_or_path:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=data_args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+
+    if data_args.task_name == "mnli":
+        dev_ds, dev_ds_mismatched = load_dataset("glue", data_args.task_name, splits=["dev_matched", "dev_mismatched"])
+
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+
+        test_ds, test_ds_mismatched = load_dataset(
+            "glue", data_args.task_name, splits=["test_matched", "test_mismatched"]
+        )
+
+        test_ds = test_ds.map(trans_func, lazy=True)
+        test_ds_mismatched = test_ds_mismatched.map(trans_func, lazy=True)
+
+    else:
+        dev_ds = load_dataset("glue", data_args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+
+        test_ds = load_dataset("glue", data_args.task_name, splits="test")
+        test_ds = test_ds.map(trans_func, lazy=True)
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+    if model_args.lora:
+        # TODO: hardcode parameters for now. Change after MergedLoRA is introduced
+        lora_config = LoRAConfig(
+            target_modules=[
+                ".*self_attn.q_proj.*",
+                ".*self_attn.k_proj.*",
+                ".*self_attn.v_proj.*",
+                ".*self_attn.out_proj.*",
+                ".*linear1.*",
+                ".*linear2.*",
+            ],
+            trainable_modules=[".*classifier.*"],
+            r=model_args.lora_rank,
+            lora_alpha=model_args.lora_alpha,
+            merge_weights=False,
+            dtype=dtype,
+        )
+        model = LoRAModel(model, lora_config)
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    if model_args.qat:
+        from paddle import nn
+        from paddle.quantization import QAT, QuantConfig
+        from paddle.quantization.quanters import (
+            FakeQuanterChannelWiseAbsMaxObserver,
+            FakeQuanterWithAbsMaxObserver,
+        )
+
+        from paddlenlp.peft.lora import LoRALinear
+        from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear
+
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+
+        if model_args.qat_type == "A8W8":
+            activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype)
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype=dtype)
+        elif model_args.qat_type == "W4":
+            activation = None
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype=dtype)
+        elif model_args.qat_type == "A8W4":
+            activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype)
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype=dtype)
+        else:
+            raise ValueError("qat_type should be one of ['A8W8', 'W4', 'A8W4']")
+
+        q_config.add_type_config(LoRALinear, weight=weight, activation=activation)
+        q_config.add_type_config(nn.Linear, weight=weight, activation=activation)
+
+        qat = QAT(q_config)
+        model = qat.quantize(model, inplace=True)
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric.reset()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        res = metric.accumulate()
+        metric.reset()
+        if isinstance(metric, AccuracyAndF1):
+            return {
+                "accuracy": res[0],
+                "precision": res[1],
+                "recall": res[2],
+                "f1 score": res[3],
+                "accuracy and f1": res[4],
+            }
+        elif isinstance(metric, Mcc):
+            return {"mcc": res[0]}
+        elif isinstance(metric, PearsonAndSpearman):
+            return {
+                "pearson": res[0],
+                "spearman": res[1],
+                "pearson and spearman": res[2],
+            }
+        else:
+            return {"accuracy": res}
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+        if data_args.task_name == "mnli":
+            eval_metrics = trainer.evaluate(dev_ds_mismatched)
+            trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        logger.info("*** Predict ***")
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+        if data_args.task_name == "mnli":
+            test_ret = trainer.predict(test_ds_mismatched)
+            trainer.log_metrics("test", test_ret.metrics)
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/benchmark/lambada/eval.py b/examples/benchmark/lambada/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a9aa31043fb0850a388c36a7579f97009af59f9
--- /dev/null
+++ b/examples/benchmark/lambada/eval.py
@@ -0,0 +1,384 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import re
+import time
+from pprint import pprint as print
+
+# from paddle.distributed.apis import env
+import numpy as np
+import paddle
+from paddle.distributed import fleet
+from paddle.io import DataLoader
+
+from paddlenlp.data import Stack, Tuple
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_type", default=None, type=str, required=False, help="Model type selected in the list")
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: ",
+    )
+
+    # only support tensor_parallel_degree
+    parser.add_argument(
+        "--tensor_parallel_degree",
+        type=int,
+        default=1,
+        help="Model Parallelism degree. Spliting the linear layers to many cards.",
+    )
+
+    # Other config
+    parser.add_argument("--seed", type=int, default=1024, help="Random seed for initialization")
+    parser.add_argument("--sample_nums", type=int, default=16, help="Random seed for initialization")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="gpu",
+        choices=["cpu", "eval_pathgpu", "xpu", "npu"],
+        help="select cpu, gpu, xpu devices.",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        default="float16",
+        choices=["bfloat16", "float16", "float32"],
+        help="set the dtype of model",
+    )
+
+    # load autodist name files, eg: bloom-176b
+    parser.add_argument("--load_autodist", action="store_true", help="whether load auto-dist wieght file")
+
+    return parser
+
+
+def get_eval_parser():
+    parser = get_parser()
+    parser.add_argument(
+        "--eval_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The eval file path.",
+    )
+    parser.add_argument(
+        "--cloze_eval", action="store_true", help="Evaluation dataset from `--eval_path` is a cloze task."
+    )
+    parser.add_argument("--overlapping_eval", type=int, default=32, help="Sliding window for overlapping eval.")
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--seq_length", type=int, default=512, help="Maximum sequence length to process for evaluation."
+    )
+    parser.add_argument("--logging_steps", type=int, default=10, help="logging step for eval")
+    return parser
+
+
+class LM_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, tokens, seq_len, pad_idx, overlapping_eval=None):
+        self.tokens = tokens
+        self.seq_len = seq_len
+        self.pad_idx = pad_idx
+        self.overlapping_eval = overlapping_eval
+        if self.overlapping_eval is None:
+            self.overlapping_eval = self.seq_len
+        self.overlapping_eval = max(1, self.overlapping_eval)
+
+        self.total_targets = len(self.tokens) - 1
+        # remove first sequence tokens
+        targets = max(self.total_targets - self.overlapping_eval, 0)
+        self.total_sequences = max(math.ceil(targets / self.overlapping_eval) + 1, 1)
+
+    def __len__(self):
+        return self.total_sequences
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        loss_mask = np.ones(seq_length, dtype="float32")
+        loss_mask[np.where(np.array(tokens) == self.pad_idx)] = 0.0
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        start_idx = idx * self.overlapping_eval
+        end_idx = start_idx + self.seq_len
+        tokens = self.tokens[start_idx : end_idx + 1]
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        [tokens, loss_mask, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        if self.overlapping_eval != self.seq_len and idx != 0:
+            loss_mask[: -self.overlapping_eval] *= 0
+
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+
+class Lambada_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, tokens, labels, seq_len, pad_idx):
+        self.seq_len = seq_len
+        self.pad_idx = pad_idx
+        self.tokens = tokens
+        self.labels = labels
+
+    def __len__(self):
+        return len(self.tokens)
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, attention_mask, position_ids, labels]
+
+    def __getitem__left_padding(self, idx):
+        tokens = self.tokens[idx][: self.seq_len]
+        labels = self.labels[idx]
+        tokens = tokens + labels
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            # tokens += [self.pad_idx] * num_pad + tokens
+            tokens = [self.pad_idx] * num_pad + tokens
+        loss_mask = np.zeros(self.seq_len, dtype="float32")
+        loss_mask[-len(labels) :] = 1.0
+        [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        tokens = self.tokens[idx][: self.seq_len]
+        labels = self.labels[idx]
+        tokens = tokens + labels
+
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        loss_mask = np.zeros(self.seq_len, dtype="float32")
+        loss_mask[num_tokens - len(labels) - 1 : num_tokens - 1] = 1.0
+        [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+
+def wikitext_detokenizer(string):
+    # contractions
+    string = string.replace("s '", "s'")
+    string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
+    # number separators
+    string = string.replace(" @-@ ", "-")
+    string = string.replace(" @,@ ", ",")
+    string = string.replace(" @.@ ", ".")
+    # punctuation
+    string = string.replace(" : ", ": ")
+    string = string.replace(" ; ", "; ")
+    string = string.replace(" . ", ". ")
+    string = string.replace(" ! ", "! ")
+    string = string.replace(" ? ", "? ")
+    string = string.replace(" , ", ", ")
+    # double brackets
+    string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
+    string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
+    string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
+    string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
+    string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
+    # miscellaneous
+    string = string.replace("= = = =", "====")
+    string = string.replace("= = =", "===")
+    string = string.replace("= =", "==")
+    string = string.replace(" " + chr(176) + " ", chr(176))
+    string = string.replace(" \n", "\n")
+    string = string.replace("\n ", "\n")
+    string = string.replace(" N ", " 1 ")
+    string = string.replace(" 's", "'s")
+    return string
+
+
+def get_tokens(tokenizer, text, strict=True):
+    if not strict:
+        tokens = tokenizer(text)["input_ids"]
+        return tokens[:-1], [tokens[-1]]
+    last_token = text.split()[-1]
+    start_idx = text.rfind(last_token)
+    beginning_tokens = tokenizer(text[:start_idx].strip())["input_ids"]
+    last_token = tokenizer(" " + last_token)["input_ids"]
+    return beginning_tokens, last_token
+
+
+def create_eval_dataset(args):
+    val_dataloader = None
+    eval_batch_size = args.batch_size
+    seq_len = args.seq_length
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else "<pad>"
+    if not args.cloze_eval:
+        with open(args.eval_path, "rb") as reader:
+            entire_data = reader.read().decode("utf-8")
+        num_original_tokens = len(entire_data.strip().split(" "))
+        entire_data = wikitext_detokenizer(entire_data)
+        tokenized_data = tokenizer(entire_data)["input_ids"]
+        num_tokenized_tokens = len(tokenized_data)
+        print("Original Tokens: %d, Detokenized tokens: %d" % (num_tokenized_tokens, num_original_tokens))
+        val_dataset = LM_Eval_Dataset(tokenized_data, seq_len, tokenizer.pad_token_id, args.overlapping_eval)
+    else:
+        tokenized_data = []
+        tokenized_label = []
+        with open(args.eval_path, "r") as f:
+            for line in f.readlines():
+                text = json.loads(line)["text"]
+                tokens, labels = get_tokens(tokenizer, text, strict=False)
+                tokenized_data.append(tokens)
+                tokenized_label.append(labels)
+        val_dataset = Lambada_Eval_Dataset(tokenized_data, tokenized_label, seq_len, tokenizer.pad_token_id)
+        num_tokenized_tokens = 0
+        num_original_tokens = 0
+
+    args.num_examples = len(val_dataset)
+    args.num_original_tokens = num_original_tokens
+    args.num_tokenized_tokens = num_tokenized_tokens
+    val_dataloader = DataLoader(
+        val_dataset,
+        batch_size=eval_batch_size,
+        drop_last=False,
+        collate_fn=Tuple(Stack(), Stack(), Stack(), Stack(), Stack()),
+    )
+
+    return val_dataloader
+
+
+def do_generation():
+
+    # env.set_seed(seed)
+    parser = get_eval_parser()
+    args = parser.parse_args()
+    paddle.set_default_dtype(args.dtype)
+
+    if args.tensor_parallel_degree > 1:
+        strategy = fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "mp_degree": args.tensor_parallel_degree,
+        }
+        # Set control in tensor parallel
+        strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed}
+        fleet.init(is_collective=True, strategy=strategy)
+
+    eval_data_loader = create_eval_dataset(args)
+
+    tic_eval = time.time()
+
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name_or_path,
+        tensor_parallel_output=False,
+        tensor_parallel_degree=args.tensor_parallel_degree,
+        tensor_parallel_rank=paddle.distributed.get_rank(),
+        use_flash_attention=False,
+        dtype=args.dtype,  # todo enable set dtype to avoid additional mem usage
+    )
+
+    model.eval()
+    args.use_pure_fp16 = False
+
+    total_score = 0
+    score_name = "loss" if not args.cloze_eval else "number correct"
+    args.use_pure_fp16 = False
+    eval_data_loader = create_eval_dataset(args)
+    with paddle.no_grad():
+        for step, batch in enumerate(eval_data_loader):
+
+            tokens, loss_mask = batch[:2]
+            labels = batch[-1]
+            with paddle.amp.auto_cast(args.use_pure_fp16):
+                if args.model_type == "bloom":
+                    preds = model(tokens).detach()
+                else:
+                    preds = model(tokens)[0].detach()
+                # print(preds)
+
+                # cast preds to float32 to keep high-precision
+                preds = preds.astype(paddle.float32)
+
+                if not args.cloze_eval:
+                    masked_lm_loss = paddle.nn.functional.cross_entropy(preds, labels, reduction="none")
+                    loss = paddle.sum(masked_lm_loss * loss_mask)
+                    total_score += float(loss) / (args.num_tokenized_tokens - 1)
+                else:
+                    outputs = paddle.argmax(preds, -1)
+                    acc = paddle.cast(outputs == labels, "float32")
+                    acc = paddle.where(paddle.cast(loss_mask, "bool"), acc, paddle.ones_like(acc))
+                    acc = paddle.sum(paddle.prod(acc, -1))
+                    total_score += float(acc)
+
+                if step % args.logging_steps == 0:
+                    logger.info(
+                        "step %d, batch: %d, %s: %f, speed: %.2f step/s"
+                        % (step, step, score_name, total_score, args.logging_steps / (time.time() - tic_eval))
+                    )
+                    tic_eval = time.time()
+
+    if not args.cloze_eval:
+        total_loss = float(total_score)
+        ppl = math.exp(min(20, total_loss))
+        token_ratio = (args.num_tokenized_tokens - 1) / (args.num_original_tokens - 1)
+        adjusted_ppl = math.exp(min(20, total_loss * token_ratio))
+        string = " validation results on {} | ".format(args.eval_path)
+        string += "avg loss: {:.4E} | ".format(total_loss)
+        string += "ppl: {:.4E} | ".format(ppl)
+        string += "adjusted ppl: {:.4E} | ".format(adjusted_ppl)
+        string += "token ratio: {} |".format(token_ratio)
+    else:
+        num_correct = float(total_score)
+        acc = float(num_correct / args.num_examples)
+        string = " validation results on {} | ".format(args.eval_path)
+        string += "number correct: {:.4E} | ".format(num_correct)
+        string += "total examples: {:.4E} | ".format(args.num_examples)
+        string += "avg accuracy: {:.4E}".format(acc)
+    logger.info(string)
+
+
+if __name__ == "__main__":
+    do_generation()
\ No newline at end of file
diff --git a/examples/benchmark/peft/README.md b/examples/benchmark/peft/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..841e7f880af8d9b43c56cd062a5bf5bbaa139bcf
--- /dev/null
+++ b/examples/benchmark/peft/README.md
@@ -0,0 +1,97 @@
+# Benchmark Results
+
+### 配置
+
+- 硬件: A100-80G with NVLink, 具体卡数见表
+- Torch环境: 见torch/requirements.txt
+- FP16配置: torch 使用 cuda amp fp16, paddle 使用 fp16 O2 opt level, intokens 设置为 1024, 并开启了flash attention
+
+### Bloom
+
+- 数据: 10k条[Chinese-Vicuna/guanaco_belle_merge_v1.0](https://huggingface.co/datasets/Chinese-Vicuna/guanaco_belle_merge_v1.0)
+
+| Model         | Method   | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | **Speedup** |
+|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------|
+| Bloomz-7b1-mt | LoRA     | 1        | 4          |              | 4097.03                   |             | 1980.32                  | **107%**    |
+| Bloomz-7b1-mt | Finetune | 4        | 8          | MP 4         | 4136.69                   | ZeRO 3      | 1702.00                  | **143%**    |
+| Bloomz-7b1-mt | Finetune | 4        | 16         | MP 4         | 4359.72                   | ZeRO 3      | 2849.90                  | **53%**     |
+
+###### 多卡分布式实验记录
+
+- Finetuning with 4 GPUs
+
+| Model          | Setup           | Paddle Effective Tokens /s | Torch Effective Tokens /s  |  Speedup  |
+|----------------|-----------------|----------------------------|----------------------------|-----------|
+| Bloomz-7b1-mt  | bsz 8 MP4       |       7421.09              |         N/A                |   N/A     |
+| Bloomz-7b1-mt  | bsz 8 ZeRO 3    |       6063.23              |     1702.00                |   256%    |
+| Bloomz-7b1-mt  | bsz 8 ZeRO 2    |      5191.47               |     1891.16                |   175%    |
+| Bloomz-7b1-mt  | bsz 16 MP4      |      8214.55               |         N/A                |   N/A     |
+| Bloomz-7b1-mt  | bsz 16 ZeRO 3   |      5822.23               |      2849.90               |   104     |
+| Bloomz-7b1-mt  | bsz 16 ZeRO 2   |    5572.13                 |     2719.92                |   105%    |
+
+
+### Llama
+
+- 数据: 使用10k条[tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
+
+| Model     | Method   | Num GPUs | Batch Size  | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup |
+|-----------|----------|----------|-------------|--------------|---------------------------|-------------|--------------------------|---------|
+| Llama-7b  | LoRA     | 1        | 4           |              |  4406.23                  |             | 1895.90                  |  **132%**  |
+| Llama-13b | LoRA     | 1        | 4           |              |  1975.94                  |             |    1101.85              |  **79%**  |
+| Llama-13b | LoRA     | 1        | 8           | recompute    |  1869.60                  | gradient ckpt |    768.26              |  **143%**  |
+| Llama-7b  | Finetune | 4        | 8           | MP4          |  3275.90                  | ZeRO 2      | 1621.52                  |  **102%**  |
+| Llama-7b  | Finetune | 4        | 16          | sharding 2   |  6798.72                 | ZeRO 2      | 2465.55                  |  **176%**  |
+| Llama-13b | Finetune | 4        | 8           | MP4 recompute|  1938.19                  | ZeRO 3      | 736.19                   |  **127%**  |
+| Llama-65b | LoRA     | 4        | 8           | MP4 recompute|  840.57                   | gradient ckpt, bits 4, max_memory_MB 50000, qlora        | 327.75          |  **156%** |
+| Llama-65b | LoRA     | 4        | 16          | MP4 recompute|  993.38                   | gradient ckpt, bits 4, max_memory_MB 50000, qlora        | 405.90          |  **122%** |
+
+
+###### 多卡分布式实验记录
+
+- Finetuning with 4 GPUs
+
+| Model     | Setup         | Paddle Effective Tokens /s | Torch Effective Tokens /s  |  Speedup  |
+|-----------|---------------|----------------------------|----------------------------|-----------|
+| LLaMA-7b  | bsz 8 MP4     | **3841.61**                |  N/A                       | N/A       |
+| LLaMA-7b  | bsz 8 ZeRO 3  | 4189.43                    |  1177.93                   | 256%       |
+| LLaMA-7b  | bsz 8 ZeRO 2  | 4611.10                    |  1621.52                   | 184%       |
+| LLaMA-7b  | bsz 16 (4*4) MP4 | 4829.47                 |  N/A                       | N/A       |
+| LLaMA-7b  | bsz 16 ZeRO 3 | 4048.61                    |  2268.16                   | 78%       |
+| LLaMA-7b  | bsz 16 ZeRO 2 | **3463.45**                |  2465.55                   | 40%       |
+| LLaMA-13b | bsz 8 MP4 recompute |  **2509.50**         |  N/A                       | N/A       |
+| LLaMA-13b | bsz 8 ZeRO 3  | 1867.99                    |  736.19                    | 154%        |
+| LLaMA-13b | bsz 8 ZeRO 2  | 1201.75                    |  OOM                       | N/A       |
+
+
+### ChatGLM
+
+| Model         | Method   | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup |
+|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------|
+| chatglm-6b    | LoRA     | 1        | 4          |              |        4216.76            |             |       1866.48            | **126%**    |
+| chatglm-6b    | Finetune | 4        | 8          |   MP 4       |        3799.78            |   ZeRO 2    |       2124.17            | **79%**    |
+| chatglm-6b    | Finetune | 4        | 16         |   MP 4       |        5720.21            |   ZeRO 3    |       3191.35            | **79%**    |
+
+
+###### 多卡分布式实验记录
+
+- Finetuning with 4 GPUs
+
+| Model     | Setup           | Paddle Effective Tokens /s | Torch Effective Tokens /s  |  Speedup  |
+|-----------|-----------------|----------------------------|----------------------------|-----------|
+| chatglm-6b  | bsz 8 MP4     |  4564.94                   |         N/A                |   N/A     |
+| chatglm-6b  | bsz 8 ZeRO 3  |    6480.36                 |         1840.99            |   252%    |
+| chatglm-6b  | bsz 8 ZeRO 2  |    4707.74                 |         2124.17            |   122%     |
+| chatglm-6b  | bsz 16 MP4    |    4972.21                 |         N/A                |   N/A     |
+| chatglm-6b  | bsz 16 ZeRO 3 |    5282.28                 |         3184.26            |   66%     |
+| chatglm-6b  | bsz 16 ZeRO 2 |    5751.00                 |         3151.07            |   83%     |
+
+
+### GPT 3
+
+| Model         | Method   | Num GPUs | Batch Size | Paddle Setup | Paddle Effective Tokens/s | Torch Setup | Torch Effective Tokens/s | Speedup |
+|---------------|----------|----------|------------|--------------|---------------------------|-------------|--------------------------|---------|
+| gpt3-6.7b     | LoRA     | 1        | 4          |              |        3450.06            |             |       1186.74            | **191%**|
+| gpt3-13b      | LoRA     | 1        | 4          |              |        2008.40            |             |       969.60             | **107%**|
+| gpt3-6.7b     | Finetune | 4        | 8          |   MP 4       |        3301.49            |   ZeRO 2    |       1441.65            | **129%**|
+| gpt3-13b      | Finetune | 4        | 8          |   MP 4       |        1890.38            |   ZeRO 2    |       783.26             | **141%**|
+| gpt3-6.7b     | Finetune | 4        | 16         |   MP 4       |        4666.19            |   ZeRO 3    |       1756.42            | **166%**|
diff --git a/examples/benchmark/peft/paddle/benchmark.py b/examples/benchmark/peft/paddle/benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..1849dd9d73127e8bb35f5d53458dda55864dabf3
--- /dev/null
+++ b/examples/benchmark/peft/paddle/benchmark.py
@@ -0,0 +1,328 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+import paddle.profiler as profiler
+from datasets import load_dataset
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from utils import CustomTrainer, ProfilerCallback
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.datasets import InTokensMapDataset
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer, GPTForCausalLM
+
+"""
+单卡
+python benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt  \
+    --num_train_epochs 1 --per_device_train_batch_size 4 \
+    --evaluation_strategy no --save_strategy no \
+    --fp16 --fp16_opt_level O2 --lora \
+    --logging_steps 50 --output_dir outputs
+
+多卡mp
+python -m paddle.distributed.launch --gpus "0,1,2,3" benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt  \
+    --num_train_epochs 1 --per_device_train_batch_size 8 \
+    --evaluation_strategy no --save_strategy no \
+    --fp16 --fp16_opt_level O2 --tensor_parallel_degree 4 \
+    --logging_steps 50 --output_dir outputs
+
+多卡sharding 3
+python -m paddle.distributed.launch --gpus "0,1,2,3" benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt  \
+    --num_train_epochs 1 --per_device_train_batch_size 4 \
+    --evaluation_strategy no --save_strategy no \
+    --fp16 --fp16_opt_level O2 \
+    --sharding "stage3" --sharding_parallel_degree 4 \
+    --logging_steps 50 --output_dir outputs
+"""
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+
+    model_name_or_path: str = field(default=None, metadata={"help": "model name or local path"})
+    lora: Optional[bool] = field(default=False, metadata={"help": "whether to use LoRA"})
+    english: Optional[bool] = field(default=False, metadata={"help": "whether to english benchmark dataset"})
+    profiler: Optional[bool] = field(default=False, metadata={"help": "whether to use profiler"})
+    train_data_size: int = field(default=1000, metadata={"help": "Number of dataset for training"})
+    intokens_length: int = field(default=2048, metadata={"help": "Intokens length"})
+    intokens: Optional[bool] = field(default=False, metadata={"help": "whether to use intokens"})
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, TrainingArguments))
+    model_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Set the dtype for loading model
+    dtype = None
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+    else:
+        dtype = "float32"
+    if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]:
+        tokenizer = AutoTokenizer.from_pretrained("gpt3-13B-en")
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    if "llama" in model_args.model_name_or_path or "Baichuan" in model_args.model_name_or_path:
+        tokenizer.pad_token = tokenizer.unk_token
+
+    if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]:
+        model = GPTForCausalLM.from_pretrained(
+            model_args.model_name_or_path,
+            low_cpu_mem_usage=True,
+            use_flash_attention=model_args.use_flash_attention,
+            dtype=dtype,
+            tensor_parallel_degree=training_args.tensor_parallel_degree,
+            tensor_parallel_rank=training_args.tensor_parallel_rank,
+        )
+        tracker = get_rng_state_tracker()
+        tracker.add("global_seed", 111)
+        tracker.add("local_seed", 222)
+    else:
+        model = AutoModelForCausalLM.from_pretrained(
+            model_args.model_name_or_path,
+            low_cpu_mem_usage=True,
+            use_flash_attention=model_args.use_flash_attention,
+            dtype=dtype,
+            tensor_parallel_degree=training_args.tensor_parallel_degree,
+            tensor_parallel_rank=training_args.tensor_parallel_rank,
+        )
+
+    if model_args.lora:
+        if "llama" in model_args.model_name_or_path or "Baichuan" in model_args.model_name_or_path:
+            target_modules = [".*q_proj.*", ".*k_proj.*", ".*v_proj.*"]
+        elif model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]:
+            target_modules = [
+                ".*qkv_proj.*",
+                ".*q_proj.*",
+                ".*k_proj.*",
+                ".*v_proj.*",
+                ".*linear1.*",
+                ".*linear2.*",
+                ".*out_proj.*",
+            ]
+        elif "chatglm2" in model_args.model_name_or_path:
+            target_modules = [
+                ".*query.*",
+                ".*key.*",
+                ".*value.*",
+                ".*dense.*",
+                ".*dense_h_to_4h.*",
+                ".*dense_4h_to_h.*",
+            ]
+        else:
+            target_modules = [".*query_key_value.*"]
+
+        lora_config = LoRAConfig(
+            target_modules=target_modules,
+            r=8,
+            lora_alpha=32,
+            dtype=dtype,
+        )
+        model = LoRAModel(model, lora_config)
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    def preprocess_function(example, max_src_length=256, max_tgt_length=384, intokens=False):
+        inputs = example["instruction"]
+        targets = example["output"]
+        if "input" in example:
+            inputs += example["input"]
+        model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False)
+        labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False)
+        labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id]
+        model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids
+        model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids
+        # shift input and labels
+        model_inputs["input_ids"] = model_inputs["input_ids"][:-1]
+        model_inputs["labels"] = model_inputs["labels"][1:]
+        seq_length = len(model_inputs["input_ids"])
+        model_inputs["position_ids"] = list(range(seq_length))
+        if intokens:
+            model_inputs["attention_mask"] = np.tril(np.ones([seq_length, seq_length], dtype=bool))
+        return model_inputs
+
+    def preprocess_function_chatglm(example, max_src_length=256, max_tgt_length=384, intokens=False):
+        inputs = example["instruction"]
+        targets = example["output"]
+        if "input" in example:
+            inputs += example["input"]
+        model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False)
+        labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False)
+        labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id]
+        model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids
+        model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids
+        # shift input and labels
+        model_inputs["input_ids"] = model_inputs["input_ids"][:-1]
+        model_inputs["labels"] = model_inputs["labels"][1:]
+
+        if intokens:
+            context_length = model_inputs["input_ids"].index(tokenizer.bos_token_id)
+            seq_length = len(model_inputs["input_ids"])
+            position_ids = np.arange(seq_length, dtype=np.int64)
+            block_position_ids = np.concatenate(
+                [
+                    np.zeros(context_length, dtype=np.int64),
+                    np.arange(1, seq_length - context_length + 1, dtype=np.int64),
+                ]
+            )
+            model_inputs["position_ids"] = np.stack([position_ids, block_position_ids], axis=0)
+            attention_mask = np.tri(seq_length, seq_length, dtype=bool)
+            attention_mask[:, :context_length] = 1
+            model_inputs["attention_mask"] = attention_mask
+
+        return model_inputs
+
+    def preprocess_function_bloom(example, max_src_length=256, max_tgt_length=384, intokens=False):
+        inputs = example["instruction"]
+        targets = example["output"]
+        if "input" in example:
+            inputs += example["input"]
+        model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False)
+        labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False)
+        labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id]
+        model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids
+        model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids
+        # shift input and labels
+        model_inputs["input_ids"] = model_inputs["input_ids"][:-1]
+        model_inputs["labels"] = model_inputs["labels"][1:]
+
+        if intokens:
+            model_inputs["attention_mask"] = np.tril(
+                np.ones([len(model_inputs["input_ids"]), len(model_inputs["input_ids"])], dtype=bool)
+            )
+        return model_inputs
+
+    def preprocess_function_gpt(example, max_source_length=256, max_target_length=384, intokens=False):
+        """
+        Convert an example into necessary features.
+        """
+        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+        # in one example possible giving several features when a context is long, each of those features having a
+        # context that overlaps a bit the context of the previous feature.
+        # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+        # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+        inputs = example["instruction"]
+        targets = example["output"]
+        if "input" in example:
+            inputs += example["input"]
+
+        input_seq = inputs
+        output_seq = targets
+
+        outputs = tokenizer(
+            output_seq,
+            max_length=max_target_length,
+            # pad_to_max_seq_len=True,
+            truncation_strategy="longest_first",
+            return_attention_mask=False,
+            return_token_type_ids=False,
+        )
+        inputs = tokenizer(
+            input_seq,
+            max_length=max_source_length,
+            # pad_to_max_seq_len=True,
+            truncation_strategy="longest_first",
+            return_attention_mask=False,
+            return_length=False,
+        )
+
+        final = {}
+        for k in outputs.keys():
+            final[k] = inputs[k] + outputs[k]
+            if k == "input_ids":
+                final["labels"] = [tokenizer.pad_token_id] * len(inputs["input_ids"]) + outputs[k]
+
+        # shift inputs and labels
+        final["input_ids"] = final["input_ids"][:-1]
+        final["labels"] = final["labels"][1:]
+        return final
+
+    if model_args.english:
+        dataset = load_dataset("tatsu-lab/alpaca")
+    else:
+        dataset = load_dataset("Chinese-Vicuna/guanaco_belle_merge_v1.0")
+
+    # select first 10k examples for benchmarking
+    dataset = dataset["train"].select(range(model_args.train_data_size))
+    if "chatglm2" in model_args.model_name_or_path:
+        dataset = dataset.map(
+            lambda example: preprocess_function(example, intokens=model_args.intokens),
+        )
+    elif "chatglm" in model_args.model_name_or_path:
+        dataset = dataset.map(
+            lambda example: preprocess_function_chatglm(example, intokens=model_args.intokens),
+        )
+    elif "bloom" in model_args.model_name_or_path:
+
+        dataset = dataset.map(
+            lambda example: preprocess_function_bloom(example, intokens=model_args.intokens),
+        )
+    elif model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]:
+        dataset = dataset.map(
+            lambda example: preprocess_function_gpt(example, intokens=model_args.intokens),
+        )
+    else:
+        dataset = dataset.map(lambda example: preprocess_function(example, intokens=model_args.intokens))
+    total_effective_tokens = sum([len(i["input_ids"]) for i in dataset]) * training_args.num_train_epochs
+    if model_args.intokens:
+        dataset = InTokensMapDataset(
+            dataset,
+            tokenizer=tokenizer,
+            max_length=model_args.intokens_length,
+        )
+    if model_args.profiler:
+        prof = profiler.Profiler(
+            targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
+            profile_memory=True,
+            scheduler=profiler.make_scheduler(closed=1, ready=2, record=1, repeat=1),
+            on_trace_ready=profiler.export_chrome_tracing("./log"),
+        )
+    if model_args.model_name_or_path in ["gpt3-6.7B-en", "gpt3-13B-en"]:
+        data_collator = DataCollatorForSeq2Seq(
+            return_tensors="pd", tokenizer=tokenizer, label_pad_token_id=tokenizer.pad_token_id
+        )
+    else:
+        data_collator = DataCollatorForSeq2Seq(return_tensors="pd", tokenizer=tokenizer)
+
+    trainer = CustomTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=dataset,
+        callbacks=[ProfilerCallback(prof)] if model_args.profiler else [],
+        args=training_args,
+        data_collator=data_collator,
+    )
+
+    train_metrics = trainer.train()
+    tokens_per_second = trainer.total_observed_tokens / train_metrics.metrics["train_runtime"]
+    effective_tokens_per_second = total_effective_tokens / train_metrics.metrics["train_runtime"]
+    print(f"Tokens per second: {tokens_per_second:.2f}")
+    print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/benchmark/peft/paddle/inference_benchmark.py b/examples/benchmark/peft/paddle/inference_benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..287e9e13d14b91b82040006b3a0604eb96b2eecf
--- /dev/null
+++ b/examples/benchmark/peft/paddle/inference_benchmark.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import paddle
+
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+def parse_args(prog=None):
+    """
+    parse_args
+    """
+    parser = argparse.ArgumentParser(prog=prog)
+    parser.add_argument("--model_name_or_path", type=str, help="model name or local path", required=True)
+    parser.add_argument("--do_forward", action="store_true", help="fowrward test")
+    parser.add_argument("--do_generate", action="store_true", help="generate test")
+    parser.add_argument("--dtype", type=str, default="float16", choices=["float16", "float32"], help="set dtype of model", required=True)
+    return parser.parse_args()
+
+
+@paddle.no_grad()
+def predict_generate(model, inputs):
+    for i in range(10):
+        start = time.perf_counter()
+        result = model.generate(
+            **inputs,
+            max_length=100,
+            decode_strategy="greedy_search",
+            use_cache=True,
+        )
+        hf_cost = (time.perf_counter() - start) * 1000
+        print("Speed test:", hf_cost)
+        infer_data = result[0]
+        for x in infer_data.tolist():
+            res = tokenizer.decode(x, skip_special_tokens=True)
+            print(res)
+
+
+@paddle.no_grad()
+def predict_forward(model, inputs):
+    for i in range(10):
+        start = time.perf_counter()
+        _ = model(**inputs)
+        hf_cost = (time.perf_counter() - start) * 1000
+        print("Speed test:", hf_cost)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    all_texts = [
+        "你好",
+        "去年9月，拼多多海外版“Temu”正式在美国上线。数据显示，截至2023年2月23日，Temu App新增下载量4000多万，新增用户增速第一，AppStore购物榜霸榜69天、Google Play购物榜霸榜114天。",
+    ]
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name_or_path,
+        load_state_as_np=True,
+        low_cpu_mem_usage=True,
+        dtype=args.dtype,
+    )
+    if model.base_model_prefix == "llama":
+        tokenizer.pad_token = tokenizer.unk_token
+    model.eval()
+
+    if args.do_forward:
+        for input_text in all_texts:
+            print(f"text: {input_text}")
+            inputs = tokenizer([input_text], return_tensors="pd", return_token_type_ids=False)
+            predict_forward(model, inputs)
+
+    if args.do_generate:
+        for input_text in all_texts:
+            print(f"text: {input_text}")
+            _inputs = tokenizer(
+                input_text,
+                padding=True,
+                return_tensors="np",
+                max_length=50,
+                return_attention_mask=True,
+                return_position_ids=True,
+            )
+            inputs_tensor = {}
+            for key, value in _inputs.items():
+                inputs_tensor[key] = paddle.to_tensor(value)
+            predict_generate(model, inputs_tensor)
diff --git a/examples/benchmark/peft/paddle/utils.py b/examples/benchmark/peft/paddle/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..70d8a8824cdaa108b7837e509d94ec37e098814d
--- /dev/null
+++ b/examples/benchmark/peft/paddle/utils.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from paddlenlp.trainer import (
+    Trainer,
+    TrainerCallback,
+    TrainerControl,
+    TrainerState,
+    TrainingArguments,
+)
+
+
+class CustomTrainer(Trainer):
+    total_observed_tokens = 0.0
+
+    def training_step(self, model, inputs):
+        input_ids = inputs["input_ids"]
+        self.total_observed_tokens += float(input_ids.shape[0] * input_ids.shape[1])
+        return super().training_step(model, inputs)
+
+
+class ProfilerCallback(TrainerCallback):
+    "A callback that prints a message at the beginning of training"
+
+    def __init__(self, prof):
+        self.prof = prof
+        self.prof.start()
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        print("Starting training")
+
+    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        self.prof.step()
+
+    def on_train_end(self, args, state, control, **kwargs):
+        self.prof.stop()
+        self.prof.summary()
diff --git a/examples/benchmark/peft/torch/benchmark.py b/examples/benchmark/peft/torch/benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e4d7e1bd634d02e7affe50c39b3af306eee59bc
--- /dev/null
+++ b/examples/benchmark/peft/torch/benchmark.py
@@ -0,0 +1,226 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from dataclasses import dataclass, field
+from typing import Optional
+
+import torch
+import torch.profiler as profiler
+from datasets import load_dataset
+from peft import LoraConfig, TaskType, get_peft_model
+from transformers import (
+    AutoModel,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    DataCollatorForSeq2Seq,
+    HfArgumentParser,
+    LlamaTokenizer,
+    TrainingArguments,
+)
+from utils import CustomTrainer, ProfilerCallback
+
+"""
+单卡
+python benchmark.py --model_name_or_path bigscience/bloomz-7b1-mt  \
+    --num_train_epochs 1 --per_device_train_batch_size 4 \
+    --evaluation_strategy no --save_strategy no \
+    --fp16 --lora \
+    --logging_steps 50 --output_dir outputs
+
+多卡 deepspeed zero3
+python -m torch.distributed.run --nproc_per_node=4 benchmark.py --deepspeed ds_config.json \
+    --model_name_or_path bigscience/bloomz-7b1-mt  \
+    --num_train_epochs 1 --per_device_train_batch_size 2 \
+    --evaluation_strategy no --save_strategy no \
+    --fp16 \
+    --logging_steps 50 --output_dir outputs
+"""
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+
+    model_name_or_path: str = field(default=None, metadata={"help": "model name or local path"})
+    lora: Optional[bool] = field(default=False, metadata={"help": "whether to use LoRA"})
+    qlora: Optional[bool] = field(default=False, metadata={"help": "whether to use qLoRA"})
+    english: Optional[bool] = field(default=False, metadata={"help": "whether to english benchmark dataset"})
+    profiler: Optional[bool] = field(default=False, metadata={"help": "whether to use profiler"})
+    double_quant: bool = field(
+        default=True, metadata={"help": "Compress the quantization statistics through double quantization."}
+    )
+    quant_type: str = field(
+        default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}
+    )
+    bits: int = field(default=4, metadata={"help": "How many bits to use."})
+    max_memory_MB: int = field(default=80000, metadata={"help": "Free memory per gpu."})
+    train_data_size: int = field(default=1000, metadata={"help": "Number of dataset for training"})
+
+
+def main():
+    parser = HfArgumentParser((ModelArguments, TrainingArguments))
+    model_args, training_args = parser.parse_args_into_dataclasses()
+
+    if "llama" in model_args.model_name_or_path:
+        tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, use_fast=False)
+        tokenizer.pad_token_id = 0
+    elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, use_fast=False)
+        tokenizer.pad_token_id = 0
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+
+    compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
+    if "chatglm" in model_args.model_name_or_path:
+        # Add empty_init=False for zero3 training, refer to https://github.com/THUDM/ChatGLM-6B/issues/530
+        model = AutoModel.from_pretrained(
+            model_args.model_name_or_path,
+            empty_init=False if training_args.deepspeed is not None else True,
+            trust_remote_code=True,
+            torch_dtype="auto",
+        )
+
+    else:
+        if model_args.qlora:
+            n_gpus = torch.cuda.device_count()
+            max_memory = f"{model_args.max_memory_MB}MB"
+            max_memory = {i: max_memory for i in range(n_gpus)}
+            device_map = "auto"
+
+            # if we are in a distributed setting, we need to set the device map and max memory per device
+            if os.environ.get("LOCAL_RANK") is not None:
+                local_rank = int(os.environ.get("LOCAL_RANK", "0"))
+                device_map = {"": local_rank}
+                max_memory = {"": max_memory[local_rank]}
+
+            model = AutoModelForCausalLM.from_pretrained(
+                model_args.model_name_or_path,
+                torch_dtype="auto",
+                load_in_4bit=model_args.bits == 4,
+                load_in_8bit=model_args.bits == 8,
+                device_map=device_map,
+                max_memory=max_memory,
+                quantization_config=BitsAndBytesConfig(
+                    load_in_4bit=model_args.bits == 4,
+                    load_in_8bit=model_args.bits == 8,
+                    llm_int8_threshold=6.0,
+                    llm_int8_has_fp16_weight=False,
+                    bnb_4bit_compute_dtype=compute_dtype,
+                    bnb_4bit_use_double_quant=model_args.double_quant,
+                    bnb_4bit_quant_type=model_args.quant_type,
+                ),
+            )
+        elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]:
+            model = AutoModelForCausalLM.from_pretrained(
+                model_args.model_name_or_path,
+                torch_dtype=torch.float16,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                model_args.model_name_or_path,
+                torch_dtype="auto",
+            )
+    if model_args.lora:
+        if "llama" in model_args.model_name_or_path:
+            target_modules = ["q_proj", "k_proj", "v_proj"]
+        elif model_args.model_name_or_path in ["cerebras/Cerebras-GPT-13B", "stanford-crfm/levanter-gpt2-7B"]:
+            target_modules = [
+                ".*c_attn.*",
+                ".*q_attn.*",
+                ".*c_proj.*",
+                ".*c_fc.*",
+            ]
+        else:
+            target_modules = ["query_key_value"]
+        peft_config = LoraConfig(
+            task_type=TaskType.CAUSAL_LM, target_modules=target_modules, r=8, lora_alpha=32, lora_dropout=0.0
+        )
+        model = get_peft_model(model, peft_config)
+        model.print_trainable_parameters()
+
+    if model_args.lora and training_args.gradient_checkpointing:
+        # For backward compatibility
+        if hasattr(model, "enable_input_require_grads"):
+            model.enable_input_require_grads()
+        else:
+
+            def make_inputs_require_grad(module, input, output):
+                output.requires_grad_(True)
+
+            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
+
+        # enable gradient checkpointing for memory efficiency
+        model.gradient_checkpointing_enable()
+
+    def preprocess_function(example, max_src_length=256, max_tgt_length=384):
+        inputs = example["instruction"]
+        if "input" in example:
+            inputs += example["input"]
+        targets = example["output"]
+        model_inputs = tokenizer(inputs, max_length=max_src_length, truncation=True, return_attention_mask=False)
+
+        labels = tokenizer(targets, max_length=max_tgt_length, truncation=True, return_attention_mask=False)
+        labels_input_ids = labels["input_ids"] + [tokenizer.eos_token_id]
+
+        model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids
+        model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids
+        return model_inputs
+
+    if model_args.english:
+        dataset = load_dataset("tatsu-lab/alpaca")
+    else:
+        dataset = load_dataset("Chinese-Vicuna/guanaco_belle_merge_v1.0")
+
+    # select first 10k examples for benchmarking
+    dataset = dataset["train"].select(range(model_args.train_data_size))
+    dataset = dataset.map(
+        lambda example: preprocess_function(example), remove_columns=["instruction", "input", "output"]
+    )
+    total_effective_tokens = sum([len(i["input_ids"]) for i in dataset]) * training_args.num_train_epochs
+
+    if model_args.profiler:
+        prof = profiler.profile(
+            activities=[
+                torch.profiler.ProfilerActivity.CPU,
+                torch.profiler.ProfilerActivity.CUDA,
+            ],
+            schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
+            on_trace_ready=torch.profiler.tensorboard_trace_handler("hf-training-trainer"),
+            profile_memory=True,
+            with_stack=True,
+        )
+
+    data_collator = DataCollatorForSeq2Seq(return_tensors="pt", tokenizer=tokenizer)
+
+    trainer = CustomTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=dataset,
+        callbacks=[ProfilerCallback(prof=prof)] if model_args.profiler else [],
+        args=training_args,
+        data_collator=data_collator,
+    )
+    model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
+    train_metrics = trainer.train()
+    tokens_per_second = trainer.total_observed_tokens / train_metrics.metrics["train_runtime"]
+    effective_tokens_per_second = total_effective_tokens / train_metrics.metrics["train_runtime"]
+    print(f"Tokens per second: {tokens_per_second:.2f}")
+    print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/benchmark/peft/torch/ds_config_stage2.json b/examples/benchmark/peft/torch/ds_config_stage2.json
new file mode 100644
index 0000000000000000000000000000000000000000..da73ccf988b937e55256a25f3bcb3f402dff320d
--- /dev/null
+++ b/examples/benchmark/peft/torch/ds_config_stage2.json
@@ -0,0 +1,35 @@
+{
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "gradient_accumulation_steps": "auto",
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "fp16": {
+        "enabled": "auto"
+    },
+	"zero_optimization": {
+        "stage": 2,
+        "allgather_partitions": true,
+        "allgather_bucket_size": 5e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 5e8,
+        "contiguous_gradients": true
+    }
+}
+
diff --git a/examples/benchmark/peft/torch/ds_config_stage3.json b/examples/benchmark/peft/torch/ds_config_stage3.json
new file mode 100644
index 0000000000000000000000000000000000000000..f988bc0c66900f44b19025d63716b291420d5fc5
--- /dev/null
+++ b/examples/benchmark/peft/torch/ds_config_stage3.json
@@ -0,0 +1,37 @@
+{
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "gradient_accumulation_steps": "auto",
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+    "fp16": {
+        "enabled": "auto"
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_16bit_weights_on_model_save": true
+    }
+}
\ No newline at end of file
diff --git a/examples/benchmark/peft/torch/inference_benchmark.py b/examples/benchmark/peft/torch/inference_benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..85b23014bc36d28e0bcc4a484ec27b5a4379a44b
--- /dev/null
+++ b/examples/benchmark/peft/torch/inference_benchmark.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+def parse_args(prog=None):
+    """
+    parse_args
+    """
+    parser = argparse.ArgumentParser(prog=prog)
+    parser.add_argument("--model_name_or_path", type=str, help="model name or local path", required=True)
+    parser.add_argument("--do_forward", action="store_true", help="fowrward test")
+    parser.add_argument("--do_generate", action="store_true", help="generate test")
+    return parser.parse_args()
+
+
+@torch.no_grad()
+def predict_generate(model, inputs):
+    for i in range(10):
+        start = time.perf_counter()
+        generate_ids = model.generate(
+            inputs.input_ids,
+            max_length=100,
+            do_sample=False,
+        )
+        hf_cost = (time.perf_counter() - start) * 1000
+        print("Speed test:", hf_cost)
+        result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        print(result)
+
+
+@torch.no_grad()
+def predict_forward(model, inputs):
+    for i in range(10):
+        start = time.perf_counter()
+        _ = model(**inputs)
+        hf_cost = (time.perf_counter() - start) * 1000
+        print("Speed test:", hf_cost)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    all_texts = [
+        "你好",
+        "去年9月，拼多多海外版“Temu”正式在美国上线。数据显示，截至2023年2月23日，Temu App新增下载量4000多万，新增用户增速第一，AppStore购物榜霸榜69天、Google Play购物榜霸榜114天。",
+    ]
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    if "llama" in args.model_name_or_path:
+        tokenizer.pad_token_id = 0
+    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path).cuda()
+    model = model.eval()
+    if args.do_forward:
+        for input_text in all_texts:
+            print(f"text: {input_text}")
+            inputs = tokenizer([input_text], return_tensors="pt", max_length=50, padding=True)
+            inputs = inputs.to("cuda")
+            predict_forward(model, inputs)
+
+    if args.do_generate:
+        for input_text in all_texts:
+            print(f"text: {input_text}")
+            inputs = tokenizer([input_text], return_tensors="pt", max_length=50, padding=True)
+            inputs = inputs.to("cuda")
+            predict_generate(model, inputs)
diff --git a/examples/benchmark/peft/torch/requirements.txt b/examples/benchmark/peft/torch/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c683fbf8ee1bc17202a7e23bfb4ea9ea2778e1ae
--- /dev/null
+++ b/examples/benchmark/peft/torch/requirements.txt
@@ -0,0 +1,8 @@
+deepspeed==0.9.4
+datasets==2.12.0
+transformers==4.30.2
+peft @ git+https://github.com/huggingface/peft.git
+torch==2.0.1
+sentencepiece
+bitsandbytes==0.39.0
+# CUDA VERSION 11.7
\ No newline at end of file
diff --git a/examples/benchmark/peft/torch/utils.py b/examples/benchmark/peft/torch/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb8a286e84d03e75270c2f9ac65be17135f2d4a3
--- /dev/null
+++ b/examples/benchmark/peft/torch/utils.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import (
+    Trainer,
+    TrainerCallback,
+    TrainerControl,
+    TrainerState,
+    TrainingArguments,
+)
+
+
+class CustomTrainer(Trainer):
+    total_observed_tokens = 0.0
+
+    def training_step(self, model, inputs):
+        input_ids = inputs["input_ids"]
+        self.total_observed_tokens += float(input_ids.shape[0] * input_ids.shape[1])
+        return super().training_step(model, inputs)
+
+
+class ProfilerCallback(TrainerCallback):
+    "A callback that prints a message at the beginning of training"
+
+    def __init__(self, prof):
+        self.prof = prof
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        print("Starting training")
+
+    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        self.prof.step()
diff --git a/examples/code_generation/codegen/README.md b/examples/code_generation/codegen/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c71803fdba2052f06dbd9d4eff7883d9e8ef5e5b
--- /dev/null
+++ b/examples/code_generation/codegen/README.md
@@ -0,0 +1,327 @@
+# 代码生成：写代码的AI助理
+
+**目录**
+- [代码生成](#代码生成)
+  - [简介](#简介)
+    - [特色](#特色)
+  - [效果展示](#效果展示)
+  - [Github Copilot插件配置](#GithubCopilot插件配置)
+    - [环境依赖](#环境依赖)
+    - [代码结构说明](#代码结构说明)
+    - [启动服务](#启动服务)
+      - [配置参数](#配置参数说明)
+    - [测试服务](#测试服务)
+    - [配置插件](#配置插件)
+    - [注意事项](#注意事项)
+  - [训练定制](#训练定制)
+    - [数据准备](#数据准备)
+      - [从本地文件创建数据集](#从本地文件创建数据集)
+    - [模型训练](#模型训练)
+  - [TaskFlow调用](#TaskFlow调用)
+  - [更多使用案例](#更多使用案例)
+  - [模型列表](#模型列表)
+  - [References](#references)
+
+
+## 简介
+代码生成是根据编程人员的输入，生成出编程人员想要的代码，能够帮助编程人员甚至独立生成代码，提高编程效率。
+
+
+### 特色
+
+本项目是基于预训练语言模型CodeGen的代码生成，具有以下优势：
+- **效果领先**。CodeGen（16B）在HumanEval benchmark上评估指标已经超过[OpenAI's Codex](https://arxiv.org/pdf/2107.03374.pdf)。
+- **免费的Github Copilot**。支持通过Github Copilot调用该模型，让你免费体验代码AI助理。
+- **高性能**。基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation)打造高性能推理，毫秒级响应。具体加速指标可参考[perf](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation/README.md)。
+- **支持自定义数据集训练**。可增加自己的代码数据加以微调，让其更智能。
+- **开箱即用**。本项目提供TaskFlow接口，无需训练，仅需几行代码便可预测。
+
+
+## 效果展示
+
+- Github Copilot代码提示效果展示
+<p align="center">
+<img src="https://user-images.githubusercontent.com/24390500/189046785-6c04a3c3-ce89-4684-9aff-a7dc2e7a7041.gif"/> <br />
+</p>
+
+- 解算法题效果展示。求解无重复字符的最长子串的长度
+```python
+from paddlenlp import Taskflow
+
+prompt = "def lengthOfLongestSubstring(self, s: str) -> int:"
+codegen = Taskflow("code_generation", model="Salesforce/codegen-2B-mono",decode_strategy="greedy_search", repetition_penalty=1.0)
+print(codegen(prompt))
+```
+结果输出为：
+```python
+        if not s:
+            return 0
+
+        start = 0
+        end = 0
+        max_len = 0
+
+        while end < len(s):
+            if s[end] not in s[start:end]:
+                max_len = max(max_len, end - start + 1)
+                end += 1
+            else:
+                start += 1
+
+        return max_len
+```
+<p align="center">
+<img src="https://user-images.githubusercontent.com/24390500/182512164-946d959c-57b1-49e6-b9a5-be47281d1ee2.png"/> <br />
+</p>
+
+
+## Jupyter Lab插件配置
+
+请参考[codegenJupyterLabExt](https://github.com/chenqianhe/codegenJupyterLabExt), 感谢生态开发者[@chenqianhe](https://github.com/chenqianhe)的贡献！👏👏
+
+## GithubCopilot插件配置
+
+**以VS Code的插件为例**
+
+### 环境依赖
+- PaddleNLP >= 2.4.0
+- PaddlePaddle >= 2.3.1
+
+其他依赖：`pip install -r requirements.txt`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+codegen/
+├── requirements.txt # 环境依赖
+├── codegen_server.py # server启动脚本
+├── run_clm.py # 训练评估脚本
+├── run_clm.sh # 启动脚本
+└── README.md # 说明文档
+```
+
+### 启动服务
+
+```python
+python codegen_server.py
+```
+
+##### 配置参数说明
+在codegen_server.py中配置如下参数：
+- `model_name_or_path`：模型名，默认为 "Salesforce/codegen-350M-mono"
+- `device`：运行设备，默认为"gpu"
+- `temperature`：解码参数temperature，默认为0.5
+- `top_k`：解码参数top_k，默认为10
+- `top_p`：解码参数top_p，默认为1.0
+- `repetition_penalty`：解码重复惩罚项，默认为1.0
+- `min_length`：生成的最小长度，默认为0
+- `max_length`：生成的最大长度，默认为16
+- `decode_strategy`：解码策略，默认为"greedy_search"
+- `load_state_as_np`：以numpy格式加载模型参数，可节省显存，默认为True
+- `use_fast`：是否使用FastGeneration，可加速推理，默认为True
+- `use_fp16_decoding`：是否使用fp16推理，可节省显存和加速推理，默认为True
+
+### 测试服务
+```python
+import openai
+openai.api_key = 'dummy'
+openai.api_base = 'http://127.0.0.1:8978'
+result = openai.Completion.create(
+    engine='codegen', prompt='def hello', max_tokens=16, temperature=0.1)
+print(result)
+'''
+<OpenAIObject text_completion id=cmpl-dmhoeHmcw9DJ4NeqOJDQVKv3iivJ0 at 0x7fe7a81d42c0> JSON: {
+  "id": "cmpl-dmhoeHmcw9DJ4NeqOJDQVKv3iivJ0",
+  "choices": [
+    {
+      "text": "_world():\n    print(\"Hello World!\")\n\n\n#",
+      "index": 0,
+      "finish_reason": "stop",
+      "logprobs": null,
+    }
+  ],
+  "usage": {
+    "completion_tokens": null,
+    "prompt_tokens": null,
+    "total_tokens": null
+  }
+}
+'''
+```
+**注意**：如果要从本地访问服务器，`127.0.0.1`需要换成服务器的对外IP。
+
+
+### 配置插件
+打开用户设置（[settings.json](https://code.visualstudio.com/docs/getstarted/settings#_settings-file-locations)），增加一行配置
+```json
+    "github.copilot.advanced": {
+        "debug.overrideEngine": "codegen",
+        "debug.testOverrideProxyUrl": "http://127.0.0.1:8978",
+        "debug.overrideProxyUrl": "http://127.0.0.1:8978"
+    },
+```
+接下来就可以愉快地使用了😊。
+
+
+#### 注意事项
+- 如果使用FastGeneration，需要设置[codegen_server.py](#配置参数说明)中`use_fast=True`，第一次推理会涉及到编译，会耗费一些时间。FastGeneration的环境依赖参考[这里](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md#%E4%BD%BF%E7%94%A8%E7%8E%AF%E5%A2%83%E8%AF%B4%E6%98%8E)。
+- 如果要使用自己训练好的模型，可以设置[codegen_server.py](#配置参数说明)中`model_name_or_path`为本地模型路径。
+- 如果要从本地访问服务器，上述的`127.0.0.1`需要换成服务器的对外IP。
+- 如果出现下方的提示和报错，则说明FastGeneration没有启动成功，需要定位下失败的原因。或者也可设置`use_fast=False`，不启动FastGeneration加速，但推理速度会较慢。
+```shell
+  FastGeneration is not available, and the original version would be used instead.
+```
+```shell
+  RuntimeError: (NotFound) There are no kernels which are registered in the unsqueeze2 operator.
+  [Hint: Expected kernels_iter != all_op_kernels.end(), but received kernels_iter == all_op_kernels.end().] (at /home/Paddle/paddle/fluid/imperative/prepared_operator.cc:341)
+  [operator < unsqueeze2 > error]
+```
+- 本代码也支持插件[fauxpilot](https://marketplace.visualstudio.com/items?itemName=Venthe.fauxpilot)，感谢[@linonetwo](https://github.com/linonetwo)测试。`settings.json`中配置"fauxpilot.server": "http://服务器ip:8978/v1/engines"
+
+## 训练定制
+
+### 数据准备
+
+#### 从本地文件创建数据集
+
+在许多情况，我们需要使用本地数据集来训练我们的代码生成模型，本项目支持使用固定格式本地数据集文件进行训练。
+
+本地数据集文件格式如下：
+- train.json/test.json 文件格式：
+每行为一个jsonline
+```text
+{
+    "code": "from paddlenlp.transformers import CodeGenForCausalLM\n\n\nmodel = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-2B-mono')\n"
+}
+```
+
+更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+
+### 模型训练
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证。
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+python -m paddle.distributed.launch --gpus 0,1 run_clm.py \
+            --model_name_or_path Salesforce/codegen-350M-mono \
+            --block_size 1024 \
+            --output_dir output \
+            --train_file train.json \
+            --validation_file test.json \
+            --num_train_epochs 5 \
+            --logging_steps 10 \
+            --save_steps 1000 \
+            --per_device_train_batch_size 2 \
+            --per_device_eval_batch_size 2 \
+            --learning_rate 1e-4 \
+            --warmup_ratio 0.1 \
+            --do_train \
+            --do_eval \
+            --device gpu
+```
+使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"
+
+关键参数释义如下：
+- `gpus` 指示了训练所用的GPU卡号。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型（详见[模型列表](#模型列表)），或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+- `block_size` 表示训练时候数据被拆分的块数。
+- `output_dir` 表示模型的保存路径。
+- `train_file` 本地训练数据地址，数据格式必须与`dataset_name`所指数据集格式相同。
+- `validation_file` 本地测试数据地址，数据格式必须与`dataset_name`所指数据集格式相同。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `per_device_train_batch_size` 表示训练时**每张卡**上的样本数目。
+- `per_device_eval_batch_size` 表示测试时**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `warmup_ratio` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `do_train` 表示是否训练。
+- `do_eval` 表示是否评测。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+
+可通过`bash run_clm.sh`启动训练，更多参数详情和参数的默认值请参考`run_clm.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+./output/
+│── model_config.json
+│── model_state.pdparams
+│── tokenizer_config.json
+│── special_tokens_map.json
+│── added_tokens.json
+│── vocab.json
+│── merges.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+
+## TaskFlow调用
+参考[TaskFlow文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md)
+
+## 更多使用案例
+
+- 根据注释/功能描述写代码
+
+```python
+import re
+import paddle
+from paddlenlp.transformers import CodeGenTokenizer, CodeGenForCausalLM
+
+# The supported models are shown in the following table
+model_name = 'Salesforce/codegen-2B-mono'
+# Init tokenizer
+tokenizer = CodeGenTokenizer.from_pretrained(model_name)
+# Init model
+model = CodeGenForCausalLM.from_pretrained(model_name)
+
+prompt = "# this function prints hello world"
+inputs = tokenizer([prompt])
+inputs = {k: paddle.to_tensor(v) for (k, v) in inputs.items()}
+# Generate
+output, score = model.generate(inputs['input_ids'],
+                               max_length=128,
+                               decode_strategy='greedy_search')
+# Decode the result
+print(
+    tokenizer.decode(output[0],
+                     truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"],
+                     skip_special_tokens=True,
+                     spaces_between_special_tokens=False))
+```
+结果输出为：
+```python
+def hello_world():
+    print("Hello World")
+
+hello_world()
+```
+
+## 模型列表
+模型列表
+| 模型名称                           | 说明                         |
+| :--------------------------------- | -------------------------------- |
+| Salesforce/codegen-350M-mono             | 基于Python数据集BIGPYTHON训练  |
+| Salesforce/codegen-2B-mono             | 基于Python数据集BIGPYTHON训练  |
+| Salesforce/codegen-6B-mono             | 基于Python数据集BIGPYTHON训练  |
+| Salesforce/codegen-16B-mono             | 基于Python数据集BIGPYTHON训练  |
+| Salesforce/codegen-350M-nl             | 基于自然语言数据集THEPILE训练  |
+| Salesforce/codegen-2B-nl             | 基于自然语言数据集THEPILE训练  |
+| Salesforce/codegen-6B-nl             | 基于自然语言数据集THEPILE训练  |
+| Salesforce/codegen-16B-nl             | 基于自然语言数据集THEPILE训练  |
+| Salesforce/codegen-350M-multi             | 基于多编程语言数据集BIGQUERY训练  |
+| Salesforce/codegen-2B-multi            | 基于多编程语言数据集BIGQUERY训练  |
+| Salesforce/codegen-6B-multi             | 基于多编程语言数据集BIGQUERY训练  |
+| Salesforce/codegen-16B-multi             | 基于多编程语言数据集BIGQUERY训练  |
+
+## References
+- Nijkamp, Erik, et al. "A conversational paradigm for program synthesis." arXiv preprint arXiv:2203.13474 (2022).
+- [https://github.com/features/copilot/](https://github.com/features/copilot/)
+- [https://github.com/AndPuQing/Papilot](https://github.com/AndPuQing/Papilot)
diff --git a/examples/code_generation/codegen/codegen_server.py b/examples/code_generation/codegen/codegen_server.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f8d80bb7cfab4ce436f5461a14e696a172bbff6
--- /dev/null
+++ b/examples/code_generation/codegen/codegen_server.py
@@ -0,0 +1,140 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+import string
+import time
+
+import paddle
+import uvicorn
+from fastapi import FastAPI, Response, status
+from pydantic import BaseModel
+from sse_starlette.sse import EventSourceResponse
+
+from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+from paddlenlp.utils.log import logger
+
+
+class DefaultConfig:
+    model_name_or_path = "Salesforce/codegen-350M-mono"
+    device = "gpu"
+    temperature = 0.5
+    top_k = 10
+    top_p = 1.0
+    repetition_penalty = 1.0
+    min_length = 0
+    max_length = 16
+    decode_strategy = "greedy_search"
+    load_state_as_np = True
+    use_faster = True
+    use_fp16_decoding = True
+    default_dtype = "float16" if use_faster and use_fp16_decoding else "float32"
+
+
+class Input(BaseModel):
+    prompt: str
+    stream: bool = False
+
+
+class Output(BaseModel):
+    id: str
+    model: str = "codegen"
+    object: str = "text_completion"
+    created: int = int(time.time())
+    choices: list = None
+    usage = {
+        "completion_tokens": None,
+        "prompt_tokens": None,
+        "total_tokens": None,
+    }
+
+
+generate_config = DefaultConfig()
+paddle.set_device(generate_config.device)
+paddle.set_default_dtype(generate_config.default_dtype)
+
+tokenizer = CodeGenTokenizer.from_pretrained(generate_config.model_name_or_path)
+model = CodeGenForCausalLM.from_pretrained(
+    generate_config.model_name_or_path, load_state_as_np=generate_config.load_state_as_np
+)
+
+app = FastAPI()
+
+
+def random_completion_id():
+    return "cmpl-" + "".join(random.choice(string.ascii_letters + string.digits) for _ in range(29))
+
+
+@app.post("/v1/engines/codegen/completions", status_code=200)
+async def gen(item: Input):
+    item = item.dict()
+    logger.info(f"Request: {item}")
+    temperature = item.get("temperature", generate_config.temperature)
+    top_k = item.get("top_k", generate_config.top_k)
+    if temperature == 0.0:
+        temperature = 1.0
+        top_k = 1
+    repetition_penalty = item.get("frequency_penalty", generate_config.repetition_penalty)
+
+    start_time = time.time()
+    logger.info("Start generating code")
+    tokenized = tokenizer([item["prompt"]], truncation=True, return_tensors="pd")
+    output, _ = model.generate(
+        tokenized["input_ids"],
+        max_length=16,
+        min_length=generate_config.min_length,
+        decode_strategy=generate_config.decode_strategy,
+        top_k=top_k,
+        repetition_penalty=repetition_penalty,
+        temperature=temperature,
+        use_fast=generate_config.use_faster,
+        use_fp16_decoding=generate_config.use_fp16_decoding,
+    )
+    logger.info("Finish generating code")
+    end_time = time.time()
+    logger.info(f"Time cost: {end_time - start_time}")
+    output = tokenizer.decode(output[0], skip_special_tokens=True)
+    logger.info(f"Generated code: {output}")
+    output_json = Output(
+        id=random_completion_id(),
+        choices=[
+            {
+                "text": output,
+                "index": 0,
+                "finish_reason": "stop",
+                "logprobs": None,
+            }
+        ],
+        usage={
+            "completion_tokens": None,
+            "prompt_tokens": None,
+            "total_tokens": None,
+        },
+    ).json()
+
+    def stream_response(response):
+        yield f"{response}\n\n"
+        yield "data: [DONE]\n\n"
+
+    if item.get("stream", False):
+        return EventSourceResponse(stream_response(output_json))
+    else:
+        return Response(
+            status_code=status.HTTP_200_OK,
+            content=output_json,
+            media_type="application/json",
+        )
+
+
+if __name__ == "__main__":
+    uvicorn.run("codegen_server:app", host="0.0.0.0", port=8978)
diff --git a/examples/code_generation/codegen/requirements.txt b/examples/code_generation/codegen/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ae00f4799fa170e98fa4cb1744bc3aa537439fe3
--- /dev/null
+++ b/examples/code_generation/codegen/requirements.txt
@@ -0,0 +1,7 @@
+fastapi==0.79.0
+pydantic==1.9.1
+python-dotenv==0.20.0
+sse_starlette==0.10.3
+uvicorn==0.17.6
+openai==0.8.0
+regex==2022.6.2
\ No newline at end of file
diff --git a/examples/code_generation/codegen/run_clm.py b/examples/code_generation/codegen/run_clm.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e9d5668763fc74fb9da9a14ad08995a41bbdc38
--- /dev/null
+++ b/examples/code_generation/codegen/run_clm.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from dataclasses import dataclass, field
+from functools import partial
+from itertools import chain
+from typing import Optional
+
+import paddle
+import paddle.nn as nn
+from datasets import load_dataset
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(
+        default="Salesforce/codegen-350M-mono",
+        metadata={"help": ("Path to pre-trained model.")},
+    )
+    overwrite_cache: Optional[bool] = field(
+        default=False,
+        metadata={"help": ("Whether to overwrite cache for dataset.")},
+    )
+
+
+@dataclass
+class DataArguments:
+    train_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "The input training data file."},
+    )
+    validation_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "The input validation data file."},
+    )
+    block_size: Optional[int] = field(
+        default=None,
+        metadata={"help": ("The training dataset will be truncated in block of this size for training. ")},
+    )
+
+
+def compute_metrics(eval_preds):
+    labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+    logits = paddle.to_tensor(eval_preds.predictions)
+    loss_fct = nn.CrossEntropyLoss()
+    eval_loss = loss_fct(logits[:, :-1, :], labels[:, 1:])
+    perplexity = math.exp(eval_loss)
+    return {"perplexity": perplexity}
+
+
+def convert_example(examples, tokenizer):
+    """convert examples into necessary features"""
+    # Convert raw text to feature
+    tokenized_examples = tokenizer(
+        examples["code"], return_attention_mask=True, return_position_ids=False, return_token_type_ids=False
+    )
+    return tokenized_examples
+
+
+def group_texts(examples, block_size):
+    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
+    total_length = len(concatenated_examples[list(examples.keys())[0]])
+    if total_length >= block_size:
+        total_length = (total_length // block_size) * block_size
+    result = {
+        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
+        for k, t in concatenated_examples.items()
+    }
+    result["labels"] = result["input_ids"].copy()
+    return result
+
+
+def process_ds(dataset, tokenizer, overwrite_cache, block_size):
+    trans_func = partial(convert_example, tokenizer=tokenizer)
+    dataset = dataset.map(
+        trans_func, batched=True, remove_columns=dataset.column_names, load_from_cache_file=overwrite_cache
+    )
+    trans_func = partial(group_texts, block_size=block_size)
+    dataset = dataset.map(trans_func, batched=True, load_from_cache_file=overwrite_cache)
+    return dataset
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    model = CodeGenForCausalLM.from_pretrained(model_args.model_name_or_path)
+
+    tokenizer = CodeGenTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    train_set = load_dataset("json", data_files=data_args.train_file, split="train")
+    dev_set = load_dataset("json", data_files=data_args.validation_file, split="train")
+
+    if data_args.block_size is None:
+        block_size = tokenizer.model_max_length
+        if block_size > 1024:
+            logger.warning(
+                f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
+                "Picking 1024 instead. You can change that default value by passing --block_size xxx."
+            )
+            block_size = 1024
+    else:
+        if data_args.block_size > tokenizer.model_max_length:
+            logger.warning(
+                f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model"
+                f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
+            )
+        block_size = min(data_args.block_size, tokenizer.model_max_length)
+
+    train_set = process_ds(train_set, tokenizer, model_args.overwrite_cache, block_size)
+    dev_set = process_ds(dev_set, tokenizer, model_args.overwrite_cache, block_size)
+
+    batchify_fn = DataCollatorWithPadding(tokenizer, return_attention_mask=True)
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_set if training_args.do_train else None,
+        eval_dataset=dev_set if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        compute_metrics=compute_metrics,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/code_generation/codegen/run_clm.sh b/examples/code_generation/codegen/run_clm.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c76ea178e3f4c2c8a270e799199372151aa78f7
--- /dev/null
+++ b/examples/code_generation/codegen/run_clm.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus 0,1 run_clm.py \
+            --model_name_or_path Salesforce/codegen-350M-mono \
+            --block_size 1024 \
+            --output_dir output \
+            --train_file train.json \
+            --validation_file test.json \
+            --num_train_epochs 5 \
+            --logging_steps 10 \
+            --save_steps 1000 \
+            --per_device_train_batch_size 2 \
+            --per_device_eval_batch_size 2 \
+            --learning_rate 1e-4 \
+            --warmup_ratio 0.1 \
+            --do_train \
+            --do_eval \
+            --device gpu
diff --git a/examples/dependency_parsing/ddparser/README.md b/examples/dependency_parsing/ddparser/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c0c964dc23d545204c1aee408bfa6ce6ed79d56d
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/README.md
@@ -0,0 +1,393 @@
+# DDParser
+
+ - [模型简介](#模型简介)
+ - [快速开始](#快速开始)
+    - [模型效果](#模型效果)
+    - [数据格式](#数据格式)
+    - [数据准备](#数据准备)
+    - [文件结构](#文件结构)
+    - [模型训练、预测与部署](#模型训练、预测与部署)
+ - [Taskflow一键预测](#Taskflow一键预测)
+ - [Reference](#Reference)
+
+## 模型简介
+
+依存句法分析任务通过分析句子中词语之间的依存关系来确定句子的句法结构，
+该项目是基于Paddle v2.1的[baidu/ddparser](https://github.com/baidu/DDParser)实现，
+模型结构为[Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)。
+同时本项目引入了[ERNIE](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 系列预训练模型，
+用户可以基于预训练模型finetune完成依存句法分析训练（参考以下[示例](#模型训练)）。
+
+## 快速开始
+
+本项目展示了基于NLPCC2013_EVSAM05_THU和NLPCC2013_EVSAM05_HIT数据集的进行模型训练、预测和部署的示例。
+
+### 模型效果
+
+以下是NLPCC2013_EVSAM05_THU和NLPCC2013_EVSAM05_HIT数据集的模型性能对比，baseline为第二届自然语言处理与中文计算会议发布的[评测报告](http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt05.rar)。
+
+#### NLPCC2013_EVSAM05_THU
+
+| model                     | dev UAS | dev LAS | test UAS | test LAS |
+| ------------------------- | :-----: | :------:| :-------:| :-------:|
+| `baseline`                |  81.49  |  72.17  |  84.68   |  76.02   |
+| `biaffine-dep(+char)`     |  84.11  |  75.16  |  85.31   |  76.73   |
+| `biaffine-dep(+pos)`      |  83.28  |  74.20  |  84.54   |  75.33   |
+| `biaffine-dep-lstm-pe`    |  81.02  |  71.20  |  82.86   |  73.86   |
+| `biaffine-dep-ernie-tiny` |  89.02  |  81.39  |  89.31   |  81.51   |
+| `biaffine-dep-ernie-1.0`  |  92.25  |  84.77  |  92.12   |  84.62   |
+| `biaffine-dep-ernie-gram` |  92.20  |  85.10  |  91.96   |  84.10   |
+
+#### NLPCC2013_EVSAM05_HIT
+
+| model                     | dev UAS | dev LAS | test UAS | test LAS |
+| ------------------------- | :-----: | :------:| :-------:| :-------:|
+| `baseline`                |  82.96  |  65.45  |  82.65   |  65.25   |
+| `biaffine-dep(+char)`     |  80.90  |  65.29  |  80.77   |  65.43   |
+| `biaffine-dep(+pos)`      |  83.85  |  68.27  |  83.75   |  68.04   |
+| `biaffine-dep-lstm-pe`    |  77.48  |  61.34  |  76.41   |  60.32   |
+| `biaffine-dep-ernie-tiny` |  84.21  |  68.89  |  83.98   |  68.67   |
+| `biaffine-dep-ernie-1.0`  |  89.24  |  74.12  |  88.64   |  74.09   |
+| `biaffine-dep-ernie-gram` |  89.59  |  74.75  |  88.79   |  74.46   |
+
+其中`lstm-pe`表示lstm by positional encoding，`biaffine-dep`的模型输入可以选择句子的word级表示加char级表示（`biaffine-dep(+char)`）或者句子的word级表示加上pos词性标签（`biaffine-dep(+pos)`），其他模型使用句子的word级表示和char级表示。
+
+指标释义：
+```text
+UAS (依存准确率) = number of words assigned correct head / total words
+LAS (依存标注准备率) = number of words assigned correct head and relation / total words
+```
+
+### 数据格式
+
+本用例数据格式基于[CoNLL-X](https://ilk.uvt.nl/~emarsi/download/pubs/14964.pdf)。
+
+| 名称 | 含义 |
+| --- | --- |
+| ID |  单词ID，序号从1开始 |
+| FORM | 当前单词 |
+| LEMMA | 当前词语的原型或词干，在中文中此列与FORM相同 |
+| CPOSTAG | 当前词语的词性（粗粒度） |
+| POSTAG | 当前词语的词性（细粒度） |
+| FEATS | 句法特征 |
+| HEAD | 当前单词的中心词 |
+| DEPREL | 当前单词与中心词的依存关系 |
+| PHEAD | 当前单词的主观中心词 |
+| PDEPREL | 当前单词与主观中心词的依存关系 |
+
+NLPCC2013_EVSAM05_THU数据集示例：
+```
+ID      FROM   LEMMA CPOSTAG POSTAG  FEATS   HEAD    DEPREL
+1       世界    世界    n       n       _       5       限定
+2       第      第      m       m       _       4       限定
+3       八      八      m       m       _       2       连接依存
+4       大      大      a       a       _       5       限定
+5       奇迹    奇迹    n       n       _       6       存现体
+6       出现    出现    v       v       _       0       核心成分
+```
+
+NLPCC2013_EVSAM05_HIT数据集示例：
+```
+ID      FROM   LEMMA CPOSTAG POSTAG  FEATS   HEAD     DEPREL        PHEAD PDEPREL
+1       城建    城建    NN      NN      _       2       relevant        _    _
+2       成为    成为    VV      VV      _       0       ROOT            _    _
+3       外商    外商    NN      NN      _       4       agent           _    _
+4       投资    投资    VV      VV      _       7       d-restrictive   _    _
+5       青海    青海    NR      NR      _       4       patient         _    _
+6       新      新      JJ      JJ     _       7       d-attribute     _    _
+7       热点    热点    NN      NN      _       2       isa             _    _
+```
+
+- 该用例中用户只需关注`FORM`、`POSTTAG`、`HEAD`和`DEPREL`这几列信息即可，'_'表示数值不可用。
+
+### 数据准备
+
+该用例使用的是[第二届自然语言处理与中文计算会议（NLP&CC 2013）](http://tcci.ccf.org.cn/conference/2013/pages/page04_sam.html)
+提供的数据集，其中`NLPCC2013_EVSAM05_THU`为清华大学语义依存网络语料，`NLPCC2013_EVSAM05_HIT`为哈尔滨工业大学依存网络语料。
+
+为了方便用户的快速使用，PaddleNLP Dataset API内置了数据集，一键可完成数据集加载。
+
+加载`NLPCC2013_EVSAM05_THU`数据集：
+```python
+from paddlenlp.datasets import load_dataset
+train_ds, dev_ds, test_ds = load_dataset("nlpcc13_evsam05_thu", splits=["train", "dev", "test"])
+```
+
+加载`NLPCC2013_EVSAM05_HIT`数据集：
+```python
+from paddlenlp.datasets import load_dataset
+train_ds, dev_ds, test_ds = load_dataset("nlpcc13_evsam05_hit", splits=["train", "dev", "test"])
+```
+
+### 文件结构
+
+以下是本项目主要代码结构及说明：
+
+```text
+ddparser/
+├── deploy # 部署
+│   └── python
+│       └── predict.py # python预测部署示例
+├── model
+│   ├── dropouts.py # dropout
+│   ├── encoder.py # 编码器
+│   └── dep.py # 模型网络
+├── README.md # 使用说明
+├── export_model.py # 模型导出脚本
+├── criterion.py # 损失函数
+├── data.py # 数据结构
+├── metric.py # 指标计算
+├── train.py # 训练脚本
+├── predict.py # 预测脚本
+└── utils.py # 工具函数
+```
+
+### 模型训练、预测与部署
+
+本项目提供了三种模型结构：LSTMEncoder+MLP+BiAffine、LSTMByWPEncoder+MLP+BiAffine和ErnieEncoder+MLP+BiAffine，用户可通过`--encoding_model`指定所使用的模型结构。
+
+#### LSTMEncoder+MLP+BiAffine
+
+##### 启动训练
+
+通过如下命令，指定GPU 0卡，以`lstm`为encoder在`nlpcc13_evsam05_thu`数据集上训练与评估：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --device=gpu \
+    --epochs=100 \
+    --task_name=nlpcc13_evsam05_thu \
+    --save_dir=./model_file \
+    --encoding_model=lstm \
+    --feat=pos \
+    --batch_size=1000 \
+    --lstm_lr=0.002
+```
+
+##### 基于动态图的预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --gpus "0" predict.py \
+    --device=gpu \
+    --task_name=nlpcc13_evsam05_thu \
+    --encoding_model=lstm \
+    --feat=pos \
+    --params_path=./model_file/best.pdparams \
+    --infer_output_file=infer_output.conll
+```
+
+**NOTE**: 预测时的`encoding_model`和`feat`需要与训练时的参数保持一致。
+
+#### LSTMByWPEncoder+MLP+BiAffine
+
+##### 启动训练
+
+通过如下命令，指定GPU 0卡，以`lstm-pe`为encoder在`nlpcc13_evsam05_hit`数据集上训练与评估：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --device=gpu \
+    --epochs=100 \
+    --task_name=nlpcc13_evsam05_hit \
+    --encoding_model=lstm-pe \
+    --save_dir=./model_file \
+    --lstm_lr=0.002
+```
+
+##### 基于动态图的预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --gpus "0" predict.py \
+    --device=gpu \
+    --task_name=nlpcc13_evsam05_hit \
+    --encoding_model=lstm-pe \
+    --params_path=./model_file/best.pdparams \
+    --infer_output_file=infer_output.conll
+```
+
+##### 基于静态图的预测部署
+
+使用动态图训练结束后，可以将动态图参数导出成静态图参数， 从而获得较优的预测部署性能，执行如下命令完成动态图转换静态图的功能：
+
+```shell
+python export_model.py --encoding_model=lstm-pe \
+    --params_path=./model_file/best.pdparams \
+    --output_path=./output
+```
+
+导出静态图模型之后，可以用于部署，`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式：
+```shell
+python deploy/python/predict.py --model_dir=./output --task_name=nlpcc13_evsam05_hit
+```
+
+#### ErnieEncoder+MLP+BiAffine
+
+##### 启动训练
+
+通过如下命令，指定GPU 0卡，以预训练模型`ernie-gram-zh`为encoder在`nlpcc13_evsam05_hit`数据集上训练与评估：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --device=gpu \
+    --epochs=100 \
+    --task_name=nlpcc13_evsam05_hit \
+    --encoding_model=ernie-gram-zh \
+    --save_dir=./model_file \
+    --ernie_lr=5e-5
+```
+
+##### 基于动态图的预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --gpus "0" predict.py \
+    --device=gpu \
+    --task_name=nlpcc13_evsam05_hit \
+    --encoding_model=ernie-gram-zh \
+    --params_path=./model_file/best.pdparams \
+    --infer_output_file=infer_output.conll
+```
+
+##### 基于静态图的预测部署
+
+使用动态图训练结束后，可以将动态图参数导出成静态图参数， 从而获得较优的预测部署性能，执行如下命令完成动态图转换静态图的功能：
+
+```shell
+python export_model.py --encoding_model=ernie-gram-zh \
+    --params_path=./model_file/best.pdparams \
+    --output_path=./output
+```
+
+导出静态图模型之后，可以用于部署，`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式：
+```shell
+python deploy/python/predict.py --model_dir=./output --task_name=nlpcc13_evsam05_hit
+```
+
+#### 参数释义
+
+项目中的参数具体说明如下：
+
+* `device`: 选用什么设备进行训练，可选cpu、gpu。
+* `task_name`: 选择训练所用的数据集，可选nlpcc13_evsam05_thu和nlpcc13_evsam05_hit。
+* `encoding_model`: 选择模型编码网络，可选lstm、lstm-pe、ernie-1.0、ernie-tiny和ernie-gram-zh。
+* `epochs`: 训练轮数。
+* `save_dir`: 保存训练模型的路径；默认将当前在验证集上LAS指标最高的模型`best.pdparams`和训练最近一个epoch的模型`last_epoch.pdparams`保存在目录model_file文件夹下。
+* `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数，默认为1000。
+* `init_from_params`: 模型参数路径，热启动模型训练；默认为None。
+* `clip`: 梯度裁剪阈值，将梯度限制在阈值范围内。
+* `lstm_lr`: 模型编码网络为lstm或lstm-pe时的学习率，默认为0.002。
+* `ernie_lr`: 模型编码网络为ernie-1.0、ernie-tiny、ernie-gram-zh时的学习率，默认为5e-5。
+* `seed`: 随机种子，默认为1000。
+* `min_freq`: 训练模式下的使用参数，基于训练数据生成的词表的最小词频，默认为2。
+* `n_buckets`: 选择数据分桶数，对训练数据按照长度进行分桶。
+* `tree`: 确保输出结果是正确的依存句法树，默认为True。
+* `feat`: 模型编码网络为lstm时的使用参数，选择输入的特征，可选char（句子的char级表示）和pos（词性标签）；ernie类别的模型只能为None。
+* `warmup_proportion`: 学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+* `weight_decay`: 控制正则项力度的参数，用于防止过拟合，默认为0.0。
+
+## Taskflow一键预测
+
+Taskflow向用户提供了一个百度基于大规模标注数据集[DuCTB1.0](#数据来源)训练的依存句法分析工具ddparser。用户可以方便地使用该工具完成[一键预测](#一键预测)。
+
+### 环境依赖
+
+- LAC >= 2.1
+- matplotlib >= 3.4.2
+
+### 一键预测
+
+```python
+from paddlenlp import Taskflow
+
+ddp = Taskflow("dependency_parsing")
+ddp("百度是一家高科技公司")
+'''
+[{'word': ['百度', '是', '一家', '高科技', '公司'],
+  'head': ['2', '0', '5', '5', '2'],
+  'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]
+'''
+ddp(["百度是一家高科技公司", "他送了一本书"])
+'''
+[{'word': ['百度', '是', '一家', '高科技', '公司'],
+  'head': ['2', '0', '5', '5', '2'],
+  'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']},
+ {'word': ['他', '送', '了', '一本', '书'],
+  'head': ['2', '0', '2', '5', '2'],
+  'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
+'''
+
+# 输出概率和词性标签
+ddp = Taskflow("dependency_parsing", prob=True, use_pos=True)
+ddp("百度是一家高科技公司")
+'''
+[{'word': ['百度', '是', '一家', '高科技', '公司'],
+  'postag': ['ORG', 'v', 'm', 'n', 'n'],
+  'head': ['2', '0', '5', '5', '2'],
+  'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB'],
+  'prob': [1.0, 1.0, 1.0, 1.0, 1.0]}]
+'''
+
+# 使用ddparser-ernie-1.0进行预测
+ddp = Taskflow("dependency_parsing", model="ddparser-ernie-1.0")
+ddp("百度是一家高科技公司")
+'''
+[{'word': ['百度', '是', '一家', '高科技', '公司'],
+  'head': ['2', '0', '5', '5', '2'],
+  'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]
+'''
+
+# 使用ddparser-ernie-gram-zh进行预测
+ddp = Taskflow("dependency_parsing", model="ddparser-ernie-gram-zh")
+ddp("百度是一家高科技公司")
+'''
+[{'word': ['百度', '是', '一家', '高科技', '公司'],
+  'head': ['2', '0', '5', '5', '2'],
+  'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]
+'''
+```
+
+### 依存关系可视化
+
+```python
+from paddlenlp import Taskflow
+
+ddp = Taskflow("dependency_parsing", return_visual=True)
+result = ddp("百度是一家高科技公司")[0]['visual']
+import cv2
+cv2.imwrite('test.png', result)
+```
+
+### 标注关系说明
+
+DuCTB1.0数据集含14种标注关系，具体含义见下表：
+
+| Label |  关系类型  | 说明                     | 示例                           |
+| :---: | :--------: | :----------------------- | :----------------------------- |
+|  SBV  |  主谓关系  | 主语与谓词间的关系       | 他送了一本书(他<--送)          |
+|  VOB  |  动宾关系  | 宾语与谓词间的关系       | 他送了一本书(送-->书)          |
+|  POB  |  介宾关系  | 介词与宾语间的关系       | 我把书卖了（把-->书）          |
+|  ADV  |  状中关系  | 状语与中心词间的关系     | 我昨天买书了（昨天<--买）      |
+|  CMP  |  动补关系  | 补语与中心词间的关系     | 我都吃完了（吃-->完）          |
+|  ATT  |  定中关系  | 定语与中心词间的关系     | 他送了一本书(一本<--书)        |
+|   F   |  方位关系  | 方位词与中心词的关系     | 在公园里玩耍(公园-->里)        |
+|  COO  |  并列关系  | 同类型词语间关系        | 叔叔阿姨(叔叔-->阿姨)          |
+|  DBL  |  兼语结构  | 主谓短语做宾语的结构     | 他请我吃饭(请-->我，请-->吃饭) |
+|  DOB  | 双宾语结构 | 谓语后出现两个宾语       | 他送我一本书(送-->我，送-->书) |
+|  VV   |  连谓结构  | 同主语的多个谓词间关系   | 他外出吃饭(外出-->吃饭)        |
+|  IC   |  子句结构  | 两个结构独立或关联的单句  | 你好，书店怎么走？(你好<--走)  |
+|  MT   |  虚词成分  | 虚词与中心词间的关系     | 他送了一本书(送-->了)          |
+|  HED  |  核心关系  | 指整个句子的核心         |                               |
+
+### 数据来源
+
+**DuCTB1.0**: `Baidu Chinese Treebank1.0`是百度构建的中文句法树库，即Taskflow所提供的依存句法分析工具-DDParser的训练数据来源。
+
+## Reference
+
+- [baidu/ddparser](https://github.com/baidu/DDParser)
+- [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)
diff --git a/examples/dependency_parsing/ddparser/criterion.py b/examples/dependency_parsing/ddparser/criterion.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcb25995f0e4dc8e72c2491088619ad523c19ae0
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/criterion.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from utils import index_sample
+
+
+class ParserCriterion(nn.Layer):
+    def __init__(self, *args, **kwargs):
+        super(ParserCriterion, self).__init__(*args, **kwargs)
+
+    def __call__(self, s_arc, s_rel, arcs, rels, mask):
+
+        arcs = paddle.masked_select(arcs, mask)
+        rels = paddle.masked_select(rels, mask)
+
+        select = paddle.nonzero(mask)
+        s_arc = paddle.gather_nd(s_arc, select)
+        s_rel = paddle.gather_nd(s_rel, select)
+
+        s_rel = index_sample(s_rel, paddle.unsqueeze(arcs, axis=1))
+
+        arc_cost = F.cross_entropy(s_arc, arcs)
+        rel_cost = F.cross_entropy(s_rel, rels)
+
+        avg_cost = paddle.mean(arc_cost + rel_cost)
+        return avg_cost
diff --git a/examples/dependency_parsing/ddparser/data.py b/examples/dependency_parsing/ddparser/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..1663dfbe2a2bb41463859744ea23dd66b1851078
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/data.py
@@ -0,0 +1,278 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import math
+import numpy as np
+
+import paddle
+from paddle.io import Dataset
+from paddlenlp.data import Vocab
+from utils import kmeans, pad_sequence
+
+
+def build_vocab(corpus, tokenizer, encoding_model, feat):
+    """
+    Build vocabs use the api of paddlenlp.data.Vocab.build_vocab(),
+    Using token_to_idx to specifies the mapping relationship between
+    tokens and indices to be used.
+
+    Args:
+        Corpus(obj:`list[list[str]]`): The training corpus which contains
+            list of input words, features and relations.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from
+            :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. If the encoding model is lstm,
+            tokenizer is None.
+        encoding_model(obj:`str`): The encoder used for embedding.
+        feat(obj:`str`): The features used for model inputs. If the encoding
+            model is lstm, feat can be `pos` or `char`, otherwise the feat is None.
+
+    Returns:
+        word_vocab(obj:`Vocab`): Word vocab.
+        feat_vocab(obj:`Vocab`): Feature vocab.
+        rel_vocab(obj:`Vocab`): Relation vocab.
+    """
+    word_examples, feat_examples, rel_examples = corpus
+
+    # Build word vocab and feature vocab
+    if encoding_model == "lstm":
+        # Using token_to_idx to specifies the mapping
+        # relationship between tokens and indices
+        word_vocab = Vocab.build_vocab(
+            word_examples,
+            min_freq=2,
+            token_to_idx={"[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3},
+            unk_token="[UNK]",
+            pad_token="[PAD]",
+            bos_token="[BOS]",
+            eos_token="[EOS]",
+        )
+        if feat == "pos":
+            feat_vocab = Vocab.build_vocab(
+                feat_examples,
+                token_to_idx={"[BOS]": 0, "[EOS]": 1},
+                bos_token="[BOS]",
+                eos_token="[EOS]",
+            )
+        else:
+            feat_vocab = Vocab.build_vocab(
+                feat_examples,
+                token_to_idx={"[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3},
+                unk_token="[UNK]",
+                pad_token="[PAD]",
+                bos_token="[BOS]",
+                eos_token="[EOS]",
+            )
+    else:
+        word_vocab = tokenizer.vocab
+        feat_vocab = None
+
+    # Build relation vocab
+    rel_vocab = Vocab.build_vocab(
+        rel_examples,
+        token_to_idx={"[BOS]": 0, "[EOS]": 1, "[UNK]": 2},
+        bos_token="[BOS]",
+        eos_token="[EOS]",
+        unk_token="[UNK]",
+    )
+    return word_vocab, feat_vocab, rel_vocab
+
+
+def load_vocab(vocab_dir):
+    """load vocabs"""
+    word_vocab = Vocab.from_json(os.path.join(vocab_dir, "word_vocab.json"))
+    rel_vocab = Vocab.from_json(os.path.join(vocab_dir, "rel_vocab.json"))
+    feat_vocab_path = os.path.join(vocab_dir, "feat_vocab.json")
+    if os.path.exists(feat_vocab_path):
+        feat_vocab = Vocab.from_json(os.path.join(feat_vocab_path))
+    else:
+        feat_vocab = None
+    return word_vocab, feat_vocab, rel_vocab
+
+
+def convert_example(example, vocabs, encoding_model="ernie-3.0-medium-zh", feat=None, mode="train", fix_len=20):
+    """Builds model inputs for dependency parsing task."""
+    word_vocab, feat_vocab, rel_vocab = vocabs
+    if encoding_model == "lstm":
+        word_bos_index = word_vocab.to_indices("[BOS]")
+        word_eos_index = word_vocab.to_indices("[EOS]")
+    else:
+        word_bos_index = word_vocab.to_indices("[CLS]")
+        word_eos_index = word_vocab.to_indices("[SEP]")
+
+    if feat_vocab:
+        feat_bos_index = feat_vocab.to_indices("[BOS]")
+        feat_eos_index = feat_vocab.to_indices("[EOS]")
+
+    arc_bos_index, arc_eos_index = 0, 1
+
+    rel_bos_index = rel_vocab.to_indices("[BOS]")
+    rel_eos_index = rel_vocab.to_indices("[EOS]")
+
+    if mode != "test":
+        arcs = list(example["HEAD"])
+        arcs = [arc_bos_index] + arcs + [arc_eos_index]
+        arcs = np.array(arcs, dtype=int)
+
+        rels = rel_vocab.to_indices(example["DEPREL"])
+        rels = [rel_bos_index] + rels + [rel_eos_index]
+        rels = np.array(rels, dtype=int)
+
+    if encoding_model == "lstm":
+        words = word_vocab.to_indices(example["FORM"])
+        words = [word_bos_index] + words + [word_eos_index]
+        words = np.array(words, dtype=int)
+
+        if feat == "pos":
+            feats = feat_vocab.to_indices(example["CPOS"])
+            feats = [feat_bos_index] + feats + [feat_eos_index]
+            feats = np.array(feats, dtype=int)
+        else:
+            feats = [[feat_vocab.to_indices(token) for token in word] for word in example["FORM"]]
+            feats = [[feat_bos_index]] + feats + [[feat_eos_index]]
+            feats = pad_sequence([np.array(ids[:fix_len], dtype=int) for ids in feats], fix_len=fix_len)
+        if mode == "test":
+            return words, feats
+        return words, feats, arcs, rels
+    else:
+        words = [[word_vocab.to_indices(char) for char in word] for word in example["FORM"]]
+        words = [[word_bos_index]] + words + [[word_eos_index]]
+        words = pad_sequence([np.array(ids[:fix_len], dtype=int) for ids in words], fix_len=fix_len)
+        if mode == "test":
+            return [words]
+        return words, arcs, rels
+
+
+def create_dataloader(dataset, batch_size, mode="train", n_buckets=None, trans_fn=None):
+    """
+    Create dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will
+            shuffle the dataset randomly.
+        n_buckets(obj:`int`, optional, defaults to `None`): If n_buckets is not None, it will devide
+            the dataset into n_buckets according to the sequence lengths.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a
+            data sample to input ids, etc.
+    """
+    if n_buckets:
+        word_examples = [seq["FORM"] for seq in dataset]
+        lengths = [len(i) + 1 for i in word_examples]
+        buckets = dict(zip(*kmeans(lengths, n_buckets)))
+    else:
+        buckets = None
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    if n_buckets:
+        if mode == "train":
+            batch_sampler = BucketsSampler(
+                buckets=buckets,
+                batch_size=batch_size,
+                shuffle=True,
+            )
+        else:
+            batch_sampler = BucketsSampler(
+                buckets=buckets,
+                batch_size=batch_size,
+                shuffle=False,
+            )
+    else:
+        batch_sampler = SequentialSampler(
+            batch_size=batch_size,
+            corpus_length=len(dataset),
+        )
+
+    # Subclass of `paddle.io.Dataset`
+    dataset = Batchify(dataset, batch_sampler)
+
+    # According to the api of `paddle.io.DataLoader` set `batch_size`
+    # and `batch_sampler` to `None` to disable batchify dataset automatically
+    data_loader = paddle.io.DataLoader(dataset=dataset, batch_sampler=None, batch_size=None, return_list=True)
+    return data_loader, buckets
+
+
+class Batchify(Dataset):
+    def __init__(self, dataset, batch_sampler):
+
+        self.batches = []
+        for batch_sample_id in batch_sampler:
+            batch = []
+            raw_batch = self._collate_fn([dataset[sample_id] for sample_id in batch_sample_id])
+            for data in raw_batch:
+                if isinstance(data[0], np.ndarray):
+                    data = pad_sequence(data)
+                batch.append(data)
+            self.batches.append(batch)
+
+    def __getitem__(self, idx):
+        return self.batches[idx]
+
+    def __len__(self):
+        return len(self.batches)
+
+    def _collate_fn(self, batch):
+        """Return batch samples"""
+        return (raw for raw in zip(*batch))
+
+
+class BucketsSampler(object):
+    """BucketsSampler"""
+
+    def __init__(self, buckets, batch_size, shuffle=False):
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.sizes, self.buckets = zip(*[(size, bucket) for size, bucket in buckets.items()])
+        # The number of chunks in each bucket, which is clipped by range [1, len(bucket)]
+        self.chunks = []
+        for size, bucket in zip(self.sizes, self.buckets):
+            max_ch = max(math.ceil(size * len(bucket) / batch_size), 1)
+            chunk = min(len(bucket), int(max_ch))
+            self.chunks.append(chunk)
+
+    def __iter__(self):
+        """Returns an iterator, randomly or sequentially returns a batch id"""
+        range_fn = np.random.permutation if self.shuffle else np.arange
+        for i in range_fn(len(self.buckets)).tolist():
+            split_sizes = [(len(self.buckets[i]) - j - 1) // self.chunks[i] + 1 for j in range(self.chunks[i])]
+            for batch in np.split(range_fn(len(self.buckets[i])), np.cumsum(split_sizes)):
+                if len(batch):
+                    yield [self.buckets[i][j] for j in batch.tolist()]
+
+    def __len__(self):
+        """Returns the number of batches"""
+        return sum(self.chunks)
+
+
+class SequentialSampler(object):
+    """SequentialSampler"""
+
+    def __init__(self, batch_size, corpus_length):
+        self.batch_size = batch_size
+        self.corpus_length = corpus_length
+
+    def __iter__(self):
+        """iter"""
+        batch = []
+        for i in range(self.corpus_length):
+            batch.append(i)
+            if len(batch) == self.batch_size:
+                yield batch
+                batch = []
+        else:
+            if len(batch):
+                yield batch
diff --git a/examples/dependency_parsing/ddparser/deploy/python/predict.py b/examples/dependency_parsing/ddparser/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..027d0f358621c02f4cdeb90ddf5fa57628161345
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/deploy/python/predict.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+
+import numpy as np
+import paddle
+from data import convert_example, load_vocab
+from utils import eisner, flat_words, istree, pad_sequence
+
+from paddlenlp.datasets import load_dataset
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The path to static model.")
+parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.")
+parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", type=int, default=64, help="Numbers of examples a batch for training.")
+parser.add_argument("--infer_output_file", type=str, default='infer_output.conll', help="The path to save infer results.")
+parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.")
+args = parser.parse_args()
+# fmt: on
+
+
+def batchify_fn(batch):
+    raw_batch = [raw for raw in zip(*batch)]
+    batch = [pad_sequence(data) for data in raw_batch]
+    return batch
+
+
+def decode(s_arc, s_rel, mask, tree=True):
+
+    lens = np.sum(mask.astype(int), axis=-1)
+    arc_preds = np.argmax(s_arc, axis=-1)
+
+    bad = [not istree(seq[: i + 1]) for i, seq in zip(lens, arc_preds)]
+    if tree and any(bad):
+        arc_preds[bad] = eisner(s_arc[bad], mask[bad])
+
+    rel_preds = np.argmax(s_rel, axis=-1)
+    rel_preds = [rel_pred[np.arange(len(arc_pred)), arc_pred] for arc_pred, rel_pred in zip(arc_preds, rel_preds)]
+    return arc_preds, rel_preds
+
+
+class Predictor(object):
+    def __init__(self, model_dir, device):
+        model_file = model_dir + "/inference.pdmodel"
+        params_file = model_dir + "/inference.pdiparams"
+
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()]
+
+    def predict(self, data, vocabs):
+        word_vocab, _, rel_vocab = vocabs
+        word_pad_index = word_vocab.to_indices("[PAD]")
+        word_bos_index = word_vocab.to_indices("[CLS]")
+        word_eos_index = word_vocab.to_indices("[SEP]")
+        examples = []
+        for text in data:
+            example = {
+                "FORM": text["FORM"],
+                "CPOS": text["CPOS"],
+            }
+            example = convert_example(
+                example,
+                vocabs=vocabs,
+                mode="test",
+            )
+            examples.append(example)
+
+        batches = [examples[idx : idx + args.batch_size] for idx in range(0, len(examples), args.batch_size)]
+
+        arcs, rels = [], []
+        for batch in batches:
+            words = batchify_fn(batch)[0]
+            words, position = flat_words(words, word_pad_index)
+            self.input_handles[0].copy_from_cpu(words)
+            self.input_handles[1].copy_from_cpu(position)
+            self.predictor.run()
+            s_arc = self.output_handle[0].copy_to_cpu()
+            s_rel = self.output_handle[1].copy_to_cpu()
+            words = self.output_handle[2].copy_to_cpu()
+
+            mask = np.logical_and(
+                np.logical_and(words != word_pad_index, words != word_bos_index),
+                words != word_eos_index,
+            )
+
+            arc_preds, rel_preds = decode(s_arc, s_rel, mask, args.tree)
+
+            arcs.extend([arc_pred[m] for arc_pred, m in zip(arc_preds, mask)])
+            rels.extend([rel_pred[m] for rel_pred, m in zip(rel_preds, mask)])
+
+        arcs = [[str(s) for s in seq] for seq in arcs]
+        rels = [rel_vocab.to_tokens(seq) for seq in rels]
+        return arcs, rels
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_dir, args.device)
+
+    # Load vocabs from model file path
+    vocabs = load_vocab(args.model_dir)
+
+    test_ds = load_dataset(args.task_name, splits=["test"])
+    test_ds_copy = copy.deepcopy(test_ds)
+
+    pred_arcs, pred_rels = predictor.predict(test_ds, vocabs)
+
+    with open(args.infer_output_file, "w", encoding="utf-8") as out_file:
+        for res, head, rel in zip(test_ds_copy, pred_arcs, pred_rels):
+            res["HEAD"] = tuple(head)
+            res["DEPREL"] = tuple(rel)
+            res = "\n".join("\t".join(map(str, line)) for line in zip(*res.values())) + "\n"
+            out_file.write("{}\n".format(res))
+    out_file.close()
+    print("Results saved!")
diff --git a/examples/dependency_parsing/ddparser/export_model.py b/examples/dependency_parsing/ddparser/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..97b6cabb5eb507072567b93fc33b2ff80d816eb8
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/export_model.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_vocab
+from model.dep import BiAffineParser
+
+from paddlenlp.transformers import AutoModel
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--encoding_model", choices=["lstm-pe", "ernie-1.0", "ernie-3.0-medium-zh", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.")
+parser.add_argument("--params_path", type=str, required=True, default='./model_file/best.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+
+    # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh
+    if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"]:
+        pretrained_model = AutoModel.from_pretrained(args.encoding_model)
+    else:
+        pretrained_model = None
+
+    # Load vocabs from model file path
+    vocab_dir = os.path.split(args.params_path)[0]
+    word_vocab, _, rel_vocab = load_vocab(vocab_dir)
+
+    if not os.path.exists(args.output_path):
+        os.makedirs(args.output_path)
+
+    # Save vocabs to output path
+    word_vocab.to_json(path=os.path.join(args.output_path, "word_vocab.json"))
+    rel_vocab.to_json(path=os.path.join(args.output_path, "rel_vocab.json"))
+
+    n_rels, n_words, n_feats = len(rel_vocab), len(word_vocab), None
+
+    word_pad_index = word_vocab.to_indices("[PAD]")
+    word_bos_index = word_vocab.to_indices("[CLS]")
+    word_eos_index = word_vocab.to_indices("[SEP]")
+
+    # Load ddparser model
+    model = BiAffineParser(
+        encoding_model=args.encoding_model,
+        feat=None,
+        n_rels=n_rels,
+        n_feats=n_feats,
+        n_words=n_words,
+        pad_index=word_pad_index,
+        eos_index=word_eos_index,
+        pretrained_model=pretrained_model,
+    )
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/dependency_parsing/ddparser/metric.py b/examples/dependency_parsing/ddparser/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad3256fc3c5906a334ffeb792b6af72ef3faaa7f
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/metric.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+import paddle
+from paddle.metric import Metric
+
+
+class ParserEvaluator(Metric):
+    """
+    UAS and LAS for dependency parser.
+
+    UAS = number of words assigned correct head / total words
+    LAS = number of words assigned correct head and relation / total words
+    """
+
+    def __init__(self, name="ParserEvaluator", eps=1e-8):
+        super(ParserEvaluator, self).__init__()
+
+        self.eps = eps
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.total = 0.0
+        self.correct_arcs = 0.0
+        self.correct_rels = 0.0
+
+    def update(self, arc_preds, rel_preds, arcs, rels, mask):
+        select = paddle.nonzero(mask)
+        arc_mask = paddle.gather_nd(arc_preds == arcs, select)
+        rel_mask = paddle.logical_and(paddle.gather_nd(rel_preds == rels, select), arc_mask)
+
+        self.total += len(arc_mask)
+        self.correct_arcs += np.sum(arc_mask.numpy()).item()
+        self.correct_rels += np.sum(rel_mask.numpy()).item()
+
+    def accumulate(self):
+        uas = self.correct_arcs / (self.total + self.eps)
+        las = self.correct_rels / (self.total + self.eps)
+        return uas, las
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/examples/dependency_parsing/ddparser/model/dep.py b/examples/dependency_parsing/ddparser/model/dep.py
new file mode 100644
index 0000000000000000000000000000000000000000..73d7659d85f02478ad83adcc8949ecc85721b836
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/model/dep.py
@@ -0,0 +1,140 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from model.dropouts import SharedDropout
+from model.encoder import LSTMEncoder, LSTMByWPEncoder, ErnieEncoder
+
+
+class BiAffineParser(nn.Layer):
+    """DDParser"""
+
+    def __init__(
+        self,
+        encoding_model,
+        feat,
+        n_rels,
+        n_feats,
+        n_words,
+        pad_index,
+        eos_index,
+        pretrained_model=None,
+        n_mlp_arc=500,
+        n_mlp_rel=100,
+        mlp_dropout=0.33,
+    ):
+        super(BiAffineParser, self).__init__()
+        self.pad_index = pad_index
+        self.eos_index = eos_index
+
+        if encoding_model == "lstm":
+            self.embed = LSTMEncoder(feat, n_feats, n_words)
+        elif encoding_model == "lstm-pe":
+            self.embed = LSTMByWPEncoder(n_words, pad_index)
+        else:
+            self.embed = ErnieEncoder(pad_index, pretrained_model)
+
+        # MLP layer
+        self.mlp_arc_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc, dropout=mlp_dropout)
+        self.mlp_arc_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc, dropout=mlp_dropout)
+        self.mlp_rel_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel, dropout=mlp_dropout)
+        self.mlp_rel_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel, dropout=mlp_dropout)
+
+        # Biaffine layer
+        self.arc_attn = BiAffine(n_in=n_mlp_arc, bias_x=True, bias_y=False)
+        self.rel_attn = BiAffine(n_in=n_mlp_rel, n_out=n_rels, bias_x=True, bias_y=True)
+
+    def forward(self, words, feats):
+
+        words, x = self.embed(words, feats)
+        mask = paddle.logical_and(words != self.pad_index, words != self.eos_index)
+
+        arc_h = self.mlp_arc_h(x)
+        arc_d = self.mlp_arc_d(x)
+        rel_h = self.mlp_rel_h(x)
+        rel_d = self.mlp_rel_d(x)
+
+        # Get arc and rel scores from the bilinear attention
+        # Shape: (batch_size, seq_len, seq_len)
+        s_arc = self.arc_attn(arc_d, arc_h)
+        # Shape: (batch_size, seq_len, seq_len, n_rels)
+        s_rel = paddle.transpose(self.rel_attn(rel_d, rel_h), perm=[0, 2, 3, 1])
+        # Set the scores that exceed the length of each sentence to -1e5
+        s_arc_mask = paddle.unsqueeze(mask, 1)
+        s_arc = s_arc * s_arc_mask + paddle.scale(
+            paddle.cast(s_arc_mask, "int32"), scale=1e5, bias=-1, bias_after_scale=False
+        )
+        return s_arc, s_rel, words
+
+
+class MLP(nn.Layer):
+    """MLP"""
+
+    def __init__(self, n_in, n_out, dropout=0):
+        super(MLP, self).__init__()
+
+        self.linear = nn.Linear(
+            n_in,
+            n_out,
+            weight_attr=nn.initializer.XavierNormal(),
+        )
+        self.leaky_relu = nn.LeakyReLU(negative_slope=0.1)
+        self.dropout = SharedDropout(p=dropout)
+
+    def forward(self, x):
+        # Shape: (batch_size, output_size)
+        x = self.linear(x)
+        x = self.leaky_relu(x)
+        x = self.dropout(x)
+        return x
+
+
+class BiAffine(nn.Layer):
+    """BiAffine"""
+
+    def __init__(self, n_in, n_out=1, bias_x=True, bias_y=True):
+        super(BiAffine, self).__init__()
+
+        self.n_in = n_in
+        self.n_out = n_out
+        self.bias_x = bias_x
+        self.bias_y = bias_y
+        self.weight = self.create_parameter(shape=[n_out, n_in + bias_x, n_in + bias_y], dtype="float32")
+
+    def forward(self, x, y):
+        if self.bias_x:
+            x = paddle.concat([x, paddle.ones_like(x[:, :, :1])], axis=-1)
+        if self.bias_y:
+            y = paddle.concat([y, paddle.ones_like(x[:, :, :1])], axis=-1)
+        # Shape x: (batch_size, num_tokens, input_size + bias_x)
+        b = x.shape[0]
+        o = self.weight.shape[0]
+        # Shape x: (batch_size, output_size, num_tokens, input_size + bias_x)
+        x = paddle.expand(paddle.unsqueeze(x, axis=1), shape=(x.shape[0], o, x.shape[1], x.shape[2]))
+        # Shape y: (batch_size, output_size, num_tokens, input_size + bias_y)
+        y = paddle.expand(paddle.unsqueeze(y, axis=1), shape=(y.shape[0], o, y.shape[1], y.shape[2]))
+        # Shape weight: (batch_size, output_size, input_size + bias_x, input_size + bias_y)
+        weight = paddle.expand(
+            paddle.unsqueeze(self.weight, axis=0),
+            shape=(b, self.weight.shape[0], self.weight.shape[1], self.weight.shape[2]),
+        )
+
+        # Shape: (batch_size, output_size, num_tokens, num_tokens)
+        s = paddle.matmul(paddle.matmul(x, weight), paddle.transpose(y, perm=[0, 1, 3, 2]))
+        # Remove dim 1 if n_out == 1
+        if s.shape[1] == 1:
+            s = paddle.squeeze(s, axis=1)
+        return s
diff --git a/examples/dependency_parsing/ddparser/model/dropouts.py b/examples/dependency_parsing/ddparser/model/dropouts.py
new file mode 100644
index 0000000000000000000000000000000000000000..1aa8c5ac1723a8bb8f207ea3da5be337de326667
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/model/dropouts.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class SharedDropout(nn.Layer):
+    """SharedDropout"""
+
+    def __init__(self, p=0.5, batch_first=True):
+        super(SharedDropout, self).__init__()
+
+        self.p = p
+        self.batch_first = batch_first
+
+    def forward(self, x):
+        """Forward network"""
+        if self.training and self.p > 0:
+            if self.batch_first:
+                mask = self.get_mask(x[:, 0], self.p)
+            else:
+                mask = self.get_mask(x[0], self.p)
+            x *= paddle.unsqueeze(mask, axis=1) if self.batch_first else mask
+        return x
+
+    @staticmethod
+    def get_mask(x, p):
+        """Generate the mask matrix of the dropout by the input."""
+        mask = paddle.uniform(shape=x.shape, min=0, max=1) >= p
+        mask = paddle.cast(mask, "float32")
+        mask = mask / (1 - p)
+        return mask
+
+
+class IndependentDropout(nn.Layer):
+    """IndependentDropout"""
+
+    def __init__(self, p=0.5):
+        super(IndependentDropout, self).__init__()
+        self.p = p
+
+    def forward(self, *items):
+        """Forward network"""
+        if self.training and self.p > 0:
+            masks = [paddle.uniform(shape=x.shape[:2], min=0, max=1) >= self.p for x in items]
+            masks = [paddle.cast(x, "float32") for x in masks]
+            total = paddle.add(*masks)
+            scale = len(items) / paddle.maximum(total, paddle.ones_like(total))
+            masks = [mask * scale for mask in masks]
+            items = [item * paddle.unsqueeze(mask, axis=-1) for item, mask in zip(items, masks)]
+        return items
diff --git a/examples/dependency_parsing/ddparser/model/encoder.py b/examples/dependency_parsing/ddparser/model/encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c319b2e736addf8d69ea47920e6e1a0075a5ef1
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/model/encoder.py
@@ -0,0 +1,165 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+from model.dropouts import IndependentDropout, SharedDropout
+from utils import index_sample, pad_sequence_paddle
+
+
+class ErnieEncoder(nn.Layer):
+    def __init__(self, pad_index, pretrained_model):
+        super(ErnieEncoder, self).__init__()
+        self.pad_index = pad_index
+        self.ptm = pretrained_model
+        self.mlp_input_size = self.ptm.config["hidden_size"]
+
+    def forward(self, words, wp):
+        x, _ = self.ptm(words)
+        x = paddle.reshape(
+            index_sample(x, wp),
+            shape=[wp.shape[0], wp.shape[1], x.shape[2]],
+        )
+        words = index_sample(words, wp)
+        return words, x
+
+
+class LSTMByWPEncoder(nn.Layer):
+    def __init__(
+        self,
+        n_words,
+        pad_index,
+        lstm_by_wp_embed_size=200,
+        n_embed=300,
+        n_lstm_hidden=300,
+        n_lstm_layers=3,
+        lstm_dropout=0.33,
+    ):
+        super(LSTMByWPEncoder, self).__init__()
+        self.pad_index = pad_index
+        self.word_embed = nn.Embedding(n_words, lstm_by_wp_embed_size)
+
+        self.lstm = nn.LSTM(
+            input_size=lstm_by_wp_embed_size,
+            hidden_size=n_lstm_hidden,
+            num_layers=n_lstm_layers,
+            dropout=lstm_dropout,
+            direction="bidirectional",
+        )
+
+        self.lstm_dropout = SharedDropout(p=lstm_dropout)
+        self.mlp_input_size = n_lstm_hidden * 2
+
+    def forward(self, words, wp):
+
+        word_embed = self.word_embed(words)
+        mask = words != self.pad_index
+        seq_lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1)
+
+        x, _ = self.lstm(word_embed, sequence_length=seq_lens)
+        x = paddle.reshape(
+            index_sample(x, wp),
+            shape=[wp.shape[0], wp.shape[1], x.shape[2]],
+        )
+        words = paddle.index_sample(words, wp)
+        x = self.lstm_dropout(x)
+        return words, x
+
+
+class LSTMEncoder(nn.Layer):
+    def __init__(
+        self,
+        feat,
+        n_feats,
+        n_words,
+        pad_index=0,
+        feat_pad_index=0,
+        n_char_embed=50,
+        n_feat_embed=60,
+        n_lstm_char_embed=100,
+        n_embed=300,
+        embed_dropout=0.33,
+        n_lstm_hidden=300,
+        n_lstm_layers=3,
+        lstm_dropout=0.33,
+    ):
+        super(LSTMEncoder, self).__init__()
+        self.pad_index = pad_index
+
+        if feat == "char":
+            self.feat_embed = CharLSTMEncoder(
+                n_chars=n_feats,
+                n_embed=n_char_embed,
+                n_out=n_lstm_char_embed,
+                pad_index=feat_pad_index,
+            )
+            feat_embed_size = n_lstm_char_embed
+        else:
+            self.feat_embed = nn.Embedding(num_embeddings=n_feats, embedding_dim=n_feat_embed)
+            feat_embed_size = n_feat_embed
+
+        self.word_embed = nn.Embedding(num_embeddings=n_words, embedding_dim=n_embed)
+        self.embed_dropout = IndependentDropout(p=embed_dropout)
+
+        self.lstm = nn.LSTM(
+            input_size=n_embed + feat_embed_size,
+            hidden_size=n_lstm_hidden,
+            num_layers=n_lstm_layers,
+            dropout=lstm_dropout,
+            direction="bidirectional",
+        )
+        self.lstm_dropout = SharedDropout(p=lstm_dropout)
+        self.mlp_input_size = n_lstm_hidden * 2
+
+    def forward(self, words, feats):
+        word_embed = self.word_embed(words)
+        feat_embed = self.feat_embed(feats)
+        word_embed, feat_embed = self.embed_dropout(word_embed, feat_embed)
+        embed = paddle.concat([word_embed, feat_embed], axis=-1)
+        mask = words != self.pad_index
+        seq_lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1)
+        x, _ = self.lstm(embed, sequence_length=seq_lens)
+        x = self.lstm_dropout(x)
+        return words, x
+
+
+class CharLSTMEncoder(nn.Layer):
+    def __init__(self, n_chars, n_embed, n_out, pad_index=0):
+        super(CharLSTMEncoder, self).__init__()
+        self.n_chars = n_chars
+        self.n_embed = n_embed
+        self.n_out = n_out
+        self.pad_index = pad_index
+
+        # the embedding layer
+        self.embed = nn.Embedding(num_embeddings=n_chars, embedding_dim=n_embed)
+        # the lstm layer
+        self.lstm = nn.LSTM(input_size=n_embed, hidden_size=n_out // 2, direction="bidirectional")
+
+    def forward(self, x):
+        """Forward network"""
+        mask = paddle.any(x != self.pad_index, axis=-1)
+
+        lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1)
+        select = paddle.nonzero(mask)
+        masked_x = paddle.gather_nd(x, select)
+        char_mask = masked_x != self.pad_index
+        emb = self.embed(masked_x)
+        word_lens = paddle.sum(paddle.cast(char_mask, "int32"), axis=-1)
+        _, (h, _) = self.lstm(emb, sequence_length=word_lens)
+        h = paddle.concat(paddle.unstack(h), axis=-1)
+
+        feat_embed = pad_sequence_paddle(h, lens, pad_index=self.pad_index)
+
+        return feat_embed
diff --git a/examples/dependency_parsing/ddparser/predict.py b/examples/dependency_parsing/ddparser/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..53c1835cc919353d7cc02d35e949710eb9c725e4
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/predict.py
@@ -0,0 +1,187 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, load_vocab
+from model.dep import BiAffineParser
+from utils import decode, flat_words
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel
+
+# fmt: off
+parser = argparse.ArgumentParser()
+# Predict
+parser.add_argument("--params_path", type=str, default='model_file/best.pdparams', required=True, help="Directory to load model parameters.")
+parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.")
+parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--encoding_model", choices=["lstm", "lstm-pe", "ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.")
+parser.add_argument("--batch_size", type=int, default=1000, help="Numbers of examples a batch for training.")
+parser.add_argument("--infer_output_file", type=str, default='infer_output.conll', help="The path to save infer results.")
+# Preprocess
+parser.add_argument("--n_buckets", type=int, default=15, help="Number of buckets to devide the dataset.")
+# Postprocess
+parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.")
+# Lstm
+parser.add_argument("--feat", choices=["char", "pos"], type=str, default=None, help="The feature representation to use.")
+args = parser.parse_args()
+# fmt: on
+
+
+@paddle.no_grad()
+def batch_predict(
+    model,
+    data_loader,
+    rel_vocab,
+    word_pad_index,
+    word_bos_index,
+    word_eos_index,
+):
+
+    model.eval()
+    arcs, rels = [], []
+    for inputs in data_loader():
+        if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe":
+            words = inputs[0]
+            words, feats = flat_words(words)
+            s_arc, s_rel, words = model(words, feats)
+        else:
+            words, feats = inputs
+            s_arc, s_rel, words = model(words, feats)
+
+        mask = paddle.logical_and(
+            paddle.logical_and(words != word_pad_index, words != word_bos_index),
+            words != word_eos_index,
+        )
+
+        lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1)
+        arc_preds, rel_preds = decode(s_arc, s_rel, mask)
+        arcs.extend(paddle.split(paddle.masked_select(arc_preds, mask), lens.numpy().tolist()))
+        rels.extend(paddle.split(paddle.masked_select(rel_preds, mask), lens.numpy().tolist()))
+
+    arcs = [[str(s) for s in seq.numpy().tolist()] for seq in arcs]
+    rels = [rel_vocab.to_tokens(seq.numpy().tolist()) for seq in rels]
+
+    return arcs, rels
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+
+    # if args.encoding_model.startswith("ernie"):
+    #     tokenizer = AutoTokenizer.from_pretrained(args.encoding_model)
+    # elif args.encoding_model == "lstm-pe":
+    #     tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    # else:
+    #     tokenizer = None
+
+    # Load vocabs from model file path
+    vocab_dir = os.path.split(args.params_path)[0]
+    word_vocab, feat_vocab, rel_vocab = load_vocab(vocab_dir)
+
+    n_rels, n_words = len(rel_vocab), len(word_vocab)
+    if args.encoding_model == "lstm":
+        n_feats = len(feat_vocab)
+        word_pad_index = word_vocab.to_indices("[PAD]")
+        word_bos_index = word_vocab.to_indices("[BOS]")
+        word_eos_index = word_vocab.to_indices("[EOS]")
+    else:
+        n_feats = None
+        word_pad_index = word_vocab.to_indices("[PAD]")
+        word_bos_index = word_vocab.to_indices("[CLS]")
+        word_eos_index = word_vocab.to_indices("[SEP]")
+
+    test_ds = load_dataset(args.task_name, splits=["test"])
+    test_ds_copy = copy.deepcopy(test_ds)
+
+    trans_fn = partial(
+        convert_example,
+        vocabs=[word_vocab, feat_vocab, rel_vocab],
+        encoding_model=args.encoding_model,
+        feat=args.feat,
+        mode="test",
+    )
+
+    test_data_loader, buckets = create_dataloader(
+        test_ds,
+        batch_size=args.batch_size,
+        mode="test",
+        n_buckets=args.n_buckets,
+        trans_fn=trans_fn,
+    )
+
+    # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh
+    if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny"]:
+        pretrained_model = AutoModel.from_pretrained(args.encoding_model)
+    elif args.encoding_model == "ernie-gram-zh":
+        pretrained_model = AutoModel.from_pretrained(args.encoding_model)
+    else:
+        pretrained_model = None
+
+    # Load model
+    model = BiAffineParser(
+        encoding_model=args.encoding_model,
+        feat=args.feat,
+        n_rels=n_rels,
+        n_feats=n_feats,
+        n_words=n_words,
+        pad_index=word_pad_index,
+        eos_index=word_eos_index,
+        pretrained_model=pretrained_model,
+    )
+
+    # Load saved model parameters
+    if os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("The parameters path is incorrect or not specified.")
+
+    # Start predict
+    pred_arcs, pred_rels = batch_predict(
+        model,
+        test_data_loader,
+        rel_vocab,
+        word_pad_index,
+        word_bos_index,
+        word_eos_index,
+    )
+
+    # Restore the order of sentences in the buckets
+    if buckets:
+        indices = np.argsort(np.array([i for bucket in buckets.values() for i in bucket]))
+    else:
+        indices = range(len(pred_arcs))
+    pred_heads = [pred_arcs[i] for i in indices]
+    pred_deprels = [pred_rels[i] for i in indices]
+
+    with open(args.infer_output_file, "w", encoding="utf-8") as out_file:
+        for res, head, rel in zip(test_ds_copy, pred_heads, pred_deprels):
+            res["HEAD"] = tuple(head)
+            res["DEPREL"] = tuple(rel)
+            res = "\n".join("\t".join(map(str, line)) for line in zip(*res.values())) + "\n"
+            out_file.write("{}\n".format(res))
+    out_file.close()
+    print("Results saved!")
+
+
+if __name__ == "__main__":
+    do_predict(args)
diff --git a/examples/dependency_parsing/ddparser/train.py b/examples/dependency_parsing/ddparser/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..473391fb0d68fb2c77825845c2f888036107d6fb
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/train.py
@@ -0,0 +1,300 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from criterion import ParserCriterion
+from data import build_vocab, convert_example, create_dataloader
+from metric import ParserEvaluator
+from model.dep import BiAffineParser
+from utils import decode, flat_words
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.transformers.optimization import LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+# Train
+parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--task_name", choices=["nlpcc13_evsam05_thu", "nlpcc13_evsam05_hit"], type=str, default="nlpcc13_evsam05_thu", help="Select the task.")
+parser.add_argument("--encoding_model", choices=["lstm", "lstm-pe", "ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"], type=str, default="ernie-3.0-medium-zh", help="Select the encoding model.")
+parser.add_argument("--epochs", type=int, default=100, help="Number of epoches for training.")
+parser.add_argument("--save_dir", type=str, default='model_file/', help="Directory to save model parameters.")
+parser.add_argument("--batch_size", type=int, default=1000, help="Numbers of examples a batch for training.")
+parser.add_argument("--init_from_params", type=str, default=None, help="The path of model parameters to be loaded.")
+parser.add_argument("--clip", type=float, default=1.0, help="The threshold of gradient clip.")
+parser.add_argument("--lstm_lr", type=float, default=0.002, help="The Learning rate of lstm encoding model.")
+parser.add_argument("--ernie_lr", type=float, default=5e-05, help="The Learning rate of ernie encoding model.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+# Preprocess
+parser.add_argument("--n_buckets", type=int, default=15, help="Number of buckets to devide the dataset.")
+# Postprocess
+parser.add_argument("--tree", type=bool, default=True, help="Ensure the output conforms to the tree structure.")
+# Lstm
+parser.add_argument("--feat", choices=["char", "pos"], type=str, default=None, help="The feature representation to use.")
+# Ernie
+parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Linear warmup proportion over total steps.")
+parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay if we apply some.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def batch_evaluate(
+    model,
+    metric,
+    criterion,
+    data_loader,
+    word_pad_index,
+    word_bos_index,
+    word_eos_index,
+):
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader():
+        if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe":
+            words, arcs, rels = batch
+            words, feats = flat_words(words)
+            s_arc, s_rel, words = model(words, feats)
+        else:
+            words, feats, arcs, rels = batch
+            s_arc, s_rel, words = model(words, feats)
+
+        mask = paddle.logical_and(
+            paddle.logical_and(words != word_pad_index, words != word_bos_index),
+            words != word_eos_index,
+        )
+
+        loss = criterion(s_arc, s_rel, arcs, rels, mask)
+
+        losses.append(loss.item())
+
+        arc_preds, rel_preds = decode(s_arc, s_rel, mask)
+        metric.update(arc_preds, rel_preds, arcs, rels, mask)
+    uas, las = metric.accumulate()
+    total_loss = np.mean(losses)
+    model.train()
+    metric.reset()
+    return total_loss, uas, las
+
+
+def do_train(args):
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    if args.encoding_model == "ernie-gram-zh":
+        tokenizer = AutoTokenizer.from_pretrained(args.encoding_model)
+    elif args.encoding_model.startswith("ernie"):
+        tokenizer = AutoTokenizer.from_pretrained(args.encoding_model)
+    elif args.encoding_model == "lstm-pe":
+        tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    else:
+        tokenizer = None
+
+    train_ds, dev_ds = load_dataset(args.task_name, splits=["train", "dev"])
+
+    # Build the vocabs based on train corpus
+    word_examples = [seq["FORM"] for seq in train_ds]
+    if args.feat == "pos":
+        feat_examples = [seq["CPOS"] for seq in train_ds]
+    elif args.feat == "char":
+        feat_examples = [token for seq in train_ds for token in seq["FORM"]]
+    else:
+        feat_examples = None
+    rel_examples = [seq["DEPREL"] for seq in train_ds]
+
+    train_corpus = [word_examples, feat_examples, rel_examples]
+    vocabs = build_vocab(
+        train_corpus,
+        tokenizer,
+        encoding_model=args.encoding_model,
+        feat=args.feat,
+    )
+    word_vocab, feat_vocab, rel_vocab = vocabs
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    # Save vocabs into json file
+    word_vocab.to_json(path=os.path.join(args.save_dir, "word_vocab.json"))
+    rel_vocab.to_json(path=os.path.join(args.save_dir, "rel_vocab.json"))
+
+    if feat_vocab:
+        n_feats = len(feat_vocab)
+        feat_vocab.to_json(path=os.path.join(args.save_dir, "feat_vocab.json"))
+        word_pad_index = word_vocab.to_indices("[PAD]")
+        word_bos_index = word_vocab.to_indices("[BOS]")
+        word_eos_index = word_vocab.to_indices("[EOS]")
+    else:
+        n_feats = None
+        word_pad_index = word_vocab.to_indices("[PAD]")
+        word_bos_index = word_vocab.to_indices("[CLS]")
+        word_eos_index = word_vocab.to_indices("[SEP]")
+
+    n_rels, n_words = len(rel_vocab), len(word_vocab)
+
+    trans_fn = partial(
+        convert_example,
+        vocabs=vocabs,
+        encoding_model=args.encoding_model,
+        feat=args.feat,
+    )
+
+    train_data_loader, _ = create_dataloader(
+        train_ds,
+        batch_size=args.batch_size,
+        mode="train",
+        n_buckets=args.n_buckets,
+        trans_fn=trans_fn,
+    )
+    dev_data_loader, _ = create_dataloader(
+        dev_ds,
+        batch_size=args.batch_size,
+        mode="dev",
+        n_buckets=args.n_buckets,
+        trans_fn=trans_fn,
+    )
+
+    # Load pretrained model if encoding model is ernie-3.0-medium-zh, ernie-1.0, ernie-tiny or ernie-gram-zh
+    if args.encoding_model in ["ernie-3.0-medium-zh", "ernie-1.0", "ernie-tiny", "ernie-gram-zh"]:
+        pretrained_model = AutoModel.from_pretrained(args.encoding_model)
+    else:
+        pretrained_model = None
+
+    # Load ddparser model
+    model = BiAffineParser(
+        encoding_model=args.encoding_model,
+        feat=args.feat,
+        n_rels=n_rels,
+        n_feats=n_feats,
+        n_words=n_words,
+        pad_index=word_pad_index,
+        eos_index=word_eos_index,
+        pretrained_model=pretrained_model,
+    )
+
+    # Define learning rate
+    if args.encoding_model.startswith("ernie"):
+        lr = args.ernie_lr
+    else:
+        lr = args.lstm_lr
+
+    # Continue training from a pretrained model if the checkpoint is specified
+    if args.init_from_params and os.path.isfile(args.init_from_params):
+        state_dict = paddle.load(args.init_from_params)
+        model.set_dict(state_dict)
+
+    # Data parallel for distributed training
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(list(train_data_loader)) * args.epochs
+
+    # Define the training strategy
+    lr_scheduler = LinearDecayWithWarmup(lr, num_training_steps, args.warmup_proportion)
+    grad_clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.clip)
+    if args.encoding_model.startswith("ernie"):
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            grad_clip=grad_clip,
+        )
+    else:
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=lr,
+            beta1=0.9,
+            beta2=0.9,
+            epsilon=1e-12,
+            parameters=model.parameters(),
+            grad_clip=grad_clip,
+        )
+
+    # Load metric and criterion
+    best_las = 0
+    metric = ParserEvaluator()
+    criterion = ParserCriterion()
+
+    # Epoch train
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for inputs in train_data_loader():
+            if args.encoding_model.startswith("ernie") or args.encoding_model == "lstm-pe":
+                words, arcs, rels = inputs
+                words, feats = flat_words(words)
+                s_arc, s_rel, words = model(words, feats)
+            else:
+                words, feats, arcs, rels = inputs
+                s_arc, s_rel, words = model(words, feats)
+
+            mask = paddle.logical_and(
+                paddle.logical_and(words != word_pad_index, words != word_bos_index),
+                words != word_eos_index,
+            )
+
+            loss = criterion(s_arc, s_rel, arcs, rels, mask)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss.item(), 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+        if rank == 0:
+            # Evaluate on dev dataset
+            loss, uas, las = batch_evaluate(
+                model,
+                metric,
+                criterion,
+                dev_data_loader,
+                word_pad_index,
+                word_bos_index,
+                word_eos_index,
+            )
+            print("eval loss: %.5f, UAS: %.2f%%, LAS: %.2f%%" % (loss, uas * 100, las * 100))
+            # Save model parameter of last epoch
+            save_param_path = os.path.join(args.save_dir, "last_epoch.pdparams")
+            paddle.save(model.state_dict(), save_param_path)
+            # Save the model if it get a higher score of las
+            if las > best_las:
+                save_param_path = os.path.join(args.save_dir, "best.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                best_las = las
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/examples/dependency_parsing/ddparser/utils.py b/examples/dependency_parsing/ddparser/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..807d4be437d3d074ddde0801e558d4c958faf044
--- /dev/null
+++ b/examples/dependency_parsing/ddparser/utils.py
@@ -0,0 +1,436 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import Pad
+
+
+def decode(s_arc, s_rel, mask, tree=True):
+    """Decode function"""
+    mask = mask.numpy()
+    lens = np.sum(mask, -1)
+    # Prevent self-loops
+    arc_preds = paddle.argmax(s_arc, axis=-1).numpy()
+    bad = [not istree(seq[: i + 1]) for i, seq in zip(lens, arc_preds)]
+    if tree and any(bad):
+        arc_preds[bad] = eisner(s_arc.numpy()[bad], mask[bad])
+    arc_preds = paddle.to_tensor(arc_preds)
+    rel_preds = paddle.argmax(s_rel, axis=-1)
+    rel_preds = index_sample(rel_preds, paddle.unsqueeze(arc_preds, axis=-1))
+    rel_preds = paddle.squeeze(rel_preds, axis=-1)
+    return arc_preds, rel_preds
+
+
+def pad_sequence(sequences, padding_value=0, fix_len=None):
+    """Fill sequences(np.ndarray) into a fixed-length matrix."""
+    max_size = sequences[0].shape
+    trailing_dims = max_size[1:]
+    max_len = max([s.shape[0] for s in sequences])
+    if fix_len is not None:
+        assert fix_len >= max_len, "fix_len is too small."
+        max_len = fix_len
+    out_dims = (len(sequences), max_len) + trailing_dims
+    out_tensor = np.full(out_dims, padding_value, dtype=sequences[0].dtype)
+    for i, tensor in enumerate(sequences):
+        length = tensor.shape[0]
+        out_tensor[i, :length, ...] = tensor
+    return out_tensor
+
+
+def pad_sequence_paddle(inputs, lens, pad_index=0):
+    sequences = []
+    idx = 0
+    for l in lens:
+        sequences.append(np.array(inputs[idx : idx + l]))
+        idx += l
+    outputs = Pad(pad_val=pad_index)(sequences)
+    output_tensor = paddle.to_tensor(outputs)
+    return output_tensor
+
+
+def fill_diagonal(x, value, offset=0, dim1=0, dim2=1):
+    """Fill value into the diagoanl of x that offset is ${offset} and the coordinate system is (dim1, dim2)."""
+    strides = x.strides
+    shape = x.shape
+    if dim1 > dim2:
+        dim1, dim2 = dim2, dim1
+    assert 0 <= dim1 < dim2 <= 2
+    assert len(x.shape) == 3
+    assert shape[dim1] == shape[dim2]
+
+    dim_sum = dim1 + dim2
+    dim3 = 3 - dim_sum
+    if offset >= 0:
+        diagonal = np.lib.stride_tricks.as_strided(
+            x[:, offset:] if dim_sum == 1 else x[:, :, offset:],
+            shape=(shape[dim3], shape[dim1] - offset),
+            strides=(strides[dim3], strides[dim1] + strides[dim2]),
+        )
+    else:
+        diagonal = np.lib.stride_tricks.as_strided(
+            x[-offset:, :] if dim_sum in [1, 2] else x[:, -offset:],
+            shape=(shape[dim3], shape[dim1] + offset),
+            strides=(strides[dim3], strides[dim1] + strides[dim2]),
+        )
+
+    diagonal[...] = value
+    return x
+
+
+def backtrack(p_i, p_c, heads, i, j, complete):
+    """Backtrack the position matrix of eisner to generate the tree"""
+    if i == j:
+        return
+    if complete:
+        r = p_c[i, j]
+        backtrack(p_i, p_c, heads, i, r, False)
+        backtrack(p_i, p_c, heads, r, j, True)
+    else:
+        r, heads[j] = p_i[i, j], i
+        i, j = sorted((i, j))
+        backtrack(p_i, p_c, heads, i, r, True)
+        backtrack(p_i, p_c, heads, j, r + 1, True)
+
+
+def stripe(x, n, w, offset=(0, 0), dim=1):
+    """
+    Returns a diagonal stripe of the tensor.
+
+    Args:
+        x (Tensor): the input tensor with 2 or more dims.
+        n (int): the length of the stripe.
+        w (int): the width of the stripe.
+        offset (tuple): the offset of the first two dims.
+        dim (int): 0 if returns a horizontal stripe; 1 else.
+
+    Example:
+    >>> x = np.arange(25).reshape(5, 5)
+    >>> x
+    tensor([[ 0,  1,  2,  3,  4],
+            [ 5,  6,  7,  8,  9],
+            [10, 11, 12, 13, 14],
+            [15, 16, 17, 18, 19],
+            [20, 21, 22, 23, 24]])
+    >>> stripe(x, 2, 3, (1, 1))
+    tensor([[ 6,  7,  8],
+            [12, 13, 14]])
+    >>> stripe(x, 2, 3, dim=0)
+    tensor([[ 0,  5, 10],
+            [ 6, 11, 16]])
+    """
+    if not x.flags["C_CONTIGUOUS"]:
+        x = np.ascontiguousarray(x)
+    strides = x.strides
+    m = strides[0] + strides[1]
+    k = strides[1] if dim == 1 else strides[0]
+    return np.lib.stride_tricks.as_strided(
+        x[offset[0] :, offset[1] :], shape=[n, w] + list(x.shape[2:]), strides=[m, k] + list(strides[2:])
+    )
+
+
+def flat_words(words, pad_index=0):
+    mask = words != pad_index
+    lens = paddle.sum(paddle.cast(mask, "int64"), axis=-1)
+    position = paddle.cumsum(lens + paddle.cast((lens == 0), "int64"), axis=1) - 1
+    select = paddle.nonzero(mask)
+    words = paddle.gather_nd(words, select)
+    lens = paddle.sum(lens, axis=-1)
+    words = pad_sequence_paddle(words, lens, pad_index)
+    max_len = words.shape[1]
+    position = mask_fill(position, position >= max_len, max_len - 1)
+    return words, position
+
+
+def index_sample(x, index):
+    """
+    Select input value according to index
+
+    Arags：
+        input: input matrix
+        index: index matrix
+    Returns:
+        output
+    >>> input
+    [
+        [1, 2, 3],
+        [4, 5, 6]
+    ]
+    >>> index
+    [
+        [1, 2],
+        [0, 1]
+    ]
+    >>> index_sample(input, index)
+    [
+        [2, 3],
+        [4, 5]
+    ]
+    """
+    x_s = x.shape
+    dim = len(index.shape) - 1
+    assert x_s[:dim] == index.shape[:dim]
+
+    if len(x_s) == 3 and dim == 1:
+        r_x = paddle.reshape(x, shape=[-1, x_s[1], x_s[-1]])
+    else:
+        r_x = paddle.reshape(x, shape=[-1, x_s[-1]])
+
+    index = paddle.reshape(index, shape=[len(r_x), -1, 1])
+    # Generate arange index, shape like index
+    arr_index = paddle.arange(start=0, end=len(index), dtype=index.dtype)
+    arr_index = paddle.unsqueeze(arr_index, axis=[1, 2])
+    arr_index = paddle.expand(arr_index, index.shape)
+    # Genrate new index
+    new_index = paddle.concat((arr_index, index), -1)
+    new_index = paddle.reshape(new_index, (-1, 2))
+    # Get output
+    out = paddle.gather_nd(r_x, new_index)
+    if len(x_s) == 3 and dim == 2:
+        out = paddle.reshape(out, shape=[x_s[0], x_s[1], -1])
+    else:
+        out = paddle.reshape(out, shape=[x_s[0], -1])
+    return out
+
+
+def mask_fill(input, mask, value):
+    """
+    Fill value to input according to mask
+
+    Args:
+        input: input matrix
+        mask: mask matrix
+        value: Fill value
+
+    Returns:
+        output
+
+    >>> input
+    [
+        [1, 2, 3],
+        [4, 5, 6]
+    ]
+    >>> mask
+    [
+        [True, True, False],
+        [True, False, False]
+    ]
+    >>> mask_fill(input, mask, 0)
+    [
+        [1, 2, 0],
+        [4, 0, 0]
+    ]
+    """
+    return input * paddle.logical_not(mask) + paddle.cast(mask, input.dtype) * value
+
+
+def kmeans(x, k):
+    """
+    kmeans algorithm, put sentence id into k buckets according to sentence length
+
+    Args:
+        x: list, sentence length
+        k: int, k clusters
+
+    Returns:
+        centroids: list, center point of k clusters
+        clusters: list(tuple), k clusters
+    """
+    x = np.array(x, dtype=np.float32)
+    # Count the frequency of each datapoint
+    d, indices, f = np.unique(x, return_inverse=True, return_counts=True)
+    # Calculate the sum of the values of the same datapoints
+    total = d * f
+    # Initialize k centroids randomly
+    c, old = d[np.random.permutation(len(d))[:k]], None
+    # Assign labels to each datapoint based on centroids
+    dists_abs = np.absolute(d[..., np.newaxis] - c)
+    dists, y = dists_abs.min(axis=-1), dists_abs.argmin(axis=-1)
+    # The number of clusters must not be greater than that of datapoints
+    k = min(len(d), k)
+
+    while old is None or not np.equal(c, old).all():
+        # If an empty cluster is encountered,
+        # choose the farthest datapoint from the biggest cluster
+        # and move that the empty one
+        for i in range(k):
+            if not np.equal(y, i).any():
+                # mask.shape=[k, n]
+                mask = y == np.arange(k)[..., np.newaxis]
+                lens = mask.sum(axis=-1)
+                biggest = mask[lens.argmax()].nonzero()[0]
+                farthest = dists[biggest].argmax()
+                y[biggest[farthest]] = i
+        mask = y == np.arange(k)[..., np.newaxis]
+        # Update the centroids
+        c, old = (total * mask).sum(-1) / (f * mask).sum(-1), c
+        # Re-assign all datapoints to clusters
+        dists_abs = np.absolute(d[..., np.newaxis] - c)
+        dists, y = dists_abs.min(axis=-1), dists_abs.argmin(axis=-1)
+    # Assign all datapoints to the new-generated clusters without considering the empty ones
+    y, assigned = y[indices], np.unique(y).tolist()
+    # Get the centroids of the assigned clusters
+    centroids = c[assigned].tolist()
+    # Map all values of datapoints to buckets
+    clusters = [np.equal(y, i).nonzero()[0].tolist() for i in assigned]
+
+    return centroids, clusters
+
+
+def eisner(scores, mask):
+    """Eisner algorithm is a general dynamic programming decoding algorithm for bilexical grammar.
+
+    Args：
+        scores: Adjacency matrix，shape=(batch, seq_len, seq_len)
+        mask: mask matrix，shape=(batch, sql_len)
+
+    Returns:
+        output，shape=(batch, seq_len)，the index of the parent node corresponding to the token in the query
+
+    """
+    lens = mask.sum(1)
+    batch_size, seq_len, _ = scores.shape
+    scores = scores.transpose(2, 1, 0)
+    # Score for incomplete span
+    s_i = np.full_like(scores, float("-inf"))
+    # Score for complete span
+    s_c = np.full_like(scores, float("-inf"))
+    # Incompelte span position for backtrack
+    p_i = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64)
+    # Compelte span position for backtrack
+    p_c = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64)
+    # Set 0 to s_c.diagonal
+    s_c = fill_diagonal(s_c, 0)
+    # Contiguous
+    s_c = np.ascontiguousarray(s_c)
+    s_i = np.ascontiguousarray(s_i)
+    for w in range(1, seq_len):
+        n = seq_len - w
+        starts = np.arange(n, dtype=np.int64)[np.newaxis, :]
+        # ilr = C(i->r) + C(j->r+1)
+        ilr = stripe(s_c, n, w) + stripe(s_c, n, w, (w, 1))
+        # Shape: (batch_size, n, w)
+        ilr = ilr.transpose(2, 0, 1)
+        # scores.diagonal(-w).shape:[batch, n]
+        il = ilr + scores.diagonal(-w)[..., np.newaxis]
+        # I(j->i) = max(C(i->r) + C(j->r+1) + s(j->i)), i <= r < j
+        il_span, il_path = il.max(-1), il.argmax(-1)
+        s_i = fill_diagonal(s_i, il_span, offset=-w)
+        p_i = fill_diagonal(p_i, il_path + starts, offset=-w)
+
+        ir = ilr + scores.diagonal(w)[..., np.newaxis]
+        # I(i->j) = max(C(i->r) + C(j->r+1) + s(i->j)), i <= r < j
+        ir_span, ir_path = ir.max(-1), ir.argmax(-1)
+        s_i = fill_diagonal(s_i, ir_span, offset=w)
+        p_i = fill_diagonal(p_i, ir_path + starts, offset=w)
+
+        # C(j->i) = max(C(r->i) + I(j->r)), i <= r < j
+        cl = stripe(s_c, n, w, (0, 0), 0) + stripe(s_i, n, w, (w, 0))
+        cl = cl.transpose(2, 0, 1)
+        cl_span, cl_path = cl.max(-1), cl.argmax(-1)
+        s_c = fill_diagonal(s_c, cl_span, offset=-w)
+        p_c = fill_diagonal(p_c, cl_path + starts, offset=-w)
+
+        # C(i->j) = max(I(i->r) + C(r->j)), i < r <= j
+        cr = stripe(s_i, n, w, (0, 1)) + stripe(s_c, n, w, (1, w), 0)
+        cr = cr.transpose(2, 0, 1)
+        cr_span, cr_path = cr.max(-1), cr.argmax(-1)
+        s_c = fill_diagonal(s_c, cr_span, offset=w)
+        s_c[0, w][np.not_equal(lens, w)] = float("-inf")
+        p_c = fill_diagonal(p_c, cr_path + starts + 1, offset=w)
+
+    predicts = []
+    p_c = p_c.transpose(2, 0, 1)
+    p_i = p_i.transpose(2, 0, 1)
+    for i, length in enumerate(lens.tolist()):
+        heads = np.ones(length + 1, dtype=np.int64)
+        backtrack(p_i[i], p_c[i], heads, 0, length, True)
+        predicts.append(heads)
+
+    return pad_sequence(predicts, fix_len=seq_len)
+
+
+class Node:
+    """Node class"""
+
+    def __init__(self, id=None, parent=None):
+        self.lefts = []
+        self.rights = []
+        self.id = int(id)
+        self.parent = parent if parent is None else int(parent)
+
+
+class DepTree:
+    """
+    DepTree class, used to check whether the prediction result is a project Tree.
+    A projective tree means that you can project the tree without crossing arcs.
+    """
+
+    def __init__(self, sentence):
+        # set root head to -1
+        sentence = copy.deepcopy(sentence)
+        sentence[0] = -1
+        self.sentence = sentence
+        self.build_tree()
+        self.visit = [False] * len(sentence)
+
+    def build_tree(self):
+        """Build the tree"""
+        self.nodes = [Node(index, p_index) for index, p_index in enumerate(self.sentence)]
+        # set root
+        self.root = self.nodes[0]
+        for node in self.nodes[1:]:
+            self.add(self.nodes[node.parent], node)
+
+    def add(self, parent, child):
+        """Add a child node"""
+        if parent.id is None or child.id is None:
+            raise Exception("id is None")
+        if parent.id < child.id:
+            parent.rights = sorted(parent.rights + [child.id])
+        else:
+            parent.lefts = sorted(parent.lefts + [child.id])
+
+    def judge_legal(self):
+        """Determine whether it is a project tree"""
+        target_seq = list(range(len(self.nodes)))
+        if len(self.root.lefts + self.root.rights) != 1:
+            return False
+        cur_seq = self.inorder_traversal(self.root)
+        if target_seq != cur_seq:
+            return False
+        else:
+            return True
+
+    def inorder_traversal(self, node):
+        """Inorder traversal"""
+        if self.visit[node.id]:
+            return []
+        self.visit[node.id] = True
+        lf_list = []
+        rf_list = []
+        for ln in node.lefts:
+            lf_list += self.inorder_traversal(self.nodes[ln])
+        for rn in node.rights:
+            rf_list += self.inorder_traversal(self.nodes[rn])
+
+        return lf_list + [node.id] + rf_list
+
+
+def istree(sequence):
+    """Is the sequence a project tree"""
+    return DepTree(sequence).judge_legal()
diff --git a/examples/dialogue/dgu/README.md b/examples/dialogue/dgu/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..90a9f57ec209ab1c67abd57be9b2032287f1c5bf
--- /dev/null
+++ b/examples/dialogue/dgu/README.md
@@ -0,0 +1,143 @@
+# 对话通用理解模型 (DGU, Dialogue General Understanding)
+
+## 模型简介
+
+对话系统 (Dialogue System) 常常需要根据应用场景的变化去解决多种多样的任务。任务的多样性（意图识别、槽填充、行为识别、状态追踪等等），以及领域训练数据的稀少，给Dialogue System的研究和应用带来了巨大的困难和挑战，要使得Dialogue System得到更好的发展，需要开发一个通用的对话理解模型。为此，我们给出了基于BERT的对话通用理解模型 (DGU: Dialogue General Understanding)，通过实验表明，使用base-model (BERT)并结合常见的学习范式，就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果，展现了学习一个通用对话理解模型的巨大潜力。
+
+DGU模型内共包含6个任务，全部基于公开数据集在Paddle2.0上完成训练及评估，详细说明如下：
+
+```
+udc: 使用UDC (Ubuntu Corpus V1) 数据集完成对话匹配 (Dialogue Response Selection) 任务;
+dstc2: 使用DSTC2 (Dialog State Tracking Challenge 2) 数据集完成对话状态追踪 (Dialogue State Tracking) 任务;
+atis_slot: 使用ATIS (Airline Travel Information System) 数据集完成对话槽填充 (Dialogue Slot Filling) 任务；
+atis_intent: 使用ATIS (Airline Travel Information System) 数据集完成对话意图识别 (Dialogue Intent Detection) 任务；
+mrda: 使用MRDAC (Meeting Recorder Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务；
+swda: 使用SwDAC (Switchboard Dialogue Act Corpus) 数据集完成对话行为识别 (Dialogue Act Detection) 任务;
+```
+
+## 模型效果
+
+DGU模型中的6个任务，分别采用不同的评估指标在test集上进行评估，结果如下：
+
+<table>
+    <tr><th style="text-align:center">任务</th><th style="text-align:center">评估指标</th><th style="text-align:center">DGU</th></tr>
+    <tr align="center"><td rowspan="3" style="vertical-align:middle;">udc</td><td>R1@10</td><td>81.04%</td></tr>
+    <tr align="center"><td>R2@10</td><td>89.85%</td></tr>
+    <tr align="center"><td>R5@10</td><td>97.59%</td></tr>
+    <tr align="center"><td>dstc2</td><td>Joint_Acc</td><td>90.43%</td></tr>
+    <tr align="center"><td>atis_slot</td><td>F1_Micro</td><td>97.98%</td></tr>
+    <tr align="center"><td>atis_intent</td><td>Acc</td><td>97.42%</td></tr>
+    <tr align="center"><td>mrda</td><td>Acc</td><td>90.94%</td></tr>
+    <tr align="center"><td>swda</td><td>Acc</td><td>80.61%</td></tr>
+</table>
+
+**NOTE:** 以上结果均是采用默认配置在GPU单卡上训练和评估得到的，用户如需复现效果，可采用默认配置在单卡上进行训练评估。
+
+## 快速开始
+
+### 数据准备
+
+下载数据集压缩包并解压后，DGU_datasets目录下共存在6个目录，分别对应每个任务的训练集train.txt、评估集dev.txt和测试集test.txt。
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/DGU_datasets.tar.gz
+tar -zxf DGU_datasets.tar.gz
+```
+
+DGU_datasets目录结构：
+
+```text
+DGU_datasets/
+├── atis_intent
+│   ├── dev.txt
+│   ├── map_tag_intent_id.txt
+│   ├── test.txt
+│   └── train.txt
+├── udc
+│   ├── dev.txt
+│   ├── dev.txt-small
+│   ├── test.txt
+│   └── train.txt
+├── atis_slot
+│   ├── dev.txt
+│   ├── map_tag_slot_id.txt
+│   ├── test.txt
+│   └── train.txt
+├── dstc2
+│   ├── dev.txt
+│   ├── map_tag_id.txt
+│   ├── test.txt
+│   └── train.txt
+├── mrda
+│   ├── dev.txt
+│   ├── map_tag_id.txt
+│   ├── test.txt
+│   └── train.txt
+└── swda
+    ├── dev.txt
+    ├── map_tag_id.txt
+    ├── test.txt
+    └── train.txt
+```
+
+数据的每一行由多列组成，都以"\t"作为分割符，详细数据格式说明如下：
+
+```
+udc：由label、多轮对话conv和回应response组成
+格式：label \t conv1 \t conv2 \t conv3 \t ... \t response
+
+dstc2：由多轮对话id、当前轮QA对(使用\1拼接)和对话状态序列state_list(state_list中每个state由空格分割)组成
+格式：conversation_id \t question \1 answer \t state1 state2 state3 ...
+
+atis_slot：由对话内容conversation_content和标签序列label_list (label_list中每个label由空格分割) 组成, 其中标签序列和对话内容中word为一一对应关系
+格式：conversation_content \t label1 label2 label3 ...
+
+atis_intent：由标签label和对话内容conversation_content组成
+格式： label \t conversation_content
+
+mrda：由多轮对话id、标签label、发言人caller、对话内容conversation_content组成
+格式：conversation_id \t label \t caller \t conversation_content
+
+swda：由多轮对话id、标签label、发言人caller、对话内容conversation_content组成
+格式：conversation_id \t label \t caller \t conversation_content
+```
+
+**NOTE:** 上述数据集来自于 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding)，是由相应的开源数据集经过数据格式转换而得来的，本项目中暂未包含数据格式转换脚本，细节请参考 [Paddle1.8静态图版本](https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/dialogue_system/dialogue_general_understanding)。
+
+### 模型训练
+
+运行如下命令即可在训练集 (train.tsv) 上进行模型训练，并在开发集 (dev.tsv) 验证，训练结束后会在测试集 (test.txt) 上进行模型评估
+
+```shell
+# GPU启动，gpus指定训练所用的GPU卡号，可以是单卡，也可以多卡。默认会进行训练、验证和评估
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" --log_dir ./log main.py --task_name=udc --data_dir=./DGU_datasets/udc --output_dir=./checkpoints/udc --device=gpu
+# 若只需进行评估，do_train设为False，并且必须指定init_from_ckpt
+# python -m paddle.distributed.launch --gpus "0" --log_dir ./log main.py --task_name=udc --data_dir=./DGU_datasets/udc --do_train=False --init_from_ckpt=./checkpoints/udc/best --device=gpu
+```
+
+以上参数表示：
+
+* `task_name`：任务名称，可以为udc、dstc2、atis_slot、atis_intent、mrda或swda。
+* `data_dir`：训练数据路径。
+* `output_dir`：训练保存模型的文件路径。
+* `do_train：是否进行训练，默认为`True`。
+* `init_from_ckpt`：恢复模型参数的路径。
+* `device`：表示训练使用的设备。
+
+其他可选参数和参数的默认值请参考`args.py`。
+
+程序运行时将会自动进行训练，验证和评估。同时训练过程中会自动保存模型在指定的`output_dir`中。
+如：
+```text
+checkpoints/
+├── 1000.pdopt
+├── 1000.pdparams
+├── 2000.pdopt
+├── 2000.pdparams
+├── ...
+├── best.pdopt
+└── best.pdparams
+```
+
+**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/1000`即可，程序会自动加载模型参数`checkpoints/1000.pdparams`，也会自动加载优化器状态`checkpoints/1000.pdopt`。
diff --git a/examples/dialogue/dgu/args.py b/examples/dialogue/dgu/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..4139474c906b47b7b690b9d5eab32c14861cb29b
--- /dev/null
+++ b/examples/dialogue/dgu/args.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train.")
+    parser.add_argument("--model_name_or_path", default='bert-base-uncased', type=str, help="Path to pre-trained bert model or shortcut name.")
+    parser.add_argument("--output_dir", default=None, type=str, help="The output directory where the checkpoints will be saved.")
+    parser.add_argument("--data_dir", default=None, type=str, help="The directory where the dataset will be load.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for trainng. Sequences longer than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--test_max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for testing. Sequences longer than this will be truncated, sequences shorter will be padded.")
+    parser.add_argument("--batch_size", default=None, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--test_batch_size", default=None, type=int, help="Batch size per GPU/CPU for testing.")
+    parser.add_argument("--learning_rate", default=None, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--epochs", default=None, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--logging_steps", default=None, type=int, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", default=None, type=int, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", default=42, type=int, help="Random seed for initialization.")
+    parser.add_argument("--warmup_proportion", default=0.1, type=float, help="The proportion of warmup.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    parser.add_argument("--do_train", default=True, type=eval, help="Whether training.")
+    parser.add_argument("--do_eval", default=True, type=eval, help="Whether evaluation.")
+    parser.add_argument("--do_test", default=True, type=eval, help="Whether testing.")
+    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def set_default_args(args):
+    args.task_name = args.task_name.lower()
+    if args.task_name == "udc":
+        if not args.save_steps:
+            args.save_steps = 1000
+        if not args.logging_steps:
+            args.logging_steps = 100
+        if not args.epochs:
+            args.epochs = 2
+        if not args.max_seq_len:
+            args.max_seq_len = 210
+        if not args.test_batch_size:
+            args.test_batch_size = 100
+    elif args.task_name == "dstc2":
+        if not args.save_steps:
+            args.save_steps = 400
+        if not args.logging_steps:
+            args.logging_steps = 20
+        if not args.epochs:
+            args.epochs = 40
+        if not args.learning_rate:
+            args.learning_rate = 5e-5
+        if not args.max_seq_len:
+            args.max_seq_len = 256
+        if not args.test_max_seq_len:
+            args.test_max_seq_len = 512
+    elif args.task_name == "atis_slot":
+        if not args.save_steps:
+            args.save_steps = 100
+        if not args.logging_steps:
+            args.logging_steps = 10
+        if not args.epochs:
+            args.epochs = 50
+    elif args.task_name == "atis_intent":
+        if not args.save_steps:
+            args.save_steps = 100
+        if not args.logging_steps:
+            args.logging_steps = 10
+        if not args.epochs:
+            args.epochs = 20
+    elif args.task_name == "mrda":
+        if not args.save_steps:
+            args.save_steps = 500
+        if not args.logging_steps:
+            args.logging_steps = 200
+        if not args.epochs:
+            args.epochs = 7
+    elif args.task_name == "swda":
+        if not args.save_steps:
+            args.save_steps = 500
+        if not args.logging_steps:
+            args.logging_steps = 200
+        if not args.epochs:
+            args.epochs = 3
+    else:
+        raise ValueError("Not support task: %s." % args.task_name)
+
+    if not args.data_dir:
+        args.data_dir = "./DGU_datasets/" + args.task_name
+    if not args.output_dir:
+        args.output_dir = "./checkpoints/" + args.task_name
+    if not args.learning_rate:
+        args.learning_rate = 2e-5
+    if not args.batch_size:
+        args.batch_size = 32
+    if not args.test_batch_size:
+        args.test_batch_size = args.batch_size
+    if not args.max_seq_len:
+        args.max_seq_len = 128
+    if not args.test_max_seq_len:
+        args.test_max_seq_len = args.max_seq_len
diff --git a/examples/dialogue/dgu/data.py b/examples/dialogue/dgu/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..469134f7cfc738a8b15573a9599f15e6307f193d
--- /dev/null
+++ b/examples/dialogue/dgu/data.py
@@ -0,0 +1,509 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import numpy as np
+from typing import List
+
+from paddle.io import Dataset
+
+# The input data bigin with '[CLS]', using '[SEP]' split conversation content(
+# Previous part, current part, following part, etc.). If there are multiple
+# conversation in split part, using 'INNER_SEP' to further split.
+INNER_SEP = "[unused0]"
+
+
+def get_label_map(label_list):
+    """Create label maps"""
+    label_map = {}
+    for (i, l) in enumerate(label_list):
+        label_map[l] = i
+    return label_map
+
+
+class UDCv1(Dataset):
+    """
+    The UDCv1 dataset is using in task Dialogue Response Selection.
+    The source dataset is UDCv1(Ubuntu Dialogue Corpus v1.0). See detail at
+    http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/
+    """
+
+    MAX_LEN_OF_RESPONSE = 60
+    LABEL_MAP = get_label_map(["0", "1"])
+
+    def __init__(self, data_dir, mode="train"):
+        super(UDCv1, self).__init__()
+        self._data_dir = data_dir
+        self._mode = mode
+        self.read_data()
+
+    def read_data(self):
+        if self._mode == "train":
+            data_path = os.path.join(self._data_dir, "train.txt")
+        elif self._mode == "dev":
+            data_path = os.path.join(self._data_dir, "dev.txt-small")
+        elif self._mode == "test":
+            data_path = os.path.join(self._data_dir, "test.txt")
+        self.data = []
+        with open(data_path, "r", encoding="utf8") as fin:
+            for line in fin:
+                if not line:
+                    continue
+                arr = line.rstrip("\n").split("\t")
+                if len(arr) < 3:
+                    print("Data format error: %s" % "\t".join(arr))
+                    print("Data row contains at least three parts: label\tconversation1\t.....\tresponse.")
+                    continue
+                label = arr[0]
+                text_a = arr[1:-1]
+                text_b = arr[-1]
+                self.data.append([label, text_a, text_b])
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+
+        def _truncate_and_concat(text_a: List[str], text_b: str, tokenizer, max_seq_length):
+            tokens_b = tokenizer.tokenize(text_b)
+            tokens_b = tokens_b[: min(cls.MAX_LEN_OF_RESPONSE, len(tokens_b))]
+            tokens_a = []
+            for text in text_a:
+                tokens_a.extend(tokenizer.tokenize(text))
+                tokens_a.append(INNER_SEP)
+            tokens_a = tokens_a[:-1]
+            if len(tokens_a) > max_seq_length - len(tokens_b) - 3:
+                tokens_a = tokens_a[len(tokens_a) - max_seq_length + len(tokens_b) + 3 :]
+            tokens, segment_ids = [], []
+            tokens.extend([tokenizer.cls_token] + tokens_a + [tokenizer.sep_token])
+            segment_ids.extend([0] * len(tokens))
+            tokens.extend(tokens_b + [tokenizer.sep_token])
+            segment_ids.extend([1] * (len(tokens_b) + 1))
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            return input_ids, segment_ids
+
+        label, text_a, text_b = example
+        label = np.array([cls.get_label(label)], dtype="int64")
+        input_ids, segment_ids = _truncate_and_concat(text_a, text_b, tokenizer, max_seq_length)
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
+
+
+class DSTC2(Dataset):
+    """
+    The dataset DSTC2 is using in task Dialogue State Tracking.
+    The source dataset is DSTC2(Dialog State Tracking Challenges 2). See detail at
+    https://github.com/matthen/dstc
+    """
+
+    LABEL_MAP = get_label_map([str(i) for i in range(217)])
+
+    def __init__(self, data_dir, mode="train"):
+        super(DSTC2, self).__init__()
+        self._data_dir = data_dir
+        self._mode = mode
+        self.read_data()
+
+    def read_data(self):
+        def _concat_dialogues(examples):
+            """concat multi turns dialogues"""
+            new_examples = []
+            max_turns = 20
+            for i in range(len(examples)):
+                multi_turns = examples[max(i - max_turns, 0) : i + 1]
+                new_qa = "\1".join([example[0] for example in multi_turns])
+                new_examples.append((new_qa.split("\1"), examples[i][1]))
+            return new_examples
+
+        if self._mode == "train":
+            data_path = os.path.join(self._data_dir, "train.txt")
+        elif self._mode == "dev":
+            data_path = os.path.join(self._data_dir, "dev.txt")
+        elif self._mode == "test":
+            data_path = os.path.join(self._data_dir, "test.txt")
+        self.data = []
+        with open(data_path, "r", encoding="utf8") as fin:
+            pre_idx = -1
+            examples = []
+            for line in fin:
+                if not line:
+                    continue
+                arr = line.rstrip("\n").split("\t")
+                if len(arr) != 3:
+                    print("Data format error: %s" % "\t".join(arr))
+                    print("Data row should contains three parts: id\tquestion\1answer\tlabel1 label2 ...")
+                    continue
+                idx = arr[0]
+                qa = arr[1]
+                label_list = arr[2].split()
+                if idx != pre_idx:
+                    if idx != 0:
+                        examples = _concat_dialogues(examples)
+                        self.data.extend(examples)
+                        examples = []
+                    pre_idx = idx
+                examples.append((qa, label_list))
+            if examples:
+                examples = _concat_dialogues(examples)
+                self.data.extend(examples)
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+
+        def _truncate_and_concat(texts: List[str], tokenizer, max_seq_length):
+            tokens = []
+            for text in texts:
+                tokens.extend(tokenizer.tokenize(text))
+                tokens.append(INNER_SEP)
+            tokens = tokens[:-1]
+            if len(tokens) > max_seq_length - 2:
+                tokens = tokens[len(tokens) - max_seq_length + 2 :]
+            tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
+            segment_ids = [0] * len(tokens)
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            return input_ids, segment_ids
+
+        texts, labels = example
+        input_ids, segment_ids = _truncate_and_concat(texts, tokenizer, max_seq_length)
+        labels = [cls.get_label(l) for l in labels]
+        label = np.zeros(cls.num_classes(), dtype="int64")
+        for l in labels:
+            label[l] = 1
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
+
+
+class ATIS_DSF(Dataset):
+    """
+    The dataset ATIS_DSF is using in task Dialogue Slot Filling.
+    The source dataset is ATIS(Airline Travel Information System). See detail at
+    https://www.kaggle.com/siddhadev/ms-cntk-atis
+    """
+
+    LABEL_MAP = get_label_map([str(i) for i in range(130)])
+
+    def __init__(self, data_dir, mode="train"):
+        super(ATIS_DSF, self).__init__()
+        self._data_dir = data_dir
+        self._mode = mode
+        self.read_data()
+
+    def read_data(self):
+        if self._mode == "train":
+            data_path = os.path.join(self._data_dir, "train.txt")
+        elif self._mode == "dev":
+            data_path = os.path.join(self._data_dir, "dev.txt")
+        elif self._mode == "test":
+            data_path = os.path.join(self._data_dir, "test.txt")
+        self.data = []
+        with open(data_path, "r", encoding="utf8") as fin:
+            for line in fin:
+                if not line:
+                    continue
+                arr = line.rstrip("\n").split("\t")
+                if len(arr) != 2:
+                    print("Data format error: %s" % "\t".join(arr))
+                    print("Data row should contains two parts: conversation_content\tlabel1 label2 label3.")
+                    continue
+                text = arr[0]
+                label_list = arr[1].split()
+                self.data.append([text, label_list])
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+        text, labels = example
+        tokens, label_list = [], []
+        words = text.split()
+        assert len(words) == len(labels)
+        for word, label in zip(words, labels):
+            piece_words = tokenizer.tokenize(word)
+            tokens.extend(piece_words)
+            label = cls.get_label(label)
+            label_list.extend([label] * len(piece_words))
+        if len(tokens) > max_seq_length - 2:
+            tokens = tokens[len(tokens) - max_seq_length + 2 :]
+            label_list = label_list[len(tokens) - max_seq_length + 2 :]
+        tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
+        label_list = [0] + label_list + [0]
+        segment_ids = [0] * len(tokens)
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+        label = np.array(label_list, dtype="int64")
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
+
+
+class ATIS_DID(Dataset):
+    """
+    The dataset ATIS_ID is using in task Dialogue Intent Detection.
+    The source dataset is ATIS(Airline Travel Information System). See detail at
+    https://www.kaggle.com/siddhadev/ms-cntk-atis
+    """
+
+    LABEL_MAP = get_label_map([str(i) for i in range(26)])
+
+    def __init__(self, data_dir, mode="train"):
+        super(ATIS_DID, self).__init__()
+        self._data_dir = data_dir
+        self._mode = mode
+        self.read_data()
+
+    def read_data(self):
+        if self._mode == "train":
+            data_path = os.path.join(self._data_dir, "train.txt")
+        elif self._mode == "dev":
+            data_path = os.path.join(self._data_dir, "dev.txt")
+        elif self._mode == "test":
+            data_path = os.path.join(self._data_dir, "test.txt")
+        self.data = []
+        with open(data_path, "r", encoding="utf8") as fin:
+            for line in fin:
+                if not line:
+                    continue
+                arr = line.rstrip("\n").split("\t")
+                if len(arr) != 2:
+                    print("Data format error: %s" % "\t".join(arr))
+                    print("Data row should contains two parts: label\tconversation_content.")
+                    continue
+                label = arr[0]
+                text = arr[1]
+                self.data.append([label, text])
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+        label, text = example
+        tokens = tokenizer.tokenize(text)
+        if len(tokens) > max_seq_length - 2:
+            tokens = tokens[len(tokens) - max_seq_length + 2 :]
+        tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
+        segment_ids = [0] * len(tokens)
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+        label = np.array([cls.get_label(label)], dtype="int64")
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
+
+
+def read_da_data(data_dir, mode):
+    def _concat_dialogues(examples):
+        """concat multi turns dialogues"""
+        new_examples = []
+        for i in range(len(examples)):
+            label, caller, text = examples[i]
+            cur_txt = "%s : %s" % (caller, text)
+            pre_txt = ["%s : %s" % (item[1], item[2]) for item in examples[max(0, i - 5) : i]]
+            suf_txt = ["%s : %s" % (item[1], item[2]) for item in examples[i + 1 : min(len(examples), i + 3)]]
+            sample = [label, pre_txt, cur_txt, suf_txt]
+            new_examples.append(sample)
+        return new_examples
+
+    if mode == "train":
+        data_path = os.path.join(data_dir, "train.txt")
+    elif mode == "dev":
+        data_path = os.path.join(data_dir, "dev.txt")
+    elif mode == "test":
+        data_path = os.path.join(data_dir, "test.txt")
+    data = []
+    with open(data_path, "r", encoding="utf8") as fin:
+        pre_idx = -1
+        examples = []
+        for line in fin:
+            if not line:
+                continue
+            arr = line.rstrip("\n").split("\t")
+            if len(arr) != 4:
+                print("Data format error: %s" % "\t".join(arr))
+                print("Data row should contains four parts: id\tlabel\tcaller\tconversation_content.")
+                continue
+            idx, label, caller, text = arr
+            if idx != pre_idx:
+                if idx != 0:
+                    examples = _concat_dialogues(examples)
+                    data.extend(examples)
+                    examples = []
+                pre_idx = idx
+            examples.append((label, caller, text))
+        if examples:
+            examples = _concat_dialogues(examples)
+            data.extend(examples)
+    return data
+
+
+def truncate_and_concat(
+    pre_txt: List[str], cur_txt: str, suf_txt: List[str], tokenizer, max_seq_length, max_len_of_cur_text
+):
+    cur_tokens = tokenizer.tokenize(cur_txt)
+    cur_tokens = cur_tokens[: min(max_len_of_cur_text, len(cur_tokens))]
+    pre_tokens = []
+    for text in pre_txt:
+        pre_tokens.extend(tokenizer.tokenize(text))
+        pre_tokens.append(INNER_SEP)
+    pre_tokens = pre_tokens[:-1]
+    suf_tokens = []
+    for text in suf_txt:
+        suf_tokens.extend(tokenizer.tokenize(text))
+        suf_tokens.append(INNER_SEP)
+    suf_tokens = suf_tokens[:-1]
+    if len(cur_tokens) + len(pre_tokens) + len(suf_tokens) > max_seq_length - 4:
+        left_num = max_seq_length - 4 - len(cur_tokens)
+        if len(pre_tokens) > len(suf_tokens):
+            suf_num = int(left_num / 2)
+            suf_tokens = suf_tokens[:suf_num]
+            pre_num = left_num - len(suf_tokens)
+            pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :]
+        else:
+            pre_num = int(left_num / 2)
+            pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :]
+            suf_num = left_num - len(pre_tokens)
+            suf_tokens = suf_tokens[:suf_num]
+    tokens, segment_ids = [], []
+    tokens.extend([tokenizer.cls_token] + pre_tokens + [tokenizer.sep_token])
+    segment_ids.extend([0] * len(tokens))
+    tokens.extend(cur_tokens + [tokenizer.sep_token])
+    segment_ids.extend([1] * (len(cur_tokens) + 1))
+    if suf_tokens:
+        tokens.extend(suf_tokens + [tokenizer.sep_token])
+        segment_ids.extend([0] * (len(suf_tokens) + 1))
+    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+    return input_ids, segment_ids
+
+
+class MRDA(Dataset):
+    """
+    The dataset MRDA is using in task Dialogue Act.
+    The source dataset is MRDA(Meeting Recorder Dialogue Act). See detail at
+    https://www.aclweb.org/anthology/W04-2319.pdf
+    """
+
+    MAX_LEN_OF_CUR_TEXT = 50
+    LABEL_MAP = get_label_map([str(i) for i in range(5)])
+
+    def __init__(self, data_dir, mode="train"):
+        super(MRDA, self).__init__()
+        self.data = read_da_data(data_dir, mode)
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+        label, pre_txt, cur_txt, suf_txt = example
+        label = np.array([cls.get_label(label)], dtype="int64")
+        input_ids, segment_ids = truncate_and_concat(
+            pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT
+        )
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
+
+
+class SwDA(Dataset):
+    """
+    The dataset SwDA is using in task Dialogue Act.
+    The source dataset is SwDA(Switchboard Dialog Act). See detail at
+    http://compprag.christopherpotts.net/swda.html
+    """
+
+    MAX_LEN_OF_CUR_TEXT = 50
+    LABEL_MAP = get_label_map([str(i) for i in range(42)])
+
+    def __init__(self, data_dir, mode="train"):
+        super(SwDA, self).__init__()
+        self.data = read_da_data(data_dir, mode)
+
+    @classmethod
+    def get_label(cls, label):
+        return cls.LABEL_MAP[label]
+
+    @classmethod
+    def num_classes(cls):
+        return len(cls.LABEL_MAP)
+
+    @classmethod
+    def convert_example(cls, example, tokenizer, max_seq_length=512):
+        """Convert a glue example into necessary features."""
+        label, pre_txt, cur_txt, suf_txt = example
+        label = np.array([cls.get_label(label)], dtype="int64")
+        input_ids, segment_ids = truncate_and_concat(
+            pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT
+        )
+        return input_ids, segment_ids, label
+
+    def __getitem__(self, index):
+        return self.data[index]
+
+    def __len__(self):
+        return len(self.data)
diff --git a/examples/dialogue/dgu/main.py b/examples/dialogue/dgu/main.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5ca4faf457242b57b4b23ee09be242e3404ff0a
--- /dev/null
+++ b/examples/dialogue/dgu/main.py
@@ -0,0 +1,290 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import time
+import numpy as np
+from functools import partial
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler
+from paddle.optimizer import AdamW
+from paddle.metric import Accuracy
+
+from paddlenlp.datasets import MapDataset
+from paddlenlp.data import Stack, Tuple, Pad
+from paddlenlp.transformers import BertTokenizer, BertForSequenceClassification, BertForTokenClassification
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+from args import parse_args, set_default_args
+import data
+import metric
+
+TASK_CLASSES = {
+    "udc": (data.UDCv1, metric.RecallAtK),
+    "dstc2": (data.DSTC2, metric.JointAccuracy),
+    "atis_slot": (data.ATIS_DSF, metric.F1Score),
+    "atis_intent": (data.ATIS_DID, Accuracy),
+    "mrda": (data.MRDA, Accuracy),
+    "swda": (data.SwDA, Accuracy),
+}
+
+
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def load_ckpt(args, model, optimizer=None):
+    if args.init_from_ckpt:
+        params_state_dict = paddle.load(args.init_from_ckpt + ".pdparams")
+        model.set_state_dict(params_state_dict)
+        if optimizer:
+            opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt")
+            optimizer.set_state_dict(opt_state_dict)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+
+def save_ckpt(model, optimizer, output_dir, name):
+    params_path = os.path.join(output_dir, "{}.pdparams".format(name))
+    opt_path = os.path.join(output_dir, "{}.pdopt".format(name))
+    paddle.save(model.state_dict(), params_path)
+    paddle.save(optimizer.state_dict(), opt_path)
+
+
+class DGULossFunction(nn.Layer):
+    def __init__(self, task_name):
+        super(DGULossFunction, self).__init__()
+
+        self.task_name = task_name
+        self.loss_fn = self.get_loss_fn()
+
+    def get_loss_fn(self):
+        if self.task_name in ["udc", "atis_slot", "atis_intent", "mrda", "swda"]:
+            return F.cross_entropy
+        elif self.task_name == "dstc2":
+            return nn.BCEWithLogitsLoss(reduction="sum")
+
+    def forward(self, logits, labels):
+        if self.task_name in ["udc", "atis_intent", "mrda", "swda"]:
+            loss = self.loss_fn(logits, labels)
+        elif self.task_name == "dstc2":
+            loss = self.loss_fn(logits, paddle.cast(labels, dtype=logits.dtype))
+        elif self.task_name == "atis_slot":
+            labels = paddle.unsqueeze(labels, axis=-1)
+            loss = self.loss_fn(logits, labels)
+        return loss
+
+
+def print_logs(args, step, logits, labels, loss, total_time, metric):
+    if args.task_name in ["udc", "atis_intent", "mrda", "swda"]:
+        if args.task_name == "udc":
+            metric = Accuracy()
+        metric.reset()
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        acc = metric.accumulate()
+        print("step %d - loss: %.4f - acc: %.4f - %.3fs/step" % (step, loss, acc, total_time / args.logging_steps))
+    elif args.task_name == "dstc2":
+        metric.reset()
+        metric.update(logits, labels)
+        joint_acc = metric.accumulate()
+        print(
+            "step %d - loss: %.4f - joint_acc: %.4f - %.3fs/step"
+            % (step, loss, joint_acc, total_time / args.logging_steps)
+        )
+    elif args.task_name == "atis_slot":
+        metric.reset()
+        metric.update(logits, labels)
+        f1_micro = metric.accumulate()
+        print(
+            "step %d - loss: %.4f - f1_micro: %.4f - %.3fs/step"
+            % (step, loss, f1_micro, total_time / args.logging_steps)
+        )
+
+
+def train(args, model, train_data_loader, dev_data_loader, metric, n_procs, rank):
+    num_examples = len(train_data_loader) * args.batch_size * n_procs
+    max_train_steps = args.epochs * len(train_data_loader)
+    print("\nNum train examples: %d" % num_examples)
+    print("Max train steps: %d" % max_train_steps)
+    print("Warmup proportion: %.2f" % args.warmup_proportion)
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, max_train_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+    loss_fn = DGULossFunction(args.task_name)
+
+    load_ckpt(args, model, optimizer)
+
+    step = 0
+    best_metric = 0.0
+    total_time = 0.0
+    for epoch in range(args.epochs):
+        print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+        batch_start_time = time.time()
+        for batch in train_data_loader:
+            step += 1
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fn(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            total_time += time.time() - batch_start_time
+            if step % args.logging_steps == 0:
+                print_logs(args, step, logits, labels, loss, total_time, metric)
+                total_time = 0.0
+            if step % args.save_steps == 0 or step == max_train_steps:
+                if rank == 0:
+                    save_ckpt(model, optimizer, args.output_dir, step)
+                if args.do_eval:
+                    print("\nEval begin...")
+                    metric_out = evaluation(args, model, dev_data_loader, metric)
+                    if rank == 0 and metric_out > best_metric:
+                        best_metric = metric_out
+                        save_ckpt(model, optimizer, args.output_dir, "best")
+                        print("Best model, step: %d\n" % step)
+            batch_start_time = time.time()
+
+
+@paddle.no_grad()
+def evaluation(args, model, data_loader, metric):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        if args.task_name in ["atis_intent", "mrda", "swda"]:
+            correct = metric.compute(logits, labels)
+            metric.update(correct)
+        else:
+            metric.update(logits, labels)
+    model.train()
+    metric_out = metric.accumulate()
+    print("Total samples: %d" % (len(data_loader) * args.test_batch_size))
+    if args.task_name == "udc":
+        print("R1@10: %.4f - R2@10: %.4f - R5@10: %.4f\n" % (metric_out[0], metric_out[1], metric_out[2]))
+        return metric_out[0]
+    elif args.task_name == "dstc2":
+        print("Joint_acc: %.4f\n" % metric_out)
+        return metric_out
+    elif args.task_name == "atis_slot":
+        print("F1_micro: %.4f\n" % metric_out)
+        return metric_out
+    elif args.task_name in ["atis_intent", "mrda", "swda"]:
+        print("Acc: %.4f\n" % metric_out)
+        return metric_out
+
+
+def create_data_loader(args, dataset_class, trans_func, batchify_fn, mode):
+    dataset = dataset_class(args.data_dir, mode)
+    dataset = MapDataset(dataset).map(trans_func, lazy=True)
+    if mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    else:
+        batch_sampler = BatchSampler(dataset, batch_size=args.test_batch_size, shuffle=False)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+    return data_loader
+
+
+def main(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+    rank = dist.get_rank()
+    if world_size > 1 and args.do_train:
+        dist.init_parallel_env()
+
+    set_seed(args.seed)
+
+    dataset_class, metric_class = TASK_CLASSES[args.task_name]
+    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+    trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_len)
+    test_trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.test_max_seq_len)
+    metric = metric_class()
+
+    if args.task_name in ("udc", "dstc2", "atis_intent", "mrda", "swda"):
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+            Stack(dtype="int64"),  # label
+        ): fn(samples)
+        model = BertForSequenceClassification.from_pretrained(
+            args.model_name_or_path, num_classes=dataset_class.num_classes()
+        )
+    elif args.task_name == "atis_slot":
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+            Pad(axis=0, pad_val=0, dtype="int64"),  # label
+        ): fn(samples)
+        model = BertForTokenClassification.from_pretrained(
+            args.model_name_or_path, num_classes=dataset_class.num_classes(), dropout=0.0
+        )
+    if world_size > 1 and args.do_train:
+        model = paddle.DataParallel(model)
+
+    if args.do_train:
+        train_data_loader = create_data_loader(args, dataset_class, trans_func, batchify_fn, "train")
+        if args.do_eval:
+            dev_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "dev")
+        else:
+            dev_data_loader = None
+        train(args, model, train_data_loader, dev_data_loader, metric, world_size, rank)
+
+    if args.do_test:
+        if rank == 0:
+            test_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "test")
+            if args.do_train:
+                # If do_eval=True, use best model to evaluate the test data.
+                # Otherwise, use final model to evaluate the test data.
+                if args.do_eval:
+                    args.init_from_ckpt = os.path.join(args.output_dir, "best")
+                    load_ckpt(args, model)
+            else:
+                if not args.init_from_ckpt:
+                    raise ValueError('"init_from_ckpt" should be set.')
+                load_ckpt(args, model)
+            print("\nTest begin...")
+            evaluation(args, model, test_data_loader, metric)
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    set_default_args(args)
+    print_args(args)
+
+    main(args)
diff --git a/examples/dialogue/dgu/metric.py b/examples/dialogue/dgu/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5ef869f768cbc912435539d32fe078ef6447665
--- /dev/null
+++ b/examples/dialogue/dgu/metric.py
@@ -0,0 +1,245 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+from paddle.metric import Metric
+
+
+class RecallAtK(Metric):
+    """
+    Recall@K is the fraction of relevant results among the retrieved Top K
+    results, using to evaluate the performance of Dialogue Response Selection.
+
+    Noted that this class manages the Recall@K score only for binary
+    classification task.
+    """
+
+    def __init__(self, name="Recall@K", *args, **kwargs):
+        super(RecallAtK, self).__init__(*args, **kwargs)
+        self._name = name
+        self.softmax = nn.Softmax()
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.num_sampls = 0
+        self.p_at_1_in_10 = 0.0
+        self.p_at_2_in_10 = 0.0
+        self.p_at_5_in_10 = 0.0
+
+    def get_p_at_n_in_m(self, data, n, m, idx):
+        """
+        calculate precision in recall n
+        """
+        pos_score = data[idx][0]
+        curr = data[idx : idx + m]
+        curr = sorted(curr, key=lambda x: x[0], reverse=True)
+        if curr[n - 1][0] <= pos_score:
+            return 1
+        return 0
+
+    def update(self, logits, labels):
+        """
+        Update the states based on the current mini-batch prediction results.
+
+        Args:
+            logits (Tensor): The predicted value is a Tensor with
+                shape [batch_size, 2] and type float32 or float64.
+            labels (Tensor): The ground truth value is a 2D Tensor,
+                its shape is [batch_size, 1] and type is int64.
+        """
+        probs = self.softmax(logits)
+        probs = probs.numpy()
+        labels = labels.numpy()
+        assert probs.shape[0] == labels.shape[0]
+        data = []
+        for prob, label in zip(probs, labels):
+            data.append((prob[1], label))
+        assert len(data) % 10 == 0
+
+        length = int(len(data) / 10)
+        self.num_sampls += length
+        for i in range(length):
+            idx = i * 10
+            assert data[idx][1] == 1
+            self.p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, idx)
+            self.p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, idx)
+            self.p_at_5_in_10 += self.get_p_at_n_in_m(data, 5, 10, idx)
+
+    def accumulate(self):
+        """
+        Calculate the final Recall@K.
+
+        Returns:
+            A list with scaler float: results of the calculated R1@K, R2@K, R5@K.
+        """
+        metrics_out = [
+            self.p_at_1_in_10 / self.num_sampls,
+            self.p_at_2_in_10 / self.num_sampls,
+            self.p_at_5_in_10 / self.num_sampls,
+        ]
+        return metrics_out
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
+
+
+class JointAccuracy(Metric):
+    """
+    The joint accuracy rate is used to evaluate the performance of multi-turn
+    Dialogue State Tracking. For each turn, if and only if all state in
+    state_list are correctly predicted, the dialog state prediction is
+    considered correct. And the joint accuracy rate is equal to 1, otherwise
+    it is equal to 0.
+    """
+
+    def __init__(self, name="JointAccuracy", *args, **kwargs):
+        super(JointAccuracy, self).__init__(*args, **kwargs)
+        self._name = name
+        self.sigmoid = nn.Sigmoid()
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.num_samples = 0
+        self.correct_joint = 0.0
+
+    def update(self, logits, labels):
+        """
+        Update the states based on the current mini-batch prediction results.
+
+        Args:
+            logits (Tensor): The predicted value is a Tensor with
+                shape [batch_size,  num_classes] and type float32 or float64.
+            labels (Tensor): The ground truth value is a 2D Tensor,
+                its shape is [batch_size, num_classes] and type is int64.
+        """
+        probs = self.sigmoid(logits)
+        probs = probs.numpy()
+        labels = labels.numpy()
+        assert probs.shape[0] == labels.shape[0]
+        assert probs.shape[1] == labels.shape[1]
+        for i in range(probs.shape[0]):
+            pred, refer = [], []
+            for j in range(probs.shape[1]):
+                if probs[i][j] >= 0.5:
+                    pred.append(j)
+                if labels[i][j] == 1:
+                    refer.append(j)
+            if not pred:
+                pred = [np.argmax(probs[i])]
+            if pred == refer:
+                self.correct_joint += 1
+        self.num_samples += probs.shape[0]
+
+    def accumulate(self):
+        """
+        Calculate the final JointAccuracy.
+
+        Returns:
+            A scaler float: results of the calculated JointAccuracy.
+        """
+        joint_acc = self.correct_joint / self.num_samples
+        return joint_acc
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
+
+
+class F1Score(Metric):
+    """
+    F1-score is the harmonic mean of precision and recall. Micro-averaging is
+    to create a global confusion matrix for all examples, and then calculate
+    the F1-score. This class is using to evaluate the performance of Dialogue
+    Slot Filling.
+    """
+
+    def __init__(self, name="F1Score", *args, **kwargs):
+        super(F1Score, self).__init__(*args, **kwargs)
+        self._name = name
+        self.reset()
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.tp = {}
+        self.fn = {}
+        self.fp = {}
+
+    def update(self, logits, labels):
+        """
+        Update the states based on the current mini-batch prediction results.
+
+        Args:
+            logits (Tensor): The predicted value is a Tensor with
+                shape [batch_size, seq_len, num_classes] and type float32 or
+                float64.
+            labels (Tensor): The ground truth value is a 2D Tensor,
+                its shape is [batch_size, seq_len] and type is int64.
+        """
+        probs = paddle.argmax(logits, axis=-1)
+        probs = probs.numpy()
+        labels = labels.numpy()
+        assert probs.shape[0] == labels.shape[0]
+        assert probs.shape[1] == labels.shape[1]
+        for i in range(probs.shape[0]):
+            start, end = 1, probs.shape[1]
+            while end > start:
+                if labels[i][end - 1] != 0:
+                    break
+                end -= 1
+            prob, label = probs[i][start:end], labels[i][start:end]
+            for y_pred, y in zip(prob, label):
+                if y_pred == y:
+                    self.tp[y] = self.tp.get(y, 0) + 1
+                else:
+                    self.fp[y_pred] = self.fp.get(y_pred, 0) + 1
+                    self.fn[y] = self.fn.get(y, 0) + 1
+
+    def accumulate(self):
+        """
+        Calculate the final micro F1 score.
+
+        Returns:
+            A scaler float: results of the calculated micro F1 score.
+        """
+        tp_total = sum(self.tp.values())
+        fn_total = sum(self.fn.values())
+        fp_total = sum(self.fp.values())
+        p_total = float(tp_total) / (tp_total + fp_total)
+        r_total = float(tp_total) / (tp_total + fn_total)
+        if p_total + r_total == 0:
+            return 0
+        f1_micro = 2 * p_total * r_total / (p_total + r_total)
+        return f1_micro
+
+    def name(self):
+        """
+        Returns metric name
+        """
+        return self._name
diff --git a/examples/dialogue/lic2021_baseline/README.md b/examples/dialogue/lic2021_baseline/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b0354b0f18d0761dbbba67efc5f72a41a2ac027d
--- /dev/null
+++ b/examples/dialogue/lic2021_baseline/README.md
@@ -0,0 +1,146 @@
+# LIC 2021对话比赛baseline
+
+## 模型简介
+
+近年来，人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统，使得机器可以流畅自然地与人进行语言交互，既可以进行日常问候类的闲聊，又可以完成特定功能，以使得开放域对话系统具有实际应用价值，例如进行对话式推荐，或围绕一个主题进行深入的知识对话等。具体的说，开放域对话可以继续拆分为支持不同功能的对话形式，例如对话式推荐，知识对话技术等，如何解决并有效融合以上多个技能面临诸多挑战。
+
+LIC 2021对话比赛收集了一系列公开的开放域对话数据并提供了统一的评测方式，旨在为研究人员和开发者提供学术和技术交流的平台，进一步提升开放域对话的研究水平，推动自然语言理解和人工智能领域技术的应用和发展。
+
+为了方便参赛者快速了解LIC 2021对话比赛的流程，并快速地参与到比赛中，本项目基于UnifiedTransformer模型提供了一个基础baseline，利用小规模样例数据在预训练模型上完成了微调及预测。参赛者可以针对赛题进行其他改进，例如修改数据预处理方法、修改网络结构、修改训练方式、修改预测的解码方式或对结果的后处理策略等方式提升模型效果。
+
+UnifiedTransformer模型的细节可以[参阅论文](https://arxiv.org/abs/2006.16779)。
+
+## 快速开始
+
+### 环境依赖
+
+- sentencepiece
+
+安装方式：`pip install sentencepiece`
+
+### 数据准备
+
+由于样例数据涉及LIC 2021对话比赛，暂不开放。
+关于数据集及数据集的预处理过程，详见[2021语言与智能技术竞赛：多技能对话](https://aistudio.baidu.com/aistudio/competition/detail/67)及官方提供的基线系统Baselines。
+
+模型的输入由3部分组成：词向量token_ids，句向量token_type_ids和位置向量position_ids。本项目的数据集是样例文本经过数据预处理脚本得到的id化的数据集。数据的每一行由3列组成，以";"作为分割符，格式：token_ids;token_type_ids;position_ids。具体细节请参考`data.py`。
+
+### 模型训练
+
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" --log_dir ./log finetune.py \
+    --model_name_or_path=unified_transformer-12L-cn \
+    --train_data_path=./datasets/train.txt \
+    --valid_data_path=./datasets/valid.txt \
+    --save_dir=./checkpoints \
+    --logging_steps=500 \
+    --save_steps=8000 \
+    --seed=2021 \
+    --epochs=10 \
+    --batch_size=8192 \
+    --lr=1e-5 \
+    --weight_decay=0.01 \
+    --warmup_steps=4000 \
+    --max_grad_norm=0.1 \
+    --sort_pool_size=65536 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `gpus` 指示了训练所用的GPU卡号。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unified_transformer-12L-cn      |
+   | unified_transformer-12L-cn-luge |
+
+- `train_data_path` 表示训练集文件路径。
+- `valid_data_path` 表示验证集文件路径。
+- `save_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `lr` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_steps` 表示学习率逐渐升高到基础学习率（即上面配置的lr）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `max_grad_norm` 表示梯度裁剪允许的最大梯度值。
+- `sort_pool_size` 表示在构建batch数据时，用来排序的pool size。
+- `device` 表示训练使用的设备。
+
+参数详情和参数的默认值请参考`args.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+./checkpoints/
+├── model_8000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── spm.model
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+### 模型预测
+
+运行如下命令即可在样例测试集上进行测试
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+# GPU启动，预测仅支持单卡
+python infer.py \
+    --model_name_or_path=./checkpoints/model_80000 \
+    --test_data_path=./datasets/test.txt \
+    --output_path=./predict.txt \
+    --logging_steps=500 \
+    --seed=2021 \
+    --batch_size=4 \
+    --min_dec_len=1 \
+    --max_dec_len=64 \
+    --num_samples=20 \
+    --decode_strategy=sampling \
+    --top_k=5 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unified_transformer-12L-cn      |
+   | unified_transformer-12L-cn-luge |
+
+- `test_data_path` 表示预测集文件路径。
+- `output_path` 表示预测结果的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `seed` 表示随机数生成器的种子。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `min_dec_len` 表示预测生成的句子的最小长度。
+- `max_dec_len` 表示预测生成的句子的最大长度。
+- `num_samples` 表示每条样本生成的句子的数量。对于每条样本，模型会生成`num_samples`个句子，根据每个句子的概率得分进行排序，得分最高的句子作为最终的生成结果。
+- `decode_strategy` 表示预测解码时采取的策略，可选"sampling"、"greedy_search"和"beam_search"之一。
+- `top_k` 表示采用"sampling"解码策略时，token的概率按从大到小排序，生成的token只从前`top_k`个中进行采样。
+- `device` 表示训练使用的设备。
+
+参数详情和参数的默认值请参考`args.py`。
+
+程序运行结束后会将预测结果保存在`output_path`中。将预测结果准备成比赛官网要求的格式，提交评估即可得评估结果。
+
+采用不同的模型在样例测试集上有如下结果：
+
+|       model_name_or_path        |  F1   | BLEU1 / BLEU2 | DISTINCT1 / DISTINCT2 |
+| :-----------------------------: | :---: | :-----------: | :-------------------: |
+|   unified_transformer-12L-cn    | 10.62 | 0.070 / 0.022 |     0.065 / 0.304     |
+| unified_transformer-12L-cn-luge | 33.11 | 0.245 / 0.157 |     0.074 / 0.238     |
+|    ./checkpoints/model_80000    | 32.38 | 0.239 / 0.150 |     0.070 / 0.219     |
diff --git a/examples/dialogue/lic2021_baseline/args.py b/examples/dialogue/lic2021_baseline/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc08c7d618f6e3bdf93fb410adeaca63184b7966
--- /dev/null
+++ b/examples/dialogue/lic2021_baseline/args.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument('--train_data_path', type=str, default='./datasets/train.txt', help='Specify the path to load train data.')
+    parser.add_argument('--valid_data_path', type=str, default='./datasets/valid.txt', help='Specify the path to load valid data.')
+    parser.add_argument('--test_data_path', type=str, default='./datasets/test.txt', help='Specify the path to load test data.')
+    parser.add_argument('--logging_steps', type=int, default=500, help='Log every X updates steps.')
+    parser.add_argument('--save_steps', type=int, default=8000, help='Save checkpoint every X updates steps.')
+    parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=8192, required=True, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--lr', type=float, default=1e-5, help='The initial learning rate.')
+    parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.')
+    parser.add_argument('--epochs', type=int, default=10, help='Total number of training epochs to perform.')
+    parser.add_argument('--warmup_steps', type=int, default=4000, help='The number of warmup steps.')
+    parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.')
+    parser.add_argument('--sort_pool_size', type=int, default=65536, help='The pool size for sort in build batch data.')
+    parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.')
+    parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.')
+    parser.add_argument('--num_samples', type=int, default=1, help='The decode numbers in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.')
+    parser.add_argument('--device', type=str, default='gpu', help='Device for selecting for the training.')
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
diff --git a/examples/dialogue/lic2021_baseline/data.py b/examples/dialogue/lic2021_baseline/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bc938d1b72902aa3504b02d42dfb7114c3c9bc4
--- /dev/null
+++ b/examples/dialogue/lic2021_baseline/data.py
@@ -0,0 +1,258 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from glob import glob
+
+import numpy as np
+import paddle.distributed as dist
+from paddle.io import IterableDataset
+
+from paddlenlp.transformers.tokenizer_utils import convert_to_unicode
+
+
+class DialogueDataset(IterableDataset):
+    def __init__(
+        self,
+        filepattern,
+        batch_size,
+        pad_token_id,
+        bos_token_id,
+        sort_pool_size=2**16,
+        seed=1,
+        n_procs=None,
+        rank=None,
+        mode="test",
+    ):
+        super(DialogueDataset, self).__init__()
+
+        self.file_list = glob(filepattern)
+        self.sort_pool_size = 0 if mode == "test" else sort_pool_size
+        self.n_procs = n_procs if n_procs else dist.get_world_size()
+        self.rank = rank if rank else dist.get_rank()
+        self.batch_size = batch_size * self.n_procs
+        self.shuffle = True if mode == "train" else False
+        self.mode = mode
+        self.pad_id = pad_token_id
+        self.bos_id = bos_token_id
+        self.global_rng = np.random.RandomState(seed)
+
+        assert len(self.file_list) > 0, "There is no files in %s." % filepattern
+
+    def load_file(self, file_path):
+        with open(file_path, "r", encoding="utf-8") as fin:
+            for i, line in enumerate(fin):
+                cols = convert_to_unicode(line).strip().split(";")
+                cols = list(map(lambda x: list(map(int, x.split(" "))), cols))
+                if len(cols) > 3:
+                    cols = cols[:3]
+                token_ids, type_ids, pos_ids = cols
+                if self.mode == "test":
+                    tgt_start_idx = len(cols[0])
+                else:
+                    tgt_start_idx = token_ids.index(self.bos_id, 1)
+                sample = [token_ids, type_ids, pos_ids, tgt_start_idx]
+                yield sample
+
+    def get_sorted_batch(self, pool):
+        """Generate sorted batches from pool."""
+        pool = sorted(pool, key=lambda sample: len(sample[0]))
+        batches = []
+        batch, max_len = [], 0
+        for sample in pool:
+            max_len = max(max_len, len(sample[0]))
+            if self.mode == "test":
+                to_append = len(batch) < self.batch_size
+            else:
+                to_append = (len(batch) + 1) * max_len <= self.batch_size
+            if to_append:
+                batch.append(sample)
+            else:
+                batches.append(batch)
+                batch, max_len = [sample], len(sample[0])
+        if len(batch) > 0:
+            batches.append(batch)
+        if self.shuffle:
+            self.global_rng.shuffle(batches)
+        for batch in batches:
+            yield batch
+
+    @property
+    def get_batch(self):
+        all_files = list(self.file_list)
+        if self.shuffle:
+            self.global_rng.shuffle(all_files)
+        if self.sort_pool_size > 0:
+            pool = []
+            for file_path in all_files:
+                for sample in self.load_file(file_path):
+                    pool.append(sample)
+                    if len(pool) == self.sort_pool_size:
+                        for batch in self.get_sorted_batch(pool):
+                            yield batch
+                        pool = []
+                if len(pool) > 0:
+                    for batch in self.get_sorted_batch(pool):
+                        yield batch
+        else:
+            batch, max_len = [], 0
+            for file_path in all_files:
+                for sample in self.load_file(file_path):
+                    max_len = max(max_len, len(sample[0]))
+                    if self.mode == "test":
+                        to_append = len(batch) < self.batch_size
+                    else:
+                        to_append = (len(batch) + 1) * max_len <= self.batch_size
+                    if to_append:
+                        batch.append(sample)
+                    else:
+                        yield batch
+                        batch, max_len = [sample], len(sample[0])
+            if len(batch) > 0:
+                yield batch
+
+    def pad_batch_data(self, batch):
+        """Pad the instances to the max sequence length in batch."""
+        max_len = max(map(len, batch))
+        batch_data = np.array([list(data) + [self.pad_id] * (max_len - len(data)) for data in batch], dtype="int64")
+        return batch_data
+
+    def gen_tgt_label_and_pos(self, batch_token_ids, batch_tgt_start_idx):
+        max_len = max(map(len, batch_token_ids))
+        tgt_label = []
+        tgt_pos = []
+        for sent_index, sent in enumerate(batch_token_ids):
+            sent_b_index = batch_tgt_start_idx[sent_index]
+            tgt_label.extend(sent[sent_b_index + 1 :])
+            tgt_pos.extend([sent_index * max_len + i for i in range(sent_b_index, len(sent) - 1)])
+        tgt_label = np.array(tgt_label).astype("int64")
+        tgt_pos = np.array(tgt_pos).astype("int64")
+
+        return tgt_label, tgt_pos
+
+    def gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx):
+        max_len = max(map(len, batch_token_ids))
+        input_mask_data = np.zeros((len(batch_token_ids), max_len, max_len))
+        for index, mask_data in enumerate(input_mask_data):
+            start = batch_tgt_start_idx[index]
+            end = len(batch_token_ids[index])
+            mask_data[:end, :start] = 1.0
+            # Generate the lower triangular matrix using the slice of matrix
+            b = np.tril(np.ones([end - start, end - start]), 0)
+            mask_data[start:end, start:end] = b
+        return input_mask_data.astype("float32")
+
+    def __iter__(self):
+        for batch_data in self.get_batch:
+            # sample [token_ids, type_ids, pos_ids, tgt_start_idx]
+            # raw_batch [sample0, sample1, ...]
+            if self.n_procs > 1:
+                batch_data = batch_data[self.rank :: self.n_procs]
+            batch_data = zip(*batch_data)
+            token_ids, type_ids, pos_ids, tgt_start_idx = batch_data
+
+            pad_token_ids = self.pad_batch_data(token_ids)
+            pad_type_ids = self.pad_batch_data(type_ids)
+            pad_pos_ids = self.pad_batch_data(pos_ids)
+
+            generation_mask = self.gen_self_attn_mask(token_ids, tgt_start_idx)
+
+            if self.mode == "test":
+                # [batch_size, 1]
+                tgt_ids = np.array([[self.bos_id]] * len(token_ids), dtype="int64")
+                tgt_type = np.ones((len(token_ids), 1), dtype="int64")
+                tgt_pos = np.array(tgt_start_idx, dtype="int64").reshape(-1, 1)
+                tgt_generation_mask = generation_mask[:, 0:1, :].astype("float32")
+
+                pad_token_ids = np.concatenate((pad_token_ids, tgt_ids), axis=1)
+                pad_type_ids = np.concatenate((pad_type_ids, tgt_type), axis=1)
+                pad_pos_ids = np.concatenate((pad_pos_ids, tgt_pos), axis=1)
+                generation_mask = np.concatenate((generation_mask, tgt_generation_mask), axis=1)
+
+                append_mask = np.zeros((generation_mask.shape[0], generation_mask.shape[1], 1), dtype="float32")
+                append_mask[:, -1, :] = 1.0
+                generation_mask = np.concatenate((generation_mask, append_mask), axis=2)
+                generation_mask = (generation_mask - 1.0) * 1e9
+                generation_mask = np.expand_dims(generation_mask, axis=1)
+                yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask)
+            else:
+                tgt_label, tgt_pos = self.gen_tgt_label_and_pos(token_ids, tgt_start_idx)
+                generation_mask = (generation_mask - 1.0) * 1e9
+                generation_mask = np.expand_dims(generation_mask, axis=1)
+                yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask, tgt_label, tgt_pos)
+
+
+def post_process_response(token_ids, tokenizer):
+    """
+    Post-process the decoded sequence. Truncate from the first
+    <eos> and remove the <bos> and <eos> tokens currently.
+    """
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    response = tokenizer.merge_subword(tokens)
+    return token_ids, response
+
+
+def get_in_turn_repetition(pred, is_cn=False):
+    """Get in-turn repetition."""
+    if len(pred) == 0:
+        return 1.0
+    if isinstance(pred[0], str):
+        pred = [tok.lower() for tok in pred]
+        if is_cn:
+            pred = "".join(pred)
+    tri_grams = set()
+    for i in range(len(pred) - 2):
+        tri_gram = tuple(pred[i : i + 3])
+        if tri_gram in tri_grams:
+            return True
+        tri_grams.add(tri_gram)
+    return False
+
+
+def select_response(ids, scores, tokenizer, max_dec_len=None, num_samples=1):
+    ids = ids.numpy().tolist()
+    scores = scores.numpy()
+
+    if len(ids) != len(scores) or (len(ids) % num_samples) != 0:
+        raise ValueError("the length of `ids` is {}, but the `num_samples` is {}".format(len(ids), num_samples))
+
+    group = []
+    tmp = []
+    for pred, score in zip(ids, scores):
+        pred_token_ids, pred_tokens = post_process_response(pred, tokenizer)
+        num_token = len(pred_token_ids)
+        response = " ".join(pred_tokens)
+
+        in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids)
+        # not ending
+        if max_dec_len is not None and num_token >= max_dec_len:
+            score -= 1e3
+        elif in_turn_repetition:
+            score -= 1e3
+
+        tmp.append([response, score])
+        if len(tmp) == num_samples:
+            group.append(tmp)
+            tmp = []
+
+    results = []
+    for preds in group:
+        preds = sorted(preds, key=lambda x: -x[1])
+        results.append(preds[0][0])
+    return results
diff --git a/examples/dialogue/lic2021_baseline/finetune.py b/examples/dialogue/lic2021_baseline/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f74d75ef84bc06e6a86cf79c246ffdbee114e9b
--- /dev/null
+++ b/examples/dialogue/lic2021_baseline/finetune.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+import paddle.nn.functional as F
+from args import parse_args, print_args
+from data import DialogueDataset
+from paddle.io import DataLoader
+from paddle.optimizer import AdamW
+from paddle.optimizer.lr import NoamDecay
+
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def main(args):
+    paddle.set_device(args.device)
+    paddle.seed(args.seed)
+    world_size = dist.get_world_size()
+    if world_size > 1:
+        dist.init_parallel_env()
+
+    model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path)
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    train_dataset = DialogueDataset(
+        args.train_data_path,
+        args.batch_size,
+        tokenizer.pad_token_id,
+        tokenizer.cls_token_id,
+        args.sort_pool_size,
+        args.seed,
+        mode="train",
+    )
+    train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None)
+    valid_dataset = DialogueDataset(
+        args.valid_data_path,
+        args.batch_size,
+        tokenizer.pad_token_id,
+        tokenizer.cls_token_id,
+        args.sort_pool_size,
+        mode="valid",
+    )
+    valid_dataloader = DataLoader(valid_dataset, return_list=True, batch_size=None)
+
+    lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    step = 0
+    total_time = 0.0
+    for epoch in range(args.epochs):
+        print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+        batch_start_time = time.time()
+        for inputs in train_dataloader:
+            step += 1
+            token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs
+
+            logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
+            loss = F.cross_entropy(logits, tgt_label)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            total_time += time.time() - batch_start_time
+            if step % args.logging_steps == 0:
+                ppl = paddle.exp(loss)
+                print(
+                    "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                    % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                )
+                total_time = 0.0
+            if step % args.save_steps == 0:
+                evaluation(model, valid_dataloader)
+                if dist.get_rank() == 0:
+                    save_ckpt(model, tokenizer, args.save_dir, step)
+            batch_start_time = time.time()
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader):
+    print("\nEval begin...")
+    model.eval()
+    total_tokens = 0
+    total_loss = 0.0
+    start_time = time.time()
+    step = 0
+    for inputs in data_loader:
+        step += 1
+        token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs
+
+        logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
+        loss = F.cross_entropy(logits, tgt_label, reduction="sum")
+
+        total_loss += float(loss.numpy())
+        total_tokens += tgt_label.shape[0]
+
+    avg_loss = total_loss / total_tokens
+    ppl = math.exp(avg_loss)
+    avg_speed = (time.time() - start_time) / step
+    print("loss: %.4f - ppl: %.4f - %.3fs/step\n" % (avg_loss, ppl, avg_speed))
+    model.train()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+
+    main(args)
diff --git a/examples/dialogue/lic2021_baseline/infer.py b/examples/dialogue/lic2021_baseline/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..b41cb6fcf2d0ba644525ec17684858437b0fe453
--- /dev/null
+++ b/examples/dialogue/lic2021_baseline/infer.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import paddle
+from args import parse_args, print_args
+from data import DialogueDataset, select_response
+from paddle.io import DataLoader
+
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+def main(args):
+    paddle.set_device(args.device)
+    paddle.seed(args.seed)
+
+    model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path)
+
+    test_dataset = DialogueDataset(
+        args.test_data_path, args.batch_size, tokenizer.pad_token_id, tokenizer.cls_token_id, mode="test"
+    )
+    test_dataloader = DataLoader(test_dataset, return_list=True, batch_size=None)
+
+    infer(model, test_dataloader, tokenizer)
+
+
+@paddle.no_grad()
+def infer(model, data_loader, tokenizer):
+    print("\nInfer begin...")
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    responses = []
+    for step, inputs in enumerate(data_loader, 1):
+        token_ids, type_ids, pos_ids, generation_mask = inputs
+        ids, scores = model.generate(
+            input_ids=token_ids,
+            token_type_ids=type_ids,
+            position_ids=pos_ids,
+            attention_mask=generation_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            early_stopping=args.early_stopping,
+            num_return_sequences=args.num_samples,
+            use_fast=False,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+        results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_samples)
+        responses.extend(results)
+
+        start_time = time.time()
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for response in responses:
+            fout.write(response + "\n")
+    print("\nSave inference result into: %s" % args.output_path)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    main(args)
diff --git a/examples/dialogue/plato-2/README.md b/examples/dialogue/plato-2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..20d6e6644d702162a2d4b4d7b2a9047aeb091e83
--- /dev/null
+++ b/examples/dialogue/plato-2/README.md
@@ -0,0 +1,67 @@
+# PLATO-2
+
+## 模型简介
+
+构建高质量的开放领域（Open-Domain）的对话机器人，使得它能用自然语言与人自由地交流，这一直是自然语言处理领域终极目标之一。
+
+为了能够简易地构建一个高质量的开放域聊天机器人，本项目在Paddle2.0上实现了PLATO-2的预测模型，并基于终端实现了简单的人机交互。用户可以通过下载预训练模型快速构建一个开放域聊天机器人。
+
+PLATO-2的网络结构见下图：
+
+<p align="center"><img src="./imgs/network.png" hspace="10"/></p>
+
+PLATO-2的训练过程及其他细节详见 [Knover](https://github.com/PaddlePaddle/Knover)
+
+## 快速开始
+
+### 环境依赖
+
+- python 3.7+
+- sentencepiece
+- termcolor
+
+安装方式：`pip install sentencepiece termcolor`
+
+### 数据准备
+
+您可以从以下位置下载预训练模型文件：
+
+* PLATO-2, 24-layers, 16-heads, 1024-hidden, EN: [预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/plato2/24L.pdparams)
+* PLATO-2, 32-layers, 32-heads, 2048-hidden, EN: [预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/plato2/32L.pdparams)
+
+以24层预训练模型为例：
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/plato2/24L.pdparams
+```
+
+**NOTE:** PLATO-2网络参数量较大，24层网络至少需要显存16G，32层网络至少需要显存22G，用户可选择合适的网络层数及预训练模型。
+
+sentencepiece分词预训练模型和词表文件下载：
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/plato2/data.tar.gz
+tar -zxf data.tar.gz
+```
+
+### 人机交互
+
+运行如下命令即可开始与聊天机器人用英语进行简单的对话
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python interaction.py --vocab_path ./data/vocab.txt --spm_model_file ./data/spm.model --num_layers 24 --init_from_ckpt ./24L.pdparams
+```
+
+以上参数表示：
+
+* vocab_path：词表文件路径。
+* spm_model_file：sentencepiece分词预训练模型路径。
+* num_layers：PLATO-2组网层数。
+* init_from_ckpt：PLATO-2预训练模型路径。
+
+32层PLATO-2网络交互示例：
+
+<p><img src="./imgs/case.jpg" hspace="10"/></p>
+
+**NOTE:** 输入"[EXIT]"退出交互程序，输入"[NEXT]"开启下一轮新的对话。
diff --git a/examples/dialogue/plato-2/imgs/case.jpg b/examples/dialogue/plato-2/imgs/case.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..e3378e4164a3d0fe0a43334414bd4b0a13605458
Binary files /dev/null and b/examples/dialogue/plato-2/imgs/case.jpg differ
diff --git a/examples/dialogue/plato-2/imgs/network.png b/examples/dialogue/plato-2/imgs/network.png
new file mode 100644
index 0000000000000000000000000000000000000000..c14de8e75d7411b0ce9ea94d565402b2b245e7a2
Binary files /dev/null and b/examples/dialogue/plato-2/imgs/network.png differ
diff --git a/examples/dialogue/plato-2/interaction.py b/examples/dialogue/plato-2/interaction.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c4227b3d0d12d31fb168078356c93cd5c3b418a
--- /dev/null
+++ b/examples/dialogue/plato-2/interaction.py
@@ -0,0 +1,104 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+from collections import namedtuple
+
+import paddle
+from model import Plato2InferModel
+from readers.nsp_reader import NSPReader
+from readers.plato_reader import PlatoReader
+from termcolor import colored, cprint
+from utils import gen_inputs
+from utils.args import parse_args
+
+from paddlenlp.trainer.argparser import strtobool
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    group = parser.add_argument_group("Model")
+    group.add_argument("--init_from_ckpt", type=str, default="")
+    group.add_argument("--vocab_size", type=int, default=8001)
+    group.add_argument("--latent_type_size", type=int, default=20)
+    group.add_argument("--num_layers", type=int, default=24)
+
+    group = parser.add_argument_group("Task")
+    group.add_argument("--is_cn", type=strtobool, default=False)
+
+    args, _ = parser.parse_known_args()
+    NSPReader.add_cmdline_args(parser)
+
+    args = parse_args(parser)
+    args.batch_size *= args.latent_type_size
+
+    # print(json.dumps(args, indent=2))
+    return args
+
+
+def load_params(model, init_from_ckpt):
+    state_dict = paddle.load(init_from_ckpt)
+    model.set_state_dict(state_dict)
+
+
+def interact(args):
+    """Inference main function."""
+    plato_reader = PlatoReader(args)
+    nsp_reader = NSPReader(args)
+
+    if args.num_layers == 24:
+        n_head = 16
+        hidden_size = 1024
+    elif args.num_layers == 32:
+        n_head = 32
+        hidden_size = 2048
+    else:
+        raise ValueError(
+            "The pre-trained model only support 24 or 32 layers, " "but received num_layers=%d." % args.num_layers
+        )
+
+    model = Plato2InferModel(nsp_reader, args.num_layers, n_head, hidden_size)
+    load_params(model, args.init_from_ckpt)
+    model.eval()
+
+    Example = namedtuple("Example", ["src", "data_id"])
+    context = []
+    start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation."
+    cprint(start_info, "yellow", attrs=["bold"])
+    while True:
+        user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip()
+        if user_utt == "[EXIT]":
+            break
+        elif user_utt == "[NEXT]":
+            context = []
+            cprint(start_info, "yellow", attrs=["bold"])
+        else:
+            context.append(user_utt)
+            example = Example(src=" [SEP] ".join(context), data_id=0)
+            record = plato_reader._convert_example_to_record(example, is_infer=True)
+            data = plato_reader._pad_batch_records([record], is_infer=True)
+            inputs = gen_inputs(data, args.latent_type_size)
+            inputs["tgt_ids"] = inputs["tgt_ids"].astype("int64")
+            pred = model(inputs)[0]
+            bot_response = pred["response"]
+            print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"]))
+            context.append(bot_response)
+    return
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    interact(args)
diff --git a/examples/dialogue/plato-2/model.py b/examples/dialogue/plato-2/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..437a708876838bce79489cac3204f60a2c861a2a
--- /dev/null
+++ b/examples/dialogue/plato-2/model.py
@@ -0,0 +1,458 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import namedtuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+def post_process_context(token_ids, reader, merge=True):
+    """Post-process the context sequence."""
+    context = []
+    utt = []
+    for tok_id in token_ids[1:]:
+        if tok_id == reader.eos_id:
+            utt = reader.tokenizer.convert_ids_to_tokens(utt)
+            if merge:
+                utt = reader.tokenizer.merge_subword(utt)
+            context.append(utt)
+            utt = []
+        else:
+            utt.append(tok_id)
+    return context
+
+
+def post_process_response(token_ids, reader, merge=True):
+    """
+    Post-process the decoded sequence. Truncate from the first
+    <eos> and remove the <bos> and <eos> tokens currently.
+    """
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == reader.eos_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[1:eos_pos]
+    response = reader.tokenizer.convert_ids_to_tokens(token_ids)
+    if merge:
+        response = reader.tokenizer.merge_subword(response)
+    return token_ids, response
+
+
+def get_cross_turn_repetition(context, pred_tokens, eos_idx, is_cn=False):
+    """Get cross-turn repetition."""
+    if len(pred_tokens) == 0:
+        return 1.0
+    if is_cn:
+        context = ["".join(utt) for utt in context]
+        pred_tokens = "".join(pred_tokens)
+
+    pred_tri_grams = set()
+    for i in range(len(pred_tokens) - 2):
+        tri_gram = tuple(pred_tokens[i : i + 3])
+        pred_tri_grams.add(tri_gram)
+    for utt in context:
+        for i in range(len(utt) - 2):
+            tri_gram = tuple(utt[i : i + 3])
+            if tri_gram in pred_tri_grams:
+                return 1.0
+    return 0.0
+
+
+def get_in_turn_repetition(pred, is_cn=False):
+    """Get in-turn repetition."""
+    if len(pred) == 0:
+        return 1.0
+    if isinstance(pred[0], str):
+        pred = [tok.lower() for tok in pred]
+        if is_cn:
+            pred = "".join(pred)
+    tri_grams = set()
+    for i in range(len(pred) - 2):
+        tri_gram = tuple(pred[i : i + 3])
+        if tri_gram in tri_grams:
+            return 1.0
+        tri_grams.add(tri_gram)
+    return 0.0
+
+
+class Plato2EncoderLayer(nn.Layer):
+    def __init__(self, n_head, hidden_size, attn_dropout, act_dropout):
+        super(Plato2EncoderLayer, self).__init__()
+
+        self.self_attn = nn.MultiHeadAttention(hidden_size, n_head, attn_dropout)
+        self.pre_norm_layer = nn.LayerNorm(hidden_size)
+        self.post_norm_layer = nn.LayerNorm(hidden_size)
+        self.fc1 = nn.Linear(hidden_size, hidden_size * 4)
+        self.fc2 = nn.Linear(hidden_size * 4, hidden_size)
+
+        self.dropout_layer = nn.Dropout(act_dropout)
+        self.gelu_layer = nn.GELU()
+
+    def forward(self, x, attn_mask, cache):
+        query = self.pre_norm_layer(x)
+        attn_output, new_cache = self.self_attn(query, None, None, attn_mask, cache)
+        attn_output = self.dropout_layer(attn_output)
+        attn_output = attn_output + x
+        ffd_input = self.post_norm_layer(attn_output)
+
+        ffd_output = self.fc1(ffd_input)
+        ffd_output = self.gelu_layer(ffd_output)
+        ffd_output = self.dropout_layer(ffd_output)
+
+        ffd_output = self.fc2(ffd_output)
+        ffd_output = self.dropout_layer(ffd_output)
+        out = ffd_output + attn_output
+
+        return out, new_cache
+
+    def gen_cache(self, key):
+        return self.self_attn.gen_cache(key)
+
+
+class Plato2Encoder(nn.Layer):
+    def __init__(
+        self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout
+    ):
+        super(Plato2Encoder, self).__init__()
+
+        self.n_head = n_head
+
+        self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size)
+        self.sent_embedding_layer = nn.Embedding(type_size, hidden_size)
+        self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size)
+
+        self.encoder_layers = []
+        for i in range(num_layers):
+            encoder_layer = Plato2EncoderLayer(n_head, hidden_size, attn_dropout, act_dropout)
+            self.encoder_layers.append(encoder_layer)
+            self.add_sublayer("layers." + str(i), encoder_layer)
+        self.post_encoder_layer_norm = nn.LayerNorm(hidden_size)
+
+        self.dropout_layer = nn.Dropout(act_dropout)
+
+    def forward(self, caches, token_ids, type_ids, pos_ids, generation_mask, aux_emb=None):
+        out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, generation_mask, aux_emb)
+
+        new_caches = []
+        for i, encoder_layer in enumerate(self.encoder_layers):
+            out, new_cache = encoder_layer(out, self_attn_mask, caches[i])
+            new_caches.append(new_cache)
+
+        enc_output = self.post_encoder_layer_norm(out)
+        return enc_output, new_caches
+
+    def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None):
+        token_emb_out = self.word_embedding_layer(token_ids)
+        type_emb_out = self.sent_embedding_layer(type_ids)
+        pos_emb_out = self.pos_embedding_layer(pos_ids)
+        emb_out = token_emb_out + type_emb_out + pos_emb_out
+
+        # auxiliary memory embeddings
+        if aux_emb is not None:
+            emb_out = paddle.concat([aux_emb, emb_out], axis=1)
+
+        emb_out = self.dropout_layer(emb_out)
+
+        # generate n-head self-attention mask
+        self_attn_mask = input_mask
+        self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        return emb_out, n_head_self_attn_mask
+
+    def gen_caches(self, key):
+        caches = [encoder_layer.gen_cache(key) for encoder_layer in self.encoder_layers]
+        return caches
+
+
+class NSP(nn.Layer):
+    def __init__(
+        self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout
+    ):
+        super(NSP, self).__init__()
+
+        self.n_head = n_head
+        self.hidden_size = hidden_size
+
+        self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size)
+        self.sent_embedding_layer = nn.Embedding(type_size, hidden_size)
+        self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            hidden_size, n_head, hidden_size * 4, act_dropout, "gelu", attn_dropout, act_dropout, "True"
+        )
+        encoder_norm = nn.LayerNorm(hidden_size)
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers, encoder_norm)
+        self.fc1 = nn.Linear(hidden_size, hidden_size)
+        self.fc2 = nn.Linear(hidden_size, 2)
+
+        self.dropout_layer = nn.Dropout(act_dropout)
+        self.tanh_layer = nn.Tanh()
+        self.softmax = nn.Softmax()
+
+    def forward(self, inputs):
+        token_ids = inputs["token_ids"]
+        type_ids = inputs["type_ids"]
+        pos_ids = inputs["pos_ids"]
+        attention_mask = inputs["attention_mask"]
+        label_pos = inputs["label_pos"]
+
+        out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, attention_mask)
+        # [-1, seq_len, hidden_size]
+        enc_out = self.encoder(out, self_attn_mask)
+
+        enc_out = paddle.reshape(enc_out, [-1, self.hidden_size])
+        label_pos = paddle.cast(label_pos, "int64")
+        out = paddle.gather(enc_out, label_pos)
+        pooled_out = self.fc1(out)
+        pooled_out = self.tanh_layer(pooled_out)
+
+        # [-1, 2]
+        logits = self.fc2(pooled_out)
+        probs = self.softmax(logits)
+
+        return probs
+
+    def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None):
+        token_emb_out = self.word_embedding_layer(token_ids)
+        type_emb_out = self.sent_embedding_layer(type_ids)
+        pos_emb_out = self.pos_embedding_layer(pos_ids)
+        emb_out = token_emb_out + type_emb_out + pos_emb_out
+
+        # auxiliary memory embeddings
+        if aux_emb is not None:
+            emb_out = paddle.concat([aux_emb, emb_out], axis=1)
+
+        emb_out = self.dropout_layer(emb_out)
+
+        # generate n-head self-attention mask
+        self_attn_mask = input_mask
+        self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        return emb_out, n_head_self_attn_mask
+
+
+class Plato2InferModel(nn.Layer):
+    def __init__(
+        self,
+        nsp_reader,
+        num_layers,
+        n_head,
+        hidden_size,
+        vocab_size=8001,
+        type_size=2,
+        latent_type_size=20,
+        max_position_seq_len=256,
+        act_dropout=0.1,
+        attn_dropout=0.1,
+        max_dec_len=64,
+        min_dec_len=1,
+        topk=10,
+    ):
+        super(Plato2InferModel, self).__init__()
+
+        self.nsp_reader = nsp_reader
+        self.num_layers = num_layers
+        self.latent_type_size = latent_type_size
+        self.max_dec_len = max_dec_len
+        self.min_dec_len = min_dec_len
+        self.topk = topk
+        self.unk_id = 0
+        self.bos_id = 1
+        self.eos_id = 2
+        self.mask_id = 8000
+        self.after_eos = paddle.ones([vocab_size]) * -1e9
+        self.after_eos[self.eos_id] = 0
+        self.is_cn = False
+        self.batch_size = 1
+
+        self.latent_weight = paddle.create_parameter([hidden_size, latent_type_size], "float32")
+
+        self.plato2_encoder = Plato2Encoder(
+            vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout
+        )
+
+        self.logits_fc_layer = nn.Linear(hidden_size, hidden_size)
+        self.logits_layer_norm = nn.LayerNorm(hidden_size)
+        self.logits_bias = paddle.create_parameter([vocab_size], "float32", is_bias=True)
+
+        self.nsp_predictor = NSP(
+            vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout
+        )
+
+        self.gelu_layer = nn.GELU()
+        self.softmax = nn.Softmax()
+
+    @paddle.no_grad()
+    def forward(self, inputs):
+        token_ids = inputs["token_ids"]
+        type_ids = inputs["type_ids"]
+        pos_ids = inputs["pos_ids"]
+        generation_mask = inputs["generation_mask"]
+        latent_id = inputs["latent_id"]
+        data_id = inputs["data_id"]
+
+        # [-1, 1, latent_type_size]
+        latent_id = F.one_hot(latent_id, self.latent_type_size)
+        # [-1, 1, hidden_size]
+        latent_emb = paddle.matmul(latent_id, self.latent_weight, transpose_y=True)
+
+        caches = self.plato2_encoder.gen_caches(token_ids)
+
+        # [-1, seq_len + 1, hidden_size]
+        enc_out, new_caches = self.plato2_encoder(caches, token_ids, type_ids, pos_ids, generation_mask, latent_emb)
+
+        pred_ids = self.decode(inputs, new_caches)
+
+        nsp_inputs = self.gen_nsp_input(token_ids, pred_ids)
+        # [-1, 2]
+        probs = self.nsp_predictor(nsp_inputs)
+
+        return self.get_results(data_id, token_ids, pred_ids, probs)
+
+    def decode(self, inputs, caches):
+        tgt_ids = inputs["tgt_ids"]
+        tgt_pos = inputs["tgt_pos"]
+        tgt_generation_mask = inputs["tgt_generation_mask"]
+        predictions = tgt_ids
+
+        # TODO
+        step = 0
+        while step < self.max_dec_len:
+            # [-1, 1]
+            append_mask = paddle.cast(tgt_ids != self.eos_id, dtype=tgt_generation_mask.dtype)
+            tgt_generation_mask = paddle.concat([tgt_generation_mask, paddle.unsqueeze(append_mask, 1)], axis=-1)
+            tgt_sent = paddle.ones([tgt_generation_mask.shape[0], 1], dtype=tgt_ids.dtype)
+
+            # [-1, 1, hidden_size]
+            out, caches = self.plato2_encoder(caches, tgt_ids, tgt_sent, tgt_pos, tgt_generation_mask)
+            out = paddle.squeeze(out, axis=1)
+
+            # [-1, hidden_size]
+            trans = self.logits_fc_layer(out)
+            trans = self.gelu_layer(trans)
+            trans = self.logits_layer_norm(trans)
+
+            # [-1, vocab_size]
+            logits = (
+                paddle.matmul(trans, self.plato2_encoder.word_embedding_layer.weight, transpose_y=True)
+                + self.logits_bias
+            )
+            logits[:, self.unk_id] = -1e9
+            logits[:, self.bos_id] = -1e9
+            logits[:, self.mask_id] = -1e9
+            if step < self.min_dec_len:
+                logits[:, self.eos_id] = -1e9
+            logits = logits * append_mask + (1 - append_mask) * self.after_eos
+            probs = self.softmax(logits)
+
+            # [-1, topk]
+            topk_probs, _ = paddle.topk(probs, k=self.topk)
+            mask = paddle.cast(probs >= topk_probs[:, -1:], "float32")
+            sums = paddle.sum(topk_probs, axis=-1, keepdim=True)
+            new_probs = probs * mask / sums
+            # [-1, 1]
+            sampling_ids = paddle.multinomial(new_probs)
+
+            step = step + 1
+            tgt_ids = sampling_ids
+            tgt_pos = tgt_pos + 1
+            predictions = paddle.concat([predictions, tgt_ids], axis=1)
+        return predictions
+
+    def gen_nsp_input(self, token_ids, pred_ids):
+        token_ids = token_ids.numpy()
+        pred_ids = pred_ids.numpy()
+
+        def __reader__():
+            headers = ["src", "tgt", "data_id"]
+
+            Example = namedtuple("Example", headers)
+
+            for i, (raw, pred) in enumerate(zip(token_ids, pred_ids)):
+                context = post_process_context(raw, self.nsp_reader, merge=False)
+                _, response = post_process_response(pred, self.nsp_reader, merge=False)
+                context_tokenized_input = " [SEP] ".join(" ".join(utt) for utt in context)
+                response_tokenized_input = " ".join(response)
+                example = Example(src=context_tokenized_input, tgt=response_tokenized_input, data_id=i)
+                data = self.nsp_reader._convert_example_to_record(example, is_infer=True)
+                yield data
+            return
+
+        generator = self.nsp_reader.data_generator(
+            reader=__reader__,
+            is_infer=True,
+            phase="test",
+        )
+        inputs = next(generator())
+
+        # print('\nnsp_inputs:')
+        for key in inputs:
+            inputs[key] = paddle.to_tensor(inputs[key])
+            if key in ["token_ids", "type_ids", "pos_ids"]:
+                inputs[key] = paddle.squeeze(inputs[key], axis=-1)
+            # print(key, inputs[key].shape)
+            # print(inputs[key])
+        return inputs
+
+    def get_results(self, data_id, token_ids, pred_ids, probs):
+        data_id = data_id.numpy()
+        token_ids = token_ids.numpy()
+        pred_ids = pred_ids.numpy()
+        probs = probs.numpy()
+
+        infos = []
+        for raw, pred, prob in zip(token_ids, pred_ids, probs):
+            tokens = post_process_context(raw, self.nsp_reader)
+            pred_token_ids, pred_tokens = post_process_response(pred, self.nsp_reader)
+            info = {}
+            info["response"] = " ".join(pred_tokens)
+            cross_turn_repetition = get_cross_turn_repetition(tokens, pred_tokens, self.nsp_reader.eos_id, self.is_cn)
+            in_turn_repetition = max(
+                get_in_turn_repetition(pred_tokens, self.is_cn), get_in_turn_repetition(pred_token_ids)
+            )
+
+            info["score"] = float(prob[1])
+            if len(pred_token_ids) >= self.max_dec_len:
+                info["score"] -= 1e3
+            elif cross_turn_repetition > 0:
+                info["score"] -= 1e3
+            elif in_turn_repetition > 0:
+                info["score"] -= 1e3
+            infos.append(info)
+
+        results = []
+        pre_idx = 0
+        sample = []
+        for idx, info in zip(data_id, infos):
+            if idx != pre_idx:
+                sample = sorted(sample, key=lambda info: -info["score"])
+                result = sample[0]
+                result["data_id"] = pre_idx
+                results.apeend(result)
+                sample = []
+                pre_idx = idx
+            sample.append(info)
+        if sample:
+            sample = sorted(sample, key=lambda info: -info["score"])
+            result = sample[0]
+            result["data_id"] = pre_idx
+            results.append(result)
+        return results
diff --git a/examples/dialogue/plato-2/readers/dialog_reader.py b/examples/dialogue/plato-2/readers/dialog_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..00339c1f0a0e83ba75015598c1159790d48cb69b
--- /dev/null
+++ b/examples/dialogue/plato-2/readers/dialog_reader.py
@@ -0,0 +1,436 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Dialogue Reader."""
+
+import csv
+import gzip
+from collections import namedtuple
+from contextlib import contextmanager
+
+import numpy as np
+import utils.tokenization as tokenization
+from utils import pad_batch_data
+from utils.masking import mask
+
+from paddlenlp.trainer.argparser import strtobool
+
+
+class DialogReader(object):
+    """The implement of DialogReader."""
+
+    @classmethod
+    def add_cmdline_args(cls, parser):
+        """Add cmdline argurments."""
+        group = parser.add_argument_group("Reader")
+        group.add_argument("--max_src_len", type=int, default=128)
+        group.add_argument("--max_tgt_len", type=int, default=128)
+        group.add_argument("--truncate_first_turn", type=strtobool, default=False)
+        group.add_argument("--file_format", type=str, default="file", choices=["file", "filelist"])
+        group.add_argument("--data_format", type=str, default="raw", choices=["raw", "tokenized", "numerical"])
+        group.add_argument("--in_tokens", type=strtobool, default=False)
+        group.add_argument("--batch_size", type=int, default=16)
+        group.add_argument("--continuous_position", type=strtobool, default=True)
+        group.add_argument("--random_seed", type=int, default=11)
+        group.add_argument("--sort_pool_size", type=int, default=2**16)
+
+        group = parser.add_argument_group("Tokenizer")
+        group.add_argument("--tokenizer", type=str, default="SentencePieceTokenizer")
+        args, _ = parser.parse_known_args()
+        tokenizer_cls = getattr(tokenization, args.tokenizer)
+        tokenizer_cls.add_cmdline_args(parser)
+        return group
+
+    def __init__(self, args):
+        tokenizer_cls = getattr(tokenization, args.tokenizer)
+        self.tokenizer = tokenizer_cls(args)
+        self.vocab = self.tokenizer.vocab
+        self.pad_id = args.pad_id = self.vocab["[PAD]"]
+        self.bos_id = args.bos_id = self.vocab["[CLS]"]
+        self.eos_id = args.eos_id = self.vocab["[SEP]"]
+        self.unk_id = args.unk_id = self.vocab["[UNK]"]
+        self.mask_id = args.mask_id = self.vocab["[MASK]"]
+        self.vocab_size = args.get("vocab_size", 0)
+        self.max_src_len = args.max_src_len
+        self.max_tgt_len = args.max_tgt_len
+        self.truncate_first_turn = args.truncate_first_turn
+        self.file_format = args.file_format
+        self.data_format = args.data_format
+        self.in_tokens = args.in_tokens
+        self.batch_size = args.batch_size
+        self.continuous_position = args.continuous_position
+        self.sort_pool_size = args.sort_pool_size
+
+        # random_seed must be set for data slicing when using multi-gpu
+        self.global_rng = np.random.RandomState(args.random_seed)
+
+        # training progress
+        self.current_example = 0
+        self.current_epoch = 0
+        self.num_examples = 0
+
+        # model related
+
+        self.fields = ["token_ids", "type_ids", "pos_ids"]
+        self.num_numerical_fields = len(self.fields)
+        self.fields += ["tgt_start_idx", "data_id"]
+        self.sort_key = lambda record: [len(record.token_ids)]
+
+        self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields))
+
+        self.features = {}
+        return
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_epoch, self.current_file_index, self.total_file
+
+    def _convert_example_to_record(self, example, is_infer):
+        # process src
+        src_token_ids = []
+        src_pos_ids = []
+
+        # tokenize src
+        s_token_ids_list = []
+        for s in example.src.split("[SEP]"):
+            s = tokenization.convert_to_unicode(s).strip()
+
+            if self.data_format == "tokenized":
+                s_tokens = s.split(" ")
+            else:
+                s_tokens = self.tokenizer.tokenize(s)
+
+            s_token_ids = self.tokenizer.convert_tokens_to_ids(s_tokens) + [self.eos_id]
+            s_token_ids_list.append(s_token_ids)
+
+        # trim src
+        idx = len(s_token_ids_list) - 1
+        total_token_num = 1
+        while idx >= 0:
+            total_token_num += len(s_token_ids_list[idx])
+            if total_token_num > self.max_src_len:
+                if self.truncate_first_turn and idx == 0:
+                    truncated_ids = s_token_ids_list[idx][: self.max_src_len - total_token_num]
+                    if len(truncated_ids) > 1:
+                        s_token_ids_list[idx] = truncated_ids[:-1] + [self.eos_id]
+                        idx -= 1
+                break
+            idx -= 1
+
+        for i, s_token_ids in enumerate(s_token_ids_list[idx + 1 :], idx + 1):
+            src_token_ids += s_token_ids
+            src_pos_ids += list(range(1, len(s_token_ids) + 1))
+
+        src_token_ids = [self.bos_id] + src_token_ids
+        src_type_ids = [0] * len(src_token_ids)
+        src_pos_ids = [0] + src_pos_ids
+        assert (
+            len(src_token_ids) == len(src_type_ids) == len(src_pos_ids)
+        ), "not len(src_token_ids) == len(src_type_ids) == len(src_pos_ids)"
+
+        token_ids = src_token_ids
+        type_ids = src_type_ids
+        pos_ids = src_pos_ids
+        tgt_start_idx = len(token_ids)
+
+        if not is_infer:
+            # process tgt
+            # tokenize tgt
+            tgt = tokenization.convert_to_unicode(example.tgt).strip()
+            if self.data_format == "tokenized":
+                tgt_tokens = tgt.split(" ")
+            else:
+                tgt_tokens = self.tokenizer.tokenize(tgt)
+
+            tgt_token_ids = self.tokenizer.convert_tokens_to_ids(tgt_tokens)
+            tgt_token_ids.append(self.eos_id)
+
+            # trim tgt
+            if len(tgt_token_ids) > self.max_tgt_len - 1:
+                tgt_token_ids = tgt_token_ids[: self.max_tgt_len - 1]
+
+            tgt_token_ids = [self.bos_id] + tgt_token_ids
+            tgt_type_ids = [1] * len(tgt_token_ids)
+            tgt_pos_ids = list(range(1, len(tgt_token_ids) + 1))
+            assert (
+                len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids)
+            ), "not len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids)"
+
+            token_ids += tgt_token_ids
+            type_ids += tgt_type_ids
+            pos_ids += tgt_pos_ids
+
+        assert len(token_ids) == len(type_ids) == len(pos_ids), "not len(token_ids) == len(type_ids) == len(pos_ids)"
+
+        if self.continuous_position:
+            src_pos_ids = list(range(len(src_token_ids)))
+            if not is_infer:
+                tgt_pos_ids = list(range(len(tgt_token_ids)))
+            pos_ids = list(range(len(token_ids)))
+
+        field_values = {"token_ids": src_token_ids, "type_ids": src_type_ids, "pos_ids": src_pos_ids}
+        field_values["tgt_start_idx"] = tgt_start_idx
+        field_values["data_id"] = example.data_id
+
+        record = self.Record(**field_values)
+        return record
+
+    def _read_tsv(self, fp, phase, is_infer, delimiter="\t", quotechar=None):
+        """Reads a tab separated value file."""
+        csv.field_size_limit(2**20)
+        reader = csv.reader(fp, delimiter=delimiter, quotechar=quotechar)
+        headers = next(reader)
+        headers.append("data_id")
+        Example = namedtuple("Example", headers)
+
+        for i, line in enumerate(reader):
+            example = Example(*line, data_id=i)
+            if is_infer or phase.endswith("test"):
+                self.features[phase][i] = example
+            record = self._convert_example_to_record(example, is_infer)
+            yield record
+
+    def _read_numerical_file(self, fp, delimiter=";"):
+        for i, line in enumerate(fp):
+            cols = tokenization.convert_to_unicode(line).strip().split(delimiter)
+            cols = list(map(lambda x: list(map(int, x.split(" "))), cols))
+            if len(cols) > self.num_numerical_fields:
+                cols = cols[: self.num_numerical_fields]
+            tgt_start_idx = cols[0].index(self.bos_id, 1)
+            record = self.Record(*cols, tgt_start_idx=tgt_start_idx, data_id=i)
+            yield record
+
+    def _read_file(self, input_file, phase, is_infer):
+        def __wrapper__():
+            with open_file(input_file) as fp:
+                if self.data_format == "numerical":
+                    records = self._read_numerical_file(fp)
+                else:
+                    records = self._read_tsv(fp, phase, is_infer)
+                for record in records:
+                    yield record
+
+        return __wrapper__
+
+    def _read_files(self, filelist, phase, is_infer, shuffle_files):
+        input_files = open(filelist).readlines()
+
+        def __wrapper__():
+            if shuffle_files:
+                self.global_rng.shuffle(input_files)
+
+            if phase == "train":
+                self.total_file = len(input_files)
+            for file_index, input_file in enumerate(input_files, 1):
+                if phase == "train":
+                    self.current_file_index = file_index
+                    self.current_file = input_file
+                file_reader = self._read_file(input_file.strip(), phase, is_infer)
+                for record in file_reader():
+                    yield record
+
+        return __wrapper__
+
+    def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16):
+        """Construct a batch reader."""
+
+        def update_max_lens(max_lens, record):
+            """Update max_lens."""
+            if max_lens is None:
+                return self.sort_key(record)
+            else:
+                return [max(max_len, l) for max_len, l in zip(max_lens, self.sort_key(record))]
+
+        def get_batch(reader):
+            """Generate batches from reader."""
+            batch, max_lens = [], None
+            for record in reader():
+                if record is None:
+                    yield batch
+                    batch, max_lens = [], None
+                    continue
+
+                self.current_example += 1
+                max_lens = update_max_lens(max_lens, record)
+                if self.in_tokens:
+                    to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size
+                else:
+                    to_append = len(batch) < self.batch_size
+                if to_append:
+                    batch.append(record)
+                else:
+                    yield batch
+                    batch, max_lens = [record], self.sort_key(record)
+
+            if len(batch) > 0:
+                yield batch
+
+        def get_sorted_batch(pool):
+            """Generate sorted batches from pool."""
+            pool = sorted(pool, key=self.sort_key)
+            batches = []
+            batch, max_lens = [], None
+            for record in pool:
+                self.current_example += 1
+                max_lens = update_max_lens(max_lens, record)
+                if self.in_tokens:
+                    to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size
+                else:
+                    to_append = len(batch) < self.batch_size
+                if to_append:
+                    batch.append(record)
+                else:
+                    batches.append(batch)
+                    batch, max_lens = [record], self.sort_key(record)
+
+            if len(batch) > 0:
+                batches.append(batch)
+            self.global_rng.shuffle(batches)
+
+            for batch in batches:
+                yield batch
+
+        def __wrapper__():
+            if sort_pool_size > 0:
+                pool = []
+                for record in reader():
+                    pool.append(record)
+                    if len(pool) == sort_pool_size:
+                        for batch in get_sorted_batch(pool):
+                            yield batch
+                        pool = []
+                if len(pool) > 0:
+                    for batch in get_sorted_batch(pool):
+                        yield batch
+            else:
+                for batch in get_batch(reader):
+                    yield batch
+
+        return __wrapper__
+
+    def _distributed_batch_reader(self, batch_reader, num_part, part_id, is_test=False):
+        def __wrapper__():
+            batches = []
+            for batch in batch_reader():
+                batches.append(batch)
+                if len(batches) == num_part:
+                    yield batches[part_id]
+                    batches = []
+            if is_test and 0 <= part_id < len(batches):
+                yield batches[part_id]
+            return
+
+        return __wrapper__
+
+    def data_generator(
+        self, input_file=None, reader=None, num_epochs=1, num_part=1, part_id=0, phase=None, is_infer=False
+    ):
+        """Data generator."""
+
+        def __wrapper__():
+            if is_infer or phase.endswith("test"):
+                self.features[phase] = {}
+
+            nonlocal reader
+            if reader is None:
+                if self.file_format == "filelist":
+                    reader = self._read_files(input_file, phase, is_infer, not phase.endswith("test"))
+                else:
+                    if phase == "train":
+                        self.total_file = 1
+                        self.current_file_index = 1
+                        self.current_file = input_file
+                    reader = self._read_file(input_file, phase, is_infer)
+
+            batch_reader = self._batch_reader(
+                reader, phase, is_infer, sort_pool_size=self.sort_pool_size if not is_infer else 0
+            )
+            if phase == "train":
+                batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id)
+            elif phase.startswith("distributed"):
+                batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id, is_test=True)
+
+            for epoch_index in range(num_epochs):
+                if phase == "train":
+                    self.current_example = 0
+                    self.current_epoch = epoch_index + 1
+                for batch in batch_reader():
+                    yield self._pad_batch_records(batch, is_infer)
+
+        return __wrapper__
+
+    def _gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx=None, is_unidirectional=True, shift_len=0):
+        max_len = max(map(len, batch_token_ids))
+        input_mask_data = np.zeros((len(batch_token_ids), max_len + shift_len, max_len + shift_len))
+        if is_unidirectional:
+            for index, mask_data in enumerate(input_mask_data):
+                start = 0 if batch_tgt_start_idx is None else batch_tgt_start_idx[index]
+                end = len(batch_token_ids[index])
+                mask_data[: end + shift_len, : start + shift_len] = 1.0
+                # Generate the lower triangular matrix using the slice of matrix
+                b = np.tril(np.ones([end - start, end - start]), 0)
+                mask_data[start + shift_len : end + shift_len, start + shift_len : end + shift_len] = b
+        else:
+            for index, token_ids in enumerate(batch_token_ids):
+                input_mask_data[index, : len(token_ids) + shift_len, : len(token_ids) + shift_len] = 1.0
+        return input_mask_data.astype("float32")
+
+    def _pad_batch_records(self, batch_records, is_infer):
+        """
+        Padding batch records and construct model's inputs.
+        """
+        batch = {}
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_type_ids = [record.type_ids for record in batch_records]
+        batch_pos_ids = [record.pos_ids for record in batch_records]
+        batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id)
+        batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id)
+        batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id)
+
+        batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records]
+        batch["generation_mask"] = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx)
+
+        if is_infer:
+            tgt_ids = np.array([[[self.bos_id]]] * len(batch_token_ids), dtype="int64")
+            if self.continuous_position:
+                tgt_pos = np.array(batch_tgt_start_idx, dtype="int64")
+            else:
+                tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64")
+            tgt_pos = tgt_pos.reshape(-1, 1, 1)
+            batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist()
+            batch["tgt_ids"] = tgt_ids.tolist()
+            batch["tgt_pos"] = tgt_pos.tolist()
+
+            batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32")
+        else:
+            batch["tgt_label"], batch["tgt_pos"] = mask(
+                batch_tokens=batch_token_ids,
+                vocab_size=self.vocab_size,
+                sent_b_starts=batch_tgt_start_idx,
+                is_unidirectional=True,
+            )
+
+        batch_data_id = [record.data_id for record in batch_records]
+        batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1])
+        return batch
+
+
+@contextmanager
+def open_file(filename):
+    """Open file."""
+    if filename.endswith(".gz"):
+        fp = gzip.open(filename, "rt")
+    else:
+        fp = open(filename)
+    yield fp
+    fp.close()
diff --git a/examples/dialogue/plato-2/readers/nsp_reader.py b/examples/dialogue/plato-2/readers/nsp_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..968da9ff1ba0e7c8ac87e993df5c964d1ea0bd72
--- /dev/null
+++ b/examples/dialogue/plato-2/readers/nsp_reader.py
@@ -0,0 +1,152 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""NSP Reader."""
+
+from collections import namedtuple
+
+import numpy as np
+from readers.dialog_reader import DialogReader
+from utils import pad_batch_data
+from utils.masking import mask
+
+from paddlenlp.trainer.argparser import strtobool
+
+
+class NSPReader(DialogReader):
+    """NSP Reader."""
+
+    @classmethod
+    def add_cmdline_args(cls, parser):
+        """Add cmdline argurments."""
+        group = DialogReader.add_cmdline_args(parser)
+        group.add_argument(
+            "--attention_style", type=str, default="bidirectional", choices=["bidirectional", "unidirectional"]
+        )
+        group.add_argument("--mix_negative_sample", type=strtobool, default=False)
+        return group
+
+    def __init__(self, args):
+        super(NSPReader, self).__init__(args)
+        self.fields.append("label")
+        self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields))
+
+        self.attention_style = args.attention_style
+        self.mix_negative_sample = args.mix_negative_sample
+        return
+
+    def _convert_example_to_record(self, example, is_infer):
+        record = super(NSPReader, self)._convert_example_to_record(example, False)
+        if "label" in example._fields:
+            record = record._replace(label=int(example.label))
+        return record
+
+    def _mix_negative_sample(self, reader, neg_pool_size=2**16):
+        def gen_from_pool(pool):
+            num_samples = len(pool)
+            if num_samples == 1:
+                # only one sample: it is impossible to generate negative sample
+                yield pool[0]._replace(label=1)
+                return
+            self.global_rng.shuffle(pool)
+            for i in range(num_samples):
+                pool[i] = pool[i]._replace(label=1)
+                j = (i + 1) % num_samples
+                idx_i = pool[i].tgt_start_idx
+                idx_j = pool[j].tgt_start_idx
+                field_values = {}
+                field_values["token_ids"] = pool[i].token_ids[:idx_i] + pool[j].token_ids[idx_j:]
+                field_values["type_ids"] = pool[i].type_ids[:idx_i] + pool[j].type_ids[idx_j:]
+                field_values["pos_ids"] = list(range(len(field_values["token_ids"])))
+                neg_record = self.Record(**field_values, tgt_start_idx=idx_i, data_id=-1, label=0)
+                pool.append(neg_record)
+                assert len(neg_record.token_ids) <= self.max_seq_len
+            self.global_rng.shuffle(pool)
+            for record in pool:
+                yield record
+
+        def __wrapper__():
+            pool = []
+            for record in reader():
+                pool.append(record)
+                if len(pool) == neg_pool_size:
+                    for record in gen_from_pool(pool):
+                        yield record
+                    pool = []
+            if len(pool) > 0:
+                for record in gen_from_pool(pool):
+                    yield record
+
+        return __wrapper__
+
+    def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16):
+        if self.mix_negative_sample:
+            reader = self._mix_negative_sample(reader)
+        return super(NSPReader, self)._batch_reader(
+            reader, phase=phase, is_infer=is_infer, sort_pool_size=sort_pool_size
+        )
+
+    def _pad_batch_records(self, batch_records, is_infer):
+        """
+        Padding batch records and construct model's inputs.
+        """
+        batch = {}
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_type_ids = [record.type_ids for record in batch_records]
+        batch_pos_ids = [record.pos_ids for record in batch_records]
+        batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records]
+        batch_label = [record.label for record in batch_records]
+
+        if self.attention_style == "unidirectional":
+            batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id)
+            batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id)
+            batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id)
+            tgt_label, tgt_pos, label_pos = mask(
+                batch_tokens=batch_token_ids,
+                vocab_size=self.vocab_size,
+                bos_id=self.bos_id,
+                sent_b_starts=batch_tgt_start_idx,
+                labels=batch_label,
+                is_unidirectional=True,
+            )
+            attention_mask = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx)
+        else:
+            batch_mask_token_ids, tgt_label, tgt_pos, label_pos = mask(
+                batch_tokens=batch_token_ids,
+                vocab_size=self.vocab_size,
+                bos_id=self.bos_id,
+                eos_id=self.eos_id,
+                mask_id=self.mask_id,
+                sent_b_starts=batch_tgt_start_idx,
+                labels=batch_label,
+                is_unidirectional=False,
+            )
+            if not is_infer:
+                batch_token_ids = batch_mask_token_ids
+            batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id)
+            batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id)
+            batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id)
+            attention_mask = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False)
+
+        batch["attention_mask"] = attention_mask
+        batch["label_pos"] = label_pos
+
+        if not is_infer:
+            batch_label = np.array(batch_label).astype("int64").reshape([-1, 1])
+            batch["label"] = batch_label
+            batch["tgt_label"] = tgt_label
+            batch["tgt_pos"] = tgt_pos
+
+        batch_data_id = [record.data_id for record in batch_records]
+        batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1])
+        return batch
diff --git a/examples/dialogue/plato-2/readers/plato_reader.py b/examples/dialogue/plato-2/readers/plato_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d3cd790ee76900799aa2cf2683dc64a2d1524d9
--- /dev/null
+++ b/examples/dialogue/plato-2/readers/plato_reader.py
@@ -0,0 +1,85 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Plato Reader."""
+
+import numpy as np
+
+from readers.dialog_reader import DialogReader
+from utils import pad_batch_data
+from utils.masking import mask
+
+
+class PlatoReader(DialogReader):
+    """The implement of PlatoReader"""
+
+    def __init__(self, args):
+        super(PlatoReader, self).__init__(args)
+        self.latent_type_size = args.latent_type_size
+        self.use_bow = args.use_bow
+
+    def _pad_batch_records(self, batch_records, is_infer):
+        """
+        Padding batch records and construct model's inputs.
+        """
+        batch = {}
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_type_ids = [record.type_ids for record in batch_records]
+        batch_pos_ids = [record.pos_ids for record in batch_records]
+
+        batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records]
+
+        batch_size = len(batch_token_ids)
+
+        # padding
+        batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id)
+        batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id)
+        batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id)
+
+        batch["generation_mask"] = self._gen_self_attn_mask(
+            batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx, is_unidirectional=True, shift_len=1
+        )
+        if not is_infer:
+            batch["recognition_mask"] = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False, shift_len=1)
+
+        if is_infer:
+            tgt_ids = np.array([[[self.bos_id]]] * batch_size, dtype="int64")
+            if self.continuous_position:
+                tgt_pos = np.array(batch_tgt_start_idx, dtype="int64")
+            else:
+                tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64")
+            tgt_pos = tgt_pos.reshape(-1, 1, 1)
+            batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist()
+            batch["tgt_ids"] = tgt_ids.tolist()
+            batch["tgt_pos"] = tgt_pos.tolist()
+            batch["parent_idx"] = np.array(range(batch_size), dtype="int32")
+
+            batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32")
+        else:
+            mask_return_list = mask(
+                batch_tokens=batch_token_ids,
+                vocab_size=self.vocab_size,
+                sent_b_starts=batch_tgt_start_idx,
+                is_unidirectional=True,
+                use_latent=True,
+                use_bow=self.use_bow,
+            )
+            batch["tgt_label"] = mask_return_list[0]
+            batch["tgt_pos"] = mask_return_list[1]
+            if self.use_bow:
+                batch["bow_label"] = mask_return_list[2]
+                batch["bow_pos"] = mask_return_list[3]
+
+        batch_data_id = [record.data_id for record in batch_records]
+        batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1])
+        return batch
diff --git a/examples/dialogue/plato-2/utils/__init__.py b/examples/dialogue/plato-2/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a9ff1098dd3bbde8f4bf169216c404bb0b689de
--- /dev/null
+++ b/examples/dialogue/plato-2/utils/__init__.py
@@ -0,0 +1,51 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utils."""
+
+from itertools import chain
+
+import numpy as np
+import paddle
+
+
+def repeat_array(array, times):
+    """Repeate numpy array."""
+    if isinstance(array, list):
+        return list(chain(*([array] * times)))
+    else:
+        return np.concatenate([array] * times, axis=0)
+
+
+def gen_inputs(inputs, latent_type_size):
+    batch_size = len(inputs["data_id"])
+    inputs = {name: repeat_array(array, latent_type_size) for name, array in inputs.items()}
+    # Add latent_id
+    inputs["latent_id"] = np.array(
+        [i for i in range(latent_type_size) for _ in range(batch_size)], dtype="int64"
+    ).reshape([-1, 1])
+
+    # print('\nplato_inputs:')
+    for key in inputs:
+        inputs[key] = paddle.to_tensor(inputs[key])
+        if key in ["token_ids", "type_ids", "pos_ids", "tgt_ids", "tgt_pos", "data_id"]:
+            inputs[key] = paddle.squeeze(inputs[key], axis=-1)
+        # print(key, inputs[key].shape, inputs[key].dtype)
+    return inputs
+
+
+def pad_batch_data(insts, pad_id=0):
+    """Pad the instances to the max sequence length in batch."""
+    max_len = max(map(len, insts))
+    inst_data = np.array([list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts])
+    return inst_data.astype("int64").reshape([-1, max_len, 1])
diff --git a/examples/dialogue/plato-2/utils/args.py b/examples/dialogue/plato-2/utils/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..b112acf6ba7354f47cab8e35e502fc3ccdda1b03
--- /dev/null
+++ b/examples/dialogue/plato-2/utils/args.py
@@ -0,0 +1,88 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Parse argument."""
+
+import argparse
+import json
+
+
+class Args(dict):
+    """Arguments class
+
+    Store arguments in training / infer / ... scripts.
+    """
+
+    def __getattr__(self, name):
+        if name in self.keys():
+            return self[name]
+        for v in self.values():
+            if isinstance(v, Args):
+                if name in v:
+                    return v[name]
+        return None
+
+    def get(self, key, default_value=None):
+        """Get the value of corresponding key."""
+        if key in self.keys():
+            return self[key]
+        for v in self.values():
+            if isinstance(v, Args):
+                if key in v:
+                    return v[key]
+        return default_value
+
+    def __setattr__(self, name, value):
+        self[name] = value
+
+    def save(self, filename):
+        with open(filename, "w") as fp:
+            json.dump(self, fp, ensure_ascii=False, indent=4, sort_keys=False)
+
+    def load(self, filename, group_name=None):
+        if group_name is not None:
+            if group_name not in self:
+                self[group_name] = Args()
+            self[group_name].load(filename)
+            return
+        with open(filename, "r") as fp:
+            params_dict = json.load(fp)
+        for k, v in params_dict.items():
+            if isinstance(v, dict):
+                self[k].update(Args(v))
+            else:
+                self[k] = v
+
+
+def parse_args(parser: argparse.ArgumentParser, allow_unknown=False) -> Args:
+    """Parse hyper-parameters from cmdline."""
+    if allow_unknown:
+        parsed, _ = parser.parse_known_args()
+    else:
+        parsed = parser.parse_args()
+    args = Args()
+    optional_args = parser._action_groups[1]
+    for action in optional_args._group_actions[1:]:
+        arg_name = action.dest
+        args[arg_name] = getattr(parsed, arg_name)
+    for group in parser._action_groups[2:]:
+        group_args = Args()
+        for action in group._group_actions:
+            arg_name = action.dest
+            group_args[arg_name] = getattr(parsed, arg_name)
+        if len(group_args) > 0:
+            if group.title in args:
+                args[group.title].update(group_args)
+            else:
+                args[group.title] = group_args
+    return args
diff --git a/examples/dialogue/plato-2/utils/masking.py b/examples/dialogue/plato-2/utils/masking.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb6be808448a84bbc0039eafbfaf65ecdbcbecd7
--- /dev/null
+++ b/examples/dialogue/plato-2/utils/masking.py
@@ -0,0 +1,119 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Reader utils."""
+
+import numpy as np
+
+
+def mask(
+    batch_tokens,
+    vocab_size,
+    bos_id=1,
+    eos_id=2,
+    mask_id=3,
+    sent_b_starts=None,
+    labels=None,
+    is_unidirectional=False,
+    use_latent=False,
+    use_bow=False,
+):
+    """
+    Add mask for batch_tokens, return out, mask_label, mask_pos;
+    Note: mask_pos responding the batch_tokens after padded;
+    """
+    batch_tokens = np.copy(batch_tokens)
+    max_len = max(map(len, batch_tokens))
+    mask_label = []
+    mask_pos = []
+    if labels is not None:
+        label_pos = []
+
+    if is_unidirectional:
+        # unidirectional language model
+        if use_latent:
+            max_len += 1
+            shift_len = 1
+        else:
+            shift_len = 0
+        for sent_index, sent in enumerate(batch_tokens):
+            sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0
+            if labels is not None:
+                label_pos.append(sent_index * max_len + len(sent) - 1 + shift_len)
+            mask_label.extend(sent[sent_b_index + 1 :])
+            mask_pos.extend([sent_index * max_len + i + shift_len for i in range(sent_b_index, len(sent) - 1)])
+        mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+        mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+        return_list = [mask_label, mask_pos]
+
+        # latent related (bow label and pos)
+        if use_latent and use_bow:
+            bow_label = []
+            bow_pos = []
+            for sent_index, sent in enumerate(batch_tokens):
+                sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0
+
+                def __filter__(tok_id):
+                    # TODO: exclude [EOS] from bow loss
+                    return True
+
+                bow_pos.extend([sent_index for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])])
+                bow_label.extend([sent[i] for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])])
+            bow_label = np.array(bow_label).astype("int64").reshape([-1, 1])
+            bow_pos = np.array(bow_pos).astype("int64").reshape([-1, 1])
+            return_list += [bow_label, bow_pos]
+    else:
+        # bidirectional mask language model
+        total_token_num = sum(map(len, batch_tokens))
+        prob_mask = np.random.rand(total_token_num)
+        # TODO: fix replace_ids, include [UNK]
+        replace_ids = np.random.randint(3, high=vocab_size, size=total_token_num)
+        prob_index = 0
+        for sent_index, sent in enumerate(batch_tokens):
+            # add pair label position
+            if labels is not None:
+                label_pos.append(sent_index * max_len)
+
+            # add mask label and position
+            for token_index, token in enumerate(sent):
+                if token == eos_id or token == bos_id:
+                    continue
+                prob = prob_mask[prob_index + token_index]
+                if prob > 0.15:
+                    continue
+                elif 0.03 < prob <= 0.15:
+                    # mask
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = mask_id
+                    mask_pos.append(sent_index * max_len + token_index)
+                elif 0.015 < prob <= 0.03:
+                    # random replace
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = replace_ids[prob_index + token_index]
+                    mask_pos.append(sent_index * max_len + token_index)
+                else:
+                    # keep the original token
+                    mask_label.append(sent[token_index])
+                    mask_pos.append(sent_index * max_len + token_index)
+
+            prob_index += len(sent)
+
+        mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+        mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+        return_list = [batch_tokens, mask_label, mask_pos]
+
+    if labels is not None:
+        label_pos = np.array(label_pos).astype("int64").reshape([-1, 1])
+        assert len(labels) == len(label_pos)
+        return_list.append(label_pos)
+    return return_list
diff --git a/examples/dialogue/plato-2/utils/tokenization.py b/examples/dialogue/plato-2/utils/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d4741ba984ef17c75a6d2967f573f1f343b895b
--- /dev/null
+++ b/examples/dialogue/plato-2/utils/tokenization.py
@@ -0,0 +1,189 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+import collections
+import sentencepiece as spm
+import unicodedata
+
+from utils.args import str2bool
+
+
+def clean_text(text):
+    """Performs invalid character removal and whitespace cleanup on text."""
+    text = text.replace("“", '"').replace("”", '"').replace("‘", "'").replace("’", "'").replace("—", "-")
+
+    output = []
+    for char in text:
+        if _is_control(char):
+            continue
+        if _is_whitespace(char):
+            output.append(" ")
+        else:
+            output.append(char)
+    return "".join(output)
+
+
+def preprocess_text(inputs, remove_space=True, lower=False):
+    """preprocess data by removing extra space and normalize data."""
+    outputs = inputs
+    if remove_space:
+        outputs = " ".join(inputs.strip().split())
+
+    outputs = unicodedata.normalize("NFKD", outputs)
+    outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+    if lower:
+        outputs = outputs.lower()
+
+    return outputs
+
+
+def encode_pieces(spm_model, text, return_unicode=True, sample=False):
+    """turn sentences into word pieces."""
+    # liujiaxiang: add for ernie-albert, mainly consider for “/”/‘/’/— causing too many unk
+    text = clean_text(text)
+
+    if not sample:
+        pieces = spm_model.EncodeAsPieces(text)
+    else:
+        pieces = spm_model.SampleEncodeAsPieces(text, 64, 0.1)
+
+    return pieces
+
+
+def encode_ids(spm_model, text, sample=False):
+    """turn sentences into word pieces."""
+    pieces = encode_pieces(spm_model, text, return_unicode=False, sample=sample)
+    ids = [spm_model.PieceToId(piece) for piece in pieces]
+    return ids
+
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if isinstance(text, str):
+        return text
+    elif isinstance(text, bytes):
+        return text.decode("utf-8", "ignore")
+    else:
+        raise ValueError("Unsupported string type: %s" % (type(text)))
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = open(vocab_file, "r", encoding="UTF-8")
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.rstrip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+
+
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+
+
+class SentencePieceTokenizer(object):
+    """Runs end-to-end tokenziation."""
+
+    @classmethod
+    def add_cmdline_args(cls, parser):
+        """Add cmdline argurments."""
+        group = parser.add_argument_group("Tokenizer")
+        group.add_argument("--vocab_path", type=str, required=True)
+        group.add_argument("--do_lower_case", type=str2bool, default=False)
+        group.add_argument("--spm_model_file", type=str, required=True)
+        return group
+
+    def __init__(self, args):
+        self.spm_model = spm.SentencePieceProcessor()
+        self.spm_model.Load(args.spm_model_file)
+        self.vocab = load_vocab(args.vocab_path)
+        self.do_lower_case = args.do_lower_case
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = preprocess_text(text, lower=self.do_lower_case)
+        return encode_pieces(self.spm_model, text, return_unicode=True)
+
+    def convert_tokens_to_ids(self, tokens):
+        """Convert tokens to ids."""
+        ret = []
+        unk_id = self.vocab["<unk>"]
+        for token in tokens:
+            if token in self.vocab:
+                ret.append(self.vocab[token])
+            else:
+                ret.append(unk_id)
+        return ret
+
+    def convert_ids_to_tokens(self, ids):
+        """Convert ids to tokens."""
+        return convert_by_vocab(self.inv_vocab, ids)
+
+    def merge_subword(self, tokens):
+        """Merge subword."""
+        ret = []
+        for token in tokens:
+            if token.startswith("▁"):
+                ret.append(token[1:])
+            else:
+                if len(ret):
+                    ret[-1] += token
+                else:
+                    ret.append(token)
+
+        ret = [token for token in ret if token]
+        return ret
+
+    def convert_ids_to_str(self, ids):
+        """Convert ids to string."""
+        tokens = self.convert_ids_to_tokens(ids)
+        tokens = self.merge_subword(tokens)
+        res = " ".join(tokens).replace("<s>", "")
+        res = res.replace("</s>", "\n").replace("\n ", "\n").strip()
+        return res
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
diff --git a/examples/dialogue/plato-xl b/examples/dialogue/plato-xl
new file mode 100644
index 0000000000000000000000000000000000000000..3225c24e6351bef4a4c8b0031723042adab8fa06
--- /dev/null
+++ b/examples/dialogue/plato-xl
@@ -0,0 +1 @@
+../../model_zoo/plato-xl
\ No newline at end of file
diff --git a/examples/dialogue/unified_transformer/README.md b/examples/dialogue/unified_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..45cf6f32b2399ed35e48f383ed32e8b5af5c2e44
--- /dev/null
+++ b/examples/dialogue/unified_transformer/README.md
@@ -0,0 +1,217 @@
+# UnifiedTransformer
+
+## 模型简介
+
+近年来，人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统，使得机器可以流畅自然地与人进行语言交互，既可以进行日常问候类的闲聊，又可以完成特定功能，以使得开放域对话系统具有实际应用价值。具体的说，开放域对话可以继续拆分为支持不同功能的对话形式，例如对话式推荐，知识对话技术等，如何解决并有效融合以上多个技能面临诸多挑战。
+
+[UnifiedTransformer](https://arxiv.org/abs/2006.16779)以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件，采用灵活的注意力机制，十分适合对话生成任务。
+
+本项目是UnifiedTransformer在 Paddle 2.0上的开源实现，介绍了如何使用UnifiedTransformer在DuConv任务型对话数据集上进行微调，并给出了一个搭建简单中文聊天机器人的例子。
+
+## 快速开始
+
+### 环境依赖
+
+- sentencepiece
+- termcolor
+
+安装方式：`pip install sentencepiece termcolor`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+.
+├── finetune.py # 模型finetune主程序入口
+├── infer.py # 模型预测主程序入口
+├── utils.py # 定义参数及一些工具函数
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+**DuConv**是百度发布的基于知识图谱的主动聊天任务数据集，让机器根据构建的知识图谱进行主动聊天，使机器具备模拟人类用语言进行信息传递的能力。数据集的创新性是：强调了bot的主动性，并且在闲聊对话中引入了明确的对话目标，即将对话引导到特定实体上。数据中的知识信息来源于电影和娱乐人物领域有聊天价值的知识信息，如票房、导演、评价等，以三元组SPO的形式组织，对话目标中的话题为电影或娱乐人物实体。数据集中共有3万session，约12万轮对话，划分为训练集、开发集、测试集1和测试集2，其中测试集1中包含对话的response，而测试集2中只有对话历史。
+
+为了方便用户快速测试，PaddleNLP Dataset API内置了DuConv数据集，一键即可完成数据集加载，示例代码如下：
+
+```python
+from paddlenlp.datasets import load_dataset
+train_ds, dev_ds, test1_ds, test2_ds = load_dataset('duconv', splits=('train', 'dev', 'test_1', 'test_2'))
+```
+
+### 预训练模型
+
+以下是PaddleNLP支持的对话类预训练模型：
+
+|模型名称| 模型参数 | 模型特点 |
+|:-----:|:------:|:-------:|
+|unified_transformer-12L-cn| 12-layers, 12-heads, 768-hidden| 在千万级别的中文会话数据上进行预训练。|
+|unified_transformer-12L-cn-luge| 12-layers, 12-heads, 768-hidden|由unified_transformer-12L-cn预训练模型在千言对话数据集上进行微调。并且模型输入中加入了标识不同对话技能的special token，使得模型能同时支持闲聊对话、推荐对话和知识对话。|
+|plato-mini| 6-layers, 12-heads, 768-hidden|在十亿级别的中文对话数据上进行预训练。参数量更小，但效果更好。只支持闲聊型对话。|
+
+### 模型训练
+
+运行如下命令即可在训练集上进行finetune，并在验证集上进行验证
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu '1,2'`
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus '0' --log_dir ./log finetune.py \
+    --model_name_or_path=unified_transformer-12L-cn-luge \
+    --save_dir=./checkpoints \
+    --logging_steps=100 \
+    --save_steps=1000 \
+    --seed=2021 \
+    --epochs=3 \
+    --batch_size=16 \
+    --lr=5e-5 \
+    --weight_decay=0.01 \
+    --warmup_steps=2500 \
+    --max_grad_norm=0.1 \
+    --max_seq_len=512 \
+    --max_response_len=128 \
+    --max_knowledge_len=256 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `gpus` 指示了训练所用的GPU
+- `log_dir` 指示了日志保存目录
+- `model_name_or_path` 指示了finetune使用的预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unified_transformer-12L-cn      |
+   | unified_transformer-12L-cn-luge |
+
+- `save_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `lr` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_steps` 表示学习率逐渐升高到基础学习率（即上面配置的lr）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `max_grad_norm` 表示梯度裁剪允许的最大梯度值。
+- `max_seq_len` 表示输入序列的最大长度。
+- `max_response_len` 表示输入response的最大长度。
+- `max_knowledge_len` 表示输入knowledge序列的最大长度。
+- `device` 表示使用的设备。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中，其中loss最小的模型会被保存在`save_dir/model_best`中。如：
+
+```text
+./checkpoints/
+├── model_1000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── spm.model
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，只需指定`model_name_or_path`为本地微调模型的路径即可。
+
+### 模型预测
+
+运行如下命令即可在测试集上进行测试。
+
+```shell
+# GPU启动，预测仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python infer.py \
+    --model_name_or_path=./checkpoints/model_best \
+    --output_path=./predict.txt \
+    --logging_steps=10 \
+    --seed=2021 \
+    --max_seq_len=512 \
+    --max_knowledge_len=256 \
+    --batch_size=4 \
+    --min_dec_len=1 \
+    --max_dec_len=64 \
+    --num_return_sequences=20 \
+    --decode_strategy=sampling \
+    --top_k=5 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了预测使用的模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unified_transformer-12L-cn      |
+   | unified_transformer-12L-cn-luge |
+
+- `output_path` 表示预测结果的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `seed` 表示随机数生成器的种子。
+- `max_seq_len` 表示输入序列的最大长度。
+- `max_knowledge_len` 表示输入knowledge序列的最大长度。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `min_dec_len` 表示预测生成的句子的最小长度。
+- `max_dec_len` 表示预测生成的句子的最大长度。
+- `num_return_sequences` 表示每条样本生成的句子的数量。对于每条样本，模型会生成`num_return_sequences`个句子，根据每个句子的概率得分进行排序，得分最高的句子作为最终的生成结果。
+- `decode_strategy` 表示预测解码时采取的策略，可选"sampling"、"greedy_search"和"beam_search"之一。
+- `top_k` 表示采用"sampling"解码策略时，token的概率按从大到小排序，生成的token只从前`top_k`个中进行采样。
+- `device` 表示使用的设备。
+
+同时，我们提供了基于 FastGeneration 的高性能预测的选项，可以选择性开启是否需要采用高性能预测，PaddleNLP 提供了 JIT 的方式，可以自动完成对所需的自定义 op 的动态库编译：
+- `faster` 表示是否开启高性能预测。设置 `--faster` 即表示开启。
+- `use_fp16_decoding` 表示在开启高性能预测的时候，是否使用 fp16 来完成预测过程。设置 `--use_fp16_decoding` 即表示使用 fp16 进行预测，否则使用 fp32。
+
+程序运行结束后会将预测生成的response保存在`output_path`中。同时终端中会输出评估结果。
+
+采用预训练模型及微调模型在测试集上有如下结果：
+
+|       model_name_or_path        |  BLEU1 / BLEU2  | DISTINCT1 / DISTINCT2 |
+| :-----------------------------: | :-------------: | :-------------------: |
+| unified_transformer-12L-cn-luge | 0.2606 / 0.1576 |    0.1168 / 0.2977    |
+|    ./checkpoints/model_best     | 0.2808 / 0.1744 |    0.1124 / 0.2899    |
+
+**NOTE:** `./checkpoints/model_best`是按本项目中的超参在单卡上finetune得到的结果。
+
+### 人机交互
+
+运行如下命令即可开始与聊天机器人用中文进行简单的对话。
+
+```shell
+# GPU启动，仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python interaction.py \
+    --model_name_or_path=plato-mini \
+    --min_dec_len=0 \
+    --max_dec_len=64 \
+    --num_return_sequences=20 \
+    --decode_strategy=sampling \
+    --top_k=5 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了预测使用的模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unified_transformer-12L-cn      |
+   | unified_transformer-12L-cn-luge |
+   | plato-mini                      |
+
+- `min_dec_len` 表示预测生成的句子的最小长度。
+- `max_dec_len` 表示预测生成的句子的最大长度。
+- `num_return_sequences` 表示每条样本生成的句子的数量。对于每条样本，模型会生成`num_return_sequences`个句子，根据每个句子的概率得分进行排序，得分最高的句子作为最终的生成结果。
+- `decode_strategy` 表示预测解码时采取的策略，可选"sampling"、"greedy_search"和"beam_search"之一。
+- `top_k` 表示采用"sampling"解码策略时，token的概率按从大到小排序，生成的token只从前`top_k`个中进行采样。
+- `device` 表示使用的设备。
+
+**NOTE:** 输入"[EXIT]"退出交互程序，输入"[NEXT]"开启下一轮新的对话。需要注意使用退格会导致错误。
+
+## Reference
+
+- [UnifiedTransformer](https://arxiv.org/abs/2006.16779)
+- [Knover/luge-dialogue](https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue)
+- [DuConv](https://www.aclweb.org/anthology/P19-1369/)
diff --git a/examples/dialogue/unified_transformer/finetune.py b/examples/dialogue/unified_transformer/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..daeb700d3605878ffbc3293e232f4ba0778757da
--- /dev/null
+++ b/examples/dialogue/unified_transformer/finetune.py
@@ -0,0 +1,165 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import load_dataset
+from paddle.optimizer import AdamW
+from paddle.optimizer.lr import NoamDecay
+from utils import create_data_loader, print_args, set_seed
+
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.')
+    parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--lr', type=float, default=5e-5, help='The initial learning rate.')
+    parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.')
+    parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.')
+    parser.add_argument('--warmup_steps', type=int, default=2500, help='The number of warmup steps.')
+    parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.')
+    parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def train(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+    if world_size > 1:
+        dist.init_parallel_env()
+
+    set_seed(args.seed)
+
+    model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds, dev_ds = load_dataset("duconv", split=("train", "dev"))
+    train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train")
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "dev")
+
+    lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    step = 0
+    total_time = 0.0
+    best_ppl = 1e9
+    for epoch in range(args.epochs):
+        print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+        batch_start_time = time.time()
+        for inputs in train_data_loader:
+            step += 1
+            labels = inputs[-1]
+
+            logits = model(*inputs[:-1])
+            loss = F.cross_entropy(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            total_time += time.time() - batch_start_time
+            if step % args.logging_steps == 0:
+                ppl = paddle.exp(loss)
+                print(
+                    "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                    % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                )
+                total_time = 0.0
+            if step % args.save_steps == 0:
+                ppl = evaluation(model, dev_data_loader)
+                if dist.get_rank() == 0:
+                    save_ckpt(model, tokenizer, args.save_dir, step)
+                    if ppl < best_ppl:
+                        best_ppl = ppl
+                        save_ckpt(model, tokenizer, args.save_dir, "best")
+                        print("Saved step {} as best model.\n".format(step))
+            batch_start_time = time.time()
+    print("\nTraining completed.")
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader):
+    print("\nEval begin...")
+    model.eval()
+    total_tokens = 0
+    total_loss = 0.0
+    start_time = time.time()
+    step = 0
+    for inputs in data_loader:
+        step += 1
+        labels = inputs[-1]
+
+        logits = model(*inputs[:-1])
+        loss = F.cross_entropy(logits, labels, reduction="sum")
+
+        total_loss += loss.item()
+        total_tokens += labels.shape[0]
+
+    avg_loss = total_loss / total_tokens
+    ppl = math.exp(avg_loss)
+    avg_speed = (time.time() - start_time) / step
+    print("loss: %.4f - ppl: %.4f - %.3fs/step" % (avg_loss, ppl, avg_speed))
+    model.train()
+    return ppl
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    train(args)
diff --git a/examples/dialogue/unified_transformer/infer.py b/examples/dialogue/unified_transformer/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3781f73e16aa998d84fd18a732c74e64b1c8283e
--- /dev/null
+++ b/examples/dialogue/unified_transformer/infer.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+
+import paddle
+from datasets import load_dataset
+from utils import create_data_loader, print_args, select_response, set_seed
+
+from paddlenlp.metrics import BLEU, Distinct
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.')
+    parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.')
+    parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.')
+    parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.')
+    parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--faster', action='store_true', help='Whether to process inference using faster transformer. ')
+    parser.add_argument('--use_fp16_decoding', action='store_true', help='Whether to use fp16 when using faster transformer. Only works when using faster transformer. ')
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def calc_bleu_and_distinct(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    bleu1 = BLEU(n_size=1)
+    bleu2 = BLEU(n_size=2)
+    distinct1 = Distinct(n_size=1)
+    distinct2 = Distinct(n_size=2)
+    for pred, target in zip(preds, targets):
+        pred_tokens = pred.split()
+        target_token = target.split()
+
+        bleu1.add_inst(pred_tokens, [target_token])
+        bleu2.add_inst(pred_tokens, [target_token])
+
+        distinct1.add_inst(pred_tokens)
+        distinct2.add_inst(pred_tokens)
+
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("BLEU-1:", bleu1.score())
+    print("BLEU-2:", bleu2.score())
+    print("DISTINCT-1:", distinct1.score())
+    print("DISTINCT-2:", distinct2.score())
+
+
+@paddle.no_grad()
+def infer(args):
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+
+    model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path)
+
+    test_ds = load_dataset("duconv", split="test_1")
+    test_ds, test_data_loader = create_data_loader(test_ds, tokenizer, args, "test")
+
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    pred_responses = []
+    for step, inputs in enumerate(test_data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask, seq_len = inputs
+        output = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            seq_len=seq_len,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            early_stopping=args.early_stopping,
+            num_return_sequences=args.num_return_sequences,
+            use_fp16_decoding=args.use_fp16_decoding,
+            use_fast=args.faster,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        ids, scores = output
+        results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_responses.extend(results)
+
+        start_time = time.time()
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for response in pred_responses:
+            fout.write(response + "\n")
+    print("\nSave inference result into: %s" % args.output_path)
+
+    target_responses = [example["response"] for example in test_ds]
+    calc_bleu_and_distinct(pred_responses, target_responses)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    infer(args)
diff --git a/examples/dialogue/unified_transformer/interaction.py b/examples/dialogue/unified_transformer/interaction.py
new file mode 100644
index 0000000000000000000000000000000000000000..cde62e5057d6ecc5cf143b5c80519a46952fa5f5
--- /dev/null
+++ b/examples/dialogue/unified_transformer/interaction.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from termcolor import colored, cprint
+from utils import print_args, select_response, set_seed
+
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--model_name_or_path', type=str, default='plato-mini', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument('--seed', type=int, default=None, help='Random seed for initialization.')
+    parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.')
+    parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.')
+    parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=5, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def interaction(args, model, tokenizer):
+    history = []
+    start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation."
+    cprint(start_info, "yellow", attrs=["bold"])
+    while True:
+        user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip()
+        if user_utt == "[EXIT]":
+            break
+        elif user_utt == "[NEXT]":
+            history = []
+            cprint(start_info, "yellow", attrs=["bold"])
+        else:
+            history.append(user_utt)
+            inputs = tokenizer.dialogue_encode(
+                history, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False
+            )
+            inputs["input_ids"] = inputs["input_ids"].astype("int64")
+            ids, scores = model.generate(
+                input_ids=inputs["input_ids"],
+                token_type_ids=inputs["token_type_ids"],
+                position_ids=inputs["position_ids"],
+                attention_mask=inputs["attention_mask"],
+                max_length=args.max_dec_len,
+                min_length=args.min_dec_len,
+                decode_strategy=args.decode_strategy,
+                temperature=args.temperature,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                length_penalty=args.length_penalty,
+                early_stopping=args.early_stopping,
+                num_return_sequences=args.num_return_sequences,
+                use_fast=True,
+            )
+            bot_response = select_response(
+                ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences, keep_space=False
+            )[0]
+            print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"]))
+            history.append(bot_response)
+    return
+
+
+def main(args):
+    paddle.set_device(args.device)
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    # Initialize the model and tokenizer
+    model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path)
+
+    model.eval()
+    interaction(args, model, tokenizer)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    main(args)
diff --git a/examples/dialogue/unified_transformer/utils.py b/examples/dialogue/unified_transformer/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..90585d69e0ee684639151a47127b979ba44df165
--- /dev/null
+++ b/examples/dialogue/unified_transformer/utils.py
@@ -0,0 +1,265 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+
+import paddle
+import paddle.distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler
+from paddlenlp.data import Pad
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def preprocess_examples(examples, mode="train"):
+    """
+    For training set and dev set, treat each utterance of the first speaker as
+    the response, and concatenate the goal, knowledge and the dialog’s previous
+    utterances as the history. In this way, multiple history-response pairs
+    are constructed.
+    """
+    if mode == "test":
+        return examples
+    new_examples = {}
+    goal = []
+    knowledge = []
+    history = []
+    response = []
+
+    conv = examples["conversation"]
+    for index, conversation in enumerate(conv):
+        for i in range(0, len(conversation), 2):
+            goal.append(examples["goal"][index])
+            knowledge.append(examples["knowledge"][index])
+            history.append(conversation[:i])
+            response.append(conversation[i])
+    new_examples["goal"] = goal
+    new_examples["knowledge"] = knowledge
+    new_examples["history"] = history
+    new_examples["response"] = response
+
+    return new_examples
+
+
+def convert_example(example, tokenizer, max_seq_len=512, max_response_len=128, max_knowledge_len=256, mode="train"):
+    """Convert all examples into necessary features."""
+    goal = example["goal"]
+    knowledge = example["knowledge"]
+    goal_knowledge = " ".join([" ".join(lst) for lst in goal + knowledge])
+
+    if mode != "test":
+        tokenized_example = tokenizer.dialogue_encode(
+            example["history"],
+            response=example["response"],
+            knowledge=goal_knowledge,
+            task_type="knowledge",
+            max_seq_len=max_seq_len,
+            max_response_len=max_response_len,
+            max_knowledge_len=max_knowledge_len,
+            return_length=True,
+        )
+        response_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1)
+        response_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(response_start, response_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][response_start + 1 : response_end]
+        return tokenized_example
+    else:
+        tokenized_example = tokenizer.dialogue_encode(
+            example["history"],
+            knowledge=goal_knowledge,
+            task_type="knowledge",
+            max_seq_len=max_seq_len,
+            max_knowledge_len=max_knowledge_len,
+            add_start_token_as_response=True,
+            return_length=True,
+        )
+
+        if "response" in example:
+            tokenized_example["response"] = example["response"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    if mode != "test":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    else:
+        seq_len = np.asarray([example["seq_len"] for example in batch_examples]).astype("int32")
+        return input_ids, token_type_ids, position_ids, attention_mask, seq_len
+
+
+def create_data_loader(dataset, tokenizer, args, mode):
+    trans_func1 = partial(preprocess_examples, mode=mode)
+    trans_func2 = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        max_response_len=args.max_response_len,
+        max_knowledge_len=args.max_knowledge_len,
+        mode=mode,
+    )
+    remove_columns = None
+    if mode in ["train", "dev"]:
+        remove_columns = ["id", "conversation"]
+
+    dataset = dataset.map(trans_func1, batched=True, batch_size=None, remove_columns=remove_columns).map(trans_func2)
+    if mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    else:
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return token_ids, tokens
+
+
+def get_in_turn_repetition(pred, is_cn=False):
+    """Get in-turn repetition."""
+    if len(pred) == 0:
+        return 1.0
+    if isinstance(pred[0], str):
+        pred = [tok.lower() for tok in pred]
+        if is_cn:
+            pred = "".join(pred)
+    tri_grams = set()
+    for i in range(len(pred) - 2):
+        tri_gram = tuple(pred[i : i + 3])
+        if tri_gram in tri_grams:
+            return True
+        tri_grams.add(tri_gram)
+    return False
+
+
+def select_response(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1, keep_space=True):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_response(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            if keep_space:
+                response = " ".join(pred_tokens)
+            else:
+                response = "".join(pred_tokens)
+
+            in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids)
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+            elif in_turn_repetition:
+                score -= 1e3
+
+            tmp.append([response, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_response(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            if keep_space:
+                response = " ".join(pred_tokens)
+            else:
+                response = "".join(pred_tokens)
+
+            in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids)
+
+            last_pos = 0
+            if (max_dec_len is not None and num_token >= max_dec_len) or in_turn_repetition:
+                tmp.append([response])
+            else:
+                tmp.insert(last_pos, [response])
+                last_pos += 1
+
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+    return results
diff --git a/examples/few_shot/README.md b/examples/few_shot/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..317ce55d06837bfc1f5f49aec4b8f0fa79982025
--- /dev/null
+++ b/examples/few_shot/README.md
@@ -0,0 +1,34 @@
+# Few-Shot Learning (FSL)
+
+Few-Shot Learning 旨在研究如何从少量有监督的训练样本中学习出具有良好泛化性的模型，对训练数据很少或监督数据获取成本极高的应用场景有很大价值。
+
+随着大规模预训练模型的不断涌现，FSL 结合预训练模型的先验知识和强大的泛化能力在下游任务效果上取得了显著提升，为大规模预训练模型结合 FSL 的工业落地应用带来了无限可能性。
+
+我们旨在为 FSL 领域的研究者提供简单易用、全面、前沿的 FSL 策略库，便于研究者基于 FSL 策略库将注意力集中在算法创新上。我们会持续开源 FSL 领域的前沿学术工作，并在中文小样本学习测评基准 [FewCLUE](https://github.com/CLUEbenchmark/FewCLUE) 上进行评测。
+
+## Benchmark
+我们在 FewCLUE 9 个任务的 test_public.json 测试集上进行了效果评测
+
+| 算法 | 预训练模型  | eprstmt  | csldcp  | iflytek  | tnews  | ocnli  |  bustm | chid | csl | cluewsc |
+| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |------------ | ------------ | ---------- |
+| PET       | ERNIE-1.0-Large-CW  | 88.03  | 63.79  | 56.43  | 56.57  | 56.27  | 72.69  | 91.39 | 76.00 | 78.79 |
+| P-Tuning  | ERNIE-1.0-Large-CW  | 89.84  | 64.57  | 45.80  | 57.41  | 44.13  | 68.51  | 90.00 | 74.67 | 73.26 |
+| EFL       | ERNIE-1.0-Large-CW  | 90.82  | 54.48  | 46.71 | 54.43  | 43.17 | 72.63 | 85.71 | 61.52 | 80.02 |
+
+**注释**:
+- 表格中 CHID 数据集的指标与 FewCLUE 榜单指标计算方式不同。
+- 由于 iflytek 和 csldcp 标签数较多，每条样本采样 5 个非正确标签作为负样本训练评测。
+- 为统一配置，除 EFL-iflytek 外均训练 1000 steps，EFL-iflytek 训练 5000 steps。
+
+## Models
+- [P-tuning](./p-tuning)
+- [EFL](./efl)
+- [PET](./pet)
+
+## References
+
+- [1] X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385.
+
+- [2] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690.
+
+- [3] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676.
diff --git a/examples/few_shot/RGL/README.md b/examples/few_shot/RGL/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..14c183d8a6ed227bbc1f236c101d4b46cbf3938d
--- /dev/null
+++ b/examples/few_shot/RGL/README.md
@@ -0,0 +1,129 @@
+# RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning
+
+This is the implementation of the paper [RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning](https://aclanthology.org/2022.findings-naacl.81/).
+
+**************************** Updates *****************************
+
+2022-07-11: Our training code has been released.
+
+2022-04-08: Our paper has been accepted to Findings of [NAACL 2022](https://aclanthology.org/2022.findings-naacl.81/)!
+
+# Overview
+
+<h2 align="center">
+<img align="center"  src="https://user-images.githubusercontent.com/25607475/178176845-c559b07f-5278-432d-a4d8-ed9bd74d393c.png" alt="overview" width = "410" height = "400">
+</h2>
+
+We propose a simple yet effective Relation Graph augmented Learning RGL method that can obtain better performance in few-shot natural language understanding tasks.
+
+RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems of the relation graphs. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness.
+
+# Prepare the data
+
+We evaluate on the GLUE variant for few-shot learning in the paper, including SST-2, SST-5, MR, CR, MPQA, Subj, TREC, CoLA, MNLI, MNLI-mm, SNLI, QNLI, RTE, MRPC, QQP and STS-B. Please download the [datasets](https://paddlenlp.bj.bcebos.com/datasets/k-shot-glue/rgl-k-shot.zip) and extract the data files to the path ``./data/k-shot``.
+
+
+# Experiments
+
+The structure of the code:
+
+```
+├── scripts/
+│   ├── run_pet.sh  # Script for PET
+│   └── run_rgl.sh  # Script for RGL
+├── template.py     # The parser for prompt template
+├── verbalizer.py   # The mapping from labels to corresponding words
+├── tokenizer.py    # The tokenizer wrapeer to conduct text truncation
+├── utils.py        # The tools
+└── rgl.py          # The training process of RGL
+```
+
+## How to define a template
+
+We inspire from [OpenPrompt](https://github.com/thunlp/OpenPrompt/tree/main) and define template as a list of dictionary. The key of raw texts in datasets is `text`, and the corresponding value is the keyword of text in loaded dataset, where we use `text_a` to denote the first sentence in every example and `text_b` to denote the other sentences by default.
+
+For example, given the template ``{'text':'text_a'} It was {'mask'}.`` and a sample text ``nothing happens , and it happens to flat characters .`` the input text will be ``nothing happens , and it happens to flat characters . It was <mask>.``
+
+
+## Quick start
+
+Run the following code for prompt-tuning.
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python rgl.py \
+--output_dir ./checkpoints/ \
+--dataset SST-2 \
+--data_path ./data/k-hot/SST-2/16-13/ \
+--max_seq_length 128 \
+--max_steps 1000 \
+--logging_step 10 \
+--eval_step 100 \
+--batch_size 4 \
+--alpha 0.1 \
+--seed 13 \
+--learning_rate 1e-5 \
+--template "{'text':'text_a'} It was {'mask'}." \
+--verbalizer "{'0':'terrible','1':'great'}"
+```
+
+The configurations consist of:
+- ``output_dir``: The directory to save model checkpoints.
+- ``dataset``: The dataset name for few-shot learning.
+- ``data_path``: The path to data files of ``dataset``.
+- ``max_seq_length``: The maximum length of input text, including the prompt.
+- ``max_steps``: The maximum steps for training.
+- ``logging_step``: Print logs every ``logging_step``.
+- ``eval_step``: Evaluate model every ``eval_step``.
+- ``batch_size``: The number of samples per batch.
+- ``alpha``: The weight of the loss proposed in RGL.
+- ``seed``: Random seed.
+- ``learning_rate``: The learning rate for tuning.
+- ``template``: The template to define how to combine text data and prompt.
+- ``verbalizer``: The verbalizer to map labels to words in vocabulary.
+
+
+## Multiple runs for the best results
+
+To reproduce our experiments, you can use the scripts to get the results under different settings. We have defined the templates and the verbalizers in both ``./script/run_pet.sh`` and ``./script/run_rgl.sh``. You can refer to these scripts for more details.
+
+### Run PET
+
+```
+bash ./scripts/run_pet.sh SST-2 0
+```
+
+where ``SST-2`` specifies the dataset used for prompt-tuning and you can replace it with any other downloaded datasets in ``./data/k-shot/ ``. Besides, ``0`` refers to the gpu device id.
+
+**NOTE**: The dataset name is case-sensitive to run the scripts.
+
+### Run RGL
+
+```
+bash ./scripts/run_rgl.sh SST-2 0
+```
+
+Please see the descriptions above for the arguments.
+
+
+# Citation
+
+Please cite our paper if you use RGL in your work:
+```
+@inproceedings{wang-etal-2022-rgl,
+    title = "{RGL}: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning",
+    author = "Wang, Yaqing and
+      Tian, Xin  and
+      Xiong, Haoyi  and
+      Li, Yueyang  and
+      Chen, Zeyu  and
+      Guo, Sheng  and
+      Dou, Dejing",
+    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
+    year = "2022",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.findings-naacl.81",
+    pages = "1078--1084",
+}
+
+```
diff --git a/examples/few_shot/RGL/data.py b/examples/few_shot/RGL/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..32efac286aad456acbb8b057dd9b12f76bf52ece
--- /dev/null
+++ b/examples/few_shot/RGL/data.py
@@ -0,0 +1,496 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import csv
+import json
+import os
+from abc import abstractmethod
+from collections import defaultdict
+from dataclasses import dataclass, field
+
+import paddle
+import pandas as pd
+from paddle.metric import Accuracy
+
+from paddlenlp.datasets import MapDataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+
+
+@dataclass
+class InputExample(object):
+    """Data structure of every example in datasets."""
+
+    uid: str = field(default=None, metadata={"help": "A unique identifier of the example."})
+    text_a: str = field(default=None, metadata={"help": "The first text sequence in each example."})
+    text_b: str = field(default=None, metadata={"help": "The other text sequences in each example."})
+    cls_label: int = field(default=None, metadata={"help": "The label of classification tasks."})
+    seq_label: list = field(default=None, metadata={"help": "The label of generation tasks."})
+    meta: dict = field(default=None, metadata={"help": "An optional dictionary of other data for each example."})
+
+    def __repr__(self):
+        content = {k: v for k, v in self.__dict__.items() if v is not None}
+        content = json.dumps(content, indent=2, sort_keys=True) + "\n"
+        return str(content)
+
+    def keys(self, keep_none=False):
+        return [key for key in self.__dict__.keys() if getattr(self, key) is not None]
+
+
+class InputFeatures(dict):
+    """
+    Data structure of every wrapped example or a batch of examples as the input of model.
+
+    Args:
+        input_ids (paddle.Tensor):
+            The token ids.
+        attention_mask (paddle.Tensor):
+            The mask ids.
+        token_type_ids (paddle.Tensor, optional):
+            The token type ids.
+        input_embeds (paddle.Tensor, optional):
+            The embeddings of soft tokens.
+        mask_ids (paddle.Tensor, optional):
+            The mask ids where 1 denotes that a token is a mask, 0 denotes it is not a mask.
+        cls_label (list, optional):
+            The label of classification task.
+        seq_label (list, optional):
+            The label of generation task.
+        uid (list, optional):
+            The unique id(s) for example(s).
+    """
+
+    input_keys = [
+        "input_ids",
+        "attention_mask",
+        "token_type_ids",
+        "input_embeds",
+        "cls_label",
+        "seq_label",
+        "label",
+        "uid",
+        "mask_ids",
+        "soft_token_ids",
+    ]
+
+    def __init__(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        input_embeds=None,
+        mask_ids=None,
+        label=None,
+        cls_label=None,
+        seq_label=None,
+        uid=None,
+        soft_token_ids=None,
+    ):
+        self.input_ids = input_ids
+        self.attention_mask = attention_mask
+        self.token_type_ids = token_type_ids
+        self.input_embeds = input_embeds
+        self.label = label
+        self.cls_label = cls_label
+        self.seq_label = seq_label
+        self.mask_ids = mask_ids
+        self.uid = uid
+        self.soft_token_ids = soft_token_ids
+
+    @classmethod
+    def add_keys(cls, *args):
+        cls.input_keys.extend(args)
+
+    def keys(self, keep_none=False):
+        if keep_none:
+            return self.input_keys
+        else:
+            return [key for key in self.input_keys if getattr(self, key) is not None]
+
+    def values(self, keep_none=False):
+        return [getattr(self, key) for key in self.keys(keep_none=keep_none)]
+
+    def items(self):
+        return [(key, getattr(self, key)) for key in self.keys()]
+
+    def __len__(self):
+        return len(self.keys())
+
+    def __repr__(self):
+        return str(json.dumps(self.items()) + "\n")
+
+    def __getitem__(self, key):
+        return getattr(self, key)
+
+    def __iter__(self):
+        return iter(self.keys())
+
+    def __contains__(self, key, keep_none):
+        return key in self.keys(keep_none)
+
+    def __setitem__(self, key, value):
+        if key not in self.input_keys:
+            raise KeyError("{} not in predefined keys, use add_keys to add it.".format(key))
+        setattr(self, key, value)
+
+    @staticmethod
+    def collate_fn(batch):
+        """Collate batch data in form of InputFeatures."""
+        new_batch = {}
+        for key in batch[0]:
+            values = [b[key] for b in batch]
+            try:
+                new_batch[key] = paddle.to_tensor(values)
+            except ValueError:
+                new_batch[key] = values
+
+        return InputFeatures(**new_batch)
+
+
+class DataProcessor(object):
+    """Base class for reading datasets from files."""
+
+    def __init__(self, labels=None):
+        self._labels = labels
+        if labels is not None:
+            self._labels = sorted(labels)
+
+    @property
+    def labels(self):
+        if not getattr(self, "_labels"):
+            raise ValueError("labels and label_mappings are not setted yet.")
+        return self._labels
+
+    @labels.setter
+    def labels(self, labels):
+        if labels is not None:
+            self._labels = sorted(labels)
+
+    @property
+    def label_mapping(self):
+        if not getattr(self, "_labels"):
+            raise ValueError("labels and label_mappings are not setted yet.")
+        if not getattr(self, "_label_mapping"):
+            self._label_mapping = {k: i for i, k in enumerate(self._labels)}
+        return self._label_mapping
+
+    @label_mapping.setter
+    def label_mapping(self, label_mapping):
+        if getattr(self, "_labels"):
+            assert self._labels == sorted(list(label_mapping.keys()))
+        self._label_mapping = label_mapping
+
+    @abstractmethod
+    def get_examples(self, data_dir, split):
+        raise NotImplementedError
+
+    def get_train_examples(self, data_dir):
+        return self.get_examples(data_dir, "train")
+
+    def get_dev_examples(self, data_dir):
+        return self.get_examples(data_dir, "dev")
+
+    def get_test_exaples(self, data_dir):
+        return self.get_examples(data_dir, "test")
+
+    @classmethod
+    def read_tsv(cls, input_file, quotechar=None):
+        with open(input_file, "r", encoding="utf-8-sig") as f:
+            data = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            return [x for x in data]
+
+    @classmethod
+    def read_csv(cls, input_file, header=None):
+        data = pd.read_csv(input_file, header=header)
+        return data.values.tolist()
+
+    @classmethod
+    def read_json(cls, input_file):
+        with open(input_file, "r") as f:
+            data = [json.loads(x) for x in f.readlines()]
+            return data
+
+
+class BoolQProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["False", "True"])
+        self.split_map = {"train": "train", "dev": "dev32", "test": "val"}
+
+    def get_examples(self, data_dir, split):
+        split = self.split_map[split]
+        raw_data = self.read_json(os.path.join(data_dir, split + ".jsonl"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            examples.append(
+                InputExample(
+                    uid="%s-%d" % (split, i),
+                    text_a=line["passage"],
+                    text_b=line["question"],
+                    cls_label=str(line["label"]),
+                )
+            )
+
+        return examples
+
+
+class MrpcProcesser(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=line[4], cls_label=line[0]))
+
+        return examples
+
+
+class MnliProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["contradiction", "entailment", "neutral"])
+
+    def _process_file(self, split):
+        if split in ["dev", "test"]:
+            return split + "_matched"
+        return split
+
+    def get_examples(self, data_dir, split):
+        split = self._process_file(split)
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[8], text_b=line[9], cls_label=line[-1])
+            )
+        return examples
+
+
+class MnliMismatchedProcessor(MnliProcessor):
+    def _process_file(self, split):
+        if split == "dev":
+            return split + "_matched"
+        if split == "test":
+            return split + "_mismatched"
+        return split
+
+
+class SnliProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["contradiction", "entailment", "neutral"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1])
+            )
+        return examples
+
+
+class ColaProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=None, cls_label=line[1]))
+        return examples
+
+
+class Sst2Processor(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[0], text_b=None, cls_label=line[1]))
+        return examples
+
+
+class StsbProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1])
+            )
+        return examples
+
+
+class QqpProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            try:
+                examples.append(
+                    InputExample(uid="%s-%s" % (split, line[0]), text_a=line[3], text_b=line[4], cls_label=line[5])
+                )
+            except IndexError:
+                continue
+        return examples
+
+
+class QnliProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["entailment", "not_entailment"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1])
+            )
+        return examples
+
+
+class RteProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["entailment", "not_entailment"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1])
+            )
+        return examples
+
+
+class WnliProcessor(DataProcessor):
+    def __init__(self):
+        super().__init__(["0", "1"])
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            if i == 0:
+                continue
+            examples.append(
+                InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1])
+            )
+        return examples
+
+
+class TextClassificationProcessor(DataProcessor):
+    def __init__(self, task_name):
+        NUM_LABELS = {"mr": 2, "sst-5": 5, "subj": 2, "trec": 6, "cr": 2, "mpqa": 2}
+        assert task_name in NUM_LABELS, "task_name not supported."
+        self.task_name = task_name
+        self._labels = list(range(NUM_LABELS[self.task_name]))
+
+    def get_examples(self, data_dir, split):
+        raw_data = self.read_csv(os.path.join(data_dir, split + ".csv"))
+        examples = []
+        for i, line in enumerate(raw_data):
+            examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[1], cls_label=line[0]))
+        return examples
+
+
+# The processor mapping for datasets in RGL paper.
+PROCESSOR_MAPPING = {
+    "mrpc": MrpcProcesser(),
+    "mnli": MnliProcessor(),
+    "mnli-mm": MnliMismatchedProcessor(),
+    "snli": SnliProcessor(),
+    "cola": ColaProcessor(),
+    "sst-2": Sst2Processor(),
+    "sts-b": StsbProcessor(),
+    "qqp": QqpProcessor(),
+    "qnli": QnliProcessor(),
+    "rte": RteProcessor(),
+    "wnli": WnliProcessor(),
+    "cr": TextClassificationProcessor("cr"),
+    "mr": TextClassificationProcessor("mr"),
+    "sst-5": TextClassificationProcessor("sst-5"),
+    "subj": TextClassificationProcessor("subj"),
+    "mpqa": TextClassificationProcessor("mpqa"),
+    "trec": TextClassificationProcessor("trec"),
+    "boolq": BoolQProcessor(),
+}
+
+# The task mapping for datasets.
+TASK_MAPPING = defaultdict(lambda: "classification")
+TASK_MAPPING["sts-b"] = "regression"
+
+# The metric mapping for datasets.
+METRIC_MAPPING = defaultdict(Accuracy)
+METRIC_MAPPING.update(
+    {
+        "mrpc": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]),
+        "qqp": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]),
+        "cola": Mcc(),
+        "sts-b": PearsonAndSpearman(name=["pearson", "spearman", "corr"]),
+    }
+)
+
+
+def load_dataset(dataset, data_path=None, splits=[]):
+    """
+    Read datasets from files.
+
+    Args:
+        dataset (str):
+            The dataset name in lowercase.
+        data_path (str):
+            The path to the dataset directory, including train, dev or test file.
+        splits (list):
+            Which file(s) of dataset to read, such as ['train', 'dev', 'test'].
+
+    """
+    assert len(splits) > 0, "No splits, can not load dataset {}".format(dataset)
+    processor = PROCESSOR_MAPPING[dataset]
+    data = []
+    if "train" in splits:
+        train_examples = processor.get_train_examples(data_path)
+        data.append(MapDataset(train_examples))
+    if "dev" in splits:
+        dev_examples = processor.get_dev_examples(data_path)
+        data.append(MapDataset(dev_examples))
+    if "test" in splits:
+        test_examples = processor.get_test_exaples(data_path)
+        data.append(MapDataset(test_examples))
+    data.append(processor.labels)
+    return data
diff --git a/examples/few_shot/RGL/rgl.py b/examples/few_shot/RGL/rgl.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd137c71b7003101a704f6000312a1083ebc0fe4
--- /dev/null
+++ b/examples/few_shot/RGL/rgl.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import METRIC_MAPPING, TASK_MAPPING, InputFeatures, load_dataset
+from template import ManualTemplate
+from tokenizer import MLMTokenizerWrapper
+from utils import (
+    LinearSchedulerWarmup,
+    check_args,
+    convert_example,
+    create_dataloader,
+    set_seed,
+)
+from verbalizer import ManualVerbalizer
+from visualdl import LogWriter
+
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser('Implementation of RGL paper.')
+parser.add_argument('--seed', type=int, default=1000, help='Random seed.')
+parser.add_argument('--device', type=str, default='gpu', choices=['gpu', 'cpu'], help='Device for training, default to gpu.')
+parser.add_argument('--dataset', type=str, default='SST-2', help='The build-in few-shot dataset.')
+parser.add_argument('--data_path', type=str, default=None, help='The path to local dataset in .tsv files.')
+
+parser.add_argument('--model_name_or_path', type=str, default='roberta-large', help='The build-in pretrained LM or the path to local model parameters.')
+parser.add_argument('--template', type=str, default="{'text':'text_a'} It was {'mask'}.", help='The input template.')
+parser.add_argument('--verbalizer', type=str, default="{'0':'terrible', '1':'great'}", help='The label mapping of output.')
+parser.add_argument('--alpha', type=float, default=0, help='The weight of link prediction loss in RGL.')
+parser.add_argument('--max_seq_length', type=int, default=512, help='The maximum length of input text.')
+parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The maximum norm of all parameters.')
+
+parser.add_argument('--num_epoch', type=int, default=0, help='The number of epoch for training.')
+parser.add_argument('--max_steps', type=int, default=1000, help='Maximum steps, which overwrites num_epoch.')
+parser.add_argument('--batch_size', type=int, default=32, help='The number of samples used per step.')
+parser.add_argument('--learning_rate', type=float, default=1e-5, help='The learning rate of optimizer.')
+parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay if we apply some.')
+parser.add_argument('--warmup_steps', type=int, default=0, help='The warmup steps for leanring rate scheduler.')
+parser.add_argument('--logging_step', type=int, default=100, help='Print logs every logging_step steps.')
+parser.add_argument('--eval_step', type=int, default=100, help='Evaluate model every eval_step steps.')
+parser.add_argument('--save_best', action='store_true', help='Save the best model according to evaluation results. Save the last checkpoint if False.')
+parser.add_argument('--output_dir', type=str, default='./checkpoints/', help='The path to save checkpoints.')
+parser.add_argument('--overwrite_output', action='store_true', help='Whether overwrite the output_dir.')
+args = parser.parse_args()
+# yapf: enable
+
+check_args(args)
+for arg in vars(args):
+    logger.info(format(arg, "<20") + format(str(getattr(args, arg)), "<"))
+
+
+@paddle.no_grad()
+def evaluate(model, dataloader, metric, verbalizer, task_type, bound=(0, 5)):
+    if task_type == "regression":
+        logsoftmax = nn.LogSoftmax(axis=-1)
+        lb, ub = bound
+    model.eval()
+    metric.reset()
+    for batch in dataloader:
+        logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
+        label_logits = verbalizer.process_logits(logits, batch["mask_ids"])
+        if task_type == "regression":
+            label_logits = logsoftmax(label_logits)
+            label_logits = paddle.exp(label_logits[..., 1].unsqueeze(-1)) * (ub - lb) + lb
+        correct = metric.compute(label_logits, batch["label"])
+        metric.update(correct)
+    score = metric.accumulate()
+    score = score if isinstance(score, (list, tuple)) else [score]
+    logger.info("{:>20}".format("Evaluation results:"))
+    for name, value in zip(metric.name(), score):
+        logger.info("{:>20} = {:.6f}".format(name, value))
+    model.train()
+    return score[0]
+
+
+def contrastive_loss(sentence_embeddings, labels, task_type="classification"):
+    """Compute the loss proposed in RGL method."""
+
+    def _raw_equal(x, y):
+        return int(x == y)
+
+    def _max_equal(x, y):
+        return int(np.argmax(x, axis=0) == np.argmax(y, axis=0))
+
+    equal_int = _raw_equal if task_type == "classification" else _max_equal
+    bce_metric = nn.CrossEntropyLoss()
+    cos_metric = nn.CosineSimilarity(axis=0, eps=1e-6)
+    batch_size = sentence_embeddings.shape[0]
+    loss = 0
+    for i in range(batch_size):
+        for j in range(batch_size):
+            score = cos_metric(sentence_embeddings[i], sentence_embeddings[j])
+            score = score.unsqueeze(0)
+            logits = paddle.concat([(1 - score) * 50, (1 + score) * 50], axis=-1)
+            label = paddle.to_tensor(equal_int(labels[i], labels[j]))
+            loss += bce_metric(logits.reshape([-1, logits.shape[-1]]), label.unsqueeze(0))
+    loss = loss / (batch_size * (batch_size - 1))
+    loss = loss / 100
+    return loss
+
+
+def main():
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+
+    task_type = TASK_MAPPING[args.dataset]
+    model = AutoModelForMaskedLM.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    tokenizer_wrapper = MLMTokenizerWrapper(args.max_seq_length, tokenizer)
+
+    train_ds, dev_ds, test_ds, label_list = load_dataset(
+        args.dataset, data_path=args.data_path, splits=["train", "dev", "test"]
+    )
+
+    template = ManualTemplate(tokenizer, args.template)
+    logger.info("Set template: {}".format(template.template))
+    verbalizer = ManualVerbalizer(tokenizer, labels=label_list, label_to_words=eval(args.verbalizer), prefix=" ")
+    logger.info("Set verbalizer: {}".format(args.verbalizer))
+
+    trans_fn = partial(convert_example, template=template, verbalizer=verbalizer, tokenizer_wrapper=tokenizer_wrapper)
+
+    train_loader = create_dataloader(train_ds, "train", args.batch_size, InputFeatures.collate_fn, trans_fn)
+    dev_loader = create_dataloader(dev_ds, "dev", args.batch_size, InputFeatures.collate_fn, trans_fn)
+    test_loader = create_dataloader(test_ds, "test", args.batch_size, InputFeatures.collate_fn, trans_fn)
+    if args.max_steps > 0:
+        num_epoch = args.max_steps // len(train_loader) + int(args.max_steps % len(train_loader) > 0)
+        max_steps = args.max_steps
+    else:
+        num_epoch = args.num_epoch
+        max_steps = args.num_epoch * len(train_loader)
+
+    lr_scheduler = LinearSchedulerWarmup(args.learning_rate, args.warmup_steps, max_steps)
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    metric_fn = METRIC_MAPPING[args.dataset]
+    if task_type == "regression":
+        loss_fn = nn.KLDivLoss()
+        lb, ub = 0, 5
+        logsoftmax = nn.LogSoftmax(axis=-1)
+    else:
+        loss_fn = nn.CrossEntropyLoss()
+    with LogWriter(logdir="./log/pet/train") as writer:
+        best_metric = -float("inf")
+        global_step = 1
+        global_loss = 0
+        for epoch in range(1, num_epoch + 1):
+            for step, batch in enumerate(train_loader, start=1):
+                writer.add_scalar("train/lr", lr_scheduler.get_lr(), global_step)
+
+                logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
+                label_logits = verbalizer.process_logits(logits, batch["mask_ids"])
+                if task_type == "regression":
+                    label_logits = logsoftmax(label_logits)
+
+                    labels = paddle.stack(
+                        [
+                            1 - (batch["label"].reshape([-1]) - lb) / (ub - lb),
+                            (batch["label"].reshape([-1]) - lb) / (ub - lb),
+                        ],
+                        axis=-1,
+                    )
+                    loss = loss_fn(label_logits.reshape([-1, 2]), labels)
+                else:
+                    labels = paddle.to_tensor(batch["label"], dtype="int64")
+                    loss = loss_fn(label_logits.reshape([-1, label_logits.shape[-1]]), labels.reshape([-1]))
+                if args.alpha > 0:
+                    con_loss = contrastive_loss(logits, labels, task_type=task_type)
+                    loss += args.alpha * con_loss
+                global_loss += loss.item()
+
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                writer.add_scalar("train/loss", loss.item(), global_step)
+
+                if global_step % args.logging_step == 0:
+                    avg_loss = global_loss / args.logging_step
+                    logger.info(
+                        "Epoch: {:3d}/{:3d}, Global Step: {:4d}, Loss: {:e}".format(
+                            epoch, num_epoch, global_step, avg_loss
+                        )
+                    )
+                    global_loss = 0
+
+                if global_step % args.eval_step == 0:
+                    logger.info("{0:-^30}".format(" Validate "))
+                    value = evaluate(model, dev_loader, metric_fn, verbalizer, task_type)
+                    if args.save_best and value > best_metric:
+                        best_metric = value
+                        save_path = os.path.join(args.output_dir, "model_best")
+                        if not os.path.exists(save_path):
+                            os.makedirs(save_path)
+                        model.save_pretrained(save_path)
+                        tokenizer.save_pretrained(save_path)
+
+                global_step += 1
+                if global_step > max_steps:
+                    break
+
+        logger.info("{0:-^30}".format(" Test "))
+        evaluate(model, test_loader, metric_fn, verbalizer, task_type)
+        if not args.save_best:
+            save_path = os.path.join(args.output_dir, "model_last")
+            if not os.path.exists(save_path):
+                os.makedirs(save_path)
+            model.save_pretrained(save_path)
+            tokenizer.save_pretrained(save_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/few_shot/RGL/scripts/run_pet.sh b/examples/few_shot/RGL/scripts/run_pet.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ca50d8c6bd00d8300cbb02430844664939f56c18
--- /dev/null
+++ b/examples/few_shot/RGL/scripts/run_pet.sh
@@ -0,0 +1,114 @@
+dataset=$1
+device=$2
+
+MAX_LEN=128
+dataname=$dataset
+
+case $dataset in
+    CoLA)
+        temp="{'text':'text_a'} This is {'mask'}."
+        verb="{'0':'incorrect','1':'correct'}"
+        ;;
+    MRPC)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    QQP)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    STS-B)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    MNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        ;;
+    MNLI-mm)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        dataname='MNLI'
+        ;;
+    SNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        ;;
+    QNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'not_entailment':'No','entailment':'Yes'}"
+        ;;
+    RTE)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'not_entailment':'No','entailment':'Yes'}"
+        MAX_LEN=256
+        ;;
+    mr)
+        temp="{'text':'text_a'} It was {'mask'}"
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=160
+        ;;
+    sst-5)
+        temp="{'text':'text_a'} It was {'mask'}." 
+        temp="{'text':'text_a'} {'mask'}" 
+        verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}"
+        ;;
+    SST-2)
+        temp="{'text':'text_a'} It was {'mask'}."
+        verb="{'0':'terrible','1':'great'}"
+        ;;
+    subj)
+        temp="{'text':'text_a'} This is {'mask'}."
+        verb="{0:'subjective',1:'objective'}"
+        MAX_LEN=256
+        ;;
+    trec)
+        temp="{'mask'}:{'text':'text_a'}"
+        verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}"
+        ;;
+    cr)
+        temp="{'text':'text_a'} It was {'mask'}."
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=160
+        ;;
+    mpqa)
+        temp="{'text':'text_a'} It was {'mask'}"
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=128
+        ;;
+
+esac
+
+echo $temp
+echo $verb
+
+
+ALPHA=0
+for seed in 13 21 42 87 100
+do
+    for lr in 1e-5 2e-5 5e-5
+    do
+        for bs in 2 4 8
+        do
+            CUDA_VISIBLE_DEVICES=$device python rgl.py \
+            --output_dir ./ckpt_pet_roberta_$seed/ \
+            --dataset $dataset \
+            --data_path ./data/k-shot/$dataname/16-$seed/ \
+            --max_seq_length $MAX_LEN \
+            --max_steps 1000 \
+            --logging_step 10 \
+            --eval_step 100 \
+            --batch_size $bs \
+            --alpha $ALPHA \
+            --seed $seed \
+            --learning_rate $lr \
+            --template "$temp" \
+            --verbalizer "$verb" \
+            --overwrite_output 
+        done
+    done
+done
+
diff --git a/examples/few_shot/RGL/scripts/run_rgl.sh b/examples/few_shot/RGL/scripts/run_rgl.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9b1a5d2dc216b7b8bcc37a607c6606ee955ef30d
--- /dev/null
+++ b/examples/few_shot/RGL/scripts/run_rgl.sh
@@ -0,0 +1,115 @@
+dataset=$1
+device=$2
+
+MAX_LEN=128
+dataname=$dataset
+
+case $dataset in
+    CoLA)
+        temp="{'text':'text_a'} This is {'mask'}."
+        verb="{'0':'incorrect','1':'correct'}"
+        ;;
+    MRPC)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    QQP)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    STS-B)
+        temp="{'text':'text_a'}{'mask'},{'text':'text_b'}"
+        verb="{'0':'No','1':'Yes'}"
+        ;;
+    MNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        ;;
+    MNLI-mm)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        dataname='MNLI'
+        ;;
+    SNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}"
+        MAX_LEN=256
+        ;;
+    QNLI)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'not_entailment':'No','entailment':'Yes'}"
+        ;;
+    RTE)
+        temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}"
+        verb="{'not_entailment':'No','entailment':'Yes'}"
+        MAX_LEN=256
+        ;;
+    mr)
+        temp="{'text':'text_a'} It was {'mask'}"
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=160
+        ;;
+    sst-5)
+        temp="{'text':'text_a'} It was {'mask'}." 
+        temp="{'text':'text_a'} {'mask'}" 
+        verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}"
+        ;;
+    SST-2)
+        temp="{'text':'text_a'} It was {'mask'}."
+        verb="{'0':'terrible','1':'great'}"
+        ;;
+    subj)
+        temp="{'text':'text_a'} This is {'mask'}."
+        verb="{0:'subjective',1:'objective'}"
+        MAX_LEN=256
+        ;;
+    trec)
+        temp="{'mask'}:{'text':'text_a'}"
+        verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}"
+        ;;
+    cr)
+        temp="{'text':'text_a'} It was {'mask'}."
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=160
+        ;;
+    mpqa)
+        temp="{'text':'text_a'} It was {'mask'}"
+        verb="{0:'terrible',1:'great'}"
+        MAX_LEN=128
+        ;;
+
+esac
+
+echo $temp
+echo $verb
+
+
+for seed in 13 21 42 87 100
+do
+    for lr in 1e-5 2e-5 5e-5
+    do
+        for bs in 2 4 8
+        do
+            for alpha in 0.1 0.3 0.5 0.7 1
+            do
+                CUDA_VISIBLE_DEVICES=$device python rgl.py \
+                --output_dir ./ckpt_rgl_$seed/ \
+                --dataset $dataset \
+                --data_path ./data/k-shot/$dataname/16-$seed/ \
+                --max_seq_length $MAX_LEN \
+                --max_steps 1000 \
+                --logging_step 100 \
+                --eval_step 1000 \
+                --batch_size $bs \
+                --alpha $alpha \
+                --seed $seed \
+                --learning_rate $lr \
+                --template "$temp" \
+                --verbalizer "$verb" \
+                --overwrite_output 
+            done
+        done
+    done
+done
diff --git a/examples/few_shot/RGL/template.py b/examples/few_shot/RGL/template.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f0561fc240296715100086200c156a3c6ef4600
--- /dev/null
+++ b/examples/few_shot/RGL/template.py
@@ -0,0 +1,391 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import abstractmethod
+
+import paddle
+import paddle.nn as nn
+from data import InputExample
+
+from paddlenlp.utils.log import logger
+
+
+class Template(nn.Layer):
+    """
+    Base template class used to preprocess the inputs of model.
+
+    Args:
+        tokenizer (paddlenlp.transformers.PretrainedTokenizer):
+            The tokenizer of pretrained models.
+        text_mapping (dict):
+            The dictionary to map text name in template to that in InputExample.
+            For example, {'premise': 'text_a', 'hypothesis': 'text_b'}.
+
+    """
+
+    registered_input_names = ["mask_ids", "shortenable_ids"]
+
+    def __init__(self, tokenizer, text_mapping=None):
+        super().__init__()
+        self.tokenizer = tokenizer
+        self.text_mapping = text_mapping
+        self._process_lock = False
+
+        self.part_start = "{"
+        self.part_end = "}"
+
+    @property
+    def template(self):
+        if not hasattr(self, "_template"):
+            raise RuntimeError("Property template has not been set before used.")
+        return self._template
+
+    @template.setter
+    def template(self, template):
+        if template is None:
+            return
+        self._template = template
+        self.process_template()
+
+    @abstractmethod
+    def process_template(self):
+        """A hook to process template text when it is set."""
+        raise NotImplementedError
+
+    def get_default_mask_ids(self):
+        """List to denote whether an item in template is a mask token."""
+        return [1 if "mask" in p else 0 for p in self.template]
+
+    def get_default_shortenable_ids(self):
+        """List to denote whther an item in template can be truncated."""
+        idx = []
+        for p in self.template:
+            if "shortenable" in p:
+                idx.append(1 if p["shortenable"] else 0)
+            else:
+                idx.append(1 if "text" in p else 0)
+        return idx
+
+    def incorporate_template_text(self, example, template=None):
+        """Replace each item in template with real text."""
+        inputs = template.copy() if self.template is None else self.template.copy()
+
+        for i, p in enumerate(inputs):
+            if "text" in p:
+                inputs[i] = p["add_prefix_space"] + getattr(example, p["text"])
+            elif "mask" in p:
+                inputs[i] = self.tokenizer.mask_token
+            elif "hard" in p:
+                inputs[i] = p["add_prefix_space"] + p["hard"]
+            elif "sep" in p:
+                inputs[i] = self.tokenizer.sep_token
+            else:
+                raise ValueError("can not parse {}".format(p))
+
+        return inputs
+
+    def parse_inputs(self, inputs: str):
+        """Parse items from the input template text."""
+        parsed = []
+        i = 0
+        while i < len(inputs):
+            p = {"add_prefix_space": " " if (i > 0 and inputs[i - 1] == " ") else ""}
+            while i < len(inputs) and inputs[i] == " ":
+                p["add_prefix_space"] = " "
+                i = i + 1
+            if i == len(inputs):
+                break
+
+            if inputs[i] == self.part_start:
+                j = i + 1
+                count_part = 1
+                while j < len(inputs):
+                    if inputs[j] == self.part_end:
+                        count_part -= 1
+                        if count_part == 0:
+                            break
+                    elif inputs[j] == self.part_start:
+                        count_part += 1
+                    j = j + 1
+                if j == len(inputs):
+                    raise ValueError(
+                        "{} at position {} has no corresponding {}".format(self.part_start, i, self.part_end)
+                    )
+                try:
+                    part = eval("{%s}" % inputs[i + 1 : j])
+                    if isinstance(part, set):
+                        part = {k: None for k in part}
+                    p.update(part)
+                except:
+                    import traceback
+
+                    logger.error(traceback.format_exc())
+                    logger.error("syntax error in {}".format("{%s}" % inputs[i + 1 : j]))
+                    exit()
+                i = j + 1
+            else:
+                j = i + 1
+                while j < len(inputs):
+                    if inputs[j] == self.part_start:
+                        break
+                    j = j + 1
+                p["hard"] = inputs[i:j].rstrip(" ")
+                i = j
+            parsed.append(p)
+
+        return parsed
+
+    def wrap_one_example(self, example):
+        """Process InputExample according to the predefined template."""
+        if self.template is None:
+            raise ValueError("template has not been initialized.")
+        if isinstance(example, InputExample):
+            text = self.incorporate_template_text(example)
+
+            non_empty_keys = example.keys()
+            for key in self.text_mapping:
+                if self.text_mapping[key] in non_empty_keys:
+                    non_empty_keys.remove(self.text_mapping[key])
+
+            keys, values = ["text"], [text]
+            for name in self.registered_input_names:
+                keys.append(name)
+                v = None
+                if hasattr(self, name) and getattr(self, name) is not None:
+                    v = getattr(self, name)
+                elif hasattr(self, "get_default_" + name):
+                    v = getattr(self, "get_default_" + name)()
+                    setattr(self, name, v)
+                else:
+                    raise ValueError(
+                        """
+                        Template's part attribute '{}' is registered but not
+                        initialized. Try using template.{} = [...] to
+                        initialize or create a get_default_{}(self)
+                        method in your template.""".format(
+                            name, name, name
+                        )
+                    )
+                values.append(v)
+
+            wrapped_parts_to_tokenize = []
+            for value in list(zip(*values)):
+                wrapped_parts_to_tokenize.append(dict(zip(keys, value)))
+
+            wrapped_parts_not_to_tokenize = {key: getattr(example, key) for key in non_empty_keys}
+            return [wrapped_parts_to_tokenize, wrapped_parts_not_to_tokenize]
+        else:
+            raise TypeError("InputExample")
+
+
+class ManualTemplate(Template):
+    """
+    ManualTemplate for hard prompt methods, such as PET, EFL.
+    """
+
+    def __init__(self, tokenizer, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}):
+        super().__init__(tokenizer=tokenizer, text_mapping=text_mapping)
+        self.template = template
+
+    def process_template(self):
+        self._template = self.parse_inputs(self._template)
+
+
+class SoftTemplate(Template):
+    """
+    SoftTemplate on the input layer for soft prompt methods, such as p-tuning.
+    """
+
+    registered_input_names = ["soft_token_ids", "mask_ids", "shortenable_ids"]
+
+    def __init__(self, tokenizer, model, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}):
+        super().__init__(tokenizer=tokenizer, text_mapping=text_mapping)
+        for module in model.children():
+            if type(module).__name__.endswith("Model"):
+                self.token_embeddings = module.embeddings.word_embeddings
+                break
+        self.token_embeddings.weight.stop_gradient = True
+        self.embedding_size = self.token_embeddings.weight.shape[-1]
+        self.template = template
+
+    def process_template(self):
+        self._template = self.parse_inputs(self._template)
+        self.process_soft_tokens()
+        self.generate_parameters()
+
+    def incorporate_template_text(self, example, template=None):
+        """Replace each item in template with real text."""
+        inputs = template.copy() if self.template is None else self.template.copy()
+
+        for i, p in enumerate(inputs):
+            if "text" in p:
+                inputs[i] = p["add_prefix_space"] + getattr(example, p["text"])
+            elif "mask" in p:
+                inputs[i] = self.tokenizer.mask_token
+            elif "hard" in p:
+                inputs[i] = p["add_prefix_space"] + p["hard"]
+            elif "soft" in p:
+                inputs[i] = p["add_prefix_space"] + p["soft"]
+            elif "sep" in p:
+                inputs[i] = self.tokenizer.sep_token
+            else:
+                raise ValueError("can not parse {}".format(p))
+
+        return inputs
+
+    def process_soft_tokens(self):
+        inputs = []
+        soft_token_ids = []
+        num_soft_token = 0
+        soft2word_init = {}
+        soft_id_reindex = {}
+
+        for part in self.template:
+            if "soft" not in part and "soft_id" not in part:
+                soft_token_ids.append(0)
+                inputs.append(part)
+                continue
+
+            if "soft" in part and part["soft"] is not None:
+                if "duplicate" in part:
+                    logger.warnings("Ignore ``duplicate``. It is " "incompatible with ``soft`` with text values.")
+
+                # Get word tokens and ids for soft token initialization.
+                init_token_ids = self.tokenizer(
+                    part["add_prefix_space"] + part["soft"], add_special_tokens=False, return_token_type_ids=False
+                )["input_ids"]
+                init_tokens = self.tokenizer.convert_ids_to_tokens(init_token_ids)
+                assert len(init_tokens) == len(init_token_ids)
+
+                # Create soft ids and corresponding ``soft`` part in template.
+                next_num_soft = num_soft_token + 1
+                num_soft_token += len(init_tokens)
+                id_list = list(range(next_num_soft, num_soft_token + 1))
+
+                soft_token_ids.extend(id_list)
+                inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": token} for token in init_tokens])
+                for soft_id, word_id in zip(id_list, init_token_ids):
+                    soft2word_init[soft_id] = word_id
+
+                # Check the ids of ``soft`` and ``soft_id``.
+                if "soft_id" in part:
+                    if part["soft_id"] in soft_id_reindex:
+                        assert id_list == soft_id_reindex[part["soft_id"]]
+                    else:
+                        soft_id_reindex[part["soft_id"]] = id_list
+                continue
+
+            if "soft_id" in part and part["soft_id"] in soft_id_reindex:
+                if "duplicate" in part:
+                    logger.warnings("Ignore ``duplicate``. Initialize " "``soft`` by ``soft_id`` directly.")
+                id_list = soft_id_reindex[part["soft_id"]]
+
+            elif "duplicate" in part:
+                assert isinstance(part["duplicate"], int)
+                if "same" in part:
+                    num_soft_token += 1
+                    id_list = [num_soft_token for _ in range(part["duplicate"])]
+                else:
+                    next_num_soft = num_soft_token + 1
+                    num_soft_token += part["duplicate"]
+                    id_list = list(range(next_num_soft, num_soft_token + 1))
+            else:
+                num_soft_token += 1
+                id_list = [num_soft_token]
+
+            if "soft_id" in part:
+                soft_id_reindex[part["soft_id"]] = id_list
+
+            soft_token_ids.extend(id_list)
+            inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": ""} for _ in range(len(id_list))])
+
+        self._template = inputs
+        self.soft_token_ids = soft_token_ids
+        self.num_soft_token = num_soft_token
+        self.soft2word_init = soft2word_init
+
+        if self.num_soft_token == 0:
+            logger.warnings("No soft tokens in template. " "Use ManualTemplate for better performance.")
+
+    def generate_parameters(self):
+        """
+        Generate parameters for soft tokens.
+        """
+        if self.num_soft_token == 0:
+            return None
+        self.soft_embeddings = nn.Embedding(self.num_soft_token + 1, self.embedding_size)
+
+        weight = self.soft_embeddings.weight.clone().detach()
+        for soft_id, word_id in self.soft2word_init.items():
+            weight[soft_id] = self.token_embeddings(paddle.to_tensor(word_id))
+        self.soft_embeddings.weight.set_value(weight)
+
+    def process_batch(self, batch):
+        word_embeds = self.token_embeddings(batch["input_ids"])
+        batch["input_ids"] = None
+        if not hasattr(self, "soft_embeddings"):
+            batch["input_embeds"] = word_embeds
+        else:
+            soft_embeds = self.soft_embeddings(batch["soft_token_ids"])
+            input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds)
+            batch["input_embeds"] = input_embeds
+        return batch
+
+
+class PTuningTemplate(SoftTemplate):
+    def __init__(
+        self, tokenizer, model, template, prompt_encoder="lstm", text_mapping={"text_a": "text_a", "text_b": "text_b"}
+    ):
+        super().__init__(tokenizer=tokenizer, model=model, text_mapping=text_mapping)
+        self.prompt_encoder = prompt_encoder
+        self.template = template
+
+    def generate_parameters(self):
+        super().generate_parameters()
+        if self.prompt_encoder == "lstm":
+            self.lstm_head = nn.LSTM(
+                input_size=self.embedding_size,
+                hidden_size=self.embedding_size,
+                num_layers=2,
+                direction="bidirect",
+                time_major=False,
+            )
+            self.mlp_head = nn.Sequential(
+                nn.Linear(2 * self.embedding_size, self.embedding_size),
+                nn.ReLU(),
+                nn.Linear(self.embedding_size, self.embedding_size),
+            )
+        elif self.prompt_encoder == "mlp":
+            self.mlp_head = nn.Sequential(
+                nn.Linear(self.embedding_size, self.embedding_size),
+                nn.ReLU(),
+                nn.Linear(self.embedding_size, self.embedding_size),
+            )
+        else:
+            raise ValueError("Unsupported soft token encoder: {}".format(self.prompt_encoder))
+
+    def process_batch(self, batch):
+        word_embeds = self.token_embeddings(batch["input_ids"])
+        batch["input_ids"] = None
+        if not hasattr(self, "soft_embeddings"):
+            batch["input_embeds"] = word_embeds
+        else:
+            soft_embeds = self.soft_embeddings(batch["soft_token_ids"])
+            if self.prompt_encoder == "lstm":
+                soft_embeds = self.lstm_head(soft_embeds)[0]
+            soft_embeds = self.mlp_head(soft_embeds)
+
+            input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds)
+            batch["input_embeds"] = input_embeds
+        return batch
diff --git a/examples/few_shot/RGL/tokenizer.py b/examples/few_shot/RGL/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..91f2fbd1fad616b3a289abc5a5f304d7adc28e26
--- /dev/null
+++ b/examples/few_shot/RGL/tokenizer.py
@@ -0,0 +1,261 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import warnings
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+
+
+class TokenizerWrapper:
+    """
+    Process examples encoded by template, such as truncating and padding.
+
+    Args:
+        max_seq_length (int):
+            The maximum length of input data (prompt and text).
+        tokenizer (paddlenlp.transformers.PreTrainedTokenizer):
+            The tokenizer of pretrained model.
+        truncate_method (str):
+            How to truncate input data.
+            Choices: ``tail``, ``head``, ``manual``.
+        create_token_type_ids (bool):
+            Whether to create token_type_ids for inputs.
+        seq_length_list (list, optional):
+            The list of maximum length for every part in input data.
+    """
+
+    def __init__(self, max_seq_length, tokenizer, truncate_method="tail", create_token_type_ids=False, **kwargs):
+        self.max_seq_length = max_seq_length
+        self.tokenizer = tokenizer
+        if truncate_method == "manual":
+            assert hasattr(kwargs, "seq_length_list"), "seq_length_list " "should be defined for manual truncation."
+            self.seq_length_list = kwargs["seq_length_list"]
+            self.truncate_fn = partial(self.truncate_from_end, etype="tail")
+        elif truncate_method == "tail" or truncate_method == "head":
+            self.truncate_fn = partial(self.truncate_from_end, etype=truncate_method)
+        else:
+            raise NotImplementedError
+
+        self.create_token_type_ids = create_token_type_ids
+
+        self.num_truncated_sentences = 0
+        self.total_passed_sentences = 0
+
+    @property
+    def special_tokens_maps(self):
+        if not hasattr(self, "_special_tokens_map"):
+            self._special_tokens_map = {
+                "<cls>": getattr(self.tokenizer, "cls_token", ""),
+                "<sep>": getattr(self.tokenizer, "sep_token", ""),
+                "<pad>": getattr(self.tokenizer, "pad_token", ""),
+                "<mask>": getattr(self.tokenizer, "mask_token", ""),
+                "<unk>": getattr(self.tokenizer, "unk_token", ""),
+            }
+        return self._special_tokens_map
+
+    @property
+    def truncate_rate(self):
+        if self.total_passed_sentences == 0:
+            return None
+        else:
+            return self.num_truncated_sentences / self.total_passed_sentences
+
+    @staticmethod
+    def truncate_by_manual(input_dict, max_len_list=[]):
+        """
+        Truncate input data by manually defined maximum sequence length.
+
+        Args:
+            input_dict (dict):
+                The dictionary of an input example.
+            max_len_list (list):
+                The maximum length of every part in example.
+                ``-1`` denotes that there is no limit on length.
+        """
+        truncated_dict = defaultdict(list)
+        shortenable_ids = input_dict["shortenable_ids"]
+        truncated_dict["shortenable_ids"] = shortenable_ids
+        for attr_name, attr_values in input_dict.items():
+            text_idx = 0
+            for i, value in enumerate(attr_values):
+                if shortenable_ids[i][0] == 0:
+                    continue
+                if text_idx >= len(max_len_list):
+                    break
+                if len(value) > 0:
+                    max_len = max_len_list[text_idx]
+                    if max_len < 0:
+                        attr_values[i] = value
+                    else:
+                        attr_values[i] = value[:max_len]
+                text_idx += 1
+            truncated_dict[attr_name] = attr_values
+        return truncated_dict
+
+    @staticmethod
+    def truncate_from_end(input_dict, num_tokens_to_truncate=0, etype="tail"):
+        assert etype in ["head", "tail"]
+        step = 1 if etype == "head" else -1
+        idx_offset = 0 if etype == "head" else 1
+        truncated_dict = defaultdict(list)
+        shortenable_ids = input_dict["shortenable_ids"]
+        for attr_name in input_dict:
+            attr_values = input_dict[attr_name]
+            count = num_tokens_to_truncate
+            for i, value in enumerate(attr_values[::step]):
+                index = int(step * (idx_offset + i))
+                if len(value) == 0 or shortenable_ids[index][0] == 0:
+                    continue
+                if count < len(value):
+                    attr_values[index] = value[:-count]
+                else:
+                    attr_values[index] = []
+                count -= len(value)
+                if count <= 0:
+                    break
+            truncated_dict[attr_name] = attr_values
+
+        return truncated_dict
+
+    @staticmethod
+    def concate_parts(input_dict):
+        for key in input_dict:
+            input_dict[key] = list(itertools.chain(*input_dict[key]))
+        return input_dict
+
+    @staticmethod
+    def padding(input_dict, max_len, pad_id_for_inputs=0, pad_id_for_others: int = 0) -> None:
+        for key, value in input_dict.items():
+            if len(input_dict[key]) > max_len:
+                raise ValueError(
+                    f"""Truncated seq length of '{key}' still greater than
+                    max length {max_len}. One possible reason is that
+                    no enough shortenable parts in template. Try adding
+                    {{"shortenable": "True"}} property.
+                """
+                )
+            if "input" in key:
+                input_dict[key].extend([pad_id_for_inputs] * (max_len - len(value)))
+            else:
+                input_dict[key].extend([pad_id_for_others] * (max_len - len(value)))
+        return input_dict
+
+    def truncate(self, inputs):
+        if hasattr(self, "seq_length_list"):
+            inputs = self.truncate_by_manual(inputs, self.seq_length_list)
+        total_tokens = sum([len(part) for part in inputs["input_ids"]])
+        num_specials = self.num_special_tokens_to_add
+        num_tokens_to_truncate = total_tokens - self.max_seq_length + num_specials
+        self.total_passed_sentences += 1
+        if num_tokens_to_truncate > 0:
+            self.num_truncated_sentences += 1
+            inputs = self.truncate_fn(input_dict=inputs, num_tokens_to_truncate=num_tokens_to_truncate)
+        return inputs
+
+    def add_special_tokens(self, encode_inputs):
+        for key in encode_inputs:
+            if key == "input_ids":
+                with warnings.catch_warnings():
+                    warnings.simplefilter("ignore")
+                    encode_inputs[key] = self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key])
+            else:
+                special_tokens_mask = np.array(self.tokenizer.get_special_tokens_mask(encode_inputs[key]))
+                with_special_tokens = np.array(self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key]))
+                with_special_tokens[special_tokens_mask == 1] = 0
+                encode_inputs[key] = with_special_tokens.tolist()
+        return encode_inputs
+
+
+class MLMTokenizerWrapper(TokenizerWrapper):
+    input_keys = ["input_ids", "attention_mask", "token_type_ids"]
+
+    @property
+    def mask_token(self):
+        return self.tokenizer.mask_token
+
+    @property
+    def mask_token_id(self):
+        return self.tokenizer.mask_token_id
+
+    @property
+    def soft_token(self):
+        return self.tokenizer.unk_token
+
+    @property
+    def soft_token_id(self):
+        return self.tokenizer.unk_token_id
+
+    @property
+    def num_special_tokens_to_add(self):
+        if not hasattr(self, "_num_specials"):
+            self._num_specials = self.tokenizer.num_special_tokens_to_add()
+        return self._num_specials
+
+    def get_token_type_ids(self, encoded_inputs):
+        token_type_ids = [0] * len(encoded_inputs["input_ids"])
+        sep_token = getattr(self.tokenizer, "sep_token", -1)
+        if sep_token >= 0:
+            sep_index = np.where([x == sep_token for x in encoded_inputs["input_ids"]])[0]
+            for i, x in enumerate(sep_index[1:]):
+                pre_x = sep_index[i - 1]
+                sep_index[pre_x + 1 : x + 1] = [i + 1] * (x - pre_x)
+        return token_type_ids
+
+    def tokenize_one_example(self, wrapped_example):
+        to_tokenize, not_to_tokenize = wrapped_example
+
+        encode_inputs = defaultdict(list)
+        for part in to_tokenize:
+            if part["mask_ids"] == 1:
+                text = [self.mask_token_id]
+
+            if part["text"] in self.special_tokens_maps.keys():
+                to_replace = self.special_tokens_maps[part["text"]]
+                if to_replace is not None:
+                    part["text"] = to_replace
+                else:
+                    raise KeyError("This tokenizer doesn't specify {} token.".format(part["prompt"]))
+
+            if "soft_token_ids" in part and part["soft_token_ids"] == 1:
+                text = [self.soft_token_id]
+            else:
+                text = self.tokenizer.encode(part["text"], add_special_tokens=False, return_token_type_ids=False)[
+                    "input_ids"
+                ]
+
+            text_len = len(text)
+            encode_inputs["input_ids"].append(text)
+            for key in part:
+                if key not in ["text"]:
+                    encode_inputs[key].append([part[key]] * text_len)
+        encode_inputs = self.truncate(inputs=encode_inputs)
+        encode_inputs.pop("shortenable_ids")
+        encode_inputs = self.concate_parts(encode_inputs)
+        encode_inputs = self.add_special_tokens(encode_inputs)
+        encode_inputs["attention_mask"] = [1] * len(encode_inputs["input_ids"])
+        if self.create_token_type_ids:
+            encode_inputs["token_type_ids"] = self.get_token_type_ids(encode_inputs)
+        encode_inputs = self.padding(
+            encode_inputs, max_len=self.max_seq_length, pad_id_for_inputs=self.tokenizer.pad_token_id
+        )
+
+        return {**encode_inputs}
+
+
+tokenizer_mapping = {
+    "roberta": MLMTokenizerWrapper,
+}
diff --git a/examples/few_shot/RGL/utils.py b/examples/few_shot/RGL/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f855145c444dd092083c50bc3894e8045678ebb6
--- /dev/null
+++ b/examples/few_shot/RGL/utils.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+
+import numpy as np
+import paddle
+from data import InputFeatures
+from paddle.io import DataLoader
+from paddle.optimizer.lr import LambdaDecay
+
+from paddlenlp.datasets import MapDataset
+
+
+def set_seed(seed):
+    """set random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def check_args(args):
+    """check output_dir and make it when not exist"""
+    if os.path.exists(args.output_dir):
+        if os.listdir(args.output_dir) and not args.overwrite_output:
+            raise ValueError("Path Configuration: output_dir {} exists!".format(args.output_dir))
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+
+    args.dataset = args.dataset.lower()
+
+
+def convert_example(example, template, tokenizer_wrapper, verbalizer=None):
+    if verbalizer is not None and hasattr(verbalizer, "wrap_one_example"):
+        example = verbalizer.wrap_one_example(example)
+    example = template.wrap_one_example(example)
+    encoded_inputs = InputFeatures(**tokenizer_wrapper.tokenize_one_example(example), **example[1])
+    return encoded_inputs
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if isinstance(dataset, list):
+        dataset = MapDataset(dataset)
+    assert isinstance(dataset, MapDataset)
+
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+class LinearSchedulerWarmup(LambdaDecay):
+    """
+    Linear scheduler with warm up.
+    """
+
+    def __init__(self, learning_rate, warmup_steps, max_steps, last_epoch=-1, verbose=False):
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            return max(0.0, float(max_steps - current_step) / float(max(1, max_steps - warmup_steps)))
+
+        super().__init__(learning_rate, lr_lambda, last_epoch, verbose)
diff --git a/examples/few_shot/RGL/verbalizer.py b/examples/few_shot/RGL/verbalizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e741235dcc072bb4e99a9db8cbda742fbbf676f
--- /dev/null
+++ b/examples/few_shot/RGL/verbalizer.py
@@ -0,0 +1,188 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import abstractmethod
+from typing import Dict, List, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from data import InputExample
+
+from paddlenlp.transformers import PretrainedTokenizer
+
+
+class Verbalizer(nn.Layer):
+    """
+    Base verbalizer class used to process the outputs and labels.
+
+    Args:
+        tokenizer (paddlenlp.transformers.PretrainedTokenizer):
+            The tokenizer of pretrained models.
+        labels (list):
+            The sequence of labels in task.
+
+    """
+
+    def __init__(self, tokenizer: PretrainedTokenizer = None, labels: List = None):
+        super().__init__()
+        assert labels is not None, "Label list for current task is not set yet."
+        self.tokenizer = tokenizer
+        self.labels = sorted(labels)
+        self._process_lock = False
+
+    @property
+    def vocab(self):
+        if not hasattr(self, "_vocab"):
+            self._vocab = self.tokenizer.convert_ids_to_tokens(np.arange(self.vocab_size).tolist())
+        return self._vocab
+
+    @property
+    def vocab_size(self):
+        return self.tokenizer.vocab_size
+
+    @property
+    def label_to_words(self):
+        if not hasattr(self, "_label_to_words"):
+            raise RuntimeError("Property label_to_words has not been set before used.")
+        return self._label_to_words
+
+    @label_to_words.setter
+    def label_to_words(self, label_to_words: Union[List, Dict]):
+        if label_to_words is None:
+            return
+        if isinstance(label_to_words, dict):
+            new_keys = sorted(list(label_to_words.keys()))
+            assert new_keys == self.labels, "label_to_words {} does not match the predefined labels {}.".format(
+                new_keys, self.labels
+            )
+            self._label_to_words = {k: label_to_words[k] for k in self.labels}
+        elif isinstance(label_to_words, list):
+            assert len(self.labels) == len(
+                label_to_words
+            ), "The lengths of label_to_words and predefined labels do not match."
+            self._label_to_words = {k: v for k, v in zip(self.labels, label_to_words)}
+        else:
+            raise TypeError("Unsupported type {} for label_to_words".format(type(label_to_words)))
+        self.process_label_words()
+
+    @property
+    def labels_to_ids(self):
+        if not hasattr(self, "labels"):
+            raise RuntimeError("Property labels_to_ids has not been set before used.")
+        return {k: i for i, k in enumerate(self.labels)}
+
+    @property
+    def ids_to_labels(self):
+        if not hasattr(self, "labels"):
+            raise RuntimeError("Property ids_to_labels has not been set before used.")
+        return {i: k for i, k in enumerate(self.labels)}
+
+    @abstractmethod
+    def process_label_words(
+        self,
+    ):
+        """A hook to process verbalizer when it is set."""
+        raise NotImplementedError
+
+    @abstractmethod
+    def project(self, logits, **kwargs):
+        """
+        Project the logits with shape ```[batch_size, vocab_size]``` into
+        label_word_logits with shape ```[batch_size, num_label_words]```.
+        """
+        raise NotImplementedError
+
+    @staticmethod
+    def aggregate(label_words_logits, atype="mean", ndim=2):
+        """
+        Aggregate embeddings when multiple words are mapped to one label.
+
+        Args:
+            label_words_logits (paddle.Tensor):
+                The logits of words which could be mapped to labels.
+            atype (str):
+                The aggregation strategy, including mean and first.
+            ndim (str):
+                The aggregated embeddings' number of dimensions.
+
+        """
+        if label_words_logits.ndim > ndim:
+            if atype == "mean":
+                return label_words_logits.mean(axis=-1)
+            elif atype == "max":
+                return label_words_logits.max(axis=-1)
+            elif atype == "first":
+                return label_words_logits[..., 0, :]
+            else:
+                raise ValueError("Unsupported aggreate type {}".format(atype))
+        return label_words_logits
+
+    def normalize(self, logits):
+        """Normalize the logits of every example."""
+        new_logits = F.softmax(logits.reshape(logits.shape[0], -1), axis=-1)
+        return new_logits.reshape(*logits.shape)
+
+
+class ManualVerbalizer(Verbalizer):
+    """
+    Manual Verbalizer to map labels to words for hard prompt methods.
+
+    Args:
+        tokenizer (paddlenlp.transformers.PretrainedTokenizer):
+            The tokenizer of pretrained models.
+        labels (list):
+            The sequence of all labels.
+        label_to_words (dict or list):
+            The dictionary or corresponding list to map labels to words.
+        prefix (str):
+            The prefix string of words, used in PLMs like RoBERTa, which is sensitive to the prefix.
+    """
+
+    def __init__(self, tokenizer, labels=None, label_to_words=None, prefix=""):
+        super().__init__(tokenizer=tokenizer, labels=labels)
+        self.tokenizer = tokenizer
+        self.labels = labels
+        self.prefix = prefix
+        self.label_to_words = label_to_words
+
+    def process_label_words(self):
+        word_ids = []
+        for label in self.labels:
+            word_ids.append(
+                self.tokenizer.encode(
+                    self.prefix + self._label_to_words[label], add_special_tokens=False, return_token_type_ids=False
+                )["input_ids"]
+            )
+        self.word_ids = paddle.to_tensor(word_ids, dtype="int64").squeeze()
+        self.label_to_words_ids = {k: v for k, v in zip(self.labels, word_ids)}
+
+    def process_logits(self, logits, mask_ids=None, **kwargs):
+        if mask_ids is not None:
+            logits = logits[mask_ids == 1]
+        label_words_logits = logits.index_select(index=self.word_ids, axis=-1)
+        return label_words_logits
+
+    def wrap_one_example(self, example):
+        """Process labels in InputExample According to the predefined verbalizer."""
+        if isinstance(example, InputExample):
+            try:
+                example.label = self.labels_to_ids[example.cls_label]
+            except KeyError:
+                # Regression tasks.
+                example.label = eval(example.cls_label)
+            return example
+        else:
+            raise TypeError("InputExample")
diff --git a/examples/few_shot/efl/README.md b/examples/few_shot/efl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f8656b690172935109a3c2fbb2b1325c76a1f9b1
--- /dev/null
+++ b/examples/few_shot/efl/README.md
@@ -0,0 +1,85 @@
+# EFL
+
+
+[Entailment as Few-Shot Learner](https://arxiv.org/abs/2104.14690)
+
+
+## 算法简介
+
+Entailment as Few-Shot Learner（EFL）提出将 NLP Fine-tune 任务转换统一转换为 Entailment 二分类任务，为小样本场景下的任务求解提供了新的视角。EFL 的主要思想如下图所示，该算法也可以使用 `Template` 实现标签描述与数据文本的拼接，定义方式详见[Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。
+
+![EFL](https://user-images.githubusercontent.com/25607475/204245126-bd94e87c-f25f-471e-af1c-d1e05f7a2897.png)
+
+## 快速开始
+
+CLUE（Chinese Language Understanding Evaluation）作为中文语言理解权威测评榜单，在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜，旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集，可以直接用来进行 EFL 算法训练、评估、预测，并生成 FewCLUE 榜单的提交结果，参与 FewCLUE 竞赛。
+
+### 代码结构说明
+```
+├── run_train.py # EFL 算法提示学习脚本
+├── data.py      # 数据集构造、数据增强
+├── utils.py     # FewCLUE 提交结果保存等工具函数
+└── prompt/      # FewCLUE 各数据集的 prompt 定义文件
+```
+
+###  数据准备
+
+读取 FewCLUE 数据集只需要 1 行代码，这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例：
+
+```
+from paddlenlp.datasets import load_dataset
+
+# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集
+train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public"))
+```
+
+### 模型训练、评估、预测
+
+通过如下命令，指定 GPU 0 卡,  在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估
+```
+python -u -m paddle.distributed.launch --gpus "0" run_train.py \
+    --output_dir checkpoint_eprstmt \
+    --task_name eprstmt \
+    --split_id few_all \
+    --prompt_path prompt/eprstmt.json \
+    --prompt_index 0 \
+    --do_train \
+    --do_eval \
+    --do_test \
+    --do_predict \
+    --do_label \
+    --max_steps 1000 \
+    --learning_rate 3e-5 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --logging_steps 5  \
+    --per_device_train_batch_size 16 \
+    --max_seq_length 128 \
+    --load_best_model_at_end \
+    --metric_for_best_model accuracy \
+    --save_total_limit 1
+```
+参数含义说明
+- `task_name`: FewCLUE 中的数据集名字
+- `split_id`: 数据集编号，包括0, 1, 2, 3, 4 和 few_all
+- `prompt_path`: prompt 定义文件名
+- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt
+- `augment_type`: 数据增强策略，可选 swap, delete, insert, substitute
+- `num_augment`: 数据增强策略为每个样本生成的样本数量
+- `word_augment_percent`: 每个序列中数据增强词所占的比例
+- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径
+- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签
+- `do_test`: 是否在公开测试集上评估模型效果
+- `model_name_or_path`: 预训练模型名，默认为 `ernie-1.0-large-zh-cw`
+- `use_rdrop`: 是否使用对比学习策略 R-Drop
+- `alpha_rdrop`: R-Drop 损失值权重，默认为 0.5
+- `dropout`: 预训练模型的 dropout 参数值，用于 R-Drop 策略中参数配置
+- `export_type`: 模型导出格式，默认为 `paddle`，动态图转静态图
+- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8)
+
+### 模型部署
+
+Coming soon...
+
+## References
+[1] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690.
diff --git a/examples/few_shot/efl/data.py b/examples/few_shot/efl/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..b33ea7927166ec03b614be73b3c2eaede7679698
--- /dev/null
+++ b/examples/few_shot/efl/data.py
@@ -0,0 +1,134 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+from paddlenlp.datasets import MapDataset, load_dataset
+
+
+def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids):
+    """
+    Extend train dataset with pseudo labeled examples if exists.
+    """
+    if pseudo_path is None:
+        return data_ds
+    with open(pseudo_path, "r", encoding="utf-8") as fp:
+        pseudo_data = [json.loads(x.strip()) for x in fp]
+    data_ds = MapDataset([x for x in data_ds] + pseudo_data)
+    return data_ds
+
+
+def convert_efl(data_ds, label_words, orig_key, is_train=False, num_neg=5):
+    efl_data_ds = []
+    label_list = sorted(label_words.keys())
+    for example in data_ds:
+        label = label_words[example[orig_key]] if orig_key in example else None
+        sub_list = label_list
+        if is_train and label is not None and len(label_list) > num_neg:
+            rand_index = np.random.permutation(len(label_list))
+            sub_list = [example[orig_key]] + [label_list[i] for i in rand_index[:num_neg]]
+        for key in sub_list:
+            new_example = example.copy()
+            cand = label_words[key]
+            new_example["candidate_label"] = cand
+            if label is not None:
+                new_example["labels"] = int(cand == label)
+            efl_data_ds.append(new_example)
+    return MapDataset(efl_data_ds)
+
+
+def convert_chid(data_ds):
+    """
+    Insert idioms into positions of `#idiom#` so that the task is converted
+    to binary classification.
+    """
+    split_data_ds = []
+    for example in data_ds:
+        fragments = example["content"].split("#idiom#")
+        label = example.get("answer", None)
+        for index, cand in enumerate(example["candidates"]):
+            new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand}
+            if label is not None:
+                new_example["label"] = str(int(index == label))
+            split_data_ds.append(new_example)
+    return MapDataset(split_data_ds)
+
+
+def convert_cluewsc(data_ds):
+    """
+    Mark the pronoun and entity with special tokens.
+    """
+    marked_data_ds = []
+    for example in data_ds:
+        target, text = example["target"], list(example["text"])
+        pronoun, p_index = target["span2_text"], target["span2_index"]
+        entity, e_index = target["span1_text"], target["span1_index"]
+        label = example.get("label", None)
+        if p_index > e_index:
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+        else:
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+        new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity}
+        if label is not None:
+            new_example["label"] = label
+        marked_data_ds.append(new_example)
+    return MapDataset(marked_data_ds)
+
+
+def load_fewclue_dataset(args, verbalizer):
+    """
+    Load fewclue datasets and convert them to the standard format of PET.
+    """
+    split_id = args.split_id
+    splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"]
+    if args.task_name == "cluewsc":
+        train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits)
+        unlabeled_ds = None
+    else:
+        splits.append("unlabeled")
+        train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset(
+            "fewclue", name=args.task_name, splits=splits
+        )
+    data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds]
+
+    # Preprocess data for EFL.
+    if args.task_name == "chid":
+        for index, sub_data_ds in enumerate(data_ds):
+            data_ds[index] = convert_chid(sub_data_ds)
+    elif args.task_name == "cluewsc":
+        for index, sub_data_ds in enumerate(data_ds[:-1]):
+            data_ds[index] = convert_cluewsc(sub_data_ds)
+
+    orig_key = "label"
+    if args.task_name == "tnews":
+        orig_key = "label_desc"
+    elif args.task_name == "iflytek":
+        orig_key = "label_des"
+    for index, sub_data_ds in enumerate(data_ds):
+        is_train = index == 0
+        if sub_data_ds is not None:
+            data_ds[index] = convert_efl(sub_data_ds, args.label_words, orig_key, is_train)
+
+    # Extend train dataset with pseudo-label data.
+    data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids)
+
+    return data_ds
diff --git a/examples/few_shot/efl/prompt/bustm.json b/examples/few_shot/efl/prompt/bustm.json
new file mode 100644
index 0000000000000000000000000000000000000000..b44363510642badc49e8414b6a1f135436ed8ede
--- /dev/null
+++ b/examples/few_shot/efl/prompt/bustm.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "下边两个句子说的是{'text': 'candidate_label'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"}
+    ],
+    "verbalizer": [
+        {"0": "不同", "1": "相关"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/chid.json b/examples/few_shot/efl/prompt/chid.json
new file mode 100644
index 0000000000000000000000000000000000000000..b3c1e648e29a60d37fc21320fc4933b88453f052
--- /dev/null
+++ b/examples/few_shot/efl/prompt/chid.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "成语[{'text':'idiom'}]使用{'text': 'candidate_label'}的例子：{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}"}
+    ],
+    "verbalizer": [
+        {"0": "错误", "1": "正确"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/cluewsc.json b/examples/few_shot/efl/prompt/cluewsc.json
new file mode 100644
index 0000000000000000000000000000000000000000..1e736c43332da941622bb3d03b7d159973bae57a
--- /dev/null
+++ b/examples/few_shot/efl/prompt/cluewsc.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'text': 'candidate_label'}{'text': 'entity'}"}
+    ],
+    "verbalizer": [
+        {"false": "不是", "true": "是"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/csl.json b/examples/few_shot/efl/prompt/csl.json
new file mode 100644
index 0000000000000000000000000000000000000000..6d19eee927f8badc4c77297e76e04f34e85c02e5
--- /dev/null
+++ b/examples/few_shot/efl/prompt/csl.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "给定以下几个词语：{'options': 'keyword', 'add_prompt': '[OPT]，'}{'text': 'candidate_label'}扩写成“{'text': 'abst'}”"}
+    ],
+    "verbalizer": [
+        {"0": "不能", "1": "可以"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/csldcp.json b/examples/few_shot/efl/prompt/csldcp.json
new file mode 100644
index 0000000000000000000000000000000000000000..2dd84e36ca2194851b7865b0a33b0e855ffb373e
--- /dev/null
+++ b/examples/few_shot/efl/prompt/csldcp.json
@@ -0,0 +1,76 @@
+{
+    "template": [
+        {"text": "这篇论文阐述了{'text': 'candidate_label'}。{'text': 'content'}"}
+    ],
+    "verbalizer": [
+        [
+            "材料科学与工程",
+            "作物学",
+            "口腔医学",
+            "药学",
+            "教育学",
+            "水利工程",
+            "理论经济学",
+            "食品科学与工程",
+            "畜牧学/兽医学",
+            "体育学",
+            "核科学与技术",
+            "力学",
+            "园艺学",
+            "水产",
+            "法学",
+            "地质学/地质资源与地质工程",
+            "石油与天然气工程",
+            "农林经济管理",
+            "信息与通信工程",
+            "图书馆、情报与档案管理",
+            "政治学",
+            "电气工程",
+            "海洋科学",
+            "民族学",
+            "航空宇航科学与技术",
+            "化学/化学工程与技术",
+            "哲学",
+            "公共卫生与预防医学",
+            "艺术学",
+            "农业工程",
+            "船舶与海洋工程",
+            "计算机科学与技术",
+            "冶金工程",
+            "交通运输工程",
+            "动力工程及工程热物理",
+            "纺织科学与工程",
+            "建筑学",
+            "环境科学与工程",
+            "公共管理",
+            "数学",
+            "物理学",
+            "林学/林业工程",
+            "心理学",
+            "历史学",
+            "工商管理",
+            "应用经济学",
+            "中医学/中药学",
+            "天文学",
+            "机械工程",
+            "土木工程",
+            "光学工程",
+            "地理学",
+            "农业资源利用",
+            "生物学/生物科学与工程",
+            "兵器科学与技术",
+            "矿业工程",
+            "大气科学",
+            "基础医学/临床医学",
+            "电子科学与技术",
+            "测绘科学与技术",
+            "控制科学与工程",
+            "军事学",
+            "中国语言文学",
+            "新闻传播学", 
+            "社会学",
+            "地球物理学",
+            "植物保护"
+        ]
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/eprstmt.json b/examples/few_shot/efl/prompt/eprstmt.json
new file mode 100644
index 0000000000000000000000000000000000000000..309146c5e7e514d7c84c2c041fcbc9589f8ec69a
--- /dev/null
+++ b/examples/few_shot/efl/prompt/eprstmt.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "这表达了{'text': 'candidate_label'}的情感。{'text':'sentence'}"}
+    ],
+    "verbalizer": [
+        {"Negative": "不满意", "Positive": "满意"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/iflytek.json b/examples/few_shot/efl/prompt/iflytek.json
new file mode 100644
index 0000000000000000000000000000000000000000..5199508e6f035355b17b0ba6006fae28228543fe
--- /dev/null
+++ b/examples/few_shot/efl/prompt/iflytek.json
@@ -0,0 +1,129 @@
+{
+    "template": [
+        {"text": "这段文本的应用描述主题是{'text': 'candidate_label'}。{'text': 'sentence'}"}
+    ],
+    "verbalizer": [
+        [
+            "银行",
+            "社区服务",
+            "电商",
+            "支付",
+            "经营养成",
+            "卡牌",
+            "借贷",
+            "驾校",
+            "理财",
+            "职考",
+            "新闻",
+            "旅游资讯",
+            "公共交通",
+            "魔幻",
+            "医疗服务",
+            "影像剪辑",
+            "动作类",
+            "工具",
+            "体育竞技",
+            "小说",
+            "运动健身",
+            "相机",
+            "辅助工具",
+            "快递物流",
+            "高等教育",
+            "股票",
+            "菜谱",
+            "行车辅助",
+            "仙侠",
+            "亲子儿童",
+            "购物咨询",
+            "射击游戏",
+            "漫画",
+            "中小学",
+            "同城服务",
+            "成人教育",
+            "求职",
+            "电子产品",
+            "艺术",
+            "薅羊毛",
+            "约会社交",
+            "经营",
+            "兼职",
+            "短视频",
+            "音乐",
+            "英语",
+            "棋牌中心",
+            "摄影修图",
+            "养生保健",
+            "办公",
+            "政务",
+            "视频",
+            "论坛圈子",
+            "彩票",
+            "直播",
+            "其他",
+            "休闲益智",
+            "策略",
+            "即时通讯",
+            "汽车交易",
+            "违章",
+            "地图导航",
+            "民航",
+            "电台",
+            "语言(非英语)",
+            "搞笑",
+            "婚恋社交",
+            "社区超市",
+            "日常养车",
+            "杂志",
+            "视频教育",
+            "家政",
+            "影视娱乐",
+            "装修家居",
+            "体育咨讯",
+            "社交工具",
+            "餐饮店",
+            "美颜",
+            "问诊挂号",
+            "飞行空战",
+            "综合预定",
+            "电影票务",
+            "笔记",
+            "买房",
+            "外卖",
+            "母婴",
+            "打车",
+            "情侣社交",
+            "日程管理",
+            "租车",
+            "微博博客",
+            "百科",
+            "绘画",
+            "铁路",
+            "生活社交",
+            "租房",
+            "酒店",
+            "保险",
+            "问答交流",
+            "收款",
+            "MOBA",
+            "K歌",
+            "技术",
+            "减肥瘦身",
+            "工作社交",
+            "团购",
+            "记账",
+            "女性",
+            "公务员",
+            "二手",
+            "美妆美业",
+            "汽车咨询",
+            "行程管理",
+            "免费WIFI",
+            "教辅",
+            "成人",
+            "婚庆",
+            "民宿短租",
+            "出国"
+        ]
+    ]
+}
+
diff --git a/examples/few_shot/efl/prompt/ocnli.json b/examples/few_shot/efl/prompt/ocnli.json
new file mode 100644
index 0000000000000000000000000000000000000000..caa7fd2c5719fccea5e3dd95a7c30648cae73b4e
--- /dev/null
+++ b/examples/few_shot/efl/prompt/ocnli.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间{'text': 'candidate_label'}。"}
+    ],
+    "verbalizer": [
+        {"contradiction": "互相矛盾", "entailment": "相互包含", "neutral": "没有关系"}
+    ]
+}
diff --git a/examples/few_shot/efl/prompt/tnews.json b/examples/few_shot/efl/prompt/tnews.json
new file mode 100644
index 0000000000000000000000000000000000000000..4580cd7662088667f59526dd93c196a9f3e8f730
--- /dev/null
+++ b/examples/few_shot/efl/prompt/tnews.json
@@ -0,0 +1,24 @@
+{
+    "template": [
+        {"text": "下边报道一条{'text': 'candidate_label'}新闻{'text':'sentence'}"}
+    ],
+    "verbalizer": [
+        {
+            "news_story":  "故事",
+            "news_entertainment": "明星",
+            "news_finance": "财经",
+            "news_sports": "体育",
+            "news_edu": "校园",
+            "news_game": "游戏",
+            "news_culture": "文化",
+            "news_tech": "科技",
+            "news_car": "汽车",
+            "news_travel": "旅行",
+            "news_world": "国际",
+            "news_agriculture": "农业",
+            "news_military": "军事",
+            "news_house": "房产",
+            "news_stock": "股票"
+        }
+    ]
+}
diff --git a/examples/few_shot/efl/run_train.py b/examples/few_shot/efl/run_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dd47043d762097f0ea9a5fdc7563ca72afe6338
--- /dev/null
+++ b/examples/few_shot/efl/run_train.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from data import load_fewclue_dataset
+from paddle.metric import Accuracy
+from paddle.static import InputSpec
+from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data
+
+from paddlenlp.prompt import (
+    ManualTemplate,
+    ManualVerbalizer,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+)
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."})
+    split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."})
+    prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."})
+    prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."})
+    pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."})
+    do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"})
+    do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+    dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    data_args = load_prompt_arguments(data_args)
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(
+        model_args.model_name_or_path,
+        num_labels=2,
+        hidden_dropout_prob=model_args.dropout,
+        attention_probs_dropout_prob=model_args.dropout,
+    )
+
+    # Define template for preprocess and verbalizer for postprocess.
+    template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length)
+    logger.info("Using template: {}".format(template.prompt))
+
+    verbalizer = ManualVerbalizer(data_args.label_words, tokenizer)
+    ids_to_labels = {idx: label for idx, label in enumerate(verbalizer.labels)}
+    logger.info("Using verbalizer: {}".format(data_args.label_words))
+
+    # Load datasets.
+    train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_fewclue_dataset(data_args, verbalizer=verbalizer)
+
+    # Define the criterion.
+    criterion = paddle.nn.CrossEntropyLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds, num_labels):
+        metric = Accuracy()
+        preds = paddle.to_tensor(eval_preds.predictions)
+        preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1]
+        preds = preds.reshape([-1, num_labels])
+        labels = paddle.to_tensor(eval_preds.label_ids)
+        labels = paddle.argmax(labels.reshape([-1, num_labels]), axis=1)
+        correct = metric.compute(preds, labels)
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    # Initialize the trainer.
+    compute_metrics = partial(compute_metrics, num_labels=len(verbalizer.labels))
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=None,
+        compute_metrics=compute_metrics,
+    )
+
+    # Traininig.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime())
+
+    # Test.
+    if data_args.do_test and public_test_ds is not None:
+        test_ret = trainer.predict(public_test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Predict.
+    if training_args.do_predict and test_ds is not None:
+        pred_ret = trainer.predict(test_ds)
+        logger.info("Prediction done.")
+        predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp)
+        save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels)
+
+    # Label unsupervised data.
+    if data_args.do_label and unlabeled_ds is not None:
+        label_ret = trainer.predict(unlabeled_ds)
+        logger.info("Labeling done.")
+        pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt")
+        save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels)
+
+    # Export static model.
+    if training_args.do_export:
+        input_spec = [
+            InputSpec(shape=[None, None], dtype="int64"),  # input_ids,
+            InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+            InputSpec(shape=[None, None], dtype="int64"),  # position_ids
+            InputSpec(shape=[None, None, None, None], dtype="float32"),  # attention_mask
+        ]
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/few_shot/efl/utils.py b/examples/few_shot/efl/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfc6463bb69d4d6719cf514ad55ec8a3ba765b4a
--- /dev/null
+++ b/examples/few_shot/efl/utils.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import pathlib
+
+import numpy as np
+import paddle
+
+from paddlenlp.datasets import load_dataset
+
+LABEL_TO_STANDARD = {
+    "tnews": {
+        "news_story": "100",
+        "news_culture": "101",
+        "news_entertainment": "102",
+        "news_sports": "103",
+        "news_finance": "104",
+        "news_house": "106",
+        "news_car": "107",
+        "news_edu": "108",
+        "news_tech": "109",
+        "news_military": "110",
+        "news_travel": "112",
+        "news_world": "113",
+        "news_stock": "114",
+        "news_agriculture": "115",
+        "news_game": "116",
+    },
+    "iflytek": {
+        "打车": 0,
+        "美颜": 100,
+        "影像剪辑": 101,
+        "摄影修图": 102,
+        "相机": 103,
+        "绘画": 104,
+        "二手": 105,
+        "电商": 106,
+        "团购": 107,
+        "外卖": 108,
+        "电影票务": 109,
+        "社区服务": 10,
+        "社区超市": 110,
+        "购物咨询": 111,
+        "笔记": 112,
+        "办公": 113,
+        "日程管理": 114,
+        "女性": 115,
+        "经营": 116,
+        "收款": 117,
+        "其他": 118,
+        "薅羊毛": 11,
+        "魔幻": 12,
+        "仙侠": 13,
+        "卡牌": 14,
+        "飞行空战": 15,
+        "射击游戏": 16,
+        "休闲益智": 17,
+        "动作类": 18,
+        "体育竞技": 19,
+        "地图导航": 1,
+        "棋牌中心": 20,
+        "经营养成": 21,
+        "策略": 22,
+        "MOBA": 23,
+        "辅助工具": 24,
+        "约会社交": 25,
+        "即时通讯": 26,
+        "工作社交": 27,
+        "论坛圈子": 28,
+        "婚恋社交": 29,
+        "免费WIFI": 2,
+        "情侣社交": 30,
+        "社交工具": 31,
+        "生活社交": 32,
+        "微博博客": 33,
+        "新闻": 34,
+        "漫画": 35,
+        "小说": 36,
+        "技术": 37,
+        "教辅": 38,
+        "问答交流": 39,
+        "租车": 3,
+        "搞笑": 40,
+        "杂志": 41,
+        "百科": 42,
+        "影视娱乐": 43,
+        "求职": 44,
+        "兼职": 45,
+        "视频": 46,
+        "短视频": 47,
+        "音乐": 48,
+        "直播": 49,
+        "同城服务": 4,
+        "电台": 50,
+        "K歌": 51,
+        "成人": 52,
+        "中小学": 53,
+        "职考": 54,
+        "公务员": 55,
+        "英语": 56,
+        "视频教育": 57,
+        "高等教育": 58,
+        "成人教育": 59,
+        "快递物流": 5,
+        "艺术": 60,
+        "语言(非英语)": 61,
+        "旅游资讯": 62,
+        "综合预定": 63,
+        "民航": 64,
+        "铁路": 65,
+        "酒店": 66,
+        "行程管理": 67,
+        "民宿短租": 68,
+        "出国": 69,
+        "婚庆": 6,
+        "工具": 70,
+        "亲子儿童": 71,
+        "母婴": 72,
+        "驾校": 73,
+        "违章": 74,
+        "汽车咨询": 75,
+        "汽车交易": 76,
+        "日常养车": 77,
+        "行车辅助": 78,
+        "租房": 79,
+        "家政": 7,
+        "买房": 80,
+        "装修家居": 81,
+        "电子产品": 82,
+        "问诊挂号": 83,
+        "养生保健": 84,
+        "医疗服务": 85,
+        "减肥瘦身": 86,
+        "美妆美业": 87,
+        "菜谱": 88,
+        "餐饮店": 89,
+        "公共交通": 8,
+        "体育咨讯": 90,
+        "运动健身": 91,
+        "支付": 92,
+        "保险": 93,
+        "股票": 94,
+        "借贷": 95,
+        "理财": 96,
+        "彩票": 97,
+        "记账": 98,
+        "银行": 99,
+        "政务": 9,
+    },
+}
+
+
+def load_prompt_arguments(args):
+    """
+    Load prompt and label words according to prompt index.
+    """
+    with open(args.prompt_path, "r", encoding="utf-8") as fp:
+        configs = json.load(fp)
+        assert len(configs["verbalizer"]) == len(configs["template"])
+        assert configs["verbalizer"][0] is not None
+        verbalizer = [configs["verbalizer"][0]]
+        last_verb_index = 0
+        for index, verb in enumerate(configs["verbalizer"][1:]):
+            if verb is None or len(verb) == 0:
+                verbalizer.append(configs["verbalizer"][last_verb_index])
+            else:
+                verbalizer.append(verb)
+                last_verb_index = index + 1
+        configs["verbalizer"] = verbalizer
+        args.prompt = configs["template"][args.prompt_index]["text"]
+        label_words = configs["verbalizer"][args.prompt_index]
+        if isinstance(label_words, list):
+            label_words = {k: k for k in label_words}
+        args.label_words = label_words
+        return args
+
+
+def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Combine unsupervised data and corresponding predicted labels and
+    save one example per line.
+    """
+    if task_name == "cluewsc":
+        return None
+
+    num_labels = len(labels)
+    data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled")
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1].numpy()
+    preds = preds.reshape([-1, num_labels])
+    label_preds = np.argmax(preds, axis=1)
+    label_probs = np.max(preds, axis=1)
+    pseudo_data = []
+    for index, example in enumerate(data_ds):
+        example["labels"] = labels[label_preds[index]]
+        example["prob"] = str(label_probs[index])
+        pseudo_data.append(example)
+    save_data(pseudo_data, save_path)
+
+
+def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Extract predicted labels and save as the format required by FewCLUE.
+    """
+    num_labels = len(labels)
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1]
+    preds = preds.reshape([-1, num_labels])
+    if task_name == "chid":
+        batch_size = preds.shape[0]
+        preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1]
+        preds = preds.reshape([batch_size // 7, 7])
+    preds = paddle.nn.functional.softmax(preds, axis=1).numpy()
+    preds = np.argmax(preds, axis=1)
+    test_ds = load_dataset("fewclue", name=task_name, splits="test")
+
+    ret_list = []
+    maps = LABEL_TO_STANDARD.get(task_name, None)
+    for idx, example in enumerate(test_ds):
+        uid = example.get("id", idx)
+        if task_name in ["bustm", "csl"]:
+            ret_list.append({"id": uid, "label": str(preds[idx])})
+        elif task_name == "chid":
+            ret_list.append({"id": uid, "answer": preds[idx]})
+        elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]:
+            ret_list.append({"id": uid, "label": labels[preds[idx]]})
+        elif task_name in ["iflytek", "tnews"]:
+            ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])})
+    save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f"
+    save_data(ret_list, save_path, save_file + "_predict.json")
+
+
+def save_data(data, save_path, save_file=None):
+    if save_file is not None:
+        pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
+        save_path = os.path.join(save_path, save_file)
+    with open(save_path, "w") as fp:
+        for example in data:
+            fp.write(json.dumps(example, ensure_ascii=False) + "\n")
diff --git a/examples/few_shot/p-tuning/README.md b/examples/few_shot/p-tuning/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..38d2f0af9c527f77e9743764d50d74804cfaa287
--- /dev/null
+++ b/examples/few_shot/p-tuning/README.md
@@ -0,0 +1,85 @@
+# P-Tuning
+
+[GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
+
+## 算法简介
+
+P-tuning 引入可学习的连续型提示向量 prompt embeddings 参数, 让模型自己去学习最优的 prompt embedding, 而不再依赖人工去设置自然语言形式的提示（Prompt）信息。P-Tuning 算法的数据和模型定义如下图所示，对应于数据预处理模块 `SoftTemplate` 和标签词映射模块 `MaskedLMVerbalizer`，详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。
+
+![p-tuning](https://user-images.githubusercontent.com/25607475/204214359-3036c6c6-f101-4a5f-958c-abe0e40c243a.png)
+
+
+## 快速开始
+
+CLUE（Chinese Language Understanding Evaluation）作为中文语言理解权威测评榜单，在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜，旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集，可以直接用来进行 PET 策略训练、评估、预测，并生成 FewCLUE 榜单的提交结果，参与 FewCLUE 竞赛。
+PaddleNLP 内置了 FewCLUE 数据集，可以直接用来进行 P-tuning 策略训练、评估、预测，并生成 FewCLUE 榜单的提交结果，参与 FewCLUE 竞赛。
+
+### 代码结构及说明
+```
+├── run_train.py # P-Tuning 算法提示学习脚本
+├── data.py      # 数据集构造、数据增强
+├── utils.py     # FewCLUE 提交结果保存等工具函数
+└── prompt/      # FewCLUE 各数据集的 prompt 定义文件
+```
+
+###  数据准备
+
+读取 FewCLUE 数据集只需要 1 行代码，这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例：
+```
+from paddlenlp.datasets import load_dataset
+
+# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集
+train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public"))
+```
+
+### 模型训练、评估、预测
+
+通过如下命令，指定 GPU 0 卡， 使用一个连续型提示向量在 FewCLUE 的 `eprstmt` 数据集上进行训练和评估。如果要使用多个可学习连续型提示向量，可修改 `./prompt/` 目录下相应的文件，修改 `soft` 的长度属性 `length` 即可。
+```
+python -u -m paddle.distributed.launch --gpus "0" run_train.py \
+    --output_dir checkpoint_eprstmt \
+    --task_name eprstmt \
+    --split_id few_all \
+    --prompt_path prompt/eprstmt.json \
+    --prompt_index 0 \
+    --do_train \
+    --do_eval \
+    --do_test \
+    --do_predict \
+    --do_label \
+    --max_steps 1000 \
+    --learning_rate 3e-5 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --logging_steps 5  \
+    --per_device_train_batch_size 16 \
+    --max_seq_length 128 \
+    --load_best_model_at_end \
+    --metric_for_best_model accuracy \
+    --save_total_limit 1
+```
+
+参数含义说明
+- `task_name`: FewCLUE 中的数据集名字
+- `split_id`: 数据集编号，包括0, 1, 2, 3, 4 和 few_all
+- `prompt_path`: prompt 定义文件名
+- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt
+- `augment_type`: 数据增强策略，可选 swap, delete, insert, substitute
+- `num_augment`: 数据增强策略为每个样本生成的样本数量
+- `word_augment_percent`: 每个序列中数据增强词所占的比例
+- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径
+- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签
+- `do_test`: 是否在公开测试集上评估模型效果
+- `model_name_or_path`: 预训练模型名，默认为 `ernie-1.0-large-zh-cw`
+- `use_rdrop`: 是否使用对比学习策略 R-Drop
+- `alpha_rdrop`: R-Drop 损失值权重，默认为 0.5
+- `dropout`: 预训练模型的 dropout 参数值，用于 R-Drop 策略中参数配置
+- `export_type`: 模型导出格式，默认为 `paddle`，动态图转静态图
+- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8)
+
+### 模型部署
+
+Coming soon...
+
+## References
+[1]X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385
diff --git a/examples/few_shot/p-tuning/data.py b/examples/few_shot/p-tuning/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f96ac02cdc8cddcde08a7cc09ca82587b41b174
--- /dev/null
+++ b/examples/few_shot/p-tuning/data.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from functools import partial
+
+import paddle
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+from paddlenlp.datasets import MapDataset, load_dataset
+
+
+def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids):
+    """
+    Extend train dataset with pseudo labeled examples if exists.
+    """
+    if pseudo_path is None:
+        return data_ds
+    with open(pseudo_path, "r", encoding="utf-8") as fp:
+        pseudo_data = [json.loads(x.strip()) for x in fp]
+    data_ds = MapDataset([x for x in data_ds] + pseudo_data)
+    return data_ds
+
+
+def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None):
+    """
+    Extend train dataset with augmentation.
+    """
+    if example_keys is None:
+        return data_ds
+    if aug_type is None or aug_type == "None":
+        return data_ds
+    if aug_type == "delete":
+        aug = WordDelete(create_n=num_aug, aug_percent=percent)
+    elif aug_type == "substitute":
+        aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent)
+    elif aug_type == "insert":
+        aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent)
+    elif aug_type == "swap":
+        aug = WordSwap(create_n=num_aug, aug_percent=percent)
+    else:
+        raise ValueError("Unsupported data augment strategy `{}`".format(aug_type))
+
+    aug_data = []
+    for example in data_ds:
+        for key in example_keys:
+            text_aug = aug.augment(example[key])
+            for text in text_aug:
+                new_example = example.copy()
+                example[key] = text
+                aug_data.append(new_example)
+
+    data_ds = MapDataset([x for x in data_ds] + aug_data)
+    return data_ds
+
+
+def convert_chid(data_ds):
+    """
+    Insert idioms into positions of `#idiom#` so that the task is converted
+    to binary classification.
+    """
+    split_data_ds = []
+    for example in data_ds:
+        fragments = example["content"].split("#idiom#")
+        label = example.get("answer", None)
+        for index, cand in enumerate(example["candidates"]):
+            new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand}
+            if label is not None:
+                new_example["label"] = str(int(index == label))
+            split_data_ds.append(new_example)
+    return MapDataset(split_data_ds)
+
+
+def convert_csl(data_ds):
+    """
+    Concatanate keywords and it can be replaced by keyword `options` in develop versioin.
+    """
+    concat_data_ds = []
+    for example in data_ds:
+        example["keyword"] = "，".join(example["keyword"])
+        concat_data_ds.append(example)
+    return MapDataset(concat_data_ds)
+
+
+def convert_cluewsc(data_ds):
+    """
+    Mark the pronoun and entity with special tokens.
+    """
+    marked_data_ds = []
+    for example in data_ds:
+        target, text = example["target"], list(example["text"])
+        pronoun, p_index = target["span2_text"], target["span2_index"]
+        entity, e_index = target["span1_text"], target["span1_index"]
+        label = example.get("label", None)
+        if p_index > e_index:
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+        else:
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+        new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity}
+        if label is not None:
+            new_example["label"] = label
+        marked_data_ds.append(new_example)
+    return MapDataset(marked_data_ds)
+
+
+def convert_labels_to_ids(example, orig_key, labels_to_ids, pop_keys=None):
+    """
+    Convert the keyword in datasets to `labels`.
+    """
+    if orig_key in example:
+        example["label_ids"] = labels_to_ids[example.pop(orig_key)]
+    if pop_keys is not None:
+        for key in pop_keys:
+            if key in example:
+                example.pop(key)
+    return example
+
+
+def convert_ids_to_words(example, token_ids):
+    """
+    Convert label id to the first word in mapping from labels to words,
+    the length of which should coincide with that of `mask` in prompt.
+    """
+    if "label_ids" in example:
+        labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0)
+        example["labels"] = labels
+    return example
+
+
+def load_fewclue_dataset(args, verbalizer, example_keys=None):
+    """
+    Load fewclue datasets and convert them to the standard format of PET.
+    """
+    split_id = args.split_id
+    splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"]
+    if args.task_name == "cluewsc":
+        train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits)
+        unlabeled_ds = None
+    else:
+        splits.append("unlabeled")
+        train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset(
+            "fewclue", name=args.task_name, splits=splits
+        )
+    data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds]
+
+    # Preprocess data for mask prediction task.
+    if args.task_name == "chid":
+        for index, sub_data_ds in enumerate(data_ds):
+            data_ds[index] = convert_chid(sub_data_ds)
+    elif args.task_name == "cluewsc":
+        for index, sub_data_ds in enumerate(data_ds[:-1]):
+            data_ds[index] = convert_cluewsc(sub_data_ds)
+    elif args.task_name == "csl":
+        for index, sub_data_ds in enumerate(data_ds):
+            data_ds[index] = convert_csl(sub_data_ds)
+    orig_key = "label"
+    pop_keys = ["id"]
+    if args.task_name == "tnews":
+        orig_key = "label_desc"
+        pop_keys = ["keywords", "label", "id"]
+    elif args.task_name == "iflytek":
+        orig_key = "label_des"
+        pop_keys = ["id", "label"]
+    elif args.task_name == "ocnli":
+        pop_keys = ["level", "label0", "label1", "label2", "label3", "label4", "genre", "prem_id", "id"]
+    convert_label = partial(
+        convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids, pop_keys=pop_keys
+    )
+    for index, sub_data_ds in enumerate(data_ds):
+        if sub_data_ds is not None:
+            data_ds[index] = sub_data_ds.map(convert_label)
+
+    # Extend train dataset with data augmentation and pseudo-label data.
+    data_ds[0] = extend_with_data_augment(
+        data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys
+    )
+    data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids)
+
+    dev_labels = [x["label_ids"] for x in data_ds[1]]
+    test_labels = [x["label_ids"] for x in data_ds[2]]
+
+    convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :])
+    data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]]
+
+    return data_ds, (dev_labels, test_labels)
diff --git a/examples/few_shot/p-tuning/prompt/bustm.json b/examples/few_shot/p-tuning/prompt/bustm.json
new file mode 100644
index 0000000000000000000000000000000000000000..345930ea51a90b229df961300cc8b90ac954b436
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/bustm.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"}
+    ],
+    "verbalizer": [
+        {"0": "不", "1": "很"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/chid.json b/examples/few_shot/p-tuning/prompt/chid.json
new file mode 100644
index 0000000000000000000000000000000000000000..cc3b30195fa7c826921fec7bd2587176b2009f27
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/chid.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'soft'}{'text':'content_pre'}{'text': 'idiom'}{'text': 'content_post'}"}
+    ],
+    "verbalizer": [
+        {"0": "否", "1": "是"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/cluewsc.json b/examples/few_shot/p-tuning/prompt/cluewsc.json
new file mode 100644
index 0000000000000000000000000000000000000000..c0ef7573441bd2c7fe05597c3a1e4371bee64dee
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/cluewsc.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'mask'}{'soft'}{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}"}
+    ],
+    "verbalizer": [
+        {"false": "错误", "true": "正确"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/csl.json b/examples/few_shot/p-tuning/prompt/csl.json
new file mode 100644
index 0000000000000000000000000000000000000000..443ba172a2fea9e01c13089622b857c8105d8a0b
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/csl.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'soft'}本文关键词有{'text': 'keyword'}{'text': 'abst'}"}
+    ],
+    "verbalizer": [
+        {"0": "不", "1": "很"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/csldcp.json b/examples/few_shot/p-tuning/prompt/csldcp.json
new file mode 100644
index 0000000000000000000000000000000000000000..5bb12c680f4e5b5d0c4da69bfbbda8312e6135e5
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/csldcp.json
@@ -0,0 +1,76 @@
+{
+    "template": [
+        {"text": "{'mask'}{'mask'}{'soft'}{'text': 'content'}"}
+    ],
+    "verbalizer": [
+        {
+            "材料科学与工程": "材料",
+            "作物学": "作物",
+            "口腔医学": "口腔",
+            "药学": "药学",
+            "教育学": "教育",
+            "水利工程": "水利",
+            "理论经济学": "理经",
+            "食品科学与工程": "食品",
+            "畜牧学/兽医学": "畜牧",
+            "体育学": "体育",
+            "核科学与技术": "核科",
+            "力学": "力学",
+            "园艺学": "园艺",
+            "水产": "水产",
+            "法学": "法学",
+            "地质学/地质资源与地质工程": "地质",
+            "石油与天然气工程": "石油",
+            "农林经济管理": "农林",
+            "信息与通信工程": "通信",
+            "图书馆、情报与档案管理": "图书",
+            "政治学": "政治",
+            "电气工程": "电气",
+            "海洋科学": "海洋",
+            "民族学": "民族",
+            "航空宇航科学与技术": "航空",
+            "化学/化学工程与技术": "化学",
+            "哲学": "哲学",
+            "公共卫生与预防医学": "卫生",
+            "艺术学": "艺术",
+            "农业工程": "农工",
+            "船舶与海洋工程": "船舶",
+            "计算机科学与技术": "计科",
+            "冶金工程": "冶金",
+            "交通运输工程": "交通",
+            "动力工程及工程热物理": "动力",
+            "纺织科学与工程": "纺织",
+            "建筑学": "建筑",
+            "环境科学与工程": "环境",
+            "公共管理": "公管",
+            "数学": "数学",
+            "物理学": "物理",
+            "林学/林业工程": "林学",
+            "心理学": "心理",
+            "历史学": "历史",
+            "工商管理": "工管",
+            "应用经济学": "应经",
+            "中医学/中药学": "中医",
+            "天文学": "天文",
+            "机械工程": "机械",
+            "土木工程": "土木",
+            "光学工程": "光学",
+            "地理学": "地理",
+            "农业资源利用": "农业",
+            "生物学/生物科学与工程": "生物",
+            "兵器科学与技术": "兵器",
+            "矿业工程": "矿业",
+            "大气科学": "大气",
+            "基础医学/临床医学": "基础",
+            "电子科学与技术": "电子",
+            "测绘科学与技术": "测绘",
+            "控制科学与工程": "控制",
+            "军事学": "军事",
+            "中国语言文学": "中文",
+            "新闻传播学": "新闻", 
+            "社会学": "社会",
+            "地球物理学":"地球",
+            "植物保护":"植保"
+        }
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/eprstmt.json b/examples/few_shot/p-tuning/prompt/eprstmt.json
new file mode 100644
index 0000000000000000000000000000000000000000..ea6941cdd963f096f4429c28368c738594df149a
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/eprstmt.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'soft'}{'text':'sentence'}"}
+    ],
+    "verbalizer": [
+        {"Negative": "不", "Positive": "很"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/iflytek.json b/examples/few_shot/p-tuning/prompt/iflytek.json
new file mode 100644
index 0000000000000000000000000000000000000000..198ce19949738290b6494777b47ff359fdd00ab7
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/iflytek.json
@@ -0,0 +1,129 @@
+{
+    "template": [
+        {"text": "{'mask': None, 'length': 4}{'soft'}{'text': 'sentence'}"}
+    ],
+    "verbalizer": [
+        {
+            "银行": "银行办理",
+            "社区服务": "社区服务",
+            "电商": "电商网购",
+            "支付": "支付交易",
+            "经营养成": "经营养成",
+            "卡牌": "卡牌游戏",
+            "借贷": "借贷借款",
+            "驾校": "驾校学车",
+            "理财": "投资理财",
+            "职考": "职业考试",
+            "新闻": "新闻资讯",
+            "旅游资讯": "旅游资讯",
+            "公共交通": "公共交通",
+            "魔幻": "魔幻游戏",
+            "医疗服务": "医疗服务",
+            "影像剪辑": "影像剪辑",
+            "动作类": "动作游戏",
+            "工具": "使用工具",
+            "体育竞技": "体育竞技",
+            "小说": "小说阅读",
+            "运动健身": "运动健身",
+            "相机": "相机拍照",
+            "辅助工具": "辅助工具",
+            "快递物流": "快递物流",
+            "高等教育": "高等教育",
+            "股票": "股票炒股",
+            "菜谱": "做菜菜谱",
+            "行车辅助": "行车帮助",
+            "仙侠": "仙侠小说",
+            "亲子儿童": "亲子儿童",
+            "购物咨询": "购物资讯",
+            "射击游戏": "射击游戏",
+            "漫画": "动漫漫画",
+            "中小学": "中学小学",
+            "同城服务": "同城跑腿",
+            "成人教育": "成人教育",
+            "求职": "面试求职",
+            "电子产品": "电子产品",
+            "艺术": "艺术学习",
+            "薅羊毛": "比价省钱",
+            "约会社交": "约会社交",
+            "经营": "经营管理",
+            "兼职": "兼职赚钱",
+            "短视频": "拍短视频",
+            "音乐": "音乐乐库",
+            "英语": "英语学习",
+            "棋牌中心": "棋牌中心",
+            "摄影修图": "摄影修图",
+            "养生保健": "养生保健",
+            "办公": "办公工具",
+            "政务": "政务服务",
+            "视频": "视频拍摄",
+            "论坛圈子": "论坛圈子",
+            "彩票": "彩票乐透",
+            "直播": "直播娱乐",
+            "其他": "其他类别",
+            "休闲益智": "休闲益智",
+            "策略": "策略游戏",
+            "即时通讯": "即时通讯",
+            "汽车交易": "汽车交易",
+            "违章": "违章罚款",
+            "地图导航": "地图导航",
+            "民航": "民用航空",
+            "电台": "电台播报",
+            "语言(非英语)": "小语种类",
+            "搞笑": "搞笑娱乐",
+            "婚恋社交": "婚恋社交",
+            "社区超市": "社区超市",
+            "日常养车": "日常养车",
+            "杂志": "杂志期刊",
+            "视频教育": "线上教育",
+            "家政": "家政服务",
+            "影视娱乐": "影视娱乐",
+            "装修家居": "装修家居",
+            "体育咨讯": "体育资讯",
+            "社交工具": "社交工具",
+            "餐饮店": "餐饮美食",
+            "美颜": "美颜相机",
+            "问诊挂号": "问诊挂号",
+            "飞行空战": "飞行空战",
+            "综合预定": "综合预定",
+            "电影票务": "电影票务",
+            "笔记": "笔记记录",
+            "买房": "买房购房",
+            "外卖": "外卖配送",
+            "母婴": "母婴产品",
+            "打车": "打车出行",
+            "情侣社交": "情侣社交",
+            "日程管理": "日程管理",
+            "租车": "租车出行",
+            "微博博客": "微博博客",
+            "百科": "知识百科",
+            "绘画": "绘画学习",
+            "铁路": "铁路交通",
+            "生活社交": "生活社交",
+            "租房": "租房房源",
+            "酒店": "酒店住宿",
+            "保险": "保险理赔",
+            "问答交流": "问答交流",
+            "收款": "收款交易",
+            "MOBA": "多人竞技",
+            "K歌": "唱歌K歌",
+            "技术": "技术学习",
+            "减肥瘦身": "减肥瘦身",
+            "工作社交": "工作社交",
+            "团购": "团购拼单",
+            "记账": "记录记账",
+            "女性": "女性生活",
+            "公务员": "公务员类",
+            "二手": "二手交易",
+            "美妆美业": "美妆美业",
+            "汽车咨询": "汽车资讯",
+            "行程管理": "行程管理",
+            "免费WIFI": "WIFI",
+            "教辅": "教育辅助",
+            "成人": "成人两性",
+            "婚庆": "婚庆结婚",
+            "民宿短租": "民宿短租",
+            "出国": "出国相关"
+        }
+    ]
+}
+
diff --git a/examples/few_shot/p-tuning/prompt/ocnli.json b/examples/few_shot/p-tuning/prompt/ocnli.json
new file mode 100644
index 0000000000000000000000000000000000000000..796cb691f99d80806e8b01be2ef6240964c9e7a2
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/ocnli.json
@@ -0,0 +1,8 @@
+{
+    "template": [
+        {"text": "{'mask'}{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"}
+    ],
+    "verbalizer": [
+        {"contradiction": "不同", "entailment": "相似", "neutral": "无关"}
+    ]
+}
diff --git a/examples/few_shot/p-tuning/prompt/tnews.json b/examples/few_shot/p-tuning/prompt/tnews.json
new file mode 100644
index 0000000000000000000000000000000000000000..822c30badd5237c2ca4313e94cb273f13e737eb2
--- /dev/null
+++ b/examples/few_shot/p-tuning/prompt/tnews.json
@@ -0,0 +1,24 @@
+{
+    "template": [
+        {"text": "{'mask'}{'mask'}{'soft'}{'text':'sentence'}"}
+    ],
+    "verbalizer": [
+        {
+            "news_story":  "八卦",
+            "news_entertainment": "明星",
+            "news_finance": "财经",
+            "news_sports": "体育",
+            "news_edu": "校园",
+            "news_game": "游戏",
+            "news_culture": "文化",
+            "news_tech": "科技",
+            "news_car": "汽车",
+            "news_travel": "旅行",
+            "news_world": "国际",
+            "news_agriculture": "农业",
+            "news_military": "军事",
+            "news_house": "房子",
+            "news_stock": "股票"
+        }
+    ]
+}
diff --git a/examples/few_shot/p-tuning/run_train.py b/examples/few_shot/p-tuning/run_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..abe66b7bd3fa883f246f05893ec432d57ace5f48
--- /dev/null
+++ b/examples/few_shot/p-tuning/run_train.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from data import load_fewclue_dataset
+from paddle.metric import Accuracy
+from paddle.static import InputSpec
+from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data
+
+from paddlenlp.prompt import (
+    MaskedLMVerbalizer,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    SoftTemplate,
+)
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."})
+    split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."})
+    prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."})
+    prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."})
+    augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."})
+    num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."})
+    word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."})
+    augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."})
+    pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."})
+    do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"})
+    do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+    dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    data_args = load_prompt_arguments(data_args)
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForMaskedLM.from_pretrained(
+        model_args.model_name_or_path,
+        hidden_dropout_prob=model_args.dropout,
+        attention_probs_dropout_prob=model_args.dropout,
+    )
+
+    # Define template for preprocess and verbalizer for postprocess.
+    template = SoftTemplate(data_args.prompt, tokenizer, training_args.max_seq_length, model.get_input_embeddings())
+    logger.info("Using template: {}".format(template.prompt))
+
+    verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer)
+    labels_to_ids = verbalizer.labels_to_ids
+    ids_to_labels = {idx: label for label, idx in labels_to_ids.items()}
+    logger.info("Using verbalizer: {}".format(data_args.label_words))
+
+    # Load datasets.
+    data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys)
+    train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds
+    dev_labels, test_labels = label_list
+
+    # Define the criterion.
+    criterion = paddle.nn.CrossEntropyLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds, labels, verbalizer):
+        metric = Accuracy()
+        predictions = paddle.to_tensor(eval_preds.predictions)
+        predictions = verbalizer.aggregate_multiple_mask(predictions)
+        correct = metric.compute(predictions, paddle.to_tensor(labels))
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    # Initialize the trainer.
+    dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer)
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=None,
+        compute_metrics=dev_compute_metrics,
+    )
+
+    # Traininig.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime())
+
+    # Test.
+    if data_args.do_test and public_test_ds is not None:
+        test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer)
+        trainer.compute_metrics = test_compute_metrics
+        test_ret = trainer.predict(public_test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Predict.
+    if training_args.do_predict and test_ds is not None:
+        pred_ret = trainer.predict(test_ds)
+        logger.info("Prediction done.")
+        predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp)
+        save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels)
+
+    # Label unsupervised data.
+    if data_args.do_label and unlabeled_ds is not None:
+        label_ret = trainer.predict(unlabeled_ds)
+        logger.info("Labeling done.")
+        pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt")
+        save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels)
+
+    # Export static model.
+    if training_args.do_export:
+        template = prompt_model.template
+        template_keywords = template.extract_template_keywords(template.prompt)
+        input_spec = [
+            InputSpec(shape=[None, None], dtype="int64"),  # input_ids,
+            InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+            InputSpec(shape=[None, None], dtype="int64"),  # position_ids
+            InputSpec(shape=[None, None, None, None], dtype="float32"),  # attention_mask
+            InputSpec(shape=[None], dtype="int64"),  # masked_positions
+            InputSpec(shape=[None, None], dtype="int64"),  # soft_token_ids
+        ]
+        if "encoder" in template_keywords:
+            input_spec.append(InputSpec(shape=[None, None], dtype="int64"))  # encoder_ids
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/few_shot/p-tuning/utils.py b/examples/few_shot/p-tuning/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..989b4e6b81a8d156f93bf256e79ba5a6ed201197
--- /dev/null
+++ b/examples/few_shot/p-tuning/utils.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import pathlib
+
+import numpy as np
+import paddle
+
+from paddlenlp.datasets import load_dataset
+
+LABEL_TO_STANDARD = {
+    "tnews": {
+        "news_story": "100",
+        "news_culture": "101",
+        "news_entertainment": "102",
+        "news_sports": "103",
+        "news_finance": "104",
+        "news_house": "106",
+        "news_car": "107",
+        "news_edu": "108",
+        "news_tech": "109",
+        "news_military": "110",
+        "news_travel": "112",
+        "news_world": "113",
+        "news_stock": "114",
+        "news_agriculture": "115",
+        "news_game": "116",
+    },
+    "iflytek": {
+        "打车": 0,
+        "美颜": 100,
+        "影像剪辑": 101,
+        "摄影修图": 102,
+        "相机": 103,
+        "绘画": 104,
+        "二手": 105,
+        "电商": 106,
+        "团购": 107,
+        "外卖": 108,
+        "电影票务": 109,
+        "社区服务": 10,
+        "社区超市": 110,
+        "购物咨询": 111,
+        "笔记": 112,
+        "办公": 113,
+        "日程管理": 114,
+        "女性": 115,
+        "经营": 116,
+        "收款": 117,
+        "其他": 118,
+        "薅羊毛": 11,
+        "魔幻": 12,
+        "仙侠": 13,
+        "卡牌": 14,
+        "飞行空战": 15,
+        "射击游戏": 16,
+        "休闲益智": 17,
+        "动作类": 18,
+        "体育竞技": 19,
+        "地图导航": 1,
+        "棋牌中心": 20,
+        "经营养成": 21,
+        "策略": 22,
+        "MOBA": 23,
+        "辅助工具": 24,
+        "约会社交": 25,
+        "即时通讯": 26,
+        "工作社交": 27,
+        "论坛圈子": 28,
+        "婚恋社交": 29,
+        "免费WIFI": 2,
+        "情侣社交": 30,
+        "社交工具": 31,
+        "生活社交": 32,
+        "微博博客": 33,
+        "新闻": 34,
+        "漫画": 35,
+        "小说": 36,
+        "技术": 37,
+        "教辅": 38,
+        "问答交流": 39,
+        "租车": 3,
+        "搞笑": 40,
+        "杂志": 41,
+        "百科": 42,
+        "影视娱乐": 43,
+        "求职": 44,
+        "兼职": 45,
+        "视频": 46,
+        "短视频": 47,
+        "音乐": 48,
+        "直播": 49,
+        "同城服务": 4,
+        "电台": 50,
+        "K歌": 51,
+        "成人": 52,
+        "中小学": 53,
+        "职考": 54,
+        "公务员": 55,
+        "英语": 56,
+        "视频教育": 57,
+        "高等教育": 58,
+        "成人教育": 59,
+        "快递物流": 5,
+        "艺术": 60,
+        "语言(非英语)": 61,
+        "旅游资讯": 62,
+        "综合预定": 63,
+        "民航": 64,
+        "铁路": 65,
+        "酒店": 66,
+        "行程管理": 67,
+        "民宿短租": 68,
+        "出国": 69,
+        "婚庆": 6,
+        "工具": 70,
+        "亲子儿童": 71,
+        "母婴": 72,
+        "驾校": 73,
+        "违章": 74,
+        "汽车咨询": 75,
+        "汽车交易": 76,
+        "日常养车": 77,
+        "行车辅助": 78,
+        "租房": 79,
+        "家政": 7,
+        "买房": 80,
+        "装修家居": 81,
+        "电子产品": 82,
+        "问诊挂号": 83,
+        "养生保健": 84,
+        "医疗服务": 85,
+        "减肥瘦身": 86,
+        "美妆美业": 87,
+        "菜谱": 88,
+        "餐饮店": 89,
+        "公共交通": 8,
+        "体育咨讯": 90,
+        "运动健身": 91,
+        "支付": 92,
+        "保险": 93,
+        "股票": 94,
+        "借贷": 95,
+        "理财": 96,
+        "彩票": 97,
+        "记账": 98,
+        "银行": 99,
+        "政务": 9,
+    },
+}
+
+
+def load_prompt_arguments(args):
+    """
+    Load prompt and label words according to prompt index.
+    """
+    with open(args.prompt_path, "r", encoding="utf-8") as fp:
+        configs = json.load(fp)
+        assert len(configs["verbalizer"]) == len(configs["template"])
+        assert configs["verbalizer"][0] is not None
+        verbalizer = [configs["verbalizer"][0]]
+        last_verb_index = 0
+        for index, verb in enumerate(configs["verbalizer"][1:]):
+            if verb is None or len(verb) == 0:
+                verbalizer.append(configs["verbalizer"][last_verb_index])
+            else:
+                verbalizer.append(verb)
+                last_verb_index = index + 1
+        configs["verbalizer"] = verbalizer
+        args.prompt = configs["template"][args.prompt_index]["text"]
+        label_words = configs["verbalizer"][args.prompt_index]
+        if isinstance(label_words, list):
+            label_words = {k: k for k in label_words}
+        args.label_words = label_words
+        return args
+
+
+def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Combine unsupervised data and corresponding predicted labels and
+    save one example per line.
+    """
+    if task_name == "cluewsc":
+        return None
+
+    data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled")
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = verbalizer.aggregate_multiple_mask(preds)
+    preds = paddle.nn.functional.softmax(preds, axis=1).numpy()
+    label_preds = np.argmax(preds, axis=1)
+    label_probs = np.max(preds, axis=1)
+    pseudo_data = []
+    for index, example in enumerate(data_ds):
+        example["labels"] = labels[label_preds[index]]
+        example["prob"] = str(label_probs[index])
+        pseudo_data.append(example)
+    save_data(pseudo_data, save_path)
+
+
+def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Extract predicted labels and save as the format required by FewCLUE.
+    """
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = verbalizer.aggregate_multiple_mask(preds)
+    if task_name == "chid":
+        batch_size = preds.shape[0]
+        preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1]
+        preds = preds.reshape([batch_size // 7, 7])
+    preds = paddle.nn.functional.softmax(preds, axis=1).numpy()
+    preds = np.argmax(preds, axis=1)
+    test_ds = load_dataset("fewclue", name=task_name, splits="test")
+
+    ret_list = []
+    maps = LABEL_TO_STANDARD.get(task_name, None)
+    for idx, example in enumerate(test_ds):
+        uid = example.get("id", idx)
+        if task_name in ["bustm", "csl"]:
+            ret_list.append({"id": uid, "label": str(preds[idx])})
+        elif task_name == "chid":
+            ret_list.append({"id": uid, "answer": preds[idx]})
+        elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]:
+            ret_list.append({"id": uid, "label": labels[preds[idx]]})
+        elif task_name in ["iflytek", "tnews"]:
+            ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])})
+    save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f"
+    save_data(ret_list, save_path, save_file + "_predict.json")
+
+
+def save_data(data, save_path, save_file=None):
+    if save_file is not None:
+        pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
+        save_path = os.path.join(save_path, save_file)
+    with open(save_path, "w") as fp:
+        for example in data:
+            fp.write(json.dumps(example, ensure_ascii=False) + "\n")
diff --git a/examples/few_shot/pet/README.md b/examples/few_shot/pet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0499883d4707227692f38d3acffee4c82176b9cc
--- /dev/null
+++ b/examples/few_shot/pet/README.md
@@ -0,0 +1,84 @@
+# PET
+
+[Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference](https://arxiv.org/abs/2001.07676)
+
+## 算法简介
+
+自然语言处理任务可以通过给预训练模型提供“任务描述”等方式来进行无监督学习，但效果一般低于有监督训练。而 Pattern-Exploiting Training (PET) 是一种半监督方法，通过将输入转换为完形填空形式的短语来帮助语言模型理解任务。然后用这些短语来给无标注数据打软标签。最后在得到的标注数据集上用有监督方法进行训练。在小样本设置下，PET 在部分任务上远超有监督学习和强半监督学习方法。以 PET 为代表的提示学习与微调学习的区别如下图所示，包括数据预处理模块 `Template` 和标签词映射模块 `Verbalizer`。详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。
+
+![PET_and_FT](https://user-images.githubusercontent.com/25607475/192727706-0a17b5ef-db6b-46be-894d-0ee315306776.png)
+
+
+## 快速开始
+
+CLUE（Chinese Language Understanding Evaluation）作为中文语言理解权威测评榜单，在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜，旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集，可以直接用来进行 PET 算法训练、评估、预测，并生成 FewCLUE 榜单的提交结果，参与 FewCLUE 竞赛。
+
+### 代码结构说明
+```
+├── run_train.py # PET 算法提示学习脚本
+├── data.py      # 数据集构造、数据增强
+├── utils.py     # FewCLUE 提交结果保存等工具函数
+└── prompt/      # FewCLUE 各数据集的 prompt 定义文件
+```
+
+###  数据准备
+
+读取 FewCLUE 数据集只需要 1 行代码，这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例：
+
+```
+from paddlenlp.datasets import load_dataset
+
+# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的eprstmt 数据集
+train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public"))
+```
+
+### 模型训练、评估、预测
+
+通过如下命令，指定 GPU 0 卡,  在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估
+```
+python -u -m paddle.distributed.launch --gpus "0" run_train.py \
+    --output_dir checkpoint_eprstmt \
+    --task_name eprstmt \
+    --split_id few_all \
+    --prompt_path prompt/eprstmt.json \
+    --prompt_index 0 \
+    --do_train \
+    --do_eval \
+    --do_test \
+    --do_predict \
+    --do_label \
+    --max_steps 1000 \
+    --learning_rate 3e-5 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --logging_steps 5  \
+    --per_device_train_batch_size 16 \
+    --max_seq_length 128 \
+    --load_best_model_at_end \
+    --metric_for_best_model accuracy \
+    --save_total_limit 1
+```
+参数含义说明
+- `task_name`: FewCLUE 中的数据集名字
+- `split_id`: 数据集编号，包括0, 1, 2, 3, 4 和 few_all
+- `prompt_path`: prompt 定义文件名
+- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt
+- `augment_type`: 数据增强策略，可选 swap, delete, insert, substitute
+- `num_augment`: 数据增强策略为每个样本生成的样本数量
+- `word_augment_percent`: 每个序列中数据增强词所占的比例
+- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径
+- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签
+- `do_test`: 是否在公开测试集上评估模型效果
+- `model_name_or_path`: 预训练模型名，默认为 `ernie-1.0-large-zh-cw`
+- `use_rdrop`: 是否使用对比学习策略 R-Drop
+- `alpha_rdrop`: R-Drop 损失值权重，默认为 0.5
+- `dropout`: 预训练模型的 dropout 参数值，用于 R-Drop 策略中参数配置
+- `export_type`: 模型导出格式，默认为 `paddle`，动态图转静态图
+- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8)
+
+### 模型部署
+
+Coming soon...
+
+## References
+[1] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676.
diff --git a/examples/few_shot/pet/data.py b/examples/few_shot/pet/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba2cca68383011e989c4b0197500dd6757fe3708
--- /dev/null
+++ b/examples/few_shot/pet/data.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from functools import partial
+
+import paddle
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+from paddlenlp.datasets import MapDataset, load_dataset
+
+
+def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids):
+    """
+    Extend train dataset with pseudo labeled examples if exists.
+    """
+    if pseudo_path is None:
+        return data_ds
+    with open(pseudo_path, "r", encoding="utf-8") as fp:
+        pseudo_data = [json.loads(x.strip()) for x in fp]
+    data_ds = MapDataset([x for x in data_ds] + pseudo_data)
+    return data_ds
+
+
+def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None):
+    """
+    Extend train dataset with augmentation.
+    """
+    if example_keys is None:
+        return data_ds
+    if aug_type is None or aug_type == "None":
+        return data_ds
+    if aug_type == "delete":
+        aug = WordDelete(create_n=num_aug, aug_percent=percent)
+    elif aug_type == "substitute":
+        aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent)
+    elif aug_type == "insert":
+        aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent)
+    elif aug_type == "swap":
+        aug = WordSwap(create_n=num_aug, aug_percent=percent)
+    else:
+        raise ValueError("Unsupported data augment strategy `{}`".format(aug_type))
+
+    aug_data = []
+    for example in data_ds:
+        for key in example_keys:
+            text_aug = aug.augment(example[key])
+            for text in text_aug:
+                new_example = example.copy()
+                example[key] = text
+                aug_data.append(new_example)
+
+    data_ds = MapDataset([x for x in data_ds] + aug_data)
+    return data_ds
+
+
+def convert_chid(data_ds):
+    """
+    Insert idioms into positions of `#idiom#` so that the task is converted
+    to binary classification.
+    """
+    split_data_ds = []
+    for example in data_ds:
+        fragments = example["content"].split("#idiom#")
+        label = example.get("answer", None)
+        for index, cand in enumerate(example["candidates"]):
+            new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand}
+            if label is not None:
+                new_example["label"] = str(int(index == label))
+            split_data_ds.append(new_example)
+    return MapDataset(split_data_ds)
+
+
+def convert_csl(data_ds):
+    """
+    Concatanate keywords and it can be replaced by keyword `options` in develop versioin.
+    """
+    concat_data_ds = []
+    for example in data_ds:
+        example["keyword"] = "，".join(example["keyword"])
+        concat_data_ds.append(example)
+    return MapDataset(concat_data_ds)
+
+
+def convert_cluewsc(data_ds):
+    """
+    Mark the pronoun and entity with special tokens.
+    """
+    marked_data_ds = []
+    for example in data_ds:
+        target, text = example["target"], list(example["text"])
+        pronoun, p_index = target["span2_text"], target["span2_index"]
+        entity, e_index = target["span1_text"], target["span1_index"]
+        label = example.get("label", None)
+        if p_index > e_index:
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+        else:
+            text.insert(e_index, "[")
+            text.insert(e_index + len(entity) + 1, "]")
+            text.insert(p_index, "_")
+            text.insert(p_index + len(pronoun) + 1, "_")
+        new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity}
+        if label is not None:
+            new_example["label"] = label
+        marked_data_ds.append(new_example)
+    return MapDataset(marked_data_ds)
+
+
+def convert_labels_to_ids(example, orig_key, labels_to_ids):
+    """
+    Convert the keyword in datasets to `labels`.
+    """
+    if orig_key in example:
+        example["label_ids"] = labels_to_ids[example.pop(orig_key)]
+    return example
+
+
+def convert_ids_to_words(example, token_ids):
+    """
+    Convert label id to the first word in mapping from labels to words,
+    the length of which should coincide with that of `mask` in prompt.
+    """
+    if "label_ids" in example:
+        labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0)
+        example["labels"] = labels
+    return example
+
+
+def load_fewclue_dataset(args, verbalizer, example_keys=None):
+    """
+    Load fewclue datasets and convert them to the standard format of PET.
+    """
+    split_id = args.split_id
+    splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"]
+    if args.task_name == "cluewsc":
+        train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits)
+        unlabeled_ds = None
+    else:
+        splits.append("unlabeled")
+        train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset(
+            "fewclue", name=args.task_name, splits=splits
+        )
+    data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds]
+
+    # Preprocess data for mask prediction task.
+    if args.task_name == "chid":
+        for index, sub_data_ds in enumerate(data_ds):
+            data_ds[index] = convert_chid(sub_data_ds)
+    elif args.task_name == "cluewsc":
+        for index, sub_data_ds in enumerate(data_ds[:-1]):
+            data_ds[index] = convert_cluewsc(sub_data_ds)
+    elif args.task_name == "csl":
+        for index, sub_data_ds in enumerate(data_ds):
+            data_ds[index] = convert_csl(sub_data_ds)
+    orig_key = "label"
+    if args.task_name == "tnews":
+        orig_key = "label_desc"
+    elif args.task_name == "iflytek":
+        orig_key = "label_des"
+    convert_label = partial(convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids)
+    for index, sub_data_ds in enumerate(data_ds):
+        if sub_data_ds is not None:
+            data_ds[index] = sub_data_ds.map(convert_label)
+
+    # Extend train dataset with data augmentation and pseudo-label data.
+    data_ds[0] = extend_with_data_augment(
+        data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys
+    )
+    data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids)
+
+    dev_labels = [x["label_ids"] for x in data_ds[1]]
+    test_labels = [x["label_ids"] for x in data_ds[2]]
+
+    convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :])
+    data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]]
+
+    return data_ds, (dev_labels, test_labels)
diff --git a/examples/few_shot/pet/prompt/bustm.json b/examples/few_shot/pet/prompt/bustm.json
new file mode 100644
index 0000000000000000000000000000000000000000..ab377ea85708450af14a138fd48049a6b9df447d
--- /dev/null
+++ b/examples/few_shot/pet/prompt/bustm.json
@@ -0,0 +1,14 @@
+{
+    "template": [
+        {"text": "下边两句话说的是一个事情吗？{'mask'}“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"},
+        {"text": "下边两个句子说的是{'mask'}{'mask'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"},
+        {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”意思{'mask'}{'mask'}。"},
+        {"text": "“{'text':'sentence1'}”和“{'text':'sentence2'}”描述的是{'mask'}{'mask'}的事情。"}
+    ],
+    "verbalizer": [
+        {"0": "不", "1": "是"},
+        {"0": "不同", "1": "相同"},
+        {"0": "不同", "1": "一样"},
+        {"0": "不同", "1": "相同"}
+    ]
+}
diff --git a/examples/few_shot/pet/prompt/chid.json b/examples/few_shot/pet/prompt/chid.json
new file mode 100644
index 0000000000000000000000000000000000000000..24dac2d41100e9640af01269479fb6de12b0ca97
--- /dev/null
+++ b/examples/few_shot/pet/prompt/chid.json
@@ -0,0 +1,14 @@
+{
+    "template": [
+        {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}{'mask'}"},
+        {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}成语{'text':'idiom'}用在这个句子中{'mask'}合适。"},
+        {"text": "选一个合适的词语填在括号里，你会选“{'text': 'idiom'}”吗？{'mask'}。“{'text':'content_pre'}(){'text': 'content_post'}”"},
+        {"text": "下边句中成语[{'text':'idiom'}]的理解正确吗？{'mask'}{'mask'}。“{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}”"}
+    ],
+    "verbalizer": [
+        {"0": "否", "1": "是"},
+        {"0": "不", "1": "很"},
+        {"0": "不", "1": "会"},
+        {"0": "错误", "1": "正确"}
+    ]
+}
\ No newline at end of file
diff --git a/examples/few_shot/pet/prompt/cluewsc.json b/examples/few_shot/pet/prompt/cluewsc.json
new file mode 100644
index 0000000000000000000000000000000000000000..76badab27eb87ccd8054866e173afb76731fc56e
--- /dev/null
+++ b/examples/few_shot/pet/prompt/cluewsc.json
@@ -0,0 +1,12 @@
+{
+    "template": [
+        {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'mask'}是{'text': 'entity'}"},
+        {"text": "{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}。这里{'text': 'pronoun'}理解得对吗？{'mask'}"},
+        {"text": "{'text': 'text'}{'text': 'pronoun'}{'mask'}{'mask'}地代表了{'text': 'entity'}"}
+    ],
+    "verbalizer": [
+        {"false": "不", "true": "就"},
+        {"false": "错", "true": "对"},
+        {"false": "错误", "true": "正确"}
+    ]
+}
diff --git a/examples/few_shot/pet/prompt/csl.json b/examples/few_shot/pet/prompt/csl.json
new file mode 100644
index 0000000000000000000000000000000000000000..c604c90d0ce506089aea570966f6b3fb995345ef
--- /dev/null
+++ b/examples/few_shot/pet/prompt/csl.json
@@ -0,0 +1,14 @@
+{
+    "template": [
+        {"text": "给定以下几个词语：{'text': 'keyword'}{'mask'}{'mask'}扩写成“{'text': 'abst'}”"},
+        {"text": "{'text':'abst'}这段话中关键词包括{'text':'keyword', 'truncate': False}对吗？{'mask'}。"},
+        {"text": "{'text':'keyword'}这几个词和下边这段话内容{'mask'}关。“{'text':'abst'}”"},
+        {"text": "“{'text':'abst'}”本文的内容{'mask'}{'mask'}“{'text':'keyword'}”"}
+    ],
+    "verbalizer": [
+        {"0": "不能", "1": "可以"},
+        {"0": "错", "1": "对"},
+        {"0": "无", "1": "有"},
+        {"0": "不含", "1": "包括"}
+    ]
+}
diff --git a/examples/few_shot/pet/prompt/csldcp.json b/examples/few_shot/pet/prompt/csldcp.json
new file mode 100644
index 0000000000000000000000000000000000000000..e0fcf846b7ed3ff0730f24d4319e288b11760e3d
--- /dev/null
+++ b/examples/few_shot/pet/prompt/csldcp.json
@@ -0,0 +1,82 @@
+{
+    "template": [
+        {"text": "阅读下边一段{'mask'}{'mask'}学的资料：“{'text': 'content'}”"},
+        {"text": "阅读下边这段{'mask'}{'mask'}方面的材料：“{'text': 'content'}”"},
+        {"text": "阅读这段{'mask'}{'mask'}学的文献：“{'text': 'content'}”"},
+        {"text": "阅读这段{'mask'}{'mask'}学的材料：“{'text': 'content'}”"}
+    ],
+    "verbalizer": [
+        {
+            "材料科学与工程": "材料",
+            "作物学": "作物",
+            "口腔医学": "口腔",
+            "药学": "药学",
+            "教育学": "教育",
+            "水利工程": "水利",
+            "理论经济学": "理经",
+            "食品科学与工程": "食品",
+            "畜牧学/兽医学": "畜牧",
+            "体育学": "体育",
+            "核科学与技术": "核科",
+            "力学": "力学",
+            "园艺学": "园艺",
+            "水产": "水产",
+            "法学": "法学",
+            "地质学/地质资源与地质工程": "地质",
+            "石油与天然气工程": "石油",
+            "农林经济管理": "农林",
+            "信息与通信工程": "通信",
+            "图书馆、情报与档案管理": "图书",
+            "政治学": "政治",
+            "电气工程": "电气",
+            "海洋科学": "海洋",
+            "民族学": "民族",
+            "航空宇航科学与技术": "航空",
+            "化学/化学工程与技术": "化学",
+            "哲学": "哲学",
+            "公共卫生与预防医学": "卫生",
+            "艺术学": "艺术",
+            "农业工程": "农工",
+            "船舶与海洋工程": "船舶",
+            "计算机科学与技术": "计科",
+            "冶金工程": "冶金",
+            "交通运输工程": "交通",
+            "动力工程及工程热物理": "动力",
+            "纺织科学与工程": "纺织",
+            "建筑学": "建筑",
+            "环境科学与工程": "环境",
+            "公共管理": "公管",
+            "数学": "数学",
+            "物理学": "物理",
+            "林学/林业工程": "林学",
+            "心理学": "心理",
+            "历史学": "历史",
+            "工商管理": "工管",
+            "应用经济学": "应经",
+            "中医学/中药学": "中医",
+            "天文学": "天文",
+            "机械工程": "机械",
+            "土木工程": "土木",
+            "光学工程": "光学",
+            "地理学": "地理",
+            "农业资源利用": "农业",
+            "生物学/生物科学与工程": "生物",
+            "兵器科学与技术": "兵器",
+            "矿业工程": "矿业",
+            "大气科学": "大气",
+            "基础医学/临床医学": "基础",
+            "电子科学与技术": "电子",
+            "测绘科学与技术": "测绘",
+            "控制科学与工程": "控制",
+            "军事学": "军事",
+            "中国语言文学": "中文",
+            "新闻传播学": "新闻", 
+            "社会学": "社会",
+            "地球物理学":"地球",
+            "植物保护":"植保"
+        },
+        {},
+        {},
+        {}
+    ]
+}
diff --git a/examples/few_shot/pet/prompt/eprstmt.json b/examples/few_shot/pet/prompt/eprstmt.json
new file mode 100644
index 0000000000000000000000000000000000000000..84e408def08760013c8f483582a6761ac72dd612
--- /dev/null
+++ b/examples/few_shot/pet/prompt/eprstmt.json
@@ -0,0 +1,18 @@
+{
+    "template": [
+        {"text": "{'text':'sentence'}我{'mask'}喜欢。"},
+        {"text": "我{'mask'}喜欢。{'text':'sentence'}"},
+        {"text": "{'mask'}{'mask'}推荐这件商品！{'text':'sentence'}"},
+        {"text": "我对这个东西{'mask'}满意。{'text':'sentence'}"},
+        {"text": "{'mask'}理想。{'text':'sentence'}"},
+        {"text": "{'text':'sentence'}这句话表示我{'mask'}满意。"}
+    ],
+    "verbalizer": [
+        {"Negative": "不", "Positive": "很"},
+        {"Negative": "不", "Positive": "很"},
+        {"Negative": "很不", "Positive": "非常"},
+        {"Negative": "不", "Positive": "很"},
+        {"Negative": "不", "Positive": "很"},
+        {"Negative": "不", "Positive": "很"}
+    ]
+}
diff --git a/examples/few_shot/pet/prompt/iflytek.json b/examples/few_shot/pet/prompt/iflytek.json
new file mode 100644
index 0000000000000000000000000000000000000000..9bce98d3f57ab67ce8284ba8a280641c30d37987
--- /dev/null
+++ b/examples/few_shot/pet/prompt/iflytek.json
@@ -0,0 +1,253 @@
+{
+    "template": [
+        {"text": "下边介绍的是和{'mask': None, 'length': 4}相关的产品：{'text': 'sentence'}"},
+        {"text": "搜索更多{'mask'}{'mask'}相关的应用程序。{'text': 'sentence'}"},
+        {"text": "这段话跟什么有关？{'mask'}{'mask'}“{'text': 'sentence'}”"}
+    ],
+    "verbalizer": [
+        {
+            "银行": "银行办理",
+            "社区服务": "社区服务",
+            "电商": "电商网购",
+            "支付": "支付交易",
+            "经营养成": "经营养成",
+            "卡牌": "卡牌游戏",
+            "借贷": "借贷借款",
+            "驾校": "驾校学车",
+            "理财": "投资理财",
+            "职考": "职业考试",
+            "新闻": "新闻资讯",
+            "旅游资讯": "旅游资讯",
+            "公共交通": "公共交通",
+            "魔幻": "魔幻游戏",
+            "医疗服务": "医疗服务",
+            "影像剪辑": "影像剪辑",
+            "动作类": "动作游戏",
+            "工具": "使用工具",
+            "体育竞技": "体育竞技",
+            "小说": "小说阅读",
+            "运动健身": "运动健身",
+            "相机": "相机拍照",
+            "辅助工具": "辅助工具",
+            "快递物流": "快递物流",
+            "高等教育": "高等教育",
+            "股票": "股票炒股",
+            "菜谱": "做菜菜谱",
+            "行车辅助": "行车帮助",
+            "仙侠": "仙侠小说",
+            "亲子儿童": "亲子儿童",
+            "购物咨询": "购物资讯",
+            "射击游戏": "射击游戏",
+            "漫画": "动漫漫画",
+            "中小学": "中学小学",
+            "同城服务": "同城跑腿",
+            "成人教育": "成人教育",
+            "求职": "面试求职",
+            "电子产品": "电子产品",
+            "艺术": "艺术学习",
+            "薅羊毛": "比价省钱",
+            "约会社交": "约会社交",
+            "经营": "经营管理",
+            "兼职": "兼职赚钱",
+            "短视频": "拍短视频",
+            "音乐": "音乐乐库",
+            "英语": "英语学习",
+            "棋牌中心": "棋牌中心",
+            "摄影修图": "摄影修图",
+            "养生保健": "养生保健",
+            "办公": "办公工具",
+            "政务": "政务服务",
+            "视频": "视频拍摄",
+            "论坛圈子": "论坛圈子",
+            "彩票": "彩票乐透",
+            "直播": "直播娱乐",
+            "其他": "其他类别",
+            "休闲益智": "休闲益智",
+            "策略": "策略游戏",
+            "即时通讯": "即时通讯",
+            "汽车交易": "汽车交易",
+            "违章": "违章罚款",
+            "地图导航": "地图导航",
+            "民航": "民用航空",
+            "电台": "电台播报",
+            "语言(非英语)": "小语种类",
+            "搞笑": "搞笑娱乐",
+            "婚恋社交": "婚恋社交",
+            "社区超市": "社区超市",
+            "日常养车": "日常养车",
+            "杂志": "杂志期刊",
+            "视频教育": "线上教育",
+            "家政": "家政服务",
+            "影视娱乐": "影视娱乐",
+            "装修家居": "装修家居",
+            "体育咨讯": "体育资讯",
+            "社交工具": "社交工具",
+            "餐饮店": "餐饮美食",
+            "美颜": "美颜相机",
+            "问诊挂号": "问诊挂号",
+            "飞行空战": "飞行空战",
+            "综合预定": "综合预定",
+            "电影票务": "电影票务",
+            "笔记": "笔记记录",
+            "买房": "买房购房",
+            "外卖": "外卖配送",
+            "母婴": "母婴产品",
+            "打车": "打车出行",
+            "情侣社交": "情侣社交",
+            "日程管理": "日程管理",
+            "租车": "租车出行",
+            "微博博客": "微博博客",
+            "百科": "知识百科",
+            "绘画": "绘画学习",
+            "铁路": "铁路交通",
+            "生活社交": "生活社交",
+            "租房": "租房房源",
+            "酒店": "酒店住宿",
+            "保险": "保险理赔",
+            "问答交流": "问答交流",
+            "收款": "收款交易",
+            "MOBA": "多人竞技",
+            "K歌": "唱歌K歌",
+            "技术": "技术学习",
+            "减肥瘦身": "减肥瘦身",
+            "工作社交": "工作社交",
+            "团购": "团购拼单",
+            "记账": "记录记账",
+            "女性": "女性生活",
+            "公务员": "公务员类",
+            "二手": "二手交易",
+            "美妆美业": "美妆美业",
+            "汽车咨询": "汽车资讯",
+            "行程管理": "行程管理",
+            "免费WIFI": "WIFI",
+            "教辅": "教育辅助",
+            "成人": "成人两性",
+            "婚庆": "婚庆结婚",
+            "民宿短租": "民宿短租",
+            "出国": "出国相关"
+        },
+        {
+            "银行": "银行",
+            "社区服务": "社区",
+            "电商": "网购",
+            "支付": "付钱",
+            "经营养成": "养成",
+            "卡牌": "纸牌",
+            "借贷": "借钱",
+            "驾校": "学车",
+            "理财": "投资",
+            "职考": "考试",
+            "新闻": "新闻",
+            "旅游资讯": "旅游",
+            "公共交通": "交通",
+            "魔幻": "魔幻",
+            "医疗服务": "医疗",
+            "影像剪辑": "剪辑",
+            "动作类": "动作",
+            "工具": "工具",
+            "体育竞技": "体育",
+            "小说": "小说",
+            "运动健身": "运动",
+            "相机": "相机",
+            "辅助工具": "辅助",
+            "快递物流": "快递",
+            "高等教育": "教育",
+            "股票": "股票",
+            "菜谱": "菜谱",
+            "行车辅助": "帮助",
+            "仙侠": "仙侠",
+            "亲子儿童": "小孩",
+            "购物咨询": "购物",
+            "射击游戏": "射击",
+            "漫画": "漫画",
+            "中小学": "小学",
+            "同城服务": "跑腿",
+            "成人教育": "成人",
+            "求职": "面试",
+            "电子产品": "电子",
+            "艺术": "艺术",
+            "薅羊毛": "赚钱",
+            "约会社交": "约会",
+            "经营": "经营",
+            "兼职": "兼职",
+            "短视频": "短片",
+            "音乐": "音乐",
+            "英语": "英语",
+            "棋牌中心": "棋牌",
+            "摄影修图": "拍照",
+            "养生保健": "养生",
+            "办公": "办公",
+            "政务": "政务",
+            "视频": "视频",
+            "论坛圈子": "论坛",
+            "彩票": "彩票",
+            "直播": "直播",
+            "其他": "其他",
+            "休闲益智": "休闲",
+            "策略": "策略",
+            "即时通讯": "通讯",
+            "汽车交易": "买车",
+            "违章": "违章",
+            "地图导航": "地图",
+            "民航": "航空",
+            "电台": "电台",
+            "语言(非英语)": "语言",
+            "搞笑": "搞笑",
+            "婚恋社交": "婚恋",
+            "社区超市": "超市",
+            "日常养车": "养车",
+            "杂志": "杂志",
+            "视频教育": "线上",
+            "家政": "家政",
+            "影视娱乐": "影视",
+            "装修家居": "装修",
+            "体育咨讯": "资讯",
+            "社交工具": "交流",
+            "餐饮店": "美食",
+            "美颜": "美颜",
+            "问诊挂号": "挂号",
+            "飞行空战": "飞行",
+            "综合预定": "预定",
+            "电影票务": "票务",
+            "笔记": "笔记",
+            "买房": "买房",
+            "外卖": "外卖",
+            "母婴": "母婴",
+            "打车": "打车",
+            "情侣社交": "情侣",
+            "日程管理": "日程",
+            "租车": "租车",
+            "微博博客": "博客",
+            "百科": "百科",
+            "绘画": "绘画",
+            "铁路": "铁路",
+            "生活社交": "生活",
+            "租房": "租房",
+            "酒店": "酒店",
+            "保险": "保险",
+            "问答交流": "问答",
+            "收款": "收款",
+            "MOBA": "多人",
+            "K歌": "唱歌",
+            "技术": "技术",
+            "减肥瘦身": "减肥",
+            "工作社交": "工作",
+            "团购": "团购",
+            "记账": "记录",
+            "女性": "女性",
+            "公务员": "公务",
+            "二手": "二手",
+            "美妆美业": "美妆",
+            "汽车咨询": "汽车",
+            "行程管理": "行程",
+            "免费WIFI": "上网",
+            "教辅": "教辅",
+            "成人": "两性",
+            "婚庆": "婚庆",
+            "民宿短租": "民宿",
+            "出国": "出国"
+        },
+        {}
+    ]
+}
+
diff --git a/examples/few_shot/pet/prompt/ocnli.json b/examples/few_shot/pet/prompt/ocnli.json
new file mode 100644
index 0000000000000000000000000000000000000000..0519816f080b9fe9d71cacd5d757a72ff54e5427
--- /dev/null
+++ b/examples/few_shot/pet/prompt/ocnli.json
@@ -0,0 +1,12 @@
+{
+    "template": [
+        {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间的逻辑关系是{'mask'}{'mask'}"},
+        {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”说的是{'mask'}{'mask'}的东西。"},
+        {"text": "下边两句话之间有什么逻辑关系？{'mask'}{'mask'}“{'text': 'sentence1'}”{'sep'}“{'text': 'sentence2'}”"}
+    ],
+    "verbalizer": [
+        {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"},
+        {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"},
+        {"contradiction": "不同", "entailment": "类似", "neutral": "无关"}
+    ]
+}
\ No newline at end of file
diff --git a/examples/few_shot/pet/prompt/tnews.json b/examples/few_shot/pet/prompt/tnews.json
new file mode 100644
index 0000000000000000000000000000000000000000..5a2c7449b8e132b9a4dde3048217df9951d7fb65
--- /dev/null
+++ b/examples/few_shot/pet/prompt/tnews.json
@@ -0,0 +1,30 @@
+{
+    "template": [
+        {"text": "阅读下边一则{'mask'}{'mask'}新闻：{'text':'sentence'}"},
+        {"text": "阅读这篇标题为「{'text':'sentence'}」的文章，它讲的是{'mask'}{'mask'}。"},
+        {"text": "下边这则新闻属于{'mask'}{'mask'}话题{'text':'sentence'}"},
+        {"text": "下边这则新闻属于什么话题呢？{'mask'}{'mask'}{'text':'sentence'}"}
+    ],
+    "verbalizer": [
+        {
+            "news_story":  "八卦",
+            "news_entertainment": "明星",
+            "news_finance": "财经",
+            "news_sports": "体育",
+            "news_edu": "校园",
+            "news_game": "游戏",
+            "news_culture": "文化",
+            "news_tech": "科技",
+            "news_car": "汽车",
+            "news_travel": "旅行",
+            "news_world": "国际",
+            "news_agriculture": "农业",
+            "news_military": "军事",
+            "news_house": "房子",
+            "news_stock": "股票"
+        },
+        {},
+        {},
+        {}
+    ]
+}
diff --git a/examples/few_shot/pet/run_train.py b/examples/few_shot/pet/run_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bab91cfe712d2456ee390073a56b21e7e8f6d4c
--- /dev/null
+++ b/examples/few_shot/pet/run_train.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from data import load_fewclue_dataset
+from paddle.metric import Accuracy
+from paddle.static import InputSpec
+from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data
+
+from paddlenlp.prompt import (
+    ManualTemplate,
+    MaskedLMVerbalizer,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+)
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+@dataclass
+class DataArguments:
+    task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."})
+    split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."})
+    prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."})
+    prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."})
+    augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."})
+    num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."})
+    word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."})
+    augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."})
+    pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."})
+    do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"})
+    do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."})
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."})
+    export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."})
+    dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."})
+# yapf: enable
+
+
+def main():
+    # Parse the arguments.
+    parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    data_args = load_prompt_arguments(data_args)
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    paddle.set_device(training_args.device)
+
+    # Load the pretrained language model.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForMaskedLM.from_pretrained(
+        model_args.model_name_or_path,
+        hidden_dropout_prob=model_args.dropout,
+        attention_probs_dropout_prob=model_args.dropout,
+    )
+
+    # Define template for preprocess and verbalizer for postprocess.
+    template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length)
+    logger.info("Using template: {}".format(template.prompt))
+
+    verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer)
+    labels_to_ids = verbalizer.labels_to_ids
+    ids_to_labels = {idx: label for label, idx in labels_to_ids.items()}
+    logger.info("Using verbalizer: {}".format(data_args.label_words))
+
+    # Load datasets.
+    data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys)
+    train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds
+    dev_labels, test_labels = label_list
+
+    # Define the criterion.
+    criterion = paddle.nn.CrossEntropyLoss()
+
+    # Initialize the prompt model with the above variables.
+    prompt_model = PromptModelForSequenceClassification(
+        model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+    )
+
+    # Define the metric function.
+    def compute_metrics(eval_preds, labels, verbalizer):
+        metric = Accuracy()
+        predictions = paddle.to_tensor(eval_preds.predictions)
+        predictions = verbalizer.aggregate_multiple_mask(predictions)
+        correct = metric.compute(predictions, paddle.to_tensor(labels))
+        metric.update(correct)
+        acc = metric.accumulate()
+        return {"accuracy": acc}
+
+    # Initialize the trainer.
+    dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer)
+    trainer = PromptTrainer(
+        model=prompt_model,
+        tokenizer=tokenizer,
+        args=training_args,
+        criterion=criterion,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        callbacks=None,
+        compute_metrics=dev_compute_metrics,
+    )
+
+    # Traininig.
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime())
+
+    # Test.
+    if data_args.do_test and public_test_ds is not None:
+        test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer)
+        trainer.compute_metrics = test_compute_metrics
+        test_ret = trainer.predict(public_test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    # Predict.
+    if training_args.do_predict and test_ds is not None:
+        pred_ret = trainer.predict(test_ds)
+        logger.info("Prediction done.")
+        predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp)
+        save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels)
+
+    # Label unsupervised data.
+    if data_args.do_label and unlabeled_ds is not None:
+        label_ret = trainer.predict(unlabeled_ds)
+        logger.info("Labeling done.")
+        pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt")
+        save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels)
+
+    # Export static model.
+    if training_args.do_export:
+        input_spec = [
+            InputSpec(shape=[None, None], dtype="int64"),  # input_ids,
+            InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+            InputSpec(shape=[None, None], dtype="int64"),  # position_ids
+            InputSpec(shape=[None, None, None, None], dtype="float32"),  # attention_mask
+            InputSpec(shape=[None], dtype="int64"),  # masked_positions
+        ]
+        export_path = os.path.join(training_args.output_dir, "export")
+        trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/few_shot/pet/utils.py b/examples/few_shot/pet/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..989b4e6b81a8d156f93bf256e79ba5a6ed201197
--- /dev/null
+++ b/examples/few_shot/pet/utils.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import pathlib
+
+import numpy as np
+import paddle
+
+from paddlenlp.datasets import load_dataset
+
+LABEL_TO_STANDARD = {
+    "tnews": {
+        "news_story": "100",
+        "news_culture": "101",
+        "news_entertainment": "102",
+        "news_sports": "103",
+        "news_finance": "104",
+        "news_house": "106",
+        "news_car": "107",
+        "news_edu": "108",
+        "news_tech": "109",
+        "news_military": "110",
+        "news_travel": "112",
+        "news_world": "113",
+        "news_stock": "114",
+        "news_agriculture": "115",
+        "news_game": "116",
+    },
+    "iflytek": {
+        "打车": 0,
+        "美颜": 100,
+        "影像剪辑": 101,
+        "摄影修图": 102,
+        "相机": 103,
+        "绘画": 104,
+        "二手": 105,
+        "电商": 106,
+        "团购": 107,
+        "外卖": 108,
+        "电影票务": 109,
+        "社区服务": 10,
+        "社区超市": 110,
+        "购物咨询": 111,
+        "笔记": 112,
+        "办公": 113,
+        "日程管理": 114,
+        "女性": 115,
+        "经营": 116,
+        "收款": 117,
+        "其他": 118,
+        "薅羊毛": 11,
+        "魔幻": 12,
+        "仙侠": 13,
+        "卡牌": 14,
+        "飞行空战": 15,
+        "射击游戏": 16,
+        "休闲益智": 17,
+        "动作类": 18,
+        "体育竞技": 19,
+        "地图导航": 1,
+        "棋牌中心": 20,
+        "经营养成": 21,
+        "策略": 22,
+        "MOBA": 23,
+        "辅助工具": 24,
+        "约会社交": 25,
+        "即时通讯": 26,
+        "工作社交": 27,
+        "论坛圈子": 28,
+        "婚恋社交": 29,
+        "免费WIFI": 2,
+        "情侣社交": 30,
+        "社交工具": 31,
+        "生活社交": 32,
+        "微博博客": 33,
+        "新闻": 34,
+        "漫画": 35,
+        "小说": 36,
+        "技术": 37,
+        "教辅": 38,
+        "问答交流": 39,
+        "租车": 3,
+        "搞笑": 40,
+        "杂志": 41,
+        "百科": 42,
+        "影视娱乐": 43,
+        "求职": 44,
+        "兼职": 45,
+        "视频": 46,
+        "短视频": 47,
+        "音乐": 48,
+        "直播": 49,
+        "同城服务": 4,
+        "电台": 50,
+        "K歌": 51,
+        "成人": 52,
+        "中小学": 53,
+        "职考": 54,
+        "公务员": 55,
+        "英语": 56,
+        "视频教育": 57,
+        "高等教育": 58,
+        "成人教育": 59,
+        "快递物流": 5,
+        "艺术": 60,
+        "语言(非英语)": 61,
+        "旅游资讯": 62,
+        "综合预定": 63,
+        "民航": 64,
+        "铁路": 65,
+        "酒店": 66,
+        "行程管理": 67,
+        "民宿短租": 68,
+        "出国": 69,
+        "婚庆": 6,
+        "工具": 70,
+        "亲子儿童": 71,
+        "母婴": 72,
+        "驾校": 73,
+        "违章": 74,
+        "汽车咨询": 75,
+        "汽车交易": 76,
+        "日常养车": 77,
+        "行车辅助": 78,
+        "租房": 79,
+        "家政": 7,
+        "买房": 80,
+        "装修家居": 81,
+        "电子产品": 82,
+        "问诊挂号": 83,
+        "养生保健": 84,
+        "医疗服务": 85,
+        "减肥瘦身": 86,
+        "美妆美业": 87,
+        "菜谱": 88,
+        "餐饮店": 89,
+        "公共交通": 8,
+        "体育咨讯": 90,
+        "运动健身": 91,
+        "支付": 92,
+        "保险": 93,
+        "股票": 94,
+        "借贷": 95,
+        "理财": 96,
+        "彩票": 97,
+        "记账": 98,
+        "银行": 99,
+        "政务": 9,
+    },
+}
+
+
+def load_prompt_arguments(args):
+    """
+    Load prompt and label words according to prompt index.
+    """
+    with open(args.prompt_path, "r", encoding="utf-8") as fp:
+        configs = json.load(fp)
+        assert len(configs["verbalizer"]) == len(configs["template"])
+        assert configs["verbalizer"][0] is not None
+        verbalizer = [configs["verbalizer"][0]]
+        last_verb_index = 0
+        for index, verb in enumerate(configs["verbalizer"][1:]):
+            if verb is None or len(verb) == 0:
+                verbalizer.append(configs["verbalizer"][last_verb_index])
+            else:
+                verbalizer.append(verb)
+                last_verb_index = index + 1
+        configs["verbalizer"] = verbalizer
+        args.prompt = configs["template"][args.prompt_index]["text"]
+        label_words = configs["verbalizer"][args.prompt_index]
+        if isinstance(label_words, list):
+            label_words = {k: k for k in label_words}
+        args.label_words = label_words
+        return args
+
+
+def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Combine unsupervised data and corresponding predicted labels and
+    save one example per line.
+    """
+    if task_name == "cluewsc":
+        return None
+
+    data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled")
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = verbalizer.aggregate_multiple_mask(preds)
+    preds = paddle.nn.functional.softmax(preds, axis=1).numpy()
+    label_preds = np.argmax(preds, axis=1)
+    label_probs = np.max(preds, axis=1)
+    pseudo_data = []
+    for index, example in enumerate(data_ds):
+        example["labels"] = labels[label_preds[index]]
+        example["prob"] = str(label_probs[index])
+        pseudo_data.append(example)
+    save_data(pseudo_data, save_path)
+
+
+def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels):
+    """
+    Extract predicted labels and save as the format required by FewCLUE.
+    """
+    preds = paddle.to_tensor(label_preds.predictions)
+    preds = verbalizer.aggregate_multiple_mask(preds)
+    if task_name == "chid":
+        batch_size = preds.shape[0]
+        preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1]
+        preds = preds.reshape([batch_size // 7, 7])
+    preds = paddle.nn.functional.softmax(preds, axis=1).numpy()
+    preds = np.argmax(preds, axis=1)
+    test_ds = load_dataset("fewclue", name=task_name, splits="test")
+
+    ret_list = []
+    maps = LABEL_TO_STANDARD.get(task_name, None)
+    for idx, example in enumerate(test_ds):
+        uid = example.get("id", idx)
+        if task_name in ["bustm", "csl"]:
+            ret_list.append({"id": uid, "label": str(preds[idx])})
+        elif task_name == "chid":
+            ret_list.append({"id": uid, "answer": preds[idx]})
+        elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]:
+            ret_list.append({"id": uid, "label": labels[preds[idx]]})
+        elif task_name in ["iflytek", "tnews"]:
+            ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])})
+    save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f"
+    save_data(ret_list, save_path, save_file + "_predict.json")
+
+
+def save_data(data, save_path, save_file=None):
+    if save_file is not None:
+        pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
+        save_path = os.path.join(save_path, save_file)
+    with open(save_path, "w") as fp:
+        for example in data:
+            fp.write(json.dumps(example, ensure_ascii=False) + "\n")
diff --git a/examples/information_extraction/DuEE/README.md b/examples/information_extraction/DuEE/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1bdf5fab793a5a85882ac22126afd1b04c70e637
--- /dev/null
+++ b/examples/information_extraction/DuEE/README.md
@@ -0,0 +1,231 @@
+# LIC2021 DuEE 事件抽取基线
+
+
+信息抽取旨在从非结构化自然语言文本中提取结构化知识，如实体、关系、事件等。事件抽取的目标是对于给定的自然语言句子，根据预先指定的事件类型和论元角色，识别句子中所有目标事件类型的事件，并根据相应的论元角色集合抽取事件所对应的论元。其中目标事件类型 (event_type) 和论元角色 (role) 限定了抽取的范围，例如 (event_type：胜负，role：时间，胜者，败者，赛事名称)、(event_type：夺冠，role：夺冠事件，夺冠赛事，冠军)。
+
+<div align="center">
+<img src="pictures/DuEE-Fin/ee.png" width="600" height="200" alt="事件抽取" align=center />
+</div>
+
+该示例展示了如何使用PaddleNLP快速复现[LIC2021事件抽取比赛](http://lic2021.ccf.org.cn/)基线并进阶优化基线。
+
+同时，我们提供了该示例在线运行展示教程：
+[PaddleNLP实战——LIC2021事件抽取任务基线](https://aistudio.baidu.com/aistudio/projectdetail/1639964)
+
+## 目录结构
+
+以下是本项目主要目录结构及说明：
+
+```text
+DuEE/
+├── classifier.py # 文本分类训练脚本
+├── duee_1_data_prepare.py # 句子级事件抽取数据预处理
+├── duee_1_postprocess.py # 句子级事件抽取数据后处理
+├── duee_fin_data_prepare.py  # 篇章级事件抽取数据预处理
+├── duee_fin_postprocess.py  # 篇章级事件抽取数据后处理
+├── README.md # 文档说明
+├── run_classifier.sh # 文本分类训练启动脚本
+├── run_duee_1.sh # 句子级事件抽取启动脚本
+├── run_duee_fin.sh # 篇章级事件抽取启动脚本
+├── run_sequence_labeling.sh # 序列标注训练启动脚本
+├── sequence_labeling.py # 序列标注训练脚本
+└── utils.py # 效能函数
+```
+
+
+## 篇章级事件抽取基线
+
+篇章级事件抽取数据集（DuEE-Fin）是金融领域篇章级别事件抽取数据集，
+共包含13个已定义好的事件类型约束和1.15万中文篇章（存在部分非目标篇章作为负样例），其中6900训练集，1150验证集和3450测试集，数据集下载[地址](https://aistudio.baidu.com/aistudio/competition/detail/65) 。
+在该数据集上基线采用基于[ERNIE](https://github.com/PaddlePaddle/ERNIE)的序列标注（sequence labeling）方案，分为基于序列标注的触发词抽取模型、基于序列标注的论元抽取模型和枚举属性分类模型，属于PipeLine模型；基于序列标注的触发词抽取模型采用BIO方式，识别触发词的位置以及对应的事件类型，基于序列标注的论元抽取模型采用BIO方式识别出事件中的论元以及对应的论元角色；枚举属性分类模型采用ernie进行分类。
+
+### 评测方法
+
+本任务采用预测论元F1值作为评价指标，对于每个篇章，采用不放回的方式给每个目标事件寻找最相似的预测事件（事件级别匹配），搜寻方式是优先寻找与目标事件的事件类型相同且角色和论元正确数量最多的预测事件
+
+f1_score = (2 * P * R) / (P + R)，其中
+
+- 预测论元正确=事件类型和角色相同且论元正确
+- P=预测论元正确数量 / 所有预测论元的数量
+- R=预测论元正确数量 / 所有人工标注论元的数量
+
+### 快速复现基线Step1：数据预处理并加载
+
+从比赛官网下载数据集，逐层解压存放于data/DuEE-fin目录下，运行以下脚本将原始数据预处理成序列标注格式数据。
+处理之后的数据放在data/DuEE-Fin下，触发词识别数据文件存放在data/DuEE-Fin/role下，论元角色识别数据文件存放在data/DuEE-Fin/trigger下。
+枚举分类数据存放在data/DuEE-Fin/enum下。
+
+```
+sh run_duee_fin.sh data_prepare
+```
+
+我们可以加载自定义数据集。通过继承[`paddle.io.Dataset`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/Dataset_cn.html#dataset)，自定义实现`__getitem__` 和 `__len__`两个方法。
+
+如下代码已完成加载数据集操作：
+```
+train_ds = DuEventExtraction(args.train_data, args.vocab_path, args.tag_path)
+dev_ds = DuEventExtraction(args.dev_data, args.vocab_path, args.tag_path)
+test_ds = DuEventExtraction(args.test_data, args.vocab_path, args.tag_path)
+```
+
+### 快速复现基线Step2：构建模型
+
+
+基于序列标注的触发词抽取模型是整体模型的一部分，该部分主要是给定事件类型，识别句子中出现的事件触发词对应的位置以及对应的事件类别，该模型是基于ERNIE开发序列标注模型，模型原理图如下：
+
+<div align="center">
+<img src="pictures/DuEE-Fin/trigger_model.png" width="500" height="400" alt="基于序列标注的触发词抽取模型" align=center />
+</div>
+
+
+同样地，基于序列标注的论元抽取模型也是基于ERNIE开发序列标注模型，该部分主要是识别出事件中的论元以及对应论元角色，模型原理图如下：
+
+<div align="center">
+<img src="pictures/DuEE-Fin/role_model.png" width="500" height="400" alt="基于序列标注的论元抽取模型" align=center />
+</div>
+
+上述样例中通过模型识别出：1）论元"新东方"，并分配标签"B-收购方"、"I-收购方"、"I-收购方"；2）论元"东方优播", 并分配标签"B-被收购方"、"I-被收购方"、"I-被收购方"、"I-被收购方"。最终识别出文本中包含的论元角色和论元对是<收购方，新东方>、<被收购方，东方优播>
+
+**PaddleNLP提供了ERNIE预训练模型常用序列标注模型，可以通过指定模型名字完成一键加载**：
+
+```python
+from paddlenlp.transformers import AutoModelForTokenClassification
+
+model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map))
+```
+
+同时，对于枚举分类数据采用的是基于ERNIE的文本分类模型，枚举角色类型为环节。模型原理图如下：
+
+<div align="center">
+<img src="pictures/DuEE-Fin/enum_model.png" width="500" height="400" alt="枚举属性分类模型" align=center />
+</div>
+
+> 给定文本，对文本进行分类，得到不同类别上的概率 筹备上市（0.8）、暂停上市（0.02）、正式上市（0.15）、终止上市（0.03）
+
+
+**同样地，PaddleNLP提供了ERNIE预训练模型常用文本分类模型，可以通过指定模型名字完成一键加载**：
+
+```python
+from paddlenlp.transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map))
+```
+
+### 快速复现基线Step3：数据处理
+
+我们需要将原始数据处理成模型可读入的数据。PaddleNLP为了方便用户处理数据，内置了对于各个预训练模型对应的Tokenizer，可以完成
+文本token化，转token ID，文本长度截断等操作。与加载模型类似地，也可以一键加载。
+
+```python
+from paddlenlp.transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+```
+
+文本数据处理直接调用tokenizer即可输出模型所需输入数据。
+
+```python
+inputs = tokenizer(text="请输入测试样例", max_seq_len=20)
+# {'input_ids': [1, 647, 789, 109, 558, 525, 314, 656, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+# 'seq_len': 9}
+```
+
+
+### 快速复现基线Step4：定义损失函数和优化器，开始训练
+
+在该基线上，我们选择交叉墒作为损失函数，使用[`paddle.optimizer.AdamW`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/adamw/AdamW_cn.html#adamw)作为优化器。
+
+启动训练：
+```shell
+# 触发词识别模型训练
+sh run_duee_fin.sh trigger_train
+
+# 触发词识别预测
+sh run_duee_fin.sh trigger_predict
+
+# 论元识别模型训练
+sh run_duee_fin.sh role_train
+
+# 论元识别预测
+sh run_duee_fin.sh role_predict
+
+# 枚举分类模型训练
+sh run_duee_fin.sh enum_train
+
+# 枚举分类预测
+sh run_duee_fin.sh enum_predict
+```
+
+
+### 快速复现基线Step5：数据后处理，提交结果
+
+按照比赛预测指定格式提交结果至[评测网站](http://aistudio-bce.bcc-bdbl.baidu.com/aistudio/competition/detail/141)。
+
+
+``` shell
+sh run_duee_fin.sh pred_2_submit
+```
+
+结果存放于`submit/test_duee_fin.json`
+
+
+#### to-do 增加基线效果
+
+
+## 句子级事件抽取基线
+
+
+句子级别通用领域的事件抽取数据集（[DuEE 1.0](https://aistudio.baidu.com/aistudio/competition/detail/32?isFromCcf=true)）上进行事件抽取的基线模型，该模型采用基于[ERNIE](https://github.com/PaddlePaddle/ERNIE)的序列标注（sequence labeling）方案，分为基于序列标注的触发词抽取模型和基于序列标注的论元抽取模型，属于PipeLine模型；基于序列标注的触发词抽取模型采用BIO方式，识别触发词的位置以及对应的事件类型，基于序列标注的论元抽取模型采用BIO方式识别出事件中的论元以及对应的论元角色。模型和数据处理方式与篇章级事件抽取相同，此处不再赘述。句子级别通用领域的事件抽取无枚举角色分类。
+
+启动训练：
+```shell
+# 训练触发词识别模型
+sh run_duee_1.sh trigger_train
+
+# 触发词识别预测
+sh run_duee_1.sh trigger_predict
+
+# 论元识别模型训练
+sh run_duee_1.sh role_train
+
+# 论元识别预测
+sh run_duee_1.sh role_predict
+
+# 数据后处理，提交预测结果
+# 结果存放于submit/test_duee_1.json`
+sh run_duee_1.sh pred_2_submit
+```
+
+### 评测方法
+
+事件论元结果与人工标注的事件论元结果进行匹配，并按字级别匹配F1进行打分，不区分大小写，如论元有多个表述，则取多个匹配F1中的最高值
+
+f1_score = (2 * P * R) / (P + R)，其中
+
+- P=预测论元得分总和 / 所有预测论元的数量
+- R=预测论元得分总和 / 所有人工标注论元的数量
+- 预测论元得分=事件类型是否准确 * 论元角色是否准确 * 字级别匹配F1值 （*是相乘）
+- 字级别匹配F1值 = 2 * 字级别匹配P值 * 字级别匹配R值 / (字级别匹配P值 + 字级别匹配R值)
+- 字级别匹配P值 = 预测论元和人工标注论元共有字的数量/ 预测论元字数
+- 字级别匹配R值 = 预测论元和人工标注论元共有字的数量/ 人工标注论元字数
+
+#### to-do 增加基线效果
+
+## 进阶优化基线效果
+
+基线采用的预训练模型为ERNIE，PaddleNLP提供了丰富的预训练模型，如BERT，RoBERTa，Electra，XLNet等
+参考[PaddleNLP预训练模型介绍](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)
+
+如可以选择RoBERTa large中文模型优化模型效果，只需更换模型和tokenizer即可无缝衔接。
+
+```python
+from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+model = RobertaForTokenClassification.from_pretrained("roberta-wwm-ext-large", num_classes=len(label_map))
+tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")
+```
+
+## Reference
+
+- [DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios](https://link.springer.com/chapter/10.1007/978-3-030-60457-8_44)
diff --git a/examples/information_extraction/DuEE/classifier.py b/examples/information_extraction/DuEE/classifier.py
new file mode 100644
index 0000000000000000000000000000000000000000..9732593639a78eac11b6fbde12438939e288112a
--- /dev/null
+++ b/examples/information_extraction/DuEE/classifier.py
@@ -0,0 +1,323 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+classification
+"""
+import argparse
+import ast
+import csv
+import json
+import os
+import random
+import traceback
+from collections import namedtuple
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from utils import load_dict, read_by_lines, write_by_lines
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+
+"""
+For All pre-trained model（English and Chinese),
+Please refer to https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer
+"""
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
+parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
+parser.add_argument("--tag_path", type=str, default=None, help="tag set path")
+parser.add_argument("--train_data", type=str, default=None, help="train data")
+parser.add_argument("--dev_data", type=str, default=None, help="dev data")
+parser.add_argument("--test_data", type=str, default=None, help="test data")
+parser.add_argument("--predict_data", type=str, default=None, help="predict data")
+parser.add_argument("--do_train", type=ast.literal_eval, default=True, help="do train")
+parser.add_argument("--do_predict", type=ast.literal_eval, default=True, help="do predict")
+parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
+parser.add_argument(
+    "--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy"
+)
+parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
+parser.add_argument("--valid_step", type=int, default=100, help="validation step")
+parser.add_argument("--skip_step", type=int, default=20, help="skip step")
+parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
+parser.add_argument("--checkpoints", type=str, default=None, help="Directory to model checkpoint")
+parser.add_argument("--init_ckpt", type=str, default=None, help="already pretraining model checkpoint")
+parser.add_argument("--predict_save_path", type=str, default=None, help="predict data save path")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+args = parser.parse_args()
+
+
+def set_seed(random_seed):
+    """sets random seed"""
+    random.seed(random_seed)
+    np.random.seed(random_seed)
+    paddle.seed(random_seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accuracy = metric.accumulate()
+    metric.reset()
+    model.train()
+    return float(np.mean(losses)), accuracy
+
+
+def convert_example(example, tokenizer, label_map=None, max_seq_len=512, is_test=False):
+    """convert_example"""
+    has_text_b = False
+    if isinstance(example, dict):
+        has_text_b = "text_b" in example.keys()
+    else:
+        has_text_b = "text_b" in example._fields
+
+    text_b = None
+    if has_text_b:
+        text_b = example.text_b
+
+    tokenized_input = tokenizer(text=example.text_a, text_pair=text_b, max_seq_len=max_seq_len)
+    input_ids = tokenized_input["input_ids"]
+    token_type_ids = tokenized_input["token_type_ids"]
+
+    if is_test:
+        return input_ids, token_type_ids
+    else:
+        label = np.array([label_map[example.label]], dtype="int64")
+        return input_ids, token_type_ids, label
+
+
+class DuEventExtraction(paddle.io.Dataset):
+    """Du"""
+
+    def __init__(self, data_path, tag_path):
+        self.label_vocab = load_dict(tag_path)
+        self.examples = self._read_tsv(data_path)
+
+    def _read_tsv(self, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="UTF-8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            headers = next(reader)
+            text_indices = [index for index, h in enumerate(headers) if h != "label"]
+            Example = namedtuple("Example", headers)
+            examples = []
+            for line in reader:
+                for index, text in enumerate(line):
+                    if index in text_indices:
+                        line[index] = text
+                try:
+                    example = Example(*line)
+                except Exception as e:
+                    traceback.print_exc()
+                    raise Exception(e)
+                examples.append(example)
+            return examples
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, index):
+        return self.examples[index]
+
+
+def data_2_examples(datas):
+    """data_2_examples"""
+    has_text_b, examples = False, []
+    if isinstance(datas[0], list):
+        Example = namedtuple("Example", ["text_a", "text_b"])
+        has_text_b = True
+    else:
+        Example = namedtuple("Example", ["text_a"])
+    for item in datas:
+        if has_text_b:
+            example = Example(text_a=item[0], text_b=item[1])
+        else:
+            example = Example(text_a=item)
+        examples.append(example)
+    return examples
+
+
+def do_train():
+    paddle.set_device(args.device)
+    world_size = paddle.distributed.get_world_size()
+    rank = paddle.distributed.get_rank()
+    if world_size > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    label_map = load_dict(args.tag_path)
+    model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map))
+    model = paddle.DataParallel(model)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    print("============start train==========")
+    train_ds = DuEventExtraction(args.train_data, args.tag_path)
+    dev_ds = DuEventExtraction(args.dev_data, args.tag_path)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, label_map=label_map, max_seq_len=args.max_seq_len)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype="int32"),
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype="int32"),
+        Stack(dtype="int64"),  # label
+    ): fn(list(map(trans_func, samples)))
+
+    batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    train_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, collate_fn=batchify_fn)
+    dev_loader = paddle.io.DataLoader(dataset=dev_ds, batch_size=args.batch_size, collate_fn=batchify_fn)
+
+    num_training_steps = len(train_loader) * args.num_epoch
+    metric = paddle.metric.Accuracy()
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    step, best_performerence = 0, 0.0
+    model.train()
+    for epoch in range(args.num_epoch):
+        for idx, (input_ids, token_type_ids, labels) in enumerate(train_loader):
+            logits = model(input_ids, token_type_ids)
+            loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_item = loss.item()
+            if step > 0 and step % args.skip_step == 0 and rank == 0:
+                print(
+                    f"train epoch: {epoch} - step: {step} (total: {num_training_steps}) - loss: {loss_item:.6f} acc {acc:.5f}"
+                )
+            if step > 0 and step % args.valid_step == 0 and rank == 0:
+                loss_dev, acc_dev = evaluate(model, criterion, metric, dev_loader)
+                print(
+                    f"dev step: {step} - loss: {loss_dev:.6f} accuracy: {acc_dev:.5f}, current best {best_performerence:.5f}"
+                )
+                if acc_dev > best_performerence:
+                    best_performerence = acc_dev
+                    print(
+                        f"==============================================save best model best performerence {best_performerence:5f}"
+                    )
+                    paddle.save(model.state_dict(), f"{args.checkpoints}/best.pdparams")
+            step += 1
+
+    # save the final model
+    if rank == 0:
+        paddle.save(model.state_dict(), "{}/final.pdparams".format(args.checkpoints))
+
+
+def do_predict():
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+
+    label_map = load_dict(args.tag_path)
+    id2label = {val: key for key, val in label_map.items()}
+
+    model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=len(label_map))
+    model = paddle.DataParallel(model)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    print("============start predict==========")
+    if not args.init_ckpt or not os.path.isfile(args.init_ckpt):
+        raise Exception("init checkpoints {} not exist".format(args.init_ckpt))
+    else:
+        state_dict = paddle.load(args.init_ckpt)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.init_ckpt)
+
+    # load data from predict file
+    sentences = read_by_lines(args.predict_data)  # origin data format
+    sentences = [json.loads(sent) for sent in sentences]
+
+    encoded_inputs_list = []
+    for sent in sentences:
+        sent = sent["text"]
+        input_sent = [sent]  # only text_a
+        if "text_b" in sent:
+            input_sent = [[sent, sent["text_b"]]]  # add text_b
+        example = data_2_examples(input_sent)[0]
+        input_ids, token_type_ids = convert_example(example, tokenizer, max_seq_len=args.max_seq_len, is_test=True)
+        encoded_inputs_list.append((input_ids, token_type_ids))
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),
+    ): fn(samples)
+    # Separates data into some batches.
+    batch_encoded_inputs = [
+        encoded_inputs_list[i : i + args.batch_size] for i in range(0, len(encoded_inputs_list), args.batch_size)
+    ]
+    results = []
+    model.eval()
+    for batch in batch_encoded_inputs:
+        input_ids, token_type_ids = batchify_fn(batch)
+        input_ids = paddle.to_tensor(input_ids)
+        token_type_ids = paddle.to_tensor(token_type_ids)
+        logits = model(input_ids, token_type_ids)
+        probs = F.softmax(logits, axis=1)
+        probs_ids = paddle.argmax(probs, -1).numpy()
+        probs = probs.numpy()
+        for prob_one, p_id in zip(probs.tolist(), probs_ids.tolist()):
+            label_probs = {}
+            for idx, p in enumerate(prob_one):
+                label_probs[id2label[idx]] = p
+            results.append({"probs": label_probs, "label": id2label[p_id]})
+
+    assert len(results) == len(sentences)
+    for sent, ret in zip(sentences, results):
+        sent["pred"] = ret
+    sentences = [json.dumps(sent, ensure_ascii=False) for sent in sentences]
+    write_by_lines(args.predict_save_path, sentences)
+    print("save data {} to {}".format(len(sentences), args.predict_save_path))
+
+
+if __name__ == "__main__":
+
+    if args.do_train:
+        do_train()
+    elif args.do_predict:
+        do_predict()
diff --git a/examples/information_extraction/DuEE/duee_1_data_prepare.py b/examples/information_extraction/DuEE/duee_1_data_prepare.py
new file mode 100644
index 0000000000000000000000000000000000000000..dee77a7899313b9fd35808e9f27805cb476ef717
--- /dev/null
+++ b/examples/information_extraction/DuEE/duee_1_data_prepare.py
@@ -0,0 +1,127 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""duee 1.0 dataset process"""
+import json
+import os
+
+from utils import read_by_lines, write_by_lines
+
+
+def data_process(path, model="trigger", is_predict=False):
+    """data_process"""
+
+    def label_data(data, start, l, _type):
+        """label_data"""
+        for i in range(start, start + l):
+            suffix = "B-" if i == start else "I-"
+            data[i] = "{}{}".format(suffix, _type)
+        return data
+
+    sentences = []
+    output = ["text_a"] if is_predict else ["text_a\tlabel"]
+    with open(path) as f:
+        for line in f:
+            d_json = json.loads(line.strip())
+            _id = d_json["id"]
+            text_a = ["，" if t == " " or t == "\n" or t == "\t" else t for t in list(d_json["text"].lower())]
+            if is_predict:
+                sentences.append({"text": d_json["text"], "id": _id})
+                output.append("\002".join(text_a))
+            else:
+                if model == "trigger":
+                    labels = ["O"] * len(text_a)
+                    for event in d_json.get("event_list", []):
+                        event_type = event["event_type"]
+                        start = event["trigger_start_index"]
+                        trigger = event["trigger"]
+                        labels = label_data(labels, start, len(trigger), event_type)
+                    output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels)))
+                elif model == "role":
+                    for event in d_json.get("event_list", []):
+                        labels = ["O"] * len(text_a)
+                        for arg in event["arguments"]:
+                            role_type = arg["role"]
+                            argument = arg["argument"]
+                            start = arg["argument_start_index"]
+                            labels = label_data(labels, start, len(argument), role_type)
+                        output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels)))
+    return output
+
+
+def schema_process(path, model="trigger"):
+    """schema_process"""
+
+    def label_add(labels, _type):
+        """label_add"""
+        if "B-{}".format(_type) not in labels:
+            labels.extend(["B-{}".format(_type), "I-{}".format(_type)])
+        return labels
+
+    labels = []
+    for line in read_by_lines(path):
+        d_json = json.loads(line.strip())
+        if model == "trigger":
+            labels = label_add(labels, d_json["event_type"])
+        elif model == "role":
+            for role in d_json["role_list"]:
+                labels = label_add(labels, role["role"])
+    labels.append("O")
+    tags = []
+    for index, label in enumerate(labels):
+        tags.append("{}\t{}".format(index, label))
+    return tags
+
+
+if __name__ == "__main__":
+    print("\n=================DUEE 1.0 DATASET==============")
+    conf_dir = "./conf/DuEE1.0"
+    schema_path = "{}/event_schema.json".format(conf_dir)
+    tags_trigger_path = "{}/trigger_tag.dict".format(conf_dir)
+    tags_role_path = "{}/role_tag.dict".format(conf_dir)
+    print("\n=================start schema process==============")
+    print("input path {}".format(schema_path))
+    tags_trigger = schema_process(schema_path, "trigger")
+    write_by_lines(tags_trigger_path, tags_trigger)
+    print("save trigger tag {} at {}".format(len(tags_trigger), tags_trigger_path))
+    tags_role = schema_process(schema_path, "role")
+    write_by_lines(tags_role_path, tags_role)
+    print("save trigger tag {} at {}".format(len(tags_role), tags_role_path))
+    print("=================end schema process===============")
+
+    # data process
+    data_dir = "./data/DuEE1.0"
+    trigger_save_dir = "{}/trigger".format(data_dir)
+    role_save_dir = "{}/role".format(data_dir)
+    print("\n=================start schema process==============")
+    if not os.path.exists(trigger_save_dir):
+        os.makedirs(trigger_save_dir)
+    if not os.path.exists(role_save_dir):
+        os.makedirs(role_save_dir)
+    print("\n----trigger------for dir {} to {}".format(data_dir, trigger_save_dir))
+    train_tri = data_process("{}/duee_train.json".format(data_dir), "trigger")
+    write_by_lines("{}/train.tsv".format(trigger_save_dir), train_tri)
+    dev_tri = data_process("{}/duee_dev.json".format(data_dir), "trigger")
+    write_by_lines("{}/dev.tsv".format(trigger_save_dir), dev_tri)
+    test_tri = data_process("{}/duee_test1.json".format(data_dir), "trigger")
+    write_by_lines("{}/test.tsv".format(trigger_save_dir), test_tri)
+    print("train {} dev {} test {}".format(len(train_tri), len(dev_tri), len(test_tri)))
+    print("\n----role------for dir {} to {}".format(data_dir, role_save_dir))
+    train_role = data_process("{}/duee_train.json".format(data_dir), "role")
+    write_by_lines("{}/train.tsv".format(role_save_dir), train_role)
+    dev_role = data_process("{}/duee_dev.json".format(data_dir), "role")
+    write_by_lines("{}/dev.tsv".format(role_save_dir), dev_role)
+    test_role = data_process("{}/duee_test1.json".format(data_dir), "role")
+    write_by_lines("{}/test.tsv".format(role_save_dir), test_role)
+    print("train {} dev {} test {}".format(len(train_role), len(dev_role), len(test_role)))
+    print("=================end schema process==============")
diff --git a/examples/information_extraction/DuEE/duee_1_postprocess.py b/examples/information_extraction/DuEE/duee_1_postprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a9efbc030a8fcd1c109e038a57664bfc20a17f6
--- /dev/null
+++ b/examples/information_extraction/DuEE/duee_1_postprocess.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""duee 1.0 data predict post-process"""
+
+import argparse
+import json
+
+from utils import extract_result, read_by_lines, write_by_lines
+
+
+def predict_data_process(trigger_file, role_file, schema_file, save_path):
+    """predict_data_process"""
+    pred_ret = []
+    trigger_data = read_by_lines(trigger_file)
+    role_data = read_by_lines(role_file)
+    schema_data = read_by_lines(schema_file)
+    print("trigger predict {} load from {}".format(len(trigger_data), trigger_file))
+    print("role predict {} load from {}".format(len(role_data), role_file))
+    print("schema {} load from {}".format(len(schema_data), schema_file))
+
+    schema = {}
+    for s in schema_data:
+        d_json = json.loads(s)
+        schema[d_json["event_type"]] = [r["role"] for r in d_json["role_list"]]
+
+    # process the role data
+    sent_role_mapping = {}
+    for d in role_data:
+        d_json = json.loads(d)
+        r_ret = extract_result(d_json["text"], d_json["pred"]["labels"])
+        role_ret = {}
+        for r in r_ret:
+            role_type = r["type"]
+            if role_type not in role_ret:
+                role_ret[role_type] = []
+            role_ret[role_type].append("".join(r["text"]))
+        sent_role_mapping[d_json["id"]] = role_ret
+
+    for d in trigger_data:
+        d_json = json.loads(d)
+        t_ret = extract_result(d_json["text"], d_json["pred"]["labels"])
+        pred_event_types = list(set([t["type"] for t in t_ret]))
+        event_list = []
+        for event_type in pred_event_types:
+            role_list = schema[event_type]
+            arguments = []
+            for role_type, ags in sent_role_mapping[d_json["id"]].items():
+                if role_type not in role_list:
+                    continue
+                for arg in ags:
+                    if len(arg) == 1:
+                        continue
+                    arguments.append({"role": role_type, "argument": arg})
+            event = {"event_type": event_type, "arguments": arguments}
+            event_list.append(event)
+        pred_ret.append({"id": d_json["id"], "text": d_json["text"], "event_list": event_list})
+    pred_ret = [json.dumps(r, ensure_ascii=False) for r in pred_ret]
+    print("submit data {} save to {}".format(len(pred_ret), save_path))
+    write_by_lines(save_path, pred_ret)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Official evaluation script for DuEE version 1.0")
+    parser.add_argument("--trigger_file", help="trigger model predict data path", required=True)
+    parser.add_argument("--role_file", help="role model predict data path", required=True)
+    parser.add_argument("--schema_file", help="schema file path", required=True)
+    parser.add_argument("--save_path", help="save file path", required=True)
+    args = parser.parse_args()
+    predict_data_process(args.trigger_file, args.role_file, args.schema_file, args.save_path)
diff --git a/examples/information_extraction/DuEE/duee_fin_data_prepare.py b/examples/information_extraction/DuEE/duee_fin_data_prepare.py
new file mode 100644
index 0000000000000000000000000000000000000000..49b101007703bf9cb84d2f720bcccfd8c75b8f10
--- /dev/null
+++ b/examples/information_extraction/DuEE/duee_fin_data_prepare.py
@@ -0,0 +1,276 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""duee finance dataset proces"""
+import json
+import os
+
+from utils import cal_md5, read_by_lines, text_to_sents, write_by_lines
+
+enum_role = "环节"
+
+
+def data_process(path, model="trigger", is_predict=False):
+    """data_process"""
+
+    def label_data(data, start, l, _type):
+        """label_data"""
+        for i in range(start, start + l):
+            suffix = "B-" if i == start else "I-"
+            data[i] = "{}{}".format(suffix, _type)
+        return data
+
+    sentences = []
+    output = ["text_a"] if is_predict else ["text_a\tlabel"]
+
+    for line in read_by_lines(path):
+        d_json = json.loads(line)
+        _id = d_json["id"]
+        text_a = ["，" if t == " " or t == "\n" or t == "\t" else t for t in list(d_json["text"].lower())]
+        if is_predict:
+            sentences.append({"text": d_json["text"], "id": _id})
+            output.append("\002".join(text_a))
+        else:
+            if model == "trigger":
+                labels = ["O"] * len(text_a)
+                if len(d_json.get("event_list", [])) == 0:
+                    continue
+                for event in d_json.get("event_list", []):
+                    event_type = event["event_type"]
+                    start = event["trigger_start_index"]
+                    trigger = event["trigger"]
+                    labels = label_data(labels, start, len(trigger), event_type)
+                output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels)))
+            elif model == "role":
+                for event in d_json.get("event_list", []):
+                    labels = ["O"] * len(text_a)
+                    for arg in event["arguments"]:
+                        role_type = arg["role"]
+                        if role_type == enum_role:
+                            continue
+                        argument = arg["argument"]
+                        start = arg["argument_start_index"]
+                        labels = label_data(labels, start, len(argument), role_type)
+                    output.append("{}\t{}".format("\002".join(text_a), "\002".join(labels)))
+    return output
+
+
+def enum_data_process(path, is_predict=False):
+    """enum_data_process"""
+    output = ["text_a"] if is_predict else ["label\ttext_a"]
+    for line in read_by_lines(path):
+        d_json = json.loads(line)
+        text = d_json["text"].lower().replace("\t", " ")
+        if is_predict:
+            output.append(text)
+            continue
+        if len(d_json.get("event_list", [])) == 0:
+            continue
+        label = None
+        for event in d_json["event_list"]:
+            if event["event_type"] != "公司上市":
+                continue
+            for argument in event["arguments"]:
+                role_type = argument["role"]
+                if role_type == enum_role:
+                    label = argument["argument"]
+        if label:
+            output.append("{}\t{}".format(label, text))
+    return output
+
+
+def schema_process(path, model="trigger"):
+    """schema_process"""
+
+    def label_add(labels, _type):
+        """label_add"""
+        if "B-{}".format(_type) not in labels:
+            labels.extend(["B-{}".format(_type), "I-{}".format(_type)])
+        return labels
+
+    labels = []
+    for line in read_by_lines(path):
+        d_json = json.loads(line.strip())
+        if model == "trigger":
+            labels = label_add(labels, d_json["event_type"])
+        elif model == "role":
+            for role in d_json["role_list"]:
+                if role["role"] == enum_role:
+                    continue
+                labels = label_add(labels, role["role"])
+        elif model == "enum":
+            for role in d_json["role_list"]:
+                if role["role"] == enum_role:
+                    labels = role["enum_items"]
+
+    labels.append("O")
+    tags = []
+    for index, label in enumerate(labels):
+        tags.append("{}\t{}".format(index, label))
+    if model == "enum":
+        tags = tags[:-1]
+    return tags
+
+
+def marked_doc_2_sentence(doc):
+    """marked_doc_2_sentence"""
+
+    def argument_in_sent(sent, argument_list, trigger):
+        """argument_in_sent"""
+        trigger_start = sent.find(trigger)
+        if trigger_start < 0:
+            return trigger_start, [], None
+        new_arguments, enum_argument = [], None
+        for argument in argument_list:
+            word = argument["argument"]
+            role_type = argument["role"]
+            if role_type == enum_role:
+                # special
+                enum_argument = argument
+                continue
+            start = sent.find(word)
+            if start < 0:
+                continue
+            new_arguments.append({"role": role_type, "argument": word, "argument_start_index": start})
+        return trigger_start, new_arguments, enum_argument
+
+    title = doc["title"]
+    text = doc["text"]
+    sents = text_to_sents(text)
+    sent_mapping_event, sents_order = {}, []
+    step = 3
+    batch_sents = [sents[i : i + step] for i in range(0, len(sents), step)]
+    if len(title) > 0:
+        batch_sents = [[title]] + batch_sents
+    for batch in batch_sents:
+        b_sent = " ".join(batch).replace("\n", " ").replace("\r\n", " ").replace("\r", " ").replace("\t", " ")
+        if b_sent in sent_mapping_event:
+            continue
+        sent_id = cal_md5(b_sent.encode("utf-8"))
+        sent_mapping_event[b_sent] = {"id": doc["id"], "sent_id": sent_id, "text": b_sent}
+        sents_order.append(b_sent)
+
+    for event in doc.get("event_list", []):
+        cur_sent, trigger_start, arguments, enum_argument = "", -1, [], None
+        for sent in sents_order:
+            tri_start, argus, enum_arg = argument_in_sent(sent, event["arguments"], event["trigger"])
+            if tri_start < 0:
+                continue
+            if len(argus) > len(arguments):
+                cur_sent, trigger_start, arguments = sent, tri_start, argus
+            if enum_arg:
+                enum_argument = enum_arg
+        if trigger_start >= 0 and len(arguments) > 0:
+            # add enum 2 event
+            if enum_argument:
+                arguments.append(enum_argument)
+            if "event_list" not in sent_mapping_event[cur_sent]:
+                sent_mapping_event[cur_sent]["event_list"] = []
+            new_event = {
+                "arguments": arguments,
+                "event_type": event["event_type"],
+                "trigger": event["trigger"],
+                "trigger_start_index": trigger_start,
+            }
+            sent_mapping_event[cur_sent]["event_list"].append(new_event)
+    return sent_mapping_event.values()
+
+
+def docs_data_process(path):
+    """docs_data_process"""
+    lines = read_by_lines(path)
+    sentences = []
+    for line in lines:
+        d_json = json.loads(line)
+        sentences.extend(marked_doc_2_sentence(d_json))
+    sentences = [json.dumps(s, ensure_ascii=False) for s in sentences]
+    return sentences
+
+
+if __name__ == "__main__":
+    # schema process
+    print("\n=================DUEE FINANCE DATASET==============")
+    conf_dir = "./conf/DuEE-Fin"
+    if not os.path.exists(conf_dir):
+        os.makedirs(conf_dir)
+    schema_path = "./data/DuEE-fin/duee_fin_event_schema.json"
+    tags_trigger_path = "{}/trigger_tag.dict".format(conf_dir)
+    tags_role_path = "{}/role_tag.dict".format(conf_dir)
+    tags_enum_path = "{}/enum_tag.dict".format(conf_dir)
+    print("\n=================start schema process==============")
+    print("input path {}".format(schema_path))
+    tags_trigger = schema_process(schema_path, "trigger")
+    write_by_lines(tags_trigger_path, tags_trigger)
+    print("save trigger tag {} at {}".format(len(tags_trigger), tags_trigger_path))
+    tags_role = schema_process(schema_path, "role")
+    write_by_lines(tags_role_path, tags_role)
+    print("save trigger tag {} at {}".format(len(tags_role), tags_role_path))
+    tags_enum = schema_process(schema_path, "enum")
+    write_by_lines(tags_enum_path, tags_enum)
+    print("save enum enum tag {} at {}".format(len(tags_enum), tags_enum_path))
+    print("=================end schema process===============")
+
+    # data process
+    data_dir = "./data/DuEE-Fin"
+    sentence_dir = "{}/sentence".format(data_dir)
+    trigger_save_dir = "{}/trigger".format(data_dir)
+    role_save_dir = "{}/role".format(data_dir)
+    enum_save_dir = "{}/enum".format(data_dir)
+    print("\n=================start data process==============")
+    print("\n********** start document process **********")
+    if not os.path.exists(sentence_dir):
+        os.makedirs(sentence_dir)
+    train_sent = docs_data_process("./data/DuEE-fin/duee_fin_train.json/duee_fin_train.json")
+    write_by_lines("{}/train.json".format(sentence_dir), train_sent)
+    dev_sent = docs_data_process("./data/DuEE-fin/duee_fin_dev.json/duee_fin_dev.json")
+    write_by_lines("{}/dev.json".format(sentence_dir), dev_sent)
+    test_sent = docs_data_process("./data/DuEE-fin/duee_fin_test2.json/duee_fin_test2.json")
+    write_by_lines("{}/test.json".format(sentence_dir), test_sent)
+    print("train {} dev {} test {}".format(len(train_sent), len(dev_sent), len(test_sent)))
+    print("********** end document process **********")
+
+    print("\n********** start sentence process **********")
+    print("\n----trigger------for dir {} to {}".format(sentence_dir, trigger_save_dir))
+    if not os.path.exists(trigger_save_dir):
+        os.makedirs(trigger_save_dir)
+    train_tri = data_process("{}/train.json".format(sentence_dir), "trigger")
+    write_by_lines("{}/train.tsv".format(trigger_save_dir), train_tri)
+    dev_tri = data_process("{}/dev.json".format(sentence_dir), "trigger")
+    write_by_lines("{}/dev.tsv".format(trigger_save_dir), dev_tri)
+    test_tri = data_process("{}/test.json".format(sentence_dir), "trigger")
+    write_by_lines("{}/test.tsv".format(trigger_save_dir), test_tri)
+    print("train {} dev {} test {}".format(len(train_tri), len(dev_tri), len(test_tri)))
+
+    print("\n----role------for dir {} to {}".format(sentence_dir, role_save_dir))
+    if not os.path.exists(role_save_dir):
+        os.makedirs(role_save_dir)
+    train_role = data_process("{}/train.json".format(sentence_dir), "role")
+    write_by_lines("{}/train.tsv".format(role_save_dir), train_role)
+    dev_role = data_process("{}/dev.json".format(sentence_dir), "role")
+    write_by_lines("{}/dev.tsv".format(role_save_dir), dev_role)
+    test_role = data_process("{}/test.json".format(sentence_dir), "role")
+    write_by_lines("{}/test.tsv".format(role_save_dir), test_role)
+    print("train {} dev {} test {}".format(len(train_role), len(dev_role), len(test_role)))
+
+    print("\n----enum------for dir {} to {}".format(sentence_dir, enum_save_dir))
+    if not os.path.exists(enum_save_dir):
+        os.makedirs(enum_save_dir)
+    trian_enum = enum_data_process("{}/train.json".format(sentence_dir))
+    write_by_lines("{}/train.tsv".format(enum_save_dir), trian_enum)
+    dev_enum = enum_data_process("{}/dev.json".format(sentence_dir))
+    write_by_lines("{}/dev.tsv".format(enum_save_dir), dev_enum)
+    test_enum = enum_data_process("{}/test.json".format(sentence_dir))
+    write_by_lines("{}/test.tsv".format(enum_save_dir), test_enum)
+    print("train {} dev {} test {}".format(len(trian_enum), len(dev_enum), len(test_enum)))
+    print("********** end sentence process **********")
+    print("=================end data process==============")
diff --git a/examples/information_extraction/DuEE/duee_fin_postprocess.py b/examples/information_extraction/DuEE/duee_fin_postprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..544bf88efcb10de6d8fbbc0e88853ab02e052db8
--- /dev/null
+++ b/examples/information_extraction/DuEE/duee_fin_postprocess.py
@@ -0,0 +1,139 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""duee finance data predict post-process"""
+
+import argparse
+import json
+
+from utils import extract_result, read_by_lines, write_by_lines
+
+enum_event_type = "公司上市"
+enum_role = "环节"
+
+
+def event_normalization(doc):
+    """event_merge"""
+    for event in doc.get("event_list", []):
+        argument_list = []
+        argument_set = set()
+        for arg in event["arguments"]:
+            arg_str = "{}-{}".format(arg["role"], arg["argument"])
+            if arg_str not in argument_set:
+                argument_list.append(arg)
+            argument_set.add(arg_str)
+        event["arguments"] = argument_list
+
+    event_list = sorted(doc.get("event_list", []), key=lambda x: len(x["arguments"]), reverse=True)
+    new_event_list = []
+    for event in event_list:
+        event_type = event["event_type"]
+        event_argument_set = set()
+        for arg in event["arguments"]:
+            event_argument_set.add("{}-{}".format(arg["role"], arg["argument"]))
+        flag = True
+        for new_event in new_event_list:
+            if event_type != new_event["event_type"]:
+                continue
+            new_event_argument_set = set()
+            for arg in new_event["arguments"]:
+                new_event_argument_set.add("{}-{}".format(arg["role"], arg["argument"]))
+            if len(event_argument_set & new_event_argument_set) == len(new_event_argument_set):
+                flag = False
+        if flag:
+            new_event_list.append(event)
+    doc["event_list"] = new_event_list
+    return doc
+
+
+def predict_data_process(trigger_file, role_file, enum_file, schema_file, save_path):
+    """predict_data_process"""
+    pred_ret = []
+    trigger_data = read_by_lines(trigger_file)
+    role_data = read_by_lines(role_file)
+    enum_data = read_by_lines(enum_file)
+    schema_data = read_by_lines(schema_file)
+    print("trigger predict {} load from {}".format(len(trigger_data), trigger_file))
+    print("role predict {} load from {}".format(len(role_data), role_file))
+    print("enum predict {} load from {}".format(len(enum_data), enum_file))
+    print("schema {} load from {}".format(len(schema_data), schema_file))
+
+    schema, sent_role_mapping, sent_enum_mapping = {}, {}, {}
+    for s in schema_data:
+        d_json = json.loads(s)
+        schema[d_json["event_type"]] = [r["role"] for r in d_json["role_list"]]
+
+    # role depends on id and sent_id
+    for d in role_data:
+        d_json = json.loads(d)
+        r_ret = extract_result(d_json["text"], d_json["pred"]["labels"])
+        role_ret = {}
+        for r in r_ret:
+            role_type = r["type"]
+            if role_type not in role_ret:
+                role_ret[role_type] = []
+            role_ret[role_type].append("".join(r["text"]))
+        _id = "{}\t{}".format(d_json["id"], d_json["sent_id"])
+        sent_role_mapping[_id] = role_ret
+
+    # process the enum_role data
+    for d in enum_data:
+        d_json = json.loads(d)
+        _id = "{}\t{}".format(d_json["id"], d_json["sent_id"])
+        label = d_json["pred"]["label"]
+        sent_enum_mapping[_id] = label
+
+    # process trigger data
+    for d in trigger_data:
+        d_json = json.loads(d)
+        t_ret = extract_result(d_json["text"], d_json["pred"]["labels"])
+        pred_event_types = list(set([t["type"] for t in t_ret]))
+        event_list = []
+        _id = "{}\t{}".format(d_json["id"], d_json["sent_id"])
+        for event_type in pred_event_types:
+            role_list = schema[event_type]
+            arguments = []
+            for role_type, ags in sent_role_mapping[_id].items():
+                if role_type not in role_list:
+                    continue
+                for arg in ags:
+                    arguments.append({"role": role_type, "argument": arg})
+            # 特殊处理环节
+            if event_type == enum_event_type:
+                arguments.append({"role": enum_role, "argument": sent_enum_mapping[_id]})
+            event = {"event_type": event_type, "arguments": arguments, "text": d_json["text"]}
+            event_list.append(event)
+        pred_ret.append(
+            {"id": d_json["id"], "sent_id": d_json["sent_id"], "text": d_json["text"], "event_list": event_list}
+        )
+    doc_pred = {}
+    for d in pred_ret:
+        if d["id"] not in doc_pred:
+            doc_pred[d["id"]] = {"id": d["id"], "event_list": []}
+        doc_pred[d["id"]]["event_list"].extend(d["event_list"])
+
+    # unfiy the all prediction results and save them
+    doc_pred = [json.dumps(event_normalization(r), ensure_ascii=False) for r in doc_pred.values()]
+    print("submit data {} save to {}".format(len(doc_pred), save_path))
+    write_by_lines(save_path, doc_pred)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Official evaluation script for DuEE version 1.0")
+    parser.add_argument("--trigger_file", help="trigger model predict data path", required=True)
+    parser.add_argument("--role_file", help="role model predict data path", required=True)
+    parser.add_argument("--enum_file", help="enum model predict data path", required=True)
+    parser.add_argument("--schema_file", help="schema file path", required=True)
+    parser.add_argument("--save_path", help="save file path", required=True)
+    args = parser.parse_args()
+    predict_data_process(args.trigger_file, args.role_file, args.enum_file, args.schema_file, args.save_path)
diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png
new file mode 100644
index 0000000000000000000000000000000000000000..2d2f6ce67170d302a62842de19d8154db3fa1f70
Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png differ
diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..d52fddb93b0ea43d36349becd2f499f9019d43f9
Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png differ
diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..97ac8ec639a41cd7942506f2832f9875d5849e6d
Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png differ
diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png b/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..9332eb56a321d9866c882e37b1ee4eb31aa98198
Binary files /dev/null and b/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png differ
diff --git a/examples/information_extraction/DuEE/run_classifier.sh b/examples/information_extraction/DuEE/run_classifier.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a75fcaef4635f4a4cf200de0f3928af0060c81e4
--- /dev/null
+++ b/examples/information_extraction/DuEE/run_classifier.sh
@@ -0,0 +1,53 @@
+data_dir=${1}
+conf_path=${2}
+ckpt_dir=${3}
+predict_data=${4}
+learning_rate=${5}
+is_train=${6}
+max_seq_len=${7}
+batch_size=${8}
+epoch=${9}
+pred_save_path=${10}
+
+
+if [ "$is_train" = True ]; then
+    unset CUDA_VISIBLE_DEVICES
+    python -m paddle.distributed.launch --gpus "0"  classifier.py \
+                                        --num_epoch ${epoch} \
+                                        --learning_rate 5e-5 \
+                                        --tag_path ${conf_path} \
+                                        --train_data ${data_dir}/train.tsv \
+                                        --dev_data ${data_dir}/dev.tsv \
+                                        --test_data ${data_dir}/test.tsv \
+                                        --predict_data ${predict_data} \
+                                        --do_train True \
+                                        --do_predict False \
+                                        --max_seq_len ${max_seq_len} \
+                                        --batch_size ${batch_size} \
+                                        --skip_step 1 \
+                                        --valid_step 5 \
+                                        --checkpoints ${ckpt_dir} \
+                                        --init_ckpt ${ckpt_dir}/best.pdparams \
+                                        --predict_save_path ${pred_save_path} \
+                                        --device gpu
+else
+    export CUDA_VISIBLE_DEVICES=0
+    python classifier.py \
+        --num_epoch ${epoch} \
+        --learning_rate 5e-5 \
+        --tag_path ${conf_path} \
+        --train_data ${data_dir}/train.tsv \
+        --dev_data ${data_dir}/dev.tsv \
+        --test_data ${data_dir}/test.tsv \
+        --predict_data ${predict_data} \
+        --do_train False \
+        --do_predict True \
+        --max_seq_len ${max_seq_len} \
+        --batch_size ${batch_size} \
+        --skip_step 1 \
+        --valid_step 1 \
+        --checkpoints ${ckpt_dir} \
+        --init_ckpt ${ckpt_dir}/best.pdparams \
+        --predict_save_path ${pred_save_path} \
+        --device gpu
+fi
diff --git a/examples/information_extraction/DuEE/run_duee_1.sh b/examples/information_extraction/DuEE/run_duee_1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4d599594123c911a7d069255bd59ae972fc619c4
--- /dev/null
+++ b/examples/information_extraction/DuEE/run_duee_1.sh
@@ -0,0 +1,60 @@
+dataset_name=DuEE1.0
+data_dir=./data/${dataset_name}
+conf_dir=./conf/${dataset_name}
+ckpt_dir=./ckpt/${dataset_name}
+submit_data_path=./submit/test_duee_1.json
+pred_data=${data_dir}/duee_test1.json   # 换其他数据，需要修改它
+
+learning_rate=5e-5
+max_seq_len=300
+batch_size=16
+epoch=20
+
+echo -e "check and create directory"
+dir_list=(./ckpt ${ckpt_dir} ./submit)
+for item in ${dir_list[*]}
+do
+    if [ ! -d ${item} ]; then
+        mkdir ${item}
+        echo "create dir * ${item} *"
+    else
+        echo "dir ${item} exist"
+    fi
+done
+
+process_name=${1}
+
+run_sequence_labeling_model(){
+    model=${1}
+    is_train=${2}
+    pred_save_path=${ckpt_dir}/${model}/test_pred.json
+    sh run_sequence_labeling.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path}
+}
+
+if [ ${process_name} == data_prepare ]; then
+    echo -e "\nstart ${dataset_name} data prepare"
+    python duee_1_data_prepare.py
+    echo -e "end ${dataset_name} data prepare"
+elif [ ${process_name} == trigger_train ]; then
+    echo -e "\nstart ${dataset_name} trigger train"
+    run_sequence_labeling_model trigger True
+    echo -e "end ${dataset_name} trigger train"
+elif [ ${process_name} == trigger_predict ]; then
+    echo -e "\nstart ${dataset_name} trigger predict"
+    run_sequence_labeling_model trigger False
+    echo -e "end ${dataset_name} trigger predict"
+elif [ ${process_name} == role_train ]; then
+    echo -e "\nstart ${dataset_name} role train"
+    run_sequence_labeling_model role True
+    echo -e "end ${dataset_name} role train"
+elif [ ${process_name} == role_predict ]; then
+    echo -e "\nstart ${dataset_name} role predict"
+    run_sequence_labeling_model role False
+    echo -e "end ${dataset_name} role predict"
+elif [ ${process_name} == pred_2_submit ]; then
+    echo -e "\nstart ${dataset_name} predict data merge to submit fotmat"
+    python duee_1_postprocess.py --trigger_file ${ckpt_dir}/trigger/test_pred.json --role_file ${ckpt_dir}/role/test_pred.json --schema_file ${conf_dir}/event_schema.json --save_path ${submit_data_path}
+    echo -e "end ${dataset_name} role predict data merge"
+else
+    echo "no process name ${process_name}"
+fi
\ No newline at end of file
diff --git a/examples/information_extraction/DuEE/run_duee_fin.sh b/examples/information_extraction/DuEE/run_duee_fin.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5b0b7b12e4a4474c5dedc96be3953f3162dd652e
--- /dev/null
+++ b/examples/information_extraction/DuEE/run_duee_fin.sh
@@ -0,0 +1,75 @@
+dataset_name=DuEE-Fin
+data_dir=./data/${dataset_name}
+conf_dir=./conf/${dataset_name}
+ckpt_dir=./ckpt/${dataset_name}
+submit_data_path=./submit/test_duee_fin.json
+pred_data=${data_dir}/sentence/test.json  # 换其他数据，需要修改它
+
+learning_rate=5e-5
+max_seq_len=300
+batch_size=16
+epoch=20
+
+echo -e "check and create directory"
+dir_list=(./ckpt ${ckpt_dir} ./submit)
+for item in ${dir_list[*]}
+do
+    if [ ! -d ${item} ]; then
+        mkdir ${item}
+        echo "create dir * ${item} *"
+    else
+        echo "dir ${item} exist"
+    fi
+done
+
+process_name=${1}
+
+run_sequence_labeling_model(){
+    model=${1}
+    is_train=${2}
+    pred_save_path=${ckpt_dir}/${model}/test_pred.json
+    sh run_sequence_labeling.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path}
+}
+
+run_classifier_model(){
+    model=${1}
+    is_train=${2}
+    pred_save_path=${ckpt_dir}/${model}/test_pred.json
+    sh run_classifier.sh ${data_dir}/${model} ${conf_dir}/${model}_tag.dict ${ckpt_dir}/${model} ${pred_data} ${learning_rate} ${is_train} ${max_seq_len} ${batch_size} ${epoch} ${pred_save_path}
+}
+
+if [ ${process_name} == data_prepare ]; then
+    echo -e "\nstart ${dataset_name} data prepare"
+    python duee_fin_data_prepare.py
+    echo -e "end ${dataset_name} data prepare"
+elif [ ${process_name} == trigger_train ]; then
+    echo -e "\nstart ${dataset_name} trigger train"
+    run_sequence_labeling_model trigger True
+    echo -e "end ${dataset_name} trigger train"
+elif [ ${process_name} == trigger_predict ]; then
+    echo -e "\nstart ${dataset_name} trigger predict"
+    run_sequence_labeling_model trigger False
+    echo -e "end ${dataset_name} trigger predict"
+elif [ ${process_name} == role_train ]; then
+    echo -e "\nstart ${dataset_name} role train"
+    run_sequence_labeling_model role True
+    echo -e "end ${dataset_name} role train"
+elif [ ${process_name} == role_predict ]; then
+    echo -e "\nstart ${dataset_name} role predict"
+    run_sequence_labeling_model role False
+    echo -e "end ${dataset_name} role predict"
+elif [ ${process_name} == enum_train ]; then
+    echo -e "\nstart ${dataset_name} enum train"
+    run_classifier_model enum True
+    echo -e "end ${dataset_name} enum train"
+elif [ ${process_name} == enum_predict ]; then
+    echo -e "\nstart ${dataset_name} enum predict"
+    run_classifier_model enum False
+    echo -e "end ${dataset_name} enum predict"
+elif [ ${process_name} == pred_2_submit ]; then
+    echo -e "\nstart ${dataset_name} predict data merge to submit fotmat"
+    python duee_fin_postprocess.py --trigger_file ${ckpt_dir}/trigger/test_pred.json --role_file ${ckpt_dir}/role/test_pred.json --enum_file ${ckpt_dir}/enum/test_pred.json --schema_file ${conf_dir}/event_schema.json --save_path ${submit_data_path}
+    echo -e "end ${dataset_name} role predict data merge"
+else
+    echo "no process name ${process_name}"
+fi
\ No newline at end of file
diff --git a/examples/information_extraction/DuEE/run_sequence_labeling.sh b/examples/information_extraction/DuEE/run_sequence_labeling.sh
new file mode 100644
index 0000000000000000000000000000000000000000..05f3e337f3a9cf125b4f95adfa1a5931050a195b
--- /dev/null
+++ b/examples/information_extraction/DuEE/run_sequence_labeling.sh
@@ -0,0 +1,53 @@
+
+data_dir=$1
+conf_path=$2
+ckpt_dir=$3
+predict_data=$4
+learning_rate=$5
+is_train=$6
+max_seq_len=$7
+batch_size=$8
+epoch=${9}
+pred_save_path=${10}
+
+if [ "$is_train" = True ]; then
+    unset CUDA_VISIBLE_DEVICES
+    python -m paddle.distributed.launch --gpus "0"  sequence_labeling.py \
+                            --num_epoch ${epoch} \
+                            --learning_rate ${learning_rate} \
+                            --tag_path ${conf_path} \
+                            --train_data ${data_dir}/train.tsv \
+                            --dev_data ${data_dir}/dev.tsv \
+                            --test_data ${data_dir}/test.tsv \
+                            --predict_data ${predict_data} \
+                            --do_train True \
+                            --do_predict False \
+                            --max_seq_len ${max_seq_len} \
+                            --batch_size ${batch_size} \
+                            --skip_step 10 \
+                            --valid_step 50 \
+                            --checkpoints ${ckpt_dir} \
+                            --init_ckpt ${ckpt_dir}/best.pdparams \
+                            --predict_save_path ${pred_save_path} \
+                            --device gpu
+else
+    export CUDA_VISIBLE_DEVICES=0
+    python sequence_labeling.py \
+            --num_epoch ${epoch} \
+            --learning_rate ${learning_rate} \
+            --tag_path ${conf_path} \
+            --train_data ${data_dir}/train.tsv \
+            --dev_data ${data_dir}/dev.tsv \
+            --test_data ${data_dir}/test.tsv \
+            --predict_data ${predict_data} \
+            --do_train False \
+            --do_predict True \
+            --max_seq_len ${max_seq_len} \
+            --batch_size ${batch_size} \
+            --skip_step 10 \
+            --valid_step 50 \
+            --checkpoints ${ckpt_dir} \
+            --init_ckpt ${ckpt_dir}/best.pdparams \
+            --predict_save_path ${pred_save_path} \
+            --device gpu
+fi
diff --git a/examples/information_extraction/DuEE/sequence_labeling.py b/examples/information_extraction/DuEE/sequence_labeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2c596d00eab002dac447d906508379bfe018dc6
--- /dev/null
+++ b/examples/information_extraction/DuEE/sequence_labeling.py
@@ -0,0 +1,312 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+sequence labeling
+"""
+import argparse
+import ast
+import json
+import os
+import random
+import warnings
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from utils import load_dict, read_by_lines, write_by_lines
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+
+warnings.filterwarnings("ignore")
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--num_epoch", type=int, default=3, help="Number of epoches for fine-tuning.")
+parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
+parser.add_argument("--tag_path", type=str, default=None, help="tag set path")
+parser.add_argument("--train_data", type=str, default=None, help="train data")
+parser.add_argument("--dev_data", type=str, default=None, help="dev data")
+parser.add_argument("--test_data", type=str, default=None, help="test data")
+parser.add_argument("--predict_data", type=str, default=None, help="predict data")
+parser.add_argument("--do_train", type=ast.literal_eval, default=True, help="do train")
+parser.add_argument("--do_predict", type=ast.literal_eval, default=True, help="do predict")
+parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay rate for L2 regularizer.")
+parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion params for warmup strategy")
+parser.add_argument("--max_seq_len", type=int, default=512, help="Number of words of the longest seqence.")
+parser.add_argument("--valid_step", type=int, default=100, help="validation step")
+parser.add_argument("--skip_step", type=int, default=20, help="skip step")
+parser.add_argument("--batch_size", type=int, default=32, help="Total examples' number in batch for training.")
+parser.add_argument("--checkpoints", type=str, default=None, help="Directory to model checkpoint")
+parser.add_argument("--init_ckpt", type=str, default=None, help="already pretraining model checkpoint")
+parser.add_argument("--predict_save_path", type=str, default=None, help="predict data save path")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable.
+
+
+def set_seed(args):
+    """sets random seed"""
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, num_label, data_loader):
+    """evaluate"""
+    model.eval()
+    metric.reset()
+    losses = []
+    for input_ids, seg_ids, seq_lens, labels in data_loader:
+        logits = model(input_ids, seg_ids)
+        loss = paddle.mean(
+            criterion(logits.reshape([-1, num_label]), labels.reshape([-1])))
+        losses.append(loss.numpy())
+        preds = paddle.argmax(logits, axis=-1)
+        n_infer, n_label, n_correct = metric.compute(None, seq_lens, preds,
+                                                     labels)
+        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    avg_loss = np.mean(losses)
+    model.train()
+
+    return precision, recall, f1_score, avg_loss
+
+
+def convert_example_to_feature(example,
+                               tokenizer,
+                               label_vocab=None,
+                               max_seq_len=512,
+                               no_entity_label="O",
+                               ignore_label=-1,
+                               is_test=False):
+    tokens, labels = example
+    tokenized_input = tokenizer(tokens,
+                                return_length=True,
+                                is_split_into_words='token',
+                                max_seq_len=max_seq_len)
+
+    input_ids = tokenized_input['input_ids']
+    token_type_ids = tokenized_input['token_type_ids']
+    seq_len = tokenized_input['seq_len']
+
+    if is_test:
+        return input_ids, token_type_ids, seq_len
+    elif label_vocab is not None:
+        labels = labels[:(max_seq_len - 2)]
+        encoded_label = [no_entity_label] + labels + [no_entity_label]
+        encoded_label = [label_vocab[x] for x in encoded_label]
+        return input_ids, token_type_ids, seq_len, encoded_label
+
+
+class DuEventExtraction(paddle.io.Dataset):
+    """DuEventExtraction"""
+
+    def __init__(self, data_path, tag_path):
+        self.label_vocab = load_dict(tag_path)
+        self.word_ids = []
+        self.label_ids = []
+        with open(data_path, 'r', encoding='utf-8') as fp:
+            # skip the head line
+            next(fp)
+            for line in fp.readlines():
+                words, labels = line.strip('\n').split('\t')
+                words = words.split('\002')
+                labels = labels.split('\002')
+                self.word_ids.append(words)
+                self.label_ids.append(labels)
+        self.label_num = max(self.label_vocab.values()) + 1
+
+    def __len__(self):
+        return len(self.word_ids)
+
+    def __getitem__(self, index):
+        return self.word_ids[index], self.label_ids[index]
+
+
+def do_train():
+    paddle.set_device(args.device)
+    world_size = paddle.distributed.get_world_size()
+    rank = paddle.distributed.get_rank()
+    if world_size > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    no_entity_label = "O"
+    ignore_label = -1
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    label_map = load_dict(args.tag_path)
+    model = AutoModelForTokenClassification.from_pretrained(
+        "ernie-3.0-medium-zh", num_classes=len(label_map))
+    model = paddle.DataParallel(model)
+
+    print("============start train==========")
+    train_ds = DuEventExtraction(args.train_data, args.tag_path)
+    dev_ds = DuEventExtraction(args.dev_data, args.tag_path)
+
+    trans_func = partial(convert_example_to_feature,
+                         tokenizer=tokenizer,
+                         label_vocab=train_ds.label_vocab,
+                         max_seq_len=args.max_seq_len,
+                         no_entity_label=no_entity_label,
+                         ignore_label=ignore_label,
+                         is_test=False)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32'
+            ),  # input ids
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32'
+            ),  # token type ids
+        Stack(dtype='int64'),  # sequence lens
+        Pad(axis=0, pad_val=ignore_label, dtype='int64')  # labels
+    ): fn(list(map(trans_func, samples)))
+
+    batch_sampler = paddle.io.DistributedBatchSampler(
+        train_ds, batch_size=args.batch_size, shuffle=True)
+    train_loader = paddle.io.DataLoader(dataset=train_ds,
+                                        batch_sampler=batch_sampler,
+                                        collate_fn=batchify_fn)
+    dev_loader = paddle.io.DataLoader(dataset=dev_ds,
+                                      batch_size=args.batch_size,
+                                      collate_fn=batchify_fn)
+
+    num_training_steps = len(train_loader) * args.num_epoch
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+
+    metric = ChunkEvaluator(label_list=train_ds.label_vocab.keys(),
+                            suffix=False)
+    criterion = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+
+    step, best_f1 = 0, 0.0
+    model.train()
+    for epoch in range(args.num_epoch):
+        for idx, (input_ids, token_type_ids, seq_lens,
+                  labels) in enumerate(train_loader):
+            logits = model(input_ids,
+                           token_type_ids).reshape([-1, train_ds.label_num])
+            loss = paddle.mean(criterion(logits, labels.reshape([-1])))
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_item = loss.item()
+            if step > 0 and step % args.skip_step == 0 and rank == 0:
+                print(
+                    f'train epoch: {epoch} - step: {step} (total: {num_training_steps}) - loss: {loss_item:.6f}'
+                )
+            if step > 0 and step % args.valid_step == 0 and rank == 0:
+                p, r, f1, avg_loss = evaluate(model, criterion, metric,
+                                              len(label_map), dev_loader)
+                print(f'dev step: {step} - loss: {avg_loss:.5f}, precision: {p:.5f}, recall: {r:.5f}, f1: {f1:.5f} current best {best_f1:.5f}')
+                if f1 > best_f1:
+                    best_f1 = f1
+                    print(f'==============================================save best model best performerence {best_f1:5f}')
+                    paddle.save(model.state_dict(), f'{args.checkpoints}/best.pdparams')
+            step += 1
+
+    # save the final model
+    if rank == 0:
+        paddle.save(model.state_dict(),
+                    '{}/final.pdparams'.format(args.checkpoints))
+
+
+def do_predict():
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    label_map = load_dict(args.tag_path)
+    id2label = {val: key for key, val in label_map.items()}
+    model = AutoModelForTokenClassification.from_pretrained(
+        "ernie-3.0-medium-zh", num_classes=len(label_map))
+
+    print("============start predict==========")
+    if not args.init_ckpt or not os.path.isfile(args.init_ckpt):
+        raise Exception("init checkpoints {} not exist".format(args.init_ckpt))
+    else:
+        state_dict = paddle.load(args.init_ckpt)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.init_ckpt)
+
+    # load data from predict file
+    sentences = read_by_lines(args.predict_data)  # origin data format
+    sentences = [json.loads(sent) for sent in sentences]
+
+    encoded_inputs_list = []
+    for sent in sentences:
+        sent = sent["text"].replace(" ", "\002")
+        input_ids, token_type_ids, seq_len = convert_example_to_feature(
+            [list(sent), []],
+            tokenizer,
+            max_seq_len=args.max_seq_len,
+            is_test=True)
+        encoded_inputs_list.append((input_ids, token_type_ids, seq_len))
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32'
+            ),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token], dtype='int32'
+            ),  # token_type_ids
+        Stack(dtype='int64')  # sequence lens
+    ): fn(samples)
+    # Separates data into some batches.
+    batch_encoded_inputs = [
+        encoded_inputs_list[i:i + args.batch_size]
+        for i in range(0, len(encoded_inputs_list), args.batch_size)
+    ]
+    results = []
+    model.eval()
+    for batch in batch_encoded_inputs:
+        input_ids, token_type_ids, seq_lens = batchify_fn(batch)
+        input_ids = paddle.to_tensor(input_ids)
+        token_type_ids = paddle.to_tensor(token_type_ids)
+        logits = model(input_ids, token_type_ids)
+        probs = F.softmax(logits, axis=-1)
+        probs_ids = paddle.argmax(probs, -1).numpy()
+        probs = probs.numpy()
+        for p_list, p_ids, seq_len in zip(probs.tolist(), probs_ids.tolist(),
+                                          seq_lens.tolist()):
+            prob_one = [
+                p_list[index][pid]
+                for index, pid in enumerate(p_ids[1:seq_len - 1])
+            ]
+            label_one = [id2label[pid] for pid in p_ids[1:seq_len - 1]]
+            results.append({"probs": prob_one, "labels": label_one})
+    assert len(results) == len(sentences)
+    for sent, ret in zip(sentences, results):
+        sent["pred"] = ret
+    sentences = [json.dumps(sent, ensure_ascii=False) for sent in sentences]
+    write_by_lines(args.predict_save_path, sentences)
+    print("save data {} to {}".format(len(sentences), args.predict_save_path))
+
+
+if __name__ == '__main__':
+
+    if args.do_train:
+        do_train()
+    elif args.do_predict:
+        do_predict()
diff --git a/examples/information_extraction/DuEE/utils.py b/examples/information_extraction/DuEE/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4c912ace48e241af88f3709660d27dc4ccf4caf
--- /dev/null
+++ b/examples/information_extraction/DuEE/utils.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hashlib
+
+
+def cal_md5(str):
+    """calculate string md5"""
+    str = str.decode("utf-8", "ignore").encode("utf-8", "ignore")
+    return hashlib.md5(str).hexdigest()
+
+
+def read_by_lines(path):
+    """read the data by line"""
+    result = list()
+    with open(path, "r", encoding="utf8") as infile:
+        for line in infile:
+            result.append(line.strip())
+    return result
+
+
+def write_by_lines(path, data):
+    """write the data"""
+    with open(path, "w", encoding="utf8") as outfile:
+        [outfile.write(d + "\n") for d in data]
+
+
+def text_to_sents(text):
+    """text_to_sents"""
+    deliniter_symbols = ["。", "？", "！"]
+    paragraphs = text.split("\n")
+    ret = []
+    for para in paragraphs:
+        if para == "":
+            continue
+        sents = [""]
+        for s in para:
+            sents[-1] += s
+            if s in deliniter_symbols:
+                sents.append("")
+        if sents[-1] == "":
+            sents = sents[:-1]
+        ret.extend(sents)
+    return ret
+
+
+def load_dict(dict_path):
+    """load_dict"""
+    vocab = {}
+    for line in open(dict_path, "r", encoding="utf-8"):
+        value, key = line.strip("\n").split("\t")
+        vocab[key] = int(value)
+    return vocab
+
+
+def extract_result(text, labels):
+    """extract_result"""
+    ret, is_start, cur_type = [], False, None
+    if len(text) != len(labels):
+        # 韩文回导致label 比 text要长
+        labels = labels[: len(text)]
+    for i, label in enumerate(labels):
+        if label != "O":
+            _type = label[2:]
+            if label.startswith("B-"):
+                is_start = True
+                cur_type = _type
+                ret.append({"start": i, "text": [text[i]], "type": _type})
+            elif _type != cur_type:
+                """
+                # 如果是没有B-开头的，则不要这部分数据
+                cur_type = None
+                is_start = False
+                """
+                cur_type = _type
+                is_start = True
+                ret.append({"start": i, "text": [text[i]], "type": _type})
+            elif is_start:
+                ret[-1]["text"].append(text[i])
+            else:
+                cur_type = None
+                is_start = False
+        else:
+            cur_type = None
+            is_start = False
+    return ret
+
+
+if __name__ == "__main__":
+    s = "xxdedewd"
+    print(cal_md5(s.encode("utf-8")))
diff --git a/examples/information_extraction/DuIE/README.md b/examples/information_extraction/DuIE/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c40cd28a3e156bac7b6b3295ee63147410e24e8d
--- /dev/null
+++ b/examples/information_extraction/DuIE/README.md
@@ -0,0 +1,142 @@
+# LIC2021 DuIE 关系抽取基线
+
+信息抽取旨在从非结构化自然语言文本中提取结构化知识，如实体、关系、事件等。关系抽取的目标是对于给定的自然语言句子，根据预先定义的schema集合，抽取出所有满足schema约束的SPO三元组。schema定义了关系P以及其对应的主体S和客体O的类别。
+本基线系统基于预训练语言模型ERNIE设计了结构化的标注策略，可以实现多条、交叠的SPO抽取。
+
+该示例展示了如何使用PaddleNLP快速复现[LIC2021关系抽取比赛](http://lic2021.ccf.org.cn/)基线并进阶优化模型基线。
+
+同时，我们提供了该示例在线运行展示教程：
+[PaddleNLP实战——LIC2021关系抽取任务基线](https://aistudio.baidu.com/aistudio/projectdetail/1639963)
+
+
+## 目录结构
+
+以下是本项目主要目录结构及说明：
+
+```text
+DuIE/
+├── data_loader.py # 加载数据
+├── extract_chinese_and_punct.py # 文本数据预处理
+├── README.md # 文档说明
+├── re_official_evaluation.py # 比赛评价脚本
+├── run_duie.py # 模型训练脚本
+├── train.sh # 启动脚本
+└── utils.py # 效能函数
+```
+
+## 关系抽取基线
+
+针对 DuIE2.0 任务中多条、交叠SPO这一抽取目标，比赛对标准的 'BIO' 标注进行了扩展。
+对于每个 token，根据其在实体span中的位置（包括B、I、O三种），我们为其打上三类标签，并且根据其所参与构建的predicate种类，将 B 标签进一步区分。给定 schema 集合，对于 N 种不同 predicate，以及头实体/尾实体两种情况，我们设计对应的共 2*N 种 B 标签，再合并 I 和 O 标签，故每个 token 一共有 (2*N+2) 个标签，如下图所示。
+
+<div align="center">
+<img src="images/tagging_strategy.png" width="500" height="400" alt="标注策略" align=center />
+</div>
+
+### 评价方法
+
+对测试集上参评系统输出的SPO结果和人工标注的SPO结果进行精准匹配，采用F1值作为评价指标。注意，对于复杂O值类型的SPO，必须所有槽位都精确匹配才认为该SPO抽取正确。针对部分文本中存在实体别名的问题，使用百度知识图谱的别名词典来辅助评测。F1值的计算方式如下：
+
+F1 = (2 * P * R) / (P + R)，其中
+
+- P = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中预测出的SPO个数
+- R = 测试集所有句子中预测正确的SPO个数 / 测试集所有句子中人工标注的SPO个数
+
+### 快速复现基线Step1：构建模型
+
+该任务可以看作一个序列标注任务，所以基线模型采用的是ERNIE序列标注模型。
+
+**PaddleNLP提供了ERNIE预训练模型常用序列标注模型，可以通过指定模型名字完成一键加载.PaddleNLP为了方便用户处理数据，内置了对于各个预训练模型对应的Tokenizer，可以完成文本token化，转token ID，文本长度截断等操作。**
+
+```python
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+
+model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=(len(label_map) - 2) * 2 + 2)
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+```
+
+文本数据处理直接调用tokenizer即可输出模型所需输入数据。
+
+```python
+inputs = tokenizer(text="请输入测试样例", max_seq_len=20)
+# {'input_ids': [1, 647, 789, 109, 558, 525, 314, 656, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+# 'seq_len': 9}
+```
+
+### 快速复现基线Step2：加载并处理
+
+
+
+从比赛官网[下载数据集](https://aistudio.baidu.com/aistudio/competition/detail/65)，解压存放于data/目录下并重命名为train_data.json, dev_data.json, test_data.json.
+
+我们可以加载自定义数据集。通过继承[`paddle.io.Dataset`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/Dataset_cn.html#dataset)，自定义实现`__getitem__` 和 `__len__`两个方法。
+
+
+如下代码已完成加载数据集操作：
+
+```python
+train_dataset = DuIEDataset.from_file(
+    os.path.join(args.data_path, 'train_data.json'),
+    tokenizer,
+    args.max_seq_length,
+    True)
+test_dataset = DuIEDataset.from_file(
+    os.path.join(args.data_path, 'dev_data.json'),
+    tokenizer,
+    args.max_seq_length,
+    True)
+```
+
+### 快速复现基线Step3：定义损失函数和优化器，开始训练
+
+在该基线上，我们选择均方误差作为损失函数，使用[`paddle.optimizer.AdamW`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/adamw/AdamW_cn.html#adamw)作为优化器。
+
+
+启动训练：
+```shell
+sh train.sh
+```
+
+在训练过程中，模型保存在当前目录checkpoints文件夹下。同时在训练的同时使用官方评测脚本进行评估，输出P/R/F1指标。
+在验证集上F1可以达到69.42。
+
+
+### 快速复现基线Step4：提交预测结果
+
+将训练保存的模型加载后进行预测
+
+```shell
+sh predict.sh
+```
+
+预测结果会被保存在data/predictions.json，data/predictions.json.zip，其格式与原数据集文件一致。
+
+之后可以使用官方评估脚本评估训练模型在dev_data.json上的效果。如：
+
+```shell
+python re_official_evaluation.py --golden_file=dev_data.json  --predict_file=predicitons.json.zip [--alias_file alias_dict]
+```
+输出指标为Precision, Recall 和 F1，Alias file包含了合法的实体别名，最终评测的时候会使用，这里不予提供。
+
+之后在test_data.json上预测，然后预测结果（.zip文件）至[评测网站](http://aistudio-bce.bcc-bdbl.baidu.com/aistudio/competition/detail/141)。
+
+
+## 进阶优化基线效果
+
+基线采用的预训练模型为ERNIE，PaddleNLP提供了丰富的预训练模型，如BERT，RoBERTa，Electra，XLNet等
+参考[PaddleNLP预训练模型介绍](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)
+
+如可以选择RoBERTa large中文模型优化模型效果，只需更换模型和tokenizer即可无缝衔接。
+
+```python
+from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+model = RobertaForTokenClassification.from_pretrained(
+    "roberta-wwm-ext-large",
+    num_classes=(len(label_map) - 2) * 2 + 2)
+tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")
+```
+## Reference
+
+- [DuIE: A Large-scale Chinese Dataset for Information Extraction](http://tcci.ccf.org.cn/conference/2019/papers/EV10.pdf)
diff --git a/examples/information_extraction/DuIE/data/id2spo.json b/examples/information_extraction/DuIE/data/id2spo.json
new file mode 100644
index 0000000000000000000000000000000000000000..8313a9ef942fba1fa79957acc1652688d2503b5d
--- /dev/null
+++ b/examples/information_extraction/DuIE/data/id2spo.json
@@ -0,0 +1 @@
+{"predicate": ["empty", "empty", "注册资本", "作者", "所属专辑", "歌手", "邮政编码", "主演", "上映时间", "上映时间", "饰演", "饰演", "国籍", "成立日期", "毕业院校", "作曲", "作词", "编剧", "导演", "面积", "占地面积", "总部地点", "制片人", "嘉宾", "简称", "主持人", "获奖", "获奖", "获奖", "获奖", "海拔", "出品公司", "配音", "配音", "所在城市", "号", "主角", "创始人", "父亲", "祖籍", "母亲", "朝代", "董事长", "人口数量", "妻子", "丈夫", "票房", "票房", "专业代码", "气候", "修业年限", "改编自", "官方语言", "首都", "主题曲", "校长", "代言人"], "subject_type": ["empty", "empty", "企业", "图书作品", "歌曲", "歌曲", "行政区", "影视作品", "影视作品", "影视作品", "娱乐人物", "娱乐人物", "人物", "机构", "人物", "歌曲", "歌曲", "影视作品", "影视作品", "行政区", "机构", "企业", "影视作品", "电视综艺", "机构", "电视综艺", "娱乐人物", "娱乐人物", "娱乐人物", "娱乐人物", "地点", "影视作品", "娱乐人物", "娱乐人物", "景点", "历史人物", "文学作品", "企业", "人物", "人物", "人物", "历史人物", "企业", "行政区", "人物", "人物", "影视作品", "影视作品", "学科专业", "行政区", "学科专业", "影视作品", "国家", "国家", "影视作品", "学校", "企业/品牌"], "object_type": ["empty", "empty", "Number", "人物", "音乐专辑", "人物", "Text", "人物", "Date_@value", "地点_inArea", "人物_@value", "影视作品_inWork", "国家", "Date", "学校", "人物", "人物", "人物", "人物", "Number", "Number", "地点", "人物", "人物", "Text", "人物", "奖项_@value", "作品_inWork", "Date_onDate", "Number_period", "Number", "企业", "人物_@value", "影视作品_inWork", "城市", "Text", "人物", "人物", "人物", "地点", "人物", "Text", "人物", "Number", "人物", "人物", "Number_@value", "地点_inArea", "Text", "气候", "Number", "作品", "语言", "城市", "歌曲", "人物", "人物"]}
\ No newline at end of file
diff --git a/examples/information_extraction/DuIE/data/predicate2id.json b/examples/information_extraction/DuIE/data/predicate2id.json
new file mode 100644
index 0000000000000000000000000000000000000000..a94c10304a85910d7b0a1f967541471cbc97b940
--- /dev/null
+++ b/examples/information_extraction/DuIE/data/predicate2id.json
@@ -0,0 +1 @@
+{"O": 0, "I": 1, "注册资本": 2, "作者": 3, "所属专辑": 4, "歌手": 5, "邮政编码": 6, "主演": 7, "上映时间_@value": 8, "上映时间_inArea": 9, "饰演_@value": 10, "饰演_inWork": 11, "国籍": 12, "成立日期": 13, "毕业院校": 14, "作曲": 15, "作词": 16, "编剧": 17, "导演": 18, "面积": 19, "占地面积": 20, "总部地点": 21, "制片人": 22, "嘉宾": 23, "简称": 24, "主持人": 25, "获奖_@value": 26, "获奖_inWork": 27, "获奖_onDate": 28, "获奖_period": 29, "海拔": 30, "出品公司": 31, "配音_@value": 32, "配音_inWork": 33, "所在城市": 34, "号": 35, "主角": 36, "创始人": 37, "父亲": 38, "祖籍": 39, "母亲": 40, "朝代": 41, "董事长": 42, "人口数量": 43, "妻子": 44, "丈夫": 45, "票房_@value": 46, "票房_inArea": 47, "专业代码": 48, "气候": 49, "修业年限": 50, "改编自": 51, "官方语言": 52, "首都": 53, "主题曲": 54, "校长": 55, "代言人": 56}
\ No newline at end of file
diff --git a/examples/information_extraction/DuIE/data_loader.py b/examples/information_extraction/DuIE/data_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a1d26a8a88f7321b8c5ec985b6dd6423a2e1be4
--- /dev/null
+++ b/examples/information_extraction/DuIE/data_loader.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+from extract_chinese_and_punct import ChineseAndPunctuationExtractor
+
+from paddlenlp.transformers import AutoTokenizer, PretrainedTokenizer
+
+InputFeature = collections.namedtuple(
+    "InputFeature", ["input_ids", "seq_len", "tok_to_orig_start_index", "tok_to_orig_end_index", "labels"]
+)
+
+
+def parse_label(spo_list, label_map, tokens, tokenizer):
+    # 2 tags for each predicate + I tag + O tag
+    num_labels = 2 * (len(label_map.keys()) - 2) + 2
+    seq_len = len(tokens)
+    # initialize tag
+    labels = [[0] * num_labels for i in range(seq_len)]
+    #  find all entities and tag them with corresponding "B"/"I" labels
+    for spo in spo_list:
+        for spo_object in spo["object"].keys():
+            # assign relation label
+            if spo["predicate"] in label_map.keys():
+                # simple relation
+                label_subject = label_map[spo["predicate"]]
+                label_object = label_subject + 55
+                subject_tokens = tokenizer._tokenize(spo["subject"])
+                object_tokens = tokenizer._tokenize(spo["object"]["@value"])
+            else:
+                # complex relation
+                label_subject = label_map[spo["predicate"] + "_" + spo_object]
+                label_object = label_subject + 55
+                subject_tokens = tokenizer._tokenize(spo["subject"])
+                object_tokens = tokenizer._tokenize(spo["object"][spo_object])
+
+            subject_tokens_len = len(subject_tokens)
+            object_tokens_len = len(object_tokens)
+
+            # assign token label
+            # there are situations where s entity and o entity might overlap, e.g. xyz established xyz corporation
+            # to prevent single token from being labeled into two different entity
+            # we tag the longer entity first, then match the shorter entity within the rest text
+            forbidden_index = None
+            if subject_tokens_len > object_tokens_len:
+                for index in range(seq_len - subject_tokens_len + 1):
+                    if tokens[index : index + subject_tokens_len] == subject_tokens:
+                        labels[index][label_subject] = 1
+                        for i in range(subject_tokens_len - 1):
+                            labels[index + i + 1][1] = 1
+                        forbidden_index = index
+                        break
+
+                for index in range(seq_len - object_tokens_len + 1):
+                    if tokens[index : index + object_tokens_len] == object_tokens:
+                        if forbidden_index is None:
+                            labels[index][label_object] = 1
+                            for i in range(object_tokens_len - 1):
+                                labels[index + i + 1][1] = 1
+                            break
+                        # check if labeled already
+                        elif index < forbidden_index or index >= forbidden_index + len(subject_tokens):
+                            labels[index][label_object] = 1
+                            for i in range(object_tokens_len - 1):
+                                labels[index + i + 1][1] = 1
+                            break
+
+            else:
+                for index in range(seq_len - object_tokens_len + 1):
+                    if tokens[index : index + object_tokens_len] == object_tokens:
+                        labels[index][label_object] = 1
+                        for i in range(object_tokens_len - 1):
+                            labels[index + i + 1][1] = 1
+                        forbidden_index = index
+                        break
+
+                for index in range(seq_len - subject_tokens_len + 1):
+                    if tokens[index : index + subject_tokens_len] == subject_tokens:
+                        if forbidden_index is None:
+                            labels[index][label_subject] = 1
+                            for i in range(subject_tokens_len - 1):
+                                labels[index + i + 1][1] = 1
+                            break
+                        elif index < forbidden_index or index >= forbidden_index + len(object_tokens):
+                            labels[index][label_subject] = 1
+                            for i in range(subject_tokens_len - 1):
+                                labels[index + i + 1][1] = 1
+                            break
+
+    # if token wasn't assigned as any "B"/"I" tag, give it an "O" tag for outside
+    for i in range(seq_len):
+        if labels[i] == [0] * num_labels:
+            labels[i][0] = 1
+
+    return labels
+
+
+def convert_example_to_feature(
+    example,
+    tokenizer: PretrainedTokenizer,
+    chineseandpunctuationextractor: ChineseAndPunctuationExtractor,
+    label_map,
+    max_length: Optional[int] = 512,
+    pad_to_max_length: Optional[bool] = None,
+):
+    spo_list = example["spo_list"] if "spo_list" in example.keys() else None
+    text_raw = example["text"]
+
+    sub_text = []
+    buff = ""
+    for char in text_raw:
+        if chineseandpunctuationextractor.is_chinese_or_punct(char):
+            if buff != "":
+                sub_text.append(buff)
+                buff = ""
+            sub_text.append(char)
+        else:
+            buff += char
+    if buff != "":
+        sub_text.append(buff)
+
+    tok_to_orig_start_index = []
+    tok_to_orig_end_index = []
+    orig_to_tok_index = []
+    tokens = []
+    text_tmp = ""
+    for (i, token) in enumerate(sub_text):
+        orig_to_tok_index.append(len(tokens))
+        sub_tokens = tokenizer._tokenize(token)
+        text_tmp += token
+        for sub_token in sub_tokens:
+            tok_to_orig_start_index.append(len(text_tmp) - len(token))
+            tok_to_orig_end_index.append(len(text_tmp) - 1)
+            tokens.append(sub_token)
+            if len(tokens) >= max_length - 2:
+                break
+        else:
+            continue
+        break
+
+    seq_len = len(tokens)
+    # 2 tags for each predicate + I tag + O tag
+    num_labels = 2 * (len(label_map.keys()) - 2) + 2
+    # initialize tag
+    labels = [[0] * num_labels for i in range(seq_len)]
+    if spo_list is not None:
+        labels = parse_label(spo_list, label_map, tokens, tokenizer)
+
+    # add [CLS] and [SEP] token, they are tagged into "O" for outside
+    if seq_len > max_length - 2:
+        tokens = tokens[0 : (max_length - 2)]
+        labels = labels[0 : (max_length - 2)]
+        tok_to_orig_start_index = tok_to_orig_start_index[0 : (max_length - 2)]
+        tok_to_orig_end_index = tok_to_orig_end_index[0 : (max_length - 2)]
+    tokens = ["[CLS]"] + tokens + ["[SEP]"]
+    # "O" tag for [PAD], [CLS], [SEP] token
+    outside_label = [[1] + [0] * (num_labels - 1)]
+
+    labels = outside_label + labels + outside_label
+    tok_to_orig_start_index = [-1] + tok_to_orig_start_index + [-1]
+    tok_to_orig_end_index = [-1] + tok_to_orig_end_index + [-1]
+    if seq_len < max_length:
+        tokens = tokens + ["[PAD]"] * (max_length - seq_len - 2)
+        labels = labels + outside_label * (max_length - len(labels))
+        tok_to_orig_start_index = tok_to_orig_start_index + [-1] * (max_length - len(tok_to_orig_start_index))
+        tok_to_orig_end_index = tok_to_orig_end_index + [-1] * (max_length - len(tok_to_orig_end_index))
+
+    token_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+    return InputFeature(
+        input_ids=np.array(token_ids),
+        seq_len=np.array(seq_len),
+        tok_to_orig_start_index=np.array(tok_to_orig_start_index),
+        tok_to_orig_end_index=np.array(tok_to_orig_end_index),
+        labels=np.array(labels),
+    )
+
+
+class DuIEDataset(paddle.io.Dataset):
+    def __init__(self, data, label_map, tokenizer, max_length=512, pad_to_max_length=False):
+        super(DuIEDataset, self).__init__()
+
+        self.data = data
+        self.chn_punc_extractor = ChineseAndPunctuationExtractor()
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_length
+        self.pad_to_max_length = pad_to_max_length
+        self.label_map = label_map
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, item):
+
+        example = json.loads(self.data[item])
+        input_feature = convert_example_to_feature(
+            example,
+            self.tokenizer,
+            self.chn_punc_extractor,
+            self.label_map,
+            self.max_seq_length,
+            self.pad_to_max_length,
+        )
+        return {
+            "input_ids": np.array(input_feature.input_ids, dtype="int64"),
+            "seq_lens": np.array(input_feature.seq_len, dtype="int64"),
+            "tok_to_orig_start_index": np.array(input_feature.tok_to_orig_start_index, dtype="int64"),
+            "tok_to_orig_end_index": np.array(input_feature.tok_to_orig_end_index, dtype="int64"),
+            # If model inputs is generated in `collate_fn`, delete the data type casting.
+            "labels": np.array(input_feature.labels, dtype="float32"),
+        }
+
+    @classmethod
+    def from_file(
+        cls,
+        file_path: Union[str, os.PathLike],
+        tokenizer: PretrainedTokenizer,
+        max_length: Optional[int] = 512,
+        pad_to_max_length: Optional[bool] = None,
+    ):
+        assert os.path.exists(file_path) and os.path.isfile(
+            file_path
+        ), f"{file_path} dose not exists or is not a file."
+        label_map_path = os.path.join(os.path.dirname(file_path), "predicate2id.json")
+        assert os.path.exists(label_map_path) and os.path.isfile(
+            label_map_path
+        ), f"{label_map_path} dose not exists or is not a file."
+        with open(label_map_path, "r", encoding="utf8") as fp:
+            label_map = json.load(fp)
+        with open(file_path, "r", encoding="utf-8") as fp:
+            data = fp.readlines()
+            return cls(data, label_map, tokenizer, max_length, pad_to_max_length)
+
+
+@dataclass
+class DataCollator:
+    """
+    Collator for DuIE.
+    """
+
+    def __call__(self, examples: List[Dict[str, Union[list, np.ndarray]]]):
+        batched_input_ids = np.stack([x["input_ids"] for x in examples])
+        seq_lens = np.stack([x["seq_lens"] for x in examples])
+        tok_to_orig_start_index = np.stack([x["tok_to_orig_start_index"] for x in examples])
+        tok_to_orig_end_index = np.stack([x["tok_to_orig_end_index"] for x in examples])
+        labels = np.stack([x["labels"] for x in examples])
+
+        return (batched_input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels)
+
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    d = DuIEDataset.from_file("./data/train_data.json", tokenizer)
+    sampler = paddle.io.RandomSampler(data_source=d)
+    batch_sampler = paddle.io.BatchSampler(sampler=sampler, batch_size=2)
+
+    collator = DataCollator()
+    loader = paddle.io.DataLoader(dataset=d, batch_sampler=batch_sampler, collate_fn=collator, return_list=True)
+    for dd in loader():
+        model_input = {
+            "input_ids": dd[0],
+            "seq_len": dd[1],
+            "tok_to_orig_start_index": dd[2],
+            "tok_to_orig_end_index": dd[3],
+            "labels": dd[4],
+        }
+        print(model_input)
diff --git a/examples/information_extraction/DuIE/extract_chinese_and_punct.py b/examples/information_extraction/DuIE/extract_chinese_and_punct.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cd8e9966746f48832f4ae55b61042f3f369c3fb
--- /dev/null
+++ b/examples/information_extraction/DuIE/extract_chinese_and_punct.py
@@ -0,0 +1,132 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+
+LHan = [
+    [0x2E80, 0x2E99],  # Han # So  [26] CJK RADICAL REPEAT, CJK RADICAL RAP
+    [0x2E9B, 0x2EF3],  # Han # So  [89] CJK RADICAL CHOKE, CJK RADICAL C-SIMPLIFIED TURTLE
+    [0x2F00, 0x2FD5],  # Han # So [214] KANGXI RADICAL ONE, KANGXI RADICAL FLUTE
+    0x3005,  # Han # Lm       IDEOGRAPHIC ITERATION MARK
+    0x3007,  # Han # Nl       IDEOGRAPHIC NUMBER ZERO
+    [0x3021, 0x3029],  # Han # Nl   [9] HANGZHOU NUMERAL ONE, HANGZHOU NUMERAL NINE
+    [0x3038, 0x303A],  # Han # Nl   [3] HANGZHOU NUMERAL TEN, HANGZHOU NUMERAL THIRTY
+    0x303B,  # Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
+    [0x3400, 0x4DB5],  # Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400, CJK UNIFIED IDEOGRAPH-4DB5
+    [0x4E00, 0x9FC3],  # Han # Lo [20932] CJK UNIFIED IDEOGRAPH-4E00, CJK UNIFIED IDEOGRAPH-9FC3
+    [0xF900, 0xFA2D],  # Han # Lo [302] CJK COMPATIBILITY IDEOGRAPH-F900, CJK COMPATIBILITY IDEOGRAPH-FA2D
+    [0xFA30, 0xFA6A],  # Han # Lo  [59] CJK COMPATIBILITY IDEOGRAPH-FA30, CJK COMPATIBILITY IDEOGRAPH-FA6A
+    [0xFA70, 0xFAD9],  # Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70, CJK COMPATIBILITY IDEOGRAPH-FAD9
+    [0x20000, 0x2A6D6],  # Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000, CJK UNIFIED IDEOGRAPH-2A6D6
+    [0x2F800, 0x2FA1D],
+]  # Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800, CJK COMPATIBILITY IDEOGRAPH-2FA1D
+
+CN_PUNCTS = [
+    (0x3002, "。"),
+    (0xFF1F, "？"),
+    (0xFF01, "！"),
+    (0xFF0C, "，"),
+    (0x3001, "、"),
+    (0xFF1B, "；"),
+    (0xFF1A, "："),
+    (0x300C, "「"),
+    (0x300D, "」"),
+    (0x300E, "『"),
+    (0x300F, "』"),
+    (0x2018, "‘"),
+    (0x2019, "’"),
+    (0x201C, "“"),
+    (0x201D, "”"),
+    (0xFF08, "（"),
+    (0xFF09, "）"),
+    (0x3014, "〔"),
+    (0x3015, "〕"),
+    (0x3010, "【"),
+    (0x3011, "】"),
+    (0x2014, "—"),
+    (0x2026, "…"),
+    (0x2013, "–"),
+    (0xFF0E, "．"),
+    (0x300A, "《"),
+    (0x300B, "》"),
+    (0x3008, "〈"),
+    (0x3009, "〉"),
+    (0x2015, "―"),
+    (0xFF0D, "－"),
+    (0x0020, " "),
+]
+# (0xFF5E, "～"),
+
+EN_PUNCTS = [[0x0021, 0x002F], [0x003A, 0x0040], [0x005B, 0x0060], [0x007B, 0x007E]]
+
+
+class ChineseAndPunctuationExtractor(object):
+    def __init__(self):
+        self.chinese_re = self.build_re()
+
+    def is_chinese_or_punct(self, c):
+        if self.chinese_re.match(c):
+            return True
+        else:
+            return False
+
+    def build_re(self):
+        L = []
+        for i in LHan:
+            if isinstance(i, list):
+                f, t = i
+                try:
+                    f = chr(f)
+                    t = chr(t)
+                    L.append("%s-%s" % (f, t))
+                except:
+                    pass  # A narrow python build, so can't use chars > 65535 without surrogate pairs!
+
+            else:
+                try:
+                    L.append(chr(i))
+                except:
+                    pass
+        for j, _ in CN_PUNCTS:
+            try:
+                L.append(chr(j))
+            except:
+                pass
+
+        for k in EN_PUNCTS:
+            f, t = k
+            try:
+                f = chr(f)
+                t = chr(t)
+                L.append("%s-%s" % (f, t))
+            except:
+                raise ValueError()
+                pass  # A narrow python build, so can't use chars > 65535 without surrogate pairs!
+
+        RE = "[%s]" % "".join(L)
+        # print('RE:', RE.encode('utf-8'))
+        return re.compile(RE, re.UNICODE)
+
+
+if __name__ == "__main__":
+    extractor = ChineseAndPunctuationExtractor()
+    for c in "韩邦庆（1856～1894）曾用名寄，字子云，别署太仙、大一山人、花也怜侬、三庆":
+        if extractor.is_chinese_or_punct(c):
+            print(c, "yes")
+        else:
+            print(c, "no")
+
+    print("～", extractor.is_chinese_or_punct("～"))
+    print("~", extractor.is_chinese_or_punct("~"))
+    print("―", extractor.is_chinese_or_punct("―"))
+    print("-", extractor.is_chinese_or_punct("-"))
diff --git a/examples/information_extraction/DuIE/images/tagging_strategy.png b/examples/information_extraction/DuIE/images/tagging_strategy.png
new file mode 100644
index 0000000000000000000000000000000000000000..0b67f69d775c1811f70bb0de880fa32bfc7009d3
Binary files /dev/null and b/examples/information_extraction/DuIE/images/tagging_strategy.png differ
diff --git a/examples/information_extraction/DuIE/predict.sh b/examples/information_extraction/DuIE/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dd4a1da7f4cde4457d65a43d64387031215ea16f
--- /dev/null
+++ b/examples/information_extraction/DuIE/predict.sh
@@ -0,0 +1,14 @@
+set -eux
+
+export CUDA_VISIBLE_DEVICES=0 
+export BATCH_SIZE=64
+export CKPT=./checkpoints/model_90000.pdparams
+export DATASET_FILE=./data/test1.json
+
+python run_duie.py \
+    --do_predict \
+    --init_checkpoint $CKPT \
+    --predict_data_file $DATASET_FILE \
+    --max_seq_length 128 \
+    --batch_size $BATCH_SIZE
+
diff --git a/examples/information_extraction/DuIE/re_official_evaluation.py b/examples/information_extraction/DuIE/re_official_evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..827601948873fdcc796ef51d9da0a3c7c2c26e6c
--- /dev/null
+++ b/examples/information_extraction/DuIE/re_official_evaluation.py
@@ -0,0 +1,271 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# imitations under the License.
+"""
+This module to calculate precision, recall and f1-value
+of the predicated results.
+"""
+import argparse
+import json
+import os
+import sys
+import zipfile
+
+SUCCESS = 0
+FILE_ERROR = 1
+NOT_ZIP_FILE = 2
+ENCODING_ERROR = 3
+JSON_ERROR = 4
+SCHEMA_ERROR = 5
+ALIAS_FORMAT_ERROR = 6
+
+CODE_INFO = {
+    SUCCESS: "success",
+    FILE_ERROR: "file is not exists",
+    NOT_ZIP_FILE: "predict file is not a zipfile",
+    ENCODING_ERROR: "file encoding error",
+    JSON_ERROR: "json parse is error",
+    SCHEMA_ERROR: "schema is error",
+    ALIAS_FORMAT_ERROR: "alias dict format is error",
+}
+
+
+def del_bookname(entity_name):
+    """delete the book name"""
+    if entity_name.startswith("《") and entity_name.endswith("》"):
+        entity_name = entity_name[1:-1]
+    return entity_name
+
+
+def check_format(line):
+    """检查输入行是否格式错误"""
+    ret_code = SUCCESS
+    json_info = {}
+    try:
+        line = line.strip()
+    except:
+        ret_code = ENCODING_ERROR
+        return ret_code, json_info
+    try:
+        json_info = json.loads(line)
+    except:
+        ret_code = JSON_ERROR
+        return ret_code, json_info
+    if "text" not in json_info or "spo_list" not in json_info:
+        ret_code = SCHEMA_ERROR
+        return ret_code, json_info
+    required_key_list = ["subject", "predicate", "object"]
+    for spo_item in json_info["spo_list"]:
+        if type(spo_item) is not dict:
+            ret_code = SCHEMA_ERROR
+            return ret_code, json_info
+        if not all([required_key in spo_item for required_key in required_key_list]):
+            ret_code = SCHEMA_ERROR
+            return ret_code, json_info
+        if not isinstance(spo_item["subject"], str) or not isinstance(spo_item["object"], dict):
+            ret_code = SCHEMA_ERROR
+            return ret_code, json_info
+    return ret_code, json_info
+
+
+def _parse_structured_ovalue(json_info):
+    spo_result = []
+    for item in json_info["spo_list"]:
+        s = del_bookname(item["subject"].lower())
+        o = {}
+        for o_key, o_value in item["object"].items():
+            o_value = del_bookname(o_value).lower()
+            o[o_key] = o_value
+        spo_result.append({"predicate": item["predicate"], "subject": s, "object": o})
+    return spo_result
+
+
+def load_predict_result(predict_filename):
+    """Loads the file to be predicted"""
+    predict_result = {}
+    ret_code = SUCCESS
+    if not os.path.exists(predict_filename):
+        ret_code = FILE_ERROR
+        return ret_code, predict_result
+    try:
+        predict_file_zip = zipfile.ZipFile(predict_filename)
+    except:
+        ret_code = NOT_ZIP_FILE
+        return ret_code, predict_result
+    for predict_file in predict_file_zip.namelist():
+        for line in predict_file_zip.open(predict_file):
+            ret_code, json_info = check_format(line)
+            if ret_code != SUCCESS:
+                return ret_code, predict_result
+            sent = json_info["text"]
+            spo_result = _parse_structured_ovalue(json_info)
+            predict_result[sent] = spo_result
+    return ret_code, predict_result
+
+
+def load_test_dataset(golden_filename):
+    """load golden file"""
+    golden_dict = {}
+    ret_code = SUCCESS
+    if not os.path.exists(golden_filename):
+        ret_code = FILE_ERROR
+        return ret_code, golden_dict
+    with open(golden_filename, "r", encoding="utf-8") as gf:
+        for line in gf:
+            ret_code, json_info = check_format(line)
+            if ret_code != SUCCESS:
+                return ret_code, golden_dict
+
+            sent = json_info["text"]
+            spo_result = _parse_structured_ovalue(json_info)
+            golden_dict[sent] = spo_result
+    return ret_code, golden_dict
+
+
+def load_alias_dict(alias_filename):
+    """load alias dict"""
+    alias_dict = {}
+    ret_code = SUCCESS
+    if alias_filename == "":
+        return ret_code, alias_dict
+    if not os.path.exists(alias_filename):
+        ret_code = FILE_ERROR
+        return ret_code, alias_dict
+    with open(alias_filename, "r", encoding="utf-8") as af:
+        for line in af:
+            line = line.strip()
+            try:
+                words = line.split("\t")
+                alias_dict[words[0].lower()] = set()
+                for alias_word in words[1:]:
+                    alias_dict[words[0].lower()].add(alias_word.lower())
+            except:
+                ret_code = ALIAS_FORMAT_ERROR
+                return ret_code, alias_dict
+    return ret_code, alias_dict
+
+
+def del_duplicate(spo_list, alias_dict):
+    """delete synonyms triples in predict result"""
+    normalized_spo_list = []
+    for spo in spo_list:
+        if not is_spo_in_list(spo, normalized_spo_list, alias_dict):
+            normalized_spo_list.append(spo)
+    return normalized_spo_list
+
+
+def is_spo_in_list(target_spo, golden_spo_list, alias_dict):
+    """target spo是否在golden_spo_list中"""
+    if target_spo in golden_spo_list:
+        return True
+    target_s = target_spo["subject"]
+    target_p = target_spo["predicate"]
+    target_o = target_spo["object"]
+    target_s_alias_set = alias_dict.get(target_s, set())
+    target_s_alias_set.add(target_s)
+    for spo in golden_spo_list:
+        s = spo["subject"]
+        p = spo["predicate"]
+        o = spo["object"]
+        if p != target_p:
+            continue
+        if s in target_s_alias_set and _is_equal_o(o, target_o, alias_dict):
+            return True
+    return False
+
+
+def _is_equal_o(o_a, o_b, alias_dict):
+    for key_a, value_a in o_a.items():
+        if key_a not in o_b:
+            return False
+        value_a_alias_set = alias_dict.get(value_a, set())
+        value_a_alias_set.add(value_a)
+        if o_b[key_a] not in value_a_alias_set:
+            return False
+    for key_b, value_b in o_b.items():
+        if key_b not in o_a:
+            return False
+        value_b_alias_set = alias_dict.get(value_b, set())
+        value_b_alias_set.add(value_b)
+        if o_a[key_b] not in value_b_alias_set:
+            return False
+    return True
+
+
+def calc_pr(predict_filename, alias_filename, golden_filename):
+    """calculate precision, recall, f1"""
+    ret_info = {}
+
+    # load alias dict
+    ret_code, alias_dict = load_alias_dict(alias_filename)
+    if ret_code != SUCCESS:
+        ret_info["errorCode"] = ret_code
+        ret_info["errorMsg"] = CODE_INFO[ret_code]
+        return ret_info
+    # load test golden dataset
+    ret_code, golden_dict = load_test_dataset(golden_filename)
+    if ret_code != SUCCESS:
+        ret_info["errorCode"] = ret_code
+        ret_info["errorMsg"] = CODE_INFO[ret_code]
+        return ret_info
+    # load predict result
+    ret_code, predict_result = load_predict_result(predict_filename)
+    if ret_code != SUCCESS:
+        ret_info["errorCode"] = ret_code
+        ret_info["errorMsg"] = CODE_INFO[ret_code]
+        return ret_info
+
+    # evaluation
+    correct_sum, predict_sum, recall_sum, recall_correct_sum = 0.0, 0.0, 0.0, 0.0
+    for sent in golden_dict:
+        golden_spo_list = del_duplicate(golden_dict[sent], alias_dict)
+        predict_spo_list = predict_result.get(sent, list())
+        normalized_predict_spo = del_duplicate(predict_spo_list, alias_dict)
+        recall_sum += len(golden_spo_list)
+        predict_sum += len(normalized_predict_spo)
+        for spo in normalized_predict_spo:
+            if is_spo_in_list(spo, golden_spo_list, alias_dict):
+                correct_sum += 1
+        for golden_spo in golden_spo_list:
+            if is_spo_in_list(golden_spo, predict_spo_list, alias_dict):
+                recall_correct_sum += 1
+    sys.stderr.write("correct spo num = {}\n".format(correct_sum))
+    sys.stderr.write("submitted spo num = {}\n".format(predict_sum))
+    sys.stderr.write("golden set spo num = {}\n".format(recall_sum))
+    sys.stderr.write("submitted recall spo num = {}\n".format(recall_correct_sum))
+    precision = correct_sum / predict_sum if predict_sum > 0 else 0.0
+    recall = recall_correct_sum / recall_sum if recall_sum > 0 else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
+    precision = round(precision, 4)
+    recall = round(recall, 4)
+    f1 = round(f1, 4)
+    ret_info["errorCode"] = SUCCESS
+    ret_info["errorMsg"] = CODE_INFO[SUCCESS]
+    ret_info["data"] = []
+    ret_info["data"].append({"name": "precision", "value": precision})
+    ret_info["data"].append({"name": "recall", "value": recall})
+    ret_info["data"].append({"name": "f1-score", "value": f1})
+    return ret_info
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--golden_file", type=str, help="true spo results", required=True)
+    parser.add_argument("--predict_file", type=str, help="spo results predicted", required=True)
+    parser.add_argument("--alias_file", type=str, default="", help="entities alias dictionary")
+    args = parser.parse_args()
+    golden_filename = args.golden_file
+    predict_filename = args.predict_file
+    alias_filename = args.alias_file
+    ret_info = calc_pr(predict_filename, alias_filename, golden_filename)
+    print(json.dumps(ret_info))
diff --git a/examples/information_extraction/DuIE/run_duie.py b/examples/information_extraction/DuIE/run_duie.py
new file mode 100644
index 0000000000000000000000000000000000000000..94e1a227292be505567703a8112ebca93fbd5156
--- /dev/null
+++ b/examples/information_extraction/DuIE/run_duie.py
@@ -0,0 +1,317 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import random
+import sys
+import time
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from data_loader import DataCollator, DuIEDataset
+from paddle.io import DataLoader
+from tqdm import tqdm
+from utils import decoding, get_precision_recall_f1, write_prediction_results
+
+from paddlenlp.transformers import (
+    AutoModelForTokenClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--do_train", action="store_true", default=False, help="do train")
+parser.add_argument("--do_predict", action="store_true", default=False, help="do predict")
+parser.add_argument("--init_checkpoint", default=None, type=str, required=False, help="Path to initialize params from")
+parser.add_argument("--data_path", default="./data", type=str, required=False, help="Path to data.")
+parser.add_argument(
+    "--predict_data_file", default="./data/test_data.json", type=str, required=False, help="Path to data."
+)
+parser.add_argument(
+    "--output_dir",
+    default="./checkpoints",
+    type=str,
+    required=False,
+    help="The output directory where the model predictions and checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument(
+    "--batch_size",
+    default=8,
+    type=int,
+    help="Batch size per GPU/CPU for training.",
+)
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_ratio", default=0, type=float, help="Linear warmup over warmup_ratio * total_steps.")
+parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+args = parser.parse_args()
+
+
+class BCELossForDuIE(nn.Layer):
+    def __init__(
+        self,
+    ):
+        super(BCELossForDuIE, self).__init__()
+        self.criterion = nn.BCEWithLogitsLoss(reduction="none")
+
+    def forward(self, logits, labels, mask):
+        loss = self.criterion(logits, labels)
+        mask = paddle.cast(mask, "float32")
+        loss = loss * mask.unsqueeze(-1)
+        loss = paddle.sum(loss.mean(axis=2), axis=1) / paddle.sum(mask, axis=1)
+        loss = loss.mean()
+        return loss
+
+
+def set_random_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, data_loader, file_path, mode):
+    """
+    mode eval:
+    eval on development set and compute P/R/F1, called between training.
+    mode predict:
+    eval on development / test set, then write predictions to \
+        predict_test.json and predict_test.json.zip \
+        under args.data_path dir for later submission or evaluation.
+    """
+    example_all = []
+    with open(file_path, "r", encoding="utf-8") as fp:
+        for line in fp:
+            example_all.append(json.loads(line))
+    id2spo_path = os.path.join(os.path.dirname(file_path), "id2spo.json")
+    with open(id2spo_path, "r", encoding="utf8") as fp:
+        id2spo = json.load(fp)
+
+    model.eval()
+    loss_all = 0
+    eval_steps = 0
+    formatted_outputs = []
+    current_idx = 0
+    for batch in tqdm(data_loader, total=len(data_loader)):
+        eval_steps += 1
+        input_ids, seq_len, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch
+        logits = model(input_ids=input_ids)
+        mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2))
+        loss = criterion(logits, labels, mask)
+        loss_all += loss.item()
+        probs = F.sigmoid(logits)
+        logits_batch = probs.numpy()
+        seq_len_batch = seq_len.numpy()
+        tok_to_orig_start_index_batch = tok_to_orig_start_index.numpy()
+        tok_to_orig_end_index_batch = tok_to_orig_end_index.numpy()
+        formatted_outputs.extend(
+            decoding(
+                example_all[current_idx : current_idx + len(logits)],
+                id2spo,
+                logits_batch,
+                seq_len_batch,
+                tok_to_orig_start_index_batch,
+                tok_to_orig_end_index_batch,
+            )
+        )
+        current_idx = current_idx + len(logits)
+    loss_avg = loss_all / eval_steps
+    print("eval loss: %f" % (loss_avg))
+
+    if mode == "predict":
+        predict_file_path = os.path.join(args.data_path, "predictions.json")
+    else:
+        predict_file_path = os.path.join(args.data_path, "predict_eval.json")
+
+    predict_zipfile_path = write_prediction_results(formatted_outputs, predict_file_path)
+
+    if mode == "eval":
+        precision, recall, f1 = get_precision_recall_f1(file_path, predict_zipfile_path)
+        os.system("rm {} {}".format(predict_file_path, predict_zipfile_path))
+        return precision, recall, f1
+    elif mode != "predict":
+        raise Exception("wrong mode for eval func")
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # Reads label_map.
+    label_map_path = os.path.join(args.data_path, "predicate2id.json")
+    if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)):
+        sys.exit("{} dose not exists or is not a file.".format(label_map_path))
+    with open(label_map_path, "r", encoding="utf8") as fp:
+        label_map = json.load(fp)
+    num_classes = (len(label_map.keys()) - 2) * 2 + 2
+
+    # Loads pretrained model ERNIE
+    model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=num_classes)
+    model = paddle.DataParallel(model)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    criterion = BCELossForDuIE()
+
+    # Loads dataset.
+    train_dataset = DuIEDataset.from_file(
+        os.path.join(args.data_path, "train_data.json"), tokenizer, args.max_seq_length, True
+    )
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.batch_size, shuffle=True, drop_last=True
+    )
+    collator = DataCollator()
+    train_data_loader = DataLoader(
+        dataset=train_dataset, batch_sampler=train_batch_sampler, collate_fn=collator, return_list=True
+    )
+    eval_file_path = os.path.join(args.data_path, "dev_data.json")
+    test_dataset = DuIEDataset.from_file(eval_file_path, tokenizer, args.max_seq_length, True)
+    test_batch_sampler = paddle.io.BatchSampler(
+        test_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True
+    )
+    test_data_loader = DataLoader(
+        dataset=test_dataset, batch_sampler=test_batch_sampler, collate_fn=collator, return_list=True
+    )
+
+    # Defines learning rate strategy.
+    steps_by_epoch = len(train_data_loader)
+    num_training_steps = steps_by_epoch * args.num_train_epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_ratio)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    # Starts training.
+    global_step = 0
+    logging_steps = 50
+    save_steps = 10000
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        print("\n=====start training of %d epochs=====" % epoch)
+        tic_epoch = time.time()
+        model.train()
+        for step, batch in enumerate(train_data_loader):
+            input_ids, seq_lens, tok_to_orig_start_index, tok_to_orig_end_index, labels = batch
+            logits = model(input_ids=input_ids)
+            mask = (input_ids != 0).logical_and((input_ids != 1)).logical_and((input_ids != 2))
+            loss = criterion(logits, labels, mask)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            loss_item = loss.item()
+            global_step += 1
+
+            if global_step % logging_steps == 0 and rank == 0:
+                print(
+                    "epoch: %d / %d, steps: %d / %d, loss: %f, speed: %.2f step/s"
+                    % (
+                        epoch,
+                        args.num_train_epochs,
+                        step,
+                        steps_by_epoch,
+                        loss_item,
+                        logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_step % save_steps == 0 and rank == 0:
+                print("\n=====start evaluating ckpt of %d steps=====" % global_step)
+                precision, recall, f1 = evaluate(model, criterion, test_data_loader, eval_file_path, "eval")
+                print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1))
+                print("saving checkpoing model_%d.pdparams to %s " % (global_step, args.output_dir))
+                paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step))
+                model.train()  # back to train mode
+
+        tic_epoch = time.time() - tic_epoch
+        print(
+            "epoch time footprint: %d hour %d min %d sec"
+            % (tic_epoch // 3600, (tic_epoch % 3600) // 60, tic_epoch % 60)
+        )
+
+    # Does final evaluation.
+    if rank == 0:
+        print("\n=====start evaluating last ckpt of %d steps=====" % global_step)
+        precision, recall, f1 = evaluate(model, criterion, test_data_loader, eval_file_path, "eval")
+        print("precision: %.2f\t recall: %.2f\t f1: %.2f\t" % (100 * precision, 100 * recall, 100 * f1))
+        paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step))
+        print("\n=====training complete=====")
+
+
+def do_predict():
+    paddle.set_device(args.device)
+
+    # Reads label_map.
+    label_map_path = os.path.join(args.data_path, "predicate2id.json")
+    if not (os.path.exists(label_map_path) and os.path.isfile(label_map_path)):
+        sys.exit("{} dose not exists or is not a file.".format(label_map_path))
+    with open(label_map_path, "r", encoding="utf8") as fp:
+        label_map = json.load(fp)
+    num_classes = (len(label_map.keys()) - 2) * 2 + 2
+
+    # Loads pretrained model ERNIE
+    model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_classes=num_classes)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    criterion = BCELossForDuIE()
+
+    # Loads dataset.
+    test_dataset = DuIEDataset.from_file(args.predict_data_file, tokenizer, args.max_seq_length, True)
+    collator = DataCollator()
+    test_batch_sampler = paddle.io.BatchSampler(
+        test_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True
+    )
+    test_data_loader = DataLoader(
+        dataset=test_dataset, batch_sampler=test_batch_sampler, collate_fn=collator, return_list=True
+    )
+
+    # Loads model parameters.
+    if not (os.path.exists(args.init_checkpoint) and os.path.isfile(args.init_checkpoint)):
+        sys.exit("wrong directory: init checkpoints {} not exist".format(args.init_checkpoint))
+    state_dict = paddle.load(args.init_checkpoint)
+    model.set_dict(state_dict)
+
+    # Does predictions.
+    print("\n=====start predicting=====")
+    evaluate(model, criterion, test_data_loader, args.predict_data_file, "predict")
+    print("=====predicting complete=====")
+
+
+if __name__ == "__main__":
+
+    if args.do_train:
+        do_train()
+    elif args.do_predict:
+        do_predict()
diff --git a/examples/information_extraction/DuIE/train.sh b/examples/information_extraction/DuIE/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..89a69e9ab9fba6d10f6d59481c844653280fe598
--- /dev/null
+++ b/examples/information_extraction/DuIE/train.sh
@@ -0,0 +1,18 @@
+set -eux
+
+export BATCH_SIZE=8
+export LR=2e-5
+export EPOCH=12
+
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_duie.py \
+                            --device gpu \
+                            --seed 42 \
+                            --do_train \
+                            --data_path ./data \
+                            --max_seq_length 128 \
+                            --batch_size $BATCH_SIZE \
+                            --num_train_epochs $EPOCH \
+                            --learning_rate $LR \
+                            --warmup_ratio 0.06 \
+                            --output_dir ./checkpoints
diff --git a/examples/information_extraction/DuIE/utils.py b/examples/information_extraction/DuIE/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b810043bdee8fcadc59e796e890ebdac429195a1
--- /dev/null
+++ b/examples/information_extraction/DuIE/utils.py
@@ -0,0 +1,171 @@
+# Copyright (c) 2021 Baidu.com, Inc. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import codecs
+import json
+import os
+import re
+import zipfile
+
+import numpy as np
+
+
+def find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index):
+    """
+    retrieval entity mention under given predicate id for certain prediction.
+    this is called by the "decoding" func.
+    """
+    entity_list = []
+    for i in range(len(predictions)):
+        if [id_] in predictions[i]:
+            j = 0
+            while i + j + 1 < len(predictions):
+                if [1] in predictions[i + j + 1]:
+                    j += 1
+                else:
+                    break
+            entity = "".join(text_raw[tok_to_orig_start_index[i] : tok_to_orig_end_index[i + j] + 1])
+            entity_list.append(entity)
+    return list(set(entity_list))
+
+
+def decoding(
+    example_batch, id2spo, logits_batch, seq_len_batch, tok_to_orig_start_index_batch, tok_to_orig_end_index_batch
+):
+    """
+    model output logits -> formatted spo (as in data set file)
+    """
+    formatted_outputs = []
+    for (i, (example, logits, seq_len, tok_to_orig_start_index, tok_to_orig_end_index)) in enumerate(
+        zip(example_batch, logits_batch, seq_len_batch, tok_to_orig_start_index_batch, tok_to_orig_end_index_batch)
+    ):
+
+        logits = logits[1 : seq_len + 1]  # slice between [CLS] and [SEP] to get valid logits
+        logits[logits >= 0.5] = 1
+        logits[logits < 0.5] = 0
+        tok_to_orig_start_index = tok_to_orig_start_index[1 : seq_len + 1]
+        tok_to_orig_end_index = tok_to_orig_end_index[1 : seq_len + 1]
+        predictions = []
+        for token in logits:
+            predictions.append(np.argwhere(token == 1).tolist())
+
+        # format predictions into example-style output
+        formatted_instance = {}
+        text_raw = example["text"]
+        complex_relation_label = [8, 10, 26, 32, 46]
+        complex_relation_affi_label = [9, 11, 27, 28, 29, 33, 47]
+
+        # flatten predictions then retrival all valid subject id
+        flatten_predictions = []
+        for layer_1 in predictions:
+            for layer_2 in layer_1:
+                flatten_predictions.append(layer_2[0])
+        subject_id_list = []
+        for cls_label in list(set(flatten_predictions)):
+            if 1 < cls_label <= 56 and (cls_label + 55) in flatten_predictions:
+                subject_id_list.append(cls_label)
+        subject_id_list = list(set(subject_id_list))
+
+        # fetch all valid spo by subject id
+        spo_list = []
+        for id_ in subject_id_list:
+            if id_ in complex_relation_affi_label:
+                continue  # do this in the next "else" branch
+            if id_ not in complex_relation_label:
+                subjects = find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index)
+                objects = find_entity(text_raw, id_ + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index)
+                for subject_ in subjects:
+                    for object_ in objects:
+                        spo_list.append(
+                            {
+                                "predicate": id2spo["predicate"][id_],
+                                "object_type": {"@value": id2spo["object_type"][id_]},
+                                "subject_type": id2spo["subject_type"][id_],
+                                "object": {"@value": object_},
+                                "subject": subject_,
+                            }
+                        )
+            else:
+                #  traverse all complex relation and look through their corresponding affiliated objects
+                subjects = find_entity(text_raw, id_, predictions, tok_to_orig_start_index, tok_to_orig_end_index)
+                objects = find_entity(text_raw, id_ + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index)
+                for subject_ in subjects:
+                    for object_ in objects:
+                        object_dict = {"@value": object_}
+                        object_type_dict = {"@value": id2spo["object_type"][id_].split("_")[0]}
+                        if id_ in [8, 10, 32, 46] and id_ + 1 in subject_id_list:
+                            id_affi = id_ + 1
+                            object_dict[id2spo["object_type"][id_affi].split("_")[1]] = find_entity(
+                                text_raw, id_affi + 55, predictions, tok_to_orig_start_index, tok_to_orig_end_index
+                            )[0]
+                            object_type_dict[id2spo["object_type"][id_affi].split("_")[1]] = id2spo["object_type"][
+                                id_affi
+                            ].split("_")[0]
+                        elif id_ == 26:
+                            for id_affi in [27, 28, 29]:
+                                if id_affi in subject_id_list:
+                                    object_dict[id2spo["object_type"][id_affi].split("_")[1]] = find_entity(
+                                        text_raw,
+                                        id_affi + 55,
+                                        predictions,
+                                        tok_to_orig_start_index,
+                                        tok_to_orig_end_index,
+                                    )[0]
+                                    object_type_dict[id2spo["object_type"][id_affi].split("_")[1]] = id2spo[
+                                        "object_type"
+                                    ][id_affi].split("_")[0]
+                        spo_list.append(
+                            {
+                                "predicate": id2spo["predicate"][id_],
+                                "object_type": object_type_dict,
+                                "subject_type": id2spo["subject_type"][id_],
+                                "object": object_dict,
+                                "subject": subject_,
+                            }
+                        )
+
+        formatted_instance["text"] = example["text"]
+        formatted_instance["spo_list"] = spo_list
+        formatted_outputs.append(formatted_instance)
+    return formatted_outputs
+
+
+def write_prediction_results(formatted_outputs, file_path):
+    """write the prediction results"""
+
+    with codecs.open(file_path, "w", "utf-8") as f:
+        for formatted_instance in formatted_outputs:
+            json_str = json.dumps(formatted_instance, ensure_ascii=False)
+            f.write(json_str)
+            f.write("\n")
+        zipfile_path = file_path + ".zip"
+        f = zipfile.ZipFile(zipfile_path, "w", zipfile.ZIP_DEFLATED)
+        f.write(file_path)
+
+    return zipfile_path
+
+
+def get_precision_recall_f1(golden_file, predict_file):
+    r = os.popen(
+        "python3 ./re_official_evaluation.py --golden_file={} --predict_file={}".format(golden_file, predict_file)
+    )
+    result = r.read()
+    r.close()
+    precision = float(
+        re.search('"precision", "value":.*?}', result).group(0).lstrip('"precision", "value":').rstrip("}")
+    )
+    recall = float(re.search('"recall", "value":.*?}', result).group(0).lstrip('"recall", "value":').rstrip("}"))
+    f1 = float(re.search('"f1-score", "value":.*?}', result).group(0).lstrip('"f1-score", "value":').rstrip("}"))
+
+    return precision, recall, f1
diff --git a/examples/information_extraction/DuUIE/README.md b/examples/information_extraction/DuUIE/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b8bfc1edaf4b1e36034be4cdd7332f81d3e34983
--- /dev/null
+++ b/examples/information_extraction/DuUIE/README.md
@@ -0,0 +1,180 @@
+# CCKS 2022 通用信息抽取 -- 基于UIE的基线系统
+
+信息抽取任务旨在根据特定的抽取需求（Schema，S）从非结构化文本（Text，X）中自动抽取结构化信息（Structure，Y）。
+其中，特定的抽取需求是指抽取任务中的抽取框架，主要由抽取类别（人物名称、公司名称、企业上市事件）及目标结构（实体、关系、事件等）组成。
+本任务为中文信息抽取任务，即按照特定的抽取框架S，从给定的一组自由文本X中抽取出所有符合抽取需求的信息结构Y（实体、关系、事件记录等）。
+对于同一输入文本，不同的抽取框架会抽取不同的信息结构。
+
+本例中包含四类抽取任务：实体抽取、关系抽取、事件抽取和情感抽取。
+以“In 1997, Steve was excited to become the CEO of Apple.”为例，各个任务的目标结构为:
+
+- 实体：Steve - 人物实体、Apple - 组织机构实体
+- 关系：(Steve, Work For Apple)
+- 事件：{类别: 就职事件, 触发词: become, 论元: [[雇主, Apple], [雇员, Steve]]}
+- 情感：(exicted, become the CEO of Apple, Positive)
+
+该示例展示了如何使用 PaddleNLP 快速构建 [CCKS 2022 通用信息抽取比赛](https://aistudio.baidu.com/aistudio/competition/detail/161/0/task-definition)基线，构建单个模型同时对上述四个任务进行抽取。
+
+## 环境安装
+
+``` bash
+pip install -r requirements.txt
+```
+
+## 目录结构
+``` text
+.
+├── config/                       # 配置文件
+├── inference.py                  # 推理入口
+├── process_data.py               # 比赛数据处理相关脚本
+├── README.md                     # 说明文件
+├── requirements.txt              # Python 依赖包文件
+├── run_seq2struct.py             # Python 入口
+└── uie/
+    ├── evaluation                # 信息抽取代码
+    └── seq2struct                # 序列到结构代码
+```
+
+## 通用信息抽取基线
+
+### 基线说明
+
+本例采用面向信息抽取的统一序列到结构生成模型作为任务基线。
+
+该模型将多种不同的信息抽取目标结构表示为统一的结构化抽取语言（Structured Extraction Language，SEL），并且通过端到端生成的方式实现复杂结构的抽取。
+
+同时，该模型使用结构化框架前缀（Structural Schema Instructor，SSI）作为抽取目标来帮助模型区分不同的抽取任务。
+
+**[报名竞赛](https://aistudio.baidu.com/aistudio/competition/detail/161/0/introduction)下载数据集后，从[这里](#quick-start)开始实现快速基线。**
+
+#### 结构化抽取语言
+结构化抽取语言将不同的目标结构进行统一结构表示。
+典型的结构化抽取语言的形式如下：
+```
+(
+  (Spot Name: Info Span
+    (Assocation Name: Info Span)
+    (Assocation Name: Info Span)
+  )
+)
+```
+其中，
+- Spot Name: 信息点类别，如实体类型；
+- Assocation Name (asoc/asso): 信息点关联类别，如关系类型、事件论元类型；
+- Info Span: 信息点所对应的文本片段。
+
+以`2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！`中的信息结构为例：
+
+- 该句中的国籍关系 SEL 表达式为：
+```
+(
+  (人物: 谷爱凌
+    (国籍: 中国)
+  )
+)
+```
+- 该句中的夺冠事件 SEL 表达式为：
+```
+(
+  (夺冠: 金牌
+    (夺冠时间: 2月8号上午)
+    (冠军: 谷爱凌)
+    (夺冠赛事: 北京冬奥会自由式滑雪女子大跳台决赛)
+  )
+)
+```
+
+生成SEL表达式后，我们通过解析器将表达式解析成对应的结构化抽取记录。
+
+```
+records = sel2record.sel2record(
+  pred=predicted_sel,  # 生成的 SEL 表达式，例如 ((夺冠: 金牌 (冠军: 谷爱凌)))
+  text=raw_text,
+  ...
+)
+records 为解析后的抽取结果。例如 {类别: 夺冠, 触发词: 金牌, 冠军: 谷爱凌}
+
+```
+
+#### 结构化模式前缀
+结构化模式前缀与带抽取的文本一同输入序列到结构生成模型，用于区分不同的抽取任务。
+基线模型使用特殊字符 `[spot]`、`[asoc]` 来组织结构化模式前缀，`[spot]` 对应 SEL 中的 SpotName 类别，`[asoc]` 对应
+不同任务的形式是：
+- 实体抽取：[spot] 实体类别 [text]
+- 关系抽取：[spot] 实体类别 [asoc] 关系类别 [text]
+- 事件抽取：[spot] 事件类别 [asoc] 论元类别 [text]
+- 情感抽取：[spot] 评价维度 [asoc] 观点类别 [text]
+
+以夺冠事件为例，其对应的SSI为 `[spot] 夺冠 [asoc] 夺冠事件 [asoc] 冠军 [asoc] 夺冠赛事 [text] 2月8日上午北京冬奥会自由...`。
+
+本次大赛在框架定义文件([seen_schema.zip](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets))中提供了详细的抽取类别定义和模板类型，欢迎选手尝试多种多样不同的输入形式和输出形式。
+
+### <span id='quick-start'>快速基线第一步：数据预处理并加载</span>
+
+从比赛官网下载数据集([duuie.zip](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets))，解压存放于 data/ 目录下。
+预处理脚本将在原始数据中添加 Spot 和 Asoc 标注，将需要抽取的内容表示为 Spot-Asoc 的数据形式。
+
+``` bash
+python process_data.py preprocess
+```
+
+处理之后的数据将自动生成在 data/DuUIE_pre 下，每个实例中添加了 `spot`, `asoc` 和 `spot_asoc` 三个字段。
+
+### 快速基线第二步：多任务模型训练
+
+基线采用的预训练模型为字符级别的中文模型 [uie-char-small](https://paddlenlp.bj.bcebos.com/models/ccks2022/uie-char-small.zip)，该模型采用两阶段的训练方式构建：首先使用 100G 中文数据进行 Span Corruption 预训练；然后使用远距离监督产生的文本-结构数据进行结构生成预训练。
+下载解压缩后开始多任务训练。
+
+#### 多任务配置
+
+本例中采用 Yaml 配置文件来配置不同任务的数据来源和验证方式，详见多任务配置文件 `config/multi-task-duuie.yaml`。
+本例将依据配置文件自动读取每个任务所需的训练数据进行训练，并对每个任务进行验证并汇报结果。
+``` bash
+python3 run_seq2struct.py                              \
+  --multi_task_config config/multi-task-duuie.yaml     \
+  --negative_keep 1.0                                  \
+  --do_train                                           \
+  --metric_for_best_model=all-task-ave                 \
+  --model_name_or_path=./uie-char-small                \
+  --num_train_epochs=10                                \
+  --per_device_train_batch_size=32                     \
+  --per_device_eval_batch_size=256                     \
+  --output_dir=output/duuie_multi_task_b32_lr5e-4      \
+  --logging_dir=output/duuie_multi_task_b32_lr5e-4_log \
+  --learning_rate=5e-4                                 \
+  --overwrite_output_dir                               \
+  --gradient_accumulation_steps 1
+```
+
+训练完成后，将生成对应的文件夹 `output/duuie_multi_task_b32_lr5e-4`
+
+### 快速基线第三步：使用多任务模型进行预测
+
+该步骤将依据不同任务的抽取框架进行信息抽取，并输出到对应的文件夹中。
+首先下载[测试文件](https://aistudio.baidu.com/aistudio/competition/detail/161/0/datasets)放置在data目录下，然后使用脚本将其自动处理并预测。
+
+``` bash
+python process_data.py split-test
+python inference.py --data data/duuie_test_a --model output/duuie_multi_task_b32_lr5e-4
+```
+
+### 快速基线第四步：后处理提交结果
+
+该步骤将按照实例的 `id` 将不同任务的预测进行合并，生成提交数据 `submit.txt`。
+
+``` bash
+python process_data.py merge-test
+```
+
+### 进阶优化，提升模型性能
+本次大赛联合多个开源平台，提供大量开源数据集、免费计算资源、预训练语言模型和结构化知识图谱数据，参赛选手可以进一步构建数据集开发模型，提升模型性能。
+- [千言中文开源数据集](https://aistudio.baidu.com/aistudio/index)
+- [飞桨AI Studio-人工智能学习实训社区](https://aistudio.baidu.com/aistudio/index)
+- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
+- [OpenKG 开放知识图谱社区](http://openkg.cn)
+
+## Reference
+- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**
+- [DuIE: A Large-scale Chinese Dataset for Information Extraction](http://tcci.ccf.org.cn/conference/2019/papers/EV10.pdf)
+- [DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios](https://link.springer.com/chapter/10.1007/978-3-030-60457-8_44)
+- [CASA: Conversational Aspect Sentiment Analysis for Dialogue Understanding](https://jair.org/index.php/jair/article/view/12802)
diff --git a/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml b/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fafefbad28934241a36f897c996d30b4d6170ce5
--- /dev/null
+++ b/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml
@@ -0,0 +1,172 @@
+T1:
+  name: DUIE_LIFE_SPO
+  path: data/duuie_pre/DUIE_LIFE_SPO
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-rel-boundary-F1
+
+T2:
+  name: DUIE_ORG_SPO
+  path: data/duuie_pre/DUIE_ORG_SPO
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-rel-boundary-F1
+
+T3:
+  name: 体育竞赛
+  path: data/duuie_pre/体育竞赛
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T4:
+  name: 灾害意外
+  path: data/duuie_pre/灾害意外
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T5:
+  name: CONV-ASA
+  path: data/duuie_pre/CONV-ASA
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-rel-strict-F1
+
+T6:
+  name: MSRA
+  path: data/duuie_pre/MSRA
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-ent-F1
+
+T7:
+  name: PEOPLE_DAILY
+  path: data/duuie_pre/PEOPLE_DAILY
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-ent-F1
+
+T8:
+  name: 金融信息_中标
+  path: data/duuie_pre/金融信息_中标
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+
+T9:
+  name: 金融信息_企业融资
+  path: data/duuie_pre/金融信息_企业融资
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+
+T10:
+  name: 金融信息_股份回购
+  path: data/duuie_pre/金融信息_股份回购
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+
+T11:
+  name: 金融信息_中标
+  path: data/duuie_pre/金融信息_中标
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+
+T12:
+  name: 金融信息_高管变动
+  path: data/duuie_pre/金融信息_高管变动
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+
+T13:
+  name: 金融信息_亏损
+  path: data/duuie_pre/金融信息_亏损
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T14:
+  name: 金融信息_公司上市
+  path: data/duuie_pre/金融信息_公司上市
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T15:
+  name: 金融信息_被约谈
+  path: data/duuie_pre/金融信息_被约谈
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T16:
+  name: 金融信息_企业收购
+  path: data/duuie_pre/金融信息_企业收购
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T17:
+  name: 金融信息_股东减持
+  path: data/duuie_pre/金融信息_股东减持
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T18:
+  name: 金融信息_解除质押
+  path: data/duuie_pre/金融信息_解除质押
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T19:
+  name: 金融信息_企业破产
+  path: data/duuie_pre/金融信息_企业破产
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T20:
+  name: 金融信息_股东增持
+  path: data/duuie_pre/金融信息_股东增持
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
+
+T21:
+  name: 金融信息_质押
+  path: data/duuie_pre/金融信息_质押
+  sel2record: longer_first_zh
+  eval_match_mode: set
+  metrics:
+    - string-evt-role-F1
diff --git a/examples/information_extraction/DuUIE/inference.py b/examples/information_extraction/DuUIE/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0a0662837c092eed0f8fa407fa214f995b5c17a
--- /dev/null
+++ b/examples/information_extraction/DuUIE/inference.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import math
+from tqdm import tqdm
+
+import paddle
+from paddlenlp.data import Pad
+from paddlenlp.transformers import T5ForConditionalGeneration
+
+from uie.evaluation.sel2record import RecordSchema, MapConfig, SEL2Record
+from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer
+
+special_to_remove = {"<pad>", "</s>"}
+
+
+def read_json_file(file_name):
+    return [json.loads(line) for line in open(file_name, encoding="utf8")]
+
+
+def schema_to_ssi(schema: RecordSchema):
+    # Convert Schema to SSI
+    # <spot> spot type ... <asoc> asoc type <text>
+    ssi = "<spot>" + "<spot>".join(sorted(schema.type_list))
+    ssi += "<asoc>" + "<asoc>".join(sorted(schema.role_list))
+    ssi += "<extra_id_2>"
+    return ssi
+
+
+def post_processing(x):
+    for special in special_to_remove:
+        x = x.replace(special, "")
+    return x.strip()
+
+
+class Predictor:
+    def __init__(self, model_path, max_source_length=256, max_target_length=192) -> None:
+        self.tokenizer = T5BertTokenizer.from_pretrained(model_path)
+        self.model = T5ForConditionalGeneration.from_pretrained(model_path)
+        self.model.eval()
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+
+    @paddle.no_grad()
+    def predict(self, text, schema):
+        def to_tensor(x):
+            return paddle.to_tensor(x, dtype="int64")
+
+        ssi = schema_to_ssi(schema=schema)
+
+        text = [ssi + x for x in text]
+
+        inputs = self.tokenizer(
+            text, return_token_type_ids=False, return_attention_mask=True, max_seq_len=self.max_source_length
+        )
+
+        inputs = {
+            "input_ids": to_tensor(Pad(pad_val=self.tokenizer.pad_token_id)(inputs["input_ids"])),
+            "attention_mask": to_tensor(Pad(pad_val=0)(inputs["attention_mask"])),
+        }
+
+        pred, _ = self.model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            max_length=self.max_target_length,
+        )
+
+        pred = self.tokenizer.batch_decode(pred.numpy())
+
+        return [post_processing(x) for x in pred]
+
+
+def find_to_predict_folder(folder_name):
+    for root, dirs, _ in os.walk(folder_name):
+        for dirname in dirs:
+            data_name = os.path.join(root, dirname)
+            if os.path.exists(os.path.join(data_name, "record.schema")):
+                yield data_name
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data", "-d", required=True, help="Folder need to been predicted.")
+    parser.add_argument("--model", "-m", required=True, help="Trained model for inference")
+    parser.add_argument(
+        "--max_source_length", default=384, type=int, help="Max source length for inference, ssi + text"
+    )
+    parser.add_argument("--max_target_length", default=192, type=int)
+    parser.add_argument("--batch_size", default=512, type=int)
+    parser.add_argument(
+        "-c",
+        "--config",
+        dest="map_config",
+        help="Offset mapping config, maping generated sel to offset record",
+        default="longer_first_zh",
+    )
+    parser.add_argument("--verbose", action="store_true")
+    options = parser.parse_args()
+
+    # Find the folder need to be predicted with `record.schema`
+    data_folder = find_to_predict_folder(options.data)
+    model_path = options.model
+
+    predictor = Predictor(
+        model_path=model_path, max_source_length=options.max_source_length, max_target_length=options.max_target_length
+    )
+
+    for task_folder in data_folder:
+
+        print(f"Extracting on {task_folder}")
+        schema = RecordSchema.read_from_file(os.path.join(task_folder, "record.schema"))
+        sel2record = SEL2Record(
+            schema_dict=SEL2Record.load_schema_dict(task_folder),
+            map_config=MapConfig.load_by_name(options.map_config),
+            tokenizer=predictor.tokenizer,
+        )
+
+        test_filename = os.path.join(f"{task_folder}", "test.json")
+        if not os.path.exists(test_filename):
+            print(f"{test_filename} not found, skip ...")
+            continue
+
+        instances = read_json_file(test_filename)
+        text_list = [x["text"] for x in instances]
+        token_list = [list(x["text"]) for x in instances]
+
+        batch_num = math.ceil(len(text_list) / options.batch_size)
+
+        predict = list()
+        for index in tqdm(range(batch_num)):
+            start = index * options.batch_size
+            end = index * options.batch_size + options.batch_size
+            predict += predictor.predict(text_list[start:end], schema=schema)
+
+        records = list()
+        for p, text, tokens in zip(predict, text_list, token_list):
+            records += [sel2record.sel2record(pred=p, text=text, tokens=tokens)]
+
+        pred_filename = os.path.join(f"{task_folder}", "pred.json")
+        with open(pred_filename, "w", encoding="utf8") as output:
+            for record in records:
+                output.write(json.dumps(record, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/information_extraction/DuUIE/process_data.py b/examples/information_extraction/DuUIE/process_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..0269223ccb7a287f73e777c694255c9fcce66981
--- /dev/null
+++ b/examples/information_extraction/DuUIE/process_data.py
@@ -0,0 +1,612 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+from typing import List, Dict
+from collections import defaultdict
+import yaml
+import json
+import os
+from uie.evaluation.sel2record import RecordSchema, merge_schema
+
+
+def load_definition_schema_file(filename):
+    """Load schema file in Yaml
+    读取 YAML 定义的 Schema 文件
+    """
+    return yaml.load(open(filename, encoding="utf8"), Loader=yaml.FullLoader)
+
+
+def load_jsonlines_file(filename):
+    """Load Data file in JSONLINE
+    读取 JSONLINE 文件
+    """
+    return [json.loads(line) for line in open(filename, encoding="utf8")]
+
+
+def convert_entity_schema(entity_schema):
+    """Convert entity schmea to record schema"""
+    spots = list()
+    asocs = list()
+    spot_asoc_map = dict()
+    for entity in entity_schema:
+        spots += [entity]
+        spot_asoc_map[entity] = list()
+    return spots, asocs, spot_asoc_map
+
+
+def convert_entity_relation_schema(entity_schema, relation_schema):
+    """Convert entity and relation chmea to record schema"""
+    spots = list()
+    asocs = list()
+    spot_asoc_map = dict()
+    for entity in entity_schema:
+        spots += [entity]
+        spot_asoc_map[entity] = list()
+    for relation in relation_schema:
+        asocs += [relation]
+        arg1_type = relation_schema[relation]["主体"]
+        if arg1_type not in spots:
+            spots += [arg1_type]
+            spot_asoc_map[arg1_type] = list()
+        spot_asoc_map[arg1_type] += [relation]
+    return spots, asocs, spot_asoc_map
+
+
+def convert_event_schema(schema):
+    """Convert event schmea to record schema"""
+    spots = list()
+    asocs = set()
+    spot_asoc_map = dict()
+    for event_type, definition in schema.items():
+        spots += [event_type]
+        spot_asoc_map[event_type] = list()
+        for arg in definition["参数"]:
+            asocs.add(arg)
+            spot_asoc_map[event_type] += [arg]
+    return spots, list(asocs), spot_asoc_map
+
+
+def dump_schema(output_folder, schema_dict):
+    if not os.path.exists(output_folder):
+        os.makedirs(output_folder)
+
+    for schema_name, schema in schema_dict.items():
+        schema_file = f"{output_folder}/{schema_name}.schema"
+        with open(schema_file, "w", encoding="utf8") as output:
+            for element in schema:
+                output.write(json.dumps(element, ensure_ascii=False) + "\n")
+
+
+def main_entity_relation(schema_file, schema_name, instances, output_folder):
+    schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader)
+    entity_schema = convert_entity_schema(schema.get("实体", {}))
+    relation_schema = convert_entity_relation_schema(schema.get("实体", {}), schema.get("关系", {}))
+    event_schema = convert_event_schema({})
+    dump_schema(
+        output_folder=output_folder,
+        schema_dict={
+            "entity": entity_schema,
+            "relation": relation_schema,
+            "event": event_schema,
+            "record": relation_schema,
+        },
+    )
+
+    with open(f"{output_folder}/test.json", "w", encoding="utf8") as output:
+        for instance in instances:
+            if instance["schema"] == schema_name:
+                output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+    return schema_name
+
+
+def main_event(schema_file, schema_name, instances, output_folder):
+    schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader)
+    event_schema = convert_event_schema(schema.get("事件", {}))
+    dump_schema(
+        output_folder=output_folder,
+        schema_dict={
+            "entity": [[], [], {}],
+            "relation": [[], [], {}],
+            "event": event_schema,
+            "record": event_schema,
+        },
+    )
+
+    with open(f"{output_folder}/test.json", "w", encoding="utf8") as output:
+        for instance in instances:
+            if instance["schema"] == schema_name:
+                output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+    return schema_name
+
+
+def main_seprate_event(schema_file, schema_name, instances, output_folder):
+    """Prediction tasks are separated by event types
+    按照事件类别分离预测任务生成抽取的 Schema
+    """
+    valid_instances = list()
+
+    for instance in instances:
+        if schema_name == instance["schema"]:
+            valid_instances += [instance]
+
+    schema = yaml.load(open(schema_file, encoding="utf8"), Loader=yaml.FullLoader)
+    _, _, event_map = convert_event_schema(schema.get("事件", {}))
+
+    for event in event_map:
+        subevent_output_folder = f"{output_folder}_{event}"
+        dump_schema(
+            output_folder=subevent_output_folder,
+            schema_dict={
+                "entity": [[], [], {}],
+                "relation": [[], [], {}],
+                "event": [[event], event_map[event], {event: event_map[event]}],
+                "record": [[event], event_map[event], {event: event_map[event]}],
+            },
+        )
+
+        with open(f"{subevent_output_folder}/test.json", "w", encoding="utf8") as output:
+            for instance in valid_instances:
+                output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+    return event_map.keys()
+
+
+# 将关系抽取结果转换到提交格式
+def convert_relation(relation):
+    return {
+        "type": relation[0],
+        "args": [
+            {"type": relation[1], "text": relation[2]},
+            {"type": relation[3], "text": relation[4]},
+        ],
+    }
+
+
+# 将实体抽取结果转换到提交格式
+def convert_entity(entity):
+    return {
+        "type": entity[0],
+        "text": entity[1],
+    }
+
+
+def convert_event(event):
+    return {
+        "type": event["type"],
+        "text": event["trigger"],
+        "args": [{"type": role_type, "text": arg} for role_type, arg in event["roles"]],
+    }
+
+
+def merge_pred_text_file(text_filename, pred_filename):
+    """Merge extracted result
+    基于实例编号合并抽取结果
+    """
+
+    # 读取原始文件中的数据，用于获取 ID
+    test_instances = load_jsonlines_file(text_filename)
+    # 读取抽取结果的预测文件
+    pred_instances = load_jsonlines_file(pred_filename)
+
+    assert len(test_instances) == len(pred_instances)
+
+    to_sumbit_instances = dict()
+    for test_instance, pred_instance in zip(test_instances, pred_instances):
+
+        # 获取抽取结果中的字符串结果
+        entity_list = pred_instance["entity"].get("string", [])
+        relation_list = pred_instance["relation"].get("string", [])
+        event_list = pred_instance["event"].get("string", [])
+
+        # 将抽取结果转换为提交的数据格式
+        to_sumbit_instance = {
+            "id": test_instance["id"],
+            "entity": [convert_entity(entity) for entity in entity_list],
+            "relation": [convert_relation(relation) for relation in relation_list],
+            "event": [convert_event(event) for event in event_list],
+        }
+        to_sumbit_instances[test_instance["id"]] = to_sumbit_instance
+
+    return to_sumbit_instances
+
+
+def split_test(options):
+    test_file = options.data_file
+    schema_folder = options.schema_folder
+    output_folder = options.output_folder
+    instances = [json.loads(line) for line in open(test_file, encoding="utf8")]
+    main_entity_relation(
+        os.path.join(schema_folder, "人生信息.yaml"), "人生信息", instances, os.path.join(output_folder, "人生信息")
+    )
+    main_entity_relation(
+        os.path.join(schema_folder, "机构信息.yaml"), "机构信息", instances, os.path.join(output_folder, "机构信息")
+    )
+    main_entity_relation(
+        os.path.join(schema_folder, "影视情感.yaml"), "影视情感", instances, os.path.join(output_folder, "影视情感")
+    )
+    main_event(os.path.join(schema_folder, "灾害意外.yaml"), "灾害意外", instances, os.path.join(output_folder, "灾害意外"))
+    main_event(os.path.join(schema_folder, "体育竞赛.yaml"), "体育竞赛", instances, os.path.join(output_folder, "体育竞赛"))
+    main_seprate_event(
+        os.path.join(schema_folder, "金融信息.yaml"), "金融信息", instances, os.path.join(output_folder, "金融信息")
+    )
+
+
+def merge_test(options):
+    """Merge predicted result from trained model
+    将预测文件夹中的预测结果进行合并
+    """
+    output_folder = options.pred_folder
+    submit_filename = options.submit
+
+    to_sumbit_instances = dict()
+    for schema in os.listdir(output_folder):
+        test_filename = os.path.join(output_folder, schema, "test.json")
+        pred_filename = os.path.join(output_folder, schema, "pred.json")
+        sub_to_sumbit_instances = merge_pred_text_file(
+            text_filename=test_filename,
+            pred_filename=pred_filename,
+        )
+        print(f"Merge {schema} with {len(sub_to_sumbit_instances)} instances ...")
+        for instance_id, instance in sub_to_sumbit_instances.items():
+            if instance_id in to_sumbit_instances:
+                to_sumbit_instances[instance_id]["entity"] += instance.get("entity", [])
+                to_sumbit_instances[instance_id]["relation"] += instance.get("relation", [])
+                to_sumbit_instances[instance_id]["event"] += instance.get("event", [])
+            else:
+                to_sumbit_instances[instance_id] = instance
+
+    print(f"To submit instances number: {len(to_sumbit_instances)}")
+    with open(submit_filename, "w", encoding="utf8") as output:
+        for instance in to_sumbit_instances.values():
+            output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+
+
+def annonote_graph(entities: List[Dict] = [], relations: List[Dict] = [], events: List[Dict] = []):
+    """Convert Entity Relation Event to Spot-Assocation Graph
+    将实体、关系和事件的标注信息转换成需要生成的 Spot-Assocation 结构
+
+    Args:
+        tokens (List[str]): Token List
+        entities (List[Entity], optional): Entity List. Defaults to [].
+        relations (List[Relation], optional): Relation List. Defaults to [].
+        events (List[Event], optional): Event List. Defaults to [].
+
+    Returns:
+        set: Set of Spot
+        set: Set of Asoc
+        list: Instance of Spot-Asoc
+    """
+    spot_dict = dict()
+    asoc_dict = defaultdict(list)
+
+    def add_spot(spot):
+        spot_key = (tuple(spot["offset"]), spot["type"])
+        spot_dict[spot_key] = spot
+
+    def add_asoc(spot, asoc, tail):
+        spot_key = (tuple(spot["offset"]), spot["type"])
+        asoc_dict[spot_key] += [(tuple(tail["offset"]), tail["text"], asoc)]
+
+    for entity in entities:
+        add_spot(spot=entity)
+
+    for relation in relations:
+        add_spot(spot=relation["args"][0])
+        add_asoc(spot=relation["args"][0], asoc=relation["type"], tail=relation["args"][1])
+
+    for event in events:
+        add_spot(spot=event)
+        for argument in event["args"]:
+            add_asoc(spot=event, asoc=argument["type"], tail=argument)
+
+    spot_asoc_instance = list()
+    for spot_key in sorted(spot_dict.keys()):
+        offset, label = spot_key
+
+        if len(spot_dict[spot_key]["offset"]) == 0:
+            continue
+
+        spot_instance = {
+            "span": spot_dict[spot_key]["text"],
+            "label": label,
+            "asoc": list(),
+        }
+        for tail_offset, tail_text, asoc in sorted(asoc_dict.get(spot_key, [])):
+            if len(tail_offset) == 0:
+                continue
+
+            spot_instance["asoc"] += [(asoc, tail_text)]
+        spot_asoc_instance += [spot_instance]
+
+    spot_labels = set([label for _, label in spot_dict.keys()])
+    asoc_labels = set()
+    for _, asoc_list in asoc_dict.items():
+        for _, _, asoc in asoc_list:
+            asoc_labels.add(asoc)
+    return spot_labels, asoc_labels, spot_asoc_instance
+
+
+def add_spot_asoc_to_single_file(filename):
+    instances = [json.loads(line) for line in open(filename, encoding="utf8")]
+    print(f"Add spot asoc to {filename} ...")
+    with open(filename, "w", encoding="utf8") as output:
+        for instance in instances:
+            spots, asocs, spot_asoc_instance = annonote_graph(
+                entities=instance["entity"],
+                relations=instance["relation"],
+                events=instance["event"],
+            )
+            # 将信息结构转换成 Spot Asoc 形式
+            instance["spot_asoc"] = spot_asoc_instance
+            # 添加该实例中存在的 Spot 类别
+            instance["spot"] = list(spots)
+            # 添加该实例中存在的 Asoc 类别
+            instance["asoc"] = list(asocs)
+            output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+
+
+def convert_duuie_to_spotasoc(data_folder, ignore_datasets):
+
+    schema_list = list()
+
+    for task_folder in os.listdir(data_folder):
+        if task_folder in ignore_datasets:
+            continue
+        if not os.path.isdir(os.path.join(data_folder, task_folder)):
+            continue
+
+        print(f"Add spot asoc to {task_folder} ...")
+        # 读取单任务的 Schema
+        task_schema_file = os.path.join(data_folder, task_folder, "record.schema")
+
+        # 向单任务数据中添加 Spot Asoc 标注
+        add_spot_asoc_to_single_file(os.path.join(data_folder, task_folder, "train.json"))
+        add_spot_asoc_to_single_file(os.path.join(data_folder, task_folder, "val.json"))
+        record_schema = RecordSchema.read_from_file(task_schema_file)
+
+        schema_list += [record_schema]
+
+        for line in open(os.path.join(data_folder, task_folder, "train.json"), encoding="utf8"):
+            new_instance = json.loads(line)
+            # 添加任务中所有的 Spot 类别
+            new_instance["spot"] = record_schema.type_list
+            # 添加任务中所有的 Asoc 类别
+            new_instance["asoc"] = record_schema.role_list
+
+        for line in open(os.path.join(data_folder, task_folder, "val.json"), encoding="utf8"):
+            new_instance = json.loads(line)
+            # 添加任务中所有的 Spot 类别
+            new_instance["spot"] = record_schema.type_list
+            # 添加任务中所有的 Asoc 类别
+            new_instance["asoc"] = record_schema.role_list
+
+    # 融合不同任务的 Schema
+    multi_schema = merge_schema(schema_list)
+    multi_schema.write_to_file(os.path.join(data_folder, "record.schema"))
+
+
+def dump_instances(instances, output_filename):
+    with open(output_filename, "w", encoding="utf8") as output:
+        for instance in instances:
+            output.write(json.dumps(instance, ensure_ascii=False) + "\n")
+
+
+def dump_event_schema(event_map, output_folder):
+    role_list = list()
+    for roles in event_map.values():
+        role_list += roles["参数"]
+    rols_list = list(set(role_list))
+    type_list = list(event_map.keys())
+    type_role_map = {event_type: list(event_map[event_type]["参数"].keys()) for event_type in event_map}
+    dump_schema(
+        output_folder=output_folder,
+        schema_dict={
+            "entity": [[], [], {}],
+            "relation": [[], [], {}],
+            "event": [type_list, rols_list, type_role_map],
+            "record": [type_list, rols_list, type_role_map],
+        },
+    )
+
+
+def filter_event_in_instance(instances, required_event_types):
+    """Filter events in the instance, keep event mentions with `required_event_types`
+    过滤实例中的事件，只保留需要的事件类别的事件标注
+    """
+    import copy
+
+    new_instances = list()
+    for instance in instances:
+        new_instance = copy.deepcopy(instance)
+        new_instance["event"] = list(filter(lambda x: x["type"] in required_event_types, new_instance["event"]))
+        new_instances += [new_instance]
+    return new_instances
+
+
+def filter_event(data_folder, event_types, output_folder):
+    """Keep event with `event_types` in `data_folder` save to `output_folder`
+    过滤 `data_folder` 中的事件，只保留 `event_types` 类型事件保存到 `output_folder`"""
+    dump_event_schema(event_types, output_folder)
+    for split in ["train", "val"]:
+        filename = os.path.join(data_folder, f"{split}.json")
+        instances = [json.loads(line.strip()) for line in open(filename, encoding="utf8")]
+        new_instances = filter_event_in_instance(instances, required_event_types=event_types)
+        dump_instances(new_instances, os.path.join(output_folder, f"{split}.json"))
+
+
+def preprocess_event(data_folder, schema_folder):
+    """Preprocessing event dataset for CCKS 2022
+    针对 CCKS 2022 竞赛数据进行预处理
+    """
+    # Filter event annotation in raw data, only keep the required event in CCKS 2022
+    # 对事件数据进行预处理，过滤除 `灾害意外` 和 `体育竞赛` 外的事件标注
+    for schema in ["灾害意外", "体育竞赛"]:
+        print(f"Building {schema} dataset ...")
+        duee_folder = os.path.join(data_folder, "DUEE")
+        schema_file = os.path.join(schema_folder, f"{schema}.yaml")
+        output_folder = os.path.join(data_folder, schema)
+        schema = load_definition_schema_file(schema_file)
+        filter_event(
+            data_folder=duee_folder,
+            event_types=schema["事件"],
+            output_folder=output_folder,
+        )
+
+    for schema in ["金融信息"]:
+        print(f"Building {schema} dataset ...")
+        duee_fin_folder = os.path.join(data_folder, "DUEE_FIN_LITE")
+        schema_file = os.path.join(schema_folder, f"{schema}.yaml")
+        output_folder = os.path.join(data_folder, schema)
+        schema = load_definition_schema_file(schema_file)
+        # 依据不同事件类别将多事件抽取分割成多个单事件类型抽取
+        # Separate multi-type extraction to multiple single-type extraction
+        for event_type in schema["事件"]:
+            filter_event(
+                data_folder=duee_fin_folder,
+                event_types={event_type: schema["事件"][event_type]},
+                output_folder=output_folder + "_" + event_type,
+            )
+
+
+def merge_instance(instance_list):
+    """Merge instances with same text but different annotation
+    合并文本相同标记不同的实例
+    """
+
+    def all_equal(_x):
+        for __x in _x:
+            if __x != _x[0]:
+                return False
+        return True
+
+    def entity_key(_x):
+        return (tuple(_x["offset"]), _x["type"])
+
+    def relation_key(_x):
+        return (
+            tuple(_x["type"]),
+            tuple(_x["args"][0]["offset"]),
+            _x["args"][0]["type"],
+            tuple(_x["args"][1]["offset"]),
+            _x["args"][1]["type"],
+        )
+
+    def event_key(_x):
+        return (tuple(_x["offset"]), _x["type"])
+
+    assert all_equal([x["text"] for x in instance_list])
+
+    element_dict = {
+        "entity": dict(),
+        "relation": dict(),
+        "event": dict(),
+    }
+    instance_id_list = list()
+    for x in instance_list:
+        instance_id_list += [x["id"]]
+        for entity in x.get("entity", list()):
+            element_dict["entity"][entity_key(entity)] = entity
+        for relation in x.get("relation", list()):
+            element_dict["relation"][relation_key(relation)] = relation
+        for event in x.get("event", list()):
+            element_dict["event"][event_key(event)] = event
+
+    return {
+        "id": "-".join(instance_id_list),
+        "text": instance_list[0]["text"],
+        "tokens": instance_list[0]["tokens"],
+        "entity": list(element_dict["entity"].values()),
+        "relation": list(element_dict["relation"].values()),
+        "event": list(element_dict["event"].values()),
+    }
+
+
+def preprocess_duie(data_folder):
+    life_folder = os.path.join(data_folder, "DUIE_LIFE_SPO")
+    org_folder = os.path.join(data_folder, "DUIE_ORG_SPO")
+    life_train_instances = load_jsonlines_file(f"{life_folder}/train.json")
+    org_train_instances = load_jsonlines_file(f"{org_folder}/train.json")
+    life_relation = RecordSchema.read_from_file(f"{life_folder}/record.schema").role_list
+    org_relation = RecordSchema.read_from_file(f"{org_folder}/record.schema").role_list
+
+    instance_dict = defaultdict(list)
+    for instance in life_train_instances + org_train_instances:
+        instance_dict[instance["text"]] += [instance]
+
+    for text in instance_dict:
+        instance_dict[text] = merge_instance(instance_dict[text])
+
+    with open(f"{life_folder}/train.json", "w") as output:
+        for instance in instance_dict.values():
+            new_instance = copy.deepcopy(instance)
+            new_instance["relation"] = list(filter(lambda x: x["type"] in life_relation, instance["relation"]))
+            output.write(json.dumps(new_instance) + "\n")
+
+    with open(f"{org_folder}/train.json", "w") as output:
+        for instance in instance_dict.values():
+            new_instance = copy.deepcopy(instance)
+            new_instance["relation"] = list(filter(lambda x: x["type"] in org_relation, instance["relation"]))
+            output.write(json.dumps(new_instance) + "\n")
+
+
+def preprocess(options):
+    """Preprocessing event dataset for CCKS 2022
+    针对 CCKS 2022 竞赛数据进行预处理
+    """
+    import shutil
+
+    shutil.rmtree(options.output_folder) if os.path.exists(options.output_folder) else None
+    shutil.copytree(options.train_data, options.output_folder)
+
+    preprocess_duie(data_folder=options.output_folder)
+    preprocess_event(data_folder=options.output_folder, schema_folder=options.schema_folder)
+    convert_duuie_to_spotasoc(data_folder=options.output_folder, ignore_datasets=options.ignore_datasets)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+
+    subparsers = parser.add_subparsers(help="Data preprocessing scripts for CCKS 2022")
+
+    parser_t = subparsers.add_parser("preprocess", help="Data preprocessing")
+    parser_t.add_argument("--train_data", default="data/duuie", help="Path for DuUIE data folder")
+    parser_t.add_argument("--output_folder", default="data/duuie_pre", help="Path for Preprocessed DuUIE data folder")
+    parser_t.add_argument(
+        "--ignore_datasets",
+        default=["DUEE", "DUEE_FIN_LITE"],
+        nargs="+",
+        help="Ignore dataset in `output_folder` for training",
+    )
+    parser_t.add_argument("--schema_folder", default="data/seen_schema", help="Path for seen schema folder")
+    parser_t.set_defaults(func=preprocess)
+
+    parser_a = subparsers.add_parser("split-test", help="Split test file with schema for prediction")
+    parser_a.add_argument("--data_file", default="data/duuie_test_a.json", help="Path for DuUIE data file")
+    parser_a.add_argument("--output_folder", default="data/duuie_test_a", help="Path for DuUIE predicted folder")
+    parser_a.add_argument("--schema_folder", default="data/seen_schema", help="Path for seen schema folder")
+    parser_a.set_defaults(func=split_test)
+
+    parser_b = subparsers.add_parser("merge-test", help="Merge predicted result for submission")
+    parser_b.add_argument("--data_file", default="data/duuie_test_a.json", help="Path for DuUIE data file")
+    parser_b.add_argument("--pred_folder", default="data/duuie_test_a", help="Path for DuUIE predicted folder")
+    parser_b.add_argument("--submit", default="submit.txt", help="Path for output submission file")
+    parser_b.set_defaults(func=merge_test)
+
+    options = parser.parse_args()
+    options.func(options)
diff --git a/examples/information_extraction/DuUIE/requirements.txt b/examples/information_extraction/DuUIE/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..88bbb7f977ad1f22e2e26b8f4ae3082f73c385fe
--- /dev/null
+++ b/examples/information_extraction/DuUIE/requirements.txt
@@ -0,0 +1,5 @@
+tabulate
+importlib_metadata
+nltk
+visualdl
+paddlenlp
\ No newline at end of file
diff --git a/examples/information_extraction/DuUIE/run_seq2struct.py b/examples/information_extraction/DuUIE/run_seq2struct.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb566bfe8e3917f6569b3de2ae465eb60efe2fe8
--- /dev/null
+++ b/examples/information_extraction/DuUIE/run_seq2struct.py
@@ -0,0 +1,497 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+
+import paddle
+from paddle.amp import GradScaler, auto_cast
+from paddle.optimizer import AdamW
+from uie.evaluation.sel2record import evaluate_extraction_results
+from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer
+from uie.seq2struct.utils import (
+    better_print_multi,
+    get_scheduler,
+    get_train_dataloader,
+    get_writer,
+    load_eval_tasks,
+    save_checkpoint,
+    set_logger,
+    set_seed,
+)
+
+from paddlenlp.transformers import T5ForConditionalGeneration
+
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--multi_task_config",
+        required=True,
+        help="Path to multi-task config file.",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="t5-large",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        help="The output directory where the model predictions and checkpoints"
+        " will be written. Default as `outputs`",
+    )
+    parser.add_argument(
+        "--overwrite_output_dir",
+        action="store_true",
+        help="Overwrite output directory",
+    )
+    parser.add_argument(
+        "--logging_dir",
+        type=str,
+        help="The output logging directory",
+    )
+    parser.add_argument(
+        "--max_source_length",
+        default=384,
+        type=int,
+        help="The maximum total input sequence length after tokenization."
+        " Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--max_target_length",
+        default=192,
+        type=int,
+        help="The maximum total target sequence length to be generated.",
+    )
+    parser.add_argument("--max_prefix_length", default=None, type=int, help="The maximum prefix length.")
+    parser.add_argument(
+        "--per_device_train_batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--per_device_eval_batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for evaluating.",
+    )
+    parser.add_argument(
+        "--metric_for_best_model",
+        type=str,
+        help="The main metric to choose the best checkpoint.",
+    )
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        default=1,
+        type=int,
+        help="gradient_accumulation_steps.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        default=5e-4,
+        type=float,
+        help="The initial learning rate for Adam.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_beta1", default=0.9, type=float, help="Beta1 for AdamW optimizer")
+    parser.add_argument("--adam_beta2", default=0.999, type=float, help="Beta2 for AdamW optimizer")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=10,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_ratio",
+        default=0.06,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument(
+        "--ignore_pad_token_for_loss",
+        type=bool,
+        default=True,
+        help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.",
+    )
+    parser.add_argument(
+        "--warmup_steps",
+        type=int,
+        default=-1,
+        help="warmup_steps.",
+    )
+    parser.add_argument(
+        "--logging_steps",
+        type=int,
+        default=500,
+        help="Log every X updates steps.",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="random seed for initialization",
+    )
+    parser.add_argument(
+        "--writer_type",
+        choices=["visualdl", "tensorboard"],
+        default="visualdl",
+        help="writer_type.",
+    )
+    parser.add_argument(
+        "--lr_scheduler_type",
+        choices=["linear", "cosine", "poly"],
+        default="linear",
+        type=str,
+        help="lr_scheduler_type.",
+    )
+    parser.add_argument("--use_amp", "--fp16", action="store_true", help="Enable mixed precision training.")
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=2**15,
+        help="The value of scale_loss for fp16.",
+    )
+    parser.add_argument(
+        "--dataloader_num_workers",
+        type=int,
+        default=0,
+        help="dataloader_num_workers.",
+    )
+    parser.add_argument("--spot_noise", type=float, default=0, help="The noise rate of inserting rejection null spot.")
+    parser.add_argument("--asoc_noise", type=float, default=0, help="The noise rate of inserting rejection null asoc.")
+    parser.add_argument(
+        "--negative_keep", type=float, default=1.0, help="The keep rate of negative instance for fast training."
+    )
+    parser.add_argument("--meta_positive_rate", type=float, default=1, help="The keep rate of positive spot.")
+    parser.add_argument("--meta_negative", type=int, default=-1, help="Negative Schema Number in Training.")
+    parser.add_argument(
+        "--ordered_prompt", action="store_true", help="Whether to sort the spot prompt and asoc prompt or not."
+    )
+    parser.add_argument("--do_train", action="store_true")
+    parser.add_argument("--do_eval", action="store_true")
+    parser.add_argument(
+        "--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training."
+    )
+
+    args = parser.parse_args()
+
+    # Sanity check
+    if not (args.do_train or args.do_eval):
+        raise ValueError('At least one of the "--do_train" or "--do_eval" should be true.')
+    if args.do_train and not args.output_dir:
+        raise ValueError("--output_dir should be given when --do_train is true.")
+
+    return args
+
+
+@paddle.no_grad()
+def evaluate(model, tokenizer, data_loader, generate_max_length, eval_instances, sel2record, eval_match_mode):
+    """Evaluate single task"""
+
+    model = model._layers if isinstance(model, paddle.DataParallel) else model
+
+    model.eval()
+
+    to_remove_token_list = list()
+    if tokenizer.eos_token:
+        to_remove_token_list += [tokenizer.eos_token]
+    if tokenizer.pad_token:
+        to_remove_token_list += [tokenizer.pad_token]
+
+    def postprocess_text(x_str):
+        # Clean `bos` `eos` `pad` for cleaned text
+        for to_remove_token in to_remove_token_list:
+            x_str = x_str.replace(to_remove_token, "")
+
+        return x_str.strip()
+
+    # Generate SEL using Trained Model
+    all_preds = []
+    for batch in data_loader:
+
+        outputs, scores = model.generate(
+            input_ids=batch["input_ids"],
+            attention_mask=batch["attention_mask"],
+            max_length=generate_max_length,
+            use_fast=True,
+        )
+
+        # Convert Token id to Token String
+        outputs = tokenizer.batch_decode(outputs, clean_up_tokenization_spaces=False, skip_special_tokens=False)
+
+        preds = [postprocess_text(output) for output in outputs]
+        all_preds.extend(preds)
+
+    assert len(all_preds) == len(eval_instances)
+
+    # Parsing SEL to Record
+    all_records = []
+    for predicted_sel, instance in zip(all_preds, eval_instances):
+        record = sel2record.sel2record(pred=predicted_sel, text=instance["text"], tokens=instance["tokens"])
+        all_records += [record]
+
+    task_metrics = evaluate_extraction_results(eval_instances, all_records, eval_match_mode=eval_match_mode)
+
+    prediction = {"record": all_records, "sel": all_preds, "metric": task_metrics}
+
+    return task_metrics, prediction
+
+
+def eval_all_tasks(eval_tasks, model, tokenizer, generate_max_length):
+    """Evaluate all tasks"""
+    eval_overall_results = dict()
+    eval_overall_predictions = dict()
+    for task_name, eval_task in eval_tasks.items():
+        # Evaulate single task
+        logger.info(f"Evaluate {task_name} ...")
+        eval_results, eval_prediction = evaluate(
+            model=model,
+            tokenizer=tokenizer,
+            data_loader=eval_task.dataloader,
+            generate_max_length=generate_max_length,
+            eval_instances=eval_task.val_instances,
+            sel2record=eval_task.sel2record,
+            eval_match_mode=eval_task.config.eval_match_mode,
+        )
+
+        for metric_name in eval_task.metrics:
+            metric_key = f"{task_name}:{metric_name}"
+            eval_overall_results[metric_key] = eval_results[metric_name]
+
+        eval_overall_predictions[task_name] = eval_prediction
+
+    sum_metric = sum(eval_overall_results.values())
+    number_metric = len(eval_overall_results.values())
+    eval_overall_results["all-task-ave"] = sum_metric / float(number_metric)
+
+    return eval_overall_results, eval_overall_predictions
+
+
+def test(args, model, tokenizer):
+    eval_tasks = load_eval_tasks(model=model, tokenizer=tokenizer, args=args)
+
+    eval_overall_results, eval_predictions = eval_all_tasks(
+        eval_tasks=eval_tasks,
+        model=model,
+        tokenizer=tokenizer,
+        generate_max_length=args.max_target_length,
+    )
+
+    for line in better_print_multi(eval_overall_results).split("\n"):
+        logger.info(line)
+
+
+def train(args, model, tokenizer):
+
+    set_seed(args)
+
+    generate_max_length = args.max_target_length
+
+    writer = get_writer(args)
+
+    # Distributed Setting
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+        model = paddle.DataParallel(model)
+
+    # get dataloader
+    train_dataloader = get_train_dataloader(
+        model=model,
+        tokenizer=tokenizer,
+        args=args,
+    )
+    eval_tasks = load_eval_tasks(model=model, tokenizer=tokenizer, args=args) if args.do_eval else None
+
+    def math_ceil(x, y):
+        return math.ceil(x / float(y))
+
+    num_update_steps_per_epoch = math_ceil(len(train_dataloader), args.gradient_accumulation_steps)
+    if args.logging_steps > num_update_steps_per_epoch:
+        args.logging_steps = num_update_steps_per_epoch
+    if args.max_steps > 0:
+        args.num_train_epochs = math_ceil(args.max_steps, num_update_steps_per_epoch)
+    else:
+        args.max_steps = args.num_train_epochs * num_update_steps_per_epoch
+
+    # get lr_scheduler
+    lr_scheduler = get_scheduler(
+        learning_rate=args.learning_rate,
+        scheduler_type=args.lr_scheduler_type,
+        num_warmup_steps=args.warmup_steps if args.warmup_steps > 0 else args.warmup_ratio,
+        num_training_steps=args.max_steps,
+    )
+
+    total_batch_size = (
+        args.per_device_train_batch_size * args.gradient_accumulation_steps * paddle.distributed.get_world_size()
+    )
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    grad_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        beta1=args.adam_beta1,
+        beta2=args.adam_beta2,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=grad_clip,
+    )
+
+    if args.use_amp:
+        scaler = GradScaler(init_loss_scaling=args.scale_loss)
+
+    logger.info("********** Running training **********")
+    logger.info(f"  Num examples = {len(train_dataloader.dataset)}")
+    logger.info(f"  Num Epochs = {args.num_train_epochs}")
+    logger.info(f"  Device train batch size = {args.per_device_train_batch_size}")
+    logger.info(f"  Device eval  batch size = {args.per_device_eval_batch_size}")
+    logger.info(f"  Total  train batch size (w. accumulation) = {total_batch_size}")
+    logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
+    logger.info(f"  Total optimization steps = {args.max_steps}")
+
+    global_steps = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    best_score = 0.0
+
+    def logging_lr_loss():
+        cur_lr = lr_scheduler.get_lr()
+        cur_loss = (tr_loss - logging_loss) / args.logging_steps
+        writer.add_scalar("lr", cur_lr, global_steps)
+        writer.add_scalar("loss", cur_loss, global_steps)
+        logger.info(f"global_steps {global_steps}/{args.max_steps}" f" - lr: {cur_lr:.10f}  loss: {cur_loss:.10f}")
+
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_dataloader):
+            model.train()
+
+            with auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax"]):
+                outputs = model(**batch)
+                loss = outputs[0] / args.gradient_accumulation_steps
+                tr_loss += loss.item()
+
+            if args.use_amp:
+                scaler.scale(loss).backward()
+            else:
+                loss.backward()
+
+            if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
+                if args.use_amp:
+                    scaler.minimize(optimizer, loss)
+                else:
+                    optimizer.step()
+
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                global_steps += 1
+
+                if args.logging_steps > 0 and global_steps % args.logging_steps == 0:
+                    if paddle.distributed.get_rank() == 0:
+                        logging_lr_loss()
+                        logging_loss = tr_loss
+
+        save_checkpoint(tokenizer, model, os.path.join(args.output_dir, f"ckpt_epoch{epoch}"))
+        if args.do_eval and paddle.distributed.get_rank() == 0:
+
+            logger.info("********** Running evaluating **********")
+            logger.info(f"************* Epoch {epoch} ************")
+
+            eval_overall_results, eval_predictions = eval_all_tasks(
+                eval_tasks=eval_tasks,
+                model=model,
+                tokenizer=tokenizer,
+                generate_max_length=generate_max_length,
+            )
+
+            for line in better_print_multi(eval_overall_results).split("\n"):
+                logger.info(line)
+
+            if args.metric_for_best_model not in eval_overall_results:
+                raise ValueError(
+                    f"Main metric {args.metric_for_best_model} " f"is not in {eval_overall_results.keys()}."
+                )
+
+            logger.info("********** Evaluating Done **********")
+            current_score = eval_overall_results[args.metric_for_best_model]
+            if current_score > best_score:
+                logger.info("********** Saving Model **********")
+                best_score = current_score
+                save_checkpoint(tokenizer, model, os.path.join(args.output_dir, "best"))
+
+    best_ckpt_file = os.path.join(args.output_dir, "best", "model_state.pdparams")
+    if os.path.exists(best_ckpt_file):
+        logger.info(f"Load best checkpoint from {best_ckpt_file}")
+        model.load_dict(paddle.load(best_ckpt_file))
+
+    save_checkpoint(tokenizer, model, args.output_dir)
+
+
+def main(args):
+    logger.info("**********  Configuration Arguments **********")
+    for arg, value in sorted(vars(args).items()):
+        logger.info(f"{arg}: {value}")
+    logger.info("**********************************************")
+
+    if args.do_train and args.output_dir is not None:
+        if os.path.exists(args.output_dir) and not args.overwrite_output_dir:
+            raise ValueError(
+                f"Output directory ({args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        else:
+            os.makedirs(args.output_dir, exist_ok=True)
+
+    # Set device
+    paddle.set_device(args.device)
+
+    # Prepare model and tokenizer
+    tokenizer = T5BertTokenizer.from_pretrained(args.model_name_or_path)
+    model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+
+    if args.do_train:
+        train(args, model, tokenizer)
+
+    if args.do_eval:
+        test(args, model, tokenizer)
+
+    logger.info(f"Output Dir: {args.output_dir}")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    os.makedirs("caches", exist_ok=True)
+    if args.logging_dir is not None:
+        os.makedirs(args.logging_dir, exist_ok=True)
+    set_logger(args)
+    logger.info(args)
+    main(args)
diff --git a/examples/information_extraction/DuUIE/uie/__init__.py b/examples/information_extraction/DuUIE/uie/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7fa799a6a6e843079c76d8dae328db51d57c8e7
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/__init__.py
@@ -0,0 +1,19 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Code for Evaluation and Sequence-to-Structure
+"""
diff --git a/examples/information_extraction/DuUIE/uie/evaluation/__init__.py b/examples/information_extraction/DuUIE/uie/evaluation/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595126a161a77fa1215a2a2032dbd19d35f88a51
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/evaluation/__init__.py
@@ -0,0 +1,19 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Code for Evaluation
+"""
diff --git a/examples/information_extraction/DuUIE/uie/evaluation/constants.py b/examples/information_extraction/DuUIE/uie/evaluation/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac6eacf6a36c296c3150b6030673657c12f3030a
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/evaluation/constants.py
@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+spot_prompt = "<spot>"
+asoc_prompt = "<asoc>"
+
+type_start = "<extra_id_0>"
+type_end = "<extra_id_1>"
+text_start = "<extra_id_2>"
+span_start = "<extra_id_5>"
+null_span = "<extra_id_6>"
+null_lÍabel = "<extra_id_7>"
+
+offset_map_strategy = {
+    "closest_en": {
+        "map_strategy": "closest",
+        "de_duplicate": True,
+        "span_to_token": "space",
+    },
+    "closest_zh": {
+        "map_strategy": "closest",
+        "de_duplicate": True,
+        "span_to_token": "list",
+    },
+    "fisrt_en": {
+        "map_strategy": "first",
+        "de_duplicate": True,
+        "span_to_token": "space",
+    },
+    "first_zh": {
+        "map_strategy": "first",
+        "de_duplicate": True,
+        "span_to_token": "list",
+    },
+    "longer_first_zh": {
+        "map_strategy": "longer_first",
+        "de_duplicate": True,
+        "span_to_token": "list",
+    },
+}
+
+
+@dataclass
+class BaseStructureMarker:
+    sent_start = "<extra_id_0>"
+    sent_end = "<extra_id_1>"
+    record_start = "<extra_id_0>"
+    record_end = "<extra_id_1>"
+    span_start = "<extra_id_0>"
+    span_end = "<extra_id_1>"
+    text_start = "<extra_id_2>"
+    source_span_start = "<extra_id_3>"
+    source_span_end = "<extra_id_4>"
+    target_span_start = "<extra_id_5>"
+    null_span = "<extra_id_6>"
+    null_label = "<extra_id_7>"
diff --git a/examples/information_extraction/DuUIE/uie/evaluation/scorer.py b/examples/information_extraction/DuUIE/uie/evaluation/scorer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1fa366e4e10f6971237c3c551d6fba9cc3f57a75
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/evaluation/scorer.py
@@ -0,0 +1,569 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+from collections import defaultdict
+from copy import deepcopy
+from typing import Dict, List
+
+
+def tuple_offset(offset):
+    if isinstance(offset, tuple):
+        return offset
+    else:
+        return tuple(offset)
+
+
+def warning_tp_increment(gold, pred, prefix):
+    sys.stderr.write(f"{prefix} TP Increment Warning, Gold Offset: {gold['offset']}\n")
+    sys.stderr.write(f"{prefix} TP Increment Warning, Pred Offset: {pred['offset']}\n")
+    sys.stderr.write(f"{prefix} TP Increment Warning, Gold String: {gold['string']}\n")
+    sys.stderr.write(f"{prefix} TP Increment Warning, Pred String: {pred['string']}\n")
+    sys.stderr.write("===============\n")
+
+
+class Metric:
+    """Tuple Metric"""
+
+    def __init__(self, verbose=False, match_mode="normal"):
+        self.tp = 0.0
+        self.gold_num = 0.0
+        self.pred_num = 0.0
+        self.verbose = verbose
+        self.match_mode = match_mode
+        assert self.match_mode in {"set", "normal", "multimatch"}
+
+    def __repr__(self) -> str:
+        return f"tp: {self.tp}, gold: {self.gold_num}, pred: {self.pred_num}"
+
+    @staticmethod
+    def safe_div(a, b):
+        if b == 0.0:
+            return 0.0
+        else:
+            return a / b
+
+    def compute_f1(self, prefix=""):
+        tp = self.tp
+        pred_num = self.pred_num
+        gold_num = self.gold_num
+        p, r = self.safe_div(tp, pred_num), self.safe_div(tp, gold_num)
+        return {
+            prefix + "tp": tp,
+            prefix + "gold": gold_num,
+            prefix + "pred": pred_num,
+            prefix + "P": p * 100,
+            prefix + "R": r * 100,
+            prefix + "F1": self.safe_div(2 * p * r, p + r) * 100,
+        }
+
+    def count_instance(self, gold_list, pred_list):
+        if self.match_mode == "set":
+            gold_list = set(gold_list)
+            pred_list = set(pred_list)
+            if self.verbose:
+                print("Gold:", gold_list)
+                print("Pred:", pred_list)
+            self.gold_num += len(gold_list)
+            self.pred_num += len(pred_list)
+            self.tp += len(gold_list & pred_list)
+
+        else:
+            if self.verbose:
+                print("Gold:", gold_list)
+                print("Pred:", pred_list)
+            self.gold_num += len(gold_list)
+            self.pred_num += len(pred_list)
+
+            if len(gold_list) > 0 and len(pred_list) > 0:
+                # guarantee length same
+                assert len(gold_list[0]) == len(pred_list[0])
+
+            dup_gold_list = deepcopy(gold_list)
+            for pred in pred_list:
+                if pred in dup_gold_list:
+                    self.tp += 1
+                    if self.match_mode == "normal":
+                        # Each Gold Instance can be matched one time
+                        dup_gold_list.remove(pred)
+
+    def count_batch_instance(self, batch_gold_list, batch_pred_list):
+        for gold_list, pred_list in zip(batch_gold_list, batch_pred_list):
+            self.count_instance(gold_list=gold_list, pred_list=pred_list)
+
+
+class Scorer:
+    @staticmethod
+    def load_gold_list(gold_list, offset_key=None):
+        raise NotImplementedError
+
+    @staticmethod
+    def load_pred_list(pred_list):
+        raise NotImplementedError
+
+    @staticmethod
+    def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"):
+        raise NotImplementedError
+
+
+class EntityScorer(Scorer):
+    @staticmethod
+    def load_gold_list(gold_list: List[List[Dict]]):
+        """Load gold instance to `string` and `offset`
+
+        Args:
+            gold_list (List[List[Dict]]): [description]
+                [
+                    [
+                        {'type': 'Geo-political', 'offset': [7], 'text': 'seattle'},
+                        {'type': 'Location', 'offset': [11], 'text': 'lot'},
+                        {'type': 'Geo-political', 'offset': [14], 'text': 'city'}
+                    ],
+                    [...]
+                ]
+
+        Returns:
+            List[Dict]: each instance has `offset` and `string`
+                [
+                    {
+                        'offset': [('Geo-political', (7,)), ('Location', (11,)), ('Geo-political', (14,))],
+                        'string': [('Geo-political', 'seattle'), ('Location', 'lot'), ('Geo-political', 'city')]
+                    },
+                    {...}, ...
+                ]
+        """
+        gold_instance_list = []
+        for gold in gold_list:
+            gold_offset = list()
+            gold_string = list()
+            for span in gold:
+                span_label = span["type"]
+                span_offset = span["offset"]
+                span_text = span["text"]
+                gold_offset += [(span_label, tuple_offset(span_offset))]
+                gold_string += [(span_label, span_text)]
+            gold_instance = {
+                "offset": gold_offset,
+                "string": gold_string,
+            }
+            gold_instance_list += [gold_instance]
+        return gold_instance_list
+
+    @staticmethod
+    def load_pred_list(pred_list: List[Dict]):
+        """[summary]
+
+        Args:
+            pred_list (List[Dict]): [description]
+                [
+                    {
+                        'offset': [['Geo-political', [7]], ['Geo-political', [14]]],
+                        'string': [['Geo-political', 'seattle'], ['Geo-political', 'city']]
+                    },
+                    {...},
+                ]
+        Returns:
+            List[Dict] : each relation instance has `offset` and `string`
+                [
+                    {
+                        'offset': [('Geo-political', (7,)), ('Geo-political', (14,))],
+                        'string': [('Geo-political', 'seattle'), ('Geo-political', 'city')]
+                    }
+                ]
+        """
+        pred_instance_list = list()
+        for pred in pred_list:
+            for offset_pred in pred["offset"]:
+                if not isinstance(offset_pred[1], tuple):
+                    offset_pred[1] = tuple_offset(offset_pred[1])
+            pred["offset"] = [tuple_offset(p) for p in pred["offset"]]
+            pred["string"] = [tuple_offset(p) for p in pred["string"]]
+            pred_instance_list += [pred]
+        return pred_instance_list
+
+    @staticmethod
+    def eval_instance_list(
+        gold_instance_list: List[Dict], pred_instance_list: List[Dict], verbose=False, match_mode="normal"
+    ):
+        """[summary]
+
+        Args:
+            gold_instance_list (List[Dict]): [description]
+                [
+                    {
+                        'offset': [('Geo-political', (7,)), ('Location', (11,)), ('Geo-political', (14,))],
+                        'string': [('Geo-political', 'seattle'), ('Location', 'lot'), ('Geo-political', 'city')]
+                    },
+                    {...}, ...
+                ]
+            pred_instance_list (List[Dict]): [description]
+                [
+                    {
+                        'offset': [('Geo-political', (7,)), ('Geo-political', (14,))],
+                        'string': [('Geo-political', 'seattle'), ('Geo-political', 'city')]
+                    }
+                ]
+            verbose (bool, optional): [description]. Defaults to False.
+            match_mode (string, optional): [description]. Defaults to `normal` .
+
+        Returns:
+            Dict: Result of Evaluation
+                (offset, string) X (gold, pred, tp, P, R, F1)
+        """
+        metrics = {
+            "string": Metric(verbose=verbose, match_mode=match_mode),
+            "offset": Metric(verbose=verbose, match_mode=match_mode),
+        }
+        for pred, gold in zip(pred_instance_list, gold_instance_list):
+
+            pre_string_tp, pre_offset_tp = metrics["string"].tp, metrics["offset"].tp
+
+            for eval_key in metrics:
+                metrics[eval_key].count_instance(gold_list=gold.get(eval_key, []), pred_list=pred.get(eval_key, []))
+
+            post_string_tp, post_offset_tp = metrics["string"].tp, metrics["offset"].tp
+            if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp:
+                warning_tp_increment(gold=gold, pred=pred, prefix="Entity")
+
+        results = dict()
+        for eval_key in metrics:
+            results.update(metrics[eval_key].compute_f1(prefix=eval_key + "-ent-"))
+
+        return results
+
+
+class RelationScorer(Scorer):
+    @staticmethod
+    def load_gold_list(gold_list: List[List[Dict]]):
+        """[summary]
+
+        Args:
+            gold_list (List[List[Dict]]): List of Sentece, each sentence contains a List of Relation Dict
+                [
+                    [
+                        {
+                            'type': 'Part-whole',
+                            'args': [{'type': 'Location', 'offset': [11], 'text': 'lot'}, {'type': 'Geo-political', 'offset': [14], 'text': 'city'}]
+                        }, ...
+                    ],
+                    [...],
+                ]
+
+        Returns:
+            List[Dict]: List of Sentece, each sentence contains two List (offset, string) of Relation Tuple
+                [
+                    {
+                        'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,)), ... ],
+                        'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan'), ...]
+                    }
+                ]
+        """
+        gold_instance_list = []
+        for gold in gold_list:
+            gold_instance = defaultdict(list)
+            for record in gold:
+                assert len(record["args"]) == 2
+                gold_instance["offset"] += [
+                    (
+                        record["type"],
+                        record["args"][0]["type"],
+                        tuple_offset(record["args"][0]["offset"]),
+                        record["args"][1]["type"],
+                        tuple_offset(record["args"][1]["offset"]),
+                    )
+                ]
+                gold_instance["string"] += [
+                    (
+                        record["type"],
+                        record["args"][0]["type"],
+                        record["args"][0]["text"],
+                        record["args"][1]["type"],
+                        record["args"][1]["text"],
+                    )
+                ]
+            gold_instance_list += [gold_instance]
+
+        return gold_instance_list
+
+    @staticmethod
+    def load_pred_list(pred_list):
+        """[summary]
+
+        Args:
+            pred_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Relation List
+                [
+                    {
+                        'offset': [['Part-whole', 'Geo-political', [0], 'Geo-political', [2]]],
+                        'string': [['Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan']],
+                    }, ...
+                ]
+        Returns:
+            List[Dict]: List of Sentece, each sentence contains two List (offset, string) of Relation Tuple
+                [
+                    {
+                        'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,))],
+                        'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan')]
+                    }, ...
+                ]
+        """
+        pred_instance_list = list()
+        for pred in pred_list:
+            for offset_pred in pred["offset"]:
+
+                if not isinstance(offset_pred[2], tuple):
+                    offset_pred[2] = tuple_offset(offset_pred[2])
+
+                if not isinstance(offset_pred[4], tuple):
+                    offset_pred[4] = tuple_offset(offset_pred[4])
+
+            pred["offset"] = [tuple_offset(p) for p in pred["offset"]]
+            pred["string"] = [tuple_offset(p) for p in pred["string"]]
+            pred_instance_list += [pred]
+        return pred_instance_list
+
+    @staticmethod
+    def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"):
+        """[summary]
+
+        Args:
+            gold_instance_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Relation Tuple
+                [
+                    {
+                        'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,)), ... ],
+                        'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan'), ...]
+                    }
+                ]
+            pred_instance_list ([type]): List of Sentece, each sentence contains two List (offset, string) of Relation Tuple
+                [
+                    {
+                        'offset': [('Part-whole', 'Geo-political', (0,), 'Geo-political', (2,))],
+                        'string': [('Part-whole', 'Geo-political', 'MULTAN', 'Geo-political', 'Pakistan')]
+                    }, ...
+                ]
+            verbose (bool, optional): Defaults to False.
+            match_mode (string, optional): [description]. Defaults to `normal` .
+
+        Returns:
+            Dict: Result of Evaluation
+                (offset, string) X (boundary, strict) X (gold, pred, tp, P, R, F1)
+        """
+        # Span Boundary and Type
+        metrics = {
+            "offset": Metric(verbose=verbose, match_mode=match_mode),
+            "string": Metric(verbose=verbose, match_mode=match_mode),
+        }
+        # Span Boundary Only
+        boundary_metrics = {
+            "offset": Metric(verbose=verbose, match_mode=match_mode),
+            "string": Metric(verbose=verbose, match_mode=match_mode),
+        }
+        for pred, gold in zip(pred_instance_list, gold_instance_list):
+
+            pre_string_tp, pre_offset_tp = metrics["string"].tp, metrics["offset"].tp
+
+            for eval_key in metrics:
+                # Span Boundary and Type
+                metrics[eval_key].count_instance(
+                    gold_list=gold.get(eval_key, []),
+                    pred_list=pred.get(eval_key, []),
+                )
+
+            post_string_tp, post_offset_tp = metrics["string"].tp, metrics["offset"].tp
+            if verbose and (post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp):
+                warning_tp_increment(gold=gold, pred=pred, prefix="Relation Strict")
+
+            pre_string_tp, pre_offset_tp = boundary_metrics["string"].tp, boundary_metrics["offset"].tp
+
+            for eval_key in boundary_metrics:
+                # Span Boundary Only
+                boundary_metrics[eval_key].count_instance(
+                    gold_list=[(x[0], x[2], x[4]) for x in gold.get(eval_key, [])],
+                    pred_list=[(x[0], x[2], x[4]) for x in pred.get(eval_key, [])],
+                )
+            post_string_tp, post_offset_tp = boundary_metrics["string"].tp, boundary_metrics["offset"].tp
+            if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp:
+                warning_tp_increment(gold=gold, pred=pred, prefix="Relation Boundary")
+
+        results = dict()
+        for eval_key in metrics:
+            results.update(metrics[eval_key].compute_f1(prefix=eval_key + "-rel-strict-"))
+        for eval_key in boundary_metrics:
+            results.update(boundary_metrics[eval_key].compute_f1(prefix=eval_key + "-rel-boundary-"))
+        return results
+
+
+class EventScorer(Scorer):
+    @staticmethod
+    def load_gold_list(gold_list):
+        """[summary]
+
+        Args:
+            gold_list (List[List[Dict]]): List of Sentece, each sentence contains a List of Event Dict
+                [
+                    [ # Sentance
+                        { # Event Record
+                            'type': 'Die',
+                            'offset': [16],
+                            'text': 'shot',
+                            'args': [
+                                {'type': 'Victim', 'offset': [17], 'text': 'himself'},
+                                {'type': 'Agent', 'offset': [5, 6], 'text': 'John Joseph'},
+                                {'type': 'Place', 'offset': [23], 'text': 'court'}
+                            ]
+                        },
+                    ]
+                ]
+
+        Returns:
+            List[Dict]: List of Sentece, each sentence contains Four List of Event Tuple
+                [
+                    {
+                        'offset_trigger': [('Die', (16,)), ('Convict', (30,))],
+                        'string_trigger': [('Die', 'shot'), ('Convict', 'convicted')],
+                        'offset_role': [('Die', 'Victim', (17,)), ('Die', 'Agent', (5, 6)), ('Die', 'Place', (23,))],
+                        'string_role': [('Die', 'Victim', 'himself'), ('Die', 'Agent', 'John Joseph'), ('Die', 'Place', 'court')]
+                    },
+                    ...
+                ]
+        """
+        gold_instance_list = []
+        for gold in gold_list:
+            gold_instance = defaultdict(list)
+            for record in gold:
+                gold_instance["offset_trigger"] += [(record["type"], tuple_offset(record["offset"]))]
+                gold_instance["string_trigger"] += [(record["type"], record["text"])]
+                for arg in record["args"]:
+                    gold_instance["offset_role"] += [(record["type"], arg["type"], tuple_offset(arg["offset"]))]
+                    gold_instance["string_role"] += [(record["type"], arg["type"], arg["text"])]
+            gold_instance_list += [gold_instance]
+        return gold_instance_list
+
+    @staticmethod
+    def load_pred_list(pred_list):
+        """[summary]
+
+        Args:
+            pred_list (List[Dict]): List of Sentece, each sentence contains two List (offset, string) of Event List
+                [
+                    {
+                        'offset': [{'type': 'Attack', 'roles': [['Attacker', [5, 6]], ['Place', [23]], ['Target', [17]]], 'trigger': [16]}],
+                        'string': [{'roles': [['Attacker', 'John Joseph'], ['Place', 'court'], ['Target', 'himself']], 'type': 'Attack', 'trigger': 'shot'}],
+                    },
+                    ...
+                ]
+        Returns:
+            List[Dict]: List of Sentece, each sentence contains four List (offset, string) X (trigger, role) of Event List
+                [
+                    {
+                        'offset_trigger': [('Attack', (16,))],
+                        'offset_role': [('Attack', 'Attacker', (5, 6)), ('Attack', 'Place', (23,)), ('Attack', 'Target', (17,))],
+                        'string_trigger': [('Attack', 'shot')],
+                        'string_role': [('Attack', 'Attacker', 'John Joseph'), ('Attack', 'Place', 'court'), ('Attack', 'Target', 'himself')],
+                    },
+                    ...
+                ]
+        """
+        pred_instance_list = list()
+        for pred in pred_list:
+            pred_instance = defaultdict(list)
+
+            for offset_pred in pred["offset"]:
+                event_type, trigger_offset = offset_pred["type"], tuple_offset(offset_pred["trigger"])
+                pred_instance["offset_trigger"] += [(event_type, trigger_offset)]
+                for role_type, role_offset in offset_pred["roles"]:
+                    pred_instance["offset_role"] += [(event_type, role_type, tuple_offset(role_offset))]
+
+            for string_pred in pred["string"]:
+                event_type, trigger_string = string_pred["type"], string_pred["trigger"]
+                pred_instance["string_trigger"] += [(event_type, trigger_string)]
+                for role_type, role_string in string_pred["roles"]:
+                    pred_instance["string_role"] += [(event_type, role_type, role_string)]
+            pred_instance_list += [pred_instance]
+        return pred_instance_list
+
+    @staticmethod
+    def eval_instance_list(gold_instance_list, pred_instance_list, verbose=False, match_mode="normal"):
+        """[summary]
+
+        Args:
+            gold_instance_list (List[Dict]): List of Sentece, each sentence contains Four List of Event Tuple
+                [
+                    {
+                        'offset_trigger': [('Die', (16,)), ('Convict', (30,))],
+                        'string_trigger': [('Die', 'shot'), ('Convict', 'convicted')],
+                        'offset_role': [('Die', 'Victim', (17,)), ('Die', 'Agent', (5, 6)), ('Die', 'Place', (23,))],
+                        'string_role': [('Die', 'Victim', 'himself'), ('Die', 'Agent', 'John Joseph'), ('Die', 'Place', 'court')]
+                    },
+                    ...
+                ]
+            pred_instance_list (List[Dict]): List of Sentece, each sentence contains four List (offset, string) X (trigger, role) of Event List
+                [
+                    {
+                        'offset_trigger': [('Attack', (16,))],
+                        'offset_role': [('Attack', 'Attacker', (5, 6)), ('Attack', 'Place', (23,)), ('Attack', 'Target', (17,))],
+                        'string_trigger': [('Attack', 'shot')],
+                        'string_role': [('Attack', 'Attacker', 'John Joseph'), ('Attack', 'Place', 'court'), ('Attack', 'Target', 'himself')],
+                    },
+                    ...
+                ]
+            verbose (bool, optional): [description]. Defaults to False.
+            match_mode (string, optional): [description]. Defaults to `normal`.
+
+        Returns:
+            Dict: Result of Evaluation
+                (offset, string) X (trigger, role) X (gold, pred, tp, P, R, F1)
+        """
+        trigger_metrics = {
+            "offset": Metric(verbose=verbose, match_mode=match_mode),
+            "string": Metric(verbose=verbose, match_mode=match_mode),
+        }
+        role_metrics = {
+            "offset": Metric(verbose=verbose, match_mode=match_mode),
+            "string": Metric(verbose=verbose, match_mode=match_mode),
+        }
+
+        for pred, gold in zip(pred_instance_list, gold_instance_list):
+
+            pre_string_tp, pre_offset_tp = trigger_metrics["string"].tp, trigger_metrics["offset"].tp
+
+            for eval_key in trigger_metrics:
+                trigger_metrics[eval_key].count_instance(
+                    gold_list=gold.get(eval_key + "_trigger", []), pred_list=pred.get(eval_key + "_trigger", [])
+                )
+
+            post_string_tp, post_offset_tp = trigger_metrics["string"].tp, trigger_metrics["offset"].tp
+            if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp:
+                warning_tp_increment(gold=gold, pred=pred, prefix="Trigger")
+
+            pre_string_tp, pre_offset_tp = role_metrics["string"].tp, role_metrics["offset"].tp
+
+            for eval_key in role_metrics:
+                role_metrics[eval_key].count_instance(
+                    gold_list=gold.get(eval_key + "_role", []), pred_list=pred.get(eval_key + "_role", [])
+                )
+
+            post_string_tp, post_offset_tp = role_metrics["string"].tp, role_metrics["offset"].tp
+            if verbose and post_offset_tp - pre_offset_tp != post_string_tp - pre_string_tp:
+                warning_tp_increment(gold=gold, pred=pred, prefix="Role")
+
+        results = dict()
+        for eval_key in trigger_metrics:
+            results.update(trigger_metrics[eval_key].compute_f1(prefix=f"{eval_key}-evt-trigger-"))
+        for eval_key in role_metrics:
+            results.update(role_metrics[eval_key].compute_f1(prefix=f"{eval_key}-evt-role-"))
+
+        return results
diff --git a/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py b/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py
new file mode 100644
index 0000000000000000000000000000000000000000..77885638d5080f883c72332088ce26381ac46a5f
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py
@@ -0,0 +1,1069 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Tuple, List, Dict
+from collections import defaultdict, OrderedDict, Counter
+import os
+import numpy
+import logging
+import re
+import json
+from nltk.tree import ParentedTree
+from uie.evaluation.constants import span_start, type_start, type_end, null_span, offset_map_strategy
+from uie.evaluation.scorer import EntityScorer, RelationScorer, EventScorer
+
+logger = logging.getLogger("__main__")
+
+left_bracket = "【"
+right_bracket = "】"
+brackets = left_bracket + right_bracket
+
+split_bracket = re.compile(r"<extra_id_\d>")
+
+
+def proprocessing_graph_record(graph, schema_dict):
+    """Mapping generated spot-asoc result to Entity/Relation/Event"""
+    records = {
+        "entity": list(),
+        "relation": list(),
+        "event": list(),
+    }
+
+    entity_dict = OrderedDict()
+
+    # 根据不同任务的 Schema 将不同的 Spot 对应到不同抽取结果： Entity/Event
+    # Mapping generated spot result to Entity/Event
+    for record in graph["pred_record"]:
+
+        if record["type"] in schema_dict["entity"].type_list:
+            records["entity"] += [{"text": record["spot"], "type": record["type"]}]
+            entity_dict[record["spot"]] = record["type"]
+
+        elif record["type"] in schema_dict["event"].type_list:
+            records["event"] += [{"trigger": record["spot"], "type": record["type"], "roles": record["asocs"]}]
+
+        else:
+            print("Type `%s` invalid." % record["type"])
+
+    # 根据不同任务的 Schema 将不同的 Asoc 对应到不同抽取结果： Relation/Argument
+    # Mapping generated asoc result to Relation/Argument
+    for record in graph["pred_record"]:
+        if record["type"] in schema_dict["entity"].type_list:
+            for role in record["asocs"]:
+                records["relation"] += [
+                    {
+                        "type": role[0],
+                        "roles": [
+                            (record["type"], record["spot"]),
+                            (entity_dict.get(role[1], record["type"]), role[1]),
+                        ],
+                    }
+                ]
+
+    if len(entity_dict) > 0:
+        for record in records["event"]:
+            if record["type"] in schema_dict["event"].type_list:
+                new_role_list = list()
+                for role in record["roles"]:
+                    if role[1] in entity_dict:
+                        new_role_list += [role]
+                record["roles"] = new_role_list
+
+    return records
+
+
+def match_sublist(the_list, to_match):
+    """Find sublist in the whole list
+
+    Args:
+        the_list (list(str)): the whole list
+            - [1, 2, 3, 4, 5, 6, 1, 2, 4, 5]
+        to_match (list(str)): the sublist
+            - [1, 2]
+
+    Returns:
+        list(tuple): matched (start, end) position list
+            - [(0, 1), (6, 7)]
+    """
+    len_to_match = len(to_match)
+    matched_list = list()
+    for index in range(len(the_list) - len_to_match + 1):
+        if to_match == the_list[index : index + len_to_match]:
+            matched_list += [(index, index + len_to_match - 1)]
+    return matched_list
+
+
+def check_overlap(x, y):
+    """Check two span whether overlap or not
+
+    Args:
+        x (Tuple[int, int]): start, end including position of span x
+        y (Tuple[int, int]): start, end including position of span y
+
+    x: (3, 4), y: (4, 5) -> True
+    x: (3, 3), y: (4, 5) -> False
+
+    Returns:
+        bool: two span whether overlap or not
+    """
+    # x start > y end or y start > x end, no overlap
+    if x[0] > y[1] or y[0] > x[1]:
+        return False
+    else:
+        return True
+
+
+def get_index_tuple(matched: Tuple[int, int]):
+    """Convert start, end (inlcuding) tuple to index list
+
+    Args:
+        matched (Tuple[int, int]): start and end position tuple
+
+    (3, 4) -> [3, 4]
+    (3, 3) -> [3]
+
+    Returns:
+        List[int]: List of index
+    """
+    return tuple(range(matched[0], matched[1] + 1))
+
+
+def span_to_token(text, span_to_token_strategy="space"):
+    """Convert text span string to token list
+
+    Args:
+        text (string): text span string
+        span_to_token_strategy (str, optional): Defaults to 'space'.
+            - space: split text to tokens using space
+            - list: split text to toekns as list
+
+    Raises:
+        NotImplementedError: No implemented span_to_token_strategy
+
+    Returns:
+        list(str): list of token
+    """
+    if span_to_token_strategy == "space":
+        return text.split(" ")
+    elif span_to_token_strategy == "list":
+        return list(text)
+    else:
+        raise NotImplementedError(f"The span to token strategy {span_to_token_strategy} is not implemented.")
+
+
+class MapConfig:
+    """Config of mapping string to offset"""
+
+    def __init__(self, map_strategy: str = "first", de_duplicate: bool = True, span_to_token: str = "space") -> None:
+        self.map_strategy = map_strategy
+        self.de_duplicate = de_duplicate
+        self.span_to_token = span_to_token
+
+    def __repr__(self) -> str:
+        repr_list = [
+            f"map_strategy: {self.map_strategy}",
+            f"de_duplicate: {self.de_duplicate}",
+            f"span_to_token: {self.span_to_token}",
+        ]
+        return ", ".join(repr_list)
+
+    @staticmethod
+    def load_by_name(config_name):
+        offset_map = offset_map_strategy[config_name]
+        return MapConfig(
+            map_strategy=offset_map["map_strategy"],
+            de_duplicate=offset_map["de_duplicate"],
+            span_to_token=offset_map["span_to_token"],
+        )
+
+
+class RecordSchema:
+    """Record Schema Class
+    type_list: list of spot name
+    role_list: list of asoc name
+    type_role_dict: the mapping of spot-to-asoc
+    """
+
+    def __init__(self, type_list, role_list, type_role_dict):
+        self.type_list = type_list
+        self.role_list = role_list
+        self.type_role_dict = type_role_dict
+
+    def __repr__(self) -> str:
+        repr_list = [f"Type: {self.type_list}\n", f"Role: {self.role_list}\n", f"Map: {self.type_role_dict}"]
+        return "\n".join(repr_list)
+
+    @staticmethod
+    def get_empty_schema():
+        return RecordSchema(type_list=list(), role_list=list(), type_role_dict=dict())
+
+    @staticmethod
+    def read_from_file(filename):
+        lines = open(filename, encoding="utf8").readlines()
+        type_list = json.loads(lines[0])
+        role_list = json.loads(lines[1])
+        type_role_dict = json.loads(lines[2])
+        return RecordSchema(type_list, role_list, type_role_dict)
+
+    def write_to_file(self, filename):
+        with open(filename, "w", encoding="utf8") as output:
+            output.write(json.dumps(self.type_list, ensure_ascii=False) + "\n")
+            output.write(json.dumps(self.role_list, ensure_ascii=False) + "\n")
+            output.write(json.dumps(self.type_role_dict, ensure_ascii=False) + "\n")
+
+
+def merge_schema(schema_list: List[RecordSchema]):
+    """Merge list of schema
+
+    Args:
+        schema_list (List[RecordSchema]): list of record schema
+
+    Returns:
+        RecordSchema: A merged schema
+    """
+    type_set = set()
+    role_set = set()
+    type_role_dict = defaultdict(list)
+
+    for schema in schema_list:
+
+        for type_name in schema.type_list:
+            type_set.add(type_name)
+
+        for role_name in schema.role_list:
+            role_set.add(role_name)
+
+        for type_name in schema.type_role_dict:
+            type_role_dict[type_name] += schema.type_role_dict[type_name]
+
+    for type_name in type_role_dict:
+        type_role_dict[type_name] = list(set(type_role_dict[type_name]))
+
+    return RecordSchema(
+        type_list=list(type_set),
+        role_list=list(role_set),
+        type_role_dict=type_role_dict,
+    )
+
+
+class Record:
+    """Record for converting generated string to information record"""
+
+    def __init__(self, map_config) -> None:
+        self._map_config = map_config
+
+    def span_to_token(self, text):
+        return span_to_token(text, span_to_token_strategy=self._map_config.span_to_token)
+
+
+class EntityRecord(Record):
+    """Record for converting generated string to information record <type, span>"""
+
+    @staticmethod
+    def to_string(pred_record_list):
+        entity_list = list()
+        for pred_record in pred_record_list:
+            record_type, record_text = pred_record["type"], pred_record["text"]
+            if record_text == "":
+                logger.warning(f"Empty Extraction {pred_record}")
+                continue
+            entity_list += [(record_type, record_text)]
+        return entity_list
+
+    def to_offset(self, instance, tokens):
+        map_strategy_dict = {
+            "first": self.record_to_offset_first_role,
+            "closest": self.record_to_offset_closest_role,
+            "longer_first": self.record_to_offset_longer_first,
+        }
+
+        if self._map_config.map_strategy in map_strategy_dict:
+            map_function = map_strategy_dict[self._map_config.map_strategy]
+            return map_function(
+                instance=instance,
+                token_list=tokens,
+            )
+        else:
+            raise NotImplementedError(
+                f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented."
+            )
+
+    def record_to_offset_closest_role(
+        self,
+        instance,
+        token_list,
+    ):
+        """
+        Find Role's offset using closest matched with trigger word.
+        :param instance:
+        :return:
+        """
+        return self.record_to_offset_first_role(instance, token_list=token_list)
+
+    def record_to_offset_first_role(self, instance, token_list):
+        """
+        Find Entity's offset using first matched in the sentence.
+        :param instance:
+        :return:
+        """
+        entity_list = list()
+
+        entity_matched_set = set()
+        for pred_record in instance:
+            record_type, record_text = pred_record["type"], pred_record["text"]
+            if record_text == "":
+                logger.warning(f"Empty Extraction {pred_record}")
+                continue
+            matched_list = match_sublist(token_list, self.span_to_token(record_text))
+            for matched in matched_list:
+                if (record_type, matched) not in entity_matched_set:
+                    entity_list += [(record_type, tuple(range(matched[0], matched[1] + 1)))]
+                    entity_matched_set.add((record_type, matched))
+                    break
+
+        return entity_list
+
+    def record_to_offset_longer_first(self, instance, token_list):
+        """
+        Find Entity's offset using first matched in the sentence.
+        :param instance:
+        :return:
+        """
+        entity_list = list()
+
+        entity_matched_set = set()
+        for x in instance:
+            x["length"] = len(x["text"])
+        instance.sort(reverse=True, key=lambda x: x["length"])
+
+        for pred_record in instance:
+            record_type, record_text = pred_record["type"], pred_record["text"]
+            if record_text == "":
+                logger.warning(f"Empty Extraction {pred_record}")
+                continue
+
+            matched_list = match_sublist(token_list, self.span_to_token(record_text))
+            for matched in matched_list:
+                flag = False
+                for _, g in entity_matched_set:
+                    if check_overlap(g, matched):
+                        flag = True
+                if flag:
+                    continue
+
+                if (record_type, matched) not in entity_matched_set:
+                    entity_list += [(record_type, tuple(range(matched[0], matched[1] + 1)))]
+                    entity_matched_set.add((record_type, matched))
+                    break
+
+        return entity_list
+
+
+class RelationRecord(Record):
+    """Record for converting generated string to information record
+    <type, arg1_type, arg1_span, arg2_type, arg2_span>
+    """
+
+    def to_offset(self, instance, tokens):
+        map_strategy_dict = {
+            "first": self.record_to_offset_first_role,
+            "closest": self.record_to_offset_closest_role,
+            "longer_first": self.record_to_offset_closest_role,
+        }
+        if self._map_config.map_strategy in map_strategy_dict:
+            map_function = map_strategy_dict[self._map_config.map_strategy]
+            return map_function(
+                instance=instance,
+                token_list=tokens,
+            )
+        else:
+            raise NotImplementedError(
+                f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented."
+            )
+
+    @staticmethod
+    def to_string(instance):
+        relation_list = list()
+        for record in instance:
+            relation_type = record["type"]
+            relation = [relation_type]
+            if len(record["roles"]) < 2:
+                continue
+            for role_type, text_str in record["roles"][:2]:
+                relation += [role_type, text_str]
+            relation_list += [tuple(relation)]
+        return relation_list
+
+    def record_to_offset_first_role(self, instance, token_list):
+        """
+        Find Role's offset using first matched in the sentence.
+        :param instance:
+        :return:
+        """
+        relation_list = list()
+
+        for record in instance:
+            relation_type = record["type"]
+
+            if len(record["roles"]) < 2:
+                continue
+
+            relation = [relation_type]
+            for role_type, text_str in record["roles"][:2]:
+                matched_list = match_sublist(token_list, self.span_to_token(text_str))
+                if len(matched_list) == 0:
+                    logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list))
+                    break
+                relation += [role_type, get_index_tuple(matched_list[0])]
+            if len(relation) != 5 or (self._map_config.de_duplicate and tuple(relation) in relation_list):
+                continue
+            relation_list += [tuple(relation)]
+
+        return relation_list
+
+    def record_to_offset_closest_role(self, instance, token_list):
+        """
+        Find Role's offset using closest matched with trigger word.
+        :param instance:
+        :return:
+        """
+        relation_list = list()
+
+        for record in instance:
+            relation_type = record["type"]
+
+            if len(record["roles"]) < 2:
+                continue
+
+            arg1_type, arg1_text = record["roles"][0]
+            arg2_type, arg2_text = record["roles"][1]
+            arg1_matched_list = match_sublist(token_list, self.span_to_token(arg1_text))
+            arg2_matched_list = match_sublist(token_list, self.span_to_token(arg2_text))
+
+            if len(arg1_matched_list) == 0:
+                logger.warning("[Cannot reconstruct]: %s %s\n" % (arg1_text, token_list))
+                break
+            if len(arg2_matched_list) == 0:
+                logger.warning("[Cannot reconstruct]: %s %s\n" % (arg2_text, token_list))
+                break
+
+            distance_tuple = list()
+            for arg1_match in arg1_matched_list:
+                for arg2_match in arg2_matched_list:
+                    distance = abs(arg1_match[0] - arg2_match[0])
+                    distance_tuple += [(distance, arg1_match, arg2_match)]
+            distance_tuple.sort()
+
+            relation = [
+                relation_type,
+                arg1_type,
+                get_index_tuple(distance_tuple[0][1]),
+                arg2_type,
+                get_index_tuple(distance_tuple[0][2]),
+            ]
+            if self._map_config.de_duplicate and tuple(relation) in relation_list:
+                continue
+            relation_list += [tuple(relation)]
+
+        return relation_list
+
+
+class EventRecord(Record):
+    """Record for converting generated string to information record in predicate-arguments
+    {
+        type: pred_type,
+        trigger: predicate_span,
+        args: [(arg_type, arg_span), ...]
+    }
+    """
+
+    def to_offset(self, instance, tokens):
+        map_strategy_dict = {
+            "first": self.record_to_offset_first_role,
+            "closest": self.record_to_offset_closest_role,
+            "longer_first": self.record_to_offset_closest_role,
+        }
+        if self._map_config.map_strategy in map_strategy_dict:
+            map_function = map_strategy_dict[self._map_config.map_strategy]
+            return map_function(
+                instance=instance,
+                token_list=tokens,
+            )
+        else:
+            raise NotImplementedError(
+                f"The map strategy {self._map_config.map_strategy} in {self.__class__} is not implemented."
+            )
+
+    @staticmethod
+    def to_string(instance):
+        """
+        {'type': 'Justice:Appeal',
+         'trigger': 'appeal',
+         'roles': [
+            ('Adjudicator', 'court'),
+            ('Plaintiff', 'Anwar')
+            ], }
+        """
+        return instance
+
+    def record_to_offset_first_role(self, instance, token_list):
+        """
+        Find Role's offset using first matched in the sentence.
+        """
+        record_list = list()
+
+        trigger_matched_set = set()
+        for record in instance:
+            event_type = record["type"]
+            trigger = record["trigger"]
+            matched_list = match_sublist(token_list, self.span_to_token(trigger))
+
+            if len(matched_list) == 0:
+                logger.warning("[Cannot reconstruct]: %s %s\n" % (trigger, token_list))
+                continue
+
+            trigger_offset = None
+            for matched in matched_list:
+                if matched not in trigger_matched_set:
+                    trigger_offset = get_index_tuple(matched)
+                    trigger_matched_set.add(matched)
+                    break
+
+            # No trigger word, skip the record
+            if trigger_offset is None:
+                break
+
+            pred_record = {"type": event_type, "roles": [], "trigger": trigger_offset}
+
+            for role_type, text_str in record["roles"]:
+                matched_list = match_sublist(token_list, self.span_to_token(text_str))
+                if len(matched_list) == 0:
+                    logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list))
+                    continue
+                pred_record["roles"] += [(role_type, get_index_tuple(matched_list[0]))]
+
+            record_list += [pred_record]
+
+        return record_list
+
+    def record_to_offset_closest_role(self, instance, token_list):
+        """
+        Find Role's offset using closest matched with trigger word.
+        """
+        record_list = list()
+
+        trigger_matched_set = set()
+        for record in instance:
+            event_type = record["type"]
+            trigger = record["trigger"]
+            matched_list = match_sublist(token_list, self.span_to_token(trigger))
+
+            if len(matched_list) == 0:
+                logger.warning("[Cannot reconstruct]: %s %s\n" % (trigger, token_list))
+                continue
+
+            trigger_offset = None
+            for matched in matched_list:
+                if matched not in trigger_matched_set:
+                    trigger_offset = get_index_tuple(matched)
+                    trigger_matched_set.add(matched)
+                    break
+
+            # No trigger word, skip the record
+            if trigger_offset is None or len(trigger_offset) == 0:
+                break
+
+            pred_record = {"type": event_type, "roles": [], "trigger": trigger_offset}
+
+            for role_type, text_str in record["roles"]:
+                matched_list = match_sublist(token_list, self.span_to_token(text_str))
+                if len(matched_list) == 0:
+                    logger.warning("[Cannot reconstruct]: %s %s\n" % (text_str, token_list))
+                else:
+                    abs_distances = [abs(match[0] - trigger_offset[0]) for match in matched_list]
+                    closest_index = numpy.argmin(abs_distances)
+                    pred_record["roles"] += [(role_type, get_index_tuple(matched_list[closest_index]))]
+
+            record_list += [pred_record]
+        return record_list
+
+
+class SEL2Record:
+    """Converting sel expression to information records"""
+
+    def __init__(self, schema_dict, map_config: MapConfig, tokenizer=None) -> None:
+        self._schema_dict = schema_dict
+        self._predict_parser = SpotAsocPredictParser(
+            record_schema=schema_dict["record"],
+            tokenizer=tokenizer,
+        )
+        self._map_config = map_config
+        self._tokenizer = tokenizer
+
+    def __repr__(self) -> str:
+        return f"## {self._map_config}"
+
+    def sel2record(self, pred, text, tokens):
+        """Converting sel expression to information records
+
+        Args:
+            pred (str): sel expression
+            text (str): input text
+            tokens (list(str)): token list
+
+        Returns:
+            _type_: _description_
+        """
+        # Parsing generated SEL to String-level Record
+        # 将生成的结构表达式解析成 String 级别的 Record
+        well_formed_list, counter = self._predict_parser.decode(
+            gold_list=[],
+            pred_list=[pred],
+            text_list=[text],
+        )
+
+        # Convert String-level Record to Entity/Relation/Event
+        # 将抽取的 Spot-Asoc Record 结构
+        # 根据不同的 Schema 转换成 Entity/Relation/Event 结果
+        pred_records = proprocessing_graph_record(well_formed_list[0], self._schema_dict)
+
+        task_record_map = {
+            "entity": EntityRecord,
+            "relation": RelationRecord,
+            "event": EventRecord,
+        }
+
+        parsed_record = defaultdict(dict)
+        # Mapping String-level record to Offset-level record
+        # 将 String 级别的 Record 回标成 Offset 级别的 Record
+        for task in task_record_map:
+            record_map = task_record_map[task](
+                map_config=self._map_config,
+            )
+
+            parsed_record[task]["offset"] = record_map.to_offset(
+                instance=pred_records.get(task, []),
+                tokens=tokens,
+            )
+
+            parsed_record[task]["string"] = record_map.to_string(
+                pred_records.get(task, []),
+            )
+        return parsed_record
+
+    @staticmethod
+    def load_schema_dict(schema_folder):
+        schema_dict = dict()
+        for schema_key in ["record", "entity", "relation", "event"]:
+            schema_filename = os.path.join(schema_folder, f"{schema_key}.schema")
+            if os.path.exists(schema_filename):
+                schema_dict[schema_key] = RecordSchema.read_from_file(schema_filename)
+            else:
+                logger.warning(f"{schema_filename} is empty, ignore.")
+                schema_dict[schema_key] = RecordSchema.get_empty_schema()
+        return schema_dict
+
+
+def fix_unk_from_text(span, text, unk="<unk>", tokenizer=None):
+    """Fixing unknown tokens `unk` in the generated expression
+
+    Args:
+        span (str): generated span
+        text (str): raw text
+        unk (str, optional): symbol of unk token
+        tokenizer (Tokenizer, optional): the tokenizer
+
+    Returns:
+        fixed span
+    """
+    if tokenizer is not None:
+        return fix_unk_from_text_with_tokenizer(span, text, unk=unk, tokenizer=tokenizer)
+    else:
+        return fix_unk_from_text_without_tokenizer(span, text, unk=unk)
+
+
+def fix_unk_from_text_without_tokenizer(span, text, unk="<unk>"):
+    """
+    Find span from the text to fix unk in the generated span
+    Example:
+    span = "<unk> colo e Bengo"
+    text = "Angola International Airport is located at Ícolo e Bengo"
+    fixed_span = "Ícolo e Bengo"
+    """
+    if unk not in span:
+        return span
+
+    def clean_wildcard(x):
+        sp = ".*?()[]+"
+        return re.sub("(" + "|".join([f"\\{s}" for s in sp]) + ")", "\\\\\g<1>", x)
+
+    match = r"\s*[^，？。\s]+\s*".join([clean_wildcard(item.strip()) for item in span.split(unk)])
+
+    if len(match) > 100:
+        # Too long regular expression may lead re problem
+        return span
+
+    result = re.search(match, text)
+
+    if not result:
+        return span
+    return result.group().strip()
+
+
+def fix_unk_from_text_with_tokenizer(span, text, tokenizer, unk="<unk>"):
+    unk_id = tokenizer.vocab.to_indices(unk)
+    tokenized_span = tokenizer.encode(span, add_special_tokens=False, return_token_type_ids=None)["input_ids"]
+    tokenized_text = tokenizer.encode(text, add_special_tokens=False, return_token_type_ids=None)["input_ids"]
+
+    matched = match_sublist(tokenized_text, tokenized_span)
+    if len(matched) == 0:
+        return span
+
+    if tokenized_span[0] == unk_id and matched[0][0] > 0:
+        previous_token = [tokenized_text[matched[0][0] - 1]]
+        pre_strip = tokenizer.vocab.to_tokens(previous_token[0])
+    else:
+        previous_token = []
+        pre_strip = ""
+
+    if tokenized_span[-1] == unk_id and matched[0][1] < len(tokenized_text) - 1:
+        next_token = [tokenized_text[matched[0][1] + 1]]
+        next_strip = tokenizer.vocab.to_tokens(next_token[0])
+    else:
+        next_token = []
+        next_strip = ""
+
+    extend_span = tokenized_span
+    if len(previous_token) > 0:
+        extend_span = previous_token + extend_span
+    if len(next_token) > 0:
+        extend_span = extend_span + next_token
+
+    extend_span = tokenizer.decode(extend_span)
+    fixed_span = fix_unk_from_text_without_tokenizer(extend_span, text, unk)
+    return fixed_span.rstrip(next_strip).lstrip(pre_strip)
+
+
+def add_space(text):
+    """
+    add space between special token
+    """
+    new_text_list = list()
+    for item in zip(split_bracket.findall(text), split_bracket.split(text)[1:]):
+        new_text_list += item
+    return " ".join(new_text_list)
+
+
+def convert_bracket(text):
+    text = add_space(text)
+    for start in [type_start]:
+        text = text.replace(start, left_bracket)
+    for end in [type_end]:
+        text = text.replace(end, right_bracket)
+    return text
+
+
+def find_bracket_num(tree_str):
+    """
+    Count Bracket Number (num_left - num_right), 0 indicates num_left = num_right
+    """
+    count = 0
+    for char in tree_str:
+        if char == left_bracket:
+            count += 1
+        elif char == right_bracket:
+            count -= 1
+        else:
+            pass
+    return count
+
+
+def check_well_form(tree_str):
+    return find_bracket_num(tree_str) == 0
+
+
+def clean_text(tree_str):
+    count = 0
+    sum_count = 0
+
+    tree_str_list = tree_str.split()
+
+    for index, char in enumerate(tree_str_list):
+        if char == left_bracket:
+            count += 1
+            sum_count += 1
+        elif char == right_bracket:
+            count -= 1
+            sum_count += 1
+        else:
+            pass
+        if count == 0 and sum_count > 0:
+            return " ".join(tree_str_list[: index + 1])
+    return " ".join(tree_str_list)
+
+
+def resplit_label_span(label, span, split_symbol=span_start):
+    label_span = label + " " + span
+
+    if split_symbol in label_span:
+        splited_label_span = label_span.split(split_symbol)
+        if len(splited_label_span) == 2:
+            return splited_label_span[0].strip(), splited_label_span[1].strip()
+
+    return label, span
+
+
+def add_bracket(tree_str):
+    """add right bracket to fix ill-formed expression"""
+    tree_str_list = tree_str.split()
+    bracket_num = find_bracket_num(tree_str_list)
+    tree_str_list += [right_bracket] * bracket_num
+    return " ".join(tree_str_list)
+
+
+def get_tree_str(tree):
+    """get str from sel tree"""
+    str_list = list()
+    for element in tree:
+        if isinstance(element, str):
+            str_list += [element]
+    return " ".join(str_list)
+
+
+def rewrite_label_span(label, span, label_set=None, text=None, tokenizer=None):
+
+    # Invalid Type
+    if label_set and label not in label_set:
+        logger.debug("Invalid Label: %s" % label)
+        return None, None
+
+    # Fix unk using Text
+    if text is not None and "<unk>" in span:
+        span = fix_unk_from_text(span, text, unk="<unk>", tokenizer=tokenizer)
+
+    # Invalid Text Span
+    if text is not None and span not in text:
+        logger.debug("Invalid Text Span: %s\n%s\n" % (span, text))
+        return None, None
+
+    return label, span
+
+
+def convert_spot_asoc(spot_asoc_instance, structure_maker):
+    """Convert spot asoc instance to target string"""
+    spot_instance_str_rep_list = list()
+    for spot in spot_asoc_instance:
+        spot_str_rep = [
+            spot["label"],
+            structure_maker.target_span_start,
+            spot["span"],
+        ]
+        for asoc_label, asoc_span in spot.get("asoc", list()):
+            asoc_str_rep = [
+                structure_maker.span_start,
+                asoc_label,
+                structure_maker.target_span_start,
+                asoc_span,
+                structure_maker.span_end,
+            ]
+            spot_str_rep += [" ".join(asoc_str_rep)]
+        spot_instance_str_rep_list += [
+            " ".join(
+                [
+                    structure_maker.record_start,
+                    " ".join(spot_str_rep),
+                    structure_maker.record_end,
+                ]
+            )
+        ]
+    target_text = " ".join(
+        [
+            structure_maker.sent_start,
+            " ".join(spot_instance_str_rep_list),
+            structure_maker.sent_end,
+        ]
+    )
+    return target_text
+
+
+class SpotAsocPredictParser:
+    """Parser for converting generated sel to extraction record"""
+
+    def __init__(self, record_schema: RecordSchema = None, tokenizer=None):
+        self.spot_set = set(record_schema.type_list) if record_schema else None
+        self.asoc_set = set(record_schema.role_list) if record_schema else None
+        self.tokenizer = tokenizer
+
+    def decode(
+        self,
+        gold_list,
+        pred_list,
+        text_list=None,
+        raw_list=None,
+    ) -> Tuple[List[Dict], Counter]:
+        counter = Counter()
+        well_formed_list = []
+
+        if gold_list is None or len(gold_list) == 0:
+            gold_list = ["%s%s" % (type_start, type_end)] * len(pred_list)
+
+        if text_list is None:
+            text_list = [None] * len(pred_list)
+
+        if raw_list is None:
+            raw_list = [None] * len(pred_list)
+
+        for gold, pred, text, raw_data in zip(gold_list, pred_list, text_list, raw_list):
+            gold = convert_bracket(gold)
+            pred = convert_bracket(pred)
+
+            pred = clean_text(pred)
+
+            try:
+                gold_tree = ParentedTree.fromstring(gold, brackets=brackets)
+            except ValueError:
+                logger.warning(f"Ill gold: {gold}")
+                logger.warning(f"Fix gold: {add_bracket(gold)}")
+                gold_tree = ParentedTree.fromstring(add_bracket(gold), brackets=brackets)
+                counter.update(["gold_tree add_bracket"])
+
+            instance = {"gold": gold, "pred": pred, "gold_tree": gold_tree, "text": text, "raw_data": raw_data}
+
+            counter.update(["gold_tree" for _ in gold_tree])
+
+            _, _, instance["gold_record"] = self.get_record_list(sel_tree=instance["gold_tree"], text=instance["text"])
+
+            try:
+                if not check_well_form(pred):
+                    pred = add_bracket(pred)
+                    counter.update(["fixed"])
+
+                pred_tree = ParentedTree.fromstring(pred, brackets=brackets)
+                counter.update(["pred_tree" for _ in pred_tree])
+
+                instance["pred_tree"] = pred_tree
+                counter.update(["well-formed"])
+
+            except ValueError:
+                counter.update(["ill-formed"])
+                logger.debug(f"ill-formed: {pred}")
+                instance["pred_tree"] = ParentedTree.fromstring(left_bracket + right_bracket, brackets=brackets)
+
+            _, _, instance["pred_record"] = self.get_record_list(sel_tree=instance["pred_tree"], text=instance["text"])
+
+            well_formed_list += [instance]
+
+        return well_formed_list, counter
+
+    def get_record_list(self, sel_tree, text=None):
+        """Convert single sel expression to extraction records
+
+        Args:
+            sel_tree (Tree): sel tree
+            text (str, optional): _description_. Defaults to None.
+
+        Returns:
+            spot_list: list of (spot_type: str, spot_span: str)
+            asoc_list: list of (spot_type: str, asoc_label: str, asoc_text: str)
+            record_list: list of {'asocs': list(), 'type': spot_type, 'spot': spot_text}
+        """
+        spot_list = list()
+        asoc_list = list()
+        record_list = list()
+
+        for spot_tree in sel_tree:
+
+            # Drop incomplete tree
+            if isinstance(spot_tree, str) or len(spot_tree) == 0:
+                continue
+
+            spot_type = spot_tree.label()
+            spot_text = get_tree_str(spot_tree)
+            spot_type, spot_text = resplit_label_span(spot_type, spot_text)
+            spot_type, spot_text = rewrite_label_span(
+                label=spot_type,
+                span=spot_text,
+                label_set=self.spot_set,
+                text=text,
+                tokenizer=self.tokenizer,
+            )
+
+            # Drop empty generated span
+            if spot_text is None or spot_text == null_span:
+                continue
+            # Drop empty generated type
+            if spot_type is None:
+                continue
+            # Drop invalid spot type
+            if self.spot_set is not None and spot_type not in self.spot_set:
+                continue
+
+            record = {"asocs": list(), "type": spot_type, "spot": spot_text}
+
+            for asoc_tree in spot_tree:
+                if isinstance(asoc_tree, str) or len(asoc_tree) < 1:
+                    continue
+
+                asoc_label = asoc_tree.label()
+                asoc_text = get_tree_str(asoc_tree)
+                asoc_label, asoc_text = resplit_label_span(asoc_label, asoc_text)
+                asoc_label, asoc_text = rewrite_label_span(
+                    label=asoc_label,
+                    span=asoc_text,
+                    label_set=self.asoc_set,
+                    text=text,
+                    tokenizer=self.tokenizer,
+                )
+
+                # Drop empty generated span
+                if asoc_text is None or asoc_text == null_span:
+                    continue
+                # Drop empty generated type
+                if asoc_label is None:
+                    continue
+                # Drop invalid asoc type
+                if self.asoc_set is not None and asoc_label not in self.asoc_set:
+                    continue
+
+                asoc_list += [(spot_type, asoc_label, asoc_text)]
+                record["asocs"] += [(asoc_label, asoc_text)]
+
+            spot_list += [(spot_type, spot_text)]
+            record_list += [record]
+
+        return spot_list, asoc_list, record_list
+
+
+def evaluate_extraction_results(gold_instances, pred_instances, eval_match_mode="normal"):
+    task_scorer_dict = {"entity": EntityScorer, "relation": RelationScorer, "event": EventScorer}
+    # Score Record
+    results = dict()
+    for task, scorer in task_scorer_dict.items():
+
+        gold_list = [x[task] for x in gold_instances]
+        pred_list = [x[task] for x in pred_instances]
+
+        gold_instance_list = scorer.load_gold_list(gold_list)
+        pred_instance_list = scorer.load_pred_list(pred_list)
+        sub_results = scorer.eval_instance_list(
+            gold_instance_list=gold_instance_list,
+            pred_instance_list=pred_instance_list,
+            verbose=False,
+            match_mode=eval_match_mode,
+        )
+        results.update(sub_results)
+    return results
diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py b/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc13950c0b3bc029002a2fa67ed49e7eff7cd6d1
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py
@@ -0,0 +1,19 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Code for Sequence-to-Structure
+"""
diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py b/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a602aee4c664e319cc4c8bfe066b9d9ec991416
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py
@@ -0,0 +1,510 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import logging
+import math
+import random
+from collections import OrderedDict
+from dataclasses import dataclass
+from typing import Optional
+
+import numpy as np
+import paddle
+from uie.evaluation.constants import (
+    BaseStructureMarker,
+    asoc_prompt,
+    null_span,
+    spot_prompt,
+    text_start,
+)
+from uie.evaluation.sel2record import RecordSchema, convert_spot_asoc
+
+from paddlenlp.data import Pad
+
+logger = logging.getLogger("__main__")
+
+
+@dataclass
+class SpotAsocNoiser:
+    spot_noise_ratio: float = 0.0  # Ratio of insert spot not in raw record
+    asoc_noise_ratio: float = 0.0  # Ratio of insert asoc not in raw recordf
+
+    def random_insert_spot(self, spot_asoc, spot_label_list=None):
+        """Insert negative spot in random, sample negative spot from spot_label_list"""
+        # If no negative spot_label_list, skip insertion of spot null span
+        if spot_label_list is None or len(spot_label_list) == 0:
+            return spot_asoc
+
+        random_num = sum(np.random.binomial(1, self.spot_noise_ratio, len(spot_asoc)))
+        for _ in range(random_num):
+            random_position = np.random.randint(low=0, high=len(spot_asoc))
+            to_insert_negative_spot = {
+                "span": null_span,
+                "label": np.random.choice(spot_label_list),  # Sample negative spot from spot_label_list
+                "asoc": list(),
+            }
+            spot_asoc.insert(random_position, to_insert_negative_spot)
+        return spot_asoc
+
+    def random_insert_asoc(self, spot_asoc, asoc_label_list=None):
+        """Insert negative asoc in random, sample negative asoc from asoc_label_list"""
+        # If no negative asoc_label_list, skip insertion of asoc null span
+        if asoc_label_list is None or len(asoc_label_list) == 0:
+            return spot_asoc
+
+        spot_sum = len(spot_asoc)
+        random_num = sum(np.random.binomial(1, self.asoc_noise_ratio, spot_sum))
+        for _ in range(random_num):
+            random_label = np.random.choice(asoc_label_list)
+            spot_position = np.random.randint(low=0, high=len(spot_asoc))
+            asoc_position = np.random.randint(low=0, high=len(spot_asoc[spot_position]["asoc"]) + 1)
+            # Insert random negative span at `asoc_position`
+            spot_asoc[spot_position]["asoc"].insert(asoc_position, (random_label, null_span))
+        return spot_asoc
+
+    def add_noise(self, spot_asoc, spot_label_list, asoc_label_list):
+        """Add noise to target spot-asoc structure
+        spot_asoc: raw spot-asoc structure
+        spot_label_list: negative spot candidates
+        asoc_label_list: negative asoc candidates
+        """
+        spot_asoc = self.random_insert_asoc(spot_asoc, asoc_label_list)
+        spot_asoc = self.random_insert_spot(spot_asoc, spot_label_list)
+        return spot_asoc
+
+
+class DynamicSSIGenerator:
+    """
+    Sample negative spot and asoc to construct SSI
+    """
+
+    def __init__(self, tokenizer, schema: RecordSchema, positive_rate=1, negative=-1, ordered_prompt=False) -> None:
+        self.spot_dict = self.get_ordered_dict(schema.type_list, tokenizer)
+        self.asoc_dict = self.get_ordered_dict(schema.role_list, tokenizer)
+        self.spot_list = list(self.spot_dict.keys())
+        self.asoc_list = list(self.asoc_dict.keys())
+        self.spot_prompt_id = tokenizer.convert_tokens_to_ids(spot_prompt)
+        self.asoc_prompt_id = tokenizer.convert_tokens_to_ids(asoc_prompt)
+        self.text_start_id = tokenizer.convert_tokens_to_ids(text_start)
+        self.positive_rate = positive_rate if 0 < positive_rate < 1 else 1
+        self.negative = negative
+        self.ordered_prompt = ordered_prompt
+        logger.info(f"Meta Sample " f"Negative: {self.negative}, " f"Ordered SSI: {self.ordered_prompt}")
+
+    @staticmethod
+    def get_ordered_dict(schema_name_list, tokenizer):
+        """Get schema name -> id dict
+        schema_name_list: ["人物", "组织机构"]
+        """
+        schema_ordered_dict = OrderedDict()
+        for name in schema_name_list:
+            # tokenizer.encode("人物") -> [8, 122]
+            encoded_name = tokenizer.encode(name, add_special_tokens=False, return_token_type_ids=None)
+            schema_ordered_dict[name] = encoded_name["input_ids"]
+        return schema_ordered_dict
+
+    @staticmethod
+    def sample_negative(postive, candidates, k=5):
+        if k < 0:
+            k = len(candidates)
+        negative_set = set()
+        for index in np.random.permutation(len(candidates))[:k].tolist():
+            negative = candidates[index]
+            if negative not in postive:
+                negative_set.add(negative)
+
+        return list(negative_set)
+
+    def sample_spot(self, positive, candidates=None):
+        """Sample spot
+
+        Args:
+            positive (List[str]): Positive Spot List
+
+        Returns:
+            List[int]: spot index list
+            List[str]: Sampled Positive Spot List
+            List[str]: Sampled Negative Spot List
+        """
+        neg_cands = candidates if candidates is not None else self.spot_list
+
+        negative_spot = self.sample_negative(postive=positive, candidates=neg_cands, k=self.negative)
+        positive_spot = random.sample(positive, math.floor(len(positive) * self.positive_rate))
+
+        converted_spot_prefix = self.convert_prefix(
+            candidates=positive_spot + negative_spot,
+            prompt=self.spot_prompt_id,
+            mapper=self.spot_dict,
+            ordered_prompt=self.ordered_prompt,
+        )
+
+        return converted_spot_prefix, positive_spot, negative_spot
+
+    def sample_asoc(self, positive, candidates=None):
+        """Sample Asoc
+
+        Args:
+            positive (List[str]): Positive Asoc List
+
+        Returns:
+            List[int]: asoc index list
+            List[str]: Sampled Negative Asoc List
+        """
+        neg_cands = candidates if candidates is not None else self.asoc_list
+        negative_asoc = self.sample_negative(postive=positive, candidates=neg_cands, k=self.negative)
+        converted_asoc_prefix = self.convert_prefix(
+            candidates=positive + negative_asoc,
+            prompt=self.asoc_prompt_id,
+            mapper=self.asoc_dict,
+            ordered_prompt=self.ordered_prompt,
+        )
+        return converted_asoc_prefix, negative_asoc
+
+    def full_spot(self, candidates=None, shuffle=False):
+        # Random Prompt + Shuffle
+        if not self.ordered_prompt and shuffle:
+            ordered_prompt = False
+        else:
+            ordered_prompt = True
+
+        prefix_cands = candidates if candidates is not None else self.spot_list
+
+        return self.convert_prefix(
+            candidates=prefix_cands,
+            prompt=self.spot_prompt_id,
+            mapper=self.spot_dict,
+            ordered_prompt=ordered_prompt,
+        )
+
+    def full_asoc(self, candidates=None, shuffle=False):
+        # Random Prompt + Shuffle
+        if not self.ordered_prompt and shuffle:
+            ordered_prompt = False
+        else:
+            ordered_prompt = True
+
+        prefix_cands = candidates if candidates is not None else self.asoc_list
+
+        return self.convert_prefix(
+            candidates=prefix_cands,
+            prompt=self.asoc_prompt_id,
+            mapper=self.asoc_dict,
+            ordered_prompt=ordered_prompt,
+        )
+
+    @staticmethod
+    def convert_prefix(candidates, prompt, mapper, ordered_prompt=True):
+        prefix = list()
+
+        if ordered_prompt:
+            candidate_sorted = sorted([(candidate, index) for index, candidate in enumerate(candidates)])
+            index_list = [index for _, index in candidate_sorted]
+        else:
+            index_list = np.random.permutation(len(candidates)).tolist()
+
+        for index in index_list:
+            prefix += [prompt]
+            prefix += mapper[candidates[index]]
+        return prefix
+
+
+class DataCollatorForSeq2Seq:
+    def __init__(
+        self,
+        tokenizer,
+        ssi_generator: DynamicSSIGenerator,
+        model=None,
+        label_pad_token_id=-100,
+        padding=True,
+        max_source_length: Optional[int] = None,
+        max_target_length: Optional[int] = None,
+        max_prefix_length: Optional[int] = None,
+        spot_asoc_nosier: SpotAsocNoiser = None,
+        return_tensors=True,
+    ):
+
+        self.tokenizer = tokenizer
+        self.ssi_generator = ssi_generator
+        self.model = model
+        self.label_pad_token_id = label_pad_token_id
+        self.padding = padding
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+        self.max_prefix_length = max_prefix_length
+        self.spot_asoc_nosier = spot_asoc_nosier
+        self.return_tensors = return_tensors
+
+    def __call__(self, data, return_tensors=None):
+
+        new_data = []  # To avoid the orgin data being covered
+
+        for ins in data:
+
+            target_spot_asoc = copy.deepcopy(ins["spot_asoc"])
+
+            if ins["sample_ssi"] is True:
+                # Sample Dynamic SSI
+                spot_prefix, pos_spot, neg_spot = self.ssi_generator.sample_spot(positive=ins.get("spots", []))
+                asoc_prefix, neg_asoc = self.ssi_generator.sample_asoc(positive=ins.get("asocs", []))
+
+                # Filter spot-asoc not in Positive Spot
+                target_spot_asoc = list(filter(lambda x: x["label"] in pos_spot, target_spot_asoc))
+
+                # Inject rejection noise
+                if self.spot_asoc_nosier is not None:
+                    target_spot_asoc = self.spot_asoc_nosier.add_noise(
+                        target_spot_asoc,
+                        spot_label_list=neg_spot,
+                        asoc_label_list=neg_asoc,
+                    )
+            else:
+                # Evaluation using Ordered SSI
+                spot_prefix = self.ssi_generator.full_spot(shuffle=self.model.training)
+                asoc_prefix = self.ssi_generator.full_asoc(shuffle=self.model.training)
+
+            # Prepare prefix ids
+            prefix = spot_prefix + asoc_prefix
+            # truncate `prefix` to max length
+            if self.max_prefix_length is not None and self.max_prefix_length >= 0:
+                prefix = prefix[: self.max_prefix_length]
+            prefix = prefix + [self.ssi_generator.text_start_id]
+
+            # Prepare source text ids
+            source_text_id = prefix + ins["input_ids"]
+            # truncate `input_ids` to max source length
+            if self.max_source_length is not None:
+                source_text_id = source_text_id[: self.max_source_length]
+
+            # Prepare target record ids
+            # Generate new record
+            target_record = convert_spot_asoc(target_spot_asoc, structure_maker=BaseStructureMarker())
+            target_labels = self.tokenizer.encode(
+                target_record,
+                return_token_type_ids=False,
+                return_attention_mask=True,
+                max_seq_len=self.max_target_length,
+            )
+
+            new_data.append(
+                {
+                    "input_ids": source_text_id,
+                    "labels": target_labels["input_ids"],
+                    "attention_mask": [1] * len(source_text_id),
+                    "decoder_attention_mask": target_labels["attention_mask"],
+                }
+            )
+
+        first = new_data[0]
+        assert isinstance(
+            first, dict
+        ), f"Input pattern not understood. The input of collatot must be a dict with key of input column name and value of data Received input type: {type(first)}"
+
+        labels = [d["labels"] for d in new_data] if "labels" in new_data[0].keys() else None
+
+        batch = {}
+
+        def _pad_function(sequence, pad_value):
+            return Pad(axis=0, pad_val=pad_value, dtype="int64")(sequence)
+
+        pad_value_map = {
+            "token_type_ids": self.tokenizer.pad_token_type_id,
+            "attention_mask": 0,
+            "decoder_attention_mask": 0,
+            "special_tokens_mask": 1,
+            "input_ids": self.tokenizer.pad_token_id,
+        }
+
+        for k, v in first.items():
+            if k not in ("labels", "label_ids") and v is not None and not isinstance(v, str):
+                batch[k] = _pad_function(
+                    sequence=[d[k] for d in new_data],
+                    pad_value=pad_value_map[k],
+                )
+            else:
+                batch[k] = _pad_function(
+                    sequence=[d[k] for d in new_data],
+                    pad_value=self.label_pad_token_id,
+                )
+
+        # prepare decoder_input_ids
+        if (
+            labels is not None
+            and self.model is not None
+            and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
+        ):
+            decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=batch["labels"])
+            if not return_tensors:
+                batch["decoder_input_ids"] = decoder_input_ids.numpy()
+        if self.return_tensors:
+            for k, v in batch.items():
+                batch[k] = paddle.to_tensor(v)
+        return batch
+
+
+class DataCollatorForMultiTaskSeq2Seq:
+    def __init__(
+        self,
+        tokenizer,
+        ssi_generator: DynamicSSIGenerator,
+        model=None,
+        label_pad_token_id=-100,
+        padding=True,
+        max_source_length: Optional[int] = None,
+        max_target_length: Optional[int] = None,
+        max_prefix_length: Optional[int] = None,
+        spot_asoc_nosier: SpotAsocNoiser = None,
+        return_tensors=True,
+    ):
+
+        self.tokenizer = tokenizer
+        self.ssi_generator = ssi_generator
+        self.model = model
+        self.label_pad_token_id = label_pad_token_id
+        self.padding = padding
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+        self.max_prefix_length = max_prefix_length
+        self.spot_asoc_nosier = spot_asoc_nosier
+        self.return_tensors = return_tensors
+
+    def __call__(self, data, return_tensors=None):
+
+        new_data = []  # To avoid the orgin data being covered
+
+        for ins in data:
+
+            target_spot_asoc = copy.deepcopy(ins["spot_asoc"])
+
+            if ins["sample_ssi"] is True:
+
+                positive_spot = set()
+                positive_asoc = set()
+                for spot_asoc in ins["spot_asoc"]:
+                    positive_spot.add(spot_asoc["label"])
+                    for asoc in spot_asoc["asoc"]:
+                        positive_asoc.add(asoc[0])
+
+                # 对 SSI 进行采样
+                # 在多任务中，每个数据Instance
+                #   ‘spots’ 对应该任务的 spots
+                #   ‘asocs’ 对应该任务的 asocs
+                # 因此 candidates 在任务内进行采样
+                spot_prefix, pos_spot, neg_spot = self.ssi_generator.sample_spot(
+                    positive=list(positive_spot),
+                    candidates=ins["spots"],
+                )
+                asoc_prefix, neg_asoc = self.ssi_generator.sample_asoc(
+                    positive=list(positive_asoc),
+                    candidates=ins["asocs"],
+                )
+
+                # Filter spot-asoc not in Positive Spot
+                target_spot_asoc = list(filter(lambda x: x["label"] in pos_spot, target_spot_asoc))
+
+                # Inject rejection noise
+                if self.spot_asoc_nosier is not None:
+                    target_spot_asoc = self.spot_asoc_nosier.add_noise(
+                        target_spot_asoc,
+                        spot_label_list=neg_spot,
+                        asoc_label_list=neg_asoc,
+                    )
+
+            else:
+                # Evaluation using Ordered SSI
+                spot_prefix = self.ssi_generator.full_spot(candidates=ins["spots"], shuffle=self.model.training)
+                asoc_prefix = self.ssi_generator.full_asoc(candidates=ins["asocs"], shuffle=self.model.training)
+
+            # Prepare prefix ids
+            prefix_id = spot_prefix + asoc_prefix
+            # truncate `prefix` to max length
+            if self.max_prefix_length is not None and self.max_prefix_length >= 0:
+                prefix_id = prefix_id[: self.max_prefix_length]
+            prefix_id = prefix_id + [self.ssi_generator.text_start_id]
+
+            # Prepare source text ids
+            source_text_id = prefix_id + ins["input_ids"]
+            # truncate `input_ids` to max source length
+            if self.max_source_length is not None:
+                source_text_id = source_text_id[: self.max_source_length]
+
+            # Prepare target record ids
+            # Generate new record
+            target_record = convert_spot_asoc(target_spot_asoc, structure_maker=BaseStructureMarker())
+            target_labels = self.tokenizer.encode(
+                target_record,
+                return_token_type_ids=False,
+                return_attention_mask=True,
+                max_seq_len=self.max_target_length,
+            )
+
+            new_data.append(
+                {
+                    "input_ids": source_text_id,
+                    "labels": target_labels["input_ids"],
+                    "attention_mask": [1] * len(source_text_id),
+                    "decoder_attention_mask": target_labels["attention_mask"],
+                }
+            )
+
+        first = new_data[0]
+        assert isinstance(
+            first, dict
+        ), f"Input pattern not understood. The input of collatot must be a dict with key of input column name and value of data Received input type: {type(first)}"
+
+        labels = [d["labels"] for d in new_data] if "labels" in new_data[0].keys() else None
+
+        batch = {}
+
+        def _pad_function(sequence, pad_value):
+            return Pad(axis=0, pad_val=pad_value, dtype="int64")(sequence)
+
+        pad_value_map = {
+            "token_type_ids": self.tokenizer.pad_token_type_id,
+            "attention_mask": 0,
+            "decoder_attention_mask": 0,
+            "special_tokens_mask": 1,
+            "input_ids": self.tokenizer.pad_token_id,
+        }
+
+        for k, v in first.items():
+            if k not in ("labels", "label_ids") and v is not None and not isinstance(v, str):
+                batch[k] = _pad_function(
+                    sequence=[d[k] for d in new_data],
+                    pad_value=pad_value_map[k],
+                )
+            else:
+                batch[k] = _pad_function(
+                    sequence=[d[k] for d in new_data],
+                    pad_value=self.label_pad_token_id,
+                )
+
+        # prepare decoder_input_ids
+        if (
+            labels is not None
+            and self.model is not None
+            and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
+        ):
+            decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=batch["labels"])
+            if not return_tensors:
+                batch["decoder_input_ids"] = decoder_input_ids.numpy()
+
+        if self.return_tensors:
+            for k, v in batch.items():
+                batch[k] = paddle.to_tensor(v)
+
+        return batch
diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py b/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb761d8093f21a895a92f5a520a8f1941d574c99
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Optional, Union, List
+
+from paddle import Tensor
+from paddlenlp.transformers import BertTokenizer
+
+logger = logging.getLogger(__name__)
+
+
+class T5BertTokenizer(BertTokenizer):
+
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=False,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="<unk>",
+        sep_token=None,
+        pad_token="<pad>",
+        cls_token=None,
+        mask_token=None,
+        space_token="<space>",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file=vocab_file,
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+        if space_token not in self._additional_special_tokens:
+            self._additional_special_tokens += [space_token]
+
+        self._space_token = space_token
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def tokenize(self, text, **kwargs):
+        import re
+
+        # Remove space between <extra_id_*> <spot> <asoc>
+        split_bracket = re.compile(r"\s*<extra_id_\d>\s*|\s*<spot>\s*|\s*<asoc>\s*")
+
+        if len(split_bracket.split(text)) > 1:
+            new_text_list = [split_bracket.split(text)[0]]
+            for item in zip(split_bracket.findall(text), split_bracket.split(text)[1:]):
+                new_text_list += [item[0].strip(), item[1]]
+            text = "".join(new_text_list)
+        text = text.replace(" ", self._space_token)
+        return super().tokenize(text, **kwargs)
+
+    def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            logging.warn(
+                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
+            )
+            return token_ids
+        else:
+            return token_ids + [self.eos_token_id]
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A sequence has the following format:
+
+        - single sequence: ``X </s>``
+        - pair of sequences: ``A </s> B </s>``
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+        if token_ids_1 is None:
+            return token_ids_0
+        else:
+            token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+            return token_ids_0 + token_ids_1
+
+    def _decode(self, token_ids: Union[List[int], Tensor], skip_special_tokens: bool = False, **kwargs) -> str:
+        if isinstance(token_ids, Tensor):
+            tokens = self.convert_ids_to_tokens(token_ids.tolist(), skip_special_tokens=skip_special_tokens)
+        else:
+            tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
+
+        # Fix '##' subtoken
+        tokens = [x.lstrip("#") if x.startswith("##") else x for x in tokens]
+
+        x_str = "".join(tokens)
+        x_str = x_str.replace(" ", "")
+        x_str = x_str.replace(self._space_token, " ")
+        return x_str
+
+    def decode(self, token_ids: Union[List[int], Tensor], skip_special_tokens: bool = False, **kwargs) -> str:
+        return self._decode(token_ids, skip_special_tokens)
+
+    def batch_decode(self, sequences, skip_special_tokens=False, clean_up_tokenization_spaces=True):
+        """
+        Convert a list of lists of token ids into a list of strings by calling decode.
+        Args:
+            sequences (Union[List[int], List[List[int]], Tensor]):
+                List of tokenized input ids.
+            skip_special_tokens (bool, optional):
+                Whether or not to remove special tokens in the decoding. Defaults to `False`.
+            clean_up_tokenization_spaces (bool, optional):
+                Whether or not to clean up the tokenization spaces. Defaults to `True`.
+        Returns:
+            List[str]: The list of decoded sentences.
+        """
+        return [
+            self._decode(
+                seq, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces
+            )
+            for seq in sequences
+        ]
diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/utils.py b/examples/information_extraction/DuUIE/uie/seq2struct/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..06112300d92d35fe4c970586bf31d1dbff4e032c
--- /dev/null
+++ b/examples/information_extraction/DuUIE/uie/seq2struct/utils.py
@@ -0,0 +1,479 @@
+#!/usr/bin/env python3
+# -*- coding:utf-8 -*-
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import random
+from dataclasses import dataclass
+from typing import List
+
+import numpy as np
+import paddle
+import tabulate
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from uie.evaluation import constants
+from uie.evaluation.sel2record import MapConfig, RecordSchema, SEL2Record, merge_schema
+from uie.seq2struct.data_collator import (
+    DataCollatorForMultiTaskSeq2Seq,
+    DataCollatorForSeq2Seq,
+    DynamicSSIGenerator,
+    SpotAsocNoiser,
+)
+from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    CosineDecayWithWarmup,
+    LinearDecayWithWarmup,
+    PolyDecayWithWarmup,
+)
+
+logger = logging.getLogger("__main__")
+
+
+def get_writer(args):
+    if args.writer_type == "visualdl":
+        from visualdl import LogWriter
+
+        writer = LogWriter(logdir=args.logging_dir)
+    elif args.writer_type == "tensorboard":
+        from tensorboardX import SummaryWriter
+
+        writer = SummaryWriter(logdir=args.logging_dir)
+    else:
+        raise ValueError("writer_type must be in ['visualdl', 'tensorboard']")
+    return writer
+
+
+scheduler_type2cls = {
+    "linear": LinearDecayWithWarmup,
+    "cosine": CosineDecayWithWarmup,
+    "poly": PolyDecayWithWarmup,
+}
+
+
+def get_scheduler(
+    learning_rate,
+    scheduler_type,
+    num_warmup_steps=None,
+    num_training_steps=None,
+    **scheduler_kwargs,
+):
+    """Set learning rate scheduler"""
+
+    if scheduler_type not in scheduler_type2cls.keys():
+        data = " ".join(scheduler_type2cls.keys())
+        raise ValueError(f"scheduler_type must be choson from {data}")
+
+    if num_warmup_steps is None:
+        raise ValueError("requires `num_warmup_steps`, please provide that argument.")
+
+    if num_training_steps is None:
+        raise ValueError("requires `num_training_steps`, please provide that argument.")
+
+    return scheduler_type2cls[scheduler_type](
+        learning_rate=learning_rate, total_steps=num_training_steps, warmup=num_warmup_steps, **scheduler_kwargs
+    )
+
+
+def set_seed(args):
+    """Set default seed"""
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+def save_checkpoint(tokenizer, model, output_dir):
+    """Save tokenizer and checkpoint model to output_dir"""
+    logger.info(f"saving checkpoint to {output_dir}")
+    if isinstance(model, paddle.DataParallel):
+        model = model._layers
+    model.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def set_logger(args):
+    """Set logger"""
+    logger.setLevel(logging.DEBUG if "DEBUG" in os.environ else logging.INFO)
+
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+        handlers=[
+            logging.FileHandler(
+                filename=f"{args.output_dir}.log",
+                mode="w",
+                encoding="utf-8",
+            )
+        ],
+    )
+    # create console handler and set level to debug
+    console_handler = logging.StreamHandler()
+    console_handler.setLevel(level=logging.DEBUG)
+    # add formatter to console_handler
+    console_handler.setFormatter(fmt=logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s"))
+    # add console_handler to logger
+    logger.addHandler(console_handler)
+
+
+def read_json_file(file_name):
+    """Read jsonline file as generator"""
+    with open(file_name, encoding="utf8") as fin:
+        for line in fin:
+            yield json.loads(line)
+
+
+def better_print_multi(results):
+    """Better print multi task results
+    results: Dictionary of task and metric {"task:metric": "value", ...}
+    """
+    table = [(task, results[task]) for task in results]
+    return tabulate.tabulate(table, headers=["Task", "Metric"])
+
+
+def read_func(tokenizer, data_file: str, max_source_length: int, is_train: bool = False, negative_keep: float = 1.0):
+    """Read instance from data_file
+
+    Args:
+        tokenizer (PretrainedTokenizer): Tokenizer
+        data_file (str): Data filename
+        max_source_length (int): Max source length
+        is_train (bool): instance from this file whether for training
+        negative_keep (float): the ratio of keeping negative instances
+    """
+
+    negative_drop_num = 0
+    with open(data_file, "r", encoding="utf-8") as fin:
+        for line in fin:
+            instance = json.loads(line)
+
+            # Drop negative sample in random during training stage
+            if is_train and len(instance["spot_asoc"]) == 0:
+                # if negative_keep >= 1, keep all negative instances
+                # else drop negative instance when random() > negative_keep
+                if random.random() > negative_keep:
+                    negative_drop_num += 1
+                    continue
+
+            inputs = tokenizer(
+                instance["text"],
+                return_token_type_ids=False,
+                return_attention_mask=True,
+                max_seq_len=max_source_length,
+            )
+
+            # `sample_ssi` can be True in the training stage
+            # `sample_ssi` can only be False in the evaluation stage
+            # 在训练时，ssi可以动态变化 (sample_ssi=True)
+            # 但是在推理和验证时，ssi必须固定保证推理结果的一致 (sample_ssi=False)
+            inputs.update(
+                {
+                    "spots": instance["spot"],
+                    "asocs": instance["asoc"],
+                    "spot_asoc": instance["spot_asoc"],
+                    "sample_ssi": is_train,
+                }
+            )
+            yield inputs
+
+    if negative_drop_num > 0:
+        logger.info(f"Drop negative {negative_drop_num} instance during loading {data_file}.")
+
+
+def read_training_instance_based_config(
+    tokenizer, config_file: str, max_source_length: int, negative_keep: float = 1.0
+):
+    """Read training instances based on config_file
+
+    Args:
+        tokenizer (PretrainedTokenizer): Tokenizer
+        config_file (str): Config filename
+        max_source_length (int): Max source length
+        negative_keep: the ratio of keeping negative instances
+
+    Yields:
+        dict: instance for training
+    """
+    task_configs = list(TaskConfig.load_list_from_yaml(config_file))
+
+    for task_config in task_configs:
+        negative_drop_num = 0
+
+        train_file = os.path.join(task_config.data_path, "train.json")
+        schema_file = os.path.join(task_config.data_path, "record.schema")
+        record_schema = RecordSchema.read_from_file(schema_file)
+        with open(train_file, "r", encoding="utf-8") as fin:
+            count = 0
+            for line in fin:
+                instance = json.loads(line)
+
+                # Drop negative sample in random during training stage
+                if len(instance["spot_asoc"]) == 0:
+                    # if negative_keep >= 1, keep all negative instances
+                    # else drop negative instance when random() > negative_keep
+                    if random.random() > negative_keep:
+                        negative_drop_num += 1
+                        continue
+
+                inputs = tokenizer(
+                    instance["text"],
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                    max_seq_len=max_source_length,
+                )
+
+                # `sample_ssi` is True in the training stage
+                inputs.update(
+                    {
+                        "spots": record_schema.type_list,
+                        "asocs": record_schema.role_list,
+                        "spot_asoc": instance["spot_asoc"],
+                        "sample_ssi": True,
+                    }
+                )
+                yield inputs
+                count += 1
+            logger.info(f"Load {count} instances from {train_file}")
+
+        if negative_drop_num > 0:
+            logger.info(f"Drop negative {negative_drop_num} instance during loading {train_file}.")
+
+
+def get_train_dataloader(model, tokenizer, args):
+    logger.info(f"Load data according to {args.multi_task_config} ...")
+
+    dataset = load_dataset(
+        read_training_instance_based_config,
+        tokenizer=tokenizer,
+        config_file=args.multi_task_config,
+        max_source_length=args.max_source_length,
+        lazy=False,
+        negative_keep=args.negative_keep,
+    )
+
+    # Merge schema in all datasets for pre-tokenize
+    schema_list = list()
+    for task_config in TaskConfig.load_list_from_yaml(args.multi_task_config):
+        schema_file = os.path.join(task_config.data_path, "record.schema")
+        schema_list += [RecordSchema.read_from_file(schema_file)]
+    schema = merge_schema(schema_list)
+
+    batch_sampler = DistributedBatchSampler(
+        dataset=dataset,
+        batch_size=args.per_device_train_batch_size,
+        shuffle=True,
+    )
+
+    if args.spot_noise > 0 or args.asoc_noise > 0:
+        spot_asoc_nosier = SpotAsocNoiser(
+            spot_noise_ratio=args.spot_noise,
+            asoc_noise_ratio=args.asoc_noise,
+            null_span=constants.null_span,
+        )
+    else:
+        spot_asoc_nosier = None
+
+    label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id
+
+    collate_fn = DataCollatorForMultiTaskSeq2Seq(
+        tokenizer,
+        model=model,
+        label_pad_token_id=label_pad_token_id,
+        max_source_length=args.max_source_length,
+        max_prefix_length=args.max_prefix_length,
+        max_target_length=args.max_target_length,
+        ssi_generator=DynamicSSIGenerator(
+            tokenizer=tokenizer,
+            schema=schema,
+            positive_rate=args.meta_positive_rate,
+            negative=args.meta_negative,
+            ordered_prompt=args.ordered_prompt,
+        ),
+        spot_asoc_nosier=spot_asoc_nosier,
+    )
+
+    data_loader = DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        return_list=True,
+    )
+
+    return data_loader
+
+
+def get_eval_dataloader(model, tokenizer, eval_filename, record_schema, args):
+    """Get evaluation dataloader"""
+
+    logger.info(f"Load data from {eval_filename} ...")
+
+    schema = RecordSchema.read_from_file(record_schema)
+
+    dataset = load_dataset(
+        read_func,
+        tokenizer=tokenizer,
+        data_file=eval_filename,
+        max_source_length=args.max_source_length,
+        is_train=False,
+        lazy=False,
+    )
+
+    batch_sampler = BatchSampler(dataset=dataset, batch_size=args.per_device_eval_batch_size, shuffle=False)
+
+    label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id
+
+    collate_fn = DataCollatorForSeq2Seq(
+        tokenizer,
+        model=model,
+        label_pad_token_id=label_pad_token_id,
+        max_source_length=args.max_source_length,
+        max_prefix_length=args.max_prefix_length,
+        max_target_length=args.max_target_length,
+        ssi_generator=DynamicSSIGenerator(
+            tokenizer=tokenizer,
+            schema=schema,
+            positive_rate=1,
+            negative=-1,
+            ordered_prompt=True,
+        ),
+        spot_asoc_nosier=None,
+    )
+
+    data_loader = DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=collate_fn,
+        num_workers=args.dataloader_num_workers,
+        return_list=True,
+    )
+
+    return data_loader
+
+
+def load_eval_tasks(model, tokenizer, args):
+    """Load evaluation tasks
+
+    Args:
+        model (PretrainedModel): Pretrain Model
+        tokenizer (PretrainedTokenizer): Tokenizer
+        args (Namespace): arguments for loading eval tasks
+
+    Returns:
+        list(Task): list of evaluation tasks
+    """
+    eval_tasks = dict()
+    task_configs = list(TaskConfig.load_list_from_yaml(args.multi_task_config))
+
+    for task_config in task_configs:
+
+        val_filename = os.path.join(task_config.data_path, "val.json")
+        record_schema = os.path.join(task_config.data_path, "record.schema")
+
+        task_dataloader = get_eval_dataloader(
+            model=model, tokenizer=tokenizer, eval_filename=val_filename, record_schema=record_schema, args=args
+        )
+
+        sel2record = SEL2Record(
+            schema_dict=SEL2Record.load_schema_dict(task_config.data_path),
+            map_config=MapConfig.load_by_name(task_config.sel2record),
+            tokenizer=tokenizer if isinstance(tokenizer, T5BertTokenizer) else None,
+        )
+
+        eval_tasks[task_config.dataset_name] = Task(
+            config=task_config,
+            dataloader=task_dataloader,
+            sel2record=sel2record,
+            val_instances=list(read_json_file(val_filename)),
+            metrics=task_config.metrics,
+        )
+
+    return eval_tasks
+
+
+def write_prediction(eval_prediction, output_dir, prefix="eval"):
+    """Write prediction to output_dir
+
+    Args:
+        eval_prediction (dict):
+            - `record` (list(dict)), each element is extraction reocrd
+            - `sel` (list(str)): each element is sel expression
+            - `metric` (dict)
+        output_dir (str): Output directory path
+        prefix (str, optional): prediction file prefix. Defaults to 'eval'.
+
+    Write prediction to files:
+        - `preds_record.txt`, each line is extracted record
+        - `preds_seq2seq.txt`, each line is generated sel
+        - `results.txt`, detailed metrics of prediction
+    """
+    output_filename = os.path.join(output_dir, f"{prefix}-preds_record.txt")
+    with open(output_filename, "w", encoding="utf8") as output:
+        for pred in eval_prediction.get("record", []):
+            output.write(json.dumps(pred, ensure_ascii=False) + "\n")
+
+    output_filename = os.path.join(output_dir, f"{prefix}-preds_seq2seq.txt")
+    with open(output_filename, "w", encoding="utf8") as output:
+        for pred in eval_prediction.get("sel", []):
+            output.write(pred + "\n")
+
+    output_filename = os.path.join(output_dir, f"{prefix}-results.txt")
+    with open(output_filename, "w", encoding="utf8") as output:
+        for key, value in eval_prediction.get("metric", {}).items():
+            output.write(f"{prefix}-{key} = {value}\n")
+
+
+class TaskConfig:
+    def __init__(self, task_dict) -> None:
+        self.dataset_name = task_dict.get("name", "")
+        self.task_name = task_dict.get("task", "")
+        self.data_path = task_dict.get("path", "")
+        self.sel2record = task_dict.get("sel2record", "")
+        self.metrics = task_dict.get("metrics", [])
+        self.eval_match_mode = task_dict.get("eval_match_mode", "normal")
+        self.schema = RecordSchema.read_from_file(f"{self.data_path}/record.schema")
+
+    def __repr__(self) -> str:
+        task_config_list = [
+            f"dataset: {self.dataset_name}",
+            f"task   : {self.task_name}",
+            f"path   : {self.data_path}",
+            f"schema : {self.schema}",
+            f"metrics: {self.metrics}",
+            f"eval_match_mode : {self.eval_match_mode}",
+        ]
+        return "\n".join(task_config_list)
+
+    @staticmethod
+    def load_list_from_yaml(task_config):
+        import yaml
+
+        configs = yaml.load(open(task_config, encoding="utf8"), Loader=yaml.FullLoader)
+        task_configs = filter(lambda x: x.startswith("T"), configs)
+        for task_config in task_configs:
+            yield TaskConfig(configs[task_config])
+
+
+@dataclass
+class Task:
+    config: TaskConfig
+    dataloader: DataLoader
+    sel2record: SEL2Record
+    val_instances: List[dict]
+    metrics: List[str]
diff --git a/examples/information_extraction/msra_ner/README.md b/examples/information_extraction/msra_ner/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d9e04bf6fe7dc1cdf8f6330e856e378c79243d5
--- /dev/null
+++ b/examples/information_extraction/msra_ner/README.md
@@ -0,0 +1,125 @@
+# 使用PaddleNLP完成中文命名实体识别
+
+## 1. 简介
+
+MSRA-NER 数据集由微软亚研院发布，其目标是识别文本中具有特定意义的实体，主要包括人名、地名、机构名等。示例如下：
+
+```
+不\002久\002前\002，\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。    O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O
+这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。    O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O
+```
+
+PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整：每一行文本、标签以特殊字符"\t"进行分隔，每个字之间以特殊字符"\002"分隔。
+
+## 快速开始
+
+### 模型训练
+
+#### 单卡训练
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ./train.py \
+    --model_type bert \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --dataset msra_ner \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/msra_ner/ \
+    --device gpu
+```
+
+其中参数释义如下：
+- `model_type`: 指定模型的类型，可选的有 bert、ernie、ernie-ctm。
+- `model_name_or_path`: 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer，支持[PaddleNLP Transformer API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中除ernie-gen以外的所有模型。若使用其他系列模型，需修改脚本导入相应的Task和Tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `dataset`: 目前支持 msra_ner 和 peoples_daily_ner 数据集。
+- `max_seq_length`: 表示最大句子长度，超过该长度将被截断。
+- `batch_size`: 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate`: 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs`: 表示训练轮数。
+- `logging_steps`: 表示日志打印间隔。
+- `save_steps`: 表示模型保存及评估间隔。
+- `output_dir`: 表示模型保存路径。
+- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+
+#### 多卡训练
+```shell
+python -m paddle.distributed.launch --gpus "0,1" ./train.py \
+    --model_type bert \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --dataset msra_ner \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/msra_ner/ \
+    --device gpu
+```
+
+
+训练过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下日志：
+
+```
+global step 3996, epoch: 2, batch: 1184, loss: 0.008593, speed: 4.15 step/s
+global step 3997, epoch: 2, batch: 1185, loss: 0.008453, speed: 4.17 step/s
+global step 3998, epoch: 2, batch: 1186, loss: 0.002294, speed: 4.19 step/s
+global step 3999, epoch: 2, batch: 1187, loss: 0.005351, speed: 4.16 step/s
+global step 4000, epoch: 2, batch: 1188, loss: 0.004734, speed: 4.18 step/s
+eval loss: 0.006829, precision: 0.908957, recall: 0.926683, f1: 0.917734
+```
+
+使用以上命令进行单卡 Fine-tuning ，在验证集上有如下结果：
+ Metric                       | Result      |
+------------------------------|-------------|
+Precision                     | 0.908957    |
+Recall                        | 0.926683    |
+F1                            | 0.917734    |
+
+### 模型评估
+目前支持bert类型模型，其他模型可修改为对应的Task和Tokenizer。支持msra_ner数据集。
+```shell
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ./eval.py \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --device gpu \
+    --init_checkpoint_path tmp/msra_ner/model_500.pdparams
+```
+
+其中参数释义如下：
+- `model_name_or_path`: 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_seq_length`: 表示最大句子长度，超过该长度将被截断。
+- `batch_size`: 表示每次迭代**每张卡**上的样本数目。
+- `use_gpu`: 是否使用GPU。
+- `init_checkpoint_path`: 模型加载路径。
+
+### 模型预测
+
+目前支持bert类型模型，其他模型可修改为对应的Task和Tokenizer。支持msra_ner数据集。
+```shell
+export CUDA_VISIBLE_DEVICES=0
+
+python -u ./predict.py \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --device gpu \
+    --init_checkpoint_path tmp/msra_ner/model_500.pdparams
+```
+
+### 使用其它预训练模型
+
+请参考[Transformer API文档](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 了解更多PaddleNLP支持的预训练模型信息，并更换`--model_name_or_path`参数即可对比其他预训练模型的效果。
+
+## Reference
+
+- [The third international Chinese language processing bakeoff: Word segmentation and named entity recognition](https://faculty.washington.edu/levow/papers/sighan06.pdf)
diff --git a/examples/information_extraction/msra_ner/eval.py b/examples/information_extraction/msra_ner/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e30532cd43697b7763a03a6938467d6487bc6aa
--- /dev/null
+++ b/examples/information_extraction/msra_ner/eval.py
@@ -0,0 +1,126 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import BertForTokenClassification, BertTokenizer
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument(
+    "--model_name_or_path",
+    default=None,
+    type=str,
+    required=True,
+    help="Path to pre-trained model or shortcut name selected in the list: "
+    + ", ".join(list(BertTokenizer.pretrained_init_configuration.keys())),
+)
+parser.add_argument("--init_checkpoint_path", default=None, type=str, required=True, help="The model checkpoint path.")
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer "
+    "than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu", "xpu"],
+    help="The device to select to train the model, is must be cpu/gpu/xpu.",
+)
+
+
+def do_eval(args):
+    paddle.set_device(args.device)
+
+    # Create dataset, tokenizer and dataloader.
+    train_ds, eval_ds = load_dataset("msra_ner", split=("train", "test"))
+    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+
+    label_list = train_ds.features["ner_tags"].feature.names
+    label_num = len(label_list)
+    no_entity_id = 0
+
+    def tokenize_and_align_labels(examples):
+        tokenized_inputs = tokenizer(
+            examples["tokens"],
+            max_seq_len=args.max_seq_length,
+            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
+            is_split_into_words="token",
+            return_length=True,
+        )
+        labels = []
+
+        for i, label in enumerate(examples["ner_tags"]):
+            label_ids = label
+            if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids):
+                label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2]
+            label_ids = [no_entity_id] + label_ids + [no_entity_id]
+            label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids))
+
+            labels.append(label_ids)
+        tokenized_inputs["labels"] = labels
+        return tokenized_inputs
+
+    ignore_label = -100
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"),  # input
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"),  # segment
+            "seq_len": Stack(dtype="int64"),
+            "labels": Pad(axis=0, pad_val=ignore_label, dtype="int64"),  # label
+        }
+    ): fn(samples)
+
+    eval_ds = eval_ds.select(range(len(eval_ds) - 1))
+    eval_ds = eval_ds.map(tokenize_and_align_labels, batched=True)
+    eval_data_loader = DataLoader(
+        dataset=eval_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True
+    )
+
+    # Define the model netword and its loss
+    model = BertForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num)
+    if args.init_checkpoint_path:
+        model_dict = paddle.load(args.init_checkpoint_path)
+        model.set_dict(model_dict)
+    loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+
+    metric = ChunkEvaluator(label_list=label_list)
+
+    model.eval()
+    metric.reset()
+    for step, batch in enumerate(eval_data_loader):
+        input_ids, token_type_ids, length, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = loss_fct(logits, labels)
+        avg_loss = paddle.mean(loss)
+        preds = logits.argmax(axis=2)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score))
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    do_eval(args)
diff --git a/examples/information_extraction/msra_ner/predict.py b/examples/information_extraction/msra_ner/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..5aa4de5197cfcb4d5264ea5c5a7d5ddff5ee0948
--- /dev/null
+++ b/examples/information_extraction/msra_ner/predict.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.transformers import BertForTokenClassification, BertTokenizer
+
+parser = argparse.ArgumentParser()
+
+# yapf: disable
+parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(BertTokenizer.pretrained_init_configuration.keys())))
+parser.add_argument("--init_checkpoint_path", default=None, type=str, required=True, help="The model checkpoint path.", )
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.", )
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", )
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu", "npu"] , help="The device to select to train the model, is must be cpu/gpu/xpu/npu.")
+# yapf: enable
+
+
+def parse_decodes(input_words, id2label, decodes, lens):
+    decodes = [x for batch in decodes for x in batch]
+    lens = [x for batch in lens for x in batch]
+
+    outputs = []
+    for idx, end in enumerate(lens):
+        sent = "".join(input_words[idx]["tokens"])
+        tags = [id2label[x] for x in decodes[idx][1:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.startswith("B-") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                if t.startswith("B-"):
+                    tags_out.append(t.split("-")[1])
+                else:
+                    tags_out.append(t)
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+
+    # Create dataset, tokenizer and dataloader.
+    train_examples, predict_examples = load_dataset("msra_ner", split=("train", "test"))
+    column_names = train_examples.column_names
+    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+
+    label_list = train_examples.features["ner_tags"].feature.names
+    label_num = len(label_list)
+    no_entity_id = 0
+
+    def tokenize_and_align_labels(examples):
+        tokenized_inputs = tokenizer(
+            examples["tokens"],
+            max_seq_len=args.max_seq_length,
+            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
+            is_split_into_words="token",
+            return_length=True,
+        )
+        labels = []
+
+        for i, label in enumerate(examples["ner_tags"]):
+            label_ids = label
+            if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids):
+                label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2]
+            label_ids = [no_entity_id] + label_ids + [no_entity_id]
+            label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids))
+
+            labels.append(label_ids)
+        tokenized_inputs["labels"] = labels
+        return tokenized_inputs
+
+    batchify_fn = DataCollatorForTokenClassification(tokenizer)
+
+    id2label = dict(enumerate(label_list))
+
+    predict_examples = predict_examples.select(range(len(predict_examples) - 1))
+    predict_ds = predict_examples.map(tokenize_and_align_labels, batched=True, remove_columns=column_names)
+    predict_data_loader = DataLoader(
+        dataset=predict_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True
+    )
+
+    # Define the model netword
+    model = BertForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num)
+    if args.init_checkpoint_path:
+        model_dict = paddle.load(args.init_checkpoint_path)
+        model.set_dict(model_dict)
+
+    model.eval()
+    pred_list = []
+    len_list = []
+    for step, batch in enumerate(predict_data_loader):
+        logits = model(batch["input_ids"], batch["token_type_ids"])
+        pred = paddle.argmax(logits, axis=-1)
+        pred_list.append(pred.numpy())
+        len_list.append(batch["seq_len"].numpy())
+
+    preds = parse_decodes(predict_examples, id2label, pred_list, len_list)
+
+    file_path = "results.txt"
+    with open(file_path, "w", encoding="utf8") as fout:
+        fout.write("\n".join(preds))
+    # Print some examples
+    print("The results have been saved in the file: %s, some examples are shown below: " % file_path)
+    print("\n".join(preds[:10]))
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    do_predict(args)
diff --git a/examples/information_extraction/msra_ner/train.py b/examples/information_extraction/msra_ner/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e87ba6ad94f4f65971fa275c809e50eb33ee7a8e
--- /dev/null
+++ b/examples/information_extraction/msra_ner/train.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import paddle
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import (
+    BertForTokenClassification,
+    BertTokenizer,
+    ErnieCtmForTokenClassification,
+    ErnieCtmTokenizer,
+    ErnieForTokenClassification,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "bert": (BertForTokenClassification, BertTokenizer),
+    "ernie": (ErnieForTokenClassification, ErnieTokenizer),
+    "ernie-ctm": (ErnieCtmForTokenClassification, ErnieCtmTokenizer),
+}
+
+parser = argparse.ArgumentParser()
+
+# yapf: disable
+parser.add_argument("--model_type", default="bert", type=str, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), )
+parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])), )
+parser.add_argument("--dataset", default="msra_ner", type=str, choices=["msra_ner", "peoples_daily_ner"] , help="The named entity recognition datasets.")
+parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.", )
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu", "npu"] , help="The device to select to train the model, is must be cpu/gpu/xpu/npu.")
+# yapf: enable
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader, label_num, mode="valid"):
+    model.eval()
+    metric.reset()
+    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
+    for batch in data_loader:
+        logits = model(batch["input_ids"], batch["token_type_ids"])
+        loss = loss_fct(logits, batch["labels"])
+        avg_loss = paddle.mean(loss)
+        preds = logits.argmax(axis=2)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(
+            batch["seq_len"], preds, batch["labels"]
+        )
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("%s: eval loss: %f, precision: %f, recall: %f, f1: %f" % (mode, avg_loss, precision, recall, f1_score))
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # Create dataset, tokenizer and dataloader.
+    if args.dataset == "peoples_daily_ner":
+        raw_datasets = load_dataset(args.dataset)
+    else:
+        raw_datasets = load_dataset(args.dataset)
+
+    AutoForTokenClassification, AutoTokenizer = MODEL_CLASSES[args.model_type]
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    train_ds = raw_datasets["train"]
+
+    label_list = train_ds.features["ner_tags"].feature.names
+    label_num = len(label_list)
+    no_entity_id = 0
+
+    def tokenize_and_align_labels(examples):
+        tokenized_inputs = tokenizer(
+            examples["tokens"],
+            max_seq_len=args.max_seq_length,
+            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
+            is_split_into_words="token",
+            return_length=True,
+        )
+        labels = []
+
+        for i, label in enumerate(examples["ner_tags"]):
+            label_ids = label
+            if len(tokenized_inputs["input_ids"][i]) - 2 < len(label_ids):
+                label_ids = label_ids[: len(tokenized_inputs["input_ids"][i]) - 2]
+            label_ids = [no_entity_id] + label_ids + [no_entity_id]
+            label_ids += [no_entity_id] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids))
+
+            labels.append(label_ids)
+        tokenized_inputs["labels"] = labels
+        return tokenized_inputs
+
+    train_ds = train_ds.select(range(len(train_ds) - 1))
+    column_names = train_ds.column_names
+    train_ds = train_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names)
+
+    ignore_label = -100
+
+    batchify_fn = DataCollatorForTokenClassification(tokenizer=tokenizer, label_pad_token_id=ignore_label)
+
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True
+    )
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, collate_fn=batchify_fn, num_workers=0, batch_sampler=train_batch_sampler, return_list=True
+    )
+
+    test_ds = raw_datasets["test"]
+    test_ds = test_ds.select(range(len(test_ds) - 1))
+    test_ds = test_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names)
+
+    test_data_loader = DataLoader(
+        dataset=test_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True
+    )
+
+    if args.dataset == "peoples_daily_ner":
+        dev_ds = raw_datasets["validation"]
+        dev_ds = dev_ds.select(range(len(dev_ds) - 1))
+        dev_ds = dev_ds.map(tokenize_and_align_labels, batched=True, remove_columns=column_names)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, collate_fn=batchify_fn, num_workers=0, batch_size=args.batch_size, return_list=True
+        )
+
+    # Define the model netword and its loss
+    model = AutoForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=label_num)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+
+    metric = ChunkEvaluator(label_list=label_list)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            logits = model(batch["input_ids"], batch["token_type_ids"])
+            loss = loss_fct(logits, batch["labels"])
+            avg_loss = paddle.mean(loss)
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch, step, avg_loss, args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            avg_loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                if paddle.distributed.get_rank() == 0:
+                    if args.dataset == "peoples_daily_ner":
+                        evaluate(model, loss_fct, metric, dev_data_loader, label_num, "valid")
+                    evaluate(model, loss_fct, metric, test_data_loader, label_num, "test")
+
+                    paddle.save(model.state_dict(), os.path.join(args.output_dir, "model_%d.pdparams" % global_step))
+            if global_step >= num_training_steps:
+                return
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    for arg in vars(args):
+        logger.info("{:20}:{}".format(arg, getattr(args, arg)))
+
+    do_train(args)
diff --git a/examples/information_extraction/waybill_ie/README.md b/examples/information_extraction/waybill_ie/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c842c91c7242aa8916665df0199f0a252f4cf009
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/README.md
@@ -0,0 +1,102 @@
+# 快递单信息抽取 (Waybill Information Extraction)
+
+## 简介
+
+本示例将通过BiGRU-CRF和ERNIE + FC两类模型，演示如何从用户提供的快递单中，抽取姓名、电话、省、市、区、详细地址等内容，形成结构化信息。辅助物流行业从业者进行有效信息的提取，从而降低客户填单的成本。
+
+## 快速开始
+
+### 数据准备
+
+执行以下命令，下载并解压示例数据集：
+
+```bash
+python download.py --data_dir ./waybill_ie
+```
+
+数据示例如下：
+
+```
+1^B6^B6^B2^B0^B2^B0^B0^B0^B7^B7^B宣^B荣^B嗣^B甘^B肃^B省^B白^B银^B市^B会^B宁^B县^B河^B畔^B镇^B十^B字^B街^B金^B海^B超^B市^B西^B行^B5^B0^B米    T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I
+1^B3^B5^B5^B2^B6^B6^B4^B3^B0^B7^B姜^B骏^B炜^B云^B南^B省^B德^B宏^B傣^B族^B景^B颇^B族^B自^B治^B州^B盈^B江^B县^B平^B原^B镇^B蜜^B回^B路^B下^B段    T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I
+```
+数据集中以特殊字符"\t"分隔文本、标签，以特殊字符"\002"(示例中显示为"^B")分隔每个字。标签的定义如下：
+
+| 标签 | 定义 |  标签 | 定义 |
+| -------- | -------- |-------- | -------- |
+| P-B | 姓名起始位置 | P-I | 姓名中间位置或结束位置 |
+| T-B | 电话起始位置 | T-I | 电话中间位置或结束位置 |
+| A1-B | 省份起始位置 | A1-I | 省份中间位置或结束位置 |
+| A2-B | 城市起始位置 | A2-I | 城市中间位置或结束位置 |
+| A3-B | 县区起始位置 | A3-I | 县区中间位置或结束位置 |
+| A4-B | 详细地址起始位置 | A4-I | 详细地址中间位置或结束位置 |
+| O | 无关字符 | | |
+
+数据标注采用**BIO模式**。其中 B(begin) 表示一个标签类别的开头，比如 P-B 指的是姓名的开头；相应的，I(inside) 表示一个标签的延续。O表示Outside无关字符。更多标注模式介绍请参考[Inside–outside–beginning (tagging)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))
+
+### 启动训练
+
+本项目提供了两种模型结构，一种是BiGRU+CRF结构，另一种是ERNIE+FC结构，前者显存占用小，推理速度快；后者能够在更快收敛并取得更高的精度，但推理速度较慢。
+
+#### 启动BiGRU + CRF训练
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python run_bigru_crf.py
+```
+
+#### 启动ERNIE + FC训练
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python run_ernie.py
+```
+##### 模型导出
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，具体代码见export_model.py。静态图参数保存在output_path指定路径中。 运行方式：
+
+基于 `ERNIE` 的模型结构的导出方式
+
+```bash
+python export_ernie_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output
+```
+
+基于 `ERNIE + CRF` 的模型结构的导出方式
+
+```bash
+python export_ernie_crf_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output
+```
+
+基于 `BIGRU + CRF` 的模型结构的导出方式
+
+```bash
+python export_bigru_crf_model.py --params_path bigru_crf_ckpt/model_80/model_state.pdparams --output_path=./output
+```
+
+其中`params_path`是指动态图训练保存的参数路径，`output_path`是指静态图参数导出路径。
+
+#### 模型部署
+导出模型之后，可以用于部署，deploy/python文件提供了python部署预测示例。运行方式：
+
+基于 `ERNIE` 的模型
+
+```bash
+python deploy/python/predict_ernie.py --model_dir ./output
+```
+
+基于 `ERNIE + CRF` 的模型
+
+```bash
+python deploy/python/predict_ernie_crf.py --model_dir ./output
+```
+
+基于 `BIGRU + CRF` 的模型
+
+```bash
+python deploy/python/predict_bigru_crf.py --model_dir ./output
+```
+
+## 更多详细教程请参考：
+
+[基于Bi-GRU+CRF的快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1317771)
+
+[使用预训练模型ERNIE优化快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1329361)
diff --git a/examples/information_extraction/waybill_ie/data.py b/examples/information_extraction/waybill_ie/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d276b45510749ffb56ba6636827dbc43b1d4c49c
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/data.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.datasets import MapDataset
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_dataset(datafiles):
+    def read(data_path):
+        with open(data_path, "r", encoding="utf-8") as fp:
+            next(fp)  # Skip header
+            for line in fp.readlines():
+                words, labels = line.strip("\n").split("\t")
+                words = words.split("\002")
+                labels = labels.split("\002")
+                yield words, labels
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..2578b69c4c6c09228e4970416e9cde9d89e15e80
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py
@@ -0,0 +1,290 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from functools import partial
+
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.")
+parser.add_argument(
+    "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located."
+)
+parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu"],
+    help="The device to select to train the model, is must be cpu/gpu.",
+)
+parser.add_argument(
+    "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+)
+parser.add_argument(
+    "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+)
+parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+parser.add_argument(
+    "--enable_mkldnn",
+    default=False,
+    type=eval,
+    choices=[True, False],
+    help="Enable to use mkldnn to speed up when using cpu.",
+)
+parser.add_argument(
+    "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+)
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        print(predictions[idx][:end])
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_token=None):
+    token_ids = []
+    oov_id = vocab.get(oov_token) if oov_token else None
+    for token in tokens:
+        token_id = vocab.get(token, oov_id)
+        token_ids.append(token_id)
+    return token_ids
+
+
+def convert_to_features(example, word_vocab):
+    tokens = example[0]
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV")
+    return token_ids, len(token_ids)
+
+
+def read(data_path):
+    with open(data_path, "r", encoding="utf-8") as fp:
+        next(fp)  # Skip header
+        for line in fp.readlines():
+            words, labels = line.strip("\n").split("\t")
+            words = words.split("\002")
+            labels = labels.split("\002")
+            yield words, labels
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        batch_size=200,
+        use_tensorrt=False,
+        precision="fp32",
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="",
+    ):
+        self.batch_size = batch_size
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        param_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(param_file):
+            raise ValueError("not find params file path {}".format(param_file))
+        config = paddle.inference.Config(model_file, param_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-3.0-medium-zh",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, dataset, batchify_fn, word_vocab, label_vocab):
+        if args.benchmark:
+            self.autolog.times.start()
+        all_preds = []
+        all_lens = []
+        num_of_examples = len(dataset)
+        trans_func = partial(convert_to_features, word_vocab=word_vocab)
+        start_idx = 0
+        while start_idx < num_of_examples:
+            end_idx = start_idx + self.batch_size
+            end_idx = end_idx if end_idx < num_of_examples else num_of_examples
+            batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]]
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            input_ids, lens = batchify_fn(batch_data)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(lens)
+            self.predictor.run()
+            preds = self.output_handle.copy_to_cpu()
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            # Drop CLS prediction
+            all_preds.append(preds)
+            print(preds.shape)
+            all_lens.append(lens)
+
+            start_idx += self.batch_size
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        sentences = [example[0] for example in dataset.data]
+        results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+        return results
+
+
+if __name__ == "__main__":
+    test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False)
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    word_vocab = load_dict(os.path.join(args.data_dir, "word.dic"))
+
+    trans_func = partial(convert_to_features, word_vocab=word_vocab)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int64"),  # token_ids
+        Stack(dtype="int64"),  # seq_len
+    ): fn(samples)
+
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    results = predictor.predict(test_ds, batchify_fn, word_vocab, label_vocab)
+    print("\n".join(results))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdd3ccfeba9b0808c0f5434915bbb0bc5f7e7eb5
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py
@@ -0,0 +1,283 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.")
+parser.add_argument(
+    "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located."
+)
+parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu"],
+    help="The device to select to train the model, is must be cpu/gpu.",
+)
+parser.add_argument(
+    "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+)
+parser.add_argument(
+    "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+)
+parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+parser.add_argument(
+    "--enable_mkldnn",
+    default=False,
+    type=eval,
+    choices=[True, False],
+    help="Enable to use mkldnn to speed up when using cpu.",
+)
+parser.add_argument(
+    "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+)
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
+
+
+def convert_to_features(example, tokenizer):
+    tokens = example[0]
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token")
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"]
+
+
+def read(data_path):
+    with open(data_path, "r", encoding="utf-8") as fp:
+        next(fp)  # Skip header
+        for line in fp.readlines():
+            words, labels = line.strip("\n").split("\t")
+            words = words.split("\002")
+            labels = labels.split("\002")
+            yield words, labels
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        batch_size=200,
+        use_tensorrt=False,
+        precision="fp32",
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="",
+    ):
+        self.batch_size = batch_size
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        param_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(param_file):
+            raise ValueError("not find params file path {}".format(param_file))
+        config = paddle.inference.Config(model_file, param_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-3.0-medium-zh",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, dataset, batchify_fn, tokenizer, label_vocab):
+        if args.benchmark:
+            self.autolog.times.start()
+        all_preds = []
+        all_lens = []
+        num_of_examples = len(dataset)
+        trans_func = partial(convert_to_features, tokenizer=tokenizer)
+        start_idx = 0
+        while start_idx < num_of_examples:
+            end_idx = start_idx + self.batch_size
+            end_idx = end_idx if end_idx < num_of_examples else num_of_examples
+            batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]]
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            input_ids, segment_ids, lens = batchify_fn(batch_data)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            preds = np.argmax(logits, axis=-1)
+            # Drop CLS prediction
+            preds = preds[:, 1:]
+            all_preds.append(preds)
+            all_lens.append(lens)
+
+            start_idx += self.batch_size
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        sentences = [example[0] for example in dataset.data]
+        results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+        return results
+
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False)
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # seq_len
+    ): fn(samples)
+
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab)
+    print("\n".join(results))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..1158a49aafe25b4b3cc7ec6677a3f52978a927dc
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py
@@ -0,0 +1,263 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from functools import partial
+
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--model_dir", type=str, default='./output', help="The path to parameters in static graph.")
+parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.")
+parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
+
+
+def convert_to_features(example, tokenizer):
+    tokens = example[0]
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token")
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"]
+
+
+def read(data_path):
+    with open(data_path, "r", encoding="utf-8") as fp:
+        next(fp)  # Skip header
+        for line in fp.readlines():
+            words, labels = line.strip("\n").split("\t")
+            words = words.split("\002")
+            labels = labels.split("\002")
+            yield words, labels
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        batch_size=200,
+        use_tensorrt=False,
+        precision="fp32",
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="",
+    ):
+        self.batch_size = batch_size
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        param_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(param_file):
+            raise ValueError("not find params file path {}".format(param_file))
+        config = paddle.inference.Config(model_file, param_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-3.0-medium-zh",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, dataset, batchify_fn, tokenizer, label_vocab):
+        if args.benchmark:
+            self.autolog.times.start()
+        all_preds = []
+        all_lens = []
+        num_of_examples = len(dataset)
+        trans_func = partial(convert_to_features, tokenizer=tokenizer)
+        start_idx = 0
+        while start_idx < num_of_examples:
+            end_idx = start_idx + self.batch_size
+            end_idx = end_idx if end_idx < num_of_examples else num_of_examples
+            batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]]
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            input_ids, segment_ids, lens = batchify_fn(batch_data)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.input_handles[2].copy_from_cpu(lens)
+            self.predictor.run()
+            preds = self.output_handle.copy_to_cpu()
+
+            if args.benchmark:
+                self.autolog.times.stamp()
+            preds = [pred[1:] for pred in preds]
+            all_preds.append(preds)
+            all_lens.append(lens)
+
+            start_idx += self.batch_size
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        sentences = [example[0] for example in dataset.data]
+        results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+        return results
+
+
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False)
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # seq_len
+    ): fn(samples)
+
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab)
+    print("\n".join(results))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/examples/information_extraction/waybill_ie/download.py b/examples/information_extraction/waybill_ie/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..a76b56b99aeed624ec9c71b8b8b512ea7889ad35
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/download.py
@@ -0,0 +1,32 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+
+from paddle.utils.download import get_path_from_url
+
+URL = "https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/waybill.tar.gz"
+
+
+def main(arguments):
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="./")
+    args = parser.parse_args(arguments)
+    get_path_from_url(URL, args.data_dir)
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
diff --git a/examples/information_extraction/waybill_ie/export_bigru_crf_model.py b/examples/information_extraction/waybill_ie/export_bigru_crf_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..b439dc30836afd303e81344b39958021d58d8c3d
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/export_bigru_crf_model.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_dict
+from model import BiGRUWithCRF
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--params_path",
+    type=str,
+    required=True,
+    default="./checkpoint/model_900/model_state.pdparams",
+    help="The path to model parameters to be loaded.",
+)
+parser.add_argument(
+    "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved."
+)
+parser.add_argument(
+    "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located."
+)
+args = parser.parse_args()
+
+if __name__ == "__main__":
+    # The number of labels should be in accordance with the training dataset.
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    word_vocab = load_dict(os.path.join(args.data_dir, "word.dic"))
+
+    # Define the model netword and its loss
+    model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab))
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None], dtype="int64"),  # lengths
+        ],
+    )
+
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/information_extraction/waybill_ie/export_ernie_crf_model.py b/examples/information_extraction/waybill_ie/export_ernie_crf_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f1d6839e54e52047f5a96b11ee454c67ad9aac5
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/export_ernie_crf_model.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_dict
+from model import ErnieCrfForTokenClassification
+
+from paddlenlp.transformers import AutoModelForTokenClassification
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.")
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    # The number of labels should be in accordance with the training dataset.
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    # Define the model netword and its loss
+    ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab))
+    model = ErnieCrfForTokenClassification(ernie)
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+            paddle.static.InputSpec(shape=[None], dtype="int64"),  # lengths
+        ],
+    )
+
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/information_extraction/waybill_ie/export_ernie_model.py b/examples/information_extraction/waybill_ie/export_ernie_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..2436a98b4af8878c7ce59fb29efbd5132f85efcc
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/export_ernie_model.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_dict
+
+from paddlenlp.transformers import AutoModelForTokenClassification
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.")
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    # The number of labels should be in accordance with the training dataset.
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab))
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/information_extraction/waybill_ie/model.py b/examples/information_extraction/waybill_ie/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6d1e8dfb36fa194dad4b7030c4073ec9151ca6b
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/model.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.embeddings import TokenEmbedding
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+from paddlenlp.utils.tools import compare_version
+
+if compare_version(paddle.version.full_version, "2.2.0") >= 0:
+    # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0
+    from paddle.text import ViterbiDecoder
+else:
+    from paddlenlp.layers.crf import ViterbiDecoder
+
+
+class BiGRUWithCRF(nn.Layer):
+    def __init__(self, emb_size, hidden_size, word_num, label_num, use_w2v_emb=False):
+        super(BiGRUWithCRF, self).__init__()
+        if use_w2v_emb:
+            self.word_emb = TokenEmbedding(extended_vocab_path="./data/word.dic", unknown_token="OOV")
+        else:
+            self.word_emb = nn.Embedding(word_num, emb_size)
+        self.gru = nn.GRU(emb_size, hidden_size, num_layers=2, direction="bidirect")
+        # We need `label_num + 2` for appending BOS and EOS tag
+        self.fc = nn.Linear(hidden_size * 2, label_num + 2)
+        self.crf = LinearChainCrf(label_num)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions)
+
+    def forward(self, inputs, lengths, labels=None):
+        embs = self.word_emb(inputs)
+        output, _ = self.gru(embs)
+        emission = self.fc(output)
+        if labels is not None:
+            loss = self.crf_loss(emission, lengths, labels)
+            return loss
+        else:
+            _, prediction = self.viterbi_decoder(emission, lengths)
+            return prediction
+
+
+class ErnieCrfForTokenClassification(nn.Layer):
+    def __init__(self, ernie, crf_lr=100):
+        super().__init__()
+        self.num_labels = ernie.num_labels
+        self.ernie = ernie  # allow ernie to be config
+        self.crf = LinearChainCrf(self.num_labels, crf_lr=crf_lr, with_start_stop_tag=False)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, False)
+
+    def forward(
+        self, input_ids, token_type_ids=None, lengths=None, position_ids=None, attention_mask=None, labels=None
+    ):
+        logits = self.ernie(
+            input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=position_ids
+        )
+
+        if labels is not None:
+            loss = self.crf_loss(logits, lengths, labels)
+            return loss
+        else:
+            _, prediction = self.viterbi_decoder(logits, lengths)
+            return prediction
diff --git a/examples/information_extraction/waybill_ie/run_bigru_crf.py b/examples/information_extraction/waybill_ie/run_bigru_crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..f458d36de5b39123cf387f1dc3e55c749af2eba8
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/run_bigru_crf.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from data import load_dataset, load_dict, parse_decodes
+from model import BiGRUWithCRF
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+
+parser = argparse.ArgumentParser()
+
+# yapf: disable
+parser.add_argument("--save_dir", default='./bigru_crf_ckpt', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--data_dir", default='./waybill_ie/data', type=str, help="The folder where the dataset is located.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_token=None):
+    token_ids = []
+    oov_id = vocab.get(oov_token) if oov_token else None
+    for token in tokens:
+        token_id = vocab.get(token, oov_id)
+        token_ids.append(token_id)
+    return token_ids
+
+
+def convert_to_features(example, word_vocab, label_vocab):
+    tokens, labels = example
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV")
+    label_ids = convert_tokens_to_ids(labels, label_vocab, "O")
+    return token_ids, len(token_ids), label_ids
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for token_ids, lengths, label_ids in data_loader:
+        preds = model(token_ids, lengths)
+        n_infer, n_label, n_correct = metric.compute(lengths, preds, label_ids)
+        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score))
+    model.train()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, ds, label_vocab):
+    all_preds = []
+    all_lens = []
+    for token_ids, lengths, label_ids in data_loader:
+        preds = model(token_ids, lengths)
+        all_preds.append(preds.numpy())
+        all_lens.append(lengths)
+    sentences = [example[0] for example in ds.data]
+    results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # Create dataset, tokenizer and dataloader.
+    train_ds, dev_ds, test_ds = load_dataset(
+        datafiles=(
+            os.path.join(args.data_dir, "train.txt"),
+            os.path.join(args.data_dir, "dev.txt"),
+            os.path.join(args.data_dir, "test.txt"),
+        )
+    )
+
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    word_vocab = load_dict(os.path.join(args.data_dir, "word.dic"))
+
+    trans_func = partial(convert_to_features, word_vocab=word_vocab, label_vocab=label_vocab)
+    train_ds.map(trans_func)
+    dev_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int32"),  # token_ids
+        Stack(dtype="int64"),  # seq_len
+        Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"),  # label_ids
+    ): fn(samples)
+
+    train_loader = paddle.io.DataLoader(
+        dataset=train_ds,
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=True,
+        return_list=True,
+        collate_fn=batchify_fn,
+    )
+
+    dev_loader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn
+    )
+
+    test_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn
+    )
+
+    # Define the model netword and its loss
+    model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab))
+
+    optimizer = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
+    metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+
+    step = 0
+    for epoch in range(args.epochs):
+        for token_ids, lengths, label_ids in train_loader:
+            loss = model(token_ids, lengths, label_ids)
+            loss = loss.mean()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            step += 1
+            print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss))
+        evaluate(model, metric, dev_loader)
+        paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams"))
+
+    preds = predict(model, test_loader, test_ds, label_vocab)
+    file_path = "bigru_crf_results.txt"
+    with open(file_path, "w", encoding="utf8") as fout:
+        fout.write("\n".join(preds))
+    # Print some examples
+    print("The results have been saved into: %s, some examples are shown below: " % file_path)
+    print("\n".join(preds[:10]))
diff --git a/examples/information_extraction/waybill_ie/run_ernie.py b/examples/information_extraction/waybill_ie/run_ernie.py
new file mode 100644
index 0000000000000000000000000000000000000000..d21baad79a7772be433eb8c11d363787bd800953
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/run_ernie.py
@@ -0,0 +1,166 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from data import load_dataset, load_dict, parse_decodes
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default="./ernie_ckpt", type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_to_features(example, tokenizer, label_vocab):
+    tokens, labels = example
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token")
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    labels = ["O"] + labels + ["O"]
+    tokenized_input["labels"] = [label_vocab[x] for x in labels]
+    return (
+        tokenized_input["input_ids"],
+        tokenized_input["token_type_ids"],
+        tokenized_input["seq_len"],
+        tokenized_input["labels"],
+    )
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for input_ids, seg_ids, lens, labels in data_loader:
+        logits = model(input_ids, seg_ids)
+        preds = paddle.argmax(logits, axis=-1)
+        n_infer, n_label, n_correct = metric.compute(lens, preds, labels)
+        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score))
+    model.train()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, ds, label_vocab):
+    all_preds = []
+    all_lens = []
+    for input_ids, seg_ids, lens, labels in data_loader:
+        logits = model(input_ids, seg_ids)
+        preds = paddle.argmax(logits, axis=-1)
+        # Drop CLS prediction
+        preds = [pred[1:] for pred in preds.numpy()]
+        all_preds.append(preds)
+        all_lens.append(lens)
+    sentences = [example[0] for example in ds.data]
+    results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+    return results
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    # Create dataset, tokenizer and dataloader.
+    train_ds, dev_ds, test_ds = load_dataset(
+        datafiles=(
+            os.path.join(args.data_dir, "train.txt"),
+            os.path.join(args.data_dir, "dev.txt"),
+            os.path.join(args.data_dir, "test.txt"),
+        )
+    )
+
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab)
+
+    train_ds.map(trans_func)
+    dev_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    ignore_label = -1
+
+    def batchify_fn(samples):
+        fn = Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"),  # token_type_ids
+            Stack(dtype="int64"),  # seq_len
+            Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"),  # labels
+        )
+        return fn(samples)
+
+    train_loader = create_dataloader(
+        dataset=train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn
+    )
+
+    dev_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn)
+
+    test_loader = create_dataloader(dataset=test_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn)
+
+    # Define the model netword and its loss
+    model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab))
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+    metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+    loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+    optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
+
+    step = 0
+    for epoch in range(args.epochs):
+        for input_ids, token_type_ids, length, labels in train_loader:
+            logits = model(input_ids, token_type_ids)
+            loss = paddle.mean(loss_fn(logits, labels))
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            step += 1
+            print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss))
+        evaluate(model, metric, dev_loader)
+        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+        model_to_save.save_pretrained(os.path.join(args.save_dir, "model_%d" % step))
+
+    if rank == 0:
+        preds = predict(model, test_loader, test_ds, label_vocab)
+        file_path = "ernie_results.txt"
+        with open(file_path, "w", encoding="utf8") as fout:
+            fout.write("\n".join(preds))
+        # Print some examples
+        print("The results have been saved in the file: %s, some examples are shown below: " % file_path)
+        print("\n".join(preds[:10]))
diff --git a/examples/information_extraction/waybill_ie/run_ernie_crf.py b/examples/information_extraction/waybill_ie/run_ernie_crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9d03b77e6437e12745cc70bbbe35f3301b2a2cc
--- /dev/null
+++ b/examples/information_extraction/waybill_ie/run_ernie_crf.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from data import load_dataset, load_dict, parse_decodes
+from model import ErnieCrfForTokenClassification
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default="./ernie_crf_ckpt", type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_to_features(example, tokenizer, label_vocab):
+    tokens, labels = example
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token")
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    labels = ["O"] + labels + ["O"]
+    tokenized_input["labels"] = [label_vocab[x] for x in labels]
+    return (
+        tokenized_input["input_ids"],
+        tokenized_input["token_type_ids"],
+        tokenized_input["seq_len"],
+        tokenized_input["labels"],
+    )
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for input_ids, seg_ids, lens, labels in data_loader:
+        preds = model(input_ids, seg_ids, lengths=lens)
+        n_infer, n_label, n_correct = metric.compute(lens, preds, labels)
+        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score))
+    model.train()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, ds, label_vocab):
+    all_preds = []
+    all_lens = []
+    for input_ids, seg_ids, lens, labels in data_loader:
+        preds = model(input_ids, seg_ids, lengths=lens)
+        # Drop CLS prediction
+        preds = [pred[1:] for pred in preds.numpy()]
+        all_preds.append(preds)
+        all_lens.append(lens)
+    sentences = [example[0] for example in ds.data]
+    results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # Create dataset, tokenizer and dataloader.
+    train_ds, dev_ds, test_ds = load_dataset(
+        datafiles=(
+            os.path.join(args.data_dir, "train.txt"),
+            os.path.join(args.data_dir, "dev.txt"),
+            os.path.join(args.data_dir, "test.txt"),
+        )
+    )
+
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab)
+
+    train_ds.map(trans_func)
+    dev_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    def batchify_fn(samples):
+        fn = Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"),  # token_type_ids
+            Stack(dtype="int64"),  # seq_len
+            Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"),  # labels
+        )
+        return fn(samples)
+
+    train_loader = paddle.io.DataLoader(
+        dataset=train_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn
+    )
+    dev_loader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn
+    )
+    test_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn
+    )
+
+    # Define the model netword and its loss
+    ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab))
+    model = ErnieCrfForTokenClassification(ernie)
+
+    metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+    optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
+
+    step = 0
+    for epoch in range(args.epochs):
+        for input_ids, token_type_ids, lengths, labels in train_loader:
+            loss = model(input_ids, token_type_ids, lengths=lengths, labels=labels)
+            avg_loss = paddle.mean(loss)
+            avg_loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            step += 1
+            print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, avg_loss))
+        evaluate(model, metric, dev_loader)
+
+        paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams"))
+
+    preds = predict(model, test_loader, test_ds, label_vocab)
+    file_path = "ernie_crf_results.txt"
+    with open(file_path, "w", encoding="utf8") as fout:
+        fout.write("\n".join(preds))
+    # Print some examples
+    print("The results have been saved in the file: %s, some examples are shown below: " % file_path)
+    print("\n".join(preds[:10]))
diff --git a/examples/language_model/bert b/examples/language_model/bert
new file mode 100644
index 0000000000000000000000000000000000000000..d4b3e50787afc1c64b23948635b163a7f181e76c
--- /dev/null
+++ b/examples/language_model/bert
@@ -0,0 +1 @@
+../../model_zoo/bert
\ No newline at end of file
diff --git a/examples/language_model/bigbird/README.md b/examples/language_model/bigbird/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9bc3f26b8346b94e4f496086320dfe4df40035a0
--- /dev/null
+++ b/examples/language_model/bigbird/README.md
@@ -0,0 +1,151 @@
+# Big Bird
+
+## 模型介绍
+[Big Bird](https://arxiv.org/abs/2007.14062)(Transformers for Longer Sequences) 是Google的研究人员提出的针对长序列预训练模型，使用了稀疏注意力机制，将计算复杂度、空间复杂度降到线性复杂度，大大提升了长序列任务的预测能力。
+
+本项目是 Big Bird 的 PaddlePaddle 实现， 包含模型训练，模型验证等内容。以下是本例的简要目录结构及说明：
+
+```text
+.
+├── args.py                 # 预训练任务的配置
+├── run_classifier.py       # IMDB数据集的分类任务
+├── run_pretrain.py         # 预训练任务脚本
+├── README.md               # 文档
+└── data/                    # 示例数据
+```
+## 快速开始
+
+### 环境依赖
+
+- sentencepiece
+
+安装命令：`pip install sentencepiece`
+
+### 数据准备
+根据论文中的信息，目前 Big Bird 的预训练数据是主要是由 Books，CC-News，Stories, Wikipedia 4种预训练数据来构造，用户可以根据自己的需要来下载和清洗相应的数据。目前已提供一份示例数据在 data 目录。
+
+
+### 预训练任务
+
+下面是预训练任务的具体的执行方式
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" --log_dir log  run_pretrain.py --model_name_or_path bigbird-base-uncased \
+    --input_dir "./data" \
+    --output_dir "output" \
+    --batch_size 4 \
+    --weight_decay 0.01 \
+    --learning_rate 1e-5 \
+    --max_steps 100000 \
+    --save_steps 10000 \
+    --logging_steps 1 \
+    --max_encoder_length 512 \
+    --max_pred_length 75
+```
+
+其中参数释义如下：
+
+- `gpus` paddle.distributed.launch参数，用于指定使用哪张显卡。单卡格式："0"；多卡格式："0,1,2"。
+- `log_dir` paddle.distributed.launch参数，用于指定训练日志输出的目录，默认值为`log`。（注意，如果需要在同一目录多次启动run_pretrain.py，需要设置不同的log_dir，否则日志会重定向到相同的文件中）。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。目前支持的训练模型配置有："bigbird-base-uncased"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `output_dir` 指定输出文件。
+- `batch_size` 训练的batch大小
+- `weight_decay` AdamW权重衰减参数
+- `learning_rate` 训练的学习率
+- `max_steps` 最大训练步数
+- `save_steps` 保存模型间隔
+- `logging_steps` 打印日志的步数
+- `max_encoder_length` MLM任务的最大的token数目
+- `max_pred_length` MLM任务最大的需要预测token的数目
+
+
+### 验证任务
+
+#### Imdb分类任务
+通过预训练任务训练完成之后，可以预训练的模型参数，在 Big Bird 的验证任务中通过IMDB数据集来进行最终模型效果的验证，[IMDB数据集](http://ai.stanford.edu/~amaas/data/sentiment/) ，IMDB数据集是关于电影用户评论情感分析的数据集，主要是包含了50000条偏向明显的评论，其中25000条作为训练集，25000作为测试集。label为pos(positive)和neg(negative)，是一个序列文本分类任务，具体的执行脚本如下。
+
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python run_classifier.py --model_name_or_path bigbird-base-uncased \
+    --output_dir "output" \
+    --batch_size 2 \
+    --learning_rate 5e-6 \
+    --max_steps 16000 \
+    --save_steps 1000 \
+    --max_encoder_length 3072
+```
+
+其中参数释义如下：
+
+- `model_name_or_path` 指示了finetune使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："bigbird-base-uncased"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"。
+- `output_dir` 指定输出文件。
+- `batch_size` 训练的batch大小。
+- `learning_rate` 训练的学习率。
+- `max_steps` 最大训练步数。
+- `save_steps` 保存模型间隔。
+- `logging_steps` 打印日志的步数。
+- `max_encoder_length` MLM任务的最大的token数目。
+
+
+基于`bigbird-base-uncased`在IMDB评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| IMDB  | Accuracy                     |      0.9449       |
+
+#### Glue任务
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type bigbird \
+    --model_name_or_path bigbird-base-uncased \
+    --task_name SST-2 \
+    --max_encoder_length 128 \
+    --batch_size 32   \
+    --learning_rate 1e-5 \
+    --epochs 5 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --device gpu
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用bigbird模型时设置为bigbird即可。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："bigbird-base-uncased"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"。
+- `task_name` 表示Fine-tuning的任务。
+- `max_encoder_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+
+基于`bigbird-base-uncased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| SST-2 | Accuracy                     |      0.9365       |
+| QNLI  | Accuracy                     |      0.9017       |
+| CoLA  | Mattehew's corr              |      0.5708       |
+| MRPC  | F1/Accuracy                  |  0.9019 / 0.8603  |
+| STS-B | Person/Spearman corr         |  0.8591 / 0.8607  |
+| QQP   | Accuracy/F1                  |  0.9132 / 0.8828  |
+| MNLI  | Matched acc/MisMatched acc   |  0.8615 / 0.8606  |
+| RTE   | Accuracy                     |      0.7004       |
+
+### 致谢
+
+* 感谢[Google 研究团队](https://github.com/google-research/bigbird)提供BigBird开源代码的实现以及预训练模型。
+
+### 参考论文
+
+* Zaheer, et al. "Big bird: Transformers for longer sequences" Advances in Neural Information Processing Systems, 2020
diff --git a/examples/language_model/bigbird/args.py b/examples/language_model/bigbird/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d33e05c31b2cbfd1381e7a08f95b95d0d408372
--- /dev/null
+++ b/examples/language_model/bigbird/args.py
@@ -0,0 +1,104 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+from paddlenlp.trainer.argparser import strtobool
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_type", default="bigbird", type=str, help="Model type selected in training model.")
+
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bigbird-base-uncased",
+        type=str,
+        help="Path to pre-trained model or shortcut model name for training model.",
+    )
+
+    parser.add_argument(
+        "--input_dir", default=None, type=str, help="The input directory where the data will be read from."
+    )
+
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for AdamW.")
+
+    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+
+    parser.add_argument(
+        "--weight_decay", default=0.01, type=float, help=" Weight decay rate if we apply in the optimizer of Adamw."
+    )
+
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for AdamW optimizer.")
+
+    parser.add_argument(
+        "--max_steps", default=100000, type=int, help="If > 0: set total number of training steps to perform."
+    )
+
+    parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization.")
+
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="gpu",
+        choices=["cpu", "gpu", "npu"],
+        help="Select cpu, gpu, xpu, npu devices to train model.",
+    )
+
+    parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+
+    parser.add_argument(
+        "--max_encoder_length",
+        type=int,
+        default=512,
+        help="The maximum total input sequence length after SentencePiece tokenization.",
+    )
+
+    parser.add_argument(
+        "--max_pred_length", default=75, type=int, help="The maximum total of masked tokens in input sequence."
+    )
+
+    parser.add_argument(
+        "--use_nsp", default=False, type=bool, help="Whether or not add the nsp loss to the total loss."
+    )
+
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+
+    parser.add_argument(
+        "--task_name",
+        default="sst-2",
+        type=str,
+        required=False,
+        help="The name of the task to train selected in the list: sst-2, cola, mrpc, sts-b, qqp, mnli, qnli, rte",
+    )
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/language_model/bigbird/data/data.txt b/examples/language_model/bigbird/data/data.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8f264377effd4e015f777600c0998298257cfa56
--- /dev/null
+++ b/examples/language_model/bigbird/data/data.txt
@@ -0,0 +1,200 @@
+ _START_ARTICLE_ Kasturba Road _START_PARAGRAPH_ Kasturba Road is a street in Bangalore, the capital of Karnataka, India, which is connected to M G Road to the north and J C Road to the south. Some important landmarks situated along Kasturba Road are Sree Kanteerava Stadium, Kanteerava Indoor Stadium, Cubbon Park, Government Museum, Venkatappa Art Gallery, Visvesvaraya Industrial and Technological Museum and UB City. A 600-year-old Ganesha temple is also situated on Kasturba Road._NEWLINE_It was earlier known as Sydney Road._NEWLINE_Other important landmarks close to the road are Karnataka High Court, Vidhana Soudha and Chinnaswamy Stadium.
+ _START_ARTICLE_ Amazon (yacht) _START_SECTION_ Construction _START_PARAGRAPH_ Carvel planked in teak and pitch pine on oak frames, with alternate wrought iron strap floor reinforcement, bronze fastenings, lead keel and copper sheathing, the Amazon's hull is still largely original. _START_SECTION_ History _START_PARAGRAPH_ Her builder and first owner, Tankerville Chamberlayne, an English gentleman, personally supervised her construction by his own Arrow Yard at Northam on the River Itchen.  This small private facility was established by the Chamberlayne family for the maintenance of the famous cutter Arrow, which was adapted continuously and thereby kept racing competitively into the 1890s.  Amazon's engine and boiler were supplied by the adjacent works of Day, Summers and Company._NEWLINE_Amazon was used for summer cruising, to attend sailing regattas along the south coast of England, and to visit France. Having been prepared appropriately for the occasion of Queen Victoria's Diamond Jubilee Royal Fleet Review in 1897 (at which Turbinia made her debut), she was shortly after sold to a prominent French yachtsman and was based at Saint Malo as Armoricain until 1900, when she returned to British ownership._NEWLINE_Already too old (and with a coal-fired compound engine thought to be rather too old-fashioned) for the First World War, she remained in south coast ports as a private yacht.  A new owner took her to London and after 52 years of service her original engine and boiler were removed on her conversion to diesel in 1937.  The Second World War put paid to pleasure cruising and she subsequently became a houseboat for some years in a west London Yacht Basin._NEWLINE_The actor Arthur Lowe bought her as a houseboat in 1968 and, encouraged by his surveyor's positive report, made her ready for sea again in 1971; at first a private yacht she then pursued a successful charter business in the 1980s, before migrating to northern Scotland in 1990._NEWLINE_In 1997 she made passage from Scotland to Malta, where her new owners used her for cruising in the Mediterranean.  In 2009 Amazon crossed the Atlantic Ocean via the Cape Verde Islands  and travelled in the Caribbean and to Bermuda._NEWLINE_She arrived at Newport, Rhode Island, United States from Bermuda on Labor Day 2009. Amazon was hosted by the Herreshoff Marine Museum at Bristol, Rhode Island in October 2009. She spent time in Narragansett Bay. The yacht subsequently travelled to Mystic Seaport in late 2009  and was based there in early 2011.  Amazon remained at Mystic Seaport until mid-2011._NEWLINE_Amazon acted as Flagship for the Commodore of the Mystic River Yacht Club for a charity regatta in Long Island Sound in June 2011 and visited Canada in July 2011 _NEWLINE_In August 2011 the yacht made a trans-Atlantic passage from Newfoundland to Ireland, and arrived at Waterford on 2 September 2011 where she was described by a local boat owner as the "classiest motor boat I have ever seen!".  She remained at Waterford for the winter._NEWLINE_In May 2012 she visited Bristol before sailing to London, where she took part in the Thames Diamond Jubilee Pageant on Sunday 3 June 2012.  She was the only vessel present that had also witnessed the Diamond Jubilee Fleet Review for Queen Victoria at Spithead on 26 June 1897. The Director of National Historic Ships referred to her in his public letter of criticism concerning the BBC's coverage of the event._NEWLINE_She was subsequently at the Ramsgate Maritime Museum until late June, at Shoreham on 28 June 2012, then at Cowes and in the Bassin Vauban at St Malo, France in late July 2012._NEWLINE_In August and September 2012, Amazon was in the Channel Islands, visiting Alderney in August and Jersey in September, berthing in St Helier and Gorey Harbours; on 13 September she was in St Aubin's Bay to watch the 2012 Jersey International Air Display._NEWLINE_She was in Bristol during the winter and at the Southampton Maritime Festival on 5 & 6 May 2013._NEWLINE_On 23 May she was in the Bristol Channel en route to Gloucester where she arrived on 24 May for the city's Tall Ships Festival on 25 & 26 May, and was on the Gloucester and Sharpness Canal during June. She was at Gorey, Jersey on 22 July 2013 and had returned to Malta by October that year.
+ _START_ARTICLE_ Dolores Kuenz _START_SECTION_ Table tennis career _START_PARAGRAPH_ She won a World Championship gold medal in the Women's Team event at the 1937 World Table Tennis Championships known as the Corbillon Cup.  she was the captain of the team. _START_SECTION_ Hall of Fame _START_PARAGRAPH_ She was inducted into the USA Hall of Fame in 1979.
+ _START_ARTICLE_ Mark E. Clayton _START_SECTION_ Background _START_PARAGRAPH_ Clayton was born in Mobile in south Alabama, but he was reared in Alexandria, Virginia. His parents, Mr. and Mrs. William Gill Clayton, were Goldwater Republicans. His father lobbied Congress on religious liberty issues. Clayton graduated from high school and served in the United States Army Reserve. He studied to be an aircraft electrician before he enrolled at Pensacola Christian College in Pensacola, Florida, from which he graduated in 2002._NEWLINE_In 2003, Clayton moved to Nashville. His father died in 2004 and Clayton purchased a 1920s-era farmhouse on three acres in Whites Creek in suburban Davidson County. Clayton, who has never been married, lives in Whites Creek with his dog, Saint._NEWLINE_Clayton has worked at numerous jobs, including Target, a call center, as a floor installer, and as a salesperson of insurance, siding, and roofs. He is a church youth group leader. He is currently employed with a moving company. _START_SECTION_ 2008 U.S. Senate candidacy _START_PARAGRAPH_ Clayton first ran for U.S. Senate in 2008, when he finished fourth among six candidates in the Democratic primary with 32,309 votes. Bob Tuke won the nomination with 59,000 votes and was then decisively defeated in the general election by the Republican incumbent  Lamar Alexander of Maryville, whom Clayton had described as a "neo-conservative", _START_SECTION_ 2012 candidacy for U.S. Senate _START_PARAGRAPH_ After Clayton's primary triumph, the Tennessee Democratic Party disavowed his candidacy and his vice-presidency of the socially conservative interest group, the Public Advocate of the United States, based in Washington, D.C. The group has been labelled a hate group by the Southern Poverty Law Center for its anti-gay rhetoric._NEWLINE_The Tennessee Democratic Party issued this statement:_NEWLINE_The only time that Clayton has voted in a Democratic primary was when he was voting for himself. Many Democrats in Tennessee knew nothing about any of the candidates in the race; so they voted for the person at the top of the ticket. Unfortunately, none of the other Democratic candidates was able to run the race needed to gain statewide visibility or support. Mark Clayton is associated with a known hate group in Washington, D.C., and the Tennessee Democratic Party disavows his candidacy, will not do anything to promote or support him in any way, and urges Democrats to write-in a candidate of their choice in November._NEWLINE_Clayton won the Democratic nomination with 30% of the vote, despite raising no money and having a website that was four years out of date. The next day Tennessee's Democratic Party disavowed the candidate over his active role in the Public Advocate of the United States, which they described as a "known hate group". They blamed his victory among candidates, for whom the TNDP provided little forums to become known, on the fact that his name appeared first on the ballot, and said they would do nothing to help his campaign, urging Democrats to vote for "the write-in candidate of their choice" in November. One of the Democratic candidates, Larry Crim, filed a petition seeking to offer the voters a new primary in which to select a Democratic Nominee based on Democratic Chair Chip Forrester permitting Clayton, a nondemocratic candidate, at the top of ballot to benefit a candidate Forrester recruited and improperly endorsed - Overall - who did not win the primary. Forrester then disavowed Clayton he had allowed on the ballot after he received the most votes. The background is that the TNDP placed little emphasis on the U.S. Senate race in 2012 to replace Corker. Treasurer and Financial benefactor of the TNDP Bill Freeman who was under Forrester actually contributed to Republican Bob Corker's campaign and was later removed from office, followed by Forrester's not seeking another term subsequent to the TNDP 2012 fiasco. Crim filed a preliminary motion seeking a temporary restraining order against certification of the results until the merits of the case for a new primary could be decided. Yet, after a judge denied the temporary restraining order Crim withdrew his petition stating at a news conference outside the Federal Courthouse that the costs of proceeding and the costs of a new primary to the Democratic Party, even if Crim won, would be overwhelming especially given the political realities the party leaders conducted and permitted to the detriment of any Democratic Nominee. Mr. Crim was subsequently elected Chair of Democrats United For Tennessee in 2012._NEWLINE_Clayton's nomination has been compared to that of the previously unknown Alvin Greene in the Democratic primary in the 2010 Senate race in South Carolina. Greene was then handily defeated by the Republican Jim DeMint. _START_SECTION_ 2014 candidacy for Tennessee governor _START_PARAGRAPH_ Clayton attempted to register as a candidate in the Democratic primary for the Tennessee governor in 2014. The Democratic Party of Tennessee denied his attempt to run in the primary election, describing him as, "not a bona fide Democrat." Clayton filed a suit in federal court, but lost when the judge found there were no grounds for the suit at the federal level. _START_SECTION_ Political positions _START_PARAGRAPH_ Clayton opposes abortion and same-sex marriage. He believes that the Transportation Security Administration (TSA) should be shut down, stating that the TSA "mandates [transsexuals] and homosexuals grabbing children in their stranger-danger zones." He opposes national ID cards. The Washington Post and Vox describe him as a conspiracy theorist. _START_SECTION_ Public Advocate of the United States _START_PARAGRAPH_ Clayton's work at the Public Advocate of the United States, a conservative group based in Washington, D.C., has come under scrutiny. The group has been designated as an anti-gay hate group by the Alabama-based Southern Poverty Law Center._NEWLINE_In a press release The Public Advocate proclaimed that it "associates with members of both major parties in a non-partisan fashion and promotes traditional values". The organization contends that Clayton has demonstrated that "an American patriot can put his or her name on the ballot and win big as a conservative, even in the Democratic Party." A Clayton spokesman criticized the state Democratic party for disowning the nominee and argued that the state party had violated the law by using its resources to attack one of its own candidates. The Clayton campaign also said that it would file a complaint with the Federal Election Commission.
+ _START_ARTICLE_ Ben T. Elliott _START_SECTION_ Life _START_PARAGRAPH_ Elliott was born on November 6, 1944, in Philadelphia, Pennsylvania. He is a 1966 graduate of Bucknell University and a 1970 graduate of The Institut d'Études Politiques de Paris. At Bucknell he played free safety for the Bison football team and was a member of the Phi Gamma Delta fraternity. In 1962 he graduated from the Haverford School, which honored him with its Distinguished Alumni Award in 2007._NEWLINE_In 1978, he worked for the U.S. Chamber of Commerce._NEWLINE_In 1980, he was hired as  White house speech writer, by Ken Khachigian._NEWLINE_In 1983, he succeeded him as chief speech writer._NEWLINE_When Treasury Secretary Don Regan took over as Reagan’s chief of staff in 1985, he asked Elliott to resign, which occurred in 1986. Eleanor Clift reported, “Elliott, a staunch believer in supply side economics and a fervent right-to-life advocate, clashed with Regan and other top presidential assistants over the rhetorical tone of the President's State of the Union address last January. Although Elliott was widely believed to have won the battle, his language was nevertheless softened considerably.”_NEWLINE_After leaving the White House, Elliott wrote speeches for Jack Kemp, William Simon, Steve Forbes, IBM, Pepsi, Goldman Sachs and the New York Stock Exchange. He serves as a speechwriter at Bank of America and is a trustee of Bucknell University.
+ _START_ARTICLE_ The Ex-File 3: The Return of the Exes _START_SECTION_ Release _START_PARAGRAPH_ It was released in China on 29 December 2017 _START_SECTION_ Box office _START_PARAGRAPH_ As of 26 January 2018, it has earned ¥1.9 billion in China._NEWLINE_The film gained some notoriety in the overseas press for beating Star Wars: The Last Jedi at the Chinese box office.
+ _START_ARTICLE_ Fluor-liddicoatite _START_SECTION_ Crystal habit _START_PARAGRAPH_ Crystals are stout prismatic, with a curved convex trigonal outline, generally elongated and striated parallel to the c axis.  Crystals are hemimorphic, meaning that the two ends of the crystal have different forms.  Fluor-liddicoatite usually has a pedion (a single crystal face) opposite one or two pyramids. _START_SECTION_ Physical properties _START_PARAGRAPH_ The color is usually smoky brown, but also pink, red, green, blue, or rarely white.  Color zoning is abundant at the type locality, parallel to pyramid faces. This is due to changes in the solution during crystal growth. As the concentration of trace elements that serve as coloring agents changes, there will be areas of less or more color in different parts of the crystal. When the crystal is sliced perpendicular to the c axis, triangular zoning may be seen, together with a trigonal star that radiates from the centre of the crystal, with the three rays directed towards the corners of the triangular color patterns.  _NEWLINE_The pink-red color is due to the manganese Mn³⁺ content, and the green color is due to intervalence charge transfer transactions between iron Fe²⁺ and titanium Ti⁴⁺._NEWLINE__NEWLINE_The streak is white to very light brown, lighter than the mass color, luster is vitreous and crystals are transparent to translucent._NEWLINE_ _NEWLINE_Cleavage is poor perpendicular to the c crystal axis, or it may be totally absent. The mineral is brittle, with an  uneven to conchoidal fracture.  It is very hard, with hardness ​7 ¹⁄₂, a little harder than zircon, making it suitable for use as a gemstone.  Specific gravity is 3.02, a little lighter than fluorite.  It is neither fluorescent nor radioactive. _START_SECTION_ Optical properties _START_PARAGRAPH_ Fluor-liddicoatite is uniaxial (-), with  _NEWLINE_refractive Indices Nₒ = 1.637 and Nₑ = 1.621 for the type specimen.  The refractive indices, however, will vary from specimen to specimen, as they depend on the content of iron and manganese, which are usually present as trace elements.Pleochroism is strong: O dark brown or pink, E light brown or pale pink. _START_SECTION_ Environment _START_PARAGRAPH_ Fluor-liddicoatite is detrital in soil at the type locality, presumably derived from the weathering of granitic pegmatites.  Associated minerals are quartz, elbaite, albite and micas.
+ _START_ARTICLE_ Château Bilquin de Cartier _START_SECTION_ History _START_PARAGRAPH_ Origins of the château can be traced back to the 17th century, around 1635, when the Honoré family builds a castle on the Sambre river bank. The place had formerly been occupied by a seigneurial manor which was destroyed on 21 July 1554._NEWLINE_In 1667, the unfinished Spanish fortress of Charleroy is captured by Louis XIV's troops during the War of Devolution. As the castle in Marchienne was located in neutral territory (under authority of the Prince-Bishopric of Liège), it was used as a hospital for both French and Spanish soldiers._NEWLINE_In 1695, the castle is bought by Guillaume de Bilquin, a wealthy forge owner, who completes and enhances it. In 1717, his daughter, Marie-Agnès Bilquin, marries Jean-Louis Cartier, son of the general treasurer of the prince-bishop of Liège. As such, the castle becomes the property of the Cartier de Marchienne family._NEWLINE_In 1740, the castle hosts Remacle Le Loup, a famous draftsman from the Liège region. It is severely damaged by a fire in 1932, and bought over by the municipality of Marchienne-au-Pont in 1938, ending more than two centuries of ownership by the Cartier family._NEWLINE_Marguerite Yourcenar, a Belgian-born French novelist and essayist, and the first woman elected to the Académie française, is the daughter of Fernande de Cartier de Marchienne, from the Cartier family related to the Cartier castle. She visited the castle in Marchienne-au-Pont in 1956, and mentions her Cartier de Marchienne ancestry and the castle in her 1974 memoir Dear Departed: A Memoir (French: Souvenirs pieux)._NEWLINE_The Cartier castle was listed on 21 August 1980. It underwent restoration in phases between 1986 and 2001 (helped by ERDF), after having been left in a sorry condition (infested by dry rot). _START_SECTION_ Current condition _START_PARAGRAPH_ The castle today hosts a public library on the ground floor (Bibliothèque Marguerite Yourcenar), and administrative services of the Walloon region on the first floor. The library has a section dedicated to books in Turkish language._NEWLINE_The courtyard is equipped with benches and is publicly accessible as part of the Marchienne-au-Pont municipal park. The castle wing which was located on the southern side of the courtyard has been demolished to create an entrance for the De Cartier station of the Charleroi metro. At the tip of the western wing, a stone porch is adorned with the arms of the Bilquin-Baillencourt family, and a 1699 date inscription. Similarly, the lintel above the northern wing door shows a scalloped key with the arms of the Cartier family. Other demolished features include a barnyard where the Marchienne-au-Pont municipal swimming pool now stands._NEWLINE_A 19th century grain elevator in neo-renaissance style can be seen on the Sambre embankment. _START_SECTION_ Beijing replica _START_PARAGRAPH_ Emile de Cartier de Marchienne, Marguerite Yourcenar's uncle, who served as the Belgian ambassador in China at the start of the 20thcentury (1910-1917), ordered the construction of a Cartier castle replica, to serve as the Belgian legation building in Beijing. Plans were drawn in Marchienne-au-Pont, and bricks, slates, tiles, panelling and other materials were transported from Belgium to China for the construction. The building, which is now the Zijin Guest House, still exists in the Beijing Legation Quarter, although the original entrance has disappeared (Photo).
+ _START_ARTICLE_ Paralympic swimming _START_PARAGRAPH_ Paralympic swimming is an adaptation of the sport of swimming for athletes with disabilities. Paralympic swimmers compete at the Summer Paralympic Games and at other sports competitions throughout the world. The sport is governed by the International Paralympic Committee. Both men and women compete in Paralympic swimming, racing against competitors of their own gender. Swimming has been a part of the Paralympic program since the 1960 Summer Olympics in Rome, Italy. _START_SECTION_ Rules _START_PARAGRAPH_ Rules for the sport are adapted from those set forth by the International Swimming Federation (FINA). Swimmers compete individually in backstroke, breaststroke, butterfly, freestyle, individual medley, and as teams in relay races. At the Paralympics, World Championships and other elite level competitions, swimmers compete in an Olympic-size swimming pool._NEWLINE_Significant differences between able-bodied and Paralympic swimming include the starting position and adaptations allowed for visually impaired swimmers. Competitors may start a race by standing on a platform and diving into the pool, as in non-disabled swimming, or by sitting on the platform and diving in, or they may start the race in the water. In events for the blind and visually impaired, people called "tappers" may stand at the end of the pool and use a pole to tap the swimmers when they approach the wall, indicating when the swimmer should turn or end the race. No prostheses or assistive devices may be worn during competition. _START_SECTION_ Classification _START_PARAGRAPH_ Swimmers are classified according to the type and extent of their disability. The classification system allows swimmers to compete against others with a similar level of function._NEWLINE_Swimmers with physical disabilities are allocated a category between 1 and 10, with 1 corresponding to the most severe types of disability. Physical disabilities of Paralympic swimmers include single or multiple limb loss (through birth defects and/or amputation), cerebral palsy, spinal cord injuries (leading to paralysis or disability in limb coordination), dwarfism, and disabilities which impair the use of joints._NEWLINE_Blind and visually impaired swimmers compete within separate categories, being allocated to categories 11, 12 or 13. Category 11 corresponds to totally blind swimmers, while competitors in category 13 have severe but not total visual impairment. Category 11 swimmers compete with blackened goggles to ensure competitors are on an even level. Category 11 swimmers are also required to use tappers but they are optional for category 12 and 13._NEWLINE_Swimmers with mental disabilities compete in category 14._NEWLINE_Numbers are combined with a letter prefix depending on the event type. An "S" prefix corresponds to freestyle, backstroke and butterfly, while "SB" corresponds to breaststroke and "SM" to the medley. Hence, a swimmer with severe physical disabilities competing in backstroke may compete in an S3 event, while a blind swimmer in the medley would compete in class SM11._NEWLINE_For relay races, athletes from different classifications compete together, but the sum of their individual classifications must not exceed a given points total. For example, a relay team for a 34 points freestyle relay may consist of two S8 swimmers and two S9 swimmers (9 + 9 + 8 + 8 = 34), or an S10 swimmer and three S8 swimmers (10 + 8 + 8 + 8 = 34)
+ _START_ARTICLE_ Noah's Ark (1956 TV series) _START_SECTION_ Synopsis _START_PARAGRAPH_ Noah's Ark stars Paul Burke as young veterinarian Dr. Noah McCann, partner with the older Dr. Sam Rinehart, played by Victor Rodman (1892–1965), who in the series uses a wheelchair. May Wynn plays the young receptionist, Liz Clark._NEWLINE_Another similarly titled series, Second Noah, a family drama with Daniel Hugh Kelly in the title role of author Noah Beckett and Betsy Brantley as his veterinarian-wife, was televised on ABC from 1996 to 1997. _START_SECTION_ Production notes _START_PARAGRAPH_ Noah's Ark was created, produced, and directed by Jack Webb through his Mark VII Limited production company, and filmed at Revue Studios, later part of Universal Television. Its pilot episode on September 18, 1956, is titled "Jack Webb Presents". At the time, Webb and Ben Alexander co-starred on NBC's popular police drama Dragnet. In the October 2 episode of Noah's Ark titled "The Petition", a dispute develops over a rezoning request for the veterinary clinic. When Noah tries to reason with recalcitrant neighbors, violence results. _START_SECTION_ Scheduling _START_PARAGRAPH_ The 24th episode of the series, "Irmgaard's Problem", never aired. Instead, on March 5, 1957, the suspense series Panic! made its debut to replace Noah's Ark. The theme song for Noah's Ark was performed by the Hi-Los. _NEWLINE_Noah's Ark aired at 8:30 p.m. EST on Tuesdays, opposite The Brothers (with Gale Gordon, Bob Sweeney, and Barbara Billingsley) on CBS and Hugh O'Brian's The Life and Legend of Wyatt Earp on ABC. Noah's Ark followed the quiz show The Big Surprise and preceded the anthology series, The Jane Wyman Show on NBC.
+ _START_ARTICLE_ Katy de la Cruz _START_SECTION_ Early life _START_PARAGRAPH_ Catalina de la Cruz was born in Bustos, Bulacan. Even as a young child, de la Cruz would be hired to sing at town fiestas, and at intermissions during cockfights and boxing matches. Her formal schooling ended at the third grade._NEWLINE_In 1914, when she was seven years old, she was hired by the owner of a Manila film theater to sing to the audiences in between movie screenings. Such performances were typical in Manila theaters during that period, and from those routines would emerge a distinct genre eventually known as bodabil. She learned her songs through listening to phonograph records, and mastered the English language with the help of her brother. _START_SECTION_ Bodabil star _START_PARAGRAPH_ By the age of thirteen, de la Cruz was a rising star in the bodabil circuit, performing alongside other leading stage performers such as Atang de la Rama. She soon became a solo headliner, performing in Manila's largest theaters such as the Savoy, the Palace, and the Lux. By 1925, de la Cruz was the highest paid entertainer in the Philippines. She fell in love, and later married, the piano player of her stage show. Some of the chorus girls who performed alongside her onstage, such as Chichay, Etang Discher, and Mary Walter, later become prominent entertainers in their own right._NEWLINE_De la Cruz was acknowledged as a proficient performer of torch songs who drew comparisons to Sophie Tucker._NEWLINE_Initially, her signature tune was the bluesy ballad St. Louis Blues. After jazz became popular in the Philippines in the 1920s, de la Cruz adapted her singing style and soon mastered the art of scat singing, which became a trademark of hers. By the 1930s, de la Cruz would be most identified with the song Balut, a fast-paced jazzy tune written by Jerry Brandy. Her take on the song, which afforded her to showcase her scatting ability, has been described as impish and rustic, rounded out by her low, playfully dragging key. A slightly bawdy take, called "Balut", named for a notorious Filipino culinary delicacy of the same name, remains popular to date, with versions performed by the New Minstrels, Pilita Corrales, and Lani Misalucha._NEWLINE_She also occasionally acted in films, most prominently in Inspirasyon (1953), for which she received the FAMAS Best Supporting Actress Award in 1953. Many of her films were for Sampaguita Pictures. _START_SECTION_ Later career _START_PARAGRAPH_ As bodabil slowly declined, de la Cruz concentrated on concert performances and international tours. In the late 1940s and early 1950s, she was a top-billed performer at the famed Forbidden City nightclub in San Francisco. In 1961, she starred in her own stage show in Las Vegas. De la Cruz also performed concert tours in Thailand, Taiwan, Hong Kong, Singapore, Australia, and Hawaii._NEWLINE_De la Cruz eventually retired to San Francisco, California, though she would occasionally perform until the late 1980s. In 1989, she visited the Philippines to attend the premiere at the Cultural Center of the Philippines of Katy!, a highly publicized stage musical based on her life. _START_SECTION_ Family _START_PARAGRAPH_ Of de la Cruz's four children, her daughter Angie followed her into showbusiness, pairing with Nikki Ross to form Wing Duo, a singing tandem that was popular on the bodabil circuit and on film during the 1950s.
+ _START_ARTICLE_ Wood, North Carolina _START_PARAGRAPH_ Wood (formerly, Wood's Store) is a small unincorporated community in northeastern Franklin County, North Carolina, on North Carolina Highway 561 east of Centerville. Settled in 1893, Wood was incorporated as a town in 1917. The town charter was repealed on May 5, 1961. Wood lies at an elevation of 322 feet (98 m)._NEWLINE_Archibald Taylor House was listed on the National Register of Historic Places in 1975.
+ _START_ARTICLE_ Requiem Canticles (Robbins) _START_PARAGRAPH_ Requiem Canticles is a ballet made for New York City Ballet's Stravinsky Festival by balletmaster Jerome Robbins to eponymous music from 1966 by Igor Stravinsky. The premiere took place June 25, 1972, at the New York State Theater, Lincoln Center.
+ _START_ARTICLE_ Tennō, Akita _START_PARAGRAPH_ Tennō (天王町 Tennō-machi) was a town located in Minamiakita District, Akita Prefecture, Japan._NEWLINE_In 2003, the town had an estimated population of 22,115 and a density of 532.76 persons per km². The total area was 41.51 km²._NEWLINE_On March 22, 2005, Tennō, along with the towns of Iitagawa and Shōwa (all from Minamiakita District), merged to create the city of Katagami.
+ _START_ARTICLE_ James W. Jones _START_PARAGRAPH_ James William Jones (1843 – 26 April 1920) was a South Australian surveyor. He was the son of civil engineer Thomas Jones, founder of the Manchester Unity of Oddfellows in South Australia. He studied at J. L. Young's Adelaide Educational Institution. He was soon working for his father on the Port Elliot to Goolwa tramway, for which his father received official criticism. He joined the State public service as a draughtsman in 1865 and was appointed Chief Surveyor then Deputy Surveyor-General in the Department of Survey and Crown Lands. He explored the area north-east of Eucla in 1880, and discovered the Kudna rockhole and catacombs, an immense network of limestone caves, lakes and underground passages under the Nullarbor Plain. He was appointed Conservator of Water in 1887 and Secretary to the Commissioner of Works in 1902. He was secretary of the South Australia branch of the Royal Geographical Society of Australasia from its foundation in 1885 to 1894. He was elected president of the Institute of Surveyors in 1912. He was chairman and president of the Harbors and Marine Board in 1914. He was secretary of the Cheer-Up Society during the First World War._NEWLINE_He was appointed Companion of the Imperial Service Order in 1911.
+ _START_ARTICLE_ Mathilda Ebeling _START_PARAGRAPH_ Aurora Mathilda Ebeling (1826–1851) was a Swedish soprano opera singer. After first appearing as a concert pianist in 1842, she made her singing début at Stockholm's Mindre Theatre in 1844. She performed at the Royal Swedish Opera from 1846 to 1848 before further study in Paris and an engagement with Berlin's Royal Opera in 1850. _START_SECTION_ Biography _START_PARAGRAPH_ Born in Stockholm County on 11 October 1826, Mathilda Ebeling was the daughter of the flautist Johan Ludvig Ebeling and Aurora Olivia Björkman. She received piano lessons from Wilhelmina Josephson. When 15, she gave her first major concert as a pianist in Stockholm's Stock Exchange Hall where she showed promise as one of the country's most notable performers. Nevertheless, following the encouragement of the composer Johan Magnus Rosén, she decided to take singing lessons which led to her début in the opera Farinelli at the Mindre Theatre in 1844._NEWLINE_In 1845, she was admitted to the Royal Opera as a student, making a highly successful début the following year in The Magic Flute. The opera proved so popular that it ran for over a hundred performances, playing to full houses ten times in succession. Writing in Stockholms Figaro, a critic complimented her as a "bright star" who had "great promise with her full voice, a true sense of music and a propensity for deep perception". She went on to play the roles of Anna in Don Giovanni, Agathe in Der Freischütz, Adalgisa in Norma and the countess in The Marriage of Figaro. Her career in Sweden came to an abrupt end when she began suffering from a pulmonary ailment._NEWLINE_In 1848 Ebeling continued her studies in Paris under Manuel García. Thereafter she appeared in various locations in Germany, giving her last performance in Berlin, shortly before her death on 1 December 1851.
+ _START_ARTICLE_ William Marshall (illustrator) _START_PARAGRAPH_ William Marshall (fl. 1617–1649) was a seventeenth-century British engraver and illustrator, best known for his print depicting "Charles the Martyr", a symbolic portrayal of King Charles I of England as a Christian martyr. _START_SECTION_ Early career _START_PARAGRAPH_ Nothing is known of Marshall's life beyond references to his career as an engraver. Marshall's earliest known work is the frontispiece to the book A Solemne Joviall Disposition Briefly Shadowing the Law of Drinking, which was published in 1617. In the 1630s he produced a number of portrait engravings and book frontispieces, depicting Puritan divines, poets, and figures associated with the High Church establishment of the day, such as William Laud._NEWLINE_His most ambitious work was the highly elaborate frontispiece to George Wither's 1635 Collection of Emblemes, Ancient and Moderne, an unusually complex example of the Emblem book. Wither left the design to Marshall, having given general instructions, but expressed himself exasperated with the result, on the grounds that its symbolism was thoroughly incoherent.  As he wrote,_NEWLINE_Instead thereof, the Workman brought to light,_NEWLINE__NEWLINE_What, here, you see; therein, mistaking quite_NEWLINE__NEWLINE_The true Design: And, so (with pains, and cost)_NEWLINE__NEWLINE_The first intended FRONTISPIECE, is lost._NEWLINE_Wither's lengthy poem on the engraving claims that its apparently inconsistent symbolism revealed, unintentionally, a deeper truth. The lower part of the frontispiece depicts people wandering in confusion in a cave, apparently having emerged from a womb-like pool in which babies are shown swimming. They exit the cave to draw lots given to them by the goddess of Fortune, symbolic of their allotted place in life. They then climb up a mountain, which divides into two peaks, symbolic of the right and the wrong paths in life. The path to the peak on the right appears more attractive at first, but then becomes rocky and finally leads only to death; the path on the left is at first harder, but eventually becomes pleasant and leads to paradise. A Christian church is depicted on the left and a Pagan temple on the right._NEWLINE_Marshall also created forty-one of the seventy-nine plates in Francis Quarles's Emblems of the life of man._NEWLINE_In 1640 he created the image of William Shakespeare for John Benson's (notoriously inaccurate) edition of the poet's sonnets. This was an adapted and reversed version of the original Martin Droeshout print. Five years later, he created the image of John Milton surrounded by four muses for Milton's 1645 Poems. The muses are Melpomene (tragedy), Erato, (lyric poetry), Urania, (astronomy), and Clio (history). Like Wither, Milton was unimpressed by Marshall's work, considering the portrait to be deeply unflattering. He had Marshall engrave satirical verses written in Greek underneath the image. It is assumed that this was a practical joke on Marshall, who is unlikely to have known that he was engraving insults directed at himself. The verses read in translation,_NEWLINE_Looking at the form of the original, you could say, perhaps, that this likeness had been drawn by a rank beginner; but, my friends, since you do not recognize what is pictured here, have a chuckle at a caricature by a useless artist.
+ _START_ARTICLE_ Energy policy of Venezuela _START_SECTION_ Oil _START_PARAGRAPH_ Venezuela has been producing oil for nearly a century and was an OPEC founder-member. In 2005, Venezuela produced 162 million tons of oil, which is 4.1% of world's total production. By the oil production Venezuela ranks seventh in the world. Venezuela is the world's eight oil exporter and fifth largest net exporter. In 2012, 11 percent of US oil imports came from Venezuela._NEWLINE_Since 2010, when the heavy oil from the Orinoco Belt was considered to be economically recoverable, Venezuela has had the largest proved reserves of petroleum in the world, about 298 billion barrels._NEWLINE_Oil accounts for about half of total government revenues. The leading oil company is Petróleos de Venezuela S.A. (PDVSA), which according to Venezuelan authorities produces 3.3 million barrels per day (520,000 m³/d).  However, oil industry analysts and the U.S. Energy Information Administration believe it to be only 2.8-2.9 million barrels per day (460,000 m³/d). Venezuela's main oil fields are located at four major sedimentary basins: Maracaibo, Falcon, Apure, and Oriental. PdVSA has 1.28 million barrels per day (204,000 m³/d) of crude oil refining capacity. The major facilities are the Paraguaná Refining Center, Puerto de la Cruz, and El Palito. _START_SECTION_ Natural gas _START_PARAGRAPH_ As of 2013, Venezuela has the eighth-largest proved gas reserves in the world and the largest in South America. Proved reserves were estimated at 5.5 trillion cubic meters (tcm). However, inadequate transportation and distribution infrastructure has prevented it from making the most of its resources. More than 70% of domestic gas production is consumed by the petroleum industry. Nearly 35% of gross natural gas output are re-injected in order to boost or maintain reservoir pressures, while smaller_NEWLINE_amounts (5%) are vented or flared. About 10% of production volumes are subject to shrinkage as a result of the extraction of NGLs. The 2010 estimate is 176 trillion cubic feet (5,000 km³), and the nation reportedly produced about 848 billion cubic feet (2.40×10¹⁰ m³) in 2008._NEWLINE_The leading gas company is PdVSA. The largest private natural gas producer is Repsol-YPF, who supplies 80-megawatt (MW) power station in Portuguesa, and plans to develop a 450-MW power plant in Obispos. _START_SECTION_ Tar sands and heavy oils _START_PARAGRAPH_ Venezuela has non-conventional oil deposits (extra-heavy crude oil, bitumen and tar sands) at 1,200 billion barrels (1.9×10¹¹ m³) approximately equal to the world's reserves of conventional oil. About 267 billion barrels (4.24×10¹⁰ m³) of this may be producible at current prices using current technology.  The main deposits are located in the Orinoco Belt in central Venezuela (Orinoco tar sands), some deposits are also found in the Maracaibo Basin and Lake Guanoco, near the Caribbean coast. _START_SECTION_ Coal _START_PARAGRAPH_ Venezuela has recoverable coal reserves of approximately 528 million short tons (Mmst), most of which is bituminous._NEWLINE_Coal production was at 9.254 million short tons as of 2007. Most coal exports go to Latin American countries, the United States and Europe._NEWLINE_The main coal company in Venezuelas is Carbozulia, a former subsidiary of PdVSA, which is controlled by Venezuela's state development agency Corpozulia. The major coal-producing region in Venezuela is the Guasare Basin, which is located near the Colombian border. The coal industry development plans include the construction of a railway linking coal mines to the coast and a new deepwater port. _START_SECTION_ Electricity _START_PARAGRAPH_ The main electricity source is hydropower, which accounts for 71% in 2004. A gross theoretical capability of hydropower is 320 TWh per annum, of which 130 TWh per annum is considered as economically feasible. In 2004, Venezuela produced 70 TWh of hydropower, which accounts 2.5% of world's total. At the end of 2002, total installed hydroelectric generating capacity accounted 13.76 GW with additional 4.5 GW under construction and 7.4 GW of planned capacity._NEWLINE_Hydroelectricity production is concentrated on the Caroní River in Guayana Region. Today it has 4 different dams. The largest hydroplant is the Guri dam with 10,200 MW of installed capacity, which makes it the third-largest hydroelectric plant in the world. Other facilities on the Caroní are Caruachi, Macagua I, Macagua II and Macagua III, with a total of 15.910 MW of installed capacity in 2003. New dams, Tocoma (2 160 MW) and Tayucay (2 450 MW), are currently under construction between Guri and Caruachi. With a projected installed capacity for the whole Hydroelectric Complex (upstream Caroni River and downstream Caroni River), between 17.250 and 20.000 MW in 2010._NEWLINE_The largest power companies are state-owned CVG Electrificación del Caroní (EDELCA), a subsidiary of the mining company Corporación Venezolana de Guayana (CVG), and Compania Anonima de Administracion y Fomento Electrico (CADAFE) accounting respectively for approximately 63% and 18% of generating capacities. Other state-owned power companies are ENELBAR and ENELVEN-ENELCO (approximately 8% of capacities). In 2007, PDVSA bought 82.14% percent of Electricidad de Caracas (EDC) from AES Corporation as part of a renationalization program. Subsequently, the ownership share rose to 93.62% (December 2008). EDC has 11% of Venezuelan capacity, and owns the majority of conventional thermal power plants. The rest of the power production is owned by private companies._NEWLINE_The national transmission system (Sistema Inrterconectado Nacional- SIN) is composed by four interconnected regional transmission systems operated by EDELCA, CADAFE, EDC and ENELVEN-ENELCO. Oficina de Operacion de Sistema Interconectados (OPSIS), jointly owned by the four vertical integrated electric companies, operate the SIN under an RTPA regime. _START_SECTION_ Environmental issues _START_PARAGRAPH_ Prolonged oil production has resulted in significant oil pollution along the Caribbean coast. Hydrocarbons extraction has resulted also in the subsiding of the eastern shore of Lake Maracaibo, South America's largest lake. Venezuela is also the region's top emitter of carbon dioxide. _START_SECTION_ Regional cooperation _START_PARAGRAPH_ Venezuela has pushed the creation of regional oil initiatives for the Caribbean (Petrocaribe), the Andean region (Petroandino), and South America (Petrosur), and Latin America (Petroamerica). The initiatives include assistance for oil developments, investments in refining capacity, and preferential oil pricing. The most developed of these three is the Petrocaribe initiative, with 13 nations signed agreement in 2005. Under Petrocaribe, Venezuela will offer crude oil and petroleum products to Caribbean nations under preferential terms and prices. The payment system allows for a few nations to buy oil on market value but only a certain amount is needed up front; the remainder can be paid through a 25-year financing agreement on 1% interest. The deal allows for the Caribbean nations to purchase up to 185 million barrels (29,400,000 m³) of oil per day on these terms. In addition it allows for nations to pay part of the cost with other products provided to Venezuela, such as bananas, rice, and sugar._NEWLINE_In 2000, Venezuela and Cuba signed an agreement, which grants Venezuelan oil supplies to Cuba._NEWLINE_In 2006, the construction of the Trans-Caribbean gas pipeline, which will connect Venezuela and Colombia with extension to Panama (and probably to Nicaragua) began. The pipeline will pump gas from Colombia to Venezuela and, after 7 years, from Venezuela to Colombia. Venezuela has also proposed the project of Gran Gasoducto del Sur, which would connect Venezuela with Brazil and Argentina. There has been some discussion about constructing an oil pipeline to Colombia along the Pacific Ocean._NEWLINE_Venezuela also exports electricity to neighboring countries. Santa Elena/Boa Vista Interconnector permits electricity export to Brazil, and Cuatricenternario/Cuestecitas Interconnector and EI Corozo/San Mateo Interconnector to Colombia.
+ _START_ARTICLE_ Marvin Creamer _START_PARAGRAPH_ Marvin Creamer (born January 24, 1916) is a former college professor and amateur American sailor noted for having sailed around the globe without the aid of navigational instruments. Between December, 21, 1982, and May 17, 1984, Creamer and the crew of his 36-foot boat, Globe Star, circumnavigated the globe without a compass, sextant, watch, or other instruments. The ship spent 510 days at sea. As general guides, Creamer observed the sun and stars, currents, and occasionally the regional biological setting. In honor of his voyage, Rowan University created the Marvin Creamer Scholarship Fund. He turned 100 in January 2016. _START_SECTION_ Personal life _START_PARAGRAPH_ Creamer was born in Vineland, New Jersey, and taught Geography at Glassboro State College from 1948 until 1977. He has three children, six grandchildren, and two great-grandchildren. He was married to Blanche Creamer for 59 years until her death in 2005.
+ _START_ARTICLE_ Save a Child's Heart _START_PARAGRAPH_ Save a Child's Heart (SACH) is a humanitarian organization with a mission to improve the quality of pediatric cardiac care for children from developing countries who suffer from heart disease, and who cannot get adequate medical care in their home countries. It also works to create centers of pediatric cardiac competence in these countries, so these children can be treated at home. SACH was founded in 1996 and is based at the Edith Wolfson Medical Center near Tel Aviv, Israel. _START_SECTION_ History _START_PARAGRAPH_ Save a Child's Heart is the creation of Dr. Amram Cohen, and grew out of Cohen's experiences as a doctor serving with the U.S. Armed Forces in Korea in 1988, where he joined a program that helped poor local children with heart disease. The experience introduced him to a network of doctors doing similar work in developing countries, inspiring him to start his own program after moving to Israel in 1992. He brought three Ethiopian children to Israel for heart surgery in 1996, and then went on to make use of a network of professional and personal contacts to build a volunteer organization to help others for whom the operations were unavailable or too expensive._NEWLINE_Through a foundation he established, Save a Child's Heart, Dr. Cohen and other surgeons conducted hundreds of operations on children with congenital heart diseases, mostly at the Wolfson Medical Center, where Dr. Cohen was the head of pediatric cardiac surgery and served as Save a Child's Heart's chief surgeon. _NEWLINE_Children were also brought from Nigeria, Tanzania, Congo, Moldova, Russia, Ghana, Vietnam, Ecuador, Jordan and the Palestinian Authority. Dr. Cohen and his team also traveled to China and Ethiopia to operate on about 60 children and taught medical staff there and in other countries. His foundation helped bring doctors and nurses to Israel for training, with the aim of creating centers for treatment of pediatric heart disease in their home countries. _NEWLINE_Dr. Cohen died on August 16, 2001, while climbing Mount Kilimanjaro in Tanzania. He was 47._NEWLINE_Since Dr. Cohen's death, SACH has continued its efforts to benefit children with life-threatening cardiac problems and to teach medical personnel in developing nations the surgical techniques needed to treat these young patients._NEWLINE_In 2006, SACH was selected as a featured charity by KLM Royal Dutch Airlines Air Cares program, with the airline showing a video of the charity's work on board its flights. The airline also donated EUR10,000 and donated free air miles to SACH._NEWLINE_In April, 2007, Israeli musician Idan Raichel traveled with Save a Child's Heart to Rwanda and Ethiopia._NEWLINE_In May 2011, SACH received recognition for special consultative status with the Economic and Social Council (ECOSOC) of the United Nations_NEWLINE_In November 2011, a new children's home was inaugurated. The facility was built specifically to meet the needs of the young patients and staff and will allow Save a Child's Heart to house and treat a larger number of the children._NEWLINE_In June 2012, SACH received the Israeli Presidential Award for Volunteerism._NEWLINE_In July 2016, SACH saves its 4,000th child._NEWLINE_In January 2017, Israeli Prime Minister Benjamin Netanyahu visited SACH_NEWLINE_In January 2019, SACH saved its 5,000th child. _START_SECTION_ Surgeries performed in Israel _START_PARAGRAPH_ Save a Child's Heart has treated over 5,000 children from 59 developing nations in Israeli hospitals._NEWLINE_In 2013, amidst the Syrian Civil War, SACH conducted an open-heart surgery on a 5-year old Syrian girl. The pre-schooler, living as a refugee in an undisclosed country, traveled to the Wolfson Medical Hospital in Holon to receive the treatment. She was the first Syrian child to receive the free medical care and surgery._NEWLINE_SACH is embarking on its biggest project yet, to build an International Pediatric Cardiac Center (IPCC) at the Wolfson Medical Center (WMC), which will serve as a Children's Hospital. The IPCC will be a worldwide center of competence in pediatric cardiac care with international recognition in pediatric cardiac treatment, training and research. It will serve as a model for other SACH centers of competence in developing countries. This new state of the art child oriented medical facility will house all of the infrastructure and equipment needed to perform pediatric heart surgeries, including all pre and post-operative care. _START_SECTION_ International activities _START_PARAGRAPH_ China – On November 16, 2008, a Save a Child's Heart (SACH) training and surgical mission left for Shijiazhuang in the Hebei Province in China. This was SACH's 8th mission to China where its medical teams have saved, with Chinese colleagues, more than 100 Chinese children._NEWLINE_Angola - On May 3, 2009, a Save a Child's Heart medical team left for Luanda, Angola, to examine and screen Angolan children. The team examined 88 children. Among them were children who had been treated in Israel and needed a follow up examination._NEWLINE_Moldova – On November 11, 2007, a SACH team arrived in Kishinev, Moldova, to work with a team of local pediatric cardiologists. The mixed surgical group examined children and performed surgeries for five days._NEWLINE_Tanzania – In August 2011, a SACH team of Doctors, Nurses, Staff and Volunteers traveled to Tanzania to the Bugando Medical Center to work alongside local partners. During this mission SACH, together with the local partners doctors screened 300 children and performed 12 surgeries on Ethiopian children. A week later, a team of SACH volunteers, doctors and staff climbed Mount Kilimanjaro in an effort to raise $1M to save the lives of African children in need. As of January 2018 there have been 7 medical missions to Tanzania._NEWLINE_Romania - In 2017, there were two missions to Romania in March and November, as well as one mission in 2018. During these missions Israeli doctors traveled to help assist Romanian medical staff in performing over 11 life saving heart procedures as well as performing their own procedures._NEWLINE_Zanzibar - There have been 8 medical missions to Zanzibar since 2008, the most recent being in February 2019. Save a Child's Heart (SACH) sent an all-women's mission to Zanzibar in mid-February 2019 to screen and diagnose children in need of life-saving heart surgery. SACH worked with its medical partners at the Mnazi Mmoia Hospital in Zanzibar to conduct screenings and determine which children are in need of heart surgeries. Throughout the mission, there were a total of 398 children in Zanzibar screened._NEWLINE_SACH Photo Exhibit Tours the Globe_NEWLINE_Since 2008, a photo exhibit of SACH activities has been presented in cities around the world, including Abuja (Nigeria), Brussels, Detroit, Glasgow, Hebei (China), Jerusalem, Johannesburg, Melbourne, Miami, Moscow, Philadelphia, Quezon City (Philippines), Singapore, Sydney, Toronto, Vancouver and Washington, DC.
+ _START_ARTICLE_ Nerve point of neck _START_SECTION_ Convergence of nerves _START_PARAGRAPH_ Erb's point is formed by the union of the C5 and C6 nerve roots, which later converge. At the nerve trunk, branches of suprascapular nerves and the nerve to the subclavius also merge. The merged nerve divides into the anterior and posterior division of C5 and C6. _START_SECTION_ Clinical significance _START_PARAGRAPH_ Injury to Erb's point is commonly sustained at birth or from a fall onto the shoulder. The nerve roots normally involved are C5 and partly C6. Symptoms include paralysis of the biceps, brachialis, and coracobrachialis (through the musculocutaneous nerve); the brachioradialis (through the radial nerve); and the deltoid (through the axillary nerve). The effect is called "Erb's palsy". Typically, an affected person's arm hangs at the side with the hand rotated medially, like a porter waiting for a tip; hence the colloquial name "porter's tip hand".
+ _START_ARTICLE_ Ken Knight _START_SECTION_ Fire service career _START_PARAGRAPH_ Knight started work at Westminster Bank in Reigate (1964-1966). He commenced his fire service career as a firefighter in 1966 and subsequently served in a number of UK fire brigades. He was appointed as Chief Fire Officer of Dorset Fire and Rescue Service (1994-1998) and West Midlands Fire Service (1998-2003), before becoming London’s Fire Commissioner in 2003. In 2007 he was appointed as the Government's Chief Fire and Rescue Adviser for England based at the Department for Communities and Local Government._NEWLINE_As the Chief Fire and Rescue Adviser for England (2007-2013), Knight was responsible for advising ministers and senior officials on fire policy matters and for providing advice during major and catastrophic emergencies together with operational advice on preparedness and response during the 2012 London Olympics. He was also responsible for the enforcement of fire safety regulations in Crown Premises in England._NEWLINE_He produced an independent report for the UK Government on fire and rescue services response to the widespread flooding in 2007 entitled Facing the Challenge. He was also tasked by the Secretary of State to undertake a review in the immediate aftermath of the fire in high rise flats at Lakanal House, London, 2009 in which six people died._NEWLINE_In May 2013 Knight published an efficiencies review of the 46 Fire and Rescue Authorities, Facing the Future, which had been commissioned by the UK Government. _START_SECTION_ Independent consultancy _START_PARAGRAPH_ Since leaving his position as the Chief Fire and Rescue Adviser to the UK Government in 2013, Knight has provided independent consultancy advice to public and private sector organisations._NEWLINE_From 2014 to 2017 Knight served as the Lead Commissioner for the London Borough of Tower Hamlets under the Local Government Act 1999._NEWLINE_From 2016-17, the Department for Business, Innovation and Skills commissioned Knight to make recommendations regarding the introduction of electronic balloting for industrial action._NEWLINE_Since June 2017, Knight has been chair of the Independent Expert Advisory Panel at the Ministry of Housing, Communities and Local Government in the immediate aftermath of London's Grenfell Tower fire, in which 71 people lost their lives. His role was to provide advice to officials and Ministers on action to be taken in high rise buildings following the fire._NEWLINE_Knight has also completed reviews of the fire and rescue services in Southern Ireland, Bermuda and Gibraltar, and undertook a review of the national fire safety and civil defence arrangements in the Kurdish Region of Iraq at the request of the Kurdish Regional Government. _START_SECTION_ Honours _START_PARAGRAPH_ Her Majesty the Queen awarded Knight the Queen's Fire Service Medal in 1991 and a CBE in the 2001 Birthday Honours. He was appointed as Her Majesty’s Representative Deputy Lieutenant for Richmond upon Thames (2017-2017), and has been a Deputy Lieutenant for Greater London since 2007. Knight was knighted in the 2006 Birthday Honours in recognition of his outstanding contribution to the fire and rescue service._NEWLINE_Knight is a Companion of the Chartered Management Institute and a Fellow of the Institution of Fire Engineers. He was a founder Trustee of the UK Firefighters Memorial Trust and is a past Master of the Worshipful Company of Firefighters (1998).
+ _START_ARTICLE_ James Brophy (public servant) _START_SECTION_ Biography _START_PARAGRAPH_ James Brophy was born on 26 September 1889 in South Melbourne, Victoria eldest child of Richard Brophy, labourer, and his wife Catherine, née Mackey, both from Ireland._NEWLINE_On 14 January 1922 he married Elizabeth Constance Ridley at St Brigid's Catholic Church, Red Hill, Brisbane. They had ten children. She died in 1965. They moved to Melbourne, Victoria in 1927 and to Canberra in 1930._NEWLINE_He was appointed Papal Knight-Commander of the Order of St. Gregory the Great in 1951 by Pope Pius XII. He was awarded the Companion of the Imperial Service Order in the Queen’s birthday honours list in 1954._NEWLINE_He was a life member of the NSW Hockey Association, NSW Junior Hockey Association, ACT Hockey Association and the NSW Amateur Swimming  Association. He served as an official at the Rome Olympics and Perth Commonwealth Games in swimming. He was also involved with Australian Rules Football and was president of the Canberra Australian National Football Junior League from 1940 to 1942 and senior vice-president of the CANFL in 1941 and 1942. He was a foundation member of the National Eisteddfod Society._NEWLINE_He retired on 25 September 1955._NEWLINE_He died on 24 May 1969 at Canberra Hospital aged 79 and was buried at Canberra Cemetery.
+ _START_ARTICLE_ Paeania _START_PARAGRAPH_ Paeania or Paiania (Ancient Greek: Παιανία) were two demoi of ancient Attica, divided into Upper Paeania and Lower Paeania, that were situated on the eastern side of Hymettus, near the modern village of Liopesi renamed to Paiania. It was the deme of Demosthenes.
+ _START_ARTICLE_ Federal University of Rio de Janeiro Faculty of Law _START_SECTION_ History _START_PARAGRAPH_ The National Faculty of Law of UFRJ is the result of the merger in 1920 of two private schools, the Free Faculty of Law and Social Sciences of Rio de Janeiro and the Free School of Law. It was a long–held dream of prominent citizens such as Fernando Mendes de Almeida and others, who dreamed of creating a private law school. With the establishment of the republic and the creation of a free educational system, Mendes de Almeida called on former supporters of the idea and, with new members, worked for the establishment of Free School of Law and Social Sciences of Rio de Janeiro, which eventually became the National Faculty of Law._NEWLINE_The creation of the National Faculty of Law, through the merger of the two private colleges, represented an end to the monopoly of legal education, which until then was the nearly exclusive province of the Faculdade de Direito do Recife in Olinda, and the Faculdade de Direito da Universidade de São Paulo. The founding of the National Faculty of Law added much–needed diversity to the nation's legal education._NEWLINE_The National Faculty of Law, together with the UFRJ Polytechnical School and the UFRJ Medical School, became in 1945 the basis for a new university, the University of Brazil. During that period the faculty's library was created, the college's magazine "A Época" was launched, the Literary Guild and the Law Journal were created, under a committee formed by Cândido de Oliveira Filho, Luiz Carpenter Raul Pederneiras, Virgílio de Sá Pereira, Gilberto Amado and Afrânio Peixoto._NEWLINE_In the 1930s, the National Faculty of Law experienced memorable public contests for remarkable teachers, such as Joaquim Pimenta (sociology). The class of 1937 was especially noted for graduates such as José Honorio Rodrigues, and Evaristo de Moraes Filho, who became professor in Labor Law and Sociology with his thesis on Auguste Comte._NEWLINE_In the 1940s the National Faculty of Law transferred to its current building, during a period marked by strong student mobilization (especially as resistance to the Estado Novo). Notable recruiting drives continued, bringing young lawyers to the Chairs of the Faculty, such as San Tiago Dantas and Hélio Tornaghi._NEWLINE_The 1950s consolidated the reputation of the National Faculty of Law. In 1955, the inaugural class of San Tiago Dantas, entitled "Legal Education and the Brazilian crisis", attracted much attention. At that time, San Tiago presented new guidelines for the legal education and criticized legal teaching methods of the time, defending the case system as opposed to the text system, and also argued that an interdisciplinary approach to Law was more suitable to modern times._NEWLINE_In 1960, the Brazilian capital moved to Brasília, and the process of federalisation of higher education began, with UFRJ as a part of it. With the coup of 1964, the National Faculty of Law faced some consequences, but the CACO – Centro Acadêmico Cândido de Oliveira (the faculty's students' union) fought against the military regime._NEWLINE_In the 1970s, the National Faculty of Law went through a deep crisis, characterized by the carrying out of only a few entrance examinations and a gradual reduction of faculty staff. The 1980s were also marked by crises and obstacles in entrance drives._NEWLINE_In the 1990s, there were some initiatives, such as curriculum changes, the rearrangement of departmental structure and the creation of a center for community outreach, including a Special Court, an office of the Ombudsman, and a center for legal practice._NEWLINE_Since the end of 2009, following the election of a new directing board, the National Faculty of Law has been going through deep changes in academic and structural matters, aimed at improving the school's quality and reputation.
+ _START_ARTICLE_ Special flight rules area _START_PARAGRAPH_ In United States aviation, a special flight rules area (SFRA) is a region in which the normal regulations of flight do not apply in whole or in part, especially regulations concerning airspace classification, altitude, course, and speed restrictions, and the like. _START_SECTION_ Washington, DC Special Flight Rules Area _START_PARAGRAPH_ Following the terrorist attacks of September 11, 2001, the airspace around Washington DC underwent a number of changes designed to restrict flying around the city.  In 2003, a temporary flight rules area was created and was named the Washington DC Air Defense Identification Zone.  In 2008 the temporary status of the ADIZ was removed and the rule was made permanent._NEWLINE_In order to fly within the DC SFRA, pilots of general aviation aircraft are required to file a special fight rules flight plan, obtain a discrete transponder code, and remain in contact with air traffic control at all times.  Special training is required in order to fly within 30 nm of the Washington DC (KDCA) VOR. _START_SECTION_ Los Angeles Special Flight Rules Area _START_PARAGRAPH_ Long established in the Los Angeles basin is the Los Angeles SFRA. Los Angeles International Airport is surrounded by extensive Class B airspace, which is difficult for VFR traffic to navigate. In particular, the airport has four large runways running east/west that have airspace protection from 10,000 feet down to the surface that is 25 statute miles wide. This large swath of Class B airspace bisects Los Angeles and makes flights between the airports north of LAX and south of LAX require air traffic control to route these flights. To alleviate this load on ATC, the SFRA over LAX defines two exceptions to the Class B airspace to allow VFR aircraft to transit without control from ATC._NEWLINE_There are two routes, one for southeast-bound traffic and one for northwest-bound traffic. Both follow the 132° radial of the Santa Monica VOR between the Santa Monica Airport and the intersection of Interstate 405 and Imperial Highway. Southeast-bound traffic flies at 3,500 feet. Northwest-bound traffic flies at 4,500 feet. Despite being in the Class B airspace, aircraft following the rules of this corridor need not communicate with ATC._NEWLINE_The rules are fairly simple: Turn on all practical lights, day or night. Squawk 1201. Do not exceed 140 knots IAS. Monitor and self-report on 128.55 MHz. Have a copy of the Los Angeles TAC in the aircraft. No jets. _START_SECTION_ Hudson River Special Flight Rules Area _START_PARAGRAPH_ On November 19, 2009, the FAA effected an SFRA in the New York City Class B airspace, motivated largely by the mid-air collision of a private general aviation aircraft and a sightseeing helicopter ride along the Hudson River VFR corridor in the summer of 2009._NEWLINE_The Hudson River Class B exclusion area is formed from the airspace above the Hudson River between the Alpine Tower and the Verrazano-Narrows Bridge. It is bounded by the banks of the Hudson river and runs from the surface of the river up to 1,300 feet. Aircraft fly along the right-hand bank to separate northbound and southbound traffic. Aircraft transiting the entire corridor fly between 1,000' and 1,300'. Aircraft performing local operations (mostly landing and taking off) inside the area fly under 1,000'. Aircraft need not communicate with ATC, but they must make certain mandatory self-reports at certain charted points._NEWLINE_In addition, there is a further Class B exclusion area over the East River between the Hudson river and just past the Queensboro Bridge. This length cannot be transited, as the Queensboro end of the corridor ends inside the Class B airspace. Additionally, aircraft not landing or taking off inside the East river exclusion must be in contact with ATC.
+ _START_ARTICLE_ Volf Bronner _START_SECTION_ Early life _START_PARAGRAPH_ Volf Bronner was born in Buriat-Mongolia in 1876. He attended high school in Chita and then began to study medicine at the University of Tomsk but was expelled because of his revolutionary political activities. One of his classmates at Tomsk was A. T. Trubacheev, later the People's Commissar of Public Health of the Buryat Autonomous Soviet Socialist Republic. He continued his medical studies at the University of Berlin from where he obtained his doctorate in medicine in 1900. _START_SECTION_ Career _START_PARAGRAPH_ From 1900 to autumn 1901, Bronner was a doctor in Verkhneudinsk, and from 1906 to 1913 he was in Paris, where he worked with professor Guyon and subsequently at the Pasteur Institute. He edited the Journal Clinique d'Urologie. From 1915 he worked in Moscow and in 1922 he established the State Venereological Institute in Moscow, of which he became the director._NEWLINE_Bronner helped to organise the 1928 Soviet-German Syphilis Expedition which aimed to tackle the endemic syphilis in Buriat-Mongolia, Bronner's place of birth, and to determine the method of transmission of the disease. Contrary to expectations, the expedition concluded that the syphilis in the area was spread principally by sexual activity._NEWLINE_In 1927, Bronner edited Prostitutsiia v Rossii (Prostitution in Russia) with Arkadii Elistratov, professor of police law at Moscow University, and in 1936, his book, La lutte contre la prostitution en URSS (The fight against prostitution in the USSR) revealed that two thirds of prostitutes had been servants._NEWLINE_Following the Russian Communist Party's 17th Congress in 1934, which emphasised service to the collective over individual needs, Bronner was one of a number of public figures who changed his public utterances to match the new ethos, moving away from a humanistic approach that saw syphilitic infection as the result of misfortune and nothing to be ashamed about, towards an approach that characterised it as impeding the efforts of the party and something that carried shameful connotations. _START_SECTION_ Death _START_PARAGRAPH_ Bronner was arrested on suspicion of spying and terrorist activity during Joseph Stalin's Great Purge of 1937. He was executed in 1939.
+ _START_ARTICLE_ Fordongianus _START_SECTION_ History _START_PARAGRAPH_ In antiquity, Fordongianus was called Forum Trajani in honor of Roman emperor Trajan, who is credited with the building of what are now considerable Roman remains, including those of a bridge, and of thermae on a scale of great magnificence (Valéry, Voy. en Sardaigne, vol. ii. c. 35).  The city, in the interior of Sardinia, is known from the Itineraries, which place it on the road from Tibula, through the interior of the island, to Othoca. (Itin. Ant. p. 82.) Fordongianus sits on the left bank of the river Tirsi (ancient Thyrsus), about 25 kilometres (16 mi) from Oristano.
+ _START_ARTICLE_ Cherry Hill Gourmet Market _START_PARAGRAPH_ The Cherry Hill Gourmet Market is a 19,000-square-foot (1,800 m²) Russian themed specialty grocery and deli located on the corner of Emmons Avenue and Ocean Avenue on the water in Sheepshead Bay, Brooklyn, New York City. It is the principal establishment occupying the former Lundy's Restaurant (now Lundy's Landing Shopping Plaza), which also houses Momoyama Japanese Restaurant and Masal Turkish Cafe on the first floor, and professional offices on the upper floors._NEWLINE_Opened by Russian born fruit and vegetable produce entrepreneur David Isaev in May 2009, the creation of the market put an end to attempts to revive Lundy's. The facade of the Lundy's building, an official New York City landmark, remains the same. The market and other businesses located within the landmark Lundy's structure remain embroiled in legal controversy due to ongoing violations in of zoning laws created to protect Lundy's. _START_SECTION_ Community opposition _START_PARAGRAPH_ The Lundy's Building underwent a seven million dollar renovation in order to be saved and reopened. Broooklynite neighborhood traditionalists have continued to attempt to force the new businesses out of the historic site. Community opposition centers on previous inclusion of the Lundy's building into a special maritime zoning district enacted in the 1970s to promote water-related commercial and recreational development. The Cherry Hill Gourmet Market also features tables for dinner, serves fish salad and a dozen kinds of smoked fish and caviar. However as such it is not primarily or exclusively a grocery store, and theoretically not permitted under the special zoning designation imposed on the historical Lundy's building, according to the New York City Department of Buildings. Mr. Isaev is involved in ongoing negotiations to legalize the market, and keep its roughly 100 workers employed. Isaev was hit with fines related to violation of the zoning laws, settled the fines with the New York City Landmarks Commission, and is now pursuing zoning changes which would legalize his and other businesses now housed within the Lundy's structure which are out of zoning compliance, which remain in operation despite being in technical violation of still legal and enforceable community zoning laws enacted to protect the Lundy's structure._NEWLINE_As part of a community settlement, Isaev, who removed the brass lettering 'Lundy Bros' and 'FWIL' for Lundy's founder Frederick William Irving Lundy from several arched doorways, restored the brass lettering to their original positions, and placed a large screen around large refrigeration units behind the market on the parking lot to settle landmark commission objections. The Lundy's building, which was a garbage-strewn decaying structure going to ruin when Mr. Isaev acquired it, is today a popular Brooklyn shopping spot for the Russian emigre communities situated in Sheepshead Bay, and nearby Manhattan Beach and Brighton Beach. In opening his establishment and later pursuing changing in the zoning laws applying to Lundy's, Isaev maintained he was unaware nor informed of all the complicated landmark designations and zoning requirements applying to the site when he signed his lease and opened a community business in good faith. _START_SECTION_ Hurricane Sandy flood _START_PARAGRAPH_ The effects of Hurricane Sandy in October 2012 caused the waters of Sheepshead Bay to overflow. The storm surge flooded the Cherry Hill Gourmet Market at ground level, causing it to sustain water damage and resulting in tons of spoiled food. During the post-hurricane cleanup the food had to be discarded, but the building was otherwise unaffected.
+ _START_ARTICLE_ Herping _START_SECTION_ Photography _START_PARAGRAPH_ Photography of reptiles and amphibians is largely dependent on digital cameras with a macro lens. An adequate lens is necessary for successfully capturing many species' images in an efficient manner, as it keeps photographer and subject from being injured, as well as maintaining the natural behavior of the subject. In some cases it is more practical to temporarily capture and pose the subject manually such as when moving or obscured by debris, such as when a fossorial snake is scurrying into its burrow. _START_SECTION_ Equipment _START_PARAGRAPH_ Herping activities are often recorded using the latest digital camera or camcorder technology. As many as three flashes may be used for optimal lighting, especially in challenging environments such as tropical rainforests. The multiple flashes create three distracting catchlights in the subject's eye; two may be edited out of the photo by using Photoshop or similar applications._NEWLINE_Photographing venomous snakes at close range places the photographer within striking range, and various shields have evolved to minimize the danger. These bite shields often take the form of an opaque or transparent plastic covering which surrounds the camera and exposes only the lens. Modifications are made to accommodate various flash setups. Snakes are temperature-dependent and are often active in large numbers during optimal weather.  Consequently, the greatest danger in venomous snake photography may lie in a bite from an unseen snake near the photographer. Great care must be taken to survey the area, and bites of this nature have occurred on several occasions._NEWLINE_The safest way to photograph venomous snakes is never to touch them. Snakes may be manipulated with a variety of specialized hooks, ranging from large hooks used for moving snakes, to extendable pocket hooks used for minor posing adjustments. Bite-resistant gloves may also be worn. _START_SECTION_ Setups _START_PARAGRAPH_ Herptiles are extremely weather-sensitive and often appear in heavy rain or other challenging photographic conditions. Some photographers carry cardboard boxes which can be modified in the field to create tiny sets for photography. In a desert area, sand is sprinkled on the bottom of the box and desert debris is scattered about. In wet areas, mossy sets are often developed, which work well for salamanders. The herp is posed to show identifying features and can be photographed at leisure, creating a realistic photo.  During heavy rain or cold temperatures, this "studio" work is usually done in the back of an SUV or similar vehicle._NEWLINE_For aquatic herptiles, early spring is often the best period to find them, as aquatic vegetation is still sparse. Aquariums with natural or prefitted substrate may be used to obtain natural photographs. The extent of aquatic setups is limited only by the photographer's imagination, and elaborate studio setups have been used to photograph specialized scenes like basilisks running on water. _START_SECTION_ Techniques _START_PARAGRAPH_ Because reptiles and amphibians are often agitated when captured, various techniques have evolved to pacify subjects of herpetological photography. One technique involves placing a hat or similar object over an animal (typically a snake) so that it coils and rests quietly. The object is then quickly lifted off the animal and a series of photos are taken. Assistants are often standing by out-of-frame to head off escape attempts. _START_SECTION_ Field Techniques _START_PARAGRAPH_ Many techniques are used when a person goes “herping” or looking for reptiles and amphibians.  _NEWLINE_One technique is known as road running, road cruising, or cruising.  This is done by riding in a vehicle and traveling down stretches of road at a slow speed to count or catch animals. The use of a road as a natural transect can generate estimates of species density by cruising the road at peak migration time. Similarly, driving roads at night during anuran breeding times can yield a high diversity of species.  The North American Amphibian Monitoring Program (NAAMP) uses road surveys to log a species count into a data base to study amphibian population across the nation. This is done by traveling down a set route and stopping at predetermined spots and listening for a few minutes and writing down every species that was heard at that location. (Dodd 2010)_NEWLINE_Another technique for observing reptiles for research or photo opportunities is the use of cover boards.  Silvy (2012) suggests that the use of metal and wood cover boards be set at least two months prior to searching.  These boards act like natural cover for herpetofauna to hide._NEWLINE_Tree frogs can be caught and photographed by using PVC pipes that are capped on the bottom and hung vertically in a tree near water._NEWLINE_If aquatic species are the target species the use of an aquatic funnel trap can be used._NEWLINE_Drift fences have been used with a high success rate for capturing snakes.  The use of a drift fences along with a pit-fall or funnel box trap has yielded high success.  The length of the fence is variable, but the longer the fence results in a higher success rate.  The fence is set with traps in the middle and/or the ends.   Snakes encounter the fence and are directed or lead to the trap.  Care must be taken in providing enough cover so the species do not die of heat exhaustion.  Identifying all the species in the trap is recommended so an accidental envenomation is avoided.  Pit fall traps are small buckets that are placed in holes dug out next to the drift fence._NEWLINE_Turtles can be caught by using a variety of techniques; hoop traps, basking traps, floating pitfall traps, and funnel traps are among the best traps to use. Basking traps are used to catch basking turtles.  These traps float on the surface and have an elevated platform for the turtle to bask.  The net is underwater so they cannot escape once they fall into the trap. _START_SECTION_ Tourism _START_PARAGRAPH_ Herp-related tourism, like bird-related tourism, is on the rise. Because there are several hundred birders for every herper, herp-related tourism presently has a negligible economic impact. Fortunately, there is no way to engineer wildlife preserves for a specified vertebrate group. Instead, large areas of wilderness are conserved, benefiting all wildlife. Some of the more popular herping destinations include the United States, Costa Rica, the Amazon, Madagascar, and Australia._NEWLINE_Other countries such as India and South Africa possess tremendous herpetological diversity and there are entrepreneurial individuals developing ecotourism infrastructure in these areas. One example is Exo-Terra, a division of the Hagen pet supplies company, which since 2004 has traveled to a different tropical African country each year. The company also holds an annual photography contest that showcases some of the best herp photography in the world. The winner of the photo contest goes on the next trip. _START_SECTION_ Geographical differences _START_PARAGRAPH_ In Canada and other high-latitude countries, the herping season lasts 6–8 months, depending on the area. Ontario is the most herpetologically diverse province in Canada. While species lists may seem high, many Canadian herps have extremely limited ranges and exist only in isolated populations. Many Canadian herp species are threatened and in some cases great care is taken to protect remnant populations._NEWLINE_The United States contains a large number of different habitats and thus has a wide diversity of reptiles and amphibians. In some parts of the country, such as South Florida and South Texas, herping can be productive year round because of moderate winter temperatures. In most cooler parts of the country the herps hibernate in the winter and thus are mostly inaccessible to herpers.  Popular herping destinations in the United States are southern California, southern Arizona, Texas, and Florida. These states boast an incredible diversity of herps as well as a number of species that are highly sought after by herpers. It is no coincidence that all of these states are in the southern part of the country; reptiles and amphibians are ectothermic (cold blooded) and thus are typically more abundant in warmer climates. _START_SECTION_ Safety _START_PARAGRAPH_ Herping can potentially be a dangerous activity if not pursued with proper caution. A strike from a venomous snake can potentially be life-threatening. Other herping activities, especially "flipping," put a herper at risk of accidentally coming in contact with a scorpion or spider. Safety equipment used to mitigate such dangers includes snake hooks, snake tongs, boots and gloves. _START_SECTION_ Ethical and legal issues _START_PARAGRAPH_ Field herpers encompass a wide ethical spectrum, ranging from behavioural observation without approaching the animal to "feeder" animal collection for existing herpetoculture. The majority of herpers practice careful capture and release in the same spot, as many herps have their own territories and replacing them somewhere else would be a disturbance. As wilderness areas shrink, herpers are concentrated into smaller areas, and commercial collectors often encounter field biologists which may have quite different approaches to their study animals. Many species are also threatened or endangered and thus it is illegal to take them from the wild. Another consideration is spreading of diseases, such as the fungus Batrachochytrium dendrobatidis responsible for a worldwide decline in amphibian populations, which may be spread inadvertently by humans._NEWLINE_Since many herps are nocturnal, herpers often remove animals temporarily for daylight photo sessions. The animals are then replaced exactly where found. There is no "herpers code" and ethical considerations are left to the individual. From time to time, albino and other unusually coloured animals are encountered and these are sometimes kept for herpetoculture. The ethical justification in these cases is that conspicuous animals would be easy prey in the wild. Although true in the case of albino or other light-coloured animals, this is not true, for example, when normally barred individuals are born with striped patterns. In this case the motive is usually commercial, with the collector planning to develop a striped bloodline and charge high prices for an exclusive morph._NEWLINE_There are many different laws in place that affect herpers. Laws vary by country and state and are designed to protect the wildlife and habitats. In most states, a hunting license is required to collect reptiles and amphibians. Some states are more strict than others in terms of herping-related legislature. In Texas, for example, it is illegal to collect herps on public land, and thus the "road cruising" strategy described above is illegal. Herpers should be careful to obey all laws in areas that they hunt. Lawbreaking herpers risk fines or even legal prosecution.
+ _START_ARTICLE_ Patriot Guard Riders _START_SECTION_ History _START_PARAGRAPH_ The group was formed in 2005 to shelter and protect the deceased's family against protesters from the Westboro Baptist Church, who claim that the deaths of American troops in Iraq and Afghanistan are divine retribution for American tolerance of homosexuality.  PGR members position themselves to physically shield the mourners from the presence of the Westboro protesters by blocking the protesters from view with their motorcade, or by having members hold American flags.  The group also drowns out the protesters' chants by singing patriotic songs or by revving motorcycle engines._NEWLINE_Although initially founded by motorcyclists, the organization is open to anyone, regardless of political affiliation, veteran status, or whether they ride or not.  The only prerequisite is "a deep respect for those who serve our country; military and first responders. The Patriot Guard was established in Mulvane, Kansas, at American Legion Post 136 in 2005.  The founder members incorporated the organization as a 501(c)(3) non-profit in the State of Oklahoma on February 21, 2006._NEWLINE_The group's mission quickly expanded to include the funerals of law enforcement officers, fire department personnel, all first responders, and any active duty member or veteran of the U.S. Armed Forces from all previous wars and conflicts and is now largely focused on recognizing and honoring the sacrifices of dead service members as well as their families and loved ones. As of March 2011, PGR reported over 220,000 members. In addition to their attendance at funerals, the group also greets troops returning from overseas at welcome home celebrations, deployment ceremonies, and performs volunteer work for veteran's organizations such as Veterans Homes. The group also assists families in financial difficulties with travel and housing arrangements, and visits military hospitals to encourage and honor wounded service members of the United States Armed Forces. _START_SECTION_ Trademark lawsuit _START_PARAGRAPH_ In 2007, the Patriot Guard Riders attempted to register the name with United States Patent and Trademark Office. One of the organization's founding members and first President, Jeff Brown, who previously operated the PGR merchandise store, filed an objection. PGR rebuked this, stating in papers filed with the Patent and Trademark Office that Brown had been ejected as a director of PGR in November 2006, and had therefore relinquished all rights to the store and the organization's name. After resigning, Brown filed a trademark request, but this was rejected since the PGR had submitted its own request. PGR contacted all its members asking for donations to establish a defense fund for the lawsuit._NEWLINE_As of 16 July 2012 the Trademark Trial and Appeal Board (TTAB) rendered its decision to Brown's opposition of the PGR, Inc's registration. They stated: "The record further reflects that during Brown's tenure as Executive Director, despite his use of personal funds, he was acting in his official capacity when ordering the collateral merchandise to sell on the online store. Consumers who bought the goods prior to Brown's departure and the subsequent creation of "Twister's Store" were led to believe the goods originated from the PGR. Hence, Brown cannot prevail on his claim of priority since he cannot show by a preponderance of the evidence a prior proprietary interest in the word mark PATRIOT GUARD RIDERS for collateral merchandise.  Decision: The opposition is dismissed." _START_SECTION_ Defending their trademark _START_PARAGRAPH_ After successfully registering multiple trademarks, the Patriot Guard Riders (PGR), Inc., began taking steps to enforce and defend its marks from unauthorized use._NEWLINE_A group in Michigan split from the PGR but continued to use multiple marks while conducting fundraising activities, most notably adopting the name "Michigan Patriot Guard" (MPG).  The PGR made multiple requests of the MPG to cease and desist utilizing the name and trademarks.  When the MPG failed to comply, the PGR filed a lawsuit in US District Court of Flint, Michigan._NEWLINE_Before the lawsuit went to trial, the PGR and MPG reached a settlement.  As part of the agreement, the MPG will change its name.  The organization's new name is Michigan Bikers Helping Veterans.
+ _START_ARTICLE_ Martyn Farr _START_PARAGRAPH_ Martyn Farr (born Crickhowell, Wales, March 3, 1951) is a leading exploratory cave diver and caver, known for his record-breaking cave dives and the exploration of many miles of previously undiscovered underground passages (e.g. in Ogof y Daren Cilau and Noon's Hole). As an author and photographer he has written many books on the subject of cave diving history and techniques and caving locations. _START_SECTION_ Life and career _START_PARAGRAPH_ Farr began caving in 1961 and cave diving in 1971, and within 10 years had established a world record for underwater cave penetration in the Bahamas. He is noted within the cave diving community for his explorations in Wookey Hole in 1977 and 1982, and for completing the first traverse of Llangattock Mountain in Wales in 1986, the execution of which was a televised media event, being the longest and deepest caving through trip in the British Isles. In 1978, Farr discovered also the Pollatoomary cave in the Partry Mountains of the Republic of Ireland. In 2008, his student Artur Kozłowski explored this cave to a depth of 103 metres, which made it the deepest known cave on the British Isles._NEWLINE_As well as running a cave diving training facility in South Wales, Farr is a regular contributor to diving magazines around the world. Farr has also acted as support diver in some of the world's most notable cave diving penetrations, including the British-led expedition to Pozo Azul in Spain in September 2010, which at 8.8 km (5.5 miles) of underwater travel is the world's longest cave diving penetration. _NEWLINE_Martyn also owns Cwmdu Campsite, a Visit Wales 4 star campsite and caravan park in the Brecon Beacons National Park, an area upon which many of Martyn’s books are centered.  _NEWLINE_Farr is the author of The Darkness Beckons, regarded as the definitive book on the history of UK cave diving.
+ _START_ARTICLE_ The Covenant, The Sword, and the Arm of the Lord _START_SECTION_ Leadership _START_PARAGRAPH_ The founder of the CSA was James Ellison, who was jailed for a period of time in federal prison along with his "high priest" Kerry Noble. Robert G. Millar became one of Ellison's spiritual advisers, and he was also the founder of Elohim City. Ellison was also mentored by Richard Girnt Butler, founder of the Aryan Nations and Robert E. Miles, founder of the Mountain Church in Cohoctah, Michigan. Both extreme right-wing leaders taught and practiced the theology of Christian Identity, a belief system which the FBI includes on its watch list as an extremist religion. Ellison had close ties to the Ku Klux Klan and the Aryan Nations, based in Hayden Lake, Idaho, and led by Richard Girnt Butler, who was described as "the glue of the Aryan Nations movement in the Northwest, if not the country" by the supervisor of the Inland Northwest Joint Terrorism Task Force. Miles had a prison ministry and newsletter, relating mostly to the violent white Aryan groups, of which there are many, most notably, the Aryan Brotherhood. After Ellison was released from prison, he moved to Elohim City, where he married Millar's granddaughter._NEWLINE_The entire Council of Elders in the CSA community were deeply influenced and mentored by many outside sources. This nine-man council deliberated on both the spiritual meaning and the direction of CSA activities. Jim Ellison, Kerry Noble and William Wade were the only known members of the council. _START_SECTION_ Purpose _START_PARAGRAPH_ The CSA was an organization which believed that doomsday was imminent, and the 224-acre compound that was set up in Elijah became a community for its members. There they trained their members in paramilitary operations. The group believed in white supremacy and was anti-Semitic. Like other prominent anti-Semitic groups that believed in antisemitic canards, they referred to the United States Government as ZOG, short for Zionist Occupied Government.  The military leader, who used the name Randall Rader during his stay at the CSA compound, left the group in a rift with Ellison and joined the newly forming group The Order in Idaho. The CSA initially professed the belief that the United States government would dissolve due to its own corruption, whereas The Order advocated revolution. However, in July 1983, The CSA published a manifesto called A.T.T.A.C.K. (Aryan Tactical Treaty for the Advancement of Christ’s Kingdom) which declared a war against the government. This was seen by followers as the Second American Revolution. _START_SECTION_ Operations _START_PARAGRAPH_ CSA assassins monitored the homes of their targets, practiced mock assassinations of these targets with scoped rifles, and practiced attacks in a mock residential training facility known as Silhouette City. The perimeter of the CSA compound had 100, 200, and 300-yard (270 m) indicator plates nailed to trees to allow the defenders to adjust their sights accordingly to engage attackers. The central rallying point in the event of an attack was a concrete bunk house that housed the communications radios next to the 95-foot (29 m) tower, which was constructed for defense. The perimeter of the compound had built-in bunkers for one to three men, and each was numbered as a post and assigned to individuals as an area of responsibility. _NEWLINE_The line infantryman carried a Ruger Mini-14 .223 Remington rifle. As in the early days of the United States Marine Corps, the squads were set up in four man fire teams. One man in the fire team carried a Heckler and Koch Model 91 rifle in .308 caliber. These had been modified via a technique which the organization sold to "brother groups," converting the rifles to an illegal selective fire weapon (capable of firing either single shots or fully automatic). The Elite "A" Team had black clothing and some fairly sophisticated weapons, such as the .22 caliber Ruger target pistol fitted with an integral silencer, and several MAC-10 submachineguns in both 9 mm and .45 ACP, also with attached suppressors. These men trained in the covert aspects of military action and were to be the core of the defense initiative._NEWLINE_The Bureau of Alcohol, Tobacco, and Firearms (ATF) later determined that the CSA had obtained 155 Krugerrands, one live light antitank rocket, 94 long arms, 30 handguns, 35 sawed-off shotguns and machine guns, one heavy machine gun, and a quantity of C-4 explosives._NEWLINE_Within "Silhouette City", the CSA also ran a boot camp-style program known as the End Time Overcomer Survival Training School - which was conducted by Order member, Randall Rader. Here, the group trained an estimated 1,500 of like-minded Christian Identity adherents in combat techniques and paramilitary exercises. Upon completing this training, a newly-trained militant would leave to join or start other similar militia groups._NEWLINE_The CSA and its paramilitary arm taught basic pistol and rifle use as well as personal home defense, rural and urban warfare, weapons proficiency, general military field craft, Christian martial arts, and natural wilderness survival._NEWLINE_In 1983, CSA member William Thomas accompanying Richard Wayne Snell and Steven Scott attempted to dynamite a natural gas pipeline which crossed the Red River on its way from the Gulf of Mexico to Chicago. This event ran as part of the group's A.T.T.A.C.K. operations. According to Kerry Noble, the group predicted this to result in riots (due to it being in winter). However, the trio were unsuccessful in carrying out the act of terror. _NEWLINE_The CSA had links with other radical organizations, including the Aryan Brotherhood, the Mountain Church, and The Order, which were all dangerous white supremacist organizations which advocated the violent overthrow of the United States Government. Many of their members were seen traveling in and out of the compound, and after a search of the compound, several stolen vehicles including one belonging to The Order were recovered._NEWLINE_According to a report conducted by the California Department of Justice, The Pagans Motorcycle Club provided the CSA with training in booby trap devices and survival techniques in return for weapons and ammunition._NEWLINE_Things began to go downhill for the organization after Snell, an alleged member, was arrested for killing an African-American police officer. Snell was later tied to the killing of a gun store owner in 1981, obtaining and using the same gun, the serial number of which had been removed by the CSA armorer, Kent Yates. Yates was arrested on Friday, July 13, 1984, on an outstanding warrant out of New Mexico for firearms violations in Farmington. He was later also charged and convicted of weapons manufacture and modification for the CSA._NEWLINE_After the incident with Snell, the FBI began to seek ways to infiltrate the CSA compound and stop the organization which it deemed dangerous. Its agents obtained warrants under Arkansas state law to arrest Ellison, the leader of the CSA, for multiple firearms violations. (The FBI later claimed that at all times it had an "inside man" in the CSA.) _START_SECTION_ Siege _START_PARAGRAPH_ On 16 April 1985, the FBI obtained a search warrant for the CSA compound._NEWLINE_Beginning on 19 April 1985, the FBI and the ATF, led by the FBI's Hostage Rescue Team (HRT), positioned around 300 federal agents in Elijah. It was necessary to keep the operation a secret, but this was not easy in the small community. However, the FBI and ATF agents took advantage of Elijah being a common destination for anglers by pretending to be fishermen and registering at different motels near the various fishing destinations. On the morning of 19 April, they moved in and surrounded the CSA compound, putting some of their agents in fishing boats in order to seal off the lakeside area of the compound. There they waited, until a few hours later when two guards emerged from the compound. They appeared to be unaware of the presence of the officers and walked towards a sniper hold-out, until an officer yelled commands to return to the compound, with which the guards complied. Later, an unnamed individual emerged from the compound and talked with the federal agents and reported to Ellison that the FBI agents were outside and willing to negotiate his surrender and the emptying of the compound. Ellison emerged later. FBI agents had expected that he would not go down without a firefight, but the FBI negotiators convinced him that the CSA would certainly lose if they had one. They convinced him that they wanted peaceful cooperation, and he asked that his spiritual adviser, assumed to be Millar, come to the compound to instruct him. The individual was flown to the area and seemed eager to convince Ellison to stand down. They allowed the individual into the compound, and the FBI instructed him to call in every 30 minutes in order to report on how negotiations were going._NEWLINE_U.S. Attorney Asa Hutchinson, who later successfully prosecuted Ellison and other leaders of the CSA, put on an FBI flak jacket and entered the compound in order to join the negotiations, leading to a peaceful conclusion to the armed stand-off. After several calls requesting more time, early on the morning of the fourth day of the siege, Arkansas State Police entered the compound and escorted out the remaining members without further bloodshed. Women and children had earlier been evacuated to nearby motels. _START_SECTION_ Charges _START_PARAGRAPH_ Ellison and most of his leadership were charged in federal court with illegal weapons possession and racketeering. In September 1985, Ellison, Kerry Noble, and four other CSA members (Gary Stone, Timothy Russell, Rudy Loewen and David Giles) were sentenced to lengthy federal prison terms. A seventh CSA member, Stephen Scott, pleaded guilty in an Arkansas federal court to charges he dynamited a natural gas pipeline near Fulton, Arkansas in 1983, and was also sent to prison. Ex-CSA member Kent Yates also pleaded guilty to a charge of conspiring to make and transfer automatic weapons silencers._NEWLINE_Ellison faced the maximum sentence of 20 years in prison after he was convicted on federal racketeering and weapons charges. However, Ellison was released in 1987 after agreeing to testify against the leader and six senior members of the Aryan Nations. All seven men were arrested and indicted on charges of sedition. The jury found all the defendants not guilty on all charges. Upon his release from federal prison, Ellison moved to Elohim City._NEWLINE_Richard Wayne Snell, the man who shot and killed the police officer and a pawn shop owner, was sentenced to death by lethal injection, which was carried out on 19 April 1995, the same day as the Oklahoma City bombing. _START_SECTION_ Possible ties to the Oklahoma City bombing _START_PARAGRAPH_ There are several claims that the 1995 Oklahoma City bombing on the Alfred P. Murrah Federal Building was tied to the "New Day" teachings of Elohim City. No proof, however, has been established. Elohim City was assembled for the purpose of gathering "prophets of the New Day". Leader Robert G. Millar envisioned himself to be the "Shepherd of Shepherds" traveling to numerous alternative societies, many of which were and are still communes. His ambition was to unite these underground organizations. He appeared several times at the Padanaram Settlement, in southern Indiana, but contrary to reports, members of the Padanaram Settlement did not concur with the radical callings of either Millar or Ellison, who made two appearances there. "The Valley" was and still is known more for being a cultural hub for artists and philosophers, and until roughly 2003 it operated a sawmill._NEWLINE_Timothy McVeigh, who was convicted and executed for perpetuating the Oklahoma City bombing, had no association with the CSA and had just enlisted in the U.S Army when the CSA compound was besieged and broken up. The Oklahoma City bombing occurred exactly on the 10-year anniversary of the start of the siege of the CSA compound in 1985. The most plausible link is the fact that Richard Wayne Snell, who was executed on the day of the Oklahoma City bombing, had planned a similar attack on the Murrah building in 1983 after becoming upset with the Internal Revenue Service. Additionally, Snell was heard taunting jailers that something drastic would happen on the day of his execution. However, McVeigh has stated that he chose the date of 19 April to coincide with violent end of the Waco siege exactly two years prior. McVeigh had traveled and visited Waco during the 51-day siege and cited it and 1992 Ruby Ridge events as his primary motivation for carrying out the bombing._NEWLINE_The single incident in which the CSA was involved, the robbery of a pawn shop in Springfield, Missouri, was in fact, foiled by a CSA member on the orders of Jim Ellison, unknown to Wayne Snell, who headed up the plan. It was in regard to this event in which Ellison saw a "sign from God" which he interpreted to mean that they should not carry out the attempt; not the attack on the Oklahoma City Federal Building._NEWLINE_The death knell of the CSA was its attempt to kill FBI special agent Jack Knox, the lead agent assigned to investigate the group; Asa Hutchinson, the federal prosecutor; and the federal judge who presided over the affair that brought about the eventual action against Gordon Kahl, a tax protester and a member of the Posse Comitatus, by federal agents at CSA member, Leonard Ginter's home (called 'The Bunker', due to its construction from concrete covered with earth). Ellison revered Kahl as a hero. Like McVeigh, Kahl was a decorated American soldier; Kahl earned a Silver Star in the Korean War, and McVeigh earned a Bronze Star in the first Gulf War – Desert Storm. _START_SECTION_ Media _START_PARAGRAPH_ In 2013, Kerry Noble appeared on the Investigation Discovery show Dangerous Persuasions talking about his time with the group. He was also interviewed for an episode of Brainwashed on the Slice Network in Canada and discussed his time with the CSA._NEWLINE_The Discovery Channel crime series The FBI Files' sixth season featured an episode whose topic was the CSA. The episode reveals the details of the federal investigation into the group, the 1985 siege and aftermath. The episode originally aired 10 December 2002.
+ _START_ARTICLE_ I Alone (The Vampire Diaries) _START_SECTION_ Plot _START_PARAGRAPH_ After Damon (Ian Somerhalder) compels Alaric (Matt Davis) to do whatever he has to do to get the ascendant from Jo (Jodi Lyn O'Keefe), Alaric has no choice. After obtaining it, Damon compels him again to forget everything and along with Elena (Nina Dobrev) they meet Liv (Penelope Mitchell) who sends them back to 1994 to find Bonnie (Kat Graham) and bring her home. While searching for Bonnie, Elena wonders why Jo agreed to give Damon the ascendant which is the only thing that is protecting her from Kai (Chris Wood), but Damon manages to avoid the truth._NEWLINE_Damon and Elena page Bonnie and are able to speak to her and tell her that they are bringing her home. In their conversation Bonnie tells them that Kai stabbed her to get her blood leaving her in Portland, and that she fear Kai might be free. While waiting for Bonnie to get back in Mystic Falls, Damon tells Elena the truth of how he got the ascendant something that makes her furious accusing him that he would do anything to make her fall in love with him again, no matter who gets hurt. Damon confesses her that Bonnie kept him alive while the two of them were stuck in 1994 and she was the one giving him hope explaining that this is the reason he wants to save Bonnie and not only to make her fall in love with him again._NEWLINE_Back in present time, Kai kills a cab driver and once he arrives in Mystic Falls he finds Liv. He steals some of her magic and tries to kill her but Tyler (Michael Trevino) comes in time and saves her. Seeing that Kai is free, Tyler wants to take Liv inside the borders of Mystic Falls that no magic works to protect her from Kai. That forces Liv to bring Damon and Elena back to the present before Bonnie gets to them and she is left behind again. They try to convince her to send them back but Liv leaves with Tyler. A little bit later, Kai finds Elena and Damon at the cemetery and destroys the ascendant while he crosses the border and gets to Mystic Falls, with Elena and Damon not be able to do anything to stop him._NEWLINE_Jo finds out that the ascendant is gone and confronts Alaric, being the only other person who knew where she kept it, but Alaric swears he did not take it. Jo tells him that it might be possible that he took it but he does not remember because he was compelled to forget. Alaric tells her that Damon is his friend and would never do that to him, so Jo makes him to cross the border so they can see if Alaric is compelled._NEWLINE_Meanwhile, Stefan (Paul Wesley) meets Matt's (Zach Roerig) friend who claims to be Sarah (Gabrielle Walsh), the daughter of their uncle. Later, it is revealed that she is actually an impostor who goes by the name Monique. Stefan knows about the real Sarah and where she is all these years since he has been watching over her whole life so the moment he saw Monique, he knew she was lying. He compels Monique to forget she ever knew Sarah and to leave Mystic Falls, because he does not want Damon, or anyone else, to know about Sarah. Enzo (Michael Malarkey) suspects that Stefan hides something and kills Monique because Stefan refuses to tell him. That makes Matt go to Jeremy (Steven R. McQueen) and ask him to help him kill Enzo._NEWLINE_At the end of the episode, Alaric, knowing now the truth about Damon compelling him, he confronts him and even though Damon tries to apologize. He hits him and leaves when he finds out that Kai is already free. In the meantime, Bonnie returns to Mystic Falls of 1994 but she does not find Elena and Damon there while in present, Kai gets to Tyler's home and tells him that he wants to save Liv's life, asking for Tyler's help. _START_SECTION_ Ratings _START_PARAGRAPH_ In its original American broadcast, "I Alone" was watched by 1.49 million; down by 0.19 from the previous episode. _START_SECTION_ Reviews _START_PARAGRAPH_ "I Alone" received mixed reviews._NEWLINE_Stephanie Flasher from TV After Dark gave the episode a B+ rating saying that the episode had a little bit of something for everyone and the writers Brian Young and Holly Brix took viewers on an emotional journey filled with ups and downs._NEWLINE_Ashley Dominique of Geeked Out Nation rated the episode with 7.1/10 stating that the episode moved the plot with the Gemini Coven forward, readjusting the tensions within our characters._NEWLINE_Jen from TV Overmind rated the episode with 7/10 saying that the episode left her feeling a little uneasy about the second half of the season and the next week's midseason finale._NEWLINE_Leigh Raines of TV Fanatic rated the episode with 3.5/5 stating that the episode was a full hour of good intentions but with poor planning._NEWLINE_Sara Ditta from Next Projection rated the episode with 5.7/10 saying that the only characters with any real spark in the episode were Enzo and Kai. "While neither the plot nor characters developed significantly in this episode, the show's baddies brought some fun moments in an episode that mostly sets up next week's midseason finale."_NEWLINE_Caroline Preece of Den of Geek gave a good review to the episode saying that the show delivered yet another fantastic hour of television. "There are shows, like The Vampire Diaries, that start off pretty terribly before going on to become sizeable hits. They burn hot and bright for a couple of seasons before the complacency sets in and eventually drives even the most enthusiastic fans away. Vampire Diaries was a textbook example of this, and to see it get back to its early quality in its sixth year is fantastic."
+ _START_ARTICLE_ Zoltán Kontár _START_SECTION_ Club career _START_PARAGRAPH_ On 13 July 2015, FK Senica signed Kontár on one year-loan from FC Petržalka akadémia. At the age of 21, he made his professional debut for FK Senica against FC DAC 1904 Dunajská Streda on 18 July 2015.
+ _START_ARTICLE_ Dawnn Lewis _START_SECTION_ Early life and education _START_PARAGRAPH_ Dawnn Lewis was born in Brooklyn, New York City, to Carl and Joyce Lewis, who are of African-American and Guyanese descent, She began singing at the age of four and acting at eleven. _NEWLINE_Lewis graduated at 16 from the High School of Music & Art in New York City, now known as Fiorello H. LaGuardia High School. Then she majored in musical theatre with a minor in journalism at the University of Miami, graduating with a Bachelor of Music degree, cum laude, in 1982. _START_SECTION_ A Different World (1987–1992) _START_PARAGRAPH_ Lewis appeared for the first five of the six-season run as Jaleesa Vinson (later Vinson–Taylor) from 1987 until 1992. Lewis co-wrote the theme song to A Different World, with Bill Cosby and Stu Gardner, and co-performed the song for the first season. In A Different World, Although her character was married to another of the main characters on the show, her character disappeared from A Different World without explanation, like Chuck Cunningham of Happy Days. Lewis appeared in a special week-long segment of A Different World called the Hillman College Reunion airing on Nick At Nite, along with Lisa Bonet, Jasmine Guy, Kadeem Hardison, Darryl M. Bell, Cree Summer, and Sinbad. On her Super Password appearance in 1988, she was paired with Dallas star Ken Kercheval, not any of her co-stars. _START_SECTION_ Hangin' with Mr. Cooper (1992–1993) _START_PARAGRAPH_ In September 1992, Lewis began starring in ABC's Hangin' with Mr. Cooper alongside Mark Curry and Holly Robinson. Lewis appeared in 20 of the 22 episodes of the first season as Robin Dumars, Mark's childhood best friend and roommate. Lewis did not appear on the two shows concurrently — she left A Different World to star in Hangin' with Mr. Cooper. Lewis and Holly Robinson, along with R&B quartet En Vogue, performed the theme song for season one of Hangin' with Mr. Cooper. Sometime before the end of season one, The show producers decided to scale back on the updated version of the 1970s ABC sitcom Three's Company concept. Lewis left the show after the conclusion of the first season due to the producers deciding to change the direction of the show, replacing her character with a mother and child; Mark's cousin Geneva Lee (portrayed by Saundra Quarterman) and her daughter Nicole (portrayed by Raven-Symoné). _START_SECTION_ Other work _START_PARAGRAPH_ Lewis portrayed Deloris Van Cartier in Peter Schneider's Sister Act the Musical, which opened at the Pasadena Playhouse on October 24, 2006. Lewis has voiced Storm of the X-Men in three games, most recently Marvel: Ultimate Alliance 2. She also voiced Granny Grim on The Grim Adventures of Billy and Mandy, and voiced the female Shokan (Sheeva) in Mortal Kombat: Defenders of the Realm. Lewis has also done voice work as LaBarbara in Futurama, Detective Terri Lee on Spider-Man:The Animated Series, villainess Di Archer on Bruno the Kid, and voiced a number of characters on The Boondocks. Additionally, she voiced the character, 'Sharona' on "King of the Hill."_NEWLINE_Lewis co-starred in two Disney Channel Original Movies, The Poof Point as Marigold Ballord, and as Gail DeBarge in Let It Shine. In 2000, Lewis played Blabberwort the Troll in the five-episode NBC miniseries The 10th Kingdom. _START_SECTION_ 2006–present _START_PARAGRAPH_ In 2006, Lewis starred as Melba Early in the film adaptation of Dreamgirls. Lewis released her debut CD, entitled Worth Waiting For, in 2006. She most recently played Addaperle in The Wiz with New York City Center's Encores! In 2009, Lewis played Denise Fields on One Tree Hill. In 2010, Lewis played a minor recurring role as Lauren's mother in The Secret Life of the American Teenager. In 2012, she voiced Malora in Strange Frame. She also appeared as Dr. Knapp on Days of Our Lives in 2012–2013. As of 2015, Lewis is playing a recurring role on Major Crimes as Patrice, a love interest for Lt. Provenza, whom he met during a case. In that same year, she also voiced Ruby's mother Helen Hanshaw in one episode of Sofia The First . _NEWLINE_In March 2016, Lewis was cast in Disney Junior's animated series Doc McStuffins as the voice of Grandma McStuffins._NEWLINE_In 2017 she provided the voice of Maybelle Mundy in the film Bunyan and Babe._NEWLINE_In 2018 she began voicing Fannie Granger on DreamWorks’s Spirit Riding Free, and in 2019 began voicing The Chief on Netflix’s animated Carmen Sandiego._NEWLINE_She will be portraying Zelma in the Tina Turner Broadway Musical. _START_SECTION_ Personal life _START_PARAGRAPH_ Lewis was married to former NBA player Johnny Newman in 2004. They divorced in 2006.
+ _START_ARTICLE_ International Space Year _START_PARAGRAPH_ The International Space Year (ISY) was 1992, the year of the quincentenary of Christopher Columbus's voyage to the Americas in 1492. First proposed by U.S. Senator Spark Matsunaga, the designation of 1992 as International Space Year was endorsed by 18 national and international space agencies, who also proposed the year's theme, "Mission to Planet Earth".  Eventually, 29 national space agencies and 10 international organizations took part in coordinated activities to promote space exploration and the use of sustainable technology on Earth. _START_SECTION_ United Nations endorsement _START_PARAGRAPH_ The United Nations Committee on the Peaceful Uses of Outer Space agreed to recognize the International Space Year to promote peaceful cooperation between nations during its 1990 session. United Nations Secretary-General Boutros Boutros-Ghali, addressing the World Space Congress in Washington, D.C. on August 28, 1992 said, "One of the central goals of International Space year is to highlight the importance of understanding the Earth as a single, complex, interdependent system and to stress the unique role that space science and technology can play in promoting that understanding." _START_SECTION_ Global activities _START_PARAGRAPH_ International Space Year was celebrated by 29 space agencies in various countries with the purpose of establishing peaceful international relations in space programs. International Space Year conferences were held regularly in many nations. _START_SECTION_ Australia _START_PARAGRAPH_ In Australia, many public events were organized to augment public awareness of space by the National Space Society chapters of Australia. CSIRO led the "Mission to Planet Earth" Land Cover Change project, using satellites to study plant life on Earth in relation to climate and civilization. CSIRO and various Australian Universities also studied the ocean using European and Japanese satellites. Additionally, a series of commemorative stamps was issued by the Australia Post for International Space Year. _START_SECTION_ Japan _START_PARAGRAPH_ In Tokyo, Japan, a conference — the Asia-Pacific International Space Year Conference — was held to discuss the "Mission to Planet Earth" theme and international cooperation. _START_SECTION_ Russia _START_PARAGRAPH_ In Russia, the Foundation for Social Inventions launched Space Flight Europe-America 500 in an attempt to promote a peaceful social and economic relationship between the former Soviet states and the United States of America. Space Flight Europe-America 500 consisted of a Proton rocket carrying various items symbolizing peace, which orbited the Earth for a few days. The space craft was scheduled to land near Washington in late November. Its cost was estimated by Russian authorities at over US$200 million. _START_SECTION_ United States _START_PARAGRAPH_ In the United States, NASA, which led the US space agencies, responded to ISY with the completion or creation of many important space programs, including numerous collaborations with other domestic and international space agencies. A total of twelve programmes were launched, the most in any year up to that point. NASA focused particularly on projects — such as the Mars Observer, which studied the atmosphere and climate of Mars — that examined the possibility of sustaining human life outside Earth, as well as those exploring problems that existed on Earth at the time. ISY was also recognized with the opening of a new exhibit, entitled "Where Next, Columbus?" at the National Air and Space Museum. _START_SECTION_ They Might Be Giants _START_PARAGRAPH_ Alternative rock band They Might Be Giants were designated by NASA as the "Musical Ambassador" of the International Space Year when they were searching the NASA archives for images for their album, Apollo 18. The title of the album came directly from the NASA Apollo program—the last mission of which was Apollo 17. Accordionist and singer/songwriter John Linnell jokingly speculated that an album named Apollo 18 would be a cheaper alternative to actually manning a flight to the Moon as part of the International Space Year, although the album title was selected prior to the band's involvement with ISY. In support of the celebration, the album's back cover artwork and some promotional materials feature the International Space Year logo. Linnell explained that "[the band is] supposed to be included on lists of events happening in connection with International Space Year...In other words, on a particular month they'll say in some town there's this lecture about space telescopes and then there's this They Might Be Giants concert." On a different occasion, however, he pointed out that he "[didn't] think most people have heard that this is International Space Year".
+ _START_ARTICLE_ Pteraspidomorphi _START_SECTION_ Characteristics _START_PARAGRAPH_ Pteraspidomorphs are characterized by their massive dermal head armour having large, median, ventral and dorsal plates or shields._NEWLINE_The fossils show extensive shielding of the head. Many had hypocercal tails in order to generate lift to increase ease of movement through the water for their armoured bodies, which were covered in dermal bone. They also had sucking mouth parts and some species may have lived in fresh water._NEWLINE_Most pteraspidomorphs were marine, but lived very near to the shore, in lagoons and deltas. Some groups are thought to have been fresh water-dwelling. They were certainly bottom-dwellers, as shown by traces of abrasion of the ventral surfaces of their headshields. _START_SECTION_ Classification _START_PARAGRAPH_ Pteraspidomorphs have been first regarded as related to bony fishes, then to sharks, then ancestral to hagfishes, and finally as the closest jawless relatives of the gnathostomes._NEWLINE_This last theory was based on the fact that they seem to have a paired olfactory organ and a sensory-line pattern which is quite similar to that of the gnathostomes. These characteristics are, however, likely to be general for either the vertebrates or, at any rate, for the ensemble of all ostracoderms and the gnathostomes. Other ostracoderms, such as the Galeaspida are now known to have a paired olfactory organ. Current phylogenetic analysis using a large number of characteristics now place pteraspidomorphs as the sister-group of all other ostracoderms and the gnathostomes.
+ _START_ARTICLE_ Black Park _START_SECTION_ Wildlife _START_PARAGRAPH_ Black Park SSSI has heath, alder carr - both rare in the county - mixed and coniferous woodland and some areas of acid grassland. It has a varied fauna, and insects include the nationally rare Roesel's bush cricket. There are eighteen species of butterfly, birds including hobbies and nightjars, and snakes and lizards. _START_SECTION_ Filming location _START_PARAGRAPH_ Black Park is adjacent to Pinewood Film Studios and has been used as an outdoor location for many film and television productions.  The woods and lake featured prominently in the Hammer Horror films from the late 1950s to the 1970s, including The Curse of Frankenstein (1957), The Brides of Dracula (1960), The Curse of the Werewolf (1961) and Dracula: Prince of Darkness (1966). In these films the location was often used to represent Transylvania. The park has also been used in film productions such as the James Bond film Goldfinger, where it was used for a night car chase scene (actually set in Switzerland and featuring Bond's Aston Martin DB5), and the 2006 version of Casino Royale, plus several Carry On films, Wombling Free, Batman,  Hawk the Slayer, Sleepy Hollow, Bugsy Malone, the Harry Potter film series, Captain America: The First Avenger, Robin Hood, 47 Ronin,Eden Lake, and the Monty Python film And Now for Something Completely Different._NEWLINE_In television, Black Park, together with its lake, was used extensively in location filming for the planet Alzarius in the 1980 Doctor Who serial Full Circle, and was employed again two years later in the recording of the Restoration-era set serial The Visitation.  Dressed with fake cobwebs, it was also used for the filming of the early Blake's 7 episode The Web. _START_SECTION_ Recreation and sports _START_PARAGRAPH_ Black Park is popular with walkers and dog owners due to the wide open spaces and well-maintained routes. _NEWLINE_During summer 2010 a 'Go-Ape' activity centre was established in the park with the construction of climbing rigging and zip lines between the trees. The area is properly supervised by park staff during opening hours. The Go Ape team now offers cycle hire._NEWLINE_Runners are commonplace within the park and the increase in private persons using the park for exercise/training has led to the establishment of a Parkrun event on Saturday mornings.  The professionally organised events are free to enter and form part of a network of nationwide parkruns._NEWLINE_Mountain biking is popular in the park as the combination of dense woodland, open plains, technical sections and narrow but quick draining trails make for exciting riding. Beyond Black Park XC10 is an annual event organised in conjunction with Black Park staff by West Drayton Mountain Bike Club and Beyond Mountain Bikes of Surrey. The event attracts riders of all ages and skill levels._NEWLINE_The lake is open for fishing during the normal rod licence season, though pre-baiting, keep nets and night fishing are all forbidden. _START_SECTION_ Black Park at war _START_PARAGRAPH_ During both World War One and Two the Park saw service for the Empire with troops from the Canadian Forestry Regiment helping to farm the Park and harvest the wood, for use in the trenches of France or building air strips in France for the Royal Flying Corps. To this day the lines of trees they planted can still be clearly seen._NEWLINE_Sadly one of the Forestry Regiment never went home after being killed in a road traffic accident on the nearby Crooked Billet Roundabout. He is buried in the nearby St Margaret's Church, Iver Heath. Since 2007 the local Scout Group, 1st Iver Heath have laid poppies on his grave, as part of the Centenary of Scouting and an event called 'Uniform Day 007' that featured a representative of the Canadian Army who helped the Scouts' routine of laying a wreath for this young soldier many miles from home._NEWLINE_On the fields between the park and Iver Heath near Pinewood Studios, a World War One fighter crashed on its way to France after stopping off in Iver Heath. In World War Two a V2 rocket fell very close by the site of the fighter's location._NEWLINE_The Park was also used to store military supplies hidden amongst the trees from enemy surveillance, as was nearby Langley Park. _START_SECTION_ Geology _START_PARAGRAPH_ The park is the type site for Black Park Gravel Member, a layer of sand and gravel dating to the Anglian ice age, around 450,000 years ago.
+ _START_ARTICLE_ Johannes Adam Simon Oertel _START_SECTION_ Biography _START_PARAGRAPH_ After studying art in Germany at Nuremberg and Munich, he practiced engraving until 1848, in which year he came to the United States and taught for a time in Newark, New Jersey. In 1851, he married Julia Adelaide Torrey. They eventually had four children. After his marriage, he engraved plates for bank notes, painted portraits and colored photographs. In 1857 he was elected an associate of the National Academy of Design. In 1857 he moved to Madison, New Jersey, where he painted Lament of the Fallen Spirits and Redemption. _NEWLINE_ About this time, he was invited to assist in preparing new decorations for the capitol in Washington. In 1861 he transferred his studio to Westerly, Rhode Island, where he painted Father Time and his Family and The Final Harvest (1862), The Dispensation of the Promise and the Law (containing 150 figures, 1863), Walk to Emmaus, The Walk to Gethsemane, Easter Morning, Magdalen at the Sepulchre, The Rock of Ages, and others (1868)._NEWLINE__NEWLINE_His painting Rock of Ages became enormously popular and was reproduced in millions of photographs and chromolithographs for sale both in the United States and England. _START_SECTION_ American Civil War _START_PARAGRAPH_ During the American Civil War, Oertel accompanied the Army of Virginia under General Burnside for several months in 1862. His Virginia Turnpike and other landscapes were the fruit of this military experience. He also did some historical battle scenes, such as the "Battle of Sullivan's Island" that happened during the American Revolutionary War, and some illustrations for Harper's Weekly, such as the cover for November 15, 1864 issue, of "Convalescent Soldiers Passing through Washington, DC, to Re-join their Units" and "The Union Scout"._NEWLINE_While residing at Westerly, he prepared himself for orders in the Episcopal church, and he was made deacon in 1865, and subsequently presbyter. He then confined himself almost entirely to the domain of Christian art, and painted pictures that he presented to churches in Glen Cove, New York, New York City, Washington, D.C., North Carolina, and elsewhere._NEWLINE_He was in Washington, D.C., during the funeral of President Abraham Lincoln on April 19, 1865, and left an eloquent account of the event. _START_SECTION_ St James Episcopal Church _START_PARAGRAPH_ The Rev. Johannes Oertel served as the priest of St James Episcopal Church in Lenoir, North Carolina, from 1869 to 1874. He was one of the first in the valley to offer a school for African American children, and offered religious services to those recently freed from slavery, including baptism, confirmation, marriage and funeral rites._NEWLINE_The Reredos in front of the church is an outstanding exampal of his woodworking skills. Made from over four hundred pieces of chestnut, oak, poplar, holly, cherry, beech, and pine, they were often carved during missionary trips to the Chapel of Rest in Happy Valley, North Carolina, and the Chapel of Peace in Witnel, North Carolina The architectural design is  Gothic perpendicular from the 14th and 16th centuries. While Rev. Oertel carved other reredos and altar rails, the one in St. James is considered to be the most intricate and notable. The altar painting (1872) is layered oil on canvass with gold gilt, and depicts Jesus administering Holy Communion to male and female communicants._NEWLINE_While at St. James, friends in New York donated the 100 year old pump organ from Christ Episcopal Church (Tarrytown, New York). The organ, dating from about 1770, was the first instrument to enhance the service in Lenoir. Rev. Oertel rebuilt the damaged organ, making new pipes, and a new wind chest and bellows. He then carved an illuminated case for the organ works._NEWLINE_By the main church door of the church is "Father Time and His Family", (1862, charcoal and pen on paper), which was completed in Westerly, Rhode Island. It depicts Father Time, his wife (the year) and their children (the months). Each child carries an item from the Cornucopia representative of their month._NEWLINE_A collection of his art is held by the church, and includes: "The Wandering Jew" (1902?, oil on canvas); "Capturing Wild Horses" (print); "Founded Upon a Rock" (1900, oil on canvas); "Rock of Ages" (offset lithography), and known as his most popular work; "Man Rowing Out on the Sea of Life With Christ as Pilot" (1880, oil on canvas); "In Memorium" (between 1880 and 1900, oil on canvas board); "Christian Hope" (1867, oil on canvas); "Head of St Paul" (oil on canvas, unknown date); "Expulsion from the Garden of Eden" (1893, oil on canvas); "Prophecy of Balaam" (1891, monochrome oil on canvas); "The Four Evangelists" (1884, monochrome oil on canvas); "Lament of the Fallen Spirits" (1850, oil on canvas); "Mary Magdalene at the Cross" (ca 1902, oil on canvas); "The Good Shepherd" (1878, oil on canvas); "The Prophet Jeremiah" (oil on canvas, unknown date); "The King of Truth" (1903, oil on canvas); "The Prophet Joel"; The Prophet Ezekiel"; "The Prophet Isaiah"; "The Unknown Prophet";  "The Dispensation of Promise and the Law" (1864-1865, chalk and ink on linen-backed paper)._NEWLINE_He had charge of two parishes in North Carolina (in Lenoir) until 1876.  He moved around a great deal as a priest and spent time in North Carolina, Florida,  Tennessee, St. Louis, Washington DC, Maryland and Virginia. _START_SECTION_ Portrait painter _START_PARAGRAPH_ During his time, Johannes Oertel was also known primarily as a portrait painter, and often he would leave the church in Lenoir, North Carolina, to go north to raise money by painting portraits. Many of his head and bust portraits were popular after the Civil War, and he did a number of them for prosperous clients in New England. He made an interesting portrait of the Mayor of Providence, Rhode Island, Thomas A. Doyle, as a young man on his way up. He would later serve eighteen years as the mayor, and brought Providence, Rhode Island from a manufacturing town to a small metropolis._NEWLINE_The Rev. Oertel is also known for his head of St Paul, held today by the St. James Episcopal Church, and portrayed as a weary but stern man. _START_SECTION_ Later years _START_PARAGRAPH_ Oertel was an instructor of art at Washington University in St Louis, in 1889-91.  He spent the last 18 years of his life in a town near Washington DC, where he made many religious paintings and wood carvings. He painted a series of four large pictures entitled The Plan of Redemption which he presented to Sewanee (the University of the South in Tennessee).  His last major work came in 1906-07 when he created the paintings and designed the new woodwork for the altarpiece of the Cathedral at Quincy, Illinois._NEWLINE_Oertel died in Vienna, Virginia, where he was living with one of his sons, and is buried in Flint Hill Cemetery in nearby Oakton. Collections of his papers are held by the libraries of George Washington University and the University of North Carolina at Chapel Hill.
+ _START_ARTICLE_ The Viral Fever _START_SECTION_ Early days _START_PARAGRAPH_ After graduating from IIT Kharagpur, Kumar quit his job as a consultant for US Air Force to try his hand at production jobs, assisting Farah Khan on Om Shanti Om. After a few production jobs, Kumar began to write and produce his own short films and videos._NEWLINE_Arunabh Kumar, along with long-time friends and IIT alumni, Amit Golani, and Biswapati Sarkar began pitching shows aimed at the youth to MTV. Faced with rejection, The Viral Fever was founded when the group came together and released a video titled Rowdies on YouTube starring Deepak Kumar Mishra and Naveen Kasturia. The runaway success of the video prompted Arunabh to create the YouTube focused video company, The Viral Fever, in 2012. _START_SECTION_ Growth on Youtube (2012–2014) _START_PARAGRAPH_ The Viral Fever released the first Barely Speaking with Arnub video with an interview of Shah Rukh Khan, and an appearance by then and current Delhi Chief Minister, Arvind Kejriwal. Biswapati Sarkar's parody of Indian news anchor Arnab Goswami was widely appreciated. TVF's content was dominated by parodies during these years with videos like Gaana Waala Song, Gangs of Social Media and Munna Jazbaati contributing to the growing popularity of The Viral Fever. The channel was recognised as one of the first success stories of original digital content in India. _START_SECTION_ Pioneering web-series in India (2015–present) _START_PARAGRAPH_ After a few years of creating "viral videos", The Viral Fever released India's first web-series, Permanent Roommates, in 2015. Featuring then unknown actor Sumeet Vyas and Nidhi Singh, Permanent Roommates has been watched over 5 crore (50 million) times. Pitchers was lauded as one of the best shows in recent memory in Indian entertainment for capturing the zeitgeist of the Indian startup scene. In 2016, The Viral Fever released the smash hit Tripling, and the widely applauded Humorously Yours. Other TVF network channels like Girliyapa, The Screen Patti and The Timeliners have also successfully debuted web-series. _START_SECTION_ TVFPlay & funding from Tiger Global (early 2016–present) _START_PARAGRAPH_ TVF debuted their platform, releasing the final two episodes of Pitchers on TVFPlay. The platform saw 10 lakh (1 million) hits in the first two days and crashed for 3 hours. In early 2016, venture capital firm Tiger Global Management invested $10 million into The Viral Fever, acquiring a 20% stake in the company. The Viral Fever has since launched other YouTube channels for original content: Girliyapa, a female-run channel, The Screen Patti, and The Timeliners, headquartered in New Delhi. TVF currently has offices in Mumbai, New Delhi, and Palo Alto. _START_SECTION_ Branded entertainment _START_PARAGRAPH_ The Viral Fever is one of the most sought after creators of original content with brands in India. TVF claims to have worked with over 150 brands. Some major companies to have worked with The Viral Fever or allied channels in the past include Procter & Gamble, Ola, Flipkart, Vodafone, Bharti Airtel, OnePlus, Xiaomi, Nokia and Tata Motors. _START_SECTION_ The Making Of... (2014) _START_PARAGRAPH_ The Making Of...entertainment products in India, ranging from a blockbuster movie, to a decade long TV soap. The first season of The Making Of... comprised five episodes, with a standalone episode released for Season 2 in 2016. _START_SECTION_ Permanent Roommates (2014–present) _START_PARAGRAPH_ Permanent Roommates is an Indian web series created by TVF and Biswapati Sarkar. This series revolves around a young couple, Tanya and Mikesh, who after being in a long distance relationship for 3 years, face the prospect of marriage. The first season released on YouTube on October 29, 2014. The second season was released on TVFPlay, The Viral Fever's video streaming medium, on February 14, 2016. Permanent Roommates has been renewed for a third season, rumoured to premiere in 2018._NEWLINE_Permanent Roommates was lauded for its portrayal of live-in relationships in conservative urban Indian families. Actors Sumeet Vyas and Nidhi Singh have gone on from Permanent Roommates to be showcased in Bollywood films. _START_SECTION_ Barely Speaking with Arnub (2014–present) _START_PARAGRAPH_ A talk show starring Biswapati Sarkar as Arnub with a U, parodying Indian news anchor Arnab Goswami. Barely Speaking with Arnub was picked up after the success of an earlier video titled "Bollywood Aam Aadmi Party" featuring Jitendra Kumar, Nidhi Bisht, and novelist Mayank Shekhar. The parody talk show has been lauded for its portrayal of the loud and boisterous nature of Indian news where anchors prefer theatrics over nuance. Season one of the show opened with an interview with Shah Rukh Khan. Popular celebrities who have appeared on the show for an interview with Sarkar include Delhi Chief Minister Arvind Keriwal, Ranveer Singh, Parineeti Chopra, Sunny Leone and Chetan Bhagat. In the first two seasons, Biswapati Sarkar, Amit Golani, Vipul Goyal, Shivankit Parihar, Jasmeet Singh Bhatia, and Abhishek Upmanyu have been writers on the show._NEWLINE_Barely Speaking with Arnub returned for a shortened season two in 2016. The show has been on hold as writer Biswapati Sarkar focuses on writing web series including the sequel to TVF Pitchers. _START_SECTION_ TVF Pitchers (2015) _START_PARAGRAPH_ TVF Pitchers is an Indian web series created by The Viral Fever (TVF) and developed by Arunabh Kumar single-handedly with assisting efforts from others. The first season consists of five episodes and premiered online on The Viral Fever's content portal TVFPlay on 3 June 2015. A week later, on 10 June, it premiered on YouTube. The season finale premiered on TVFPlay on 30 August 2015. It follows four friends, Naveen, Jitu, Yogi and Mandal, who quit their jobs in order to develop their own start-up company._NEWLINE_In 2016, TVF announced December 2017 as the projected release of Season 2 of Pitchers with the last scene of Permanent Roommates. _START_SECTION_ TVF Tripling (2016) _START_PARAGRAPH_ TVF Tripling is an Asian Television Award-winning Indian web series created by The Viral Fever. It traces the story of three siblings Chandan, Chanchal & Chitvan. Together they start a hilarious journey, to find themselves and their relations. Featuring Sumeet Vyas, Maanvi Gagroo and Amol Parashar, and written by Sumeet Vyas and Akarsh Khurana, along with some other contributions; and  has won several awards, including a Kyoorius Blue Elephant.  The Viral Fever partnered with Tata Motors for the project to promote the newly launched Tata Tiago. Tripling was recognised as one of the best web-series of 2016 and is a benchmark of success in Indian branded content. _START_SECTION_ Chai Sutta Chronicles (2013–present) _START_PARAGRAPH_ Chai Sutta Chronicles is TVF's Jim Jarmusch-inspired series about conversations between friends. Each episode of deals with a theme and conversation over a cigarette and a cup of tea. Season 1 of Chai Sutta Chronicles aired in 2013, with a season 2 released over 2017-18. _START_SECTION_ Tech Conversations With Dad (2014–present) _START_PARAGRAPH_ An ongoing TVF series about Jeetu and his conversations with his father - often about technology, relationships, and families in India. This is TVF's longest running digital title with 9 videos in 4 years. _START_SECTION_ Bisht, Please! (2017) _START_PARAGRAPH_ Nidhi Bisht's solo vehicle released in 2017 - about a small-town Indian girl who moves to the city. _START_SECTION_ Humorously Yours (2016-present) _START_PARAGRAPH_ Long-time TVF writer, and stand-up comic Vipul Goyal features in a semi-autobiographical story about the life of a stand-up comic. The show furthered TVF's reputation for story-telling in the Indian context. Humorously Yours has been renewed for a second season and was released in 2019. _START_SECTION_ F.A.T.H.E.R.S (2017) _START_PARAGRAPH_ A Tech Conversations with Dad spin-off, Fathers features three veterans of the Indian television screen - Brijendra Kala, Gajraj Rao, and Atul Srivastava, playing three fathers trying to keep up with the times. _START_SECTION_ Inmates (2017) _START_PARAGRAPH_ Inmates is a sitcom about living in Mumbai in the 21st century starring Ashish Verma, Mukti Mohan and Aakansha Thakur, and writers Raghav Raj Kakkar and Kashyap Kapoor. _START_SECTION_ Bachelors (2016–present) _START_PARAGRAPH_ Bachelors is a show about four bachelors who live together and face hilarious situations as they face the world after college. In 2016, TVF released a pilot episode of Bachelors featuring popular YouTube star Bhuvan Bam in 2016. The success of the pilot led to a 4 episode extension with Bhuvan and over 25 million views. Bachelors was picked up for Season 2, which was released in late 2017, with Jeetendra Kumar replacing Bhuvan Bam as the lead. Season 2 has also crossed 25 million views since release and was listed among the best web-series of 2017 by FilmCompanion and Indian Express. _START_SECTION_ The Aam Aadmi Family (2016) _START_PARAGRAPH_ The Aam Aadmi Family is an offering from TVF's sister channel -The Timeliners. It is an album of moments from the life of a middle-class family based in Delhi. It has garnered an average of 8 million views across 3 seasons _START_SECTION_ Flames (2018) _START_PARAGRAPH_ Flames is another offering from TVF's sister channel -The Timeliners. It is  the story of a young romance unfolding as a chemical reaction. Studious Rajat falls for Ishita, the new girl in the tuition. Rajat's BFFs, Pandey & Anusha's friendship is beginning to turn into a relationship. The equations of friendships evolve in the first season of this teenage romance. This series has already garnered an average of 5 million views. _START_SECTION_ Zeroes (2018) _START_PARAGRAPH_ Zeroes is a mini-series from TheScreenpatti. It is about four zeroes who come together with an almost delusional ambition to create a great company despite being already late in the startup race. Whether they rise above their label of zeroes is what forms the crux of the story. _START_SECTION_ Yeh Meri Family (2018) _START_PARAGRAPH_ Yeh Meri Family is a mini series released by TVF. It is about a nuclear-middle class family living in the height of fad of 90s. The story revolves around a 13-year-old boy who is average at academics and is being constantly being bullied and blackmailed by his elder brother. It is a mixture of drama and thrills of being a teenager. _START_SECTION_ Awkward Conversations With Parents (2018) _START_PARAGRAPH_ Awkward Conversations With Parents is a six episode series produced by The Screen Patti. The story is about a modern-boy named Ishan and his down-to-earth, ritual parents. Ishan is in his late teens; a year before he leaves the home for college. He and his parents are involved in strange talks about absurd topics about the desires of a growing-into-an-adult boy. The series involves various absurd topics such as condoms, wet dreams etc. _START_SECTION_ Kota Factory (2019) _START_PARAGRAPH_ Kota Factory is a TVF original released in 2019. _START_SECTION_ Controversy _START_PARAGRAPH_ In March 2017, Arunabh Kumar found himself in the middle of a sexual harassment claim that was posted anonymously via Medium. TVF released a press release via Medium refuting the claims. Several women have come out with similar stories of harassment. An FIR has been filed against Kumar in this case._NEWLINE_A second FIR was then lodged against sexual harassment-accused CEO and founder of The Viral Fever Arunabh Kumar. Mumbai's Versova Police registered the FIR after another woman filed a complaint against Kumar over an incident she said took place in 2014. On 16 June 2017 Arunabh Kumar, accused in multiple sexual harassment cases, stepped down as TVF CEO. Dhawal Gusain now leads the company as CEO, with Karan Chaudhry stepping in as COO and the creator of Permanent Roommates & Tripling, Sameer Saxena appointed CCO.
+ _START_ARTICLE_ Tobias Bridge _START_PARAGRAPH_ Sir Tobias Bridge fought for Parliament in the English Civil War, and served the Lord Protector Oliver Cromwell during the Interregnum. After the Restoration he served King Charles II._NEWLINE_During the English Civil War, Bridge fought for Parliament under Fairfax. During the Interregnum he was an active supporter of Oliver Cromwell served on several influential committees. From 1655 and 1659 he was a Colonel of Horse, and  on the death of Charles Worsley he succeeded to the governorship of Cheshire, Lancashire and Staffordshire district during the second half of 1656 of the Rule of the Major-Generals._NEWLINE_During the Second Commonwealth, in the immediate prelude to the  restoration of the monarchy, he served as a major in Sir Lord Lockhart's Regiment of Horse at Dunkirk, and after the restoration he was appointed Captain of Horse at Dunkirk, a post where he took direct orders from the Governor of Dunkirk and King Charles II. He held the post until 1662 when Dunkirk was sold to France. On his return from Dunkirk he was commissioned into the Duke of Richmond's Regiment as a captain._NEWLINE_A year after he was knighted in 1666, Bridge went to Barbados as colonel of his regiment. In 1673 he commanded the local land forces against the Baron of Tobago in one of the many wars over that island. In 1674 he was admitted to the council of Barbados. He probably died in Bridgetown, a town named after him and the capital of Barbados.
+ _START_ARTICLE_ General Jewish Labour Bund in Belarus _START_PARAGRAPH_ The Belarusian chapter of the General Jewish Labour Bund was among political parties forming the government and parliament of the Belarusian People's Republic gaining independence in 1918._NEWLINE_Members of the Bund became members of the Parliament. Bund member Mojżesz Gutman even became a Minister without portfolio in the Government of the newly created republic and drafted its constitution. The Bund left later the government bodies of the BNR._NEWLINE_Contrarily to the attitude of the Communist party in Ukraine and Russia proper, the Communist Party (bolsheviks) of Lithuania and Belorussia agreed to integrate in its ranks the local Bund, renamed Belarusian Kombund in April 1920 after the Twelfth Conference of the Bund on April 12–19, 1920 in Gomel, into an autonomous Jewish Communist Party, thus not forcing individual members either to join directly the party or through the Yevsektsiya. They even demanded the Yevsektsiya to be dissolved into the Kombund. This seems however to have been a mere agreement on paper that was never implemented, a manoeuver by the Communists to attract support from Bundists as the Bund was more powerful than them in the Belarusian cities._NEWLINE_In 1921, at its Thirteenth Conference of the still "General Jewish Labour Bund in Lithuania, Poland and Russia", i.e. then in Russia, Ukraine and Belarus, a majority of the Bundist delegates decided to dissolve the party, and part of its membership joined the R.C.P.(B.) on the basis of the rules of admission, while the minority formed the Social Democratic Bund._NEWLINE_In West Belarus, that was part of interwar Poland, the remnants of the party finally merged into the Polish Bund, while many activists chose to join the Polish Communist Party.
+ _START_ARTICLE_ Albrecht v. Herald Co. _START_SECTION_ Background _START_PARAGRAPH_ Lester J. Albrecht, an independent newspaper carrier, bought from Herald Publishing Company at wholesale and sold at retail copies of Herald's morning newspaper, the St. Louis Globe-Democrat, under an exclusive territory arrangement terminable if a carrier exceeded the maximum retail price advertised by Albrecht. When Albrecht exceeded that price, Herald Co. protested to him and then informed Albrecht's subscribers that it would itself deliver the paper at the lower price. Herald Co. engaged an agency (Milne) to solicit petitioner's customers. About 300 of Albrecht's 1200 subscribers switched to direct delivery by Herald._NEWLINE_Herald Co. later turned these customers over, without cost, to another carrier (Kroner), who was aware of Herald's purpose and knew that he might have to return the route if Albrecht discontinued his pricing practice. Herald Co. told Albrecht that he could have his customers back if he adhered to the suggested price. Albrecht filed a treble-damage complaint which, as later amended, charged a combination in restraint of trade in violation of section 1 of the Sherman Antitrust Act, among Herald, Albrecht's customers, Milne, and Kroner. Albrecht's appointment as carrier was terminated and Herald required sale of his route. Albrecht made the sale at a price found to be lower than it would have been but for the conduct of Herald Co._NEWLINE_The jury found for Herald Co. Albrecht moved for a judgment notwithstanding the verdict, asserting that, under United States v. Parke, Davis & Co., and like cases, the undisputed facts showed a combination to fix resale prices, which was per se illegal under § 1 of the Sherman Act. The district court denied the motion. _START_SECTION_ Ruling of Eighth Circuit _START_PARAGRAPH_ The court of appeals affirmed. It held that there could be no violation of § 1 of the Sherman Act, which requires concerted action, because Herald's action was unilateral. Herald was entitled to refuse to deal with Albrecht because he violated his contract requiring him to observe Herald's maximum price. Herald was entitled to engage in competition with Albrecht because he was not entitled to exclusivity after violating the contract. _START_SECTION_ Ruling of Supreme Court _START_PARAGRAPH_ The Supreme Court reversed in an opinion by Justice White wrote for the Court; Justice Douglas concurred. Justices Harlan and Stewart dissented. _START_SECTION_ Majority opinion _START_PARAGRAPH_ The Court decided two principal points, one of which was later overruled. First, the conduct was not unilateral but rather was concerted. Second, later overruled by Khan, maximum price-fixing was illegal per se. _START_SECTION_ The combination _START_PARAGRAPH_ Based on the Parke Davis case, there was a combination that Herald put together. In Parke Davis the "combination with retailers arose because their acquiescence in the suggested prices was secured by threats of termination; the combination with wholesalers arose because they cooperated in terminating price-cutting retailers." By the same token, "there can be no doubt that a combination arose between respondent, Milne, and Kroner to force petitioner to conform to the advertised retail price." Herald:_NEWLINE_hired Milne to solicit customers away from petitioner in order to get petitioner to reduce his price. It was through the efforts of Milne, as well as because of respondent's letter to petitioner's customers, that about 300 customers were obtained for Kroner. Milne's purpose was undoubtedly to earn its fee, but it was aware that the aim of the solicitation campaign was to force petitioner to lower his price. Kroner knew that respondent was giving him the customer list as part of a program to get petitioner to conform to the advertised price, and he knew that he might have to return the customers if petitioner ultimately complied with respondent's demands. He undertook to deliver papers at the suggested price, and materially aided in the accomplishment of respondent's plan. Given the uncontradicted facts recited by the Court of Appeals, there was a combination within the meaning of § 1 between respondent, Milne, and Kroner, and the Court of Appeals erred in holding to the contrary._NEWLINE_Justice White pointed out other possible combinations that Albrecht might properly have argued existed. First, he could have claimed a combination between Herald and himself, at least "as of the day he unwillingly complied" with Herald's advertised price. Second, "he might successfully have claimed that respondent [Herald] had combined with other carriers because the firmly enforced price policy applied to all carriers, most of whom acquiesced in it." A third possible combination was between Herald and Albrecht's customers. _START_SECTION_ The price fix _START_PARAGRAPH_ Price-fixing agreements and combinations are illegal per se, including ones to fix maximum prices. In Kiefer-Stewart Co. v. Seagram & Sons, the Court pointed out, liquor distributors combined to set maximum resale prices. The court of appeals perceived no restraint of trade, but the Supreme Court reversed. It held "that agreements to fix maximum prices 'no less than those to fix minimum prices, cripple the freedom of traders, and thereby restrain their ability to sell in accordance with their own judgment.' " The Court said that it agreed with the Kiefer-Stewart decision:_NEWLINE_Maximum and minimum price-fixing may have different consequences in many situations. But schemes to fix maximum prices, by substituting the perhaps erroneous judgment of a seller for the forces of the competitive market, may severely intrude upon the ability of buyers to compete and survive in that market. Competition, even in a single product, is not cast in a single mold. Maximum prices may be fixed too low for the dealer to furnish services essential to the value which goods have for the consumer or to furnish services and conveniences which consumers desire and for which they are willing to pay. Maximum price-fixing may channel distribution through a few large or specifically advantaged dealers who otherwise would be subject to significant non-price competition. Moreover, if the actual price charged under a maximum price scheme is nearly always the fixed maximum price, which is increasingly likely as the maximum price approaches the actual cost of the dealer, the scheme tends to acquire all the attributes of an arrangement fixing minimum prices. It is our view, therefore, that the combination formed by the respondent in this case to force petitioner to maintain a specified price for the resale of the newspapers which he had purchased from respondent constituted, without more, an illegal restraint of trade under § 1 of the Sherman Act. _START_SECTION_ Concurring opinion _START_PARAGRAPH_ Justice Douglas agreed that the court of appeals erred, but considered that "this is a 'rule of reason' case." _START_SECTION_ Harlan dissent _START_PARAGRAPH_ Justice Harlan considered maximum price-fixing beneficial to the public:_NEWLINE_Other things being equal, a manufacturer would like to restrict those distributing his product to the lowest feasible profit margin, for, in this way, he achieves the lowest overall price to the public and the largest volume. When a manufacturer dictates a minimum resale price, he is responding to the interest of his [retailer] customers, who may treat his product better if they have a secure high margin of profits. When the same manufacturer dictates a price ceiling, however, he is acting directly in his own interest, and there is no room for the inference that he is merely a mechanism for accomplishing anticompetitive purposes of his customers._NEWLINE_Justice Harlan also disagreed that one who merely acquiesces engages in concerted action within the meaning of § 1 of the Sherman Act. _START_SECTION_ Stewart dissent _START_PARAGRAPH_ Justice Stewart considered that Herald was justified in fixing maximum prices to its ultimate customers, the consuming public, because that was a necessary defensive measure in the face of the territorial monopoly granted the distributors. By not permitting this—" The Court today stands the Sherman Act on its head." _START_SECTION_ Economic background _START_PARAGRAPH_ A newspaper's profits are determined by its circulation  and the number of advertisements  it sells. Like in every circulation industry, circulation depends upon the price of a copy, as well as the amount of advertising: . Similarly, the demand for advertising space is determined by . In other words: the higher the circulation, the higher the demand for advertising space. The profit-maxizing newspaper monopolist therefore sets his copy price as:_NEWLINE_where  is the cost per copy,  is the marginal cost of advertisement,  is the traditional price elasticity of demand, and  captures the feedback effect of lower copy prices inducing more advertising and vice versa. Most important is the term , which captures the marginal advertising profit from selling additional advertising due to increased circulation. The newspaper monopolist's optimal price is therefore lower than for traditional monopolists in non-circulation industries.
+ _START_ARTICLE_ Gudastviri _START_SECTION_ Construction _START_PARAGRAPH_ The gudastviri is made up of two main parts: The first being a whole sheep or goat skin, or a sewed, rectangular leather bag (“guda”). The second is a yoked double-chante ("stviri"), terminating in a single horn bell, which makes the gudastviri a member of the hornpipe class of bagpipes._NEWLINE_There is a small wooden blow-pipe (khreko) with a check-valve tied into one leg, or corner of the bag. A fixed round wooden stock holding the chanter, is tied into the bag, in the opposite foreleg, or corner. The chanter itself has two wooden pipes (dedani) of equal length, bore and wall thickness, which are inserted into the stock. The left chanter tube "leader" has the most finger holes, it is also called “teller” or “beginner”. The right chanter tube "bass" is called mebane or "deep voice producer". This bass pipe has three front-facing holes and the “beginner”, has six holes (but the Adjaran chiboni’s leader pipe has only five holes). _NEWLINE_The three bottom holes of the left pipe are placed symmetrically across from the three holes of the right pipe. _START_SECTION_ Tuning _START_PARAGRAPH_ The Adjaran chiboni has a diatonic scale. It can produce two-part chords and two-part tunes. The two parts are produced by the simultaneous sound of both dedanis. The player's left hand plays the highest notes of the scale on the left chanter tube, while the fingers of the player's right hand covers and uncovers the lower notes of the scale, which is made possible by the limited number of finger holes (only 3 or 4 holes) disposed lower down, toward the distal end of the right chanter tube. _NEWLINE_The compass of a chiboni is major sixth (but the Rachian gudastviri’s diapason can be a minor, or a major seventh). The ends of the pipes are fixed inside the resonator/horn. The horn is made of Caucasian goat or bull horn. The gudastviri is decorated with silver bands, mounted with colored glass beads, and numerous small chains. There is a ball of cotton wool inserted into the open end of the horn, to absorb the moisture formed during playing.   The bag (guda) can have a bag cover of cloth or leather, or have the natural goat hair left on the outside of the bag. _NEWLINE_The six holes of the left reed pipe emit notes of the first octave: F, E, D, C, B, A, G; the three holes of the right one emit deep-voiced notes: C, B, A, G. _START_SECTION_ Playing and application _START_PARAGRAPH_ The gudastviri is used for vocal accompaniment. A majority of recitative songs were performed with its accompaniment, in the region of Racha. The gudastviri player’s repertoire consists of historical, epic, satirical, comic, and lyrical verses, which are performed as one part songs. These songs are recitatives and it is the text, not the melody, that is the most important part of the performance. _NEWLINE_Traditionally,only men play this instrument, and Rachian gudastviri players were strolling musicians, who were welcomed as guests, at every family merriment, party, or wedding. It was a kind of profession that served as the main source of the player's income. Gudastviri players often took part in the old Georgian improvisation competition known as berikaoba, where they had to invent a witty epic, lyrical or comical poems, "on the spot" and retell these poems, accompanied with Gudastviri music. _NEWLINE_The competition was often won by the most skillful berika (participant). _START_SECTION_ Design and development _START_PARAGRAPH_ In the region of eastern Javakheti, gudastviri pipes are made of very young branches of a dog-rose.  The gudastviri itself is normally constructed by the player to his tastes.  Jewellers may also be hired to ornament the instrument._NEWLINE_Among the kinds of Georgian gudastviri, the most developed is Adjarian chiboni. As for the gudastviri of Pshavi, it belongs perhaps, to an earlier stage of development, as it has only one hole on the bass chanter, possibly indicating this instrument's early origin.
+ _START_ARTICLE_ Ignacio Cáceres _START_SECTION_ Biography _START_PARAGRAPH_ He was the silver medallist in the 10,000 m at the 2001 Summer Universiade and the following year he finished twelfth in the event at the 2002 European Championships. He was selected for the marathon team at the 2010 European Athletics Championships, but failed to finish the race._NEWLINE_He set a personal best at the 2012 Rotterdam Marathon, taking ninth place in a time of 2:11:58 hours.  Also in 2012, he competed in the marathon at the Olympic Games, finishing in 31st place.
+ _START_ARTICLE_ Masayoshi Tomizuka _START_SECTION_ Career _START_PARAGRAPH_ Tomizuka joined the faculty of the Department of Mechanical Engineering  at the University of California, Berkeley in 1974.  He served as Vice Chair of Mechanical Engineering in charge of instruction from December 1989 to December 1991, and as Vice Chair in charge of graduate studies from July 1995 to December 1996. Since June 11, 2009, he has been Executive Associate Dean for the College of Engineering at UC Berkeley. He also served as Program Director of the Dynamic Systems and Control Program at the National Science Foundation from Sept. 2002 to Dec. 2004. _START_SECTION_ Research interests _START_PARAGRAPH_ Tomizuka's current research interests include optimal and adaptive control, digital control, signal processing, motion control, and control problems related to robotics, manufacturing, data storage devices, vehicles and human-machine systems. _START_SECTION_ Society activities _START_PARAGRAPH_ Tomizuka has been and is an active member of the Dynamic Systems and Control Division (DSCD) of the American Society of Mechanical Engineers (ASME).  He served as chairman of the Executive Committee of the Division (1986–87), Technical Editor of the ASME Journal of Dynamic Systems, Measurement and Control, J-DSMC (1988–93) and Editor-in-Chief of the IEEE/ASME Transactions on Mechatronics (1997–99).  He served as President of the American Automatic Control Council (1998–99). He chairs the IFAC Technical Committee on Mechatronic Systems.  He is a Fellow of the ASME, the Institute of Electric and Electronics Engineers (IEEE) and the Society of Manufacturing Engineers.  He is the recipient of the Best J-DSMC Best Paper Award (1995), the DSCD Outstanding Investigator Award (1996), the Pi Tau Sigma-ASME Charles Russ Richards Memorial Award (1997), the DSCD Leadership Award (2000), the Rufus Oldenburger Medal (2002) and the John Ragazzini Award (2006).  The Oldenburger Medal was awarded to him for his seminal contributions in the area of adaptive control, preview control and zero-phase control.
+ _START_ARTICLE_ A-normal form _START_SECTION_ Grammar _START_PARAGRAPH_ The following BNF grammar describes the pure λ-calculus modified to support the constraints of ANF:_NEWLINE_ EXP ::= VAL_NEWLINE_      |  let VAR = VAL in EXP_NEWLINE__NEWLINE_ VAL ::= λ VAR . EXP_NEWLINE_      |  VAR_NEWLINE_      |  VAL VAL_NEWLINE_Variants of ANF used in compilers or in research often allow constants, records, tuples, multiargument functions, primitive operations and conditional expressions as well. _START_SECTION_ Examples _START_PARAGRAPH_ The expression:_NEWLINE_f(g(x),h(y))_NEWLINE_is written in ANF as:_NEWLINE_let v0 = g(x) in_NEWLINE_    let v1 = h(y) in_NEWLINE_        f(v0,v1)
+ _START_ARTICLE_ Prometheus Global Media _START_SECTION_ Founding _START_PARAGRAPH_ On December 10, 2009, the Nielsen Company announced that it would sell its Business Media division, which included brands such as Adweek, Billboard, and The Hollywood Reporter, to a new company known as e5 Global Media; a joint venture between Guggenheim Partners and Pluribus Capital Management—a company led by James Finkelstein, Matthew Doull, and George Green. Two Nielsen properties, Editor & Publisher, and Kirkus Reviews, were not included in the sale, and were to be shut down. Editor & Publisher would instead be sold to the Duncan McIntosh Company, and Kirkus Reviews would be sold to Herbert Simon. The company's first CEO was Richard Beckman, previously an executive and publisher at Condé Nast and Fairchild Publications, and former publisher of magazines GQ and Vogue. Beckman's career suffered a setback in 1999 following "some inappropriate behavior" resulting in injuries to Vogue's West Coast advertising director Carol Matthews, while Beckman was Matthews' publisher at Condé Nast._NEWLINE_Beckman's first major move was a re-launch of The Hollywood Reporter; with the hiring of Janice Min, formerly of Us Weekly, as editorial director, THR replaced its daily print publication with a weekly magazine, and performed a significant redesign to its website with an increased focus on breaking scoops. The new format was meant to compete against up-and-coming blogs focusing on industry news, such as Deadline Hollywood and TheWrap, along with its then-struggling rival Variety. The changes had a significant impact on the publication's performance: by 2013, ad sales were up more than 50%, while traffic to the magazine's website had grown by 800%. In October 2010, the company was renamed Prometheus Global Media; named after the Greek mythological figure, Beckman stated in an internal memo that the new name would "[carry] more weight and gravitas in the marketplace." _START_SECTION_ Re-organization and acquisition _START_PARAGRAPH_ In late 2011, Prometheus went through a number of cost-cutting measures. In August 2011, Backstage was sold to a group of investors led by John Amato in a transaction funded by Guggenheim, and the following month, Prometheus laid off the staff responsible for the Hollywood Creative Directory and announced it had sold the publication._NEWLINE_In January 2013, Guggenheim Partners acquired the stake in Prometheus owned by Pluribus Capital, giving it full ownership; following the acquisition, former Yahoo! executive Ross Levinsohn was named as CEO of the new Guggenheim Digital Media division, which would oversee Prometheus and other digital assets for Guggenheim companies (such as Dick Clark Productions). In April 2013, Guggenheim re-acquired Backstage (which had also acquired Sonicbids, a platform for allowing musicians to book gigs online) and made its CEO John Amato president of the Billboard Group—a new group consisting of Billboard, Backstage, and Sonicbids._NEWLINE_In a January 2014 restructuring, Levinsohn was shifted to a business development role and no longer directly manages the Prometheus properties. Additionally, the company was split into two operating groups; an Entertainment Group was formed by merging The Hollywood Reporter into the Billboard Group, with Janice Min becoming co-president and chief creative officer of the group alongside Amato. The remaining properties, consisting of Adweek and Film Expo Group, are led by Jeff Wilbur._NEWLINE_On May 29, 2014, Prometheus announced it would acquire the publishing assets of Mediabistro—a network of websites focusing on various aspects of the mass media industry—which includes the media job listing site Mediabistro and its network of blogs such as AgencySpy, FishbowlNY, Lost Remote and TVNewser—for $8 million. The acquisition did not include Mediabistro's expo business, which were retained under the name Mecklermedia. On January 13, 2015, Adweek and Film Expo Group were merged into Mediabistro to form a new Prometheus subsidiary, Mediabistro Holdings. At the same time, its blogs were re-launched under the new "Adweek Blog Network" banner, and all of Mediabistro's social media-oriented blogs were merged into SocialTimes._NEWLINE_In March 2015, Guggenheim Partners reported that its president Todd Boehly was exploring the possibility of forming his own company. A representative stated that such a company would "likely be harmonious with Guggenheim, especially since Todd's role for some time has been strategic and transaction-oriented, rather than working in or managing any of our day-to-day businesses." On December 17, 2015, in response to losses across Guggenheim Partners, the company announced that it would spin out its media properties to a group led by Boehly, including the Hollywood Reporter-Billboard Media Group, Mediabistro, and Dick Clark Productions, all under their existing leadership. The resultant company is known as Eldridge Industries.
+ _START_ARTICLE_ Debora Menicucci _START_SECTION_ Personal life _START_PARAGRAPH_ She is married to the Venezuelan Supreme Court President, Maikel Moreno.
+ _START_ARTICLE_ Šajkaši _START_SECTION_ Personal armament _START_PARAGRAPH_ The Šajkaši were armed with sabres, spears and ordinary and mechanical arrows. Sometimes they wore helmets and shields. Their spears likely were longer than ordinary, set to be used at longer distances. They used arrows until the end of the 16th century when the arquebus had been perfected. Later, when gunpowder began to be widely used, the Šajkaši were armed with sabres, long spears and muskets. _START_SECTION_ Clothing _START_PARAGRAPH_ Their clothing was in dark-blue colour. _START_SECTION_ 16th century _START_PARAGRAPH_ Pavle Bakić commanded the Šajkaši in the service of Ferdinand, the Archduke of Austria and King of Hungary and Croatia. The Šajkaši participated in the Battle of Mohács (1526). After the battle the Šajkaši were still unpaid for their services. Ferdinand reprimanded the court for not having paid at least part of the unpaid salary to the Šajkaši. Bakić once again turned to Ferdinand, alerting him that the nonpayment to the Šajkaši would cause estrangement of the Serbs in his lands, and those of Zapolya and the Ottoman Empire. He also informed Ferdinand of the persecution of Serbs by the Austrian staff and officers. _START_SECTION_ 17th century _START_PARAGRAPH_ From all the writings in Serbian and German by the estimable Archimandrite Jovan Rajić (Johann Raics, 1726-1801), it has been established that the old Šajkaš Corps had their staff in the city of Komarno (Comorn, Hungarian: Komorom) which is along the upper Danube and that the personnel were under the Hungarian and Polish King Ladislaus III (1424-1444) and those following him until they were included under the rule of Leopold I (1640-1705)._NEWLINE_The old Šajkaš Corps was established in 1526 and dissolved in 1746._NEWLINE_By far the majority of the Šajkaši were Serbs who had come north and west as direct result of the Turkish advance into the Balkans. As Ottoman conquest continued throughout the 1500s, thousands fled north across the Danube into lands vacated by the Serbs, also moving away from the Turks. In addition, thousands of refugees, generally of the Orthodox faith, had entered the largely deserted lands of northern Medieval Serbia where few of the original natives had survived the brutal wars. These Serbs formed the nucleus of the military frontiersmen, beginning in the early 1500s, and of the river fleet, the Šajkaši formations._NEWLINE_In 1690, a major Serbian immigration took place. Some 30,000 families from Kosovo sought refuge across the Sava and Danube rivers among their kin folk after the Austrian supported revolt failed and left them defenceless in the face of Ottoman reprisals. _START_SECTION_ 18th century _START_PARAGRAPH_ These newly-arrived Serbs together with the members of the earlier Šajkaši Corps formed the new Šajkaši Battalion on the lower Danube. They defended the border between the Habsburg and Ottoman empires. _NEWLINE_Like the other regiments of the Austrian Military Frontier, the Šajkaši settled the deserted borderlands designated for them by the Austrian crown in exchange for military service. The Šajkaši battalion's first lower Danube villages were: Titel, Lok, Mošorin (Moschorin), Vilovo (Willova), Gardinovci (Gardinovatz) and Žabalj (Xablia, formerly Josefdorf)._NEWLINE_Six more villages were authorized on 7 June 1769: Čurug, Gospođinci, Šajkaš (St. Ivan), Upper Kovilj, Lower Kovilj, and Kać (Kaacs). In 1800 and 1801, two more -- Djurdjevo and Nadalj -- were also settled. On 1 January 1809, the battalion totaled six companies (a division)._NEWLINE_This battalion, along the model of the other regiments of the Military Frontier, was organized according to the standing order of the Austrian military-civil administration. The duties of and support for the Battalion's frontiersmen were the same as all other frontiersmen._NEWLINE_A history of the Šajkaš Battalion was written between 1842 and 1847 by one of its officers, Captain Jovan Trumić. Apparently the only surviving copy is now at the Serbian scholarly society, Matica srpska, in Novi Sad._NEWLINE_The original Trumić manuscript and other documents about the Šajkaši were brought to light in 2004 by Slavko Gavrilović, a Serbian scholar who specialized in the Šajkaš Battalion. _NEWLINE_Included in Captain Trumić's study are valuable statistics from 1844. There were 30,315 inhabitants on the lands of the Šajkaš Battalion at that time: 28,656 Serbs; 758 Germans; 528 Hungarians; 196 Wallachians; and 177 others. Within the jurisdiction of the Battalion, there were 28,275 Eastern Orthodox Christians (Non-Uniates, mostly Serbs); 1,627 Roman Catholics (includes Croatians); 329 Protestant Evangelists (Lutherans); 63 Uniates; and 21 Protestant Reformed (Calvenists)._NEWLINE_Among the additional documents Slavko Gavrilović published in 2004 is a list of the officers who served in the Šajka Battalion between 1762 and 1873. Captain Trumić is among them._NEWLINE_A total of 246 Serbian officers are listed. The list is interesting today for the large number of Serbian officers and for the details about their military service which provide information about the officers of both the Šajkaš Battalion and the Military Frontier._NEWLINE_The list is chronological. The dates and places correspond not only to assignments within the Military Frontier but also to postings in far-away wars waged by the Habsburg crown. The list reveals that these officers were transferred to those wars. _NEWLINE_After 1699, the service of the military frontiersmen was not limited to protecting the Habsburg Empire from the Ottomans. They were required, by regulation, to serve where called. The high command in Vienna viewed the Military Frontier as a vast pool of self-sufficient recruits for the Austrian military, and their participation in the major wars involving the Austrian crown seems to bear this out. _START_SECTION_ 19th Century _START_PARAGRAPH_ By an imperial decree of the Habsburg ruler, Empress Maria Theresa, a special unit of the Austrian Danube Fleet, the Šajkaš Battalion of the lower Danube, was created in 1763. It was abolished under another Habsburg, Emperor Franz Joseph, in 1872._NEWLINE_Of the Battalion's Serbian officers, 89 were promoted from the ranks of non-commissioned officers. Sixty came from several of the Military Frontier's other regiments which were either named for the regimental headquarter cities -- Brod, Gradiska, Ogulin, Otocac, Petrovaradin, Slunj, Titel and Varazdin -- or regions of the Military Frontier -- Banat, Banija, Lika, and Slavonia. Some were cadets or arrived fresh from the Military Frontier School in Vienna. _NEWLINE_The Trumić list includes the year the officers joined the Battalion and his rank and prior post. Also shown are each officer's promotions and the year he was transferred out of the Battalion as well as his rank and new post. _NEWLINE_A staggering number military frontiersmen (graničari as they were called in Serbian) from the regiments of the Military Frontier were called up to the Austrian armies engaged in various European wars, such as the Austrian War of Succession (1741-1748); the Seven Years' War (1756-1763); the Bavarian War of Succession (1778-1779); the wars against post-revolutionary France (1792-1800); the Napoleonic Wars (1805-1815); the Austro-Italian Wars (1848-1849, 1859, 1866) and the Revolution of 1848 and the wars against the Hungarians (1848-1849)._NEWLINE_Postings on the list include places in northern Italy, Mantua and Solferino, made famous in the Napoleonic Wars and in the Austro-Italian Wars. According to the list, Lieutenant Michael Stanisavljević was transferred to Mantua in 1784 and Captain Marcus Rajčević was killed in the Battle of Solferino in 1859. From the Šajkaš Battalion, Adjutant George Bešanović transferred to the Bosnian-Serbian Frei Corps in 1788. Lieutenant Gligorije Popović and Lieutenant Thimotie Zivković transferred to Count Gyulay's Frei Corps in 1793. Lieutenant Arsenije Sečujac transferred to the 3rd Serbian Frei Crops Battalion in 1813. _NEWLINE_During the Austro-Turkish War of 1788-1791, the Austrians organized Serbian volunteers into special military force known as the Frei Corps (Free Corps), usually commanded by Austrian officers of Serbian descent._NEWLINE_Thirty-five were transferred to other regiments of the Military Frontier. Fifty-one died while in serving in the Battalion, but whether the death was in line of duty is not specified. Seventy-seven retired from the Battalion with pensions. The length of service varied from a few months to as many as 30 years, if not more. _NEWLINE_Several Serbs became majors, colonels and battalion commanders. In 1763, the year the Šajkaš Battalion was created, Theodor Stanisavljević was a Major and the Battalion Commander of the Petrovaradin Frontier Infantry Regiment. In 1773, he was Colonel in the Šajkaš Battalion. He died in 1783._NEWLINE_Colonel Aron Stanisavljević, in 1813 after 35 years with the Battalion, was promoted to Brigadier General and Major-General and transferred to Banat.That same year, Lieutenant Colonel Johann Nepomuk Majdić became Battalion Commander._NEWLINE_In 1816, Captain Thimotie Zivković returned to the Battalion and was promoted to Colonel and battalion Commander.In 1835, Colonel Franz Jankovic was appointed Major General and Commander of the Supreme Shipping Office in Vienna._NEWLINE_In 1849, Major Johann Bunčić was the Battalion Commander of the Ogulin Frontier Regiment and Adjutant to the Austrian-Serbian Army Corps when he joined the Sajkas Battalion. The next year, he was transferred to the Petrovaradin Frontier Infantry Regiment as a Colonel. _START_SECTION_ Šajkaši Battalion _START_PARAGRAPH_ The Frontier Šajkaši Battalion (Krajiški šajkaški bataljon), known in German as Czaikisten-Bataillon, was active in the period of 1763–1873. After the Treaty of Belgrade (1739) the Habsburg-Ottoman border was set up on the Danube and Sava rivers. The Šajkaši bands in Komarno, Esztergom, Györ and other places were abolished, until the establishment of the Šajkaši Battalion in Bačka, between the Danube and Tisa, in 1763 upon decision of the Habsburg War Council. The Serb colonising community which was employed into the battalion (the šajkaši) was given the Šajkaška region, which initially included six villages, eventually increased by eight. The battalion headquarters were in Titel. The battalion had four bands in 1769, with ca. 1,116 men, although it was constantly expanded. _START_SECTION_ Šajkaši migrations _START_PARAGRAPH_ Šajkaši families, Serbs, settled in Esztergom during the rule of Matthias Corvinus, a settlement in the lower town developed from the community, called Srpska varoš._NEWLINE_A group of Serbian Šajkaši settled in Slovakia, where they continued their service, known in Slovak as čajkári. _START_SECTION_ Legacy _START_PARAGRAPH_ In Hungarian war annals the most clear and also most vulnerable place is taken by the King's Šajkaši. They were the most important factors and participants of the victories of the Royal Army. Whenever there was a threat to Hungary, the Šajkaši were the main support of the territorial defence and the most reliable aid to the Royal Army._NEWLINE_Stationed at many locations, the most important Šajkaši units were those of Komárom, as this was the most important Imperial fortress in Hungary; they were stationed here up until the reign of Maria Theresa, when they were moved to South Bačka._NEWLINE_The šajkača hat is derived from the 18th-century Šajkaši in Banat. _START_SECTION_ Annotations _START_PARAGRAPH_ Another term used in German was Nassadisten (Serbian: насадисте/nasadiste).
+ _START_ARTICLE_ Path to Paradise: The Untold Story of the World Trade Center Bombing _START_SECTION_ Production _START_PARAGRAPH_ The film was shot in New Jersey and Manhattan. _START_SECTION_ Accolades _START_PARAGRAPH_ The American-Arab Anti-Discrimination Committee awarded the film the Advancing Tolerance award for Enhancing Intolerance. _START_SECTION_ Aftermath _START_PARAGRAPH_ The film was scheduled for a repeat broadcast on HBO the week of the 9/11 attacks.  HBO pulled it from its schedule following the attacks.
+ _START_ARTICLE_ Marilyn Frye _START_SECTION_ Education and career _START_PARAGRAPH_ Frye received the BA with honors in philosophy from Stanford University in 1963 and received the PhD in Philosophy at Cornell University in 1969, writing a dissertation titled "Meaning and Illocutionary Force," under the supervision of Max Black. Before coming to Michigan State University in 1974, she taught in the Philosophy Department at the University of Pittsburgh. From 2003 until her retirement, Frye was University Distinguished Professor at Michigan State University; she also served as Associate Dean for Graduate Studies of the College of Arts and Letters.   In 2008 she was the Phi Beta Kappa Romanell Lecturer. _START_SECTION_ Research and publications _START_PARAGRAPH_ Frye is the author of The Politics of Reality (1983), a collection of nine essays which has become a "classic" of feminist philosophy._NEWLINE_In her chapter entitled "Oppression" in the book Feminist Frontiers, Frye discusses the idea of the double bind in gender. This double bind refers to "situations in which options are reduced to a very few and all of them expose one to penalty, censure or deprivation". Frye applies this principle to gender and the dilemma women often face in her discussion of oppression. For example, it is neither socially acceptable for a women to be sexually active or for her to be sexually inactive and labelled a "man-hater" or "uptight". This absence of choice permeates so thoroughly into women's day-to-day life that even small things like how they choose to dress or talk are criticized. Frye acknowledges that men face issues as well, but differentiates the issues of men and women through the metaphor of a bird cage. Each individual bind women face can be thought of as a single bar in a cage: by itself, it isn't enough to contain the bird. But, with enough bars, the bird is trapped inside the cage, left with nowhere to go. This is the complete absence of choice Frye describes: how it is the culmination of issues women face that is so "immobilizing" and why their struggle, and not men's, is considered oppression._NEWLINE_Frye is openly lesbian, and much of her work explores social categories—in particular, those based on race and gender.
+ _START_ARTICLE_ Dmitri Kurakin _START_PARAGRAPH_ Dmitri Kurakin (born 5 June 1975 in Tallinn) is an Estonian ice dancer who also competed internationally for Germany. He originally competed with Anna Mosenkova for Estonia. They were multiple medalists at the Estonian Figure Skating Championships and competed at the World Figure Skating Championships, the European Figure Skating Championships, and the World Junior Figure Skating Championships. He teamed up with Jill Vernekohl in 2001 and they are the 2001 and 2002 German silver medalists._NEWLINE_Kurakin is the older brother of Juri Kurakin, who is also an ice dancer.
+ _START_ARTICLE_ Giorgi Japaridze _START_PARAGRAPH_ Giorgi Japaridze (also spelled Giorgie Dzhaparidze) is a Georgian-American researcher in logic and theoretical computer science. He currently holds the title of Full Professor at the Computing Sciences Department of Villanova University. Japaridze is best known for his invention of computability logic, cirquent calculus, and Japaridze's polymodal logic. _START_SECTION_ Research _START_PARAGRAPH_ During 1985-1988 Japaridze elaborated the system GLP, known as Japaridze's polymodal logic. This is a system of modal logic with the "necessity" operators [0],[1],[2],…, understood as a natural series of incrementally weak provability predicates for Peano arithmetic. In "The polymodal logic of provability" Japaridze proved the arithmetical completeness of this system, as well as its inherent incompleteness with respect to Kripke frames. GLP has been extensively studied by various authors during the subsequent three decades, especially after Lev Beklemishev, in 2004, pointed out its usefulness in understanding the proof theory of arithmetic (provability algebras and proof-theoretic ordinals)._NEWLINE_Japaridze has also studied the first-order (predicate) versions of provability logic. He came up with an axiomatization of the single-variable fragment of that logic, and proved its arithmetical completeness and decidability. In the same paper he showed that, on the condition of the 1-completeness of the underlying arithmetical theory, predicate provability logic with non-iterated modalities is recursively enumerable. In he did the same for the predicate provability logic with non-modalized quantifiers._NEWLINE_In 1992-1993, Japaridze came up with the concepts of cointerpretability, tolerance and cotolerance, naturally arising in interpretability logic. He proved that cointerpretability is equivalent to 1-conservativity and tolerance is equivalent to 1-consistency. The former was an answer to the long-standing open problem regarding the metamathematical meaning of 1-conservativity. Within the same line of research, Japaridze constructed the modal logics of tolerance (1993) and the arithmetical hierarchy (1994), and proved their arithmetical completeness._NEWLINE_In 2002 Japaridze introduced "the Logic of Tasks", which later became a part of his Abstract Resource Semantics on one hand, and a fragment of Computability Logic (see below) on the other hand._NEWLINE_Japaridze is best known for founding Computability Logic in 2003 and making subsequent contributions to its evolution. This is a long-term research program and a semantical platform for "redeveloping logic as a formal theory of (interactive) computability, as opposed to the formal theory of truth that it has more traditionally been"._NEWLINE_In 2006 Japaridze conceived cirquent calculus as a proof-theoretic approach that manipulates graph-style constructs, termed cirquents, instead of the more traditional and less general tree-like constructs such as formulas or sequents. This novel proof-theoretic approach was later successfully used to "tame" various fragments of computability logic, which had otherwise stubbornly resisted all axiomatization attempts within the traditional proof theory such as sequent calculus or Hilbert-style systems. It was also used to (define and) axiomatize the purely propositional fragment of Independent-Friendly Logic._NEWLINE_The birth of cirquent calculus was accompanied with offering the associated "abstract resource semantics". Cirquent calculus with that semantics can be seen as a logic of resources which, unlike Linear Logic, makes it possible to account for resource-sharing. As such, it has been presented as a viable alternative to linear logic by Japaridze, who repeatedly has criticized the latter for being neither sufficiently expressive nor complete as a resource logic. This challenge, however, has remained largely unnoticed by the linear logic community, which never responded to it._NEWLINE_Japaridze has cast a similar (and also never answered) challenge to intuitionistic logic, criticizing it for lacking a convincing semantical justification the associated constructivistic claims, and for being incomplete as a result of "throwing out the baby with the bath water". Heyting's intuitionistic logic, in its full generality, has been shown to be sound but incomplete with respect to the semantics of computability logic. The positive (negation-free) propositional fragment of intuitionistic logic, however, has been proven to be complete with respect to the computability-logic semantics._NEWLINE_In "On the system CL12 of computability logic", on the platform of computability logic, Japaridze generalized the traditional concepts of time and space complexities to interactive computations, and introduced a third sort of a complexity measure for such computations, termed "amplitude complexity"._NEWLINE_Among Japaridze's contributions is the elaboration of a series of systems of (Peano) arithmetic based on computability logic, named "clarithmetics". These include complexity-oriented systems (in the style of bounded arithmetic) for various combinations of time, space and amplitude complexity classes. _START_SECTION_ Biography and academic career _START_PARAGRAPH_ Giorgi Japaridze was born in 1961 in Tbilisi, Georgia (then in the Soviet Union)._NEWLINE_He graduated from Tbilisi State University in 1983, received a PhD degree (in philosophy) from Moscow State University in 1987, and then a second PhD degree (in computer science) from the University of Pennsylvania in 1998. During 1987-1992 Japaridze worked as a Senior Researcher at the Institute of Philosophy of the Georgian Academy of Sciences. During 1992-1993 he was a Postdoctoral Fellow at the University of Amsterdam (Mathematics and Computer Science department). During 1993-1994 he held the position of a Visiting Associate Professor at the University of Notre Dame (Philosophy Department). He has joined the faculty of Villanova University (Computing Sciences Department). Japaridze has also worked as a Visiting Professor at Xiamen University (2007) and Shandong University (2010-2013) in China. _START_SECTION_ Awards _START_PARAGRAPH_ In 1982, for his work "Determinism and Freedom of Will", Japaridze received a Medal from the Georgian Academy of Sciences for the best student research paper, granted to one student in the nation each year. In 2015, he received an Outstanding Faculty Research Award from Villanova University, granted to one faculty member each year. Japaridze has been a recipient of various grants and scholarships, including research grants from the US National Science Foundation, Villanova University and Shandong University, Postdoctoral Fellowship from the Dutch government, Smullyan Fellowship from Indiana University (never utilized), and Dean's Fellowship from the University of Pennsylvania.
+ _START_ARTICLE_ Zera Luther Tanner _START_SECTION_ Career _START_PARAGRAPH_ Zera Tanner was born in Warsaw, New York in 1835, the son of Zerah and Ruth (Foster) Tanner. The elder Tanner died when his son was one year old, so the younger Tanner worked in family farms until his late teens, when he apprenticed to a mechanic. Tanner traveled by ship to Great Britain in 1855, and because of ill health chose to take a longer voyage from Liverpool to Bombay, India aboard SS Culloden in 1856. After two round trips, one as third officer, Tanner chose sailing for his profession._NEWLINE_Returning to the United States, after Tanner served aboard American merchantmen, he eventually assisted several seaborne troop movements in the Gulf of Mexico. _START_SECTION_ Civil War service _START_PARAGRAPH_ Tanner chose to join government service and was appointed acting ensign of the Union Navy in the summer of 1862. Tanner served upon the bark USS Midnight and the supply steamer USS Rhode Island during the American Civil War. When Rhode Island captured a British blockade runner in December 1864, Tanner was put in charge of the prize crew. During the Second Battle of Fort Fisher in 1865, Tanner commanded the boats from his vessel landing Union ground forces. _START_SECTION_ Post war service _START_PARAGRAPH_ Tanner entered the United States Navy in 1868, coming over from the deactivated volunteer services. Until his retirement in 1897, Tanner served the navy in hydrographic survey and dredging commands, often in conjunction with the United States Commission of Fish and Fisheries, generally known as the United States Fish Commission. Tanner partially designed and oversaw the construction of two ships for the commission. USFC Fish Hawk, in service from 1880 to 1926 and the first large vessel ever built expressly for the promotion of fisheries, was a smaller vessel designed for coastal waters and was primarily used as a mobile fish hatchery although she also conducted fisheries research, while USFC Albatross, which served as a fisheries research ship from 1882 to 1921 except for brief periods in U.S. Navy service in 1898 and from 1917 to 1919, was the first full-sized vessel primarily designed for marine research. Tanner was the first commanding officer of Fish Hawk, and he commanded Albatross for many years, including transporting famed naturalist Alexander Emanuel Agassiz on an 1891 voyage to the Galapagos Islands._NEWLINE_Tanner was promoted to commander in 1893 and was relieved of command of Albatross on 1 May 1894.  After an extended furlough, was assigned to duty with the Fish Commission on 1 January 1895.  He retired from the Navy on 5 December 1897, having reached the mandatory retirement age of 62._NEWLINE_Tanner died in Washington, D.C. in late 1906 and was buried with military honors at Arlington National Cemetery._NEWLINE_Tanner was a Companion of the Military Order of the Loyal Legion of the United States.  He was also a member of the Grand Army of the Republic and the Society of American Wars. _START_SECTION_ Legacy _START_PARAGRAPH_ Tanner developed an improved method of depth sounding, using instruments of his own design. He patented his system in 1899 as the Tanner navigational sounding apparatus._NEWLINE_Two U.S. Navy ships have been named after Tanner. After World War II, USS Pamina (AKA-34), an attack cargo ship with service in the Okinawa campaign was re-purposed for oceanographic survey work and renamed USS Tanner (AGS-15). Tanner spent her career mapping significant coastline areas and was retired in 1969. In 1990, USNS Tanner (T-AGS-40) was built for the U.S. Navy as a fast oceanographic research vessel. Now named TS State of Maine, she serves as the training ship of the Maine Maritime Academy.
+ _START_ARTICLE_ We Got Married (season 1) _START_PARAGRAPH_ We Got Married (Season 1) is the first season of We Got Married (우리 결혼했어요), a popular reality South Korean variety show and a segment of the Sunday Sunday Night program. First broadcast in 2008, the show pairs up Korean celebrities to show what life would be like if they were married. Each week, couples are assigned missions to complete, with candid interviews of the participants to reveal their thoughts and feelings._NEWLINE_With a new format and slightly different couples, newlyweds are given a mission to complete each week. As during the special pilot episode, interviewed participants provide a unique perspective on the ongoing relationship conflicts and developments. All of the recorded material is then played in front of the participants, MCs, and audience who add commentary or clarification._NEWLINE_Beginning with a Lunar New Year's Special in 2009 with three new couples, a new format is introduced into the show, first forecasted through the addition of Kangin and Lee Yoon Ji. Each couple is given a concept to portray; in Kangin and Lee Yoon Ji's case, a college couple living with a limited income. The show now consists of more special effects and editing in order to show each couple in a set atmosphere and theme.
+ _START_ARTICLE_ 2006 World Lacrosse Championship _START_SECTION_ Pool play _START_PARAGRAPH_ For the round-robin phase of the tournament, nations were separated into blue, red, orange and yellow divisions according to strength. Each of the twenty-one nations was eligible to win the championship. _START_SECTION_ Best Positional Players _START_PARAGRAPH_ Brodie Merrill - Defence_NEWLINE_ Jay Jalbert - Midfield_NEWLINE_ Jeff Zywicki - Attack _START_SECTION_ Tournament MVP _START_PARAGRAPH_ Geoff Snider - Midfield, face-off
+ _START_ARTICLE_ Thomas Porter (Vermont politician) _START_PARAGRAPH_ Thomas Porter (February 15, 1734 – May 30, 1833) was a Connecticut and Vermont military and political figure who served as Speaker of the Vermont House of Representatives. _START_SECTION_ Biography _START_PARAGRAPH_ Thomas Porter was born in Farmington, Connecticut Colony, on February 15, 1734 and became a farmer in Cornwall.  He served with the British during the French and Indian War and held several local offices, including member of the Connecticut House of Representatives._NEWLINE_Porter served against the British at the start of the American Revolution as a Captain in the Connecticut Militia, and relocated to Tinmouth, Vermont in 1779._NEWLINE_In 1780 Porter was elected to the Vermont House of Representatives.  He served until 1782 and was Speaker of the House during his entire House tenure._NEWLINE_Porter resigned as Speaker to accept election to the Governor's Council, on which he served until 1795._NEWLINE_From 1781 to 1782 Porter was Assistant Judge of the Rutland County Court, and he was the court's Chief Judge from 1788 to 1789._NEWLINE_In 1783 Porter became a Judge on the Vermont Supreme Court, serving until 1785._NEWLINE_He died in Granville, New York on May 30, 1833. Porter was buried at Sawyer Cemetery in Tinmouth._NEWLINE_Porter was the father of college president and theologian Ebenezer Porter.
+ _START_ARTICLE_ The Ice House (1969 film) _START_SECTION_ Plot _START_PARAGRAPH_ Ric Martin (Robert Story), a disgraced cop, long since fired from police work, makes a sexual approach to Ice House dancer Venus De Marco (Sabrina) and is struck with a beer bottle for his efforts. Angered, he stalks the dancer, and when she again raises a bottle in a defensive manner, he strangles her. He is thwarted in his efforts to hide the body at a local lovers' lane, and ends up hiding it at The Ice House, where he works in the menial position of attendant. Other women become his victims and their bodies are stored there as well. His identical twin brother Fred Martin (David Story), himself a cop and investigating the disappearances, cannot understand why his brother is acting oddly. In an effort to slow down the hunt for the serial killer, Ric kills Fred and takes his place investigating the case. _START_SECTION_ Production _START_PARAGRAPH_ The film's production began in early 1967, with director Stuart E. McGowan wanting blonde bombshell Jayne Mansfield to play Venus De Marco. However, after Mansfield's death in an automobile accident in June 1967, filming was postponed. Over the next year McGowan offered the role to Mamie Van Doren, Diana Dors and Joi Lansing, all of whom turned down the offer. Eventually the role was filled by model-turned-actress Sabrina. _START_SECTION_ Release _START_PARAGRAPH_ The film had its original United States release in the United States on July 9, 1969 by Orbit Media Group. Marden Films gave the film a theatrical release in Canada in 1972. The film was released on VHS in the USA by Something Weird Video in 1996 as part of Frank Henenlotter's Sexy Shockers from the Vaults (Vol. 60), and a fully restored director's cut was given a worldwide release in 2008 by Grindhouse Releasing. The film is also known as Cold Blood, Crimen on the Rocks (Spain), Love in Cold Blood and The Passion Pit. _START_SECTION_ Critical response _START_PARAGRAPH_ John Charles, editor of Video Watchdog magazine, wrote: "Character actors Scott Brady, Jim Davis and Tris Coffin, and a pair of musclebound, thespically challenged leading men are the main points of interest in this thriller/softcore hybrid, which delivers little more than copious nudity." He panned the film for the poor direction of Stuart E. McGowan, and notes that while the film set up the viewer for mystery and horror, it failed to deliver and meandered to a predictable twist ending. He also panned the performances of real-life twins David and Robert Story as "incredibly stiff", and made note that "some amusingly unhip slang" and an undramatic "ridiculous" and "undercranked" motorcycle chase provided only "intermittent entertainment".  While noting Grindhouse Releasing's intent to remarket the film, they spoke toward Something Weird Video's 1996 video release, and noted that although SWV's 35mm source material was "damaged in every way imaginable", its color and resolution were still decent.
+ _START_ARTICLE_ William J. Thaler _START_PARAGRAPH_ William J. Thaler, Ph.D. (December 4, 1925 – June 5, 2005) was an American experimental physicist. Working for the Office of Naval Research (ONR) at the Naval Research Laboratory in the 1950s, Thaler developed an early warning system to detect the launching of ballistic missiles using high frequency radio waves bounced between the Earth's surface and the ionosphere, part of the upper atmosphere._NEWLINE_Monitoring the disruption of the returning radio waves, called back-scatter, allowed for the long distance detection of rocket launchings and nuclear tests. Based in the Washington D.C. area, the experimental monitoring systems, termed "over-the-horizon radar", were able to pick up radio disruptions from nuclear tests held in Nevada and were later successful in tracking a Polaris missile fired from Cape Canaveral. _START_SECTION_ Education _START_PARAGRAPH_ Thaler attended St. James Parochial School in Baltimore and Loyola High School in Towson, MD.  He received his undergraduate degree from Loyola College of Baltimore in 1947 and earned his master's degree in science at The Catholic University of America. He received his doctorate in physics at Catholic University in 1951. _START_SECTION_ Operation Argus _START_PARAGRAPH_ In 1958, Thaler was in charge of the ONR section of Operation Argus, a secret series of tests conducted over the Atlantic Ocean that looked at the effect of high-altitude detonations of nuclear weapons on radar and radio transmissions. _START_SECTION_ Later career _START_PARAGRAPH_ In late 1960, Thaler joined the faculty of Georgetown University, expanded the Physics department and chaired the department from 1960 to 1976. From 1976 to 1979, he took a leave of absence to serve as chief scientist and director of the Office of Telecommunications Policy, within the Executive Office of the President, in the Ford and Carter administrations. He returned to Georgetown University and retired in 1996. _START_SECTION_ Awards _START_PARAGRAPH_ In 1960, Thaler was awarded the Mendel Medal by Villanova University. This honor "is awarded to outstanding scientists who have done much by their painstaking work to advance the cause of science, and, by their lives and their standing before the world as scientists, have demonstrated that between true science and true religion there is no intrinsic conflict." _START_SECTION_ Personal life _START_PARAGRAPH_ Thaler was married to Barbara Thaler and had six children, two of whom preceded him in death. _START_SECTION_ Death _START_PARAGRAPH_ Thaler died of complications resulting from a stroke at his home in Centreville, Virginia.  He was 79 years old.
+ _START_ARTICLE_ J&D's Down Home Enterprises _START_PARAGRAPH_ J&D's Down Home Enterprises, also known as J&D's Foods, is an American company based in Seattle, Washington, that produces bacon-related products such as Bacon Salt and Baconnaise.  The company was founded in 2007 by Justin Esch and Dave Lefkow who used a $5,000 prize winning from America's Funniest Home Videos as start capital to launch the business.  J&D's used an unconventional, yet successful advertising campaign consisting of on-line Social network services and personal telephone calls to the media to get stories written about them resulting in selling 20,000 jars of their initial product, Bacon Salt, in their first five months of operation. The business continues to be successful and has since added several new products to the Bacon Salt range. _START_SECTION_ History _START_PARAGRAPH_ The company launched sales on July 16, 2007, when entrepreneurs Justin Esch and David Lefkow started selling Bacon Salt, a product that was conceived by Esch while considering the virtues of the drink Mitch Morgan, which is composed of a shot of bourbon and a piece of fried bacon as a garnish. After registering a trademark and purchasing an internet domain, Esch and Lefkow started producing Bacon Salt.  They funded their company with a $5,000 loan from Lefkow's 3-year-old son Dean, who had won the money on America's Funniest Home Videos for a video in which he smacks his dad with a hit while playing T-Ball. Bacon Salt became a hit, and as of 2007 was sold in all 50 states and 26 other countries.  Esch and Lefkow have since repaid the $5,000 loan._NEWLINE_Additional financing was provided through the sale of series B stock by outside investors in 2011. _START_SECTION_ Marketing _START_PARAGRAPH_ Initially Esch and Lefkow had no advertising budget and were met by resistance from food brokers and distributors who were unwilling to take a chance on an unknown product.  Instead of traditional advertising, they sent jars of Bacon Salt to editors of food blogs and set up online profiles for Bacon Salt for online communities including MySpace and Facebook.  The technique worked, Esch and Lefkow received 800 orders in the first week for their three varieties (original, hickory, and peppered) which included an order from a customer in Texas for 36 jars.  They sold 20,000 jars between their opening in July 2007 and November 2007.  In addition to online marketing, Esch and Lefkow spent an average of 30 minutes per day phoning the media, and explaining why they should do a story about them. _START_SECTION_ Mayonnaise Wrestling _START_PARAGRAPH_ On October 28, 2008 J&D's Foods hosted The World’s First Charity No Holds Barred Battle To The Death Mayonnaise Wrestling Match at Heavens night club in Seattle, Washington.  The event featured three bouts of wrestling in 6,000 lbs of mayonnaise inside a ring made of old mattresses and hay bales.  The main event included a fight between combatants dressed in a 7' foam Bacon suit and a giant jar of Mayonnaise. The event promoted the launch of their new bacon-flavored mayonnaise, Baconnaise and the money raised benefited the family of a deceased co-worker.  J&D's hosted a complimentary BLT sandwich bar served $3 Mitch Morgan shots made with Maker's Mark bourbon. _START_SECTION_ The Bacathlon _START_PARAGRAPH_ On November 19, 2009 J&D's Foods hosted The Bacathlon at Heavens night club in Seattle, Washington. Promoted as The World's First Bacon-Themed Multi-Sport Athletic and Endurance Event that featured an attempt to set the World Record for Bacon Eating.  The Bacathlon featured local Seattle personalities competing against each other dressed in giant foam Bacon suits.  The competitors included Ben Dragavon from the Seattle Sounders, Mark Rahner from ROTTEN Comics, Thee Ted Smith from KISW 99.9 FM, Josh Black a.k.a. Ronald McFondle from Seattle Semi-Pro Wrestling, Justin Barnes from KFNK 104.9 FM, Sheeza Brickhouse from the Rat City Rollergirls, Erik "The Red" Denmark from Major League Eating and Jessica Williams the J&D's Foods Fall 2009 intern.  J&D's once again served $3 Mitch Morgan shots made with Maker's Mark bourbon. _START_SECTION_ Novelty Products _START_PARAGRAPH_ J&D's Foods has used novelty bacon-flavored products Bacon Lip Balm, baconlube, Mmmvelopes, BaconAir, Bacon Coffins, Bacon Baby Formula and bacon soda to generate media buzz and spread the awareness of their mainstream consumer products. _START_SECTION_ Products _START_PARAGRAPH_ J&D's markets a variety of bacon-flavored products, including: _START_SECTION_ Bacon Salt _START_PARAGRAPH_ Bacon Salt was the first product developed by J&D's Foods. The first attempt at making bacon salt was to take bacon grease and pour it over kosher salt. That attempt result was unsuccessful and, according to Esch, tasted disgusting. Upon perfecting the recipe, the pair introduced the first three flavors of Bacon Salt — Original, Hickory and Peppered varieties. J&D's went on to launch Natural Bacon Salt and the Limited Edition Holiday Flavors Cheddar, Jalapeño, Maple, Applewood and Mesquite Bacon Salt in 2008. Bacon Salt is vegetarian and kosher. Hickory Bacon Salt is vegan._NEWLINE_In 2008, J&D's Foods launched Operation Bacon Salt, to provide Bacon Salt to troops serving abroad in Iraq and Afghanistan where pork is not readily available. After receiving several emails from troops stationed abroad J&D's began mailing free Bacon Salt to troops. _START_SECTION_ Baconnaise _START_PARAGRAPH_ Introduced in October 2008, Baconnaise is a bacon-flavored mayonnaise spread available in both Regular and Lite versions.  Jon Stewart of The Daily Show described Baconnaise as being "for people who want heart disease but are too lazy to actually make the bacon." _START_SECTION_ Bacon Lip Balm _START_PARAGRAPH_ Introduced in October 2008, Bacon Lip Balm is a bacon-flavored lip balm. J&D's bacon-inspired lip balm, which is a "hot seller" on their web site, having sold over 50,000 units, often by the dozen.  Customer reviews range from being described as the worst thing they have ever heard of to the greatest thing ever.  When Esch and Lefkow appeared on The Oprah Winfrey Show in March 2009, Winfrey's co-hosts Mark Consuelos and Ali Wentworth applied Bacon Lip Balm during the interview. _START_SECTION_ BaconPOP _START_PARAGRAPH_ Introduced in November 2009, BaconPOP is a bacon-flavored microwave popcorn. _START_SECTION_ Bacon Ranch _START_PARAGRAPH_ Introduced in November 2009, Bacon Ranch is a bacon-flavored ranch dip/mix, to be combined with buttermilk, sour cream and/or mayonnaise. _START_SECTION_ Mmmvelopes _START_PARAGRAPH_ Introduced in November 2009, Mmmvelopes are bacon-flavored envelopes that both look like pieces of bacon and taste like bacon when licked.
+ _START_ARTICLE_ Lindholmiola spectabilis _START_SECTION_ Geographic distribution _START_PARAGRAPH_ This species is endemic to Greece, where it occurs in the north and north-eastern part of the country's mainland.
+ _START_ARTICLE_ Jean Vander Pyl _START_SECTION_ Early life and career _START_PARAGRAPH_ Vander Pyl was born in Philadelphia to John Howard and Kathleen Hale Vander Pyl. Her grandfather had come from the Netherlands. Her father was the district manager for Knit Underwear, her mother was a Southerner from Tennessee. The two died within six months of each other in the early 1950s. By 1939, she was already working as a radio actress.  _NEWLINE_On radio she was heard on such programs as The Halls of Ivy (1950–52) and on Father Knows Best during the early 1950s, where she portrayed Margaret Anderson; the role was played on television by Jane Wyatt. Her husband, Carroll G. O'Meara, was a graduate of Stanford University who worked as a copywriter at KHJ radio in the mid-1930s and later became an advertising executive._NEWLINE_Vander Pyl made numerous TV appearances as an actress in programs such as Leave It to Beaver, The Donna Reed Show, Father Knows Best, The Beverly Hillbillies, That Girl, and Petticoat Junction. One of her final TV appearances was in the opening scene of the Season Two Murder, She Wrote episode, "One Good Bid Deserves a Murder". Vander Pyl also had a cameo appearance in the 1994 live-action film version of The Flintstones as Mrs. Feldspar, an elderly woman in a conga line. _START_SECTION_ Voice work _START_PARAGRAPH_ Vander Pyl was the voice of Wilma Flintstone, her best-known character, in the original Flintstones series. She told an interviewer in 1995 that she received $250 per episode for making The Flintstones, and in 1966, when the series ended, she rushed to accept $15,000 in lieu of residual payments from syndication. The Flintstones ran in syndication across the globe for decades. At the time, Vander Pyl lived in San Clemente, California, and remarked: "If I got residuals, I wouldn't live in San Clemente. I'd own San Clemente."_NEWLINE_Most of her other voice acting work was also for the Hanna-Barbera studio, where she played her first voice role in 1958 on an episode of The Huckleberry Hound Show, voicing an actress in The Yogi Bear episode, "Show Biz Bear". She did additional voices, The Narrator And Southern belles and beautiful girls, on The Quick Draw McGraw Show, Snagglepuss and The Yogi Bear Show. In 1961-62, Vander Pyl played Nurse Larue, Charlie the baby, Goldie, Lola Glamour and additional voices on multiple episodes of Top Cat and in 1962, she did another memorable role, as Rosie, the Jetsons' robotic maid, and 23 years later in 1985 she reprised the character on the returning series._NEWLINE_Later, she did the voices of Maw Rugg and her daughter Floral Rugg on a rural cartoon, The Hillbilly Bears and Winsome Witch; both shows were part of The Atom Ant/Secret Squirrel Show (1965–1967). Jean Vander Pyl was also the voice of Little Ogee on The Magilla Gorilla Show. In 1969, Vander Pyl guest starred on the Scooby-Doo, Where Are You! episode "Foul Play in Funland", playing Sarah Jenkins._NEWLINE_In the 1970s, she was the voice of Marge Huddles, the main character's wife on Where's Huddles?, in which she played a role similar to that of Wilma Flintstone and was reunited with her Flintstones cast members Alan Reed and Mel Blanc. She went on to voice Mrs. Finkerton on Inch High, Private Eye, as well as several female characters on Hong Kong Phooey, The Tom and Jerry Show and Captain Caveman and the Teen Angels._NEWLINE_In the 1980s and 1990s, the talented voice actress did voices on Mister T, Snorks, Yogi's Treasure Hunt and also on The Flintstone Kids as Mrs. Slaghoople. She mostly reprised Wilma Flintstone on spin-off series and films such as The Flintstone Comedy Hour, The New Fred and Barney Show, The Flintstone Comedy Show, The Jetsons Meet the Flintstones, I Yabba-Dabba Do!, Hollyrock-a-Bye Baby, and A Flintstones Christmas Carol._NEWLINE_Her last roles were again as Wilma Flintstone on What a Cartoon! episode "Dino: Stay Out!" in 1995, A Flintstone Family Christmas in 1996 and on The Weird Al Show in 1997. _START_SECTION_ Personal life _START_PARAGRAPH_ Vander Pyl was married twice. First to Carroll G. O'Meara on March 9, 1939; together they had three children, O'Meara died on February 18, 1962, at the age of 53. She then married her second husband Roger Wells DeWitt in 1963; the couple had one son, they remained married until DeWitt's death in 1992. _START_SECTION_ Death _START_PARAGRAPH_ On April 10, 1999, Vander Pyl, the last surviving original cast member of The Flintstones, died of lung cancer at her home in Dana Point, California, at the age of 79. Vander Pyl was interred in Ascension Cemetery in Lake Forest, California.
+ _START_ARTICLE_ Choleretic _START_PARAGRAPH_ Choleretics are substances that increase the volume of secretion of bile from the liver as well as the amount of solids secreted.
+ _START_ARTICLE_ Once Upon a Castle _START_PARAGRAPH_ Once Upon a Castle is a symphonie concertante for organ and orchestra composed in 2003 and revised in 2015 by American composer Michael Daugherty. The music is inspired by both the life and times of American media mogul William Randolph Hearst, Hearst Castle, and the Hollywood lore of Charles Foster Kane, a fictional character based on Hearst in the movie Citizen Kane. _START_SECTION_ Origin and performance history _START_PARAGRAPH_ The composition was commissioned by the Ann Arbor Symphony Orchestra and a consortium consisting of the Cedar Rapids Symphony Orchestra, the Rockford Symphony Orchestra and the West Michigan Symphony Orchestra. The world premiere was given by the Ann Arbor Symphony Orchestra conducted by Arie Lipsky, with Steven Ball, organ, at the Michigan Theater, Ann Arbor, Michigan on November 15, 2003. The world premiere of the revised version was given by the Nashville Symphony conducted by Giancarlo Guerrero, with Paul Jacobs, organ, at the Schermerhorn Symphony Center, Nashville, Tennessee on November 6, 2015. _START_SECTION_ Recording and reception _START_PARAGRAPH_ The concerto was recorded in 2015 with the Nashville Symphony and released on the Naxos label. Many critics reviewed the recording favorably, including a 10/10 for both categories of artistic quality and sound quality from music critic David Hurwitz and 4 out of 5 stars from critic James Manheim._NEWLINE__NEWLINE_Donald Rosenberg of Gramophone wrote:_NEWLINE_Hearst’s extravagant abode in San Simeon comes to brilliant life in Once Upon a Castle (2015), whose four movements receive an extra sonic kick with the presence of a pipe organ, played to the glowing hilt by Paul Jacobs. Guerrero and the orchestra sound as if they’re savouring every fresh Daugherty detail._NEWLINE_— Donald Rosenberg, Gramophone_NEWLINE__NEWLINE_Bob McQuiston of Classial Lost and Found wrote:_NEWLINE_American organist Paul Jacobs gives a technically flawless, magnificent account of this colorful work. He's at the console of the U.S.-built, three-manual Schoenstein Organ (64 ranks) in the Schermerhorn Symphony Center's Laura Turner Hall, Nashville, Tennessee, where these recordings were made (see 10 November 2014)...Jacobs receives superb support from the Nashville Symphony under its Music Director Giancarlo Guerrero."_NEWLINE_— Bob McQuiston, "Classical Lost and Found"_NEWLINE_ The album won the 2017 Grammy Award for Best Classical Compendium. _START_SECTION_ Four movements of the composition _START_PARAGRAPH_ The music of the four movements is programmatic, and is based on the composer's framing of different architectural, geographical and fictional aspects of the Hearst/Kane history and lore. The composer's published score includes descriptive program notes explaining the imagery and inspiration for each movement. _START_SECTION_ I. The Road to San Simeon _START_PARAGRAPH_ The music of this movement is meant to represent the winding drive to Hearst Castle from San Simeon. Hearst's opulent antique collection. The composer explains his intent for his music in this movement to occasionally remind the listener of a musical “antique”. _START_SECTION_ II. Neptune Pool _START_PARAGRAPH_ This movement is dedicated to William Albright (1944-98), who was the composer's colleague at the University of Michigan and who is recognized as one of the 20th century's greatest composers of contemporary organ music. The music is meant to portray the vast, grandiose architecture of the famous pool at Hearst Castle. _START_SECTION_ III.  Rosebud _START_PARAGRAPH_ In his program notes the composer states: "...the ground breaking film [Citizen Kane] starring and directed by Orson Welles, presents a caricature of Randolph Hearst...[the] music for this movement echoes a brilliant scene in the film where the boisterous Kane (the organ) and lonely Susan (the solo violin) argue from opposite ends of a cavernous empty room of the castle." Sleigh bells are used as a musical representation of "rosebud," the famous final word of the fictional character Citizen Kane. _START_SECTION_ IV.  Xanadu _START_PARAGRAPH_ This movement is composed to capture the spirit of the bombastic, lavish parties famously held at Hearst Castle, the likes of which Winston Churchill and famous film stars of the day including Clark Gable, Charlie Chaplin, and Greta Garbo attended. The composer notes in the published score, "I also had in mind fragments of Samuel Taylor Coleridge's 1798 poem, Kubla Khan...[the music uses] virtuoso bass pedal riffs surrounded by sizzling strings, rumbling brass, shimmering percussion and pulsating timpani.”
+ _START_ARTICLE_ Inhabited initial _START_PARAGRAPH_ An inhabited initial is an initial, an enlarged letter at the beginning of a paragraph or other section of text that contains an illustration of human or animal figures within the letter. It is similar to a historiated initial; however, the figures in historiated initials show an identifiable scene or story, while the figures in inhabited initials do not show a narrative. Figures in inhabited initials may be related to the contents of the text, but do not have to be. They may be purely decorative instead.
+ _START_ARTICLE_ Jim St. James _START_PARAGRAPH_ Jim Bozyk (1954–1990), known professionally as Jim St. James, was a Canadian actor and HIV/AIDS activist. He was best known as the star of a series of public service announcements on AIDS awareness which aired on Canadian television in the 1980s, and as the subject of June Callwood's 1988 book Jim: A Life with AIDS. _START_SECTION_ Background _START_PARAGRAPH_ He was raised in rural Southern Ontario in a Jehovah's Witness family, and was briefly married to a woman. He struggled with his sexuality, and undertook at least one suicide attempt before coming out as gay. Many of his family disowned him when he came out, although he remained in occasional contact with his father. He was also excommunicated from the Jehovah's Witnesses, although he remained devoutly religious in his personal life._NEWLINE_He worked as a stage actor in Toronto for several years, winning an award from Theatre Ontario as best actor in a musical for his performance in a production of Man of La Mancha in 1984. Just two days after winning that award, he was first diagnosed HIV-positive. _START_SECTION_ Activism _START_PARAGRAPH_ Following his diagnosis, he battled clinical depression for about a year before deciding in 1985 to get on with life, and renewed his commitment to both acting and HIV activism. He was one of the founding members of Toronto's People With AIDS Foundation, appeared in the AIDS-themed documentary film No Sad Songs in 1985 and a production of Robert E. Sherwood's play Idiot's Delight in 1987, and began appearing as a public speaker on HIV and AIDS issues. During this era, he was commonly credited as Canada's longest-living survivor of the disease, and as the country's most prominent HIV/AIDS activist._NEWLINE_In 1987, he appeared in an HIV education segment on CBC Television's youth public affairs program What's New, and in 1988 he starred in several HIV/AIDS awareness commercials, funded by CJOH-TV and the Canadian Public Health Association, which aired on television stations across Canada. During this era, he was also meeting regularly with Callwood in preparation for the book Jim: A Life with AIDS, which was published in fall 1988. By this time, he had developed Kaposi's sarcoma. In both 1988 and 1989, he invited the media to cover his birthday party as a news story, to highlight his continued survival and to promote further awareness of the disease. At the time of his 1989 party, however, he was making plans to move into Casey House, Toronto's AIDS hospice, due to his declining health._NEWLINE_He died on March 24, 1990 at Casey House, just a few weeks short of his 36th birthday.
+ _START_ARTICLE_ Festival y Reinado Nacional del Carbón _START_PARAGRAPH_ The Coal National festival and Pageant (Spanish: Festival y Reinado Nacional del Carbón) is a festival in Colombia that takes place in the town of Barrancas, Department of La Guajira from October 10 to the 13. The festival is an artistic and cultural event which celebrates de municipalities' Roman Catholic devotion to the "Virgin of Pilar", there are art expositions showing different coal sculptures and paintings as well as local gastronomy events. There also cultural expositions showing the arts and crafts of the Wayuu indigenous ethnic group.
+ _START_ARTICLE_ Calliostoma zizyphinum _START_SECTION_ Description _START_PARAGRAPH_ The solid, regularly conical shell is straight-sided and imperforate. The shell contains up to 12-13 whorls. It is sculptured with regular spiral grooves and ridges, traversed by fine prosocline growth lines. The apex is minute, composed of a single smooth rounded_NEWLINE_whorl, Several whorls follow, each with 4 granose spiral ridges. These become smooth and either obsolete or narrow on the later whorls. The body whorl has a prominent peripheral keel bearing two broad ridges; ridges above suture in preceding whorls. Base of shell rather flat, inner lips reflected over shallow umbilical groove. The periphery is angular, encircled by a smooth rounded rib that becomes a supra-sutural band or fasciole on the spire. The base of the shell is nearly flat. The aperture is quadrate. The cylindrical columella is nearly straight._NEWLINE_The color of the shell is variable. The ground color is yellowish brown, pale pink, or violet with streaks and blotches of brown, red or purple on the periphery. Blotches on the keel are generally darker, more frequent and more regular than on other parts of shell. It is radiately clouded with brown on the upper surface. The base of the shell is unicolored or obscurely radiately streaked. Pure white or violet specimens are occasionally found. _START_SECTION_ Predators _START_PARAGRAPH_ The starfish Asterias rubens is a known predator of Calliostoma zizyphinus _START_SECTION_ Distribution _START_PARAGRAPH_ This marine species occurs in European waters from Northern Norway to the Azores; in the Mediterranean Sea.
+ _START_ARTICLE_ Shahnshah Zakarian _START_SECTION_ Biography _START_PARAGRAPH_ After death of Zakaria II Mkhargrdzeli, his five-year-old son, Shanshe was adopted by his uncle Ivane Mkhargrdzeli, the latter raised and converted him in Chalcedonism. As soon as he reached age of adulthood he was raised to office of mandaturukhutsesi._NEWLINE_During Mongol invasion of Georgia Queen Rusudan had to evacuate Tbilisi for Kutaisi, while Shanshe took refuge in Adjara, then made peace with the Mongols and agreed to pay them tribute. They confirmed Shanshe in his fief. Soon Rusudan sent Avag Mkhargrdzeli to arranged her submission to the Mongols in 1243 and arrived in eastern Georgia, where she was met by Shanshe and other notable nobles._NEWLINE_During the period of interregnum (1245–1250), with the two Davids absent at the court of the Great Khan in Karakorum, the Mongols divided the Kingdom of Georgia into eight districts (tumen), three of which belonged to Mkhargrdzeli's, i.e., the territories of the Shanshe in Ani and Kars; of the Avag in Syunik and Artsakh; and of the Vahram (Gagi, Shamkor and the surrounding area)._NEWLINE_Rubroek, envoy extraordinary of the French king Louis IX to the Khan of Mongolia, stayed in 1255 with Shanshe on one of his Armenian estates. Rubroek characterizes Shanshe as a great feudal lord and owner of 15 cities._NEWLINE_In 1260, Hulagu Khan requested that David Ulu supported him in the war against Mamluk Sultanate in Cairo. David, remembering the Georgian losses at Baghdad (1258) refused to comply and revolted. Georgian nobles, led by David Ulu was defeated and once again submitted to Mongol rule. Although Prince Shanshe was freed for a ransom, his son Zakaria was killed. Shanshe died not soon after this event. He was buried in his ancestral Kobayr monastery.
+ _START_ARTICLE_ Howlin' at the Moon _START_SECTION_ Background _START_PARAGRAPH_ The up-tempo "Howlin' at the Moon" celebrates the giddiness of true love.  Lyrically, the song reflects Williams' sense of humor and love of hunting.  The title is punctuated by the hound dog yodels of fiddler Jerry Rivers. In his book Hank Williams: The Biography, writer Colin Escott observes, "The performance tears along...It was but a short step from there to rockabilly." Williams recorded the song at Castle Studio in Nashville on March 16, 1951.  Williams was backed on the session by members of his Drifting Cowboys band, including Jerry Rivers (fiddle), Don Helms (steel guitar), Sammy Pruett (electric guitar), Jack Shook (rhythm guitar), Ernie Newton or "Cedric Rainwater," aka Howard Watts (bass), and either Owen Bradley or producer Fred Rose on piano. The B-side of Howlin' at the Moon," the ballad "I Can't Help It (If I'm Still in Love with You)," outperformed the A-side on the charts, peaking at #2._NEWLINE_Williams disciple George Jones recorded this song for his 1960 album George Jones Salutes Hank Williams.
+ _START_ARTICLE_ Oddjobs _START_SECTION_ History _START_PARAGRAPH_ Oddjobs released the album, Drums, in 2001. The 12-inch single, "Blue Collar Holler", reached number 6 on the CMJ college radio hip hop chart in 2002. The six-track EP, The Shopkeeper's Wife, was released in 2003. The group toured with DJ Shadow in the same year.
+ _START_ARTICLE_ Groep Otten _START_SECTION_ Internal conflicts within Forum for Democracy _START_PARAGRAPH_ After the establishment of Forum for Democracy in 2016, the founders Henk Otten and Thierry Baudet have fully devoted themselves to the development of their party. In a short time the party managed to win two seats in the 2017 Dutch general election. Two years later, further efforts led to a monstrous victory in the 2019 Dutch provincial elections, yet the party failed to form a college in any province. Baudet was blamed for this by the political parties involved. His earlier statements about social issues have been viewed by many as extreme right, racist and unfriendly to women. In addition, a number of Tweets also caused a great deal of fuss._NEWLINE_On 19 April 2019, co-founder Otten, who was always involved with the party behind the scenes, steps forward in an interview with the NRC Handelsblad. In this interview, Otten criticised the course of the party and the behaviour of Baudet. "Baudet should not unnecessarily ignore our people. It might be nice to make a daring statement, but we are now a big party. He is talking about his own words, but words have consequences. You have responsibility for other people. The question is whether you should use a political party as a vehicle for academic debates that you enjoy yourself. I don't think so." The party leader was "not amused" that this criticism came out._NEWLINE_Less than a week later the fire was opened in public at Otten. Two of the party's three board members, Baudet and Rob Rooken, accused Otten of taking money from the party treasury. The money in question was immediately deposited back to the party account by Otten. Moreover, Otten resigned as a party member of the party at the request of the other two board members. During this period Otten opted for media silence so as not to further harm the party, which was certainly important with the Senate election approaching. In this election, Otten was also the lead candidate for the Forum for Democracy, and the party's intended group chairman. In the aftermath of the internal conflict, Otten renounced the group chairmanship. He was succeeded by Paul Cliteur with whom he entered the Senate (together with ten brand new members of the Dutch parliament)._NEWLINE_Owing to suspicions of financial malpractices, Otten was disbarred by the party on 24 July 2019. During this period, Otten immediately visited all kinds of media to speak out against the accusations against him that he himself dismissed as defamation and slander. It is due to disagreements about the course of the party, Otten claims. Otten also promised to make a declaration weigh defamation. In Nieuwsuur Otten hinted at his further political aspirations by the possibility of establishing his own party for the first time. _START_SECTION_ Founding of Group Otten _START_PARAGRAPH_ On 18 August 2019, Otten officially sets up his political party. The new party leader is still looking for a suitable name for his party, but will continue under the name "Fractie-Otten" for the time being. Two members of the Senate have joined him because of the same dissatisfaction with the course of the Forum: former party spokesman Jeroen de Vries and Dorien Rookmaker. The three have chosen to keep their seats. There is even a possibility that the Otten Group can get a seat in the European Parliament. This has to do with the redistribution of seats in the European Union Parliament after the United Kingdom's departure from the EU. The number of seats of FvD then increases from three to four seats. As Rookmaker was the fourth on the list of candidates for the Forum for the European Parliament elections of 2019, she still claims the "Brexit seat" despite her cancelled membership._NEWLINE_The situation in the provincial states was equally uneasy. An increasing number of members of the provincial states expressed their dissatisfaction with the direction of FvD and then cancelled their membership. Some even chose to join Otten's party, including North Holland member of the states Robert Baljeu. In an interview with the NOS, Otten expresses the expectation that more FvD members will make the switch to his party.
+ _START_ARTICLE_ Lethrinus rubrioperculatus _START_SECTION_ Description _START_PARAGRAPH_ This species grows to and is brown or olive-grey in colour. It has small, scattered blotches that are irregular in chape. The Body depth 2.94 to 3.18 times in standard length. Body color is olive-gray or brown, with scattered irregular small black blotches. There is normally a red spot present on the top edge of the operculum. The lips are normally red. The fins are pinkish or pale in colour. _START_SECTION_ Distribution _START_PARAGRAPH_ Lethrinus rubrioperculatus is found in numerous locations, including East African waters, southern Japan and Taiwan, the Marquesas Islands, New Caledonia and the northern half of Australia. _START_SECTION_ Habitat _START_PARAGRAPH_ This species lives over sandy bottoms, in areas where rubble is present, and along the slopes of outer reefs. Although reef-associated, Lethrinus rubrioperculatus also occurs at depths of up to 160 metres, much deeper than most other species in this genus. This species is non-migratory. _START_SECTION_ Diet _START_PARAGRAPH_ Lethrinus rubrioperculatus eats mostly crustaceans, mollusks, echinoderms, and other fishes. _START_SECTION_ Human uses _START_PARAGRAPH_ This fish is caught commercially. _START_SECTION_ Parasites _START_PARAGRAPH_ As most fish, Lethrinus rubrioperculatus is the host of many species of parasites. _NEWLINE_Monogeneans parasitic on the gills include the diplectanid Calydiscoides euzeti, the ancyrocephalids Lethrinitrema gibbus and Lethrinitrema dossenus and several capsalids. _NEWLINE_Copepods parasitic on the gills include the caligid Caligus lethrinicola and the lernanthropid Sagum vespertilio. _NEWLINE_The gills also harbour unidentified gnathiid isopod larvae. _NEWLINE_The digestive tract harbours an unidentified Acanthocephala, unidentified tetraphyllid cestodes, species of the anisakid nematode Raphidascaris (Ichthyascaris), and a variety of digeneans, including the acanthocolpid Stephanostomum aaravi, the hemiurid Lecithochirium sp. and Tubulovesicula angusticauda, the opecoelid Pseudoplagioporus interruptus and three other opecoelids. _NEWLINE_The abdominal cavity contains two species of larval tetrarhynch cestodes, the otobothriid Otobothrium parvum and the tentaculariid Nybelinia goreensis. _NEWLINE_In New Caledonia, where its parasites were particularly studied, Lethrinus rubrioperculatus has a total of twenty species of parasites.
+ _START_ARTICLE_ Claussen pickles _START_PARAGRAPH_ Claussen is a brand of pickled cucumbers. It is headquartered in Woodstock, Illinois, an exurb of Chicago. Unlike many other brands, Claussen pickles are uncooked, and are typically found in the refrigerated section of grocery stores._NEWLINE_Claussen is advertised as having superior crunchiness to other brands. In a 1992 television advertisement, Claussen pickles were shown to snap under pressure, whereas unidentified competing brands merely bent without snapping. In response, Vlasic Foods Inc. submitted a complaint to an advertising industry tribunal, claiming that the commercial was unfair and misleading. Ultimately, however, the claims of Claussen were upheld by the tribunal. _START_SECTION_ Other products _START_PARAGRAPH_ Additionally, Claussen is the manufacturer of sauerkraut and a sweet pickle relish which won the San Francisco Chronicle's June 18, 2008, Taster's Choice challenge. _START_SECTION_ History _START_PARAGRAPH_ The company C. F. Claussen & Sons was founded by Claus Claussen Sr. in Chicago in May 1870. Claussen was a vegetable farmer on land that today is in the Chicago city limits at 51st and South Western Blvd. He had a surplus crop of cucumbers one year, and so he decided to pickle them. Claussen pickles were produced on the same piece of land until 1976 when the plant moved to Woodstock, Ill._NEWLINE_Claus Claussen Sr. was succeeded by his son Claus S. Claussen, who was serving as president of the company when he died following an automobile accident on December 20, 1932._NEWLINE_For some years, William C. Claussen (b. 1890) served as president of the Claussen Pickle Company._NEWLINE_The company was sold to Oscar Mayer in 1970 and moved to Woodstock in 1976. Oscar Mayer was later acquired by General Foods in 1981, who in turn merged with Kraft, Inc. in 1990 to form Kraft General Foods, renamed Kraft Foods in 1995._NEWLINE_In 2002, the investment group that owned Vlasic Pickles sought to acquire the Claussen brand as well. The Federal Trade Commission blocked the proposed merger on the grounds that it would have severe anticompetitive effects, leading to a monopoly in the refrigerated-pickle market.
+ _START_ARTICLE_ Steve Molloy _START_SECTION_ Background _START_PARAGRAPH_ Steve Molloy was born in Gorton, Manchester, Lancashire, England. _START_SECTION_ International honours _START_PARAGRAPH_ Steve Molloy won caps for England while at Leeds in 1992 against Wales, while at Featherstone Rovers in 1996 against France (interchange/substitute), and Wales, while at Sheffield Eagles in 1999 against France (2 matches), and won caps for Great Britain while at Leeds in 1993 against France, while at Featherstone Rovers in 1994 against Fiji, 1996 against Fiji (interchange/substitute), and New Zealand (interchange/substitute). _START_SECTION_ County Cup Final appearances _START_PARAGRAPH_ Steve Molloy played right-prop, i.e. number 10, in Warrington's 24-16 victory over Oldham in the 1989 Lancashire County Cup Final during the 1989–90 season at Knowsley Road, St. Helens on Saturday 14 October 1989. _START_SECTION_ Club career _START_PARAGRAPH_ Molloy made his début for Warrington on Sunday 28 August 1988, and he played his last match for Warrington on Monday 16 April 1990, he made his début for Featherstone Rovers on Sunday 29 August 1993, and he played his last match for Featherstone Rovers during the 1997 season.
+ _START_ARTICLE_ Hemigalinae _START_SECTION_ Characteristics _START_PARAGRAPH_ The tails of Hemigalinae species are ringed. The toes and the middle of the lower part of the tarsus are bald. The frenum, upper part, and sides of the lower part are hairy. The orbit is imperfect._NEWLINE_Hemigalinae resemble the Viverrinae in having the scent glands present in both sexes and wholly perineal, but differing by their simpler structure, consisting in the male of a shallower, _NEWLINE_smaller pouch, with less tumid lips, situated midway between the scrotum and the penis, but not extending to either. In the female, the scent glands consist of a pair of swellings, each with a slit-like orifice, situated one on each side of the vulva and a little behind it and on a common eminence, the perineal area behind this eminence being naked. The prepuce is long and pendulous. The feet are nearly intermediate in structure between those of the digitigrade Viverrinae and the semiplantigrade Paradoxurinae, but more like the latter, both the carpal and metatarsal pads being well developed, double, and joining the plantar pad below, and as wide as it is at the point of contact. But the feet, with the pads, are considerably narrower, the carpals and metatarsals converging and meeting above so that a much larger area of the under surface is hairy. The area between the four main digits and the plantar pad is covered with short hair, and the pads of the third and fourth digits of the hind foot are separated as in the Viverrinae, not confluent as in the Paradoxurinae. The retractile claws are not protected by skin-lobes.
+ _START_ARTICLE_ Durand Scott _START_SECTION_ High school career _START_PARAGRAPH_ The Bronx native attended Rice High School where he was a teammate of Kemba Walker until the latter left for college. He was crucial in their state championship earned in 2009, including a good performance in the semifinal against a Lance Stephenson led Lincoln won 77-50.  For his efforts, he was selected as the Daily News City Player of the Year, and was selected to the Jordan Brand Classic. During that time, he also played AAU basketball for the Gauchos. _START_SECTION_ College career _START_PARAGRAPH_ He passed up offers from Memphis, West Virginia and UConn to join the Miami (Florida) and play in the Atlantic Coast Conference (ACC) of the NCAA Division I._NEWLINE_In his freshman year, Scott played in all 33 games (28 starts) while averaging 10.3 points, 4 rebounds, 3.4 assists and a team-high 1.2 steals per game. He made the ACC All-Rookie team and the ACC All-Tournament First Team._NEWLINE_In his sophomore year, he started in all but one of the 36 games he played in, averaging 13.6 points (second-best on team), 4.2 rebounds, 3.1 assists and a 1.2 steals (best) in 32.8 minutes (most) per game._NEWLINE_In his junior year, he played 33.2 minutes per game (6th most in ACC), posting 12.9 (ACC 14th, team best), 3.1 assists (ACC 7th), 5.4 rebounds (team second best) and 1 steal. He was an All-ACC Honorable Mention._NEWLINE_He scored a career-high 32 points versus NC State in the 2013 ACC Tournament semi-finals. In his senior year, had 13.1 points and 4 rebounds. He was named ACC Defensive Player of the Year and selected to the ACC All-Tournament First Team as Miami won the Tournament._NEWLINE_At the end of his college career, he averaged 12.5 points, 4.4 rebounds, 3.1 assists, 1.3 steals and 32.1 minutes in 132 total games played. He was first in Miami history for games started and minutes played (125 and 4,238 respectively), 8th in points scored (1,650), 5th in assists (404) and 7th in steals (166). _START_SECTION_ Professional career _START_PARAGRAPH_ After his college career, Scott attended the Portsmouth Invitational, where he was an all-tournament selection. He also worked out with a number of NBA teams, but went undrafted in the 2013 NBA draft. Scott then joined the San Antonio Spurs for the 2013 NBA Summer League._NEWLINE_In August 2013, Scott signed with Blu:sens Monbús of the Spanish Liga ACB for the 2013–14 season. He registered 4.6 points and 1.2 rebounds in 12.3 minutes per game during the season._NEWLINE_Scott signed with Israeli side Hapoel Tel Aviv for the 2014–15 season, he finished the season with 15.2 points, 4.5 rebounds and 1.5 steals in 31 Israeli League games as Hapoel reached the playoffs._NEWLINE_In July 2015, Scott signed with Italian Serie A side Enel Brindisi for one year. The same month, he was announced as part of the Milwaukee Bucks roster for the 2015 NBA Summer League. On July 22, 2016, he re-signed with Brindisi for one more season._NEWLINE_On July 15, 2017, Scott signed with Italian club Auxilium Torino for the 2017–18 season. On August 20, 2017, it was announced that the player won't play with the team for personal reasons. On October 5, 2017, he signed with the Memphis Grizzlies. On October 14, 2017, he was waived by the Grizzlies. On March 29, 2018, EWE Baskets Oldenburg of the Basketball Bundesliga was reported to have signed Scott for the rest of 2017–18 season._NEWLINE_For the 2018–19 season, Scott signed with the Long Island Nets of the NBA G League. He did not make the final roster._NEWLINE_On November 28, 2018, Scott signed a one-year deal with the French team Levallois Metropolitans. In January 2019, Scott parted ways with Levallois Metropolitans after appearing in five games._NEWLINE_On January 22, 2019, Scott returned to Israel for a second stint, signing with Hapoel Gilboa Galil for the rest of the season. On February 4, 2019, Scott recorded a season-high 25 points in his second game with Gilboa Galil, shooting 9-for-12 from the field, along with three rebounds and assists in an 89–87 win over Ironi Nahariya. On April 10, 2019, Scott parted ways after appearing in nine games._NEWLINE_On August 30, 2019, Scott returned to France for a second stint, signing a one-year deal with Cholet Basket. On September 17, 2019, he parted ways with Cholet before appearing in a game. _START_SECTION_ National team career _START_PARAGRAPH_ Scott has played for the Jamaican national team. He participated in the 2013 FIBA Americas Championship, posting 10.5 points, 3.9 rebounds and 0.8 assists in around 28 minutes per game.
+ _START_ARTICLE_ McCarthy of Muskerry _START_PARAGRAPH_ The MacCarthy dynasty of Muskerry is a branch of the great MacCarthy Mor dynasty, the Kings of Desmond. Their branch descends from Dermod Mor MacCarthy, 1st Lord of Muscry (1310-1367/8), second son of Cormac MacCarthy Mor (1271–1359), King of Desmond._NEWLINE_Dermod Mor was created Lord of Muscry (Muskerry, along the Lee river in central County Cork) in 1353. His descendant Cormac Oge MacCarthy, 17th Lord of Muscry, was in 1628 created Charles MacCarty, 1st Viscount Muskerry, and his son, the 2nd Viscount Muskerry, was in 1658 created Donough MacCarty, 1st Earl of Clancarty._NEWLINE_The dynasty is still in existence and can be considered to still broadly belong to the Irish nobility, but its leadership is in confusion. There also remains some dispute with their (friendly) rivals and kinsmen the MacCarthys Reagh, concerning the title Prince of Desmond. The late main line of the MacCarthy Mor dynasty became extinct in the late 16th century and it has ever since been unclear who inherits the title, because of the advent of the career of Florence MacCarthy. See Kingdom of Desmond. There are also earlier MacCarthy Mor septs in existence who are claimants. The situation was recently thrown into even more exotic confusion by the impostor Terence Francis MacCarthy.
+ _START_ARTICLE_ Deltahedron _START_PARAGRAPH_ In geometry, a deltahedron (plural deltahedra) is a polyhedron whose faces are all equilateral triangles. The name is taken from the Greek majuscule delta (Δ), which has the shape of an equilateral triangle. There are infinitely many deltahedra, but of these only eight are convex, having 4, 6, 8, 10, 12, 14, 16 and 20 faces. The number of faces, edges, and vertices is listed below for each of the eight convex deltahedra.
+ _START_ARTICLE_ Stephan Dweck _START_PARAGRAPH_ Stephan Dweck Esq. (born 1960) is an African-American humorist, attorney, radio show host and the author or co-author of several books._NEWLINE_He co-hosted the Sports Funk show on WFAN-AM radio in New York City with Monteria Ivey. Dweck and Ivey lived in the Frederick Douglass Houses housing project in Manhattan._NEWLINE_Ivey, Dweck and James Percelay co-authored several books on African-American humor, from slavery to American ghettos, including the Snaps trilogy. Ivey and Dweck also wrote two books on pick-up lines called You're So Fine I'd Drink a Tub of Your Bathwater and Baby, All Those Curves. And Me With No Brakes. Other books include Laugh Your Ass Off: The Big Book of African American Humor and The Field Guide to White People._NEWLINE_Dweck executive produced the Snaps series for HBO and the animated show The Big Head People for Spike TV. He has worked as a screenwriter for Eddie Murphy Productions and Miramax Films. He also was a regular guest on the IMUS in the morning program. His WFAN radio show, Sports Funk, was one of the first African American Sports talk shows in the nation._NEWLINE_He is a graduate of Dartmouth College, where he received the Ernest E. Just award for academic excellence. He is a member of Alpha Phi Alpha fraternity, and a member of the New York, New Jersey and Connecticut bar. As an attorney he has represented several rappers, singers and actors, including the cast of the film Paris Is Burning in their lawsuit against the producers of the film. He is the co-owner and founder of the Digital Cannabis Magazine Bloomin "The Cannabis Life" He Practices Entertainment law in New York City.
+ _START_ARTICLE_ Izaak Walton Killam Memorial Prize _START_PARAGRAPH_ The Izaak Walton Killam Memorial Prize was established according to the will of Dorothy J. Killam to honour the memory of her husband Izaak Walton Killam._NEWLINE_Five Killam Prizes, each having a value of $100,000, are annually awarded by the Canada Council to eminent Canadian researchers who distinguish themselves in the fields of social, human, natural, or health sciences.
+ _START_ARTICLE_ Canada-Wide Science Fair _START_SECTION_ History _START_PARAGRAPH_ The First Canada-Wide Science Fair was held May 11 and 12, 1962 at the Science Building at Carleton University in Ottawa. In 1962, the fair was co-sponsored by the Kiwanis Club of Ottawa Incorporated. The initial Headquarters for the Canadian Science Fairs Council was 45 Rideau Street, Ottawa. The two-day science fair was made up of 45 exhibits of regional winners from secondary school fairs across the country. _START_SECTION_ Intel International Science and Engineering Fair (Intel ISEF) _START_PARAGRAPH_ Several competitors and winners from the Canada-Wide Science Fair have been selected for competition at the Intel International Science and Engineering Fair as part of Team Canada, among them inventors Ann Makosinski and Alex Deans. Past Canada-Wide Science Fair winners Raymond Wang and Austin Wang both from Vancouver, BC, won the Gordon E. Moore award at Intel ISEF in 2015 and 2016, respectively. _START_SECTION_ Awards _START_PARAGRAPH_ Almost $1 million in awards and scholarships is given out each year at the Canada-Wide Science Fair._NEWLINE_Bronze, silver, and gold medals are awarded to outstanding projects in each age/grade category (see above). Challenge awards are presented for the best project in each of seven STEM challenges (discovery, energy, environment, health, information, innovation and resources) for each age/grade category. Sponsored special awards are also offered._NEWLINE_Three Grand Awards recognize the top project from the gold medal winners in each age/grade category: The Best Project Award (including $2,500 cash) is presented to the top overall project, regardless of category. The top projects from the two remaining categories receive Platinum Awards, which include $1,000 cash. Two or three of the platinum award winners compete at the European Union Contest for Young Scientists.
+ _START_ARTICLE_ Home Energy Saver _START_SECTION_ The Home Energy Simulation Model _START_PARAGRAPH_ The Home Energy Saver is built on DOE-2, a computer program for building heating and cooling energy analysis and design. DOE-2 performs a thermal load simulation that accounts for heating and cooling equipment and thermal distribution efficiencies, infiltration, and thermostat management. User-entered zip codes are mapped to one of about 300 unique "weather tapes" that impose a year's worth of local weather conditions on the home to determine heating and cooling needs._NEWLINE_Home Energy Saver extends DOE-2 in a number of ways to improve the simulation model. For example, when users enter their actual electricity tariffs, the predictive power of the model improves. Other methods are used to calculate the energy used by appliances, water heating, and lighting._NEWLINE_The public domain HES calculation methods and underlying data are clearly documented on the website. Other web-based tool developers are welcome to use this information at no cost, providing that the source is properly credited. _START_SECTION_ Awards & Recognition _START_PARAGRAPH_ Each year, the R&D 100 Awards recognize the year's 100 most significant, innovative, newly introduced research and development advances. The awards are recognized in industry, government, and academia as proof that a product is one of the most innovative ideas of the year, nationally and internationally. Home Energy Saver and Hohm received an R&D 100 Award in 2010._NEWLINE_Home Energy Saver received the U.S. Department of Energy's "Energy 100" award as one of the best 100 scientific and technological accomplishments over DOE's 23-year lifetime. The discoveries were chosen based on their impact in saving consumers money and improving quality of life._NEWLINE_PC Magazine recognized Home Energy Saver in 2004 as one of the "Top 100 Undiscovered Websites._NEWLINE_MSN-Money rates Home Energy Saver among the "Best Sites for Free Government Help" including it in the list of "The 100 most Useful Sites on the Internet. _START_SECTION_ Licensing _START_PARAGRAPH_ Organizations who want to provide their customers tools to predict home energy consumption can license the Home Energy Saver Application Programming Interfaces (APIs)._NEWLINE_Microsoft, the first organization to license the Home Energy Saver, uses it to drive Microsoft Hohm.
+ _START_ARTICLE_ Lego Racers (video game) _START_SECTION_ Gameplay _START_PARAGRAPH_ Lego Racers is a racing game played from a third person perspective. Set in the fictional Legoland universe, the game depicts Rocket Racer, the "greatest racing champion" in Legoland. After becoming bored from beating everyone at racing, he decides to create a racing contest, and finds the best racers in the history of Legoland using a dimensional warp machine created by his friend, Veronica Voltage, a genius scientist and mechanic. The player takes on the hosts and co-racers in an attempt to beat Rocket Racer and become the "Greatest Lego Racer of All Time", completing the game._NEWLINE_Players assume the role of either one of several pre-built or custom-built minifigures and compete against other minifigure characters in races set across different tracks in the Legoland universe, using a variety of cars built out of Lego. At the beginning of each race, the player can perform a "Turbo Start", which allows the player to start the race at full speed. Throughout races, the player can also perform power slides and "Super Slides", which allow the player's car to turn around corners more sharply._NEWLINE_Each of the game's tracks contain power up bricks, which can be collected by the player and used to gain an advantage over other racers. The power ups are divided into four categories: Projectile, Hazard, Shield and Turbo, with each providing a different use to the player. The player can also collect up to three "power plus" bricks, which increase the capability of any power ups collected. Most tracks contain one shortcut that players can use to get ahead of opponents, which are usually either found with careful looking, or accessed using power-ups, mainly Projectile power-ups that destroy part of the scenery. During a race, the in-game HUD displays the player's position, lap number, "lap timers", and a "Power Up Icon" if the player is carrying any power up or power plus bricks. The player can also choose between viewing the "Speedometer", the "Course Map" or the "Close-up Map"._NEWLINE_The game contains three single-player modes: "Circuit Race", "Single Race" and "Time Race", as well as one multiplayer mode, "Versus Race". The Circuit Race mode follows the game's main plot, and allows players to race through circuits made up of multiple tracks, gaining points based on where they place, while contending with a highly skilled racer who leads each circuit. In a circuit, the player must earn enough points to move on to the next race, and will win if they finish with the most points. Placing third or above in a circuit unlocks the next circuit for the player. The Single Race mode allows the player to race on a single track unlocked from the Circuit Race mode. The Time Race mode places the player in a race against Veronica Voltage driving a ghost car with the aim of beating her best time around a track chosen by the player. Versus Race allows two players to race against each other in a split screen view without non-player character minifigures on the track._NEWLINE_Throughout the game, the player can unlock various brick sets and character pieces by completing certain tasks, such as coming first in a Circuit Race. The game's "Build Menu" allows the player to build custom cars, minifigures and driving licenses of their own design using unlocked bricks and character parts. Minifigures can be customized with different hat, hair, head, body and leg parts, and given a name entered by the player on the minifigure's driving license. A picture of the player's minifigure is also placed on their driving license, and their facial expression can be changed by the player. The player can create a custom car using a combination of different chassis and car sets. The player can rotate, move and place bricks from these sets directly on to the chassis. Placement of the bricks changes the car's balance and weight, which affects its overall performance. The "Mix" option creates minifigures from randomly selected parts, while the "Quick Build" option creates one of 2 presets for a specific chassis. _START_SECTION_ Development _START_PARAGRAPH_ The concept for Lego Racers was initially created by High Voltage Software founder Kerry J. Ganofsky, with the idea of players being able to build and race cars created with Lego bricks. After a year of development, Lego Media began production of the game, hiring Ganofsky's company to develop it. Lego Media and other facilities within The Lego Group collaborated with High Voltage Software during the production of the game._NEWLINE_A large number of character models, documents and pictures from different Lego System characters and models were sent to the developers, who eventually chose to use the Castle, Space, Adventurers and Pirates themes in the game. High Voltage Software chose the characters they liked best from these themes and created character studies for them to "capture the mood of each persona". Certain characters would assume the role of bosses, while others were featured as less skilled opponents. The developers also created two original characters, Rocket Racer and Veronica Voltage._NEWLINE_High Voltage Software spent over a year creating Lego Racers' car building mechanics. The game's lead programmer, Dwight Luetscher, created a formula that was used by the game's artists to create individual Lego elements in the game. The pieces available to the player were selected from hundreds of Lego elements by the developers, chosen first by aesthetics, and then analysed to see if they would fit into Luetscher's formula. The developers chose to affect the attributes of the player's car, such as handling, acceleration and top speed, through how many bricks are placed on the chassis, as this is simpler to understand for the game's main age demographic._NEWLINE_Due to the high number of Lego sets and pieces in the game, a custom mesh code was created to "weld" the geometry in place and optimize the cars polygon count, creating one solid mesh for each car created by the player. Every element in the game, including bricks and character pieces, had different levels of detail created for use in menu screens and cut scenes, where the models had to be a higher quality due to the player seeing them up close. The developers planned a damage system where bricks would break apart from the car upon crashing, but this presented "too many problems to make it a real possibility". Lego Racers was available to play before release by journalists at E3 1999. _START_SECTION_ Sequel _START_PARAGRAPH_ Following Lego Racers' success, news arose in April 2001 that Pocket Studios was working on a sequel to the Game Boy Advance version of Lego Racers, titled Lego Racers 2, which was then shown at E3 2001 in May that year. The eponymous Microsoft Windows and PlayStation 2 counterpart to Lego Racers 2, developed by Attention to Detail, was announced in August 2001, and released in September 2001. The sequel followed up immediately after Rocket Racer's defeat in Lego Racers, who is shown a new opportunity to reclaim his title as world champion, by travelling to Xalax and prove himself worthy of it. After Rocket Racer proceeds to do so and succeeds, the player is tasked to control their self-built protagonist, racing through various worlds based on Lego themes, and eventually face Rocket Racer again. Lego Racers 2 was received less favorably than Lego Racers, and incorporated numerous elements from both Lego Racers and Rollcage, another game developed by Attention to Detail._NEWLINE_An arcade-style version of Lego Racers was shown in Legoland Windsor's Lego Rocket Racers building in "The Beginning" area, between 2000 and 2004, as well as 2009 and 2011.
+ _START_ARTICLE_ Grange railway station _START_SECTION_ History _START_PARAGRAPH_ The original station, located 13.2 kilometers from Adelaide and on the western side of Military Road, was opened in September 1882 as the terminus of the Grange railway line. Initially operated by a private company, South Australian Railways took over the line in the 1890s, and extended it to Henley Beach station via the Henley Beach railway line. On 31 August 1957, however, the line was cut back to Grange._NEWLINE_On 9 March 1986, the current Grange station, on the eastern side of Military Road replaced the original station on the western side. The station was relocated to prevent traffic flow along Military Road from being interrupted by the arrival of trains. The ticket office and shelter of the original station were demolished shortly after, but the unused platform remains in place._NEWLINE_The train station shelter was replaced in 2017.
+ _START_ARTICLE_ Heinrich Graf von Einsiedel _START_SECTION_ Biography _START_PARAGRAPH_ Einsiedel, a great-grandson of Otto von Bismarck, was born in Potsdam, Province of Brandenburg, as the youngest child to Herbert von Einsiedel (1885–1945) and Irene von Bismarck-Schönhausen (1888–1982). His parents were divorced in 1931._NEWLINE_In World War II Einsiedel served as a German fighter pilot, initially with Jagdgeschwader 2 over the Western Front, flying the Messerschmitt Bf 109. He took part in escort operations over the cruisers Scharnhorst, Gneisenau and Prinz Eugen as they made their 'Channel dash' from Brest to Germany in February 1942. von Einsiedel claimed two of the six Fairey Swordfish of No. 825 Squadron Fleet Air Arm, who made an unsuccessful low-level torpedo attack. _NEWLINE_On one occasion he was shot down and crash-landed near Rotterdam and was also shot down into the Channel and rescued. In June 1942 von Einsiedel was transferred to Jagdgeschwader 3 on the Russian Front for the forthcoming offensive against Stalingrad. Over the next six weeks, he claimed 33 Russian aircraft downed, including four Petlyakov Pe-2 bombers in the space of six minutes on 20 August. He was awarded the German Cross in Gold._NEWLINE_On 30 August 1942, during combat with Russian 'Ratas', he was forced to land and was captured by Russian ground forces, becoming a prisoner of war in the Soviet Union. The Soviet authorities soon realised the pilot was a well-connected member of the German nobility and thus a potentially valuable propaganda weapon. On capture von Einsiedel refused to divulge any military intelligence to his captors. He finally agreed however to send an open letter home stating he was being treated correctly and that Germany was going to lose the war, and that his great-grandfather Otto von Bismarck, would never have invaded Russia._NEWLINE_He became a founding member, Vice-president and commissary of propaganda of the National Committee for a Free Germany and led a propaganda unit which broadcast and distributed leaflets to German forces._NEWLINE_Released after the war, von Einsiedel initially worked for the Tägliche Rundschau, the German newspaper of the Soviet Military Administration in Germany, but became increasingly disillusioned with the Soviet regime, experiencing at first hand the Russian corruption and inefficiency. He was given permission to visit West Berlin on behalf of the NKVD for intelligence gathering purposes. While meeting his mother he was arrested by US Forces and sentenced by an American court for spying and having forged documents. He was released on appeal. _NEWLINE_Despite a highly publicised press conference when back in the East, he was by now seen as a liability by the Soviet authorities._NEWLINE_He thus moved to West Germany in late 1948, where he worked as a translator, script-writer and journalist. The governing Socialist Unity Party of East Germany acknowledged von Einsiedel as a bonafide anti-fascist but a petit bourgeois who, "as soon as the class war became acute", had wavered and switched political camps for his own self interests._NEWLINE_von Einsiedel also wrote 'The Shadow of Stalingrad: Being the Diary of Temptation' in 1953, which attempted to tell his complex story. Eventually von Einsiedel joined the film industry, as a scriptwriter and a film soundtrack dubber. He also played the role of a pilot in the drama 'The Last Bridge' (1953) with his first wife, Barbara Rütting._NEWLINE_He also wrote for the liberal Hamburg weekly, Die Zeit. He twice won the German bridge championship and played in the bridge World Cup._NEWLINE_Einsiedel was a member of the Social Democratic Party of Germany from 1957 until 1992 and was elected as a member of the German Bundestag as a candidate of the Party of Democratic Socialism (PDS) from 1994 until 1998._NEWLINE_Einsiedel died in Munich on 18 July 2007 aged 85.
+ _START_ARTICLE_ Calmar, Iowa _START_SECTION_ History _START_PARAGRAPH_ Calmar was platted in 1854. It was named after Kalmar, a city in Sweden._NEWLINE_The settlement experienced growth in 1868 when the railroad was built through it. Calmar was incorporated on July 14, 1869. _START_SECTION_ Geography _START_PARAGRAPH_ Calmar is located at 43°10′55″N 91°51′59″W (43.182054, -91.866446)._NEWLINE_According to the United States Census Bureau, the city has a total area of 1.07 square miles (2.77 km²), all of it land. _START_SECTION_ 2010 census _START_PARAGRAPH_ As of the census of 2010, there were 978 people, 444 households, and 252 families residing in the city. The population density was 914.0 inhabitants per square mile (352.9/km²). There were 492 housing units at an average density of 459.8 per square mile (177.5/km²). The racial makeup of the city was 98.0% White, 0.3% African American, 0.1% Asian, 0.6% from other races, and 1.0% from two or more races. Hispanic or Latino of any race were 2.0% of the population._NEWLINE_There were 444 households of which 27.0% had children under the age of 18 living with them, 43.9% were married couples living together, 9.9% had a female householder with no husband present, 2.9% had a male householder with no wife present, and 43.2% were non-families. 31.5% of all households were made up of individuals and 10.8% had someone living alone who was 65 years of age or older. The average household size was 2.20 and the average family size was 2.84._NEWLINE_The median age in the city was 34.9 years. 21.5% of residents were under the age of 18; 13.7% were between the ages of 18 and 24; 23.9% were from 25 to 44; 27.7% were from 45 to 64; and 13.2% were 65 years of age or older. The gender makeup of the city was 51.7% male and 48.3% female. _START_SECTION_ 2000 census _START_PARAGRAPH_ As of the census of 2000, there were 1,058 people, 452 households, and 269 families residing in the city. The population density was 1,006.8 people per square mile (389.0/km²). There were 482 housing units at an average density of 458.7 per square mile (177.2/km²). The racial makeup of the city was 98.87% White, 0.19% African American, 0.09% Native American, 0.19% Asian, and 0.66% from two or more races. Hispanic or Latino of any race were 0.47% of the population._NEWLINE_There were 452 households out of which 27.2% had children under the age of 18 living with them, 49.8% were married couples living together, 7.1% had a female householder with no husband present, and 40.3% were non-families. 30.3% of all households were made up of individuals and 14.6% had someone living alone who was 65 years of age or older. The average household size was 2.33 and the average family size was 2.97._NEWLINE_In the city, the population was spread out with 23.3% under the age of 18, 14.7% from 18 to 24, 26.6% from 25 to 44, 18.6% from 45 to 64, and 16.8% who were 65 years of age or older. The median age was 36 years. For every 100 females, there were 105.0 males. For every 100 females age 18 and over, there were 104.8 males._NEWLINE_The median income for a household in the city was $36,250, and the median income for a family was $50,063. Males had a median income of $29,875 versus $21,708 for females. The per capita income for the city was $17,958. About 3.4% of families and 9.6% of the population were below the poverty line, including 5.7% of those under age 18 and 17.5% of those age 65 or over. _START_SECTION_ Education _START_PARAGRAPH_ Calmar is home to one of two campuses of Northeast Iowa Community College._NEWLINE_South Winneshiek High School is in Calmar. Its elementary and middle schools are in Ossian.
+ _START_ARTICLE_ Texas State Highway Loop 207 _START_SECTION_ Route description _START_PARAGRAPH_ Loop 207 begins and ends in Mont Belvieu at SH 146. Between the terminuses, Loop 207 intersects FM 565. _START_SECTION_ History _START_PARAGRAPH_ Loop 207 was designated on its current route on September 12, 1946.
+ _START_ARTICLE_ Swarovski crystal mesh Armani Privé gown _START_SECTION_ Reception _START_PARAGRAPH_ The Daily Mail remarked that Armani "stole the show" at the 2007 Oscars, as both Beyoncé and Katie Holmes also turned up wearing Giorgio Armani dresses and stated that Blanchett's dress "set the tone for the evening: a pale colour, with clean lines, that paid lip service to the metallic trend so prevalant on the catwalks this season." Cosmopolitan magazine cited the slinky, shimmery-silver, one-shoulder dress as one of the Best Oscar dresses of all time, saying, "Cate makes the list twice because of her consistently impeccable style. This one-shouldered gunmetal gown clings to her fabulous body like it was painted on, and the delicate and elegant hair and makeup complete the look without distracting us from the dress."
+ _START_ARTICLE_ William G. Bissell _START_PARAGRAPH_ William George Bissell (1857–1925) was a member of the Wisconsin State Senate. _START_SECTION_ Biography _START_PARAGRAPH_ Bissell was born on September 18, 1857 in Massena, New York. He moved with his parents to Lodi, Wisconsin in the spring of 1866. Bissell worked as a farmer and travelling salesman before becoming a general merchant in 1896. He married Eva S. Sisson (1860–1937). Bissell died in 1925 and is interred in Baraboo, Wisconsin. _START_SECTION_ Senate career _START_PARAGRAPH_ In the fall of 1898 Bissell was nominated for the state senate by the Republicans of the Twenty-seventh district, comprising Columbia and Sauk counties, and he was elected over Edmund S. Baker, the candidate of the democrats and James M. Blachly, the candidate of the Prohibitionists. Bissell represented the 27th District in the Senate from 1899 to 1902. Bissell served on the committees on state affairs, manufacturers and agriculture of the senate.
+ _START_ARTICLE_ Patrick Davis (politics) _START_PARAGRAPH_ Patrick Davis is a political consultant and strategist. Davis has worked in the George H.W. Bush Administration and the National Republican Senatorial Committee, most notably as Political Director in 2004. Davis also served as the Executive Director of the South Dakota Republican Party before going into private business in 2005.  He lives in Colorado Springs, Colorado, with his wife, Jo Ann, and their two children. _START_SECTION_ Life in the public sphere _START_PARAGRAPH_ After graduating from college in 1990, Davis served as the Assistant to the Deputy Director of White House Political Affairs in the George H. W. Bush Administration. Davis then worked for the 1992 Bush-Quayle Presidential campaign, serving as the field desk coordinator for eleven Northwestern states._NEWLINE_Davis served as the Executive Director of the South Dakota Republican Party from 1995 to 1999. During this time, South Dakota Republicans increased their majorities in both houses of the State Legislature, John Thune was elected to the United States House of Representatives and Governor Bill Janklow was re-elected._NEWLINE_In 1999, Davis was hired to represent the National Republican Senatorial Committee as a Regional Political Director in ten Republican United States Senate campaigns. During the 2004 election cycle, Davis served as the NRSC's Political Director, increasing the Republican majority from 51 to 55. In his position as Political Director, Davis managed the political and strategic operations of the committee, including candidate recruitment, message development, and campaign management. He also directed the committee's $35 million voter contact budget._NEWLINE_Davis was involved in the competitive winning United States Senate campaigns for John Thune, Norm Coleman, Wayne Allard, Gordon Smith, Conrad Burns, Tom Coburn, Mel Martinez, Richard Burr, David Vitter, Jim Bunning, Johnny Isakson, Mike Lee, Richard Burr, John Hoeven, and Jim DeMint._NEWLINE_Davis was involved in the competitive winning United States Representative campaigns for Cynthia Lummis, Rick Berg, Steve Daines, Kristi Noem, Tim Huelskamp, and Mike Coffman. _START_SECTION_ Private consultant _START_PARAGRAPH_ In 2005, Davis founded Patrick Davis Consulting, LLC, a company that serves candidates, campaigns and corporations clients. Patrick Davis Consulting has been hired to work for both local and national campaigns, including Steve House for Governor (CO), Joe Gschwendtner for Governor (CO), Steve Laffey for Congress (CO), Floyd Trujillo for U.S. Senate (CO), Ron Saxton for Governor (OR), Don Stenberg for U.S. Senate (NE), Mike Protack for U.S. Senate (DE), Scott Tipton for Congress (CO-3), Jeff Crank for Congress (CO-5), Duane Sand for Congress (ND), Bruce Whalen for Congress (SD), Rick O’Donnell for Congress (CO-7), Kyle Hybl for CU Regent (CO-5), Eli Schwiesow for State Senate (SD), Glen Urquhart for Congress (DE), Karen England for Lt. Governor (CA), Rhonda Sivarajah for Congress (MN), Sharna Wahlgren for Congress (MN), David Gerson for Congress (MN) and Dan Lederman for Senate (SD)_NEWLINE_Since 1990, Davis has been involved in the winning U.S. Senate campaigns for John Thune, John Hoeven, Larry Pressler ('90), Steve Daines, Dan Sullivan, Lisa Murkowski, Jim DeMint, Joni Ernst, Cory Gardner, Thom Tillis, Bill Cassidy, Mike Lee, Norm Coleman, Rudy Boschwitz ('90), Tom Cotton, Bill Frist, Wayne Allard, Gordon Smith, Conrad Burns, Tom Coburn, Mel Martinez, Richard Burr, Jim Inhofe, John Cornyn, Sam Brownback, David Vitter, Jim Bunning, and Johnny Isakson._NEWLINE_He has been involved in the winning US House campaigns of Kristi Noem, Kevin Cramer, Tim Huelskamp, Mike Coffman, Steve Pearce, and Cynthia Lummis._NEWLINE_Finally, Patrick Davis Consulting also provides public relations services for private, non-political clients, such as Wal-Mart, Comcast, Neumann Education Foundation, and Neumann Systems Group.
+ _START_ARTICLE_ Cubelets _START_PARAGRAPH_ Cubelets are a line of construction toys manufactured by Modular Robotics._NEWLINE_The Cubelets are small color coded cubes that people magnetically stick together to form a variety of simple robots, a kind of modular robot._NEWLINE_The cubes connect with magnets and a  genderless connector.
+ _START_ARTICLE_ 1991 World Artistic Gymnastics Championships _START_PARAGRAPH_ The 26th Artistic Gymnastics World Championships were held in Indianapolis, United States, in the Hoosier Dome from September 6 to 15, 1991. This was the last championships at which the Soviet Union competed.
+ _START_ARTICLE_ MN 5 (biostratigraphic zone) _START_PARAGRAPH_ In biostratigraphy, MN 5 is one of the MN zones used to characterize the fossil mammal faunas of the Neogene of Europe. It is preceded by MN 4 and followed by MN 6 and is part of the Orleanian age of the middle Miocene. MN 5 starts within magnetostratigraphic chron C5Cr, at 17.0 million years ago, and ends at the start of chron C5Bn.1r, at 15.0 million years ago, although some different correlations have been proposed. The reference locality used to correlate faunas with this zone is Pontlevoy-Thenay in France; other localities include La Retama in Spain, Castelnau-d'Arbieu in France, and Sandelzhausen in Germany._NEWLINE_In this zone, the muroid rodent Cricetodon first appears in western Europe, as do the poorly known Lartetomys and Mixocricetodon. In the extinct rodent family Eomyidae, the genus Ligerimys last appears during MN 5, but Keramidomys and Eomyops appear as immigrants. The last European marsupial, Amphiperatherium, last appears in France and Spain during MN 5, but persists into MN 6 in Germany._NEWLINE_The primate Pliopithecus first appears during MN 5. The rhinoceroses Prosantorhinus, Plesiaceratherium, Hispanotherium, and Gaindatherium make their last appearance, but Ancylotherium and Hoploaceratherium first appear during MN 5. Chalicotherium, a member of the related extinct family Chalicotheriidae, also appears for the first time. Several artiodactyls, such as the pig Conohyus, the deer Heteroprox and Dicrocerus, and the musk deer Micromeryx first appear, and the pigs Bunolistriodon and Aureliachoerus and the ruminants Amphimoschus and Lagomeryx last appear in MN 5. Two artiodactyl genera, Triceromeryx and Pseudoeotragus, occur only during MN 5. The primitive artiodactyl Cainotherium last appears in France and Spain, but persists into MN 6 in Germany.
+ _START_ARTICLE_ Etheostoma gracile _START_SECTION_ Distribution and habitat _START_PARAGRAPH_ Etheostoma gracile is found in the Mississippi River basin from central Illinois and northeastern Missouri to Louisiana, also in the Red River drainages to southeastern Kansas and eastern Oklahoma, and the Gulf Slope drainages from the Tombigbee River in Mississippi to the  Nueces River in Texas. Suitable habitats include pools of slow-flowing water in small streams, backwaters of larger rivers, turbid water over sand or mud, oxbow lakes, swamps, and among vegetation. _START_SECTION_ Status _START_PARAGRAPH_ The IUCN has listed this species as being of "Least Concern", because it has an extensive range in the Mississippi River system, has a large total population size, and numerous subpopulations. In general, the population trend seems stable and no major threats have been identified.
+ _START_ARTICLE_ N-acylneuraminate-9-phosphate synthase _START_SECTION_ Structural studies _START_PARAGRAPH_ As of late 2007, only one structure has been solved for this class of enzymes, with the PDB accession code 1WVO.
+ _START_ARTICLE_ GE Healthcare _START_SECTION_ 19th century _START_PARAGRAPH_ In 1893, C.F. Samms and J.B. Wantz founded the Victor Electric Company in a basement. By 1896 they made electrostatic generators for exciting x-ray tubes and electrotherapeutic devices._NEWLINE_They had a staff of six and a capital of $3,000 invested in the company._NEWLINE_Victor Electric_NEWLINE_plunged into the x-ray business and by 1896 (one year after Roentgen’s discovery) were making x-ray machines. The business grew rapidly and so, in 1896, moved into new premises three times the original size, but this did not solve the space problems and the company made 3 moves by 1899._NEWLINE_Victor Electric had competitors. In 1896, G.A.Frye began making x-ray tubes, which in 1897 was purchased by Swett & Lewis as the first merger in the x-ray business. _START_SECTION_ 20th century _START_PARAGRAPH_ During the first years, it was easier to keep up with the competition than space requirements. By 1903, Victor Electric had outgrown its facilities at 418 Dearborn St. in Chicago and bought two floors of a building at 55 Market Street, Chicago. This was again only a temporary stop; by 1910 it was too small and the firm moved again in 1911 to a building at the corner of Jackson Blvd. and Damen Avenue. This was the first permanent home of Victor Electric Co. They stayed there 35 years and during this time, gradually acquired all the space in the building and several around it._NEWLINE_During the first 20 years of the x-ray business, many new names appeared. In 1901 the Western Electric Coil Co. was formed. In 1902 MacAlaster & Wiggin purchased the x-ray tube business of Swett & Lewis. Two other companies were the Radio Electric Co., which was later to be known as Snook-Roentgen Manufacturing and the Scheidel Western X-Ray Coil Co. In 1907, Homer Clyde Snook introduced the Snook apparatus, the first interrupterless device produced for X-ray work. The Snook apparatus was manufactured in England._NEWLINE_In 1916, the first significant merger took place, Scheidel Western, Snook-Roentgen, MacAlaster & Wiggin, and Victor Electric Co. were merged with Victor, the surviving name. Victor’s two founders had key roles in the new firm; C.F.Samms was company president and J.B.Wantz was Vice-President of manufacturing and engineering._NEWLINE_Four years later, in 1920, a second major merger was accomplished when Victor was acquired by General Electric_NEWLINE_which was, at that time, the foremost manufacturer of x-ray tubes._NEWLINE_The marriage of Victor Electric and General Electric became complete of July 28, 1926 when Victor was declared a wholly owned affiliate of General Electric. The merger brought renewed vitality to the organization and Victor entered the foreign market with equipment sold and serviced in nearly 70 countries. In 1930, the name was changed from Victor to General Electric X-Ray Corporation._NEWLINE_World War II saw the dramatic use of x-rays in industry for non-destructive testing of war materials. It also saw the broad use of x-rays as a medical tool for military services._NEWLINE_As the war ended, GE X-Ray Corporation continued to grow. Greater production capacity and greater expertise was needed in the core business of building X-ray tubes. Since the tubes were made from hand-blown glass, the decision was made to move the company 90 miles north to Milwaukee, Wisconsin, in order to tap into the enormous amount of glass-blowing talent in Milwaukee's beer-brewing industry. The company moved from Jackson Blvd. in Chicago to a 43-acre (170,000 m²) site in the city of West Milwaukee, which had been used for building turbochargers during the war. The street in front was renamed Electric Avenue, and the General Electric X-Ray Corporation had a new home in 1947._NEWLINE_In 1951, the corporate structure was dissolved and the name changed to General Electric x-Ray Department. This new name lasted less than 10 years as the department divested itself of its industrial x-ray business, widened its medical business, and took on the name of GE Medical Systems Department. One of the reasons for the name of Medical Systems was due to the increase in the electro-medical business, which began in 1961 with the introduction of patient monitoring equipment. By 1967 modular equipment was developed which was soon popular in cardiac and intensive care units. Early in 1960, pacemakers were developed in Corporate Research & Development in Schenectady, New York, and in 1969 the Standby Pacemaker was developed._NEWLINE_In 1968, the Biomedical Business Section opened its first factory in Edgerton Avenue. Late in 1970 a surgical package was introduced and in 1971, equipment to monitor blood gasses during surgery was introduced._NEWLINE_Later in 1971, Biomedical opened a 9,000 square meter admin and engineering building opposite its factory and in 1972, the section was renamed The cardio-Surgical Product Section. With the growth of its medical business, the General Electric Company upgraded the department to The Medical Systems Division in 1971. Also in 1971, a major expansion programme was started and the Waukesha factory was planned. Work started in July 1972, and was completed in 1973._NEWLINE_In 1973, work on CT was started and eventually the first CT machine was installed in 1976. Development continued to the first CT 8800, and after long negotiations, GE acquired the medical division of EMI Group Ltd. in late 1980 soon after the 1979 takeover of EMI medical division by Thorn Electric company._NEWLINE_The American Anti-Trust Authorities stopped the takeover in the USA however, and the EMI factory in Chicago was bought up by Omni-Medical, who continued to make CTs for a number of years._NEWLINE_Meanwhile, back at GE, the Patient Monitoring Department was sold off in 1981. The initial boost provided by the EMI takeover turned into the doldrums as Reaganomics sent the US dollar soaring, so in 1984 GE bought a 49% share of YMS (Yokogawa Medical Systems), a Japanese company._NEWLINE_In 1983, GE Medical started investing heavily in Magnetic Resonance Imaging (MRI) technology, investing nearly 1 billion US dollars in a new plant in Waukesha, and the MR Signa was born, which would go on to become the very successful MR model range. The magnet plant in Florence (USA) was opened a short time later, giving GE its own magnet production. In the same year, GE divested its dental x-ray division to form Gendex Dental Systems._NEWLINE_In 1985 GE acquired Technicare from Johnson and Johnson.  Originally named Ohio Nuclear (and in 1979, after another fusion, Ohio Nuclear Unirad), the name was changed to Technicare in 1982. Technicare (with headquarters in Cleveland, Ohio) had been producing a range of rotate-stationary CTs with an installed base in the thousands, as well as some x-ray diagnostic equipment and a nascent MRI product range._NEWLINE_Up to this time, the medical Systems Division had simply been divided into domestic and international, but in 1987 it was decided to re-organize into the three "poles" of America, Europe and Pacific. In 1988, GE Medical Europe merged with CGR (a medical equipment supplier based in France) to form General Electric CGR Medical Systems. The European headquarters were moved from Hammersmith (UK) to Buc near Paris._NEWLINE_In 1992, GE had a setback after long negotiations to buy Picker International, who were a major producer of CT and MR equipment.  The deal was not approved by the American authorities, and so GE just bought the Picker Service organization in the U.K., leaving the rest of Picker intact._NEWLINE_In 1994, it was decided to change the name in Europe from GE-CGR back to General Electric Medical Systems. At the close of 1998, GE Medical acquired the Nuclear and MR businesses of Elscint, (then a division of Elron, based in Haifa, Israel), the CT business being bought by Picker, and in the same year Marquette Medical Systems became a wholly owned subsidiary of GE Medical. In 1998, GE medical bought Diasonics Vingmed Ltd. from Elbit Medical Imaging (of Haifa, Israel), thus expanding its ultrasound imaging business. _START_SECTION_ 21st century _START_PARAGRAPH_ In 2001, GE Medical Systems acquired San Francisco, CA, based CT maker Imatron, Inc for $210 million. Imatron produced an Electron beam tomography (EBT) scanner that performs imaging applications used by physicians specializing in cardiology, pulmonology and gastroenterology. The formal Imatron business was later incorporated into GE Healthcare's Diagnostic Imaging business segment. In early 2002, GE Healthcare had acquired MedicaLogic (creator of the former Logician, an ambulatory Electronic Medical Records system) for approximately $32 million._NEWLINE_By Jan 2003, GE acquired Millbrook Corporation, maker of Millbrook Practice Manager, a billing and scheduling system for doctors offices._NEWLINE_GE Healthcare IT would later merge the two products into one, although the stand-alone EMR product is still available and in development. Also in April 2002, GE Healthcare completed the acquisition of Visualization Technology, Inc., Boston, MA; a manufacturer of intra-operative medical devices and related products for use in minimally invasive image guided surgery._NEWLINE_In 2003, GE Healthcare acquired Instrumentarium (including its Datex-Ohmeda division), a producer, manufacturer, and supplier of anesthesia machines and mechanical ventilators. To satisfy regulatory concerns in the United States and in Europe, GE Healthcare was forced to divest Instrumentarium’s Ziehm Imaging mobile C-arm business, as well as its Spacelabs patient-monitoring unit. Currently, GE Healthcare owns 80% of all anesthesia machines in the United States and 60% of the machines in the world. The former Instrumentarium business was incorporated into GE Healthcare's Clinical Systems business segment._NEWLINE_In 2004, the former Amersham plc business segments were separated into the GE Healthcare Medical Diagnostics and Life Sciences business segments and 1 May 2013, both the business were combined again under the GE Life Sciences brand with Kieran Murphy taking the leadership role. Also in 2004, GE Healthcare along with other healthcare companies built a research reactor for neutron and unit cell research at GE's European Research Center near Garching (outside of Munich), Germany. It is the only such reactor currently in operation. In 2005, Sir William Castell, CEO of GE Healthcare and former CEO of Amersham plc stepped down as CEO to become Chairman of the Wellcome Trust—a charity that fosters and promotes human and animal research—in the United Kingdom. Former GE Medical Systems CEO Joe Hogan became the overall CEO for the GE Healthcare business. In 2005, Dental Imaging operations were separated from GE Healthcare. The PaloDEx Group Oy was founded and continues the business with its subsidiaries Instrumentarium Dental and SOREDEX. Specifically, Instrumentarium Dental continues the brands Orthopantomograph and intraoral systems FOCUS and SIGMA, formerly known as Instrumentarium Imaging or GE Healthcare products._NEWLINE_In September 2005, GE Healthcare and IDX Systems Corporation announced that they entered into a definitive, $1.2 billion merger agreement for GE to acquire IDX, a leading healthcare information technology (IT) provider. The acquisition was completed in January 2006. IDX was folded into GE Healthcare Integrated IT Solutions, which specializes in clinical information systems and healthcare revenue management._NEWLINE_On 4 February 2008, GE Healthcare announced that it had completed the acquisition of Whatman plc (LSE:WHM), a global supplier of filtration products and technologies at 270p per share in cash for each Whatman share, valuing Whatman at approximately £363 million (approximately $713 million.) In July 2008, Joseph Hogan announced his intent to leave his post as CEO of GE Healthcare to take the role of CEO at ABB. On July 17, 2008, GE Healthcare announced John Dineen had been chosen to replace outgoing CEO Joseph Hogan. Mr. Dineen had been head of GE's Transportation division since 2005. On March 24, 2010, GE Healthcare announced acquisition of MedPlexus. In late April, 2010, GE Healthcare announced it was investing €3 million in the Technology Research for Independent Living Centre (TRIL). The Irish centre seeks to enhance independence for elderly people through technological innovation._NEWLINE_In July 2015, GE Healthcare partnered with the 2015 CrossFit Games to provide athletes with mobile imaging equipment._NEWLINE_In January 2016, it was announced GE Healthcare's global headquarters will move to Chicago effective early 2016._NEWLINE_In June 2017, GE announced Kieran Murphy as the new CEO of GE Healthcare, with former CEO John Flannery's appointment as CEO of GE._NEWLINE_In April 2018, GE announced the sale of several healthcare information technology assets for $1.05 billion to Veritas Capital._NEWLINE_In June 2018, GE announced it would spin off GE Healthcare into its own company that will most likely be based in Chicago. This represents the conglomerate's efforts to shrink and focus more on the aviaton, power and renewable energy sectors. _START_SECTION_ Criticism _START_PARAGRAPH_ According to The Independent, the firm has received more money back in tax benefits (£1.6 million) in the UK over the past 12 years than it has paid in. Its UK operations are all ultimately owned by a holding company in the Netherlands.  Tax paid was £250,000, 1.7% of its £14.3m profit.  The group employs 22,000 people in the UK._NEWLINE_It supplies a cloud based imaging system to the East Midlands Radiology Consortium which was described in October 2017 as breaking down, so that medical images had to be sent between hospitals by taxi.
+ _START_ARTICLE_ National Association Opposed to Woman Suffrage _START_PARAGRAPH_ The National Association Opposed to Women Suffrage (NAOWS) was founded in the United States by women opposed to the suffrage movement in 1911. It was the most popular anti-suffrage organization in northeastern cities. NAOWS had influential local chapters in many states, including Texas and Virginia. _START_SECTION_ History _START_PARAGRAPH_ The National Association Opposed to Women Suffrage (NAOWS) was established by Josephine Jewell Dodge in New York City in 1911. Dodge had the first meeting at her house and women came from New York and surrounding states. Dodge was currently the president of the New York State Association Opposed to Woman Suffrage (NYSAOWS). Dodge resigned from NYSAOWS to take over as president of NAOWS. Shortly after formation, state branches of NAOWS began to form. Headquarters in Washington, D.C. were opened in 1913, giving the organization a front in both New York and the U.S. Capital._NEWLINE_Like other anti-suffrage organizations, NAOWS published a newsletter as well as other publications, containing their opinions on the current political issues of the time. The newsletter of the association was called Woman's Protest (later renamed Woman Patriot in 1918). Dodge also toured the country, spreading anti-suffrage views to other states._NEWLINE_Members of the NAOWS typically belonged to wealthy families who feared suffrage would upset the status quo. In the South, the NAOWS gained support from many plantation owners who believed rights for women would lead to rights for minorities. Josephine Dodge, the founding president, was replaced in 1917, by Alice Hay Wadsworth, wife of U.S. Senator James W. Wadsworth, Jr. from New York. Upon amendment to the New York State Constitution granting women the right to vote, the focus of the NAOWS shifted from the state level to the federal level. The organization also began to see more men join NAOWS than before. The headquarters were moved solely to Washington D.C. and they merged with the Woman Patriot Publishing Company. The organization disbanded in 1920 as a result of the passage of the Nineteenth Amendment. _START_SECTION_ Texas Association Opposed to Woman Suffrage _START_PARAGRAPH_ In March 1916, the Texas Association Opposed to Woman Suffrage (TAOWS) was created as a chapter of NAOWS in Houston with Pauline Wells as the president. The chapter in Texas also connected the increase in African Americans voting to women's suffrage and they stoked fears of "domination by the black race in the South." They also believed that women's suffrage was linked to "feminism, sex antagonism, socialism, anarchy and Mormonism." Like their parent organization, TAOWS had local chapters in major Texas cities. TAOWS fought against the Texas Equal Suffrage Association who were pushing for Texas women's right to vote in Texas primary elections in 1918. In April 1919, headquarters were moved to Fort Worth. In 1919, TAOWS successfully campaigned against a state measure for women's vote which was defeated by 25,000 votes in May. However, in June 1919, Texas passed a suffrage amendment, allowing women to vote and the TAOWS stopped fighting against women's suffrage. _START_SECTION_ Virginia Association Opposed to Woman Suffrage _START_PARAGRAPH_ A group, the Virginia Association Opposed to Woman Suffrage (VAOWS) formed in Richmond in March 1912 and affiliated with NAOWS. Jane Rutherford served as the president of VAOWS. Local branches in different cities formed by 1913 and the organization distributed anti-suffrage literature. In 1915, VAOWS helped raise money for the Belgian Relief Fund during World War I. By May 1917, VAOWS had doubled in size and continued to grow through 1918. Around 8,000 women had signed up with the anti-suffrage cause in Richmond by 1919._NEWLINE_Like the Texas Association Opposed to Woman Suffrage, VAOWS also suggested that race riots, the black vote and women's suffrage were connected. In a sponsored editorial published in The Richmond Times-Dispatch on September 2, 1919, VAOWS exclaimed, "RACE RIOTS WILL INCREASE IF THERE IS MORE POLITICS BETWEEN THE RACES AND IF WOMEN ARE MIXED UP IN POLITICS!" _START_SECTION_ Political views _START_PARAGRAPH_ One of NAOWS' publications included a pamphlet, Some Reasons Why We Oppose Votes for Women, which, as the title suggests, outlines some of the reasons why they are opposed to women suffrage. They believed it was irrelevant to the success of the country, as stated in their pamphlet:_NEWLINE_"Because the great advance of women in the last century— moral, intellectual and economic— has been made without the vote; which goes to prove that it is not needed for their further advancement along the same lines."_NEWLINE_The National Association Opposed to Women Suffrage opposed women's right to vote because they said that the majority of women did not want the right to vote, and because they believed that the men in their lives accurately represented the political will of women around the United States. NAOWS submitted pamphlets like these to the general public as well as directing them to government officials so that political figures would see that women opposed the then-unratified nineteenth amendment. They did this in order to counteract the rhetoric of the suffragettes of the time. According to the NAOWS and the state-based organizations that it inspired, voting would severely and negatively affect the true submissive and domestic state of the feminine. These organizations were championed by women who thought themselves the prime examples of true womanhood—quiet, dignified, and regal. They looked with disdain at the outward protests of suffragettes._NEWLINE_NAOWS wanted to appeal to conservative and traditional members of their community, including other women and religious figures. They positioned themselves as being in opposition of "the militant suffragette" and militant or "hysterical" tactics. NAOWS also believed that women's involvement in politics would interfere with their "civic duties for which they are peculiarly adapted." NAOWS believed that women were equal to men, but had different duties and "functions." _START_SECTION_ Quotes from Some Reasons Why We Oppose Votes For Women _START_PARAGRAPH_ "We believe that political equality will deprive us of special privileges hitherto accorded to us by law."_NEWLINE_"[We oppose suffrage] Because it means simply doubling the vote, and especially the undesirable and corrupt vote of our large cities." _NEWLINE_"[We oppose suffrage] Because our present duties fill up the whole measure of our time and ability, and are such as none but ourselves can perform."
+ _START_ARTICLE_ N4 road (Ghana) _START_SECTION_ Route _START_PARAGRAPH_ Major towns and cities along the route of the N4 include Madina, Adenta, Aburi, Mamfe, Koforidua, Asokore, and Bunso. _START_SECTION_ Greater Accra Region _START_PARAGRAPH_ The N4 begins at the Tetteh Quarshie Interchange in Accra and travels north, running by the University of Ghana at Legon. Continuing north to Madina, the N4 intersects with the R40 near the Madina Police Station and veers slightly northwest toward Oyarifa before entering the Eastern Region. _START_SECTION_ Eastern Region _START_PARAGRAPH_ In the Eastern Region, the N4 continues north to Aburi, where it intersects with the IR1 near the Aburi Botanical Gardens. The route continues north intersecting with the R22 at Mamfe,then turns west through Saforo and Kwamoso before veering northwest at Adawso toward Koforidua. At Koforidua, the N4 intersects with the R42 near the Koforidua Technical University and continues north to Asokore, where it intersects with the R41. From Asokore, the N4 turns northwest through New Tafo before terminating at Bunso, where it intersects with the R60 and N6, which continues north to Kumasi
+ _START_ARTICLE_ Hussein Adan Isack _START_PARAGRAPH_ Hussein Adan Isack (born 1957) is a naturalist and ethnobiologist living in Kenya, having been a research scientist in ornithology. _START_SECTION_ Background and youth _START_PARAGRAPH_ Hussein Adan Isack was born to parents who lived in the pastoral Northern region of Kenya. Throughout his youth Hussein developed and cultivated a keen interest in bird watching and began keeping birds at his home._NEWLINE_His interest was further stimulated when he joined his high school where he met Paul Scholes, a biologist, himself an ardent birdwatcher from Liverpool. He reared falcons in school and would later become the leader of the wildlife clubs in his area. During his holidays,he would spend time pursuing his passion Later, he met Van Someren, an ornithologist with whom he worked with at the National Museum of Kenya paving way for a career spanning over 30 years._NEWLINE_He was later awarded the Christopher Welch Scholarship for Natural Sciences at Oxford University, where he studied zoology and majored in behavioral ecology. It was during this time that Hussein travelled extensively across the world and met Heinz Ulrich Reyer, a zoologist from Zurich. The two would become close friends and colleagues. He then made frequent trips to the North alongside Heinz, studying the behavioural patterns of birds in that area, specifically the honeyguide, a bird revered for its ability to locate and direct locals towards honey in the remote desert. _START_SECTION_ Professional life _START_PARAGRAPH_ Having received his PhD in ornithology from Oxford, Hussein went back to Kenya where he worked as a scientist at the  National Museum of Kenya. He became the head of the Ornithology department and co-ordinated research activities in the region._NEWLINE_He is a founding board member of the Ewaso Ng'iro Development Authority, appointed by H.E president Daniel Toroitich arap Moi, tasked with the responsibility of ensuring sustainable development of the water basin._NEWLINE_In 1991, he took part in the making of a documentary, The Trials of Life with sir David Attenborough, produced by the BBC and the Australian Broadcasting Service. It focused on the communication between humans and birds, specifically the honeyguide on which Hussein was known for his expertise._NEWLINE_Currently based in Kenya, he founded and runs the Kivulini Heritage Trust a Non-Governmental organization that seeks to preserve indigenous cultures and promote sustainable use of natural resources.
+ _START_ARTICLE_ Cloisters Cross _START_PARAGRAPH_ The Cloisters Cross, also referred to as the Bury St Edmunds Cross, is a complex 12th-century ivory Romanesque altar cross in The Cloisters, part of the Metropolitan Museum of Art in New York. The cross is carved from walrus ivory. _START_SECTION_ Description _START_PARAGRAPH_ The carvings which cover both front and back sides include ninety-two intricately carved figures and ninety-eight inscriptions.  The figures, each of which is only about one-half inch tall, illustrate a number of Biblical scenes, and on the back a number of the Old Testament prophets with banderoles containing quotations from their books. There is debate over whether or not these inscriptions are chosen with an anti-Semitic intent. The Metropolitan Museum of Art's website currently says: "Prominent among the inscriptions are several strong invectives against Jews. Though it is impossible to know precisely who commissioned this piece and with what aims, the cross certainly offers some indication of the anti-Semitism prevalent in England at this time. By the end of the thirteenth century, Jews were expelled from the country". This theme was developed in a book by Thomas Hoving, the curator involved when the Metropolitan acquired the cross, and later Director. This was unkindly described in an academic review of Elizabeth C. Parker and Charles T. Little as "an autobiographical romance...written in Raymond Chandler style"._NEWLINE_Parker and Little disagree with Hoving and think that it is doubtful that the cross, a sophisticated theological object, was specifically designed for the purpose of either castigating or converting any member of the small Jewish population in England in the mid-twelfth century._NEWLINE_The sculptor is not known. Thomas Hoving, who managed the acquisition of the cross while he was associate curator at The Cloisters, concluded that it was carved by Master Hugo at Bury St Edmunds Abbey in Suffolk. However, beyond stylistic affinities there is no certain evidence to suggest that the cross was even made in England; although this is accepted by most scholars, other places of origin such as Germany have been proposed._NEWLINE_The history of the cross before it was acquired by Ante Topić Mimara (1898–1987) is unknown.  He sold it to the Metropolitan Museum in 1963.  The British Museum was also keen to buy the cross, but they eventually declined, because Topić Mimara steadfastly refused to provide proof that he had full title to sell the cross. Hoving reportedly sat up drinking coffee with Topić Mimara until after midnight on the night that the British Museum's option lapsed, and he purchased the cross immediately afterwards for £200,000.
+ _START_ARTICLE_ Štefan Ružička _START_SECTION_ Playing career _START_PARAGRAPH_ Ružička was drafted in the third round of the 2003 NHL Entry Draft by the Philadelphia Flyers and proceeded to the Canadian Hockey League to play for the Owen Sound Attack, of the Ontario Hockey League, under the direction of head coach Mike Stothers._NEWLINE_On September 14, 2015 he opted to take a hiatus from professional hockey in spite of being only 30 years old. Over a calendar year later, he returned to the professional ranks in securing a contract with HC Sparta Praha of the Czech Extraliga on September 4, 2016.
+ _START_ARTICLE_ Nikki Marshall _START_SECTION_ Early life _START_PARAGRAPH_ Born in Thornton, Colorado to Mike and Kelly Marshall, Nikki was raised with her younger sister, Shaye, in Mead, Colorado. She attended Skyline High School in the nearby city of Longmont and was the leading scorer for the soccer team. She was named All-State three times during her sophomore, junior and senior years and was a three-time Longmont Times-Call Soccer Player of the Year. Marshall finished her high school career with 100 goals and 38 assists. She graduated as the top scorer in the school's history. She scored 23 goals and served 10 assists during her senior year alone and was named 2006 Northern Conference Player of the Year, all-region soccer player of the year, and Skyline Falcon of the Year. Also a decorated track athlete at the school, she earned All-State honors in the 100 meter, 400 meter relay and long jump during her junior year and won the state championship in the 800 medley as a senior. _START_SECTION_ Colorado Buffaloes _START_PARAGRAPH_ Marshall currently holds seventeen school records for the University of Colorado, and is the all-time leading goal scorer for the Colorado women's soccer team.  As a freshman, she led the Buffaloes' most prolific offensive season with seventeen goals, all the way to the Sweet Sixteen in the 2006 NCAA postseason. She was on Soccer Buzz's All-American Fourth Team, All-American Freshman Team, All-Central Region First Team, All-Central Region Freshman team, and Freshman of the Year. She was on the Freshman All-American Team in Soccer America and National Player of the Week in Soccer Times. Her Big 12 Conference awards in 2006 include Newcomer of the Year, First Team All-Big 12, All-Big 12 Newcomer Team, and Big 12 All-Tournament Team. The University of Colorado Athletic Department awarded her Female Athlete of the Year and Female Freshman of the Year._NEWLINE_In 2007, Marshall led the Buffaloes in scoring for the second-straight year with nine goals, even though she had played defender that year for Colorado. Before her sophomore year season, she Marshall was on the Pre-season All-American Team by Soccer America and was ranked Pre-season All-Big 12 by the Big 12 Conference. She was runner-up in the most playing time on the team, totaling 1,927 minutes on the season. She led the Buffs to 10-8-4 record and another bid into the NCAA tournament  She ended the season with a First Team All-Big 12 award from the Big 12 Conference._NEWLINE_During her junior year, Marshall was moved back up to the striker position. She was ranked Pre-season All-Big 12 by the Big 12 Conference. After another successful season, Marshall was second on the team for goals scored, with eight goals. She led her team second in the Big 12 Tournament and a bid to the NCAA Tournament. For the third year in a row, she received First Team All-Big 12 from the Big 12 Conference._NEWLINE_In 2009, Marshall was again ranked a Pre-season All-American by Soccer America. She was captain of the Buffaloes along with fellow senior Kara Linder. She led the Buffs in scoring with a total of eight goals on the season. She also let the team in number of shots taken, 53. She finished the season with another First Team All-Big 12 accolade. _START_SECTION_ The WPS Years, 2010–11 _START_PARAGRAPH_ Marshall was the first draft pick (seventh overall) for the Washington Freedom in the 2010 WPS Draft. During the 2010 season, she started in all 24 of the team's regular season matches and scored three goals playing as a defender. The Freedom finished fourth during the regular season with an 8–7–9 record, earning a berth to the playoffs. During the playoff quarterfinals, the Freedom were defeated by the Philadelphia Independence 1–0._NEWLINE_Marshall remained with the club in 2011 when they relocated to Florida and became magicJack under new ownership. She played in eight games for magicJack before being traded to the Boston Breakers. She made seven appearances for the Breakers as the team finished fourth in the regular season with a 5–4–0 record. On August 17, 2011, the Breakers were defeated 3–1 by magicJack and eliminated from the playoffs. _START_SECTION_ Portland Thorns FC, 2013–2014 _START_PARAGRAPH_ In February 2013, Marshall signed with Portland Thorns FC for the inaugural season of the National Women's Soccer League. She was on the starting lineup as a defender in all of the Thorns' 22 games as the team finished third in regular season play and received a berth to the playoffs. During the team's playoff semi-final match, Marshall served the assist to Tiffany Weimer's equalizer goal. The Thorns would eventually defeat FC Kansas City 3-2 in overtime. The Thorns defeated Western New York Flash during the playoff final clinching the league's first championship title._NEWLINE_Marshall was waived by the Thorns during the post-season and picked up during the waiver draft by the Washington Spirit. A few months later, she was traded to the Seattle Reign FC. In December 2013, she was traded back to the Portland Thorns. Thorns management clarified that Marshall was put on waivers due to a cap issue. With the trades, she was signed to a new contract at a different salary for the 2014 season._NEWLINE_During the 2014 season, Marshall started in all 24 matches for the Thorns playing for 2072 minutes. After finishing third during the regular season with a 10–8–6 record, the Thorns advanced to the playoffs where they were defeated in the semifinals 2-0 by eventual champions FC Kansas City._NEWLINE_In August 2014, Marshall suffered an anterior cruciate ligament (ACL) tear during a match against the Seattle Reign . She announced her retirement in February 2015, citing low pay. _START_SECTION_ International _START_PARAGRAPH_ Marshall was a member of the United States under-20 women's national soccer team that competed in the 2008 FIFA U-20 Women's World Cup in Chile. Marshall and fellow central defender Lauren Fowlkes were the only two members of that squad to start in and play every minute of all six matches of the tournament; both were praised for their poised performance as the anchors for the team's defense, especially during the final game against North Korea.
+ _START_ARTICLE_ Roberto Calasso _START_SECTION_ Biography _START_PARAGRAPH_ Calasso was born in Florence in 1941, into a family of the Tuscan upper class, well connected with some of the great Italian intellectuals of their time. His maternal grandfather Ernesto Codignola was a professor of philosophy at Florence University. Codignola created a new publishing house called La Nuova Italia, in Florence, as his friend Benedetto Croce had done in Bari with Laterza. Calasso's uncle, Tristano Codignola, was a partisan during World War II who after the war joined the political life of the new republic, and was for a while Minister of Education. His mother Melisenda – who gave up an academic career to raise her three children – was a scholar of German literature, working on Hölderlin’s translations of the Greek poet Pindar. Calasso's father Francesco was a law professor, first at Florence University and then in Rome, where he eventually became dean of his faculty. He was arrested  by the fascist militia after the assassination of Giovanni Gentile and sentenced to be killed in reprisal, but was saved by the intervention of both friends of Gentile, with whom the family had connections on the maternal side, and by the German consul Gerhard Wolf._NEWLINE_At 12 Calasso met and was greatly influenced by a professor at Padua University, Enzo Turolla, and they became lifelong friends. In 1954 the family moved to Rome, where Calasso developed a passion for cinema. His doctoral dissertation was Sir Thomas Browne's theory of hieroglyphs, which he completed under Mario Praz, while indulging himself with hashish._NEWLINE_Calasso has worked for the publishing firm of Adelphi Edizioni since its founding by Roberto Bazlen in 1962 and became its Chairman in 1999. His books have been translated into most European languages._NEWLINE_He is the author of an unnamed ongoing work reflecting on the culture of modernity which began with The Ruin of Kasch in 1983, a book admired by Italo Calvino. Dedicated to the French statesman Talleyrand, it was followed in 1988 by The Marriage of Cadmus and Harmony, in which the tale of Cadmus and his wife Harmonia becomes a pretext for re-telling the great tales of Greek mythology and reflecting on the reception of Greek culture for a contemporary readership. Another world civilization is surveyed in Ka (1996, where the subject of the re-telling is Hindu mythology). K restricts the focus to a single author, Franz Kafka; this trend continues with Il rosa Tiepolo, inspired by an adjective used by Proust to describe a shade of pink used by Tiepolo in his paintings. With La folie Baudelaire, Calasso once more broadens his scope to fresco a whole civilisation, that of Paris in the latter half of the 19th century, reconsidering the lives and works of the post-romantic generation of writers and artists from Baudelaire to Valéry. In his most recent work, Ardore (2010), the author returns to India for an exhaustive analysis of the theory and practice of Vedic sacrifice and its significance for post-modern epistemology._NEWLINE_His more narrowly focused essays relating to European modernity are collected in I quarantanove gradini (The Forty-nine Steps), addressed to Pierre Klossowski and his wife; Literature and the gods (2002) (based on his Weidenfeld Lectures at Oxford, on the decline and return of pagan imagery in the art of the west), and La follia che viene dalle ninfe (The Madness that Comes from the Nymphs), a collection of related essays ranging from Plato's Phaedrus to Nabokov's Lolita._NEWLINE_Along with his status as a major analyst specifically of the works of Kafka, Calasso has, more broadly, been active in many essays in retrieving and re-invigorating the notion of a Central European literary culture. He also serves as the president of the International Alexander Lernet-Holenia Society, which promotes the publication, translation and study of this multi-genre Austrian writer and his focus on the identity crisis of his characters at odds with postimperial Austria and Central Europe. _START_SECTION_ Reception _START_PARAGRAPH_ Terri Windling selected the English translation of The Marriage of Cadmus and Harmony as one of the best fantasy books of 1994, describing it as "a complex and intellectually dazzling novel using ancient Greek mythology to explore the origins of Western thought."
+ _START_ARTICLE_ Henry J. Rosner _START_SECTION_ Personal life _START_PARAGRAPH_ Rosner was the oldest of seven children, along with his twin sister Sally Miller. He married Sophie Kimels in December 1929. The couple, who had met at a Young People's Socialist League picnic, honeymooned in Russia, which they found to be a totalitarian dictatorship rather than the socialist utopia they had hoped to see. They later wrote a report to Norman Thomas about their experience of Russia. (see Barbara Seaman)_NEWLINE_Rosner and Kimels and had three children: Barbara Seaman, Jeri Drucker, and Elaine Rosner-Jeria. After Kimels' death, Rosner married journalist Ruth Gruber in 1974. _START_SECTION_ Career _START_PARAGRAPH_ As Norman Thomas's policy researcher, Rosner helped write the socialist platform for the 1932 presidential race. Rosner contributed "The Myth of a Progressive Governor," a statistic-filled six page tract blasting Franklin D. Roosevelt's failure to honor his promise to "remember the forgotten man at the bottom." On Roosevelt's position (or lack thereof) regarding the seven-day work week, Rosner wrote:_NEWLINE_While distinguished economists urge the five-day week as a solution for the unemployment problem, Roosevelt has done nothing to abolish the seven-day work week among New York transit employees, hotel and cafeteria workers, and elevator operators in apartment houses. The records of the N.Y.Ç. Transit Commission reveal that there are 25,000 subway guards, platform men, street car conductors, motormen and bus drivers in New York City alone who work ten hours a day or more seven days a week. There are 25,000 hotel workers in New York City who never get a day off. Thousands of cafeteria works and elevator operators are in the same predicament. The same conditions exist on the state payroll. Guards and attendants in state hospitals and state prisons work ten and twelve hours a day seven days a week. Watchmen, lock tenders, and bridge workers in the state department of public works are also denied one day of rest in seven._NEWLINE__NEWLINE_It would be a simple matter to amend that section of the N.Y. labor law so as to give all workers in New York this protection. At the request of the City Affairs Committee such an amendment was introduced at the 1932 session of the Legislature.  The bill was never reported out of committee or given a public hearing. Communications were sent to the Governor, acquainting him with the facts and requesting his support, but he did not make any effort to compel action from he legislature. In his gubernatorial messages to the legislature, he never mentioned the abolition of the seven-day week." _NEWLINE_The following year, Rosner was co-editor of the 1933 handbook of the New York Socialist Party._NEWLINE_After trouncing the Socialists and the Republicans in the 1932 election, Roosevelt met with Norman Thomas, Henry Rosner, and other members of the Socialist Party. As president, Roosevelt took on many of the social issues Rosner had criticized him for ignoring during his years as governor of New York._NEWLINE_In this respect, Rosner played an important, though low key, role as an early proponent for New Deal programs._NEWLINE_As Fiscal officer for welfare for New York City, Rosner served under all New York City Mayors from Fiorello LaGuardia through Abraham Beame. He retired as Assistant Administrator of the New York City Human Resources Administration in 1975._NEWLINE_Rosner contributed controversial and influential articles to The Nation magazine and other political periodicals.
+ _START_ARTICLE_ Plazenica _START_PARAGRAPH_ Plazenica is a mountain in the municipality of Kupres, Bosnia and Herzegovina. It has an altitude of 1,765 metres (5,791 ft).
+ _START_ARTICLE_ Error guessing _START_PARAGRAPH_ In software testing, error guessing is a test method in which test cases used to find bugs in programs are established based on experience in prior testing. The scope of test cases usually rely on the software tester involved, who uses past experience and intuition to determine what situations commonly cause software failure, or may cause errors to appear. Typical errors include divide by zero, null pointers, or invalid parameters._NEWLINE_Error guessing has no explicit rules for testing; test cases can be designed depending on the situation, either drawing from functional documents or when an unexpected/undocumented error is found while testing operations.
+ _START_ARTICLE_ David Jay _START_SECTION_ Activism _START_PARAGRAPH_ Frustrated with the lack of resources available regarding asexuality, Jay launched AVEN's website in 2001. Since then, he has taken a leading role in the asexuality movement, appearing on multiple television shows, and being featured heavily in Arts Engine's 2011 documentary (A)sexual._NEWLINE_AVEN, which Salon.com referred to as the "unofficial online headquarters" of the asexuality movement, is widely recognised as the largest online asexual community. Its two main goals are to create public acceptance and discussion about asexuality and to facilitate the growth of a large online asexual community. As of June 17, 2013, AVEN has nearly 70,000 registered members._NEWLINE_In New York City, working both with the Department of Education and private organizations, he's been providing training on Ace inclusion to health educators._NEWLINE_He has a vision for a post-sex world, one that asks all of us to work on building a more empathetic, intimate society that celebrates any kind of close human relationship, whether or not it involves sex. _START_SECTION_ Personal life _START_PARAGRAPH_ Jay is from St. Louis, Missouri, and he graduated from Crossroads College Preparatory School in 2000. At the age of 15, Jay began considering himself asexual, and he came out as asexual while a student at Wesleyan University in Connecticut.
+ _START_ARTICLE_ Lynn Norment _START_SECTION_ Personal _START_PARAGRAPH_ Norment was born the third child out of nine. Norment's mother Ester worked as a licensed practitioner nurse. Her father Alex Norment owned a local repair shop, which was named Norment's Radio and TV. While in elementary school, Norment attended an all-black, segregated school known as Bolivar Industrial Elementary. She then went to vocational school where she became a member of the school newspaper and Beta Club. In 1969, Tennessee offered African Americans in Bolivar to transfer to the mostly white Bolivar High school, Norment was amongst a few African Americans who helped integrate the school; she then graduated in 1970._NEWLINE_Lynn Norment is an alumni from Memphis State University where she received a Bachelor of Arts degree in journalism. In college, Norment was an intern for The Commercial Appeal a newspaper in Memphis, Tennessee._NEWLINE_She resides in the South Loop area of Chicago, Illinois. _START_SECTION_ Career _START_PARAGRAPH_ Later, Norment traveled north to Chicago and worked as a freelance writer for Ebony Magazine. Norment has worked with a number of celebrities, athletes and public figures including Denzel Washington, Barack Obama, Whitney Houston, Steve Harvey, Will Smith, and Micheal Jordan. She became the managing editor of Ebony._NEWLINE_Norment has also held different leadership roles for the National Association of Black Journalists, including being chairperson for the Convention in Chicago held in 1977._NEWLINE_She is a board member of Habilitative Inc. She operates programs for residents that are in need on the West side of Chicago. Norment has taught college courses at Columbia College Chicago, and mentors young journalist. Norment currently has launched a company that offers media relations and editorial services to individuals as well as agencies and corporations. _START_SECTION_ Notable works _START_PARAGRAPH_ Lynn Norment is most recognized for her 30 years spent of writing for Ebony Magazine. Norment has written a wide range of stories of dissimilar subjects such as religion, business, relationships, social issues and lifestyle. _START_SECTION_ Context _START_PARAGRAPH_ While growing up in Bolivar Tennessee, Lynn Norment went to a segregated school, a school built specifically for African Americans and a school built for White Americans. Segregation formally began with the passing of Jim Crow laws following the end of the Reconstruction Era in 1877. Those laws prevented blacks, and later Mexican Americans, Native Americans to go to the same school as white individuals and affected other public spaces such as church, bathrooms, movie theaters, etc. However, in 1969 racial integration in Tennessee schools allowed the African American community to transfer to the mostly white schools. Norment among many helped was a student in the desegregated high school._NEWLINE_Later, she moved North to Chicago and began working for Ebony Magazine. The magazine was founded in 1947 by John H. Johnson in Chicago. It is a monthly magazine for the African American community. The magazine has always brought up African American issues and interests while remaining positive despite how negative things seemed to be happening at the time. For years ads were created specifically for Ebony, which featured black models and advertised black-owned products.
+ _START_ARTICLE_ Come Darkness, Come Light: Twelve Songs of Christmas _START_SECTION_ Background _START_PARAGRAPH_ Come Darkness Come Light: Twelve Songs of Christmas contained twelve songs of holiday-themed songs, six of which were written or co-written by Carpenter. The six additional tracks consisted of rare traditional holiday songs. The album features collaborations with Carpenter's producer of many of her previous albums John Jennings. Come Darkness, Come Light includes cover of songs by Robin and Linda Williams, Tommy Thompson, and composer John Rutter. The opening track "Once in Royal David's City" was originally performed during the Festival of Nine Lessons and Carols in Cambridge, England, which Carpenter states she listens to every Christmas. Mark Deming of Allmusic thought that the album focused more on the "thoughtful and spiritual side of the season", while Scott Sexton of About.com said that the album's arrangement evoked "a calming vibe that is perfect for any holiday event"._NEWLINE_In an interview with Country Music Television in late 2008, Carpenter explained that she and producer John Jennings tried to create a more solemn approach to the record, without the use of symphonies or orchestras. In the interview, Carpenter commented that she wanted to keep the focus of the album "spare" and make its sound more acoustic._NEWLINE__NEWLINE_"That was the whole point from the beginning -- to make a real acoustic record. Whatever instruments we thought might add a texture or color, John was able to provide himself. We brought in Jon Carroll, my longtime keyboard player. He is so gifted, and he really did the heavy lifting on the piano, but John was able to fill in where it was needed. At most, there were three people in the room, but mostly it was me and John. It's really fun to do that. You feel like you're wacky scientists, late at night in the lab, experimenting to your heart's content."
+ _START_ARTICLE_ Salut d'Amour _START_SECTION_ History _START_PARAGRAPH_ Elgar finished the piece in July 1888, when he was romantically involved with Caroline Alice Roberts, and he called it "Liebesgruss" ('Love's Greeting') because of Miss Roberts' fluency in German. On their engagement she had already presented him with a poem "The Wind at Dawn" which he set to music and, when he returned home to London on 22 September from a holiday at the house of his friend Dr. Charles Buck in Settle, he gave her Salut d'Amour as an engagement present. _NEWLINE_The dedication was in French: "à Carice". "Carice" was a combination of his wife's names Caroline Alice, and was the name to be given to their daughter born two years later._NEWLINE_It was not published by Schott & Co., a German publisher, with offices in Mainz, London, Paris and Brussels, until a year later, and the first editions were for violin and piano, piano solo, cello and piano, and for small orchestra.  Few copies were sold until Schott changed the title to "Salut d'Amour" with Liebesgruss as a sub-title, and the composer's name as 'Ed. Elgar'. The French title, Elgar realised, would help the work to be sold not only in France but in other European countries. _NEWLINE_The first public performance was of the orchestral version, at a Crystal Palace concert on 11 November 1889, conducted by August Manns. The first recording of that version was made in 1915 for The Gramophone Company with an orchestra conducted by the composer. As a violin-and-piano piece Salut d'Amour had been recorded for The Gramophone & Typewriter Ltd (predecessor to The Gramophone Company) as early as 1901 by Jacques Jacobs, leader/director of the Trocadero Restaurant orchestra. Auguste van Biene recorded a cello transcription in 1907. _START_SECTION_ Legacy _START_PARAGRAPH_ "Salut d'amour" is one of Elgar's best-known works and has inspired numerous arrangements for widely varying instrumental combinations. It was even arranged as a song "Woo thou, Sweet Music" with words by A. C. Bunten._NEWLINE_The tune, in E major, was included to 2015 video game Fallout 4 as part of its "Classical Radio Station" songs.
+ _START_ARTICLE_ 1976 Colgate International _START_SECTION_ Doubles _START_PARAGRAPH_ Chris Evert /  Martina Navratilova defeated  Olga Morozova /  Virginia Wade 6–4, 1–1 divided due to rain
+ _START_ARTICLE_ Victoria Cross for New Zealand _START_SECTION_ Victoria Cross _START_PARAGRAPH_ The original Victoria Cross was created by Queen Victoria in 1856 to recognise incidents of gallantry that were unconnected with a man's lengthy or meritorious service. She signed a Royal Warrant on 29 January 1856 that officially instituted the VC. The order was retroactive to 1854 to recognise acts of valour during the Crimean War._NEWLINE_The Australian and New Zealand Victoria Crosses are made from the same gunmetal as the originals. It was originally intended that the VCs would be cast from the bronze cascabels of two cannon that were captured from the Russians at the siege of Sevastopol. The historian John Glanfield has since shown that the metal used for VCs is in fact from Chinese cannon not Russian, and their origin is a mystery._NEWLINE_The barrels of the cannon in question are stationed outside the Officers' Mess at the Royal Artillery Barracks at Woolwich. The remaining portion of the only remaining cascabel, weighing 10 kilograms (385 oz), is stored in a vault maintained by 15 Regiment Royal Logistic Corps at MoD Donnington. It can only be removed under armed guard. It is estimated that approximately 80 to 85 more VCs could be cast from this source. A single company of jewellers, Hancocks of London, has been responsible for the production of every VC. _START_SECTION_ Appearance _START_PARAGRAPH_ The Victoria Cross for New Zealand is identical to the original design. The decoration is a cross pattée, 41 millimetres (1.6 in) high, 36 millimetres (1.4 in) wide, bearing a crown surmounted by a lion, and the inscription FOR VALOUR. This was originally to have been FOR BRAVERY, until it was changed on the recommendation of Queen Victoria, who thought some might erroneously consider that only the recipients of the VC were brave in battle. The decoration, suspension bar and link weigh about 27 grams (0.87 troy ounces)._NEWLINE_The cross is suspended by a ring from a seriffed "V" to a bar ornamented with laurel leaves, through which the ribbon passes.  The reverse of the suspension bar is engraved with the recipient's name, rank, number and unit. On the reverse of the medal is a circular panel on which the date of the act for which it was awarded is engraved in the centre. The ribbon is crimson, 38 millimetres (1.5 inches) wide. Although the warrants state the colour as being red it is described by most commentators as being crimson or "wine-red".
+ _START_ARTICLE_ Indiana State Road 357 _START_SECTION_ Route description _START_PARAGRAPH_ State Road 357 starts at State Road 64, which is also Morton Street, on the south side of town. It runs north for about 200 feet, then veers to the north-northeast and runs parallel with the railroad track about 100 feet to the east. Upon reaching Mill Street, the road veers back to the north and proceeds to its northern terminus at State Road 57 at the north edge of town. It is concurrent with Main Street over its entire length.
+ _START_ARTICLE_ George Geldorp _START_PARAGRAPH_ George Geldorp, Georg Geldorp or Jorge Geldorp (1580/1595, Cologne – 4 November 1665, London) was a Flemish painter who was mainly active in England where he was known for his portraits and history paintings. He was also active as an art dealer and impresario. _START_SECTION_ Life _START_PARAGRAPH_ Geldorp was the son of the Flemish portrait painter Gortzius Geldorp who lived and worked in Cologne. Geldorp first trained and worked as a painter in Cologne before being admitted as a Master in the Guild of Saint Luke in Antwerp in 1610. Two years later his first wife Margriet Parmentiers died in Antwerp._NEWLINE_In 1623, Geldorp moved to London where he painted a number of portraits in the Anglo-Netherlandish style, notably of William Cecil, 2nd Earl of Salisbury and his wife Catherine in 1626 (Hatfield House, Hertfordshire) and of Sir Arthur Ingram in late 1638/early 1639._NEWLINE_He was involved in organizing commissions in England for Flemish and Dutch artists including Rubens, Anthony van Dyck and Peter Lely.  Upon the Restoration, he assisted with the reconstitution of the art collection and possessions of the English Royal family and was rewarded for his services with the post of picture mender and cleaner to the King._NEWLINE_He was the teacher of Isaac Sailmaker. _START_SECTION_ Work _START_PARAGRAPH_ George Geldorp was a portrait specialist.  His portraits are regarded as less accomplished and more stiffly articulated than those of contemporary painters active in London such as Daniel Mijtens.  The surfaces of his paintings are decorative. The background of the Portrait of William Cecil, 2nd Earl of Salisbury contains an historically important view of Hatfield House with sportsmen in the foreground._NEWLINE_Geldorp was also active as a collaborator and copyist of Anthony van Dyck and later Peter Lely._NEWLINE_The Dutch biographer Arnold Houbraken reported that Geldorp was known to the artist biographer Joachim von Sandrart.  Von Sandrart had written that Geldorp was not a very accomplished draughtsman and had the habit of tracing other's sketches, and then pricking holes in these sketches, and sponging this onto the canvas as a guide to paint his subjects. Houbraken disapproved of this practise and wrote that he preferred to write about painters who were good draughtsmen.
+ _START_ARTICLE_ A381 road _START_SECTION_ Route _START_PARAGRAPH_ The A381 starts in Teignmouth from a junction with the A379 at Shaldon Bridge, following the Teign Estuary to Kingsteignton, where it overlaps the A380 to cross the River Teign.  At the Penn Inn Roundabout it then continues west on a short dual carriageway into central Newton Abbot and southwest to Totnes. Here it overlaps the A385 to cross the River Dart and the main London-Penzance railway line._NEWLINE_From a junction on the west of Totnes it rises southwards into the South Hams. This section of the road is an important link to the national road network for the town of Dartmouth (served by the A3122) as the alternative A379 via Torbay is reliant upon the Dartmouth Higher Ferry with its associated fares and peak-time queues. As the road approaches Kingsbridge it enters the South Devon Area of Outstanding Natural Beauty and skirts around the edge of the town, overlapping for a short distance with the A379 road before finally turning south to Salcombe. An identically-numbered spur from this road turns back eastward to Kingsbridge. _START_SECTION_ History _START_PARAGRAPH_ The constant pressure of traffic through the narrow streets of Totnes town centre prompted the construction of the Western Bypass around the edge of the town, together with a second crossing of the River Dart at Brutus Bridge in 1982. The tight-knit nature of the town centre's development, quickly thinning to countryside, meant that relatively few buildings needed demolition to facilitate construction of the new road. _NEWLINE_In 1991 and 2006 the route through and around Kingsbridge was redrawn twice, when the section northwest of Kingsbridge was downgraded to B-road status leaving a gap in the route, and subsequently diverted to the former route of the B3197 around the west side of the town leaving the original section through West Alvington as a spur of the new road. _NEWLINE_As a rural main road, the A381 has been the scene of multiple accidents. During 2008-2010 there were three fatal accidents on the section from Totnes to Halwell, prompting Devon County Council to implement a Casualty Severity Reduction Scheme, improving road markings and signage. _NEWLINE_On Sunday 20 May 2012 a 0.7-mile (1.1 km) section of the road through Totnes was part of the Olympic Torch procession for the London 2012 Olympics.
+ _START_ARTICLE_ United States House Select Committee on Alleged Abstraction of Books from the Library of the House _START_PARAGRAPH_ The Committee on Alleged Abstraction of Books from the Library of the House was a select committee of the United States House of Representatives that existed from February 14–28, 1861._NEWLINE_The committee was charged with investigating rumors that several members of the House of Representatives from several states that had seceded from the United States had taken books from the House Library for personal use, or, alternatively, to help start a congressional library for the Confederate States of America. The allegations were first made public by The New York Times in an article published February 13, 1861, that accused Members of Congress of taking "some of the most valuable volumes in the collection._NEWLINE_The select committee's investigation determined those rumors to be in error, finding that the supposed missing books had instead not been properly credited back to the representatives' accounts after being returned.
+ _START_ARTICLE_ Shazam (application) _START_SECTION_ Overview _START_PARAGRAPH_ Shazam identifies songs based on an audio fingerprint based on a time-frequency graph called a spectrogram. It uses a smartphone or computer's built-in microphone to gather a brief sample of audio being played. Shazam stores a catalogue of audio fingerprints in a database. The user tags a song for 10 seconds and the application creates an audio fingerprint. Shazam works by analyzing the captured sound and seeking a match based on an acoustic fingerprint in a database of millions of songs. If it finds a match, it sends information such as the artist, song title, and album back to the user. Some implementations of Shazam incorporate relevant links to services such as iTunes, Spotify, YouTube, or Groove Music._NEWLINE_Shazam can identify music being played from any source, provided that the background noise level is not high enough to prevent an acoustic fingerprint being taken, and that the song is present in the software's database._NEWLINE_As well as the free app, the company has released a paid app called Shazam Encore. In September 2012, the service was expanded to enable TV users in the US to identify featured music, access cast information, and get links to show information online, as well as added social networking capabilities._NEWLINE_Shazam redesigned their app in 2014,  and added additional features. _START_SECTION_ Compatible devices _START_PARAGRAPH_ Shazam runs on Android, iOS (including Apple Watch), BlackBerry OS, and Windows Phone systems. Shazam is also available for Mac, as a desktop application that when enabled, runs in the background and automatically recognises any song played on or near the computer.  Apple's launch of iOS 8 in September 2014 came with the integration of Shazam into Apple's Siri function. _START_SECTION_ History _START_PARAGRAPH_ The company was founded in 1999 by Barton and Inghelbrecht, who were students at University of California, Berkeley and Mukherjee, who worked at a London-based internet consulting firm called Viant. In need of a digital signal processing specialist, the founding team then hired Wang, who had received his PhD from Stanford University. It first made a profit in 2016, 17 years after launch. _START_SECTION_ 2002–2006: Early days of the service _START_PARAGRAPH_ Initially, in 2002, the service was launched only in the UK and was known as "2580", as the number was the shortcode that customers dialled from their mobile phone to get music recognised. The phone would automatically hang up after 30 seconds. A result was then sent to the user in the form of a text message containing the song title and artist name. At a later date, the service also began to add hyperlinks in the text message to allow the user to download the song online._NEWLINE_Shazam launched in the US on the AT&T Wireless network in 2004 in a joint offering with Musicphone, a now defunct San Francisco-based company. The service was free at launch with AT&T saying that it would charge $0.99 for each use in future._NEWLINE_In 2006, users were charged £0.60 per call or had unlimited use for £4.50 a month, as well as an online service to keep track of all tags. _START_SECTION_ 2006–2017: Smartphone app and expansion _START_PARAGRAPH_ Shazam for iPhone debuted on 10 July 2008, with the launch of Apple's App Store. The free app enabled users to launch iTunes and buy the song directly, although the service struggled to identify classical music._NEWLINE_Shazam launched on the Android platform later that year, and on the  Windows Mobile Marketplace a year later. Encore first appeared for iPhone in November 2009._NEWLINE_By December 2009, Shazam was downloaded 10 million times in 150 countries across 350 mobile operators. Around eight percent of users purchased a track after it was identified by the service. Its success led to a funding round from Kleiner Perkins Caufield & Byers in October 2009. In January 2011, Apple announced that Shazam was the fourth most downloaded free app of all time on the App Store, while rival SoundHound had the top paid iPad app._NEWLINE_In August 2012, Shazam announced the service had been used to tag five billion songs, TV shows and advertisements. In addition, Shazam claimed to have over 225 million users across 200 countries. A month later, the service claimed to have more than 250 million users with 2 million active users per week. The Shazam app currently has more than 100 million monthly active users and has been used on more than 500 million mobile devices. In October 2014, Shazam announced its technology has been used to identify 15 billion songs._NEWLINE_The Shazam app was listed among Techland's 50 Best Android Applications for 2013._NEWLINE_In August 2014, Shazam announced there would be no more updates for Shazam(RED) after 7 August. _NEWLINE_Apple's launch of iOS 8 in September 2014 came with the seamless integration of Shazam into Apple's intelligent personal assistant Siri function._NEWLINE_In February 2013, Shazam announced a partnership with the music store Beatport, adding its library of electronic music to the service. On 3 April 2013, Shazam announced an exclusive partnership with Saavn, an Indian online music streaming service. The deal added nearly 1 million songs in Indian languages to Shazam's database. In July 2014, Shazam announced a partnership with Rdio that allows Shazam users to stream full songs within the app._NEWLINE__NEWLINE_Rich Riley joined Shazam as CEO in April 2013 to increase the company's growth, after over 13 years at Yahoo! and with more than 17 years of experience as an entrepreneur and internet executive. "I look forward to extending our dominance in media engagement, from our roots in music to our leadership position in second-screen TV and want to ensure that Shazam is the company that helps people recognize and engage with the world around them", Riley said at the time. Riley replaced Andrew Fisher, who was hired from Infospace into the CEO role in 2005 to strengthen industry partnerships and grow the userbase. Fisher is now executive chairman._NEWLINE_In addition to music, Shazam has announced collaborations with partners across television, advertising and cinema. In May 2014, National CineMedia announced a partnership with Shazam to incorporate Shazam into FirstLook pre-show segments that run in Regal, AMC and Cinemark theatres. In November 2014, NCM and Shazam announced that NCM FirstLook pre-shows are now Shazam enabled on over 20,000 movie screens across the United States._NEWLINE_In August 2014, Shazam announced the launch of Resonate, a sales product that allows TV networks to access its technology and user base. The news included the announcement of partnerships with AMC, A&E, Dick Clark Productions and Fuse._NEWLINE_Shazam recently announced a partnership with Sun Broadcast Group on Shazam for Radio, a new offering that will allow radio stations to push customised content to listeners on Sun Broadcast's over 8,000 radio stations in the US._NEWLINE_In December 2016, Shazam announced a partnership with Snapchat. The new feature comes as part of the latest Snapchat update and integration with Shazam, which allows Snapchat users to use Shazam's music recognition technology by pressing and holding the camera screen. _START_SECTION_ 2017–present: subsidiary of Apple _START_PARAGRAPH_ In December 2017, Apple announced it would be acquiring Shazam for a reported $400 million (£300 million). On 23 April 2018, the European Commission stated that it would be reviewing the acquisition. The European Commission approved the acquisition on 6 September 2018 and the deal was completed on 24 September 2018. _START_SECTION_ Funding _START_PARAGRAPH_ As of September 2012, Shazam had raised US$32 million in funding. In July 2013, Carlos Slim invested US$40 million in Shazam for an undisclosed share. In March 2014, Shazam confirmed another US$20 million in new funding, raising the total value of the company to US$500 million. The company's earlier backers include European venture capital firm DN Capital, which invested in Shazam in 2004. _START_SECTION_ Patent infringement lawsuit _START_PARAGRAPH_ In May 2009, Tune Hunter accused Shazam of violating U.S. Patent 6,941,275, which covers music identification and purchase in a portable device. Shazam settled the case in January 2010.
+ _START_ARTICLE_ International scale of river difficulty _START_SECTION_ Caution in application _START_PARAGRAPH_ Classifications can vary enormously, depending on the skill level and experience of the paddlers who rated the river.  For example, at the 1999 International Conference on Outdoor Recreation and Education, an author of a paddling guide pointed out that there is too much variation in what is covered by the Class I designation, and proposed making further distinctions within the Class I flat water designations and Class I+ moving water designations, with the goal of providing better information for canoeists, instructors leading trips, and families with young children._NEWLINE_The grade of a river or rapid is likely to change along with the level of the water. High water usually makes rapids more difficult and dangerous, although some rapids may be easier at high flows because features are covered or washed out. At spate/flood stage, even rapids which are usually easy can contain lethal and unpredictable hazards. Conversely, some rapids may be easier with lower water levels when dangerous hydraulics become easier to manage. Some rivers with high volumes of fast moving water may require little maneuvering, but will pose serious risk of injury or death in the event of a capsize.
+ _START_ARTICLE_ Zambian kwacha _START_SECTION_ Etymology _START_PARAGRAPH_ The name kwacha derives from the Nyanja, Bemba, and Tonga language word for "dawn", alluding to the Zambian nationalist slogan of a "new dawn of freedom". The name ngwee translates as "bright" in the Nyanja language. _START_SECTION_ History _START_PARAGRAPH_ Prior to independence, the Rhodesia and Nyasaland pound was the legal tender of the short-lived British protectorate of Northern Rhodesia. Banknotes of 10 shillings, 1, 5, and 10 pounds issued by the Central Africa Currency Board were in circulation, together with coins of ½, 1, 3, 6 pence, and 1, 2, 2½, and 5 shillings. After independence, the Bank of Zambia issued the first Zambian currency, the Zambian pound, in 1964. The issued paper bills and coins were of similar denominations as these used before independence, except for the 10 pounds note, which was never issued by the Bank of Zambia. A new design to depict the newly independent country's history and struggle was adopted. The two currencies - the Rhodesia and Nyasaland pound and the Zambian pound, were allowed to circulate in parallel until December 15, 1965, when the South Rhodesian pound bills and coins were withdrawn from circulation, except for the 3 pence coin which was allowed to circulate alongside its Zambian alternative for a brief period._NEWLINE_On July 1, 1966, the parliament approved the arrangements of the decimal currency system (Act 40 of 1966). The government voted in favor of decimalisation, and changing the main currency unit to Kwacha, with one kwacha being equal to 100 ngwee. The exchange rate was set to one kwacha equivalent to ten Zambian shillings, or one half of a Zambian pound. Thus, by January 16, 1968, all Zambian pound bills and coins were removed from circulation and replaced by the new kwacha bills, and ngwee coins. The Zambian pound bills of 10 shillings, 1, and 5 pounds were changed into 1, 2 and 10 kwacha respectively, a bill of 50 ngwee was issued to replace the old 5 shillings coin, alongside a new bill of 20 kwacha. Ngwee coins with the denominations of 1, 2, 5, 10, and 20 ngwee replacing the existing 1, 3, 6 pence, 1, and 2 shillings coins respectively. The Zambian pound notes, and coins ceased to be a legal tender on January 31, 1974._NEWLINE_At the very beginning, the kwacha was pegged to the pound at a fixed rate of 1.7094 kwacha per 1 pound. Yet, after the devaluation of the dollar on August 15, 1971, Zambia broke all its currency's ties to the British monetary unit, and pegged the kwacha to the American monetary unit. These reforms resulted in a reduction of the kwacha's gold standard by 7.8%. A few months later, the British Chancellor of the Exchequer Anthony Barber, announced the demise of the Sterling area, and flotation of the sterling pound, causing Zambia to renounce the monetary privileges once enjoyed as a member state._NEWLINE_Throughout the years, the Zambian currency suffered high rates of inflation forcing the Bank of Zambia to introduce high value denominations in 2003, including 20,000 and 50,000 kwacha bills to facilitate transactions. In 2013, a new, redenominated kwacha was introduced. _START_SECTION_ Banknotes _START_PARAGRAPH_ The Zambian kwacha was first issued in 1968 to replace the Zambian pound. The design of the kwacha bill changed as time went on, also, different bills were either introduced in or withdrawn from circulation. Seven emissions of the first kwacha are known to exist, while only one emission of the second kwacha was introduced in circulation on January 1, 2013, and still existing since then without any changes in design or security features. Each emission share similar general features in design throughout all the banknotes, with slight changes concerning the colors and the activity based theme on the reverse of the banknotes. _START_SECTION_ New Kwacha (2012 series) _START_PARAGRAPH_ On January 23, 2012, the Bank of Zambia proposed certain measures in regards of the redenomination of the Zambian kwacha. Such recommendations were initially approved by the government, being one of the measures required to address costs associated with the continuous devaluation of the national currency, due to depreciation throughout time, as a direct result of several years of high inflation rates that characterized the national economy during the late decades of the 20th century, and the early years of the 21st century. The recommendations were assented to the parliament on November 3, 2012. Later, The Re-Domination of Currency Act (Act 8 of 2012) was enacted on December 3, 2012._NEWLINE_The old currency unit was divided by 1000, hence, removing three zeros from the preexisting K50,000, K20,000, K10,000, K5,000, and K1,000. The lower denominations of K500, K100, and K50 were also divided by 1000 and were changed into the 1 Kwacha, 50, 10, and 5 Ngwee coins respectively. On the other hand, the preexisting K20 banknote was removed from circulation due to its extremely low purchasing power._NEWLINE_The Bank of Zambia announced January 1, 2013 as the changeover date. On the same day, the new redenominated currency became the legal tender of Zambia. The old and new currencies were allowed to circulate side-by-side for a transition period of six months, until 30 June 2013. During this period, the old currency was denoted by 'K', whilst the new one was denoted by 'KR'. After the six-month period, the 'KR' symbol was dropped, and the new currency was referred to by the 'K' symbol._NEWLINE_By June 26, 2013, the Bank of Zambia managed to withdraw 3.7 trillion Kwacha in old banknotes, accumulating to about 95.3% of the circulating banknotes. Although the old currency ceased to be legal tender four days later, the Bank of Zambia Deputy Governor, announced that residents who were still holding to the old currency, especially those living in rural areas, could still be able to exchange the old currency for the new one through commercial banks, and other designated agents.
+ _START_ARTICLE_ Transportin 1 _START_SECTION_ Function _START_PARAGRAPH_ This protein is a karyopherin which interacts with nuclear localization sequence to target nuclear proteins to the nucleus. The classical karyopherin receptor complex, such as the complex that uses Importin-β1 (encoded by gene KPNB1), is a heterodimer of an alpha subunit which recognizes the nuclear localization signal and a beta subunit which docks the complex at nucleoporins. However, Transportin-1 can directly bind to the cargo proteins and may not need importin alpha subunit to do it._NEWLINE_Transportin-1 is thought to use the same principal mechanism to carry out nuclear transport as other Importins. It mediates docking to the nuclear pore complex through binding to nucleoporin and is subsequently translocated through the pore by an energy requiring mechanism. Then, in the nucleus Ran binds to Transportin-1, it dissociates from cargo, and Transportin-1 is re-exported from the nucleus to the cytoplasm where GTP hydrolysis releases Ran. Then Transportin-1 is free to bind new cargo._NEWLINE_In addition, Transportin-1 is implicated in helping protein transport into primary cilium. The function of Transportin-1 in this case is thought to be similar to carrying proteins into the nucleus through a nuclear pore. Transportin-1 binds cargo and then is helping this cargo to pass through a pore at the base of the cilium. Ran and nucleoporins are also implicated in this mechanism._NEWLINE_Alternate splicing of this gene results in two transcript variants encoding different proteins. _START_SECTION_ Targets _START_PARAGRAPH_ Transportin 1 (TRN1) is part of the non-classical nuclear import pathway. In conjunction with the RanGTP hydrolysis cascade TRN1 acts to import a selection of proteins into the nucleus of cells. These targets typically contain a PY-motif otherwise known as a M9 nuclear localisation signal. Well described examples include hnRNP A1._NEWLINE_The type of cargo proteins that Transportin 1 can carry into the nucleus include RNA-binding proteins (such as hnRNP A1 and hnRNP F) and also ribosomal proteins. _START_SECTION_ Clinical Significance _START_PARAGRAPH_ TRN1 has been implicated in the pathogenesis of two neurodegenerative diseases namely Amyotrophic lateral sclerosis and frontotemporal dementia.
+ _START_ARTICLE_ Embedded feminism _START_PARAGRAPH_ Embedded feminism is the attempt of state authorities to legitimize an intervention in a conflict by co-opting feminist discourses and instrumentalizing feminist activists and groups for their own agenda. This term was introduced in the analysis of the US-led invasion of Afghanistan, but can also be applied to several historical examples where women's rights were used as justification and legitimization of Western interventionism. _START_SECTION_ Concept _START_PARAGRAPH_ Originally, the Canadian gender researcher Krista Hunt developed the conceptual framework of embedded feminism to describe the gendered nature of the US-led invasion of Afghanistan in 2001 and the practice of the US government to justify the War on Terror in the eyes of the public._NEWLINE_Hunt defines the concept as the "incorporation of feminist discourse and feminist activists into political projects that claim to serve the interests of women, but ultimately subordinate and/or subvert that goal"._NEWLINE_Hunt coined the term embedded feminism referring to the "embedded journalism" or "embedded media" approach of the US Department of Defense which became prominent in the media coverage of the 2003 invasion of Iraq. The US government attached journalists, photographers, and camera people to military units and granted them unprecedented access to the battle frontline. Although "embedded journalism" allowed the public to get an exclusive look at the situation in Iraq, this practice was regarded as problematic, as it could undermine the independent reporting and promote the preferences of the government._NEWLINE_The "far-reaching process of appropriating and subverting feminism through appeals to women's rights" that is embedded feminism is different from simple co-optation practices by state authorities in so far as it goes beyond the absorption "of the meanings of the original concepts to fit into the prevailing political priorities". _START_SECTION_ Historical examples _START_PARAGRAPH_ Krista Hunt argues that appeals to women's liberation have been embedded in political projects for centuries to mobilize feminists and their discourses._NEWLINE_A large body of feminist literature has analyzed the gender-related dimensions of (post-) colonial projects where feminists from the Global North were convinced to get involved in order to "save" other oppressed women. Such rescue narratives generally presuppose a homogeneity of women as an oppressed group, as showed in the work of Chandra Mohanty, and put into play the orientalized nature of the seemingly dangerous "brown man". Thus, feminism which was incorporated in the modernization and civilization projects of imperial countries is argued to have helped strengthening colonialism and patriarchy instead of promoting women's rights._NEWLINE_Feminists also claim that feminist activists and their discourses have been instrumentalized for nationalist projects. During the Nasser era in Egypt for example, feminists are said to have played a major role in helping create a sense of cohesion and bonding and therefore directly contributed to the emergence of a national identity during and after the struggle for independence. Nevertheless, women remained mostly absent from the public sphere of politics once the project succeed. _START_SECTION_ The War on Terror _START_PARAGRAPH_ The history of the war on terrorism throughout the IR realm consistently showcased a male-stream discipline and a hyper-masculine war hero narrative. In other words, the story is narrated by these men, who hold high positions of power and are fixated to exemplify their heroic qualities to shield women from harm and collide with the world’s difficulties. For example, according to the former US President, George W. Bush, the central goal of the terrorists is the brutal oppression of women…that is the reason this great nation, with our friends and allies, will not rest until we bring them all to justice . This rallying cry by the Bush administration is exactly the narrative that is at question. The time-honored tradition of the good guys defeating the bad guys and protecting racialized women serves to reinforce patriotism and justify violence both abroad and at home. However, how does one exemplify “the bad guys”?  By using a gendered lens and looking at the war of terror through a gendered perspective, a simple rallying cry has far more complexities. For instance, there is a power dynamic at play here involving two opposing parties. There is the Western men and women who are deemed to be the saviors. Then there are the Afghan women who needs saving. What does this do? This creates a subtle social construct that the war on terror has created different kinds of men and women based on race, religion and nationality . Having said that, a gendered lens does ignore specific factors. It ignores the power dynamic of liberated white western women against their oppressed Afghan women. Basically, in a war, your race and nationality come heavily into play when it comes to who is deemed to be more liberated.  It ignores the historical colonial justification for invasion by proclaiming racialized men are harmful to racialized women. Feminists analyzed Bush’s rallying cry and found similarities to the white men knowing what’s right and saving the racialized women because of perceptions about the racialized men. It ignores the reinforced resistance to women’s rights, whereas men see it as Western imposition. In a war situation, when a Western country tries to help an oppressed nation, it is seen as western imposition because it is as if “the west knows best”, without even being apart or living in an oppressed nation and gives the perception that anything the West does (even empowering women) is treated as imposition.  It ignores the obscurity of the reality that white Western women are still being oppressed by the same powers that are trying to liberate Afghan women. Finally, it ignores the overarching point of all these factors which is the creation of a divide and conquer situation of women while also initiating solidarity of all women . In other words, what all these factors mean is the examination of race, class, nationality, religion and sexuality, we notice factors that were passive such as progressing onward with political agendas that are traditional, old testament and problematic while also trying to play the good guy by silencing other related key issues. In conclusion, Gender has become a topic that is heavily scrutinized but also heavily appreciated and even in the most traditional of scenarios, a gender lens/perspective is very much needed to tackle the real issue of IR._NEWLINE_In 2001, the Bush administration began expressing their concerns for the situation of women under the Taliban regime. According to Hunt, it invoked the struggle for women's rights and women's liberation as a rational to justify the invasion of Afghanistan. This increased gender awareness can be interpreted as part of a framing strategy which conflated the War on Terror with the fight for women's rights as a proxy for universal human rights. In the eyes of many feminists, the rescue of oppressed women from the Taliban became the powerful normative legitimation of the invasion which obtained broad public approval. More importantly, this strategy could align itself with feminist groups, that are traditionally pacifist, and could win their approval, thereby removing a critical opposition. The doubts in the government's commitment to further women's rights through war arose due to its lack of interest before 9/11. It was only after the terror attacks, that politicians in the US and in Europe began broadly supporting women's liberation from the Taliban._NEWLINE_Despite its usual non-violent stance, the Feminist Majority Foundation (FMF) supporter the policies of the Bush administration and is therefore regarded as one of the most vocal feminist supporters embedded in the War on Terror. Although the FMF saw the government's increased gender awareness as a success of their ‘Stop Gender Apartheid’ campaign, their involvement in Bush's political project was strongly criticized by other NGOs and the critical public because their role was considered to be legitimizing._NEWLINE_Hunt sees embedded feminism as a concept that was used to advance the engendered war story of the Bush administration that the invasion of Afghanistan could liberate Afghan women. It has further created a division between feminist groups that supported the war and those groups that refused to get involved in the usurpation of feminism for war. A division also emerged between "Western" feminists who strived to save the "Other" women from an orientalised enemy and Afghan feminists who criticized the notion that war could liberate them. _START_SECTION_ Hegemonic Western feminism and post-colonial critique _START_PARAGRAPH_ Hunt notes that there is a striking similarity between the logic of embedded feminism in colonialist projects and the War on Terror. Both are inherently Eurocentric and present the West as culturally and normatively superior to the "unmodern" Eastern societies. This rationale would give the West a prerogative to intervene and rescue the "monolithic group" of Other women who have no agency on their own. Spivak's famous post-colonial critique of the relationship between the colonizers and the colonized subjects in "Can the Subaltern speak?" condenses this relationship to the strategy of "white men saving brown women from brown men". This analysis can also be applied to the seemingly neo-imperialist strategy that the US government was pursuing by framing Taliban men in Afghanistan as a danger to women who were presented as victims in need of help from the West._NEWLINE_Characteristic for Western hegemonic feminism was the disregard of Western actors for the opinions of Afghan women's groups who argued that a war would certainly have a negative impact on women and fuel fundamentalist sentiments. In the aftermath, Bush's agenda was in fact interpreted as an attack on Islamic values and resulted in a backlash from the conservative forces.Hegemonic feminism also tends to reproduce binary gender roles, especially in the visual representation of women and children as victims of war or oppression in the media. Cynthia Enloe has called this conflation of women and children as victimized subjects "womenandchildren", a single trope that is being invoked in patriarchic narratives to support state security interests. _START_SECTION_ Contextualisation _START_PARAGRAPH_ The unique nature of embedded feminism as a state strategy is not just the argumentation based on the representation of women and children as victims but the conjunction of this discourse with the struggle for women's rights. _NEWLINE_Hunt's concept has made an impact on gender-related conflict research and has been applied to the wars in Iraq, Kosovo and Afghanistan. Embedded feminism can also be used in other contexts such as neo-liberal globalization and can be applied to several other policy fields where pseudo-feminist arguments and feminist groups are misused to legitimize a state-led action or to construct an alternative story.
+ _START_ARTICLE_ Ridgeway Mine _START_SECTION_ Extraction _START_PARAGRAPH_ Ore was blasted out of the mine pits and hauled in 85-ton trucks to ore storage heaps at the surface of the mine. Each ore truckload had only around 1.5 to 3 ounces of extractable gold in it, and for every ore truckload there was a truckload of waste rock. Recovering the ore's microscopic gold deposits was a multi-step process. First, the ore was milled to 200 mesh and soaked in a mixture of sodium cyanide, lime and water in one of ten giant tanks for twenty-seven days to dissolve the gold in the ore. Next, the solution was sent through a carbon in leach process where activated charcoal is used to absorb the gold from the solution.  The gold on the carbon granules was dissolved into a second solution as it was sent through a stripping column under high pressure. This solution is used in an electrowinning process that deposits the gold onto steel wool cathodes. The gold-plated steel wool undergoes electrorefining and its gold is deposited onto stainless steel plates. Finally, the gold is scraped off of these stainless steel plates, fed into a gold crucible and melted into  doré bars. _START_SECTION_ Geology _START_PARAGRAPH_ The Carolina Slate Belt (CSB) originated in the Proterozoic or Cambrian age with a volcanic arc terrane, a chain of volcanoes that form on the edge of one continental plate when it has another continental plate subducted under it. Later, as part of the Taconic orogeny these volcanoes were metamorphosed to greenschist. Finally, the Acadian orogeny and Alleghanian orogeny sheared the rocks and created intrusive granite from the intrusions of magma into cracks in the rock._NEWLINE_The mine is located in an east-west trend of the CSB which otherwise follows in a northeastern trend. It is also on the boundary between an older igneous terrane called the Sawney's Creek volcanics and a newer Bear Creek turbidite (sedimentary rock made from slides of liquefied sediments) terrane. The North Pit's primary gold ore is a chert made from either fine-grained sediment or a volcanic tuff and the South Pit's primary gold ore is finely laminated sediments from underwater flows of felsic (rich in feldspar and quartz) ash from a volcano. There is some debate on the processes that created the ore but there is consensus that exhalative sediments formed when hot water from a deep geothermal vent cooled and precipitated gold onto the ocean floor and these gold-bearing sediments were further concentrated when they were dissolved and precipitated again by an epithermal process (cooler geothermal vents closer to the surface). _START_SECTION_ History _START_PARAGRAPH_ Gold was first found in the Carolina Slate Belt in 1827 at the Haile mine in nearby Lancaster County. Placer mining, hydraulic mining and later use of the Newbery-Vautin chlorination process kept South Carolina mines in operation but by the year 1900 mining became unprofitable and had mostly stopped. In the 1960s interest in mining the Ridgeway area increased after John Chapman found gold while panning in nearby creeks. Amselco Minerals, a company later merged with Kennecott Minerals, began investigating Ridgeway after geologist Irving T. Kiff noticed similarities between Ridgeway's slate outcrops and outcrops at the Haile mine. Kennecott Minerals began buying land for the Ridgeway mine in 1980 and began mining in 1988. _START_SECTION_ Accidents _START_PARAGRAPH_ The Ridgeway mine was a relatively safe operation, only killing three employees in eleven years of operation._NEWLINE_On August 18, 1988 James F. Wise, while working as a spotter for a pan scraper, was crushed to death when it backed over him. Roosevelt Williams, the scraper driver, was also injured and was sent to the hospital for treatment. The Caterpillar model 623 scraper that Williams was driving that day has limited visibility for the driver and no right side rear view mirror, blocking the driver from seeing anything within 30 feet of the right side of the scraper. Williams was instructed in weekly safety meetings to keep spotters in his eyesight while driving the vehicle and to not start moving the vehicle until the spotter is located. On the day of the accident Williams did not locate the spotter before moving and only watched the left side of the vehicle while backing it out. Wise was crushed under the right tires of the scraper. Williams did not notice that his spotter was missing until he had backed up for around 100 feet and saw Wise lying around 98 feet in front of him. _NEWLINE_On April 21, 1993 Johnny Ray fell to his death after he was pushed off the top of the mine into the pit by a forklift. Steven Crapps, the forklift driver, was attempting to park the forklift by putting it in neutral and applying the parking brake. However, he also turned off the forklift's engine which provides power to the hydraulic brake system pumps. The braking system has a backup accumulator system that allows the brakes to function even when the engine is turned off, however later tests of the forklift showed that the accumulator was broken. When the brakes failed the forklift rolled down a 5%-6% grade slope towards the edge of the mine pit. Ray was between the forks of the forklift and was pushed by it for around 15 feet as it rolled towards the mine pit. A small berm at the edge of the pit prevented the forklift from rolling over the edge but Ray was pushed into the pit, suffered a drop of around 85 feet, was evacuated by helicopter to a local hospital and later died._NEWLINE_On February 2, 1997 Joseph Sumpter was fatally crushed by a runaway haulage truck. The truck, assigned to Sumpter for the day's work, had a bad battery and starter and had to be jump started by another vehicle. It also had a completely inoperable parking brake. His co-worker, Robert Stover, pulled his truck in front of Sumpter's truck to jump start it and Sumpter climbed across the engine decks of both trucks to connect jumper cables. Sumpter's truck was immobilized with a wooden chock but when Stover parked facing Sumpter's truck he pushed Sumpter's truck backward about two feet behind the chock. The jump start was successful and Stover backed his truck away, allowing Sumpter's truck to roll forward, bypass the chock and continue rolling down a 3% grade hill. Sumpter held on to the engine deck while the truck rolled for around 57 feet, but near the end of the ride Sumpter fell off and was run over by the front wheel of the truck. Sumpter was pronounced dead at the scene._NEWLINE_On the weekend of December 3, 1988, soon after the mine began chemically processing the mined ore, a flock of around 400 sea gulls landed in and around the tailings pond and 65 of the seagulls died. Ridgeway Mining had installed scare cannons to prevent birds from landing in the cyanide pond but in this case it was ineffective. The South Carolina Department of Natural Resources levied a $6,500 fine against Ridgeway Mining for killing the birds. _START_SECTION_ Closure _START_PARAGRAPH_ Mining stopped in the South Pit mine in September 1996 and the North Pit shut down in November 1999. Ore processing finished on December 3, 1999. The Confederate States Mint struck commemorative gold, silver and bronze coins with metal from the mine and sold them to the general public. The coins were marketed as wholly South Carolina gold and featured a local building used by the Confederate army during the American Civil War. The Ridgeway Mining Company also commissioned the Confederate States Mint to mint a coin out of the mine's gold and silver which it gave to the mine employees as gifts. _START_SECTION_ Reclamation _START_PARAGRAPH_ The mine tailings are a pyrite-bearing rock which, when exposed to water and oxygen, causes acid rock drainage and environmental damage. Kennecott Ridgeway designed an impoundment for the tailings that seals them off from rainwater runoff and the atmosphere. If fresh water was allowed to run through the tailings it would constantly dissolve and oxidize the rock's sulfides so the tailings are instead kept in stagnant water that has already had a chance to fully saturate with sulfides. The impoundment was created by saturating the tailings with water, mining nearby inert saprolites and clay, sending the saprolites and clay through the ore mills and pouring them over the tailing pile to create a cone over the mine tailings._NEWLINE_The Ridgeway mine recovery has been an ecological success with minimal seepage of acid water into the environment, a stark contrast to the nearby Barite Hill and Brewer gold mines that were both declared Superfund sites after being abandoned by their owners. _START_SECTION_ South Pit Lake _START_PARAGRAPH_ Both mine pits are expected to fill with water by 2020 and public access is planned for the reclaimed South Pit Lake. Mine management hopes to create a meromictic lake, a self-regulated lake with distinct layers of water that do not mix, to minimize oxidation of sulfides and prevent toxic metals from circulating in the environment.  A comprehensive study of the South Pit Lake's limnology was conducted from 2000 to 2004, measuring its physical, chemical and biological properties. Wind is typically the most important contributing factor to mixing in a lake and the South Pit Lake was found to have a poor alignment with the prevailing direction of wind in the area. To counteract the mixing forces of wind the lake has other properties that promote water layer stability and stratification. Its relative depth, a ratio of the lake's surface area to its depth, is around 10%, typical of other meromictic lakes. There are several underwater features that dissipate wave energy and reduce the strength of wind-powered seiches in the lake.  Finally, bacteria in the lake contributed to a pycnocline by precipitating carbon, gypsum and metals out of the upper layer, lowering the density of the upper layer, increasing the density of the lower layer and promoting lake stability. The lake successfully achieved meromixis in the winter of 2001 and maintained it until the end of the study in 2004. _START_SECTION_ Future mining _START_PARAGRAPH_ The Strongbow Exploration company announced that it was investigating a possible new strike of gold within three miles of Rio Tinto's Ridgeway mine in 2011. After releasing several press releases announcing positive test results from drilling samples and the signing of property purchase option agreements with landowners there has been no further communication on their Ridgeway exploration since December 2012.
+ _START_ARTICLE_ Atlético Tucumán _START_SECTION_ History _START_PARAGRAPH_ The club was founded in 1902, which makes Atlético the oldest football club from the province of Tucumán._NEWLINE_Atlético has participated in nine seasons in the Primera division: eight seasons between 1973 and 1981, and a single season in 1984. The team's best ever performance in Primera División was in 1979, when they reached the semi-finals of the Torneo Nacional._NEWLINE_In 2008 Atlético Tucumán was promoted to the Argentine 2nd Division after defeating Racing de Córdoba in the final game of Torneo Argentino A, and one year later the squad achieved its 2nd consecutive promotion by winning the B Nacional tournament and reaching the Primera División. _START_SECTION_ Tucumán Derby _START_PARAGRAPH_ The Tucumán Derby is played between Atlético and its longtime rival San Martín, both of the same city. The Santo (as San Martín is nicknamed) currently plays in the Torneo Argentino A, the regionalized third division of Argentine league system. _START_SECTION_ Ground _START_PARAGRAPH_ The stadium was constructed in 1922 by Spanish architect José Graña (1885–1950) with an original capacity for 5,000 spectators. It was inaugurated on May 21 of same year. Originally named as "Grand Stadium" due to being the largest of the North side of Argentina, Racing Club de Avellaneda was invited to play a friendly match versus Atlético Tucumán as part of the celebration. The stadium was named Monumental "José Fierro" in honor of a well-remembered Atlético's chairman._NEWLINE_It was the first roof stadium in Tucumán Province and the first to have a superior stand. The structure was built out of concrete. The record attendance was in 2008, during a match between Atlético and Racing de Córdoba, when all the seats were filled._NEWLINE_The stadium is located in the north part of the city of San Miguel de Tucumán (named "Barrio Norte"). It can currently accommodate up to 32,500 people due to an upgrade of the facilities that included adding an extra 2,500 seats.
+ _START_ARTICLE_ Ford Motor Company Philippines _START_SECTION_ History _START_PARAGRAPH_ Ford's history in the Philippines can be traced back to 1913 with the local assembly of the Ford Model T. In 1929, Henry Ford established Pilipinas Ford Car Works, Inc. (PFCW). In 1967, Ford Philippines, Inc. (FPI) was established as a subsidiary of the Ford Motor Company and began production operations on May 3, 1968, located at Sucat, Parañaque. In 1976, FPI inaugurated a body stamping plant in Mariveles, Bataan. On March 20, 1984, FPI formally and unexpectedly announced it would cease its operations in the Philippines by August 1984, in accordance with a decision reached by the management of Ford Motor Company._NEWLINE_In 1997, Ford returned to the Philippines with the establishment of Ford Motor Company Philippines, Inc. (FMCPI), introducing US-made vehicles such as the Expedition, the F-150, the Clubwagon and the Lincoln Town Car. A new P4 Billion state-of-the-art assembly plant in Santa Rosa, Laguna opened in September 1999. The first car manufactured at the plant was the Ford Lynx, and the company began building the Mazda-based Ford Ranger in March 2000. FMPCI company expanded its line-up with the introduction of the Escape SUV, Explorer SUV and Everest SUV. Towards the end of the decade, the Fiesta and Mustang were also introduced in the local market._NEWLINE_In 2012, Ford announced the consolidation of manufacturing operations in Southeast Asia and the cessation of operations at the Santa Rosa plant, citing "lack of supply base and economies of scale." 250 workers were affected by the decision, which Ford Philippines tried to resolve by offering them work at other Ford manufacturing facilities overseas. Despite this closure, Ford Philippines is opening more dealerships and expanding its vehicle lineup by the year 2015. On March 2014, Mitsubishi Motors Philippines Corporation announced it had acquired the former Ford assembly plant.
+ _START_ARTICLE_ Imrana Alhaji Buba _START_SECTION_ Early Life and Education _START_PARAGRAPH_ Buba was born in Jakusko Yobe State on 6 August 1992 and grew up in Potiskum, Yobe state. He is an alumnus of the University of Maiduguri, Borno state where he graduated with a first-class honours degree in Political Science in 2015 and holds a master's degree in Africa and International Development from the University of Edinburgh, United Kingdom in 2018. _START_SECTION_ Career and Activism _START_PARAGRAPH_ Buba had a traumatic experience with Boko haram in June 2010 while he was travelling to the University of Maiduguri as an undergraduate when his bus was stopped by the terrorists and passengers were kidnapped, he survived and also had friends and family who were killed by the Boko haram insurgency. As a result of this he founded the Youth Coalition Against Terrorism (YOCAT) in August 2010 to offer counselling services to victims of terrorism, as well as providing peace education and skills training for unemployed youths._NEWLINE_He has provided employment opportunities for over 2000 youth in north-eastern Nigeria through partnerships with local government agencies and private organisations and the organization has recruited over 600 volunteers and partnered with many local bodies to organize different beneficial programs for young people in north-eastern Nigeria._NEWLINE_In 2016, he was selected as one of 3 Nigerians and 21 African changemakers in the Commonwealth for the Queen’s Young Leaders Award by her majesty, Queen Elizabeth II of England in recognition of his work around peace building in northern Nigeria and also became a fellow of Generation Change Fellowship of the United States Institute of Peace (USIP) as a result of his works in combating terrorism._NEWLINE_He was selected for the 2017 JCI Ten Outstanding Young Persons of the World, in recognition of his effort to counter violent extremism and promote a culture of peace in Nigeria and was part of the 2017 Mandela Washington Fellowship program for Young African leaders in Washington D.C. He is also a fellow of LEAP Africa SIP and YALI West Africa._NEWLINE_His work and public accolades made him an expert and speaker, particularly regarding political instability in Nigeria. He was a speaker/panelist at the 2016/2017 International Day of Peace events at the United States Institute of Peace (USIP), speaker at the 2017 Wage Peace event at the American University, speaker/panelist at the 2017 United Nations International Youth Day event, speaker at the 2018 United Nations International Day for the Remembrance of Victims of Terrorism and the 2018 One Young World Summit._NEWLINE_Buba’s vision is to promote a culture of peace and tolerance that can break the cycle of conflict, violence, and terror that plague Nigeria. _START_SECTION_ Links _START_PARAGRAPH_ Age is not a limit to making a difference –Imrana Alhaji Buba_NEWLINE_Interview with Nigerian recipient of 2016 Queen's Young Leader award
+ _START_ARTICLE_ Princeton High School (Texas) _START_SECTION_ Football _START_PARAGRAPH_ The Princeton football team has made playoff appearances 13 times in 1948, 1972, 1974, 1975, 1976, 2010, 2011, 2012, 2013, 2014, 2015, 2016 and 2017. Princeton won district championships in 1948, 1972, 1974, 1975 and 1976 when only one team per district made the playoffs. In the new format where 4 teams from each district make the playoffs, Princeton has added two more district championships in 2013 and 2017. _START_SECTION_ Basketball _START_PARAGRAPH_ PHS men's and women's basketball teams have experienced success over the years.  The 1994-95 men's team finished with a record of 30-5, finishing the season as district 9-3A Runner-up and Area Champions after defeating West High School in the playoffs. _START_SECTION_ Marching Band _START_PARAGRAPH_ The "Panther Pride Marching Band" has made appearances at the UIL State Marching Band Contest on many consecutive occasions, earning the bronze medal three times; once in 2010, again in 2014, and in 2016. The band has been well represented locally and at the state level since the early 1970s. _START_SECTION_ Extracurricular activities _START_PARAGRAPH_ Princeton High School offers an extensive amount of extracurricular activities and programs, including Speech, Debate, Theatre, UIL Academics, Band, Choir, Winterguard, Art Club, Robotics, Fishing, Journalism, Student Council, Welding, Cheerleading, National Honor Society, Junior Reserve Officers' Training Corps (JROTC), The Fellowship for Christian Athletes, Future Farmers of America (FFA), Spanish Club. The school also has a widely recognized Career and Technology Exploration (CATE) program that serves Princeton High School students and others from the area's surrounding schools.
+ _START_ARTICLE_ Bangladesh Institute of Science and Technology _START_PARAGRAPH_ Bangladesh Institute of Science and Technology (BIST) is a university-level institution affiliated with the National University, Bangladesh located in Dhaka, Bangladesh.The institute was established in 1999 at Kakrail in Dhaka. The institute is regulated by a governing body consisting of Principal, Head of the Faculty Science and Engineering, Head of the Faculty of Business Studies all within the rules and regulations of National University of Bangladesh. _START_SECTION_ History _START_PARAGRAPH_ Bangladesh Institute of Science and Technology (BIST) was established in 1999. _START_SECTION_ Campus _START_PARAGRAPH_ BIST is located at 122, New Kakrail Road, Dhaka 1000, Bangladesh.
+ _START_ARTICLE_ Rosemary R. Haggett _START_PARAGRAPH_ Rosemary R. Haggett, Ph.D., is the Vice Chancellor for Academic Affairs and Student Success for the University of North Texas System. She was the second woman to serve as a dean of a college of agriculture in the U.S. _START_SECTION_ Education _START_PARAGRAPH_ Rosemary Haggett earned a bachelor's degree in biology from the University of Bridgeport and a Ph.D. in physiology from the University of Virginia. She conducted postdoctoral work in reproductive biology at Northwestern University. _START_SECTION_ Career _START_PARAGRAPH_ Rosemary Haggett began her career as a postdoctoral associate at Northwestern University, conducting research in reproductive biology. She worked at the U.S. Department of Agriculture for several years. In 1994, she became a professor of Animal and Veterinary Sciences and the Dean of the College of Agriculture, Forestry, and Consumer Sciences at West Virginia University, only the second woman in the U.S. to serve in such a position. In 1999, she became the Associate Provost for Academic Programs at West Virginia University. _NEWLINE_In 2003, she left West Virginia University to serve at the National Science Foundation as the director of the Division of Undergraduate Education. She also served as the Acting Deputy Assistant Director of the Education and Human Resources Directorate, the Acting Director of the Division of Graduate Education, and the Senior Adviser of the Education and Human Resources Directorate at the National Science Foundation. _NEWLINE_In 2007, she became Provost and Executive Vice President for Academic Affairs at the University of Toledo. _NEWLINE_In 2010, she became Vice Chancellor for Academic Affairs and Student Success for the University of North Texas System._NEWLINE_Rosemary Haggett served as Chair of the Committee of Visitors for the National Science Foundation's Training Cluster in the Division of Biological Infrastructure in 2003. She served on the National Academies Committee to review the USDA Agricultural and Food Research Initiative in 2012.
+ _START_ARTICLE_ Ted Hooper (rugby league) _START_PARAGRAPH_ Edward James Hooper (1871–1925) was an Australian rugby league referee and administrator. Born in Kent, England, Hooper played rugby union in Sydney before becoming a referee in the sport. When the New South Wales Rugby League (NSWRL) formed in 1908, marking the start of professional rugby league in Australia, he joined the ranks of players and referees who switched to the new code, becoming the president of its referees' association. He officiated the match between Eastern Suburbs and Newtown in the first round of the NSWRL's inaugural season. He died in 1925, collapsing in the dressing rooms after refereeing an exhibition match in Brisbane.
+ _START_ARTICLE_ Jason Baird Jackson _START_PARAGRAPH_ Jason Baird Jackson, Ph.D. (born 1969) is the Director of the Mathers Museum of World Cultures and Professor of Folklore and Anthropology at Indiana University Bloomington. He is "an advocate of open access issues and works for scholarly communications and scholarly publishing projects." At IUB, he has served as Chair of the Department of Folklore and Ethnomusicology and as Director of the Folklore Institute. According to the Journal of American Folklore, "Jason Baird Jackson establishes himself as one of the foremost scholars in American Indian studies today." _START_SECTION_ Early life and education _START_PARAGRAPH_ Jason Jackson was born in 1969._NEWLINE_He received his B.A. in sociology from University of Florida in 1990 with a minor in anthropology. He earned his M.A. degrees in cultural anthropology and in folklore, as well as his Ph.D. degree in anthropology from Indiana University Bloomington. _START_SECTION_ Career _START_PARAGRAPH_ Jackson was Curator of Anthropology at the Gilcrease Museum in Tulsa, Oklahoma (1995–2000) and Assistant Curator of Ethnology at the Sam Noble Oklahoma Museum of Natural History in Norman, Oklahoma (2000–2004). He remains a Research Associate at SNOMNH._NEWLINE_A noted scholar in the tradition of Boasian anthropology, Dr. Jackson's research interests include the following areas: (1) folklore and ethnology (intellectual and cultural property issues, folklore and folklife, material culture, religion, ritual, cultural change, ethnohistory, music and dance, ethnobotany, ethnomedicine, social organization, social theory, history of folkloristics and anthropology), (2) linguistic anthropology (verbal art, oratory, language shift, language ideologies, theories of performance, language and culture), (3) curatorship (community collaboration, exhibitions, collections management), (4) American and native American studies (Eastern North America)._NEWLINE_Dr. Jackson's ethnographic and historical work has focused on the life of the Yuchi, a Native American people residing today in Oklahoma, USA. He has published and edited several books on Native American topics, including Yuchi Ceremonial Life: Performance, Meaning and Tradition in a Contemporary American Indian Community. He has also published numerous articles based on his studies of Native American ethnography and folklore. Dr. Jackson has additionally spent time as an editor of the Journal of Folklore Research._NEWLINE_Dr. Jackson is the founding editor of Museum Anthropology Review, the first open access, peer-reviewed journal for on the subject of Museum Anthropology. He is also the principal for the Open Folklore Project tasked with "developing tools and resources for open access within Folklore studies." He also serves on the Editorial Board for Anthropological Quarterly and is one of the 2017 Visiting Faculty for the Smithsonian Summer Institute in Museum Anthropology, a position he has also previously held._NEWLINE_In June 2001, Dr. Jackson was awarded a Post-Ph.D. Research Grant from the Wenner-Gren Foundation "to aid archival and ethnographic field research on the role that social dance, musical performance, and cultural performances more generally, play in the network connecting the Woodland Indian communities of central and eastern Oklahoma into a regional system of exchange."
+ _START_ARTICLE_ The Umbrella (film) _START_SECTION_ Premise _START_PARAGRAPH_ After being released from prison, two incompetent crooks allow the umbrella with their stolen valuables stashed away in it to be carried off by someone else. A series of confusions ensue as they desperately try to recover the missing umbrella.
+ _START_ARTICLE_ Rick Rivet _START_SECTION_ Background and education _START_PARAGRAPH_ Rivet's family lived both in the country and in town at Aklavik, which was a Métis trading center. Métis have a specific culture with First Nations and European roots. He began school in Aklavik at age seven._NEWLINE_Rivet earned four degrees: his Bachelor of Arts from the University of Alberta in 1972; his Bachelor of Fine Arts from the University of Victoria in 1980; his Master of Fine Arts from the University of Saskatchewan in 1985, and Bachelor of Education from the University of Saskatchewan in 1986. _START_SECTION_ Artwork _START_PARAGRAPH_ His art is deeply influenced by ideas of fusion and hybridity of cultures.  He works primarily in acrylic on canvas in a style he has referred to as "an expressionist/primitivist approach."
+ _START_ARTICLE_ Bogdan Radenković _START_SECTION_ Biography _START_PARAGRAPH_ Born in 1874 in Srbovac, a village in the municipality of Zvečan, then part of the Ottoman Empire (and now Kosovo, still a contentious international political zone to this day). As a university graduate and a tonsured monk with a chosen name Vasilije, he became a secretary to the Serbian Orthodox Metropolitanate of Skopje. In this influential post, he had numerous contacts with his people and consulates of Serbia, Russia, and France. Among the clergy, he was known as Vasilije (Radenković) and among the laity simply Bogdan Radenković._NEWLINE_Bogdan Radenković was a member of the Serbian Committee of Skopje and the main organizer of the Serbian Chetnik action in the Ottoman Empire. He was an intermediary between the Serbian consulate and the Chetnik organization and their supporters. During 1905 the Turkish authorities caught a farmer who after being tortured revealed that Radenković was the president of the Serbian Committee in Skopje. At the Skopje trial, the farmer recanted citing that his testimony was extracted by force and Radenković was ultimately acquitted. Radenković was a friend of Milan Rakić, then Serbia's vice-consul in Skopje, with whom he conferred confidential operational plans of the Chetnik organization. Milan Rakić at the time began writing a poem called "On Gazimestan" that became popular even before the Balkan War of 1912. _START_SECTION_ Founding of the Serb Democratic League _START_PARAGRAPH_ The Serb Democratic League in the Ottoman Empire (Serbian: Српска демократска лига у Отоманској царевини) was an Ottoman Serb political organisation established on August 13, 1908, at the First Serb Conference (August 10–13), immediately after the Young Turk Revolution. Some 26 most distinguished Serbs in the Ottoman Empire attended and Bogdan Radenković was selected to head the "Temporary Central Board of the Organization of Ottoman Serbs" in July 1908. Bishop Vicentije Krdzić of the Serbian Orthodox Eparchy of Skopje headed the clergy and Bogdan Radenković the lay membership of the  "Assembly of Ottoman Serbs in Skopje", held on Sretenje in 1909. These organizations included the Serb elite of Old Raška, Kosovo and Metohija, and Vardar Macedonia and Aegean Macedonia.[1] It included many members of the Serbian Chetnik Organization as well. They were: Bogdan Radenković; Aleksandar Bukvić; Gligorije "Gliša" Elezović; Vasa Jovanović; Milan Čemerikić; Sava Stojanović; David Dimitrijević; Đorđe Hadzi-Kostić; Velimir Prelić and Jovan Šantrić._NEWLINE_With the founding of the Serb Democratic League, it became the first political party to represent the interests of the Serbs in the Ottoman Empire.The Serbian Democratic League sent to Thessaloniki Bogdan Radenković, Jovan Šantrić and Đorđe Hadži-Kostić to negotiate with the Central Young Turk Board. The Serbian demands were as follows:for the three non-Muslim “ethnic groups” – Serbian, Greek and Bulgarian – to get equal number of_NEWLINE_seats in the Ottoman Parliament. But the Young Turks refused that concept and they conditioned the electoral agreement with the Serbs with having an agreement on broader bases that would not have a national_NEWLINE_background. In 1910 as a representative of the party, he was sent to Istanbul where he urged the Turkish authorities to stop using their troops (Bashi-bazouk) to terrorize the Serbian population in Gjilan. The Sublime Porte denied the violence in Kosovo claiming that it was a fabrication. Yet to the Albanians are credited many of the outrages that have been committed in Old Serbia, where Turkish troops are alleged to have massacred more than 60,000 Christians. _START_SECTION_ Black Hand _START_PARAGRAPH_ Radenković and a few others, particularly Colonel Dragutin Dimitrijević, were the initiators of the creation of the "Unification or Death" organization, better known as the Black Hand, in 1911. Along with Ljuba Čupa and Vojislav Tankosić,  Radenković wrote the constitution of the "Unification or Death" organization, which was modelled on similar German secret nationalistic associations and the Italian Carbonari. _START_SECTION_ Bishop _START_PARAGRAPH_ After bishop Nićifor Perić of Raška-Prizren withdrew from his office (1911), owing to disagreement with the Serbian diplomacy, the Patriarchate of Constantinople appointed Bishop Gavrilo Dožić as successor, as the Serbian diplomacy wanted. There was a conflict within the Serbian Church regarding the appointment of Gavrilo; the "Old Serbs" (clergy from Kosovo and Old Serbia) wanted their candidate, the previous secretary of the Eparchy of Skoplje, monk Vasilije (Bogdan) Radenković. While waiting for the Ottoman government approval, the Serbian government changed the decision and ordered through the consuls that Ottoman Serbs request that Radenković be appointed instead. However, Gavrilo ended up being chosen. Meanwhile, Radenković became a founder of the Black Hand conspiracy group. _START_SECTION_ Secret Mission in Korçë _START_PARAGRAPH_ After the occupation of Serbia in late 1915 by the Germans, Austrians, Hungarians and Bulgarians, Bogdan Radenković withdrew through Montenegro, Albania to the island of Corfu, where he was temporarily hospitalized with tuberculosis. When his condition improved somewhat he was sent to Athens and from there to Korçë County in eastern Albania. There he stayed until  August 1916 when a surprise Bulgarian invasion took place and he was forced to flee. He was almost caught while escaping but eventually managed to reach Thessaloniki, where the Serbian Supreme Command was stationed. Weak, suffering from tuberculosis after the harrowing escape from Korçë, he was advised by his doctor to go to Egypt where the climate may improve his condition. Nikola Pašić, however, purposefully delayed his departure until his condition worsened. _START_SECTION_ High Military Court _START_PARAGRAPH_ The Serbian Supreme Command on 15 March 1917 sent a warrant for Bogdan Radenković's arrest, though the main accused was Dragutin Dimitrijević, better known as Apis, and his associates. Radenković was sentenced to death for allegedly plotting against the prince regent Aleksandar Karadjordjević and Nikola Pašić, the head of the Serbian government-in-exile, even though there was no concrete evidence that could link him to such an outrageous plot, not him nor Dimitrijević and others._NEWLINE_Bogdan Radenković died of tuberculosis in a prison hospital in Thessaloniki, Greece on 30 July 1917. Years later it was revealed that Nikola Pašić fabricated the story to rid himself of Dragutin Dimitrijević and other Serbian nationalists that may pose a threat after the war during election time. All the accused were vindicated, but many years later.
+ _START_ARTICLE_ Super Mario Bros. 3 _START_SECTION_ Gameplay _START_PARAGRAPH_ Super Mario Bros. 3 is a two-dimensional, side-scrolling platform game in which the player controls either Mario or Luigi. The game shares similar gameplay mechanics with previous games in the series — Super Mario Bros., Super Mario Bros. 2 in Japan, and Super Mario Bros. 2 internationally — while introducing several new elements. In addition to running and jumping found in past games, the player can slide down slopes, pick up and throw special blocks, and freely climb vines. Mario can also fly and float with power-ups. The game world consists of eight kingdoms, each subdivided into multiple levels. The eight worlds feature distinct visual themes: for example, the second world, "Desert Land", contains sand-covered levels with pyramids, while the levels in the fourth world, "Giant Land", contain obstacles and enemies four times their normal size._NEWLINE_The player navigates through the game via two game screens: an overworld map and a level playfield. The overworld map displays an overhead representation of the current kingdom and has several paths leading from the world's entrance to a castle. Paths connect to action panels, fortresses, and other map icons, and allow players to take different routes to reach the kingdom's goal. Moving the on-screen character to an action panel or fortress will allow access to that level's playfield, a linear stage populated with obstacles and  enemies. The majority of the game takes place in these levels, with the player traversing the stage by running, jumping, flying, swimming, and dodging or defeating enemies. Players start with a certain number of lives and may gain additional lives by picking up green spotted 1-Up mushrooms hidden in bricks, or by collecting 100 coins, defeating several enemies in a row with a Koopa shell, or bouncing on enemies successively without touching the ground. Mario and Luigi lose a life if they take damage while small, fall in a bottomless pit, or run out of time. The game ends when all lives are lost, although the player can continue from the last level played by selecting "Continue"._NEWLINE_Completing stages allows the player to progress through the overworld map and to succeeding worlds. Each world features a final stage with a boss to defeat. The first seven worlds feature an airship controlled by one of the Koopalings, while the player battles Bowser in his castle in the eighth world as the Final boss. Other map icons include large boulders and locked doors that impede paths. Mini-games and bonus screens on the map provide the player a chance to obtain special power-ups and additional lives. Power-ups obtained in these mini-games are stored in a reserve until activated by the player from the map screen._NEWLINE_In addition to special items from previous games like the Super Mushroom and the Fire Flower, new power-ups are introduced that provide the player with new options. The Super Leaf and Tanooki Suit give Mario raccoon and tanooki appearances, allowing him to fly. The Tanooki Suit enables him to turn into stone to avoid enemies for a short period of time. Changing into a Tanooki statue while jumping results in Mario pounding the ground and killing whatever enemies are directly under him; this is the first appearance of the now standard "ground pound" move in the Mario series. The new "Frog Suit" increases the character's underwater speed, agility, and jumping height on land. Another new suit, the Hammer Suit, gives Mario the appearance of the Hammer Bro. enemy and allows him to throw hammers at enemies and resist fire attacks when crouching._NEWLINE_Super Mario Bros. 3 includes a multiplayer option which allows two players to play the game by taking turns at navigating the overworld map and accessing stage levels. The first player controls Mario, while the other controls Luigi (a palette swap of Mario). Through this mode, players can access several mini-games, including a remake of the original Mario Bros. arcade game, in which one player has the opportunity to steal the cards of another, but may lose their turn if they lose the mini-game. _START_SECTION_ Plot and characters _START_PARAGRAPH_ The plot of Super Mario Bros. 3 is described in the instruction booklet. The Mushroom World, the setting of the game, is invaded by the Koopalings, Bowser's seven children. The Koopalings conquer each of the seven kingdoms by stealing its king's magical wand and using it to transform him into an animal. Princess Toadstool sends Mario and Luigi to travel to each kingdom, retrieve the stolen wand, and restore its king to normal._NEWLINE_Mario and Luigi receive notes and special items from Princess Toadstool after rescuing each of the first six kings. When they rescue the seventh king, they instead receive a note from Bowser, boasting that he has kidnapped Toadstool and imprisoned her within the castle of his own realm, Dark Land. The brothers travel through Dark Land, enter his castle, and defeat Bowser in a battle. The game ends with Princess Toadstool being freed from the castle._NEWLINE_Super Mario Bros. 3 was conceived as a stage play. The title screen features a stage curtain being drawn open, and in-game objects hang from off-screen catwalks, are bolted to the background, or cast shadows on the skyline. When Mario finishes a level, he walks off the stage. _START_SECTION_ Development _START_PARAGRAPH_ Beginning development shortly after the 1986 release of the Famicom's Super Mario Bros. 2, Super Mario Bros. 3 was developed by Nintendo Entertainment Analysis and Development, a team that consisted of more than ten people. The game took more than two years to complete at a budget of about $800,000. Developer Shigeru Miyamoto served as director. He worked closely with the designers and programmers during the conceptual and final stages, encouraging a free interchange of ideas. Miyamoto considered intriguing and original ideas to be key to creating a successful game. Originally, the team intended for the game to be played from an isometric point of view, but the developers found that this made it too difficult to position jumps, so the game was changed to the 2D side view used in previous games. Some isometric elements remain, such as the checkered floor present in the title screen._NEWLINE_The game was designed to appeal to players of varying skill levels. To assist less skilled players, bonus coins and 1-ups are more abundant in earlier worlds, while later worlds present more complex challenges for experienced players. In the two-player mode, the players alternate turns to balance play time. The development team introduced new power-ups and concepts that would give Mario the appearance of different creatures as a means of providing him with new abilities. An early idea changed Mario into a centaur, but was dropped in favor of a raccoon tail with limited flying ability. Other costumes with different abilities were added to his repertoire, and levels were designed to take advantage of these abilities. New enemies were included to add diversity to the game, along with variants of previous enemies, such as Goombas, Hammer Bros., and Koopa Troopas._NEWLINE_Some of the enemies designed for Super Mario Bros. 3 were inspired by the team's personal experiences. For example, Miyamoto stated that the Chain Chomp enemy, a tethered ball and chain creature that lunges at the player when in close proximity, was based on a "bad [childhood] experience" he had with a dog. Bowser's children, the Koopalings, were designed to be unique in appearance and personality; Miyamoto based the characters on seven of his programmers as a tribute to their work and efforts. Nintendo of America named the Koopalings after well-known musicians: for example, the characters "Ludwig von Koopa" and "Roy Koopa" are named after Ludwig van Beethoven and Roy Orbison respectively._NEWLINE_The character graphics were created with a special graphics machine ("Character Generator Computer Aided Design") that generated a collection of the graphical shapes used in the game. Shapes in the collection were assigned numbers that the game's code used to access and combine to form complete images on the screen in real time. The Super Mario Bros. 3 cartridge uses Nintendo's custom MMC3 (memory management controller) ASIC to enhance the NES capabilities.  The MMC3 chip allows for animated tiles, extra RAM for diagonal scrolling, and a scan line timer to split the screen. The game uses these functions to split the game screen into two portions, a playfield on the top and a status bar on the bottom. This allows the top portion to scroll as the character navigates the stage while the bottom portion remains static to display text and other information._NEWLINE_Like its predecessors, the music in Super Mario Bros. 3 was composed by Koji Kondo, who composed several new songs as well as returning melodies from Super Mario Bros. According to Kondo, who had composed the music in Super Mario Bros. based on what he believed fit the levels rather than focusing on composing a specific genre of music, the game was the most difficult game for him to compose. Kondo experimented with several different genres of music, unsure of how to follow up the music from the first game after hearing from several people that it sounded a lot like latin or fusion music, and came up with several different melodies throughout its development before settling on what ultimately made it into the game. The development team decided that music on the title screen was unnecessary._NEWLINE_During 1988, a global shortage of ROM chips, along with Nintendo's preparation of Super Mario Bros. 2, prevented Nintendo from performing various North American game releases according to their original schedules. The delayed products included Super Mario Bros. 3 and, according to Nintendo Power, Zelda II: The Adventure of Link. The delay, however, presented Nintendo with an opportunity to promote the game in a feature film. In 1989, Tom Pollack of Universal Studios approached Nintendo of America's marketing department about a video game movie; inspired by Nintendo video game competitions, Pollack envisioned a video game version of Tommy for younger audiences. Nintendo licensed its products for inclusion in what would become the film The Wizard. During the movie's production, the filmmakers requested and were granted approval from Nintendo regarding the script and the portrayal of the company's games. Super Mario Bros. 3 was one of the products shown in the film and was used in a final scene involving a video game competition. The film was released in December 1989, between the Japanese and English versions of the game. _START_SECTION_ Sales _START_PARAGRAPH_ Super Mario Bros. 3 became a best-selling game. Its inclusion in The Wizard served as a preview which generated a high level of anticipation in the United States prior to its release. Levi Buchanan of IGN considered Super Mario Bros. 3's appearance in the film as a show-stealing element, referring to the movie as a "90-minute commercial" for the game. The game sold 250,000 copies in its first two days of release, according to a spokeswoman for Nintendo. By 1993, the game had sold 4 and 7 million unbundled units in Japan and the United States respectively. In the United States alone, the game generated over US$500 million in revenue for Nintendo. Author David Sheff commented that, in music industry terms, the game went platinum 11 times. The game was later bundled with new NES systems. Including bundled units, the NES version of the game sold over 17 million copies. Game Informer reported in their October 2009 issue that the Virtual Console version had sold one million copies. As of 2011, Super Mario Bros. 3 remains the highest-grossing non-bundled home video game to date, having grossed $1.7 billion, adjusted for inflation. _START_SECTION_ Legacy _START_PARAGRAPH_ Super Mario Bros. 3 introduced several elements carried over to subsequent Mario games. A similar overworld map is used in Super Mario World and New Super Mario Bros., and Mario's ability to fly has been a feature in games such as Super Mario World, Super Mario 64 and Super Mario Galaxy. The game's "Super Leaf" item has returned in more recent Mario games for the Nintendo 3DS, like Super Mario 3D Land, Mario Kart 7  and  New Super Mario Bros. 2. Bowser's red hair was first popularized in the game and has since become a part of his standard appearance._NEWLINE_Through a collaboration between NBC and Nintendo of America, an animated television series, The Adventures of Super Mario Bros. 3, was created in 1990 by DIC Entertainment. The show aired weekly and featured numerous characters, enemies, and settings from the video game; the original seven Koopalings are given different names based on their given personalities and are also given a new age order. Other Nintendo products have included various elements of the game as well. Music from Super Mario Bros. 3 appears as a track on Nintendo Sound Selection Koopa, a collection of songs from Nintendo games. The game's stages and graphics comprise a background theme in the 2006 Nintendo DS game Tetris DS. The Koopalings are also world bosses in Super Mario World, Mario is Missing!, Yoshi's Safari, Hotel Mario and all New Super Mario Bros. games except New Super Mario Bros. Boom Boom, another boss from this game, additionally reappears in Super Mario 3D Land and Super Mario 3D World, alongside a boomerang-wielding female counterpart named Pom Pom. Super Mario Bros. 3 is one of the games represented in both Super Mario Maker and Super Mario Maker 2._NEWLINE_Super Mario Bros. 3 has appeared on numerous top video game lists. The game debuted on Nintendo Power's Top 30 best games ever list at number 20 in September 1989. It entered the list's top 10 a few months later and reached number one in May 1990. Super Mario Bros. 3 remained within the top 20 for more than five years. More than a decade later, the magazine ranked the game number six on their list of 200 Greatest Nintendo Games. In August 2008, Nintendo Power listed Super Mario Bros. 3 as the second best NES video game, praising it for making the series more complex and introducing new abilities that have since become signature abilities in the series. The game placed 11th, behind Super Mario Bros., in Official Nintendo Magazine's "100 greatest Nintendo games of all time". In 2007, ScrewAttack called Super Mario Bros. 3 the best Mario game in the series as well as the best game on the NES, citing the graphics, power-ups, secrets, and popularity, summing it up as "just incredible" and stating, "If you haven't experienced this greatness, we pity you". In a poll conducted by Dengeki, the game tied with Super Mario World as the number three video game their readers first played._NEWLINE_The game has been ranked on several of IGN's lists of "top games". In 2005, they rated it 23rd among their Top 100 Games, and praised the precise and intuitive controls. IGN editors from the United States, United Kingdom, and Australia ranked Super Mario Bros. 3 number 39 in their 2007 Top 100 Games, citing Miyamoto's "ingenious" designs. They further commented that the game improved on the "already-brilliant concepts" of the previous games with new power-ups and enemies. Users and readers of the website placed the game high on similar lists: 32nd in 2005 and 21st in 2006. In 2007, the game was included in the "game canon", a list of the ten most important video games selected for preservation by the Library of Congress. In 2009, Game Informer put Super Mario Bros. 3 9th on their list of "The Top 200 Games of All Time", saying that it is "a game with incredible lasting power that we won't soon forget". This is down one place from Game Informer's previous ranking in 2001. Edge ranked the game #20 on its list of "The 100 Best Games To Play Today", calling it "the one 8-bit game that still shines today, no caveats required." UGO listed Super Mario Bros. 3 on their list of the "Top 50 Games That Belong On the 3DS", calling it "Arguably the greatest Mario game ever made." GameSpot placed the game on their list of the greatest games of all time. USgamer ranked the game as the third best Mario platformer ever. Super Mario Bros. 3 ranked 34th on Warp Zoned's "Scientifically Proven Best Video Games of All Time" list, a statistical meta-analysis of 44 "top games" lists published between 1995 and 2016._NEWLINE_In the early 1990s, game developers John Carmack and Tom Hall developed an adaptive tile refresh technology to perform smooth, side-scrolling graphics on EGA cards for IBM clone personal computers. They used it to develop a clone of Super Mario Bros. 3 and presented it to Nintendo, who rejected it to retain exclusivity for their games on Nintendo consoles. Carmack and Hall went on to found Id Software and develop Commander Keen, a series of platform games inspired by Super Mario Bros. 3. _START_SECTION_ Remakes _START_PARAGRAPH_ The game has been ported or remade on several other Nintendo consoles. It was included in the 1993 Super NES game Super Mario All-Stars, a compilation of remakes of NES Super Mario games featuring updated graphics and sound, which was also later released on the Wii in 2010. A Game Boy Advance version, Super Mario Advance 4: Super Mario Bros. 3, was released in 2003. This version features support for the Nintendo e-Reader peripheral, which allows the player to access additional levels stored on e-Reader cards, in addition to updated graphics, power-ups, and sound._NEWLINE_Super Mario Bros. 3 was rereleased in emulation as a downloadable Virtual Console game in 2007 for the Wii and in 2014 for the Nintendo 3DS and Wii U consoles. It is one of thirty pre-installed games in the NES Classic Edition console, and is on the Nintendo Switch Online service.
+ _START_ARTICLE_ Mobile Public Library _START_SECTION_ History _START_PARAGRAPH_ The Mobile Public Library has roots going back to the 1850s, when it was started as a subscription organization by the Franklin Society.  The library was officially established as the Mobile Public Library in 1902 and was originally housed in an antebellum structure at the corner of Conti and Hamilton Street.  The library association appealed to city leaders in the late 1910s to provide operating funds for the library, and it offered to give the city the library property if it would build a new building to house the collections.  The city declined to finance the construction of a new building, but did approve operating funds on 2 April 1918.  _NEWLINE_Due to increasing public demand for a library, on 15 December 1925, the city commissioners voted to schedule a special election on a $250,000 bond issue.  The voters approved the bond and, along with a gift of $30,000 from Eli H. Bernheim of New York City, the new library building was constructed. Noted Mobile architect George Bigelow Rogers designed the building in the Classical Revival style.  The new structure, now known as the Ben May Main Library, was opened on 15 September 1928._NEWLINE_The state had passed racial segregation laws at the turn of the century after disenfranchising most blacks and many poor whites in the state, excluding them from politics. Mobile's African-American community did not have access to a public library until one was completed for them in 1931; it was known as the Davis Avenue Branch. It was also designed by George Bigelow Rogers. It was funded by a city bond issue and the city's sale of the old library property on Conti Street._NEWLINE_The Ben May Main Library building is a contributing building to the Church Street East Historic District, which was listed on the National Register of Historic Places on 16 December 1971.  The system opened a new branch, the West Regional Branch, in 2002, with First Lady Laura Bush making an address. Beginning in 2006, the Ben May Main Library building was restored and expanded by 22,000 square feet (2,000 m²). It was reopened on 31 May 2007. _START_SECTION_ Services _START_PARAGRAPH_ In addition to basic services, participation in several interlibrary loan systems, and internet access at all locations, the Mobile Public Library provides a range of other services. Free library cards are made available to all residents in the Alabama counties of Baldwin, Washington, Clarke, Monroe, Escambia and Conecuh. Alabama Virtual Library (AVL) cards are also made available for free at all branches. _START_SECTION_ Local history and genealogy _START_PARAGRAPH_ The Local History and Genealogy Division includes works by local authors, Mobile histories, periodicals, Mobile newspapers on microfilm from 1819 to the present, city directories from 1837 onward, federal census records for most of the Southeastern United States, and the Mobile Historic Development Commission's survey of historic architecture in Mobile with 10,000 images stored and indexed on CD-ROM. _START_SECTION_ Youth services _START_PARAGRAPH_ This department attempts to meet the needs and interests of children and young adults through the various library collections, services and programs.  Books, movie DVDs and VHS, music CDs, audiobooks, back issues of magazines, and video games are available to be checked out.  Story time for young children is provided at most library locations. _START_SECTION_ Disabilities _START_PARAGRAPH_ All branches provide handicapped access, materials, and services for patrons with disabilities. A few of the services provided are magnifying glasses, large type books, closed captioned videos, books for and about the handicapped, and instructional books and videos on sign language.  In addition, recorded books on discs and cassettes and the equipment for using them are available on free loan to eligible individuals from the Alabama Regional Library for the Blind and Physically Handicapped in Montgomery, Alabama. _START_SECTION_ Bookmobile _START_PARAGRAPH_ The library operates a bookmobile three days a week at over 30 different stops across Mobile County. Each location is visited every three weeks.
+ _START_ARTICLE_ Daugherty Furniture Building _START_SECTION_ History _START_PARAGRAPH_ The furniture store was originally located in Fork Mountain, Tennessee, but was moved to Clinton in the late 1930s because of the population growth of Anderson County. More than 75,000 individuals had moved to the area in a two-year period, and the demand for household items was high. Daugherty Furniture Store served as the only "one-stop shop" for Anderson County residents. The store offered home delivery, with a fleet of delivery trucks. The third and fourth floors of the building served as apartments, which housed Oak Ridge workers and scientists in need of housing in a region running out of housing options. The business expanded into the nearby regions, and the main location served as a meeting place for local businessmen. The attention and popularity of the main location helped to develop the neighborhood of Market and Main Streets in Clinton._NEWLINE_The store was owned and operated by the family, and most of the employees were family members. Daugherty's brothers, Leonard, Emmitt, and Laford, worked for the shop, with Leonard serving as a manager overseeing deliveries and purchases. Their sons, along with Daugherty's sons, all worked at the store and remained working there until its closing in 1985. All employees were trained to be familiar with the store's merchandise, including demonstrating Singer sewing machines. Employees also provided repair services, both in-store and in the homes of customers. Many employees lived in the upstairs apartments, and Daugherty lived in the corner unit and oversaw the company until his death in 1985. _START_SECTION_ Architecture _START_PARAGRAPH_ The original building was located in Fork Mountain, Tennessee, and was a simple one-story building. When Daugherty moved to Clinton in 1935, he rented two buildings across the street from where the current building stands. While planning to build a new location, he found inspiration in a building in Oak Ridge owned by Elza Gate. Gate's home, known as the Glenn Copeland House, was entirely faced in stone (except the roof). Daugherty would use Frank Gilbreath as an advisor on building his store; Gilbreath served as stonemason on the Glenn Copeland House._NEWLINE_When Daugherty began exploring options for building his store, his friends and colleagues advised him to avoid building a multi-story building. He hired architect Clem H. Meyer, who had designed Huntsville High School in Scott County, Tennessee. Oba Hill was also hired, who worked alongside Gilbreath on the stone work._NEWLINE_About 99,000 pounds of locally quarried stone was used on the buildings exterior, most which came from areas near the New River region of Morgan County and Scruggs Farm in Bethel. All of the stone was hand-chiseled and laid by Gilbreath and Sebastian Marie, another local stone cutter. When the building was complete, in 1942, it was the largest commercial building in Clinton, along with Magnet Mills, Inc._NEWLINE_The Daugherty Furniture Building doesn't touch on traditional architectural styles, but serves as an example of vernacular architecture, by using local construction methods and materials. The buildings interior, from the fourth floor to the basement, looks like an inverted step pyramid. The load bearing walls are stair-stepped, therefor the fifth floor walls are thinner than those in the basement; the basement walls are 26 inches thick, with the fifth floor being 12 inches. The architecture of the building was also influenced by Meyer's work on Huntsville High School, using techniques found in that project. Examples include the rectangular shape, stone exterior, and rectangular steel windows. The design is also minimalistic, reflecting the wartime emphasis on simplicity seen in design during the period.
+ _START_ARTICLE_ Waka (poetry) _START_SECTION_ Etymology _START_PARAGRAPH_ The word waka has two different but related meanings: the original meaning was "poetry in Japanese" and encompassed several genres such as chōka and sedōka (discussed below); the later, more common definition refers to poetry in a 5-7-5-7-7 metre. Up to and during the compilation of the Man'yōshū in the eighth century, the word waka was a general term for poetry composed in Japanese, and included several genres such as tanka (短歌, "short poem"), chōka (長歌, "long poem"), bussokusekika (仏足石歌, "Buddha footprint poem") and sedōka (旋頭歌, "repeating-the-first-part poem"). However, by the time of the Kokinshū's compilation at the beginning of the tenth century, all of these forms except for the tanka and chōka had effectively gone extinct, and chōka had significantly diminished in prominence. As a result, the word waka became effectively synonymous with tanka, and the word tanka fell out of use until it was revived at the end of the nineteenth century (see Tanka)._NEWLINE_Tanka (hereafter referred to as waka) consist of five lines (句 ku, literally "phrases") of 5-7-5-7-7 on or syllabic units.　Therefore, tanka is sometimes called Misohitomoji (三十一文字), meaning it contains 31 syllables in total. _START_SECTION_ Kamakura and Muromachi periods _START_PARAGRAPH_ After the Heian period, during the Kamakura period and later, renga, a form of collaborative linked poetry, began to develop. In the late Heian period, three of the last great waka poets appeared: Fujiwara no Shunzei, his son Fujiwara no Teika, and Emperor Go-Toba. Emperor Go-Toba ordered the creation of a new anthology and joined in editing it. The anthology was named Shin Kokin Wakashū. He edited it again and again until he died in 1239. Teika made copies of ancient books and wrote on the theory of waka. His descendants, and indeed almost all subsequent poets, such as Shōtetsu, taught his methods and studied his poems. The courtly poetry scenes were historically dominated by a few noble clans and allies, each of which staked out a position._NEWLINE_By this period, a number of clans had fallen by the wayside, leaving the Reizei and the Nijō families; the former stood for "progressive" approaches, the varied use of the "ten styles" and novelty, while the latter conservatively hewed to already established norms and the "ushin" (deep feelings) style that dominated courtly poetry. Eventually, the Nijo family became defunct, leading to the ascendancy of the "liberal" Reizei family.  Their innovative reign was soon deposed by the Asukai family, aided by the Ashikaga shōgun, Ashikaga Yoshinori._NEWLINE_In the Muromachi period, renga became popular in the court and people around it. It spread to the priestly classes and thence to wealthy commoners. In much the same way as waka, renga anthologies were produced under the imperial aegis.  As momentum and popular interest shifted to the renga form, the tanka style was left to the Imperial court. Conservative tendencies exacerbated the loss of life and flexibility. A tradition named Kokin-denju, the heritage of Kokin Wakashū, was developed. It was a system on how to analyze the Kokin Wakashū and included the secret (or precisely lost) meaning of words. Studying waka degenerated into learning the many intricate rules, allusions,  theories, and secrets, so as to produce tanka that would be accepted by the court._NEWLINE_There were comical waka already in the Kojiki and the Man'yōshū, but the noble style of waka in the court inhibited and scorned such aspects of waka. Renga was soon in the same position with many codes and strictures reflecting literary tradition. Haikai no renga (also called just haikai (playful renga)) and kyōka, comical waka, were a reaction to this seriousness. But in the Edo-period waka itself lost almost all of its flexibility and began to echo and repeat old poems and themes. _START_SECTION_ Edo period (1603–1867) _START_PARAGRAPH_ In the early Edo period, waka was not a fashionable genre. Newly created haikai no renga (of whose hokku, or opening verse, haiku was a late 19th-century revision) was the favored genre. This tendency was kept during this period, but in the late Edo period waka faced new trends from beyond the court. Motoori Norinaga, the great reviver of the traditional Japanese literature, attempted to revive waka as a way of providing "traditional feeling expressed in genuine Japanese way". He wrote waka, and waka became an important form to his followers, the Kokugaku scholars._NEWLINE_In Echigo Province a Buddhist priest, Ryōkan, composed many waka in a naïve style intentionally avoiding complex rules and the traditional way of waka. He belonged to another great tradition of waka: waka for expressing religious feeling. His frank expression of his feeling found many admirers, then and now. In the cities, a comical, ironic and satiric form of waka emerged. It was called kyōka (狂歌), mad poem, and was loved by intellectual people in big cities like Edo and Osaka. It was not precisely a new form; satirical waka was a style known since ancient times. But it was in the Edo period that this aspect of waka developed and reached an artistic peak. Still, most waka poets kept to ancient tradition or made those reformation another stereotype, and waka was not a vibrant genre in general at the end of this period.
+ _START_ARTICLE_ 1982 Monaco Grand Prix Formula Three _START_PARAGRAPH_ Results from the 1982 Monaco Grand Prix Formula Three held at Monte Carlo on May 22, 1982, in the Circuit de Monaco.
+ _START_ARTICLE_ John Jairo Castillo _START_SECTION_ Club career _START_PARAGRAPH_ He signed with the team in January 2008 on a long-team contract.  He has played for numerous Colombian teams such as  Deportes Tolima in the Copa Mustang and also for Bolivian club Oriente Petrolero and most recently for Venezuelan club Guaros FC.  Recently training with deportivo Cali.
+ _START_ARTICLE_ The Rainbow Agenda _START_PARAGRAPH_ The Rainbow Agenda was a set of demands put forth by a coalition of student groups at Stanford University in the late 1980s. Inspired by Jesse Jackson's Rainbow Coalition (now Rainbow/PUSH), Stanford's Rainbow Coalition demanded that the university "explore the critical concerns of minority students, faculty, and staff at Stanford University". _START_SECTION_ History _START_PARAGRAPH_ On January 17, 1987, some 500 Stanford students marched with Jesse Jackson to celebrate a new course at Stanford to replace its previous "Western Culture" requirement. In 1988, the Faculty Senate voted to change the course to "Cultures, Ideas, and Values" (CIV)._NEWLINE_In response to backlash, a "Rainbow Coalition" was formed, a coalition of student groups which made a number of demands of the university (the Rainbow Agenda). These student groups included the Black Student Union, MEChA, the Asian American Student Association, and the Stanford American Indian Organization. The demands included requests concerning student and faculty diversity, support for community centers, and a "renewed commitment to discourage Indian mascot fanatics." _NEWLINE_On 15 May 1989, students from the Rainbow Coalition also occupied of President Donald Kennedy's office, to "emphasize the need for an Asian American Studies tenure-track professor; a full-time dean for El Centro Chicano, the Chicano student center; and a discrimination review board to act on complaints of racial slurs and incidents. Fifty-five students... were arrested." _START_SECTION_ Impact _START_PARAGRAPH_ The Rainbow Coalition was renamed the Students of Color Coalition (SOCC) in 1994._NEWLINE_Twenty five years after its founding, SOCC has recently been one of the most influential student groups in determining the result of student elections on campus._NEWLINE_"Culture, Ideas, and Values" was eventually replaced by IHUM_NEWLINE_._NEWLINE_The Stanford Review was founded in 1987 partly to provide an alternative viewpoint to issues raised by the Rainbow Coalition.
+ _START_ARTICLE_ 1997–98 Bradford City A.F.C. season _START_SECTION_ Season summary _START_PARAGRAPH_ In the 1997–98 season, Bradford started well with 13 points from a possible 15 which saw the Bantams top of the table after five games, but results declined and chairman Geoffrey Richmond sacked Kamara on 6 January, three days after a 2–0 FA Cup defeat to Manchester City._NEWLINE_Richmond turned to Jewell, who was by now Kamara's assistant, and he won his first game 2–1 to Stockport County. In his 21 games in charge, Jewell won six games and drew five to guide Bradford to 13th, their highest position since Jewell had joined the club. He was rewarded with a permanent contract when others expected Richmond to turn to a big name. _START_SECTION_ Results _START_PARAGRAPH_ Bradford City's score comes first
+ _START_ARTICLE_ Ulaanbaatar Hotel _START_SECTION_ History _START_PARAGRAPH_ The infamous Anastasia Filatova, the wife of Mongolian communist leader Yumjaagiin Tsedenbal,  and de facto co-ruler of the country, was personally involved in the construction and design. She chose the best workers and designers available at the time to complete the hotel, which was designed to be a flagship property for the Mongolian hospitality industry. The senior employees say that she had personally picked colors and design for the lobby and main hall._NEWLINE_It was the first public building with running hot water, in the 1960s Mongolian elites used to rent rooms per hour to enjoy the hot bath or shower. A number of foreign embassies were quartered at the hotel during 1980s and 1990s._NEWLINE_The last embassy, Turkish mission, moved out in 1997. Ever since its foundation the hotel is frequented by politicians and lobbyists. During the Democratic revolution of 1991, Communist rulers used the hotel to meet unofficially with the democratic activists. The future fate of the country was decided during these meetings. The hotel has also become a cultural phenomenon: more than a dozen movies were filmed here._NEWLINE_Julia Roberts, Demis Roussos, Richard Gere, Steven Seagal, Dalai Lama, Fradkov, Andre Kim, Alsou are among the famous guests who stayed at the Hotel.
+ _START_ARTICLE_ Ireland–Spain relations _START_SECTION_ Early relations _START_PARAGRAPH_ The first awareness and contact between both nations was through stories about Celtic migration from Iberia to Ireland as mentioned in the Lebor Gabála Érenn regarding the Milesians. The first diplomatic contact between Irish and Spanish nobility happened in April 1529 when the Spanish ambassador, Don Gonzalez Fernandez, visited Ireland and met with The 10th Earl of Desmond. The agreement, known as the Treaty of Dingle, gave a formal legal and constitutional foundation to the rights of citizenship and other privileges that Irish exiles and emigrés enjoyed in Habsburg Spain, Habsburg Austria and Habsburg Netherlands from the 16th to the early 20th centuries. Both nations felt united in their common beliefs of Catholicism, but this was not an issue in 1529. In 1554-58 Philip Prince of Asturias was married to Mary I and was named as titular King of Ireland in the Papal Bull Ilius ad quem. As a result, during the first plantations of Ireland what is now County Offaly was shired as "King's County", and Philipstown (now Daingean) was named in his honour, the first Irish place named after someone from Spain. Soon after Mary's death he succeeded as Philip II of Spain._NEWLINE_In 1601, Spain supported Irish rebels fighting against England during the Nine Years War, and especially during the Siege of Kinsale. At the time, the Catholics of Ireland saw Spain as a potential liberator of their country from Protestant England and in 1595 Hugh O'Neill offered the crown of Ireland to Philip II of Spain. Philip refused the offer, having already been the titular King of Ireland. _NEWLINE_Many Irishmen in rebellion against English rule subsequently sought refuge in Spain following the Flight of the Earls (1607), and for the next two centuries Irish soldiers contributed to Spanish Army of Flanders and fought side-by-side during the Dutch Revolts and during the Thirty Years' War. During the Anglo-Spanish War (1625–1630) proposals were made in 1627 to launch an invasion of Ireland under Shane O'Neill and Hugh O'Donnell, but did not go further than the planning stage._NEWLINE_Irish soldiers in the service of Spain also participated in the colonization of the Americas. Several prominent Spanish officials of Irish origin governed and administered Spanish colonies as Viceroy's such as Ambrosio O'Higgins in Peru and Juan O'Donojú in Mexico or became ministers in the Spanish government, most notably Leopoldo O'Donnell and his relatives Carlos O'Donnell and Juan O'Donnell. At the same time, during the independence wars of Gran Colombia, several thousand Irish soldiers fought for South America against Spain and made up parts of the British Legions. _START_SECTION_ Independence for the Republic of Ireland and the Spanish Civil War _START_PARAGRAPH_ In January 1801, Ireland became a part of the newly created United Kingdom of Great Britain and Ireland, and all relations between Ireland and Spain were henceforth carried out through the Court of St James's. Napoleon's Irish Legion took part in suppressing the Dos de Mayo Uprising. After the Napoleonic Wars the Regiment of Hibernia was disbanded at the demand of their British allies._NEWLINE_In December 1922, most of Ireland gained a form of independence within the British Empire as the Irish Free State and, in 1924, diplomatic relations were officially established between the new Irish Free State and the Kingdom of Spain. That same year, Spain opened its first consulate in Dublin. In 1935, the first Irish Minister was appointed to Spain with residence in Madrid._NEWLINE_In 1936, Spain was engulfed in a Civil War between the Republican faction led by President Manuel Azaña; and the Nationalist faction led by General Francisco Franco. The Free State was a member of the Non-intervention Committee. However, Ireland (North and South) was divided in opinion over the war. Many Irish Catholics sided with Francisco Franco. A smaller number sided with the Spanish Republican faction. In 1936 the Irish Christian Front was established to financially support Francisco Franco and the Irish Brigade was created to fight for the Nationalist side and contributed 700 Irish volunteer soldiers to Franco. The Irish-American politician Joe Kennedy stopped the US Congress from supplying arms to the Republic. At the same time, 250 Irish socialist volunteers joined the International Brigades and fought for the Spanish Republican faction. Both Irish factions took part in the Battle of Jarama in February 1937. By the summer of 1937, the Irish Brigade was "disarmed and ordered out of Spain by Franco" (Fearghal McGarry); most of the socialists stayed until late 1938, although they were frequently treated as pariahs on their return home, and many emigrated to the UK. After the war ended in 1939, the Irish Minister presented his credentials in Burgos and formally recognized the new Spanish government under General Franco. _START_SECTION_ Post-war relations _START_PARAGRAPH_ In 1973, both Northern Ireland (as part of the United Kingdom) and the Republic of Ireland joined the European Economic Community (EEC), and Spain followed suit in 1986. In July 1986, King Juan Carlos I of Spain paid his first official visit to the Republic of Ireland. In 1993, Mary Robinson became the first Irish President to pay an official visit to Spain. Since then, there have been numerous visits between leaders of both states. Recently, in January 2017, Irish Taoiseach Enda Kenny paid a visit to Spain._NEWLINE_Spain has increasingly become an important tourist destination for Irish travelers. In 2016, 1.4 million Irish citizens visited Spain for tourism. At the same time, 263,000 Spanish tourists visited Ireland. In 2016, 35,000 Spanish nationals studied English in Ireland. Several Irish and Spanish airlines provide direct services between both nations. _START_SECTION_ Bilateral agreements _START_PARAGRAPH_ Both the Republic of Ireland and Spain have signed several bilateral agreements (mostly prior to both states joining the European Union), such as an Agreement on the Exchange of Diplomatic Pouches (1935); Agreement on the Exchange of Information regarding Meteorology (1950); Extradition Treaty (1957); Cultural Cooperation Agreement (1980); Spanish Agreement on Renouncing Historic Rights of Fishing in Irish Waters (1980) and an Agreement on the Avoidance of Double Taxation (1994). _START_SECTION_ Trade _START_PARAGRAPH_ In 2015, trade between the Republic of Ireland and Spain totaled 4.5 billion Euros. Ireland's exports to Spain include: pharmaceutical products, electrical equipment, perfume and chemical based products. Spain's exports to Ireland include: automobiles, clothing and organic chemical products. That same year, Irish investments in Spain totaled 200 million Euros while at the same time, Spanish investments in the Republic of Ireland totaled 4 billion Euros.
+ _START_ARTICLE_ Tokyo College of Music _START_SECTION_ History _START_PARAGRAPH_ The college moved to Toshima in Tokyo in 1924 after the original campus was destroyed by the Great Kantō earthquake.
+ _START_ARTICLE_ Fort Pond Bay _START_PARAGRAPH_ Fort Pond Bay is a bay off Long Island Sound at Montauk, New York that was site of the first port on the end of Long Island.  The bay has a long naval and civilian history. _START_SECTION_ New-York Province and the American Revolution _START_PARAGRAPH_ Fort Pond Bay was first listed by name in a 1655 map published in 1680 by John Scott which makes note of a Montaukett Native-American fort on its banks._NEWLINE_Early settlers in the area raised cattle and sheep on the bluffs above the bay. During the American Revolutionary War during the Siege of Boston British warships sailed into the bay in 1775. Local militia under Captain John Dayton, feigned they had more men than they had, turning their coats inside out as they marched back and forth on top of a high hill to the south. The tactic is called Dayton's Ruse._NEWLINE_Long Island was occupied throughout the war and the bay was used by the British for their blockade of Connecticut. In 1781 HMS Culloden ran aground while pursuing a French frigate during a January storm. The ship, which survived the initial ground hit a rock and had to be scuttled in the bay at Culloden Point and burned with its canons thrown overboard. Its debris field and wreck site is now the only underwater park on Long Island._NEWLINE_In the late 18th century, the small fishing village of Montauk was established at the southeast corner of the bay. _START_SECTION_ The 19th century and today _START_PARAGRAPH_ In 1839 the slave ship Amistad anchored in bay (also at Culloden Point) when the surviving crew tried to convince their revolted slave captors that they had returned to Africa as they went for provisions in the village of Montauk. The ship was seized by the USS Washington in the bay._NEWLINE_In the 1890s, Austin Corbin extended the Long Island Rail Road from Bridgehampton, New York to the Montauk fishing village (the line extension was called the Fort Pond Railway). His friend Arthur Bensen purchased 10,000 acres (40 km²) of Montaukett land around the village and the LIRR began advertising that it could cut a day off ship travel by docking in Montauk and taking the train rather than going to New York. Corbin built a steel pier into pond for the overseas ships (even as the Corps of Engineers continued to caution against using the bay because of rocks._NEWLINE_The dream was never to materialize and the U.S. Army bought the land for Camp Wikoff. Theodore Roosevelt and his Rough Riders were to come by transport into the bay following the Spanish–American War at the camp to be quarantined over concerns about yellow fever._NEWLINE_The fishing village was obliterated in the storm of the Great Hurricane of 1938. The Navy took over the area for a seaplane and dirigible base during World War II (the dock is still in use). The Montauk fishing village was moved a mile south closer to the Atlantic Ocean._NEWLINE_During the years after World War II, the bay ceased to be used by most boats because of flooding and rocks. Boats now dock in the dredged Lake Montauk. In the 1960s the bluffs above the bay were used to build Leisurama homes as inexpensive second homes that had been inspired by the Kitchen Debate between Richard Nixon and Nikita Khrushchev.
+ _START_ARTICLE_ William Johnson (cricketer, born 1884) _START_PARAGRAPH_ William James Johnson (22 September 1884 – 14 August 1941) was a wine and spirit grocer and keen cricketer who played one first-class match for Victoria in 1924–25. He was later a selector of the Australian Test team._NEWLINE_Johnson's son, Ian, went on to captain the Australian Test cricket team.
+ _START_ARTICLE_ Sports broadcasting contracts in Kosovo _START_SECTION_ Basketball Cups _START_PARAGRAPH_ Kosovo Basketball Cup: ArtMotion (2018-2021)
+ _START_ARTICLE_ Resettable fuse _START_SECTION_ Operation _START_PARAGRAPH_ A polymeric PTC device is made up of a non-conductive crystalline organic polymer matrix that is loaded with carbon black particles to make it conductive. While cool, the polymer is in a crystalline state, with the carbon forced into the regions between crystals, forming many conductive chains.  Since it is conductive (the "initial resistance"), it will pass a current. If too much current is passed through the device the device will begin to heat. As the device heats, the polymer will expand, changing from a crystalline into an amorphous state. The expansion separates the carbon particles and breaks the conductive pathways, causing the device to heat faster and expand more, further raising the resistance.  This increase in resistance substantially reduces the current in the circuit. A small (leakage) current still flows through the device and is sufficient to maintain the temperature at a level which will keep it in the high resistance state.  Leakage current can range from less than a hundred mA at rated voltage up to a few hundred mA at lower voltages. The device can be said to have latching functionality. The hold current is the maximum current at which the device is guaranteed not to trip. The trip current is the current at which the device is guaranteed to trip._NEWLINE_When power is removed, the heating due to the leakage current will stop and the PPTC device will cool. As the device cools, it regains its original crystalline structure and returns to a low resistance state where it can hold the current as specified for the device. This cooling usually takes a few seconds, though a tripped device will retain a slightly higher resistance for hours, unless the power in it is weaker, or has been often used, slowly approaching the initial resistance value. The resetting will often not take place even if the fault alone has been removed with the power still flowing as the operating current may be above the holding current of the PPTC.  The device may not return to its original resistance value;  it will most likely stabilize at a significantly higher resistance (up to 4 times initial value). It could take hours, days, weeks or even years for the device to return to a resistance value similar to its original value, if at all._NEWLINE_A PPTC device has a current rating and a voltage rating. _START_SECTION_ Applications _START_PARAGRAPH_ These devices are often used in computer power supplies, largely due to the PC 97 standard (which recommends a sealed PC that the user never has to open), and in aerospace/nuclear applications where replacement is difficult. Another application for such devices is protecting audio loudspeakers, particularly tweeters, from damage when over driven: by putting a resistor or light bulb in parallel with the PPTC device it is possible to design a circuit that limits total current through the tweeter to a safe value instead of cutting it off, allowing the speaker to continue operating without damage when the amplifier is delivering more power than the tweeter could tolerate.  While a fuse could also offer similar protection, if the fuse is blown, the tweeter cannot operate until the fuse is replaced. _START_SECTION_ Device trade names _START_PARAGRAPH_ These devices are sold by different companies under various trademarks, including Multifuse (Bourns), PolySwitch (TE Connectivity), and Polyfuse (Littelfuse).  PolySwitch is the earliest product of this type, having been invented at Raychem Corporation (now TE Connectivity) and introduced in the early 1980s.  Due to common availability, electronics engineers and technicians often refer to this device as a "polyswitch" or "polyfuse", in the generic sense, regardless of actual brand.
+ _START_ARTICLE_ Table Table _START_SECTION_ Expansion _START_PARAGRAPH_ The expansion of Table Table as with Taybarns slowed during the recession of 2009/10 as the company sought to consolidate its position. Since April 2010 the company began expanding the Table Table brand further as it continued to enjoy the success of most profitable restaurant brand based on a number of sites. Although making one of the smallest profit for Whitbread, Table Table was on average the second best performing throughout 2009 based on a site average following Taybarns._NEWLINE_Table Table has continued this success being the best performing restaurant brand within Whitbread in 2013, 2014, 2015 and 2016._NEWLINE_The conversion of Brewers Fayres to Table Tables has now come to an end following a successful relaunch of the Brewers Fayre brand and both brands are now expanding in their own markets. All new Table Table restaurants are new build alongside new Premier Inns _START_SECTION_ Future and trials _START_PARAGRAPH_ Despite the popularity of the brand, during summer–autumn 2012, a number of Table Table pubs were converted back to Brewers Fayre such as The Brampton Hut and Howgate._NEWLINE_From 2014 a number of selected Table Table pub restaurants, including the Hobbs Boat in Weston, the Liskeard Tavern in Liskeard, The Carclaze in St Austell, The Roundstone in Littlehampton and The Globe Inn located in Christchurch, have been converted to a new brand called Whitbread Inn, harking back to the company's past._NEWLINE_Early 2017 saw some restaurants being moved to the Beefeater brand including "Mosley Park" in Wolverhampton and "Liberty Bell" in Romford
+ _START_ARTICLE_ Simon Duiker _START_PARAGRAPH_ Simon Liekele Eeltje Duiker (16 April 1874, Amsterdam – 6 March 1941, Amsterdam) was a Dutch painter._NEWLINE_Duiker lived and worked in Amsterdam while studying at its National Academy. He painted interior scenes and still life paintings. His work depicts middle class Dutch life, particularly men and women at work in the home._NEWLINE_Duiker was greatly influenced by the work of Jan Vermeer. Duiker, along with Jacques Snoeck and Gijbertus Jan Sijhoff, is considered one of the last great Dutch interior scene painters. Duiker's paintings are part of the Dutch National Collection (Rijkscollectie).
+ _START_ARTICLE_ Stierlitz _START_SECTION_ Character _START_PARAGRAPH_ In the universe of the Seventeen Moments of Spring, Stierlitz is the cover name for a Soviet super-spy Colonel Maxim Maximovich Isaуev (Макси́м Макси́мович Иса́ев), whose "real" name is Vsevolod Vladimirovich Vladimirov (Все́волод Влади́мирович Владимиров)._NEWLINE_Stierlitz takes a key role in SS Reich Main Security Office in Berlin during World War II, infiltrating Ausland-SD (foreign intelligence) headed by Walter Schellenberg. Working deep undercover, Stierlitz tries to collect intelligence about the Germans' war plans and communicate it to Moscow. He receives instructions from Moscow on how to proceed, on one occasion traveling to Switzerland on a secret mission. He diverts the German nuclear "Vengeance Weapon" research program into a fruitless dead-end, thwarts separate peace talks between Nazi Germany, the United Kingdom and the United States, engages in intellectual games with members of the Nazi high command and sacrifices his own happiness for the good of his motherland. Despite being wracked with desire to return home to his wife he subordinates his feelings to his duty, thus embodying an idealised Soviet vision of patriotism._NEWLINE_Stierlitz is quite an opposite of the action-oriented James Bond; most of the time he gains his knowledge without any Bond-style stunts and gadgets, while in the film adaptation of the stories the action is presented through a narrative voice-over by Yefim Kopelyan. He is presented in a deeply patriotic but non-ideological light, fighting to defend the Soviet motherland against external enemies rather than just defending the Communist government against its ideological opponents. _START_SECTION_ Influences in Russian life and culture _START_PARAGRAPH_ Although Stierlitz was a much-loved character, he was also the butt of a common genre of Russian jokes, often satirising his deductive trains of thought, with unexpected twists, delivered in the deadpan style of the voice-overs in the film adaptations; for example:_NEWLINE_Stierlitz approaches Berlin. The city is veiled in smoke from the fires. "Forgot to switch off the iron again," thought Stierlitz with slight irritation._NEWLINE_Stierlitz continues to be a popular character in modern Russia. Despite the fact that references and Stierlitz jokes still penetrate contemporary speech, Seventeen Moments of Spring is very popular mainly because it is quite patriotic. It is repeated annually on Russian television, usually around Victory Day. Stierlitz also continues to have a political significance. When his portrayer Vyacheslav Tikhonov died in December 2009, the Foreign Intelligence Service—one of the successor organisations of the former Soviet KGB—sent its condolences to his family. Ivan Zassoursky notes that Russian Prime Minister (and former and current President) Vladimir Putin, a former KGB agent, has been portrayed as "embod[ying] the image—very important for the Russian television audience—of Standartenführer von Stierlitz... If anyone missed the connection between Putin, who served in Germany, and von Stierlitz, articles in the press reminded them of the resemblance and helped create the association." The connection went both ways; Putin was strongly influenced by the novels, commenting: "What amazed me most of all was how one man's effort could achieve what whole armies could not."_NEWLINE_Stierlitz movies contributed a number of catchphrases, such as "Character: nordic, robust" (Характер — нордический, выдержанный, a personal characteristic, usually mocking or ironic).
+ _START_ARTICLE_ Thomas Hunter (New York politician) _START_PARAGRAPH_ Thomas Hunter (September 11, 1834 Baltimore, Maryland – March 11, 1903 Sterling, Cayuga County, New York) was an American businessman and politician from New York. _START_SECTION_ Life _START_PARAGRAPH_ He attended the common schools, worked on a farm, and then took part in the construction of the Manassas Gap Railroad. From 1857 to 1860, he engaged in the milling business in Sterling Valley. During the American Civil War he enlisted as a private in the 110th New York Volunteers, and finished the war as a captain. After the war he engaged in the lumber business, and then became a railroad contractor._NEWLINE_He was a member of the New York State Assembly (Cayuga Co., 1st D.) in 1881 and 1882._NEWLINE_He was a member of the New York State Senate (26th D.) from 1890 to 1893, sitting in the 113th, 114th, 115th and 116th New York State Legislatures.
+ _START_ARTICLE_ Ford Taunus P3 _START_SECTION_ High profile launch _START_PARAGRAPH_ The car received its public launch at the Beethoven Hall in Bonn. Unusually for a car launch, both the by now 84-year-old German chancellor, Konrad Adenauer, and the grandson of the firm's founder, Henry Ford, were present. In addition to the latest Ford Taunus, they were celebrating the thirtieth anniversary of Ford's Cologne plant. It had been on 2 October 1930 that Henry Ford and the then mayor of nearby Cologne, Konrad Adenauer, had laid the foundation stone for the Cologne Ford Plant. _START_SECTION_ European design _START_PARAGRAPH_ The first post-war Taunus models had been designed in North America. The Taunus P3 was designed by Uwe Bahnsen, a German born designer who would dominate car design at Ford of Germany for nearly thirty years and whose subsequent designs included the 1969 Ford Capri and its successors.  Towards the end of his time in charge of design with Ford of Germany, Bahnsen also led the teams that designed the Fords Sierra and Scorpio. In the context of 1960 the Taunus P3 can nevertheless be seen as Bahnsen's most innovative design for a production car._NEWLINE_The 1960 Taunus design featured a recurring geometrical shape, which was a cross between a short sausage and a long lozenge. The rear panel and the side panels respected the same basic shape as did the front grill, subject to two large cut-outs for the headlights._NEWLINE_The same shape was carried over to the interior of the car where the main dials and controls on the dash-board were surrounded by a thick frame in the shape that respected a short sausage (or a very long lozenge). The repetitious use of a single simple shape at different levels of the design gave the overall car a consistent visual unity which was in stark contrast to the high finned flamboyance of the previous Taunus 17M and was seen at the time as a radical switch by Ford of Germany away from American styling in favour of European styling. There were no tails fins and there was very little decorative chrome included. The efficiency of its superficially much more simple design enabled Ford to boast that the 1960 car, despite being fractionally narrowed on the outside, offered usefully more interior width than the car it replaced._NEWLINE_Despite the importance of sausages in German cuisine, the award for a catchy soubriquet was earned by the person who saw the car and was reminded not of a sausage but of a bathtub. It was and remains the "Badewanne" (bathtub) soubriquet that caught the eye of the press reporters, and it is as the "Badewannetaunus" that the car continues to be remembered by enthusiasts _START_SECTION_ Technical Innovation _START_PARAGRAPH_ The P3 and the 1961 Citroën Ami were the first cars with rectangular or lozenge-shaped (non-round) headlights. This technical innovation was developed by lighting manufacturers Hella (Taunus) and Cibie (Ami). At the time, it was an unquestioned article of faith that headlights were round, and in the United States, it was the law, so these new headlights were illegal there. Ten years later this had inspired European automakers to come up with various non-round headlamp shapes, though many had by 1970 settled on a standardised shared rectangular shape. _START_SECTION_ Body _START_PARAGRAPH_ Most of the cars were sold as two- or four-door sedans/saloons. A three-door "Turnier" station wagon was also available. The confident determination of the car's designers’ to celebrate the new decade with something new and different was reflected in the unusual positioning of the rear lights on the early station wagons, on the leading edge at the back of the roof of the car, as two red horizontal units lined up directly above the tailgate. Later P3 Turniers had their rear lights more conventionally positioned._NEWLINE_The P3 also followed the tradition of its predecessor in that coach built two-door cabriolets and coupes were offered, converted by a traditional Cologne based body builder called Karl Deutsch However, these special bodied cars appear to have been relatively expensive, and only about 150 were produced._NEWLINE_The cars were offered with an unusually broad choice of color and interior trim options._NEWLINE_The early 1960s were a period of rapid expansion for the west European auto-industry, and export markets for the new 17M included Greece and Australia where several cars were converted locally into "pickups" or, in Australian English, "utes". The 17M was also assembled in South Africa and Southern Rhodesia in right hand drive. _START_SECTION_ Engine and running gear _START_PARAGRAPH_ The cars were all branded as Ford Taunus 17Ms which might have led observers who thought they had understood Ford Germany's naming conventions to conclude that the cars all came with 1.7 litre engines. In fact, there were three different engine sizes offered, being the 1498 cc unit first seen in the Taunus 15M of 1954, the 1698 cc unit originally introduced in 1957 to cope with the weight of the first Ford Taunus 17M and, from September 1961, a new larger 1757 cc engine. Power outputs initially ranged from 55 PS (40 kW; 54 hp) to 60 PS (44 kW; 59 hp), and these engine versions remained available throughout the model's four-year life, but several more powerful engines featuring raised compression ratios in response to the increased availability of higher octane fuels appeared during the four-year period: by 1964 the most powerful Ford 17M offered 75 PS (55 kW; 74 hp). In Sweden, the more powerful version first available in 1962 was sold as the "Taunus Sport", with the inclusion of a more lavish interior. Approximately 50% of the cars built were delivered with the smallest of the three engines, the 1498 cc unit._NEWLINE_The engines were all gasoline/petrol powered four-cylinder inline four-stroke water-cooled units._NEWLINE_Changing gear involved a column-mounted gear change, which by now was becoming increasingly mainstream in Germany. It was possible to specify a "Saxomat" automatic clutch with the three-speed transmission,: drivers content to accept a fully manual gear change system could also specify a four-speed gear box._NEWLINE_There were several important technical innovations during the four-year model run which no doubt go some way to explain the car's commercial success when compared to that achieved by its predecessor, and will have strengthened the Ford image in a market which had grown used to seeing Ford sales trailing those of General Motors’ Opel business.  In April 1962 the 17M became the first mainstream production car in Germany to offer, as an option, disc brakes on the front wheels. Just over a year later front disc brakes became a standard fitting on all models. 1962 was also the year when the car acquired an "automatic starter" which reportedly made the traditional manual choke unnecessary. _START_SECTION_ Optional extras _START_PARAGRAPH_ Options available at extra cost included velour carpeting, grab handles for the passengers in the back, whitewall tires, a two tone paint finish and even a "make-up mirror" cleverly incorporated into the sun-visor on the passenger's side. _START_SECTION_ Commercial _START_PARAGRAPH_ The boldly styled and regularly upgraded Taunus P3 was a commercial success. 669,731 were produced. The figure includes 86,010 station wagons. In the sales statistics for several months of 1961/62 the success of the model even enabled Ford briefly to overtake Opel on the German market, becoming the second best selling auto-brand, beaten to the top spot only by Volkswagen._NEWLINE_The Taunus P3 was replaced by the Ford Taunus P5 which would come with a wider range of engines and which would sell at approximately the same rate. However, the overall market size was growing through the 1960s, and with it grew the sales of the Opel Rekord. After the Taunus P3 no future Taunus model would come close to challenging Opel’s dominance of the large lucrative middle-market portion of the German auto-market. _START_SECTION_ Fifty years on _START_PARAGRAPH_ The Taunus P3 continues to generate enthusiasm, and most of the surviving vehicles in Germany enjoy the financial privileges and responsibilities available, in Germany, to owners of cars designated and maintained as oldtimers (the term used in Germany for a classic car)._NEWLINE_In 2006 484 Taunus P3 sedans/saloons were registered in Germany along with 21 "Turnier" station wagons. There were thought to be fewer than 10 in the USA, a relatively high content of registered and/or roadworthy vehicles in the Scandinavian countries and a handful still survive in other countries (for instance some of the countries in Eastern Europe and Latin America where it was marketed) where statistics are less readily accessible than in Germany.
+ _START_ARTICLE_ Deflowering (flowers) _START_PARAGRAPH_ Deflowering is a form of pruning that consists of removing flowers before they develop. It is similar to deadheading but stricter, as deadheading refers to the removal of faded flowers. Deflowering is usually performed on fruit-forming and seed-forming shrubs and trees in their first year. The aim is to prevent the plants from spending energy and nutrients on seed development before they establish themselves. Leaving some flowers and fruits to develop may be necessary for identification purposes.
+ _START_ARTICLE_ Ford railway station (Merseyside) _START_SECTION_ History _START_PARAGRAPH_ It opened for service on 1 June 1906 and closed on 2 April 1951. Passenger trains then only ran once a year on this line, transporting passengers for the Grand National, although this service also ceased in 1956. Demolition of Ford station was completed on 1 May 1959. _START_SECTION_ Reopening proposals _START_PARAGRAPH_ This section of the line still exists, although has no passenger services running and is no longer electrified, with the only trains running being for engineer access to the Ormskirk line. Plans to open this section as part of Merseyrail's Northern Line have been put forward in Sefton's transport plan, with the first details to emerge about its possible reopening being published by the media on 28 February 2008.
+ _START_ARTICLE_ Hispanics and Latinos in Texas _START_PARAGRAPH_ Hispanic and Latino Texans are residents of the state of Texas who are of Hispanic or Latino ancestry. As of the 2010 U.S. Census, Hispanics and Latinos of any race were 38.2% of the state's population.  Moreover, the U.S Census shows that the 2010 estimated Hispanic population in Texas was 9.7 million and increased to 11.1 million in 2017 with a calculated 18% change from the 2010 Hispanic population estimate._NEWLINE_Tejano or Texano (Spanish for "Texan") is a term used to identify a Texan of Criollo Spanish or Mexican heritage. _START_SECTION_ Origins _START_PARAGRAPH_ The first European to see Texas was Alonso Álvarez de Pineda, who led an expedition for the governor of Jamaica, Francisco de Garay, in 1520. While searching for a passage between the Gulf of Mexico and Asia, Álvarez de Pineda created the first map of the northern Gulf Coast. This map is the earliest recorded document of Texas history. Moreover, the area of present day Texas was claimed by Spain at this time._NEWLINE_Years later on June 1527, an expedition led by Pánfilo de Narváez and Álvar Núñez Cabeza de Vaca with the purpose of reaching Florida in order to build a city, resulted in a failed mission due to harsh weather and disease. Instead, the Spanish explorers were left shipwrecked off the coast of Texas where the Spanish lived for around six years.  After the years spent living in Texas among Indigenous civilization, Narvaez and Cabeza de Baca along with some of their men, found their way back to Mexico City in 1536 and told stories about the extravagancies witnessed in the north.  Learning about this, the Spanish set out due north in 1539 with the purpose of discovering riches in places yet to be explored. One of the primary motives for the excursions was for the discovery of gold._NEWLINE_The excursion of the Spanish in 1539 into the north or what is today Texas, New Mexico, and Arizona, was led by the Spanish conquistador Francisco Vázquez de Coronado. On July 7, 1540, Coronado's army reached the outskirts of the rumored city of much gold, Cibola,  near upper Rio Grande where the Spanish encountered massive resistance from Puebloans. The violence between the Spanish and the Puebloans continued at Cibola until the Puebloan soldiers inhabiting Cibola were forced to leave to a village where their wives and children had moved to for shelter._NEWLINE_When the fighting settled, Coronado decided to explore the land more extendedly, which was when one of the expeditions arrived at Texas in 1541 where they encountered groups of people from the Caddo tribe, leading to more events of violence.  After all, Coronado returned to New Spain on April 1542 an informed about the cruel reality of the cities in the north that were explored, describing them as not having any gold or silver. Soon after this, the Spanish decided to remain away from the north or the present day southwest region of the United States for approximately 150 years, though expeditions led by Spaniards and not authorized by Spain did take place within those years.  Until 1688, Spain essentially remained out of Texas._NEWLINE_Around 1688, the Spanish learned about French interventions occurring in the area of Texas, land that had already been claimed by Spain. This led to the actions taken by Spaniard Alonso de León, the then governor of Coahuila, to march into Texas towards Fort St. Louis. Fort St. Louis was the location where the French were set up. On April 1689, Alonso de Leon arrived with his army ready to take down the French fort and looking for any remaining French in the area. During the time there, de Leon was informed by some of the located French that the Karankawa people had attacked them and left the fort in ruins, forcing the French to flee.  A year after going back to New Spain, de Leon returned to Texas because he was concerned about the French returning to Spanish territory. Spanish activity in Texas remained minimal and only returned when the French attempted to intervene._NEWLINE_In 1690 when de Leon returned to Texas, he had with him an army of about 100 men made up of soldiers and priests and built the first church in Texas, named San Francisco de los Tejas. The construction of this church was a major stepping stone for Spain as Spanish Texas was headed to become an area of greater importance for Spain. After San Francisco de los Tejas was established, the construction of many more missions followed, such as Mision Nuestra Senora del Rosario and Nuestra Senora del Refugio. A year later in January 1691, Domingo Terán de los Ríos was appointed to be the governor of Spanish Texas. Throughout the construction of various churches, the Spanish had interactions with different Indigenous groups. Soon enough, interracial marriages led to the development of different races such as mestizos, criollos, and culebras/mulattos. This led to the development of the Caste system in Texas and throughout the southwest United States. During this time, Spain faced problems with the French, the Natives, and with also with each other. With years passing by, the occurrence of other events such as the American Revolution in 1775, led to more problems in Texas. Soon enough, Spain would have to face the ever-growing United States and the Mexican population while having problems with the Natives and the French._NEWLINE_Mexico declared its independence from Spain on September 16, 1810 and war ended on September 26, 1821. Because of Mexico's independence from Spain, Texas became the property of Mexico. Around this time, the United States had obtained massive amounts of land from France through the Louisiana Purchase in 1803. In addition, under Mexican law, Texas was available for anyone to move to and also offered land grants to empresarios. During this time, the population of Texas grew quickly. The population was not only Mexican but also included United States citizens, Native Americans and enslaved people.  When people residing in Texas did not agree with Mexican law and did not follow the law, Mexico ended all immigration into Texas. Such events led to the Texas Independence which then led to the annexation of Texas and then to the Mexican–American War._NEWLINE_On February 2, 1848 the peace treaty, the Treaty of Guadalupe Hidalgo, was signed between Mexico and the United States which essentially gave the United States much of the land that was owned by Mexico in the north and established the Rio Grande River as the border between Texas and Mexico. Moreover, Hispanics and Latinos already living in the territory that became of the United States, were given the opportunity to stay and obtain United States citizenship. While many chose to leave to their home country, many also decided to stay._NEWLINE_The major immigration of Mexicans into Texas began during the 1890s due to the growth and Industrialisation aspect of Texas that created a plethora of jobs. _START_SECTION_ Demographics _START_PARAGRAPH_ Hispanics (of any race) were 7.1% of the state population in 1910. As of 2010, 45% of Texas residents had Hispanic ancestry; these include recent immigrants from Mexico, Central America, and South America, as well as Tejanos, whose ancestors have lived in Texas as early as the 1700s.  Tejanos are the largest ancestry group in southern Duval County and among the largest in and around Bexar County, including San Antonio, where over one million Hispanics live.  The state has the second largest Hispanic population in the United States, behind California._NEWLINE_Hispanics dominate southern, south-central, and western Texas and form a significant portion of the residents in the cities of San Antonio, Dallas, Houston, and Austin.  The Hispanic population contributes to Texas having a younger population than the American average, because Hispanic births have outnumbered non-Hispanic white births since the early 1990s.  In 2007, for the first time since the early nineteenth century, Hispanics accounted for more than half of all births (50.2%), while non-Hispanic whites accounted for just 34%._NEWLINE_Steve Murdock, a demographer with the Hobby Center for the Study of Texas at Rice University and a former director of the U.S. Census Bureau, predicted that, between 2000 and 2040 (assuming that the net migration rate will equal half that of 1990-2000), Hispanic public school enrollment will increase by 213 percent, while non-Hispanic white enrollment will decrease by 15 percent. As of 2010, 29.21% (6,543,702) of Texas residents age 5 and older spoke Spanish at home as a primary language. _START_SECTION_ Spanish language in Texas _START_PARAGRAPH_ In Texas, English is the state's de facto official language (though it lacks de jure status) and is used in government. However, the continual influx of Spanish-speaking immigrants increased the import of Spanish in Texas. Texas's counties bordering Mexico are mostly Hispanic, and consequently, Spanish is commonly spoken in the region. The Government of Texas, through Section 2054.116 of the Government Code, mandates that state agencies provide information on their websites in Spanish to assist residents who have limited English proficiency. _START_SECTION_ Origins _START_PARAGRAPH_ From 1915 to 1919, during the Mexican Revolution, Mexicans and Tejanos in South Texas faced increased violence from Texan Rangers. Due to tensions caused by changes in both governments and the border, people of Latino descent were hanged, shot, burnt, decapitated, and tortured. Texas Legislative Investigation ended this period of violence by finding the Texas Rangers guilty. More recently, the Texas government has acknowledged this period of history with the "Life and Death on the Border, 1910 to 1920" exhibit._NEWLINE_Anti-Latino attitudes spiked during the Great Depression of the 1920s. Latinos, among other foreigners, were accused of stealing jobs from Americans and contributing to the decline of the economy. In response to the growing, Anglo-American, frustration, the United States government forcibly removed 2 million latinos with the majority of them being American citizens. During these repatriations, local governments denied aid to those of Mexican descent, offered train fares to Mexico and  raided Latino communities. Hospitals removed Latinos with disabilities and illness while employers laid off Latino workers. To avoid raids and discrimination, many Latinos returned to Mexico voluntarily. By 1936, approximately one third of Texas's Latino population had been repatriated._NEWLINE_These sentiments heightened in the 1840s with the end of the Mexican-American War and the signing of the Treaty of Guadalupe Hidalgo. The increased population of Latinos were met with further illegal deportations, violence, racism, and segregation. In instance of these reactions was the Olivera Street raid of 1931. During this raid, law enforcement and immigration agents arrested and deported nearly 400 Mexican-Americans despite their citizenship or immigration status in America. _START_SECTION_ Mob Violence _START_PARAGRAPH_ Anti-Latino sentiments grew during the California Gold Rush as many Latinos demonstrated more advanced mining skills than their white counterparts.  From the late 19th century, the Gold Rush era, to the early 20th century, mob violence against Spanish-speaking individuals became a common occurrence and the number of victims reached well over thousands. During this period, Texas Rangers carried out lynchings of Hispanic men, women, and children for accusations that included cattle theft, murder, witchcraft, and even refusal to play the fiddle. Some case studies included the burning of Refugio Ramírez and his family for the alleged bewitching of neighbors in 1880 by a mob in Collin County, North Texas. Another event included the Porvenir Massacre of 1918, which involved the seizure and assassination of 15 men and boys from the village of Porvenir in Presidio County, Texas. Although Texas Rangers justified the murders by accusing the people of being "thieves, spies and murderers",  the United States Army's and the State Department's investigations found that the denizens of Porvenir were unarmed and innocent. As a result, Texas state government began investigation of the Texas Rangers. _START_SECTION_ Environmental Racism _START_PARAGRAPH_ With a high number of chemical industries and facilities, various neighborhoods within Houston are susceptible to toxic air pollution. The communities closest to these environmentally hazardous spaces are communities of low-income, people of color.  Located in East Houston, Harrisburg/Manchester and Galena Park are the two communities with the closest proximity to Risk Management Plan (RMP) facilities or facilities that use certain hazardous substances._NEWLINE_Both Harrisburg/Manchester and Galena Park are largely made up of impoverished, Latino communities with average household incomes of $49,732 and $45,431. Due to the close proximity to RMP facilities, the people of these neighborhoods are at a 24 to 36 percent higher risk of getting cancer when compared to the predominantly white neighborhoods of Houston.  Harrisburg/Manchester is geographically centered in the middle of "21 Toxic Release Inventory (TRI) reporting facilities, 11 large quantity generators of hazardous waste, 4 facilities that treat, store or dispose of hazardous waste, 9 major dischargers of air pollutants, and 8 major stormwater discharging facilities".   An average of 484,000 pounds of toxic chemicals are released into the Harrisburg/Manchester air while none are released in communities with average household incomes of $226,333 and poverty rate of 3 percent. _START_SECTION_ School Segregation _START_PARAGRAPH_ Spanning from the 1890s to the 1980s, 122 school districts throughout 59 counties established segregated schools for Mexican-Americans. These poorly developed schools lacked the adequate schooling environment. Teachers possessed no credentials or experience while the classrooms lacked the necessary equipment. School administrators often placed Tejano students into 'low-track' classes. By assessing Tejano students on biased rubrics that evaluated mental, emotional, and language abilities, school officials classified Tejano students as inferior and underdeveloped. Beginning with elementary schools, administrators assigned Tejano children to low-level and nonacademic courses, aimed to lead the students to vocational or general-education courses. Due to unequal educational platforms, disregard for Tejano culture, and linguistic intolerance, Hispanic students had higher withdrawal rates and lower academic performances.
+ _START_ARTICLE_ So You Won't Squawk _START_SECTION_ Plot _START_PARAGRAPH_ Handyman Eddie (Buster Keaton) is mistaken for gangster Louie the Wolf (Eddie Fetherston). Louie encourages this deception and lets rival gangster Slugger McGraw (Matt McHugh) think Eddie is him. Slugger attempts to kill Eddie many times. After one final attempt a car chase ensues with Eddie throwing various items out the window to get the attention of the police. _START_SECTION_ Production _START_PARAGRAPH_ The chase scene is a reworking of Buster's final chase from his feature Le Roi des Champs-Élysées (1934).
+ _START_ARTICLE_ Rehearsal letter _START_PARAGRAPH_ A rehearsal letter is a boldface letter of the alphabet in an orchestral score, and its corresponding parts, that provides the conductor, who typically leads rehearsals, with a convenient spot to tell the orchestra to begin at places other than the start of movements or pieces. Rehearsal letters are most often used in scores of the Romantic era and onwards, beginning with Louis Spohr. Rehearsal letters are typically placed at structural points in the piece. _START_SECTION_ Terminology _START_PARAGRAPH_ They may also be generically called rehearsal marks or rehearsal figures, or, when numbers are used instead of letters, rehearsal numbers. _START_SECTION_ Purpose _START_PARAGRAPH_ In the course of rehearsing a symphony or piece, it is often necessary for the conductor to stop and go back to some point in the middle, in order to master the more difficult passages or sections, or to resolve a challenge that the ensemble is having. Many scores and parts have bar numbers, every five or ten bars, or at the beginning of each page or line. But as pieces and individual movements of works became longer (extending to several hundred bars) as the Romantic era progressed, bar numbers became less practical in rehearsal._NEWLINE_For example, a conductor can tell their musicians to resume at bar 387, so that the musicians have to find the nearest bar number in their parts (e.g. 385 or 390) and count back or forward a couple of measures. Even if the number 387 is written at the appropriate bar, it might not particularly stand out. But if there is, for example, a big, bold letter M in the score and parts, it is much easier for the conductor to just say "begin at letter M". Even if the conductor were to say "one bar before letter M", that would still be more convenient than saying "bar 386". Alternatively the conductor could first say "before M..." and allow the players time to find M and then say "one bar"._NEWLINE_In the score of a full orchestra, rehearsal letters are typically placed over the flutes' (or piccolo's) staff, and duplicated above the first violins' staff. For concert bands, rehearsal letters are placed over the piccolo's staff (or flutes'), and over the trumpets'. Rehearsal letters should appear in every part, but the conductor or librarian should check this and also make sure that they agree with the conductor's score; if they do not, the letters from the parts should be copied to the conductor's score. For typical pieces or movements of the Romantic era marked allegro, the letters A to Z can be used up, though the letters I, J or O (or all) may be skipped._NEWLINE_Placement and frequency of the letters do not follow a hard-and-fast rule. Generally they are inserted at places where there is a musically significant change, for example a new theme, or a change in dynamic or instrumentation or the start of a new section – just those places where a conductor might want to restart in rehearsal. In addition, having the letters coincide with musical signposts can help players who are counting rests confirm they are still in the right place, which would not be possible if the marks were placed at numerically regular intervals._NEWLINE_The letter A is almost always used for a point close to the beginning, but not for the very beginning itself because it is much easier to say "from the beginning". Likewise, rehearsal letters are not necessary at changes in tempo, key signature or time signature, as the name of the new tempo or signature can serve the same purpose. For example, in some editions of Beethoven's Ninth Symphony, letter A of the Finale does not occur until bar 140, when the relatively late entry of the first violins with the "Ode to Joy" theme might not stand out enough to the other players to be a convenient point of reference, whereas the reminiscences of the previous movements' themes are more easily referenced by their tempo markings._NEWLINE_A rehearsal letter usually breaks a multimeasure rest in a part (except in cases where a given instrument does not play at all in a given movement of the work). Because rehearsal letters are sometimes independent of edition and in some cases even version, they are also useful for telling applicants for positions in the orchestra what passages they need to play at the audition. They are also useful for easy reference in scholarly essays about orchestral works. However, rehearsal letters are altogether absent from some editions of some pieces that have them in other editions, such as the older editions of Richard Wagner's Meistersinger prelude._NEWLINE_Rehearsal letters are less useful in unaccompanied instrumental music such as the solo piano repertoire (although they may be used in duets), since the instrumentalist has no need to communicate to a fellow player where to resume playing. Songs also tend not to use them, because it is more useful to refer to the lyrics (except in pieces where the lyrics are highly repetitive, or those with long lyric-less sections). _START_SECTION_ Usage in the late 19th century to 21st century _START_PARAGRAPH_ In some cases, A to Z might not be enough. After Z, Aa may be used, followed by Bb, and so on until Zz (though Ii, Jj and/or Oo might also be skipped). The Wilhelm Hansen edition of Jean Sibelius's Symphony No. 7 in C major presents one unusual case: the letters A to Z (including both I and J, as well as O) are used up with just three more pages left in the score. For the final flute and bassoon solo, the editors use Ö (the final letter of the Finnish alphabet) as a rehearsal letter._NEWLINE_But in the case of some composers, such as Gustav Mahler and Dmitri Shostakovich, twice through the alphabet might still not be enough. For this reason, some editors prefer rehearsal numbers to rehearsal letters. Mahler's and Shostakovich's scores use rehearsal numbers rather than letters. These are typically in boldface and enclosed in a box, or less commonly, a circle. Confusingly, however, some editions enclose bar numbers in boxes, though usually not in bold. In the Schirmer edition of Roy Harris's Symphony No. 3 (in one movement), the rehearsal numbers are enclosed in circles, and they occur every ten measures, actually being the bar number divided by 10. That rehearsal numbers "are easily confused with measure numbers" is a reason sometimes given in favor of rehearsal letters._NEWLINE_Advocates of rehearsal numbers counter that even 26 letters are not enough for some scores. Whereas rehearsal letters reset to A for each movement of a multi-movement work (even for connected movements), rehearsal numbers typically run over the course of the entire work, even if the movements are not connected. For example, the rehearsal number for the last few bars of the first movement of Edward Elgar's First Symphony is 55; the first rehearsal number of the second movement is 56. There are exceptions, however. The final outburst in the first movement of Mahler's Second Symphony is rehearsal number 27. Mahler actually wanted a pause of five minutes before the next movement, so the rehearsal numbers reset to 1, ending with 15. The third movement follows after a short break, but its first rehearsal number is 28. _START_SECTION_ Jazz and pop _START_PARAGRAPH_ For jazz and pop compositions with several choruses, "many jazz composers and arrangers" use a format in which "each successive verse/chorus part of the form is assigned successive letters of the alphabet" combined with a measure number: for example, letter A for the first 8-bar phrase of the verse after the introduction, A9 for the next 8-bar phrase, A17, A25, then B, B9, B17, B25 for the chorus, etc., with the special rehearsal marking TAG for the tag ending. In jazz and pop music, the musicians frequently refer to the "A section" or the "B section" of a 32 bar song during rehearsals. In pop music, the music is commonly organized into standard sections, such as an intro, multiple verses and choruses (refrain), one or more bridges, a guitar solo (or other instrumental solos), and an outros. As such, a bandleader who wishes to start in the middle of a song will typically specify which part of this structure the band should start on (e.g., "last four bars of the bridge, going into the guitar solo" or "last verse and go to the outro").
+ _START_ARTICLE_ Joseph Imre _START_PARAGRAPH_ Joseph B. Imre is a historian, political scientist, researcher, and business entrepreneur. Joseph is the proprietor of Seasons Fine Foods and Cookery School. Joseph has been elected to the Board of Directors of the Downtown Napanee Business Improvement Area (BIA) (2019-2023), is a member of the Napanee & District Chamber of Commerce, and a volunteer at the L&A County Museum and Archives. An active member of the Canadian-Hungarian community, Imre has served as the President and founder of the Hungarian Students' Association at the University of Toronto (2002–2005); Vice-President of the Albert Apponyi Association (2000–2010); and on the Board of Directors of several Hungarian organizations including the National Alliance of Hungarians in Canada (Kanadai Magyarok Országos Szövetsége). In 2007, Mr. Imre was awarded the Order of the Knights Cross from the 1956 Hungarian Freedom Fighters Association in Hungary. Mr. Imre is a member of the Friends of Hungary Foundation._NEWLINE_Imre graduated from the University of Toronto with an Honours Bachelor of Arts in history and political science in 2005. Upon completion of postgraduate studies at the University of Oxford (2006) he completed a master's degree in history from the University of Bristol (2007) with distinction. In 2009, Joseph completed a graduate diploma in comparative politics from the London School of Economics. Imre is a member of the Project Management Institute (PMI) where he holds additional certification in project management._NEWLINE_Mr. Imre has written extensively on medieval, early modern, and modern European history. While his academic focus is primarily Renaissance studies and the religious history of 15th century Italy, Mr. Imre has also published widely on 20th century Hungarian historiography and the issue of Hungarian minorities in the Carpathian Basin. His graduate dissertation on Girolamo Savonarola linked the controversial figure to humanistic elements in Renaissance society and challenged existing scholarship. In his capacity as a historian, Mr. Imre regularly contributes to a number of newspapers and online blogs as a columnist and contributor. Mr. Imre is an editor with the NAHC._NEWLINE_Mr. Imre and his wife, Mrs. Jazmin Bansagi, also own and operate a farm in Lennox and Addington County in eastern Ontario. Seven Fields Farm & Orchard was founded on organic principles and sustainable approaches to stewardship that protect the land and environment. Seven Fields Farm & Orchard operates a walnut and apple orchard, hay production, and the raising of beef cattle in co-operation with other farms in the area.
+ _START_ARTICLE_ Vorlage (ski hill) _START_PARAGRAPH_ Vorlage is a ski hill located at the village of Wakefield, within the municipality of La Pêche, Quebec, in the Gatineau Hills north of Ottawa, Ontario. It consists of 18 runs, 13 of which have night skiing. It was opened in 1941 and has a terrain park.
+ _START_ARTICLE_ Cauldcots railway station _START_SECTION_ History _START_PARAGRAPH_ The station opened in 1883 by the North British, Arbroath and Montrose Railway.  The station closed to both passengers and goods traffic on 22 September 1930.
+ _START_ARTICLE_ Paolo Brescia _START_PARAGRAPH_ Paolo Brescia is an Italian architect and founder of Open Building Research. He graduated with a degree in architecture from the Politecnico di Milano in 1996 and had his academic fellowship at Architectural Association in London._NEWLINE_After working with Renzo Piano, he founded in 2000 OBR with Tommaso Principi to investigate new ways of contemporary living, creating a design network among Milan, London, Mumbai and New York._NEWLINE_He combines his professional experience with the academic world as guest lecturer in several athenaeums, such as Accademia di Architettura di Mendrisio, Kent State University, Aalto University, University of Oulu, Academy of Architecture of Mumbai, College of Architecture of Pune, Mimar Sinan Fine Art University, Hacettepe University. He was university professor in charge at the Faculty of Industrial Design at the Politecnico di Milano (2004-2005) and professor of architectural design at University of Genoa (2013-2015)._NEWLINE_With OBR his projects have been featured in international exhibitions, including at X Biennale di Architettura, Venice 2006; Architecture: Where to, London 2007; V Bienal de Arquitetura, Brasilia 2007; XI Bienal Internacional de Arquitectura, Buenos Aires 2007; AR Award Exhibition, Berlin 2008; China International Architectural Expo, Beijing 2009; International Expo, Shangai 2010; UIA 24th World Congress of Architecture, Tokyo 2011; Energy at MAXXI, Rome 2013; Italy Now, Bogotá 2014; Small Utopias, Johannesburg 2014, XIV Biennale di Architettura, Venice 2014, Triennale di Milano, Milan 2015 and Cooper Hewitt Smithsonian Design Museum, New York. _START_SECTION_ Working life _START_PARAGRAPH_ From 1998, Brescia worked with Renzo Piano till the establishment of OBR Open Building Research in 2000._NEWLINE_With OBR, he won several design competitions, such as: Pythagoras Museum (2003), Galleria Sabauda in the Real Palace in Turin (2003), Milanofiori Residential Complex (2005), Ex Cinema Roma in Parma (2006), Galliera Hospital in Genoa (2009), Polo di Funo in Bologna (2009), Fair of Messina (2010), Cesme Waterfront (2012), Via XX Settembre in Genoa (2012), Santa Margherita Ligure Waterfront (2013), Michelin HQ and Research Labs in Delhi (2014), Terrazza Triennale in Milan (2014), Parco Centrale di Prato (2015), Comparto Stazioni Varese (2016).
+ _START_ARTICLE_ Bryan Lugo _START_SECTION_ Career _START_PARAGRAPH_ Bryan Lugo played Officer Burton in the IFC Films and La Petite Reine film Maniac. The film premiered at the 2012 Cannes Film Festival, out of competition.  Lugo plays opposite Elijah Wood in the film, which premiered to a limited distribution in theaters on January 2, 2013.   His other films include Afternoon Delight, which premiered in the 2013 Sundance Film Festival, opposite Kathryn Hahn, Jane Lynch, and Juno Temple; I Am Gangster (2015); and Marvel Studios' Ant-Man and the Wasp, opposite Tip "T.I." Harris, David Dastmalchian, and Walton Goggins, which premiered on July 6, 2018, worldwide in theaters.
+ _START_ARTICLE_ Feudalism in England _START_SECTION_ Origins of Feudalism _START_PARAGRAPH_ The word feudal derives from an ancient Gothic source faihu signifying simply "property" which in its most basic sense was "cattle". This can be compared to the very ancient classical Latin word pecunia, which means both cattle and money. Many societies in existence today demonstrate the traditional use of cattle as financial currency, for example the Masai of Kenya and Tanzania, who pay dowries in this form._NEWLINE_Because feudalism was in its origin a Teutonic or Gothic system from northern Europe untouched by Roman civilization, it did not exist in ancient Rome, where the nearest equivalent was clientelism. No classical Latin word therefore exists to signify it, and a new Low-Latin word feodum was invented by mediaeval European scribes to use in their Latinised charters and other writings. _START_SECTION_ Classic English feudalism _START_PARAGRAPH_ Under the English feudal system, the person of the king (asserting his allodial right) was the only absolute "owner" of land. All nobles, knights and other tenants, termed vassals, merely "held" land from the king, who was thus at the top of the "feudal pyramid". When feudal land grants were of indefinite or indeterminate duration, such grants were deemed freehold, while fixed term and non-hereditable grants were deemed non-freehold. However, even freehold fiefs were not unconditionally heritable—before inheriting, the heir had to pay a suitable feudal relief._NEWLINE_Below the king in the feudal pyramid was a tenant-in-chief (generally in the form of a baron or knight) who was a vassal of the king, and holding from him in turn was a mesne tenant (generally a knight, sometimes a baron, including tenants-in-chief in their capacity as holders of other fiefs) who held when sub-enfeoffed by the tenant-in-chief. Below the mesne tenant further mesne tenants could hold from each other in series. The obligations and corresponding rights between lord and vassal concerning the fief form the basis of the feudal relationship. _START_SECTION_ Vassalage _START_PARAGRAPH_ Before a lord could grant land (a fief) to a tenant, he had to make that person a vassal. This was done at a formal and symbolic ceremony called a commendation ceremony composed of the two-part act of homage and oath of fealty. During homage, the lord and vassal entered a contract in which the vassal promised to fight for the lord at his command, whilst the lord agreed to protect the vassal from external forces, a valuable right in a society without police and with only a rudimentary justice system._NEWLINE_The word fealty derives from the Latin fidelitas and denotes the fidelity owed by a vassal to his feudal lord. Fealty also refers to an oath which more explicitly reinforces the commitments of the vassal made during homage. Such an oath follows homage. Once the commendation was complete, the lord and vassal were now in a feudal relationship with agreed-upon mutual obligations to one another. The vassal's principal obligation to the lord was the performance of military service. Using whatever equipment the vassal could obtain by virtue of the revenues from the fief, the vassal was responsible to answer to calls to military service on behalf of the lord._NEWLINE_This security of military help was the primary reason the lord entered into the feudal relationship, but the vassal had another obligation to his lord, namely attendance at his court, whether manorial, baronial or at the king's court itself in the form of parliament. This involved the vassal providing "counsel", so that if the lord faced a major decision he would summon all his vassals and hold a council. On the manorial level this might be a fairly mundane matter of agricultural policy, but also included the handing down by the lord of sentences for criminal offences, including capital punishment in some cases. Concerning the king's feudal court, the prototype of parliament, such deliberation could include the question of declaring war. Depending on the period of time and location, feudal customs and practices varied, see examples of feudalism. _START_SECTION_ Varieties of feudal tenure _START_PARAGRAPH_ Under the feudal system several different forms of land tenure existed, each effectively a contract with differing rights and duties attached thereto. The main varieties are as follows:
+ _START_ARTICLE_ Fuso Maru _START_SECTION_ Construction _START_PARAGRAPH_ Fuso Maru was laid down in 1907 at the Barcay Curle Co. Ltd. shipyard in Glasgow, Scotland, United Kingdom. She was launched on 19 March 1908 and was completed in February 1909. She was built for the Russian East Asiatic Steamship Company and was named Russia. She was renamed Fuso Maru when she was bought by the Japanese company Osaka Shosen K. K. - OSK Line on 24 December 1923._NEWLINE_Fuso Maru was 144.78 metres (475 ft 0 in) long, with a beam of 17.53 metres (57 ft 6 in) and a depth of 11.2 metres (36 ft 9 in). The ship was assessed at 8,596 GRT. She had two triple expansion steam engines rated at 7,113 ihp (5,304 kilowatts) and driving two screws. She had two funnels and four masts. _START_SECTION_ Pre-World War II career _START_PARAGRAPH_ As Russia, the ship completed her maiden voyage from Libau, Russia, to New York City, United States, on 2 June 1909, and her last voyage on 26 June 1914. She was then laid up at Kronstadt, Russia, until 1917, when she was renamed Rossija and later Russ. In 1921 she was transferred to the Baltic American Line and renamed Latvia. She started service on the Libau–Danzig–Halifax–New York City route on 11 July 1921. Her ninth and last transatlantic voyage started on 7 February 1923. She then was sold to Osaka Shosen Kaisha of Japan on 24 December 1923 and renamed Fuso Maru. Two of her masts were removed at this time. Fuso Maru then served two different companies under four different names before finally being purchased by the Japanese Company Osaka Shosen K. K. - OSK Line._NEWLINE_Fuso Maru operated on the Kobe, Japan–Kirun, Taiwan route from 18 July 1924 until March 1934. She then provided service on the Kobe–Dairen, Manchukuo, route from March 1934 until November 1941. She had accommodation for 42 first-class, 56 second-class, 212 third-class, and 1,414 fourth-class passengers, and had a crew of 144. _START_SECTION_ World War II career _START_PARAGRAPH_ In November 1941, the Imperial Japanese Army charted Fuso Maru for use as a troopship. She was most likely painted grey overall and armed with a suite of antiaircraft guns at this time._NEWLINE_Fuso Maru participated as a troopship in Operation "E", the Japanese invasion of Malaya beginning on 13 December 1941. In late December 1941, she was rerated as a hospital ship. She was most likely disarmed because of the international prohibition against hospital ships carrying armament, and she was painted white with a green horizontal strip and red crosses on her sides and funnel._NEWLINE_Shortly after sunrise on 15 April 1943, Allied aircraft attacked Fuso Maru three times near the Shortland Islands near (03°33′S 152°20′E). Fuso Maru returned to service as a  troopship later in 1943 and was repainted overall grey and again armed with antiaircraft guns. _START_SECTION_ Sinking _START_PARAGRAPH_ On 31 July 1944 Fuso Maru was part of Convoy MI-11, which consisted of 23 ships, including the tankers Koei Maru, Taketoyo Maru, Shichiyo Maru, Ayagumo Maru, Harima Maru, and Ogura Maru No. 1 and the cargo ships and troopships Fuso Maru, Ayayuki Maru, Yoshino Maru, Miho Maru, Enoshima Maru, Manko Maru, Hachijin Maru, Dakar Maru, Teiritsu Maru, Fukuju Maru, and Banshu Maru No. 16, escorted by the destroyer Shiokaze, the escort ship Shimushu, the minesweepers W-38 and W-39, the submarine chaser CH-55, and the auxiliary gunboat Kazan Maru. The convoy was attacked in the South China Sea 280 nautical miles (520 km) northwest of Cape Mayraira, Luzon, while it was proceeding from Moji, Japan to Miri, Borneo, by a United States Navy submarine wolfpack patrolling the Luzon Strait under the command of Captain (later Rear Admiral) Lewis S. Parks. The wolfpack consisted of Lieutenant Commander (later Vice Admiral) Lawson P. Ramage′s USS Parche (SS-384), Lieutenant Commander (later Captain) David L. Whelchel's USS Steelhead (SS-280), and  Lieutenant Commander John C. Martin's USS Hammerhead (SS-364)._NEWLINE_At 3:32 AM, Parche torpedoed and sank Koei Maru with four torpedoes. Although she was carrying a unit of 1,050 Imperial Japanese Army troops, the casualties aboard her were relatively light; about 150 troops and nine crewmen were killed. About the same time, tanker Ogura Maru No. 1 was hit by two torpedoes, killing five men, but she did not sink. At 3:40 AM, Parche torpedoed and sank Yoshino Maru with four torpedoes; she carried down 2,442 of the 5,063 Imperial Japanese Army troops she was carrying, as well as 18 gunners, 35 crewmen, and 400 cubic meters (14,120 cubic feet) of ammunition.  At 4:20 AM, Steelhead hit Dakar Maru with two torpedoes, killing six men, but Dakar Maru did not sink and quickly beached herself._NEWLINE_Aboard Fuso Maru, 40 men were assigned to duty as lookouts, including Imperial Japanese Army artillerymen and infantrymen. At 4:55 AM, one lookout spotted a torpedo approaching the ship and her captain ordered her rudder turned hard to port, but it was too late. Steelhead′s torpedo hit Fuso Maru′s engine room on the port side of the ship. Fuso Maru bucked and trembled from the explosion and the blast blew upwards, destroying several lifeboats that were on deck. Fuso Maru took on a 25-degree list to port in heavy seas when the order to abandon ship was issued. The ground vehicles carried as deck cargo broke loose and fell onto men swimming in the water._NEWLINE_At 5:10 AM, Fuso Maru sank only 15 minutes after the torpedo hit, taking down 1,316 of 4,500 troops aboard. Seventy men of the 2nd Company, Sixth Aviation Signal Regiment, 12 other passengers, and 22 crew members also perished, bringing the death toll to 1,384 people. A cargo consisting of food and medical supplies, oil, trucks, 36 railway carriages, and 1,120-tons of other military supplies also was lost._NEWLINE_At 5:14 AM, Parche torpedoed and sank Manko Maru. She carried several hundred Imperial Japanese Navy personnel, 17 crewmen, about 20 gunners, and a cargo of ammunition down with her.  Altogether, four of the 23 ships of Convoy MI-11 sank and two were damaged. The ships took down several thousand military personnel, gunners, and crewmen, as well as their cargoes of ammunition and other supplies. Thousands of troops were left floating in the waters of the Balintang Channel. _START_SECTION_ Wreck _START_PARAGRAPH_ The wreck of Fuso Maru lies at (18°51′N 122°55′E). Its condition is unknown.
+ _START_ARTICLE_ Brazil at the 2017 World Aquatics Championships _START_SECTION_ Water polo _START_PARAGRAPH_ Brazil's men's water polo team qualified for the World Championships with a gold medal performance at the 2017 UANA Cup in Couva, Trinidad and Tobago.
+ _START_ARTICLE_ Ann Catley _START_SECTION_ Life _START_PARAGRAPH_ Catley was born near Tower Hill, London, who first made money singing in pubs and to the garrison of the Tower of London. She was apprenticed aged fifteen to William Bates, a composer and singing teacher. A scandal emerged when Bates sold Ann's apprenticeship to her admirer Sir Francis Blake Delaval of Seaton Delaval Hall for £200. Bates was given money by Delaval in addition to make up for any financial loss to him. Catley's father, Robert Catley, could see that Ann had been sold. Aided by his employer, her father sued the rake Delaval and Bates. Lord Chief Justice Mansfield's judgement extended British law as he ruled that Delaval had offended society and the King's Bench could take action against Delaval on society's behalf; he was heavily fined. Catley's relationship with Delaval ended. Delaval found future relationships difficult and Catley continued her career._NEWLINE_In 1768 she met Lieutenant-Colonel Francis Lascelles (1744–1799) and they became a couple. She took the name Lascelles  but they never married. In her will she left her property to their eight surviving children._NEWLINE_She performed many roles on the London and Dublin stage, until 1782. Her pupil Margaret Martyr's style is said to have come from Catley. Thomas Bellamy wrote of Martyr in 1795 "Catley's pupil - Catley's boast, Sportive, playful, arch and free, Lovely MARTYR, hail to thee!"_NEWLINE_Catley spent her last years living at Little Ealing and died on 14 October 1789._NEWLINE_John O'Keeffe wrote of her: "she was one of the most beautiful women I ever saw: the expression of her eyes, and the smiles and dimples that played round her lips and cheeks, enchanting".
+ _START_ARTICLE_ William Paul (author) _START_SECTION_ Background _START_PARAGRAPH_ Paul grew up in the Fife village of Kingskettle in Scotland. He has an older brother Donald and a younger sister Elizabeth Anne. _NEWLINE_Paul attended the village primary school before going to Bell Baxter High School in Cupar. In 1973 he went to Aberdeen University where he graduated with an MA in English and Politics._NEWLINE_After university, Paul became a trainee reporter for the Press and Journal. He then moved to Edinburgh where he was a reporter on The Scotsman. Paul's first son Andrew was born in 1983 while Paul was working on his debut novel Seasons of Revenge._NEWLINE_In 1985, Paul's his second son, William, was born and Seasons of Revenge was published. A second novel Mummy's Boy was published the next year._NEWLINE_In 1988, Paul joined Scotland on Sunday as senior writer. He was promoted to Chief Reporter then News Editor, then Assistant Editor and finally Executive Editor. In 1999, Paul was temporarily seconded from Scotland on Sunday to be the launch editor of www.scotsman.com. He left in 2001 to become the Head of Digital Communications for the Scottish Government until February 2019._NEWLINE_Paul continues to write novels. He also writes news articles and reports on rugby matches and is a lifelong supporter of Raith Rovers F.C..
+ _START_ARTICLE_ Gol Talab _START_PARAGRAPH_ Gol Talab or Gol Talaab (talab means tank) also known as Nawab Bari Pukur, is a small oval-shaped water tank/pond in Islampur, Old Dhaka, Dhaka, Bangladesh, located immediately to the north-west of the Ahsan Manzil Palace and north of the Buriganga River. Gol Talab is an official heritage site, designated by the city government of Dhaka. _START_SECTION_ Description _START_PARAGRAPH_ Gol Talab dates to the 19th century. It covers an area of 2.23 acres and has a maximum depth of 23 feet (7.0 m). There are plans to upgrade it into a park. The pond is fenced. Vegetation found around the lake consist of trees of coconut, mango,  neem, jackfruit and Chinese dates.  Aquatic fauna reported are fish, frog, insects and others. The fish species reported are ruhi, tilapia, silver carp, pangash, katal, koi, puti and many more. Invertebrates reported are beetles, dragon flies, grasshoppers, butterflies, small birds and water scorpions. The pond has a bathing ghat only on its northwestern part. Boating competitions are held in the pond. A path for jogging and walking exists around the water. _START_SECTION_ Scientific analysis _START_PARAGRAPH_ Gol Talab is one of the five ponds in Dhaka, which have a significant effect on the environmental and biodiversity of the urban climate. Field research studies have been carried out to assess its link with the environment, economy and society. The results of the socio-environmental survey,  involving both quantitative and qualitative aspects, were carried out by three faculty members of the Department of Architecture, Stamford University Bangladesh._NEWLINE_Water quality studies of the Gol Talab indicate a degree of pollution with the following water quality parameters of: TDS of 261 mg/liter; a conductivity value of 0.528 ms/cm; pH value of 6.92; dissolved oxygen (DO) of 13.92 mg/liter; an arsenic content of less than 10 ppb; a  COD values of −23 mg/liter; and BOD value of 59.4 mg per liter. _START_SECTION_ Conservation _START_PARAGRAPH_ The pond is maintained by the Moulvi Khawaja Abdullah Welfare Trust and Bangladesh Water Development Board, as part of the 2000 National Water Management Plan. The pond has been cleaned up and restored as part of the water development plan. In 2008, The Daily Star reported that heritage buildings and site were under threat in the city, including Gol Talab. In 2009 the Dhaka City Corporation reaffirmed the conservation status of 93 structures and sites in Dhaka, in consideration of their "historical, aesthetic, scientific, social, cultural, religious, political and heritage value"; Gol Talab is an official heritage site under the Dhaka Metropolitan Development Plan.
+ _START_ARTICLE_ Clean Up Your Own Backyard _START_SECTION_ Background _START_PARAGRAPH_ Written by Mac Davis and Billy Strange and published by Gladys Music, Inc., it was released as a 7" single in 1969 with "The Fair Is Moving On" on the B-side, but not featured on any studio album. The single was also released in the UK, Canada, Germany, Australia, New Zealand, and India._NEWLINE_It reached #35 on the Billboard Hot 100 and #21 on the UK Singles Chart. The single reached #18 on the Record World chart and #29 on the Australian Go-Set chart. The RIAA certified the single Gold in March 1992._NEWLINE_The song was from the soundtrack of the MGM film The Trouble with Girls, and was later included on the budget RCA Camden album Almost In Love._NEWLINE_Although The Trouble with Girls is set in the 1920s, several lyrics within this song are anachronistic for the era, such as a reference to "armchair quarterbacks", a term not coined until the advent of television sports broadcasting decades later. _START_SECTION_ Other recordings _START_PARAGRAPH_ The song has been recorded by Nat Stuckey, O.C. Smith, Tom Green, Jennifer Scott, Sue Moreno, Darrel Higham and The Enforcers, and Lee Birchfield in 2012.
+ _START_ARTICLE_ Houston Direct Navigation Company _START_SECTION_ Houston and Galveston Navigation Company _START_PARAGRAPH_ In 1851, William Marsh Rice founded the Houston and Galveston Navigation Company with his own capital of $5,000 and the capital of twenty-five other investors. These included Paul Bremond, Cornelius Ennis, William J. Hutchins, and John H. Sterrett. _START_SECTION_ Houston Navigation Company _START_PARAGRAPH_ One of the company's antecedents was the Houston Navigation Company, formed in 1854 by many of the same principals as the Houston and Galveston Navigation Company. _START_SECTION_ After the Civil War _START_PARAGRAPH_ The Houston Direct Navigation Company transported freight and passengers from Houston to railheads along Buffalo Bayou._NEWLINE_Houston Direct Navigation Company was founded on 9 October 1866 by William Marsh Rice, Thomas M. Bagby, John H. Sterrett, and several others.  Businesses receiving and shipping goods from Houston were paying high fees for moving freight through Galveston, Texas. The company offered cheaper transportation, which bypassed Galveston and its Galveston Wharf Company._NEWLINE_At first, the company’s main business in the late-1860s consisted of lightering around Galveston and interlining freight through the Buffalo Bayou, Brazos and Colorado Railroad; however, it expanded  service, running five passenger steamers by 1870. The company continued to expand its fleet, even as passenger demand diminished. Three steamers operated for freight-only in 1873, along with 22 barges with three tugs, the only two steamers transported passengers.
+ _START_ARTICLE_ NY Med _START_SECTION_ Critical reception _START_PARAGRAPH_ NY Med received "universal acclaim" based on an aggregate score of 84 out of 100 from six critics on Metacritic. Verne Gay of Newsday called the series "beautiful and moving," adding "NY Med brings it all home with power, beauty, insight and a degree of emotion that's an occasional sharp uppercut to the jaw." New York Magazine's Matt Zoller Seitz said the series "is filled with warm, honest moments like this — some poignant, others comic — and characters who would be plenty compelling even if they didn't keep revealing surprising new sides." Mary McNamara of the Los Angeles Times called the series "surprisingly addictive", adding "NY Med appears less self-conscious about its medical pedigree than its predecessors, more willing to embrace the dramatic pacing and elasticities of television." The New York Times' Mike Hale thought the series was "predictably absorbing" but added "NY Med and its predecessors have an interesting, though certainly unintentional, effect: The intense focus on the heroics of the nurses and doctors can make the patients look just as helpless and pathetic as we fear we will be when our day in the ward comes." _START_SECTION_ Patient privacy lawsuit _START_PARAGRAPH_ One episode of NY Med depicted the case of an elderly man, Mark Chanko, who arrived at NewYork–Presbyterian hospital after he was hit by a garbage truck. The episode showed Chanko's final moments as he died from his injuries. While his face was blurred, Chanko's widow was able to identify him when she watched the episode. The patient's family had not granted ABC or the hospital permission to film his treatment, and they were deeply upset by the episode. The family sued ABC and New York–Presbyterian Hospital for violating Mark Chanko's privacy. The case was dismissed by an appellate court. However, the family appealed and the NY judiciary felt there was sufficient reason to bring it before the state's highest appellate court. New York–Presbyterian agreed to a $2.2M settlement with the Department of Health and Human Services, Office for Civil Rights, who investigated this as a HIPAA violation. ABC removed the segment from the DVD version of the episode and from future broadcasts.
+ _START_ARTICLE_ Community Professional Loudspeakers _START_PARAGRAPH_ Community Professional Loudspeakers is an American manufacturer of loudspeakers and sound reinforcement equipment. The company has been located in the Philadelphia area since its inception in 1968, and has occupied its present location in Chester, Pennsylvania since 1981. _START_SECTION_ Background _START_PARAGRAPH_ Bruce Howze founded the company in 1968, which was first named Community Light and Sound. The company originally started in the Philadelphia area, and now occupies a 100,000 square foot space in Chester, Pennsylvania._NEWLINE_Community established itself as the first company to utilize fiberglass to create large yet lightweight loudspeaker horns and enclosures. In 1970, it introduced its first notable live sound reinforcement loudspeaker product, the LMF, a fiberglass midrange horn.  The company next developed the Leviathan fiberglass composite bass horn, which Elvis Presley used in his 1971 tour.  Several top musical groups from that era used Leviathans as well, such as the Eagles, Linda Ronstadt, and Earth, Wind & Fire._NEWLINE_In the mid-1970s, Community became one of the first companies to meticulously test and document the performance of both its own loudspeakers and competitors’ loudspeakers. Community based its test measurement philosophy on the underlying principles of “free field” and “far field,” believing that far more dependable and relevant data can be obtained by testing loudspeakers at measurement distances that correspond to actual listening distances._NEWLINE_Since its founding, Community has pursued pioneering loudspeaker technologies.  In 2010, the United States Patent and Trademark Office granted the company a patent for Carbon Ring Cone Technology._NEWLINE_In 2014, Community was acquired by Audio Prof, which also owns Apart Audio. Bruce W. Howze is still active with the company today, as is Christine Howze. Steve Johnson joined Community in 2013 as president. The Community brand has a global presence with its products being for live performance and permanent installation in houses of worship, schools, and other venues._NEWLINE_Community Professional is also well known for its weather-resistant loudspeaker designs, which are installed in major sports stadia and arenas throughout the world.  This same quality makes the company’s loudspeakers a valuable component in emergency notification systems, such as the one used by the Tidal Information System in Venice, Italy.
+ _START_ARTICLE_ 2017–18 Argentine Primera División _START_SECTION_ Competition format _START_PARAGRAPH_ The tournament was contested by 28 teams. Each team played the other 27 teams in a single round-robin tournament. The additional match against the main rival team in the so-called "Fecha de Clásicos" was not played in this season. _START_SECTION_ Top goalscorers _START_PARAGRAPH_ Source: AFA _START_SECTION_ Top assists _START_PARAGRAPH_ Source: AFA
+ _START_ARTICLE_ Kelly Lai Chen _START_SECTION_ Early life _START_PARAGRAPH_ In 1933, Lai was born as Xi Zhongjian (奚重俭) into a prominent family from Pudong, Shanghai, owner of the Xi Fu Ji (奚福记) Factory. He was the third child among six siblings; Betty Loh Ti (born Xi Zhongyi) was the youngest. Their maternal grandfather was the tycoon Gu Zhuxuan, who owned Shanghai's Tianchan Theatre. When Lai was four, his father was killed by Japanese bombing during the Battle of Shanghai, and his mother died ten years later. He and his siblings were brought up by their maternal grandmother._NEWLINE_When the Communists took over Mainland China in 1949, Lai's grandmother brought the children to Hong Kong. He trained in the Republic of China Air Force cadet school in Taiwan for half a year, but was forced to quit because of heart disease. _START_SECTION_ Career _START_PARAGRAPH_ After returning to Hong Kong, Lai attended the actor training school of Motion Picture & General Investment (MP&GI, later Cathay Organisation HK). In 1956, he starred in his first film Green Hills and Jade Valleys directed by Yueh Feng. In his second film, Golden Lotus, also by Yueh Feng, he acted opposite the star actress Lin Dai. The highly successful film launched Lai into stardom. In the following years, he appeared in more than 40 films, including Evan Yang's Our Dream Car (1959), Chung Kai-man (鍾啟文)'sThe Education of Love (1961), and Wong Tin-lam's Father Takes a Bride (1963), starring opposite popular actresses such as Ge Lan,  Jeannette Lin Cui, and Lucilla You Min. He was known for his portrayals of "gentle, vulnerable, and sensitive" young men._NEWLINE_In 1967, Lai, his sister Betty Loh Ti, and director Yuan Qiufeng founded Gold Eagle Film Company. It made a number of martial arts films, including Duel at the Supreme Gate (1968), but they were commercially unsuccessful._NEWLINE_Lai retired from acting in 1971 and focused on producing films. He retired in the early 1990s, but returned to the screen for Andrew Lau's 1996 film Young and Dangerous 2 and Wong Kar-wai's award-winning In the Mood for Love (2000), in which he played Maggie Cheung's boss. _START_SECTION_ Personal life _START_PARAGRAPH_ In 1974, Lai married martial arts actress Angela Mao, nicknamed "female Bruce Lee". They had a daughter together, and divorced after six years of marriage. Lai raised their daughter._NEWLINE_In his later years Lai lived alone in Hong Kong, and his daughter often visited him. He died on 3 April 2018, aged 84.
+ _START_ARTICLE_ Douglas v. Veterans Administration _START_PARAGRAPH_ Curtis Douglas vs. Veterans Administration (5 Merit Systems Protection Board (MSPB), 313 (1981) was a case decided by the Merit Systems Protection Board which established criteria that supervisors must consider in determining an appropriate penalty to impose for an act of federal employee misconduct.
+ _START_ARTICLE_ Blackadder, Scottish Borders _START_PARAGRAPH_ Blackadder is a hamlet on the B6460, in the Scottish Borders area of Scotland, located at grid reference NT846523._NEWLINE_Places nearby include Allanton, Duns, Edrom, Fogo, Gavinton, and Whitsome.
+ _START_ARTICLE_ James White (engineer) _START_PARAGRAPH_ James Lindsay White (January 3, 1938 – November 26, 2009) was an American polymer scientist._NEWLINE_White was a key figure in defining the field of polymer engineering. He founded two polymer engineering programs, one at the University of Tennessee and the other at the University of Akron. He also founded the International Polymer Processing Society and two scholarly journals: the Journal of Polymer Engineering and the International Polymer Processing Journal. He authored the textbook Rubber Processing, which was long popular among engineers. He published more than 500 papers and eight books based on his studies of flow in internal mixers, pin barrel extruders, and twin screw extruders with and without simultaneous chemical reactions._NEWLINE_He received the Charles Goodyear Medal in 2009, for  “fundamental understanding of rheology and mathematical modeling of unfilled and filled rubbers and simulations of flow in batch and continuous mixing machines.”  He received the Bingham Medal in 1981. _START_SECTION_ Education _START_PARAGRAPH_ White obtained his BS in chemical engineering at the Brooklyn Polytechnic Institute.  He then pursued graduate studies in the research group of Arthur B. Metzner at the University of Delaware, receiving his MS degree in 1962 and his doctorate in 1965.  His research resulted in development of the White-Metzner rheological model, which is still widely used in polymer processing simulations.
+ _START_ARTICLE_ Karen Yarbrough _START_SECTION_ Early life _START_PARAGRAPH_ Yarbrough's father, Don Williams, is a State Farm insurance agent and former village president of the Village of Maywood. Yarbrough earned a bachelor's degree in Business Administration from Chicago State University, a master’s in Inner City Studies from Northeastern Illinois University and attended the advanced leadership institute at Harvard University's Kennedy School of Government. Yarbrough is the founder and CEO of Hathaway Insurance Agency, where she has worked for thirty years. _START_SECTION_ Public service _START_PARAGRAPH_ For eight years, Yarbrough served as the first female president of the Maywood Chamber of Commerce. She established the Gold Card Program, which provided opportunities for business and education to work together to provide scholarships for deserving students._NEWLINE_Yarbrough also served as a board member of the United Way of Suburban Chicago and the Oak Park YMCA. _START_SECTION_ State Representative _START_PARAGRAPH_ Yarbrough was first elected in 2000, defeating the appointed incumbent Wanda Sharp. Her term began in January, 2001, and she rose to become an assistant Majority Leader._NEWLINE_During her tenure in the Illinois House, Yarbrough served on seven committees: Housing and Urban Development (Chairwoman), House Insurance Committee (Vice-Chairwoman), Environmental Health Committee, Appropriations: Public Safety, Computer Technology, Illinois Legislative Black Caucus, and the Governor’s Safety and Re-Entry Commission._NEWLINE_She pushed for "All Kids", a program helping to ensure that all children in Illinois get affordable health care. The All Kids program provides working families that make too much for KidCare, but not enough to afford private insurance, with affordable health insurance. Yarbrough also sponsored a bill to support challenged children. It funds training teachers and school administrators to recognize and be sensitive to the special needs of students who are pregnant or living in abusive environments._NEWLINE_Yarbrough also fought the tobacco lobby for a "Smoke Free Illinois", making Illinois the first Midwestern state to go smoke-free. She also fought to fund programs to provide a "Second Chance" for previously incarcerated individuals and sponsored a Quality of Life bill to use state lottery money to treat people with HIV-AIDs. She was also the chief House sponsor for legislation which ended the death penalty in Illinois, making it the 16th state to do so and earning her recognition from the Pope._NEWLINE_After Yarbrough was elected Recorder, Cory Foster was appointed to fulfill the remainder of her term in the 97th General Assembly. Emanuel Chris Welch was elected to succeed her as the representative from the 7th district. _START_SECTION_ Personal life _START_PARAGRAPH_ Yarbrough’s husband, Henderson Yarbrough, Sr. is the current trustee, and was the former mayor of the village of Maywood. They have seven grandchildren.
+ _START_ARTICLE_ The Iron Lady (TV series) _START_SECTION_ Synopsis _START_PARAGRAPH_ In earlier times and not very long ago, girls born to traditional Chinese families were deprived of privileges and opportunities reserved for males. Most of them accepted their fate dutifully and submissively._NEWLINE_But one woman of those oppressed times dared defy the hand she was dealt. Opportunist, manipulator, ambitious, she desired power to control her own destiny above all else._NEWLINE_This is the story of an unconventional woman of extraordinary will and determination who seized control of her fate at all costs!
+ _START_ARTICLE_ Elephant Hotel _START_SECTION_ History _START_PARAGRAPH_ Hachaliah Bailey (known as the creator of Bailey Circus) built the Elephant Hotel in Somers, NY, after buying an African elephant, which he named "Old Bet". Bailey intended to use the elephant for farm work, but the number of people it attracted caused Bailey to tour her throughout the northeast.  Old Bet was killed on tour in 1816, when she was shot by a local farmer. Bailey's tours were the first of their type in the nation, and inspired numerous others to tour with exotic animals, and during the 1830s the old style circus and Bailey's attractions merged to form the modern circus.  Due to this, Somers is known as the "Cradle of the American Circus"._NEWLINE_Bailey had purchased this land in 1805, and began construction of the hotel in 1821, as a memorial to the animals he displayed. It is said Old Bet was buried in front of the building. The monument to Old Bet that stands in front of the hotel was placed in 1827.  In 1835, the hotel was incorporated by the Zoological Institute._NEWLINE_The Elephant Hotel was purchased by the town of Somers in 1927. It is a town landmark and was dedicated a National Historic Landmark in 2005. _START_SECTION_ Museum of the Early American Circus _START_PARAGRAPH_ The Somers Historical Society occupies the third floor of the building.  The Society operates the Museum of the Early American Circus, which is open on Thursday afternoons and for special holidays.
+ _START_ARTICLE_ Marrit Boonstra _START_SECTION_ Biography _START_PARAGRAPH_ Boonstra, who was born in Groningen, played tennis as a right-hander with a double-handed backhand._NEWLINE_Her junior career included a win over Caroline Wozniacki, who she also partnered with to the make girls' doubles quarterfinals of the 2006 Wimbledon Championships._NEWLINE_As a 17 year old, she played three doubles rubbers for the Netherlands in the 2006 Fed Cup competition, teaming up with Dutch veteran Brenda Schultz-McCarthy to win all three matches._NEWLINE_Boonstra received a wild card to compete in the main draw of the 2006 Ordina Open, a WTA Tour tournament in Rosmalen. She lost in the opening round to Jelena Janković._NEWLINE_From 2008 to 2010, she played collegiate tennis in the United States for the University of Florida._NEWLINE_During her time on the ITF circuit, she won one singles and five doubles titles.
+ _START_ARTICLE_ Mayfair Theatre _START_SECTION_ Description _START_PARAGRAPH_ The Mayfair is a surviving atmospheric cinema of the Spanish Revival form, the second theatre house of this kind to be constructed in Ottawa. Interior features include four faux-balconies, two of which feature clay-tile canopies. Other significant features include stained-glass windows, a proscenium arch, a painted ceiling, decorative plastering and wrought ironwork. The Mayfair has retained the theatre clock used since its inception, a unit which features blue illuminated numbering. _START_SECTION_ 1932 to 1970s _START_PARAGRAPH_ Fred Robertson, a retailer from Almonte, was the Mayfair's original owner. The Mayfair opened on 5 December 1932 with its showings of The Blue Danube. Adult admission prices were 15 cents for matinees, 25 cents for evening performances, with each child admitted for ten and 15 cents respectively. After The Blue Danube completed a three-day run, the Mayfair presented its first double bill with Bring 'Em Back Alive and X Marks the Spot. At the outset, the theatre's sound system was supplied by Northern Electric while Montreal-based Canadian Theatre Supply provided the projection and stage equipment._NEWLINE_For the first half century of its existence, the cinema remained under Robertson family ownership. The theatre later operated as a second-run cinema for numerous years. In the late 1970s the Mayfair concentrated on pornographic films, a phase which lasted less than two years. _START_SECTION_ 1980s _START_PARAGRAPH_ In October 1981, the Mayfair adopted a repertory format and in the following month Keith Davidson became theatre manager. The Mayfair became known for its economical double features which were introduced in June 1982 for five days each week, excluding Sundays and Mondays when Chinese-language films would be presented. The Mayfair's ownership then consisted of several investors, most of whom were Ottawa-based._NEWLINE_The Mayfair cancelled a planned showing of Videodrome in April 1983 when police threatened the theatre with obscenity charges. A handful of citizens, including Maude Barlow, objected to the violent content of the film which was approved by the Ontario Board of Censors and was previously shown without incident in Nepean, Ontario._NEWLINE_Director Michael Rubbo rented the theatre for three days in early 1986 to conduct a "four-waller" promotion for his film The Peanut Butter Solution which had fared poorly in the English Canadian market._NEWLINE_In 1986, major renovations resulted in new concession stand and box office spaces, while 492 wider seats replaced the original set of approximately 600 seats. In 1988, the Mayfair's regular admission price was $5, or $3.50 for those with theatre memberships which were available for $5 per year, or $3 per year for students. During that time, membership numbered more than 5000. _START_SECTION_ 1990s _START_PARAGRAPH_ Double features became available on all days as of 1 April 1990 as the Chinese-language films were discontinued. Sunday afternoon double features were also introduced at that time. Regular prices for the double features were $5.50, or $4 for those who obtained a $6 annual membership. Featured films were predominantly hit American productions with a minority of classic and international films._NEWLINE_Tom Bergin became manager in the early 1990s when Davidson left to pursue other interests. _START_SECTION_ 2000s _START_PARAGRAPH_ In August 2008, local media indicated that the theatre would close effective 30 November 2008, the date at which the theatre would terminate its membership programme. The City of Ottawa declared the theatre building as a heritage site under the Ontario Heritage Act on 8 October 2008, a designation which prohibits outright demolition of the building._NEWLINE_Public and community concern over the closure of the Mayfair and interest in its heritage value resulted in the formation of the Friends of the Mayfair Theatre, a loosely organized community group that claimed several hundred members._NEWLINE_In November 2008, a new partnership consisting of filmmakers Lee Demarbre and Ian Driscoll, projectionist and film conservator Paul Gordon and film scholar John Yemen announced that they had signed a ten-year lease with owner Stephen Ng. The new owners renovated the facility with new seating, some couches in the balcony, a digital video projection system, a new 16mm projector, a Dolby Digital sound system for the 35mm projectors, and a long play tower system. Seating capacity was reduced from 492 to 343. The Mayfair reopened on 2 January 2009 with the film Metropolis accompanied by short subjects from local filmmakers. The theatre's reopening was accompanied with a renewed emphasis on its repertory role. During this relaunch month, thirteen Ottawa premieres were presented while double bills were now limited to Tuesday nights and occasionally other nights. Midnight screenings on Friday and Saturday nights were also introduced._NEWLINE_In July 2009 two of the founding members of the new partnership, John Yemen and Paul Gordon left the group to pursue other projects. John Yemen was the individual who sent the city a proposal for heritage designation in the summer of 2008. The makeup of the partnership is now Lee Demarbre (programmer), Ian Driscoll, and Josh Stafford. _START_SECTION_ 2010s _START_PARAGRAPH_ Currently, the Mayfair's programming includes cult films, family matinees,independent films, Ottawa premieres, local films, festivals, and late night presentations. It also became the main venue for the Ottawa International Writers Festival in spring of 2010, hosting readings and lectures. The theatre also reports continued success with its annual Halloween screenings of The Rocky Horror Picture Show. _START_SECTION_ Mayfair Orleans _START_PARAGRAPH_ The Mayfair opened a three-screen cinema in Orleans on 2 December 2011. It was situated at the former Empire Six theatre facility. This location presented similar programming as the original Mayfair location, with some emphasis on family-oriented films. The Mayfair Orleans location closed on 13 February 2013 when its lease was cancelled due to arrears in rent.
+ _START_ARTICLE_ DC Comics _START_SECTION_ Origins _START_PARAGRAPH_ Entrepreneur Major Malcolm Wheeler-Nicholson founded National Allied Publications in autumn 1934. The company debuted with the tabloid-sized New Fun: The Big Comic Magazine #1 with a cover date of February 1935. The company's second title, New Comics #1 (Dec. 1935), appeared in a size close to what would become comic books' standard during the period fans and historians call the Golden Age of Comic Books, with slightly larger dimensions than today's. That title evolved into Adventure Comics, which continued through issue #503 in 1983, becoming one of the longest-running comic-book series. In 2009 DC revived Adventure Comics with its original numbering. In 1935, Jerry Siegel and Joe Shuster, the future creators of Superman, created Doctor Occult, who is the earliest DC Comics character to still be in the DC Universe._NEWLINE_Wheeler-Nicholson's third and final title, Detective Comics, advertised with a cover illustration dated December 1936, eventually premiered three months late with a March 1937 cover date. The themed anthology series would become a sensation with the introduction of Batman in issue #27 (May 1939). By then, however, Wheeler-Nicholson had gone. In 1937, in debt to printing-plant owner and magazine distributor Harry Donenfeld—who also published pulp magazines and operated as a principal in the magazine distributorship Independent News—Wheeler-Nicholson had to take Donenfeld on as a partner in order to publish Detective Comics #1. Detective Comics, Inc. was formed, with Wheeler-Nicholson and Jack S. Liebowitz, Donenfeld's accountant, listed as owners. Major Wheeler-Nicholson remained for a year, but cash-flow problems continued, and he was forced out. Shortly afterwards, Detective Comics, Inc. purchased the remains of National Allied, also known as Nicholson Publishing, at a bankruptcy auction._NEWLINE_Detective Comics, Inc. soon launched a fourth title, Action Comics, the premiere of which introduced Superman. Action Comics #1 (June 1938), the first comic book to feature the new character archetype—soon known as "superheroes"—proved a sales hit. The company quickly introduced such other popular characters as the Sandman and Batman._NEWLINE_On February 22, 2010, a copy of Action Comics #1 (June 1938) sold at an auction from an anonymous seller to an anonymous buyer for $1 million, besting the $317,000 record for a comic book set by a different copy, in lesser condition, the previous year. _START_SECTION_ Golden Age _START_PARAGRAPH_ National Allied Publications soon merged with Detective Comics, Inc., forming National Comics Publications on September 30, 1946. National Comics Publications absorbed an affiliated concern, Max Gaines' and Liebowitz' All-American Publications. In the same year Gaines let Liebowitz buy him out, and kept only Picture Stories from the Bible as the foundation of his own new company, EC Comics. At that point, "Liebowitz promptly orchestrated the merger of All-American and Detective Comics into National Comics... Next he took charge of organizing National Comics, [the self-distributorship] Independent News, and their affiliated firms into a single corporate entity, National Periodical Publications". National Periodical Publications became publicly traded on the stock market in 1961._NEWLINE_Despite the official names "National Comics" and "National Periodical Publications", the company began branding itself as "Superman-DC" as early as 1940, and the company became known colloquially as DC Comics for years before the official adoption of that name in 1977._NEWLINE_The company began to move aggressively against what it saw as copyright-violating imitations from other companies, such as Fox Comics' Wonder Man, which (according to court testimony) Fox started as a copy of Superman. This extended to DC suing Fawcett Comics over Captain Marvel, at the time comics' top-selling character (see National Comics Publications, Inc. v. Fawcett Publications, Inc.). Faced with declining sales and the prospect of bankruptcy if it lost, Fawcett capitulated in 1953 and ceased publishing comics. Years later, Fawcett sold the rights for Captain Marvel to DC—which in 1972 revived Captain Marvel in the new title Shazam! featuring artwork by his creator, C. C. Beck. In the meantime, the abandoned trademark had been seized by Marvel Comics in 1967, with the creation of their Captain Marvel, forbidding the DC comic itself to be called that. While Captain Marvel did not recapture his old popularity, he later appeared in a Saturday morning live action TV adaptation and gained a prominent place in the mainstream continuity DC calls the DC Universe._NEWLINE_When the popularity of superheroes faded in the late 1940s the company focused on such genres as science fiction, Westerns, humor, and romance. DC also published crime and horror titles, but relatively tame ones, and thus avoided the mid-1950s backlash against such comics. A handful of the most popular superhero-titles, including Action Comics and Detective Comics, the medium's two longest-running titles, continued publication. _START_SECTION_ Silver Age _START_PARAGRAPH_ In the mid-1950s, editorial director Irwin Donenfeld and publisher Liebowitz directed editor Julius Schwartz (whose roots lay in the science-fiction book market) to produce a one-shot Flash story in the try-out title Showcase. Instead of reviving the old character, Schwartz had writers Robert Kanigher and John Broome, penciler Carmine Infantino, and inker Joe Kubert create an entirely new super-speedster, updating and modernizing the Flash's civilian identity, costume, and origin with a science-fiction bent. The Flash's reimagining in Showcase #4 (October 1956) proved sufficiently popular that it soon led to a similar revamping of the Green Lantern character, the introduction of the modern all-star team Justice League of America (JLA), and many more superheroes, heralding what historians and fans call the Silver Age of comic books._NEWLINE_National did not reimagine its continuing characters (primarily Superman, Batman, and Wonder Woman), but radically overhauled them. The Superman family of titles, under editor Mort Weisinger, introduced such enduring characters as Supergirl, Bizarro, and Brainiac. The Batman titles, under editor Jack Schiff, introduced the successful Batwoman, Bat-Girl, Ace the Bat-Hound, and Bat-Mite in an attempt to modernize the strip with non-science-fiction elements. Schwartz, together with artist Infantino, then revitalized Batman in what the company promoted as the "New Look", re-emphasizing Batman as a detective. Meanwhile, editor Kanigher successfully introduced a whole family of Wonder Woman characters having fantastic adventures in a mythological context._NEWLINE_Since the 1940s, when Superman, Batman, and many of the company's other heroes began appearing in stories together, DC's characters inhabited a shared continuity that, decades later, was dubbed the "DC Universe" by fans. With the story "Flash of Two Worlds", in Flash #123 (September 1961), editor Schwartz (with writer Gardner Fox and artists Infantino and Joe Giella) introduced a concept that allowed slotting the 1930s and 1940s Golden Age heroes into this continuity via the explanation that they lived on an other-dimensional "Earth 2", as opposed to the modern heroes' "Earth 1"—in the process creating the foundation for what would later be called the DC Multiverse._NEWLINE_DC's introduction of the reimagined superheroes did not go unnoticed by other comics companies. In 1961, with DC's JLA as the specific spur, Marvel Comics writer-editor Stan Lee and a robust creator Jack Kirby ushered in the sub-Silver Age "Marvel Age" of comics with the debut issue of The Fantastic Four. Reportedly, DC ignored the initial success of Marvel with this editorial change until its consistently strengthening sales, albeit also benefiting Independent News' business as their distributor as well, made that impossible. That commercial situation especially applied with Marvel's superior sell-through percentage numbers which were typically 70% to DC's roughly 50%, which meant DC's publications were barely making a profit in comparison after returns from the distributors were calculated while Marvel was making an excellent profit by comparison._NEWLINE_However, the senior DC staff were reportedly at a loss at this time to understand how this small publishing house was achieving this increasingly threatening commercial strength. For instance, when Marvel's product was examined in a meeting, Marvel's emphasis on more sophisticated character-based narrative and artist-driven visual storytelling was apparently ignored for self-deluding guesses at the brand's popularity which included superficial reasons like the presence of the color red or word balloons on the cover, or that the perceived crudeness of the interior art was somehow more appealing to readers. When Lee learned about DC's subsequent experimental attempts to imitate these perceived details, he amused himself by arranging direct defiance of those assumptions in Marvel's publications as sales strengthened further to frustrate the competition._NEWLINE_However, this ignorance of Marvel's true appeal did not extend to some of the writing talent during this period, from which there were some attempts to emulate Marvel's narrative approach. For instance, there was the Doom Patrol series by Arnold Drake, a writer who previously warned the management of the new rival's strength; a superhero team of outsiders who resented their freakish powers, which Drake later speculated was plagiarized by Stan Lee to create The X-Men. There was also the young Jim Shooter who purposely emulated Marvel's writing when he wrote for DC after much study of both companies' styles, such as for the Legion of Super-Heroes feature._NEWLINE_A 1966 Batman TV show on the ABC network sparked a temporary spike in comic book sales, and a brief fad for superheroes in Saturday morning animation (Filmation created most of DC's initial cartoons) and other media. DC significantly lightened the tone of many DC comics—particularly Batman and Detective Comics—to better complement the "camp" tone of the TV series. This tone coincided with the famous "Go-Go Checks" checkerboard cover-dress which featured a black-and-white checkerboard strip (all DC books cover dated February 1966 until August 1967) at the top of each comic, a misguided attempt by then-managing editor Irwin Donenfeld to make DC's output "stand out on the newsracks". In particular, DC artist, Carmine Infantino, complained that the visual cover distinctiveness made DC's titles easier for readers to see and then avoid in favor of Marvel's titles._NEWLINE_In 1967, Batman artist Infantino (who had designed popular Silver Age characters Batgirl and the Phantom Stranger) rose from art director to become DC's editorial director. With the growing popularity of upstart rival Marvel Comics threatening to topple DC from its longtime number-one position in the comics industry, he attempted to infuse the company with more focus towards marketing new and existing titles and characters with more adult sensibilities towards an emerging older age group of superhero comic book fans that grew out of Marvel's efforts to market their superhero line to college-aged adults. He also recruited major talents such as ex-Marvel artist and Spider-Man co-creator Steve Ditko and promising newcomers Neal Adams and Denny O'Neil and replaced some existing DC editors with artist-editors, including Joe Kubert and Dick Giordano, to give DC's output a more artistic critical eye. _START_SECTION_ Kinney National/Warner Communications subsidiary (1967-1990) _START_PARAGRAPH_ In 1967, National Periodical Publications was purchased by Kinney National Company, which purchased Warner Bros.-Seven Arts in 1969. Kinney National spun off its non-entertainment assets in 1972 (as National Kinney Corporation) and changed its name to Warner Communications Inc._NEWLINE_In 1970, Jack Kirby moved from Marvel Comics to DC, at the end of the Silver Age of Comics, in which Kirby's contributions to Marvel played a large, integral role. Given carte blanche to write and illustrate his own stories, he created a handful of thematically linked series he called collectively The Fourth World. In the existing series Superman's Pal Jimmy Olsen and in his own, newly launched series New Gods, Mister Miracle, and The Forever People, Kirby introduced such enduring characters and concepts as archvillain Darkseid and the other-dimensional realm Apokolips. Furthermore, Kirby intended their stories to be later reprinted in collected editions in a publishing format that would later be called the trade paperback, which would become a standard industry practice decades later. While sales were respectable, they did not meet DC management's initially high expectations, and also suffered from a lack of comprehension and internal support from Infantino. By 1973 the "Fourth World" was all cancelled, although Kirby's conceptions would soon become integral to the broadening of the DC Universe. Obligated by his contract, Kirby created other unrelated series for DC, including Kamandi, The Demon, and OMAC, before ultimately returning to Marvel Comics. _START_SECTION_ The Bronze Age _START_PARAGRAPH_ Following the science-fiction innovations of the Silver Age, the comics of the 1970s and 1980s would become known as the Bronze Age, as fantasy gave way to more naturalistic and sometimes darker themes. Illegal drug use, banned by the Comics Code Authority, explicitly appeared in comics for the first time in Marvel Comics' story "Green Goblin Reborn!" in The Amazing Spider-Man #96 (May 1971), and after the Code's updating in response, DC offered a drug-fueled storyline in writer Dennis O'Neil and artist Neal Adams' Green Lantern, beginning with the story "Snowbirds Don't Fly" in the retitled Green Lantern / Green Arrow #85 (Sept. 1971), which depicted Speedy, the teen sidekick of superhero archer Green Arrow, as having become a heroin addict._NEWLINE_Jenette Kahn, a former children's magazine publisher, replaced Infantino as editorial director in January 1976. DC had attempted to compete with the now-surging Marvel by dramatically increasing its output and attempting to win the market by flooding it. This included launching series featuring such new characters as Firestorm and Shade, the Changing Man, as well as an increasing array of non-superhero titles, in an attempt to recapture the pre-Wertham days of post-War comicdom. In June 1978, five months before the release of the first Superman movie, Kahn expanded the line further, increasing the number of titles and story pages, and raising the price from 35 cents to 50 cents. Most series received eight-page back-up features while some had full-length twenty-five-page stories. This was a move the company called the "DC Explosion". The move was not successful, however, and corporate parent Warner dramatically cut back on these largely unsuccessful titles, firing many staffers in what industry watchers dubbed "the DC Implosion". In September 1978, the line was dramatically reduced and standard-size books returned to 17 story pages but for a still increased 40 cents. By 1980, the books returned to 50 cents with a 25-page story count but the story pages replaced house ads in the books._NEWLINE_Seeking new ways to boost market share, the new team of publisher Kahn, vice president Paul Levitz, and managing editor Giordano addressed the issue of talent instability. To that end—and following the example of Atlas/Seaboard Comics and such independent companies as Eclipse Comics—DC began to offer royalties in place of the industry-standard work-for-hire agreement in which creators worked for a flat fee and signed away all rights, giving talent a financial incentive tied to the success of their work. As it happened, the implementation of these incentives proved opportune considering Marvel Comics' Editor-in-Chief, Jim Shooter, was alienating much of his company's creative staff with his authoritarian manner and major talents there went to DC like Roy Thomas, Gene Colan, Marv Wolfman, and George Perez._NEWLINE_In addition, emulating the era's new television form, the miniseries while addressing the matter of an excessive number of ongoing titles fizzling out within a few issues of their start, DC created the industry concept of the comic book limited series. This publishing format allowed for the deliberate creation of finite storylines within a more flexible publishing format that could showcase creations without forcing the talent into unsustainable open-ended commitments. The first such title was World of Krypton in 1979, and its positive results lead to subsequent similar titles and later more ambitious productions like Camelot 3000 for the direct market in 1982._NEWLINE_These changes in policy shaped the future of the medium as a whole, and in the short term allowed DC to entice creators away from rival Marvel, and encourage stability on individual titles. In November 1980 DC launched the ongoing series The New Teen Titans, by writer Marv Wolfman and artist George Pérez, two popular talents with a history of success. Their superhero-team comic, superficially similar to Marvel's ensemble series X-Men, but rooted in DC history, earned significant sales in part due to the stability of the creative team, who both continued with the title for six full years. In addition, Wolfman and Pérez took advantage of the limited-series option to create a spin-off title, Tales of the New Teen Titans, to present origin stories of their original characters without having to break the narrative flow of the main series or oblige them to double their work load with another ongoing title. _START_SECTION_ Modern Age _START_PARAGRAPH_ This successful revitalization of the Silver Age Teen Titans led DC's editors to seek the same for the wider DC Universe. The result, the Wolfman/Pérez 12-issue limited series Crisis on Infinite Earths, gave the company an opportunity to realign and jettison some of the characters' complicated backstory and continuity discrepancies. A companion publication, two volumes entitled The History of the DC Universe, set out the revised history of the major DC characters. Crisis featured many key deaths that would shape the DC Universe for the following decades, and separate the timeline of DC publications into pre- and post-"Crisis"._NEWLINE_Meanwhile, a parallel update had started in the non-superhero and horror titles. Since early 1984, the work of British writer Alan Moore had revitalized the horror series The Saga of the Swamp Thing, and soon numerous British writers, including Neil Gaiman and Grant Morrison, began freelancing for the company. The resulting influx of sophisticated horror-fantasy material led to DC in 1993 establishing the Vertigo mature-readers imprint, which did not subscribe to the Comics Code Authority._NEWLINE_Two DC limited series, Batman: The Dark Knight Returns by Frank Miller and Watchmen by Moore and artist Dave Gibbons, drew attention in the mainstream press for their dark psychological complexity and promotion of the antihero. These titles helped pave the way for comics to be more widely accepted in literary-criticism circles and to make inroads into the book industry, with collected editions of these series as commercially successful trade paperbacks._NEWLINE_The mid-1980s also saw the end of many long-running DC war comics, including series that had been in print since the 1960s. These titles, all with over 100 issues, included Sgt. Rock, G.I. Combat, The Unknown Soldier, and Weird War Tales. _START_SECTION_ Time Warner/AOL Time Warner/WarnerMedia unit (1990–present) _START_PARAGRAPH_ In March 1989, Warner Communications merged with Time Inc., making DC Comics a subsidiary of Time Warner. In June, the first Tim Burton directed Batman movie was released, and DC began publishing its hardcover series of DC Archive Editions, collections of many of their early, key comics series, featuring rare and expensive stories unseen by many modern fans. Restoration for many of the Archive Editions was handled by Rick Keene with colour restoration by DC's long-time resident colourist, Bob LeRose. These collections attempted to retroactively credit many of the writers and artists who had worked without much recognition for DC during the early period of comics when individual credits were few and far between._NEWLINE_The comics industry experienced a brief boom in the early 1990s, thanks to a combination of speculative purchasing (mass purchase of the books as collectible items, with intent to resell at a higher value as the rising value of older issues, was thought to imply that all comics would rise dramatically in price) and several storylines which gained attention from the mainstream media. DC's extended storylines in which Superman was killed, Batman was crippled and superhero Green Lantern turned into the supervillain Parallax resulted in dramatically increased sales, but the increases were as temporary as the hero's replacements. Sales dropped off as the industry went into a major slump, while manufactured "collectables" numbering in the millions replaced quality with quantity until fans and speculators alike deserted the medium in droves._NEWLINE_DC's Piranha Press and other imprints (including the mature readers line Vertigo, and Helix, a short-lived science fiction imprint) were introduced to facilitate compartmentalized diversification and allow for specialized marketing of individual product lines. They increased the use of non-traditional contractual arrangements, including the dramatic rise of creator-owned projects, leading to a significant increase in critically lauded work (much of it for Vertigo) and the licensing of material from other companies. DC also increased publication of book-store friendly formats, including trade paperback collections of individual serial comics, as well as original graphic novels._NEWLINE_One of the other imprints was Impact Comics from 1991 to 1992 in which the Archie Comics superheroes were licensed and revamped. The stories in the line were part of its own shared universe._NEWLINE_DC entered into a publishing agreement with Milestone Media that gave DC a line of comics featuring a culturally and racially diverse range of superhero characters. Although the Milestone line ceased publication after a few years, it yielded the popular animated series Static Shock. DC established Paradox Press to publish material such as the large-format Big Book of... series of multi-artist interpretations on individual themes, and such crime fiction as the graphic novel Road to Perdition. In 1998, DC purchased WildStorm Comics, Jim Lee's imprint under the Image Comics banner, continuing it for many years as a wholly separate imprint – and fictional universe – with its own style and audience. As part of this purchase, DC also began to publish titles under the fledgling WildStorm sub-imprint America's Best Comics (ABC), a series of titles created by Alan Moore, including The League of Extraordinary Gentlemen, Tom Strong, and Promethea. Moore strongly contested this situation, and DC eventually stopped publishing ABC. _START_SECTION_ 2000s _START_PARAGRAPH_ In March 2003 DC acquired publishing and merchandising rights to the long-running fantasy series Elfquest, previously self-published by creators Wendy and Richard Pini under their WaRP Graphics publication banner. This series then followed another non-DC title, Tower Comics' series T.H.U.N.D.E.R. Agents, in collection into DC Archive Editions. In 2004 DC temporarily acquired the North American publishing rights to graphic novels from European publishers 2000 AD and Humanoids. It also rebranded its younger-audience titles with the mascot Johnny DC and established the CMX imprint to reprint translated manga. In 2006, CMX took over from Dark Horse Comics publication of the webcomic Megatokyo in print form. DC also took advantage of the demise of Kitchen Sink Press and acquired the rights to much of the work of Will Eisner, such as his The Spirit series and his graphic novels._NEWLINE_In 2004, DC began laying the groundwork for a full continuity-reshuffling sequel to Crisis on Infinite Earths, promising substantial changes to the DC Universe (and side-stepping the 1994 Zero Hour event which similarly tried to ret-con the history of the DCU). In 2005, the critically lauded Batman Begins film was released; also, the company published several limited series establishing increasingly escalated conflicts among DC's heroes, with events climaxing in the Infinite Crisis limited series. Immediately after this event, DC's ongoing series jumped forward a full year in their in-story continuity, as DC launched a weekly series, 52, to gradually fill in the missing time. Concurrently, DC lost the copyright to "Superboy" (while retaining the trademark) when the heirs of Jerry Siegel used a provision of the 1976 revision to the copyright law to regain ownership._NEWLINE_In 2005, DC launched its "All-Star" line (evoking the title of the 1940s publication), designed to feature some of the company's best-known characters in stories that eschewed the long and convoluted continuity of the DC Universe. The line began with All-Star Batman & Robin the Boy Wonder and All-Star Superman, with All-Star Wonder Woman and All-Star Batgirl announced in 2006 but neither being released nor scheduled as of the end of 2009._NEWLINE_DC licensed characters from the Archie Comics imprint Red Circle Comics by 2007. They appeared in the Red Circle line, based in the DC Universe, with a series of one-shots followed by a miniseries that lead into two ongoing titles, each lasting 10 issues. _START_SECTION_ 2010s _START_PARAGRAPH_ In 2011, DC rebooted all of its running titles following the Flashpoint storyline. The reboot called The New 52 gave new origin stories and costume designs to many of DC's characters._NEWLINE_DC licensed pulp characters including Doc Savage and the Spirit which it then used, along with some DC heroes, as part of the First Wave comics line launched in 2010 and lasting through fall 2011._NEWLINE_In May 2011, DC announced it would begin releasing digital versions of their comics on the same day as paper versions._NEWLINE_On June 1, 2011, DC announced that it would end all ongoing series set in the DC Universe in August and relaunch its comic line with 52 issue #1s, starting with Justice League on August 31 (written by Geoff Johns and drawn by Jim Lee), with the rest to follow later on in September._NEWLINE_On June 4, 2013, DC unveiled two new digital comic innovations to enhance interactivity: DC² and DC² Multiverse. DC² layers dynamic artwork onto digital comic panels, adding a new level of dimension to digital storytelling, while DC² Multiverse allows readers to determine a specific story outcome by selecting individual characters, storylines and plot developments while reading the comic, meaning one digital comic has multiple outcomes. DC² will first appear in the upcoming digital-first title, Batman '66, based on the 1960s television series and DC² Multiverse will first appear in Batman: Arkham Origins, a digital-first title based on the video game of the same name._NEWLINE_In 2014, DC announced an eight-issue miniseries titled "Convergence" which began in April 2015._NEWLINE_In 2016, DC announced a line-wide relaunch titled DC Rebirth. The new line would launch with an 80-page one-shot titled DC Universe: Rebirth, written by Geoff Johns, with art from Gary Frank, Ethan Van Sciver, and more. After that, many new series would launch with a twice-monthly release schedule and new creative teams for nearly every title. The relaunch was meant to bring back the legacy and heart many felt had been missing from DC characters since the launch of the New 52. Rebirth brought huge success, both financially and critically. _START_SECTION_ Logo _START_PARAGRAPH_ DC's first logo appeared on the April 1940 issues of its titles. The letters "DC" stood for Detective Comics, the name of Batman's flagship title. The small logo, with no background, read simply, "A DC Publication"._NEWLINE_The November 1941 DC titles introduced an updated logo. This version was almost twice the size of the previous one and was the first version with a white background. The name "Superman" was added to "A DC Publication", effectively acknowledging both Superman and Batman. This logo was the first to occupy the top-left corner of the cover, where the logo has usually resided since. The company now referred to itself in its advertising as "Superman-DC"._NEWLINE_In November 1949, the logo was modified to incorporate the company's formal name, National Comics Publications. This logo would also serve as the round body of Johnny DC, DC's mascot in the 1960s._NEWLINE_In October 1970, DC briefly retired the circular logo in favour of a simple "DC" in a rectangle with the name of the title, or the star of the book; the logo on many issues of Action Comics, for example, read "DC Superman". An image of the lead character either appeared above or below the rectangle. For books that did not have a single star, such as anthologies like House of Mystery or team series such as Justice League of America, the title and "DC" appeared in a stylized logo, such as a bat for "House of Mystery". This use of characters as logos helped to establish the likenesses as trademarks, and was similar to Marvel's contemporaneous use of characters as part of its cover branding._NEWLINE_DC's "100 Page Super-Spectacular" titles and later 100-page and "Giant" issues published from 1972 to 1974 featured a logo exclusive to these editions: the letters "DC" in a simple sans-serif typeface within a circle. A variant had the letters in a square._NEWLINE_The July 1972 DC titles featured a new circular logo. The letters "DC" were rendered in a block-like typeface that would remain through later logo revisions until 2005. The title of the book usually appeared inside the circle, either above or below the letters._NEWLINE_In December 1973, this logo was modified with the addition of the words "The Line of DC Super-Stars" and the star motif that would continue in later logos. This logo was placed in the top center of the cover from August 1975 to October 1976._NEWLINE_When Jenette Kahn became DC's publisher in late 1976, she commissioned graphic designer Milton Glaser to design a new logo. Popularly referred to as the "DC bullet", this logo premiered on the February 1977 titles. Although it varied in size and colour and was at times cropped by the edges of the cover, or briefly rotated 4 degrees, it remained essentially unchanged for nearly three decades. Despite logo changes since 2005, the old "DC bullet" continues to be used only on the DC Archive Editions series._NEWLINE_In July 1987, DC released variant editions of Justice League #3 and The Fury of Firestorm #61 with a new DC logo. It featured a picture of Superman in a circle surrounded by the words "SUPERMAN COMICS". The company released these variants to newsstands in certain markets as a marketing test._NEWLINE_On May 8, 2005, a new logo (dubbed the "DC spin") was unveiled, debuting on DC titles in June 2005 with DC Special: The Return of Donna Troy #1 and the rest of the titles the following week. In addition to comics, it was designed for DC properties in other media, which was used for movies since Batman Begins, with Superman Returns showing the logo's normal variant, and the TV series Smallville, the animated series Justice League Unlimited and others, as well as for collectibles and other merchandise. The logo was designed by Josh Beatman of Brainchild Studios and DC executive Richard Bruning._NEWLINE_In March 2012, DC unveiled a new logo consisting of the letter "D" flipping back to reveal the letter "C" and "DC ENTERTAINMENT". The Dark Knight Rises was the first film to use the new logo, while the TV series Arrow was the first series to feature the new logo._NEWLINE_DC Entertainment announced a new identity and logo for another iconic DC Comics universe brand on May 17, 2016. The new logo was first used on May 25, 2016, in conjunction with the release of DC Universe: Rebirth Special #1 by Geoff Johns. _START_SECTION_ DC Universe _START_PARAGRAPH_ DC Universe is a video on demand service operated by DC Entertainment. It was announced in April 2017, with the title and service formally announced in May 2018. DC Universe is expected to offer more than video content through the inclusion of an immersive experience with fan interaction that encompasses comics in addition to television. _START_SECTION_ Digital distribution _START_PARAGRAPH_ DC Comics are available in digital form through several sources._NEWLINE_Free services: In 2015, Hoopla Digital became the first library-based digital system to distribute DC Comics._NEWLINE_Paid services: Google Play, Comixology
+ _START_ARTICLE_ Prelude in A minor, Op. 51, No. 2 (Scriabin) _START_PARAGRAPH_ Alexander Scriabin's Prelude Opus 51 No. 2 is the second of his Quatre Morceaux (Four Pieces) op. 51, published in 1906. It is notated in A minor. It is written in a 6/8 beat in 30 measures (plus upbeat) and should be expressed Lugubre (dire)._NEWLINE_This is one of several pieces Scriabin never played in public (together with the Sonata No. 6 (op. 62)). He called it "Shattered Strings" (German "Zersprungene Saiten") when Leonid Sabaneyev reminded him of the piece during a discussion about minor and major. Sabaneyev quotes him with "Oh, let's not talk about this! This is a ghastly piece! [...] I was in an appalling situation back then. This Prelude, and also the Marche funebre in the First Sonata formed in moments disheartenment... But only these two!" (referring to his allegation that he had abandoned the minor tonality a long time ago).
+ _START_ARTICLE_ Teté _START_SECTION_ Biography _START_PARAGRAPH_ Teté was a major coach of football in Rio Grande do Sul. Became known as the "Marshal of Victories," because he was an officer of the Army Reserve._NEWLINE_As a player, he served in the 9º Regimento. Then trained the Farroupilha (after the change of club name), Pelotas Brazil Guarany of Bagé, General Osorio Cruzeiro-RS Nacional-RS and Internacional ._NEWLINE_In Internacional, Teté did well. He coached the team from 1951 to 1957 and was four-time Gaucho (51, 52, 53 and 55). Also coached Brazil national team, became champion of the Pan American of 1956 in Mexico.
+ _START_ARTICLE_ Lauryn Hill _START_SECTION_ 1975–1993: Early life and career beginnings _START_PARAGRAPH_ Lauryn Noelle Hill was born on May 26, 1975 in East Orange, New Jersey. Her mother, Valerie Hill was an English teacher and her father Mal Hill a computer and management consultant. She has one older brother named Malaney who was born in 1972. Her Baptist family moved to New York and Newark for short periods before settling in South Orange, New Jersey._NEWLINE_Hill has said of her musically oriented family: "there were so many records, so much music constantly being played. My mother played the piano, my father sang, and we were always surrounded by music." Her father sang in local nightclubs and at weddings. While growing up, Hill frequently listened to Curtis Mayfield, Stevie Wonder, Aretha Franklin, and Gladys Knight; years later she recalled playing Marvin Gaye's What's Going On repeatedly until she fell asleep to it._NEWLINE_In middle school, Lauryn Hill performed "The Star-Spangled Banner" before a basketball game. Due to its popularity, subsequent games featured a recording of her rendition. In 1988, Hill appeared as an Amateur Night contestant on It's Showtime at the Apollo. She sang her version of the Smokey Robinson track "Who's Lovin' You", garnering an initially harsh reaction from the crowd. She persevered through the performance, however, she later cried off-stage._NEWLINE_Hill attended Columbia High School, where she was a member of the track team, cheerleading squad and was a classmate of actor Zach Braff. She also took violin lessons, went to dance class, and founded the school's gospel choir. Academically, she took advanced placement classes and received primarily 'A' grades. School officials recognized her as a leader among the student body. Later recalling her education, Hill commented, "I had a love for—I don't know if it was necessarily for academics, more than it just was for achieving, period. If it was academics, if it was sports, if it was music, if it was dance, whatever it was, I was always driven to do a lot in whatever field or whatever area I was focusing on at the moment."_NEWLINE_While a freshman in high school, through mutual friends, Prakazrel "Pras" Michel approached Hill about a music group he was creating. Hill and Pras began under the name Translator Crew. They came up with this name because they wanted to rhyme in different languages. Another female vocalist was soon replaced by Michel's cousin, multi-instrumentalist Wyclef Jean. The group began performing in local showcases and high school talent shows. Hill was initially only a singer, but then learned to rap too; instead of modeling herself on female rappers like Salt-N-Pepa and MC Lyte, she preferred male rappers like Ice Cube and developed her flow from listening to them. Hill later said, "I remember doing my homework in the bathroom stalls of hip-hop clubs."_NEWLINE_While growing up, Hill took acting lessons in Manhattan. She began her acting career in 1991 appearing with Jean in Club XII, MC Lyte's Off-Broadway hip-hop rendering of Shakespeare's Twelfth Night. While the play was not a success, an agent noticed her. Later that year, Hill began appearing on the soap opera As the World Turns in a recurring role as troubled teenager Kira Johnson. She subsequently co-starred alongside Whoopi Goldberg in the 1993 release Sister Act 2: Back in the Habit, playing Rita Louise Watson, an inner-city Catholic school teenager with a surly, rebellious attitude. In it, she performed the songs "His Eye Is on the Sparrow" (a duet with Tanya Blount) and "Joyful, Joyful". Director Bill Duke credited Hill with improvising a rap in a scene: "None of that was scripted. That was all Lauryn. She was amazing." Critic Roger Ebert called her "the girl with the big joyful voice", although he thought her talent was wasted, while Rolling Stone said she "performed marvelously against type ... in the otherwise perfunctory [film]." Hill also appeared in Steven Soderbergh's 1993 motion picture King of the Hill, in a minor but pivotal role as a 1930s gum-popping elevator operator. Soderbergh biographer Jason Wood described her as supplying one of the warmest scenes in the film. Hill graduated from Columbia High School in 1993. _START_SECTION_ 1994–1996: The Fugees _START_PARAGRAPH_ Pras, Hill and Jean renamed their group the Fugees, a derivative of the word "refugee", which was a derogatory term for Haitian Americans. Hill began a romantic relationship with Jean. The Fugees, who signed a contract with Columbia/Ruffhouse Records in 1993, became known for their genre blending, particularly of reggae, rock and soul, which was first experimented on their debut album, Blunted on Reality, released in 1994. It reached number 62 on the Billboard Top R&B/Hip-Hop Albums chart but overall sold poorly and was met by poor critical reviews due to their management's attempt insistence they adopt gangsta rap attitudes. Although the album made little impact, Hill's rapping on "Some Seek Stardom" was seen as a highlight. Within the group, she was frequently referred to by the nickname "L. Boogie". Hill's image and artistry, as well as her full, rich, raspy alto voice, placed her at the forefront of the band, with some fans urging her to begin a solo career._NEWLINE_The Fugees' second album, The Score (1996), peaked at number one on the U.S. Billboard 200 and stayed in the top ten of that chart for over half a year. It sold about six million copies in the United States and more than 17 million copies worldwide. In the 1996 Pazz & Jop Critics Poll, The Score came second in the list of best albums and three of its tracks placed within the top twenty best singles. It won the Grammy Award for Best Rap Album, and was later included on Rolling Stone's list of the 500 greatest albums of all time. Almost all of the writing and producing for it was done by Jean. The Score garnered praise for being a strong alternative to the gangsta idiom, and Hill stated, "We're trying to do something positive with the music because it seems like only the negative is rising to the top these days. It only takes a drop of purity to clean a cesspool."_NEWLINE_Singles from The Score included "Fu-Gee-La" and "Ready or Not", which highlighted Hill's singing and rapping abilities, and "No Woman, No Cry". Her rendition of "Killing Me Softly" became her breakout hit. Buttressed by what Rolling Stone publications later called Hill's "evocative" vocal line and her "amazing pipes", the track became pervasive on pop, R&B, hip hop, and adult contemporary radio formats. It won the Grammy Award for Best R&B Performance by a Duo or Group with Vocals. On the album, Hill combined African-American music and Caribbean music influences with socially conscious lyrics. Newsweek mentioned Hill's "irresistibly cute looks" and proclaimed her "the most powerful new voice in rap."_NEWLINE_At 21 years old, the now-famous Hill was still living at home with her parents. She had been enrolled at Columbia University during this period, and considered majoring in history as she became a sophomore, but left after about a year of total studies once sales of The Score went into the millions. In 1996, Hill responded to a false rumor on The Howard Stern Show that she had made a racist comment on MTV, saying "How can I possibly be a racist? My music is universal. And I believe in God. If I believe in God, then I have to love all of God's creations. There can be no segregation."_NEWLINE_In 1996, Hill founded the Refugee Project, a non-profit outreach organization that sought to transform the attitudes and behavior of at-risk urban youth. Part of this was Camp Hill, which offered stays in the Catskill Mountains for such youngsters; another was production of an annual Halloween haunted house in East Orange. Hill also raised money for Haitian refugees, supported clean water well-building projects in Kenya and Uganda, and staged a rap concert in Harlem to promote voter registration. A 1997 benefit event for the Refugee Project introduced a Board of Trustees for the organization that included Sean Combs, Mariah Carey, Busta Rhymes, Spike Lee, and others as members._NEWLINE_In 1997, the Fugees split to work on solo projects, which Jean later blamed on his tumultuous relationship with Hill and the fact he married his wife Claudinette while still involved with Hill. Meanwhile, in the summer of 1996 Hill had met Rohan Marley, a son of Bob Marley and a former University of Miami football player. Hill subsequently began a relationship with him, while still also involved with Jean. Hill became pregnant in late 1996, and on August 3, 1997, Marley and Hill's first child, Zion David, was born. The couple lived in Hill's childhood house in South Orange after she bought her parents a new house down the street._NEWLINE_Hill had a cameo appearance in the 1997 film Hav Plenty. In 1998, Hill took up another small, but important role in the film Restaurant; Entertainment Weekly praised her portrayal of the protagonist's pregnant former girlfriend as bringing vigor to the film. _START_SECTION_ 1997–1999: The Miseducation of Lauryn Hill _START_PARAGRAPH_ Hill recorded her solo record The Miseducation of Lauryn Hill from late 1997 through June 1998 at Tuff Gong Studios in Jamaica. The title was inspired by the book The Mis-Education of the Negro (1933) by Carter G. Woodson and The Education of Sonny Carson, a film and autobiographical novel. The album featured contributions from D'Angelo, Carlos Santana, Mary J. Blige and the then-unknown John Legend. Wyclef Jean initially did not support Hill recording a solo album, but eventually offered his production help; Hill turned him down. Several songs on the album concerned her frustration with the Fugees; "I Used to Love Him" dealt with the breakdown of the relationship between Hill and Wyclef Jean. Other songs such as "To Zion" spoke about her decision to have her first baby, even though many at the time encouraged her to have an abortion so to not interfere with her blossoming career. Indeed, Hill's pregnancy revived her from a period of writer's block._NEWLINE_In terms of production, Hill collaborated with a group of musicians known as New Ark, consisting of Vada Nobles, Rasheem Pugh, Tejumold Newton, and Johari Newton. Hill later said that she wanted to "write songs that lyrically move me and have the integrity of reggae and the knock of hip-hop and the instrumentation of classic soul" and that the production on the album was intended to make the music sound raw and not computer-aided. Hill spoke of pressure from her label to emulate Prince, wherein all tracks would be credited as written and produced by the artist with little outside help. She also wanted to be appreciated as an auteur as much as Jean had within the Fugees. (She also saw a feminist cause: "But step out and try and control things and there are doubts. This is a very sexist industry. They'll never throw the 'genius' title to a sister.") While recording the album, when Hill was asked about providing contracts or documentation to the musicians, she replied, "We all love each other. This ain't about documents. This is blessed."_NEWLINE_Released on August 25, 1998, the album received rave reviews from contemporary music critics, and was the most acclaimed album of 1998. Critics lauded the album's blending of the R&B, doo-wop, pop, hip-hop, and reggae genres and its honest representation of a woman's life and relationships. David Browne, writing in Entertainment Weekly, called it "an album of often-astonishing power, strength, and feeling", and praised Hill for "easily flowing from singing to rapping, evoking the past while forging a future of her own". Robert Christgau quipped, "PC record of the year—songs soft, singing ordinary, rapping skilled, rhymes up and down, skits de trop, production subtle and terrific". In 2017 NPR rated the album as the 2nd best album of all time created by a woman._NEWLINE_It sold over 423,000 copies in its first week (boosted by advance radio play of two non-label-sanctioned singles, "Lost Ones" and "Can't Take My Eyes Off You") and topped the Billboard 200 for four weeks and the Billboard R&B Albums chart for six weeks. It went on to sell about 8 million copies in the U.S. and 12 million copies worldwide. During 1998 and 1999, Hill earned $25 million from record sales and touring. Hill, along with Blige, Missy Elliott, Meshell Ndegeocello, Erykah Badu, and others, found a voice with the neo soul genre._NEWLINE_The first single released from the album was "Lost Ones", which reached number 27 in Spring 1998. The second was "Doo Wop (That Thing)", which debuted at number one on the Billboard Hot 100 chart. It exemplified Hill's appeal, combining feelings of self-empowerment with self-defense. Other charted singles from the album were "Ex-Factor", which has been interpolated by Drake and Cardi B, "Everything Is Everything" and "To Zion". In the 1998 Pazz & Jop Critics Poll, Miseducation came second in the list of best albums and "Doo Wop (That Thing)" second in best singles._NEWLINE_In November 1998, Marley and Hill's second child, Selah Louise, was born. Of being a young mother of two, Hill said, "It's not an easy situation at all. You have to really pray and be honest with yourself."_NEWLINE_In the run-up to the 1999 Grammy Awards, Hill became the first woman to be nominated in ten categories in a single year. In addition to Miseducation works, the nominations included her rendition of "Can't Take My Eyes Off You" for the 1997 film Conspiracy Theory, which had appeared on Billboard charts, and Hill's writing and producing of "A Rose Is Still a Rose", which became a late-in-career hit for Aretha Franklin. She appeared on several magazine covers, including Time, Esquire, Rolling Stone, Teen People, and The New York Times Fashion Magazine. During the ceremony, Hill broke another record by becoming the first woman to win five times in one night, taking home the awards for Album of the Year, Best R&B Album, Best R&B Song, Best Female R&B Vocal Performance, and Best New Artist. During an acceptance speech, she said, "This is crazy. This is hip-hop!" Hill had brought forth a new, mainstream acceptance of the genre._NEWLINE_In February 1999, Hill received four awards at the 30th Annual NAACP Image Awards. In May 1999, she became the youngest woman ever named to Ebony magazine's 100+ Most Influential Black Americans list; in November of that year, the same publication named her as one of "10 For Tomorrow" in the "Ebony 2000: Special Millennium Issue". In May 1999, she made People magazine's 50 Most Beautiful People list. The publication, which has called her "model-gorgeous", praised the 5-foot-4-inch (1.63 m) Hill for her idiosyncratic sense of personal style. In June 1999, she received an Essence Award, but her acceptance speech, where she said there was no contradiction in religious love and servitude and "[being] who you are, as fly and as hot and as whatever," drew reaction from those in the public who thought she was not a good role model as a young, unwed mother of two. This was a repetition of criticism she had received after the birth of her first child, and she had said that she and Marley would soon be married. In early 2000, Hill was one of many artists and producers to share the Grammy Award for Album of the Year for Santana's 1999 multi-million-selling Supernatural, for which she had written, produced, and rapped on the track "Do You Like the Way" (a rumination on the direction the world was headed, it also featured the singing of CeeLo Green and the signature guitar runs of Carlos Santana). She was also nominated for Best R&B Song for "All That I Can Say", which she had written and produced for Mary J. Blige. Also, her concocted duet with Bob Marley on "Turn Your Lights Down Low" for the 1999 remix tribute album Chant Down Babylon additionally appeared in the 1999 film The Best Man and later received a Grammy nomination for Best Pop Collaboration with Vocals._NEWLINE_In November 1998, New Ark filed a fifty-page lawsuit against Hill, her management, and record label, claiming that Hill "used their songs and production skills, but failed to properly credit them for the work" on Miseducation. The musicians claimed to be the primary songwriters on two tracks, and major contributors on several others, though Gordon Williams, a prominent recorder, engineer, and mixer on Miseducation, described the album as a "powerfully personal effort by Hill" and said "It was definitely her vision." Hill responded that New Ark had been appropriately credited and now were seeking to take advantage of her success. New Ark requested partial writing credits on most of the tracks on the album as well as monetary reimbursement. After many delays, depositions took place during the latter part of 2000. In part, the case illustrated the difficult boundaries between songwriting and all other aspects that went into contemporary arranging, sampling, and recording. The suit would eventually be settled out of court in February 2001, with Hill paying New Ark a reported $5 million. A friend of Hill's later said of the suit, "That was the beginning of a chain effect that would turn everything a little crazy." _START_SECTION_ 2000–2003: Self-imposed exile and MTV Unplugged No. 2.0 _START_PARAGRAPH_ Hill began writing a screenplay about the life of Bob Marley, in which she planned to act as his wife Rita. She also began producing a romantic comedy about soul food with a working title of Sauce, and accepted a starring role in the film adaptation of Toni Morrison's novel Beloved; she later dropped out of both projects due to pregnancy. She also reportedly turned down roles in Charlie's Angels (the part that went to Lucy Liu), The Bourne Identity, The Mexican, The Matrix Reloaded, and The Matrix Revolutions._NEWLINE_During 2000, Hill dropped out of the public eye. The pressures of fame began to overwhelm her. She disliked not being able to go out of her house to do simple errands without having to worry about her physical appearance. She fired her management team and began attending Bible study classes five days a week; she also stopped doing interviews, watching television and listening to music. She started associating with a "spiritual advisor" named Brother Anthony. Some familiar with Hill believe Anthony more resembled a cult leader than a spiritual advisor, and thought his guidance probably inspired much of Hill's more controversial public behavior._NEWLINE_She later described this period of her life to Essence saying "People need to understand that the Lauryn Hill they were exposed to in the beginning was all that was allowed in that arena at that time … I had to step away when I realized that for the sake of the machine, I was being way too compromised. I felt uncomfortable about having to smile in someone's face when I really didn't like them or even know them well enough to like them." She also spoke about her emotional crisis, saying, "For two or three years I was away from all social interaction. It was a very introspective time because I had to confront my fears and master every demonic thought about inferiority, about insecurity or the fear of being black, young and gifted in this western culture." She went on to say that she had to fight to retain her identity, and was forced "to deal with folks who weren't happy about that."_NEWLINE_In July 2001, while pregnant with her third child, Hill unveiled her new material to a small crowd, for a taping of an MTV Unplugged special. An album of the concert, titled MTV Unplugged No. 2.0, was released in May 2002 and featured only her singing and playing an acoustic guitar. Unlike the near-unanimous praise of Miseducation, 2.0 sharply divided critics. AllMusic gave the album 4 out of 5 stars, saying that the recording "is the unfinished, unflinching presentation of ideas and of a person. It may not be a proper follow-up to her first album, but it is fascinating." Rolling Stone called the album "a public breakdown" and Robert Hilburn of the Los Angeles Times said the album's title opened Hill up for jokes that she had become unhinged. NME wrote that "Unplugged 2.0 is a sparse and often gruelling listen, but there is enough genius shading these rough sketches to suggest that all might not yet be lost." With the mixed reviews and no significant radio airplay, 2.0 debuted at number three on the Billboard 200. The album has been certified Platinum in the US by the RIAA._NEWLINE_Her song "Mystery of Iniquity" was nominated for a Grammy Award for Best Female Rap Solo Performance and used as an interpolation by hip-hop producer/songwriter Kanye West for his single "All Falls Down", which was co-written with Lauryn Hill, as sung by Syleena Johnson._NEWLINE_Around 2001, Marley and Hill's third child, Joshua Omaru, was born. He was followed a year later by their fourth, John Nesta. While Hill sometimes had spoken of Marley as her husband, they never married, and along the way she was informed that Marley had been previously married at a young age. Furthermore, according to a 2003 Rolling Stone report, he had never secured a divorce; but Marley later disputed this and made public to a blog a 1996 divorce document from Haiti. The two had been living in a high-end Miami hotel, but around 2003 she moved out into her own place in that city. Hill later said that she and Marley "have had long periods of separation over the years". Hill slowly worked on a new album and it was reported that by 2003, Columbia Records had spent more than $2.5 million funding it, including installing a recording studio in the singer's Miami apartment and flying different musicians around the country._NEWLINE_By 2002, Hill had shut down her non-profit Refugee Project. She said, "I had a nonprofit organization and I had to shut all that down. You know, smiling with big checks, obligatory things, not having things come from a place of passion. That's slavery. Everything we do should be a result of our gratitude for what God has done for us. It should be passionate."_NEWLINE_In December 2003, Hill, during a performance in Vatican City, spoke of the "corruption, exploitation, and abuses" in reference to the molestation of boys by Catholic priests in the United States and the cover-up of offenses by Catholic Church officials. High-ranking church officials were in attendance, but Pope John Paul II was not present. The Catholic League called Hill "pathologically miserable" and claimed her career was "in decline". The following day, several reporters suggested that Hill's comments at the Vatican may have been influenced by her spiritual advisor, Brother Anthony. _START_SECTION_ 2004–2009: Sporadic touring and recording _START_PARAGRAPH_ In 2004, Hill contributed a new song, "The Passion", to The Passion of the Christ: Songs. A remix version with John Legend of his "So High" ended up receiving a Grammy Award nomination for Best R&B Performance by a Duo or Group with Vocals. Around this time, Hill began selling a pay-per-view music video of the song "Social Drugs" through her website. Those who purchase the $15 video would only be able to view it three times before it expired. In addition to the video, Hill began selling autographed posters and Polaroids through her website, with some items listed at upwards of $500._NEWLINE_For the first time since 1997, the Fugees performed in September 2004 at Dave Chappelle's Block Party in the Bedford–Stuyvesant neighborhood of Brooklyn. The concert featured Hill's nearly a cappella rendition of "Killing Me Softly". The event was recorded by director Michel Gondry and was released on March 3, 2006, to universal acclaim. The Fugees also appeared at BET Awards 2005 during June 2005, where they opened the show with a 12-minute set. One track, "Take It Easy", was leaked online and thereafter was released as an Internet single in late September. It peaked at number forty on the Billboard R&B Chart. In 2005, she told USA Today, "If I make music now, it will only be to provide information to my own children. If other people benefit from it, then so be it." When asked how she now felt about the songs on 2.0, she stated "a lot of the songs were transitional. The music was about how I was feeling at the time, even though I was documenting my distress as well as my bursts of joy."_NEWLINE_The Fugees embarked on a European tour in late 2005. Old tensions between Hill and the other members of the group soon resurfaced, and the reunion ended before an album could be recorded; Jean and Michel both blamed Hill for the split. Hill reportedly demanded to be addressed by everyone, including her bandmates, as "Ms. Hill"; she also considered changing her moniker to "Empress". Hill's tardiness was also cited as a contributing factor._NEWLINE_Hill began touring on her own, although to mixed reviews; often arriving late to concerts (sometimes by over two hours), performing unpopular reconfigurations of her songs and sporting an exaggerated appearance. On some occasions, fans have booed her and left early. In June 2007, Sony Records said Hill had been recording through the past decade, had accumulated considerable unreleased material and had re-entered the studio with the goal of making a new album. Later that same year, an album titled Ms. Hill, which featured cuts from Miseducation, various soundtracks contributions and other "unreleased" songs, was released. It features guest appearances from D'Angelo, Rah Digga and John Forté. Also in June 2007, Hill released a new song, "Lose Myself", on the soundtrack to the film Surf's Up._NEWLINE_In early 2008, Marley and Hill's fifth child, Sarah, was born. The couple were not living together, although Marley considered them "spiritually together" even while listing himself as single on social media. Hill later said that she and Marley "have [had] a long and complex history about which many inaccuracies have been reported since the beginning" and that they both valued their privacy. By August 2008, Hill was living with her mother and children in her hometown of South Orange, New Jersey._NEWLINE_Reports in mid-2008 claimed that Columbia Records then believed Hill to be on hiatus. Marley disputed these claims, telling an interviewer that Hill has enough material for several albums: "She writes music in the bathroom, on toilet paper, on the wall. She writes it in the mirror if the mirror smokes up. She writes constantly. This woman does not sleep". One of the few public appearances Hill made in 2008 was at a Martha Stewart book-signing in New Jersey, perplexing some in the press. In April 2009, it was reported that Hill would engage in a 10-day tour of European summer festivals during mid-July of that year. She performed two shows for the tour and passed out on stage during the start of her second performance and left the stage. She refused to give refunds to angry consumers for the show. On June 10, Hill's management informed the promoters of the Stockholm Jazz Festival, which she was scheduled to headline, that she would not be performing due to unspecified "health reasons." Shortly afterward, the rest of the tour was canceled as well. _START_SECTION_ 2010–present: Further activities and imprisonment _START_PARAGRAPH_ In January 2010, Hill returned to the live stage and performed in stops across New Zealand and Australia on the Raggamuffin Music Festival. Many of the songs that Hill had performed and recorded over the past six years were included on an April 2010 unofficial compilation album titled Khulami Phase. The album also features a range of other material found on the Ms. Hill compilation. Hill appeared at the Harmony Festival in Santa Rosa, California, in June 2010, her first live American performance in several years. An unreleased song called "Repercussions" was leaked via the Internet in late July 2010, debuting at number 94 on Billboard's Hot R&B/Hip-Hop Songs (and peaked at number 83 the following week), making it her first Billboard chart appearance as a lead artist since 1999._NEWLINE_Hill joined the Rock the Bells hip-hop festival series in the U.S. during August 2010, and as part of that year's theme of rendering classic albums, she performed The Miseducation of Lauryn Hill in its entirety for the first time. She increased the tempo and urgency from the original recording, but at times had difficulty in communicating with her band. Hill continued touring, including a set at the 6th Annual Jazz in the Gardens, in Miami Gardens, Florida in December. In Spring 2011, Hill performed at the Coachella Valley Music Festival, New Orleans Jazz Fest, and at the Cosmopolitan of Las Vegas. In July 2011, Hill gave birth to her sixth child, Micah, her first not with Rohan Marley; the father remains publicly unknown._NEWLINE_In February 2012, Hill performed a new song titled "Fearless Vampire Killer", during a sold-out performance at the Warner Theater in Washington, D.C. In late 2012, Hill toured with rapper Nas; her portion of the tour, titled Black Rage, is named after her song, released October 30. Hill has described the song as being "about the derivative effects of racial inequity and abuse" and "a juxtaposition to the statement 'life is good,' which she believes can only be so when these long standing issues are addressed and resolved."_NEWLINE_In June 2012, Hill was charged with three counts of tax fraud or failing to file taxes (Title 26 USC § 7202 Willful failure to collect or pay over tax) not tax evasion on $1.8 million of income earned between 2005 and 2007. During this time she had toured as a musical artist, earned royalties from both her records and from films she had appeared in, and had owned and been in charge of multiple corporations. In a long post to her Tumblr, Hill said that she had gone "underground" and had rejected pop culture's "climate of hostility, false entitlement, manipulation, racial prejudice, sexism and ageism." She added that, "When I was working consistently without being affected by the interferences mentioned above, I filed and paid my taxes. This only stopped when it was necessary to withdraw from society, in order to guarantee the safety and well-being of myself and my family." On June 29, 2012, Hill appeared in the United States District Court for the District of New Jersey in Newark and pleaded guilty to the charges; her attorney said she would make restitution for the back taxes she owed. By April 22, 2013, Hill had paid back only $50,000 of the $554,000 she owed immediately; U.S. Magistrate Judge Madeline Cox Arleo criticized Hill, saying "This is not someone who stands before the court penniless. This is a criminal matter. Actions speak louder than words, and there has been no effort here to pay these taxes." Hill also faced possible eviction from her rented home in South Orange as well as a civil lawsuit from the town for running a business out of a home without a zoning permit._NEWLINE_On May 4, 2013, Hill released her first official single in over a decade, "Neurotic Society (Compulsory Mix)". She later published a message on her Tumblr describing how she was "required to release [it] immediately, by virtue of the impending legal deadline." The release received some criticism for lyrics that appeared to tie societal decay to certain LGBT social movements. Hill responded that the song was not targeted at any particular group but was instead focused on anyone hiding behind neurotic behavior. Following a deal with Sony Music, which involves Hill creating a new record label within the company, Hill was said to be scheduled to release her first album in fifteen years during 2013._NEWLINE_On May 6, 2013, Hill was sentenced by Judge Arleo to serve three months in prison for failing to file taxes/tax fraud and three months' house arrest afterwards as part of a year of supervised probation. She had faced a possible sentence of as long as 36 months, and the sentence given took into account her lack of a prior criminal record and her six minor-aged children. By this point Hill had fully paid back $970,000 in back taxes and penalties she owed, which also took into account an additional $500,000 that Hill had in unreported income for 2008 and 2009. In the courtroom, Hill said that she had lived "very modestly" considering how much money she had made for others, and that "I am a child of former slaves who had a system imposed on them. I had an economic system imposed on me." Hill reported to the minimum-security Federal Correctional Institution, Danbury on July 8, 2013, to begin serving her sentence._NEWLINE_Hill was released from prison on October 4, 2013, a few days early for good behavior, and began her home confinement and probationary periods. She put out a single called "Consumerism" that she had finished, via verbal and e-mailed instructions, while incarcerated. Judge Arleo allowed her to postpone part of her confinement in order to tour in late 2013 under strict conditions._NEWLINE_During 2014, Hill was heard as the narrator of Concerning Violence, an award-winning Swedish documentary on the African liberation struggles of the 1960s and 1970s. She also continued to draw media attention for her erratic behavior, appearing late twice in the same day for sets at Voodoo Fest in November 2014._NEWLINE_In May 2015, Hill canceled her scheduled concert outside Tel Aviv in Israel following a social media campaign from activists promoting the Boycott, Divestment and Sanctions campaign. She said she had wanted to also perform a show in Ramallah in the West Bank but logistical problems had proved too great. Hill stated: "It is very important to me that my presence or message not be misconstrued, or a source of alienation to either my Israeli or my Palestinian fans."_NEWLINE_Hill contributed her voice to the soundtrack for What Happened, Miss Simone?, a 2015 documentary about the life of Nina Simone, an American singer, pianist, and civil rights activist. Hill was originally supposed to record only two songs for the record, but ended up recording six. She also served as a producer on the compilation alongside Robert Glasper. Hill said of her connection to Simone: "Because I fed on this music ... I believed I always had a right to have a voice. Her example is clearly a form of sustenance to a generation needing to find theirs. What a gift." NPR critically praised Hill's performance on the soundtrack, stating: "This album mainly showcases Lauryn Hill's breadth and dexterity. Not formally marketed as Hill's comeback album, her six tracks here make this her most comprehensive set of studio recordings since The Miseducation of Lauryn Hill in 1998."_NEWLINE_In April 2016, Hill hosted and headlined what was billed as the inaugural Diaspora Calling! festival at the Kings Theatre in Brooklyn. The festival's purpose was to showcase the efforts of musicians and artists from around the African diaspora like Brooklyn Haitian Rara band Brother High Full tempo. The following month, Hill was approximately 2 hours and 20 minutes late for her show at the Chastain Park Amphitheatre in Atlanta, though members of Hill's team claimed it was only an hour after their scheduled start time. Moments after the less-than-40-minute show ended due to the venue's strict 11:00 p.m. closing time, Hill said her driver had gotten lost and she could not help that. Less than 48 hours later, after a large backlash from her fans on Twitter, she took to her Facebook page and stated she was late for the concert because of certain needs, including her need to "align her energy with the time." _START_SECTION_ Legacy and sampling _START_PARAGRAPH_ Lauryn Hill's work continues to inspire rappers and can still be heard sampled in Hip Hop today. In 2018, Hill was sampled on Cardi B's "Be Careful", Drake's "Nice for What", and A$AP Rocky's "Purity". Kanye West has mentioned Lauryn Hill in a couple of his songs. In Kanye West's song "Champion", released in 2007, he says "Lauryn Hill said her heart was in Zion, I wish her heart still was in rhymin". Zion is Hill's son first born, who she sings about in "To Zion" Kanye also talks about Hill in his song released in 2016, "No More parties in LA". He states "I was uninspired since Lauryn Hill retired" This is referring to Hill not releasing any new music since 2013._NEWLINE_Other samples of Lauryn Hill's work come from artists such as J Cole, A Boogie Wit Da Hoodie, and Kanye West.  J Cole samples Hill's "Nothing Even Matters" and "To Zion" in his song "Cole Summer".  A Boogie samples a few Hill songs including "Ex-Factor" in his song of the same name as well as in a remix of Drake's "Nice for What".  Kanye famously samples Hill in "All Falls Down" where he samples "Mystery of Iniquity".
+ _START_ARTICLE_ James Edward McManus _START_PARAGRAPH_ James Edward McManus, C.Ss.R. (October 10, 1900 – July 3, 1976) was an American prelate of the Roman Catholic Church. A Redemptorist, he served as Bishop of Ponce in Puerto Rico (1947–63) and as an auxiliary bishop of the Archdiocese of New York (1963–70). _START_SECTION_ Early life and education _START_PARAGRAPH_ James McManus was born in Brooklyn, New York, the eighth of nine children of William and Elizabeth (née O'Loughlin) McManus. He received his early education at the parochial school of Our Lady of Perpetual Help Church in Brooklyn from 1906 to 1914. In 1915, he enrolled at St. Mary's College, a preparatory school run by the Redemptorists in North East, Pennsylvania. He then studied at Mount St. Alphonsus Seminary at Esopus from 1922 to 1928. He made his profession as a Redemptorist in Ilchester, Maryland, on August 2, 1922. _START_SECTION_ Priesthood _START_PARAGRAPH_ On June 19, 1927, McManus was ordained to the priesthood in Esopus. He was assigned to the Puerto Rican mission in Caguas in 1929. He later returned to the continental United States to study at the Catholic University of America in Washington, D.C., where he earned a Doctor of Canon Law degree in 1937. He then served as professor of canon law at Mount St. Alphonsus Seminary until 1940, when he returned to Puerto Rico. He served as a pastor in Aguadilla (1940–45) and then in Mayagüez (1945–47). _START_SECTION_ Ponce _START_PARAGRAPH_ On May 10, 1947, McManus was appointed Bishop of Ponce by Pope Pius XII. He received his episcopal consecration on the following July 1 from Bishop William Tibertus McCarty, with Bishops Aloysius Joseph Willinger and William David O'Brien serving as co-consecrators, at Our Lady of Perpetual Help Church in Brooklyn. His biggest contribution as Bishop of Ponce was the founding of the Pontifical Catholic University of Puerto Rico in 1948. He also oversaw the move of a seminary from the Archdiocese of San Juan to his diocese in Aibonito._NEWLINE_During his tenure in Ponce, McManus became an outspoken critic of Luis Muñoz Marín, who served as Governor of Puerto Rico from 1949 to 1965. In the 1952 and 1956 elections, he opposed Muñoz Marín and supported the Republican Statehood Party, which demanded statehood for the island and proposed an economic plan similar to that of the continental Republican Party. In 1958, he feuded with Muñoz Marín over his program to crack down on gambling, including bingo games for the support of parish churches. He denounced the legalization of birth control measures and a law that would divorce couples who had been separated for more than three years. He also opposed the administration's measure to cut the tax-exempt donations to charity by corporations from 15% of gross income to 5% of surplus._NEWLINE_In 1960, after the Legislative Assembly failed to pass a law allowing religious instruction for schoolchildren, McManus said that the administration of Muñoz Marín was "responsible for the moral evils that cloud and de-Christianize our society." In August of that year, he helped organize the Christian Action Party, which he urged all Catholics to support. The party nominated Salvador Perea, a professor at the Pontifical Catholic University, as its candidate for governor, but was caught in a controversy over the validity of the signatures it collected to get on the ballot._NEWLINE_A month before the election, McManus and two other bishops issued a pastoral letter that prohibited Catholics from voting for Muñoz Marín's Popular Democratic Party, which they claimed "accepts as its own the morality of a 'regime of license,' denying Christian morality..." The letter also stated, "It is evident that the philosophy of the Popular Democratic Party is anti-Christian and anti-Catholic, and that it is based on the modern heresy that popular will and not divine law decides what is moral and immoral. This philosophy destroys the Ten Commandments of God and permits that they be substituted by popular and human criteria." McManus insisted that Catholics who disobeyed the injunction by voting for the Popular Democrats would commit a sin. The letter resulted in widespread protests in Puerto Rico and sparked open controversy within the Church. Cardinal Francis Spellman of New York declared that Puerto Rican voters would not be penalized by the Church while Archbishop James P. Davis of San Juan defended the bishops. Muñoz Marín denounced the letter as an "incredible medieval interference in a political campaign."_NEWLINE_Between 1962 and 1965, McManus attended all four sessions of the Second Vatican Council. _START_SECTION_ New York _START_PARAGRAPH_ McManus resigned as Bishop of Ponce for reasons of health on November 18, 1963. On the same date, he was appointed Auxiliary Bishop of New York and Titular Bishop of Benda by Pope Paul VI. He denied that his transfer to New York had anything to do with his opposition to Governor Muñoz Marín, calling his appointment "routine." As an auxiliary bishop, he served as pastor of St. Cecilia's Church in Manhattan (1964–66) and episcopal vicar of Sullivan and Ulster Counties, a post which he held until his retirement in 1970._NEWLINE_McManus died at the Monmouth Medical Center in Long Branch, New Jersey, at age 75.
+ _START_ARTICLE_ Kate Webb _START_PARAGRAPH_ Kate Webb (24 March 1943 – 13 May 2007) was a New Zealand-born Australian war correspondent for UPI and Agence France-Presse. She earned a reputation for dogged and fearless reporting throughout the Vietnam War, and at one point she was held prisoner for weeks by North Vietnamese troops. After the war, she continued to report from global hotspots including Iraq during the Gulf War. _START_SECTION_ Biography _START_PARAGRAPH_ Born Catherine Merrial Webb in Christchurch, New Zealand, Webb moved to Canberra, Australia, with her family while she was still a child. Her father, Leicester Chisholm Webb, was professor of political science at the Australian National University, and her mother, Caroline Webb, was active in women's organisations. Both her parents were killed when Kate was 18._NEWLINE_On 30 March 1958, at the age of 15, Catherine Webb was charged with the murder of Victoria Fenner, the adopted daughter of Frank Fenner, in Canberra. She supplied a rifle and bullets to Fenner and was present when Fenner shot herself. After a Children's Court hearing the charge was dropped._NEWLINE_She graduated from the University of Melbourne, then left to work for the Sydney Daily Mirror. In 1967 she quit the paper and travelled to Vietnam to cover the escalating war. Webb was soon hired by UPI and earned a reputation as a hard-drinking, chain-smoking war correspondent: She was the first wire correspondent to reach the U.S. Embassy in Saigon during the Tet offensive. With the death of Phnom Penh bureau chief Frank Frosch in 1970, Webb was selected to fill his position—she later claimed it was because she spoke French. In 1971 she made news herself when she was captured by North Vietnamese troops operating in Cambodia. Premature official reports claimed that a body discovered was Webb's, and The New York Times published an obituary. She emerged from captivity 23 days after she was captured, after having endured forced marches, interrogations, and malaria. She described her experiences in a book called On the Other Side, and in War Torn, a collection of reminiscences by women correspondents in the Vietnam War._NEWLINE_After her release from captivity and because of her sudden fame, UPI sent her to Washington DC as their show piece. Soon thereafter she threatened to resign if she did not get a "real job". She was reassigned to the Philippines as the UPI bureau chief in Manila._NEWLINE_After the war, she continued to work as a foreign correspondent for UPI and Agence France-Presse (AFP). She served as a correspondent in Iraq during the Gulf War, in Indonesia as Timor-Leste gained independence, and in South Korea, where she was the first to report the death of Kim Il Song. She also reported from Afghanistan, and later described an incident in Kabul as the most frightening in her career. Following the collapse of Mohammad Najibullah's communist regime, she was captured by a local warlord and brought to a hotel, where she was brutally beaten and dragged up a flight of stairs by her hair. She finally escaped with the help of two fellow journalists, and hid out on a window ledge in the freezing Afghan winter, while the warlord and his men searched the building for her._NEWLINE_Webb retired to the Hunter Region in 2001. She died of bowel cancer on 13 May 2007. In 2008, AFP established the Kate Webb Prize, worth €3,000 to €5,000, awarded annually to an Asian correspondent or agency that best exemplified the spirit of Kate Webb. Webb was commemorated on an Australian postage stamp in 2017._NEWLINE_She is survived by a brother, Jeremy Webb, and a sister, Rachel Miller.
+ _START_ARTICLE_ Golden State (schooner) _START_SECTION_ Earlier 19th century schooner _START_PARAGRAPH_ The schooner "'Golden State,' of San Francisco" was involved in the 1858 lawsuit Wetherbee vs. Schooner "Golden State" (and Captain W. S. Tuttle).
+ _START_ARTICLE_ Blooming, Oregon _START_PARAGRAPH_ Blooming is an unincorporated community in Washington County, Oregon, United States near the Tualatin River, about two miles south of Cornelius. Its elevation is 190 feet (58 m). There are several plant nurseries in the area._NEWLINE_The Blooming area was originally known as "the German Settlement" for a group of German immigrants who had settled there. Rev. Paul of St. Peter's Lutheran Church renamed the community "Blooming", for the area's flowers and as an indicator of the town's good prospects. Blooming post office ran from 1895 to 1904._NEWLINE_The center of the community was St. Peter's Evangelical Lutheran Church, which is the second-oldest congregation of the Lutheran Church–Missouri Synod in the Pacific Northwest. The church was founded and the original church building constructed in 1882; a portion of the current church was built in 1923._NEWLINE_Blooming has a Cornelius ZIP code.
+ _START_ARTICLE_ Maggi Kelly _START_SECTION_ Life _START_PARAGRAPH_ She played for  University of California, Berkeley. She graduated from University of North Carolina at Chapel Hill, and from the University of Colorado._NEWLINE_She was a member of the United States women's national water polo team._NEWLINE_She participated at the 1994 World Aquatic Championships, and  1998 World Aquatic Championships.
+ _START_ARTICLE_ BoxChilli _START_SECTION_ History _START_PARAGRAPH_ Launched as a 2-person start-up in 2003, boxChilli has grown to a team of 14 with offices in Southampton & Portsmouth. BoxChilli doubled in size between 2012-2014 and has continued to grow.  The company has won several awards and in 2014 was a finalist in the Wirehive 100 award for the fasts growing agency, and have been in the top Wirehive top 100 digital agencies to work with outside of London in 2014 / 2015 & 2016. _START_SECTION_ Youth development _START_PARAGRAPH_ In 2014, boxChilli team member Kirsty Mallan was a finalist at the NatWest Venus Award (Portsmouth) recognising achievement in women under the age of 25. As well as hiring, training and supporting graduates and young managers, boxChilli works with Portsmouth-based training agency PETA Ltd to provide apprenticeships in web design, search and related areas. And in 2016 Danielle Ellis was a semi-finalist in the young manager of the year category. _START_SECTION_ Charity work _START_PARAGRAPH_ boxChilli supports charities with in-kind donations of services. The company has made websites for a number of Hampshire-based charities including Hampshire Wildlife trust and CancerWise. The company also supports team members individual fundraising efforts, including growing moustaches for Movember and entering bicycle riding and running events.
+ _START_ARTICLE_ Tanmay Jahagirdar _START_SECTION_ Biography _START_PARAGRAPH_ He has begun his acting career at a very young age. He believes in being highly educated and is himself an Engineering diploma holder as well as a Mass media Graduate. His father, Vilas Jahagirdar has a reputed name in the Indian Pharma Industry. He is well versed in languages such as Marathi, Hindi and Urdu. He wishes to learn several other languages as well. He married Payal Wachasundar on 25 March 2019. _START_SECTION_ Career _START_PARAGRAPH_ His career comprises the vast journey as a child artist through television, commercial advertisements and films. In 2002, he appeared in the film The truth - Yathharth, a feature film based on the dark shades of life hidden in rural India. The film's star cast includes Milind Gunaji, Raghuveer Yadav and Shraddha Nigam. He played Raghu, the character who and supports his friend Bijuria (Shraddha Nigam)through the ups and downs of his life. This challenging project turned momentous for him as he got a chance to work with such versatile actors of the generation and was also appreciated for his on-screen presence too. He belongs to an era of the popular kids' series on Indian Television. He was seen in Aryamaan - Brahmaand Ka Yodha, a sci-fi series where he worked with a TV legend, Mukesh Khanna; he earned a lot of appreciation for the character he played. He appeared in Shaka Laka Boom Boom, a series based on a magical pencil aired on Star Plus. Since then, it was no looking back for this talent in the Television Industry.  Next, he appeared in a lead role in the serial Kya Mujhse Dosti Karoge on Hungama TV. He has done very interesting advertisement commercials. He has worked for some reputed brands in India. The Siyaram Suitings Ad featuring Diya Mirza and Boris Becker (International Tennis Player) has this kid, who played a role of an adorable and naughty Rajasthani kid. He also acted in commercials like BSA-ibike, Parry's Lacto King, etc. These all advertisements represent a nostalgic era of TV._NEWLINE_The milestone in his career came with Ram Gopal Verma's film Ab Tak Chappan (2004). He played a role of Nana Patekar's (Sadhu) son (Aman). He was praised highly for his pivotal character in the film. The film went on to become a commercial box office hit and was the critics' favorite of the year also. After this, he appeared in the film Phir Kabhi(2009). He portrayed the role of a young school boy, representing the childhood memory of the character played by Gulshan Grover. He was seen in a comic role and has shown his commendable comic sense. Definitely, he got highly recognized and admired for his acting. He was also seen in one of the episodes of a very popular series Adalat on Sony TV in 2011. He has played roles into different genres such as thriller, comedy, drama, sci-fi superhero, etc. He has worked with recognized banners such as UTV Motion Pictures, RGV productions, etc._NEWLINE_Recently, he made his comeback through Ab Tak Chhappan 2(2015), in which he played the role of Sadhu Agashe's grown up son Aman. He portrayed a young and matured person, who still believes in the beauty of life in spite of all the odds. Aman has a passion and love for music, this element of his character added an altogether different angle to the film. His role complimented the story to progress with interesting twists each time. He has worked with stalwarts of the Indian Cinema such as Nana Patekar, Mithun Chakraborty, Ronit Roy, Mukesh Khanna, etc. _NEWLINE_He continues to pursue his passion in acting and direction.
+ _START_ARTICLE_ Spring Creek, Madison County, North Carolina _START_PARAGRAPH_ Spring Creek is a tributary stream of the French Broad River in Madison County, North Carolina with a length of approximately 17 miles. It flows in much of its lower course through a section of the Pisgah National Forest and passes the communities of Trust, Luck, and Joe. It joins the French Broad river in Hot Springs, North Carolina.
+ _START_ARTICLE_ Acanthocladium _START_SECTION_ Description _START_PARAGRAPH_ The Spiny everlasting is a woody perennial shrub with spines at branch ends, covered in short white hair. It bears oblong, bumpy fruit._NEWLINE_Spiny everlasting was presumed extinct in 1992, having suffered habitat loss from clearance for winter crops, but various colonies of it have been found around Laura, near the Spencer Gulf. _START_SECTION_ Homonym _START_PARAGRAPH_ In 1883, William Mitten used the same name, Acanthocladium, to refer to a group of mosses, now in the family Sematophyllaceae. Several dozen species of mosses were described and place in this genus before it was realized that Mittenn's name represented an illegitimate homonym. The moss genus has since been renamed Wijkia H.A. Crum.
+ _START_ARTICLE_ Eddie O'Sullivan _START_SECTION_ Early career _START_PARAGRAPH_ O'Sullivan was born in Youghal, Cork, Ireland. After attending the Christian Brothers school in the town, he graduated from Thomond College, which a decade later became part of the University of Limerick._NEWLINE_O'Sullivan played for the Garryowen Football Club during the 1970s and 1980s at fly-half and wing, while teaching physical education, maths, and science in Mountbellew, County Galway. He played for Munster between 1983 and 1986 on the wing and got capped for Ireland A in 1984. He also played Gaelic Football. In 1982, he played corner forward on the Mountbellew Moylough Gaelic football team. He was fitness advisor to the Galway Senior Football Team, managed by John O'Mahony, which won two All-Ireland Senior Football Titles in 1998 and 2001. _START_SECTION_ Coaching career _START_PARAGRAPH_ He started his coaching career at Monivea Rugby Club in north-east Galway in the early 1980s while still a teacher. He worked as a Development Officer for the Irish Rugby Football Union between 1988 and 1991. During that time and up until 1992, he was the Fitness Advisor to the Irish Rugby Team, under Head Coach Ciaran Fitzgerald. He followed this with spells coaching at Blackrock College, (first as Assistant, then as Head Coach) Connacht as Assistant Coach and Head Coach between 1992 and 1996 and the Irish Under-21 side. The Under-21 side won the 1996 Triple Crown, beating Clive Woodward's England. Between 1997 and 1999, while working in the USA he continued to coach the Buccaneers Rugby Club in Connacht, who won promotion from Division 3 to Division 1 of the All-Ireland League and reaching the Top 4 of the tournament in their 1st season in Division 1._NEWLINE_After failing to secure a high-profile coaching position in Ireland, O'Sullivan moved to America to coach the US Eagles where he worked as Forwards Coach at the 1999 Rugby World Cup He also worked as Technical Director to USA Rugby between 1997 and 1999. As Technical Director he developed and delivered the USA Rugby Coach Education Programme, which certifier coaches at Foundation, Level I, Level II and Level III. He was then appointed as the assistant coach of the Irish national side in 1999 as the Backs Coach. During his time as Assistant Coach he was credited with advancing Irish back play considerably while working with players such as Brian O'Driscoll, Ronan O'Gara, Peter Stringer, Rob Henderson, Shane Horgan and Denis Hickie. In 2001 he was appointed Head Coach following the controversial departure of Warren Gatland. _START_SECTION_ Ireland _START_PARAGRAPH_ In his first year Ireland finished third-place in the 2002 Six Nations Championship. O'Sullivan's Ireland went on to achieve second place in 2003, only losing the Grand Slam in the final match against England. At the 2003 Rugby World Cup his team lost to France in the quarter-finals._NEWLINE_Ireland again missed out in the 2004 Six Nations Championship, losing the Grand Slam to France this time, but went on to win Ireland's first Triple Crown in 19 years. While transitioning the team during the 2005, O'Sullivan's side finished in third place with defeats by France and Wales. In 2006 defeat to France cost Ireland the Championship. In 2007 again Ireland lost the championship to France on points difference. On the final day of the tournament despite defeating Italy heavily in Rome (51-24), France defeated Scotland with a controversial try in the final minute of the game to again deny Ireland a 6 Nations Championship. The fact that the French played later in the day than Ireland gave them an advantage of knowing exactly what score they needed to secure the Championship. This fuelled the discussion about games not kicking off at the same time on the final day of the tournament._NEWLINE_During his tenure as Head Coach to Ireland O'Sullivan won 3 Triple Crowns, in 2004, 2006 and 2007. Ireland also defeated Australia twice (2002 & 2006) and South Africa twice (2004 & 2006). Ireland defeated England in the Six Nations Championship over four consecutive years (2004 - 2007) including a record victory (43-13) at the iconic Croke Park Stadium in 2007. O'Sullivan also coached the Barbarians R.F.C. to victory over the 2007 World Cup Champions South Africa in November 2007. _START_SECTION_ 2007 World Cup campaign _START_PARAGRAPH_ In August 2007, O'Sullivan's contract with the IRFU was extended for a further four years, which meant that he was contracted to be in charge of the Irish Rugby Team until 2012. Part of the terms of the contract allowed him to leave the position temporarily to coach the 2009 Lions squad, were he to be offered that role. Soon, however, he was the subject of press criticism after a run of poor results. Ireland turned in poor performances in the opening matches of the World Cup against the lower rated Georgia and Namibia. They had previously also struggled in pre-tournament games against Italy and Scotland._NEWLINE_Criticisms included a failure to inspire passion in the team and a lack of depth in the squad, which has been said to have caused complacency in the first team players. Many began to see the signing of his contract as a premature move. Rumours have abounded of conflict in the Irish during the tournament, and even that Geordan Murphy might leave the squad as a result of being dropped from the bench for the French game._NEWLINE_The poor performances continued with the failure of Ireland, for the second time in its history, to qualify for the quarter-final stages of the World Cup, finishing third in its Group with two wins and two losses._NEWLINE_After an extensive post Rugby World Cup review it was established that, despite rumours during the tournament, there was no discord among the playing squad. The failure to perform was identified as an over emphasis on strength and conditioning prior to the tournament along with only 2 pre-tournament warm-up games, leaving the squad short of match preparation. _START_SECTION_ End of Irish career _START_PARAGRAPH_ On 19 March 2008, O'Sullivan resigned from his job after a disappointing 6 Nations campaign. O'Sullivan finished with an overall success rate of 64% during his tenure with Ireland. During his time Ireland reached 3rd in the World Rugby ranking on 2 occasions in 2003 and 2006._NEWLINE_O'Sullivan released his autobiography, Never Die Wondering ISBN 978-1-84605-399-3, in autumn 2009. It was written with sports writer Vincent Hogan. _START_SECTION_ United States _START_PARAGRAPH_ After coaching the Irish Under-21 team to their first ever Triple Crown in 1996, O’Sullivan joined USA Rugby as Assistant Coach to the Eagles (Forwards Coach) and Assistant National Technical Director to USA Rugby. In 1998, he took over as National Technical Director while still retaining his position as Assistant Coach on the Eagles coaching staff. He worked with the Eagles at the 1999 Rugby World Cup before returning to Ireland as Assistant Coach to the Irish Rugby Team (Backs Coach)._NEWLINE_In 2009 the position of Head Coach to the USA Eagles Men's team came open when Scott Johnson announced he would leave the team at the end of the 2008–09 season to move to Ospreys of the Celtic League. O'Sullivan's agent reported on 16 February that O'Sullivan had signed a deal that would see him coach the United States through the 2011 World Cup. His hiring was officially announced on 4 March._NEWLINE_O'Sullivan coached the USA Eagles from 2009 up until the end of the 2011 Rugby World Cup. It was O'Sullivan's 5th appearance at a Rugby World Cup in a coaching capacity (RWC 1991 S&C Coach Ireland, RWC 1999 Assistant Coach USA, 2003 Head Coach Ireland, 2007 Head Coach Ireland, 2011 Head Coach USA) _START_SECTION_ Biarritz Olympique _START_PARAGRAPH_ In May 2014 O'Sullivan was confirmed as head coach of Biarritz Olympique.  Relegated from the Top 14 following the 2013-14 season, Biarritz played in the Pro D2 league in 2014-15.  O'Sullivan left Biarritz Olympique in October 2015.
diff --git a/examples/language_model/bigbird/run_classifier.py b/examples/language_model/bigbird/run_classifier.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcddaa80c89e4128ed6536ffd17108638df5334e
--- /dev/null
+++ b/examples/language_model/bigbird/run_classifier.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import string
+import time
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.data import Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    BigBirdForSequenceClassification,
+    BigBirdTokenizer,
+    create_bigbird_rand_mask_idx_list,
+)
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--model_name_or_path", type=str, default="bigbird-base-uncased-finetune", help="pretraining model name or path"
+)
+parser.add_argument(
+    "--max_encoder_length",
+    type=int,
+    default=3072,
+    help="The maximum total input sequence length after SentencePiece tokenization.",
+)
+parser.add_argument("--learning_rate", type=float, default=1e-5, help="Learning rate used to train.")
+parser.add_argument("--max_steps", default=10000, type=int, help="Max training steps to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+parser.add_argument("--attn_dropout", type=float, default=0.0, help="Attention ffn model dropout.")
+parser.add_argument(
+    "--hidden_dropout_prob", type=float, default=0.0, help="The dropout rate for the embedding pooler."
+)
+parser.add_argument(
+    "--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model."
+)
+parser.add_argument("--seed", type=int, default=8, help="Random seed for initialization.")
+
+args = parser.parse_args()
+TRANSLATOR = str.maketrans("", "", string.punctuation)
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def create_dataloader(batch_size, max_encoder_length, tokenizer, config, pad_val=0):
+    def _tokenize(text):
+        input_ids = [tokenizer.cls_id]
+        input_ids.extend(tokenizer.convert_tokens_to_ids(tokenizer._tokenize(text)[: max_encoder_length - 2]))
+        input_ids.append(tokenizer.sep_id)
+        input_len = len(input_ids)
+        if input_len < max_encoder_length:
+            input_ids.extend([pad_val] * (max_encoder_length - input_len))
+        input_ids = np.array(input_ids).astype("int64")
+        return input_ids
+
+    def _collate_data(data, stack_fn=Stack(dtype="int64")):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        out[0] = stack_fn([_tokenize(x["text"].translate(TRANSLATOR)) for x in data])
+        out[1] = stack_fn([x["label"] for x in data])
+        seq_len = len(out[0][0])
+        # Construct the random attention mask for the random attention
+        rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+            config["num_layers"],
+            seq_len,
+            seq_len,
+            config["nhead"],
+            config["block_size"],
+            config["window_size"],
+            config["num_global_blocks"],
+            config["num_rand_blocks"],
+            config["seed"],
+        )
+        out.extend(rand_mask_idx_list)
+        return out
+
+    def _create_dataloader(mode, tokenizer, max_encoder_length, pad_val=0):
+        dataset = load_dataset("imdb", splits=mode)
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=(mode == "train"))
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset, batch_sampler=batch_sampler, collate_fn=_collate_data, return_list=True
+        )
+        return data_loader
+
+    train_data_loader = _create_dataloader("train", tokenizer, max_encoder_length, 0)
+    test_data_loader = _create_dataloader("test", tokenizer, max_encoder_length, 0)
+    return train_data_loader, test_data_loader
+
+
+def main():
+    # Initialization for the parallel environment
+    paddle.set_device(args.device)
+    set_seed(args)
+    # Define the model and metric
+    # In finetune task, bigbird performs better when setting dropout to zero.
+    model = BigBirdForSequenceClassification.from_pretrained(
+        args.model_name_or_path, attn_dropout=args.attn_dropout, hidden_dropout_prob=args.hidden_dropout_prob
+    )
+
+    criterion = nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    # Define the tokenizer and dataloader
+    tokenizer = BigBirdTokenizer.from_pretrained(args.model_name_or_path)
+    config = getattr(model, BigBirdForSequenceClassification.base_model_prefix).config
+    train_data_loader, test_data_loader = create_dataloader(
+        args.batch_size, args.max_encoder_length, tokenizer, config
+    )
+
+    # Define the Adam optimizer
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.learning_rate, epsilon=1e-6)
+
+    # Finetune the classification model
+    do_train(model, criterion, metric, optimizer, train_data_loader, tokenizer)
+
+    # Evaluate the finetune model
+    do_evalute(model, criterion, metric, test_data_loader)
+
+
+def do_train(model, criterion, metric, optimizer, train_data_loader, tokenizer):
+    model.train()
+    global_steps = 0
+    tic_train = time.time()
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_steps += 1
+            input_ids, labels = batch[:2]
+            rand_mask_idx_list = batch[2:]
+
+            output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list)
+            loss = criterion(output, labels)
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            correct = metric.compute(output, labels)
+            metric.update(correct)
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, acc:%f, speed: %.2f step/s"
+                    % (global_steps, epoch, loss, metric.accumulate(), args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0:
+                output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % (global_steps))
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(output_dir)
+                tokenizer.save_pretrained(output_dir)
+
+            if global_steps >= args.max_steps:
+                return
+
+
+@paddle.no_grad()
+def do_evalute(model, criterion, metric, test_data_loader):
+    model.eval()
+    global_steps = 0
+    for step, batch in enumerate(test_data_loader):
+        global_steps += 1
+        input_ids, labels = batch[:2]
+        rand_mask_idx_list = batch[2:]
+        output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list)
+        loss = criterion(output, labels)
+        correct = metric.compute(output, labels)
+        metric.update(correct)
+        if global_steps % args.logging_steps == 0:
+            logger.info("eval: global step %d, loss: %f, acc %f" % (global_steps, loss, metric.accumulate()))
+    logger.info("final eval: loss: %f, acc %f" % (loss, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/language_model/bigbird/run_glue.py b/examples/language_model/bigbird/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa0234e01c2359676bd64e8175caf9c339530230
--- /dev/null
+++ b/examples/language_model/bigbird/run_glue.py
@@ -0,0 +1,333 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import time
+from functools import partial
+
+import args
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BigBirdForSequenceClassification,
+    BigBirdTokenizer,
+    LinearDecayWithWarmup,
+    create_bigbird_rand_mask_idx_list,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bigbird": (BigBirdForSequenceClassification, BigBirdTokenizer),
+}
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    input_ids = [tokenizer.cls_id]
+    token_type_ids = None
+
+    if (int(is_test) + len(example)) == 2:
+        input_ids.extend(tokenizer(example["sentence"])["input_ids"][: max_seq_length - 2])
+        input_ids.append(tokenizer.sep_id)
+        input_len = len(input_ids)
+        token_type_ids = input_len * [0]
+    else:
+        input_ids1 = tokenizer(example["sentence1"])["input_ids"]
+        input_ids2 = tokenizer(example["sentence2"])["input_ids"]
+        total_len = len(input_ids1) + len(input_ids2) + tokenizer.num_special_tokens_to_add(pair=True) + 1
+        if total_len > max_seq_length:
+            input_ids1, input_ids2, _ = tokenizer.truncate_sequences(
+                input_ids1, input_ids2, total_len - max_seq_length
+            )
+
+        input_ids.extend(input_ids1)
+        input_ids.append(tokenizer.sep_id)
+        input_len1 = len(input_ids)
+
+        input_ids.extend(input_ids2)
+        input_ids.append(tokenizer.sep_id)
+        input_len2 = len(input_ids) - input_len1
+
+        token_type_ids = input_len1 * [0] + input_len2 * [1]
+
+    input_len = len(input_ids)
+    if input_len < max_seq_length:
+        input_ids.extend([tokenizer.pad_id] * (max_seq_length - input_len))
+        token_type_ids.extend([tokenizer.pad_token_type_id] * (max_seq_length - input_len))
+
+    if not is_test:
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def collect_data(samples, dataset, config):
+    stack_fn = Stack(dtype="int64" if dataset.label_list else "float32")
+    stack_fn1 = Stack()
+
+    num_fields = len(samples[0])
+    out = [None] * num_fields
+    out[0] = stack_fn1([x[0] for x in samples])  # input_ids
+    out[1] = stack_fn1([x[1] for x in samples])  # token_type_ids
+    if num_fields >= 2:
+        out[2] = stack_fn(x[2] for x in samples)  # labels
+    seq_len = len(out[0][0])
+    # Construct the random attention mask for the random attention
+    rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+        config["num_layers"],
+        seq_len,
+        seq_len,
+        config["nhead"],
+        config["block_size"],
+        config["window_size"],
+        config["num_global_blocks"],
+        config["num_rand_blocks"],
+        config["seed"],
+    )
+    out.extend(rand_mask_idx_list)
+    return out
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch[:3]
+        rand_mask_idx_list = batch[3:]
+        # run forward
+        logits = model(input_ids, segment_ids, rand_mask_idx_list=rand_mask_idx_list)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        logger.info(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            )
+        )
+    elif isinstance(metric, Mcc):
+        logger.info("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]))
+    elif isinstance(metric, PearsonAndSpearman):
+        logger.info(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (loss.numpy(), res[0], res[1], res[2])
+        )
+    else:
+        logger.info("eval loss: %f, acc: %s, " % (loss.numpy(), res))
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    worker_num = paddle.distributed.get_world_size()
+    if worker_num > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    # In finetune task, bigbird performs better when setting dropout to zero.
+    model = model_class.from_pretrained(
+        args.model_name_or_path, num_classes=num_classes, attn_dropout=0.0, hidden_dropout_prob=0.0
+    )
+    if worker_num > 1:
+        model = paddle.DataParallel(model)
+    config = getattr(model, model_class.base_model_prefix).config
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_encoder_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = partial(collect_data, dataset=train_ds, config=config)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.CrossEntropyLoss() if train_ds.label_list else paddle.nn.MSELoss()
+
+    metric = metric_class()
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch[:3]
+            rand_mask_idx_list = batch[3:]
+            # run forward
+            logits = model(input_ids, segment_ids, rand_mask_idx_list=rand_mask_idx_list)
+            loss = loss_fct(logits, labels)
+            # run backward and update params
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    logger.info("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    logger.info("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
+                    )
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = args.parse_args()
+    print_arguments(args)
+    assert args.device in [
+        "cpu",
+        "gpu",
+        "xpu",
+        "npu",
+    ], "Invalid device! Available device should be cpu, gpu, xpu or npu."
+    do_train(args)
diff --git a/examples/language_model/bigbird/run_pretrain.py b/examples/language_model/bigbird/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..81545f3fb0ad55c5b9ac2cd01167526d71fd24c3
--- /dev/null
+++ b/examples/language_model/bigbird/run_pretrain.py
@@ -0,0 +1,259 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import time
+
+import args
+import numpy as np
+import paddle
+from paddle.io import DataLoader, Dataset
+
+from paddlenlp.data import Stack
+from paddlenlp.transformers import (
+    BigBirdForPretraining,
+    BigBirdPretrainingCriterion,
+    BigBirdTokenizer,
+    LinearDecayWithWarmup,
+    create_bigbird_rand_mask_idx_list,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "bigbird": (BigBirdForPretraining, BigBirdTokenizer),
+}
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+class PretrainingDataset(Dataset):
+    def __init__(self, input_file, tokenizer, max_encoder_length=512, max_pred_length=75):
+        self.tokenizer = tokenizer
+        self.max_encoder_length = max_encoder_length
+        self.max_pred_length = max_pred_length
+        input_file = open(input_file, "r", encoding="utf-8")
+        self.lines = input_file.readlines()
+
+        self.vocab_size = tokenizer.vocab_size
+
+    def __getitem__(self, index):
+        line = self.lines[index].rstrip()
+        subtokens, masked_lm_positions, masked_lm_ids, masked_lm_weights = self.tokenizer._encode(
+            line, max_seq_len=self.max_encoder_length, max_pred_len=self.max_pred_length
+        )
+        return [
+            subtokens,
+            np.zeros_like(subtokens),
+            masked_lm_positions,
+            masked_lm_ids,
+            masked_lm_weights,
+            np.zeros([1], dtype="int64"),
+        ]
+
+    def __len__(self):
+        return len(self.lines)
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+def create_dataloader(input_file, tokenizer, worker_init, batch_size, max_encoder_length, max_pred_length, config):
+    pretrain_dataset = PretrainingDataset(input_file, tokenizer, max_encoder_length, max_pred_length)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        pretrain_dataset, batch_size=batch_size, shuffle=True, drop_last=True
+    )
+
+    # make masked_lm_positions can be gathered
+    def _collate_data(data, stack_fn=Stack()):
+        # Data Fields: input_ids, segment_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights, next_sentence_labels
+        num_fields = len(data[0])
+        out = [None] * num_fields
+
+        for i in [0, 1, 5]:
+            out[i] = stack_fn([x[i] for x in data])
+        batch_size, seq_length = out[0].shape
+        size = sum(len(x[2]) for x in data)
+        out[2] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[3] = np.full([size, 1], -1, dtype=np.int64)
+        # masked weight
+        out[4] = np.full([size], 0, dtype="float32")
+        # # Organize as a 1D tensor for gather or use gather_nd
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[2]):
+                out[2][mask_token_num] = i * seq_length + pos
+                out[3][mask_token_num] = x[3][j]
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+        out.append(np.asarray([mask_token_num], dtype=np.float32))
+        seq_len = len(out[0][0])
+        rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+            config["num_layers"],
+            seq_len,
+            seq_len,
+            config["nhead"],
+            config["block_size"],
+            config["window_size"],
+            config["num_global_blocks"],
+            config["num_rand_blocks"],
+            config["seed"],
+        )
+        out.extend(rand_mask_idx_list)
+        return out
+
+    dataloader = DataLoader(
+        dataset=pretrain_dataset,
+        batch_sampler=train_batch_sampler,
+        collate_fn=_collate_data,
+        worker_init_fn=worker_init,
+        return_list=True,
+    )
+    return dataloader
+
+
+def do_train(args):
+    # Initialization for the parallel environment
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    worker_index = paddle.distributed.get_rank()
+    worker_num = paddle.distributed.get_world_size()
+
+    # Set the random seed for the training process
+    set_seed(args)
+    worker_init = WorkerInitObj(args.seed + worker_index)
+
+    # Get the model class and tokenizer class
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    # Define the pretrain model and metric
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+    if args.model_name_or_path in pretrained_models_list:
+        config = model_class.config_class.from_pretrained(args.model_name_or_path)
+        model = model_class(config)
+    else:
+        model = BigBirdForPretraining.from_pretrained(args.model_name_or_path)
+    # Get bigbird config for generate random attention mask
+    config = getattr(model, BigBirdForPretraining.base_model_prefix).config
+    criterion = BigBirdPretrainingCriterion(config, args.use_nsp)
+    if worker_num > 1:
+        model = paddle.DataParallel(model)
+
+    # Define learing_rate scheduler and optimizer
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, args.max_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.epochs):
+        files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir)]
+        files.sort()
+        num_files = len(files)
+        for f_id in range(num_files):
+            train_data_loader = create_dataloader(
+                files[f_id],
+                tokenizer,
+                worker_init,
+                args.batch_size,
+                args.max_encoder_length,
+                args.max_pred_length,
+                config,
+            )
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                (
+                    input_ids,
+                    segment_ids,
+                    masked_lm_positions,
+                    masked_lm_ids,
+                    masked_lm_weights,
+                    next_sentence_labels,
+                    masked_lm_scale,
+                ) = batch[:7]
+                rand_mask_idx_list = batch[7:]
+
+                prediction_scores, seq_relationship_score = model(
+                    input_ids=input_ids,
+                    token_type_ids=segment_ids,
+                    rand_mask_idx_list=rand_mask_idx_list,
+                    masked_positions=masked_lm_positions,
+                )
+                loss = criterion(
+                    prediction_scores,
+                    seq_relationship_score,
+                    masked_lm_ids,
+                    next_sentence_labels,
+                    masked_lm_scale,
+                    masked_lm_weights,
+                )
+                if global_step % args.logging_steps == 0 and worker_index == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, lr: %.10f, loss: %f, speed: %.2f step/s"
+                        % (
+                            global_step,
+                            epoch,
+                            optimizer.get_lr(),
+                            loss,
+                            args.logging_steps / (time.time() - tic_train),
+                        )
+                    )
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                if global_step % args.save_steps == 0:
+                    if worker_index == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # Need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+            del train_data_loader
+
+
+if __name__ == "__main__":
+    args = args.parse_args()
+    do_train(args)
diff --git a/examples/language_model/bloom b/examples/language_model/bloom
new file mode 100644
index 0000000000000000000000000000000000000000..69ade8179eef55acb2d9c2ac71978e1448293106
--- /dev/null
+++ b/examples/language_model/bloom
@@ -0,0 +1 @@
+../../llm/bloom
\ No newline at end of file
diff --git a/examples/language_model/chatglm b/examples/language_model/chatglm
new file mode 100644
index 0000000000000000000000000000000000000000..b3d6f16d3004c13997e19135ff470c135ccd419f
--- /dev/null
+++ b/examples/language_model/chatglm
@@ -0,0 +1 @@
+../../llm/chatglm
\ No newline at end of file
diff --git a/examples/language_model/chinesebert/README.md b/examples/language_model/chinesebert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..383f68b2149d37979fe6665d6008c7f1f125f9f4
--- /dev/null
+++ b/examples/language_model/chinesebert/README.md
@@ -0,0 +1,203 @@
+# ChineseBert with PaddleNLP
+
+[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/pdf/2106.16038.pdf)
+
+**摘要：**
+最近的汉语预训练模型忽略了汉语特有的两个重要方面：字形和拼音，它们对语言理解具有重要的语法和语义信息。在本研究中，我们提出了汉语预训练，它将汉字的字形和拼音信息纳入语言模型预训练中。字形嵌入是基于汉字的不同字体获得的，能够从视觉特征中捕捉汉字语义，拼音嵌入代表汉字的发音，处理汉语中高度流行的异义现象（同一汉字具有不同的发音和不同的含义）。在大规模的未标记中文语料库上进行预训练后，所提出的ChineseBERT模型在训练步骤较少的基线模型上产生了显著的性能提高。该模型在广泛的中国自然语言处理任务上实现了新的SOTA性能，包括机器阅读理解、自然语言推理、文本分类、句子对匹配和命名实体识别方面的竞争性能。
+
+本项目是 ChineseBert 在 Paddle 2.x上的开源实现。
+
+## **数据准备**
+涉及到的ChnSentiCorp，crmc2018，XNLI数据
+部分Paddle已提供，其他可参考https://github.com/27182812/ChineseBERT_paddle,
+在data目录下。
+
+
+## **模型预训练**
+模型预训练过程可参考[Electra的README](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/electra/README.md)
+
+## **Fine-tuning**
+
+### 运行Fine-tuning
+
+#### **使用Paddle提供的预训练模型运行 Fine-tuning**
+
+#### 1、ChnSentiCorp
+以ChnSentiCorp数据集为例
+
+#### （1）模型微调：
+```shell
+# 运行训练
+python -m paddle.distributed.launch --gpus 0,1 python train_chn.py \
+--data_path './data/ChnSentiCorp' \
+--device 'gpu' \
+--num_train_epochs 10 \
+--max_seq_length 512 \
+--per_device_train_batch_size 8 \
+--per_device_eval_batch_size 8 \
+--learning_rate 2e-5 \
+--adam_beta2 0.98 \
+--weight_decay 0.0001 \
+--warmup_ratio 0.1 \
+--logging_steps 10 \
+--save_steps 100 \
+--seed 2333 \
+--do_train \
+--do_eval \
+--output_dir 'outputs/chn' | tee outputs/train_chn.log
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `num_train_epochs` 表示训练轮数。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `per_device_train_batch_size` 表示每次迭代**每张卡**上的训练样本数目。
+- `per_device_eval_batch_size` 表示每次迭代**每张卡**上的验证样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `adam_beta2` 表示优化器中使用的beta2的系数。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `warmup_ratio` 表示动态学习率热启动的比例。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示验证并保存模型间隔。
+- `seed` 指定随机种子。
+- `do_train` 表示是否进行训练。
+- `do_eval` 表示是否进行验证。
+- `output_dir` 表示模型保存路径。
+
+#### (2) 评估
+
+在dev和test数据集上acc分别为95.8和96.08，达到论文精度要求。
+
+#### 2、XNLI
+
+#### （1）训练
+
+```bash
+python -m paddle.distributed.launch --gpus 0,1 python train_xnli.py \
+--data_path './data/XNLI' \
+--device 'gpu' \
+--num_train_epochs 5 \
+--max_seq_length 256 \
+--per_device_train_batch_size 16 \
+--per_device_eval_batch_size 16 \
+--learning_rate 1.3e-5 \
+--adam_beta2 0.98 \
+--weight_decay 0.001 \
+--warmup_ratio 0.1 \
+--logging_steps 10 \
+--save_steps 100 \
+--seed 2333 \
+--do_train \
+--do_eval \
+--output_dir "outputs/xnli" | tee outputs/train_xnli.log
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `num_train_epochs` 表示训练轮数。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `per_device_train_batch_size` 表示每次迭代**每张卡**上的训练样本数目。
+- `per_device_eval_batch_size` 表示每次迭代**每张卡**上的验证样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `adam_beta2` 表示优化器中使用的beta2的系数。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `warmup_ratio` 表示动态学习率热启动的比例。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示验证并保存模型间隔。
+- `seed` 指定随机种子。
+- `do_train` 表示是否进行训练。
+- `do_eval` 表示是否进行验证。
+- `output_dir` 表示模型保存路径。
+
+#### （2）评估
+
+test数据集 acc最好结果为81.657,达到论文精度要求。
+
+#### 3、cmrc2018
+
+#### (1) 训练
+
+```shell
+# 开始训练
+python -m paddle.distributed.launch --gpus 0,1 python train_cmrc2018.py \
+    --data_dir "./data/cmrc2018" \
+    --model_name_or_path ChineseBERT-large \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 16 \
+    --gradient_accumulation_steps 8 \
+    --learning_rate 4e-5 \
+    --max_grad_norm 1.0 \
+    --adam_beta2 0.98 \
+    --num_train_epochs 3 \
+    --logging_steps 2 \
+    --save_steps 20 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.01 \
+    --seed 1111 \
+    --do_train \
+    --do_eval \
+    --dataloader_num_workers 0 \
+    --fp16 True \
+    --output_dir "outputs/cmrc2018"
+```
+其中参数释义如下：
+- `data_path` 表示微调数据路径。
+- `model_name_or_path` 模型名称或者路径，支持ChineseBERT-base、ChineseBERT-large两种种规格。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `per_device_train_batch_size` 表示训练过程中每次迭代**每张卡**上的样本数目。
+- `per_device_eval_batch_size` 表示验证过程中每次迭代**每张卡**上的样本数目。
+- `gradient_accumulation_steps` 梯度累加步数。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `max_grad_norm` 梯度裁剪。
+- `adam_beta2` 表示优化器中使用的beta2的系数。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示验证并保存模型间隔。
+- `warmup_ratio` 表示动态学习率热启动的比例。
+- `weight_decay` 表示优化器中使用的weight_decay的系数。
+- `seed` 指定随机种子。
+- `do_train` 表示是否进行训练。
+- `do_eval` 表示是否进行验证。
+- `dataloader_num_workers` 表示同时工作进程。
+- `fp16` 表示是否使用混合精度fp16。
+- `output_dir` 表示模型保存路径。
+
+训练过程中模型会在dev数据集进行评估，其中最好的结果如下所示：
+
+```python
+
+{
+    AVERAGE = 82.791
+    F1 = 91.055
+    EM = 74.526
+    TOTAL = 3219
+    SKIP = 0
+}
+
+```
+
+#### （2）运行eval_cmrc.py，生成test数据集预测答案
+
+```bash
+python eval_cmrc.py --model_name_or_path outputs/step-340 --n_best_size 35 --max_answer_length 65
+```
+
+其中，model_name_or_path为模型路径
+
+#### （3）提交CLUE
+
+test数据集 EM为78.55，达到论文精度要求
+
+
+## Reference
+
+```bibtex
+@article{sun2021chinesebert,
+  title={ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information},
+  author={Sun, Zijun and Li, Xiaoya and Sun, Xiaofei and Meng, Yuxian and Ao, Xiang and He, Qing and Wu, Fei and Li, Jiwei},
+  journal={arXiv preprint arXiv:2106.16038},
+  year={2021}
+}
+
+```
diff --git a/examples/language_model/chinesebert/cmrc_eval.sh b/examples/language_model/chinesebert/cmrc_eval.sh
new file mode 100644
index 0000000000000000000000000000000000000000..79155340b1c7b6f46a08545a6b7dc640d2bb5e45
--- /dev/null
+++ b/examples/language_model/chinesebert/cmrc_eval.sh
@@ -0,0 +1 @@
+python eval.py --model_name_or_path outputs/cmrc2018/step-140 --n_best_size 35 --max_answer_length 65
diff --git a/examples/language_model/chinesebert/cmrc_evaluate.py b/examples/language_model/chinesebert/cmrc_evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..ceb5257ff9f1ae907adee3733547ed68ecff097a
--- /dev/null
+++ b/examples/language_model/chinesebert/cmrc_evaluate.py
@@ -0,0 +1,241 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Evaluation script for CMRC 2018
+version: v5 - special
+Note:
+v5 - special: Evaluate on SQuAD-style CMRC 2018 Datasets
+v5: formatted output, add usage description
+v4: fixed segmentation issues
+"""
+
+import argparse
+import json
+import re
+import sys
+from collections import OrderedDict
+
+import nltk
+
+
+# split Chinese with English
+def mixed_segmentation(in_str, rm_punc=False):
+    in_str = str(in_str).lower().strip()
+    segs_out = []
+    temp_str = ""
+    sp_char = [
+        "-",
+        ":",
+        "_",
+        "*",
+        "^",
+        "/",
+        "\\",
+        "~",
+        "`",
+        "+",
+        "=",
+        "，",
+        "。",
+        "：",
+        "？",
+        "！",
+        "“",
+        "”",
+        "；",
+        "’",
+        "《",
+        "》",
+        "……",
+        "·",
+        "、",
+        "「",
+        "」",
+        "（",
+        "）",
+        "－",
+        "～",
+        "『",
+        "』",
+    ]
+    for char in in_str:
+        if rm_punc and char in sp_char:
+            continue
+        if re.search(r"[\u4e00-\u9fa5]", char) or char in sp_char:
+            if temp_str != "":
+                ss = nltk.word_tokenize(temp_str)
+                segs_out.extend(ss)
+                temp_str = ""
+            segs_out.append(char)
+        else:
+            temp_str += char
+
+    # handling last part
+    if temp_str != "":
+        ss = nltk.word_tokenize(temp_str)
+        segs_out.extend(ss)
+
+    return segs_out
+
+
+# remove punctuation
+def remove_punctuation(in_str):
+    in_str = str(in_str).lower().strip()
+    sp_char = [
+        "-",
+        ":",
+        "_",
+        "*",
+        "^",
+        "/",
+        "\\",
+        "~",
+        "`",
+        "+",
+        "=",
+        "，",
+        "。",
+        "：",
+        "？",
+        "！",
+        "“",
+        "”",
+        "；",
+        "’",
+        "《",
+        "》",
+        "……",
+        "·",
+        "、",
+        "「",
+        "」",
+        "（",
+        "）",
+        "－",
+        "～",
+        "『",
+        "』",
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return "".join(out_segs)
+
+
+# find longest common string
+def find_lcs(s1, s2):
+    m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+    mmax = 0
+    p = 0
+    for i in range(len(s1)):
+        for j in range(len(s2)):
+            if s1[i] == s2[j]:
+                m[i + 1][j + 1] = m[i][j] + 1
+                if m[i + 1][j + 1] > mmax:
+                    mmax = m[i + 1][j + 1]
+                    p = i + 1
+    return s1[p - mmax : p], mmax
+
+
+def evaluate(ground_truth_file, prediction_file):
+    f1 = 0
+    em = 0
+    total_count = 0
+    skip_count = 0
+    for instance in ground_truth_file["data"]:
+        # context_id   = instance['context_id'].strip()
+        # context_text = instance['context_text'].strip()
+        for para in instance["paragraphs"]:
+            for qas in para["qas"]:
+                total_count += 1
+                query_id = qas["id"].strip()
+                answers = [x["text"] for x in qas["answers"]]
+
+                if query_id not in prediction_file:
+                    sys.stderr.write("Unanswered question: {}\n".format(query_id))
+                    skip_count += 1
+                    continue
+
+                prediction = str(prediction_file[query_id])
+                f1 += calc_f1_score(answers, prediction)
+                em += calc_em_score(answers, prediction)
+
+    f1_score = 100.0 * f1 / total_count
+    em_score = 100.0 * em / total_count
+    return f1_score, em_score, total_count, skip_count
+
+
+def calc_f1_score(answers, prediction):
+    f1_scores = []
+    for ans in answers:
+        ans_segs = mixed_segmentation(ans, rm_punc=True)
+        prediction_segs = mixed_segmentation(prediction, rm_punc=True)
+        lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
+        if lcs_len == 0:
+            f1_scores.append(0)
+            continue
+        precision = 1.0 * lcs_len / len(prediction_segs)
+        recall = 1.0 * lcs_len / len(ans_segs)
+        f1 = (2 * precision * recall) / (precision + recall)
+        f1_scores.append(f1)
+    return max(f1_scores)
+
+
+def calc_em_score(answers, prediction):
+    em = 0
+    for ans in answers:
+        ans_ = remove_punctuation(ans)
+        prediction_ = remove_punctuation(prediction)
+        if ans_ == prediction_:
+            em = 1
+            break
+    return em
+
+
+def get_result(ground_truth_file, prediction_file):
+    ground_truth_file = json.load(open(ground_truth_file, "rb"))
+    prediction_file = json.load(open(prediction_file, "rb"))
+    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
+    AVG = (EM + F1) * 0.5
+    output_result = OrderedDict()
+    output_result["AVERAGE"] = "%.3f" % AVG
+    output_result["F1"] = "%.3f" % F1
+    output_result["EM"] = "%.3f" % EM
+    output_result["TOTAL"] = TOTAL
+    output_result["SKIP"] = SKIP
+    print(json.dumps(output_result))
+    return output_result
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluation Script for CMRC 2018")
+    parser.add_argument("--dataset_file", default="cmrc2018_public/dev.json", help="Official dataset file")
+    parser.add_argument("--prediction_file", default="all_predictions.json", help="Your prediction File")
+    args = parser.parse_args()
+    ground_truth_file = json.load(open(args.dataset_file, "rb"))
+    prediction_file = json.load(open(args.prediction_file, "rb"))
+    F1, EM, TOTAL, SKIP = evaluate(ground_truth_file, prediction_file)
+    AVG = (EM + F1) * 0.5
+    output_result = OrderedDict()
+    output_result["AVERAGE"] = "%.3f" % AVG
+    output_result["F1"] = "%.3f" % F1
+    output_result["EM"] = "%.3f" % EM
+    output_result["TOTAL"] = TOTAL
+    output_result["SKIP"] = SKIP
+    output_result["FILE"] = args.prediction_file
+    print(json.dumps(output_result))
diff --git a/examples/language_model/chinesebert/dataset_cmrc2018.py b/examples/language_model/chinesebert/dataset_cmrc2018.py
new file mode 100644
index 0000000000000000000000000000000000000000..7144d5e67a99acf2384cbabb1110a354d9e464df
--- /dev/null
+++ b/examples/language_model/chinesebert/dataset_cmrc2018.py
@@ -0,0 +1,418 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+from typing import List, Optional
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader, Dataset
+from utils import load_pickle, save_pickle
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import Trainer
+from paddlenlp.trainer.trainer_utils import (
+    EvalLoopOutput,
+    EvalPrediction,
+    IterableDatasetShard,
+    find_batch_size,
+    has_length,
+)
+from paddlenlp.trainer.utils.helper import (
+    nested_concat,
+    nested_numpify,
+    nested_truncate,
+)
+from paddlenlp.utils.batch_sampler import (
+    DistributedBatchSampler as NlpDistributedBatchSampler,
+)
+from paddlenlp.utils.log import logger
+
+
+# this right
+def prepare_train_features_paddlenlp(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = [examples[i]["context"] for i in range(len(examples))]
+    questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        stride=args["model_args"].doc_stride,
+        max_length=args["model_args"].max_seq_length,
+        return_token_type_ids=True,
+    )
+
+    # Let's label those examples!
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_example["input_ids"]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # The offset mappings will give us a map from token to character position in the original context. This will
+        # help us compute the start_positions and end_positions.
+        offsets = tokenized_example["offset_mapping"]
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        answers = examples[sample_index]["answers"]
+        answer_starts = examples[sample_index]["answer_starts"]
+
+        # If no answers are given, set the cls_index as answer.
+        if len(answer_starts) == 0:
+            tokenized_examples[i]["start_positions"] = cls_index
+            tokenized_examples[i]["end_positions"] = cls_index
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answer_starts[0]
+            end_char = start_char + len(answers[0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            # Minus one more to reach actual text
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples[i]["start_positions"] = cls_index
+                tokenized_examples[i]["end_positions"] = cls_index
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples[i]["start_positions"] = token_start_index - 1
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples[i]["end_positions"] = token_end_index + 1
+
+    return tokenized_examples
+
+
+# this right
+def prepare_dev_features_paddlenlp(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = [examples[i]["context"] for i in range(len(examples))]
+    questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        stride=args["model_args"].doc_stride,
+        max_length=args["model_args"].max_seq_length,
+        return_token_type_ids=True,
+    )
+
+    # For validation, there is no need to compute start and end positions
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        tokenized_examples[i]["example_id"] = examples[sample_index]["id"]
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples[i]["offset_mapping"] = [
+            (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+        ]
+
+    return tokenized_examples
+
+
+def get_train_dataset(tokenizer, args, splits="train"):
+
+    data_dir = args["data_args"].data_dir
+    filename = os.path.join(data_dir, "cmrc2018_" + splits + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("cmrc2018", splits=splits)
+        ds.map(
+            partial(prepare_train_features_paddlenlp, tokenizer=tokenizer, args=args),
+            batched=True,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+def get_dev_dataset(tokenizer, args, splits="dev"):
+
+    data_dir = args["data_args"].data_dir
+    filename = os.path.join(data_dir, "cmrc2018_" + splits + ".pkl")
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("cmrc2018", splits=splits)
+        ds.map(
+            partial(prepare_dev_features_paddlenlp, tokenizer=tokenizer, args=args),
+            batched=True,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+def is_datasets_available():
+    import importlib
+
+    return importlib.util.find_spec("datasets") is not None
+
+
+if is_datasets_available():
+    import datasets
+
+
+class EvalTrainer(Trainer):
+    def set_eval_collator(self, collator):
+        self.eval_collate_fn = collator
+
+    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+        """
+        Returns the evaluation [`~paddle.io.DataLoader`].
+
+        Subclass and override this method if you want to inject some custom behavior.
+
+        Args:
+            eval_dataset (`paddle.io.Dataset`, *optional*):
+                If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by
+                the `model.forward()` method are automatically removed. It must implement `__len__`.
+        """
+        if eval_dataset is None and self.eval_dataset is None:
+            raise ValueError("Trainer: evaluation requires an eval_dataset.")
+        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+
+        if is_datasets_available() and isinstance(eval_dataset, datasets.Dataset):
+            eval_dataset = self._remove_unused_columns(eval_dataset, description="evaluation")
+
+        if self._is_iterable_dataset(eval_dataset):
+            if self.args.world_size > 1:
+                eval_dataset = IterableDatasetShard(
+                    eval_dataset,
+                    batch_size=self.args.per_device_eval_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    num_processes=self.args.world_size,
+                    process_index=self.args.process_index,
+                )
+
+            return DataLoader(
+                eval_dataset,
+                batch_size=self.args.per_device_eval_batch_size,
+                collate_fn=self.eval_collate_fn,
+                num_workers=self.args.dataloader_num_workers,
+            )
+
+        eval_sampler = self._get_eval_sampler(eval_dataset)
+
+        return DataLoader(
+            eval_dataset,
+            batch_sampler=eval_sampler,
+            collate_fn=self.eval_collate_fn,
+            num_workers=self.args.dataloader_num_workers,
+        )
+
+    def evaluation_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        prediction_loss_only: Optional[bool] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_eval_iters: Optional[int] = -1,
+    ) -> EvalLoopOutput:
+        """
+        Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`.
+
+        Works both with or without labels.
+        """
+        args = self.args
+
+        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only
+
+        model = self.model
+
+        if isinstance(dataloader, paddle.io.DataLoader):
+            batch_size = dataloader.batch_sampler.batch_size
+        elif isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase):
+            # support for inner dataloader
+            batch_size = dataloader._batch_sampler.batch_size
+            # alias for inner dataloader
+            dataloader.dataset = dataloader._dataset
+        else:
+            raise ValueError("Only support for paddle.io.DataLoader")
+
+        num_samples = None
+        if max_eval_iters > 0:
+            # on eval limit steps
+            num_samples = batch_size * self.args.world_size * max_eval_iters
+            if isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase) and isinstance(
+                dataloader._batch_sampler, NlpDistributedBatchSampler
+            ):
+                consumed_samples = (
+                    ((self.state.global_step) // args.eval_steps)
+                    * max_eval_iters
+                    * args.per_device_eval_batch_size
+                    * args.world_size
+                )
+                dataloader._batch_sampler.set_epoch(consumed_samples=consumed_samples)
+
+        logger.info(f"***** Running {description} *****")
+        if has_length(dataloader):
+            logger.info(f"  Num examples = {self.num_examples(dataloader)}")
+            if max_eval_iters > 0:
+                logger.info(f"  Total prediction steps = {max_eval_iters}")
+            else:
+                logger.info(f"  Total prediction steps = {len(dataloader)}")
+        else:
+            logger.info("  Num examples: Unknown")
+            if max_eval_iters > 0:
+                logger.info(f"  Total prediction steps = {max_eval_iters}")
+
+        logger.info(f"  Pre device batch size = {batch_size}")
+        logger.info(f"  Total Batch size = {batch_size * self.args.world_size}")
+
+        model.eval()
+
+        self.callback_handler.eval_dataloader = dataloader
+        # Do this before wrapping.
+        eval_dataset = dataloader.dataset
+
+        if args.past_index >= 0:
+            self._past = None
+
+        # Initialize containers
+        # losses/preds/labels on GPU (accumulated for eval_accumulation_steps)
+        losses_host = None
+        preds_host = None
+        labels_host = None
+        # losses/preds/labels on CPU (final containers)
+        all_losses = None
+        all_preds = None
+        all_labels = None
+        # Will be useful when we have an iterable dataset so don't know its length.
+
+        observed_num_examples = 0
+        # Main evaluation loop
+        losses = []
+        for step, inputs in enumerate(dataloader):
+            # Update the observed num examples
+            observed_batch_size = find_batch_size(inputs)
+            if observed_batch_size is not None:
+                observed_num_examples += observed_batch_size
+                # For batch samplers, batch_size is not known by the dataloader in advance.
+                if batch_size is None:
+                    batch_size = observed_batch_size
+
+            # Prediction step
+            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
+
+            # Update containers on host
+            if loss is not None:
+                # losses = self._nested_gather(loss.repeat(batch_size))
+                # losses = self._nested_gather(loss)
+                losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1]))
+                losses_host = losses if losses_host is None else paddle.concat((losses_host, losses), axis=0)
+
+            if labels is not None:
+                labels = self._pad_across_processes(labels)
+                labels = self._nested_gather(labels)
+                labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)
+            if logits is not None:
+                logits = self._pad_across_processes(logits)
+                logits = self._nested_gather(logits)
+                if self.preprocess_logits_for_metrics is not None:
+                    logits = self.preprocess_logits_for_metrics(logits, labels)
+                preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
+            self.control = self.callback_handler.on_prediction_step(args, self.state, self.control)
+            if max_eval_iters > 0 and step >= max_eval_iters - 1:
+                break
+
+        # Gather all remaining tensors and put them back on the CPU
+        if losses_host is not None:
+            losses = nested_numpify(losses_host)
+            all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+        if preds_host is not None:
+            logits = nested_numpify(preds_host)
+            all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+        if labels_host is not None:
+            labels = nested_numpify(labels_host)
+            all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+
+        # Number of samples
+        if num_samples is not None:
+            pass
+        elif has_length(eval_dataset):
+            num_samples = len(eval_dataset)
+        # The instance check is weird and does not actually check for the type, but whether the dataset has the right
+        # methods. Therefore we need to make sure it also has the attribute.
+        elif isinstance(eval_dataset, IterableDatasetShard) and hasattr(eval_dataset, "num_examples"):
+            num_samples = eval_dataset.num_examples
+        else:
+            if has_length(dataloader):
+                num_samples = self.num_examples(dataloader)
+            else:  # both len(dataloader.dataset) and len(dataloader) fail
+                num_samples = observed_num_examples
+
+        # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of
+        # samplers has been rounded to a multiple of batch_size, so we truncate.
+        if all_losses is not None:
+            all_losses = all_losses[:num_samples]
+        if all_preds is not None:
+            all_preds = nested_truncate(all_preds, num_samples)
+        if all_labels is not None:
+            all_labels = nested_truncate(all_labels, num_samples)
+
+        model.train()
+
+        if self.compute_metrics is not None and all_preds is not None:
+            metrics = self.compute_metrics(
+                EvalPrediction(predictions=all_preds, label_ids=all_labels), dataloader, args
+            )
+        else:
+            metrics = {}
+
+        if all_losses is not None:
+            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)
diff --git a/examples/language_model/chinesebert/eval_cmrc.py b/examples/language_model/chinesebert/eval_cmrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cd8d8d8d50f2c632c62d3527b0dd5573bf731f6
--- /dev/null
+++ b/examples/language_model/chinesebert/eval_cmrc.py
@@ -0,0 +1,220 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from tqdm.auto import tqdm
+import os
+
+import paddle
+
+from dataset_cmrc2018 import get_dev_dataloader
+from train_cmrc2018 import MODEL_CLASSES
+from metric import compute_prediction
+from utils import save_json
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, args, output_dir="./"):
+    model.eval()
+    all_start_logits = []
+    all_end_logits = []
+
+    for batch in tqdm(data_loader):
+        input_ids, token_type_ids, pinyin_ids = batch
+        start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids=token_type_ids, pinyin_ids=pinyin_ids)
+        all_start_logits.extend(start_logits_tensor.numpy().tolist())
+        all_end_logits.extend(end_logits_tensor.numpy().tolist())
+
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        data_loader.dataset.data,
+        data_loader.dataset.new_data,
+        (all_start_logits, all_end_logits),
+        False,
+        args.n_best_size,
+        args.max_answer_length,
+        args.null_score_diff_threshold,
+    )
+
+    save_json(all_predictions, os.path.join(output_dir, "all_predictions.json"))
+    if args.save_nbest_json:
+        save_json(all_nbest_json, os.path.join(output_dir, "all_nbest_json.json"))
+
+
+def main(args):
+    print(args)
+    paddle.set_device(args.device)
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    model = model_class.from_pretrained(args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    splits = "test"
+    dev_data_loader = get_dev_dataloader(tokenizer, args, splits=splits)
+    evaluate(model, dev_data_loader, args, output_dir=args.output_dir)
+
+    data_dir = args.data_dir
+    dev_ground_truth_file_path = os.path.join(data_dir, "dev.json")
+    dev_predict_file_path = os.path.join(args.output_dir, "all_predictions.json")
+    if splits == "dev":
+        from cmrc_evaluate import get_result
+
+        get_result(dev_ground_truth_file_path, dev_predict_file_path)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model_type", default="chinesebert", type=str, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="ChineseBERT-large",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="outputs/cmrc2018",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written. "
+        "Default as `outputs`",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=512,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--train_batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--eval_batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for evaluating.",
+    )
+
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        default=1,
+        type=int,
+        help="gradient_accumulation_steps.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        default=4e-5,
+        type=float,
+        help="The initial learning rate for Adam.",
+    )
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=2,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_train_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_radio",
+        default=0.1,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--warmup_steps", type=int, default=-1, help="warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument(
+        "--save_steps",
+        type=int,
+        default=250,
+        help="Save checkpoint every X updates steps.",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--writer_type",
+        choices=["visualdl", "tensorboard"],
+        default="visualdl",
+        help="writer_type.",
+    )
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--scheduler_type",
+        choices=["linear", "cosine", "poly"],
+        default="linear",
+        type=str,
+        help="scheduler_type.",
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=35,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument(
+        "--null_score_diff_threshold",
+        type=float,
+        default=0.0,
+        help="If null_score - best_non_null is greater than the threshold predict null.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=65, help="Max answer length.")
+    parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.")
+
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=2**15,
+        help="The value of scale_loss for fp16.",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=0,
+        help="num_workers.",
+    )
+    parser.add_argument("--save_nbest_json", action="store_true", help="Enable save nbest json.")
+
+    args = parser.parse_args()
+
+    args.model_type = args.model_type.lower()
+    args.logdir = os.path.join(args.output_dir, "logs")
+    os.makedirs("caches", exist_ok=True)
+    os.makedirs(args.logdir, exist_ok=True)
+
+    return args
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
diff --git a/examples/language_model/chinesebert/metric_cmrc.py b/examples/language_model/chinesebert/metric_cmrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..13208fb9773a158e963881bd335adb3e79732ba1
--- /dev/null
+++ b/examples/language_model/chinesebert/metric_cmrc.py
@@ -0,0 +1,387 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import re
+import string
+import numpy as np
+
+
+def compute_prediction(
+    examples,
+    features,
+    predictions,
+    version_2_with_negative=False,
+    n_best_size=20,
+    max_answer_length=30,
+    null_score_diff_threshold=0.0,
+):
+    """
+    Post-processes the predictions of a question-answering model to convert
+    them to answers that are substrings of the original contexts. This is
+    the base postprocessing functions for models that only return start and
+    end logits.
+
+    Args:
+        examples (list): List of raw squad-style data (see `run_squad.py
+            <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/
+            machine_reading_comprehension/SQuAD/run_squad.py>`__ for more
+            information).
+        features (list): List of processed squad-style features (see
+            `run_squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/
+            develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__
+            for more information).
+        predictions (tuple): The predictions of the model. Should be a tuple
+            of two list containing the start logits and the end logits.
+        version_2_with_negative (bool, optional): Whether the dataset contains
+            examples with no answers. Defaults to False.
+        n_best_size (int, optional): The total number of candidate predictions
+            to generate. Defaults to 20.
+        max_answer_length (int, optional): The maximum length of predicted answer.
+            Defaults to 20.
+        null_score_diff_threshold (float, optional): The threshold used to select
+            the null answer. Only useful when `version_2_with_negative` is True.
+            Defaults to 0.0.
+
+    Returns:
+        A tuple of three dictionaries containing final selected answer, all n_best
+        answers along with their probability and scores, and the score_diff of each
+        example.
+    """
+    assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
+    all_start_logits, all_end_logits = predictions
+
+    assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features."
+
+    # Build a map example to its corresponding features.
+    features_per_example = collections.defaultdict(list)
+    for i, feature in enumerate(features):
+        features_per_example[feature["example_id"]].append(i)
+
+    # The dictionaries we have to fill.
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+
+    scores_diff_json = collections.OrderedDict()
+
+    # Let's loop over all the examples!
+    for example_index, example in enumerate(examples):
+        # Those are the indices of the features associated to the current example.
+        feature_indices = features_per_example[example["id"]]
+
+        min_null_prediction = None
+        prelim_predictions = []
+
+        # Looping through all the features associated to the current example.
+        for feature_index in feature_indices:
+            # We grab the predictions of the model for this feature.
+            start_logits = all_start_logits[feature_index]
+            end_logits = all_end_logits[feature_index]
+            # This is what will allow us to map some the positions in our logits to span of texts in the original
+            # context.
+            offset_mapping = features[feature_index]["offset_mapping"]
+            # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
+            # available in the current feature.
+            token_is_max_context = features[feature_index].get("token_is_max_context", None)
+
+            # Update minimum null prediction.
+            feature_null_score = start_logits[0] + end_logits[0]
+            if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
+                min_null_prediction = {
+                    "offsets": (0, 0),
+                    "score": feature_null_score,
+                    "start_logit": start_logits[0],
+                    "end_logit": end_logits[0],
+                }
+
+            # Go through all possibilities for the `n_best_size` greater start and end logits.
+            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
+                    # to part of the input_ids that are not in the context.
+                    if (
+                        start_index >= len(offset_mapping)
+                        or end_index >= len(offset_mapping)
+                        or offset_mapping[start_index] is None
+                        or offset_mapping[end_index] is None
+                        or offset_mapping[start_index] == (0, 0)
+                        or offset_mapping[end_index] == (0, 0)
+                    ):
+                        continue
+                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
+                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
+                        continue
+                    # Don't consider answer that don't have the maximum context available (if such information is
+                    # provided).
+                    if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
+                        continue
+                    prelim_predictions.append(
+                        {
+                            "offsets": (
+                                offset_mapping[start_index][0],
+                                offset_mapping[end_index][1],
+                            ),
+                            "score": start_logits[start_index] + end_logits[end_index],
+                            "start_logit": start_logits[start_index],
+                            "end_logit": end_logits[end_index],
+                        }
+                    )
+        if version_2_with_negative:
+            # Add the minimum null prediction
+            prelim_predictions.append(min_null_prediction)
+            null_score = min_null_prediction["score"]
+
+        # Only keep the best `n_best_size` predictions.
+        predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
+
+        # Add back the minimum null prediction if it was removed because of its low score.
+        if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
+            predictions.append(min_null_prediction)
+
+        # Use the offsets to gather the answer text in the original context.
+        context = example["context"]
+        for pred in predictions:
+            offsets = pred.pop("offsets")
+            pred["text"] = context[offsets[0] : offsets[1]]
+
+        # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
+        # failure.
+        if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
+            predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})
+
+        # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
+        # the LogSumExp trick).
+        scores = np.array([pred.pop("score") for pred in predictions])
+        exp_scores = np.exp(scores - np.max(scores))
+        probs = exp_scores / exp_scores.sum()
+
+        # Include the probabilities in our predictions.
+        for prob, pred in zip(probs, predictions):
+            pred["probability"] = prob
+
+        # Pick the best prediction. If the null answer is not possible, this is easy.
+        if not version_2_with_negative:
+            all_predictions[example["id"]] = predictions[0]["text"]
+        else:
+            # Otherwise we first need to find the best non-empty prediction.
+            i = 0
+            while predictions[i]["text"] == "":
+                i += 1
+            best_non_null_pred = predictions[i]
+
+            # Then we compare to the null prediction using the threshold.
+            score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
+            scores_diff_json[example["id"]] = float(score_diff)  # To be JSON-serializable.
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example["id"]] = ""
+            else:
+                all_predictions[example["id"]] = best_non_null_pred["text"]
+
+        # Make `predictions` JSON-serializable by casting np.float back to float.
+        all_nbest_json[example["id"]] = [
+            {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
+            for pred in predictions
+        ]
+
+    return all_predictions, all_nbest_json, scores_diff_json
+
+
+def make_qid_to_has_ans(examples):
+    qid_to_has_ans = {}
+    for example in examples:
+        qid_to_has_ans[example["id"]] = not example.get("is_impossible", False)
+    return qid_to_has_ans
+
+
+def normalize_answer(s):
+    # Lower text and remove punctuation, articles and extra whitespace.
+    def remove_articles(text):
+        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+        return re.sub(regex, " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    if not s:
+        return ""
+    else:
+        return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def compute_exact(a_gold, a_pred):
+    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+
+
+def compute_f1(a_gold, a_pred, is_whitespace_splited=True):
+    gold_toks = normalize_answer(a_gold).split()
+    pred_toks = normalize_answer(a_pred).split()
+
+    if not is_whitespace_splited:
+        gold_toks = gold_toks[0] if gold_toks else ""
+        pred_toks = pred_toks[0] if pred_toks else ""
+
+    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+    num_same = sum(common.values())
+    if len(gold_toks) == 0 or len(pred_toks) == 0:
+        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+        return int(gold_toks == pred_toks)
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(pred_toks)
+    recall = 1.0 * num_same / len(gold_toks)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def get_raw_scores(examples, preds, is_whitespace_splited=True):
+    exact_scores = {}
+    f1_scores = {}
+    for example in examples:
+        qid = example["id"]
+        gold_answers = [text for text in example["answers"] if normalize_answer(text)]
+        if not gold_answers:
+            # For unanswerable questions, only correct answer is empty string
+            gold_answers = [""]
+        if qid not in preds:
+            print("Missing prediction for %s" % qid)
+            continue
+        a_pred = preds[qid]
+        # Take max over all gold answers
+        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
+        f1_scores[qid] = max(compute_f1(a, a_pred, is_whitespace_splited) for a in gold_answers)
+
+    return exact_scores, f1_scores
+
+
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+    new_scores = {}
+    for qid, s in scores.items():
+        pred_na = na_probs[qid] > na_prob_thresh
+        if pred_na:
+            new_scores[qid] = float(not qid_to_has_ans[qid])
+        else:
+            new_scores[qid] = s
+    return new_scores
+
+
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+    if not qid_list:
+        total = len(exact_scores)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores.values()) / total),
+                ("f1", 100.0 * sum(f1_scores.values()) / total),
+                ("total", total),
+            ]
+        )
+    else:
+        total = len(qid_list)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+                ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+                ("total", total),
+            ]
+        )
+
+
+def merge_eval(main_eval, new_eval, prefix):
+    for k in new_eval:
+        main_eval["%s_%s" % (prefix, k)] = new_eval[k]
+
+
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores:
+            continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    return 100.0 * best_score / len(scores), best_thresh
+
+
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
+    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
+    main_eval["best_exact"] = best_exact
+    main_eval["best_exact_thresh"] = exact_thresh
+    main_eval["best_f1"] = best_f1
+    main_eval["best_f1_thresh"] = f1_thresh
+
+
+def squad_evaluate(examples, preds, na_probs=None, na_prob_thresh=1.0, is_whitespace_splited=True):
+    """
+    Computes and prints the f1 score and em score of input prediction.
+
+    Args:
+        examples (list): List of raw squad-style data (see `run_squad.py
+            <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/
+            machine_reading_comprehension/SQuAD/run_squad.py>`__ for more
+            information).
+        preds (dict): Dictionary of final predictions. Usually generated by
+            `compute_prediction`.
+        na_probs (dict, optional): Dictionary of score_diffs of each example.
+            Used to decide if answer exits and compute best score_diff
+            threshold of null. Defaults to None.
+        na_prob_thresh (float, optional): The threshold used to select the
+            null answer. Defaults to 1.0.
+        is_whitespace_splited (bool, optional): Whether the predictions and references
+            can be tokenized by whitespace. Usually set True for English and
+            False for Chinese. Defaults to True.
+    """
+
+    if not na_probs:
+        na_probs = {k: 0.0 for k in preds}
+
+    qid_to_has_ans = make_qid_to_has_ans(examples)  # maps qid to True/False
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(examples, preds, is_whitespace_splited)
+    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, na_prob_thresh)
+    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, na_prob_thresh)
+    out_eval = make_eval_dict(exact_thresh, f1_thresh)
+    if has_ans_qids:
+        has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+        merge_eval(out_eval, has_ans_eval, "HasAns")
+    if no_ans_qids:
+        no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+        merge_eval(out_eval, no_ans_eval, "NoAns")
+        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
+
+    print(json.dumps(out_eval, indent=2))
+    return out_eval
diff --git a/examples/language_model/chinesebert/run_chn.sh b/examples/language_model/chinesebert/run_chn.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c1810b0eb1af0631d32e90a29d193f0246d57187
--- /dev/null
+++ b/examples/language_model/chinesebert/run_chn.sh
@@ -0,0 +1,31 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus 0,1 python train_chn.py \
+--data_path './data/ChnSentiCorp' \
+--device 'gpu' \
+--num_train_epochs 10 \
+--max_seq_length 512 \
+--per_device_train_batch_size 8 \
+--per_device_eval_batch_size 8 \
+--learning_rate 2e-5 \
+--adam_beta2 0.98 \
+--weight_decay 0.0001 \
+--warmup_ratio 0.1 \
+--logging_steps 10 \
+--save_steps 100 \
+--seed 2333 \
+--do_train \
+--do_eval \
+--output_dir 'outputs/chn' | tee outputs/train_chn.log
diff --git a/examples/language_model/chinesebert/run_cmrc2018.sh b/examples/language_model/chinesebert/run_cmrc2018.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1541877949f5bf7d9b83faf3a5072e94d21642d9
--- /dev/null
+++ b/examples/language_model/chinesebert/run_cmrc2018.sh
@@ -0,0 +1,36 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus 0,1 python train_cmrc2018.py \
+    --data_dir "./data/cmrc2018" \
+    --model_name_or_path ChineseBERT-large \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 16 \
+    --gradient_accumulation_steps 8 \
+    --learning_rate 4e-5 \
+    --max_grad_norm 1.0 \
+    --adam_beta2 0.98 \
+    --num_train_epochs 3 \
+    --logging_steps 2 \
+    --save_steps 20 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.01 \
+    --seed 1111 \
+    --do_train \
+    --do_eval \
+    --dataloader_num_workers 0 \
+    --fp16 True \
+    --output_dir "outputs/cmrc2018"
+
diff --git a/examples/language_model/chinesebert/run_xnli.sh b/examples/language_model/chinesebert/run_xnli.sh
new file mode 100644
index 0000000000000000000000000000000000000000..de0e6945deedfdf2fb41c2e4e1cb6b596fd50fce
--- /dev/null
+++ b/examples/language_model/chinesebert/run_xnli.sh
@@ -0,0 +1,31 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus 0,1 python train_xnli.py \
+--data_path './data/XNLI' \
+--device 'gpu' \
+--num_train_epochs 5 \
+--max_seq_length 256 \
+--per_device_train_batch_size 16 \
+--per_device_eval_batch_size 16 \
+--learning_rate 1.3e-5 \
+--adam_beta2 0.98 \
+--weight_decay 0.001 \
+--warmup_ratio 0.1 \
+--logging_steps 10 \
+--save_steps 100 \
+--seed 2333 \
+--do_train \
+--do_eval \
+--output_dir "outputs/xnli" | tee outputs/train_xnli.log
diff --git a/examples/language_model/chinesebert/train_chn.py b/examples/language_model/chinesebert/train_chn.py
new file mode 100644
index 0000000000000000000000000000000000000000..75059fcd4fc1a60ff044556265967c1da476c474
--- /dev/null
+++ b/examples/language_model/chinesebert/train_chn.py
@@ -0,0 +1,165 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from paddle.metric import Accuracy
+from utils import load_ds
+
+from paddlenlp.data import Pad, Stack
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.transformers import (
+    ChineseBertForSequenceClassification,
+    ChineseBertTokenizer,
+)
+
+
+@dataclass
+class ModelArguments:
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. "
+                "Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    data_path: Optional[str] = field(
+        default="./data",
+        metadata={"help": "The path of datasets to be loaded."},
+    )
+
+
+def convert_example(example, tokenizer, max_length=512, is_test=False):
+    # The original data is processed into a format that can be read in by the model,
+    # enocded_ Inputs is a dict that contains inputs_ids、token_type_ids、etc.
+    encoded_inputs = tokenizer(text=example["text"], max_length=max_length)
+
+    # input_ids：After the text is segmented into tokens, the corresponding token id in the vocabulary.
+    input_ids = encoded_inputs["input_ids"]
+    # # token_type_ids：Does the current token belong to sentence 1 or sentence 2, that is, the segment ids.
+    pinyin_ids = encoded_inputs["pinyin_ids"]
+
+    label = np.array([example["label"]], dtype="int64")
+    # return encoded_inputs
+    return input_ids, pinyin_ids, label
+
+
+@dataclass
+class DataCollator:
+    tokenizer: ChineseBertTokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        pinyin_ids = []
+        labels = []
+        batch = {}
+
+        for feature in features:
+            input_idx, pinyin_idx, label = feature
+            input_ids.append(input_idx)
+            pinyin_ids.append(pinyin_idx)
+            labels.append(label)
+
+        input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),)  # input_ids
+        pinyin_ids = (Pad(axis=0, pad_val=0)(pinyin_ids),)  # pinyin_ids
+        labels = (Stack()(labels),)  # labels
+
+        batch["input_ids"] = input_ids[0]
+        batch["pinyin_ids"] = pinyin_ids[0]
+        batch["labels"] = labels[0]
+
+        return batch
+
+
+def compute_metrics(eval_preds):
+    labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+    preds = paddle.to_tensor(eval_preds.predictions)
+    preds = paddle.nn.functional.softmax(preds, axis=-1)
+    labels = paddle.argmax(labels, axis=-1)
+    metric = Accuracy()
+    correct = metric.compute(preds, labels)
+    metric.update(correct)
+    acc = metric.accumulate()
+    return {"accuracy": acc}
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    data_dir = data_args.data_path
+    train_path = os.path.join(data_dir, "train.tsv")
+    dev_path = os.path.join(data_dir, "dev.tsv")
+    test_path = os.path.join(data_dir, "test.tsv")
+
+    train_ds, dev_ds, test_ds = load_ds(datafiles=[train_path, dev_path, test_path])
+
+    model = ChineseBertForSequenceClassification.from_pretrained("ChineseBERT-large", num_classes=2)
+    tokenizer = ChineseBertTokenizer.from_pretrained("ChineseBERT-large")
+
+    # Process the data into a data format that the model can read in.
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_length=model_args.max_seq_length)
+    train_ds = train_ds.map(trans_func, lazy=False)
+    dev_ds = dev_ds.map(trans_func, lazy=False)
+    test_ds = test_ds.map(trans_func, lazy=False)
+
+    # Form data into batch data, such as padding text sequences of different lengths into the maximum length of batch data,
+    # and stack each data label together
+    batchify_fn = DataCollator(tokenizer)
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        criterion=criterion,
+        compute_metrics=compute_metrics,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/language_model/chinesebert/train_cmrc2018.py b/examples/language_model/chinesebert/train_cmrc2018.py
new file mode 100644
index 0000000000000000000000000000000000000000..62b70af02e3345f2f42a75d29b205535c5d30bec
--- /dev/null
+++ b/examples/language_model/chinesebert/train_cmrc2018.py
@@ -0,0 +1,259 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import Optional
+
+import paddle
+from cmrc_evaluate import get_result
+from dataset_cmrc2018 import EvalTrainer, get_dev_dataset, get_train_dataset
+from metric_cmrc import compute_prediction, squad_evaluate
+from utils import CrossEntropyLossForSQuAD, save_json
+
+from paddlenlp.data import Pad, Stack
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, set_seed
+from paddlenlp.transformers import (
+    BertForQuestionAnswering,
+    BertTokenizer,
+    ChineseBertForQuestionAnswering,
+    ChineseBertTokenizer,
+    ErnieForQuestionAnswering,
+    ErnieTokenizer,
+)
+
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "bert": (BertForQuestionAnswering, BertTokenizer),
+    "ernie": (ErnieForQuestionAnswering, ErnieTokenizer),
+    "chinesebert": (ChineseBertForQuestionAnswering, ChineseBertTokenizer),
+}
+
+
+@dataclass
+class ModelArguments:
+    model_type: Optional[str] = field(
+        default="chinesebert",
+        metadata={"help": ("Type of pre-trained model.")},
+    )
+    model_name_or_path: Optional[str] = field(
+        default="ChineseBERT-large",
+        metadata={"help": ("Path to pre-trained model or shortcut name of model.")},
+    )
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. "
+                "Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    doc_stride: Optional[int] = field(
+        default=128,
+        metadata={"help": ("When splitting up a long document into chunks, how much stride to take between chunks.")},
+    )
+    n_best_size: Optional[int] = field(
+        default=35,
+        metadata={
+            "help": ("The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+        },
+    )
+    null_score_diff_threshold: Optional[float] = field(
+        default=0.0,
+        metadata={"help": ("If null_score - best_non_null is greater than the threshold predict null.")},
+    )
+    max_query_length: Optional[int] = field(
+        default=64,
+        metadata={"help": ("Max query length.")},
+    )
+    max_answer_length: Optional[int] = field(
+        default=65,
+        metadata={"help": ("Max answer length.")},
+    )
+    use_amp: Optional[bool] = field(
+        default=False,
+        metadata={"help": ("Enable mixed precision training.")},
+    )
+
+
+@dataclass
+class DataArguments:
+    data_dir: Optional[str] = field(
+        default="./data/cmrc2018",
+        metadata={"help": ("the path of cmrc2018 data.")},
+    )
+    save_nbest_json: Optional[bool] = field(
+        default=False,
+        metadata={"help": ("Enable save nbest json.")},
+    )
+
+
+parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+
+@dataclass
+class Train_DataCollator:
+    tokenizer: ChineseBertTokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        token_type_ids = []
+        pinyin_ids = []
+        start_positions = []
+        end_positions = []
+        batch = {}
+
+        for feature in features:
+            input_ids.append(feature["input_ids"])
+            token_type_ids.append(feature["token_type_ids"])
+            pinyin_ids.append(feature["pinyin_ids"])
+            start_positions.append(feature["start_positions"])
+            end_positions.append(feature["end_positions"])
+
+        input_ids = Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids)  # input_ids
+        token_type_ids = Pad(axis=0, pad_val=0)(token_type_ids)
+        pinyin_ids = Pad(axis=0, pad_val=0)(pinyin_ids)  # pinyin_ids
+        start_positions = Stack(dtype="int64")(start_positions)
+        end_positions = Stack(dtype="int64")(end_positions)
+
+        batch["input_ids"] = input_ids
+        batch["token_type_ids"] = token_type_ids
+        batch["pinyin_ids"] = pinyin_ids
+        batch["start_positions"] = start_positions
+        batch["end_positions"] = end_positions
+
+        return batch
+
+
+@dataclass
+class Eval_DataCollator:
+    tokenizer: ChineseBertTokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        token_type_ids = []
+        pinyin_ids = []
+        batch = {}
+
+        for feature in features:
+            input_ids.append(feature["input_ids"])
+            token_type_ids.append(feature["token_type_ids"])
+            pinyin_ids.append(feature["pinyin_ids"])
+
+        input_ids = Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids)  # input_ids
+        token_type_ids = Pad(axis=0, pad_val=0)(token_type_ids)
+        pinyin_ids = Pad(axis=0, pad_val=0)(pinyin_ids)  # pinyin_ids
+
+        batch["input_ids"] = input_ids
+        batch["token_type_ids"] = token_type_ids
+        batch["pinyin_ids"] = pinyin_ids
+
+        return batch
+
+
+def compute_metrics(eval_preds, dataloader, args):
+    all_start_logits, all_end_logits = eval_preds.predictions
+    all_start_logits = all_start_logits.tolist()
+    all_end_logits = all_end_logits.tolist()
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        dataloader.dataset.data,
+        dataloader.dataset.new_data,
+        (all_start_logits, all_end_logits),
+        False,
+        model_args.n_best_size,
+        model_args.max_answer_length,
+        model_args.null_score_diff_threshold,
+    )
+
+    save_json(all_predictions, os.path.join(args.output_dir, "all_predictions.json"))
+    if data_args.save_nbest_json:
+        save_json(all_nbest_json, os.path.join(args.output_dir, "all_nbest_json.json"))
+
+    ground_truth_file = os.path.join(data_args.data_dir, "dev.json")
+
+    eval_results = get_result(
+        ground_truth_file=ground_truth_file, prediction_file=os.path.join(args.output_dir, "all_predictions.json")
+    )
+    print("CMRC2018 EVALUATE.")
+    print(eval_results)
+    print("SQUAD EVALUATE.")
+    squad_evaluate(
+        examples=dataloader.dataset.data,
+        preds=all_predictions,
+        na_probs=scores_diff_json,
+    )
+    return eval_results
+
+
+def train():
+    model_args.model_type = model_args.model_type.lower()
+    training_args.logdir = os.path.join(training_args.output_dir, "logs")
+    os.makedirs("caches", exist_ok=True)
+    os.makedirs(training_args.logdir, exist_ok=True)
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    # get model and tokenizer
+    model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+    model = model_class.from_pretrained(model_args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
+
+    # get dataloader
+    args = {}
+    args["training_args"] = training_args
+    args["data_args"] = data_args
+    args["model_args"] = model_args
+    train_ds = get_train_dataset(tokenizer, args)
+    dev_ds = get_dev_dataset(tokenizer, args)
+    train_collator = Train_DataCollator(tokenizer)
+    dev_collator = Eval_DataCollator(tokenizer)
+
+    criterion = CrossEntropyLossForSQuAD()
+
+    trainer = EvalTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=train_collator,
+        criterion=criterion,
+        compute_metrics=compute_metrics,
+    )
+    trainer.set_eval_collator(dev_collator)
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    train()
diff --git a/examples/language_model/chinesebert/train_xnli.py b/examples/language_model/chinesebert/train_xnli.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2e0d5c45c93a364b163da4fcce1d497fd041757
--- /dev/null
+++ b/examples/language_model/chinesebert/train_xnli.py
@@ -0,0 +1,166 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from paddle.metric import Accuracy
+from utils import load_ds_xnli
+
+from paddlenlp.data import Pad, Stack
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.transformers import (
+    ChineseBertForSequenceClassification,
+    ChineseBertTokenizer,
+)
+
+
+@dataclass
+class ModelArguments:
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. "
+                "Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    data_path: Optional[str] = field(
+        default="./data",
+        metadata={"help": "The path of datasets to be loaded."},
+    )
+
+
+def convert_example(example, tokenizer, max_length=512, is_test=False):
+
+    label_map = {"contradictory": 0, "contradiction": 0, "entailment": 2, "neutral": 1}
+    first, second, third = example["sentence1"], example["sentence2"], example["label"]
+
+    encoded_inputs = tokenizer(first, second, max_length=max_length)
+    input_ids = encoded_inputs["input_ids"]
+    pinyin_ids = encoded_inputs["pinyin_ids"]
+
+    label = np.array([label_map[third]], dtype="int64")
+    assert len(input_ids) <= max_length
+    return input_ids, pinyin_ids, label
+
+
+@dataclass
+class DataCollator:
+    tokenizer: ChineseBertTokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        pinyin_ids = []
+        labels = []
+        batch = {}
+
+        for feature in features:
+            input_idx, pinyin_idx, label = feature
+            input_ids.append(input_idx)
+            pinyin_ids.append(pinyin_idx)
+            labels.append(label)
+
+        input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),)  # input_ids
+        pinyin_ids = (Pad(axis=0, pad_val=0)(pinyin_ids),)  # pinyin_ids
+        labels = (Stack()(labels),)  # labels
+
+        batch["input_ids"] = input_ids[0]
+        batch["pinyin_ids"] = pinyin_ids[0]
+        batch["labels"] = labels[0]
+
+        return batch
+
+
+def compute_metrics(eval_preds):
+    labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+    preds = paddle.to_tensor(eval_preds.predictions)
+    preds = paddle.nn.functional.softmax(preds, axis=-1)
+    labels = paddle.argmax(labels, axis=-1)
+    metric = Accuracy()
+    correct = metric.compute(preds, labels)
+    metric.update(correct)
+    acc = metric.accumulate()
+    return {"accuracy": acc}
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    data_dir = data_args.data_path
+    train_path = os.path.join(data_dir, "train.tsv")
+    dev_path = os.path.join(data_dir, "dev.tsv")
+    test_path = os.path.join(data_dir, "test.tsv")
+
+    train_ds, dev_ds, test_ds = load_ds_xnli(datafiles=[train_path, dev_path, test_path])
+
+    model = ChineseBertForSequenceClassification.from_pretrained("ChineseBERT-large", num_classes=3)
+    tokenizer = ChineseBertTokenizer.from_pretrained("ChineseBERT-large")
+
+    print(" | load pretrained model state sucessfully.")
+
+    # Process the data into a data format that the model can read in.
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_length=model_args.max_seq_length)
+    train_ds = train_ds.map(trans_func, lazy=False)
+    dev_ds = dev_ds.map(trans_func, lazy=False)
+    test_ds = test_ds.map(trans_func, lazy=False)
+
+    # Form data into batch data, such as padding text sequences of different lengths into the maximum length of batch data,
+    # and stack each data label together
+    batchify_fn = DataCollator(tokenizer)
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        criterion=criterion,
+        compute_metrics=compute_metrics,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/language_model/chinesebert/utils.py b/examples/language_model/chinesebert/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a58fc7edb21463646e9789620d551570ec3ab539
--- /dev/null
+++ b/examples/language_model/chinesebert/utils.py
@@ -0,0 +1,235 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import pickle
+import random
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import (
+    CosineDecayWithWarmup,
+    LinearDecayWithWarmup,
+    PolyDecayWithWarmup,
+)
+
+scheduler_type2cls = {
+    "linear": LinearDecayWithWarmup,
+    "cosine": CosineDecayWithWarmup,
+    "poly": PolyDecayWithWarmup,
+}
+
+
+def get_layer_lr_radios(layer_decay=0.8, n_layers=12):
+    """Have lower learning rates for layers closer to the input."""
+    key_to_depths = OrderedDict(
+        {
+            "mpnet.embeddings.": 0,
+            "mpnet.encoder.relative_attention_bias.": 0,
+            "qa_outputs.": n_layers + 2,
+        }
+    )
+    for layer in range(n_layers):
+        key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1
+    return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()}
+
+
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def get_writer(args):
+    if args.writer_type == "visualdl":
+        from visualdl import LogWriter
+
+        writer = LogWriter(logdir=args.logdir)
+    elif args.writer_type == "tensorboard":
+        from tensorboardX import SummaryWriter
+
+        writer = SummaryWriter(logdir=args.logdir)
+    else:
+        raise ValueError("writer_type must be in ['visualdl', 'tensorboard']")
+    return writer
+
+
+def get_scheduler(
+    learning_rate,
+    scheduler_type,
+    num_warmup_steps=None,
+    num_training_steps=None,
+    **scheduler_kwargs,
+):
+    if scheduler_type not in scheduler_type2cls.keys():
+        data = " ".join(scheduler_type2cls.keys())
+        raise ValueError(f"scheduler_type must be choson from {data}")
+
+    if num_warmup_steps is None:
+        raise ValueError("requires `num_warmup_steps`, please provide that argument.")
+
+    if num_training_steps is None:
+        raise ValueError("requires `num_training_steps`, please provide that argument.")
+
+    return scheduler_type2cls[scheduler_type](
+        learning_rate=learning_rate,
+        total_steps=num_training_steps,
+        warmup=num_warmup_steps,
+        **scheduler_kwargs,
+    )
+
+
+def save_json(data, file_name):
+    with open(file_name, "w", encoding="utf-8") as w:
+        w.write(json.dumps(data, ensure_ascii=False, indent=4) + "\n")
+
+
+class CrossEntropyLossForSQuAD(nn.Layer):
+    def forward(self, logits, labels):
+        start_logits, end_logits = logits
+        start_position, end_position = labels
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = F.cross_entropy(input=start_logits, label=start_position)
+        end_loss = F.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+
+        return loss
+
+
+def save_pickle(data, file_path):
+    with open(str(file_path), "wb") as f:
+        pickle.dump(data, f)
+
+
+def load_pickle(input_file):
+    with open(str(input_file), "rb") as f:
+        data = pickle.load(f)
+    return data
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn, lazy=False)
+
+    # shuffle = True if mode == 'train' else False
+    shuffle = False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+def convert_example(example, tokenizer, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        valid_length(obj:`int`): The input sequence valid length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+
+    input_ids = tokenizer.encode(example["text"])
+    input_ids = np.array(input_ids, dtype="int64")
+
+    if not is_test:
+        label = np.array(example["label"], dtype="int64")
+        return input_ids, label
+    else:
+        return input_ids
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
+    model.train()
+    metric.reset()
+    return accu
+
+
+def load_ds(datafiles):
+    """
+    intput:
+        datafiles -- str or list[str] -- the path of train or dev sets
+        split_train -- Boolean -- split from train or not
+        dev_size -- int -- split how much data from train
+
+    output:
+        MapDataset
+    """
+
+    def read(ds_file):
+        with open(ds_file, "r", encoding="utf-8") as fp:
+            next(fp)  # Skip header
+            for line in fp.readlines():
+                data = line[:-1].split("\t")
+                if len(data) == 2:
+                    yield ({"text": data[1], "label": int(data[0])})
+                elif len(data) == 3:
+                    yield ({"text": data[2], "label": int(data[1])})
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
+
+
+def load_ds_xnli(datafiles):
+    def read(ds_file):
+        with open(ds_file, "r", encoding="utf-8") as fp:
+            # next(fp)  # Skip header
+            for line in fp.readlines():
+                data = line.strip().split("\t", 2)
+                first, second, third = data
+                yield ({"sentence1": first, "sentence2": second, "label": third})
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
diff --git a/examples/language_model/convbert/README.md b/examples/language_model/convbert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b7014db2abdf8029f62b04272e4e87f461cae893
--- /dev/null
+++ b/examples/language_model/convbert/README.md
@@ -0,0 +1,99 @@
+# ConvBert with PaddleNLP
+
+[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496)
+
+**摘要：**
+像BERT及其变体这样的预训练语言模型最近在各种自然语言理解任务中取得了令人印象深刻的表现。然而，BERT严重依赖全局自注意力块，因此需要大量内存占用和计算成本。
+虽然它的所有注意力头从全局角度查询整个输入序列以生成注意力图，但我们观察到一些头只需要学习局部依赖，这意味着存在计算冗余。
+因此，我们提出了一种新颖的基于跨度的动态卷积来代替这些自注意力头，以直接对局部依赖性进行建模。新的卷积头与其余的自注意力头一起形成了一个新的混合注意力块，在全局和局部上下文学习中都更有效。
+我们为 BERT 配备了这种混合注意力设计并构建了一个ConvBERT模型。实验表明，ConvBERT 在各种下游任务中明显优于BERT及其变体，具有更低的训练成本和更少的模型参数。
+值得注意的是，ConvBERT-base 模型达到86.4GLUE分数，比ELECTRA-base高0.7，同时使用不到1/4的训练成本。
+
+本项目是 ConvBert 在 Paddle 2.x上的开源实现。
+
+## **数据准备**
+
+### Fine-tuning数据
+Fine-tuning 使用GLUE数据，这部分Paddle已提供，在执行Fine-tuning 命令时会自动下载并加载
+
+
+## **模型预训练**
+模型预训练过程可参考[Electra的README](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/electra/README.md)
+
+## **Fine-tuning**
+
+### 运行Fine-tuning
+
+#### **使用Paddle提供的预训练模型运行 Fine-tuning**
+
+以 GLUE/SST-2 任务为例，启动 Fine-tuning 的方式如下：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+
+python -u examples/language_model/convbert/run_glue.py \
+    --model_type convbert \
+    --model_name_or_path convbert-small \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 256   \
+    --learning_rate 1e-4 \
+    --num_train_epochs 3 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --output_dir ./glue/$TASK_NAME/ \
+    --device gpu
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前支持BERT、ELECTRA、ERNIE、CONVBERT模型。
+- `model_name_or_path` 模型名称或者路径，其中convbert模型当前仅支持convbert-small、convbert-medium-small、convbert-base几种规格。
+- `task_name` 表示 Fine-tuning 的任务，当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU、NPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+
+Fine-tuning过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下格式的日志：
+
+```
+global step 100/792, epoch: 0, batch: 99, rank_id: 0, loss: 0.333723, lr: 0.0000970547, speed: 3.6162 step/s
+eval loss: 0.295912, acc: 0.8623853211009175, eval done total : 0.5295147895812988 s
+global step 200/792, epoch: 0, batch: 199, rank_id: 0, loss: 0.243273, lr: 0.0000830295, speed: 3.6822 step/s
+eval loss: 0.249330, acc: 0.8899082568807339, eval done total : 0.508596658706665 s
+global step 300/792, epoch: 1, batch: 35, rank_id: 0, loss: 0.166950, lr: 0.0000690042, speed: 3.7250 step/s
+eval loss: 0.307219, acc: 0.8956422018348624, eval done total : 0.5816614627838135 s
+global step 400/792, epoch: 1, batch: 135, rank_id: 0, loss: 0.185729, lr: 0.0000549790, speed: 3.6896 step/s
+eval loss: 0.201950, acc: 0.9025229357798165, eval done total : 0.5364704132080078 s
+global step 500/792, epoch: 1, batch: 235, rank_id: 0, loss: 0.132817, lr: 0.0000409537, speed: 3.7708 step/s
+eval loss: 0.239518, acc: 0.9094036697247706, eval done total : 0.5128316879272461 s
+global step 600/792, epoch: 2, batch: 71, rank_id: 0, loss: 0.163107, lr: 0.0000269285, speed: 3.7303 step/s
+eval loss: 0.199408, acc: 0.9139908256880734, eval done total : 0.5226929187774658 s
+global step 700/792, epoch: 2, batch: 171, rank_id: 0, loss: 0.082950, lr: 0.0000129032, speed: 3.7664 step/s
+eval loss: 0.236055, acc: 0.9025229357798165, eval done total : 0.5140993595123291 s
+global step 792/792, epoch: 2, batch: 263, rank_id: 0, loss: 0.025735, lr: 0.0000000000, speed: 4.1180 step/s
+eval loss: 0.226449, acc: 0.9013761467889908, eval done total : 0.5103530883789062 s
+```
+
+使用convbert-small预训练模型进行单卡Fine-tuning ，在验证集上有如下结果（这里各类任务的结果是运行1次的结果）：
+
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthews corr                | 56.22       |
+| SST-2 | acc.                         | 91.39       |
+| MRPC  | acc./F1                      | 87.70       |
+| STS-B | Pearson/Spearman corr        | 86.34       |
+| QQP   | acc./F1                      | 85.47       |
+| MNLI  | matched acc./mismatched acc. | 81.87       |
+| QNLI  | acc.                         | 87.71       |
+| RTE   | acc.                         | 66.06       |
+
+注：acc.是Accuracy的简称，表中Metric字段名词取自[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)
+
+
+
+## Reference
+[Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution. In NeurIPS 2020](https://arxiv.org/abs/2008.02496)
diff --git a/examples/language_model/convbert/convert.py b/examples/language_model/convbert/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..96208861028aa10bd69184aeb70b78922f54281d
--- /dev/null
+++ b/examples/language_model/convbert/convert.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+import argparse
+
+huggingface_to_paddle = {
+    "embeddings.LayerNorm": "embeddings.layer_norm",
+    "encoder.layer": "encoder.layers",
+    "attention.self.query.": "self_attn.q_proj.",
+    "attention.self.key.": "self_attn.k_proj.",
+    "attention.self.value.": "self_attn.v_proj.",
+    "attention.output.dense.": "self_attn.out_proj.",
+    "intermediate.dense": "linear1",
+    "output.dense": "linear2",
+    "attention.output.LayerNorm": "norm1",
+    "output.LayerNorm": "norm2",
+    "attention.self.key_conv_attn_layer": "self_attn.key_conv_attn_layer",
+    "attention.self.conv_kernel_layer": "self_attn.conv_kernel_layer",
+    "attention.self.conv_out_layer": "self_attn.conv_out_layer",
+}
+
+skip_weights = ["embeddings.position_ids"]
+dont_transpose = ["attention.self.key_conv_attn_layer", "_embeddings.weight", "LayerNorm."]
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+    import torch
+    import paddle
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        if k in skip_weights:
+            continue
+        if k[-7:] == ".weight":
+            if not any([w in k for w in dont_transpose]):
+                if v.ndim == 2:
+                    v = v.transpose(0, 1)
+        if "self.key_conv_attn_layer.bias" in k:
+            v = v.squeeze(-1)
+
+        oldk = k
+        for huggingface_name, paddle_name in huggingface_to_paddle.items():
+            k = k.replace(huggingface_name, paddle_name)
+
+        print(f"Converting: {oldk} => {k}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="./conv-bert-base/pytorch_model.bin",
+        type=str,
+        required=False,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="./convbert-base/model_state.pdparams",
+        type=str,
+        required=False,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/examples/language_model/convbert/run_glue.py b/examples/language_model/convbert/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..559cec890c23c8b1f715fb5cd2ddb6484ba579fb
--- /dev/null
+++ b/examples/language_model/convbert/run_glue.py
@@ -0,0 +1,372 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ConvBertForSequenceClassification,
+    ConvBertTokenizer,
+    ElectraForSequenceClassification,
+    ElectraTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "electra": (ElectraForSequenceClassification, ElectraTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+    "convbert": (ConvBertForSequenceClassification, ConvBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "npu"],
+        help="The device to select to train the model, is must be cpu/gpu/npu.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    elif isinstance(metric, Mcc):
+        print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (loss.numpy(), res[0], res[1], res[2]),
+            end="",
+        )
+    else:
+        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0 or global_step == num_training_steps:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
+                    )
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
+    if args.device in "gpu" and n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
+    else:
+        do_train(args)
diff --git a/examples/language_model/convbert/run_pretrain.py b/examples/language_model/convbert/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ebbe898da0aa86002343a1207f4b37372945e54
--- /dev/null
+++ b/examples/language_model/convbert/run_pretrain.py
@@ -0,0 +1,507 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import logging
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    ConvBertForTotalPretraining,
+    ConvBertGenerator,
+    ConvBertPretrainingCriterion,
+    ConvBertTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "convbert": (ConvBertForTotalPretraining, ConvBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default="convbert",
+        type=str,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="convbert-small",
+        type=str,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument("--max_seq_length", default=128, type=int, help="max length of each sequence")
+    parser.add_argument("--mask_prob", default=0.15, type=float, help="the probability of one word to be mask")
+    parser.add_argument(
+        "--train_batch_size",
+        default=96,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--eval_batch_size",
+        default=96,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=4,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train."
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu"],
+        help="The device to select to train the model, is must be cpu/gpu.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+class BookCorpus(paddle.io.Dataset):
+    """
+    https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html
+    Args:
+        data_path (:obj:`str`) : The dataset file path, which contains train.tsv, dev.tsv and test.tsv.
+        tokenizer (:obj:`class PretrainedTokenizer`) : The tokenizer to split word and convert word to id.
+        max_seq_length (:obj:`int`) : max length for each sequence.
+        mode (:obj:`str`, `optional`, defaults to `train`):
+            It identifies the dataset mode (train, test or dev).
+    """
+
+    def __init__(
+        self,
+        data_path,
+        tokenizer,
+        max_seq_length,
+        mode="train",
+    ):
+        if mode == "train":
+            data_file = "train.data"
+        elif mode == "test":
+            data_file = "test.data"
+        else:
+            data_file = "dev.data"
+
+        self.data_file = os.path.join(data_path, data_file)
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.raw_examples = self._read_file(self.data_file)
+
+    def _read_file(self, input_file):
+        """
+        Reads a text file.
+
+        Args:
+            input_file (:obj:`str`) : The file to be read.
+
+        Returns:
+            examples (:obj:`list`): All the input data.
+        """
+        if not os.path.exists(input_file):
+            raise RuntimeError("The file {} is not found.".format(input_file))
+        else:
+            with io.open(input_file, "r", encoding="UTF-8") as f:
+                examples = []
+                while True:
+                    line = f.readline()
+                    if line:
+                        if len(line) > 0 and not line.isspace():
+                            example = self.tokenizer(line, max_seq_len=self.max_seq_length)["input_ids"]
+                            examples.append(example)
+                    else:
+                        break
+                return examples
+
+    def truncation_ids(self, ids, max_seq_length):
+        if len(ids) <= (max_seq_length - 2):
+            return ids
+        else:
+            return ids[: (max_seq_length - 2)]
+
+    def __getitem__(self, idx):
+        return self.raw_examples[idx]
+
+    def __len__(self):
+        return len(self.raw_examples)
+
+
+class DataCollatorForConvBert(object):
+    """
+    pads, gets batch of tensors and preprocesses batches for masked language modeling
+    when dataloader num_worker > 0, this collator may trigger some bugs, for safe, be sure dataloader num_worker=0
+    """
+
+    def __init__(self, tokenizer, max_seq_length, mlm=True, mlm_probability=0.15):
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.mlm = True
+        self.mlm_probability = mlm_probability
+
+    def __call__(self, examples):
+        if self.mlm:
+            inputs, raw_inputs, labels = self.mask_tokens(examples)
+            return inputs, raw_inputs, labels
+        else:
+            raw_inputs, _ = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
+            raw_inputs = self.tensorize_batch(raw_inputs, "int64")
+            inputs = raw_inputs.clone().detach()
+            labels = raw_inputs.clone().detach()
+            if self.tokenizer.pad_token is not None:
+                pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
+                labels[labels == pad_token_id] = -100
+            return inputs, raw_inputs, labels
+
+    def tensorize_batch(self, examples, dtype):
+        if isinstance(examples[0], (list, tuple)):
+            examples = [paddle.to_tensor(e, dtype=dtype) for e in examples]
+        length_of_first = examples[0].shape[0]
+        are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples)
+        if are_tensors_same_length:
+            return paddle.stack(examples, axis=0)
+        else:
+            raise ValueError("the tensor in examples not have same shape, please check input examples")
+
+    def add_special_tokens_and_set_maskprob(self, inputs, truncation, max_seq_length):
+        pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
+        full_inputs = []
+        full_maskprob = []
+        max_length = 0
+        for ids in inputs:
+            if len(ids) > max_length:
+                max_length = len(ids)
+        max_length = min(max_length, max_seq_length)
+
+        for ids in inputs:
+            if len(ids) <= max_length:
+                padding_num = max_length - len(ids)
+                full_inputs.append(ids + ([pad_token_id] * padding_num))
+                full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0] + ([0] * padding_num))
+            else:
+                if truncation:
+                    full_inputs.append(ids[:max_length])
+                    full_maskprob.append([0] + ([self.mlm_probability] * (max_length - 2)) + [0])
+                else:
+                    full_inputs.append(ids)
+                    full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0])
+        return full_inputs, full_maskprob
+
+    def mask_tokens(self, examples):
+        if self.tokenizer.mask_token is None:
+            raise ValueError("the tokenizer does not have mask_token, please check!")
+        mask_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
+
+        raw_inputs, probability_matrix = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
+        raw_inputs = self.tensorize_batch(raw_inputs, "int64")
+        probability_matrix = self.tensorize_batch(probability_matrix, "float32")
+        inputs = raw_inputs.clone()
+        labels = raw_inputs.clone()
+
+        total_indices = paddle.bernoulli(probability_matrix).astype("bool").numpy()
+        labels[~total_indices] = -100
+
+        # 80% MASK
+        indices_mask = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & total_indices
+        inputs[indices_mask] = mask_token_id
+
+        # 10% Random
+        indices_random = (
+            paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() & total_indices & ~indices_mask
+        )
+        random_words = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64")
+        inputs = paddle.where(paddle.to_tensor(indices_random), random_words, inputs)
+
+        # 10% Original
+        return inputs, raw_inputs, labels
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`):
+            Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`):
+            If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1):
+            The sample number of a mini-batch.
+        use_gpu(obj:`bool`, optional, defaults to obj:`True`):
+            Whether to use gpu to run.
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+
+    if mode == "train" and use_gpu:
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+    else:
+        shuffle = True if mode == "train" else False
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+
+    return dataloader
+
+
+def do_train(args):
+    paddle.enable_static() if not args.eager_run else None
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    WorkerInitObj(args.seed + paddle.distributed.get_rank())
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # Loads or initializes a model.
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    if args.model_name_or_path in pretrained_models_list:
+        config = model_class.config_class.from_pretrained(args.model_name_or_path)
+        model = model_class(config)
+    else:
+        model = model_class.from_pretrained(args.model_name_or_path)
+
+    criterion = ConvBertPretrainingCriterion(
+        getattr(model.generator, ConvBertGenerator.base_model_prefix).config.vocab_size,
+        model.gen_weight,
+        model.disc_weight,
+    )
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    # Loads dataset.
+    tic_load_data = time.time()
+    print("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+    train_dataset = BookCorpus(
+        data_path=args.input_dir, tokenizer=tokenizer, max_seq_length=args.max_seq_length, mode="train"
+    )
+    print("load data done, total : %s s" % (time.time() - tic_load_data))
+
+    # Reads data and generates mini-batches.
+    data_collator = DataCollatorForConvBert(
+        tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm=True, mlm_probability=args.mask_prob
+    )
+
+    train_data_loader = create_dataloader(
+        train_dataset,
+        batch_size=args.train_batch_size,
+        mode="train",
+        use_gpu=True if args.device in "gpu" else False,
+        data_collator=data_collator,
+    )
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=clip,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+
+    print("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+    trained_global_step = global_step = 0
+    t_loss = paddle.to_tensor([0.0])
+    log_loss = paddle.to_tensor([0.0])
+    loss_list = []
+    log_list = []
+    tic_train = time.time()
+
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            if trained_global_step > 0:
+                trained_global_step -= 1
+                continue
+            global_step += 1
+            input_ids, raw_input_ids, generator_labels = batch
+            if args.use_amp:
+                with paddle.amp.auto_cast():
+                    gen_logits, disc_logits, disc_labels, attention_mask = model(
+                        input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=generator_labels
+                    )
+                    loss = criterion(gen_logits, disc_logits, generator_labels, disc_labels, attention_mask)
+                scaled = scaler.scale(loss)
+                scaled.backward()
+                t_loss += loss.detach()
+                scaler.minimize(optimizer, scaled)
+            else:
+                gen_logits, disc_logits, disc_labels, attention_mask = model(
+                    input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=generator_labels
+                )
+                loss = criterion(gen_logits, disc_logits, generator_labels, disc_labels, attention_mask)
+                loss.backward()
+                t_loss += loss.detach()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                local_loss = (t_loss - log_loss) / args.logging_steps
+                if paddle.distributed.get_world_size() > 1:
+                    paddle.distributed.all_gather(loss_list, local_loss)
+                    if paddle.distributed.get_rank() == 0:
+                        log_str = (
+                            "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                            "avg_loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
+                        ).format(
+                            global_step,
+                            num_training_steps,
+                            epoch,
+                            step,
+                            float((paddle.stack(loss_list).sum() / len(loss_list)).numpy()),
+                            optimizer.get_lr(),
+                            (time.time() - tic_train) / args.logging_steps,
+                        )
+                        print(log_str)
+                        log_list.append(log_str)
+                        loss_list = []
+                else:
+                    log_str = (
+                        "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                        "loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
+                    ).format(
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        float(local_loss.numpy()),
+                        optimizer.get_lr(),
+                        (time.time() - tic_train) / args.logging_steps,
+                    )
+                    print(log_str)
+                    log_list.append(log_str)
+                log_loss = t_loss
+                tic_train = time.time()
+            if global_step % args.save_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                    if len(log_list) > 0:
+                        with open(os.path.join(output_dir, "train.log"), "w") as f:
+                            for log in log_list:
+                                if len(log.strip()) > 0:
+                                    f.write(log.strip() + "\n")
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
+    if args.device in "gpu" and n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
+    else:
+        do_train(args)
diff --git a/examples/language_model/elmo/README.md b/examples/language_model/elmo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bccce85996b855464440c99f263bc125096df78b
--- /dev/null
+++ b/examples/language_model/elmo/README.md
@@ -0,0 +1,142 @@
+# ELMo
+
+## 模型简介
+
+ELMo(Embeddings from Language Models)是重要的通用语义表示模型之一，以双向LSTM为网络基本组件，以Language Model为训练目标，通过预训练得到通用的语义表示，ELMo能够学习到复杂的特征，比如语法、语义，并且能够学习在不同上下文情况下的词汇多义性。将ELMo得到的语义表示作为Feature迁移到下游NLP任务中，会显著提升下游任务的模型性能，比如问答、文本蕴含和情感分析等。ELMo模型的细节可以[参阅论文](https://arxiv.org/abs/1802.05365)。
+
+本项目是ELMo在Paddle上的开源实现, 基于1 Billion Word Language Model Benchmark进行预训练，并接入了简单的下游任务作为示例程序。
+
+接入的下游任务是在sentence polarity dataset v1数据集上构建的文本二分类任务，采用ELMo + BoW的简单网络结构。与base模型（Word2Vec + BoW）进行精度对比。
+
+| 模型  | test acc |
+| ---- | -------- |
+| word2vec + BoW  | 0.7769   |
+| ELMo + BoW  | 0.7760   |
+
+## 环境依赖
+
+- sklearn
+- gensim
+
+安装方式：`pip install sklearn gensim`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+.
+├── args.py # 运行参数配置文件
+├── dataset.py # 数据读取
+├── elmo.py # 模型组网
+├── run_pretrain.py # 训练模型主程序入口
+├── run_eval.py # 评估模型主程序入口
+├── word2vec_base.py # 下游二分类任务base模型训练测试主程序入口
+├── run_finetune.py # 下游二分类任务训练测试主程序入口
+├── download_data.sh # 数据下载脚本
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+运行下载数据的脚本后，会生成两个文件，1-billion-word目录下会存在训练数据目录（training-tokenized-shuffled）、测试集数据（heldout-tokenized-shuffled）以及对应的词典（vocab-15w.txt），sentence-polarity-dataset-v1目录下会存在未切分的正向样本（rt-polarity.pos）、负向样本（rt-polarity.neg）以及Google预训练好的Word2Vec向量文件GoogleNews-vectors-negative300.bin.gz。
+
+```shell
+sh download_data.sh
+```
+
+1-billion-word目录结构：
+
+```text
+.
+├── training-tokenized-shuffled # 训练集
+├── heldout-tokenized-shuffled # 测试集
+└── vocab-15w.txt # 词典
+```
+
+sentence-polarity-dataset-v1目录结构：
+
+```text
+.
+├── rt-polarity.pos # 正向样本
+├── rt-polarity.neg # 负向样本
+└── GoogleNews-vectors-negative300.bin.gz # 预训练好的Word2Vec向量
+```
+
+### 模型训练
+
+基于1-billion-word数据集，可以运行下面的命令，在训练集上进行模型训练
+```shell
+# GPU启动, 支持单卡和多卡
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus '0' run_pretrain.py --train_data_path='./1-billion-word/training-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --save_dir='./checkpoints' --device='gpu'
+```
+
+其他可选参数和参数的默认值请参考`args.py`。
+
+程序运行时将会自动开始训练，同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── 10000.pdopt
+├── 10000.pdparams
+├── 20000.pdopt
+├── 20000.pdparams
+├── ...
+├── final.pdopt
+└── final.pdparams
+```
+
+**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/10000`即可，程序会自动加载模型参数`checkpoints/10000.pdparams`，也会自动加载优化器状态`checkpoints/10000.pdopt`。
+
+### 模型评估
+
+基于1-billion-word数据集，可以运行下面的命令，在评测集上进行模型评估
+```shell
+# GPU启动，仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python run_eval.py --dev_data_path='./1-billion-word/heldout-tokenized-shuffled/*' --vocab_file='./1-billion-word/vocab-15w.txt' --init_from_ckpt='./checkpoints/10000' --device='gpu'
+```
+
+### 下游任务
+
+下游任务是基于sentence polarity dataset v1数据集的二分类任务，base模型采用Word2Vec + BoW的模型结构，其中Word2Vec采用Google预训练好的GoogleNews-vectors-negative300.bin.gz。
+
+#### base模型
+
+base模型可以运行下面的命令，在训练集上进行模型训练评估
+```shell
+# GPU启动, 支持单卡和多卡
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus '0' word2vec_base.py --data_dir='./sentence-polarity-dataset-v1/' --pretrained_word2vec_file='./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin' --device='gpu'
+```
+
+#### ELMo finetune
+
+ELMo finetune可以运行下面的命令，在训练集上进行模型训练评估
+```shell
+# GPU启动, 支持单卡和多卡
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus '0' run_finetune.py --data_dir='./sentence-polarity-dataset-v1/' --init_from_ckpt='./checkpoints/10000' --device='gpu'
+```
+
+**NOTE:** 可以通过构建模型时的trainable参数设置ELMo参与或不参与下游任务的训练。ELMo接入下游任务的具体用法请参考`run_finetune.py`。
+
+另外，预训练的ELMo也可以作为文本词向量编码器单独使用，即输入文本内容，输出每个词对应的词向量。用法示例如下：
+
+```python
+from elmo import ELMoEmbedder
+
+embedder = ELMoEmbedder(params_file)
+sentences = [['The', 'first', 'sentence', '.'], ['Second', 'one', '.']]
+
+embeddings = embedder.encode(sentences)
+for i, (text, emb) in enumerate(zip(sentences, embeddings)):
+    print(text)
+    print(emb.shape)
+    print()
+```
+
+## Reference
+
+- [Deep contextualized word representations](https://arxiv.org/abs/1802.05365)
diff --git a/examples/language_model/elmo/args.py b/examples/language_model/elmo/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..6deaf83d53b977a84723bfe433aa1887f00b87c4
--- /dev/null
+++ b/examples/language_model/elmo/args.py
@@ -0,0 +1,51 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--train_data_path", type=str, default="./1-billion-word/training-tokenized-shuffled/*", help="Specify the path to load train data.")
+    parser.add_argument("--dev_data_path", type=str, default="./1-billion-word/heldout-tokenized-shuffled/*", help="Specify the path to load dev data.")
+    parser.add_argument("--vocab_file", type=str, default="./1-billion-word/vocab-15w.txt", help="Specify the path to load vocab file.")
+    parser.add_argument("--save_dir", type=str, default="./checkpoint/", help="Specify the path to save the checkpoints.")
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--save_freq", type=int, default=100, help="The frequency, in number of steps, to save checkpoint. (default: %(default)d)")
+    parser.add_argument("--log_freq", type=int, default=100, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
+    parser.add_argument("--epochs", type=int, default=10, help="Total number of training epochs to perform.")
+    parser.add_argument("--batch_size", type=int, default=128, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--dropout", type=float, default=0.1, help="The dropout rate.")
+    parser.add_argument("--lr", type=float, default=0.2, help="The initial learning rate.")
+    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
+    parser.add_argument("--max_grad_norm", type=float, default=10.0, help='The max grad norm.')
+    parser.add_argument("--max_characters_per_token", type=int, default=50, help="The maximum characters number of token in sequence. (default: %(default)d)")
+    parser.add_argument("--unroll_steps", type=int, default=20, help="The sentence length after re-cutting in dataset. (default: %(default)d)")
+    parser.add_argument("--char_embed_dim", type=int, default=16, help="The dimension of char_embedding table. (default: %(default)d)")
+    parser.add_argument("--projection_dim", type=int, default=512, help="The size of rnn hidden unit. (default: %(default)d)")
+    parser.add_argument("--num_layers", type=int, default=2, help="The num of rnn layers. (default: %(default)d)")
+    parser.add_argument("--num_highways", type=int, default=2, help="The num of highways in CharEncoder. (default: %(default)d)")
+    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
diff --git a/examples/language_model/elmo/dataset.py b/examples/language_model/elmo/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..adacebe7041e0d9b3e24abda22cb4c86347c23b1
--- /dev/null
+++ b/examples/language_model/elmo/dataset.py
@@ -0,0 +1,443 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import random
+from copy import deepcopy
+from typing import List
+
+import numpy as np
+import paddle
+from paddle.io import IterableDataset
+
+
+class Vocabulary(object):
+    """
+    A token vocabulary. Holds a map from token to ids and provides a method for
+    encoding text to a sequence of ids.
+
+    Parameters:
+        filename (str): The vocabulary file. It is a flat text file with
+            one (normalized) token per line.
+    """
+
+    def __init__(self, filename):
+        self._word_to_id = {}
+        for word in ["UNK", "<S>", "</S>", "<PAD>"]:
+            self._word_to_id[word] = len(self._word_to_id)
+        with open(filename, "r") as fin:
+            for line in fin:
+                word = line.strip()
+                if word in self._word_to_id:
+                    raise ValueError("There has repeated token in the vocabulary file: %s" % word)
+                self._word_to_id[word] = len(self._word_to_id)
+
+    @property
+    def bos(self):
+        return self._word_to_id["<S>"]
+
+    @property
+    def eos(self):
+        return self._word_to_id["</S>"]
+
+    @property
+    def unk(self):
+        return self._word_to_id["UNK"]
+
+    @property
+    def pad(self):
+        return self._word_to_id["<PAD>"]
+
+    @property
+    def size(self):
+        return len(self._word_to_id)
+
+    def word_to_id(self, word):
+        if word in self._word_to_id:
+            return self._word_to_id[word]
+        return self.unk
+
+    def encode(self, sentence, split=True):
+        """
+        Convert a sentence to a list of ids, with special tokens added.
+        Sentence is a single string with tokens separated by whitespace.
+        """
+        if split:
+            word_ids = [self.word_to_id(cur_word) for cur_word in sentence.split()]
+        else:
+            word_ids = [self.word_to_id(cur_word) for cur_word in sentence]
+
+        word_ids = [self.bos] + word_ids + [self.eos]
+        word_ids_reverse = deepcopy(word_ids)
+        word_ids_reverse.reverse()
+        return np.array(word_ids, dtype=np.int64), np.array(word_ids_reverse, dtype=np.int64)
+
+
+class UnicodeCharsVocabulary(Vocabulary):
+    """
+    Vocabulary containing character-level and word level information.
+
+    Has a word vocabulary that is used to lookup word ids and a character id
+    that is used to map words to arrays of character ids.
+
+    The character ids are defined by ord(c) for c in word.encode('utf-8').
+    This limits the total number of possible char ids to 256.
+    To this we add 5 additional special ids: begin sentence, end sentence,
+    begin word, end word and char padding.
+
+    Parameters:
+        filename (str): The vocabulary file. It is a flat text file with
+            one (normalized) token per line.
+        max_word_length (int): The maximum characters number of token in sequence.
+    """
+
+    def __init__(self, filename, max_word_length, **kwargs):
+        super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs)
+        self._max_word_length = max_word_length
+
+        self.bos_char = 256  # <begin sentence>
+        self.eos_char = 257  # <end sentence>
+        self.bow_char = 258  # <begin word>
+        self.eow_char = 259  # <end word>
+        self.pad_char = 260  # <char padding>
+
+        self._word_char_ids = {}
+
+        # the charcter representation of the begin/end of sentence characters
+        def _make_bos_eos(c):
+            r = np.zeros([self.max_word_length], dtype=np.int64)
+            r[:] = self.pad_char
+            r[0] = self.bow_char
+            r[1] = c
+            r[2] = self.eow_char
+            return r
+
+        self.bos_chars = _make_bos_eos(self.bos_char)
+        self.eos_chars = _make_bos_eos(self.eos_char)
+
+        for word in self._word_to_id:
+            self._word_char_ids[word] = self._convert_word_to_char_ids(word)
+
+        self._word_char_ids["<S>"] = self.bos_chars
+        self._word_char_ids["</S>"] = self.eos_chars
+
+    @property
+    def char_size(self):
+        # char ids 0-255 come from utf-8 encoding bytes.
+        # assign 256-300 to special chars.
+        # all +1, the id=0 is for token padding and mask.
+        return 262
+
+    @property
+    def max_word_length(self):
+        return self._max_word_length
+
+    def _convert_word_to_char_ids(self, word):
+        code = np.zeros([self.max_word_length], dtype=np.int64)
+        code[:] = self.pad_char
+
+        word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)]
+        code[0] = self.bow_char
+        for k, chr_id in enumerate(word_encoded, start=1):
+            code[k] = chr_id
+        code[len(word_encoded) + 1] = self.eow_char
+
+        return code
+
+    def word_to_char_ids(self, word):
+        if word in self._word_to_id:
+            return self._word_char_ids[word]
+        else:
+            return self._convert_word_to_char_ids(word)
+
+    def encode_chars(self, sentence, split=True):
+        """
+        Encode the sentence as a white space delimited string of tokens.
+        """
+        if split:
+            chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence.split()]
+        else:
+            chars_ids = [self.word_to_char_ids(cur_word) for cur_word in sentence]
+
+        chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars]
+        chars_ids_reverse = deepcopy(chars_ids)
+        chars_ids_reverse.reverse()
+
+        # +1 for token padding and mask
+        chars_ids = np.vstack(chars_ids) + 1
+        chars_ids_reverse = np.vstack(chars_ids_reverse) + 1
+        return chars_ids, chars_ids_reverse
+
+
+class CharsVocabulary(object):
+    def __init__(self, max_word_length):
+        self._max_word_length = max_word_length
+
+        self.bos_char = 256  # <begin sentence>
+        self.eos_char = 257  # <end sentence>
+        self.bow_char = 258  # <begin word>
+        self.eow_char = 259  # <end word>
+        self.pad_char = 260  # <char padding>
+
+        # the charcter representation of the begin/end of sentence characters
+        def _make_bos_eos(c):
+            r = np.zeros([self.max_word_length], dtype=np.int64)
+            r[:] = self.pad_char
+            r[0] = self.bow_char
+            r[1] = c
+            r[2] = self.eow_char
+            return r
+
+        self.bos_chars = _make_bos_eos(self.bos_char)
+        self.eos_chars = _make_bos_eos(self.eos_char)
+
+    @property
+    def char_size(self):
+        # char ids 0-255 come from utf-8 encoding bytes.
+        # assign 256-300 to special chars.
+        # all +1, the id=0 is for token padding and mask.
+        return 262
+
+    @property
+    def max_word_length(self):
+        return self._max_word_length
+
+    def convert_word_to_char_ids(self, word):
+        code = np.zeros([self.max_word_length], dtype=np.int64)
+        code[:] = self.pad_char
+
+        word_encoded = word.encode("utf-8", "ignore")[: (self.max_word_length - 2)]
+        code[0] = self.bow_char
+        for k, chr_id in enumerate(word_encoded, start=1):
+            code[k] = chr_id
+        code[len(word_encoded) + 1] = self.eow_char
+
+        return code
+
+    def encode_chars(self, sentence, split=True):
+        """
+        Encode the sentence as a white space delimited string of tokens.
+        """
+        if split:
+            chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence.split()]
+        else:
+            chars_ids = [self.convert_word_to_char_ids(cur_word) for cur_word in sentence]
+
+        chars_ids = [self.bos_chars] + chars_ids + [self.eos_chars]
+        chars_ids_reverse = deepcopy(chars_ids)
+        chars_ids_reverse.reverse()
+
+        # +1 for token padding and mask
+        chars_ids = np.vstack(chars_ids) + 1
+        chars_ids_reverse = np.vstack(chars_ids_reverse) + 1
+        return chars_ids, chars_ids_reverse
+
+
+def load_vocab(vocab_file=None, max_word_length=50):
+    if vocab_file is None:
+        return CharsVocabulary(max_word_length)
+    elif max_word_length:
+        return UnicodeCharsVocabulary(vocab_file, max_word_length)
+    else:
+        return Vocabulary(vocab_file)
+
+
+class OneBillionWordDataset(IterableDataset):
+    """
+    Hold the one billion word dataset, consisting of 1B Words which is used for
+    benchmarking of Language Modeling. The training/held-out data was produced
+    from the WMT 2011 News Crawl data.
+
+    The dataset is a list of tokenized files. Each file contains one sentence
+    per line. Each sentence is pre-tokenized and white space joined.
+
+    Parameters:
+        filepattern (str): A glob string that specifies the list of files.
+        vocab (Vocabulary): An instance of Vocabulary or UnicodeCharsVocabulary.
+        batch_size (int): The batch_size.
+        num_steps (int): The sentence length after re-cutting in dataset.
+        n_procs (int): The number of GPUs.
+        mode (str, optional): The dataset mode. It can be "train" and "test".
+            When "test", the dataset iterate through all data once then stop.
+            When "train", it will iterate forever. Default: "test".
+        shuffle (bool, optional): Whether shuffle the data. Default: False.
+        seed (int, optional): The random seed. Default: 0.
+    """
+
+    def __init__(
+        self, filepattern, vocab, batch_size, num_steps, n_procs=1, rank=0, mode="test", shuffle=False, seed=0
+    ):
+        super(OneBillionWordDataset, self).__init__()
+
+        self._all_files = glob.glob(filepattern)
+        print("\nFound %d files at %s\n" % (len(self._all_files), filepattern))
+        self._vocab = vocab
+        self._max_word_length = vocab.max_word_length
+        self._use_char_inputs = hasattr(vocab, "encode_chars")
+        self._batch_size = batch_size
+        self._num_steps = num_steps
+        self._n_procs = n_procs
+        self._rank = rank
+        self._mode = mode
+        self._shuffle = shuffle
+        self._seed = abs(seed)
+        self._file_seed = self._get_file_random_seed()
+
+    def _get_file_random_seed(self):
+        file_seed = {}
+        np.random.seed(self._seed)
+        seed_list = list(np.random.random(len(self._all_files)))
+        for file_path, seed in zip(list(self._all_files), seed_list):
+            file_seed[file_path] = seed
+        return file_seed
+
+    def _load_file(self, file_path):
+        print("\nLoading data from: %s\n" % file_path)
+        with open(file_path) as f:
+            sentences_raw = f.readlines()
+        sentences = sentences_raw
+
+        if self._shuffle:
+            if self._n_procs > 1:
+                seed = self._file_seed[file_path] * self._seed
+                random.seed(seed)
+            random.shuffle(sentences)
+
+        for sentence in sentences:
+            ids, ids_reverse = self._vocab.encode(sentence)
+            if self._use_char_inputs:
+                char_ids, char_ids_reverse = self._vocab.encode_chars(sentence)
+            else:
+                char_ids, char_ids_reverse = None, None
+            yield (ids, char_ids, ids_reverse, char_ids_reverse)
+
+    def get_sentence(self):
+        while True:
+            self._seed += 1
+            all_files = list(self._all_files)
+            if self._shuffle:
+                if self._n_procs > 1:
+                    random.seed(self._seed)
+                random.shuffle(all_files)
+            for file_path in all_files:
+                for ret in self._load_file(file_path):
+                    yield ret
+            if self._mode == "test":
+                break
+
+    @property
+    def number_of_tokens(self):
+        # number of tokens in training data (1B Word Benchmark)
+        return 768648884
+
+    def __iter__(self):
+        sentence_generator = self.get_sentence()
+        n_batch_size = self._batch_size * self._n_procs
+        cur_stream = [None] * n_batch_size
+
+        while True:
+            inputs = np.zeros([n_batch_size, self._num_steps], np.int64)
+            inputs_reverse = np.zeros([n_batch_size, self._num_steps], np.int64)
+            if self._max_word_length is not None:
+                char_inputs = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64)
+                char_inputs_reverse = np.zeros([n_batch_size, self._num_steps, self._max_word_length], np.int64)
+            else:
+                char_inputs = None
+                char_inputs_reverse = None
+            targets = np.zeros([n_batch_size, self._num_steps], np.int64)
+            targets_reverse = np.zeros([n_batch_size, self._num_steps], np.int64)
+
+            for i in range(n_batch_size):
+                cur_pos = 0
+                while cur_pos < self._num_steps:
+                    if cur_stream[i] is None or len(cur_stream[i][0]) <= 1:
+                        try:
+                            cur_stream[i] = list(next(sentence_generator))
+                        except StopIteration:
+                            return
+
+                    how_many = min(len(cur_stream[i][0]) - 1, self._num_steps - cur_pos)
+                    next_pos = cur_pos + how_many
+
+                    inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many]
+                    inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][2][:how_many]
+                    if self._max_word_length is not None:
+                        char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][:how_many]
+                        char_inputs_reverse[i, cur_pos:next_pos] = cur_stream[i][3][:how_many]
+                    targets[i, cur_pos:next_pos] = cur_stream[i][0][1 : how_many + 1]
+                    targets_reverse[i, cur_pos:next_pos] = cur_stream[i][2][1 : how_many + 1]
+
+                    cur_pos = next_pos
+
+                    cur_stream[i][0] = cur_stream[i][0][how_many:]
+                    cur_stream[i][2] = cur_stream[i][2][how_many:]
+                    if self._max_word_length is not None:
+                        cur_stream[i][1] = cur_stream[i][1][how_many:]
+                        cur_stream[i][3] = cur_stream[i][3][how_many:]
+
+            # token_ids: (n_batch_size, self._num_steps)
+            # char_inputs: character ids (n_batch_size, self._num_steps, 50)
+            # targets: word ID of next word (n_batch_size, self._num_steps)
+            batch_data = {
+                "token_ids": inputs,
+                "tokens_characters": char_inputs,
+                "next_token_ids": targets,
+                "token_ids_reverse": inputs_reverse,
+                "tokens_characters_reverse": char_inputs_reverse,
+                "next_token_ids_reverse": targets_reverse,
+            }
+            if self._n_procs > 1:
+                start = self._rank * self._batch_size
+                end = start + self._batch_size
+                for key in batch_data:
+                    batch_data[key] = batch_data[key][start:end]
+
+            yield (
+                batch_data["tokens_characters"],
+                batch_data["next_token_ids"],
+                batch_data["tokens_characters_reverse"],
+                batch_data["next_token_ids_reverse"],
+            )
+
+
+def create_one_batch(sentences, vocab, max_seq_len):
+    # Add <S>, </S> for every sentence
+    max_len = max([len(sentence) for sentence in sentences]) + 2
+    max_len = min(max_len, max_seq_len)
+    batch_ids = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64)
+    batch_ids_reverse = np.zeros([len(sentences), max_len, vocab.max_word_length], dtype=np.int64)
+    batch_lens = []
+    for i, sentence in enumerate(sentences):
+        sentence = sentence[: max_len - 2]
+        seq_len = len(sentence) + 2
+        ids, ids_reverse = vocab.encode_chars(sentence, split=False)
+        batch_ids[i, :seq_len, :] = ids
+        batch_ids_reverse[i, :seq_len, :] = ids_reverse
+        batch_lens.append(seq_len)
+    return batch_ids, batch_ids_reverse, batch_lens
+
+
+def create_batches(sentences: List[List[str]], batch_size, vocab, max_seq_len):
+    """
+    Batch the sentences as character ids
+    Each sentence is a list of tokens without <s> or </s>, e.g.
+    [['The', 'first', 'sentence', '.'], ['Second', '.']]
+    """
+    n_batch = (len(sentences) - 1) // batch_size + 1
+    for i in range(n_batch):
+        start, end = i * batch_size, (i + 1) * batch_size
+        ids, ids_reverse, seq_lens = create_one_batch(sentences[start:end], vocab, max_seq_len)
+        ids = paddle.to_tensor(ids)
+        ids_reverse = paddle.to_tensor(ids_reverse)
+        yield ids, ids_reverse, seq_lens
diff --git a/examples/language_model/elmo/download_data.sh b/examples/language_model/elmo/download_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..385df158a073b65527441304c1d79ea5225273a2
--- /dev/null
+++ b/examples/language_model/elmo/download_data.sh
@@ -0,0 +1,25 @@
+set -eux
+
+rm 1-billion-word* -rf
+wget https://bj.bcebos.com/paddlenlp/datasets/1-billion-word.tar.gz
+src_md5="5f079a9b88ea27585e0539f502ca9327"
+md5=`md5sum 1-billion-word.tar.gz | cut -d ' ' -f1`
+if [ $md5 != $src_md5 ]
+then
+    echo "The MD5 values of 1-billion-word.tar.gz are inconsistent. Please download again!"
+    exit 1
+fi
+tar -zxf 1-billion-word.tar.gz
+
+rm sentence-polarity-dataset-v1* -rf
+wget https://bj.bcebos.com/paddlenlp/datasets/movie-review/sentence-polarity-dataset-v1.tar.gz
+src_md5="0464239d7b14b18d941f54a948c6cb26"
+md5=`md5sum sentence-polarity-dataset-v1.tar.gz | cut -d ' ' -f1`
+if [ $md5 != $src_md5 ]
+then
+    echo "The MD5 values of sentence-polarity-dataset-v1.tar.gz are inconsistent. Please download again!"
+    exit 1
+fi
+tar -zxf sentence-polarity-dataset-v1.tar.gz
+cd sentence-polarity-dataset-v1
+gunzip GoogleNews-vectors-negative300.bin.gz
diff --git a/examples/language_model/elmo/elmo.py b/examples/language_model/elmo/elmo.py
new file mode 100644
index 0000000000000000000000000000000000000000..04c79e76423233f53ea652c375301c50a3519c31
--- /dev/null
+++ b/examples/language_model/elmo/elmo.py
@@ -0,0 +1,335 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.nn.initializer as I
+from dataset import create_batches, load_vocab
+
+
+def reverse_sequence(x, sequence_lengths):
+    batch_size = x.shape[0]
+    sequence_lengths = sequence_lengths.numpy().data
+    y = paddle.zeros(x.shape, x.dtype)
+    for i in range(batch_size):
+        lens = sequence_lengths[i]
+        z = x[i, :lens, :]
+        z = paddle.reverse(z, axis=[0])
+        y[i, :lens, :] = z
+    return y
+
+
+class ELMo(nn.Layer):
+    def __init__(
+        self,
+        batch_size=None,
+        char_embed_dim=16,
+        projection_dim=512,
+        vocab_size=None,
+        cnn_filters=[[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]],
+        char_vocab_size=262,
+        max_characters_per_token=50,
+        num_highways=2,
+        num_layers=2,
+        dropout=0.1,
+        task="pre-train",
+    ):
+        super(ELMo, self).__init__()
+
+        if task == "pre-train":
+            if vocab_size is None or batch_size is None:
+                raise ValueError('vocab_size and batch_size should be set when task="pre-train"')
+        elif task == "fine-tune":
+            if batch_size is None:
+                batch_size = 128
+        else:
+            raise ValueError('task should be "pre-train" or "fine-tune"')
+
+        self._projection_dim = projection_dim
+        self._task = task
+
+        self._token_embding_layer = ELMoCharacterEncoderLayer(
+            char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token
+        )
+        self._elmobilm = ELMoBiLM(batch_size, projection_dim, projection_dim, num_layers, dropout, task)
+        if task == "pre-train":
+            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(projection_dim)))
+            self._linear_layer = nn.Linear(projection_dim, vocab_size, weight_attr=paramAttr)
+
+    @property
+    def embedding_dim(self):
+        return self._projection_dim * 2
+
+    def forward(self, inputs):
+        # [batch_size, seq_len, max_characters_per_token]
+        ids, ids_reverse = inputs
+        # [batch_size, seq_len, projection_dim]
+        token_embedding = self._token_embding_layer(ids)
+        token_embedding_reverse = self._token_embding_layer(ids_reverse)
+
+        outs = self._elmobilm(token_embedding, token_embedding_reverse)
+
+        if self._task == "pre-train":
+            # [batch_size, seq_len, projection_dim]
+            fw_out, bw_out = outs
+
+            # [batch_size, max_seq_len, vocab_size]
+            fw_logits = self._linear_layer(fw_out)
+            bw_logits = self._linear_layer(bw_out)
+            return [fw_logits, bw_logits]
+        else:
+            mask = paddle.any(ids > 0, axis=2)
+            seq_lens = paddle.sum(paddle.cast(mask, dtype=ids.dtype), axis=1)
+            outputs = [paddle.concat([token_embedding, token_embedding], axis=2)]
+            for fw_h, bw_h in zip(outs[0], outs[1]):
+                bw_h = reverse_sequence(bw_h, seq_lens)
+                outputs.append(paddle.concat([fw_h, bw_h], axis=2))
+            # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2]
+            outputs = paddle.concat([paddle.unsqueeze(emb, axis=1) for emb in outputs], axis=1)
+            return outputs
+
+
+class ELMoBiLM(nn.Layer):
+    def __init__(self, batch_size, input_size, hidden_size, num_layers, dropout, task="pre-train"):
+        super(ELMoBiLM, self).__init__()
+
+        self._num_layers = num_layers
+        self._dropout = dropout
+        self._task = task
+
+        self._lstm_layers = []
+        for direction in ["forward", "backward"]:
+            layers = []
+            for i in range(num_layers):
+                lstm = nn.LSTM(
+                    input_size=input_size,
+                    hidden_size=hidden_size,
+                    num_layers=1,
+                    direction="forward",
+                    weight_hh_attr=paddle.ParamAttr(initializer=I.XavierUniform()),
+                    weight_ih_attr=paddle.ParamAttr(initializer=I.XavierUniform()),
+                    bias_hh_attr=False,
+                    bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)),
+                )
+                self.add_sublayer("{}_lstm_layer_{}".format(direction, i), lstm)
+
+                hidden_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32")
+                cell_state = paddle.zeros(shape=[1, batch_size, hidden_size], dtype="float32")
+                layers.append({"lstm": lstm, "hidden_state": hidden_state, "cell_state": cell_state})
+            self._lstm_layers.append(layers)
+
+        if dropout:
+            self._dropout_layer = nn.Dropout(p=dropout)
+
+    def forward(self, fw_x, bw_x):
+        final_outs = []
+        lstm_outs = []
+        for x, layers in zip([fw_x, bw_x], self._lstm_layers):
+            batch_size = x.shape[0]
+            outs = []
+            for i, dic in enumerate(layers):
+                lstm = dic["lstm"]
+                hidden_state = dic["hidden_state"][:, :batch_size, :]
+                cell_state = dic["cell_state"][:, :batch_size, :]
+                if self._dropout:
+                    x = self._dropout_layer(x)
+                x, (hidden_state, cell_state) = lstm(x, (hidden_state, cell_state))
+                hidden_state = hidden_state.detach()
+                cell_state = cell_state.detach()
+                dic["hidden_state"][:, :batch_size, :] = hidden_state
+                dic["cell_state"][:, :batch_size, :] = cell_state
+                outs.append(x)
+            lstm_outs.append(outs)
+
+            if self._dropout:
+                x = self._dropout_layer(x)
+            final_outs.append(x)
+        if self._task == "pre-train":
+            return final_outs
+        else:
+            return lstm_outs
+
+
+class ELMoCharacterEncoderLayer(nn.Layer):
+    def __init__(
+        self, char_vocab_size, char_embed_dim, projection_dim, num_highways, cnn_filters, max_characters_per_token
+    ):
+        super(ELMoCharacterEncoderLayer, self).__init__()
+
+        self._use_highway = num_highways > 0
+        self._n_filters = sum(f[1] for f in cnn_filters)
+        self._use_proj = self._n_filters != projection_dim
+
+        paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-1.0, high=1.0))
+        self._char_embedding_layer = nn.Embedding(
+            num_embeddings=char_vocab_size, embedding_dim=char_embed_dim, weight_attr=paramAttr
+        )
+
+        with paddle.no_grad():
+            self._char_embedding_layer.weight[0, :] = 0
+
+        self._convolution_layers = []
+        for i, (width, num) in enumerate(cnn_filters):
+            paramAttr = paddle.ParamAttr(initializer=I.Uniform(low=-0.05, high=0.05))
+            conv2d = nn.Conv2D(
+                in_channels=char_embed_dim,
+                out_channels=num,
+                kernel_size=(1, width),
+                padding="Valid",
+                data_format="NHWC",
+                weight_attr=paramAttr,
+            )
+            max_pool = nn.MaxPool2D(
+                kernel_size=(1, max_characters_per_token - width + 1),
+                stride=(1, 1),
+                padding="Valid",
+                data_format="NHWC",
+            )
+            self.add_sublayer("cnn_layer_{}".format(i), conv2d)
+            self.add_sublayer("maxpool_layer_{}".format(i), max_pool)
+            self._convolution_layers.append([width, conv2d, max_pool])
+
+        self._relu = nn.ReLU()
+        if self._use_highway:
+            self._highway_layer = Highway(self._n_filters, num_highways)
+        if self._use_proj:
+            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(self._n_filters)))
+            self._linear_layer = nn.Linear(self._n_filters, projection_dim, weight_attr=paramAttr)
+
+    def forward(self, x):
+        # [batch_size, seq_len, max_characters_per_token, embed_dim]
+        char_embedding = self._char_embedding_layer(x)
+
+        cnn_outs = []
+        for width, conv2d, max_pool in self._convolution_layers:
+            # [batch_size, seq_len, max_characters_per_token - kerner_width, out_channel]
+            conv_out = conv2d(char_embedding)
+            # [batch_size, seq_len, 1, out_channel]
+            pool_out = max_pool(conv_out)
+            # [batch_size, seq_len, 1, out_channel]
+            out = self._relu(pool_out)
+            # [batch_size, seq_len, out_channel]
+            out = paddle.squeeze(out, axis=2)
+            cnn_outs.append(out)
+
+        # [batch_size, seq_len, n_filters]
+        token_embedding = paddle.concat(cnn_outs, axis=-1)
+
+        if self._use_highway:
+            # [batch_size, seq_len, n_filters]
+            token_embedding = self._highway_layer(token_embedding)
+
+        if self._use_proj:
+            # [batch_size, seq_len, projection_dim]
+            token_embedding = self._linear_layer(token_embedding)
+
+        return token_embedding
+
+
+class Highway(nn.Layer):
+    def __init__(self, input_dim, num_layers):
+        super(Highway, self).__init__()
+
+        self._num_layers = num_layers
+
+        self._highway_layers = []
+        for i in range(num_layers):
+            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim)))
+            paramAttr_b = paddle.ParamAttr(initializer=I.Constant(value=-2.0))
+            carry_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr, bias_attr=paramAttr_b)
+            self.add_sublayer("carry_linear_{}".format(i), carry_linear)
+
+            paramAttr = paddle.ParamAttr(initializer=I.Normal(mean=0.0, std=1.0 / np.sqrt(input_dim)))
+            transform_linear = nn.Linear(input_dim, input_dim, weight_attr=paramAttr)
+            self.add_sublayer("transform_linear_{}".format(i), transform_linear)
+
+            self._highway_layers.append([carry_linear, transform_linear])
+
+        self._relu = nn.ReLU()
+        self._sigmoid = nn.Sigmoid()
+
+    def forward(self, x):
+        for i in range(self._num_layers):
+            carry_linear, transform_linear = self._highway_layers[i]
+            carry_gate = self._sigmoid(carry_linear(x))
+            transform_gate = self._relu(transform_linear(x))
+            x = carry_gate * transform_gate + (1.0 - carry_gate) * x
+        return x
+
+
+class ELMoLoss(nn.Layer):
+    def __init__(self):
+        super(ELMoLoss, self).__init__()
+
+    def forward(self, x, y):
+        # [batch_size, seq_len, vocab_size]
+        fw_logits, bw_logits = x
+        # [batch_size, seq_len]
+        fw_label, bw_label = y
+        # [batch_size, seq_len, 1]
+        fw_label = paddle.unsqueeze(fw_label, axis=2)
+        bw_label = paddle.unsqueeze(bw_label, axis=2)
+
+        # [batch_size, seq_len, 1]
+        fw_loss = F.cross_entropy(input=fw_logits, label=fw_label)
+        bw_loss = F.cross_entropy(input=bw_logits, label=bw_label)
+
+        avg_loss = 0.5 * (fw_loss + bw_loss)
+        return avg_loss
+
+
+def get_elmo_layer(params_file, batch_size, trainable=False):
+    if trainable:
+        elmo = ELMo(batch_size=batch_size, task="fine-tune")
+    else:
+        elmo = ELMo(batch_size=batch_size, dropout=None, task="fine-tune")
+    weight_state_dict = paddle.load(params_file + ".pdparams")
+    elmo.set_state_dict(weight_state_dict)
+    if trainable:
+        elmo.train()
+    else:
+        for params in elmo.parameters():
+            params.trainable = False
+        elmo.eval()
+    return elmo
+
+
+class ELMoEmbedder(object):
+    def __init__(self, params_file, batch_size=128, max_seq_len=256):
+        self._max_seq_len = max_seq_len
+        self._batch_size = batch_size
+
+        self._elmo = get_elmo_layer(params_file, batch_size, trainable=False)
+        self._vocab = load_vocab()
+
+    def encode(self, sentences: List[List[str]]):
+        """
+        Each sentence is a list of tokens without <s> or </s>, e.g.
+        [['The', 'first', 'sentence', '.'], ['Second', '.']]
+        """
+        batch_data = create_batches(sentences, self._batch_size, self._vocab, self._max_seq_len)
+        embeddings = []
+        for data in batch_data:
+            ids, ids_reverse, seq_lens = data
+            # [batch_size, num_lstm_layers + 1, max_seq_len, projection_dim * 2]
+            outputs = self._elmo([ids, ids_reverse])
+            outputs = outputs.numpy()
+            for i, lens in enumerate(seq_lens):
+                arr = outputs[i, :, 1 : lens - 1, :]
+                embeddings.append(arr)
+        return embeddings
diff --git a/examples/language_model/elmo/run_eval.py b/examples/language_model/elmo/run_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea330d62eaa61369ecef51c2b92e9505cf3b0fe7
--- /dev/null
+++ b/examples/language_model/elmo/run_eval.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import time
+
+import paddle
+from args import parse_args, print_args
+from dataset import OneBillionWordDataset, load_vocab
+from elmo import ELMo, ELMoLoss
+from paddle.io import DataLoader
+
+
+@paddle.no_grad()
+def eval(args):
+    paddle.set_device(args.device)
+
+    if not args.init_from_ckpt:
+        raise ValueError("init_from_ckpt should be set when eval.")
+    vocab = load_vocab(args.vocab_file, args.max_characters_per_token)
+
+    elmo = ELMo(
+        args.batch_size,
+        args.char_embed_dim,
+        args.projection_dim,
+        vocab.size,
+        dropout=args.dropout,
+        num_layers=args.num_layers,
+        num_highways=args.num_highways,
+        char_vocab_size=vocab.char_size,
+    )
+    elmo.eval()
+
+    elmo_loss = ELMoLoss()
+
+    # Loads pre-trained parameters.
+    weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams")
+    elmo.set_state_dict(weight_state_dict)
+    print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    dev_dataset = OneBillionWordDataset(
+        args.dev_data_path, vocab, args.batch_size, args.unroll_steps, mode="test", shuffle=False, seed=args.seed
+    )
+
+    dev_dataloader = DataLoader(dev_dataset, return_list=True, batch_size=None)
+
+    total_step = total_loss = 0
+    total_time = 0.0
+    batch_start_time = time.time()
+    for step, inputs in enumerate(dev_dataloader, start=1):
+        ids, next_ids, ids_reverse, next_ids_reverse = inputs
+        outputs = elmo([ids, ids_reverse])
+        loss = elmo_loss(outputs, [next_ids, next_ids_reverse])
+        ppl = paddle.exp(loss)
+
+        total_loss += float(loss)
+        total_step += 1
+
+        total_time += time.time() - batch_start_time
+        if step % args.log_freq == 0:
+            print(
+                "Eval step %d - loss: %.4f - Perplexity: %.4f - %.3fs/step"
+                % (step, float(loss) * args.unroll_steps, float(ppl), total_time / args.log_freq)
+            )
+            total_time = 0.0
+        batch_start_time = time.time()
+
+    avg_loss = total_loss / total_step
+    avg_ppl = math.exp(avg_loss)
+    print("Eval - average loss: %.4f - average Perplexity: %.4f" % (avg_loss * args.unroll_steps, avg_ppl))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    eval(args)
diff --git a/examples/language_model/elmo/run_finetune.py b/examples/language_model/elmo/run_finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..a129c6e87d0c57315a8d19015c5278d1eb902f28
--- /dev/null
+++ b/examples/language_model/elmo/run_finetune.py
@@ -0,0 +1,275 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import re
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from dataset import load_vocab
+from elmo import get_elmo_layer
+from paddle.io import DataLoader, Dataset
+from sklearn.model_selection import train_test_split
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.")
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
+    parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.")
+    parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.")
+    parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.")
+    parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.")
+    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
+    parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm')
+    parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.")
+    parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.")
+    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Device for selecting for the training.")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def clean_str(string):
+    """
+    Tokenization/string cleaning for all datasets except for SST.
+    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
+    """
+    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
+    string = re.sub(r"\'s", " 's", string)
+    string = re.sub(r"\'ve", " 've", string)
+    string = re.sub(r"n\'t", " n't", string)
+    string = re.sub(r"\'re", " 're", string)
+    string = re.sub(r"\'d", " 'd", string)
+    string = re.sub(r"\'ll", " 'll", string)
+    string = re.sub(r",", " , ", string)
+    string = re.sub(r"!", " ! ", string)
+    string = re.sub(r"\(", " \( ", string)
+    string = re.sub(r"\)", " \) ", string)
+    string = re.sub(r"\?", " \? ", string)
+    string = re.sub(r"\s{2,}", " ", string)
+    return string.strip().lower()
+
+
+def load_data_and_labels(positive_data_file, negative_data_file):
+    """
+    Loads MR polarity data from files, splits the data into words and generates labels.
+    Returns split sentences and labels.
+    """
+    # Load data from files
+    positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines())
+    positive_examples = [s.strip() for s in positive_examples]
+    negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines())
+    negative_examples = [s.strip() for s in negative_examples]
+    # Split by words
+    x_text = positive_examples + negative_examples
+    x_text = [clean_str(sent) for sent in x_text]
+    x_text = list(map(lambda x: x.split(), x_text))
+    # Generate labels
+    positive_labels = [1 for _ in positive_examples]
+    negative_labels = [0 for _ in negative_examples]
+    y = np.array(positive_labels + negative_labels)
+    return [x_text, y]
+
+
+class ELMoBowTextClassification(nn.Layer):
+    def __init__(self, params_file, batch_size, sent_embedding_dim, dropout, num_labels):
+        super(ELMoBowTextClassification, self).__init__()
+
+        self._elmo = get_elmo_layer(params_file, batch_size, trainable=True)
+        word_embedding_dim = self._elmo.embedding_dim
+        self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim)
+        self._fc2 = nn.Linear(sent_embedding_dim, num_labels)
+        self._dropout = nn.Dropout(p=dropout)
+
+    def forward(self, inputs):
+        """
+        Parameters:
+            inputs (Tuple): It is a Tuple contains 2 tensor with shape
+                `[batch_size, max_seq_len, max_characters_per_token]`.
+        """
+        mask = paddle.any(inputs[0] > 0, axis=2)
+        # [batch_size, 3, max_seq_len, word_embedding_dim]
+        elmo_out = self._elmo(inputs)
+        # [batch_size, max_seq_len, word_embedding_dim]
+        word_emb = self.mix_elmo_outputs(elmo_out)
+
+        # [batch_size, word_embedding_dim]
+        sent_emb = self.average_word_embedding(word_emb, mask)
+
+        # [batch_size, sent_embedding_dim]
+        dense = self._fc1(sent_emb)
+        dense = self._dropout(dense)
+
+        # [batch_size, num_labels]
+        out = self._fc2(dense)
+        return out
+
+    def mix_elmo_outputs(self, elmo_out):
+        """
+        Computes a mixture of elmo_out. At present, we simply take the last one.
+        Parameters:
+            elmo_out (Tensor): It is a Tensor with shape
+                `[batch_size, 3, max_seq_len, word_embedding_dim]`.
+        """
+        # [batch_size, max_seq_len, word_embedding_dim]
+        word_emb = elmo_out[:, 2, :, :]
+        return word_emb
+
+    def average_word_embedding(self, word_emb, mask):
+        """
+        Parameters:
+            word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`.
+            mask: It is a Tensor with shape `[batch_size, max_seq_len]`.
+        """
+        mask = paddle.unsqueeze(mask, axis=-1)
+        # [batch_size, 1]
+        seq_lens = paddle.sum(paddle.cast(mask, dtype=word_emb.dtype), axis=1)
+
+        # [batch_size, max_seq_len, word_embedding_dim]
+        word_emb = word_emb * mask
+        # [batch_size, word_embedding_dim]
+        sent_emb = paddle.sum(word_emb, axis=1)
+        # [batch_size, word_embedding_dim]
+        sent_emb = sent_emb / seq_lens
+        return sent_emb
+
+
+class SentencePolarityDatasetV1(Dataset):
+    def __init__(self, x, y, vocab, max_seq_len):
+        super(SentencePolarityDatasetV1, self).__init__()
+
+        self._text = list(zip(x, y))
+        self._vocab = vocab
+        self._max_seq_len = max_seq_len
+        self._data = self.convert_to_ids()
+
+    def convert_to_ids(self):
+        data = []
+        for sentence, label in self._text:
+            ids, ids_reverse = self._vocab.encode_chars(sentence[: self._max_seq_len - 2], split=False)
+            data.append([ids, ids_reverse, label])
+        return data
+
+    def __getitem__(self, idx):
+        ids = np.copy(self._data[idx][0])
+        ids_reverse = np.copy(self._data[idx][1])
+        label = self._data[idx][2]
+        return (ids, ids_reverse, label)
+
+    def __len__(self):
+        return len(self._data)
+
+
+def generate_batch(batch):
+    batch_ids, batch_ids_reverse, batch_label = zip(*batch)
+    max_len = max([ids.shape[0] for ids in batch_ids])
+    new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64)
+    new_batch_ids_reverse = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.int64)
+    new_batch_label = []
+    for i, (ids, ids_reverse, label) in enumerate(zip(batch_ids, batch_ids_reverse, batch_label)):
+        seq_len = ids.shape[0]
+        new_batch_ids[i, :seq_len, :] = ids
+        new_batch_ids_reverse[i, :seq_len, :] = ids_reverse
+        new_batch_label.append(label)
+    return new_batch_ids, new_batch_ids_reverse, new_batch_label
+
+
+def finetune(args):
+    paddle.set_device(args.device)
+    if dist.get_world_size() > 1:
+        dist.init_parallel_env()
+
+    pos_file = os.path.join(args.data_dir, "rt-polarity.pos")
+    neg_file = os.path.join(args.data_dir, "rt-polarity.neg")
+    x_text, y = load_data_and_labels(pos_file, neg_file)
+    x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed)
+
+    if not args.init_from_ckpt:
+        raise ValueError("`init_from_ckpt` should be set.")
+    model = ELMoBowTextClassification(
+        args.init_from_ckpt, args.batch_size, args.sent_embedding_dim, args.dropout, args.num_classes
+    )
+    if dist.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+    model.train()
+
+    adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay)
+    criterion = nn.CrossEntropyLoss()
+
+    vocab = load_vocab()
+
+    train_dataset = SentencePolarityDatasetV1(x_train, y_train, vocab, args.max_seq_len)
+    test_dataset = SentencePolarityDatasetV1(x_test, y_test, vocab, args.max_seq_len)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=args.batch_size,
+        return_list=True,
+        shuffle=True,
+        collate_fn=lambda batch: generate_batch(batch),
+    )
+    test_loader = DataLoader(
+        test_dataset,
+        batch_size=args.batch_size,
+        return_list=True,
+        shuffle=False,
+        collate_fn=lambda batch: generate_batch(batch),
+    )
+
+    for epoch in range(args.epochs):
+        print("Epoch {}/{}".format(epoch + 1, args.epochs))
+        for step, batch_data in enumerate(train_loader, start=1):
+            ids, ids_reverse, label = batch_data
+
+            output = model((ids, ids_reverse))
+            loss = criterion(output, label)
+            loss.backward()
+            adam.step()
+            adam.clear_grad()
+
+            if step % args.logging_step == 0:
+                print("step {}, loss {}".format(step, float(loss)))
+
+    acc = test(model, test_loader)
+    print("\ntest acc {}\n".format(acc))
+
+
+@paddle.no_grad()
+def test(model, test_loader):
+    correct = num = 0
+    model.eval()
+    for batch_data in test_loader:
+        ids, ids_reverse, label = batch_data
+
+        # [batch_size, 2]
+        output = model((ids, ids_reverse))
+
+        num += label.shape[0]
+        predict = paddle.argmax(output, axis=1)
+        label = paddle.cast(label, dtype=predict.dtype)
+        correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64")))
+    model.train()
+    return correct * 1.0 / num
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    finetune(args)
diff --git a/examples/language_model/elmo/run_pretrain.py b/examples/language_model/elmo/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b22f97f8df385aa62fa8fe603334170f634a038
--- /dev/null
+++ b/examples/language_model/elmo/run_pretrain.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from args import parse_args, print_args
+from dataset import OneBillionWordDataset, load_vocab
+from elmo import ELMo, ELMoLoss
+from paddle.io import DataLoader
+
+
+def save_params(elmo, optimizer, save_dir, name):
+    elmo_ckpt = os.path.join(save_dir, "{}.pdparams".format(name))
+    opt_ckpt = os.path.join(save_dir, "{}.pdopt".format(name))
+    paddle.save(elmo.state_dict(), elmo_ckpt)
+    paddle.save(optimizer.state_dict(), opt_ckpt)
+
+
+def train(args):
+    paddle.set_device(args.device)
+    n_procs = dist.get_world_size()
+    rank = dist.get_rank()
+
+    if n_procs > 1:
+        dist.init_parallel_env()
+
+    vocab = load_vocab(args.vocab_file, args.max_characters_per_token)
+
+    elmo = ELMo(
+        args.batch_size,
+        args.char_embed_dim,
+        args.projection_dim,
+        vocab.size,
+        dropout=args.dropout,
+        num_layers=args.num_layers,
+        num_highways=args.num_highways,
+        char_vocab_size=vocab.char_size,
+    )
+    if n_procs > 1:
+        elmo = paddle.DataParallel(elmo)
+    elmo.train()
+
+    gloabl_norm_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = paddle.optimizer.Adagrad(
+        learning_rate=args.lr, parameters=elmo.parameters(), initial_accumulator_value=1.0, grad_clip=gloabl_norm_clip
+    )
+    elmo_loss = ELMoLoss()
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        weight_state_dict = paddle.load(args.init_from_ckpt + ".pdparams")
+        opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt")
+        elmo.set_state_dict(weight_state_dict)
+        optimizer.set_state_dict(opt_state_dict)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    train_dataset = OneBillionWordDataset(
+        args.train_data_path,
+        vocab,
+        args.batch_size,
+        args.unroll_steps,
+        n_procs=n_procs,
+        rank=rank,
+        mode="train",
+        shuffle=True,
+        seed=args.seed,
+    )
+
+    train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None)
+
+    n_tokens_per_batch = args.batch_size * args.unroll_steps * n_procs
+    n_steps_per_epoch = int(train_dataset.number_of_tokens / n_tokens_per_batch)
+    n_steps_total = args.epochs * n_steps_per_epoch
+    print("Training for %s epochs and %s steps" % (args.epochs, n_steps_total))
+
+    total_time = 0.0
+    batch_start_time = time.time()
+    for step, inputs in enumerate(train_dataloader, start=1):
+        ids, next_ids, ids_reverse, next_ids_reverse = inputs
+        outputs = elmo([ids, ids_reverse])
+        loss = elmo_loss(outputs, [next_ids, next_ids_reverse])
+        ppl = paddle.exp(loss)
+        loss *= args.unroll_steps
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+
+        total_time += time.time() - batch_start_time
+        if step % args.log_freq == 0:
+            print(
+                "step %d/%d - loss: %.4f - Perplexity: %.4f - %.3fs/step"
+                % (step, n_steps_total, float(loss), float(ppl), total_time / args.log_freq)
+            )
+            total_time = 0.0
+        if rank == 0 and step % args.save_freq == 0:
+            save_params(elmo, optimizer, args.save_dir, step)
+        if step == n_steps_total:
+            # training done
+            if rank == 0:
+                save_params(elmo, optimizer, args.save_dir, "final")
+            break
+        batch_start_time = time.time()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    train(args)
diff --git a/examples/language_model/elmo/word2vec_base.py b/examples/language_model/elmo/word2vec_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..1401268b8f0efc049afe79fe340dad6f2a9380eb
--- /dev/null
+++ b/examples/language_model/elmo/word2vec_base.py
@@ -0,0 +1,255 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import re
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from gensim.models.keyedvectors import KeyedVectors
+from paddle.io import DataLoader, Dataset
+from sklearn.model_selection import train_test_split
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--data_dir", type=str, default="./sentence-polarity-dataset-v1/", help="Specify the data dir.")
+    parser.add_argument("--pretrained_word2vec_file", type=str, default="./sentence-polarity-dataset-v1/GoogleNews-vectors-negative300.bin", help="Specify the pretrained word2vec model path.")
+    parser.add_argument("--logging_step", type=int, default=10, help="The frequency, in number of steps, the training logs are printed. (default: %(default)d)")
+    parser.add_argument("--epochs", type=int, default=20, help="Total number of training epochs to perform.")
+    parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--dropout", type=float, default=0.5, help="The dropout rate.")
+    parser.add_argument("--lr", type=float, default=0.001, help="The initial learning rate.")
+    parser.add_argument("--weight_decay", type=float, default=0.0001, help="The weight decay for optimizer.")
+    parser.add_argument("--seed", type=int, default=2020, help="Random seed.")
+    parser.add_argument("--max_seq_len", type=int, default=256, help='max grad norm')
+    parser.add_argument("--sent_embedding_dim", type=int, default=64, help="The size of sentence embedding.")
+    parser.add_argument("--num_classes", type=int, default=2, help="The num of classification classes.")
+    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def clean_str(string):
+    """
+    Tokenization/string cleaning for all datasets except for SST.
+    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
+    """
+    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
+    string = re.sub(r"\'s", " 's", string)
+    string = re.sub(r"\'ve", " 've", string)
+    string = re.sub(r"n\'t", " n't", string)
+    string = re.sub(r"\'re", " 're", string)
+    string = re.sub(r"\'d", " 'd", string)
+    string = re.sub(r"\'ll", " 'll", string)
+    string = re.sub(r",", " , ", string)
+    string = re.sub(r"!", " ! ", string)
+    string = re.sub(r"\(", " \( ", string)
+    string = re.sub(r"\)", " \) ", string)
+    string = re.sub(r"\?", " \? ", string)
+    string = re.sub(r"\s{2,}", " ", string)
+    return string.strip().lower()
+
+
+def load_data_and_labels(positive_data_file, negative_data_file):
+    """
+    Loads MR polarity data from files, splits the data into words and generates labels.
+    Returns split sentences and labels.
+    """
+    # Load data from files
+    positive_examples = list(open(positive_data_file, "r", encoding="latin-1").readlines())
+    positive_examples = [s.strip() for s in positive_examples]
+    negative_examples = list(open(negative_data_file, "r", encoding="latin-1").readlines())
+    negative_examples = [s.strip() for s in negative_examples]
+    # Split by words
+    x_text = positive_examples + negative_examples
+    x_text = [clean_str(sent) for sent in x_text]
+    x_text = list(map(lambda x: x.split(), x_text))
+    # Generate labels
+    positive_labels = [1 for _ in positive_examples]
+    negative_labels = [0 for _ in negative_examples]
+    y = np.array(positive_labels + negative_labels)
+    return [x_text, y]
+
+
+class Word2VecBoWTextClassification(nn.Layer):
+    def __init__(self, word_embedding_dim, sent_embedding_dim, dropout, num_classes):
+        super(Word2VecBoWTextClassification, self).__init__()
+
+        self._fc1 = nn.Linear(word_embedding_dim, sent_embedding_dim)
+        self._fc2 = nn.Linear(sent_embedding_dim, num_classes)
+        self._dropout = nn.Dropout(p=dropout)
+
+    def forward(self, inputs):
+        word_emb, seq_lens = inputs
+
+        # [batch_size, word_embedding_dim]
+        sent_emb = self.average_word_embedding(word_emb, seq_lens)
+
+        # [batch_size, sent_embedding_dim]
+        dense = self._fc1(sent_emb)
+        dense = self._dropout(dense)
+
+        # [batch_size, num_classes]
+        out = self._fc2(dense)
+        return out
+
+    def average_word_embedding(self, word_emb, seq_lens):
+        """
+        Parameters:
+            word_emb: It is a Tensor with shape `[batch_size, max_seq_len, word_embedding_dim]`.
+            seq_lens: It is a Tensor with shape `[batch_size]`.
+        """
+        seq_lens = paddle.unsqueeze(seq_lens, axis=-1)
+        seq_lens = paddle.cast(seq_lens, dtype=word_emb.dtype)
+
+        # [batch_size, word_embedding_dim]
+        sent_emb = paddle.sum(word_emb, axis=1)
+        # [batch_size, word_embedding_dim]
+        sent_emb = sent_emb / seq_lens
+        return sent_emb
+
+
+class SentencePolarityDatasetV1(Dataset):
+    def __init__(self, x, y, gensim_model, max_seq_len):
+        super(SentencePolarityDatasetV1, self).__init__()
+
+        self._text = list(zip(x, y))
+        self._gensim_model = gensim_model
+        self._vector_size = gensim_model.vector_size
+        self._max_seq_len = max_seq_len
+        self._data = self.convert_to_ids()
+
+    def convert_to_ids(self):
+        data = []
+        for sentence, label in self._text:
+            sentence = sentence[: self._max_seq_len]
+            ids = np.zeros([len(sentence), self._vector_size], dtype=np.float32)
+            for i, word in enumerate(sentence):
+                if word in self._gensim_model:
+                    ids[i] = self._gensim_model[word]
+                else:
+                    ids[i] = np.random.uniform(-0.25, 0.25, self._vector_size)
+            data.append([ids, label])
+        return data
+
+    def __getitem__(self, idx):
+        ids = np.copy(self._data[idx][0])
+        label = self._data[idx][1]
+        return (ids, label)
+
+    def __len__(self):
+        return len(self._data)
+
+
+def generate_batch(batch):
+    batch_ids, batch_label = zip(*batch)
+    max_len = max([ids.shape[0] for ids in batch_ids])
+    new_batch_ids = np.zeros([len(batch_ids), max_len, batch_ids[0].shape[1]], dtype=np.float32)
+    new_batch_label = []
+    new_batch_seq_len = []
+    for i, (ids, label) in enumerate(zip(batch_ids, batch_label)):
+        seq_len = ids.shape[0]
+        new_batch_ids[i, :seq_len, :] = ids
+        new_batch_label.append(label)
+        new_batch_seq_len.append(seq_len)
+    return new_batch_ids, new_batch_label, new_batch_seq_len
+
+
+def train(args):
+    paddle.set_device(args.device)
+    if dist.get_world_size() > 1:
+        dist.init_parallel_env()
+
+    pos_file = os.path.join(args.data_dir, "rt-polarity.pos")
+    neg_file = os.path.join(args.data_dir, "rt-polarity.neg")
+    x_text, y = load_data_and_labels(pos_file, neg_file)
+    x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.1, random_state=args.seed)
+
+    # gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True, limit=300000)
+    gensim_model = KeyedVectors.load_word2vec_format(args.pretrained_word2vec_file, binary=True)
+    print("\nLoaded word2vec from %s\n" % args.pretrained_word2vec_file)
+
+    train_dataset = SentencePolarityDatasetV1(x_train, y_train, gensim_model, args.max_seq_len)
+    test_dataset = SentencePolarityDatasetV1(x_test, y_test, gensim_model, args.max_seq_len)
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=args.batch_size,
+        return_list=True,
+        shuffle=True,
+        collate_fn=lambda batch: generate_batch(batch),
+    )
+    test_loader = DataLoader(
+        test_dataset,
+        batch_size=args.batch_size,
+        return_list=True,
+        shuffle=False,
+        collate_fn=lambda batch: generate_batch(batch),
+    )
+
+    model = Word2VecBoWTextClassification(
+        gensim_model.vector_size, args.sent_embedding_dim, args.dropout, args.num_classes
+    )
+    if dist.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+    model.train()
+
+    adam = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr, weight_decay=args.weight_decay)
+    criterion = nn.CrossEntropyLoss()
+
+    for epoch in range(args.epochs):
+        print("Epoch %d/%d" % (epoch + 1, args.epochs))
+        for step, batch_data in enumerate(train_loader, start=1):
+            ids, label, seq_lens = batch_data
+
+            output = model((ids, seq_lens))
+            loss = criterion(output, label)
+            loss.backward()
+            adam.step()
+            adam.clear_grad()
+
+            if step % args.logging_step == 0:
+                print("step %d, loss %.4f" % (step, float(loss)))
+
+    acc = test(model, test_loader)
+    print("\ntest acc %.4f\n" % acc)
+
+
+@paddle.no_grad()
+def test(model, test_loader):
+    correct = num = 0
+    model.eval()
+    for batch_data in test_loader:
+        ids, label, seq_lens = batch_data
+
+        # [batch_size, 2]
+        output = model((ids, seq_lens))
+
+        num += label.shape[0]
+        predict = paddle.argmax(output, axis=1)
+        label = paddle.cast(label, dtype=predict.dtype)
+        correct += int(paddle.sum(paddle.cast(predict == label, dtype="int64")))
+    model.train()
+    return correct * 1.0 / num
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    train(args)
diff --git a/examples/language_model/end_to_end_memory_networks/README.md b/examples/language_model/end_to_end_memory_networks/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d3a2dc040356b9c96c70cef629582db137127dbb
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/README.md
@@ -0,0 +1,212 @@
+# End-To-End-Memory-Networks-in-Paddle
+## 一、简介
+
+用Paddle来复现论文End-To-End Memory Networks
+
+![模型简介](http://paddle.yulan.net.cn/model_introduction.png)
+
+本模型是Facebook AI在Memory networks之后提出的一个更加完善的记忆网络模型，在问答系统以及语言模型中均有良好的应用。论文中使用了多个单层单元堆叠而成的多层架构。
+
+单层架构如上图a所示，主要的参数包括A,B,C,W四个矩阵，其中A,B,C三个矩阵就是embedding矩阵，主要是将输入文本和Question编码成词向量，W是最终的输出矩阵。从上图可以看出，对于输入的句子s分别会使用A和C进行编码得到Input和Output的记忆模块，Input用来跟Question编码得到的向量相乘得到每句话跟q的相关性，Output则与该相关性进行加权求和得到输出向量。然后再加上q并传入最终的输出层。
+
+多层网络如上图b所示，实际上是将多个单层堆叠到一起形成的网络，这里将每一层称为一个hop。
+为了减少参数，模型提出了两种让各个hop之间共享Embedding参数（A与C）的方法：
+* Adjacent：这种方法让相邻层之间的$A=C$。也就是说$A_{k+1}=C_{k}$，此外W等于顶层的C，B等于底层的A，这样就减少了一半的参数量。
+* Layer-wise（RNN-like)：与RNN相似，采用完全共享参数的方法，即各层之间参数均相等。$A_{1}=A_{2}=...=A_{k}$,$C_{1}=C_{2}=...=C_{k}$。但这样模型的参数太少，性能会受到影响，故提出一种改进方法，在每一层之间加一个线性映射矩阵H，即令$u^{k+1}=H u^{k}+o^{k}$。
+
+具体到语言模型，模型做出了一下调整：
+1. 由于输入是单个句子，编码级别是单词级的，所以可以直接将每个单词的词向量存入memory即可，也就是说A与C现在都是单词的Embedding矩阵，mi与ci中都是单个单词的词向量。
+2. 输出W矩阵的output为下一个单词的概率，即输出维度为vocab size。
+3. 不同于QA任务，这里不存在Question，所以直接将q向量设置为全0.1的常量，也不需要再进行Embedding操作。
+4. 采用Layer-wise的参数缩减策略。
+5. 文中提出，对于每一层的u向量中一半的神经元进行ReLU操作，以帮助模型训练。
+
+## 二、数据集
+
+* Penn Treetank:
+
+    * [Penn Treebank](http://paddle.yulan.net.cn/ptb.zip)
+
+        NLP中常用的PTB语料库,语料来源为1989年华尔街日报，并做以下切分
+
+        train：887k words
+
+        valid：70k words
+
+        test：78k words
+
+        vocabulary  size：10k
+
+    * [text8](http://paddle.yulan.net.cn/text8.zip)
+
+        来源于enwiki8，总共100M个字符，划分为93.3M/5.7M/1M字符(train/valid/test)，将出现次数少于10次的单词替换为<UNK>
+
+## 三、环境依赖
+
+* 硬件：GPU
+* 框架：Paddle >= 2.0.0，progress库
+
+## 四、快速开始
+
+下载数据集和已训练好的模型
+```bash
+mkdir data
+mkdir models
+cd data
+wget http://paddle.yulan.net.cn/ptb.zip
+wget http://paddle.yulan.net.cn/text8.zip
+unzip -d ptb ptb.zip
+unzip -d text8 text8.zip
+cd ..
+cd models
+wget http://paddle.yulan.net.cn/model_ptb
+wget http://paddle.yulan.net.cn/model_text8
+cd ..
+```
+
+### 训练
+
+训练参数可在`config.yaml`文件中调整。
+
+Note: 由于本模型受随机因素影响较大，故每次训练的结果差异较大，即使固定随机种子，由于GPU的原因训练结果仍然无法完全一致。
+
+#### 在ptb数据集上训练
+
+```bash
+cp config/config_ptb.yaml config.yaml
+python train.py
+```
+
+#### 寻找最佳模型
+
+由于模型受随机因素影响较大，故要进行多次训练来找到最优模型，原论文中在ptb数据集上进行了10次训练，并保留了在test集上表现最好的模型。本复现提供了一个脚本，来进行多次训练以获得能达到足够精度的模型。
+
+```bash
+cp config/config_ptb.yaml config.yaml
+python train_until.py --target 111.0
+```
+
+以下是在ptb数据集上进行多次训练以达到目标精度的[log](http://paddle.yulan.net.cn/ptb_train_until.log),可以计算出20轮的平均ppl为113，方差为5.68
+
+#### 在text8数据集上训练
+
+```bash
+cp config/config_text8.yaml config.yaml
+python train.py
+```
+
+### 测试
+
+保持`config.yaml`文件与训练时相同
+
+```
+python eval.py
+```
+
+### 使用预训练模型
+
+#### ptb数据集上
+
+```bash
+cp config/config_ptb_test.yaml config.yaml
+python eval.py
+```
+
+将得到以下结果
+
+![](http://paddle.yulan.net.cn/test_ptb.png)
+
+#### text8数据集上
+
+```bash
+cp config/config_text8_test.yaml config.yaml
+python eval.py
+```
+
+结果如下
+
+![](http://paddle.yulan.net.cn/test_text8.png)
+
+## 五、复现精度
+
+相应模型已包含在本repo中，分别位于目录`models_ptb`与`models_text8`下
+
+| Dataset | Paper Perplexity | Our Perplexity |
+| :-----: | :--------------: | :------------: |
+|   ptb   |       111        |     110.75     |
+|  text8  |       147        |     145.62     |
+
+## 六、代码结构详细说明
+
+### 6.1 代码结构
+
+```
+├── checkpoints
+├── config                                        # 配置文件模板
+├── config.yaml
+├── README.md
+├── requirements.txt
+├── config.py
+├── model.py
+├── data.py
+├── train.py                                    # 训练脚本
+├── eval.py                                        # 测试脚本
+├── train_until.py
+└── utils.py
+```
+
+### 6.2 参数说明
+
+可以在`config.yaml`中设置以下参数
+
+```
+# internal state dimension
+edim: 150
+# linear part of the state
+lindim: 75
+# number of hops
+nhop: 7
+# memory size
+mem_size: 200
+# initial internal state value
+init_hid: 0.1
+# initial learning rate
+init_lr: 0.01
+# weight initialization std
+init_std: 0.05
+# clip gradients to this norm
+max_grad_norm: 50
+
+# batch size to use during training
+batch_size: 128
+# number of epoch to use during training
+nepoch: 100
+
+# data directory
+data_dir: "data/ptb"
+# checkpoint directory
+checkpoint_dir: "checkpoints"
+# model name for test and recover train
+model_name: "model"
+# if True, load model [model_name] before train
+recover_train: False
+# data set name
+data_name: "ptb"
+# print progress, need progress module
+show: True
+# initial random seed
+srand: 17814
+# How many epochs output log once
+log_epoch: 5
+# Desired ppl
+target_ppl: 147
+```
+
+### 七、reference
+原论文地址：[Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus: “End-To-End Memory Networks”, 2015.](https://arxiv.org/pdf/1503.08895v5.pdf)
+
+复现repo：[yulangz/End-to-End-Memory-Networks-in-Paddle](https://github.com/yulangz/End-to-End-Memory-Networks-in-Paddle)
+
+参考repo：[https://github.com/facebookarchive/MemNN](https://github.com/facebookarchive/MemNN)
+
+项目AiStudio地址：[https://aistudio.baidu.com/aistudio/projectdetail/2381004](https://aistudio.baidu.com/aistudio/projectdetail/2381004)
diff --git a/examples/language_model/end_to_end_memory_networks/config.py b/examples/language_model/end_to_end_memory_networks/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e1fcc0eafd2f36619190251dcec075d918638e7
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import yaml
+
+
+class Config(object):
+    """
+    A simple waper for configs
+    """
+
+    def __init__(self, config_path: str):
+        with open(config_path, "r") as f:
+            self.d = yaml.load(f.read(), Loader=yaml.SafeLoader)
+
+    def __getattribute__(self, key):
+        d = super(Config, self).__getattribute__("d")
+        if key in d:
+            return d[key]
+        else:
+            return super(Config, self).__getattribute__(key)
diff --git a/examples/language_model/end_to_end_memory_networks/config.yaml b/examples/language_model/end_to_end_memory_networks/config.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..99ccce66ab526afcae1a507290c17f06c5c80b83
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config.yaml
@@ -0,0 +1,40 @@
+# internal state dimension
+edim: 150
+# linear part of the state
+lindim: 75
+# number of hops
+nhop: 7
+# memory size
+mem_size: 200
+# initial internal state value
+init_hid: 0.1
+# initial learning rate
+init_lr: 0.01
+# weight initialization std
+init_std: 0.05
+# clip gradients to this norm
+max_grad_norm: 50
+
+# batch size to use during training
+batch_size: 128
+# number of epoch to use during training
+nepoch: 100
+
+# data directory
+data_dir: "data/ptb"
+# checkpoint directory
+checkpoint_dir: "checkpoints"
+# model name for test and recover train
+model_name: "model"
+# if True, load model [model_name] before train
+recover_train: False
+# data set name
+data_name: "ptb"
+# print progress, need progress module
+show: True
+# initial random seed
+srand: 17814
+# How many epochs output log once
+log_epoch: 5
+# Desired ppl
+target_ppl: 147
\ No newline at end of file
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..620877cbced56072fc57b0401d441619b578e1ee
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config/config_ptb.yaml
@@ -0,0 +1,19 @@
+edim: 150
+lindim: 75
+nhop: 7
+mem_size: 200
+batch_size: 128
+nepoch: 100
+init_lr: 0.01
+init_hid: 0.1
+init_std: 0.05
+max_grad_norm: 50
+data_dir: "data/ptb"
+checkpoint_dir: "checkpoints"
+model_name: "model"
+recover_train: False
+data_name: "ptb"
+show: True
+srand: 17814
+log_epoch: 5
+target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..3be5d7c58a1f47e72e65f9d4421efff66e42bb5f
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config/config_ptb_test.yaml
@@ -0,0 +1,18 @@
+edim: 150
+lindim: 75
+nhop: 7
+mem_size: 200
+batch_size: 128
+nepoch: 100
+init_lr: 0.01
+init_hid: 0.1
+init_std: 0.05
+max_grad_norm: 50
+data_dir: "data/ptb"
+checkpoint_dir: "models"
+model_name: "model_ptb"
+recover_train: False
+data_name: "ptb"
+show: True
+log_epoch: 5
+target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..ce9d2b5fb3fbd00dc24d9356f7af5408fade75c7
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config/config_text8.yaml
@@ -0,0 +1,19 @@
+edim: 500
+lindim: 250
+nhop: 7
+mem_size: 100
+batch_size: 128
+nepoch: 100
+init_lr: 0.01
+init_hid: 0.1
+init_std: 0.05
+max_grad_norm: 50
+data_dir: "data/text8"
+checkpoint_dir: "checkpoints"
+model_name: "model"
+recover_train: False
+data_name: "text8"
+show: True
+srand: 12345
+log_epoch: 5
+target_ppl: 111
diff --git a/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml b/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..04751bfa6803650b6d9b38e6f2d677f14df3f63e
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/config/config_text8_test.yaml
@@ -0,0 +1,18 @@
+edim: 500
+lindim: 250
+nhop: 7
+mem_size: 100
+batch_size: 128
+nepoch: 100
+init_lr: 0.01
+init_hid: 0.1
+init_std: 0.05
+max_grad_norm: 50
+data_dir: "data/text8"
+checkpoint_dir: "models"
+model_name: "model_text8"
+recover_train: False
+data_name: "text8"
+show: True
+log_epoch: 5
+target_ppl: 147
diff --git a/examples/language_model/end_to_end_memory_networks/data.py b/examples/language_model/end_to_end_memory_networks/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..3083cb996348c60a4921a073d8e6fababa58638b
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/data.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+
+def read_data(fname, word2idx):
+    """
+    Data is processed into a one-dimensional vector, and each value is the code corresponding to a word.
+    The two sentences are separated by special characters < EOS >.
+
+    Args:
+        fname (str):
+            data filename
+        word2idx (dict):
+            word dict
+
+    Returns:
+        list: return word vectors
+    """
+    if os.path.isfile(fname):
+        with open(fname) as f:
+            lines = f.readlines()
+    else:
+        raise (Exception("[!] Data %s not found" % fname))
+
+    words = []
+    for line in lines:
+        words.extend(line.split())
+
+    print("Read %s words from %s" % (len(words), fname))
+
+    data = list()
+    for line in lines:
+        for word in line.split():
+            index = word2idx[word]
+            data.append(index)
+        data.append(word2idx["<eos>"])
+    return data
+
+
+def load_vocab(fname):
+    """
+    load word dict
+
+    Args:
+        fname (str): filename of the vocav file
+
+    Returns:
+        dict: word dict
+    """
+    word2idx = {}
+    with open(fname, "r") as f:
+        for line in f:
+            pair = line.split()
+            word2idx[pair[0]] = int(pair[1])
+    return word2idx
+
+
+def load_data(config):
+    """
+    load data
+
+    Args:
+        config: config
+
+    Returns:
+        word dict, and train, valid, test data
+    """
+    vocab_path = os.path.join(config.data_dir, "%s.vocab.txt" % config.data_name)
+    word2idx = load_vocab(vocab_path)
+
+    train_data = read_data(os.path.join(config.data_dir, "%s.train.txt" % config.data_name), word2idx)
+    valid_data = read_data(os.path.join(config.data_dir, "%s.valid.txt" % config.data_name), word2idx)
+    test_data = read_data(os.path.join(config.data_dir, "%s.test.txt" % config.data_name), word2idx)
+
+    return word2idx, train_data, valid_data, test_data
diff --git a/examples/language_model/end_to_end_memory_networks/eval.py b/examples/language_model/end_to_end_memory_networks/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc3dede1d357c78f6f8eb6d416049df795eb7ce8
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/eval.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+from importlib import import_module
+
+import numpy as np
+import paddle
+from config import Config
+from data import load_data
+from model import MemN2N
+from paddle import nn
+
+
+@paddle.no_grad()
+def eval(model: MemN2N, data, config, mode="Test"):
+    """
+    evaluate the model performance
+
+    Args:
+        model (MemN2N): the model to be evaluate
+        data: evaluation data
+        config: model and eval configs
+        mode: Valid or Test
+
+    Returns:
+        average loss
+    """
+    model.eval()
+    lossfn = nn.CrossEntropyLoss(reduction="sum")
+    N = int(math.ceil(len(data) / config.batch_size))
+    total_loss = 0
+
+    context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64)
+    target = np.ndarray([config.batch_size], dtype=np.int64)
+
+    if config.show:
+        ProgressBar = getattr(import_module("utils"), "ProgressBar")
+        bar = ProgressBar(mode, max=N - 1)
+
+    m = config.mem_size
+    for batch in range(N):
+        if config.show:
+            bar.next()
+
+        for i in range(config.batch_size):
+            if m >= len(data):
+                break
+            target[i] = data[m]
+            context[i, :] = data[m - config.mem_size : m]
+            m += 1
+        if m >= len(data):
+            break
+
+        batch_data = paddle.to_tensor(context)
+        batch_label = paddle.to_tensor(target)
+
+        preict = model(batch_data)
+        loss = lossfn(preict, batch_label)
+
+        total_loss += loss
+
+    if config.show:
+        bar.finish()
+
+    return total_loss / N / config.batch_size
+
+
+def test(model: MemN2N, test_data, config):
+    """
+    test the model performance
+    """
+    test_loss = eval(model, test_data, config, "Test")
+    test_perplexity = math.exp(test_loss)
+    print("Perplexity on Test: %f" % test_perplexity)
+
+
+if __name__ == "__main__":
+    config = Config("config.yaml")
+
+    if not os.path.exists(config.checkpoint_dir):
+        os.makedirs(config.checkpoint_dir)
+
+    word2idx, train_data, valid_data, test_data = load_data(config)
+    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
+    config.nwords = len(word2idx)
+
+    print("vacab size is %d" % config.nwords)
+
+    model = MemN2N(config)
+
+    model_path = os.path.join(config.checkpoint_dir, config.model_name)
+    state_dict = paddle.load(model_path)
+    model.set_dict(state_dict)
+    test(model, test_data, config)
diff --git a/examples/language_model/end_to_end_memory_networks/model.py b/examples/language_model/end_to_end_memory_networks/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..8897cbc700026e5e7f6be82c5c176c08e5817b0a
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/model.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import nn
+import numpy as np
+
+
+class MemN2N(nn.Layer):
+    """
+    End to End Memory Networks model
+
+    reference paper: https://arxiv.org/pdf/1503.08895v5.pdf
+    """
+
+    def __init__(self, config):
+        """
+        Model initialization
+
+        Args:
+            config: model configuration, see config.yaml for more detail
+        """
+        super(MemN2N, self).__init__()
+        self.nwords = config.nwords
+        self.init_hid = config.init_hid
+        self.init_std = config.init_std
+        self.nhop = config.nhop
+        self.edim = config.edim
+        self.mem_size = config.mem_size
+        self.lindim = config.lindim
+        self.max_grad_norm = config.max_grad_norm
+        self.batch_size = config.batch_size
+
+        self.checkpoint_dir = config.checkpoint_dir
+
+        normal_attr = paddle.framework.ParamAttr(initializer=paddle.nn.initializer.Normal(std=self.init_std))
+        self.A = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr)
+        self.C = nn.Embedding(self.nwords, self.edim, weight_attr=normal_attr)
+
+        # Temporal Encoding
+        self.T_A = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr)
+        self.T_C = nn.Embedding(self.mem_size, self.edim, weight_attr=normal_attr)
+
+        # Linear mapping for q
+        self.H = nn.Linear(self.edim, self.edim, weight_attr=normal_attr, bias_attr=False)
+
+        # output mapping
+        self.W = nn.Linear(self.edim, self.nwords, weight_attr=normal_attr, bias_attr=False)
+
+    def forward(self, data):
+        """
+        The shape of data is [batch_size, mem_size], and the content is the id of each word
+        """
+        q = np.ndarray([self.batch_size, self.edim], dtype=np.float32)
+        q.fill(self.init_hid)
+        q = paddle.to_tensor(q)
+
+        time = np.ndarray([self.batch_size, self.mem_size], dtype=np.int64)
+        for i in range(self.mem_size):
+            time[:, i] = i
+        time = paddle.to_tensor(time)
+
+        for hop in range(self.nhop):
+            A_in_c = self.A(data)  # [batch_size, mem_size, edim]
+            A_in_t = self.T_A(time)  # [batch_size, mem_size, edim]
+            A_in = paddle.add(A_in_c, A_in_t)  # [batch_size, mem_size, edim]
+
+            q_in = q.reshape([-1, 1, self.edim])  # [batch, 1, edim]
+            A_out3d = paddle.matmul(q_in, A_in, transpose_y=True)  # [batch, 1, mem_size]
+            A_out2d = A_out3d.reshape([-1, self.mem_size])
+            p = nn.functional.softmax(A_out2d)  # [batch, mem_size]
+
+            C_in_c = self.C(data)
+            C_in_t = self.T_C(time)
+            C_in = paddle.add(C_in_c, C_in_t)  # [batch_size, mem_size, edim]
+
+            p_3d = p.reshape([-1, 1, self.mem_size])  # [batch, 1, mem_size]
+            C_out3d = paddle.matmul(p_3d, C_in)  # [batch, 1, edim]
+
+            C_out2d = C_out3d.reshape([-1, self.edim])  # [batch, edim]
+
+            # Linear mapping and addition
+            q_mapped = self.H(q)
+            q_out = paddle.add(C_out2d, q_mapped)
+
+            if self.lindim == self.edim:
+                q = q_out
+            elif self.lindim == 0:
+                q = nn.functional.relu(q_out)
+            else:
+                F = q_out[:, : self.lindim]
+                G = q_out[:, self.lindim :]
+                K = nn.functional.relu(G)
+                q = paddle.concat([F, K], axis=-1)
+
+        predict = self.W(q)
+        return predict
diff --git a/examples/language_model/end_to_end_memory_networks/requirements.txt b/examples/language_model/end_to_end_memory_networks/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a5c04145b7381f412fae119d93190da0297f2a22
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/requirements.txt
@@ -0,0 +1,2 @@
+progress==1.6
+
diff --git a/examples/language_model/end_to_end_memory_networks/train.py b/examples/language_model/end_to_end_memory_networks/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef1c6e893b0e5d9a8df48fd20163766356118da0
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/train.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import random
+from importlib import import_module
+
+import numpy as np
+import paddle
+from config import Config
+from data import load_data
+from eval import eval
+from model import MemN2N
+from paddle import nn
+
+
+def train_single_epoch(model: MemN2N, lr, data, config):
+    """
+    train one epoch
+
+    Args:
+        model (MemN2N): model to be trained
+        lr (float): the learning rate of this epoch
+        data: training data
+        config: configs
+
+    Returns:
+        float: average loss
+    """
+    model.train()
+    N = int(math.ceil(len(data) / config.batch_size))  # total train N batchs
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=config.max_grad_norm)
+    optimizer = paddle.optimizer.SGD(learning_rate=lr, parameters=model.parameters(), grad_clip=clip)
+    lossfn = nn.CrossEntropyLoss(reduction="sum")
+
+    total_loss = 0
+
+    if config.show:
+        ProgressBar = getattr(import_module("utils"), "ProgressBar")
+        bar = ProgressBar("Train", max=N)
+
+    for batch in range(N):
+        if config.show:
+            bar.next()
+
+        optimizer.clear_grad()
+        context = np.ndarray([config.batch_size, config.mem_size], dtype=np.int64)
+        target = np.ndarray([config.batch_size], dtype=np.int64)
+        for i in range(config.batch_size):
+            m = random.randrange(config.mem_size, len(data))
+            target[i] = data[m]
+            context[i, :] = data[m - config.mem_size : m]
+
+        batch_data = paddle.to_tensor(context)
+        batch_label = paddle.to_tensor(target)
+
+        preict = model(batch_data)
+        loss = lossfn(preict, batch_label)
+        loss.backward()
+        optimizer.step()
+        total_loss += loss
+
+    if config.show:
+        bar.finish()
+
+    return total_loss / N / config.batch_size
+
+
+def train(model: MemN2N, train_data, valid_data, config):
+    """
+    do train
+
+    Args:
+        model (MemN2N): the model to be evaluate
+        train_data: training data
+        valid_data: validating data
+        config: model and training configs
+
+    Returns:
+        no return
+    """
+    lr = config.init_lr
+
+    train_losses = []
+    train_perplexities = []
+
+    valid_losses = []
+    valid_perplexities = []
+
+    for epoch in range(1, config.nepoch + 1):
+        train_loss = train_single_epoch(model, lr, train_data, config)
+        valid_loss = eval(model, valid_data, config, "Validation")
+
+        info = {"epoch": epoch, "learning_rate": lr}
+
+        # When the loss on the valid no longer drops, it's like learning rate divided by 1.5
+        if len(valid_losses) > 0 and valid_loss > valid_losses[-1] * 0.9999:
+            lr /= 1.5
+
+        train_losses.append(train_loss)
+        train_perplexities.append(math.exp(train_loss))
+
+        valid_losses.append(valid_loss)
+        valid_perplexities.append(math.exp(valid_loss))
+
+        info["train_perplexity"] = train_perplexities[-1]
+        info["validate_perplexity"] = valid_perplexities[-1]
+
+        print(info)
+
+        if epoch % config.log_epoch == 0:
+            save_dir = os.path.join(config.checkpoint_dir, "model_%d" % epoch)
+            paddle.save(model.state_dict(), save_dir)
+            lr_path = os.path.join(config.checkpoint_dir, "lr_%d" % epoch)
+            with open(lr_path, "w") as f:
+                f.write(f"{lr}")
+
+        # to get the target ppl
+        if info["validate_perplexity"] < config.target_ppl:
+            save_dir = os.path.join(config.checkpoint_dir, "model_good")
+            paddle.save(model.state_dict(), save_dir)
+            break
+
+        if lr < 1e-5:
+            break
+
+    save_dir = os.path.join(config.checkpoint_dir, "model")
+    paddle.save(model.state_dict(), save_dir)
+
+
+if __name__ == "__main__":
+    config = Config("config.yaml")
+
+    if not os.path.exists(config.checkpoint_dir):
+        os.makedirs(config.checkpoint_dir)
+
+    word2idx, train_data, valid_data, test_data = load_data(config)
+    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
+    config.nwords = len(word2idx)
+    print("vacab size is %d" % config.nwords)
+
+    np.random.seed(config.srand)
+    random.seed(config.srand)
+    paddle.seed(config.srand)
+
+    model = MemN2N(config)
+    if config.recover_train:
+        model_path = os.path.join(config.checkpoint_dir, config.model_name)
+        state_dict = paddle.load(model_path)
+        model.set_dict(state_dict)
+    train(model, train_data, valid_data, config)
diff --git a/examples/language_model/end_to_end_memory_networks/train_until.py b/examples/language_model/end_to_end_memory_networks/train_until.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebb94a2455b5e3625e69629455e1be8b861242ed
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/train_until.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from config import Config
+from data import load_data
+from eval import test
+from model import MemN2N
+from train import train
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--target", default=111.0, type=float, help="target perplexity")
+target = parser.parse_args().target
+
+if __name__ == "__main__":
+    config = Config("config.yaml")
+    if not os.path.exists(config.checkpoint_dir):
+        os.makedirs(config.checkpoint_dir)
+
+    word2idx, train_data, valid_data, test_data = load_data(config)
+    idx2word = dict(zip(word2idx.values(), word2idx.keys()))
+    config.nwords = len(word2idx)
+    print("vacab size is %d" % config.nwords)
+
+    while True:
+        random.seed(time.time())
+        config.srand = random.randint(0, 100000)
+
+        np.random.seed(config.srand)
+        random.seed(config.srand)
+        paddle.seed(config.srand)
+
+        model = MemN2N(config)
+        train(model, train_data, valid_data, config)
+
+        test_ppl = test(model, test_data, config)
+        if test_ppl < target:
+            model_path = os.path.join(config.checkpoint_dir, config.model_name + "_" + str(config.srand) + "_good")
+            paddle.save(model.state_dict(), model_path)
+            break
diff --git a/examples/language_model/end_to_end_memory_networks/utils.py b/examples/language_model/end_to_end_memory_networks/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..632c790093d73278d62186c8be06021170e34247
--- /dev/null
+++ b/examples/language_model/end_to_end_memory_networks/utils.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from progress.bar import Bar
+
+
+class ProgressBar(Bar):
+    message = "Loading"
+    fill = "#"
+    suffix = "%(percent).1f%% | ETA: %(eta)ds"
diff --git a/examples/language_model/glm b/examples/language_model/glm
new file mode 100644
index 0000000000000000000000000000000000000000..aa651eb577cf772492f4a9f901237d505fbfa514
--- /dev/null
+++ b/examples/language_model/glm
@@ -0,0 +1 @@
+../../llm/glm
\ No newline at end of file
diff --git a/examples/language_model/gpt b/examples/language_model/gpt
new file mode 100644
index 0000000000000000000000000000000000000000..6ca9896375c102a161bbd5b811122ab1d45949bb
--- /dev/null
+++ b/examples/language_model/gpt
@@ -0,0 +1 @@
+../../model_zoo/gpt/
\ No newline at end of file
diff --git a/examples/language_model/gpt-3/dygraph/build_optimizer.py b/examples/language_model/gpt-3/dygraph/build_optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..72445ea975291cee558e67b5a8b9cc6cf7c38876
--- /dev/null
+++ b/examples/language_model/gpt-3/dygraph/build_optimizer.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+
+import paddle
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer import (
+    DygraphShardingOptimizer,
+)
+
+
+def is_new_version_sharding_stage1_optimizer():
+    signature_keys = set(inspect.signature(DygraphShardingOptimizer).parameters.keys())
+    return "inner_optimizer_class" not in signature_keys
+
+
+def apply(model, args, lr_scheduler, clip, decay_params, strategy):
+    if args.sharding_stage == 1 and args.sharding_degree > 1 and not is_new_version_sharding_stage1_optimizer():
+        # for backward compatibility.
+        # this call will raise, if sharding stage1 is handled in HybridParallelOptimizer,
+        # in which case, the logic follows will handle it
+        optimizer = DygraphShardingOptimizer(
+            hcg=fleet.get_hybrid_communicate_group(),
+            user_defined_strategy=strategy,
+            params=model.parameters(),
+            inner_optimizer_class=paddle.optimizer.AdamW,
+            learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr,
+            beta1=args.adam_beta1,
+            beta2=args.adam_beta2,
+            epsilon=args.adam_epsilon,
+            weight_decay=args.weight_decay,
+            grad_clip=clip,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            multi_precision=args.use_pure_fp16,
+        )
+    else:
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr,
+            beta1=args.adam_beta1,
+            beta2=args.adam_beta2,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            grad_clip=clip,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            # TODO: remove 'multi_precision' in definition of optimizer
+            # and add it to 'paddle.amp.decorate'
+            multi_precision=args.use_pure_fp16,
+        )
+    return optimizer
diff --git a/examples/language_model/gpt-3/dygraph/run_finetune.py b/examples/language_model/gpt-3/dygraph/run_finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..04b41bdf619cf238327c348a9ca629fc840fb787
--- /dev/null
+++ b/examples/language_model/gpt-3/dygraph/run_finetune.py
@@ -0,0 +1,561 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import build_optimizer
+import numpy as np
+import paddle
+from args import parse_args
+from configuration import GPTConfig
+from modeling import GPTLMHeadModel
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from utils import (
+    _rotate_checkpoints,
+    all_gather,
+    convert_example,
+    is_dp_group_support_in_group_sharded_parallel,
+    left_padding,
+    optimizer_name_suffix,
+    weight_name_suffix,
+    wrap_sharding_2_3,
+)
+from visualdl import LogWriter
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import get_last_checkpoint
+from paddlenlp.trainer.training_args import default_logdir
+from paddlenlp.transformers import (
+    CosineAnnealingWithWarmupDecay,
+    GPTChineseTokenizer,
+    GPTTokenizer,
+    LinearAnnealingWithWarmupDecay,
+    PretrainedModel,
+)
+from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (GPTLMHeadModel, GPTTokenizer),
+    "gpt-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+}
+
+
+def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0):
+    assert args.device != "cpu"
+
+    random.seed(basic_seed + data_world_rank)
+    np.random.seed(basic_seed + data_world_rank)
+    paddle.seed(basic_seed + data_world_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + 100003 + data_world_rank
+    tracker = get_rng_state_tracker()
+
+    if "global_seed" not in tracker.states_:
+        tracker.add("global_seed", global_seed)
+    if "local_seed" not in tracker.states_:
+        tracker.add("local_seed", local_seed)
+
+
+@paddle.no_grad()
+def run_evaluate(args, data_loader, model, iter_steps, log_writer, global_step, task_name="valid"):
+    model.eval()
+    all_loss = []
+    local_time = time.time()
+    iter_step = 0
+    iter_steps = sys.maxsize
+    for eval_step, batch in enumerate(data_loader):
+        with paddle.amp.auto_cast(
+            args.use_pure_fp16,
+            custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+            custom_white_list=["fused_attention", "fused_feedforward"],
+            level="O2",
+        ):
+            loss = model(**batch)
+
+        all_loss.append(float(loss))
+
+        if (eval_step + 1) % args.accumulate_steps == 0:
+            iter_step += 1
+        else:
+            continue
+
+        if iter_step >= iter_steps:
+            break
+
+    average_loss = sum(all_loss) / len(all_loss)
+    v = paddle.to_tensor(average_loss).detach()
+    average_loss = all_gather(v)
+
+    if log_writer is not None:
+        logger.info("--" * 30)
+        logger.info(
+            "%s step %d, batch: %d, loss: %f, speed: %.2f step/s"
+            % (task_name, global_step, iter_step, average_loss, iter_step / (time.time() - local_time))
+        )
+        logger.info("--" * 30)
+        log_writer.add_scalar(task_name + "_loss", average_loss, global_step)
+
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    nranks = paddle.distributed.get_world_size()
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": args.dp_degree,
+        "mp_degree": args.mp_degree,
+        "pp_degree": 1,
+        "sharding_degree": args.sharding_degree,
+    }
+
+    # set control in tensor parallel
+    strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    # obtain rank message of hybrid parallel
+    hcg = fleet.get_hybrid_communicate_group()
+    # global_rank = hcg.get_global_rank()
+    mp_rank = hcg.get_model_parallel_rank()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+
+    sharding_size = hcg.get_sharding_parallel_world_size()
+    data_world_rank = dp_rank * sharding_size + sharding_rank
+    data_world_size = args.dp_degree * args.sharding_degree
+    # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+    # seed control in hybrid parallel
+    set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank)
+
+    default_global_tokens_num = args.global_batch_size * args.max_seq_len
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    training_args = args
+    training_args.overwrite_output_dir = False
+    training_args.resume_from_checkpoint = True
+    if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    global_step = 0
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        global_step = int(str(last_checkpoint).split("-")[-1])
+
+    log_writer = None
+    if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0:
+        log_writer_path = os.path.join(args.output_dir, default_logdir())
+        log_writer = LogWriter(logdir=log_writer_path)
+
+    WEIGHTS_NAME = "model_state.pdparams"
+    OPTIMIZER_NAME = "optimizer.pdopt"
+
+    if args.mp_degree > 1 or args.sharding_degree > 1:
+        WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix())
+        OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix())
+        # GPTLMHeadModel using old style save_pretrained
+        # remove if CLASS using save_pretrained_v2
+        logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}")
+        if not GPTLMHeadModel.constructed_from_pretrained_config():
+            GPTLMHeadModel.resource_files_names = {"model_state": WEIGHTS_NAME}
+
+    model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+    model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+    model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+    model_config["num_partitions"] = args.mp_degree
+    model_config["use_recompute"] = args.use_recompute
+    model_config["enable_fuse_transformer"] = False
+    model = GPTLMHeadModel(GPTConfig(**model_config))
+    # Create the critrion for the gpt model
+
+    # Create the learning_rate sheduler and optimizer
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]"
+    args.warmup_steps = args.warmup_rate * args.max_steps
+
+    lr_scheduler = None
+
+    if args.lr_decay_style == "none":
+        lr_scheduler = None
+    elif args.lr_decay_style == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+    elif args.lr_decay_style == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+
+    clip = None
+    if args.grad_clip > 0:
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy)
+    if args.use_pure_fp16:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        # level O2 means converting the network to FP16
+        if args.sharding_stage not in [2, 3]:
+            scaler = fleet.distributed_scaler(scaler)
+        model = paddle.amp.decorate(models=model, level="O2")
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        model.set_state_dict(
+            paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True)
+        )
+    # wrap sharding stage2/3 and add collective group
+    # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
+    if args.sharding_stage in [2, 3] and args.sharding_degree > 1:
+        scaler = scaler if args.use_pure_fp16 else None
+        model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args)
+
+    elif paddle.distributed.get_world_size() > 1:
+        model = fleet.distributed_model(model)
+        optimizer = fleet.distributed_optimizer(optimizer)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        # decoder_start_token_id=model.config.bos_token_id,
+        decoder_start_token_id=tokenizer.bos_token_id,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+        ignore_pad_token_for_loss=args.ignore_pad_token_for_loss,
+    )
+
+    logger.info("Loading train and dev dataset: %s" % args.dataset_name)
+    train_set, dev_set = load_dataset(args.dataset_name, splits=["train_v1", "dev_v1"])
+    logger.info("Loaded train and dev dataset: %s" % args.dataset_name)
+    train_set = train_set.map(trans_func, lazy=True)
+
+    # print(train_set[0])
+    # exit()
+
+    train_batch_sampler = DistributedBatchSampler(
+        train_set,
+        batch_size=args.micro_batch_size,
+        shuffle=True,
+        drop_last=True,
+        num_replicas=data_world_size,
+        rank=data_world_rank,
+    )
+
+    train_data_loader = paddle.io.DataLoader(
+        dataset=train_set,
+        batch_sampler=train_batch_sampler,
+        num_workers=0,
+        collate_fn=DataCollatorForSeq2Seq(
+            tokenizer=tokenizer,
+            padding=True,
+            max_length=args.max_seq_length,
+            label_pad_token_id=tokenizer.pad_token_id,
+        ),
+        return_list=True,
+    )
+    dev_set = dev_set.map(trans_func, lazy=True)
+    valid_batch_sampler = paddle.io.BatchSampler(dev_set, batch_size=args.micro_batch_size, shuffle=False)
+    valid_data_loader = paddle.io.DataLoader(
+        dataset=dev_set,
+        batch_sampler=valid_batch_sampler,
+        num_workers=0,
+        collate_fn=DataCollatorForSeq2Seq(
+            tokenizer=tokenizer,
+            padding=True,
+            max_length=args.max_seq_length,
+            label_pad_token_id=tokenizer.pad_token_id,
+        ),
+        return_list=True,
+    )
+
+    global_step = 0
+    # time count
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    reader_start = time.time()
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        optimizer.set_state_dict(
+            paddlenlp_load(
+                os.path.join(last_checkpoint, OPTIMIZER_NAME),
+                map_location="cpu",
+            )
+        )
+        global_step = int(str(last_checkpoint).split("-")[-1])
+
+    _globalstep_last_logged = global_step
+    if isinstance(train_data_loader.batch_sampler, DistributedBatchSampler):
+        _globalstep_last_logged = 0
+
+    tr_loss = paddle.to_tensor(0.0)
+    loss_global = paddle.to_tensor(0.0)
+
+    for epoch in range(sys.maxsize):
+        for step, batch in enumerate(train_data_loader()):
+            train_reader_cost += time.time() - reader_start
+            train_start = time.time()
+
+            if global_step >= args.max_steps:
+                return
+
+            if _globalstep_last_logged > 0:
+                _globalstep_last_logged -= 1
+                continue
+
+            # In ParallelMode of DataParallel, 'no_sync' can be used for improving
+            # performance of model by gradient accumulation.
+
+            with paddle.amp.auto_cast(
+                args.use_pure_fp16,
+                custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+                custom_white_list=["fused_attention", "fused_feedforward"],
+                level="O2",
+            ):
+                loss = model(**batch)
+                # loss = criterion(preds, labels, loss_mask)
+
+            if args.accumulate_steps > 1:
+                tr_loss_step = loss / args.accumulate_steps
+            else:
+                tr_loss_step = loss
+
+            if args.use_pure_fp16:
+                scaler.scale(tr_loss_step).backward()
+            else:
+                tr_loss_step.backward()
+
+            tr_loss_step = tr_loss_step.detach()
+
+            tr_loss += tr_loss_step
+            loss_global += loss.detach()
+
+            # Skip for accumulate_steps in global step
+            if (step + 1) % args.accumulate_steps != 0:
+                continue
+
+            if args.sharding_degree > 1 and args.sharding_stage in [2, 3]:
+                if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel():
+                    fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group())
+
+            if args.use_pure_fp16:
+                # scaler.minimize(optimizer, tr_loss)
+                scaler.step(optimizer)
+                scaler.update()
+            else:
+                optimizer.step()
+
+            optimizer.clear_grad()
+            tr_loss.subtract_(tr_loss)
+
+            global_step += 1
+
+            # Sync for profile time, delete it may be a little faster
+            # paddle.device.cuda.synchronize()
+            train_run_cost += time.time() - train_start
+
+            if global_step % args.logging_freq == 0:
+                avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps
+                loss_global.subtract_(loss_global)
+                speed = args.logging_freq / (train_reader_cost + train_run_cost)
+                avg_reader_cost = train_reader_cost / args.logging_freq
+
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e"
+                    % (
+                        global_step,
+                        epoch,
+                        avg_loss,
+                        avg_reader_cost,
+                        1.0 / speed,
+                        speed,
+                        speed * default_global_tokens_num,
+                        speed * default_global_tokens_num / nranks,
+                        optimizer.get_lr(),
+                    )
+                )
+                if log_writer is not None:
+                    log_writer.add_scalar("loss", float(loss), global_step)
+                    log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step)
+
+                # tic_train = time.time()
+                train_reader_cost = 0.0
+                train_run_cost = 0.0
+
+            if lr_scheduler is not None:
+                lr_scheduler.step()
+
+            if global_step % args.eval_freq == 0:
+                # Since the valid data broardcast to all devices, we do evaluate on all device.
+                run_evaluate(args, valid_data_loader, model, args.eval_iters, log_writer, global_step, "valid")
+
+            # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly
+            # only dp_rank = 0 save model
+            if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0:
+
+                model_to_save = (
+                    model._layers
+                    if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3]
+                    else model
+                )
+
+                if args.sharding_stage == 3:
+                    # If parameter need to convert to cpu, please add convert2cpu=True
+                    model_to_save.get_all_parameters(convert2cpu=True)
+
+                while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"):
+                    if hasattr(model_to_save, "_layers"):
+                        model_to_save = model_to_save._layers
+                    else:
+                        model_to_save = model_to_save._layer
+
+                output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step)
+                os.makedirs(output_dir, exist_ok=True)
+                logger.info("Save model to %s" % output_dir)
+
+                # tokenizer only need to save on one node
+                if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                    tokenizer.save_pretrained(output_dir)
+
+                # paramerters is the same in sharding group
+                if sharding_rank == 0 and dp_rank == 0:
+                    if isinstance(model_to_save, PretrainedModel):
+                        model_to_save.save_pretrained(output_dir)
+                    else:
+                        logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.")
+                        state_dict = model_to_save.state_dict()
+                        paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
+
+                # ckpt optimizer weight should save on echo sharding rank
+                if dp_rank == 0:
+                    paddle.save(
+                        optimizer.state_dict(),
+                        os.path.join(
+                            output_dir,
+                            OPTIMIZER_NAME,
+                        ),
+                    )
+
+                if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                    _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir)
+
+            if global_step >= args.max_steps:
+                return
+
+            reader_start = time.time()
+
+
+def do_export(args):
+
+    if args.do_export:
+        from utils import merge_model_parallel
+
+        last_checkpoint = get_last_checkpoint(args.output_dir)
+        from modeling import GPTForGeneration
+
+        from paddlenlp.transformers import GPTConfig
+
+        _, tokenizer_class = MODEL_CLASSES[args.model_type]
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+        config = GPTConfig.from_pretrained(last_checkpoint)
+        config.fuse_attention_qkv = True
+        # config.max_predict_len = 8
+        config.max_dec_len = 20
+        config.eos_token_id = tokenizer.eos_token_id
+        config.eol_token_id = tokenizer.eol_token_id
+        config.pad_token_id = tokenizer.eos_token_id
+        config.use_cache = True
+        config.top_k = 1
+
+        model = GPTForGeneration(config)
+        missing_keys, unexpected_keys = model.set_state_dict(merge_model_parallel(last_checkpoint, config))
+        print("missing_keys", missing_keys)
+        print("unexpected_keys", unexpected_keys)
+
+        # Switch to eval model
+        model.eval()
+        # Convert to static graph with specific input description
+        input_text = ["Nice to meet", "Hello "]
+        inputs = tokenizer(input_text)
+
+        # input_ids = tokenizer.encode(input_text)['input_ids']
+        inputs = tokenizer(input_text)
+        inputs = left_padding(inputs, tokenizer.bos_token_id)
+        input_ids = inputs["input_ids"]
+
+        input_ids = paddle.to_tensor(input_ids, dtype="int64")
+        ret = model(input_ids=input_ids)
+
+        # ret =  model.generate(input_ids = data["input_ids"])
+        for out_ids, in_txt in zip(ret[0].tolist(), input_text):
+            print("==" * 30)
+            print(in_txt + tokenizer.convert_ids_to_string(out_ids))
+
+        model = paddle.jit.to_static(
+            model,
+            input_spec=[
+                paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            ],
+        )
+        infer_path = os.path.join(args.output_dir, "infer", f"{args.dataset_name}")
+
+        # Save converted static graph model
+        paddle.jit.save(model, infer_path)
+        # Also save tokenizer for inference usage
+        tokenizer.save_pretrained(os.path.dirname(infer_path))
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    args.do_export = True
+    os.environ["softmax_mask_fuse_upper_triangle"] = "False"
+    do_train(args)
+    do_export(args)
diff --git a/examples/language_model/gpt-3/dygraph/run_glue_mp.py b/examples/language_model/gpt-3/dygraph/run_glue_mp.py
new file mode 100644
index 0000000000000000000000000000000000000000..396896166e868c0e33ad278b1affdca00e360cfd
--- /dev/null
+++ b/examples/language_model/gpt-3/dygraph/run_glue_mp.py
@@ -0,0 +1,585 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import build_optimizer
+import numpy as np
+import paddle
+from args import parse_args
+from configuration import GPTConfig
+from modeling import GPTForSequenceClassification
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from paddle.metric import Accuracy
+from utils import (
+    _rotate_checkpoints,
+    all_gather,
+    is_dp_group_support_in_group_sharded_parallel,
+    optimizer_name_suffix,
+    weight_name_suffix,
+    wrap_sharding_2_3,
+)
+from visualdl import LogWriter
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.trainer import get_last_checkpoint
+from paddlenlp.trainer.training_args import default_logdir
+from paddlenlp.transformers import (
+    CosineAnnealingWithWarmupDecay,
+    GPTChineseTokenizer,
+    GPTTokenizer,
+    LinearAnnealingWithWarmupDecay,
+    PretrainedModel,
+)
+from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "gpt": (GPTForSequenceClassification, GPTTokenizer),
+    "gpt-cn": (GPTForSequenceClassification, GPTChineseTokenizer),
+}
+
+
+def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0):
+    assert args.device != "cpu"
+
+    random.seed(basic_seed + data_world_rank)
+    np.random.seed(basic_seed + data_world_rank)
+    paddle.seed(basic_seed + data_world_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + 100003 + data_world_rank
+    tracker = get_rng_state_tracker()
+
+    if "global_seed" not in tracker.states_:
+        tracker.add("global_seed", global_seed)
+    if "local_seed" not in tracker.states_:
+        tracker.add("local_seed", local_seed)
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(
+            example["sentence"], padding="max_length", max_length=max_seq_length, return_token_type_ids=False
+        )
+    else:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            padding=True,
+            max_length=max_seq_length,
+            return_token_type_ids=False,
+        )
+
+    if not is_test:
+        example["labels"] = label
+
+    return example
+
+
+@paddle.no_grad()
+def run_evaluate(args, data_loader, model, log_writer, global_step, metric, task_name="valid"):
+    model.eval()
+    metric.reset()
+    local_time = time.time()
+    iter_steps = sys.maxsize
+    all_loss = []
+    for eval_step, batch in enumerate(data_loader):
+        with paddle.amp.auto_cast(
+            args.use_pure_fp16,
+            custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+            custom_white_list=["fused_attention", "fused_feedforward"],
+            level="O2",
+        ):
+            loss = model(**batch, return_dict=True)
+            if isinstance(loss, dict):
+                logits = loss["logits"]
+                loss = loss["loss"]
+                correct = metric.compute(logits.detach(), batch["labels"].detach())
+                metric.update(correct)
+
+            all_loss.append(float(loss))
+
+        if eval_step >= iter_steps - 1:
+            break
+
+    res = metric.accumulate()
+
+    average_loss = sum(all_loss) / len(all_loss)
+
+    logger.info("--" * 30)
+    if isinstance(metric, AccuracyAndF1):
+        logger.info(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (average_loss, res[0], res[1], res[2], res[3], res[4]),
+        )
+    elif isinstance(metric, Mcc):
+        logger.info(
+            "eval loss: %f, mcc: %s, " % (average_loss, res[0]),
+        )
+    elif isinstance(metric, PearsonAndSpearman):
+        logger.info(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (average_loss, res[0], res[1], res[2]),
+        )
+    else:
+        logger.info("eval loss: %f, acc: %s, " % (average_loss, res))
+
+    logger.info("--" * 30)
+    logger.info(
+        "%s step %d, batch: %d, loss: %f, speed: %.2f step/s"
+        % (task_name, global_step, eval_step + 1, average_loss, (eval_step + 1) / (time.time() - local_time))
+    )
+    logger.info("--" * 30)
+    if log_writer is not None:
+        log_writer.add_scalar(task_name + "_loss", average_loss, global_step)
+
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    nranks = paddle.distributed.get_world_size()
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": args.dp_degree,
+        "mp_degree": args.mp_degree,
+        "pp_degree": 1,
+        "sharding_degree": args.sharding_degree,
+    }
+
+    # set control in tensor parallel
+    strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    # obtain rank message of hybrid parallel
+    hcg = fleet.get_hybrid_communicate_group()
+    # global_rank = hcg.get_global_rank()
+    mp_rank = hcg.get_model_parallel_rank()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+
+    sharding_size = hcg.get_sharding_parallel_world_size()
+    data_world_rank = dp_rank * sharding_size + sharding_rank
+    data_world_size = args.dp_degree * args.sharding_degree
+    # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+    # seed control in hybrid parallel
+    set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank)
+    default_global_tokens_num = args.global_batch_size * args.max_seq_len
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_ds,
+        batch_size=args.micro_batch_size,
+        shuffle=True,
+        num_replicas=data_world_size,
+        rank=data_world_rank,
+    )
+
+    if args.task_name == "mnli":
+        dev_ds = load_dataset("glue", args.task_name, splits=["dev_matched"])
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    valid_batch_sampler = paddle.io.BatchSampler(
+        dev_ds,
+        batch_size=args.micro_batch_size,
+        shuffle=False,
+    )
+
+    train_data_loader = paddle.io.DataLoader(
+        dataset=train_ds,
+        batch_sampler=train_batch_sampler,
+        num_workers=0,
+        return_list=True,
+        collate_fn=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=args.max_seq_length),
+    )
+
+    valid_data_loader = paddle.io.DataLoader(
+        dataset=dev_ds,
+        batch_sampler=valid_batch_sampler,
+        num_workers=0,
+        return_list=True,
+        collate_fn=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=args.max_seq_length),
+    )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    training_args = args
+    training_args.overwrite_output_dir = False
+    training_args.resume_from_checkpoint = True
+    if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    global_step = 0
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        global_step = int(str(last_checkpoint).split("-")[-1])
+    # Define log writer
+    log_writer = None
+    if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0:
+        log_writer_path = os.path.join(args.output_dir, default_logdir())
+        log_writer = LogWriter(log_writer_path)
+
+    WEIGHTS_NAME = "model_state.pdparams"
+    OPTIMIZER_NAME = "optimizer.pdopt"
+
+    if args.mp_degree > 1 or args.sharding_degree > 1:
+        WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix())
+        OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix())
+        # GPTForSequenceClassification using old style save_pretrained
+        # remove if CLASS using save_pretrained_v2
+        logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}")
+        if not GPTForSequenceClassification.constructed_from_pretrained_config():
+            GPTForSequenceClassification.resource_files_names = {"model_state": WEIGHTS_NAME}
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+    if args.model_name_or_path in pretrained_models_list:
+        model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+
+        model_config["num_partitions"] = args.mp_degree
+        model_config["use_recompute"] = args.use_recompute
+        model_config["enable_fuse_transformer"] = args.fuse_transformer
+        model = GPTForSequenceClassification(GPTConfig(**model_config))
+
+    else:
+        model = GPTForSequenceClassification.from_pretrained(
+            args.model_name_or_path,
+            hidden_dropout_prob=args.hidden_dropout_prob,
+            attention_probs_dropout_prob=args.attention_probs_dropout_prob,
+            num_partitions=args.mp_degree,
+            use_recompute=args.use_recompute,
+            enable_fuse_transformer=False,
+            num_labels=num_classes,
+        )
+
+    metric = metric_class()
+
+    # Create the learning_rate sheduler and optimizer
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]"
+    args.warmup_steps = args.warmup_rate * args.max_steps
+
+    lr_scheduler = None
+
+    if args.lr_decay_style == "none":
+        lr_scheduler = None
+    elif args.lr_decay_style == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+        )
+    elif args.lr_decay_style == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+        )
+
+    clip = None
+    if args.grad_clip > 0:
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy)
+    if args.use_pure_fp16:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        # level O2 means converting the network to FP16
+        if args.sharding_stage not in [2, 3]:
+            scaler = fleet.distributed_scaler(scaler)
+        model = paddle.amp.decorate(models=model, level="O2")
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        model.set_state_dict(
+            paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True)
+        )
+
+    # wrap sharding stage2/3 and add collective group
+    # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
+    if args.sharding_stage in [2, 3] and args.sharding_degree > 1:
+        scaler = scaler if args.use_pure_fp16 else None
+        model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args)
+
+    elif paddle.distributed.get_world_size() > 1:
+        model = fleet.distributed_model(model)
+        optimizer = fleet.distributed_optimizer(optimizer)
+
+    # time count
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    reader_start = time.time()
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        optimizer.set_state_dict(
+            paddlenlp_load(
+                os.path.join(last_checkpoint, OPTIMIZER_NAME),
+                map_location="cpu",
+            )
+        )
+
+    _globalstep_last_logged = global_step
+    tr_loss = paddle.to_tensor(0.0)
+    loss_global = paddle.to_tensor(0.0)
+
+    if _globalstep_last_logged > args.max_steps:
+        return
+    for epoch in range(sys.maxsize):
+        train_data_loader.batch_sampler.set_epoch(epoch)
+        for step, batch in enumerate(train_data_loader):
+            train_reader_cost += time.time() - reader_start
+            train_start = time.time()
+            if _globalstep_last_logged > 0:
+                _globalstep_last_logged -= 1
+                continue
+
+            # In ParallelMode of DataParallel, 'no_sync' can be used for improving
+            # performance of model by gradient accumulation.
+            with paddle.amp.auto_cast(
+                args.use_pure_fp16,
+                custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+                custom_white_list=["fused_attention", "fused_feedforward"],
+                level="O2",
+            ):
+                loss = model(**batch)
+                if isinstance(loss, tuple):
+                    loss = loss[0]
+
+            if args.accumulate_steps > 1:
+                tr_loss_step = loss / args.accumulate_steps
+            else:
+                tr_loss_step = loss
+
+            if args.use_pure_fp16:
+                scaler.scale(tr_loss_step).backward()
+            else:
+                tr_loss_step.backward()
+
+            tr_loss_step = tr_loss_step.detach()
+
+            tr_loss += tr_loss_step
+            loss_global += loss.detach()
+
+            # Skip for accumulate_steps in global step
+            if (step + 1) % args.accumulate_steps != 0:
+                continue
+
+            if args.sharding_degree > 1 and args.sharding_stage in [2, 3]:
+                if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel():
+                    fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group())
+
+            if args.use_pure_fp16:
+                scaler.step(optimizer)
+                scaler.update()
+            else:
+                optimizer.step()
+
+            optimizer.clear_grad()
+            tr_loss.subtract_(tr_loss)
+            global_step += 1
+
+            # Sync for profile time, delete it may be a little faster
+            paddle.device.cuda.synchronize()
+            train_run_cost += time.time() - train_start
+
+            if global_step % args.logging_freq == 0:
+                avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps
+                loss_global.subtract_(loss_global)
+                speed = args.logging_freq / (train_reader_cost + train_run_cost)
+                avg_reader_cost = train_reader_cost / args.logging_freq
+
+                logger.info(
+                    "global step: %d, epoch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e"
+                    % (
+                        global_step,
+                        epoch,
+                        avg_loss,
+                        avg_reader_cost,
+                        1.0 / speed,
+                        speed,
+                        speed * default_global_tokens_num,
+                        speed * default_global_tokens_num / nranks,
+                        optimizer.get_lr(),
+                    )
+                )
+                if log_writer is not None:
+                    log_writer.add_scalar("loss", float(loss), global_step)
+                    log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step)
+
+                # tic_train = time.time()
+                train_reader_cost = 0.0
+                train_run_cost = 0.0
+
+            if global_step % args.eval_freq == 0:
+                # Since the valid data broardcast to all devices, we do evaluate on all device.
+                run_evaluate(args, valid_data_loader, model, log_writer, global_step, metric, "valid")
+
+            # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly
+            # only dp_rank = 0 save model
+            if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0:
+
+                model_to_save = (
+                    model._layers
+                    if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3]
+                    else model
+                )
+                if args.sharding_stage == 3:
+                    # If parameter need to convert to cpu, please add convert2cpu=True
+                    model_to_save.get_all_parameters(convert2cpu=True)
+
+                while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"):
+                    if hasattr(model_to_save, "_layers"):
+                        model_to_save = model_to_save._layers
+                    else:
+                        model_to_save = model_to_save._layer
+
+                output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step)
+                os.makedirs(output_dir, exist_ok=True)
+
+                logger.info("Save model to %s" % output_dir)
+
+                # tokenizer only need to save on one node
+                if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                    tokenizer.save_pretrained(output_dir)
+
+                # paramerters is the same in sharding group
+                if sharding_rank == 0 and dp_rank == 0:
+                    if isinstance(model_to_save, PretrainedModel):
+                        model_to_save.save_pretrained(output_dir)
+                    else:
+                        logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.")
+                        state_dict = model_to_save.state_dict()
+                        paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
+
+                # ckpt optimizer weight should save on echo sharding rank
+                if dp_rank == 0:
+                    paddle.save(
+                        optimizer.state_dict(),
+                        os.path.join(
+                            output_dir,
+                            OPTIMIZER_NAME,
+                        ),
+                    )
+
+                if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                    _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir)
+
+            if lr_scheduler is not None:
+                lr_scheduler.step()
+
+            if global_step >= args.max_steps:
+                return
+
+            reader_start = time.time()
+
+
+def do_export(args):
+    if args.do_export:
+        from utils import merge_model_parallel
+
+        last_checkpoint = get_last_checkpoint(args.output_dir)
+        from paddlenlp.transformers import GPTConfig, GPTForSequenceClassification
+
+        _, tokenizer_class = MODEL_CLASSES[args.model_type]
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+        config = GPTConfig.from_pretrained(last_checkpoint)
+        config.fuse_attention_qkv = True
+        model = GPTForSequenceClassification(config)
+        missing_keys, unexpected_keys = model.set_state_dict(merge_model_parallel(last_checkpoint, config))
+        print("missing_keys", missing_keys)
+        print("unexpected_keys", unexpected_keys)
+        # print(train_ds[0])
+        model = paddle.jit.to_static(
+            model,
+            input_spec=[
+                paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            ],
+        )
+        infer_path = os.path.join(args.output_dir, "infer", f"{args.task_name}")
+
+        # Save converted static graph model
+        paddle.jit.save(model, infer_path)
+        # # Also save tokenizer for inference usage
+        tokenizer.save_pretrained(os.path.dirname(infer_path))
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    args.do_export = True
+    do_train(args)
+    do_export(args)
diff --git a/examples/language_model/gpt-3/dygraph/run_pretrain.py b/examples/language_model/gpt-3/dygraph/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..4562f415018f61a88ac383b399e00c6a8c50a8d0
--- /dev/null
+++ b/examples/language_model/gpt-3/dygraph/run_pretrain.py
@@ -0,0 +1,515 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import sys
+import time
+
+import build_optimizer
+import numpy as np
+import paddle
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from paddle.distributed.sharding import group_sharded_parallel
+from visualdl import LogWriter
+
+from paddlenlp.transformers import (
+    CosineAnnealingWithWarmupDecay,
+    GPTChineseTokenizer,
+    GPTTokenizer,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+
+# to import data_tools
+filepath = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(filepath, "../"))
+# import lr  # noqa e402
+from args import parse_args  # noqa e402
+from configuration import GPTConfig
+from dataset import create_pretrained_dataset  # noqa e402
+from modeling import (  # noqa e402
+    GPTForPretraining,
+    GPTForPretrainingPipe,
+    GPTPretrainingCriterion,
+)
+
+MODEL_CLASSES = {
+    "gpt": (GPTForPretraining, GPTTokenizer),
+    "gpt-cn": (GPTForPretraining, GPTChineseTokenizer),
+}
+
+
+def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank):
+    assert args.device != "cpu"
+
+    basic_seed = basic_seed * 1000
+    random.seed(basic_seed + data_world_rank)
+    np.random.seed(basic_seed + data_world_rank)
+    paddle.seed(basic_seed + data_world_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 123 + mp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + data_world_rank
+    tracker = get_rng_state_tracker()
+    tracker.add("global_seed", global_seed)
+    tracker.add("local_seed", local_seed)
+
+
+@paddle.no_grad()
+def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, epoch, task_name="valid"):
+    model.eval()
+    all_loss = []
+    local_time = time.time()
+    for eval_step, batch in enumerate(data_loader):
+        tokens, loss_mask, position_ids, labels = batch
+        if args.pp_degree < 2:
+            preds = model(tokens, position_ids)
+            loss = criterion(preds, labels, loss_mask)
+        else:
+            data = [(tokens, position_ids), (labels, loss_mask)]
+            loss = model.eval_batch(data, compute_loss=True)
+
+        all_loss.append(float(loss))
+        if eval_step >= iter_steps - 1:
+            break
+
+    average_loss = sum(all_loss) / len(all_loss)
+    logger.info(
+        "%s step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+        % (task_name, global_step, epoch, eval_step, average_loss, iter_steps / (time.time() - local_time))
+    )
+    log_writer.add_scalar(task_name + "_loss", average_loss, global_step)
+    model.train()
+
+
+def get_train_data_file(args):
+    files = [
+        os.path.join(args.input_dir, f)
+        for f in os.listdir(args.input_dir)
+        if (os.path.isfile(os.path.join(args.input_dir, f)) and str(f).endswith("_idx.npz"))
+    ]
+    files = [x.replace("_idx.npz", "") for x in files]
+    if len(files) == 0:
+        logger.warning(
+            "Not found dataset with name of xxx_ids.npy and xxx_idx.npz! Try to found old compatible xxx_ids.npz file."
+        )
+    else:
+        return files
+
+    files = [
+        os.path.join(args.input_dir, f)
+        for f in os.listdir(args.input_dir)
+        if (os.path.isfile(os.path.join(args.input_dir, f)) and str(f).endswith("_ids.npz"))
+    ]
+
+    files = [x.replace("_ids.npz", "") for x in files]
+    return files
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    nranks = paddle.distributed.get_world_size()
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": args.dp_degree,
+        "mp_degree": args.mp_degree,
+        "pp_degree": args.pp_degree,
+        "sharding_degree": args.sharding_degree,
+    }
+
+    accumulate_steps = args.local_batch_size // args.micro_batch_size
+    strategy.pipeline_configs = {"accumulate_steps": accumulate_steps, "micro_batch_size": args.micro_batch_size}
+
+    # set control in tensor parallel
+    strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    # obtain rank message of hybrid parallel
+    hcg = fleet.get_hybrid_communicate_group()
+    global_rank = hcg.get_global_rank()
+    mp_rank = hcg.get_model_parallel_rank()
+    pp_rank = hcg.get_stage_id()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+
+    # sharding stage2/3 not support hybrid parallel now
+    if args.sharding_stage in [2, 3]:
+        assert args.mp_degree == args.pp_degree == 1, "sharding stage2/3 will support tensor/pipeline parallel later"
+        dp_group = hcg.get_data_parallel_group()
+
+    sharding_size = hcg.get_sharding_parallel_world_size()
+    data_world_rank = dp_rank * sharding_size + sharding_rank
+    data_world_size = args.dp_degree * args.sharding_degree
+    local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+    # seed control in hybrid parallel
+    set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank, pp_rank)
+
+    default_global_tokens_num = args.global_batch_size * args.max_seq_len
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    # Define log writer
+    log_writer_path = os.path.join(
+        args.output_dir,
+        "train_log",
+        "{}_globalbsz_{}_pure_fp16_{}_recompute_{}_card_{}".format(
+            args.model_name_or_path, args.global_batch_size, args.use_pure_fp16, False, global_rank
+        ).lower(),
+    )
+
+    if os.path.exists(log_writer_path):
+        import shutil
+
+        shutil.rmtree(log_writer_path)
+
+    log_writer = LogWriter(log_writer_path)
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    if args.mp_degree > 1:
+        GPTForPretraining.resource_files_names = {"model_state": "model_state_mp_{:0>2d}.pdparams".format(mp_rank)}
+
+    if args.model_name_or_path in pretrained_models_list:
+        model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+
+        model_config["num_partitions"] = args.mp_degree
+        model_config["use_recompute"] = args.use_recompute
+        model_config["enable_fuse_transformer"] = args.fuse_transformer
+        if args.pp_degree == 1:
+            model = GPTForPretraining(GPTConfig(**model_config))
+        else:
+            topology = hcg.topology()
+            model = GPTForPretrainingPipe(GPTConfig(**model_config), topology)
+    else:
+        model = GPTForPretraining.from_pretrained(
+            args.model_name_or_path,
+            hidden_dropout_prob=args.hidden_dropout_prob,
+            attention_probs_dropout_prob=args.attention_probs_dropout_prob,
+        )
+
+    # Create the critrion for the gpt model
+    criterion = GPTPretrainingCriterion()
+
+    # Create the learning_rate sheduler and optimizer
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]"
+    args.warmup_steps = args.warmup_rate * args.max_steps
+
+    lr_scheduler = None
+
+    if args.lr_decay_style == "none":
+        lr_scheduler = None
+    elif args.lr_decay_style == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+    elif args.lr_decay_style == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+
+    clip = None
+    if args.grad_clip > 0:
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy)
+
+    # decorate @to_static for benchmark, skip it by default.
+    if args.to_static:
+        specs = None
+        paddle.jit.ignore_module([os])
+        model = paddle.jit.to_static(model, input_spec=specs)
+        logger.info("Successfully to apply @to_static with specs: {}".format(specs))
+
+    if args.use_pure_fp16:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        # level O2 means converting the network to FP16
+        if args.sharding_stage not in [2, 3]:
+            scaler = fleet.distributed_scaler(scaler)
+        model = paddle.amp.decorate(models=model, level="O2")
+
+    # wrap sharding stage2/3 and add collective group
+    # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
+    if args.sharding_stage in [2, 3]:
+        if args.dp_degree > 1:
+            from paddle.distributed.parallel import sync_params_buffers
+
+            sync_params_buffers(model, comm_group=dp_group, src_rank=dp_group.ranks[0])
+
+        scaler = scaler if args.use_pure_fp16 else None
+        model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args.sharding_offload)
+
+    elif paddle.distributed.get_world_size() > 1:
+        model = fleet.distributed_model(model)
+        optimizer = fleet.distributed_optimizer(optimizer)
+
+    if args.model_name_or_path not in pretrained_models_list:
+        logger.info("Try to load checkpoint from %s " % args.model_name_or_path)
+        opt_path = os.path.join(args.model_name_or_path, "model_state.pdopt")
+        if os.path.exists(opt_path):
+            opt_dict = paddle.load(opt_path)
+            optimizer.set_state_dict(opt_dict)
+        else:
+            logger.warning("No optimizer checkpoint file found in %s." % opt_path)
+
+    global_step = 0
+    # tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        files = get_train_data_file(args)
+        files.sort()
+        num_files = len(files)
+        for f_id in range(num_files):
+            data_file = files[f_id]
+            train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
+                args,
+                [data_file],
+                local_rank=local_rank,
+                data_world_size=data_world_size,
+                data_world_rank=data_world_rank,
+                max_seq_len=args.max_seq_len,
+                eos_id=tokenizer.eos_token_id,
+                old_version_accumulate_compatible=True,
+            )
+            # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader
+            # many times. and start a new random dataloader.
+            valid_data_loader = valid_data_loader()
+            test_data_loader = test_data_loader()
+
+            # time count
+            train_reader_cost = 0.0
+            train_run_cost = 0.0
+            reader_start = time.time()
+            for step, batch in enumerate(train_data_loader()):
+                train_reader_cost += time.time() - reader_start
+                train_start = time.time()
+
+                global_step += 1
+                tokens, loss_mask, position_ids, labels = batch
+
+                loss_mask.stop_gradient = True
+                labels.stop_gradient = True
+                position_ids.stop_gradient = True
+
+                if args.pp_degree == 1:
+                    # In ParallelMode of DataParallel, 'no_sync' can be used for improving
+                    # performance of model by gradient accumulation.
+                    loss = 0.0
+                    for i in range(accumulate_steps):
+                        start_index = i * args.micro_batch_size
+                        end_index = start_index + args.micro_batch_size
+                        with paddle.amp.auto_cast(
+                            args.use_pure_fp16,
+                            custom_black_list=["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"],
+                            custom_white_list=["fused_attention", "fused_feedforward"],
+                            level="O2",
+                        ):
+                            preds = model(tokens[start_index:end_index, :], position_ids[start_index:end_index, :])
+                            loss_mbs = criterion(
+                                preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :]
+                            )
+                        loss_mbs = loss_mbs / accumulate_steps
+                        if args.use_pure_fp16:
+                            scaler.scale(loss_mbs).backward()
+                        else:
+                            loss_mbs.backward()
+                        loss = loss + loss_mbs
+
+                    if args.sharding_stage in [2, 3] and args.dp_degree > 1:
+                        fused_allreduce_gradients(model.parameters(), hcg)
+                        if args.sharding_stage == 3:
+                            for p in model.parameters():
+                                if hasattr(p, "bw_storage"):
+                                    assert p.grad is None, "This case shouldn't happen."
+                                    p.bw_storage.scale_(1.0 / dp_group.nranks)
+                                    paddle.distributed.all_reduce(p.bw_storage, group=dp_group)
+
+                    if args.use_pure_fp16:
+                        if args.sharding_stage in [2, 3]:
+                            scaler.step(optimizer)
+                            scaler.update()
+                        else:
+                            scaler.minimize(optimizer, loss)
+                    else:
+                        optimizer.step()
+
+                    if lr_scheduler is not None:
+                        lr_scheduler.step()
+
+                    optimizer.clear_grad()
+
+                else:
+                    data = [(tokens, position_ids), (labels, loss_mask)]
+                    with paddle.amp.auto_cast(
+                        args.use_pure_fp16,
+                        custom_black_list=["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"],
+                        custom_white_list=["fused_attention", "fused_feedforward"],
+                        level="O2",
+                    ):
+                        loss = model.train_batch(
+                            data,
+                            optimizer=optimizer,
+                            lr_scheduler=lr_scheduler,
+                            scaler=scaler if args.use_pure_fp16 else None,
+                        )
+
+                # Sync for profile time, delete it may be a little faster
+                paddle.device.cuda.synchronize()
+                train_run_cost += time.time() - train_start
+                # Profile for model benchmark
+                profiler.add_profiler_step(args.profiler_options)
+
+                if global_step % args.logging_freq == 0:
+                    avg_loss = loss.numpy()
+                    speed = args.logging_freq / (train_reader_cost + train_run_cost)
+                    avg_reader_cost = train_reader_cost / args.logging_freq
+
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            avg_loss,
+                            avg_reader_cost,
+                            1.0 / speed,
+                            speed,
+                            speed * default_global_tokens_num,
+                            speed * default_global_tokens_num / nranks,
+                            optimizer.get_lr(),
+                        )
+                    )
+                    log_writer.add_scalar("loss", float(loss), global_step)
+                    log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step)
+
+                    # tic_train = time.time()
+                    train_reader_cost = 0.0
+                    train_run_cost = 0.0
+
+                if args.check_accuracy:
+                    if global_step >= args.max_steps:
+                        return
+                    else:
+                        continue
+
+                if global_step % args.eval_freq == 0:
+                    # Since the valid data broardcast to all devices, we do evaluate on all device.
+                    run_evaluate(
+                        args,
+                        valid_data_loader,
+                        model,
+                        criterion,
+                        args.eval_iters,
+                        log_writer,
+                        global_step,
+                        epoch,
+                        "valid",
+                    )
+
+                # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly
+                # only dp_rank = 0 save model
+                if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0:
+
+                    model_to_save = (
+                        model._layers
+                        if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3]
+                        else model
+                    )
+                    output_dir = os.path.join(args.output_dir, "step_%d" % global_step)
+                    os.makedirs(output_dir, exist_ok=True)
+
+                    logger.info("Save model to %s" % output_dir)
+
+                    if args.pp_degree > 1:
+                        if mp_rank == 0 and sharding_rank == 0 and pp_rank == 0:
+                            tokenizer.save_pretrained(output_dir)
+                        model_to_save.save_state_dict(output_dir)
+                        paddle.save(
+                            optimizer.state_dict(),
+                            os.path.join(
+                                output_dir,
+                                "model_state_mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}.pdopt".format(
+                                    mp_rank, sharding_rank, pp_rank
+                                ),
+                            ),
+                        )
+                    else:
+                        if args.sharding_stage == 3:
+                            # If parameter need to convert to cpu, please add convert2cpu=True
+                            model_to_save.get_all_parameters(convert2cpu=False)
+                        if mp_rank == 0 and sharding_rank == 0:
+                            tokenizer.save_pretrained(output_dir)
+                        model_to_save.save_pretrained(output_dir)
+                        paddle.save(
+                            optimizer.state_dict(),
+                            os.path.join(
+                                output_dir,
+                                "model_state_mp_{:0>2d}_sharding_{:0>2d}.pdopt".format(mp_rank, sharding_rank),
+                            ),
+                        )
+
+                if global_step >= args.max_steps:
+                    run_evaluate(
+                        args,
+                        test_data_loader,
+                        model,
+                        criterion,
+                        args.test_iters,
+                        log_writer,
+                        global_step,
+                        epoch,
+                        "test",
+                    )
+                    logger.info("The training process is complete.")
+                    del train_data_loader
+                    return
+
+                reader_start = time.time()
+
+            del train_data_loader
+
+
+def wrap_sharding_2_3(model, optimizer, scaler, sharding_offload):
+    group = fleet.get_hybrid_communicate_group().get_sharding_parallel_group()
+    level = "p_g_os" if args.sharding_stage == 3 else "os_g"
+    return group_sharded_parallel(
+        model=model, optimizer=optimizer, level=level, scaler=scaler, group=group, offload=sharding_offload
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    do_train(args)
diff --git a/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py b/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py
new file mode 100644
index 0000000000000000000000000000000000000000..5124d5a6e236a477cab112c6ef4e210c4e41612a
--- /dev/null
+++ b/examples/language_model/gpt-3/dygraph/run_pretrain_mp.py
@@ -0,0 +1,456 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import time
+
+import build_optimizer
+import numpy as np
+import paddle
+from args import parse_args
+from configuration import GPTConfig
+from dataset import create_pretrained_dataset
+from modeling import GPTForPretraining, GPTPretrainingCriterion
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from run_pretrain import get_train_data_file
+from utils import (
+    _rotate_checkpoints,
+    all_gather,
+    is_dp_group_support_in_group_sharded_parallel,
+    optimizer_name_suffix,
+    weight_name_suffix,
+    wrap_sharding_2_3,
+)
+from visualdl import LogWriter
+
+from paddlenlp.trainer import get_last_checkpoint
+from paddlenlp.trainer.training_args import default_logdir
+from paddlenlp.transformers import (
+    CosineAnnealingWithWarmupDecay,
+    GPTChineseTokenizer,
+    GPTTokenizer,
+    LinearAnnealingWithWarmupDecay,
+    PretrainedModel,
+)
+from paddlenlp.transformers.model_utils import _add_variant, paddlenlp_load
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (GPTForPretraining, GPTTokenizer),
+    "gpt-cn": (GPTForPretraining, GPTChineseTokenizer),
+}
+
+
+def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank=0):
+    assert args.device != "cpu"
+
+    random.seed(basic_seed + data_world_rank)
+    np.random.seed(basic_seed + data_world_rank)
+    paddle.seed(basic_seed + data_world_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 59999 + mp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + 100003 + data_world_rank
+    tracker = get_rng_state_tracker()
+
+    if "global_seed" not in tracker.states_:
+        tracker.add("global_seed", global_seed)
+    if "local_seed" not in tracker.states_:
+        tracker.add("local_seed", local_seed)
+
+
+@paddle.no_grad()
+def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, task_name="valid"):
+    model.eval()
+    all_loss = []
+    local_time = time.time()
+    iter_step = 0
+    for eval_step, batch in enumerate(data_loader):
+        with paddle.amp.auto_cast(
+            args.use_pure_fp16,
+            custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+            custom_white_list=["fused_attention", "fused_feedforward"],
+            level="O2",
+        ):
+            tokens, loss_mask, position_ids, labels = batch
+            preds = model(tokens, position_ids)
+            loss = criterion(preds, labels, loss_mask)
+
+        all_loss.append(float(loss))
+
+        if (eval_step + 1) % args.accumulate_steps == 0:
+            iter_step += 1
+        else:
+            continue
+
+        if iter_step >= iter_steps:
+            break
+
+    average_loss = sum(all_loss) / len(all_loss)
+    v = paddle.to_tensor(average_loss).detach()
+    average_loss = all_gather(v)
+
+    if log_writer is not None:
+        logger.info("--" * 30)
+        logger.info(
+            "%s step %d, batch: %d, loss: %f, speed: %.2f step/s"
+            % (task_name, global_step, iter_steps, average_loss, iter_steps / (time.time() - local_time))
+        )
+        logger.info("--" * 30)
+        log_writer.add_scalar(task_name + "_loss", average_loss, global_step)
+
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    nranks = paddle.distributed.get_world_size()
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": args.dp_degree,
+        "mp_degree": args.mp_degree,
+        "pp_degree": 1,
+        "sharding_degree": args.sharding_degree,
+    }
+
+    # set control in tensor parallel
+    strategy.tensor_parallel_configs = {"tensor_init_seed": args.seed}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    # obtain rank message of hybrid parallel
+    hcg = fleet.get_hybrid_communicate_group()
+    # global_rank = hcg.get_global_rank()
+    mp_rank = hcg.get_model_parallel_rank()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+
+    sharding_size = hcg.get_sharding_parallel_world_size()
+    data_world_rank = dp_rank * sharding_size + sharding_rank
+    data_world_size = args.dp_degree * args.sharding_degree
+    local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+    # seed control in hybrid parallel
+    set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank)
+
+    default_global_tokens_num = args.global_batch_size * args.max_seq_len
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    training_args = args
+    training_args.overwrite_output_dir = False
+    training_args.resume_from_checkpoint = True
+    if os.path.isdir(training_args.output_dir) and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    global_step = 0
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        global_step = int(str(last_checkpoint).split("-")[-1])
+
+    log_writer = None
+    if dp_rank == 0 and mp_rank == 0 and sharding_rank == 0:
+        log_writer_path = os.path.join(args.output_dir, default_logdir())
+        log_writer = LogWriter(logdir=log_writer_path)
+
+    WEIGHTS_NAME = "model_state.pdparams"
+    OPTIMIZER_NAME = "optimizer.pdopt"
+
+    if args.mp_degree > 1 or args.sharding_degree > 1:
+        WEIGHTS_NAME = _add_variant(WEIGHTS_NAME, weight_name_suffix())
+        OPTIMIZER_NAME = _add_variant(OPTIMIZER_NAME, optimizer_name_suffix())
+        # GPTForPretraining using old style save_pretrained
+        # remove if CLASS using save_pretrained_v2
+        logger.info(f"{WEIGHTS_NAME}, {OPTIMIZER_NAME}, {optimizer_name_suffix()}")
+        if not GPTForPretraining.constructed_from_pretrained_config():
+            GPTForPretraining.resource_files_names = {"model_state": WEIGHTS_NAME}
+
+    model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+    model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+    model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+    model_config["num_partitions"] = args.mp_degree
+    model_config["use_recompute"] = args.use_recompute
+    model_config["enable_fuse_transformer"] = False
+    model = GPTForPretraining(GPTConfig(**model_config))
+    # Create the critrion for the gpt model
+    criterion = GPTPretrainingCriterion()
+
+    # Create the learning_rate sheduler and optimizer
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]"
+    args.warmup_steps = args.warmup_rate * args.max_steps
+
+    lr_scheduler = None
+
+    if args.lr_decay_style == "none":
+        lr_scheduler = None
+    elif args.lr_decay_style == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+    elif args.lr_decay_style == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=args.max_lr,
+            min_lr=args.min_lr,
+            warmup_step=args.warmup_steps,
+            decay_step=args.decay_steps,
+            last_epoch=0,
+        )
+
+    clip = None
+    if args.grad_clip > 0:
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = build_optimizer.apply(model, args, lr_scheduler, clip, decay_params, strategy)
+
+    if args.use_pure_fp16:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        # level O2 means converting the network to FP16
+        if args.sharding_stage not in [2, 3]:
+            scaler = fleet.distributed_scaler(scaler)
+        model = paddle.amp.decorate(models=model, level="O2")
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        model.set_state_dict(
+            paddle.load(os.path.join(last_checkpoint, model.resource_files_names["model_state"]), return_numpy=True)
+        )
+    # wrap sharding stage2/3 and add collective group
+    # TODO(Baibaifan): combine ShardingStage1/2/3 and fleet.distributed_model in feature
+    if args.sharding_stage in [2, 3] and args.sharding_degree > 1:
+        scaler = scaler if args.use_pure_fp16 else None
+        model, optimizer, scaler = wrap_sharding_2_3(model, optimizer, scaler, args)
+
+    elif paddle.distributed.get_world_size() > 1:
+        model = fleet.distributed_model(model)
+        optimizer = fleet.distributed_optimizer(optimizer)
+
+    files = get_train_data_file(args)
+    train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
+        args,
+        files,
+        local_rank=local_rank,
+        data_world_size=data_world_size,
+        data_world_rank=data_world_rank,
+        max_seq_len=args.max_seq_len,
+        eos_id=tokenizer.eos_token_id,
+        current_step=global_step,
+    )
+    # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader
+    # many times. and start a new random dataloader.
+    valid_data_loader = valid_data_loader()
+    test_data_loader = test_data_loader()
+
+    global_step = 0
+    # time count
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    reader_start = time.time()
+
+    if training_args.resume_from_checkpoint and last_checkpoint is not None:
+        optimizer.set_state_dict(
+            paddlenlp_load(
+                os.path.join(last_checkpoint, OPTIMIZER_NAME),
+                map_location="cpu",
+            )
+        )
+        global_step = int(str(last_checkpoint).split("-")[-1])
+
+    _globalstep_last_logged = global_step
+    if isinstance(train_data_loader.batch_sampler, DistributedBatchSampler):
+        _globalstep_last_logged = 0
+
+    tr_loss = paddle.to_tensor(0.0)
+    loss_global = paddle.to_tensor(0.0)
+
+    for step, batch in enumerate(train_data_loader()):
+        train_reader_cost += time.time() - reader_start
+        train_start = time.time()
+
+        if _globalstep_last_logged > 0:
+            _globalstep_last_logged -= 1
+            continue
+
+        tokens, loss_mask, position_ids, labels = batch
+
+        # In ParallelMode of DataParallel, 'no_sync' can be used for improving
+        # performance of model by gradient accumulation.
+
+        with paddle.amp.auto_cast(
+            args.use_pure_fp16,
+            custom_black_list=["c_softmax_with_cross_entropy", "elementwise_div"],
+            custom_white_list=["fused_attention", "fused_feedforward"],
+            level="O2",
+        ):
+            preds = model(tokens, position_ids)
+            loss = criterion(preds, labels, loss_mask)
+
+        if args.accumulate_steps > 1:
+            tr_loss_step = loss / args.accumulate_steps
+        else:
+            tr_loss_step = loss
+
+        if args.use_pure_fp16:
+            scaler.scale(tr_loss_step).backward()
+        else:
+            tr_loss_step.backward()
+
+        tr_loss_step = tr_loss_step.detach()
+
+        tr_loss += tr_loss_step
+        loss_global += loss.detach()
+
+        # Skip for accumulate_steps in global step
+        if (step + 1) % args.accumulate_steps != 0:
+            continue
+
+        if args.sharding_degree > 1 and args.sharding_stage in [2, 3]:
+            if args.dp_degree > 1 and not is_dp_group_support_in_group_sharded_parallel():
+                fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group())
+
+        if args.use_pure_fp16:
+            # scaler.minimize(optimizer, tr_loss)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            optimizer.step()
+
+        optimizer.clear_grad()
+        tr_loss.subtract_(tr_loss)
+
+        global_step += 1
+
+        # Sync for profile time, delete it may be a little faster
+        # paddle.device.cuda.synchronize()
+        train_run_cost += time.time() - train_start
+
+        if global_step % args.logging_freq == 0:
+            avg_loss = all_gather(loss_global) / args.logging_freq / args.accumulate_steps
+            loss_global.subtract_(loss_global)
+            speed = args.logging_freq / (train_reader_cost + train_run_cost)
+            avg_reader_cost = train_reader_cost / args.logging_freq
+
+            logger.info(
+                "global step %d,  loss: %.9f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e"
+                % (
+                    global_step,
+                    avg_loss,
+                    avg_reader_cost,
+                    1.0 / speed,
+                    speed,
+                    speed * default_global_tokens_num,
+                    speed * default_global_tokens_num / nranks,
+                    optimizer.get_lr(),
+                )
+            )
+            if log_writer is not None:
+                log_writer.add_scalar("loss", float(loss), global_step)
+                log_writer.add_scalar("learning_rate", optimizer.get_lr(), global_step)
+
+            # tic_train = time.time()
+            train_reader_cost = 0.0
+            train_run_cost = 0.0
+
+        if lr_scheduler is not None:
+            lr_scheduler.step()
+
+        if global_step % args.eval_freq == 0:
+            # Since the valid data broardcast to all devices, we do evaluate on all device.
+            run_evaluate(args, valid_data_loader, model, criterion, args.eval_iters, log_writer, global_step, "valid")
+
+        # TODO: 1. merge paramters while saving model. 2. ensure that the model is saved and loaded correctly
+        # only dp_rank = 0 save model
+        if (global_step % args.save_steps == 0 or global_step >= args.max_steps) and dp_rank == 0:
+
+            model_to_save = (
+                model._layers
+                if paddle.distributed.get_world_size() > 1 and args.sharding_stage not in [2, 3]
+                else model
+            )
+
+            if args.sharding_stage == 3:
+                # If parameter need to convert to cpu, please add convert2cpu=True
+                model_to_save.get_all_parameters(convert2cpu=True)
+
+            while hasattr(model_to_save, "_layers") or hasattr(model_to_save, "_layer"):
+                if hasattr(model_to_save, "_layers"):
+                    model_to_save = model_to_save._layers
+                else:
+                    model_to_save = model_to_save._layer
+
+            output_dir = os.path.join(args.output_dir, "checkpoint-%d" % global_step)
+            os.makedirs(output_dir, exist_ok=True)
+            logger.info("Save model to %s" % output_dir)
+
+            # tokenizer only need to save on one node
+            if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                tokenizer.save_pretrained(output_dir)
+
+            # paramerters is the same in sharding group
+            if sharding_rank == 0 and dp_rank == 0:
+                if isinstance(model_to_save, PretrainedModel):
+                    model_to_save.save_pretrained(output_dir)
+                else:
+                    logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.")
+                    state_dict = model_to_save.state_dict()
+                    paddle.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
+
+            # ckpt optimizer weight should save on echo sharding rank
+            if dp_rank == 0:
+                paddle.save(
+                    optimizer.state_dict(),
+                    os.path.join(
+                        output_dir,
+                        OPTIMIZER_NAME,
+                    ),
+                )
+
+            if mp_rank == 0 and sharding_rank == 0 and dp_rank == 0:
+                _rotate_checkpoints(args.save_total_limit, output_dir=args.output_dir)
+
+        if global_step >= args.max_steps:
+            return
+
+        reader_start = time.time()
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    do_train(args)
diff --git a/examples/language_model/llama b/examples/language_model/llama
new file mode 100644
index 0000000000000000000000000000000000000000..00841636eed14207239457be205fd4787349653d
--- /dev/null
+++ b/examples/language_model/llama
@@ -0,0 +1 @@
+../../llm/llama
\ No newline at end of file
diff --git a/examples/language_model/luke/README.md b/examples/language_model/luke/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a66f93035b528413d337c263bbfd299c14267b8e
--- /dev/null
+++ b/examples/language_model/luke/README.md
@@ -0,0 +1,91 @@
+# LUKE with PaddleNLP
+
+[LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057)
+
+**模型简介：**
+许多NLP任务都涉及实体，例如：关系分类、实体类型、命名实体识别（NER）和问答（QA）。解决此类实体相关任务的关键是学习实体有效表示。传统的实体表示为每个实体分配一个固定的Embedding向量，该向量将有关实体的信息存储在知识库（KB）中。它们需要实体链接(entity linking)来表示文本中的实体，而不能表示KB中不存在的实体。
+
+相比之下，基于contextualized word representations(CWRs) transformer的大型预训练模型，如BERT和RoBERTa，提供了基于语言建模的有效通用词语表征。然而，由于以下两个原因，CWRs的体系结构不适合表示实体：
+
+- 由于CWR不输出实体的跨级(span-level)表示，因此它们通常需要学习如何基于通常较小的下游数据集计算此类表征。
+
+- 许多与实体相关的任务，如关系分类和问答（QA）涉及实体之间关系的推理。尽管transformer可以通过使用self-attention机制将单词相互关联来捕捉单词之间的复杂关系。在实体之间执行关系推理是困难的，因为许多实体在模型中被分割成多个词。此外，基于单词的CWRs预训练任务不适合学习实体的表征，因为在实体中预测一个被MASK的单词，例如预测“Rings”, 给予句子“The Lord of the [MASK]”，一个完整的实体就这样被拆分。
+
+LUKE和现有CWRs之间的一个重要区别在于，它不仅将单词视为独立的token，还将实体视为独立的token，并使用transformer计算所有token的中间表征和输出表征。由于实体被视为token，LUKE可以直接建模实体之间的关系。
+本项目是 LUKE 在 Paddle 2.x上的开源实现。
+
+## 快速开始
+
+### 下游任务微调
+
+数据集
+下载Open Entity数据集
+[下载地址](https://cloud.tsinghua.edu.cn/f/6ec98dbd931b4da9a7f0/)
+把下载好的文件解压,并把解压后的Open Entity目录下的`train.json`、`test.json`和`dev.json`分别为训练集、验证集和测试集
+
+下载SQuAD1.1数据集，主流机器阅读理解数据集
+[下载地址](https://data.deepai.org/squad1.1.zip)
+
+#### 1、SQuAD1.1
+以SQuAD1.1数据集为例
+
+运行以下两个命令即可训练并评估LUKE在SQuAD1.1数据集的精度
+
+```shell
+python -m paddle.distributed.launch examples/language_model/luke/run_squad.py
+    --model_type luke \
+    --device gpu \
+    --learning_rate 15e-6 \
+    --num_train_epochs 2 \
+    --batch_size 8 \
+    --do_predict \
+    --do_train \
+    --model_name_or_path luke-large
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前支持`luke`
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `num_train_epochs` 表示需要训练的epoch数量
+- `do_train` 表示是否开启训练
+- `do_predict` 表示是否开启评估
+- `model_name_or_path` 模型的名称和路径,支持`luke-base` 和 `luke-large`
+
+训练结束后模型会对模型进行评估，其评估在验证集上完成, 训练完成后你将看到如下结果:
+```text
+{"exact_match": 89.75691579943235, "f1": 94.95702001984502}
+```
+
+#### 2、Open Entity
+
+```shell
+python -m paddle.distributed.launch examples/language_model/luke/run_open_entity.py \
+    --model_type luke-large \
+    --data_dir data/ \
+    --output_dir output/ \
+    --device gpu \
+    --learning_rate 1e-5 \
+    --num_train_epochs 3 \
+    --train_batch_size 2
+```
+训练结束后模型会对模型进行评估，其评估在测试集上完成, 训练完成后你将看到如下结果:
+```text
+Results: {
+  "test_f1": 0.7815726767275616,
+  "test_precision": 0.7880405766150561,
+  "test_recall": 0.7752100840336135
+}
+```
+
+
+# Reference
+
+```bibtex
+@inproceedings{yamada2020luke,
+  title={LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention},
+  author={Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto},
+  booktitle={EMNLP},
+  year={2020}
+}
+```
diff --git a/examples/language_model/luke/args.py b/examples/language_model/luke/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..f736916e938d3827384b20b41e64e2cc9b1df801
--- /dev/null
+++ b/examples/language_model/luke/args.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model_type", default="bert", type=str, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bert-base-uncased",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="outputs",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written. "
+        "Default as `outputs`",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.0,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=20,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument(
+        "--null_score_diff_threshold",
+        type=float,
+        default=0.0,
+        help="If null_score - best_non_null is greater than the threshold predict null.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.")
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+    parser.add_argument(
+        "--version_2_with_negative",
+        action="store_true",
+        help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    args = parser.parse_args()
+    return args
diff --git a/examples/language_model/luke/open_entity_processor.py b/examples/language_model/luke/open_entity_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0f4739058e428cdea96b4c5a39ff174bde2ca7b
--- /dev/null
+++ b/examples/language_model/luke/open_entity_processor.py
@@ -0,0 +1,138 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from tqdm import tqdm
+
+ENTITY_TOKEN = "[ENTITY]"
+
+
+class InputExample(object):
+    def __init__(self, id_, text, span, labels):
+        self.id = id_
+        self.text = text
+        self.span = span
+        self.labels = labels
+
+
+class InputFeatures(object):
+    def __init__(
+        self,
+        word_ids,
+        word_segment_ids,
+        word_attention_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_segment_ids,
+        entity_attention_mask,
+        labels,
+    ):
+        self.word_ids = word_ids
+        self.word_segment_ids = word_segment_ids
+        self.word_attention_mask = word_attention_mask
+        self.entity_ids = entity_ids
+        self.entity_position_ids = entity_position_ids
+        self.entity_segment_ids = entity_segment_ids
+        self.entity_attention_mask = entity_attention_mask
+        self.labels = labels
+
+
+class DatasetProcessor(object):
+    def get_train_examples(self, data_dir):
+        return self._create_examples(data_dir, "train")
+
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(data_dir, "dev")
+
+    def get_test_examples(self, data_dir):
+        return self._create_examples(data_dir, "test")
+
+    def get_label_list(self, data_dir):
+        labels = set()
+        for example in self.get_train_examples(data_dir):
+            labels.update(example.labels)
+        return sorted(labels)
+
+    def _create_examples(self, data_dir, set_type):
+        with open(os.path.join(data_dir, set_type + ".json"), "r") as f:
+            data = json.load(f)
+        return [
+            InputExample(i, item["sent"], (item["start"], item["end"]), item["labels"]) for i, item in enumerate(data)
+        ]
+
+
+def convert_examples_to_features(examples, label_list, tokenizer, max_mention_length):
+    label_map = {label: i for i, label in enumerate(label_list)}
+
+    conv_tables = (
+        ("-LRB-", "("),
+        ("-LCB-", "("),
+        ("-LSB-", "("),
+        ("-RRB-", ")"),
+        ("-RCB-", ")"),
+        ("-RSB-", ")"),
+    )
+    features = []
+    for example in tqdm(examples):
+
+        def preprocess_and_tokenize(text, start, end=None):
+            target_text = text[start:end].rstrip()
+            for a, b in conv_tables:
+                target_text = target_text.replace(a, b)
+
+            return tokenizer.tokenize(target_text, add_prefix_space=True)
+
+        tokens = [tokenizer.cls_token]
+        tokens += preprocess_and_tokenize(example.text, 0, example.span[0])
+        mention_start = len(tokens)
+        tokens.append(ENTITY_TOKEN)
+        tokens += preprocess_and_tokenize(example.text, example.span[0], example.span[1])
+        tokens.append(ENTITY_TOKEN)
+        mention_end = len(tokens)
+
+        tokens += preprocess_and_tokenize(example.text, example.span[1])
+        tokens.append(tokenizer.sep_token)
+
+        word_ids = tokenizer.convert_tokens_to_ids(tokens)
+        word_attention_mask = [1] * len(tokens)
+        word_segment_ids = [0] * len(tokens)
+
+        entity_ids = [2, 0]
+        entity_attention_mask = [1, 0]
+        entity_segment_ids = [0, 0]
+        entity_position_ids = list(range(mention_start, mention_end))[:max_mention_length]
+        entity_position_ids += [-1] * (max_mention_length - mention_end + mention_start)
+        entity_position_ids = [entity_position_ids, [-1] * max_mention_length]
+
+        labels = [0] * len(label_map)
+
+        for label in example.labels:
+            labels[label_map[label]] = 1
+
+        features.append(
+            InputFeatures(
+                word_ids=word_ids,
+                word_segment_ids=word_segment_ids,
+                word_attention_mask=word_attention_mask,
+                entity_ids=entity_ids,
+                entity_position_ids=entity_position_ids,
+                entity_segment_ids=entity_segment_ids,
+                entity_attention_mask=entity_attention_mask,
+                labels=labels,
+            )
+        )
+
+    return features
diff --git a/examples/language_model/luke/run_open_entity.py b/examples/language_model/luke/run_open_entity.py
new file mode 100644
index 0000000000000000000000000000000000000000..edd06060341b9abb7389a9f9941ecadcafb2be73
--- /dev/null
+++ b/examples/language_model/luke/run_open_entity.py
@@ -0,0 +1,294 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import logging
+import os
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from open_entity_processor import DatasetProcessor, convert_examples_to_features
+from paddle.io import DataLoader, Dataset
+from paddle.optimizer import AdamW
+from tqdm import tqdm
+
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    LukeForEntityClassification,
+    LukeTokenizer,
+)
+
+ENTITY_TOKEN = "[ENTITY]"
+
+parser = argparse.ArgumentParser(description="LUKE FOR OPEN ENTITY")
+
+parser.add_argument(
+    "--output_dir", type=str, required=True, help="Use to store all outputs during training and evaluation."
+)
+parser.add_argument("--data_dir", type=str, required=True, help="Dataset folder")
+parser.add_argument("--num_train_epochs", type=int, default=2, help="Number of training cycles")
+parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+parser.add_argument("--batch_size", type=int, default=8, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--device", type=str, default="gpu", help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--gradient_accumulation_steps", type=int, default=3, help="Gradient accumulated before each parameter update."
+)
+parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay if we apply some")
+parser.add_argument(
+    "--warmup_proportion",
+    type=float,
+    default=0.06,
+    help="Proportion of training steps to perform linear learning rate warmup for.",
+)
+parser.add_argument("--learning_rate", type=float, default=1e-5, help="The initial learning rate for Adam.")
+parser.add_argument("--model_type", type=str, default="luke-base", help="Type of pre-trained model.")
+parser.add_argument("--max_mention_length", type=int, default=30, help="Max entity position's length")
+
+args = parser.parse_args()
+
+
+class Trainer(object):
+    def __init__(self, args, model, dataloader, num_train_steps, step_callback=None):
+        self.args = args
+        self.model = model
+        self.dataloader = dataloader
+        self.num_train_steps = num_train_steps
+        self.step_callback = step_callback
+
+        self.optimizer, self.scheduler = self._create_optimizer(model)
+        self.wd_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    def train(self):
+        model = self.model
+
+        epoch = 0
+        global_step = 0
+
+        model.train()
+
+        with tqdm(total=self.num_train_steps) as pbar:
+            while True:
+                for step, batch in enumerate(self.dataloader):
+                    outputs = model(
+                        input_ids=batch[0],
+                        token_type_ids=batch[1],
+                        attention_mask=batch[2],
+                        entity_ids=batch[3],
+                        entity_position_ids=batch[4],
+                        entity_token_type_ids=batch[5],
+                        entity_attention_mask=batch[6],
+                    )
+
+                    loss = F.binary_cross_entropy_with_logits(
+                        outputs.reshape([-1]), batch[7].reshape([-1]).astype("float32")
+                    )
+
+                    if self.args.gradient_accumulation_steps > 1:
+                        loss = loss / self.args.gradient_accumulation_steps
+                    loss.backward()
+                    if (step + 1) % self.args.gradient_accumulation_steps == 0:
+                        self.optimizer.step()
+                        self.scheduler.step()
+                        self.optimizer.clear_grad()
+                        pbar.set_description("epoch: %d loss: %.7f" % (epoch, loss))
+                        pbar.update()
+                        global_step += 1
+
+                        if global_step == self.num_train_steps:
+                            break
+                output_dir = self.args.output_dir
+
+                model.save_pretrained(output_dir)
+                if global_step == self.num_train_steps:
+                    break
+                epoch += 1
+
+        return model, global_step
+
+    def _create_optimizer(self, model):
+        scheduler = self._create_scheduler()
+        clip = paddle.nn.ClipGradByNorm(clip_norm=1.0)
+        return (
+            AdamW(
+                parameters=model.parameters(),
+                grad_clip=clip,
+                learning_rate=scheduler,
+                apply_decay_param_fun=lambda x: x in self.wd_params,
+                weight_decay=self.args.weight_decay,
+            ),
+            scheduler,
+        )
+
+    def _create_scheduler(self):
+        return LinearDecayWithWarmup(
+            learning_rate=self.args.learning_rate, total_steps=self.num_train_steps, warmup=self.args.warmup_proportion
+        )
+
+
+class DataGenerator(Dataset):
+    def __init__(self, features):
+        super(DataGenerator, self).__init__()
+        self.features = features
+
+    def __getitem__(self, item):
+        word_ids = self.features[item].word_segment_ids
+        word_segment_ids = self.features[item].word_segment_ids
+        word_attention_mask = self.features[item].word_attention_mask
+        entity_ids = self.features[item].entity_ids
+        entity_position_ids = self.features[item].entity_position_ids
+        entity_segment_ids = self.features[item].entity_segment_ids
+        entity_attention_mask = self.features[item].entity_attention_mask
+        labels = self.features[item].labels
+
+        return (
+            word_ids,
+            word_segment_ids,
+            word_attention_mask,
+            entity_ids,
+            entity_position_ids,
+            entity_segment_ids,
+            entity_attention_mask,
+            labels,
+        )
+
+    def __len__(self):
+        return len(self.features)
+
+
+@paddle.no_grad()
+def evaluate(args, model, mode="dev", output_file=None):
+    dataloader, _, _, label_list = load_examples(args, mode=mode)
+    model.eval()
+
+    all_logits = []
+    all_labels = []
+
+    for batch in tqdm(dataloader, desc=mode):
+        logits = model(
+            input_ids=batch[0],
+            token_type_ids=batch[1],
+            attention_mask=batch[2],
+            entity_ids=batch[3],
+            entity_position_ids=batch[4],
+            entity_token_type_ids=batch[5],
+            entity_attention_mask=batch[6],
+        )
+
+        logits = logits.tolist()
+        labels = batch[7].tolist()
+
+        all_logits.extend(logits)
+        all_labels.extend(labels)
+
+    all_predicted_indexes = []
+    all_label_indexes = []
+    for logits, labels in zip(all_logits, all_labels):
+        all_predicted_indexes.append([i for i, v in enumerate(logits) if v > 0])
+        all_label_indexes.append([i for i, v in enumerate(labels) if v > 0])
+
+    if output_file:
+        with open(output_file, "w") as f:
+            for predicted_indexes, label_indexes in zip(all_predicted_indexes, all_label_indexes):
+                data = dict(
+                    predictions=[label_list[ind] for ind in predicted_indexes],
+                    labels=[label_list[ind] for ind in label_indexes],
+                )
+                f.write(json.dumps(data) + "\n")
+
+    num_predicted_labels = 0
+    num_gold_labels = 0
+    num_correct_labels = 0
+
+    for predicted_indexes, label_indexes in zip(all_predicted_indexes, all_label_indexes):
+        num_predicted_labels += len(predicted_indexes)
+        num_gold_labels += len(label_indexes)
+        num_correct_labels += len(frozenset(predicted_indexes).intersection(frozenset(label_indexes)))
+
+    if num_predicted_labels > 0:
+        precision = num_correct_labels / num_predicted_labels
+    else:
+        precision = 0.0
+
+    recall = num_correct_labels / num_gold_labels
+    if precision + recall == 0.0:
+        f1 = 0.0
+    else:
+        f1 = 2 * precision * recall / (precision + recall)
+
+    return dict(precision=precision, recall=recall, f1=f1)
+
+
+def load_examples(args, mode="train"):
+    tokenizer = LukeTokenizer.from_pretrained(args.model_type)
+    tokenizer.add_special_tokens(dict(additional_special_tokens=[ENTITY_TOKEN]))
+    processor = DatasetProcessor()
+    if mode == "train":
+        examples = processor.get_train_examples(args.data_dir)
+    elif mode == "dev":
+        examples = processor.get_dev_examples(args.data_dir)
+    else:
+        examples = processor.get_test_examples(args.data_dir)
+
+    label_list = processor.get_label_list(args.data_dir)
+
+    logging.info("Creating features from the dataset...")
+    features = convert_examples_to_features(examples, label_list, tokenizer, args.max_mention_length)
+
+    dataset = DataGenerator(features)
+
+    def collate_fn(batch):
+        def create_padded_sequence(k, padding_value):
+            """Pad sequence to maximum length"""
+            new_data = []
+            max_len = 0
+            for each_batch in batch:
+                if len(each_batch[k]) > max_len:
+                    max_len = len(each_batch[k])
+            for each_batch in batch:
+                new_data.append(each_batch[k] + [padding_value] * (max_len - len(each_batch[k])))
+            return np.array(new_data, dtype="int64")
+
+        return (
+            create_padded_sequence(0, 1),  # pad word_ids
+            create_padded_sequence(1, 0),  # pad word_segment_ids
+            create_padded_sequence(2, 0),  # pad word_attention_mask
+            create_padded_sequence(3, 0),  # pad entity_ids
+            create_padded_sequence(4, 0),  # pad entity_position_ids
+            create_padded_sequence(5, 0),  # pad entity_segment_ids
+            create_padded_sequence(6, 0),  # pad entity_attention_mask
+            create_padded_sequence(7, 0),
+        )  # convert to numpy array
+
+    dataloader = DataLoader(dataset, shuffle="train" in mode, batch_size=args.batch_size, collate_fn=collate_fn)
+
+    return dataloader, examples, features, label_list
+
+
+if __name__ == "__main__":
+    results = {}
+    train_dataloader, _, features, _ = load_examples(args, mode="train")
+    num_labels = len(features[0].labels)
+    num_train_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
+    num_train_steps = int(num_train_steps_per_epoch * args.num_train_epochs)
+    model = LukeForEntityClassification.from_pretrained(args.model_type, num_classes=num_labels)
+    trainer = Trainer(args, model=model, dataloader=train_dataloader, num_train_steps=num_train_steps)
+    trainer.train()
+    output_file = os.path.join(args.output_dir, "test_predictions.jsonl")
+    results.update({f"test_{k}": v for k, v in evaluate(args, model, "test", output_file).items()})
+
+    logging.info("Results: %s", json.dumps(results, indent=2, sort_keys=True))
+    with open(os.path.join(args.output_dir, "results.json"), "w") as f:
+        json.dump(results, f)
diff --git a/examples/language_model/luke/run_squad.py b/examples/language_model/luke/run_squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..494eb07d9d55865925d09cec374d5130a2a4769a
--- /dev/null
+++ b/examples/language_model/luke/run_squad.py
@@ -0,0 +1,353 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from args import parse_args
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    LukeForQuestionAnswering,
+    LukeTokenizer,
+)
+
+MODEL_CLASSES = {"luke": (LukeForQuestionAnswering, LukeTokenizer)}
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
+    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
+    # left whitespace
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        add_prefix_space=True,
+        return_token_type_ids=True,
+        max_seq_len=args.max_seq_length,
+        stride=args.doc_stride,
+        return_attention_mask=True,
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            # Minus one more to reach actual text
+            token_end_index -= 1
+            if token_end_index >= len(offsets):
+                token_end_index = len(offsets) - 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        add_prefix_space=True,
+        return_token_type_ids=True,
+        stride=args.doc_stride,
+        max_seq_len=args.max_seq_length,
+        return_attention_mask=True,
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, raw_dataset, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        input_ids, _ = batch
+        start_logits_tensor, end_logits_tensor = model(input_ids)
+
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                print("Processing example: %d" % len(all_start_logits))
+                print("time per 1000:", time.time() - tic_eval)
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        raw_dataset,
+        data_loader.dataset,
+        (all_start_logits, all_end_logits),
+        args.version_2_with_negative,
+        args.n_best_size,
+        args.max_answer_length,
+        args.null_score_diff_threshold,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open("prediction.json", "w", encoding="utf-8") as writer:
+        writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+    squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json)
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+def run(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    if args.version_2_with_negative:
+        train_examples = load_dataset("squad_v2", split="train")
+        dev_examples = load_dataset("squad_v2", split="validation")
+    else:
+        train_examples = load_dataset("squad", split="train")
+        dev_examples = load_dataset("squad", split="validation")
+
+    column_names = train_examples.column_names
+    set_seed(args)
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            print("Loads checkpoint from %s" % args.model_name_or_path)
+
+    model = model_class.from_pretrained(args.model_name_or_path)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.do_train:
+        train_ds = train_examples.map(
+            partial(prepare_train_features, tokenizer=tokenizer, args=args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+                "start_positions": Stack(dtype="int64"),
+                "end_positions": Stack(dtype="int64"),
+            }
+        ): fn(samples)
+
+        train_data_loader = DataLoader(
+            dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+        )
+
+        num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        criterion = CrossEntropyLossForSQuAD()
+
+        global_step = 0
+        tic_train = time.time()
+
+        for epoch in range(num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids, _, start_positions, end_positions = batch
+                logits = model(input_ids)
+                loss = criterion(logits, (start_positions, end_positions))
+
+                if global_step % args.logging_steps == 0:
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if rank == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        print("Saving checkpoint to:", output_dir)
+                    if global_step == num_training_steps:
+                        break
+
+    if args.do_predict and rank == 0:
+        dev_ds = dev_examples.map(
+            partial(prepare_validation_features, tokenizer=tokenizer, args=args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+        dev_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            }
+        ): fn(samples)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True
+        )
+
+        evaluate(model, dev_data_loader, args)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    run(args)
diff --git a/examples/language_model/megatronbert/README.md b/examples/language_model/megatronbert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e9795358bb40a232db43afa05a06853098b5af80
--- /dev/null
+++ b/examples/language_model/megatronbert/README.md
@@ -0,0 +1,103 @@
+# MegatronBert with PaddleNLP
+
+[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf)
+
+**模型简介：**
+近期在语言建模方面的工作表明，训练大型transformers模型提高了自然语言处理应用的技术水平。然而，由于内存限制，非常大的模型可能难以训练。在这项工作中，
+作者提出了训练大型transformers模型的技术，并实现了一种简单、高效的模型运算并行方法，该方法能够训练具有数十亿个参数的transformers模型。
+
+本项目是 MegatronBert 在 Paddle 2.x上的开源实现。
+
+## 快速开始
+
+### 下游任务微调
+
+#### 1、SQuAD1.1 & SQuAD2.0
+SQuAD1.1数据集
+
+```shell
+python -m paddle.distributed.launch run_squad.py \
+    --do_train \
+    --do_predict \
+    --batch_size=8 \
+    --model_name_or_path=megatronbert-cased
+    --learning_rate=1e-5 \
+    --output_dir=output/ \
+    --device=gpu \
+    --num_train_epochs=2
+```
+其中参数释义如下：
+- `model_name_or_path` 指示了模型类型，当前支持`megatronbert-cased`和`megatronbert-uncased`模型。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `output_dir` 表示模型保存路径。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `num_train_epochs` 表示需要训练的epoch数量
+
+训练结束后模型会对模型进行评估，其评估在验证集上完成, 训练完成后你将看到如下结果:
+```text
+{
+  "exact": 88.78902554399243,
+  "f1": 94.4082803514958,
+  "total": 10570,
+  "HasAns_exact": 88.78902554399244,
+  "HasAns_f1": 94.4082803514958,
+  "HasAns_total": 10570
+}
+```
+
+SQuAD2.0数据集
+```shell
+python -m paddle.distributed.launch run_squad.py \
+    --do_train \
+    --version_2_with_negative \
+    --do_predict \
+    --batch_size=8 \
+    --model_name_or_path=megatronbert-cased
+    --learning_rate=1e-5 \
+    --output_dir=output/ \
+    --device=gpu \
+    --num_train_epochs=2
+```
+
+其中参数释义如下：
+- `version_2_with_negative`  是否使用SQuAD2.0数据集
+
+训练结束后模型会对模型进行评估，其评估在验证集上完成, 训练完成后你将看到如下结果:
+```text
+{
+  "exact": 85.85867093405206,
+  "f1": 88.70579950475263,
+  "total": 11873,
+  "HasAns_exact": 82.47300944669365,
+  "HasAns_f1": 88.17543143048748,
+  "HasAns_total": 5928,
+  "NoAns_exact": 89.23465096719933,
+  "NoAns_f1": 89.23465096719933,
+  "NoAns_total": 5945,
+  "best_exact": 85.99343047250063,
+  "best_exact_thresh": -1.6154582500457764,
+  "best_f1": 88.75296534320918,
+  "best_f1_thresh": -0.20494508743286133
+}
+```
+
+#### 2、mnli数据集
+
+```shell
+python -m paddle.distributed.launch run_glue.py \
+    --task_name=mnli \
+    --output_dir=output/ \
+    --model_name_or_path=megatronbert-cased \
+    --learning_rate=1e-5 \
+    --device=gpu \
+    --num_train_epochs=2
+```
+训练结束后模型会对模型进行评估，其评估在测试集上完成, 训练完成后你将看到如下结果:
+```text
+eval loss: 0.186327, acc: 0.8992358634742741, eval loss: 0.332409, acc: 0.8968673718470301, eval done total : 118.65499472618103 s
+```
+
+# Reference
+
+* [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/pdf/1909.08053.pdf)
diff --git a/examples/language_model/megatronbert/args.py b/examples/language_model/megatronbert/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b9bf24492b770baa82a2a68e1a9244b6a50e9e7
--- /dev/null
+++ b/examples/language_model/megatronbert/args.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.")
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument("--model_type", default="megatronbert", type=str, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="megatronbert-cased",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written. "
+        "Default as `outputs`",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=512,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=2, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.06,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=5000, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=20,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument(
+        "--null_score_diff_threshold",
+        type=float,
+        default=0.0,
+        help="If null_score - best_non_null is greater than the threshold predict null.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.")
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+    parser.add_argument(
+        "--version_2_with_negative",
+        action="store_true",
+        help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    args = parser.parse_args()
+    return args
diff --git a/examples/language_model/megatronbert/run_glue.py b/examples/language_model/megatronbert/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbb78c94abd9e3a8450fda83663bb4a2e3125a19
--- /dev/null
+++ b/examples/language_model/megatronbert/run_glue.py
@@ -0,0 +1,362 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    MegatronBertForSequenceClassification,
+    MegatronBertTokenizer,
+)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "megatronbert": (MegatronBertForSequenceClassification, MegatronBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default="megatronbert",
+        type=str,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="megatronbert-cased",
+        type=str,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="outputs",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=512,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=2,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.06, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu/npu.",
+    )
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, token_type_ids=segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    elif isinstance(metric, Mcc):
+        print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (loss.numpy(), res[0], res[1], res[2]),
+            end="",
+        )
+    else:
+        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_labels=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                logits = model(input_ids, token_type_ids=segment_ids)
+                loss = loss_fct(logits, labels)
+            if args.use_amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
+                    )
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/language_model/megatronbert/run_squad.py b/examples/language_model/megatronbert/run_squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d3f12b5e090f572e5ed351fa5cf8fb2ffc14f02
--- /dev/null
+++ b/examples/language_model/megatronbert/run_squad.py
@@ -0,0 +1,313 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from args import parse_args
+from paddle.io import DataLoader
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    MegatronBertForQuestionAnswering,
+    MegatronBertTokenizer,
+)
+
+MODEL_CLASSES = {"megatronbert": (MegatronBertForQuestionAnswering, MegatronBertTokenizer)}
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = [examples[i]["context"] for i in range(len(examples))]
+    questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+    )
+
+    # Let's label those examples!
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_example["input_ids"]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # The offset mappings will give us a map from token to character position in the original context. This will
+        # help us compute the start_positions and end_positions.
+        offsets = tokenized_example["offset_mapping"]
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        answers = examples[sample_index]["answers"]
+        answer_starts = examples[sample_index]["answer_starts"]
+
+        # If no answers are given, set the cls_index as answer.
+        if len(answer_starts) == 0:
+            tokenized_examples[i]["start_positions"] = cls_index
+            tokenized_examples[i]["end_positions"] = cls_index
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answer_starts[0]
+            end_char = start_char + len(answers[0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            # Minus one more to reach actual text
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples[i]["start_positions"] = cls_index
+                tokenized_examples[i]["end_positions"] = cls_index
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples[i]["start_positions"] = token_start_index - 1
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples[i]["end_positions"] = token_end_index + 1
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = [examples[i]["context"] for i in range(len(examples))]
+    questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+    )
+
+    # For validation, there is no need to compute start and end positions
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        tokenized_examples[i]["example_id"] = examples[sample_index]["id"]
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples[i]["offset_mapping"] = [
+            (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+        ]
+
+    return tokenized_examples
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        input_ids, token_type_ids = batch
+        start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids=token_type_ids)
+
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                print("Processing example: %d" % len(all_start_logits))
+                print("time per 1000:", time.time() - tic_eval)
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        data_loader.dataset.data,
+        data_loader.dataset.new_data,
+        (all_start_logits, all_end_logits),
+        args.version_2_with_negative,
+        args.n_best_size,
+        args.max_answer_length,
+        args.null_score_diff_threshold,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open("prediction.json", "w", encoding="utf-8") as writer:
+        writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+    squad_evaluate(examples=data_loader.dataset.data, preds=all_predictions, na_probs=scores_diff_json)
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+def run(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    if args.version_2_with_negative:
+        train_ds = load_dataset("squad", splits="train_v2", data_files=args.train_file)
+        dev_ds = load_dataset("squad", splits="dev_v2", data_files=args.predict_file)
+    else:
+        train_ds = load_dataset("squad", splits="train_v1", data_files=args.train_file)
+        dev_ds = load_dataset("squad", splits="dev_v1", data_files=args.predict_file)
+    set_seed(args)
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            print("init checkpoint from %s" % args.model_name_or_path)
+
+    model = model_class.from_pretrained(args.model_name_or_path)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.do_train:
+        train_ds.map(partial(prepare_train_features, tokenizer=tokenizer, args=args), batched=True)
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+                "start_positions": Stack(dtype="int64"),
+                "end_positions": Stack(dtype="int64"),
+            }
+        ): fn(samples)
+
+        train_data_loader = DataLoader(
+            dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+        )
+
+        num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        criterion = CrossEntropyLossForSQuAD()
+
+        global_step = 0
+        tic_train = time.time()
+
+        for epoch in range(num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids, token_type_ids, start_positions, end_positions = batch
+                logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
+                loss = criterion(logits, (start_positions, end_positions))
+
+                if global_step % args.logging_steps == 0:
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if rank == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        print("Saving checkpoint to:", output_dir)
+                    if global_step == num_training_steps:
+                        break
+
+    if args.do_predict and rank == 0:
+        dev_ds.map(partial(prepare_validation_features, tokenizer=tokenizer, args=args), batched=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+        dev_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            }
+        ): fn(samples)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True
+        )
+
+        evaluate(model, dev_data_loader, args)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    run(args)
diff --git a/examples/language_model/moe/data_tools b/examples/language_model/moe/data_tools
new file mode 100644
index 0000000000000000000000000000000000000000..8841a30c30fdcf5aef96137c98938765471bb759
--- /dev/null
+++ b/examples/language_model/moe/data_tools
@@ -0,0 +1 @@
+../../../model_zoo/ernie-1.0/data_tools
\ No newline at end of file
diff --git a/examples/language_model/moe/dygraph/args.py b/examples/language_model/moe/dygraph/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..700fc911889a19eefc57433ef819e6dd856f8f97
--- /dev/null
+++ b/examples/language_model/moe/dygraph/args.py
@@ -0,0 +1,274 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def process_batch_size(args):
+    if args.global_batch_size is None and args.local_batch_size is None:
+        raise ValueError("global_batch_size or local_batch_size should be set.")
+    elif args.global_batch_size is not None and args.local_batch_size is not None:
+        assert args.global_batch_size // args.local_batch_size == (args.dp_degree * args.sharding_degree), (
+            "global_batch_size[{}] should be "
+            "divided by local_batch_size[{}] when dp_degree is [{}], sharding_degree is [{}]. ".format(
+                args.global_batch_size, args.local_batch_size, args.dp_degree, args.sharding_degree
+            )
+        )
+    elif args.global_batch_size is not None and args.local_batch_size is None:
+        assert (
+            args.global_batch_size % (args.dp_degree * args.sharding_degree) == 0
+        ), "global_batch_size[{}] should be divided by dp_degree[{}] times sharding_degree[{}].".format(
+            args.global_batch_size, args.dp_degree, args.sharding_degree
+        )
+        args.local_batch_size = args.global_batch_size // (args.dp_degree * args.sharding_degree)
+    else:
+        args.global_batch_size = args.local_batch_size * args.dp_degree * args.sharding_degree
+    assert args.local_batch_size % args.micro_batch_size == 0
+
+
+def parse_args(MODEL_CLASSES):
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+
+    # Train I/O config
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the training logs and checkpoints will be written.",
+    )
+    parser.add_argument("--split", type=str, default="949,50,1", help="Train/valid/test data split.")
+
+    parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.")
+
+    parser.add_argument(
+        "--global_batch_size",
+        default=None,
+        type=int,
+        help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.",
+    )
+
+    parser.add_argument(
+        "--local_batch_size",
+        default=None,
+        type=int,
+        help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.",
+    )
+
+    parser.add_argument(
+        "--micro_batch_size",
+        default=8,
+        type=int,
+        help="Batch size per device for one step training.",
+    )
+
+    # Default training config
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.")
+    parser.add_argument("--max_lr", default=0.00015, type=float, help="The initial max learning rate for Adam.")
+    parser.add_argument("--min_lr", default=1e-5, type=float, help="The initial min learning rate for Adam.")
+    parser.add_argument(
+        "--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate."
+    )
+
+    # Adam optimizer config
+    parser.add_argument(
+        "--adam_beta1",
+        default=0.9,
+        type=float,
+        help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.",
+    )
+    parser.add_argument(
+        "--adam_beta2",
+        default=0.999,
+        type=float,
+        help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.",
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+
+    # Training steps config
+    parser.add_argument(
+        "--num_train_epochs",
+        default=1,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=500000,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--decay_steps",
+        default=360000,
+        type=int,
+        help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.",
+    )
+    parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.")
+    parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.")
+    parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.")
+
+    # Config for 4D Parallelism
+
+    parser.add_argument(
+        "--sharding_degree",
+        type=int,
+        default=1,
+        help="Group Sharded degree. Spliting the model parameters to many cards.",
+    )
+
+    parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.")
+    parser.add_argument(
+        "--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards."
+    )
+    parser.add_argument(
+        "--pp_degree",
+        type=int,
+        default=1,
+        help="Pipeline Parallelism degree.  Spliting the model layers to different parts.",
+    )
+    parser.add_argument(
+        "--use_recompute", type=strtobool, nargs="?", const=False, help="Using the recompute to save the memory."
+    )
+
+    parser.add_argument(
+        "--recompute_partition",
+        type=strtobool,
+        nargs="?",
+        const=False,
+        help="use recompute_partition to support mp partition when use_recompute is True .",
+    )
+
+    parser.add_argument(
+        "--recompute_offload",
+        type=strtobool,
+        nargs="?",
+        const=False,
+        help="use recompute_offload to save the memory by offload when use_recompute is True .",
+    )
+
+    parser.add_argument(
+        "--resume_dir",
+        default="",
+        type=str,
+        required=False,
+        help="The resume directory where the checkpoint will be resume.",
+    )
+
+    # Pure FP16 config
+    parser.add_argument(
+        "--use_pure_fp16", type=strtobool, nargs="?", const=False, help="Enable pure fp16 precision training."
+    )
+
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=32768,
+        help="The value of scale_loss for fp16. This is only used for AMP training.",
+    )
+
+    parser.add_argument(
+        "--sharding_offload", type=strtobool, nargs="?", const=False, help="use sharding stage2 cpu offload strategy."
+    )
+
+    parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.")
+
+    parser.add_argument(
+        "--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob."
+    )
+
+    # MOE config
+    parser.add_argument("--num_experts", type=int, default=1, help="number of experts per worker")
+
+    parser.add_argument("--top_k", type=int, default=2, help="top_k for moe gate")
+
+    parser.add_argument("--expert_mode", type=strtobool, nargs="?", const=False, help="Enable Moe mode.")
+
+    parser.add_argument(
+        "--balance_loss_weight",
+        default=1.0,
+        type=float,
+        help="The auxiliary loss generated by gate strategy to help balance experts.",
+    )
+
+    parser.add_argument(
+        "--gate",
+        type=str,
+        default="gshard",
+        choices=["naive", "gshard", "switch"],
+        help="select naive, gshard, switch gate strategy.",
+    )
+
+    # Other config
+    parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization")
+    parser.add_argument(
+        "--check_accuracy", type=strtobool, nargs="?", const=False, help="Check accuracy for training process."
+    )
+    parser.add_argument(
+        "--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu"], help="select cpu, gpu, xpu devices."
+    )
+    parser.add_argument(
+        "--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style."
+    )
+
+    args = parser.parse_args()
+    args.test_iters = args.eval_iters * 10
+
+    # process batch size
+    process_batch_size(args)
+
+    if args.check_accuracy:
+        if args.hidden_dropout_prob != 0:
+            args.hidden_dropout_prob = 0.0
+            logger.warning("The hidden_dropout_prob should set to 0 for accuracy checking.")
+        if args.attention_probs_dropout_prob != 0:
+            args.attention_probs_dropout_prob = 0.0
+            logger.warning("The attention_probs_dropout_prob should set to 0 for accuracy checking.")
+
+    logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit))
+    for arg in vars(args):
+        logger.info("{:20}:{}".format(arg, getattr(args, arg)))
+
+    return args
diff --git a/examples/language_model/moe/dygraph/checkpointing.py b/examples/language_model/moe/dygraph/checkpointing.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9ebb7eae0888ec6223410f510587c3957c8b297
--- /dev/null
+++ b/examples/language_model/moe/dygraph/checkpointing.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import paddle
+
+
+def save_checkpoint(
+    args,
+    global_step,
+    model,
+    optimizer,
+    lr_scheduler,
+    tokenizer,
+    loss_scale,
+    dp_rank,
+    mp_rank,
+    pp_rank,
+    pass_num,
+    file_id,
+    epoch,
+):
+    """save some state for each rank."""
+
+    assert args.output_dir is not None, "output_dir is not valid."
+    output_dir = os.path.join(args.output_dir, "step_{}".format(global_step))
+    os.makedirs(output_dir, exist_ok=True)
+
+    state_dict = {}
+    state_dict["args"] = args
+    state_dict["global_step"] = global_step
+    state_dict["loss_scale"] = loss_scale
+    state_dict["data_meta"] = {"pass_num": pass_num, "file_id": file_id, "start_epoch": epoch}
+
+    if optimizer is not None:
+        state_dict["optimizer"] = optimizer.state_dict()
+
+    if lr_scheduler is not None:
+        state_dict["lr_scheduler"] = lr_scheduler.state_dict()
+
+    if args.pp_degree > 1:
+        path = os.path.join(output_dir, "dp_{}_mp_{}_pp_{}".format(dp_rank, mp_rank, pp_rank))
+        # model.save_state_dict(path)
+        paddle.save(model.state_dict(), os.path.join(path, "model_state.pdparams"))
+        tokenizer.save_pretrained(path)
+    else:
+        path = os.path.join(output_dir, "dp_{}_mp_{}".format(dp_rank, mp_rank))
+        tokenizer.save_pretrained(path)
+        paddle.save(model.state_dict(), os.path.join(path, "model_state.pdparams"))
+
+    state_save_path = os.path.join(path, "meta_state.pdopt")
+    paddle.save(state_dict, state_save_path)
+
+
+def load_checkpoint(args, model, optimizer, lr_scheduler, tokenizer, dp_rank, mp_rank, pp_rank):
+    """load checkpoint for all rank."""
+
+    assert args.resume_dir is not None and len(args.resume_dir) > 0, "resume_dir is not valid."
+    assert os.path.exists(args.resume_dir) and os.path.isdir(args.resume_dir), "resume_dir not exists or not a dir."
+
+    load_path = None
+    if args.pp_degree > 1:
+        load_path = os.path.join(args.resume_dir, "dp_{}_mp_{}_pp_{}".format(dp_rank, mp_rank, pp_rank))
+        # model.set_state_dir(load_path)
+        model.set_state_dict(paddle.load(os.path.join(load_path, "model_state.pdparams")))
+    else:
+        load_path = os.path.join(args.resume_dir, "dp_{}_mp_{}".format(dp_rank, mp_rank))
+        model.set_state_dict(paddle.load(os.path.join(load_path, "model_state.pdparams")))
+
+    tokenizer.from_pretrained(load_path)
+    state_dict = paddle.load(os.path.join(load_path, "meta_state.pdopt"))
+
+    if optimizer is not None:
+        optimizer.set_state_dict(state_dict["optimizer"])
+
+    if lr_scheduler is not None:
+        lr_scheduler.set_state_dict(state_dict["lr_scheduler"])
+
+    global_step = state_dict["global_step"]
+    args.seed = state_dict["args"].seed
+    loss_scale = state_dict["loss_scale"]
+
+    resume_step = int(args.resume_dir.strip("/").split("_")[-1])
+    if resume_step != global_step:
+        print("Warning: resume_step is {}, but the step of checkpoint is {}.".format(resume_step, global_step))
+
+    return global_step, loss_scale, state_dict["data_meta"]
diff --git a/examples/language_model/moe/dygraph/dataset.py b/examples/language_model/moe/dygraph/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf8df24e7f431b5db1f02d5f0247d82c0408956e
--- /dev/null
+++ b/examples/language_model/moe/dygraph/dataset.py
@@ -0,0 +1,399 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+
+from paddlenlp.data import Stack, Tuple
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+
+def construct_samples_and_shuffle_data(
+    name, data_prefix, documents, sizes, num_samples, seq_length, seed, build_data_file
+):
+    """
+    documents: document index from 0 to len(docs)
+    sizes: the length list of all docs.
+    num_samples: total step*bs iterations of data.
+    seq_length: the sequence length.
+    sum(sizes) = tokens_per_epoch
+    data_nums = num_samples *  micro_batch_size
+    num_epochs = (data_nums + 1) // sum(sizes)
+    len(doc_idx) = num_epochs * sum(sizes)
+    """
+    # Number of tokens in each epoch and number of required epochs.
+    tokens_per_epoch = _num_tokens(documents, sizes)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+    # Rng state
+    np_rng = np.random.RandomState(seed=seed)
+
+    # Filename of the index mappings.
+    _filename = data_prefix
+    _filename += "_{}_indexmap".format(name)
+    _filename += "_{}ns".format(num_samples)
+    _filename += "_{}sl".format(seq_length)
+    doc_idx_filename = _filename + "_doc_idx.npy"
+    sample_idx_filename = _filename + "_sample_idx.npy"
+    shuffle_idx_filename = _filename + "_shuffle_idx.npy"
+
+    # Sava random state
+    savedState = np_rng.get_state()
+    # Build the indexed mapping if not exist.
+    if build_data_file:
+        if (
+            (not os.path.isfile(doc_idx_filename))
+            or (not os.path.isfile(sample_idx_filename))
+            or (not os.path.isfile(shuffle_idx_filename))
+        ):
+            if num_epochs == 1:
+                separate_last_epoch = False
+            else:
+                num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+                last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one
+                assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative."
+                num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+                assert last_epoch_num_samples < (
+                    num_samples_per_epoch + 1
+                ), "last epoch number of samples exceeded max value."
+                separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)
+            # Note. len(doc_idx) = num_epochs * len(doc)
+            doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch)
+            np.save(doc_idx_filename, doc_idx, allow_pickle=True)
+
+            # sample-idx. pos of each seq_len of data.
+            assert doc_idx.dtype == np.int32
+            sample_idx = _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch)
+            np.save(sample_idx_filename, sample_idx, allow_pickle=True)
+
+            if separate_last_epoch:
+                num_samples_ = num_samples_from_epochs_minus_one
+            else:
+                num_samples_ = sample_idx.shape[0] - 1
+
+            # Shuffle all seq len data.
+            shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng)
+            np.save(shuffle_idx_filename, shuffle_idx, allow_pickle=True)
+    else:
+        while True:
+            if (
+                (not os.path.isfile(doc_idx_filename))
+                or (not os.path.isfile(sample_idx_filename))
+                or (not os.path.isfile(shuffle_idx_filename))
+            ):
+                time.sleep(3)
+            else:
+                break
+
+    # Restore random state
+    np_rng.set_state(savedState)
+
+    if paddle.distributed.get_world_size() > 1:
+        if paddle.in_dynamic_mode():
+            paddle.distributed.barrier()
+
+    # Load mappings.
+    doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
+    sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode="r")
+    shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode="r")
+    return doc_idx, sample_idx, shuffle_idx
+
+
+def _num_tokens(documents, lens):
+    """Total number of tokens in the dataset."""
+    return np.sum(lens[documents])
+
+
+def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+    """Based on number of samples and sequence lenght, calculate how many
+    epochs will be needed."""
+    num_epochs = 0
+    total_tokens = 0
+    while True:
+        num_epochs += 1
+        total_tokens += tokens_per_epoch
+        if ((total_tokens - 1) // seq_length) >= num_samples:
+            return num_epochs
+
+
+def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
+    """
+    Build an array with length = number-of-epochs * number-of-documents.
+    Each index is mapped to a corresponding document.
+    """
+    if not separate_last_epoch or num_epochs == 1:
+        doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1]
+        doc_idx[:] = documents
+        # The documents repeat num_epochs times.
+        doc_idx = doc_idx.reshape(-1)
+        doc_idx = doc_idx.astype(np.int32)
+        return doc_idx
+
+    doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False)
+    doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
+    return np.concatenate((doc_idx_first, doc_idx_last))
+
+
+def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch):
+    """
+    num_samples + 1, pos of bs data
+    the distance between two points for sample idx is bs tokens.
+    """
+    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    sample_idx = np.zeros([int(num_samples) + 1, 2], dtype=np.int32)
+
+    sample_index = 0
+    doc_idx_index = 0
+    doc_offset = 0
+    sample_idx[sample_index][0] = doc_idx_index
+    sample_idx[sample_index][1] = doc_offset
+    sample_index += 1
+    while sample_index <= num_samples:
+        remaining_seq_length = seq_length + 1
+        while remaining_seq_length != 0:
+            doc_id = doc_idx[doc_idx_index]
+            doc_length = sizes[doc_id] - doc_offset
+            remaining_seq_length -= doc_length
+            if remaining_seq_length <= 0:
+                doc_offset += remaining_seq_length + doc_length - 1
+                remaining_seq_length = 0
+            else:
+                doc_idx_index += 1
+                doc_offset = 0
+        sample_idx[sample_index][0] = doc_idx_index
+        sample_idx[sample_index][1] = doc_offset
+        sample_index += 1
+
+    return sample_idx
+
+
+def _build_shuffle_idx(num_samples, total_size, np_rng):
+    dtype_ = np.uint32
+    if total_size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+
+    shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_first)
+    if num_samples == total_size:
+        return shuffle_idx_first
+
+    shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_last)
+
+    return np.concatenate((shuffle_idx_first, shuffle_idx_last))
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(",") != -1:
+        splits = [float(s) for s in splits_string.split(",")]
+    elif splits_string.find("/") != -1:
+        splits = [float(s) for s in splits_string.split("/")]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def create_pretrained_dataset(
+    args,
+    input_path,
+    local_rank,
+    data_world_rank,
+    data_world_size,
+    eos_id,
+    worker_init=None,
+    max_seq_len=1024,
+    places=None,
+    data_holders=None,
+):
+    device_world_size = paddle.distributed.get_world_size()
+
+    logger.info(
+        "The distributed run, total device num:{}, distinct dataflow num:{}.".format(
+            device_world_size, data_world_size
+        )
+    )
+
+    process_data = np.load(input_path, mmap_mode="r+", allow_pickle=True)
+    # All documment ids, extend as 1-D array.
+    sample_ids = process_data["ids"]
+    # The len(sample_lens) num of docs
+    # The sum(sample_lens) should equal len(sample_ids)
+    sample_lens = process_data["lens"]
+
+    splits = get_train_valid_test_split_(args.split, len(sample_lens))
+    assert len(sample_lens) >= splits[-1], "The document nums should larger than max of splits, but %s < %s" % (
+        len(sample_lens),
+        splits[-1],
+    )
+
+    def build_dataset(index, name, num_samples):
+        dataset = GPTDataset(
+            file_path=input_path,
+            build_data_file=local_rank == 0,
+            name="gpt_" + name,
+            max_seq_len=max_seq_len,
+            num_samples=num_samples,
+            documents=np.arange(splits[index], splits[index + 1]),
+            sample_ids=sample_ids,
+            sample_lens=sample_lens,
+            eos_id=eos_id,
+            seed=args.seed,
+        )
+
+        batch_sampler = DistributedBatchSampler(
+            dataset,
+            batch_size=args.local_batch_size,
+            num_replicas=data_world_size,
+            rank=data_world_rank,
+            shuffle=False,
+            drop_last=True,
+        )
+
+        data_loader = DataLoader(
+            dataset=dataset,
+            places=places,
+            feed_list=data_holders,
+            batch_sampler=batch_sampler,
+            num_workers=1,
+            worker_init_fn=worker_init,
+            # collate_fn=Tuple(Stack(), Stack(), Stack(), Stack(), Stack()),
+            collate_fn=Tuple(Stack(), Stack(), Stack()),
+            return_list=False,
+        )
+        return data_loader
+
+    # Note, data should be broardcast to all devices.
+    # for train, valid, test, the distinct data num is data_world_size
+    train_data_loader = build_dataset(0, "train", args.local_batch_size * args.max_steps * data_world_size)
+
+    valid_data_loader = build_dataset(
+        1, "valid", args.local_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size
+    )
+    test_data_loader = build_dataset(2, "test", args.local_batch_size * args.test_iters * data_world_size)
+
+    return train_data_loader, valid_data_loader, test_data_loader
+
+
+class GPTDataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        file_path,
+        num_samples,
+        eos_id,
+        sample_ids,
+        sample_lens,
+        documents=None,
+        build_data_file=False,
+        name="gpt",
+        max_seq_len=1024,
+        seed=1234,
+    ):
+        self.file_path = file_path
+        self.max_seq_len = max_seq_len
+        self.name = name
+        self.eos_id = eos_id
+        self.sample_ids = sample_ids
+        self.sample_lens = sample_lens
+        if documents is None:
+            document_ids = np.arange(0, self.sample_lens.shape[0])
+        else:
+            document_ids = documents
+
+        self.doc_idx, self.sample_idx, self.shuffle_idx = construct_samples_and_shuffle_data(
+            self.name, self.file_path, document_ids, self.sample_lens, num_samples, max_seq_len, seed, build_data_file
+        )
+
+        # The doc cumsum start pos
+        self.start_pos = [0] + np.cumsum(self.sample_lens).tolist()
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+        seq_length = len(tokens)
+        # Attention mask for the attention calulate
+        # attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length,
+        #  seq_length))
+
+        # The pad and eos tokens do not contribute the loss
+        loss_mask = np.ones(seq_length, dtype="float32")
+        loss_mask[np.where(np.array(tokens) == self.eos_id)] = 0.0
+        # position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # attention_mask = attention_mask.astype("float32")
+        # return [tokens, loss_mask, attention_mask, position_ids, labels]
+        return [tokens, loss_mask, labels]
+
+    def _get_single_sample_from_idx(self, doc_index_f, doc_index_l, offset_f, offset_l):
+        """
+        The input means:
+            doc_index_f: data from the first doc.
+            doc_index_l: data from the last doc.
+            offset_f: offset of the first doc.
+            offset_l: offset of the last doc.
+        """
+        # Data from the sample doc. just select the needed ids.
+        if doc_index_f == doc_index_l:
+            current_start_pos = self.start_pos[self.doc_idx[doc_index_f]]
+            return self.sample_ids[current_start_pos + offset_f : current_start_pos + offset_l + 1].tolist()
+
+        # Data from multi docs.
+        else:
+            current_start_pos = self.start_pos[self.doc_idx[doc_index_f]]
+            next_start_pos = self.start_pos[self.doc_idx[doc_index_f] + 1]
+            tokens = self.sample_ids[current_start_pos + offset_f : next_start_pos].tolist()
+            for i in range(doc_index_f + 1, doc_index_l):
+                current_start_pos = self.start_pos[self.doc_idx[i]]
+                next_start_pos = self.start_pos[self.doc_idx[i] + 1]
+                tokens.extend(self.sample_ids[current_start_pos:next_start_pos].tolist())
+            last_start_pos = self.start_pos[self.doc_idx[doc_index_l]]
+            tokens.extend(self.sample_ids[last_start_pos : last_start_pos + offset_l + 1].tolist())
+
+        return tokens
+
+    def __getitem__(self, index):
+        idx = self.shuffle_idx[index]
+        # Start and end documents and offsets.
+        doc_index_f = self.sample_idx[idx][0]
+        doc_index_l = self.sample_idx[idx + 1][0]
+        offset_f = self.sample_idx[idx][1]
+        offset_l = self.sample_idx[idx + 1][1]
+        tokens = self._get_single_sample_from_idx(doc_index_f, doc_index_l, offset_f, offset_l)
+        return self._construct_sample(tokens)
+
+    def __len__(self):
+        return self.sample_idx.shape[0] - 1
diff --git a/examples/language_model/moe/dygraph/framework/__init__.py b/examples/language_model/moe/dygraph/framework/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e10badb568d2b8019fe5e661943d2ea15ff05154
--- /dev/null
+++ b/examples/language_model/moe/dygraph/framework/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .storage_process import assign_group_by_size
+from .storage_process import flatten_dense_tensors
+from .storage_process import obtain_storage
+from paddle.optimizer import AdamW
+from .group_sharded import group_sharded_parallel
diff --git a/examples/language_model/moe/dygraph/framework/group_sharded.py b/examples/language_model/moe/dygraph/framework/group_sharded.py
new file mode 100644
index 0000000000000000000000000000000000000000..27e753abf3f35056f31ca9a9e98d900b565b554e
--- /dev/null
+++ b/examples/language_model/moe/dygraph/framework/group_sharded.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from types import MethodType
+
+import paddle
+
+# New version
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_optimizer_stage2 import (
+    GroupShardedOptimizerStage2,
+)
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_stage2 import (
+    GroupShardedStage2,
+)
+from paddle.framework import core
+from paddle.incubate.distributed.models.moe.grad_clip import ClipGradForMOEByGlobalNorm
+from paddle.optimizer import Optimizer
+
+
+class ClipGradForShardedMOEByGlobalNorm(ClipGradForMOEByGlobalNorm):
+    @paddle.no_grad()
+    def _dygraph_clip(self, params_grads):
+        normal_params_grads = []
+        moe_params_grads = []
+
+        # separate moe params from normal params
+        if self.moe_group is not None and self.moe_group.nranks > 1:
+            for p, g in params_grads:
+                if self.is_expert_param_func(p):
+                    moe_params_grads.append((p, g))
+                else:
+                    normal_params_grads.append((p, g))
+        else:
+            normal_params_grads = params_grads
+
+        # why to return sum_dtype?
+        # we will call `get_l2_norm_pow` twice and the precisions may be different.
+        # For convenience and simplification, we use sum_dtype directly instead of global_norm_var_normal.dtype
+        global_norm_var_normal, sum_dtype = self.get_l2_norm_pow(normal_params_grads)
+        if global_norm_var_normal is not None:
+            paddle.distributed.all_reduce(
+                global_norm_var_normal, op=paddle.distributed.ReduceOp.SUM, group=self.moe_group
+            )
+
+        global_norm_var_moe = None
+        if len(moe_params_grads) > 0:
+            global_norm_var_moe, _ = self.get_l2_norm_pow(moe_params_grads, sum_dtype)
+            if global_norm_var_moe is not None:
+                paddle.distributed.all_reduce(
+                    global_norm_var_moe, op=paddle.distributed.ReduceOp.SUM, group=self.moe_group
+                )
+
+        if global_norm_var_normal is None and global_norm_var_moe is None:
+            return params_grads
+        elif global_norm_var_normal is None:
+            global_norm_var = global_norm_var_moe
+        elif global_norm_var_moe is None:
+            global_norm_var = global_norm_var_normal
+        else:
+            if global_norm_var_normal.dtype != global_norm_var_moe.dtype:
+                # compared with normal norm, moe norm is the later one,
+                # so its precision is no lower than normal norm
+                global_norm_var_normal = global_norm_var_normal.astype(global_norm_var_moe.dtype)
+            global_norm_var = global_norm_var_normal + global_norm_var_moe
+
+        params_and_grads = []
+        global_norm_var = paddle.sqrt(global_norm_var)
+        max_global_norm = paddle.full(shape=[1], dtype=global_norm_var.dtype, fill_value=self.clip_norm)
+        clip_var = paddle.maximum(x=max_global_norm, y=paddle.maximum(x=global_norm_var, y=max_global_norm))
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if getattr(p, "need_clip", True) is False:
+                params_and_grads.append((p, g))
+                continue
+            # TODO(wangxi): use inplace elementwise_mul
+            clip_input = clip_var.astype("float16") if g.dtype == core.VarDesc.VarType.FP16 else clip_var
+            new_grad = paddle.multiply(x=g, y=clip_input)
+            params_and_grads.append((p, new_grad))
+        return params_and_grads
+
+
+def group_sharded_parallel(
+    model, optimizer, group=None, offload=False, sync_buffers=False, buffer_max_size=2**23, segment_size=2**20
+):
+
+    # check optition type
+    assert isinstance(model, paddle.nn.Layer), "The model must be the instance of paddle.nn.Layer."
+    assert isinstance(optimizer, Optimizer), "The optimizer must be the instance of paddle.optimizer.Optimizer."
+
+    def check_dtype(param):
+        return param.dtype == paddle.float16
+
+    sharded_params = []
+    pretreated_params = []
+    for p in optimizer._parameter_list:
+        if "expert" not in p.name and "gate" not in p.name:
+            sharded_params.append(p)
+        else:
+            pretreated_params.append(p)
+
+    opt_gc = optimizer._grad_clip
+    if opt_gc is not None:
+        optimizer._grad_clip = ClipGradForShardedMOEByGlobalNorm(
+            opt_gc.clip_norm, opt_gc.is_expert_param_func, opt_gc.moe_group, opt_gc.group_name
+        )
+
+    # convert model/optimizer
+    optimizer = GroupShardedOptimizerStage2(params=sharded_params, optim=optimizer, group=group, offload=offload)
+    model = GroupShardedStage2(
+        model, optimizer, group=group, sync_buffers=sync_buffers, buffer_max_size=buffer_max_size
+    )
+
+    clear_func = model._clear_gradients
+    for opt in model._sharding_optimizers:
+
+        def _opt_clear(self):
+            clear_func()
+            for p in pretreated_params:
+                if p.grad is not None:
+                    p.grad.zero_()
+
+        opt.clear_grad = MethodType(_opt_clear, opt)
+
+    return model, optimizer
diff --git a/examples/language_model/moe/dygraph/framework/storage_process.py b/examples/language_model/moe/dygraph/framework/storage_process.py
new file mode 100644
index 0000000000000000000000000000000000000000..11bb790d14e1b1c5807db81d7e8791177d8155d9
--- /dev/null
+++ b/examples/language_model/moe/dygraph/framework/storage_process.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+
+import numpy as np
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_storage import (
+    GradStorage,
+    ParamStorage,
+)
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import Type
+from paddle.framework import core
+
+alignment = {
+    "gpu": 256,
+}
+align = {
+    Type.fp16.value: 2,
+    Type.fp32.value: 4,
+}
+
+
+def assign_group_by_size(parameters, group_size=256 * 1024 * 1024):
+    is_sparse_gradient = [False] * len(parameters)
+
+    group_indices = core.eager_assign_group_by_size(parameters, is_sparse_gradient, [group_size, group_size])
+
+    var_groups = OrderedDict()
+    for group_idx, indices in enumerate(group_indices):
+        for index in indices:
+            var_groups.setdefault(group_idx, []).append(parameters[index])
+    return var_groups
+
+
+def flatten_dense_tensors(parameters):
+    _buffer_size = 0
+    _param2align = {}
+    dtype = parameters[0].dtype
+
+    for param in parameters:
+        assert param.trainable, "param must be trainable..."
+        size = np.prod(param.shape) * align[dtype]
+        remaining = size % alignment["gpu"]
+        ali = 0 if remaining == 0 else alignment["gpu"] - remaining
+        align_ = ali // align[dtype]
+        _buffer_size += np.prod(param.shape) + align_
+        _param2align[param.name] = align_
+
+    param_storage = ParamStorage(size=_buffer_size, dtype=dtype, device="gpu")
+
+    param_storage.add_rank_params(parameters, _param2align)
+
+    # process gradient
+    grad_storage = GradStorage(size=_buffer_size, dtype=dtype, device="gpu", destination="0", parm2align=_param2align)
+
+    for param in parameters:
+        grad_storage.add_grad(param, _param2align[param.name])
+
+    # param_storage --> grad_storage
+    param_storage.buffer._copy_gradient_from(grad_storage.buffer)
+    param_storage.buffer.stop_gradient = False
+    return param_storage, grad_storage
+
+
+def obtain_storage(parameters):
+    if len(parameters) < 1:
+        return []
+
+    var_groups = assign_group_by_size(parameters)
+    storage = []
+    for group_idx, parameters in var_groups.items():
+        param_storage, grad_storage = flatten_dense_tensors(parameters)
+        storage.append(param_storage.buffer)
+    return storage
diff --git a/examples/language_model/moe/dygraph/lr.py b/examples/language_model/moe/dygraph/lr.py
new file mode 100644
index 0000000000000000000000000000000000000000..09070bccaa3331de13d53229f225626fcc714509
--- /dev/null
+++ b/examples/language_model/moe/dygraph/lr.py
@@ -0,0 +1 @@
+../../gpt/lr.py
\ No newline at end of file
diff --git a/examples/language_model/moe/dygraph/modeling.py b/examples/language_model/moe/dygraph/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b45c9a465e70e0a4d51c2959de4948fc82528232
--- /dev/null
+++ b/examples/language_model/moe/dygraph/modeling.py
@@ -0,0 +1,1070 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+
+import paddle
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+    get_rng_state_tracker,
+)
+from paddle.incubate.distributed.models import moe
+from paddle.nn.layer.transformer import _convert_param_attr_to_list
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+
+MoeLayer = moe.MoELayer
+
+__all__ = [
+    "GPTModel",
+    "GPTPretrainedModel",
+    "GPTForPretraining",
+    "GPTPretrainingCriterion",
+    "GPTForGreedyGeneration",
+    "GPTLMHeadModel",
+]
+
+
+class ExpertLayer(nn.Layer):
+    def __init__(self, d_model, d_hidden, name=None, rank=0, windex=0, num_expert=1):
+        super(ExpertLayer, self).__init__()
+
+        self.htoh4 = nn.Linear(
+            d_model,
+            d_hidden,
+            weight_attr=nn.initializer.KaimingUniform(),
+            bias_attr=nn.initializer.Constant(value=0.0),
+        )
+        self.h4toh = nn.Linear(
+            d_hidden,
+            d_model,
+            weight_attr=nn.initializer.KaimingUniform(),
+            bias_attr=nn.initializer.Constant(value=0.0),
+        )
+        self.htoh4.weight.name = "expert_" + self.htoh4.weight.name
+        self.h4toh.weight.name = "expert_" + self.h4toh.weight.name
+        self.htoh4.bias.name = "expert_" + self.htoh4.bias.name
+        self.h4toh.bias.name = "expert_" + self.h4toh.bias.name
+
+    def forward(self, x):
+        x = self.htoh4(x)
+        x = F.gelu(x, approximate=True)
+        x = self.h4toh(x)
+        return x
+
+
+def parallel_matmul(lm_output, logit_weights, parallel_output):
+    hcg = fleet.get_hybrid_communicate_group()
+    model_parallel_group = hcg.get_model_parallel_group()
+    world_size = hcg.get_model_parallel_world_size()
+
+    if world_size > 1:
+        input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group)
+
+        logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True)
+
+        if parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(lm_output, logit_weights, transpose_y=True)
+        return logits
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        weight_attr=None,
+        bias_attr=None,
+        fuse=True,
+        num_partitions=1,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.need_weights = need_weights
+        self.fuse = fuse
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        assert self.num_heads % num_partitions == 0
+        self.num_heads = self.num_heads // num_partitions
+
+        if self.fuse:
+            assert self.kdim == embed_dim, "embed_dim should be equal to kdim"
+            assert self.vdim == embed_dim, "embed_dim should be equal to vidm"
+
+            self.qkv_proj = fleet.meta_parallel.ColumnParallelLinear(
+                embed_dim, 3 * embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False
+            )
+        else:
+            self.q_proj = fleet.meta_parallel.ColumnParallelLinear(
+                embed_dim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False
+            )
+
+            self.k_proj = fleet.meta_parallel.ColumnParallelLinear(
+                self.kdim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False
+            )
+
+            self.v_proj = fleet.meta_parallel.ColumnParallelLinear(
+                self.vdim, embed_dim, weight_attr=weight_attr, has_bias=True, gather_output=False
+            )
+
+        self.out_proj = fleet.meta_parallel.RowParallelLinear(
+            embed_dim, embed_dim, weight_attr=weight_attr, has_bias=True, input_is_parallel=True
+        )
+
+    def _fuse_prepare_qkv(self, query):
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, self.num_heads, 3 * self.head_dim])
+        mix_layer = paddle.transpose(mix_layer, [0, 2, 1, 3])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+        return q, k, v
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = tensor.transpose(x=q, perm=[0, 2, 1, 3])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=2)
+            v = tensor.concat([cache.v, v], axis=2)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v) if use_cache is False else (q, k, v, cache)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            v = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if use_cache is False:
+            if self.fuse:
+                q, k, v = self._fuse_prepare_qkv(query)
+            else:
+                q, k, v = self._prepare_qkv(query, key, value, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+        # scale dot product attention
+        product = paddle.matmul(x=q, y=k, transpose_y=True) * (self.head_dim**-0.5)
+
+        # if attn_mask is not None:
+        # product = product + attn_mask
+        # weights = F.softmax(product)
+
+        weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.dropout:
+            with get_rng_state_tracker().rng_state("local_seed"):
+                weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = tensor.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(self, decoder_layers, num_layers, norm=None, hidden_size=None):
+        super(TransformerDecoder, self).__init__()
+
+        self.num_layers = num_layers
+        self.layers = decoder_layers
+        self.norm = norm
+        if norm == "LayerNorm":
+            self.norm = nn.LayerNorm(hidden_size)
+        elif norm is not None:
+            raise ValueError("Only support LayerNorm")
+        self.checkpoints = []
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = []
+        self.checkpoints = []
+
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                if use_cache:
+                    output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache)
+                    new_caches.append(new_cache)
+                else:
+                    output = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache[i])
+                new_caches.append(new_cache)
+            self.checkpoints.append(output.name)
+
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if use_cache is False else (output, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        weight_attr=None,
+        bias_attr=None,
+        num_partitions=1,
+        expert_mode=False,
+        num_experts=1,
+        top_k=2,
+        hcg=None,
+        gate=None,
+        recompute_interval=0,
+        recompute_partition=False,
+        recompute_offload=False,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+        self.recompute_interval = recompute_interval
+
+        # moe config
+        self.top_k = top_k
+        self.num_experts = num_experts
+        self.expert_mode = expert_mode
+        self.hcg = hcg
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+
+        self.self_attn = MultiHeadAttention(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+            num_partitions=num_partitions,
+        )
+
+        if expert_mode:
+            experts_list = nn.LayerList()
+            for expi in range(num_experts):
+                exp_layer = ExpertLayer(d_model, dim_feedforward // top_k, windex=expi, num_expert=num_experts)
+                experts_list.append(exp_layer)
+
+            moe_group = hcg.get_expert_parallel_group()
+            mp_group = hcg.get_model_parallel_group()
+            gate_config = {
+                "type": "gshard",
+                "top_k": top_k,
+            }
+
+            recompute_ctx = {"mp_group": mp_group, "offload": recompute_offload, "partition": recompute_partition}
+            self.moe_mlp = MoeLayer(
+                d_model=d_model,
+                experts=experts_list,
+                gate=gate_config,
+                moe_group=moe_group,
+                mp_group=mp_group,
+                recompute_interval=self.recompute_interval,
+                recompute_ctx=recompute_ctx,
+            )
+        else:
+            self.linear1 = fleet.meta_parallel.ColumnParallelLinear(
+                d_model, dim_feedforward, weight_attr=weight_attrs[2], gather_output=False, has_bias=True
+            )
+
+            self.linear2 = fleet.meta_parallel.RowParallelLinear(
+                dim_feedforward, d_model, weight_attr=weight_attrs[2], input_is_parallel=True, has_bias=True
+            )
+
+        self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
+        self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
+        self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory=None, tgt_mask=None, use_cache=False, cache=None):
+        residual = tgt
+
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if use_cache is False:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+
+        with get_rng_state_tracker().rng_state("global_seed"):
+            tgt = residual + self.dropout1(tgt)
+
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        if self.expert_mode:
+            tgt = self.moe_mlp(tgt)
+        else:
+            with get_rng_state_tracker().rng_state("global_seed"):
+                tgt = self.dropout2(self.linear2(F.gelu(self.linear1(tgt), approximate=True)))
+
+        tgt = residual + tgt
+
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        return tgt if use_cache is False else (tgt, incremental_cache)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+    ):
+        super(GPTEmbeddings, self).__init__()
+
+        self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+            vocab_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        self.position_embeddings = nn.Embedding(
+            max_position_embeddings,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(
+                name="pos_embeddings", initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)
+            ),
+        )
+
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embedings + position_embeddings
+
+        with get_rng_state_tracker().rng_state("global_seed"):
+            embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class GPTPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained GPT models. It provides GPT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "gpt-cpm-large-cn": {  # 2.6B
+            "vocab_size": 30000,
+            "hidden_size": 2560,
+            "num_hidden_layers": 32,
+            "num_attention_heads": 32,
+            "intermediate_size": 10240,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+            "eos_token_id": 7,
+            "bos_token_id": 0,
+            "eol_token_id": 3,
+            "num_partitions": 1,
+        },
+        "gpt-cpm-small-cn-distill": {  # 109M
+            "vocab_size": 30000,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "pad_token_id": 0,
+            "eos_token_id": 7,
+            "bos_token_id": 0,
+            "eol_token_id": 3,
+            "num_partitions": 1,
+        },
+        "gpt3-13B-en": {  # 13B
+            "vocab_size": 50304,
+            "hidden_size": 5120,
+            "num_hidden_layers": 40,
+            "num_attention_heads": 128,
+            "intermediate_size": 20480,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "eos_token_id": 50256,
+            "eol_token_id": 198,
+            "num_partitions": 1,
+        },
+        "gpt3-1.3B-en": {  # 1.3B
+            "vocab_size": 50304,
+            "hidden_size": 2048,
+            "num_hidden_layers": 24,
+            "num_attention_heads": 16,
+            "intermediate_size": 8192,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "eos_token_id": 50256,
+            "eol_token_id": 198,
+            "num_partitions": 1,
+        },
+        "gpt2-medium-en": {  # 345M
+            "vocab_size": 50304,
+            "hidden_size": 1024,
+            "num_hidden_layers": 24,
+            "num_attention_heads": 16,
+            "intermediate_size": 4096,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "eos_token_id": 50256,
+            "eol_token_id": 198,
+            "num_partitions": 1,
+        },
+        "gpt2-en": {  # 117M
+            "vocab_size": 50304,
+            "hidden_size": 768,
+            "num_hidden_layers": 12,
+            "num_attention_heads": 12,
+            "intermediate_size": 3072,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "eos_token_id": 50256,
+            "eol_token_id": 198,
+            "num_partitions": 1,
+        },
+        "gpt2-small-en": {  # config for CE
+            "vocab_size": 50304,
+            "hidden_size": 1024,  # 1024
+            "num_hidden_layers": 8,  # 4
+            "num_attention_heads": 16,
+            "intermediate_size": 1024 * 4,  # 4096,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "attention_probs_dropout_prob": 0.1,
+            "max_position_embeddings": 1024,
+            "type_vocab_size": 1,  # no use
+            "initializer_range": 0.02,
+            "eos_token_id": 50256,
+            "eol_token_id": 198,
+            "num_partitions": 1,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "gpt-cpm-large-cn": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt-cpm-large-cn.pdparams",
+            "gpt-cpm-small-cn-distill": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt-cpm-small-cn-distill.pdparams",
+            "gpt2-medium-en": "https://paddlenlp.bj.bcebos.com/models/transformers/gpt/gpt2-medium-en.pdparams",
+        }
+    }
+    base_model_prefix = "gpt"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        # no hook
+        return
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.gpt.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+@register_base_model
+class GPTModel(GPTPretrainedModel):
+    """
+    The base model of gpt.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        eos_token_id=7,
+        bos_token_id=0,
+        eol_token_id=3,
+        num_partitions=1,
+        expert_mode=False,
+        num_experts=1,
+        top_k=2,
+        hcg=None,
+        gate=None,
+        recompute_interval=0,
+        recompute_partition=False,
+        recompute_offload=False,
+    ):
+        super(GPTModel, self).__init__()
+
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+
+        self.embeddings = GPTEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            self.initializer_range,
+        )
+
+        decoder_layers = nn.LayerList()
+        for i in range(num_hidden_layers):
+            decoder_layers.append(
+                TransformerDecoderLayer(
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=intermediate_size,
+                    dropout=hidden_dropout_prob,
+                    activation=hidden_act,
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range)
+                    ),
+                    bias_attr=None,
+                    num_partitions=num_partitions,
+                    expert_mode=expert_mode,
+                    num_experts=num_experts,
+                    top_k=top_k,
+                    hcg=hcg,
+                    gate=gate,
+                    recompute_interval=recompute_interval,
+                    recompute_partition=recompute_partition,
+                    recompute_offload=recompute_offload,
+                )
+            )
+
+        self.decoder = TransformerDecoder(decoder_layers, num_hidden_layers, norm="LayerNorm", hidden_size=hidden_size)
+
+        self.checkpoints = []
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None):
+        self.checkpoints = []
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(cache[0].k)[-2]
+            position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype="int64")
+            position_ids = position_ids.unsqueeze(0)
+            # .expand_as(input_ids)
+            position_ids = paddle.expand_as(position_ids, input_ids)
+        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        encoder_outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            # tgt_mask=attention_mask,
+            tgt_mask=None,
+            use_cache=use_cache,
+            cache=cache,
+        )
+        self.checkpoints.extend(self.decoder.checkpoints)
+        return encoder_outputs
+
+
+class GPTForPretraining(GPTPretrainedModel):
+    """
+    The pretraining model of GPT.
+
+    It returns some logits and cached_kvs.
+    """
+
+    def __init__(self, gpt):
+        super(GPTForPretraining, self).__init__()
+        self.gpt = gpt
+
+    def forward(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        logits = parallel_matmul(encoder_outputs, self.gpt.embeddings.word_embeddings.weight, True)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+
+class GPTPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for GPT.
+
+    It calculates the final loss.
+    """
+
+    def __init__(self):
+        super(GPTPretrainingCriterion, self).__init__()
+        self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+        self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask):
+
+        hcg = fleet.get_hybrid_communicate_group()
+        mp_size = hcg.get_model_parallel_world_size()
+        if mp_size > 1:
+            masked_lm_loss = self.parallel_loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+        else:
+            masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+
+        loss_mask = loss_mask.reshape([-1])
+        masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+        loss = masked_lm_loss / loss_mask.sum()
+        return loss
+
+
+class GPTForGreedyGeneration(GPTPretrainedModel):
+    """
+    The generate model for GPT-2.
+    It use the greedy stategy and generate the next word with highest probablity.
+    """
+
+    def __init__(self, gpt, max_predict_len):
+        super(GPTForGreedyGeneration, self).__init__()
+        self.gpt = gpt
+        self.max_predict_len = paddle.to_tensor(max_predict_len, dtype="int32")
+
+    def model(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+        logits = paddle.matmul(encoder_outputs, self.gpt.embeddings.word_embeddings.weight, transpose_y=True)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+    def forward(self, input_ids, end_id):
+        output, cached_kvs = self.model(input_ids, use_cache=True, cache=None)
+        src_ids = input_ids
+        nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1])
+        src_ids = paddle.concat([src_ids, nid], axis=1)
+        cur_len = 0
+        while cur_len < self.max_predict_len:
+            output, cached_kvs = self.model(nid, use_cache=True, cache=cached_kvs)
+
+            nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1])
+            src_ids = paddle.concat([src_ids, nid], axis=1)
+            cur_len += 1
+            if paddle.max(nid) == end_id:
+                break
+        return src_ids
+
+
+class GPTLMHead(nn.Layer):
+    def __init__(self, hidden_size, vocab_size, embedding_weights=None):
+        super(GPTLMHead, self).__init__()
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=paddle.get_default_dtype(), is_bias=True)
+            if embedding_weights is None
+            else embedding_weights
+        )
+
+    def forward(self, hidden_states):
+        logits = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True)
+        return logits
+
+
+class GPTLMHeadModel(GPTPretrainedModel):
+    def __init__(self, gpt):
+        super(GPTLMHeadModel, self).__init__()
+        self.gpt = gpt
+        self.lm_head = GPTLMHead(
+            self.gpt.config["hidden_size"], self.gpt.config["vocab_size"], self.gpt.embeddings.word_embeddings.weight
+        )
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None):
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        logits = self.lm_head(encoder_outputs)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if position_ids is not None:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+            if attention_mask is not None:
+                attention_mask = attention_mask[:, :, -1, :].unsqueeze(2)
+
+        return {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError as e:
+            try:
+                return getattr(getattr(self, self.base_model_prefix), name)
+            except AttributeError:
+                try:
+                    return getattr(self, self.base_model_prefix).config[name]
+                except KeyError:
+                    raise e
+
+
+# these Layers is just for PipelineParallel
+
+
+class GPTPretrainingCriterionPipe(GPTPretrainingCriterion):
+    """Extends GPTPretrainingCriterion to meet the input standard."""
+
+    def forward(self, prediction_scores, args):
+        masked_lm_labels = args[0]
+        loss_mask = args[1]
+        loss = super().forward(prediction_scores, masked_lm_labels, loss_mask)
+        return loss
+
+
+class EmbeddingPipe(GPTEmbeddings):
+    """Extends GPTEmbeddings to forward attention_mask through the pipeline."""
+
+    @property
+    def embedding_weight(self):
+        return self.word_embeddings.weight
+
+    def forward(self, input_ids):
+        embeddings = super().forward(input_ids=input_ids, position_ids=None)
+        return embeddings
+
+
+class GPTForPretrainingPipe(PipelineLayer):
+    """GPTForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the GPTModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        eos_token_id=7,
+        bos_token_id=0,
+        eol_token_id=3,
+        num_partitions=1,
+        topology=None,
+        recompute_interval=0,
+        expert_mode=False,
+        num_experts=1,
+        top_k=2,
+        hcg=None,
+    ):
+
+        # forward desc
+        self.descs = []
+
+        self.descs.append(
+            SharedLayerDesc(
+                "embed",
+                EmbeddingPipe,
+                shared_weight_attr="embedding_weight",
+                vocab_size=vocab_size,
+                hidden_size=hidden_size,
+                hidden_dropout_prob=hidden_dropout_prob,
+                max_position_embeddings=max_position_embeddings,
+                type_vocab_size=type_vocab_size,
+                initializer_range=0.02,
+            )
+        )
+
+        for _ in range(num_hidden_layers):
+            self.descs.append(
+                LayerDesc(
+                    TransformerDecoderLayer,
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=intermediate_size,
+                    dropout=hidden_dropout_prob,
+                    activation=hidden_act,
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+                    bias_attr=None,
+                    num_partitions=num_partitions,
+                    expert_mode=expert_mode,
+                    num_experts=num_experts,
+                    top_k=top_k,
+                    hcg=hcg,
+                )
+            )
+
+        self.descs.append(LayerDesc(nn.LayerNorm, normalized_shape=hidden_size))
+
+        def _logits_helper(embedding, output):
+            return parallel_matmul(output, embedding.embedding_weight, True)
+
+        self.descs.append(
+            SharedLayerDesc(
+                "embed",
+                EmbeddingPipe,
+                forward_func=_logits_helper,
+                shared_weight_attr="embedding_weight",
+                vocab_size=vocab_size,
+                hidden_size=hidden_size,
+                hidden_dropout_prob=hidden_dropout_prob,
+                max_position_embeddings=max_position_embeddings,
+                type_vocab_size=type_vocab_size,
+                initializer_range=0.02,
+            )
+        )
+
+        super().__init__(
+            layers=self.descs,
+            loss_fn=GPTPretrainingCriterionPipe(),
+            topology=topology,
+            seg_method="layer:TransformerDecoderLayer",
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": fleet.fleet._hcg.get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+        )
diff --git a/examples/language_model/moe/dygraph/run.sh b/examples/language_model/moe/dygraph/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f513281b60d6bf9b415065bfa275fbd39405355
--- /dev/null
+++ b/examples/language_model/moe/dygraph/run.sh
@@ -0,0 +1,36 @@
+export PYTHONPATH=$PYTHONPATH:../../../../
+
+log_dir=dp8
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --gpus "0,1,2,3,4,5,6,7" run_moe_pretrain.py \
+    --model_type gpt \
+    --model_name_or_path gpt2-small-en \
+    --input_dir "./data"\
+    --output_dir "output"\
+    --weight_decay 0.01\
+    --grad_clip 1.0\
+    --max_steps 50000\
+    --save_steps 100000\
+    --decay_steps 320000\
+    --device gpu\
+    --eval_freq 1000\
+    --warmup_rate 0.01\
+    --local_batch_size 8\
+    --dp_degree 8\
+    --mp_degree 1\
+    --pp_degree 1\
+    --sharding_degree 1\
+    --sharding_offload False\
+    --expert_mode True\
+    --logging_freq 1 \
+    --num_experts 8\
+    --use_pure_fp16 True\
+    --use_recompute True\
+    --recompute_partition False\
+    --recompute_offload False\
+    --resume_dir ""\
+    --scale_loss 32768 \
+    --gate gshard \
+    --balance_loss_weight 1.0
+
diff --git a/examples/language_model/moe/dygraph/run_moe_pretrain.py b/examples/language_model/moe/dygraph/run_moe_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..0649ec771a08e8eab862934d1dc276492802d2d4
--- /dev/null
+++ b/examples/language_model/moe/dygraph/run_moe_pretrain.py
@@ -0,0 +1,705 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+import time
+import types
+from types import MethodType
+
+import lr
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from args import parse_args
+from checkpointing import load_checkpoint, save_checkpoint
+from dataset import create_pretrained_dataset
+from framework import AdamW, group_sharded_parallel, obtain_storage
+from modeling import (
+    GPTForPretraining,
+    GPTForPretrainingPipe,
+    GPTModel,
+    GPTPretrainingCriterion,
+)
+from paddle import _legacy_C_ops
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import (
+    GroupShardedScaler,
+)
+from paddle.framework import core
+from paddle.incubate.distributed.models import moe
+from utils import get_timers, set_timers
+from visualdl import LogWriter
+
+from paddlenlp.transformers import GPTChineseTokenizer, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (GPTForPretraining, GPTTokenizer),
+    "gpt-cn": (GPTForPretraining, GPTChineseTokenizer),
+}
+
+set_timers()
+
+
+def set_hyrbid_parallel_seed(basic_seed, data_world_rank, mp_rank, pp_rank):
+    assert args.device != "cpu"
+
+    random.seed(basic_seed + data_world_rank)
+    np.random.seed(basic_seed + data_world_rank)
+    paddle.seed(basic_seed + data_world_rank)
+
+    from paddle.distributed.fleet import meta_parallel
+
+    meta_parallel.model_parallel_random_seed(basic_seed + data_world_rank + 1000 * mp_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 123 + mp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + data_world_rank
+    tracker = get_rng_state_tracker()
+    tracker.add("global_seed", global_seed)
+    tracker.add("local_seed", local_seed)
+
+
+@paddle.no_grad()
+def run_evaluate(args, data_loader, model, criterion, iter_steps, log_writer, global_step, epoch, task_name="valid"):
+    model.eval()
+    all_loss = []
+    local_time = time.time()
+    for eval_step, batch in enumerate(data_loader):
+        tokens, loss_mask, labels = batch
+        # paddle version >= 2.5.0 or develop
+        paddle_version = float(paddle.__version__[:3])
+        if (paddle_version == 0.0) or (paddle_version >= 2.5):
+            with paddle.amp.auto_cast(
+                args.use_pure_fp16,
+                custom_black_list=[
+                    "reduce_sum",
+                    "c_softmax_with_cross_entropy",
+                    "elementwise_div",
+                ],
+                level="O2",
+                use_promote=False,
+            ):
+                preds = model(tokens)
+        else:
+            with paddle.amp.auto_cast(
+                args.use_pure_fp16,
+                custom_black_list=[
+                    "reduce_sum",
+                    "c_softmax_with_cross_entropy",
+                    "elementwise_div",
+                ],
+                level="O2",
+            ):
+                preds = model(tokens)
+        preds = paddle.cast(preds, dtype="float32")
+        loss = criterion(preds, labels, loss_mask)
+
+        all_loss.append(float(loss))
+        if eval_step >= iter_steps - 1:
+            break
+
+    average_loss = sum(all_loss) / len(all_loss)
+    logger.info(
+        "%s step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+        % (task_name, global_step, epoch, eval_step, average_loss, iter_steps / (time.time() - local_time))
+    )
+    log_writer.add_scalar(task_name + "_loss", average_loss, global_step)
+    model.train()
+
+
+def initialize_model_and_expert_group(hcg):
+    def get_expert_parallel_world_size(self):
+        return self.get_data_parallel_world_size() * self.get_model_parallel_world_size()
+
+    hcg.get_expert_parallel_world_size = types.MethodType(get_expert_parallel_world_size, hcg)
+
+    # need create mp_dp group for expert parallel group in advance
+    _, mp_dp_comm_group = hcg._set_check_group(parallel_method="pipe")
+
+    def get_expert_parallel_group(self):
+        return mp_dp_comm_group
+
+    hcg.get_expert_parallel_group = types.MethodType(get_expert_parallel_group, hcg)
+
+
+def initialize_mp_dp_parameters(model, hcg):
+    mp_group = hcg.get_model_parallel_group()
+    mp_src_rank = hcg.get_model_parallel_group_src_rank()
+
+    dp_group = hcg.get_data_parallel_group()
+    dp_src_rank = hcg.get_data_parallel_group_src_rank()
+
+    for param in model.parameters():
+        if "expert_" in param.name:
+            continue
+        if not param.is_distributed:
+            paddle.distributed.broadcast(param.detach(), src=mp_src_rank, group=mp_group, sync_op=True)
+
+        paddle.distributed.broadcast(param.detach(), src=dp_src_rank, group=dp_group, sync_op=True)
+
+
+def unscale_method(self, optimizer):
+    if not self._enable:
+        return
+
+    if getattr(optimizer, "_param_groups", None) and isinstance(optimizer._param_groups[0], dict):
+        param_grads_fp16 = []
+        param_grads_fp32 = []
+        for group in optimizer._param_groups:
+            for param in group["params"]:
+                if param._grad_ivar() is not None:
+                    if param._grad_ivar().dtype == core.VarDesc.VarType.FP16:
+                        param_grads_fp16.append(param._grad_ivar())
+                    else:
+                        param_grads_fp32.append(param._grad_ivar())
+    else:
+        param_grads_fp16 = [
+            param._grad_ivar()
+            for param in optimizer._parameter_list
+            if (param._grad_ivar() is not None) and (param._grad_ivar().dtype == core.VarDesc.VarType.FP16)
+        ]
+        param_grads_fp32 = [
+            param._grad_ivar()
+            for param in optimizer._parameter_list
+            if (param._grad_ivar() is not None) and (param._grad_ivar().dtype == core.VarDesc.VarType.FP32)
+        ]
+    temp_found_inf_fp16 = paddle.to_tensor(np.array([0]).astype(np.bool_))
+    temp_found_inf_fp32 = paddle.to_tensor(np.array([0]).astype(np.bool_))
+
+    if len(param_grads_fp16):
+        _legacy_C_ops.check_finite_and_unscale(param_grads_fp16, self._scale, param_grads_fp16, temp_found_inf_fp16)
+    if len(param_grads_fp32):
+        _legacy_C_ops.check_finite_and_unscale(param_grads_fp32, self._scale, param_grads_fp32, temp_found_inf_fp32)
+    self._found_inf = 1 if temp_found_inf_fp16 or temp_found_inf_fp32 else 0
+
+    if dist.get_world_size() > 1:
+        is_found_inf = paddle.to_tensor([self._found_inf], dtype="int32")
+        paddle.distributed.all_reduce(is_found_inf, op=paddle.distributed.ReduceOp.MAX, group=None)
+        self._found_inf = int(is_found_inf)
+
+
+def all_reduce_parameters(params, group):
+    if group.nranks < 2:
+        return
+
+    div_factor = 1.0 / group.nranks
+    with paddle.framework.no_grad():
+        for p in params:
+            grad = p.grad.scale_(div_factor)
+            paddle.distributed.all_reduce(grad, sync_op=True)
+
+
+def parameters_classify(model, use_sharding=False):
+    decay_gate_params = []
+    decay_expert_params = []
+    decay_other_params = []
+
+    gate_params = []
+    expert_params = []
+    other_params = []
+
+    for param in model.parameters():
+        # param_name = param.name
+        if "expert_" in param.name:
+            if not any(nd in param.name for nd in ["bias", "norm"]):
+                decay_expert_params.append(param)
+            else:
+                expert_params.append(param)
+        elif "gate_" in param.name:
+            if not any(nd in param.name for nd in ["bias", "norm"]):
+                decay_gate_params.append(param)
+            else:
+                gate_params.append(param)
+        else:
+            if not any(nd in param.name for nd in ["bias", "norm"]):
+                decay_other_params.append(param)
+            else:
+                other_params.append(param)
+
+    print("all parameters length:", len(model.parameters()))
+    print(
+        "decay_gate_params len: {}, decay_expert_params len: {}, decay_other_params len: {}".format(
+            len(decay_gate_params), len(decay_expert_params), len(decay_other_params)
+        )
+    )
+    print(
+        "gate_params len: {}, expert_params len: {}, other_params len: {}".format(
+            len(gate_params), len(expert_params), len(other_params)
+        )
+    )
+
+    d_gate = obtain_storage(decay_gate_params)
+    gate = obtain_storage(gate_params)
+
+    d_expert = obtain_storage(decay_expert_params)
+    expert = obtain_storage(expert_params)
+
+    d_other = decay_other_params if use_sharding else obtain_storage(decay_other_params)
+    other = other_params if use_sharding else obtain_storage(other_params)
+
+    opt_fused_tensors = []
+    decay_fused_tensors = []
+    reduce_fused_tensors = []
+    gate_fused_tensors = []
+
+    decay_fused_tensors = d_gate + d_other + d_expert
+    opt_fused_tensors = decay_fused_tensors + gate + other + expert
+    reduce_fused_tensors = d_other + other
+    gate_fused_tensors = d_gate + gate
+
+    expert_fusion_names = []
+    for i, p in enumerate(d_expert + expert):
+        p.name = "fused_expert_tensor_{}".format(i)
+        expert_fusion_names.append(p.name)
+
+    for i, p in enumerate(d_gate + gate):
+        p.name = "fused_gate_tensor_{}".format(i)
+
+    return opt_fused_tensors, decay_fused_tensors, reduce_fused_tensors, gate_fused_tensors, expert_fusion_names
+
+
+def timer_log(log_freq):
+    timers = get_timers()
+    # Logging
+    timers_to_log = []
+
+    def add_to_logging(name):
+        if name in timers.timers:
+            timers_to_log.append(name)
+
+    add_to_logging("forward-compute")
+    add_to_logging("forward-recv")
+    add_to_logging("forward-send")
+    add_to_logging("forward-send-backward-recv")
+    add_to_logging("backward-compute")
+    add_to_logging("backward-recv")
+    add_to_logging("backward-send")
+    add_to_logging("backward-send-forward-recv")
+    add_to_logging("backward-params-all-reduce")
+    add_to_logging("backward-embedding-all-reduce")
+    add_to_logging("optimizer-copy-to-main-grad")
+    add_to_logging("optimizer-unscale-and-check-inf")
+    add_to_logging("optimizer-clip-main-grad")
+    add_to_logging("optimizer-copy-main-to-model-params")
+    add_to_logging("optimizer")
+    add_to_logging("batch-generator")
+    add_to_logging("Prepare Forward")
+    add_to_logging("Gate Computation")
+    add_to_logging("Limit_By_Capacity")
+    add_to_logging("Prune_Gate_By_Cap")
+    add_to_logging("Random Routing")
+    add_to_logging("Base Operation")
+    add_to_logging("AllGather in Limit")
+    add_to_logging("MOEScatter")
+    add_to_logging("Expert Computation")
+    add_to_logging("MOEGather")
+    add_to_logging("Score BMM")
+    add_to_logging("AllReduce")
+    add_to_logging("AllGather")
+    add_to_logging("lec reduce")
+    add_to_logging("lec reduce2")
+
+    timers.log(timers_to_log, normalizer=log_freq)
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": args.dp_degree,
+        "mp_degree": args.mp_degree,
+        "pp_degree": args.pp_degree,
+        "sharding_degree": args.sharding_degree,
+    }
+
+    accumulate_steps = args.local_batch_size // args.micro_batch_size
+    strategy.pipeline_configs = {"accumulate_steps": accumulate_steps, "micro_batch_size": args.micro_batch_size}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    nranks = paddle.distributed.get_world_size()
+
+    # obtain rank message of hybrid parallel
+    hcg = fleet.get_hybrid_communicate_group()
+    global_rank = hcg.get_global_rank()
+    mp_rank = hcg.get_model_parallel_rank()
+    pp_rank = hcg.get_stage_id()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+    sharding_group = hcg.get_sharding_parallel_group()
+
+    if args.sharding_degree > 1:
+        assert (
+            args.dp_degree == args.mp_degree == args.pp_degree == 1
+        ), "sharding stage2 will support hybrid parallel later"
+
+    sharding_size = hcg.get_sharding_parallel_world_size()
+    data_world_rank = dp_rank * sharding_size + sharding_rank
+    data_world_size = args.dp_degree * args.sharding_degree
+    local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+    # seed control in hybrid parallel
+    set_hyrbid_parallel_seed(args.seed, data_world_rank, mp_rank, pp_rank)
+
+    default_global_tokens_num = args.global_batch_size * args.max_seq_len
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    # Define log writer
+    log_writer_path = os.path.join(
+        args.output_dir,
+        "train_log",
+        "{}_globalbsz_{}_pure_fp16_{}_recompute_{}_card_{}".format(
+            args.model_name_or_path, args.global_batch_size, args.use_pure_fp16, False, global_rank
+        ).lower(),
+    )
+
+    if os.path.exists(log_writer_path):
+        import shutil
+
+        shutil.rmtree(log_writer_path)
+
+    log_writer = LogWriter(log_writer_path)
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    if args.model_name_or_path in pretrained_models_list:
+        model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+
+        model_config["num_partitions"] = args.mp_degree
+
+        # MOE config
+        initialize_model_and_expert_group(hcg)
+
+        model_config["expert_mode"] = args.expert_mode
+        model_config["hcg"] = hcg
+        model_config["num_experts"] = args.num_experts
+        model_config["top_k"] = args.top_k
+        if args.expert_mode:
+            model_config["gate"] = args.gate
+
+        if args.pp_degree == 1:
+            model_config["recompute_interval"] = 1 if args.use_recompute else 0
+            model_config["recompute_partition"] = args.recompute_partition
+            model_config["recompute_offload"] = args.recompute_offload
+            if args.use_recompute and args.recompute_partition:
+                raise Exception("when use_recompute is True, recompute_partition must be False in MoE.")
+
+            model = GPTForPretraining(GPTModel(**model_config))
+        else:
+            model_config["topology"] = hcg.topology()
+            model_config["recompute_interval"] = 1 if args.use_recompute else 0
+            model = GPTForPretrainingPipe(**model_config)
+    else:
+        model = GPTForPretraining.from_pretrained(
+            args.model_name_or_path,
+            hidden_dropout_prob=args.hidden_dropout_prob,
+            attention_probs_dropout_prob=args.attention_probs_dropout_prob,
+        )
+
+    # Create the critrion for the gpt model
+    criterion = GPTPretrainingCriterion()
+
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    warmup_step = args.warmup_rate * args.decay_steps
+
+    lr_scheduler = None
+
+    if args.lr_decay_style == "none":
+        lr_scheduler = None
+    elif args.lr_decay_style == "cosine":
+        lr_scheduler = lr.CosineAnnealingWithWarmupDecay(
+            max_lr=args.max_lr, min_lr=args.min_lr, warmup_step=warmup_step, decay_step=args.decay_steps
+        )
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    if args.use_pure_fp16:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        if args.sharding_degree == 1:
+            scaler = fleet.distributed_scaler(scaler)
+            scaler._unscale = MethodType(unscale_method, scaler)
+        else:
+            scaler = GroupShardedScaler(scaler)
+
+        model = paddle.amp.decorate(models=model, optimizers=None, level="O2", save_dtype="float32")
+
+    (
+        opt_fused_tensors,
+        decay_fused_tensors,
+        reduce_fused_tensors,
+        gate_fused_tensors,
+        expert_fusion_names,
+    ) = parameters_classify(model, use_sharding=(args.sharding_degree > 1))
+    decay_params = [p.name for p in decay_fused_tensors]
+
+    clip = None
+    if args.grad_clip > 0:
+        is_expert_param_fun = lambda param: param.name in expert_fusion_names  # noqa: E731
+        clip = moe.ClipGradByGlobalNorm(
+            clip_norm=args.grad_clip,
+            is_expert_param_func=is_expert_param_fun,
+            moe_group=hcg.get_expert_parallel_group(),
+        )
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr,
+        beta1=args.adam_beta1,
+        beta2=args.adam_beta2,
+        epsilon=args.adam_epsilon,
+        parameters=opt_fused_tensors,
+        weight_decay=args.weight_decay,
+        grad_clip=clip,
+        apply_decay_param_fun=lambda x: x in decay_params,  # decay_params,
+        multi_precision=args.use_pure_fp16,
+    )
+
+    # in order to restore reader.
+    pass_num = 0
+    file_id = 0
+    start_epoch = 0
+    args.resume_dir = None if len(args.resume_dir) <= 0 else args.resume_dir
+
+    if paddle.distributed.get_world_size() > 1 and args.resume_dir is None:
+        print(">> initialize....")
+        if args.sharding_degree > 1:
+            model, optimizer = group_sharded_parallel(model, optimizer, sharding_group, args.sharding_offload)
+            for p in gate_fused_tensors:
+                dist.broadcast(p, src=sharding_group.ranks[0], group=sharding_group, sync_op=True)
+            # Multi stream operation will be supported later
+            dist.wait(tensor=p, group=sharding_group, use_calc_stream=True)
+        else:
+            initialize_mp_dp_parameters(model, hcg)
+
+    if args.resume_dir is not None:
+        global_step, loss_scale, data_meta = load_checkpoint(
+            args, model, optimizer, lr_scheduler, tokenizer, dp_rank, mp_rank, pp_rank
+        )
+        pass_num = data_meta["pass_num"]
+        file_id = data_meta["file_id"]
+        start_epoch = data_meta["start_epoch"]
+
+    if args.model_name_or_path not in pretrained_models_list:
+        logger.info("Try to load checkpoint from %s " % args.model_name_or_path)
+        opt_path = os.path.join(args.model_name_or_path, "model_state.pdopt")
+        if os.path.exists(opt_path):
+            opt_dict = paddle.load(opt_path)
+            optimizer.set_state_dict(opt_dict)
+        else:
+            logger.warning("No optimizer checkpoint file found in %s." % opt_path)
+
+    global_step = 0 if args.resume_dir is None else global_step
+    timers = get_timers()
+    tic_train = time.time()
+    for epoch in range(start_epoch, args.num_train_epochs):
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and "npz_" not in str(f))
+        ]
+        files.sort()
+        num_files = len(files)
+        for f_id in range(file_id, num_files):
+            data_file = files[f_id]
+            train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
+                args,
+                data_file,
+                local_rank=local_rank,
+                data_world_size=data_world_size,
+                data_world_rank=data_world_rank,
+                eos_id=tokenizer.eos_token_id,
+            )
+
+            # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader
+            # many times. and start a new random dataloader.
+            valid_data_loader = valid_data_loader()
+            test_data_loader = test_data_loader()
+            for step, batch in enumerate(train_data_loader()):
+                # to remove the train data that has been studyed.
+                if step < global_step - pass_num:
+                    continue
+
+                global_step += 1
+                tokens, loss_mask, labels = batch
+
+                loss_mask.stop_gradient = True
+                labels.stop_gradient = True
+
+                loss = 0.0
+                for i in range(accumulate_steps):
+                    start_index = i * args.micro_batch_size
+                    end_index = start_index + args.micro_batch_size
+                    timers("forward-compute").start()
+                    # paddle version >= 2.5.0 or develop
+                    paddle_version = float(paddle.__version__[:3])
+                    if (paddle_version == 0.0) or (paddle_version >= 2.5):
+                        with paddle.amp.auto_cast(
+                            args.use_pure_fp16,
+                            custom_black_list=[
+                                "reduce_sum",
+                                "c_softmax_with_cross_entropy",
+                                "elementwise_div",
+                            ],
+                            level="O2",
+                            use_promote=False,
+                        ):
+                            preds = model(tokens[start_index:end_index, :])
+                            loss_mbs = criterion(
+                                preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :]
+                            )
+                    else:
+                        with paddle.amp.auto_cast(
+                            args.use_pure_fp16,
+                            custom_black_list=[
+                                "reduce_sum",
+                                "c_softmax_with_cross_entropy",
+                                "elementwise_div",
+                            ],
+                            level="O2",
+                        ):
+                            preds = model(tokens[start_index:end_index, :])
+                            loss_mbs = criterion(
+                                preds, labels[start_index:end_index, :], loss_mask[start_index:end_index, :]
+                            )
+                    timers("forward-compute").stop()
+
+                    if args.gate != "naive" and args.balance_loss_weight:
+                        aux_loss_list = [
+                            l.moe_mlp.gate.get_loss(clear=False).reshape([-1])
+                            for l in model.gpt.decoder.layers
+                            if hasattr(l.moe_mlp, "gate")
+                        ]
+                        bal_loss = paddle.concat(aux_loss_list)
+                        if bal_loss.dtype == paddle.float16:
+                            bal_loss = paddle.cast(bal_loss, dtype=paddle.float32)
+                        bal_loss = bal_loss.mean()
+                        loss_mbs += bal_loss * args.balance_loss_weight
+                    loss_mbs = loss_mbs / accumulate_steps
+
+                    timers("backward-compute").start()
+                    if args.use_pure_fp16:
+                        scaler.scale(loss_mbs).backward()
+                    else:
+                        loss_mbs.backward()
+                    timers("backward-compute").stop()
+                    loss = loss + loss_mbs
+
+                timers("backward-params-all-reduce").start()
+                all_reduce_parameters(gate_fused_tensors, hcg.get_expert_parallel_group())
+                if args.sharding_degree == 1:
+                    all_reduce_parameters(reduce_fused_tensors, hcg.get_data_parallel_group())
+                timers("backward-params-all-reduce").stop()
+
+                if args.use_pure_fp16:
+                    scaler.step(optimizer)
+                    scaler.update()
+                else:
+                    optimizer.step()
+                learning_rate = optimizer.get_lr()
+                if lr_scheduler is not None:
+                    lr_scheduler.step()
+                optimizer.clear_grad()
+
+                if global_step % args.logging_freq == 0:
+                    avg_loss = loss.numpy()
+                    speed = args.logging_freq / (time.time() - tic_train)
+                    if args.gate != "naive" and args.balance_loss_weight:
+                        bal_loss = bal_loss.numpy()
+                        avg_loss -= bal_loss
+                    else:
+                        bal_loss = -1
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, loss: %.9f, bal_loss: %.9f, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, learning rate: %.5e"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            avg_loss,
+                            bal_loss,
+                            speed,
+                            speed * default_global_tokens_num,
+                            speed * default_global_tokens_num / nranks,
+                            learning_rate,
+                        )
+                    )
+                    log_writer.add_scalar("loss", float(loss), global_step)
+                    log_writer.add_scalar("learning_rate", learning_rate, global_step)
+
+                    tic_train = time.time()
+                    timer_log(args.logging_freq)
+
+                if global_step % args.save_steps == 0 or global_step >= args.max_steps:
+                    loss_scale = scaler._scale if args.use_pure_fp16 else None
+                    save_checkpoint(
+                        args,
+                        global_step,
+                        model,
+                        optimizer,
+                        lr_scheduler,
+                        tokenizer,
+                        loss_scale,
+                        dp_rank,
+                        mp_rank,
+                        pp_rank,
+                        pass_num,
+                        file_id,
+                        epoch,
+                    )
+                    print("save checkpoint for step_{} successfully...loss_scale = {}".format(global_step, loss_scale))
+
+                if global_step % args.eval_freq == 0:
+                    # Since the valid data broardcast to all devices, we do evaluate on all device.
+                    run_evaluate(
+                        args,
+                        valid_data_loader,
+                        model,
+                        criterion,
+                        args.eval_iters,
+                        log_writer,
+                        global_step,
+                        epoch,
+                        "valid",
+                    )
+
+                if global_step >= args.max_steps:
+                    run_evaluate(
+                        args,
+                        test_data_loader,
+                        model,
+                        criterion,
+                        args.test_iters,
+                        log_writer,
+                        global_step,
+                        epoch,
+                        "test",
+                    )
+                    logger.info("The training process is complete.")
+                    del train_data_loader
+                    return
+
+            # to record sum of the length of train_data_loader that has been read.
+            pass_num += len(train_data_loader())
+            del train_data_loader
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    do_train(args)
diff --git a/examples/language_model/moe/dygraph/sync_files.sh b/examples/language_model/moe/dygraph/sync_files.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0e297e908f0d31f3ae9c0d762c2702b9aa679d0d
--- /dev/null
+++ b/examples/language_model/moe/dygraph/sync_files.sh
@@ -0,0 +1,12 @@
+#!/bin/bash                                                                                                                                               
+                                                                                                                                                            
+# get sshd port                                                                                                                                           
+sshport=$(lsof -i | grep sshd | awk '{print $9}' | sed s/\*://)                                                                                           
+                                                                                                                                                            
+hostfile=${TRAIN_WORKSPACE}/hostfile                                                                                                                      
+hostlist=$(cat $hostfile | awk '{print $1}' | xargs)                                                                                                      
+for host in ${hostlist[@]}; do                                                                                                                            
+  #ssh $host "ls $PWD"                                                                                                                                    
+  echo "scp $1 to $host"                                                                                                                                  
+  scp -r $1 ${host}:${PWD}                                                                                                                                
+done
diff --git a/examples/language_model/moe/dygraph/utils.py b/examples/language_model/moe/dygraph/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..49d94ea06806ff457c34ffec772d24bb2f66ecff
--- /dev/null
+++ b/examples/language_model/moe/dygraph/utils.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import paddle
+
+_GLOBAL_TIMERS = None
+
+
+def _ensure_var_is_not_initialized(var, name):
+    """Make sure the input variable is not None."""
+    assert var is None, "{} is not initialized.".format(name)
+
+
+def _ensure_var_is_initialized(var, name):
+    """Make sure the input variable is not None."""
+    assert var is not None, "{} is not initialized.".format(name)
+
+
+def get_timers():
+    _ensure_var_is_initialized(_GLOBAL_TIMERS, "timers")
+    return _GLOBAL_TIMERS
+
+
+def set_timers():
+    """Initialize timers."""
+    global _GLOBAL_TIMERS
+    _ensure_var_is_not_initialized(_GLOBAL_TIMERS, "timers")
+    _GLOBAL_TIMERS = Timers()
+
+
+class _Timer:
+    """Timer."""
+
+    def __init__(self, name):
+        self.name = name
+        self.elapsed_ = 0.0
+        self.started_ = False
+        self.start_time = time.time()
+
+    def start(self):
+        """Start the timer."""
+        assert not self.started_, "timer has already started"
+        paddle.device.cuda.synchronize()
+        self.start_time = time.time()
+        self.started_ = True
+
+    def stop(self):
+        """Stop the timers."""
+        assert self.started_, "timer is not started."
+        paddle.device.cuda.synchronize()
+        self.elapsed_ += time.time() - self.start_time
+        self.started_ = False
+
+    def reset(self):
+        """Reset timer."""
+        self.elapsed_ = 0.0
+        self.started_ = False
+
+    def elapsed(self, reset=True):
+        """Calculate the elapsed time."""
+        started_ = self.started_
+        # If the timing in progress, end it first.
+        if self.started_:
+            self.stop()
+        # Get the elapsed time.
+        elapsed_ = self.elapsed_
+        # Reset the elapsed time
+        if reset:
+            self.reset()
+        # If timing was in progress, set it back.
+        if started_:
+            self.start()
+        return elapsed_
+
+
+class Timers:
+    """Group of timers."""
+
+    def __init__(self):
+        self.timers = {}
+
+    def __call__(self, name):
+        if name not in self.timers:
+            self.timers[name] = _Timer(name)
+        return self.timers[name]
+
+    def write(self, names, writer, iteration, normalizer=1.0, reset=False):
+        """Write timers to a tensorboard writer"""
+        assert normalizer > 0.0
+        for name in names:
+            value = self.timers[name].elapsed(reset=reset) / normalizer
+            writer.add_scalar(name + "-time", value, iteration)
+
+    def log(self, names, normalizer=1.0, reset=True):
+        """Log a group of timers."""
+        assert normalizer > 0.0
+        string = "time (ms)"
+        for name in names:
+            elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer
+            string += " | {}: {:.2f}".format(name, elapsed_time)
+
+        if paddle.distributed.get_rank() == (paddle.distributed.get_world_size() - 1):
+            print(string, flush=True)
+        else:
+            print(string, flush=True)
diff --git a/examples/language_model/mpnet/README.md b/examples/language_model/mpnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f959121c69634f433b1b6a1197a40b11e9cdd03e
--- /dev/null
+++ b/examples/language_model/mpnet/README.md
@@ -0,0 +1,181 @@
+# MPNet with PaddleNLP
+
+[MPNet: Masked and Permuted Pre-training for Language Understanding - Microsoft Research](https://www.microsoft.com/en-us/research/publication/mpnet-masked-and-permuted-pre-training-for-language-understanding/)
+
+**摘要:**
+BERT采用掩码语言建模（MLM）进行预训练，是最成功的预训练模型之一。由于BERT忽略了预测标记之间的依赖关系，XLNet引入了置换语言建模（PLM）进行预训练来解决这个问题。然而，XLNet没有利用句子的完整位置信息，因此会受到预训练和微调之间的位置差异的影响。在本文中，我们提出了MPNet，这是一种新的预训练方法，它继承了BERT和XLNet的优点并避免了它们的局限性。MPNet通过置换语言建模（相对于BERT中的MLM）利用预测标记之间的依赖性，并以辅助位置信息作为输入，使模型能够看到完整的句子，从而减少位置差异（相对于XLNet中的PLM）。我们在大规模数据集（超过160GB的文本语料库）上预训练了MPNet模型，并对各种下游任务（GLUE、SQuAD 等）进行微调。实验结果表明，在相同的模型设置下，MPNet大大优于MLM和PLM，并且与之前最先进的预训练方法（例如 BERT、XLNet、RoBERTa）相比，在这些任务上取得了更好的结果。原始代码和预训练模型可从 https://github.com/microsoft/MPNet 下载得到。
+
+本项目是 MPNet 在 Paddle 2.x上的开源实现。
+
+## 快速开始
+
+### 下游任务微调
+
+#### 1、GLUE
+以QQP数据集为例，运行其他glue数据集，请参考`train.sh`文件。（超参数遵循原论文的仓库的[README](https://github.com/microsoft/MPNet/blob/master/MPNet/README.glue.md)）
+
+##### （1）模型微调：
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name qqp \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --weight_decay 0.1 \
+    --warmup_steps 5666 \
+    --max_steps 113272 \
+    --logging_steps 500 \
+    --save_steps 2000 \
+    --seed 42 \
+    --output_dir qqp/ \
+    --do_train \
+    --do_eval \
+    --device gpu
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前支持BERT、ELECTRA、ERNIE、CONVBERT、MPNET模型。
+- `model_name_or_path` 模型名称或者路径，其中mpnet模型当前仅支持mpnet-base几种规格。
+- `task_name` 表示 Fine-tuning 的任务，当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE和WNLI。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `per_device_train_batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `lr_scheduler_type` scheduler类型，可选linear和cosine。
+- `weight_decay` 权重衰减比例。
+- `warmup_steps` warmup步数。
+- `max_steps` 表示最大训练步数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `do_train` 表示是否需要训练。
+- `do_eval` 表示是否需要评测。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+
+##### （2）模型预测：
+```bash
+cd glue
+python run_predict.py --task_name qqp  --ckpt_path qqp/best-qqp_ft_model_106000.pdparams
+```
+
+##### （3）压缩template文件夹为zip文件，然后提交到[GLUE排行榜](https://gluebenchmark.com/leaderboard)：
+
+
+###### GLUE开发集结果：
+
+| task                      | cola  | sst-2  | mrpc        | sts-b             | qqp         | mnli       | qnli | rte   | avg |
+|--------------------------------|-------|-------|-------------|------------------|-------------|------|-------|-------|-------|
+| **metric** | **mcc** | **acc** | **acc/f1** | **pearson/spearman** | **acc/f1**  | **acc(m/mm)**  | **acc** | **acc** |    |
+| Paper | **65.0** | **95.5** | **91.8**/空 | 91.1/空 | **91.9**/空 | **88.5**/空 | 93.3 | 85.8 | **87.9** |
+| Mine | 64.4 | 95.4 | 90.4/93.1 | **91.6**/91.3 | **91.9**/89.0 | 87.7/88.2 | **93.6** | **86.6** | 87.7 |
+
+###### GLUE测试集结果对比：
+
+| task                      | cola  | sst-2  | mrpc  | sts-b  | qqp | mnli-m | qnli  | rte   | avg      |
+|--------------------------------|-------|-------|-------|-------|-----|-------|-------|-------|----------|
+| **metric** | **mcc** | **acc** | **acc/f1** | **pearson/spearman** | **acc/f1**  | **acc(m/mm)**  | **acc** | **acc** |  |
+| Paper | **64.0** | **96.0** | 89.1/空 | 90.7/空 | **89.9**/空 | **88\.5**/空 | 93\.1 | 81.0 | **86.5** |
+| Mine | 60.5     | 95.9 | **91.6**/88.9 | **90.8**/90.3 | 89.7/72.5 | 87.6/86.6 | **93.3** | **82.4** | **86.5** |
+
+#### 2、SQuAD v1.1
+
+使用Paddle提供的预训练模型运行SQuAD v1.1数据集的Fine-tuning
+
+```bash
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_squad.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 4 \
+    --lr_scheduler_type linear \
+    --logging_steps 25 \
+    --save_steps 25 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.1 \
+    --output_dir squad1.1/ \
+    --device gpu \
+    --do_train \
+    --do_eval \
+    --seed 42
+```
+
+训练过程中模型会自动对结果进行评估，其中最好的结果如下所示：
+
+```python
+{
+  "exact": 86.84957426679281,
+  "f1": 92.82031917884066,
+  "total": 10570,
+  "HasAns_exact": 86.84957426679281,
+  "HasAns_f1": 92.82031917884066,
+  "HasAns_total": 10570
+}
+```
+
+#### 3、SQuAD v2.0
+对于 SQuAD v2.0,按如下方式启动 Fine-tuning:
+
+```bash
+unset CUDA_VISIBLE_DEVICES
+cd squad
+python -m paddle.distributed.launch --gpus "0" run_squad.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 4 \
+    --lr_scheduler_type linear \
+    --logging_steps 200 \
+    --save_steps 200 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.1 \
+    --output_dir squad2/ \
+    --device gpu \
+    --do_train \
+    --do_eval \
+    --seed 42 \
+    --version_2_with_negative
+```
+
+* `version_2_with_negative`: 使用squad2.0数据集和评价指标的标志。
+
+训练过程中模型会自动对结果进行评估，其中最好的结果如下所示：
+
+```python
+{
+  "exact": 82.27912069401162,
+  "f1": 85.2774124891565,
+  "total": 11873,
+  "HasAns_exact": 80.34750337381917,
+  "HasAns_f1": 86.35268530427743,
+  "HasAns_total": 5928,
+  "NoAns_exact": 84.20521446593776,
+  "NoAns_f1": 84.20521446593776,
+  "NoAns_total": 5945,
+  "best_exact": 82.86869367472417,
+  "best_exact_thresh": -2.450321674346924,
+  "best_f1": 85.67634263296013,
+  "best_f1_thresh": -2.450321674346924
+}
+```
+
+# Tips:
+- 对于SQUAD任务：根据这个[issues](https://github.com/microsoft/MPNet/issues/3)所说,论文中汇报的是`best_exact`和`best_f1`。
+- 对于GLUE任务：根据这个[issues](https://github.com/microsoft/MPNet/issues/7)所说，部分任务采用了热启动初始化的方法。
+
+# Reference
+
+```bibtex
+@article{song2020mpnet,
+    title={MPNet: Masked and Permuted Pre-training for Language Understanding},
+    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
+    journal={arXiv preprint arXiv:2004.09297},
+    year={2020}
+}
+```
diff --git a/examples/language_model/mpnet/convert.py b/examples/language_model/mpnet/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9f3fb8fd14329e437b2759b5f0a06df84533d11
--- /dev/null
+++ b/examples/language_model/mpnet/convert.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+import argparse
+
+huggingface_to_paddle = {
+    ".attn.": ".",
+    "intermediate.dense": "ffn",
+    "output.dense": "ffn_output",
+    ".output.LayerNorm.": ".layer_norm.",
+    ".LayerNorm.": ".layer_norm.",
+    "lm_head.decoder.bias": "lm_head.decoder_bias",
+}
+
+skip_weights = ["lm_head.decoder.weight", "lm_head.bias"]
+dont_transpose = [
+    "_embeddings.weight",
+    ".LayerNorm.weight",
+    ".layer_norm.weight",
+    "relative_attention_bias.weight",
+]
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+    import torch
+    import paddle
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        transpose = False
+        if k in skip_weights:
+            continue
+        if k[-7:] == ".weight":
+            if not any([w in k for w in dont_transpose]):
+                if v.ndim == 2:
+                    v = v.transpose(0, 1)
+                    transpose = True
+        oldk = k
+        for huggingface_name, paddle_name in huggingface_to_paddle.items():
+            k = k.replace(huggingface_name, paddle_name)
+
+        print(f"Converting: {oldk} => {k} | is_transpose {transpose}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="weights/hg/mpnet-base/pytorch_model.bin",
+        type=str,
+        required=False,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="weights/pd/mpnet-base/model_state.pdparams",
+        type=str,
+        required=False,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/examples/language_model/mpnet/glue/predict.sh b/examples/language_model/mpnet/glue/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c2396372fb6e9c67ee8271de4ed9c138a61f3030
--- /dev/null
+++ b/examples/language_model/mpnet/glue/predict.sh
@@ -0,0 +1,3 @@
+# task name ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]
+
+python run_predict.py --task_name qqp  --ckpt_path qqp/best-qqp_ft_model_106000.pdparams
\ No newline at end of file
diff --git a/examples/language_model/mpnet/glue/run_glue.py b/examples/language_model/mpnet/glue/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..742c43b7dba7397c381da10be0e6c20b83b36b16
--- /dev/null
+++ b/examples/language_model/mpnet/glue/run_glue.py
@@ -0,0 +1,454 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import time
+from collections import OrderedDict
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Dict, List, Optional
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader, Dataset
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.trainer.trainer_utils import speed_metrics
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ElectraForSequenceClassification,
+    ElectraTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+    MPNetForSequenceClassification,
+    MPNetTokenizer,
+)
+
+
+class MPNetTrainer(Trainer):
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dataset] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> Dict[str, float]:
+        """
+        Run evaluation and returns metrics.
+
+        The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
+        (pass it to the init `compute_metrics` argument).
+
+        You can also subclass and override this method to inject custom behavior.
+
+        Args:
+            eval_dataset (`Dataset`, *optional*):
+                Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not
+                accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
+                method.
+            ignore_keys (`Lst[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "eval_bleu" if the prefix is "eval" (default)
+
+        Returns:
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
+            dictionary also contains the epoch number which comes from the training state.
+        """
+        # memory metrics - must set up as early as possible
+        self._memory_tracker.start()
+
+        trans_func = partial(
+            convert_example,
+            tokenizer=self.tokenizer,
+            label_list=self.args.label_list,
+            max_seq_length=self.args.max_seq_length,
+        )
+        if self.args.task_name == "mnli":
+            dev_ds_matched, dev_ds_mismatched = load_dataset(
+                "glue", self.args.task_name, splits=["dev_matched", "dev_mismatched"]
+            )
+
+            dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+            dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+            dev_batch_sampler_matched = paddle.io.BatchSampler(
+                dev_ds_matched, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False
+            )
+            dev_data_loader_matched = DataLoader(
+                dataset=dev_ds_matched,
+                batch_sampler=dev_batch_sampler_matched,
+                collate_fn=self.data_collator,
+                num_workers=2,
+                return_list=True,
+            )
+            dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+                dev_ds_mismatched, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False
+            )
+            dev_data_loader_mismatched = DataLoader(
+                dataset=dev_ds_mismatched,
+                batch_sampler=dev_batch_sampler_mismatched,
+                collate_fn=self.data_collator,
+                num_workers=2,
+                return_list=True,
+            )
+        else:
+            dev_ds = load_dataset("glue", self.args.task_name, splits="dev")
+            dev_ds = dev_ds.map(trans_func, lazy=True)
+            dev_batch_sampler = paddle.io.BatchSampler(
+                dev_ds, batch_size=self.args.per_device_eval_batch_size * 2, shuffle=False
+            )
+            dev_data_loader = DataLoader(
+                dataset=dev_ds,
+                batch_sampler=dev_batch_sampler,
+                collate_fn=self.data_collator,
+                num_workers=2,
+                return_list=True,
+            )
+
+        start_time = time.time()
+
+        if self.args.task_name == "mnli":
+            output = self.evaluation_loop(
+                dev_data_loader_matched,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if self.compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+                metric_key_prefix=metric_key_prefix,
+            )
+
+            total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size
+            output.metrics.update(
+                speed_metrics(
+                    metric_key_prefix,
+                    start_time,
+                    num_samples=output.num_samples,
+                    num_steps=math.ceil(output.num_samples / total_batch_size),
+                )
+            )
+
+            self.log(output.metrics)
+
+            output = self.evaluation_loop(
+                dev_data_loader_mismatched,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if self.compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+                metric_key_prefix=metric_key_prefix,
+            )
+
+            total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size
+            output.metrics.update(
+                speed_metrics(
+                    metric_key_prefix,
+                    start_time,
+                    num_samples=output.num_samples,
+                    num_steps=math.ceil(output.num_samples / total_batch_size),
+                )
+            )
+
+            self.log(output.metrics)
+
+            self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+
+            self._memory_tracker.stop_and_update_metrics(output.metrics)
+
+            return output.metrics
+        else:
+            output = self.evaluation_loop(
+                dev_data_loader,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if self.compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+                metric_key_prefix=metric_key_prefix,
+            )
+
+            total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size
+            output.metrics.update(
+                speed_metrics(
+                    metric_key_prefix,
+                    start_time,
+                    num_samples=output.num_samples,
+                    num_steps=math.ceil(output.num_samples / total_batch_size),
+                )
+            )
+
+            self.log(output.metrics)
+
+            self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+
+            self._memory_tracker.stop_and_update_metrics(output.metrics)
+
+            return output.metrics
+
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "qnli": Accuracy,
+    "mnli": Accuracy,
+    "rte": Accuracy,
+    "wnli": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "electra": (ElectraForSequenceClassification, ElectraTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+    "mpnet": (MPNetForSequenceClassification, MPNetTokenizer),
+}
+
+
+@dataclass
+class ModelArguments:
+    max_seq_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. "
+                "Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    task_name: Optional[str] = field(
+        default=None,
+        metadata={"help": ("The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()))},
+    )
+    model_type: Optional[str] = field(
+        default="convbert",
+        metadata={"help": ("Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))},
+    )
+    model_name_or_path: Optional[str] = field(
+        default="convbert-base",
+        metadata={
+            "help": (
+                "Path to pre-trained model or shortcut name selected in the list: "
+                + ", ".join(
+                    sum(
+                        [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()],
+                        [],
+                    )
+                ),
+            )
+        },
+    )
+    layer_lr_decay: Optional[float] = field(
+        default=1.0,
+        metadata={"help": ("layer_lr_decay")},
+    )
+
+
+@dataclass
+class DataArguments:
+    data_path: Optional[str] = field(
+        default="./data",
+        metadata={"help": "The path of datasets to be loaded."},
+    )
+
+
+def compute_metrics(eval_preds, metric):
+    labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64")
+    preds = paddle.to_tensor(eval_preds.predictions)
+    preds = paddle.nn.functional.softmax(preds, axis=-1)
+    labels = paddle.argmax(labels, axis=-1)
+    correct = metric.compute(preds, labels)
+    metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        return {
+            "acc": res[0],
+            "precision": res[1],
+            "recall": res[2],
+            "f1": res[3],
+            "acc and f1": res[4],
+        }
+    elif isinstance(metric, Mcc):
+        return {
+            "mcc": res[0],
+        }
+    elif isinstance(metric, PearsonAndSpearman):
+        return {
+            "pearson": res[0],
+            "spearman": res[1],
+            "pearson and spearman": res[2],
+        }
+    else:
+        return {
+            "acc": res,
+        }
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_length=max_seq_length, return_token_type_ids=True)
+    else:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            max_length=max_seq_length,
+            return_token_type_ids=True,
+        )
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+@dataclass
+class DataCollator:
+    def __init__(self, tokenizer, train_ds):
+        self.tokenizer = (tokenizer,)
+        self.train_ds = (train_ds,)
+
+    def __call__(self, features):
+        input_ids = []
+        labels = []
+        batch = {}
+
+        for feature in features:
+            input_idx, _, label = feature
+            input_ids.append(input_idx)
+            labels.append(label)
+
+        if not isinstance(self.tokenizer, MPNetTokenizer):
+            self.tokenizer = self.tokenizer[0]
+            self.train_ds = self.train_ds[0]
+        input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),)  # input_ids
+        labels = (Stack(dtype="int64" if self.train_ds.label_list else "float32")(labels),)  # labels
+
+        batch["input_ids"] = input_ids[0]
+        batch["labels"] = labels[0]
+
+        return batch
+
+
+def _get_layer_lr_radios(layer_decay=0.8, n_layers=12):
+    """Have lower learning rates for layers closer to the input."""
+    key_to_depths = OrderedDict(
+        {
+            "mpnet.embeddings.": 0,
+            "mpnet.encoder.relative_attention_bias.": 0,
+            "mpnet.pooler.": n_layers + 2,
+            "mpnet.classifier.": n_layers + 2,
+        }
+    )
+    for layer in range(n_layers):
+        key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1
+    return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()}
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if training_args.output_dir is None:
+        training_args.output_dir = model_args.task_name.lower()
+    if model_args.task_name is not None:
+        training_args.task_name = model_args.task_name
+    if model_args.max_seq_length is not None:
+        training_args.max_seq_length = model_args.max_seq_length
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    model_args.task_name = model_args.task_name.lower()
+    metric_class = METRIC_CLASSES[model_args.task_name]
+    model_args.model_type = model_args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+
+    train_ds = load_dataset("glue", model_args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
+    training_args.label_list = train_ds.label_list
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=training_args.label_list,
+        max_seq_length=model_args.max_seq_length,
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    batchify_fn = DataCollator(tokenizer, train_ds)
+
+    num_classes = 1 if training_args.label_list is None else len(training_args.label_list)
+    model = model_class.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+
+    if model_args.layer_lr_decay != 1.0:
+        layer_lr_radios_map = _get_layer_lr_radios(model_args.layer_lr_decay, n_layers=12)
+        for name, parameter in model.named_parameters():
+            layer_lr_radio = 1.0
+            for k, radio in layer_lr_radios_map.items():
+                if k in name:
+                    layer_lr_radio = radio
+                    break
+            parameter.optimize_attr["learning_rate"] *= layer_lr_radio
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if training_args.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+    compute_metrics_func = partial(
+        compute_metrics,
+        metric=metric,
+    )
+
+    trainer = MPNetTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        criterion=loss_fct,
+        compute_metrics=compute_metrics_func,
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/language_model/mpnet/glue/run_predict.py b/examples/language_model/mpnet/glue/run_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f8c3233c90cf5d1a497d76908ab006400ebc699
--- /dev/null
+++ b/examples/language_model/mpnet/glue/run_predict.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+import os
+import paddle
+from paddle.io import DataLoader
+import pandas as pd
+from tqdm import tqdm
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Tuple, Pad
+from paddlenlp.transformers import MPNetForSequenceClassification, MPNetTokenizer
+from run_glue import convert_example
+
+task2filename = {
+    "cola": "CoLA.tsv",
+    "sst-2": "SST-2.tsv",
+    "mrpc": "MRPC.tsv",
+    "sts-b": "STS-B.tsv",
+    "qqp": "QQP.tsv",
+    "mnli": ["MNLI-m.tsv", "MNLI-mm.tsv"],
+    "rte": "RTE.tsv",
+    "qnli": "QNLI.tsv",
+    "wnli": "WNLI.tsv",
+}
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--ckpt_path",
+        default=None,
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--task_name",
+        type=str,
+        choices=["cola", "sst-2", "mrpc", "sts-b", "qqp", "mnli", "rte", "qnli", "wnli"],
+        default="cola",
+        required=True,
+        help="task_name.",
+    )
+
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    args = parser.parse_args()
+    args.task_name = args.task_name.lower()
+    return args
+
+
+def predict(data_loader, model, id2label=None):
+    outputs = []
+    progress_bar = tqdm(
+        range(len(data_loader)),
+        desc="Predition Iteration",
+    )
+    with paddle.no_grad():
+        for batch in data_loader:
+            input_ids, segment_ids = batch
+            logits = model(input_ids)
+            if id2label is not None:
+                pred = paddle.argmax(logits, axis=-1).cpu().tolist()
+                outputs.extend(list(map(lambda x: id2label[x], pred)))
+            else:
+                pred = logits.squeeze(-1).cpu().tolist()
+                outputs.extend(pred)
+            progress_bar.update(1)
+    return outputs
+
+
+def writetsv(outputs, file):
+    d = {"index": list(range(len(outputs))), "prediction": outputs}
+    pd.DataFrame(d).to_csv(file, sep="\t", index=False)
+    print(f"Save to {file}.")
+
+
+def predict2file(args):
+    if args.task_name == "mnli":
+        test_ds_matched, test_ds_mismatched = load_dataset("glue", "mnli", splits=["test_matched", "test_mismatched"])
+        id2label = dict(zip(range(len(test_ds_matched.label_list)), test_ds_matched.label_list))
+    else:
+        test_ds = load_dataset("glue", args.task_name, splits="test")
+        if test_ds.label_list is not None:
+            id2label = dict(zip(range(len(test_ds.label_list)), test_ds.label_list))
+        else:
+            id2label = None
+
+    model = MPNetForSequenceClassification.from_pretrained(args.ckpt_path)
+    model.eval()
+    tokenizer = MPNetTokenizer.from_pretrained(args.ckpt_path)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+    ): fn(samples)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=None,
+        max_seq_length=args.max_seq_length,
+        is_test=True,
+    )
+
+    if args.task_name == "mnli":
+        test_ds_matched = test_ds_matched.map(trans_func, lazy=True)
+        test_ds_mismatched = test_ds_mismatched.map(trans_func, lazy=True)
+        test_batch_sampler_matched = paddle.io.BatchSampler(test_ds_matched, batch_size=args.batch_size, shuffle=False)
+        test_data_loader_matched = DataLoader(
+            dataset=test_ds_matched,
+            batch_sampler=test_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=2,
+            return_list=True,
+        )
+        test_batch_sampler_mismatched = paddle.io.BatchSampler(
+            test_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        test_data_loader_mismatched = DataLoader(
+            dataset=test_ds_mismatched,
+            batch_sampler=test_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=2,
+            return_list=True,
+        )
+        file_m = os.path.join("template", task2filename[args.task_name][0])
+        file_mm = os.path.join("template", task2filename[args.task_name][1])
+        matched_outputs = predict(test_data_loader_matched, model, id2label)
+        mismatched_outputs = predict(test_data_loader_mismatched, model, id2label)
+        writetsv(matched_outputs, file_m)
+        writetsv(mismatched_outputs, file_mm)
+    else:
+        test_ds = test_ds.map(trans_func, lazy=True)
+        test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+        test_data_loader = DataLoader(
+            dataset=test_ds,
+            batch_sampler=test_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=2,
+            return_list=True,
+        )
+        predict_outputs = predict(test_data_loader, model, id2label)
+
+        file = os.path.join("template", task2filename[args.task_name])
+        writetsv(predict_outputs, file)
+
+
+if __name__ == "__main__":
+    args = get_args()
+    os.makedirs("template", exist_ok=True)
+    predict2file(args)
diff --git a/examples/language_model/mpnet/glue/train.sh b/examples/language_model/mpnet/glue/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e068d234c8c148417c72d4cd4813faa6e5fc9217
--- /dev/null
+++ b/examples/language_model/mpnet/glue/train.sh
@@ -0,0 +1,194 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]
+unset CUDA_VISIBLE_DEVICES
+# QQP
+# 运行训练
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name qqp \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_steps 5666 \
+    --max_steps 113272 \
+    --logging_steps 1 \
+    --save_steps 3 \
+    --seed 42 \
+    --output_dir qqp \
+    --do_train \
+    --do_eval \
+    --device gpu
+
+# COLA
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name cola \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 200 \
+    --save_steps 200 \
+    --seed 42 \
+    --output_dir cola \
+    --do_train \
+    --do_eval \
+    --device gpu
+
+# QNLI
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name qnli \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 1000 \
+    --save_steps 1000 \
+    --seed 42 \
+    --output_dir qnli \
+    --do_train \
+    --do_eval \
+    --device gpu
+  
+# SST2
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name sst-2 \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 400 \
+    --save_steps 400 \
+    --seed 42 \
+    --output_dir sst-2 \
+    --do_train \
+    --do_eval \
+    --device gpu
+
+
+############################################################################################################################################
+# 先训练这个模型，之后需要使用这个权重！(RTE，MRPC和STS-B用了MNLI做初始化，与roberta一致)
+# MNLI
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path mpnet-base \
+    --task_name mnli \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 1000 \
+    --save_steps 1000 \
+    --seed 42 \
+    --output_dir mnli \
+    --do_train \
+    --do_eval \
+    --device gpu
+
+########################################################
+# RTE
+export MNLI_BEST_CKPT=/path/to/mnli/best/ckpt
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path $MNLI_BEST_CKPT \
+    --task_name rte \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 2e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 13 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --seed 42 \
+    --output_dir rte \
+    --do_train \
+    --do_eval \
+    --device gpu
+    
+############################################################
+# MRPC
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path $MNLI_BEST_CKPT \
+    --task_name mrpc \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 1e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --seed 42 \
+    --output_dir mrpc \
+    --do_train \
+    --do_eval \
+    --device gpu  
+  
+############################################################
+# STSB
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type mpnet \
+    --model_name_or_path $MNLI_BEST_CKPT \
+    --task_name rte \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 2e-5 \
+    --lr_scheduler_type linear \
+    --layer_lr_decay 1.0 \
+    --weight_decay 0.1 \
+    --warmup_ratio 0.06 \
+    --num_train_epochs 10 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --seed 42 \
+    --output_dir rte \
+    --do_train \
+    --do_eval \
+    --device gpu
+  
+############################################################
+
diff --git a/examples/language_model/mpnet/squad/run_squad.py b/examples/language_model/mpnet/squad/run_squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..e60e208634805640d2ef2ff4d76fa687c4f65587
--- /dev/null
+++ b/examples/language_model/mpnet/squad/run_squad.py
@@ -0,0 +1,709 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from collections import OrderedDict
+from dataclasses import dataclass, field
+from functools import partial
+from typing import List, Optional
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle.io import DataLoader, Dataset
+
+from paddlenlp.data import Pad, Stack
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.trainer.trainer_utils import (
+    EvalLoopOutput,
+    EvalPrediction,
+    IterableDatasetShard,
+    find_batch_size,
+    has_length,
+)
+from paddlenlp.trainer.utils.helper import (
+    nested_concat,
+    nested_numpify,
+    nested_truncate,
+)
+from paddlenlp.transformers import (
+    BertForQuestionAnswering,
+    BertTokenizer,
+    ErnieForQuestionAnswering,
+    ErnieTokenizer,
+    MPNetForQuestionAnswering,
+    MPNetTokenizer,
+)
+from paddlenlp.utils.batch_sampler import (
+    DistributedBatchSampler as NlpDistributedBatchSampler,
+)
+from paddlenlp.utils.log import logger
+
+
+def is_datasets_available():
+    import importlib
+
+    return importlib.util.find_spec("datasets") is not None
+
+
+class MPNetTrainer(Trainer):
+    def set_eval_collator(self, collator):
+        self.eval_collate_fn = collator
+
+    def set_eval_raw_dataset(self, raw_dataset):
+        self.eval_raw_dataset = raw_dataset
+
+    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+        """
+        Returns the evaluation [`~paddle.io.DataLoader`].
+        Subclass and override this method if you want to inject some custom behavior.
+        Args:
+            eval_dataset (`paddle.io.Dataset`, *optional*):
+                If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by
+                the `model.forward()` method are automatically removed. It must implement `__len__`.
+        """
+        if eval_dataset is None and self.eval_dataset is None:
+            raise ValueError("Trainer: evaluation requires an eval_dataset.")
+        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+
+        if self._is_iterable_dataset(eval_dataset):
+            if self.args.world_size > 1:
+                eval_dataset = IterableDatasetShard(
+                    eval_dataset,
+                    batch_size=self.args.per_device_eval_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    num_processes=self.args.world_size,
+                    process_index=self.args.process_index,
+                )
+
+            return DataLoader(
+                eval_dataset,
+                batch_size=self.args.per_device_eval_batch_size,
+                collate_fn=self.eval_collate_fn,
+                num_workers=self.args.dataloader_num_workers,
+            )
+
+        eval_sampler = self._get_eval_sampler(eval_dataset)
+
+        return DataLoader(
+            eval_dataset,
+            batch_sampler=eval_sampler,
+            collate_fn=self.eval_collate_fn,
+            num_workers=self.args.dataloader_num_workers,
+        )
+
+    def evaluation_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        prediction_loss_only: Optional[bool] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_eval_iters: Optional[int] = -1,
+    ) -> EvalLoopOutput:
+        """
+        Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`.
+        Works both with or without labels.
+        """
+        args = self.args
+
+        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only
+
+        model = self.model
+
+        if isinstance(dataloader, paddle.io.DataLoader):
+            batch_size = dataloader.batch_sampler.batch_size
+        elif isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase):
+            # support for inner dataloader
+            batch_size = dataloader._batch_sampler.batch_size
+            # alias for inner dataloader
+            dataloader.dataset = dataloader._dataset
+        else:
+            raise ValueError("Only support for paddle.io.DataLoader")
+
+        num_samples = None
+        if max_eval_iters > 0:
+            # on eval limit steps
+            num_samples = batch_size * self.args.world_size * max_eval_iters
+            if isinstance(dataloader, paddle.io.dataloader.dataloader_iter._DataLoaderIterBase) and isinstance(
+                dataloader._batch_sampler, NlpDistributedBatchSampler
+            ):
+                consumed_samples = (
+                    ((self.state.global_step) // args.eval_steps)
+                    * max_eval_iters
+                    * args.per_device_eval_batch_size
+                    * args.world_size
+                )
+                dataloader._batch_sampler.set_epoch(consumed_samples=consumed_samples)
+
+        logger.info(f"***** Running {description} *****")
+        if has_length(dataloader):
+            logger.info(f"  Num examples = {self.num_examples(dataloader)}")
+            if max_eval_iters > 0:
+                logger.info(f"  Total prediction steps = {max_eval_iters}")
+            else:
+                logger.info(f"  Total prediction steps = {len(dataloader)}")
+        else:
+            logger.info("  Num examples: Unknown")
+            if max_eval_iters > 0:
+                logger.info(f"  Total prediction steps = {max_eval_iters}")
+
+        logger.info(f"  Pre device batch size = {batch_size}")
+        logger.info(f"  Total Batch size = {batch_size * self.args.world_size}")
+
+        model.eval()
+
+        self.callback_handler.eval_dataloader = dataloader
+        # Do this before wrapping.
+        eval_dataset = dataloader.dataset
+
+        if args.past_index >= 0:
+            self._past = None
+
+        # Initialize containers
+        # losses/preds/labels on GPU (accumulated for eval_accumulation_steps)
+        losses_host = None
+        preds_host = None
+        labels_host = None
+        # losses/preds/labels on CPU (final containers)
+        all_losses = None
+        all_preds = None
+        all_labels = None
+        # Will be useful when we have an iterable dataset so don't know its length.
+
+        observed_num_examples = 0
+        # Main evaluation loop
+        losses = []
+        for step, inputs in enumerate(dataloader):
+            # Update the observed num examples
+            observed_batch_size = find_batch_size(inputs)
+            if observed_batch_size is not None:
+                observed_num_examples += observed_batch_size
+                # For batch samplers, batch_size is not known by the dataloader in advance.
+                if batch_size is None:
+                    batch_size = observed_batch_size
+
+            # Prediction step
+            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
+
+            # Update containers on host
+            if loss is not None:
+                # losses = self._nested_gather(loss.repeat(batch_size))
+                # losses = self._nested_gather(loss)
+                losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1]))
+                losses_host = losses if losses_host is None else paddle.concat((losses_host, losses), axis=0)
+
+            if labels is not None:
+                labels = self._pad_across_processes(labels)
+                labels = self._nested_gather(labels)
+                labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)
+            if logits is not None:
+                logits = self._pad_across_processes(logits)
+                logits = self._nested_gather(logits)
+                if self.preprocess_logits_for_metrics is not None:
+                    logits = self.preprocess_logits_for_metrics(logits, labels)
+                preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
+            self.control = self.callback_handler.on_prediction_step(args, self.state, self.control)
+            if max_eval_iters > 0 and step >= max_eval_iters - 1:
+                break
+
+        # Gather all remaining tensors and put them back on the CPU
+        if losses_host is not None:
+            losses = nested_numpify(losses_host)
+            all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+        if preds_host is not None:
+            logits = nested_numpify(preds_host)
+            all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+        if labels_host is not None:
+            labels = nested_numpify(labels_host)
+            all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+
+        # Number of samples
+        if num_samples is not None:
+            pass
+        elif has_length(eval_dataset):
+            num_samples = len(eval_dataset)
+        # The instance check is weird and does not actually check for the type, but whether the dataset has the right
+        # methods. Therefore we need to make sure it also has the attribute.
+        elif isinstance(eval_dataset, IterableDatasetShard) and hasattr(eval_dataset, "num_examples"):
+            num_samples = eval_dataset.num_examples
+        else:
+            if has_length(dataloader):
+                num_samples = self.num_examples(dataloader)
+            else:  # both len(dataloader.dataset) and len(dataloader) fail
+                num_samples = observed_num_examples
+
+        # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of
+        # samplers has been rounded to a multiple of batch_size, so we truncate.
+        if all_losses is not None:
+            all_losses = all_losses[:num_samples]
+        if all_preds is not None:
+            all_preds = nested_truncate(all_preds, num_samples)
+        if all_labels is not None:
+            all_labels = nested_truncate(all_labels, num_samples)
+
+        model.train()
+
+        if self.compute_metrics is not None and all_preds is not None:
+            metrics = self.compute_metrics(
+                EvalPrediction(predictions=all_preds, label_ids=all_labels),
+                data_loader=dataloader,
+                raw_dataset=self.eval_raw_dataset,
+            )
+        else:
+            metrics = {}
+
+        if all_losses is not None:
+            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)
+
+
+MODEL_CLASSES = {
+    "bert": (BertForQuestionAnswering, BertTokenizer),
+    "ernie": (ErnieForQuestionAnswering, ErnieTokenizer),
+    "mpnet": (MPNetForQuestionAnswering, MPNetTokenizer),
+}
+
+
+@dataclass
+class ModelArguments:
+    max_seq_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. "
+                "Sequences longer than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    model_type: Optional[str] = field(
+        default="convbert",
+        metadata={"help": ("Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))},
+    )
+    model_name_or_path: Optional[str] = field(
+        default="convbert-base",
+        metadata={
+            "help": (
+                "Path to pre-trained model or shortcut name selected in the list: "
+                + ", ".join(
+                    sum(
+                        [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()],
+                        [],
+                    )
+                ),
+            )
+        },
+    )
+    layer_lr_decay: Optional[float] = field(
+        default=1.0,
+        metadata={"help": ("layer_lr_decay")},
+    )
+    doc_stride: Optional[int] = field(
+        default=128,
+        metadata={"help": ("When splitting up a long document into chunks, how much stride to take between chunks.")},
+    )
+    n_best_size: Optional[int] = field(
+        default=20,
+        metadata={
+            "help": ("The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+        },
+    )
+    null_score_diff_threshold: Optional[float] = field(
+        default=0.0,
+        metadata={"help": ("If null_score - best_non_null is greater than the threshold predict null.")},
+    )
+    max_query_length: Optional[int] = field(
+        default=64,
+        metadata={"help": ("Max query length.")},
+    )
+    max_answer_length: Optional[int] = field(
+        default=30,
+        metadata={"help": ("Max answer length.")},
+    )
+    do_lower_case: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to lower case the input text. Should be True for uncased models and False for cased models."
+            )
+        },
+    )
+    verbose: Optional[bool] = field(
+        default=False,
+        metadata={"help": ("Whether to output verbose log.")},
+    )
+    version_2_with_negative: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": (
+                "If true, the SQuAD examples contain some that do not have an answer.",
+                "If using squad v2.0, it should be set true.",
+            )
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    train_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "Train data path."},
+    )
+    predict_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "Predict data path."},
+    )
+
+
+def _get_layer_lr_radios(layer_decay=0.8, n_layers=12):
+    """Have lower learning rates for layers closer to the input."""
+    key_to_depths = OrderedDict(
+        {
+            "mpnet.embeddings.": 0,
+            "mpnet.encoder.relative_attention_bias.": 0,
+            "qa_outputs.": n_layers + 2,
+        }
+    )
+    for layer in range(n_layers):
+        key_to_depths[f"mpnet.encoder.layer.{str(layer)}."] = layer + 1
+    return {key: (layer_decay ** (n_layers + 2 - depth)) for key, depth in key_to_depths.items()}
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
+    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
+    # left whitespace
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    tokenized_examples = tokenizer(
+        questions, contexts, max_length=args.max_seq_length, stride=args.doc_stride, return_attention_mask=True
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2
+        sequence_B_lengths = len(input_ids) - sequence_A_lengths
+        sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, return_attention_mask=True
+    )
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        input_ids = tokenized_examples["input_ids"][i]
+        sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2
+        sequence_B_lengths = len(input_ids) - sequence_A_lengths
+        sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+@dataclass
+class TrainDataCollator:
+    def __init__(self, tokenizer):
+        self.tokenizer = tokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        start_positions = []
+        end_positions = []
+        batch = {}
+
+        for feature in features:
+            input_ids.append(feature["input_ids"])
+            start_positions.append(feature["start_positions"])
+            end_positions.append(feature["end_positions"])
+
+        input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),)  # input_ids
+        start_positions = (Stack(dtype="int64")(start_positions),)  # start_positions
+        end_positions = (Stack(dtype="int64")(end_positions),)  # end_positions
+
+        batch["input_ids"] = input_ids[0]
+        batch["start_positions"] = start_positions[0]
+        batch["end_positions"] = end_positions[0]
+
+        return batch
+
+
+@dataclass
+class EvalDataCollator:
+    def __init__(self, tokenizer):
+        self.tokenizer = tokenizer
+
+    def __call__(self, features):
+        input_ids = []
+        batch = {}
+
+        for feature in features:
+            input_ids.append(feature["input_ids"])
+
+        input_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(input_ids),)  # input_ids
+
+        batch["input_ids"] = input_ids[0]
+
+        return batch
+
+
+def compute_metrics(eval_preds, data_loader=None, raw_dataset=None, model_args=None):
+    start_logits, end_logits = eval_preds.predictions
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        raw_dataset,
+        data_loader.dataset,
+        (start_logits, end_logits),
+        model_args.version_2_with_negative,
+        model_args.n_best_size,
+        model_args.max_answer_length,
+        model_args.null_score_diff_threshold,
+    )
+    squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json)
+    return {}
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, raw_dataset, args, global_step, write_predictions=False):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+
+    for batch in data_loader:
+        input_ids = batch[0]
+        start_logits_tensor, end_logits_tensor = model(input_ids)
+
+        for idx in range(start_logits_tensor.shape[0]):
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        raw_dataset,
+        data_loader.dataset,
+        (all_start_logits, all_end_logits),
+        args.version_2_with_negative,
+        args.n_best_size,
+        args.max_answer_length,
+        args.null_score_diff_threshold,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    if write_predictions:
+        with open(f"{str(global_step)}_prediction.json", "w", encoding="utf-8") as writer:
+            writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+    squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json)
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+
+        return loss
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    model_args.model_type = model_args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
+    model = model_class.from_pretrained(model_args.model_name_or_path)
+
+    if training_args.do_train:
+        # layer_lr for base
+        if model_args.layer_lr_decay != 1.0:
+            layer_lr_radios_map = _get_layer_lr_radios(model_args.layer_lr_decay, n_layers=12)
+            for name, parameter in model.named_parameters():
+                layer_lr_radio = 1.0
+                for k, radio in layer_lr_radios_map.items():
+                    if k in name:
+                        layer_lr_radio = radio
+                        break
+                parameter.optimize_attr["learning_rate"] *= layer_lr_radio
+
+        if model_args.version_2_with_negative:
+            train_examples = load_dataset("squad_v2", split="train")
+        else:
+            train_examples = load_dataset("squad", split="train")
+        column_names = train_examples.column_names
+        train_ds = train_examples.map(
+            partial(prepare_train_features, tokenizer=tokenizer, args=model_args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+
+    if training_args.do_eval:
+        if model_args.version_2_with_negative:
+            dev_examples = load_dataset("squad_v2", split="validation")
+        else:
+            dev_examples = load_dataset("squad", split="validation")
+        column_names = dev_examples.column_names
+        dev_ds = dev_examples.map(
+            partial(prepare_validation_features, tokenizer=tokenizer, args=model_args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+
+    batchify_fn_train = TrainDataCollator(tokenizer)
+    batchify_fn_eval = EvalDataCollator(tokenizer)
+    criterion = CrossEntropyLossForSQuAD()
+
+    compute_metrics_func = partial(
+        compute_metrics,
+        model_args=model_args,
+    )
+
+    trainer = MPNetTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn_train,
+        criterion=criterion,
+        compute_metrics=compute_metrics_func,
+    )
+    trainer.set_eval_collator(batchify_fn_eval)
+    trainer.set_eval_raw_dataset(dev_examples)
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/language_model/opt b/examples/language_model/opt
new file mode 100644
index 0000000000000000000000000000000000000000..d095192093bdd412fce64ee16050e2dca4aca81c
--- /dev/null
+++ b/examples/language_model/opt
@@ -0,0 +1 @@
+../../llm/opt
\ No newline at end of file
diff --git a/examples/language_model/rembert/README.md b/examples/language_model/rembert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5a3bf3bc25497359d6e060d3b5d1d19a41c3c812
--- /dev/null
+++ b/examples/language_model/rembert/README.md
@@ -0,0 +1,95 @@
+# RemBert with PaddleNLP
+
+[RemBERT: Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821v1.pdf)
+
+**模型简介：**
+作者发现，分离词嵌入为建模语言模型提供更好的灵活性，使我们能够显著提高多语言模型输入词嵌入中参数分
+配的效率。通过在transformers层中重新分配输入词嵌入参数，在微调过程中，相比于具有相同数量参数量的
+自然语言模型在自然语言理解任务上获得了更好的性能。作者还发现，增大输出词嵌入维度可以提升模型的性能，
+即使在预训练结束后，输出词嵌入被丢弃，该模型仍能在微调阶段保持不变。作者分析表明，增大输出词嵌入维度
+可以防止模型在预训练数据集上过拟合，并让模型在其他NLP数据集上有更强的泛化能力。利用这些发现，我们能够
+训练性能更强大的模型，而无需在微调阶段增加参数。
+
+## 快速开始
+
+### 下游任务微调
+
+####数据集
+下载XTREME-XNLI数据集:
+训练集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip)
+测试集:[下载地址](https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip)
+其中训练集为位于`XNLI-MT-1.0/multinli/multinli.train.en.tsv`, 测试集位于`XNLI-1.0/xnli.test.tsv`
+
+下载XTREME-PAWS-X数据集：
+[下载地址](https://storage.googleapis.com/paws/pawsx/x-final.tar.gz)
+每个训练集、验证集和测试集分别为`train`、`dev`和`test`开头的`tsv`文件, 将所有语言的数据集解压后，请合并所有语言测试集到一个文件(此任务需要在多语言进行测试)
+
+#### 1、XTREME-XNLI
+XTREME-XNLI数据集为例：
+运行以下两个命令即可训练并评估RemBert在XTREME-XNLI数据集的精度
+
+```shell
+python -m paddle.distributed.launch examples/language_model/rembert/main.py \
+    --model_type rembert \
+    --data_dir data/
+    --output_dir output/ \
+    --device gpu
+    --learning_rate 1e-5 \
+    --num_train_epochs 3 \
+    --train_batch_size 16 \
+    --do_train \
+    --do_eval \
+    --task xnli \
+    --eval_step 500
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前支持`rembert`
+- `data_dir` 数据集路径。
+- `train_batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `output_dir` 表示模型保存路径。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- `num_train_epochs` 表示需要训练的epoch数量
+- `do_train` 表示是否开启训练
+- `do_eval` 表示是否开启评估
+- `task` 表示训练的任务
+- `eval_step` 表示训练多少步评估一次模型
+
+训练结束后模型会对模型进行评估，训练完成后你将看到如下结果:
+```bash
+Accuracy 0.8089
+```
+
+#### 2、XTREME-PAWS-X
+在此数据集训练使用如下命令：
+
+```shell
+python -m paddle.distributed.launch examples/language_model/rembert/main.py \
+    --model_type rembert \
+    --data_dir data/
+    --output_dir output/ \
+    --device gpu
+    --learning_rate 8e-6 \
+    --num_train_epochs 3 \
+    --train_batch_size 16 \
+    --do_train \
+    --do_eval \
+    --task paws \
+    --eval_step 500
+```
+训练结束后模型会对模型进行评估，其评估在测试集上完成, 训练完成后你将看到如下结果:
+```bash
+Accuracy 0.8778
+```
+
+
+# Reference
+
+```bibtex
+@article{chung2020rethinking,
+  title={Rethinking embedding coupling in pre-trained language models},
+  author={Chung, Hyung Won and Fevry, Thibault and Tsai, Henry and Johnson, Melvin and Ruder, Sebastian},
+  journal={arXiv preprint arXiv:2010.12821},
+  year={2020}
+}
+```
diff --git a/examples/language_model/rembert/data_processor.py b/examples/language_model/rembert/data_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5f6568d7a68e7da0e792312bdc252e7a4a7e8e5
--- /dev/null
+++ b/examples/language_model/rembert/data_processor.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from paddlenlp.transformers import RemBertTokenizer
+import csv
+from paddle.io import Dataset
+
+tokenization = RemBertTokenizer.from_pretrained("rembert")
+
+
+class InputExample(object):
+    """
+    Use classes to store each example
+    """
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class MrpcProcessor(object):
+    """Load the dataset and convert each example text to ids"""
+
+    def get_train_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_2k.tsv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_2k.tsv")), "test")
+
+    def get_labels(self):
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            text_a = tokenization(line[1])["input_ids"]
+            text_b = tokenization(line[2])["input_ids"]
+            label = int(line[3])
+            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf-8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+
+class XNLIProcessor(object):
+    """Load the dataset and convert each example text to ids"""
+
+    def get_train_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "multinli.train.en.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "xnli.test.tsv")), "test")
+
+    def get_labels(self):
+        return ["neutral", "entailment", "contradictory"]
+
+    def _create_examples(self, lines, set_type):
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            if set_type == "train":
+                text_a = " ".join(line[0].strip().split(" "))
+                text_b = " ".join(line[1].strip().split(" "))
+                text_a = tokenization(text_a)["input_ids"]
+                text_b = tokenization(text_b)["input_ids"]
+                label = self.get_labels().index(line[2].strip())
+                examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+            else:
+                text_a = " ".join(line[6].strip().split(" "))
+                text_b = " ".join(line[7].strip().split(" "))
+                if line[1] == "contradiction":
+                    line[1] = "contradictory"
+                label = self.get_labels().index(line[1].strip())
+                text_a = tokenization(text_a)["input_ids"]
+                text_b = tokenization(text_b)["input_ids"]
+                examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with open(input_file, "r", encoding="utf-8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+
+class DataGenerator(Dataset):
+    """Data generator is used to feed features into dataloader."""
+
+    def __init__(self, features):
+        super(DataGenerator, self).__init__()
+        self.features = features
+
+    def __getitem__(self, item):
+        text_a = self.features[item].text_a
+        text_b = self.features[item].text_b
+        text_a_token_type_ids = [0] * len(text_a)
+        text_b_token_type_ids = [1] * len(text_b)
+        label = [self.features[item].label]
+
+        return dict(
+            text_a=text_a,
+            text_b=text_b,
+            text_a_token_type_ids=text_a_token_type_ids,
+            text_b_token_type_ids=text_b_token_type_ids,
+            label=label,
+        )
+
+    def __len__(self):
+        return len(self.features)
diff --git a/examples/language_model/rembert/main.py b/examples/language_model/rembert/main.py
new file mode 100644
index 0000000000000000000000000000000000000000..f48e5751080294b5520c15a2740e7ea47bd949ad
--- /dev/null
+++ b/examples/language_model/rembert/main.py
@@ -0,0 +1,180 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import random
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from data_processor import DataGenerator, MrpcProcessor, XNLIProcessor
+from paddle.io import DataLoader, DistributedBatchSampler
+from paddle.metric import Accuracy
+from tqdm import tqdm
+from trainer import Trainer
+
+from paddlenlp.transformers import RemBertForSequenceClassification
+
+logger = logging.getLogger(__name__)
+
+parser = argparse.ArgumentParser(description="RemBert For Sequence Classification")
+parser.add_argument("--data_dir", type=str, default=None, help="Data path.")
+parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+parser.add_argument("--do_eval", action="store_true", help="Whether to predict.")
+parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.")
+parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+parser.add_argument("--train_batch_size", type=int, default=16, help="per gpu batch size during thr training.")
+parser.add_argument("--eval_batch_size", type=int, default=16, help="per gpu batch size during thr evaluating.")
+parser.add_argument(
+    "--output_dir",
+    default="outputs",
+    type=str,
+    help="The output directory where the model predictions and checkpoints will be written. " "Default as `outputs`",
+)
+parser.add_argument(
+    "--max_seq_length",
+    default=512,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer "
+    "than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument(
+    "--gradient_accumulation_steps",
+    type=int,
+    default=2,
+    help="Proportion of training steps to perform linear learning rate warmup for.",
+)
+parser.add_argument(
+    "--warmup_proportion",
+    type=float,
+    default=0.02,
+    help="Proportion of training steps to perform linear learning rate warmup for.",
+)
+parser.add_argument("--learning_rate", type=float, default=8e-6, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", type=float, default=0.01, help="Weight decay if we apply some.")
+parser.add_argument("--task", type=str, required=True, help="Training task")
+parser.add_argument("--model_type", default="rembert", type=str, help="Type of pre-trained model.")
+parser.add_argument("--eval_step", type=int, default=2000, help="Eavlate the model once after training step X.")
+args = parser.parse_args()
+
+nranks = paddle.distributed.ParallelEnv().nranks
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def load_example(args, mode="train"):
+    """Load data to DataLoader"""
+    if args.task == "paws":
+        processor = MrpcProcessor()
+    if args.task == "xnli":
+        processor = XNLIProcessor()
+    if mode == "train":
+        examples = processor.get_train_examples(args.data_dir)
+    elif mode == "dev":
+        examples = processor.get_dev_examples(args.data_dir)
+    else:
+        examples = processor.get_test_examples(args.data_dir)
+
+    datagenerator = DataGenerator(examples)
+
+    def collate_fn(batch):
+        def create_padded_sequence(key, padding_value):
+            """Pad sequence to max length"""
+            pad_sequence = []
+            max_len = 0
+            for example in batch:
+                if len(example[key]) > max_len:
+                    max_len = len(example[key])
+            for example in batch:
+                pad_sequence.append(example[key] + [padding_value] * (max_len - len(example[key])))
+            return np.array(pad_sequence, dtype="int64")
+
+        text_a = create_padded_sequence("text_a", 0)  # pad text_a input_ids
+        text_b = create_padded_sequence("text_b", 0)  # pad text_b input_ids
+        text_a_token_type_ids = create_padded_sequence("text_a_token_type_ids", 0)  # pad text_a_token_type_ids
+        text_b_token_type_ids = create_padded_sequence("text_b_token_type_ids", 1)  # pad text_b_token_type_ids
+        label = create_padded_sequence("label", 0)  # label will not pad, just convert to numpy array
+
+        input_ids = np.concatenate([text_a, text_b], axis=-1)[:, : args.max_seq_length]
+        token_type_ids = np.concatenate([text_a_token_type_ids, text_b_token_type_ids], axis=-1)[
+            :, : args.max_seq_length
+        ]
+
+        return input_ids, token_type_ids, label
+
+    if mode in ("dev", "test"):
+        dataloader = DataLoader(datagenerator, batch_size=args.eval_batch_size, shuffle=False, collate_fn=collate_fn)
+    else:
+        sampler = DistributedBatchSampler(
+            datagenerator, batch_size=args.train_batch_size, shuffle=True, drop_last=False
+        )
+        dataloader = DataLoader(datagenerator, batch_sampler=sampler, collate_fn=collate_fn)
+
+    return dataloader, processor
+
+
+def run(args):
+    if args.do_train:
+        train_dataloader, processor = load_example(args, "train")
+        num_label = len(processor.get_labels())
+        model = RemBertForSequenceClassification.from_pretrained(args.model_type, num_classes=num_label)
+        if nranks > 1:
+            dist.init_parallel_env()
+            model = paddle.DataParallel(model)
+
+        num_train_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
+        num_train_steps = int(num_train_steps_per_epoch * args.num_train_epochs)
+        trainer = Trainer(
+            args, model=model, dataloader=train_dataloader, num_train_steps=num_train_steps, step_callback=evaluate
+        )
+        trainer.train()
+
+    if args.do_eval:
+        model = RemBertForSequenceClassification.from_pretrained(args.output_dir)
+        evaluate(model, args)
+
+
+def evaluate(model, args, mode="test"):
+    """evaluate the model"""
+    model.eval()
+    metric = Accuracy()
+    eval_dataloader, processor = load_example(args, mode)
+    for batch in tqdm(eval_dataloader, total=len(eval_dataloader)):
+        logits = model(input_ids=batch[0], token_type_ids=batch[1])
+        labels = batch[2].reshape(
+            (
+                -1,
+                1,
+            )
+        )
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("Accuracy:", res)
+    model.train()
+    return res
+
+
+if __name__ == "__main__":
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    run(args)
diff --git a/examples/language_model/rembert/trainer.py b/examples/language_model/rembert/trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..314cb118d8392e5ccf9a3205164416136983491e
--- /dev/null
+++ b/examples/language_model/rembert/trainer.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.optimizer import AdamW
+from tqdm import tqdm
+
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+
+def _create_model_arguments(batch):
+    return batch
+
+
+class Trainer(object):
+    def __init__(self, args, model, dataloader, num_train_steps, step_callback=None):
+        self.args = args
+        self.model = model
+        self.dataloader = dataloader
+        self.num_train_steps = num_train_steps
+        self.step_callback = step_callback
+
+        self.optimizer, self.scheduler = self._create_optimizer(model)
+        self.scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+        self.wd_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    def train(self):
+        model = self.model
+
+        epoch = 0
+        global_step = 0
+        tr_loss = 0.0
+        acc = 0.0
+
+        model.train()
+        model, self.optimizer = paddle.amp.decorate(
+            models=model, optimizers=self.optimizer, level="O2", master_weight=None, save_dtype="float32"
+        )
+
+        with tqdm(total=self.num_train_steps) as pbar:
+            while True:
+                for step, batch in enumerate(self.dataloader):
+                    with paddle.amp.auto_cast(enable=True, custom_white_list=None, custom_black_list=None, level="O2"):
+                        logits = model(input_ids=batch[0], token_type_ids=batch[1])
+
+                    loss = paddle.nn.CrossEntropyLoss()(logits, batch[2].reshape((-1,)))
+
+                    if self.args.gradient_accumulation_steps > 1:
+                        loss = loss / self.args.gradient_accumulation_steps
+                    scaled = self.scaler.scale(loss)
+                    scaled.backward()
+                    if (step + 1) % self.args.gradient_accumulation_steps == 0:
+                        self.scaler.minimize(self.optimizer, scaled)
+                        self.scheduler.step()
+                        self.optimizer.clear_grad()
+                        pbar.set_description("epoch: {} loss: {} acc: {}".format(epoch, loss.numpy(), acc))
+                        pbar.update()
+                        global_step += 1
+
+                        if global_step == self.num_train_steps:
+                            break
+                    if (step + 1) % self.args.eval_step == 0:
+                        ac = self.step_callback(model, self.args)
+                        if ac > acc:
+                            acc = ac
+                            model.save_pretrained(self.args.output_dir)
+
+                if global_step == self.num_train_steps:
+                    break
+                epoch += 1
+
+        return model, global_step, tr_loss / global_step
+
+    def _create_optimizer(self, model):
+        scheduler = self._create_scheduler()
+        clip = paddle.nn.ClipGradByNorm(clip_norm=1.0)
+        return (
+            AdamW(
+                parameters=model.parameters(),
+                grad_clip=clip,
+                learning_rate=scheduler,
+                beta1=0.9,
+                apply_decay_param_fun=lambda x: x in self.wd_params,
+                weight_decay=self.args.weight_decay,
+                beta2=0.99,
+            ),
+            scheduler,
+        )
+
+    def _create_scheduler(self):
+        return LinearDecayWithWarmup(self.args.learning_rate, self.num_train_steps, self.args.warmup_proportion)
diff --git a/examples/language_model/rnnlm/README.md b/examples/language_model/rnnlm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..efd6459bd911c12c3841a74b9fae37bf09f4a394
--- /dev/null
+++ b/examples/language_model/rnnlm/README.md
@@ -0,0 +1,68 @@
+# 语言模型
+
+# 简介
+
+## 1. 任务说明
+本文主要介绍基于lstm的语言的模型的实现，给定一个输入词序列（中文分词、英文tokenize），计算其ppl（语言模型困惑度，用户表示句子的流利程度），基于循环神经网络语言模型的介绍可以[参阅论文](https://arxiv.org/abs/1409.2329)。相对于传统的方法，基于循环神经网络的方法能够更好的解决稀疏词的问题。
+
+
+## 2. 效果说明
+
+|   |    train    |   valid    |    test      |
+| :------------- | :---------: | :--------: | :----------: |
+|     PaddlePaddle     |    47.234   |  86.801    |    83.159    |
+|   Tensorflow   |    45.594   |  87.363    |    84.015   |
+
+
+
+## 3. 数据集
+
+此任务的数据集合是采用ptb dataset，下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
+
+
+# 快速开始
+
+### 数据准备
+为了方便开发者进行测试，我们内置了数据下载脚本，默认自动下载PTB数据集。
+
+### 训练或Fine-tune
+
+任务训练启动命令如下：
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+```
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型到checkpoint、中。
+还可以在启动命令后以--的形式修改网络参数或数据位置，具体可修改的参数和参数的默认值参考`args.py`。
+
+**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/test`即可，程序会自动加载模型参数`checkpoints/test.pdparams`，也会自动加载优化器状态`checkpoints/test.pdopt`。
+
+# 进阶使用
+
+## 任务定义与建模
+此任务目的是给定一个输入的词序列，预测下一个词出现的概率。
+
+## 模型原理介绍
+此任务采用了序列任务常用的rnn网络，实现了一个两层的lstm网络，然后lstm的结果去预测下一个词出现的概率。
+
+由于数据的特殊性，每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell。
+
+
+## 数据格式说明
+此任务的数据格式比较简单，每一行为一个已经分好词（英文的tokenize）的词序列。
+
+目前的句子示例如下图所示:
+```
+aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
+pierre <unk> N years old will join the board as a nonexecutive director nov. N
+mr. <unk> is chairman of <unk> n.v. the dutch publishing group
+```
+
+特殊说明：ptb的数据比较特殊，ptb的数据来源于一些文章，相邻的句子可能来源于一个段落或者相邻的段落，ptb 数据不能做shuffle。
+
+
+## 如何组建自己的模型
++ **自定义数据：** 关于数据，如果可以把自己的数据先进行分词（或者tokenize），通过`--data_path`来指定本地数据集所在文件夹，并需要在`train.py`中修改对应的文件名称。
++ **网络结构更改：** 网络只实现了基于lstm的语言模型，用户可以自己的需求更换为gru等网络结构，这些实现都是在`model.py`中定义。
diff --git a/examples/language_model/rnnlm/args.py b/examples/language_model/rnnlm/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..21b586b83e5c5306d82f368e045fa78dbd9f0495
--- /dev/null
+++ b/examples/language_model/rnnlm/args.py
@@ -0,0 +1,23 @@
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--data_path", type=str, default=None, help="all the data for train,valid,test")
+    parser.add_argument("--batch_size", type=int, default=20, help="batch size")
+    parser.add_argument("--hidden_size", type=int, default=650, help="hidden_size")
+    parser.add_argument("--num_steps", type=int, default=35, help="num steps")
+    parser.add_argument("--num_layers", type=int, default=2, help="num_layers")
+    parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm")
+    parser.add_argument("--dropout", type=float, default=0.5, help="dropout")
+    parser.add_argument("--epoch_start_decay", type=int, default=6, help="epoch_start_decay")
+    parser.add_argument("--max_epoch", type=int, default=39, help="max_epoch")
+    parser.add_argument("--lr_decay", type=float, default=0.8, help="lr_decay")
+    parser.add_argument("--base_lr", type=float, default=1.0, help="base_lr")
+    parser.add_argument("--init_scale", type=float, default=0.05, help="init_scale")
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    args = parser.parse_args()
+    return args
diff --git a/examples/language_model/rnnlm/model.py b/examples/language_model/rnnlm/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..99c04c43d32a87fc669b834b312a86e54f1565be
--- /dev/null
+++ b/examples/language_model/rnnlm/model.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.initializer as I
+
+
+class RnnLm(nn.Layer):
+    def __init__(self, vocab_size, hidden_size, batch_size, num_layers=1, init_scale=0.1, dropout=0.0):
+        super(RnnLm, self).__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.init_scale = init_scale
+        self.batch_size = batch_size
+        self.reset_states()
+
+        self.embedder = nn.Embedding(
+            vocab_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.lstm = nn.LSTM(
+            input_size=hidden_size,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            dropout=dropout,
+            weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+            weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.fc = nn.Linear(
+            hidden_size,
+            vocab_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+            bias_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.dropout = nn.Dropout(p=dropout)
+
+    def forward(self, inputs):
+        x = inputs
+        x_emb = self.embedder(x)
+        x_emb = self.dropout(x_emb)
+
+        y, (self.hidden, self.cell) = self.lstm(x_emb, (self.hidden, self.cell))
+        (self.hidden, self.cell) = tuple([item.detach() for item in (self.hidden, self.cell)])
+        y = self.dropout(y)
+        y = self.fc(y)
+        return y
+
+    def reset_states(self):
+        self.hidden = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32")
+        self.cell = paddle.zeros(shape=[self.num_layers, self.batch_size, self.hidden_size], dtype="float32")
+
+
+class CrossEntropyLossForLm(nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForLm, self).__init__()
+
+    def forward(self, y, label):
+        label = paddle.unsqueeze(label, axis=2)
+        loss = paddle.nn.functional.cross_entropy(input=y, label=label, reduction="none")
+        loss = paddle.squeeze(loss, axis=[2])
+        loss = paddle.mean(loss, axis=[0])
+        loss = paddle.sum(loss)
+        return loss
+
+
+class UpdateModel(paddle.callbacks.Callback):
+    # This callback reset model hidden states and update learning rate before each epoch begins
+    def on_epoch_begin(self, epoch=None, logs=None):
+        self.model.network.reset_states()
diff --git a/examples/language_model/rnnlm/reader.py b/examples/language_model/rnnlm/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..065ea409076d51762219e4958ffd1dc5c7ca649f
--- /dev/null
+++ b/examples/language_model/rnnlm/reader.py
@@ -0,0 +1,58 @@
+import numpy as np
+
+import paddle
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Vocab
+
+
+def create_data_loader(batch_size, num_steps, data_path=None):
+    train_ds, valid_ds, test_ds = load_dataset("ptb", splits=("train", "valid", "test"))
+
+    train_examples = [train_ds[i]["sentence"].split() for i in range(len(train_ds))]
+    vocab = Vocab.build_vocab(train_examples, eos_token="</eos>")
+
+    # Because the sentences in PTB dataset might be consecutive, we need to concatenate
+    # all texts from our dataset and fold them into chunks while the number of rows is
+    # equal to batch size. For example:
+    #
+    #   Sentence1: we're talking about years ago before anyone heard of asbestos having
+    #              any questionable properties.
+    #   Sentence2: there is no asbestos in our products now.
+    #   Batch_size: 5
+    #   Grouped_text: [["we're", "talking", "about", "years"],
+    #                  ["ago", "before", "anyone", "heard"],
+    #                  ["of", "asbestos", "having", "any"],
+    #                  ["questionable", "properties", "there", "is"],
+    #                  ["no", "asbestos", "in", "our"]]
+    #
+    def group_texts(examples):
+        concat_examples = []
+        for example in examples:
+            concat_examples += example["sentence"].split() + ["</eos>"]
+
+        concat_examples = vocab.to_indices(concat_examples)
+
+        max_seq_len = len(concat_examples) // batch_size
+        reshaped_examples = np.asarray(concat_examples[0 : batch_size * max_seq_len], dtype="int64").reshape(
+            (batch_size, max_seq_len)
+        )
+        encoded_examples = []
+        for i in range(max_seq_len // num_steps):
+            encoded_examples.append(
+                (
+                    np.copy(reshaped_examples[:, i * num_steps : (i + 1) * num_steps]),
+                    np.copy(reshaped_examples[:, i * num_steps + 1 : (i + 1) * num_steps + 1]),
+                )
+            )
+
+        return encoded_examples
+
+    train_ds.map(group_texts, batched=True)
+    valid_ds.map(group_texts, batched=True)
+    test_ds.map(group_texts, batched=True)
+
+    train_loader = paddle.io.DataLoader(train_ds, return_list=True, batch_size=None)
+    valid_loader = paddle.io.DataLoader(valid_ds, return_list=True, batch_size=None)
+    test_loader = paddle.io.DataLoader(test_ds, return_list=True, batch_size=None)
+    return train_loader, valid_loader, test_loader, len(vocab)
diff --git a/examples/language_model/rnnlm/train.py b/examples/language_model/rnnlm/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb69ebc37aeee5a97546a33a8626fb35a98a617e
--- /dev/null
+++ b/examples/language_model/rnnlm/train.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from args import parse_args
+from model import CrossEntropyLossForLm, RnnLm, UpdateModel
+from reader import create_data_loader
+
+from paddlenlp.metrics import Perplexity
+
+paddle.seed(102)
+
+
+def train(args):
+    paddle.set_device(args.device)
+    data_path = args.data_path
+    train_loader, valid_loader, test_loader, vocab_size = create_data_loader(
+        batch_size=args.batch_size, num_steps=args.num_steps, data_path=data_path
+    )
+
+    network = RnnLm(
+        vocab_size=vocab_size,
+        hidden_size=args.hidden_size,
+        batch_size=args.batch_size,
+        num_layers=args.num_layers,
+        init_scale=args.init_scale,
+        dropout=args.dropout,
+    )
+    gloabl_norm_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    cross_entropy = CrossEntropyLossForLm()
+    ppl_metric = Perplexity()
+    callback = UpdateModel()
+    scheduler = paddle.callbacks.LRScheduler(by_step=False, by_epoch=True)
+    model = paddle.Model(network)
+
+    learning_rate = paddle.optimizer.lr.LambdaDecay(
+        learning_rate=args.base_lr,
+        lr_lambda=lambda x: args.lr_decay ** max(x + 1 - args.epoch_start_decay, 0.0),
+        verbose=True,
+    )
+    optimizer = paddle.optimizer.SGD(
+        learning_rate=learning_rate, parameters=model.parameters(), grad_clip=gloabl_norm_clip
+    )
+
+    model.prepare(optimizer=optimizer, loss=cross_entropy, metrics=ppl_metric)
+
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=(len(train_loader) // 10), verbose=3)
+    model.fit(
+        train_data=train_loader,
+        eval_data=valid_loader,
+        epochs=args.max_epoch,
+        shuffle=False,
+        callbacks=[callback, scheduler, benchmark_logger],
+    )
+
+    model.save(path="checkpoint/test")  # save for training
+
+    print("Start to evaluate on test dataset...")
+    model.evaluate(test_loader, log_freq=len(test_loader))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    train(args)
diff --git a/examples/language_model/roberta/README.md b/examples/language_model/roberta/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b6e948386801d82f54b2f90ba7a675c408823428
--- /dev/null
+++ b/examples/language_model/roberta/README.md
@@ -0,0 +1,110 @@
+# RoBERTa预训练（Masked Language Modeling）
+本项目是RoBERTa模型在 Paddle 2.0上的开源实现，包含了数据tokenization和预训练代码。本项目旨在用简练清晰的代码完成基本预训练任务（仅Masked Language Modeling）。该代码易于理解，便于修改和定制。
+## 简介
+本目录下包含:
+
+utils.py: 数据采样函数DataCollatorMLM
+
+create_data.py: tokenize数据（使用HF datasets导入和预处理wikipedia数据）
+
+run_pretrain.py: 预训练代码
+
+## 数据准备
+运行create_data.py，默认使用wikipedia corpus数据，自动下载（约34GB）
+
+```
+python create_data.py \
+--output_dir wiki \
+--dataset_name wikipedia \
+--dataset_config_name 20200501.en \
+--tokenizer_name roberta-base \
+--max_seq_length 512 \
+--line_by_line False \
+--preprocessing_num_workers 20
+```
+
+其中参数释义如下：
+- `output_dir` 指示数据tokenize后保存的目录。
+- `dataset_name` 表示数据名称，默认使用wikipedia。
+- `dataset_config_name` 表示数据参数，默认使用wikipedia英文数据。
+- `tokenizer_name` 表示tokenizer名。
+- `max_seq_length` 表示最大序列长度。
+- `line_by_line` 表示是否将数据group到max_seq_length，True则不进行grouping。
+- `preprocessing_num_workers` 表示worker数量，亦为multi-processing数量。
+
+## 预训练
+
+```
+python -m paddle.distributed.launch --gpus "0,1" run_pretrain.py \
+--model_name_or_path roberta-en-base \
+--batch_size 16 \
+--learning_rate 1e-4 \
+--weight_decay 1e-2 \
+--warmup_steps 10000 \
+--num_train_epochs 3 \
+--input_file wiki \
+--output_dir ckp/ \
+--logging_steps 100 \
+--save_steps 10000 \
+--max_steps -1 \
+--device gpu \
+--max_seq_length 512 \
+--amp True
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_steps` 表示动态学习率热启的step数。
+- `num_train_epochs` 表示训练轮数。
+- `input_file` 表示输入数据的目录，由create_data.py创建。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `max_seq_length` 训练数据最大长度。
+- `amp` 指示是否启用自动混合精度训练。
+
+注：
+paddle.Dataloader需2.3rc版本才支持HF datasets类，现行版本可以直接在python paddle库中的reader.py中注释掉：
+```
+assert isinstance(dataset, Dataset)
+```
+https://github.com/PaddlePaddle/Paddle/blob/0ee230a7d3177f791d2a5388ab4dffdccc03f4aa/python/paddle/fluid/reader.py#L335
+
+## fine-tune
+
+finetune代码请参考[benchmark_glue](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/glue)
+
+运行如下：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_type roberta \
+    --model_name_or_path ROBERTA_CKP_PATH \
+    --tokenizer_name_or_path roberta-en-base \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 100 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu
+
+```
+
+
+总训练tokens：512(seq_len）* 32(batch_size) * 780000(iteration)，约RoBERTa训练量10%，在GLUE validation set表现：
+
+| Model GLUE Score   | CoLA  | SST-2  | MRPC   | STS-B  | QQP    | MNLI   | QNLI   | RTE    |
+|--------------------|-------|--------|--------|--------|--------|--------|--------|--------|
+| RoBERTa paper      |  68.0 |  96.4  |  90.9  |  92.4  |  92.2  |  90.2  |  94.7  |  86.6  |
+| PaddleNLP 6-epoch  | 36.9  | 89.5   | 84.3   | 86.2   | 88.6   | 80.5   | 88.4   | 58.1   |
diff --git a/examples/language_model/roberta/create_data.py b/examples/language_model/roberta/create_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..70b7aee100387b8ffd115d42f9116aa2a9776a78
--- /dev/null
+++ b/examples/language_model/roberta/create_data.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from datasets import load_dataset
+from transformers import AutoTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--output_dir",
+    default="wiki",
+    type=str,
+    required=False,
+    help="The output directory where the model predictions and checkpoints will be written.",
+)
+parser.add_argument("--dataset_name", default="wikipedia", type=str, required=False, help="dataset name")
+parser.add_argument(
+    "--dataset_config_name", default="20200501.en", type=str, required=False, help="dataset config name"
+)
+parser.add_argument(
+    "--use_slow_tokenizer",
+    action="store_true",
+    help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).",
+)
+parser.add_argument("--tokenizer_name", default="roberta-base", type=str, required=False, help="tokenizer name")
+parser.add_argument(
+    "--max_seq_length",
+    default=512,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument(
+    "--line_by_line",
+    type=bool,
+    default=False,
+    help="Whether distinct lines of text in the dataset are to be handled as distinct sequences.",
+)
+parser.add_argument("--preprocessing_num_workers", default=20, type=int, help="multi-processing number.")
+parser.add_argument(
+    "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
+)
+
+
+def main(args):
+    if args.output_dir is not None:
+        os.makedirs(args.output_dir, exist_ok=True)
+
+    # Get the datasets:
+    if args.dataset_name is not None:
+        # Downloading and loading a dataset from the hub.
+        raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
+
+    # Load pretrained tokenizer
+    if args.tokenizer_name:
+        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer)
+
+    # First we tokenize all the texts.
+    column_names = raw_datasets["train"].column_names
+    text_column_name = "text" if "text" in column_names else column_names[0]
+
+    if args.line_by_line:
+        # When using line_by_line, we just tokenize each nonempty line.
+        padding = False
+
+        def tokenize_function(examples):
+            # Remove empty lines
+            examples[text_column_name] = [
+                line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
+            ]
+            return tokenizer(
+                examples[text_column_name],
+                padding=padding,
+                truncation=True,
+                max_length=args.max_seq_length,
+                # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
+                # receives the `special_tokens_mask`.
+                return_special_tokens_mask=True,
+            )
+
+        tokenized_datasets = raw_datasets.map(
+            tokenize_function,
+            batched=True,
+            num_proc=args.preprocessing_num_workers,
+            remove_columns=[text_column_name],
+            load_from_cache_file=not args.overwrite_cache,
+            desc="Running tokenizer on dataset line_by_line",
+        )
+    else:
+        # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
+        # We use `return_special_tokens_mask=True` because DataCollatorForLanguageModeling (see below) is more
+        # efficient when it receives the `special_tokens_mask`.
+        def tokenize_function(examples):
+            return tokenizer(examples[text_column_name], return_special_tokens_mask=True)
+
+        tokenized_datasets = raw_datasets.map(
+            tokenize_function,
+            batched=True,
+            num_proc=args.preprocessing_num_workers,
+            remove_columns=column_names,
+            load_from_cache_file=not args.overwrite_cache,
+            desc="Running tokenizer on every text in dataset",
+        )
+
+        # Main data processing function that will concatenate all texts from our dataset and generate chunks of
+        # max_seq_length.
+        def group_texts(examples):
+            # Concatenate all texts.
+            concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
+            total_length = len(concatenated_examples[list(examples.keys())[0]])
+            # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
+            # customize this part to your needs.
+            if total_length >= args.max_seq_length:
+                total_length = (total_length // args.max_seq_length) * args.max_seq_length
+            # Split by chunks of max_len.
+            result = {
+                k: [t[i : i + args.max_seq_length] for i in range(0, total_length, args.max_seq_length)]
+                for k, t in concatenated_examples.items()
+            }
+            return result
+
+        # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a
+        # remainder for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value
+        # might be slower to preprocess.
+
+        tokenized_datasets = tokenized_datasets.map(
+            group_texts,
+            batched=True,
+            num_proc=args.preprocessing_num_workers,
+            load_from_cache_file=not args.overwrite_cache,
+            desc=f"Grouping texts in chunks of {args.max_seq_length}",
+        )
+    tokenized_datasets.save_to_disk(args.output_dir)
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    main(args)
diff --git a/examples/language_model/roberta/run_pretrain.py b/examples/language_model/roberta/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..d96b0458a4df5cf9fd5938d274d44843ab9a789d
--- /dev/null
+++ b/examples/language_model/roberta/run_pretrain.py
@@ -0,0 +1,176 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from utils import DataCollatorMLM
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    RobertaConfig,
+    RobertaForMaskedLM,
+)
+
+parser = argparse.ArgumentParser()
+IGNORE = -100
+
+# yapf: disable
+parser.add_argument("--model_name_or_path", default='roberta-en-base', type=str, required=False, help="Path to pre-trained model")
+parser.add_argument("--input_file", default='wiki', type=str, required=False, help="The input directory where the model predictions and checkpoints will be written.")
+parser.add_argument("--output_dir", default='ckp/', type=str, required=False, help="The output directory where the model predictions and checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=2e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+parser.add_argument("--num_train_epochs", default=10, type=int, help="Total number of training epochs to perform.", )
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+parser.add_argument("--amp", type=strtobool, default=True, help="use mix precision.")
+
+roberta_arch = {
+    "attention_probs_dropout_prob": 0.1,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 768,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "max_position_embeddings": 514,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "type_vocab_size": 1,
+    "vocab_size": 50265,
+    "layer_norm_eps": 1e-05,
+    "pad_token_id": 1,
+    "cls_token_id": 0
+}
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(seed)
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    set_seed(args.seed)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # Load model and train from scratch
+    config = RobertaConfig(**roberta_arch)
+    model = RobertaForMaskedLM(config)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+    ignore_label = IGNORE
+    loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+
+    # Load wikipedia dataset via Hugging face datasets
+    # TO DO: paddle datasets
+    import datasets
+    tokenized_datasets = datasets.load_from_disk(args.input_file)
+    train_ds = tokenized_datasets["train"]
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained('roberta-base')
+
+    # Prepare data for training
+    collator_func = DataCollatorMLM(tokenizer=tokenizer)  # data collator
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True)
+    train_data_loader = DataLoader(
+        dataset=train_ds,
+        collate_fn=collator_func,
+        num_workers=0,
+        batch_sampler=train_batch_sampler,
+        return_list=True)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps,
+                                         args.warmup_steps)
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+    if args.amp:  # mixed precision (fp16)
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+    # Start training
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            input_ids, _, labels = batch
+            with paddle.amp.auto_cast(args.amp):
+                logits = model(input_ids=input_ids)
+                loss = loss_fct(logits, labels)
+            if args.amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0:
+
+                print(
+                    "global step %d/%d, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (global_step, num_training_steps, loss, optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train)))
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "paddle_%d" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+
+                    model_to_save = model._layers if isinstance(
+                        model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    do_train(args)
diff --git a/examples/language_model/roberta/utils.py b/examples/language_model/roberta/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d2a566a2f3ab354fc505a5e45bda73e1ba34f0b
--- /dev/null
+++ b/examples/language_model/roberta/utils.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.data import Dict, Pad
+
+
+class DataCollatorMLM:
+    def __init__(self, tokenizer, batch_pad=None):
+        self.batch_pad = batch_pad
+        self.mask_token_id = tokenizer.mask_token_id
+        self.pad_token_id = tokenizer.pad_token_id
+        self.token_len = tokenizer.vocab_size
+        if batch_pad is None:
+            self.batch_pad = lambda samples, fn=Dict(
+                {
+                    "input_ids": Pad(axis=0, pad_val=self.pad_token_id, dtype="int64"),  # input
+                    # 'token_type_ids': Pad(axis=0, pad_val=0, dtype='int64'),  # segment
+                    "special_tokens_mask": Pad(axis=0, pad_val=True, dtype="int64"),  # segment
+                }
+            ): fn(samples)
+        else:
+            self.batch_pad = batch_pad
+
+    def __call__(self, examples):
+        examples = self.batch_pad(examples)
+        examples = [paddle.to_tensor(e) for e in examples]
+        examples[0], labels = self._mask_tokens(
+            examples[0], paddle.cast(examples[1], dtype=bool), self.mask_token_id, self.token_len
+        )
+        examples.append(labels)
+        return examples
+
+    def _mask_tokens(self, inputs, special_tokens_mask, mask_token_id, token_len, mlm_prob=0.15, ignore_label=-100):
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
+        """
+        labels = inputs.clone()
+        probability_matrix = paddle.full(labels.shape, mlm_prob)
+        probability_matrix[special_tokens_mask] = 0
+
+        masked_indices = paddle.cast(paddle.bernoulli(probability_matrix), dtype=bool)
+        labels[~masked_indices] = ignore_label  # We only compute loss on masked tokens
+
+        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        indices_replaced = paddle.cast(paddle.bernoulli(paddle.full(labels.shape, 0.8)), dtype=bool) & masked_indices
+        inputs[indices_replaced] = mask_token_id
+
+        # 10% of the time, we replace masked input tokens with random word
+
+        indices_random = (
+            paddle.cast(paddle.bernoulli(paddle.full(labels.shape, 0.5)), dtype=bool)
+            & masked_indices
+            & ~indices_replaced
+        )
+        random_words = paddle.randint(low=0, high=token_len, shape=labels.shape)
+        inputs[indices_random] = random_words[indices_random]
+
+        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+        return inputs, labels
diff --git a/examples/language_model/roformer/README.md b/examples/language_model/roformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e06ee0d6ca947ffdc89e9cabdaa70bab5ed029e2
--- /dev/null
+++ b/examples/language_model/roformer/README.md
@@ -0,0 +1,112 @@
+# RoFormer
+
+## 模型简介
+
+[RoFormer](https://arxiv.org/pdf/2104.09864.pdf) (RoFormer: Enhanced Transformer with Rotary Position Embedding)是一个带有旋转位置嵌入(RoPE)的MLM预训练语言模型。 RoPE是一种相对位置编码方法，具有良好的理论特性。其主要思想是根据绝对位置将上下文嵌入（transformer中的 q，k）乘以旋转矩阵。可以证明上下文嵌入的内积将仅取决于相对位置。
+RoPE 是唯一可用于线性注意力的相对位置嵌入。更多详情请参考[论文](https://arxiv.org/pdf/2104.09864.pdf)或[原博客](https://kexue.fm/archives/8265)。EleutherAI还发布了一篇[博客](https://blog.eleuther.ai/rotary-embeddings/)，其中包含有关 RoPE 的直观解释和实验。
+
+本项目是RoFormer在 Paddle 2.x上的开源实现，包含了`THUCNews分类任务`和`Cail2019 Scm任务`的微调代码。
+
+## 快速开始
+
+
+### 预训练MLM测试
+    ```bash
+    python test_mlm.py --model_name roformer-chinese-base --text 今天[MASK]很好，我想去公园玩！
+    # paddle: 今天[天气||天||阳光||太阳||空气]很好，我想去公园玩！
+    python test_mlm.py --model_name roformer-chinese-base --text 北京是[MASK]的首都！
+    # paddle: 北京是[中国||谁||中华人民共和国||我们||中华民族]的首都！
+    python test_mlm.py --model_name roformer-chinese-char-base --text 今天[MASK]很好，我想去公园玩！
+    # paddle: 今天[天||气||都||风||人]很好，我想去公园玩！
+    python test_mlm.py --model_name roformer-chinese-char-base --text 北京是[MASK]的首都！
+    # paddle: 北京是[谁||我||你||他||国]的首都！
+    ```
+
+### THUCNews分类任务数据
+
+THUCNews分类任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_thucnews.py`执行微调时将会自动下载。
+
+### 执行Fine-tunning
+
+启动thucnews分类任务的Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" examples/language_model/roformer/run_thucnews.py \
+    --model_type roformer \
+    --model_name_or_path roformer-chinese-base \
+    --max_seq_length 256 \
+    --batch_size 64   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./thucnews/ \
+    --device gpu \
+    --use_amp False
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，可以选择roformer。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录的地址。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+- `use_amp` 指示是否启用自动混合精度训练。
+
+基于`roformer-chinese-base`在THUCNews分类任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| THUCNews | Accuracy                     |      0.98      |
+
+
+
+### Cail2019_Scm任务数据
+
+Cail2019_Scm分类任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`cail2019_scm.py`执行微调时将会自动下载。
+
+### 执行Fine-tunning
+
+启动cail2019_scm任务的Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" examples/language_model/roformer/run_cail2019_scm.py \
+    --model_type roformer_mean_pooling \
+    --model_name_or_path roformer-chinese-base \
+    --max_seq_length 512 \
+    --batch_size 16   \
+    --learning_rate 6e-6 \
+    --num_train_epochs 20 \
+    --logging_steps 60 \
+    --save_steps 600 \
+    --output_dir ./cail2019_scm/ \
+    --device gpu \
+    --use_amp False
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，可以选择roformer_cls_pooling和roformer_mean_pooling两种类型。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录的地址。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示学习率大小，本代码并未使用学习率衰减。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+- `use_amp` 指示是否启用自动混合精度训练。
+
+基于`roformer-chinese-base`在Cail2019_Scm任务上Fine-tuning后，有如下结果：
+
+|     Model     |    Dev Accuracy   |    Test Accuracy   |
+|:-------------:|:-----------------:|:------------------:|
+| RoFormer-512  |       0.6307      |        0.6947      |
+
+注: `run_cail2019_scm.py`参考了[原论文微调的代码](https://github.com/ZhuiyiTechnology/roformer/blob/main/finetune_scm.py)，原代码未使用学习率衰减，而是使用了固定学习率6e-6。
diff --git a/examples/language_model/roformer/convert.py b/examples/language_model/roformer/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..c33830e633dde755f15ae16da0de9bdb9c3eca08
--- /dev/null
+++ b/examples/language_model/roformer/convert.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+import argparse
+
+huggingface_to_paddle = {
+    "embeddings.LayerNorm": "embeddings.layer_norm",
+    "encoder.layer": "encoder.layers",
+    "attention.self.query": "self_attn.q_proj",
+    "attention.self.key": "self_attn.k_proj",
+    "attention.self.value": "self_attn.v_proj",
+    "attention.output.dense": "self_attn.out_proj",
+    "intermediate.dense": "linear1",
+    "output.dense": "linear2",
+    "attention.output.LayerNorm": "norm1",
+    "output.LayerNorm": "norm2",
+    "predictions.decoder.": "predictions.decoder_",
+    "predictions.transform.dense": "predictions.transform",
+    "predictions.transform.LayerNorm": "predictions.layer_norm",
+}
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+
+    import torch
+    import paddle
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        if k == "cls.predictions.bias" or "encoder.embed_positions." in k:
+            continue
+        if k[-7:] == ".weight":
+            if ".embeddings." not in k and ".LayerNorm." not in k:
+                v = v.transpose(0, 1)
+        oldk = k
+        for huggingface_name, paddle_name in huggingface_to_paddle.items():
+            k = k.replace(huggingface_name, paddle_name)
+
+        if "roformer." not in k and "cls." not in k:
+            k = "roformer." + k
+
+        print(f"Converting: {oldk} => {k}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="roformer_chinese_base/pytorch_model.bin",
+        type=str,
+        required=True,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="roformer_chinese_base/model_state.pdparams",
+        type=str,
+        required=True,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/examples/language_model/roformer/run_cail2019_scm.py b/examples/language_model/roformer/run_cail2019_scm.py
new file mode 100644
index 0000000000000000000000000000000000000000..68b7e95b5fc48f88ed285c8ca95eb2ba2526fe49
--- /dev/null
+++ b/examples/language_model/roformer/run_cail2019_scm.py
@@ -0,0 +1,363 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import random
+import time
+from functools import partial
+
+import jieba
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import RoFormerTokenizer
+from paddlenlp.transformers.roformer.modeling import (
+    RoFormerForSequenceClassification,
+    RoFormerPretrainedModel,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+jieba.setLogLevel(logging.INFO)
+
+
+class RoFormerMeanPoolingForSequenceClassification(RoFormerPretrainedModel):
+    def __init__(self, roformer, num_classes):
+        super(RoFormerMeanPoolingForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roformer = roformer
+        self.classifier = nn.Linear(self.roformer.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        last_hidden_state = self.roformer(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]
+
+        mask = (input_ids != self.roformer.pad_token_id).astype(self.classifier.weight.dtype).unsqueeze(-1)
+        mean_pooling = paddle.sum(last_hidden_state * mask, axis=1) / paddle.sum(mask, axis=1)
+        logits = self.classifier(mean_pooling)
+        return logits
+
+
+MODEL_CLASSES = {
+    "roformer_cls_pooling": (RoFormerForSequenceClassification, RoFormerTokenizer),
+    "roformer_mean_pooling": (RoFormerMeanPoolingForSequenceClassification, RoFormerTokenizer),
+}
+
+
+class Cail2019_SCM_Accuracy(Accuracy):
+    def compute(self, pred, label, *args):
+        pred = paddle.cast(pred[::2] > pred[1::2], dtype="int64")
+        correct = (pred == 1).unsqueeze(-1)
+        return paddle.cast(correct, dtype="float32")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum(
+                [list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()],
+                [],
+            )
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        default=1e-4,
+        type=float,
+        help="The initial learning rate for Adam.",
+    )
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument(
+        "--save_steps",
+        type=int,
+        default=100,
+        help="Save checkpoint every X updates steps.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu", "npu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu/npu.",
+    )
+    parser.add_argument(
+        "--use_amp",
+        type=strtobool,
+        default=False,
+        help="Enable mixed precision training.",
+    )
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=2**15,
+        help="The value of scale_loss for fp16.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids).squeeze(-1)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(F.sigmoid(logits), labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("eval loss: %f, acc: %s" % (loss.numpy(), res))
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+    if example["label"] == 0:
+        text1 = example["text_a"]
+        text2 = example["text_b"]
+        text3 = example["text_c"]
+    else:
+        text1 = example["text_a"]
+        text2 = example["text_c"]
+        text3 = example["text_b"]
+
+    data1 = tokenizer(text1, text_pair=text2, max_length=max_seq_length)
+    data2 = tokenizer(text1, text_pair=text3, max_length=max_seq_length)
+
+    return [data1["input_ids"], data1["token_type_ids"], 1], [data2["input_ids"], data2["token_type_ids"], 0]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    args.model_type = args.model_type.lower()
+    args.batch_size = args.batch_size // 2
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("cail2019_scm", splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+            Stack(dtype="float32"),
+        ),
+    ):  # label
+        new_samples = []
+        for sample in samples:
+            new_samples.extend(sample)
+        return fn(new_samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True,
+    )
+
+    dev_ds = load_dataset("cail2019_scm", splits="dev")
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size * 4, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_ds,
+        batch_sampler=dev_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True,
+    )
+
+    test_ds = load_dataset("cail2019_scm", splits="test")
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds,
+        batch_sampler=test_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True,
+    )
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=1)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.BCEWithLogitsLoss()
+
+    metric = Cail2019_SCM_Accuracy()
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            model.train()
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                logits = model(input_ids, segment_ids).squeeze(-1)
+                loss = loss_fct(logits, labels)
+            if args.use_amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                print("============Dev Dataset============")
+                evaluate(model, loss_fct, metric, dev_data_loader)
+                print("============Test Dataset============")
+                evaluate(model, loss_fct, metric, test_data_loader)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "ft_model_%d.pdparams" % (global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/language_model/roformer/run_thucnews.py b/examples/language_model/roformer/run_thucnews.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a014e57b88a983b8d0dda5f7f7d277069eed844
--- /dev/null
+++ b/examples/language_model/roformer/run_thucnews.py
@@ -0,0 +1,290 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    RoFormerForSequenceClassification,
+    RoFormerTokenizer,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "roformer": (RoFormerForSequenceClassification, RoFormerTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu", "npu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu/npu.",
+    )
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a thucnews example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["label"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    example = tokenizer(example["text"], max_length=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("thucnews", splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    dev_ds = load_dataset("thucnews", splits="dev")
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = Accuracy()
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                logits = model(input_ids, segment_ids)
+                loss = loss_fct(logits, labels)
+            if args.use_amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                evaluate(model, loss_fct, metric, dev_data_loader)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "ft_model_%d.pdparams" % (global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/language_model/roformer/test_mlm.py b/examples/language_model/roformer/test_mlm.py
new file mode 100644
index 0000000000000000000000000000000000000000..31bb14c550df21b3757b8314cca8c600b576940d
--- /dev/null
+++ b/examples/language_model/roformer/test_mlm.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import argparse
+from paddlenlp.transformers import RoFormerForMaskedLM, RoFormerTokenizer
+
+
+def test_mlm(text, model_name):
+    model = RoFormerForMaskedLM.from_pretrained(model_name)
+    model.eval()
+    tokenizer = RoFormerTokenizer.from_pretrained(model_name)
+    tokens = ["[CLS]"]
+    text_list = text.split("[MASK]")
+    for i, t in enumerate(text_list):
+        tokens.extend(tokenizer.tokenize(t))
+        if i == len(text_list) - 1:
+            tokens.extend(["[SEP]"])
+        else:
+            tokens.extend(["[MASK]"])
+
+    input_ids_list = tokenizer.convert_tokens_to_ids(tokens)
+    input_ids = paddle.to_tensor([input_ids_list])
+
+    with paddle.no_grad():
+        pd_outputs = model(input_ids)[0]
+    pd_outputs_sentence = "paddle: "
+    for i, id in enumerate(input_ids_list):
+        if id == tokenizer.convert_tokens_to_ids(["[MASK]"])[0]:
+            tokens = tokenizer.convert_ids_to_tokens(pd_outputs[i].topk(5)[1].tolist())
+            pd_outputs_sentence += "[" + "||".join(tokens) + "]"
+        else:
+            pd_outputs_sentence += "".join(tokenizer.convert_ids_to_tokens([id], skip_special_tokens=True))
+
+    print(pd_outputs_sentence)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name", default="roformer-chinese-base", type=str, help="Pretrained roformer name or path."
+    )
+    parser.add_argument("--text", default="今天[MASK]很好，我想去公园玩！", type=str, help="MLM text.")
+    args = parser.parse_args()
+    test_mlm(text=args.text, model_name=args.model_name)
diff --git a/examples/language_model/rw/README.md b/examples/language_model/rw/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1436353774786c042eb18aa82eb68d8aa17179de
--- /dev/null
+++ b/examples/language_model/rw/README.md
@@ -0,0 +1,13 @@
+# Falcon
+
+## 介绍
+
+Falcon是由[TII](https://www.tii.ae/)构建的Causal decoder-only模型，基于含有 1,500B个tokens的[RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)数据集训练得来。
+Falcon 引入了[FlashAttention](https://github.com/HazyResearch/flash-attention)和[Multi-Query Attention]等新特性。更详细的模型介绍见[论文](https://arxiv.org/abs/2306.01116)
+
+## 推理
+
+```
+python predict_generation.py \
+    --model_name_or_path tiiuae/falcon-7b
+```
diff --git a/examples/language_model/rw/predict_generation.py b/examples/language_model/rw/predict_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..256feb80585a11927a183231a4fb8d23f72e9ff6
--- /dev/null
+++ b/examples/language_model/rw/predict_generation.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import RWConfig, RWForCausalLM, RWTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", default="tiiuae/falcon-7b", help="The directory of model.")
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--src_length", type=int, default=128, help="The batch size of data.")
+    parser.add_argument("--tgt_length", type=int, default=128, help="The batch size of data.")
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args=None, tokenizer=None, model=None, **kwargs):
+        if args is None:
+            self.tokenizer = tokenizer
+            self.model = model
+            self.src_length = kwargs["src_length"]
+            self.tgt_length = kwargs["tgt_length"]
+        else:
+            self.tokenizer = RWTokenizer.from_pretrained(args.model_name_or_path)
+            self.batch_size = args.batch_size
+            self.args = args
+            self.src_length = self.args.src_length
+            self.tgt_length = self.args.tgt_length
+
+            config = RWConfig.from_pretrained(args.model_name_or_path)
+            dtype = config.dtype if config.dtype is not None else config.paddle_dtype
+
+            self.model = RWForCausalLM.from_pretrained(
+                args.model_name_or_path,
+                dtype=dtype,
+            )
+        self.model.eval()
+
+    def preprocess(self, input_text):
+        inputs = self.tokenizer(
+            input_text,
+            return_tensors="np",
+            padding=True,
+            max_length=self.src_length,
+            truncation=True,
+            truncation_side="left",
+        )
+        inputs_tensor = {}
+        for key in inputs:
+            inputs_tensor[key] = paddle.to_tensor(inputs[key])
+        return inputs_tensor
+
+    def infer(self, inputs):
+        result = self.model.generate(
+            **inputs,
+            decode_strategy="sampling",
+            top_k=1,
+            max_length=self.tgt_length,
+            bos_token_id=self.tokenizer.bos_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
+            use_cache=True,
+        )
+        result = result[0]
+        return result
+
+    def postprocess(self, infer_data):
+        result = []
+        for x in infer_data.tolist():
+            res = self.tokenizer.decode(x, skip_special_tokens=True)
+            res = res.strip("\n")
+            result.append(res)
+        out_dict = {"result": result}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    all_texts = [
+        "Hello!",
+        "Please introduce yourself, ",
+    ]
+    batch_texts = batchfy_text(all_texts, args.batch_size)
+    for bs, texts in enumerate(batch_texts):
+        outputs = predictor.predict(texts)
+        for text, result in zip(texts, outputs["result"]):
+            print("{}\n{}".format(text, result))
diff --git a/examples/language_model/t5/README.md b/examples/language_model/t5/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7f743228c4123f9f51815b1d45d5e7670dd51ca3
--- /dev/null
+++ b/examples/language_model/t5/README.md
@@ -0,0 +1,331 @@
+# T5
+
+[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683v3.pdf)
+
+## 摘要
+
+迁移学习在自然语言处理(NLP)中已经成为一种强大的技术。迁移学习首先在数据丰富的任务上进行预训练，然后在下游任务上进行调整。迁移学习的有效性引起了不同的方法、方法和实践。在本文中，我们通过引入一个统一的框架，将所有基于文本的语言问题转换为文本到文本的格式，来探索自然语言处理的迁移学习技术。我们的系统研究比较了数十项语言理解任务的训练前目标、架构、未标记数据集、迁移方法和其他因素。通过将我们的探索与规模和我们的新"Colossal Clean Crawled Corpus"数据集相结合，我们在摘要、问答、文本分类等许多基准测试中取得了最先进的结果。为了促进NLP迁移学习的未来工作，我们发布了我们的数据集、预训练模型和代码。
+
+本项目是T5在 Paddle 2.x上的开源实现，包含了`模型权重`转换代码和`GLUE任务`的微调代码。
+
+## 快速开始
+
+### 预训练
+
+本项目致力于t5模型的预训练，从数据下载，数据转化，模型训练，流程开源开放，可复现。
+
+接下来将从下面几个方面，详细介绍整个数据制作全流程，从零开始，构建一个预训练模型。
+
+#### 1. 数据准备
+
+数据流是预训练的非常重要的，[预处理文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/preprocess/README.md)提供了整体的数据变动的流程示意，用户可以查看数据制作的细节文档。
+
+在数据ID化步骤中，我们需要配置tokenzer_name，选择t5模型对应的tokenizer；通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:[`t5_openwebtext.bin`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.bin), 文章索引信息[`t5_openwebtext.idx`](https://paddlenlp.bj.bcebos.com/models/transformers/t5/data/t5_openwebtext.idx).（这里提供了一个处理好的预训练数据，可点击链接下载）
+
+```shell
+python -u  create_pretraining_data.py \
+    --model_name t5-small \
+    --tokenizer_name T5Tokenizer \
+    --data_format JSON \
+    --input_path openwebtext/2020-04.jsonl.zst \
+    --split_sentences \
+    --output_prefix t5_openwebtext  \
+    --workers 1 \
+    --log_interval 5 \
+    --data_impl mmap
+```
+
+#### 2. 开始训练
+
+**路径配置**
+
+- 主要配置输入输出目录
+- 这里的`tokenizer_name_or_path`请设置为内置的tokenizer，如`t5-small`等。
+- 这里的 `input_dir` 设置输入数据集路径，例如配置`input_dir "./data"`即可。
+
+**启动训练**：这里启动的是单机8卡任务，整体全局的batch_size 512 (64*8)。如果指定ips参数，进行多机运行，如 `python3 -u  -m paddle.distributed.launch  --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 `
+
+```shell
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "./log" \
+    t5_run_pretrain_trainer.py \
+    --model_type "t5" \
+    --model_name_or_path "t5-small" \
+    --tokenizer_name_or_path "${vocab_dir}" \
+    --input_dir "${data_dir}" \
+    --output_dir "${base_dir}" \
+    --split 10,5,1 \
+    --max_seq_length 512 \
+    --max_seq_length_dec 128 \
+    --per_device_train_batch_size 64 \
+    --per_device_eval_batch_size 64 \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 20000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --decay_steps 9900 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 10\
+    --dataloader_num_workers 4 \
+    --eval_steps 100 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --do_train \
+    --do_eval \
+    --seed 1234 \
+    --device "gpu" \
+    --data_impl "mmap"
+```
+
+其中参数释义如下：
+
+- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。
+- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie，词表文件名一般命名为vocab.txt)，或者PaddleNLP内置tokenizer的名字。
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `output_dir` 指定输出文件。
+- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test，当样本数太少时，增大测试的样本数目。
+- `max_seq_len` 输入文本序列的长度，默认值`512`。
+- `fp16_opt_level` 混合精度策略，支持O1 自动混合精度，O2 pure fp16精度训练。
+- `max_steps` 最大训练步数。训练不支持通过`epoch`控制，第一次制造数据index时候，日志会显示数据会被计算的epoch数，请注意查看。
+- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。
+- `weight_decay` 权重衰减参数。
+- `warmup_rate` 学习率warmup参数。
+- `max_grad_norm` 梯度裁剪范围。
+- `logging_steps` 日志输出间隔。
+- `dataloader_num_workers` DataLoader采样进程，当数据输入为瓶颈时，可尝试提高采样进程数目。
+- `eval_steps` 模型评估间隔。
+- `device` 训练设备，默认为GPU。
+- `data_impl` 指定输入文件数据制作类型，默认为mmap，可指定mmap或lazy。mmap格式在读入数据时会建立内存映射，lazy格式在读入数据时直接从文件读取。
+
+### GLUE任务
+
+### 执行Fine-tunning
+
+启动rte分类任务的Fine-tuning的方式如下：
+
+```shell
+python run_glue.py \
+    --model_name_or_path t5-base \
+    --task_name rte \
+    --max_seq_length 256 \
+    --train_batch_size 16 \
+    --eval_batch_size 64 \
+    --learning_rate 1e-4 \
+    --weight_decay 0.01 \
+    --warmup_radio 0.1 \
+    --num_train_epochs 10 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --seed 42 \
+    --scheduler_type linear \
+    --output_dir outputs/rte/ \
+    --device gpu
+```
+
+其中参数释义如下：
+
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录的地址。
+- `task_name` GLUE任务名称，可从选["cola","sst-2","mrpc","sts-b","qqp","mnli", "rte", "qnli"]选择。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `train_batch_size` 表示训练时的样本数目。
+- `eval_batch_size` 表示验证时的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `warmup_radio` warmup比率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机种子。
+- `scheduler_type` scheduler类型，可选linear和cosine，默认linear。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备，可选cpu、gpu或npu。
+
+使用trainer进行Fine-tuning:
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue_trainer.py \
+    --model_name_or_path t5-base \
+    --task_name rte \
+    --max_seq_length 256 \
+    --do_train \
+    --do_eval \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 64 \
+    --learning_rate 1e-4 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.1 \
+    --num_train_epochs 10 \
+    --eval_steps 200 \
+    --logging_steps 20 \
+    --save_steps 200 \
+    --save_total_limit 3 \
+    --metric_for_best_model "eval_accuracy" \
+    --fp16 false \
+    --fp16_opt_level "O1" \
+    --recompute true \
+    --sharding "stage1" \
+    --overwrite_output_dir \
+    --disable_tqdm true \
+    --output_dir outputs/rte/
+```
+
+具体参数含义请参见: https://paddlenlp.readthedocs.io/zh/latest/trainer.html
+
+###### t5-base模型在GLUE开发集上的结果：
+
+| Model          | cola  | sst-2 | mrpc  | sts-b   | qqp   | mnli  | qnli  | rte   | mean    |
+| -------------- | ----- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ------- |
+|                | mcc   | acc   | acc   | pearson | acc   | acc   | acc   | acc   |         |
+| T5-base-Paddle | 61.74 | 95.18 | 90.44 | 90.09   | 91.60 | 87.18 | 93.56 | 81.95 | 86.4675 |
+
+###### t5_v1_1-base模型在GLUE开发集上的结果：
+
+使用`run_glue_trainer.py`运行，由于`t5_v1_1-base`没有在glue任务上进行训练过，直接生成label的策略需要的训练时间需要更长。
+
+| Model               | cola    | sst-2 | mrpc  | sts-b   | qqp   | mnli  | qnli   | rte   |
+| ------------------- | ------- | ----- | ----- | ------- | ----- | ----- | ------ | ----- |
+|                     | mcc     | acc   | acc   | pearson | acc   | acc   | acc    | acc   |
+| T5-v1_1-base Paddle | 47.6845 | 94.38 | 84.31 | 87.74   | 88.05 | 85.39 | 90.518 | 65.70 |
+| epoch               | 100     | 10    | 100   | 100     | 3     | 3     | 10     | 100   |
+
+注：
+
+- 直接生成label的finetune方式难度较大，前期基本学习如何正确生成label标签，后期才学习分类任务。
+- 生成的label标签设计，标签差异大一些，效果会更好一些。
+- `qqp`,`mnli`数据集适当增大训练epoch数，可以取得更好效果。
+
+### CLUE任务
+
+使用trainer进行Fine-tuning:
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1,2,3" run_clue_trainer.py \
+    --model_name_or_path Langboat/mengzi-t5-base-mt \
+    --task_name cluewsc2020 \
+    --max_seq_length 512 \
+    --do_train \
+    --do_eval \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 64 \
+    --learning_rate 1e-4 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.1 \
+    --num_train_epochs 100 \
+    --eval_steps 200 \
+    --logging_steps 20 \
+    --save_steps 200 \
+    --save_total_limit 3 \
+    --metric_for_best_model "eval_accuracy" \
+    --fp16 false \
+    --fp16_opt_level "O1" \
+    --recompute true \
+    --sharding "stage1" \
+    --overwrite_output_dir \
+    --disable_tqdm true \
+    --output_dir outputs/clue/cluewsc2020
+```
+
+###### Langboat/mengzi-t5-base-mt模型在CLUE开发集上的结果：
+
+| Model                      | afqmc | tnews | iflytek | cmnli | ocnli | cluewsc2020 | csl   |
+| :------------------------- | :---- | :---- | :------ | :---- | :---- | :---------- | :---- |
+|                            | acc   | acc   | acc     | acc   | acc   | acc         | acc   |
+| Langboat/mengzi-t5-base-mt | 74.44 | 58.47 | 61.14   | 80.97 | 75.76 | 79.61       | 84.47 |
+| epoch                      | 10    | 10    | 10      | 10    | 10    | 100         | 10    |
+
+
+
+### GLUE Demo测试
+
+```sh
+python glue_demo.py
+input text: sst2 sentence: contains no wit , only labored gags
+label: negative
+==================================================
+input text: sst2 sentence: that loves its characters and communicates something rather beautiful about human nature
+label: positive
+==================================================
+input text: cola sentence: Mickey looked it up.
+label: acceptable
+==================================================
+input text: sst2 sentence: remains utterly satisfied to remain the same throughout
+label: positive
+==================================================
+input text: sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship
+label: positive
+==================================================
+```
+
+### Zero shot Demo测试 [参考自Langboat/mengzi-zero-shot](https://github.com/Langboat/mengzi-zero-shot)
+
+```sh
+python zero_shot_demo.py
+```
+
+当前**zero shot**时输入的构造方法如下表所示。
+
+| **任务类型**         | **prompt构造（其中{s}代表句子输**入）                        |
+| -------------------- | ------------------------------------------------------------ |
+| **实体抽取**         | “{s}”找出上述句子中的实体和他们对应的类别                    |
+| **语义相似度**       | “{s1}”和“{s2}”这两句话是在说同一件事吗?                      |
+| **金融关系抽取**     | “{s}”中的“{e1}”和“{e2}”是什么关系？答:                       |
+| **广告文案生成**     | 请根据以下产品信息设计广告文案。商品信息:{s}                 |
+| **医学领域意图分类** | 问题:“{s}”。此问题的医学意图是什么？选项：病情诊断，病因分析，治疗方案，就医建议，指标解读，疾病描述，后果表述，注意事项，功效作用，医疗费用。 |
+| **评论情感分类**     | 评论:{s}。请判断该条评论所属类别(积极或消极)并填至空格处。回答： |
+| **评论对象抽取**     | 评论:{s}.这条评论的评价对象是谁？                            |
+| **新闻分类**         | “{s}”是什么新闻频道写的？选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，电竞。答： |
+
+```
+input_text: “导致泗水的砭石受到追捧，价格突然上涨。而泗水县文化市场综合执法局颜鲲表示，根据监控”找出上述句子中的实体和他们对应的类别
+output: 泗水县文化市场综合执法局:政府,颜鲲:姓名
+==================================================
+input_text: “你好，我还款银行怎么更换”和“怎么更换绑定还款的卡”这两句话是在说同一件事吗?
+output: 是
+==================================================
+input_text: “为打消市场顾虑,工行两位洋股东——美国运通和安联集团昨晚做出承诺,近期不会减持工行H股。”中的“工行”和“美国运通”是什么关系？答:
+output: 被持股
+==================================================
+input_text: 请根据以下产品信息设计广告文案。商品信息:类型-裤，版型-宽松，风格-潮，风格-复古，风格-文艺，图案-复古，裤型-直筒裤，裤腰型-高腰，裤口-毛边
+output: 这款牛仔裤采用高腰直筒的版型设计,搭配宽松的裤型,穿着舒适又显潮流感。而裤脚的毛边设计,增添几分复古文艺的气息。
+==================================================
+input_text: 问题:“呼气试验阳性什么意思”。此问题的医学意图是什么？选项：病情诊断，病因分析，治疗方案，就医建议，指标解读，疾病描述，后果表述，注意事项，功效作用，医疗费用。
+output: 指标解读
+==================================================
+input_text: 评论:房间很一般，小，且让人感觉脏，隔音效果差，能听到走廊的人讲话，走廊光线昏暗，旁边没有什么可吃。请判断该条评论所属类别(积极或消极)并填至空格处。回答：
+output: 消极
+==================================================
+input_text: 评论:灵水的水质清澈，建议带个浮潜装备，可以看清湖里的小鱼。.这条评论的评价对象是谁？
+output: 灵水
+==================================================
+input_text: “懒人适合种的果树：长得多、好打理，果子多得都得送邻居吃”是什么新闻频道写的？选项：故事，文化，娱乐，体育，财经，房产，汽车，教育，科技，军事，旅游，国际，股票，农业，电竞。答：
+output: 农业
+==================================================
+```
+
+# Reference
+
+```bibtex
+@article{2020t5,
+  author  = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
+  title   = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
+  journal = {Journal of Machine Learning Research},
+  year    = {2020},
+  volume  = {21},
+  number  = {140},
+  pages   = {1-67},
+  url     = {http://jmlr.org/papers/v21/20-074.html}
+}
+@inproceedings{wolf-etal-2020-transformers,
+    title = "Transformers: State-of-the-Art Natural Language Processing",
+    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
+    month = oct,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
+    pages = "38--45"
+}
+```
diff --git a/examples/language_model/t5/convert.py b/examples/language_model/t5/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c1fa30ea14481fc6510292ba96797236d454e09
--- /dev/null
+++ b/examples/language_model/t5/convert.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+import argparse
+
+dont_transpose = [
+    "shared.weight",
+    "layer_norm.weight",
+    ".layer_norm.weight",
+    "relative_attention_bias.weight",
+    "embed_tokens.weight",
+]
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+    import torch
+    import paddle
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        transpose = False
+
+        if k[-7:] == ".weight":
+            if not any([w in k for w in dont_transpose]):
+                if v.ndim == 2:
+                    v = v.transpose(0, 1)
+                    transpose = True
+
+        print(f"Converting: {k} | is_transpose {transpose}")
+
+        if k != "lm_head.weight":
+            k = "t5." + k
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="google/t5-large/pytorch_model.bin",
+        type=str,
+        required=False,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="paddle/t5-large/model_state.pdparams",
+        type=str,
+        required=False,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/examples/language_model/t5/data.py b/examples/language_model/t5/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..8630ffbd64a6c3ff890e748f201c4e87dca4e23c
--- /dev/null
+++ b/examples/language_model/t5/data.py
@@ -0,0 +1,383 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+from functools import partial
+
+from paddle.io import BatchSampler, DataLoader
+from utils import load_pickle, save_pickle
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+
+CLUE_PROCESSED = collections.OrderedDict(
+    [
+        ("afqmc", (["afqmc sentence1: ", "afqmc sentence2: "], ["不同", "类似"])),
+        (
+            "tnews",
+            (
+                ["tnews sentence: "],
+                ["故事", "文化", "娱乐", "体育", "财经", "房产", "汽车", "教育", "科技", "军事", "旅游", "国际", "股票", "农业", "电竞"],
+            ),
+        ),
+        (
+            "iflytek",
+            (
+                ["iflytek sentence: "],
+                [
+                    "打车",
+                    "地图导航",
+                    "免费WIFI",
+                    "租车",
+                    "同城服务",
+                    "快递物流",
+                    "婚庆",
+                    "家政",
+                    "公共交通",
+                    "政务",
+                    "社区服务",
+                    "薅羊毛",
+                    "魔幻",
+                    "仙侠",
+                    "卡牌",
+                    "飞行空战",
+                    "射击游戏",
+                    "休闲益智",
+                    "动作类",
+                    "体育竞技",
+                    "棋牌中心",
+                    "经营养成",
+                    "策略",
+                    "MOBA",
+                    "辅助工具",
+                    "约会社交",
+                    "即时通讯",
+                    "工作社交",
+                    "论坛圈子",
+                    "婚恋社交",
+                    "情侣社交",
+                    "社交工具",
+                    "生活社交",
+                    "微博博客",
+                    "新闻",
+                    "漫画",
+                    "小说",
+                    "技术",
+                    "教辅",
+                    "问答交流",
+                    "搞笑",
+                    "杂志",
+                    "百科",
+                    "影视娱乐",
+                    "求职",
+                    "兼职",
+                    "视频",
+                    "短视频",
+                    "音乐",
+                    "直播",
+                    "电台",
+                    "K歌",
+                    "成人",
+                    "中小学",
+                    "职考",
+                    "公务员",
+                    "英语",
+                    "视频教育",
+                    "高等教育",
+                    "成人教育",
+                    "艺术",
+                    "语言(非英语)",
+                    "旅游资讯",
+                    "综合预定",
+                    "民航",
+                    "铁路",
+                    "酒店",
+                    "行程管理",
+                    "民宿短租",
+                    "出国",
+                    "工具",
+                    "亲子儿童",
+                    "母婴",
+                    "驾校",
+                    "违章",
+                    "汽车咨询",
+                    "汽车交易",
+                    "日常养车",
+                    "行车辅助",
+                    "租房",
+                    "买房",
+                    "装修家居",
+                    "电子产品",
+                    "问诊挂号",
+                    "养生保健",
+                    "医疗服务",
+                    "减肥瘦身",
+                    "美妆美业",
+                    "菜谱",
+                    "餐饮店",
+                    "体育咨讯",
+                    "运动健身",
+                    "支付",
+                    "保险",
+                    "股票",
+                    "借贷",
+                    "理财",
+                    "彩票",
+                    "记账",
+                    "银行",
+                    "美颜",
+                    "影像剪辑",
+                    "摄影修图",
+                    "相机",
+                    "绘画",
+                    "二手",
+                    "电商",
+                    "团购",
+                    "外卖",
+                    "电影票务",
+                    "社区超市",
+                    "购物咨询",
+                    "笔记",
+                    "办公",
+                    "日程管理",
+                    "女性",
+                    "经营",
+                    "收款",
+                    "其他",
+                ],
+            ),
+        ),
+        ("cmnli", (["cmnli sentence1: ", "cmnli sentence2: "], ["矛盾", "中立", "蕴涵"])),
+        ("ocnli", (["ocnli sentence1: ", "ocnli sentence2: "], ["蕴涵", "矛盾", "中立"])),
+        ("cluewsc2020", (["cluewsc2020 sentence: "], ["同义", "歧义"])),
+        ("csl", ((["csl sentence1: ", "csl sentence2: "], ["伪造", "真实"]))),
+    ]
+)
+GLUE_PROCESSED = collections.OrderedDict(
+    [
+        ("cola", (["cola sentence: "], ["not_acceptable", "acceptable"])),
+        ("sst-2", (["sst2 sentence: "], ["negative", "positive"])),
+        (
+            "mrpc",
+            (["mrpc sentence1: ", " sentence2: "], ["not_equivalent", "equivalent"]),
+        ),
+        ("sts-b", (["stsb sentence1: ", " sentence2: "], None)),
+        ("qqp", (["qqp question1: ", " question2: "], ["not_duplicate", "duplicate"])),
+        (
+            "mnli",
+            (
+                ["mnli hypothesis: ", " premise: "],
+                ["contradiction", "entailment", "neutral"],
+            ),
+        ),
+        (
+            "qnli",
+            (["qnli question: ", " sentence: "], ["entailment", "not_entailment"]),
+        ),
+        (
+            "rte",
+            (["rte sentence1: ", " rte sentence2: "], ["entailment", "not_entailment"]),
+        ),
+    ]
+)
+
+GLUE_1_1_PROCESSED = collections.OrderedDict(
+    [
+        ("cola", (["cola sentence: "], ["outrageous", "acceptable"])),
+        ("sst-2", (["sst2 sentence: "], ["negative", "positive"])),
+        (
+            "mrpc",
+            (["mrpc sentence1: ", " sentence2: "], ["nonidentical", "equivalent"]),
+        ),
+        ("sts-b", (["stsb sentence1: ", " sentence2: "], None)),
+        ("qqp", (["qqp question1: ", " question2: "], ["inequable", "duplicate"])),
+        (
+            "mnli",
+            (
+                ["mnli hypothesis: ", " premise: "],
+                ["contradiction", "entailment", "neutral"],
+            ),
+        ),
+        (
+            "qnli",
+            (["qnli question: ", " sentence: "], ["entailment", "contradiction"]),
+        ),
+        (
+            "rte",
+            (["rte sentence1: ", " rte sentence2: "], ["entailment", "contradiction"]),
+        ),
+    ]
+)
+
+
+def trans_func(example, tokenizer, args):
+    task_name = args.task_name
+    processed, label = GLUE_PROCESSED[task_name]
+    if label:
+        id2label = dict(zip(range(len(label)), label))
+    else:
+        id2label = None
+
+    if not args.is_test:
+        if id2label:
+            label_text = id2label[example["labels"]]
+        else:
+            label_text = str(example["labels"])
+        target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True)
+
+    if len(processed) == 1:
+        text = processed[0] + example["sentence"]
+    else:
+        text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"]
+
+    source = tokenizer(
+        text,
+        max_seq_len=args.max_seq_length,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+    )
+
+    if not args.is_test:
+        return (
+            source["input_ids"],
+            source["attention_mask"],
+            target["input_ids"],
+            target["attention_mask"],
+        )
+    else:
+        return source["input_ids"], source["attention_mask"]
+
+
+def get_train_dataloader(tokenizer, args):
+    filename = os.path.join("caches", args.task_name + "_train" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits="train")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=True)
+
+    # batchify_fn = lambda samples, fn=Tuple(
+    #     Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+    #     Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask
+    #     Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels
+    #     Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # decoder_attention_mask
+    # ): fn(samples)
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask
+            Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # decoder_attention_mask
+        ),
+    ):
+        return fn(samples)
+
+    data_loader = DataLoader(
+        dataset=ds,
+        batch_sampler=batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=args.num_workers,
+        return_list=True,
+    )
+
+    return data_loader
+
+
+def get_dev_dataloader(tokenizer, args):
+    filename = os.path.join("caches", args.task_name + "_dev" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits="dev")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=False)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask
+            Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # decoder_attention_mask
+        ),
+    ):
+        return fn(samples)
+
+    data_loader = DataLoader(
+        dataset=ds,
+        batch_sampler=batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=args.num_workers,
+        return_list=True,
+    )
+
+    return data_loader
+
+
+def get_mnli_dev_dataloader(tokenizer, args, matched=True):
+    if matched:
+        split = "dev_matched"
+    else:
+        split = "dev_mismatched"
+    filename = os.path.join("caches", args.task_name + f"_{split}" + ".pkl")
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits=split)
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    batch_sampler = BatchSampler(ds, batch_size=args.train_batch_size, shuffle=False)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask
+            Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # decoder_attention_mask
+        ),
+    ):
+        return fn(samples)
+
+    data_loader = DataLoader(
+        dataset=ds,
+        batch_sampler=batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=args.num_workers,
+        return_list=True,
+    )
+
+    return data_loader
diff --git a/examples/language_model/t5/dataset_utils.py b/examples/language_model/t5/dataset_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7149da377ec972ff6980e9a9c8e73ea42d117b5
--- /dev/null
+++ b/examples/language_model/t5/dataset_utils.py
@@ -0,0 +1 @@
+../../../model_zoo/ernie-1.0/data_tools/dataset_utils.py
\ No newline at end of file
diff --git a/examples/language_model/t5/glue_demo.py b/examples/language_model/t5/glue_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..61ada574d22c2f6609f17c3e526f95254a495d83
--- /dev/null
+++ b/examples/language_model/t5/glue_demo.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+class Demo:
+    def __init__(self, model_name_or_path="t5-base", max_predict_len=5):
+        self.tokenizer = T5Tokenizer.from_pretrained(model_name_or_path)
+        print("Loading the model parameters, please wait...")
+        self.model = T5ForConditionalGeneration.from_pretrained(model_name_or_path)
+        self.model.eval()
+        self.max_predict_len = max_predict_len
+        print("Model loaded.")
+
+    # prediction function
+    @paddle.no_grad()
+    def generate(self, inputs, max_predict_len=None):
+        max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len
+
+        ids = self.tokenizer(inputs)["input_ids"]
+        input_ids = paddle.to_tensor([ids], dtype="int64")
+        outputs = self.model.generate(input_ids, max_length=max_predict_len)[0][0]
+        decode_outputs = self.tokenizer.decode(outputs, skip_special_tokens=True).strip()
+        print(f"input text: {inputs}")
+        print(f"label: {decode_outputs}")
+        print("=" * 50)
+
+
+if __name__ == "__main__":
+    label_length_map = {
+        "cola": 4,
+        "sst2": 1,
+        "mrpc": 5,
+        "stsb": 5,
+        "qqp": 5,
+        "mnli": 4,
+        "qnli": 5,
+        "rte": 5,
+    }
+    demo = Demo(model_name_or_path="t5-base")
+    input_text_list = [
+        "sst2 sentence: contains no wit , only labored gags ",
+        "sst2 sentence: that loves its characters and communicates something rather beautiful about human nature ",
+        "cola sentence: Mickey looked it up.",
+        "sst2 sentence: remains utterly satisfied to remain the same throughout ",
+        "sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship ",
+    ]
+    for text in input_text_list:
+        max_predict_len = label_length_map[text.split()[0]]
+        demo.generate(text, max_predict_len=max_predict_len)
+
+    # input text: sst2 sentence: contains no wit , only labored gags
+    # label: negative
+    # ==================================================
+    # input text: sst2 sentence: that loves its characters and communicates something rather beautiful about human nature
+    # label: positive
+    # ==================================================
+    # input text: cola sentence: Mickey looked it up.
+    # label: acceptable
+    # ==================================================
+    # input text: sst2 sentence: remains utterly satisfied to remain the same throughout
+    # label: positive
+    # ==================================================
+    # input text: sst2 sentence: a well-made and often lovely depiction of the mysteries of friendship
+    # label: positive
+    # ==================================================
diff --git a/examples/language_model/t5/run_clue_trainer.py b/examples/language_model/t5/run_clue_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d573aba663402b19f4cb63bf1c81153f579c6ae7
--- /dev/null
+++ b/examples/language_model/t5/run_clue_trainer.py
@@ -0,0 +1,307 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import paddle
+from data import CLUE_PROCESSED
+from utils import CLUE_METRICS, load_pickle, save_pickle
+
+from paddlenlp.data.data_collator import DataCollatorForSeq2Seq
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Seq2SeqTrainer,
+    Seq2SeqTrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+from paddlenlp.utils.log import logger
+
+
+def trans_func(example, tokenizer, args):
+    task_name = args.task_name
+    PROCESSED = CLUE_PROCESSED
+    processed, label = PROCESSED[task_name]
+    if label:
+        id2label = dict(zip(range(len(label)), label))
+    else:
+        id2label = None
+
+    is_test = "label" not in example
+    # Convert raw text to feature
+    if "keyword" in example and task_name == "csl":  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example and task_name == "cluewsc2020":  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+    if not is_test:
+        if id2label:
+            label_text = id2label[example["label"]]
+        else:
+            label_text = str(example["label"])
+        target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True)
+
+    if len(processed) == 1:
+        text = processed[0] + example["sentence"]
+    else:
+        text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"]
+
+    source = tokenizer(
+        text,
+        max_seq_len=args.max_seq_length,
+        padding="max_length",
+        return_token_type_ids=False,
+        return_attention_mask=True,
+    )
+
+    if not is_test:
+        return {
+            "input_ids": source["input_ids"],
+            "attention_mask": source["attention_mask"],
+            "labels": target["input_ids"],
+            "decoder_attention_mask": target["attention_mask"],
+        }
+    else:
+        return {"input_ids": source["input_ids"], "attention_mask": source["attention_mask"]}
+
+
+def get_train_dataset(tokenizer, args):
+    filename = os.path.join(args.cache_dir, args.task_name + "_train" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("clue", args.task_name, splits="train")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+    return ds
+
+
+def get_dev_dataset(tokenizer, args):
+    filename = os.path.join(args.cache_dir, args.task_name + "_dev" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("clue", args.task_name, splits="dev")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    task_name: str = field(default=None, metadata={"help": "The name of the task to use (via the datasets library)."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    cache_dir: str = field(default="./caches", metadata={"help": "cache dir for datasets."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        default="Langboat/mengzi-t5-base",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, Seq2SeqTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if not os.path.exists(data_args.cache_dir):
+        os.mkdir(data_args.cache_dir)
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    PROCESSED = CLUE_PROCESSED
+    label_name = PROCESSED[data_args.task_name][1]
+    if label_name:
+        label2id = dict(zip(label_name, range(len(label_name))))
+    else:
+        label2id = None
+    metric_list = CLUE_METRICS[data_args.task_name]
+    # generate_max_length = label_length_map[data_args.task_name]
+
+    # get model and tokenizer
+    model = T5ForConditionalGeneration.from_pretrained(model_args.model_name_or_path)
+    tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path)
+    print(model)
+    # get dataloader
+    train_dataset = get_train_dataset(tokenizer, data_args)
+    eval_dataset = get_dev_dataset(tokenizer, data_args)
+
+    data_collator = DataCollatorForSeq2Seq(
+        tokenizer=tokenizer, model=model, pad_to_multiple_of=8 if training_args.fp16 else None
+    )
+
+    # Define the metrics of tasks.
+    def compute_metrics(p, tokenizer=tokenizer, label2id=label2id):
+        all_preds = []
+        all_labels = []
+        # source_ids, source_mask, labels, target_mask = batch
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+        labels = p.label_ids
+        for p, l in zip(preds, labels):
+            pred = tokenizer.decode(p, skip_special_tokens=True).strip()
+            label = tokenizer.decode(l, skip_special_tokens=True).strip()
+            if label2id:
+                # for classifaction task.
+                label = label2id[label]
+                if pred not in label2id:
+                    # set to wrong label if the generated text not in the labal set.
+                    pred = 0
+                    if label == 0:
+                        pred = 1
+                else:
+                    pred = label2id[pred]
+            else:
+                # for regression task.
+                label = float(label.replace(" ", ""))
+                try:
+                    pred = float(pred.replace(" ", ""))
+                except Exception as e:
+                    # set to zero if the generated text can not convert to float
+                    pred = 0.0
+                    print(e)
+
+            all_preds.append(pred)
+            all_labels.append(label)
+
+        all_preds = paddle.to_tensor(all_preds).detach()
+        all_labels = paddle.to_tensor(all_labels).detach()
+
+        results = {}
+        for metric in metric_list:
+            results.update(metric(all_labels, all_preds))
+
+        return results
+
+    training_args.predict_with_generate = True
+    trainer = Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        # trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+        trainer.save_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/language_model/t5/run_glue.py b/examples/language_model/t5/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..94de77f467ed43981b357ea7e7e6fa3af9e92e74
--- /dev/null
+++ b/examples/language_model/t5/run_glue.py
@@ -0,0 +1,439 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+
+import paddle
+from data import (
+    GLUE_PROCESSED,
+    get_dev_dataloader,
+    get_mnli_dev_dataloader,
+    get_train_dataloader,
+)
+from paddle.amp import GradScaler, auto_cast
+from paddle.optimizer import AdamW
+from tqdm import tqdm
+from utils import GLUE_METRICS, get_scheduler, get_writer, set_seed
+
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument(
+        "--model_name_or_path",
+        default="t5-small",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--task_name",
+        default="sst-2",
+        type=str,
+        help="task_name.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="outputs",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written. "
+        "Default as `outputs`",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=256,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--train_batch_size",
+        default=4,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--eval_batch_size",
+        default=16,
+        type=int,
+        help="Batch size per GPU/CPU for evaluating.",
+    )
+
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        default=1,
+        type=int,
+        help="gradient_accumulation_steps.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        default=2e-5,
+        type=float,
+        help="The initial learning rate for Adam.",
+    )
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=4,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_train_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_radio",
+        default=0.1,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--warmup_steps", type=int, default=-1, help="warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.")
+    parser.add_argument(
+        "--save_steps",
+        type=int,
+        default=50,
+        help="Save checkpoint every X updates steps.",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--writer_type",
+        choices=["visualdl", "tensorboard"],
+        default="visualdl",
+        help="writer_type.",
+    )
+    parser.add_argument(
+        "--scheduler_type",
+        choices=["linear", "cosine", "poly"],
+        default="linear",
+        type=str,
+        help="scheduler_type.",
+    )
+    parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.")
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=2**15,
+        help="The value of scale_loss for fp16.",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=0,
+        help="num_workers.",
+    )
+    parser.add_argument("--is_test", action="store_true", help="is_test.")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["gpu", "cpu", "npu"],
+        help="The device to select to train the model, is must be cpu/gpu/npu.",
+    )
+    args = parser.parse_args()
+    args.task_name = args.task_name.lower()
+    args.logdir = os.path.join(args.output_dir, "logs")
+    os.makedirs("caches", exist_ok=True)
+    os.makedirs(args.logdir, exist_ok=True)
+
+    return args
+
+
+label_length_map = {
+    "cola": 4,
+    "sst-2": 1,
+    "mrpc": 5,
+    "sts-b": 5,
+    "qqp": 5,
+    "mnli": 4,
+    "qnli": 5,
+    "rte": 5,
+}
+
+logger = logging.getLogger(__name__)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, tokenizer, label2id, metric_list, generate_max_length=5):
+    model.eval()
+    all_preds = []
+    all_labels = []
+
+    for batch in data_loader:
+        source_ids, source_mask, labels, target_mask = batch
+        outputs = model.generate(
+            input_ids=source_ids,
+            attention_mask=source_mask,
+            max_length=generate_max_length,
+        )[0]
+
+        for p, l, m in zip(outputs.numpy(), labels.numpy(), target_mask.numpy()):
+            pred = tokenizer.decode(p, skip_special_tokens=True).strip()
+            label = tokenizer.decode(l[m.astype("bool")], skip_special_tokens=True).strip()
+            if label2id:
+                pred = label2id[pred]
+                label = label2id[label]
+            else:
+                pred = float(pred.replace(" ", ""))
+                label = float(label.replace(" ", ""))
+
+            all_preds.append(pred)
+            all_labels.append(label)
+
+    results = {}
+    for metric in metric_list:
+        results.update(metric(all_labels, all_preds))
+    print(results)
+    return results
+
+
+def main(args):
+    paddle.set_device(args.device)
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+        handlers=[
+            logging.FileHandler(
+                os.path.join(args.output_dir, "run.log"),
+                mode="w",
+                encoding="utf-8",
+            )
+        ],
+    )
+    logger.info("**********  Configuration Arguments **********")
+    for arg, value in sorted(vars(args).items()):
+        logger.info(f"{arg}: {value}")
+    logger.info("**************************************************")
+    set_seed(args)
+
+    # metric and label
+    label_name = GLUE_PROCESSED[args.task_name][1]
+    if label_name:
+        label2id = dict(zip(label_name, range(len(label_name))))
+    else:
+        label2id = None
+    metric_list = GLUE_METRICS[args.task_name]
+    generate_max_length = label_length_map[args.task_name]
+
+    writer = get_writer(args)
+
+    # get model and tokenizer
+    model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path)
+
+    # get dataloader
+    train_dataloader = get_train_dataloader(tokenizer, args)
+    if args.task_name == "mnli":
+        dev_dataloader_match = get_mnli_dev_dataloader(tokenizer, args, matched=True)
+        dev_dataloader_mismatch = get_mnli_dev_dataloader(tokenizer, args, matched=False)
+    else:
+        dev_dataloader = get_dev_dataloader(tokenizer, args)
+
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+    if args.max_train_steps > 0:
+        args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
+    else:
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+
+    # get lr_scheduler
+    lr_scheduler = get_scheduler(
+        learning_rate=args.learning_rate,
+        scheduler_type=args.scheduler_type,
+        num_warmup_steps=args.warmup_steps if args.warmup_steps > 0 else args.warmup_radio,
+        num_training_steps=args.max_train_steps,
+    )
+
+    total_batch_size = args.train_batch_size * args.gradient_accumulation_steps
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    if args.use_amp:
+        scaler = GradScaler(init_loss_scaling=args.scale_loss)
+
+    logger.info("********** Running training **********")
+    logger.info(f"  Num examples = {len(train_dataloader.dataset)}")
+    logger.info(f"  Num Epochs = {args.num_train_epochs}")
+    logger.info(f"  Instantaneous train batch size = {args.train_batch_size}")
+    logger.info(f"  Instantaneous eval batch size = {args.eval_batch_size}")
+    logger.info(f"  Total train batch size (w. accumulation) = {total_batch_size}")
+    logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
+    logger.info(f"  Total optimization steps = {args.max_train_steps}")
+
+    progress_bar = tqdm(range(args.max_train_steps))
+
+    global_steps = 0
+    tr_loss, logging_loss = 0.0, 0.0
+
+    for _ in range(args.num_train_epochs):
+        for step, batch in enumerate(train_dataloader):
+            model.train()
+            with auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax"]):
+                source_ids, source_mask, labels, target_mask = batch
+                outputs = model(
+                    input_ids=source_ids,
+                    attention_mask=source_mask,
+                    labels=labels,
+                    decoder_attention_mask=target_mask,
+                )
+                loss = outputs[0] / args.gradient_accumulation_steps
+                tr_loss += loss.item()
+
+            if args.use_amp:
+                scaler.scale(loss).backward()
+            else:
+                loss.backward()
+
+            if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
+                if args.use_amp:
+                    scaler.minimize(optimizer, loss)
+                else:
+                    optimizer.step()
+
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                progress_bar.update(1)
+                global_steps += 1
+
+                if args.logging_steps > 0 and global_steps % args.logging_steps == 0:
+                    writer.add_scalar("lr", lr_scheduler.get_lr(), global_steps)
+                    writer.add_scalar(
+                        "loss",
+                        (tr_loss - logging_loss) / args.logging_steps,
+                        global_steps,
+                    )
+                    logger.info(
+                        "global_steps {} - lr: {:.10f}  loss: {:.10f}".format(
+                            global_steps,
+                            lr_scheduler.get_lr(),
+                            (tr_loss - logging_loss) / args.logging_steps,
+                        )
+                    )
+                    logging_loss = tr_loss
+
+                if args.save_steps > 0 and global_steps % args.save_steps == 0:
+                    logger.info("********** Running evaluating **********")
+                    logger.info(f"********** Step {global_steps} **********")
+                    output_dir = os.path.join(args.output_dir, f"step-{global_steps}")
+                    os.makedirs(output_dir, exist_ok=True)
+
+                    if args.task_name == "mnli":
+                        matched_results = evaluate(
+                            model,
+                            dev_dataloader_match,
+                            tokenizer,
+                            label2id,
+                            metric_list,
+                            generate_max_length,
+                        )
+                        for k, v in matched_results.items():
+                            writer.add_scalar(f"eval/matched_{k}", v, global_steps)
+                            logger.info(f"  {k} = {v}")
+                        mismatched_results = evaluate(
+                            model,
+                            dev_dataloader_mismatch,
+                            tokenizer,
+                            label2id,
+                            metric_list,
+                            generate_max_length,
+                        )
+                        for k, v in mismatched_results.items():
+                            writer.add_scalar(f"eval/mismatched_{k}", v, global_steps)
+                            logger.info(f"  {k} = {v}")
+                    else:
+                        eval_results = evaluate(
+                            model,
+                            dev_dataloader,
+                            tokenizer,
+                            label2id,
+                            metric_list,
+                            generate_max_length,
+                        )
+                        for k, v in eval_results.items():
+                            writer.add_scalar(f"eval/{k}", v, global_steps)
+                            logger.info(f"  {k} = {v}")
+                    model.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    logger.info("********** Evaluating Done **********")
+
+            if global_steps >= args.max_train_steps:
+                logger.info("********** Running evaluating **********")
+                logger.info(f"********** Step {global_steps} **********")
+                output_dir = os.path.join(args.output_dir, f"step-{global_steps}")
+                os.makedirs(output_dir, exist_ok=True)
+
+                if args.task_name == "mnli":
+                    matched_results = evaluate(
+                        model,
+                        dev_dataloader_match,
+                        tokenizer,
+                        label2id,
+                        metric_list,
+                        generate_max_length,
+                    )
+                    for k, v in matched_results.items():
+                        writer.add_scalar(f"eval/matched_{k}", v, global_steps)
+                        logger.info(f"  {k} = {v}")
+                    mismatched_results = evaluate(
+                        model,
+                        dev_dataloader_mismatch,
+                        tokenizer,
+                        label2id,
+                        metric_list,
+                        generate_max_length,
+                    )
+                    for k, v in mismatched_results.items():
+                        writer.add_scalar(f"eval/mismatched_{k}", v, global_steps)
+                        logger.info(f"  {k} = {v}")
+                else:
+                    eval_results = evaluate(
+                        model,
+                        dev_dataloader,
+                        tokenizer,
+                        label2id,
+                        metric_list,
+                        generate_max_length,
+                    )
+                    for k, v in eval_results.items():
+                        writer.add_scalar(f"eval/{k}", v, global_steps)
+                        logger.info(f"  {k} = {v}")
+                model.save_pretrained(output_dir)
+                tokenizer.save_pretrained(output_dir)
+                logger.info("********** Evaluating Done **********")
+                logger.info("********** Training Done **********")
+                return
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
diff --git a/examples/language_model/t5/run_glue_trainer.py b/examples/language_model/t5/run_glue_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4998aea7c553212d5d818d8e75fd8bf4eff074ab
--- /dev/null
+++ b/examples/language_model/t5/run_glue_trainer.py
@@ -0,0 +1,394 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+from data import GLUE_1_1_PROCESSED, GLUE_PROCESSED
+from utils import GLUE_METRICS, load_pickle, save_pickle
+
+from paddlenlp.data import Pad
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+from paddlenlp.utils.log import logger
+
+label_length_map = {
+    "cola": 4,
+    "sst-2": 1,
+    "mrpc": 5,
+    "sts-b": 5,
+    "qqp": 5,
+    "mnli": 4,
+    "qnli": 5,
+    "rte": 5,
+}
+
+
+def trans_func(example, tokenizer, args):
+    task_name = args.task_name
+    PROCESSED = GLUE_PROCESSED
+    if "v1_1" in args.cache_dir:
+        PROCESSED = GLUE_1_1_PROCESSED
+    processed, label = PROCESSED[task_name]
+    if label:
+        id2label = dict(zip(range(len(label)), label))
+    else:
+        id2label = None
+
+    is_test = "labels" not in example
+
+    if not is_test:
+        if id2label:
+            label_text = id2label[example["labels"]]
+        else:
+            label_text = str(example["labels"])
+        target = tokenizer(label_text, return_token_type_ids=False, return_attention_mask=True)
+
+    if len(processed) == 1:
+        text = processed[0] + example["sentence"]
+    else:
+        text = processed[0] + example["sentence1"] + processed[1] + example["sentence2"]
+
+    source = tokenizer(
+        text,
+        max_seq_len=args.max_seq_length,
+        padding="max_length",
+        return_token_type_ids=False,
+        return_attention_mask=True,
+    )
+
+    if not is_test:
+        return {
+            "input_ids": source["input_ids"],
+            "attention_mask": source["attention_mask"],
+            "labels": target["input_ids"],
+            "decoder_attention_mask": target["attention_mask"],
+        }
+    else:
+        return {"input_ids": source["input_ids"], "attention_mask": source["attention_mask"]}
+
+
+class BatchDict(object):
+    def __init__(self, fn):
+        assert isinstance(fn, (dict)), (
+            "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn "
+            "Received fn=%s" % (str(fn))
+        )
+
+        self._fn = fn
+
+        for col_name, ele_fn in self._fn.items():
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (
+                col_name,
+                str(type(ele_fn)),
+            )
+
+    def __call__(self, data):
+
+        ret = {}
+        if len(data) <= 0:
+            return ret
+
+        for col_name, ele_fn in self._fn.items():
+            # skip unused col_name, such as labels in test mode.
+            if col_name not in data[0].keys():
+                continue
+            result = ele_fn([ele[col_name] for ele in data])
+            ret[col_name] = result
+
+        return ret
+
+
+def get_train_dataset(tokenizer, args):
+    filename = os.path.join(args.cache_dir, args.task_name + "_train" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits="train")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+def get_dev_dataset(tokenizer, args):
+    filename = os.path.join(args.cache_dir, args.task_name + "_dev" + ".pkl")
+
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits="dev")
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+def get_mnli_dev_dataset(tokenizer, args, matched=True):
+    if matched:
+        split = "dev_matched"
+    else:
+        split = "dev_mismatched"
+    filename = os.path.join(args.cache_dir, args.task_name + f"_{split}" + ".pkl")
+    if os.path.exists(filename):
+        ds = load_pickle(filename)
+    else:
+        ds = load_dataset("glue", args.task_name, splits=split)
+        ds.map(
+            partial(trans_func, tokenizer=tokenizer, args=args),
+            batched=False,
+            lazy=False,
+        )
+        save_pickle(ds, filename)
+
+    return ds
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    task_name: str = field(default=None, metadata={"help": "The name of the task to use (via the datasets library)."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    cache_dir: str = field(default="./caches", metadata={"help": "cache dir for datasets."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        default="t5-small",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+
+
+class T5GlueTrainer(Trainer):
+    def __init__(self, do_generation: bool, label2id, **kwargs):
+        super().__init__(**kwargs)
+        self.do_generation = do_generation
+        self.label2id = label2id
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+
+        if not self.do_generation:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        all_preds = []
+        all_labels = []
+        # source_ids, source_mask, labels, target_mask = batch
+        labels = inputs["labels"]
+        target_mask = inputs["decoder_attention_mask"]
+
+        with paddle.no_grad():
+            outputs = model.generate(
+                input_ids=inputs["input_ids"],
+                attention_mask=inputs["attention_mask"],
+                max_length=5,
+            )[0]
+
+        for p, l, m in zip(outputs.numpy(), labels.numpy(), target_mask.numpy()):
+            pred = self.tokenizer.decode(p, skip_special_tokens=True).strip()
+            label = self.tokenizer.decode(l[m.astype("bool")], skip_special_tokens=True).strip()
+
+            if self.label2id:
+                # for classifaction task.
+                label = self.label2id[label]
+                if pred not in self.label2id:
+                    # set to wrong label if the generated text not in the labal set.
+                    pred = 0
+                    if label == 0:
+                        pred = 1
+                else:
+                    pred = self.label2id[pred]
+            else:
+                # for regression task.
+                label = float(label.replace(" ", ""))
+                try:
+                    pred = float(pred.replace(" ", ""))
+                except Exception:
+                    # set to zero if the generated text can not convert to float
+                    pred = 0.0
+
+            all_preds.append(pred)
+            all_labels.append(label)
+
+        all_preds = paddle.to_tensor(all_preds).detach()
+        all_labels = paddle.to_tensor(all_labels).detach()
+
+        return (None, all_preds, all_labels)
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if "v1_1" in model_args.model_name_or_path:
+        data_args.cache_dir = "./caches_v1_1"
+    if not os.path.exists(data_args.cache_dir):
+        os.mkdir(data_args.cache_dir)
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    PROCESSED = GLUE_PROCESSED
+    if "v1_1" in data_args.cache_dir:
+        PROCESSED = GLUE_1_1_PROCESSED
+    label_name = PROCESSED[data_args.task_name][1]
+    if label_name:
+        label2id = dict(zip(label_name, range(len(label_name))))
+    else:
+        label2id = None
+    metric_list = GLUE_METRICS[data_args.task_name]
+
+    # get model and tokenizer
+    model = T5ForConditionalGeneration.from_pretrained(model_args.model_name_or_path)
+    tokenizer = T5Tokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # get dataloader
+    train_dataset = get_train_dataset(tokenizer, data_args)
+    if data_args.task_name == "mnli":
+        eval_dataset = get_mnli_dev_dataset(tokenizer, data_args, matched=True)
+    else:
+        eval_dataset = get_dev_dataset(tokenizer, data_args)
+
+    batchify_fn = lambda samples, fn=BatchDict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            "attention_mask": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask
+            "labels": Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels
+            "decoder_attention_mask": Pad(
+                axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+            ),  # decoder_attention_mask
+        }
+    ): fn(samples)
+    data_collator = batchify_fn
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        results = {}
+        for metric in metric_list:
+            results.update(metric(p.label_ids, preds))
+
+        return results
+
+    trainer = T5GlueTrainer(
+        model=model,
+        criterion=None,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        do_generation=True,
+        label2id=label2id,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        # trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/language_model/t5/t5_dataset.py b/examples/language_model/t5/t5_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..e173bd35cd736a3018cc05e64abd6de3e9678a51
--- /dev/null
+++ b/examples/language_model/t5/t5_dataset.py
@@ -0,0 +1,340 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""T5 Style dataset."""
+
+import collections
+import copy
+
+import numpy as np
+import paddle
+from dataset_utils import create_masked_lm_predictions, get_samples_mapping
+
+
+class T5Dataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        name,
+        tokenizer,
+        indexed_dataset,
+        data_prefix,
+        num_epochs,
+        max_num_samples,
+        masked_lm_prob,
+        max_seq_length,
+        max_seq_length_dec,
+        short_seq_prob,
+        seed,
+        binary_head=False,
+        share_folder=False,
+        args=None,
+    ):
+
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+        self.max_seq_length_dec = max_seq_length_dec
+        self.binary_head = binary_head
+        self.share_folder = share_folder
+        self.args = args
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(
+            self.indexed_dataset,
+            data_prefix,
+            num_epochs,
+            max_num_samples,
+            self.max_seq_length - 2,  # account for added tokens
+            short_seq_prob,
+            self.seed,
+            self.name,
+            self.binary_head,
+            self.share_folder,
+        )
+        # Vocab stuff.
+        self.vocab_id_list = list(tokenizer.get_vocab().values())
+        self.vocab_id_to_token_dict = copy.deepcopy(
+            {tokenizer.convert_tokens_to_ids(key): key for key, _ in tokenizer.get_vocab().items()}
+        )
+        self.vocab_token_to_id_dict = copy.deepcopy(tokenizer.get_vocab())
+
+        # T5 is chinese char level model, sometime is need
+        # add ## chinse char to encode and decode.
+        # Here we extend the vocab dict.
+        self.vocab_id_to_token_dict.update(tokenizer.added_tokens_decoder)
+        self.vocab_token_to_id_dict.update(tokenizer.added_tokens_encoder)
+
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+
+        self.bos_id = tokenizer.bos_token_id
+        self.eos_id = tokenizer.eos_token_id
+
+        self.sentinel_tokens = tokenizer.additional_special_tokens_ids
+        assert len(self.sentinel_tokens) > 0, "Provide the argument --vocab-extra-ids 100 to the script"
+
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+
+    def __getitem__(self, idx):
+
+        start_index, end_index, seq_length = self.samples_mapping[idx]
+        sample = []
+        for index in range(start_index, end_index):
+            sample.append(self.indexed_dataset[index])
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return build_training_sample(
+            sample,
+            seq_length,
+            self.max_seq_length,  # needed for padding
+            self.max_seq_length_dec,
+            self.vocab_id_list,
+            self.vocab_id_to_token_dict,
+            self.cls_id,
+            self.sep_id,
+            self.mask_id,
+            self.pad_id,
+            self.masked_lm_prob,
+            np_rng,
+            self.bos_id,
+            self.eos_id,
+            self.sentinel_tokens,
+        )
+
+
+def build_training_sample(
+    sample,
+    target_seq_length,
+    max_seq_length,
+    max_seq_length_dec,
+    vocab_id_list,
+    vocab_id_to_token_dict,
+    cls_id,
+    sep_id,
+    mask_id,
+    pad_id,
+    masked_lm_prob,
+    np_rng,
+    bos_id=None,
+    eos_id=None,
+    sentinel_tokens=None,
+):
+    """Build training sample.
+
+    Arguments:
+        sample: A list of sentences in which each sentence is a list token ids.
+        target_seq_length: Desired sequence length.
+        max_seq_length: Maximum length of the sequence. All values are padded to
+            this length.
+        vocab_id_list: List of vocabulary ids. Used to pick a random id.
+        vocab_id_to_token_dict: A dictionary from vocab ids to text tokens.
+        cls_id: Start of example id.
+        sep_id: Separator id.
+        mask_id: Mask token id.
+        pad_id: Padding token id.
+        masked_lm_prob: Probability to mask tokens.
+        np_rng: Random number genenrator. Note that this rng state should be
+              numpy and not python since python randint is inclusive for
+              the opper bound whereas the numpy one is exclusive.
+        bos_id: start of decoder example id
+        eos_id: end of generation id
+        sentinel_tokens: unique value to be substituted for every replaced span
+    """
+
+    assert target_seq_length <= max_seq_length
+
+    # flatten sentences into one list
+    tokens = [token for sentence in sample for token in sentence]
+
+    # Truncate to `target_sequence_length`.
+    max_num_tokens = target_seq_length
+    truncated = len(tokens) > max_num_tokens
+    tokens = tokens[:max_num_tokens]
+
+    # Masking.
+    max_predictions_per_seq = masked_lm_prob * max_num_tokens
+    (tokens, masked_positions, masked_labels, _, masked_spans) = create_masked_lm_predictions(
+        tokens,
+        vocab_id_list,
+        vocab_id_to_token_dict,
+        masked_lm_prob,
+        cls_id,
+        sep_id,
+        mask_id,
+        max_predictions_per_seq,
+        np_rng,
+        max_ngrams=10,
+        geometric_dist=True,
+        masking_style="t5",
+    )
+
+    # Padding.
+    tokens_enc, tokens_dec_in, labels, enc_mask, dec_mask, enc_dec_mask, loss_mask = pad_and_convert_to_numpy(
+        tokens,
+        masked_positions,
+        masked_labels,
+        pad_id,
+        max_seq_length,
+        max_seq_length_dec,
+        masked_spans,
+        bos_id,
+        eos_id,
+        sentinel_tokens,
+    )
+
+    train_sample = {
+        "text_enc": tokens_enc,
+        "text_dec": tokens_dec_in,
+        "labels": labels,
+        "loss_mask": loss_mask,
+        "truncated": int(truncated),
+        "enc_mask": enc_mask,
+        "dec_mask": dec_mask,
+        "enc_dec_mask": enc_dec_mask,
+    }
+    return train_sample
+
+
+def pad_and_convert_to_numpy(
+    tokens,
+    masked_positions,
+    masked_labels,
+    pad_id,
+    max_seq_length,
+    max_seq_length_dec,
+    masked_spans=None,
+    bos_id=None,
+    eos_id=None,
+    sentinel_tokens=None,
+):
+    """Pad sequences and convert them to numpy."""
+
+    sentinel_tokens = collections.deque(sentinel_tokens)
+    t5_input = []
+    (t5_decoder_in, t5_decoder_out) = ([bos_id], [])
+    (start_index, end_index) = (0, None)
+    for span in masked_spans:
+        flag = sentinel_tokens.popleft()
+
+        # Append the same tokens in decoder input and output
+        t5_decoder_in.append(flag)
+        t5_decoder_in.extend(span.label)
+        t5_decoder_out.append(flag)
+        t5_decoder_out.extend(span.label)
+
+        end_index = span.index[0]
+        t5_input.extend(tokens[start_index:end_index])
+        t5_input.append(flag)
+
+        # the next start index is the token after the last span token
+        start_index = span.index[-1] + 1
+
+    # Add <eos> token to the t5_decoder_out
+    t5_decoder_out.append(eos_id)
+
+    # Add the remaining tokens to the t5 input
+    t5_input.extend(tokens[start_index:])
+
+    # assert (len(t5_input) - len(masked_spans)) + \
+    #        (len(t5_decoder_in) - (len(masked_spans) + 1)) == len(tokens)
+
+    # Some checks.
+
+    # Encoder-side padding mask.
+    num_tokens = len(t5_input)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(masked_positions) == len(masked_labels)
+
+    # Tokens..
+    filler = [pad_id] * padding_length
+    tokens_enc = np.array(t5_input + filler, dtype=np.int64)
+
+    # Decoder-side padding mask.
+    num_tokens_dec = len(t5_decoder_in)
+
+    padding_length_dec = max_seq_length_dec - num_tokens_dec
+    assert padding_length_dec >= 0
+    filler_dec = [pad_id] * padding_length_dec
+    # print(t5_decoder_in, filler_dec, pad_id)
+    tokens_dec_in = np.array(t5_decoder_in + filler_dec, dtype=np.int64)
+
+    # Create attention masks
+    enc_mask = make_attention_mask(tokens_enc, tokens_enc)
+    enc_dec_mask = make_attention_mask(tokens_dec_in, tokens_enc)
+    dec_mask = make_attention_mask(tokens_dec_in, tokens_dec_in)
+    dec_mask = dec_mask * make_history_mask(tokens_dec_in)
+
+    # Labels mask.
+    labels = t5_decoder_out + ([-100] * padding_length_dec)
+    labels = np.array(labels, dtype=np.int64)
+
+    # Loss mask
+    loss_mask = ([1] * num_tokens_dec) + ([pad_id] * padding_length_dec)
+    loss_mask = np.array(loss_mask, dtype=np.int64)
+
+    return tokens_enc, tokens_dec_in, labels, enc_mask, dec_mask, enc_dec_mask, loss_mask
+
+
+def make_attention_mask(source_block, target_block):
+    """
+    Returns a 2-dimensional (2-D) attention mask
+    :param source_block: 1-D array
+    :param target_block: 1-D array
+    """
+    mask = (target_block[None, :] >= 1) * (source_block[:, None] >= 1)
+    mask = mask.astype(np.int64)
+    # (source_length, target_length)
+    return mask
+
+
+def make_attention_mask_3d(source_block, target_block):
+    """
+    Returns a 3-dimensional (3-D) attention mask
+    :param source_block: 1-D array
+    :param target_block: 1-D array
+    """
+    mask = (target_block[:, None, :] >= 1) * (source_block[:, :, None] >= 1)
+    # (batch, source_length, target_length)
+    # mask = mask.astype(np.int64)
+    return mask
+
+
+def make_history_mask(block):
+    length = block.shape[0]
+    arange = np.arange(length)
+    history_mask = (
+        arange[
+            None,
+        ]
+        <= arange[:, None]
+    )
+    history_mask = history_mask.astype(np.int64)
+    return history_mask
+
+
+def make_history_mask_3d(block):
+    batch, length = block.shape
+    arange = paddle.arange(length, device=block.device)
+    history_mask = (arange[None, :] <= arange[:, None])[None, :]
+    history_mask = history_mask.expand(batch, length, length)
+    return history_mask
diff --git a/examples/language_model/t5/t5_run_pretrain_trainer.py b/examples/language_model/t5/t5_run_pretrain_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..01e2abfa5c0d2710ca3ef1e3bca42118f4df2d32
--- /dev/null
+++ b/examples/language_model/t5/t5_run_pretrain_trainer.py
@@ -0,0 +1,436 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+T5 pretraining scripts.
+"""
+import math
+import os
+import random
+import time
+from dataclasses import dataclass, field
+
+# from turtle import shape
+from typing import Optional
+
+import numpy as np
+import paddle
+from dataset_utils import build_train_valid_test_datasets
+
+from paddlenlp.data import Stack
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    speed_metrics,
+)
+from paddlenlp.transformers import (
+    LinearAnnealingWithWarmupDecay,
+    T5Config,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "t5": (T5Config, T5ForConditionalGeneration, T5Tokenizer),
+}
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    max_seq_length_dec: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total output sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    masked_lm_prob: float = field(
+        default=0.15,
+        metadata={"help": "Mask token prob."},
+    )
+    short_seq_prob: float = field(
+        default=0.1,
+        metadata={"help": "Short sequence prob."},
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+    favor_longer_ngram: bool = field(
+        default=False,
+        metadata={"help": "Whether to favor long ngrams"},
+    )
+    max_ngrams: int = field(
+        default=3,
+        metadata={"help": "Max N Grams"},
+    )
+    data_impl: str = field(
+        default="mmap",
+        metadata={"help": "mmap/lazy format converted from preprocessed data."},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(default="t5", metadata={"help": "Only support for ernie pre-training for now."})
+    model_name_or_path: str = field(
+        default="t5-small",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    binary_head: Optional[bool] = field(default=False, metadata={"help": "True for NSP task."})
+    hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
+    attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention probs dropout prob."})
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+    binary_head=False,
+):
+
+    train_valid_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.world_size * training_args.test_iters,
+    ]
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        args=data_args,
+        tokenizer=tokenizer,
+        splits_string=data_args.split,
+        train_valid_test_num_samples=train_valid_test_num_samples,
+        max_seq_length=data_args.max_seq_length,
+        masked_lm_prob=data_args.masked_lm_prob,
+        short_seq_prob=data_args.short_seq_prob,
+        max_seq_length_dec=data_args.max_seq_length_dec,
+        seed=training_args.seed,
+        skip_warmup=True,
+        binary_head=False,
+        dataset_type="t5",
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        text_enc, text_dec = data["text_enc"], data["text_dec"]
+        if tokenizer.pad_token_id in text_enc:
+            text_enc = text_enc[0 : list(text_enc).index(tokenizer.pad_token_id)]
+        logger.info(tokenizer._decode(text_enc))
+        if tokenizer.pad_token_id in text_dec:
+            text_dec = text_dec[0 : list(text_dec).index(tokenizer.pad_token_id)]
+        logger.info(tokenizer._decode(text_dec))
+
+    print_dataset(train_ds[0], "train")
+    print_dataset(valid_ds[0], "valid")
+    print_dataset(test_ds[0], "test")
+
+    def _collate_data(data, stack_fn=Stack()):
+        # print("Line 200", data[0])
+        # num_fields = len(data[0])
+        num_fields = len(data[0].keys())
+        out = [None] * num_fields
+
+        # text_enc, text_dec, labels, loss_mask, truncated, enc_mask, dec_mask, enc_dec_mask
+
+        for i in range(num_fields):
+            out[i] = stack_fn([list(x.values())[i] for x in data])
+
+        return {
+            "input_ids": out[0],
+            "decoder_input_ids": out[1],
+            "labels": out[2],
+            "attention_mask": out[5],
+            "decoder_attention_mask": out[6],
+        }
+
+    return train_ds, valid_ds, test_ds, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.world_size,
+            rank=self.args.process_index,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.world_size,
+            rank=self.args.process_index,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    set_seed(training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        # if last_checkpoint is None and len(
+        #         os.listdir(training_args.output_dir)) > 1:
+        #     raise ValueError(
+        #         f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+        #         "Use --overwrite_output_dir to overcome.")
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+
+    # if model_args.binary_head is False:
+    #     model_class = ErnieForMaskedLM
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    if model_args.model_name_or_path in pretrained_models_list:
+        logger.warning(f"Your model {model_args.model_name_or_path} is training from scratch !!!")
+        model_config = model_class.pretrained_init_configuration[model_args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = model_args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = model_args.attention_probs_dropout_prob
+        model = model_class(config_class(**model_config))
+        # model_config["enable_recompute"] = args.use_recompute
+    else:
+        logger.warning(f"Your model is continue training from {model_args.model_name_or_path}")
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            hidden_dropout_prob=model_args.hidden_dropout_prob,
+            attention_probs_dropout_prob=model_args.attention_probs_dropout_prob,
+        )
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = LinearAnnealingWithWarmupDecay(
+        training_args.learning_rate,
+        training_args.min_learning_rate,
+        warmup_step=warmup_steps,
+        decay_step=training_args.decay_steps,
+    )
+
+    data_file = get_train_data_file(data_args)
+    tokenizer = tokenizer_class.from_pretrained(
+        model_args.tokenizer_name_or_path, cls_token="[CLS]", bos_token="<s>", mask_token="[MASK]", sep_token="[SEP]"
+    )
+
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args, training_args, data_file, tokenizer, False
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/language_model/t5/tests/t5_mp.py b/examples/language_model/t5/tests/t5_mp.py
new file mode 100644
index 0000000000000000000000000000000000000000..aab854c3475ed2137bf25b9de9a479dff55e825e
--- /dev/null
+++ b/examples/language_model/t5/tests/t5_mp.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import T5Model
+
+T5Model._init_weights = lambda *_: None
+
+
+def main():
+    world_size = paddle.distributed.get_world_size()
+    dp_degree = 2 if world_size >= 4 else 1
+    tensor_parallel_degree = world_size // dp_degree
+
+    strategy = paddle.distributed.fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": dp_degree,
+        "mp_degree": tensor_parallel_degree,
+        "pp_degree": 1,
+        "sharding_degree": 1,
+    }
+    paddle.distributed.fleet.init(is_collective=True, strategy=strategy)
+
+    hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+    mp_group = hcg.get_model_parallel_group()
+    tensor_parallel_rank = mp_group.rank
+    model = T5Model.from_pretrained(
+        "t5-small",
+        tensor_parallel_degree=tensor_parallel_degree,
+        tensor_parallel_rank=tensor_parallel_rank,
+        dtype="float32",
+        low_cpu_mem_usage=True,
+    )
+    model.eval()
+    loss = model(
+        input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]),
+        decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]),
+        return_dict=True,
+    )
+    ret = loss.last_hidden_state.abs().mean().item()
+    np.testing.assert_allclose(ret, 0.136544, rtol=1e-4)
+
+    with tempfile.TemporaryDirectory() as tempdir:
+        model.save_pretrained(save_dir=tempdir, merge_tensor_parallel=False)
+        paddle.distributed.barrier()
+        load_model = T5Model.from_pretrained(
+            tempdir,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            dtype="float32",
+            low_cpu_mem_usage=True,
+        )
+        load_model.eval()
+        loss = load_model(
+            input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]),
+            decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]),
+            return_dict=True,
+        )
+        ret = loss.last_hidden_state.abs().mean().item()
+        np.testing.assert_allclose(ret, 0.136544, rtol=1e-4)
+
+    with tempfile.TemporaryDirectory() as tempdir:
+        object_list = []
+        paddle.distributed.all_gather_object(object_list, tempdir, group=mp_group)
+        tempdir = object_list[0]
+        model.save_pretrained(save_dir=tempdir, merge_tensor_parallel=True)
+        paddle.distributed.barrier()
+        load_model = T5Model.from_pretrained(
+            tempdir,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            dtype="float32",
+            low_cpu_mem_usage=True,
+        )
+        load_model.eval()
+        loss = load_model(
+            input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]),
+            decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]),
+            return_dict=True,
+        )
+        ret = loss.last_hidden_state.abs().mean().item()
+        np.testing.assert_allclose(ret, 0.136544, rtol=1e-4)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py b/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cff3233848f3eaec506c346525b0bbe60ac16c2
--- /dev/null
+++ b/examples/language_model/t5/tests/test_parallel_dygraph_dataparallel.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#
+
+import copy
+import os
+import subprocess
+import time
+import unittest
+
+import paddle
+from paddle.distributed.utils.launch_utils import (
+    TrainerProc,
+    find_free_ports,
+    get_cluster,
+    watch_local_trainers,
+)
+
+
+def get_cluster_from_args(selected_gpus):
+    cluster_node_ips = "127.0.0.1"
+    node_ip = "127.0.0.1"
+
+    node_ips = [x.strip() for x in cluster_node_ips.split(",")]
+
+    node_ips.index(node_ip)
+
+    free_ports = None
+
+    free_ports = find_free_ports(len(selected_gpus))
+    if free_ports is not None:
+        free_ports = list(free_ports)
+
+    trainer_endpoints = []
+    for ip in node_ips:
+        trainer_endpoints.append(["%s:%d" % (ip, port) for port in free_ports])
+    return get_cluster(node_ips, node_ip, trainer_endpoints, selected_gpus)
+
+
+def get_gpus(selected_gpus):
+    selected_gpus = [x.strip() for x in selected_gpus.split(",")]
+    return selected_gpus
+
+
+def start_local_trainers_cpu(trainer_endpoints, training_script, training_script_args, log_dir=None):
+    current_env = copy.copy(os.environ.copy())
+    current_env.pop("http_proxy", None)
+    current_env.pop("https_proxy", None)
+
+    procs = []
+    n_rank = len(trainer_endpoints)
+    print(trainer_endpoints)
+    for rank_id, endpoint in enumerate(trainer_endpoints):
+        proc_env = {
+            "PADDLE_DISTRI_BACKEND": "gloo",
+            "PADDLE_TRAINER_ID": "%d" % rank_id,
+            "PADDLE_CURRENT_ENDPOINT": "%s" % endpoint,
+            "PADDLE_TRAINERS_NUM": "%d" % n_rank,
+            "PADDLE_TRAINER_ENDPOINTS": ",".join(trainer_endpoints),
+        }
+
+        current_env.update(proc_env)
+
+        print("trainer proc env:{}".format(current_env))
+
+        assert os.getenv("WITH_COVERAGE", "OFF") == "OFF", "Gloo don't support WITH_COVERAGE."
+        cmd = "python -u " + training_script
+
+        print("start trainer proc:{} env:{}".format(cmd, proc_env))
+
+        fn = None
+
+        proc = subprocess.Popen(cmd.split(" "), env=current_env)
+
+        tp = TrainerProc()
+        tp.proc = proc
+        tp.rank = rank_id
+        tp.log_fn = fn
+        tp.cmd = cmd
+
+        procs.append(tp)
+
+    return procs
+
+
+def start_local_trainers(
+    cluster,
+    pod,
+    training_script,
+    training_script_args,
+    eager_mode=True,
+    allocator_strategy="auto_growth",
+    log_dir=None,
+    without_http_proxy=True,
+):
+    current_env = copy.copy(os.environ.copy())
+    # paddle broadcast ncclUniqueId use socket, and
+    # proxy maybe make trainers unreachable, so delete them.
+    # if we set them to "", grpc will log error message "bad uri"
+    # so just delete them.
+
+    # current_env.pop("http_proxy", None)
+    # current_env.pop("https_proxy", None)
+
+    procs = []
+    for t in pod.trainers:
+        proc_env = {
+            "FLAGS_selected_gpus": "%s" % ",".join([str(g) for g in t.gpus]),
+            "PADDLE_TRAINER_ID": "%d" % t.rank,
+            "PADDLE_CURRENT_ENDPOINT": "%s" % t.endpoint,
+            "PADDLE_TRAINERS_NUM": "%d" % cluster.trainers_nranks(),
+            "PADDLE_TRAINER_ENDPOINTS": ",".join(cluster.trainers_endpoints()),
+        }
+
+        proc_env["FLAGS_allocator_strategy"] = allocator_strategy
+        if allocator_strategy == "auto_growth":
+            proc_env["FLAGS_fraction_of_gpu_memory_to_use"] = "0.1"
+
+        current_env.update(proc_env)
+
+        print("trainer proc env:{}".format(current_env))
+
+        if os.getenv("WITH_COVERAGE", "OFF") == "ON":
+            cmd = "python -m coverage run --branch -p " + training_script
+        else:
+            cmd = "python -u " + training_script
+
+        print("start trainer proc:{} env:{}".format(cmd, proc_env))
+
+        fn = None
+
+        proc = subprocess.Popen(cmd.split(" "), env=current_env)
+
+        tp = TrainerProc()
+        tp.proc = proc
+        tp.rank = t.rank
+        tp.log_fn = fn
+        tp.cmd = cmd
+
+        procs.append(tp)
+
+    return procs
+
+
+class TestMultipleGpus(unittest.TestCase):
+    def setUp(self):
+        self.selected_gpus = get_gpus("0,1")
+
+    def run_1gpu(self, *args, **kwargs):
+        self.selected_gpus = get_gpus("0")
+        self.run_n_gpu(*args, **kwargs)
+
+    def run_2gpu(self, *args, **kwargs):
+        self.selected_gpus = get_gpus("0,1")
+        self.run_n_gpu(*args, **kwargs)
+
+    def run_4gpu(self, *args, **kwargs):
+        self.selected_gpus = get_gpus("0,1,2,3")
+        self.run_n_gpu(*args, **kwargs)
+
+    def run_8gpu(self, *args, **kwargs):
+        self.selected_gpus = get_gpus("0,1,2,3,4,5,6,7")
+        self.run_n_gpu(*args, **kwargs)
+
+    def run_n_gpu(
+        self,
+        target_file_name,
+        eager_mode=True,
+        allocator_strategy="auto_growth",
+    ):
+        if not paddle.framework.core.is_compiled_with_cuda() or paddle.framework.core.get_cuda_device_count() == 0:
+            return
+
+        # selected_gpus = get_gpus("0,1")
+        cluster = None
+        pod = None
+
+        cluster, pod = get_cluster_from_args(self.selected_gpus)
+
+        procs = start_local_trainers(
+            cluster,
+            pod,
+            eager_mode=eager_mode,
+            allocator_strategy=allocator_strategy,
+            training_script=target_file_name,
+            training_script_args=[],
+        )
+
+        while True:
+            alive = watch_local_trainers(procs, cluster.trainers_endpoints())
+
+            if not alive:
+                print("Local procs complete, POD info:{}".format(pod))
+                break
+            time.sleep(3)
+
+
+class TestMultipleWithGloo(unittest.TestCase):
+    def run_2cpu(self, target_file_name):
+
+        cluster, pod = get_cluster_from_args([0, 1])  # tmp use. for getting trainer_nranks()
+
+        procs = start_local_trainers_cpu(
+            cluster.trainers_endpoints(),
+            training_script=target_file_name,
+            training_script_args=[],
+        )
+
+        while True:
+            alive = watch_local_trainers(procs, cluster.trainers_nranks())
+
+            if not alive:
+                print("Local procs complete, POD info:{}".format(pod))
+                break
+            time.sleep(3)
diff --git a/examples/language_model/t5/tests/test_t5_mp.py b/examples/language_model/t5/tests/test_t5_mp.py
new file mode 100644
index 0000000000000000000000000000000000000000..00777f916126ec81553220cc49a2d138c9b62f17
--- /dev/null
+++ b/examples/language_model/t5/tests/test_t5_mp.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# import sys
+import unittest
+
+import numpy as np
+import paddle
+import torch
+from test_parallel_dygraph_dataparallel import TestMultipleGpus
+
+import paddlenlp
+
+
+def load_torch(path, *args, **kwargs):
+    import torch
+
+    state = torch.load(path, map_location="cpu")
+    for key in list(state.keys()):
+        v = state.pop(key)
+        state[key] = v.numpy()
+    return state
+
+
+# hack load torch, it has problem to load torch ckpt.
+paddlenlp.utils.serialization.load_torch = load_torch
+paddlenlp.transformers.conversion_utils.load_torch = load_torch
+
+
+class TestT5(unittest.TestCase):
+    def testTorchT5(self):
+        from transformers import AutoModel
+
+        model = AutoModel.from_pretrained("t5-small", trust_remote_code=True)
+        model.eval()
+        loss = model(
+            input_ids=torch.arange(100, 110, dtype=torch.long).reshape(1, -1),
+            decoder_input_ids=torch.arange(100, 105, dtype=torch.long).reshape(1, -1),
+        )
+        ret = loss.last_hidden_state.abs().mean().item()
+        # Torch T5 has bug in GELU activation
+        np.testing.assert_allclose(ret, 0.1365441530942917, rtol=1e-7)
+
+    def testConvertedPaddleT5(self):
+        from paddlenlp.transformers import AutoModel
+
+        model = AutoModel.from_pretrained("t5-small", from_hf_hub=True)
+        model.eval()
+        loss = model(
+            input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]),
+            decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]),
+            return_dict=True,
+        )
+        ret = loss.last_hidden_state.abs().mean().item()
+        np.testing.assert_allclose(ret, 0.1365441381931305, rtol=1e-7)
+
+    @unittest.skip("Skip export!")
+    def testPaddleT5(self):
+        from paddlenlp.transformers import T5Model
+
+        model = T5Model.from_pretrained("t5-small", dtype="float32")
+        model.eval()
+        loss = model(
+            input_ids=paddle.arange(100, 110, dtype="int64").reshape([1, -1]),
+            decoder_input_ids=paddle.arange(100, 105, dtype="int64").reshape([1, -1]),
+            return_dict=True,
+        )
+        ret = loss.last_hidden_state.abs().mean().item()
+        np.testing.assert_allclose(ret, 0.1365441381931305, rtol=1e-7)
+
+        # # dy2static
+        # input_spec = [
+        #     paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+        #     paddle.static.InputSpec(shape=[None, 2, None], dtype="int64"),  # pos_ids
+        #     paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64"),  # attn_ids
+        # ]
+        # with tempfile.TemporaryDirectory() as tempdir:
+        #     paddlenlp.transformers.export_model(
+        #         model=model,
+        #         input_spec=input_spec,
+        #         path=tempdir,
+        #     )
+
+    # TODO: support @ decorate for multi-gpus tests
+    @unittest.skip("Skip for reuqired multi-gpus!")
+    def testPaddleTensorParallelT5(self):
+        """_summary_"""
+        from modeling import T5Model as AutoModel
+
+        tensor_parallel_degree = paddle.distributed.get_world_size()
+        tensor_parallel_rank = paddle.distributed.get_rank()
+        strategy = paddle.distributed.fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "dp_degree": 1,
+            "mp_degree": tensor_parallel_degree,
+            "pp_degree": 1,
+            "sharding_degree": 1,
+        }
+        paddle.distributed.fleet.init(is_collective=True, strategy=strategy)
+        model = AutoModel.from_pretrained(
+            "t5-small",
+            from_hf=True,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+        )
+        model.eval()
+
+
+class TestT5TensorParallel(TestMultipleGpus):
+    def testPaddleTensorParallelT5(self):
+        self.run_4gpu("t5_mp.py")
diff --git a/examples/language_model/t5/utils.py b/examples/language_model/t5/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..303f70af2a71d0aad8a773ea0a961034fe19e126
--- /dev/null
+++ b/examples/language_model/t5/utils.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import pickle
+import random
+
+import numpy as np
+import paddle
+import sklearn
+from scipy.stats import pearsonr, spearmanr
+from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef
+
+from paddlenlp.transformers import (
+    CosineDecayWithWarmup,
+    LinearDecayWithWarmup,
+    PolyDecayWithWarmup,
+)
+
+
+def accuracy(targets, predictions):
+    return {"accuracy": 100 * accuracy_score(targets, predictions)}
+
+
+def sklearn_metrics_wrapper(metric_str, metric_dict_str=None, metric_post_process_fn=None, **metric_fn_kwargs):
+    def fn(targets, predictions):
+        if metric_str == "matthews_corrcoef":
+            metric_fn = matthews_corrcoef
+        else:
+            metric_fn = getattr(sklearn.metrics, metric_str)
+        metric_val = metric_fn(targets, predictions, **metric_fn_kwargs)
+        if metric_post_process_fn is not None:
+            metric_val = metric_post_process_fn(metric_val)
+        return {metric_dict_str or metric_str: metric_val}
+
+    return fn
+
+
+def f1_score_with_invalid(targets, predictions):
+    targets, predictions = np.asarray(targets), np.asarray(predictions)
+    invalid_idx_mask = np.logical_and(predictions != 0, predictions != 1)
+    predictions[invalid_idx_mask] = 1 - targets[invalid_idx_mask]
+    return {"f1": 100 * f1_score(targets, predictions)}
+
+
+def pearson_corrcoef(targets, predictions):
+    return {"pearson_corrcoef": 100 * pearsonr(targets, predictions)[0]}
+
+
+def spearman_corrcoef(targets, predictions):
+    return {"spearman_corrcoef": 100 * spearmanr(targets, predictions)[0]}
+
+
+CLUE_METRICS = collections.OrderedDict(
+    [
+        ("afqmc", [accuracy]),
+        ("tnews", [accuracy]),
+        ("iflytek", [accuracy]),
+        ("cmnli", [accuracy]),
+        ("ocnli", [accuracy]),
+        ("cluewsc2020", [accuracy]),
+        ("csl", [accuracy]),
+        ("ax", []),  # Only test set available.
+    ]
+)
+
+GLUE_METRICS = collections.OrderedDict(
+    [
+        (
+            "cola",
+            [sklearn_metrics_wrapper("matthews_corrcoef", metric_post_process_fn=lambda x: 100 * x)],
+        ),
+        ("sst-2", [accuracy]),
+        ("mrpc", [f1_score_with_invalid, accuracy]),
+        ("sts-b", [pearson_corrcoef, spearman_corrcoef]),
+        ("qqp", [f1_score_with_invalid, accuracy]),
+        ("mnli", [accuracy]),
+        ("qnli", [accuracy]),
+        ("rte", [accuracy]),
+        ("wnli", [accuracy]),
+        ("ax", []),  # Only test set available.
+    ]
+)
+
+scheduler_type2cls = {
+    "linear": LinearDecayWithWarmup,
+    "cosine": CosineDecayWithWarmup,
+    "poly": PolyDecayWithWarmup,
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def get_writer(args):
+    if args.writer_type == "visualdl":
+        from visualdl import LogWriter
+
+        writer = LogWriter(logdir=args.logdir)
+    elif args.writer_type == "tensorboard":
+        from tensorboardX import SummaryWriter
+
+        writer = SummaryWriter(logdir=args.logdir)
+    else:
+        raise ValueError("writer_type must be in ['visualdl', 'tensorboard']")
+    return writer
+
+
+def get_scheduler(
+    learning_rate,
+    scheduler_type,
+    num_warmup_steps=None,
+    num_training_steps=None,
+    **scheduler_kwargs,
+):
+    if scheduler_type not in scheduler_type2cls.keys():
+        data = " ".join(scheduler_type2cls.keys())
+        raise ValueError(f"scheduler_type must be choson from {data}")
+
+    if num_warmup_steps is None:
+        raise ValueError("requires `num_warmup_steps`, please provide that argument.")
+
+    if num_training_steps is None:
+        raise ValueError("requires `num_training_steps`, please provide that argument.")
+
+    return scheduler_type2cls[scheduler_type](
+        learning_rate=learning_rate,
+        total_steps=num_training_steps,
+        warmup=num_warmup_steps,
+        **scheduler_kwargs,
+    )
+
+
+def save_json(data, file_name):
+    with open(file_name, "w", encoding="utf-8") as w:
+        w.write(json.dumps(data, ensure_ascii=False, indent=4) + "\n")
+
+
+def save_pickle(data, file_path):
+    with open(str(file_path), "wb") as f:
+        pickle.dump(data, f)
+
+
+def load_pickle(input_file):
+    with open(str(input_file), "rb") as f:
+        data = pickle.load(f)
+    return data
diff --git a/examples/language_model/t5/zero_shot_demo.py b/examples/language_model/t5/zero_shot_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f255c2585290e715e60d13ad13c0baf1df30a66
--- /dev/null
+++ b/examples/language_model/t5/zero_shot_demo.py
@@ -0,0 +1,210 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2022 Langboat Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+https://github.com/Langboat/mengzi-zero-shot
+"""
+
+import paddle
+from collections import Counter
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+def task_type_map(task_type):
+    task_map = {
+        "sentiment_classifier": sentiment_cls,
+        "news_classifier": news_cls,
+        "medical_domain_intent_classifier": domain_cls,
+        "entity_extraction": entity_extr,
+        "text_similarity": text_sim,
+        "financial_relationship_extraction": finance_extr,
+        "ad_generation": ad_gen,
+        "comment_object_extraction": com_obj_extr,
+    }
+
+    return task_map[task_type]
+
+
+def create_input_with_prompt(task_type, input_string, input_string2=None, entity1=None, entity2=None):
+    prompt_map = task_type_map(task_type)
+
+    if task_type == "text_similarity":
+        return prompt_map(input_string, input_string2)
+    elif task_type == "financial_relationship_extraction":
+        return prompt_map(input_string, entity1, entity2)
+    return prompt_map(input_string)
+
+
+def entity_extr(
+    s,
+):
+    """
+    dataset: CLUENER
+    task: 实体抽取
+    output:
+    """
+    prompts = [f"“{s}”找出上述句子中的实体和他们对应的类别"]
+    return prompts
+
+
+def text_sim(s1, s2):
+    """
+    dataset:
+    task: 语义相似度
+    output:
+    """
+    prompts = [f"“{s1}”和“{s2}”这两句话是在说同一件事吗?"]
+    return prompts
+
+
+def finance_extr(s, e1, e2):
+    """
+    dataset:
+    task: 金融关系抽取
+    output:
+    """
+    prompts = [f"“{s}”中的“{e1}”和“{e2}”是什么关系？答:"]
+    return prompts
+
+
+def ad_gen(s):
+    """
+    dataset:
+    task: 广告文案生成
+    output:
+    """
+    prompts = [f"请根据以下产品信息设计广告文案。商品信息:{s}"]
+    return prompts
+
+
+def domain_cls(s):
+    """
+    dataset:
+    task: 医学领域意图分类
+    output:
+    """
+    # dataset: quake-qic
+    prompts = [f"问题:“{s}”。此问题的医学意图是什么？选项：病情诊断，病因分析，治疗方案，就医建议，指标解读，疾病描述，后果表述，注意事项，功效作用，医疗费用。"]
+    return prompts
+
+
+def sentiment_cls(s):
+    """
+    dataset: eprstmt
+    task: 评论情感分类
+    output: 消极/积极
+    """
+    prompts = [f"评论:{s}。请判断该条评论所属类别(积极或消极)并填至空格处。回答："]
+    #    f'"{s}"。 如果这个评论的作者是客观的，那么请问这个评论的内容是什么态度的回答？答：',
+    #    f'现有机器人能判断句子是消极评论还是积极评论。已知句子：“{s}”。这个机器人将给出的答案是：'
+    return prompts
+
+
+def com_obj_extr(s):
+    """
+    dataset:
+    task: 评论对象抽取
+    output:
+    """
+    prompts = [f"评论:{s}.这条评论的评价对象是谁？"]
+    return prompts
+
+
+def news_cls(s):
+    """
+    dataset: tnews
+    task: 新闻分类
+    output:
+    """
+    label_list = ["故事", "文化", "娱乐", "体育", "财经", "房产", "汽车", "教育", "科技", "军事", "旅游", "国际", "股票", "农业", "电竞"]
+
+    prompts = [
+        f'“{s}”是什么新闻频道写的？选项：{"，".join(label_list)}。答：',
+    ]
+    #    f'这条新闻是关于什么主题的？新闻：{s}。选项：{"，".join(label_list)}。答：',
+    #    f'这是关于“{"，".join(label_list)}”中哪个选项的文章？文章：{s}。 答：']
+    return prompts
+
+
+class Demo:
+    def __init__(self, model_name_or_path="Langboat/mengzi-t5-base-mt", max_predict_len=512):
+        self.tokenizer = T5Tokenizer.from_pretrained(model_name_or_path)
+        print("Loading the model parameters, please wait...")
+        self.model = T5ForConditionalGeneration.from_pretrained(model_name_or_path)
+        self.model.eval()
+        self.max_predict_len = max_predict_len
+        print("Model loaded.")
+
+    def token_decode(self, s):
+        return self.tokenizer.decode(s, skip_special_tokens=True)
+
+    def pick_most_common(self, x):
+        return Counter(x).most_common(1)[0][0]
+
+    @paddle.no_grad()
+    def generate(self, task_type, input_string, input_string2=None, entity1=None, entity2=None, max_predict_len=None):
+        max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len
+
+        input_text = create_input_with_prompt(task_type, input_string, input_string2, entity1, entity2)
+        # tokenize
+        encodings = self.tokenizer(input_text, max_seq_len=512)
+        encodings = {k: paddle.to_tensor(v) for k, v in encodings.items()}
+        outputs = self.model.generate(**encodings, max_length=max_predict_len)[0]
+        dec_out = list(map(self.token_decode, outputs))
+        output = self.pick_most_common(dec_out)
+        print("input_text:", input_text[0])
+        print("output:", output)
+        print("=" * 50)
+        return output
+
+
+if __name__ == "__main__":
+
+    demo = Demo(model_name_or_path="Langboat/mengzi-t5-base-mt")
+    # (1) 实体抽取
+    demo.generate(task_type="entity_extraction", input_string="导致泗水的砭石受到追捧，价格突然上涨。而泗水县文化市场综合执法局颜鲲表示，根据监控")
+    # 泗水：地址，泗水县文化市场综合执法局：政府，颜鲲：姓名
+
+    # (2) 语义相似度
+    demo.generate(task_type="text_similarity", input_string="你好，我还款银行怎么更换", input_string2="怎么更换绑定还款的卡")
+    # 是
+
+    # (3) 金融关系抽取
+    demo.generate(
+        task_type="financial_relationship_extraction",
+        input_string="为打消市场顾虑,工行两位洋股东——美国运通和安联集团昨晚做出承诺,近期不会减持工行H股。",
+        entity1="工行",
+        entity2="美国运通",
+    )
+    # 被持股
+
+    # (4) 广告文案生成
+    demo.generate(task_type="ad_generation", input_string="类型-裤，版型-宽松，风格-潮，风格-复古，风格-文艺，图案-复古，裤型-直筒裤，裤腰型-高腰，裤口-毛边")
+    # 这款牛仔裤采用高腰直筒的版型设计,搭配宽松的裤型,穿着舒适又显潮流感。而裤脚的毛边设计,增添几分复古文艺的气息。
+
+    # (5) 医学领域意图分类
+    demo.generate(task_type="medical_domain_intent_classifier", input_string="呼气试验阳性什么意思")
+    # 指标解读
+
+    # (6) 情感分类
+    demo.generate(task_type="sentiment_classifier", input_string="房间很一般，小，且让人感觉脏，隔音效果差，能听到走廊的人讲话，走廊光线昏暗，旁边没有什么可吃")
+    # 消极
+
+    # (7) 评论对象抽取
+    demo.generate(task_type="comment_object_extraction", input_string="灵水的水质清澈，建议带个浮潜装备，可以看清湖里的小鱼。")
+    # 灵水
+
+    # (8) 新闻分类
+    demo.generate(task_type="news_classifier", input_string="懒人适合种的果树：长得多、好打理，果子多得都得送邻居吃")
+    # 农业
diff --git a/examples/language_model/transformer-xl/README.md b/examples/language_model/transformer-xl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9f6f7dec8da57b23670c4b085fb7c29b9df41aed
--- /dev/null
+++ b/examples/language_model/transformer-xl/README.md
@@ -0,0 +1,114 @@
+# Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
+
+以下是本例的简要目录结构及说明：
+
+```text
+.
+├── configs/                # 配置文件
+├── eval.py                 # 预测脚本
+├── gen_data.sh             # 数据下载脚本
+├── mem_transformer.py      # 模型组网
+├── reader.py               # 数据读取接口
+├── README.md               # 文档
+├── train.py                # 训练脚本
+└── utils/                  # 数据处理工具
+```
+
+## 模型简介
+
+本项目是语言模型 Transformer-XL 的 PaddlePaddle 实现， 包含模型训练，预测等内容。
+
+
+## 快速开始
+
+### 环境依赖
+
+- attrdict
+- pyyaml
+
+安装命令 `pip install attrdict pyyaml`
+
+### 数据准备
+
+公开数据集：enwik8、text8、wt103 多用于语言模型的 benchmark 测试。输出获取与处理方式如下：
+
+```shell
+bash gen_data.sh
+```
+
+会在当前路径下的 ./gen_data/ 路径下生成我们需要的数据。
+
+### 单机训练
+
+#### 单机单卡
+
+以提供的 enwik8 数据为例，可以执行以下命令进行模型训练：
+
+``` sh
+# setting visible devices for training
+export CUDA_VISIBLE_DEVICES=0
+python train.py --config ./configs/enwik8.yaml
+```
+
+可以在 enwik8.yaml 文件中设置相应的参数，比如 `batch_size`、`epoch` 等。
+
+如果要更换成 wt103 数据集进行训练，可以在执行的时候通过 `--config` 指定对应的配置文件即可。
+
+``` sh
+# setting visible devices for training
+export CUDA_VISIBLE_DEVICES=0
+python train.py --config ./configs/wt103.yaml
+```
+
+#### 使用 CPU 进行训练
+
+如果要使用 CPU 进行训练，可以修改 `configs/` 路径下，对应的配置文件中的 `use_gpu` 配置为 `False`，用相同的方式启动训练即可使用 CPU 进行训练。
+
+``` sh
+python train.py --config ./configs/enwik8.yaml
+```
+
+### 单机多卡
+
+同样，可以执行如下命令实现八卡训练：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config ./configs/enwik8.yaml
+```
+
+### 恢复训练
+
+若需要从之前的 checkpoint 开始继续训练，可以设置 `configs/` 路径中对应的配置文件中的参数 `init_from_checkpoint` 可载入之前的 checkpoint（包括 optimizer 的信息）继续训练。指定的方式是，指定到模型的 checkpoint 保存的路径。比如，指定成 `./trained_models/step_final/`，该路径下的目录结构如下：
+
+```text
+.
+├── mem_transformer.pdopt                   # 存储的优化器相关信息
+└── mem_transformer.pdparams                # 存储模型参数相关信息
+```
+
+若只是从之前训练的参数开始重新训练，无需载入 optimizer 信息，可以设置对应的配置文件中的参数 `init_from_pretrain_model` 可载入指定的参数，从头开始训练。指定的方式也是类似，指定到模型保存的参数文件 `mem_transformer.pdparams` 的路径，比如 `./trained_models/step_final/`。
+
+### 模型推断
+
+以 enwik8 数据为例，模型训练完成后可以执行以下命令可以进行预测：
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+python eval.py --config ./configs/enwik8.yaml
+```
+
+同理，可以通过指定 `--config` 选项来选择需要的数据集对应的配置文件。
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+python eval.py --config ./configs/wt103.yaml
+```
+
+完成推断之后，会将显示在验证集和测试集上的 loss 的结果。
+
+## 参考文献
+
+[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860)
diff --git a/examples/language_model/transformer-xl/configs/enwik8.yaml b/examples/language_model/transformer-xl/configs/enwik8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..12b3ef5cff007ef47dd4fe75e713915b0a81187a
--- /dev/null
+++ b/examples/language_model/transformer-xl/configs/enwik8.yaml
@@ -0,0 +1,112 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# Path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# Path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# Path of trained parameter, to make prediction
+init_from_params: "./trained_models/step_final/"
+# The directory for saving model
+save_model: "trained_models"
+# The directory for saving inference model.
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The path to data files 
+data: "./gen_data/enwik8/"
+# The name of dataset
+dataset: "enwik8"
+
+# Whether to use cuda
+use_gpu: True
+
+# Args for reader, see reader.py for details
+token_delimiter: None
+batch_size: 16
+eval_batch_size: 10
+
+# Hyparams for training:
+# The number of epoches for training
+epoch: 200
+# Max step for training.
+max_step: 400000
+
+# The hyper parameters for optimizer.
+# Type of ptimizer. 
+optim: adam
+# Learning rate schedule. 
+scheduler: cosine
+# This static learning_rate will be applied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 0.00025
+# The hyper parameters for Adam optimizer.
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# The hyper parameters for Momentum optimizer.
+mom: 0.0
+# Global gradient clip. 
+clip: 0.25
+# The parameters for learning rate scheduling.
+warmup_steps: 0
+# The parameters for CosineAnnealingDecay. Minimum learning rate.
+eta_min: 0.0
+# The parameters for ReduceLROnPlateau.
+# The Ratio that the learning rate will be reduced. 
+decay_rate: 0.5
+# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
+patience: 0
+# The lower bound of the learning rate after reduction.
+min_lr: 0.0
+
+# Hyparams for model:
+# Whe use adaptive softmax. 
+adaptive: False
+# Size of dictionary. This can be obtained automatically. 
+ntokens: 10000
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# Dimension of heads.
+d_head: 64
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# Number of head used in multi-head attention.
+n_head: 8
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 12
+# Dropout rates.
+dropout: 0.1
+# Attention dropout
+attn_dropout: 0.0
+# Attention type for decoder. 
+# 0 for relative partial MHA (in Transformer-XL). 
+# 1 for relative MHA (in Shaw et al). 
+attn_type: 0
+# Apply layer normalization before or after sublayers. 
+normalize_before: False
+# Whether to tie weight or not. 
+tie_weight: True
+# The length of the extended context.
+ext_len: 0
+# The divident value for softmax and adapative input. 
+div_val: 1
+# Target length. The number of tokens to predict. 
+tgt_len: 512
+# Memory length. The length of the retained previous heads. 
+mem_len: 512
+# Use the same attention length for all tokens. 
+same_length: False
+# Use the same positional encoding after clamp len. 
+clamp_len: -1
+# The number of samples in sample softmax. -1 means do not use sampled softmax. 
+sample_softmax: -1
+# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
+eval_tgt_len: 128
+# What kind of mode for evaluation. valid, test or both("all"). 
+mode: "all"
+# Maximum evaluation step. 
+max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/configs/text8.yaml b/examples/language_model/transformer-xl/configs/text8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..5e1353a1d05040e7fe23ee60734510e92b3bb2e6
--- /dev/null
+++ b/examples/language_model/transformer-xl/configs/text8.yaml
@@ -0,0 +1,112 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# Path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# Path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# Path of trained parameter, to make prediction
+init_from_params: "./trained_models/step_final/"
+# The directory for saving model
+save_model: "trained_models"
+# The directory for saving inference model.
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The path to data files 
+data: "./gen_data/text8/"
+# The name of dataset
+dataset: "text8"
+
+# Whether to use cuda
+use_gpu: True
+
+# Args for reader, see reader.py for details
+token_delimiter: None
+batch_size: 15
+eval_batch_size: 5
+
+# Hyparams for training:
+# The number of epoches for training
+epoch: 200
+# Max step for training.
+max_step: 400000
+
+# The hyper parameters for optimizer.
+# Type of ptimizer. 
+optim: adam
+# Learning rate schedule. 
+scheduler: cosine
+# This static learning_rate will be applied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 0.00025
+# The hyper parameters for Adam optimizer.
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# The hyper parameters for Momentum optimizer.
+mom: 0.0
+# Global gradient clip. 
+clip: 0.25
+# The parameters for learning rate scheduling.
+warmup_steps: 0
+# The parameters for CosineAnnealingDecay. Minimum learning rate.
+eta_min: 0.0
+# The parameters for ReduceLROnPlateau.
+# The Ratio that the learning rate will be reduced. 
+decay_rate: 0.5
+# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
+patience: 0
+# The lower bound of the learning rate after reduction.
+min_lr: 0.0
+
+# Hyparams for model:
+# Whe use adaptive softmax. 
+adaptive: False
+# Size of dictionary. This can be obtained automatically. 
+ntokens: 10000
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# Dimension of heads.
+d_head: 64
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# Number of head used in multi-head attention.
+n_head: 8
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 12
+# Dropout rates.
+dropout: 0.1
+# Attention dropout
+attn_dropout: 0.0
+# Attention type for decoder. 
+# 0 for relative partial MHA (in Transformer-XL). 
+# 1 for relative MHA (in Shaw et al). 
+attn_type: 0
+# Apply layer normalization before or after sublayers. 
+normalize_before: False
+# Whether to tie weight or not. 
+tie_weight: True
+# The length of the extended context.
+ext_len: 0
+# The divident value for softmax and adapative input. 
+div_val: 1
+# Target length. The number of tokens to predict. 
+tgt_len: 512
+# Memory length. The length of the retained previous heads. 
+mem_len: 512
+# Use the same attention length for all tokens. 
+same_length: False
+# Use the same positional encoding after clamp len. 
+clamp_len: -1
+# The number of samples in sample softmax. -1 means do not use sampled softmax. 
+sample_softmax: -1
+# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
+eval_tgt_len: 128
+# What kind of mode for evaluation. valid, test or both("all"). 
+mode: "all"
+# Maximum evaluation step. 
+max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/configs/wt103.yaml b/examples/language_model/transformer-xl/configs/wt103.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..99fec78d1494686ffc93ee4b123dbaafbd881625
--- /dev/null
+++ b/examples/language_model/transformer-xl/configs/wt103.yaml
@@ -0,0 +1,112 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# Path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# Path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# Path of trained parameter, to make prediction
+init_from_params: "./trained_models/step_final/"
+# The directory for saving model
+save_model: "trained_models"
+# The directory for saving inference model.
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The path to data files 
+data: "./gen_data/wikitext-103/"
+# The name of dataset
+dataset: "wt103"
+
+# Whether to use cuda
+use_gpu: True
+
+# Args for reader, see reader.py for details
+token_delimiter: None
+batch_size: 32
+eval_batch_size: 10
+
+# Hyparams for training:
+# The number of epoches for training
+epoch: 200
+# Max step for training.
+max_step: 200000
+
+# The hyper parameters for optimizer.
+# Type of ptimizer. 
+optim: adam
+# Learning rate schedule. 
+scheduler: cosine
+# This static learning_rate will be applied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 0.00025
+# The hyper parameters for Adam optimizer.
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# The hyper parameters for Momentum optimizer.
+mom: 0.0
+# Global gradient clip. 
+clip: 0.25
+# The parameters for learning rate scheduling.
+warmup_steps: 0
+# The parameters for CosineAnnealingDecay. Minimum learning rate.
+eta_min: 0.0
+# The parameters for ReduceLROnPlateau.
+# The Ratio that the learning rate will be reduced. 
+decay_rate: 0.5
+# When loss doesn’t improve for this number of epochs, learing rate will be reduced.
+patience: 0
+# The lower bound of the learning rate after reduction.
+min_lr: 0.0
+
+# Hyparams for model:
+# Whe use adaptive softmax. 
+adaptive: True
+# Size of dictionary. This can be obtained automatically. 
+ntokens: 10000
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 410
+# Dimension of heads.
+d_head: 41
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2100
+# Number of head used in multi-head attention.
+n_head: 10
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 16
+# Dropout rates.
+dropout: 0.1
+# Attention dropout
+attn_dropout: 0.0
+# Attention type for decoder. 
+# 0 for relative partial MHA (in Transformer-XL). 
+# 1 for relative MHA (in Shaw et al). 
+attn_type: 0
+# Apply layer normalization before or after sublayers. 
+normalize_before: False
+# Whether to tie weight or not. 
+tie_weight: True
+# The length of the extended context.
+ext_len: 0
+# The divident value for softmax and adapative input. 
+div_val: 1
+# Target length. The number of tokens to predict. 
+tgt_len: 150
+# Memory length. The length of the retained previous heads. 
+mem_len: 150
+# Target length for evaluation. That is, the number of tokens to predict for evaluation. 
+eval_tgt_len: 150
+# Use the same attention length for all tokens. 
+same_length: False
+# Use the same positional encoding after clamp len. 
+clamp_len: -1
+# The number of samples in sample softmax. -1 means do not use sampled softmax. 
+sample_softmax: -1
+# What kind of mode for evaluation. valid, test or both("all"). 
+mode: "all"
+# Maximum evaluation step. 
+max_eval_steps: -1
diff --git a/examples/language_model/transformer-xl/eval.py b/examples/language_model/transformer-xl/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..de13d17fd9dcc0469b85164f70f22333192ecf52
--- /dev/null
+++ b/examples/language_model/transformer-xl/eval.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+from pprint import pprint
+
+import numpy as np
+import paddle
+import yaml
+from attrdict import AttrDict
+from mem_transformer import MemTransformerLM
+from reader import get_lm_data_loader, get_lm_vocab
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_eval(args):
+    assert args.ext_len >= 0, "Extended context length must be no less than 0"
+
+    def _evaluate(loader):
+        total_len, total_loss = 0, 0.0
+
+        eval_mems = tuple()
+        for i, (src, target, seq_len) in enumerate(loader):
+            if args.max_eval_steps > 0 and i >= args.max_eval_steps:
+                break
+            ret = mem_transformer(src, target, *eval_mems)
+            loss, eval_mems = ret[0], ret[1:]
+            eval_cur_loss = seq_len * loss.numpy()
+            total_loss += eval_cur_loss
+            total_len += seq_len
+        return total_loss / total_len
+
+    def _logger(loss):
+        if args.dataset in ["enwik8", "text8"]:
+            logger_info = "loss: %f, bpc: %f" % (loss, loss / np.log(2))
+        else:
+            logger_info = "loss: %f, ppl: %.2f" % (loss, np.exp(loss))
+        return logger_info
+
+    if not args.use_gpu:
+        paddle.set_device("cpu")
+
+    vocab = get_lm_vocab(args)
+    eval_loader = get_lm_data_loader(args, vocab, "valid")
+    test_loader = get_lm_data_loader(args, vocab, "test")
+
+    cutoffs, tie_projs = [], [False]
+    if args.adaptive:
+        assert args.dataset in ["wt103", "lm1b"]
+        if args.dataset == "wt103":
+            cutoffs = [20000, 40000, 200000]
+            tie_projs += [True] * len(cutoffs)
+        elif args.dataset == "lm1b":
+            cutoffs = [60000, 100000, 640000]
+            tie_projs += [False] * len(cutoffs)
+
+    mem_transformer = MemTransformerLM(
+        args.ntokens,
+        args.n_layer,
+        args.n_head,
+        args.d_model,
+        args.d_head,
+        args.d_inner_hid,
+        args.dropout,
+        args.attn_dropout,
+        tie_weight=args.tie_weight,
+        d_embed=args.d_model,
+        div_val=args.div_val,
+        tie_projs=tie_projs,
+        normalize_before=args.normalize_before,
+        tgt_len=args.tgt_len,
+        ext_len=args.ext_len,
+        mem_len=args.mem_len,
+        cutoffs=cutoffs,
+        same_length=args.same_length,
+        attn_type=args.attn_type,
+        clamp_len=args.clamp_len,
+        sample_softmax=args.sample_softmax,
+    )
+
+    assert args.init_from_params, "Please set init_from_params to load the infer model."
+
+    model_dict = paddle.load(os.path.join(args.init_from_params, "mem_transformer.pdparams"))
+    mem_transformer.load_dict(model_dict)
+
+    logger.info(
+        "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
+            args.eval_batch_size, args.tgt_len, args.ext_len, args.mem_len, args.clamp_len
+        )
+    )
+
+    mem_transformer.reset_length(args.tgt_len, args.ext_len, args.mem_len)
+
+    test_loss = None
+    valid_loss = None
+    if args.mode == "all":
+        test_loss = _evaluate(test_loader)
+        valid_loss = _evaluate(eval_loader)
+    elif args.mode == "valid":
+        valid_loss = _evaluate(eval_loader)
+    elif args.mode == "test":
+        test_loss = _evaluate(test_loader)
+
+    logger_info = ""
+    if valid_loss is not None:
+        logger_info = logger_info + "validation loss: " + _logger(valid_loss) + " | "
+    if test_loss is not None:
+        logger_info = logger_info + "test loss: " + _logger(test_loss) + " | "
+    logger.info(logger_info)
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+
+    do_eval(args)
diff --git a/examples/language_model/transformer-xl/gen_data.sh b/examples/language_model/transformer-xl/gen_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..865a8a5835893a8626fee58a9324d5e94c30f1fd
--- /dev/null
+++ b/examples/language_model/transformer-xl/gen_data.sh
@@ -0,0 +1,55 @@
+echo "Downloading dataset..."
+
+CUR_DIR=$PWD
+
+mkdir -p gen_data
+cd ./gen_data/
+
+if [ ! -d "wikitext-103" ]; then
+    echo "Downloading wikitext-103..."
+    wget -O wikitext-103-v1.zip https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
+    echo "Unzip wikitext-103..."
+    unzip wikitext-103-v1.zip
+    cd wikitext-103
+    # Rename
+    mv wiki.train.tokens train.txt
+    mv wiki.valid.tokens valid.txt
+    mv wiki.test.tokens test.txt
+    cd -
+fi
+
+if [ ! -d 'enwik8' ]; then
+    mkdir -p enwik8
+    cd enwik8
+    echo "Downloading enwik8..."
+    wget -O enwik8.zip http://mattmahoney.net/dc/enwik8.zip
+    wget -O prep_enwik8.py https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py
+    python3 prep_enwik8.py
+    rm -f prep_enwik8.py
+    cd -
+fi
+
+if [ ! -d 'text8' ]; then
+    mkdir -p text8
+    cd text8
+    echo "Downloading text8..."
+    wget -O text8.zip http://mattmahoney.net/dc/text8.zip
+    python ${CUR_DIR}/utils/preprocess_text8.py 5000000
+    cd -
+fi
+
+if [ ! -d 'one-billion-words' ]; then
+    mkdir -p one-billion-words
+    cd one-billion-words
+    echo "Downloading one-billion-words..."
+    wget -O 1-billion-word-language-modeling-benchmark-r13output.tar.gz http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
+    tar xzf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
+
+    dir="./1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/"
+    cat ${dir}/news.en.heldout-00000-of-00050 > valid.txt
+    cat ${dir}/news.en.heldout-00000-of-00050 > test.txt
+    wget -O 1b_word_vocab.txt https://github.com/rafaljozefowicz/lm/raw/master/1b_word_vocab.txt
+    cd -
+fi
+
+echo "All done. "
diff --git a/examples/language_model/transformer-xl/mem_transformer.py b/examples/language_model/transformer-xl/mem_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..122d4e1cc3f23226eaa66f9f13982ebd3328ae6a
--- /dev/null
+++ b/examples/language_model/transformer-xl/mem_transformer.py
@@ -0,0 +1,1031 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+global_dtype = paddle.get_default_dtype()
+
+
+def sample_logits(embedding, bias, labels, inputs, sampler):
+    true_log_probs, samp_log_probs, neg_samples = sampler.sample(labels)
+    n_sample = neg_samples.shape[0]
+    b1, b2 = labels.shape[0], labels.shape[1]
+    all_ids = paddle.concat([paddle.reshape(labels, shape=[-1]), neg_samples])
+    all_w = embedding(all_ids)
+    true_w = paddle.reshape(all_w[:-n_sample], shape=[b1, b2, -1])
+    sample_w = paddle.reshape(all_w[-n_sample:], shape=[n_sample, -1])
+
+    all_b = paddle.gather(bias, all_ids)
+    true_b = paddle.reshape(all_b[:-n_sample], shape=[b1, b2])
+    sample_b = all_b[-n_sample:]
+
+    hit = paddle.cast((labels.unsqueeze([2]) == neg_samples), dtype=global_dtype).detach()
+    true_logits = paddle.sum(true_w * inputs, axis=-1) + true_b - true_log_probs
+    sample_logits = (
+        paddle.transpose(paddle.matmul(sample_w, paddle.transpose(inputs, [0, 2, 1])), [0, 2, 1])
+        + sample_b
+        - samp_log_probs
+    )
+    sample_logits = sample_logits - 1e30 * hit
+    logits = paddle.concat([true_logits.unsqueeze([2]), sample_logits], -1)
+
+    return logits
+
+
+class ProjAdaptiveSoftmax(nn.Layer):
+    """
+    Combine projection and logsoftmax.
+    """
+
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, keep_order=False):
+        super(ProjAdaptiveSoftmax, self).__init__()
+
+        self.n_token = n_token
+        self.d_embed = d_embed
+        self.d_proj = d_proj
+
+        self.cutoffs = cutoffs + [n_token]
+        self.cutoff_ends = [0] + self.cutoffs
+        self.div_val = div_val
+
+        self.shortlist_size = self.cutoffs[0]
+        self.num_clusters = len(self.cutoffs) - 1
+        self.head_size = self.shortlist_size + self.num_clusters
+
+        if self.num_clusters > 0:
+            self.cluster_weight = paddle.create_parameter(
+                shape=[self.num_clusters, self.d_embed],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+            self.cluster_bias = paddle.create_parameter(
+                shape=[self.num_clusters],
+                dtype=global_dtype,
+                is_bias=True,
+                default_initializer=paddle.nn.initializer.Constant(0.0),
+            )
+
+        self.out_layers_weight = nn.ParameterList()
+        self.out_layers_bias = nn.ParameterList()
+        self.out_projs = nn.ParameterList()
+
+        if div_val == 1:
+            for i in range(len(self.cutoffs)):
+                if d_proj != d_embed:
+                    self.out_projs.append(
+                        paddle.create_parameter(
+                            shape=[d_proj, d_embed],
+                            dtype=global_dtype,
+                            default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                        )
+                    )
+                else:
+                    self.out_projs.append(None)
+
+            self.out_layers_weight.append(
+                paddle.create_parameter(
+                    shape=[n_token, d_embed],
+                    dtype=global_dtype,
+                    default_initializer=paddle.nn.initializer.Constant(0.0),
+                )
+            )
+            self.out_layers_bias.append(
+                paddle.create_parameter(
+                    shape=[n_token],
+                    dtype=global_dtype,
+                    is_bias=True,
+                    default_initializer=paddle.nn.initializer.Constant(0.0),
+                )
+            )
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                d_emb_i = d_embed // (div_val**i)
+
+                self.out_projs.append(
+                    paddle.create_parameter(
+                        shape=[d_proj, d_emb_i],
+                        dtype=global_dtype,
+                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                    )
+                )
+
+                self.out_layers_weight.append(
+                    paddle.create_parameter(
+                        shape=[r_idx - l_idx, d_emb_i],
+                        dtype=global_dtype,
+                        default_initializer=paddle.nn.initializer.Uniform(
+                            low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0)
+                        ),
+                    )
+                )
+                self.out_layers_bias.append(
+                    paddle.create_parameter(
+                        shape=[r_idx - l_idx],
+                        dtype=global_dtype,
+                        is_bias=True,
+                        default_initializer=paddle.nn.initializer.Uniform(
+                            low=-((r_idx - l_idx) ** (-1.0 / 2.0)), high=(r_idx - l_idx) ** (-1.0 / 2.0)
+                        ),
+                    )
+                )
+
+        self.keep_order = keep_order
+
+    def _compute_logits(self, hidden, weight, bias, proj=None):
+        if proj is None:
+            logit = F.linear(hidden, weight.t(), bias=bias)
+        else:
+            proj_hid = F.linear(hidden, proj)
+            logit = F.linear(proj_hid, weight.t(), bias=bias)
+
+        return logit
+
+    def forward(self, hidden, target, keep_order=False):
+        assert hidden.shape[0] == target.shape[0]
+
+        if self.num_clusters == 0:
+            logit = self._compute_logits(hidden, self.out_layers_weight[0], self.out_layers_bias[0], self.out_projs[0])
+            nll = -paddle.log(F.softmax(logit, axis=-1))
+            idx = paddle.concat([paddle.arange(0, nll.shape[0]).unsqueeze([1]), target.unsqueeze(1)], axis=1)
+            nll = paddle.gather_nd(nll, idx)
+        else:
+            weights, biases = [], []
+            for i in range(len(self.cutoffs)):
+                if self.div_val == 1:
+                    l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                    weight_i = self.out_layers_weight[0][l_idx:r_idx]
+                    bias_i = self.out_layers_bias[0][l_idx:r_idx]
+                else:
+                    weight_i = self.out_layers_weight[i]
+                    bias_i = self.out_layers_bias[i]
+
+                if i == 0:
+                    weight_i = paddle.concat([weight_i, self.cluster_weight], axis=0)
+                    bias_i = paddle.concat([bias_i, self.cluster_bias], axis=0)
+
+                weights.append(weight_i)
+                biases.append(bias_i)
+
+            head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
+
+            head_logit = self._compute_logits(hidden, head_weight, head_bias, head_proj)
+            head_logprob = paddle.log(F.softmax(head_logit, axis=-1))
+
+            nll = paddle.zeros_like(target, dtype=hidden.dtype)
+
+            offset = 0
+            cutoff_values = [0] + self.cutoffs
+            for i in range(len(cutoff_values) - 1):
+                l_idx, r_idx = cutoff_values[i], cutoff_values[i + 1]
+
+                mask_i = paddle.cast(target >= l_idx, dtype=paddle.get_default_dtype()) * paddle.cast(
+                    target < r_idx, dtype="int64"
+                )
+                indices_i = paddle.nonzero(mask_i).squeeze([1])
+
+                if paddle.numel(indices_i) == 0:
+                    continue
+                target_i = paddle.gather(target, indices_i, axis=0) - l_idx
+                head_logprob_i = paddle.gather(head_logprob, indices_i, axis=0)
+                if i == 0:
+                    target_i_idx = paddle.concat(
+                        [paddle.arange(0, head_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1
+                    )
+                    logprob_i = head_logprob_i.gather_nd(target_i_idx)
+                else:
+                    weight_i, bias_i, proj_i = (
+                        weights[i],
+                        biases[i],
+                        self.out_projs[i].weight if self.out_projs[i] is not None else None,
+                    )
+
+                    hidden_i = paddle.gather(hidden, indices_i, axis=0)
+
+                    tail_logit_i = self._compute_logits(hidden_i, weight_i, bias_i, proj_i)
+                    tail_logprob_i = paddle.log(F.softmax(tail_logit_i, axis=-1))
+
+                    target_i_idx = paddle.concat(
+                        [paddle.arange(0, tail_logprob_i.shape[0]).unsqueeze([1]), target_i.unsqueeze([1])], axis=1
+                    )
+                    logprob_i = tail_logprob_i.gather_nd(target_i_idx)
+
+                    logprob_i = head_logprob_i[:, -i] + logprob_i
+
+                if self.keep_order or keep_order:
+                    nll = paddle.scatter(nll, indices_i, -logprob_i)
+                else:
+                    index = paddle.arange(offset, offset + logprob_i.shape[0], 1)
+                    nll = paddle.scatter(nll, index, -logprob_i)
+
+                offset += logprob_i.shape[0]
+
+        return nll
+
+
+class LogUniformSampler(object):
+    def __init__(self, range_max, n_sample):
+        with paddle.no_grad():
+            self.range_max = range_max
+            log_indices = paddle.log(paddle.arange(1.0, range_max + 2.0, 1.0, dtype=global_dtype))
+            self.dist = (log_indices[1:] - log_indices[:-1]) / log_indices[-1]
+
+            self.log_q = paddle.cast(
+                paddle.log(
+                    paddle.exp(-(paddle.log1p(-paddle.cast(self.dist, dtype=global_dtype)) * 2 * n_sample)) - 1
+                ),
+                dtype=global_dtype,
+            )
+
+        self.n_sample = n_sample
+
+    def sample(self, labels):
+        n_sample = self.n_sample
+        n_tries = 2 * n_sample
+        batch_size = labels.shape[0]
+
+        with paddle.no_grad():
+            neg_samples = paddle.unique(paddle.multinomial(self.dist, n_tries, replacement=True))
+            true_log_probs = paddle.gather(self.log_q, labels.flatten())
+            true_log_probs = paddle.reshape(true_log_probs, shape=[batch_size, -1])
+            samp_log_probs = paddle.gather(self.log_q, neg_samples)
+            return true_log_probs, samp_log_probs, neg_samples
+
+
+class PositionEmbedding(nn.Layer):
+    def __init__(self, emb_dim):
+        super(PositionEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+        self.inv_freq = 1.0 / (10000.0 ** (paddle.arange(0.0, emb_dim, 2.0, dtype=global_dtype) / emb_dim))
+
+    def forward(self, pos_seq, bsz=None):
+        sinusoid_inp = paddle.matmul(pos_seq.unsqueeze([1]), self.inv_freq.unsqueeze([0]))
+        pos_emb = paddle.concat([paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)], axis=-1)
+
+        if bsz is not None:
+            pos_emb = pos_emb.unsqueeze([0]).expand([bsz, -1, -1])
+            pos_emb.stop_gradient = True
+            return pos_emb
+        else:
+            pos_emb = pos_emb.unsqueeze([0])
+            pos_emb.stop_gradient = True
+            return pos_emb
+
+
+class PositionwiseFFN(nn.Layer):
+    def __init__(self, d_model, d_inner, dropout, normalize_before=False):
+        super(PositionwiseFFN, self).__init__()
+
+        self.d_model = d_model
+        self.d_inner = d_inner
+
+        self.CoreNet = nn.Sequential(
+            nn.Linear(
+                d_model,
+                d_inner,
+                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                bias_attr=paddle.nn.initializer.Constant(0.0),
+            ),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(
+                d_inner,
+                d_model,
+                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                bias_attr=paddle.nn.initializer.Constant(0.0),
+            ),
+            nn.Dropout(dropout),
+        )
+        self.layer_norm = nn.LayerNorm(
+            d_model,
+            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
+            bias_attr=paddle.nn.initializer.Constant(0.0),
+        )
+        self.normalize_before = normalize_before
+
+    def forward(self, inp):
+        if self.normalize_before:
+            core_out = self.CoreNet(self.layer_norm(inp))
+            output = core_out + inp
+        else:
+            core_out = self.CoreNet(inp)
+            output = self.layer_norm(inp + core_out)
+        return output
+
+
+class MultiHeadAttn(nn.Layer):
+    def __init__(self, n_head, d_model, d_head, dropout, attn_dropout=0, normalize_before=False):
+        super(MultiHeadAttn, self).__init__()
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_head = d_head
+
+        self.q_proj = nn.Linear(
+            d_model, n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
+        )
+        self.kv_proj = nn.Linear(
+            d_model, 2 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
+        )
+        self.drop = nn.Dropout(p=dropout)
+        self.attn_drop = nn.Dropout(p=attn_dropout)
+        self.o_proj = nn.Linear(
+            n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
+        )
+        self.layer_norm = nn.LayerNorm(
+            d_model,
+            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
+            bias_attr=paddle.nn.initializer.Constant(0.0),
+        )
+
+        self.scale = 1 / (d_head**0.5)
+        self.normalize_before = normalize_before
+
+    def forward(self, h, attn_mask=None, mems=None):
+        if mems is not None:
+            c = paddle.concat([mems, h], axis=1)
+        else:
+            c = h
+
+        if self.normalize_before:
+            c = self.layer_norm(c)
+
+        head_q = self.q_proj(h)
+        head_k, head_v = paddle.chunk(self.kv_proj(c), chunks=2, axis=-1)
+
+        head_q = paddle.reshape(head_q, shape=[h.shape[0], h.shape[1], self.n_head, self.d_head])
+        head_k = paddle.reshape(head_k, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head])
+        head_v = paddle.reshape(head_v, shape=[c.shape[0], c.shape[1], self.n_head, self.d_head])
+
+        attn_score = paddle.einsum("bind,bjnd->bnij", head_q, head_k)
+        attn_score = attn_score * self.scale
+        if attn_mask is not None:
+            attn_score = attn_score - float("inf") * attn_mask
+
+        attn_prob = F.softmax(attn_score, dim=-1)
+        attn_prob = self.attn_drop(attn_prob)
+
+        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, head_v)
+        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
+
+        attn_out = self.o_proj(attn_vec)
+        attn_out = self.drop(attn_out)
+        if self.normalize_before:
+            output = h + attn_out
+        else:
+            output = self.layer_norm(h + attn_out)
+
+        return output
+
+
+class RelMultiHeadAttn(nn.Layer):
+    def __init__(
+        self,
+        n_head,
+        d_model,
+        d_head,
+        dropout,
+        attn_dropout=0,
+        tgt_len=None,
+        ext_len=None,
+        mem_len=None,
+        normalize_before=False,
+    ):
+        super(RelMultiHeadAttn, self).__init__()
+
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_head = d_head
+        self.dropout = dropout
+
+        self.qkv_proj = nn.Linear(
+            d_model, 3 * n_head * d_head, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
+        )
+
+        self.drop = nn.Dropout(dropout)
+        self.attn_drop = nn.Dropout(attn_dropout)
+        self.o_proj = nn.Linear(
+            n_head * d_head, d_model, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01), bias_attr=False
+        )
+
+        self.layer_norm = nn.LayerNorm(
+            d_model,
+            weight_attr=paddle.nn.initializer.Normal(mean=1.0, std=0.01),
+            bias_attr=paddle.nn.initializer.Constant(0.0),
+        )
+
+        self.scale = 1 / (d_head**0.5)
+
+        self.normalize_before = normalize_before
+
+    def _rel_shift(self, x, zero_triu=False):
+        x_shape = x.shape
+        zero_pad = paddle.zeros([x_shape[0], x_shape[1], x_shape[2], 1], dtype=x.dtype)
+        x_padded = paddle.concat([zero_pad, x], axis=-1)
+
+        x_padded = paddle.reshape(x_padded, shape=[x_shape[0], x_shape[1], x_shape[3] + 1, x_shape[2]])
+
+        x = paddle.reshape(x_padded[:, :, 1:, :], shape=x_shape)
+
+        if zero_triu:
+            ones = paddle.ones([x_shape[2], x_shape[3]])
+            x = x * paddle.tril(ones, diagonal=x_shape[3] - x_shape[2]).unsqueeze([2, 3])
+
+        return x
+
+    def forward(self, w, r, attn_mask=None, mems=None):
+        raise NotImplementedError
+
+
+class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
+    def __init__(self, *args, **kwargs):
+        super(RelPartialLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
+
+        self.r_proj = nn.Linear(
+            self.d_model,
+            self.n_head * self.d_head,
+            weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            bias_attr=False,
+        )
+
+    def forward(self, w, r, r_w_bias, r_r_bias, attn_mask=None, mems=None):
+        qlen, rlen, bsz = w.shape[1], r.shape[1], w.shape[0]
+
+        if mems is not None:
+            cat = paddle.concat([mems, w], axis=1)
+            if self.normalize_before:
+                w_heads = self.qkv_proj(self.layer_norm(cat))
+            else:
+                w_heads = self.qkv_proj(cat)
+            r_head_k = self.r_proj(r)
+
+            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
+
+            w_head_q = w_head_q[:, -qlen:, :]
+        else:
+            if self.normalize_before:
+                w_heads = self.qkv_proj(self.layer_norm(w))
+            else:
+                w_heads = self.qkv_proj(w)
+            r_head_k = self.r_proj(r)
+
+            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
+
+        klen = w_head_k.shape[1]
+
+        w_head_q = paddle.reshape(w_head_q, shape=[bsz, qlen, self.n_head, self.d_head])
+        w_head_k = paddle.reshape(w_head_k, shape=[bsz, klen, self.n_head, self.d_head])
+        w_head_v = paddle.reshape(w_head_v, shape=[bsz, klen, self.n_head, self.d_head])
+
+        r_head_k = paddle.reshape(r_head_k, shape=[bsz, rlen, self.n_head, self.d_head])
+
+        rw_head_q = w_head_q + r_w_bias
+
+        AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k)
+        rr_head_q = w_head_q + r_r_bias
+
+        BD = paddle.einsum("bind,bjnd->bnij", rr_head_q, r_head_k)
+        BD = self._rel_shift(BD)
+
+        attn_score = AC + BD
+        attn_score = attn_score * self.scale
+
+        if attn_mask is not None:
+            attn_score = attn_score - 1e30 * attn_mask
+
+        attn_prob = F.softmax(attn_score, axis=-1)
+        attn_prob = self.attn_drop(attn_prob)
+
+        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v)
+
+        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
+
+        attn_out = self.o_proj(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.normalize_before:
+            output = w + attn_out
+        else:
+            output = self.layer_norm(w + attn_out)
+
+        return output
+
+
+class RelLearnableMultiHeadAttn(RelMultiHeadAttn):
+    def __init__(self, *args, **kwargs):
+        super(RelLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
+
+    def forward(self, w, r_emb, r_w_bias, r_bias, attn_mask=None, mems=None):
+        qlen, bsz = w.shape[1], w.shape[0]
+
+        if mems is not None:
+            cat = paddle.concat([mems, w], 1)
+            if self.normalize_before:
+                w_heads = self.qkv_proj(self.layer_norm(cat))
+            else:
+                w_heads = self.qkv_proj(cat)
+            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
+
+            w_head_q = w_head_q[-qlen:]
+        else:
+            if self.normalize_before:
+                w_heads = self.qkv_proj(self.layer_norm(w))
+            else:
+                w_heads = self.qkv_proj(w)
+            w_head_q, w_head_k, w_head_v = paddle.chunk(w_heads, chunks=3, axis=-1)
+
+        klen = w_head_k.shape[1]
+
+        w_head_q = paddle.reshape(w_head_q, shape=[w_head_q.shape[0], w_head_q.shape[1], self.n_head, self.d_head])
+        w_head_k = paddle.reshape(w_head_k, shape=[w_head_k.shape[0], w_head_k.shape[1], self.n_head, self.d_head])
+        w_head_v = paddle.reshape(w_head_v, shape=[w_head_v.shape[0], w_head_v.shape[1], self.n_head, self.d_head])
+
+        if klen > r_emb.shape[0]:
+            r_emb_pad = r_emb[0:1].expand(klen - r_emb.shape[0], -1, -1)
+            r_emb = paddle.concat([r_emb_pad, r_emb], 0)
+            r_bias_pad = r_bias[0:1].expand(klen - r_bias.shape[0], -1)
+            r_bias = paddle.concat([r_bias_pad, r_bias], 0)
+        else:
+            r_emb = r_emb[-klen:]
+            r_bias = r_bias[-klen:]
+
+        rw_head_q = w_head_q + r_w_bias.unsqueeze([0])
+
+        AC = paddle.einsum("bind,bjnd->bnij", rw_head_q, w_head_k)
+        r_emb = r_emb.unsqueeze([0]).expand([bsz, -1, -1, -1])
+        B_ = paddle.einsum("bind,bjnd->bnij", w_head_q, r_emb)
+        D_ = r_bias.unsqueeze([0, 2])
+        BD = self._rel_shift(B_ + D_)
+
+        attn_score = AC + BD
+        attn_score = attn_score * self.scale
+
+        if attn_mask is not None:
+            attn_score = attn_score - float("inf") * attn_mask
+
+        attn_prob = F.softmax(attn_score, dim=-1)
+        attn_prob = self.attn_drop(attn_prob)
+
+        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, w_head_v)
+
+        attn_vec = paddle.reshape(attn_vec, shape=[attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head])
+
+        attn_out = self.o_net(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.normalize_before:
+            output = w + attn_out
+        else:
+            output = self.layer_norm(w + attn_out)
+
+        return output
+
+
+class DecoderLayer(nn.Layer):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
+        super(DecoderLayer, self).__init__()
+
+        self.dec_attn = MultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
+
+    def forward(self, dec_inp, dec_attn_mask=None, mems=None):
+
+        output = self.dec_attn(dec_inp, attn_mask=dec_attn_mask, mems=mems)
+        output = self.pos_ff(output)
+
+        return output
+
+
+class RelLearnableDecoderLayer(nn.Layer):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
+        super(RelLearnableDecoderLayer, self).__init__()
+
+        self.dec_attn = RelLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
+
+    def forward(self, dec_inp, r_emb, r_w_bias, r_bias, dec_attn_mask=None, mems=None):
+
+        output = self.dec_attn(dec_inp, r_emb, r_w_bias, r_bias, attn_mask=dec_attn_mask, mems=mems)
+        output = self.pos_ff(output)
+
+        return output
+
+
+class RelPartialLearnableDecoderLayer(nn.Layer):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout, **kwargs):
+        super(RelPartialLearnableDecoderLayer, self).__init__()
+
+        self.dec_attn = RelPartialLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFFN(d_model, d_inner, dropout, normalize_before=kwargs.get("normalize_before"))
+
+    def forward(self, dec_inp, r, r_w_bias, r_r_bias, dec_attn_mask=None, mems=None):
+        output = self.dec_attn(dec_inp, r, r_w_bias, r_r_bias, attn_mask=dec_attn_mask, mems=mems)
+        output = self.pos_ff(output)
+
+        return output
+
+
+class AdaptiveEmbedding(nn.Layer):
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, sample_softmax=False):
+        super(AdaptiveEmbedding, self).__init__()
+
+        self.n_token = n_token
+        self.d_embed = d_embed
+
+        self.cutoffs = cutoffs + [n_token]
+        self.div_val = div_val
+        self.d_proj = d_proj
+
+        self.emb_scale = d_proj**0.5
+
+        self.cutoff_ends = [0] + self.cutoffs
+
+        self.emb_layers = nn.LayerList()
+        self.emb_projs = nn.ParameterList()
+        if div_val == 1:
+            self.emb_layers.append(
+                nn.Embedding(
+                    n_token,
+                    d_embed,
+                    sparse=sample_softmax > 0,
+                    weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                )
+            )
+            if d_proj != d_embed:
+                self.emb_projs.append(
+                    paddle.create_parameter(
+                        shape=[d_embed, d_proj],
+                        dtype=global_dtype,
+                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                    )
+                )
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                d_emb_i = d_embed // (div_val**i)
+                self.emb_layers.append(
+                    nn.Embedding(r_idx - l_idx, d_emb_i, weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01))
+                )
+                self.emb_projs.append(
+                    paddle.create_parameter(
+                        shape=[d_emb_i, d_proj],
+                        dtype=global_dtype,
+                        default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                    )
+                )
+
+    def forward(self, inp):
+        if self.div_val == 1:
+            embed = self.emb_layers[0](inp)
+            if self.d_proj != self.d_embed:
+                embed = F.linear(embed, self.emb_projs[0])
+        else:
+            inp_flat = paddle.reshape(inp, shape=[-1])
+            emb_flat = paddle.zeros([inp_flat.shape[0], self.d_proj], dtype=global_dtype)
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+
+                mask_i = (inp_flat >= l_idx) & (inp_flat < r_idx)
+                indices_i = paddle.nonzero(mask_i).squeeze([1])
+
+                if indices_i.numel() == 0:
+                    continue
+
+                inp_i = paddle.gather(inp_flat, indices_i, axis=0) - l_idx
+                emb_i = self.emb_layers[i](inp_i)
+                emb_i = F.linear(emb_i, self.emb_projs[i])
+
+                emb_flat = paddle.scatter(emb_flat, indices_i, emb_i)
+
+            embed = paddle.reshape(emb_flat, shape=inp.shape.append(self.d_proj))
+
+        embed = embed * self.emb_scale
+
+        return embed
+
+
+class MemTransformerLM(nn.Layer):
+    def __init__(
+        self,
+        n_token,
+        n_layer,
+        n_head,
+        d_model,
+        d_head,
+        d_inner,
+        dropout,
+        attn_dropout,
+        tie_weight=True,
+        d_embed=None,
+        div_val=1,
+        tie_projs=[False],
+        normalize_before=False,
+        tgt_len=None,
+        ext_len=None,
+        mem_len=None,
+        cutoffs=[],
+        adapt_inp=False,
+        same_length=False,
+        attn_type=0,
+        clamp_len=-1,
+        sample_softmax=-1,
+    ):
+        super(MemTransformerLM, self).__init__()
+        self.n_token = n_token
+
+        d_embed = d_model if d_embed is None else d_embed
+        self.d_embed = d_embed
+        self.d_model = d_model
+        self.n_head = n_head
+        self.d_head = d_head
+
+        self.word_emb = AdaptiveEmbedding(n_token, d_embed, d_model, cutoffs, div_val=div_val)
+
+        self.drop = nn.Dropout(dropout)
+
+        self.n_layer = n_layer
+
+        self.tgt_len = tgt_len
+        self.mem_len = mem_len
+        self.ext_len = ext_len
+        self.max_klen = tgt_len + ext_len + mem_len
+
+        self.attn_type = attn_type
+
+        self.layers = nn.LayerList()
+        if attn_type == 0:
+            for i in range(n_layer):
+                self.layers.append(
+                    RelPartialLearnableDecoderLayer(
+                        n_head,
+                        d_model,
+                        d_head,
+                        d_inner,
+                        dropout,
+                        tgt_len=tgt_len,
+                        ext_len=ext_len,
+                        mem_len=mem_len,
+                        attn_dropout=attn_dropout,
+                        normalize_before=normalize_before,
+                    )
+                )
+        elif attn_type == 1:
+            for i in range(n_layer):
+                self.layers.append(
+                    RelLearnableDecoderLayer(
+                        n_head,
+                        d_model,
+                        d_head,
+                        d_inner,
+                        dropout,
+                        tgt_len=tgt_len,
+                        ext_len=ext_len,
+                        mem_len=mem_len,
+                        attn_dropout=attn_dropout,
+                        normalize_before=normalize_before,
+                    )
+                )
+        elif attn_type in [2, 3]:
+            for i in range(n_layer):
+                self.layers.append(
+                    DecoderLayer(
+                        n_head,
+                        d_model,
+                        d_head,
+                        d_inner,
+                        dropout,
+                        attn_dropout=attn_dropout,
+                        normalize_before=normalize_before,
+                    )
+                )
+
+        self.sample_softmax = sample_softmax
+        if sample_softmax > 0:
+            self.out_layer = nn.Linear(
+                d_model,
+                n_token,
+                weight_attr=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+                bias_attr=paddle.nn.initializer.Constant(0.0),
+            )
+            self.tie_weight = tie_weight
+            self.sampler = LogUniformSampler(n_token, sample_softmax)
+        else:
+            self.crit = ProjAdaptiveSoftmax(n_token, d_embed, d_model, cutoffs, div_val=div_val)
+
+            if tie_weight:
+                for i in range(len(self.crit.out_layers_weight)):
+                    self.crit.out_layers_weight[i] = self.word_emb.emb_layers[i].weight
+
+            if tie_projs:
+                for i, tie_proj in enumerate(tie_projs):
+                    if tie_proj and div_val == 1 and d_model != d_embed:
+                        self.crit.out_projs[i] = self.word_emb.emb_projs[0]
+                    elif tie_proj and div_val != 1:
+                        self.crit.out_projs[i] = self.word_emb.emb_projs[i]
+
+        self.same_length = same_length
+        self.clamp_len = clamp_len
+
+        self._create_params()
+
+    def backward_compatible(self):
+        self.sample_softmax = -1
+
+    def _create_params(self):
+        if self.attn_type == 0:
+            self.pos_emb = PositionEmbedding(self.d_model)
+            self.r_w_bias = paddle.create_parameter(
+                shape=[self.n_head, self.d_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+            self.r_r_bias = paddle.create_parameter(
+                shape=[self.n_head, self.d_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+        elif self.attn_type == 1:
+            self.r_emb = paddle.create_parameter(
+                shape=[self.n_layer, self.max_klen, self.n_head, self.d_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+            self.r_w_bias = paddle.create_parameter(
+                shape=[self.n_layer, self.n_head, self.d_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+            self.r_bias = paddle.create_parameter(
+                shape=[self.n_layer, self.max_klen, self.n_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+        elif self.attn_type == 2:
+            self.pos_emb = PositionEmbedding(self.d_model)
+        elif self.attn_type == 3:
+            self.r_emb = paddle.create_parameter(
+                shape=[self.n_layer, self.max_klen, self.n_head, self.d_head],
+                dtype=global_dtype,
+                default_initializer=paddle.nn.initializer.Normal(mean=0.0, std=0.01),
+            )
+
+    def reset_length(self, tgt_len, ext_len, mem_len):
+        self.tgt_len = tgt_len
+        self.mem_len = mem_len
+        self.ext_len = ext_len
+
+    def init_mems(self, batch_size, d_model):
+        if self.mem_len > 0:
+            mems = []
+            for _ in range(self.n_layer + 1):
+                empty = paddle.empty(shape=[batch_size, 0, d_model], dtype=global_dtype)
+                mems.append(empty)
+
+            return mems
+        else:
+            return None
+
+    def _update_mems(self, hids, mems, qlen, mlen):
+        if mems is None:
+            return None
+
+        assert len(hids) == len(mems), "length of hids and length of mems must be the same. "
+
+        with paddle.no_grad():
+            new_mems = []
+            end_idx = mlen + max(0, qlen - 0 - self.ext_len)
+            beg_idx = max(0, end_idx - self.mem_len)
+            for i in range(len(hids)):
+                cat = paddle.concat([mems[i], hids[i]], axis=1)
+                new_mems.append(cat[:, beg_idx:end_idx].detach())
+
+        return new_mems
+
+    def _forward(self, dec_inputs, mems=None):
+        bsz, qlen = dec_inputs.shape
+
+        word_emb = self.word_emb(dec_inputs)
+
+        mlen = mems[0].shape[1] if mems is not None else 0
+        klen = mlen + qlen
+        if self.same_length:
+            all_ones = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype)
+            mask_len = klen - self.mem_len
+            if mask_len > 0:
+                mask_shift_len = qlen - mask_len
+            else:
+                mask_shift_len = qlen
+            dec_attn_mask = (
+                paddle.triu(all_ones, diagonal=1 + mlen) + paddle.tril(all_ones, -mask_shift_len)
+            ).unsqueeze([0, 1])
+        else:
+            dec_attn_mask = paddle.ones(shape=[qlen, klen], dtype=word_emb.dtype)
+            dec_attn_mask = paddle.triu(dec_attn_mask, diagonal=1 + mlen).unsqueeze([0, 1])
+
+        hids = []
+        if self.attn_type == 0:
+            pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype)
+            if self.clamp_len > 0:
+                # TODO: clamp and clip
+                pos_seq = paddle.clip(pos_seq, max=self.clamp_len)
+            pos_emb = self.pos_emb(pos_seq, bsz)
+
+            core_out = self.drop(word_emb)
+            pos_emb = self.drop(pos_emb)
+
+            hids.append(core_out)
+            for i, layer in enumerate(self.layers):
+                mems_i = None if mems is None else mems[i]
+                core_out = layer(
+                    core_out, pos_emb, self.r_w_bias, self.r_r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i
+                )
+                hids.append(core_out)
+        elif self.attn_type == 1:
+            core_out = self.drop(word_emb)
+            hids.append(core_out)
+            for i, layer in enumerate(self.layers):
+                if self.clamp_len > 0:
+                    r_emb = self.r_emb[i][-self.clamp_len :]
+                    r_bias = self.r_bias[i][-self.clamp_len :]
+                else:
+                    r_emb, r_bias = self.r_emb[i], self.r_bias[i]
+
+                mems_i = None if mems is None else mems[i]
+                core_out = layer(core_out, r_emb, self.r_w_bias[i], r_bias, dec_attn_mask=dec_attn_mask, mems=mems_i)
+                hids.append(core_out)
+        elif self.attn_type == 2:
+            pos_seq = paddle.arange(klen - 1, -1, -1.0, dtype=word_emb.dtype)
+            if self.clamp_len > 0:
+                pos_seq = paddle.clip(pos_seq, max=self.clamp_len)
+            pos_emb = self.pos_emb(pos_seq, bsz)
+
+            core_out = self.drop(word_emb + pos_emb[-qlen:])
+
+            hids.append(core_out)
+            for i, layer in enumerate(self.layers):
+                mems_i = None if mems is None else mems[i]
+                if mems_i is not None and i == 0:
+                    mems_i += pos_emb[:mlen]
+                core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i)
+                hids.append(core_out)
+        elif self.attn_type == 3:
+            core_out = self.drop(word_emb)
+
+            hids.append(core_out)
+            for i, layer in enumerate(self.layers):
+                mems_i = None if mems is None else mems[i]
+                if mems_i is not None and mlen > 0:
+                    cur_emb = self.r_emb[i][:-qlen]
+                    cur_size = cur_emb.size(0)
+                    if cur_size < mlen:
+                        cur_emb_pad = cur_emb[0:1].expand(mlen - cur_size, -1, -1)
+                        cur_emb = paddle.concat([cur_emb_pad, cur_emb], 0)
+                    else:
+                        cur_emb = cur_emb[-mlen:]
+                    mems_i += cur_emb.view(mlen, 1, -1)
+                core_out += self.r_emb[i][-qlen:].view(qlen, 1, -1)
+
+                core_out = layer(core_out, dec_attn_mask=dec_attn_mask, mems=mems_i)
+                hids.append(core_out)
+
+        core_out = self.drop(core_out)
+
+        new_mems = self._update_mems(hids, mems, mlen, qlen)
+
+        return core_out, new_mems
+
+    def forward(self, data, target, *mems):
+        if not mems:
+            batch_size = data.shape[0]
+            mems = self.init_mems(batch_size, self.d_model)
+
+        hidden, new_mems = self._forward(data, mems=mems)
+
+        # TODO(FrostML): use getitem.
+        tgt_len = target.shape[1]
+        pred_hid = paddle.slice(hidden, [1], [-tgt_len], [hidden.shape[1]])
+        if self.sample_softmax > 0 and self.training:
+            assert self.tie_weight, "tie_weight must be True if sample_softmax > 0"
+            logit = sample_logits(self.word_emb, self.out_layer.bias, target, pred_hid, self.sampler)
+            loss = -paddle.log(F.softmax(logit, axis=-1))[:, :, 0]
+        else:
+            loss = self.crit(
+                paddle.reshape(pred_hid, shape=[-1, pred_hid.shape[-1]]), paddle.reshape(target, shape=[-1])
+            )
+
+        if new_mems is None:
+            return [loss.mean()]
+        else:
+            return [loss.mean()] + new_mems
diff --git a/examples/language_model/transformer-xl/reader.py b/examples/language_model/transformer-xl/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..390392e9046a5a78110d36b57f0bd03c2812d271
--- /dev/null
+++ b/examples/language_model/transformer-xl/reader.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle.distributed as dist
+from paddle.io import DataLoader, IterableDataset
+
+from paddlenlp.data import Vocab
+
+
+class LMDataset(IterableDataset):
+    def __init__(self, mode, vocab, path, dataset_name, batch_size, bptt, ext_len, nranks, rank):
+        assert mode in ["train", "valid", "test"], "Parameter mode must be one of [train, valid, test]."
+
+        super(LMDataset, self).__init__()
+        self.vocab = vocab
+        self.dataset_name = dataset_name
+
+        if self.dataset_name in ["wt103"]:
+            self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, lower_case=False)
+        elif self.dataset_name in ["enwik8", "text8"]:
+            self.data = self.read_raw_data(filename=os.path.join(path, mode + ".txt"), ordered=True, add_eos=False)
+        else:
+            raise ValueError("Not supported dataset yet. ")
+        self.rank = rank
+        self.batch_size = batch_size
+        batch_size *= nranks
+
+        self.bptt = bptt
+        self.ext_len = ext_len if ext_len is not None else 0
+
+        self.num_step = len(self.data) // batch_size
+        data = self.data[: self.num_step * batch_size]
+        self.data = data.reshape([batch_size, -1])
+
+        # Number of samples
+        self.num_samples = (self.num_step + self.bptt - 1) // self.bptt
+
+    def __len__(self):
+        return self.num_samples
+
+    def __iter__(self):
+        for i in range(0, self.data.shape[1] - 1, self.bptt):
+            seq_len = min(self.bptt, self.data.shape[1] - 1 - i)
+            end_idx = i + seq_len
+            beg_idx = max(0, i - self.ext_len)
+            src = self.data[:, beg_idx:end_idx]
+            target = self.data[:, i + 1 : i + 1 + seq_len]
+
+            # NOTE: For now, DataLoader can yield `int`. It's not necessary
+            # to transfer `seq_len` after DataLoader.
+            # However, if it's necessary to use `seq_len` as input for some
+            # PaddlePaddle op, then it must be yielded by `[seq_len]` whose
+            # shape is [1], cause some op cannot use shape [] as input.
+            yield [
+                src[self.rank * self.batch_size : (self.rank + 1) * self.batch_size],
+                target[self.rank * self.batch_size : (self.rank + 1) * self.batch_size],
+                seq_len,
+            ]
+
+    def read_raw_data(
+        self, filename, ordered=False, lower_case=True, delimiter=None, add_eos=True, add_double_eos=False
+    ):
+        assert os.path.exists(filename), "%s is not exist. " % filename
+
+        data = []
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                tokens = LMDataset.tokenize(line=line, delimiter=delimiter, lower_case=lower_case)
+                if add_double_eos:  # for lm1b
+                    tokens = (
+                        [self.vocab._identifiers_to_tokens["bos_token"]]
+                        + tokens
+                        + [self.vocab._identifiers_to_tokens["bos_token"]]
+                    )
+                elif add_eos:
+                    tokens = tokens + [self.vocab._identifiers_to_tokens["eos_token"]]
+                data.append(np.asarray(self.get_indices(tokens)).astype("int64"))
+
+        if ordered:
+            data = np.concatenate(data)
+
+        return data
+
+    def get_indices(self, tokens):
+        return self.vocab.to_indices(tokens)
+
+    @classmethod
+    def get_vocab(
+        cls,
+        files,
+        max_size=None,
+        min_freq=0,
+        lower_case=True,
+        delimiter=None,
+        unk_token=None,
+        pad_token=None,
+        bos_token=None,
+        eos_token=None,
+        **kwargs
+    ):
+        return Vocab.build_vocab(
+            cls.data_iterator(files=files, delimiter=delimiter, lower_case=lower_case),
+            max_size=max_size,
+            min_freq=min_freq,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+        )
+
+    @classmethod
+    def tokenize(cls, line, delimiter=None, lower_case=True):
+        line = line.strip()
+        if lower_case:
+            line = line.lower()
+        tokens = list(line) if delimiter == "" else line.split(delimiter)
+        return tokens
+
+    @classmethod
+    def data_iterator(cls, files, delimiter=None, lower_case=True):
+        if isinstance(files, str):
+            files = [files]
+        elif not isinstance(files, (list, tuple)):
+            raise ValueError("The parameter files must be a str or a list/tuple.")
+
+        for fl in files:
+            assert os.path.exists(fl), "%s is not exist. " % fl
+
+            with open(fl, "r", encoding="utf-8") as f:
+                for line in f:
+                    tokens = cls.tokenize(line=line, delimiter=delimiter, lower_case=lower_case)
+                    yield tokens
+
+
+def get_lm_data_loader(args, vocab, mode="train"):
+    lm_dataset = LMDataset(
+        mode=mode,
+        vocab=vocab,
+        path=args.data,
+        dataset_name=args.dataset,
+        batch_size=args.batch_size if mode == "train" else args.eval_batch_size,
+        bptt=args.tgt_len,
+        ext_len=args.ext_len,
+        nranks=dist.get_world_size() if mode == "train" else 1,
+        rank=dist.get_rank() if mode == "train" else 0,
+    )
+
+    data_loader = DataLoader(dataset=lm_dataset, batch_size=None, num_workers=0, return_list=True)
+
+    return data_loader
+
+
+def get_lm_vocab(args):
+    kwargs = {"unk_token": "<unk>"}
+    if args.token_delimiter == "None":
+        kwargs["delimiter"] = None
+    else:
+        kwargs["delimiter"] = args.token_delimiter
+
+    if args.dataset == "wt103":
+        kwargs["eos_token"] = "<eos>"
+        kwargs["lower_case"] = False
+
+    if args.dataset in ["enwik8", "text8"]:
+        files = [
+            os.path.join(args.data, "train.txt"),
+            os.path.join(args.data, "valid.txt"),
+            os.path.join(args.data, "test.txt"),
+        ]
+    elif args.dataset == "wt103":
+        files = [os.path.join(args.data, "train.txt")]
+    else:
+        raise ValueError("Not supported dataset yet. ")
+
+    vocab = LMDataset.get_vocab(files, **kwargs)
+    args.ntokens = len(vocab)
+    print("Finish processing vocabulary, and the size of vocabulary is {}".format(args.ntokens))
+
+    return vocab
diff --git a/examples/language_model/transformer-xl/train.py b/examples/language_model/transformer-xl/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..579a9114efd9cbc9f554499037f0379aa896b4b7
--- /dev/null
+++ b/examples/language_model/transformer-xl/train.py
@@ -0,0 +1,297 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import yaml
+from attrdict import AttrDict
+from mem_transformer import MemTransformerLM
+from reader import get_lm_data_loader, get_lm_vocab
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./configs/enwik8.yaml", type=str, help="Path of the config file. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_train(args):
+    if args.use_gpu:
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+    else:
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    vocab = get_lm_vocab(args)
+    train_loader = get_lm_data_loader(args, vocab, "train")
+    eval_loader = get_lm_data_loader(args, vocab, "valid")
+
+    cutoffs, tie_projs = [], [False]
+    if args.adaptive:
+        assert args.dataset in ["wt103", "lm1b"]
+        if args.dataset == "wt103":
+            cutoffs = [20000, 40000, 200000]
+            tie_projs += [True] * len(cutoffs)
+        elif args.dataset == "lm1b":
+            cutoffs = [60000, 100000, 640000]
+            tie_projs += [False] * len(cutoffs)
+
+    mem_transformer = MemTransformerLM(
+        args.ntokens,
+        args.n_layer,
+        args.n_head,
+        args.d_model,
+        args.d_head,
+        args.d_inner_hid,
+        args.dropout,
+        args.attn_dropout,
+        tie_weight=args.tie_weight,
+        d_embed=args.d_model,
+        div_val=args.div_val,
+        tie_projs=tie_projs,
+        normalize_before=args.normalize_before,
+        tgt_len=args.tgt_len,
+        ext_len=args.ext_len,
+        mem_len=args.mem_len,
+        cutoffs=cutoffs,
+        same_length=args.same_length,
+        attn_type=args.attn_type,
+        clamp_len=args.clamp_len,
+        sample_softmax=args.sample_softmax,
+    )
+
+    if args.scheduler == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(
+            learning_rate=args.learning_rate, T_max=args.max_step, eta_min=args.eta_min
+        )
+    elif args.scheduler == "noam":
+        scheduler = paddle.optimizer.lr.NoamDecay(
+            d_model=args.d_model, warmup_steps=args.warmup_steps, learning_rate=args.learning_rate
+        )
+    elif args.scheduler == "dev_perf":
+        paddle.optimizer.lr.ReduceOnPlateau(
+            learning_rate=args.learning_rate, factor=args.decay_rate, patience=args.patience, min_lr=args.lr_min
+        )
+    elif args.scheduler == "constant":
+        scheduler = args.learning_rate
+
+    clip = paddle.nn.ClipGradByGlobalNorm(args.clip)
+    if args.optim.lower() == "momentum":
+        optimizer = paddle.optimizer.Momentum(
+            learning_rate=scheduler, parameters=mem_transformer.parameters(), momentum=args.mom, grad_clip=clip
+        )
+    elif args.optim.lower() == "adam":
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=scheduler,
+            parameters=mem_transformer.parameters(),
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=eval(args.eps),
+            grad_clip=clip,
+        )
+    elif args.optim.lower() == "adagrad":
+        optimizer = paddle.optimizer.Adagrad(
+            learning_rate=scheduler, parameters=mem_transformer.parameters(), grad_clip=clip
+        )
+
+    # Init from some checkpoint, to resume the previous training
+    if args.init_from_checkpoint:
+        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdparams"))
+        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "mem_transformer.pdopt"))
+        mem_transformer.set_state_dict(model_dict)
+        optimizer.set_state_dict(opt_dict)
+        print("loaded from checkpoint.")
+    # Init from some pretrain models, to better solve the current task
+    if args.init_from_pretrain_model:
+        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "mem_transformer.pdparams"))
+        mem_transformer.set_state_dict(model_dict)
+        print("loaded from pre-trained model.")
+
+    if trainer_count > 1:
+        mem_transformer = paddle.DataParallel(mem_transformer)
+
+    step_idx = 0
+    train_loss = 0.0
+
+    log_start_time = time.time()
+
+    for pass_id in range(args.epoch):
+        batch_id = 0
+
+        mems = tuple()
+        for input_data in train_loader:
+            (src, target, seq_len) = input_data
+            ret = mem_transformer(src, target, *mems)
+            loss = ret[0]
+            mems = ret[1:]
+            train_loss += loss.numpy()
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if step_idx > 0 and step_idx % args.print_step == 0 and rank == 0:
+                cur_loss = train_loss / args.print_step
+                elapsed = time.time() - log_start_time
+                if args.scheduler == "constant":
+                    lr = optimizer.get_lr()
+                else:
+                    lr = scheduler.get_lr()
+                logger_info = (
+                    "step_idx: %d, epoch: %d, batch: %d, learning rate: %.8f, "
+                    "speed: %f ms/batch, loss: %f"
+                    % (step_idx, pass_id, batch_id, lr, elapsed * 1000.0 / args.print_step, cur_loss)
+                )
+                if args.dataset in ["enwik8", "text8"]:
+                    logger_info = logger_info + ", bpc: %f" % (cur_loss / np.log(2))
+                else:
+                    logger_info = logger_info + ", ppl: %f" % (np.exp(cur_loss))
+
+                logger.info(logger_info)
+                train_loss = 0.0
+                log_start_time = time.time()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                # Do validation.
+                mem_transformer.eval()
+
+                # TODO(FrostML): simplify this.
+                if args.mem_len == 0:
+                    if dist.get_world_size() == 1:
+                        mem_transformer.reset_length(
+                            tgt_len=args.eval_tgt_len,
+                            ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len,
+                            mem_len=args.mem_len,
+                        )
+                    else:
+                        mem_transformer._layers.reset_length(
+                            tgt_len=args.eval_tgt_len,
+                            ext_len=args.ext_len + args.tgt_len - args.eval_tgt_len,
+                            mem_len=args.mem_len,
+                        )
+                else:
+                    if dist.get_world_size() == 1:
+                        mem_transformer.reset_length(
+                            tgt_len=args.eval_tgt_len,
+                            ext_len=args.ext_len,
+                            mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len,
+                        )
+                    else:
+                        mem_transformer._layers.reset_length(
+                            tgt_len=args.eval_tgt_len,
+                            ext_len=args.ext_len,
+                            mem_len=args.mem_len + args.tgt_len - args.eval_tgt_len,
+                        )
+
+                total_len, total_loss = 0, 0.0
+
+                eval_mems = tuple()
+                with paddle.no_grad():
+                    for i, (src, target, seq_len) in enumerate(eval_loader):
+                        if args.max_eval_steps > 0 and i >= args.max_eval_steps:
+                            break
+                        ret = mem_transformer(src, target, *eval_mems)
+                        loss, eval_mems = ret[0], ret[1:]
+                        eval_cur_loss = seq_len * loss.numpy()
+                        total_loss += eval_cur_loss
+                        total_len += seq_len
+                    eval_loss = total_loss / total_len
+
+                logger_info = "Validation, step_idx: %d, validation loss: %f" % (step_idx, eval_loss)
+                if args.dataset in ["enwik8", "text8"]:
+                    logger_info = logger_info + ", bpc: %f" % (eval_loss / np.log(2))
+                else:
+                    logger_info = logger_info + ", ppl: %f" % (np.exp(eval_loss))
+                logger.info(logger_info)
+
+                if args.save_model and rank == 0:
+                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt"))
+                    f = open(
+                        os.path.join(args.save_model, "step_" + str(step_idx), "evaluation_loss_" + str(eval_loss)),
+                        "w",
+                    )
+                    f.close()
+
+                if args.scheduler == "dev_perf":
+                    scheduler.step(eval_loss)
+
+                # TODO(FrostML): simplify this.
+                if dist.get_world_size() == 1:
+                    mem_transformer.reset_length(tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len)
+                else:
+                    mem_transformer._layers.reset_length(
+                        tgt_len=args.tgt_len, ext_len=args.ext_len, mem_len=args.mem_len
+                    )
+
+                mem_transformer.train()
+
+            if step_idx >= args.max_step:
+                return
+            step_idx += 1
+            batch_id += 1
+            if args.scheduler in ["cosine", "dev_perf"]:
+                if step_idx < args.warmup_steps:
+                    curr_lr = args.learning_rate * step_idx / args.warmup_steps
+                    scheduler.base_lr = curr_lr
+                else:
+                    if args.scheduler == "cosine":
+                        scheduler.step()
+            elif args.scheduler == "constant":
+                if step_idx < args.warmup_steps:
+                    curr_lr = args.learning_rate * step_idx / args.warmup_steps
+                    optimizer.set_lr(curr_lr)
+            elif args.scheduler == "noam":
+                scheduler.step()
+
+    if args.save_model and rank == 0:
+        model_dir = os.path.join(args.save_model, "step_final")
+        if not os.path.exists(model_dir):
+            os.makedirs(model_dir)
+        paddle.save(mem_transformer.state_dict(), os.path.join(model_dir, "mem_transformer.pdparams"))
+        paddle.save(optimizer.state_dict(), os.path.join(model_dir, "mem_transformer.pdopt"))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+
+    do_train(args)
diff --git a/examples/language_model/transformer-xl/utils/preprocess_text8.py b/examples/language_model/transformer-xl/utils/preprocess_text8.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad70ab65ccc2f5c818905bea3ad26af1dab67eb0
--- /dev/null
+++ b/examples/language_model/transformer-xl/utils/preprocess_text8.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import zipfile
+
+if __name__ == "__main__":
+    data = zipfile.ZipFile("text8.zip").extractall()
+    data = open("text8", "r", encoding="utf-8").read()
+
+    num_test_char = int(sys.argv[1])
+
+    train_data = data[: -2 * num_test_char]
+    valid_data = data[-2 * num_test_char : -num_test_char]
+    test_data = data[-num_test_char:]
+
+    for files, data in [("train.txt", train_data), ("valid.txt", valid_data), ("test.txt", test_data)]:
+        data_str = " ".join(["_" if c == " " else c for c in data.strip()])
+        with open(files, "w") as f:
+            f.write(data_str)
+        with open(files + ".raw", "w", encoding="utf-8") as fw:
+            fw.write(data)
diff --git a/examples/language_model/xlm/README.md b/examples/language_model/xlm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31a707f9f3ebcfa412d9caeffb6e9f53cc08b535
--- /dev/null
+++ b/examples/language_model/xlm/README.md
@@ -0,0 +1,128 @@
+# XLM—Enhancing BERT for Cross-lingual Language Model
+
+## 目录
+* [模型简介](#模型简介)
+* [模型实现的注意点](#模型实现的注意点)
+* [快速开始](#快速开始)
+  * [通用参数释义](#通用参数释义)
+  * [自然语言推断任务](#自然语言推断任务)
+* [参考资料](#参考资料)
+
+## 模型简介
+
+[XLM—Enhancing BERT for Cross-lingual Language Model](https://arxiv.org/abs/1901.07291) 是facebook团队提出的一个跨语言模型。
+
+在这项工作中，他们将这种方法扩展到多种语言，并展示了跨语言预训练的有效性。论文提出了两种学习跨语言语言模型 (XLM) 的方法：一种是**仅依赖单语数据的无监督方法**，另一种是**利用具有新的跨语言语言模型目标的并行数据的监督方法**。该方法在跨语言分类、无监督和有监督机器翻译方面获得了最先进的结果。在 XNLI 上，该方法以4.9% 的绝对精度提升了最新技术水平。在无监督机器翻译上，该方法在 WMT'16 German-English 上获得 34.3 BLEU，将之前的最新技术提高了9BLEU 以上。在有监督的机器翻译上，该方法在WMT'16罗马尼亚语-英语上获得了 38.5 BLEU 的最新技术水平，比之前的最佳方法高出 4 BLEU 以上。
+
+XLM论文中一共提出了三种预训练任务：**CLM**、**MLM**和**TLM**。
+- **CLM：Causal Language Model**，无监督单语单向LM训练任务，就是用`Transformer`进行LM的单向训练。
+- **MLM：Masked Language Model**，无监督单语双向LM训练任务，与`BERT`一样。
+- **TLM：Translation Language Model**，有监督翻译LM训练，拼接平行双语语料，然后执行MLM，以期这样能学到翻译的对齐信息。
+
+![framework](./framework.jpg)
+
+## 模型实现的注意点
+本仓库的模型在复现过程中主要参考了huggingface的实现，故在实现过程中与facebook团队的官方实现相比存在一定的不同。
+- 对于`token_pair`任务，`huggingface`的`tokenizer`会额外添加`<s> A </s> B </s>`的标记，而`facebook`的`tokenizer`会添加`</s> A </s> B </s>`的标记，本仓库的实现遵循了`huggingface`的实现，主要区别在于第一个特殊标记使用了`<s>`而不是`</s>`。
+- facebook的XLM模型由于并未使用`token_type_id`参数，因此我们在实际使用`tokenizer`的时候需要人工传入`return_token_type_ids=False`，如：`tokenizer(text, return_token_type_ids=False)`，这样就不会返回`token_type_id`了。
+- 考虑到现有已开源预训练权重的XLM模型，在`XLMPredLayer`处并未使用到`adaptive_softmax`，因此本仓库仅实现了带有`cross_entropy`的`XLMPredLayer`。
+
+本文件夹内包含了`XLM模型`在`xnli任务`上的训练和验证内容。以下是本例的简要目录结构及说明：
+
+```text
+.
+├── README.md                   # README文档
+├── xnli_train.py               # 自然语言推断训练代码
+├── xnli_eval.py                # 自然语言推断评估代码
+```
+
+## 快速开始
+
+### xlm tokenizer依赖安装
+
+```shell
+# sacremoses
+pip install sacremoses
+# Thai tokenizer
+pip install pythainlp
+# Japanese tokenizer
+git clone https://github.com/neubig/kytea.git
+cd kytea
+autoreconf -i
+./configure --prefix=$HOME/local
+make && make install
+pip install kytea
+# Chinese tokenizer
+pip install jieba
+```
+
+### 通用参数释义
+- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："xlm-mlm-tlm-xnli15-1024"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"。
+- `output_dir` 表示模型保存路径。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断，不足该长度的将会进行 padding。
+- `learning_rate` 表示基础学习率大小，本代码并未使用学习率warmup和衰减。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔步数。
+- `save_steps` 表示模型保存及评估间隔步数。
+- `batch_size` 表示每次迭代**每张**卡上的样本数目。
+- `adam_epsilon` 表示Adam优化器的epsilon。
+- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+- `seed` 表示随机数种子。
+- `device` 表示训练使用的设备, `'gpu'`表示使用 GPU, `'xpu'`表示使用百度昆仑卡, `'cpu'`表示使用 CPU。
+- `use_amp` 表示是否启用自动混合精度训练。
+- `scale_loss` 表示自动混合精度训练的参数。
+
+### 自然语言推断任务
+
+#### 数据集介绍
+XNLI 是 MNLI 的子集，并且已被翻译成14种不同的语言（包含一些较低资源语言）。与 MNLI 一样，目标是预测文本蕴含（句子 A 是否暗示/矛盾/都不是句子 B ）。
+
+#### 单卡训练
+
+```shell
+python xnli_train.py \
+    --batch_size 8 \
+    --model_name_or_path xlm-mlm-tlm-xnli15-1024 \
+    --save_steps 24544 \
+    --output_dir outputs
+```
+
+#### 单卡评估
+
+```shell
+python xnli_eval.py \
+    --batch_size 8 \
+    --model_name_or_path outputs/best_model
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus 0,1 --log_dir outputs xnli_train.py \
+    --batch_size 8 \
+    --model_name_or_path xlm-mlm-tlm-xnli15-1024 \
+    --save_steps 24544 \
+    --output_dir outputs
+```
+
+在XNLI数据集上微调 cross-lingual-transfer 类型的自然语言推断任务后，在测试集上有如下结果
+| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| XLM | 84.6 | 79.2 | 79.8 | 76.9 | 76.6 | 77.6 | 76.2 | 71.7 | 73.8 | 74.5 | 71.1 | 74.8 | 68.8 | 69.2 | 65.8 | 74.7 |
+
+
+## 参考资料
+- https://github.com/facebookresearch/XLM
+- https://github.com/huggingface/transformers/tree/main/src/transformers/models/xlm
+
+## 引用
+
+Bibtex:
+```tex
+@article{lample2019cross,
+  title={Cross-lingual Language Model Pretraining},
+  author={Lample, Guillaume and Conneau, Alexis},
+  journal={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2019}
+}
+```
diff --git a/examples/language_model/xlm/framework.jpg b/examples/language_model/xlm/framework.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..96d613f945621029f5451d6acc273ca84a513600
Binary files /dev/null and b/examples/language_model/xlm/framework.jpg differ
diff --git a/examples/language_model/xlm/xnli_eval.py b/examples/language_model/xlm/xnli_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..54e65cfdc65c0a3d90867815a5d28684b6cad6c2
--- /dev/null
+++ b/examples/language_model/xlm/xnli_eval.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+import numpy as np
+
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Stack, Tuple, Pad
+from paddle.metric import Accuracy
+
+all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model."
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU/XPU for training.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=256,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, language, tokenizer):
+    metric.reset()
+    for batch in data_loader:
+        input_ids, attention_mask, labels = batch
+        # add lang_ids
+        lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language]
+        logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("[%s] acc: %s " % (language.upper(), res))
+    return res
+
+
+def convert_example(example, tokenizer, max_seq_length=256, language="en"):
+    """convert a example into necessary features"""
+    # Get the label
+    label = example["label"]
+    premise = example["premise"]
+    hypothesis = example["hypothesis"]
+    # Convert raw text to feature
+    example = tokenizer(
+        premise,
+        text_pair=hypothesis,
+        max_length=max_seq_length,
+        return_attention_mask=True,
+        return_token_type_ids=False,
+        lang=language,
+    )
+    return example["input_ids"], example["attention_mask"], label
+
+
+def get_test_dataloader(args, language, tokenizer):
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=0, dtype="int64"),  # attention_mask
+        Stack(dtype="int64"),  # labels
+    ): fn(samples)
+    # make sure language is `language``
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language)
+    test_ds = load_dataset("xnli", language, splits="test")
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    return test_data_loader
+
+
+def do_eval(args):
+    paddle.set_device(args.device)
+    tokenizer = XLMTokenizer.from_pretrained(args.model_name_or_path)
+    model = XLMForSequenceClassification.from_pretrained(args.model_name_or_path)
+    model.eval()
+    metric = Accuracy()
+    all_languages_acc = []
+    for language in all_languages:
+        test_dataloader = get_test_dataloader(args, language, tokenizer)
+        acc = evaluate(model, metric, test_dataloader, language, tokenizer)
+        all_languages_acc.append(acc)
+    print("test mean acc: %.4f" % np.mean(all_languages_acc))
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_eval(args)
diff --git a/examples/language_model/xlm/xnli_train.py b/examples/language_model/xlm/xnli_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..d19d40ec6f14dd65710053f407a78aa0145dc2c1
--- /dev/null
+++ b/examples/language_model/xlm/xnli_train.py
@@ -0,0 +1,279 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from paddle.metric import Accuracy
+from paddle.optimizer import Adam
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer
+
+all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model."
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=256,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=2e-6, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--dropout", default=0.1, type=float, help="Dropout rate.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=5,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=200, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=24544, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU/XPU for training.",
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, language, tokenizer):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, attention_mask, labels = batch
+        # add lang_ids
+        lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language]
+        logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("[%s] acc: %s " % (language.upper(), res))
+    model.train()
+    return res
+
+
+def convert_example(example, tokenizer, max_seq_length=256, language="en"):
+    """convert a example into necessary features"""
+    # Get the label
+    label = example["label"]
+    premise = example["premise"]
+    hypothesis = example["hypothesis"]
+    # Convert raw text to feature
+    example = tokenizer(
+        premise,
+        text_pair=hypothesis,
+        max_length=max_seq_length,
+        return_attention_mask=True,
+        return_token_type_ids=False,
+        lang=language,
+    )
+    return example["input_ids"], example["attention_mask"], label
+
+
+def get_test_dataloader(args, language, batchify_fn, tokenizer):
+    # make sure language is `language``
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language)
+    test_ds = load_dataset("xnli", language, splits="test")
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = BatchSampler(test_ds, batch_size=args.batch_size * 4, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    return test_data_loader
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    tokenizer = XLMTokenizer.from_pretrained(args.model_name_or_path)
+
+    # define train dataset language
+    language = "en"
+    train_ds = load_dataset("xnli", language, splits="train")
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=language)
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=0, dtype="int64"),  # attention_mask
+        Stack(dtype="int64"),  # labels
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    model = XLMForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=3, dropout=args.dropout)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    optimizer = Adam(
+        learning_rate=args.learning_rate,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+    )
+
+    loss_fct = nn.CrossEntropyLoss()
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    metric = Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    max_test_acc = 0.0
+    print(f"num_training_steps {num_training_steps}")
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, attention_mask, labels = batch
+            lang_ids = paddle.ones_like(input_ids) * tokenizer.lang2id[language]
+
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                logits = model(input_ids, langs=lang_ids, attention_mask=attention_mask)
+                loss = loss_fct(logits, labels)
+
+            if args.use_amp:
+                scaled_loss = scaler.scale(loss)
+                scaled_loss.backward()
+                scaler.minimize(optimizer, scaled_loss)
+            else:
+                loss.backward()
+                optimizer.step()
+
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                all_languages_acc = []
+                for language in all_languages:
+                    test_data_loader = get_test_dataloader(args, language, batchify_fn, tokenizer)
+                    acc = evaluate(model, metric, test_data_loader, language, tokenizer)
+                    all_languages_acc.append(acc)
+                test_mean_acc = np.mean(all_languages_acc)
+                print("test mean acc: %.4f" % test_mean_acc)
+
+                if paddle.distributed.get_rank() == 0:
+                    if test_mean_acc >= max_test_acc:
+                        max_test_acc = test_mean_acc
+                        output_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # Need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        print("best test mean acc: %.4f" % max_test_acc)
+                        print("Save model and tokenizer to %s" % output_dir)
+
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/language_model/xlnet/README.md b/examples/language_model/xlnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb17af1529bb4864e1c270c39e687594b13b12d9
--- /dev/null
+++ b/examples/language_model/xlnet/README.md
@@ -0,0 +1,67 @@
+# XLNet
+
+## 模型简介
+
+[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型，XLNet通过最大化输入序列所有排列的期望来进行语言建模，这使得它可以同时关注到上下文的信息。 另外，XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型，Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列，这使得XLNet在长文本序列的语言任务上有着优秀的表现。
+
+本项目是XLNet在 Paddle 2.0上的开源实现，包含了在 [GLUE评测任务](https://gluebenchmark.com/tasks) 上的微调代码。
+
+## 快速开始
+
+### 环境依赖
+
+- sentencepiece
+
+安装命令：`pip install sentencepiece`
+
+### 数据准备
+
+GLUE评测任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_glue.py`执行时将会自动下载。
+
+### 执行Fine-tuning
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" ./run_glue.py \
+    --model_name_or_path xlnet-base-cased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 100 \
+    --save_steps 500 \
+    --output_dir ./tmp/
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `task_name` 表示Fine-tuning的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+
+基于`xlnet-base-cased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result             |
+|:-----:|:----------------------------:|:------------------:|
+| SST-2 | Accuracy                     |      94.266        |
+| QNLI  | Accuracy                     |      91.708        |
+| CoLA  | Mattehew's corr              |      50.264        |
+| MRPC  | F1/Accuracy                  |   91.071/87.745    |
+| STS-B | Person/Spearman corr         |   86.243/85.973    |
+| QQP   | Accuracy/F1                  |   90.838/87.644    |
+| MNLI  | Matched acc/MisMatched acc   |   87.468/86.859    |
+| RTE   | Accuracy                     |      70.036        |
+| WNLI  | Accuracy                     |      56.338        |
+
+## Reference
+
+- [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
+- [zihangdai/xlnet](https://github.com/zihangdai/xlnet)
diff --git a/examples/language_model/xlnet/run_glue.py b/examples/language_model/xlnet/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..a59e3acb39d7ef04115e41fb85bbf8153b036b80
--- /dev/null
+++ b/examples/language_model/xlnet/run_glue.py
@@ -0,0 +1,377 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+from math import ceil
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import LinearDecayWithWarmup
+from paddlenlp.transformers.xlnet.modeling import (
+    XLNetForSequenceClassification,
+    XLNetPretrainedModel,
+)
+from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+from paddlenlp.utils import profiler
+
+final_res = "Not evaluated yet!"
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+    "wnli": Accuracy,
+}
+
+
+def parse_args():
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),)
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(XLNetPretrainedModel.pretrained_init_configuration.keys()),)
+    parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.",)
+    parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",)
+    parser.add_argument("--pad_to_max_seq_len", default=False, type=bool, help="Whether to pad all sequences to max length for sequences shorter than max length.",)
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per device for training.",)
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",)
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",)
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",)
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",)
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",)
+    parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.",)
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.",)
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",)
+    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "npu"], help="Select cpu, gpu, xpu, npu devices.",)
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup_steps. If > 0: Override warmup_proportion",)
+    parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps.",)
+    parser.add_argument('-p', '--profiler_options', type=str, default=None, help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".',)
+    # yapf: enable
+
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    losses = []
+    global final_res
+    for batch in data_loader:
+        input_ids, token_type_ids, attention_mask, labels = batch
+        logits = model(input_ids, token_type_ids, attention_mask)
+        loss = loss_fct(logits, labels)
+        losses.append(loss.detach().numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s"
+            % (np.average(losses), res[0], res[1], res[2], res[3], res[4])
+        )
+
+        final_res = "final:    acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s" % (
+            res[0],
+            res[1],
+            res[2],
+            res[3],
+            res[4],
+        )
+    elif isinstance(metric, Mcc):
+        print("eval loss: %f, mcc: %s" % (np.average(losses), res[0]))
+        final_res = "final:    mcc: %s" % (res[0])
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s"
+            % (np.average(losses), res[0], res[1], res[2])
+        )
+        final_res = "final:    pearson: %s, spearman: %s, pearson and spearman: %s" % (res[0], res[1], res[2])
+    else:
+        print("eval loss: %f, acc: %s" % (np.average(losses), res))
+        final_res = "final:    acc: %s" % res
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, pad_to_max_seq_len=False, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(
+            example["sentence"],
+            max_seq_len=max_seq_length,
+            pad_to_max_seq_len=pad_to_max_seq_len,
+            return_attention_mask=True,
+        )
+    else:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            max_seq_len=max_seq_length,
+            pad_to_max_seq_len=pad_to_max_seq_len,
+            return_attention_mask=True,
+        )
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], example["attention_mask"], label
+    else:
+        return example["input_ids"], example["token_type_ids"], example["attention_mask"]
+
+
+def create_data_loader(args, tokenizer):
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=train_ds.label_list,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, pad_right=False),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, pad_right=False),  # token_type
+        Pad(axis=0, pad_val=0, pad_right=False),  # attention_mask
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+
+        return (
+            train_data_loader,
+            dev_data_loader_matched,
+            dev_data_loader_mismatched,
+            train_ds,
+            dev_ds_matched,
+            dev_ds_mismatched,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+        return train_data_loader, dev_data_loader, train_ds, dev_ds
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    global final_res
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    tokenizer_class = XLNetTokenizer
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    if args.task_name == "mnli":
+        (
+            train_data_loader,
+            dev_data_loader_matched,
+            dev_data_loader_mismatched,
+            train_ds,
+            dev_ds_matched,
+            dev_ds_mismatched,
+        ) = create_data_loader(args, tokenizer)
+    else:
+        train_data_loader, dev_data_loader, train_ds, dev_ds = create_data_loader(args, tokenizer)
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = XLNetForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "layer_norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        grad_clip=clip,
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    global_step = 0
+    model.train()
+
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    reader_start = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            train_reader_cost += time.time() - reader_start
+            train_start = time.time()
+
+            global_step += 1
+            input_ids, token_type_ids, attention_mask, labels = batch
+            logits = model(input_ids, token_type_ids, attention_mask)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            train_run_cost += time.time() - train_start
+            # Profile for model benchmark
+            profiler.add_profiler_step(args.profiler_options)
+
+            if global_step % args.logging_steps == 0:
+                speed = args.logging_steps / (train_reader_cost + train_run_cost)
+                avg_reader_cost = train_reader_cost / args.logging_steps
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s, avg_reader_cost: %.4f sec, avg_batch_cost: %.4f sec, avg_samples: %d, avg_ips: %.4f sequences/sec"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        speed,
+                        avg_reader_cost,
+                        1.0 / speed,
+                        args.batch_size,
+                        speed * args.batch_size,
+                    )
+                )
+                train_reader_cost = 0.0
+                train_run_cost = 0.0
+
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    print("matched ", end="")
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    final_res1 = "matched " + final_res
+                    print("mismatched ", end="")
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    final_res2 = "mismatched " + final_res
+                    final_res = final_res1 + "\r\n" + final_res2
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if (not paddle.distributed.get_world_size() > 1) or paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "%s_ft_model_%d" % (args.task_name, global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                if global_step == num_training_steps:
+                    print(final_res)
+                    exit(0)
+
+            reader_start = time.time()
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/lexical_analysis/README.md b/examples/lexical_analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..421a04d6aa0a613b061b6d105ac889effcda7d08
--- /dev/null
+++ b/examples/lexical_analysis/README.md
@@ -0,0 +1,185 @@
+# 词法分析
+
+## 1. 简介
+
+词法分析任务的输入是一个字符串（我们后面使用『句子』来指代它），而输出是句子中的词边界和词性、实体类别。序列标注是词法分析的经典建模方式，我们使用基于 GRU 的网络结构学习特征，将学习到的特征接入 CRF 解码层完成序列标注。模型结构如下所示：<br />
+
+![GRU-CRF-MODEL](https://bj.bcebos.com/paddlenlp/imgs/gru-crf-model.png)
+
+1. 输入采用 one-hot 方式表示，每个字以一个 id 表示
+2. one-hot 序列通过字表，转换为实向量表示的字向量序列；
+3. 字向量序列作为双向 GRU 的输入，学习输入序列的特征表示，得到新的特性表示序列，我们堆叠了两层双向 GRU 以增加学习能力；
+4. CRF 以 GRU 学习到的特征为输入，以标记序列为监督信号，实现序列标注。
+
+
+## 快速开始
+
+### 数据准备
+
+我们提供了少数样本用以示例输入数据格式。执行以下命令，下载并解压示例数据集：
+
+```bash
+python download.py --data_dir ./
+```
+
+训练使用的数据可以由用户根据实际的应用场景，自己组织数据。除了第一行是 `text_a\tlabel` 固定的开头，后面的每行数据都是由两列组成，以制表符分隔，第一列是 utf-8 编码的中文文本，以 `\002` 分割，第二列是对应每个字的标注，以 `\002` 分隔。我们采用 IOB2 标注体系，即以 X-B 作为类型为 X 的词的开始，以 X-I 作为类型为 X 的词的持续，以 O 表示不关注的字（实际上，在词性、专名联合标注中，不存在 O ）。示例如下：
+
+```text
+除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员    p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I
+```
+
+其中词性和专名类别标签集合如下表，包含词性标签 24 个（小写字母），专名类别标签 4 个（大写字母）。这里需要说明的是，人名、地名、机构名和时间四个类别，存在（PER / LOC / ORG / TIME 和 nr / ns / nt / t）两套标签，被标注为第二套标签的词，是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签，在四个类别的准确、召回之间做出自己的权衡。
+
+| 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
+| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
+| n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
+| nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
+| nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
+| a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
+| m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
+| c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
+| PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
+
+### 模型训练
+
+#### 单卡训练
+
+启动方式如下：
+
+```bash
+python train.py \
+        --data_dir ./lexical_analysis_dataset_tiny \
+        --model_save_dir ./save_dir \
+        --epochs 10 \
+        --batch_size 32 \
+        --device gpu \
+        # --init_checkpoint ./save_dir/final
+```
+
+其中参数释义如下：
+- `data_dir`: 数据集所在文件夹路径.
+- `model_save_dir`: 训练期间模型保存路径。
+- `epochs`: 模型训练迭代轮数。
+- `batch_size`: 表示每次迭代**每张卡**上的样本数目。
+- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `init_checkpoint`: 模型加载路径，通过设置init_checkpoint可以启动增量训练。
+
+#### 多卡训练
+
+启动方式如下：
+
+```bash
+python -m paddle.distributed.launch --gpus "0,1"  train.py \
+        --data_dir ./lexical_analysis_dataset_tiny \
+        --model_save_dir ./save_dir \
+        --epochs 10 \
+        --batch_size 32 \
+        --device gpu \
+        # --init_checkpoint ./save_dir/final
+```
+
+### 模型评估
+
+通过加载训练保存的模型，可以对测试集数据进行验证，启动方式如下：
+
+```bash
+python eval.py --data_dir ./lexical_analysis_dataset_tiny \
+        --init_checkpoint ./save_dir/model_100.pdparams \
+        --batch_size 32 \
+        --device gpu
+```
+
+其中`./save_dir/model_100.pdparams`是训练过程中保存的参数文件，请更换为实际得到的训练保存路径。
+
+### 模型导出
+
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
+
+运行方式：
+
+```shell
+python export_model.py --data_dir=./lexical_analysis_dataset_tiny --params_path=./save_dir/model_100.pdparams --output_path=./infer_model/static_graph_params
+```
+
+其中`./save_dir/model_100.pdparams`是训练过程中保存的参数文件，请更换为实际得到的训练保存路径。
+
+* `params_path`是指动态图训练保存的参数路径
+* `output_path`是指静态图参数导出路径。
+
+导出模型之后，可以用于部署，deploy/predict.py文件提供了python部署预测示例。运行方式：
+
+```shell
+python deploy/predict.py --model_file=infer_model/static_graph_params.pdmodel --params_file=infer_model/static_graph_params.pdiparams --data_dir lexical_analysis_dataset_tiny
+```
+
+### 模型预测
+
+对无标签数据可以启动模型预测：
+
+```bash
+python predict.py --data_dir ./lexical_analysis_dataset_tiny \
+        --init_checkpoint ./save_dir/model_100.pdparams \
+        --batch_size 32 \
+        --device gpu
+```
+
+得到类似以下输出：
+
+```txt
+(大学, n)(学籍, n)(证明, n)(怎么, r)(开, v)
+(电车, n)(的, u)(英文, nz)
+(什么, r)(是, v)(司法, n)(鉴定人, vn)
+```
+
+### Taskflow一键预测
+可以使用PaddleNLP提供的Taskflow工具来对输入的文本进行一键分词，具体使用方法如下:
+
+```python
+from paddlenlp import Taskflow
+
+lac = Taskflow("lexical_analysis")
+lac("LAC是个优秀的分词工具")
+'''
+[{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']}]
+'''
+
+lac(["LAC是个优秀的分词工具", "三亚是一个美丽的城市"])
+'''
+[{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']},
+ {'text': '三亚是一个美丽的城市', 'segs': ['三亚', '是', '一个', '美丽', '的', '城市'], 'tags': ['LOC', 'v', 'm', 'a', 'u', 'n']}
+]
+'''
+```
+
+任务的默认路径为`$HOME/.paddlenlp/taskflow/lexical_analysis/lac/`，默认路径下包含了执行该任务需要的所有文件。
+
+如果希望得到定制化的分词及标注结果，用户也可以通过Taskflow来加载自定义的词法分析模型并进行预测。
+
+通过`task_path`指定用户自定义路径，自定义路径下的文件需要和默认路径的文件一致。
+
+自定义路径包含如下文件（用户自己的模型权重、标签字典）：
+```text
+custom_task_path/
+├── model.pdparams
+├── word.dic
+├── tag.dic
+└── q2b.dic
+```
+
+使用Taskflow加载自定义模型进行一键预测：
+
+```python
+from paddlenlp import Taskflow
+
+my_lac = Taskflow("lexical_analysis", model_path="./custom_task_path/")
+```
+
+更多使用方法请参考[Taskflow文档](../../docs/model_zoo/taskflow.md)。
+
+## 预训练模型
+
+如果您希望使用已经预训练好了的LAC模型完成词法分析任务，请参考：
+
+[Lexical Analysis of Chinese](https://github.com/baidu/lac)
+
+[PaddleHub分词模型](https://www.paddlepaddle.org.cn/hubdetail?name=lac&en_category=LexicalAnalysis)
diff --git a/examples/lexical_analysis/data.py b/examples/lexical_analysis/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1d1a59aa9bd687deacd9ab7dfcdf441b3caea97
--- /dev/null
+++ b/examples/lexical_analysis/data.py
@@ -0,0 +1,142 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The file_reader converts raw corpus to input.
+"""
+
+from paddlenlp.datasets import MapDataset
+
+# We use "\002" to separate sentence characters and sequence labels,
+# for example: 除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员
+#              p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002
+CHAR_DELIMITER = "\002"
+
+
+def load_dataset(datafiles):
+    def read(data_path):
+        with open(data_path, "r", encoding="utf-8") as fp:
+            if "infer" in data_path:
+                next(fp)
+            for line in fp:
+                line = line.strip()
+                if "infer" in data_path:
+                    words = list(line)
+                    yield [words]
+                else:
+                    words, labels = line.split("\t")
+                    words = words.split(CHAR_DELIMITER)
+                    labels = labels.split(CHAR_DELIMITER)
+                    assert len(words) == len(labels), "The word %s is not match with the label %s" % (words, labels)
+                    yield [words, labels]
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
+
+
+def load_vocab(dict_path):
+    """
+    Load vocab from file
+    """
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def normalize_token(token, normlize_vocab):
+    """Normalize text from DBC case to SBC case"""
+    if normlize_vocab:
+        token = normlize_vocab.get(token, token)
+    return token
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None):
+    """convert tokens to token indexs"""
+    token_ids = []
+    oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None
+    for token in tokens:
+        token = normalize_token(token, normlize_vocab)
+        token_id = vocab.get(token, oov_replace_token)
+        token_ids.append(token_id)
+
+    return token_ids
+
+
+def convert_example(example, max_seq_len, word_vocab, label_vocab=None, normlize_vocab=None):
+    if len(example) == 2:
+        tokens, labels = example
+    else:
+        tokens, labels = example[0], None
+    tokens = tokens[:max_seq_len]
+
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab)
+    length = len(token_ids)
+    if labels is not None:
+        labels = labels[:max_seq_len]
+        label_ids = convert_tokens_to_ids(labels, label_vocab, oov_replace_token="O")
+        return token_ids, length, label_ids
+    else:
+        return token_ids, length
+
+
+def parse_result(words, preds, lengths, word_vocab, label_vocab):
+    """parse padding result"""
+    batch_out = []
+    id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys()))
+    id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys()))
+    for sent_index in range(len(lengths)):
+        sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]]
+        tags = [id2label_dict[index] for index in preds[sent_index][: lengths[sent_index]]]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split("-")[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split("-")[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
diff --git a/examples/lexical_analysis/deploy/predict.py b/examples/lexical_analysis/deploy/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d50e0b49033bad95a2f25f27eb68a6cebb07ccf
--- /dev/null
+++ b/examples/lexical_analysis/deploy/predict.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--model_file", type=str, required=True, default='./static_graph_params.pdmodel', help="The path to model info in static graph.")
+parser.add_argument("--params_file", type=str, required=True, default='./static_graph_params.pdiparams', help="The path to parameters in static graph.")
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.")
+parser.add_argument("--batch_size", type=int, default=2, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--epochs", default=1, type=int, help="The number of epochs when running benchmark.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def normalize_token(token, normlize_vocab):
+    """Normalize text from DBC case to SBC case"""
+    if normlize_vocab:
+        token = normlize_vocab.get(token, token)
+    return token
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None):
+    """Convert tokens to token indexs"""
+    token_ids = []
+    oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None
+    for token in tokens:
+        token = normalize_token(token, normlize_vocab)
+        token_id = vocab.get(token, oov_replace_token)
+        token_ids.append(token_id)
+
+    return token_ids
+
+
+def convert_example(tokens, max_seq_len, word_vocab, normlize_vocab=None):
+    """Convert tokens of sequences to token ids"""
+    tokens = tokens[:max_seq_len]
+
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab)
+    length = len(token_ids)
+    return token_ids, length
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_result(words, preds, lengths, word_vocab, label_vocab):
+    """Parse padding result"""
+    batch_out = []
+    id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys()))
+    id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys()))
+    for sent_index in range(len(lengths)):
+        sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]]
+        tags = [id2label_dict[index] for index in preds[sent_index][: lengths[sent_index]]]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split("-")[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split("-")[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
+
+
+class Predictor(object):
+    def __init__(self, model_file, params_file, device, max_seq_length):
+        self.max_seq_length = max_seq_length
+
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, word_vocab, label_vocab, normlize_vocab, batch_size=1):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+                A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+            word_vocab(obj:`dict`): The word id (key) to word str (value) map.
+            label_vocab(obj:`dict`): The label id (key) to label str (value) map.
+            normlize_vocab(obj:`dict`): The fullwidth char (key) to halfwidth char (value) map.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        examples = []
+
+        for text in data:
+            tokens = list(text.strip())
+            token_ids, length = convert_example(
+                tokens, self.max_seq_length, word_vocab=word_vocab, normlize_vocab=normlize_vocab
+            )
+            examples.append((token_ids, length))
+
+        def batchify_fn(samples):
+            fn = Tuple(Pad(axis=0, pad_val=0, dtype="int64"), Stack(axis=0, dtype="int64"))
+
+            return fn(samples)
+
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+        results = []
+
+        for batch in batches:
+            token_ids, length = batchify_fn(batch)
+            self.input_handles[0].copy_from_cpu(token_ids)
+            self.input_handles[1].copy_from_cpu(length)
+            self.predictor.run()
+            preds = self.output_handle.copy_to_cpu()
+            result = parse_result(token_ids, preds, length, word_vocab, label_vocab)
+            results.extend(result)
+        return results
+
+
+if __name__ == "__main__":
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+    infer_ds = []
+    with open(os.path.join(args.data_dir, "infer.tsv"), "r", encoding="utf-8") as fp:
+        for line in fp.readlines():
+            infer_ds += [line.strip()]
+    predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len)
+    start = time.time()
+    for _ in range(args.epochs):
+        results = predictor.predict(infer_ds, word_vocab, label_vocab, normlize_vocab, batch_size=args.batch_size)
+    end = time.time()
+    for idx, result in enumerate(results):
+        print("Text: {}".format(infer_ds[idx]))
+        sent_tags = []
+        sent, tags = result
+        sent_tag = ["(%s, %s)" % (ch, tag) for ch, tag in zip(sent, tags)]
+        print("Result: {}\n".format(sent_tag))
+    print("Total predict time: {:.4f} s".format(end - start))
diff --git a/examples/lexical_analysis/download.py b/examples/lexical_analysis/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1f0236b3cf0977dbde881aa97bd3a6a9d8cff95
--- /dev/null
+++ b/examples/lexical_analysis/download.py
@@ -0,0 +1,32 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+
+from paddle.utils.download import get_path_from_url
+
+URL = "https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz"
+
+
+def main(arguments):
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="data")
+    args = parser.parse_args(arguments)
+    get_path_from_url(URL, args.data_dir)
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
diff --git a/examples/lexical_analysis/eval.py b/examples/lexical_analysis/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bd51480b5ad707fbe2d2662d4a85135045fa692
--- /dev/null
+++ b/examples/lexical_analysis/eval.py
@@ -0,0 +1,92 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from data import convert_example, load_dataset, load_vocab
+from model import BiGruCrf
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+
+# fmt: off
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.")
+parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+args = parser.parse_args()
+# fmt: on
+
+
+def evaluate(args):
+    paddle.set_device(args.device)
+
+    # create dataset.
+    test_ds = load_dataset(datafiles=(os.path.join(args.data_dir, "test.tsv")))
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    # q2b.dic is used to replace DBC case to SBC case
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+
+    trans_func = partial(
+        convert_example,
+        max_seq_len=args.max_seq_len,
+        word_vocab=word_vocab,
+        label_vocab=label_vocab,
+        normlize_vocab=normlize_vocab,
+    )
+    test_ds.map(trans_func)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=0, dtype="int64"),  # word_ids
+        Stack(dtype="int64"),  # length
+        Pad(axis=0, pad_val=0, dtype="int64"),  # label_ids
+    ): fn(samples)
+
+    # Create sampler for dataloader
+    test_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False, drop_last=False)
+    test_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_sampler=test_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    # Define the model network and metric evaluator
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab))
+    chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+
+    # Load the model and start predicting
+    model_dict = paddle.load(args.init_checkpoint)
+    model.load_dict(model_dict)
+
+    model.eval()
+    chunk_evaluator.reset()
+    for batch in test_loader:
+        token_ids, length, labels = batch
+        preds = model(token_ids, length)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = chunk_evaluator.compute(length, preds, labels)
+        chunk_evaluator.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1_score = chunk_evaluator.accumulate()
+    print("eval precision: %f, recall: %f, f1: %f" % (precision, recall, f1_score))
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    evaluate(args)
diff --git a/examples/lexical_analysis/export_model.py b/examples/lexical_analysis/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2939fcad6c8974790437af1716e91a8fbe4a929
--- /dev/null
+++ b/examples/lexical_analysis/export_model.py
@@ -0,0 +1,57 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_vocab
+from model import BiGruCrf
+from paddle.static import InputSpec
+
+# fmt: off
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+parser.add_argument("--output_path", type=str, default='./infer_model/static_graph_params', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+args = parser.parse_args()
+# fmt: on
+
+
+def main():
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab))
+
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            InputSpec(shape=[None, None], dtype="int64", name="token_ids"),
+            InputSpec(shape=[None], dtype="int64", name="length"),
+        ],
+    )
+    # Save in static graph model.
+    paddle.jit.save(model, args.output_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/lexical_analysis/model.py b/examples/lexical_analysis/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..20a4a5ebe4b1412b87bb8bec1d8660f001b22203
--- /dev/null
+++ b/examples/lexical_analysis/model.py
@@ -0,0 +1,96 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+
+if hasattr(paddle, "text") and hasattr(paddle.text, "ViterbiDecoder"):
+    from paddle.text import ViterbiDecoder
+else:
+    from paddlenlp.layers.crf import ViterbiDecoder
+
+
+class BiGruCrf(nn.Layer):
+    """The network for lexical analysis, based on two layers of BiGRU and one layer of CRF. More details see https://arxiv.org/abs/1807.01882
+
+    Args:
+        word_emb_dim (int): The dimension in which a word is embedded.
+        hidden_size (int): The number of hidden nodes in the GRU layer.
+        vocab_size (int): the word vocab size.
+        num_labels (int): the labels amount.
+        emb_lr (float, optional): The scaling of the learning rate of the embedding layer. Defaults to 2.0.
+        crf_lr (float, optional): The scaling of the learning rate of the crf layer. Defaults to 0.2.
+    """
+
+    def __init__(
+        self, word_emb_dim, hidden_size, vocab_size, num_labels, emb_lr=2.0, crf_lr=0.2, with_start_stop_tag=True
+    ):
+        super(BiGruCrf, self).__init__()
+        self.word_emb_dim = word_emb_dim
+        self.vocab_size = vocab_size
+        self.num_labels = num_labels
+        self.hidden_size = hidden_size
+        self.emb_lr = emb_lr
+        self.crf_lr = crf_lr
+        self.init_bound = 0.1
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=self.vocab_size,
+            embedding_dim=self.word_emb_dim,
+            weight_attr=paddle.ParamAttr(
+                learning_rate=self.emb_lr,
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+            ),
+        )
+
+        self.gru = nn.GRU(
+            input_size=self.word_emb_dim,
+            hidden_size=self.hidden_size,
+            num_layers=2,
+            direction="bidirectional",
+            weight_ih_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+            weight_hh_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.fc = nn.Linear(
+            in_features=self.hidden_size * 2,
+            out_features=self.num_labels + 2 if with_start_stop_tag else self.num_labels,
+            weight_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.crf = LinearChainCrf(self.num_labels, self.crf_lr, with_start_stop_tag)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, with_start_stop_tag)
+
+    def forward(self, inputs, lengths, labels=None):
+        word_embed = self.word_embedding(inputs)
+        bigru_output, _ = self.gru(word_embed, sequence_length=lengths)
+        emission = self.fc(bigru_output)
+        if labels is not None:
+            loss = self.crf_loss(emission, lengths, labels)
+            return loss
+        else:
+            _, prediction = self.viterbi_decoder(emission, lengths)
+            return prediction
diff --git a/examples/lexical_analysis/predict.py b/examples/lexical_analysis/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7a7b80cdd85514920503f4d9f0fc1453683a57
--- /dev/null
+++ b/examples/lexical_analysis/predict.py
@@ -0,0 +1,101 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from data import convert_example, load_dataset, load_vocab, parse_result
+from model import BiGruCrf
+
+from paddlenlp.data import Pad, Stack, Tuple
+
+# fmt: off
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.")
+parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+args = parser.parse_args()
+# fmt: on
+
+
+def infer(args):
+    paddle.set_device(args.device)
+
+    # create dataset.
+    infer_ds = load_dataset(datafiles=(os.path.join(args.data_dir, "infer.tsv")))
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    # q2b.dic is used to replace DBC case to SBC case
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+
+    trans_func = partial(
+        convert_example,
+        max_seq_len=args.max_seq_len,
+        word_vocab=word_vocab,
+        label_vocab=label_vocab,
+        normlize_vocab=normlize_vocab,
+    )
+    infer_ds.map(trans_func)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=0, dtype="int64"),  # word_ids
+        Stack(dtype="int64"),  # length
+    ): fn(samples)
+
+    # Create sampler for dataloader
+    infer_sampler = paddle.io.BatchSampler(
+        dataset=infer_ds, batch_size=args.batch_size, shuffle=False, drop_last=False
+    )
+    infer_loader = paddle.io.DataLoader(
+        dataset=infer_ds, batch_sampler=infer_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    # Define the model network
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab))
+
+    # Load the model and start predicting
+    model_dict = paddle.load(args.init_checkpoint)
+    model.load_dict(model_dict)
+
+    model.eval()
+    results = []
+    for batch in infer_loader:
+        token_ids, length = batch
+        preds = model(token_ids, length)
+        result = parse_result(token_ids.numpy(), preds.numpy(), length.numpy(), word_vocab, label_vocab)
+        results += result
+
+    sent_tags = []
+    for sent, tags in results:
+        sent_tag = ["(%s, %s)" % (ch, tag) for ch, tag in zip(sent, tags)]
+        sent_tags.append("".join(sent_tag))
+
+    file_path = "results.txt"
+    with open(file_path, "w", encoding="utf8") as fout:
+        fout.write("\n".join(sent_tags))
+
+    # Print some examples
+    print("The results have been saved in the file: %s, some examples are shown below: " % file_path)
+    print("\n".join(sent_tags[:10]))
+
+
+if __name__ == "__main__":
+    infer(args)
diff --git a/examples/lexical_analysis/train.py b/examples/lexical_analysis/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..480721f5eaf7e4f56edb42b4d495b573b556df03
--- /dev/null
+++ b/examples/lexical_analysis/train.py
@@ -0,0 +1,178 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import paddle
+from data import convert_example, load_dataset, load_vocab
+from model import BiGruCrf
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.")
+parser.add_argument("--model_save_dir", type=str, default=None, help="The model will be saved in this path.")
+parser.add_argument("--epochs", type=int, default=10, help="Corpus iteration num.")
+parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.")
+parser.add_argument("--base_lr", type=float, default=0.001, help="The basic learning rate that affects the entire network.")
+parser.add_argument("--crf_lr", type=float, default=0.2, help="The learning rate ratio that affects CRF layers.")
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.")
+parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+parser.add_argument("--do_eval", type=strtobool, default=True, help="To evaluate the model if True.")
+# yapf: enable
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        token_ids, length, labels = batch
+        preds = model(token_ids, length)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    logger.info("eval precision: %f, recall: %f, f1: %f" % (precision, recall, f1_score))
+    model.train()
+    return precision, recall, f1_score
+
+
+def train(args):
+    paddle.set_device(args.device)
+
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    # Create dataset.
+    train_ds, test_ds = load_dataset(
+        datafiles=(os.path.join(args.data_dir, "train.tsv"), os.path.join(args.data_dir, "test.tsv"))
+    )
+
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    # q2b.dic is used to replace DBC case to SBC case
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+
+    trans_func = partial(
+        convert_example,
+        max_seq_len=args.max_seq_len,
+        word_vocab=word_vocab,
+        label_vocab=label_vocab,
+        normlize_vocab=normlize_vocab,
+    )
+    train_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=word_vocab.get("[PAD]", 0), dtype="int64"),  # word_ids
+        Stack(dtype="int64"),  # length
+        Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"),  # label_ids
+    ): fn(samples)
+
+    # Create sampler for dataloader
+    train_sampler = paddle.io.DistributedBatchSampler(
+        dataset=train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True
+    )
+    train_loader = paddle.io.DataLoader(
+        dataset=train_ds, batch_sampler=train_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    test_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False, drop_last=False)
+    test_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_sampler=test_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    # Define the model netword and its loss
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab), crf_lr=args.crf_lr)
+    # Prepare optimizer, loss and metric evaluator
+    optimizer = paddle.optimizer.Adam(learning_rate=args.base_lr, parameters=model.parameters())
+    chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+
+    if args.init_checkpoint:
+        if os.path.exists(args.init_checkpoint):
+            logger.info("Init checkpoint from %s" % args.init_checkpoint)
+            model_dict = paddle.load(args.init_checkpoint)
+            model.load_dict(model_dict)
+        else:
+            logger.info("Cannot init checkpoint from %s which doesn't exist" % args.init_checkpoint)
+    logger.info("Start training")
+    # Start training
+    global_step = 0
+    last_step = args.epochs * len(train_loader)
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    total_samples = 0
+    reader_start = time.time()
+    max_f1_score = -1
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_loader):
+            train_reader_cost += time.time() - reader_start
+            global_step += 1
+            token_ids, length, label_ids = batch
+            train_start = time.time()
+            loss = model(token_ids, length, label_ids)
+            avg_loss = paddle.mean(loss)
+            train_run_cost += time.time() - train_start
+            total_samples += args.batch_size
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                    % (
+                        global_step,
+                        last_step,
+                        avg_loss,
+                        train_reader_cost / args.logging_steps,
+                        (train_reader_cost + train_run_cost) / args.logging_steps,
+                        total_samples / args.logging_steps,
+                        total_samples / (train_reader_cost + train_run_cost),
+                    )
+                )
+                train_reader_cost = 0.0
+                train_run_cost = 0.0
+                total_samples = 0
+            avg_loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 or global_step == last_step:
+                if rank == 0:
+                    paddle.save(
+                        model.state_dict(), os.path.join(args.model_save_dir, "model_%d.pdparams" % global_step)
+                    )
+                    logger.info("Save %d steps model." % (global_step))
+                    if args.do_eval:
+                        precision, recall, f1_score = evaluate(model, chunk_evaluator, test_loader)
+                        if f1_score > max_f1_score:
+                            max_f1_score = f1_score
+                            paddle.save(model.state_dict(), os.path.join(args.model_save_dir, "best_model.pdparams"))
+                            logger.info("Save best model.")
+
+            reader_start = time.time()
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    train(args)
diff --git a/examples/machine_reading_comprehension/DuReader-robust/README.md b/examples/machine_reading_comprehension/DuReader-robust/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7d74fcf6c8ba09e7b35503fae72ae191e98c2ed8
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-robust/README.md
@@ -0,0 +1,73 @@
+# 阅读理解 DuReader-robust
+
+# 简介
+
+## 任务说明
+阅读理解模型的鲁棒性是衡量该技术能否在实际应用中大规模落地的重要指标之一。随着当前技术的进步，模型虽然能够在一些阅读理解测试集上取得较好的性能，但在实际应用中，这些模型所表现出的鲁棒性仍然难以令人满意。DuReader-robust数据集作为首个关注阅读理解模型鲁棒性的中文数据集，旨在考察模型在真实应用场景中的过敏感性、过稳定性以及泛化能力等问题。
+
+## 数据集
+
+DuReader-robust数据集是单篇章、抽取式阅读理解数据集，具体的任务定义为：
+对于一个给定的问题q和一个篇章p，参赛系统需要根据篇章内容，给出该问题的答案a。数据集中的每个样本，是一个三元组<q, p, a>，例如：
+
+**问题 q**: 乔丹打了多少个赛季
+
+**篇章 p**: 迈克尔.乔丹在NBA打了15个赛季。他在84年进入nba，期间在1993年10月6日第一次退役改打棒球，95年3月18日重新回归，在99年1月13日第二次退役，后于2001年10月31日复出，在03年最终退役…
+
+**参考答案 a**: [‘15个’,‘15个赛季’]
+
+关于该数据集的详细内容，可参考数据集[论文](https://arxiv.org/abs/2004.11142)。
+
+## 快速开始
+
+### 数据准备
+
+为了方便开发者进行测试，我们已将数据集上传至HuggingFace。
+
+
+### Fine-tune
+
+按如下方式启动 Fine-tuning:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_du.py \
+    --model_type ernie_gram \
+    --model_name_or_path ernie-gram-zh \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 1 \
+    --logging_steps 10 \
+    --save_steps 1000 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/dureader-robust/ \
+    --do_train \
+    --do_predict \
+    --device gpu \
+ ```
+
+* `model_type`: 预训练模型的种类。如bert，ernie，roberta等。
+* `model_name_or_path`: 预训练模型的具体名称。如bert-base-chinese，roberta-wwm-ext等。或者是模型文件的本地路径。
+* `output_dir`: 保存模型checkpoint的路径。
+* `do_train`: 是否进行训练。
+* `do_predict`: 是否进行预测。
+
+训练结束后模型会自动对结果进行评估，得到类似如下的输出：
+
+```text
+{
+  "exact": 72.90049400141143,
+  "f1": 86.95957173352133,
+  "total": 1417,
+  "HasAns_exact": 72.90049400141143,
+  "HasAns_f1": 86.95957173352133,
+  "HasAns_total": 1417
+}
+```
+
+评估结束后模型会自动对测试集进行预测，并将可提交的结果生成在`prediction.json`中。
+
+
+**NOTE:** 如需恢复模型训练，则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/dureader-robust/model_19000/`，程序会自动加载模型参数`/model_state.pdparams`，也会自动加载词表，模型config和tokenizer的config。
diff --git a/examples/machine_reading_comprehension/DuReader-robust/args.py b/examples/machine_reading_comprehension/DuReader-robust/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d3a2927eabfccee53c204d88136f98fe4b96318
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-robust/args.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model_type", default=None, type=str, required=True, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.0,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=20,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.")
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+    parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    args = parser.parse_args()
+    return args
diff --git a/examples/machine_reading_comprehension/DuReader-robust/run_du.py b/examples/machine_reading_comprehension/DuReader-robust/run_du.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e7a76043c8e835e05e8f3092adfc2ec3cfdfb47
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-robust/run_du.py
@@ -0,0 +1,328 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from args import parse_args
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import (
+    BertForQuestionAnswering,
+    BertTokenizer,
+    ErnieForQuestionAnswering,
+    ErnieGramForQuestionAnswering,
+    ErnieGramTokenizer,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+    RobertaForQuestionAnswering,
+    RobertaTokenizer,
+)
+
+MODEL_CLASSES = {
+    "bert": (BertForQuestionAnswering, BertTokenizer),
+    "ernie": (ErnieForQuestionAnswering, ErnieTokenizer),
+    "ernie_gram": (ErnieGramForQuestionAnswering, ErnieGramTokenizer),
+    "roberta": (RobertaForQuestionAnswering, RobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, raw_dataset, data_loader, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        input_ids, token_type_ids = batch
+        start_logits_tensor, end_logits_tensor = model(input_ids, token_type_ids)
+
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                print("Processing example: %d" % len(all_start_logits))
+                print("time per 1000:", time.time() - tic_eval)
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, _, _ = compute_prediction(
+        raw_dataset,
+        data_loader.dataset,
+        (all_start_logits, all_end_logits),
+        False,
+        args.n_best_size,
+        args.max_answer_length,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open("prediction.json", "w", encoding="utf-8") as writer:
+        writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+    squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, is_whitespace_splited=False)
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+def run(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    set_seed(args)
+
+    train_examples = load_dataset("PaddlePaddle/dureader_robust", split="train")
+    dev_examples = load_dataset("PaddlePaddle/dureader_robust", split="validation")
+
+    column_names = train_examples.column_names
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            print("init checkpoint from %s" % args.model_name_or_path)
+
+    model = model_class.from_pretrained(args.model_name_or_path)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    def prepare_train_features(examples):
+        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+        # in one example possible giving several features when a context is long, each of those features having a
+        # context that overlaps a bit the context of the previous feature.
+        # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+        # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+        contexts = examples["context"]
+        questions = examples["question"]
+
+        tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length)
+
+        # Since one example might give us several features if it has a long context, we need a map from a feature to
+        # its corresponding example. This key gives us just that.
+        sample_mapping = tokenized_examples.pop("overflow_to_sample")
+        # The offset mappings will give us a map from token to character position in the original context. This will
+        # help us compute the start_positions and end_positions.
+        offset_mapping = tokenized_examples.pop("offset_mapping")
+
+        # Let's label those examples!
+        tokenized_examples["start_positions"] = []
+        tokenized_examples["end_positions"] = []
+
+        for i, offsets in enumerate(offset_mapping):
+            # We will label impossible answers with the index of the CLS token.
+            input_ids = tokenized_examples["input_ids"][i]
+            cls_index = input_ids.index(tokenizer.cls_token_id)
+
+            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+            sequence_ids = tokenized_examples["token_type_ids"][i]
+
+            # One example can give several spans, this is the index of the example containing this span of text.
+            sample_index = sample_mapping[i]
+            answers = examples["answers"][sample_index]
+            # If no answers are given, set the cls_index as answer.
+            if len(answers["answer_start"]) == 0:
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Start/end character index of the answer in the text.
+                start_char = answers["answer_start"][0]
+                end_char = start_char + len(answers["text"][0])
+
+                # Start token index of the current span in the text.
+                token_start_index = 0
+                while sequence_ids[token_start_index] != 1:
+                    token_start_index += 1
+
+                # End token index of the current span in the text.
+                token_end_index = len(input_ids) - 1
+                while sequence_ids[token_end_index] != 1:
+                    token_end_index -= 1
+                token_end_index -= 1
+
+                # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+                if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                    tokenized_examples["start_positions"].append(cls_index)
+                    tokenized_examples["end_positions"].append(cls_index)
+                else:
+                    # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                    # Note: we could go after the last offset if the answer is the last word (edge case).
+                    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                        token_start_index += 1
+                    tokenized_examples["start_positions"].append(token_start_index - 1)
+                    while offsets[token_end_index][1] >= end_char:
+                        token_end_index -= 1
+                    tokenized_examples["end_positions"].append(token_end_index + 1)
+
+        return tokenized_examples
+
+    if args.do_train:
+        train_ds = train_examples.map(prepare_train_features, batched=True, remove_columns=column_names, num_proc=4)
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+                "start_positions": Stack(dtype="int64"),
+                "end_positions": Stack(dtype="int64"),
+            }
+        ): fn(samples)
+
+        train_data_loader = DataLoader(
+            dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+        )
+
+        num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        criterion = CrossEntropyLossForSQuAD()
+
+        global_step = 0
+        tic_train = time.time()
+        for epoch in range(num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids, token_type_ids, start_positions, end_positions = batch
+                logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
+                loss = criterion(logits, (start_positions, end_positions))
+
+                if global_step % args.logging_steps == 0:
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if rank == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        print("Saving checkpoint to:", output_dir)
+                    if global_step == num_training_steps:
+                        break
+
+    def prepare_validation_features(examples):
+        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+        # in one example possible giving several features when a context is long, each of those features having a
+        # context that overlaps a bit the context of the previous feature.
+        # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+        # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+        contexts = examples["context"]
+        questions = examples["question"]
+
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+        )
+
+        # Since one example might give us several features if it has a long context, we need a map from a feature to
+        # its corresponding example. This key gives us just that.
+        sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+        # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+        # corresponding example_id and we will store the offset mappings.
+        tokenized_examples["example_id"] = []
+
+        for i in range(len(tokenized_examples["input_ids"])):
+            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+            sequence_ids = tokenized_examples["token_type_ids"][i]
+            context_index = 1
+
+            # One example can give several spans, this is the index of the example containing this span of text.
+            sample_index = sample_mapping[i]
+            tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+            # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+            # position is part of the context or not.
+            tokenized_examples["offset_mapping"][i] = [
+                (o if sequence_ids[k] == context_index else None)
+                for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+            ]
+
+        return tokenized_examples
+
+    if args.do_predict and rank == 0:
+        dev_ds = dev_examples.map(prepare_validation_features, batched=True, remove_columns=column_names, num_proc=4)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+        dev_batchify_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            }
+        ): fn(samples)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True
+        )
+
+        evaluate(model, dev_examples, dev_data_loader, args)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    run(args)
diff --git a/examples/machine_reading_comprehension/DuReader-yesno/README.md b/examples/machine_reading_comprehension/DuReader-yesno/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ac95144f7cec93eae3df1ceec784aaafadaa0383
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-yesno/README.md
@@ -0,0 +1,70 @@
+# 阅读理解 DuReader-yesno
+
+## 简介
+
+### 任务说明
+机器阅读理解评测中常用的F1、EM等指标虽然能够很好的衡量抽取式模型所预测的答案和真实答案的匹配程度，但在处理观点类问题时，该类指标难以衡量模型是否真正理解答案所代表的含义，例如答案中包含的观点极性。DuReader-yesno是一个以观点极性判断为目标任务的数据集，通过引入该数据集，可以弥补抽取类数据集的不足，从而更好地评价模型的自然语言理解能力。
+
+
+### 数据集
+
+该数据集的任务定义如下：
+对于一个给定的问题q、一系列相关文档D=d1, d2, …, dn，以及人工抽取答案段落摘要a，要求参评系统自动对问题q、候选文档D以及答案段落摘要a进行分析，输出每个答案段落摘要所表述的是非观点极性。其中，极性分为三类 {Yes, No, Depends}。其中：
+
+* Yes：肯定观点，肯定观点指的是答案给出了较为明确的肯定态度。有客观事实的从客观事实的角度出发，主观态度类的从答案的整体态度来判断。
+* No：否定观点，否定观点通常指的是答案较为明确的给出了与问题相反的态度。
+* Depends：无法确定/分情况，主要指的是事情本身存在多种情况，不同情况下对应的观点不一致；或者答案本身对问题表示不确定，要具体具体情况才能判断。
+
+例如：
+```text
+{
+    "documents":[
+        {
+            "title":"香蕉能放冰箱吗 香蕉剥皮冷冻保存_健康贴士_保健_99健康网",
+            "paragraphs":[
+                "本文导读:............."
+            ]
+        }
+    ],
+    "yesno_answer":"No",
+    "question":"香蕉能放冰箱吗",
+    "answer":"香蕉不能放冰箱，香蕉如果放冰箱里，会更容易变坏，会发黑腐烂。",
+    "id":293
+}
+```
+
+## 快速开始
+
+### Fine-tune
+
+按如下方式启动 Fine-tuning:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_du.py \
+    --model_type ernie_gram \
+    --model_name_or_path ernie-gram-zh \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --logging_steps 200 \
+    --save_steps 1000 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/dureader-yesno/ \
+    --device gpu \
+ ```
+
+* `model_type`: 预训练模型的种类。如bert，ernie，roberta等。
+* `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased，bert-large-cased等。或者是模型文件的本地路径。
+* `output_dir`: 保存模型checkpoint的路径。
+
+训练结束后模型会自动对结果进行评估，得到类似如下的输出：
+
+```text
+accu: 0.874954
+```
+评估结束后模型会自动对测试集进行预测，并将可提交的结果生成在`prediction.json`中。
+
+**NOTE:** 如需恢复模型训练，则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/dureader-yesno/model_19000/`，程序会自动加载模型参数`/model_state.pdparams`，也会自动加载词表，模型config和tokenizer的config。
diff --git a/examples/machine_reading_comprehension/DuReader-yesno/args.py b/examples/machine_reading_comprehension/DuReader-yesno/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..e460d925a1a08aef5455eec3257a27f3fef4cfba
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-yesno/args.py
@@ -0,0 +1,60 @@
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model_type", default=None, type=str, required=True, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.0,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/machine_reading_comprehension/DuReader-yesno/run_du.py b/examples/machine_reading_comprehension/DuReader-yesno/run_du.py
new file mode 100644
index 0000000000000000000000000000000000000000..434b94c28df35508586efd821abf9c0c39e0a29f
--- /dev/null
+++ b/examples/machine_reading_comprehension/DuReader-yesno/run_du.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from args import parse_args
+from paddle.io import DataLoader
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ErnieForSequenceClassification,
+    ErnieGramForSequenceClassification,
+    ErnieGramTokenizer,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+    RobertaForSequenceClassification,
+    RobertaTokenizer,
+)
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+    "ernie_gram": (ErnieGramForSequenceClassification, ErnieGramTokenizer),
+    "roberta": (RobertaForSequenceClassification, RobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def convert_example(example, tokenizer):
+    """convert a Dureader-yesno example into necessary features"""
+
+    feature = tokenizer(text=example["question"], text_pair=example["answer"], max_seq_len=args.max_seq_length)
+    feature["labels"] = example["labels"]
+    feature["id"] = example["id"]
+
+    return feature
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("accu: %f" % (accu))
+    model.train()  # Switch the model to training mode after evaluation
+
+
+@paddle.no_grad()
+def predict(model, data_loader):
+    model.eval()
+    res = {}
+    for batch in data_loader:
+        input_ids, segment_ids, qas_id = batch
+        logits = model(input_ids, segment_ids)
+        qas_id = qas_id.numpy()
+        preds = paddle.argmax(logits, axis=1).numpy()
+        for i in range(len(preds)):
+            res[str(qas_id[i])] = data_loader.dataset.label_list[preds[i]]
+    model.train()
+    return res
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    set_seed(args)
+
+    train_ds, dev_ds, test_ds = load_dataset("dureader_yesno", splits=["train", "dev", "test"])
+
+    trans_func = partial(convert_example, tokenizer=tokenizer)
+
+    train_batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "labels": Stack(dtype="int64"),
+        }
+    ): fn(samples)
+
+    test_batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "id": Stack(),
+        }
+    ): fn(samples)
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+    )
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+    )
+
+    test_ds = test_ds.map(trans_func, lazy=True)
+    test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=args.batch_size, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_ds, batch_sampler=test_batch_sampler, collate_fn=test_batchify_fn, return_list=True
+    )
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=len(train_ds.label_list))
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+    num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, label = batch
+
+            logits = model(input_ids=input_ids, token_type_ids=segment_ids)
+            loss = criterion(logits, label)
+
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                if rank == 0:
+                    evaluate(model, metric, dev_data_loader)
+                    output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    print("Saving checkpoint to:", output_dir)
+                if global_step == num_training_steps:
+                    break
+
+    if rank == 0:
+        predictions = predict(model, test_data_loader)
+        with open("prediction.json", "w") as writer:
+            writer.write(json.dumps(predictions, ensure_ascii=False, indent=4) + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_train(args)
diff --git a/examples/machine_reading_comprehension/SQuAD/README.md b/examples/machine_reading_comprehension/SQuAD/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..537453318acd41980c0e54be6901536a00b6f67a
--- /dev/null
+++ b/examples/machine_reading_comprehension/SQuAD/README.md
@@ -0,0 +1,195 @@
+# 阅读理解 SQuAD
+
+## 简介
+
+### 任务说明
+本文主要介绍基于Bert预训练模型的SQuAD（Stanford Question Answering Dataset）数据集的阅读理解任务，给定一篇文章和一个问题，计算答案在文章中的起始位置和结束位置。对于SQuAD2.0数据集，还可以返回答案在文章中不存在的概率。
+
+### 数据集
+
+此任务的数据集包括以下数据集：
+
+SQuAD v1.1
+- [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+- [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+
+SQuAD v2.0
+- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
+- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
+
+
+## 快速开始
+
+### 数据准备
+
+为了方便开发者进行测试，我们使用了HuggingFace的数据集，用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本。
+
+### Fine-tune
+
+对于 SQuAD v1.1,按如下方式启动 Fine-tuning:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --logging_steps 1000 \
+    --save_steps 1000 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/squad/ \
+    --device gpu \
+    --do_train \
+    --do_predict
+ ```
+
+* `model_type`: 预训练模型的种类。如bert，ernie，roberta等。
+* `model_name_or_path`: 预训练模型的具体名称。如bert-base-uncased，bert-large-cased等。或者是模型文件的本地路径。
+* `output_dir`: 保存模型checkpoint的路径。
+* `do_train`: 是否进行训练。
+* `do_predict`: 是否进行预测。
+
+训练结束后模型会自动对结果进行评估，得到类似如下的输出：
+
+```text
+{
+  "exact": 81.18259224219489,
+  "f1": 88.68817481234801,
+  "total": 10570,
+  "HasAns_exact": 81.18259224219489,
+  "HasAns_f1": 88.68817481234801,
+  "HasAns_total": 10570
+}
+```
+
+对于 SQuAD v2.0,按如下方式启动 Fine-tuning:
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --logging_steps 1000 \
+    --save_steps 1000 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/squad/ \
+    --device gpu \
+    --do_train \
+    --do_predict \
+    --version_2_with_negative
+ ```
+
+* `version_2_with_negative`: 使用squad2.0数据集和评价指标的标志。
+
+训练结束后会在模型会自动对结果进行评估，得到类似如下的输出：
+
+```text
+{
+  "exact": 73.25865408910974,
+  "f1": 76.63096554166046,
+  "total": 11873,
+  "HasAns_exact": 73.22874493927125,
+  "HasAns_f1": 79.98303877802545,
+  "HasAns_total": 5928,
+  "NoAns_exact": 73.28847771236333,
+  "NoAns_f1": 73.28847771236333,
+  "NoAns_total": 5945,
+  "best_exact": 74.31988545439232,
+  "best_exact_thresh": -2.5820093154907227,
+  "best_f1": 77.20521797731851,
+  "best_f1_thresh": -1.559523582458496
+}
+```
+
+其中会输出 `best_f1_thresh` 是最佳阈值，可以使用这个阈值重新训练，或者从 `all_nbest_json`变量中获取最终 `prediction`。
+训练方法与前面大体相同，只需要设定 `--null_score_diff_threshold` 参数的值为测评时输出的 `best_f1_thresh` ，通常这个值在 -1.0 到 -5.0 之间。
+
+**NOTE:** 如需恢复模型训练，则model_name_or_path只需指定到文件夹名即可。如`--model_name_or_path=./tmp/squad/model_19000/`，程序会自动加载模型参数`/model_state.pdparams`，也会自动加载词表，模型config和tokenizer的config。
+
+### 预测
+
+如需使用训练好的模型预测并输出结果，需将自己的数据集改成SQuAD格式（以下示例为SQuAD2.0）。
+
+```text
+{"data": [{'title': 'Beyoncé',
+ 'paragraphs': [
+                 {'qas': [{'question': 'When did Beyonce start becoming popular?',
+                         'id': '56be85543aeaaa14008c9063',
+                         'answers': [],
+                      'is_impossible': False}]],
+                             'context':'Beyoncé Giselle Knowles-Carter(biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she.'}
+     }]
+```
+
+并参考[以内置数据集格式读取本地数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#id4)中的方法创建自己的数据集并修改`run_squad.py`中对应的数据集读取代码。再运行以下脚本：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_squad.py \
+    --model_type bert \
+    --model_name_or_path your-best-model \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --logging_steps 1000 \
+    --save_steps 1000 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/squad/ \
+    --device gpu \
+    --do_predict \
+    --version_2_with_negative
+ ```
+
+即可完成预测，预测的答案保存在`prediction.json`中。数据格式如下所示，左边的id与输入中的id对应。
+
+```text
+{
+    "56be85543aeaaa14008c9063": "in the late 1990s",
+    ...
+}
+```
+
+### 静态图预测
+
+在Fine-tune完成后，我们可以使用如下方式导出希望用来预测的模型：
+
+```shell
+python -u ./export_model.py \
+    --model_type bert \
+    --model_path bert-base-uncased \
+    --output_path ./infer_model/model
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_path` 表示训练模型的保存路径，与训练时的`output_dir`一致。
+- `output_path` 表示导出预测模型文件的前缀。保存时会添加后缀（`pdiparams`，`pdiparams.info`，`pdmodel`）；除此之外，还会在`output_path`包含的目录下保存tokenizer相关内容。
+
+然后按照如下的方式对阅读理解任务进行预测：
+
+```shell
+python -u deploy/python/predict.py \
+    --model_type bert \
+    --model_name_or_path ./infer_model/model \
+    --batch_size 4 \
+    --max_seq_length 384
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 表示预测模型文件的前缀，和上一步导出预测模型中的`output_path`一致。
+- `batch_size` 表示每个预测批次的样本数目。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断，和训练时一致。
+
+以上命令将在SQuAD v1.1的验证集上进行预测。此外，同训练时一样，用户可以通过命令行传入`--version_2_with_negative`控制所需要的SQuAD数据集版本。
diff --git a/examples/machine_reading_comprehension/SQuAD/args.py b/examples/machine_reading_comprehension/SQuAD/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fa0e6e8ab8fa6222f2a75451522236c67f19f7b
--- /dev/null
+++ b/examples/machine_reading_comprehension/SQuAD/args.py
@@ -0,0 +1,104 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model_type", default="bert", type=str, help="Type of pre-trained model.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bert-base-uncased",
+        type=str,
+        help="Path to pre-trained model or shortcut name of model.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="outputs",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written. "
+        "Default as `outputs`",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument(
+        "--warmup_proportion",
+        default=0.0,
+        type=float,
+        help="Proportion of training steps to perform linear learning rate warmup for.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "mlu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument(
+        "--n_best_size",
+        type=int,
+        default=20,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument(
+        "--null_score_diff_threshold",
+        type=float,
+        default=0.0,
+        help="If null_score - best_non_null is greater than the threshold predict null.",
+    )
+    parser.add_argument("--max_query_length", type=int, default=64, help="Max query length.")
+    parser.add_argument("--max_answer_length", type=int, default=30, help="Max answer length.")
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_false",
+        help="Whether to lower case the input text. Should be True for uncased models and False for cased models.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Whether to output verbose log.")
+    parser.add_argument(
+        "--version_2_with_negative",
+        action="store_true",
+        help="If true, the SQuAD examples contain some that do not have an answer. If using squad v2.0, it should be set true.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to predict.")
+    parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    args = parser.parse_args()
+    return args
diff --git a/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py b/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a8281d70d96d4fe5fbf64361b55aa97970d7956
--- /dev/null
+++ b/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from functools import partial
+
+import paddle
+from datasets import load_dataset
+
+from paddlenlp.data import Dict, Pad
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)))
+from args import parse_args  # noqa: E402
+from run_squad import MODEL_CLASSES, prepare_validation_features  # noqa: E402
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        config = paddle.inference.Config(args.model_name_or_path + ".pdmodel", args.model_name_or_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+        return cls(predictor, input_handles, output_handles)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, dataset, raw_dataset, collate_fn, args, do_eval=True):
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=args.batch_size, shuffle=False)
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, num_workers=0, return_list=True
+        )
+        outputs = []
+        all_start_logits = []
+        all_end_logits = []
+        for data in data_loader:
+            output = self.predict_batch(data)
+            outputs.append(output)
+            if do_eval:
+                all_start_logits.extend(list(output[0]))
+                all_end_logits.extend(list(output[1]))
+        if do_eval:
+            all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+                raw_dataset,
+                data_loader.dataset,
+                (all_start_logits, all_end_logits),
+                args.version_2_with_negative,
+                args.n_best_size,
+                args.max_answer_length,
+                args.null_score_diff_threshold,
+            )
+            squad_evaluate(
+                examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json
+            )
+        return outputs
+
+
+def main():
+    args = parse_args()
+
+    predictor = Predictor.create_predictor(args)
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(os.path.dirname(args.model_name_or_path))
+
+    if args.version_2_with_negative:
+        raw_dataset = load_dataset("squad_v2", split="validation")
+    else:
+        raw_dataset = load_dataset("squad", split="validation")
+    column_names = raw_dataset.column_names
+    dataset = raw_dataset.map(
+        partial(prepare_validation_features, tokenizer=tokenizer, args=args),
+        batched=True,
+        remove_columns=column_names,
+        num_proc=4,
+    )
+
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+        }
+    ): fn(samples)
+    predictor = Predictor.create_predictor(args)
+    predictor.predict(dataset, raw_dataset, args=args, collate_fn=batchify_fn)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/machine_reading_comprehension/SQuAD/export_model.py b/examples/machine_reading_comprehension/SQuAD/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f1ad38135c94021c1d0bc2eb9a6adaec69e6210
--- /dev/null
+++ b/examples/machine_reading_comprehension/SQuAD/export_model.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+
+from run_squad import MODEL_CLASSES
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path of the trained model to be exported.",
+    )
+    parser.add_argument(
+        "--output_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file prefix used to save the exported inference model.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # build model and load trained parameters
+    model = model_class.from_pretrained(args.model_path)
+    # switch to eval model
+    model.eval()
+    # convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # save converted static graph model
+    paddle.jit.save(model, args.output_path)
+    # also save tokenizer for inference usage
+    tokenizer = tokenizer_class.from_pretrained(args.model_path)
+    tokenizer.save_pretrained(os.path.dirname(args.output_path))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/machine_reading_comprehension/SQuAD/run_squad.py b/examples/machine_reading_comprehension/SQuAD/run_squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a4c9357050f70b73cb2086e11317597c97cc643
--- /dev/null
+++ b/examples/machine_reading_comprehension/SQuAD/run_squad.py
@@ -0,0 +1,350 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from args import parse_args
+from datasets import load_dataset
+from paddle.io import DataLoader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import (
+    BertForQuestionAnswering,
+    BertTokenizer,
+    ErnieForQuestionAnswering,
+    ErnieTokenizer,
+    FunnelForQuestionAnswering,
+    FunnelTokenizer,
+    LinearDecayWithWarmup,
+)
+
+MODEL_CLASSES = {
+    "bert": (BertForQuestionAnswering, BertTokenizer),
+    "ernie": (ErnieForQuestionAnswering, ErnieTokenizer),
+    "funnel": (FunnelForQuestionAnswering, FunnelTokenizer),
+}
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
+    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
+    # left whitespace
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    tokenized_examples = tokenizer(
+        questions, contexts, max_seq_len=args.max_seq_length, stride=args.doc_stride, return_attention_mask=True
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, raw_dataset, features, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        start_logits_tensor, end_logits_tensor = model(
+            batch["input_ids"], token_type_ids=batch["token_type_ids"], attention_mask=batch["attention_mask"]
+        )
+
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                print("Processing example: %d" % len(all_start_logits))
+                print("time per 1000:", time.time() - tic_eval)
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+        raw_dataset,
+        features,
+        (all_start_logits, all_end_logits),
+        args.version_2_with_negative,
+        args.n_best_size,
+        args.max_answer_length,
+        args.null_score_diff_threshold,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open("prediction.json", "w", encoding="utf-8") as writer:
+        writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+    squad_evaluate(examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json)
+
+    model.train()
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+def run(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    if args.version_2_with_negative:
+        train_examples = load_dataset("squad_v2", split="train")
+        dev_examples = load_dataset("squad_v2", split="validation")
+    else:
+        train_examples = load_dataset("squad", split="train")
+        dev_examples = load_dataset("squad", split="validation")
+    set_seed(args)
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            print("init checkpoint from %s" % args.model_name_or_path)
+
+    model = model_class.from_pretrained(args.model_name_or_path)
+    column_names = train_examples.column_names
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.do_train:
+        train_ds = train_examples.map(
+            partial(prepare_train_features, tokenizer=tokenizer, args=args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+        train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+        train_batchify_fn = DataCollatorWithPadding(tokenizer)
+
+        train_data_loader = DataLoader(
+            dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=train_batchify_fn, return_list=True
+        )
+
+        num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        criterion = CrossEntropyLossForSQuAD()
+
+        if args.use_amp:
+            scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+        global_step = 0
+        tic_train = time.time()
+
+        for epoch in range(num_train_epochs):
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                if args.use_amp:
+                    with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                        logits = model(
+                            input_ids=batch["input_ids"],
+                            token_type_ids=batch["token_type_ids"],
+                            attention_mask=batch["attention_mask"],
+                        )
+                        loss = criterion(logits, (batch["start_positions"], batch["end_positions"]))
+                    scaler.scale(loss).backward()
+                    scaler.minimize(optimizer, loss)
+                else:
+                    logits = model(
+                        input_ids=batch["input_ids"],
+                        token_type_ids=batch["token_type_ids"],
+                        attention_mask=batch["attention_mask"],
+                    )
+                    loss = criterion(logits, (batch["start_positions"], batch["end_positions"]))
+                    loss.backward()
+                    optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                if global_step % args.logging_steps == 0:
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch + 1, step + 1, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                    tic_train = time.time()
+
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if rank == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        print("Saving checkpoint to:", output_dir)
+                    if global_step == num_training_steps:
+                        break
+
+    if args.do_predict and rank == 0:
+        dev_ds = dev_examples.map(
+            partial(prepare_validation_features, tokenizer=tokenizer, args=args),
+            batched=True,
+            remove_columns=column_names,
+            num_proc=4,
+        )
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_ds_for_model = dev_ds.remove_columns(["example_id", "offset_mapping"])
+        dev_batchify_fn = DataCollatorWithPadding(tokenizer)
+
+        dev_data_loader = DataLoader(
+            dataset=dev_ds_for_model, batch_sampler=dev_batch_sampler, collate_fn=dev_batchify_fn, return_list=True
+        )
+
+        evaluate(model, dev_data_loader, dev_examples, dev_ds, args)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    run(args)
diff --git a/examples/machine_translation/README.md b/examples/machine_translation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6d58e14e2015083bce66f9f9ff8cd66cb7af9c37
--- /dev/null
+++ b/examples/machine_translation/README.md
@@ -0,0 +1,114 @@
+# 机器翻译
+
+机器翻译（Machine Translation）是利用计算机将一种自然语言（源语言）转换为另一种自然语言（目标语言）的过程，输入为源语言句子，输出为相应的目标语言的句子。
+
+## 快速开始
+
+### 环境依赖
+
+使用当前机器翻译示例，需要额外安装配置以下环境：
+
+* attrdict
+* pyyaml
+* subword_nmt
+* fastBPE (可选，若不使用 preprocessor.py 的 bpe 分词功能可以不需要)
+
+### 数据准备
+
+数据准备部分分成两种模式，一种是使用 PaddleNLP 内置的已经处理好的 WMT14 EN-DE 翻译的数据集，另一种，提供了当前 Transformer demo 使用自定义数据集的方式。以下分别展开介绍。
+
+#### 使用内置已经处理完成数据集
+
+内置的处理好的数据集是基于公开的数据集：WMT 数据集。
+
+WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛，其中英德翻译任务提供了一个中等规模的数据集，这个数据集是较多论文中使用的数据集，也是 Transformer 论文中用到的一个数据集。我们也将 [WMT'14 EN-DE 数据集](http://www.statmt.org/wmt14/translation-task.html) 作为示例提供。
+
+可以编写如下代码，即可自动载入处理好的上述的数据，对应的 WMT14 EN-DE 的数据集将会自动下载并且解压到 `~/.paddlenlp/datasets/WMT14ende/`。
+
+``` python
+datasets = load_dataset('wmt14ende', splits=('train', 'dev'))
+```
+
+如果使用内置的处理好的数据，那到这里即可完成数据准备一步，可以直接移步 [Transformer 翻译模型](transformer/README.md) 将详细介绍如何使用内置的数据集训一个英德翻译的 Transformer 模型。
+
+#### 使用自定义翻译数据集
+
+本示例同时提供了自定义数据集的方法。可参考以下执行数据处理方式：
+
+``` bash
+# 数据下载、处理，包括 bpe 的训练
+bash preprocessor/prepare-wmt14en2de.sh --icml17
+
+# 数据预处理
+DATA_DIR=examples/translation/wmt14_en_de
+
+python preprocessor/preprocessor.py \
+    --source_lang en \
+    --target_lang de \
+    --train_pref $DATA_DIR/train \
+    --dev_pref $DATA_DIR/dev \
+    --test_pref $DATA_DIR/test \
+    --dest_dir data/wmt17_en_de \
+    --thresholdtgt 0 \
+    --thresholdsrc 0 \
+    --joined_dictionary
+```
+
+`preprocessor/preprocessor.py` 支持了在机器翻译中常见的数据预处理方式。在预处理 `preprocessor/preprocessor.py` 脚本中，则提供词表构建，数据集文件整理，甚至于 bpe 分词的功能（bpe 分词过程可选）。最后获取的处理完成的 train，dev，test 数据可以直接用于后面 Transformer 模型的训练、评估和推理中。具体的参数说明如下：
+
+* `--src_lang`(`-s`): 指明数据处理对应的源语言类型，比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。
+* `--trg_lang`(`-t`): 指明数据处理对应的目标语言的类型，比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。
+* `--train_pref`: 指明前序步骤中，下载的训练数据的路径，以及对应的文件名前缀，比如 `preprocessor/wmt14_en_de/train` 结合 `--src_lang de` 和 `--trg_lang en`，表示在 `preprocessor/wmt14_en_de/` 路径下，源语言是 `preprocessor/wmt14_en_de/train.en`，目标语言是 `preprocessor/wmt14_en_de/train.de`。
+* `--dev_pref`: 指明前序步骤中，下载的验证数据的路径，以及对应的文件名前缀。在验证集语料中，如果有的 token 在训练集中从未出现过，那么将会被 `<unk>` 替换。
+* `--test_pref`: 指明前序步骤中，下载的测试数据的路径，以及对应的文件名前缀。在测试集语料中，如果有的 token 在训练集中从未出现过，那么将会被 `<unk>` 替换。
+* `--dest_dir`: 完成数据处理之后，保存处理完成数据以及词表的路径。
+* `--threshold_src`: 在源语言中，出现频次小于 `--threshold_src` 指定的频次的 token 将会被替换成 `<unk>`。默认为 0，表示不会根据 token 出现的频次忽略 token 本身。
+* `--threshold_trg`: 在目标语言中，出现频次小于 `--threshold_trg` 指定的频次的 token 将会被替换成 `<unk>`。默认为 0，表示不会根据 token 出现的频次忽略 token 本身。
+* `--src_vocab`: 源语言词表，默认为 None，表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表，那么将使用 `--src_vocab` 指定的词表。
+* `--trg_vocab`: 目标语言词表，默认为 None，表示需要预处理步骤根据训练集语料重新生成一份词表。如果源语言与目标语言共用同一份词表，那么将使用 `--src_vocab` 指定的词表。
+* `--nwords_src`: 源语言词表最大的大小，不包括 special token。默认为 None，表示不限制。若源语言和目标语言共用同一份词表，那么将使用 `--nwords_src` 指定的大小。
+* `--nwords_trg`: 目标语言词表最大的大小，不包括 special token。默认为 None，表示不限制。若源语言和目标语言共用同一份词表，那么将使用 `--nwords_src` 指定的大小。
+* `--align_file`: 是否将平行语料文件整合成一个文件。
+* `--joined_dictionary`: 源语言和目标语言是否使用同一份词表。若不共用同一份词表，无需指定。
+* `--only_source`: 是否仅处理源语言。
+* `--dict_only`: 是否仅处理词表。若指定，则仅完成词表处理。
+* `--bos_token`: 指明翻译所用的 `bos_token`，表示一个句子开始。
+* `--eos_token`: 指明翻译所用的 `eos_token`，表示一个句子的结束。
+* `--pad_token`: 指明 `pad_token`，用于将一个 batch 内不同长度的句子 pad 到合适长度。
+* `--unk_token`: 指明 `unk_token`，用于当一个 token 在词表中未曾出现的情况，将使用 `--unk_token` 指明的字符替换。
+* `--apply_bpe`: 是否需要对数据作 bpe 分词。若指定则会在 preprocessor.py 脚本开始执行 bpe 分词。如果是使用提供的 shell 脚本完成的数据下载，则无需设置，在 shell 脚本中会作 bpe 分词处理。
+* `--bpe_code`: 若指明 `--apply_bpe` 使用 bpe 分词，则需同时提供训练好的 bpe code 文件。
+
+除了 WMT14 德英翻译数据集外，我们也提供了其他的 shell 脚本完成数据下载处理，比如 WMT14 英法翻译数据。
+
+``` bash
+# WMT14 英法翻译的数据下载、处理
+bash prepare-wmt14en2fr.sh
+```
+
+完成数据处理之后，同样也可以采用上文提到的预处理方式获取词表，完成预处理。
+
+如果有或者需要使用其他的平行语料，可以自行完成下载和简单的处理。
+
+在下载部分，即在 shell 脚本中，处理需要用到 [mosesdecoder](https://github.com/moses-smt/mosesdecoder) 和 [subword-nmt](https://github.com/rsennrich/subword-nmt) 这两个工具。包括:
+
+* 使用 `mosesdecoder/scripts/tokenizer/tokenizer.perl` 完成对词做一个初步的切分；
+* 基于 `mosesdecoder/scripts/training/clean-corpus-n.perl` 完成数据的清洗；
+* 使用 `subword-nmt/subword_nmt/learn_bpe.py` 完成 bpe 的学习；
+
+此外，基于学到的 bpe code 进行分词的操作目前提供了两种选项，其一是，可以在以上的 shell 脚本中处理完成，使用以下的工具：
+
+* 使用 `subword-nmt/subword_nmt/apply_bpe.py` 完成分词工作。
+
+其二，也可以直接在后面的 `preprocessor/preprocessor.py` 脚本中，指明 `--apply_bpe` 完成分词操作。
+
+
+### 如何训一个翻译模型
+
+前文介绍了如何快速开始完成翻译训练所需平行语料的准备，关于进一步的，模型训练、评估和推理部分，可以根据需要，参考对应的模型的文档：
+
+* [Transformer 翻译模型](transformer/README.md)
+
+## Acknowledge
+
+我们借鉴了 facebookresearch 的 [fairseq](https://github.com/facebookresearch/fairseq) 在翻译数据的预处理上优秀的设计，在此对 fairseq 作者以及其开源社区表示感谢。
diff --git a/examples/machine_translation/preprocessor/prepare-iwslt14.sh b/examples/machine_translation/preprocessor/prepare-iwslt14.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4746b27c82e79e085f8fba4e1c593f38bfb77ac0
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-iwslt14.sh
@@ -0,0 +1,135 @@
+#!/usr/bin/env bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+LC=$SCRIPTS/tokenizer/lowercase.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=10000
+
+URL="http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz"
+GZ=de-en.tgz
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=de
+tgt=en
+lang=de-en
+prep=iwslt14.tokenized.de-en
+tmp=$prep/tmp
+origin=origin
+
+mkdir -p $origin $tmp $prep
+
+echo "Downloading data from ${URL}..."
+cd $origin
+wget "$URL"
+
+if [ -f $GZ ]; then
+    echo "Data successfully downloaded."
+else
+    echo "Data not successfully downloaded."
+    exit
+fi
+
+tar zxvf $GZ
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    f=train.tags.$lang.$l
+    tok=train.tags.$lang.tok.$l
+
+    cat $origin/$lang/$f | \
+    grep -v '<url>' | \
+    grep -v '<talkid>' | \
+    grep -v '<keywords>' | \
+    sed -e 's/<title>//g' | \
+    sed -e 's/<\/title>//g' | \
+    sed -e 's/<description>//g' | \
+    sed -e 's/<\/description>//g' | \
+    perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
+    echo ""
+done
+perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
+for l in $src $tgt; do
+    perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
+done
+
+echo "pre-processing dev/test data..."
+for l in $src $tgt; do
+    for o in `ls $origin/$lang/IWSLT14.TED*.$l.xml`; do
+    fname=${o##*/}
+    f=$tmp/${fname%.*}
+    echo $o $f
+    grep '<seg id' $o | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -l $l | \
+    perl $LC > $f
+    echo ""
+    done
+done
+
+
+echo "creating train, dev, test..."
+for l in $src $tgt; do
+    awk '{if (NR%23 == 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/dev.$l
+    awk '{if (NR%23 != 0)  print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l
+
+    cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
+        $tmp/IWSLT14.TEDX.dev2012.de-en.$l \
+        $tmp/IWSLT14.TED.tst2010.de-en.$l \
+        $tmp/IWSLT14.TED.tst2011.de-en.$l \
+        $tmp/IWSLT14.TED.tst2012.de-en.$l \
+        > $tmp/test.$l
+done
+
+TRAIN=$tmp/train.en-de
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
+    done
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh
new file mode 100644
index 0000000000000000000000000000000000000000..32926c4aa96b27ceda377e314102c3f15f122568
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh
@@ -0,0 +1,163 @@
+#!/bin/bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=40000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz"
+    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-nc-v12.tgz"
+    "dev.tgz"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.de-en"
+    "commoncrawl.de-en"
+    "training/news-commentary-v12.de-en"
+)
+
+# This will make the dataset compatible to the one used in "Convolutional Sequence to Sequence Learning"
+# https://arxiv.org/abs/1705.03122
+if [ "$1" == "--icml17" ]; then
+    URLS[2]="http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
+    FILES[2]="training-parallel-nc-v9.tgz"
+    CORPORA[2]="training/news-commentary-v9.de-en"
+    OUTDIR=wmt14_en_de
+else
+    OUTDIR=wmt17_en_de
+fi
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=en
+tgt=de
+lang=en-de
+prep=$OUTDIR
+tmp=$prep/tmp
+origin=origin
+dev=dev/newstest2013
+
+mkdir -p $origin $tmp $prep
+
+cd $origin
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url" --no-check-certificate
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit -1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $origin/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $origin/test-full/newstest2014-deen-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and dev..."
+for l in $src $tgt; do
+    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l
+    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.de-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3fc3bc10f6324875a5401688140ac0985f39600d
--- /dev/null
+++ b/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh
@@ -0,0 +1,157 @@
+#!/bin/bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+cd preprocessor/
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=40000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://statmt.org/wmt13/training-parallel-un.tgz"
+    "http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
+    "http://statmt.org/wmt10/training-giga-fren.tar"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-un.tgz"
+    "training-parallel-nc-v9.tgz"
+    "training-giga-fren.tar"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.fr-en"
+    "commoncrawl.fr-en"
+    "un/undoc.2000.fr-en"
+    "training/news-commentary-v9.fr-en"
+    "giga-fren.release2.fixed"
+)
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit
+fi
+
+src=en
+tgt=fr
+lang=en-fr
+prep=wmt14_en_fr
+tmp=$prep/tmp
+origin=origin
+
+mkdir -p $origin $tmp $prep
+
+cd $origin
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url" --no-check-certificate
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit -1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+
+gunzip giga-fren.release2.fixed.*.gz
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $origin/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $origin/test-full/newstest2014-fren-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and dev..."
+for l in $src $tgt; do
+    awk '{if (NR%1333 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/dev.$l
+    awk '{if (NR%1333 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.fr-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L dev.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.dev $src $tgt $prep/dev 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done
+
+cd -
diff --git a/examples/machine_translation/preprocessor/preprocessor.py b/examples/machine_translation/preprocessor/preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc434a76478796478c79c06f39cb25e360b998a0
--- /dev/null
+++ b/examples/machine_translation/preprocessor/preprocessor.py
@@ -0,0 +1,326 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import os
+import shutil
+from itertools import zip_longest
+from pprint import pprint
+
+from paddlenlp.data import Vocab
+from paddlenlp.utils.log import logger
+
+
+def get_preprocessing_parser():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--train_pref", default=None, type=str, help="The prefix for train file and also used to save dict. "
+    )
+    parser.add_argument(
+        "--dev_pref",
+        default=None,
+        type=str,
+        help="The prefixes for dev file and use comma to separate. "
+        "(words missing from train set are replaced with <unk>)",
+    )
+    parser.add_argument(
+        "--test_pref",
+        default=None,
+        type=str,
+        help="The prefixes for test file and use comma to separate. "
+        "(words missing from train set are replaced with <unk>)",
+    )
+    parser.add_argument(
+        "--dest_dir",
+        default="./data/",
+        type=str,
+        help="The destination dir to save processed train, dev and test file. ",
+    )
+    parser.add_argument(
+        "--threshold_trg", default=0, type=int, help="Map words appearing less than threshold times to unknown. "
+    )
+    parser.add_argument(
+        "--threshold_src", default=0, type=int, help="Map words appearing less than threshold times to unknown. "
+    )
+    parser.add_argument("--src_vocab", default=None, type=str, help="Reuse given source dictionary. ")
+    parser.add_argument("--trg_vocab", default=None, type=str, help="Reuse given target dictionary. ")
+    parser.add_argument("--nwords_trg", default=None, type=int, help="The number of target words to retain. ")
+    parser.add_argument("--nwords_src", default=None, type=int, help="The number of source words to retain. ")
+    parser.add_argument("--align_file", default=None, help="An alignment file (optional). ")
+    parser.add_argument("--joined_dictionary", action="store_true", help="Generate joined dictionary. ")
+    parser.add_argument("--only_source", action="store_true", help="Only process the source language. ")
+    parser.add_argument(
+        "--dict_only", action="store_true", help="Only builds a dictionary and then exits if it's set."
+    )
+    parser.add_argument("--bos_token", default="<s>", type=str, help="bos_token. ")
+    parser.add_argument("--eos_token", default="</s>", type=str, help="eos_token. ")
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The token used for padding. If it's None, the bos_token will be used. Defaults to None. ",
+    )
+    parser.add_argument("--unk_token", default="<unk>", type=str, help="Unk token. ")
+    parser.add_argument("--apply_bpe", action="store_true", help="Whether to apply bpe to the files. ")
+    parser.add_argument(
+        "--bpe_code", default=None, type=str, help="The code used for bpe. Must be provided when --apply_bpe is set. "
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def _train_path(lang, train_pref):
+    return "{}{}".format(train_pref, ("." + lang) if lang else "")
+
+
+def _dev_path(lang, dev_pref):
+    return "{}{}".format(dev_pref, ("." + lang) if lang else "")
+
+
+def _test_path(lang, test_pref):
+    return "{}{}".format(test_pref, ("." + lang) if lang else "")
+
+
+def _file_name(prefix, lang):
+    fname = prefix
+    if lang is not None:
+        fname += ".{lang}".format(lang=lang)
+    return fname
+
+
+def _dest_path(prefix, lang, dest_dir):
+    return os.path.join(dest_dir, _file_name(prefix, lang))
+
+
+def _dict_path(lang, dest_dir):
+    return _dest_path("dict", lang, dest_dir) + ".txt"
+
+
+def _build_dictionary(filenames, args, src=False, trg=False):
+    assert src ^ trg, "src and trg cannot be both True or both False. "
+
+    if not isinstance(filenames, (list, tuple)):
+        filenames = [filenames]
+
+    tokens = []
+    for file in filenames:
+        with open(file, "r") as f:
+            lines = f.readlines()
+            for line in lines:
+                tokens.append(line.strip().split())
+
+    return Vocab.build_vocab(
+        tokens,
+        max_size=args.nwords_src if src else args.nwords_trg,
+        min_freq=args.threshold_src if src else args.threshold_trg,
+        unk_token=args.unk_token,
+        pad_token=args.pad_token,
+        bos_token=args.bos_token,
+        eos_token=args.eos_token,
+    )
+
+
+def _make_dataset(vocab, input_prefix, output_prefix, lang, args):
+    # Copy original text file to destination folder
+    output_text_file = _dest_path(
+        output_prefix + ".{}-{}".format(args.src_lang, args.trg_lang),
+        lang,
+        args.dest_dir,
+    )
+
+    shutil.copyfile(_file_name(input_prefix, lang), output_text_file)
+
+
+def _make_all(lang, vocab, args):
+    if args.train_pref:
+        _make_dataset(vocab, args.train_pref, "train", lang, args=args)
+
+    if args.dev_pref:
+        for k, dev_pref in enumerate(args.dev_pref.split(",")):
+            out_prefix = "dev{}".format(k) if k > 0 else "dev"
+            _make_dataset(vocab, dev_pref, out_prefix, lang, args=args)
+
+    if args.test_pref:
+        for k, test_pref in enumerate(args.test_pref.split(",")):
+            out_prefix = "test{}".format(k) if k > 0 else "test"
+            _make_dataset(vocab, test_pref, out_prefix, lang, args=args)
+
+
+def _align_files(args, src_vocab, trg_vocab):
+    assert args.train_pref, "--train_pref must be set if --align_file is specified"
+    src_file_name = _train_path(args.src_lang, args.train_pref)
+    trg_file_name = _train_path(args.trg_lang, args.train_pref)
+    freq_map = {}
+
+    with open(args.align_file, "r", encoding="utf-8") as align_file:
+        with open(src_file_name, "r", encoding="utf-8") as src_file:
+            with open(trg_file_name, "r", encoding="utf-8") as trg_file:
+                for a, s, t in zip_longest(align_file, src_file, trg_file):
+                    si = src_vocab.to_indices(s)
+                    ti = trg_vocab.to_indices(t)
+                    ai = list(map(lambda x: tuple(x.split("\t")), a.split()))
+                    for sai, tai in ai:
+                        src_idx = si[int(sai)]
+                        trg_idx = ti[int(tai)]
+                        if src_idx != src_vocab.get_unk_token_id() and trg_idx != trg_vocab.get_unk_token_id():
+                            assert src_idx != src_vocab.get_pad_token_id()
+                            assert src_idx != src_vocab.get_eos_token_id()
+                            assert trg_idx != trg_vocab.get_pad_token_id()
+                            assert trg_idx != trg_vocab.get_eos_token_id()
+                            if src_idx not in freq_map:
+                                freq_map[src_idx] = {}
+                            if trg_idx not in freq_map[src_idx]:
+                                freq_map[src_idx][trg_idx] = 1
+                            else:
+                                freq_map[src_idx][trg_idx] += 1
+
+    align_dict = {}
+    for src_idx in freq_map.keys():
+        align_dict[src_idx] = max(freq_map[src_idx], key=freq_map[src_idx].get)
+
+    with open(
+        os.path.join(
+            args.dest_dir,
+            "alignment.{}-{}.txt".format(args.src_lang, args.trg_lang),
+        ),
+        "w",
+        encoding="utf-8",
+    ) as f:
+        for k, v in align_dict.items():
+            print("{} {}".format(src_vocab[k], trg_vocab[v]), file=f)
+
+
+def main(args):
+    os.makedirs(args.dest_dir, exist_ok=True)
+    pprint(args)
+
+    if args.apply_bpe:
+        import fastBPE
+
+        bpe = fastBPE.fastBPE(args.bpe_code)
+        filenames = [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]]
+        for k, dev_pref in enumerate(args.dev_pref.split(",")):
+            filenames.extend([_dev_path(lang, args.dev_pref) for lang in [args.src_lang, args.trg_lang]])
+        for k, test_pref in enumerate(args.test_pref.split(",")):
+            filenames.extend([_test_path(lang, args.test_pref) for lang in [args.src_lang, args.trg_lang]])
+
+        for file in filenames:
+            sequences = []
+            with open(file, "r") as f:
+                lines = f.readlines()
+                for seq in lines:
+                    sequences.append(seq.strip())
+
+            bpe_sequences = bpe.apply(sequences)
+            os.makedirs(os.path.join(args.train_pref, "tmp_bpe"), exist_ok=True)
+            shutil.copyfile(file, os.path.join(args.train_pref, "tmp_bpe", os.path.split(file)[-1]))
+
+            with open(file, "w") as f:
+                for bpe_seq in bpe_sequences:
+                    f.write(bpe_seq + "\n")
+
+    # build dictionaries
+    target = not args.only_source
+
+    if not args.src_vocab and os.path.exists(_dict_path(args.src_lang, args.dest_dir)):
+        raise FileExistsError(_dict_path(args.src_lang, args.dest_dir))
+
+    if target and not args.trg_vocab and os.path.exists(_dict_path(args.trg_lang, args.dest_dir)):
+        raise FileExistsError(_dict_path(args.trg_lang, args.dest_dir))
+
+    if args.joined_dictionary:
+        assert (
+            not args.src_vocab or not args.trg_vocab
+        ), "Cannot use both --src_vocab and --trg_vocab with --joined_dictionary"
+
+        if args.src_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif args.trg_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            assert args.train_pref, "--train_pref must be set if --src_vocab is not specified. "
+            src_vocab = _build_dictionary(
+                [_train_path(lang, args.train_pref) for lang in [args.src_lang, args.trg_lang]], args=args, src=True
+            )
+
+        trg_vocab = src_vocab
+    else:
+        if args.src_vocab:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            assert args.train_pref, "--train_pref must be set if --src_vocab is not specified"
+            src_vocab = _build_dictionary([_train_path(args.src_lang, args.train_pref)], args=args, src=True)
+
+        if target:
+            if args.trg_vocab:
+                trg_vocab = Vocab.load_vocabulary(
+                    filepath=args.trg_vocab,
+                    unk_token=args.unk_token,
+                    bos_token=args.bos_token,
+                    eos_token=args.eos_token,
+                    pad_token=args.pad_token,
+                )
+            else:
+                assert args.train_pref, "--train_pref must be set if --trg_vocab is not specified"
+                trg_vocab = _build_dictionary([_train_path(args.trg_lang, args.train_pref)], args=args, trg=True)
+        else:
+            trg_vocab = None
+
+    # save dictionaries
+    src_vocab.save_vocabulary(_dict_path(args.src_lang, args.dest_dir))
+    if target and trg_vocab is not None:
+        trg_vocab.save_vocabulary(_dict_path(args.trg_lang, args.dest_dir))
+
+    if args.dict_only:
+        return
+
+    _make_all(args.src_lang, src_vocab, args)
+    if target:
+        _make_all(args.trg_lang, trg_vocab, args)
+
+    logger.info("Wrote preprocessed data to {}".format(args.dest_dir))
+
+    if args.align_file:
+        _align_files(args, src_vocab=src_vocab, trg_vocab=trg_vocab)
+
+
+if __name__ == "__main__":
+    args = get_preprocessing_parser()
+    main(args)
diff --git a/examples/machine_translation/requirements.txt b/examples/machine_translation/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..59ee8dc7f381d36665221bba3384844c881013e0
--- /dev/null
+++ b/examples/machine_translation/requirements.txt
@@ -0,0 +1,4 @@
+attrdict
+easydict
+pyyaml
+subword_nmt
diff --git a/examples/machine_translation/seq2seq/README.md b/examples/machine_translation/seq2seq/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2f271dfb6b4f26aff0cb2eada4a29c818c00a7ce
--- /dev/null
+++ b/examples/machine_translation/seq2seq/README.md
@@ -0,0 +1,104 @@
+# Machine Translation using Seq2Seq with Attention
+
+以下是本范例模型的简要目录结构及说明：
+
+```
+.
+├── deploy                 # 预测部署目录
+│ └── python
+│   └── infer.py           # 用预测模型进行推理的程序
+├── README.md              # 文档，本文件
+├── args.py                # 训练、预测、导出模型以及模型参数配置程序
+├── data.py                # 数据读入程序
+├── train.py               # 训练主程序
+├── predict.py             # 预测主程序
+├── export_model.py        # 导出预测模型的程序
+└── seq2seq_attn.py        # 带注意力机制的翻译模型程序
+```
+
+## 简介
+
+Sequence to Sequence (Seq2Seq)，使用编码器-解码器（Encoder-Decoder）结构，用编码器将源序列编码成vector，再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译，自动对话机器人，文档摘要自动生成，图片描述自动生成等任务中。
+
+本目录包含Seq2Seq的一个经典样例：机器翻译，带Attention机制的翻译模型。Seq2Seq翻译模型，模拟了人类在进行翻译类任务时的行为：先解析源语言，理解其含义，再根据该含义来写出目标语言的语句。更多关于机器翻译的具体原理和数学表达式，我们推荐参考飞桨官网[机器翻译案例](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/nlp_case/machine_translation/README.cn.html)。
+
+## 模型概览
+
+本模型中，在编码器方面，我们采用了基于LSTM的多层的RNN encoder；在解码器方面，我们使用了带注意力（Attention）机制的RNN decoder，在预测时我们使用柱搜索（beam search）算法来生成翻译的目标语句。
+
+## 数据介绍
+
+本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料，tst2012的数据作为开发集，tst2013的数据作为测试集。
+
+### 数据获取
+如果用户在初始化数据集时没有提供路径，数据集会自动下载到`paddlenlp.utils.env.DATA_HOME`的`IWSLT15/`路径下，例如在linux系统下，默认存储路径是`~/.paddlenlp/datasets/IWSLT15`。
+
+## 模型训练
+
+执行以下命令即可训练带有注意力机制的Seq2Seq机器翻译模型：
+
+```sh
+python train.py \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --device gpu \
+    --model_path ./attention_models
+```
+
+各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后，save一次模型。
+
+**NOTE:** 如需恢复模型训练，则`init_from_ckpt`只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=attention_models/5`即可，程序会自动加载模型参数`attention_models/5.pdparams`，也会自动加载优化器状态`attention_models/5.pdopt`。
+
+## 模型预测
+
+训练完成之后，可以使用保存的模型（由 `--init_from_ckpt` 指定）对测试集的数据集进行beam search解码。生成的翻译结果位于`--infer_output_file`指定的路径，预测命令如下：
+
+```sh
+python predict.py \
+     --num_layers 2 \
+     --hidden_size 512 \
+     --batch_size 128 \
+     --dropout 0.2 \
+     --init_scale  0.1 \
+     --max_grad_norm 5.0 \
+     --init_from_ckpt attention_models/9 \
+     --infer_output_file infer_output.txt \
+     --beam_size 10 \
+     --device gpu
+```
+
+各参数的具体说明请参阅 `args.py` ，注意预测时所用模型超参数需和训练时一致。
+
+## 预测效果评价
+取第10个epoch的结果，用取beam_size为10的beam search解码，`predict.py`脚本在生成翻译结果之后，会调用`paddlenlp.metrics.BLEU`计算翻译结果的BLEU指标，最终计算出的BLEU分数为0.24329954822714048
+
+## 保存预测模型
+这里指定的参数`export_path` 表示导出预测模型文件的前缀。保存时会添加后缀（`pdiparams`，`pdiparams.info`，`pdmodel`）。
+```shell
+python export_model.py \
+     --num_layers 2 \
+     --hidden_size 512 \
+     --batch_size 128 \
+     --dropout 0.2 \
+     --init_scale  0.1 \
+     --max_grad_norm 5.0 \
+     --init_from_ckpt attention_models/9.pdparams \
+     --beam_size 10 \
+     --export_path ./infer_model/model
+```
+
+## 基于预测引擎推理
+然后按照如下的方式对IWSLT15数据集中的测试集（有标注的）进行预测（基于Paddle的[Python预测API](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/python_infer_cn.html)）：
+
+```shell
+cd deploy/python
+python infer.py \
+    --export_path ../../infer_model/model \
+    --device gpu \
+    --batch_size 128 \
+    --infer_output_file infer_output.txt
+```
diff --git a/examples/machine_translation/seq2seq/args.py b/examples/machine_translation/seq2seq/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..317917ab1189cb523acdef7a110752ee38db810d
--- /dev/null
+++ b/examples/machine_translation/seq2seq/args.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument("--learning_rate", type=float, default=0.001, help="learning rate for optimizer")
+
+    parser.add_argument("--num_layers", type=int, default=1, help="layers number of encoder and decoder")
+
+    parser.add_argument("--hidden_size", type=int, default=100, help="hidden size of encoder and decoder")
+
+    parser.add_argument("--batch_size", type=int, help="batch size of each step")
+
+    parser.add_argument("--max_epoch", type=int, default=12, help="max epoch for the training")
+
+    parser.add_argument("--max_len", type=int, default=50, help="max length for source and target sentence")
+
+    parser.add_argument("--dropout", type=float, default=0.2, help="drop probability")
+
+    parser.add_argument("--init_scale", type=float, default=0.0, help="init scale for parameter")
+
+    parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip")
+
+    parser.add_argument("--log_freq", type=int, default=100, help="The frequency to print training logs")
+
+    parser.add_argument("--model_path", type=str, default="model", help="model path for model to save")
+
+    parser.add_argument("--infer_output_file", type=str, default="infer_output", help="file name for inference output")
+
+    parser.add_argument("--beam_size", type=int, default=10, help="file name for inference")
+
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference."
+    )
+
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+
+    parser.add_argument(
+        "--export_path",
+        type=str,
+        default=None,
+        help="The output file prefix used to save the exported inference model.",
+    )
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/machine_translation/seq2seq/data.py b/examples/machine_translation/seq2seq/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e4f44901a42e3f23c11fe2d1a7cc065736ea94f
--- /dev/null
+++ b/examples/machine_translation/seq2seq/data.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import Pad, SamplerHelper, Vocab
+from paddlenlp.datasets import load_dataset
+
+
+def create_train_loader(args):
+    batch_size = args.batch_size
+    max_len = args.max_len
+
+    train_ds, dev_ds = load_dataset("iwslt15", splits=("train", "dev"))
+    src_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["en"])
+    tgt_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["vi"])
+    bos_id = src_vocab[src_vocab.bos_token]
+    eos_id = src_vocab[src_vocab.eos_token]
+    pad_id = eos_id
+
+    def convert_example(example):
+        source = example["en"].split()[:max_len]
+        target = example["vi"].split()[:max_len]
+
+        source = src_vocab.to_indices(source)
+        target = tgt_vocab.to_indices(target)
+
+        return source, target
+
+    key = lambda x, data_source: len(data_source[x][0])
+
+    # Truncate and convert example to ids
+    train_ds = train_ds.map(convert_example, lazy=False)
+    dev_ds = dev_ds.map(convert_example, lazy=False)
+
+    train_batch_sampler = (
+        SamplerHelper(train_ds).shuffle().sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size)
+    )
+
+    dev_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size)
+
+    train_loader = paddle.io.DataLoader(
+        train_ds,
+        batch_sampler=train_batch_sampler,
+        collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+
+    dev_loader = paddle.io.DataLoader(
+        dev_ds,
+        batch_sampler=dev_batch_sampler,
+        collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+
+    return train_loader, dev_loader, len(src_vocab), len(tgt_vocab), pad_id
+
+
+def create_infer_loader(args):
+    batch_size = args.batch_size
+    test_ds = load_dataset("iwslt15", splits="test")
+    src_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["en"])
+    tgt_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["vi"])
+    bos_id = src_vocab[src_vocab.bos_token]
+    eos_id = src_vocab[src_vocab.eos_token]
+    pad_id = eos_id
+
+    def convert_example(example):
+        source = example["en"].split()
+        target = example["vi"].split()
+
+        source = src_vocab.to_indices(source)
+        target = tgt_vocab.to_indices(target)
+
+        return source, target
+
+    test_ds.map(convert_example)
+    test_batch_sampler = SamplerHelper(test_ds).batch(batch_size=batch_size)
+
+    test_loader = paddle.io.DataLoader(
+        test_ds,
+        batch_sampler=test_batch_sampler,
+        collate_fn=partial(prepare_infer_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+    return test_loader, len(src_vocab), len(tgt_vocab), bos_id, eos_id
+
+
+def prepare_infer_input(insts, bos_id, eos_id, pad_id):
+    insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts]
+    src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts])
+    return src, src_length
+
+
+def prepare_train_input(insts, bos_id, eos_id, pad_id):
+    # Add eos token id and bos token id.
+    insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts]
+    # Pad sequence using eos id.
+    src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts])
+    tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([inst[1] for inst in insts])
+    tgt_mask = (tgt[:, :-1] != pad_id).astype("float32")
+    return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask
diff --git a/examples/machine_translation/seq2seq/deploy/python/infer.py b/examples/machine_translation/seq2seq/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..de6f80e0778590208c81142faa663d115c85cde4
--- /dev/null
+++ b/examples/machine_translation/seq2seq/deploy/python/infer.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+import sys
+
+sys.path.append("../../")
+
+import numpy as np  # noqa: E402
+import paddle  # noqa: E402
+from args import parse_args  # noqa: E402
+from data import create_infer_loader  # noqa: E402
+from predict import post_process_seq  # noqa: E402
+
+from paddlenlp.data import Vocab  # noqa: E402
+from paddlenlp.datasets import load_dataset  # noqa: E402
+from paddlenlp.metrics import BLEU  # noqa: E402
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        config = paddle.inference.Config(args.export_path + ".pdmodel", args.export_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+        return cls(predictor, input_handles, output_handles)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, dataloader, infer_output_file, trg_idx2word, bos_id, eos_id):
+        cand_list = []
+        with io.open(infer_output_file, "w", encoding="utf-8") as f:
+            for data in dataloader():
+                finished_seq = self.predict_batch(data)[0]
+                finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq
+                finished_seq = np.transpose(finished_seq, [0, 2, 1])
+                for ins in finished_seq:
+                    for beam_idx, beam in enumerate(ins):
+                        id_list = post_process_seq(beam, bos_id, eos_id)
+                        word_list = [trg_idx2word[id] for id in id_list]
+                        sequence = " ".join(word_list) + "\n"
+                        f.write(sequence)
+                        cand_list.append(word_list)
+                        break
+
+        test_ds = load_dataset("iwslt15", splits="test")
+        bleu = BLEU()
+        for i, data in enumerate(test_ds):
+            ref = data["vi"].split()
+            bleu.add_inst(cand_list[i], [ref])
+        print("BLEU score is %s." % bleu.score())
+
+
+def main():
+    args = parse_args()
+
+    predictor = Predictor.create_predictor(args)
+    test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args)
+    tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"])
+    trg_idx2word = tgt_vocab.idx_to_token
+
+    predictor.predict(test_loader, args.infer_output_file, trg_idx2word, bos_id, eos_id)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/machine_translation/seq2seq/export_model.py b/examples/machine_translation/seq2seq/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..79b05c1dccb5a54a13437b66cffa4c63c1aa8a99
--- /dev/null
+++ b/examples/machine_translation/seq2seq/export_model.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from args import parse_args
+from data import create_infer_loader
+from seq2seq_attn import Seq2SeqAttnInferModel
+
+
+def main():
+    args = parse_args()
+    _, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args)
+
+    # Build model and load trained parameters
+    model = Seq2SeqAttnInferModel(
+        src_vocab_size,
+        tgt_vocab_size,
+        args.hidden_size,
+        args.hidden_size,
+        args.num_layers,
+        args.dropout,
+        bos_id=bos_id,
+        eos_id=eos_id,
+        beam_size=args.beam_size,
+        max_out_len=256,
+    )
+
+    # Load the trained model
+    model.set_state_dict(paddle.load(args.init_from_ckpt))
+
+    # Wwitch to eval model
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # src
+            paddle.static.InputSpec(shape=[None], dtype="int64"),  # src length
+        ],
+    )
+    # Save converted static graph model
+    paddle.jit.save(model, args.export_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/machine_translation/seq2seq/predict.py b/examples/machine_translation/seq2seq/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..0da32d69d057141597a8d183331b3d8b34e1d1bd
--- /dev/null
+++ b/examples/machine_translation/seq2seq/predict.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+
+import numpy as np
+import paddle
+from args import parse_args
+from data import create_infer_loader
+from seq2seq_attn import Seq2SeqAttnInferModel
+
+from paddlenlp.data import Vocab
+from paddlenlp.metrics import BLEU
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+
+    test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args)
+    tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"])
+
+    model = paddle.Model(
+        Seq2SeqAttnInferModel(
+            src_vocab_size,
+            tgt_vocab_size,
+            args.hidden_size,
+            args.hidden_size,
+            args.num_layers,
+            args.dropout,
+            bos_id=bos_id,
+            eos_id=eos_id,
+            beam_size=args.beam_size,
+            max_out_len=256,
+        )
+    )
+
+    model.prepare()
+
+    # Load the trained model
+    assert args.init_from_ckpt, "Please set reload_model to load the infer model."
+    model.load(args.init_from_ckpt)
+
+    cand_list = []
+    with io.open(args.infer_output_file, "w", encoding="utf-8") as f:
+        for data in test_loader():
+            with paddle.no_grad():
+                finished_seq = model.predict_batch(inputs=data)[0]
+            finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq
+            finished_seq = np.transpose(finished_seq, [0, 2, 1])
+            for ins in finished_seq:
+                for beam_idx, beam in enumerate(ins):
+                    id_list = post_process_seq(beam, bos_id, eos_id)
+                    word_list = [tgt_vocab.to_tokens(id) for id in id_list]
+                    sequence = " ".join(word_list) + "\n"
+                    f.write(sequence)
+                    cand_list.append(word_list)
+                    break
+
+    bleu = BLEU()
+    for i, data in enumerate(test_loader.dataset.data):
+        ref = data["vi"].split()
+        bleu.add_inst(cand_list[i], [ref])
+    print("BLEU score is %s." % bleu.score())
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/examples/machine_translation/seq2seq/seq2seq_attn.py b/examples/machine_translation/seq2seq/seq2seq_attn.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bbcf62b77c5e899d9baec8187c139749c14bac6
--- /dev/null
+++ b/examples/machine_translation/seq2seq/seq2seq_attn.py
@@ -0,0 +1,254 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.nn.initializer as I
+
+
+class CrossEntropyCriterion(nn.Layer):
+    def __init__(self):
+        super(CrossEntropyCriterion, self).__init__()
+
+    def forward(self, predict, label, trg_mask):
+        cost = F.cross_entropy(input=predict, label=label, soft_label=False, reduction="none")
+        cost = paddle.squeeze(cost, axis=[2])
+        masked_cost = cost * trg_mask
+        batch_mean_cost = paddle.mean(masked_cost, axis=[0])
+        seq_cost = paddle.sum(batch_mean_cost)
+
+        return seq_cost
+
+
+class Seq2SeqEncoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1):
+        super(Seq2SeqEncoder, self).__init__()
+        self.embedder = nn.Embedding(
+            vocab_size,
+            embed_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.lstm = nn.LSTM(
+            input_size=embed_dim,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            direction="forward",
+            dropout=dropout_prob if num_layers > 1 else 0.0,
+        )
+
+    def forward(self, sequence, sequence_length):
+        inputs = self.embedder(sequence)
+        encoder_output, encoder_state = self.lstm(inputs, sequence_length=sequence_length)
+
+        return encoder_output, encoder_state
+
+
+class AttentionLayer(nn.Layer):
+    def __init__(self, hidden_size, bias=False, init_scale=0.1):
+        super(AttentionLayer, self).__init__()
+        self.input_proj = nn.Linear(
+            hidden_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+            bias_attr=bias,
+        )
+        self.output_proj = nn.Linear(
+            hidden_size + hidden_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+            bias_attr=bias,
+        )
+
+    def forward(self, hidden, encoder_output, encoder_padding_mask):
+        encoder_output = self.input_proj(encoder_output)
+        attn_scores = paddle.matmul(paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True)
+
+        if encoder_padding_mask is not None:
+            attn_scores = paddle.add(attn_scores, encoder_padding_mask)
+
+        attn_scores = F.softmax(attn_scores)
+        attn_out = paddle.squeeze(paddle.matmul(attn_scores, encoder_output), [1])
+        attn_out = paddle.concat([attn_out, hidden], 1)
+        attn_out = self.output_proj(attn_out)
+        return attn_out
+
+
+class Seq2SeqDecoderCell(nn.RNNCellBase):
+    def __init__(self, num_layers, input_size, hidden_size, dropout_prob=0.0):
+        super(Seq2SeqDecoderCell, self).__init__()
+        if dropout_prob > 0.0:
+            self.dropout = nn.Dropout(dropout_prob)
+        else:
+            self.dropout = None
+
+        self.lstm_cells = nn.LayerList(
+            [
+                nn.LSTMCell(input_size=input_size + hidden_size if i == 0 else hidden_size, hidden_size=hidden_size)
+                for i in range(num_layers)
+            ]
+        )
+
+        self.attention_layer = AttentionLayer(hidden_size)
+
+    def forward(self, step_input, states, encoder_output, encoder_padding_mask=None):
+        lstm_states, input_feed = states
+        new_lstm_states = []
+        step_input = paddle.concat([step_input, input_feed], 1)
+        for i, lstm_cell in enumerate(self.lstm_cells):
+            out, new_lstm_state = lstm_cell(step_input, lstm_states[i])
+            if self.dropout:
+                step_input = self.dropout(out)
+            else:
+                step_input = out
+
+            new_lstm_states.append(new_lstm_state)
+        out = self.attention_layer(step_input, encoder_output, encoder_padding_mask)
+        return out, [new_lstm_states, out]
+
+
+class Seq2SeqDecoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1):
+        super(Seq2SeqDecoder, self).__init__()
+        self.embedder = nn.Embedding(
+            vocab_size,
+            embed_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+        self.lstm_attention = nn.RNN(
+            Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size, dropout_prob), is_reverse=False, time_major=False
+        )
+        self.output_layer = nn.Linear(
+            hidden_size,
+            vocab_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+            bias_attr=False,
+        )
+
+    def forward(self, trg, decoder_initial_states, encoder_output, encoder_padding_mask):
+        inputs = self.embedder(trg)
+
+        decoder_output, _ = self.lstm_attention(
+            inputs,
+            initial_states=decoder_initial_states,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask,
+        )
+        predict = self.output_layer(decoder_output)
+
+        return predict
+
+
+class Seq2SeqAttnModel(nn.Layer):
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        embed_dim,
+        hidden_size,
+        num_layers,
+        dropout_prob=0.0,
+        eos_id=1,
+        init_scale=0.1,
+    ):
+        super(Seq2SeqAttnModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.eos_id = eos_id
+        self.num_layers = num_layers
+        self.INF = 1e9
+        self.encoder = Seq2SeqEncoder(src_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale)
+        self.decoder = Seq2SeqDecoder(trg_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale)
+
+    def forward(self, src, src_length, trg):
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+
+        # Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size]
+        encoder_final_states = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)]
+        # Construct decoder initial states: use input_feed and the shape is
+        # [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states
+        decoder_initial_states = [
+            encoder_final_states,
+            self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]),
+        ]
+        # Build attention mask to avoid paying attention on padddings
+        src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
+        encoder_padding_mask = (src_mask - 1.0) * self.INF
+        encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
+
+        predict = self.decoder(trg, decoder_initial_states, encoder_output, encoder_padding_mask)
+        return predict
+
+
+class Seq2SeqAttnInferModel(Seq2SeqAttnModel):
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        embed_dim,
+        hidden_size,
+        num_layers,
+        dropout_prob=0.0,
+        bos_id=0,
+        eos_id=1,
+        beam_size=4,
+        max_out_len=256,
+    ):
+        args = dict(locals())
+        args.pop("self")
+        args.pop("__class__", None)
+        self.bos_id = args.pop("bos_id")
+        self.beam_size = args.pop("beam_size")
+        self.max_out_len = args.pop("max_out_len")
+        self.num_layers = num_layers
+        super(Seq2SeqAttnInferModel, self).__init__(**args)
+        # Dynamic decoder for inference
+        self.beam_search_decoder = nn.BeamSearchDecoder(
+            self.decoder.lstm_attention.cell,
+            start_token=bos_id,
+            end_token=eos_id,
+            beam_size=beam_size,
+            embedding_fn=self.decoder.embedder,
+            output_fn=self.decoder.output_layer,
+        )
+
+    def forward(self, src, src_length):
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+
+        encoder_final_state = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)]
+
+        # Initial decoder initial states
+        decoder_initial_states = [
+            encoder_final_state,
+            self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]),
+        ]
+        # Build attention mask to avoid paying attention on paddings
+        src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
+
+        encoder_padding_mask = (src_mask - 1.0) * self.INF
+        encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
+
+        # Tile the batch dimension with beam_size
+        encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_output, self.beam_size)
+        encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_padding_mask, self.beam_size)
+
+        # Dynamic decoding with beam search
+        seq_output, _ = nn.dynamic_decode(
+            decoder=self.beam_search_decoder,
+            inits=decoder_initial_states,
+            max_step_num=self.max_out_len,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask,
+        )
+        return seq_output
diff --git a/examples/machine_translation/seq2seq/train.py b/examples/machine_translation/seq2seq/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..fec0708040f5e3f28f3cbc713ed225eb0b86751b
--- /dev/null
+++ b/examples/machine_translation/seq2seq/train.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+from args import parse_args
+from data import create_train_loader
+from seq2seq_attn import CrossEntropyCriterion, Seq2SeqAttnModel
+
+from paddlenlp.metrics import Perplexity
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+
+    # Define dataloader
+    train_loader, eval_loader, src_vocab_size, tgt_vocab_size, eos_id = create_train_loader(args)
+
+    model = paddle.Model(
+        Seq2SeqAttnModel(
+            src_vocab_size, tgt_vocab_size, args.hidden_size, args.hidden_size, args.num_layers, args.dropout, eos_id
+        )
+    )
+
+    grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=args.learning_rate, parameters=model.parameters(), grad_clip=grad_clip
+    )
+
+    ppl_metric = Perplexity()
+    model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric)
+
+    print(args)
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=args.log_freq, verbose=3)
+
+    model.fit(
+        train_data=train_loader,
+        eval_data=eval_loader,
+        epochs=args.max_epoch,
+        eval_freq=1,
+        save_freq=1,
+        save_dir=args.model_path,
+        callbacks=[benchmark_logger],
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_train(args)
diff --git a/examples/machine_translation/transformer/README.md b/examples/machine_translation/transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e653a539a99d8047b3d4c643537c99001614a1ca
--- /dev/null
+++ b/examples/machine_translation/transformer/README.md
@@ -0,0 +1,445 @@
+# Machine Translation using Transformer
+
+机器翻译（Machine Translation）是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程，输入为源语言句子，输出为相应的目标语言的句子。
+
+本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现，包含模型训练，预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。
+
+## 模型介绍
+Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译（Machine Translation）等序列到序列（Seq2Seq）学习任务的一种全新网络结构，其完全使用注意力（Attention）机制来实现序列到序列的建模[1]。
+
+<p align="center">
+<img src="images/transformer_network.png" height=400 hspace='10'/> <br />
+图 1. Transformer 网络结构图
+</p>
+
+相较于此前 Seq2Seq 模型中广泛使用的循环神经网络（Recurrent Neural Network, RNN），使用Self Attention进行输入序列到输出序列的变换主要具有以下优势：
+
+- 计算复杂度小
+  - 特征维度为 d 、长度为 n 的序列，在 RNN 中计算复杂度为 `O(n * d * d)` （n 个时间步，每个时间步计算 d 维的矩阵向量乘法），在 Self-Attention 中计算复杂度为 `O(n * n * d)` （n 个时间步两两计算 d 维的向量点积或其他相关度函数），n 通常要小于 d 。
+- 计算并行度高
+  - RNN 中当前时间步的计算要依赖前一个时间步的计算结果；Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出，各时间步可以完全并行。
+- 容易学习长程依赖（long-range dependencies）
+  - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立；Self-Attention 中任何两个位置都直接相连；路径越短信号传播越容易。
+
+Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构，已被广泛应用在 Bert [2]等语义表示模型中，取得了显著效果。
+
+### 模型特点
+
+Transformer 中的 Encoder 由若干相同的 layer 堆叠组成，每个 layer 主要由多头注意力（Multi-Head Attention）和全连接的前馈（Feed-Forward）网络这两个 sub-layer 构成。
+- Multi-Head Attention 在这里用于实现 Self-Attention，相比于简单的 Attention 机制，其将输入进行多路线性变换后分别计算 Attention 的结果，并将所有结果拼接后再次进行线性变换作为输出。参见图2，其中 Attention 使用的是点积（Dot-Product），并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
+- Feed-Forward 网络会对序列中的每个位置进行相同的计算（Position-wise），其采用的是两次线性变换中间加以 ReLU 激活的结构。
+
+此外，每个 sub-layer 后还施以 Residual Connection [3] 和 Layer Normalization [4] 来促进梯度传播和模型收敛。
+
+<p align="center">
+<img src="images/multi_head_attention.png" height=300 hspace='10'/> <br />
+图 2. Multi-Head Attention
+</p>
+
+Decoder 具有和 Encoder 类似的结构，只是相比于组成 Encoder 的 layer ，在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention，这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
+
+## 数据准备
+
+本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测，也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+## 动态图
+
+### 使用内置数据集进行训练
+
+以下文档，介绍了使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译数据集的训练方式。
+
+#### 单机单卡
+
+以提供的英德翻译数据为例，可以执行以下命令进行模型训练：
+
+``` sh
+# Setting visible devices for training
+export CUDA_VISIBLE_DEVICES=0
+python train.py --config ./configs/transformer.base.yaml
+```
+
+可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+
+如果是在单卡下进行训练，可能需要适当调整下参数，比如考虑增大 `warmup_steps` 参数为 `16000`，相关的设置可以参考 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 配置文件中各个选项。
+
+#### 单机多卡
+
+同样，可以执行如下命令实现八卡训练：
+
+``` sh
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py --config ./configs/transformer.base.yaml
+```
+
+与上面的情况相似，可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+
+### 使用自定义数据集进行训练
+
+自定义数据集与内置数据集训练的方式基本上是一致的，不过需要额外提供数据文件的路径。可以参照以下文档。
+
+#### 单机单卡
+
+本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+本示例以处理好的 WMT14 数据为例。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>" \
+    --pad_token "<s>"
+```
+
+`train.py` 脚本中，各个参数的含义如下：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。
+* `--dev_file`: 指明训练所需要的 `dev` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--dev_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--dev_file ${DATA_DEST_DIR}/dev.de-en.de ${DATA_DEST_DIR}/dev.de-en.en`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--batch_size`: 指明训练时，一个 batch 里面，最多的 token 的数目。默认为 config 中设置的 4096。
+* `--max_iter`: 指明训练时，需要训练的最大的 step 的数目，默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量，而不是 step。
+* `--use_amp`: 是否使用混合精度训练。设置的类型是一个 `str`，可以是 `['true', 'false', 'True', 'False']` 中任意一个。默认不使用混合精度训练。
+* `--amp_level`: 若使用混合精度，则指明混合精度的级别。可以是 `['O1', 'O2']` 中任意一个。默认是 `O1`。
+
+#### 单机多卡
+
+单机多卡的执行方式与单机打卡差别不大，需要额外加上单机多卡的启动命令，如下所示：
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --dev_file ${DATA_DEST_DIR}/dev.de-en.en ${DATA_DEST_DIR}/dev.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+其余启动参数与单机单卡相同，这里不再累述。
+
+### 模型推断
+
+#### 使用内置数据集进行预测
+
+如果是基于内置的数据集训练得到的英德翻译的模型，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --config ./configs/transformer.base.yaml
+```
+
+翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+另外 `predict.py` 中使用的 `TransformerGenerator` 接口对于GPU预测将在适配的条件下自动切换到 `FastGeneration` 预测加速版本（期间会进行jit编译）， `FastGeneration` 的更多内容可以参考 `fast_transformer/README.md`。
+
+#### 基于自定义数据集进行预测
+
+本示例同样支持自定义数据集进行预测。可以参照以下文档。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python predict.py \
+    --config configs/transformer.base.yaml \
+    --test_file ${DATA_DEST_DIR}/test.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+以下是各个参数的含义：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--without_ft`: 本示例在预测时，支持了 GPU 的翻译预测的加速，如果不使用加速特性，可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。
+
+翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。
+
+#### 导出静态图预测模型与预测引擎预测
+
+Transformer 同时提供了将训练的动态图的 checkpoint 转成静态图模型功能，并提供了对应的使用预测引擎进行预测推理的方法。具体的使用方式如下：
+
+首先是进行动转静，使用 `export_model.py` 脚本完成将动态图的 checkpoint 转成静态图的模型，并保存成 inference 的模型。
+
+``` sh
+python export_model.py --config ./configs/transformer.base.yaml
+```
+
+模型默认保存在 `infer_model/` 路径下面。可以在 `configs/` 路径下的配置文件中更改 `inference_model_dir` 配置，从而保存至自定义的路径。
+
+同样，因为模型导出会用到模型的词表等信息，所以如果是**自定义数据集**，仍需要传入所使用的词表。
+
+``` bash
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python export_model.py \
+    --config ./configs/transformer.base.yaml \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>"
+```
+
+其中：
+
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+
+#### 使用 Paddle Inference API 进行推理
+
+准备好以上模型之后，可以使用预测引擎 Paddle Inference API 进行推理。
+
+如果使用 Paddle Inference Python API，可以参考[使用 Paddle Inference Python API 推理](./deploy/python/README.md)。
+
+如果使用 Paddle Inference C++ API，可以参考[使用 Paddle Inference C++ API 推理](./deploy/cpp/README.md)。
+
+#### 使用 Paddle Serving 进行推理
+
+除了使用 Paddle Inference API 进行本地推理外，还可以使用 Paddle Serving 实现在服务器上部署推理模型，客户端发送数据进行推理。可以参考[使用 Paddle Serving 推理](./deploy/serving/README.md)。
+
+## 静态图
+
+在静态图中，本示例仍然可以选择内置数据集进行训练或是使用自定义数据集进行训练。
+
+### 使用内置数据集进行训练
+
+#### 单机单卡
+
+如果是需要单机单卡训练，则使用下面的命令进行训练：
+``` shell
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+python train.py --config ../configs/transformer.base.yaml
+```
+
+建议可以在单卡执行的时候，尝试增大 `warmup_steps`。可以修改 `configs/transformer.big.yaml` 或是 `configs/transformer.base.yaml` 中对应参数。
+
+#### 单机多卡
+
+如果是需要单机多卡训练，则使用下面的命令进行训练：
+
+##### PE 的方式启动单机多卡：
+``` shell
+cd static/
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python train.py --config ../configs/transformer.base.yaml
+```
+
+##### fleet 的方式启动单机多卡：
+``` shell
+cd static/
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py --config ../configs/transformer.base.yaml --distributed
+```
+
+需要注意的是，使用 fleet 的方式启动单机多卡务必设置 `--distributed`。
+
+### 使用自定义数据集进行训练
+
+静态图和动态图在训练脚本启动上差别不大，仍然需要指明对应的文件的位置。可以参照以下文档。
+
+#### 单机单卡
+
+本示例这里略去自定义数据下载、处理的步骤，如果需要，可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+本示例以处理好的 WMT14 数据为例。
+
+``` bash
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+
+python train.py \
+    --config configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.en ${DATA_DEST_DIR}/train.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+`train.py` 脚本中，各个参数的含义如下：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--train_file`: 指明训练所需要的 `train` 训练集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，一组平行语料的源语言和目标语言，依次两个文件的路径和名称，`--train_file ${SOURCE_LANG_FILE} ${TARGET_LANG_FILE}`。比如，`--train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--batch_size`: 指明训练时，一个 batch 里面，最多的 token 的数目。默认为 config 中设置的 4096。
+* `--max_iter`: 指明训练时，需要训练的最大的 step 的数目，默认为 None。表示使用 config 中指定的 `epoch: 30` 来作为最大的迭代的 epoch 的数量，而不是 step。
+
+#### 单机多卡
+
+单机多卡下，执行方式与上文所述单机单卡传入自定义数据集方式相同。因静态图多卡有两种方式执行，所以这里会多一个参数：
+
+* `--distributed`:（**多卡训练需要**）指明是否是使用 fleet 来启动多卡。若设置，则使用 fleet 启动多卡。具体使用方式如下。
+
+##### PE 的方式启动单机多卡：
+``` shell
+cd static/
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python train.py \
+    --config ../configs/transformer.base.yaml \
+    --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+##### fleet 的方式启动单机多卡：
+``` shell
+cd static/
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" train.py \
+    --config ../configs/transformer.base.yaml \
+    --distributed \
+    --train_file ${DATA_DEST_DIR}/train.de-en.de ${DATA_DEST_DIR}/train.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+需要注意的是，使用 fleet 的方式启动单机多卡务必设置 `--distributed`。
+
+#### 使用内置数据集进行预测
+
+如果是基于内置的数据集训练得到的英德翻译的模型，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+
+``` sh
+# setting visible devices for prediction
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --config ../configs/transformer.base.yaml
+```
+
+由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 big model 的配置。
+
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+#### 基于自定义数据集进行预测
+
+本示例同样支持自定义数据集进行预测。可以参照以下文档。
+
+``` bash
+cd static/
+export CUDA_VISIBLE_DEVICES=0
+
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/wmt14_en_de/
+python predict.py \
+    --config configs/transformer.base.yaml \
+    --test_file ${DATA_DEST_DIR}/test.de-en.en \
+    --src_vocab ${DATA_DEST_DIR}/dict.en.txt \
+    --trg_vocab ${DATA_DEST_DIR}/dict.de.txt \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+以下是各个参数的含义：
+
+* `--config`: 指明所使用的 Transformer 的 config 文件，包括模型超参、训练超参等，默认是 `transformer.big.yaml`。即，默认训练 Transformer Big 模型。
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+* `--without_ft`: 本示例在预测时，支持了 GPU 的翻译预测的加速，如果不使用加速特性，可以设置 `--without_ft` 即会执行普通的 PaddlePaddle 动态图预测。
+
+翻译结果会输出到 config 文件中 `output_file` 条目指定的文件中。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `configs/transformer.big.yaml` 和 `configs/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。
+
+## 使用 FastGeneration 实现预测
+
+具体的说明可以参考 `fast_transformer/README.md`。`cd fast_transformer/` 即可查看。
+
+## 模型评估
+
+预测结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
+```
+
+执行上述操作之后，可以看到类似如下的结果，此处结果是 big model 在 newstest2014 上的 BLEU 结果：
+```
+BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506)
+```
+
+## FAQ
+
+**Q:** 预测结果中样本数少于输入的样本数是什么原因
+**A:** 若样本中最大长度超过 `transformer.base.yaml` 或是 `transformer.big.yaml` 中 `max_length` 的默认设置，请注意运行时增大 `max_length` 的设置，否则超长样本将被过滤。
+
+**Q:** 预测时最大长度超过了训练时的最大长度怎么办
+**A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小，若预测时长度超过 `max_length`，请调大该值，会重新生成更大的 position encoding 表。
+
+
+## 参考文献
+1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
+2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018.
+3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
+4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
+5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015.
diff --git a/examples/machine_translation/transformer/configs/transformer.base.yaml b/examples/machine_translation/transformer/configs/transformer.base.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..faf1c65374d81abdba865221fe537141b454e5a6
--- /dev/null
+++ b/examples/machine_translation/transformer/configs/transformer.base.yaml
@@ -0,0 +1,138 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# Path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# Path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# Path of trained parameter, to make prediction
+init_from_params: "./trained_models/step_final/"
+# The directory for saving model
+save_model: "trained_models"
+# The directory for saving inference model 
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The file to output the translation results of predict_file to.
+output_file: "predict.txt"
+# The <bos>, <eos> and <unk> tokens in the dictionary.
+special_token: ["<s>", "<e>", "<unk>"]
+# The data type of input ids. 
+input_dtype: "int64"
+
+# Device to use. 
+device: "gpu"
+
+# Args for reader, see reader.py for details
+# The translation task to process.
+task_name: "de-en"
+src_lang: "en"
+trg_lang: "de"
+pool_size: 200000
+sort_type: "global"
+batch_size: 4096
+infer_batch_size: 8
+shuffle_batch: True
+# Data shuffle only works when sort_type is pool or none
+shuffle: True
+# shuffle_seed must be set when shuffle is True and using multi-cards to train. 
+# Otherwise, the number of batches cannot be guaranteed. 
+shuffle_seed: 128
+# For Dataloader num_workers
+num_workers: 0
+
+# Hyparams for training:
+# The number of epoches for training
+epoch: 30
+
+# The hyper parameters for Adam optimizer.
+# This static learning_rate will be applied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 2.0
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# The parameters for learning rate scheduling.
+warmup_steps: 4000
+# The weight used to mix up the ground-truth distribution and the fixed
+# uniform distribution in label smoothing when training.
+# Set this as zero if label smoothing is not wanted.
+label_smooth_eps: 0.1
+
+# Hyparams for generation:
+# The parameters for beam search.
+# Indicating the strategy of beam search. It can be 'v1' or 'v2'. 'v2' would
+# select the top `beam_size * 2` beams and process the top `beam_size` alive
+# and finish beams in them separately, while 'v1' would only select the top
+# `beam_size` beams and mix up the alive and finish beams. 'v2' always
+# searchs more and get better results, since the alive beams would
+# always be `beam_size` while the number of alive beams in `v1` might
+# decrease when meeting the end token. However, 'v2' always generates
+# longer results thus might do more calculation and be slower.
+beam_search_version: "v1"
+beam_size: 4
+max_out_len: 256
+# Indicating whether max_out_len in configurations is the length relative to
+# that of source text. Only works in `v2` temporarily.
+use_rel_len: False
+# The power number in length penalty calculation. Only works in `v2` temporarily.
+# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>.
+alpha: 0.6
+# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation
+# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate`
+# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive
+# BeamSearch. **NOTE**: Only works when using FastGeneration temporarily.
+diversity_rate: 0.0
+# The number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# Size of source word dictionary.
+src_vocab_size: 10000
+# Size of target word dictionay
+trg_vocab_size: 10000
+# Used to pad vocab size to be multiple of pad_factor.
+pad_factor: 8
+# Used to pad sequence length to be multiple of pad_seq.
+pad_seq: 1
+# Used to make batch size to be multiple of bsz_multi.
+bsz_multi: 8
+# Index for <bos> token
+bos_idx: 0
+# Index for <eos> token
+eos_idx: 1
+# Index for <unk> token
+unk_idx: 2
+# Max length of sequences deciding the size of position encoding table.
+max_length: 256
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# Number of head used in multi-head attention.
+n_head: 8
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# Dropout rates.
+dropout: 0.1
+# The flag indicating whether to share embedding and softmax weights.
+# Vocabularies in source and target should be same for weight sharing.
+weight_sharing: True
+# Whether to apply pre-normalization or not. 
+normalize_before: True
+
+# Mixed precision training
+use_amp: False
+use_pure_fp16: False
+scale_loss: 128.0
+
+# Maximum iteration for training. 
+max_iter: None
+
+# enable to static ? 
+to_static: False
diff --git a/examples/machine_translation/transformer/configs/transformer.big.yaml b/examples/machine_translation/transformer/configs/transformer.big.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..299637bc64a2043be762322da3480bb91b6124ba
--- /dev/null
+++ b/examples/machine_translation/transformer/configs/transformer.big.yaml
@@ -0,0 +1,135 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# Path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# Path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# Path of trained parameter, to make prediction
+init_from_params: "./trained_models/step_final/"
+# The directory for saving model
+save_model: "trained_models"
+# The directory for saving inference model 
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The file to output the translation results of predict_file to.
+output_file: "predict.txt"
+# The <bos>, <eos> and <unk> tokens in the dictionary.
+special_token: ["<s>", "<e>", "<unk>"]
+# The data type of input ids. 
+input_dtype: "int64"
+
+# Device to use. 
+device: "gpu"
+
+# Args for reader, see reader.py for details
+# The translation task to process.
+task_name: "de-en"
+src_lang: "en"
+trg_lang: "de"
+pool_size: 200000
+sort_type: "global"
+batch_size: 4096
+infer_batch_size: 8
+shuffle_batch: True
+# Data shuffle only works when sort_type is pool or none
+shuffle: True
+# shuffle_seed must be set when shuffle is True and using multi-cards to train. 
+# Otherwise, the number of batches cannot be guaranteed. 
+shuffle_seed: 128
+# For Dataloader num_workers
+num_workers: 0
+
+# Hyparams for training:
+# The number of epoches for training
+epoch: 30
+
+# The hyper parameters for Adam optimizer.
+# This static learning_rate will be applied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 2.0
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# The parameters for learning rate scheduling.
+warmup_steps: 4000
+# The weight used to mix up the ground-truth distribution and the fixed
+# uniform distribution in label smoothing when training.
+# Set this as zero if label smoothing is not wanted.
+label_smooth_eps: 0.1
+
+# Hyparams for generation:
+# The parameters for beam search.
+# Indicating the strategy of beam search. It can be 'v1' or 'v2'. 'v2' would
+# select the top `beam_size * 2` beams and process the top `beam_size` alive
+# and finish beams in them separately, while 'v1' would only select the top
+# `beam_size` beams and mix up the alive and finish beams. 'v2' always
+# searchs more and get better results, since the alive beams would
+# always be `beam_size` while the number of alive beams in `v1` might
+# decrease when meeting the end token. However, 'v2' always generates
+# longer results thus might do more calculation and be slower.
+beam_search_version: "v1"
+beam_size: 4
+max_out_len: 1024
+# Indicating whether max_out_len in configurations is the length relative to
+# that of source text. Only works in `v2` temporarily.
+use_rel_len: False
+# The power number in length penalty calculation. Only works in `v2` temporarily.
+# Please refer to GNMT <https://arxiv.org/pdf/1609.08144.pdf>.
+alpha: 0.6
+# Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation
+# <https://arxiv.org/abs/1611.08562>`_ for details. Bigger `diversity_rate`
+# would lead to more diversity. if `diversity_rate == 0` is equivalent to naive
+# BeamSearch. **NOTE**: Only works when using FastGeneration temporarily.
+diversity_rate: 0.0
+# The number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# Size of source word dictionary.
+src_vocab_size: 10000
+# Size of target word dictionay
+trg_vocab_size: 10000
+# Used to pad vocab size to be multiple of pad_factor.
+pad_factor: 8
+# Used to pad sequence length to be multiple of pad_seq.
+pad_seq: 1
+# Used to make batch size to be multiple of bsz_multi.
+bsz_multi: 8
+# Index for <bos> token
+bos_idx: 0
+# Index for <eos> token
+eos_idx: 1
+# Index for <unk> token
+unk_idx: 2
+# Max length of sequences deciding the size of position encoding table.
+max_length: 1024
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 1024
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 4096
+# Number of head used in multi-head attention.
+n_head: 16
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# Dropout rates.
+dropout: 0.1
+# The flag indicating whether to share embedding and softmax weights.
+# Vocabularies in source and target should be same for weight sharing.
+weight_sharing: True
+# Whether to apply pre-normalization or not. 
+normalize_before: True
+
+# Mixed precision training
+use_amp: False
+use_pure_fp16: False
+scale_loss: 128.0
+
+# Maximum iteration for training. 
+max_iter: None
diff --git a/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt b/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..cade70b6388bbdedd623093adc568c9602bc076b
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt
@@ -0,0 +1,82 @@
+project(cpp_inference_demo CXX C)
+option(WITH_MKL        "Compile demo with MKL/OpenBlas support, default use MKL."       ON)
+option(WITH_GPU        "Compile demo with GPU/CPU, default use CPU."                    OFF)
+option(WITH_STATIC_LIB "Compile demo with static/shared library, default use static."   ON)
+option(USE_TENSORRT "Compile demo with TensorRT."   OFF)
+
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -g")
+set(CMAKE_STATIC_LIBRARY_PREFIX "")
+message("flags" ${CMAKE_CXX_FLAGS})
+
+if(NOT DEFINED PADDLE_LIB)
+  message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib")
+endif()
+if(NOT DEFINED DEMO_NAME)
+  message(FATAL_ERROR "please set DEMO_NAME with -DDEMO_NAME=demo_name")
+endif()
+
+
+include_directories("${PADDLE_LIB}")
+include_directories("${PADDLE_LIB}/third_party/install/protobuf/include")
+include_directories("${PADDLE_LIB}/third_party/install/glog/include")
+include_directories("${PADDLE_LIB}/third_party/install/gflags/include")
+include_directories("${PADDLE_LIB}/third_party/install/zlib/include")
+include_directories("${PADDLE_LIB}/third_party/boost")
+include_directories("${PADDLE_LIB}/third_party/eigen3")
+
+if (USE_TENSORRT AND WITH_GPU)
+      include_directories("${TENSORRT_ROOT}/include")
+      link_directories("${TENSORRT_ROOT}/lib")
+endif()
+
+link_directories("${PADDLE_LIB}/third_party/install/zlib/lib")
+
+link_directories("${PADDLE_LIB}/third_party/install/protobuf/lib")
+link_directories("${PADDLE_LIB}/third_party/install/glog/lib")
+link_directories("${PADDLE_LIB}/third_party/install/gflags/lib")
+link_directories("${PADDLE_LIB}/paddle/lib")
+
+add_executable(${DEMO_NAME} ${DEMO_NAME}.cc)
+
+if(WITH_MKL)
+  include_directories("${PADDLE_LIB}/third_party/install/mklml/include")
+  set(MATH_LIB ${PADDLE_LIB}/third_party/install/mklml/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX}
+               ${PADDLE_LIB}/third_party/install/mklml/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX})
+  set(MKLDNN_PATH "${PADDLE_LIB}/third_party/install/mkldnn")
+  if(EXISTS ${MKLDNN_PATH})
+    include_directories("${MKLDNN_PATH}/include")
+    set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0)
+  endif()
+else()
+  set(MATH_LIB ${PADDLE_LIB}/third_party/install/openblas/lib/libopenblas${CMAKE_STATIC_LIBRARY_SUFFIX})
+endif()
+
+# Note: libpaddle_inference_api.so/a must put before libpaddle_fluid.so/a
+if(WITH_STATIC_LIB)
+  set(DEPS
+      ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_STATIC_LIBRARY_SUFFIX})
+else()
+  set(DEPS
+      ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_SHARED_LIBRARY_SUFFIX})
+endif()
+
+set(EXTERNAL_LIB "-lrt -ldl -lpthread")
+set(DEPS ${DEPS}
+    ${MATH_LIB} ${MKLDNN_LIB}
+    glog gflags protobuf z
+    ${EXTERNAL_LIB})
+
+if(WITH_GPU)
+  if (USE_TENSORRT)
+    set(DEPS ${DEPS}
+        ${TENSORRT_ROOT}/lib/libnvinfer${CMAKE_SHARED_LIBRARY_SUFFIX})
+    set(DEPS ${DEPS}
+        ${TENSORRT_ROOT}/lib/libnvinfer_plugin${CMAKE_SHARED_LIBRARY_SUFFIX})
+  endif()
+  set(DEPS ${DEPS} ${CUDA_LIB}/libcudart${CMAKE_SHARED_LIBRARY_SUFFIX})
+  set(DEPS ${DEPS} ${CUDA_LIB}/libcudart${CMAKE_SHARED_LIBRARY_SUFFIX} )
+  set(DEPS ${DEPS} /usr/lib/x86_64-linux-gnu/libcublas${CMAKE_SHARED_LIBRARY_SUFFIX} )
+  set(DEPS ${DEPS} ${CUDNN_LIB}/libcudnn${CMAKE_SHARED_LIBRARY_SUFFIX} )
+endif()
+
+target_link_libraries(${DEMO_NAME} ${DEPS})
diff --git a/examples/machine_translation/transformer/deploy/cpp/README.md b/examples/machine_translation/transformer/deploy/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..67cf8b9b6997477258da4ac81b87cfc4e9f775a0
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/README.md
@@ -0,0 +1,96 @@
+# 使用 Paddle Inference C++ API 推理
+
+## 模型推理
+
+通过前文介绍，我们可以获取导出后的预测模型。模型导出后，`infer_model/` 下的目录结构如下：
+
+``` text
+.
+├── transformer.pdiparams
+├── transformer.pdiparams.info
+└── transformer.pdmodel
+```
+
+可以将存有导出后模型的目录拷贝到当前路径下：
+
+``` sh
+cp -rf ../../infer_model/ ./
+```
+
+使用 C++ 进行推理需要提前先编译出可执行文件。编译的方式可以直接使用 `run.sh`，不过需要做一些指定。
+
+首先打开 run.sh：
+
+``` sh
+LIB_DIR=YOUR_LIB_DIR
+CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+MODEL_DIR=YOUR_MODEL_DIR
+VOCAB_DIR=/root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708
+DATA_DIR=/root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en
+```
+
+需要依次指定：
+* `LIB_DIR`: 所使用的 Paddle Inference 的库，即 `libpaddle_inference.so` 的位置。预测库的组织结构满足：
+  ```text
+  .
+  ├── CMakeCache.txt
+  ├── paddle/
+    ├── include/
+    └── lib/
+  ├── third_party/
+    ├── cudaerror/
+    ├── install/
+    └── threadpool/
+  └── version.txt
+  ```
+* `CUDA_LIB_DIR`: 所使用的 CUDA 的库的位置。
+* `CUDNN_LIB_DIR`: 所使用的 CUDNN 的库的位置。
+* `MODEL_DIR`: 导出的模型的路径。
+* `VOCAB_DIR`: 词表的位置。
+* `DATA_DIR`: 需要推理的数据的位置，当前数据是经过 tokenize 以及 bpe 处理之后的序列用空格连接成的句子，并非原始数据。
+
+可以简单执行如下语句完成编译以及推理整个过程。
+
+``` sh
+bash run.sh
+```
+
+以上步骤，如果全部正确执行，将会依次完成编译、预测全部过程。不过，如果需要自行执行可执行文件，编译完成后，其实，在 `build/bin/` 路径下会生成 `transformer_e2e` 的可执行文件，也可以直接执行这个可执行文件进行推理。
+
+执行的参数及解释如下：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+./build/bin/transformer_e2e -batch_size 8 -device gpu -gpu_id 0 -model_dir ./infer_model/ -vocab_file /root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 -data_file /root/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en
+```
+
+各个参数解释如下：
+* `-batch_size`: 使用 Paddle Inference 的时候一个 batch 的句子数目。
+* `-device`: 使用的设备，可以是 gpu 或是 cpu。
+* `-gpu_id`: 若使用 gpu，则需要提供所使用的 gpu 的 id。
+* `-use_mkl`: 是否使用 mkl，设置代表使用 mkl，不设置则不使用 mkl。仅在使用 cpu 进行预测的时候有效。
+* `-threads`: 仅在使用 mkl 的时候起效，用于指定计算 math 库时的线程数。
+* `-model_dir`: 导出的模型的位置。
+* `-vocab_file`: 词表文件的位置。
+* `-data_file`: 推理用的数据的位置。
+
+英德翻译的结果会保存到 `predict.txt` 文件中。
+
+## 模型评估
+
+推理结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
+```
+
+执行上述操作之后，可以看到类似如下的结果，此处结果是 big model 在 newstest2014 上的 BLEU 结果：
+```
+BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506)
+```
diff --git a/examples/machine_translation/transformer/deploy/cpp/helper.h b/examples/machine_translation/transformer/deploy/cpp/helper.h
new file mode 100644
index 0000000000000000000000000000000000000000..6b4d41f82d8d25a34750fdc3ce10c2d010052b98
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/helper.h
@@ -0,0 +1,51 @@
+#pragma once
+#include <gflags/gflags.h>
+#include <glog/logging.h>
+#include <glog/raw_logging.h>
+#include <sys/time.h>
+#include <chrono>  // NOLINT
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <vector>
+#include "paddle/include/paddle_inference_api.h"
+
+namespace paddle {
+namespace inference {
+// Timer for timer
+class Timer {
+public:
+  std::chrono::high_resolution_clock::time_point start;
+  std::chrono::high_resolution_clock::time_point startu;
+  void tic() { start = std::chrono::high_resolution_clock::now(); }
+  double toc() {
+    startu = std::chrono::high_resolution_clock::now();
+    std::chrono::duration<double> time_span =
+        std::chrono::duration_cast<std::chrono::duration<double>>(startu -
+                                                                  start);
+    double used_time_ms = static_cast<double>(time_span.count()) * 1000.0;
+    return used_time_ms;
+  }
+};
+
+static void split(const std::string &str,
+                  char sep,
+                  std::vector<std::string> *pieces) {
+  pieces->clear();
+  if (str.empty()) {
+    return;
+  }
+  size_t pos = 0;
+  size_t next = str.find(sep, pos);
+  while (next != std::string::npos) {
+    pieces->push_back(str.substr(pos, next - pos));
+    pos = next + 1;
+    next = str.find(sep, pos);
+  }
+  if (!str.substr(pos).empty()) {
+    pieces->push_back(str.substr(pos));
+  }
+}
+
+}  // namespace inference
+}  // namespace paddle
diff --git a/examples/machine_translation/transformer/deploy/cpp/run.sh b/examples/machine_translation/transformer/deploy/cpp/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6c59397b5cadf548b3102d49620f9e4af0b44cd0
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/run.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+# Whether to use mkl or gpu 
+WITH_MKL=ON
+DEVICE='gpu'
+
+# Please set:
+# * Corresponding PaddlePaddle inference lib
+# * Corresponding CUDA lib
+# * Corresponding CUDNN lib
+# * Corresponding model directory
+# * Corresponding vocab directory
+# * Corresponding data directory
+LIB_DIR=YOUR_LIB_DIR
+CUDA_LIB_DIR=YOUR_CUDA_LIB_DIR
+CUDNN_LIB_DIR=YOUR_CUDNN_LIB_DIR
+MODEL_DIR=YOUR_MODEL_DIR
+# DATA_HOME is where paddlenlp stores dataset and can be returned by paddlenlp.utils.env.DATA_HOME.
+VOCAB_DIR=DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708
+DATA_DIR=DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en
+
+bash run_impl.sh ${LIB_DIR} transformer_e2e ${MODEL_DIR} ${WITH_MKL} ${DEVICE} ${CUDNN_LIB_DIR} ${CUDA_LIB_DIR} ${VOCAB_DIR} ${DATA_DIR}
diff --git a/examples/machine_translation/transformer/deploy/cpp/run_impl.sh b/examples/machine_translation/transformer/deploy/cpp/run_impl.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e8c5782af25e7d3643d104755106c27627086f2e
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/run_impl.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+mkdir -p build
+cd build
+rm -rf *
+
+LIB_DIR=$1
+DEMO_NAME=$2
+MODEL_FILE_DIR=$3
+WITH_MKL=$4
+DEVICE=$5
+CUDNN_LIB=$6
+CUDA_LIB=$7
+VOCAB_DIR=$8
+DATA_DIR=$9
+
+WITH_GPU=OFF
+if [[ $DEVICE="gpu" ]]; then
+  WITH_GPU=ON
+fi
+
+cmake .. -DPADDLE_LIB=${LIB_DIR} \
+  -DWITH_MKL=${WITH_MKL} \
+  -DDEMO_NAME=${DEMO_NAME} \
+  -DWITH_GPU=${WITH_GPU} \
+  -DWITH_STATIC_LIB=OFF \
+  -DUSE_TENSORRT=${USE_TENSORRT} \
+  -DCUDNN_LIB=${CUDNN_LIB} \
+  -DCUDA_LIB=${CUDA_LIB} \
+  -DTENSORRT_ROOT=${TENSORRT_ROOT}
+
+make -j
+
+./${DEMO_NAME} -batch_size 8 -device ${DEVICE} -gpu_id 0 -model_dir ${MODEL_FILE_DIR} -vocab_file ${VOCAB_DIR} -data_file ${DATA_DIR}
diff --git a/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc b/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2f21391f3cab25bde23302732836a2e334937ab3
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc
@@ -0,0 +1,291 @@
+#include <pthread.h>
+#include <algorithm>
+#include <atomic>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <numeric>
+#include <string>
+#include <thread>
+#include <unordered_map>
+
+#include "helper.h"
+
+#include <sys/time.h>
+#include <unistd.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+DEFINE_int32(batch_size, 1, "Batch size to do inference. ");
+DEFINE_string(device, "gpu", "The device to do inference. Can be gpu or cpu. ");
+DEFINE_int32(gpu_id, 0, "The gpu id to do inference. ");
+DEFINE_bool(use_mkl,
+            false,
+            "Whether to use mkl when using cpu to do inference. ");
+DEFINE_int32(threads,
+             1,
+             "The number of threads to run math lib when using mkl. ");
+DEFINE_string(model_dir,
+              "./infer_model/",
+              "The directory to the inference model. ");
+DEFINE_string(vocab_file,
+              "./vocab_all.bpe.33708",
+              "The path to the vocabulary file. ");
+DEFINE_string(data_file,
+              "./newstest2014.tok.bpe.33708.en",
+              "The path to the input data file. ");
+
+using namespace paddle_infer;
+
+std::string model_dir = "";
+std::string vocab_file = "";
+std::string data_file = "";
+
+const int EOS_IDX = 1;
+const int PAD_IDX = 0;
+const int MAX_LENGTH = 256;
+const int N_BEST = 1;
+
+int batch_size = 1;
+int gpu_id = 0;
+
+namespace paddle {
+namespace inference {
+
+struct DataInput {
+  std::vector<int64_t> src_data;
+};
+
+struct DataResult {
+  std::string result_q;
+};
+
+bool get_result_tensor(const std::unique_ptr<paddle_infer::Tensor>& seq_ids,
+                       std::vector<DataResult>& dataresultvec,
+                       std::unordered_map<int, std::string>& num2word_dict) {
+  std::vector<int> output_shape = seq_ids->shape();
+  int batch_size = output_shape[0];
+  int beam_num = output_shape[2];
+  int out_num = std::accumulate(
+      output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
+  std::vector<int64_t> seq_ids_out;
+  seq_ids_out.resize(out_num);
+  seq_ids->CopyToCpu(seq_ids_out.data());
+
+  dataresultvec.resize(batch_size * N_BEST);
+  auto max_output_length = output_shape[1];
+
+  for (int bsz = 0; bsz < batch_size; ++bsz) {
+    for (int k = 0; k < N_BEST; ++k) {
+      dataresultvec[bsz * N_BEST + k].result_q = "";
+      for (int len = 0; len < max_output_length; ++len) {
+        if (seq_ids_out[bsz * max_output_length * beam_num + len * beam_num +
+                        k] == EOS_IDX) {
+          break;
+        }
+        dataresultvec[bsz * N_BEST + k].result_q =
+            dataresultvec[bsz * N_BEST + k].result_q +
+            num2word_dict[seq_ids_out[bsz * max_output_length * beam_num +
+                                      len * beam_num + k]] +
+            " ";
+      }
+    }
+  }
+  return true;
+}
+
+class DataReader {
+public:
+  explicit DataReader(const std::string& path)
+      : file(new std::ifstream(path)) {}
+
+  bool NextBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                 const int& batch_size,
+                 std::vector<std::string>& source_query_vec) {
+    std::string line;
+    std::vector<std::string> word_data;
+    std::vector<DataInput> data_input_vec;
+    int max_len = 0;
+    for (int i = 0; i < batch_size; i++) {
+      if (!std::getline(*file, line)) {
+        break;
+      }
+      DataInput data_input;
+      split(line, ' ', &word_data);
+      std::string query_str = "";
+      for (int j = 0; j < word_data.size(); ++j) {
+        if (j >= MAX_LENGTH) {
+          break;
+        }
+        query_str += word_data[j];
+        if (word2num_dict.find(word_data[j]) == word2num_dict.end()) {
+          data_input.src_data.push_back(word2num_dict["<unk>"]);
+        } else {
+          data_input.src_data.push_back(word2num_dict[word_data[j]]);
+        }
+      }
+      source_query_vec.push_back(query_str);
+      data_input.src_data.push_back(EOS_IDX);
+      max_len = std::max(max_len, static_cast<int>(data_input.src_data.size()));
+      max_len = std::min(max_len, MAX_LENGTH);
+      data_input_vec.push_back(data_input);
+    }
+    if (data_input_vec.empty()) {
+      return false;
+    }
+    return TensorMoreBatch(
+        predictor, data_input_vec, max_len, data_input_vec.size());
+  }
+
+  bool GetWordDict() {
+    std::ifstream fin(vocab_file);
+    std::string line;
+    int k = 0;
+    while (std::getline(fin, line)) {
+      word2num_dict[line] = k;
+      num2word_dict[k] = line;
+      k += 1;
+    }
+
+    fin.close();
+
+    return true;
+  }
+
+  std::unordered_map<std::string, int> word2num_dict;
+  std::unordered_map<int, std::string> num2word_dict;
+  std::unique_ptr<std::ifstream> file;
+
+private:
+  bool TensorMoreBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                       std::vector<DataInput>& data_input_vec,
+                       int max_len,
+                       int batch_size) {
+    auto src_word_t = predictor->GetInputHandle("src_word");
+    std::vector<int64_t> src_word_vec;
+    src_word_vec.resize(max_len * batch_size);
+    for (int i = 0; i < batch_size; ++i) {
+      for (int k = 0; k < max_len; ++k) {
+        if (k < data_input_vec[i].src_data.size()) {
+          src_word_vec[i * max_len + k] = data_input_vec[i].src_data[k];
+        } else {
+          src_word_vec[i * max_len + k] = PAD_IDX;
+        }
+      }
+    }
+    src_word_t->Reshape({batch_size, max_len});
+    src_word_t->CopyFromCpu(src_word_vec.data());
+
+    return true;
+  }
+};
+
+
+template <typename... Args>
+void SummaryConfig(const paddle_infer::Config& config,
+                   double infer_time,
+                   int num_batches,
+                   int num_samples) {
+  LOG(INFO) << "----------------------- Model info ----------------------";
+  LOG(INFO) << "model_name: "
+            << "transformer";
+  LOG(INFO) << "model_type: "
+            << "FP32";
+  LOG(INFO) << "----------------------- Data info -----------------------";
+  LOG(INFO) << "batch_size: " << batch_size;
+  LOG(INFO) << "num_of_samples: " << num_samples;
+  LOG(INFO) << "----------------------- Conf info -----------------------";
+  LOG(INFO) << "runtime_device: " << (config.use_gpu() ? "gpu" : "cpu");
+  LOG(INFO) << "ir_optim: " << (config.ir_optim() ? "true" : "false");
+  LOG(INFO) << "enable_memory_optim: "
+            << (config.enable_memory_optim() ? "true" : "false");
+  if (config.use_gpu()) {
+    LOG(INFO) << "enable_tensorrt: "
+              << (config.tensorrt_engine_enabled() ? "true" : "false");
+  } else {
+    LOG(INFO) << "enable_mkldnn: "
+              << (config.mkldnn_enabled() ? "true" : "false");
+    LOG(INFO) << "cpu_math_library_num_threads: "
+              << config.cpu_math_library_num_threads();
+  }
+  LOG(INFO) << "----------------------- Perf info -----------------------";
+  LOG(INFO) << "average_latency(ms): " << infer_time / num_samples << ", "
+            << "QPS: " << num_samples / (infer_time / 1000.0);
+}
+
+
+void Main(
+    int batch_size, std::string device, int gpu_id, int use_mkl, int threads) {
+  Config config;
+  config.SetModel(model_dir + "/transformer.pdmodel",
+                  model_dir + "/transformer.pdiparams");
+
+  if (device == "gpu") {
+    config.EnableUseGpu(100, gpu_id);
+  } else {
+    config.DisableGpu();
+    if (use_mkl) {
+      config.EnableMKLDNN();
+      config.SetCpuMathLibraryNumThreads(threads);
+    }
+  }
+
+  config.SwitchUseFeedFetchOps(false);
+  config.SwitchSpecifyInputNames(true);
+  // When using fp16, fc_elementwise_layernorm_fuse_pass causes a little
+  // different translation results with original dygraph prediction, maybe you
+  // can turn off the IR optimization for same results as following:
+  // config.SwitchIrOptim(false);
+  auto predictor = CreatePredictor(config);
+  DataReader reader(data_file);
+  reader.GetWordDict();
+
+  double whole_time = 0;
+  Timer timer;
+  int num_batches = 0;
+  int num_samples = 0;
+  std::vector<std::string> source_query_vec;
+  std::ofstream out("predict.txt");
+
+  while (reader.NextBatch(predictor, batch_size, source_query_vec)) {
+    timer.tic();
+    predictor->Run();
+    std::vector<DataResult> dataresultvec;
+    auto output_names = predictor->GetOutputNames();
+    get_result_tensor(predictor->GetOutputHandle(output_names[0]),
+                      dataresultvec,
+                      reader.num2word_dict);
+
+    whole_time += timer.toc();
+    num_batches++;
+    source_query_vec.clear();
+
+    if (out.is_open()) {
+      for (int i = 0; i < dataresultvec.size(); ++i) {
+        out << dataresultvec[i].result_q << "\n";
+      }
+    }
+    num_samples += dataresultvec.size();
+  }
+  SummaryConfig(config, whole_time, num_batches, num_samples);
+}
+}  // namespace inference
+}  // namespace paddle
+
+int main(int argc, char** argv) {
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+
+  batch_size = FLAGS_batch_size;
+  gpu_id = FLAGS_gpu_id;
+
+  model_dir = FLAGS_model_dir;
+  vocab_file = FLAGS_vocab_file;
+  data_file = FLAGS_data_file;
+
+  paddle::inference::Main(
+      batch_size, FLAGS_device, gpu_id, FLAGS_use_mkl, FLAGS_threads);
+
+  return 0;
+}
diff --git a/examples/machine_translation/transformer/deploy/python/README.md b/examples/machine_translation/transformer/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6df683ecabde75bcfaf2aeaf7fc18a4148d44f62
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/python/README.md
@@ -0,0 +1,57 @@
+# 使用 Paddle Inference Python API 推理
+
+## 模型推理
+
+通过前文介绍，我们可以获取导出后的预测模型。模型导出后，`infer_model/` 下的目录结构如下：
+
+``` text
+.
+├── transformer.pdiparams
+├── transformer.pdiparams.info
+└── transformer.pdmodel
+```
+
+可以将存有导出后模型的目录拷贝到当前路径下：
+
+``` sh
+cp -rf ../../infer_model/ ./
+```
+
+执行如下命令可以使用 Paddle Inference Python API 进行推理：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+python inference.py \
+        --config ../../configs/transformer.base.yaml \
+        --batch_size 8 \
+        --device gpu \
+        --model_dir ./infer_model/
+```
+
+各个参数解释如下：
+* `--config`: yaml 配置文件，和训练时使用的相同，不过因为模型导出时已经固定了模型结构，因此，模型超参相关配置将不会再起作用，仅有 `reader` 相关配置、`infer_batch_size` 以及 `inference_model_dir` 仍会有效。
+* `--batch_size`: 与配置文件中 `infer_batch_size` 意义相同，是指的使用 Paddle Inference 的时候一个 batch 的句子数目。
+* `--device`: 使用的设备，可以是 gpu，xpu 或是 cpu。
+* `--use_mkl`: 是否使用 mkl，没有设定表示不使用 mkl。可以通过 `--use_mkl True` 指定。
+* `--threads`: 仅在使用 mkl 的时候起效，用于指定计算 math 库时的线程数。
+* `--model_dir`: 导出的 Paddle Inference 可用的模型路径，与配置文件中的 `inference_model_dir` 对应。
+
+英德翻译的结果会保存到 `predict.txt` 文件中。
+
+## 模型评估
+
+推理结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
+```
+
+执行上述操作之后，可以看到类似如下的结果，此处结果是 big model 在 newstest2014 上的 BLEU 结果：
+```
+BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506)
+```
diff --git a/examples/machine_translation/transformer/deploy/python/benchmark.sh b/examples/machine_translation/transformer/deploy/python/benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0b9b8c482995b219d2c701989db8ce0ecee4598d
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/python/benchmark.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+model_dir=${1}
+model=${2}
+mkdir -p output_pipeline
+log_path="output_pipeline"
+
+for batch_size in "1" "2" "4"; do
+    python inference.py \
+        --config="../../configs/transformer.${model}.yaml" \
+        --device cpu \
+        --model_dir=${model_dir} \
+        --batch_size=${batch_size} \
+        --profile > ${log_path}/transformer_${model}_cpu_nomkl_bs${batch_size}_inference.log 2>&1
+
+    for threads in "1" "6"; do
+        python inference.py \
+            --config="../../configs/transformer.${model}.yaml" \
+            --model_dir=${model_dir} \
+            --device cpu \
+            --use_mkl True \
+            --threads=${threads} \
+            --batch_size=${batch_size} \
+            --profile > ${log_path}/transformer_${model}_cpu_mkl_threads${threads}_bs${batch_size}_inference.log 2>&1 
+    done
+
+    python inference.py \
+        --config="../../configs/transformer.${model}.yaml" \
+        --model_dir=${model_dir} \
+        --device gpu \
+        --batch_size=${batch_size} \
+        --profile > tee ${log_path}/transformer_${model}_gpu_bs${batch_size}_inference.log 2>&1 
+done
diff --git a/examples/machine_translation/transformer/deploy/python/inference.py b/examples/machine_translation/transformer/deploy/python/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..d765fadcebd3ff6ed5e31aa209a19778ccac5b6f
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/python/inference.py
@@ -0,0 +1,330 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+from pprint import pprint
+
+import paddle
+import yaml
+from easydict import EasyDict as AttrDict
+from paddle import inference
+
+from paddlenlp.utils.log import logger
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)))
+import reader  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--batch_size", type=int, help="Batch size. ")
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["gpu", "xpu", "cpu", "npu"],
+        help="Device to use during inference. ",
+    )
+    parser.add_argument("--use_mkl", default=False, type=eval, choices=[True, False], help="Whether to use mkl. ")
+    parser.add_argument("--threads", default=1, type=int, help="The number of threads when enable mkl. ")
+    parser.add_argument("--model_dir", default="", type=str, help="Path of the model. ")
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--profile", action="store_true", help="Whether to profile. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--test_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--save_log_path",
+        default="./transformer/output/",
+        type=str,
+        help="The path to save logs when profile is enabled. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles, autolog=None):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+        self.autolog = autolog
+        self.use_auto_log = not isinstance(self.autolog, recorder.Recorder)
+
+    @classmethod
+    def create_predictor(cls, args, config=None, profile=False, model_name=None):
+        if config is None:
+            config = inference.Config(
+                os.path.join(args.inference_model_dir, "transformer.pdmodel"),
+                os.path.join(args.inference_model_dir, "transformer.pdiparams"),
+            )
+            if args.device == "gpu":
+                config.enable_use_gpu(100, 0)
+            elif args.device == "xpu":
+                config.enable_xpu()
+            elif args.device == "npu":
+                config.enable_custom_device("npu")
+            else:
+                # CPU
+                config.disable_gpu()
+                if args.use_mkl:
+                    config.enable_mkldnn()
+                    config.set_cpu_math_library_num_threads(args.threads)
+            # Use ZeroCopy.
+            config.switch_use_feed_fetch_ops(False)
+
+        if profile:
+            if args.mod is recorder:
+                autolog = args.mod.Recorder(config, args.infer_batch_size, args.model_name)
+            else:
+                pid = os.getpid()
+                autolog = args.mod.AutoLogger(
+                    model_name=args.model_name,
+                    model_precision="fp32",
+                    batch_size=args.infer_batch_size,
+                    save_path=args.save_log_path,
+                    inference_config=config,
+                    data_shape="dynamic",
+                    pids=pid,
+                    process_name=None,
+                    gpu_ids=0 if args.device == "gpu" else None,
+                    time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                    warmup=0,
+                    logger=logger,
+                )
+        else:
+            autolog = None
+
+        predictor = inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+        return cls(predictor, input_handles, output_handles, autolog)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, test_loader, to_tokens, n_best, bos_idx, eos_idx):
+        outputs = []
+        samples = 0
+        if self.autolog is not None:
+            if self.use_auto_log:
+                self.autolog.times.start()
+            else:
+                cpu_rss_mb, gpu_rss_mb = 0, 0
+                gpu_id = 0 if self.autolog.use_gpu else None
+                gpu_util = 0
+
+        for data in test_loader:
+            samples = len(data[0])
+
+            if self.autolog is not None:
+                if self.use_auto_log:
+                    self.autolog.times.stamp()
+                else:
+                    self.autolog.tic()
+
+            output = self.predict_batch(data)
+
+            if self.autolog is not None:
+                if self.use_auto_log:
+                    self.autolog.times.stamp()
+                else:
+                    self.autolog.toc(samples)
+                    gpu_util += recorder.Recorder.get_current_gputil(gpu_id)
+                    cm, gm = recorder.Recorder.get_current_memory_mb(gpu_id)
+                    cpu_rss_mb += cm
+                    gpu_rss_mb += gm
+
+            finished_sequence = output[0].transpose([0, 2, 1])
+            for ins in finished_sequence:
+                n_best_seq = []
+                for beam_idx, beam in enumerate(ins):
+                    if beam_idx >= n_best:
+                        break
+                    id_list = post_process_seq(beam, bos_idx, eos_idx)
+                    word_list = to_tokens(id_list)
+                    sequence = " ".join(word_list)
+                    n_best_seq.append(sequence)
+                outputs.append(n_best_seq)
+
+        if self.autolog is not None:
+            if self.use_auto_log:
+                self.autolog.times.end(stamp=True)
+            else:
+                self.autolog.get_device_info(
+                    cpu_rss_mb=cpu_rss_mb / len(test_loader),
+                    gpu_rss_mb=gpu_rss_mb / len(test_loader) if self.autolog.use_gpu else 0,
+                    gpu_util=gpu_util / len(test_loader) if self.autolog.use_gpu else 0,
+                )
+
+        return outputs
+
+
+def do_inference(args):
+    # Define data loader
+    test_loader, to_tokens = reader.create_infer_loader(args)
+
+    predictor = Predictor.create_predictor(args=args, profile=args.profile, model_name=args.model_name)
+    sequence_outputs = predictor.predict(test_loader, to_tokens, args.n_best, args.bos_idx, args.eos_idx)
+
+    f = open(args.output_file, "w", encoding="utf-8")
+    for target in sequence_outputs:
+        for sequence in target:
+            f.write(sequence + "\n")
+    f.close()
+
+    if args.profile:
+        predictor.autolog.report()
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    args.device = ARGS.device
+    args.use_mkl = ARGS.use_mkl
+    args.threads = ARGS.threads
+    if ARGS.batch_size is not None:
+        args.infer_batch_size = ARGS.batch_size
+    args.profile = ARGS.profile
+    args.model_name = "transformer_base" if "base" in ARGS.config else "transformer_big"
+    if ARGS.model_dir != "":
+        args.inference_model_dir = ARGS.model_dir
+    args.save_log_path = ARGS.save_log_path
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    if args.profile:
+        import importlib
+
+        import tls.recorder as recorder
+
+        try:
+            mod = importlib.import_module("auto_log")
+        except ImportError:
+            mod = importlib.import_module("tls.recorder")
+        args.mod = mod
+
+    do_inference(args)
diff --git a/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py b/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c85f36913b59ed4d972fe8730afcf671d30d75c5
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py
@@ -0,0 +1,247 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from pathlib import Path
+
+import paddle
+import paddle.inference as paddle_infer
+
+CUR_DIR = os.path.dirname(os.path.abspath(__file__))
+LOG_PATH_ROOT = f"{CUR_DIR}/../../output"
+
+
+class PaddleInferBenchmark(object):
+    def __init__(
+        self,
+        config,
+        model_info: dict = {},
+        data_info: dict = {},
+        perf_info: dict = {},
+        resource_info: dict = {},
+        **kwargs
+    ):
+        """
+        Construct PaddleInferBenchmark Class to format logs.
+        args:
+            config(paddle.inference.Config): paddle inference config
+            model_info(dict): basic model info
+                {'model_name': 'resnet50'
+                 'precision': 'fp32'}
+            data_info(dict): input data info
+                {'batch_size': 1
+                 'shape': '3,224,224'
+                 'data_num': 1000}
+            perf_info(dict): performance result
+                {'inference_time_s': 2.0}
+            resource_info(dict):
+                cpu and gpu resources
+                {'cpu_rss': 100
+                 'gpu_rss': 100
+                 'gpu_util': 60}
+        """
+        # PaddleInferBenchmark Log Version
+        self.log_version = "1.0.3"
+
+        # Paddle Version
+        self.paddle_version = paddle.__version__
+        self.paddle_commit = paddle.__git_commit__
+        paddle_infer_info = paddle_infer.get_version()
+        self.paddle_branch = paddle_infer_info.strip().split(": ")[-1]
+
+        # model info
+        self.model_info = model_info
+
+        # data info
+        self.data_info = data_info
+
+        # perf info
+        self.perf_info = perf_info
+
+        try:
+            # required value
+            self.model_name = model_info["model_name"]
+            self.precision = model_info["precision"]
+
+            self.batch_size = data_info["batch_size"]
+            self.shape = data_info["shape"]
+            self.data_num = data_info["data_num"]
+
+            self.inference_time_s = round(perf_info["inference_time_s"], 4)
+        except:
+            self.print_help()
+            raise ValueError("Set argument wrong, please check input argument and its type")
+
+        self.inference_time_s_90 = perf_info.get("inference_time_s_90", "")
+        self.inference_time_s_99 = perf_info.get("inference_time_s_99", "")
+        self.succ_rate = perf_info.get("succ_rate", "")
+        self.qps = perf_info.get("qps", "")
+
+        # conf info
+        self.config_status = self.parse_config(config)
+
+        # mem info
+        if isinstance(resource_info, dict):
+            self.cpu_rss_mb = int(resource_info.get("cpu_rss_mb", 0))
+            self.cpu_vms_mb = int(resource_info.get("cpu_vms_mb", 0))
+            self.cpu_shared_mb = int(resource_info.get("cpu_shared_mb", 0))
+            self.cpu_dirty_mb = int(resource_info.get("cpu_dirty_mb", 0))
+            self.cpu_util = round(resource_info.get("cpu_util", 0), 2)
+
+            self.gpu_rss_mb = int(resource_info.get("gpu_rss_mb", 0))
+            self.gpu_util = round(resource_info.get("gpu_util", 0), 2)
+        else:
+            self.cpu_rss_mb = 0
+            self.cpu_vms_mb = 0
+            self.cpu_shared_mb = 0
+            self.cpu_dirty_mb = 0
+            self.cpu_util = 0
+
+            self.gpu_rss_mb = 0
+            self.gpu_util = 0
+
+        # init benchmark logger
+        self.benchmark_logger()
+
+    def benchmark_logger(self):
+        """
+        benchmark logger
+        """
+        # remove other logging handler
+        for handler in logging.root.handlers[:]:
+            logging.root.removeHandler(handler)
+
+        # Init logger
+        FORMAT = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+        log_output = f"{LOG_PATH_ROOT}/{self.model_name}.log"
+        Path(f"{LOG_PATH_ROOT}").mkdir(parents=True, exist_ok=True)
+        logging.basicConfig(
+            level=logging.INFO,
+            format=FORMAT,
+            handlers=[
+                logging.FileHandler(filename=log_output, mode="w"),
+                logging.StreamHandler(),
+            ],
+        )
+        self.logger = logging.getLogger(__name__)
+        self.logger.info(f"Paddle Inference benchmark log will be saved to {log_output}")
+
+    def parse_config(self, config) -> dict:
+        """
+        parse paddle predictor config
+        args:
+            config(paddle.inference.Config): paddle inference config
+        return:
+            config_status(dict): dict style config info
+        """
+        if isinstance(config, paddle_infer.Config):
+            config_status = {}
+            config_status["runtime_device"] = "cpu"
+            if config.use_gpu():
+                config_status["runtime_device"] = "gpu"
+            if config.use_xpu():
+                config_status["runtime_device"] = "xpu"
+            config_status["ir_optim"] = config.ir_optim()
+            config_status["enable_tensorrt"] = config.tensorrt_engine_enabled()
+            config_status["precision"] = self.precision
+            config_status["enable_mkldnn"] = config.mkldnn_enabled()
+            config_status["cpu_math_library_num_threads"] = config.cpu_math_library_num_threads()
+        elif isinstance(config, dict):
+            config_status["runtime_device"] = config.get("runtime_device", "")
+            config_status["ir_optim"] = config.get("ir_optim", "")
+            config_status["enable_tensorrt"] = config.get("enable_tensorrt", "")
+            config_status["precision"] = config.get("precision", "")
+            config_status["enable_mkldnn"] = config.get("enable_mkldnn", "")
+            config_status["cpu_math_library_num_threads"] = config.get("cpu_math_library_num_threads", "")
+        else:
+            self.print_help()
+            raise ValueError("Set argument config wrong, please check input argument and its type")
+        return config_status
+
+    def report(self, identifier=None):
+        """
+        print log report
+        args:
+            identifier(string): identify log
+        """
+        if identifier:
+            identifier = f"[{identifier}]"
+        else:
+            identifier = ""
+
+        self.logger.info("\n")
+        self.logger.info("---------------------- Paddle info ----------------------")
+        self.logger.info(f"{identifier} paddle_version: {self.paddle_version}")
+        self.logger.info(f"{identifier} paddle_commit: {self.paddle_commit}")
+        self.logger.info(f"{identifier} paddle_branch: {self.paddle_branch}")
+        self.logger.info(f"{identifier} log_api_version: {self.log_version}")
+        self.logger.info("----------------------- Conf info -----------------------")
+        self.logger.info(f"{identifier} runtime_device: {self.config_status['runtime_device']}")
+        self.logger.info(f"{identifier} ir_optim: {self.config_status['ir_optim']}")
+        self.logger.info(f"{identifier} enable_memory_optim: {True}")
+        self.logger.info(f"{identifier} enable_tensorrt: {self.config_status['enable_tensorrt']}")
+        self.logger.info(f"{identifier} enable_mkldnn: {self.config_status['enable_mkldnn']}")
+        self.logger.info(
+            f"{identifier} cpu_math_library_num_threads: {self.config_status['cpu_math_library_num_threads']}"
+        )
+        self.logger.info("----------------------- Model info ----------------------")
+        self.logger.info(f"{identifier} model_name: {self.model_name}")
+        self.logger.info(f"{identifier} precision: {self.precision}")
+        self.logger.info("----------------------- Data info -----------------------")
+        self.logger.info(f"{identifier} batch_size: {self.batch_size}")
+        self.logger.info(f"{identifier} input_shape: {self.shape}")
+        self.logger.info(f"{identifier} data_num: {self.data_num}")
+        self.logger.info("----------------------- Perf info -----------------------")
+        self.logger.info(
+            f"{identifier} cpu_rss(MB): {self.cpu_rss_mb}, cpu_vms: {self.cpu_vms_mb}, cpu_shared_mb: {self.cpu_shared_mb}, cpu_dirty_mb: {self.cpu_dirty_mb}, cpu_util: {self.cpu_util}%"
+        )
+        self.logger.info(f"{identifier} gpu_rss(MB): {self.gpu_rss_mb}, gpu_util: {self.gpu_util}%")
+        self.logger.info(f"{identifier} inference_time(ms): {round(self.inference_time_s*1000, 1)}")
+        if self.inference_time_s_90:
+            self.logger.info(
+                f"{identifier} 90%_cost: {self.inference_time_s_90}, 99%_cost: {self.inference_time_s_99}, succ_rate: {self.succ_rate}"
+            )
+        if self.qps:
+            self.logger.info(f"{identifier} QPS: {self.qps}")
+
+    def print_help(self):
+        """
+        print function help
+        """
+        print(
+            """Usage:
+            ==== Print inference benchmark logs. ====
+            config = paddle.inference.Config()
+            model_info = {'model_name': 'resnet50'
+                          'precision': 'fp32'}
+            data_info = {'batch_size': 1
+                         'shape': '3,224,224'
+                         'data_num': 1000}
+            perf_info = {'inference_time_s': 2.0}
+            resource_info = {'cpu_rss_mb': 100
+                             'gpu_rss_mb': 100
+                             'gpu_util': 60}
+            log = PaddleInferBenchmark(config, model_info, data_info, perf_info, resource_info)
+            log('Test')
+            """
+        )
+
+    def __call__(self, identifier=None):
+        """
+        __call__
+        args:
+            identifier(string): identify log
+        """
+        self.report(identifier)
diff --git a/examples/machine_translation/transformer/deploy/python/tls/recorder.py b/examples/machine_translation/transformer/deploy/python/tls/recorder.py
new file mode 100644
index 0000000000000000000000000000000000000000..d33e8fac1a3eeac3ef2ecb60ea282a163e135d92
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/python/tls/recorder.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import paddle
+import psutil
+from tls.benchmark_utils import PaddleInferBenchmark
+
+if paddle.is_compiled_with_cuda():
+    import GPUtil
+    import pynvml
+
+
+class Recorder(object):
+    def __init__(self, config, batch_size, model_name, mem_info=None):
+        self.model_name = model_name
+        self.config = config
+        self.precision = "fp32"
+        self.batch_size = batch_size
+
+        self.use_gpu = False
+        self.use_xpu = False
+        self.use_cpu = False
+
+        if config.use_gpu():
+            self.place = "gpu"
+            self.use_gpu = True
+        elif config.use_xpu():
+            self.place = "xpu"
+            self.use_xpu = True
+        else:
+            self.place = "cpu"
+            self.use_cpu = True
+
+        self.infer_time_s = 0
+        self.samples = 0
+
+        self.start = 0
+
+        self.device_info = {"cpu_rss_mb": None, "gpu_rss_mb": None, "gpu_util": None}
+        if mem_info is not None:
+            self.mem_info = mem_info
+
+    def tic(self):
+        self.start = time.time()
+
+    def toc(self, samples=1):
+        self.infer_time_s += time.time() - self.start
+        self.samples += samples
+
+    def get_device_info(self, cpu_rss_mb=None, gpu_rss_mb=None, gpu_util=None):
+        self.device_info["cpu_rss_mb"] = cpu_rss_mb
+        self.device_info["gpu_rss_mb"] = gpu_rss_mb
+        self.device_info["gpu_util"] = gpu_util
+
+    def report(self):
+        model_info = {"model_name": self.model_name, "precision": self.precision}
+        data_info = {"batch_size": self.batch_size, "shape": "dynamic_shape", "data_num": self.samples}
+        perf_info = {"inference_time_s": self.infer_time_s}
+        log = PaddleInferBenchmark(self.config, model_info, data_info, perf_info, self.device_info)
+        log("Test")
+
+    @staticmethod
+    def get_current_memory_mb(gpu_id=None):
+        pid = os.getpid()
+        p = psutil.Process(pid)
+        info = p.memory_full_info()
+        cpu_rss_mb = info.uss / 1024.0 / 1024.0
+        gpu_rss_mb = 0
+        if gpu_id is not None:
+            pynvml.nvmlInit()
+            handle = pynvml.nvmlDeviceGetHandleByIndex(0)
+            meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            gpu_rss_mb = meminfo.used / 1024.0 / 1024.0
+        return cpu_rss_mb, gpu_rss_mb
+
+    @staticmethod
+    def get_current_gputil(gpu_id=None):
+        gpu_load = 0
+        if gpu_id is not None:
+            GPUs = GPUtil.getGPUs()
+            gpu_load = GPUs[gpu_id].load
+        return gpu_load
diff --git a/examples/machine_translation/transformer/deploy/serving/README.md b/examples/machine_translation/transformer/deploy/serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2fc68c6794375e30941546a8574b541553817f39
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/README.md
@@ -0,0 +1,104 @@
+# 使用 Paddle Serving 推理
+
+## Paddle Serving 的使用
+
+Paddle Serving 的安装可以参考[Paddle Serving 安装文档](https://github.com/PaddlePaddle/Serving#installation)。需要在服务端和客户端安装相关的依赖。
+
+Serving 的执行包含两部分，其一是服务端的执行，其二是客户端的执行。接下来我们会一一说明如何使用 Paddle Serving 完成对 Transformer 的推理。
+
+## 模型推理
+
+通过前文介绍，我们可以获取导出后的预测模型。模型导出后，`infer_model/` 下的目录结构如下：
+
+``` text
+.
+└── infer_model/
+    ├── transformer.pdiparams
+    ├── transformer.pdiparams.info
+    └── transformer.pdmodel
+```
+
+可以将存有导出后模型的目录拷贝到当前路径下：
+
+``` sh
+cp -rf ../../infer_model/ ./
+```
+
+### 导出 Serving 模型和配置
+
+使用导出的 Paddle Inference 的模型，我们需要再做一次转换，将上面保存在 `infer_model/` 下面的模型重新转换成 Paddle Serving 使用的模型。具体操作方式如下：
+
+``` sh
+python export_serving_model.py --model_dir ./infer_model/
+```
+
+执行结束之后，会在 shell 上打印出 Transformer 模型输入、输出的变量的名称：
+
+``` sh
+model feed_names : dict_keys(['src_word'])                          # 模型输入的变量的名称
+model fetch_names : dict_keys(['save_infer_model/scale_0.tmp_1'])   # 模型输出的变量的名称
+```
+
+导出后，可以在当前路径下得到两个新的目录 `transformer_client/` 和 `transformer_server/`。
+
+``` text
+.
+├── transformer_client/
+    ├── serving_client_conf.prototxt
+    └── serving_client_conf.stream.prototxt
+└── transformer_server/
+    ├── __model__
+    ├── __params__
+    ├── serving_server_conf.prototxt
+    └── serving_server_conf.stream.prototxt
+```
+
+脚本成功执行并打印出预期内的输入、输出的变量的名称即完成整个过程。
+
+### 启动服务端
+
+Transformer 的服务端使用的是 Paddle Serving 的 `WebService` 相关接口。执行的命令如下：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+python transformer_web_server.py --config ../../configs/transformer.base.yaml --device gpu --model_dir ./transformer_server
+```
+
+各个参数的解释如下：
+* `--config`: yaml 配置文件，和训练时使用的相同，不过因为模型导出时已经固定了模型结构，因此，模型超参相关配置将不会再起作用，仅有 `reader` 相关配置，比如词表以及 `inference_model_dir` 等仍会有效。
+* `--device`: 使用的设备，可以是 gpu 或是 cpu。
+* `--model_dir`: 导出的 Paddle Serving 可用的模型路径，与配置文件中的 `inference_model_dir` 对应。在这里，特指的 `transformer_server/` 的路径。
+
+### 启动客户端完成推理
+
+在英德翻译的例子里面，在客户端这侧，我们只需要传给服务端需要翻译的句子即可。这里的句子是经过了 tokenize 以及 bpe 切词的序列用空格连接而成的句子。
+
+执行的方式如下：
+
+``` sh
+python transformer_web_client.py --config ../../configs/transformer.base.yaml --batch_size 8
+```
+
+各个参数的解释如下：
+* `--config`: yaml 配置文件，和训练时使用的相同，不过因为模型导出时已经固定了模型结构，因此，模型超参相关配置将不会再起作用，仅有 `reader` 相关配置，比如使用的测试集以及 `infer_batch_size` 等仍会有效。
+* `--batch_size`: 与配置文件中 `infer_batch_size` 意义相同，是指的使用 Paddle Serving 的时候一个 batch 的句子数目。
+
+执行完客户端的脚本，将会在本地生成一个 `predict.txt` 的文件，存有推理的结果。
+
+## 模型评估
+
+推理结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
+```
+
+执行上述操作之后，可以看到类似如下的结果，此处结果是 big model 在 newstest2014 上的 BLEU 结果：
+```
+BLEU = 27.48, 58.6/33.2/21.1/13.9 (BP=1.000, ratio=1.012, hyp_len=65312, ref_len=64506)
+```
diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark.py b/examples/machine_translation/transformer/deploy/serving/benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..973f5d2b3ecf9cb442525cb67d94acb1ee4d56b5
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/benchmark.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import yaml
+
+
+def parse_benchmark(filein, fileout):
+    with open(filein, "r") as fin:
+        res = yaml.load(fin)
+        del_list = []
+        for key in res["DAG"].keys():
+            if "call" in key:
+                del_list.append(key)
+        for key in del_list:
+            del res["DAG"][key]
+    with open(fileout, "w") as fout:
+        yaml.dump(res, fout, default_flow_style=False)
+
+
+if __name__ == "__main__":
+    filein = sys.argv[1]
+    fileout = sys.argv[2]
+    parse_benchmark(filein, fileout)
diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh b/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7f57ef5a0873f9bfa7902a04d914bb0fa12bce3c
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh
@@ -0,0 +1,24 @@
+modelname="transformer"
+export FLAGS_profile_pipeline=1
+# HTTP
+ps -ef | grep web_service | awk '{print $2}' | xargs kill -9
+sleep 3
+rm -rf profile_log_$modelname
+for thread_num in "1" "8" "16"; do
+  for batch_size in "1" "2" "4"; do
+    python transformer_web_server.py --config ../../configs/transformer.base.yaml --device gpu --model_dir ./transformer_server --profile &
+    sleep 3
+    echo "----Transformer thread num: ${thread_num} batch size: ${batch_size} mode:http ----" >> profile_log_$modelname
+    nvidia-smi --id=2 --query-compute-apps=used_memory --format=csv -lms 100 > gpu_use.log 2>&1 &
+    nvidia-smi --id=2 --query-gpu=utilization.gpu --format=csv -lms 100 > gpu_utilization.log 2>&1 &
+    echo "import psutil\ncpu_utilization=psutil.cpu_percent(1,False)\nprint('CPU_UTILIZATION:', cpu_utilization)\n" > cpu_utilization.py
+    python transformer_web_client.py --config ../../configs/transformer.base.yaml --batch_size ${batch_size} --threads ${thread_num} --profile
+    python cpu_utilization.py >> profile_log_$modelname
+    ps -ef | grep web_server | awk '{print $2}' | xargs kill -9
+    python benchmark.py benchmark.log benchmark.tmp
+    mv benchmark.tmp benchmark.log
+    awk 'BEGIN {max = 0} {if(NR>1){if ($modelname > max) max=$modelname}} END {print "MAX_GPU_MEMORY:", max}' gpu_use.log >> profile_log_$modelname
+    awk 'BEGIN {max = 0} {if(NR>1){if ($modelname > max) max=$modelname}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$modelname
+    cat benchmark.log >> profile_log_$modelname
+  done
+done
diff --git a/examples/machine_translation/transformer/deploy/serving/export_serving_model.py b/examples/machine_translation/transformer/deploy/serving/export_serving_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..97e0a526dc805f09eb0006f5fbfd015e11e28aac
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/export_serving_model.py
@@ -0,0 +1,28 @@
+import argparse
+import paddle
+import paddle_serving_client.io as serving_io
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", type=str, required=True, help="input inference model dir")
+    return parser.parse_args()
+
+
+def do_export(model_dir):
+    feed_names, fetch_names = serving_io.inference_model_to_serving(
+        dirname=model_dir,
+        serving_server="transformer_server",
+        serving_client="transformer_client",
+        model_filename="transformer.pdmodel",
+        params_filename="transformer.pdiparams",
+    )
+
+    print("model feed_names : %s" % feed_names)
+    print("model fetch_names : %s" % fetch_names)
+
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    args = parse_args()
+    do_export(args.model_dir)
diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_reader.py b/examples/machine_translation/transformer/deploy/serving/transformer_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b295e7d6c37d174ced3c78ab9808a5868b1db39
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/transformer_reader.py
@@ -0,0 +1,52 @@
+import numpy as np
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Pad, Vocab
+
+
+class TransformerReader(object):
+    def __init__(self, args={}):
+        super(TransformerReader, self).__init__()
+
+        dataset = load_dataset("wmt14ende", splits=("test"))
+        if not args.benchmark:
+            self.vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"])
+        else:
+            self.vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"])
+        self.src_vocab = self.trg_vocab = self.vocab
+
+        def convert_samples(samples):
+            source = []
+            for sample in samples:
+                src = sample.split()
+                source.append(self.src_vocab.to_indices(src))
+
+            return source
+
+        self.tokenize = convert_samples
+        self.to_tokens = self.trg_vocab.to_tokens
+        self.feed_keys = ["src_word"]
+        self.bos_idx = args.bos_idx
+        self.eos_idx = args.eos_idx
+        self.pad_idx = args.bos_idx
+        self.pad_seq = args.pad_seq
+        self.word_pad = Pad(self.pad_idx)
+
+    def set_feed_keys(self, keys):
+        self.feed_keys = keys
+
+    def get_feed_keys(self):
+        return self.feed_keys
+
+    def prepare_infer_input(self, insts):
+        """
+        Put all padded data needed by beam search decoder into a list.
+        """
+        insts = self.tokenize(insts)
+
+        src_max_len = (max([len(inst) for inst in insts]) + self.pad_seq) // self.pad_seq * self.pad_seq
+        src_word = self.word_pad(
+            [inst + [self.eos_idx] + [self.pad_idx] * (src_max_len - 1 - len(inst)) for inst in insts]
+        )
+
+        return np.asarray(src_word)
diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py b/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..0407a1f8eb08b7fd17cd642ffaefafd815977699
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pprint import pprint
+
+import requests
+import yaml
+from easydict import EasyDict as AttrDict
+from paddle_serving_client.utils import MultiThreadRunner
+
+from paddlenlp.datasets import load_dataset
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument("--batch_size", type=int, help="Batch size. ")
+    parser.add_argument("--threads", default=1, type=int, help="Number of threads. ")
+    parser.add_argument("--profile", action="store_true", help="Whether to profile. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_client(idx, args):
+    dataset = load_dataset("wmt14ende", splits=("test"))
+
+    headers = {"Content-type": "application/json"}
+    url = "http://127.0.0.1:9292/transformer/prediction"
+
+    batch = []
+    sample = 0
+    f = open(args.output_file, "w")
+    if args.profile:
+        recorder = Recorder(args.infer_batch_size, args.model_name)
+        recorder.tic()
+
+    for sequence in dataset:
+        sample += 1
+        batch.append(sequence[args.src_lang])
+        if len(batch) < args.infer_batch_size and sample != len(dataset):
+            continue
+        data = {"feed": [{"src_word": batch}], "fetch": ["finished_sequence"]}
+        r = requests.post(url=url, headers=headers, data=json.dumps(data))
+        if r is not None:
+            print("Status: ", r)
+
+            if args.profile:
+                recorder.toc(samples=len(batch))
+            else:
+                for seq in r.json()["result"]["finished_sequence"]:
+                    f.write(seq[0] + "\n")
+            batch = []
+        if args.profile:
+            recorder.tic()
+    f.close()
+    if args.profile:
+        recorder.report()
+        return [[recorder.infer_time]]
+
+
+def multithread_http(args):
+    multi_thread_runner = MultiThreadRunner()
+    multi_thread_runner.run(do_client, args.threads, args)
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+    if ARGS.batch_size is not None:
+        args.infer_batch_size = ARGS.batch_size
+    args.profile = ARGS.profile
+    args.threads = ARGS.threads
+    args.model_name = "transformer_base" if "base" in ARGS.config else "transformer_big"
+
+    if args.profile:
+        from utils.recorder import Recorder
+
+        multithread_http(args)
+    else:
+        do_client(0, args)
diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py b/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py
new file mode 100644
index 0000000000000000000000000000000000000000..c13a78fc1f2dc2775d72fb1f3aa54a2a9af5ef17
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pprint import pprint
+
+import numpy as np
+import yaml
+from easydict import EasyDict as AttrDict
+
+try:
+    from paddle_serving_server_gpu.web_service import WebService
+except:
+    from paddle_serving_server.web_service import WebService
+
+from transformer_reader import TransformerReader
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--device", default="gpu", type=str, choices=["gpu", "cpu"], help="Device to use during inference. "
+    )
+    parser.add_argument("--model_dir", default="", type=str, help="Path of the model. ")
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--profile", action="store_true", help="Whether to profile. ")
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+class TransformerService(WebService):
+    def init_client(self, args):
+        self.args = args
+        self.transformer_reader = TransformerReader(args=args)
+
+    def preprocess(self, feed=[], fetch=[]):
+        src_sequence = feed[0]["src_word"]
+        if isinstance(src_sequence, str):
+            src_sequence = [src_sequence]
+        src_word = self.transformer_reader.prepare_infer_input(src_sequence)
+        feed_batch = {"src_word": src_word}
+        fetch = ["save_infer_model/scale_0.tmp_1"]
+
+        return feed_batch, fetch, True
+
+    def postprocess(self, feed={}, fetch=[], fetch_map=None):
+        if fetch_map is not None:
+            finished_sequence = np.array(fetch_map["save_infer_model/scale_0.tmp_1"]).transpose([0, 2, 1])
+            outputs = []
+            for ins in finished_sequence:
+                n_best_seq = []
+                for beam_idx, beam in enumerate(ins):
+                    if beam_idx >= self.args.n_best:
+                        break
+                    id_list = post_process_seq(beam, self.args.bos_idx, self.args.eos_idx)
+                    word_list = self.transformer_reader.to_tokens(id_list)
+                    sequence = " ".join(word_list)
+                    n_best_seq.append(sequence)
+                outputs.append(n_best_seq)
+            res = {"finished_sequence": outputs}
+            return res
+
+
+def do_server(args):
+    service = TransformerService(name="transformer")
+    if args.profile:
+        try:
+            service.setup_profile(30)
+        except:
+            pass
+    service.load_model_config(args.inference_model_dir)
+    if args.device == "gpu":
+        service.set_gpus("0")
+        service.prepare_server(workdir="workdir", port=9292, device="gpu", gpuid=0)
+    else:
+        service.prepare_server(workdir="workdir", port=9292, device="cpu")
+
+    service.init_client(args=args)
+
+    if args.profile:
+        service.run_debugger_service()
+    else:
+        service.run_rpc_service()
+    service.run_web_service()
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+    args.benchmark = ARGS.benchmark
+    args.profile = ARGS.profile
+    args.device = ARGS.device
+    if ARGS.model_dir != "":
+        args.inference_model_dir = ARGS.model_dir
+
+    do_server(args)
diff --git a/examples/machine_translation/transformer/deploy/serving/utils/recorder.py b/examples/machine_translation/transformer/deploy/serving/utils/recorder.py
new file mode 100644
index 0000000000000000000000000000000000000000..70a156a5e4f86ce9eb72ad1643b04117b2ce46d4
--- /dev/null
+++ b/examples/machine_translation/transformer/deploy/serving/utils/recorder.py
@@ -0,0 +1,34 @@
+import time
+import paddle
+
+
+class Recorder(object):
+    def __init__(self, batch_size, model_name, mem_info=None):
+        self.model_name = model_name
+        self.precision = "fp32"
+        self.batch_size = batch_size
+
+        self.infer_time = 0
+        self.samples = 0
+
+        self.start = 0
+
+    def tic(self):
+        self.start = time.time()
+
+    def toc(self, samples=1):
+        self.infer_time += (time.time() - self.start) * 1000
+        self.samples += samples
+
+    def report(self):
+        print("----------------------- Env info ------------------------")
+        print("paddle_version: {}".format(paddle.__version__))
+        print("----------------------- Model info ----------------------")
+        print("model_name: {}".format(self.model_name))
+        print("model_type: {}".format(self.precision))
+        print("----------------------- Data info -----------------------")
+        print("batch_size: {}".format(self.batch_size))
+        print("num_of_samples: {}".format(self.samples))
+        print("----------------------- Perf info -----------------------")
+        print("average_latency(ms): {}".format(self.infer_time / (self.samples)))
+        print("QPS: {}".format((self.samples) / (self.infer_time / 1000.0)))
diff --git a/examples/machine_translation/transformer/export_model.py b/examples/machine_translation/transformer/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fceddc5bf628edaa624ba23557f86cb7ac16a275
--- /dev/null
+++ b/examples/machine_translation/transformer/export_model.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+import reader
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.transformers import InferTransformerModel, position_encoding_init
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def do_export(args):
+    # Adapt vocabulary size
+    reader.adapt_vocab_size(args)
+    # Define model
+    transformer = InferTransformerModel(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        beam_size=args.beam_size,
+        max_out_len=args.max_out_len,
+        beam_search_version=args.beam_search_version,
+        normalize_before=args.get("normalize_before", True),
+        rel_len=args.use_rel_len,
+        alpha=args.alpha,
+    )
+
+    # Load the trained model
+    assert args.init_from_params, "Please set init_from_params to load the infer model."
+
+    model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # To avoid a longer length than training, reset the size of position
+    # encoding to max_length
+    model_dict["encoder.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+    model_dict["decoder.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+    transformer.load_dict(model_dict)
+    # Set evaluate mode
+    transformer.eval()
+
+    # Convert dygraph model to static graph model
+    transformer = paddle.jit.to_static(
+        transformer,
+        input_spec=[
+            # src_word
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # trg_word
+            # paddle.static.InputSpec(
+            #     shape=[None, None], dtype="int64")
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(transformer, os.path.join(args.inference_model_dir, "transformer"))
+    logger.info("Transformer has been saved to {}".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    do_export(args)
diff --git a/examples/machine_translation/transformer/fast_transformer/README.md b/examples/machine_translation/transformer/fast_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb760655e003bec63a55bdd177cbb0256b7f3fb3
--- /dev/null
+++ b/examples/machine_translation/transformer/fast_transformer/README.md
@@ -0,0 +1,220 @@
+# FastGeneration 预测
+
+在这里我们集成了 NVIDIA [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1) 用于预测加速。同时集成了 FasterTransformer float32 以及 float16 预测，打造了 FastGeneration 的能力。以下是使用 FastGeneration 的说明。
+
+## 使用环境说明
+
+* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本
+* CMake >= 3.10
+* CUDA 10.1 或 10.2（需要 PaddlePaddle 框架一致）
+* gcc 版本需要与编译 PaddlePaddle 版本一致，比如使用 gcc8.2
+* 推荐使用 Python3
+* [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup) 使用必要的环境
+* 环境依赖
+  - attrdict
+  - pyyaml
+  ```shell
+  pip install attrdict pyyaml
+  ```
+
+## 快速开始
+
+我们实现了基于 FastGeneration 的自定义 op 的接入，用于加速当前机器翻译 example 在 GPU 上的预测性能。
+
+## 使用 FastGeneration 完成预测
+
+编写 python 脚本的时候，调用 [`FasterTransformer` API](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.ops.fast_transformer.transformer.fast_transformer.html#paddlenlp.ops.fast_transformer.transformer.fast_transformer.FasterTransformer) 即可实现 Transformer 模型的高性能预测。
+
+举例如下：
+
+``` python
+from paddlenlp.ops import FasterTransformer
+
+transformer = FasterTransformer(
+    src_vocab_size=args.src_vocab_size,
+    trg_vocab_size=args.trg_vocab_size,
+    max_length=args.max_length + 1,
+    num_encoder_layers=args.n_layer,
+    num_decoder_layers=args.n_layer,
+    n_head=args.n_head,
+    d_model=args.d_model,
+    d_inner_hid=args.d_inner_hid,
+    dropout=args.dropout,
+    weight_sharing=args.weight_sharing,
+    bos_id=args.bos_idx,
+    eos_id=args.eos_idx,
+    decoding_strategy=args.decoding_strategy,
+    beam_size=args.beam_size,
+    topk=args.topk,
+    topp=args.topp,
+    max_out_len=args.max_out_len,
+    use_fp16_decoding=args.use_fp16_decoding)
+```
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md)。编译好后，使用 `FasterTransformer(decoding_lib="/path/to/lib", ...)` 可以完成导入。
+
+更详细的例子可以参考 `encoder_decoding_predict.py`，我们提供了更详细用例。
+
+
+#### 数据准备
+
+本示例可以使用 PaddleNLP 内置的处理好的 WMT14 EN-DE 翻译的数据进行训练、预测，也可以使用自定义数据集。数据准备部分可以参考前页文档 [使用自定义翻译数据集](../README.md)。
+
+#### 模型推断
+
+使用模型推断前提是需要指定一个合适的 checkpoint，需要在对应的 `../configs/transformer.base.yaml` 中修改对应的模型载入的路径参数 `init_from_params`。
+
+我们提供一个已经训练好的动态图的 base model 的 checkpoint 以供使用，可以通过[transformer-base-wmt_ende_bpe](https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz)下载。
+
+``` sh
+wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+```
+
+然后，需要修改对应的 `../configs/transformer.base.yaml` 配置文件中的 `init_from_params` 的值为 `./base_trained_models/step_final/`。
+
+#### 使用动态图预测(使用 float32 decoding 预测)
+
+以英德翻译数据为例，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
+./decoding_gemm 8 4 8 64 38512 32 512 0
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --decoding_strategy beam_search \
+    --beam_size 5
+```
+
+其中:
+* `--config`: 选项用于指明配置文件的位置
+* `--decoding_lib`: 选项用于指明编译好的 FastGeneration decoding lib 的位置
+* `--decoding_strategy`: 选项用于指定解码使用的策略，可以选择是 `beam_search`，`topk_sampling`，`topp_sampling`。
+  * 当使用 `beam_search` 的时候，需要指定 `--beam_size` 的值
+  * 当使用 `topk_sampling` 的时候，需要指定 `--topk` 的值
+  * 当使用 `topp_sampling` 的时候，需要指定 `topp` 的值，并且需要保证 `--topk` 的值为 0
+* `--beam_size`: 解码策略是 `beam_search` 的时候，beam size 的大小，数据类型是 `int`
+* `--diversity_rate`: 解码策略是 `beam_search` 的时候，设置 diversity rate 的大小，数据类型是 `float`。当设置的 `diversity_rate` 大于 0 的时候，FasterTransformer 仅支持 beam size 为 1，4，16，64
+* `--topk`: 解码策略是 `topk_sampling` 的时候，topk 计算的 k 值的大小，数据类型是 `int`
+* `--topp`: 解码策略是 `topp_sampling` 的时候，p 的大小，数据类型是 `float`
+
+翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 base model 的配置。
+
+#### 使用动态图预测(使用 float16 decoding 预测)
+
+float16 与 float32 预测的基本流程相同，不过在使用 float16 的 decoding 进行预测的时候，需要再加上 `--use_fp16_decoding` 选项，表示使用 fp16 进行预测。后按照与之前相同的方式执行即可。具体执行方式如下：
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
+./decoding_gemm 8 4 8 64 38512 32 512 1
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --use_fp16_decoding \
+    --decoding_strategy beam_search \
+    --beam_size 5
+```
+
+其中，`--config` 选项用于指明配置文件的位置，而 `--decoding_lib` 选项用于指明编译好的 FastGeneration decoding lib 的位置。
+
+翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `./sample/config/transformer.base.yaml` 文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 base model 的配置。
+
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+#### 使用自定义数据集进行预测
+
+如果需要使用准备好的自定义数据集进行高性能推理，同样可以通过在执行 `encoder_decoding_predict.py` 脚本时指明以下参数，从而引入自定义数据集。
+
+* `--data_dir`: 指明训练需要的数据集的路径。无需提供不同的 train、dev 和 test 文件具体的文件名，会自动根据 `--src_lang` 和 `--trg_lang` 指定的语言进行构造。train、dev 和 test 默认的文件名分别为 [train|dev|test].{src_lang}-{trg_lang}.[{src_lang}|{trg_lang}]。且 `--data_dir` 设置的优先级会高于后面提到的 `--train_file`，`--dev_file` 和 `--test_file` 的优先级。
+  * `--src_lang`(`-s`): 指代翻译模型的源语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+  * `--trg_lang`(`-t`): 指代翻译模型的目标语言。比如 `de` 表示德语，`en` 表示英语，`fr` 表示法语等等。和数据集本身相关。
+* `--test_file`: 指明训练所需要的 `test` 验证集的数据集的路径。若没有提供 `--data_dir` 或是需要特别指明训练数据的名称的时候指定。指定的方式为，传入源语言的文件。比如，`--test_file ${DATA_DEST_DIR}/test.de-en.de`。
+* `--vocab_file`: 指明训练所需的词表文件的路径和名称。若指定 `--vocab_file` 则默认是源语言和目标语言使用同一个词表。且 `--vocab_file` 设置的优先级会高于后面提到的 `--src_vocab` 和 `--trg_vocab` 优先级。
+* `--src_vocab`: 指明训练所需的源语言的词表文件的路径和名称。可以与 `--trg_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--trg_vocab`: 指明训练所需的目标语言的词表文件的路径和名称。可以与 `--src_vocab` 相同，若相同，则视为源语言和目标语言共用同一个词表。
+* `--unk_token`: 若提供了自定义的词表，则需要额外指明词表中未登录词 `[UNK]` 具体的 token。比如，`--unk_token "<unk>"`。默认为 `<unk>`，与数据预处理脚本设定默认值相同。
+* `--bos_token`: 若提供了自定义的词表，则需要额外指明词表中起始词 `[BOS]` 具体的 token。比如，`--bos_token "<s>"`。默认为 `<s>`，与数据预处理脚本设定默认值相同。
+* `--eos_token`: 若提供了自定义的词表，则需要额外指明词表中结束词 `[EOS]` 具体的 token。比如，`--eos_token "</s>"`。默认为 `</s>`，与数据预处理脚本设定默认值相同。
+* `--pad_token`: 若提供了自定义的词表，原则上，需要额外指定词表中用于表示 `[PAD]` 具体的 token。比如，`--pad_token "<pad>"`。默认为 None，若使用 None，则使用 `--bos_token` 作为 `pad_token` 使用。
+
+比如：
+
+``` bash
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+DATA_DEST_DIR=${PATH_TO_PADDLENLP}/PaddleNLP/examples/machine_translation/data/iwslt14.tokenized.de-en/
+
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+cp -rf ../../../../paddlenlp/ops/build/third-party/build/fastertransformer/bin/decoding_gemm ./
+./decoding_gemm 8 4 8 64 38512 32 512 1
+
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so \
+    --use_fp16_decoding \
+    --decoding_strategy beam_search \
+    --beam_size 5 \
+    --test_file ${DATA_DEST_DIR}/test.de-en.de \
+    --src_vocab ${DATA_DEST_DIR}/dev.de-en.de \
+    --trg_vocab ${DATA_DEST_DIR}/dev.de-en.en \
+    --bos_token "<s>" \
+    --eos_token "</s>" \
+    --unk_token "<unk>"
+```
+
+#### 导出基于 FastGeneration 的预测库使用模型文件
+
+我们提供一个已经基于动态图训练好的 base model 的 checkpoint 以供使用，当前 checkpoint 是基于 WMT 英德翻译的任务训练。可以通过[transformer-base-wmt_ende_bpe](https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz)下载。
+
+使用 C++ 预测库，首先，我们需要做的是将动态图的 checkpoint 导出成预测库能使用的模型文件和参数文件。可以执行 `export_model.py` 实现这个过程。
+
+``` sh
+python export_model.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_strategy beam_search \
+    --beam_size 5
+```
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md)。编译好后，可以在执行 `export_model.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。
+
+注意：如果是自行编译的话，这里的 `libdecoding_op.so` 的动态库是参照文档 [文本生成高性能加速](../../../../paddlenlp/ops/README.md) 中 **`Python 动态图使用自定义 op`** 编译出来的 lib，与相同文档中 **`C++ 预测库使用自定义 op`** 编译产出不同。因此，在使用预测库前，还需要额外导出模型：
+  * 一次用于获取 Python 动态图下的 lib，用到 Python 端进行模型导出。
+  * 一次获取编译的基于预测库的可执行文件
+
+执行 `export_model.py` 之后，可以在当前路径的 `infer_model/` 下面看到导出的模型文件：
+  ```text
+  └── infer_model/
+    ├── transformer.pdiparams
+    └── transformer.pdmodel
+  ```
+
+#### C++ 预测库使用高性能加速
+
+C++ 预测库使用 FastGeneration 的高性能加速需要自行编译，可以参考 [文本生成高性能加速](../../../../paddlenlp/ops/README.md) 文档完成基于 C++ 预测库的编译，同时也可以参考相同文档执行对应的 C++ 预测库的 demo 完成预测。
+
+具体的使用 demo 可以参考 [Transformer 预测库 C++ demo](../../../../paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc)。
+
+## 模型评估
+
+评估方式与动态图评估方式相同，预测结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl ~/.paddlenlp/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data/newstest2014.tok.de < predict.tok.txt
+```
+
+执行上述操作之后，可以看到类似如下的结果，此处结果是 beam_size 为 5 时 base model 在 newstest2014 上的 BLEU 结果：
+```
+BLEU = 26.89, 58.4/32.6/20.5/13.4 (BP=1.000, ratio=1.010, hyp_len=65166, ref_len=64506)
+```
diff --git a/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py b/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..f27ea58ac929071875a05ea436f14b4090f54d49
--- /dev/null
+++ b/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py
@@ -0,0 +1,292 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+from pprint import pprint
+
+import numpy as np
+import paddle
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.ops import FasterTransformer
+from paddlenlp.utils.log import logger
+
+sys.path.append("../")
+import reader  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.base.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--decoding_lib",
+        default="../../../../paddlenlp/ops/build/lib/libdecoding_op.so",
+        type=str,
+        help="Path of libdecoding_op.so. ",
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--enable_fast_encoder",
+        action="store_true",
+        help="Whether to use fast version encoder to predict. This is experimental option for now. ",
+    )
+    parser.add_argument("--use_fp16_encoder", action="store_true", help="Whether to use fp16 encoder to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        type=str,
+        choices=["beam_search", "beam_search_v2", "topk_sampling", "topp_sampling"],
+        help="Decoding strategy. Can be one of ['beam_search', 'topk_sampling', 'topp_sampling']. ",
+    )
+    parser.add_argument("--beam_size", default=4, type=int, help="Beam size. ")
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity rate for beam search. ")
+    parser.add_argument("--topk", default=4, type=int, help="The k value for topk_sampling. Default is 4. ")
+    parser.add_argument(
+        "--topp",
+        default=0.0,
+        type=float,
+        help="The probability threshold for topp_sampling. Default is 0.0 which means it won't go through topp_sampling. ",
+    )
+    parser.add_argument("--batch_size", default=None, type=int, help="Batch size. ")
+    parser.add_argument(
+        "--profile", action="store_true", help="Whether to profile the performance using newstest2014 dataset. "
+    )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--test_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    # Define data loader
+    # NOTE: Data yielded by DataLoader may be on CUDAPinnedPlace,
+    # but custom op doesn't support CUDAPinnedPlace. Hence,
+    # disable using CUDAPinnedPlace in DataLoader.
+    paddle.io.reader.use_pinned_memory(False)
+    test_loader, to_tokens = reader.create_infer_loader(args)
+
+    # Define model
+    transformer = FasterTransformer(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        decoding_strategy=args.decoding_strategy,
+        beam_size=args.beam_size,
+        max_out_len=args.max_out_len,
+        diversity_rate=args.diversity_rate,
+        decoding_lib=args.decoding_lib,
+        use_fp16_decoding=args.use_fp16_decoding,
+        enable_fast_encoder=args.enable_fast_encoder,
+        use_fp16_encoder=args.use_fp16_encoder,
+    )
+
+    # Set evaluate mode
+    transformer.eval()
+
+    # Load checkpoint.
+    transformer.load(init_from_params=os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # Providing model_dict still works.
+    # state_dict = paddle.load(os.path.join(args.init_from_params,
+    #                          "transformer.pdparams"))
+    # transformer.load(state_dict=state_dict)
+
+    f = open(args.output_file, "w")
+    with paddle.no_grad():
+        if args.profile:
+            import time
+
+            start = time.time()
+        for (src_word,) in test_loader:
+            finished_seq = transformer(src_word=src_word)
+            if not args.profile:
+                if args.decoding_strategy == "beam_search" or args.decoding_strategy == "beam_search_v2":
+                    finished_seq = finished_seq.numpy().transpose([1, 2, 0])
+                elif args.decoding_strategy == "topk_sampling" or args.decoding_strategy == "topp_sampling":
+                    finished_seq = np.expand_dims(finished_seq.numpy().transpose([1, 0]), axis=1)
+                for ins in finished_seq:
+                    for beam_idx, beam in enumerate(ins):
+                        if beam_idx >= args.n_best:
+                            break
+                        id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                        word_list = to_tokens(id_list)
+                        sequence = " ".join(word_list) + "\n"
+                        f.write(sequence)
+        if args.profile:
+            if args.decoding_strategy == "beam_search" or args.decoding_strategy == "beam_search_v2":
+                logger.info(
+                    "Setting info: batch size: {}, beam size: {}, use fp16: {}. ".format(
+                        args.infer_batch_size, args.beam_size, args.use_fp16_decoding
+                    )
+                )
+            elif args.decoding_strategy == "topk_sampling":
+                logger.info(
+                    "Setting info: batch size: {}, topk: {}, use fp16: {}. ".format(
+                        args.infer_batch_size, args.topk, args.use_fp16_decoding
+                    )
+                )
+            elif args.decoding_strategy == "topp_sampling":
+                logger.info(
+                    "Setting info: batch size: {}, topp: {}, use fp16: {}. ".format(
+                        args.infer_batch_size, args.topp, args.use_fp16_decoding
+                    )
+                )
+            paddle.device.cuda.synchronize(place)
+            logger.info(
+                "Average time latency is {} ms/batch. ".format((time.time() - start) / len(test_loader) * 1000)
+            )
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.decoding_lib = ARGS.decoding_lib
+    args.use_fp16_decoding = ARGS.use_fp16_decoding
+    args.enable_fast_encoder = ARGS.enable_fast_encoder
+    args.use_fp16_encoder = ARGS.use_fp16_encoder
+    args.decoding_strategy = ARGS.decoding_strategy
+    args.beam_size = ARGS.beam_size
+    args.diversity_rate = ARGS.diversity_rate
+    args.topk = ARGS.topk
+    args.topp = ARGS.topp
+    args.profile = ARGS.profile
+    args.benchmark = ARGS.benchmark
+    if ARGS.batch_size:
+        args.infer_batch_size = ARGS.batch_size
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    do_predict(args)
diff --git a/examples/machine_translation/transformer/fast_transformer/export_model.py b/examples/machine_translation/transformer/fast_transformer/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f444985c36f32723d243ef9792dc026d21a256a
--- /dev/null
+++ b/examples/machine_translation/transformer/fast_transformer/export_model.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+from pprint import pprint
+
+import paddle
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.ops import FasterTransformer
+from paddlenlp.utils.log import logger
+
+sys.path.append("../")
+import reader  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.base.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--decoding_lib",
+        default="../../../../paddlenlp/ops/build/lib/libdecoding_op.so",
+        type=str,
+        help="Path of libdecoding_op.so. ",
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--enable_fast_encoder",
+        action="store_true",
+        help="Whether to use fast version encoder to predict. This is experimental option for now. ",
+    )
+    parser.add_argument("--use_fp16_encoder", action="store_true", help="Whether to use fp16 encoder to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        type=str,
+        choices=["beam_search", "topk_sampling", "topp_sampling", "beam_search_v2"],
+        help="Decoding strategy. Can be one of ['beam_search', 'topk_sampling', 'topp_sampling', 'beam_search_v2']. ",
+    )
+    parser.add_argument("--beam_size", default=4, type=int, help="Beam size. ")
+    parser.add_argument("--topk", default=4, type=int, help="The k value for topk_sampling. Default is 4. ")
+    parser.add_argument(
+        "--topp",
+        default=0.0,
+        type=float,
+        help="The probability threshold for topp_sampling. Default is 0.0 which means it won't go through topp_sampling. ",
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+    reader.adapt_vocab_size(args)
+
+    # Define model
+    transformer = FasterTransformer(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        decoding_strategy=args.decoding_strategy,
+        beam_size=args.beam_size,
+        max_out_len=args.max_out_len,
+        decoding_lib=args.decoding_lib,
+        use_fp16_decoding=args.use_fp16_decoding,
+        enable_fast_encoder=args.enable_fast_encoder,
+        use_fp16_encoder=args.use_fp16_encoder,
+        rel_len=args.use_rel_len,
+        alpha=args.alpha,
+    )
+
+    # Set evaluate mode
+    transformer.eval()
+
+    # Load checkpoint.
+    transformer.load(init_from_params=os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # Convert dygraph model to static graph model
+    transformer = paddle.jit.to_static(
+        transformer,
+        input_spec=[
+            # src_word
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # trg_word
+            # Support exporting model which support force decoding
+            # NOTE: Data type MUST be int32 !
+            # paddle.static.InputSpec(
+            #     shape=[None, None], dtype="int32")
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(transformer, os.path.join(args.inference_model_dir, "transformer"))
+    logger.info("Transformer has been saved to {}".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.decoding_lib = ARGS.decoding_lib
+    args.use_fp16_decoding = ARGS.use_fp16_decoding
+    args.enable_fast_encoder = ARGS.enable_fast_encoder
+    args.use_fp16_encoder = ARGS.use_fp16_encoder
+    args.decoding_strategy = ARGS.decoding_strategy
+    args.beam_size = ARGS.beam_size
+    args.topk = ARGS.topk
+    args.topp = ARGS.topp
+    args.benchmark = ARGS.benchmark
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    do_predict(args)
diff --git a/examples/machine_translation/transformer/images/multi_head_attention.png b/examples/machine_translation/transformer/images/multi_head_attention.png
new file mode 100644
index 0000000000000000000000000000000000000000..427fb6b32aaeb7013066a167aab4fb97c024c2d6
Binary files /dev/null and b/examples/machine_translation/transformer/images/multi_head_attention.png differ
diff --git a/examples/machine_translation/transformer/images/transformer_network.png b/examples/machine_translation/transformer/images/transformer_network.png
new file mode 100644
index 0000000000000000000000000000000000000000..34be0e5c7e2b08f858683d86353db5e81049c7ca
Binary files /dev/null and b/examples/machine_translation/transformer/images/transformer_network.png differ
diff --git a/examples/machine_translation/transformer/predict.py b/examples/machine_translation/transformer/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..9226da595e65996c5e2a7ec93b471b647cfee62d
--- /dev/null
+++ b/examples/machine_translation/transformer/predict.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+import reader
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.ops import TransformerGenerator
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--test_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument("--without_ft", action="store_true", help="Whether to use FastGeneration to do predict. ")
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu", "mlu"], help="Device selected for inference."
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    if args.device == "gpu":
+        place = "gpu"
+    elif args.device == "xpu":
+        place = "xpu"
+    elif args.device == "npu":
+        place = "npu"
+    elif args.device == "mlu":
+        place = "mlu"
+    else:
+        place = "cpu"
+
+    paddle.set_device(place)
+
+    # Define data loader
+    test_loader, to_tokens = reader.create_infer_loader(args)
+
+    # Define model
+    # `TransformerGenerator` automatically chioces using `FastGeneration`
+    # (with jit building) or the slower verison `InferTransformerModel`.
+    transformer = TransformerGenerator(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        beam_size=args.beam_size,
+        max_out_len=args.max_out_len,
+        use_ft=not args.without_ft,
+        beam_search_version=args.beam_search_version,
+        normalize_before=args.get("normalize_before", True),
+        rel_len=args.use_rel_len,  # only works when using FT or beam search v2
+        alpha=args.alpha,  # only works when using beam search v2
+        diversity_rate=args.diversity_rate,  # only works when using FT
+        use_fp16_decoding=False,
+    )  # only works when using FT
+
+    # Load the trained model
+    assert args.init_from_params, "Please set init_from_params to load the infer model."
+
+    transformer.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # Providing model_dict still works.
+    # state_dict = paddle.load(os.path.join(args.init_from_params,
+    #                          "transformer.pdparams"))
+    # transformer.load(state_dict=state_dict)
+
+    # Set evaluate mode
+    transformer.eval()
+
+    f = open(args.output_file, "w", encoding="utf-8")
+    with paddle.no_grad():
+        for (src_word,) in test_loader:
+            # When `output_time_major` argument is `True` for TransformerGenerator,
+            # the shape of finished_seq is `[seq_len, batch_size, beam_size]`
+            # for beam search v1 or `[seq_len, batch_size, beam_size * 2]` for
+            # beam search v2.
+            finished_seq = transformer(src_word=src_word)
+            finished_seq = finished_seq.numpy().transpose([1, 2, 0])
+            for ins in finished_seq:
+                for beam_idx, beam in enumerate(ins):
+                    if beam_idx >= args.n_best:
+                        break
+                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                    word_list = to_tokens(id_list)
+                    sequence = " ".join(word_list) + "\n"
+                    f.write(sequence)
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    args.without_ft = ARGS.without_ft
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+
+    args.device = ARGS.device
+    pprint(args)
+
+    do_predict(args)
diff --git a/examples/machine_translation/transformer/reader.py b/examples/machine_translation/transformer/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..32687908ed8acea2f538b75b375e03cb2dbf89b5
--- /dev/null
+++ b/examples/machine_translation/transformer/reader.py
@@ -0,0 +1,545 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import os
+import sys
+from functools import partial
+
+import numpy as np
+import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader
+
+from paddlenlp.data import Pad, Vocab
+
+
+def min_max_filer(data, max_len, min_len=0):
+    # 1 for special tokens.
+    data_min_len = min(len(data["source"]), len(data["target"])) + 1
+    data_max_len = max(len(data["source"]), len(data["target"])) + 1
+    return (data_min_len >= min_len) and (data_max_len <= max_len)
+
+
+def padding_vocab(x, args):
+    return (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor
+
+
+def create_data_loader(args, places=None):
+    use_custom_dataset = args.train_file is not None or args.dev_file is not None or args.data_dir is not None
+    map_kwargs = {}
+    if use_custom_dataset:
+        data_files = {}
+        if args.data_dir is not None:
+            if os.path.exist(
+                os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["train"] = [
+                    os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "train.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)),
+                ]
+            if os.path.exist(
+                os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["dev"] = [
+                    os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "dev.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)),
+                ]
+        else:
+            # datasets.load_dataset doesn't support tuple
+            if args.train_file is not None:
+                data_files["train"] = list(args.train_file)
+            if args.dev_file is not None:
+                data_files["dev"] = list(args.dev_file)
+
+        from datasets import load_dataset
+
+        if len(data_files) > 0:
+            for split in data_files:
+                if isinstance(data_files[split], (list, tuple)):
+                    for i, path in enumerate(data_files[split]):
+                        data_files[split][i] = os.path.abspath(data_files[split][i])
+                else:
+                    data_files[split] = os.path.abspath(data_files[split])
+
+        datasets = load_dataset("language_pair", data_files=data_files, split=("train", "dev"))
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --src_vocab must be specified when using custom dataset. ")
+
+    else:
+        from paddlenlp.datasets import load_dataset
+
+        datasets = load_dataset("wmt14ende", splits=("train", "dev"))
+
+        map_kwargs["lazy"] = False
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif not args.benchmark:
+            src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["bpe"])
+        else:
+            src_vocab = Vocab.load_vocabulary(**datasets[0].vocab_info["benchmark"])
+
+    if use_custom_dataset and not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
+    else:
+        trg_vocab = src_vocab
+
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
+
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
+
+    def convert_samples(sample):
+        source = sample["source"].split()
+        sample["source"] = src_vocab.to_indices(source)
+
+        target = sample["target"].split()
+        sample["target"] = trg_vocab.to_indices(target)
+
+        return sample
+
+    data_loaders = [(None)] * 2
+    for i, dataset in enumerate(datasets):
+        dataset = dataset.map(convert_samples, **map_kwargs).filter(partial(min_max_filer, max_len=args.max_length))
+        batch_sampler = TransformerBatchSampler(
+            dataset=dataset,
+            batch_size=args.batch_size,
+            pool_size=args.pool_size,
+            sort_type=args.sort_type,
+            shuffle=args.shuffle,
+            shuffle_batch=args.shuffle_batch,
+            use_token_batch=True,
+            max_length=args.max_length,
+            distribute_mode=True if i == 0 else False,
+            world_size=dist.get_world_size(),
+            rank=dist.get_rank(),
+            pad_seq=args.pad_seq,
+            bsz_multi=args.bsz_multi,
+        )
+
+        data_loader = DataLoader(
+            dataset=dataset,
+            places=places,
+            batch_sampler=batch_sampler,
+            collate_fn=partial(
+                prepare_train_input,
+                bos_idx=args.bos_idx,
+                eos_idx=args.eos_idx,
+                pad_idx=args.pad_idx,
+                pad_seq=args.pad_seq,
+                dtype=args.input_dtype,
+            ),
+            num_workers=args.num_workers,
+        )
+        data_loaders[i] = data_loader
+    return data_loaders
+
+
+def create_infer_loader(args):
+    use_custom_dataset = args.test_file is not None or args.data_dir is not None
+    map_kwargs = {}
+    if use_custom_dataset:
+        data_files = {}
+        if args.data_dir is not None:
+            if os.path.exist(
+                os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang))
+            ):
+                data_files["test"] = [
+                    os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.src_lang)),
+                    os.path.join(args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang))
+                    if os.path.exist(
+                        os.path.join(
+                            args.data_dir, "test.{}-{}.{}".format(args.src_lang, args.trg_lang, args.trg_lang)
+                        )
+                    )
+                    else None,
+                ]
+        else:
+            if args.test_file is not None:
+                # datasets.load_dataset doesn't support tuple
+                data_files["test"] = list(args.test_file) if isinstance(args.test_file, tuple) else args.test_file
+
+        from datasets import load_dataset
+
+        if len(data_files) > 0:
+            for split in data_files:
+                if isinstance(data_files[split], (list, tuple)):
+                    for i, path in enumerate(data_files[split]):
+                        data_files[split][i] = os.path.abspath(data_files[split][i])
+                else:
+                    data_files[split] = os.path.abspath(data_files[split])
+
+        dataset = load_dataset("language_pair", data_files=data_files, split=("test"))
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --src_vocab must be specified when using custom dataset. ")
+
+    else:
+        from paddlenlp.datasets import load_dataset
+
+        dataset = load_dataset("wmt14ende", splits=("test"))
+
+        map_kwargs["lazy"] = False
+
+        if args.src_vocab is not None:
+            src_vocab = Vocab.load_vocabulary(
+                filepath=args.src_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        elif not args.benchmark:
+            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["bpe"])
+        else:
+            src_vocab = Vocab.load_vocabulary(**dataset.vocab_info["benchmark"])
+
+    if use_custom_dataset and not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab,
+                unk_token=args.unk_token,
+                bos_token=args.bos_token,
+                eos_token=args.eos_token,
+                pad_token=args.pad_token,
+            )
+        else:
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
+    else:
+        trg_vocab = src_vocab
+
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
+
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
+
+    def convert_samples(sample):
+        source = sample["source"].split()
+        sample["source"] = src_vocab.to_indices(source)
+
+        if "target" in sample.keys() and sample["target"] != "":
+            target = sample["target"].split()
+            sample["target"] = trg_vocab.to_indices(target)
+
+        return sample
+
+    dataset = dataset.map(convert_samples, **map_kwargs)
+
+    data_loader = DataLoader(
+        dataset=dataset,
+        batch_size=args.infer_batch_size,
+        shuffle=False,
+        drop_last=False,
+        collate_fn=partial(
+            prepare_infer_input,
+            bos_idx=args.bos_idx,
+            eos_idx=args.eos_idx,
+            pad_idx=args.pad_idx,
+            pad_seq=args.pad_seq,
+            dtype=args.input_dtype,
+        ),
+        num_workers=args.num_workers,
+        return_list=True,
+    )
+    return data_loader, trg_vocab.to_tokens
+
+
+def adapt_vocab_size(args):
+    if args.src_vocab:
+        src_vocab = Vocab.load_vocabulary(
+            filepath=args.src_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token
+        )
+    elif not args.benchmark:
+        from paddlenlp.datasets import load_dataset
+
+        datasets = load_dataset("wmt14ende", splits=("test"))
+        src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["bpe"])
+    else:
+        from paddlenlp.datasets import load_dataset
+
+        datasets = load_dataset("wmt14ende", splits=("test"))
+        src_vocab = Vocab.load_vocabulary(**datasets.vocab_info["benchmark"])
+
+    if not args.joined_dictionary:
+        if args.trg_vocab is not None:
+            trg_vocab = Vocab.load_vocabulary(
+                filepath=args.trg_vocab, bos_token=args.bos_token, eos_token=args.eos_token, pad_token=args.pad_token
+            )
+        else:
+            raise ValueError("The --trg_vocab must be specified when the dict is not joined. ")
+    else:
+        trg_vocab = src_vocab
+
+    args.src_vocab_size = padding_vocab(len(src_vocab), args)
+    args.trg_vocab_size = padding_vocab(len(trg_vocab), args)
+
+    if args.bos_token is not None:
+        args.bos_idx = src_vocab.get_bos_token_id()
+    if args.eos_token is not None:
+        args.eos_idx = src_vocab.get_eos_token_id()
+    if args.pad_token is not None:
+        args.pad_idx = src_vocab.get_pad_token_id()
+    else:
+        args.pad_idx = args.bos_idx
+
+
+def prepare_train_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int64"):
+    """
+    Put all padded data needed by training into a list.
+    """
+    word_pad = Pad(pad_idx, dtype=dtype)
+
+    src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    trg_max_len = (max([len(inst["target"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    src_word = word_pad(
+        [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts]
+    )
+    trg_word = word_pad(
+        [[bos_idx] + inst["target"] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts]
+    )
+    lbl_word = np.expand_dims(
+        word_pad([inst["target"] + [eos_idx] + [pad_idx] * (trg_max_len - 1 - len(inst["target"])) for inst in insts]),
+        axis=2,
+    )
+
+    data_inputs = [src_word, trg_word, lbl_word]
+
+    return data_inputs
+
+
+def prepare_infer_input(insts, bos_idx, eos_idx, pad_idx, pad_seq=1, dtype="int64"):
+    """
+    Put all padded data needed by beam search decoder into a list.
+    """
+    word_pad = Pad(pad_idx, dtype=dtype)
+
+    src_max_len = (max([len(inst["source"]) for inst in insts]) + pad_seq) // pad_seq * pad_seq
+    src_word = word_pad(
+        [inst["source"] + [eos_idx] + [pad_idx] * (src_max_len - 1 - len(inst["source"])) for inst in insts]
+    )
+
+    return [
+        src_word,
+    ]
+
+
+class SortType(object):
+    GLOBAL = "global"
+    POOL = "pool"
+    NONE = "none"
+
+
+class SentenceBatchCreator(object):
+    def __init__(self, batch_size):
+        self.batch = []
+        self._batch_size = batch_size
+
+    def append(self, info):
+        self.batch.append(info)
+        if len(self.batch) == self._batch_size:
+            tmp = self.batch
+            self.batch = []
+            return tmp
+
+
+class TokenBatchCreator(object):
+    def __init__(self, batch_size, bsz_multi=1):
+        self._batch = []
+        self.max_len = -1
+        self._batch_size = batch_size
+        self._bsz_multi = bsz_multi
+
+    def append(self, info):
+        cur_len = info.max_len
+        max_len = max(self.max_len, cur_len)
+        if max_len * (len(self._batch) + 1) > self._batch_size:
+            # Make sure the batch size won't be empty.
+            mode_len = max(len(self._batch) // self._bsz_multi * self._bsz_multi, len(self._batch) % self._bsz_multi)
+            result = self._batch[:mode_len]
+            self._batch = self._batch[mode_len:]
+            self._batch.append(info)
+            self.max_len = max([b.max_len for b in self._batch])
+            return result
+        else:
+            self.max_len = max_len
+            self._batch.append(info)
+
+    @property
+    def batch(self):
+        return self._batch
+
+
+class SampleInfo(object):
+    def __init__(self, i, lens, pad_seq=1):
+        self.i = i
+        # Take bos and eos into account
+        self.min_len = min(lens[0], lens[1]) + 1
+        self.max_len = (max(lens[0], lens[1]) + pad_seq) // pad_seq * pad_seq
+        self.seq_max_len = max(lens[0], lens[1]) + 1
+        self.src_len = lens[0] + 1
+        self.trg_len = lens[1] + 1
+
+
+class TransformerBatchSampler(BatchSampler):
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        pool_size=10000,
+        sort_type=SortType.NONE,
+        min_length=0,
+        max_length=100,
+        shuffle=False,
+        shuffle_batch=False,
+        use_token_batch=False,
+        clip_last_batch=False,
+        distribute_mode=True,
+        seed=0,
+        world_size=1,
+        rank=0,
+        pad_seq=1,
+        bsz_multi=8,
+    ):
+        for arg, value in locals().items():
+            if arg != "self":
+                setattr(self, "_" + arg, value)
+        self._random = np.random
+        self._random.seed(seed)
+        # for multi-devices
+        self._distribute_mode = distribute_mode
+        self._nranks = world_size
+        self._local_rank = rank
+        self._sample_infos = []
+        for i, data in enumerate(self._dataset):
+            lens = [len(data["source"]), len(data["target"])]
+            self._sample_infos.append(SampleInfo(i, lens, self._pad_seq))
+
+    def __iter__(self):
+        # global sort or global shuffle
+        if self._sort_type == SortType.GLOBAL:
+            infos = sorted(self._sample_infos, key=lambda x: x.trg_len)
+            infos = sorted(infos, key=lambda x: x.src_len)
+        else:
+            if self._shuffle:
+                infos = self._sample_infos
+                self._random.shuffle(infos)
+            else:
+                infos = self._sample_infos
+
+            if self._sort_type == SortType.POOL:
+                reverse = True
+                for i in range(0, len(infos), self._pool_size):
+                    # To avoid placing short next to long sentences
+                    reverse = not reverse
+                    infos[i : i + self._pool_size] = sorted(
+                        infos[i : i + self._pool_size], key=lambda x: x.seq_max_len, reverse=reverse
+                    )
+
+        batches = []
+        batch_creator = (
+            TokenBatchCreator(self._batch_size, self._bsz_multi)
+            if self._use_token_batch
+            else SentenceBatchCreator(self._batch_size * self._nranks)
+        )
+
+        for info in infos:
+            batch = batch_creator.append(info)
+            if batch is not None:
+                batches.append(batch)
+
+        if not self._clip_last_batch and len(batch_creator.batch) != 0:
+            batches.append(batch_creator.batch)
+
+        if self._shuffle_batch:
+            self._random.shuffle(batches)
+
+        if not self._use_token_batch:
+            # When producing batches according to sequence number, to confirm
+            # neighbor batches which would be feed and run parallel have similar
+            # length (thus similar computational cost) after shuffle, we as take
+            # them as a whole when shuffling and split here
+            batches = [
+                [batch[self._batch_size * i : self._batch_size * (i + 1)] for i in range(self._nranks)]
+                for batch in batches
+            ]
+            batches = list(itertools.chain.from_iterable(batches))
+        self.batch_number = (len(batches) + self._nranks - 1) // self._nranks
+
+        # for multi-device
+        for batch_id, batch in enumerate(batches):
+            if not self._distribute_mode or (batch_id % self._nranks == self._local_rank):
+                batch_indices = [info.i for info in batch]
+                yield batch_indices
+        if self._distribute_mode and len(batches) % self._nranks != 0:
+            if self._local_rank >= len(batches) % self._nranks:
+                # use previous data to pad
+                yield batch_indices
+
+    def __len__(self):
+        if hasattr(self, "batch_number"):  #
+            return self.batch_number
+        if not self._use_token_batch:
+            batch_number = (len(self._dataset) + self._batch_size * self._nranks - 1) // (
+                self._batch_size * self._nranks
+            )
+        else:
+            # For uncertain batch number, the actual value is self.batch_number
+            batch_number = sys.maxsize
+        return batch_number
diff --git a/examples/machine_translation/transformer/static/predict.py b/examples/machine_translation/transformer/static/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..41bada379268d109fdc2dc30d76e0258b721d8b3
--- /dev/null
+++ b/examples/machine_translation/transformer/static/predict.py
@@ -0,0 +1,231 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+from pprint import pprint
+
+import numpy as np
+import paddle
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.transformers import InferTransformerModel
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))
+import reader  # noqa: E402
+
+
+def cast_parameters_to_fp32(place, program, scope=None):
+    all_parameters = []
+    for block in program.blocks:
+        all_parameters.extend(block.all_parameters())
+
+    var_scope = scope if scope else paddle.static.global_scope()
+    for param in all_parameters:
+        tensor = var_scope.find_var(param.name).get_tensor()
+        if "fp16" in str(tensor._dtype()).lower() and "fp32" in str(param.dtype).lower():
+            data = np.array(tensor)
+            tensor.set(np.float32(data), place)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--test_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for test. Can be set by using --test_file source_language_file. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    paddle.enable_static()
+    if args.device == "gpu":
+        place = paddle.set_device("gpu")
+    else:
+        place = paddle.set_device("cpu")
+
+    # Define data loader
+    test_loader, to_tokens = reader.create_infer_loader(args)
+
+    test_program = paddle.static.Program()
+    startup_program = paddle.static.Program()
+    with paddle.static.program_guard(test_program, startup_program):
+        src_word = paddle.static.data(name="src_word", shape=[None, None], dtype=args.input_dtype)
+
+        # Define model
+        transformer = InferTransformerModel(
+            src_vocab_size=args.src_vocab_size,
+            trg_vocab_size=args.trg_vocab_size,
+            max_length=args.max_length + 1,
+            num_encoder_layers=args.n_layer,
+            num_decoder_layers=args.n_layer,
+            n_head=args.n_head,
+            d_model=args.d_model,
+            d_inner_hid=args.d_inner_hid,
+            dropout=args.dropout,
+            weight_sharing=args.weight_sharing,
+            bos_id=args.bos_idx,
+            eos_id=args.eos_idx,
+            pad_id=args.pad_idx,
+            beam_size=args.beam_size,
+            max_out_len=args.max_out_len,
+        )
+
+        finished_seq = transformer(src_word=src_word)
+
+    test_program = test_program.clone(for_test=True)
+
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+
+    assert args.init_from_params, "must set init_from_params to load parameters"
+    paddle.static.load(test_program, os.path.join(args.init_from_params, "transformer"), exe)
+    print("finish initing model from params from %s" % (args.init_from_params))
+
+    # cast weights from fp16 to fp32 after loading
+    if args.use_pure_fp16:
+        cast_parameters_to_fp32(place, test_program)
+
+    f = open(args.output_file, "w")
+    for data in test_loader:
+        (finished_sequence,) = exe.run(test_program, feed={"src_word": data[0]}, fetch_list=finished_seq.name)
+        finished_sequence = finished_sequence.transpose([0, 2, 1])
+        for ins in finished_sequence:
+            for beam_idx, beam in enumerate(ins):
+                if beam_idx >= args.n_best:
+                    break
+                id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                word_list = to_tokens(id_list)
+                sequence = " ".join(word_list) + "\n"
+                f.write(sequence)
+
+    paddle.disable_static()
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    args.data_dir = ARGS.data_dir
+    args.test_file = ARGS.test_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    do_predict(args)
diff --git a/examples/machine_translation/transformer/static/train.py b/examples/machine_translation/transformer/static/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..13ca011345aaba6f51c0130ad43d0b1d5c46a05e
--- /dev/null
+++ b/examples/machine_translation/transformer/static/train.py
@@ -0,0 +1,414 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.fleet as fleet
+import yaml
+from easydict import EasyDict as AttrDict
+
+from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="../configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--distributed", action="store_true", help="Whether to use fleet to launch. ")
+    parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--train_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--dev_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ")
+
+    # For benchmark.
+    parser.add_argument(
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+    args = parser.parse_args()
+    return args
+
+
+def do_train(args):
+    paddle.enable_static()
+    if args.is_distributed:
+        fleet.init(is_collective=True)
+        assert args.device != "xpu", "xpu doesn't support distributed training"
+        places = [paddle.set_device("gpu")] if args.device == "gpu" else paddle.static.cpu_places()
+        trainer_count = len(places)
+    else:
+        if args.device == "gpu":
+            places = paddle.static.cuda_places()
+        elif args.device == "xpu":
+            places = paddle.static.xpu_places()
+            paddle.set_device("xpu")
+        else:
+            places = paddle.static.cpu_places()
+            paddle.set_device("cpu")
+        trainer_count = len(places)
+
+    # Set seed for CE
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    # Define data loader
+    (train_loader), (eval_loader) = reader.create_data_loader(args, places=places)
+
+    train_program = paddle.static.Program()
+    startup_program = paddle.static.Program()
+    with paddle.static.program_guard(train_program, startup_program):
+        src_word = paddle.static.data(name="src_word", shape=[None, None], dtype=args.input_dtype)
+        trg_word = paddle.static.data(name="trg_word", shape=[None, None], dtype=args.input_dtype)
+        lbl_word = paddle.static.data(name="lbl_word", shape=[None, None, 1], dtype=args.input_dtype)
+
+        # Define model
+        transformer = TransformerModel(
+            src_vocab_size=args.src_vocab_size,
+            trg_vocab_size=args.trg_vocab_size,
+            max_length=args.max_length + 1,
+            num_encoder_layers=args.n_layer,
+            num_decoder_layers=args.n_layer,
+            n_head=args.n_head,
+            d_model=args.d_model,
+            d_inner_hid=args.d_inner_hid,
+            dropout=args.dropout,
+            weight_sharing=args.weight_sharing,
+            bos_id=args.bos_idx,
+            eos_id=args.eos_idx,
+            pad_id=args.pad_idx,
+        )
+        # Define loss
+        criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
+
+        logits = transformer(src_word=src_word, trg_word=trg_word)
+
+        sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+        scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
+
+        # Define optimizer
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=scheduler,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=float(args.eps),
+            parameters=transformer.parameters(),
+            weight_decay=args.weight_decay,
+        )
+
+        if args.is_distributed:
+            build_strategy = paddle.static.BuildStrategy()
+            exec_strategy = paddle.static.ExecutionStrategy()
+            dist_strategy = fleet.DistributedStrategy()
+            dist_strategy.build_strategy = build_strategy
+            dist_strategy.execution_strategy = exec_strategy
+            dist_strategy.fuse_grad_size_in_MB = 16
+
+            if args.use_amp:
+                dist_strategy.amp = True
+                dist_strategy.amp_configs = {
+                    "custom_white_list": ["softmax", "layer_norm"],
+                    "init_loss_scaling": args.scale_loss,
+                    "custom_black_list": ["lookup_table_v2"],
+                    "use_pure_fp16": args.use_pure_fp16,
+                }
+
+            optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+        else:
+            if args.use_amp:
+                amp_list = paddle.static.amp.AutoMixedPrecisionLists(
+                    custom_white_list=["softmax", "layer_norm"], custom_black_list=["lookup_table_v2"]
+                )
+                optimizer = paddle.static.amp.decorate(
+                    optimizer,
+                    amp_list,
+                    init_loss_scaling=args.scale_loss,
+                    use_dynamic_loss_scaling=True,
+                    use_pure_fp16=args.use_pure_fp16,
+                )
+        optimizer.minimize(avg_cost)
+
+    if args.is_distributed:
+        exe = paddle.static.Executor(places[0])
+    else:
+        exe = paddle.static.Executor()
+        build_strategy = paddle.static.BuildStrategy()
+        exec_strategy = paddle.static.ExecutionStrategy()
+
+        compiled_train_program = paddle.static.CompiledProgram(train_program, build_strategy=build_strategy)
+    exe.run(startup_program)
+
+    if args.use_amp:
+        optimizer.amp_init(places[0])
+
+    # the best cross-entropy value with label smoothing
+    loss_normalizer = -(
+        (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps))
+        + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)
+    )
+
+    step_idx = 0
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    for pass_id in range(args.epoch):
+        batch_id = 0
+        batch_start = time.time()
+        for data in train_loader:
+            # NOTE: used for benchmark and use None as default.
+            if args.max_iter and step_idx == args.max_iter:
+                break
+            if trainer_count == 1:
+                data = [data]
+            train_reader_cost = time.time() - batch_start
+
+            if args.is_distributed:
+                outs = exe.run(
+                    train_program,
+                    feed=[
+                        {
+                            "src_word": data[i][0],
+                            "trg_word": data[i][1],
+                            "lbl_word": data[i][2],
+                        }
+                        for i in range(trainer_count)
+                    ],
+                    fetch_list=[sum_cost.name, token_num.name],
+                )
+                train_batch_cost = time.time() - batch_start
+                batch_ips_avg.record(train_batch_cost, np.asarray(outs[1]).sum())
+            else:
+                outs = exe.run(
+                    compiled_train_program,
+                    feed=[
+                        {
+                            "src_word": data[i][0],
+                            "trg_word": data[i][1],
+                            "lbl_word": data[i][2],
+                        }
+                        for i in range(trainer_count)
+                    ],
+                    fetch_list=[sum_cost.name, token_num.name],
+                )
+                train_batch_cost = time.time() - batch_start
+                batch_ips_avg.record(train_batch_cost, np.asarray(outs[1]).sum() / trainer_count)
+            scheduler.step()
+
+            reader_cost_avg.record(train_reader_cost)
+            batch_cost_avg.record(train_batch_cost)
+
+            # Profile for model benchmark
+            if args.profiler_options is not None:
+                profiler.add_profiler_step(args.profiler_options)
+
+            if step_idx % args.print_step == 0 and (
+                args.benchmark or (args.is_distributed and dist.get_rank() == 0) or not args.is_distributed
+            ):
+                sum_cost_val, token_num_val = np.array(outs[0]), np.array(outs[1])
+                # Sum the cost from multi-devices
+                total_sum_cost = sum_cost_val.sum()
+                total_token_num = token_num_val.sum()
+                total_avg_cost = total_sum_cost / total_token_num
+
+                if step_idx == 0:
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                else:
+                    train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time()
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f, avg_speed: %.2f step/s, "
+                        "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
+                        "ips: %.5f words/sec"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            train_avg_batch_cost,
+                            batch_cost_avg.get_average(),
+                            reader_cost_avg.get_average(),
+                            batch_ips_avg.get_total_cnt(),
+                            batch_ips_avg.get_average_per_sec(),
+                        )
+                    )
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                if args.save_model and dist.get_rank() == 0:
+                    model_path = os.path.join(args.save_model, "step_" + str(step_idx), "transformer")
+                    paddle.static.save(train_program, model_path)
+
+            batch_id += 1
+            step_idx += 1
+            batch_start = time.time()
+
+        # NOTE: used for benchmark and use None as default.
+        if args.max_iter and step_idx == args.max_iter:
+            break
+
+    if args.save_model and dist.get_rank() == 0:
+        model_path = os.path.join(args.save_model, "step_final", "transformer")
+        paddle.static.save(train_program, model_path)
+
+    paddle.disable_static()
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    args.is_distributed = ARGS.distributed
+    if ARGS.max_iter:
+        args.max_iter = ARGS.max_iter
+    args.weight_decay = ARGS.weight_decay
+
+    args.data_dir = ARGS.data_dir
+    args.train_file = ARGS.train_file
+    args.dev_file = ARGS.dev_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+    args.profiler_options = ARGS.profiler_options
+
+    do_train(args)
diff --git a/examples/machine_translation/transformer/tls/distributed_utils.py b/examples/machine_translation/transformer/tls/distributed_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..26d6c0ca8d90683eeb918f9c1bdf9c2e855c6b1f
--- /dev/null
+++ b/examples/machine_translation/transformer/tls/distributed_utils.py
@@ -0,0 +1,19 @@
+import paddle
+import paddle.distributed as dist
+
+
+def all_gather_tokens(data):
+    """Gathers num of tokens from all nodes.
+    `data` should be a tensor of num of tokens.
+    """
+    if dist.get_world_size() < 2:
+        return data
+    if not hasattr(all_gather_tokens, "_in_buffer") or all_gather_tokens._in_buffer is None:
+        all_gather_tokens._in_buffer = data
+        all_gather_tokens._out_buffers = []
+    in_buffer = all_gather_tokens._in_buffer
+    out_buffers = all_gather_tokens._out_buffers
+
+    dist.all_gather(out_buffers, in_buffer)
+
+    return paddle.add_n(out_buffers)
diff --git a/examples/machine_translation/transformer/tls/record.py b/examples/machine_translation/transformer/tls/record.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1ddc738a5280978255d02eed4682adc59543794
--- /dev/null
+++ b/examples/machine_translation/transformer/tls/record.py
@@ -0,0 +1,29 @@
+class AverageStatistical(object):
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.total_cnt = 0
+        self.time = 0
+
+    def record(self, val, cnt=1):
+        self.time += val
+        self.total_cnt += cnt
+
+    def get_average(self):
+        if self.total_cnt == 0:
+            return 0
+
+        return self.time / self.total_cnt
+
+    def get_average_per_sec(self):
+        if self.time == 0.0:
+            return 0.0
+
+        return float(self.total_cnt) / self.time
+
+    def get_total_cnt(self):
+        return self.total_cnt
+
+    def get_total_time(self):
+        return self.time
diff --git a/examples/machine_translation/transformer/tls/to_static.py b/examples/machine_translation/transformer/tls/to_static.py
new file mode 100644
index 0000000000000000000000000000000000000000..96a41a6159aa53861d20a0d1dfa9ce36bf8aee63
--- /dev/null
+++ b/examples/machine_translation/transformer/tls/to_static.py
@@ -0,0 +1,31 @@
+# copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.jit import to_static
+
+
+def create_input_specs():
+    src_word = paddle.static.InputSpec(name="src_word", shape=[None, None], dtype="int64")
+    trg_word = paddle.static.InputSpec(name="trg_word", shape=[None, None], dtype="int64")
+    return [src_word, trg_word]
+
+
+def apply_to_static(config, model):
+    support_to_static = config.get("to_static", False)
+    if support_to_static:
+        specs = create_input_specs()
+        model = to_static(model, input_spec=specs)
+        print("Successfully to apply @to_static with specs: {}".format(specs))
+    return model
diff --git a/examples/machine_translation/transformer/train.py b/examples/machine_translation/transformer/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..67465e0b8bae42aec4b31f9579b7eebc40020309
--- /dev/null
+++ b/examples/machine_translation/transformer/train.py
@@ -0,0 +1,476 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import inspect
+import os
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import reader
+import yaml
+from easydict import EasyDict as AttrDict
+from tls.record import AverageStatistical
+from tls.to_static import apply_to_static
+
+from paddlenlp.transformers import CrossEntropyCriterion, TransformerModel
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--train_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--dev_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    parser.add_argument("--batch_size", default=None, type=int, help="The maximum tokens per batch. ")
+    parser.add_argument(
+        "--use_amp",
+        default=None,
+        type=str,
+        choices=["true", "false", "True", "False"],
+        help="Whether to use amp to train Transformer. ",
+    )
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu", "mlu"], help="Device selected for inference."
+    )
+    parser.add_argument(
+        "--amp_level",
+        default=None,
+        type=str,
+        choices=["O1", "O2"],
+        help="The amp level if --use_amp is on. Can be one of [O1, O2]. ",
+    )
+    parser.add_argument("--weight_decay", default=None, type=float, help="Weight Decay for optimizer. ")
+
+    # For benchmark.
+    parser.add_argument(
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+    parser.add_argument("--to_static", action="store_true", help="Whether use to_static to train Transformer. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_train(args):
+    if args.device == "gpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+    elif args.device == "npu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("npu")
+    elif args.device == "xpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("xpu")
+    elif args.device == "mlu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("mlu")
+    else:
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    # Define data loader
+    (train_loader), (eval_loader) = reader.create_data_loader(args)
+
+    # Define model
+    transformer = TransformerModel(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        pad_id=args.pad_idx,
+        normalize_before=args.get("normalize_before", True),
+    )
+
+    transformer = apply_to_static(args, transformer)
+
+    # Define loss
+    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx if args.pad_idx is None else args.pad_idx)
+
+    scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
+
+    # Define optimizer
+    if "use_multi_tensor" not in inspect.getfullargspec(paddle.optimizer.Adam.__init__).args:
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=scheduler,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=float(args.eps),
+            parameters=transformer.parameters(),
+            weight_decay=args.weight_decay,
+        )
+    else:
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=scheduler,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=float(args.eps),
+            parameters=transformer.parameters(),
+            use_multi_tensor=True,
+            weight_decay=args.weight_decay,
+        )
+
+    # Init from some checkpoint, to resume the previous training
+    if args.init_from_checkpoint:
+        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
+        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
+        transformer.set_state_dict(model_dict)
+        optimizer.set_state_dict(opt_dict)
+        print("loaded from checkpoint.")
+    # Init from some pretrain models, to better solve the current task
+    if args.init_from_pretrain_model:
+        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
+        transformer.set_state_dict(model_dict)
+        print("loaded from pre-trained model.")
+
+    # for amp training
+    if args.use_amp:
+        amp_level = "O2" if args.use_pure_fp16 else "O1"
+        scaler = paddle.amp.GradScaler(enable=True, init_loss_scaling=args.scale_loss)
+        transformer = paddle.amp.decorate(models=transformer, level=amp_level, save_dtype="float32")
+
+    # for distributed training
+    if trainer_count > 1:
+        transformer = paddle.DataParallel(transformer)
+
+    # The best cross-entropy value with label smoothing
+    loss_normalizer = -(
+        (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps))
+        + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)
+    )
+
+    step_idx = 0
+    tokens_sum = 0
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+
+        batch_id = 0
+        batch_start = time.time()
+        for input_data in train_loader:
+            train_reader_cost = time.time() - batch_start
+            (src_word, trg_word, lbl_word) = input_data
+
+            if args.use_amp:
+                with paddle.amp.auto_cast(
+                    custom_black_list={"scale", "reduce_sum", "elementwise_div"} if amp_level == "O2" else {},
+                    level=amp_level,
+                ):
+                    logits = transformer(src_word=src_word, trg_word=trg_word)
+                    sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                scaled = scaler.scale(avg_cost)  # scale the loss
+                scaled.backward()  # do backward
+
+                scaler.minimize(optimizer, scaled)  # update parameters
+                if "set_to_zero" in inspect.getfullargspec(optimizer.clear_grad).args:
+                    optimizer.clear_grad(set_to_zero=False)
+                else:
+                    optimizer.clear_grad()
+            else:
+                logits = transformer(src_word=src_word, trg_word=trg_word)
+                sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                avg_cost.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            train_batch_cost = time.time() - batch_start
+            reader_cost_avg.record(train_reader_cost)
+            batch_cost_avg.record(train_batch_cost)
+            batch_ips_avg.record(train_batch_cost, 0)
+
+            tokens_sum += token_num
+
+            # Profile for model benchmark
+            if args.profiler_options is not None:
+                profiler.add_profiler_step(args.profiler_options)
+
+            # NOTE: For benchmark, loss infomation on all cards will be printed.
+            if step_idx % args.print_step == 0 and (args.benchmark or rank == 0):
+                total_avg_cost = avg_cost.numpy()
+                tokens_sum_val = tokens_sum.numpy()
+                batch_ips_avg.record(0, tokens_sum_val)
+                tokens_sum = 0
+
+                if step_idx == 0:
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f "
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                else:
+                    train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time()
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
+                        "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
+                        "ips: %.5f words/sec"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            train_avg_batch_cost,
+                            batch_cost_avg.get_average(),
+                            reader_cost_avg.get_average(),
+                            batch_ips_avg.get_total_cnt(),
+                            batch_ips_avg.get_average_per_sec(),
+                        )
+                    )
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                # Validation
+                transformer.eval()
+                total_sum_cost = 0
+                total_token_num = 0
+                with paddle.no_grad():
+                    for input_data in eval_loader:
+                        (src_word, trg_word, lbl_word) = input_data
+                        if args.use_amp:
+                            with paddle.amp.auto_cast(
+                                custom_black_list={"scale", "reduce_sum", "elementwise_div"}
+                                if amp_level == "O2"
+                                else {},
+                                level=amp_level,
+                            ):
+                                logits = transformer(src_word=src_word, trg_word=trg_word)
+                                sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                        else:
+                            logits = transformer(src_word=src_word, trg_word=trg_word)
+                            sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                        total_sum_cost += sum_cost.numpy()
+                        total_token_num += token_num.numpy()
+                        total_avg_cost = total_sum_cost / total_token_num
+                    logger.info(
+                        "validation, step_idx: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f"
+                        % (
+                            step_idx,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                transformer.train()
+
+                if args.save_model and rank == 0:
+                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+            # NOTE: Used for benchmark and use None as default.
+            if args.max_iter and step_idx == args.max_iter:
+                break
+            batch_id += 1
+            step_idx += 1
+            scheduler.step()
+            batch_start = time.time()
+
+        # NOTE: Used for benchmark and use None as default.
+        if args.max_iter and step_idx == args.max_iter:
+            break
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+    if args.save_model and rank == 0:
+        model_dir = os.path.join(args.save_model, "step_final")
+        if not os.path.exists(model_dir):
+            os.makedirs(model_dir)
+        paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+        paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    if ARGS.max_iter:
+        args.max_iter = ARGS.max_iter
+    if ARGS.batch_size:
+        args.batch_size = ARGS.batch_size
+    if ARGS.use_amp:
+        ARGS.use_amp = ARGS.use_amp.lower()
+        if ARGS.use_amp == "true":
+            args.use_amp = True
+        else:
+            args.use_amp = False
+    if ARGS.amp_level:
+        args.use_pure_fp16 = ARGS.amp_level == "O2"
+    args.weight_decay = ARGS.weight_decay
+
+    args.data_dir = ARGS.data_dir
+    args.train_file = ARGS.train_file
+    args.dev_file = ARGS.dev_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    if ARGS.to_static:
+        args.to_static = ARGS.to_static
+    args.device = ARGS.device
+    pprint(args)
+
+    args.profiler_options = ARGS.profiler_options
+
+    do_train(args)
diff --git a/examples/model_compression/distill_lstm/README.md b/examples/model_compression/distill_lstm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..56e63c435339821e886d2457b65fd9cc06698efe
--- /dev/null
+++ b/examples/model_compression/distill_lstm/README.md
@@ -0,0 +1,194 @@
+# Distilling Knowledge From Fine-tuned BERT into Bi-LSTM
+
+以下是本例的简要目录结构及说明：
+```
+.
+├── small.py              # 小模型结构以及对小模型单独训练的脚本
+├── bert_distill.py       # 用教师模型BERT蒸馏学生模型的蒸馏脚本
+├── data.py               # 定义了dataloader等数据读取接口
+├── utils.py              # 定义了将样本转成id的转换接口
+├── args.py               # 参数配置脚本
+└── README.md             # 文档，本文件
+```
+
+## 简介
+本目录下的实验是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中，主要参考论文 [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)实现。
+
+在模型蒸馏中，较大的模型（在本例中是BERT）通常被称为教师模型，较小的模型（在本例中是Bi-LSTM）通常被称为学生模型。知识的蒸馏通常是通过模型学习蒸馏相关的损失函数实现，在本实验中，损失函数是均方误差损失函数，传入函数的两个参数分别是学生模型的输出和教师模型的输出。
+
+在[论文](https://arxiv.org/abs/1903.12136)的模型蒸馏阶段，作者为了能让教师模型表达出更多的知识供学生模型学习，对训练数据进行了数据增强。作者使用了三种数据增强方式，分别是：
+
+1. Masking，即以一定的概率将原数据中的word token替换成`[MASK]`；
+
+2. POS—guided word replacement，即以一定的概率将原数据中的词用与其有相同POS tag的词替换；
+
+3. n-gram sampling，即以一定的概率，从每条数据中采样n-gram，其中n的范围可通过人工设置。
+
+通过数据增强，可以产生更多无标签的训练数据，在训练过程中，学生模型可借助教师模型的“暗知识”，在更大的数据集上进行训练，产生更好的蒸馏效果。需要指出的是，实验只使用了第1和第3种数据增强方式。
+在英文数据集任务上，本文使用了Google News语料[预训练的Word Embedding](https://code.google.com/archive/p/word2vec/)初始化小模型的Embedding层。
+
+本实验分为三个训练过程：在特定任务上对BERT的fine-tuning、在特定任务上对基于Bi-LSTM的小模型的训练（用于评价蒸馏效果）、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。
+
+## 数据、预训练模型介绍及获取
+
+本实验使用GLUE中的SST-2、QQP以及中文情感分类数据集ChnSentiCorp中的训练集作为训练语料，用数据集中的验证集评估模型的效果。运行本目录下的实验，数据集会被自动下载到`paddlenlp.utils.env.DATA_HOME` 路径下，例如在linux系统下，例如对于GLUE中的QQP数据集，默认存储路径是`~/.paddlenlp/datasets/glue/QQP`，对于ChnSentiCorp数据集，则会下载到 `~/.paddlenlp/datasets/chnsenticorp`。
+
+对于BERT的fine-tuning任务，本实验中使用了预训练模型`bert-bas-uncased`、`bert-wwm-ext-chinese`、`bert-base-chinese`。同样，这几个模型在训练时会被自动下载到`paddlenlp.utils.env.MODEL_HOME`路径下。例如，对于`bert-base-uncased`模型，在linux系统下，会被下载到`~/.paddlenlp/models/bert-base-uncased`下。
+
+在中文数据集上的小模型训练的输入利用jieba分词，其中词表同本repo下[文本分类项目](../../text_classification/rnn)的词表，可通过运行以下命令进行下载：
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+```
+
+为了节省显存和运行时间，可以对ChnSentiCorp中未出现的词先进行过滤，并将最后的词表文件名和词表大小配置在下面的参数中。
+
+
+## 蒸馏实验过程
+### 训练BERT fine-tuning模型
+训练BERT的fine-tuning模型，可以去本repo下example中的[glue目录](../../benchmark/glue)下。关于glue的更多详细说明，可见glue目录下的README文档。
+
+以GLUE的SST-2任务为例，调用BERT fine-tune的训练脚本，配置如下的参数，训练SST-2任务：
+
+```shell
+cd ../../benchmark/glue
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+python -u ./run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 128   \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 10 \
+    --save_steps 10 \
+    --output_dir ../model_compression/distill_lstm/pretrained_models/$TASK_NAME/ \
+    --device gpu \
+
+```
+
+如果需要训练基于ChnSentiCorp数据集的BERT finetuning模型，可以进入[文本分类目录](../../text_classification/pretrained_models)下，将预训练模型改成BERT，并基于bert-base-chinese和bert-wwm-ext-chinese模型进行fine-tuning训练。
+
+训练完成之后，可将训练效果最好的模型保存在本项目下的`pretrained_models/$TASK_NAME/`下。模型目录下有`model_config.json`, `model_state.pdparams`, `tokenizer_config.json`及`vocab.txt`这几个文件。
+
+
+### 训练小模型
+
+尝试运行下面的脚本可以分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的小模型进行训练。
+
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python small.py \
+    --task_name chnsenticorp \
+    --max_epoch 20 \
+    --vocab_size 1256608 \
+    --batch_size 64 \
+    --model_name bert-wwm-ext-chinese \
+    --optimizer adam \
+    --lr 3e-4 \
+    --dropout_prob 0.2 \
+    --vocab_path senta_word_dict.txt \
+    --save_steps 10000 \
+    --output_dir small_models/chnsenticorp/
+
+```
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python small.py \
+    --task_name sst-2 \
+    --vocab_size 30522 \
+    --max_epoch 10 \
+    --batch_size 64 \
+    --lr 1.0 \
+    --dropout_prob 0.4 \
+    --output_dir small_models/SST-2 \
+    --save_steps 10000 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en
+
+```
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python small.py \
+    --task_name qqp \
+    --vocab_size 30522 \
+    --max_epoch 35 \
+    --batch_size 256 \
+    --lr 2.0 \
+    --dropout_prob 0.4 \
+    --output_dir small_models/QQP \
+    --save_steps 10000 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en
+
+```
+
+### 蒸馏模型
+这一步是将教师模型BERT的知识蒸馏到基于BiLSTM的学生模型中，可以运行下面的命令分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的学生模型进行蒸馏。
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python bert_distill.py \
+    --task_name chnsenticorp \
+    --vocab_size 1256608 \
+    --max_epoch 6 \
+    --lr 1.0 \
+    --dropout_prob 0.1 \
+    --batch_size 64 \
+    --model_name bert-wwm-ext-chinese \
+    --teacher_dir pretrained_models/chnsenticorp/best_bert_wwm_ext_model_880 \
+    --vocab_path senta_word_dict.txt \
+    --output_dir distilled_models/chnsenticorp \
+    --save_steps 10000 \
+
+```
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python bert_distill.py \
+    --task_name sst-2 \
+    --vocab_size 30522 \
+    --max_epoch 6 \
+    --lr 1.0 \
+    --task_name sst-2 \
+    --dropout_prob 0.2 \
+    --batch_size 128 \
+    --model_name bert-base-uncased \
+    --output_dir distilled_models/SST-2 \
+    --teacher_dir pretrained_models/SST-2/best_model_610 \
+    --save_steps 10000 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en \
+
+```
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python bert_distill.py \
+    --task_name qqp \
+    --vocab_size 30522 \
+    --max_epoch 6 \
+    --lr 1.0 \
+    --dropout_prob 0.2 \
+    --batch_size 256 \
+    --model_name bert-base-uncased \
+    --n_iter 10 \
+    --output_dir distilled_models/QQP \
+    --teacher_dir pretrained_models/QQP/best_model_17000 \
+    --save_steps 10000 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en \
+
+```
+
+各参数的具体说明请参阅 `args.py` ，注意在训练不同任务时，需要调整对应的超参数。
+
+
+## 蒸馏实验结果
+本蒸馏实验基于GLUE的SST-2、QQP、中文情感分类ChnSentiCorp数据集。实验效果均使用每个数据集的验证集（dev）进行评价，评价指标是准确率（acc），其中QQP中包含f1值。利用基于BERT的教师模型去蒸馏基于Bi-LSTM的学生模型，对比Bi-LSTM小模型单独训练，在SST-2、QQP、ChnSentiCorp(中文情感分类)任务上分别有3.3%、1.9%、1.4%的提升。
+
+| Model             | SST-2(dev acc)    | QQP(dev acc/f1)            | ChnSentiCorp(dev acc) | ChnSentiCorp(dev acc) |
+| ----------------- | ----------------- | -------------------------- | --------------------- | --------------------- |
+| Teacher  model    | bert-base-uncased | bert-base-uncased          | bert-base-chinese     | bert-wwm-ext-chinese  |
+| BERT-base         | 0.930046          | 0.905813(acc)/0.873472(f1) | 0.951667              | 0.955000              |
+| Bi-LSTM           | 0.854358          | 0.856616(acc)/0.799682(f1) | 0.920000              | 0.920000              |
+| Distilled Bi-LSTM | 0.887615          | 0.875216(acc)/0.831254(f1) | 0.932500              | 0.934167              |
+
+## 参考文献
+
+Tang R, Lu Y, Liu L, Mou L, Vechtomova O, Lin J. [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)[J]. arXiv preprint arXiv:1903.12136, 2019.
diff --git a/examples/model_compression/distill_lstm/args.py b/examples/model_compression/distill_lstm/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..07fd4b1bb1914d386f6e353774ff85ad7a939b3b
--- /dev/null
+++ b/examples/model_compression/distill_lstm/args.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+
+from paddlenlp.utils.env import MODEL_HOME
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument("--task_name", type=str, default="sst-2", help="Task name.")
+
+    parser.add_argument(
+        "--optimizer", type=str, default="adadelta", help="Optimizer to use, only support[adam|adadelta]."
+    )
+
+    parser.add_argument("--lr", type=float, default=1.0, help="Learning rate for optimizer.")
+
+    parser.add_argument("--num_layers", type=int, default=1, help="Layers number of LSTM.")
+
+    parser.add_argument("--emb_dim", type=int, default=300, help="Embedding dim.")
+
+    parser.add_argument("--output_dim", type=int, default=2, help="Number of classifications.")
+
+    parser.add_argument("--hidden_size", type=int, default=300, help="Hidden size of LSTM")
+
+    parser.add_argument("--batch_size", type=int, default=64, help="Batch size of training.")
+
+    parser.add_argument("--max_epoch", type=int, default=12, help="Max number of epochs for training.")
+
+    parser.add_argument("--max_seq_length", type=int, default=128, help="Max length for sentence.")
+
+    parser.add_argument(
+        "--n_iter", type=int, default=20, help="Number of iterations for one sample in data augmentation."
+    )
+
+    parser.add_argument("--dropout_prob", type=float, default=0.0, help="Drop probability.")
+
+    parser.add_argument("--init_scale", type=float, default=0.1, help="Init scale for parameter")
+
+    parser.add_argument("--log_freq", type=int, default=10, help="The frequency to print evaluation logs.")
+
+    parser.add_argument("--save_steps", type=int, default=100, help="The frequency to print evaluation logs.")
+
+    parser.add_argument("--padding_idx", type=int, default=0, help="The padding index of embedding.")
+
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="bert-base-uncased",
+        help="Teacher model's name. Maybe its tokenizer would be loaded and used by small model.",
+    )
+
+    parser.add_argument("--teacher_dir", type=str, help="Teacher model's directory.")
+
+    parser.add_argument(
+        "--vocab_path",
+        type=str,
+        default=os.path.join(MODEL_HOME, "bert-base-uncased", "bert-base-uncased-vocab.txt"),
+        help="Student model's vocab path.",
+    )
+
+    parser.add_argument("--output_dir", type=str, default="models", help="Directory to save models .")
+
+    parser.add_argument(
+        "--init_from_ckpt", type=str, default=None, help="The path of layer and optimizer to be loaded."
+    )
+
+    parser.add_argument(
+        "--whole_word_mask",
+        action="store_true",
+        help="If True, use whole word masking method in data augmentation in distilling.",
+    )
+
+    parser.add_argument("--embedding_name", type=str, default=None, help="The name of pretrained word embedding.")
+
+    parser.add_argument("--vocab_size", type=int, default=10000, help="Student model's vocab size.")
+
+    parser.add_argument(
+        "--alpha", type=float, default=0.0, help="Weight balance between cross entropy loss and mean square loss."
+    )
+
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=2021,
+        help="Random seed for model parameter initialization, data augmentation and so on.",
+    )
+
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference."
+    )
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/model_compression/distill_lstm/bert_distill.py b/examples/model_compression/distill_lstm/bert_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f253a31b8f568e884da5c8185eb9817ca052d9a
--- /dev/null
+++ b/examples/model_compression/distill_lstm/bert_distill.py
@@ -0,0 +1,172 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import paddle
+import paddle.nn as nn
+from args import parse_args
+from data import create_distill_loader
+from paddle.metric import Accuracy
+from small import BiLSTM
+
+from paddlenlp.metrics import AccuracyAndF1
+from paddlenlp.transformers import BertForSequenceClassification
+
+METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy}
+
+
+class TeacherModel(object):
+    def __init__(self, teacher_dir):
+        self.model = BertForSequenceClassification.from_pretrained(teacher_dir)
+        self.model.eval()
+
+
+def evaluate(task_name, model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for i, batch in enumerate(data_loader):
+        if task_name == "qqp":
+            _, _, student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2, labels = batch
+            logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2)
+        else:
+            _, _, student_input_ids, seq_len, labels = batch
+            logits = model(student_input_ids, seq_len)
+
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    else:
+        print("acc: %s, " % (res), end="")
+    model.train()
+
+
+def do_train(agrs):
+    paddle.set_device(args.device)
+    train_data_loader, dev_data_loader = create_distill_loader(
+        args.task_name,
+        model_name=args.model_name,
+        vocab_path=args.vocab_path,
+        batch_size=args.batch_size,
+        max_seq_length=args.max_seq_length,
+        n_iter=args.n_iter,
+        whole_word_mask=args.whole_word_mask,
+        seed=args.seed,
+    )
+
+    model = BiLSTM(
+        args.emb_dim,
+        args.hidden_size,
+        args.vocab_size,
+        args.output_dim,
+        args.vocab_path,
+        args.padding_idx,
+        args.num_layers,
+        args.dropout_prob,
+        args.init_scale,
+        args.embedding_name,
+    )
+
+    if args.optimizer == "adadelta":
+        optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters())
+    else:
+        optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters())
+
+    ce_loss = nn.CrossEntropyLoss()
+    mse_loss = nn.MSELoss()
+
+    metric_class = METRIC_CLASSES[args.task_name]
+    metric = metric_class()
+
+    teacher = TeacherModel(args.teacher_dir)
+
+    print("Start to distill student model.")
+
+    if args.init_from_ckpt:
+        model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams"))
+        optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt"))
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.max_epoch):
+        model.train()
+        for i, batch in enumerate(train_data_loader):
+            global_step += 1
+            if args.task_name == "qqp":
+                (
+                    bert_input_ids,
+                    bert_segment_ids,
+                    student_input_ids_1,
+                    seq_len_1,
+                    student_input_ids_2,
+                    seq_len_2,
+                    labels,
+                ) = batch
+            else:
+                bert_input_ids, bert_segment_ids, student_input_ids, seq_len, labels = batch
+
+            # Calculate teacher model's forward.
+            with paddle.no_grad():
+                teacher_logits = teacher.model(bert_input_ids, bert_segment_ids)
+
+            # Calculate student model's forward.
+            if args.task_name == "qqp":
+                logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2)
+            else:
+                logits = model(student_input_ids, seq_len)
+
+            loss = args.alpha * ce_loss(logits, labels) + (1 - args.alpha) * mse_loss(logits, teacher_logits)
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step % args.log_freq == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s"
+                    % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train))
+                )
+                tic_eval = time.time()
+                evaluate(args.task_name, model, metric, dev_data_loader)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                paddle.save(
+                    model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams")
+                )
+                paddle.save(
+                    optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt")
+                )
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print(args)
+    paddle.seed(args.seed)
+    do_train(args)
diff --git a/examples/model_compression/distill_lstm/data.py b/examples/model_compression/distill_lstm/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..dec2b358260bc02e9047cee95b294873f8237256
--- /dev/null
+++ b/examples/model_compression/distill_lstm/data.py
@@ -0,0 +1,322 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import jieba
+import numpy as np
+import paddle
+from utils import (
+    convert_example_for_distill,
+    convert_example_for_lstm,
+    convert_pair_example,
+)
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import BertTokenizer
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = {}
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n").split("\t")[0]
+        vocab[token] = index
+    return vocab
+
+
+def ngram_sampling(words, words_2=None, p_ng=0.25, ngram_range=(2, 6)):
+    if np.random.rand() < p_ng:
+        ngram_len = np.random.randint(ngram_range[0], ngram_range[1] + 1)
+        ngram_len = min(ngram_len, len(words))
+        start = np.random.randint(0, len(words) - ngram_len + 1)
+        words = words[start : start + ngram_len]
+        if words_2:
+            words_2 = words_2[start : start + ngram_len]
+    return words if not words_2 else (words, words_2)
+
+
+def flatten(list_of_list):
+    final_list = []
+    for each_list in list_of_list:
+        final_list += each_list
+    return final_list
+
+
+def apply_data_augmentation(
+    data, task_name, tokenizer, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 6), whole_word_mask=False, seed=0
+):
+    """
+    Data Augmentation contains Masking and n-gram sampling. Tokenization and
+    Masking are performed at the same time, so that the masked token can be
+    directly replaced by `mask_token`, after what sampling is performed.
+    """
+
+    def _data_augmentation(data, tokenized_list, whole_word_mask=whole_word_mask):
+        # 1. Masking
+        words = []
+        if not whole_word_mask:
+            words = [tokenizer.mask_token if np.random.rand() < p_mask else word for word in tokenized_list]
+        else:
+            for word in data.split():
+                words += [[tokenizer.mask_token]] if np.random.rand() < p_mask else [tokenizer.tokenize(word)]
+        # 2. N-gram sampling
+        words = ngram_sampling(words, p_ng=p_ng, ngram_range=ngram_range)
+        words = flatten(words) if isinstance(words[0], list) else words
+        return words
+
+    np.random.seed(seed)
+    new_data = []
+    for example in data:
+        if task_name == "qqp":
+            data_list = tokenizer.tokenize(example["sentence1"])
+            data_list_2 = tokenizer.tokenize(example["sentence2"])
+            new_data.append({"sentence1": data_list, "sentence2": data_list_2, "labels": example["labels"]})
+        else:
+            data_list = tokenizer.tokenize(example["sentence"])
+            new_data.append({"sentence": data_list, "labels": example["labels"]})
+
+    for example in data:
+        for _ in range(n_iter):
+            if task_name == "qqp":
+                words = _data_augmentation(example["sentence1"], data_list)
+                words_2 = _data_augmentation(example["sentence2"], data_list_2)
+                new_data.append({"sentence1": words, "sentence2": words_2, "labels": example["labels"]})
+            else:
+                words = _data_augmentation(example["sentence"], data_list)
+                new_data.append({"sentence": words, "labels": example["labels"]})
+    return new_data
+
+
+def apply_data_augmentation_for_cn(
+    data, tokenizer, vocab, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 10), seed=0
+):
+    """
+    Because BERT and jieba have different `tokenize` function, it returns
+    jieba_tokenizer(example['text'], bert_tokenizer(example['text']) and
+    example['label]) for each example in data.
+    jieba tokenization and Masking are performed at the same time, so that the
+    masked token can be directly replaced by `mask_token`, and other tokens
+    could be tokenized by BERT's tokenizer, from which tokenized example for
+    student model and teacher model would get at the same time.
+    """
+    np.random.seed(seed)
+    new_data = []
+
+    for example in data:
+        if not example["text"]:
+            continue
+        text_tokenized = list(jieba.cut(example["text"]))
+        lstm_tokens = text_tokenized
+        bert_tokens = tokenizer.tokenize(example["text"])
+        new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]})
+        for _ in range(n_iter):
+            # 1. Masking
+            lstm_tokens, bert_tokens = [], []
+            for word in text_tokenized:
+                if np.random.rand() < p_mask:
+                    lstm_tokens.append([vocab.unk_token])
+                    bert_tokens.append([tokenizer.unk_token])
+                else:
+                    lstm_tokens.append([word])
+                    bert_tokens.append(tokenizer.tokenize(word))
+            # 2. N-gram sampling
+            lstm_tokens, bert_tokens = ngram_sampling(lstm_tokens, bert_tokens, p_ng, ngram_range)
+            lstm_tokens, bert_tokens = flatten(lstm_tokens), flatten(bert_tokens)
+            if lstm_tokens and bert_tokens:
+                new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]})
+    return new_data
+
+
+def create_data_loader_for_small_model(
+    task_name, vocab_path, model_name=None, batch_size=64, max_seq_length=128, shuffle=True
+):
+    """Data loader for bi-lstm, not bert."""
+    if task_name == "chnsenticorp":
+        train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"])
+    else:
+        train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"])
+    if task_name == "chnsenticorp":
+        vocab = Vocab.load_vocabulary(
+            vocab_path,
+            unk_token="[UNK]",
+            pad_token="[PAD]",
+            bos_token=None,
+            eos_token=None,
+        )
+        pad_val = vocab["[PAD]"]
+
+    else:
+        vocab = BertTokenizer.from_pretrained(model_name)
+        pad_val = vocab.pad_token_id
+
+    trans_fn = partial(
+        convert_example_for_lstm, task_name=task_name, vocab=vocab, max_seq_length=max_seq_length, is_test=False
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_val), Stack(dtype="int64"), Stack(dtype="int64")  # input_ids  # seq len  # label
+    ): fn(samples)
+
+    train_ds = train_ds.map(trans_fn, lazy=True)
+    dev_ds = dev_ds.map(trans_fn, lazy=True)
+
+    train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle)
+
+    return train_data_loader, dev_data_loader
+
+
+def create_distill_loader(
+    task_name,
+    model_name,
+    vocab_path,
+    batch_size=64,
+    max_seq_length=128,
+    shuffle=True,
+    n_iter=20,
+    whole_word_mask=False,
+    seed=0,
+):
+    """
+    Returns batch data for bert and small model.
+    Bert and small model have different input representations.
+    """
+    tokenizer = BertTokenizer.from_pretrained(model_name)
+    if task_name == "chnsenticorp":
+        train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"])
+        vocab = Vocab.load_vocabulary(
+            vocab_path,
+            unk_token="[UNK]",
+            pad_token="[PAD]",
+            bos_token=None,
+            eos_token=None,
+        )
+        pad_val = vocab["[PAD]"]
+        data_aug_fn = partial(
+            apply_data_augmentation_for_cn, tokenizer=tokenizer, vocab=vocab, n_iter=n_iter, seed=seed
+        )
+    else:
+        train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"])
+        vocab = tokenizer
+        pad_val = tokenizer.pad_token_id
+        data_aug_fn = partial(
+            apply_data_augmentation,
+            task_name=task_name,
+            tokenizer=tokenizer,
+            n_iter=n_iter,
+            whole_word_mask=whole_word_mask,
+            seed=seed,
+        )
+    train_ds = train_ds.map(data_aug_fn, batched=True)
+    print("Data augmentation has been applied.")
+
+    trans_fn = partial(
+        convert_example_for_distill,
+        task_name=task_name,
+        tokenizer=tokenizer,
+        label_list=train_ds.label_list,
+        max_seq_length=max_seq_length,
+        vocab=vocab,
+    )
+
+    trans_fn_dev = partial(
+        convert_example_for_distill,
+        task_name=task_name,
+        tokenizer=tokenizer,
+        label_list=train_ds.label_list,
+        max_seq_length=max_seq_length,
+        vocab=vocab,
+        is_tokenized=False,
+    )
+
+    if task_name == "qqp":
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # bert input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # bert segment
+            Pad(axis=0, pad_val=pad_val),  # small input_ids
+            Stack(dtype="int64"),  # small seq len
+            Pad(axis=0, pad_val=pad_val),  # small input_ids
+            Stack(dtype="int64"),  # small seq len
+            Stack(dtype="int64"),  # small label
+        ): fn(samples)
+    else:
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # bert input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # bert segment
+            Pad(axis=0, pad_val=pad_val),  # small input_ids
+            Stack(dtype="int64"),  # small seq len
+            Stack(dtype="int64"),  # small label
+        ): fn(samples)
+
+    train_ds = train_ds.map(trans_fn, lazy=True)
+    dev_ds = dev_ds.map(trans_fn_dev, lazy=True)
+    train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle)
+    return train_data_loader, dev_data_loader
+
+
+def create_pair_loader_for_small_model(
+    task_name, model_name, vocab_path, batch_size=64, max_seq_length=128, shuffle=True, is_test=False
+):
+    """Only support QQP now."""
+    tokenizer = BertTokenizer.from_pretrained(model_name)
+    train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"])
+    vocab = Vocab.load_vocabulary(
+        vocab_path,
+        unk_token="[UNK]",
+        pad_token="[PAD]",
+        bos_token=None,
+        eos_token=None,
+    )
+
+    trans_func = partial(
+        convert_pair_example,
+        task_name=task_name,
+        vocab=tokenizer,
+        is_tokenized=False,
+        max_seq_length=max_seq_length,
+        is_test=is_test,
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab["[PAD]"]),  # input
+        Stack(),  # length
+        Pad(axis=0, pad_val=vocab["[PAD]"]),  # input
+        Stack(),  # length
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle)
+    return train_data_loader, dev_data_loader
+
+
+def create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle=True):
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=batch_size, shuffle=shuffle)
+
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False)
+
+    train_data_loader = paddle.io.DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    dev_data_loader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    return train_data_loader, dev_data_loader
diff --git a/examples/model_compression/distill_lstm/small.py b/examples/model_compression/distill_lstm/small.py
new file mode 100644
index 0000000000000000000000000000000000000000..92681bd039912f1b2daba0aff5ddea1103e381c7
--- /dev/null
+++ b/examples/model_compression/distill_lstm/small.py
@@ -0,0 +1,211 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.initializer as I
+from args import parse_args
+from data import create_data_loader_for_small_model, create_pair_loader_for_small_model
+from paddle.metric import Accuracy
+
+from paddlenlp.embeddings import TokenEmbedding
+from paddlenlp.metrics import AccuracyAndF1
+
+METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy}
+
+
+class BiLSTM(nn.Layer):
+    def __init__(
+        self,
+        embed_dim,
+        hidden_size,
+        vocab_size,
+        output_dim,
+        vocab_path,
+        padding_idx=0,
+        num_layers=1,
+        dropout_prob=0.0,
+        init_scale=0.1,
+        embedding_name=None,
+    ):
+        super(BiLSTM, self).__init__()
+        if embedding_name is not None:
+            self.embedder = TokenEmbedding(
+                embedding_name, extended_vocab_path=vocab_path, keep_extended_vocab_only=True
+            )
+            embed_dim = self.embedder.embedding_dim
+        else:
+            self.embedder = nn.Embedding(vocab_size, embed_dim, padding_idx)
+
+        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, "bidirectional", dropout=dropout_prob)
+
+        self.fc = nn.Linear(
+            hidden_size * 2,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.fc_1 = nn.Linear(
+            hidden_size * 8,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.output_layer = nn.Linear(
+            hidden_size,
+            output_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+    def forward(self, x_1, seq_len_1, x_2=None, seq_len_2=None):
+        x_embed_1 = self.embedder(x_1)
+        lstm_out_1, (hidden_1, _) = self.lstm(x_embed_1, sequence_length=seq_len_1)
+        out_1 = paddle.concat((hidden_1[-2, :, :], hidden_1[-1, :, :]), axis=1)
+        if x_2 is not None:
+            x_embed_2 = self.embedder(x_2)
+            lstm_out_2, (hidden_2, _) = self.lstm(x_embed_2, sequence_length=seq_len_2)
+            out_2 = paddle.concat((hidden_2[-2, :, :], hidden_2[-1, :, :]), axis=1)
+            out = paddle.concat(x=[out_1, out_2, out_1 + out_2, paddle.abs(out_1 - out_2)], axis=1)
+            out = paddle.tanh(self.fc_1(out))
+        else:
+            out = paddle.tanh(self.fc(out_1))
+        logits = self.output_layer(out)
+
+        return logits
+
+
+def evaluate(task_name, model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        if task_name == "qqp":
+            input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch
+            logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2)
+        else:
+            input_ids, seq_len, labels = batch
+            logits = model(input_ids, seq_len)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    else:
+        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+    return res[0] if isinstance(metric, AccuracyAndF1) else res
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    metric_class = METRIC_CLASSES[args.task_name]
+    metric = metric_class()
+    if args.task_name == "qqp":
+        train_data_loader, dev_data_loader = create_pair_loader_for_small_model(
+            task_name=args.task_name,
+            vocab_path=args.vocab_path,
+            model_name=args.model_name,
+            batch_size=args.batch_size,
+        )
+    else:
+        train_data_loader, dev_data_loader = create_data_loader_for_small_model(
+            task_name=args.task_name,
+            vocab_path=args.vocab_path,
+            model_name=args.model_name if args.task_name == "sst-2" else None,
+            batch_size=args.batch_size,
+        )
+
+    model = BiLSTM(
+        args.emb_dim,
+        args.hidden_size,
+        args.vocab_size,
+        args.output_dim,
+        args.vocab_path,
+        args.padding_idx,
+        args.num_layers,
+        args.dropout_prob,
+        args.init_scale,
+        args.embedding_name,
+    )
+
+    loss_fct = nn.CrossEntropyLoss()
+
+    if args.optimizer == "adadelta":
+        optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters())
+    else:
+        optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters())
+
+    if args.init_from_ckpt:
+        model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams"))
+        optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt"))
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.max_epoch):
+        for i, batch in enumerate(train_data_loader):
+            global_step += 1
+            if args.task_name == "qqp":
+                input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch
+                logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2)
+            else:
+                input_ids, seq_len, labels = batch
+                logits = model(input_ids, seq_len)
+
+            loss = loss_fct(logits, labels)
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step % args.log_freq == 0:
+                with paddle.no_grad():
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s"
+                        % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train))
+                    )
+                    tic_eval = time.time()
+
+                    evaluate(args.task_name, model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                paddle.save(
+                    model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams")
+                )
+                paddle.save(
+                    optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt")
+                )
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print(args)
+    paddle.seed(args.seed)
+    do_train(args)
diff --git a/examples/model_compression/distill_lstm/utils.py b/examples/model_compression/distill_lstm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0243d97a5a64d2dea77711a5702e62f930f4358c
--- /dev/null
+++ b/examples/model_compression/distill_lstm/utils.py
@@ -0,0 +1,117 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import jieba
+
+import numpy as np
+
+
+def convert_example_for_lstm(example, task_name, vocab, is_tokenized=False, max_seq_length=128, is_test=False):
+    """convert a example for lstm's input"""
+    input_ids = []
+    if task_name == "chnsenticorp":
+        if is_tokenized:
+            lstm_tokens = example["lstm_tokens"][:max_seq_length]
+            input_ids = [vocab[token] for token in lstm_tokens]
+        else:
+            tokenized_text = list(jieba.cut(example["text"]))[:max_seq_length]
+            input_ids = vocab[tokenized_text]
+    else:
+        if is_tokenized:
+            tokens = example["sentence"][:max_seq_length]
+        else:
+            tokens = vocab.tokenize(example["sentence"])[:max_seq_length]
+        input_ids = vocab.convert_tokens_to_ids(tokens)
+
+    valid_length = np.array(len(input_ids), dtype="int64")
+    if not is_test:
+        label = (
+            np.array(example["label"], dtype="int64")
+            if task_name == "chnsenticorp"
+            else np.array(example["labels"], dtype="int64")
+        )
+        return input_ids, valid_length, label
+    return input_ids, valid_length
+
+
+def convert_pair_example(example, task_name, vocab, is_tokenized=True, max_seq_length=128, is_test=False):
+    seq1 = convert_example_for_lstm(
+        {"sentence": example["sentence1"], "labels": example["labels"]},
+        task_name,
+        vocab,
+        is_tokenized,
+        max_seq_length,
+        is_test,
+    )[:2]
+
+    seq2 = convert_example_for_lstm(
+        {"sentence": example["sentence2"], "labels": example["labels"]},
+        task_name,
+        vocab,
+        is_tokenized,
+        max_seq_length,
+        is_test,
+    )
+    pair_features = seq1 + seq2
+
+    return pair_features
+
+
+def convert_example_for_distill(
+    example, task_name, tokenizer, label_list, max_seq_length, vocab, is_tokenized=True, is_test=False
+):
+    bert_features = convert_example_for_bert(
+        example,
+        tokenizer=tokenizer,
+        label_list=label_list,
+        is_tokenized=is_tokenized,
+        max_seq_length=max_seq_length,
+        is_test=is_test,
+    )
+    if task_name == "qqp":
+        small_features = convert_pair_example(example, task_name, vocab, is_tokenized, max_seq_length, is_test)
+    else:
+        small_features = convert_example_for_lstm(example, task_name, vocab, is_tokenized, max_seq_length, is_test)
+    return bert_features[:2] + small_features
+
+
+def convert_example_for_bert(example, tokenizer, label_list, is_tokenized=False, max_seq_length=512, is_test=False):
+    """convert a example for bert's input"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"] if "labels" in example else example["label"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if "sentence1" in example:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            max_seq_len=max_seq_length,
+            is_split_into_words=is_tokenized,
+        )
+    else:
+        if "sentence" in example:
+            text = example["sentence"]
+        elif "text" in example:
+            text = example["text"]
+        else:
+            text = example["bert_tokens"]
+        example = tokenizer(text, max_seq_len=max_seq_length, is_split_into_words=is_tokenized)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
diff --git a/examples/model_compression/minilmv2/README.md b/examples/model_compression/minilmv2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..62bfc39fa1d48b8fb862bb5fffc691a602744d6f
--- /dev/null
+++ b/examples/model_compression/minilmv2/README.md
@@ -0,0 +1,124 @@
+# MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
+
+以下是本例的简要目录结构及说明：
+```
+.
+├── general_distill.py       # 通用蒸馏脚本
+├── run_clue.py              # 在下游任务上的微调脚本
+└── README.md                # 文档，本文件
+```
+## 简介
+本目录下的实验主要参考论文[《MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers》](https://arxiv.org/abs/2012.15828)实现。
+
+MiniLMv2也是从层数深的Transformer类模型到层数较浅的Transformer类模型的蒸馏策略。它的优势是只需要取教师模型和学生模型中的各一层进行蒸馏训练，而不像其他方法需要蒸馏更多的层，避免面对更加复杂的layer mapping问题，并且效果优于TinyBert的蒸馏策略。
+
+MiniLMv2蒸馏的目标是教师模型某层的q与q, k与k, v与v的矩阵乘结果和学生模型最后一层的q与q, k与k, v与v的矩阵乘之间的kl散度loss。其中教师模型是large size时，选择实验并选取倒数某一层，当教师模型是base size时，选择最后一层进行蒸馏即可。
+
+为了防止教师模型是large size时，head size与学生模型不同，蒸馏目标的shape无法匹配，MiniLMv2还需要对head进行重组，先合并再按relation_head_num重新分割head_num和head_size。
+
+## 数据、预训练模型介绍及获取
+
+### 数据获取
+由于本实验是通用场景下的蒸馏，因此数据和预训练类似。可以参考[NLP Chinese Corpus](https://github.com/brightmart/nlp_chinese_corpus)中提供的数据。
+数据下载完成后，需要将所有数据集整理成每行一条文本数据，再将数据切分成多个小文件，并放在一个目录下，以便使用多卡并行训练。
+
+### 训练启动方式
+
+假设我们把切分好的预训练数据文件都放在`${dataset}`下，那么我们可以运行如下命令用单机八卡进行预训练蒸馏：
+```shell
+
+dataset=/PaddleNLP/dataset
+output_dir=./pretrain
+
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" general_distill.py \
+    --student_model_type tinybert \
+    --num_relation_heads 48 \
+    --student_model_name_or_path tinybert-6l-768d-zh \
+    --init_from_student False \
+    --teacher_model_type bert \
+    --teacher_model_name_or_path bert-base-chinese \
+    --max_seq_length 128 \
+    --batch_size 256 \
+    --learning_rate 6e-4 \
+    --logging_steps 20 \
+    --max_steps 100000 \
+    --warmup_steps 4000 \
+    --save_steps 5000 \
+    --teacher_layer_index 11 \
+    --student_layer_index 5 \
+    --weight_decay 1e-2 \
+    --output_dir ${output_dir} \
+    --device gpu \
+    --input_dir ${dataset} \
+
+```
+
+其中参数释义如下：
+
+- `student_model_type` 学生模型的类型
+- `num_relation_heads` head重新组合之后的head数
+- `student_model_name_or_path` 学生模型的名字（需要与学生模型类型对应），或者是学生模型的路径
+- `init_from_student` 本次蒸馏的学生模型是否用`student_model_name_or_path`中的参数进行初始化，是个bool类型的参数。默认是False
+- `teacher_model_type bert` 教师模型的类型
+- `teacher_model_name_or_path`  教师模型的名字
+- `max_seq_length 128` 表示最大句子长度，超过该长度将被截断。
+- `warmup_steps` 学习率warmup up的步数
+- `save_steps` 保存模型的频率
+- `teacher_layer_index`表示学生模型从教师模型学习的教师层
+- `student_layer_index` 表示学生模型从教师模型学习的学生层
+- `output_dir` 模型输出的目录
+- `device gpu` 表示运行该程序的设备，默认是gpu
+- `input_dir` 预训练数据的存放地址
+
+
+
+### 评价方法
+
+假设预训练完成后的模型存储在`${pretrained_models}`下，这里也提供了我们已经预训练完成的一版[模型](https://bj.bcebos.com/paddlenlp/models/general_distill/minilmv2_6l_768d_ch.tar.gz)可供参考，模型与`tinybert-6l-768d-zh`结构相同，因此可以使用`TinyBertForSequenceClassification.from_pretrained()`对模型直接进行加载。
+本示例训练出的通用模型需要在下游任务上Fine-tuning，利用下游任务上的指标进行评价。
+我们可以运行如下脚本在单卡上进行Fine-tuning：
+
+```shell
+
+export CUDA_VISIBLE_DEVICES="0"
+
+python -u ./run_clue.py \
+    --model_type tinybert  \
+    --model_name_or_path ${pretrained_models} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${max_seq_len} \
+    --batch_size 16   \
+    --learning_rate ${learning_rate} \
+    --num_train_epochs ${num_train_epochs} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --device gpu  \
+
+```
+
+
+其中不同的任务下，`${learning_rate}`、`${num_train_epochs}`、`${max_seq_len}`，我们推荐不同的Fine-tuning的超参数，可以参考以下配置：
+
+| TASK_NAME        | AFQMC | TNEWS | IFLYTEK | OCNLI | CMNLI | CLUEWSC2020 | CSL  |
+| ---------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ---- |
+| learning_rate    | 2e-5  | 2e-5  | 2e-5    | 3e-5  | 3e-5  | 1e-5        | 1e-5 |
+| num_train_epochs | 3     | 3     | 6       | 6     | 3     | 50          | 8    |
+| max_seq_len      | 128   | 128   | 128     | 128   | 128   | 128         | 256  |
+
+
+### 蒸馏实验结果
+
+本示例选择的是CLUE中的分类任务，以`bert-base-chinese`作教师模型，利用MiniLMv2策略对6层模型进行蒸馏，可以得到的通用模型在CLUE上的指标为：
+
+| CLUE    | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   |
+| ------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- |
+| Acc (%) | 71.38 | 56.46 | 58.87   | 79.01 | 73.02 | 68.42       | 77.73 |
+
+
+## 参考文献
+
+Wang W, Bao H, Huang S, Dong L, Wei F. [MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers](https://arxiv.org/abs/2012.15828)[J]. arXiv preprint arXiv:2012.15828v2, 2021.
diff --git a/examples/model_compression/minilmv2/general_distill.py b/examples/model_compression/minilmv2/general_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc5636ac926fecf09fb0159c5daa8bb1c8379671
--- /dev/null
+++ b/examples/model_compression/minilmv2/general_distill.py
@@ -0,0 +1,407 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+    TinyBertForPretraining,
+    TinyBertModel,
+    TinyBertTokenizer,
+)
+from paddlenlp.transformers.distill_utils import calc_minilm_loss, to_distill
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.tools import TimeCostAverage
+
+MODEL_CLASSES = {
+    "tinybert": (TinyBertForPretraining, TinyBertTokenizer),
+    "bert": (BertForSequenceClassification, BertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--student_model_type",
+        default="tinybert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--teacher_model_type",
+        default="bert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--student_model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--init_from_student",
+        type=strtobool,
+        default=False,
+        help="Whether to use the parameters of student model to initialize.",
+    )
+    parser.add_argument(
+        "--teacher_model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model."
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=6e-4, type=float, help="The initial learning rate for AdamW.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=512,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--num_relation_heads",
+        default=64,
+        type=int,
+        help="The number of relation heads is 48 and 64 for base and large-size teacher model.",
+    )
+    parser.add_argument(
+        "--teacher_layer_index",
+        default=11,
+        type=int,
+        help="The transformer layer index of teacher model to distill.",
+    )
+    parser.add_argument(
+        "--student_layer_index",
+        default=5,
+        type=int,
+        help="The transformer layer index of student model to distill.",
+    )
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=-1,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.01, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for AdamW optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=400000,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def create_pretraining_dataset(input_file, args, worker_init, tokenizer):
+    train_data = PretrainingDataset(input_file=input_file, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    # files have been sharded, no need to dispatch again
+    train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True)
+
+    # DataLoader cannot be pickled because of its place.
+    # If it can be pickled, use global function instead of lambda and use
+    # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_data,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        worker_init_fn=worker_init,
+        return_list=True,
+    )
+    return train_data_loader, input_file
+
+
+class PretrainingDataset(paddle.io.Dataset):
+    def __init__(self, input_file, tokenizer, max_seq_length):
+        self.input_file = input_file
+        f = open(input_file, "r")
+        input_ids = []
+        for i, line in enumerate(f):
+            line = line[:max_seq_length]
+            tokenized_example = tokenizer(line, max_seq_len=max_seq_length)
+            input_ids.append(tokenized_example["input_ids"])
+        self.inputs = np.asarray(input_ids)
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs)
+
+    def __getitem__(self, index):
+        input_ids = [np.asarray(self.inputs[index])]
+        return input_ids
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+    args.student_model_type = args.student_model_type.lower()
+
+    # For student
+    model_class, tokenizer_class = MODEL_CLASSES[args.student_model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.student_model_name_or_path)
+    if args.init_from_student:
+        student = model_class.from_pretrained(args.student_model_name_or_path)
+    else:
+        tinybert = TinyBertModel(vocab_size=21128, num_hidden_layers=6)
+        student = model_class(tinybert)
+
+    # For teacher
+    teacher_model_class, _ = MODEL_CLASSES[args.teacher_model_type]
+    teacher = teacher_model_class.from_pretrained(args.teacher_model_name_or_path)
+    pad_token_id = student.pretrained_init_configuration[args.student_model_name_or_path]["pad_token_id"]
+    if paddle.distributed.get_world_size() > 1:
+        student = paddle.DataParallel(student, find_unused_parameters=True)
+        teacher = paddle.DataParallel(teacher, find_unused_parameters=True)
+
+    num_training_steps = args.max_steps
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.98,
+        epsilon=args.adam_epsilon,
+        parameters=student.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=clip,
+    )
+
+    pool = ThreadPoolExecutor(1)
+
+    teacher = to_distill(teacher, return_qkv=True, layer_index=args.teacher_layer_index)
+    student = to_distill(student, return_qkv=True, layer_index=args.student_layer_index)
+
+    global_step = 0
+    for epoch in range(args.num_train_epochs):
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f))
+        ]
+        files.sort()
+        num_files = len(files)
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        if paddle.distributed.get_world_size() > num_files:
+            remainder = paddle.distributed.get_world_size() % num_files
+
+            data_file = files[
+                (
+                    f_start_id * paddle.distributed.get_world_size()
+                    + paddle.distributed.get_rank()
+                    + remainder * f_start_id
+                )
+                % num_files
+            ]
+        else:
+            data_file = files[
+                (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+            ]
+
+        train_data_loader, _ = create_pretraining_dataset(data_file, args, worker_init, tokenizer)
+
+        # TODO(guosheng): better way to process single file
+        single_file = True if f_start_id + 1 == len(files) else False
+
+        for f_id in range(f_start_id, len(files)):
+            if not single_file and f_id == f_start_id:
+                continue
+            if paddle.distributed.get_world_size() > num_files:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id)
+                    % num_files
+                ]
+            else:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+                ]
+            dataset_future = pool.submit(create_pretraining_dataset, data_file, args, worker_init, tokenizer)
+
+            kl_loss_func = paddle.nn.KLDivLoss("sum")
+            train_cost_avg = TimeCostAverage()
+            total_samples = 0
+            batch_start = time.time()
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids = batch[0]
+                attention_mask = paddle.unsqueeze(
+                    (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e9, axis=[1, 2]
+                )
+                student(input_ids)
+                with paddle.no_grad():
+                    teacher(input_ids)
+                # Q-Q relation
+                q_t, q_s = teacher.outputs.q, student.outputs.q
+                batch_size = q_t.shape[0]
+                pad_seq_len = q_t.shape[2]
+                loss_qr = calc_minilm_loss(kl_loss_func, q_s, q_t, attention_mask, args.num_relation_heads)
+                del q_t, q_s
+                # K-K relation
+                k_t, k_s = teacher.outputs.k, student.outputs.k
+                loss_kr = calc_minilm_loss(kl_loss_func, k_s, k_t, attention_mask, args.num_relation_heads)
+                del k_t, k_s
+                # V-V relation
+                v_t, v_s = teacher.outputs.v, student.outputs.v
+                loss_vr = calc_minilm_loss(kl_loss_func, v_s, v_t, attention_mask, args.num_relation_heads)
+
+                del v_t, v_s
+
+                loss = loss_qr + loss_kr + loss_vr
+                loss /= args.num_relation_heads * pad_seq_len * batch_size
+                loss.backward()
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_samples += args.batch_size
+                train_run_cost = time.time() - batch_start
+                train_cost_avg.record(train_run_cost)
+                if global_step % args.logging_steps == 0:
+                    logger.info(
+                        "global step: %d, epoch: %d, batch: %d, loss: %f, "
+                        "lr: %f, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss,
+                            optimizer.get_lr(),
+                            train_cost_avg.get_average(),
+                            total_samples / args.logging_steps,
+                            total_samples / (args.logging_steps * train_cost_avg.get_average()),
+                        )
+                    )
+                    total_samples = 0
+                    train_cost_avg.reset()
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+                batch_start = time.time()
+
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/minilmv2/run_clue.py b/examples/model_compression/minilmv2/run_clue.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ab58327a6b818fc71007959e0562baa9ae5274d
--- /dev/null
+++ b/examples/model_compression/minilmv2/run_clue.py
@@ -0,0 +1,327 @@
+# Copyright (c) 2021s PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+    TinyBertForSequenceClassification,
+    TinyBertTokenizer,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "afqmc": Accuracy,
+    "tnews": Accuracy,
+    "iflytek": Accuracy,
+    "ocnli": Accuracy,
+    "cmnli": Accuracy,
+    "cluewsc2020": Accuracy,
+    "csl": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "tinybert": (TinyBertForSequenceClassification, TinyBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+    return res
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["label"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+    elif "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = tokenizer(sentence1, text_pair=example["abst"], max_seq_len=max_seq_length)
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        # print(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example = tokenizer(text, max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("clue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+    best_acc = 0.0
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                acc = evaluate(model, loss_fct, metric, dev_data_loader)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                if acc > best_acc:
+                    best_acc = acc
+            if global_step >= num_training_steps:
+                print("best_acc: ", best_acc)
+                return
+    print("best_acc: ", best_acc)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/ofa/README.md b/examples/model_compression/ofa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..79eccdf150869ab8956304e475586fbf87db6a2b
--- /dev/null
+++ b/examples/model_compression/ofa/README.md
@@ -0,0 +1,329 @@
+# BERT Compression Based on PaddleSlim
+
+BERT-base模型是一个迁移能力很强的通用语义表示模型，但是模型中也有一些参数冗余。本教程将介绍如何使用PaddleSlim对BERT-base模型进行压缩。
+
+## 压缩原理
+
+1. 对Fine-tuning得到模型通过计算参数及其梯度的乘积得到参数的重要性，把模型参数根据重要性进行重排序。
+2. 超网络中最大的子网络选择和Bert-base模型网络结构一致的网络结构，其他小的子网络是对最大网络的进行不同的宽度选择来得到的，宽度选择具体指的是网络中的参数进行裁剪，所有子网络在整个训练过程中都是参数共享的。
+2. 用重排序之后的模型参数作为超网络模型的初始化参数。
+3. Fine-tuning之后的模型作为教师网络，超网络作为学生网络，进行知识蒸馏。
+
+<p align="center">
+<img src="./imgs/ofa_bert.jpg" width="950"/><br />
+整体流程图
+</p>
+
+
+## 压缩结果
+
+利用`bert-base-uncased`模型首先在GLUE数据集上进行finetune，得到需要压缩的模型，之后基于此模型进行压缩。压缩后模型参数大小减小26%（从110M减少到81M），压缩后模型在GLUE dev数据集上的精度和压缩前模型在GLUE dev数据集上的精度对比如下表所示：
+
+| Task  | Metric                       | Result            | Result with PaddleSlim |
+|:-----:|:----------------------------:|:-----------------:|:----------------------:|
+| SST-2 | Accuracy                     |      0.93005      |       0.931193         |
+| QNLI  | Accuracy                     |      0.91781      |       0.920740         |
+| CoLA  | Mattehew's corr              |      0.59557      |       0.601244         |
+| MRPC  | F1/Accuracy                  |  0.91667/0.88235  |   0.91740/0.88480      |
+| STS-B | Person/Spearman corr         |  0.88847/0.88350  |   0.89271/0.88958      |
+| QQP   | Accuracy/F1                  |  0.90581/0.87347  |   0.90994/0.87947      |
+| MNLI  | Matched acc/MisMatched acc   |  0.84422/0.84825  |   0.84687/0.85242      |
+| RTE   | Accuracy                     |      0.711191     |       0.718412         |
+
+<p align="center">
+<strong>表1-1: GLUE数据集精度对比</strong>
+</p>
+
+压缩前后模型的耗时如下表所示：
+
+<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000">
+        <tbody>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">Device</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">Batch Size</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">Model</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">TRT(FP16)</span>
+                        </td>
+                        <td style="text-align:center;">
+                                <span style="font-size:18px;">Latency(ms)</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td rowspan=8 align=center> T4 </td>
+                        <td rowspan=4 align=center> 16 </td>
+                        <td rowspan=2 align=center> BERT </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">110.71</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">Y</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">22.0</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td rowspan=2 align=center>Compressed BERT </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">69.62</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">Y</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">14.93</span>
+                        </td>
+                </tr>
+        <tr>
+            <td rowspan=4 align=center> 40 </td>
+                        <td rowspan=2 align=center> BERT </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">252.78</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">Y</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">53.67</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td rowspan=2 align=center>Compressed BERT </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">168.71</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">Y</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">37.22</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td rowspan=2 align=center> V100 </td>
+                        <td rowspan=2 align=center> 16 </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;" align=center>BERT</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">33.28</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">Compressed BERT</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">21.83</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td rowspan=2 align=center> Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz </td>
+                        <td rowspan=2 align=center> 16 </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;" align=center>BERT</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">10831.73</span>
+                        </td>
+                </tr>
+                <tr>
+                        <td style="text-align:center">
+                                <span style="font-size:18px;">Compressed BERT</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">N</span>
+                        </td>
+                        <td style="text-align:center">
+                                <span style="font-size:18px">7682.93</span>
+                        </td>
+                </tr>
+        </tbody>
+</table>
+<br />
+<p align="center">
+<strong>表1-2: 模型速度对比</strong>
+</p>
+
+压缩后模型在T4机器上相比原始模型在FP32的情况下加速59%，在TensorRT FP16的情况下加速47.3%。
+压缩后模型在V100机器上相比原始模型在FP32的情况下加速52.5%。
+压缩后模型在Intel(R) Xeon(R) Gold 5117 CPU上相比原始模型在FP32的情况下加速41%。
+
+## 快速开始
+本教程示例以GLUE/SST-2 数据集为例。
+
+### 环境依赖
+
+模型压缩功能依赖最新版本的PaddleSlim
+```shell
+git clone https://github.com/PaddlePaddle/PaddleSlim
+python setup.py build && python setup.py install
+```
+
+### Fine-tuing
+首先需要对Pretrain-Model在实际的下游任务上进行Finetuning，得到需要压缩的模型。Fine-tuning流程参考[Fine-tuning教程](../../benchmark/glue)
+
+```shell
+cd ../../benchmark/glue/
+```
+
+```python
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu \
+```
+参数详细含义参考[README.md](../../benchmark/glue/README.md)
+Fine-tuning 在dev上的结果如压缩结果表1-1中Result那一列所示。
+
+
+### 压缩训练
+
+单卡训练
+```shell
+python -u ./run_glue_ofa.py --model_type bert \
+          --model_name_or_path ${task_pretrained_model_dir} \
+          --task_name $TASK_NAME --max_seq_length 128     \
+          --batch_size 32       \
+          --learning_rate 2e-5     \
+          --num_train_epochs 6     \
+          --logging_steps 10     \
+          --save_steps 100     \
+          --output_dir ./tmp/$TASK_NAME \
+          --device gpu  \
+          --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5
+```
+
+多卡训练
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1" run_glue_ofa.py  \
+          --model_type bert \
+          --model_name_or_path ${task_pretrained_model_dir} \
+          --task_name $TASK_NAME --max_seq_length 128     \
+          --batch_size 32       \
+          --learning_rate 2e-5     \
+          --num_train_epochs 6     \
+          --logging_steps 10     \
+          --save_steps 100     \
+          --output_dir ./tmp/$TASK_NAME \
+          --device gpu  \
+          --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5
+```
+
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前仅支持BERT模型。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `task_name` 表示 Fine-tuning 的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `width_mult_list` 表示压缩训练过程中，对每层Transformer Block的宽度选择的范围。
+
+压缩训练之后在dev上的结果如压缩结果表格中Result with PaddleSlim那一列所示，延时情况如表1-2所示。
+
+
+### 导出子模型
+根据传入的config导出相应的子模型并转为静态图模型。
+
+启动命令：
+
+```shell
+python -u ./export_model.py --model_type bert \
+                             --model_name_or_path ${PATH_OF_MODEL_AFTER_OFA} \
+                             --max_seq_length 128     \
+                 --sub_model_output_dir ./tmp/$TASK_NAME/dynamic_model \
+                             --static_sub_model ./tmp/$TASK_NAME/static_model \
+                 --device gpu \
+                 --width_mult  0.6666666666666666
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前仅支持BERT模型。
+- `model_name_or_path` 指示了某种特定配置的经过OFA训练后保存的模型，对应有其预训练模型和预训练时使用的tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。默认：128.
+- `sub_model_output_dir` 指示了导出子模型动态图参数的目录。
+- `static_sub_model` 指示了导出子模型静态图模型及参数的目录，设置为None，则表示不导出静态图模型。默认：None。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `width_mult` 表示导出子模型的宽度。默认：1.0.
+
+
+### OFA接口介绍
+
+OFA API介绍参考[API](https://github.com/PaddlePaddle/PaddleSlim/blob/release/2.0.0/docs/zh_cn/api_cn/dygraph/ofa/ofa_api.rst)
+
+## 另附：基于本代码对TinyBERT(L=4, D=312)进行压缩
+下游任务模型是从TinyBERT官方repo转换得到。
+
+### 压缩结果
+
+| Task  | Metric                       | TinyBERT(L=4, D=312) |     Result with OFA    |
+|:-----:|:----------------------------:|:--------------------:|:----------------------:|
+| SST-2 | Accuracy                     |     [0.9234]()       |      [0.9220]()        |
+| QNLI  | Accuracy                     |     [0.8746]()       |      [0.8720]()        |
+| CoLA  | Mattehew's corr              |     [0.4961]()       |      [0.5048]()        |
+| MRPC  | F1/Accuracy                  |  [0.8998/0.8554]()   |   [0.9003/0.8578]()    |
+| STS-B | Person/Spearman corr         |  [0.8635/0.8631]()   |   [0.8717/0.8706]()    |
+| QQP   | Accuracy/F1                  |  [0.9047/0.8751]()   |   [0.9034/0.8733]()    |
+| MNLI  | Matched acc/MisMatched acc   |  [0.8256/0.8294]()   |   [0.8211/0.8261]()    |
+| RTE   | Accuracy                     |     [0.6534]()       |      [0.6787]()        |
+
+
+## 参考论文
+
+1. Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu. DynaBERT: Dynamic BERT with Adaptive Width and Depth.
+2. H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. Once for all: Train one network and specialize it for efficient deployment.
diff --git a/examples/model_compression/ofa/export_model.py b/examples/model_compression/ofa/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f0254faea089dd4a45abafa5e7a5fd55d040b46
--- /dev/null
+++ b/examples/model_compression/ofa/export_model.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+
+import paddle
+from paddleslim.nas.ofa import OFA, utils
+from paddleslim.nas.ofa.convert_super import Convert, supernet
+
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertModel,
+    BertTokenizer,
+)
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+}
+
+
+def bert_forward(
+    self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, output_hidden_states=False
+):
+    wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype
+    if attention_mask is None:
+        attention_mask = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+    else:
+        if attention_mask.ndim == 2:
+            # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+            attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+
+    embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+    if output_hidden_states:
+        output = embedding_output
+        encoder_outputs = []
+        for mod in self.encoder.layers:
+            output = mod(output, src_mask=attention_mask)
+            encoder_outputs.append(output)
+        if self.encoder.norm is not None:
+            encoder_outputs[-1] = self.encoder.norm(encoder_outputs[-1])
+        pooled_output = self.pooler(encoder_outputs[-1])
+    else:
+        sequence_output = self.encoder(embedding_output, attention_mask)
+        pooled_output = self.pooler(sequence_output)
+    if output_hidden_states:
+        return encoder_outputs, pooled_output
+    else:
+        return sequence_output, pooled_output
+
+
+BertModel.forward = bert_forward
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--sub_model_output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the sub model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--static_sub_model",
+        default=None,
+        type=str,
+        help="The output directory where the sub static model will be written. If set to None, not export static model",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--n_gpu", type=int, default=1, help="number of gpus to use, 0 for cpu.")
+    parser.add_argument("--width_mult", type=float, default=1.0, help="width mult you want to export")
+    parser.add_argument("--depth_mult", type=float, default=1.0, help="depth mult you want to export")
+    args = parser.parse_args()
+    return args
+
+
+def export_static_model(model, model_path, max_seq_length):
+    input_shape = [
+        paddle.static.InputSpec(shape=[None, max_seq_length], dtype="int64"),
+        paddle.static.InputSpec(shape=[None, max_seq_length], dtype="int64"),
+    ]
+    net = paddle.jit.to_static(model, input_spec=input_shape)
+    paddle.jit.save(net, model_path)
+
+
+def do_train(args):
+    paddle.set_device("gpu" if args.n_gpu else "cpu")
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config_path = os.path.join(args.model_name_or_path, "model_config.json")
+    cfg_dict = dict(json.loads(open(config_path).read()))
+
+    kept_layers_index = {}
+    if args.depth_mult < 1.0:
+        depth = round(cfg_dict["init_args"][0]["num_hidden_layers"] * args.depth_mult)
+        cfg_dict["init_args"][0]["num_hidden_layers"] = depth
+        for idx, i in enumerate(range(1, depth + 1)):
+            kept_layers_index[idx] = math.floor(i / args.depth_mult) - 1
+
+    os.rename(config_path, config_path + "_bak")
+    with open(config_path, "w", encoding="utf-8") as f:
+        f.write(json.dumps(cfg_dict, ensure_ascii=False))
+
+    num_labels = cfg_dict["num_classes"]
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    origin_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    os.rename(config_path + "_bak", config_path)
+
+    sp_config = supernet(expand_ratio=[1.0, args.width_mult])
+    model = Convert(sp_config).convert(model)
+
+    ofa_model = OFA(model)
+
+    sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams"))
+
+    if len(kept_layers_index) == 0:
+        ofa_model.model.set_state_dict(sd)
+    else:
+        for name, params in ofa_model.model.named_parameters():
+            if "encoder" not in name:
+                params.set_value(sd[name])
+            else:
+                idx = int(name.strip().split(".")[3])
+                mapping_name = name.replace("." + str(idx) + ".", "." + str(kept_layers_index[idx]) + ".")
+                params.set_value(sd[mapping_name])
+
+    best_config = utils.dynabert_config(ofa_model, args.width_mult)
+    for name, sublayer in ofa_model.model.named_sublayers():
+        if isinstance(sublayer, paddle.nn.MultiHeadAttention):
+            sublayer.num_heads = int(args.width_mult * sublayer.num_heads)
+
+    ofa_model.export(
+        best_config,
+        input_shapes=[[1, args.max_seq_length], [1, args.max_seq_length]],
+        input_dtypes=["int64", "int64"],
+        origin_model=origin_model,
+    )
+    for name, sublayer in origin_model.named_sublayers():
+        if isinstance(sublayer, paddle.nn.MultiHeadAttention):
+            sublayer.num_heads = int(args.width_mult * sublayer.num_heads)
+
+    output_dir = os.path.join(args.sub_model_output_dir, "model_width_%.5f" % args.width_mult)
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    model_to_save = origin_model
+    model_to_save.save_pretrained(output_dir)
+
+    if args.static_sub_model is not None:
+        export_static_model(origin_model, args.static_sub_model, args.max_seq_length)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/ofa/imgs/ofa_bert.jpg b/examples/model_compression/ofa/imgs/ofa_bert.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8683c7d8bc21803e13654231763a3680940d63cd
Binary files /dev/null and b/examples/model_compression/ofa/imgs/ofa_bert.jpg differ
diff --git a/examples/model_compression/ofa/run_glue_ofa.py b/examples/model_compression/ofa/run_glue_ofa.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed7725cddf611acb31ab3c9be4b1b63d44a8a87
--- /dev/null
+++ b/examples/model_compression/ofa/run_glue_ofa.py
@@ -0,0 +1,488 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertModel,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# paddleslim needs to be imported last for some overrides to kick in
+from paddleslim.nas.ofa import OFA, DistillConfig, utils  # isort: skip
+from paddleslim.nas.ofa.convert_super import Convert, supernet  # isort: skip
+from paddleslim.nas.ofa.utils import nlp_utils  # isort: skip
+
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["gpu", "cpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--width_mult_list", nargs="+", type=float, default=[1.0, 5 / 6, 2 / 3, 0.5], help="width mult in compress"
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader, width_mult=1.0):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids, attention_mask=[None, None])
+        if isinstance(logits, tuple):
+            logits = logits[0]
+        loss = criterion(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    # Teacher model's evaluation
+    if width_mult == 100:
+        if isinstance(metric, AccuracyAndF1):
+            print(
+                "teacher model, eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+                % (
+                    loss.numpy(),
+                    res[0],
+                    res[1],
+                    res[2],
+                    res[3],
+                    res[4],
+                ),
+                end="",
+            )
+        elif isinstance(metric, Mcc):
+            print("teacher model, eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
+        elif isinstance(metric, PearsonAndSpearman):
+            print(
+                "teacher model, eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+                % (loss.numpy(), res[0], res[1], res[2]),
+                end="",
+            )
+        else:
+            print("teacher model, eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    else:
+        if isinstance(metric, AccuracyAndF1):
+            print(
+                "width_mult: %s, eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+                % (
+                    width_mult,
+                    loss.numpy(),
+                    res[0],
+                    res[1],
+                    res[2],
+                    res[3],
+                    res[4],
+                ),
+                end="",
+            )
+        elif isinstance(metric, Mcc):
+            print("width_mult: %s, eval loss: %f, mcc: %s, " % (str(width_mult), loss.numpy(), res[0]), end="")
+        elif isinstance(metric, PearsonAndSpearman):
+            print(
+                "width_mult: %s, eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+                % (str(width_mult), loss.numpy(), res[0], res[1], res[2]),
+                end="",
+            )
+        else:
+            print("width_mult: %s, eval loss: %f, acc: %s, " % (str(width_mult), loss.numpy(), res), end="")
+    model.train()
+
+
+# monkey patch for bert forward to accept [attention_mask, head_mask] as  attention_mask
+def bert_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None]):
+    wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype
+    if attention_mask[0] is None:
+        attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+    embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+    encoder_outputs = self.encoder(embedding_output, attention_mask)
+    sequence_output = encoder_outputs
+    pooled_output = self.pooler(sequence_output)
+    return sequence_output, pooled_output
+
+
+BertModel.forward = bert_forward
+
+
+# reorder weights according head importance and neuron importance
+def reorder_neuron_head(model, head_importance, neuron_importance):
+    # reorder heads and ffn neurons
+    for layer, current_importance in enumerate(neuron_importance):
+        # reorder heads
+        idx = paddle.argsort(head_importance[layer], descending=True)
+        nlp_utils.reorder_head(model.bert.encoder.layers[layer].self_attn, idx)
+        # reorder neurons
+        idx = paddle.argsort(paddle.to_tensor(current_importance), descending=True)
+        nlp_utils.reorder_neuron(model.bert.encoder.layers[layer].linear1.fn, idx, dim=1)
+        nlp_utils.reorder_neuron(model.bert.encoder.layers[layer].linear2.fn, idx, dim=0)
+
+
+def soft_cross_entropy(inp, target):
+    inp_likelihood = F.log_softmax(inp, axis=-1)
+    target_prob = F.softmax(target, axis=-1)
+    return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1))
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list)
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    # Step1: Initialize a dictionary to save the weights from the origin BERT model.
+    origin_weights = model.state_dict()
+
+    # Step2: Convert origin model to supernet.
+    sp_config = supernet(expand_ratio=args.width_mult_list)
+    model = Convert(sp_config).convert(model)
+    # Use weights saved in the dictionary to initialize supernet.
+    utils.set_state_dict(model, origin_weights)
+    del origin_weights
+
+    # Step3: Define teacher model.
+    teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    # Step4: Config about distillation.
+    mapping_layers = ["bert.embeddings"]
+    for idx in range(model.bert.config["num_hidden_layers"]):
+        mapping_layers.append("bert.encoder.layers.{}".format(idx))
+
+    default_distill_config = {
+        "lambda_distill": 0.1,
+        "teacher_model": teacher_model,
+        "mapping_layers": mapping_layers,
+    }
+    distill_config = DistillConfig(**default_distill_config)
+
+    # Step5: Config in supernet training.
+    ofa_model = OFA(model, distill_config=distill_config, elastic_order=["width"])
+
+    criterion = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    if args.task_name == "mnli":
+        dev_data_loader = (dev_data_loader_matched, dev_data_loader_mismatched)
+
+    # Step6: Calculate the importance of neurons and head,
+    # and then reorder them according to the importance.
+    head_importance, neuron_importance = nlp_utils.compute_neuron_head_importance(
+        args.task_name,
+        ofa_model.model,
+        dev_data_loader,
+        loss_fct=criterion,
+        num_layers=model.bert.config["num_hidden_layers"],
+        num_heads=model.bert.config["num_attention_heads"],
+    )
+    reorder_neuron_head(ofa_model.model, head_importance, neuron_importance)
+
+    if paddle.distributed.get_world_size() > 1:
+        ofa_model.model = paddle.DataParallel(ofa_model.model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=ofa_model.model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        # Step7: Set current epoch and task.
+        ofa_model.set_epoch(epoch)
+        ofa_model.set_task("width")
+
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+
+            for width_mult in args.width_mult_list:
+                # Step8: Broadcast supernet config from width_mult,
+                # and use this config in supernet training.
+                net_config = utils.dynabert_config(ofa_model, width_mult)
+                ofa_model.set_net_config(net_config)
+                logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None])
+                rep_loss = ofa_model.calc_distill_loss()
+                if args.task_name == "sts-b":
+                    logit_loss = paddle.zeros(shape=[1], dtype="float32")
+                else:
+                    logit_loss = soft_cross_entropy(logits, teacher_logits.detach())
+                loss = rep_loss + args.lambda_logit * logit_loss
+                loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.logging_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(teacher_model, criterion, metric, dev_data_loader_matched, width_mult=100)
+                    evaluate(teacher_model, criterion, metric, dev_data_loader_mismatched, width_mult=100)
+                else:
+                    evaluate(teacher_model, criterion, metric, dev_data_loader, width_mult=100)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                for idx, width_mult in enumerate(args.width_mult_list):
+                    net_config = utils.dynabert_config(ofa_model, width_mult)
+                    ofa_model.set_net_config(net_config)
+                    tic_eval = time.time()
+                    if args.task_name == "mnli":
+                        evaluate(ofa_model, criterion, metric, dev_data_loader_matched, width_mult)
+                        evaluate(ofa_model, criterion, metric, dev_data_loader_mismatched, width_mult)
+                        print("eval done total : %s s" % (time.time() - tic_eval))
+                    else:
+                        evaluate(ofa_model, criterion, metric, dev_data_loader, width_mult)
+                        print("eval done total : %s s" % (time.time() - tic_eval))
+
+                    if paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/ofa/run_glue_ofa_depth.py b/examples/model_compression/ofa/run_glue_ofa_depth.py
new file mode 100644
index 0000000000000000000000000000000000000000..09f8e8d9d1c995368e4a2b0ea09bc36cb61a9a13
--- /dev/null
+++ b/examples/model_compression/ofa/run_glue_ofa_depth.py
@@ -0,0 +1,459 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+from paddleslim.nas.ofa import OFA, DistillConfig, RunConfig, utils
+from paddleslim.nas.ofa.convert_super import Convert, supernet
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertModel,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.")
+    parser.add_argument("--lambda_rep", default=0.1, type=float, help="lambda for hidden state distillation loss.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["gpu", "cpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--width_mult_list", nargs="+", type=float, default=[1.0, 5 / 6, 2 / 3, 0.5], help="width mult in compress"
+    )
+    parser.add_argument(
+        "--depth_mult_list", nargs="+", type=float, default=[1.0, 0.75, 0.5], help="width mult in compress"
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+def evaluate(model, criterion, metric, data_loader, width_mult=1.0, depth_mult=1.0):
+    with paddle.no_grad():
+        model.eval()
+        metric.reset()
+        for batch in data_loader:
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids, attention_mask=[None, None])
+            if isinstance(logits, tuple):
+                logits = logits[0]
+            loss = criterion(logits, labels)
+            correct = metric.compute(logits, labels)
+            metric.update(correct)
+        results = metric.accumulate()
+        # Teacher model's evaluation
+        if width_mult == 100:
+            print("teacher_model, eval loss: %f, %s: %s\n" % (loss.numpy(), metric.name(), results), end="")
+        else:
+            print(
+                "depth_mult: %f, width_mult: %f, eval loss: %f, %s: %s\n"
+                % (depth_mult, width_mult, loss.numpy(), metric.name(), results),
+                end="",
+            )
+        model.train()
+
+
+# monkey patch for bert forward to accept [attention_mask, head_mask] as  attention_mask
+def bert_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None], depth_mult=1.0):
+    wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype
+    if attention_mask[0] is None:
+        attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+    embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+    encoder_outputs = self.encoder(embedding_output, attention_mask, depth_mult=depth_mult)
+    sequence_output = encoder_outputs
+    pooled_output = self.pooler(sequence_output)
+    return sequence_output, pooled_output
+
+
+BertModel.forward = bert_forward
+
+
+def transformer_encoder_forward(self, src, src_mask=None, depth_mult=1.0):
+    output = src
+
+    depth = round(self.num_layers * depth_mult)
+    kept_layers_index = []
+    for i in range(1, depth + 1):
+        kept_layers_index.append(math.floor(i / depth_mult) - 1)
+
+    for i in kept_layers_index:
+        output = self.layers[i](output, src_mask=src_mask)
+
+    if self.norm is not None:
+        output = self.norm(output)
+
+    return output
+
+
+paddle.nn.TransformerEncoder.forward = transformer_encoder_forward
+
+
+def sequence_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None], depth=1.0):
+    _, pooled_output = self.bert(
+        input_ids,
+        token_type_ids=token_type_ids,
+        position_ids=position_ids,
+        attention_mask=attention_mask,
+        depth_mult=depth,
+    )
+
+    pooled_output = self.dropout(pooled_output)
+    logits = self.classifier(pooled_output)
+    return logits
+
+
+BertForSequenceClassification.forward = sequence_forward
+
+
+def soft_cross_entropy(inp, target):
+    inp_likelihood = F.log_softmax(inp, axis=-1)
+    target_prob = F.softmax(target, axis=-1)
+    return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1))
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list)
+
+    # Step1: Initialize the origin BERT model.
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+    origin_weights = model.state_dict()
+
+    # Step2: Convert origin model to supernet.
+    sp_config = supernet(expand_ratio=args.width_mult_list)
+    model = Convert(sp_config).convert(model)
+    # Use weights saved in the dictionary to initialize supernet.
+    utils.set_state_dict(model, origin_weights)
+
+    # Step3: Define teacher model.
+    teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+    new_dict = utils.utils.remove_model_fn(teacher_model, origin_weights)
+    teacher_model.set_state_dict(new_dict)
+    del origin_weights, new_dict
+
+    default_run_config = {"elastic_depth": args.depth_mult_list}
+    run_config = RunConfig(**default_run_config)
+
+    # Step4: Config about distillation.
+    mapping_layers = ["bert.embeddings"]
+    for idx in range(model.bert.config["num_hidden_layers"]):
+        mapping_layers.append("bert.encoder.layers.{}".format(idx))
+
+    default_distill_config = {
+        "lambda_distill": args.lambda_rep,
+        "teacher_model": teacher_model,
+        "mapping_layers": mapping_layers,
+    }
+    distill_config = DistillConfig(**default_distill_config)
+
+    # Step5: Config in supernet training.
+    ofa_model = OFA(model, run_config=run_config, distill_config=distill_config, elastic_order=["depth"])
+    # elastic_order=['width'])
+
+    criterion = paddle.nn.CrossEntropyLoss() if train_ds.label_list else paddle.nn.MSELoss()
+
+    metric = metric_class()
+
+    if args.task_name == "mnli":
+        dev_data_loader = (dev_data_loader_matched, dev_data_loader_mismatched)
+
+    if paddle.distributed.get_world_size() > 1:
+        ofa_model.model = paddle.DataParallel(ofa_model.model, find_unused_parameters=True)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=ofa_model.model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        # Step6: Set current epoch and task.
+        ofa_model.set_epoch(epoch)
+        ofa_model.set_task("depth")
+
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+
+            for depth_mult in args.depth_mult_list:
+                for width_mult in args.width_mult_list:
+                    # Step7: Broadcast supernet config from width_mult,
+                    # and use this config in supernet training.
+                    net_config = utils.dynabert_config(ofa_model, width_mult, depth_mult)
+                    ofa_model.set_net_config(net_config)
+                    logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None])
+                    rep_loss = ofa_model.calc_distill_loss()
+                    if args.task_name == "sts-b":
+                        logit_loss = 0.0
+                    else:
+                        logit_loss = soft_cross_entropy(logits, teacher_logits.detach())
+                    loss = rep_loss + args.lambda_logit * logit_loss
+                    loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            ofa_model.model.clear_gradients()
+
+            if global_step % args.logging_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                        % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train))
+                    )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                if args.task_name == "mnli":
+                    evaluate(teacher_model, criterion, metric, dev_data_loader_matched, width_mult=100)
+                    evaluate(teacher_model, criterion, metric, dev_data_loader_mismatched, width_mult=100)
+                else:
+                    evaluate(teacher_model, criterion, metric, dev_data_loader, width_mult=100)
+                for depth_mult in args.depth_mult_list:
+                    for width_mult in args.width_mult_list:
+                        net_config = utils.dynabert_config(ofa_model, width_mult, depth_mult)
+                        ofa_model.set_net_config(net_config)
+                        tic_eval = time.time()
+                        if args.task_name == "mnli":
+                            evaluate(ofa_model, criterion, metric, dev_data_loader_matched, width_mult, depth_mult)
+                            evaluate(ofa_model, criterion, metric, dev_data_loader_mismatched, width_mult, depth_mult)
+                            print("eval done total : %s s" % (time.time() - tic_eval))
+                        else:
+                            evaluate(ofa_model, criterion, metric, dev_data_loader, width_mult, depth_mult)
+                            print("eval done total : %s s" % (time.time() - tic_eval))
+
+                        if paddle.distributed.get_rank() == 0:
+                            output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                            if not os.path.exists(output_dir):
+                                os.makedirs(output_dir)
+                            # need better way to get inner model of DataParallel
+                            model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                            model_to_save.save_pretrained(output_dir)
+                            tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/pp-minilm/README.md b/examples/model_compression/pp-minilm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb3e9598ef025e1a33992362c13ce4cd86e1529b
--- /dev/null
+++ b/examples/model_compression/pp-minilm/README.md
@@ -0,0 +1,430 @@
+ **目录**
+
+* [PP-MiniLM 中文小模型](#PP-MiniLM中文小模型)
+    * [导入 PP-MiniLM](#导入PP-MiniLM)
+    * [在下游任务上使用 PP-MiniLM](#在下游任务上使用PP-MiniLM)
+        * [数据介绍](#数据介绍)
+        * [环境依赖](#环境依赖)
+        * [微调](#微调)
+            * [运行方式](#运行方式)
+            * [微调后模型精度](#微调后模型精度)
+            * [导出微调后模型](#导出微调后模型)
+        * [裁剪](#裁剪)
+            * [原理简介](#原理简介)
+            * [运行方式](#运行方式)
+            * [裁剪后模型精度](#裁剪后模型精度)
+            * [导出裁剪后的模型](#导出裁剪后的模型)
+        * [量化](#量化)
+            * [原理简介](#原理简介)
+            * [运行方式](#运行方式)
+            * [量化后模型精度](#量化后模型精度)
+        * [使用 Paddle Inference 进行推理部署](#使用PaddleInference推理部署)
+            * [环境要求](#环境要求)
+            * [运行方式](#运行方式)
+            * [性能测试](#性能测试)
+        * [使用 Paddle Serving 进行服务化部署](#使用PaddleServing服务化部署)
+    * [参考文献](#参考文献)
+
+<a name="PP-MiniLM中文小模型"></a>
+
+# PP-MiniLM 中文小模型
+[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) 联合 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 通过模型蒸馏、剪裁、量化等级联模型压缩技术发布中文特色小模型 PP-MiniLM(6L768H) 及压缩方案，保证模型精度的同时模型推理速度达 BERT(12L768H) 的 8.88 倍，参数量相比减少 52%，模型精度在中文语言理解评测基准 CLUE 高 0.62。
+
+PP-MiniLM 压缩方案以面向预训练模型的任务无关知识蒸馏(Task-agnostic Distillation)技术、裁剪(Pruning)技术、量化(Quantization)技术为核心，使得 PP-MiniLM **又快**、**又准**、**又小**。
+
+1. **推理速度快**: 依托 PaddleSlim 的裁剪、量化技术对 PP-MiniLM 小模型进行压缩、加速, 使得 PP-MiniLM 量化后模型 GPU 推理速度相比 BERT base 加速比高达 8.88；
+
+2. **精度高**: 我们以 [MiniLMv2](https://arxiv.org/abs/2012.15828) 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础，通过引入样本间关系知识蒸馏做了进一步算法优化，6 层 PP-MiniLM 模型在 CLUE 数据集上比 12 层 `bert-base-chinese` 高 0.62%，比同等规模的 TinyBERT<sub>6,</sub>、UER-py RoBERTa 分别高 2.57%、2.24%；
+
+3. **参数规模小**：依托 Task-agnostic Distillation 技术和 PaddleSlim 裁剪技术，模型参数量相比 BERT 减少 52%。
+
+**整体效果**
+
+
+| Model                         | #Params   | #FLOPs    | Speedup (w/o FasterTokenizer)   | AFQMC     | TNEWS     | IFLYTEK   | CMNLI     | OCNLI     | CLUEWSC2020 | CSL       | Avg       |
+| ----------------------------- | --------- | --------- | ---------------- | --------- | --------- | --------- | --------- | --------- | ----------- | --------- | --------- |
+| BERT-base, Chinese            | 102.3M    | 10.87B    | 1.00x            | 74.14     | 56.81     | 61.10     | 81.19     | 74.85     | 79.93       | 81.47     | 72.78     |
+| TinyBERT<sub>6,</sub> Chinese | 59.7M     | 5.44B     | 1.90x            | 72.59     | 55.70     | 57.64     | 79.57     | 73.97     | 76.32       | 80.00     | 70.83     |
+| UER-py RoBERTa L6-H768        | 59.7M     | 5.44B     | 1.90x            | 69.62     | **66.45** | 59.91     | 76.89     | 71.36     | 71.05       | **82.87** | 71.16     |
+| RBT6, Chinese                 | 59.7M     | 5.44B     | 1.90x            | 73.93     | 56.63     | 59.79     | 79.28     | 73.12     | 77.30       | 80.80     | 71.55     |
+| ERNIE-Tiny                    | 90.7M     | 4.83B     | 2.22x            | 71.55     | 58.34     | 61.41     | 76.81     | 71.46     | 72.04       | 79.13     | 70.11     |
+| PP-MiniLM                     | 59.7M     | 5.44B     | 2.15x (1.90x)     | 74.14     | 57.43     | **61.75** | 81.01     | **76.17** | 86.18       | 79.17     | **73.69** |
+| PP-MiniLM + 裁剪              | **49.1M** | **4.08B** | 2.74x (2.48x)     | 73.91     | 57.44     | 61.64     | 81.10     | 75.59     | **85.86**   | 78.53     | 73.44     |
+| PP-MiniLM + 量化              | 59.8M     | -         | 7.34x (4.63x)     | **74.19** | 57.13     | 61.10     | **81.20** | 76.10     | 85.20       | 78.03     | 73.28     |
+| PP-MiniLM + 裁剪 + 量化       | **49.2M** | -         | **8.88x** (5.36x) | 74.00     | 57.37     | 61.33     | 81.09     | 75.56     | 85.85       | 78.57     | 73.40     |
+
+
+**NOTE：**
+
+1.上表所有模型的精度测试均是基于下方超参数范围进行的 Grid Search 超参寻优。在每个配置下训练时，每隔 100 个 steps 在验证集上评估一次，取验证集上最佳准确率作为当前超参数配置下的准确率；
+- batch sizes: 16, 32, 64;
+- learning rates: 3e-5, 5e-5, 1e-4
+
+2.量化后比量化前模型参数量多了 0.1M 是因为保存了 scale 值；
+
+3.性能测试的环境：
+
+- 硬件：NVIDIA Tesla T4 单卡；
+- 软件：CUDA 11.1, cuDNN 8.1, TensorRT 7.2, PaddlePaddle 2.2.2；
+- 实验配置：batch_size: 32, max_seq_len: 128；
+
+其中，除上表最后两行的模型是对 INT8 模型进行预测，其余模型均基于 FP32 精度测试。
+
+4.PP-MiniLM 的加速比（见表中 Speedup 列）均测试了接入与未接入 FasterTokenizer 的数据，其中括号内的加速比代表未接入 FasterTokenizer 的加速比数据。接入 FasterTokenizer 对模型的精度没有影响，裁剪、量化后的模型相对 BERT-base 的加速比从 5.36 倍增加到 8.88 倍。
+
+**方案流程**
+
+<p align="center">
+<img src="./pp-minilm.png" width="950"/><br />
+方案流程图
+</p>
+
+如上流程图所示，完整的中文小模型方案分为：导入 PP-MiniLM 中文预训练小模型、下游任务微调、裁剪、离线量化、预测部署五大步。下面会对这里的每一个步骤进行介绍。除了下游任务微调步骤，其余步骤均可以省略，但我们建议保留下面的每一个步骤。
+
+以下是本范例模型的简要目录结构及说明：
+
+```shell
+.
+├── general_distill              # 任务无关知识蒸馏目录
+│ └── general_distill.py         # 任务无关知识蒸馏脚本
+│ └── run.sh                     # 任务无关知识蒸馏启动脚本
+│ └── README.md                  # 任务无关知识蒸馏文档
+├── finetuning                   # 下游任务训练目录
+│ └── run_clue.py                # CLUE 上的微调脚本
+│ └── run_clue.sh                # CLUE 上的微调启动脚本
+│ └── run_one_search.sh          # 单数据集下精调脚本
+│ └── run_all_search.sh          # CLUE数据集下精调脚本
+│ └── export_model.py            # 导出 fine-tuned 部署模型脚本
+├── pruning                      # 裁剪、蒸馏目录
+│ └── prune.py                   # 裁剪、蒸馏脚本
+│ └── prune.sh                   # 裁剪、蒸馏启动脚本
+│ └── export_model.py            # 导出裁剪训练得到的子模型（动、静态图模型）
+├── quantization                 # 离线量化目录
+│ └── quant_post.py              # 离线量化脚本
+│ └── quant.sh                   # 离线量化启动脚本
+├── deploy                       # 部署目录
+│ └── python                     # Paddle Inference 预测目录
+│   └── infer.py                 # Paddle Inference 预测脚本
+│   └── infer_all.sh             # 批量预测量化模型启动脚本
+│   └── infer_perf.sh            # 量化模型性能测试启动脚本
+│ └── serving                    # Paddle Serving 预测目录
+│   └── export_to_serving.py     # 导出 Paddle Serving 预测模型脚本
+│   └── web_service.py           # Paddle Serving 服务端启动脚本
+│   └── rpc_client.py            # Paddle Serving 客户端启动脚本
+│   └── config_nlp.yml           # Paddle Serving 预测配置文件
+│   └── README.md                # Paddle Serving 预测文档
+├── data.py                      # 数据处理脚本
+├── pp-minilm.png                # PP-MiniLM 方案流程图
+└── README.md                    # 文档，本文件
+
+```
+
+<a name="导入PP-MiniLM"></a>
+
+## 导入 PP-MiniLM
+
+PP-MiniLM 是使用任务无关蒸馏方法，以 `roberta-wwm-ext-large` 做教师模型蒸馏产出的含 6 层 Transformer Encoder Layer、Hidden Size 为 768 的预训练小模型，在 CLUE 上 7 个分类任务上的模型精度超过 BERT<sub>base</sub>、TinyBERT<sub>6</sub>、UER-py RoBERTa L6-H768、RBT6。
+
+可以这样导入 PP-MiniLM：
+
+```python
+
+from paddlenlp.transformers import PPMiniLMModel, PPMiniLMForSequenceClassification
+
+model = PPMiniLMModel.from_pretrained('ppminilm-6l-768h')
+model = PPMiniLMForSequenceClassification.from_pretrained('ppminilm-6l-768h') # 用于分类任务
+```
+
+PP-MiniLM 是一个 6 层的预训练模型，使用 `from_pretrained`导入 PP-MiniLM 之后，就可以在自己的数据集上进行 fine-tuning。接下来会介绍如何用下游任务数据在导入的 PP-MiniLM 上进行微调、进一步压缩及推理部署。
+
+**NOTE：** 如果对 PP-MiniLM 的训练过程感兴趣，可以查看[任务无关蒸馏文档](general_distill/README.md)了解相关细节。
+
+<a name="在下游任务上使用PP-MiniLM"></a>
+
+## 在下游任务上使用 PP-MiniLM
+
+PP-MiniLM 预训练小模型在 CLUE 中的 7 个分类数据集的平均精度上比 12 层 `bert-base-chinese` 高 0.62%，比同等规模的 TinyBERT、UER-py RoBERTa 分别高 2.57%、2.24%，因此我们推荐将 PP-MiniLM 运用在中文下游任务上。当然，如果想对已有模型进一步压缩，也可以参考这里的压缩方案，因为压缩方案是通用的。
+
+本案例中会以 CLUE 中 7 个分类数据集为例介绍如何在下游任务上使用 PP-MiniLM。首先用 CLUE 中的数据集对预训练小模型 PP-MiniLM 进行微调，然后提供了一套压缩方案，即借助 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 进行裁剪和量化，进一步对模型规模进行压缩，最终使用基于 TensorRT 的 [Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/inference_cn.html) 预测库对量化后的模型进行预测部署。裁剪、量化前，6 层 PP-MiniLM 的推理速度达 `bert-base-chinese` 的 2.15 倍，在下游任务上压缩完成后，模型推理速度高达`bert-base-chinese`的 8.88 倍。
+
+<a name="数据介绍"></a>
+
+### 数据介绍
+
+本案例中下游任务使用的数据是 CLUE 中 7 个分类数据集，包括 AFQMC、TNEWS、IFLYTEK、OCNLI、CMNLI、CSL、CLUEWSC2020。在 Linux 环境下，运行 `run_clue.py` 这个 fine-tuning 脚本会将该数据集自动下载到`~/.paddlenlp/datasets/Clue/`目录下。
+
+<a name="环境依赖"></a>
+
+### 环境依赖
+
+压缩方案依赖 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 提供的裁剪、量化功能，在本案例中需要安装 paddleslim 2.2.2 及之后的版本(使用命令 `pip install paddleslim>=2.2.2` 安装即可)。PaddleSlim 是个专注于深度学习模型压缩的工具库，提供剪裁、量化、蒸馏、和模型结构搜索等模型压缩策略，帮助用户快速实现模型的小型化。安装命令如下：
+
+```shell
+pip install paddleslim>=2.2.2
+```
+
+<a name="微调"></a>
+
+### 微调
+
+基于如下超参范围对 PP-MiniLM 在各个任务上进行 Grid Search 超参寻优：
+
+- batch sizes: 16, 32, 64
+- learning rates: 3e-5, 5e-5, 1e-4
+
+<a name="运行方式"></a>
+
+#### 运行方式
+
+```shell
+cd finetuning
+mkdir ppminilm-6l-768h
+sh run_all_search.sh ppminilm-6l-768h
+```
+
+如果只是在单个数据集上用特定 `batch_size`、`learning_rate` 微调，可以使用如下命令：
+
+```
+sh run_clue.sh CLUEWSC2020 1e-4 32 50 128 0 ppminilm-6l-768h
+```
+
+其中每个参数依次表示：CLUE 中的任务名称、学习率、batch size、epoch 数、最大序列长度、gpu id、模型名称（模型保存目录）。
+
+<a name="微调后模型精度"></a>
+
+#### 微调后模型精度
+
+经过超参寻优后，我们可以得到在 CLUE 每个任务上验证集上有最高准确率的模型，CLUE 上各个任务上的最高准确率如下表：
+
+| AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   | Avg   |
+| ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- |
+| 74.14 | 57.43 | 61.75   | 81.01 | 76.17 | 86.18       | 79.17 | 73.69 |
+
+
+
+超参寻优完成后，保存下每个数据集下有最高准确率的模型，以及其对应的超参数，因裁剪、量化等后续步骤需要用到最好的模型和超参数。
+
+<a name="导出微调后模型"></a>
+
+#### 导出微调后模型
+
+模型在训练完成之后，可以选择每个数据集下效果最好的模型进行导出：
+
+```shell
+export TASK_NAME=CLUEWSC2020
+export MODEL_PATH=ppminilm-6l-768h
+export LR=1e-4
+export BS=32
+
+python export_model.py --task_name ${TASK_NAME} --model_path ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/
+```
+
+静态图（部署）模型路径与动态图模型的路径相同，文件名为 `inference.pdmodel` , `inference.pdiparams` 和 `inference.pdiparams.info` 。
+
+<a name="裁剪"></a>
+
+### 裁剪
+
+这一步主要使用 PaddleSlim 对下游任务上的模型宽度进行裁剪，进一步压缩模型的大小。
+
+该过程会以上一步的模型（即 fine-tuning 后得到的最好模型）当作教师模型，蒸馏宽度为 3/4 的学生模型。经过我们的实验，在 6L768H 条件下，模型宽度压缩为原来的 3/4，平均精度下降 0.25。
+
+<a name="原理简介"></a>
+
+#### 原理简介
+
+本方案采取的裁剪方法参考了 [DynaBERT-Dynamic BERT with Adaptive Width and Depth](https://arxiv.org/pdf/2004.04037) 中的策略。首先对预训练模型和 Head 进行重要性排序，保证更重要的 Head 不容易被裁掉，然后用原模型作为蒸馏过程中的教师模型，宽度更小的（本案例是 3/4 宽度）模型作为学生模型，蒸馏得到的学生模型就是我们裁剪得到的模型。
+
+<a name="运行方式"></a>
+
+#### 运行方式
+
+假设需要对上一步 fine-tuned 模型 `../finetuning/ppminilm-6l-768h/models/CLUEWSC2020/1e-4_32` 进行裁剪，其中 `learning_rate`、`batch_size` 可以继续使用 fine-tuning 时的参数，这里执行的是宽度 `0.75` 的裁剪，可以使用如下命令运行：
+
+```shell
+cd pruning
+export FT_MODELS=../finetuning/ppminilm-6l-768h/models/CLUEWSC2020/1e-4_32
+
+sh prune.sh CLUEWSC2020 1e-4 32 50 128 0 ${FT_MODELS} 0.75
+```
+其中每个参数依次表示：CLUE 中的任务名称、学习率、batch size、epoch 数、最大序列长度、gpu id、学生模型的地址、裁剪后宽度比例列表。执行完成后，模型保存的路径位于 `pruned_models/CLUEWSC2020/0.75/best_model/`。
+
+<a name="裁剪后模型精度"></a>
+
+#### 裁剪后模型精度
+
+经过裁剪后，CLUE 上各个任务上的精度如下表所示。相比起裁剪前，CLUE 数据集上平均值下降 0.25。模型的参数量由 59.7M 下降到 49.1M。
+
+| Model            | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   | Avg   |
+| ---------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- |
+| PP-MiniLM 裁剪后 | 73.91 | 57.44 | 61.64   | 81.10 | 75.59 | 85.86       | 78.53 | 73.44 |
+
+
+<a name="导出裁剪后的模型"></a>
+
+#### 导出裁剪后的模型
+
+这一步可以同时导出经过裁剪后特定宽度下模型的动、静态图的模型和参数等文件。
+
+以 CLUEWSC2020 数据集为例，导出模型：
+
+```shell
+
+export MODEL_PATH=pruned_models
+export TASK_NAME=CLUEWSC2020
+sh export.sh ${MODEL_PATH} ${TASK_NAME}
+```
+
+或者可以批量导出 CLUE 各个任务上的模型：
+
+```shell
+
+sh export_all.sh
+cd ..
+```
+
+导出的静态图模型、参数等文件位于 `${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/` 目录下，有 `float.pdmodel`、`float.pdiparams`、`float.pdiparams.info` 三个文件。
+
+导出的动态图参数等文件位于 `${MODEL_PATH}/${TASK_NAME}/0.75/sub/model_width_0.75000/` 目录下，有 `model_state.pdparams` 和 `model_config.json` 两个文件。需要注意的是，此动态图模型不能通过原始 Transformer API 将参数正确载入，因为裁剪后模型不再符合 Transformer 的组网，例如 q、k、v 的 weight 的 shape 是 `[hidden_size, hidden_size * width_mul]` ，不再是 `[hidden_size, hidden_size]`。
+
+<a name="量化"></a>
+
+### 量化
+
+```shell
+cd quantization
+```
+
+<a name="原理简介"></a>
+
+#### 原理简介
+
+这里的量化采用的是静态离线量化方法，即不需要训练，只使用少量校准数据计算量化因子，就可快速得到量化模型。这一步需要有训练好的预测（静态图）模型。因此，需要对前序步骤产出的模型进行导出（参考上方导出模型的运行方式）。
+
+量化我们可以借助 PaddleSlim 提供的离线量化 API `paddleslim.quant.quant_post_static` 实现，我们这一步使用了 `mse`、`avg`、`abs_max`、`hist` 四种对于激活 Tensor 的量化方法，以及 `channel_wise_abs_max` 对权重 Tensor 的量化方法，并使用 4、8 两种校准集数量，对 `matmul`、`matmul_v2` 算子进行量化。
+
+<a name="运行方式"></a>
+
+#### 运行方式
+
+运行如下脚本可以对导出的静态图模型进行量化：
+
+```shell
+export MODEL_DIR=../pruning/pruned_models/
+python quant_post.py --task_name $TASK_NAME --input_dir ${MODEL_DIR}/${TASK_NAME}/0.75/sub_static
+```
+
+可以批量对所有数据集下的 FP32 模型进行量化：
+
+```shell
+sh quant_all.sh
+cd ..
+```
+
+<a name="量化后模型精度"></a>
+
+#### 量化后模型精度
+
+经过量化后，CLUE 上各个任务上的精度如下表，对 PP-MiniLM 进行量化后，精度比原 FP32 模型下降 0.19；对裁剪后的模型进行量化，精度几乎无损（-0.04）：
+
+| NO   | Model                   | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   | Avg   |
+| ---- | ----------------------- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- |
+| 1    | PP-MiniLM               | 74.24 | 57.21 | 61.1    | 81.16 | 76.17 | 85.53       | 78.90 | 73.47 |
+| 1    | PP-MiniLM + 量化        | 74.19 | 57.13 | 61.10   | 81.20 | 76.10 | 85.20       | 78.03 | 73.28 |
+| 2    | PP-MiniLM + 裁剪        | 73.91 | 57.44 | 61.64   | 81.10 | 75.59 | 85.86       | 78.53 | 73.44 |
+| 2    | PP-MiniLM + 裁剪 + 量化 | 74.00 | 57.37 | 61.33   | 81.09 | 75.56 | 85.85       | 78.57 | 73.40 |
+
+
+**NOTE：** 实验 1 是补充实验，PP-MiniLM 和 实验 2 中裁剪前的 PP-MiniLM 模型精度不同。
+
+最后，值得注意的是，PP-MiniLM 是基于 `roberta-wwm-ext-large` 做教师模型蒸馏得到的学生模型，如果你有更好的 24 层中文预训练模型，可以基于[任务无关蒸馏文档](general_distill/README.md)中介绍的蒸馏过程，训练出一个比 PP-MiniLM 精度更高，在下游任务上表现更好的 6 层小模型。
+
+<a name="使用PaddleInference进行推理部署"></a>
+
+### 使用 Paddle Inference 进行推理部署
+
+预测部署借助 PaddlePaddle 安装包中自带的 [Paddle Inference](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/inference_cn.html) 进行预测。
+
+<a name="环境要求"></a>
+
+#### 环境要求
+
+这一步依赖安装有预测库的 PaddlePaddle 2.2.2。可以在 [PaddlePaddle 官网](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#python) 根据机器环境选择合适的 Python 预测库进行安装。
+
+想要得到更明显的加速效果，推荐在 NVIDA Tensor Core GPU（如 T4、A10、A100)上进行测试，本案例基于 T4 测试。若在 V 系列 GPU 卡上测试，由于其不支持 Int8 Tensor Core，加速效果将达不到本文档表格中的效果。
+
+本案例是在 NVIDIA Tesla T4 单卡上，使用 CUDA 11.1、cuDNN 8.1、TensorRT 7.2 进行预测。
+
+<a name="运行方式"></a>
+
+#### 运行方式
+
+这里使用了动态 shape 功能，因此需要设置 TensorRT 子图输入shape 的范围。用户需要事先根据自己的模型结构和数据 shape 的范围，设置 TensorRT 子图输入的 shape 的最大、最小、以及最优的范围，其中最优范围可以按照数据分布选择最常见的来设置。动态 shape 的设置可以参考[官方文档](https://paddleinference.paddlepaddle.org.cn/optimize/paddle_trt.html#dynamic-shape)中的教程，以及本案例中 infer.py 脚本中的 160 行 - 206 行）。
+
+INT8 预测运行脚本：
+
+```shell
+
+cd deploy/python
+
+export task=tnews
+export algo=mse
+export bs=4
+python infer.py --task_name ${task}  --model_path  ../../quantization/${task}_quant_models/${algo}${bs}/int8  --int8 --use_trt
+```
+如果想要批量对量化模型进行预测并输出不同量化策略产出模型的精度，可以使用如下的脚本批量预测：
+
+```shell
+sh infer_all.sh
+```
+
+FP32 预测运行脚本：
+
+```shell
+python infer.py --task_name ${task}  --model_path  $MODEL_PATH --use_trt
+```
+
+<a name="性能测试"></a>
+
+#### 性能测试
+
+测试性能环境同上。本案例测试采用的是 CLUE TNEWS 数据集下量化方法为 `mse`、校准集数量为 4 得到的量化模型，在 TNEWS 的验证集上统计 5 次端到端预测的总耗时（前 20 个 steps 作为 warmup steps）并求平均。其中 `batch_size` 为 32，`max_seq_len` 为 128。
+
+启动性能测试需要对 `infer.py` 脚本传入参数 `--perf`，运行性能测试脚本可以得到 PP-MiniLM、PP-MiniLM 裁剪后、PP-MiniLM 量化后模型预测的耗时：
+
+```shell
+
+bash infer_perf.sh
+cd ../../
+```
+
+下表后三行分别是微调后的模型、裁剪后的模型、量化后模型的总耗时情况。
+
+取 5 个测试时长取平均，并计算出加速度比，可以发现裁剪、量化后的模型是原 BERT<sub>base</sub> 模型推理速度的 8.88 倍，其中只经过裁剪后的模型是 BERT<sub>base</sub> 推理速度的 2.74 倍，只经过量化后的模型是 BERT<sub>base</sub> 推理速度的 7.34 倍，接入 FasterTokenizer 前，裁剪、量化后的推理速度是原 BERT<sub>base</sub> 模型推理速度的 5.36 倍。
+
+|                         | 加速比    | 加速比（w/o FasterTokenizer） |
+| ----------------------- | --------- | ----------------------------- |
+| BERT<sub>base</sub>     | 1.00x     | 1.00x                         |
+| PP-MiniLM               | 2.15x     | 1.90x                         |
+| PP-MiniLM + 裁剪        | 2.74x     | 2.48x                         |
+| PP-MiniLM + 量化        | 7.34x     | 4.63x                         |
+| PP-MiniLM + 裁剪 + 量化 | **8.88x** | **5.36x**                     |
+
+
+<a name="使用PaddleServing进行服务化部署"></a>
+
+### 使用 Paddle Serving 进行服务化部署
+
+上面介绍的 Paddle Inference 为使用本地模型推理，Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过 RPC/HTTP 方式发送数据进行推理，实现模型推理的服务化。准备好静态图（推理模型）后，可参考 [Paddle Serving](deploy/serving/README.md) 部署步骤。
+
+<a name="参考文献"></a>
+
+## 参考文献
+
+1.Wang W, Bao H, Huang S, Dong L, Wei F. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers[J]. arXiv preprint arXiv:2012.15828v2, 2021.
+
+2.Hou L, Huang Z, Shang L, Jiang X, Chen X and Liu Q. DynaBERT: Dynamic BERT with Adaptive Width and Depth[J]. arXiv preprint arXiv:2004.04037, 2020.
+
+3.Cai H, Gan C, Wang T, Zhang Z, and Han S. Once for all: Train one network and specialize it for efficient deployment[J]. arXiv preprint arXiv:1908.09791, 2020.
+
+4.Wu H, Judd P, Zhang X, Isaev M and Micikevicius P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation[J]. arXiv preprint arXiv:2004.09602v1, 2020.
diff --git a/examples/model_compression/pp-minilm/data.py b/examples/model_compression/pp-minilm/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..819fc29a91a234e7e08812612b6d1158a2cf618c
--- /dev/null
+++ b/examples/model_compression/pp-minilm/data.py
@@ -0,0 +1,84 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from paddle.metric import Accuracy
+
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    PPMiniLMForSequenceClassification,
+    PPMiniLMTokenizer,
+)
+
+MODEL_CLASSES = {
+    "ppminilm": (PPMiniLMForSequenceClassification, PPMiniLMTokenizer),
+    "bert": (BertForSequenceClassification, BertTokenizer),
+}
+
+METRIC_CLASSES = {
+    "afqmc": Accuracy,
+    "tnews": Accuracy,
+    "iflytek": Accuracy,
+    "ocnli": Accuracy,
+    "cmnli": Accuracy,
+    "cluewsc2020": Accuracy,
+    "csl": Accuracy,
+}
+
+
+def convert_example(example, label_list, tokenizer=None, is_test=False, max_seq_length=512, **kwargs):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        # Get the label
+        example["label"] = np.array(example["label"], dtype="int64")
+        label = example["label"]
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+    if tokenizer is None:
+        return example
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
diff --git a/examples/model_compression/pp-minilm/deploy/python/infer.py b/examples/model_compression/pp-minilm/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..15b497df8b24e3814da0bf06aab43ff61b342467
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/python/infer.py
@@ -0,0 +1,344 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+import time
+from functools import partial
+
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+
+sys.path.append("../../")
+from data import METRIC_CLASSES, MODEL_CLASSES, convert_example  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default="afqmc",
+        type=str,
+        help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default="ppminilm",
+        type=str,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="ppminilm-6l-768h",
+        type=str,
+        help="The directory or name of model.",
+    )
+    parser.add_argument(
+        "--model_path",
+        default="./quant_models/model",
+        type=str,
+        required=True,
+        help="The path prefix of inference model to be used.",
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        choices=["gpu", "cpu", "xpu"],
+        help="Device selected for inference.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size for predict.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--perf_warmup_steps",
+        default=20,
+        type=int,
+        help="Warmup steps for performance test.",
+    )
+    parser.add_argument(
+        "--use_trt",
+        action="store_true",
+        help="Whether to use inference engin TensorRT.",
+    )
+    parser.add_argument(
+        "--perf",
+        action="store_true",
+        help="Whether to test performance.",
+    )
+    parser.add_argument(
+        "--collect_shape",
+        action="store_true",
+        help="Whether collect shape range info.",
+    )
+    parser.add_argument(
+        "--use_faster_tokenizer",
+        type=strtobool,
+        default=True,
+        help="Whether to use FasterTokenizer to accelerate training or further inference.",
+    )
+    parser.add_argument(
+        "--int8",
+        action="store_true",
+        help="Whether to use int8 inference.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+@paddle.no_grad()
+def evaluate(outputs, metric, data_loader):
+    metric.reset()
+    for i, batch in enumerate(data_loader):
+        input_ids, segment_ids, labels = batch
+        logits = paddle.to_tensor(outputs[i][0])
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("acc: %s, " % res, end="")
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+            cls.device = paddle.set_device("gpu")
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            cls.device = paddle.set_device("cpu")
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        if args.use_trt:
+            if args.int8:
+                config.enable_tensorrt_engine(
+                    workspace_size=1 << 30,
+                    precision_mode=inference.PrecisionType.Int8,
+                    max_batch_size=args.batch_size,
+                    min_subgraph_size=5,
+                    use_static=False,
+                    use_calib_mode=False,
+                )
+            else:
+                config.enable_tensorrt_engine(
+                    workspace_size=1 << 30,
+                    precision_mode=inference.PrecisionType.Float32,
+                    max_batch_size=args.batch_size,
+                    min_subgraph_size=5,
+                    use_static=False,
+                    use_calib_mode=False,
+                )
+            print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))
+            # Set min/max/opt tensor shape of each trt subgraph input according
+            # to dataset.
+            # For example, the config of TNEWS data should be 16, 32, 32, 31, 128, 32.
+            min_batch_size, max_batch_size, opt_batch_size = 1, 32, 32
+            min_seq_len, max_seq_len, opt_seq_len = 1, 128, 32
+            if args.use_faster_tokenizer:
+                min_input_shape = {
+                    "faster_tokenizer_1.tmp_0": [min_batch_size, min_seq_len],
+                    "faster_tokenizer_1.tmp_1": [min_batch_size, min_seq_len],
+                    "tmp_4": [min_batch_size, min_seq_len],
+                    "unsqueeze2_0.tmp_0": [min_batch_size, 1, 1, min_seq_len],
+                }
+                max_input_shape = {
+                    "faster_tokenizer_1.tmp_0": [max_batch_size, max_seq_len],
+                    "faster_tokenizer_1.tmp_1": [max_batch_size, max_seq_len],
+                    "tmp_4": [max_batch_size, max_seq_len],
+                    "unsqueeze2_0.tmp_0": [max_batch_size, 1, 1, max_seq_len],
+                }
+                opt_input_shape = {
+                    "faster_tokenizer_1.tmp_0": [opt_batch_size, opt_seq_len],
+                    "faster_tokenizer_1.tmp_1": [opt_batch_size, opt_seq_len],
+                    "tmp_4": [opt_batch_size, opt_seq_len],
+                    "unsqueeze2_0.tmp_0": [opt_batch_size, 1, 1, opt_seq_len],
+                }
+            else:
+                min_input_shape = {
+                    "input_ids": [min_batch_size, min_seq_len],
+                    "token_type_ids": [min_batch_size, min_seq_len],
+                    "tmp_4": [min_batch_size, min_seq_len],
+                    "unsqueeze2_0.tmp_0": [min_batch_size, 1, 1, min_seq_len],
+                }
+                max_input_shape = {
+                    "input_ids": [max_batch_size, max_seq_len],
+                    "token_type_ids": [max_batch_size, max_seq_len],
+                    "tmp_4": [max_batch_size, max_seq_len],
+                    "unsqueeze2_0.tmp_0": [max_batch_size, 1, 1, max_seq_len],
+                }
+                opt_input_shape = {
+                    "input_ids": [opt_batch_size, opt_seq_len],
+                    "token_type_ids": [opt_batch_size, opt_seq_len],
+                    "tmp_4": [opt_batch_size, opt_seq_len],
+                    "unsqueeze2_0.tmp_0": [opt_batch_size, 1, 1, opt_seq_len],
+                }
+            config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape, opt_input_shape)
+
+        predictor = paddle.inference.create_predictor(config)
+
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+        return cls(predictor, input_handles, output_handles)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def faster_predict(self, dataset, args):
+        batch_num = 0
+        if "sentence" in dataset[0]:
+            data = [example["sentence"] for example in dataset]
+            batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+            batch_num = len(batches)
+        else:
+            data1 = [example["sentence1"] for example in dataset]
+            data2 = [example["sentence2"] for example in dataset]
+            batches1 = [data1[idx : idx + args.batch_size] for idx in range(0, len(data1), args.batch_size)]
+            batches2 = [data2[idx : idx + args.batch_size] for idx in range(0, len(data1), args.batch_size)]
+            batch_num = len(batches1)
+        if args.perf:
+            for i in range(batch_num):
+                if "sentence" in dataset[0]:
+                    output = self.predict_batch([batches[i]])
+                else:
+                    output = self.predict_batch([batches1[i], batches2[i]])
+                if i > args.perf_warmup_steps:
+                    break
+            time1 = time.time()
+            if "sentence" in dataset[0]:
+                for i in range(batch_num):
+                    output = self.predict_batch([batches[i]])
+            else:
+                for i in range(batch_num):
+                    output = self.predict_batch([batches1[i], batches2[i]])
+            print("task name: %s, time: %s, " % (args.task_name, time.time() - time1))
+            return output
+
+        else:
+            labels = [example["label"] for example in dataset]
+
+            batched_labels = [labels[idx : idx + args.batch_size] for idx in range(0, len(labels), args.batch_size)]
+            metric = METRIC_CLASSES[args.task_name]()
+            metric.reset()
+
+            for i in range(batch_num):
+                if "sentence" in dataset[0]:
+                    logits = self.predict_batch([batches[i]])
+                else:
+                    logits = self.predict_batch([batches1[i], batches2[i]])
+                correct = metric.compute(paddle.to_tensor(logits), paddle.to_tensor(batched_labels[i]))
+                metric.update(correct)
+
+            res = metric.accumulate()
+            print("task name: %s, acc: %s, " % (args.task_name, res), end="")
+
+    def convert_predict_batch(self, args, data, tokenizer, batchify_fn, label_list):
+        examples = []
+        for example in data:
+            example = convert_example(example, label_list, tokenizer, max_seq_length=args.max_seq_length)
+            examples.append(example)
+
+        return examples
+
+    def predict(self, dataset, tokenizer, batchify_fn, args):
+        batches = [dataset[idx : idx + args.batch_size] for idx in range(0, len(dataset), args.batch_size)]
+        if args.perf:
+            for i, batch in enumerate(batches):
+                examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list)
+                input_ids, segment_ids, label = batchify_fn(examples)
+                output = self.predict_batch([input_ids, segment_ids])
+                if i > args.perf_warmup_steps:
+                    break
+            time1 = time.time()
+            for batch in batches:
+                examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list)
+                input_ids, segment_ids, _ = batchify_fn(examples)
+                output = self.predict_batch([input_ids, segment_ids])
+
+            print("task name: %s, time: %s, " % (args.task_name, time.time() - time1))
+
+        else:
+            metric = METRIC_CLASSES[args.task_name]()
+            metric.reset()
+            for i, batch in enumerate(batches):
+                examples = self.convert_predict_batch(args, batch, tokenizer, batchify_fn, dataset.label_list)
+                input_ids, segment_ids, label = batchify_fn(examples)
+                output = self.predict_batch([input_ids, segment_ids])
+                correct = metric.compute(paddle.to_tensor(output), paddle.to_tensor(label))
+                metric.update(correct)
+
+            res = metric.accumulate()
+            print("task name: %s, acc: %s, " % (args.task_name, res), end="")
+
+
+def main():
+    paddle.seed(42)
+    args = parse_args()
+
+    args.task_name = args.task_name.lower()
+    args.model_type = args.model_type.lower()
+
+    predictor = Predictor.create_predictor(args)
+
+    _, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+
+    if not args.use_faster_tokenizer:
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    else:
+        trans_func = partial(convert_example, label_list=dev_ds.label_list, is_test=False)
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+    if not args.use_faster_tokenizer:
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+            Stack(dtype="int64" if dev_ds.label_list else "float32"),  # label
+        ): fn(samples)
+        predictor.predict(dev_ds, tokenizer, batchify_fn, args)
+    else:
+        predictor.faster_predict(dev_ds, args=args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_all.sh b/examples/model_compression/pp-minilm/deploy/python/infer_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9a069793d8778457ce7a33d914769ddbd2b8a26e
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/python/infer_all.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+for task in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL
+do
+    for bs in 4 8
+    do
+        for algo in abs_max avg hist mse
+        do
+            python infer.py --task_name ${task}  --model_path  ../quantization/${task}_quant_models/${algo}${bs}/int8  --int8 --use_trt
+            echo this is ${task}, ${algo}, ${bs}
+        done
+   done
+done
diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh b/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c8469d9d9117359df4588d7d6e015a8c42f0904c
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export task=TNEWS
+
+echo Inference of orgin FP32 model
+for ((i=0;i<=4;i++));
+do
+    python infer.py  --task_name ${task} --model_path  ../finetuning/ppminilm-6l-768h/models/${task}/1e-4_64/inference  --use_trt --perf
+done
+
+echo After pruning
+for ((i=0;i<=4;i++));
+do
+    python infer.py --task_name ${task} --model_path ../pruning/pruned_models/${task}/0.75/sub_static/float  --use_trt --perf
+done
+
+echo After quantization
+for ((i=0;i<=4;i++));
+do
+    python  infer.py  --task_name tnews --model_path  ../quantization/${task}_quant_models/mse4/int8  --int8 --use_trt --perf
+done
+
+
diff --git a/examples/model_compression/pp-minilm/deploy/serving/README.md b/examples/model_compression/pp-minilm/deploy/serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3fed44ed9d92f1152630b954eb0a5a6f89ca426d
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/serving/README.md
@@ -0,0 +1,82 @@
+# PP-MiniLM 使用 Paddle Serving 进行服务化部署
+
+Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过 RPC/HTTP 方式发送数据进行推理，实现模型推理的服务化，下面以RPC方式为例进行说明。
+
+## 前提条件
+准备好 Inference 模型，需要2个文件：
+| 文件                          | 说明                                   |
+|-------------------------------|----------------------------------------|
+| ppminilm.pdiparams      | 模型权重文件，供推理时加载使用            |
+| ppminilm.pdmodel        | 模型结构文件，供推理时加载使用            |
+
+假设这 2 个文件已生成，并放在目录 `$MODEL_DIR` 下。
+
+## 环境要求
+
+使用 Paddle Serving 需要在服务器端安装相关模块，需要 v0.8.0 之后的版本：
+```shell
+pip install paddle-serving-app paddle-serving-client paddle-serving-server
+```
+
+如果服务器端可以使用GPU进行推理，则安装 server 的 gpu 版本，安装时要注意参考服务器当前 CUDA、TensorRT 的版本来安装对应的版本：[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.8.0)
+
+```shell
+pip install paddle-serving-app paddle-serving-client paddle-serving-server-gpu
+```
+
+还需要在客户端安装相关模块，也需要 v0.8.0 之后的版本：
+```shell
+pip install paddle-serving-app paddle-serving-client
+```
+
+## 从 Inference 模型生成 Serving 模型和配置
+
+以前提条件中准备好的 Inference 模型 `ppminilm.pdmodel`、`ppminilm.pdiparams` 为例：
+
+```shell
+python export_to_serving.py \
+    --dirname  ${MODEL_DIR} \
+    --model_filename ppminilm.pdmodel \
+    --params_filename ppminilm.pdiparams \
+    --server_path serving_server \
+    --client_path serving_client \
+    --fetch_alias_names logits \
+```
+
+其中参数释义如下：
+- `dirname` : 表示 Inference 推理模型所在目录，这里是位于 `${MODEL_DIR}`。
+- `model_filename` : 表示推理需要加载的模型结构文件。例如前提中得到的 `ppminilm.pdmodel`。如果设置为 `None` ，则使用 `__model__` 作为默认的文件名。
+- `params_filename` : 表示推理需要加载的模型权重文件。例如前提中得到的 `ppminilm.pdiparams`。
+- `server_path`: 转换后的模型文件和配置文件的存储路径。默认值为 serving_server。
+- `client_path`: 转换后的客户端配置文件存储路径。默认值为 serving_client。
+- `fetch_alias_names`: 模型输出的别名设置，比如输入的 input_ids 等，都可以指定成其他名字，默认不指定。
+- `feed_alias_names`: 模型输入的别名设置，比如输出 pooled_out 等，都可以重新指定成其他名字，默认不指定。
+
+执行命令后，会在当前目录下生成 2 个目录：serving_server 和 serving_client。serving_server 目录包含服务器端所需的模型和配置，需将其拷贝到服务器端；serving_client 目录包含客户端所需的配置，需将其拷贝到客户端。
+
+
+## 配置 config 文件
+
+在启动预测之前，需要按照自己的情况修改 config 文件中的配置，主要需要修改的配置释义如下：
+
+- `rpc_port` : rpc端口。
+- `device_type` : 0 代表 CPU, 1 代表 GPU, 2 代表 TensorRT, 3 代表 Arm CPU, 4 代表 Kunlun XPU。
+- `devices` : 计算硬件 ID，当 devices 为 "" 或不写时，为 CPU 预测；当 devices 为"0"、 "0,1,2" 时为 GPU 预测。
+- `fetch_list` : fetch 结果列表，以 client_config 中 fetch_var 的 alias_name 为准, 如果没有设置则全部返回。
+- `model_config` : 模型路径。
+
+## 启动 server
+
+在服务器端容器中，使用上一步得到的 serving_server 目录启动 server：
+
+```shell
+python web_service.py
+
+```
+
+## 启动 client 发起推理请求
+在客户端容器中，使用前面得到的 serving_client 目录启动 client 发起 RPC 推理请求。从命令行读取输入数据发起推理请求：
+
+```shell
+python rpc_client.py
+```
diff --git a/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml b/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6356840acb25554e6320f21523c9a9ba20b91850
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml
@@ -0,0 +1,34 @@
+# worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+# 当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 20
+# build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+  # op资源类型, True, 为线程模型；False，为进程模型
+  is_thread_op: False
+  # 使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+  tracer:
+    interval_s: 10
+# http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 18083
+# 18082
+# rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 8091
+op:
+  ppminilm:
+    # 并发数，is_thread_op=True时，为线程并发；否则为进程并发
+    concurrency: 1
+    # 当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+    local_service_conf:
+      # client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+      client_type: local_predictor
+      # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+      device_type: 1
+      # 计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+      devices: '0'
+      # Fetch结果列表，以client_config中fetch_var的alias_name为准, 如果没有设置则全部返回
+      fetch_list: ['logits']
+      # 模型路径
+      model_config: ./serving_server/
+
diff --git a/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py b/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..79786ee23942a9ffce5ce8eb01aa0a8282e72219
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+import paddle_serving_client.io as serving_io
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--dirname",
+    type=str,
+    required=True,
+    default="./output",
+    help="Path of saved model files. Program file and parameter files are saved in this directory.",
+)
+parser.add_argument(
+    "--model_filename",
+    type=str,
+    required=True,
+    default="inference.get_pooled_embedding.pdmodel",
+    help="The name of file to load the inference program. If it is None, the default filename __model__ will be used.",
+)
+parser.add_argument(
+    "--params_filename",
+    type=str,
+    required=True,
+    default="inference.get_pooled_embedding.pdiparams",
+    help="The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. Default: None.",
+)
+parser.add_argument(
+    "--server_path",
+    type=str,
+    default="./serving_server",
+    help="The path of server parameter in static graph to be saved.",
+)
+parser.add_argument(
+    "--client_path",
+    type=str,
+    default="./serving_client",
+    help="The path of client parameter in static graph to be saved.",
+)
+parser.add_argument(
+    "--feed_alias_names",
+    type=str,
+    default=None,
+    help="set alias names for feed vars, split by comma ',', you should run --show_proto to check the number of feed vars",
+)
+parser.add_argument(
+    "--fetch_alias_names",
+    type=str,
+    default=None,
+    help="set alias names for feed vars, split by comma ',', you should run --show_proto to check the number of fetch vars",
+)
+parser.add_argument(
+    "--show_proto",
+    type=bool,
+    default=False,
+    help="If yes, you can preview the proto and then determine your feed var alias name and fetch var alias name.",
+)
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    args = parser.parse_args()
+    feed_names, fetch_names = serving_io.inference_model_to_serving(
+        dirname=args.dirname,
+        serving_server=args.server_path,
+        serving_client=args.client_path,
+        model_filename=args.model_filename,
+        params_filename=args.params_filename,
+        show_proto=args.show_proto,
+        feed_alias_names=args.feed_alias_names,
+        fetch_alias_names=args.fetch_alias_names,
+    )
diff --git a/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py b/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..c975d13265e7f90d0f78af753a46be38930412f2
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle_serving_server.pipeline import PipelineClient
+import numpy as np
+
+client = PipelineClient()
+client.connect(["127.0.0.1:8091"])
+
+list_data = ["国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据", "试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比"]
+feed = {}
+for i, item in enumerate(list_data):
+    feed[str(i)] = item
+
+print(feed)
+ret = client.predict(feed_dict=feed)
+print(ret)
+result = np.array(eval(ret.value[0]))
+print(ret.key)
+print(result.shape)
+print(result)
diff --git a/examples/model_compression/pp-minilm/deploy/serving/web_service.py b/examples/model_compression/pp-minilm/deploy/serving/web_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..01da8bc3b0f3c9999442f0bdec17a6923e6eb8eb
--- /dev/null
+++ b/examples/model_compression/pp-minilm/deploy/serving/web_service.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from paddle_serving_server.web_service import Op, WebService
+
+_LOGGER = logging.getLogger()
+
+
+class PPMiniLMOp(Op):
+    def init_op(self):
+        pass
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        ((_, input_dict),) = input_dicts.items()
+        feed_dict = {}
+        feed_dict["text"] = list(input_dict.values())
+        return feed_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        new_dict = {}
+        new_dict["logits"] = str(fetch_dict["logits"].tolist())
+        return new_dict, None, ""
+
+
+class PPMiniLMService(WebService):
+    def get_pipeline_response(self, read_op):
+        ppminilm_op = PPMiniLMOp(name="ppminilm", input_ops=[read_op])
+        return ppminilm_op
+
+
+ppminilm_service = PPMiniLMService(name="ppminilm")
+ppminilm_service.prepare_pipeline_config("config_nlp.yml")
+ppminilm_service.run_service()
diff --git a/examples/model_compression/pp-minilm/finetuning/export_model.py b/examples/model_compression/pp-minilm/finetuning/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7e01ca67bab3c5900c07bebd754770ebfd3de00
--- /dev/null
+++ b/examples/model_compression/pp-minilm/finetuning/export_model.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import sys
+
+import paddle
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import PPMiniLMForSequenceClassification
+
+sys.path.append("../")
+from data import METRIC_CLASSES  # noqa: E402
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_path",
+        default="best_clue_model",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--save_inference_model_with_tokenizer",
+        type=strtobool,
+        default=True,
+        help="Whether to save inference model with tokenizer.",
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def do_export(args):
+    save_path = os.path.join(os.path.dirname(args.model_path), "inference")
+    model = PPMiniLMForSequenceClassification.from_pretrained(args.model_path)
+    args.task_name = args.task_name.lower()
+
+    input_spec = [
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+    ]
+    model = paddle.jit.to_static(model, input_spec=input_spec)
+
+    paddle.jit.save(model, save_path)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_export(args)
diff --git a/examples/model_compression/pp-minilm/finetuning/run_all_search.sh b/examples/model_compression/pp-minilm/finetuning/run_all_search.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c09a288a2fadee454a3bda186724fd391db0db69
--- /dev/null
+++ b/examples/model_compression/pp-minilm/finetuning/run_all_search.sh
@@ -0,0 +1,42 @@
+# $1 means GENERAL_DIR
+mkdir -p $1/afqmc
+mkdir -p $1/tnews
+mkdir -p $1/ifly
+mkdir -p $1/ocnli
+mkdir -p $1/cmnli
+mkdir -p $1/wsc
+mkdir -p $1/csl
+
+# The penultimate parameter is the card id, this script can be changed if necessary
+bash run_one_search.sh $1 afqmc 0 &
+bash run_one_search.sh $1 tnews 1 &
+bash run_one_search.sh $1 ifly 2 &
+bash run_one_search.sh $1 ocnli 3 &
+bash run_one_search.sh $1 csl 4 &
+bash run_one_search.sh $1 wsc 5 &
+
+# Because the CMNLI data set is significantly larger than other data sets,
+# It needs to be placed on different cards.
+lr=1e-4
+bs=16
+sh run_clue.sh CMNLI $lr $bs 3 128 0 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=32
+sh run_clue.sh CMNLI $lr $bs 3 128 1 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=64
+sh run_clue.sh CMNLI $lr $bs 3 128 2 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+
+lr=5e-5
+bs=16
+sh run_clue.sh CMNLI $lr $bs 3 128 3 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=32
+sh run_clue.sh CMNLI $lr $bs 3 128 4 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=64
+sh run_clue.sh CMNLI $lr $bs 3 128 5 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+
+lr=3e-5
+bs=16
+sh run_clue.sh CMNLI $lr $bs 3 128 6 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=32
+sh run_clue.sh CMNLI $lr $bs 3 128 5 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
+bs=64
+sh run_clue.sh CMNLI $lr $bs 3 128 7 $1  > $1/cmnli/${lr}_${bs}_3_128.log &
diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.py b/examples/model_compression/pp-minilm/finetuning/run_clue.py
new file mode 100644
index 0000000000000000000000000000000000000000..897edc3089f6c434293cc0df9962ed33c9663d87
--- /dev/null
+++ b/examples/model_compression/pp-minilm/finetuning/run_clue.py
@@ -0,0 +1,335 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.io import DataLoader
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+sys.path.append("../")
+from data import METRIC_CLASSES, MODEL_CLASSES, convert_example  # noqa: E402
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="best_clue_model",
+        type=str,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--do_train", type=strtobool, default=True, help="Whether do train.")
+    parser.add_argument("--do_eval", type=strtobool, default=False, help="Whether do train.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+    return res
+
+
+def do_eval(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length
+    )
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if dev_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if dev_ds.label_list is None else len(dev_ds.label_list)
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    metric = metric_class()
+    model.eval()
+    metric.reset()
+    for batch in dev_data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    print("acc: %s\n, " % (res), end="")
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds, dev_ds = load_dataset("clue", args.task_name, splits=("train", "dev"))
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length
+    )
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+    best_acc = 0.0
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                acc = evaluate(model, loss_fct, metric, dev_data_loader)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                if acc > best_acc:
+                    best_acc = acc
+                    output_dir = args.output_dir
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                print("best_acc: ", best_acc)
+                return
+    print("best_acc: ", best_acc)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    if args.do_train:
+        do_train(args)
+    if args.do_eval:
+        do_eval(args)
diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.sh b/examples/model_compression/pp-minilm/finetuning/run_clue.sh
new file mode 100644
index 0000000000000000000000000000000000000000..de7f577ee6af7e3b3effefdc632478c3c493e460
--- /dev/null
+++ b/examples/model_compression/pp-minilm/finetuning/run_clue.sh
@@ -0,0 +1,25 @@
+
+export TASK_NAME=$1
+export LR=$2
+export BS=$3
+export EPOCH=$4
+export MAX_SEQ_LEN=$5
+export CUDA_VISIBLE_DEVICES=$6
+export MODEL_PATH=$7
+
+python -u ./run_clue.py \
+    --model_type ppminilm  \
+    --model_name_or_path ${MODEL_PATH} \
+    --task_name ${TASK_NAME} \
+    --max_seq_length ${MAX_SEQ_LEN} \
+    --batch_size ${BS}   \
+    --learning_rate ${LR} \
+    --num_train_epochs ${EPOCH} \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps  100 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ${MODEL_PATH}/models/${TASK_NAME}/${LR}_${BS}/ \
+    --device gpu  \
diff --git a/examples/model_compression/pp-minilm/finetuning/run_one_search.sh b/examples/model_compression/pp-minilm/finetuning/run_one_search.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fbb5261d2f31136ec988b7d7e922440ff40ee1e9
--- /dev/null
+++ b/examples/model_compression/pp-minilm/finetuning/run_one_search.sh
@@ -0,0 +1,55 @@
+OUTPUT_DIR=$1
+TASK_NAME=$2
+
+mkdir ${OUTPUT_DIR}/afqmc
+mkdir ${OUTPUT_DIR}/tnews
+mkdir ${OUTPUT_DIR}/ifly
+mkdir ${OUTPUT_DIR}/ocnli
+mkdir ${OUTPUT_DIR}/wsc
+mkdir ${OUTPUT_DIR}/csl
+mkdir ${OUTPUT_DIR}/cmnli
+
+
+for lr in 1e-4 5e-5 3e-5
+do
+    for bs in 16 32 64
+    do
+    echo bs: $bs, lr: $lr
+
+    if [ $TASK_NAME == afqmc ]
+    then
+    sh run_clue.sh AFQMC $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/afqmc/${lr}_${bs}_3_128.log
+    fi
+
+    if [ $TASK_NAME == tnews ]
+    then
+    sh run_clue.sh TNEWS $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/tnews/${lr}_${bs}_3_128.log
+    fi
+
+    if [ $TASK_NAME == ifly ]
+    then
+    sh run_clue.sh IFLYTEK $lr $bs 6 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/ifly/${lr}_${bs}_6_128.log
+    fi
+
+    if [ $TASK_NAME == ocnli ]
+    then
+    sh run_clue.sh OCNLI $lr $bs 6 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/ocnli/${lr}_${bs}_6_128.log
+    fi
+
+    if [ $TASK_NAME == wsc ]
+    then
+    sh run_clue.sh CLUEWSC2020 $lr $bs 50 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/wsc/${lr}_${bs}_50_128.log
+    fi
+
+    if [ $TASK_NAME == csl ]
+    then
+    sh run_clue.sh CSL $lr $bs 8 256 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/csl/${lr}_${bs}_8_256.log
+    fi
+
+    if [ $TASK_NAME == cmnli ]
+    then
+    sh run_clue.sh CMNLI $lr $bs 3 128 $3 ${OUTPUT_DIR} > ${OUTPUT_DIR}/cmnli/${lr}_${bs}_3_128.log
+    fi
+    done
+done
+
diff --git a/examples/model_compression/pp-minilm/general_distill/README.md b/examples/model_compression/pp-minilm/general_distill/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c4be10fb4d562cd37554da93d2eef547ae5a70e3
--- /dev/null
+++ b/examples/model_compression/pp-minilm/general_distill/README.md
@@ -0,0 +1,64 @@
+# PP-MiniLM 任务无关蒸馏
+
+## 环境要求
+
+本实验基于 NVIDIA Tesla V100 32G 8 卡进行，训练周期约为 2-3 天。
+
+## 原理介绍
+
+任务无关知识蒸馏是用较大（层数更多、宽度更宽的）的基于 Transformer Layer 的预训练模型对较小（层数更少、宽度更窄的）的基于 Transformer Layer 的预训练模型进行蒸馏，从而得到更小、效果与较大模型更接近的预训练模型。
+
+PP-MiniLM 参考了 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 蒸馏策略。MiniLMv2 算法是用 24 层 large-size 的教师模型倒数几层的 Q-Q、K-K、V-V 之间的relation对6层学生模型最后一层 Q-Q、K-K、V-V 之间的 relation 进行蒸馏。具体的做法是，首先将学生、教师用于蒸馏的层上的 Q、K、V 的 Head 数进行统一，然后计算各自 Q-Q、K-K、V-V 的点积，最后对教师和学生的点积计算KL散度损失。由于 relation 的 shape 是 `[batch_size, head_num, seq_len, seq_len]`，因此可以认为这里的relation是一种Token与Token之间的关系。
+
+本方案在 MiniLMv2 策略的基础上，做了进一步优化: 通过引入多视角的注意力关系知识来进一步提升模型效果。MiniLMv2 的自注意力关系知识仅建模了 Token 与 Token 之间的关系，PP-MiniLM 在此基础上额外引入了样本与样本间的自注意力关系知识，也就是挖掘出更多教师模型所蕴含的知识，从而进一步优化模型效果。
+
+具体来说，PP-MiniLM 利用了 `roberta-wwm-ext-large` 第 20 层的 Q-Q、K-K、V-V 之间的 Sample 与 Sampl 之间关系对 6 层学生模型 PP-MiniLM 第 6 层的 Q-Q、K-K、V-V 之间的 Sample 与 Sample 之间的关系进行蒸馏。与MiniLMv2不同的是，PP-MiniLM的策略需要在统一Q、K、V的Head数之后，对Q、K、V转置为 `[seq_len, head_num, batch_size, head_dim]`，这样Q-Q、K-K、V-V 的点积则可以表达样本间的关系。经过我们的实验，这种方法比使用原始 MiniLMv2 算法在 CLUE 上平均准确率高 0.36。
+
+
+### 数据介绍
+
+任务无关知识蒸馏的训练数据一般是预训练语料，可以使用公开的预训练语料 [CLUECorpus2020](https://github.com/CLUEbenchmark/CLUECorpus2020/)。需要将数据处理成一行一个句子的格式，再将数据文件分割成多个子文件（例如 64 个），放在同一个目录下。
+
+
+### 运行方式
+
+```shell
+sh run.sh # 包含general_distill.py的运行配置
+cd ..
+```
+
+其中 `general_distill.py` 参数释义如下：
+
+- `model_type` 指示了学生模型类型，当前仅支持 'ppminilm'、'roberta'。
+- `num_relation_heads` relation head 的个数，一般对于 large-size 的教师模型是64，对于 base-size 的教师模型是 48。
+- `teacher_model_type`指示了教师模型类型，当前仅支持 'roberta'。
+- `teacher_layer_index`蒸馏时使用的教师模型的层
+- `student_layer_index` 蒸馏时使用的学生模型的层
+- `teacher_model_name_or_path`教师模型的名称，例如`'roberta-wwm-ext-large'`
+- `max_seq_length` 最大的样本长度
+- `num_layers` 学生模型的层数，目前仅支持 2，4，6
+- `logging_steps` 日志间隔
+- `max_steps` 最大迭代次数
+- `warmup_steps` 学习率增长得到`learning_rate`所需要的步数
+- `save_steps`保存模型的间隔步数
+- `weight_decay` 表示AdamW优化器中使用的 weight_decay 的系数
+- `output_dir`训练相关文件以及模型保存的输出路径
+- `device`设备选择，推荐使用 gpu
+- `input_dir` 训练数据目录
+- `use_amp` 是否使用混合精度训练，默认 False
+- `alpha`head间关系的权重，默认 0.0
+- `beta`样本间关系的权重，默认 0.0
+
+将最终得到的模型绝对路径保存至 `$GENERAL_MODEL_DIR`，例如：
+
+```shell
+GENERAL_MODEL_DIR=PaddleNLP/examples/model_compression/PP-MiniLM/general_distill/pretrain/model_400000
+```
+
+## 模型精度
+
+在 CLUE 数据集上经过超参寻优后，得到 CLUE 上各个任务上的最高准确率如下表：
+
+| AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   | Avg   |
+| ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----- |
+| 74.14 | 57.43 | 61.75   | 81.01 | 76.17 | 86.18       | 79.17 | 73.69 |
diff --git a/examples/model_compression/pp-minilm/general_distill/general_distill.py b/examples/model_compression/pp-minilm/general_distill/general_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..83ceed266944e64e92d65fdd6783de77d0666c67
--- /dev/null
+++ b/examples/model_compression/pp-minilm/general_distill/general_distill.py
@@ -0,0 +1,455 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    PPMiniLMForSequenceClassification,
+    PPMiniLMModel,
+    PPMiniLMTokenizer,
+    RobertaModel,
+    RobertaTokenizer,
+)
+from paddlenlp.transformers.distill_utils import calc_multi_relation_loss, to_distill
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.tools import TimeCostAverage
+
+MODEL_CLASSES = {
+    "roberta": (RobertaModel, RobertaTokenizer),
+    "ppminilm": (PPMiniLMForSequenceClassification, PPMiniLMTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default="ppminilm",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--teacher_model_type",
+        default="roberta",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--student_model_name_or_path",
+        default=None,
+        type=str,
+        required=False,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--teacher_model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model."
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=6e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_layers",
+        default=6,
+        type=int,
+        help="Number layers of student model.",
+    )
+    parser.add_argument(
+        "--teacher_layer_index",
+        default=19,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--student_layer_index",
+        default=5,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=512,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--num_relation_heads",
+        default=64,
+        type=int,
+        help="The number of relation heads is 48 and 64 for base and large-size teacher model.",
+    )
+    parser.add_argument("--beta", default=0.0, type=float, help="0.0 usually")
+    parser.add_argument("--alpha", default=0.0, type=float, help="0.0 usually")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=-1,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.01, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=400000,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def create_pretraining_dataset(input_file, shared_list, args, worker_init, tokenizer):
+    train_data = PretrainingDataset(input_file=input_file, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    # files have been sharded, no need to dispatch again
+    train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True)
+
+    # DataLoader cannot be pickled because of its place.
+    # If it can be pickled, use global function instead of lambda and use
+    # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_data,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        worker_init_fn=worker_init,
+        return_list=True,
+    )
+    return train_data_loader, input_file
+
+
+class PretrainingDataset(paddle.io.Dataset):
+    def __init__(self, input_file, tokenizer, max_seq_length):
+        self.input_file = input_file
+        f = open(input_file, "r")
+        input_ids = []
+        for i, line in enumerate(f):
+            line = line[:max_seq_length]
+            tokenized_example = tokenizer(line, max_seq_len=max_seq_length)
+            input_ids.append(tokenized_example["input_ids"])
+
+        self.inputs = np.asarray(input_ids)
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs)
+
+    def __getitem__(self, index):
+        input_ids = [np.asarray(self.inputs[index])]
+        return input_ids
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+    args.model_type = args.model_type.lower()
+
+    # For teacher
+    teacher_model_class, tokenizer_class = MODEL_CLASSES[args.teacher_model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.teacher_model_name_or_path)
+
+    # For student
+    model_class, _ = MODEL_CLASSES[args.model_type]
+    if args.num_layers == 6:
+        ppminilm = PPMiniLMModel(
+            vocab_size=tokenizer.vocab_size,
+            num_hidden_layers=6,
+            hidden_act="relu",
+            intermediate_size=3072,
+            hidden_size=768,
+        )  # layer: 6
+    elif args.num_layers == 4:
+        ppminilm = PPMiniLMModel(
+            vocab_size=tokenizer.vocab_size,
+            num_hidden_layers=4,
+            hidden_act="relu",
+            intermediate_size=1024,
+            hidden_size=256,
+            num_attention_heads=16,
+        )  # layer: 4
+    else:
+        ppminilm = PPMiniLMModel(
+            vocab_size=tokenizer.vocab_size,
+            num_hidden_layers=2,
+            hidden_act="relu",
+            hidden_size=128,
+            intermediate_size=512,
+        )  # layer: 2
+    student = model_class(ppminilm)
+
+    teacher = teacher_model_class.from_pretrained(args.teacher_model_name_or_path)
+    pad_token_id = 0
+
+    if paddle.distributed.get_world_size() > 1:
+        student = paddle.DataParallel(student, find_unused_parameters=True)
+        teacher = paddle.DataParallel(teacher, find_unused_parameters=True)
+
+    num_training_steps = args.max_steps
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=student.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    pool = ThreadPoolExecutor(1)
+
+    teacher = to_distill(teacher, return_qkv=True, layer_index=args.teacher_layer_index)
+    student = to_distill(student, return_qkv=True, layer_index=args.student_layer_index)
+
+    global_step = 0
+    for epoch in range(args.num_train_epochs):
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f))
+        ]
+        files.sort()
+        num_files = len(files)
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        shared_file_list = {}
+
+        if paddle.distributed.get_world_size() > num_files:
+            remainder = paddle.distributed.get_world_size() % num_files
+
+            data_file = files[
+                (
+                    f_start_id * paddle.distributed.get_world_size()
+                    + paddle.distributed.get_rank()
+                    + remainder * f_start_id
+                )
+                % num_files
+            ]
+        else:
+            data_file = files[
+                (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+            ]
+
+        train_data_loader, _ = create_pretraining_dataset(data_file, shared_file_list, args, worker_init, tokenizer)
+
+        # TODO(guosheng): better way to process single file
+        single_file = True if f_start_id + 1 == len(files) else False
+
+        for f_id in range(f_start_id, len(files)):
+            if not single_file and f_id == f_start_id:
+                continue
+            if paddle.distributed.get_world_size() > num_files:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id)
+                    % num_files
+                ]
+            else:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+                ]
+            dataset_future = pool.submit(
+                create_pretraining_dataset, data_file, shared_file_list, args, worker_init, tokenizer
+            )
+
+            kl_loss_fct = paddle.nn.KLDivLoss("sum")
+            train_cost_avg = TimeCostAverage()
+            total_samples = 0
+            batch_start = time.time()
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids = batch[0]
+                attention_mask = paddle.unsqueeze(
+                    (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e4, axis=[1, 2]
+                )
+                with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "gelu", "softmax"]):
+                    student(input_ids)
+                    with paddle.no_grad():
+                        teacher(input_ids)
+                    # Q-Q relation
+                    q_t, q_s = teacher.outputs.q, student.outputs.q
+                    batch_size = q_t.shape[0]
+                    pad_seq_len = q_t.shape[2]
+                    loss_q = calc_multi_relation_loss(
+                        kl_loss_fct, q_s, q_t, attention_mask, args.num_relation_heads, args.alpha, args.beta
+                    )
+                    del q_t, q_s
+                    # K-K relation
+                    k_t, k_s = teacher.outputs.k, student.outputs.k
+                    loss_k = calc_multi_relation_loss(
+                        kl_loss_fct, k_s, k_t, attention_mask, args.num_relation_heads, args.alpha, args.beta
+                    )
+                    del k_t, k_s
+
+                    # V-V relation
+                    v_t, v_s = teacher.outputs.v, student.outputs.v
+                    loss_v = calc_multi_relation_loss(
+                        kl_loss_fct, v_s, v_t, attention_mask, args.num_relation_heads, args.alpha, args.beta
+                    )
+
+                    del v_t, v_s
+
+                    loss = loss_q + loss_k + loss_v
+                    loss /= args.num_relation_heads * pad_seq_len * batch_size
+
+                if args.use_amp:
+                    scaler.scale(loss).backward()
+                    scaler.minimize(optimizer, loss)
+                else:
+                    loss.backward()
+
+                    optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_samples += args.batch_size
+                train_run_cost = time.time() - batch_start
+                train_cost_avg.record(train_run_cost)
+                if global_step % args.logging_steps == 0:
+                    logger.info(
+                        "global step: %d, epoch: %d, batch: %d, loss: %f, "
+                        "lr: %f, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss,
+                            optimizer.get_lr(),
+                            train_cost_avg.get_average(),
+                            total_samples / args.logging_steps,
+                            total_samples / (args.logging_steps * train_cost_avg.get_average()),
+                        )
+                    )
+                    total_samples = 0
+                    train_cost_avg.reset()
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+                batch_start = time.time()
+
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/pp-minilm/general_distill/run.sh b/examples/model_compression/pp-minilm/general_distill/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..be940e7c6d8bdd54f78989c1ee148abb93eda339
--- /dev/null
+++ b/examples/model_compression/pp-minilm/general_distill/run.sh
@@ -0,0 +1,70 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+set -eux
+
+unset CUDA_VISIBLE_DEVICES
+
+bs=128
+maxlen=128
+numH=64
+lr=6e-4
+maxStep=400000
+warmStep=4000
+wd=1e-2
+
+teacher=roberta
+teacherModel=roberta-wwm-ext-large
+
+alpha=0
+beta=1.0
+mode=hardest
+use_amp=True
+teacher_layer_index=19
+student_layer_index=5
+num_layers=6
+
+hp_config=bs${bs}_maxlen${maxlen}_lr${lr}_wd${wd}_numH${numH}_maxStep${maxStep}_warmStep${warmStep}_adamW_maxnorm1p0_teacher_${teacherModel}_coldboot_teacher_vocab_index${teacher_layer_index}_4l-312d-batchbatch
+
+export PYTHONPATH=../../../../:$PYTHONPATH
+output_dir="./pretrain_${hp_config}"
+
+mkdir -p ${output_dir}
+cp ./general_distill.py ${output_dir}/
+cp ../../../../paddlenlp/transformers/distill_utils.py ${output_dir}/
+
+
+python3 -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" general_distill.py \
+    --model_type ppminilm \
+    --num_relation_heads ${numH} \
+    --teacher_model_type ${teacher} \
+    --teacher_layer_index ${teacher_layer_index} \
+    --student_layer_index ${student_layer_index} \
+    --teacher_model_name_or_path ${teacherModel} \
+    --max_seq_length ${maxlen} \
+    --num_layers ${num_layers} \
+    --batch_size ${bs} \
+    --learning_rate ${lr} \
+    --logging_steps 20 \
+    --max_steps ${maxStep} \
+    --warmup_steps ${warmStep} \
+    --save_steps 20000 \
+    --weight_decay ${wd} \
+    --output_dir ${output_dir} \
+    --device gpu \
+    --input_dir  dataset/  \
+    --use_amp ${use_amp} \
+    --alpha ${alpha} \
+    --beta ${beta} \
diff --git a/examples/model_compression/pp-minilm/pp-minilm.png b/examples/model_compression/pp-minilm/pp-minilm.png
new file mode 100644
index 0000000000000000000000000000000000000000..8fc8431697883e968f4602150f217a80a0c28f0e
Binary files /dev/null and b/examples/model_compression/pp-minilm/pp-minilm.png differ
diff --git a/examples/model_compression/pp-minilm/pruning/export.sh b/examples/model_compression/pp-minilm/pruning/export.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ee7e5a5867755dbb458646d84f58336889befe8b
--- /dev/null
+++ b/examples/model_compression/pp-minilm/pruning/export.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+MODEL_PATH=$1
+TASK_NAME=$2
+python export_model.py --model_type ppminilm \
+    --task_name ${TASK_NAME} \
+    --model_name_or_path ${MODEL_PATH}/${TASK_NAME}/0.75/best_model \
+    --sub_model_output_dir ${MODEL_PATH}/${TASK_NAME}/0.75/sub/  \
+    --static_sub_model ${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/float  \
+    --n_gpu 1 --width_mult 0.75
diff --git a/examples/model_compression/pp-minilm/pruning/export_all.sh b/examples/model_compression/pp-minilm/pruning/export_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..78a782a74c563768cb35ac3cc6004209f21429f9
--- /dev/null
+++ b/examples/model_compression/pp-minilm/pruning/export_all.sh
@@ -0,0 +1,26 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+MODEL_PATH=pruned_models
+
+for TASK_NAME in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL
+
+do
+    python export_model.py --model_type ppminilm \
+    --model_name_or_path ${MODEL_PATH}/${TASK_NAME}/0.75/best_model \
+    --sub_model_output_dir ${MODEL_PATH}/${TASK_NAME}/0.75/sub/  \
+    --static_sub_model ${MODEL_PATH}/${TASK_NAME}/0.75/sub_static/float  \
+    --n_gpu 1 --width_mult 0.75
+
+done
diff --git a/examples/model_compression/pp-minilm/pruning/export_model.py b/examples/model_compression/pp-minilm/pruning/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2088d1547f588cd516a16168b8a58616a78b908
--- /dev/null
+++ b/examples/model_compression/pp-minilm/pruning/export_model.py
@@ -0,0 +1,187 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+import sys
+
+import paddle
+from paddleslim.nas.ofa import OFA, utils
+from paddleslim.nas.ofa.convert_super import Convert, supernet
+
+from paddlenlp.transformers import PPMiniLMModel
+
+sys.path.append("../")
+from data import METRIC_CLASSES, MODEL_CLASSES  # noqa: E402
+
+
+def ppminilm_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+    wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype
+
+    if attention_mask is None:
+        attention_mask = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+    embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+
+    encoder_outputs = self.encoder(embedding_output, attention_mask)
+    sequence_output = encoder_outputs
+    pooled_output = self.pooler(sequence_output)
+    return sequence_output, pooled_output
+
+
+PPMiniLMModel.forward = ppminilm_forward
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--sub_model_output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the sub model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--static_sub_model",
+        default=None,
+        type=str,
+        help="The output directory where the sub static model will be written. If set to None, not export static model",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--n_gpu", type=int, default=1, help="number of gpus to use, 0 for cpu.")
+    parser.add_argument("--width_mult", type=float, default=1.0, help="width mult you want to export")
+    parser.add_argument("--depth_mult", type=float, default=1.0, help="depth mult you want to export")
+    args = parser.parse_args()
+    return args
+
+
+def do_export(args):
+    paddle.set_device("gpu" if args.n_gpu else "cpu")
+    args.model_type = args.model_type.lower()
+    args.task_name = args.task_name.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config_path = os.path.join(args.model_name_or_path, "config.json")
+    cfg_dict = dict(json.loads(open(config_path).read()))
+
+    kept_layers_index = {}
+    if args.depth_mult < 1.0:
+        depth = round(cfg_dict["init_args"][0]["num_hidden_layers"] * args.depth_mult)
+        cfg_dict["init_args"][0]["num_hidden_layers"] = depth
+        for idx, i in enumerate(range(1, depth + 1)):
+            kept_layers_index[idx] = math.floor(i / args.depth_mult) - 1
+
+    os.rename(config_path, config_path + "_bak")
+    with open(config_path, "w", encoding="utf-8") as f:
+        f.write(json.dumps(cfg_dict, ensure_ascii=False))
+
+    model = model_class.from_pretrained(args.model_name_or_path)
+
+    origin_model = model_class.from_pretrained(args.model_name_or_path)
+
+    os.rename(config_path + "_bak", config_path)
+
+    sp_config = supernet(expand_ratio=[1.0, args.width_mult])
+    model = Convert(sp_config).convert(model)
+
+    ofa_model = OFA(model)
+
+    sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams"))
+
+    if len(kept_layers_index) == 0:
+        ofa_model.model.set_state_dict(sd)
+    else:
+        for name, params in ofa_model.model.named_parameters():
+            if "encoder" not in name:
+                params.set_value(sd[name])
+            else:
+                idx = int(name.strip().split(".")[3])
+                mapping_name = name.replace("." + str(idx) + ".", "." + str(kept_layers_index[idx]) + ".")
+                params.set_value(sd[mapping_name])
+
+    best_config = utils.dynabert_config(ofa_model, args.width_mult)
+    for name, sublayer in ofa_model.model.named_sublayers():
+        if isinstance(sublayer, paddle.nn.MultiHeadAttention):
+            sublayer.num_heads = int(args.width_mult * sublayer.num_heads)
+
+    origin_model_new = ofa_model.export(
+        best_config,
+        input_shapes=[[1, args.max_seq_length], [1, args.max_seq_length]],
+        input_dtypes=["int64", "int64"],
+        origin_model=origin_model,
+    )
+
+    for name, sublayer in origin_model_new.named_sublayers():
+        if isinstance(sublayer, paddle.nn.MultiHeadAttention):
+            sublayer.num_heads = int(args.width_mult * sublayer.num_heads)
+
+    output_dir = os.path.join(args.sub_model_output_dir, "model_width_%.5f" % args.width_mult)
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    model_to_save = origin_model_new
+    model_to_save.save_pretrained(output_dir)
+
+    input_spec = [
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+    ]
+    origin_model_new = paddle.jit.to_static(origin_model_new, input_spec=input_spec)
+
+    paddle.jit.save(origin_model_new, args.static_sub_model)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_export(args)
diff --git a/examples/model_compression/pp-minilm/pruning/prune.py b/examples/model_compression/pp-minilm/pruning/prune.py
new file mode 100644
index 0000000000000000000000000000000000000000..19427a8e4ea5e0ce6e30cfaae96bf49f1b96e565
--- /dev/null
+++ b/examples/model_compression/pp-minilm/pruning/prune.py
@@ -0,0 +1,384 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.io import DataLoader
+from paddleslim.nas.ofa import OFA, DistillConfig, utils
+from paddleslim.nas.ofa.convert_super import Convert, supernet
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import LinearDecayWithWarmup, PPMiniLMModel
+from paddlenlp.transformers.ofa_utils import (
+    compute_neuron_head_importance,
+    encoder_layer_ofa_forward,
+    encoder_ofa_forward,
+    mha_ofa_forward,
+    prepare_qkv_ofa,
+    reorder_neuron_head,
+)
+from paddlenlp.utils.log import logger
+
+sys.path.append("../")
+from data import METRIC_CLASSES, MODEL_CLASSES, convert_example  # noqa: E402
+
+paddle.nn.MultiHeadAttention.forward = mha_ofa_forward
+paddle.nn.MultiHeadAttention._prepare_qkv = prepare_qkv_ofa
+paddle.nn.TransformerEncoder.forward = encoder_ofa_forward
+paddle.nn.TransformerEncoderLayer.forward = encoder_layer_ofa_forward
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--glue_dir",
+        default="/root/.paddlenlp/datasets/Clue/",
+        type=str,
+        required=False,
+        help="The Glue directory.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--lambda_logit", default=1.0, type=float, help="lambda for logit loss.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["gpu", "cpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--width_mult_list",
+        nargs="+",
+        type=str,
+        default=["1.0", "5 / 6", "2 / 3", "0.5"],
+        help="width mult of compression",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, width_mult, student=False):
+    model.eval()
+    metric.reset()
+    for i, batch in enumerate(data_loader):
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids, attention_mask=[None, None])
+        if isinstance(logits, tuple):
+            logits = logits[0]
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+
+    res = metric.accumulate()
+    print("width_mult: %s, acc: %s, " % (str(width_mult), res), end="")
+    model.train()
+    return res
+
+
+# monkey patch for ppminilm forward to accept [attention_mask, head_mask] as  attention_mask
+def ppminilm_forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=[None, None]):
+    wtype = self.pooler.dense.fn.weight.dtype if hasattr(self.pooler.dense, "fn") else self.pooler.dense.weight.dtype
+    if attention_mask[0] is None:
+        attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e9, axis=[1, 2])
+    embedding_output = self.embeddings(input_ids, token_type_ids, position_ids)
+    encoded_layer = self.encoder(embedding_output, attention_mask)
+    pooled_output = self.pooler(encoded_layer)
+
+    return encoded_layer, pooled_output
+
+
+PPMiniLMModel.forward = ppminilm_forward
+
+
+def soft_cross_entropy(inp, target):
+    inp_likelihood = F.log_softmax(inp, axis=-1)
+    target_prob = F.softmax(target, axis=-1)
+    return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1))
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    train_ds = load_dataset("clue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list)
+
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    # Step1: Initialize a dictionary to save the weights from the origin PPMiniLM model.
+    origin_weights = model.state_dict()
+
+    # Step2: Convert origin model to supernet.
+    sp_config = supernet(expand_ratio=[1.0])
+    model = Convert(sp_config).convert(model)
+    # Use weights saved in the dictionary to initialize supernet.
+    utils.set_state_dict(model, origin_weights)
+    del origin_weights
+
+    super_sd = paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams"))
+    model.set_state_dict(super_sd)
+
+    # Step3: Define teacher model.
+    teacher_model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_labels)
+
+    # Step4: Config about distillation.
+    mapping_layers = ["ppminilm.embeddings"]
+    for idx in range(model.ppminilm.config["num_hidden_layers"]):
+        mapping_layers.append("ppminilm.encoder.layers.{}".format(idx))
+
+    default_distill_config = {
+        "lambda_distill": 0.1,
+        "teacher_model": teacher_model,
+        "mapping_layers": mapping_layers,
+    }
+    distill_config = DistillConfig(**default_distill_config)
+
+    # Step5: Config in supernet training.
+    ofa_model = OFA(model, distill_config=distill_config, elastic_order=["width"])
+
+    criterion = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    # Step6: Calculate the importance of neurons and head,
+    # and then reorder them according to the importance.
+    head_importance, neuron_importance = compute_neuron_head_importance(
+        ofa_model.model,
+        dev_data_loader,
+        loss_fct=criterion,
+        num_layers=model.ppminilm.config["num_hidden_layers"],
+        num_heads=model.ppminilm.config["num_attention_heads"],
+    )
+    reorder_neuron_head(ofa_model.model, head_importance, neuron_importance)
+
+    if paddle.distributed.get_world_size() > 1:
+        ofa_model.model = paddle.DataParallel(ofa_model.model)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm),
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    best_res = 0.0
+    args.width_mult_list = [eval(width_mult) for width_mult in args.width_mult_list]
+    for epoch in range(num_train_epochs):
+        # Step7: Set current epoch and task.
+        ofa_model.set_epoch(epoch)
+        ofa_model.set_task("width")
+
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, _ = batch
+
+            for width_mult in args.width_mult_list:
+                # Step8: Broadcast supernet config from width_mult,
+                # and use this config in supernet training.
+                net_config = utils.dynabert_config(ofa_model, width_mult)
+                ofa_model.set_net_config(net_config)
+                logits, teacher_logits = ofa_model(input_ids, segment_ids, attention_mask=[None, None])
+                rep_loss = ofa_model.calc_distill_loss()
+                logit_loss = soft_cross_entropy(logits, teacher_logits.detach())
+                loss = rep_loss + args.lambda_logit * logit_loss
+                loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                evaluate(teacher_model, metric, dev_data_loader, width_mult=100)
+                print("eval done total : %s s" % (time.time() - tic_eval))
+                for idx, width_mult in enumerate(args.width_mult_list):
+                    net_config = utils.dynabert_config(ofa_model, width_mult)
+                    ofa_model.set_net_config(net_config)
+                    tic_eval = time.time()
+                    res = evaluate(ofa_model, metric, dev_data_loader, width_mult)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+
+                    if best_res < res:
+                        output_dir = args.output_dir
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        best_res = res
+            if global_step >= num_training_steps:
+                print("best_res: ", best_res)
+                return
+    print("best_res: ", best_res)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/model_compression/pp-minilm/pruning/prune.sh b/examples/model_compression/pp-minilm/pruning/prune.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f760b429bed57e6c24abcc9adc59fd17d6f0fb2c
--- /dev/null
+++ b/examples/model_compression/pp-minilm/pruning/prune.sh
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export TASK_NAME=$1
+export LR=$2
+export BATCH_SIZE=$3
+export PRE_EPOCHS=$4
+export SEQ_LEN=$5
+export CUDA_VISIBLE_DEVICES=$6
+export STUDENT_DIR=$7
+export WIDTH_LIST=$8
+
+python -u ./prune.py --model_type ppminilm \
+          --model_name_or_path ${STUDENT_DIR} \
+          --task_name $TASK_NAME --max_seq_length ${SEQ_LEN}     \
+          --batch_size ${BATCH_SIZE}       \
+          --learning_rate ${LR}     \
+          --num_train_epochs ${PRE_EPOCHS}     \
+          --logging_steps 100     \
+          --save_steps 100     \
+          --output_dir ./pruned_models/$TASK_NAME/0.75/best_model \
+          --device gpu  \
+          --width_mult_list ${WIDTH_LIST}
+
diff --git a/examples/model_compression/pp-minilm/quantization/quant_all.sh b/examples/model_compression/pp-minilm/quantization/quant_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1b39c8ca0e5dd7c46119f3eb89053c1763448e1c
--- /dev/null
+++ b/examples/model_compression/pp-minilm/quantization/quant_all.sh
@@ -0,0 +1,20 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+MODEL_DIR=../pruning/pruned_models/
+
+for task in AFQMC TNEWS IFLYTEK CMNLI OCNLI CLUEWSC2020 CSL
+do
+    python quant_post.py --task_name ${task} --input_dir ${MODEL_DIR}/${task}/0.75/sub_static
+done
diff --git a/examples/model_compression/pp-minilm/quantization/quant_post.py b/examples/model_compression/pp-minilm/quantization/quant_post.py
new file mode 100644
index 0000000000000000000000000000000000000000..7436f4f9f5aa2e9093fea640cccc05047d61257d
--- /dev/null
+++ b/examples/model_compression/pp-minilm/quantization/quant_post.py
@@ -0,0 +1,154 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+from functools import partial
+
+import paddle
+import paddleslim
+
+from paddlenlp.data import Pad
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import PPMiniLMTokenizer
+
+sys.path.append("../")
+from data import convert_example  # noqa: E402
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("--task_name", type=str, required=True, help="task_name")
+parser.add_argument(
+    "--input_dir", type=str, default="../pruning/pruned_models/", required=True, help="Input task model directory."
+)
+parser.add_argument("--output_dir", type=str, default="./", required=False, help="Output model directory.")
+
+parser.add_argument(
+    "--save_model_filename", type=str, default="int8.pdmodel", required=False, help="File name of quantified model."
+)
+
+parser.add_argument(
+    "--save_params_filename",
+    type=str,
+    default="int8.pdiparams",
+    required=False,
+    help="File name of quantified model's parameters.",
+)
+
+parser.add_argument(
+    "--input_model_filename", type=str, default="float.pdmodel", required=False, help="File name of float model."
+)
+
+parser.add_argument(
+    "--input_param_filename",
+    type=str,
+    default="float.pdiparams",
+    required=False,
+    help="File name of float model's parameters.",
+)
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer "
+    "than this will be truncated, sequences shorter will be padded.",
+)
+
+parser.add_argument(
+    "--use_faster_tokenizer",
+    type=strtobool,
+    default=False,
+    help="Whether to use FasterTokenizer to accelerate training or further inference.",
+)
+
+parser.add_argument(
+    "--model_name_or_path",
+    default="ppminilm-6l-768h",
+    type=str,
+    help="Model name or the directory of model directory.",
+)
+
+args = parser.parse_args()
+
+
+def quant_post(args, batch_size=8, algo="avg"):
+    place = paddle.set_device("gpu")
+    exe = paddle.static.Executor(place)
+    args.task_name = args.task_name.lower()
+
+    dev_ds = load_dataset("clue", args.task_name, splits="dev")
+    if args.use_faster_tokenizer:
+        trans_func = partial(convert_example, label_list=dev_ds.label_list)
+    else:
+        tokenizer = PPMiniLMTokenizer.from_pretrained("ppminilm-6l-768h")
+        trans_func = partial(
+            convert_example, label_list=dev_ds.label_list, tokenizer=tokenizer, max_seq_length=128, is_test=True
+        )
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+
+    def batch_generator_func():
+        batch_data = [[], []]
+        for data in dev_ds:
+            batch_data[0].append(data[0])
+            batch_data[1].append(data[1])
+            if len(batch_data[0]) == batch_size:
+                input_ids = Pad(axis=0, pad_val=0)(batch_data[0])
+                segment_ids = Pad(axis=0, pad_val=0)(batch_data[1])
+                yield [input_ids, segment_ids]
+                batch_data = [[], []]
+
+    def batch_generator_func_using_faster_tokenizer():
+        if "sentence" in dev_ds[0]:
+            batch_data = []
+        else:
+            batch_data = [[], []]
+        for data in dev_ds:
+            if "sentence" in data:
+                batch_data.append(data["sentence"])
+                if len(batch_data) == batch_size:
+                    yield {"text": batch_data}
+                    batch_data = []
+            else:
+                batch_data[0].append(data["sentence1"])
+                batch_data[1].append(data["sentence2"])
+                if len(batch_data[0]) == batch_size:
+                    yield {"text": batch_data[0], "text_pair": batch_data[1]}
+                    batch_data = [[], []]
+
+    paddleslim.quant.quant_post_static(
+        exe,
+        args.input_dir,
+        os.path.join(args.output_dir, args.task_name + "_quant_models", algo + str(batch_size)),
+        save_model_filename=args.save_model_filename,
+        save_params_filename=args.save_params_filename,
+        algo=algo,
+        hist_percent=0.9999,
+        batch_generator=batch_generator_func if not args.use_faster_tokenizer else None,
+        data_loader=batch_generator_func_using_faster_tokenizer if args.use_faster_tokenizer else None,
+        model_filename=args.input_model_filename,
+        params_filename=args.input_param_filename,
+        quantizable_op_type=["matmul", "matmul_v2"],
+        weight_bits=8,
+        weight_quantize_type="channel_wise_abs_max",
+        batch_nums=1,
+    )
+
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    for batch_size in [4, 8]:
+        for algo in ["abs_max", "avg", "mse", "hist"]:
+            quant_post(args, batch_size, algo)
diff --git a/examples/model_interpretation/README.md b/examples/model_interpretation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ab3adf4eddabf6a9e60f10cbaf583934703efaf1
--- /dev/null
+++ b/examples/model_interpretation/README.md
@@ -0,0 +1,255 @@
+NLP可解释评估
+===
+深度学习模型在很多NLP任务上已经取得巨大成功，但其常被当作一个黑盒使用，内部预测机制对使用者是不透明的。这使得深度学习模型结果不被人信任，增加落地难度，尤其是在医疗、法律等特殊领域。同时，当模型出现效果不好或鲁棒性差等问题时，由于不了解其内部机制，导致很难对模型进行优化。近期，深度学习模型的可解释性被越来越多的人关注。但模型的可解释性评估还不够完善，本模块提供了3个NLP任务的评测数据和相关评测指标，旨在评估模型的可解释性。模块包含以下功能：
+
+    1. 完善可解释性评估体系，提供了评测数据和对应的评测指标
+    2. 提供了3种典型的证据抽取方法，分别是基于注意力（attention-based）、梯度（gradient-based）和线性模型（LIME）的证据抽取方法，并在LSTM、Transformer（RoBERTa-base和RoBERTa-large）等常用模型网络结构上完成实验验证，分别验证模型结构复杂度、模型参数规模对模型可解释的影响
+    3. 提供模型较全面的评估报告，含模型本身准确率等效果、以及在3个可解释评测指标上的结果
+
+<p align="center">
+<img src="imgs/structure.png" /> <br>
+</p>
+
+可解释评估体系
+---
+### 评测数据
+我们提供了情感分析、相似度计算、阅读理解等三个NLP任务上的中英文数据集。对于每一个数据集，人工标注了证据数据和扰动数据。
+
+    证据数据：给出模型预测依赖的证据（从人类认知角度），其由输入中的若干词构成。我们的标注标准包含3个维度：充分性（sufficiency）、简洁性（concision）、可理解性（understandability）。
+    扰动数据：旨在评估模型在扰动下的证据一致性。我们从抗干扰性、敏感性和泛化性等角度构建了扰动数据，其中，“敏感性”和“泛化性”维度下构建的数据可能会改变证据。
+
+#### 样例数据（来自中文情感分析任务）：
+
+<p align="center">
+<img src="imgs/example1.png" /> <br>
+</p>
+
+#### 数据规模
+<table>
+   <tr>
+      <td rowspan="2">任务</td>
+      <td colspan="3">英文模型</td>
+      <td colspan="3">中文模型</td>
+   </tr>
+   <tr>
+      <td>规模</td>
+      <td>证据平均长度比例</td>
+      <td>证据平均数量</td>
+      <td>规模</td>
+      <td>证据平均长度比例</td>
+      <td>证据平均数量</td>
+   </tr>
+   <tr>
+      <td>情感分析</td>
+      <td>1,499</td>
+      <td>19.20%</td>
+      <td>2.1</td>
+      <td>1,646</td>
+      <td>30.10%</td>
+      <td>1.4</td>
+   </tr>
+   <tr>
+      <td>相似度任务</td>
+      <td>1,659</td>
+      <td>52.20%</td>
+      <td>1.0</td>
+      <td>1,629</td>
+      <td>70.50%</td>
+      <td>1.0</td>
+   </tr>
+   <tr>
+      <td>阅读理解</td>
+      <td>1,507</td>
+      <td>10.20%</td>
+      <td>1.0</td>
+      <td>1,762</td>
+      <td>9.60%</td>
+      <td>1.0</td>
+   </tr>
+</table>
+
+### 评估指标
+__合理性__：评估模型预测依赖的证据与人工标注证据的拟合度，我们这里使用macro-F1作为评估指标，其中模型预测依赖证据可以由本模块提供的证据分析方法（位于/model_interpretation/task/目录下）给出。<br>
+
+<p align="center">
+<img src="imgs/equation1.png" /> <br>
+</p>
+其中S<sub>i</sub><sup>p</sup>和S<sub>i</sub><sup>g</sup>分别代表针对第i条输入模型预测证据和人工标注证据，N代表数据集中数据的数量<br>
+
+__一致性__：评估(原始输入，对应扰动输入)对中词重要度排序的一致性。证据分析方法对输入中每个词赋予一个重要度，基于该重要度对输入中所有词进行排序。我们使用搜索排序中的MAP（mean average precision）指标来计算两个排序的一致性。这里给出了MAP的两种计算方式，分别见以下两个公式：<br>
+公式一（正在使用）：<br>
+<p align="center">
+<img src="imgs/equation5.png" /> <br>
+</p>
+公式二：<br>
+<p align="center">
+<img src="imgs/equation2.png" /> <br>
+</p>
+其中X<sup>o</sup>和X<sup>d</sup>分别代表原始输入和扰动输入的词重要度排序序列。|X<sup>d</sup>|代表X<sup>d</sup>中词的个数，X<sup>o</sup><sub>1:j</sub>表示X<sup>o</sup>中前j最重要的词。函数G(x, Y)检查词x是否存在于列表Y中，如果存在则G(x, Y)=1。MAP越高表示两个序列排序一致性越高<br>
+
+__忠诚性__：评估模型给出的证据的忠诚性，即模型是否真的基于给出的证据进行预测的。这里从充分性和完备性两个角度进行评估。充分性，即模型给出的证据是否包含了预测需要的全部信息（即y<sub>r<sub>i</sub></sub> = y<sub>x<sub>i</sub></sub>，其中r<sub>i</sub>表示输入x<sub>i</sub>的证据，y<sub>x</sub>表示模型对输入x的预测结果）；完备性，即模型对输入x的预测结果（即y<sub>x<sub>i</sub>\r<sub>i</sub></sub> ≠ y<sub>x<sub>i</sub></sub>，其中x<sub>i</sub>\r<sub>i</sub>表示从输入x<sub>i</sub>中去除证据r<sub>i</sub>）。基于这两个维度，我们提出了一个新的指标New-P，计算方式如下：<br>
+
+<p align="center">
+<img src="imgs/equation3.png" /> <br>
+</p>
+<p align="center">
+<img src="imgs/equation4.png" /> <br>
+</p>
+
+### 证据抽取方法
+证据抽取方法（rationale-extraction），顾名思义，就是从输入中抽取对模型预测至关重要的词，又被称为后验解释方法（post-hoc explanation methods）。
+该平台提供了3种典型的证据抽取方法，分别是：基于注意力机制（attention-based）的解释方法、基于梯度（gradient-based）的解释方法，和基于线性模型（linear-based）的解释方法：<br>
+
+Attention-based（[Jain and Wallace, 2019](https://arxiv.org/pdf/1902.10186.pdf)）：
+
+   将注意力分数作为词重要度。注意力分数的获取取决于具体模型架构，我们提供了基于LSTM和transformer框架的提取方法，见每个具体任务下的saliency_map目录。
+
+Gradient-based（[Sundararajan et al., 2017](https://arxiv.org/pdf/1703.01365.pdf)）：
+
+   基于梯度给出每个词重要度。我们这里给出了integrated gradient计算方式，具体见saliency_map目录或论文[Axiomatic attribution for deep networks](https://arxiv.org/pdf/1703.01365.pdf)。
+
+Linear-based（[Ribeiro et al.. 2016](https://arxiv.org/pdf/1602.04938.pdf)）：
+
+   使用线性模型局部模拟待验证模型，线性模型学习到的词的权重作为该词对预测结果的重要度，详细见论文[" why should i trust you?" explaining the predictions of any classifier](https://arxiv.org/pdf/1602.04938.pdf)。
+
+### 三个任务的被评估模型
+为验证模型复杂度、参数规模对可解释的影响，针对每个任务，我们分别提供了基于LSTM（简单结构）的模型、及Transformer-based预训练模型（复杂结构），其中，对于预训练模型，提供了base版本和large版本。<br>
+模型代码位置：/model_interpretation/task/{task}/，({task}可取值为["senti","similarity","mrc"]，其中senti代表情感分析，similarity代表相似度计算，mrc代表阅读理解)<br>
+模型运行及依赖环境请参考下方的“平台使用”。
+
+
+## 平台使用
+### 环境准备
+代码运行需要 Linux 主机，Python 3.8（推荐，其他低版本未测试过） 和 PaddlePaddle 2.1 以上版本。
+
+### 推荐的环境
+
+* 操作系统 CentOS 7.5
+* Python 3.8.12
+* PaddlePaddle 2.1.0
+* PaddleNLP 2.2.4
+
+除此之外，需要使用支持 GPU 的硬件环境。
+
+### PaddlePaddle
+
+需要安装GPU版的PaddlePaddle。
+
+```
+# GPU 版本
+pip3 install paddlepaddle-gpu
+```
+
+更多关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start).
+
+### 第三方 Python 库
+除 PaddlePaddle 及其依赖之外，还依赖其它第三方 Python 库，位于代码根目录的 requirements.txt 文件中。
+
+可使用 pip 一键安装
+
+```pip3 install -r requirements.txt```
+
+## 数据准备
+### 模型训练数据
+#### 情感分析任务：
+
+中文推荐使用ChnSentiCorp，英文推荐使用SST-2。本模块提供的中英文情感分析模型就是基于这两个数据集的。若修改训练数据集，请修改/model_interpretation/task/senti/pretrained_models/train.py (RoBERTa) 以及 /model_interpretation/task/senti/rnn/train.py (LSTM)。
+
+[//]:数据集会被缓存到/home/work/.paddlenlp/datasets/目录下
+
+#### 相似度计算：
+
+中文推荐使用LCQMC，英文推荐使用QQP。本模块提供的中英文相似度计算模型就是基于这两个数据集的，若修改训练数据集，请修改/model_interpretation/task/similarity/pretrained_models/train_pointwise.py(RoBERTa)以及/model_interpretation/task/similarity/simnet/train.py(LSTM)。
+
+#### 阅读理解中英文：
+
+中文推荐使用[DuReader_Checklist](https://dataset-bj.cdn.bcebos.com/lic2021/dureader_checklist.dataset.tar.gz)，英文推荐使用[SQUDA2](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)。请将阅读理解训练数据放置在/model_interpretation/task/mrc/data目录下。
+
+### 下载预训练模型
+
+使用paddlenlp框架自动缓存模型文件。
+
+### 其他数据下载
+请运行download.sh自动下载
+
+### 评测数据
+评测数据样例位于/model_interpretation/data/目录下，每一行为一条JSON格式的数据。
+#### 情感分析数据格式：
+    id: 数据的编号，作为该条数据识别key；
+    context：原文本数据；
+    sent_token：原文本数据的标准分词，注意：golden证据是基于该分词的，预测证据也需要与该分词对应；
+    sample_type： 数据的类性，分为原始数据（ori）和扰动数据（disturb）；
+    rel_ids：与原始数据关联的扰动数据的id列表（只有原始数据有）；
+
+#### 相似度数据格式：
+    id：数据的编号，作为该条数据识别key；
+    query（英文中为sentence1）：句子1的原文本数据；
+    title（英文中为sentence2）：句子2的原文本数据；
+    text_q_seg：句子1的标准分词，注意：golden证据是基于该分词的，预测证据也需要与该分词对应；
+    text_t_seg：句子2的标准分词，注意：golden证据是基于该分词的，预测证据也需要与该分词对应；
+    sample_type： 数据的类性，分为原始数据（ori）和扰动数据（disturb）；
+    rel_ids：与原始数据关联的扰动数据的id列表（只有原始数据有）；
+
+#### 阅读理解数据格式：
+    id：数据的编号，作为该条数据识别key；
+    title：文章标题；
+    context：文章主体；
+    question：文章的问题；
+    sent_token：原文本数据的标准分词，注意：golden证据是基于该分词的，预测证据也需要与该分词对应；
+    sample_type： 数据的类性，分为原始数据（ori）和扰动数据（disturb）；
+    rel_ids：与原始数据关联的扰动数据的id列表（只有原始数据有）；
+## 模型运行
+### 模型预测：
+
+      model_interpretation/task/{task}/run_inter_all.sh (生成所有结果)
+      model_interpretation/task/{task}/run_inter.sh (生成单个配置的结果，配置可以选择不同的评估模型，以及不同的证据抽取方法、语言)
+
+(注：{task}可取值为["senti","similarity","mrc"]，其中senti代表情感分析，similarity代表相似度计算，mrc代表阅读理解)
+
+### 证据抽取:
+      cd model_interpretation/rationale_extraction
+      ./generate.sh
+
+### 可解释评估：
+#### 合理性（plausibility）：
+    model_interpretation/evaluation/plausibility/run_f1.sh
+#### 一致性（consistency）：
+    model_interpretation/evaluation/consistency/run_map.sh
+#### 忠诚性（faithfulness）：
+    model_interpretation/evaluation/faithfulness/run_newp.sh
+
+### 评估报告
+中文情感分析评估报告样例：
+<table>
+   <tr>
+      <td rowspan="2">模型 + 证据抽取方法</td>
+      <td colspan="4">情感分析</td>
+   </tr>
+   <tr>
+      <td>Acc</td>
+      <td>Macro-F1</td>
+      <td>MAP</td>
+      <td>New_P</td>
+   </tr>
+   <tr>
+      <td>LSTM + IG</td>
+      <td>56.8</td>
+      <td>36.8</td>
+      <td>59.8</td>
+      <td>91.4</td>
+   </tr>
+   <tr>
+      <td>RoBERTa-base + IG</td>
+      <td>62.4</td>
+      <td>36.4</td>
+      <td>48.7</td>
+      <td>48.9</td>
+   </tr>
+   <tr>
+      <td>RoBERTa-large + IG</td>
+      <td>65.3</td>
+      <td>38.3</td>
+      <td>41.9</td>
+      <td>37.8</td>
+   </tr>
+</table>
diff --git a/examples/model_interpretation/data/mrc_ch b/examples/model_interpretation/data/mrc_ch
new file mode 100644
index 0000000000000000000000000000000000000000..09a10298751f398085cafadd46d7b111eb79e0ad
--- /dev/null
+++ b/examples/model_interpretation/data/mrc_ch
@@ -0,0 +1,100 @@
+{"id": 1, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌，外形是纺锤型的，有明显的瓣状结构，内里的肉是白色的，有清淡的药香味，生吃又脆又甜，常食用可以预防肝癌、胃癌，营养价值非常高。红薯是粗粮，也叫番薯山芋。它是一种属管状花目，旋花科一年生的草本植物，富含丰富的矿物质和维生素，而且非常耐饱。", "question": "地瓜和红薯一样吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", "，", "外", "形", "是", "纺", "锤", "型", "的", "，", "有", "明", "显", "的", "瓣", "状", "结", "构", "，", "内", "里", "的", "肉", "是", "白", "色", "的", "，", "有", "清", "淡", "的", "药", "香", "味", "，", "生", "吃", "又", "脆", "又", "甜", "，", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", "，", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", "，", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", "，", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", "，", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", "，", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "ori", "rel_ids": [1763]}
+{"id": 5, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条：已满十六周岁的人犯罪，应当负刑事责任。已满十四周岁不满十六周岁的人，犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的，应当负刑事责任。", "question": "已满几周岁的人贩卖毒品罪应当负刑事责任", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", "：", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", "，", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", "，", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", "，", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "ori", "rel_ids": [1767]}
+{"id": 10, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格，而考生并不是硕士研究生；而读研是指学生在高校攻读硕士研究生的过程，学生身份已经是硕士研究生。这二者并不等同，而是有先后关系，也就是说考生只有通过考研，才能成为硕士研究生，然后在规定的学习时间内读研。", "question": "考研跟读研有什么区别", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", "，", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", "；", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", "，", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", "，", "而", "是", "有", "先", "后", "关", "系", "，", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", "，", "才", "能", "成", "为", "硕", "士", "研", "究", "生", "，", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "ori", "rel_ids": [1772]}
+{"id": 12, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂，主要是控制作物疯长的。而磷酸二氢钾属于叶面肥，施用后可促使作物的叶色更加浓绿，根系发达，药效完全不同，也并不排斥，可以混合使用。不过要注意施用时要严格按照说明施加，不可过量，否则会阻碍生长。", "question": "磷酸二氢钾能和多效唑一起用吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", "，", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", "，", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", "，", "根", "系", "发", "达", "，", "药", "效", "完", "全", "不", "同", "，", "也", "并", "不", "排", "斥", "，", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", "，", "不", "可", "过", "量", "，", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "ori", "rel_ids": [1774]}
+{"id": 14, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋，猫咪不能吃生鸡蛋，因为生鸡蛋中有细菌，常见的是沙门氏菌，容易引起猫腹泻脱水，而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄，但是需要掌握好量，一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇，易引发猫咪患脂肪肝和高脂血病。", "question": "猫咪可以吃生蛋黄吗", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", "，", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", "，", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", "，", "常", "见", "的", "是", "沙", "门", "氏", "菌", "，", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", "，", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", "，", "但", "是", "需", "要", "掌", "握", "好", "量", "，", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", "，", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "ori", "rel_ids": [1776]}
+{"id": 18, "title": "最近深圳限行吗", "context": "现在由于疫情的影响，深圳市不限行的了，但是没有必要尽量还是少出门，出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和，但是这并不意味着疫情的结束，国外疫情形势还是很严峻的，境外输入案例较多。", "question": "最近深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", "，", "深", "圳", "市", "不", "限", "行", "的", "了", "，", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", "，", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", "，", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", "，", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", "，", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "ori", "rel_ids": [1780]}
+{"id": 19, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析：如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字，则合同有效。", "question": "合同不签字不盖章有效吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", "：", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", "，", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "ori", "rel_ids": [1781]}
+{"id": 27, "title": "", "context": "吴三桂(1612年-1678年10月2日)，字长伯，一字月所，明朝辽东人，明末清初著名政治军事人物，吴周政权建立者吴周太祖。", "question": "吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", "，", "字", "长", "伯", "，", "一", "字", "月", "所", "，", "明", "朝", "辽", "东", "人", "，", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", "，", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "ori", "rel_ids": [1789]}
+{"id": 34, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感，它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等，因为狗狗的屁股后面有两个肛门腺，在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为什么总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", "，", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", "，", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", "，", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "ori", "rel_ids": [1796]}
+{"id": 36, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜，在粘贴过程中要注意，不要出现气泡，以免影响隔音效果。若想要隔音效果更好点，还可以购买一些密封条安装在窗户缝隙处，这也能起到更好的隔音效果。另外，室内使用的家具可以更换成木质的，这样同样能起到一定的吸音效果。", "question": "出租房隔音不好怎么解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", "，", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", "，", "不", "要", "出", "现", "气", "泡", "，", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", "，", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", "，", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", "，", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", "，", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "ori", "rel_ids": [1798]}
+{"id": 40, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲，是由李宗盛作词、作曲、演唱，收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。", "question": "鬼迷心窍原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", "，", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", "，", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "ori", "rel_ids": [1802]}
+{"id": 41, "title": "", "context": "白龙马，名著小说《西游记》中的重要角色。本是西海龙王三太子，因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆，要被斩首。后因南海观世菩萨出面才免于死罪，被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马，被菩萨点化，变身为白龙。", "question": "白龙马的真正身份", "sent_token": ["白", "龙", "马", "，", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", "，", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", "，", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", "，", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", "，", "被", "菩", "萨", "点", "化", "，", "变", "身", "为", "白", "龙", "。"], "sample_type": "ori", "rel_ids": [1803]}
+{"id": 43, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片，由亚历克斯·加兰执导，娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编，讲述了生物学家莉娜为了自己的丈夫，她自愿加入了科学考察探险小队，去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭什么类型", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", "，", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", "，", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", "，", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", "，", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", "，", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "ori", "rel_ids": [1805]}
+{"id": 45, "title": "", "context": "网球运动的起源及演变可以用四句话来概括：网球孕育在法国，诞生在英国，开始普及和形成高潮在美国，现盛行全世界。", "question": "网球起源于哪国?", "sent_token": ["网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", "：", "网", "球", "孕", "育", "在", "法", "国", "，", "诞", "生", "在", "英", "国", "，", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", "，", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "ori", "rel_ids": [1807]}
+{"id": 48, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "ori", "rel_ids": [1810]}
+{"id": 53, "title": "", "context": "人类的心脏位于胸腔中部偏左，体积约相当于一个拳头大小，重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子，位于横膈之上，两肺间而偏左。", "question": "人类心脏多少斤", "sent_token": ["人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", "，", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", "，", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", "，", "位", "于", "横", "膈", "之", "上", "，", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "ori", "rel_ids": [1815]}
+{"id": 54, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃吗", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1816]}
+{"id": 68, "title": "", "context": "穿上盔甲后，托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2：奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["穿", "上", "盔", "甲", "后", "，", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", "：", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "ori", "rel_ids": [1830]}
+{"id": 69, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。", "question": "人间正道是沧桑上一句", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "ori", "rel_ids": [1831]}
+{"id": 72, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映，由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美，时代背景从1929年开始延续到二战结束，女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。", "question": "艺妓回忆录多长时间", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", "，", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", "，", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", "，", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。"], "sample_type": "ori", "rel_ids": [1834]}
+{"id": 77, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1839]}
+{"id": 82, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "ori", "rel_ids": [1844]}
+{"id": 88, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1850]}
+{"id": 91, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "ori", "rel_ids": [1853]}
+{"id": 92, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道一样吗", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1854]}
+{"id": 101, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "农历六月二十四是什么星座", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "ori", "rel_ids": [1863]}
+{"id": 105, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪，是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品，而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "海洛因几克属于犯罪", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", "，", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", "，", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "ori", "rel_ids": [1867]}
+{"id": 115, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "地方质数没几年编修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "ori", "rel_ids": [1877]}
+{"id": 117, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。诗的开头即点出浩然正气存乎天地之间，至时穷之际，必然会显示出来。随后连用十二个典故，都是历史上有名的人物，他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月，立天地，为三纲之命，道义之根。最后联系到自己的命运，自己虽然兵败被俘，处在极其恶劣的牢狱之中，但是由于自己一身正气，各种邪气和疾病都不能侵犯自己，因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰，充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", "，", "至", "时", "穷", "之", "际", "，", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", "，", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", "，", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", "，", "立", "天", "地", "，", "为", "三", "纲", "之", "命", "，", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", "，", "自", "己", "虽", "然", "兵", "败", "被", "俘", "，", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", "，", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", "，", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", "，", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", "，", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "ori", "rel_ids": [1879]}
+{"id": 121, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "ori", "rel_ids": [1883]}
+{"id": 123, "title": "", "context": "新梓学校成立于2007年9月，是一所公办九年一贯制学校，座落在龙岗街道新生社区，紧邻水岸新都花园，交通十分便利。校园占地27500平方米，建筑面积16285平方米。", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", "，", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", "，", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", "，", "紧", "邻", "水", "岸", "新", "都", "花", "园", "，", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", "，", "建", "筑", "面", "积", "16285", "平", "方", "米", "。"], "sample_type": "ori", "rel_ids": [1885]}
+{"id": 124, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水吗", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "ori", "rel_ids": [1886]}
+{"id": 126, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1888]}
+{"id": 128, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃吗?有什么副作用?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1890]}
+{"id": 132, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "世界上哪里的红花最好", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1894]}
+{"id": 135, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词，在现代家居中，已经被业主、客户、家居设计师广泛用到，现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右，宽为700mm到1200mm，这样的梳妆台尺寸是大小正合适的，在家庭装修之前的前期准备时，就应该确定好梳妆台尺寸大小，同时梳妆台尺寸也要和房间的格调和风格统一起来。", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", "，", "在", "现", "代", "家", "居", "中", "，", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", "，", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", "，", "宽", "为", "700mm", "到", "1200mm", "，", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", "，", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", "，", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", "，", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。"], "sample_type": "ori", "rel_ids": [1897]}
+{"id": 137, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1899]}
+{"id": 138, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1900]}
+{"id": 144, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1906]}
+{"id": 156, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "国产宫颈疫苗有几价", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1918]}
+{"id": 183, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "ori", "rel_ids": [1945]}
+{"id": 187, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "ori", "rel_ids": [1949]}
+{"id": 194, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近，也称寿带，法令若垂长，亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹，因为这意味脸部皮肤松弛，是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", "，", "也", "称", "寿", "带", "，", "法", "令", "若", "垂", "长", "，", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", "，", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", "，", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "ori", "rel_ids": [1956]}
+{"id": 204, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1966]}
+{"id": 215, "title": "", "context": "珍珠鸟作者简介冯骥才，当代作家，1942年生于天津，原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员，从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", "，", "当", "代", "作", "家", "，", "1942", "年", "生", "于", "天", "津", "，", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", "，", "从", "事", "过", "绘", "画", "。"], "sample_type": "ori", "rel_ids": [1977]}
+{"id": 221, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期能吃维生素b2片吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [1983]}
+{"id": 231, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "ori", "rel_ids": [1993]}
+{"id": 241, "title": "隔阂意味着是什么意思", "context": "隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "ori", "rel_ids": [2003]}
+{"id": 242, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "小儿癫痫能治愈吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [2004]}
+{"id": 250, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [2012]}
+{"id": 1763, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌，外形是纺锤型的，有明显的瓣状结构，内里的肉是白色的，有清淡的药香味，生吃又脆又甜，常食用可以预防肝癌、胃癌，营养价值非常高。红薯是粗粮，也叫番薯山芋。它是一种属管状花目，旋花科一年生的草本植物，富含丰富的矿物质和维生素，而且非常耐饱。", "question": "马铃薯和红苕指的是同一个物种吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", "，", "外", "形", "是", "纺", "锤", "型", "的", "，", "有", "明", "显", "的", "瓣", "状", "结", "构", "，", "内", "里", "的", "肉", "是", "白", "色", "的", "，", "有", "清", "淡", "的", "药", "香", "味", "，", "生", "吃", "又", "脆", "又", "甜", "，", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", "，", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", "，", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", "，", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", "，", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", "，", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "disturb"}
+{"id": 1767, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条：已满十六周岁的人犯罪，应当负刑事责任。已满十四周岁不满十六周岁的人，犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的，应当负刑事责任。", "question": "贩卖毒品需要负刑事责任的人要满几周岁", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", "：", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", "，", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", "，", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", "，", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "disturb"}
+{"id": 1772, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格，而考生并不是硕士研究生；而读研是指学生在高校攻读硕士研究生的过程，学生身份已经是硕士研究生。这二者并不等同，而是有先后关系，也就是说考生只有通过考研，才能成为硕士研究生，然后在规定的学习时间内读研。", "question": "考取研究生跟攻读研究生，具体什么区别？", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", "，", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", "；", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", "，", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", "，", "而", "是", "有", "先", "后", "关", "系", "，", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", "，", "才", "能", "成", "为", "硕", "士", "研", "究", "生", "，", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "disturb"}
+{"id": 1774, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂，主要是控制作物疯长的。而磷酸二氢钾属于叶面肥，施用后可促使作物的叶色更加浓绿，根系发达，药效完全不同，也并不排斥，可以混合使用。不过要注意施用时要严格按照说明施加，不可过量，否则会阻碍生长。", "question": "磷酸一钾能和氯丁唑一起用OK吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", "，", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", "，", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", "，", "根", "系", "发", "达", "，", "药", "效", "完", "全", "不", "同", "，", "也", "并", "不", "排", "斥", "，", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", "，", "不", "可", "过", "量", "，", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "disturb"}
+{"id": 1776, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋，猫咪不能吃生鸡蛋，因为生鸡蛋中有细菌，常见的是沙门氏菌，容易引起猫腹泻脱水，而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄，但是需要掌握好量，一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇，易引发猫咪患脂肪肝和高脂血病。", "question": "小猫咪可以吃蛋黄吗，生的", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", "，", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", "，", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", "，", "常", "见", "的", "是", "沙", "门", "氏", "菌", "，", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", "，", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", "，", "但", "是", "需", "要", "掌", "握", "好", "量", "，", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", "，", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "disturb"}
+{"id": 1780, "title": "最近深圳限行吗", "context": "现在由于疫情的影响，深圳市不限行的了，但是没有必要尽量还是少出门，出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和，但是这并不意味着疫情的结束，国外疫情形势还是很严峻的，境外输入案例较多。", "question": "近期深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", "，", "深", "圳", "市", "不", "限", "行", "的", "了", "，", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", "，", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", "，", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", "，", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", "，", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "disturb"}
+{"id": 1781, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析：如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字，则合同有效。", "question": "一没有签字，二没有盖章的合同，还有法律效用吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", "：", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", "，", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "disturb"}
+{"id": 1789, "title": "", "context": "吴三桂(1612年-1678年10月2日)，字长伯，一字月所，明朝辽东人，明末清初著名政治军事人物，吴周政权建立者吴周太祖。", "question": "平西王吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", "，", "字", "长", "伯", "，", "一", "字", "月", "所", "，", "明", "朝", "辽", "东", "人", "，", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", "，", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "disturb"}
+{"id": 1796, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感，它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等，因为狗狗的屁股后面有两个肛门腺，在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为何总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", "，", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", "，", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", "，", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "disturb"}
+{"id": 1798, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜，在粘贴过程中要注意，不要出现气泡，以免影响隔音效果。若想要隔音效果更好点，还可以购买一些密封条安装在窗户缝隙处，这也能起到更好的隔音效果。另外，室内使用的家具可以更换成木质的，这样同样能起到一定的吸音效果。", "question": "出租房隔音不好如何解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", "，", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", "，", "不", "要", "出", "现", "气", "泡", "，", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", "，", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", "，", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", "，", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", "，", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "disturb"}
+{"id": 1802, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲，是由李宗盛作词、作曲、演唱，收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。1993年，李宗盛凭借该曲获得第一届新加坡醉心金曲奖最佳作词奖", "question": "谁是鬼迷心窍的原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", "，", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", "，", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "1993", "年", "，", "李", "宗", "盛", "凭", "借", "该", "曲", "获", "得", "第", "一", "届", "新", "加", "坡", "醉", "心", "金", "曲", "奖", "最", "佳", "作", "词", "奖", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "disturb"}
+{"id": 1803, "title": "", "context": "白龙马，名著小说《西游记》中的重要角色。本是西海龙王三太子，因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆，要被斩首。后因南海观世菩萨出面才免于死罪，被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马，被菩萨点化，变身为白龙。", "question": "西游记中的白龙马，它的原始身份是什么", "sent_token": ["白", "龙", "马", "，", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", "，", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", "，", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", "，", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", "，", "被", "菩", "萨", "点", "化", "，", "变", "身", "为", "白", "龙", "。"], "sample_type": "disturb"}
+{"id": 1805, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片，这部电影集合了科幻、悬疑、惊悚等时下流行的元素，由亚历克斯·加兰执导，娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编，讲述了生物学家莉娜为了自己的丈夫，她自愿加入了科学考察探险小队，去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭是什么类型的电影", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", "，", "这", "部", "电", "影", "集", "合", "了", "科", "幻", "、", "悬", "疑", "、", "惊", "悚", "等", "时", "下", "流", "行", "的", "元", "素", "，", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", "，", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", "，", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", "，", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", "，", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "disturb"}
+{"id": 1807, "title": "", "context": "网球与高尔夫、保龄球、桌球并成为世界四大绅士运动，他的起源可以追溯到12-13世纪的法国。网球运动的起源及演变可以用四句话来概括：网球孕育在法国，诞生在英国，开始普及和形成高潮在美国，现盛行全世界。", "question": "网球发源于哪国", "sent_token": ["网", "球", "与", "高", "尔", "夫", "、", "保", "龄", "球", "、", "桌", "球", "并", "成", "为", "世", "界", "四", "大", "绅", "士", "运", "动", "，", "他", "的", "起", "源", "可", "以", "追", "溯", "到", "12", "-", "13", "世", "纪", "的", "法", "国", "。", "网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", "：", "网", "球", "孕", "育", "在", "法", "国", "，", "诞", "生", "在", "英", "国", "，", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", "，", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "disturb"}
+{"id": 1810, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "玩家挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["玩", "家", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "disturb"}
+{"id": 1815, "title": "", "context": "心脏是脊椎动物身体中最重要的一个器官，人类的心脏位于胸腔中部偏左，体积约相当于一个拳头大小，重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子，位于横膈之上，两肺间而偏左。", "question": "人类心脏有多重", "sent_token": ["心", "脏", "是", "脊", "椎", "动", "物", "身", "体", "中", "最", "重", "要", "的", "一", "个", "器", "官", "，", "人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", "，", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", "，", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", "，", "位", "于", "横", "膈", "之", "上", "，", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "disturb"}
+{"id": 1816, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃不", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1830, "title": "", "context": "钢铁侠是由美国漫威电影工作室出品的一部科幻冒险电影，改编自同名系列漫画。穿上盔甲后，托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2：奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["钢", "铁", "侠", "是", "由", "美", "国", "漫", "威", "电", "影", "工", "作", "室", "出", "品", "的", "一", "部", "科", "幻", "冒", "险", "电", "影", "，", "改", "编", "自", "同", "名", "系", "列", "漫", "画", "。", "穿", "上", "盔", "甲", "后", "，", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", "：", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "disturb"}
+{"id": 1831, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。人间正道，社会发展的正常规律。沧桑，沧海(大海)变为桑田，多指巨大的变化，这里比喻的是革命的道路艰难曲折。", "question": "人间正道是沧桑前面是什么", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", "，", "社", "会", "发", "展", "的", "正", "常", "规", "律", "。", "沧", "桑", "，", "沧", "海", "(", "大", "海", ")", "变", "为", "桑", "田", "，", "多", "指", "巨", "大", "的", "变", "化", "，", "这", "里", "比", "喻", "的", "是", "革", "命", "的", "道", "路", "艰", "难", "曲", "折", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "disturb"}
+{"id": 1834, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映，由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美，时代背景从1929年开始延续到二战结束，女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。该片获得2006年第78届奥斯卡金像奖最佳摄影、最佳艺术指导、最佳服装设计三项奖项。", "question": "艺伎回忆录片长有多久", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", "，", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", "，", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", "，", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。", "该", "片", "获", "得", "2006", "年", "第", "78", "届", "奥", "斯", "卡", "金", "像", "奖", "最", "佳", "摄", "影", "、", "最", "佳", "艺", "术", "指", "导", "、", "最", "佳", "服", "装", "设", "计", "三", "项", "奖", "项", "。"], "sample_type": "disturb"}
+{"id": 1839, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病呢", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"}
+{"id": 1844, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "由武士死后的灵魂化成。生前一直为主人效忠，最后献出了生命。从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["由", "武", "士", "死", "后", "的", "灵", "魂", "化", "成", "。", "生", "前", "一", "直", "为", "主", "人", "效", "忠", "，", "最", "后", "献", "出", "了", "生", "命", "。", "从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "disturb"}
+{"id": 1850, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。人体的生长的生命过程与肾中精气的盛衰有着密切的关系，肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "人", "体", "的", "生", "长", "的", "生", "命", "过", "程", "与", "肾", "中", "精", "气", "的", "盛", "衰", "有", "着", "密", "切", "的", "关", "系", "，", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1853, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年中苏公开论战、美国黑人民权运动兴起、肯尼迪遇刺等事件震动世界。1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "中", "苏", "公", "开", "论", "战", "、", "美", "国", "黑", "人", "民", "权", "运", "动", "兴", "起", "、", "肯", "尼", "迪", "遇", "刺", "等", "事", "件", "震", "动", "世", "界", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "disturb"}
+{"id": 1854, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管是距门齿15cm处为食管的入口处，经过胸腔之后通过贲门口也就是膈肌孔与胃相连。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道有什么不同", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "是", "距", "门", "齿", "15cm", "处", "为", "食", "管", "的", "入", "口", "处", "，", "经", "过", "胸", "腔", "之", "后", "通", "过", "贲", "门", "口", "也", "就", "是", "膈", "肌", "孔", "与", "胃", "相", "连", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1863, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "星座查询：中国阴历六月二十四", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "disturb"}
+{"id": 1867, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪，是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品，而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "携带多少克吗啡类毒品，就已经算犯罪了", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", "，", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", "，", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "disturb"}
+{"id": 1877, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "那种用来记述地方情况的史志，一般都是多少年修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "disturb"}
+{"id": 1879, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。表达了作者忠君爱国、为国捐躯，忧国之痛和愿意以死明志、为国捐躯的豪情壮志的思想感情。诗的开头即点出浩然正气存乎天地之间，至时穷之际，必然会显示出来。随后连用十二个典故，都是历史上有名的人物，他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月，立天地，为三纲之命，道义之根。最后联系到自己的命运，自己虽然兵败被俘，处在极其恶劣的牢狱之中，但是由于自己一身正气，各种邪气和疾病都不能侵犯自己，因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰，充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "表", "达", "了", "作", "者", "忠", "君", "爱", "国", "、", "为", "国", "捐", "躯", "，", "忧", "国", "之", "痛", "和", "愿", "意", "以", "死", "明", "志", "、", "为", "国", "捐", "躯", "的", "豪", "情", "壮", "志", "的", "思", "想", "感", "情", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", "，", "至", "时", "穷", "之", "际", "，", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", "，", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", "，", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", "，", "立", "天", "地", "，", "为", "三", "纲", "之", "命", "，", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", "，", "自", "己", "虽", "然", "兵", "败", "被", "俘", "，", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", "，", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", "，", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", "，", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", "，", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "disturb"}
+{"id": 1883, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。宠物皮肤疾病也是会有一定传染性的，所以一定要进行定期消毒，选用专门的宠物消毒液，每周消毒1-2次，能有效预防传染", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "宠", "物", "皮", "肤", "疾", "病", "也", "是", "会", "有", "一", "定", "传", "染", "性", "的", "，", "所", "以", "一", "定", "要", "进", "行", "定", "期", "消", "毒", "，", "选", "用", "专", "门", "的", "宠", "物", "消", "毒", "液", "，", "每", "周", "消", "毒", "1", "-", "2", "次", "，", "能", "有", "效", "预", "防", "传", "染", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "disturb"}
+{"id": 1885, "title": "", "context": "新梓学校成立于2007年9月，是一所公办九年一贯制学校，座落在龙岗街道新生社区，紧邻水岸新都花园，交通十分便利。校园占地27500平方米，建筑面积16285平方米。学校设计办学规模36班，学生人数1800人", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", "，", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", "，", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", "，", "紧", "邻", "水", "岸", "新", "都", "花", "园", "，", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", "，", "建", "筑", "面", "积", "16285", "平", "方", "米", "。", "学", "校", "设", "计", "办", "学", "规", "模", "36", "班", "，", "学", "生", "人", "数", "1800", "人"], "sample_type": "disturb"}
+{"id": 1886, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水么", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "disturb"}
+{"id": 1888, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。人流手术都是属于微创无痛性质的，具有无痛、创伤极小，出血少、手术时间短，无需住院，手术后即可回家，不影响工作和生活等优势。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "人", "流", "手", "术", "都", "是", "属", "于", "微", "创", "无", "痛", "性", "质", "的", "，", "具", "有", "无", "痛", "、", "创", "伤", "极", "小", "，", "出", "血", "少", "、", "手", "术", "时", "间", "短", "，", "无", "需", "住", "院", "，", "手", "术", "后", "即", "可", "回", "家", "，", "不", "影", "响", "工", "作", "和", "生", "活", "等", "优", "势", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1890, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃么?有什么副作用呀?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"}
+{"id": 1894, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "最好的刺红花生产自哪里", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"}
+{"id": 1897, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词，在现代家居中，已经被业主、客户、家居设计师广泛用到，现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右，宽为700mm到1200mm，这样的梳妆台尺寸是大小正合适的，在家庭装修之前的前期准备时，就应该确定好梳妆台尺寸大小，同时梳妆台尺寸也要和房间的格调和风格统一起来。每个人都有自己不同的审美眼光，所以在外观选择上只要是个人喜欢就行，但梳妆台的外表最好选择用油漆刷过的，这样容易清理，不至于化妆品渗透到梳妆台内，影响梳妆台的外观", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", "，", "在", "现", "代", "家", "居", "中", "，", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", "，", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", "，", "宽", "为", "700mm", "到", "1200mm", "，", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", "，", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", "，", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", "，", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。", "每", "个", "人", "都", "有", "自", "己", "不", "同", "的", "审", "美", "眼", "光", "，", "所", "以", "在", "外", "观", "选", "择", "上", "只", "要", "是", "个", "人", "喜", "欢", "就", "行", "，", "但", "梳", "妆", "台", "的", "外", "表", "最", "好", "选", "择", "用", "油", "漆", "刷", "过", "的", "，", "这", "样", "容", "易", "清", "理", "，", "不", "至", "于", "化", "妆", "品", "渗", "透", "到", "梳", "妆", "台", "内", "，", "影", "响", "梳", "妆", "台", "的", "外", "观"], "sample_type": "disturb"}
+{"id": 1899, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,燕窝性平味甘，归肺胃肾经，功能养阴润燥，益气补中，填精补髓。虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "燕", "窝", "性", "平", "味", "甘", "，", "归", "肺", "胃", "肾", "经", "，", "功", "能", "养", "阴", "润", "燥", "，", "益", "气", "补", "中", "，", "填", "精", "补", "髓", "。", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"}
+{"id": 1900, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤不会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1906, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。随着宝宝的眼球、视神经和大脑的不断发育，他们看到的景物会越来越清楚，视野也会不断扩大，在出生6-8个月后，宝宝眼中的世界，就基本和成人一样了。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "随", "着", "宝", "宝", "的", "眼", "球", "、", "视", "神", "经", "和", "大", "脑", "的", "不", "断", "发", "育", "，", "他", "们", "看", "到", "的", "景", "物", "会", "越", "来", "越", "清", "楚", "，", "视", "野", "也", "会", "不", "断", "扩", "大", "，", "在", "出", "生", "6", "-", "8", "个", "月", "后", "，", "宝", "宝", "眼", "中", "的", "世", "界", "，", "就", "基", "本", "和", "成", "人", "一", "样", "了", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"}
+{"id": 1918, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "中国自己生产的HPV疫苗都有哪些", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "disturb"}
+{"id": 1945, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,简单说就是中间有休息的高强度训练，非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "简", "单", "说", "就", "是", "中", "间", "有", "休", "息", "的", "高", "强", "度", "训", "练", "，", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "disturb"}
+{"id": 1949, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行是中国大陆第一家由民间资本设立的全国性商业银行，成立于1996年1月12日。民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "是", "中", "国", "大", "陆", "第", "一", "家", "由", "民", "间", "资", "本", "设", "立", "的", "全", "国", "性", "商", "业", "银", "行", "，", "成", "立", "于", "1996", "年", "1", "月", "12", "日", "。", "民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "disturb"}
+{"id": 1956, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近，也称寿带，是典型的皮肤组织老化，造成肌肤表面凹陷的现象。法令若垂长，亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹，因为这意味脸部皮肤松弛，是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", "，", "也", "称", "寿", "带", "，", "是", "典", "型", "的", "皮", "肤", "组", "织", "老", "化", "，", "造", "成", "肌", "肤", "表", "面", "凹", "陷", "的", "现", "象", "。", "法", "令", "若", "垂", "长", "，", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", "，", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", "，", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "disturb"}
+{"id": 1966, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。临床表现主要有腹痛、腹泻、稀水便或黏液脓血便。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "临", "床", "表", "现", "主", "要", "有", "腹", "痛", "、", "腹", "泻", "、", "稀", "水", "便", "或", "黏", "液", "脓", "血", "便", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"}
+{"id": 1977, "title": "", "context": "珍珠鸟作者简介冯骥才，当代作家，1942年生于天津，兄妹六人，排行第三，为长子。原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员，从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", "，", "当", "代", "作", "家", "，", "1942", "年", "生", "于", "天", "津", "，", "兄", "妹", "六", "人", "，", "排", "行", "第", "三", "，", "为", "长", "子", "。", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", "，", "从", "事", "过", "绘", "画", "。"], "sample_type": "disturb"}
+{"id": 1983, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期间，能吃维生素b2吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"}
+{"id": 1993, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清一般指阿苯达唑。阿苯达唑是一种咪唑衍生物类广谱驱肠虫药物。是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "一", "般", "指", "阿", "苯", "达", "唑", "。", "阿", "苯", "达", "唑", "是", "一", "种", "咪", "唑", "衍", "生", "物", "类", "广", "谱", "驱", "肠", "虫", "药", "物", "。", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "disturb"}
+{"id": 2003, "title": "隔阂意味着是什么意思", "context": "隔阂是一个汉语词汇，一指彼此情意沟通的障碍或是情意不通，思想有距离，彼此之间有间隔，又指阻隔、隔绝。隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "是", "一", "个", "汉", "语", "词", "汇", "，", "一", "指", "彼", "此", "情", "意", "沟", "通", "的", "障", "碍", "或", "是", "情", "意", "不", "通", "，", "思", "想", "有", "距", "离", "，", "彼", "此", "之", "间", "有", "间", "隔", "，", "又", "指", "阻", "隔", "、", "隔", "绝", "。", "隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "disturb"}
+{"id": 2004, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "能彻底治愈羊儿风吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"}
+{"id": 2012, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。属于脑梗死症型中症状最轻微的，也是唯一一种能够通过可靠用药、饮食调节、康复锻炼、控制血压和血脂等综合性治疗措施达到彻底治愈的脑梗死。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "属", "于", "脑", "梗", "死", "症", "型", "中", "症", "状", "最", "轻", "微", "的", "，", "也", "是", "唯", "一", "一", "种", "能", "够", "通", "过", "可", "靠", "用", "药", "、", "饮", "食", "调", "节", "、", "康", "复", "锻", "炼", "、", "控", "制", "血", "压", "和", "血", "脂", "等", "综", "合", "性", "治", "疗", "措", "施", "达", "到", "彻", "底", "治", "愈", "的", "脑", "梗", "死", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/data/mrc_en b/examples/model_interpretation/data/mrc_en
new file mode 100644
index 0000000000000000000000000000000000000000..d95bef1dedbdd64c494a8219276478d3dfbc0ae4
--- /dev/null
+++ b/examples/model_interpretation/data/mrc_en
@@ -0,0 +1,100 @@
+{"id": 1, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "What is the original meaning of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1508]}
+{"id": 2, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1509]}
+{"id": 3, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1510]}
+{"id": 4, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What part of France were the Normans located ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1511]}
+{"id": 5, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve serve as a Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1512]}
+{"id": 6, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin go up against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1513]}
+{"id": 7, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1514]}
+{"id": 8, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans attack Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1515]}
+{"id": 9, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base called ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1516]}
+{"id": 10, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium located ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1517]}
+{"id": 11, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1518]}
+{"id": 12, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's husband ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1519]}
+{"id": 13, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1520]}
+{"id": 14, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1521]}
+{"id": 15, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1522]}
+{"id": 16, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1523]}
+{"id": 17, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1524]}
+{"id": 18, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was at some point subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "at", "some", "point", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "ori", "rel_ids": [1525]}
+{"id": 19, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny fail to accomplish what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1526]}
+{"id": 20, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was in charge of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1527]}
+{"id": 21, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1528]}
+{"id": 22, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1529]}
+{"id": 23, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a roll in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1530]}
+{"id": 24, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for the following 380 years . Although not part of a planned operation , the conquest had much more permanent results than initially expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "the", "following", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "much", "more", "permanent", "results", "than", "initially", "expected", "."], "sample_type": "ori", "rel_ids": [1531]}
+{"id": 25, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were gathered in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "gathered", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "ori", "rel_ids": [1532]}
+{"id": 26, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1533]}
+{"id": 27, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who bought the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1534]}
+{"id": 28, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1535]}
+{"id": 29, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1536]}
+{"id": 30, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law have ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1537]}
+{"id": 31, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1538]}
+{"id": 32, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What kind of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1539]}
+{"id": 33, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1540]}
+{"id": 34, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1541]}
+{"id": 35, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1542]}
+{"id": 36, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society resulted in rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "resulted", "in", "rampant", "pillaging", "."], "sample_type": "ori", "rel_ids": [1543]}
+{"id": 37, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1544]}
+{"id": 38, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's most well known piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1545]}
+{"id": 39, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1546]}
+{"id": 40, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1547]}
+{"id": 41, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1548]}
+{"id": 42, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1549]}
+{"id": 43, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1550]}
+{"id": 44, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1551]}
+{"id": 45, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1552]}
+{"id": 46, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1553]}
+{"id": 47, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1554]}
+{"id": 48, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem fails to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1555]}
+{"id": 49, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically seek to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1556]}
+{"id": 50, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Usually , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Usually", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "ori", "rel_ids": [1557]}
+{"id": 1508, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "what is the original denotation of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"}
+{"id": 1509, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is borrowed from Old Low Franconian Nortmann \" Northman \" or from Old Norse Norðmaðr , Latinized as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"}
+{"id": 1510, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to compose a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "compose", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"}
+{"id": 1511, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "Where in France were the Normans located", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"}
+{"id": 1512, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve assume the role of Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"}
+{"id": 1513, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin fought against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"}
+{"id": 1514, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos . Roussel de Bailleul revolted against Isaac Comnene during one expedition and began the conquest of Lycaonia and Galatia for himself .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", ".", "Roussel", "de", "Bailleul", "revolted", "against", "Isaac", "Comnene", "during", "one", "expedition", "and", "began", "the", "conquest", "of", "Lycaonia", "and", "Galatia", "for", "himself", "."], "sample_type": "disturb"}
+{"id": 1515, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans assault Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"}
+{"id": 1516, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base 's name ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"}
+{"id": 1517, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium situated ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"}
+{"id": 1518, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he joined up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "joined", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"}
+{"id": 1519, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was married to Margaret ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"}
+{"id": 1520, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"}
+{"id": 1521, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a string of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "string", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"}
+{"id": 1522, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had appointed the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "appointed", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"}
+{"id": 1523, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"}
+{"id": 1524, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph become earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"}
+{"id": 1525, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was in some degree subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "in", "some", "degree", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "disturb"}
+{"id": 1526, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny not succeed accomplishing what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"}
+{"id": 1527, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was the leader of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"}
+{"id": 1528, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . Antioch lay on the crusaders ' route to Palestine , and anticipating that it would be attacked the Muslim governor of the city , Yaghi - Siyan , began stockpiling food and sending requests for help . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "Antioch", "lay", "on", "the", "crusaders", "'", "route", "to", "Palestine", ",", "and", "anticipating", "that", "it", "would", "be", "attacked", "the", "Muslim", "governor", "of", "the", "city", ",", "Yaghi", "-", "Siyan", ",", "began", "stockpiling", "food", "and", "sending", "requests", "for", "help", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"}
+{"id": 1529, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . A politique , Bohemond was resolved to engineer the enthusiasm of the crusaders to his own ends ; and when his nephew Tancred left the main army at Heraclea Cybistra , and attempted to establish a footing in Cilicia , the movement may have been already intended as a preparation for Bohemond ’s eastern principality . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "A", "politique", ",", "Bohemond", "was", "resolved", "to", "engineer", "the", "enthusiasm", "of", "the", "crusaders", "to", "his", "own", "ends", ";", "and", "when", "his", "nephew", "Tancred", "left", "the", "main", "army", "at", "Heraclea", "Cybistra", ",", "and", "attempted", "to", "establish", "a", "footing", "in", "Cilicia", ",", "the", "movement", "may", "have", "been", "already", "intended", "as", "a", "preparation", "for", "Bohemond", "’s", "eastern", "principality", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"}
+{"id": 1530, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a part in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"}
+{"id": 1531, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for 380 years . Although not part of a planned operation , the conquest had more permanent results than expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "more", "permanent", "results", "than", "expected", "."], "sample_type": "disturb"}
+{"id": 1532, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were assembled in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "assembled", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "disturb"}
+{"id": 1533, "title": "", "context": "Jean de Béthencourt was a French explorer who was responsible for the expedition to the Canaries . Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Jean", "de", "Béthencourt", "was", "a", "French", "explorer", "who", "was", "responsible", "for", "the", "expedition", "to", "the", "Canaries", ".", "Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"}
+{"id": 1534, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who purchased the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"}
+{"id": 1535, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla . Maciot de Bethencourt was born illegitimate circa 1390 at France .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", ".", "Maciot", "de", "Bethencourt", "was", "born", "illegitimate", "circa", "1390", "at", "France", "."], "sample_type": "disturb"}
+{"id": 1536, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Just off the Normandy coast , the Channel Islands comprising of Jersey , Guernsey , Alderney , Sark and Herm are a short hop away from Britain and mainland Europe . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Just", "off", "the", "Normandy", "coast", ",", "the", "Channel", "Islands", "comprising", "of", "Jersey", ",", "Guernsey", ",", "Alderney", ",", "Sark", "and", "Herm", "are", "a", "short", "hop", "away", "from", "Britain", "and", "mainland", "Europe", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"}
+{"id": 1537, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law possess ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"}
+{"id": 1538, "title": "", "context": "The term Norman architecture is used to categorise styles of Romanesque architecture developed by the Normans in the various lands under their dominion or influence in the 11th and 12th centuries . Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["The", "term", "Norman", "architecture", "is", "used", "to", "categorise", "styles", "of", "Romanesque", "architecture", "developed", "by", "the", "Normans", "in", "the", "various", "lands", "under", "their", "dominion", "or", "influence", "in", "the", "11th", "and", "12th", "centuries", ".", "Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"}
+{"id": 1539, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What type of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"}
+{"id": 1540, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans integrated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "integrated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"}
+{"id": 1541, "title": "", "context": "Norman Castles were typically built on the highest ground in the area , often adjoined Rivers and overlooking towns and harbours . In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["Norman", "Castles", "were", "typically", "built", "on", "the", "highest", "ground", "in", "the", "area", ",", "often", "adjoined", "Rivers", "and", "overlooking", "towns", "and", "harbours", ".", "In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"}
+{"id": 1542, "title": "", "context": "In England , the period of Norman architecture succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"}
+{"id": 1543, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society led to rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "led", "to", "rampant", "pillaging", "."], "sample_type": "disturb"}
+{"id": 1544, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . The Bayeux Tapestry is a narrative embroidery of about 70 meters long and 50 centimeters wide . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "The", "Bayeux", "Tapestry", "is", "a", "narrative", "embroidery", "of", "about", "70", "meters", "long", "and", "50", "centimeters", "wide", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"}
+{"id": 1545, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's world - renowned piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"}
+{"id": 1546, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , hiring natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "hiring", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"}
+{"id": 1547, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved reputation in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "reputation", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"}
+{"id": 1548, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were supported by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "supported", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"}
+{"id": 1549, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . Robert Guiscard was a Norman adventurer remembered for the conquest of southern Italy and Sicily . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "Robert", "Guiscard", "was", "a", "Norman", "adventurer", "remembered", "for", "the", "conquest", "of", "southern", "Italy", "and", "Sicily", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"}
+{"id": 1550, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they proceeded with the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "proceeded", "with", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"}
+{"id": 1551, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science handles broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"}
+{"id": 1552, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in theory amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "theory", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"}
+{"id": 1553, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm . Informally , a computational problem consists of problem instances and solutions to these problem instances .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", ".", "Informally", ",", "a", "computational", "problem", "consists", "of", "problem", "instances", "and", "solutions", "to", "these", "problem", "instances", "."], "sample_type": "disturb"}
+{"id": 1554, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . Therefore , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "Therefore", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"}
+{"id": 1555, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem is unable to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"}
+{"id": 1556, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically want to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"}
+{"id": 1557, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Generally , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Generally", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/data/senti_ch b/examples/model_interpretation/data/senti_ch
new file mode 100644
index 0000000000000000000000000000000000000000..d17704e850549f19b6489930a536c816691dd1ab
--- /dev/null
+++ b/examples/model_interpretation/data/senti_ch
@@ -0,0 +1,100 @@
+{"id": 1, "context": "特别垃圾的摄影店，服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "摄", "影", "店", "，", "服", "务", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1647]}
+{"id": 4, "context": "加油员服务态度特别好！加油站的油价合理！我经常在这里加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "特", "别", "好", "！", "加", "油", "站", "的", "油", "价", "合", "理", "！", "我", "经", "常", "在", "这", "里", "加", "油"], "sample_type": "ori", "rel_ids": [1650]}
+{"id": 5, "context": "不错，交通便利，出行方便！", "sent_token": ["不", "错", "，", "交", "通", "便", "利", "，", "出", "行", "方", "便", "！"], "sample_type": "ori", "rel_ids": [1651]}
+{"id": 7, "context": "业务水平高，服务质量好", "sent_token": ["业", "务", "水", "平", "高", "，", "服", "务", "质", "量", "好"], "sample_type": "ori", "rel_ids": [1653]}
+{"id": 8, "context": "环境还不错，还好的，门口就是站点", "sent_token": ["环", "境", "还", "不", "错", "，", "还", "好", "的", "，", "门", "口", "就", "是", "站", "点"], "sample_type": "ori", "rel_ids": [1654]}
+{"id": 10, "context": "[认真评价]  她家的手法很独特", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "的", "手", "法", "很", "独", "特"], "sample_type": "ori", "rel_ids": [1656]}
+{"id": 12, "context": "免费领取太实惠了，感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "太", "实", "惠", "了", "，", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "ori", "rel_ids": [1658]}
+{"id": 13, "context": "不错，服务很好，态度也好", "sent_token": ["不", "错", "，", "服", "务", "很", "好", "，", "态", "度", "也", "好"], "sample_type": "ori", "rel_ids": [1659]}
+{"id": 14, "context": "服务态度很好，剪的也很好", "sent_token": ["服", "务", "态", "度", "很", "好", "，", "剪", "的", "也", "很", "好"], "sample_type": "ori", "rel_ids": [1660]}
+{"id": 15, "context": "东西一般！环境也不怎么好！有包间就会好点", "sent_token": ["东", "西", "一", "般", "！", "环", "境", "也", "不", "怎", "么", "好", "！", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "ori", "rel_ids": [1661]}
+{"id": 16, "context": "一般般吧，还是会觉得酷姆思比较好次～配料选择太少了～", "sent_token": ["一", "般", "般", "吧", "，", "还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "～", "配", "料", "选", "择", "太", "少", "了", "～"], "sample_type": "ori", "rel_ids": [1662]}
+{"id": 17, "context": "鱼特色美食 菜也OK 服务态度也好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "OK", " ", "服", "务", "态", "度", "也", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "ori", "rel_ids": [1663]}
+{"id": 18, "context": "环境相当不错，业务水平很专业", "sent_token": ["环", "境", "相", "当", "不", "错", "，", "业", "务", "水", "平", "很", "专", "业"], "sample_type": "ori", "rel_ids": [1664]}
+{"id": 20, "context": "是一家公办的幼儿园，环境各方面挺好的挺好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", "，", "环", "境", "各", "方", "面", "挺", "好", "的", "挺", "好", "的"], "sample_type": "ori", "rel_ids": [1666]}
+{"id": 21, "context": "环境挺好 价格很便宜  赞一个", "sent_token": ["环", "境", "挺", "好", " ", "价", "格", "很", "便", "宜", " ", " ", "赞", "一", "个"], "sample_type": "ori", "rel_ids": [1667]}
+{"id": 22, "context": "味道不错！团购很实惠", "sent_token": ["味", "道", "不", "错", "！", "团", "购", "很", "实", "惠"], "sample_type": "ori", "rel_ids": [1668]}
+{"id": 23, "context": "服务一如既往的好，虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "一", "如", "既", "往", "的", "好", "，", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "ori", "rel_ids": [1669]}
+{"id": 24, "context": "很人性化，凭票一日可进出多次", "sent_token": ["很", "人", "性", "化", "，", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "ori", "rel_ids": [1670]}
+{"id": 25, "context": "设施不行，这价位就这样了", "sent_token": ["设", "施", "不", "行", "，", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "ori", "rel_ids": [1671]}
+{"id": 26, "context": "服务周到 价格低廉 旅游了好几次 非常满意", "sent_token": ["服", "务", "周", "到", " ", "价", "格", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "非", "常", "满", "意"], "sample_type": "ori", "rel_ids": [1672]}
+{"id": 27, "context": "好吃，环境不错，服务很好", "sent_token": ["好", "吃", "，", "环", "境", "不", "错", "，", "服", "务", "很", "好"], "sample_type": "ori", "rel_ids": [1673]}
+{"id": 28, "context": "环境挺好，主要是手法很舒服！做完后皮肤水水的！", "sent_token": ["环", "境", "挺", "好", "，", "主", "要", "是", "手", "法", "很", "舒", "服", "！", "做", "完", "后", "皮", "肤", "水", "水", "的", "！"], "sample_type": "ori", "rel_ids": [1674]}
+{"id": 30, "context": "服务态度很好，老板人很和蔼", "sent_token": ["服", "务", "态", "度", "很", "好", "，", "老", "板", "人", "很", "和", "蔼"], "sample_type": "ori", "rel_ids": [1676]}
+{"id": 31, "context": "老板娘手艺很好，人也长得漂亮", "sent_token": ["老", "板", "娘", "手", "艺", "很", "好", "，", "人", "也", "长", "得", "漂", "亮"], "sample_type": "ori", "rel_ids": [1677]}
+{"id": 33, "context": "本地市场，东西比较齐全", "sent_token": ["本", "地", "市", "场", "，", "东", "西", "比", "较", "齐", "全"], "sample_type": "ori", "rel_ids": [1679]}
+{"id": 34, "context": "陈老师人非常好，做事很细心", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", "，", "做", "事", "很", "细", "心"], "sample_type": "ori", "rel_ids": [1680]}
+{"id": 37, "context": "各方面都很满意，特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "很", "满", "意", "，", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "ori", "rel_ids": [1683]}
+{"id": 38, "context": "箱子外形比较漂亮，细节做的挺好", "sent_token": ["箱", "子", "外", "形", "比", "较", "漂", "亮", "，", "细", "节", "做", "的", "挺", "好"], "sample_type": "ori", "rel_ids": [1684]}
+{"id": 40, "context": "带女儿去春游,觉得还不错", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "不", "错"], "sample_type": "ori", "rel_ids": [1686]}
+{"id": 41, "context": "很不错的地方，值得去一下", "sent_token": ["很", "不", "错", "的", "地", "方", "，", "值", "得", "去", "一", "下"], "sample_type": "ori", "rel_ids": [1687]}
+{"id": 42, "context": "性价比极高的一家婚礼策划公司", "sent_token": ["性", "价", "比", "极", "高", "的", "一", "家", "婚", "礼", "策", "划", "公", "司"], "sample_type": "ori", "rel_ids": [1688]}
+{"id": 45, "context": "张家港市第二大高中不是盖的", "sent_token": ["张", "家", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "ori", "rel_ids": [1691]}
+{"id": 47, "context": "买设备放心，态度很好！！！！！！", "sent_token": ["买", "设", "备", "放", "心", "，", "态", "度", "很", "好", "！", "！", "！", "！", "！", "！"], "sample_type": "ori", "rel_ids": [1693]}
+{"id": 48, "context": "店员服务超好的，免费补衣服", "sent_token": ["店", "员", "服", "务", "超", "好", "的", "，", "免", "费", "补", "衣", "服"], "sample_type": "ori", "rel_ids": [1694]}
+{"id": 50, "context": "很好用的软件很不错的选择", "sent_token": ["很", "好", "用", "的", "软", "件", "很", "不", "错", "的", "选", "择"], "sample_type": "ori", "rel_ids": [1696]}
+{"id": 51, "context": "口味一如既往的好，学生年轻人的首选", "sent_token": ["口", "味", "一", "如", "既", "往", "的", "好", "，", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "ori", "rel_ids": [1697]}
+{"id": 52, "context": "离我家很近，购物很方便", "sent_token": ["离", "我", "家", "很", "近", "，", "购", "物", "很", "方", "便"], "sample_type": "ori", "rel_ids": [1698]}
+{"id": 53, "context": "环境不错，依塌陷区修健", "sent_token": ["环", "境", "不", "错", "，", "依", "塌", "陷", "区", "修", "健"], "sample_type": "ori", "rel_ids": [1699]}
+{"id": 54, "context": "管理处在哪里 楼下保安态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "保", "安", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1700]}
+{"id": 56, "context": "还不错哦，就是我指甲有点短比较难修", "sent_token": ["还", "不", "错", "哦", "，", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "ori", "rel_ids": [1702]}
+{"id": 57, "context": "必须给好评！！这家店可太棒了", "sent_token": ["必", "须", "给", "好", "评", "！", "！", "这", "家", "店", "可", "太", "棒", "了"], "sample_type": "ori", "rel_ids": [1703]}
+{"id": 58, "context": "非常不错的酒店，离海很近", "sent_token": ["非", "常", "不", "错", "的", "酒", "店", "，", "离", "海", "很", "近"], "sample_type": "ori", "rel_ids": [1704]}
+{"id": 60, "context": "再也不会去了，路又难走", "sent_token": ["再", "也", "不", "会", "去", "了", "，", "路", "又", "难", "走"], "sample_type": "ori", "rel_ids": [1706]}
+{"id": 61, "context": "一般把…洗的不是太仔细", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "是", "太", "仔", "细"], "sample_type": "ori", "rel_ids": [1707]}
+{"id": 62, "context": "买了65块钱的东西，感觉挺实惠的", "sent_token": ["买", "了", "65", "块", "钱", "的", "东", "西", "，", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "ori", "rel_ids": [1708]}
+{"id": 64, "context": "适合同学之间聚会时小请", "sent_token": ["适", "合", "同", "学", "之", "间", "聚", "会", "时", "小", "请"], "sample_type": "ori", "rel_ids": [1710]}
+{"id": 66, "context": "价位真的很便宜，母亲节去的", "sent_token": ["价", "位", "真", "的", "很", "便", "宜", "，", "母", "亲", "节", "去", "的"], "sample_type": "ori", "rel_ids": [1712]}
+{"id": 67, "context": "网购怎么多年第一次差评：1.实物与描述不符", "sent_token": ["网", "购", "怎", "么", "多", "年", "第", "一", "次", "差", "评", "：", "1", ".", "实", "物", "与", "描", "述", "不", "符"], "sample_type": "ori", "rel_ids": [1713]}
+{"id": 68, "context": "百丽理发店头发做的特别好", "sent_token": ["百", "丽", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "ori", "rel_ids": [1714]}
+{"id": 70, "context": "不错，去过好几次了，比较干净还会再去的", "sent_token": ["不", "错", "，", "去", "过", "好", "几", "次", "了", "，", "比", "较", "干", "净", "还", "会", "再", "去", "的"], "sample_type": "ori", "rel_ids": [1716]}
+{"id": 1647, "context": "特别垃圾的宾馆，服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "宾", "馆", "，", "服", "务", "态", "度", "差"], "sample_type": "disturb"}
+{"id": 1650, "context": "加油员服务态度简直不要太好，油价没有比这更合理的了，隔三岔五来加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "简", "直", "不", "要", "太", "好", "，", "油", "价", "没", "有", "比", "这", "更", "合", "理", "的", "了", "，", "隔", "三", "岔", "五", "来", "加", "油"], "sample_type": "disturb"}
+{"id": 1651, "context": "不错，交通便利，方便出行！", "sent_token": ["不", "错", "，", "交", "通", "便", "利", "，", "方", "便", "出", "行", "！"], "sample_type": "disturb"}
+{"id": 1653, "context": "业务水平和服务质量666", "sent_token": ["业", "务", "水", "平", "和", "服", "务", "质", "量", "666"], "sample_type": "disturb"}
+{"id": 1654, "context": "有着不错的环境，站点就在门口", "sent_token": ["有", "着", "不", "错", "的", "环", "境", "，", "站", "点", "就", "在", "门", "口"], "sample_type": "disturb"}
+{"id": 1656, "context": "[认真评价]  她家有着很独特的手法", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "有", "着", "很", "独", "特", "的", "手", "法"], "sample_type": "disturb"}
+{"id": 1658, "context": "免费领取大大的实惠了，感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "大", "大", "的", "实", "惠", "了", "，", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "disturb"}
+{"id": 1659, "context": "不错，服务好，态度好", "sent_token": ["不", "错", "，", "服", "务", "好", "，", "态", "度", "好"], "sample_type": "disturb"}
+{"id": 1660, "context": "服务态度不是一般的好，剪的不要太好", "sent_token": ["服", "务", "态", "度", "不", "是", "一", "般", "的", "好", "，", "剪", "的", "不", "要", "太", "好"], "sample_type": "disturb"}
+{"id": 1661, "context": "东西真的很一般！环境也真的不怎么好！有包间就会好点", "sent_token": ["东", "西", "真", "的", "很", "一", "般", "！", "环", "境", "也", "真", "的", "不", "怎", "么", "好", "！", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "disturb"}
+{"id": 1662, "context": "还是会觉得酷姆思比较好次～配料就那么几个", "sent_token": ["还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "～", "配", "料", "就", "那", "么", "几", "个"], "sample_type": "disturb"}
+{"id": 1663, "context": "鱼特色美食 菜也十分OK 服务态度也很好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "十", "分", "OK", " ", "服", "务", "态", "度", "也", "很", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "disturb"}
+{"id": 1664, "context": "环境相当不错，拥有非常专业的业务水平", "sent_token": ["环", "境", "相", "当", "不", "错", "，", "拥", "有", "非", "常", "专", "业", "的", "业", "务", "水", "平"], "sample_type": "disturb"}
+{"id": 1666, "context": "是一家公办的幼儿园，环境各方面没见过这么好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", "，", "环", "境", "各", "方", "面", "没", "见", "过", "这", "么", "好", "的"], "sample_type": "disturb"}
+{"id": 1667, "context": "环境好 价格便宜 赞一个", "sent_token": ["环", "境", "好", " ", "价", "格", "便", "宜", " ", "赞", "一", "个"], "sample_type": "disturb"}
+{"id": 1668, "context": "味道相当不错！团购实惠", "sent_token": ["味", "道", "相", "当", "不", "错", "！", "团", "购", "实", "惠"], "sample_type": "disturb"}
+{"id": 1669, "context": "服务还是那么那么的好，虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "还", "是", "那", "么", "那", "么", "的", "好", "，", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "disturb"}
+{"id": 1670, "context": "特别的人性化，凭票一日可进出多次", "sent_token": ["特", "别", "的", "人", "性", "化", "，", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "disturb"}
+{"id": 1671, "context": "设施out了，这价位就这样了", "sent_token": ["设", "施", "out", "了", "，", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "disturb"}
+{"id": 1672, "context": "服务不能说不周到 价格不能说不低廉 旅游了好几次 不要太满意", "sent_token": ["服", "务", "不", "能", "说", "不", "周", "到", " ", "价", "格", "不", "能", "说", "不", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "不", "要", "太", "满", "意"], "sample_type": "disturb"}
+{"id": 1673, "context": "太太太好吃，环境不错，服务很好", "sent_token": ["太", "太", "太", "好", "吃", "，", "环", "境", "不", "错", "，", "服", "务", "很", "好"], "sample_type": "disturb"}
+{"id": 1674, "context": "环境挺好，主要是手法很舒服！皮肤做完后还水水的！", "sent_token": ["环", "境", "挺", "好", "，", "主", "要", "是", "手", "法", "很", "舒", "服", "！", "皮", "肤", "做", "完", "后", "还", "水", "水", "的", "！"], "sample_type": "disturb"}
+{"id": 1676, "context": "服务态度好，老板和蔼", "sent_token": ["服", "务", "态", "度", "好", "，", "老", "板", "和", "蔼"], "sample_type": "disturb"}
+{"id": 1677, "context": "老板的姐姐手艺很好，人也长得漂亮", "sent_token": ["老", "板", "的", "姐", "姐", "手", "艺", "很", "好", "，", "人", "也", "长", "得", "漂", "亮"], "sample_type": "disturb"}
+{"id": 1679, "context": "本地市场，想买啥都能在这找到", "sent_token": ["本", "地", "市", "场", "，", "想", "买", "啥", "都", "能", "在", "这", "找", "到"], "sample_type": "disturb"}
+{"id": 1680, "context": "陈老师人非常好，一直很细心地做事", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", "，", "一", "直", "很", "细", "心", "地", "做", "事"], "sample_type": "disturb"}
+{"id": 1683, "context": "各方面都满意得不得了，特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "满", "意", "得", "不", "得", "了", "，", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "disturb"}
+{"id": 1684, "context": "柜子外形比较漂亮，细节做的挺好", "sent_token": ["柜", "子", "外", "形", "比", "较", "漂", "亮", "，", "细", "节", "做", "的", "挺", "好"], "sample_type": "disturb"}
+{"id": 1686, "context": "带女儿去春游,觉得还会再来一趟", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "会", "再", "来", "一", "趟"], "sample_type": "disturb"}
+{"id": 1687, "context": "相当不错的地方，非常值得去一下哦", "sent_token": ["相", "当", "不", "错", "的", "地", "方", "，", "非", "常", "值", "得", "去", "一", "下", "哦"], "sample_type": "disturb"}
+{"id": 1688, "context": "这家婚礼策划公司有着极高的性价比", "sent_token": ["这", "家", "婚", "礼", "策", "划", "公", "司", "有", "着", "极", "高", "的", "性", "价", "比"], "sample_type": "disturb"}
+{"id": 1691, "context": "连云港市第二大高中不是盖的", "sent_token": ["连", "云", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "disturb"}
+{"id": 1693, "context": "买设备不得不说实在很放心，态度也十分十分的好！！！！！！", "sent_token": ["买", "设", "备", "不", "得", "不", "说", "实", "在", "很", "放", "心", "，", "态", "度", "也", "十", "分", "十", "分", "的", "好", "！", "！", "！", "！", "！", "！"], "sample_type": "disturb"}
+{"id": 1694, "context": "店员服务超好的，补衣服都是免费的", "sent_token": ["店", "员", "服", "务", "超", "好", "的", "，", "补", "衣", "服", "都", "是", "免", "费", "的"], "sample_type": "disturb"}
+{"id": 1696, "context": "好用的软件不错的选择", "sent_token": ["好", "用", "的", "软", "件", "不", "错", "的", "选", "择"], "sample_type": "disturb"}
+{"id": 1697, "context": "口味特别好，学生年轻人的首选", "sent_token": ["口", "味", "特", "别", "好", "，", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "disturb"}
+{"id": 1698, "context": "离我家不远，购物不要太方便", "sent_token": ["离", "我", "家", "不", "远", "，", "购", "物", "不", "要", "太", "方", "便"], "sample_type": "disturb"}
+{"id": 1699, "context": "环境相当不错，依塌陷区修健", "sent_token": ["环", "境", "相", "当", "不", "错", "，", "依", "塌", "陷", "区", "修", "健"], "sample_type": "disturb"}
+{"id": 1700, "context": "管理处在哪里 楼下门卫态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "门", "卫", "态", "度", "差"], "sample_type": "disturb"}
+{"id": 1702, "context": "哇哦不错哦，就是我指甲有点短比较难修", "sent_token": ["哇", "哦", "不", "错", "哦", "，", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "disturb"}
+{"id": 1703, "context": "必须给好评！！这家店可太棒了，不这么写不给返现", "sent_token": ["必", "须", "给", "好", "评", "！", "！", "这", "家", "店", "可", "太", "棒", "了", "，", "不", "这", "么", "写", "不", "给", "返", "现"], "sample_type": "disturb"}
+{"id": 1704, "context": "非常不错的民宿，离海很近", "sent_token": ["非", "常", "不", "错", "的", "民", "宿", "，", "离", "海", "很", "近"], "sample_type": "disturb"}
+{"id": 1706, "context": "再也不会去了，路有一点点难走", "sent_token": ["再", "也", "不", "会", "去", "了", "，", "路", "有", "一", "点", "点", "难", "走"], "sample_type": "disturb"}
+{"id": 1707, "context": "一般把…洗的不要太敷衍", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "要", "太", "敷", "衍"], "sample_type": "disturb"}
+{"id": 1708, "context": "买东西用了65块钱，感觉挺实惠的", "sent_token": ["买", "东", "西", "用", "了", "65", "块", "钱", "，", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "disturb"}
+{"id": 1710, "context": "同学之间聚会小请还是很适合的", "sent_token": ["同", "学", "之", "间", "聚", "会", "小", "请", "还", "是", "很", "适", "合", "的"], "sample_type": "disturb"}
+{"id": 1712, "context": "价位适合工薪族，母亲节去的", "sent_token": ["价", "位", "适", "合", "工", "薪", "族", "，", "母", "亲", "节", "去", "的"], "sample_type": "disturb"}
+{"id": 1713, "context": "真的想给好评，实物不允许呀", "sent_token": ["真", "的", "想", "给", "好", "评", "，", "实", "物", "不", "允", "许", "呀"], "sample_type": "disturb"}
+{"id": 1714, "context": "一丝风尚理发店头发做的特别好", "sent_token": ["一", "丝", "风", "尚", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "disturb"}
+{"id": 1716, "context": "去过好几次了，比较干净，但是不是心思全都用在卫生上了", "sent_token": ["去", "过", "好", "几", "次", "了", "，", "比", "较", "干", "净", "，", "但", "是", "不", "是", "心", "思", "全", "都", "用", "在", "卫", "生", "上", "了"], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/data/senti_en b/examples/model_interpretation/data/senti_en
new file mode 100644
index 0000000000000000000000000000000000000000..89da58aa5dbb09695b3de2d7cd4c2232044a1830
--- /dev/null
+++ b/examples/model_interpretation/data/senti_en
@@ -0,0 +1,100 @@
+{"id": 1, "context": "it 's a charming and often affecting journey .", "sent_token": ["it", "'s", "a", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "ori", "rel_ids": [1500]}
+{"id": 2, "context": "unflinchingly bleak and desperate", "sent_token": ["unflinchingly", "bleak", "and", "desperate"], "sample_type": "ori", "rel_ids": [1501]}
+{"id": 3, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "inventive", "filmmaker", "."], "sample_type": "ori", "rel_ids": [1502]}
+{"id": 4, "context": "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astounding", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "ori", "rel_ids": [1503]}
+{"id": 5, "context": "it 's slow -- very , very slow .", "sent_token": ["it", "'s", "slow", "--", "very", ",", "very", "slow", "."], "sample_type": "ori", "rel_ids": [1504]}
+{"id": 6, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "serious", "look", "at", "young", "women", "."], "sample_type": "ori", "rel_ids": [1505]}
+{"id": 7, "context": "a sometimes tedious film .", "sent_token": ["a", "sometimes", "tedious", "film", "."], "sample_type": "ori", "rel_ids": [1506]}
+{"id": 8, "context": "you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "ori", "rel_ids": [1507]}
+{"id": 9, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "ori", "rel_ids": [1508]}
+{"id": 10, "context": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "mesmerizing", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "ori", "rel_ids": [1509]}
+{"id": 11, "context": "it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "ori", "rel_ids": [1510]}
+{"id": 12, "context": "... the film suffers from a lack of humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "a", "lack", "of", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "ori", "rel_ids": [1511]}
+{"id": 13, "context": "we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "root", "for", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "ori", "rel_ids": [1512]}
+{"id": 14, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie lacks both thrills and humor .", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "lacks", "both", "thrills", "and", "humor", "."], "sample_type": "ori", "rel_ids": [1513]}
+{"id": 15, "context": "a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "ori", "rel_ids": [1514]}
+{"id": 16, "context": "the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "raw", "and", "will", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "ori", "rel_ids": [1515]}
+{"id": 17, "context": "audrey tatou has a knack for picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "has", "a", "knack", "for", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "ori", "rel_ids": [1516]}
+{"id": 18, "context": "... the movie is just a plain old monster .", "sent_token": ["...", "the", "movie", "is", "just", "a", "plain", "old", "monster", "."], "sample_type": "ori", "rel_ids": [1517]}
+{"id": 19, "context": "in its best moments , resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "ori", "rel_ids": [1518]}
+{"id": 20, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "uneven", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "ori", "rel_ids": [1519]}
+{"id": 21, "context": "the iditarod lasts for days - this just felt like it did .", "sent_token": ["the", "iditarod", "lasts", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "ori", "rel_ids": [1520]}
+{"id": 22, "context": "holden caulfield did it better .", "sent_token": ["holden", "caulfield", "did", "it", "better", "."], "sample_type": "ori", "rel_ids": [1521]}
+{"id": 23, "context": "a delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "ori", "rel_ids": [1522]}
+{"id": 24, "context": "seldom has a movie so closely matched the spirit of a man and his work .", "sent_token": ["seldom", "has", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "ori", "rel_ids": [1523]}
+{"id": 25, "context": "nicks , seemingly uncertain what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "uncertain", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "ori", "rel_ids": [1524]}
+{"id": 26, "context": "the action switches between past and present , but the material link is too tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "too", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "ori", "rel_ids": [1525]}
+{"id": 27, "context": "it 's an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "offbeat", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "ori", "rel_ids": [1526]}
+{"id": 28, "context": "it 's a cookie - cutter movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "cookie", "-", "cutter", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "ori", "rel_ids": [1527]}
+{"id": 29, "context": "i had to look away - this was god awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "god", "awful", "."], "sample_type": "ori", "rel_ids": [1528]}
+{"id": 30, "context": "thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "ori", "rel_ids": [1529]}
+{"id": 31, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "handful", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "ori", "rel_ids": [1530]}
+{"id": 32, "context": "a gorgeous , witty , seductive movie .", "sent_token": ["a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "ori", "rel_ids": [1531]}
+{"id": 33, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self - conscious to draw you deeply into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "far", "too", "self", "-", "conscious", "to", "draw", "you", "deeply", "into", "its", "world", "."], "sample_type": "ori", "rel_ids": [1532]}
+{"id": 34, "context": "it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "ori", "rel_ids": [1533]}
+{"id": 35, "context": "a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "ori", "rel_ids": [1534]}
+{"id": 36, "context": "the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "ori", "rel_ids": [1535]}
+{"id": 37, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees is short on the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "is", "short", "on", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "ori", "rel_ids": [1536]}
+{"id": 38, "context": "as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "ori", "rel_ids": [1537]}
+{"id": 39, "context": "escaping the studio , piccoli is warmly affecting and so is this adroitly minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "adroitly", "minimalist", "movie", "."], "sample_type": "ori", "rel_ids": [1538]}
+{"id": 40, "context": "there 's ... tremendous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "tremendous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "ori", "rel_ids": [1539]}
+{"id": 41, "context": "this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "ori", "rel_ids": [1540]}
+{"id": 42, "context": "the subtle strength of ` ` elling '' is that it never loses touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "never", "loses", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "ori", "rel_ids": [1541]}
+{"id": 43, "context": "holm ... embodies the character with an effortlessly regal charisma .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "regal", "charisma", "."], "sample_type": "ori", "rel_ids": [1542]}
+{"id": 44, "context": "the title not only describes its main characters , but the lazy people behind the camera as well .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "the", "lazy", "people", "behind", "the", "camera", "as", "well", "."], "sample_type": "ori", "rel_ids": [1543]}
+{"id": 45, "context": "it offers little beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["it", "offers", "little", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "ori", "rel_ids": [1544]}
+{"id": 46, "context": "a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "ori", "rel_ids": [1545]}
+{"id": 47, "context": "subtle and well - crafted ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "crafted", "(", "for", "the", "most", "part", ")", "."], "sample_type": "ori", "rel_ids": [1546]}
+{"id": 48, "context": "has a lot of the virtues of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "virtues", "of", "eastwood", "at", "his", "best", "."], "sample_type": "ori", "rel_ids": [1547]}
+{"id": 49, "context": "it 's hampered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hampered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "ori", "rel_ids": [1548]}
+{"id": 50, "context": "it feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "ori", "rel_ids": [1549]}
+{"id": 1500, "context": "it 's a very very charming and often affecting journey .", "sent_token": ["it", "'s", "a", "very", "very", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "disturb"}
+{"id": 1501, "context": "unflinchingly depressing and desperate", "sent_token": ["unflinchingly", "depressing", "and", "desperate"], "sample_type": "disturb"}
+{"id": 1502, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet highly inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "highly", "inventive", "filmmaker", "."], "sample_type": "disturb"}
+{"id": 1503, "context": "the acting , costumes , music , cinematography and sound are all astonishing given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astonishing", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "disturb"}
+{"id": 1504, "context": "it 's not fast .", "sent_token": ["it", "'s", "not", "fast", "."], "sample_type": "disturb"}
+{"id": 1505, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly solemn look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "solemn", "look", "at", "young", "women", "."], "sample_type": "disturb"}
+{"id": 1506, "context": "a sometimes boring film .", "sent_token": ["a", "sometimes", "boring", "film", "."], "sample_type": "disturb"}
+{"id": 1507, "context": "you do n't have to know about music to appreciate the film 's totally easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "totally", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "disturb"}
+{"id": 1508, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting totally naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "totally", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "disturb"}
+{"id": 1509, "context": "the spellbinding performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "spellbinding", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "disturb"}
+{"id": 1510, "context": "it takes a strange kind of laziness to greatly waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "greatly", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "disturb"}
+{"id": 1511, "context": "... the film suffers from lacking humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "lacking", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "disturb"}
+{"id": 1512, "context": "we support ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "support", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "disturb"}
+{"id": 1513, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie are neither thrilling nor humorous", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "are", "neither", "thrilling", "nor", "humorous"], "sample_type": "disturb"}
+{"id": 1514, "context": "quite a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["quite", "a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "disturb"}
+{"id": 1515, "context": "the emotions are somewhat raw and will probably strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "somewhat", "raw", "and", "will", "probably", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "disturb"}
+{"id": 1516, "context": "audrey tatou is good at picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "is", "good", "at", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "disturb"}
+{"id": 1517, "context": "... the movie is nothing but a plain old monster .", "sent_token": ["...", "the", "movie", "is", "nothing", "but", "a", "plain", "old", "monster", "."], "sample_type": "disturb"}
+{"id": 1518, "context": "in its best moments , it is not an exaggeration to say that resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "it", "is", "not", "an", "exaggeration", "to", "say", "that", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "disturb"}
+{"id": 1519, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an irregular tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "irregular", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "disturb"}
+{"id": 1520, "context": "the iditarod is memorable for days - this just felt like it did .", "sent_token": ["the", "iditarod", "is", "memorable", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "disturb"}
+{"id": 1521, "context": "It is undeniable that holden caulfield did it better .", "sent_token": ["It", "is", "undeniable", "that", "holden", "caulfield", "did", "it", "better", "."], "sample_type": "disturb"}
+{"id": 1522, "context": "a very very delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "very", "very", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "disturb"}
+{"id": 1523, "context": "It is not often that a movie so closely matched the spirit of a man and his work .", "sent_token": ["It", "is", "not", "often", "that", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "disturb"}
+{"id": 1524, "context": "nicks , seemingly does n't know what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "does", "n't", "know", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "disturb"}
+{"id": 1525, "context": "the action switches between past and present , but the material link is tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "disturb"}
+{"id": 1526, "context": "it 's an unconventional treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "unconventional", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "disturb"}
+{"id": 1527, "context": "it 's a stereotyped movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "stereotyped", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "disturb"}
+{"id": 1528, "context": "i had to look away - this was really awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "really", "awful", "."], "sample_type": "disturb"}
+{"id": 1529, "context": "I can not but confess that thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["I", "can", "not", "but", "confess", "that", "thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "disturb"}
+{"id": 1530, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a lot of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "lot", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "disturb"}
+{"id": 1531, "context": "seldom has seen such a gorgeous , witty , seductive movie .", "sent_token": ["seldom", "has", "seen", "such", "a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "disturb"}
+{"id": 1532, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is too self - conscious to draw you into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "too", "self", "-", "conscious", "to", "draw", "you", "into", "its", "world", "."], "sample_type": "disturb"}
+{"id": 1533, "context": "As a matter of fact , it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["As", "a", "matter", "of", "fact", ",", "it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "disturb"}
+{"id": 1534, "context": "There are no more than a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["There", "are", "no", "more", "than", "a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "disturb"}
+{"id": 1535, "context": "Nobody will be disappointed with it as the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["Nobody", "will", "be", "disappointed", "with", "it", "as", "the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "disturb"}
+{"id": 1536, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees lacks the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "lacks", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "disturb"}
+{"id": 1537, "context": "No one can deny it that as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["No", "one", "can", "deny", "it", "that", "as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "disturb"}
+{"id": 1538, "context": "escaping the studio , piccoli is warmly affecting and so is this dexterously minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "dexterously", "minimalist", "movie", "."], "sample_type": "disturb"}
+{"id": 1539, "context": "there 's ... enormous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "enormous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "disturb"}
+{"id": 1540, "context": "I ca n't deny that this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["I", "ca", "n't", "deny", "that", "this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "disturb"}
+{"id": 1541, "context": "the subtle strength of ` ` elling '' is that it does n't lose touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "does", "n't", "lose", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "disturb"}
+{"id": 1542, "context": "holm ... embodies the character with an effortlessly personal regal appeal .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "personal", "regal", "appeal", "."], "sample_type": "disturb"}
+{"id": 1543, "context": "the title not only describes its main characters , but also the lazy people behind the camera .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "also", "the", "lazy", "people", "behind", "the", "camera", "."], "sample_type": "disturb"}
+{"id": 1544, "context": "seldom does it offers beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["seldom", "does", "it", "offers", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "disturb"}
+{"id": 1545, "context": "nothing but a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["nothing", "but", "a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "disturb"}
+{"id": 1546, "context": "subtle and well - made ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "made", "(", "for", "the", "most", "part", ")", "."], "sample_type": "disturb"}
+{"id": 1547, "context": "has a lot of the merits of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "merits", "of", "eastwood", "at", "his", "best", "."], "sample_type": "disturb"}
+{"id": 1548, "context": "it 's hindered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hindered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "disturb"}
+{"id": 1549, "context": "it really really feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "really", "really", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/data/similarity_ch b/examples/model_interpretation/data/similarity_ch
new file mode 100644
index 0000000000000000000000000000000000000000..815087f5ff6b22f5a6e7562448d346f1f39385d1
--- /dev/null
+++ b/examples/model_interpretation/data/similarity_ch
@@ -0,0 +1,100 @@
+{"id": 1, "query": "求英雄联盟大神带？", "title": "英雄联盟，求大神带~", "text_q_seg": ["求", "英", "雄", "联", "盟", "大", "神", "带", "？"], "text_t_seg": ["英", "雄", "联", "盟", "，", "求", "大", "神", "带", "~"], "sample_type": "ori", "rel_ids": [1630]}
+{"id": 2, "query": "杭州哪里好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "哪", "里", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "ori", "rel_ids": [1631]}
+{"id": 3, "query": "这是什么乌龟值钱吗", "title": "这是什么乌龟！值钱嘛？", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "吗"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "！", "值", "钱", "嘛", "？"], "sample_type": "ori", "rel_ids": [1632]}
+{"id": 4, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么好处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "好", "处"], "sample_type": "ori", "rel_ids": [1633]}
+{"id": 5, "query": "何炅结婚了嘛", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "嘛"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "ori", "rel_ids": [1634]}
+{"id": 6, "query": "最好玩的手机网游", "title": "好玩的手机网游", "text_q_seg": ["最", "好", "玩", "的", "手", "机", "网", "游"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "ori", "rel_ids": [1635]}
+{"id": 7, "query": "刘诗诗杨幂谁漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["刘", "诗", "诗", "杨", "幂", "谁", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "ori", "rel_ids": [1636]}
+{"id": 8, "query": "如何入侵他人手机", "title": "如何入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["如", "何", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "ori", "rel_ids": [1637]}
+{"id": 9, "query": "红米刷什么系统好", "title": "红米可以刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "可", "以", "刷", "什", "么", "系", "统"], "sample_type": "ori", "rel_ids": [1638]}
+{"id": 10, "query": "这叫什么高跟鞋", "title": "这种高跟鞋叫什么呀", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["这", "种", "高", "跟", "鞋", "叫", "什", "么", "呀"], "sample_type": "ori", "rel_ids": [1639]}
+{"id": 11, "query": "如何刷弹弹堂点卷", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["如", "何", "刷", "弹", "弹", "堂", "点", "卷"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "ori", "rel_ids": [1640]}
+{"id": 12, "query": "嚼口香糖能减肥吗", "title": "嚼口香糖会减肥吗？", "text_q_seg": ["嚼", "口", "香", "糖", "能", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "？"], "sample_type": "ori", "rel_ids": [1641]}
+{"id": 13, "query": "这个女模特叫什么呢？", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "呢", "？"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "ori", "rel_ids": [1642]}
+{"id": 14, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好玩吗", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "吗"], "sample_type": "ori", "rel_ids": [1643]}
+{"id": 15, "query": "怎么调理湿热体质？", "title": "湿热体质怎样调理啊", "text_q_seg": ["怎", "么", "调", "理", "湿", "热", "体", "质", "？"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "ori", "rel_ids": [1644]}
+{"id": 16, "query": "搞笑电影美国", "title": "搞笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["搞", "笑", "的", "美", "国", "电", "影"], "sample_type": "ori", "rel_ids": [1645]}
+{"id": 17, "query": "京东网买手机可靠吗", "title": "在京东买手机可靠吗？", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "可", "靠", "吗", "？"], "sample_type": "ori", "rel_ids": [1646]}
+{"id": 18, "query": "谁能帮我们想个网名？", "title": "谁能帮我想个网名？", "text_q_seg": ["谁", "能", "帮", "我", "们", "想", "个", "网", "名", "？"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "？"], "sample_type": "ori", "rel_ids": [1647]}
+{"id": 19, "query": "去哪里买车便宜", "title": "哪里买车便宜点", "text_q_seg": ["去", "哪", "里", "买", "车", "便", "宜"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "ori", "rel_ids": [1648]}
+{"id": 20, "query": "你是如何看待婚姻的？", "title": "你是如何看待婚姻？", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "？"], "text_t_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "？"], "sample_type": "ori", "rel_ids": [1649]}
+{"id": 21, "query": "找张学友的一首歌", "title": "求张学友的一首歌", "text_q_seg": ["找", "张", "学", "友", "的", "一", "首", "歌"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "ori", "rel_ids": [1650]}
+{"id": 22, "query": "世事难料是什么生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "是", "什", "么", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "ori", "rel_ids": [1651]}
+{"id": 23, "query": "清远县属于那里", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "属", "于", "那", "里"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "ori", "rel_ids": [1652]}
+{"id": 24, "query": "贫血吃什么好", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "吃", "什", "么", "好"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "ori", "rel_ids": [1653]}
+{"id": 25, "query": "黄豆芽怎么做才好吃？", "title": "黄豆芽怎么做好吃？", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "？"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "？"], "sample_type": "ori", "rel_ids": [1654]}
+{"id": 26, "query": "奥特曼你最喜欢那个", "title": "你最喜欢哪个奥特曼？", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "喜", "欢", "哪", "个", "奥", "特", "曼", "？"], "sample_type": "ori", "rel_ids": [1655]}
+{"id": 27, "query": "这张图片是哪个动漫", "title": "求这张图片的动漫名！", "text_q_seg": ["这", "张", "图", "片", "是", "哪", "个", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "！"], "sample_type": "ori", "rel_ids": [1656]}
+{"id": 28, "query": "过年了卖点什么好？", "title": "要过年了卖点什么好", "text_q_seg": ["过", "年", "了", "卖", "点", "什", "么", "好", "？"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "ori", "rel_ids": [1657]}
+{"id": 29, "query": "最近过的怎么样？", "title": "你们最近过的怎么样？", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "？"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "？"], "sample_type": "ori", "rel_ids": [1658]}
+{"id": 30, "query": "现在有什么新电影", "title": "现在都有什么电影看？", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "都", "有", "什", "么", "电", "影", "看", "？"], "sample_type": "ori", "rel_ids": [1659]}
+{"id": 31, "query": "月经期可以喝茶吗", "title": "月经期能喝茶吗", "text_q_seg": ["月", "经", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["月", "经", "期", "能", "喝", "茶", "吗"], "sample_type": "ori", "rel_ids": [1660]}
+{"id": 33, "query": "本图字体是什么", "title": "图中是什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "是", "什", "么", "字", "体"], "sample_type": "ori", "rel_ids": [1662]}
+{"id": 34, "query": "画白雪公主怎么画", "title": "白雪公主怎么画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "怎", "么", "画"], "sample_type": "ori", "rel_ids": [1663]}
+{"id": 35, "query": "我爱你日语怎么说", "title": "我爱你用日语怎么说？", "text_q_seg": ["我", "爱", "你", "日", "语", "怎", "么", "说"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "怎", "么", "说", "？"], "sample_type": "ori", "rel_ids": [1664]}
+{"id": 37, "query": "踏步机什么牌子的好", "title": "什么牌子的踏步机好？", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "踏", "步", "机", "好", "？"], "sample_type": "ori", "rel_ids": [1666]}
+{"id": 38, "query": "这样的鞋怎么穿鞋带", "title": "怎么串这个鞋带", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["怎", "么", "串", "这", "个", "鞋", "带"], "sample_type": "ori", "rel_ids": [1667]}
+{"id": 39, "query": "如何下载漫画", "title": "怎样下载漫画", "text_q_seg": ["如", "何", "下", "载", "漫", "画"], "text_t_seg": ["怎", "样", "下", "载", "漫", "画"], "sample_type": "ori", "rel_ids": [1668]}
+{"id": 41, "query": "如何选择手机", "title": "怎么选择手机。", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "么", "选", "择", "手", "机", "。"], "sample_type": "ori", "rel_ids": [1670]}
+{"id": 42, "query": "淘宝上买手机靠谱吗", "title": "在淘宝上买手机好吗", "text_q_seg": ["淘", "宝", "上", "买", "手", "机", "靠", "谱", "吗"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "ori", "rel_ids": [1671]}
+{"id": 44, "query": "时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "ori", "rel_ids": [1673]}
+{"id": 45, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游？", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "？"], "sample_type": "ori", "rel_ids": [1674]}
+{"id": 46, "query": "铁观音的购买方法", "title": "购买铁观音的好方法", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["购", "买", "铁", "观", "音", "的", "好", "方", "法"], "sample_type": "ori", "rel_ids": [1675]}
+{"id": 49, "query": "动画片和熊猫有关的", "title": "有关于熊猫的动画片", "text_q_seg": ["动", "画", "片", "和", "熊", "猫", "有", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "ori", "rel_ids": [1678]}
+{"id": 51, "query": "硝酸铜是什么颜色的？", "title": "硝酸铜是什么颜色", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "？"], "text_t_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色"], "sample_type": "ori", "rel_ids": [1680]}
+{"id": 52, "query": "火影忍者佐助搞小樱", "title": "火影忍者佐助和小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["火", "影", "忍", "者", "佐", "助", "和", "小", "樱"], "sample_type": "ori", "rel_ids": [1681]}
+{"id": 53, "query": "感冒还能喝啤酒吗？", "title": "感冒了可以喝啤酒吗？", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "？"], "text_t_seg": ["感", "冒", "了", "可", "以", "喝", "啤", "酒", "吗", "？"], "sample_type": "ori", "rel_ids": [1682]}
+{"id": 54, "query": "请问这是什么动漫？", "title": "请问这是什么动漫呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "？"], "text_t_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呀"], "sample_type": "ori", "rel_ids": [1683]}
+{"id": 56, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "好"], "sample_type": "ori", "rel_ids": [1685]}
+{"id": 57, "query": "梦一场萧敬腾伴奏", "title": "萧敬腾梦一场伴奏", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["萧", "敬", "腾", "梦", "一", "场", "伴", "奏"], "sample_type": "ori", "rel_ids": [1686]}
+{"id": 58, "query": "求一本玄幻小说名", "title": "找一本玄幻的小说！", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["找", "一", "本", "玄", "幻", "的", "小", "说", "！"], "sample_type": "ori", "rel_ids": [1687]}
+{"id": 1630, "query": "英雄联盟大神求带", "title": "英雄联盟，求大神带~", "text_q_seg": ["英", "雄", "联", "盟", "大", "神", "求", "带"], "text_t_seg": ["英", "雄", "联", "盟", "，", "求", "大", "神", "带", "~"], "sample_type": "disturb"}
+{"id": 1631, "query": "杭州有哪儿好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "有", "哪", "儿", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "disturb"}
+{"id": 1632, "query": "这是什么乌龟值钱不", "title": "这是什么乌龟！值钱嘛？", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "不"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "！", "值", "钱", "嘛", "？"], "sample_type": "disturb"}
+{"id": 1633, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么益处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "益", "处"], "sample_type": "disturb"}
+{"id": 1634, "query": "何炅结婚了没", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "没"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "disturb"}
+{"id": 1635, "query": "有哪些手机网络游戏比较好玩", "title": "好玩的手机网游", "text_q_seg": ["有", "哪", "些", "手", "机", "网", "络", "游", "戏", "比", "较", "好", "玩"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "disturb"}
+{"id": 1636, "query": "演员刘诗诗跟杨幂比，谁更漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["演", "员", "刘", "诗", "诗", "跟", "杨", "幂", "比", "，", "谁", "更", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "disturb"}
+{"id": 1637, "query": "如何入侵他人手机", "title": "怎么入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["怎", "么", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "disturb"}
+{"id": 1638, "query": "红米刷什么系统好", "title": "红米能刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "能", "刷", "什", "么", "系", "统"], "sample_type": "disturb"}
+{"id": 1639, "query": "这叫什么高跟鞋", "title": "大家都把这种高跟鞋叫什么呢", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["大", "家", "都", "把", "这", "种", "高", "跟", "鞋", "叫", "什", "么", "呢"], "sample_type": "disturb"}
+{"id": 1640, "query": "怎么刷弹弹堂点券", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["怎", "么", "刷", "弹", "弹", "堂", "点", "券"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "disturb"}
+{"id": 1641, "query": "嚼口香糖可以减肥吗", "title": "嚼口香糖会减肥吗？", "text_q_seg": ["嚼", "口", "香", "糖", "可", "以", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "？"], "sample_type": "disturb"}
+{"id": 1642, "query": "这个女模特叫什么啊？", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "啊", "？"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "disturb"}
+{"id": 1643, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好不好玩", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "不", "好", "玩"], "sample_type": "disturb"}
+{"id": 1644, "query": "如何调理湿热体质？", "title": "湿热体质怎样调理啊", "text_q_seg": ["如", "何", "调", "理", "湿", "热", "体", "质", "？"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "disturb"}
+{"id": 1645, "query": "搞笑电影美国", "title": "好笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["好", "笑", "的", "美", "国", "电", "影"], "sample_type": "disturb"}
+{"id": 1646, "query": "京东网买手机可靠吗", "title": "在京东买手机靠谱吗？", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "靠", "谱", "吗", "？"], "sample_type": "disturb"}
+{"id": 1647, "query": "谁可以帮我们想个网名？", "title": "谁能帮我想个网名？", "text_q_seg": ["谁", "可", "以", "帮", "我", "们", "想", "个", "网", "名", "？"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "？"], "sample_type": "disturb"}
+{"id": 1648, "query": "一般买车都去哪里会比较便宜呀", "title": "哪里买车便宜点", "text_q_seg": ["一", "般", "买", "车", "都", "去", "哪", "里", "会", "比", "较", "便", "宜", "呀"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "disturb"}
+{"id": 1649, "query": "你是如何看待婚姻的呢？", "title": "你如何看待婚姻", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "呢", "？"], "text_t_seg": ["你", "如", "何", "看", "待", "婚", "姻"], "sample_type": "disturb"}
+{"id": 1650, "query": "请帮我找一首歌，张学友的，谢谢", "title": "求张学友的一首歌", "text_q_seg": ["请", "帮", "我", "找", "一", "首", "歌", "，", "张", "学", "友", "的", "，", "谢", "谢"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "disturb"}
+{"id": 1651, "query": "世事难料猜一生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "猜", "一", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "disturb"}
+{"id": 1652, "query": "清远县是属于哪里的", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "是", "属", "于", "哪", "里", "的"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "disturb"}
+{"id": 1653, "query": "贫血的话，补血需要吃什么呢", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "的", "话", "，", "补", "血", "需", "要", "吃", "什", "么", "呢"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "disturb"}
+{"id": 1654, "query": "黄豆芽怎么做才好吃呢？", "title": "黄豆芽怎么做好吃？", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "呢", "？"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "？"], "sample_type": "disturb"}
+{"id": 1655, "query": "奥特曼你最喜欢那个", "title": "你最爱哪个奥特曼？", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "爱", "哪", "个", "奥", "特", "曼", "？"], "sample_type": "disturb"}
+{"id": 1656, "query": "这张图片是什么动漫", "title": "求这张图片的动漫名！", "text_q_seg": ["这", "张", "图", "片", "是", "什", "么", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "！"], "sample_type": "disturb"}
+{"id": 1657, "query": "在过年的时候，什么好卖点呢", "title": "要过年了卖点什么好", "text_q_seg": ["在", "过", "年", "的", "时", "候", "，", "什", "么", "好", "卖", "点", "呢"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "disturb"}
+{"id": 1658, "query": "最近过的怎么样呀，好不好啊", "title": "你们最近过的怎么样？", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "呀", "，", "好", "不", "好", "啊"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "？"], "sample_type": "disturb"}
+{"id": 1659, "query": "现在有什么新电影", "title": "现在可以看的电影都有什么呀？", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "可", "以", "看", "的", "电", "影", "都", "有", "什", "么", "呀", "？"], "sample_type": "disturb"}
+{"id": 1660, "query": "生理期可以喝茶吗", "title": "来大姨妈的时候能喝茶吗", "text_q_seg": ["生", "理", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["来", "大", "姨", "妈", "的", "时", "候", "能", "喝", "茶", "吗"], "sample_type": "disturb"}
+{"id": 1662, "query": "本图字体是什么", "title": "图中为什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "为", "什", "么", "字", "体"], "sample_type": "disturb"}
+{"id": 1663, "query": "画白雪公主怎么画", "title": "白雪公主如何画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "如", "何", "画"], "sample_type": "disturb"}
+{"id": 1664, "query": "我爱你 日语", "title": "我爱你用日语如何说？", "text_q_seg": ["我", "爱", "你", " ", "日", "语"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "如", "何", "说", "？"], "sample_type": "disturb"}
+{"id": 1666, "query": "踏步机什么牌子的好", "title": "踏步机比较好的牌子都有哪些？", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["踏", "步", "机", "比", "较", "好", "的", "牌", "子", "都", "有", "哪", "些", "？"], "sample_type": "disturb"}
+{"id": 1667, "query": "这样的鞋怎么穿鞋带", "title": "这个鞋带要怎么串起来呢", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["这", "个", "鞋", "带", "要", "怎", "么", "串", "起", "来", "呢"], "sample_type": "disturb"}
+{"id": 1668, "query": "漫画下载的好方法", "title": "怎么下载漫画", "text_q_seg": ["漫", "画", "下", "载", "的", "好", "方", "法"], "text_t_seg": ["怎", "么", "下", "载", "漫", "画"], "sample_type": "disturb"}
+{"id": 1670, "query": "如何选择手机", "title": "怎样选择手机", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "样", "选", "择", "手", "机"], "sample_type": "disturb"}
+{"id": 1671, "query": "在淘宝上买电子产品如手机，体验怎么样，手机可靠吗？", "title": "在淘宝上买手机好吗", "text_q_seg": ["在", "淘", "宝", "上", "买", "电", "子", "产", "品", "如", "手", "机", "，", "体", "验", "怎", "么", "样", "，", "手", "机", "可", "靠", "吗", "？"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "disturb"}
+{"id": 1673, "query": "歌曲时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["歌", "曲", "时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "disturb"}
+{"id": 1674, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游吗？", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "吗", "？"], "sample_type": "disturb"}
+{"id": 1675, "query": "铁观音的购买方法", "title": "有没有购买铁观音的好的渠道", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["有", "没", "有", "购", "买", "铁", "观", "音", "的", "好", "的", "渠", "道"], "sample_type": "disturb"}
+{"id": 1678, "query": "哪些动画片是跟国宝大熊猫相关的", "title": "有关于熊猫的动画片", "text_q_seg": ["哪", "些", "动", "画", "片", "是", "跟", "国", "宝", "大", "熊", "猫", "相", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "disturb"}
+{"id": 1680, "query": "硝酸铜是什么颜色的？", "title": "硝酸铜颜色是什么", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "？"], "text_t_seg": ["硝", "酸", "铜", "颜", "色", "是", "什", "么"], "sample_type": "disturb"}
+{"id": 1681, "query": "火影忍者佐助搞小樱", "title": "请帮忙搜索火影忍者佐助跟小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["请", "帮", "忙", "搜", "索", "火", "影", "忍", "者", "佐", "助", "跟", "小", "樱"], "sample_type": "disturb"}
+{"id": 1682, "query": "感冒还能喝啤酒吗？", "title": "感冒了能够喝啤酒吗？", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "？"], "text_t_seg": ["感", "冒", "了", "能", "够", "喝", "啤", "酒", "吗", "？"], "sample_type": "disturb"}
+{"id": 1683, "query": "请问这是什么动漫呢？", "title": "请问这个动漫是哪个呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呢", "？"], "text_t_seg": ["请", "问", "这", "个", "动", "漫", "是", "哪", "个", "呀"], "sample_type": "disturb"}
+{"id": 1685, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅最好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "最", "好"], "sample_type": "disturb"}
+{"id": 1686, "query": "梦一场萧敬腾伴奏", "title": "求萧敬腾的梦一场的伴奏部分", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["求", "萧", "敬", "腾", "的", "梦", "一", "场", "的", "伴", "奏", "部", "分"], "sample_type": "disturb"}
+{"id": 1687, "query": "求一本玄幻小说名", "title": "寻一本玄幻的小说！", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["寻", "一", "本", "玄", "幻", "的", "小", "说", "！"], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/data/similarity_en b/examples/model_interpretation/data/similarity_en
new file mode 100644
index 0000000000000000000000000000000000000000..82cf67742d7a894ac8902820c6273dabf7855921
--- /dev/null
+++ b/examples/model_interpretation/data/similarity_en
@@ -0,0 +1,100 @@
+{"id": 1, "sentence1": "Is there a reason why we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "a", "reason", "why", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "ori", "rel_ids": [1660]}
+{"id": 2, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this weird ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "weird", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "ori", "rel_ids": [1661]}
+{"id": 3, "sentence1": "What does a good answer on Quora look like ? What does it mean to \" be helpful \" ?", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["What", "does", "a", "good", "answer", "on", "Quora", "look", "like", "?", "What", "does", "it", "mean", "to", "\"", "be", "helpful", "\"", "?"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "ori", "rel_ids": [1662]}
+{"id": 4, "sentence1": "What was the deadliest battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "deadliest", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "ori", "rel_ids": [1663]}
+{"id": 5, "sentence1": "What are your views about demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "views", "about", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "ori", "rel_ids": [1664]}
+{"id": 6, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Would 2017 be a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Would", "2017", "be", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "ori", "rel_ids": [1665]}
+{"id": 7, "sentence1": "What books should I read as an aspiring entrepreneur ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "I", "read", "as", "an", "aspiring", "entrepreneur", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "ori", "rel_ids": [1666]}
+{"id": 8, "sentence1": "If universe is expanding without a limit and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "without", "a", "limit", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "ori", "rel_ids": [1667]}
+{"id": 9, "sentence1": "What people who you 've never met have influenced your life the most ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["What", "people", "who", "you", "'ve", "never", "met", "have", "influenced", "your", "life", "the", "most", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "ori", "rel_ids": [1668]}
+{"id": 10, "sentence1": "I 'm going to be US President one day . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "one", "day", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "ori", "rel_ids": [1669]}
+{"id": 11, "sentence1": "Why MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["Why", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "ori", "rel_ids": [1670]}
+{"id": 12, "sentence1": "What are the procedures for becoming an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["What", "are", "the", "procedures", "for", "becoming", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "ori", "rel_ids": [1671]}
+{"id": 13, "sentence1": "How do smart and successful people control their emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["How", "do", "smart", "and", "successful", "people", "control", "their", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "ori", "rel_ids": [1672]}
+{"id": 14, "sentence1": "What are the best tips for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "tips", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "ori", "rel_ids": [1673]}
+{"id": 15, "sentence1": "What will happen if Donald Trump became the president of America ?", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "became", "the", "president", "of", "America", "?"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "ori", "rel_ids": [1674]}
+{"id": 16, "sentence1": "Why did n't Ned Stark bring more men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "n't", "Ned", "Stark", "bring", "more", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "ori", "rel_ids": [1675]}
+{"id": 17, "sentence1": "How do you get better grades ?", "sentence2": "How can I dramatically improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "dramatically", "improve", "my", "grades", "?"], "sample_type": "ori", "rel_ids": [1676]}
+{"id": 18, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "What will be your New Year 's resolution for 2017 ?", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["What", "will", "be", "your", "New", "Year", "'s", "resolution", "for", "2017", "?"], "sample_type": "ori", "rel_ids": [1677]}
+{"id": 19, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for the next Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "the", "next", "Star", "Wars", "movies", "?"], "sample_type": "ori", "rel_ids": [1678]}
+{"id": 20, "sentence1": "What is an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["What", "is", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "ori", "rel_ids": [1679]}
+{"id": 21, "sentence1": "What is the best business to start in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "start", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "ori", "rel_ids": [1680]}
+{"id": 22, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the effect of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "effect", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "ori", "rel_ids": [1681]}
+{"id": 23, "sentence1": "Which aircraft was superior - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "superior", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "ori", "rel_ids": [1682]}
+{"id": 24, "sentence1": "How can I expand my IQ ?", "sentence2": "What can I do to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["What", "can", "I", "do", "to", "increase", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1683]}
+{"id": 25, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it mean when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "mean", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "ori", "rel_ids": [1684]}
+{"id": 26, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I stop watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "stop", "watching", "porn", "?"], "sample_type": "ori", "rel_ids": [1685]}
+{"id": 27, "sentence1": "What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "effect", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "ori", "rel_ids": [1686]}
+{"id": 28, "sentence1": "Is it worth it to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worth", "it", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "ori", "rel_ids": [1687]}
+{"id": 29, "sentence1": "What is the maximum file size that can be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "can", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "ori", "rel_ids": [1688]}
+{"id": 30, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How do I learn to cook ?", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "do", "I", "learn", "to", "cook", "?"], "sample_type": "ori", "rel_ids": [1689]}
+{"id": 31, "sentence1": "What was first word spoken by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "first", "word", "spoken", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "ori", "rel_ids": [1690]}
+{"id": 32, "sentence1": "Should I give my JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "give", "my", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "ori", "rel_ids": [1691]}
+{"id": 33, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique individuals that human genome allows ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "individuals", "that", "human", "genome", "allows", "?"], "sample_type": "ori", "rel_ids": [1692]}
+{"id": 34, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "Why did Mulayam Singh Yadav expel Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["Why", "did", "Mulayam", "Singh", "Yadav", "expel", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "ori", "rel_ids": [1693]}
+{"id": 35, "sentence1": "Why do we need to philosophize ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "philosophize", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "ori", "rel_ids": [1694]}
+{"id": 36, "sentence1": "Is there any way to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["Is", "there", "any", "way", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "ori", "rel_ids": [1695]}
+{"id": 37, "sentence1": "How do I find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "do", "I", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "ori", "rel_ids": [1696]}
+{"id": 38, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where can I get cleaning services in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "can", "I", "get", "cleaning", "services", "in", "Sydney", "?"], "sample_type": "ori", "rel_ids": [1697]}
+{"id": 39, "sentence1": "Can Fast and Furious 7 gross $ 1 billion worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "gross", "$", "1", "billion", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "ori", "rel_ids": [1698]}
+{"id": 40, "sentence1": "Which is the best book for learning language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Which", "is", "the", "best", "book", "for", "learning", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "ori", "rel_ids": [1699]}
+{"id": 41, "sentence1": "What will be Barack Obama 's legacy ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "ori", "rel_ids": [1700]}
+{"id": 42, "sentence1": "Why do so many people hate Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "so", "many", "people", "hate", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "ori", "rel_ids": [1701]}
+{"id": 43, "sentence1": "How do l see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "do", "l", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "ori", "rel_ids": [1702]}
+{"id": 44, "sentence1": "Why is that the sky is so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["Why", "is", "that", "the", "sky", "is", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "ori", "rel_ids": [1703]}
+{"id": 45, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English in a short time ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "in", "a", "short", "time", "?"], "sample_type": "ori", "rel_ids": [1704]}
+{"id": 46, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How do I stop my cravings for junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "do", "I", "stop", "my", "cravings", "for", "junk", "food", "?"], "sample_type": "ori", "rel_ids": [1705]}
+{"id": 47, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I have to see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "have", "to", "see", "?"], "sample_type": "ori", "rel_ids": [1706]}
+{"id": 48, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "What 's the most accurate way to test my IQ ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["What", "'s", "the", "most", "accurate", "way", "to", "test", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1707]}
+{"id": 49, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What do you think about ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "ori", "rel_ids": [1708]}
+{"id": 50, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve tell about marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "tell", "about", "marginal", "cost", "?"], "sample_type": "ori", "rel_ids": [1709]}
+{"id": 1660, "sentence1": "Is there any reason that we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "any", "reason", "that", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "disturb"}
+{"id": 1661, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this odd ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "odd", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "disturb"}
+{"id": 1662, "sentence1": "what is a good answer on Quora that is helpful ？", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["what", "is", "a", "good", "answer", "on", "Quora", "that", "is", "helpful", "？"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "disturb"}
+{"id": 1663, "sentence1": "What was the most fatal battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "most", "fatal", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "disturb"}
+{"id": 1664, "sentence1": "What are your opions on demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "opions", "on", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "disturb"}
+{"id": 1665, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Is 2017 a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Is", "2017", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "disturb"}
+{"id": 1666, "sentence1": "What books should an aspiring entrepreneur read ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "an", "aspiring", "entrepreneur", "read", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "disturb"}
+{"id": 1667, "sentence1": "If universe is expanding infinitely and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "infinitely", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "disturb"}
+{"id": 1668, "sentence1": "Who 's the greatest influencer on your life that you have never met ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["Who", "'s", "the", "greatest", "influencer", "on", "your", "life", "that", "you", "have", "never", "met", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "disturb"}
+{"id": 1669, "sentence1": "I 'm going to be US President in the future . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "in", "the", "future", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "disturb"}
+{"id": 1670, "sentence1": "For what reason did MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["For", "what", "reason", "did", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "disturb"}
+{"id": 1671, "sentence1": "How to become an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["How", "to", "become", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "disturb"}
+{"id": 1672, "sentence1": "Are there any smart ways to control emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["Are", "there", "any", "smart", "ways", "to", "control", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "disturb"}
+{"id": 1673, "sentence1": "What are the best methods for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "methods", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "disturb"}
+{"id": 1674, "sentence1": "What will happen if Donald Trump was elected the president of US ？", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "was", "elected", "the", "president", "of", "US", "？"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "disturb"}
+{"id": 1675, "sentence1": "Why did Ned Stark bring very few men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "Ned", "Stark", "bring", "very", "few", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "disturb"}
+{"id": 1676, "sentence1": "How do you get better grades ?", "sentence2": "How can I improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "improve", "my", "grades", "?"], "sample_type": "disturb"}
+{"id": 1677, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "what will be your goals to reach in 2017", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["what", "will", "be", "your", "goals", "to", "reach", "in", "2017"], "sample_type": "disturb"}
+{"id": 1678, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for later Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "later", "Star", "Wars", "movies", "?"], "sample_type": "disturb"}
+{"id": 1679, "sentence1": "Can you give me an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["Can", "you", "give", "me", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "disturb"}
+{"id": 1680, "sentence1": "What is the best business to launch in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "launch", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "disturb"}
+{"id": 1681, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the impact of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "impact", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "disturb"}
+{"id": 1682, "sentence1": "Which aircraft was better - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "better", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "disturb"}
+{"id": 1683, "sentence1": "How can I expand my IQ ?", "sentence2": "Are there any ways to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["Are", "there", "any", "ways", "to", "increase", "my", "IQ", "?"], "sample_type": "disturb"}
+{"id": 1684, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it imply when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "imply", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "disturb"}
+{"id": 1685, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I quit watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "quit", "watching", "porn", "?"], "sample_type": "disturb"}
+{"id": 1686, "sentence1": "What will be the consequence of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "consequence", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "disturb"}
+{"id": 1687, "sentence1": "Is it worthwhile to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worthwhile", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "disturb"}
+{"id": 1688, "sentence1": "What is the maximum file size that is allowed to be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "is", "allowed", "to", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "disturb"}
+{"id": 1689, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How can I learn to cook", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "can", "I", "learn", "to", "cook"], "sample_type": "disturb"}
+{"id": 1690, "sentence1": "What was the first word uttered by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "the", "first", "word", "uttered", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "disturb"}
+{"id": 1691, "sentence1": "Should I attend JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "attend", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "disturb"}
+{"id": 1692, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique human individuals ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "human", "individuals", "?"], "sample_type": "disturb"}
+{"id": 1693, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "What 's the reason for Mulayam Singh Yadav expelling Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["What", "'s", "the", "reason", "for", "Mulayam", "Singh", "Yadav", "expelling", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "disturb"}
+{"id": 1694, "sentence1": "Why do we need to talk with eloquence ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "talk", "with", "eloquence", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "disturb"}
+{"id": 1695, "sentence1": "How to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["How", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "disturb"}
+{"id": 1696, "sentence1": "How to find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "to", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "disturb"}
+{"id": 1697, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where are cleaning services provided in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "are", "cleaning", "services", "provided", "in", "Sydney", "?"], "sample_type": "disturb"}
+{"id": 1698, "sentence1": "Can Fast and Furious 7 take $ 1 billion at the box office worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "take", "$", "1", "billion", "at", "the", "box", "office", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "disturb"}
+{"id": 1699, "sentence1": "Is there a book suitable to learn language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Is", "there", "a", "book", "suitable", "to", "learn", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "disturb"}
+{"id": 1700, "sentence1": "What will be Barack Obama 's legacy when he leaves office ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "when", "he", "leaves", "office", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "disturb"}
+{"id": 1701, "sentence1": "Why do n't people like Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "n't", "people", "like", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "disturb"}
+{"id": 1702, "sentence1": "How to see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "to", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "disturb"}
+{"id": 1703, "sentence1": "why is the sky so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["why", "is", "the", "sky", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "disturb"}
+{"id": 1704, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English efficiently ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "efficiently", "?"], "sample_type": "disturb"}
+{"id": 1705, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How to quit junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "to", "quit", "junk", "food", "?"], "sample_type": "disturb"}
+{"id": 1706, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I must see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "must", "see", "?"], "sample_type": "disturb"}
+{"id": 1707, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "How to test my IQ accurately ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["How", "to", "test", "my", "IQ", "accurately", "?"], "sample_type": "disturb"}
+{"id": 1708, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What is your view on the ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "is", "your", "view", "on", "the", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "disturb"}
+{"id": 1709, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve reflect marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "reflect", "marginal", "cost", "?"], "sample_type": "disturb"}
diff --git a/examples/model_interpretation/download.sh b/examples/model_interpretation/download.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7d98bfaceecc0fe0c685353f2d9cb57607c4d860
--- /dev/null
+++ b/examples/model_interpretation/download.sh
@@ -0,0 +1,10 @@
+wget https://paddlenlp.bj.bcebos.com/data/model_interpretation.tar
+wait
+tar -xvf model_interpretation.tar
+wait
+mv ./model_interpretation/vocab.char ./task/similarity/simnet/
+mv ./model_interpretation/vocab_QQP ./task/similarity/simnet/
+mv ./model_interpretation/simnet_vocab.txt ./task/similarity/simnet/
+
+mv ./model_interpretation/vocab.sst2_train ./task/senti/rnn/
+mv ./model_interpretation/vocab.txt ./task/senti/rnn
\ No newline at end of file
diff --git a/examples/model_interpretation/evaluation/accuracy/cal_acc.py b/examples/model_interpretation/evaluation/accuracy/cal_acc.py
new file mode 100644
index 0000000000000000000000000000000000000000..93c32b46568d9b214c9aae7af95251faa453205d
--- /dev/null
+++ b/examples/model_interpretation/evaluation/accuracy/cal_acc.py
@@ -0,0 +1,92 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script includes code to calculating accuracy for results form textual similarity task
+"""
+import argparse
+import json
+
+
+def get_args():
+    """
+    get args
+    """
+    parser = argparse.ArgumentParser("Acc eval")
+    parser.add_argument("--golden_path", required=True)
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--language", required=True, choices=["ch", "en"])
+
+    args = parser.parse_args()
+    return args
+
+
+def load_from_file(args):
+    """
+    load golden and pred data form file
+    :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list},
+             golden_label: {sent_id, label}, pred_label: {sent_id, label}
+    """
+    golden_f = open(args.golden_path, "r")
+    pred_f = open(args.pred_path, "r")
+
+    golden_labels, pred_labels = {}, {}
+
+    for golden_line in golden_f.readlines():
+        golden_dict = json.loads(golden_line)
+        id = golden_dict["sent_id"]
+        golden_labels[id] = int(golden_dict["sent_label"])
+
+    for pred_line in pred_f.readlines():
+        pred_dict = json.loads(pred_line)
+        id = pred_dict["id"]
+        pred_labels[id] = int(pred_dict["pred_label"])
+
+    result = {}
+    result["golden_labels"] = golden_labels
+    result["pred_labels"] = pred_labels
+
+    return result
+
+
+def cal_acc(golden_label, pred_label):
+    """
+    The function actually calculate the accuracy.
+    """
+    acc = 0.0
+    for ids in pred_label:
+        if ids not in golden_label:
+            continue
+        if pred_label[ids] == golden_label[ids]:
+            acc += 1
+    if len(golden_label):
+        acc /= len(golden_label)
+    return acc
+
+
+def main(args):
+    """
+    main function
+    """
+    result = load_from_file(args)
+    golden_label = result["golden_labels"]
+    pred_label = result["pred_labels"]
+
+    acc = cal_acc(golden_label, pred_label)
+    return acc, len(pred_label)
+
+
+if __name__ == "__main__":
+    args = get_args()
+    acc, num = main(args)
+    print("total\tnum: %d\tacc: %.1f" % (num, acc * 100))
diff --git a/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py b/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..21ae6808c94af5909da6cc7726e38aa117ac6a92
--- /dev/null
+++ b/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py
@@ -0,0 +1,265 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+  This script is used to evaluate the performance of the mrc model (F1)
+"""
+from __future__ import print_function
+
+import argparse
+import json
+from collections import OrderedDict
+
+from paddlenlp.metrics.squad import squad_evaluate
+
+
+def _tokenize_chinese_chars(text):
+    """
+    :param text: input text, unicode string
+    :return:
+        tokenized text, list
+    """
+
+    def _is_chinese_char(cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    output = []
+    buff = ""
+    for char in text:
+        cp = ord(char)
+        if _is_chinese_char(cp) or char == "=":
+            if buff != "":
+                output.append(buff)
+                buff = ""
+            output.append(char)
+        else:
+            buff += char
+
+    if buff != "":
+        output.append(buff)
+
+    return output
+
+
+def _normalize(in_str):
+    """
+    normalize the input unicode string
+    """
+    in_str = in_str.lower()
+    sp_char = [
+        ":",
+        "_",
+        "`",
+        "，",
+        "。",
+        "：",
+        "？",
+        "！",
+        "(",
+        ")",
+        "“",
+        "”",
+        "；",
+        "’",
+        "《",
+        "》",
+        "……",
+        "·",
+        "、",
+        ",",
+        "「",
+        "」",
+        "（",
+        "）",
+        "－",
+        "～",
+        "『",
+        "』",
+        "|",
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return "".join(out_segs)
+
+
+def find_lcs(s1, s2):
+    """find the longest common subsequence between s1 ans s2"""
+    m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+    max_len = 0
+    p = 0
+    for i in range(len(s1)):
+        for j in range(len(s2)):
+            if s1[i] == s2[j]:
+                m[i + 1][j + 1] = m[i][j] + 1
+                if m[i + 1][j + 1] > max_len:
+                    max_len = m[i + 1][j + 1]
+                    p = i + 1
+    return s1[p - max_len : p], max_len
+
+
+def evaluate_ch(ref_ans, pred_ans):
+    """
+    ref_ans: reference answers, dict
+    pred_ans: predicted answer, dict
+    return:
+        f1_score: averaged F1 score
+        em_score: averaged EM score
+        total_count: number of samples in the reference dataset
+        skip_count: number of samples skipped in the calculation due to unknown errors
+    """
+    f1 = 0
+    em = 0
+    total_count = 0
+    skip_count = 0
+    for query_id in ref_ans:
+        sample = ref_ans[query_id]
+        total_count += 1
+        answers = sample["sent_label"]
+        try:
+            prediction = pred_ans[query_id]["pred_label"]
+        except:
+            skip_count += 1
+            continue
+        if prediction == "":
+            _f1 = 1.0
+            _em = 1.0
+        else:
+            _f1 = calc_f1_score([answers], prediction)
+            _em = calc_em_score([answers], prediction)
+        f1 += _f1
+        em += _em
+
+    f1_score = 100.0 * f1 / total_count
+    em_score = 100.0 * em / total_count
+    return f1_score, em_score, total_count, skip_count
+
+
+def calc_f1_score(answers, prediction):
+    f1_scores = []
+    for ans in answers:
+        ans_segs = _tokenize_chinese_chars(_normalize(ans))
+        prediction_segs = _tokenize_chinese_chars(_normalize(prediction))
+        if args.debug:
+            print(json.dumps(ans_segs, ensure_ascii=False))
+            print(json.dumps(prediction_segs, ensure_ascii=False))
+        lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
+        if lcs_len == 0:
+            f1_scores.append(0)
+            continue
+        prec = 1.0 * lcs_len / len(prediction_segs)
+        rec = 1.0 * lcs_len / len(ans_segs)
+        f1 = (2 * prec * rec) / (prec + rec)
+        f1_scores.append(f1)
+    return max(f1_scores)
+
+
+def calc_em_score(answers, prediction):
+    em = 0
+    for ans in answers:
+        ans_ = _normalize(ans)
+        prediction_ = _normalize(prediction)
+        if ans_ == prediction_:
+            em = 1
+            break
+    return em
+
+
+def read_dataset(file_path):
+    f = open(file_path, "r")
+    golden = {}
+    for l in f.readlines():
+        ins = json.loads(l)
+        golden[ins["sent_id"]] = ins
+    f.close()
+    return golden
+
+
+def read_model_prediction(file_path):
+    f = open(file_path, "r")
+    predict = {}
+    for l in f.readlines():
+        ins = json.loads(l)
+        predict[ins["id"]] = ins
+    f.close()
+    return predict
+
+
+def read_temp(file_path):
+    with open(file_path) as f1:
+        result = json.loads(f1.read())
+    return result
+
+
+def get_args():
+    parser = argparse.ArgumentParser("mrc baseline performance eval")
+    parser.add_argument("--golden_path", help="dataset file")
+    parser.add_argument("--pred_file", help="model prediction file")
+    parser.add_argument("--language", help="the language of the model")
+    parser.add_argument("--debug", action="store_true", help="debug mode")
+    args = parser.parse_args()
+    return args
+
+
+if __name__ == "__main__":
+    args = get_args()
+
+    if args.language == "ch":
+        ref_ans = read_dataset(args.golden_path)
+        pred_ans = read_model_prediction(args.pred_file)
+        F1, EM, TOTAL, SKIP = evaluate_ch(ref_ans, pred_ans)
+
+        output_result = OrderedDict()
+        output_result["F1"] = "%.3f" % F1
+        output_result["EM"] = "%.3f" % EM
+        output_result["TOTAL"] = TOTAL
+        output_result["SKIP"] = SKIP
+        print(json.dumps(output_result))
+    else:
+        ref_ans = read_dataset(args.golden_path)
+        pred_ans = read_temp(args.pred_file)
+        res = []
+        for i in ref_ans:
+            ins = ref_ans[i]
+            ins["id"] = str(ins["sent_id"])
+            ins["answers"] = [ins["sent_label"]]
+            if ins["answers"] == [""]:
+                ins["is_impossible"] = True
+            else:
+                ins["is_impossible"] = False
+            res.append(ins)
+        squad_evaluate(examples=res, preds=pred_ans)
diff --git a/examples/model_interpretation/evaluation/accuracy/run_acc.sh b/examples/model_interpretation/evaluation/accuracy/run_acc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cfa26fa204f0ce9ad0f0e8c3de9c3bcb889846b1
--- /dev/null
+++ b/examples/model_interpretation/evaluation/accuracy/run_acc.sh
@@ -0,0 +1,31 @@
+###
+ # This script evaluates plausibility of the results generated by our models
+###
+
+TASK=senti
+if [[ $TASK == "mrc" ]]; then
+    MODELS=("roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient")
+else
+    MODELS=("lstm" "roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient" "lime")
+fi
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv
+            PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE
+
+            python3 ./cal_acc.py \
+                --language $LANGUAGE \
+                --golden_path $GOLDEN_PATH \
+                --pred_path $PRED_PATH
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh b/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..204bc6b4c20705e550583fa4d32656fcb8f2ff1c
--- /dev/null
+++ b/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh
@@ -0,0 +1,29 @@
+###
+ # This script is used to evaluate the performance of the mrc model (F1)
+### 
+MODELS=("roberta_base" "roberta_large")
+MODES=("attention" "integrated_gradient")
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "en" "ch";
+        do
+            echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+            
+            GOLDEN_PATH=../golden/mrc_${LANGUAGE}.tsv
+            if [[ $LANGUAGE == "ch" ]]; then
+                PRED_FILE=../../rationale_extraction/evaluation_data/mrc/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+            else
+                PRED_FILE=../../task/mrc/output/mrc_en.${BASE_MODEL}/predict_ans
+            fi
+
+            python3 mrc_f1_evaluate.py \
+                --golden_path $GOLDEN_PATH \
+                --pred_file $PRED_FILE \
+                --language $LANGUAGE
+        done
+    done
+done
+
diff --git a/examples/model_interpretation/evaluation/consistency/cal_map.py b/examples/model_interpretation/evaluation/consistency/cal_map.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6ed80d8058ade4e756f48e1655133f3981ce41c
--- /dev/null
+++ b/examples/model_interpretation/evaluation/consistency/cal_map.py
@@ -0,0 +1,141 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This script includes code to calculating MAP score for results form
+sentiment analysis, textual similarity, and mrc task
+"""
+import argparse
+import json
+import math
+import os
+
+
+def get_args():
+    parser = argparse.ArgumentParser("map eval")
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--golden_path", required=True)
+    parser.add_argument("--language", type=str, required=True, help="language that the model is built for")
+    args = parser.parse_args()
+    return args
+
+
+def evids_load(args, path):
+    golden_f = open(args.golden_path, "r")
+    golden = {}
+    ins_num = 0
+    for golden_line in golden_f.readlines():
+        line = json.loads(golden_line)
+        if line["sample_type"] == "disturb":
+            ins_num += 1
+        golden[line["sent_id"]] = line
+
+    evids = {}
+    with open(path, "r") as f:
+        for line in f.readlines():
+            dic = json.loads(line)
+            dic["sample_type"] = golden[dic["id"]]["sample_type"]
+            if "rel_ids" in golden[dic["id"]]:
+                dic["rel_ids"] = golden[dic["id"]]["rel_ids"]
+            evids[dic["id"]] = dic
+    return evids, ins_num
+
+
+def _calc_MAP_by_bin(top_p, length_adv, adv_attriRank_list, ori_attriRank_list):
+    """
+    This is our old way to calculate MAP,
+    which follows equation two in consistency section of README
+    """
+    hits = 0
+    sum_precs = 0.0
+    length_t = math.ceil(length_adv * top_p)
+    adv_t = adv_attriRank_list[:length_t]
+    for char_idx, char in enumerate(adv_t):
+        if char in ori_attriRank_list[: char_idx + 1]:
+            hits += 1
+        sum_precs += hits / (char_idx + 1)
+    if length_t > 0:
+        sum_precs /= length_t
+    return sum_precs
+
+
+def _calc_MAP_by_bin_paper(top_p, length_adv, adv_attriRank_list, ori_attriRank_list):
+    """
+    This function calculates MAP using the equation in our paper,
+    which follows equation one in consistency section of README
+    """
+    total_precs = 0.0
+    for i in range(length_adv):
+        hits = 0.0
+        i += 1
+        adv_t = adv_attriRank_list[:i]
+        for char_idx, char in enumerate(adv_t):
+            if char in ori_attriRank_list[:i]:
+                hits += 1
+        hits = hits / i
+        total_precs += hits
+    if length_adv == 0:
+        return 0
+    return total_precs / length_adv
+
+
+def _calc_map(evids, key, ins_num):
+    t_map = 0.0
+
+    adv_num = 0
+    ori_num = 0
+    for ori_idx in evids:
+        if evids[ori_idx]["sample_type"] == "ori":
+            ori = evids[ori_idx]
+            ori_num += 1
+            # One original instance can be related to several disturbed instance
+            for adv_idx in evids[ori_idx]["rel_ids"]:
+                if adv_idx in evids:
+                    adv_num += 1
+                    adv = evids[adv_idx]
+                    ori_attriRank_list = list(ori["rationale_token"][key])
+                    adv_attriRank_list = list(adv["rationale_token"][key])
+                    length_adv = len(adv_attriRank_list)
+
+                    sum_precs = _calc_MAP_by_bin_paper(1, length_adv, adv_attriRank_list, ori_attriRank_list)
+                    t_map += sum_precs
+
+    return t_map / ins_num, ori_num + adv_num
+
+
+def cal_MAP(args, pred_path, la):
+    evids, ins_num = evids_load(args, pred_path)
+    if not evids:
+        print(pred_path + " file empty!")
+        return 0
+    first_key = list(evids.keys())[0]
+    t_map = 0
+    num = 0
+    for i in range(len(evids[first_key]["rationale"])):
+        t_map_tmp, num_tmp = _calc_map(evids, i, ins_num)
+        t_map += t_map_tmp
+        num += num_tmp
+    t_map /= len(evids[first_key]["rationale"])
+    num /= len(evids[first_key]["rationale"])
+    print("total\t%d\t%.1f" % (num, 100 * t_map))
+    return 0
+
+
+if __name__ == "__main__":
+    args = get_args()
+    la = args.language
+    pred_path = args.pred_path
+    if os.path.exists(pred_path):
+        cal_MAP(args, pred_path, la)
+    else:
+        print("Prediction file does not exists!")
diff --git a/examples/model_interpretation/evaluation/consistency/run_map.sh b/examples/model_interpretation/evaluation/consistency/run_map.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8ed9f114c5a21483a685241debd84b7d6983e371
--- /dev/null
+++ b/examples/model_interpretation/evaluation/consistency/run_map.sh
@@ -0,0 +1,31 @@
+###
+ # This script evaluates consistency of the results generated by our models
+###
+
+TASK=senti
+if [[ $TASK == "mrc" ]]; then
+    MODELS=("roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient")
+else
+    MODELS=("lstm" "roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient" "lime")
+fi
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+            GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv
+            PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            python3 ./cal_map.py \
+                --golden_path $GOLDEN_PATH \
+                --pred_path $PRED_PATH \
+                --language $LANGUAGE 
+                
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py b/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..f4ad0e56f236718a6ad4f9b235b0f9a0fd3dba69
--- /dev/null
+++ b/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py
@@ -0,0 +1,78 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script includes code to calculating NewP score for results form
+ sentiment analysis, textual similarity, and mrc task
+"""
+import argparse
+import json
+
+import numpy as np
+
+
+def get_args():
+    """
+    get args
+    """
+    parser = argparse.ArgumentParser("NewP eval")
+
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--golden_path", required=True)
+
+    args = parser.parse_args()
+    return args
+
+
+def data_load(args):
+    """
+    load result data from file
+    """
+    pred_path = args.pred_path
+    golden_path = args.golden_path
+
+    with open(pred_path, "r") as f_text:
+        pred_list = []
+        for line in f_text.readlines():
+            line_dict = json.loads(line)
+            pred_list.append(line_dict)
+
+    with open(golden_path, "r") as f_text:
+        gold_list = {}
+        for line in f_text.readlines():
+            line_dict = json.loads(line)
+            gold_list[line_dict["sent_id"]] = line_dict
+    return pred_list, gold_list
+
+
+def analysis(args, instance, gold_list):
+    """
+    Analysis result according to result data
+    """
+    New_P_list = []
+    for ins in instance:
+        golden_label = ins["pred_label"]
+        text_correct = 1 if ins["rationale_pred"] == golden_label else 0
+        text_exclusive_correct = 1 if ins["no_rationale_pred"] == golden_label else 0
+        New_P_correct = 1 if (text_correct == 1 and text_exclusive_correct == 0) else 0
+        New_P_list.append(New_P_correct)
+
+    total_New_P = np.sum(New_P_list) / len(gold_list) if len(gold_list) else 0
+
+    print("total\t%d\t%.1f" % (len(New_P_list), 100 * total_New_P))
+
+
+if __name__ == "__main__":
+    args = get_args()
+    pred_list, gold_list = data_load(args)
+    analysis(args, pred_list, gold_list)
diff --git a/examples/model_interpretation/evaluation/faithfulness/run_newp.sh b/examples/model_interpretation/evaluation/faithfulness/run_newp.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5110ea61ff715b366102c89f5d01c365b96db5db
--- /dev/null
+++ b/examples/model_interpretation/evaluation/faithfulness/run_newp.sh
@@ -0,0 +1,30 @@
+###
+ # This script evaluates faithfulness of the results generated by our models
+###
+
+TASK=senti
+if [[ $TASK == "mrc" ]]; then
+    MODELS=("roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient")
+else
+    MODELS=("lstm" "roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient" "lime")
+fi
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv
+            PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            python3 ./newp_analysis.py \
+                --pred_path $PRED_PATH \
+                --golden_path $GOLDEN_PATH
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/evaluation/plausibility/eval_mrc.py b/examples/model_interpretation/evaluation/plausibility/eval_mrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3bc04a5ba5bf5a3ba9ed45da90e0ba076f241b8
--- /dev/null
+++ b/examples/model_interpretation/evaluation/plausibility/eval_mrc.py
@@ -0,0 +1,112 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script includes code to calculating F1 score for results form mrc task
+"""
+import argparse
+import json
+
+
+def get_args():
+    parser = argparse.ArgumentParser("F1 eval")
+
+    parser.add_argument("--golden_path", required=True)
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--language", required=True, choices=["ch", "en"])
+
+    args = parser.parse_args()
+    return args
+
+
+def load_from_file(args):
+    """
+    Load golden and pred data form file
+    :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list},
+             golden_label: {sent_id, label}, pred_label: {sent_id, label}
+    """
+    golden_f = open(args.golden_path, "r")
+    pred_f = open(args.pred_path, "r")
+
+    golden_raw_rationale, pred_rationale = {}, {}
+
+    for golden_line in golden_f.readlines():
+        golden_dict = json.loads(golden_line)
+        sent_id = golden_dict["sent_id"]
+        golden_raw_rationale[sent_id] = [int(x) for x in golden_dict["rationales"]]
+
+    for pred_line in pred_f.readlines():
+        pred_dict = json.loads(pred_line)
+        senti_id = pred_dict["id"]
+        pred_rationale[senti_id] = pred_dict["rationale"][0]
+
+    return golden_raw_rationale, pred_rationale
+
+
+def _f1(_p, _r):
+    if _p == 0 or _r == 0:
+        return 0
+    return 2 * _p * _r / (_p + _r)
+
+
+def calc_f1(golden_evid, pred_evid):
+    tp = set(pred_evid) & set(golden_evid)
+    prec = len(tp) / len(pred_evid) if len(pred_evid) else 0
+    rec = len(tp) / len(golden_evid) if len(golden_evid) else 0
+    f1 = _f1(prec, rec)
+    return f1
+
+
+def calc_model_f1(golden_dict, pred_dict):
+    """
+    :param golden_dict: dict
+    :param pred_dict:   dict
+    :return:    macro-f1, micro-f1
+    """
+
+    scores = {}
+
+    for s_id in pred_dict.keys():
+        if s_id not in golden_dict:
+            continue
+        golden_evid = golden_dict[s_id]
+        pred_evid = pred_dict[s_id]
+
+        tp = set(golden_evid) & set(pred_evid)
+        prec = len(tp) / len(pred_evid) if len(pred_evid) else 0
+        rec = len(tp) / len(golden_evid) if len(golden_evid) else 0
+        f1 = _f1(prec, rec)
+        scores[s_id] = {
+            "tp_count": len(tp),
+            "pred_count": len(pred_evid),
+            "golden_count": len(golden_evid),
+            "prec": prec,
+            "rec": rec,
+            "f1": f1,
+        }
+
+    macro_f1 = sum(score["f1"] for score in scores.values()) / len(golden_dict) if len(golden_dict) else 0
+
+    return macro_f1, scores
+
+
+def main(args):
+    golden_raw, pred_raw = load_from_file(args)
+    macro_f1, scores = calc_model_f1(golden_raw, pred_raw)
+    return macro_f1, len(golden_raw), scores
+
+
+if __name__ == "__main__":
+    args = get_args()
+    macro_f1, num, scores = main(args)
+    print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100))
diff --git a/examples/model_interpretation/evaluation/plausibility/eval_senti.py b/examples/model_interpretation/evaluation/plausibility/eval_senti.py
new file mode 100644
index 0000000000000000000000000000000000000000..449755cf972c701b89c0ac968290914ffa4352e0
--- /dev/null
+++ b/examples/model_interpretation/evaluation/plausibility/eval_senti.py
@@ -0,0 +1,178 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script includes code to calculating F1 score for results form sentiment analysis task
+"""
+import argparse
+import json
+
+
+def get_args():
+    parser = argparse.ArgumentParser("F1 eval")
+
+    parser.add_argument("--language", required=True, choices=["en", "ch"])
+    parser.add_argument("--golden_path", required=True)
+    parser.add_argument("--pred_path", required=True)
+
+    args = parser.parse_args()
+    return args
+
+
+def load_from_file(args):
+    """
+    Load golden and pred data form file
+    :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list},
+             golden_label: {sent_id, label}, pred_label: {sent_id, label}
+    """
+    golden_f = open(args.golden_path, "r")
+    pred_f = open(args.pred_path, "r")
+
+    golden_raw_rationale, golden_label, pred_rationale, pred_label = {}, {}, {}, {}
+
+    for golden_line in golden_f.readlines():
+        golden_dict = json.loads(golden_line)
+        sent_id = golden_dict["sent_id"]
+        golden_raw_rationale[sent_id] = []
+        for x in golden_dict["rationales"]:
+            temp = [int(y) for y in x]
+            golden_raw_rationale[sent_id].append(temp)
+        golden_label[sent_id] = int(golden_dict["sent_label"])
+
+    for pred_line in pred_f.readlines():
+        pred_dict = json.loads(pred_line)
+        senti_id = pred_dict["id"]
+        pred_rationale[senti_id] = pred_dict["rationale"][0]
+        pred_label[senti_id] = int(pred_dict["pred_label"])
+
+    golden_f.close()
+    pred_f.close()
+    return golden_raw_rationale, pred_rationale, golden_label, pred_label
+
+
+def _f1(_p, _r):
+    if _p == 0 or _r == 0:
+        return 0
+    return 2 * _p * _r / (_p + _r)
+
+
+def calc_f1(golden_evid, pred_evid):
+    tp = set(pred_evid) & set(golden_evid)
+    prec = len(tp) / len(pred_evid) if len(pred_evid) else 0
+    rec = len(tp) / len(golden_evid) if len(golden_evid) else 0
+    f1 = _f1(prec, rec)
+    return f1
+
+
+def combine(cur_max_f1, union_set, golden_evid, pred_evid):
+    """
+    Args:
+        cur_max_f1  float:      当前最大f1
+        union_set   set():      已合并集合
+        golden_evid list():     标注证据
+        pred_evid   list():     预测证据
+    """
+    if len(union_set & set(golden_evid)) < len(golden_evid) and calc_f1(golden_evid, pred_evid) > 0:
+        new_union_set = union_set | set(golden_evid)
+        new_f1 = calc_f1(new_union_set, pred_evid)
+        if new_f1 > cur_max_f1:  # 若union_set合并golden_evid后f1未超过cur_max_f1，则不更新union_set
+            cur_max_f1 = new_f1
+            union_set = new_union_set
+
+    return cur_max_f1, union_set
+
+
+def pick_max_golden_evid(golden_raw, pred_raw):
+    """
+    从golden_evids中找出与pred_evid f1最大的golden_evid
+    """
+    golden_dict = {}
+    err_rationale = []
+
+    for s_id in pred_raw.keys():
+        if s_id not in golden_raw:
+            continue
+        golden_evids = golden_raw[s_id]
+        pred_evid = pred_raw[s_id]
+        max_f1 = 0
+
+        # 找f1最大的单条golden_evid
+        for golden_evid in golden_evids:
+            f1 = calc_f1(golden_evid, pred_evid)
+            if f1 > max_f1:
+                max_f1 = f1
+                golden_dict[s_id] = golden_evid
+
+        # 找f1最大的组合golden_evid
+        for start_id in range(len(golden_evids) - 1):
+            union_set = set()
+            cur_max_f1 = 0
+            for id in range(start_id, len(golden_evids)):
+                golden_evid = golden_evids[id]
+                cur_max_f1, union_set = combine(cur_max_f1, union_set, golden_evid, pred_evid)
+
+            if cur_max_f1 > max_f1:
+                max_f1 = cur_max_f1
+                golden_dict[s_id] = list(union_set)
+
+        if max_f1 == 0:
+            golden_dict[s_id] = []
+            err_rationale.append(s_id)
+
+    return golden_dict
+
+
+def calc_model_f1(golden_dict, pred_dict, golden_len):
+    """
+    :param golden_dict: dict
+    :param pred_dict:   dict
+    :return:    macro-f1, micro-f1
+    """
+
+    scores = {}
+
+    for s_id in pred_dict.keys():
+        if s_id not in golden_dict:
+            continue
+        golden_evid = golden_dict[s_id]
+        pred_evid = pred_dict[s_id]
+
+        tp = set(golden_evid) & set(pred_evid)
+        prec = len(tp) / len(pred_evid) if len(pred_evid) else 0
+        rec = len(tp) / len(golden_evid) if len(golden_evid) else 0
+        f1 = _f1(prec, rec)
+        scores[s_id] = {
+            "tp_count": len(tp),
+            "pred_count": len(pred_evid),
+            "golden_count": len(golden_evid),
+            "prec": prec,
+            "rec": rec,
+            "f1": f1,
+        }
+
+    macro_f1 = (sum(score["f1"] for score in scores.values()) / golden_len) if golden_len else 0
+
+    return macro_f1, scores
+
+
+def main(args):
+    golden_raw, pred_raw, golden_label, pred_label = load_from_file(args)
+    golden_dict = pick_max_golden_evid(golden_raw, pred_raw)
+    macro_f1, scores = calc_model_f1(golden_dict, pred_raw, len(golden_raw))
+    return macro_f1, len(golden_raw)
+
+
+if __name__ == "__main__":
+    args = get_args()
+    macro_f1, num = main(args)
+    print("num\t%.2f\tmacor_f1: %.1f" % (num, macro_f1 * 100))
diff --git a/examples/model_interpretation/evaluation/plausibility/eval_similarity.py b/examples/model_interpretation/evaluation/plausibility/eval_similarity.py
new file mode 100644
index 0000000000000000000000000000000000000000..0307248514bd3eae5c453e1a1c77b42af60d79cc
--- /dev/null
+++ b/examples/model_interpretation/evaluation/plausibility/eval_similarity.py
@@ -0,0 +1,133 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script includes code to calculating F1 score for results form textual similarity task
+"""
+import argparse
+import json
+
+
+def get_args():
+    """
+    get args
+    """
+    parser = argparse.ArgumentParser("F1 eval")
+    parser.add_argument("--golden_path", required=True)
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--language", required=True, choices=["ch", "en"])
+
+    args = parser.parse_args()
+    return args
+
+
+def load_from_file(args):
+    """
+    Load golden and pred data form file
+    :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list},
+             golden_label: {sent_id, label}, pred_label: {sent_id, label}
+    """
+    golden_f = open(args.golden_path, "r")
+    pred_f = open(args.pred_path, "r")
+
+    golden_q_rationales, golden_t_rationales = {}, {}
+    pred_q_rationales, pred_t_rationales = {}, {}
+    golden_labels, pred_labels = {}, {}
+
+    for golden_line in golden_f.readlines():
+        golden_dict = json.loads(golden_line)
+        id = golden_dict["sent_id"]
+        # golden_rationale id
+        golden_q_rationales[id] = [int(x) for x in golden_dict["rationale_q_idx"]]
+        golden_t_rationales[id] = [int(x) for x in golden_dict["rationale_t_idx"]]
+        golden_labels[id] = int(golden_dict["sent_label"])
+
+    for pred_line in pred_f.readlines():
+        pred_dict = json.loads(pred_line)
+        id = pred_dict["id"]
+        pred_q_rationales[id] = pred_dict["rationale"][0]
+        pred_t_rationales[id] = pred_dict["rationale"][1]
+        pred_labels[id] = int(pred_dict["pred_label"])
+
+    result = {}
+    result["golden_q_rationales"] = golden_q_rationales
+    result["golden_t_rationales"] = golden_t_rationales
+    result["pred_q_rationales"] = pred_q_rationales
+    result["pred_t_rationales"] = pred_t_rationales
+    result["golden_labels"] = golden_labels
+    result["pred_labels"] = pred_labels
+
+    return result
+
+
+def _f1(_p, _r):
+    if _p == 0 or _r == 0:
+        return 0
+    return 2 * _p * _r / (_p + _r)
+
+
+def calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales):
+    """
+    :param golden_dict: dict
+    :param pred_dict:   dict
+    :return:    macro-f1, micro-f1
+    """
+
+    scores = {}
+
+    for id in pred_a_rationales.keys():
+        golden_a_ratioanl = golden_a_rationales[id]
+        pred_a_rationale = pred_a_rationales[id]
+        tp_a = set(golden_a_ratioanl) & set(pred_a_rationale)
+        prec_a = len(tp_a) / len(pred_a_rationale) if len(pred_a_rationale) else 0
+        rec_a = len(tp_a) / len(golden_a_ratioanl) if len(golden_a_ratioanl) else 0
+        f1_a = _f1(prec_a, rec_a)
+
+        golden_b_rationale = golden_b_rationales[id]
+        pred_b_rationale = pred_b_rationales[id]
+        tp_b = set(golden_b_rationale) & set(pred_b_rationale)
+        prec_b = len(tp_b) / len(pred_b_rationale) if len(pred_b_rationale) else 0
+        rec_b = len(tp_b) / len(golden_b_rationale) if len(golden_b_rationale) else 0
+        f1_b = _f1(prec_b, rec_b)
+
+        scores[id] = {
+            "tp_count": (len(tp_a) + len(tp_b)) / 2,
+            "pred_count": (len(pred_a_rationale) + len(pred_b_rationale)) / 2,
+            "golden_count": (len(golden_a_ratioanl) + len(golden_b_rationale)) / 2,
+            "prec": (prec_a + prec_b) / 2,
+            "rec": (rec_a + rec_b) / 2,
+            "f1": (f1_a + f1_b) / 2,
+        }
+
+    macro_f1 = (
+        sum(score["f1"] for score in scores.values()) / len(golden_a_rationales) if len(golden_a_rationales) else 0
+    )
+
+    return macro_f1, scores
+
+
+def main(args):
+    result = load_from_file(args)
+    golden_a_rationales = result["golden_q_rationales"]
+    golden_b_rationales = result["golden_t_rationales"]
+    pred_a_rationales = result["pred_q_rationales"]
+    pred_b_rationales = result["pred_t_rationales"]
+
+    macro_f1, scores = calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales)
+    return macro_f1, len(scores)
+
+
+if __name__ == "__main__":
+    args = get_args()
+    macro_f1, num = main(args)
+    print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100))
diff --git a/examples/model_interpretation/evaluation/plausibility/run_f1.sh b/examples/model_interpretation/evaluation/plausibility/run_f1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8d5bd2e7a9f2e9b4169e1175da2475838ee38755
--- /dev/null
+++ b/examples/model_interpretation/evaluation/plausibility/run_f1.sh
@@ -0,0 +1,34 @@
+###
+ # This script evaluates plausibility of the results generated by our models
+###
+
+TASK=senti
+if [[ $TASK == "mrc" ]]; then
+    MODELS=("roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient")
+else
+    MODELS=("lstm" "roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient" "lime")
+fi
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv
+            PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            SAVE_PATH=res/
+            [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH
+
+            echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE
+
+            python3 ./eval_${TASK}.py \
+                --language $LANGUAGE \
+                --golden_path $GOLDEN_PATH \
+                --pred_path $PRED_PATH
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/imgs/equation1.png b/examples/model_interpretation/imgs/equation1.png
new file mode 100644
index 0000000000000000000000000000000000000000..e1db9780248dd0a955a1a67bef66b262af988c18
Binary files /dev/null and b/examples/model_interpretation/imgs/equation1.png differ
diff --git a/examples/model_interpretation/imgs/equation2.png b/examples/model_interpretation/imgs/equation2.png
new file mode 100644
index 0000000000000000000000000000000000000000..fbb26c60e35a85814d6a25277fdca6ee17325fcd
Binary files /dev/null and b/examples/model_interpretation/imgs/equation2.png differ
diff --git a/examples/model_interpretation/imgs/equation3.png b/examples/model_interpretation/imgs/equation3.png
new file mode 100644
index 0000000000000000000000000000000000000000..bf4f28f7c48a9a64b403bfc5db1f64249bc602bc
Binary files /dev/null and b/examples/model_interpretation/imgs/equation3.png differ
diff --git a/examples/model_interpretation/imgs/equation4.png b/examples/model_interpretation/imgs/equation4.png
new file mode 100644
index 0000000000000000000000000000000000000000..a4743a67deb4365704a4a350047cbb57b221811a
Binary files /dev/null and b/examples/model_interpretation/imgs/equation4.png differ
diff --git a/examples/model_interpretation/imgs/equation5.png b/examples/model_interpretation/imgs/equation5.png
new file mode 100644
index 0000000000000000000000000000000000000000..75bbe3be4ad581485696e6a12d88bf537a33f0f6
Binary files /dev/null and b/examples/model_interpretation/imgs/equation5.png differ
diff --git a/examples/model_interpretation/imgs/example1.png b/examples/model_interpretation/imgs/example1.png
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7dda4dfef00b563601d590df3a54d547343c1
Binary files /dev/null and b/examples/model_interpretation/imgs/example1.png differ
diff --git a/examples/model_interpretation/imgs/structure.png b/examples/model_interpretation/imgs/structure.png
new file mode 100644
index 0000000000000000000000000000000000000000..b7573e09ba02e16fa4d7a80755aac059aada6394
Binary files /dev/null and b/examples/model_interpretation/imgs/structure.png differ
diff --git a/examples/model_interpretation/punctuations b/examples/model_interpretation/punctuations
new file mode 100644
index 0000000000000000000000000000000000000000..11d057b89103d3c9efff3c0c3bd6021309d25d68
--- /dev/null
+++ b/examples/model_interpretation/punctuations
@@ -0,0 +1,82 @@
+”
+。
+,
+∈
+]
+√
+ 
+!
+(
+≥
+【
+“
+「
+÷
+《
+】
+！
+ˊ
+」
+.
+_
+@
+~
+–
+〕
+∶
+）
+’
+℃
+》
+〈
+→
+、
+＋
+|
+；
+：
+∠
+'
+‘
+，
+？
+×
+△
+－
+•
+·
+—
+°
+>
+′
+●
+;
+…
+"
+Ⅱ
+/
+<
++
+＝
+^
+Ⅰ
+?
+[
+﹑
+﹐
+*
+〔
+～
+:
+（
+)
+〉
+◎
+=
+-
+\
+%
+％
+&
+≠
+．
\ No newline at end of file
diff --git a/examples/model_interpretation/rationale_extraction/available_gpu.py b/examples/model_interpretation/rationale_extraction/available_gpu.py
new file mode 100644
index 0000000000000000000000000000000000000000..e05ecd3c666a240a5a6362126cb9151e7d27fb13
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/available_gpu.py
@@ -0,0 +1,46 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific l
+"""print available_gpu id, using nvgpu
+"""
+
+import logging
+import traceback
+
+import nvgpu
+
+logging.basicConfig(
+    level=logging.DEBUG,
+    format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s",
+    datefmt="%m-%d %H:%M:%S",
+    filename=None,
+    filemode="a",
+)
+
+if __name__ == "__main__":
+    from argparse import ArgumentParser
+
+    try:
+        arg_parser = ArgumentParser(description="print available_gpu id, using nvgpu")
+        arg_parser.add_argument("-b", "--best", default=None, type=int, help="output best N")
+        args = arg_parser.parse_args()
+
+        if args.best is not None:
+            gpus = sorted(nvgpu.gpu_info(), key=lambda x: (x["mem_used"], x["index"]))
+            ids = [x["index"] for x in gpus]
+            print(",".join(ids[: args.best]))
+        else:
+            print(",".join(nvgpu.available_gpus()))
+
+    except Exception:
+        traceback.print_exc()
+        exit(-1)
diff --git a/examples/model_interpretation/rationale_extraction/generate.sh b/examples/model_interpretation/rationale_extraction/generate.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d72b20bda984ae91cf2f13fc79427f51dd0adc20
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/generate.sh
@@ -0,0 +1,57 @@
+TASK=similarity
+
+if [[ $TASK == "mrc" ]]; then
+    MODELS=("roberta_base" "roberta_large")
+    MODES=("attention" "integrated_gradient")
+else
+    MODELS=("roberta_large" "roberta_base" "lstm")
+    MODES=("lime" "attention" "integrated_gradient")
+fi
+
+for BASE_MODEL in ${MODELS[*]};
+do
+    for INTER_MODE in ${MODES[*]};
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            if [[ $LANGUAGE == "ch" ]]; then
+                if [[ $TASK == "senti" ]]; then
+                    RATIO_DIC="[0.311]"
+                elif [[ $TASK == "similarity" ]]; then
+                    RATIO_DIC="[0.701,0.709]"
+                elif [[ $TASK == "mrc" ]]; then
+                    RATIO_DIC="[0.096]"
+                fi
+            elif [[ $LANGUAGE == "en" ]]; then
+                if [[ $TASK == "senti" ]]; then
+                    RATIO_DIC="[0.192]"
+                elif [[ $TASK == "similarity" ]]; then
+                    RATIO_DIC="[0.511,0.505]"
+                elif [[ $TASK == "mrc" ]]; then
+                    RATIO_DIC="[0.102]"
+                fi
+            fi
+            echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+            PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE}
+            SAVE_PATH=./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+            [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH
+
+            python3 ./newp_text_generate.py \
+                --pred_path $PRED_PATH \
+                --save_path $SAVE_PATH \
+                --task $TASK \
+                --language $LANGUAGE \
+                --ratio $RATIO_DIC
+            wait
+
+            sh ./run_2_pred_${TASK}_per.sh $BASE_MODEL $INTER_MODE $LANGUAGE
+            wait
+            
+            sh ./generate_evaluation_data.sh $BASE_MODEL $INTER_MODE $LANGUAGE $TASK
+            wait
+            
+            echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}_finished
+        done
+    done
+done
diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..162b7fb00f70fa6914293af0e664b5d0689b0435
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py
@@ -0,0 +1,113 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+
+def get_args():
+    parser = argparse.ArgumentParser("generate data")
+
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--data_dir", required=True)
+    parser.add_argument("--data_dir2", required=True)
+    parser.add_argument("--save_path", required=True)
+    parser.add_argument("--inter_mode", required=True)
+    parser.add_argument("--base_model", required=True)
+    parser.add_argument("--language", required=True)
+
+    args = parser.parse_args()
+    return args
+
+
+def evids_load(path):
+    evids = []
+    with open(path, "r") as f:
+        for line in f.readlines():
+            dic = json.loads(line)
+            evids.append(dic)
+    return evids
+
+
+def dataLoad(args):
+    base_path = args.data_dir + "/"
+    text_path = base_path + "rationale_text/dev/dev"
+    text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev"
+
+    with open(text_path, "r") as f_text:
+        text_dict_list = {}
+        for line in f_text.readlines():
+            line_dict = json.loads(line)
+            text_dict_list[line_dict["id"]] = line_dict
+
+    with open(text_exclusive_path, "r") as f_exclusive_text:
+        text_exclusive_dict_list = {}
+        for line in f_exclusive_text.readlines():
+            line_dict = json.loads(line)
+            text_exclusive_dict_list[line_dict["id"]] = line_dict
+
+    base_path = args.data_dir2 + "/"
+    text_path = base_path + "rationale_text/dev/dev"
+    text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev"
+
+    with open(text_path, "r") as f_text:
+        text_dict_list2 = {}
+        for line in f_text.readlines():
+            line_dict = json.loads(line)
+            text_dict_list2[line_dict["id"]] = line_dict
+
+    with open(text_exclusive_path, "r") as f_exclusive_text:
+        text_exclusive_dict_list2 = {}
+        for line in f_exclusive_text.readlines():
+            line_dict = json.loads(line)
+            text_exclusive_dict_list2[line_dict["id"]] = line_dict
+
+    return text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2
+
+
+def r_data_generation(
+    args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2
+):
+    save_path = args.save_path
+    f_save = open(save_path, "w")
+
+    res_data = []
+    for ins in evids:
+        temp = {}
+        temp["id"] = ins["id"]
+        temp["pred_label"] = ins["pred_label"]
+        temp["rationale"] = text_dict_list2[ins["id"]]["context_idx"]
+        temp["no_rationale"] = text_exclusive_dict_list2[ins["id"]]["context_idx"]
+        if len(temp["rationale"]) > 1 and args.inter_mode != "lime" and not (args.base_model.startswith("roberta")):
+            for i in range(len(temp["rationale"][1])):
+                temp["rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0])
+            for i in range(len(temp["no_rationale"][1])):
+                temp["no_rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0])
+        temp["rationale_pred"] = text_dict_list[ins["id"]]["pred_label"]
+        temp["no_rationale_pred"] = text_exclusive_dict_list[ins["id"]]["pred_label"]
+        temp["rationale_token"] = text_dict_list2[ins["id"]]["context_token"]
+
+        res_data.append(temp)
+
+        f_save.write(json.dumps(temp, ensure_ascii=False) + "\n")
+    f_save.close()
+
+
+if __name__ == "__main__":
+    args = get_args()
+    text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 = dataLoad(args)
+    evids = evids_load(args.pred_path)
+    r_data_generation(
+        args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2
+    )
diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fa26d3beb8f9a7962159e7edf34e0fac26270399
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh
@@ -0,0 +1,23 @@
+###
+ # This script concatenates results from previous running to generate a formated result for evaluation use
+### 
+
+BASE_MODEL=$1
+INTER_MODE=$2
+LANGUAGE=$3
+TASK=$4
+
+PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE}
+SAVE_PATH=./evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}
+
+SAVE_DIR=./evaluation_data/${TASK}/
+[ -d $SAVE_DIR ] || mkdir -p $SAVE_DIR
+
+python3 generate_evaluation_data.py \
+    --data_dir ./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \
+    --data_dir2 ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \
+    --pred_path $PRED_PATH \
+    --save_path $SAVE_PATH \
+    --inter_mode $INTER_MODE \
+    --base_model $BASE_MODEL \
+    --language $LANGUAGE
\ No newline at end of file
diff --git a/examples/model_interpretation/rationale_extraction/mrc_pred.py b/examples/model_interpretation/rationale_extraction/mrc_pred.py
new file mode 100644
index 0000000000000000000000000000000000000000..2868c86b12408267128795fd57c87291b858793d
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/mrc_pred.py
@@ -0,0 +1,207 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import functools
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+import paddle
+
+from paddlenlp.data import Dict, Pad
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../task/mrc")
+from saliency_map.squad import RCInterpret, compute_prediction  # noqa: E402
+
+sys.path.append("..")
+from roberta.modeling import RobertaForQuestionAnswering  # noqa: E402
+
+sys.path.remove("..")
+sys.path.remove("../task/mrc")
+sys.path.append("../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+)
+
+sys.path.remove("../..")
+
+
+def get_args():
+    parser = argparse.ArgumentParser("mrc predict with roberta")
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=32, help="batchsize")
+    parser.add_argument("--epoch", type=int, default=3, help="epoch")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--warmup_proportion", type=float, default=0.1)
+    parser.add_argument("--lr", type=float, default=5e-5, help="learning rate")
+    parser.add_argument("--eval", action="store_true")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument("--language", type=str, required=True, help="language that the model based on")
+    parser.add_argument("--input_data", type=str, required=True)
+    args = parser.parse_args()
+    return args
+
+
+def map_fn_DuCheckList(examples, args, tokenizer):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = [examples[i]["context"] for i in range(len(examples))]
+    questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    # For validation, there is no need to compute start and end positions
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        tokenized_examples[i]["example_id"] = examples[sample_index]["id"]
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        if args.language == "ch":
+            tokenized_examples[i]["offset_mapping"] = [
+                (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+            ]
+        else:
+            n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2  # context start position
+            m = len(tokenized_example["offset_mapping"]) - 1  # context end position + 1
+            tokenized_examples[i]["offset_mapping"] = [
+                (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+            ]
+
+    return tokenized_examples
+
+
+def load_data(path):
+    data = {}
+    f = open(path, "r")
+    for line in f.readlines():
+        line_split = json.loads(line)
+        data[line_split["id"]] = line_split
+    f.close()
+    return data
+
+
+def init_roberta_var(args):
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+
+    model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained)
+    map_fn = functools.partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer)
+    dev_ds = RCInterpret().read(os.path.join(args.data_dir, "dev"))
+    # dev_ds = load_dataset('squad', splits='dev_v2', data_files=None)
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+        }
+    ): fn(samples)
+
+    dev_dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dev_dataloader, dev_ds
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        input_ids, token_type_ids = batch
+        loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids)
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                print("Processing example: %d" % len(all_start_logits))
+                print("time per 1000:", time.time() - tic_eval)
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction(
+        data_loader.dataset.data,
+        data_loader.dataset.new_data,
+        (all_start_logits, all_end_logits),
+        True,
+        20,
+        args.max_seq_len,
+        0.0,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open(os.path.join(args.output_dir, "dev"), "w") as f:
+        for id in all_predictions:
+            temp = {}
+            temp["id"] = int(id)
+            temp["pred_label"] = all_predictions[id]
+            temp["pred_feature"] = all_feature_index[id]
+            f.write(json.dumps(temp, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader, dev_ds = init_roberta_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    with paddle.amp.auto_cast(enable=args.use_amp):
+
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        print("load model from %s" % args.init_checkpoint)
+
+        evaluate(model, dataloader, args)
diff --git a/examples/model_interpretation/rationale_extraction/newp_text_generate.py b/examples/model_interpretation/rationale_extraction/newp_text_generate.py
new file mode 100644
index 0000000000000000000000000000000000000000..28e8e98157d8c498fa7139715a0e0b5533d4969e
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/newp_text_generate.py
@@ -0,0 +1,269 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+
+
+def get_args():
+    parser = argparse.ArgumentParser("generate data")
+
+    parser.add_argument("--pred_path", required=True)
+    parser.add_argument("--save_path", required=True)
+    parser.add_argument("--language", required=True)
+    parser.add_argument("--task", required=True)
+    parser.add_argument("--ratio", type=str, required=True)
+
+    args = parser.parse_args()
+    return args
+
+
+def evids_load(path):
+    evids = []
+    with open(path, "r") as f:
+        for line in f.readlines():
+            dic = json.loads(line)
+            evids.append(dic)
+    return evids
+
+
+def generate_for_senti(args, evid_dict, ratio):
+    r = {}
+    ex_r = {}
+
+    label = evid_dict["pred_label"]
+    char_attri = list(evid_dict["char_attri"].keys())
+    length = len(char_attri)
+
+    rationale_ratio = ratio[0]
+    toprationale_text, toprationale_exclusive_text = [], []
+
+    keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]
+    keys.sort()
+    for key in keys:
+        toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip())
+
+    keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]
+    keys.sort()
+    for key in keys:
+        toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip())
+
+    if args.language == "en":
+        toprationale_text = " ".join(toprationale_text)
+        toprationale_exclusive_text = " ".join(toprationale_exclusive_text)
+    else:
+        toprationale_text = "".join(toprationale_text)
+        toprationale_exclusive_text = "".join(toprationale_exclusive_text)
+
+    if len(toprationale_text) == 0:
+        toprationale_text = "['UNK']"
+    if len(toprationale_exclusive_text) == 0:
+        toprationale_exclusive_text = "['UNK']"
+
+    r["id"] = evid_dict["id"]
+    r["context"] = toprationale_text
+    r["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]]
+    r["context_token"] = [[evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]]]
+    r["label"] = label
+    ex_r["id"] = evid_dict["id"]
+    ex_r["context"] = toprationale_exclusive_text
+    ex_r["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]]
+    ex_r["context_token"] = [
+        [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]]
+    ]
+    ex_r["label"] = label
+    return r, ex_r
+
+
+def generate_for_similarity(args, evid_dict, ratio):
+    r = {}
+    ex_r = {}
+    q_rationale_ratio = ratio[0]
+    t_rationale_ratio = ratio[1]
+
+    label = evid_dict["pred_label"]
+    # query
+    q_char_attri = list(evid_dict["query_char_attri"].keys())
+    q_length = len(q_char_attri)
+
+    q_topR_Rtext, q_topR_noRtext = [], []
+    keys = [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]]
+    keys.sort()
+    for key in keys:
+        q_topR_Rtext.append(evid_dict["query_char_attri"][str(key)][0].strip())
+
+    keys = [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]]
+    keys.sort()
+    for key in keys:
+        q_topR_noRtext.append(evid_dict["query_char_attri"][str(key)][0].strip())
+
+    if args.language == "ch":
+        q_topR_Rtext = "".join(q_topR_Rtext)
+        q_topR_noRtext = "".join(q_topR_noRtext)
+    else:
+        q_topR_Rtext = " ".join(q_topR_Rtext)
+        q_topR_noRtext = " ".join(q_topR_noRtext)
+
+    if len(q_topR_Rtext) == 0:
+        q_topR_Rtext = "['UNK']"
+    if len(q_topR_noRtext) == 0:
+        q_topR_noRtext = "['UNK']"
+
+    # title
+    t_char_attri = list(evid_dict["title_char_attri"].keys())
+    t_length = len(t_char_attri)
+
+    t_topR_Rtext, t_topR_noRtext = [], []
+    keys = [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]]
+    keys.sort()
+    for key in keys:
+        t_topR_Rtext.append(evid_dict["title_char_attri"][str(key)][0])
+
+    keys = [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]]
+    keys.sort()
+    for key in keys:
+        t_topR_noRtext.append(evid_dict["title_char_attri"][str(key)][0])
+
+    if args.language == "ch":
+        t_topR_Rtext = "".join(t_topR_Rtext)
+        t_topR_noRtext = "".join(t_topR_noRtext)
+    else:
+        t_topR_Rtext = " ".join(t_topR_Rtext)
+        t_topR_noRtext = " ".join(t_topR_noRtext)
+
+    if len(t_topR_Rtext) == 0:
+        t_topR_Rtext = "['UNK']"
+    if len(t_topR_noRtext) == 0:
+        t_topR_noRtext = "['UNK']"
+
+    r["id"] = evid_dict["id"]
+    r["context"] = [q_topR_Rtext, t_topR_Rtext]
+    r["context_idx"] = [
+        [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]],
+        [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]],
+    ]
+    r["context_token"] = [
+        [evid_dict["query_char_attri"][x][0] for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]],
+        [evid_dict["title_char_attri"][x][0] for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]],
+    ]
+    r["label"] = label
+    ex_r["id"] = evid_dict["id"]
+    ex_r["context"] = [q_topR_noRtext, t_topR_noRtext]
+    ex_r["context_idx"] = [
+        [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]],
+        [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]],
+    ]
+    ex_r["context_token"] = [
+        [evid_dict["query_char_attri"][x][0] for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]],
+        [evid_dict["title_char_attri"][x][0] for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]],
+    ]
+    ex_r["label"] = label
+    return r, ex_r
+
+
+def generate_for_MRC(args, evid_dict, ratio):
+    id = evid_dict["id"]
+    question = evid_dict["question"]
+    char_attri = list(evid_dict["char_attri"].keys())
+    length = len(char_attri)
+
+    rationale_ratio = ratio[0]
+    toprationale_text, toprationale_exclusive_text = [], []
+    keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]
+    keys.sort()
+    for key in keys:
+        toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip())
+
+    keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]
+    keys.sort()
+    for key in keys:
+        toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip())
+
+    if args.language == "en":
+        toprationale_text = " ".join(toprationale_text)
+        toprationale_exclusive_text = " ".join(toprationale_exclusive_text)
+    else:
+        toprationale_text = "".join(toprationale_text)
+        toprationale_exclusive_text = "".join(toprationale_exclusive_text)
+
+    if len(toprationale_text) == 0:
+        toprationale_text = "['UNK']"
+    if len(toprationale_exclusive_text) == 0:
+        toprationale_exclusive_text = "['UNK']"
+
+    data_R_dict, Rdata_noR_dict = {}, {}
+
+    data_R_dict["id"] = id
+    data_R_dict["title"] = ""
+    data_R_dict["context"] = toprationale_text
+    data_R_dict["question"] = question
+    data_R_dict["answers"] = [""]
+    data_R_dict["answer_starts"] = [-1]
+    data_R_dict["is_impossible"] = False
+    data_R_dict["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]]
+    data_R_dict["context_token"] = [
+        [evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]]
+    ]
+
+    Rdata_noR_dict["id"] = id
+    Rdata_noR_dict["title"] = ""
+    Rdata_noR_dict["context"] = toprationale_exclusive_text
+    Rdata_noR_dict["question"] = question
+    Rdata_noR_dict["answers"] = [""]
+    Rdata_noR_dict["answer_starts"] = [-1]
+    Rdata_noR_dict["is_impossible"] = False
+    Rdata_noR_dict["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]]
+    Rdata_noR_dict["context_token"] = [
+        [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]]
+    ]
+
+    return data_R_dict, Rdata_noR_dict
+
+
+def r_text_generation(evids, args):
+    print("num: {}".format(len(evids)))
+
+    f_rationale_path = os.path.join(args.save_path, "rationale_text/dev")
+    f_rationale_exclusive_path = os.path.join(args.save_path, "rationale_exclusive_text/dev")
+
+    if not os.path.exists(f_rationale_path):
+        os.makedirs(f_rationale_path)
+    if not os.path.exists(f_rationale_exclusive_path):
+        os.makedirs(f_rationale_exclusive_path)
+
+    f_rationale = open(os.path.join(f_rationale_path, "dev"), "w")
+    f_rationale_exclusive = open(os.path.join(f_rationale_exclusive_path, "dev"), "w")
+
+    rationale_ratio = json.loads(args.ratio)
+    for id, evid_dict in enumerate(evids):
+        if args.task == "senti":
+            data_R_dict, Rdata_noR_dict = generate_for_senti(args, evid_dict, rationale_ratio)
+        elif args.task == "similarity":
+            data_R_dict, Rdata_noR_dict = generate_for_similarity(args, evid_dict, rationale_ratio)
+        elif args.task == "mrc":
+            data_R_dict, Rdata_noR_dict = generate_for_MRC(args, evid_dict, rationale_ratio)
+        f_rationale.write(json.dumps(data_R_dict, ensure_ascii=False) + "\n")
+        f_rationale_exclusive.write(json.dumps(Rdata_noR_dict, ensure_ascii=False) + "\n")
+
+    f_rationale.close()
+    f_rationale_exclusive.close()
+
+
+if __name__ == "__main__":
+    args = get_args()
+
+    evids = evids_load(args.pred_path)
+    r_text_generation(evids, args)
diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c672ca7a1b60e5e7912d7589587d1b81aed225ca
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh
@@ -0,0 +1,48 @@
+###
+ # This script generates mrc predictions for texts contains rationales only and contains non-rationales only
+###
+export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1`
+export PYTHONPATH=./:$PYTHONPATH
+
+BASE_MODEL=$1
+INTER_MODE=$2
+LANGUAGE=$3
+TASK=mrc
+
+for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text";
+do
+    if [[ $LANGUAGE == "ch" ]]; then
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN=roberta-wwm-ext
+            CKPT=../task/${TASK}/models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin    # 3 epoch
+            
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN=roberta-wwm-ext-large
+            CKPT=../task/${TASK}/models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin   # 3 epoch
+        fi
+    elif [[ $LANGUAGE == "en" ]]; then
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN=roberta-base
+            CKPT=../task/${TASK}/models/roberta_base_squad2_20211113_104225/ckpt.bin
+            
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN=roberta-large
+            CKPT=../task/${TASK}/models/roberta_large_squad2_20211113_111300/ckpt.bin
+        fi
+    fi
+
+    OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev
+    [ -d $OUTPUT ] || mkdir -p $OUTPUT
+    set -x
+    python3 ./mrc_pred.py  \
+        --input_data ../data/${TASK}_${LANGUAGE} \
+        --base_model $BASE_MODEL \
+        --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \
+        --output_dir $OUTPUT \
+        --from_pretrained $FROM_PRETRAIN \
+        --batch_size 1 \
+        --init_checkpoint $CKPT \
+        --n-samples 300 \
+        --doc_stride 128 \
+        --language $LANGUAGE
+done
diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh
new file mode 100644
index 0000000000000000000000000000000000000000..06dfea7790d8d47e5260b8d26b4c77036ead7648
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh
@@ -0,0 +1,62 @@
+###
+ # This script generates sentiment predictions for texts contains rationales only and contains non-rationales only
+###
+
+export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1`
+export PYTHONPATH=./:$PYTHONPATH
+
+BASE_MODEL=$1
+INTER_MODE=$2
+LANGUAGE=$3
+TASK=senti
+
+FROM_PRETRAIN='test'
+VOCAB_PATH='test'
+for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text";
+do
+    if [[ $LANGUAGE == "en" ]]; then
+
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN=roberta-base
+            CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_base_20220318_185322/model_10000/model_state.pdparams
+            #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN=roberta-large
+            CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_large_20220318_183813/model_4000/model_state.pdparams
+            #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams
+        elif [[ $BASE_MODEL == "lstm" ]]; then
+            VOCAB_PATH=../task/${TASK}/rnn/vocab.sst2_train
+            CKPT=../task/${TASK}/rnn/checkpoints_en/final.pdparams
+        fi
+
+    elif [[ $LANGUAGE == "ch" ]]; then
+
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN='roberta-wwm-ext'     
+            CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_base_20220318_155933/model_900/model_state.pdparams
+            #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_base_20211206_180737/model_900/model_state.pdparams
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN='roberta-wwm-ext-large'       
+            CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_large_20220318_170123/model_900/model_state.pdparams
+            #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_large_20211207_143351/model_900/model_state.pdparams
+        elif [[ $BASE_MODEL == "lstm" ]]; then
+            VOCAB_PATH=../task/${TASK}/rnn/vocab.txt
+            CKPT=../task/${TASK}/rnn/checkpoints_ch/final.pdparams
+        fi
+    fi
+
+    OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev
+    [ -d $OUTPUT ] || mkdir -p $OUTPUT
+    set -x
+    python3 ./sentiment_pred.py \
+        --base_model $BASE_MODEL \
+        --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \
+        --output_dir $OUTPUT \
+        --vocab_path $VOCAB_PATH \
+        --from_pretrained $FROM_PRETRAIN \
+        --batch_size 1 \
+        --init_checkpoint $CKPT \
+        --inter_mode  $INTER_MODE \
+        --n-samples 200 \
+        --language $LANGUAGE
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9f0fecd865b7e054e62f835edaf569b562e2bc62
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh
@@ -0,0 +1,54 @@
+###
+ # This script generates textual similarity predictions for texts contains rationales only and contains non-rationales only
+###
+export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1`
+export PYTHONPATH=./:$PYTHONPATH
+
+BASE_MODEL=$1
+INTER_MODE=$2
+LANGUAGE=$3
+TASK=similarity
+
+for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text";
+do
+    if [[ $LANGUAGE == "en" ]]; then
+
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN=roberta-base
+            CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211109_205245/model_54000/model_state.pdparams
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN=roberta-large
+            CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211109_205649/model_46000/model_state.pdparams
+        elif [[ $BASE_MODEL == "lstm" ]]; then
+            FROM_PRETRAIN=../task/${TASK}/skep_ernie_1.0_large_ch
+            CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams
+        fi
+
+    elif [[ $LANGUAGE == "ch" ]]; then
+
+        if [[ $BASE_MODEL == "roberta_base" ]]; then
+            FROM_PRETRAIN='roberta-wwm-ext'     
+            CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211018_104038/model_11400/model_state.pdparams
+        elif [[ $BASE_MODEL == "roberta_large" ]]; then
+            FROM_PRETRAIN='roberta-wwm-ext-large'       
+            CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211018_152833/model_22000/model_state.pdparams
+        elif [[ $BASE_MODEL == "lstm" ]]; then
+            FROM_PRETRAIN='skep_ernie_1.0_large_ch'
+            CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams
+        fi
+    fi
+
+    OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev
+    [ -d $OUTPUT ] || mkdir -p $OUTPUT
+    set -x
+    python3 similarity_pred.py  \
+        --base_model $BASE_MODEL \
+        --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \
+        --output_dir $OUTPUT \
+        --from_pretrained $FROM_PRETRAIN \
+        --batch_size 1 \
+        --max_seq_len 256 \
+        --init_checkpoint $CKPT \
+        --inter_mode  $INTER_MODE \
+        --language $LANGUAGE
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/rationale_extraction/sentiment_pred.py b/examples/model_interpretation/rationale_extraction/sentiment_pred.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ab1397ed30439463063f7e136042f3f7d4a2b97
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/sentiment_pred.py
@@ -0,0 +1,255 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import sys
+from functools import partial
+from pathlib import Path
+
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import DatasetBuilder
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../task/senti")
+from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention  # noqa: E402
+from rnn.utils import CharTokenizer, convert_example  # noqa: E402
+
+sys.path.append("..")
+from roberta.modeling import RobertaForSequenceClassification  # noqa: E402
+
+sys.path.remove("..")
+sys.path.remove("../task/senti")
+sys.path.append("../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+)
+
+sys.path.remove("../..")
+
+
+def get_args():
+    parser = argparse.ArgumentParser("sentiment analysis prediction")
+
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batchsize")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--eval", action="store_true")
+
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument(
+        "--inter_mode",
+        type=str,
+        default="attention",
+        choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"],
+        help="appoint the mode of interpretable.",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument("--start_id", type=int, default=0)
+    parser.add_argument("--vocab_path", type=str)
+    parser.add_argument("--language", type=str, required=True, help="Language that the model is built for")
+    args = parser.parse_args()
+    return args
+
+
+class SentiData(DatasetBuilder):
+    def _read(self, filename, language):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                line_split = json.loads(line)
+                yield {"id": line_split["id"], "context": line_split["context"]}
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+def map_fn_senti(examples, tokenizer, language):
+    print("load data %d" % len(examples))
+
+    contexts = [example["context"] for example in examples]
+    tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    return tokenized_examples
+
+
+def truncate_offset(seg, start_offset, end_offset):
+    seg_len = len(seg)
+    for n in range(len(start_offset) - 1, -1, -1):
+        if start_offset[n] < seg_len:
+            end_offset[n] = seg_len
+            break
+        start_offset.pop(n)
+        end_offset.pop(n)
+
+
+def init_lstm_var(args):
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = CharTokenizer(vocab, args.language, "../punctuations")
+    padding_idx = vocab.token_to_idx.get("[PAD]", 0)
+
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language)
+
+    # init attention layer
+    lstm_hidden_size = 196
+    attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+    model = BiLSTMAttentionModel(
+        attention_layer=attention,
+        vocab_size=len(tokenizer.vocab),
+        lstm_hidden_size=lstm_hidden_size,
+        num_classes=2,
+        padding_idx=padding_idx,
+    )
+
+    # Reads data and generates mini-batches.
+    dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=padding_idx),  # input_ids
+        Stack(dtype="int64"),  # seq len
+    ): [data for data in fn(samples)]
+
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+
+    return model, tokenizer, dev_loader
+
+
+def init_roberta_var(args):
+    tokenizer = None
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+    model = RobertaForSequenceClassification.from_pretrained(
+        args.from_pretrained,
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        dropout=0,
+        num_labels=2,
+        name="",
+        return_inter_score=True,
+    )
+
+    map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language)
+
+    dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language)
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        }
+    ): fn(samples)
+
+    dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dataloader
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader = init_roberta_var(args)
+
+    elif args.base_model == "lstm":
+        model, tokenizer, dataloader = init_lstm_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle:
+        # Load model
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        model.train()  # 为了取梯度，加载模型时dropout设为0
+        print("load model from %s" % args.init_checkpoint)
+
+        get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word)))
+
+        for step, d in tqdm(enumerate(dataloader)):
+            if step + 1 < args.start_id:
+                continue
+
+            result = {}
+            if args.base_model.startswith("roberta"):
+                input_ids, token_type_ids = d
+                fwd_args = [input_ids, token_type_ids]
+                fwd_kwargs = {}
+
+                tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist())  # list
+
+            elif args.base_model == "lstm":
+                input_ids, seq_lens = d
+                fwd_args = [input_ids, seq_lens]
+                fwd_kwargs = {}
+                tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]]
+
+            result["id"] = dataloader.dataset.data[step]["id"]
+
+            probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs)
+            pred_label = paddle.argmax(probs, axis=-1).tolist()[0]
+
+            result["pred_label"] = pred_label
+            result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()]
+            if args.language == "en":
+                result["context"] = tokenizer.convert_tokens_to_string(tokens)
+            else:
+                result["context"] = "".join(tokens)
+            out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
diff --git a/examples/model_interpretation/rationale_extraction/similarity_pred.py b/examples/model_interpretation/rationale_extraction/similarity_pred.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6771189b1eeab1ebbae0ed060817cad06648b78
--- /dev/null
+++ b/examples/model_interpretation/rationale_extraction/similarity_pred.py
@@ -0,0 +1,229 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import sys
+from functools import partial
+from pathlib import Path
+
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import DatasetBuilder
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("..")
+from roberta.modeling import RobertaForSequenceClassification  # noqa: E402
+
+sys.path.remove("..")
+from simnet.model import SimNet  # noqa: E402
+from simnet.utils import CharTokenizer, preprocess_data  # noqa: E402
+
+sys.path.remove("../task/similarity")
+sys.path.append("../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+)
+
+sys.path.remove("../..")
+
+
+def get_args():
+    parser = argparse.ArgumentParser("textual similarity prediction")
+
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batchsize")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument(
+        "--inter_mode",
+        type=str,
+        default="attention",
+        choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"],
+        help="appoint the mode of interpretable.",
+    )
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument("--language", type=str, required=True)
+    args = parser.parse_args()
+    return args
+
+
+class SimilarityData(DatasetBuilder):
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                line_split = json.loads(line)
+                if args.language == "ch":
+                    yield {
+                        "id": line_split["id"],
+                        "query": line_split["context"][0],
+                        "title": line_split["context"][1],
+                    }
+                else:
+                    yield {
+                        "id": line_split["id"],
+                        "sentence1": line_split["context"][0],
+                        "sentence2": line_split["context"][1],
+                    }
+
+
+def map_fn_senti(examples, tokenizer):
+    print("load data %d" % len(examples))
+    if args.language == "ch":
+        query = "query"
+        title = "title"
+    else:
+        query = "sentence1"
+        title = "sentence2"
+    queries = [example[query] for example in examples]
+    titles = [example[title] for example in examples]
+    tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    return tokenized_examples
+
+
+def init_roberta_var(args):
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+    model = RobertaForSequenceClassification.from_pretrained(
+        args.from_pretrained,
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        dropout=0,
+        num_labels=2,
+        name="",
+        return_inter_score=True,
+    )
+
+    map_fn = partial(map_fn_senti, tokenizer=tokenizer)
+
+    dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev"))
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        }
+    ): fn(samples)
+
+    dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dataloader, dev_ds
+
+
+def init_lstm_var(args):
+    if args.language == "ch":
+        vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]")
+    else:
+        vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]")
+
+    tokenizer = CharTokenizer(vocab, args.language, "../punctuations")
+    model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2)
+
+    dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev"))
+    dev_examples = preprocess_data(dev_ds.data, tokenizer, language=args.language)
+    batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # query_ids
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+    ): [data for data in fn(samples)]
+
+    return model, tokenizer, batches, batchify_fn, vocab, dev_ds
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader, dev_ds = init_roberta_var(args)
+
+    elif args.base_model == "lstm":
+        model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle:
+        # Load model
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        model.train()  # 为了取梯度，加载模型时dropout设为0
+        print("load model from %s" % args.init_checkpoint)
+
+        for step, d in tqdm(enumerate(dataloader)):
+
+            result = {}
+            if args.base_model.startswith("roberta"):
+                input_ids, token_type_ids = d
+                fwd_args = [input_ids, token_type_ids]
+                fwd_kwargs = {}
+
+                SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id)
+                q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist())  # list
+                if args.language == "ch":
+                    t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 1 : -1].tolist())  # list
+                else:
+                    t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 2 : -1].tolist())  # list
+
+            elif args.base_model == "lstm":
+                query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d)
+                query_ids = paddle.to_tensor(query_ids)
+                title_ids = paddle.to_tensor(title_ids)
+                query_seq_lens = paddle.to_tensor(query_seq_lens)
+                title_seq_lens = paddle.to_tensor(title_seq_lens)
+
+                fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens]
+                fwd_kwargs = {}
+                q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]]
+                t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]]
+
+            result["id"] = dev_ds.data[step]["id"]
+
+            probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs)
+            pred_label = paddle.argmax(probs, axis=-1).tolist()[0]
+
+            result["pred_label"] = pred_label
+            result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()]
+            if args.language == "ch":
+                result["query"] = "".join(q_tokens)
+                result["title"] = "".join(t_tokens)
+            else:
+                result["query"] = tokenizer.convert_tokens_to_string(q_tokens)
+                result["title"] = tokenizer.convert_tokens_to_string(t_tokens)
+
+            out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
diff --git a/examples/model_interpretation/requirements.txt b/examples/model_interpretation/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6a6e0abed45786316dbcd88e99c32272e8f8a87e
--- /dev/null
+++ b/examples/model_interpretation/requirements.txt
@@ -0,0 +1,5 @@
+nvgpu>=0.9.0
+regex>=2021.11.10
+spacy>=2.3.7
+tqdm>=4.62.3
+visualdl>=2.2.2
diff --git a/examples/model_interpretation/task/README.md b/examples/model_interpretation/task/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..03f1edca0dc88d58fcd18a95ffd165d9dde3c3ee
--- /dev/null
+++ b/examples/model_interpretation/task/README.md
@@ -0,0 +1,19 @@
+### 基线模型预测
+#### 情感分析：
+    预测：model_interpretation/rationale_extraction/sentiment_pred.py
+    参数设置参考：model_interpretation/rationale_extraction/run_2_pred_senti_per.sh （参数涉及模型、文件等路径，以及语言的，请根据实际情况进行修改）
+#### 文本相似度：
+    预测：model_interpretation/rationale_extraction/similarity_pred.py
+    参数设置参考：model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh（参数涉及模型、文件等路径，以及语言的，请根据实际情况进行修改）
+#### 阅读理解：
+    预测：model_interpretation/rationale_extraction/mrc_pred.py
+    参数设置参考：model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh（参数涉及模型、文件等路径，以及语言的，请根据实际情况进行修改）
+### 三个任务的基线模型训练
+#### 情感分析
+    RoBERTa：model_interpretation/task/senti/pretrained_models/run_train.sh
+    LSTM：model_interpretation/task/senti/rnn/lstm_train.sh
+#### 文本相似度
+    RoBERTa：model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh
+    LSTM：model_interpretation/task/similarity/simnet/lstm_train.sh
+#### 阅读理解
+    RoBERTa：model_interpretation/task/mrc/run_train_rc.sh
diff --git a/examples/model_interpretation/task/mrc/roberta/modeling.py b/examples/model_interpretation/task/mrc/roberta/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b376e43dabc0ecc0581de6a46de92f99857db85
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/roberta/modeling.py
@@ -0,0 +1,719 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+
+sys.path.append("../..")
+from task.transformer import TransformerEncoder, TransformerEncoderLayer  # noqa: E402
+
+sys.path.remove("../..")
+
+__all__ = [
+    "RobertaModel",
+    "RobertaPretrainedModel",
+    "RobertaForSequenceClassification",
+    "RobertaForTokenClassification",
+    "RobertaForQuestionAnswering",
+]
+
+
+class RobertaEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        pad_token_id=0,
+    ):
+        super(RobertaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RobertaPooler(nn.Layer):
+    def __init__(self, hidden_size):
+        super(RobertaPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RobertaPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained RoBerta models. It provides RoBerta related
+    `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "roberta-wwm-ext": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "roberta-wwm-ext-large": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbt3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbtl3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams",
+            "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams",
+            "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams",
+            "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams",
+        }
+    }
+    base_model_prefix = "roberta"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range")
+                    else self.roberta.config["initializer_range"],
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class RobertaModel(RobertaPretrainedModel):
+    r"""
+    The bare Roberta Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.01,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+    ):
+        super(RobertaModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = RobertaEmbeddings(
+            vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id
+        )
+        encoder_layer = TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = RobertaPooler(hidden_size)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range ``[0, max_position_embeddings - 1]``.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - sequence_output (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - pooled_output (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaModel, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaModel.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2]
+            )
+        # CLS: 101; SEP: 102; PAD: 0
+        baseline_ids = paddle.to_tensor(
+            [101] + [0] * (input_ids.shape[1] - 2) + [102],
+            dtype=input_ids.dtype,
+            place=input_ids.place,
+            stop_gradient=input_ids.stop_gradient,
+        )
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        baseline_embedding_output = self.embeddings(
+            input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+
+        if noise is not None:
+            if noise.upper() == "GAUSSIAN":
+                pass
+                # stdev_spread = 0.15
+                # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy()
+                # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32),
+                #                     stop_gradient=False)
+                # orig_embedded = orig_embedded + noise
+            if noise.upper() == "INTEGRATED":
+                embedding_output = baseline_embedding_output + i / (n_samples - 1) * (
+                    embedding_output - baseline_embedding_output
+                )
+            else:
+                raise ValueError("unsupported noise method: %s" % (noise))
+
+        # encoder_outputs = self.encoder(embedding_output, attention_mask)
+        encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask)  # interpret
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output, att_weights_list, embedding_output
+
+
+class RobertaForQuestionAnswering(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output to
+    compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of RobertaModel.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob` of `RobertaModel`
+            instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, dropout=None):
+        super(RobertaForQuestionAnswering, self).__init__()
+        self.roberta = roberta  # allow roberta to be config
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2)
+        self.classifier_cls = nn.Linear(self.roberta.config["hidden_size"], 2)
+        self.criterion = CrossEntropyLossForChecklist()
+
+    # def forward(self, input_ids, token_type_ids=None):
+    def forward(self, *args, **kwargs):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        start_pos = kwargs.pop("start_pos", None)
+        end_pos = kwargs.pop("end_pos", None)
+        cls_label = kwargs.pop("labels", None)
+
+        # sequence_output, pooled_output, _, _ = self.roberta(
+        #     input_ids,
+        #     token_type_ids=token_type_ids,
+        #     position_ids=None,
+        #     attention_mask=None)
+        # print(kwargs)
+        sequence_output, pooled_output, _, _ = self.roberta(*args, **kwargs)
+
+        logits = self.classifier(sequence_output)  # (bsz, seq, 2)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])  # (2, bsz, seq)
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+        cls_logits = self.classifier_cls(pooled_output)
+
+        if start_pos is not None and end_pos is not None:
+            if len(start_pos.shape) != 1:
+                start_pos = start_pos.squeeze()
+            if len(end_pos.shape) != 1:
+                end_pos = end_pos.squeeze()
+            loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label))
+        else:
+            loss = None
+
+        # return start_logit, end_logits
+        return loss, start_logits, end_logits, cls_logits
+
+    def forward_interpret(self, *args, **kwargs):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        start_pos = kwargs.pop("start_pos", None)
+        end_pos = kwargs.pop("end_pos", None)
+        cls_label = kwargs.pop("labels", None)
+
+        # sequence_output, pooled_output, _, _ = self.roberta(
+        #     input_ids,
+        #     token_type_ids=token_type_ids,
+        #     position_ids=None,
+        #     attention_mask=None)
+        # print(kwargs)
+        sequence_output, pooled_output, att_weights_list, embedding_output = self.roberta(*args, **kwargs)
+
+        logits = self.classifier(sequence_output)  # (bsz, seq, 2)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])  # (2, bsz, seq)
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+        cls_logits = self.classifier_cls(pooled_output)
+
+        if start_pos is not None and end_pos is not None:
+            if len(start_pos.shape) != 1:
+                start_pos = start_pos.squeeze()
+            if len(end_pos.shape) != 1:
+                end_pos = end_pos.squeeze()
+            loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label))
+        else:
+            loss = None
+
+        # return start_logit, end_logits
+        return loss, start_logits, end_logits, cls_logits, att_weights_list, embedding_output
+
+
+class CrossEntropyLossForChecklist(nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForChecklist, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits, cls_logits = y  # [(bsz, seq), (bsz, seq), (bsz, 2)]
+        start_position, end_position, answerable_label = label  # [(bsz), (bsz), (bsz)]
+
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        answerable_label = paddle.unsqueeze(answerable_label, axis=-1)
+
+        start_loss = nn.functional.cross_entropy(input=start_logits, label=start_position, soft_label=False)
+        end_loss = nn.functional.cross_entropy(input=end_logits, label=end_position, soft_label=False)
+        cls_loss = nn.functional.cross_entropy(input=cls_logits, label=answerable_label, soft_label=False)
+
+        mrc_loss = (start_loss + end_loss) / 2
+        loss = (mrc_loss + cls_loss) / 2
+        return loss
+
+
+class RobertaForSequenceClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+        self.softmax = nn.Softmax()
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Its data type should be float32 and it has a shape of [batch_size, num_classes].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        _, pooled_output, _, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+    def forward_interpet(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        _, pooled_output, att_weights_list, embedding_output = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            noise=noise,
+            i=i,
+            n_samples=n_samples,
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        probs = self.softmax(logits)
+
+        return probs, att_weights_list, embedding_output
+
+
+class RobertaForTokenClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1039a2e085464223f256782601cbe4abc181d0ff
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/run_1_predict_rc.sh
@@ -0,0 +1,51 @@
+###
+ # This file contains script to run prediction of a specific baseline model and language on given input data
+ # The result of this script will be used to evaluate the performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=7
+export PYTHONPATH=./:$PYTHONPATH
+
+LANGUAGE=ch                 # LANGUAGE choose in [en, ch]
+BASE_MODEL=roberta_base     # BASE_MODEL choose in [roberta_base, roberta_large]
+
+if [[ $LANGUAGE == "ch" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext
+        CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin    # 3epoch
+        #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext-large
+        # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin     # 3 epoch F1: 63.465  EM: 52.832 
+        # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin     # 4 epoch F1: 63.323  EM: 52.920
+        # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin    # 3 epoch F1: 66.613    EM: 57.168
+        CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin
+        #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune
+    fi
+elif [[ $LANGUAGE == "en" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-base
+        CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin
+        #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-large
+        CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin
+        #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune
+    fi
+fi
+
+OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL}
+[ -d $OUTPUT ] || mkdir -p $OUTPUT
+set -x
+python3 ./saliency_map/rc_prediction.py \
+    --base_model $BASE_MODEL \
+    --data_dir ../../data/mrc_${LANGUAGE} \
+    --from_pretrained $FROM_PRETRAIN \
+    --init_checkpoint $CKPT \
+    --output_dir $OUTPUT \
+    --n-samples 300 \
+    --doc_stride 128 \
+    --language $LANGUAGE \
+    --max_seq_len 384 \
+    --batch_size 32 \
+    --epoch 2
\ No newline at end of file
diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b504072d49bd9f997201b2ff5d699d70fdfadc43
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh
@@ -0,0 +1,57 @@
+###
+ # This file contains script to run predictions of all baseline models and languages on given input data
+ # The result of this script will be used to evaluate the performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=4
+export PYTHONPATH=./:$PYTHONPATH
+
+for BASE_MODEL in "roberta_base" "roberta_large";
+do
+    for LANGUAGE in "ch" "en";
+    do
+        if [[ $LANGUAGE == "ch" ]]; then
+            if [[ $BASE_MODEL == "roberta_base" ]]; then
+                FROM_PRETRAIN=roberta-wwm-ext
+                CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin    # 3epoch
+                #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune
+            elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                FROM_PRETRAIN=roberta-wwm-ext-large
+                # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin     # 3 epoch F1: 63.465  EM: 52.832 
+                # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin     # 4 epoch F1: 63.323  EM: 52.920
+                # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin    # 3 epoch F1: 66.613    EM: 57.168
+                CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin
+                #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune
+            fi
+        elif [[ $LANGUAGE == "en" ]]; then
+            if [[ $BASE_MODEL == "roberta_base" ]]; then
+                FROM_PRETRAIN=roberta-base
+                CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin
+                #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune
+            elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                FROM_PRETRAIN=roberta-large
+                CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin
+                #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune
+            fi
+        fi
+
+        OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL}
+        [ -d $OUTPUT ] || mkdir -p $OUTPUT
+        set -x
+
+        if [[ ! -f ${OUTPUT}/predict_feature_index ]]; then
+            python3 ./saliency_map/rc_prediction.py \
+                --base_model $BASE_MODEL \
+                --data_dir ../../data/mrc_${LANGUAGE} \
+                --from_pretrained $FROM_PRETRAIN \
+                --init_checkpoint $CKPT \
+                --output_dir $OUTPUT \
+                --n-samples 300 \
+                --doc_stride 128 \
+                --language $LANGUAGE \
+                --max_seq_len 384 \
+                --batch_size 32 \
+                --epoch 2 
+        fi
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5f038bfcaa987a894298c1f91d98380aed7e699c
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/run_2_inter_rc.sh
@@ -0,0 +1,53 @@
+###
+ # This file contains script to generate saliency map of a specific baseline model and language on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=4
+export PYTHONPATH=./:$PYTHONPATH
+
+TASK=mrc
+LANGUAGE=en                 # LANGUAGE choose in [ch, en]
+BASE_MODEL=roberta_base            # BASE_MODEL choose in [roberta_base, roberta_large]
+INTER_MODE=integrated_gradient      # INTER_MODE choice in [attention, integrated_gradient]
+START=0
+
+if [[ $LANGUAGE == "ch" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext
+        CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin    # 3 epoch
+        
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext-large
+        CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin   # 3 epoch
+    fi
+elif [[ $LANGUAGE == "en" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-base
+        CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin
+        
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-large
+        CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin
+    fi
+fi
+
+
+OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL}
+[ -d $OUTPUT ] || mkdir -p $OUTPUT
+set -x
+python3 ./saliency_map/rc_interpretable.py \
+    --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\
+    --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\
+    --base_model $BASE_MODEL \
+    --data_dir ../../data/mrc_${LANGUAGE} \
+    --from_pretrained $FROM_PRETRAIN \
+    --batch_size 1 \
+    --init_checkpoint $CKPT \
+    --inter_mode  $INTER_MODE\
+    --output_dir $OUTPUT \
+    --n-samples 300 \
+    --doc_stride 128 \
+    --start_step $START \
+    --language $LANGUAGE \
+    --num_classes 2
\ No newline at end of file
diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5908512f7ba94872fbc54ca5601c79629e38b95a
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh
@@ -0,0 +1,61 @@
+###
+ # This file contains script to generate saliency map of all baseline models and languages on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=6
+export PYTHONPATH=./:$PYTHONPATH
+
+START=0
+TASK=mrc
+for BASE_MODEL in "roberta_base" "roberta_large";
+do
+    for INTER_MODE in "attention" "integrated_gradient";
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            if [[ $LANGUAGE == "ch" ]]; then
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN=roberta-wwm-ext
+                    CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin    # 3 epoch
+                    
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN=roberta-wwm-ext-large
+                    CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin   # 3 epoch
+                fi
+            elif [[ $LANGUAGE == "en" ]]; then
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN=roberta-base
+                    CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin
+                    
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN=roberta-large
+                    CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin
+                fi
+            fi
+
+
+            OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL}
+            [ -d $OUTPUT ] || mkdir -p $OUTPUT
+            set -x
+
+            if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then
+                python3 ./saliency_map/rc_interpretable.py \
+                    --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\
+                    --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\
+                    --base_model $BASE_MODEL \
+                    --data_dir ../../data/mrc_${LANGUAGE} \
+                    --from_pretrained $FROM_PRETRAIN \
+                    --batch_size 1 \
+                    --init_checkpoint $CKPT \
+                    --inter_mode  $INTER_MODE\
+                    --output_dir $OUTPUT \
+                    --n-samples 300 \
+                    --doc_stride 128 \
+                    --start_step $START \
+                    --language $LANGUAGE\
+                    --num_classes 2
+            fi
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/task/mrc/run_train_rc.sh b/examples/model_interpretation/task/mrc/run_train_rc.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ff7d95db9342b5bdd4519042b1cf377cbfd445ca
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/run_train_rc.sh
@@ -0,0 +1,51 @@
+###
+ # This script is used to run fine-tunning of mrc roberta models.
+### 
+
+export CUDA_VISIBLE_DEVICES=7
+export PYTHONPATH=.:$PYTHONPATH
+
+LANGUAGE=ch                    # LANGUAGE choose in [ch, en]
+BASE_MODEL=roberta_base       # chooices [roberta_base, roberta_large]
+
+[ -d "logs" ] || mkdir -p "logs"
+set -x
+
+if [[ $LANGUAGE == "ch" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-wwm-ext-large
+    fi
+    EPOCH=3
+    BSZ=2
+    LR=3e-5
+    MAX_SEQLEN=512
+    DATA=DuReader-Checklist
+elif [[ $LANGUAGE == 'en' ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-base
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-large
+    fi
+    EPOCH=2
+    BSZ=16
+    LR=5e-6
+    MAX_SEQLEN=384
+    DATA=squad2
+fi
+
+timestamp=`date  +"%Y%m%d_%H%M%S"`
+python3 saliency_map/rc_finetune.py \
+    --train_data_dir ./data/$DATA/train/train.json \
+    --dev_data_dir ./data/$DATA/dev/dev.json \
+    --max_steps -1 \
+    --from_pretrained $FROM_PRETRAIN \
+    --epoch $EPOCH \
+    --bsz $BSZ \
+    --lr $LR \
+    --max_seq_len $MAX_SEQLEN \
+    --save_dir models/${BASE_MODEL}_${LANGUAGE}_${timestamp} \
+    --language $LANGUAGE \
+    --init_checkpoint models/${BASE_MODEL}_${LANGUAGE}_${timestamp}/ckpt.bin >> logs/log_${BASE_MODEL}_$timestamp 2>&1
+    
\ No newline at end of file
diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py b/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..f676df12d781da7e11ddf1af132da4748cd9ba54
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py
@@ -0,0 +1,280 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import re
+import sys
+import time
+from pathlib import Path
+
+import paddle
+from paddle.io import DataLoader
+from roberta.modeling import RobertaForQuestionAnswering
+from saliency_map.utils import create_if_not_exists, get_warmup_and_linear_decay
+from squad import DuReaderChecklist
+from visualdl import LogWriter
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+)
+
+sys.path.remove("../../..")
+
+log = logging.getLogger(__name__)
+log.setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.DEBUG)
+
+
+def get_args():
+    parser = argparse.ArgumentParser("mrc task with roberta")
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument("--bsz", type=int, default=32, help="batchsize")
+    parser.add_argument("--epoch", type=int, default=3, help="epoch")
+    parser.add_argument("--train_data_dir", type=str, required=True, help="train data file")
+    parser.add_argument("--dev_data_dir", type=str, required=True, help="develop data file")
+    parser.add_argument(
+        "--max_steps", type=int, required=True, help="max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE"
+    )
+    parser.add_argument("--warmup_proportion", type=float, default=0.1)
+    parser.add_argument("--lr", type=float, default=5e-5, help="learning rate")
+    parser.add_argument("--save_dir", type=Path, required=True, help="model output directory")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument("--language", type=str, required=True, help="language that the model based on")
+    args = parser.parse_args()
+    return args
+
+
+def map_fn_DuCheckList_finetune(examples):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    questions = [examples[i]["question"] for i in range(len(examples))]
+    contexts = [examples[i]["context"] + examples[i]["title"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    for i, tokenized_example in enumerate(tokenized_examples):
+
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_example["input_ids"]  # list(seq)
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]  # list(seq)
+
+        # The offset mappings will give us a map from token to character position in the original context. This will
+        # help us compute the start_positions and end_positions.
+        offsets = tokenized_example["offset_mapping"]  # list(seq)
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]  # int
+        if args.language == "ch":
+            answers = examples[sample_index]["answers"]  # list
+            answer_starts = examples[sample_index]["answer_starts"]  # list
+        else:
+            example = examples[sample_index]
+            example["question_len"] = len(example["question"].split())
+            example["context_len"] = len(example["context"].split())
+
+            answers = example["answers"]  # list
+            answer_starts = example["answer_starts"]  # list
+
+        # If no answers are given, set the cls_index as answer.
+        if len(answer_starts) == 0:
+            tokenized_examples[i]["start_positions"] = cls_index
+            tokenized_examples[i]["end_positions"] = cls_index
+            tokenized_examples[i]["answerable_label"] = 0
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answer_starts[0]
+            end_char = start_char + len(answers[0])
+            if args.language == "en":
+                # Start token index of the current span in the text.
+                token_start_index = 0
+                while not (offsets[token_start_index] == (0, 0) and offsets[token_start_index + 1] == (0, 0)):
+                    token_start_index += 1
+                token_start_index += 2
+
+                # End token index of the current span in the text.
+                token_end_index = len(input_ids) - 2
+            else:
+                # Start token index of the current span in the text.
+                token_start_index = 0
+                while sequence_ids[token_start_index] != 1:
+                    token_start_index += 1
+
+                # End token index of the current span in the text.
+                token_end_index = len(input_ids) - 2
+                while sequence_ids[token_end_index] != 1:
+                    token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples[i]["start_positions"] = cls_index
+                tokenized_examples[i]["end_positions"] = cls_index
+                tokenized_examples[i]["answerable_label"] = 0
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples[i]["start_positions"] = token_start_index - 1
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples[i]["end_positions"] = token_end_index + 1
+                tokenized_examples[i]["answerable_label"] = 1
+
+    return tokenized_examples
+
+
+if __name__ == "__main__":
+    args = get_args()
+
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+    model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=2)
+
+    train_ds = DuReaderChecklist().read(args.train_data_dir)
+    dev_ds = DuReaderChecklist().read(args.dev_data_dir)
+
+    train_ds.map(map_fn_DuCheckList_finetune, batched=True)
+    dev_ds.map(map_fn_DuCheckList_finetune, batched=True)
+
+    log.debug("train set: %d" % len(train_ds))
+    log.debug("dev set: %d" % len(dev_ds))
+
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.bsz, shuffle=True)
+    dev_batch_sample = paddle.io.DistributedBatchSampler(dev_ds, batch_size=args.bsz, shuffle=False)
+
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "start_positions": Stack(dtype="int64"),
+            "end_positions": Stack(dtype="int64"),
+            "answerable_label": Stack(dtype="int64"),
+        }
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+    dev_data_loader = DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sample, collate_fn=batchify_fn, return_list=True
+    )
+
+    max_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epoch
+    lr_scheduler = paddle.optimizer.lr.LambdaDecay(
+        args.lr, get_warmup_and_linear_decay(max_steps, int(args.warmup_proportion * max_steps))
+    )
+
+    param_name_to_exclue_from_weight_decay = re.compile(r".*layer_norm_scale|.*layer_norm_bias|.*b_0")
+
+    opt = paddle.optimizer.AdamW(
+        lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.wd,
+        apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n),
+        grad_clip=paddle.nn.ClipGradByGlobalNorm(1.0) if args.language == "ch" else None,
+    )
+
+    scaler = paddle.amp.GradScaler(enable=args.use_amp)
+
+    with LogWriter(logdir=str(create_if_not_exists(args.save_dir / "vdl"))) as log_writer:
+        with paddle.amp.auto_cast(enable=args.use_amp):
+            max_acc = 0.0
+            log.debug("start training...")
+            for epoch in range(args.epoch):
+                s_time = time.time()
+                for step, d in enumerate(train_data_loader, start=1):
+                    # input_ids:        paddle.Tensor(bsz, seq)
+                    # token_type_ids:   paddle.Tensor(bsz, seq)
+                    # start_positions:  paddle.Tensor(bsz)
+                    # end_positions:    paddle.Tensor(bsz)
+                    # answerable_label:    paddle.Tensor(bsz)
+                    input_ids, token_type_ids, start_positions, end_positions, answerable_label = d
+                    loss, _, _, _ = model(
+                        input_ids=input_ids,
+                        token_type_ids=token_type_ids,
+                        start_pos=start_positions,
+                        end_pos=end_positions,
+                        labels=answerable_label,
+                    )
+                    loss = scaler.scale(loss)
+                    loss.backward()
+                    scaler.minimize(opt, loss)
+                    opt.clear_grad()
+                    lr_scheduler.step()
+
+                    if step % 100 == 0:
+                        _lr = lr_scheduler.get_lr()
+                        time_cost = time.time() - s_time
+                        s_time = time.time()
+                        if args.use_amp:
+                            _l = (loss / scaler._scale).numpy()
+                            msg = "[epoch-%d step-%d] train loss %.5f lr %.3e scaling %.3e" % (
+                                epoch,
+                                step,
+                                _l,
+                                _lr,
+                                scaler._scale.numpy(),
+                            )
+                        else:
+                            _l = loss.numpy()
+                            msg = "[epoch-%d step-%d] train loss %.5f lr %.3e time_cost: %.1fs" % (
+                                epoch,
+                                step,
+                                _l,
+                                _lr,
+                                time_cost,
+                            )
+                        log.debug(msg)
+                        log_writer.add_scalar("loss", _l, step=step)
+                        log_writer.add_scalar("lr", _lr, step=step)
+
+                    if step % 1000 == 0:
+                        if args.save_dir is not None:
+                            paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin"))
+                            log.debug("save model!")
+
+                if args.save_dir is not None:
+                    paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin"))
+                    log.debug("save model!")
diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py b/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py
new file mode 100644
index 0000000000000000000000000000000000000000..7df2bc45d51f23ca1ba38ec9149621aa624cfb24
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py
@@ -0,0 +1,497 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import collections
+import json
+import logging
+import os
+import sys
+from functools import partial
+from pathlib import Path
+
+import paddle
+from roberta.modeling import RobertaForQuestionAnswering
+from squad import RCInterpret
+from tqdm import tqdm
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+    match,
+)
+
+sys.path.remove("../../..")
+
+log = logging.getLogger(__name__)
+log.setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.DEBUG)
+
+
+def get_args():
+    parser = argparse.ArgumentParser("mrc task with roberta")
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=512, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=32, help="batchsize")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument(
+        "--inter_mode",
+        type=str,
+        default="attention",
+        choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"],
+        help="appoint the mode of interpretable.",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument("--start_step", type=int, default=0, help="start from which instance")
+    parser.add_argument("--language", type=str, required=True, help="language that the model based on")
+    parser.add_argument(
+        "--ans_path",
+        type=str,
+        required=True,
+        help="the path of the file which stores the predicted answer from last step",
+    )
+    parser.add_argument(
+        "--ans_idx_path",
+        type=str,
+        required=True,
+        help="the path of the file which stores the predicted answer index from last step",
+    )
+    parser.add_argument("--num_classes", type=int, required=True, help="number of class")
+    args = parser.parse_args()
+    return args
+
+
+def truncate_offset(seg, start_offset, end_offset):
+    seg_len = len(seg)
+    for n in range(len(start_offset) - 1, -1, -1):
+        if start_offset[n] < seg_len:
+            end_offset[n] = seg_len
+            break
+        start_offset.pop(n)
+        end_offset.pop(n)
+
+
+def map_fn_DuCheckList(examples, args, tokenizer):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    if args.language == "en":
+        questions = [
+            examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples))
+        ]
+        contexts = [
+            examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples))
+        ]
+    else:
+        questions = [examples[i]["question"] for i in range(len(examples))]
+        contexts = [examples[i]["context"] for i in range(len(examples))]
+    tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    log.debug("\nexample: %d" % len(examples))
+    log.debug("feature: %d\n" % len(tokenized_examples))
+
+    # For validation, there is no need to compute start and end positions
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        tokenized_examples[i]["example_id"] = examples[sample_index]["id"]
+        tokenized_examples[i]["question"] = examples[sample_index]["question"]
+        tokenized_examples[i]["context"] = examples[sample_index]["context"]
+        tokenized_examples[i]["sent_token"] = examples[sample_index]["sent_token"]
+
+    return tokenized_examples
+
+
+def init_roberta_var(args):
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+
+    model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=args.num_classes)
+    map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer)
+    dev_ds = RCInterpret().read(args.data_dir)
+
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "overflow_to_sample": Stack(dtype="int32"),
+        }
+    ): fn(samples)
+
+    dev_dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dev_dataloader, dev_ds
+
+
+def ch_per_example(
+    args,
+    scores_in_one_example,
+    prev_context_tokens,
+    dev_ds,
+    prev_example_idx,
+    ans_dic,
+    ans_idx_dic,
+    offset,
+    out_handle,
+):
+    total_score = scores_in_one_example[-1]
+    assert len(prev_context_tokens) == len(total_score)
+    token_score_dict = []
+    for idx in range(len(total_score)):
+        token_score_dict.append([idx, offset[idx], total_score[idx]])
+
+    prev_example = dev_ds.data[prev_example_idx]
+    char_attribution_dict = match(
+        prev_example["context"] + prev_example["title"], prev_example["sent_token"], token_score_dict
+    )
+    result["id"] = prev_example["id"]
+    result["question"] = prev_example["question"]
+    result["title"] = prev_example["title"]
+    result["context"] = prev_example["context"] + prev_example["title"]
+    result["pred_label"] = ans_dic[str(result["id"])]
+    result["pred_feature"] = ans_idx_dic[str(result["id"])]
+
+    result["char_attri"] = collections.OrderedDict()
+    for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True):
+        result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+def en_per_example(inter_score, result, ans_dic, ans_idx_dic, offset, out_handle):
+    sorted_token = []
+    for i in range(len(inter_score)):
+        sorted_token.append([i, offset[i], inter_score[i]])
+    char_attribution_dict = match(result["context"], result["sent_token"], sorted_token)
+
+    result["pred_label"] = ans_dic[str(result["id"])]
+    result["pred_feature"] = ans_idx_dic[str(result["id"])]
+    result["char_attri"] = collections.OrderedDict()
+    for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True):
+        result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+    result.pop("sent_token")
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+def load_pred_data(ans_path, ans_idx_path):
+    f = open(ans_path, "r")
+    ans_dic = json.loads(f.read())
+    f.close()
+    f = open(ans_idx_path, "r")
+    ans_idx_dic = json.loads(f.read())
+    f.close()
+    return ans_dic, ans_idx_dic
+
+
+def extract_attention_scores(
+    args,
+    model,
+    result,
+    fwd_args,
+    fwd_kwargs,
+    prev_example_idx,
+    example_idx,
+    prev_context_tokens,
+    scores_in_one_example,
+    dev_ds,
+    ans_dic,
+    ans_idx_dic,
+    context_tokens,
+    offset,
+    prev_offset,
+    out_handle,
+):
+    with paddle.no_grad():
+        # start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2)
+        # attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb)
+        _, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret(
+            *fwd_args, **fwd_kwargs
+        )
+
+    # Attention score equals to the mean of attention of each token in the question
+    attentions = attentions[-1][:, :, 1:SEP_idx, :].mean(2).mean(1)  # attentions: (bsz, seq_len)
+    context_score = attentions[0, SEP_idx + add_idx : -1]  # context_score: Tensor(context)
+    context_norm_score = context_score / context_score.sum(-1)
+
+    if args.language == "ch":
+        if prev_example_idx is None or prev_example_idx == example_idx:
+            scores_in_one_example.append(context_norm_score.numpy().tolist())
+        else:
+            ch_per_example(
+                args,
+                scores_in_one_example,
+                prev_context_tokens,
+                dev_ds,
+                prev_example_idx,
+                ans_dic,
+                ans_idx_dic,
+                prev_offset,
+                out_handle,
+            )
+            scores_in_one_example = [context_norm_score.numpy().tolist()]
+        prev_example_idx = example_idx
+        prev_context_tokens = context_tokens
+        prev_offset = offset
+    else:
+        en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle)
+    return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset
+
+
+def extract_integrated_gradient_scores(
+    args,
+    dev_ds,
+    model,
+    result,
+    fwd_args,
+    fwd_kwargs,
+    SEP_idx,
+    add_idx,
+    prev_example_idx,
+    example_idx,
+    scores_in_one_example,
+    prev_context_tokens,
+    ans_dic,
+    ans_idx_dic,
+    context_tokens,
+    offset,
+    prev_offset,
+    out_handle,
+):
+    embedded_grads_list = []  # [Tensor(1, seq_len, embed_size)]
+    with open(os.path.join(args.output_dir, "predict_feature_index"), "r") as f_feature_index:
+        feature_index_dict = json.load(f_feature_index)
+    example = dev_ds.data[example_idx]
+    example_id = example["id"]
+    start_index, end_index = feature_index_dict[str(example_id)]
+
+    for i in range(args.n_samples):
+        # embedded_start_grad
+        # start_logits: (bsz, seq); embedded: (bsz, seq, emb)
+        _, start_logits, _, _, _, embedded = model.forward_interpret(
+            *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples
+        )
+
+        start_logit = start_logits[:, start_index].sum()
+        start_logit.backward(retain_graph=False)
+        embedded_start_grad = embedded.grad
+        model.clear_gradients()
+        # embedded_end_grad
+        # end_logits: (bsz, seq); embedded: (bsz, seq, emb)
+        _, _, end_logits, _, _, embedded = model.forward_interpret(
+            *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples
+        )
+        end_logit = end_logits[:, end_index].sum()
+        end_logit.backward(retain_graph=False)
+        embedded_end_grad = embedded.grad
+        model.clear_gradients()
+
+        embedded_grad = (embedded_start_grad + embedded_end_grad) / 2
+        embedded_grads_list.append(embedded_grad)
+
+        if i == 0:
+            baseline_embedded = embedded  # Tensor(1, seq_len, embed_size)
+        elif i == args.n_samples - 1:
+            pred_embedded = embedded  # Tensor(1, seq_len, embed_size)
+
+    embedded_grads_tensor = paddle.to_tensor(
+        embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True
+    )
+
+    trapezoidal_grads = (
+        embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]
+    ) / 2  # Tensor(n_samples-1, 1, seq_len, embed_size)
+    integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0]  # Tensor(1, seq_len, embed_size)xw
+
+    inter_score = (pred_embedded - baseline_embedded) * integral_grads  # Tensor(1, seq_len, embed_size)
+    inter_score = inter_score.sum(-1)  # Tensor(1, seq_len)
+    inter_score.stop_gradient = True
+
+    context_score = inter_score[0, SEP_idx + add_idx : -1]
+    context_norm_score = context_score / context_score.sum(-1)
+    if args.language == "ch":
+        if prev_example_idx is None or prev_example_idx == example_idx:
+            scores_in_one_example.append(context_norm_score.numpy().tolist())
+        else:
+            ch_per_example(
+                args,
+                scores_in_one_example,
+                prev_context_tokens,
+                dev_ds,
+                prev_example_idx,
+                ans_dic,
+                ans_idx_dic,
+                prev_offset,
+                out_handle,
+            )
+            scores_in_one_example = [context_norm_score.numpy().tolist()]
+        prev_example_idx = example_idx
+        prev_context_tokens = context_tokens
+        prev_offset = offset
+    else:
+        en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle)
+    return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.language == "ch":
+        add_idx = 1
+    else:
+        add_idx = 2
+
+    ans_dic, ans_idx_dic = load_pred_data(args.ans_path, args.ans_idx_path)
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader, dev_ds = init_roberta_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    with paddle.amp.auto_cast(enable=args.use_amp), open(
+        os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w"
+    ) as out_handle:
+
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        log.debug("load model from %s" % args.init_checkpoint)
+
+        err_total = []
+        lime_score_total = []
+        lime_relative_err_total = []
+        lime_err_total = []
+
+        # Second forward: evidence extraction
+        scores_in_one_example = []
+        prev_example_idx = None
+        prev_context_tokens = None
+        prev_offset = None
+
+        get_subword_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word)))
+        for step, d in tqdm(enumerate(dataloader)):
+            if step < args.start_step:
+                continue
+
+            model.train()
+
+            result = {}
+            input_ids, segment_ids, offset_map, example_idx = d
+            fwd_args = [input_ids, segment_ids]
+            fwd_kwargs = {}
+
+            SEP_idx = input_ids.numpy()[0].tolist().index(tokenizer.sep_token_id)
+            context_ids = input_ids[0, SEP_idx + add_idx : -1]
+            offset = offset_map[0, SEP_idx + add_idx : -1]
+            context_tokens = tokenizer.convert_ids_to_tokens(context_ids.numpy().tolist())
+
+            if args.language == "en":
+                example = dev_ds.data[step]
+                result["id"] = example["id"]
+                result["question"] = example["question"]
+                result["title"] = example["title"]
+                result["context"] = example["context"] + example["title"]
+                result["sent_token"] = example["sent_token"]
+
+            if args.inter_mode == "attention":
+                prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset = extract_attention_scores(
+                    args,
+                    model,
+                    result,
+                    fwd_args,
+                    fwd_kwargs,
+                    prev_example_idx,
+                    example_idx,
+                    prev_context_tokens,
+                    scores_in_one_example,
+                    dev_ds,
+                    ans_dic,
+                    ans_idx_dic,
+                    context_tokens,
+                    offset,
+                    prev_offset,
+                    out_handle,
+                )
+
+            elif args.inter_mode == "integrated_gradient":
+                (
+                    prev_example_idx,
+                    prev_context_tokens,
+                    scores_in_one_example,
+                    prev_offset,
+                ) = extract_integrated_gradient_scores(
+                    args,
+                    dev_ds,
+                    model,
+                    result,
+                    fwd_args,
+                    fwd_kwargs,
+                    SEP_idx,
+                    add_idx,
+                    prev_example_idx,
+                    example_idx,
+                    scores_in_one_example,
+                    prev_context_tokens,
+                    ans_dic,
+                    ans_idx_dic,
+                    context_tokens,
+                    offset,
+                    prev_offset,
+                    out_handle,
+                )
+            else:
+                raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}")
+
+        # Deal with last example
+        if args.language == "ch":
+
+            feature = dev_ds.new_data[-1]
+            input_ids = feature["input_ids"]
+            SEP_idx = input_ids.index(tokenizer.sep_token_id)
+            context_ids = input_ids[SEP_idx + 1 : -1]
+            offset = feature["offset_mapping"][SEP_idx + 1 : -1]
+            context_tokens = tokenizer.convert_ids_to_tokens(context_ids)
+
+            ch_per_example(
+                args, scores_in_one_example, context_tokens, dev_ds, -1, ans_dic, ans_idx_dic, offset, out_handle
+            )
diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py b/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1557de19d109b47884993a40c26d569670559a2
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py
@@ -0,0 +1,195 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+from functools import partial
+from pathlib import Path
+
+import paddle
+from roberta.modeling import RobertaForQuestionAnswering
+from squad import RCInterpret, compute_prediction
+
+from paddlenlp.data import Dict, Pad
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+)
+
+sys.path.remove("../../..")
+
+log = logging.getLogger(__name__)
+log.setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.DEBUG)
+
+
+def get_args():
+    parser = argparse.ArgumentParser("mrc task with roberta")
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=32, help="batchsize")
+    parser.add_argument("--epoch", type=int, default=3, help="epoch")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument(
+        "--doc_stride",
+        type=int,
+        default=128,
+        help="When splitting up a long document into chunks, how much stride to take between chunks.",
+    )
+    parser.add_argument("--language", type=str, required=True, help="language that the model based on")
+    args = parser.parse_args()
+    return args
+
+
+def map_fn_DuCheckList(examples, args, tokenizer):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    if args.language == "en":
+        contexts = [
+            examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples))
+        ]
+        questions = [
+            examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples))
+        ]
+    else:
+        contexts = [examples[i]["context"] for i in range(len(examples))]
+        questions = [examples[i]["question"] for i in range(len(examples))]
+
+    tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    # For validation, there is no need to compute start and end positions
+    for i, tokenized_example in enumerate(tokenized_examples):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_example["token_type_ids"]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = tokenized_example["overflow_to_sample"]
+        tokenized_examples[i]["example_id"] = examples[sample_index]["id"]
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        if args.language == "ch":
+            tokenized_examples[i]["offset_mapping"] = [
+                (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+            ]
+        else:
+            n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2  # context start position
+            m = len(tokenized_example["offset_mapping"]) - 1  # context end position + 1
+            tokenized_examples[i]["offset_mapping"] = [
+                (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"])
+            ]
+    return tokenized_examples
+
+
+def init_roberta_var(args):
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+
+    model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained)
+    map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer)
+    dev_ds = RCInterpret().read(args.data_dir)
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+        }
+    ): fn(samples)
+
+    dev_dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dev_dataloader, dev_ds
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, args):
+    model.eval()
+
+    all_start_logits = []
+    all_end_logits = []
+    tic_eval = time.time()
+
+    for batch in data_loader:
+        input_ids, token_type_ids = batch
+        loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids)
+        for idx in range(start_logits_tensor.shape[0]):
+            if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                log.debug("Processing example: %d" % len(all_start_logits))
+                log.debug("time per 1000:%.1f" % (time.time() - tic_eval))
+                tic_eval = time.time()
+
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+
+    all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction(
+        data_loader.dataset.data,
+        data_loader.dataset.new_data,
+        (all_start_logits, all_end_logits),
+        True,
+        20,
+        args.max_seq_len,
+        0.0,
+    )
+
+    # Can also write all_nbest_json and scores_diff_json files if needed
+    with open(os.path.join(args.output_dir, "predict_ans"), "w") as f_ans_pred:
+        f_ans_pred.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+    with open(os.path.join(args.output_dir, "predict_feature_index"), "w") as f_feature_index:
+        f_feature_index.write(json.dumps(all_feature_index, ensure_ascii=False, indent=4) + "\n")
+
+    # squad_evaluate(examples=data_loader.dataset.data, preds=all_predictions, na_probs=scores_diff_json)
+    # model.train()
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader, dev_ds = init_roberta_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    with paddle.amp.auto_cast(enable=args.use_amp):
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        log.debug("load model from %s" % args.init_checkpoint)
+        evaluate(model, dataloader, args)
diff --git a/examples/model_interpretation/task/mrc/saliency_map/squad.py b/examples/model_interpretation/task/mrc/saliency_map/squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ae811de5e5bddc666c36274dbb25c7a5224623a
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/saliency_map/squad.py
@@ -0,0 +1,476 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# !/usr/bin/env python3
+import collections
+import json
+
+import numpy as np
+
+from paddlenlp.datasets import DatasetBuilder
+
+
+class Similarity(DatasetBuilder):
+    # similarity test 21.10.3
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                line_split = line.strip().split("\t")
+                assert len(line_split) == 3
+                yield {"text_a": line_split[0], "text_b": line_split[1], "label": line_split[2]}
+
+
+class RCInterpret(DatasetBuilder):
+    # interpret 21.9.24
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                example_dic = json.loads(line)
+                id = example_dic["id"]
+                title = example_dic["title"]
+                context = example_dic["context"]
+                question = example_dic["question"]
+                if "sent_token" in example_dic:
+                    sent_token = example_dic["sent_token"]
+                    yield {
+                        "id": id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "sent_token": sent_token,
+                    }
+                else:
+                    yield {"id": id, "title": title, "context": context, "question": question}
+
+
+class DuReaderChecklist(DatasetBuilder):
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+
+        for entry in input_data:
+            # title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                title = paragraph.get("title", "").strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = []
+                    answers = []
+                    is_impossible = False
+
+                    if "is_impossible" in qa.keys():
+                        is_impossible = qa["is_impossible"]
+
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                        "is_impossible": is_impossible,
+                    }
+
+
+def compute_prediction_checklist(
+    examples,
+    features,
+    predictions,
+    version_2_with_negative: bool = False,
+    n_best_size: int = 20,
+    max_answer_length: int = 30,
+    cls_threshold: float = 0.5,
+):
+    """
+    Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the
+    original contexts. This is the base postprocessing functions for models that only return start and end logits.
+
+    Args:
+        examples: The non-preprocessed dataset (see the main script for more information).
+        features: The processed dataset (see the main script for more information).
+        predictions (:obj:`Tuple[np.ndarray, np.ndarray]`):
+            The predictions of the model: two arrays containing the start logits and the end logits respectively. Its
+            first dimension must match the number of elements of :obj:`features`.
+        version_2_with_negative (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the underlying dataset contains examples with no answers.
+        n_best_size (:obj:`int`, `optional`, defaults to 20):
+            The total number of n-best predictions to generate when looking for an answer.
+        max_answer_length (:obj:`int`, `optional`, defaults to 30):
+            The maximum length of an answer that can be generated. This is needed because the start and end predictions
+            are not conditioned on one another.
+        null_score_diff_threshold (:obj:`float`, `optional`, defaults to 0):
+            The threshold used to select the null answer: if the best answer has a score that is less than the score of
+            the null answer minus this threshold, the null answer is selected for this example (note that the score of
+            the null answer for an example giving several features is the minimum of the scores for the null answer on
+            each feature: all features must be aligned on the fact they `want` to predict a null answer).
+
+            Only useful when :obj:`version_2_with_negative` is :obj:`True`.
+    """
+
+    assert (
+        len(predictions) == 3
+    ), "`predictions` should be a tuple with two elements (start_logits, end_logits, cls_logits)."
+    all_start_logits, all_end_logits, all_cls_logits = predictions
+
+    assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features."  # 样本数
+
+    # Build a map example to its corresponding features.
+    features_per_example = collections.defaultdict(list)
+    for i, feature in enumerate(features):
+        features_per_example[feature["example_id"]].append(
+            i
+        )  # feature: dict(keys: 'input_ids', 'token_type_ids', 'offset_mapping', 'overflow_to_sample', 'example_id')
+
+    # The dictionaries we have to fill.
+    all_predictions = collections.OrderedDict()
+    all_feature_index = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    all_cls_predictions = []
+
+    # Let's loop over all the examples!
+    for example_index, example in enumerate(examples):
+        # Those are the indices of the features associated to the current example.
+        feature_indices = features_per_example[example["id"]]
+
+        # if len(feature_indices) > 1:
+        #     print('example_index: %s' % example_index)
+
+        min_null_prediction = None
+        prelim_predictions = []
+        score_answerable = -1
+        # Looping through all the features associated to the current example.
+        for feature_index in feature_indices:
+            # We grab the predictions of the model for this feature.
+            start_logits = all_start_logits[feature_index]
+            end_logits = all_end_logits[feature_index]
+            cls_logits = all_cls_logits[feature_index]
+            # This is what will allow us to map some the positions in our logits to span of texts in the original context.
+            offset_mapping = features[feature_index][
+                "offset_mapping"
+            ]  # list[tuple(2)], list长度与input_ids, start_logits, end_logits相同
+
+            # if len(feature_indices) > 1:
+            #     print('offset_mapping: %s' % offset_mapping)
+
+            # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
+            # available in the current feature.
+            token_is_max_context = features[feature_index].get("token_is_max_context", None)
+
+            exp_answerable_scores = np.exp(cls_logits - np.max(cls_logits))
+            feature_answerable_score = exp_answerable_scores / exp_answerable_scores.sum()
+            if feature_answerable_score[-1] > score_answerable:
+                score_answerable = feature_answerable_score[-1]
+                answerable_probs = feature_answerable_score
+
+            # Update minimum null prediction.
+            feature_null_score = start_logits[0] + end_logits[0]
+            if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
+                min_null_prediction = {
+                    "feature_index": (0, 0),
+                    "offsets": (0, 0),
+                    "score": feature_null_score,
+                    "start_logit": start_logits[0],
+                    "end_logit": end_logits[0],
+                }
+
+            # Go through all possibilities for the `n_best_size` greater start and end logits.
+            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()  # list(n_best_size) 从大到小
+            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()  # list(n_best_size) 从大到小
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
+                    # to part of the input_ids that are not in the context.
+                    if (
+                        start_index >= len(offset_mapping)
+                        or end_index >= len(offset_mapping)
+                        or offset_mapping[start_index] is None
+                        or offset_mapping[end_index] is None  # CLS、Question和第一个SEP的位置
+                        or offset_mapping[start_index] == (0, 0)
+                        or offset_mapping[end_index] == (0, 0)  # 第二个SEP的位置
+                    ):
+                        continue
+                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
+                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
+                        continue
+                    # Don't consider answer that don't have the maximum context available (if such information is
+                    # provided).
+                    if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
+                        continue
+                    prelim_predictions.append(
+                        {
+                            "feature_index": (start_index, end_index),
+                            "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
+                            "score": start_logits[start_index] + end_logits[end_index],
+                            "start_logit": start_logits[start_index],
+                            "end_logit": end_logits[end_index],
+                        }
+                    )
+        if version_2_with_negative:
+            # Add the minimum null prediction
+            prelim_predictions.append(min_null_prediction)
+            pred_cls_label = np.argmax(np.array(answerable_probs))
+            all_cls_predictions.append([example["id"], pred_cls_label, answerable_probs[0], answerable_probs[1]])
+
+        # Only keep the best `n_best_size` predictions.
+        predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
+
+        # Add back the minimum null prediction if it was removed because of its low score.
+        if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
+            predictions.append(min_null_prediction)
+
+        # Use the offsets to gather the answer text in the original context.
+        context = example["context"]
+        for pred in predictions:
+            # offsets = pred.pop("offsets")
+            offsets = pred["offsets"]
+            pred["text"] = context[offsets[0] : offsets[1]] if context[offsets[0] : offsets[1]] != "" else "no answer"
+
+        # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
+        # failure.
+        if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == "no answer"):
+            predictions.insert(
+                0,
+                {
+                    "feature_index": (0, 0),
+                    "offsets": (0, 0),
+                    "text": "no answer",
+                    "start_logit": 0.0,
+                    "end_logit": 0.0,
+                    "score": 0.0,
+                },
+            )
+
+        # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
+        # the LogSumExp trick).
+        scores = np.array([pred.pop("score") for pred in predictions])
+        exp_scores = np.exp(scores - np.max(scores))
+        probs = exp_scores / exp_scores.sum()
+
+        # Include the probabilities in our predictions.
+        for prob, pred in zip(probs, predictions):
+            pred["probability"] = prob
+
+        # Pick the best prediction. If the null answer is not possible, this is easy.
+        if not version_2_with_negative:
+            all_predictions[example["id"]] = predictions[0]["text"]
+            all_feature_index[example["id"]] = predictions[0]["feature_index"]
+        else:
+            # Otherwise we first need to find the best non-empty prediction.
+            i = 0
+            while predictions[i]["text"] == "no answer":
+                i += 1
+            best_non_null_pred = predictions[i]
+
+            if answerable_probs[1] < cls_threshold:
+                all_predictions[example["id"]] = "no answer"
+            else:
+                all_predictions[example["id"]] = best_non_null_pred["text"]
+            all_feature_index[example["id"]] = predictions[i]["feature_index"]
+
+        # Make `predictions` JSON-serializable by casting np.float back to float.
+        all_nbest_json[example["id"]] = [
+            {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
+            for pred in predictions
+        ]
+
+    return all_predictions, all_nbest_json, all_cls_predictions, all_feature_index
+
+
+def compute_prediction(
+    examples,
+    features,
+    predictions,
+    version_2_with_negative=False,
+    n_best_size=20,
+    max_answer_length=30,
+    null_score_diff_threshold=0.0,
+):
+    """
+    Post-processes the predictions of a question-answering model to convert
+    them to answers that are substrings of the original contexts. This is
+    the base postprocessing functions for models that only return start and
+    end logits.
+
+    Args:
+        examples (list): List of raw squad-style data (see `run_squad.py
+            <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/
+            machine_reading_comprehension/SQuAD/run_squad.py>`__ for more
+            information).
+        features (list): List of processed squad-style features (see
+            `run_squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/
+            develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__
+            for more information).
+        predictions (tuple): The predictions of the model. Should be a tuple
+            of two list containing the start logits and the end logits.
+        version_2_with_negative (bool, optional): Whether the dataset contains
+            examples with no answers. Defaults to False.
+        n_best_size (int, optional): The total number of candidate predictions
+            to generate. Defaults to 20.
+        max_answer_length (int, optional): The maximum length of predicted answer.
+            Defaults to 20.
+        null_score_diff_threshold (float, optional): The threshold used to select
+            the null answer. Only useful when `version_2_with_negative` is True.
+            Defaults to 0.0.
+
+    Returns:
+        A tuple of three dictionaries containing final selected answer, all n_best
+        answers along with their probability and scores, and the score_diff of each
+        example.
+    """
+    assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
+    all_start_logits, all_end_logits = predictions
+
+    assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features."
+
+    # Build a map example to its corresponding features.
+    features_per_example = collections.defaultdict(list)
+    for i, feature in enumerate(features):
+        features_per_example[feature["example_id"]].append(i)
+
+    # The dictionaries we have to fill.
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    all_feature_index = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+
+    # Let's loop over all the examples!
+    for example_index, example in enumerate(examples):
+        # Those are the indices of the features associated to the current example.
+        feature_indices = features_per_example[example["id"]]
+
+        min_null_prediction = None
+        prelim_predictions = []
+
+        # Looping through all the features associated to the current example.
+        for feature_index in feature_indices:
+            # We grab the predictions of the model for this feature.
+            start_logits = all_start_logits[feature_index]
+            end_logits = all_end_logits[feature_index]
+            # This is what will allow us to map some the positions in our logits to span of texts in the original
+            # context.
+            offset_mapping = features[feature_index]["offset_mapping"]
+            # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
+            # available in the current feature.
+            token_is_max_context = features[feature_index].get("token_is_max_context", None)
+
+            # Update minimum null prediction.
+            feature_null_score = start_logits[0] + end_logits[0]
+            if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
+                min_null_prediction = {
+                    "feature_index": (0, 0),
+                    "offsets": (0, 0),
+                    "score": feature_null_score,
+                    "start_logit": start_logits[0],
+                    "end_logit": end_logits[0],
+                }
+
+            # Go through all possibilities for the `n_best_size` greater start and end logits.
+            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
+                    # to part of the input_ids that are not in the context.
+                    if (
+                        start_index >= len(offset_mapping)
+                        or end_index >= len(offset_mapping)
+                        or offset_mapping[start_index] is None
+                        or offset_mapping[end_index] is None
+                        or offset_mapping[start_index] == (0, 0)
+                        or offset_mapping[end_index] == (0, 0)
+                    ):
+                        continue
+                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
+                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
+                        continue
+                    # Don't consider answer that don't have the maximum context available (if such information is
+                    # provided).
+                    if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
+                        continue
+                    prelim_predictions.append(
+                        {
+                            "feature_index": (start_index, end_index),
+                            "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
+                            "score": start_logits[start_index] + end_logits[end_index],
+                            "start_logit": start_logits[start_index],
+                            "end_logit": end_logits[end_index],
+                        }
+                    )
+        if version_2_with_negative:
+            # Add the minimum null prediction
+            prelim_predictions.append(min_null_prediction)
+            null_score = min_null_prediction["score"]
+
+        # Only keep the best `n_best_size` predictions.
+        predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
+
+        # Add back the minimum null prediction if it was removed because of its low score.
+        if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
+            predictions.append(min_null_prediction)
+
+        # Use the offsets to gather the answer text in the original context.
+        context = example["context"]
+        for pred in predictions:
+            offsets = pred.pop("offsets")
+            pred["text"] = context[offsets[0] : offsets[1]]
+
+        # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
+        # failure.
+        if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
+            predictions.insert(
+                0, {"feature_index": (0, 0), "text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0}
+            )
+
+        # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
+        # the LogSumExp trick).
+        scores = np.array([pred.pop("score") for pred in predictions])
+        exp_scores = np.exp(scores - np.max(scores))
+        probs = exp_scores / exp_scores.sum()
+
+        # Include the probabilities in our predictions.
+        for prob, pred in zip(probs, predictions):
+            pred["probability"] = prob
+
+        # Pick the best prediction. If the null answer is not possible, this is easy.
+        if not version_2_with_negative:
+            all_predictions[example["id"]] = predictions[0]["text"]
+            all_feature_index[example["id"]] = predictions[0]["feature_index"]
+        else:
+            # Otherwise we first need to find the best non-empty prediction.
+            i = 0
+            while predictions[i]["text"] == "":
+                i += 1
+            best_non_null_pred = predictions[i]
+
+            # Then we compare to the null prediction using the threshold.
+            score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
+            scores_diff_json[example["id"]] = float(score_diff)  # To be JSON-serializable.
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example["id"]] = ""
+            else:
+                all_predictions[example["id"]] = best_non_null_pred["text"]
+            all_feature_index[example["id"]] = predictions[i]["feature_index"]
+
+        # Make `predictions` JSON-serializable by casting np.float back to float.
+        all_nbest_json[example["id"]] = [
+            {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
+            for pred in predictions
+        ]
+
+    return all_predictions, all_nbest_json, scores_diff_json, all_feature_index
diff --git a/examples/model_interpretation/task/mrc/saliency_map/utils.py b/examples/model_interpretation/task/mrc/saliency_map/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..88c1619769ee079c715bf84044ca22e2aec95e0d
--- /dev/null
+++ b/examples/model_interpretation/task/mrc/saliency_map/utils.py
@@ -0,0 +1,37 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import paddle
+
+
+class UnpackDataLoader(paddle.io.DataLoader):
+    def __init__(self, *args, **kwargs):
+        super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs)
+
+    def __iter__(self):
+        return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__())
+
+
+def create_if_not_exists(dir):
+    try:
+        dir.mkdir(parents=True)
+    except:
+        pass
+    return dir
+
+
+def get_warmup_and_linear_decay(max_steps, warmup_steps):
+    return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps))
diff --git a/examples/model_interpretation/task/senti/LIME/exceptions.py b/examples/model_interpretation/task/senti/LIME/exceptions.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5fa1a29924ad795104c6ce7c124a58d1fa06dfe
--- /dev/null
+++ b/examples/model_interpretation/task/senti/LIME/exceptions.py
@@ -0,0 +1,2 @@
+class LimeError(Exception):
+    """Raise for errors"""
diff --git a/examples/model_interpretation/task/senti/LIME/explanation.py b/examples/model_interpretation/task/senti/LIME/explanation.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e212b1613ca84438ad37222b5ea09dd234d6a7b
--- /dev/null
+++ b/examples/model_interpretation/task/senti/LIME/explanation.py
@@ -0,0 +1,344 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Explanation class, with visualization functions.
+"""
+from io import open
+import os
+import os.path
+import json
+import string
+import numpy as np
+
+# from .exceptions import LimeError
+from LIME.exceptions import LimeError
+
+from sklearn.utils import check_random_state
+
+
+def id_generator(size=15, random_state=None):
+    """Helper function to generate random div ids. This is useful for embedding
+    HTML into ipython notebooks."""
+    chars = list(string.ascii_uppercase + string.digits)
+    return "".join(random_state.choice(chars, size, replace=True))
+
+
+class DomainMapper(object):
+    """Class for mapping features to the specific domain.
+
+    The idea is that there would be a subclass for each domain (text, tables,
+    images, etc), so that we can have a general Explanation class, and separate
+    out the specifics of visualizing features in here.
+    """
+
+    def __init__(self):
+        pass
+
+    def map_exp_ids(self, exp, **kwargs):
+        """Maps the feature ids to concrete names.
+
+        Default behaviour is the identity function. Subclasses can implement
+        this as they see fit.
+
+        Args:
+            exp: list of tuples [(id, weight), (id,weight)]
+            kwargs: optional keyword arguments
+
+        Returns:
+            exp: list of tuples [(name, weight), (name, weight)...]
+        """
+        return exp
+
+    def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs):
+        """Produces html for visualizing the instance.
+
+        Default behaviour does nothing. Subclasses can implement this as they
+        see fit.
+
+        Args:
+             exp: list of tuples [(id, weight), (id,weight)]
+             label: label id (integer)
+             div_name: name of div object to be used for rendering(in js)
+             exp_object_name: name of js explanation object
+             kwargs: optional keyword arguments
+
+        Returns:
+             js code for visualizing the instance
+        """
+        return ""
+
+
+class Explanation(object):
+    """Object returned by explainers."""
+
+    def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None):
+        """
+
+        Initializer.
+
+        Args:
+            domain_mapper: must inherit from DomainMapper class
+            type: "classification" or "regression"
+            class_names: list of class names (only used for classification)
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+        """
+        self.random_state = random_state
+        self.mode = mode
+        self.domain_mapper = domain_mapper
+        self.local_exp = {}
+        self.intercept = {}
+        self.score = {}
+        self.local_pred = {}
+        if mode == "classification":
+            self.class_names = class_names
+            self.top_labels = None
+            self.predict_proba = None
+        elif mode == "regression":
+            self.class_names = ["negative", "positive"]
+            self.predicted_value = None
+            self.min_value = 0.0
+            self.max_value = 1.0
+            self.dummy_label = 1
+        else:
+            raise LimeError(
+                'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode)
+            )
+
+    def available_labels(self):
+        """
+        Returns the list of classification labels for which we have any explanations.
+        """
+        try:
+            assert self.mode == "classification"
+        except AssertionError:
+            raise NotImplementedError("Not supported for regression explanations.")
+        else:
+            ans = self.top_labels if self.top_labels else self.local_exp.keys()
+            return list(ans)
+
+    def as_list(self, label=1, **kwargs):
+        """Returns the explanation as a list.
+
+        Args:
+            label: desired label. If you ask for a label for which an
+                explanation wasn't computed, will throw an exception.
+                Will be ignored for regression explanations.
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            list of tuples (representation, weight), where representation is
+            given by domain_mapper. Weight is a float.
+        """
+        label_to_use = label if self.mode == "classification" else self.dummy_label
+        ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs)
+        ans = [(x[0], float(x[1])) for x in ans]
+        return ans
+
+    def as_map(self):
+        """Returns the map of explanations.
+
+        Returns:
+            Map from label to list of tuples (feature_id, weight).
+        """
+        return self.local_exp
+
+    def as_pyplot_figure(self, label=1, **kwargs):
+        """Returns the explanation as a pyplot figure.
+
+        Will throw an error if you don't have matplotlib installed
+        Args:
+            label: desired label. If you ask for a label for which an
+                   explanation wasn't computed, will throw an exception.
+                   Will be ignored for regression explanations.
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            pyplot figure (barchart).
+        """
+        import matplotlib.pyplot as plt
+
+        exp = self.as_list(label=label, **kwargs)
+        fig = plt.figure()
+        vals = [x[1] for x in exp]
+        names = [x[0] for x in exp]
+        vals.reverse()
+        names.reverse()
+        colors = ["green" if x > 0 else "red" for x in vals]
+        pos = np.arange(len(exp)) + 0.5
+        plt.barh(pos, vals, align="center", color=colors)
+        plt.yticks(pos, names)
+        if self.mode == "classification":
+            title = "Local explanation for class %s" % self.class_names[label]
+        else:
+            title = "Local explanation"
+        plt.title(title)
+        return fig
+
+    def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Shows html explanation in ipython notebook.
+
+        See as_html() for parameters.
+        This will throw an error if you don't have IPython installed"""
+
+        from IPython.core.display import display, HTML
+
+        display(
+            HTML(
+                self.as_html(
+                    labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs
+                )
+            )
+        )
+
+    def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Saves html explanation to file. .
+
+        Params:
+            file_path: file to save explanations to
+
+        See as_html() for additional parameters.
+
+        """
+        file_ = open(file_path, "w", encoding="utf8")
+        file_.write(
+            self.as_html(
+                labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs
+            )
+        )
+        file_.close()
+
+    def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Returns the explanation as an html page.
+
+        Args:
+            labels: desired labels to show explanations for (as barcharts).
+                If you ask for a label for which an explanation wasn't
+                computed, will throw an exception. If None, will show
+                explanations for all available labels. (only used for classification)
+            predict_proba: if true, add  barchart with prediction probabilities
+                for the top classes. (only used for classification)
+            show_predicted_value: if true, add  barchart with expected value
+                (only used for regression)
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            code for an html page, including javascript includes.
+        """
+
+        def jsonize(x):
+            return json.dumps(x, ensure_ascii=False)
+
+        if labels is None and self.mode == "classification":
+            labels = self.available_labels()
+
+        this_dir, _ = os.path.split(__file__)
+        bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read()
+
+        out = (
+            """<html>
+        <meta http-equiv="content-type" content="text/html; charset=UTF8">
+        <head><script>%s </script></head><body>"""
+            % bundle
+        )
+        random_id = id_generator(size=15, random_state=check_random_state(self.random_state))
+        out += (
+            """
+        <div class="lime top_div" id="top_div%s"></div>
+        """
+            % random_id
+        )
+
+        predict_proba_js = ""
+        if self.mode == "classification" and predict_proba:
+            predict_proba_js = """
+            var pp_div = top_div.append('div')
+                                .classed('lime predict_proba', true);
+            var pp_svg = pp_div.append('svg').style('width', '100%%');
+            var pp = new lime.PredictProba(pp_svg, %s, %s);
+            """ % (
+                jsonize([str(x) for x in self.class_names]),
+                jsonize(list(self.predict_proba.astype(float))),
+            )
+
+        predict_value_js = ""
+        if self.mode == "regression" and show_predicted_value:
+            # reference self.predicted_value
+            # (svg, predicted_value, min_value, max_value)
+            predict_value_js = """
+                    var pp_div = top_div.append('div')
+                                        .classed('lime predicted_value', true);
+                    var pp_svg = pp_div.append('svg').style('width', '100%%');
+                    var pp = new lime.PredictedValue(pp_svg, %s, %s, %s);
+                    """ % (
+                jsonize(float(self.predicted_value)),
+                jsonize(float(self.min_value)),
+                jsonize(float(self.max_value)),
+            )
+
+        exp_js = """var exp_div;
+            var exp = new lime.Explanation(%s);
+        """ % (
+            jsonize([str(x) for x in self.class_names])
+        )
+
+        if self.mode == "classification":
+            for label in labels:
+                exp = jsonize(self.as_list(label))
+                exp_js += """
+                exp_div = top_div.append('div').classed('lime explanation', true);
+                exp.show(%s, %d, exp_div);
+                """ % (
+                    exp,
+                    label,
+                )
+        else:
+            exp = jsonize(self.as_list())
+            exp_js += """
+            exp_div = top_div.append('div').classed('lime explanation', true);
+            exp.show(%s, %s, exp_div);
+            """ % (
+                exp,
+                self.dummy_label,
+            )
+
+        raw_js = """var raw_div = top_div.append('div');"""
+
+        if self.mode == "classification":
+            html_data = self.local_exp[labels[0]]
+        else:
+            html_data = self.local_exp[self.dummy_label]
+
+        raw_js += self.domain_mapper.visualize_instance_html(
+            html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs
+        )
+        out += """
+        <script>
+        var top_div = d3.select('#top_div%s').classed('lime top_div', true);
+        %s
+        %s
+        %s
+        %s
+        </script>
+        """ % (
+            random_id,
+            predict_proba_js,
+            predict_value_js,
+            exp_js,
+            raw_js,
+        )
+        out += "</body></html>"
+
+        return out
diff --git a/examples/model_interpretation/task/senti/LIME/lime_base.py b/examples/model_interpretation/task/senti/LIME/lime_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c9104f69b54343b7c71db6defc59650c749fc9a
--- /dev/null
+++ b/examples/model_interpretation/task/senti/LIME/lime_base.py
@@ -0,0 +1,226 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Contains abstract functionality for learning locally linear sparse model.
+"""
+import numpy as np
+import scipy as sp
+from sklearn.linear_model import Ridge, lars_path
+from sklearn.utils import check_random_state
+
+
+class LimeBase(object):
+    """Class for learning a locally linear sparse model from perturbed data"""
+
+    def __init__(self, kernel_fn, verbose=False, random_state=None):
+        """Init function
+
+        Args:
+            kernel_fn: function that transforms an array of distances into an
+                        array of proximity values (floats).
+            verbose: if true, print local prediction values from linear model.
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+        """
+        self.kernel_fn = kernel_fn
+        self.verbose = verbose
+        self.random_state = check_random_state(random_state)
+
+    @staticmethod
+    def generate_lars_path(weighted_data, weighted_labels):
+        """Generates the lars path for weighted data.
+
+        Args:
+            weighted_data: data that has been weighted by kernel
+            weighted_label: labels, weighted by kernel
+
+        Returns:
+            (alphas, coefs), both are arrays corresponding to the
+            regularization parameter and coefficients, respectively
+        """
+        x_vector = weighted_data
+        alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False)
+        return alphas, coefs
+
+    def forward_selection(self, data, labels, weights, num_features):
+        """Iteratively adds features to the model"""
+        clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state)
+        used_features = []
+        for _ in range(min(num_features, data.shape[1])):
+            max_ = -100000000
+            best = 0
+            for feature in range(data.shape[1]):
+                if feature in used_features:
+                    continue
+                clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights)
+                score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights)
+                if score > max_:
+                    best = feature
+                    max_ = score
+            used_features.append(best)
+        return np.array(used_features)
+
+    def feature_selection(self, data, labels, weights, num_features, method):
+        """Selects features for the model. see explain_instance_with_data to
+        understand the parameters."""
+        if method == "none":
+            return np.array(range(data.shape[1]))
+
+        elif method == "forward_selection":
+            return self.forward_selection(data, labels, weights, num_features)
+
+        elif method == "highest_weights":
+            clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state)
+            clf.fit(data, labels, sample_weight=weights)
+
+            coef = clf.coef_
+            if sp.sparse.issparse(data):
+                coef = sp.sparse.csr_matrix(clf.coef_)
+                weighted_data = coef.multiply(data[0])
+                # Note: most efficient to slice the data before reversing
+                sdata = len(weighted_data.data)
+                argsort_data = np.abs(weighted_data.data).argsort()
+                # Edge case where data is more sparse than requested number of feature importances
+                # In that case, we just pad with zero-valued features
+                if sdata < num_features:
+                    nnz_indexes = argsort_data[::-1]
+                    indices = weighted_data.indices[nnz_indexes]
+                    num_to_pad = num_features - sdata
+                    indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype)))
+                    indices_set = set(indices)
+                    pad_counter = 0
+                    for i in range(data.shape[1]):
+                        if i not in indices_set:
+                            indices[pad_counter + sdata] = i
+                            pad_counter += 1
+                            if pad_counter >= num_to_pad:
+                                break
+                else:
+                    nnz_indexes = argsort_data[sdata - num_features : sdata][::-1]
+                    indices = weighted_data.indices[nnz_indexes]
+                return indices
+            else:
+                weighted_data = coef * data[0]
+                feature_weights = sorted(
+                    zip(range(data.shape[1]), weighted_data),  # zip(特征的编号, Ridge的w值）
+                    key=lambda x: np.abs(x[1]),
+                    reverse=True,
+                )
+                return np.array([x[0] for x in feature_weights[:num_features]])  # 返回Ridge的前num_features大的w的值对应的特征编号
+
+        elif method == "lasso_path":
+            weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis])
+            weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights)
+            nonzero = range(weighted_data.shape[1])
+            _, coefs = self.generate_lars_path(weighted_data, weighted_labels)
+            for i in range(len(coefs.T) - 1, 0, -1):
+                nonzero = coefs.T[i].nonzero()[0]
+                if len(nonzero) <= num_features:
+                    break
+            used_features = nonzero
+            return used_features
+
+        elif method == "auto":
+            if num_features <= 6:
+                n_method = "forward_selection"
+            else:
+                n_method = "highest_weights"
+            return self.feature_selection(data, labels, weights, num_features, n_method)
+
+    def explain_instance_with_data(
+        self,
+        neighborhood_data,
+        neighborhood_labels,
+        distances,
+        label,
+        num_features,
+        feature_selection="auto",
+        model_regressor=None,
+    ):
+        """Takes perturbed data, labels and distances, returns explanation.
+
+        Args:
+            neighborhood_data: perturbed data, 2d array. first element is
+                               assumed to be the original data point.
+            neighborhood_labels: corresponding perturbed labels. should have as
+                                 many columns as the number of possible labels.
+            distances: distances to original data point.
+            label: label for which we want an explanation
+            num_features: maximum number of features in explanation
+            feature_selection: how to select num_features. options are:
+                'forward_selection': iteratively add features to the model.
+                    This is costly when num_features is high
+                'highest_weights': selects the features that have the highest
+                    product of absolute weight * original data point when
+                    learning with all the features
+                'lasso_path': chooses features based on the lasso
+                    regularization path
+                'none': uses all features, ignores num_features
+                'auto': uses forward_selection if num_features <= 6, and
+                    'highest_weights' otherwise.
+            model_regressor: sklearn regressor to use in explanation.
+                Defaults to Ridge regression if None. Must have
+                model_regressor.coef_ and 'sample_weight' as a parameter
+                to model_regressor.fit()
+
+        Returns:
+            (intercept, exp, score, local_pred):
+            intercept is a float.
+            exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x)
+                and the local weight (y). The list is sorted by decreasing absolute value of y.
+            score is the R^2 value of the returned explanation
+            local_pred is the prediction of the explanation model on the original instance
+        """
+
+        weights = self.kernel_fn(distances)  # 扰动样本权重
+        labels_column = neighborhood_labels[:, label]  # 类别label的softmax
+
+        used_features = self.feature_selection(
+            neighborhood_data, labels_column, weights, num_features, feature_selection
+        )
+        if model_regressor is None:
+            model_regressor = Ridge(
+                alpha=1, fit_intercept=True, random_state=self.random_state  # L2正则化的系数  # 是否需要截距，即b
+            )  # seg的伪随机种子
+        easy_model = model_regressor
+        easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights)
+        prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights)
+
+        local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1))
+
+        ridge_pred = easy_model.predict(neighborhood_data[:, used_features])
+        err_np = np.abs(labels_column - ridge_pred)
+        # relative_err_np = err_np / labels_column
+        relative_err_np = err_np / ridge_pred
+        err = np.average(err_np, weights=weights)
+        relative_err = np.average(relative_err_np, weights=weights)
+
+        if self.verbose:
+            print("Intercept", easy_model.intercept_)
+            print(
+                "Prediction_local",
+                local_pred,
+            )
+            print("Right:", neighborhood_labels[0, label])
+        return (
+            easy_model.intercept_,  #
+            sorted(
+                zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True
+            ),  # 按权重大小排序的token_id列表
+            prediction_score,  # 衡量easy_model模型的预测与label的差，越大越好（差越小），最大为1
+            local_pred,  # easy_model对原始样本的预测概率
+            relative_err,
+            err,
+        )
diff --git a/examples/model_interpretation/task/senti/LIME/lime_text.py b/examples/model_interpretation/task/senti/LIME/lime_text.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ef6d3bc40decf94ee5c30a461583b0119d05122
--- /dev/null
+++ b/examples/model_interpretation/task/senti/LIME/lime_text.py
@@ -0,0 +1,664 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# !/usr/bin/env python3
+"""
+Functions for explaining text classifiers.
+"""
+import itertools
+import json
+import re
+import time
+import math
+import paddle
+from functools import partial
+
+import numpy as np
+import scipy as sp
+import sklearn
+from sklearn.utils import check_random_state
+
+import LIME.explanation as explanation
+import LIME.lime_base as lime_base
+
+
+class TextDomainMapper(explanation.DomainMapper):
+    """Maps feature ids to words or word-positions"""
+
+    def __init__(self, indexed_string, language):
+        """Initializer.
+
+        Args:
+            indexed_string: lime_text.IndexedString, original string
+        """
+        self.indexed_string = indexed_string
+        self.language = language
+
+    def map_exp_ids(self, exp, positions=False):
+        """Maps ids to words or word-position strings.
+
+        Args:
+            exp: list of tuples [(id, weight), (id,weight)]
+            positions: if True, also return word positions
+
+        Returns:
+            list of tuples (word, weight), or (word_positions, weight) if
+            examples: ('bad', 1) or ('bad_3-6-12', 1)
+        """
+        if positions:
+            exp = [
+                (
+                    "%s_%s"
+                    % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))),
+                    x[1],
+                )
+                for x in exp
+            ]
+        else:
+            exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp]
+        return exp
+
+    def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True):
+        """Adds text with highlighted words to visualization.
+
+        Args:
+             exp: list of tuples [(id, weight), (id,weight)]
+             label: label id (integer)
+             div_name: name of div object to be used for rendering(in js)
+             exp_object_name: name of js explanation object
+             text: if False, return empty
+             opacity: if True, fade colors according to weight
+        """
+        if not text:
+            return ""
+        text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8")
+        text = re.sub(r"[<>&]", "|", text)
+        exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp]
+        all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp]))
+        all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences]
+        ret = """
+            %s.show_raw_text(%s, %d, %s, %s, %s);
+            """ % (
+            exp_object_name,
+            json.dumps(all_occurrences),
+            label,
+            json.dumps(text),
+            div_name,
+            json.dumps(opacity),
+        )
+        return ret
+
+
+class IndexedString(object):
+    """String with various indexes."""
+
+    def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="en"):
+        """Initializer.
+
+        Args:
+            raw_string: string with raw text in it
+            split_expression: Regex string or callable. If regex string, will be used with re.split.
+                If callable, the function should return a list of tokens.
+            bow: if True, a word is the same everywhere in the text - i.e. we
+                 will index multiple occurrences of the same word. If False,
+                 order matters, so that the same word will have different ids
+                 according to position.
+            mask_string: If not None, replace words with this if bow=False
+                if None, default value is UNKWORDZ
+        """
+        self.raw = raw_string
+        self.mask_string = "UNKWORDZ" if mask_string is None else mask_string
+        self.language = language
+
+        if callable(split_expression):
+            tokens = split_expression(self.raw)
+            self.as_list = self._segment_with_tokens(self.raw, tokens)
+            tokens = set(tokens)
+
+            def non_word(string):
+                return string not in tokens
+
+        else:
+            # with the split_expression as a non-capturing group (?:), we don't need to filter out
+            # the separator character from the split results.
+            # splitter = re.compile(r'(%s)|$' % split_expression)
+            # self.as_list = [s for s in splitter.split(self.raw) if s]
+            if self.language == "ch":
+                splitter = re.compile(r"([\u4e00-\u9fa5])")
+                self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0]
+            else:
+                splitter = re.compile(split_expression)
+                self.as_list = [w for w in self.raw.strip().split() if len(w.strip()) > 0]
+            valid_word = splitter.match
+
+        self.as_np = np.array(self.as_list)
+        self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]])))
+        vocab = {}
+        self.inverse_vocab = []
+        self.positions = []
+        self.bow = bow
+        non_vocab = set()
+        for i, word in enumerate(self.as_np):
+            if word in non_vocab:
+                continue
+            if (valid_word(word) and self.language == "en") or (not valid_word(word) and self.language == "ch"):
+                non_vocab.add(word)
+                continue
+            if bow:
+                if word not in vocab:
+                    vocab[word] = len(vocab)
+                    self.inverse_vocab.append(word)
+                    self.positions.append([])
+                idx_word = vocab[word]
+                self.positions[idx_word].append(i)
+            else:
+                self.inverse_vocab.append(word)
+                self.positions.append(i)
+        if not bow:
+            self.positions = np.array(self.positions)
+
+    def raw_string(self):
+        """Returns the original raw string"""
+        return self.raw
+
+    def num_words(self):
+        """Returns the number of tokens in the vocabulary for this document."""
+        return len(self.inverse_vocab)
+
+    def word(self, id_):
+        """Returns the word that corresponds to id_ (int)"""
+        return self.inverse_vocab[id_]
+
+    def string_position(self, id_):
+        """Returns a np array with indices to id_ (int) occurrences"""
+        if self.bow:
+            return self.string_start[self.positions[id_]]
+        else:
+            return self.string_start[[self.positions[id_]]]
+
+    def inverse_removing(self, words_to_remove):
+        """Returns a string after removing the appropriate words.
+
+        If self.bow is false, replaces word with UNKWORDZ instead of removing it.
+
+        Args:
+            words_to_remove: list of ids (ints) to remove
+
+        Returns:
+            original raw string with appropriate words removed.
+        """
+        mask = np.ones(self.as_np.shape[0], dtype="bool")
+        mask[self.__get_idxs(words_to_remove)] = False
+        if self.language == "ch":
+            if not self.bow:
+                return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])])
+            return "".join([self.as_list[v] for v in mask.nonzero()[0]])
+        else:
+            if not self.bow:
+                return " ".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])])
+            return " ".join([self.as_list[v] for v in mask.nonzero()[0]])
+
+    @staticmethod
+    def _segment_with_tokens(text, tokens):
+        """Segment a string around the tokens created by a passed-in tokenizer"""
+        list_form = []
+        text_ptr = 0
+        for token in tokens:
+            inter_token_string = []
+            while not text[text_ptr:].startswith(token):
+                inter_token_string.append(text[text_ptr])
+                text_ptr += 1
+                if text_ptr >= len(text):
+                    raise ValueError("Tokenization produced tokens that do not belong in string!")
+            text_ptr += len(token)
+            if inter_token_string:
+                list_form.append("".join(inter_token_string))
+            list_form.append(token)
+        if text_ptr < len(text):
+            list_form.append(text[text_ptr:])
+        return list_form
+
+    def __get_idxs(self, words):
+        """Returns indexes to appropriate words."""
+        if self.bow:
+            return list(itertools.chain.from_iterable([self.positions[z] for z in words]))
+        else:
+            return self.positions[words]
+
+
+class IndexedCharacters(object):
+    """String with various indexes."""
+
+    def __init__(self, raw_string, bow=True, mask_string=None):
+        """Initializer.
+
+        Args:
+            raw_string: string with raw text in it
+            bow: if True, a char is the same everywhere in the text - i.e. we
+                 will index multiple occurrences of the same character. If False,
+                 order matters, so that the same word will have different ids
+                 according to position.
+            mask_string: If not None, replace characters with this if bow=False
+                if None, default value is chr(0)
+        """
+        self.raw = raw_string
+        self.as_list = list(self.raw)
+        self.as_np = np.array(self.as_list)
+        self.mask_string = chr(0) if mask_string is None else mask_string
+        self.string_start = np.arange(len(self.raw))
+        vocab = {}
+        self.inverse_vocab = []
+        self.positions = []
+        self.bow = bow
+        non_vocab = set()
+        for i, char in enumerate(self.as_np):
+            if char in non_vocab:
+                continue
+            if bow:
+                if char not in vocab:
+                    vocab[char] = len(vocab)
+                    self.inverse_vocab.append(char)
+                    self.positions.append([])
+                idx_char = vocab[char]
+                self.positions[idx_char].append(i)
+            else:
+                self.inverse_vocab.append(char)
+                self.positions.append(i)
+        if not bow:
+            self.positions = np.array(self.positions)
+
+    def raw_string(self):
+        """Returns the original raw string"""
+        return self.raw
+
+    def num_words(self):
+        """Returns the number of tokens in the vocabulary for this document."""
+        return len(self.inverse_vocab)
+
+    def word(self, id_):
+        """Returns the word that corresponds to id_ (int)"""
+        return self.inverse_vocab[id_]
+
+    def string_position(self, id_):
+        """Returns a np array with indices to id_ (int) occurrences"""
+        if self.bow:
+            return self.string_start[self.positions[id_]]
+        else:
+            return self.string_start[[self.positions[id_]]]
+
+    def inverse_removing(self, words_to_remove):
+        """Returns a string after removing the appropriate words.
+
+        If self.bow is false, replaces word with UNKWORDZ instead of removing
+        it.
+
+        Args:
+            words_to_remove: list of ids (ints) to remove
+
+        Returns:
+            original raw string with appropriate words removed.
+        """
+        mask = np.ones(self.as_np.shape[0], dtype="bool")
+        mask[self.__get_idxs(words_to_remove)] = False
+        if not self.bow:
+            return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])])
+        return "".join([self.as_list[v] for v in mask.nonzero()[0]])
+
+    def __get_idxs(self, words):
+        """Returns indexes to appropriate words."""
+        if self.bow:
+            return list(itertools.chain.from_iterable([self.positions[z] for z in words]))
+        else:
+            return self.positions[words]
+
+
+class LimeTextExplainer(object):
+    """Explains text classifiers.
+    Currently, we are using an exponential kernel on cosine distance, and
+    restricting explanations to words that are present in documents."""
+
+    def __init__(
+        self,
+        kernel_width=25,
+        kernel=None,
+        verbose=False,
+        class_names=None,
+        feature_selection="auto",
+        split_expression=r"\W+",
+        bow=True,
+        mask_string=None,
+        random_state=None,
+        char_level=False,
+        language="en",
+    ):
+        """Init function.
+
+        Args:
+            kernel_width: kernel width for the exponential kernel.
+            kernel: similarity kernel that takes euclidean distances and kernel
+                width as input and outputs weights in (0,1). If None, defaults to
+                an exponential kernel.
+            verbose: if true, print local prediction values from linear model
+            class_names: list of class names, ordered according to whatever the
+                classifier is using. If not present, class names will be '0',
+                '1', ...
+            feature_selection: feature selection method. can be
+                'forward_selection', 'lasso_path', 'none' or 'auto'.
+                See function 'explain_instance_with_data' in lime_base.py for
+                details on what each of the options does.
+            split_expression: Regex string or callable. If regex string, will be used with re.split.
+                If callable, the function should return a list of tokens.
+            bow: if True (bag of words), will perturb input data by removing
+                all occurrences of individual words or characters.
+                Explanations will be in terms of these words. Otherwise, will
+                explain in terms of word-positions, so that a word may be
+                important the first time it appears and unimportant the second.
+                Only set to false if the classifier uses word order in some way
+                (bigrams, etc), or if you set char_level=True.
+            mask_string: String used to mask tokens or characters if bow=False
+                if None, will be 'UNKWORDZ' if char_level=False, chr(0)
+                otherwise.
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+            char_level: an boolean identifying that we treat each character
+                as an independent occurence in the string
+        """
+
+        if kernel is None:
+
+            def kernel(d, kernel_width):
+                return np.sqrt(np.exp(-(d**2) / kernel_width**2))
+
+        kernel_fn = partial(kernel, kernel_width=kernel_width)
+
+        self.random_state = check_random_state(random_state)
+        self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state)
+        self.class_names = class_names
+        self.vocabulary = None
+        self.feature_selection = feature_selection
+        self.bow = bow
+        self.mask_string = mask_string
+        self.split_expression = split_expression
+        self.char_level = char_level
+        self.language = language
+
+    def explain_instance(
+        self,
+        text_instance: str,
+        tokenizer,
+        pred_label: int,
+        classifier_fn,
+        labels=(0, 1),
+        top_labels=None,
+        num_features=10,
+        num_samples=5000,
+        distance_metric="cosine",
+        model_regressor=None,
+        if_lstm=False,
+    ):
+        """Generates explanations for a prediction.
+
+        First, we generate neighborhood data by randomly hiding features from
+        the instance (see __data_labels_distance_mapping). We then learn
+        locally weighted linear models on this neighborhood data to explain
+        each of the classes in an interpretable way (see lime_base.py).
+
+        Args:
+            text_instance: raw text string to be explained.
+            classifier_fn: classifier prediction probability function, which
+                takes a list of d strings and outputs a (d, k) numpy array with
+                prediction probabilities, where k is the number of classes.
+                For ScikitClassifiers , this is classifier.predict_proba.
+            labels: iterable with labels to be explained.
+            top_labels: if not None, ignore labels and produce explanations for
+                the K labels with highest prediction probabilities, where K is
+                this parameter.
+            num_features: maximum number of features present in explanation
+            num_samples: size of the neighborhood to learn the linear model
+            distance_metric: the distance metric to use for sample weighting,
+                defaults to cosine similarity
+            model_regressor: sklearn regressor to use in explanation. Defaults
+            to Ridge regression in LimeBase. Must have model_regressor.coef_
+            and 'sample_weight' as a parameter to model_regressor.fit()
+        Returns:
+            An Explanation object (see explanation.py) with the corresponding
+            explanations.
+        """
+        indexed_string = (
+            IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string)
+            if self.char_level
+            else IndexedString(
+                text_instance,
+                bow=self.bow,
+                split_expression=self.split_expression,
+                mask_string=self.mask_string,
+                language=self.language,
+            )
+        )
+        domain_mapper = TextDomainMapper(indexed_string, self.language)
+
+        # 产生扰动数据集    第一条是原始数据
+        # data: 解释器训练特征  list (num_samples, doc_size)
+        # yss:  解释器训练标签  list (num_samples, class_num(2))
+        # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, )
+        data, yss, distances = self.__data_labels_distances(
+            indexed_string, tokenizer, classifier_fn, num_samples, distance_metric=distance_metric, if_lstm=if_lstm
+        )
+
+        if self.class_names is None:
+            self.class_names = [str(x) for x in range(yss[0].shape[0])]
+        ret_exp = explanation.Explanation(
+            domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state
+        )
+        ret_exp.predict_proba = yss[0]
+        if top_labels:
+            labels = np.argsort(yss[0])[-top_labels:]
+            ret_exp.top_labels = list(labels)
+            ret_exp.top_labels.reverse()
+
+        num_features = indexed_string.num_words()  # 特征数量跟word_num相同
+
+        (
+            ret_exp.intercept[pred_label],
+            ret_exp.local_exp[pred_label],
+            ret_exp.score[pred_label],
+            ret_exp.local_pred[pred_label],
+            relative_err,
+            err,
+        ) = self.base.explain_instance_with_data(
+            data,
+            yss,
+            distances,
+            pred_label,
+            num_features,
+            model_regressor=model_regressor,
+            feature_selection=self.feature_selection,
+        )
+
+        return ret_exp, indexed_string, relative_err, err
+
+    def __data_labels_distances(
+        self, indexed_string, tokenizer, classifier_fn, num_samples, distance_metric="cosine", if_lstm=False
+    ):
+        """Generates a neighborhood around a prediction.
+
+        Generates neighborhood data by randomly removing words from
+        the instance, and predicting with the classifier. Uses cosine distance
+        to compute distances between original and perturbed instances.
+        Args:
+            indexed_string: document (IndexedString) to be explained,
+            classifier_fn: classifier prediction probability function, which
+                takes a string and outputs prediction probabilities. For
+                ScikitClassifier, this is classifier.predict_proba.
+            num_samples: size of the neighborhood to learn the linear model
+            distance_metric: the distance metric to use for sample weighting,
+                defaults to cosine similarity.
+
+        Returns:
+            A tuple (data, labels, distances), where:
+                data: dense num_samples * K binary matrix, where K is the
+                    number of tokens in indexed_string. The first row is the
+                    original instance, and thus a row of ones.
+                labels: num_samples * L matrix, where L is the number of target
+                    labels
+                distances: cosine distance between the original instance and
+                    each perturbed instance (computed in the binary 'data'
+                    matrix), times 100.
+        """
+
+        def distance_fn(x):
+            return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100
+
+        doc_size = indexed_string.num_words()
+
+        if doc_size > 1:
+            sample = self.random_state.randint(
+                1, doc_size, num_samples - 1
+            )  # sample: [int(1 ~ doc_size-1) * num_samples-1]
+        else:
+            sample = [0 for i in range(num_samples - 1)]
+        data = np.ones((num_samples, doc_size))
+        data[0] = np.ones(doc_size)
+        features_range = range(doc_size)
+        perturb_text = [indexed_string.raw_string()]  # [文本 * num_samples]
+
+        for i, size in enumerate(sample, start=1):
+            # inactive: 从range（0， doc_size）中随机取出的size个数组成的list, 要去掉的字的id
+            inactive = self.random_state.choice(
+                features_range, size, replace=False  # [0, doc_size)  # int: 该扰动样本中remove token的数量
+            )
+
+            text = indexed_string.inverse_removing(inactive)  # 原文本去掉了inactive中的字后的文本
+
+            data[i, inactive] = 0
+            perturb_text.append(text)
+
+        prev_time = time.time()
+        # inverse_data: 扰动数据集 [扰动样本 str] * num_samples
+        labels = []
+        token_ids_list, s_ids_list, seq_len_list = [], [], []
+        token_ids_max_len = 0
+
+        valid_idxs = []
+
+        for idx, text in enumerate(perturb_text):
+            if self.language == "en":
+                if if_lstm:
+                    pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)]
+
+                    token_ids = tokenizer.encode(text)
+                    token_ids_max_len = max(token_ids_max_len, len(token_ids))
+                    seq_len = len(token_ids)
+                    if seq_len == 0:
+                        continue
+                    else:
+                        valid_idxs.append(idx)
+                    seq_len_list.append(seq_len)
+                    pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)]
+
+                else:
+                    pad_id = tokenizer.convert_tokens_to_ids(["[PAD]"])
+
+                    tokens = tokenizer.tokenize(text)
+                    token_ids = tokenizer.convert_tokens_to_ids(tokens)
+                    token_ids = (
+                        tokenizer.convert_tokens_to_ids(["[CLS]"])
+                        + token_ids
+                        + tokenizer.convert_tokens_to_ids(["[SEP]"])
+                    )
+                    token_ids_max_len = max(token_ids_max_len, len(token_ids))
+
+                token_ids_list.append(token_ids)
+            else:
+                if len(text) == 0:  # TODO
+                    text = perturb_text[0]
+                tokens = tokenizer.tokenize(text)
+                token_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+                if if_lstm:
+                    seq_len = len(token_ids)
+                    if seq_len == 0:
+                        continue
+                    else:
+                        valid_idxs.append(idx)
+                    seq_len_list.append(seq_len)
+                else:
+                    token_ids = (
+                        tokenizer.convert_tokens_to_ids(["[CLS]"])
+                        + token_ids
+                        + tokenizer.convert_tokens_to_ids(["[SEP]"])
+                    )
+
+                # padding
+                token_ids = token_ids + tokenizer.convert_tokens_to_ids(["[PAD]"]) * (
+                    len(perturb_text[0]) + 2 - len(token_ids)
+                )
+                token_ids_list.append(token_ids)
+                s_ids = [0 for _ in range(len(token_ids))]
+                s_ids_list.append(s_ids)
+
+        if self.language == "en":
+            for token_ids in token_ids_list:
+                while len(token_ids) < token_ids_max_len:
+                    token_ids += pad_id
+
+                s_ids = [0 for _ in range(len(token_ids))]
+                s_ids_list.append(s_ids)
+
+        token_ids_np = np.array(token_ids_list)
+        s_ids_np = np.array(s_ids_list)
+        seq_len_np = np.array(seq_len_list)
+
+        prev_time = time.time()
+
+        batch = 0
+        if self.language == "ch":
+            length = len(perturb_text[0])
+
+            if if_lstm:
+                batch = 128
+            else:
+                batch = 64 if length < 130 else 50
+        else:
+            batch = 32
+
+        epoch_num = math.ceil(len(token_ids_np) / batch)
+        for idx in range(epoch_num):
+            token_ids_tensor = paddle.Tensor(
+                value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True
+            )
+            if if_lstm:
+                seq_len_tensor = paddle.Tensor(
+                    value=seq_len_np[idx * batch : (idx + 1) * batch],
+                    place=token_ids_tensor.place,
+                    stop_gradient=token_ids_tensor.stop_gradient,
+                )
+                label = classifier_fn(token_ids_tensor, seq_len_tensor)[0]  # label: Tensor[num_samples, 2]
+            else:
+                s_ids_tensor = paddle.Tensor(
+                    value=s_ids_np[idx * batch : (idx + 1) * batch],
+                    place=token_ids_tensor.place,
+                    stop_gradient=token_ids_tensor.stop_gradient,
+                )
+                label = classifier_fn(token_ids_tensor, s_ids_tensor)[0]  # label: Tensor[num_samples, 2]
+
+            labels.extend(label.numpy().tolist())
+
+        labels = np.array(labels)  # labels: nsp.array(num_samples, 2)
+
+        print("mode forward time: %.5f" % (time.time() - prev_time))
+
+        distances = distance_fn(sp.sparse.csr_matrix(data))
+
+        return data, labels, distances
diff --git a/examples/model_interpretation/task/senti/pretrained_models/run_train.sh b/examples/model_interpretation/task/senti/pretrained_models/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b13d03c6486ef7a79f2718484cd6b976e767851a
--- /dev/null
+++ b/examples/model_interpretation/task/senti/pretrained_models/run_train.sh
@@ -0,0 +1,30 @@
+###
+ # This script is used to finetune pretrained models
+###
+
+export CUDA_VISIBLE_DEVICES=5
+
+LANGUAGE=en
+BASE_MODEL=roberta_base        # [roberta_base, roberta_large]
+timestamp=`date  +"%Y%m%d_%H%M%S"`
+
+if [[ $LANGUAGE == "ch" ]]; then
+    LEARNING_RATE=2e-5
+    MAX_SEQ_LENGTH=128
+elif [[ $LANGUAGE == "en" ]]; then
+    LEARNING_RATE=5e-6
+    MAX_SEQ_LENGTH=512
+fi
+
+[ -d "logs" ] || mkdir -p "logs"
+set -x
+
+python3 ./train.py  \
+    --learning_rate ${LEARNING_RATE} \
+    --max_seq_length ${MAX_SEQ_LENGTH} \
+    --batch_size 32 \
+    --epochs 5 \
+    --base_model $BASE_MODEL \
+    --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} \
+    --language $LANGUAGE >> logs/log_${BASE_MODEL}_${timestamp}
+    
diff --git a/examples/model_interpretation/task/senti/pretrained_models/train.py b/examples/model_interpretation/task/senti/pretrained_models/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..61dcb01ada0890de7cb44d2e13e07292fc13a8da
--- /dev/null
+++ b/examples/model_interpretation/task/senti/pretrained_models/train.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This file is used to fine-tune pretrained models
+"""
+import argparse
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import LinearDecayWithWarmup
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("..")
+sys.path.append("../../..")
+from roberta.modeling import RobertaForSequenceClassification  # noqa: E402
+
+sys.path.remove("../../..")
+sys.path.remove("..")
+from utils import convert_example  # noqa: E402
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"])
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. "
+    "Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument(
+    "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process."
+)
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+parser.add_argument(
+    "--language", choices=["ch", "en"], required=True, default=None, help="Language that the model is built for"
+)
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    """
+    This function created the dataloader which feeds data into model
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def do_train():
+    """
+    This function is the main part of the fine-tunning process
+    """
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    if args.language == "ch":
+        train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+        if args.base_model == "roberta_base":
+            tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext")
+            model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2)
+        elif args.base_model == "roberta_large":
+            tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")
+            model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2)
+    else:
+        train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"])
+        # for English version, we load models from local machine
+        if args.base_model == "roberta_base":
+            tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base")
+            model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2)
+        elif args.base_model == "roberta_large":
+            tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large")
+            model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    log_per_step = 100 if args.language == "en" else 10
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
+            loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % log_per_step == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, log_per_step / (time.time() - tic_train)),
+                    flush=True,
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % (log_per_step * 10) == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                evaluate(model, criterion, metric, dev_data_loader)
+                model._layers.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/model_interpretation/task/senti/pretrained_models/utils.py b/examples/model_interpretation/task/senti/pretrained_models/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8c0bad17bd678f7e6bec12ea55abea6a2be6c27
--- /dev/null
+++ b/examples/model_interpretation/task/senti/pretrained_models/utils.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This file contains some public function
+"""
+import numpy as np
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False, language="ch"):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    It returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    if language == "ch":
+        text = "text"
+        label = "label"
+    else:
+        text = "sentence"
+        label = "labels"
+    encoded_inputs = tokenizer(text=example[text], max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if is_test:
+        return input_ids, token_type_ids
+    label = np.array([example[label]], dtype="int64")
+    return input_ids, token_type_ids, label
diff --git a/examples/model_interpretation/task/senti/rnn/lstm_train.sh b/examples/model_interpretation/task/senti/rnn/lstm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fd3b4a4cc2b836f1c85f4b88f7e9baddaa204d20
--- /dev/null
+++ b/examples/model_interpretation/task/senti/rnn/lstm_train.sh
@@ -0,0 +1,20 @@
+###
+ # This script is used to train lstm models
+###
+
+unset CUDA_VISIBLE_DEVICES
+LANGUAGE=en
+
+if [[ $LANGUAGE == 'ch' ]]; then
+    VOCAB_PATH='./vocab.txt'
+else
+    VOCAB_PATH='vocab.sst2_train'
+fi
+python -m paddle.distributed.launch --gpus "5" train.py \
+    --device=gpu \
+    --lr=4e-4 \
+    --batch_size=64 \
+    --epochs=12 \
+    --vocab_path=$VOCAB_PATH   \
+    --language=$LANGUAGE \
+    --save_dir="./checkpoints_"${LANGUAGE}
diff --git a/examples/model_interpretation/task/senti/rnn/model.py b/examples/model_interpretation/task/senti/rnn/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c509e72432e33d4501158b8e439110b1ad1ac22
--- /dev/null
+++ b/examples/model_interpretation/task/senti/rnn/model.py
@@ -0,0 +1,267 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+INF = 1.0 * 1e12
+
+
+class LSTMModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        lstm_hidden_size=198,
+        direction="forward",
+        lstm_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+
+        self.direction = direction
+
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+
+        # self.lstm_encoder = nlp.seq2vec.LSTMEncoder(emb_dim,
+        #                                             lstm_hidden_size,
+        #                                             num_layers=lstm_layers,
+        #                                             direction=direction,
+        #                                             dropout=dropout_rate,
+        #                                             pooling_type=pooling_type)
+
+        self.lstm_layer = nn.LSTM(
+            input_size=emb_dim,
+            hidden_size=lstm_hidden_size,
+            num_layers=lstm_layers,
+            direction=direction,
+            dropout=dropout_rate,
+        )
+
+        self.fc = nn.Linear(lstm_hidden_size * (2 if direction == "bidirect" else 1), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+        self.softmax = nn.Softmax(axis=1)
+
+    def forward(self, text, seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
+        # num_directions = 2 if direction is 'bidirect'
+        # if not, num_directions = 1
+
+        # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)
+
+        encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len)
+        if self.direction == "bidirect":
+            text_repr = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1)
+        else:
+            text_repr = last_hidden[-1, :, :]
+
+        fc_out = paddle.tanh(self.fc(text_repr))  # Shape: (batch_size, fc_hidden_size)
+        logits = self.output_layer(fc_out)  # Shape: (batch_size, num_classes)
+        return logits
+
+    def forward_interpet(self, text, seq_len):
+        embedded_text = self.embedder(text)  # Shape: (batch_size, num_tokens, embedding_dim)
+
+        # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) # Shape: (batch_size, num_tokens, num_directions * hidden)
+
+        # encoded_text: tensor[batch, seq_len, num_directions * hidden]
+        # last_hidden: tensor[2, batch, hiddens]
+        encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len)
+        if self.direction == "bidirect":
+            text_repr = paddle.concat(
+                (last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1
+            )  # text_repr: tensor[batch, 2 * hidden] 双向
+        else:
+            text_repr = last_hidden[-1, :, :]  # text_repr: tensor[1, hidden_size]     单向
+
+        fc_out = paddle.tanh(self.fc(text_repr))  # Shape: (batch_size, fc_hidden_size)
+        logits = self.output_layer(fc_out)  # Shape: (batch_size, num_classes)
+        probs = self.softmax(logits)
+
+        return probs, text_repr, embedded_text
+
+
+class BiLSTMAttentionModel(nn.Layer):
+    def __init__(
+        self,
+        attention_layer,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        lstm_hidden_size=196,
+        fc_hidden_size=96,
+        lstm_layers=1,
+        dropout_rate=0.0,
+        padding_idx=0,
+    ):
+        super().__init__()
+        self.padding_idx = padding_idx
+
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.bilstm = nn.LSTM(
+            input_size=emb_dim,
+            hidden_size=lstm_hidden_size,
+            num_layers=lstm_layers,
+            dropout=dropout_rate,
+            direction="bidirect",
+        )
+        self.attention = attention_layer
+        if isinstance(attention_layer, SelfAttention):
+            self.fc = nn.Linear(lstm_hidden_size, fc_hidden_size)
+        elif isinstance(attention_layer, SelfInteractiveAttention):
+            self.fc = nn.Linear(lstm_hidden_size * 2, fc_hidden_size)
+        else:
+            raise RuntimeError("Unknown attention type %s." % attention_layer.__class__.__name__)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+        self.softmax = nn.Softmax(axis=1)
+
+    def forward(self, text, seq_len):
+        mask = text != self.padding_idx
+        embedded_text = self.embedder(text)
+        # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size)
+        encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, lstm_hidden_size)
+        hidden, att_weights = self.attention(encoded_text, mask)  # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(hidden))  # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+    def forward_interpet(self, text, seq_len, noise=None, i=None, n_samples=None):
+        mask = text != self.padding_idx
+
+        baseline_text = paddle.to_tensor(
+            [[0] * text.shape[1]], dtype=text.dtype, place=text.place, stop_gradient=text.stop_gradient
+        )
+
+        embedded_text = self.embedder(text)
+        baseline_embedded = self.embedder(baseline_text)
+
+        if noise is not None:
+            if noise.upper() == "GAUSSIAN":
+                stdev_spread = 0.15
+                stdev = stdev_spread * (embedded_text.max() - embedded_text.min()).numpy()
+                noise = paddle.to_tensor(
+                    np.random.normal(0, stdev, embedded_text.shape).astype(np.float32), stop_gradient=False
+                )
+                embedded_text = embedded_text + noise
+
+            elif noise.upper() == "INTEGRATED":
+                embedded_text = baseline_embedded + (i / (n_samples - 1)) * (embedded_text - baseline_embedded)
+
+            else:
+                raise ValueError("unsupported noise method: %s" % (noise))
+
+        # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size)
+        encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, lstm_hidden_size)
+        hidden, att_weights = self.attention(encoded_text, mask)  # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(hidden))  # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        probs = self.softmax(logits)
+        return probs, att_weights.squeeze(axis=-1), embedded_text
+
+
+class SelfAttention(nn.Layer):
+    """
+    A close implementation of attention network of ACL 2016 paper,
+    Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016).
+    ref: https://www.aclweb.org/anthology/P16-2034/
+    Args:
+        hidden_size (int): The number of expected features in the input x.
+    """
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.att_weight = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32")
+
+    def forward(self, input, mask=None):
+        """
+        Args:
+            input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence.
+            mask (paddle.Tensor) of shape (batch, seq_len) :
+                Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
+                Defaults to `None`.
+        """
+        forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2)
+        # elementwise-sum forward_x and backward_x
+        # Shape: (batch_size, max_seq_len, hidden_size)
+        h = paddle.add_n([forward_input, backward_input])
+        # Shape: (batch_size, hidden_size, 1)
+        att_weight = self.att_weight.tile(repeat_times=(paddle.shape(h)[0], 1, 1))
+        # Shape: (batch_size, max_seq_len, 1)
+        att_score = paddle.bmm(paddle.tanh(h), att_weight)
+        if mask is not None:
+            # mask, remove the effect of 'PAD'
+            mask = paddle.cast(mask, dtype="float32")
+            mask = mask.unsqueeze(axis=-1)
+            inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF)
+            att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask))
+        # Shape: (batch_size, max_seq_len, 1)
+        att_weight = F.softmax(att_score, axis=1)
+        # Shape: (batch_size, lstm_hidden_size)
+        reps = paddle.bmm(h.transpose(perm=(0, 2, 1)), att_weight).squeeze(axis=-1)
+        reps = paddle.tanh(reps)
+        return reps, att_weight
+
+
+class SelfInteractiveAttention(nn.Layer):
+    """
+    A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classiﬁcation (Yang et al., 2016).
+    ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
+    Args:
+        hidden_size (int): The number of expected features in the input x.
+    """
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.input_weight = self.create_parameter(shape=[1, hidden_size, hidden_size], dtype="float32")
+        self.bias = self.create_parameter(shape=[1, 1, hidden_size], dtype="float32")
+        self.att_context_vector = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32")
+
+    def forward(self, input, mask=None):
+        """
+        Args:
+            input (paddle.Tensor) of shape (batch, seq_len, hidden_size): Tensor containing the features of the input sequence.
+            mask (paddle.Tensor) of shape (batch, seq_len) :
+                Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
+                Defaults to `None
+        """
+        weight = self.input_weight.tile(
+            repeat_times=(paddle.shape(input)[0], 1, 1)
+        )  # tensor[batch, hidden_size, hidden_size]
+        bias = self.bias.tile(repeat_times=(paddle.shape(input)[0], 1, 1))  # tensor[batch, 1, hidden_size]
+        word_squish = paddle.bmm(input, weight) + bias  # Shape: (batch_size, seq_len, hidden_size)
+        att_context_vector = self.att_context_vector.tile(
+            repeat_times=(paddle.shape(input)[0], 1, 1)
+        )  # Shape: (batch_size, hidden_size, 1)
+        att_score = paddle.bmm(word_squish, att_context_vector)  # tensor[batch_size, seq_len, 1]
+        if mask is not None:
+            # mask, remove the effect of 'PAD'
+            mask = paddle.cast(mask, dtype="float32")
+            mask = mask.unsqueeze(axis=-1)
+            inf_tensor = paddle.full(shape=paddle.shape(mask), dtype="float32", fill_value=-INF)
+            att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask))
+        att_weight = F.softmax(att_score, axis=1)  # tensor[batch_size, seq_len, 1]
+
+        reps = paddle.bmm(input.transpose(perm=(0, 2, 1)), att_weight).squeeze(-1)  # Shape: (batch_size, hidden_size)
+        return reps, att_weight
diff --git a/examples/model_interpretation/task/senti/rnn/tokenizer_config.json b/examples/model_interpretation/task/senti/rnn/tokenizer_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..1b15a346024173ce58a2770fe03c27f2e0db3c32
--- /dev/null
+++ b/examples/model_interpretation/task/senti/rnn/tokenizer_config.json
@@ -0,0 +1 @@
+{"model":"LSTM"}
\ No newline at end of file
diff --git a/examples/model_interpretation/task/senti/rnn/train.py b/examples/model_interpretation/task/senti/rnn/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..570334a5d94e9ac5d17fb088a85821a8d37c1610
--- /dev/null
+++ b/examples/model_interpretation/task/senti/rnn/train.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+from model import BiLSTMAttentionModel, SelfInteractiveAttention
+from utils import CharTokenizer, convert_example
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--vocab_path", type=str, default=None)
+parser.add_argument("--language", choices=["ch", "en"], default=None, help="Language that the model is built for")
+args = parser.parse_args()
+
+
+def set_seed(seed=1000):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    set_seed()
+
+    if args.language == "ch":
+        train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+    else:
+        train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"])
+
+    # Loads vocab.
+    if not os.path.exists(args.vocab_path):
+        raise RuntimeError("The vocab_path  can not be found in the path %s" % args.vocab_path)
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations")
+
+    # Constructs the newtork.
+    vocab_size = len(vocab)
+    num_classes = len(train_ds.label_list)
+    pad_token_id = 0
+    pad_value = vocab.token_to_idx.get("[PAD]", 0)
+
+    lstm_hidden_size = 196
+    attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+    model = BiLSTMAttentionModel(
+        attention_layer=attention,
+        vocab_size=vocab_size,
+        lstm_hidden_size=lstm_hidden_size,
+        num_classes=num_classes,
+        padding_idx=pad_token_id,
+    )
+
+    model = paddle.Model(model)
+
+    # Reads data and generates mini-batches.
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_value), Stack(dtype="int64"), Stack(dtype="int64")  # input_ids  # seq len  # label
+    ): [data for data in fn(samples)]
+
+    train_loader = create_dataloader(
+        train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn
+    )
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Defines loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    # Starts training and evaluating.
+    callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3)
+    model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback)
diff --git a/examples/model_interpretation/task/senti/rnn/utils.py b/examples/model_interpretation/task/senti/rnn/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d574423d48a793a1948c7178e77bcf60af0f4a9
--- /dev/null
+++ b/examples/model_interpretation/task/senti/rnn/utils.py
@@ -0,0 +1,166 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def convert_example(example, tokenizer, is_test=False, language="en"):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        valid_length(obj:`int`): The input sequence valid length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    if is_test:
+        input_ids = tokenizer.encode(example["context"])
+        valid_length = np.array(len(input_ids), dtype="int64")
+        input_ids = np.array(input_ids, dtype="int64")
+        return input_ids, valid_length
+    else:
+        if language == "en":
+            input_ids = tokenizer.encode(example["sentence"])
+            label = np.array(example["labels"], dtype="int64")
+        else:
+            input_ids = tokenizer.encode(example["text"])
+            label = np.array(example["label"], dtype="int64")
+        valid_length = np.array(len(input_ids), dtype="int64")
+        input_ids = np.array(input_ids, dtype="int64")
+        return input_ids, valid_length, label
+
+
+def preprocess_prediction_data(data, tokenizer):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`List[str]`): The prediction data whose each element is  a tokenized text.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+
+    Returns:
+        examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+
+    """
+    examples = []
+    for text in data:
+        # ids = tokenizer.encode(text)                        # JiebaTokenizer
+        ids = tokenizer.encode(text)[0].tolist()[1:-1]  # ErnieTokenizer        list[ids]
+        examples.append([ids, len(ids)])
+
+    return examples
+
+
+def get_idx_from_word(word, word_to_idx, unk_word):
+    if word in word_to_idx:
+        return word_to_idx[word]
+    return word_to_idx[unk_word]
+
+
+class CharTokenizer:
+    def __init__(self, vocab, language, vocab_path):
+        self.tokenizer = list
+        self.vocab = vocab
+        self.language = language
+        self.vocab_path = vocab_path
+        self.unk_token = []
+
+    def encode(self, sentence):
+        if self.language == "ch":
+            words = tokenizer_punc(sentence, self.vocab_path)
+        else:
+            words = sentence.strip().split()
+        return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words]
+
+    def tokenize(self, sentence, wo_unk=True):
+        if self.language == "ch":
+            return tokenizer_punc(sentence, self.vocab_path)
+        else:
+            return sentence.strip().split()
+
+    def convert_tokens_to_string(self, tokens):
+        return " ".join(tokens)
+
+    def convert_tokens_to_ids(self, tokens):
+        return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens]
+
+
+def tokenizer_lac(string, lac):
+    temp = ""
+    res = []
+    for c in string:
+        if "\u4e00" <= c <= "\u9fff":
+            if temp != "":
+                res.extend(lac.run(temp))
+                temp = ""
+            res.append(c)
+        else:
+            temp += c
+    if temp != "":
+        res.extend(lac.run(temp))
+    return res
+
+
+def tokenizer_punc(string, vocab_path):
+    res = []
+    sub_string_list = string.strip().split("[MASK]")
+    for idx, sub_string in enumerate(sub_string_list):
+        temp = ""
+        for c in sub_string:
+            if "\u4e00" <= c <= "\u9fff":
+                if temp != "":
+                    temp_seg = punc_split(temp, vocab_path)
+                    res.extend(temp_seg)
+                    temp = ""
+                res.append(c)
+            else:
+                temp += c
+        if temp != "":
+            temp_seg = punc_split(temp, vocab_path)
+            res.extend(temp_seg)
+        if idx < len(sub_string_list) - 1:
+            res.append("[MASK]")
+    return res
+
+
+def punc_split(string, vocab_path):
+    punc_set = set()
+    with open(vocab_path, "r") as f:
+        for token in f:
+            punc_set.add(token.strip())
+        punc_set.add(" ")
+        for ascii_num in range(65296, 65306):
+            punc_set.add(chr(ascii_num))
+        for ascii_num in range(48, 58):
+            punc_set.add(chr(ascii_num))
+
+    res = []
+    temp = ""
+    for c in string:
+        if c in punc_set:
+            if temp != "":
+                res.append(temp)
+                temp = ""
+            res.append(c)
+        else:
+            temp += c
+    if temp != "":
+        res.append(temp)
+    return res
diff --git a/examples/model_interpretation/task/senti/roberta/modeling.py b/examples/model_interpretation/task/senti/roberta/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..02f2bec87d8554295771910cc55fabd261ba91b2
--- /dev/null
+++ b/examples/model_interpretation/task/senti/roberta/modeling.py
@@ -0,0 +1,608 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+
+sys.path.append("../..")
+from task.transformer import TransformerEncoder, TransformerEncoderLayer  # noqa: E402
+
+sys.path.remove("../..")
+
+__all__ = [
+    "RobertaModel",
+    "RobertaPretrainedModel",
+    "RobertaForSequenceClassification",
+    "RobertaForTokenClassification",
+    "RobertaForQuestionAnswering",
+]
+
+
+class RobertaEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        pad_token_id=0,
+    ):
+        super(RobertaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RobertaPooler(nn.Layer):
+    def __init__(self, hidden_size):
+        super(RobertaPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RobertaPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained RoBerta models. It provides RoBerta related
+    `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "roberta-wwm-ext": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "roberta-wwm-ext-large": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbt3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbtl3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams",
+            "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams",
+            "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams",
+            "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams",
+        }
+    }
+    base_model_prefix = "roberta"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range")
+                    else self.roberta.config["initializer_range"],
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class RobertaModel(RobertaPretrainedModel):
+    r"""
+    The bare Roberta Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+    ):
+        super(RobertaModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = RobertaEmbeddings(
+            vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id
+        )
+        encoder_layer = TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = RobertaPooler(hidden_size)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range ``[0, max_position_embeddings - 1]``.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - sequence_output (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - pooled_output (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaModel, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaModel.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2]
+            )
+        # CLS: 101; SEP: 102; PAD: 0
+        baseline_ids = paddle.to_tensor(
+            [101] + [0] * (input_ids.shape[1] - 2) + [102],
+            dtype=input_ids.dtype,
+            place=input_ids.place,
+            stop_gradient=input_ids.stop_gradient,
+        )
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        baseline_embedding_output = self.embeddings(
+            input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+
+        if noise is not None:
+            if noise.upper() == "GAUSSIAN":
+                pass
+                # stdev_spread = 0.15
+                # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy()
+                # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32),
+                #                     stop_gradient=False)
+                # orig_embedded = orig_embedded + noise
+            if noise.upper() == "INTEGRATED":
+                embedding_output = baseline_embedding_output + i / (n_samples - 1) * (
+                    embedding_output - baseline_embedding_output
+                )
+            else:
+                raise ValueError("unsupported noise method: %s" % (noise))
+
+        # encoder_outputs = self.encoder(embedding_output, attention_mask)
+        encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask)  # interpret
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output, att_weights_list, embedding_output
+
+
+class RobertaForQuestionAnswering(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output to
+    compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of RobertaModel.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob` of `RobertaModel`
+            instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, dropout=None):
+        super(RobertaForQuestionAnswering, self).__init__()
+        self.roberta = roberta  # allow roberta to be config
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None
+        )
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
+
+
+class RobertaForSequenceClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+        self.softmax = nn.Softmax()
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Its data type should be float32 and it has a shape of [batch_size, num_classes].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        _, pooled_output, _, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+    def forward_interpet(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        _, pooled_output, att_weights_list, embedding_output = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            noise=noise,
+            i=i,
+            n_samples=n_samples,
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        probs = self.softmax(logits)
+
+        return probs, att_weights_list, embedding_output
+
+
+class RobertaForTokenClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
diff --git a/examples/model_interpretation/task/senti/run_inter.sh b/examples/model_interpretation/task/senti/run_inter.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c7b71e78d21213174c6fdc21ff97cc34688729a8
--- /dev/null
+++ b/examples/model_interpretation/task/senti/run_inter.sh
@@ -0,0 +1,65 @@
+###
+ # This file contains script to generate saliency map of a specific baseline model and language on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=4
+export PYTHONPATH=./:$PYTHONPATH
+
+LANGUAGE=en                         # LANGUAGE choose in [ch, en]
+BASE_MODEL=roberta_base            # BASE_MODEL choose in [roberta_base, roberta_large, lstm]
+INTER_MODE=attention                     # INTER_MODE choice in [attention, integrated_gradient, lime]
+TASK=senti_${LANGUAGE}
+DATA=../../data/${TASK}
+START_ID=0
+FROM_PRETRAIN='test'
+VOCAB_PATH='test'
+
+if [[ $LANGUAGE == "en" ]]; then
+
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN='roberta-base'
+        CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN='roberta-large'
+        CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams
+    elif [[ $BASE_MODEL == "lstm" ]]; then
+        VOCAB_PATH='rnn/vocab.sst2_train'
+        CKPT=rnn/checkpoints_en/final.pdparams
+    fi
+
+elif [[ $LANGUAGE == "ch" ]]; then
+
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN='roberta-wwm-ext'     
+        CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN='roberta-wwm-ext-large'       
+        CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams
+    elif [[ $BASE_MODEL == "lstm" ]]; then
+        VOCAB_PATH='rnn/vocab.txt'
+        CKPT=rnn/checkpoints_ch/final.pdparams
+    fi
+fi
+
+OUTPUT=./output/${TASK}.${BASE_MODEL}
+[ -d $OUTPUT ] || mkdir -p $OUTPUT
+set -x
+
+python3 ./saliency_map/sentiment_interpretable.py \
+    --language $LANGUAGE \
+    --base_model $BASE_MODEL \
+    --data_dir $DATA \
+    --vocab_path $VOCAB_PATH \
+    --from_pretrained $FROM_PRETRAIN \
+    --batch_size 1 \
+    --init_checkpoint $CKPT \
+    --inter_mode  $INTER_MODE\
+    --output_dir $OUTPUT \
+    --n-samples 200 \
+    --start_id $START_ID \
+    --eval $@
diff --git a/examples/model_interpretation/task/senti/run_inter_all.sh b/examples/model_interpretation/task/senti/run_inter_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8b0a1d98bf0113b413d7d4698733dd0dc8551359
--- /dev/null
+++ b/examples/model_interpretation/task/senti/run_inter_all.sh
@@ -0,0 +1,75 @@
+###
+ # This file contains script to generate saliency map of all baseline models and languages on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+###
+
+export CUDA_VISIBLE_DEVICES=1
+export PYTHONPATH=./:$PYTHONPATH
+START_ID=0
+FROM_PRETRAIN='test'
+VOCAB_PATH='test'
+
+for BASE_MODEL in "lstm" "roberta_base" "roberta_large";
+do
+    for INTER_MODE in "attention" "integrated_gradient" "lime";
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            TASK=senti_${LANGUAGE}
+            DATA=../../data/${TASK}
+
+            if [[ $LANGUAGE == "en" ]]; then
+
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN='roberta-base'
+                    CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN='roberta-large'
+                    CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams
+                elif [[ $BASE_MODEL == "lstm" ]]; then
+                    VOCAB_PATH='rnn/vocab.sst2_train'
+                    CKPT=rnn/checkpoints_en/final.pdparams
+                    #CKPT=rnn/checkpoints_en/final.pdparams
+                fi
+
+            elif [[ $LANGUAGE == "ch" ]]; then
+
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN='roberta-wwm-ext'     
+                    CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN='roberta-wwm-ext-large'       
+                    CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams
+                elif [[ $BASE_MODEL == "lstm" ]]; then
+                    VOCAB_PATH='rnn/vocab.txt'
+                    CKPT=rnn/checkpoints_ch/final.pdparams
+                    #CKPT=rnn/checkpoints_ch/final.pdparams
+                fi
+            fi
+
+            OUTPUT=./output/${TASK}.${BASE_MODEL}
+            [ -d $OUTPUT ] || mkdir -p $OUTPUT
+            set -x
+
+            if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then
+                python3 ./saliency_map/sentiment_interpretable.py \
+                    --language $LANGUAGE \
+                    --base_model $BASE_MODEL \
+                    --data_dir $DATA \
+                    --vocab_path $VOCAB_PATH \
+                    --from_pretrained $FROM_PRETRAIN \
+                    --batch_size 1 \
+                    --init_checkpoint $CKPT \
+                    --inter_mode  $INTER_MODE\
+                    --output_dir $OUTPUT \
+                    --n-samples 200 \
+                    --start_id $START_ID \
+                    --eval $@
+            fi
+        done
+    done
+done
\ No newline at end of file
diff --git a/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py b/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py
new file mode 100644
index 0000000000000000000000000000000000000000..61afefc70ec598d94f0e52ad66d345379a594131
--- /dev/null
+++ b/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py
@@ -0,0 +1,502 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import collections
+import json
+import logging
+import os
+import sys
+from functools import partial
+from pathlib import Path
+
+import numpy as np
+import paddle
+from LIME.lime_text import LimeTextExplainer
+from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention
+from rnn.utils import CharTokenizer, convert_example
+from roberta.modeling import RobertaForSequenceClassification
+from tqdm import tqdm
+
+from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import DatasetBuilder
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+    match,
+)
+
+sys.path.remove("../../..")
+
+log = logging.getLogger(__name__)
+log.setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.DEBUG)
+
+
+def get_args():
+    parser = argparse.ArgumentParser("interpret sentiment analysis task")
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batchsize")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--eval", action="store_true")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument(
+        "--inter_mode",
+        type=str,
+        default="attention",
+        choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"],
+        help="appoint the mode of interpretable.",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument("--start_id", type=int, default=0)
+    parser.add_argument("--vocab_path", type=str, required=True)
+    parser.add_argument("--language", type=str, required=True, help="language that the model is built for")
+    args = parser.parse_args()
+    return args
+
+
+class Senti_data(DatasetBuilder):
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                line_split = json.loads(line)
+                yield {
+                    "id": line_split["id"],
+                    "context": line_split["context"],
+                    "sent_token": line_split["sent_token"],
+                }
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+def map_fn_senti(examples, tokenizer, args):
+    log.debug("load data %d" % len(examples))
+    if args.language == "en":
+        contexts = [example["context"].encode("ascii", errors="replace").decode("UTF-8") for example in examples]
+    else:
+        contexts = [example["context"] for example in examples]
+    tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len)
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+    for i in range(len(tokenized_examples)):
+        tokenized_examples[i]["offset_mapping"] = (
+            [(0, 0)] + tokenizer.get_offset_mapping(contexts[i])[: args.max_seq_len - 2] + [(0, 0)]
+        )
+    return tokenized_examples
+
+
+def init_lstm_var(args):
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = CharTokenizer(vocab, args.language, "../../punctuations")
+    padding_idx = vocab.token_to_idx.get("[PAD]", 0)
+
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language)
+
+    # Init attention layer
+    lstm_hidden_size = 196
+    attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+    model = BiLSTMAttentionModel(
+        attention_layer=attention,
+        vocab_size=len(tokenizer.vocab),
+        lstm_hidden_size=lstm_hidden_size,
+        num_classes=2,
+        padding_idx=padding_idx,
+    )
+
+    # Reads data and generates mini-batches.
+    dev_ds = Senti_data().read(args.data_dir)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=padding_idx),  # input_ids
+        Stack(dtype="int64"),  # seq len
+    ): [data for data in fn(samples)]
+
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+
+    return model, tokenizer, dev_loader
+
+
+def init_roberta_var(args):
+    tokenizer = None
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+    model = RobertaForSequenceClassification.from_pretrained(
+        args.from_pretrained,
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        dropout=0,
+        num_labels=2,
+        name="",
+        return_inter_score=True,
+    )
+
+    map_fn = partial(map_fn_senti, tokenizer=tokenizer, args=args)
+
+    dev_ds = Senti_data().read(args.data_dir)
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        }
+    ): fn(samples)
+
+    dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dataloader
+
+
+def extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle):
+    if args.base_model.startswith("roberta"):
+        inter_score = atts[-1][:, :, 0, :].mean(1)  # (bsz, seq)
+        inter_score = inter_score[0][1:-1]  # remove CLS and SEP
+        input_ids = input_ids[0][1:-1]
+
+    elif args.base_model == "lstm":
+        inter_score = atts[0]
+        input_ids = input_ids[0]
+
+    length = (inter_score > 0).cast("int32").sum(-1).tolist()[0]
+    assert len(tokens) == length, f"%s: {len(tokens)} != {length}" % (step + 1)
+
+    char_attribution_dict = {}
+    # Collect scores in different situation
+    if args.base_model.startswith("roberta"):
+        assert len(inter_score) == len(offset), str(len(inter_score)) + "not equal to" + str(len(offset))
+        sorted_token = []
+        for i in range(len(inter_score)):
+            sorted_token.append([i, offset[i], inter_score[i]])
+
+        char_attribution_dict = match(result["context"], result["sent_token"], sorted_token)
+
+        result["char_attri"] = collections.OrderedDict()
+        for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("sent_token")
+    else:
+        if args.language == "ch":
+            idx = 0
+            for token, score in zip(tokens, inter_score.numpy().tolist()):
+                char_attribution_dict[idx] = (token, score)
+                idx += 1
+        else:
+            idx = 0
+            for word, sub_word_score in zip(tokens, inter_score.tolist()):
+                char_attribution_dict[idx] = (word, sub_word_score)
+                idx += 1
+
+        result["char_attri"] = collections.OrderedDict()
+        for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["char_attri"][token_id] = token_info
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+def extract_integrated_gradient_scores(
+    args,
+    atts,
+    input_ids,
+    tokens,
+    sub_word_id_dict,
+    fwd_args,
+    fwd_kwargs,
+    model,
+    result,
+    pred_label,
+    err_total,
+    offset,
+    out_handle,
+):
+    embedded_grads_list = []
+    for i in range(args.n_samples):
+        probs, _, embedded = model.forward_interpet(
+            *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples
+        )
+        predicted_class_prob = probs[0][pred_label]
+        predicted_class_prob.backward(retain_graph=False)
+        embedded_grad = embedded.grad
+        model.clear_gradients()
+        embedded_grads_list.append(embedded_grad)
+
+        if i == 0:
+            baseline_pred_confidence = probs.tolist()[0][pred_label]  # scalar
+            baseline_embedded = embedded  # Tensor(1, seq_len, embed_size)
+        elif i == args.n_samples - 1:
+            pred_confidence = probs.tolist()[0][pred_label]  # scalar
+            pred_embedded = embedded  # Tensor(1, seq_len, embed_size)
+
+    embedded_grads_tensor = paddle.to_tensor(
+        embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True
+    )
+
+    trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2
+    integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0]  # Tensor(1, seq_len, embed_size)
+
+    inter_score = (pred_embedded - baseline_embedded) * integral_grads  # Tensor(1, seq_len, embed_size)
+    inter_score = inter_score.sum(-1)  # Tensor(1, seq_len)
+
+    # eval err
+    delta_pred_confidence = pred_confidence - baseline_pred_confidence
+    sum_gradient = inter_score.sum().tolist()[0]
+    err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12)
+    err_total.append(np.abs(err))
+
+    print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f"
+    print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total))
+    log.debug(print_str % print_vals)
+
+    inter_score.stop_gradient = True
+
+    char_attribution_dict = {}
+    if args.base_model.startswith("roberta"):
+        inter_score = inter_score[0][1:-1]
+        sorted_token = []
+        for i in range(len(inter_score)):
+            sorted_token.append([i, offset[i], inter_score[i]])
+        char_attribution_dict = match(result["context"], result["sent_token"], sorted_token)
+
+        result["char_attri"] = collections.OrderedDict()
+        for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("sent_token")
+
+    elif args.base_model == "lstm":
+        inter_score = inter_score[0]
+        idx = 0
+        for word, sub_word_score in zip(tokens, inter_score.tolist()):
+            char_attribution_dict[idx] = (word, sub_word_score)
+            idx += 1
+
+        result["char_attri"] = collections.OrderedDict()
+        for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["char_attri"][token_id] = token_info
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+    return err_total
+
+
+def extract_LIME_scores(
+    args,
+    tokenizer,
+    tokens,
+    pred_label,
+    model,
+    probs,
+    result,
+    lime_err_total,
+    lime_score_total,
+    lime_relative_err_total,
+    out_handle,
+):
+    explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language)
+
+    if_lstm = args.base_model == "lstm"
+    explain_res = None
+
+    text_instance = result["context"]
+
+    explain_res = explainer.explain_instance(
+        text_instance=text_instance,
+        tokenizer=tokenizer,
+        pred_label=pred_label,
+        classifier_fn=model.forward_interpet,
+        num_samples=5000,
+        if_lstm=if_lstm,
+    )
+
+    exp, indexed_string, relative_err, err = explain_res
+
+    score = exp.score[pred_label]
+    local_exps = exp.local_exp
+    ridge_pred = exp.local_pred[pred_label]
+    model_pred = probs.numpy().tolist()[0][pred_label]
+
+    lime_score_total.append(score)
+    lime_relative_err_total.append(relative_err)
+    lime_err_total.append(err)
+    log.debug("score: %.2f" % score)
+    log.debug("relative_err: %.2f" % relative_err)
+    log.debug("err: %.2f" % err)
+    log.debug("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred))
+
+    for kind, local_exp in local_exps.items():  # only have one iteration here
+        char_attribution_dict = []
+
+        for idx in range(len(result["sent_token"])):
+            t = result["sent_token"][idx]  # .replace('Ġ', '')
+            got_score = False
+            for word_id, attribution in local_exp:
+                if indexed_string.inverse_vocab[word_id] == t:
+                    char_attribution_dict.append((idx, t, attribution))
+                    got_score = True
+                    break
+            if not got_score:
+                char_attribution_dict.append((idx, t, 0))
+        char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True)
+
+        result["char_attri"] = collections.OrderedDict()
+        for s in char_attribution_dict:
+            result["char_attri"][s[0]] = (s[1], s[2])
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+    return lime_err_total, lime_score_total, lime_relative_err_total
+
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader = init_roberta_var(args)
+    elif args.base_model == "lstm":
+        model, tokenizer, dataloader = init_lstm_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    assert args.eval, "INTERPRETER must be run in eval mode"
+    with paddle.amp.auto_cast(enable=args.use_amp), open(
+        os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w"
+    ) as out_handle:
+
+        # Load model
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        model.train()  # set dropout to 0 in order to get the gradient
+        log.debug("load model from %s" % args.init_checkpoint)
+
+        get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word)))
+        for step, d in tqdm(enumerate(dataloader)):
+            if step + 1 < args.start_id:  # start from the step's instance
+                continue
+            # Initialize input_ids, fwd_args, tokens
+            result = {}
+            offset = None
+            if args.base_model.startswith("roberta"):
+                input_ids, token_type_ids, offset_map = d
+                fwd_args = [input_ids, token_type_ids]
+                fwd_kwargs = {}
+                tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist())  # list
+                offset = offset_map[0, 1:-1]
+
+            elif args.base_model == "lstm":
+                input_ids, seq_lens = d
+                fwd_args = [input_ids, seq_lens]
+                fwd_kwargs = {}
+                tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]]
+
+            result["id"] = dataloader.dataset.data[step]["id"]
+
+            probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs)
+            pred_label = paddle.argmax(probs, axis=-1).tolist()[0]
+
+            result["pred_label"] = pred_label
+            result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()]
+            sub_word_id_dict = []
+            err_total = []
+            lime_err_total, lime_score_total, lime_relative_err_total = [], [], []
+
+            result["context"] = dataloader.dataset.data[step]["context"]
+            result["sent_token"] = dataloader.dataset.data[step]["sent_token"]
+
+            # Attention
+            if args.inter_mode == "attention":
+                # extract attention scores and write resutls to file
+                extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle)
+
+            # Integrated_gradient
+            elif args.inter_mode == "integrated_gradient":
+                err_total = extract_integrated_gradient_scores(
+                    args,
+                    atts,
+                    input_ids,
+                    tokens,
+                    sub_word_id_dict,
+                    fwd_args,
+                    fwd_kwargs,
+                    model,
+                    result,
+                    pred_label,
+                    err_total,
+                    offset,
+                    out_handle,
+                )
+
+            # LIME
+            elif args.inter_mode == "lime":
+                lime_err_total, lime_score_total, lime_relative_err_total = extract_LIME_scores(
+                    args,
+                    tokenizer,
+                    tokens,
+                    pred_label,
+                    model,
+                    probs,
+                    result,
+                    lime_err_total,
+                    lime_score_total,
+                    lime_relative_err_total,
+                    out_handle,
+                )
+
+            else:
+                raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}")
+
+        if args.inter_mode == "lime":
+            log.debug(np.average(np.array(lime_relative_err_total)))
diff --git a/examples/model_interpretation/task/senti/saliency_map/utils.py b/examples/model_interpretation/task/senti/saliency_map/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..da76e25bfa59af4140d2068880c6ce5aade8ee7f
--- /dev/null
+++ b/examples/model_interpretation/task/senti/saliency_map/utils.py
@@ -0,0 +1,38 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import paddle
+
+
+class UnpackDataLoader(paddle.io.DataLoader):
+    def __init__(self, *args, **kwargs):
+        super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs)
+
+    def __iter__(self):
+        return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__())
+
+
+def create_if_not_exists(dir):
+    try:
+        dir.mkdir(parents=True)
+    except:
+        pass
+    return dir
+
+
+def get_warmup_and_linear_decay(max_steps, warmup_steps):
+    return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps))
diff --git a/examples/model_interpretation/task/similarity/LIME/exceptions.py b/examples/model_interpretation/task/similarity/LIME/exceptions.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5fa1a29924ad795104c6ce7c124a58d1fa06dfe
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/LIME/exceptions.py
@@ -0,0 +1,2 @@
+class LimeError(Exception):
+    """Raise for errors"""
diff --git a/examples/model_interpretation/task/similarity/LIME/explanation.py b/examples/model_interpretation/task/similarity/LIME/explanation.py
new file mode 100644
index 0000000000000000000000000000000000000000..46b3f0463fa6fb6b48271cbbc2652d3a5cec1b18
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/LIME/explanation.py
@@ -0,0 +1,343 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Explanation class, with visualization functions.
+"""
+from io import open
+import os
+import os.path
+import json
+import string
+import numpy as np
+
+from sklearn.utils import check_random_state
+
+from LIME.exceptions import LimeError
+
+
+def id_generator(size=15, random_state=None):
+    """Helper function to generate random div ids. This is useful for embedding
+    HTML into ipython notebooks."""
+    chars = list(string.ascii_uppercase + string.digits)
+    return "".join(random_state.choice(chars, size, replace=True))
+
+
+class DomainMapper(object):
+    """Class for mapping features to the specific domain.
+
+    The idea is that there would be a subclass for each domain (text, tables,
+    images, etc), so that we can have a general Explanation class, and separate
+    out the specifics of visualizing features in here.
+    """
+
+    def __init__(self):
+        pass
+
+    def map_exp_ids(self, exp, **kwargs):
+        """Maps the feature ids to concrete names.
+
+        Default behaviour is the identity function. Subclasses can implement
+        this as they see fit.
+
+        Args:
+            exp: list of tuples [(id, weight), (id,weight)]
+            kwargs: optional keyword arguments
+
+        Returns:
+            exp: list of tuples [(name, weight), (name, weight)...]
+        """
+        return exp
+
+    def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs):
+        """Produces html for visualizing the instance.
+
+        Default behaviour does nothing. Subclasses can implement this as they
+        see fit.
+
+        Args:
+             exp: list of tuples [(id, weight), (id,weight)]
+             label: label id (integer)
+             div_name: name of div object to be used for rendering(in js)
+             exp_object_name: name of js explanation object
+             kwargs: optional keyword arguments
+
+        Returns:
+             js code for visualizing the instance
+        """
+        return ""
+
+
+class Explanation(object):
+    """Object returned by explainers."""
+
+    def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None):
+        """
+
+        Initializer.
+
+        Args:
+            domain_mapper: must inherit from DomainMapper class
+            type: "classification" or "regression"
+            class_names: list of class names (only used for classification)
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+        """
+        self.random_state = random_state
+        self.mode = mode
+        self.domain_mapper = domain_mapper
+        self.local_exp = {}
+        self.intercept = {}
+        self.score = {}
+        self.local_pred = {}
+        if mode == "classification":
+            self.class_names = class_names
+            self.top_labels = None
+            self.predict_proba = None
+        elif mode == "regression":
+            self.class_names = ["negative", "positive"]
+            self.predicted_value = None
+            self.min_value = 0.0
+            self.max_value = 1.0
+            self.dummy_label = 1
+        else:
+            raise LimeError(
+                'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode)
+            )
+
+    def available_labels(self):
+        """
+        Returns the list of classification labels for which we have any explanations.
+        """
+        try:
+            assert self.mode == "classification"
+        except AssertionError:
+            raise NotImplementedError("Not supported for regression explanations.")
+        else:
+            ans = self.top_labels if self.top_labels else self.local_exp.keys()
+            return list(ans)
+
+    def as_list(self, label=1, **kwargs):
+        """Returns the explanation as a list.
+
+        Args:
+            label: desired label. If you ask for a label for which an
+                explanation wasn't computed, will throw an exception.
+                Will be ignored for regression explanations.
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            list of tuples (representation, weight), where representation is
+            given by domain_mapper. Weight is a float.
+        """
+        label_to_use = label if self.mode == "classification" else self.dummy_label
+        ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs)
+        ans = [(x[0], float(x[1])) for x in ans]
+        return ans
+
+    def as_map(self):
+        """Returns the map of explanations.
+
+        Returns:
+            Map from label to list of tuples (feature_id, weight).
+        """
+        return self.local_exp
+
+    def as_pyplot_figure(self, label=1, **kwargs):
+        """Returns the explanation as a pyplot figure.
+
+        Will throw an error if you don't have matplotlib installed
+        Args:
+            label: desired label. If you ask for a label for which an
+                   explanation wasn't computed, will throw an exception.
+                   Will be ignored for regression explanations.
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            pyplot figure (barchart).
+        """
+        import matplotlib.pyplot as plt
+
+        exp = self.as_list(label=label, **kwargs)
+        fig = plt.figure()
+        vals = [x[1] for x in exp]
+        names = [x[0] for x in exp]
+        vals.reverse()
+        names.reverse()
+        colors = ["green" if x > 0 else "red" for x in vals]
+        pos = np.arange(len(exp)) + 0.5
+        plt.barh(pos, vals, align="center", color=colors)
+        plt.yticks(pos, names)
+        if self.mode == "classification":
+            title = "Local explanation for class %s" % self.class_names[label]
+        else:
+            title = "Local explanation"
+        plt.title(title)
+        return fig
+
+    def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Shows html explanation in ipython notebook.
+
+        See as_html() for parameters.
+        This will throw an error if you don't have IPython installed"""
+
+        from IPython.core.display import display, HTML
+
+        display(
+            HTML(
+                self.as_html(
+                    labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs
+                )
+            )
+        )
+
+    def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Saves html explanation to file. .
+
+        Params:
+            file_path: file to save explanations to
+
+        See as_html() for additional parameters.
+
+        """
+        file_ = open(file_path, "w", encoding="utf8")
+        file_.write(
+            self.as_html(
+                labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs
+            )
+        )
+        file_.close()
+
+    def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs):
+        """Returns the explanation as an html page.
+
+        Args:
+            labels: desired labels to show explanations for (as barcharts).
+                If you ask for a label for which an explanation wasn't
+                computed, will throw an exception. If None, will show
+                explanations for all available labels. (only used for classification)
+            predict_proba: if true, add  barchart with prediction probabilities
+                for the top classes. (only used for classification)
+            show_predicted_value: if true, add  barchart with expected value
+                (only used for regression)
+            kwargs: keyword arguments, passed to domain_mapper
+
+        Returns:
+            code for an html page, including javascript includes.
+        """
+
+        def jsonize(x):
+            return json.dumps(x, ensure_ascii=False)
+
+        if labels is None and self.mode == "classification":
+            labels = self.available_labels()
+
+        this_dir, _ = os.path.split(__file__)
+        bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read()
+
+        out = (
+            """<html>
+        <meta http-equiv="content-type" content="text/html; charset=UTF8">
+        <head><script>%s </script></head><body>"""
+            % bundle
+        )
+        random_id = id_generator(size=15, random_state=check_random_state(self.random_state))
+        out += (
+            """
+        <div class="lime top_div" id="top_div%s"></div>
+        """
+            % random_id
+        )
+
+        predict_proba_js = ""
+        if self.mode == "classification" and predict_proba:
+            predict_proba_js = """
+            var pp_div = top_div.append('div')
+                                .classed('lime predict_proba', true);
+            var pp_svg = pp_div.append('svg').style('width', '100%%');
+            var pp = new lime.PredictProba(pp_svg, %s, %s);
+            """ % (
+                jsonize([str(x) for x in self.class_names]),
+                jsonize(list(self.predict_proba.astype(float))),
+            )
+
+        predict_value_js = ""
+        if self.mode == "regression" and show_predicted_value:
+            # reference self.predicted_value
+            # (svg, predicted_value, min_value, max_value)
+            predict_value_js = """
+                    var pp_div = top_div.append('div')
+                                        .classed('lime predicted_value', true);
+                    var pp_svg = pp_div.append('svg').style('width', '100%%');
+                    var pp = new lime.PredictedValue(pp_svg, %s, %s, %s);
+                    """ % (
+                jsonize(float(self.predicted_value)),
+                jsonize(float(self.min_value)),
+                jsonize(float(self.max_value)),
+            )
+
+        exp_js = """var exp_div;
+            var exp = new lime.Explanation(%s);
+        """ % (
+            jsonize([str(x) for x in self.class_names])
+        )
+
+        if self.mode == "classification":
+            for label in labels:
+                exp = jsonize(self.as_list(label))
+                exp_js += """
+                exp_div = top_div.append('div').classed('lime explanation', true);
+                exp.show(%s, %d, exp_div);
+                """ % (
+                    exp,
+                    label,
+                )
+        else:
+            exp = jsonize(self.as_list())
+            exp_js += """
+            exp_div = top_div.append('div').classed('lime explanation', true);
+            exp.show(%s, %s, exp_div);
+            """ % (
+                exp,
+                self.dummy_label,
+            )
+
+        raw_js = """var raw_div = top_div.append('div');"""
+
+        if self.mode == "classification":
+            html_data = self.local_exp[labels[0]]
+        else:
+            html_data = self.local_exp[self.dummy_label]
+
+        raw_js += self.domain_mapper.visualize_instance_html(
+            html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs
+        )
+        out += """
+        <script>
+        var top_div = d3.select('#top_div%s').classed('lime top_div', true);
+        %s
+        %s
+        %s
+        %s
+        </script>
+        """ % (
+            random_id,
+            predict_proba_js,
+            predict_value_js,
+            exp_js,
+            raw_js,
+        )
+        out += "</body></html>"
+
+        return out
diff --git a/examples/model_interpretation/task/similarity/LIME/lime_base.py b/examples/model_interpretation/task/similarity/LIME/lime_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca9ce28389191ac81364be6c07f33518d4a5f3f5
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/LIME/lime_base.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Contains abstract functionality for learning locally linear sparse model.
+"""
+import numpy as np
+import scipy as sp
+from sklearn.linear_model import Ridge, lars_path
+from sklearn.utils import check_random_state
+
+
+class LimeBase(object):
+    """Class for learning a locally linear sparse model from perturbed data"""
+
+    def __init__(self, kernel_fn, verbose=False, random_state=None):
+        """Init function
+
+        Args:
+            kernel_fn: function that transforms an array of distances into an
+                        array of proximity values (floats).
+            verbose: if true, print local prediction values from linear model.
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+        """
+        self.kernel_fn = kernel_fn
+        self.verbose = verbose
+        self.random_state = check_random_state(random_state)
+
+    @staticmethod
+    def generate_lars_path(weighted_data, weighted_labels):
+        """Generates the lars path for weighted data.
+
+        Args:
+            weighted_data: data that has been weighted by kernel
+            weighted_label: labels, weighted by kernel
+
+        Returns:
+            (alphas, coefs), both are arrays corresponding to the
+            regularization parameter and coefficients, respectively
+        """
+        x_vector = weighted_data
+        alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False)
+        return alphas, coefs
+
+    def forward_selection(self, data, labels, weights, num_features):
+        """Iteratively adds features to the model"""
+        clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state)
+        used_features = []
+        for _ in range(min(num_features, data.shape[1])):
+            max_ = -100000000
+            best = 0
+            for feature in range(data.shape[1]):
+                if feature in used_features:
+                    continue
+                clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights)
+                score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights)
+                if score > max_:
+                    best = feature
+                    max_ = score
+            used_features.append(best)
+        return np.array(used_features)
+
+    def feature_selection(self, data, labels, weights, num_features, method):
+        """Selects features for the model. see explain_instance_with_data to
+        understand the parameters."""
+        if method == "none":
+            return np.array(range(data.shape[1]))
+
+        elif method == "forward_selection":
+            return self.forward_selection(data, labels, weights, num_features)
+
+        elif method == "highest_weights":
+            clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state)
+            clf.fit(data, labels, sample_weight=weights)
+
+            coef = clf.coef_
+            if sp.sparse.issparse(data):
+                coef = sp.sparse.csr_matrix(clf.coef_)
+                weighted_data = coef.multiply(data[0])
+                # Note: most efficient to slice the data before reversing
+                sdata = len(weighted_data.data)
+                argsort_data = np.abs(weighted_data.data).argsort()
+                # Edge case where data is more sparse than requested number of feature importances
+                # In that case, we just pad with zero-valued features
+                if sdata < num_features:
+                    nnz_indexes = argsort_data[::-1]
+                    indices = weighted_data.indices[nnz_indexes]
+                    num_to_pad = num_features - sdata
+                    indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype)))
+                    indices_set = set(indices)
+                    pad_counter = 0
+                    for i in range(data.shape[1]):
+                        if i not in indices_set:
+                            indices[pad_counter + sdata] = i
+                            pad_counter += 1
+                            if pad_counter >= num_to_pad:
+                                break
+                else:
+                    nnz_indexes = argsort_data[sdata - num_features : sdata][::-1]
+                    indices = weighted_data.indices[nnz_indexes]
+                return indices
+            else:
+                weighted_data = coef * data[0]
+                feature_weights = sorted(
+                    zip(range(data.shape[1]), weighted_data),  # zip(特征的编号, Ridge的w值）
+                    key=lambda x: np.abs(x[1]),
+                    reverse=True,
+                )
+                return np.array([x[0] for x in feature_weights[:num_features]])  # 返回Ridge的前num_features大的w的值对应的特征编号
+
+        elif method == "lasso_path":
+            weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis])
+            weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights)
+            nonzero = range(weighted_data.shape[1])
+            _, coefs = self.generate_lars_path(weighted_data, weighted_labels)
+            for i in range(len(coefs.T) - 1, 0, -1):
+                nonzero = coefs.T[i].nonzero()[0]
+                if len(nonzero) <= num_features:
+                    break
+            used_features = nonzero
+            return used_features
+
+        elif method == "auto":
+            if num_features <= 6:
+                n_method = "forward_selection"
+            else:
+                n_method = "highest_weights"
+            return self.feature_selection(data, labels, weights, num_features, n_method)
+
+    def explain_instance_with_data(
+        self,
+        neighborhood_data,
+        neighborhood_labels,
+        distances,
+        label,
+        num_features,
+        feature_selection="auto",
+        model_regressor=None,
+    ):
+        """Takes perturbed data, labels and distances, returns explanation.
+
+        Args:
+            neighborhood_data: perturbed data, 2d array. first element is
+                               assumed to be the original data point.
+            neighborhood_labels: corresponding perturbed labels. should have as
+                                 many columns as the number of possible labels.
+            distances: distances to original data point.
+            label: label for which we want an explanation
+            num_features: maximum number of features in explanation
+            feature_selection: how to select num_features. options are:
+                'forward_selection': iteratively add features to the model.
+                    This is costly when num_features is high
+                'highest_weights': selects the features that have the highest
+                    product of absolute weight * original data point when
+                    learning with all the features
+                'lasso_path': chooses features based on the lasso
+                    regularization path
+                'none': uses all features, ignores num_features
+                'auto': uses forward_selection if num_features <= 6, and
+                    'highest_weights' otherwise.
+            model_regressor: sklearn regressor to use in explanation.
+                Defaults to Ridge regression if None. Must have
+                model_regressor.coef_ and 'sample_weight' as a parameter
+                to model_regressor.fit()
+
+        Returns:
+            (intercept, exp, score, local_pred):
+            intercept is a float.
+            exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x)
+                and the local weight (y). The list is sorted by decreasing absolute value of y.
+            score is the R^2 value of the returned explanation
+            local_pred is the prediction of the explanation model on the original instance
+        """
+
+        weights = self.kernel_fn(distances)  # 扰动样本权重
+        labels_column = neighborhood_labels[:, label]  # 类别label的softmax
+
+        used_features = self.feature_selection(
+            neighborhood_data, labels_column, weights, num_features, feature_selection
+        )
+        if model_regressor is None:
+            model_regressor = Ridge(
+                alpha=1, fit_intercept=True, random_state=self.random_state  # L2正则化的系数  # 是否需要截距，即b
+            )  # seg的伪随机种子
+        easy_model = model_regressor
+        easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights)
+        prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights)
+
+        local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1))
+
+        ridge_pred = easy_model.predict(neighborhood_data[:, used_features])
+        err_np = np.abs(labels_column - ridge_pred)
+        relative_err_np = err_np / ridge_pred
+        err = np.average(err_np, weights=weights)
+        relative_err = np.average(relative_err_np, weights=weights)
+
+        if self.verbose:
+            print("Intercept", easy_model.intercept_)
+            print(
+                "Prediction_local",
+                local_pred,
+            )
+            print("Right:", neighborhood_labels[0, label])
+        return (
+            easy_model.intercept_,  #
+            sorted(
+                zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True
+            ),  # 按权重大小排序的token_id列表
+            prediction_score,  # 衡量easy_model模型的预测与label的差，越大越好（差越小），最大为1
+            local_pred,  # easy_model对原始样本的预测概率
+            relative_err,
+            err,
+        )
diff --git a/examples/model_interpretation/task/similarity/LIME/lime_text.py b/examples/model_interpretation/task/similarity/LIME/lime_text.py
new file mode 100644
index 0000000000000000000000000000000000000000..b702a68d8de17fd3d4eb7c5390a709448b82b708
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/LIME/lime_text.py
@@ -0,0 +1,660 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Functions for explaining text classifiers.
+"""
+from functools import partial
+import itertools
+import json
+import re
+import time
+import math
+import paddle
+
+import numpy as np
+import scipy as sp
+import sklearn
+from sklearn.utils import check_random_state
+
+import LIME.explanation as explanation
+import LIME.lime_base as lime_base
+
+
+class TextDomainMapper(explanation.DomainMapper):
+    """Maps feature ids to words or word-positions"""
+
+    def __init__(self, indexed_string):
+        """Initializer.
+
+        Args:
+            indexed_string: lime_text.IndexedString, original string
+        """
+        self.indexed_string = indexed_string
+
+    def map_exp_ids(self, exp, positions=False):
+        """Maps ids to words or word-position strings.
+
+        Args:
+            exp: list of tuples [(id, weight), (id,weight)]
+            positions: if True, also return word positions
+
+        Returns:
+            list of tuples (word, weight), or (word_positions, weight) if
+            examples: ('bad', 1) or ('bad_3-6-12', 1)
+        """
+        if positions:
+            exp = [
+                (
+                    "%s_%s"
+                    % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))),
+                    x[1],
+                )
+                for x in exp
+            ]
+        else:
+            exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp]
+        return exp
+
+    def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True):
+        """Adds text with highlighted words to visualization.
+
+        Args:
+             exp: list of tuples [(id, weight), (id,weight)]
+             label: label id (integer)
+             div_name: name of div object to be used for rendering(in js)
+             exp_object_name: name of js explanation object
+             text: if False, return empty
+             opacity: if True, fade colors according to weight
+        """
+        if not text:
+            return ""
+        text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8")
+        text = re.sub(r"[<>&]", "|", text)
+        exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp]
+        all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp]))
+        all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences]
+        ret = """
+            %s.show_raw_text(%s, %d, %s, %s, %s);
+            """ % (
+            exp_object_name,
+            json.dumps(all_occurrences),
+            label,
+            json.dumps(text),
+            div_name,
+            json.dumps(opacity),
+        )
+        return ret
+
+
+class IndexedString(object):
+    """String with various indexes."""
+
+    def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="ch"):
+        """Initializer.
+
+        Args:
+            raw_string: string with raw text in it
+            split_expression: Regex string or callable. If regex string, will be used with re.split.
+                If callable, the function should return a list of tokens.
+            bow: if True, a word is the same everywhere in the text - i.e. we
+                 will index multiple occurrences of the same word. If False,
+                 order matters, so that the same word will have different ids
+                 according to position.
+            mask_string: If not None, replace words with this if bow=False
+                if None, default value is UNKWORDZ
+        """
+        self.raw = raw_string
+        self.mask_string = "UNKWORDZ" if mask_string is None else mask_string
+        self.language = language
+
+        if callable(split_expression):
+            tokens = split_expression(self.raw)
+            self.as_list = self._segment_with_tokens(self.raw, tokens)
+            tokens = set(tokens)
+
+            def non_word(string):
+                return string not in tokens
+
+        else:
+            # with the split_expression as a non-capturing group (?:), we don't need to filter out
+            # the separator character from the split results.
+            if self.language == "ch":
+                splitter = re.compile(r"([\u4e00-\u9fa5])")
+            else:
+                splitter = re.compile(split_expression)
+            self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0]
+            valid_word = splitter.match
+
+        self.as_np = np.array(self.as_list)
+        self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]])))
+        vocab = {}
+        self.inverse_vocab = []
+        self.positions = []
+        self.bow = bow
+        non_vocab = set()
+        for i, word in enumerate(self.as_np):
+            if word in non_vocab:
+                continue
+            if (self.language == "ch" and not valid_word(word)) or (self.language == "en" and valid_word(word)):
+                non_vocab.add(word)
+                continue
+            if bow:
+                if word not in vocab:
+                    vocab[word] = len(vocab)
+                    self.inverse_vocab.append(word)
+                    self.positions.append([])
+                idx_word = vocab[word]
+                self.positions[idx_word].append(i)
+            else:
+                self.inverse_vocab.append(word)
+                self.positions.append(i)
+        if not bow:
+            self.positions = np.array(self.positions)
+
+    def raw_string(self):
+        """Returns the original raw string"""
+        return self.raw
+
+    def num_words(self):
+        """Returns the number of tokens in the vocabulary for this document."""
+        return len(self.inverse_vocab)
+
+    def word(self, id_):
+        """Returns the word that corresponds to id_ (int)"""
+        return self.inverse_vocab[id_]
+
+    def string_position(self, id_):
+        """Returns a np array with indices to id_ (int) occurrences"""
+        if self.bow:
+            return self.string_start[self.positions[id_]]
+        else:
+            return self.string_start[[self.positions[id_]]]
+
+    def inverse_removing(self, words_to_remove):
+        """Returns a string after removing the appropriate words.
+
+        If self.bow is false, replaces word with UNKWORDZ instead of removing it.
+
+        Args:
+            words_to_remove: list of ids (ints) to remove
+
+        Returns:
+            original raw string with appropriate words removed.
+        """
+        mask = np.ones(self.as_np.shape[0], dtype="bool")
+        mask[self.__get_idxs(words_to_remove)] = False
+        if not self.bow:
+            return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])])
+        return "".join([self.as_list[v] for v in mask.nonzero()[0]])
+
+    @staticmethod
+    def _segment_with_tokens(text, tokens):
+        """Segment a string around the tokens created by a passed-in tokenizer"""
+        list_form = []
+        text_ptr = 0
+        for token in tokens:
+            inter_token_string = []
+            while not text[text_ptr:].startswith(token):
+                inter_token_string.append(text[text_ptr])
+                text_ptr += 1
+                if text_ptr >= len(text):
+                    raise ValueError("Tokenization produced tokens that do not belong in string!")
+            text_ptr += len(token)
+            if inter_token_string:
+                list_form.append("".join(inter_token_string))
+            list_form.append(token)
+        if text_ptr < len(text):
+            list_form.append(text[text_ptr:])
+        return list_form
+
+    def __get_idxs(self, words):
+        """Returns indexes to appropriate words."""
+        if self.bow:
+            return list(itertools.chain.from_iterable([self.positions[z] for z in words]))
+        else:
+            return self.positions[words]
+
+
+class IndexedCharacters(object):
+    """String with various indexes."""
+
+    def __init__(self, raw_string, bow=True, mask_string=None):
+        """Initializer.
+
+        Args:
+            raw_string: string with raw text in it
+            bow: if True, a char is the same everywhere in the text - i.e. we
+                 will index multiple occurrences of the same character. If False,
+                 order matters, so that the same word will have different ids
+                 according to position.
+            mask_string: If not None, replace characters with this if bow=False
+                if None, default value is chr(0)
+        """
+        self.raw = raw_string
+        self.as_list = list(self.raw)
+        self.as_np = np.array(self.as_list)
+        self.mask_string = chr(0) if mask_string is None else mask_string
+        self.string_start = np.arange(len(self.raw))
+        vocab = {}
+        self.inverse_vocab = []
+        self.positions = []
+        self.bow = bow
+        non_vocab = set()
+        for i, char in enumerate(self.as_np):
+            if char in non_vocab:
+                continue
+            if bow:
+                if char not in vocab:
+                    vocab[char] = len(vocab)
+                    self.inverse_vocab.append(char)
+                    self.positions.append([])
+                idx_char = vocab[char]
+                self.positions[idx_char].append(i)
+            else:
+                self.inverse_vocab.append(char)
+                self.positions.append(i)
+        if not bow:
+            self.positions = np.array(self.positions)
+
+    def raw_string(self):
+        """Returns the original raw string"""
+        return self.raw
+
+    def num_words(self):
+        """Returns the number of tokens in the vocabulary for this document."""
+        return len(self.inverse_vocab)
+
+    def word(self, id_):
+        """Returns the word that corresponds to id_ (int)"""
+        return self.inverse_vocab[id_]
+
+    def string_position(self, id_):
+        """Returns a np array with indices to id_ (int) occurrences"""
+        if self.bow:
+            return self.string_start[self.positions[id_]]
+        else:
+            return self.string_start[[self.positions[id_]]]
+
+    def inverse_removing(self, words_to_remove):
+        """Returns a string after removing the appropriate words.
+
+        If self.bow is false, replaces word with UNKWORDZ instead of removing
+        it.
+
+        Args:
+            words_to_remove: list of ids (ints) to remove
+
+        Returns:
+            original raw string with appropriate words removed.
+        """
+        mask = np.ones(self.as_np.shape[0], dtype="bool")
+        mask[self.__get_idxs(words_to_remove)] = False
+        if not self.bow:
+            return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])])
+        return "".join([self.as_list[v] for v in mask.nonzero()[0]])
+
+    def __get_idxs(self, words):
+        """Returns indexes to appropriate words."""
+        if self.bow:
+            return list(itertools.chain.from_iterable([self.positions[z] for z in words]))
+        else:
+            return self.positions[words]
+
+
+class LimeTextExplainer(object):
+    """Explains text classifiers.
+    Currently, we are using an exponential kernel on cosine distance, and
+    restricting explanations to words that are present in documents."""
+
+    def __init__(
+        self,
+        kernel_width=25,
+        kernel=None,
+        verbose=False,
+        class_names=None,
+        feature_selection="auto",
+        split_expression=r"\W+",
+        bow=True,
+        mask_string=None,
+        random_state=None,
+        char_level=False,
+        language="ch",
+    ):
+        """Init function.
+
+        Args:
+            kernel_width: kernel width for the exponential kernel.
+            kernel: similarity kernel that takes euclidean distances and kernel
+                width as input and outputs weights in (0,1). If None, defaults to
+                an exponential kernel.
+            verbose: if true, print local prediction values from linear model
+            class_names: list of class names, ordered according to whatever the
+                classifier is using. If not present, class names will be '0',
+                '1', ...
+            feature_selection: feature selection method. can be
+                'forward_selection', 'lasso_path', 'none' or 'auto'.
+                See function 'explain_instance_with_data' in lime_base.py for
+                details on what each of the options does.
+            split_expression: Regex string or callable. If regex string, will be used with re.split.
+                If callable, the function should return a list of tokens.
+            bow: if True (bag of words), will perturb input data by removing
+                all occurrences of individual words or characters.
+                Explanations will be in terms of these words. Otherwise, will
+                explain in terms of word-positions, so that a word may be
+                important the first time it appears and unimportant the second.
+                Only set to false if the classifier uses word order in some way
+                (bigrams, etc), or if you set char_level=True.
+            mask_string: String used to mask tokens or characters if bow=False
+                if None, will be 'UNKWORDZ' if char_level=False, chr(0)
+                otherwise.
+            random_state: an integer or numpy.RandomState that will be used to
+                generate random numbers. If None, the random state will be
+                initialized using the internal numpy seed.
+            char_level: an boolean identifying that we treat each character
+                as an independent occurence in the string
+        """
+
+        if kernel is None:
+
+            def kernel(d, kernel_width):
+                return np.sqrt(np.exp(-(d**2) / kernel_width**2))
+
+        kernel_fn = partial(kernel, kernel_width=kernel_width)
+
+        self.random_state = check_random_state(random_state)
+        self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state)
+        self.class_names = class_names
+        self.vocabulary = None
+        self.feature_selection = feature_selection
+        self.bow = bow
+        self.mask_string = mask_string
+        self.split_expression = split_expression
+        self.char_level = char_level
+        self.language = language
+
+    def explain_instance(
+        self,
+        text_instance_q: str,
+        text_instance_t: str,
+        analysis_query,
+        tokenizer,
+        pred_label: int,
+        classifier_fn,
+        labels=(0, 1),
+        top_labels=None,
+        num_features=10,
+        num_samples=5000,
+        distance_metric="cosine",
+        model_regressor=None,
+        if_lstm=False,
+    ):
+        """Generates explanations for a prediction.
+
+        First, we generate neighborhood data by randomly hiding features from
+        the instance (see __data_labels_distance_mapping). We then learn
+        locally weighted linear models on this neighborhood data to explain
+        each of the classes in an interpretable way (see lime_base.py).
+
+        Args:
+            text_instance: raw text string to be explained.
+            classifier_fn: classifier prediction probability function, which
+                takes a list of d strings and outputs a (d, k) numpy array with
+                prediction probabilities, where k is the number of classes.
+                For ScikitClassifiers , this is classifier.predict_proba.
+            labels: iterable with labels to be explained.
+            top_labels: if not None, ignore labels and produce explanations for
+                the K labels with highest prediction probabilities, where K is
+                this parameter.
+            num_features: maximum number of features present in explanation
+            num_samples: size of the neighborhood to learn the linear model
+            distance_metric: the distance metric to use for sample weighting,
+                defaults to cosine similarity
+            model_regressor: sklearn regressor to use in explanation. Defaults
+            to Ridge regression in LimeBase. Must have model_regressor.coef_
+            and 'sample_weight' as a parameter to model_regressor.fit()
+        Returns:
+            An Explanation object (see explanation.py) with the corresponding
+            explanations.
+        """
+        # prev_time = time.time()
+
+        text_instance = text_instance_q if analysis_query else text_instance_t
+        text_support = text_instance_t if analysis_query else text_instance_q
+
+        indexed_string = (
+            IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string)
+            if self.char_level
+            else IndexedString(
+                text_instance,
+                bow=self.bow,
+                split_expression=self.split_expression,
+                mask_string=self.mask_string,
+                language=self.language,
+            )
+        )
+        domain_mapper = TextDomainMapper(indexed_string)
+
+        # 产生扰动数据集    第一条是原始数据
+        # data: 解释器训练特征  list (num_samples, doc_size)
+        # yss:  解释器训练标签  list (num_samples, class_num(2))
+        # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, )
+        data, yss, distances = self.__data_labels_distances(
+            indexed_string,
+            text_support,
+            analysis_query,
+            tokenizer,
+            classifier_fn,
+            num_samples,
+            distance_metric=distance_metric,
+            if_lstm=if_lstm,
+        )
+
+        if self.class_names is None:
+            self.class_names = [str(x) for x in range(yss[0].shape[0])]
+        ret_exp = explanation.Explanation(
+            domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state
+        )
+        ret_exp.predict_proba = yss[0]
+        if top_labels:
+            labels = np.argsort(yss[0])[-top_labels:]
+            ret_exp.top_labels = list(labels)
+            ret_exp.top_labels.reverse()
+
+        num_features = indexed_string.num_words()  # 特征数量跟word_num相同
+
+        (
+            ret_exp.intercept[pred_label],
+            ret_exp.local_exp[pred_label],
+            ret_exp.score[pred_label],
+            ret_exp.local_pred[pred_label],
+            relative_err,
+            err,
+        ) = self.base.explain_instance_with_data(
+            data,
+            yss,
+            distances,
+            pred_label,
+            num_features,
+            model_regressor=model_regressor,
+            feature_selection=self.feature_selection,
+        )
+
+        return ret_exp, indexed_string, relative_err, err
+
+    def __data_labels_distances(
+        self,
+        indexed_string,
+        text_support,
+        analysis_query,
+        tokenizer,
+        classifier_fn,
+        num_samples,
+        distance_metric="cosine",
+        if_lstm=False,
+    ):
+        """Generates a neighborhood around a prediction.
+
+        Generates neighborhood data by randomly removing words from
+        the instance, and predicting with the classifier. Uses cosine distance
+        to compute distances between original and perturbed instances.
+        Args:
+            indexed_string: document (IndexedString) to be explained,
+            classifier_fn: classifier prediction probability function, which
+                takes a string and outputs prediction probabilities. For
+                ScikitClassifier, this is classifier.predict_proba.
+            num_samples: size of the neighborhood to learn the linear model
+            distance_metric: the distance metric to use for sample weighting,
+                defaults to cosine similarity.
+
+        Returns:
+            A tuple (data, labels, distances), where:
+                data: dense num_samples * K binary matrix, where K is the
+                    number of tokens in indexed_string. The first row is the
+                    original instance, and thus a row of ones.
+                labels: num_samples * L matrix, where L is the number of target
+                    labels
+                distances: cosine distance between the original instance and
+                    each perturbed instance (computed in the binary 'data'
+                    matrix), times 100.
+        """
+
+        def distance_fn(x):
+            return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100
+
+        doc_size = indexed_string.num_words()
+
+        sample = self.random_state.randint(
+            1, doc_size, num_samples - 1
+        )  # sample: [int(1 ~ doc_size-1) * num_samples-1]
+        data = np.ones((num_samples, doc_size))
+        data[0] = np.ones(doc_size)
+        features_range = range(doc_size)
+        perturb_text = [indexed_string.raw_string()]  # [文本 * num_samples]
+
+        for i, size in enumerate(sample, start=1):
+            # inactive: 从range（0， doc_size）中随机取出的size个数组成的list, 要去掉的字的id
+            inactive = self.random_state.choice(
+                features_range, size, replace=False  # [0, doc_size)  # int: 该扰动样本中remove token的数量
+            )
+
+            text = indexed_string.inverse_removing(inactive)  # 原文本去掉了inactive中的字后的文本
+
+            data[i, inactive] = 0
+            perturb_text.append(text)
+
+        # print('doc size: %d' % doc_size)
+
+        prev_time = time.time()
+        # inverse_data: 扰动数据集 [扰动样本 str] * num_samples
+        labels = []
+        query_list, title_list, query_len_list, title_len_list = [], [], [], []  # for lstm
+        token_ids_list, s_ids_list = [], []  # for roberta
+        max_len = 0
+
+        support_token_ids = tokenizer.encode(text_support)  # for lstm
+        support_len = len(support_token_ids)  # for lstm
+        for idx, text in enumerate(perturb_text):
+            if if_lstm:
+                text_token_ids = tokenizer.encode(text)
+                text_len = len(text_token_ids)
+                if idx == 0:
+                    max_len = len(text_token_ids)
+                while len(text_token_ids) < max_len:
+                    text_token_ids.append(0)
+
+                query_token_ids = text_token_ids if analysis_query else support_token_ids
+                title_token_ids = support_token_ids if analysis_query else text_token_ids
+                query_len = text_len if analysis_query else support_len
+                title_len = support_len if analysis_query else text_len
+
+                query_list.append(query_token_ids)
+                title_list.append(title_token_ids)
+                query_len_list.append(query_len)
+                title_len_list.append(title_len)
+
+            else:
+                text_tokens = tokenizer.tokenize(text)
+                text_token_ids = tokenizer.convert_tokens_to_ids(text_tokens)
+                support_tokens = tokenizer.tokenize(text_support)
+                support_ids = tokenizer.convert_tokens_to_ids(support_tokens)
+                if analysis_query:
+                    token_ids = (
+                        [tokenizer.cls_token_id]
+                        + text_token_ids
+                        + [tokenizer.sep_token_id]
+                        + support_ids
+                        + [tokenizer.sep_token_id]
+                    )
+                else:
+                    token_ids = (
+                        [tokenizer.cls_token_id]
+                        + support_ids
+                        + [tokenizer.sep_token_id]
+                        + text_token_ids
+                        + [tokenizer.sep_token_id]
+                    )
+                if len(token_ids) > max_len:
+                    max_len = len(token_ids)
+                token_ids_list.append(token_ids)
+
+        token_ids_np = []
+        if not if_lstm:
+            for token_ids in token_ids_list:
+                # token_ids = token_ids[:max_len]
+                token_ids = token_ids + [tokenizer.pad_token_id] * (max_len - len(token_ids))
+                token_ids_np.append(token_ids)
+                s_ids = [0 for _ in range(len(token_ids))]
+                s_ids_list.append(s_ids)
+
+        token_ids_np = np.array(token_ids_np)
+        s_ids_np = np.array(s_ids_list)
+
+        length = len(perturb_text[0])
+        if if_lstm:
+            batch = 128
+        else:
+            batch = 64 if length < 130 else 50
+
+        prev_time = time.time()
+        epoch_num = math.ceil(len(perturb_text) / batch)
+        for idx in range(epoch_num):
+            if if_lstm:
+                query_list_tensor = paddle.to_tensor(query_list[idx * batch : (idx + 1) * batch])
+                title_list_tensor = paddle.to_tensor(title_list[idx * batch : (idx + 1) * batch])
+                query_len_list_tensor = paddle.to_tensor(query_len_list[idx * batch : (idx + 1) * batch])
+                title_len_list_tensor = paddle.to_tensor(title_len_list[idx * batch : (idx + 1) * batch])
+                label = classifier_fn(
+                    query_list_tensor, title_list_tensor, query_len_list_tensor, title_len_list_tensor
+                )[
+                    0
+                ]  # label: Tensor[num_samples, 2]
+            else:
+                token_ids_tensor = paddle.Tensor(
+                    value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True
+                )
+                s_ids_tensor = paddle.Tensor(
+                    value=s_ids_np[idx * batch : (idx + 1) * batch],
+                    place=token_ids_tensor.place,
+                    stop_gradient=token_ids_tensor.stop_gradient,
+                )
+                label = classifier_fn(token_ids_tensor, s_ids_tensor)[0]  # label: Tensor[num_samples, 2]
+
+            labels.extend(label.numpy().tolist())
+
+        labels = np.array(labels)  # labels: nsp.array(num_samples, 2)
+        print("mode forward time: %.5f" % (time.time() - prev_time))
+        distances = distance_fn(sp.sparse.csr_matrix(data))
+
+        return data, labels, distances
diff --git a/examples/model_interpretation/task/similarity/pretrained_models/data.py b/examples/model_interpretation/task/similarity/pretrained_models/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..37f2c12781f47ea1151427bb9f2df9c708d75e81
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/pretrained_models/data.py
@@ -0,0 +1,138 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False, language="en"):
+    if language == "ch":
+        q_name = "query"
+        t_name = "title"
+        l_name = "label"
+    else:
+        q_name = "sentence1"
+        t_name = "sentence2"
+        l_name = "labels"
+
+    query, title = example[q_name], example[t_name]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example[l_name]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"):
+
+    if phase == "train":
+        query, pos_title, neg_title = example["query"], example["title"], example["neg_title"]
+
+        pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length)
+        neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length)
+
+        pos_input_ids = pos_inputs["input_ids"]
+        pos_token_type_ids = pos_inputs["token_type_ids"]
+        neg_input_ids = neg_inputs["input_ids"]
+        neg_token_type_ids = neg_inputs["token_type_ids"]
+
+        return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids)
+
+    else:
+        query, title = example["query"], example["title"]
+
+        inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+        input_ids = inputs["input_ids"]
+        token_type_ids = inputs["token_type_ids"]
+        if phase == "eval":
+            return input_ids, token_type_ids, example["label"]
+        elif phase == "predict":
+            return input_ids, token_type_ids
+        else:
+            raise ValueError("not supported phase:{}".format(phase))
+
+
+def gen_pair(dataset, pool_size=100):
+    """
+    Generate triplet randomly based on dataset
+
+    Args:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+            Each example is composed of 2 texts: example["query"], example["title"]
+        pool_size: the number of example to sample negative example randomly
+
+    Return:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+        Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"]
+    """
+
+    if len(dataset) < pool_size:
+        pool_size = len(dataset)
+
+    new_examples = []
+    pool = []
+    tmp_examples = []
+
+    for example in dataset:
+        label = example["label"]
+
+        # Filter negative example
+        if label == 0:
+            continue
+
+        tmp_examples.append(example)
+        pool.append(example["title"])
+
+        if len(pool) >= pool_size:
+            np.random.shuffle(pool)
+            for idx, example in enumerate(tmp_examples):
+                example["neg_title"] = pool[idx]
+                new_examples.append(example)
+            tmp_examples = []
+            pool = []
+        else:
+            continue
+    return MapDataset(new_examples)
diff --git a/examples/model_interpretation/task/similarity/pretrained_models/model.py b/examples/model_interpretation/task/similarity/pretrained_models/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf886ba69a85b2e656aa0d6a6c483a73abe875fa
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/pretrained_models/model.py
@@ -0,0 +1,89 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class PointwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["roberta"].config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        logits = self.classifier(cls_embedding)
+        probs = F.softmax(logits)
+
+        return probs
+
+
+class PairwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.1):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        self.margin = margin
+
+        # hidden_size -> 1, calculate similarity
+        self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1)
+
+    def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        sim_score = self.similarity(cls_embedding)
+        sim_score = F.sigmoid(sim_score)
+
+        return sim_score
+
+    def forward(
+        self,
+        pos_input_ids,
+        neg_input_ids,
+        pos_token_type_ids=None,
+        neg_token_type_ids=None,
+        pos_position_ids=None,
+        neg_position_ids=None,
+        pos_attention_mask=None,
+        neg_attention_mask=None,
+    ):
+
+        _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask)
+
+        _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask)
+
+        pos_embedding = self.dropout(pos_cls_embedding)
+        neg_embedding = self.dropout(neg_cls_embedding)
+
+        pos_sim = self.similarity(pos_embedding)
+        neg_sim = self.similarity(neg_embedding)
+
+        pos_sim = F.sigmoid(pos_sim)
+        neg_sim = F.sigmoid(neg_sim)
+
+        labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32")
+
+        loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin)
+
+        return loss
diff --git a/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..39c56528c9ec62b81dc71a35f9a31cd9a08870cb
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script is used for predicting results
+"""
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pointwise_example as convert_example
+from data import create_dataloader, read_text_pair
+from model import PointwiseMatching
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument(
+    "--max_seq_length",
+    default=64,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for")
+args = parser.parse_args()
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair:
+                                [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_probs = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            input_ids = paddle.to_tensor(input_ids)
+            token_type_ids = paddle.to_tensor(token_type_ids)
+
+            batch_prob = model(input_ids=input_ids, token_type_ids=token_type_ids).numpy()
+
+            batch_probs.append(batch_prob)
+
+        batch_probs = np.concatenate(batch_probs, axis=0)
+
+        return batch_probs
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True, language=args.language
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PointwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, valid_data_loader)
+    y_preds = np.argmax(y_probs, axis=1)
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+    for idx, y_pred in enumerate(y_preds):
+        text_pair = valid_ds[idx]
+        text_pair["pred_label"] = y_pred
+        print(text_pair)
diff --git a/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh b/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh
new file mode 100644
index 0000000000000000000000000000000000000000..13771c1837edee94e630c6806d20fc908669e5d3
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh
@@ -0,0 +1,32 @@
+###
+ # This script is used to finetune pretrained models
+###
+
+export CUDA_VISIBLE_DEVICES=7
+
+LANGUAGE="ch"               # ['ch', 'en']
+BASE_MODEL=roberta_large     # [roberta_base, roberta_large]
+timestamp=`date  +"%Y%m%d_%H%M%S"`
+
+if [[ $LANGUAGE == "ch" ]]; then
+    LEARNING_RATE=3e-5
+    MAX_SEQ_LENGTH=256
+elif [[ $LANGUAGE == "en" ]]; then
+    LEARNING_RATE=5e-6
+    MAX_SEQ_LENGTH=128
+fi
+
+[ -d "logs" ] || mkdir -p "logs"
+set -x
+
+python3 ./train_pointwise.py  \
+    --learning_rate $LEARNING_RATE \
+    --max_seq_length $MAX_SEQ_LENGTH \
+    --batch_size 32 \
+    --epochs 5 \
+    --save_step 1000 \
+    --warmup_proportion 0.1 \
+    --base_model $BASE_MODEL \
+    --language $LANGUAGE \
+    --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} >> logs/log_${BASE_MODEL}_${timestamp}
+    
diff --git a/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e7bbc190efcfc6f883ee8b9e5f7eb28f61b2a61
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py
@@ -0,0 +1,215 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pointwise_example as convert_example
+from data import create_dataloader
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import LinearDecayWithWarmup
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("..")
+sys.path.append("../../..")
+from roberta.modeling import RobertaForSequenceClassification  # noqa: E402
+
+sys.path.remove("../../..")
+sys.path.remove("..")
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"])
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. "
+    "Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--eval_step", default=1000, type=int, help="Step interval for evaluation.")
+parser.add_argument("--save_step", default=1000, type=int, help="Step interval for saving checkpoint.")
+parser.add_argument(
+    "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process."
+)
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for")
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+        loss = criterion(probs, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(probs, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    if args.language == "ch":
+        train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])
+
+        if args.base_model == "roberta_base":
+            tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext")
+            pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2)
+        elif args.base_model == "roberta_large":
+            tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large")
+            pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2)
+    else:
+        train_ds, dev_ds = load_dataset("glue", "qqp", splits=["train", "dev"])
+
+        if args.base_model == "roberta_base":
+            tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base")
+            pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2)
+        elif args.base_model == "roberta_large":
+            tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large")
+            pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = pretrained_model
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+            loss = criterion(probs, labels)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 100 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 100 / (time.time() - tic_train)),
+                    flush=True,
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+
+            if global_step % args.save_step == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/model_interpretation/task/similarity/roberta/modeling.py b/examples/model_interpretation/task/similarity/roberta/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5824a443f0a81cdeab7879e00ef6be7633fca3e
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/roberta/modeling.py
@@ -0,0 +1,618 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ This script defines the model structure of roberta
+"""
+import sys
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+
+sys.path.append("../..")
+from task.transformer import TransformerEncoder, TransformerEncoderLayer  # noqa: E402
+
+sys.path.remove("../..")
+
+__all__ = [
+    "RobertaModel",
+    "RobertaPretrainedModel",
+    "RobertaForSequenceClassification",
+    "RobertaForTokenClassification",
+    "RobertaForQuestionAnswering",
+]
+
+
+class RobertaEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        pad_token_id=0,
+    ):
+        super(RobertaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        """
+        forward function
+        """
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RobertaPooler(nn.Layer):
+    """
+    An abstract class for RobertaPooler
+    """
+
+    def __init__(self, hidden_size):
+        super(RobertaPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        """
+        We "pool" the model by simply taking the hidden state corresponding
+        to the first token.
+        """
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RobertaPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained RoBerta models. It provides RoBerta related
+    `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "roberta-wwm-ext": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "roberta-wwm-ext-large": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbt3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+        "rbtl3": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 21128,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams",
+            "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams",
+            "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams",
+            "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams",
+        }
+    }
+    base_model_prefix = "roberta"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range")
+                    else self.roberta.config["initializer_range"],
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class RobertaModel(RobertaPretrainedModel):
+    r"""
+    The bare Roberta Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+    ):
+        super(RobertaModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        self.embeddings = RobertaEmbeddings(
+            vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id
+        )
+        encoder_layer = TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = RobertaPooler(hidden_size)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range ``[0, max_position_embeddings - 1]``.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - sequence_output (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - pooled_output (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaModel, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaModel.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2]
+            )
+        # CLS: 101; SEP: 102; PAD: 0
+        baseline_ids = paddle.to_tensor(
+            [101] + [0] * (input_ids.shape[1] - 2) + [102],
+            dtype=input_ids.dtype,
+            place=input_ids.place,
+            stop_gradient=input_ids.stop_gradient,
+        )
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        baseline_embedding_output = self.embeddings(
+            input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+
+        if noise is not None:
+            if noise.upper() == "GAUSSIAN":
+                pass
+            if noise.upper() == "INTEGRATED":
+                embedding_output = baseline_embedding_output + i / (n_samples - 1) * (
+                    embedding_output - baseline_embedding_output
+                )
+            else:
+                raise ValueError("unsupported noise method: %s" % (noise))
+
+        encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask)  # interpret
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        result = [sequence_output, pooled_output, att_weights_list]
+        result.append(embedding_output)
+        return result
+
+
+class RobertaForQuestionAnswering(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output to
+    compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of RobertaModel.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob` of `RobertaModel`
+            instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, dropout=None):
+        super(RobertaForQuestionAnswering, self).__init__()
+        self.roberta = roberta  # allow roberta to be config
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None
+        )
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
+
+
+class RobertaForSequenceClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+        self.softmax = nn.Softmax()
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Its data type should be float32 and it has a shape of [batch_size, num_classes].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        _, pooled_output, _, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+    def forward_interpret(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        noise=None,
+        i=None,
+        n_samples=None,
+    ):
+        """
+        The forward function used when we are interpreting the model
+        """
+        _, pooled_output, att_weights_list, embedding_output = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            noise=noise,
+            i=i,
+            n_samples=n_samples,
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        probs = self.softmax(logits)
+
+        return probs, att_weights_list, embedding_output
+
+
+class RobertaForTokenClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, roberta, num_classes=2, dropout=None):
+        super(RobertaForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.roberta = roberta  # allow roberta to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output, _ = self.roberta(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
diff --git a/examples/model_interpretation/task/similarity/run_inter.sh b/examples/model_interpretation/task/similarity/run_inter.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e9de8e11df879b1965cba7d888d6fb05a4869e38
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/run_inter.sh
@@ -0,0 +1,61 @@
+###
+ # This file contains script to generate saliency map of a specific baseline model and language on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+###
+export CUDA_VISIBLE_DEVICES=7
+export PYTHONPATH=./:$PYTHONPATH
+
+LANGUAGE=ch                 # LANGUAGE choose in [ch, en]
+BASE_MODEL=roberta_base            # BASE_MODEL choose in [roberta_base, roberta_large, lstm]
+INTER_MODE=lime         # INTER_MODE choice in [attention, integrated_gradient, lime]
+TASK=similarity_${LANGUAGE}
+DATA=../../data/${TASK}  
+START_ID=0
+
+if [[ $LANGUAGE == "ch" ]]; then
+
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN='roberta-wwm-ext'     
+        CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN='roberta-wwm-ext-large'       
+        CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams
+    elif [[ $BASE_MODEL == "lstm" ]]; then
+        FROM_PRETRAIN='skep_ernie_1.0_large_ch'
+        CKPT=simnet/checkpoints_ch/final.pdparams
+    fi
+
+elif [[ $LANGUAGE == "en" ]]; then
+    if [[ $BASE_MODEL == "roberta_base" ]]; then
+        FROM_PRETRAIN=roberta-base
+        CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams
+    elif [[ $BASE_MODEL == "roberta_large" ]]; then
+        FROM_PRETRAIN=roberta-large
+        CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams
+        #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams
+    elif [[ $BASE_MODEL == "lstm" ]]; then
+        FROM_PRETRAIN='data/skep_ernie_1.0_large_ch'
+        CKPT=simnet/checkpoints_en/final.pdparams
+    fi
+fi
+
+OUTPUT=./output/$TASK.$BASE_MODEL
+[ -d $OUTPUT ] || mkdir -p $OUTPUT
+set -x
+
+python3 ./saliency_map/similarity_interpretable.py \
+    --base_model $BASE_MODEL \
+    --data_dir $DATA \
+    --from_pretrained $FROM_PRETRAIN \
+    --batch_size 1 \
+    --max_seq_len 256 \
+    --init_checkpoint $CKPT \
+    --inter_mode  $INTER_MODE \
+    --start_id $START_ID \
+    --output_dir $OUTPUT \
+    --n-samples 500 \
+    --language $LANGUAGE \
+    --eval $@
diff --git a/examples/model_interpretation/task/similarity/run_inter_all.sh b/examples/model_interpretation/task/similarity/run_inter_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..edabd07d6f41596e5eecac84e49e4e0a5b300cee
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/run_inter_all.sh
@@ -0,0 +1,69 @@
+###
+ # This file contains script to generate saliency map of all baseline models and languages on given input data
+ # The result of this script will be used to evaluate the interpretive performance of the baseline model
+### 
+export CUDA_VISIBLE_DEVICES=4
+export PYTHONPATH=./:$PYTHONPATH
+
+START_ID=0
+
+for BASE_MODEL in "lstm" "roberta_base" "roberta_large";
+do
+    for INTER_MODE in "attention" "integrated_gradient" "lime";
+    do
+        for LANGUAGE in "ch" "en";
+        do
+            TASK=similarity_${LANGUAGE}
+            DATA=../../data/${TASK}
+
+            if [[ $LANGUAGE == "ch" ]]; then
+
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN='roberta-wwm-ext'     
+                    CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN='roberta-wwm-ext-large'       
+                    CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams
+                elif [[ $BASE_MODEL == "lstm" ]]; then
+                    FROM_PRETRAIN='data/skep_ernie_1.0_large_ch'
+                    CKPT=simnet/checkpoints_ch/final.pdparams
+                fi
+
+            elif [[ $LANGUAGE == "en" ]]; then
+                if [[ $BASE_MODEL == "roberta_base" ]]; then
+                    FROM_PRETRAIN=roberta-base
+                    CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams
+                elif [[ $BASE_MODEL == "roberta_large" ]]; then
+                    FROM_PRETRAIN=roberta-large
+                    CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams
+                    #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams
+                elif [[ $BASE_MODEL == "lstm" ]]; then
+                    FROM_PRETRAIN='data/skep_ernie_1.0_large_ch'
+                    CKPT=simnet/checkpoints_en/final.pdparams
+                fi
+            fi
+
+            OUTPUT=./output/$TASK.$BASE_MODEL
+            [ -d $OUTPUT ] || mkdir -p $OUTPUT
+            set -x
+            if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then
+                python3 ./saliency_map/similarity_interpretable.py \
+                    --base_model $BASE_MODEL \
+                    --data_dir $DATA \
+                    --from_pretrained $FROM_PRETRAIN \
+                    --batch_size 1 \
+                    --max_seq_len 256 \
+                    --init_checkpoint $CKPT \
+                    --inter_mode  $INTER_MODE \
+                    --start_id $START_ID \
+                    --output_dir $OUTPUT \
+                    --n-samples 500 \
+                    --language $LANGUAGE \
+                    --eval $@
+            fi
+        done
+    done
+done
diff --git a/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py b/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py
new file mode 100644
index 0000000000000000000000000000000000000000..73064096219034b1254dfee10f7bf32fed222d45
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py
@@ -0,0 +1,646 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import collections
+import json
+import logging
+import os
+import re
+import sys
+from functools import partial
+from pathlib import Path
+
+import numpy as np
+import paddle
+from LIME.lime_text import LimeTextExplainer
+from roberta.modeling import RobertaForSequenceClassification
+from simnet.model import SimNet
+from simnet.utils import CharTokenizer, preprocess_data
+from tqdm import tqdm
+
+from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import DatasetBuilder
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaTokenizer,
+)
+
+sys.path.append("../../..")
+from model_interpretation.utils import (  # noqa: E402
+    convert_tokenizer_res_to_old_version,
+    match,
+)
+
+sys.path.remove("../../..")
+
+log = logging.getLogger(__name__)
+log.setLevel(logging.DEBUG)
+logging.getLogger().setLevel(logging.DEBUG)
+
+
+def get_args():
+    parser = argparse.ArgumentParser("interpret textual similarity task")
+    parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"])
+    parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batchsize")
+    parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data")
+    parser.add_argument("--eval", action="store_true")
+    parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from")
+    parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer")
+    parser.add_argument(
+        "--use_amp",
+        action="store_true",
+        help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices",
+    )
+    parser.add_argument(
+        "--inter_mode",
+        type=str,
+        default="attention",
+        choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"],
+        help="appoint the mode of interpretable.",
+    )
+    parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method")
+    parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory")
+    parser.add_argument("--start_id", type=int, default=0)
+    parser.add_argument("--language", type=str, required=True, help="Language that the model is based on")
+    args = parser.parse_args()
+    return args
+
+
+class Similarity_data(DatasetBuilder):
+    def _read(self, filename):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                line_split = json.loads(line)
+                if args.language == "ch":
+                    yield {
+                        "id": line_split["id"],
+                        "query": line_split["query"],
+                        "title": line_split["title"],
+                        "text_q_seg": line_split["text_q_seg"],
+                        "text_t_seg": line_split["text_t_seg"],
+                    }
+                else:
+                    yield {
+                        "id": line_split["id"],
+                        "sentence1": line_split["sentence1"],
+                        "sentence2": line_split["sentence2"],
+                        "text_q_seg": line_split["text_q_seg"],
+                        "text_t_seg": line_split["text_t_seg"],
+                    }
+
+
+def map_fn_senti(examples, tokenizer, language):
+    print("load data %d" % len(examples))
+    if language == "ch":
+        q_name = "query"
+        t_name = "title"
+        queries = [example[q_name] for example in examples]
+        titles = [example[t_name] for example in examples]
+    else:
+        q_name = "sentence1"
+        t_name = "sentence2"
+        queries = [example[q_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples]
+        titles = [example[t_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples]
+    tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len)
+
+    tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples)
+
+    for i in range(len(tokenized_examples)):
+        tokenized_examples[i]["query_offset_mapping"] = (
+            [(0, 0)] + tokenizer.get_offset_mapping(queries[i])[: args.max_seq_len - 2] + [(0, 0)]
+        )
+        tokenized_examples[i]["title_offset_mapping"] = (
+            [(0, 0)] + tokenizer.get_offset_mapping(titles[i])[: args.max_seq_len - 2] + [(0, 0)]
+        )
+
+    return tokenized_examples
+
+
+def init_roberta_var(args):
+    if args.language == "ch":
+        tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained)
+    else:
+        tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained)
+
+    model = RobertaForSequenceClassification.from_pretrained(
+        args.from_pretrained,
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        dropout=0,
+        num_labels=2,
+        name="",
+        return_inter_score=True,
+    )
+
+    map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language)
+
+    dev_ds = Similarity_data().read(args.data_dir)
+    dev_ds.map(map_fn, batched=True)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "query_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "title_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+        }
+    ): fn(samples)
+
+    dataloader = paddle.io.DataLoader(
+        dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True
+    )
+
+    return model, tokenizer, dataloader, dev_ds
+
+
+def init_lstm_var(args):
+    if args.language == "ch":
+        vocab = Vocab.load_vocabulary("simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]")
+    else:
+        vocab = Vocab.load_vocabulary("simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]")
+
+    tokenizer = CharTokenizer(vocab, args.language, "../../punctuations")
+    model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2)
+
+    dev_ds = Similarity_data().read(args.data_dir)
+    dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language)
+    batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # query_ids
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+    ): [data for data in fn(samples)]
+
+    return model, tokenizer, batches, batchify_fn, vocab, dev_ds
+
+
+def get_seq_token_num(language):
+    if language == "ch":
+        add_idx = 1
+    else:
+        add_idx = 2
+    return add_idx
+
+
+def get_qt_tokens(base_model, d, add_idx=None, tokenizer=None, batchify_fn=None, vocab=None):
+    SEP_idx = 0
+    if base_model == "roberta":
+        input_ids, token_type_ids, query_offset_map, title_offset_map = d
+        fwd_args = [input_ids, token_type_ids]
+        fwd_kwargs = {}
+
+        SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id)
+        q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist())  # list
+        t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + add_idx : -1].tolist())  # list
+        q_offset = query_offset_map[0, 1:-1].tolist()
+        t_offset = title_offset_map[0, 1:-1].tolist()
+        return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset
+
+    if base_model == "lstm":
+        query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d)
+        query_ids = paddle.to_tensor(query_ids)
+        title_ids = paddle.to_tensor(title_ids)
+        query_seq_lens = paddle.to_tensor(query_seq_lens)
+        title_seq_lens = paddle.to_tensor(title_seq_lens)
+
+        fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens]
+        fwd_kwargs = {}
+        q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]]
+        t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]]
+        return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs
+
+
+def extract_attention_scores(args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx):
+    if args.base_model.startswith("roberta"):
+        inter_score = atts[-1][:, :, 0, :].mean(1)  # (bsz, seq)
+        q_inter_score = inter_score[0][1:SEP_idx]  # remove CLS and SEP
+        t_inter_score = inter_score[0][SEP_idx + add_idx : -1]  # remove CLS and SEP
+    elif args.base_model == "lstm":
+        q_inter_score = atts[0][0]
+        t_inter_score = atts[1][0]
+
+    q_length = (q_inter_score > 0).cast("int32").sum(-1)[0]
+    t_length = (t_inter_score > 0).cast("int32").sum(-1)[0]
+    assert len(q_tokens) == q_length, f"{len(q_tokens)} != {q_length}"
+    assert len(t_tokens) == t_length, f"{len(t_tokens)} != {t_length}"
+
+    q_char_attribution_dict, t_char_attribution_dict = {}, {}
+    if args.base_model.startswith("roberta"):
+        # Query
+        sorted_token = []
+        for i in range(len(q_inter_score)):
+            sorted_token.append([i, q_offset[i], q_inter_score[i]])
+        q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token)
+        result["query_char_attri"] = collections.OrderedDict()
+        for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("text_q_seg")
+
+        # Title
+        sorted_token = []
+        for i in range(len(t_inter_score)):
+            sorted_token.append([i, t_offset[i], t_inter_score[i]])
+        t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token)
+        result["title_char_attri"] = collections.OrderedDict()
+        for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("text_t_seg")
+
+    else:
+        idx = 0
+        for token, score in zip(q_tokens, q_inter_score.tolist()):
+            q_char_attribution_dict[idx] = (token, score)
+            idx += 1
+        for token, score in zip(t_tokens, t_inter_score.tolist()):
+            t_char_attribution_dict[idx] = (token, score)
+            idx += 1
+
+        result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict()
+        for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["query_char_attri"][token] = attri
+        for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["title_char_attri"][token] = attri
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+def IG_roberta_inter_score(
+    args,
+    embedded_grads_list,
+    pred_embedded,
+    baseline_embedded,
+    pred_confidence,
+    baseline_pred_confidence,
+    SEP_idx,
+    add_idx,
+    err_total,
+):
+    embedded_grads_tensor = paddle.to_tensor(
+        embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True
+    )
+
+    # Tensor(n_samples-1, 1, seq_len, embed_size)
+    trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2
+    integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0]  # Tensor(1, seq_len, embed_size)
+    inter_score = (pred_embedded - baseline_embedded) * integral_grads  # Tensor(1, seq_len, embed_size)
+    inter_score = inter_score.sum(-1)  # Tensor(1, seq_len)
+
+    # eval err
+    delta_pred_confidence = pred_confidence - baseline_pred_confidence
+    sum_gradient = inter_score.sum().tolist()[0]
+    err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12)
+    err_total.append(np.abs(err))
+
+    print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f"
+    print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total))
+    print(print_str % print_vals)
+
+    inter_score.stop_gradient = True
+    q_inter_score = inter_score[0][1:SEP_idx]  # remove CLS and SEP
+    t_inter_score = inter_score[0][SEP_idx + add_idx : -1]  # remove CLS and SEP
+
+    return q_inter_score, t_inter_score
+
+
+def IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, idx):
+    # query
+    q_embedded_grads_tensor = paddle.to_tensor(
+        q_embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True
+    )
+    q_trapezoidal_grads = (
+        q_embedded_grads_tensor[1:] + q_embedded_grads_tensor[:-1]
+    ) / 2  # Tensor(n_samples-1, 1, seq_len, embed_size)
+    q_integral_grads = q_trapezoidal_grads.sum(0) / q_trapezoidal_grads.shape[0]  # Tensor(1, seq_len, embed_size)
+    q_inter_score = (pred_embedded[idx] - baseline_embedded[idx]) * q_integral_grads  # Tensor(1, seq_len, embed_size)
+    q_inter_score = q_inter_score.sum(-1)  # Tensor(1, seq_len)
+    q_inter_score.stop_gradient = True
+    q_inter_score = q_inter_score[0]
+
+    return q_inter_score
+
+
+def extract_integrated_gradient_scores(
+    args,
+    result,
+    fwd_args,
+    fwd_kwargs,
+    model,
+    q_tokens,
+    t_tokens,
+    out_handle,
+    SEP_idx,
+    add_idx,
+    q_offset,
+    t_offset,
+    err_total,
+):
+    embedded_grads_list = []
+    q_embedded_grads_list, t_embedded_grads_list = [], []
+    for i in range(args.n_samples):
+        probs, _, embedded = model.forward_interpret(
+            *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples
+        )
+        predicted_class_prob = probs[0][pred_label]
+        predicted_class_prob.backward(retain_graph=False)
+
+        if args.base_model.startswith("roberta"):
+            embedded_grad = embedded.grad
+            embedded_grads_list.append(embedded_grad)
+        elif args.base_model == "lstm":
+            q_embedded, t_embedded = embedded
+            q_embedded_grad = q_embedded.grad
+            t_embedded_grad = t_embedded.grad
+            q_embedded_grads_list.append(q_embedded_grad)
+            t_embedded_grads_list.append(t_embedded_grad)
+        model.clear_gradients()
+        if i == 0:
+            baseline_pred_confidence = probs.tolist()[0][pred_label]  # scalar
+            baseline_embedded = embedded  # Tensor(1, seq_len, embed_size)
+        elif i == args.n_samples - 1:
+            pred_confidence = probs.tolist()[0][pred_label]  # scalar
+            pred_embedded = embedded  # Tensor(1, seq_len, embed_size)
+
+    if args.base_model.startswith("roberta"):
+        q_inter_score, t_inter_score = IG_roberta_inter_score(
+            args,
+            embedded_grads_list,
+            pred_embedded,
+            baseline_embedded,
+            pred_confidence,
+            baseline_pred_confidence,
+            SEP_idx,
+            add_idx,
+            err_total,
+        )
+    elif args.base_model == "lstm":
+        q_inter_score = IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, 0)
+        t_inter_score = IG_lstm_inter_score(t_embedded_grads_list, pred_embedded, baseline_embedded, 1)
+
+    q_char_attribution_dict, t_char_attribution_dict = {}, {}
+    if args.base_model.startswith("roberta"):
+        # Query
+        sorted_token = []
+        for i in range(len(q_inter_score)):
+            sorted_token.append([i, q_offset[i], q_inter_score[i]])
+        q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token)
+        result["query_char_attri"] = collections.OrderedDict()
+        for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("text_q_seg")
+
+        # Title
+        sorted_token = []
+        for i in range(len(t_inter_score)):
+            sorted_token.append([i, t_offset[i], t_inter_score[i]])
+        t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token)
+        result["title_char_attri"] = collections.OrderedDict()
+        for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True):
+            result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])]
+        result.pop("text_t_seg")
+    else:
+        idx = 0
+        for token, score in zip(q_tokens, q_inter_score.tolist()):
+            q_char_attribution_dict[idx] = (token, score)
+            idx += 1
+        for token, score in zip(t_tokens, t_inter_score.tolist()):
+            t_char_attribution_dict[idx] = (token, score)
+            idx += 1
+
+        result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict()
+        for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["query_char_attri"][token] = attri
+        for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True):
+            result["title_char_attri"][token] = attri
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+
+
+def extract_LIME_scores(
+    args, q_tokens, t_tokens, result, tokenizer, pred_label, fwd_args, fwd_kwargs, model, probs, out_handle
+):
+    explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language)
+    if_lstm = args.base_model == "lstm"
+
+    explain_res_q = explainer.explain_instance(
+        text_instance_q=result["query"],
+        text_instance_t=result["title"],
+        analysis_query=True,
+        tokenizer=tokenizer,
+        pred_label=pred_label,
+        classifier_fn=model.forward_interpret,
+        num_samples=5000,
+        if_lstm=if_lstm,
+    )
+    exp_q, indexed_string_q, relative_err, err = explain_res_q
+    local_exps_q = exp_q.local_exp
+
+    explain_res_t = explainer.explain_instance(
+        text_instance_q=result["query"],
+        text_instance_t=result["title"],
+        analysis_query=False,
+        tokenizer=tokenizer,
+        pred_label=pred_label,
+        classifier_fn=model.forward_interpret,
+        num_samples=5000,
+        if_lstm=if_lstm,
+    )
+    exp_t, indexed_string_t, _, _ = explain_res_t
+    local_exps_t = exp_t.local_exp
+
+    # query
+    char_attribution_dict = []
+    for kind, local_exp in local_exps_q.items():
+        for idx in range(len(result["text_q_seg"])):
+            t = result["text_q_seg"][idx]  # .replace('Ġ', '')
+            got_score = False
+            for word_id, attribution in local_exp:
+                if indexed_string_q.inverse_vocab[word_id] == t:
+                    char_attribution_dict.append((idx, t, attribution))
+                    got_score = True
+                    break
+                if not got_score:
+                    char_attribution_dict.append((idx, t, 0))
+    char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True)
+    result["query_char_attri"] = collections.OrderedDict()
+    for s in char_attribution_dict:
+        result["query_char_attri"][s[0]] = (s[1], s[2])
+
+    # title
+    char_attribution_dict = []
+    for kind, local_exp in local_exps_t.items():
+        for idx in range(len(result["text_t_seg"])):
+            t = result["text_t_seg"][idx]  # .replace('Ġ', '')
+            got_score = False
+            for word_id, attribution in local_exp:
+                if indexed_string_t.inverse_vocab[word_id] == t:
+                    char_attribution_dict.append((idx, t, attribution))
+                    got_score = True
+                    break
+                if not got_score:
+                    char_attribution_dict.append((idx, t, 0))
+    char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True)
+    result["title_char_attri"] = collections.OrderedDict()
+    for s in char_attribution_dict:
+        result["title_char_attri"][s[0]] = (s[1], s[2])
+
+    out_handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+    return exp_q, exp_t, relative_err, err
+
+
+def LIME_error_evaluation(
+    exp_q, pred_label, probs, lime_score_total, lime_relative_err_total, lime_err_total, relative_err, err
+):
+    # err evaluation
+    score = exp_q.score[pred_label]
+    ridge_pred = exp_q.local_pred[pred_label]
+    model_pred = probs.numpy().tolist()[0][pred_label]
+
+    lime_score_total.append(score)
+    lime_relative_err_total.append(relative_err)
+    lime_err_total.append(err)
+    print("score: %.2f" % score)
+    print("relative_err: %.2f" % relative_err)
+    print("err: %.2f" % err)
+    print("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred))
+    return lime_score_total, lime_relative_err_total, lime_err_total
+
+
+g_splitter = re.compile(r"([\u4e00-\u9fa5])")
+
+if __name__ == "__main__":
+    args = get_args()
+    if args.base_model.startswith("roberta"):
+        model, tokenizer, dataloader, dev_ds = init_roberta_var(args)
+    elif args.base_model == "lstm":
+        model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args)
+    else:
+        raise ValueError("unsupported base model name.")
+
+    assert args.eval, "INTERPRETER must be run in eval mode"
+    with paddle.amp.auto_cast(enable=args.use_amp), open(
+        os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w"
+    ) as out_handle:
+        # Load model
+        sd = paddle.load(args.init_checkpoint)
+        model.set_dict(sd)
+        model.train()  # Set dropout to 0 when init the model to collect the gradient
+        print("load model from %s" % args.init_checkpoint)
+
+        # For IG
+        err_total = []
+        # For LIME
+        lime_score_total = []
+        lime_relative_err_total = []
+        lime_err_total = []
+        # For Roberta
+        sub_word_id_dict_query = []
+        sub_word_id_dict_title = []
+        # For LSTM
+        q_offset, t_offset = None, None
+
+        get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word)))
+        for step, d in tqdm(enumerate(dataloader)):
+            if step + 1 < args.start_id:
+                continue
+
+            result = {}
+            # English and Chinese models have different numbers of [SEQ] tokens between query and title
+            add_idx = get_seq_token_num(args.language)
+
+            if args.base_model.startswith("roberta"):
+                q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset = get_qt_tokens(
+                    base_model="roberta", d=d, add_idx=add_idx, tokenizer=tokenizer
+                )
+            elif args.base_model == "lstm":
+                q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs = get_qt_tokens(
+                    base_model="lstm", d=d, batchify_fn=batchify_fn, vocab=vocab
+                )
+
+            result["id"] = dev_ds.data[step]["id"]
+            result["text_q_seg"] = dev_ds.data[step]["text_q_seg"]
+            result["text_t_seg"] = dev_ds.data[step]["text_t_seg"]
+
+            probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs)
+            pred_label = paddle.argmax(probs, axis=-1).tolist()[0]
+
+            result["pred_label"] = pred_label
+            result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()]
+
+            if args.language == "ch":
+                result["query"] = dev_ds.data[step]["query"]
+                result["title"] = dev_ds.data[step]["title"]
+            else:
+                result["query"] = dev_ds.data[step]["sentence1"]
+                result["title"] = dev_ds.data[step]["sentence2"]
+
+            # Attention
+            if args.inter_mode == "attention":
+                extract_attention_scores(
+                    args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx
+                )
+
+            elif args.inter_mode == "integrated_gradient":
+                extract_integrated_gradient_scores(
+                    args,
+                    result,
+                    fwd_args,
+                    fwd_kwargs,
+                    model,
+                    q_tokens,
+                    t_tokens,
+                    out_handle,
+                    SEP_idx,
+                    add_idx,
+                    q_offset,
+                    t_offset,
+                    err_total,
+                )
+
+            elif args.inter_mode == "lime":
+                exp_q, exp_t, relative_err, err = extract_LIME_scores(
+                    args,
+                    q_tokens,
+                    t_tokens,
+                    result,
+                    tokenizer,
+                    pred_label,
+                    fwd_args,
+                    fwd_kwargs,
+                    model,
+                    probs,
+                    out_handle,
+                )
+                lime_score_total, lime_relative_err_total, lime_err_total = LIME_error_evaluation(
+                    exp_q,
+                    pred_label,
+                    probs,
+                    lime_score_total,
+                    lime_relative_err_total,
+                    lime_err_total,
+                    relative_err,
+                    err,
+                )
+
+            else:
+                raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}")
+
+        if args.inter_mode == "lime":
+            print(np.average(np.array(lime_relative_err_total)))
diff --git a/examples/model_interpretation/task/similarity/saliency_map/utils.py b/examples/model_interpretation/task/similarity/saliency_map/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e6dd7e1a61b2c79cb4a968866a11bb3a9c90c51
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/saliency_map/utils.py
@@ -0,0 +1,38 @@
+# !/usr/bin/env python3
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import paddle
+
+
+class UnpackDataLoader(paddle.io.DataLoader):
+    def __init__(self, *args, **kwargs):
+        super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs)
+
+    def __iter__(self):
+        return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__())
+
+
+def create_if_not_exists(dir):
+    try:
+        dir.mkdir(parents=True)
+    except FileExistsError:
+        pass
+    return dir
+
+
+def get_warmup_and_linear_decay(max_steps, warmup_steps):
+    return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps))
diff --git a/examples/model_interpretation/task/similarity/simnet/gen_vocab.py b/examples/model_interpretation/task/similarity/simnet/gen_vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..4359902825317f3ce3c7886536ba60cb2f74e56a
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/gen_vocab.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# !/usr/bin/env python
+# coding=utf-8
+
+import sys
+from collections import defaultdict
+
+import spacy
+
+from paddlenlp.datasets import load_dataset
+
+if sys.argv[1] == "ch":
+    train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"])
+
+    vocab = defaultdict(int)
+    for example in train_ds.data:
+        query = example["query"]
+        title = example["title"]
+        for c in query:
+            vocab[c] += 1
+        for c in title:
+            vocab[c] += 1
+    with open("vocab.char", "w") as f:
+        for k, v in vocab.items():
+            if v > 3:
+                f.write(k + "\n")
+
+else:
+    tokenizer = spacy.load("en_core_web_sm")
+    vocab = defaultdict(int)
+
+    with open("../data/QQP/train/train.tsv", "r") as f_dataset:
+        for idx, line in enumerate(f_dataset.readlines()):
+            if idx == 0:
+                continue
+            line_split = line.strip().split("\t")
+            query = [token.text for token in tokenizer(line_split[0])]
+            title = [token.text for token in tokenizer(line_split[1])]
+
+            for word in query:
+                vocab[word] += 1
+            for word in title:
+                vocab[word] += 1
+
+    with open("vocab_QQP", "w") as f:
+        for k, v in vocab.items():
+            if v > 3:
+                f.write(k + "\n")
diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py b/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ed642e836b5bc8651a12c1b3ed81f9452f4884
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+sys.path.append("../../..")
+from model import SimNet  # noqa: E402
+from utils import CharTokenizer, preprocess_data  # noqa: E402
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.")
+parser.add_argument(
+    "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?"
+)
+parser.add_argument(
+    "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded."
+)
+parser.add_argument("--language", type=str, required=True, help="Language that this model based on")
+args = parser.parse_args()
+
+
+def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_token_id),  # query_ids
+        Pad(axis=0, pad_val=pad_token_id),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+    ): [data for data in fn(samples)]
+
+    model.eval()
+    results = []
+    for batch in batches:
+        query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch)
+        query_ids = paddle.to_tensor(query_ids)
+        title_ids = paddle.to_tensor(title_ids)
+        query_seq_lens = paddle.to_tensor(query_seq_lens)
+        title_seq_lens = paddle.to_tensor(title_seq_lens)
+
+        logits, attention, _ = model.forward_interpret(query_ids, title_ids, query_seq_lens, title_seq_lens)
+        query_att = attention[0]
+        title_att = attention[1]
+
+        model.clear_gradients()
+        for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()):
+            query = [vocab._idx_to_token[idx] for idx in query_id]
+            title = [vocab._idx_to_token[idx] for idx in title_id]
+        results.append([query_att, query, title_att, title])
+
+        print("query_att: %s" % query_att.shape)
+        print("title_att: %s" % title_att.shape)
+
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device + ":2")
+    # Loads vocab.
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = CharTokenizer(vocab, args.language)
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map))
+
+    # Loads model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"])
+
+    dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language)
+    test_examples = preprocess_data(test_ds.data, tokenizer, args.language)
+    results = interpret(
+        model,
+        dev_examples,
+        label_map=label_map,
+        batch_size=args.batch_size,
+        pad_token_id=vocab.token_to_idx.get("[PAD]", 0),
+        vocab=vocab,
+    )
diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py b/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py
new file mode 100644
index 0000000000000000000000000000000000000000..8da2733bee65e439fc0d99e078f9b95a911a1dc3
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+sys.path.append("../../..")
+from model import SimNet  # noqa: E402
+from utils import CharTokenizer, preprocess_data  # noqa: E402
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.")
+parser.add_argument(
+    "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?"
+)
+parser.add_argument(
+    "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded."
+)
+parser.add_argument("--language", type=str, required=True, help="Language that this model based on")
+args = parser.parse_args()
+
+
+def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_token_id),  # query_ids
+        Pad(axis=0, pad_val=pad_token_id),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+        Stack(dtype="int64"),
+    ): [data for data in fn(samples)]
+
+    model.train()
+    results = []
+    for batch in batches:
+        query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch)
+        query_ids = paddle.to_tensor(query_ids)
+        title_ids = paddle.to_tensor(title_ids)
+        query_seq_lens = paddle.to_tensor(query_seq_lens)
+        title_seq_lens = paddle.to_tensor(title_seq_lens)
+        probs, addiational_info = model.forward_interpreter(query_ids, title_ids, query_seq_lens, title_seq_lens)
+        query_emb = addiational_info["embedded"][0]
+        title_emb = addiational_info["embedded"][1]
+
+        predicted_class_probs = paddle.max(probs, axis=-1)
+        predicted_class_probs = predicted_class_probs.sum()
+        paddle.autograd.backward([predicted_class_probs])
+        q_gradients = ((query_emb * query_emb.grad).sum(-1).detach()).abs()  # gradients: (1, seq_len)
+        q_grad_output = q_gradients / q_gradients.sum(-1, keepdim=True)
+        t_gradients = ((title_emb * title_emb.grad).sum(-1).detach()).abs()  # gradients: (1, seq_len)
+        t_grad_output = t_gradients / t_gradients.sum(-1, keepdim=True)
+
+        model.clear_gradients()
+        for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()):
+            query = [vocab._idx_to_token[idx] for idx in query_id]
+            title = [vocab._idx_to_token[idx] for idx in title_id]
+        results.append([q_grad_output, query, t_grad_output, title])
+        print([q_grad_output, query, t_grad_output, title])
+
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device + ":1")
+    # Loads vocab.
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = CharTokenizer(vocab, args.language)
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map))
+
+    # Loads model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+
+    dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"])
+
+    dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language)
+    test_examples = preprocess_data(test_ds.data, tokenizer, args.language)
+    results = interpret(
+        model,
+        dev_examples,
+        label_map=label_map,
+        batch_size=args.batch_size,
+        pad_token_id=vocab.token_to_idx.get("[PAD]", 0),
+        vocab=vocab,
+    )
+
+    # for idx, text in enumerate(data):
+    #     print('Data: {} \t Label: {}'.format(text, results[idx]))
diff --git a/examples/model_interpretation/task/similarity/simnet/lstm_train.sh b/examples/model_interpretation/task/similarity/simnet/lstm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c1b671f09308f434bae7b085a3ee68d74e12f45
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/lstm_train.sh
@@ -0,0 +1,21 @@
+###
+ # This script is used to train lstm models
+###
+
+unset CUDA_VISIBLE_DEVICES
+LANGUAGE=en
+
+if [[ $LANGUAGE == "ch" ]]; then
+    VOCAB_PATH=vocab.char
+elif [[ $LANGUAGE == "en" ]]; then
+    VOCAB_PATH=vocab_QQP
+fi
+
+python -m paddle.distributed.launch --gpus "5" train.py \
+    --device=gpu \
+    --lr=4e-4 \
+    --batch_size=64 \
+    --epochs=12 \
+    --vocab_path=$VOCAB_PATH   \
+    --language=$LANGUAGE \
+    --save_dir="./checkpoints_"${LANGUAGE}
diff --git a/examples/model_interpretation/task/similarity/simnet/model.py b/examples/model_interpretation/task/similarity/simnet/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3c86ad21c4ecaae914e0d0d1b63775393b59efc
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/model.py
@@ -0,0 +1,270 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import paddlenlp as nlp
+
+
+class SimNet(nn.Layer):
+    def __init__(self, network, vocab_size, num_classes, emb_dim=128, pad_token_id=0):
+        super().__init__()
+
+        network = network.lower()
+        if network == "bow":
+            self.model = BoWModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id)
+        elif network == "cnn":
+            self.model = CNNModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id)
+        elif network == "gru":
+            self.model = GRUModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id)
+        elif network == "lstm":
+            self.model = LSTMModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id)
+        else:
+            raise ValueError("Unknown network: %s, it must be one of bow, cnn, lstm or gru." % network)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        logits = self.model(query, title, query_seq_len, title_seq_len)
+        return logits
+
+    def forward_interpret(
+        self, query, title, query_seq_len=None, title_seq_len=None, noise=None, i=None, n_samples=None
+    ):
+
+        logits, addiational_info = self.model.forward_interpreter(
+            query, title, query_seq_len, title_seq_len, noise=noise, i=i, n_samples=n_samples
+        )
+
+        return logits, addiational_info["attention"], addiational_info["embedded"]
+
+
+class BoWModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size (obj:`int`): The vocabulary size.
+        emb_dim (obj:`int`, optional, defaults to 128):  The embedding dimension.
+        padding_idx (obj:`int`, optional, defaults to 0) : The pad token index.
+        hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size.
+        fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size.
+        num_classes (obj:`int`): All the labels that the data has.
+    """
+
+    def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, fc_hidden_size=128):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim)
+        self.fc = nn.Linear(self.bow_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, embedding_dim)
+        summed_query = self.bow_encoder(embedded_query)
+        summed_title = self.bow_encoder(embedded_title)
+        encoded_query = paddle.tanh(summed_query)
+        encoded_title = paddle.tanh(summed_title)
+        # Shape: (batch_size, embedding_dim*2)
+        contacted = paddle.concat([encoded_query, encoded_title], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+        return logits
+
+
+class LSTMModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        lstm_hidden_size=128,
+        direction="forward",
+        lstm_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=128,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.lstm_encoder = nlp.seq2vec.LSTMEncoder(
+            emb_dim, lstm_hidden_size, num_layers=lstm_layers, direction=direction, dropout=dropout_rate
+        )
+        self.fc = nn.Linear(self.lstm_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+        self.pad_token_id = padding_idx
+
+    def forward(self, query, title, query_seq_len, title_seq_len):
+        assert query_seq_len is not None and title_seq_len is not None
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, lstm_hidden_size)
+        query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len)
+        title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len)
+        # Shape: (batch_size, 2*lstm_hidden_size)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+
+        return logits
+
+    def forward_interpreter(self, query, title, query_seq_len, title_seq_len, noise=None, i=None, n_samples=None):
+        assert query_seq_len is not None and title_seq_len is not None
+        # Shape: (batch_size, num_tokens, embedding_dim)
+
+        query_baseline = paddle.to_tensor([self.pad_token_id] * query.shape[1]).unsqueeze(0)
+        title_baseline = paddle.to_tensor([self.pad_token_id] * title.shape[1]).unsqueeze(0)
+
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        embedded_query_baseline = self.embedder(query_baseline)
+        embedded_title_baseline = self.embedder(title_baseline)
+
+        if noise is not None and noise.upper() == "INTEGRATED":
+            embedded_query = embedded_query_baseline + i / (n_samples - 1) * (embedded_query - embedded_query_baseline)
+            embedded_title = embedded_title_baseline + i / (n_samples - 1) * (embedded_title - embedded_title_baseline)
+
+        # Shape: (batch_size, lstm_hidden_size)
+        query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len)
+        title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len)
+        # Shape: (batch_size, 2*lstm_hidden_size)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        probs = F.softmax(logits, axis=-1)
+
+        q_att = paddle.matmul(fc_out, embedded_query, transpose_y=True).squeeze(axis=[1])  # (bsz, query_len)
+        q_att = F.softmax(q_att, axis=-1)
+        t_att = paddle.matmul(fc_out, embedded_title, transpose_y=True).squeeze(axis=[1])  # (bsz, title_len)
+        t_att = F.softmax(t_att, axis=-1)
+
+        addiational_info = {
+            "embedded": [embedded_query, embedded_title],
+            "attention": [q_att, t_att],
+        }
+        # return logits, addiational_info
+        return probs, addiational_info
+
+
+class GRUModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        gru_hidden_size=128,
+        direction="forward",
+        gru_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.gru_encoder = nlp.seq2vec.GRUEncoder(
+            emb_dim, gru_hidden_size, num_layers=gru_layers, direction=direction, dropout=dropout_rate
+        )
+        self.fc = nn.Linear(self.gru_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len, title_seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, gru_hidden_size)
+        query_repr = self.gru_encoder(embedded_query, sequence_length=query_seq_len)
+        title_repr = self.gru_encoder(embedded_title, sequence_length=title_seq_len)
+        # Shape: (batch_size, 2*gru_hidden_size)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+
+        return logits
+
+
+class CNNModel(nn.Layer):
+    """
+    This class implements the
+
+
+     Convolution Neural Network model.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `CNNEncoder`.
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size (obj:`int`): The vocabulary size.
+        emb_dim (obj:`int`, optional, defaults to 128):  The embedding dimension.
+        padding_idx (obj:`int`, optional, defaults to 0) : The pad token index.
+        num_classes (obj:`int`): All the labels that the data has.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        num_filter=256,
+        ngram_filter_sizes=(3,),
+        fc_hidden_size=128,
+    ):
+        super().__init__()
+        self.padding_idx = padding_idx
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.encoder = nlp.seq2vec.CNNEncoder(
+            emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes
+        )
+        self.fc = nn.Linear(self.encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, num_filter)
+        query_repr = self.encoder(embedded_query)
+        title_repr = self.encoder(embedded_title)
+        # Shape: (batch_size, 2*num_filter)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+        return logits
diff --git a/examples/model_interpretation/task/similarity/simnet/predict.py b/examples/model_interpretation/task/similarity/simnet/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..dec464bf413007dff271ca2865ac4ffe39ecf69c
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/predict.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+from model import SimNet
+from utils import preprocess_prediction_data
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The path to vocabulary.")
+parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data, label_map, batch_size=1, pad_token_id=0):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_token_id),  # query_ids
+        Pad(axis=0, pad_val=pad_token_id),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+    ): [data for data in fn(samples)]
+
+    results = []
+    model.eval()
+    for batch in batches:
+        query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch)
+        query_ids = paddle.to_tensor(query_ids)
+        title_ids = paddle.to_tensor(title_ids)
+        query_seq_lens = paddle.to_tensor(query_seq_lens)
+        title_seq_lens = paddle.to_tensor(title_seq_lens)
+        logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    # Loads vocab.
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = JiebaTokenizer(vocab)
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map))
+
+    # Loads model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    data = [
+        ["世界上什么东西最小", "世界上什么东西最小？"],
+        ["光眼睛大就好看吗", "眼睛好看吗？"],
+        ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"],
+    ]
+    examples = preprocess_prediction_data(data, tokenizer)
+    results = predict(
+        model,
+        examples,
+        label_map=label_map,
+        batch_size=args.batch_size,
+        pad_token_id=vocab.token_to_idx.get("[PAD]", 0),
+    )
+
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/model_interpretation/task/similarity/simnet/train.py b/examples/model_interpretation/task/similarity/simnet/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec36090726ccba08f0891d4ca21d0296f284c236
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/train.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import sys
+from functools import partial
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+sys.path.append("../../../")
+from model import SimNet  # noqa: E402
+from utils import CharTokenizer, convert_example  # noqa: E402
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+parser.add_argument(
+    "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+)
+parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument(
+    "--vocab_path",
+    type=str,
+    default="./vocab.char",
+    help="The directory to dataset. Chinese version uses vocab.char while English version uses vocab_QQP",
+)
+parser.add_argument(
+    "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?"
+)
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--language", type=str, required=True, help="Language that this model based on")
+args = parser.parse_args()
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn)
+    return dataloader
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # Loads vocab.
+    if not os.path.exists(args.vocab_path):
+        raise RuntimeError("The vocab_path  can not be found in the path %s" % args.vocab_path)
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    # Loads dataset.
+    if args.language == "ch":
+        train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"])
+    else:
+        train_ds, dev_ds, test_ds = load_dataset("glue", "qqp", splits=["train", "dev", "test"])
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(train_ds.label_list))
+    model = paddle.Model(model)
+
+    # Reads data and generates mini-batches.
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # query_ids
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations")
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language)
+    train_loader = create_dataloader(
+        train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn
+    )
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+    test_loader = create_dataloader(
+        test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn
+    )
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Defines loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    # Starts training and evaluating.
+    model.fit(
+        train_loader,
+        dev_loader,
+        epochs=args.epochs,
+        save_dir=args.save_dir,
+    )
diff --git a/examples/model_interpretation/task/similarity/simnet/utils.py b/examples/model_interpretation/task/similarity/simnet/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2161cd48ce24c8794db827dd8fac7d4ec6d5ce4
--- /dev/null
+++ b/examples/model_interpretation/task/similarity/simnet/utils.py
@@ -0,0 +1,211 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def convert_example(example, tokenizer, is_test=False, language="en"):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        query_ids(obj:`list[int]`): The list of query ids.
+        title_ids(obj:`list[int]`): The list of title ids.
+        query_seq_len(obj:`int`): The input sequence query length.
+        title_seq_len(obj:`int`): The input sequence title length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    if language == "ch":
+        q_name = "query"
+        t_name = "title"
+        label = "label"
+    else:
+        q_name = "sentence1"
+        t_name = "sentence2"
+        label = "labels"
+
+    query, title = example[q_name], example[t_name]
+    query_ids = np.array(tokenizer.encode(query), dtype="int64")
+    query_seq_len = np.array(len(query_ids), dtype="int64")
+    title_ids = np.array(tokenizer.encode(title), dtype="int64")
+    title_seq_len = np.array(len(title_ids), dtype="int64")
+    result = [query_ids, title_ids, query_seq_len, title_seq_len]
+    if not is_test:
+        label = np.array(example[label], dtype="int64")
+        result.append(label)
+    return result
+
+
+def preprocess_prediction_data(data, tokenizer):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`List[List[str, str]]`):
+            The prediction data whose each element is a text pair.
+            Each text will be tokenized by jieba.lcut() function.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+
+    Returns:
+        examples (obj:`list`): The processed data whose each element
+            is a `list` object, which contains
+
+            - query_ids(obj:`list[int]`): The list of query ids.
+            - title_ids(obj:`list[int]`): The list of title ids.
+            - query_seq_len(obj:`int`): The input sequence query length.
+            - title_seq_len(obj:`int`): The input sequence title length.
+
+    """
+    examples = []
+    for query, title in data:
+        query_ids = tokenizer.encode(query)
+        title_ids = tokenizer.encode(title)
+        examples.append([query_ids, title_ids, len(query_ids), len(title_ids)])
+    return examples
+
+
+def preprocess_data(data, tokenizer, language):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`List[List[str, str]]`):
+            The prediction data whose each element is a text pair.
+            Each text will be tokenized by jieba.lcut() function.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+
+    Returns:
+        examples (obj:`list`): The processed data whose each element
+            is a `list` object, which contains
+
+            - query_ids(obj:`list[int]`): The list of query ids.
+            - title_ids(obj:`list[int]`): The list of title ids.
+            - query_seq_len(obj:`int`): The input sequence query length.
+            - title_seq_len(obj:`int`): The input sequence title length.
+
+    """
+    if language == "ch":
+        q_name = "query"
+        t_name = "title"
+    else:
+        q_name = "sentence1"
+        t_name = "sentence2"
+    examples = []
+    for example in data:
+        query_ids = tokenizer.encode(example[q_name])
+        title_ids = tokenizer.encode(example[t_name])
+        examples.append([query_ids, title_ids, len(query_ids), len(title_ids)])
+    return examples
+
+
+def get_idx_from_word(word, word_to_idx, unk_word):
+    if word in word_to_idx:
+        return word_to_idx[word]
+    return word_to_idx[unk_word]
+
+
+class CharTokenizer:
+    def __init__(self, vocab, language, vocab_path):
+        self.vocab = vocab
+        self.language = language
+        self.vocab_path = vocab_path
+        self.unk_token = []
+
+    def encode(self, sentence):
+        if self.language == "ch":
+            words = tokenizer_punc(sentence, self.vocab_path)
+        else:
+            words = sentence.strip().split()
+        return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words]
+
+    def tokenize(self, sentence, wo_unk=True):
+        if self.language == "ch":
+            return tokenizer_punc(sentence, self.vocab_path)
+        else:
+            return sentence.strip().split()
+
+    def convert_tokens_to_string(self, tokens):
+        return " ".join(tokens)
+
+    def convert_tokens_to_ids(self, tokens):
+        return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens]
+
+
+def tokenizer_lac(string, lac):
+    temp = ""
+    res = []
+    for c in string:
+        if "\u4e00" <= c <= "\u9fff":
+            if temp != "":
+                res.extend(lac.run(temp))
+                temp = ""
+            res.append(c)
+        else:
+            temp += c
+    if temp != "":
+        res.extend(lac.run(temp))
+    return res
+
+
+def tokenizer_punc(string, vocab_path):
+    res = []
+    sub_string_list = string.strip().split("[MASK]")
+    for idx, sub_string in enumerate(sub_string_list):
+        temp = ""
+        for c in sub_string:
+            if "\u4e00" <= c <= "\u9fff":
+                if temp != "":
+                    temp_seg = punc_split(temp, vocab_path)
+                    res.extend(temp_seg)
+                    temp = ""
+                res.append(c)
+            else:
+                temp += c
+        if temp != "":
+            temp_seg = punc_split(temp, vocab_path)
+            res.extend(temp_seg)
+        if idx < len(sub_string_list) - 1:
+            res.append("[MASK]")
+    return res
+
+
+def punc_split(string, vocab_path):
+    punc_set = set()
+    with open(vocab_path, "r") as f:
+        for token in f:
+            punc_set.add(token.strip())
+        punc_set.add(" ")
+        for ascii_num in range(65296, 65306):
+            punc_set.add(chr(ascii_num))
+        for ascii_num in range(48, 58):
+            punc_set.add(chr(ascii_num))
+
+    res = []
+    temp = ""
+    for c in string:
+        if c in punc_set:
+            if temp != "":
+                res.append(temp)
+                temp = ""
+            res.append(c)
+        else:
+            temp += c
+    if temp != "":
+        res.append(temp)
+    return res
diff --git a/examples/model_interpretation/task/transformer.py b/examples/model_interpretation/task/transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2504503739b054ca3eb24c137a5828dc857ac6e9
--- /dev/null
+++ b/examples/model_interpretation/task/transformer.py
@@ -0,0 +1,1329 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# TODO: define the classes of Transformer neural network
+
+import collections
+import copy
+
+import numpy as np
+import paddle
+from paddle import ParamAttr, tensor
+from paddle.common_ops_import import convert_dtype
+from paddle.nn import Layer, LayerList
+from paddle.nn import functional as F
+from paddle.nn.layer.common import Dropout, Linear
+from paddle.nn.layer.norm import LayerNorm
+
+__all__ = []
+
+
+def _convert_param_attr_to_list(param_attr, n):
+    """
+    If `param_attr` is a list or tuple, convert every element in it to a
+    ParamAttr instance. Otherwise, repeat `param_attr` `n` times to
+    construct a list, and rename every one by appending a increasing index
+    suffix to avoid having same names when `param_attr` contains a name.
+
+    Parameters:
+        param_attr (list|tuple|ParamAttr): A list, tuple or something can be
+            converted to a ParamAttr instance by `ParamAttr._to_attr`.
+        n (int): The times to repeat to construct a list when `param_attr`
+            is not a list or tuple.
+
+    Returns:
+        list: A list composed of each including cell's `param_attr`.
+    """
+    if isinstance(param_attr, (list, tuple)):
+        assert len(param_attr) == n, "length of param_attr should be %d when it is a list/tuple" % n
+        param_attrs = []
+        for attr in param_attr:
+            if isinstance(attr, bool):
+                if attr:
+                    param_attrs.append(ParamAttr._to_attr(None))
+                else:
+                    param_attrs.append(False)
+            else:
+                param_attrs.append(ParamAttr._to_attr(attr))
+        # param_attrs = [ParamAttr._to_attr(attr) for attr in param_attr]
+    elif isinstance(param_attr, bool):
+        param_attrs = []
+        if param_attr:
+            param_attrs = [ParamAttr._to_attr(None) for i in range(n)]
+        else:
+            param_attrs = [False] * n
+    else:
+        param_attrs = []
+        attr = ParamAttr._to_attr(param_attr)
+        for i in range(n):
+            attr_i = copy.deepcopy(attr)
+            if attr.name:
+                attr_i.name = attr_i.name + "_" + str(i)
+            param_attrs.append(attr_i)
+    return param_attrs
+
+
+def _convert_attention_mask(attn_mask, dtype):
+    """
+    Convert the attention mask to the target dtype we expect.
+
+    Parameters:
+        attn_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+        dtype (VarType): The target type of `attn_mask` we expect.
+
+    Returns:
+        Tensor: A Tensor with shape same as input `attn_mask`, with data type `dtype`.
+    """
+    if attn_mask is not None and attn_mask.dtype != dtype:
+        attn_mask_dtype = convert_dtype(attn_mask.dtype)
+        if attn_mask_dtype == "bool" or "int" in attn_mask_dtype:
+            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9
+        else:
+            attn_mask = paddle.cast(attn_mask, dtype)
+    return attn_mask
+
+
+class MultiHeadAttention(Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    Please refer to `Attention Is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`_
+    for more details.
+
+    Parameters:
+        embed_dim (int): The expected feature size in the input and output.
+        num_heads (int): The number of heads in multi-head attention.
+        dropout (float, optional): The dropout probability used on attention
+            weights to drop some attention targets. 0 for no dropout. Default 0
+        kdim (int, optional): The feature size in key. If None, assumed equal to
+            `embed_dim`. Default None.
+        vdim (int, optional): The feature size in value. If None, assumed equal to
+            `embed_dim`. Default None.
+        need_weights (bool, optional): Indicate whether to return the attention
+            weights. Default False.
+        weight_attr(ParamAttr, optional):  To specify the weight parameter property.
+            Default: None, which means the default weight parameter property is used.
+            See usage for details in :code:`ParamAttr` .
+        bias_attr (ParamAttr|bool, optional): To specify the bias parameter property.
+            Default: None, which means the default bias parameter property is used.
+            If it is set to False, this layer will not have trainable bias parameter.
+            See usage for details in :code:`ParamAttr` .
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+
+            # encoder input: [batch_size, sequence_length, d_model]
+            query = paddle.rand((2, 4, 128))
+            # self attention mask: [batch_size, num_heads, query_len, query_len]
+            attn_mask = paddle.rand((2, 2, 4, 4))
+            multi_head_attn = paddle.nn.MultiHeadAttention(128, 2)
+            output = multi_head_attn(query, None, None, attn_mask=attn_mask)  # [2, 4, 128]
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.need_weights = need_weights
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        self.q_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
+        self.k_proj = Linear(self.kdim, embed_dim, weight_attr, bias_attr=bias_attr)
+        self.v_proj = Linear(self.vdim, embed_dim, weight_attr, bias_attr=bias_attr)
+        self.out_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
+
+    def _prepare_qkv(self, query, key, value, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        Parameters:
+            query (Tensor): The queries for multi-head attention. It is a
+                tensor with shape `[batch_size, query_length, embed_dim]`. The
+                data type should be float32 or float64.
+            key (Tensor): The keys for multi-head attention. It is
+                a tensor with shape `[batch_size, key_length, kdim]`. The
+                data type should be float32 or float64. If None, use `query` as
+                `key`.
+            value (Tensor): The values for multi-head attention. It
+                is a tensor with shape `[batch_size, value_length, vdim]`.
+                The data type should be float32 or float64. If None, use `query` as
+                `value`.
+            cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional):
+                It is a namedtuple with `k` and `v` as fields, and stores tensors
+                shaped `[batch_size, num_heads, length, embed_dim]` which are results
+                of linear projection, reshape and transpose calculations in
+                MultiHeadAttention. If is an instance of `Cache`, `k` and `v`
+                fields reserve intermediate results of previous positions, which
+                mostly used for decoder self attention. If it is an instance of
+                `StaticCache`, `key` and `value` args would be ignored, `k` and
+                `v` fields would be used as calculated results on `key` and
+                `value`, which mostly used for decoder-encoder cross attention.
+                It is only used for inference and should be None for training.
+                Default None.
+
+        Returns:
+            tuple: A tuple including linear projected keys and values. These two \
+                tensors have shapes `[batch_size, n_head, sequence_length, d_key]` \
+                and `[batch_size, n_head, sequence_length, d_value]` separately, \
+                and their data types are same as inputs.
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = tensor.transpose(x=q, perm=[0, 2, 1, 3])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=2)
+            v = tensor.concat([cache.v, v], axis=2)
+            cache = self.Cache(k, v)
+
+        return (q, k, v) if cache is None else (q, k, v, cache)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        Parameters:
+            key (Tensor): The keys for multi-head attention. It is a tensor
+                with shape `[batch_size, sequence_length, kdim]`. The data type
+                should be float32 or float64.
+            value (Tensor): The values for multi-head attention. It is a tensor
+                with shape `[batch_size, sequence_length, vdim]`. The data type
+                should be float32 or float64.
+
+        Returns:
+            tuple: A tuple including transformed keys and values. Their shapes \
+                both are `[batch_size, num_heads, sequence_length, embed_dim // num_heads]`, \
+                and their data types are same as inputs.
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+
+        `Cache` or `StaticCache` is namedtuple with `k` and `v` as fields,
+        and it stores tensors shaped `[batch_size, num_heads, length, embed_dim]`
+        which are results of linear projection, reshape and transpose calculations
+        in MultiHeadAttention.
+
+        If the generated cache is an instance of `Cache`, `k` and `v` fields
+        reserve intermediate result tensors of previous positions, and the tensors
+        are incremental among decoding steps, which mostly are used for decoder
+        decoder self attention.
+
+        If the generated cache is an instance of `StaticCache`, `k` and `v` fields
+        would be used as calculated result tensors on keys an values in `forward`,
+        and the tensors keep unchanged among decoding steps, which are mostly used
+        for decoder-encoder cross attention.
+
+        The cache is generated as follows:
+
+        1. If `type` is `StaticCache`, apply `compute_kv(key, value)` and use the
+        results to create an instance of `StaticCache`.
+
+        2. If `type` is `Cache` and `value` is None, generate empty tensors shaped
+        `[batch_size, num_heads, 0, embed_dim // num_heads]` and use the results
+        to create an instance of `Cache`, where `batch_size` is from the first
+        dimension of `key`.
+
+        3. If `type` is `Cache` and `value` is not None, use `key`, `value` to create
+        an instance of `Cache`.
+
+        Parameters:
+            key (Tensor): The keys for multi-head attention. It is
+                a tensor with shape `[batch_size, key_length, kdim]`. The
+                data type should be float32 or float64. If `value` is None,
+                it is only for batch size and data type reference.
+            value (Tensor, optional): The values for multi-head attention. It
+                is a tensor with shape `[batch_size, value_length, vdim]`.
+                The data type should be float32 or float64. If None, `key` is only
+                for batch size reference. Default None.
+            type (type): It should be `MultiHeadAttention.StaticCache` or
+                `MultiHeadAttention.Cache` to indicate the cache type to generate.
+
+        Returns:
+            namedtuple: an instance of `Cache` or `StaticCache` accordingly.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            v = paddle.full(shape=[key.shape[0], 2, self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+
+        Parameters:
+            query (Tensor): The queries for multi-head attention. It is a
+                tensor with shape `[batch_size, query_length, embed_dim]`. The
+                data type should be float32 or float64.
+            key (Tensor, optional): The keys for multi-head attention. It is
+                a tensor with shape `[batch_size, key_length, kdim]`. The
+                data type should be float32 or float64. If None, use `query` as
+                `key`. Default None.
+            value (Tensor, optional): The values for multi-head attention. It
+                is a tensor with shape `[batch_size, value_length, vdim]`.
+                The data type should be float32 or float64. If None, use `query` as
+                `value`. Default None.
+            attn_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional):
+                It is a namedtuple with `k` and `v` as fields, and stores tensors
+                shaped `[batch_size, num_heads, length, embed_dim]` which are results
+                of linear projection, reshape and transpose calculations in
+                MultiHeadAttention. If it is an instance of `Cache`, `k` and `v`
+                fields reserve intermediate results of previous positions, which
+                mostly used for decoder self attention. If it is an instance of
+                `StaticCache`, `key` and `value` args would be ignored, `k` and
+                `v` fields would be used as calculated results on `key` and
+                `value`, which mostly used for decoder-encoder cross attention.
+                It is only used for inference and should be None for training.
+                Default None.
+
+        Returns:
+            Tensor|tuple: It is a tensor that has the same shape and data type \
+                as `query`, representing attention output. Or a tuple if \
+                `need_weights` is True or `cache` is not None. If `need_weights` \
+                is True, except for attention output, the tuple also includes \
+                the attention weights tensor shaped `[batch_size, num_heads, query_length, key_length]`. \
+                If `cache` is not None, the tuple then includes the new cache \
+                having the same type as `cache`, and if it is `StaticCache`, it \
+                is same as the input `cache`, if it is `Cache`, the new cache \
+                reserves tensors concatanating raw tensors with intermediate \
+                results of current query.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if cache is None:
+            q, k, v = self._prepare_qkv(query, key, value, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, cache)
+
+        # scale dot product attention
+        product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True)
+        if attn_mask is not None:
+            # Support bool or int mask
+            attn_mask = _convert_attention_mask(attn_mask, product.dtype)
+            product = product + attn_mask
+        weights = F.softmax(product)
+        if self.dropout:
+            weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = tensor.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if cache is not None:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerEncoderLayer(Layer):
+    """
+    TransformerEncoderLayer is composed of two sub-layers which are self (multi-head)
+    attention and feedforward network. Before and after each sub-layer, pre-process
+    and post-precess would be applied on the input and output accordingly. If
+    `normalize_before` is True, pre-process is layer normalization and post-precess
+    includes dropout, residual connection. Otherwise, no pre-process and post-precess
+    includes dropout, residual connection, layer normalization.
+
+    Parameters:
+        d_model (int): The expected feature size in the input and output.
+        nhead (int): The number of heads in multi-head attention(MHA).
+        dim_feedforward (int): The hidden layer size in the feedforward network(FFN).
+        dropout (float, optional): The dropout probability used in pre-process
+            and post-precess of MHA and FFN sub-layer. Default 0.1
+        activation (str, optional): The activation function in the feedforward
+            network. Default relu.
+        attn_dropout (float, optional): The dropout probability used
+            in MHA to drop some attention target. If None, use the value of
+            `dropout`. Default None
+        act_dropout (float, optional): The dropout probability used after FFN
+            activition.  If None, use the value of `dropout`. Default None
+        normalize_before (bool, optional): Indicate whether to put layer normalization
+            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
+            normalization and post-precess includes dropout, residual connection.
+            Otherwise, no pre-process and post-precess includes dropout, residual
+            connection, layer normalization. Default False
+        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
+            If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for
+            MHA, and `weight_attr[1]` would be used as `weight_attr` for linear in FFN.
+            Otherwise, MHA and FFN both use it as `weight_attr` to create parameters.
+            Default: None, which means the default weight parameter property is used.
+            See usage for details in :code:`ParamAttr` .
+        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
+            If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for
+            MHA, and `bias_attr[1]` would be used as `bias_attr` for linear in FFN.
+            Otherwise, MHA and FFN both use it as `bias_attr` to create parameters.
+            The `False` value means the corresponding layer would not have trainable
+            bias parameter. See usage for details in :code:`ParamAttr` . Default: None,
+            which means the default bias parameter property is used.
+
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.nn import TransformerEncoderLayer
+
+            # encoder input: [batch_size, src_len, d_model]
+            enc_input = paddle.rand((2, 4, 128))
+            # self attention mask: [batch_size, n_head, src_len, src_len]
+            attn_mask = paddle.rand((2, 2, 4, 4))
+            encoder_layer = TransformerEncoderLayer(128, 2, 512)
+            enc_output = encoder_layer(enc_input, attn_mask)  # [2, 4, 128]
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerEncoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 2)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 2)
+
+        self.self_attn = MultiHeadAttention(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            need_weights=True,  # interpret
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+        )
+        self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[1], bias_attr=bias_attrs[1])
+        self.dropout = Dropout(act_dropout, mode="upscale_in_train")
+        self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[1], bias_attr=bias_attrs[1])
+        self.norm1 = LayerNorm(d_model)
+        self.norm2 = LayerNorm(d_model)
+        self.dropout1 = Dropout(dropout, mode="upscale_in_train")
+        self.dropout2 = Dropout(dropout, mode="upscale_in_train")
+        self.activation = getattr(F, activation)
+
+    def forward(self, src, src_mask=None, cache=None):
+        r"""
+        Applies a Transformer encoder layer on the input.
+
+        Parameters:
+            src (Tensor): The input of Transformer encoder layer. It is
+                a tensor with shape `[batch_size, sequence_length, d_model]`.
+                The data type should be float32 or float64.
+            src_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            cache (Tensor, optional): It is an instance of `MultiHeadAttention.Cache`.
+                See `TransformerEncoderLayer.gen_cache` for more details. It is
+                only used for inference and should be None for training. Default
+                None.
+
+        Returns:
+            Tensor|tuple: It is a tensor that has the same shape and data type \
+                as `enc_input`, representing the output of Transformer encoder \
+                layer. Or a tuple if `cache` is not None, except for encoder \
+                layer output, the tuple includes the new cache which is same \
+                as input `cache` argument but `incremental_cache` has an \
+                incremental length. See `MultiHeadAttention.gen_cache` and \
+                `MultiHeadAttention.forward` for more details.
+        """
+        src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+        residual = src
+        if self.normalize_before:
+            src = self.norm1(src)
+        # Add cache for encoder for the usage like UniLM
+        if cache is None:
+            # src = self.self_attn(src, src, src, src_mask)
+            src, att_weights = self.self_attn(src, src, src, src_mask)  # interpret
+        else:
+            # src, incremental_cache = self.self_attn(src, src, src, src_mask, cache)
+            src, att_weights, incremental_cache = self.self_attn(src, src, src, src_mask, cache)  # interpret
+
+        src = residual + self.dropout1(src)
+        if not self.normalize_before:
+            src = self.norm1(src)
+
+        residual = src
+        if self.normalize_before:
+            src = self.norm2(src)
+        src = self.linear2(self.dropout(self.activation(self.linear1(src))))
+        src = residual + self.dropout2(src)
+        if not self.normalize_before:
+            src = self.norm2(src)
+        # return src if cache is None else (src, incremental_cache)
+        return (src, att_weights) if cache is None else (src, att_weights, incremental_cache)  # interpret
+
+    def gen_cache(self, src):
+        r"""
+        Generates cache for `forward` usage. The generated cache is an
+        instance of `MultiHeadAttention.Cache`.
+
+        Parameters:
+            src (Tensor): The input of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data
+                type should be float32 or float64.
+
+        Returns:
+            incremental_cache: It is an instance of `MultiHeadAttention.Cache` \
+                produced by `self_attn.gen_cache`, it reserves two tensors
+                shaped `[batch_size, nhead, 0, d_model // nhead]`. See \
+                `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
+                for more details.
+        """
+        incremental_cache = self.self_attn.gen_cache(src, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class TransformerEncoder(Layer):
+    """
+    TransformerEncoder is a stack of N encoder layers.
+
+    Parameters:
+        encoder_layer (Layer): an instance of the `TransformerEncoderLayer`. It
+            would be used as the first layer, and the other layers would be created
+            according to the configurations of it.
+        num_layers (int): The number of encoder layers to be stacked.
+        norm (LayerNorm, optional): the layer normalization component. If provided,
+            apply layer normalization on the output of last encoder layer.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.nn import TransformerEncoderLayer, TransformerEncoder
+
+            # encoder input: [batch_size, src_len, d_model]
+            enc_input = paddle.rand((2, 4, 128))
+            # self attention mask: [batch_size, n_head, src_len, src_len]
+            attn_mask = paddle.rand((2, 2, 4, 4))
+            encoder_layer = TransformerEncoderLayer(128, 2, 512)
+            encoder = TransformerEncoder(encoder_layer, 2)
+            enc_output = encoder(enc_input, attn_mask)  # [2, 4, 128]
+    """
+
+    def __init__(self, encoder_layer, num_layers, norm=None):
+        super(TransformerEncoder, self).__init__()
+        self.layers = LayerList(
+            [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)]
+        )
+        self.num_layers = num_layers
+        self.norm = norm
+
+    def forward(self, src, src_mask=None, cache=None):
+        r"""
+        Applies a stack of N Transformer encoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last encoder
+        layer.
+
+        Parameters:
+            src (Tensor): The input of Transformer encoder. It is a tensor
+                with shape `[batch_size, sequence_length, d_model]`. The data
+                type should be float32 or float64.
+            src_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            cache (list, optional): It is a list, and each element in the list
+                is `incremental_cache` produced by `TransformerEncoderLayer.gen_cache`.
+                See `TransformerEncoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+
+        Returns:
+            Tensor|tuple: It is a tensor that has the same shape and data type \
+                as `src`, representing the output of Transformer encoder. \
+                Or a tuple if `cache` is not None, except for encoder output, \
+                the tuple includes the new cache which is same as input `cache` \
+                argument but `incremental_cache` in it has an incremental length. \
+                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
+                for more details.
+        """
+        src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+        output = src
+        att_weights_list = []  # interpret
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                # output = mod(output, src_mask=src_mask)
+                output, att_weights = mod(output, src_mask=src_mask)  # interpret
+                att_weights_list.append(att_weights)
+            else:
+                # output, new_cache = mod(output, src_mask=src_mask, cache=cache[i])
+                output, att_weights, new_cache = mod(output, src_mask=src_mask, cache=cache[i])  # interpret
+                att_weights_list.append(att_weights)
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        # return output if cache is None else (output, new_caches)
+        return (output, att_weights_list) if cache is None else (output, att_weights_list, new_caches)  # interpret
+
+    def gen_cache(self, src):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is `incremental_cache` produced by
+        `TransformerEncoderLayer.gen_cache`. See `TransformerEncoderLayer.gen_cache`
+        for more details.
+
+        Parameters:
+            src (Tensor): The input of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+
+        Returns:
+            list: It is a list, and each element in the list is `incremental_cache`
+            produced by `TransformerEncoderLayer.gen_cache`. See
+            `TransformerEncoderLayer.gen_cache` for more details.
+        """
+        cache = [layer.gen_cache(src) for layer in self.layers]
+        return cache
+
+
+class TransformerDecoderLayer(Layer):
+    """
+    TransformerDecoderLayer is composed of three sub-layers which are decoder
+    self (multi-head) attention, decoder-encoder cross attention and feedforward
+    network. Before and after each sub-layer, pre-process and post-precess would
+    be applied on the input and output accordingly. If `normalize_before` is True,
+    pre-process is layer normalization and post-precess includes dropout, residual
+    connection. Otherwise, no pre-process and post-precess includes dropout, residual
+    connection, layer normalization.
+
+    Parameters:
+        d_model (int): The expected feature size in the input and output.
+        nhead (int): The number of heads in multi-head attention(MHA).
+        dim_feedforward (int): The hidden layer size in the feedforward network(FFN).
+        dropout (float, optional): The dropout probability used in pre-process
+            and post-precess of MHA and FFN sub-layer. Default 0.1
+        activation (str, optional): The activation function in the feedforward
+            network. Default relu.
+        attn_dropout (float, optional): The dropout probability used
+            in MHA to drop some attention target. If None, use the value of
+            `dropout`. Default None
+        act_dropout (float, optional): The dropout probability used after FFN
+            activition.  If None, use the value of `dropout`. Default None
+        normalize_before (bool, optional): Indicate whether to put layer normalization
+            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
+            normalization and post-precess includes dropout, residual connection.
+            Otherwise, no pre-process and post-precess includes dropout, residual
+            connection, layer normalization. Default False
+        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
+            If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for
+            self attention, `weight_attr[1]` would be used as `weight_attr` for
+            cross attention, and `weight_attr[2]` would be used as `weight_attr`
+            for linear in FFN. Otherwise, the three sub-layers all uses it as
+            `weight_attr` to create parameters. Default: None, which means the
+            default weight parameter property is used. See usage for details
+            in :ref:`api_paddle_ParamAttr` .
+        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
+            If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for
+            self attention, `bias_attr[1]` would be used as `bias_attr` for
+            cross attention, and `bias_attr[2]` would be used as `bias_attr`
+            for linear in FFN. Otherwise, the three sub-layers all uses it as
+            `bias_attr` to create parameters. The `False` value means the
+            corresponding layer would not have trainable bias parameter. See
+            usage for details in :code:`ParamAttr` . Default: None,which means
+            the default bias parameter property is used.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.nn import TransformerDecoderLayer
+
+            # decoder input: [batch_size, tgt_len, d_model]
+            dec_input = paddle.rand((2, 4, 128))
+            # encoder output: [batch_size, src_len, d_model]
+            enc_output = paddle.rand((2, 6, 128))
+            # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
+            self_attn_mask = paddle.rand((2, 2, 4, 4))
+            # cross attention mask: [batch_size, n_head, tgt_len, src_len]
+            cross_attn_mask = paddle.rand((2, 2, 4, 6))
+            decoder_layer = TransformerDecoderLayer(128, 2, 512)
+            output = decoder_layer(dec_input,
+                                   enc_output,
+                                   self_attn_mask,
+                                   cross_attn_mask)  # [2, 4, 128]
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+
+        self.self_attn = MultiHeadAttention(
+            d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[0], bias_attr=bias_attrs[0]
+        )
+        self.cross_attn = MultiHeadAttention(
+            d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1]
+        )
+        self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2])
+        self.dropout = Dropout(act_dropout, mode="upscale_in_train")
+        self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[2], bias_attr=bias_attrs[2])
+        self.norm1 = LayerNorm(d_model)
+        self.norm2 = LayerNorm(d_model)
+        self.norm3 = LayerNorm(d_model)
+        self.dropout1 = Dropout(dropout, mode="upscale_in_train")
+        self.dropout2 = Dropout(dropout, mode="upscale_in_train")
+        self.dropout3 = Dropout(dropout, mode="upscale_in_train")
+        self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        r"""
+        Applies a Transformer decoder layer on the input.
+
+        Parameters:
+            tgt (Tensor): The input of Transformer decoder layer. It is a tensor
+                with shape `[batch_size, target_length, d_model]`. The data type
+                should be float32 or float64.
+            memory (Tensor): The output of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+            tgt_mask (Tensor, optional): A tensor used in self attention
+                to prevents attention to some unwanted positions, usually the
+                the subsequent positions. It is a tensor with shape broadcasted
+                to `[batch_size, n_head, target_length, target_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            memory_mask (Tensor, optional): A tensor used in decoder-encoder
+                cross attention to prevents attention to some unwanted positions,
+                usually the paddings. It is a tensor with shape broadcasted to
+                `[batch_size, n_head, target_length, source_length]`. When the
+                data type is bool, the unwanted positions have `False` values
+                and the others have `True` values. When the data type is int,
+                the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            cache (tuple, optional): It is a tuple( :code:`(incremental_cache, static_cache)` ),
+                `incremental_cache` is an instance of `MultiHeadAttention.Cache`,
+                `static_cache` is an instance of `MultiHeadAttention.StaticCache.
+                See `TransformerDecoderLayer.gen_cache` for more details. It is
+                only used for inference and should be None for training. Default
+                None.
+
+        Returns:
+            Tensor|tuple: It is a tensor that has the same shape and data type \
+                as `tgt`, representing the output of Transformer decoder layer. \
+                Or a tuple if `cache` is not None, except for decoder layer output, \
+                the tuple includes the new cache which is same as input `cache` \
+                argument but `incremental_cache` in it has an incremental length. \
+                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
+                for more details.
+        """
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if cache is None:
+            tgt = self.cross_attn(tgt, memory, memory, memory_mask, None)
+        else:
+            tgt, static_cache = self.cross_attn(tgt, memory, memory, memory_mask, cache[1])
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, static_cache))
+
+    def gen_cache(self, memory):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a tuple
+        composed of an instance of `MultiHeadAttention.Cache` and an instance
+        of `MultiHeadAttention.StaticCache`.
+
+        Parameters:
+            memory (Tensor): The output of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+
+        Returns:
+            tuple: It is a tuple( :code:`(incremental_cache, static_cache)` ). \
+                `incremental_cache` is an instance of `MultiHeadAttention.Cache` \
+                produced by `self_attn.gen_cache(memory, MultiHeadAttention.Cache)`, \
+                it reserves two tensors shaped `[batch_size, nhead, 0, d_model // nhead]`. \
+                `static_cache` is an instance of `MultiHeadAttention.StaticCache` \
+                produced by `cross_attn.gen_cache(memory, MultiHeadAttention.StaticCache)`, \
+                it reserves two tensors shaped `[batch_size, nhead, source_length, d_model // nhead]`.
+                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
+                for more details.
+        """
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        static_cache = self.cross_attn.gen_cache(memory, memory, type=self.cross_attn.StaticCache)
+        return incremental_cache, static_cache
+
+
+class TransformerDecoder(Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+
+    Parameters:
+        decoder_layer (Layer): an instance of the `TransformerDecoderLayer`. It
+            would be used as the first layer, and the other layers would be created
+            according to the configurations of it.
+        num_layers (int): The number of decoder layers to be stacked.
+        norm (LayerNorm, optional): the layer normalization component. If provided,
+            apply layer normalization on the output of last encoder layer.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.nn import TransformerDecoderLayer, TransformerDecoder
+
+            # decoder input: [batch_size, tgt_len, d_model]
+            dec_input = paddle.rand((2, 4, 128))
+            # encoder output: [batch_size, src_len, d_model]
+            enc_output = paddle.rand((2, 6, 128))
+            # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
+            self_attn_mask = paddle.rand((2, 2, 4, 4))
+            # cross attention mask: [batch_size, n_head, tgt_len, src_len]
+            cross_attn_mask = paddle.rand((2, 2, 4, 6))
+            decoder_layer = TransformerDecoderLayer(128, 2, 512)
+            decoder = TransformerDecoder(decoder_layer, 2)
+            output = decoder(dec_input,
+                             enc_output,
+                             self_attn_mask,
+                             cross_attn_mask)  # [2, 4, 128]
+    """
+
+    def __init__(self, decoder_layer, num_layers, norm=None):
+        super(TransformerDecoder, self).__init__()
+        self.layers = LayerList(
+            [(decoder_layer if i == 0 else type(decoder_layer)(**decoder_layer._config)) for i in range(num_layers)]
+        )
+        self.num_layers = num_layers
+        self.norm = norm
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+
+        Parameters:
+            tgt (Tensor): The input of Transformer decoder. It is a tensor
+                with shape `[batch_size, target_length, d_model]`. The data type
+                should be float32 or float64.
+            memory (Tensor): The output of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+            tgt_mask (Tensor, optional): A tensor used in self attention
+                to prevents attention to some unwanted positions, usually the
+                the subsequent positions. It is a tensor with shape broadcasted
+                to `[batch_size, n_head, target_length, target_length]`. When
+                the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            memory_mask (Tensor, optional): A tensor used in decoder-encoder
+                cross attention to prevents attention to some unwanted positions,
+                usually the paddings. It is a tensor with shape broadcasted to
+                `[batch_size, n_head, target_length, source_length]`. When the
+                data type is bool, the unwanted positions have `False` values
+                and the others have `True` values. When the data type is int,
+                the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            cache (list, optional): It is a list, and each element in the list
+                is a tuple( :code:`(incremental_cache, static_cache)` ). See
+                `TransformerDecoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+
+        Returns:
+            Tensor|tuple: It is a tensor that has the same shape and data type \
+                as `tgt`, representing the output of Transformer decoder. \
+                Or a tuple if `cache` is not None, except for decoder output, \
+                the tuple includes the new cache which is same as input `cache` \
+                argument but `incremental_cache` in it has an incremental length. \
+                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
+                for more details.
+        """
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        return output if cache is None else (output, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+
+
+        Parameters:
+            memory (Tensor): The output of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+            do_zip (bool, optional): Indicate whether to apply `zip` on the tuples.
+                If True, return a list with two elements. Default False
+
+        Returns:
+            list: It is a list, and each element in the list is a tuple produced \
+                by `TransformerDecoderLayer.gen_cache(memory)`. See `TransformerDecoderLayer.gen_cache` \
+                for more details. If `do_zip` is True, apply `zip` on these tuples \
+                and return a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class Transformer(Layer):
+    """
+    A Transformer model composed of an instance of `TransformerEncoder` and an
+    instance of `TransformerDecoder`. While the embedding layer and output layer
+    are not included.
+
+    Please refer to `Attention is all you need <http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>`_ ,
+    and see `TransformerEncoder` and `TransformerDecoder` for more details.
+
+    Users can configurate the model architecture with corresponding parameters.
+    Note the usage of `normalize_before` representing where to apply layer
+    normalization (in pre-process or post-precess of multi-head attention or FFN),
+    and some transformer like models are different on this, such as
+    `BERT <https://arxiv.org/abs/1810.04805>`_ and `GPT2 <https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf>`_ .
+    The default architecture here places layer normalization in post-process and
+    applies another layer normalization on the output of last encoder/decoder layer.
+
+    Parameters:
+        d_model (int, optional): The expected feature size in the encoder/decoder input
+            and output. Default 512
+        nhead (int, optional): The number of heads in multi-head attention(MHA). Default 8
+        num_encoder_layers (int, optional): The number of layers in encoder. Default 6
+        num_decoder_layers (int, optional): The number of layers in decoder. Default 6
+        dim_feedforward (int, optional): The hidden layer size in the feedforward network(FFN). Default 2048
+        dropout (float, optional): The dropout probability used in pre-process
+            and post-precess of MHA and FFN sub-layer. Default 0.1
+        activation (str, optional): The activation function in the feedforward
+            network. Default relu.
+        attn_dropout (float, optional): The dropout probability used
+            in MHA to drop some attention target. If None, use the value of
+            `dropout`. Default None
+        act_dropout (float, optional): The dropout probability used after FFN
+            activition.  If None, use the value of `dropout`. Default None
+        normalize_before (bool, optional): Indicate whether to put layer normalization
+            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
+            normalization and post-precess includes dropout, residual connection.
+            Otherwise, no pre-process and post-precess includes dropout, residual
+            connection, layer normalization. Default False
+        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
+            If it is a list/tuple, the length of `weight_attr` could be 1, 2 or 3. If it is 3,
+            `weight_attr[0]` would be used as `weight_attr` for self attention, `weight_attr[1]`
+            would be used as `weight_attr` for cross attention of `TransformerDecoder`,
+            and `weight_attr[2]` would be used as `weight_attr` for linear in FFN.
+            If it is 2, `weight_attr[0]` would be used as `weight_attr` both for self attention
+            and cross attntion and `weight_attr[1]` would be used as `weight_attr` for
+            linear in FFN. If it is 1, `weight_attr[0]` would be used as `weight_attr`
+            for self attention, cross attention and linear in FFN. Otherwise,
+            the three sub-layers all uses it as `weight_attr` to create parameters.
+            Default: None, which means the default weight parameter property is used.
+            See usage for details
+            in :code:`ParamAttr` .
+        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
+            If it is a list/tuple, the length of `bias_attr` could be 1, 2 or 3. If it is 3,
+            `bias_attr[0]` would be used as `bias_attr` for self attention, `bias_attr[1]`
+            would be used as `bias_attr` for cross attention of `TransformerDecoder`,
+            and `bias_attr[2]` would be used as `bias_attr` for linear in FFN.
+            If it is 2, `bias_attr[0]` would be used as `bias_attr` both for self attention
+            and cross attntion and `bias_attr[1]` would be used as `bias_attr` for
+            linear in FFN. If it is 1, `bias_attr[0]` would be used as `bias_attr`
+            for self attention, cross attention and linear in FFN. Otherwise,
+            the three sub-layers all uses it as `bias_attr` to create parameters.
+            The `False` value means the corresponding layer would not have trainable
+            bias parameter. See usage for details in :code:`ParamAttr` .
+            Default: None,which means the default bias parameter property is used.
+        custom_encoder (Layer, optional): If custom encoder is provided, use it as the encoder.
+            Default None
+        custom_decoder (Layer, optional): If custom decoder is provided, use it as the decoder.
+            Default None
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.nn import Transformer
+
+            # src: [batch_size, tgt_len, d_model]
+            enc_input = paddle.rand((2, 4, 128))
+            # tgt: [batch_size, src_len, d_model]
+            dec_input = paddle.rand((2, 6, 128))
+            # src_mask: [batch_size, n_head, src_len, src_len]
+            enc_self_attn_mask = paddle.rand((2, 2, 4, 4))
+            # tgt_mask: [batch_size, n_head, tgt_len, tgt_len]
+            dec_self_attn_mask = paddle.rand((2, 2, 6, 6))
+            # memory_mask: [batch_size, n_head, tgt_len, src_len]
+            cross_attn_mask = paddle.rand((2, 2, 6, 4))
+            transformer = Transformer(128, 2, 4, 4, 512)
+            output = transformer(enc_input,
+                                 dec_input,
+                                 enc_self_attn_mask,
+                                 dec_self_attn_mask,
+                                 cross_attn_mask)  # [2, 6, 128]
+    """
+
+    def __init__(
+        self,
+        d_model=512,
+        nhead=8,
+        num_encoder_layers=6,
+        num_decoder_layers=6,
+        dim_feedforward=2048,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        weight_attr=None,
+        bias_attr=None,
+        custom_encoder=None,
+        custom_decoder=None,
+    ):
+        super(Transformer, self).__init__()
+
+        if isinstance(bias_attr, (list, tuple)):
+            if len(bias_attr) == 1:
+                encoder_bias_attr = [bias_attr[0]] * 2
+                decoder_bias_attr = [bias_attr[0]] * 3
+            elif len(bias_attr) == 2:
+                encoder_bias_attr = bias_attr
+                decoder_bias_attr = [bias_attr[0], bias_attr[0], bias_attr[-1]]
+            elif len(bias_attr) == 3:
+                encoder_bias_attr = [bias_attr[0], bias_attr[-1]]
+                decoder_bias_attr = bias_attr
+            else:
+                assert False, "length of bias_attr should be 1 or 2 or 3 when it is a list/tuple"
+        else:
+            encoder_bias_attr = bias_attr
+            decoder_bias_attr = bias_attr
+
+        if isinstance(weight_attr, (list, tuple)):
+            if len(weight_attr) == 1:
+                encoder_weight_attr = [weight_attr[0]] * 2
+                decoder_weight_attr = [weight_attr[0]] * 3
+            elif len(weight_attr) == 2:
+                encoder_weight_attr = weight_attr
+                decoder_weight_attr = [weight_attr[0], weight_attr[0], weight_attr[-1]]
+            elif len(weight_attr) == 3:
+                encoder_weight_attr = [weight_attr[0], weight_attr[-1]]
+                decoder_weight_attr = weight_attr
+            else:
+                assert False, "length of weight_attr should be 1 or 2 or 3 when it is a list/tuple"
+        else:
+            encoder_weight_attr = weight_attr
+            decoder_weight_attr = weight_attr
+
+        if custom_encoder is not None:
+            self.encoder = custom_encoder
+        else:
+            encoder_layer = TransformerEncoderLayer(
+                d_model,
+                nhead,
+                dim_feedforward,
+                dropout,
+                activation,
+                attn_dropout,
+                act_dropout,
+                normalize_before,
+                encoder_weight_attr,
+                encoder_bias_attr,
+            )
+            encoder_norm = LayerNorm(d_model)
+            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
+
+        if custom_decoder is not None:
+            self.decoder = custom_decoder
+        else:
+            decoder_layer = TransformerDecoderLayer(
+                d_model,
+                nhead,
+                dim_feedforward,
+                dropout,
+                activation,
+                attn_dropout,
+                act_dropout,
+                normalize_before,
+                decoder_weight_attr,
+                decoder_bias_attr,
+            )
+            decoder_norm = LayerNorm(d_model)
+            self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
+
+        self.d_model = d_model
+        self.nhead = nhead
+
+    def forward(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None):
+        r"""
+        Applies a Transformer model on the inputs.
+
+        Parameters:
+            src (Tensor): The input of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+            tgt (Tensor): The input of Transformer decoder. It is a tensor
+                with shape `[batch_size, target_length, d_model]`. The data type
+                should be float32 or float64.
+            memory (Tensor): The output of Transformer encoder. It is a tensor
+                with shape `[batch_size, source_length, d_model]`. The data type
+                should be float32 or float64.
+            src_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            tgt_mask (Tensor, optional): A tensor used in self attention
+                to prevents attention to some unwanted positions, usually the
+                the subsequent positions. It is a tensor with shape broadcasted
+                to `[batch_size, n_head, target_length, target_length]`. When
+                the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+            memory_mask (Tensor, optional): A tensor used in decoder-encoder
+                cross attention to prevents attention to some unwanted positions,
+                usually the paddings. It is a tensor with shape broadcasted to
+                `[batch_size, n_head, target_length, source_length]`. When the
+                data type is bool, the unwanted positions have `False` values
+                and the others have `True` values. When the data type is int,
+                the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+
+        Returns:
+            Tensor: It is a tensor that has the same shape and data type \
+                as `tgt`, representing the output of Transformer decoder.
+        """
+        src_mask = _convert_attention_mask(src_mask, src.dtype)
+        memory = self.encoder(src, src_mask=src_mask)
+
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask)
+        return output
+
+    def generate_square_subsequent_mask(self, length):
+        """
+        Generate a square mask for the sequence. The mask ensures that the
+        predictions for position i can depend only on the known outputs at
+        positions less than i.
+
+        Parameters:
+            length (int|Tensor): The length of sequence.
+
+        Returns:
+            Tensor: Generated square mask according to the given length.
+
+        Examples:
+            .. code-block:: python
+
+                import paddle
+                from paddle.nn.layer.transformer import Transformer
+                length = 5
+                d_model, n_head, dim_feedforward = 8, 4, 64
+                transformer_paddle = Transformer(
+                    d_model, n_head, dim_feedforward=dim_feedforward)
+                mask = transformer_paddle.generate_square_subsequent_mask(length)
+                print(mask)
+
+                # [[  0. -inf -inf -inf -inf]
+                # [  0.   0. -inf -inf -inf]
+                # [  0.   0.   0. -inf -inf]
+                # [  0.   0.   0.   0. -inf]
+                # [  0.   0.   0.   0.   0.]]
+
+        """
+        return paddle.tensor.triu((paddle.ones((length, length), dtype=paddle.get_default_dtype()) * -np.inf), 1)
diff --git a/examples/model_interpretation/utils.py b/examples/model_interpretation/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..469dc6f797f1c915749a842089a6322f0c666401
--- /dev/null
+++ b/examples/model_interpretation/utils.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""This file contains some public functions
+"""
+
+
+def convert_tokenizer_res_to_old_version(tokenized_res):
+    if isinstance(tokenized_res, list):
+        return tokenized_res
+    if isinstance(tokenized_res, dict):
+        if len(tokenized_res["input_ids"]) == 0 or not isinstance(tokenized_res["input_ids"][0], list):
+            return tokenized_res
+        else:
+            res = []
+            for idx in range(len(tokenized_res["input_ids"])):
+                temp_dict = {}
+                key_list = list(tokenized_res.keys())
+                for key in key_list:
+                    temp_dict[key] = tokenized_res[key][idx]
+                res.append(temp_dict)
+            return res
+    else:
+        raise ValueError("unsupported result type")
+
+
+def cal_score(match_list, sorted_token):
+    over_all = []
+    miss = 0
+    for i in match_list:
+        over_all.extend(i[0])
+
+    score_dic = {}
+    for i in sorted_token:
+        split_time = over_all.count(i[0])
+        if split_time:
+            score_dic[i[0]] = i[2] / split_time
+        else:
+            score_dic[i[0]] = 0.0
+    if miss != 0:
+        print(miss)
+
+    score = []
+    for i in range(len(match_list)):
+        cur_score = 0.0
+        for j in match_list[i][0]:
+            if j == -1:
+                continue
+            cur_score += score_dic[j]
+        score.append([str(match_list[i][1]), match_list[i][2], cur_score])
+    return score
+
+
+def match(context, context_seg, sorted_token):
+    result = []
+    pointer1 = 0  # point at the context
+    pointer2 = 0  # point at the sorted_token array
+    for i in range(len(context_seg)):
+        seg_start_idx = context.find(context_seg[i], pointer1)
+        if seg_start_idx < 0:
+            print("Error: token not in context")
+        seg_end_idx = seg_start_idx + len(context_seg[i])
+
+        cur_set = []
+        while pointer2 < len(sorted_token):
+            while pointer2 < len(sorted_token) and sorted_token[pointer2][1][1] <= seg_start_idx:
+                pointer2 += 1
+            if pointer2 >= len(sorted_token):
+                break
+            if sorted_token[pointer2][1][0] >= seg_end_idx:
+                break
+            cur_set.append(sorted_token[pointer2][0])
+            pointer2 += 1
+        result.append([cur_set, i, context_seg[i]])
+        pointer2 -= 1
+        pointer1 = seg_end_idx
+    score = cal_score(result, sorted_token)
+    return score
diff --git a/examples/multimodal/layoutlm/README.md b/examples/multimodal/layoutlm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f1f46392d4cdb8ea561d543a2952d8088cd965f0
--- /dev/null
+++ b/examples/multimodal/layoutlm/README.md
@@ -0,0 +1,44 @@
+# LayoutLM
+
+## 模型简介
+本项目是 [LayoutLM:Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf) 在 Paddle 2.2上的开源实现，
+包含了在 [FUNSD数据集](https://guillaumejaume.github.io/FUNSD/) 上的微调代码。
+
+## 快速开始
+### 配置环境
+环境依赖
+- cv2
+- sentencepiece
+- yacs
+
+安装命令：
+```shell
+pip install opencv-python
+pip install sentencepiece
+pip install yacs
+```
+
+### 数据准备
+处理好的FUNSD中文数据集下载地址：https://bj.bcebos.com/v1/paddlenlp/datasets/FUNSD.zip 。
+
+下载并解压该数据集，解压后将数据集放置在当前目录下。
+
+### 执行Fine-tuning
+1. ``Sequence Labeling`` 任务启动Fine-tuning的方式如下：
+    ```shell
+    bash train_funsd.sh
+
+    # 结果如下:
+    # best metrics: {'precision': 0.7642124883504194, 'recall': 0.8204102051025512, 'f1': 0.7913148371531967}
+    ```
+
+### 数据处理
+FUNSD数据集是常用的表格理解数据集，原始的数据集下载地址：https://guillaumejaume.github.io/FUNSD/dataset.zip.
+包括training_data和test_dataing两个子文件夹，包括149个训练数据和50个测试数据。数据预处理方式如下：
+```shell
+    bash preprocess.sh
+```
+
+## Reference
+- [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf)
+- [microsoft/unilm/layoutlm](https://github.com/microsoft/unilm/tree/master/layoutlm)
diff --git a/examples/multimodal/layoutlm/funsd.py b/examples/multimodal/layoutlm/funsd.py
new file mode 100644
index 0000000000000000000000000000000000000000..4421cd3710b4f11bb2fa19e99971b6df7d546725
--- /dev/null
+++ b/examples/multimodal/layoutlm/funsd.py
@@ -0,0 +1,317 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+
+import paddle
+from paddle.io import Dataset
+
+logger = logging.getLogger(__name__)
+
+
+class FunsdDataset(Dataset):
+    def __init__(self, args, tokenizer, labels, pad_token_label_id, mode):
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        examples = read_examples_from_file(args.data_dir, mode)
+        features = convert_examples_to_features(
+            examples,
+            labels,
+            args.max_seq_length,
+            tokenizer,
+            cls_token_at_end=False,
+            cls_token=tokenizer.cls_token,
+            cls_token_segment_id=0,
+            sep_token=tokenizer.sep_token,
+            sep_token_extra=False,
+            pad_on_left=False,
+            pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
+            pad_token_segment_id=0,
+            pad_token_label_id=pad_token_label_id,
+        )
+
+        self.features = features
+        # Convert to Tensors and build dataset
+        self.all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64")
+        self.all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64")
+        self.all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64")
+        self.all_label_ids = paddle.to_tensor([f.label_ids for f in features], dtype="int64")
+        self.all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64")
+
+    def __len__(self):
+        return len(self.features)
+
+    def __getitem__(self, index):
+        return (
+            self.all_input_ids[index],
+            self.all_input_mask[index],
+            self.all_segment_ids[index],
+            self.all_label_ids[index],
+            self.all_bboxes[index],
+        )
+
+
+class InputExample(object):
+    """A single training/test example for token classification."""
+
+    def __init__(self, guid, words, labels, boxes, actual_bboxes, file_name, page_size):
+        """Constructs a InputExample.
+        Args:
+            guid: Unique id for the example.
+            words: list. The words of the sequence.
+            labels: (Optional) list. The labels for each word of the sequence. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.words = words
+        self.labels = labels
+        self.boxes = boxes
+        self.actual_bboxes = actual_bboxes
+        self.file_name = file_name
+        self.page_size = page_size
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(
+        self,
+        input_ids,
+        input_mask,
+        segment_ids,
+        label_ids,
+        boxes,
+        actual_bboxes,
+        file_name,
+        page_size,
+    ):
+        assert (
+            0 <= all(boxes) <= 1000
+        ), "Error with input bbox ({}): the coordinate value is not between 0 and 1000".format(boxes)
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_ids = label_ids
+        self.boxes = boxes
+        self.actual_bboxes = actual_bboxes
+        self.file_name = file_name
+        self.page_size = page_size
+
+
+def read_examples_from_file(data_dir, mode):
+    file_path = os.path.join(data_dir, "{}.txt".format(mode))
+    box_file_path = os.path.join(data_dir, "{}_box.txt".format(mode))
+    image_file_path = os.path.join(data_dir, "{}_image.txt".format(mode))
+    guid_index = 1
+    examples = []
+    with open(file_path, encoding="utf-8") as f, open(box_file_path, encoding="utf-8") as fb, open(
+        image_file_path, encoding="utf-8"
+    ) as fi:
+        words = []
+        boxes = []
+        actual_bboxes = []
+        file_name = None
+        page_size = None
+        labels = []
+        for line, bline, iline in zip(f, fb, fi):
+            if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                if words:
+                    examples.append(
+                        InputExample(
+                            guid="{}-{}".format(mode, guid_index),
+                            words=words,
+                            labels=labels,
+                            boxes=boxes,
+                            actual_bboxes=actual_bboxes,
+                            file_name=file_name,
+                            page_size=page_size,
+                        )
+                    )
+                    guid_index += 1
+                    words = []
+                    boxes = []
+                    actual_bboxes = []
+                    file_name = None
+                    page_size = None
+                    labels = []
+            else:
+                splits = line.split("\t")
+                bsplits = bline.split("\t")
+                isplits = iline.split("\t")
+                assert len(splits) == 2
+                assert len(bsplits) == 2
+                assert len(isplits) == 4
+                assert splits[0] == bsplits[0]
+                words.append(splits[0])
+                if len(splits) > 1:
+                    labels.append(splits[-1].replace("\n", ""))
+                    box = bsplits[-1].replace("\n", "")
+                    box = [int(b) for b in box.split()]
+                    boxes.append(box)
+                    actual_bbox = [int(b) for b in isplits[1].split()]
+                    actual_bboxes.append(actual_bbox)
+                    page_size = [int(i) for i in isplits[2].split()]
+                    file_name = isplits[3].strip()
+                else:
+                    # Examples could have no label for mode = "test"
+                    labels.append("O")
+        if words:
+            examples.append(
+                InputExample(
+                    guid=f"{mode}-{guid_index}",
+                    words=words,
+                    labels=labels,
+                    boxes=boxes,
+                    actual_bboxes=actual_bboxes,
+                    file_name=file_name,
+                    page_size=page_size,
+                )
+            )
+    return examples
+
+
+def convert_examples_to_features(
+    examples,
+    label_list,
+    max_seq_length,
+    tokenizer,
+    cls_token_at_end=False,
+    cls_token="[CLS]",
+    cls_token_segment_id=1,
+    sep_token="[SEP]",
+    sep_token_extra=False,
+    pad_on_left=False,
+    pad_token=0,
+    cls_token_box=[0, 0, 0, 0],
+    sep_token_box=[1000, 1000, 1000, 1000],
+    pad_token_box=[0, 0, 0, 0],
+    pad_token_segment_id=0,
+    pad_token_label_id=-1,
+    sequence_a_segment_id=0,
+    mask_padding_with_zero=True,
+):
+
+    label_map = {label: i for i, label in enumerate(label_list)}
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        file_name = example.file_name
+        page_size = example.page_size
+        width, height = page_size
+        if ex_index % 10000 == 0:
+            logger.info("Writing example %d of %d", ex_index, len(examples))
+
+        tokens = []
+        token_boxes = []
+        actual_bboxes = []
+        label_ids = []
+        for word, label, box, actual_bbox in zip(example.words, example.labels, example.boxes, example.actual_bboxes):
+            word_tokens = tokenizer.tokenize(word)
+            tokens.extend(word_tokens)
+            token_boxes.extend([box] * len(word_tokens))
+            actual_bboxes.extend([actual_bbox] * len(word_tokens))
+            # Use the real label id for the first token of the word, and padding ids for the remaining tokens
+            label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
+
+        # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
+        special_tokens_count = 3 if sep_token_extra else 2
+        if len(tokens) > max_seq_length - special_tokens_count:
+            tokens = tokens[: (max_seq_length - special_tokens_count)]
+            token_boxes = token_boxes[: (max_seq_length - special_tokens_count)]
+            actual_bboxes = actual_bboxes[: (max_seq_length - special_tokens_count)]
+            label_ids = label_ids[: (max_seq_length - special_tokens_count)]
+
+        # The convention in BERT is:
+        # (a) For sequence pairs:
+        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+        #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
+        # (b) For single sequences:
+        #  tokens:   [CLS] the dog is hairy . [SEP]
+        #  type_ids:   0   0   0   0  0     0   0
+        #
+        # Where "type_ids" are used to indicate whether this is the first
+        # sequence or the second sequence. The embedding vectors for `type=0` and
+        # `type=1` were learned during pre-training and are added to the wordpiece
+        # embedding vector (and position vector). This is not *strictly* necessary
+        # since the [SEP] token unambiguously separates the sequences, but it makes
+        # it easier for the model to learn the concept of sequences.
+        #
+        # For classification tasks, the first vector (corresponding to [CLS]) is
+        # used as the "sentence vector". Note that this only makes sense because
+        # the entire model is fine-tuned.
+        tokens += [sep_token]
+        token_boxes += [sep_token_box]
+        actual_bboxes += [[0, 0, width, height]]
+        label_ids += [pad_token_label_id]
+        if sep_token_extra:
+            # roberta uses an extra separator b/w pairs of sentences
+            tokens += [sep_token]
+            token_boxes += [sep_token_box]
+            actual_bboxes += [[0, 0, width, height]]
+            label_ids += [pad_token_label_id]
+        segment_ids = [sequence_a_segment_id] * len(tokens)
+
+        if cls_token_at_end:
+            tokens += [cls_token]
+            token_boxes += [cls_token_box]
+            actual_bboxes += [[0, 0, width, height]]
+            label_ids += [pad_token_label_id]
+            segment_ids += [cls_token_segment_id]
+        else:
+            tokens = [cls_token] + tokens
+            token_boxes = [cls_token_box] + token_boxes
+            actual_bboxes = [[0, 0, width, height]] + actual_bboxes
+            label_ids = [pad_token_label_id] + label_ids
+            segment_ids = [cls_token_segment_id] + segment_ids
+
+        input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+        # The mask has 1 for real tokens and 0 for padding tokens. Only real
+        # tokens are attended to.
+        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+
+        # Zero-pad up to the sequence length.
+        padding_length = max_seq_length - len(input_ids)
+        if pad_on_left:
+            input_ids = ([pad_token] * padding_length) + input_ids
+            input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
+            segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
+            label_ids = ([pad_token_label_id] * padding_length) + label_ids
+            token_boxes = ([pad_token_box] * padding_length) + token_boxes
+        else:
+            input_ids += [pad_token] * padding_length
+            input_mask += [0 if mask_padding_with_zero else 1] * padding_length
+            segment_ids += [pad_token_segment_id] * padding_length
+            label_ids += [pad_token_label_id] * padding_length
+            token_boxes += [pad_token_box] * padding_length
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+        assert len(label_ids) == max_seq_length
+        assert len(token_boxes) == max_seq_length
+
+        features.append(
+            InputFeatures(
+                input_ids=input_ids,
+                input_mask=input_mask,
+                segment_ids=segment_ids,
+                label_ids=label_ids,
+                boxes=token_boxes,
+                actual_bboxes=actual_bboxes,
+                file_name=file_name,
+                page_size=page_size,
+            )
+        )
+    return features
diff --git a/examples/multimodal/layoutlm/preprocess.py b/examples/multimodal/layoutlm/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..28b07e5ca5d842e555ab4b485cab82653f2ea1a6
--- /dev/null
+++ b/examples/multimodal/layoutlm/preprocess.py
@@ -0,0 +1,166 @@
+import argparse
+import json
+import os
+
+from PIL import Image
+from paddlenlp.transformers import AutoTokenizer
+
+
+def bbox_string(box, width, length):
+    return (
+        str(int(1000 * (box[0] / width)))
+        + " "
+        + str(int(1000 * (box[1] / length)))
+        + " "
+        + str(int(1000 * (box[2] / width)))
+        + " "
+        + str(int(1000 * (box[3] / length)))
+    )
+
+
+def actual_bbox_string(box, width, length):
+    return (
+        str(box[0]) + " " + str(box[1]) + " " + str(box[2]) + " " + str(box[3]) + "\t" + str(width) + " " + str(length)
+    )
+
+
+def convert(args):
+    with open(os.path.join(args.output_dir, args.data_split + ".txt.tmp"), "w", encoding="utf8",) as fw, open(
+        os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"),
+        "w",
+        encoding="utf8",
+    ) as fbw, open(
+        os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"),
+        "w",
+        encoding="utf8",
+    ) as fiw:
+        for file in os.listdir(args.data_dir):
+            file_path = os.path.join(args.data_dir, file)
+            with open(file_path, "r", encoding="utf8") as f:
+                data = json.load(f)
+            image_path = file_path.replace("annotations", "images")
+            image_path = image_path.replace("json", "png")
+            file_name = os.path.basename(image_path)
+            image = Image.open(image_path)
+            width, length = image.size
+            for item in data["form"]:
+                words, label = item["words"], item["label"]
+                words = [w for w in words if w["text"].strip() != ""]
+                if len(words) == 0:
+                    continue
+                if label == "other":
+                    for w in words:
+                        fw.write(w["text"] + "\tO\n")
+                        fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n")
+                        fiw.write(
+                            w["text"] + "\t" + actual_bbox_string(w["box"], width, length) + "\t" + file_name + "\n"
+                        )
+                else:
+                    if len(words) == 1:
+                        fw.write(words[0]["text"] + "\tS-" + label.upper() + "\n")
+                        fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n")
+                        fiw.write(
+                            words[0]["text"]
+                            + "\t"
+                            + actual_bbox_string(words[0]["box"], width, length)
+                            + "\t"
+                            + file_name
+                            + "\n"
+                        )
+                    else:
+                        fw.write(words[0]["text"] + "\tB-" + label.upper() + "\n")
+                        fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n")
+                        fiw.write(
+                            words[0]["text"]
+                            + "\t"
+                            + actual_bbox_string(words[0]["box"], width, length)
+                            + "\t"
+                            + file_name
+                            + "\n"
+                        )
+                        for w in words[1:-1]:
+                            fw.write(w["text"] + "\tI-" + label.upper() + "\n")
+                            fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n")
+                            fiw.write(
+                                w["text"]
+                                + "\t"
+                                + actual_bbox_string(w["box"], width, length)
+                                + "\t"
+                                + file_name
+                                + "\n"
+                            )
+                        fw.write(words[-1]["text"] + "\tE-" + label.upper() + "\n")
+                        fbw.write(words[-1]["text"] + "\t" + bbox_string(words[-1]["box"], width, length) + "\n")
+                        fiw.write(
+                            words[-1]["text"]
+                            + "\t"
+                            + actual_bbox_string(words[-1]["box"], width, length)
+                            + "\t"
+                            + file_name
+                            + "\n"
+                        )
+            fw.write("\n")
+            fbw.write("\n")
+            fiw.write("\n")
+
+
+def seg_file(file_path, tokenizer, max_len):
+    subword_len_counter = 0
+    output_path = file_path[:-4]
+    with open(file_path, "r", encoding="utf8") as f_p, open(output_path, "w", encoding="utf8") as fw_p:
+        for line in f_p:
+            line = line.rstrip()
+
+            if not line:
+                fw_p.write(line + "\n")
+                subword_len_counter = 0
+                continue
+            token = line.split("\t")[0]
+
+            current_subwords_len = len(tokenizer.tokenize(token))
+
+            # Token contains strange control characters like \x96 or \x95
+            # Just filter out the complete line
+            if current_subwords_len == 0:
+                continue
+
+            if (subword_len_counter + current_subwords_len) > max_len:
+                fw_p.write("\n" + line + "\n")
+                subword_len_counter = current_subwords_len
+                continue
+
+            subword_len_counter += current_subwords_len
+
+            fw_p.write(line + "\n")
+
+
+def seg(args):
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, do_lower_case=True)
+    seg_file(
+        os.path.join(args.output_dir, args.data_split + ".txt.tmp"),
+        tokenizer,
+        args.max_len,
+    )
+    seg_file(
+        os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"),
+        tokenizer,
+        args.max_len,
+    )
+    seg_file(
+        os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"),
+        tokenizer,
+        args.max_len,
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data_dir", type=str, default="data/training_data/annotations")
+    parser.add_argument("--data_split", type=str, default="train")
+    parser.add_argument("--output_dir", type=str, default="data")
+    parser.add_argument("--model_name_or_path", type=str, default="bert-base-uncased")
+    parser.add_argument("--max_len", type=int, default=510)
+    args = parser.parse_args()
+
+    convert(args)
+    seg(args)
diff --git a/examples/multimodal/layoutlm/preprocess.sh b/examples/multimodal/layoutlm/preprocess.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2ff8dc4e317aa87c5a176de5af9b5571415239c5
--- /dev/null
+++ b/examples/multimodal/layoutlm/preprocess.sh
@@ -0,0 +1,13 @@
+python preprocess.py --data_dir data/training_data/annotations \
+                                    --data_split train \
+                                    --output_dir data \
+                                    --model_name_or_path bert-base-uncased \
+                                    --max_len 510
+
+python preprocess.py --data_dir data/testing_data/annotations \
+                                    --data_split test \
+                                    --output_dir data \
+                                    --model_name_or_path bert-base-uncased \
+                                    --max_len 510
+
+cat data/train.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > data/labels.txt
\ No newline at end of file
diff --git a/examples/multimodal/layoutlm/train_funsd.py b/examples/multimodal/layoutlm/train_funsd.py
new file mode 100644
index 0000000000000000000000000000000000000000..8021e0f752f12406a3937eeec78ed85253d0b255
--- /dev/null
+++ b/examples/multimodal/layoutlm/train_funsd.py
@@ -0,0 +1,282 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import random
+
+import numpy as np
+import paddle
+from funsd import FunsdDataset
+from seqeval.metrics import (
+    classification_report,
+    f1_score,
+    precision_score,
+    recall_score,
+)
+from tqdm import tqdm, trange
+
+# relative reference
+from utils import parse_args
+
+from paddlenlp.transformers import (
+    LayoutLMForTokenClassification,
+    LayoutLMModel,
+    LayoutLMTokenizer,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def get_labels(path):
+    with open(path, "r") as f:
+        labels = f.read().splitlines()
+    if "O" not in labels:
+        labels = ["O"] + labels
+    return labels
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def train(args):
+    logging.basicConfig(
+        filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None,
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN,
+    )
+
+    all_labels = get_labels(args.labels)
+
+    pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
+
+    tokenizer = LayoutLMTokenizer.from_pretrained(args.model_name_or_path)
+
+    # for training process, model is needed for the bert class
+    # else it can directly loaded for the downstream task
+    if not args.do_train:
+        model = LayoutLMForTokenClassification.from_pretrained(args.model_name_or_path)
+    else:
+        model = LayoutLMModel.from_pretrained(args.model_name_or_path)
+        model = LayoutLMForTokenClassification(model, num_classes=len(all_labels), dropout=None)
+
+    train_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode="train")
+    train_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True
+    )
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size())
+    train_dataloader = paddle.io.DataLoader(
+        train_dataset,
+        batch_sampler=train_sampler,
+        collate_fn=None,
+    )
+
+    t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # build linear decay with warmup lr sch
+    lr_scheduler = paddle.optimizer.lr.PolynomialDecay(
+        learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0
+    )
+    if args.warmup_steps > 0:
+        lr_scheduler = paddle.optimizer.lr.LinearWarmup(
+            lr_scheduler,
+            args.warmup_steps,
+            start_lr=0,
+            end_lr=args.learning_rate,
+        )
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        epsilon=args.adam_epsilon,
+        weight_decay=args.weight_decay,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=pad_token_label_id)
+
+    # Train
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info(
+        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
+        args.train_batch_size * paddle.distributed.get_world_size(),
+    )
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss = 0.0
+    model.clear_gradients()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            inputs = {
+                "input_ids": batch[0],
+                "attention_mask": batch[1],
+                "token_type_ids": batch[2],
+                "bbox": batch[4],
+            }
+            labels = batch[3]
+            logits = model(**inputs)
+            loss = loss_fct(
+                logits.reshape([-1, len(all_labels)]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+            loss = loss.mean()
+            logger.info("train loss: {}".format(loss.numpy()))
+            loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()  # Update learning rate schedule
+                model.clear_gradients()
+                global_step += 1
+
+                if (
+                    paddle.distributed.get_rank() == 0
+                    and args.logging_steps > 0
+                    and global_step % args.logging_steps == 0
+                ):
+                    # Log metrics
+                    if (
+                        paddle.distributed.get_rank() == 0 and args.evaluate_during_training
+                    ):  # Only evaluate when single GPU otherwise metrics may not average well
+                        results, _ = evaluate(
+                            args,
+                            model,
+                            tokenizer,
+                            all_labels,
+                            loss_fct,
+                            pad_token_label_id,
+                            mode="test",
+                        )
+                        logger.info("results: {}".format(results))
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
+                    os.makedirs(output_dir, exist_ok=True)
+                    if paddle.distributed.get_rank() == 0:
+                        model.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                        logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, all_labels, loss_fct, pad_token_label_id, mode, prefix=""):
+    eval_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode=mode)
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size())
+    eval_dataloader = paddle.io.DataLoader(
+        eval_dataset,
+        batch_size=args.eval_batch_size,
+        collate_fn=None,
+    )
+
+    # Eval
+    logger.info("***** Running evaluation %s *****", prefix)
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    preds = None
+    out_label_ids = None
+    model.eval()
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        with paddle.no_grad():
+            inputs = {
+                "input_ids": batch[0],
+                "attention_mask": batch[1],
+                "token_type_ids": batch[2],
+                "bbox": batch[4],
+            }
+            labels = batch[3]
+            logits = model(**inputs)
+            tmp_eval_loss = loss_fct(
+                logits.reshape([-1, len(all_labels)]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+            tmp_eval_loss = tmp_eval_loss.mean()
+            eval_loss += tmp_eval_loss.item()
+
+        nb_eval_steps += 1
+        if preds is None:
+            preds = logits.numpy()
+            out_label_ids = labels.numpy()
+        else:
+            preds = np.append(preds, logits.numpy(), axis=0)
+            out_label_ids = np.append(out_label_ids, labels.numpy(), axis=0)
+
+    eval_loss = eval_loss / nb_eval_steps
+    preds = np.argmax(preds, axis=2)
+
+    label_map = {i: label for i, label in enumerate(all_labels)}
+    out_label_list = [[] for _ in range(out_label_ids.shape[0])]
+    preds_list = [[] for _ in range(out_label_ids.shape[0])]
+
+    for i in range(out_label_ids.shape[0]):
+        for j in range(out_label_ids.shape[1]):
+            if out_label_ids[i, j] != pad_token_label_id:
+                out_label_list[i].append(label_map[out_label_ids[i][j]])
+                preds_list[i].append(label_map[preds[i][j]])
+
+    results = {
+        "loss": eval_loss,
+        "precision": precision_score(out_label_list, preds_list),
+        "recall": recall_score(out_label_list, preds_list),
+        "f1": f1_score(out_label_list, preds_list),
+    }
+
+    report = classification_report(out_label_list, preds_list)
+    logger.info("\n" + report)
+
+    logger.info("***** Eval results %s *****", prefix)
+    for key in sorted(results.keys()):
+        logger.info("  %s = %s", key, str(results[key]))
+
+    return results, preds
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    os.makedirs(args.output_dir, exist_ok=True)
+    train(args)
diff --git a/examples/multimodal/layoutlm/train_funsd.sh b/examples/multimodal/layoutlm/train_funsd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cfd65d6c3ba1710349f1fc94c2a262eab35f6caa
--- /dev/null
+++ b/examples/multimodal/layoutlm/train_funsd.sh
@@ -0,0 +1,17 @@
+export CUDA_VISIBLE_DEVICES=7
+
+python3.7 train_funsd.py \
+    --data_dir "./data/" \
+    --model_name_or_path "layoutlm-base-uncased" \
+    --do_lower_case \
+    --max_seq_length 512 \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 100 \
+    --logging_steps 10 \
+    --save_steps 500 \
+    --output_dir "output/" \
+    --labels "./data/labels.txt" \
+    --per_gpu_train_batch_size 16 \
+    --per_gpu_eval_batch_size 16 \
+    --evaluate_during_training
diff --git a/examples/multimodal/layoutlm/utils.py b/examples/multimodal/layoutlm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e3c4bce74049b1a402687edfd7ca60a054b7e98
--- /dev/null
+++ b/examples/multimodal/layoutlm/utils.py
@@ -0,0 +1,188 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--weights_path",
+        default=None,
+        type=str,
+        required=False,
+    )
+
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    # Other parameters
+    parser.add_argument(
+        "--labels",
+        default="",
+        type=str,
+        help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.",
+    )
+    parser.add_argument(
+        "--config_name",
+        default="",
+        type=str,
+        help="Pretrained config name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--tokenizer_name",
+        default="",
+        type=str,
+        help="Pretrained tokenizer name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--cache_dir",
+        default="",
+        type=str,
+        help="Where do you want to store the pre-trained models downloaded from s3",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=512,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
+    parser.add_argument(
+        "--do_predict",
+        action="store_true",
+        help="Whether to run predictions on the test set.",
+    )
+    parser.add_argument(
+        "--evaluate_during_training",
+        action="store_true",
+        help="Whether to run evaluation during training at each logging step.",
+    )
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_true",
+        help="Set this flag if you are using an uncased model.",
+    )
+
+    parser.add_argument(
+        "--per_gpu_train_batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--per_gpu_eval_batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for evaluation.",
+    )
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument(
+        "--learning_rate",
+        default=5e-5,
+        type=float,
+        help="The initial learning rate for Adam.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.")
+    parser.add_argument(
+        "--save_steps",
+        type=int,
+        default=50,
+        help="Save checkpoint every X updates steps.",
+    )
+    parser.add_argument(
+        "--eval_all_checkpoints",
+        action="store_true",
+        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
+    )
+    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+    parser.add_argument(
+        "--overwrite_output_dir",
+        action="store_true",
+        help="Overwrite the content of the output directory",
+    )
+    parser.add_argument(
+        "--overwrite_cache",
+        action="store_true",
+        help="Overwrite the cached training and evaluation sets",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision instead of 32-bit",
+    )
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O1",
+        help="For fp16: AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/amp/auto_cast_cn.html",
+    )
+    parser.add_argument(
+        "--local_rank",
+        type=int,
+        default=-1,
+        help="For distributed training: local_rank",
+    )
+    parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
+    parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
+    args = parser.parse_args()
+    return args
diff --git a/examples/multimodal/layoutxlm/README.md b/examples/multimodal/layoutxlm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..03c0a93c6a20e6a67a83283d2001a3a7e05d4b0a
--- /dev/null
+++ b/examples/multimodal/layoutxlm/README.md
@@ -0,0 +1,45 @@
+# LayoutXLM
+
+## 模型简介
+本项目是 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) 在 Paddle 2.2上的开源实现，
+包含了在 [XFUND数据集](https://github.com/doc-analysis/XFUND) 上的微调代码。
+
+## 快速开始
+### 配置环境
+环境依赖
+- cv2
+- sentencepiece
+- yacs
+
+安装命令：
+```shell
+pip install opencv-python
+pip install sentencepiece
+pip install yacs
+```
+
+### 数据准备
+处理好的XFUND中文数据集下载地址：https://bj.bcebos.com/v1/paddlenlp/datasets/XFUND.zip 。
+
+下载并解压该数据集，解压后将数据集放置在当前目录下。
+
+### 执行Fine-tuning
+1. ``Semantic Entity Recognition`` 任务启动Fine-tuning的方式如下：
+    ```shell
+    bash run_xfun_ser.sh
+
+    # 结果如下:
+    # best metrics: {'precision': 0.8514686248331108, 'recall': 0.9354602126879354, 'f1': 0.8914904770225406}
+    ```
+
+2. ``Relation Extraction`` 任务启动Fine-tuning的方式如下：
+    ```shell
+    bash run_xfun_re.sh
+
+    # 结果如下:
+    # best metrics: {'precision': 0.6788935658448587, 'recall': 0.7743484224965707, 'f1': 0.7234860621595642}
+    ```
+
+## Reference
+- [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf)
+- [microsoft/unilm/layoutxlm](https://github.com/microsoft/unilm/tree/master/layoutxlm)
diff --git a/examples/multimodal/layoutxlm/compare.py b/examples/multimodal/layoutxlm/compare.py
new file mode 100644
index 0000000000000000000000000000000000000000..120f651c177a11c30b5b21d99cce213578eaa83d
--- /dev/null
+++ b/examples/multimodal/layoutxlm/compare.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import numpy as np
+import paddle
+import torch
+
+sys.path.insert(0, "../../../")
+
+
+def get_input_demo(platform="paddle", device="cpu"):
+    info = paddle.load("fake_input_paddle_xlm.data")
+    # imgs = np.random.rand(info["input_ids"].shape[0], 3, 224, 224).astype(np.float32)
+    # info["image"] = paddle.to_tensor(imgs)
+    if platform == "torch":
+        info = {key: torch.tensor(info[key].numpy()) for key in info}
+        if device == "gpu":
+            info = {key: info[key].cuda() for key in info}
+    return info
+
+
+def test_layoutlm_paddle():
+    from paddlenlp.transformers import LayoutXLMModel
+
+    model = LayoutXLMModel.from_pretrained("layoutxlm-base-uncased")
+    model.eval()
+
+    paddle.save(model.state_dict(), "v2.pdparams")
+
+    batch_input = get_input_demo(platform="paddle", device="gpu")
+    with paddle.no_grad():
+        outputs = model(
+            input_ids=batch_input["input_ids"],
+            bbox=batch_input["bbox"],
+            image=batch_input["image"],
+            attention_mask=batch_input["attention_mask"],
+        )
+    sequence_output = outputs[0]
+    pooled_output = outputs[1]
+    return sequence_output, pooled_output
+
+
+def test_layoutlm_torch():
+    # import pytorch models
+    from layoutlmft.models.layoutxlm import LayoutXLMModel
+
+    model = LayoutXLMModel.from_pretrained("microsoft/layoutxlm-base")
+    model.eval()
+    model = model.cuda()
+
+    batch_input = get_input_demo(platform="torch", device="gpu")
+
+    outputs = model(
+        input_ids=batch_input["input_ids"],
+        bbox=batch_input["bbox"],
+        image=batch_input["image"],
+        attention_mask=batch_input["attention_mask"],
+    )
+    sequence_output = outputs[0]
+    pooled_output = outputs[1]
+    return sequence_output, pooled_output
+
+
+def get_statistic_info(x, y):
+    mean_abs_diff = np.mean(np.abs(x - y))
+    max_abs_diff = np.max(np.abs(x - y))
+    return mean_abs_diff, max_abs_diff
+
+
+if __name__ == "__main__":
+
+    print("\n====test_layoutxlm_torch=====")
+    torch_hidden_out, torch_pool_out = test_layoutlm_torch()
+    torch_hidden_out = torch_hidden_out.cpu().detach().numpy()
+    torch_pool_out = torch_pool_out.cpu().detach().numpy()
+    print(torch_hidden_out.shape, torch_pool_out.shape)
+
+    print("\n====test_layoutxlm_paddle=====")
+    paddle_hidden_out, paddle_pool_out = test_layoutlm_paddle()
+    paddle_hidden_out = paddle_hidden_out.numpy()
+    paddle_pool_out = paddle_pool_out.numpy()
+    print(paddle_hidden_out.shape, paddle_pool_out.shape)
+
+    mean_abs_diff, max_abs_diff = get_statistic_info(torch_hidden_out, paddle_hidden_out)
+    print("======hidden_out diff  info====")
+    print("\t mean_abs_diff: {}".format(mean_abs_diff))
+    print("\t max_abs_diff: {}".format(max_abs_diff))
+
+    mean_abs_diff, max_abs_diff = get_statistic_info(torch_pool_out, paddle_pool_out)
+    print("======pool_out diff  info====")
+    print("\t mean_abs_diff: {}".format(mean_abs_diff))
+    print("\t max_abs_diff: {}".format(max_abs_diff))
diff --git a/examples/multimodal/layoutxlm/run_xfun_re.py b/examples/multimodal/layoutxlm/run_xfun_re.py
new file mode 100644
index 0000000000000000000000000000000000000000..13e31b27b99c2d66c4482a870af187f7f506bd4c
--- /dev/null
+++ b/examples/multimodal/layoutxlm/run_xfun_re.py
@@ -0,0 +1,406 @@
+import sys
+import os
+import random
+import numbers
+import logging
+
+import argparse
+import paddle
+import numpy as np
+from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer, LayoutXLMForRelationExtraction
+from xfun import XFUN
+
+# Todo: delete the following line after the release of v2.2
+sys.path.insert(0, "../../../")
+logger = logging.getLogger(__name__)
+
+
+class DataCollator:
+    def __call__(self, batch):
+        data_dict = {}
+        to_tensor_keys = []
+        for sample in batch:
+            for k, v in sample.items():
+                if k not in data_dict:
+                    data_dict[k] = []
+                if isinstance(v, (np.ndarray, paddle.Tensor, numbers.Number)):
+                    if k not in to_tensor_keys:
+                        to_tensor_keys.append(k)
+                data_dict[k].append(v)
+        for k in to_tensor_keys:
+            data_dict[k] = paddle.to_tensor(data_dict[k])
+        return data_dict
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    # yapf: disable
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,)
+    parser.add_argument("--train_data_dir", default=None, type=str, required=False,)
+    parser.add_argument("--train_label_path", default=None, type=str, required=False,)
+    parser.add_argument("--eval_data_dir", default=None, type=str, required=False,)
+    parser.add_argument("--eval_label_path", default=None, type=str, required=False,)
+    parser.add_argument("--use_vdl", default=False, type=bool, required=False,)
+    parser.add_argument("--output_dir", default=None, type=str, required=True,)
+    parser.add_argument("--max_seq_length", default=512, type=int,)
+    parser.add_argument("--evaluate_during_training", action="store_true",)
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",)
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",)
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",)
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",)
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",)
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",)
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",)
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",)
+    parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",)
+    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",)
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",)
+    # yapf: enable
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def get_label_maps():
+    labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"]
+    label2id_map = {label: idx for idx, label in enumerate(labels)}
+    id2label_map = {idx: label for idx, label in enumerate(labels)}
+    return label2id_map, id2label_map
+
+
+def cal_metric(re_preds, re_labels, entities):
+    gt_relations = []
+    for b in range(len(re_labels)):
+        rel_sent = []
+        for head, tail in zip(re_labels[b]["head"], re_labels[b]["tail"]):
+            rel = {}
+            rel["head_id"] = head
+            rel["head"] = (entities[b]["start"][rel["head_id"]], entities[b]["end"][rel["head_id"]])
+            rel["head_type"] = entities[b]["label"][rel["head_id"]]
+
+            rel["tail_id"] = tail
+            rel["tail"] = (entities[b]["start"][rel["tail_id"]], entities[b]["end"][rel["tail_id"]])
+            rel["tail_type"] = entities[b]["label"][rel["tail_id"]]
+
+            rel["type"] = 1
+            rel_sent.append(rel)
+        gt_relations.append(rel_sent)
+    re_metrics = re_score(re_preds, gt_relations, mode="boundaries")
+    return re_metrics
+
+
+def re_score(pred_relations, gt_relations, mode="strict"):
+    """Evaluate RE predictions
+
+    Args:
+        pred_relations (list) :  list of list of predicted relations (several relations in each sentence)
+        gt_relations (list) :    list of list of ground truth relations
+
+            rel = { "head": (start_idx (inclusive), end_idx (exclusive)),
+                    "tail": (start_idx (inclusive), end_idx (exclusive)),
+                    "head_type": ent_type,
+                    "tail_type": ent_type,
+                    "type": rel_type}
+
+        vocab (Vocab) :         dataset vocabulary
+        mode (str) :            in 'strict' or 'boundaries'"""
+
+    assert mode in ["strict", "boundaries"]
+
+    relation_types = [v for v in [0, 1] if not v == 0]
+    scores = {rel: {"tp": 0, "fp": 0, "fn": 0} for rel in relation_types + ["ALL"]}
+
+    # Count GT relations and Predicted relations
+    n_sents = len(gt_relations)
+    n_rels = sum([len([rel for rel in sent]) for sent in gt_relations])
+    n_found = sum([len([rel for rel in sent]) for sent in pred_relations])
+
+    # Count TP, FP and FN per type
+    for pred_sent, gt_sent in zip(pred_relations, gt_relations):
+        for rel_type in relation_types:
+            # strict mode takes argument types into account
+            if mode == "strict":
+                pred_rels = {
+                    (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"])
+                    for rel in pred_sent
+                    if rel["type"] == rel_type
+                }
+                gt_rels = {
+                    (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"])
+                    for rel in gt_sent
+                    if rel["type"] == rel_type
+                }
+
+            # boundaries mode only takes argument spans into account
+            elif mode == "boundaries":
+                pred_rels = {(rel["head"], rel["tail"]) for rel in pred_sent if rel["type"] == rel_type}
+                gt_rels = {(rel["head"], rel["tail"]) for rel in gt_sent if rel["type"] == rel_type}
+
+            scores[rel_type]["tp"] += len(pred_rels & gt_rels)
+            scores[rel_type]["fp"] += len(pred_rels - gt_rels)
+            scores[rel_type]["fn"] += len(gt_rels - pred_rels)
+
+    # Compute per entity Precision / Recall / F1
+    for rel_type in scores.keys():
+        if scores[rel_type]["tp"]:
+            scores[rel_type]["p"] = scores[rel_type]["tp"] / (scores[rel_type]["fp"] + scores[rel_type]["tp"])
+            scores[rel_type]["r"] = scores[rel_type]["tp"] / (scores[rel_type]["fn"] + scores[rel_type]["tp"])
+        else:
+            scores[rel_type]["p"], scores[rel_type]["r"] = 0, 0
+
+        if not scores[rel_type]["p"] + scores[rel_type]["r"] == 0:
+            scores[rel_type]["f1"] = (
+                2 * scores[rel_type]["p"] * scores[rel_type]["r"] / (scores[rel_type]["p"] + scores[rel_type]["r"])
+            )
+        else:
+            scores[rel_type]["f1"] = 0
+
+    # Compute micro F1 Scores
+    tp = sum([scores[rel_type]["tp"] for rel_type in relation_types])
+    fp = sum([scores[rel_type]["fp"] for rel_type in relation_types])
+    fn = sum([scores[rel_type]["fn"] for rel_type in relation_types])
+
+    if tp:
+        precision = tp / (tp + fp)
+        recall = tp / (tp + fn)
+        f1 = 2 * precision * recall / (precision + recall)
+
+    else:
+        precision, recall, f1 = 0, 0, 0
+
+    scores["ALL"]["p"] = precision
+    scores["ALL"]["r"] = recall
+    scores["ALL"]["f1"] = f1
+    scores["ALL"]["tp"] = tp
+    scores["ALL"]["fp"] = fp
+    scores["ALL"]["fn"] = fn
+
+    # Compute Macro F1 Scores
+    scores["ALL"]["Macro_f1"] = np.mean([scores[ent_type]["f1"] for ent_type in relation_types])
+    scores["ALL"]["Macro_p"] = np.mean([scores[ent_type]["p"] for ent_type in relation_types])
+    scores["ALL"]["Macro_r"] = np.mean([scores[ent_type]["r"] for ent_type in relation_types])
+
+    logger.info(f"RE Evaluation in *** {mode.upper()} *** mode")
+
+    logger.info(
+        "processed {} sentences with {} relations; found: {} relations; correct: {}.".format(
+            n_sents, n_rels, n_found, tp
+        )
+    )
+    logger.info(
+        "\tALL\t TP: {};\tFP: {};\tFN: {}".format(scores["ALL"]["tp"], scores["ALL"]["fp"], scores["ALL"]["fn"])
+    )
+    logger.info("\t\t(m avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (micro)".format(precision, recall, f1))
+    logger.info(
+        "\t\t(M avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (Macro)\n".format(
+            scores["ALL"]["Macro_p"], scores["ALL"]["Macro_r"], scores["ALL"]["Macro_f1"]
+        )
+    )
+
+    for rel_type in relation_types:
+        logger.info(
+            "\t{}: \tTP: {};\tFP: {};\tFN: {};\tprecision: {:.2f};\trecall: {:.2f};\tf1: {:.2f};\t{}".format(
+                rel_type,
+                scores[rel_type]["tp"],
+                scores[rel_type]["fp"],
+                scores[rel_type]["fn"],
+                scores[rel_type]["p"],
+                scores[rel_type]["r"],
+                scores[rel_type]["f1"],
+                scores[rel_type]["tp"] + scores[rel_type]["fp"],
+            )
+        )
+
+    return scores
+
+
+def evaluate(model, eval_dataloader, logger, prefix=""):
+    # Eval!
+    logger.info(f"***** Running evaluation {prefix} *****")
+    logger.info(f"  Num examples = {len(eval_dataloader.dataset)}")
+
+    re_preds = []
+    re_labels = []
+    entities = []
+    eval_loss = 0.0
+    model.eval()
+    for idx, batch in enumerate(eval_dataloader):
+        with paddle.no_grad():
+            outputs = model(**batch)
+            loss = outputs["loss"].mean().item()
+            if paddle.distributed.get_rank() == 0:
+                logger.info(f"[Eval] process: {idx}/{len(eval_dataloader)}, loss: {loss:.5f}")
+
+            eval_loss += loss
+        re_preds.extend(outputs["pred_relations"])
+        re_labels.extend(batch["relations"])
+        entities.extend(outputs["entities"])
+    re_metrics = cal_metric(re_preds, re_labels, entities)
+    re_metrics = {
+        "precision": re_metrics["ALL"]["p"],
+        "recall": re_metrics["ALL"]["r"],
+        "f1": re_metrics["ALL"]["f1"],
+    }
+    model.train()
+    return re_metrics
+
+
+def train(args):
+    os.makedirs(args.output_dir, exist_ok=True)
+    set_seed(args)
+
+    label2id_map, id2label_map = get_label_maps()
+    pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path)
+    base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path)
+    model = LayoutXLMForRelationExtraction(base_model, dropout=None)
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_dataset = XFUN(
+        tokenizer,
+        data_dir=args.train_data_dir,
+        label_path=args.train_label_path,
+        label2id_map=label2id_map,
+        img_size=(224, 224),
+        max_seq_len=args.max_seq_length,
+        pad_token_label_id=pad_token_label_id,
+        contains_re=True,
+        add_special_ids=False,
+        return_attention_mask=True,
+        load_mode="all",
+    )
+
+    eval_dataset = XFUN(
+        tokenizer,
+        data_dir=args.eval_data_dir,
+        label_path=args.eval_label_path,
+        label2id_map=label2id_map,
+        img_size=(224, 224),
+        max_seq_len=args.max_seq_length,
+        pad_token_label_id=pad_token_label_id,
+        contains_re=True,
+        add_special_ids=False,
+        return_attention_mask=True,
+        load_mode="all",
+    )
+
+    train_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True
+    )
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size())
+    train_dataloader = paddle.io.DataLoader(
+        train_dataset, batch_sampler=train_sampler, num_workers=8, use_shared_memory=True, collate_fn=DataCollator()
+    )
+
+    eval_dataloader = paddle.io.DataLoader(
+        eval_dataset, batch_size=args.per_gpu_eval_batch_size, num_workers=8, shuffle=False, collate_fn=DataCollator()
+    )
+
+    t_total = len(train_dataloader) * args.num_train_epochs
+
+    # build linear decay with warmup lr sch
+    lr_scheduler = paddle.optimizer.lr.PolynomialDecay(
+        learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0
+    )
+    if args.warmup_steps > 0:
+        lr_scheduler = paddle.optimizer.lr.LinearWarmup(
+            lr_scheduler,
+            args.warmup_steps,
+            start_lr=0,
+            end_lr=args.learning_rate,
+        )
+    grad_clip = paddle.nn.ClipGradByNorm(clip_norm=10)
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        epsilon=args.adam_epsilon,
+        grad_clip=grad_clip,
+        weight_decay=args.weight_decay,
+    )
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(train_dataset)}")
+    logger.info(f"  Num Epochs = {args.num_train_epochs}")
+    logger.info(f"  Instantaneous batch size per GPU = {args.per_gpu_train_batch_size}")
+    logger.info(
+        f"  Total train batch size (w. parallel, distributed & accumulation) = {args.train_batch_size * paddle.distributed.get_world_size()}"
+    )
+    logger.info(f"  Total optimization steps = {t_total}")
+
+    global_step = 0
+    train_dataloader_len = len(train_dataloader)
+    best_metirc = {"f1": 0}
+    model.train()
+
+    for epoch in range(int(args.num_train_epochs)):
+        for step, batch in enumerate(train_dataloader):
+            outputs = model(**batch)
+            # model outputs are always tuple in ppnlp (see doc)
+            loss = outputs["loss"]
+            loss = loss.mean()
+
+            logger.info(
+                f"epoch: [{epoch}/{args.num_train_epochs}], iter: [{step}/{train_dataloader_len}], global_step:{global_step}, train loss: {np.mean(loss.numpy())}, lr: {optimizer.get_lr()}"
+            )
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            # lr_scheduler.step()  # Update learning rate schedule
+
+            global_step += 1
+
+            if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0:
+                # Log metrics
+                if paddle.distributed.get_rank() == 0 and args.evaluate_during_training:
+                    results = evaluate(model, eval_dataloader, logger)
+                    if results["f1"] > best_metirc["f1"]:
+                        best_metirc = results
+                        output_dir = os.path.join(args.output_dir, "checkpoint-best")
+                        os.makedirs(output_dir, exist_ok=True)
+                        model.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                        logger.info(f"Saving model checkpoint to {output_dir}")
+                    logger.info(f"eval results: {results}")
+                    logger.info(f"best_metirc: {best_metirc}")
+
+            if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0:
+                # Save model checkpoint
+                output_dir = os.path.join(args.output_dir, "checkpoint-latest")
+                os.makedirs(output_dir, exist_ok=True)
+                if paddle.distributed.get_rank() == 0:
+                    model.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                    logger.info(f"Saving model checkpoint to {output_dir}")
+    logger.info(f"best_metirc: {best_metirc}")
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    train(args)
diff --git a/examples/multimodal/layoutxlm/run_xfun_re.sh b/examples/multimodal/layoutxlm/run_xfun_re.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4aeea52f5dc960cbc464242cd9bd818ca4ae58ff
--- /dev/null
+++ b/examples/multimodal/layoutxlm/run_xfun_re.sh
@@ -0,0 +1,19 @@
+export CUDA_VISIBLE_DEVICES=0
+
+python ./run_xfun_re.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_length 512 \
+    --train_data_dir "XFUND/zh_train/image" \
+    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
+    --eval_data_dir "XFUND/zh_val/image" \
+    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
+    --num_train_epochs 200 \
+    --eval_steps 50 \
+    --save_steps 500 \
+    --output_dir "./output/re/" \
+    --learning_rate 5e-5 \
+    --warmup_steps 50 \
+    --per_gpu_train_batch_size 8 \
+    --per_gpu_eval_batch_size 8 \
+    --evaluate_during_training \
+    --seed 2048
diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.py b/examples/multimodal/layoutxlm/run_xfun_ser.py
new file mode 100644
index 0000000000000000000000000000000000000000..36b0b988822d2b22e3a3764ccb99c9fbf68be9bd
--- /dev/null
+++ b/examples/multimodal/layoutxlm/run_xfun_ser.py
@@ -0,0 +1,353 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import logging
+import os
+import random
+import sys
+
+import numpy as np
+import paddle
+from seqeval.metrics import (
+    classification_report,
+    f1_score,
+    precision_score,
+    recall_score,
+)
+from xfun import XFUN
+
+from paddlenlp.transformers import (
+    LayoutXLMForTokenClassification,
+    LayoutXLMModel,
+    LayoutXLMTokenizer,
+)
+
+# Todo: delete the following line after the release of v2.2
+sys.path.insert(0, "../../../")
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    # yapf: disable
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,)
+    parser.add_argument("--train_data_dir", default=None, type=str, required=False,)
+    parser.add_argument("--train_label_path", default=None, type=str, required=False,)
+    parser.add_argument("--eval_data_dir", default=None, type=str, required=False,)
+    parser.add_argument("--eval_label_path", default=None, type=str, required=False,)
+    parser.add_argument("--use_vdl", default=False, type=bool, required=False,)
+    parser.add_argument("--output_dir", default=None, type=str, required=True,)
+    parser.add_argument("--max_seq_length", default=512, type=int,)
+    parser.add_argument("--evaluate_during_training", action="store_true",)
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",)
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",)
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",)
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",)
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",)
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",)
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",)
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",)
+    parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",)
+    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",)
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",)
+    # yapf: enable
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def get_label_maps():
+    labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"]
+    label2id_map = {label: idx for idx, label in enumerate(labels)}
+    id2label_map = {idx: label for idx, label in enumerate(labels)}
+    return label2id_map, id2label_map
+
+
+def train(args):
+    os.makedirs(args.output_dir, exist_ok=True)
+    logging.basicConfig(
+        filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None,
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN,
+    )
+
+    ch = logging.StreamHandler()
+    ch.setLevel(logging.DEBUG)
+    logger.addHandler(ch)
+
+    label2id_map, id2label_map = get_label_maps()
+    pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path)
+    base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path)
+    model = LayoutXLMForTokenClassification(base_model, num_classes=len(label2id_map), dropout=None)
+
+    # dist mode
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    train_dataset = XFUN(
+        tokenizer,
+        data_dir=args.train_data_dir,
+        label_path=args.train_label_path,
+        label2id_map=label2id_map,
+        img_size=(224, 224),
+        pad_token_label_id=pad_token_label_id,
+        contains_re=False,
+        add_special_ids=False,
+        return_attention_mask=True,
+        load_mode="all",
+    )
+
+    train_sampler = paddle.io.DistributedBatchSampler(
+        train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True
+    )
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size())
+
+    train_dataloader = paddle.io.DataLoader(
+        train_dataset,
+        batch_sampler=train_sampler,
+        num_workers=0,
+        use_shared_memory=True,
+        collate_fn=None,
+    )
+
+    t_total = len(train_dataloader) * args.num_train_epochs
+
+    # build linear decay with warmup lr sch
+    lr_scheduler = paddle.optimizer.lr.PolynomialDecay(
+        learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0
+    )
+    if args.warmup_steps > 0:
+        lr_scheduler = paddle.optimizer.lr.LinearWarmup(
+            lr_scheduler,
+            args.warmup_steps,
+            start_lr=0,
+            end_lr=args.learning_rate,
+        )
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        epsilon=args.adam_epsilon,
+        weight_decay=args.weight_decay,
+    )
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info(
+        "  Total train batch size (w. parallel, distributed) = %d",
+        args.train_batch_size * paddle.distributed.get_world_size(),
+    )
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss = 0.0
+    set_seed(args)
+    best_metrics = None
+
+    for epoch_id in range(args.num_train_epochs):
+        for step, batch in enumerate(train_dataloader):
+            model.train()
+            outputs = model(**batch)
+            # model outputs are always tuple in ppnlp (see doc)
+            loss = outputs[0]
+            loss = loss.mean()
+            logger.info(
+                "[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format(
+                    epoch_id,
+                    args.num_train_epochs,
+                    step,
+                    len(train_dataloader),
+                    lr_scheduler.get_lr(),
+                    float(loss),
+                )
+            )
+
+            loss.backward()
+            tr_loss += loss.item()
+            optimizer.step()
+            lr_scheduler.step()  # Update learning rate schedule
+            optimizer.clear_grad()
+            global_step += 1
+
+            if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0:
+                # Log metrics
+                # Only evaluate when single GPU otherwise metrics may not average well
+                if paddle.distributed.get_rank() == 0 and args.evaluate_during_training:
+                    results, _ = evaluate(
+                        args,
+                        model,
+                        tokenizer,
+                        label2id_map,
+                        id2label_map,
+                        pad_token_label_id,
+                    )
+
+                    if best_metrics is None or results["f1"] >= best_metrics["f1"]:
+                        best_metrics = copy.deepcopy(results)
+                        output_dir = os.path.join(args.output_dir, "best_model")
+                        os.makedirs(output_dir, exist_ok=True)
+                        if paddle.distributed.get_rank() == 0:
+                            model.save_pretrained(output_dir)
+                            tokenizer.save_pretrained(output_dir)
+                            paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                            logger.info("Saving model checkpoint to %s", output_dir)
+
+                    logger.info(
+                        "[epoch {}/{}][iter: {}/{}] results: {}".format(
+                            epoch_id, args.num_train_epochs, step, len(train_dataloader), results
+                        )
+                    )
+                    if best_metrics is not None:
+                        logger.info("best metrics: {}".format(best_metrics))
+
+            if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0:
+                # Save model checkpoint
+                output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
+                os.makedirs(output_dir, exist_ok=True)
+                if paddle.distributed.get_rank() == 0:
+                    model.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(args, os.path.join(output_dir, "training_args.bin"))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix=""):
+    eval_dataset = XFUN(
+        tokenizer,
+        data_dir=args.eval_data_dir,
+        label_path=args.eval_label_path,
+        label2id_map=label2id_map,
+        img_size=(224, 224),
+        pad_token_label_id=pad_token_label_id,
+        contains_re=False,
+        add_special_ids=False,
+        return_attention_mask=True,
+        load_mode="all",
+    )
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size())
+
+    eval_dataloader = paddle.io.DataLoader(
+        eval_dataset,
+        batch_size=args.eval_batch_size,
+        num_workers=0,
+        use_shared_memory=True,
+        collate_fn=None,
+    )
+
+    # Eval!
+    logger.info("***** Running evaluation %s *****", prefix)
+    logger.info("  Num examples = %d", len(eval_dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    eval_loss = 0.0
+    nb_eval_steps = 0
+    preds = None
+    out_label_ids = None
+    model.eval()
+    for idx, batch in enumerate(eval_dataloader):
+        with paddle.no_grad():
+            outputs = model(**batch)
+            tmp_eval_loss, logits = outputs[:2]
+
+            tmp_eval_loss = tmp_eval_loss.mean()
+
+            if paddle.distributed.get_rank() == 0:
+                logger.info(
+                    "[Eval]process: {}/{}, loss: {:.5f}".format(idx, len(eval_dataloader), float(tmp_eval_loss))
+                )
+
+            eval_loss += tmp_eval_loss.item()
+        nb_eval_steps += 1
+        if preds is None:
+            preds = logits.numpy()
+            out_label_ids = batch["labels"].numpy()
+        else:
+            preds = np.append(preds, logits.numpy(), axis=0)
+            out_label_ids = np.append(out_label_ids, batch["labels"].numpy(), axis=0)
+
+    eval_loss = eval_loss / nb_eval_steps
+    preds = np.argmax(preds, axis=2)
+
+    # label_map = {i: label.upper() for i, label in enumerate(labels)}
+
+    out_label_list = [[] for _ in range(out_label_ids.shape[0])]
+    preds_list = [[] for _ in range(out_label_ids.shape[0])]
+
+    for i in range(out_label_ids.shape[0]):
+        for j in range(out_label_ids.shape[1]):
+            if out_label_ids[i, j] != pad_token_label_id:
+                out_label_list[i].append(id2label_map[out_label_ids[i][j]])
+                preds_list[i].append(id2label_map[preds[i][j]])
+
+    results = {
+        "loss": eval_loss,
+        "precision": precision_score(out_label_list, preds_list),
+        "recall": recall_score(out_label_list, preds_list),
+        "f1": f1_score(out_label_list, preds_list),
+    }
+
+    with open(os.path.join(args.output_dir, "test_gt.txt"), "w") as fout:
+        for lbl in out_label_list:
+            for l in lbl:
+                fout.write(l + "\t")
+            fout.write("\n")
+    with open(os.path.join(args.output_dir, "test_pred.txt"), "w") as fout:
+        for lbl in preds_list:
+            for l in lbl:
+                fout.write(l + "\t")
+            fout.write("\n")
+
+    report = classification_report(out_label_list, preds_list)
+    logger.info("\n" + report)
+
+    logger.info("***** Eval results %s *****", prefix)
+    for key in sorted(results.keys()):
+        logger.info("  %s = %s", key, str(results[key]))
+
+    return results, preds_list
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    train(args)
diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.sh b/examples/multimodal/layoutxlm/run_xfun_ser.sh
new file mode 100644
index 0000000000000000000000000000000000000000..43454abfc26491ef463a7138468a15587fb9fa7f
--- /dev/null
+++ b/examples/multimodal/layoutxlm/run_xfun_ser.sh
@@ -0,0 +1,19 @@
+export CUDA_VISIBLE_DEVICES=0
+
+python ./run_xfun_ser.py \
+    --model_name_or_path "layoutxlm-base-uncased" \
+    --max_seq_length 512 \
+    --train_data_dir "XFUND/zh_train/image" \
+    --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \
+    --eval_data_dir "XFUND/zh_val/image" \
+    --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \
+    --num_train_epochs 200 \
+    --eval_steps 10 \
+    --save_steps 500 \
+    --output_dir "./output/ser/" \
+    --learning_rate 5e-5 \
+    --warmup_steps 50 \
+    --per_gpu_train_batch_size 8 \
+    --per_gpu_eval_batch_size 8 \
+    --evaluate_during_training \
+    --seed 2048
diff --git a/examples/multimodal/layoutxlm/xfun.py b/examples/multimodal/layoutxlm/xfun.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bb5be92e91346b2e98fcad3b6ef1ecb2d56346c
--- /dev/null
+++ b/examples/multimodal/layoutxlm/xfun.py
@@ -0,0 +1,410 @@
+import json
+import os
+import cv2
+import numpy as np
+import paddle
+import copy
+from paddle.io import Dataset
+
+__all__ = ["XFUN"]
+
+
+class XFUN(Dataset):
+    """
+    Example:
+        print("=====begin to build dataset=====")
+        from paddlenlp.transformers import LayoutXLMTokenizer
+        tokenizer = LayoutXLMTokenizer.from_pretrained("/paddle/models/transformers/layoutxlm-base-paddle/")
+        tok_res = tokenizer.tokenize("Maribyrnong")
+        # res = tokenizer.convert_ids_to_tokens(val_data["input_ids"][0])
+        dataset = XfunDatasetForSer(
+            tokenizer,
+            data_dir="./zh.val/",
+            label_path="zh.val/xfun_normalize_val.json",
+            img_size=(224,224))
+        print(len(dataset))
+
+        data = dataset[0]
+        print(data.keys())
+        print("input_ids: ", data["input_ids"])
+        print("labels: ", data["labels"])
+        print("token_type_ids: ", data["token_type_ids"])
+        print("words_list: ", data["words_list"])
+        print("image shape: ", data["image"].shape)
+    """
+
+    def __init__(
+        self,
+        tokenizer,
+        data_dir,
+        label_path,
+        contains_re=False,
+        label2id_map=None,
+        img_size=(224, 224),
+        pad_token_label_id=None,
+        add_special_ids=False,
+        return_attention_mask=True,
+        load_mode="all",
+        max_seq_len=512,
+    ):
+        super(XFUN, self).__init__()
+        self.tokenizer = tokenizer
+        self.data_dir = data_dir
+        self.label_path = label_path
+        self.contains_re = contains_re
+        self.label2id_map = label2id_map
+        self.img_size = img_size
+        self.pad_token_label_id = pad_token_label_id
+        self.add_special_ids = add_special_ids
+        self.return_attention_mask = return_attention_mask
+        self.load_mode = load_mode
+        self.max_seq_len = max_seq_len
+
+        if self.pad_token_label_id is None:
+            self.pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index
+
+        self.all_lines = self.read_all_lines()
+
+        self.entities_labels = {"HEADER": 0, "QUESTION": 1, "ANSWER": 2}
+        self.return_keys = {
+            "bbox": "np",
+            "input_ids": "np",
+            "labels": "np",
+            "attention_mask": "np",
+            "image": "np",
+            "token_type_ids": "np",
+            "entities": "dict",
+            "relations": "dict",
+        }
+
+        if load_mode == "all":
+            self.encoded_inputs_all = self._parse_label_file_all()
+
+    def pad_sentences(
+        self,
+        encoded_inputs,
+        max_seq_len=512,
+        pad_to_max_seq_len=True,
+        return_attention_mask=True,
+        return_token_type_ids=True,
+        truncation_strategy="longest_first",
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+    ):
+        # Padding
+        needs_to_be_padded = pad_to_max_seq_len and max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len
+
+        if needs_to_be_padded:
+            difference = max_seq_len - len(encoded_inputs["input_ids"])
+            if self.tokenizer.padding_side == "right":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference
+                if return_token_type_ids:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"] + [self.tokenizer.pad_token_type_id] * difference
+                    )
+                if return_special_tokens_mask:
+                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+                encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.tokenizer.pad_token_id] * difference
+                encoded_inputs["labels"] = encoded_inputs["labels"] + [self.pad_token_label_id] * difference
+                encoded_inputs["bbox"] = encoded_inputs["bbox"] + [[0, 0, 0, 0]] * difference
+            elif self.tokenizer.padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"])
+                if return_token_type_ids:
+                    encoded_inputs["token_type_ids"] = [
+                        self.tokenizer.pad_token_type_id
+                    ] * difference + encoded_inputs["token_type_ids"]
+                if return_special_tokens_mask:
+                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+                encoded_inputs["input_ids"] = [self.tokenizer.pad_token_id] * difference + encoded_inputs["input_ids"]
+                encoded_inputs["labels"] = [self.pad_token_label_id] * difference + encoded_inputs["labels"]
+                encoded_inputs["bbox"] = [[0, 0, 0, 0]] * difference + encoded_inputs["bbox"]
+        else:
+            if return_attention_mask:
+                encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
+
+        return encoded_inputs
+
+    def truncate_inputs(self, encoded_inputs, max_seq_len=512):
+        for key in encoded_inputs:
+            if key == "sample_id":
+                continue
+            length = min(len(encoded_inputs[key]), max_seq_len)
+            encoded_inputs[key] = encoded_inputs[key][:length]
+        return encoded_inputs
+
+    def read_all_lines(
+        self,
+    ):
+        with open(self.label_path, "r") as fin:
+            lines = fin.readlines()
+        return lines
+
+    def _parse_label_file_all(self):
+        """
+        parse all samples
+        """
+        encoded_inputs_all = []
+        for line in self.all_lines:
+            encoded_inputs_all.extend(self._parse_label_file(line))
+        return encoded_inputs_all
+
+    def _parse_label_file(self, line):
+        """
+        parse single sample
+        """
+
+        image_name, info_str = line.split("\t")
+        image_path = os.path.join(self.data_dir, image_name)
+
+        def add_imgge_path(x):
+            x["image_path"] = image_path
+            return x
+
+        encoded_inputs = self._read_encoded_inputs_sample(info_str)
+        if self.contains_re:
+            encoded_inputs = self._chunk_re(encoded_inputs)
+        else:
+            encoded_inputs = self._chunk_ser(encoded_inputs)
+        encoded_inputs = list(map(add_imgge_path, encoded_inputs))
+        return encoded_inputs
+
+    def _read_encoded_inputs_sample(self, info_str):
+        """
+        parse label info
+        """
+        # read text info
+        info_dict = json.loads(info_str)
+        height = info_dict["height"]
+        width = info_dict["width"]
+
+        words_list = []
+        bbox_list = []
+        input_ids_list = []
+        token_type_ids_list = []
+        gt_label_list = []
+
+        if self.contains_re:
+            # for re
+            entities = []
+            relations = []
+            id2label = {}
+            entity_id_to_index_map = {}
+            empty_entity = set()
+        for info in info_dict["ocr_info"]:
+            if self.contains_re:
+                # for re
+                if len(info["text"]) == 0:
+                    empty_entity.add(info["id"])
+                    continue
+                id2label[info["id"]] = info["label"]
+                relations.extend([tuple(sorted(l)) for l in info["linking"]])
+
+            # x1, y1, x2, y2
+            bbox = info["bbox"]
+            label = info["label"]
+            bbox[0] = int(bbox[0] * 1000.0 / width)
+            bbox[2] = int(bbox[2] * 1000.0 / width)
+            bbox[1] = int(bbox[1] * 1000.0 / height)
+            bbox[3] = int(bbox[3] * 1000.0 / height)
+
+            text = info["text"]
+            encode_res = self.tokenizer.encode(
+                text, pad_to_max_seq_len=False, return_token_type_ids=True, return_attention_mask=True
+            )
+
+            gt_label = []
+            if not self.add_special_ids:
+                # TODO: use tok.all_special_ids to remove
+                encode_res["input_ids"] = encode_res["input_ids"][1:-1]
+                encode_res["token_type_ids"] = encode_res["token_type_ids"][1:-1]
+                encode_res["attention_mask"] = encode_res["attention_mask"][1:-1]
+            if label.lower() == "other":
+                gt_label.extend([0] * len(encode_res["input_ids"]))
+            else:
+                gt_label.append(self.label2id_map[("b-" + label).upper()])
+                gt_label.extend([self.label2id_map[("i-" + label).upper()]] * (len(encode_res["input_ids"]) - 1))
+            if self.contains_re:
+                if gt_label[0] != self.label2id_map["O"]:
+                    entity_id_to_index_map[info["id"]] = len(entities)
+                    entities.append(
+                        {
+                            "start": len(input_ids_list),
+                            "end": len(input_ids_list) + len(encode_res["input_ids"]),
+                            "label": label.upper(),
+                        }
+                    )
+            input_ids_list.extend(encode_res["input_ids"])
+            token_type_ids_list.extend(encode_res["token_type_ids"])
+            bbox_list.extend([bbox] * len(encode_res["input_ids"]))
+            gt_label_list.extend(gt_label)
+            words_list.append(text)
+
+        encoded_inputs = {
+            "input_ids": input_ids_list,
+            "labels": gt_label_list,
+            "token_type_ids": token_type_ids_list,
+            "bbox": bbox_list,
+            "attention_mask": [1] * len(input_ids_list),
+            # "words_list": words_list,
+        }
+        encoded_inputs = self.pad_sentences(
+            encoded_inputs, max_seq_len=self.max_seq_len, return_attention_mask=self.return_attention_mask
+        )
+        encoded_inputs = self.truncate_inputs(encoded_inputs)
+
+        if self.contains_re:
+            relations = self._relations(entities, relations, id2label, empty_entity, entity_id_to_index_map)
+            encoded_inputs["relations"] = relations
+            encoded_inputs["entities"] = entities
+        return encoded_inputs
+
+    def _chunk_ser(self, encoded_inputs):
+        encoded_inputs_all = []
+        seq_len = len(encoded_inputs["input_ids"])
+        chunk_size = 512
+        for chunk_id, index in enumerate(range(0, seq_len, chunk_size)):
+            chunk_beg = index
+            chunk_end = min(index + chunk_size, seq_len)
+            encoded_inputs_example = {}
+            for key in encoded_inputs:
+                encoded_inputs_example[key] = encoded_inputs[key][chunk_beg:chunk_end]
+
+            encoded_inputs_all.append(encoded_inputs_example)
+        return encoded_inputs_all
+
+    def _chunk_re(self, encoded_inputs):
+        # prepare data
+        entities = encoded_inputs.pop("entities")
+        relations = encoded_inputs.pop("relations")
+        encoded_inputs_all = []
+        chunk_size = 512
+        for chunk_id, index in enumerate(range(0, len(encoded_inputs["input_ids"]), chunk_size)):
+            item = {}
+            for k in encoded_inputs:
+                item[k] = encoded_inputs[k][index : index + chunk_size]
+
+            # select entity in current chunk
+            entities_in_this_span = []
+            global_to_local_map = {}  #
+            for entity_id, entity in enumerate(entities):
+                if index <= entity["start"] < index + chunk_size and index <= entity["end"] < index + chunk_size:
+                    entity["start"] = entity["start"] - index
+                    entity["end"] = entity["end"] - index
+                    global_to_local_map[entity_id] = len(entities_in_this_span)
+                    entities_in_this_span.append(entity)
+
+            # select relations in current chunk
+            relations_in_this_span = []
+            for relation in relations:
+                if (
+                    index <= relation["start_index"] < index + chunk_size
+                    and index <= relation["end_index"] < index + chunk_size
+                ):
+                    relations_in_this_span.append(
+                        {
+                            "head": global_to_local_map[relation["head"]],
+                            "tail": global_to_local_map[relation["tail"]],
+                            "start_index": relation["start_index"] - index,
+                            "end_index": relation["end_index"] - index,
+                        }
+                    )
+            item.update(
+                {
+                    "entities": reformat(entities_in_this_span),
+                    "relations": reformat(relations_in_this_span),
+                }
+            )
+            item["entities"]["label"] = [self.entities_labels[x] for x in item["entities"]["label"]]
+            encoded_inputs_all.append(item)
+        return encoded_inputs_all
+
+    def _relations(self, entities, relations, id2label, empty_entity, entity_id_to_index_map):
+        """
+        build relations
+        """
+        relations = list(set(relations))
+        relations = [rel for rel in relations if rel[0] not in empty_entity and rel[1] not in empty_entity]
+        kv_relations = []
+        for rel in relations:
+            pair = [id2label[rel[0]], id2label[rel[1]]]
+            if pair == ["question", "answer"]:
+                kv_relations.append({"head": entity_id_to_index_map[rel[0]], "tail": entity_id_to_index_map[rel[1]]})
+            elif pair == ["answer", "question"]:
+                kv_relations.append({"head": entity_id_to_index_map[rel[1]], "tail": entity_id_to_index_map[rel[0]]})
+            else:
+                continue
+        relations = sorted(
+            [
+                {
+                    "head": rel["head"],
+                    "tail": rel["tail"],
+                    "start_index": get_relation_span(rel, entities)[0],
+                    "end_index": get_relation_span(rel, entities)[1],
+                }
+                for rel in kv_relations
+            ],
+            key=lambda x: x["head"],
+        )
+        return relations
+
+    def load_img(self, image_path):
+        # read img
+        img = cv2.imread(image_path)
+        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+        resize_h, resize_w = self.img_size
+        im_shape = img.shape[0:2]
+        im_scale_y = resize_h / im_shape[0]
+        im_scale_x = resize_w / im_shape[1]
+        img_new = cv2.resize(img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=2)
+        mean = np.array([0.485, 0.456, 0.406])[np.newaxis, np.newaxis, :]
+        std = np.array([0.229, 0.224, 0.225])[np.newaxis, np.newaxis, :]
+        img_new = img_new / 255.0
+        img_new -= mean
+        img_new /= std
+        img = img_new.transpose((2, 0, 1))
+        return img
+
+    def __getitem__(self, idx):
+        if self.load_mode == "all":
+            data = copy.deepcopy(self.encoded_inputs_all[idx])
+        else:
+            data = self._parse_label_file(self.all_lines[idx])[0]
+
+        image_path = data.pop("image_path")
+        data["image"] = self.load_img(image_path)
+
+        return_data = {}
+        for k, v in data.items():
+            if k in self.return_keys:
+                if self.return_keys[k] == "np":
+                    v = np.array(v)
+                return_data[k] = v
+        return return_data
+
+    def __len__(
+        self,
+    ):
+        if self.load_mode == "all":
+            return len(self.encoded_inputs_all)
+        else:
+            return len(self.all_lines)
+
+
+def get_relation_span(rel, entities):
+    bound = []
+    for entity_index in [rel["head"], rel["tail"]]:
+        bound.append(entities[entity_index]["start"])
+        bound.append(entities[entity_index]["end"])
+    return min(bound), max(bound)
+
+
+def reformat(data):
+    new_data = {}
+    for item in data:
+        for k, v in item.items():
+            if k not in new_data:
+                new_data[k] = []
+            new_data[k].append(v)
+    return new_data
diff --git a/examples/multimodal/minigpt4/README.md b/examples/multimodal/minigpt4/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48c9f73840762b0575b2a52bd537278abac45c43
--- /dev/null
+++ b/examples/multimodal/minigpt4/README.md
@@ -0,0 +1,47 @@
+# MiniGPT4
+
+## 1. 模型简介
+
+MiniGPT4 是一个具有图像理解能力的开源模型，其基于 Vicuna 大语言模型 以及 BLIP-2 中的VIT和Qformer模块进行训练，使得MiniGPT4 拥有类似于GPT4的非凡能力，例如详细的图像描述生成和从手写草稿创建网站。 此外 MiniGPT4 还具备一些的其他新的功能，包括根据给定图像写故事和诗歌，为图像中显示的问题提供解决方案，教用户如何根据食物照片做饭等。下图展示了MiniGPT4的模型结构， 更多信息请参考[MiniGPT4](https://arxiv.org/abs/2304.10592)。
+
+<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/f0306cb6-4837-4f52-8f57-a0e7e35238f6" /></center>
+
+
+## 2. 获取MiniGPT4 权重以及相关配置
+这里可以分两步：1. 获取MiniGPT4权重；2. 获取相关配置，包括模型参数说明以及tokenizer相关文件等。
+### 2.1 获取MiniGPT4权重
+目前需要用户手动下载MiniGPT4权重和并转换为相应的 Paddle 版权重，为方便转换，本项目提供了相应的操作说明和转换脚本，详情请参考[MiniGPT4 权重下载和转换说明](./paddle_minigpt4_instrction.md)。
+
+### 2.2 获取相关配置
+下载相关的配置文件，这里提供了两版配置文件，请根据你的需要，点击下载即可。
+|  files Aligned with MiniGPT4-7B  |  files Aligned with MiniGPT4-13B |
+:-------------------------------------:|:-----------------------------------:
+ [Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-7b/minigpt4_7b.tar.gz)|[Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-13b/minigpt4_13b.tar.gz) |
+
+
+下载之后进行解压，请将其中相关文件放至 与 MiniGPT4 权重相同的目录中。
+
+
+## 3. 模型预测
+在下载和转换好上述模型权重之后，可执行以下命令进行模型预测。其中参数 `pretrained_name_or_path` 用于指定 MiniGPT4 的保存目录。
+
+```
+python run_predict.py \
+    -- pretrained_name_or_path "your minigpt4 path"
+
+```
+
+下图这个示例展示了在使用MiniGPT-7b时的效果：
+
+输入图片：<center><img src="https://github.com/PaddlePaddle/Paddle/assets/35913314/d8070644-4713-465d-9c7e-9585024c1819" /></center>
+
+输入文本：“describe this image”
+
+输出:
+```
+The image shows two mugs with cats on them, one is black and white and the other is blue and white. The mugs are sitting on a table with a book in the background. The mugs have a whimsical, cartoon-like appearance. The cats on the mugs are looking at each other with a playful expression. The overall mood of the image is lighthearted and fun.###
+```
+
+
+## Reference
+- [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://minigpt-4.github.io/)
diff --git a/examples/multimodal/minigpt4/merge_weight.py b/examples/multimodal/minigpt4/merge_weight.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f74d7c6a960520be922ee00ca69d88b3fcc3fe0
--- /dev/null
+++ b/examples/multimodal/minigpt4/merge_weight.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+os.environ["FLAGS_use_cuda_managed_memory"] = "true"
+
+import paddle
+import torch
+
+from paddlenlp.transformers import LlamaForCausalLM
+
+
+def merge(args):
+    model_dict = {}
+    # load the first item: blip2-flan-t5-xxl
+    state_dict = paddle.load(args.blip2_path)
+    for n, p in state_dict.items():
+        if n.startswith("vision_model") or n.startswith("qformer") or n == "query_tokens":
+            model_dict[n] = p
+    print("[1/3] load ViT, qformer and query_tokens from blip2-flan-t5-xxl done!")
+
+    # load the second item: vicuna
+    llama_model = LlamaForCausalLM.from_pretrained(args.vicuna_path)
+
+    for n, p in llama_model.named_parameters():
+        new_name = "language_model." + n
+        model_dict[new_name] = p
+    print("[2/3] load vicuna(llama typel) done!")
+
+    # load the third item: minigpt4
+    minigpt4_state_dict = torch.load(args.minigpt4_path)
+    for n, p in minigpt4_state_dict["model"].items():
+        if n.startswith("llama_model.model"):
+            new_name = n.replace("llama_model.model", "language_model.llama")
+            new_p = paddle.to_tensor(p.cpu().numpy())
+            model_dict[new_name] = new_p
+
+        if n.startswith("llama_proj"):
+            new_name = n.replace("llama_proj", "language_projection")
+            if n.endswith("weight"):
+                new_p = paddle.to_tensor(p.cpu().numpy()).transpose([1, 0])
+            else:
+                new_p = paddle.to_tensor(p.cpu().numpy())
+            model_dict[new_name] = new_p
+
+    print("[3/3] load language_projection, some llama weights from minigpt4 done!")
+
+    save_path = os.path.join(args.save_path, "model_state.pdparams")
+    paddle.save(model_dict, save_path)
+    print("The checkpoint of minigpt4 has been saved to :{}".format(save_path))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--blip2_path", default="/blip2/dirname", type=str, help="The dir name of blip2-flan-t5-xxl.")
+    parser.add_argument("--vicuna_path", default="/vicuna/dirname", type=str, help="The dir name of vicuna.")
+    parser.add_argument(
+        "--minigpt4_path", default="/minigpt4/prerained_minigpt4.pth", type=str, help="The checkpoint path of vicuna."
+    )
+    parser.add_argument("--save_path", default="/save/to/dirname", type=str, help="The saving path of minigpt4.")
+    args = parser.parse_args()
+
+    args.blip2_path = os.path.join(args.blip2_path, "model_state.pdparams")
+    if not os.path.exists(args.blip2_path):
+        raise ValueError("Not found the file: {}".format(args.blip2_path))
+    if not os.path.isdir(args.vicuna_path):
+        raise ValueError("It is not a directory: {}".format(args.vicuna_path))
+    if not os.path.exists(args.minigpt4_path):
+        raise ValueError("Not found the file: {}".format(args.minigpt4_path))
+    if not os.path.exists(args.save_path):
+        os.makedirs(args.save_path)
+
+    merge(args)
diff --git a/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md b/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md
new file mode 100644
index 0000000000000000000000000000000000000000..7b84aea48bd7c6e1b6c5c55d10d77ef0e1509500
--- /dev/null
+++ b/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md
@@ -0,0 +1,117 @@
+# 获取和转换 Paddle 版 MiniGPT4 权重
+
+## 1. 准备 MiniGPT4 中所有模块的权重
+
+你需要下载3个权重，以获取最终 MiniGPT4的权重，分别是：
+- Pretrained MiniGPT-4
+- Vicuna Weight
+- Blip2 Weight
+
+### 1.1 下载 MiniGPT4 的预训练权重
+
+根据你准备的Vicuna模型版本，下载预训练的MiniGPT4 权重。
+
+|  Checkpoint Aligned with Vicuna 7B  |  Checkpoint Aligned with Vicuna 13B |
+:-------------------------------------:|:-----------------------------------:
+[Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) | [Download](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link)
+
+### 1.2准备 ViT and Qformer 权重
+MiniGPT4中使用的ViT和Qformer Weight来自blip2-flan-t5-xxl，这个weight在PaddleNLP中进行了转换。 所以你可以从 PaddleNLP 下载它，你有两种下载方式进行下载：
+
+#### 1.2.1 通过 paddlenlp 方式加载
+直接通过paddlenlp的模型加载方法进行下载，下载后一般会存入 `PPNLP_HOME` 指定的目录。
+
+```python
+import os
+os.environ["CUDA_VISIBLE_DEVICES"]="0"
+
+import paddle
+from paddlenlp.transformers import Blip2Model, Blip2VisionModel, Blip2VisionConfig, Blip2QFormerConfig, Blip2QFormerModel
+
+Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xxl")
+```
+
+#### 1.2.2 直接点击下载
+可以直接进行点击下载：
+
+|  blip2-flan-t5-xxl 权重  |  点击下载 |
+:-------------------------------------:|:-----------------------------------:
+| model_state.pdparams | [Download](https://paddlenlp.bj.bcebos.com/models/community/Salesforce/blip2-flan-t5-xxl/model_state.pdparams) |
+
+### 1.3 准备 Vicuna 权重
+
+这里需要下载两个权重：Vicuna delta Weight和huggingface-formated Llama Weight。 然后你应该结合这两个重量来获得可以使用的Vicuna 权重。
+
+#### 1.3.1 下载 Vicuna delta 权重
+
+这里展示两种Vicuna delta 权重，请根据需要选择一种并点击下载。
+
+|  vicuna-7b-delta-v0  |  vicuna-13b-delta-v0 |
+:-------------------------------------:|:-----------------------------------:
+ [Download](https://huggingface.co/lmsys/vicuna-7b-delta-v0/tree/main) | [Download](https://huggingface.co/lmsys/vicuna-13b-delta-v0g)
+
+#### 1.3.2 根据以上选择的vicuna delta 权重，下载 相应的 llama 权重。
+
+|  llama-7b  |  llama-13b |
+:-------------------------------------:|:-----------------------------------:
+ [Download](https://huggingface.co/decapoda-research/llama-7b-hf/tree/main) | [Download](https://huggingface.co/decapoda-research/llama-13b-hf)
+
+
+#### 1.3.3 结合上面的两个权重，得到可以使用的 vicuna 权重
+- 为组合如上两个权重，请安装以下工具：
+
+```shell
+pip install git+https://github.com/lm-sys/FastChat.git@v0.1.10
+```
+- 运行以下命令，获取最终可用的vicuna 权重
+
+```shell
+python -m fastchat.model.apply_delta --base /path/to/llama-13bOR7b-hf/  --target /path/to/save/working/vicuna-13b/weight/  --delta /path/to/vicuna-13bOR7b-delta-v0/
+```
+
+## 2. 将多个 pytorch 子权重文件合并为一个权重文件
+
+Pytorch版的权重文件可能是由多个子权重文件组合而成，为使用PaddleNLP进行加载并自动转换为Paddle版，需要将其合并为一个文件：
+
+### 2.1 下载MiniGPT库
+在开始之前，请确保已经下载了 [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4.git) 库：
+
+```
+git clone https://github.com/Vision-CAIR/MiniGPT-4.git
+```
+
+### 2.2 获取完整的 vicuna 权重
+进入到MiniGPT4文件夹，执行以下代码，获取完整的 vicuna 权重文件：
+```python
+import argparse
+import os
+os.environ["CUDA_VISIBLE_DEVICES"]="0"
+os.environ["FLAGS_use_cuda_managed_memory"]="true"
+
+import torch
+from minigpt4.models.modeling_llama import LlamaForCausalLM
+
+llama_model = LlamaForCausalLM.from_pretrained("/path/to/save/working/vicuna-13b/")
+torch.save(llama_model.state_dict(), "/path/to/save/working/vicuna-13b/pytorch_model.bin")
+```
+
+## 3. 合并以上所有权重，获取最终的 Paddle 版 MiniGPT4 权重
+这里提供了一个合并以上权重的脚本，你可以通过设置相关权重路径 以获取最终的 MiniGPT4 权重。
+
+```shell
+python merge_weight.py \
+    --blip2_path "your dir name of blip2" \
+    --vicuna_path "your dir name of vicuna" \
+    --minigpt4_path "your ckpt path of minigpt4" \
+    --save_path "your dir name saving the final minigpt4"
+```
+
+**参数说明**：
+- `blip2_path`： 存放 blip2 权重的目录名
+- `vicuna_path`： 存放 vicuna_path 权重的目录名
+- `minigpt4_path`： 存放 blip2 权重的文件地址，比如./prerained_minigpt4_7b.pth
+- `save_path`： 保存 Paddle 版 MiniGPT3 权重的目录名
+
+## 3. More Reference
+
+- [MiniGPT Official Site](https://github.com/Vision-CAIR/MiniGPT-4)
diff --git a/examples/multimodal/minigpt4/run_predict.py b/examples/multimodal/minigpt4/run_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b36089f3c91a8fdf1340b966b86150dab110c9a
--- /dev/null
+++ b/examples/multimodal/minigpt4/run_predict.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+os.environ["FLAGS_use_cuda_managed_memory"] = "true"
+import requests
+from PIL import Image
+
+from paddlenlp.transformers import MiniGPT4ForConditionalGeneration, MiniGPT4Processor
+
+
+def predict(args):
+    # load MiniGPT4 moel and processor
+    model = MiniGPT4ForConditionalGeneration.from_pretrained(args.pretrained_name_or_path)
+    model.eval()
+    processor = MiniGPT4Processor.from_pretrained(args.pretrained_name_or_path)
+    print("load processor and model done!")
+
+    # prepare model inputs for MiniGPT4
+    url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png"
+    image = Image.open(requests.get(url, stream=True).raw)
+
+    text = "describe this image"
+    prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+    inputs = processor([image], text, prompt)
+
+    # generate with MiniGPT4
+    # breakpoint
+    generate_kwargs = {
+        "max_length": 300,
+        "num_beams": 1,
+        "top_p": 1.0,
+        "repetition_penalty": 1.0,
+        "length_penalty": 0,
+        "temperature": 1,
+        "decode_strategy": "greedy_search",
+        "eos_token_id": [[835], [2277, 29937]],
+    }
+    outputs = model.generate(**inputs, **generate_kwargs)
+    msg = processor.batch_decode(outputs[0])
+    print("Inference result: ", msg)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pretrained_name_or_path",
+        default="your directory of minigpt4",
+        type=str,
+        help="The dir name of minigpt4 checkpoint.",
+    )
+    args = parser.parse_args()
+
+    predict(args)
diff --git a/examples/question_generation/README.md b/examples/question_generation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..796b415797b2005bc94c152fba609f49dad28937
--- /dev/null
+++ b/examples/question_generation/README.md
@@ -0,0 +1,20 @@
+# 问题生成
+
+Question Generation（QG），即问题生成，指的是给定一段上下文和答案，自动生成一个流畅且符合上下文主题的问句。问题生成技术在教育、咨询、搜索、问答等多个领域均有着巨大的应用价值。
+
+PaddleNLP提供英文和中文问题生成任务示例，分别基于英文预训练语言模型[t5](./t5)和中文预训练语言模型[unimo-text](./unimo-text)。
+
+
+## 英文
+
+[t5](./t5) 展示了如何使用英文预训练模型T5完成问题生成任务，支持模型微调预测评估，并提供相关预训练模型。
+
+## 中文
+
+[unimo-text](./unimo-text) 展示了如何使用中文预训练模型UNIMO-Text完成问题生成任务，提供数据准备、训练、预测、推理部署全流程定制化训练，并提供相关预训练模型。
+
+# 参考文献
+
+1. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), pp.1-67.
+
+2. Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020).
diff --git a/examples/question_generation/t5/README.md b/examples/question_generation/t5/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0b0578d3cb58bd26b528912bca5d6f2fc8309801
--- /dev/null
+++ b/examples/question_generation/t5/README.md
@@ -0,0 +1,208 @@
+# 问题生成(Question Generation)
+
+## 简介
+
+Question Generation（QG），即问题生成，指的是给定一段上下文（passage或sentence），自动生成一个流畅且符合上下文主题的问句。问题生成通常可以分为两个分支，即无答案问题生成（answer-agnostic question generation）和有答案问题生成（answer-aware question generation）。
+
+本项目是T5在 PaddlePaddle上开源实现的有答案问题生成的例子，包含了在SQuAD数据集上微调和生成的代码。
+
+## 快速开始
+
+### 环境依赖
+
+- nltk
+- evaluate
+
+
+安装方式：`pip install -r requirements.txt`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+.
+├── finetune.py # 模型微调主程序入口
+├── generate.py # 模型生成主程序入口
+├── utils.py # 定义参数及一些工具函数
+├── requirements.txt # 环境依赖文件
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+#### 数据加载
+**SQuAD**（Stanford Question Answering Dataset）数据集是一个英文问答数据集，现有的问题生成研究主要在该数据集上进行评价。**SQuAD**中的数据由段落、问题、答案3个主要部分组成，其中段落从维基百科中获取，问题和答案通过众包的方式由人工标注。
+
+为了方便用户快速测试，PaddleNLP Dataset API内置了Squad数据集，一键即可完成数据集加载，示例代码如下：
+
+```python
+from paddlenlp.datasets import load_dataset
+train_set, dev_set, test_set = load_dataset("squad",  splits=["train_v1", "dev_v1"])
+```
+
+#### 数据处理
+针对**SQuAD**数据集，我们需要将QA任务格式的数据进行转换从而得到text2text形式的数据，默认构造方式如下，其他形式输入数据用户可以在convert_example函数中自行定义
+```text
+answer: {answer_text} context: {context_text}
+question: {question_text}
+```
+具体案例如下，
+```text
+answer: the Miller–Rabin primality test context: The property of being prime (or not) is called primality. A simple but slow method of verifying the primality of a given number n is known as trial division. It consists of testing whether n is a multiple of any integer between 2 and . Algorithms much more efficient than trial division have been devised to test the primality of large numbers. These include the Miller–Rabin primality test, which is fast but has a small probability of error, and the AKS primality test, which always produces the correct answer in polynomial time but is too slow to be practical. Particularly fast methods are available for numbers of special forms, such as Mersenne numbers. As of January 2016[update], the largest known prime number has 22,338,618 decimal digits.
+
+question: What is the name of the process which confirms the primality of a number n?
+```
+
+### 模型训练
+
+运行如下命令即可在训练集上进行finetune，并在验证集上进行验证
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus 1,2 train.py \
+    --model_name_or_path=t5-base \
+    --dataset_name=squad \
+    --output_dir=output \
+    --max_source_length=1024 \
+    --max_target_length=142 \
+    --learning_rate=1e-4 \
+    --num_train_epochs=6 \
+    --logging_steps=100 \
+    --save_steps=1000 \
+    --seed=42 \
+    --train_batch_size=4 \
+    --eval_batch_size=64 \
+    --warmup_proportion=0.1 \
+    --ignore_pad_token_for_loss \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `gpus` 指示了训练所用的GPU
+
+- `model_name_or_path` 指示了finetune使用的预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | t5-base |
+   | t5-large |
+
+- `dataset_name` 表示训练的数据集。
+
+- `output_dir` 表示模型的保存路径。
+
+- `max_source_length` 表示输入序列的长度，超过该长度将被截断。
+
+- `max_target_length` 表示输出的最大长度。
+
+- `learning_rate` 表示基础学习率大小，将与learning rate scheduler产生的值相乘作为当前学习率。
+
+- `num_train_epochs` 表示训练轮数。
+
+- `epochs` 表示训练轮数。
+
+- `logging_steps` 表示日志打印间隔。
+
+- `save_steps` 表示模型保存及评估间隔。
+
+- `seed` 表示随机数生成器的种子。
+
+- `train_batch_size` 表示训练每张卡上的样本数目。
+
+- `eval_batch_size` 表示预测单卡上的样本数目。
+
+- `warmup_proportion` 表示warmup_steps所占总步数的比例。学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数。
+
+- `device` 表示使用的设备。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`output_dir`中。如：
+
+```text
+./output/
+├── t5_model_1000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── special_tokens_map.json
+│   ├── spiece.model
+│   └── tokenizer_config.json
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，只需指定`model_name_or_path`为本地微调模型的路径即可。
+
+### 模型预测
+
+运行如下命令即可在验证集上进行测试
+
+```shell
+# GPU启动，预测仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python predict.py \
+    --model_name_or_path=./checkpoints/model_xx/ \
+    --dataset_name=squad \
+    --output_path=generate.txt \
+    --max_source_length=1024 \
+    --max_target_length=142 \
+    --decode_strategy=greedy_search \
+    --top_k=2 \
+    --top_p=1.0 \
+    --num_beams=1 \
+    --length_penalty=0.0 \
+    --batch_size=64 \
+    --seed=42 \
+    --ignore_pad_token_for_loss \
+    --logging_steps=20 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了预测使用的模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | t5-base |
+   | t5-large |
+   | mrm8488/t5-base-finetuned-question-generation-ap |
+
+- `dataset_name` 表示预测的数据集。
+
+- `output_path` 表示预测结果的保存路径。
+
+- `max_source_length` 表示输入序列的长度，超过该长度将被截断。
+
+- `max_target_length` 表示输出的最大长度。
+
+- `decode_strategy` 表示预测解码时采取的策略，可选"sampling"、"greedy_search"和"beam_search"之一。
+
+- `top_k` 表示采用"sampling"解码策略时，token的概率按从大到小排序，生成的token只从前`top_k`个中进行采样。
+
+- `top_p` 表示采用"sampling"解码策略时，从词表中采样并选择概率之和大于给定阈值`top_p`的token。
+
+- `num_beams` 表示besm search的beam size。
+
+- `length_penalty` 表示besm search生成长度的指数惩罚。
+
+- `batch_size` 表示每次迭代**单卡**上的样本数目。
+
+- `seed` 表示随机数生成器的种子。
+
+- `logging_steps` 表示日志打印间隔。
+
+- `device` 表示使用的设备。
+
+程序运行结束后会将预测生成的问题保存在`output_path`中。同时终端中会输出评估结果。
+
+采用社区微调模型mrm8488/t5-base-finetuned-question-generation-ap在验证集上有如下结果：
+
+|   model_name_or_path    |     BLEU-1     |     BLEU-2     |    BLEU-3    |    BLEU-4    |
+| :----------------------: | :-------------: | :-------------: |:-------------: |:-------------: |
+|        [mrm8488/t5-base-finetuned-question-generation-ap](https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap )      | 50.11 | 35.83 | 27.68 |  22.03 |
+
+
+
+
+## 参考文献
+1. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), pp.1-67.
diff --git a/examples/question_generation/t5/predict.py b/examples/question_generation/t5/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..7931059b887c38865489602f060bb5ba15894c94
--- /dev/null
+++ b/examples/question_generation/t5/predict.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import random
+import time
+from functools import partial
+from pprint import pprint
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from utils import compute_metrics, convert_example
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument("--model_name_or_path", default="t5-base", type=str, required=True, help="Path to pre-trained model. ")
+    parser.add_argument("--dataset_name", default="squad", type=str, required=True, help="The name of the dataset to use. Selected in the list: " + "squad")
+    parser.add_argument('--output_path', type=str, default='generate.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--max_source_length", default=1024, type=int, help="The maximum total input sequence length after tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",)
+    parser.add_argument("--min_target_length", default=0, type=int, help="The minimum total sequence length for target text when generating. ")
+    parser.add_argument("--max_target_length", default=142, type=int, help="The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded during ``evaluate`` and ``predict``.",)
+    parser.add_argument('--decode_strategy', default='greedy_search', type=str, help='The decode strategy in generation.')
+    parser.add_argument('--top_k', default=2, type=int, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--top_p', default=1.0, type=float, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', default=1, type=int, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', default=0.6, type=float, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--early_stopping', default=False, type=eval, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.')
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument('--faster', action='store_true', help='Whether to process inference using FastGeneration. ')
+    parser.add_argument('--use_fp16_decoding', action='store_true', help='Whether to use fp16 when using FastGeneration. Only works when using FastGeneration. ')
+    parser.add_argument("--batch_size", default=64, type=int, help="Batch size per GPU/CPU for testing or evaluation.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--is_debug", action='store_true', help="Whether to debug.")
+    parser.add_argument("--ignore_pad_token_for_loss", action='store_true', help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.")
+
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def generate(args):
+    paddle.set_device(args.device)
+    set_seed(args)
+    tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path)
+    model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    dataset = load_dataset(args.dataset_name, splits=["dev_v1"])
+    # dataset = load_dataset(args.dataset_name, splits=["dev_v2"])
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        decoder_start_token_id=model.t5.bos_token_id,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+        ignore_pad_token_for_loss=args.ignore_pad_token_for_loss,
+        is_train=False)
+
+    def batchify_fn(samples, tokenizer):
+        fn = Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+                ),  # attention_mask
+            Pad(axis=0, pad_val=-100, dtype="int64"),  # mem_seq_lens
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+                ),  # decoder_input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # labels
+        )
+        return fn(samples)
+
+    dataset = dataset.map(trans_func, lazy=True)
+
+    # debug
+    if args.is_debug:
+        dataset.data = dataset.data[:20]
+        dataset.new_data = dataset.new_data[:20]
+
+    batch_sampler = BatchSampler(dataset,
+                                 batch_size=args.batch_size,
+                                 shuffle=False)
+    data_loader = DataLoader(dataset=dataset,
+                             batch_sampler=batch_sampler,
+                             num_workers=0,
+                             collate_fn=batchify_fn,
+                             return_list=True)
+    data_loader.pin_memory = False
+
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    all_preds = []
+    all_labels = []
+    for step, batch in enumerate(data_loader):
+        input_ids, _, mem_seq_lens, _, labels = batch
+        preds, _ = model.generate(input_ids=input_ids,
+                                  max_length=args.max_target_length,
+                                  min_length=args.min_target_length,
+                                  decode_strategy=args.decode_strategy,
+                                  top_k=args.top_k,
+                                  top_p=args.top_p,
+                                  num_beams=args.num_beams,
+                                  length_penalty=args.length_penalty,
+                                  early_stopping=args.early_stopping,
+                                  diversity_rate=args.diversity_rate,
+                                  use_fast=args.faster)
+        total_time += (time.time() - start_time)
+        if step % args.logging_steps == 0:
+            print('step %d - %.3fs/step' %
+                  (step, total_time / args.logging_steps))
+            total_time = 0.0
+        all_preds.extend(preds.numpy())
+        all_labels.extend(labels.numpy())
+        start_time = time.time()
+
+    bleu_result, decoded_preds, decoded_labels = compute_metrics(
+        all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss)
+    print("BLEU result: ", bleu_result)
+    with open(args.output_path, 'w', encoding='utf-8') as fout:
+        for decoded_pred in decoded_preds:
+            fout.write(' '.join(decoded_pred) + '\n')
+    print('Save generated result into: %s' % args.output_path)
+    with open(args.output_path + '.reference.txt', 'w',
+              encoding='utf-8') as fout:
+        for decoded_label in decoded_labels:
+            fout.write(' '.join(decoded_label) + '\n')
+    print('Save referenced labels into: {}.reference.txt'.format(args.output_path))
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    pprint(args)
+    generate(args)
diff --git a/examples/question_generation/t5/requirements.txt b/examples/question_generation/t5/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..40abc64257a3f4627085af9e6c851de3553f4aef
--- /dev/null
+++ b/examples/question_generation/t5/requirements.txt
@@ -0,0 +1,2 @@
+nltk==3.6.2
+evaluate==0.2.2
\ No newline at end of file
diff --git a/examples/question_generation/t5/train.py b/examples/question_generation/t5/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdd124370fff788b624a039897d60278425f4551
--- /dev/null
+++ b/examples/question_generation/t5/train.py
@@ -0,0 +1,236 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import random
+import time
+from functools import partial
+from pprint import pprint
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from tqdm import tqdm
+from utils import compute_metrics, convert_example
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+)
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument("--model_name_or_path", default="t5-base", type=str, required=True, help="Path to pre-trained model. ")
+    parser.add_argument("--dataset_name", default="squad", type=str, required=True, help="The name of the dataset to use. Selected in the list: " + "squad")
+    parser.add_argument("--output_dir", default="output", type=str, required=True, help="The output directory where the model predictions and checkpoints will be written.",)
+    parser.add_argument("--max_source_length", default=1024, type=int, help="The maximum total input sequence length after tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",)
+    parser.add_argument("--min_target_length", default=0, type=int, help="The minimum total sequence length for target text when generating. ")
+    parser.add_argument("--max_target_length", default=142, type=int, help="The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. during ``evaluate`` and ``predict``.",)
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--train_batch_size", default=20, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--eval_batch_size", default=12, type=int, help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion")
+    parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    parser.add_argument("--use_amp", default=False, type=strtobool, help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", default=2**15, type=float, help="The value of scale_loss for fp16.")
+    parser.add_argument("--ignore_pad_token_for_loss", action='store_true', help="Whether to ignore the tokens corresponding to padded labels in the loss computation or not.")
+
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, tokenizer, ignore_pad_token_for_loss,
+             min_target_length, max_target_length):
+    model.eval()
+    all_preds = []
+    all_labels = []
+    model = model._layers if isinstance(model, paddle.DataParallel) else model
+    for batch in tqdm(data_loader, total=len(data_loader), desc="Eval step"):
+        input_ids, _, _, labels = batch
+        preds = model.generate(input_ids=input_ids,
+                               min_length=min_target_length,
+                               max_length=max_target_length,
+                               use_cache=True)[0]
+        all_preds.extend(preds.numpy())
+        all_labels.extend(labels.numpy())
+    bleu_result, decoded_preds, decoded_labels = compute_metrics(
+        all_preds, all_labels, tokenizer, ignore_pad_token_for_loss)
+    logger.info(bleu_result)
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path)
+    model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        decoder_start_token_id=model.t5.bos_token_id,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+        ignore_pad_token_for_loss=args.ignore_pad_token_for_loss)
+    logger.info("Loading train and dev dataset: %s" % args.dataset_name)
+    train_set, dev_set = load_dataset(args.dataset_name,
+                                      splits=["train_v1", "dev_v1"])
+    logger.info("Loaded train and dev dataset: %s" % args.dataset_name)
+    train_set = train_set.map(trans_func, lazy=True)
+    train_batch_sampler = DistributedBatchSampler(
+        train_set, batch_size=args.train_batch_size, shuffle=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+            ),  # attention_mask
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+            ),  # decoder_input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # labels
+    ): fn(samples)
+    train_data_loader = DataLoader(dataset=train_set,
+                                   batch_sampler=train_batch_sampler,
+                                   num_workers=0,
+                                   collate_fn=batchify_fn,
+                                   return_list=True)
+    dev_set = dev_set.map(trans_func, lazy=True)
+    dev_batch_sampler = BatchSampler(dev_set,
+                                     batch_size=args.eval_batch_size,
+                                     shuffle=False)
+    dev_data_loader = DataLoader(dataset=dev_set,
+                                 batch_sampler=dev_batch_sampler,
+                                 num_workers=0,
+                                 collate_fn=batchify_fn,
+                                 return_list=True)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (
+        len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps,
+                                         warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    global_step = 0
+    tic_train = time.time()
+    for epoch in tqdm(range(args.num_train_epochs), desc="Epoch"):
+        for step, batch in tqdm(enumerate(train_data_loader),
+                                desc="Train step",
+                                total=len(train_data_loader)):
+            global_step += 1
+            input_ids, attention_mask, decoder_input_ids, labels = batch
+            with paddle.amp.auto_cast(
+                    args.use_amp,
+                    custom_white_list=["layer_norm", "softmax", "gelu"]):
+                output = model(input_ids,
+                               attention_mask,
+                               decoder_input_ids,
+                               labels=labels)
+                loss = output[0]
+            if args.use_amp:
+                scaled_loss = scaler.scale(loss)
+                scaled_loss.backward()
+                scaler.minimize(optimizer, scaled_loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (global_step, num_training_steps, epoch, step,
+                       paddle.distributed.get_rank(), loss, optimizer.get_lr(),
+                       args.logging_steps / (time.time() - tic_train)))
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                evaluate(model, dev_data_loader, tokenizer,
+                         args.ignore_pad_token_for_loss, args.min_target_length,
+                         args.max_target_length)
+                logger.info("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "t5_model_%d.pdparams" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(
+                        model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+    if paddle.distributed.get_rank() == 0:
+        output_dir = os.path.join(args.output_dir,
+                                  "t5_model_final_%d.pdparams" % global_step)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        # Need better way to get inner model of DataParallel
+        model_to_save = model._layers if isinstance(
+            model, paddle.DataParallel) else model
+        model_to_save.save_pretrained(output_dir)
+        tokenizer.save_pretrained(output_dir)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_train(args)
diff --git a/examples/question_generation/t5/utils.py b/examples/question_generation/t5/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5ee191b669319b392e2fdaac30c44a91945e9c2
--- /dev/null
+++ b/examples/question_generation/t5/utils.py
@@ -0,0 +1,166 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import evaluate
+import nltk
+import numpy as np
+
+from paddlenlp.metrics import BLEU
+
+
+def convert_example(
+    example,
+    tokenizer,
+    decoder_start_token_id,
+    max_source_length,
+    max_target_length,
+    ignore_pad_token_for_loss=True,
+    is_train=True,
+):
+    """
+    Convert an example into necessary features.
+    """
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    context = example["context"]
+    question = example["question"]
+    try:
+        answer = example["answers"][0]
+    except:
+        print(example["context"])
+        print(example["question"])
+        print(example["answers"])
+        print(example["answer_starts"])
+        print(example["is_impossible"])
+
+    input_seq = f"answer: {answer} context: {context} </s>"
+    output_seq = f"question: {question} </s>"
+
+    outputs = tokenizer(
+        output_seq,
+        max_seq_len=max_target_length,
+        pad_to_max_seq_len=True,
+        truncation_strategy="longest_first",
+    )
+
+    output_ids = [decoder_start_token_id] + outputs["input_ids"][:-1]
+
+    if ignore_pad_token_for_loss:
+        # Replace all tokenizer.pad_token_id in the outputs by -100 when we want to ignore padding in the loss.
+        outputs["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in outputs["input_ids"]]
+
+    if is_train:
+        inputs = tokenizer(
+            input_seq,
+            max_seq_len=max_source_length,
+            pad_to_max_seq_len=True,
+            truncation_strategy="longest_first",
+            return_attention_mask=True,
+            return_length=False,
+        )
+        return inputs["input_ids"], inputs["attention_mask"], output_ids, outputs["input_ids"]
+    else:
+        inputs = tokenizer(
+            input_seq,
+            max_seq_len=max_source_length,
+            pad_to_max_seq_len=True,
+            truncation_strategy="longest_first",
+            return_attention_mask=True,
+            return_length=True,
+        )
+        return inputs["input_ids"], inputs["attention_mask"], inputs["length"], output_ids, outputs["input_ids"]
+
+
+def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True):
+    def compute_bleu(predictions, references, rouge_types=None, use_stemmer=True):
+        bleu1 = BLEU(n_size=1)
+        bleu2 = BLEU(n_size=2)
+        bleu3 = BLEU(n_size=3)
+        bleu4 = BLEU(n_size=4)
+        assert len(predictions) == len(references)
+        for i in range(len(predictions)):
+            bleu1.add_inst(predictions[i], [references[i]])
+            bleu2.add_inst(predictions[i], [references[i]])
+            bleu3.add_inst(predictions[i], [references[i]])
+            bleu4.add_inst(predictions[i], [references[i]])
+        result = {
+            "BLEU-1": bleu1.score() * 100,
+            "BLEU-2": bleu2.score() * 100,
+            "BLEU-3": bleu3.score() * 100,
+            "BLEU-4": bleu4.score() * 100,
+        }
+        return result
+
+    def compute_bleu_hf(predictions, references, rouge_types=None, use_stemmer=True):
+        predictions = [" ".join(prediction) for prediction in predictions]
+        references = [[" ".join(reference)] for reference in references]
+
+        bleu = evaluate.load("bleu")
+        assert len(predictions) == len(references)
+        bleu1_results = bleu.compute(predictions=predictions, references=references, max_order=1)
+        bleu2_results = bleu.compute(predictions=predictions, references=references, max_order=2)
+        bleu3_results = bleu.compute(predictions=predictions, references=references, max_order=3)
+        bleu4_results = bleu.compute(predictions=predictions, references=references, max_order=4)
+
+        result = {
+            "BLEU-1": bleu1_results["bleu"] * 100,
+            "BLEU-2": bleu2_results["bleu"] * 100,
+            "BLEU-3": bleu3_results["bleu"] * 100,
+            "BLEU-4": bleu4_results["bleu"] * 100,
+        }
+        return result
+
+    def post_process_text(preds, labels):
+        preds = [pred.strip() for pred in preds]
+        labels = [label.strip() for label in labels]
+        preds = [pred.strip("question:") for pred in preds]
+        labels = [label.strip("question:") for label in labels]
+        labels = [label.strip() for label in labels]
+
+        #  expects newline after each sentence
+        preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
+        labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
+
+        preds = [pred.split() for pred in preds]
+        labels = [label.split() for label in labels]
+
+        return preds, labels
+
+    def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+        """
+        Post-process the decoded sequence.
+        """
+        eos_pos = len(seq) - 1
+        for i, idx in enumerate(seq):
+            if idx == eos_idx:
+                eos_pos = i
+                break
+        seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+        return seq
+
+    if ignore_pad_token_for_loss:
+        labels = np.asarray(labels)
+        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    decoded_preds, decoded_labels = [], []
+    for pred, label in zip(preds, labels):
+        pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        decoded_preds.append(tokenizer.decode(pred_id))
+        decoded_labels.append(tokenizer.decode(label_id))
+    decoded_preds, decoded_labels = post_process_text(decoded_preds, decoded_labels)
+    # bleu_result = compute_bleu(decoded_preds, decoded_labels)
+    bleu_result = compute_bleu_hf(decoded_preds, decoded_labels)
+    return bleu_result, decoded_preds, decoded_labels
diff --git a/examples/question_generation/unimo-text/README.md b/examples/question_generation/unimo-text/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fdf6ba023b194042e90fd17943c1a153be62cee5
--- /dev/null
+++ b/examples/question_generation/unimo-text/README.md
@@ -0,0 +1,343 @@
+# 问题生成
+
+
+**目录**
+- [问题生成](#问题生成)
+  - [简介](#简介)
+    <!-- - [基于预训练语言模型的问题生成](#基于预训练语言模型的问题生成) -->
+  <!-- - [效果展示](#效果展示) -->
+  - [开箱即用](#开箱即用)
+  - [训练定制](#训练定制)
+    - [环境依赖](#环境依赖)
+    - [代码结构说明](#代码结构说明)
+    - [问题生成应用定制训练全流程介绍](#问题生成定制训练全流程介绍)
+    - [数据准备](#数据准备)
+      - [数据加载](#数据加载)
+      - [数据处理](#数据处理)
+      - [从本地文件创建数据集-可选](#从本地文件创建数据集-可选)
+    - [模型训练](#模型训练)
+    - [模型预测](#模型预测)
+    - [模型转换部署](#模型转换部署)
+      - [FasterTransformer加速及模型静态图导出](#fastertransformer加速及模型静态图导出)
+      - [模型部署](#模型部署)
+  - [References](#references)
+
+## 简介
+Question Generation（QG），即问题生成，指的是给定一段上下文，自动生成一个流畅且符合上下文主题的问句。问题生成通常可以分为，无答案问题生成和有答案问题生成，这里只关注应用更广的有答案问题生成。
+
+问题生成技术在教育、咨询、搜索、推荐等多个领域均有着巨大的应用价值。具体来说，问题生成可广泛应用于问答系统语料库构建，事实性问题生成，教育行业题库生成，对话提问，聊天机器人意图理解，对话式搜索意图提问，闲聊机器人主动提问等等场景。
+
+本项目是基于预训练语言模型UNIMO-Text的问题生成，具有以下优势：
+
+- 效果领先。基于百度自研中文预训练语言模型UNIMO-Text，并提供基于模版策略和大规模多领域问题生成数据集训练的通用问题生成预训练模型`unimo-text-1.0-question-generation`。
+- 开箱即用。本项目提供TaskFlow接口，无需训练，仅需几行代码便可预测。
+- 高性能推理。本项目基于FasterTransformer进行推理加速，能够提供更高性能的推理体验，优化后的推理模型在dureader_qg开发集的推理耗时缩短为优化前的1/5。
+- 训练推理部署全流程打通。本项目提供了全面的定制训练流程，从数据准备、模型训练预测，到模型推理部署，一应俱全。
+
+<!-- ### 基于预训练语言模型的问题生成
+
+基于预训练语言模型（Pretrained Language Models, PLMs）范式的问题生成是目前最常用、效果最好(SOTA)的方式。
+预训练模型是在超大规模的语料采用无监督或者弱监督的方式进行预训练，能够学习如何准确地理解自然语言并以自然语言的形式流畅表达，这两项都是完成文本生成任务的重要能力。
+
+PaddleNLP提供了方便易用的接口，可指定模型名或模型参数文件路径通过from_pretrained()方法加载不同网络结构的预训练模型，且相应预训练模型权重下载速度快速、稳定。
+Transformer预训练模型汇总包含了如 ERNIE、BERT、T5、UNIMO等主流预训练模型。下面以中文unimo-text-1.0模型为例，演示如何加载预训练模型和分词器：
+```
+from paddlenlp.transformers import  ErnieForGeneration, ErnieTokenizer
+model_name = "ernie-1.0"
+model = UNIMOLMHeadModel.from_pretrained(model_name)
+tokenizer = UNIMOTokenizer.from_pretrained(model_name)
+``` -->
+
+## 开箱即用
+PaddleNLP提供开箱即用的产业级NLP预置任务能力，无需训练，一键预测。
+#### 支持单条、批量预测
+```python
+>>> from paddlenlp import Taskflow
+# 默认模型为 unimo-text-1.0-dureader_qg
+>>> question_generator = Taskflow("question_generation")
+# 单条输入
+>>> question_generator([
+  {"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}
+  ])
+'''
+  ['黄山最高峰是什么']
+'''
+# 多条输入
+>>> question_generator([
+  {"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"},
+  {"context": "弗朗索瓦·韦达外文名：franciscusvieta国籍：法国出生地：普瓦图出生日期：1540年逝世日期：1603年12月13日职业：数学家主要成就：为近代数学的发展奠定了基础。", "answer": "法国"}
+  ])
+'''
+  ['黄山最高峰是什么',  '弗朗索瓦是哪里人']
+'''
+```
+关键配置参数说明：
+* `model`：可选模型，默认为unimo-text-1.0-dureader_qg，支持的模型有["unimo-text-1.0", "unimo-text-1.0-dureader_qg", "unimo-text-1.0-question-generation", "unimo-text-1.0-question-generation-dureader_qg"]。
+
+具体参数配置可参考[Taskflow文档](../../../docs/model_zoo/taskflow.md)。
+
+## 训练定制
+
+### 环境依赖
+- nltk
+- evaluate
+- tqdm
+
+安装方式：`pip install -r requirements.txt`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+├── deploy # 部署
+│   ├── paddle_inference # PaddleInference高性能推理部署
+│   │   ├── inference_unimo_text.py # 推理部署脚本
+│   │   └── README.md # 说明文档
+│   └── paddle_serving
+│       ├── config.yml # 配置文件
+│       ├── pipeline_client.py # 客户端程序
+│       ├── pipeline_service.py # 服务器程序
+│       └── README.md # 说明文档
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── train.py # 训练脚本
+├── predict.py # 预测评估脚本
+├── utils.py # 工具函数脚本
+└── README.md # 说明文档
+```
+
+### 问题生成定制训练全流程介绍
+接下来，我们将按数据准备、训练、预测、推理部署等四个阶段对问题生成应用的全流程进行介绍。
+1. **数据准备**
+- 默认使用中文问题生成数据集DuReader_QG进行实验，该数据集已集成到PaddleNLP。
+- 如果已有标注好的本地数据集，我们需要根据将数据集整理为文档要求的格式，请参考[从本地文件创建数据集（可选）](#从本地文件创建数据集（可选）)。
+
+2. **模型训练**
+
+- 数据准备完成后，可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求，调整可配置参数，选择使用GPU或CPU进行模型训练，脚本默认保存在开发集最佳表现模型。中文任务默认使用`unimo-text-1.0`模型，unimo-text-1.0还支持large模型。此外本项目还提供基于大规模多领域问题生成数据集训练的通用问题生成预训练模型`unimo-text-1.0-question-generation`，详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html)，用户可以根据任务和设备需求进行选择。
+
+
+3. **模型预测**
+
+- 训练结束后，我们可以加载保存的最佳模型进行模型测试，打印模型预测结果。
+
+4. **模型转换部署**
+- 在现实部署场景中，我们通常不仅对模型的精度表现有要求，也需要考虑模型性能上的表现。我们可以使用模型裁剪进一步压缩模型体积，问题生成应用已提供裁剪API对上一步微调后的模型进行裁剪，模型裁剪之后会默认导出静态图模型。
+
+- 模型部署需要将保存的最佳模型参数（动态图）导出成静态图参数，用于后续的推理部署。
+
+- 问题生成应用提供了基于Paddle Serving的本地部署predictor，并且支持在GPU设备使用Faster Generation进行加速。
+
+- 问题生成应用提供了基于Paddle Serving的服务端部署方案。
+
+### 数据准备
+#### 数据加载
+[**DuReader_QG**数据集](https://www.luge.ai/#/luge/dataDetail?id=8)是一个中文问题生成数据集，我们使用该数据集作为应用案例进行实验。**DuReader_QG**中的数据主要由由上下文、问题、答案3个主要部分组成，其任务描述为给定上下文p和答案a，生成自然语言表述的问题q，且该问题符合段落和上下文的限制。
+
+为了方便用户快速测试，PaddleNLP Dataset API内置了DuReader_QG数据集，一键即可完成数据集加载，示例代码如下：
+
+```python
+from paddlenlp.datasets import load_dataset
+train_ds, dev_ds = load_dataset('dureader_qg', splits=('train', 'dev'))
+```
+
+#### 数据处理
+针对**DuReader_QG**数据集，我们需要将QA任务格式的数据进行转换从而得到text2text形式的数据，我们默认使用模版的方式构造输入数据，默认模版如下，其他形式输入数据用户可以在convert_example函数中自行定义。
+```text
+答案: <answer_text> 上下文: <context_text>
+问题: <question_text>
+```
+
+#### 从本地文件创建数据集-可选
+在许多情况下，我们需要使用本地数据集来训练我们的问题生成模型，本项目支持使用固定格式本地数据集文件进行训练。
+使用本地文件，只需要在模型训练时指定`train_file` 为本地训练数据地址，`predict_file` 为本地测试数据地址即可。
+
+本地数据集目录结构如下：
+
+```text
+data/
+├── train.json # 训练数据集文件
+├── dev.json # 开发数据集文件
+└── test.json # 可选，待预测数据文件
+```
+本地数据集文件格式如下：
+- train.json/dev.json/test.json 文件格式：
+```text
+{
+  "context": <context_text>,
+  "answer": <answer_text>,
+  "question": <question_text>,
+}
+...
+```
+- train.json/dev.json/test.json 文件样例：
+```text
+{
+  "context": "欠条是永久有效的,未约定还款期限的借款合同纠纷,诉讼时效自债权人主张债权之日起计算,时效为2年。 根据《中华人民共和国民法通则》第一百三十五条:向人民法院请求保护民事权利的诉讼时效期间为二年,法律另有规定的除外。 第一百三十七条:诉讼时效期间从知道或者应当知道权利被侵害时起计算。但是,从权利被侵害之日起超过二十年的,人民法院不予保护。有特殊情况的,人民法院可以延长诉讼时效期间。 第六十二条第(四)项:履行期限不明确的,债务人可以随时履行,债权人也可以随时要求履行,但应当给对方必要的准备时间。",
+  "answer": "永久有效",
+  "question": "欠条的有效期是多久"
+}
+...
+```
+
+更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+### 模型训练
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证。
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "1,2" --log_dir ./unimo/finetune/log train.py \
+    --dataset_name=dureader_qg \
+    --model_name_or_path="unimo-text-1.0" \
+    --save_dir=./unimo/finetune/checkpoints \
+    --output_path ./unimo/finetune/predict.txt \
+    --logging_steps=100 \
+    --save_steps=500 \
+    --epochs=20 \
+    --batch_size=16 \
+    --learning_rate=1e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=512 \
+    --max_target_len=30 \
+    --do_train \
+    --do_predict \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --num_return_sequences=1 \
+    --template=1 \
+    --device=gpu
+```
+
+
+关键参数释义如下：
+- `gpus` 指示了训练所用的GPU，使用多卡训练可以指定多个GPU卡号，例如 --gpus "0,1"。
+- `dataset_name` 数据集名称，当`train_file`和`predict_file`为None时将加载`dataset_name`的训练集和开发集，默认为`dureader_qg`。
+- `train_file` 本地训练数据地址，数据格式必须与`dataset_name`所指数据集格式相同，默认为None。
+- `predict_file` 本地测试数据地址，数据格式必须与`dataset_name`所指数据集格式相同，默认为None。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+   | 可选预训练模型        |
+   |---------------------------------|
+   | unimo-text-1.0      |
+   | unimo-text-1.0-large |
+   | unimo-text-1.0-question-generation |
+
+   <!-- | T5-PEGASUS |
+   | ernie-1.0 |
+   | ernie-gen-base-en |
+   | ernie-gen-large-en |
+   | ernie-gen-large-en-430g | -->
+
+- `save_dir` 表示模型的保存路径。
+- `output_path` 表示预测结果的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_proportion` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例。
+- `max_seq_len` 模型输入序列的最大长度。
+- `max_target_len` 模型训练时标签的最大长度。
+- `min_dec_len` 模型生成序列的最小长度。
+- `max_dec_len` 模型生成序列的最大长度。
+- `do_train` 是否进行训练。
+- `do_predict` 是否进行预测，在验证集上会自动评估。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+- `template` 表示使用的模版，从[0, 1, 2, 3, 4]中选择，0表示不选择模版，1表示使用默认模版。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。如：
+
+```text
+./unimo/finetune/checkpoints
+├── model_1000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── special_tokens_map.json
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+微调的模型在dureader_qg验证集上有如下结果(指标为BLEU-4)，其中`unimo-text-1.0-dureader_qg-w/o-template`表示不使用模版策略微调的结果，`unimo-text-1.0-large-dureader_qg`表示使用large模型微调的结果，`unimo-text-1.0-question-generation-dureader_qg`表示在通用问题生成预训练模型`unimo-text-1.0-question-generation`上微调的结果：
+
+|       model_name        | DuReaderQG |
+| :-----------------------------: | :-----------: |
+|    unimo-text-1.0-dureader_qg-w/o-template    | 39.61 |
+|    unimo-text-1.0-dureader_qg    | 41.08 |
+|    unimo-text-1.0-large-dureader_qg    | 41.51 |
+|    unimo-text-1.0-question-generation-dureader_qg    | 44.02 |
+
+### 模型预测
+
+运行下方脚本可以使用训练好的模型进行预测。
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -u predict.py \
+    --dataset_name=dureader_qg \
+    --model_name_or_path=your_model_path \
+    --output_path=./predict.txt \
+    --logging_steps=100 \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --max_target_len=30 \
+    --do_predict \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --template=1 \
+    --device=gpu
+```
+关键参数释义如下：
+- `output_path` 表示预测输出结果保存的文件路径，默认为./predict.txt。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的微调好的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。
+
+### 模型转换部署
+
+#### FasterTransformer加速及模型静态图导出
+
+使用动态图训练结束之后，可以通过[静态图导出脚本](export_model.py)实现基于FasterTransformer的高性能预测加速，并将动态图参数导出成静态图参数，静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python export_model.py \
+    --model_name_or_path ./checkpoint \
+    --inference_model_dir ./export_checkpoint \
+    --max_dec_len 50 \
+    --use_fp16_decoding
+```
+关键参数释义如下：
+
+* `model_name_or_path`：动态图训练保存的参数路径；默认为"./checkpoint"。
+* `inference_model_dir`：静态图图保存的参数路径；默认为"./export_checkpoint"。
+* `max_dec_len`：最大输出长度。
+* `use_fp16_decoding`:是否使用fp16解码进行预测。
+
+执行命令后将会自动导出模型到指定的 `inference_model_dir` 中，保存模型文件结构如下所示：
+
+```text
+├── unimo_text.pdiparams
+├── unimo_text.pdiparams.info
+└── unimo_text.pdmodel
+```
+
+#### 模型部署
+本项目提供多种不同场景的部署方案，请根据实际情况进行选择：
+|部署方案|特色|场景|硬件|
+|-|-|-|-|
+|Paddle Inference<br>服务端／云端|通用性|模型算法复杂<br>硬件高性能|X86 CPU<br>NVIDIA 全系列 GPU<br>龙芯／飞腾等国产CPU<br>昆仑／昇腾／海光DCU等AI加速芯片
+|Paddle Serving<br>服务化|高并发|大流量、高并发、低延时、高吞吐<br>资源弹性调控应对服务流量变化<br>支持模型组合、加密、热更新等|X86/Arm CPU<br>NVIDIA GPU<br>昆仑／昇腾等
+
+
+问题生成应用已打通多种场景部署方案，点击链接获取具体的使用教程。
+- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md)
+- [Paddle Serving 服务化部署（Python）](./deploy/paddle_serving/README.md)
+
+
+## References
+Zheng, Chujie, and Minlie Huang. "Exploring prompt-based few-shot learning for grounded dialog generation." arXiv preprint arXiv:2109.06513 (2021).
+Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020).
diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/README.md b/examples/question_generation/unimo-text/deploy/paddle_inference/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..93f1eaf349407215a8a0eaf1f37719a13e488a45
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_inference/README.md
@@ -0,0 +1,54 @@
+# Paddle Inference部署
+本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行问题生成应用高性能推理推理部署。
+
+**目录**
+   * [背景介绍](#背景介绍)
+   * [导出预测部署模型](#导出预测部署模型)
+   * [基于Python预测](#基于Python预测)
+
+
+## 背景介绍
+Paddle inference和主框架的Model.predict均可实现推理预测，Paddle Inference 是飞桨的原生推理库， 作用于服务器端和云端，提供高性能的推理能力，主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict，inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测，paddle inference适用于对推理性能、通用性有要求的用户，针对不同平台不同的应用场景进行了深度的适配优化，保证模型在服务器端即训即用，快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子，因此它支持飞桨训练出的所有模型的推理。
+
+
+
+Paddle Inference Python端预测部署主要包含两个步骤：
+- 导出预测部署模型
+- 基于Python预测
+
+
+## 导出预测部署模型
+部署时需要使用预测格式的模型（即动态图转静态图操作）。预测格式模型相对训练格式模型而言，在拓扑上裁剪掉了预测不需要的算子，并且会做特定部署优化。具体操作详见[FasterTransformer加速及模型静态图导出](../../README.md)。
+
+## 基于Python预测
+<!-- 同上，高性能预测的默认输入和输出形式也为文件，可分别通过 test_path 和 save_path 进行指定，通过如下命令便可以基于Paddle Inference 进行高性能预测： -->
+
+在终端输入以下命令可在GPU上进行预测：
+```shell
+python deploy/paddle_inference/inference.py \
+               --inference_model_dir ./export_checkpoint \
+               --model_name_or_path "unimo-text-1.0" \
+               --predict_file predict_file_name \
+               --output_path output_path_name \
+               --device gpu \
+```
+
+<!-- 在终端输入以下命令可在CPU上进行预测：
+```shell
+python deploy/paddle_inference/inference_unimo_text.py --inference_model_dir ./export_checkpoint --device cpu
+``` -->
+经静态图转换，FastTransformer性能优化，Paddle Inference加速后的部署模型在dureader_qg devset的预测时间为27.74秒，相较于未优化前169.24秒，耗时缩减为原来的16.39%。
+关键参数释义如下：
+* `inference_model_dir`：用于高性能推理的静态图模型参数路径，默认为"./export_checkpoint"。
+* `model_name_or_path`：tokenizer对应模型或路径，默认为"unimo-text-1.0"。
+* `dataset_name`：数据集名称，默认为`dureader_qg`。
+* `predict_file`：本地预测数据地址，数据格式必须与`dataset_name`所指数据集格式相同，默认为None，当为None时默认加载`dataset_name`的dev集。
+* `output_path`：表示预测结果的保存路径。
+* `device`：推理时使用的设备，可选项["gpu"]，默认为"gpu"。
+* `batch_size`：进行推理时的批大小，默认为16。
+* `precision`：当使用TensorRT进行加速推理时，所使用的TensorRT精度，可选项["fp32", "fp16"]，默认为"fp32"。
+<!-- * `precision`：当使用TensorRT进行加速推理时，所使用的TensorRT精度，可选项["fp32", "fp16", "int8"]，默认为"fp32"。 -->
+<!-- * `device`：推理时使用的设备，可选项["gpu", "cpu", "xpu"]，默认为"gpu"。 -->
+<!-- * `enable_mkldnn`：当使用cpu时，选择是否使用MKL-DNN(oneDNN)进行加速推理，默认为False。 -->
+<!-- * `cpu_threads`：当使用cpu时，推理所用的进程数，默认为10。 -->
+<!-- * `use_tensorrt`：当使用gpu时，选择是否使用TensorRT进行加速推理，默认为False。 -->
diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py b/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f043e3f932dcb7292f5811d4721a611cc5564f6c
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py
@@ -0,0 +1,260 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+
+from paddlenlp.data import Pad
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def convert_example(
+    example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="test", template=0
+):
+    """Convert all examples into necessary features."""
+    if mode == "pretrain" or mode == "pretrain_test":
+        context = example["context"]
+        answer = example["answer"]
+        target = example["target"]
+
+        source = "答案：" + answer + tokenizer.sep_token + "上下文：" + context
+        title = None
+
+    elif mode == "train" or mode == "test":
+        target = None
+        if "source" in example and "title" in example:
+            source = example["source"]
+            title = None
+            if "title" in example.keys():
+                title = example["title"]
+        elif "context" in example and "answer" in example:
+            source = example["context"]
+            title = None
+            if "answer" in example.keys():
+                title = example["answer"]
+        else:
+            assert False, "Source and title are not in the input dictionary, nor are context and answer."
+        if "target" in example.keys():
+            target = example["target"]
+
+        if template == 1:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 2:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "在已知答案的前提下，问题：" + target
+        elif template == 3:
+            source = "这是一个问题生成任务，根据提供的答案和上下文，来生成问题。" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+
+    if mode == "train" or mode == "pretrain":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            target=target,
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            max_title_len=max_title_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1)
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+
+        return tokenized_example
+
+    elif mode == "test" or mode == "pretrain_test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=max_title_len,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+            return_length=True,
+        )
+
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode="test"):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32")
+
+    if mode == "train" or mode == "pretrain":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    elif mode == "test" or mode == "pretrain_test":
+        return input_ids, token_type_ids, position_ids, attention_mask, seq_len
+
+
+def create_data_loader(dataset, tokenizer, args, mode="test"):
+    trans_func = partial(convert_example, tokenizer=tokenizer, mode="test", template=1)
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "pretrain":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "test" or mode == "pretrain_test":
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def remove_template(instr):
+    """Remove template prefix of decoded sequence."""
+    outstr = instr.strip("问题：")
+    outstr = instr.strip("在已知答案的前提下，问题：")
+    return outstr
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+            target = remove_template(target)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+            response = remove_template(response)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py b/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e15b4a81441913eb6a2071b45f9e5cbbb8f5fa3
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py
@@ -0,0 +1,223 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+from infer_utils import create_data_loader, postprocess_response, select_sum
+from paddle import inference
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UNIMOTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model", type=str, help="Path to save inference model of UNIMOText. "
+    )
+    parser.add_argument(
+        "--model_name_or_path", type=str, default="unimo-text-1.0", help="The path or shortcut name of the tokenizer."
+    )
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference."
+    )
+    parser.add_argument(
+        "--use_tensorrt",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Whether to use inference engin TensorRT when using gpu.",
+    )
+    parser.add_argument(
+        "--enable_mkldnn",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Enable to use mkldnn to speed up when using cpu.",
+    )
+    parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+    parser.add_argument(
+        "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+    )
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--output_path", type=str, default="./predict.txt", help="The file path where the infer result will be saved."
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--dataset_name", type=str, default="dureader_qg", help="The name of the dataset to load.")
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument("--max_dec_len", type=int, default=20, help="The maximum sequence length of decoding.")
+    parser.add_argument(
+        "--num_return_sequences",
+        type=int,
+        default=1,
+        help="The numbers of returned sequences for one input in generation.",
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def setup_predictor(args):
+    """Setup inference predictor."""
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+    model_file = os.path.join(args.inference_model_dir, "unimo_text.pdmodel")
+    params_file = os.path.join(args.inference_model_dir, "unimo_text.pdiparams")
+    if not os.path.exists(model_file):
+        raise ValueError("not find model file path {}".format(model_file))
+    if not os.path.exists(params_file):
+        raise ValueError("not find params file path {}".format(params_file))
+    config = inference.Config(model_file, params_file)
+    if args.device == "gpu":
+        config.enable_use_gpu(100, 0)
+        config.switch_ir_optim()
+        config.enable_memory_optim()
+        config.disable_glog_info()
+
+        precision_map = {
+            "fp16": inference.PrecisionType.Half,
+            "fp32": inference.PrecisionType.Float32,
+            "int8": inference.PrecisionType.Int8,
+        }
+        precision_mode = precision_map[args.precision]
+        if args.use_tensorrt:
+            config.enable_tensorrt_engine(
+                max_batch_size=args.batch_size, min_subgraph_size=30, precision_mode=precision_mode
+            )
+    elif args.device == "cpu":
+        config.disable_gpu()
+        if args.enable_mkldnn:
+            config.enable_mkldnn()
+            config.set_mkldnn_cache_capacity(10)
+
+        config.set_cpu_math_library_num_threads(args.cpu_threads)
+    elif args.device == "xpu":
+        config.enable_xpu(100)
+    predictor = inference.create_predictor(config)
+    return predictor
+
+
+@paddle.no_grad()
+def infer_one(args, predictor, inputs=None):
+    """Use predictor to inference."""
+    tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0")
+
+    if not inputs:
+        inputs = {
+            "context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。",
+            "answer": "莲花峰",
+        }
+
+    inputs = "答案：" + inputs["answer"] + tokenizer.sep_token + "上下文：" + inputs["context"]
+    data = tokenizer.gen_encode(
+        inputs, add_start_token_for_decoding=True, return_length=True, is_split_into_words=False
+    )
+
+    input_handles = {}
+    for name in predictor.get_input_names():
+        input_handles[name] = predictor.get_input_handle(name)
+        if name == "attention_mask":
+            input_handles[name].copy_from_cpu(np.expand_dims(np.asarray(data[name], dtype="float32"), axis=(0, 1)))
+        else:
+            input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32").reshape([1, -1]))
+
+    output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+    predictor.run()
+
+    output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+    for sample in output[0][:, :, 0].tolist():
+        print("".join(postprocess_response(sample, tokenizer)))
+
+
+@paddle.no_grad()
+def infer(args, predictor, data_loader, tokenizer):
+    print("Infer begin...")
+    pred_ref = []
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask, seq_len = inputs
+        data = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "seq_len": seq_len,
+        }
+
+        input_handles = {}
+        for name in predictor.get_input_names():
+            input_handles[name] = predictor.get_input_handle(name)
+            if name == "attention_mask":
+                input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="float32"))
+            else:
+                input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32"))
+
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+        predictor.run()
+
+        output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+        ids = output[0]
+        scores = output[1]
+
+        ids = paddle.to_tensor(ids, dtype="int32")[:, 0, :]
+        scores = paddle.to_tensor(scores, dtype="float32")
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+
+        pred_ref.extend(results)
+        start_time = time.time()
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+    print("\nSave inference result into: %s" % args.output_path)
+
+    if "target" in data_loader.dataset[0].keys():
+        with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout:
+            targets = [example["target"] for example in data_loader.dataset]
+            for target in targets:
+                fout.write(target + "\n")
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    predictor = setup_predictor(args)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+    ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+    ds, data_loader = create_data_loader(ds, tokenizer, args, "test")
+
+    time_begin = time.time()
+    infer(args, predictor, data_loader, tokenizer)
+    print("inference cost time:", time.time() - time_begin)
diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/README.md b/examples/question_generation/unimo-text/deploy/paddle_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad375abab6ed0ebe9ebe6f158adcab9414af314e
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_serving/README.md
@@ -0,0 +1,150 @@
+# Paddle Serving服务化部署
+
+本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署问题生成在线服务。
+
+## 目录
+- [Paddle Serving服务化部署](#paddle-serving服务化部署)
+  - [目录](#目录)
+  - [背景介绍](#背景介绍)
+  - [环境准备](#环境准备)
+    - [安装Paddle Serving](#安装paddle-serving)
+    <!-- - [安装FastTokenizer文本处理加速库（可选）](#安装fastertokenizer文本处理加速库可选) -->
+  - [模型转换](#模型转换)
+  - [pipeline部署](#pipeline部署)
+    - [修改配置文件](#修改配置文件)
+    - [server启动服务](#server启动服务)
+    - [client发送服务请求](#client发送服务请求)
+
+## 背景介绍
+Paddle Serving 依托深度学习框架 PaddlePaddle 旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议，提供多种异构硬件和多种操作系统环境下推理解决方案，和多种经典预训练模型示例。集成高性能服务端推理引擎 Paddle Inference 和端侧引擎 Paddle Lite。设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架，具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性。
+
+Paddle Serving Python端预测部署主要包含以下步骤：
+- 环境准备
+- 模型转换
+- 部署模型
+
+## 环境准备
+### 安装Paddle Serving
+安装client和serving app，用于向服务发送请求:
+```shell
+pip install paddle_serving_app paddle_serving_client
+```
+安装server，用于启动服务，根据服务器设备选择安装CPU server或GPU server：
+
+- 安装CPU server
+```shell
+pip install paddle_serving_server
+```
+- 安装GPU server, 注意选择跟本地环境一致的命令
+```shell
+# CUDA10.2 + Cudnn7 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post102 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+**NOTE:**
+- 可以开启国内清华镜像源来加速下载
+- 如果要安装最新版本的PaddleServing参考[链接](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)。
+
+
+<!-- ### 安装FastTokenizer文本处理加速库（可选）
+如果部署环境是Linux，推荐安装fast_tokenizer可以得到更极致的文本处理效率，进一步提升服务性能。目前暂不支持Windows设备安装，将会在下个版本支持。
+```shell
+pip install fast-tokenizer-python
+``` -->
+
+
+## 模型转换
+
+使用Paddle Serving做服务化部署时，需要将保存的inference模型转换为serving易于部署的模型。
+
+用已安装的paddle_serving_client将静态图参数模型转换成serving格式。关于如何使用将训练后的动态图模型转为静态图模型详见[FasterTransformer加速及模型静态图导出](../../README.md)。
+
+模型转换命令如下：
+```shell
+python -m paddle_serving_client.convert --dirname ./export_checkpoint \
+                                        --model_filename unimo_text.pdmodel \
+                                        --params_filename unimo_text.pdiparams \
+                                        --serving_server ./deploy/paddle_serving/export_checkpoint_server \
+                                        --serving_client ./deploy/paddle_serving/export_checkpoint_client
+```
+关键参数释义如下：
+* `dirname`：静态图模型文件夹地址。
+* `model_filename`：模型文件名。
+* `params_filename`：模型参数名。
+* `serving_server`：server的模型文件和配置文件路径，默认"serving_server"。
+* `serving_client`：client的配置文件路径，默认"serving_client"。
+
+更多参数可通过以下命令查询：
+```shell
+python -m paddle_serving_client.convert --help
+```
+模型转换完成后，会在./delopy/paddle_serving文件夹多出export_checkpoint_server和export_checkpoint_client的文件夹，文件夹目录格式如下：
+```
+export_checkpoint_server/
+├── unimo_text.pdiparams
+├── unimo_text.pdmodel
+├── serving_server_conf.prototxt
+└── serving_server_conf.stream.prototxt
+export_checkpoint_server/
+├── serving_client_conf.prototxt
+└── serving_client_conf.stream.prototxt
+```
+
+## pipeline部署
+
+paddle_serving目录包含启动pipeline服务和发送预测请求的代码，包括：
+```
+paddle_serving/
+├──config.yml        # 启动服务端的配置文件
+├──pipeline_client.py     # 发送pipeline预测请求的脚本
+└──pipeline_service.py        # 启动pipeline服务端的脚本
+```
+
+### 修改配置文件
+目录中的`config.yml`文件解释了每一个参数的含义，可以根据实际需要修改其中的配置。
+
+### server启动服务
+修改好配置文件后，执行下面命令启动服务:
+```shell
+cd deploy/paddle_serving
+# 启动服务，运行日志保存在log.txt
+python pipeline_service.py &> log.txt &
+```
+成功启动服务后，log.txt中会打印类似如下日志
+```
+--- Running analysis [ir_graph_to_program_pass]
+I0901 12:09:27.248943 12190 analysis_predictor.cc:1035] ======= optimize end =======
+I0901 12:09:27.249596 12190 naive_executor.cc:102] ---  skip [feed], feed -> seq_len
+I0901 12:09:27.249608 12190 naive_executor.cc:102] ---  skip [feed], feed -> attention_mask
+I0901 12:09:27.249614 12190 naive_executor.cc:102] ---  skip [feed], feed -> token_type_ids
+I0901 12:09:27.249617 12190 naive_executor.cc:102] ---  skip [feed], feed -> input_ids
+I0901 12:09:27.250080 12190 naive_executor.cc:102] ---  skip [_generated_var_3], fetch -> fetch
+I0901 12:09:27.250090 12190 naive_executor.cc:102] ---  skip [transpose_0.tmp_0], fetch -> fetch
+[2022-09-01 12:09:27,251] [    INFO] - Already cached /root/.paddlenlp/models/unimo-text-1.0/unimo-text-1.0-vocab.txt
+[2022-09-01 12:09:27,269] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/unimo-text-1.0/tokenizer_config.json
+[2022-09-01 12:09:27,269] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/unimo-text-1.0/special_tokens_map.json
+[PipelineServicer] succ init
+[OP Object] init success
+2022/09/01 12:09:27 start proxy service
+```
+
+### client发送服务请求
+执行以下命令发送文本摘要服务请求：
+```shell
+cd deploy/paddle_serving
+python pipeline_client.py
+```
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)
+
+成功运行后，输出打印如下:
+```
+time cost :0.03429532051086426 seconds
+--------------------
+input:  {'context': '平安银行95511电话按9转报案人工服务。 1.寿险 :95511转1 2.信用卡 95511转2 3.平安银行 95511转3 4.一账通 95511转4转8 5.产险 95511转5 6.养老险团体险 95511转6 7.健康险 95511转7 8.证券 95511转8 9.车险报案95511转9 0.重听', 'answer': '95511'}
+output:  问题：平安银行人工服务电话
+--------------------
+```
diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml b/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1cc918e1ba0c293aa9717aecda580a2c9d871c0a
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml
@@ -0,0 +1,59 @@
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18011
+
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9999
+
+#worker_num, 最大并发数。
+#当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+#当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 10
+
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: True
+
+    #重试次数
+    retry: 1
+
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+
+op:
+    question_generation:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 11
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #模型路径
+            model_config: ../../unimo/serving/export_checkpoint_server
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准，不设置默认取全部输出变量
+            # fetch_list: ["_generated_var_3", "slice_0.tmp_0"]
+            
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 1
+            
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: "0"
+
+            #开启MKLDNN加速
+            use_mkldnn: False
+
+            #thread_num
+            thread_num: 12
+
+            #ir_optim
+            ir_optim: False
+            
+            #开启tensorrt后，进行优化的子图包含的最少节点数
+            #min_subgraph_size: 10
\ No newline at end of file
diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py b/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f043e3f932dcb7292f5811d4721a611cc5564f6c
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py
@@ -0,0 +1,260 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+
+from paddlenlp.data import Pad
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def convert_example(
+    example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="test", template=0
+):
+    """Convert all examples into necessary features."""
+    if mode == "pretrain" or mode == "pretrain_test":
+        context = example["context"]
+        answer = example["answer"]
+        target = example["target"]
+
+        source = "答案：" + answer + tokenizer.sep_token + "上下文：" + context
+        title = None
+
+    elif mode == "train" or mode == "test":
+        target = None
+        if "source" in example and "title" in example:
+            source = example["source"]
+            title = None
+            if "title" in example.keys():
+                title = example["title"]
+        elif "context" in example and "answer" in example:
+            source = example["context"]
+            title = None
+            if "answer" in example.keys():
+                title = example["answer"]
+        else:
+            assert False, "Source and title are not in the input dictionary, nor are context and answer."
+        if "target" in example.keys():
+            target = example["target"]
+
+        if template == 1:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 2:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "在已知答案的前提下，问题：" + target
+        elif template == 3:
+            source = "这是一个问题生成任务，根据提供的答案和上下文，来生成问题。" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+
+    if mode == "train" or mode == "pretrain":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            target=target,
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            max_title_len=max_title_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1)
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+
+        return tokenized_example
+
+    elif mode == "test" or mode == "pretrain_test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=max_title_len,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+            return_length=True,
+        )
+
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode="test"):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32")
+
+    if mode == "train" or mode == "pretrain":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    elif mode == "test" or mode == "pretrain_test":
+        return input_ids, token_type_ids, position_ids, attention_mask, seq_len
+
+
+def create_data_loader(dataset, tokenizer, args, mode="test"):
+    trans_func = partial(convert_example, tokenizer=tokenizer, mode="test", template=1)
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "pretrain":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "test" or mode == "pretrain_test":
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def remove_template(instr):
+    """Remove template prefix of decoded sequence."""
+    outstr = instr.strip("问题：")
+    outstr = instr.strip("在已知答案的前提下，问题：")
+    return outstr
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+            target = remove_template(target)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+            response = remove_template(response)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..9172d68cfdda147cf4b6988cb1af0db7ec52cf1e
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+from paddle_serving_server.pipeline import PipelineClient
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.client = PipelineClient()
+        self.client.connect([server_url])
+
+    def Run(self, data):
+        inputs = data
+        start_time = time.time()
+        ret = self.client.predict(feed_dict={"inputs": inputs})
+        end_time = time.time()
+        print("time cost :{} seconds".format(end_time - start_time))
+        if not ret.value:
+            print("Fail to fetch summary.")
+        # ret is special class but a dict
+        for d, s in zip(data, eval(ret.value[0])):
+            print("--------------------")
+            print("input: ", d)
+            print("output: ", s)
+            print("--------------------")
+        return
+
+
+if __name__ == "__main__":
+    server_url = "127.0.0.1:18011"
+    runner = Runner(server_url)
+    requests = [
+        {"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}
+    ]
+    runner.Run(requests)
diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9b6af9fe5d4744df5b6ea7bf2526bfe88cda9a2
--- /dev/null
+++ b/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from infer_utils import batchify_fn, convert_example, postprocess_response
+from paddle_serving_server.web_service import Op, WebService
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UNIMOTokenizer
+
+_LOGGER = logging.getLogger(__name__)
+
+
+class UnimoTextOp(Op):
+    """Op for unimo_text."""
+
+    def init_op(self):
+        self.tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        # Convert input format
+        ((_, input_dict),) = input_dicts.items()
+        data = input_dict["inputs"]
+        if isinstance(data, str) and "array(" in data:
+            data = eval(data)
+        else:
+            _LOGGER.error("input value  {}is not supported.".format(data))
+        examples = [convert_example(i, self.tokenizer) for i in data]
+        input_ids, token_type_ids, position_ids, attention_mask, seq_len = batchify_fn(
+            examples, self.tokenizer.pad_token_id
+        )
+        new_dict = {}
+        new_dict["input_ids"] = input_ids
+        new_dict["token_type_ids"] = token_type_ids
+        new_dict["attention_mask"] = attention_mask
+        new_dict["seq_len"] = seq_len
+        # the first return must be a dict or a list of dict, the dict corresponding to a batch of model input
+        return new_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        # keyname refer to export_checkpoint_client/serving_client_conf.prototxt
+        ids = fetch_dict["transpose_0.tmp_0"][:, 0, :].tolist()
+        # scores = fetch_dict["_generated_var_3"][:, 0].tolist()
+
+        results = ["".join(postprocess_response(sample, self.tokenizer)) for sample in ids]
+        new_dict = {}
+        new_dict["outputs"] = str(results)
+        # the first return must be a dict or a list of dict, the dict corresponding to a batch of model output
+        return new_dict, None, ""
+
+
+class UnimoTextService(WebService):
+    def get_pipeline_response(self, read_op):
+        return UnimoTextOp(name="question_generation", input_ops=[read_op])
+
+
+if __name__ == "__main__":
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+    service = UnimoTextService(name="question_generation")
+    service.prepare_pipeline_config("config.yml")
+    service.run_service()
diff --git a/examples/question_generation/unimo-text/export_model.py b/examples/question_generation/unimo-text/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea8d9c1a2e4013a2092f00eee036f21271c2126f
--- /dev/null
+++ b/examples/question_generation/unimo-text/export_model.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterUNIMOText
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+from paddlenlp.utils.log import logger
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", default="checkpoint", type=str, help="The model name to specify the UNIMOText to use. ")
+    parser.add_argument("--inference_model_dir", default="./export_checkpoint", type=str, help="Path to save inference model of UNIMOText. ")
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument("--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. ")
+    parser.add_argument("--max_dec_len", default=20, type=int, help="Maximum output length. ")
+    parser.add_argument("--min_dec_len", default=3, type=int, help="Minimum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument("--decoding_strategy", default="beam_search", choices=["sampling", "beam_search"], type=str, help="The main strategy to decode. ")
+    parser.add_argument("--num_beams", default=6, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. ")
+    parser.add_argument("--length_penalty", default=1.2, type=float, help="The diversity rate to procedure beam search. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_name_or_path = args.model_name_or_path
+    model = UNIMOLMHeadModel.from_pretrained(model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(model_name_or_path)
+
+    unimo_text = FasterUNIMOText(model=model,
+                                 use_fp16_decoding=args.use_fp16_decoding,
+                                 trans_out=True)
+
+    # Set evaluate mode
+    unimo_text.eval()
+
+    # Convert dygraph model to static graph model
+    unimo_text = paddle.jit.to_static(
+        unimo_text,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # token_type_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # attention_mask
+            paddle.static.InputSpec(shape=[None, 1, None, None],
+                                    dtype="float32"),
+            # seq_len
+            paddle.static.InputSpec(shape=[None], dtype="int64"),
+            args.max_dec_len,
+            args.min_dec_len,
+            args.topk,
+            args.topp,
+            args.num_beams,  # num_beams. Used for beam_search.
+            args.decoding_strategy,
+            tokenizer.cls_token_id,  # cls/bos
+            tokenizer.mask_token_id,  # mask/eos
+            tokenizer.pad_token_id,  # pad
+            args.diversity_rate,  # diversity rate. Used for beam search.
+            args.temperature,
+            args.num_return_sequences,
+            args.length_penalty,
+        ])
+
+    # Save converted static graph model
+    paddle.jit.save(unimo_text,
+                    os.path.join(args.inference_model_dir, "unimo_text"))
+    logger.info("UNIMOText has been saved to {}.".format(
+        args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/examples/question_generation/unimo-text/gen_utils.py b/examples/question_generation/unimo-text/gen_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecc75584d89fa1cd3ed5c621ea7302302363e8d3
--- /dev/null
+++ b/examples/question_generation/unimo-text/gen_utils.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+
+import paddle
+import paddle.distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler
+from paddlenlp.data import Pad
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def convert_example(
+    example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train", template=0
+):
+    """Convert all examples into necessary features."""
+    if mode == "pretrain" or mode == "pretrain_test":
+        context = example["context"]
+        answer = example["answer"]
+        target = example["target"]
+        source = "答案：" + answer + tokenizer.sep_token + "上下文：" + context
+        title = None
+
+    elif mode == "train" or mode == "test":
+        target = None
+        title = None
+        if "source" in example and "title" in example:
+            source = example["source"]
+            if "title" in example.keys():
+                title = example["title"]
+        elif "context" in example and "answer" in example:
+            source = example["context"]
+            if "answer" in example.keys():
+                title = example["answer"]
+        else:
+            assert False, "Source and title are not in the input dictionary, nor are context and answer."
+        if "target" in example.keys():
+            target = example["target"]
+        elif "question" in example.keys():
+            target = example["question"]
+
+        if template == 1:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 2:
+            source = "答案：" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "在已知答案的前提下，问题：" + target
+        elif template == 3:
+            source = "这是一个问题生成任务，根据提供的答案和上下文，来生成问题。" + title + tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif template == 4:
+            prompt_common = example["prompt_common"]
+            prompt_domain = example["prompt_domain"]
+            source = (
+                prompt_common
+                + " "
+                + tokenizer.sep_token
+                + " "
+                + "".join(
+                    [" " + tokenizer.cls_token + " " + one + " " + tokenizer.sep_token + " " for one in prompt_domain]
+                )
+                + " "
+                + tokenizer.cls_token
+                + " "
+                + "答案："
+                + title
+                + " "
+                + tokenizer.sep_token
+                + " "
+                + tokenizer.cls_token
+                + "上下文："
+                + source
+            )
+
+            title = None
+            if target:
+                target = "问题：" + target
+
+    if mode == "train" or mode == "pretrain":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            target=target,
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            max_title_len=max_title_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"])
+        index_list = []
+        count = tokenized_example["input_ids"].count(tokenizer.cls_token_id)
+        # If template==4, count must be equal to 7, otherwise count must be equal to 2
+        assert count == 7 or count == 2, (
+            str(count) + " is not in [2, 7], temp_tokens: " + " ".join(temp_tokens) + "source: " + source
+        )
+        index = -1
+        for i in range(0, count):
+            index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1)
+            index_list.append(index)
+        if template == 4:
+            tokenized_example["token_type_ids"] = (
+                [2] * (index_list[1] - index_list[0])
+                + [3] * (index_list[4] - index_list[1])
+                + [0] * (index_list[6] - index_list[4])
+                + [1] * (len(tokenized_example["input_ids"]) - index_list[6])
+            )
+        target_start = index_list[-1]
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+        if template == 4:
+            tokenized_example["token_type_ids"]
+        return tokenized_example
+
+    elif mode == "test" or mode == "pretrain_test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=max_title_len,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+        )
+
+        if template == 4:
+            # temp_tokens = tokenizer.convert_ids_to_tokens(tokenized_example['input_ids'])
+            index_list = []
+            count = tokenized_example["input_ids"].count(tokenizer.cls_token_id)
+            assert count == 7, str(count) + " is not in [7]"
+            index = -1
+            for i in range(0, count):
+                index = tokenized_example["input_ids"].index(tokenizer.cls_token_id, index + 1)
+                index_list.append(index)
+            tokenized_example["token_type_ids"] = (
+                [2] * (index_list[1] - index_list[0])
+                + [3] * (index_list[4] - index_list[1])
+                + [0] * (index_list[6] - index_list[4])
+                + [1] * (len(tokenized_example["input_ids"]) - index_list[6])
+            )
+
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        elif "question" in example and example["question"]:
+            tokenized_example["target"] = example["question"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    if mode == "train" or mode == "pretrain":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    elif mode == "test" or mode == "pretrain_test":
+        return input_ids, token_type_ids, position_ids, attention_mask
+
+
+def create_data_loader(dataset, tokenizer, args, mode):
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        max_target_len=args.max_target_len,
+        max_title_len=args.max_title_len,
+        mode=mode,
+        template=args.template,
+    )
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "pretrain":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    elif mode == "test" or mode == "pretrain_test":
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def remove_template(instr):
+    """Remove template prefix of decoded sequence."""
+    outstr = instr.strip("问题：")
+    outstr = instr.strip("在已知答案的前提下，问题：")
+    return outstr
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+            target = remove_template(target)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+            response = remove_template(response)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/examples/question_generation/unimo-text/predict.py b/examples/question_generation/unimo-text/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec590f45b08016a2642fbb37253aae48e33ee09
--- /dev/null
+++ b/examples/question_generation/unimo-text/predict.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import time
+
+import paddle
+import paddle.distributed as dist
+from gen_utils import create_data_loader, print_args, select_sum, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.')
+    parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.')
+    parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.')
+    parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.')
+    parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.')
+    parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.")
+    parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def read_file(file):
+    with open(file, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip()
+            if not line:
+                continue
+            line = json.loads(line)
+            yield line
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    if args.predict_file:
+        dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False)
+    else:
+        dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+    if args.do_predict:
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        prediction(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def prediction(model, data_loader, args, tokenizer):
+    print("\nPred begin...")
+    model.eval()
+    pred_ref = []
+    time_begin = time.time()
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+    print("Generation cost time:", time.time() - time_begin)
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/examples/question_generation/unimo-text/requirements.txt b/examples/question_generation/unimo-text/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..48ff8faab77ad3c024f841a75915a17ff475880d
--- /dev/null
+++ b/examples/question_generation/unimo-text/requirements.txt
@@ -0,0 +1,3 @@
+nltk==3.6.2
+evaluate==0.2.2
+tqdm==4.64.0
\ No newline at end of file
diff --git a/examples/question_generation/unimo-text/train.py b/examples/question_generation/unimo-text/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e2c1544328e42275af2d0886e44e66a61df7cf
--- /dev/null
+++ b/examples/question_generation/unimo-text/train.py
@@ -0,0 +1,281 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn.functional as F
+from gen_utils import create_data_loader, print_args, select_sum, set_seed
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import BLEU
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    LinearDecayWithWarmup,
+    UNIMOLMHeadModel,
+    UNIMOTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.')
+    parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.")
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.')
+    parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.')
+    parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.')
+    parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.')
+    parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.')
+    parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.')
+    parser.add_argument('--beta1', type=float, default=0.9, help='beta1')
+    parser.add_argument('--beta2', type=float, default=0.98, help='beta2')
+    parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.')
+    parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.')
+    parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.')
+    parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.')
+    parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--do_train", action='store_true', help="Whether to train the model.")
+    parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.")
+    parser.add_argument("--template", type=int, default=1, help="The template used during training, select from [0, 1, 2, 3, 4].")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def calc_bleu_n(preds, targets, n_size=4):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    bleu = BLEU(n_size=n_size)
+    tokenizer = BasicTokenizer()
+
+    for pred, target in zip(preds, targets):
+        pred_tokens = tokenizer.tokenize(pred)
+        target_token = tokenizer.tokenize(target)
+
+        bleu.add_inst(pred_tokens, [target_token])
+
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("BLEU-" + str(n_size) + ":", bleu.score())
+    return bleu.score()
+
+
+def calc_bleu(preds, targets):
+    calc_bleu_n(preds, targets, 1)
+    calc_bleu_n(preds, targets, 2)
+    calc_bleu_n(preds, targets, 3)
+    bleu4_score = calc_bleu_n(preds, targets, 4)
+    return bleu4_score
+
+
+def read_file(file):
+    with open(file, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip()
+            if not line:
+                continue
+            line = json.loads(line)
+            yield line
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    if args.train_file:
+        train_ds = load_dataset(read_file, file=args.train_file, lazy=False)
+    else:
+        train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file)
+    if args.predict_file:
+        dev_ds = load_dataset(read_file, file=args.predict_file, lazy=False)
+    else:
+        dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+
+    train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train")
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+    if args.do_train:
+        num_training_steps = args.epochs * len(train_data_loader)
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+        optimizer = AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.epsilon,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+        )
+
+        step = 0
+        total_time = 0.0
+        best_bleu4 = 0
+        for epoch in range(args.epochs):
+            print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+            batch_start_time = time.time()
+            for inputs in train_data_loader:
+                step += 1
+                labels = inputs[-1]
+                logits = model(*inputs[:-1])
+                labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1])
+                labels = paddle.nn.functional.label_smooth(labels)
+                loss = F.cross_entropy(logits, labels, soft_label=True)
+                loss.backward()
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_time += time.time() - batch_start_time
+                if step % args.logging_steps == 0:
+                    ppl = paddle.exp(loss)
+                    print(
+                        "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                        % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                    )
+                    total_time = 0.0
+
+                if step % args.save_steps == 0 or step >= num_training_steps:
+                    if dist.get_rank() == 0:
+                        save_ckpt(model, tokenizer, args.save_dir, step)
+                        print("Saved step {} model.\n".format(step))
+                        if args.do_predict:
+                            model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+                            bleu4 = evaluation(model_eval, dev_data_loader, args, tokenizer)
+                            if bleu4 > best_bleu4:
+                                print("best BLEU-4 performence has been updated: %.5f  --> %.5f" % (best_bleu4, bleu4))
+                                best_bleu4 = bleu4
+                                save_ckpt(model, tokenizer, args.save_dir, "best")
+
+                batch_start_time = time.time()
+
+        print("\nTraining completed.")
+    elif args.do_predict:
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        evaluation(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader, args, tokenizer):
+    print("\nEval begin...")
+    model.eval()
+    pred_ref = []
+    time_begin = time.time()
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+    print("Generation cost time:", time.time() - time_begin)
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+    with open(args.output_path + ".reference.txt", "w", encoding="utf-8") as fout:
+        targets = [example["target"] for example in data_loader.dataset]
+        for target in targets:
+            fout.write(target + "\n")
+
+    print("\nSave inference result into: %s" % args.output_path)
+
+    if "target" in data_loader.dataset[0].keys():
+        targets = [example["target"] for example in data_loader.dataset]
+        bleu4_score = calc_bleu(pred_ref, targets)
+
+    model.train()
+    return bleu4_score
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/examples/semantic_indexing/NQdataset.py b/examples/semantic_indexing/NQdataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..58efe8156ce12ba266d5c45ecfa40a39521b20ad
--- /dev/null
+++ b/examples/semantic_indexing/NQdataset.py
@@ -0,0 +1,251 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import random
+from typing import List
+
+import numpy as np
+import paddle
+from paddle.io import Dataset
+
+from paddlenlp.transformers.bert.tokenizer import BertTokenizer
+
+BiEncoderPassage = collections.namedtuple("BiEncoderPassage", ["text", "title"])
+
+BiENcoderBatch = collections.namedtuple(
+    "BiEncoderInput",
+    [
+        "questions_ids",
+        "question_segments",
+        "context_ids",
+        "ctx_segments",
+        "is_positive",
+        "hard_negatives",
+        "encoder_type",
+    ],
+)
+
+
+def normalize_question(question: str) -> str:
+    question = question.replace("’", "'")
+    return question
+
+
+def normalize_passage(ctx_text: str):
+    ctx_text = ctx_text.replace("\n", " ").replace("’", "'")
+    if ctx_text.startswith('"'):
+        ctx_text = ctx_text[1:]
+    if ctx_text.endswith('"'):
+        ctx_text = ctx_text[:-1]
+    return ctx_text
+
+
+class BiEncoderSample(object):
+    query: str
+    positive_passages: List[BiEncoderPassage]
+    negative_passages: List[BiEncoderPassage]
+    hard_negative_passages: List[BiEncoderPassage]
+
+
+class NQdataSetForDPR(Dataset):
+    """
+    class for managing dataset
+    """
+
+    def __init__(self, dataPath, query_special_suffix=None):
+        super(NQdataSetForDPR, self).__init__()
+        self.data = self._read_json_data(dataPath)
+        self.tokenizer = BertTokenizer
+        self.query_special_suffix = query_special_suffix
+        self.new_data = []
+        for i in range(0, self.__len__()):
+            self.new_data.append(self.__getitem__(i))
+
+    def _read_json_data(self, dataPath):
+        results = []
+        with open(dataPath, "r", encoding="utf-8") as f:
+            print("Reading file %s" % dataPath)
+            data = json.load(f)
+            results.extend(data)
+            print("Aggregated data size: {}".format(len(results)))
+        return results
+
+    def __getitem__(self, index):
+        json_sample_data = self.data[index]
+        r = BiEncoderSample()
+        r.query = self._porcess_query(json_sample_data["question"])
+
+        positive_ctxs = json_sample_data["positive_ctxs"]
+
+        negative_ctxs = json_sample_data["negative_ctxs"] if "negative_ctxs" in json_sample_data else []
+        hard_negative_ctxs = json_sample_data["hard_negative_ctxs"] if "hard_negative_ctxs" in json_sample_data else []
+
+        for ctx in positive_ctxs + negative_ctxs + hard_negative_ctxs:
+            if "title" not in ctx:
+                ctx["title"] = None
+
+        def create_passage(ctx):
+            return BiEncoderPassage(normalize_passage(ctx["text"]), ctx["title"])
+
+        r.positive_passages = [create_passage(ctx) for ctx in positive_ctxs]
+        r.negative_passages = [create_passage(ctx) for ctx in negative_ctxs]
+        r.hard_negative_passages = [create_passage(ctx) for ctx in hard_negative_ctxs]
+
+        return r
+
+    def _porcess_query(self, query):
+        query = normalize_question(query)
+
+        if self.query_special_suffix and not query.endswith(self.query_special_suffix):
+            query += self.query_special_suffix
+
+        return query
+
+    def __len__(self):
+        return len(self.data)
+
+
+class DataUtil:
+    """
+    Class for working with datasets
+    """
+
+    def __init__(self):
+        self.tensorizer = BertTensorizer()
+
+    def create_biencoder_input(
+        self,
+        samples: List[BiEncoderSample],
+        inserted_title,
+        num_hard_negatives=0,
+        num_other_negatives=0,
+        shuffle=True,
+        shuffle_positives=False,
+        hard_neg_positives=False,
+        hard_neg_fallback=True,
+        query_token=None,
+    ):
+
+        question_tensors = []
+        ctx_tensors = []
+        positive_ctx_indices = []
+        hard_neg_ctx_indices = []
+
+        for sample in samples:
+
+            if shuffle and shuffle_positives:
+                positive_ctxs = sample.positive_passages
+                positive_ctx = positive_ctxs[np.random.choice(len(positive_ctxs))]
+            else:
+                positive_ctx = sample.positive_passages[0]
+
+            neg_ctxs = sample.negative_passages
+            hard_neg_ctxs = sample.hard_negative_passages
+            question = sample.query
+
+            if shuffle:
+                random.shuffle(neg_ctxs)
+                random.shuffle(hard_neg_ctxs)
+
+            if hard_neg_fallback and len(hard_neg_ctxs) == 0:
+                hard_neg_ctxs = neg_ctxs[0:num_hard_negatives]
+
+            neg_ctxs = neg_ctxs[0:num_other_negatives]
+            hard_neg_ctxs = hard_neg_ctxs[0:num_hard_negatives]
+
+            all_ctxs = [positive_ctx] + neg_ctxs + hard_neg_ctxs
+            hard_negative_start_idx = 1
+            hard_negative_end_idx = 1 + len(hard_neg_ctxs)
+
+            current_ctxs_len = len(ctx_tensors)
+
+            sample_ctxs_tensors = [
+                self.tensorizer.text_to_tensor(ctx.text, title=ctx.title if (inserted_title and ctx.title) else None)
+                for ctx in all_ctxs
+            ]
+
+            ctx_tensors.extend(sample_ctxs_tensors)
+            positive_ctx_indices.append(current_ctxs_len)
+            hard_neg_ctx_indices.append(
+                i
+                for i in range(
+                    current_ctxs_len + hard_negative_start_idx,
+                    current_ctxs_len + hard_negative_end_idx,
+                )
+            )
+            """if query_token:
+                if query_token == "[START_END]":
+                    query_span = _select_span
+                else:
+                    question_tensors.append(self.tensorizer.text_to_tensor(" ".join([query_token, question])))
+            else:"""
+
+            question_tensors.append(self.tensorizer.text_to_tensor(question))
+
+        ctxs_tensor = paddle.concat([paddle.reshape(ctx, [1, -1]) for ctx in ctx_tensors], axis=0)
+        questions_tensor = paddle.concat([paddle.reshape(q, [1, -1]) for q in question_tensors], axis=0)
+
+        ctx_segments = paddle.zeros_like(ctxs_tensor)
+        question_segments = paddle.zeros_like(questions_tensor)
+
+        return BiENcoderBatch(
+            questions_tensor,
+            question_segments,
+            ctxs_tensor,
+            ctx_segments,
+            positive_ctx_indices,
+            hard_neg_ctx_indices,
+            "question",
+        )
+
+
+class BertTensorizer:
+    def __init__(self, pad_to_max=True, max_length=256):
+        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+        self.max_length = max_length
+        self.pad_to_max = pad_to_max
+
+    def text_to_tensor(
+        self,
+        text: str,
+        title=None,
+    ):
+        text = text.strip()
+
+        if title:
+            token_ids = self.tokenizer.encode(
+                text,
+                text_pair=title,
+                max_seq_len=self.max_length,
+                pad_to_max_seq_len=False,
+                truncation_strategy="longest_first",
+            )["input_ids"]
+        else:
+            token_ids = self.tokenizer.encode(
+                text,
+                max_seq_len=self.max_length,
+                pad_to_max_seq_len=False,
+                truncation_strategy="longest_first",
+            )["input_ids"]
+
+        seq_len = self.max_length
+        if self.pad_to_max and len(token_ids) < seq_len:
+            token_ids = token_ids + [self.tokenizer.pad_token_type_id] * (seq_len - len(token_ids))
+        if len(token_ids) >= seq_len:
+            token_ids = token_ids[0:seq_len]
+            token_ids[-1] = 102
+
+        return paddle.to_tensor(token_ids)
diff --git a/examples/semantic_indexing/README.md b/examples/semantic_indexing/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9b37dd24b7372a81d70b08ae5c3def72669bad8d
--- /dev/null
+++ b/examples/semantic_indexing/README.md
@@ -0,0 +1,297 @@
+# 语义索引
+
+语义索引技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一， 语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序，从基础层面影响整个系统的效果。
+
+语义索引库提供了前沿语义索引策略的训练、语义索引模型的效果评估方案、支持用户基于我们开源的语义索引模型进行文本 Pair 的相似度计算或者 Embedding 语义表示抽取。
+
+我们基于 ERNIE1.0 热启，分别采用 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略和 HardestNeg 策略开源了 [batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar) 和 [hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar) 模型，相比 Baseline 模型效果有显著提升:
+
+## 效果评估
+|  模型 |  Recall@10 | Recall@50  |策略简要说明|
+| ------------ | ------------ | ------------ |--------- |
+|  Baseline |  46.99 |  60.84 | 标准 pair-wise 训练范式，通过随机采样产生负样本|
+|  [In-batch negatives](https://arxiv.org/abs/2004.04906) | 51.20(**+4.21**)  | 67.24(**+6.4**)  | 在 Batch 内同时使用 batch_size 个负样本进行训练|
+|  HardestNeg | 50.22(**+3.23**) |  65.17(**+4.33**) |<div style="width: 340pt"> 在 Batch 内先挖掘最难负样本，然后进行 pair-wise 训练</div>|
+
+
+## 语义索引预训练模型下载
+以下模型结构参数为:
+`TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256`
+
+|Model|训练参数配置|硬件|MD5|
+| ------------ | ------------ | ------------ |-----------|
+|[batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar)|<div style="width: 150pt">margin:0.2 scale:30 epoch:3 lr:5E-5 bs:128 max_len:64 </div>|<div style="width: 100pt">单卡v100-16g</div>|da1bb1487bd3fd6a53b8ef95c278f3e6|
+|[hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar)|margin:0.2 epoch:3 lr:5E-5 bs:128 max_len:64 |单卡v100-16g|b535d890110ea608c8562c525a0b84b5|
+
+
+## 数据准备
+### 数据生成
+我们基于开源语义匹配数据集构造生成了面向语义索引的训练集、评估集、召回库。
+#### 构造训练集
+从开源语义相似度任务评测数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx))的训练集和测试集中抽取出所有语义相似的文本 Pair 作为训练集 [semantic_pair_train.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)。
+
+[In-batch negatives](https://arxiv.org/abs/2004.04906) 策略和 HardestNeg 策略训练数据每一行由 `tab` 分隔的语义相似的文本 Pair 对，样例数据如下:
+```
+欢打篮球的男生喜欢什么样的女生   爱打篮球的男生喜欢什么样的女生
+我手机丢了，我想换个手机        我想买个新手机，求推荐
+求秋色之空漫画全集             求秋色之空全集漫画
+学日语软件手机上的             手机学日语的软件
+```
+
+
+#### 构造评估集
+从开源语义相似度数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx)) 的验证集中抽取出正例文本 Pair 生成评估集 [same_semantic.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv)，其中第 1 列文本作为输入模型的源文本 *Source Text*、第 2 列文本作为语义相似的目标文本 *Target Text*。
+
+#### 构造召回库
+抽取出开源语义相似度数据集([LCQMC](http://icrc.hitsz.edu.cn/Article/show/171.html)、[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)、[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx))训练集中的所有文本和验证集中所有文本 Pair 的第 2 列 *Target Text* 生成召回库 [corpus_file](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file)
+
+
+### 数据下载
+|数据|描述|数量|MD5|
+| ------------ | ------------ | ------------ | -------- |
+|<div style="width: 180pt">[训练集(semantic_pair_train.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)</div>|<div style="width: 220pt">每行为语义相似的文本 Pair 构成的训练集</div>|222546|590286f695200160350cc5838cb34f00|
+|[评估集(same_semantic.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv)|每行为语义相似文本 Pair 构成的评估集|10255|86ec1fd5234d944177574372dcf780c5|
+|[召回库(corpus_file）](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file)|每行为单条文本构成的召回库|313714|a3fbc3421b5aeb939809876fc7beeaa8|
+
+
+## 项目依赖:
+- [hnswlib](https://github.com/nmslib/hnswlib)
+
+## 代码结构及说明
+```
+|—— train_batch_neg.py # In-batch negatives 策略的训练主脚本
+|—— train_hardest_neg.py # HardestNeg 策略的训练主脚本
+|—— batch_negative
+    |—— model.py # In-batch negatives 策略核心网络结构
+|——hardest_negative
+    |—— model.py # HardestNeg 策略核心网络结构
+|—— ann_util.py # Ann 建索引库相关函数
+|—— base_model.py # 语义索引模型基类
+|—— data.py # 数据读取、数据转换等预处理逻辑
+|—— evaluate.py # 根据召回结果和评估集计算评估指标
+|—— predict.py # 给定输入文件，计算文本 pair 的相似度
+|—— recall.py # 基于训练好的语义索引模型，从召回库中召回给定文本的相似文本
+```
+
+## 模型训练
+### 基于 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略训练
+以我们提供的语义相似度训练数据为例，通过如下命令，指定 GPU 0,1,2,3 卡, 基于 In-batch negatives 策略开始训练模型
+
+```
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
+    train_batch_neg.py \
+    --device gpu \
+    --save_dir ./checkpoints/ \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 3 \
+    --output_emb_size 256 \
+    --save_steps 500 \
+    --max_seq_length 64 \
+    --margin 0.2 \
+    --train_set_file semantic_pair_train.tsv \
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `save_dir`: 模型存储路径
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
+* `margin`: 正样本相似度与负样本之间的目标 Gap
+* `train_set_file`: 训练集文件
+
+
+### 基于 HardestNeg 策略训练
+以我们提供的语义相似度训练集为例子，通过如下命令，指定 GPU 0,1,2,3 卡, 开始模型训练
+
+```
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
+    train_hardest_neg.py \
+    --device gpu \
+    --save_dir ./checkpoints/ \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 3 \
+    --output_emb_size 256 \
+    --save_steps 500 \
+    --max_seq_length 64 \
+    --margin 0.2 \
+    --train_set_file semantic_pair_train.tsv \
+
+```
+
+## 效果评估
+语义索引模型的目标是: 给定输入文本，模型可以从海量候选召回库中快速、准确地召回一批语义相关文本。
+
+### 评估指标
+采用 Recall@10 和 Recall@50 指标来评估语义索引模型的召回效果
+
+### 开始评估
+效果评估分为 3 个步骤:
+1. ANN 建库
+首先基于语义索引模型抽取出召回库的文本向量，然后使用 ANN 引擎建索引库(当前基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引)
+
+2. 召回
+基于语义索引模型抽取出评估集 *Source Text* 的文本向量，在第 1 步中建立的索引库中进行 ANN 查询召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件
+
+3. 评估： 基于评估集 [same_semantic.tsv](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv) 和召回结果 `recall_result` 计算评估指标 R@10 和 R@50
+
+运行如下命令进行 ANN 建库、召回，产出召回结果数据 `recall_result`
+
+```
+python -u -m paddle.distributed.launch --gpus "0" --log_dir "recall_log/" \
+        recall.py \
+        --device gpu \
+        --recall_result_dir "recall_result_dir" \
+        --recall_result_file "recall_result.txt" \
+        --params_path "${checkpoints_params_file}" \
+        --hnsw_m 100 \
+        --hnsw_ef 100 \
+        --batch_size 64 \
+        --output_emb_size 256\
+        --max_seq_length 60 \
+        --recall_num 50 \
+        --similar_text_pair "semantic_similar_pair.tsv" \
+        --corpus_file "corpus_file" \
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `recall_result_dir`: 召回结果存储目录
+* `recall_result_file`: 召回结果的文件名
+* `params_path`： 待评估模型的参数文件名
+* `hnsw_m`: hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `recall_num`: 对 1 个文本召回的相似文本数量
+* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
+* `corpus_file`: 召回库数据 corpus_file
+
+成功运行结束后，会在 `./recall_result_dir/` 目录下产出 `recall_result.txt` 文件，部分召回示例结果如下:
+```
+开初婚未育证明怎么弄？  初婚未育证明怎么开？        0.9878678917884827
+开初婚未育证明怎么弄？  初婚未育情况证明怎么开？    0.955365777015686
+开初婚未育证明怎么弄？  初婚未育证明在哪里办理      0.9508345723152161
+开初婚未育证明怎么弄？  到哪里开初婚未育证明？      0.949864387512207
+```
+
+接下来，运行如下命令进行效果评估，产出 R@10 和 R@50 指标:
+```
+  python -u evaluate.py \
+        --similar_pair_file "semantic_similar_pair.tsv" \
+        --recall_result_file "./recall_result_dir/recall_result.txt" \
+        --recall_num 50
+```
+
+参数含义说明
+* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv
+* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果
+* `recall_num`: 对 1 个文本召回的相似文本数量
+
+成功运行结束后，会输出如下评估指标, 分别为 R@10 和 R@50
+```
+51.2    67.242
+```
+
+## 开始预测
+我们可以基于语义索引模型抽取文本的语义向量或者计算文本 Pair 的语义相似度，我们以计算文本 Pair 的语义相似度为例:
+
+### 准备预测数据
+待预测数据为 tab 分隔的 tsv 文件，每一行为 1 个文本 Pair，部分示例如下:
+```
+西安下雪了？是不是很冷啊?       西安的天气怎么样啊？还在下雪吗？
+第一次去见女朋友父母该如何表现？   第一次去见家长该怎么做
+猪的护心肉怎么切            猪的护心肉怎么吃
+显卡驱动安装不了，为什么？      显卡驱动安装不了怎么回事
+```
+
+### 开始预测
+以上述 demo 数据为例，运行如下命令基于我们开源的 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略语义索引模型开始计算文本 Pair 的语义相似度:
+```
+python -u -m paddle.distributed.launch --gpus "0" \
+    predict.py \
+    --device gpu \
+    --params_path "./checkpoints/batch_neg_v1.0.0/model_state.pdparams" \
+    --output_emb_size 256
+    --batch_size 128 \
+    --max_seq_length 64 \
+    --text_pair_file ${your_input_file}
+```
+
+参数含义说明
+* `device`: 使用 cpu/gpu 进行训练
+* `params_path`： 预训练模型的参数文件名
+* `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `text_pair_file`: 由文本 Pair 构成的待预测数据集
+
+产出如下结果
+```
+0.8121148943901062
+0.6034126281738281
+0.968634843826294
+0.9800204038619995
+```
+
+## 使用 FastGeneration 加速预测
+
+我们基于 Paddle 自定义算子功能集成了[NVIDIA FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 的高性能加速能力，通过简单易用的 Python API 即可得到 GPU 上更高性能预测能力。
+- FT FP32 相比 Paddle 前向加速比为 1.13 ~ 4.36
+- FT FP16 相比 Paddle 前向加速比为 3.65 ~ 5.42
+- 支持 Post-Normalization 和 Pre-Normalizaiton 2 种 Transformer 结构
+- 支持 GELU 和 RELU 2 个激活函数
+
+详细性能评测数据如下表:
+
+| batch size | max_seq_len | Paddle 前向(ms)|FT FP32(ms)  | FT FP16(ms) |Speedup(FT FP32/Paddle)|Speedup(FT FP16/Paddle)|
+| ---------- | ----------- | ------------------- | ------------------- |------------------ |------------------ |------------------ |
+| 16         | 16          | 23.56  |  5.40 | 5.38 | 4.36| 4.38|
+| 16         | 32          | 22.34  |  8.11  | 5.57|2.75|4.01|
+| 16         | 64          | 22.79   | 14.84  |5.39|1.54|4.23|
+| 32         | 16          | 23.41      | 8.16   |5.30|2.87|4.42|
+| 32         | 32          | 22.67      | 14.84   |6.21|1.53|3.65|
+| 32         | 64          | 33.49 | 28.53   |6.05|1.17|5.54|
+| 64         | 16          | 22.60  | 14.81   |5.59|1.53|4.04|
+| 64         | 32          | 33.52  | 28.22   |6.24|1.19|5.37|
+| 64         | 64          | 62.62  | 55.25   |11.55|1.13|5.42|
+
+Note: 测试环境如下
+```
+硬件: NVIDIA Tesla V100 16G 单卡
+Paddle Version: 2.2.1
+CUDA: 10.1
+cuDNN: 7.6
+```
+
+可参考如下命令使用高性能预测能力
+```shell
+python -u -m paddle.distributed.launch --gpus "0" faster_predict.py \
+   --params_path "batch_neg_v1.0/model_state.pdparams"   \
+   --output_emb_size 256   \
+   --batch_size 32  \
+   --max_seq_length 64  \
+   --use_fp16 \
+   --text_pair_file ${your_input_file} \
+```
+
+## 模型介绍
+简要介绍 In-batch negatives 策略和 HardestNeg 策略思路
+
+### [In-batch negatives](https://arxiv.org/abs/2004.04906) 核心思路
+
+In-batch negatives 策略的训练数据为语义相似的 Pair 对，如下所示为 Batch size = 4 的训练数据样例:
+```
+我手机丢了，我想换个手机     我想买个新手机，求推荐
+求秋色之空漫画全集          求秋色之空全集漫画
+学日语软件手机上的          手机学日语的软件
+侠盗飞车罪恶都市怎样改车     侠盗飞车罪恶都市怎么改车
+```
+In-batch negatives 策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新，将Batch 内除自身之外其它所有 *Source Text* 的相似文本 *Target Text* 作为负例，例如: 上例中 `我手机丢了，我想换个手机` 有 1 个正例(`1.我想买个新手机，求推荐`)，3 个负例(`1.求秋色之空全集漫画`，`2.手机学日语的软件`，`3.侠盗飞车罪恶都市怎么改车`)。
+
+### HardestNeg 核心思路
+HardestNeg 策略核心是在 1 个 Batch 内的所有负样本中先挖掘出最难区分的负样本，基于最难负样本进行梯度更新。例如: 上例中 *Source Text*: `我手机丢了，我想换个手机` 有 3 个负例(`1.求秋色之空全集漫画`，`2.手机学日语的软件`，`3.侠盗飞车罪恶都市怎么改车`)，其中最难区分的负例是 `手机学日语的软件`，模型训练过程中不断挖掘出类似这样的最难负样本，然后基于最难负样本进行梯度更新。
+
+## Reference
+[1] Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018.
+[2] Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, Buzhou Tang, The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification EMNLP2018.
+[3] Yang, Y., Zhang, Y., Tar, C., and Baldridge, J., “PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification”, <i>arXiv e-prints</i>, 2019.
+[4] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020.
diff --git a/examples/semantic_indexing/README_gradient_cache.md b/examples/semantic_indexing/README_gradient_cache.md
new file mode 100644
index 0000000000000000000000000000000000000000..1f5223c6dfb7c64458a7552a69cc4b86875570cb
--- /dev/null
+++ b/examples/semantic_indexing/README_gradient_cache.md
@@ -0,0 +1,129 @@
+# Gradient Cache策略 [DPR](https://arxiv.org/abs/2004.04906)
+
+
+### 实验结果
+
+`Gradient Cache` 的实验结果如下，使用的评估指标是`Accuracy`：
+
+|  DPR method | TOP-5  | TOP-10 | TOP-50| 说明 |
+| :-----: | :----: | :----: | :----: | :---- |
+|  Gradient_cache | 68.1 | 79.4| 86.2 | DPR结合GC策略训练
+| GC_Batch_size_512  | 67.3 | 79.6| 86.3| DPR结合GC策略训练，且batch_size设置为512|
+
+实验对应的超参数如下：
+
+| Hyper Parameter | batch_size| learning_rate| warmup_steps| epoches| chunk_size|max_grad_norm |
+| :----: | :----: | :----: | :----: | :---: | :----: | :----: |
+| \ | 128/512| 2e-05 | 1237 | 40 | 2| 16/8 |
+
+## 数据准备
+我们使用Dense Passage Retrieval的[原始仓库](https://github.com/Elvisambition/DPR)
+中提供的数据集进行训练和评估。可以使用[download_data.py](https://github.com/Elvisambition/DPR/blob/main/dpr/data/download_data.py)
+脚本下载所需数据集。 数据集详细介绍见[原仓库](https://github.com/Elvisambition/DPR) 。
+
+### 数据格式
+```
+[
+  {
+    "question": "....",
+    "answers": ["...", "...", "..."],
+    "positive_ctxs": [{
+        "title": "...",
+        "text": "...."
+    }],
+    "negative_ctxs": ["..."],
+    "hard_negative_ctxs": ["..."]
+  },
+  ...
+]
+```
+
+### 数据下载
+在[原始仓库](https://github.com/Elvisambition/DPR)
+下使用命令
+```
+python data/download_data.py --resource data.wikipedia_split.psgs_w100
+python data/download_data.py --resource data.retriever.nq
+python data/download_data.py --resource data.retriever.qas.nq
+```
+### 单独下载链接
+[data.retriever.nq-train](https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz)
+[data.retriever.nq-dev](https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz)
+[data.retriever.qas.nq-dev](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-dev.qa.csv)
+[data.retriever.qas.nq-test](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-test.qa.csv)
+[data.retriever.qas.nq-train](https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv)
+[psgs_w100.tsv](https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz)
+
+
+## 代码结构及说明
+```
+|—— train_gradient_cache_DPR.py # gradient_cache实现dense passage retrieval训练脚本
+|—— train_gradient_cache.py # gradient_cache算法简单实现
+|—— NQdataset.py # NQ数据集封装
+|—— generate_dense_embeddings.py # 生成文本的稠密表示
+|—— faiss_indexer.py # faiss相关indexer封装
+|—— dense_retriever.py # 召回，指标检测
+|—— qa_validation.py # 相关计算匹配函数
+|—— tokenizers.py # tokenizer封装
+```
+
+## 模型训练
+### 基于 [Dense Passage Retriever](https://arxiv.org/abs/2004.04906) 策略训练
+```
+python train_gradient_cache_DPR.py \
+   --batch_size 128 \
+   --learning_rate 2e-05 \
+   --save_dir save_biencoder
+   --warmup_steps 1237 \
+   --epoches 40 \
+   --max_grad_norm 2 \
+   --train_data_path ./dataset_dir/biencoder-nq-train.json \
+   --chunk_size 16 \
+```
+
+参数含义说明
+* `batch_size`: 批次大小
+* `learning_rate`: 学习率
+* `save_dir`: 模型保存位置
+* `warmupsteps`: 预热学习率参数
+* `epoches`: 训练批次大小
+* `max_grad_norm`: 详见ClipGradByGlobalNorm
+* `train_data_path`: 训练数据存放地址
+* `chunk_size`: chunk的大小
+
+## 生成文章稠密向量表示
+
+```
+python generate_dense_embeddings.py \
+   --ctx_file ./dataset_dir/psgs_w100.tsv \
+   --out_file test_generate \
+   --que_model_path ./save_dir/question_model_40 \
+   --con_model_path ./save_dir/context_model_40
+```
+
+
+参数含义说明
+* `ctx_file`: ctx文件读取地址
+* `out_file`: 生成后的文件输出地址
+* `que_model_path`: question model path
+* `con_model_path`： context model path
+
+
+## 针对全部文档的检索器验证
+```
+python dense_retriever.py --hnsw_index \
+    --out_file out_file \
+    --encoded_ctx_file ./test_generate \
+    --ctx_file ./dataset_dir/psgs_w100.tsv \
+    --qa_file ./dataset_dir/nq.qa.csv \
+    --que_model_path ./save_dir/question_model_40 \
+    --con_model_path ./save_dir/context_model_40
+```
+参数含义说明
+* `hnsw_index`：使用hnsw_index
+* `outfile`: 输出文件地址
+* `encoded_ctx_file`: 编码后的ctx文件
+* `ctx_file`: ctx文件
+* `qa_file`： qa_file文件
+* `que_model_path`: question encoder model
+* `con_model_path`: context encoder model
diff --git a/examples/semantic_indexing/ance/model.py b/examples/semantic_indexing/ance/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfee796e7d0789b0e02da0a0663805c0220a3d46
--- /dev/null
+++ b/examples/semantic_indexing/ance/model.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexANCE(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+        self.margin = margin
+
+    def forward(
+        self,
+        text_input_ids,
+        pos_sample_input_ids,
+        neg_sample_input_ids,
+        text_token_type_ids=None,
+        text_position_ids=None,
+        text_attention_mask=None,
+        pos_sample_token_type_ids=None,
+        pos_sample_position_ids=None,
+        pos_sample_attention_mask=None,
+        neg_sample_token_type_ids=None,
+        neg_sample_position_ids=None,
+        neg_sample_attention_mask=None,
+    ):
+
+        text_cls_embedding = self.get_pooled_embedding(
+            text_input_ids, text_token_type_ids, text_position_ids, text_attention_mask
+        )
+
+        pos_sample_cls_embedding = self.get_pooled_embedding(
+            pos_sample_input_ids, pos_sample_token_type_ids, pos_sample_position_ids, pos_sample_attention_mask
+        )
+
+        neg_sample_cls_embedding = self.get_pooled_embedding(
+            neg_sample_input_ids, neg_sample_token_type_ids, neg_sample_position_ids, neg_sample_attention_mask
+        )
+
+        pos_sample_sim = paddle.sum(text_cls_embedding * pos_sample_cls_embedding, axis=-1)
+
+        # Note: The negatives samples is sampled by ANN engine in global corpus
+        # Please refer to run_ann_data_gen.py
+        global_neg_sample_sim = paddle.sum(text_cls_embedding * neg_sample_cls_embedding, axis=-1)
+
+        labels = paddle.full(shape=[text_cls_embedding.shape[0]], fill_value=1.0, dtype=paddle.get_default_dtype())
+
+        loss = F.margin_ranking_loss(pos_sample_sim, global_neg_sample_sim, labels, margin=self.margin)
+
+        return loss
diff --git a/examples/semantic_indexing/ann_util.py b/examples/semantic_indexing/ann_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..55c608d3e58c37c0d9baf884b270178d3ac5da7f
--- /dev/null
+++ b/examples/semantic_indexing/ann_util.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import numpy as np
+import hnswlib
+from paddlenlp.utils.log import logger
+
+
+def build_index(args, data_loader, model):
+
+    index = hnswlib.Index(space="ip", dim=args.output_emb_size)
+
+    # Initializing index
+    # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
+    # during insertion of an element.
+    # The capacity can be increased by saving/loading the index, see below.
+    #
+    # ef_construction - controls index search speed/build speed tradeoff
+    #
+    # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
+    # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
+    index.init_index(max_elements=args.hnsw_max_elements, ef_construction=args.hnsw_ef, M=args.hnsw_m)
+
+    # Controlling the recall by setting ef:
+    # higher ef leads to better accuracy, but slower search
+    index.set_ef(args.hnsw_ef)
+
+    # Set number of threads used during batch search/construction
+    # By default using all available cores
+    index.set_num_threads(16)
+
+    logger.info("start build index..........")
+
+    all_embeddings = []
+
+    for text_embeddings in model.get_semantic_embedding(data_loader):
+        all_embeddings.append(text_embeddings.numpy())
+
+    all_embeddings = np.concatenate(all_embeddings, axis=0)
+    index.add_items(all_embeddings)
+
+    logger.info("Total index number:{}".format(index.get_current_count()))
+
+    return index
diff --git a/examples/semantic_indexing/base_model.py b/examples/semantic_indexing/base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7b1dcd27449a0a1a02c0de455ca3ff6fecc406c
--- /dev/null
+++ b/examples/semantic_indexing/base_model.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None, use_fp16=False):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+        self.use_fp16 = use_fp16
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        if self.use_fp16:
+            if attention_mask is None:
+                attention_mask = paddle.unsqueeze(
+                    (input_ids == self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+                )
+
+            embedding_output = self.ptm.embeddings(
+                input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+            )
+
+            embedding_output = paddle.cast(embedding_output, "float16")
+            attention_mask = paddle.cast(attention_mask, "float16")
+
+            encoder_outputs = self.ptm.encoder(embedding_output, attention_mask)
+
+            if self.use_fp16:
+                encoder_outputs = paddle.cast(encoder_outputs, "float32")
+            cls_embedding = self.ptm.pooler(encoder_outputs)
+        else:
+            _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                input_ids = paddle.to_tensor(input_ids)
+                token_type_ids = paddle.to_tensor(token_type_ids)
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
diff --git a/examples/semantic_indexing/batch_negative/model.py b/examples/semantic_indexing/batch_negative/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd87c6d8363efc4f54db6c6bd5d7b623ea68ab59
--- /dev/null
+++ b/examples/semantic_indexing/batch_negative/model.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/examples/semantic_indexing/biencoder_base_model.py b/examples/semantic_indexing/biencoder_base_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3649db7c2be67688e0b6b359b7fb6155bb4c299
--- /dev/null
+++ b/examples/semantic_indexing/biencoder_base_model.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class BiEncoder(nn.Layer):
+    """dual-encoder model
+
+    Attributes:
+        state: for question or for context
+        question_encoder: used to code the problem
+        context_encoder: used to code the context
+
+    """
+
+    def __init__(self, question_encoder, context_encoder, state=None):
+        super(BiEncoder, self).__init__()
+        self.state = state
+        if self.state is None:
+            self.question_encoder = question_encoder
+            self.context_encoder = context_encoder
+        elif self.state == "FORQUESTION":
+            self.question_encoder = question_encoder
+        elif self.state == "FORCONTEXT":
+            self.context_encoder = context_encoder
+
+    def get_question_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.question_encoder(input_ids, token_type_ids, position_ids, attention_mask)
+        """cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)"""
+
+        return cls_embedding
+
+    def get_context_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.context_encoder(input_ids, token_type_ids, position_ids, attention_mask)
+        """cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)"""
+
+        return cls_embedding
+
+    def forward(
+        self,
+        question_id,
+        question_segments,
+        question_attn_mask,
+        context_ids,
+        context_segments,
+        context_attn_mask,
+    ):
+
+        question_pooled_out = self.get_question_pooled_embedding(question_id, question_segments, question_attn_mask)
+        context_pooled_out = self.get_context_pooled_embedding(context_ids, context_segments, context_attn_mask)
+
+        return question_pooled_out, context_pooled_out
+
+
+class BiEncoderNllLoss(object):
+    """
+    calculate the nll loss for dual-encoder model
+    """
+
+    def calc(self, q_vectors, ctx_vectors, positive_idx_per_question, loss_scale=None):
+
+        scorces = paddle.matmul(q_vectors, paddle.transpose(ctx_vectors, [1, 0]))
+
+        # if len(q_vectors.shape()) > 1:
+        q_num = q_vectors.shape[0]
+        scores = scorces.reshape([q_num, -1])
+
+        softmax_scorces = F.log_softmax(scores, axis=1)
+
+        loss = F.nll_loss(softmax_scorces, paddle.to_tensor(positive_idx_per_question))
+
+        correct_predictions_count = None
+
+        if loss_scale:
+            loss.mul_(loss_scale)
+
+        return loss, correct_predictions_count
diff --git a/examples/semantic_indexing/data.py b/examples/semantic_indexing/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8e2e232f370bc986751bc56eda2e195f7fc3a36
--- /dev/null
+++ b/examples/semantic_indexing/data.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, pad_to_max_seq_len=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+    for key, text in example.items():
+        encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max_seq_len)
+        input_ids = encoded_inputs["input_ids"]
+        token_type_ids = encoded_inputs["token_type_ids"]
+        result += [input_ids, token_type_ids]
+    return result
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"text_a": data[0], "text_b": data[1]}
+
+
+def read_text_triplet(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 3:
+                continue
+            yield {"text": data[0], "pos_sample": data[1], "neg_sample": data[2]}
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_checkpoint(args):
+    """
+    Return: (latest_checkpint_path, global_step)
+    """
+    if not os.path.exists(args.save_dir):
+        return args.init_from_ckpt, 0
+
+    subdirectories = list(next(os.walk(args.save_dir))[1])
+
+    def valid_checkpoint(checkpoint):
+        chk_path = os.path.join(args.save_dir, checkpoint)
+        scheduler_path = os.path.join(chk_path, "model_state.pdparams")
+        succeed_flag_file = os.path.join(chk_path, "succeed_flag_file")
+        return os.path.exists(scheduler_path) and os.path.exists(succeed_flag_file)
+
+    trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(trained_steps) > 0:
+        return os.path.join(args.save_dir, str(max(trained_steps)), "model_state.pdparams"), max(trained_steps)
+
+    return args.init_from_ckpt, 0
+
+
+# ANN - active learning ------------------------------------------------------
+def get_latest_ann_data(ann_data_dir):
+    if not os.path.exists(ann_data_dir):
+        return None, -1
+
+    subdirectories = list(next(os.walk(ann_data_dir))[1])
+
+    def valid_checkpoint(step):
+        ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data")
+        # succed_flag_file is an empty file that indicates ann data has been generated
+        succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file")
+        return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file)
+
+    ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)]
+
+    if len(ann_data_steps) > 0:
+        latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data")
+        logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file))
+        return latest_ann_data_file, max(ann_data_steps)
+
+    logger.info("no new ann_data, return (None, -1)")
+    return None, -1
+
+
+def gen_id2corpus(corpus_file):
+    id2corpus = {}
+    with open(corpus_file, "r", encoding="utf-8") as f:
+        for idx, line in enumerate(f):
+            id2corpus[idx] = line.rstrip()
+    return id2corpus
+
+
+def gen_text_file(similar_text_pair_file):
+    text2similar_text = {}
+    texts = []
+    with open(similar_text_pair_file, "r", encoding="utf-8") as f:
+        for line in f:
+            splited_line = line.rstrip().split("\t")
+            if len(splited_line) != 2:
+                continue
+
+            text, similar_text = line.rstrip().split("\t")
+
+            if not text or not similar_text:
+                continue
+
+            text2similar_text[text] = similar_text
+            texts.append({"text": text})
+    return texts, text2similar_text
diff --git a/examples/semantic_indexing/dense_retriever.py b/examples/semantic_indexing/dense_retriever.py
new file mode 100644
index 0000000000000000000000000000000000000000..030e3726c48884add5f8f9e191033b1fd34dade7
--- /dev/null
+++ b/examples/semantic_indexing/dense_retriever.py
@@ -0,0 +1,288 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Copyright GC-DPR authors.
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+ Command line tool to get dense results and validate them
+"""
+
+import argparse
+import csv
+import glob
+import gzip
+import json
+import logging
+import pickle
+import time
+from typing import Dict, Iterator, List, Tuple
+
+import numpy as np
+import paddle
+from biencoder_base_model import BiEncoder
+from faiss_indexer import DenseFlatIndexer, DenseHNSWFlatIndexer, DenseIndexer
+from NQdataset import BertTensorizer
+from paddle import Tensor as T
+from paddle import nn
+from qa_validation import calculate_matches
+
+from paddlenlp.transformers.bert.modeling import BertModel
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+if logger.hasHandlers():
+    logger.handlers.clear()
+console = logging.StreamHandler()
+logger.addHandler(console)
+
+
+class DenseRetriever(object):
+    """
+    Does passage retrieving over the provided index and question encoder
+    """
+
+    def __init__(self, question_encoder: nn.Layer, batch_size: int, tensorizer: BertTensorizer, index: DenseIndexer):
+        self.question_encoder = question_encoder
+        self.batch_size = batch_size
+        self.tensorizer = tensorizer
+        self.index = index
+
+    def generate_question_vectors(self, questions: List[str]) -> T:
+        n = len(questions)
+        bsz = self.batch_size
+        query_vectors = []
+
+        self.question_encoder.eval()
+
+        with paddle.no_grad():
+            for j, batch_start in enumerate(range(0, n, bsz)):
+
+                batch_token_tensors = [
+                    self.tensorizer.text_to_tensor(q) for q in questions[batch_start : batch_start + bsz]
+                ]
+                q_ids_batch = paddle.stack(batch_token_tensors, axis=0)
+                q_seg_batch = paddle.zeros_like(q_ids_batch)
+                out = self.question_encoder.get_question_pooled_embedding(q_ids_batch, q_seg_batch)
+                query_vectors.extend(out)
+                if len(query_vectors) % 100 == 0:
+                    logger.info("Encoded queries %d", len(query_vectors))
+
+        query_tensor = paddle.to_tensor(query_vectors)
+        logger.info("Total encoded queries tensor %s", query_tensor.shape[0])
+        assert query_tensor.shape[0] == len(questions)
+        return query_tensor
+
+    def get_top_docs(self, query_vectors: np.array, top_docs: int = 100) -> List[Tuple[List[object], List[float]]]:
+        """
+        Does the retrieval of the best matching passages given the query vectors batch
+        :param query_vectors:
+        :param top_docs:
+        :return:
+        """
+        time0 = time.time()
+        results = self.index.search_knn(query_vectors, top_docs)
+        logger.info("index search time: %f sec.", time.time() - time0)
+        return results
+
+
+def parse_qa_csv_file(location) -> Iterator[Tuple[str, List[str]]]:
+    with open(location) as ifile:
+        reader = csv.reader(ifile, delimiter="\t")
+        for row in reader:
+            question = row[0]
+            answers = eval(row[1])
+            yield question, answers
+
+
+def validate(
+    passages: Dict[object, Tuple[str, str]],
+    answers: List[List[str]],
+    result_ctx_ids: List[Tuple[List[object], List[float]]],
+    workers_num: int,
+    match_type: str,
+) -> List[List[bool]]:
+    match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type)
+    top_k_hits = match_stats.top_k_hits
+
+    logger.info("Validation results: top k documents hits %s", top_k_hits)
+    top_k_hits = [v / len(result_ctx_ids) for v in top_k_hits]
+    logger.info("Validation results: top k documents hits accuracy %s", top_k_hits)
+    return match_stats.questions_doc_hits
+
+
+def load_passages(ctx_file: str) -> Dict[object, Tuple[str, str]]:
+    docs = {}
+    logger.info("Reading data from: %s", ctx_file)
+    if ctx_file.endswith(".gz"):
+        with gzip.open(ctx_file, "rt") as tsvfile:
+            reader = csv.reader(
+                tsvfile,
+                delimiter="\t",
+            )
+            # file format: doc_id, doc_text, title
+            for row in reader:
+                if row[0] != "id":
+                    docs[row[0]] = (row[1], row[2])
+    else:
+        with open(ctx_file) as tsvfile:
+            reader = csv.reader(
+                tsvfile,
+                delimiter="\t",
+            )
+            # file format: doc_id, doc_text, title
+            for row in reader:
+                if row[0] != "id":
+                    docs[row[0]] = (row[1], row[2])
+    return docs
+
+
+def save_results(
+    passages: Dict[object, Tuple[str, str]],
+    questions: List[str],
+    answers: List[List[str]],
+    top_passages_and_scores: List[Tuple[List[object], List[float]]],
+    per_question_hits: List[List[bool]],
+    out_file: str,
+):
+    # join passages text with the result ids, their questions and assigning has|no answer labels
+    merged_data = []
+    assert len(per_question_hits) == len(questions) == len(answers)
+    for i, q in enumerate(questions):
+        q_answers = answers[i]
+        results_and_scores = top_passages_and_scores[i]
+        hits = per_question_hits[i]
+        docs = [passages[doc_id] for doc_id in results_and_scores[0]]
+        scores = [str(score) for score in results_and_scores[1]]
+        ctxs_num = len(hits)
+
+        merged_data.append(
+            {
+                "question": q,
+                "answers": q_answers,
+                "ctxs": [
+                    {
+                        "id": results_and_scores[0][c],
+                        "title": docs[c][1],
+                        "text": docs[c][0],
+                        "score": scores[c],
+                        "has_answer": hits[c],
+                    }
+                    for c in range(ctxs_num)
+                ],
+            }
+        )
+
+    with open(out_file, "w") as writer:
+        writer.write(json.dumps(merged_data, indent=4) + "\n")
+    logger.info("Saved results * scores  to %s", out_file)
+
+
+def iterate_encoded_files(vector_files: list) -> Iterator[Tuple[object, np.array]]:
+    for i, file in enumerate(vector_files):
+        logger.info("Reading file %s", file)
+        with open(file, "rb") as reader:
+            doc_vectors = pickle.load(reader)
+            for doc in doc_vectors:
+                db_id, doc_vector = doc
+                yield db_id, doc_vector
+
+
+def main(args):
+
+    tensorizer = BertTensorizer()
+    question_model = BertModel.from_pretrained(args.que_model_path)
+    context_model = BertModel.from_pretrained(args.con_model_path)
+    model = BiEncoder(question_encoder=question_model, context_encoder=context_model)
+    model.eval()
+    if args.hnsw_index:
+        index = DenseHNSWFlatIndexer(768, args.index_buffer)
+    else:
+        index = DenseFlatIndexer(768, args.index_buffer)
+
+    retriever = DenseRetriever(model, args.batch_size, tensorizer, index)
+    # get questions & answers
+    questions = []
+    question_answers = []
+    for ds_item in parse_qa_csv_file(args.qa_file):
+        question, answers = ds_item
+        questions.append(question)
+        question_answers.append(answers)
+    questions_tensor = retriever.generate_question_vectors(questions)
+    # index all passages
+    ctx_files_pattern = args.encoded_ctx_file
+    input_paths = glob.glob(ctx_files_pattern)
+
+    logger.info("Reading all passages data from files: %s", input_paths)
+    retriever.index.index_data(input_paths)
+
+    # get top k results
+    top_ids_and_scores = retriever.get_top_docs(questions_tensor.numpy(), args.n_docs)
+    all_passages = load_passages(args.ctx_file)
+    if len(all_passages) == 0:
+        raise RuntimeError("No passages data found. Please specify ctx_file param properly.")
+    questions_doc_hits = validate(
+        all_passages, question_answers, top_ids_and_scores, args.validation_workers, args.match
+    )
+    if args.out_file:
+        save_results(all_passages, questions, question_answers, top_ids_and_scores, questions_doc_hits, args.out_file)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--qa_file",
+        required=True,
+        type=str,
+        default=None,
+        help="Question and answers file of the format: question \\t ['answer1','answer2', ...]",
+    )
+    parser.add_argument(
+        "--ctx_file",
+        required=True,
+        type=str,
+        default=None,
+        help="All passages file in the tsv format: id \\t passage_text \\t title",
+    )
+    parser.add_argument(
+        "--encoded_ctx_file",
+        type=str,
+        default=None,
+        help="Glob path to encoded passages (from generate_dense_embeddings tool)",
+    )
+    parser.add_argument("--out_file", type=str, default=None, help="output .json file path to write results to ")
+    parser.add_argument(
+        "--match", type=str, default="string", choices=["regex", "string"], help="Answer matching logic type"
+    )
+    parser.add_argument("--n-docs", type=int, default=200, help="Amount of top docs to return")
+    parser.add_argument(
+        "--validation_workers", type=int, default=16, help="Number of parallel processes to validate results"
+    )
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size for question encoder forward pass")
+    parser.add_argument(
+        "--index_buffer", type=int, default=50000, help="Temporal memory data buffer size (in samples) for indexer"
+    )
+    parser.add_argument(
+        "--hnsw_index", action="store_true", help="If enabled, use inference time efficient HNSW index"
+    )
+    parser.add_argument("--que_model_path", required=True, type=str)
+    parser.add_argument("--con_model_path", required=True, type=str)
+    args = parser.parse_args()
+
+    main(args)
diff --git a/examples/semantic_indexing/evaluate.py b/examples/semantic_indexing/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..7107c08961aa61f529f78044d5337924f0a83735
--- /dev/null
+++ b/examples/semantic_indexing/evaluate.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--similar_text_pair", type=str, default="", help="The full path of similat pair file")
+parser.add_argument("--recall_result_file", type=str, default="", help="The full path of recall result file")
+parser.add_argument(
+    "--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query"
+)
+args = parser.parse_args()
+
+
+def recall(rs, N=10):
+    """
+    Ratio of recalled Ground Truth at topN Recalled Docs
+    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
+    >>> recall(rs, N=1)
+    0.333333
+    >>> recall(rs, N=2)
+    >>> 0.6666667
+    >>> recall(rs, N=3)
+    >>> 1.0
+    Args:
+        rs: Iterator of recalled flag()
+    Returns:
+        Recall@N
+    """
+
+    recall_flags = [np.sum(r[0:N]) for r in rs]
+    return np.mean(recall_flags)
+
+
+if __name__ == "__main__":
+    text2similar = {}
+    with open(args.similar_text_pair, "r", encoding="utf-8") as f:
+        for line in f:
+            text, similar_text = line.rstrip().split("\t")
+            text2similar[text] = similar_text
+
+    rs = []
+
+    with open(args.recall_result_file, "r", encoding="utf-8") as f:
+        relevance_labels = []
+        for index, line in enumerate(f):
+
+            if index % args.recall_num == 0 and index != 0:
+                rs.append(relevance_labels)
+                relevance_labels = []
+
+            text, recalled_text, cosine_sim = line.rstrip().split("\t")
+            if text == recalled_text:
+                continue
+            if text2similar[text] == recalled_text:
+                relevance_labels.append(1)
+            else:
+                relevance_labels.append(0)
+
+    recall_N = []
+    for topN in (10, 50):
+        R = round(100 * recall(rs, N=topN), 3)
+        recall_N.append(str(R))
+    print("\t".join(recall_N))
diff --git a/examples/semantic_indexing/faiss_indexer.py b/examples/semantic_indexing/faiss_indexer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0a2eb9aa7dbaad477e2469dff8bb887ffe03734
--- /dev/null
+++ b/examples/semantic_indexing/faiss_indexer.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+ FAISS-based index components for dense retriver
+"""
+
+import os
+import time
+import logging
+import pickle
+from typing import List, Tuple, Iterator
+
+import faiss
+import numpy as np
+
+logger = logging.getLogger()
+
+
+class DenseIndexer(object):
+    """
+    Class for building, saving, and finding indexes
+    """
+
+    def __init__(self, buffer_size: int = 50000):
+        self.buffer_size = buffer_size
+        self.index_id_to_db_id = []
+        self.index = None
+
+    def index_data(self, vector_files: List[str]):
+        start_time = time.time()
+        buffer = []
+        for i, item in enumerate(iterate_encoded_files(vector_files)):
+            db_id, doc_vector = item
+            buffer.append((db_id, doc_vector))
+            if 0 < self.buffer_size == len(buffer):
+                # indexing in batches is beneficial for many faiss index types
+                self._index_batch(buffer)
+                logger.info(
+                    "data indexed %d, used_time: %f sec.", len(self.index_id_to_db_id), time.time() - start_time
+                )
+                buffer = []
+        self._index_batch(buffer)
+
+        indexed_cnt = len(self.index_id_to_db_id)
+        logger.info("Total data indexed %d", indexed_cnt)
+        logger.info("Data indexing completed.")
+
+    def _index_batch(self, data: List[Tuple[object, np.array]]):
+        raise NotImplementedError
+
+    def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]:
+        raise NotImplementedError
+
+    def serialize(self, file: str):
+        logger.info("Serializing index to %s", file)
+
+        if os.path.isdir(file):
+            index_file = os.path.join(file, "index.dpr")
+            meta_file = os.path.join(file, "index_meta.dpr")
+        else:
+            index_file = file + ".index.dpr"
+            meta_file = file + ".index_meta.dpr"
+
+        faiss.write_index(self.index, index_file)
+        with open(meta_file, mode="wb") as f:
+            pickle.dump(self.index_id_to_db_id, f)
+
+    def deserialize_from(self, file: str):
+        logger.info("Loading index from %s", file)
+
+        if os.path.isdir(file):
+            index_file = os.path.join(file, "index.dpr")
+            meta_file = os.path.join(file, "index_meta.dpr")
+        else:
+            index_file = file + ".index.dpr"
+            meta_file = file + ".index_meta.dpr"
+
+        self.index = faiss.read_index(index_file)
+        logger.info("Loaded index of type %s and size %d", type(self.index), self.index.ntotal)
+
+        with open(meta_file, "rb") as reader:
+            self.index_id_to_db_id = pickle.load(reader)
+        assert (
+            len(self.index_id_to_db_id) == self.index.ntotal
+        ), "Deserialized index_id_to_db_id should match faiss index size"
+
+    def _update_id_mapping(self, db_ids: List):
+        self.index_id_to_db_id.extend(db_ids)
+
+
+class DenseFlatIndexer(DenseIndexer):
+    def __init__(self, vector_sz: int, buffer_size: int = 50000):
+        super(DenseFlatIndexer, self).__init__(buffer_size=buffer_size)
+        self.index = faiss.IndexFlatIP(vector_sz)
+
+    def _index_batch(self, data: List[Tuple[object, np.array]]):
+        db_ids = [t[0] for t in data]
+        vectors = [np.reshape(t[1], (1, -1)) for t in data]
+        vectors = np.concatenate(vectors, axis=0)
+        self._update_id_mapping(db_ids)
+        self.index.add(vectors)
+
+    def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]:
+        scores, indexes = self.index.search(query_vectors, top_docs)
+        # convert to external ids
+        db_ids = [[self.index_id_to_db_id[i] for i in query_top_idxs] for query_top_idxs in indexes]
+        result = [(db_ids[i], scores[i]) for i in range(len(db_ids))]
+        return result
+
+
+class DenseHNSWFlatIndexer(DenseIndexer):
+    """
+    Efficient index for retrieval. Note: default settings are for hugh accuracy but also high RAM usage
+    """
+
+    def __init__(
+        self,
+        vector_sz: int,
+        buffer_size: int = 50000,
+        store_n: int = 512,
+        ef_search: int = 128,
+        ef_construction: int = 200,
+    ):
+        super(DenseHNSWFlatIndexer, self).__init__(buffer_size=buffer_size)
+
+        # IndexHNSWFlat supports L2 similarity only
+        # so we have to apply DOT -> L2 similairy space conversion with the help of an extra dimension
+        index = faiss.IndexHNSWFlat(vector_sz + 1, store_n)
+        index.hnsw.efSearch = ef_search
+        index.hnsw.efConstruction = ef_construction
+        self.index = index
+        self.phi = None
+
+    def index_data(self, vector_files: List[str]):
+        self._set_phi(vector_files)
+
+        super(DenseHNSWFlatIndexer, self).index_data(vector_files)
+
+    def _set_phi(self, vector_files: List[str]):
+        """
+        Calculates the max norm from the whole data and assign it to self.phi: necessary to transform IP -> L2 space
+        :param vector_files: file names to get passages vectors from
+        :return:
+        """
+        phi = 0
+        for i, item in enumerate(iterate_encoded_files(vector_files)):
+            id, doc_vector = item
+            norms = (doc_vector**2).sum()
+            phi = max(phi, norms)
+        logger.info("HNSWF DotProduct -> L2 space phi={}".format(phi))
+        self.phi = phi
+
+    def _index_batch(self, data: List[Tuple[object, np.array]]):
+        # max norm is required before putting all vectors in the index to convert inner product similarity to L2
+        if self.phi is None:
+            raise RuntimeError(
+                "Max norm needs to be calculated from all data at once,"
+                "results will be unpredictable otherwise."
+                "Run `_set_phi()` before calling this method."
+            )
+
+        db_ids = [t[0] for t in data]
+        vectors = [np.reshape(t[1], (1, -1)) for t in data]
+
+        norms = [(doc_vector**2).sum() for doc_vector in vectors]
+        aux_dims = [np.sqrt(self.phi - norm) for norm in norms]
+        hnsw_vectors = [np.hstack((doc_vector, aux_dims[i].reshape(-1, 1))) for i, doc_vector in enumerate(vectors)]
+        hnsw_vectors = np.concatenate(hnsw_vectors, axis=0)
+
+        self._update_id_mapping(db_ids)
+        self.index.add(hnsw_vectors)
+
+    def search_knn(self, query_vectors: np.array, top_docs: int) -> List[Tuple[List[object], List[float]]]:
+
+        aux_dim = np.zeros(len(query_vectors), dtype="float32")
+        query_nhsw_vectors = np.hstack((query_vectors, aux_dim.reshape(-1, 1)))
+        logger.info("query_hnsw_vectors %s", query_nhsw_vectors.shape)
+        scores, indexes = self.index.search(query_nhsw_vectors, top_docs)
+        # convert to external ids
+        db_ids = [[self.index_id_to_db_id[i] for i in query_top_idxs] for query_top_idxs in indexes]
+        result = [(db_ids[i], scores[i]) for i in range(len(db_ids))]
+        return result
+
+    def deserialize_from(self, file: str):
+        super(DenseHNSWFlatIndexer, self).deserialize_from(file)
+        # to trigger warning on subsequent indexing
+        self.phi = None
+
+
+def iterate_encoded_files(vector_files: list) -> Iterator[Tuple[object, np.array]]:
+    for i, file in enumerate(vector_files):
+        logger.info("Reading file %s", file)
+        with open(file, "rb") as reader:
+            doc_vectors = pickle.load(reader)
+            for doc in doc_vectors:
+                db_id, doc_vector = doc
+                yield db_id, doc_vector
diff --git a/examples/semantic_indexing/fast_predict.py b/examples/semantic_indexing/fast_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb63539b208406390ed96a87daf38d6d11a92e6b
--- /dev/null
+++ b/examples/semantic_indexing/fast_predict.py
@@ -0,0 +1,184 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops import disable_fast_encoder, enable_fast_encoder
+from paddlenlp.transformers import ErnieModel, ErnieTokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+    parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+    parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=64,
+        type=int,
+        help="The maximum total input sequence length after tokenization. "
+        "Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--dropout", default=0.0, type=float, help="Dropout probability.")
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--seed", default=42, type=int, help="Random seed.")
+    parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max_seq_len.")
+    parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16.")
+
+    args = parser.parse_args()
+    return args
+
+
+class SemanticIndexingPredictor(nn.Layer):
+    def __init__(self, pretrained_model, output_emb_size, bos_id=0, dropout=0, use_fp16=False):
+        super(SemanticIndexingPredictor, self).__init__()
+        self.bos_id = bos_id
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.0)
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+        self.use_fp16 = use_fp16
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None):
+        src_mask = input_ids == self.bos_id
+        src_mask = paddle.cast(src_mask, "float32")
+        # [bs, 1, 1, max_len]
+        src_mask = paddle.unsqueeze(src_mask, axis=[1, 2])
+        src_mask.stop_gradient = True
+
+        ones = paddle.ones_like(input_ids, dtype="int64")
+        seq_length = paddle.cumsum(ones, axis=1)
+        position_ids = seq_length - ones
+        position_ids.stop_gradient = True
+
+        embedding_output = self.ptm.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+
+        if self.use_fp16:
+            embedding_output = paddle.cast(embedding_output, "float16")
+
+        sequence_output = self.ptm.encoder(embedding_output, src_mask)
+
+        if self.use_fp16:
+            sequence_output = paddle.cast(sequence_output, "float32")
+
+        cls_embedding = self.ptm.pooler(sequence_output)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+    ):
+        query_cls_embedding = self.get_pooled_embedding(query_input_ids, query_token_type_ids, query_position_ids)
+        title_cls_embedding = self.get_pooled_embedding(title_input_ids, title_token_type_ids, title_position_ids)
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def load(self, init_from_params):
+        if init_from_params and os.path.isfile(init_from_params):
+            state_dict = paddle.load(init_from_params)
+            self.set_state_dict(state_dict)
+            print("Loaded parameters from %s" % init_from_params)
+        else:
+            raise ValueError("Please set --params_path with correct pretrained model file")
+
+
+def do_predict(args):
+    paddle.set_device("gpu")
+    paddle.seed(args.seed)
+    tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0")
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+
+    def batchify_fn(samples):
+        fn = Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+        )
+        return [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    pretrained_model = ErnieModel.from_pretrained("ernie-1.0")
+
+    model = SemanticIndexingPredictor(
+        pretrained_model, args.output_emb_size, dropout=args.dropout, use_fp16=args.use_fp16
+    )
+    model.eval()
+    model.load(args.params_path)
+    model = enable_fast_encoder(model, use_fp16=args.use_fp16)
+
+    cosine_sims = []
+    for batch_data in valid_data_loader:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+        query_input_ids = paddle.to_tensor(query_input_ids)
+        query_token_type_ids = paddle.to_tensor(query_token_type_ids)
+        title_input_ids = paddle.to_tensor(title_input_ids)
+        title_token_type_ids = paddle.to_tensor(title_token_type_ids)
+        batch_cosine_sim = model(
+            query_input_ids=query_input_ids,
+            title_input_ids=title_input_ids,
+            query_token_type_ids=query_token_type_ids,
+            title_token_type_ids=title_token_type_ids,
+        ).numpy()
+        cosine_sims.append(batch_cosine_sim)
+
+    cosine_sims = np.concatenate(cosine_sims, axis=0)
+    for cosine in cosine_sims:
+        print("{}".format(cosine))
+    model = disable_fast_encoder(model)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/examples/semantic_indexing/generate_dense_embeddings.py b/examples/semantic_indexing/generate_dense_embeddings.py
new file mode 100644
index 0000000000000000000000000000000000000000..b28d5a4e24e5546add92c1a70eede3abddae04a4
--- /dev/null
+++ b/examples/semantic_indexing/generate_dense_embeddings.py
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Copyright GC-DPR authors.
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+ Command line tool that produces embeddings for a large documents base based on the pretrained ctx & question encoders
+ Supposed to be used in a 'sharded' way to speed up the process.
+"""
+import argparse
+import csv
+import logging
+import os
+import pathlib
+import pickle
+from typing import List, Tuple
+
+import numpy as np
+import paddle
+from biencoder_base_model import BiEncoder
+from NQdataset import BertTensorizer
+from paddle import nn
+from paddle.io import DataLoader, Dataset
+from tqdm import tqdm
+
+from paddlenlp.transformers.bert.modeling import BertModel
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+if logger.hasHandlers():
+    logger.handlers.clear()
+console = logging.StreamHandler()
+logger.addHandler(console)
+
+
+class CtxDataset(Dataset):
+    def __init__(self, ctx_rows: List[Tuple[object, str, str]], tensorizer: BertTensorizer, insert_title: bool = True):
+        self.rows = ctx_rows
+        self.tensorizer = tensorizer
+        self.insert_title = insert_title
+
+    def __len__(self):
+        return len(self.rows)
+
+    def __getitem__(self, item):
+        ctx = self.rows[item]
+
+        return self.tensorizer.text_to_tensor(ctx[1], title=ctx[2] if self.insert_title else None)
+
+
+def no_op_collate(xx: List[object]):
+    return xx
+
+
+def gen_ctx_vectors(
+    ctx_rows: List[Tuple[object, str, str]], model: nn.Layer, tensorizer: BertTensorizer, insert_title: bool = True
+) -> List[Tuple[object, np.array]]:
+    bsz = args.batch_size
+    total = 0
+    results = []
+
+    dataset = CtxDataset(ctx_rows, tensorizer, insert_title)
+    loader = DataLoader(
+        dataset, shuffle=False, num_workers=2, collate_fn=no_op_collate, drop_last=False, batch_size=bsz
+    )
+
+    for batch_id, batch_token_tensors in enumerate(tqdm(loader)):
+        ctx_ids_batch = paddle.stack(batch_token_tensors, axis=0)
+        ctx_seg_batch = paddle.zeros_like(ctx_ids_batch)
+        with paddle.no_grad():
+            out = model.get_context_pooled_embedding(ctx_ids_batch, ctx_seg_batch)
+
+        out = out.astype("float32").cpu()
+        batch_start = batch_id * bsz
+        ctx_ids = [r[0] for r in ctx_rows[batch_start : batch_start + bsz]]
+        assert len(ctx_ids) == out.shape[0]
+        total += len(ctx_ids)
+        results.extend([(ctx_ids[i], out[i].reshape([-1]).numpy()) for i in range(out.shape[0])])
+
+    return results
+
+
+def main(args):
+
+    tensorizer = BertTensorizer()
+    question_model = BertModel.from_pretrained(args.que_model_path)
+    context_model = BertModel.from_pretrained(args.con_model_path)
+    model = BiEncoder(question_encoder=question_model, context_encoder=context_model)
+
+    rows = []
+    with open(args.ctx_file) as tsvfile:
+        reader = csv.reader(tsvfile, delimiter="\t")
+        # file format: doc_id, doc_text, title
+        rows.extend([(row[0], row[1], row[2]) for row in reader if row[0] != "id"])
+
+    shard_size = int(len(rows) / args.num_shards)
+    start_idx = args.shard_id * shard_size
+    end_idx = start_idx + shard_size
+
+    logger.info("Producing encodings for passages range: %d to %d (out of total %d)", start_idx, end_idx, len(rows))
+    rows = rows[start_idx:end_idx]
+    data = gen_ctx_vectors(rows, model, tensorizer, True)
+    file = args.out_file + "_" + str(args.shard_id) + ".pkl"
+    pathlib.Path(os.path.dirname(file)).mkdir(parents=True, exist_ok=True)
+    logger.info("Writing results to %s" % file)
+    with open(file, mode="wb") as f:
+        pickle.dump(data, f)
+
+    logger.info("Total passages processed %d. Written to %s", len(data), file)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--ctx_file", type=str, default=None, help="Path to passages set .tsv file")
+    parser.add_argument(
+        "--out_file", required=True, type=str, default=None, help="output file path to write results to"
+    )
+    parser.add_argument("--shard_id", type=int, default=0, help="Number(0-based) of data shard to process")
+    parser.add_argument("--num_shards", type=int, default=1, help="Total amount of data shards")
+    parser.add_argument("--batch_size", type=int, default=32, help="Batch size for the passage encoder forward pass")
+    parser.add_argument("--que_model_path", type=str)
+    parser.add_argument("--con_model_path", type=str)
+    args = parser.parse_args()
+
+    main(args)
diff --git a/examples/semantic_indexing/gradient_cache/model.py b/examples/semantic_indexing/gradient_cache/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..04745d09788948faa9fbaf69b3d9dc17b752f0e2
--- /dev/null
+++ b/examples/semantic_indexing/gradient_cache/model.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexCacheNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def get_pooled_embedding_with_no_grad(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None
+    ):
+        if self.use_fp16:
+            if attention_mask is None:
+                attention_mask = paddle.unsqueeze(
+                    (input_ids == self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+                )
+
+            with paddle.no_grad():
+                embedding_output = self.ptm.embeddings(
+                    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+                )
+
+            embedding_output = paddle.cast(embedding_output, "float16")
+            attention_mask = paddle.cast(attention_mask, "float16")
+
+            with paddle.no_grad():
+                encoder_outputs = self.ptm.encoder(embedding_output, attention_mask)
+            if self.use_fp16:
+                encoder_outputs = paddle.cast(encoder_outputs, "float32")
+            cls_embedding = self.ptm.pooler(encoder_outputs)
+        else:
+            _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+        return cls_embedding
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        return cosine_sim, labels, query_cls_embedding, title_cls_embedding
diff --git a/examples/semantic_indexing/hardest_negative/model.py b/examples/semantic_indexing/hardest_negative/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce4db41341c2df248be7e9f52fd4dc4733a70ab2
--- /dev/null
+++ b/examples/semantic_indexing/hardest_negative/model.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+from base_model import SemanticIndexBase
+
+
+class SemanticIndexHardestNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+        self.margin = margin
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        pos_sim = paddle.max(cosine_sim, axis=-1)
+
+        # subtract 10000 from all diagnal elements of cosine_sim
+        mask_socre = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=10000, dtype=paddle.get_default_dtype()
+        )
+        tmp_cosin_sim = cosine_sim - paddle.diag(mask_socre)
+        hardest_negative_sim = paddle.max(tmp_cosin_sim, axis=-1)
+
+        labels = paddle.full(shape=[query_cls_embedding.shape[0]], fill_value=1.0, dtype="float32")
+
+        loss = F.margin_ranking_loss(pos_sim, hardest_negative_sim, labels, margin=self.margin)
+        return loss
diff --git a/examples/semantic_indexing/predict.py b/examples/semantic_indexing/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..741bb4ffdf45e9a043ab19be6b5e08bed94b00cd
--- /dev/null
+++ b/examples/semantic_indexing/predict.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops import convert_to_fp16
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.")
+parser.add_argument("--use_fp16", action="store_true", help="Whether to use_fp16")
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    cosine_sims = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+
+            query_input_ids = paddle.to_tensor(query_input_ids)
+            query_token_type_ids = paddle.to_tensor(query_token_type_ids)
+            title_input_ids = paddle.to_tensor(title_input_ids)
+            title_token_type_ids = paddle.to_tensor(title_token_type_ids)
+
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+
+            cosine_sims.append(batch_cosine_sim)
+
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+        pad_to_max_seq_len=args.pad_to_max_seq_len,
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size, use_fp16=args.use_fp16)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    if args.use_fp16:
+        convert_to_fp16(model.ptm.encoder)
+
+    cosin_sim = predict(model, valid_data_loader)
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
diff --git a/examples/semantic_indexing/qa_validation.py b/examples/semantic_indexing/qa_validation.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4be203ec57a8acfd9391e9bbdf3f50107ad67ad
--- /dev/null
+++ b/examples/semantic_indexing/qa_validation.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ Set of utilities for Q&A results validation tasks - Retriver passage validation and Reader predicted answer validation
+"""
+
+import collections
+import logging
+import string
+import unicodedata
+from functools import partial
+from multiprocessing import Pool as ProcessPool
+from typing import Tuple, List, Dict
+import regex as re
+from tokenizers import SimpleTokenizer
+
+logger = logging.getLogger(__name__)
+QAMatchStats = collections.namedtuple("QAMatchStats", ["top_k_hits", "questions_doc_hits"])
+
+
+def calculate_matches(
+    all_docs: Dict[object, Tuple[str, str]],
+    answers: List[List[str]],
+    closest_docs: List[Tuple[List[object], List[float]]],
+    workers_num: int,
+    match_type: str,
+) -> QAMatchStats:
+    """
+    Evaluates answers presence in the set of documents. This function is supposed to be used with a large collection of
+    documents and results. It internally forks multiple sub-processes for evaluation and then merges results
+    :param all_docs: dictionary of the entire documents database. doc_id -> (doc_text, title)
+    :param answers: list of answers's list. One list per question
+    :param closest_docs: document ids of the top results along with their scores
+    :param workers_num: amount of parallel threads to process data
+    :param match_type: type of answer matching. Refer to has_answer code for available options
+    :return: matching information tuple.
+    top_k_hits - a list where the index is the amount of top documents retrieved and the value is the total amount of
+    valid matches across an entire dataset.
+    questions_doc_hits - more detailed info with answer matches for every question and every retrieved document
+    """
+    global dpr_all_documents
+    dpr_all_documents = all_docs
+    tok_opts = {}
+    tokenizer = SimpleTokenizer(**tok_opts)
+    processes = ProcessPool(
+        processes=workers_num,
+    )
+    logger.info("Matching answers in top docs...")
+    get_score_partial = partial(check_answer, match_type=match_type, tokenizer=tokenizer)
+
+    questions_answers_docs = zip(answers, closest_docs)
+    scores = processes.map(get_score_partial, questions_answers_docs)
+    logger.info("Per question validation results len=%d", len(scores))
+    n_docs = len(closest_docs[0][0])
+    top_k_hits = [0] * n_docs
+    for question_hits in scores:
+        best_hit = next((i for i, x in enumerate(question_hits) if x), None)
+        if best_hit is not None:
+            top_k_hits[best_hit:] = [v + 1 for v in top_k_hits[best_hit:]]
+
+    return QAMatchStats(top_k_hits, scores)
+
+
+def check_answer(questions_answers_docs, tokenizer, match_type) -> List[bool]:
+    """Search through all the top docs to see if they have any of the answers."""
+    answers, (doc_ids, doc_scores) = questions_answers_docs
+    global dpr_all_documents
+    hits = []
+    for i, doc_id in enumerate(doc_ids):
+        doc = dpr_all_documents[doc_id]
+        text = doc[0]
+
+        answer_found = False
+        if text is None:  # cannot find the document for some reason
+            logger.warning("no doc in db")
+            hits.append(False)
+            continue
+        if has_answer(answers, text, tokenizer, match_type):
+            answer_found = True
+        hits.append(answer_found)
+    return hits
+
+
+def has_answer(answers, text, tokenizer, match_type) -> bool:
+    """Check if a document contains an answer string.
+    If `match_type` is string, token matching is done between the text and answer.
+    If `match_type` is regex, we search the whole text with the regex.
+    """
+    text = _normalize(text)
+    if match_type == "string":
+        # Answer is a list of possible strings
+        text = tokenizer.tokenize(text).words(uncased=True)
+
+        for single_answer in answers:
+            single_answer = _normalize(single_answer)
+            single_answer = tokenizer.tokenize(single_answer)
+            single_answer = single_answer.words(uncased=True)
+
+            for i in range(0, len(text) - len(single_answer) + 1):
+                if single_answer == text[i : i + len(single_answer)]:
+                    return True
+
+    elif match_type == "regex":
+        # Answer is a regex
+        for single_answer in answers:
+            single_answer = _normalize(single_answer)
+            if regex_match(text, single_answer):
+                return True
+    return False
+
+
+def regex_match(text, pattern):
+    """Test if a regex pattern is contained within a text."""
+    try:
+        pattern = re.compile(
+            pattern,
+            flags=re.IGNORECASE + re.UNICODE + re.MULTILINE,
+        )
+    except BaseException:
+        return False
+    return pattern.search(text) is not None
+
+
+# function for the reader model answer validation
+def exact_match_score(prediction, ground_truth):
+    return _normalize_answer(prediction) == _normalize_answer(ground_truth)
+
+
+def _normalize_answer(s):
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def _normalize(text):
+    return unicodedata.normalize("NFD", text)
diff --git a/examples/semantic_indexing/recall.py b/examples/semantic_indexing/recall.py
new file mode 100644
index 0000000000000000000000000000000000000000..79a3d8b8a0ee9399f876fb48e6f7fef7a67aec2a
--- /dev/null
+++ b/examples/semantic_indexing/recall.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding=UTF-8
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from ann_util import build_index
+from base_model import SemanticIndexBase
+from data import convert_example, create_dataloader, gen_id2corpus, gen_text_file
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file")
+parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save")
+parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_segment
+    ): [data for data in fn(samples)]
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    model = SemanticIndexBase(pretrained_model, output_emb_size=args.output_emb_size)
+    model = paddle.DataParallel(model)
+
+    # Load pretrained semantic model
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        logger.info("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # Need better way to get inner model of DataParallel
+    inner_model = model._layers
+
+    final_index = build_index(args, corpus_data_loader, inner_model)
+
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    query_ds = MapDataset(text_list)
+
+    query_data_loader = create_dataloader(
+        query_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    query_embedding = inner_model.get_semantic_embedding(query_data_loader)
+
+    if not os.path.exists(args.recall_result_dir):
+        os.mkdir(args.recall_result_dir)
+
+    recall_result_file = os.path.join(args.recall_result_dir, args.recall_result_file)
+    with open(recall_result_file, "w", encoding="utf-8") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding.numpy(), args.recall_num)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+                for idx, doc_idx in enumerate(recalled_idx[row_index]):
+                    f.write(
+                        "{}\t{}\t{}\n".format(
+                            text_list[text_index]["text"], id2corpus[doc_idx], 1.0 - cosine_sims[row_index][idx]
+                        )
+                    )
diff --git a/examples/semantic_indexing/requirements.txt b/examples/semantic_indexing/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4c314ba430a859acd10457d796ff21ee97602eb8
--- /dev/null
+++ b/examples/semantic_indexing/requirements.txt
@@ -0,0 +1,9 @@
+faiss==1.5.3
+hnswlib==0.6.2
+numpy==1.22.4
+paddle==1.0.2
+paddlenlp==2.3.4
+paddlepaddle==2.3.1
+regex==2022.7.25
+spacy==3.4.1
+tqdm==4.64.0
diff --git a/examples/semantic_indexing/run_ann_data_gen.py b/examples/semantic_indexing/run_ann_data_gen.py
new file mode 100644
index 0000000000000000000000000000000000000000..503a49be1eb2ebd5f5bd9b1bf6e0a60314c9010c
--- /dev/null
+++ b/examples/semantic_indexing/run_ann_data_gen.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import paddle
+from ance.model import SemanticIndexANCE
+from ann_util import build_index
+from data import (
+    convert_example,
+    create_dataloader,
+    gen_id2corpus,
+    gen_text_file,
+    get_latest_ann_data,
+    get_latest_checkpoint,
+)
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import MapDataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+
+# Required parameters
+parser.add_argument("--similar_text_pair_file", default=None, type=str, required=True, help="The train_set tsv file that each line is simialr text pair")
+parser.add_argument("--corpus_file", default=None, type=str, required=True, help="The corpus file that each line is a text for buinding indexing")
+parser.add_argument("--save_dir", default=None, type=str, required=True, help="Saved model dir, will look for latest checkpoint dir in here")
+parser.add_argument("--ann_data_dir", default=None, type=str, required=True, help="The output directory where the training data will be written")
+
+parser.add_argument("--init_from_ckpt", default=None, type=str, help="Initial model dir, will use this if no checkpoint is found in model_dir")
+parser.add_argument("--end_ann_step", default=1000000, type=int, help="Stop after this number of data versions has been generated, default run forever")
+parser.add_argument("--batch_size", default=128, type=int, help="Batch size for predicting embedding of texts")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+
+parser.add_argument("--max_seq_length", default=128, type=int, help="Batch size for predicting embedding of texts")
+parser.add_argument("--topk_training", default=500, type=int, help="top k from which negative samples are collected")
+parser.add_argument("--num_negative_sample", default=5, type=int, help="at each resample, how many negative samples per query do I use")
+
+# hnsw argument
+parser.add_argument("--hnsw_m", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_ef", default=10, type=int, help="Recall number for each query from Ann index.")
+parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def generate_new_ann(args, data_loader_dict, checkpoint_path, latest_step_num):
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    model = SemanticIndexANCE(pretrained_model, output_emb_size=args.output_emb_size)
+
+    logger.info("checkpoint_path:{}".format(checkpoint_path))
+    state_dict = paddle.load(checkpoint_path)
+
+    model.set_dict(state_dict)
+    logger.info("load params from:{}".format(checkpoint_path))
+
+    logger.info("***** inference of corpus *****")
+    final_index = build_index(args, data_loader_dict["corpus_data_loader"], model)
+
+    logger.info("***** inference of query *****")
+    query_embedding = model.get_semantic_embedding(data_loader_dict["text_data_loader"])
+
+    text_list = data_loader_dict["text_list"]
+    id2corpus = data_loader_dict["id2corpus"]
+    text2similar_text = data_loader_dict["text2similar_text"]
+
+    new_ann_data_path = os.path.join(args.ann_data_dir, str(latest_step_num))
+    if not os.path.exists(new_ann_data_path):
+        os.mkdir(new_ann_data_path)
+
+    with open(os.path.join(new_ann_data_path, "new_ann_data"), "w") as f:
+        for batch_index, batch_query_embedding in enumerate(query_embedding):
+            recalled_idx, cosine_sims = final_index.knn_query(batch_query_embedding, args.topk_training)
+
+            batch_size = len(cosine_sims)
+
+            for row_index in range(batch_size):
+                text_index = args.batch_size * batch_index + row_index
+
+                hard_neg_samples = recalled_idx[row_index][-1 * args.num_negative_sample :]
+
+                for idx, hard_neg_doc_idx in enumerate(hard_neg_samples):
+                    text = text_list[text_index]["text"]
+                    similar_text = text2similar_text[text]
+                    hard_neg_sample = id2corpus[hard_neg_doc_idx]
+                    f.write("{}\t{}\t{}\n".format(text, similar_text, hard_neg_sample))
+
+    succeed_flag_file = os.path.join(new_ann_data_path, "succeed_flag_file")
+    open(succeed_flag_file, "a").close()
+    logger.info("finish generate ann data step:{}".format(latest_step_num))
+
+
+def build_data_loader(args, tokenizer):
+    """build corpus_data_loader and text_data_loader"""
+
+    id2corpus = gen_id2corpus(args.corpus_file)
+
+    # conver_example function's input must be dict
+    corpus_list = [{idx: text} for idx, text in id2corpus.items()]
+    corpus_ds = MapDataset(corpus_list)
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_segment
+    ): [data for data in fn(samples)]
+
+    corpus_data_loader = create_dataloader(
+        corpus_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    # build text data_loader
+    text_list, text2similar_text = gen_text_file(args.similar_text_pair_file)
+
+    text_ds = MapDataset(text_list)
+
+    text_data_loader = create_dataloader(
+        text_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    d = {
+        "text_data_loader": text_data_loader,
+        "corpus_data_loader": corpus_data_loader,
+        "id2corpus": id2corpus,
+        "text2similar_text": text2similar_text,
+        "text_list": text_list,
+    }
+
+    return d
+
+
+def ann_data_gen(args):
+    # use init_from_ckpt as last_checkpoint
+    last_checkpoint = args.init_from_ckpt
+
+    # get latest_ann_data_step to decide when stop gen_ann_data
+    _, latest_ann_data_step = get_latest_ann_data(args.ann_data_dir)
+
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if not os.path.exists(args.ann_data_dir):
+            os.makedirs(args.ann_data_dir)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    data_load_dict = build_data_loader(args, tokenizer)
+
+    while latest_ann_data_step <= args.end_ann_step:
+        next_checkpoint, latest_step_num = get_latest_checkpoint(args)
+        logger.info("next_checkpoint:{}".format(next_checkpoint))
+
+        if next_checkpoint == last_checkpoint:
+            logger.info("next_checkpoint == lase_checkpoint:{}".format(next_checkpoint))
+            logger.info("sleep 10s")
+            time.sleep(10)
+        else:
+            logger.info("start generate ann data using checkpoint:{}".format(next_checkpoint))
+
+            generate_new_ann(args, data_load_dict, next_checkpoint, latest_step_num)
+
+            logger.info("finished generating ann data step {}".format(latest_step_num))
+
+            last_checkpoint = next_checkpoint
+
+
+def main():
+    ann_data_gen(args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/semantic_indexing/tokenizers.py b/examples/semantic_indexing/tokenizers.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7bd48a09bd79bfba995a1798147cd335e4bee3a
--- /dev/null
+++ b/examples/semantic_indexing/tokenizers.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Most of the tokenizers code here is copied from DrQA codebase to avoid adding extra dependency
+"""
+
+import copy
+import logging
+
+import regex
+import spacy
+
+logger = logging.getLogger(__name__)
+
+
+class Tokens(object):
+    """A class to represent a list of tokenized text."""
+
+    TEXT = 0
+    TEXT_WS = 1
+    SPAN = 2
+    POS = 3
+    LEMMA = 4
+    NER = 5
+
+    def __init__(self, data, annotators, opts=None):
+        self.data = data
+        self.annotators = annotators
+        self.opts = opts or {}
+
+    def __len__(self):
+        """The number of tokens."""
+        return len(self.data)
+
+    def slice(self, i=None, j=None):
+        """Return a view of the list of tokens from [i, j)."""
+        new_tokens = copy.copy(self)
+        new_tokens.data = self.data[i:j]
+        return new_tokens
+
+    def untokenize(self):
+        """Returns the original text (with whitespace reinserted)."""
+        return "".join([t[self.TEXT_WS] for t in self.data]).strip()
+
+    def words(self, uncased=False):
+        """Returns a list of the text of each token
+
+        Args:
+            uncased: lower cases text
+        """
+        if uncased:
+            return [t[self.TEXT].lower() for t in self.data]
+        else:
+            return [t[self.TEXT] for t in self.data]
+
+    def offsets(self):
+        """Returns a list of [start, end) character offsets of each token."""
+        return [t[self.SPAN] for t in self.data]
+
+    def pos(self):
+        """Returns a list of part-of-speech tags of each token.
+        Returns None if this annotation was not included.
+        """
+        if "pos" not in self.annotators:
+            return None
+        return [t[self.POS] for t in self.data]
+
+    def lemmas(self):
+        """Returns a list of the lemmatized text of each token.
+        Returns None if this annotation was not included.
+        """
+        if "lemma" not in self.annotators:
+            return None
+        return [t[self.LEMMA] for t in self.data]
+
+    def entities(self):
+        """Returns a list of named-entity-recognition tags of each token.
+        Returns None if this annotation was not included.
+        """
+        if "ner" not in self.annotators:
+            return None
+        return [t[self.NER] for t in self.data]
+
+    def ngrams(self, n=1, uncased=False, filter_fn=None, as_strings=True):
+        """Returns a list of all ngrams from length 1 to n.
+
+        Args:
+            n: upper limit of ngram length
+            uncased: lower cases text
+            filter_fn: user function that takes in an ngram list and returns
+              True or False to keep or not keep the ngram
+            as_string: return the ngram as a string vs list
+        """
+
+        def _skip(gram):
+            if not filter_fn:
+                return False
+            return filter_fn(gram)
+
+        words = self.words(uncased)
+        ngrams = [
+            (s, e + 1)
+            for s in range(len(words))
+            for e in range(s, min(s + n, len(words)))
+            if not _skip(words[s : e + 1])
+        ]
+
+        # Concatenate into strings
+        if as_strings:
+            ngrams = ["{}".format(" ".join(words[s:e])) for (s, e) in ngrams]
+
+        return ngrams
+
+    def entity_groups(self):
+        """Group consecutive entity tokens with the same NER tag."""
+        entities = self.entities()
+        if not entities:
+            return None
+        non_ent = self.opts.get("non_ent", "O")
+        groups = []
+        idx = 0
+        while idx < len(entities):
+            ner_tag = entities[idx]
+            # Check for entity tag
+            if ner_tag != non_ent:
+                # Chomp the sequence
+                start = idx
+                while idx < len(entities) and entities[idx] == ner_tag:
+                    idx += 1
+                groups.append((self.slice(start, idx).untokenize(), ner_tag))
+            else:
+                idx += 1
+        return groups
+
+
+class Tokenizer(object):
+    """Base tokenizer class.
+    Tokenizers implement tokenize, which should return a Tokens class.
+    """
+
+    def tokenize(self, text):
+        raise NotImplementedError
+
+    def shutdown(self):
+        pass
+
+    def __del__(self):
+        self.shutdown()
+
+
+class SimpleTokenizer(Tokenizer):
+    ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+"
+    NON_WS = r"[^\p{Z}\p{C}]"
+
+    def __init__(self, **kwargs):
+        """
+        Args:
+            annotators: None or empty set (only tokenizes).
+        """
+        self._regexp = regex.compile(
+            "(%s)|(%s)" % (self.ALPHA_NUM, self.NON_WS), flags=regex.IGNORECASE + regex.UNICODE + regex.MULTILINE
+        )
+        if len(kwargs.get("annotators", {})) > 0:
+            logger.warning(
+                "%s only tokenizes! Skipping annotators: %s" % (type(self).__name__, kwargs.get("annotators"))
+            )
+        self.annotators = set()
+
+    def tokenize(self, text):
+        data = []
+        matches = [m for m in self._regexp.finditer(text)]
+        for i in range(len(matches)):
+            # Get text
+            token = matches[i].group()
+
+            # Get whitespace
+            span = matches[i].span()
+            start_ws = span[0]
+            if i + 1 < len(matches):
+                end_ws = matches[i + 1].span()[0]
+            else:
+                end_ws = span[1]
+
+            # Format data
+            data.append(
+                (
+                    token,
+                    text[start_ws:end_ws],
+                    span,
+                )
+            )
+        return Tokens(data, self.annotators)
+
+
+class SpacyTokenizer(Tokenizer):
+    def __init__(self, **kwargs):
+        """
+        Args:
+            annotators: set that can include pos, lemma, and ner.
+            model: spaCy model to use (either path, or keyword like 'en').
+        """
+        model = kwargs.get("model", "en")
+        self.annotators = copy.deepcopy(kwargs.get("annotators", set()))
+        nlp_kwargs = {"parser": False}
+        if not any([p in self.annotators for p in ["lemma", "pos", "ner"]]):
+            nlp_kwargs["tagger"] = False
+        if "ner" not in self.annotators:
+            nlp_kwargs["entity"] = False
+        self.nlp = spacy.load(model, **nlp_kwargs)
+
+    def tokenize(self, text):
+        # We don't treat new lines as tokens.
+        clean_text = text.replace("\n", " ")
+        tokens = self.nlp.tokenizer(clean_text)
+        if any([p in self.annotators for p in ["lemma", "pos", "ner"]]):
+            self.nlp.tagger(tokens)
+        if "ner" in self.annotators:
+            self.nlp.entity(tokens)
+
+        data = []
+        for i in range(len(tokens)):
+            # Get whitespace
+            start_ws = tokens[i].idx
+            if i + 1 < len(tokens):
+                end_ws = tokens[i + 1].idx
+            else:
+                end_ws = tokens[i].idx + len(tokens[i].text)
+
+            data.append(
+                (
+                    tokens[i].text,
+                    text[start_ws:end_ws],
+                    (tokens[i].idx, tokens[i].idx + len(tokens[i].text)),
+                    tokens[i].tag_,
+                    tokens[i].lemma_,
+                    tokens[i].ent_type_,
+                )
+            )
+
+        # Set special option for non-entity tag: '' vs 'O' in spaCy
+        return Tokens(data, self.annotators, opts={"non_ent": ""})
diff --git a/examples/semantic_indexing/train_ance.py b/examples/semantic_indexing/train_ance.py
new file mode 100644
index 0000000000000000000000000000000000000000..a06b5847429f1b8fb73edc1721dfa32b5270134c
--- /dev/null
+++ b/examples/semantic_indexing/train_ance.py
@@ -0,0 +1,184 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from ance.model import SemanticIndexANCE
+from data import (
+    convert_example,
+    create_dataloader,
+    get_latest_ann_data,
+    get_latest_checkpoint,
+    read_text_triplet,
+)
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoints', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--ann_data_dir", default='./ann_data', type=str, help="The output directory where the ann generated training data will be saved.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--max_training_steps", default=1000000, type=int, help="The maximum total steps for training")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file")
+parser.add_argument("--margin", default=0.3, type=float, help="Margin for pair-wise margin_rank_loss")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    latest_checkpoint, latest_global_step = get_latest_checkpoint(args)
+    logger.info("get latest_checkpoint:{}".format(latest_checkpoint))
+
+    model = SemanticIndexANCE(pretrained_model, margin=args.margin, output_emb_size=args.output_emb_size)
+
+    if latest_checkpoint:
+        state_dict = paddle.load(latest_checkpoint)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(latest_checkpoint))
+
+    model = paddle.DataParallel(model)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # pos_sample_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # pos_sample_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # neg_sample_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # neg_sample_segment
+    ): [data for data in fn(samples)]
+
+    global_step = 0
+
+    while global_step < args.max_training_steps:
+        latest_ann_data, latest_ann_data_step = get_latest_ann_data(args.ann_data_dir)
+
+        if latest_ann_data_step == -1:
+            # No ann_data generated yet
+            latest_ann_data = args.train_set_file
+            logger.info("No ann_data generated yet, Use training_set:{}".format(args.train_set_file))
+        else:
+            # Using ann_data to training model
+            logger.info("Latest ann_data is ready for training: [{}]".format(latest_ann_data))
+
+        train_ds = load_dataset(read_text_triplet, data_path=latest_ann_data, lazy=False)
+
+        train_data_loader = create_dataloader(
+            train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+        )
+
+        num_training_steps = len(train_data_loader) * args.epochs
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=clip,
+        )
+
+        tic_train = time.time()
+        for epoch in range(1, args.epochs + 1):
+            for step, batch in enumerate(train_data_loader, start=1):
+                (
+                    text_input_ids,
+                    text_token_type_ids,
+                    pos_sample_input_ids,
+                    pos_sample_token_type_ids,
+                    neg_sample_input_ids,
+                    neg_sample_token_type_ids,
+                ) = batch
+
+                loss = model(
+                    text_input_ids=text_input_ids,
+                    pos_sample_input_ids=pos_sample_input_ids,
+                    neg_sample_input_ids=neg_sample_input_ids,
+                    text_token_type_ids=text_token_type_ids,
+                    pos_sample_token_type_ids=pos_sample_token_type_ids,
+                    neg_sample_token_type_ids=neg_sample_token_type_ids,
+                )
+
+                global_step += 1
+                if global_step % 10 == 0 and rank == 0:
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s, trainning_file: %s"
+                        % (global_step, epoch, step, loss, 10 / (time.time() - tic_train), latest_ann_data)
+                    )
+                    tic_train = time.time()
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, str(global_step))
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+
+                    # Flag to indicate succeefully save model
+                    succeed_flag_file = os.path.join(save_dir, "succeed_flag_file")
+                    open(succeed_flag_file, "a").close()
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/semantic_indexing/train_batch_neg.py b/examples/semantic_indexing/train_batch_neg.py
new file mode 100644
index 0000000000000000000000000000000000000000..760e67101309280b18652fc103ab9fd89445a6e2
--- /dev/null
+++ b/examples/semantic_indexing/train_batch_neg.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from batch_negative.model import SemanticIndexBatchNeg
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.3, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.")
+parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = SemanticIndexBatchNeg(
+        pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                loss = model(
+                    query_input_ids=query_input_ids,
+                    title_input_ids=title_input_ids,
+                    query_token_type_ids=query_token_type_ids,
+                    title_token_type_ids=title_token_type_ids,
+                )
+
+            if args.use_amp:
+                scaled = scaler.scale(loss)
+                scaled.backward()
+                scaler.minimize(optimizer, scaled)
+            else:
+                loss.backward()
+                optimizer.step()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/semantic_indexing/train_gradient_cache.py b/examples/semantic_indexing/train_gradient_cache.py
new file mode 100644
index 0000000000000000000000000000000000000000..838054f7c0188cb5fc0aa8dd71c5bd946dcb6686
--- /dev/null
+++ b/examples/semantic_indexing/train_gradient_cache.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_text_pair
+from gradient_cache.model import SemanticIndexCacheNeg
+
+import paddlenlp as ppnlp
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--margin", default=0.3, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss")
+parser.add_argument("--use_amp", action="store_true", help="Whether to use AMP.")
+parser.add_argument("--amp_loss_scale", default=32768, type=float, help="The value of scale_loss for fp16. This is only used for AMP training.")
+parser.add_argument("--chunk_numbers", type=int, default=50, help="The number of the chunks for model")
+
+args = parser.parse_args()
+
+
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+
+    # If you wanna use bert/roberta pretrained model,
+    # pretrained_model = ppnlp.transformers.BertModel.from_pretrained('bert-base-chinese')
+    # pretrained_model = ppnlp.transformers.RobertaModel.from_pretrained('roberta-wwm-ext')
+    pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0")
+
+    # If you wanna use bert/roberta pretrained model,
+    # tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained('bert-base-chinese')
+    # tokenizer = ppnlp.transformers.RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+    tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained("ernie-1.0")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # query_# query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # query_# title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # title_segment
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = SemanticIndexCacheNeg(
+        pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+    model = paddle.DataParallel(model)
+    num_training_steps = len(train_data_loader) * args.epochs
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.amp_loss_scale)
+
+    if args.batch_size % args.chunk_numbers == 0:
+        chunk_numbers = args.chunk_numbers
+
+    def split(inputs, chunk_numbers, axis=0):
+        if inputs.shape[0] % chunk_numbers == 0:
+            return paddle.split(inputs, chunk_numbers, axis=0)
+        else:
+            return paddle.split(inputs, inputs.shape[0], axis=0)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            chunked_x = [split(t, chunk_numbers, axis=0) for t in batch]
+            sub_batchs = [list(s) for s in zip(*chunked_x)]
+
+            all_reps = []
+            all_grads = []
+            all_labels = []
+            all_CUDA_rnd_state = []
+            all_query = []
+            all_title = []
+
+            for sub_batch in sub_batchs:
+                all_reps = []
+                all_labels = []
+                (
+                    sub_query_input_ids,
+                    sub_query_token_type_ids,
+                    sub_title_input_ids,
+                    sub_title_token_type_ids,
+                ) = sub_batch
+                with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+
+                    with paddle.no_grad():
+                        sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state()
+                        all_CUDA_rnd_state.append(sub_CUDA_rnd_state)
+                        sub_cosine_sim, sub_label, query_embedding, title_embedding = model(
+                            query_input_ids=sub_query_input_ids,
+                            title_input_ids=sub_title_input_ids,
+                            query_token_type_ids=sub_query_token_type_ids,
+                            title_token_type_ids=sub_title_token_type_ids,
+                        )
+                        all_reps.append(sub_cosine_sim)
+                        all_labels.append(sub_label)
+                        all_title.append(title_embedding)
+                        all_query.append(query_embedding)
+
+                model_reps = paddle.concat(all_reps, axis=0)
+                model_title = paddle.concat(all_title)
+                model_query = paddle.concat(all_query)
+
+                model_title = model_title.detach()
+                model_query = model_query.detach()
+
+                model_query.stop_gtadient = False
+                model_title.stop_gradient = False
+                model_reps.stop_gradient = False
+
+                model_label = paddle.concat(all_labels, axis=0)
+                loss = F.cross_entropy(input=model_reps, label=model_label)
+                loss.backward()
+                all_grads.append(model_reps.grad)
+
+            for sub_batch, CUDA_state, grad in zip(sub_batchs, all_CUDA_rnd_state, all_grads):
+
+                (
+                    sub_query_input_ids,
+                    sub_query_token_type_ids,
+                    sub_title_input_ids,
+                    sub_title_token_type_ids,
+                ) = sub_batch
+                paddle.framework.random.set_cuda_rng_state(CUDA_state)
+                cosine_sim, _ = model(
+                    query_input_ids=sub_query_input_ids,
+                    title_input_ids=sub_title_input_ids,
+                    query_token_type_ids=sub_query_token_type_ids,
+                    title_token_type_ids=sub_title_token_type_ids,
+                )
+                surrogate = paddle.dot(cosine_sim, grad)
+
+                if args.use_amp:
+                    scaled = scaler.scale(surrogate)
+                    scaled.backward()
+                else:
+                    surrogate.backward()
+
+            if args.use_amp:
+                scaler.minimize(optimizer, scaled)
+            else:
+                optimizer.step()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/semantic_indexing/train_gradient_cache_DPR.py b/examples/semantic_indexing/train_gradient_cache_DPR.py
new file mode 100644
index 0000000000000000000000000000000000000000..59ef25024a9fa739df029fe930aadaa8776e041a
--- /dev/null
+++ b/examples/semantic_indexing/train_gradient_cache_DPR.py
@@ -0,0 +1,214 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+import paddle
+from biencoder_base_model import BiEncoder, BiEncoderNllLoss
+from NQdataset import DataUtil, NQdataSetForDPR
+from paddle.optimizer.lr import LambdaDecay
+
+from paddlenlp.transformers.bert.modeling import BertModel
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("--batch_size", required=True, type=int, default=None)
+parser.add_argument("--learning_rate", required=True, type=float, default=None)
+parser.add_argument("--save_dir", required=True, type=str, default=None)
+parser.add_argument("--warmup_steps", required=True, type=int)
+parser.add_argument("--epoches", required=True, type=int)
+parser.add_argument("--max_grad_norm", required=True, type=int)
+parser.add_argument("--train_data_path", required=True, type=str)
+parser.add_argument("--chunk_size", required=True, type=int)
+args = parser.parse_args()
+
+chunk_nums = args.batch_size // args.chunk_size
+data_path = args.train_data_path
+batch_size = args.batch_size
+learning_rate = args.learning_rate
+epoches = args.epoches
+
+
+def dataLoader_for_DPR(batch_size, source_data: list, epochs):
+    index = np.arange(0, len(source_data))
+    np.random.shuffle(index)
+    batch_data = []
+    for i in index:
+        try:
+            batch_data.append(source_data[i])
+
+            if len(batch_data) == batch_size:
+                yield batch_data
+                batch_data = []
+
+        except Exception:
+            import traceback
+
+            traceback.print_exc()
+            continue
+
+
+def get_model(model_name: str):
+    question_model = BertModel.from_pretrained(model_name)
+    context_model = BertModel.from_pretrained(model_name)
+    model = BiEncoder(question_model, context_model)
+    return model
+
+
+model = get_model("bert-base-uncased")
+
+
+def get_linear_scheduler(warmup_steps, training_steps):
+    def lr_lambda(current_step):
+        if current_step < warmup_steps:
+            return float(current_step) / float(max(1, warmup_steps))
+        return max(0.0, float(training_steps - current_step) / float(max(1, training_steps - warmup_steps)))
+
+    return LambdaDecay(learning_rate=args.learning_rate, lr_lambda=lr_lambda, last_epoch=-1, verbose=False)
+
+
+training_steps = 58880 * args.epoches / args.batch_size
+scheduler = get_linear_scheduler(args.warmup_steps, training_steps)
+optimizer = paddle.optimizer.AdamW(learning_rate=scheduler, parameters=model.parameters())
+
+
+def get_dataset(data_path: str):
+    data = NQdataSetForDPR(data_path)
+    dataset = data.new_data
+    return dataset
+
+
+util = DataUtil()
+LOSS = BiEncoderNllLoss()
+batch_data = []
+dataset = get_dataset(data_path)
+
+
+def train():
+
+    for epoch in range(epoches):
+
+        index = np.arange(0, len(dataset))
+        np.random.shuffle(index)
+
+        batch_data = []
+
+        for i in index:
+            # dataLoader
+            batch_data.append(dataset[i])
+            if len(batch_data) == batch_size:
+                all_questions = []
+                all_contexts = []
+                all_batch_input = util.create_biencoder_input(batch_data, inserted_title=True)
+
+                all_positions = all_batch_input.is_positive
+
+                all_inputs_questions_id = all_batch_input.questions_ids
+                all_inputs_questions_segment = all_batch_input.question_segments
+
+                all_inputs_contexts_id = all_batch_input.context_ids
+                all_inputs_contexts_segment = all_batch_input.ctx_segments
+
+                sub_q_ids = paddle.split(all_inputs_questions_id, chunk_nums, axis=0)
+                sub_c_ids = paddle.split(all_inputs_contexts_id, chunk_nums, axis=0)
+                sub_q_segments = paddle.split(all_inputs_questions_segment, chunk_nums, axis=0)
+                sub_c_segments = paddle.split(all_inputs_contexts_segment, chunk_nums, axis=0)
+
+                all_questions = []
+                all_contexts = []
+                all_CUDA_rnd_state_question = []
+                all_CUDA_rnd_state_context = []
+
+                for sub_q_id, sub_q_segment in zip(sub_q_ids, sub_q_segments):
+                    with paddle.no_grad():
+                        sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state()
+                        all_CUDA_rnd_state_question.append(sub_CUDA_rnd_state)
+                        sub_question_output = model.get_question_pooled_embedding(sub_q_id, sub_q_segment)
+                        all_questions.append(sub_question_output)
+                for sub_c_id, sub_c_segment in zip(sub_c_ids, sub_c_segments):
+                    with paddle.no_grad():
+                        sub_CUDA_rnd_state = paddle.framework.random.get_cuda_rng_state()
+                        all_CUDA_rnd_state_context.append(sub_CUDA_rnd_state)
+                        sub_context_output = model.get_context_pooled_embedding(sub_c_id, sub_c_segment)
+                        all_contexts.append(sub_context_output)
+
+                model_questions = paddle.concat(all_questions, axis=0)
+                all_questions = []
+
+                model_questions = model_questions.detach()
+
+                model_questions.stop_gradient = False
+
+                model_contexts = paddle.concat(all_contexts, axis=0)
+
+                model_contexts = model_contexts.detach()
+
+                model_contexts.stop_gradient = False
+
+                all_contexts = []
+
+                model_positions = all_positions
+
+                loss, _ = LOSS.calc(model_questions, model_contexts, model_positions)
+
+                print("loss is:")
+                print(loss.item())
+
+                loss.backward()
+
+                grads_for_questions = paddle.split(model_questions.grad, chunk_nums, axis=0)
+                grads_for_contexts = paddle.split(model_contexts.grad, chunk_nums, axis=0)
+
+                for sub_q_id, sub_q_segment, CUDA_state, grad_for_each_question in zip(
+                    sub_q_ids, sub_q_segments, all_CUDA_rnd_state_question, grads_for_questions
+                ):
+
+                    paddle.framework.random.set_cuda_rng_state(CUDA_state)
+
+                    sub_question_output = model.get_question_pooled_embedding(sub_q_id, sub_q_segment)
+
+                    finally_question_res_for_backward = paddle.dot(sub_question_output, grad_for_each_question)
+                    finally_question_res_for_backward = finally_question_res_for_backward * (1 / 8.0)
+
+                    finally_question_res_for_backward.backward(retain_graph=True)
+
+                for sub_c_id, sub_c_segment, CUDA_state, grad_for_each_context in zip(
+                    sub_c_ids, sub_c_segments, all_CUDA_rnd_state_context, grads_for_contexts
+                ):
+                    paddle.framework.random.set_cuda_rng_state(CUDA_state)
+
+                    sub_context_output = model.get_context_pooled_embedding(sub_c_id, sub_q_segment)
+
+                    finally_context_res_for_backward = paddle.dot(sub_question_output, grad_for_each_context)
+                    finally_context_res_for_backward = finally_context_res_for_backward * (1 / 8.0)
+
+                    finally_context_res_for_backward.backward(retain_graph=True)
+
+                paddle.nn.ClipGradByGlobalNorm(clip_norm=args.max_grad_norm, group_name=model.parameters())
+                optimizer.step()
+                scheduler.step()
+                optimizer.clear_grad()
+
+                batch_data = []
+
+        EPOCH = str(epoch)
+        save_path_que = args.save_dir + "/question_model_" + EPOCH
+        save_path_con = args.save_dir + "/context_model_" + EPOCH
+        model.question_encoder.save_pretrained(save_path_que)
+        model.context_encoder.save_pretrained(save_path_con)
+
+
+if __name__ == "__main__":
+    train()
diff --git a/examples/semantic_indexing/train_hardest_neg.py b/examples/semantic_indexing/train_hardest_neg.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4e9522d303e6157376503050278f9e9d904b20a
--- /dev/null
+++ b/examples/semantic_indexing/train_hardest_neg.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+from hardest_negative.model import SemanticIndexHardestNeg
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file")
+parser.add_argument("--margin", default=0.3, type=float, help="Margin for pair-wise margin_rank_loss")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = SemanticIndexHardestNeg(pretrained_model, margin=args.margin, output_emb_size=args.output_emb_size)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/sentiment_analysis/skep/README.md b/examples/sentiment_analysis/skep/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..de57627169db8a2d89bf7c11b03306fe6f5a180a
--- /dev/null
+++ b/examples/sentiment_analysis/skep/README.md
@@ -0,0 +1,271 @@
+# SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis
+
+情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。它包含各式各样的任务，比如句子级情感分类、评价对象级情感分类、观点抽取、情绪分类等。情感分析是人工智能的重要研究方向，具有很高的学术价值。同时，情感分析在消费决策、舆情分析、个性化推荐等领域均有重要的应用，具有很高的商业价值。
+
+情感预训练模型SKEP（Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis）。SKEP利用情感知识增强预训练模型， 在14项中英情感分析典型任务上全面超越SOTA，此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法，此算法采用无监督方法自动挖掘情感知识，然后利用情感知识构建预训练目标，从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。
+
+论文地址：https://arxiv.org/abs/2005.05635
+
+<p align="center">
+<img src="https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep.png" width="80%" height="60%"> <br />
+</p>
+
+百度研究团队在三个典型情感分析任务，语句级情感分类（Sentence-level Sentiment Classification），评价对象级情感分类（Aspect-level Sentiment Classification）、观点抽取（Opinion Role Labeling），共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。实验表明，下表展示了在模型分别在数据集SST-2、ChnSentiCorp、SE-ABSA16_PHNS、COTE_DP上的实验结果，同时标明了各项数据集对应的任务类型、语言类别、下载地址等信息。
+
+<table>
+    <tr>
+        <td><strong><center>任务</strong></td>
+        <td><strong><center>数据集合</strong></td>
+        <td><strong><center>语言</strong></td>
+        <td><strong><center>指标</strong></td>
+        <td><strong><center>SKEP</strong></td>
+        <td><strong><center>数据集地址</strong></td>
+    </tr>
+    <tr>
+        <td rowspan="2"><center>语句级情感分类<br /><center>分类</td>
+        <td><center>SST-2</td>
+        <td><center>英文</td>
+        <td><center>ACC</td>
+        <td><center>97.60</td>
+        <td><center><a href="https://gluebenchmark.com/tasks" >下载地址</a></td>
+    </tr>
+    <tr>
+        <td><center>ChnSentiCorp</td>
+        <td><center>中文</td>
+        <td><center>ACC</td>
+        <td><center>96.08</td>
+        <td><center><a href="https://dataset-bj.cdn.bcebos.com/qianyan/ChnSentiCorp.zip" >下载地址</a></td>
+    </tr>
+    <tr>
+        <td rowspan="1"><center>评价对象级<br /><center>情感分类</td>
+        <td><center>SE-ABSA16_PHNS</td>
+        <td><center>中文</td>
+        <td><center>ACC</td>
+        <td><center>65.22</td>
+        <td><center><a href="http://alt.qcri.org/semeval2016/task5/" >下载地址</a></td>
+    </tr>
+    <tr>
+        <td rowspan="1"><center>观点<br /><center>抽取</td>
+        <td><center>COTE_DP</td>
+        <td><center>中文</td>
+        <td><center>F1</td>
+        <td><center>86.30</td>
+        <td><center><a href="https://github.com/lsvih/chinese-customer-review" >下载地址</a></td>
+    </tr>
+</table>
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+skep/
+├── deploy # 部署
+│   └── python
+│       └── predict.py # python预测部署示例
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── predict_aspect.py # 对象级的情感分类任务预测脚本
+├── predict_opinion.py # 观点抽取任务预测脚本
+├── predict_sentence.py # 句子级情感分类任务预测脚本
+├── README.md # 使用说明
+├── train_aspect.py # 对象级的情感分类任务训练脚本
+├── train_opinion.py # 观点抽取任务训练脚本
+└── train_sentence.py  # 句子级情感分类任务训练脚本
+```
+
+下面以语句级情感分类、评价对象级情感分类，观点抽取等任务类型为例，分别说明相应的训练和测试方式。
+
+### 语句级情感分类
+#### 数据下载
+本示例采用常用开源数据集ChnSenticorp中文数据集、GLUE-SST2英文数据集作为语句级情感分类数据集。这两项数据集已经内置于PaddleNLP。可以通过以下方式进行加载。
+
+```python
+from paddlenlp.datasets import load_dataset
+
+train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"])
+```
+
+#### 模型训练
+
+可以通过如下命令开启语句级情感分析任务训练，需要特别说明的是，如果想要基于数据集ChnSentiCorp训练中文情感分析模型，请指定model_name为：`skep_ernie_1.0_large_ch`； 基于数据集GLUE-SST2训练英文情感分析模型请指定model_name为：`skep_ernie_2.0_large_en`。下面以中文情感分析为例进行说明。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train_sentence.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --device "gpu" \
+    --save_dir "./checkpoints" \
+    --epochs 3 \
+    --max_seq_len 128 \
+    --batch_size 16 \
+    --learning_rate 5e-5
+```
+
+可支持配置的参数：
+
+* `model_name`: 使用预训练模型的名称，可选skep_ernie_1.0_large_ch和skep_ernie_2.0_large_en。
+    skep_ernie_1.0_large_ch：是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型;
+    skep_ernie_2.0_large_en：是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的中文预训练模型。
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_len`：可选，ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为16。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.00。
+* `epochs`: 训练轮次，默认为3。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
+
+#### 模型预测
+使用如下命令进行模型预测：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict_sentence.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --ckpt_dir "checkpoints/model_100" \
+    --batch_size 16 \
+    --max_seq_len 128 \
+    --device "gpu"
+```
+
+下面展示了模型的预测示例结果：
+
+```text
+Data: 这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般      Label: negative
+Data: 怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片      Label: negative
+Data: 作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。      Label: positive
+```
+
+#### 基于Taskflow一键预测
+当前PaddleNLP已将训练好的SKEP中文语句级情感分析模型集成至Taskflow中，可以使用Taskflow对输入的文本进行一键式情感分析，使用方法如下:
+
+```python
+from paddlenlp import Taskflow
+
+senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")
+senta("怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片")
+'''
+[{'text': '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般', 'label': 'negative', 'score': 0.9894790053367615}]
+'''
+```
+
+如果想使用自己训练好的模型加载进Taskflow进行预测，可以使用参数`task_path`进行指定模型路径，需要注意的是，该路径下需要存放模型文件以及相应的Tokenizer文件（训练过程中，已保存这两者相关文件）。
+
+```python
+from paddlenlp import Taskflow
+
+senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch", task_path="./checkpoints/model_100")
+senta("怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片")
+'''
+[{'text': '这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般', 'label': 'negative', 'score': 0.9686369299888611}]
+'''
+```
+
+#### 模型部署
+
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数。在进行模型转换时，需要通过参数`ckpt_dir`指定训练好的模型存放目录，通过`output_path`指定静态图模型参数保存路径，详情请参考export_model.py。模型转换命令如下：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python export_model.py \
+    --model_name="skep_ernie_1.0_large_ch" \
+    --ckpt_dir="./checkpoints/model_100" \
+    --output_path="./static/static_graph_params"
+```
+
+可以将导出的静态图模型进行部署，deploy/python/predict.py展示了python部署预测示例。运行方式如下：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python deploy/python/predict.py \
+    --model_name="skep_ernie_1.0_large_ch" \
+    --model_file="./static/static_graph_params.pdmodel" \
+    --params_file="./static/static_graph_params.pdiparams"
+```
+
+### 评价对象级情感分类
+本节将以数据集SE-ABSA16_PHNS为例展示评价对象级的情感分类模型训练和测试。该数据集已内置于PaddleNLP中，可以通过语句级情感分类类似方式进行加载。这里不再赘述。下面展示了SE-ABSA16_PHNS数据集中的一条数据。
+
+```text
+label    text_a    text_b
+1    phone#design_features    今天有幸拿到了港版白色iPhone 5真机，试玩了一下，说说感受吧：1. 真机尺寸宽度与4/4s保持一致没有变化，长度多了大概一厘米，也就是之前所说的多了一排的图标。2. 真机重量比上一代轻了很多，个人感觉跟i9100的重量差不多。（用惯上一代的朋友可能需要一段时间适应了）3. 由于目前还没有版的SIM卡，无法插卡使用，有购买的朋友要注意了，并非简单的剪卡就可以用，而是需要去运营商更换新一代的SIM卡。4. 屏幕显示效果确实比上一代有进步，不论是从清晰度还是不同角度的视角，iPhone 5绝对要更上一层，我想这也许是相对上一代最有意义的升级了。5. 新的数据接口更小，比上一代更好用更方便，使用的过程会有这样的体会。6. 从简单的几个操作来讲速度比4s要快，这个不用测试软件也能感受出来，比如程序的调用以及照片的拍摄和浏览。不过，目前水货市场上坑爹的价格，最好大家可以再观望一下，不要急着出手。
+```
+
+#### 模型训练
+
+可以通过如下命令开启评价对象级级情感分类任务训练。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train_aspect.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --save_dir "./checkpoints" \
+    --epochs 50 \
+    --max_seq_len 128 \
+    --batch_size 16 \
+    --learning_rate 5e-5 \
+    --device "gpu"
+```
+
+#### 模型预测
+使用如下命令进行模型预测：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict_aspect.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --ckpt_dir "./checkpoints/model_100" \
+    --batch_size 16 \
+    --max_seq_len 128 \
+    --device "gpu"
+```
+
+### 观点抽取
+本节将以数据集COTE_DP为例展示评价对象级的情感分类模型训练和测试。该数据集已内置于PaddleNLP中，可以通过语句级情感分类类似方式进行加载。这里不再赘述。下面展示了COTE_DP数据中的前3条数据。
+
+```text
+label    text_a
+重庆老灶火锅    重庆老灶火锅还是很赞的，有机会可以尝试一下！
+炉鱼来了    一入店内，就看到招牌特别大的炉鱼来了，餐桌上还摆了五颜六色的小蜡烛，挺有调调的。
+外婆家    只能说是聚餐圣地外婆家一个需要提前来取号的地方。
+```
+
+#### 模型训练
+
+可以通过如下命令开启评价对象级级情感分类任务训练。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train_opinion.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --save_dir "./checkpoints" \
+    --epochs 10 \
+    --max_seq_len 128 \
+    --batch_size 16 \
+    --learning_rate 5e-5 \
+    --device "gpu"
+```
+
+#### 模型预测
+使用如下命令进行模型预测：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict_opinion.py \
+    --model_name "skep_ernie_1.0_large_ch" \
+    --ckpt_dir "./checkpoints/model_100" \
+    --batch_size 16 \
+    --max_seq_len 128 \
+    --device "gpu"
+```
+
+**备注**：
+1. 评价对象级情感分类和观点抽取两类任务的模型部署方式可参考语句级情感分类，这里不再赘述。
+2. 评级级情感分类以及观点抽取，暂不支持skep模型的Taskflow离线模型加载。如需使用此类功能，请参考：[unified_sentiment_analysis](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/unified_sentiment_extraction)。
diff --git a/examples/sentiment_analysis/skep/deploy/python/predict.py b/examples/sentiment_analysis/skep/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..2acba4857bd28d86bfd2393f9114f9bfcbcc4d18
--- /dev/null
+++ b/examples/sentiment_analysis/skep/deploy/python/predict.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+import paddle
+from scipy.special import softmax
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.transformers import SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument(
+    "--model_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdmodel",
+    help="The path to model info in static graph.",
+)
+parser.add_argument(
+    "--params_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdiparams",
+    help="The path to parameters in static graph.",
+)
+parser.add_argument(
+    "--max_seq_len",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_len=512, is_test=False):
+    text = example
+    encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_len)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    return {"input_ids": input_ids, "token_type_ids": token_type_ids}
+
+
+class Predictor(object):
+    def __init__(self, model_file, params_file, device, max_seq_len):
+        self.max_seq_len = max_seq_len
+
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, label_map, batch_size=1):
+        """
+        Predicts the data labels.
+
+        Args:
+            model (obj:`paddle.nn.Layer`): A model to classify texts.
+            data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+                A Example object contains `text`(word_ids) and `se_len`(sequence length).
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        examples = []
+        for text in data:
+            encoded_inputs = convert_example(
+                text, tokenizer, label_list=label_map.values(), max_seq_len=self.max_seq_len, is_test=True
+            )
+            examples.append(encoded_inputs)
+
+        # Separates data into some batches.
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        data_collator = DataCollatorWithPadding(tokenizer, padding=True, return_tensors="np")
+
+        results = []
+        for raw_batch in batches:
+            batch = data_collator(raw_batch)
+            input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"]
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(token_type_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            probs = softmax(logits, axis=1)
+            idx = np.argmax(probs, axis=1)
+            idx = idx.tolist()
+            labels = [label_map[i] for i in idx]
+            results.extend(labels)
+        return results
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len)
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+
+    # These data samples is in Chinese.
+    # If you use the english model, you should change the test data in English.
+    data = [
+        "这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般",
+        "怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片",
+        "作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。",
+    ]
+    label_map = {0: "negative", 1: "positive"}
+
+    results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/sentiment_analysis/skep/export_model.py b/examples/sentiment_analysis/skep/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..4164c2e603ffb560c1138d5bf009fb9c4ce3ca8f
--- /dev/null
+++ b/examples/sentiment_analysis/skep/export_model.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.transformers import SkepForSequenceClassification
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--ckpt_dir",
+    type=str,
+    required=True,
+    default="./checkpoint/model_100",
+    help="The directory of saved model checkpoint.",
+)
+parser.add_argument(
+    "--output_path",
+    type=str,
+    default="./static_graph_params",
+    help="The path of model parameter in static graph to be saved.",
+)
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+args = parser.parse_args()
+
+if __name__ == "__main__":
+    # The number of labels should be in accordance with the training dataset.
+    label_map = {0: "negative", 1: "positive"}
+    model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map))
+    print("Loaded model from %s" % args.ckpt_dir)
+
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    paddle.jit.save(model, args.output_path)
+    print("Static Model has been saved to: {}".format(args.output_path))
diff --git a/examples/sentiment_analysis/skep/predict_aspect.py b/examples/sentiment_analysis/skep/predict_aspect.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ce7c19496a677a3c86ca5f9b6b5dc540bda4ecd
--- /dev/null
+++ b/examples/sentiment_analysis/skep/predict_aspect.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+import paddle.nn.functional as F
+from tqdm import tqdm
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.")
+parser.add_argument(
+    "--max_seq_len",
+    default=400,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for prediction.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, label_map):
+    """
+    Given a prediction dataset, it gives the prediction results.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+    """
+    model.eval()
+    results = []
+    for batch in tqdm(data_loader):
+        input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"]
+        logits = model(input_ids, token_type_ids)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+def convert_example_to_feature(example, tokenizer, max_seq_len=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`dict`): Dict of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+        label(obj:`int`, optional): The input label if not is_test.
+    """
+    encoded_inputs = tokenizer(text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_len)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if is_test:
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids}
+    else:
+        label = example["label"]
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label}
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+
+    test_ds = load_dataset("seabsa16", "phns", splits=["test"])
+    label_map = {0: "negative", 1: "positive"}
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map))
+
+    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len, is_test=True)
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    test_data_loader = create_dataloader(
+        test_ds, mode="test", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    results = predict(model, test_data_loader, label_map)
+    for idx, text in enumerate(test_ds.data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/sentiment_analysis/skep/predict_opinion.py b/examples/sentiment_analysis/skep/predict_opinion.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c172d26ff86510bcc35e7974cd6a490ecc58eaf
--- /dev/null
+++ b/examples/sentiment_analysis/skep/predict_opinion.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.")
+parser.add_argument(
+    "--max_seq_len",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, label_map):
+    """
+    Given a prediction dataset, it gives the prediction results.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+    """
+    model.eval()
+    results = []
+    for batch in tqdm(data_loader):
+        input_ids, token_type_ids, seq_lens = batch["input_ids"], batch["token_type_ids"], batch["seq_lens"]
+        preds = model(input_ids, token_type_ids, seq_lens=seq_lens)
+        tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map)
+        results.extend(tags)
+    return results
+
+
+def convert_example_to_feature(example, tokenizer, max_seq_len=512):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`dict`): Dict of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+    """
+    tokens = example["tokens"]
+    new_tokens = [tokenizer.cls_token]
+
+    for index, token in enumerate(tokens):
+        sub_tokens = tokenizer.tokenize(token)
+        if not sub_tokens:
+            sub_tokens = [tokenizer.unk_token]
+        new_tokens.extend(sub_tokens)
+
+    new_tokens = new_tokens[: max_seq_len - 1]
+    new_tokens.append(tokenizer.sep_token)
+
+    input_ids = [tokenizer.convert_tokens_to_ids(token) for token in new_tokens]
+    token_type_ids = [0] * len(input_ids)
+    seq_len = len(input_ids)
+
+    return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len}
+
+
+def parse_predict_result(predictions, seq_lens, label_map):
+    """
+    Parses the prediction results to the label tag.
+    """
+    pred_tag = []
+    for idx, pred in enumerate(predictions):
+        seq_len = seq_lens[idx]
+        # drop the "[CLS]" and "[SEP]" token
+        tag = [label_map[i] for i in pred[1 : seq_len - 1]]
+        pred_tag.append(tag)
+    return pred_tag
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    test_ds = load_dataset("cote", "dp", splits=["test"])
+    label_list = test_ds.label_list
+    # The COTE_DP dataset labels with "BIO" schema.
+    label_map = {0: "B", 1: "I", 2: "O"}
+    # `no_entity_label` represents that the token isn't an entity.
+    no_entity_label_idx = 2
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepCrfForTokenClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_list))
+    print("Loaded model from %s" % args.ckpt_dir)
+
+    trans_func = partial(convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len)
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=no_entity_label_idx)
+
+    test_data_loader = create_dataloader(
+        test_ds, mode="test", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    results = predict(model, test_data_loader, label_map)
+    for idx, example in enumerate(test_ds.data):
+        print(len(example["tokens"]), len(results[idx]))
+        print("Data: {} \t Label: {}".format(example, results[idx]))
diff --git a/examples/sentiment_analysis/skep/predict_sentence.py b/examples/sentiment_analysis/skep/predict_sentence.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4dbbc69d1b25eaff26ddde51fcff62459912c05
--- /dev/null
+++ b/examples/sentiment_analysis/skep/predict_sentence.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--max_seq_len",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--ckpt_dir", type=str, default=None, help="The directory of saved model checkpoint.")
+parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+args = parser.parse_args()
+
+
+def convert_example_to_feature(example, tokenizer, max_seq_len=512):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`str`): The input text to sentiment analysis.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+        label(obj:`int`, optional): The input label if not is_test.
+    """
+    encoded_inputs = tokenizer(text=example, max_seq_len=max_seq_len)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    return {"input_ids": input_ids, "token_type_ids": token_type_ids}
+
+
+@paddle.no_grad()
+def predict(model, data, tokenizer, label_map, batch_size=1):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+
+    Returns:
+        results(obj:`list`): All the predictions labels.
+    """
+    examples = []
+    for text in data:
+        encoded_inputs = convert_example_to_feature(text, tokenizer, max_seq_len=args.max_seq_len)
+        examples.append(encoded_inputs)
+
+    # Separates data into some batches.
+    batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    results = []
+    model.eval()
+    for raw_batch in batches:
+        batch = data_collator(raw_batch)
+        input_ids, token_type_ids = batch["input_ids"], batch["token_type_ids"]
+        logits = model(input_ids, token_type_ids)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy().tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # These data samples is in Chinese.
+    # If you use the english model, you should change the test data in English.
+    data = [
+        "这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般",
+        "怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片",
+        "作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。",
+    ]
+    label_map = {0: "negative", 1: "positive"}
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepForSequenceClassification.from_pretrained(args.ckpt_dir, num_labels=len(label_map))
+    print("Loaded model from %s" % args.ckpt_dir)
+
+    results = predict(model, data, tokenizer, label_map, batch_size=args.batch_size)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/sentiment_analysis/skep/train_aspect.py b/examples/sentiment_analysis/skep/train_aspect.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5f6b7720b99fad0c57566f1898250abfc5e5374
--- /dev/null
+++ b/examples/sentiment_analysis/skep/train_aspect.py
@@ -0,0 +1,180 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_len",
+    default=400,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=3e-6, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=50, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """Sets random seed."""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def convert_example(example, tokenizer, max_seq_len=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`dict`): Dict of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+        label(obj:`int`, optional): The input label if not is_test.
+    """
+
+    encoded_inputs = tokenizer(text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_len)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if is_test:
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids}
+    else:
+        label = example["label"]
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label}
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    train_ds = load_dataset("seabsa16", "phns", splits=["train"])
+    label_list = train_ds.label_list
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepForSequenceClassification.from_pretrained(args.model_name, num_labels=len(label_list))
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len)
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"]
+            loss, logits = model(input_ids, token_type_ids, labels=labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % 100 == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                # Need better way to get inner model of DataParallel
+                model._layers.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
diff --git a/examples/sentiment_analysis/skep/train_opinion.py b/examples/sentiment_analysis/skep/train_opinion.py
new file mode 100644
index 0000000000000000000000000000000000000000..e61f394ac50f6036be96f30002326dc69630840b
--- /dev/null
+++ b/examples/sentiment_analysis/skep/train_opinion.py
@@ -0,0 +1,211 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoints",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_len",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-7, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """Sets random seed."""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def convert_example_to_feature(example, tokenizer, max_seq_len=512, no_entity_label="O", is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`dict`): Dict of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        no_entity_label(obj:`int`): The label to pad label sequence by default.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+        label(obj:`List[int]`, optional): The input label if not is_test.
+    """
+    tokens = example["tokens"]
+    labels = example["labels"]
+    assert len(tokens) == len(labels)
+
+    # 1. tokenize the tokens into sub-tokens, and align the length of tokens and labels
+    new_labels, new_tokens = [no_entity_label], [tokenizer.cls_token]
+    for index, token in enumerate(tokens):
+        sub_tokens = tokenizer.tokenize(token)
+        if not sub_tokens:
+            sub_tokens = [tokenizer.unk_token]
+
+        # repeate the labels n-times
+        new_labels.extend([labels[index]] * len(sub_tokens))
+        new_tokens.extend(sub_tokens)
+
+    # 2. check the max-length of tokens and labels
+    new_tokens = new_tokens[: max_seq_len - 1]
+    new_labels = new_labels[: max_seq_len - 1]
+
+    # 3. construct the input data
+    new_labels.append(no_entity_label)
+    new_tokens.append(tokenizer.sep_token)
+    input_ids = [tokenizer.convert_tokens_to_ids(token) for token in new_tokens]
+    token_type_ids = [0] * len(input_ids)
+    seq_len = len(input_ids)
+
+    if is_test:
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len}
+    else:
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "seq_lens": seq_len, "labels": new_labels}
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+    set_seed(args.seed)
+
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    train_ds = load_dataset("cote", "dp", splits=["train"])
+    label_list = train_ds.label_list
+    # The COTE_DP dataset labels with "BIO" schema.
+    label_map = {label: idx for idx, label in enumerate(label_list)}
+    # `no_entity_label` represents that the token isn't an entity.
+    no_entity_label_idx = label_map.get("O", 2)
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepCrfForTokenClassification.from_pretrained(args.model_name, num_labels=len(label_list))
+
+    trans_func = partial(
+        convert_example_to_feature,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        no_entity_label=no_entity_label_idx,
+        is_test=False,
+    )
+
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=no_entity_label_idx)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    tic_train = time.time()
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            # print(batch)
+            input_ids, token_type_ids, seq_lens, labels = (
+                batch["input_ids"],
+                batch["token_type_ids"],
+                batch["seq_lens"],
+                batch["labels"],
+            )
+            loss = model(input_ids, token_type_ids, seq_lens=seq_lens, labels=labels)
+            avg_loss = paddle.mean(loss)
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, avg_loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % 100 == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                # Need better way to get inner model of DataParallel
+                model._layers.save_pretrained(save_dir)
+                print("Model saved to: {}.".format(save_dir))
diff --git a/examples/sentiment_analysis/skep/train_sentence.py b/examples/sentiment_analysis/skep/train_sentence.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c05f8948b8cf33c97d9495717dc552555cb003a
--- /dev/null
+++ b/examples/sentiment_analysis/skep/train_sentence.py
@@ -0,0 +1,229 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_name",
+    choices=["skep_ernie_1.0_large_ch", "skep_ernie_2.0_large_en"],
+    default="skep_ernie_1.0_large_ch",
+    help="Select which model to train, defaults to skep_ernie_1.0_large_ch.",
+)
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoints",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument(
+    "--max_seq_len",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """Sets random seed."""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"]
+        loss, logits = model(input_ids, token_type_ids, labels=labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    acc = metric.accumulate()
+    print("eval loss: %.5f, accuracy: %.5f" % (np.mean(losses), acc))
+    model.train()
+    metric.reset()
+
+
+def convert_example_to_feature(example, tokenizer, max_seq_len=512, is_test=False, dataset_name="chnsenticorp"):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens.
+
+    Args:
+        example(obj:`dict`): Dict of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): The list of token_type_ids.
+        label(obj:`int`, optional): The input label if not is_test.
+    """
+
+    if dataset_name == "sst-2":
+        encoded_inputs = tokenizer(text=example["sentence"], max_seq_len=max_seq_len)
+    elif dataset_name == "chnsenticorp":
+        encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_len)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        if dataset_name == "sst-2":
+            label = example["labels"]
+        elif dataset_name == "chnsenticorp":
+            label = example["label"]
+        else:
+            raise RuntimeError(f"Got unkown datatset name {dataset_name}, it must be processed on your own.")
+
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "label": label}
+    else:
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids}
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+    if args.model_name == "skep_ernie_1.0_large_ch":
+        dataset_name = "chnsenticorp"
+        train_ds, dev_ds = load_dataset(dataset_name, splits=["train", "dev"])
+
+    else:
+        dataset_name = "sst-2"
+        train_ds, dev_ds = load_dataset("glue", dataset_name, splits=["train", "dev"])
+    label_map = {0: "negative", 1: "positive"}
+
+    tokenizer = SkepTokenizer.from_pretrained(args.model_name)
+    model = SkepForSequenceClassification.from_pretrained(args.model_name, num_labels=len(label_map))
+
+    trans_func = partial(
+        convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_len, dataset_name=dataset_name
+    )
+
+    data_collator = DataCollatorWithPadding(tokenizer, padding=True)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=data_collator, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=args.learning_rate,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    metric = paddle.metric.Accuracy()
+
+    # start to train model
+    model.train()
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch["input_ids"], batch["token_type_ids"], batch["labels"]
+            loss, logits = model(input_ids, token_type_ids, labels=labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accuracy: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % 100 == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                evaluate(model, metric, dev_data_loader)
+                # Need better way to get inner model of DataParallel
+                model._layers.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
diff --git a/examples/sentiment_analysis/textcnn/README.md b/examples/sentiment_analysis/textcnn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d4cd1599322a0aa12d18a561bd07686bc77b1f91
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/README.md
@@ -0,0 +1,192 @@
+# 使用TextCNN模型完成中文对话情绪识别任务
+
+情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。情感分析其中的一个任务就是对话情绪识别，针对智能对话中的用户文本，自动判断该文本的情绪类别并给出相应的置信度，情绪类型分为积极（positive）、消极（negative）和中性（neutral）。
+
+本示例展示了如何用TextCNN预训练模型在机器人聊天数据集上进行Finetune完成中文对话情绪识别任务。
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+textcnn/
+├── deploy # 部署
+│   └── python
+│       └── predict.py # python预测部署示例
+├── data.py # 数据处理脚本
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── model.py # 模型组网脚本
+├── predict.py # 模型预测脚本
+├── README.md # 文档说明
+└── train.py # 对话情绪识别任务训练脚本
+```
+
+### 数据准备
+
+这里我们提供一份已标注的机器人聊天数据集，包括训练集（train.tsv），开发集（dev.tsv）和测试集（test.tsv）。
+完整数据集可以通过以下命令下载并解压：
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/RobotChat.tar.gz
+tar xvf RobotChat.tar.gz
+```
+
+### 词表下载
+
+在模型训练之前，需要先下载词汇表文件word_dict.txt，用于构造词-id映射关系。
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/robot_chat_word_dict.txt
+```
+
+**NOTE:** 词表的选择和实际应用数据相关，需根据实际数据选择词表。
+
+### 预训练模型下载
+
+这里我们提供了一个百度基于海量数据训练好的TextCNN模型，用户通过以下方式下载预训练模型。
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/textcnn.pdparams
+```
+
+### 模型训练
+
+在下载好词表和预训练模型后就可以在机器人聊天数据集上进行finetune，通过运行以下命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证，这里通过`--init_from_ckpt=./textcnn.pdparams`指定TextCNN预训练模型。
+
+CPU 启动：
+
+```shell
+python train.py --vocab_path=./robot_chat_word_dict.txt \
+    --init_from_ckpt=./textcnn.pdparams \
+    --device=cpu \
+    --lr=5e-5 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir=./checkpoints \
+    --data_path=./RobotChat
+```
+
+GPU 启动：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --vocab_path=./robot_chat_word_dict.txt \
+    --init_from_ckpt=./textcnn.pdparams \
+    --device=gpu \
+    --lr=5e-5 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir=./checkpoints \
+    --data_path=./RobotChat
+```
+
+XPU启动：
+
+```shell
+python train.py --vocab_path=./robot_chat_word_dict.txt \
+    --init_from_ckpt=./textcnn.pdparams \
+    --device=xpu \
+    --lr=5e-5 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir=./checkpoints \
+    --data_path=./RobotChat
+```
+
+以上参数表示：
+
+* `vocab_path`: 词汇表文件路径。
+* `init_from_ckpt`: 恢复模型训练的断点路径。
+* `device`: 选用什么设备进行训练，可选cpu、gpu或xpu。如使用gpu训练则参数gpus指定GPU卡号。
+* `lr`: 学习率， 默认为5e-5。
+* `batch_size`: 运行一个batch大小，默认为64。
+* `epochs`: 训练轮次，默认为10。
+* `save_dir`: 训练保存模型的文件路径。
+* `data_path`: 数据集文件路径。
+
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── 0.pdopt
+├── 0.pdparams
+├── 1.pdopt
+├── 1.pdparams
+├── ...
+└── final.pdparams
+```
+
+**NOTE:**
+
+* 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可，程序会自动加载模型参数`checkpoints/0.pdparams`，也会自动加载优化器状态`checkpoints/0.pdopt`。
+* 使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
+  运行方式：
+
+```shell
+python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params
+```
+
+其中`params_path`是指动态图训练保存的参数路径，`output_path`是指静态图参数导出路径。
+
+导出模型之后，可以用于部署，deploy/python/predict.py文件提供了python部署预测示例。运行方式：
+
+```shell
+python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams
+```
+
+### 模型预测
+
+启动预测：
+
+CPU启动：
+
+```shell
+python predict.py --vocab_path=./robot_chat_word_dict.txt \
+    --device=cpu \
+    --params_path=./checkpoints/final.pdparams
+```
+
+GPU启动：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --vocab_path=./robot_chat_word_dict.txt \
+    --device=gpu \
+    --params_path=./checkpoints/final.pdparams
+```
+
+XPU启动：
+
+```shell
+python predict.py --vocab_path=./robot_chat_word_dict.txt \
+    --device=xpu \
+    --params_path=./checkpoints/final.pdparams
+```
+
+待预测数据如以下示例：
+
+```text
+你再骂我我真的不跟你聊了
+你看看我附近有什么好吃的
+我喜欢画画也喜欢唱歌
+```
+
+经过`preprocess_prediction_data`函数处理后，调用`predict`函数即可输出预测结果。
+
+如
+
+```text
+Data: 你再骂我我真的不跟你聊了    Label: negative
+Data: 你看看我附近有什么好吃的   Label: neutral
+Data: 我喜欢画画也喜欢唱歌       Label: positive
+```
+
+## Reference
+
+TextCNN参考论文：
+
+- [EMNLP2014-Convolutional Neural Networks for Sentence Classification](https://aclanthology.org/D14-1181.pdf)
diff --git a/examples/sentiment_analysis/textcnn/data.py b/examples/sentiment_analysis/textcnn/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..3426f065afec8c5ae15762d32ac24aa5a0c6ccb9
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/data.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    """
+    Create dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+def preprocess_prediction_data(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`list[str]`): The prediction data whose each element is a tokenized text.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+        max_ngram_filter_size (obj:`int`, optional, defaults to 3) Max n-gram size in TextCNN model.
+            Users should refer to the ngram_filter_sizes setting in TextCNN, if ngram_filter_sizes=(1, 2, 3)
+            then max_ngram_filter_size=3
+
+    Returns:
+        examples (obj:`list`): The processed data whose each element
+            is a `list` object, which contains
+
+            - word_ids(obj:`list[int]`): The list of word ids.
+    """
+    examples = []
+    for text in data:
+        ids = tokenizer.encode(text)
+        seq_len = len(ids)
+        # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model
+        if seq_len < max_ngram_filter_size:
+            ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len))
+        examples.append(ids)
+    return examples
+
+
+def convert_example(example, tokenizer):
+    """convert_example"""
+    input_ids = tokenizer.encode(example["text"])
+    input_ids = np.array(input_ids, dtype="int64")
+
+    label = np.array(example["label"], dtype="int64")
+    return input_ids, label
+
+
+def read_custom_data(filename):
+    """Reads data."""
+    with open(filename, "r", encoding="utf-8") as f:
+        # Skip head
+        next(f)
+        for line in f:
+            data = line.strip().split("\t")
+            label, text = data
+            yield {"text": text, "label": label}
diff --git a/examples/sentiment_analysis/textcnn/deploy/python/predict.py b/examples/sentiment_analysis/textcnn/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..35d6e15ecf2cb205284cb0a08f500b3df3744dcb
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/deploy/python/predict.py
@@ -0,0 +1,141 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import JiebaTokenizer, Pad, Vocab
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdmodel",
+    help="The path to model info in static graph.",
+)
+parser.add_argument(
+    "--params_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdiparams",
+    help="The path to parameters in static graph.",
+)
+parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.")
+parser.add_argument(
+    "--max_seq_length",
+    default=128,
+    type=int,
+    help="The maximum total input sequence length after tokenization. "
+    "Sequences longer than this will be truncated, sequences shorter will be padded.",
+)
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu"],
+    default="gpu",
+    help="Select which device to train model, defaults to gpu.",
+)
+args = parser.parse_args()
+
+
+def convert_example(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3):
+    """convert_example"""
+    input_ids = tokenizer.encode(data)
+    seq_len = len(input_ids)
+    # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model
+    if seq_len < max_ngram_filter_size:
+        input_ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len))
+    input_ids = np.array(input_ids, dtype="int64")
+    return input_ids
+
+
+class Predictor(object):
+    def __init__(self, model_file, params_file, device, max_seq_length):
+        self.max_seq_length = max_seq_length
+
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, label_map, batch_size=1, pad_token_id=0):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`list(str)`): Data to be predicted.
+            tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+            pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        examples = []
+        for text in data:
+            input_ids = convert_example(text, tokenizer)
+            examples.append(input_ids)
+
+        # Separates data into some batches.
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+        batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): fn(samples)  # input
+
+        results = []
+        for batch in batches:
+            input_ids = batchify_fn(batch)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.predictor.run()
+            logits = paddle.to_tensor(self.output_handle.copy_to_cpu())
+            probs = F.softmax(logits, axis=1)
+            idx = paddle.argmax(probs, axis=1).numpy()
+            idx = idx.tolist()
+            labels = [label_map[i] for i in idx]
+            results.extend(labels)
+        return results
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_length)
+
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    pad_token_id = vocab.to_indices("[PAD]")
+    tokenizer = JiebaTokenizer(vocab)
+    label_map = {0: "negative", 1: "neutral", 2: "positive"}
+
+    # Firstly pre-processing prediction data and then do predict.
+    data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"]
+
+    results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size, pad_token_id=pad_token_id)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/sentiment_analysis/textcnn/export_model.py b/examples/sentiment_analysis/textcnn/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0953a40020532f6f263d6fbcc2dbde375dbaad32
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/export_model.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.data import Vocab
+from model import TextCNNModel
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def main():
+    # Load vocab.
+    if not os.path.exists(args.vocab_path):
+        raise RuntimeError("The vocab_path  can not be found in the path %s" % args.vocab_path)
+
+    vocab = Vocab.load_vocabulary(args.vocab_path)
+    label_map = {0: "negative", 1: "neutral", 2: "positive"}
+
+    # Construct the newtork.
+    vocab_size = len(vocab)
+    num_classes = len(label_map)
+    pad_token_id = vocab.to_indices("[PAD]")
+
+    model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3))
+
+    # Load model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    model.eval()
+
+    inputs = [paddle.static.InputSpec(shape=[None, None], dtype="int64")]
+
+    model = paddle.jit.to_static(model, input_spec=inputs)
+    # Save in static graph model.
+    paddle.jit.save(model, args.output_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/sentiment_analysis/textcnn/model.py b/examples/sentiment_analysis/textcnn/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..655f1e2b8492a7748ed581c1413ebe73880433ce
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/model.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.seq2vec import CNNEncoder
+
+
+class TextCNNModel(nn.Layer):
+    """
+    This class implements the Text Convolution Neural Network model.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `CNNEncoder`.
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        num_filter=128,
+        ngram_filter_sizes=(1, 2, 3),
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.encoder = CNNEncoder(emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes)
+        self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, len(ngram_filter_sizes) * num_filter)
+        encoder_out = paddle.tanh(self.encoder(embedded_text))
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(encoder_out))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
diff --git a/examples/sentiment_analysis/textcnn/predict.py b/examples/sentiment_analysis/textcnn/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba3dd295814933e05f57ed7fa6e47e53ed54263b
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/predict.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+from paddlenlp.data import JiebaTokenizer, Pad, Vocab
+
+from model import TextCNNModel
+from data import preprocess_prediction_data
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data, label_map, batch_size=1, pad_token_id=0):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`list`): The processed data whose each element
+            is a `list` object, which contains
+
+            - word_ids(obj:`list[int]`): The list of word ids.
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+    batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): [data for data in fn(samples)]
+
+    results = []
+    model.eval()
+    for batch in batches:
+        texts = paddle.to_tensor(batchify_fn(batch))
+        logits = model(texts)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # Load vocab.
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    label_map = {0: "negative", 1: "neutral", 2: "positive"}
+
+    # Construct the newtork.
+    vocab_size = len(vocab)
+    num_classes = len(label_map)
+    pad_token_id = vocab.to_indices("[PAD]")
+
+    model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3))
+
+    # Load model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"]
+    tokenizer = JiebaTokenizer(vocab)
+    examples = preprocess_prediction_data(data, tokenizer, pad_token_id)
+
+    results = predict(model, examples, label_map=label_map, batch_size=args.batch_size, pad_token_id=pad_token_id)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/sentiment_analysis/textcnn/train.py b/examples/sentiment_analysis/textcnn/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e80d2180af759c725d9f158910d5248ecfb12a30
--- /dev/null
+++ b/examples/sentiment_analysis/textcnn/train.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+import argparse
+import os
+import random
+
+import numpy as np
+import paddle
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+
+from data import create_dataloader, convert_example, read_custom_data
+from model import TextCNNModel
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--data_path", type=str, default='./RobotChat', help="The path of datasets to be loaded")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The directory to dataset.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed=1000):
+    """Sets random seed."""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    set_seed()
+
+    # Load vocab.
+    if not os.path.exists(args.vocab_path):
+        raise RuntimeError("The vocab_path  can not be found in the path %s" % args.vocab_path)
+
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    # Load datasets.
+    dataset_names = ["train.tsv", "dev.tsv", "test.tsv"]
+    train_ds, dev_ds, test_ds = [
+        load_dataset(read_custom_data, filename=os.path.join(args.data_path, dataset_name), lazy=False)
+        for dataset_name in dataset_names
+    ]
+
+    tokenizer = JiebaTokenizer(vocab)
+    trans_fn = partial(convert_example, tokenizer=tokenizer)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), Stack(dtype="int64")  # label
+    ): [data for data in fn(samples)]
+    train_loader = create_dataloader(
+        train_ds, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn, trans_fn=trans_fn
+    )
+    dev_loader = create_dataloader(
+        dev_ds, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn, trans_fn=trans_fn
+    )
+    test_loader = create_dataloader(
+        test_ds, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn, trans_fn=trans_fn
+    )
+
+    label_map = {0: "negative", 1: "neutral", 2: "positive"}
+    vocab_size = len(vocab)
+    num_classes = len(label_map)
+    pad_token_id = vocab.to_indices("[PAD]")
+
+    model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3))
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.Model(model)
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Define loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Start training and evaluating.
+    callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3)
+    model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback)
+
+    # Evaluate on test dataset
+    print("Start to evaluate on test dataset...")
+    model.evaluate(test_loader, log_freq=len(test_loader))
diff --git a/examples/simultaneous_translation/stacl/README.md b/examples/simultaneous_translation/stacl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..554b251a4b6177e7fb4d7b6761dc82f6af1c6688
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/README.md
@@ -0,0 +1,160 @@
+# Text Simultaneous Translation using Prefix-to-Prefix Framework: STACL
+
+同声传译（Simultaneous Translation），即在句子完成之前进行翻译，同声传译的目标是实现同声传译的自动化，它可以与源语言同时翻译，延迟时间只有几秒钟。
+
+同声传译的难点在于源语言和目标语言之间词序的差异带来的翻译延迟。 例如，考虑将SOV（主宾谓）语言（如日语或德语）翻译为SVO（主谓宾）语言（如英语或汉语），必须等到源语言动词出现才可以准确翻译。因此，翻译系统必须求助于传统的全句翻译，因此造成至少一句话的延迟。
+
+本项目是基于机器翻译领域主流模型 Transformer[1]网络结构的同传模型STACL的PaddlePaddle 实现，包含模型训练，预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的同传翻译模型。
+
+## 模型介绍
+
+### 模型特点
+
+STACL 是论文 [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/) 中针对同传提出的适用于所有同传场景的翻译架构[2]，该架构基于Transformer实现，可参考PaddleNLP的[Transformer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer)。
+
+STACL 主要具有以下优势：
+
+- Prefix-to-Prefix架构拥有预测能力，即在未看到源词的情况下仍然可以翻译出对应的目标词，克服了SOV→SVO等词序差异；
+
+- Wait-k策略可以不需要全句的源句，直接预测目标句，可以实现任意的字级延迟，同时保持较高的翻译质量。
+
+#### Prefix-to-Prefix架构
+<p align="center">
+<img src="images/STACL_architecture.png" height=300 hspace='10'/> <br />
+图 1. Seq2Seq vs. STACL
+</p>
+
+和传统的机器翻译模型主要的区别在于翻译时是否需要利用全句的源句。上图中，Seq2Seq模型需要等到全句的源句（1-5）全部输入Encoder后，Decoder才开始解码进行翻译；而STACL架构采用了Wait-k（图中Wait-2）的策略，当源句只有两个词（1和2）输入到Encoder后，Decoder即可开始解码预测目标句的第一个词。
+
+####  Wait-k 策略
+Wait-k策略首先等待源句单词，然后与源句的其余部分同时翻译，即输出总是隐藏在输入后面。这是受到同声传译人员的启发，同声传译人员通常会在几秒钟内开始翻译演讲者的演讲，在演讲者结束几秒钟后完成。例如，如果k=2，第一个目标词使用前2个源词预测，第二个目标词使用前3个源词预测，以此类推。下图3中，(a)simultaneous: our wait-2 等到"布什"和"总统"输入后就开始解码预测"pres."，而(b) non-simultaneous baseline 为传统的翻译模型，需要等到整句"布什 总统 在 莫斯科 与 普京 会晤"才开始解码预测。
+<p align="center">
+<img src="images/example.png" height=100 hspace='10'/> <br />
+图 2. Wait-k 例子
+</p>
+
+## 环境依赖
+ - attrdict==2.0.1
+ - PyYAML==5.4.1
+ - subword_nmt==0.3.7
+ - jieba==0.42.1
+
+安装命令：`pip install -r requirements.txt`
+
+## 数据准备
+
+### 数据分词
+中文需要首先经过jieba分词，然后经过BPE分词(Byte Pair Encoding)；英文仅需要经过BPE分词。
+BPE分词需要对应的BPE词典，这里提供下载链接：[中文BPE词典](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.zh) ，[英文BPE词典](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.en) 。
+
+我们提供分词的接口，下面给出分词的具体操作：
+```python
+from utils.tokenizer import STACLTokenizer
+
+tokenizer_zh = STACLTokenizer('2M.zh2en.dict4bpe.zh', is_chinese=True)
+# 处理中文字符串
+print(tokenizer_zh.tokenize('玻利维亚举行总统与国会选举'))
+# 输出是: 玻@@ 利@@ 维亚 举行 总统 与 国会 选举
+
+# 处理英文字符串
+tokenizer_en = STACLTokenizer('2M.zh2en.dict4bpe.en', is_chinese=False)
+print(tokenizer_en.tokenize('bolivia holds presidential and parliament elections'))
+# 输出是：bol@@ i@@ via holds presidential and parliament elections
+```
+
+### 数据格式
+每行数据为分词后的中英文，用制表符分割。
+```
+兵营 是 双@@ 枪 老@@ 大@@ 爷 的 前提 建筑 之一 。	it serves as a prerequisite for Re@@ apers to be built at the Bar@@ rac@@ ks .
+```
+
+## 单机训练
+
+### 单机单卡/单机多卡
+可以执行以下命令进行模型训练：
+``` sh
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py --config ./config/transformer.yaml
+```
+
+可以在`config/transformer.yaml` 文件中设置相应的参数。如果执行不提供 `--config` 选项，程序将默认使用`config/transformer.yaml` 的配置。
+
+建议：如果为了更好的效果，可先在整句模型(即`waik=-1`)进行预训练，然后在此基础上根据不同的waitk进行微调来训练不同的waitk模型，训练的命令都同上，下面给出具体的流程以及主要的参数配置：
+- Pretrain
+  用来训练整句模型(即`waik=-1`)，可在`config/transformer.yaml`文件中配置参数：
+  - `waik`表示waik策略，这里设置为-1
+  - `training_file`表示训练集，数据格式同上文
+  - `validation_file`表示验证集，数据格式同上文
+  - `init_from_checkpoint`表示模型目录，从该checkpoint恢复训练，这里设置为空
+  - `init_from_pretrain_model`表示模型目录，从该checkpoint开始finetune下游任务，这里设置为空
+  - `device`选择训练用的设备，支持cpu/gpu/xpu，默认为gpu
+  - `use_amp`表示混合精度训练，示例设置为False
+- Finetune
+  用来训练waik模型(即`waitk=1,2,3,4...`)，可在`config/transformer.yaml`文件中配置参数：
+  - `waik`表示waik策略，这里设置为3（以wait-3模型为例）
+  - `training_file`表示训练集，数据格式同上文
+  - `validation_file`表示验证集，数据格式同上文
+  - `init_from_checkpoint`表示模型目录，从该checkpoint恢复训练，这里设置`waik=-1`模型的ckeckpoint
+  - `init_from_pretrain_model`表示模型目录，从该checkpoint开始finetune下游任务，这里设置为空
+  - `device`选择训练用的设备，支持cpu/gpu/xpu，默认为gpu
+  - `use_amp`表示混合精度训练，示例设置为False
+## 模型推理
+
+模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+
+``` sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --config ./config/transformer.yaml
+```
+- Predict
+ 根据具体的waik策略来进行翻译，可在`config/transformer.yaml`文件中配置参数，预测的命令同上，下面给出主要的参数说明：
+  - `waik`表示waik策略，这里设置为3（以wait-3模型为例）
+  - `predict_file`表示测试集，数据格式是BPE分词后的源语言（中文为Jieba+BPE分词），按行区分
+  - `output_file`表示输出文件，翻译结果会输出到该参数指定的文件
+  - `init_from_params`表示模型的所在目录，根据具体的`waik`来设置，这里设置为`wait=3`模型目录
+  - 更多参数的使用可以在 `config/transformer.yaml`文件中查阅注释说明并进行更改设置。如果执行不提供 `--config` 选项，程序将默认使用 `config/transformer.yaml` 的配置。
+
+需要注意的是，目前预测仅实现了单卡的预测，原因在于，翻译后面需要的模型评估依赖于预测结果写入文件顺序，多卡情况下，目前暂未支持将结果按照指定顺序写入文件。
+
+
+## 模型评估
+
+预测结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
+
+``` sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+git clone https://github.com/moses-smt/mosesdecoder.git
+# 以中英翻译 newstest2017 测试数据为例
+perl mosesdecoder/scripts/generic/multi-bleu.perl newstest2017.tok.en < predict.tok.txt
+```
+
+## 模型下载（更新中）
+我们提供基于NIST（中->英，共2M中英句对）预训练模型，供大家下载，下载后需解压使用。
+| Wait-k策略     | 模型连接     | 4-ref BLEU on NIST 2008|
+| ------------ | --------------- |---------|
+| Wait-1 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w1.tar.gz) |30.94|
+| Wait-3   |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w3.tar.gz) |34.24 |
+| Wait-5   |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w5.tar.gz) |36.30 |
+| Wait-7   |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w7.tar.gz) |37.84 |
+| Wait_-1(整句模型)   |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_sent.tar.gz) |41.41 |
+词表下载：[source vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.20k.zh.vocab) ，[target vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.10k.en.vocab)
+
+## Demo展示
+通过GUI界面的Demo来模拟STACL实时翻译的效果，下图为Demo示例，实现细节可查看[demo](./demo)
+<p align="center">
+<img src="demo/images/text_demo_show.gif" height=350 hspace='10'/> <br />
+图 3. 文本同传
+</p>
+<p align="center">
+<img src="demo/images/speech_demo_show.gif" height=350 hspace='10'/> <br />
+图 4. 语音同传
+</p>
+
+## 参考文献
+1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
+2. Ma M ,  Huang L ,  Xiong H , et al. [STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework](https://www.aclweb.org/anthology/P19-1289/)[J]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2018: 3025–3036.
+3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
+4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
diff --git a/examples/simultaneous_translation/stacl/config/transformer.yaml b/examples/simultaneous_translation/stacl/config/transformer.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2bdb04cdd2037d9b157120715f4af232d5c47019
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/config/transformer.yaml
@@ -0,0 +1,99 @@
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# path of trained parameter, to make prediction
+init_from_params: "trained_models/step_final/"
+# the directory for saving model
+save_model: "trained_models"
+# Set seed for CE or debug
+random_seed: 42
+# The pattern to match training data files.
+training_file: "data/nist2m/train.zh-en.bpe"
+# The pattern to match validation data files.
+validation_file: "data/nist2m/dev.zhen.bpe"
+# The pattern to match test data files.
+predict_file: "data/nist2m/test_08.zh.bpe"
+# The file to output the translation results of predict_file to.
+output_file: "predict.txt"
+# The path of vocabulary file of source language.
+src_vocab_fpath: "data/nist2m/nist.20k.zh.vocab"
+# The path of vocabulary file of target language.
+trg_vocab_fpath: "data/nist2m/nist.10k.en.vocab"
+# The <bos>, <eos> and <unk> tokens in the dictionary.
+special_token: ["<s>", "<e>", "<unk>"]
+
+# Use which device to train or predict(cpu,gpu,xpu)
+device: gpu
+
+# Args for reader, see reader.py for details
+pool_size: 200000
+sort_type: "pool"
+shuffle: True
+shuffle_batch: True
+batch_size: 4096
+
+# Hyparams for training:
+# the number of epoches for training
+epoch: 30
+# the hyper parameters for Adam optimizer.
+# This static learning_rate will be multiplied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 2.0
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# the parameters for learning rate scheduling.
+warmup_steps: 8000
+# the weight used to mix up the ground-truth distribution and the fixed
+# uniform distribution in label smoothing when training.
+# Set this as zero if label smoothing is not wanted.
+label_smooth_eps: 0.1
+
+# Hyparams for generation:
+# the parameters for beam search.
+beam_size: 5
+max_out_len: 256
+# the number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# size of source word dictionary.
+src_vocab_size: 10000
+# size of target word dictionay
+trg_vocab_size: 10000
+# index for <bos> token
+bos_idx: 0
+# index for <eos> token
+eos_idx: 1
+# index for <unk> token
+unk_idx: 2
+# max length of sequences deciding the size of position encoding table.
+max_length: 256
+# the dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# number of head used in multi-head attention.
+n_head: 8
+# number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# dropout rates.
+dropout: 0.1
+# the flag indicating whether to share embedding and softmax weights.
+# vocabularies in source and target should be same for weight sharing.
+weight_sharing: False
+# Wait-k policy
+waitk: -1
+# Mixed precision training
+use_amp: False
+# Maximum iteration for training.
+max_iter: None
\ No newline at end of file
diff --git a/examples/simultaneous_translation/stacl/demo/README.md b/examples/simultaneous_translation/stacl/demo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..84c2ff473ecf488f6f44d37f9cd33847f37f1775
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/README.md
@@ -0,0 +1,92 @@
+# Demo for STACL
+
+该Demo模拟同传模型STACL实时翻译的效果。
+<p align="center">
+<img src="images/text_demo_show.gif" height=350 hspace='10'/> <br />
+图 1. 文本同传
+</p>
+<p align="center">
+<img src="images/speech_demo_show.gif" height=350 hspace='10'/> <br />
+图 2. 语音同传
+</p>
+
+用户通过Chinese input文本框**打字输入**或者**语音输入即本地麦克风收音**，然后通过Jieba和BPE得到分词结果。
+
+- Simultaneous Translation (wait 1)是读取1个token（分词后）后开始实时翻译；
+- Simultaneous Translation (wait 3)是读取3个token（分词后）后开始实时翻译；
+- Simultaneous Translation (wait 5)是读取5个token（分词后）后开始实时翻译；
+- Full Sentence Translation(wait -1)是读取所有的token（分词后）即整句后开始翻译。
+
+一般来说，waitk越大(waitk=-1可看作waitk=∞)，读入的信息越多，实时翻译效果越好。由上图可见，STACL具有较好的预测性，较小的waitk也能得到较好的翻译结果。
+
+### 目录结构
+```text
+.
+├── README.md                       # 说明文档，本文件
+├── const.py                        # 语音识别应用鉴权信息
+├── demo.py                         # 启动demo的主程序文件
+├── images
+│   ├── speech_demo_show.gif        # 语音同传Demo效果图
+│   ├── paddlenlp.png               # Demo界面logo
+│   └── text_demo_show.gif          # 文本同传Demo效果图
+├── model_demo.py                   # STACL模型文件
+├── models                          # 预训练模型路径
+│   ├── nist_wait_1                 # waitk=1模型
+│   ├── nist_wait_3                 # waitk=3模型
+│   ├── nist_wait_5                 # waitk=5模型
+│   └── nist_wait_-1                # waitk=-1（整句模型）
+├── requirements.txt                # 环境依赖文件
+└── transformer_demo.yaml           # 参数配置文件
+
+```
+
+上述models目录下的模型可以在这里[下载](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/simultaneous_translation/stacl/README.md#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD%E6%9B%B4%E6%96%B0%E4%B8%AD) ，下载完后将解压后的`transformer.pdparams`分别放在不同的waitk策略对应的子目录下面。
+
+### 参数说明与配置
+
+##### 1. 模型参数配置
+可以在`transformer_demo.yaml` 文件中设置相应的参数，下面给出主要的参数配置：
+
+- `src_bpe_dict`配置源语言（这里是中文）的BPE词表，[中文BPE词表下载](https://bj.bcebos.com/paddlenlp/models/stacl/2M.zh2en.dict4bpe.zh)
+- `src_vocab_fpath`配置源语言（这里是中文）词表，[source vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.20k.zh.vocab)
+- `trg_vocab_fpath`配置目标语言（这里是英文）词表，[target vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.10k.en.vocab)
+- `device`选择预测用的设备，支持cpu/gpu/xpu，默认为cpu
+
+##### 2. 语音同传参数配置
+需要配置`const.py`里面语音识别的应用鉴权信息，只需要将`APPID`和`APPKEY`设置为自己所申请的。
+申请教程：[教程](./README_ai.md)
+
+### 环境依赖
+##### 1. 基本环境
+- attrdict==2.0.1
+- PyYAML==5.4.1
+- subword_nmt==0.3.7
+- jieba==0.42.1
+- websocket-client==1.0.1
+
+可通过安装命令：`pip install -r requirements.txt`来进行安装。
+
+注意：本项目依赖于Python内置包`tkinter >= 8.6`
+- 查看`tkinter`的版本：
+    ```python
+    python -c "import tkinter; print(tkinter.TkVersion)"
+- [`tkinter`官方文档](https://tkdocs.com/tutorial/index.html)
+
+##### 2. 语音同传环境
+需要安装`pyaudio==0.2.11`来调用本地麦克风，安装教程参考[官网](http://people.csail.mit.edu/hubert/pyaudio/)
+安装失败，则只会启动文本同传。
+
+
+### 使用说明
+
+1. 下载[预训练模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/simultaneous_translation/stacl/README.md#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD%E6%9B%B4%E6%96%B0%E4%B8%AD) ，并放在models目录下对应的子目录里；
+2. 下载词表（源语言词表，目标语言词表，BPE词表），并在配置文件`transformer_demo.yaml`中修改相应的参数；
+3. 运行`demo.py`；
+4. 出现界面，在Chinese input文本框中输入中文，按【回车键】开始实时翻译，或者按【REC】开始录音并开始实时翻译，遇到【。！？】结束整句，按【CLEAR】清空所有的输入和输出。
+
+### 常见问题
+**Q:** 出现`_tkinter.TclError: couldn't recognize data in image file`错误
+**A:** 升级`tkinter`，确保`tkinter >= 8.6`
+
+**Q:** 出现Chinese input文本框无法输入中文
+**A:** 升级`tkinter`，确保`tkinter >= 8.6`
diff --git a/examples/simultaneous_translation/stacl/demo/README_ai.md b/examples/simultaneous_translation/stacl/demo/README_ai.md
new file mode 100644
index 0000000000000000000000000000000000000000..20ded3903b7b075959049e0c5be8730b8c34b314
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/README_ai.md
@@ -0,0 +1,42 @@
+# AI接入指南
+### 1. 成为开发者
+- 1.1 参考[AI接入指南](https://ai.baidu.com/ai-doc/REFERENCE/Ck3dwjgn3) 完成第一步
+
+### 2.创建应用
+- 2.1 进入[控制台](https://console.bce.baidu.com/?fromai=1#/index/overview_v3) ，选择【语音技术】
+<p align="center">
+<img src="images/step1.png"/> <br />
+</p>
+
+- 2.2 [创建应用](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/app/create) ，参数配置
+<p align="center">
+<img src="images/step2.png"/> <br />
+</p>
+
+<p align="center">
+<img src="images/step3.png"/> <br />
+</p>
+<p align="center">
+<img src="images/step4.png"/> <br />
+</p>
+
+### 3.领取免费资源
+- 3.1 [领取免费资源](https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/resource/getFree) ，需要实名认证
+<p align="center">
+<img src="images/step5.png"/> <br />
+</p>
+
+<p align="center">
+<img src="images/step6.png"/> <br />
+</p>
+
+- 3.2 开通付费，这里因为有10个小时的免费资源，所以不收费
+<p align="center">
+<img src="images/step7.png"/> <br />
+</p>
+
+### 4.获取密钥
+- 4.1 本项目主要用到AppID和API Key
+<p align="center">
+<img src="images/step8.png"/> <br />
+</p>
diff --git a/examples/simultaneous_translation/stacl/demo/const.py b/examples/simultaneous_translation/stacl/demo/const.py
new file mode 100644
index 0000000000000000000000000000000000000000..06e89a1ae55ae497a9488cd90e5163310979b8c5
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/const.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Your APPID (type: int)
+APPID = None
+
+# Your APPKEY(type: str)
+APPKEY = None
+
+# Do not modify: Chinese Putonghua PID
+DEV_PID = 15372
+
+# Do not modify: wss link
+URI = "wss://vop.baidu.com/realtime_asr"
diff --git a/examples/simultaneous_translation/stacl/demo/demo.py b/examples/simultaneous_translation/stacl/demo/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..caa1b317c4029bb9b9a70111fcd9be900305ac60
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/demo.py
@@ -0,0 +1,650 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import os
+import threading
+import time
+import uuid
+from tkinter import END, LEFT, Button, E, Entry, Label, PhotoImage, Tk, W
+
+import _locale
+import jieba
+import paddle
+import websocket
+import yaml
+from attrdict import AttrDict
+from subword_nmt import subword_nmt
+
+from paddlenlp.data import Vocab
+from paddlenlp.transformers import position_encoding_init
+from paddlenlp.utils.log import logger
+
+open_speech = True
+try:
+    from pyaudio import PyAudio, paInt16
+except ImportError:
+    open_speech = False
+    logger.warning("No module named 'pyaudio', so no audio demo.")
+
+import const  # noqa: E402
+from model_demo import SimultaneousTransformerDemo  # noqa: E402
+
+# By default, the Windows system opens the file with GBK code,
+# and the subword_nmt package does not support setting open encoding,
+# so it is set to UTF-8 uniformly.
+_locale._getdefaultlocale = lambda *args: ["en_US", "utf8"]
+
+is_win = False
+if os.name == "nt":
+    is_win = True
+
+
+class STACLTokenizer:
+    """
+    Jieba+BPE, and convert tokens to ids.
+    """
+
+    def __init__(self, args, is_chinese):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=["-c", args.src_bpe_dict])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges, bpe_args.separator, None, bpe_args.glossaries)
+        self.is_chinese = is_chinese
+
+        self.src_vocab = Vocab.load_vocabulary(
+            args.src_vocab_fpath,
+            bos_token=args.special_token[0],
+            eos_token=args.special_token[1],
+            unk_token=args.special_token[2],
+        )
+
+        self.trg_vocab = Vocab.load_vocabulary(
+            args.trg_vocab_fpath,
+            bos_token=args.special_token[0],
+            eos_token=args.special_token[1],
+            unk_token=args.special_token[2],
+        )
+
+        args.src_vocab_size = len(self.src_vocab)
+        args.trg_vocab_size = len(self.trg_vocab)
+        self.args = args
+
+    def tokenize(self, raw_string):
+        raw_string = raw_string.strip("\n")
+        if not raw_string:
+            return raw_string, raw_string
+        if self.is_chinese:
+            raw_string = " ".join(jieba.cut(raw_string))
+        bpe_str = self.bpe.process_line(raw_string)
+        ids = self.src_vocab.to_indices(bpe_str.split())
+        return bpe_str.split(), ids
+
+
+def init_model(args, init_from_params):
+    # Define model
+    args.init_from_params = init_from_params
+    transformer = SimultaneousTransformerDemo(
+        args.src_vocab_size,
+        args.trg_vocab_size,
+        args.max_length + 1,
+        args.n_layer,
+        args.n_head,
+        args.d_model,
+        args.d_inner_hid,
+        args.dropout,
+        args.weight_sharing,
+        args.bos_idx,
+        args.eos_idx,
+        args.waitk,
+    )
+
+    # Load the trained model
+    assert args.init_from_params, "Please set init_from_params to load the infer model."
+
+    model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # To avoid a longer length than training, reset the size of position
+    # encoding to max_length
+    model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+    model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+
+    transformer.load_dict(model_dict)
+    return transformer
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def translate(
+    args, tokenizer, tokenized_src, transformers, waitks, decoder_max_length, is_last, caches, bos_id, all_result
+):
+    # Set evaluate mode
+    for transformer in transformers:
+        transformer.eval()
+
+    for idx, (waitk, transformer) in enumerate(zip(waitks, transformers)):
+        if len(tokenized_src) < waitk or (waitk == -1 and not is_last):
+            continue
+        with paddle.no_grad():
+            input_src = tokenized_src
+            if is_last:
+                decoder_max_length[idx] = args.max_out_len
+                input_src += [args.eos_idx]
+            src_word = paddle.to_tensor(input_src).unsqueeze(axis=0)
+            finished_seq, finished_scores, cache = transformer.greedy_search(
+                src_word, max_len=decoder_max_length[idx], waitk=waitk, caches=caches[idx], bos_id=bos_id[idx]
+            )
+            caches[idx] = cache
+            finished_seq = finished_seq.numpy()
+            for beam_idx, beam in enumerate(finished_seq[0]):
+                if beam_idx >= args.n_best:
+                    break
+                id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                if len(id_list) == 0:
+                    continue
+                bos_id[idx] = id_list[-1]
+                word_list = tokenizer.trg_vocab.to_tokens(id_list)
+                for word in word_list:
+                    all_result[idx].append(word)
+                res = " ".join(word_list).replace("@@ ", "")
+                logger.debug("[waitk={}] {}".format(waitk, res))
+
+
+def cut_line(str, line_len):
+    """
+    Wrap output
+    """
+    result = []
+    temp = []
+    for idx, item in enumerate(str.split()):
+        temp.append(item)
+        if (idx + 1) % line_len == 0:
+            result.append(" ".join(temp))
+            temp = []
+    if len(temp) != 0:
+        result.append(" ".join(temp))
+    return "\n".join(result)
+
+
+def process(args, tokenizer, transformers, waitks):
+    """
+    GUI and main waitk program
+    :param args:
+    :param tokenizer:
+    :param transformers:
+    :param waitks:
+    :return:
+    """
+    font_align = ("Courier", 20)
+    font_label = ("Times", 14)
+
+    if is_win:
+        font_align = ("Courier", 15)
+        font_label = ("Times", 11)
+
+    window = Tk()
+
+    window.title("Welcome to Simultaneous Translation")
+    window.geometry("1200x600")
+
+    logo = PhotoImage(file="images/paddlenlp.png")
+    button = Label(window, image=logo, compound="center")
+    button.place(x=0, y=0)
+
+    # for chinese input
+    lbl1 = Label(window, text="Chinese input:", fg="green", font=font_label, anchor=E, width=28)
+    lbl1.place(x=0, y=60)
+    txt = Entry(window, font=font_align)
+    txt.place(x=250, y=50, width=800, height=50)
+
+    button_on = Button(window, text="REC", relief="raised", cursor="hand2")
+    if open_speech:
+        button_on.place(x=1090, y=52)
+
+    s_x, s_y = 0, 130
+    x, y = 250, 120
+
+    # for jieba+BPE
+    lbl2_s = Label(window, text="Jieba+BPE:", fg="black", font=font_label, anchor=E, width=28)
+    lbl2_s.place(x=s_x, y=s_y)
+    lbl2 = Label(window, text="", font=font_align, background="pale green", anchor=E)
+    lbl2.place(x=x, y=y, width=800, height=50)
+
+    # for wait-1
+    waitnum = "1"
+    lbl3_s = Label(
+        window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28
+    )
+    lbl3_s.place(x=s_x, y=s_y + 70)
+
+    lbl3 = Label(window, text="", font=font_align, background="linen")
+    lbl3.place(x=x, y=y + 75, width=800, height=50)
+
+    # for wait-3
+    waitnum = "3"
+    lbl4_s = Label(
+        window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28
+    )
+    lbl4_s.place(x=s_x, y=s_y + 140)
+    lbl4 = Label(window, text="", font=font_align, background="linen")
+    lbl4.place(x=x, y=y + 145, width=800, height=50)
+
+    # for wait-5
+    waitnum = "5"
+    lbl5_s = Label(
+        window, text="Simultaneous\nTranslation (wait " + waitnum + "):", fg="red", font=font_label, anchor=E, width=28
+    )
+    lbl5_s.place(x=s_x, y=s_y + 210)
+    lbl5 = Label(window, text="", font=font_align, background="linen")
+    lbl5.place(x=x, y=y + 215, width=800, height=50)
+
+    # for  wait--1
+    lbl6_s = Label(
+        window, text="Full Sentence\nTranslation (wait -1):", fg="blue", font=font_label, anchor=E, width=28
+    )
+    lbl6_s.place(x=s_x, y=s_y + 280)
+
+    lbl6 = Label(window, text="", font=font_align, background="sky blue")
+    lbl6.place(x=x, y=y + 285, width=800, height=50)
+
+    def set_val(event=None):
+        """
+        Start translating
+        """
+        global i
+        global caches
+        global bos_id
+        global decoder_max_length
+        global all_result
+        global is_last
+        global user_input_bpe
+        global user_input_tokenized
+        bpe_str, tokenized_src = tokenizer.tokenize(txt.get())
+        while i < len(tokenized_src):
+            user_input_bpe.append(bpe_str[i])
+            user_input_tokenized.append(tokenized_src[i])
+            lbl2.configure(
+                text=cut_line((lbl2.cget("text") + " " + bpe_str[i]).strip(), 20), fg="black", anchor=W, justify=LEFT
+            )
+            window.update()
+            if bpe_str[i] in ["。", "？", "！"]:
+                is_last = True
+            translate(
+                args,
+                tokenizer,
+                user_input_tokenized,
+                transformers,
+                waitks,
+                decoder_max_length,
+                is_last,
+                caches,
+                bos_id,
+                all_result,
+            )
+            lbl3.configure(
+                text=cut_line(" ".join(all_result[0]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+            )
+            lbl4.configure(
+                text=cut_line(" ".join(all_result[1]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+            )
+            lbl5.configure(
+                text=cut_line(" ".join(all_result[2]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+            )
+            lbl6.configure(
+                text=cut_line(" ".join(all_result[3]).replace("@@ ", ""), 11), fg="blue", anchor=W, justify=LEFT
+            )
+            window.update()
+            if is_last:
+                caches = [None] * len(waitks)
+                bos_id = [None] * len(waitks)
+                decoder_max_length = [1] * len(waitks)
+                is_last = False
+                user_input_bpe = []
+                user_input_tokenized = []
+            i += 1
+
+    def set_val_voice(event=None):
+        """
+        Start translating
+        """
+
+        def send_start_params(ws):
+            """
+            Send start frame
+            :param websocket.WebSocket ws:
+            :return:
+            """
+            req = {
+                "type": "START",
+                "data": {
+                    "appid": const.APPID,
+                    "appkey": const.APPKEY,
+                    "dev_pid": const.DEV_PID,
+                    "cuid": "yourself_defined_user_id",
+                    "sample": 16000,
+                    "format": "pcm",
+                },
+            }
+            body = json.dumps(req)
+            ws.send(body, websocket.ABNF.OPCODE_TEXT)
+            logger.info("send START frame with params:" + body)
+
+        def send_audio(ws):
+            """
+             Send audio
+            :param  websocket.WebSocket ws:
+            :return:
+            """
+            # 160ms record
+            chunk_ms = 160
+
+            # 160ms *  16000  * 2bytes / 1000ms = 5120bytes
+            chunk_len = int(16000 * 2 / 1000 * chunk_ms)
+
+            pa = PyAudio()
+            stream = pa.open(format=paInt16, channels=1, rate=16000, input=True, frames_per_buffer=chunk_len // 2)
+
+            while True:
+                frames = []
+                frame = stream.read(chunk_len // 2, exception_on_overflow=False)
+                frames.append(frame)
+                body = b"".join(frames)
+                if len(body) == 0:
+                    logger.info("empty body")
+                    continue
+                logger.debug("try to send audio length {}".format(len(body)))
+                ws.send(body, websocket.ABNF.OPCODE_BINARY)
+
+        def send_finish(ws):
+            """
+            Send finished frame
+            :param websocket.WebSocket ws:
+            :return:
+            """
+            req = {"type": "FINISH"}
+            body = json.dumps(req)
+            ws.send(body, websocket.ABNF.OPCODE_TEXT)
+            logger.info("send FINISH frame")
+
+        def close_websocket(ws_app):
+            if ws_app:
+                logger.info("close ws_app.")
+                send_finish(ws_app)
+                ws_app.close()
+            logger.info("ws_app closed.")
+
+        def on_open(ws):
+            """
+            Send data frame after connected
+            :param  websocket.WebSocket ws:
+            :return:
+            """
+
+            def run(*args):
+                """
+                Send data frame
+                :param args:
+                :return:
+                """
+                send_start_params(ws)
+                send_audio(ws)
+                send_finish(ws)
+                logger.debug("thread terminating")
+
+            threading.Thread(target=run).start()
+
+        def on_error(ws, error):
+            """
+            For error
+            :param ws:
+            :param error: json
+            :return:
+            """
+            logger.error("error: " + str(error))
+
+        def on_close(ws):
+            """
+            Close websocket
+            :param websocket.WebSocket ws:
+            :return:
+            """
+            logger.info("ws close ...")
+            # ws.close()
+
+        def on_message(ws, message):
+            """
+            Response from server
+            :param ws:
+            :param message: json
+            :return:
+            """
+            global i
+            global text
+            global caches
+            global bos_id
+            global decoder_max_length
+            global all_result
+            global is_last
+            global user_input_bpe
+            global user_input_tokenized
+            global ws_app
+            global start_time
+
+            logger.info("Response: " + message)
+            message = json.loads(message)
+            if is_last and ws_app:
+                close_websocket(ws_app)
+            end_time = time.time()
+            if end_time - start_time > 10 and ws_app:
+                close_websocket(ws_app)
+                logger.info(
+                    "ws_app started at: {} closed at: {}, cost {}s.".format(
+                        start_time, end_time, end_time - start_time
+                    )
+                )
+            if "result" in message:
+                start_time = time.time()
+                text = message["result"]
+                txt.delete(0, END)
+                txt.insert(0, text)
+                bpe_str, tokenized_src = tokenizer.tokenize(txt.get())
+                while i < len(tokenized_src):
+                    user_input_bpe.append(bpe_str[i])
+                    user_input_tokenized.append(tokenized_src[i])
+                    lbl2.configure(
+                        text=cut_line((lbl2.cget("text") + " " + bpe_str[i]).strip(), 20),
+                        fg="black",
+                        anchor=W,
+                        justify=LEFT,
+                    )
+                    window.update()
+                    if bpe_str[i] in ["。", "？", "！"]:
+                        is_last = True
+                    translate(
+                        args,
+                        tokenizer,
+                        user_input_tokenized,
+                        transformers,
+                        waitks,
+                        decoder_max_length,
+                        is_last,
+                        caches,
+                        bos_id,
+                        all_result,
+                    )
+                    lbl3.configure(
+                        text=cut_line(" ".join(all_result[0]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+                    )
+                    lbl4.configure(
+                        text=cut_line(" ".join(all_result[1]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+                    )
+                    lbl5.configure(
+                        text=cut_line(" ".join(all_result[2]).replace("@@ ", ""), 11), fg="red", anchor=W, justify=LEFT
+                    )
+                    lbl6.configure(
+                        text=cut_line(" ".join(all_result[3]).replace("@@ ", ""), 11),
+                        fg="blue",
+                        anchor=W,
+                        justify=LEFT,
+                    )
+                    window.update()
+                    if is_last:
+                        caches = [None] * len(waitks)
+                        bos_id = [None] * len(waitks)
+                        decoder_max_length = [1] * len(waitks)
+                        is_last = False
+                        user_input_bpe = []
+                        user_input_tokenized = []
+                        if ws_app:
+                            close_websocket(ws_app)
+                    i += 1
+
+        logger.info("begin")
+        uri = const.URI + "?sn=" + str(uuid.uuid1())
+        logger.info("uri is " + uri)
+        global start_time
+        start_time = time.time()
+        global ws_app
+        ws_app = websocket.WebSocketApp(
+            uri, on_open=on_open, on_message=on_message, on_error=on_error, on_close=on_close
+        )
+        ws_app.run_forever()
+
+    def clear():
+        """
+        Clear input and output
+        """
+        txt.delete(0, END)
+        global i
+        global text
+        global caches
+        global bos_id
+        global decoder_max_length
+        global all_result
+        global is_last
+        global user_input_bpe
+        global user_input_tokenized
+        global ws_app
+        global start_time
+        if ws_app:
+            ws_app.close()
+        decoder_max_length = [1] * len(waitks)
+        caches = [None] * len(waitks)
+        bos_id = [None] * len(waitks)
+        all_result = [[], [], [], []]
+        i = 0
+        is_last = False
+        user_input_bpe = []
+        user_input_tokenized = []
+        start_time = 0
+        logger.info("CLEAR")
+        logger.info(f"i: {i}")
+        logger.info(f"caches: {caches}")
+        logger.info(f"bos_id: {bos_id}")
+        logger.info(f"decoder_max_length: {decoder_max_length}")
+        logger.info(f"all_result: {all_result}")
+        logger.info(f"is_last: {is_last}")
+        lbl2.configure(text="", fg="black", anchor=W, justify=LEFT)
+        lbl3.configure(text="", fg="red", anchor=W, justify=LEFT)
+        lbl4.configure(text="", fg="red", anchor=W, justify=LEFT)
+        lbl5.configure(text="", fg="red", anchor=W, justify=LEFT)
+        lbl6.configure(text="", fg="blue", anchor=W, justify=LEFT)
+        window.update()
+
+    txt.bind("<Return>", set_val)
+    button_on.bind("<Button-1>", set_val_voice)
+
+    desc1 = Label(window, text="使用说明：1. 在Chinese input输入中文，按【回车键】开始实时翻译，" "遇到【。！？】结束整句，按【CLEAR】清空所有的输入和输出；", anchor=E)
+    desc1.place(x=s_x + 100, y=s_y + 380)
+
+    backspace_cnt = 19
+    if is_win:
+        backspace_cnt = 15
+
+    desc2 = Label(
+        window, text=" " * backspace_cnt + "2. 按【REC】开始录音并开始实时翻译，遇到【。！？】结束整句，" "按【CLEAR】清空所有的输入和输出。", anchor=E
+    )
+    if open_speech:
+        desc2.place(x=s_x + 100, y=s_y + 410)
+
+    button_clear = Button(window, text="CLEAR", relief="raised", cursor="hand2", command=clear)
+
+    button_clear.place(x=x + 840, y=y + 380)
+
+    window.mainloop()
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./transformer_demo.yaml", type=str, help="Path of the config file. ")
+    args = parser.parse_args()
+    return args
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    yaml_file = args.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+
+    if args.device == "gpu":
+        place = "gpu:0"
+    elif args.device == "xpu":
+        place = "xpu:0"
+    elif args.device == "cpu":
+        place = "cpu"
+    paddle.set_device(place)
+
+    tokenizer = STACLTokenizer(args, is_chinese=True)
+    waitks = [1, 3, 5, -1]
+
+    transformers = []
+    for waitk in waitks:
+        transformers.append(init_model(args, f"models/nist_wait_{waitk}"))
+        logger.info(f"Loaded wait_{waitk} model.")
+
+    # for decoding max length
+    decoder_max_length = [1] * len(waitks)
+    # for decoding cache
+    caches = [None] * len(waitks)
+    # for decoding start token id
+    bos_id = [None] * len(waitks)
+    # for result
+    all_result = [[], [], [], []]
+    # current source word index
+    i = 0
+    # for decoding: is_last=True, max_len=256
+    is_last = False
+    # subword after bpe
+    user_input_bpe = []
+    # tokenized id
+    user_input_tokenized = []
+    # for stream input
+    text = ""
+    # websocket app
+    ws_app = None
+    # start time
+    start_time = 0
+
+    process(args, tokenizer, transformers, waitks)
diff --git a/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png b/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png
new file mode 100644
index 0000000000000000000000000000000000000000..9b6f7c53332c7ded62890d020263216c259be4b0
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif b/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif
new file mode 100644
index 0000000000000000000000000000000000000000..01f946ed776a7360fbb08a7dd9c8277812c240c0
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step1.png b/examples/simultaneous_translation/stacl/demo/images/step1.png
new file mode 100644
index 0000000000000000000000000000000000000000..8d293f7627385c1968f4c498468b3f0db039f9c2
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step1.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step2.png b/examples/simultaneous_translation/stacl/demo/images/step2.png
new file mode 100644
index 0000000000000000000000000000000000000000..c922d2c8869220239ebaf43603a1b267514ec746
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step2.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step3.png b/examples/simultaneous_translation/stacl/demo/images/step3.png
new file mode 100644
index 0000000000000000000000000000000000000000..89a34a03109122146cf204ef79dbfbd7649b37b7
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step3.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step4.png b/examples/simultaneous_translation/stacl/demo/images/step4.png
new file mode 100644
index 0000000000000000000000000000000000000000..b128f390a100050bbcb2445bc9d71c394ece8078
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step4.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step5.png b/examples/simultaneous_translation/stacl/demo/images/step5.png
new file mode 100644
index 0000000000000000000000000000000000000000..3f14ae13da41bc19a4cf35c3b71be55a10f09aca
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step5.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step6.png b/examples/simultaneous_translation/stacl/demo/images/step6.png
new file mode 100644
index 0000000000000000000000000000000000000000..adad2ff27f59db8db6dcff0b6c2003f1bc5fcd76
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step6.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step7.png b/examples/simultaneous_translation/stacl/demo/images/step7.png
new file mode 100644
index 0000000000000000000000000000000000000000..a545dbfd8642508445faccd98bb11a8cfd71d487
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step7.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/step8.png b/examples/simultaneous_translation/stacl/demo/images/step8.png
new file mode 100644
index 0000000000000000000000000000000000000000..1b97efcc67c8a019ed123034facdebb4d103e04e
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/step8.png differ
diff --git a/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif b/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif
new file mode 100644
index 0000000000000000000000000000000000000000..ecbfccf8ffc18d715c0ef6615847ccb01eace24d
Binary files /dev/null and b/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif differ
diff --git a/examples/simultaneous_translation/stacl/demo/model_demo.py b/examples/simultaneous_translation/stacl/demo/model_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f7a2cc7dfb40226e1c337a53a1dfd9107ecda9e
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/model_demo.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+import paddle.nn.functional as F
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)))
+from model import SimultaneousTransformer  # noqa: E402
+
+
+class SimultaneousTransformerDemo(SimultaneousTransformer):
+    """
+    model
+    """
+
+    def greedy_search(self, src_word, max_len=256, waitk=-1, caches=None, bos_id=None):
+        """
+        greedy_search uses streaming reader. It doesn't need calling
+        encoder many times, an a sub-sentence just needs calling encoder once.
+        So, it needsprevious state(caches) and last one of generated
+        tokens id last time.
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+        enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+
+        predict_ids = []
+        log_probs = paddle.full(shape=[batch_size, 1], fill_value=0, dtype="float32")
+        if not bos_id:
+            trg_word = paddle.full(shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+        else:
+            trg_word = paddle.full(shape=[batch_size, 1], fill_value=bos_id, dtype="int64")
+
+        # init states (caches) for transformer
+        if not caches:
+            caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+
+        for i in range(max_len):
+            trg_pos = paddle.full(shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            if waitk < 0 or i >= len(enc_outputs):
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches
+                )
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches
+                )
+
+            dec_output = paddle.reshape(dec_output, shape=[-1, dec_output.shape[-1]])
+
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+
+            predict_ids.append(topk_indices)
+
+            if paddle.all(finished).numpy():
+                break
+
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+
+        return finished_seq, finished_scores, caches
diff --git a/examples/simultaneous_translation/stacl/demo/requirements.txt b/examples/simultaneous_translation/stacl/demo/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c29fc68a8e962a6e40c426e110ba5498b9f0063e
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/requirements.txt
@@ -0,0 +1,5 @@
+attrdict==2.0.1
+PyYAML==5.4.1
+subword-nmt==0.3.7
+jieba==0.42.1
+websocket-client==1.0.1
diff --git a/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml b/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b162c57d3cabf5968e71ae62553df99a51a9db7d
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml
@@ -0,0 +1,51 @@
+# path of trained parameter, to make prediction
+init_from_params: ""
+# The path of vocabulary file of source language.
+src_vocab_fpath: "nist.20k.zh.vocab"
+# The path of vocabulary file of target language.
+trg_vocab_fpath: "nist.10k.en.vocab"
+# The <bos>, <eos> and <unk> tokens in the dictionary.
+special_token: [ "<s>", "<e>", "<unk>" ]
+
+# Use which device to train or predict(cpu,gpu,xpu)
+device: cpu
+
+# Hyparams for generation:
+max_out_len: 256
+# the number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# size of source word dictionary.
+src_vocab_size: 10000
+# size of target word dictionay
+trg_vocab_size: 10000
+# index for <bos> token
+bos_idx: 0
+# index for <eos> token
+eos_idx: 1
+# index for <unk> token
+unk_idx: 2
+# max length of sequences deciding the size of position encoding table.
+max_length: 256
+# the dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# number of head used in multi-head attention.
+n_head: 8
+# number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# dropout rates.
+dropout: 0.1
+# the flag indicating whether to share embedding and softmax weights.
+# vocabularies in source and target should be same for weight sharing.
+weight_sharing: False
+# Wait-k policy
+waitk: -1
+# Source bpe dict for tokenizer
+src_bpe_dict: 2M.zh2en.dict4bpe.zh
diff --git a/examples/simultaneous_translation/stacl/images/STACL_architecture.png b/examples/simultaneous_translation/stacl/images/STACL_architecture.png
new file mode 100644
index 0000000000000000000000000000000000000000..4246bb20af4f553086edcb66be40d8d2e1bd4175
Binary files /dev/null and b/examples/simultaneous_translation/stacl/images/STACL_architecture.png differ
diff --git a/examples/simultaneous_translation/stacl/images/example.png b/examples/simultaneous_translation/stacl/images/example.png
new file mode 100644
index 0000000000000000000000000000000000000000..1438cf96dba443384f39f373a070038bbee2f011
Binary files /dev/null and b/examples/simultaneous_translation/stacl/images/example.png differ
diff --git a/examples/simultaneous_translation/stacl/model.py b/examples/simultaneous_translation/stacl/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e987178dd87e9ca529b658dad884b50e614ed3b7
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/model.py
@@ -0,0 +1,314 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddlenlp.transformers import WordEmbedding, PositionalEmbedding
+
+
+class CrossEntropyCriterion(nn.Layer):
+    def __init__(self, label_smooth_eps, pad_idx=0):
+        super(CrossEntropyCriterion, self).__init__()
+        self.label_smooth_eps = label_smooth_eps
+        self.pad_idx = pad_idx
+
+    def forward(self, predict, label):
+        weights = paddle.cast(label != self.pad_idx, dtype=paddle.get_default_dtype())
+        if self.label_smooth_eps:
+            label = F.label_smooth(
+                label=F.one_hot(x=label, num_classes=predict.shape[-1]), epsilon=self.label_smooth_eps
+            )
+
+        cost = F.cross_entropy(
+            input=predict, label=label, reduction="none", soft_label=True if self.label_smooth_eps else False
+        ).squeeze()
+        weighted_cost = cost * weights
+        sum_cost = paddle.sum(weighted_cost)
+        token_num = paddle.sum(weights)
+        token_num.stop_gradient = True
+        avg_cost = sum_cost / token_num
+        return sum_cost, avg_cost, token_num
+
+
+class DecoderLayer(nn.TransformerDecoderLayer):
+    def __init__(self, *args, **kwargs):
+        super(DecoderLayer, self).__init__(*args, **kwargs)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        if len(memory) == 1:
+            # Full sent
+            tgt = self.cross_attn(tgt, memory[0], memory[0], memory_mask, None)
+        else:
+            # Wait-k policy
+            cross_attn_outputs = []
+            for i in range(tgt.shape[1]):
+                q = tgt[:, i : i + 1, :]
+                if i >= len(memory):
+                    e = memory[-1]
+                else:
+                    e = memory[i]
+                cross_attn_outputs.append(self.cross_attn(q, e, e, memory_mask[:, :, i : i + 1, : e.shape[1]], None))
+            tgt = paddle.concat(cross_attn_outputs, axis=1)
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache,))
+
+
+class Decoder(nn.TransformerDecoder):
+    """
+    PaddlePaddle 2.1 casts memory_mask.dtype to memory.dtype, but in STACL,
+    type of memory is list, having no dtype attribute.
+    """
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        return output if cache is None else (output, new_caches)
+
+
+class SimultaneousTransformer(nn.Layer):
+    """
+    model
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        bos_id=0,
+        eos_id=1,
+        waitk=-1,
+    ):
+        super(SimultaneousTransformer, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.waitk = waitk
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+        if weight_sharing:
+            assert (
+                src_vocab_size == trg_vocab_size
+            ), "Vocabularies in source and target should be same for weight sharing."
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation="relu",
+            normalize_before=True,
+            bias_attr=[False, True],
+        )
+        encoder_norm = nn.LayerNorm(d_model)
+        self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=n_layer, norm=encoder_norm)
+
+        decoder_layer = DecoderLayer(
+            d_model=d_model,
+            nhead=n_head,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation="relu",
+            normalize_before=True,
+            bias_attr=[False, False, True],
+        )
+        decoder_norm = nn.LayerNorm(d_model)
+        self.decoder = Decoder(decoder_layer=decoder_layer, num_layers=n_layer, norm=decoder_norm)
+
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
+            )
+        else:
+            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)
+
+    def forward(self, src_word, trg_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        base_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = paddle.tensor.triu(
+            (paddle.ones((trg_max_len, trg_max_len), dtype=paddle.get_default_dtype()) * -np.inf), 1
+        )
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, trg_max_len, 1])
+        src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len)
+        trg_pos = paddle.cast(trg_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=trg_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+        with paddle.static.amp.fp16_guard():
+            if self.waitk >= src_max_len or self.waitk == -1:
+                # Full sentence
+                enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+            else:
+                # Wait-k policy
+                enc_outputs = []
+                for i in range(self.waitk, src_max_len + 1):
+                    enc_output = self.encoder(enc_input[:, :i, :], src_mask=src_slf_attn_bias[:, :, :, :i])
+                    enc_outputs.append(enc_output)
+
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+            dec_output = self.decoder(
+                dec_input, enc_outputs, tgt_mask=trg_slf_attn_bias, memory_mask=trg_src_attn_bias
+            )
+
+            predict = self.linear(dec_output)
+
+        return predict
+
+    def beam_search(self, src_word, beam_size=4, max_len=256, waitk=-1):
+        # TODO: "Speculative Beam Search for Simultaneous Translation"
+        raise NotImplementedError
+
+    def greedy_search(self, src_word, max_len=256, waitk=-1):
+        src_max_len = paddle.shape(src_word)[-1]
+        base_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_slf_attn_bias = base_attn_bias
+        src_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = paddle.tile(base_attn_bias, [1, 1, 1, 1])
+        src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len)
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+        if waitk < 0 or waitk > src_max_len:
+            enc_outputs = [self.encoder(enc_input, src_mask=src_slf_attn_bias)]
+        else:
+            enc_outputs = []
+            for i in range(waitk, src_max_len + 1):
+                enc_output = self.encoder(enc_input[:, :i, :], src_mask=src_slf_attn_bias[:, :, :, :i])
+                enc_outputs.append(enc_output)
+
+        # constant number
+        batch_size = enc_outputs[-1].shape[0]
+        max_len = (enc_outputs[-1].shape[1] + 20) if max_len is None else max_len
+        end_token_tensor = paddle.full(shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+
+        predict_ids = []
+        log_probs = paddle.full(shape=[batch_size, 1], fill_value=0, dtype="float32")
+        trg_word = paddle.full(shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+
+        # init states (caches) for transformer
+        caches = self.decoder.gen_cache(enc_outputs[-1], do_zip=False)
+
+        for i in range(max_len):
+            trg_pos = paddle.full(shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            if waitk < 0 or i >= len(enc_outputs):
+                # Avoid getting the whole source in advance, a diff from:
+                # https://github.com/autosimtrans/SimulTransBaseline/blob/master/model.py#L1207
+                # if the decoder step is full sent or longer than all source
+                # step, then read the whole src
+                _e = enc_outputs[-1]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches
+                )
+            else:
+                _e = enc_outputs[i]
+                dec_output, caches = self.decoder(
+                    dec_input, [_e], None, trg_src_attn_bias[:, :, :, : _e.shape[1]], caches
+                )
+
+            dec_output = paddle.reshape(dec_output, shape=[-1, dec_output.shape[-1]])
+
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+
+            predict_ids.append(topk_indices)
+
+            if paddle.all(finished).numpy():
+                break
+
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+
+        return finished_seq, finished_scores
diff --git a/examples/simultaneous_translation/stacl/predict.py b/examples/simultaneous_translation/stacl/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f2e3da9e404389ac00ee3bae635a26a7f056012
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/predict.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+from pprint import pprint
+import yaml
+from attrdict import AttrDict
+
+import paddle
+from paddlenlp.transformers import position_encoding_init
+import reader
+from model import SimultaneousTransformer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./config/transformer.yaml", type=str, help="Path of the config file. ")
+    args = parser.parse_args()
+    return args
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    if args.device == "gpu":
+        place = "gpu:0"
+    elif args.device == "xpu":
+        place = "xpu:0"
+    elif args.device == "cpu":
+        place = "cpu"
+
+    paddle.set_device(place)
+
+    # Define data loader
+    test_loader, to_tokens = reader.create_infer_loader(args)
+
+    # Define model
+    transformer = SimultaneousTransformer(
+        args.src_vocab_size,
+        args.trg_vocab_size,
+        args.max_length + 1,
+        args.n_layer,
+        args.n_head,
+        args.d_model,
+        args.d_inner_hid,
+        args.dropout,
+        args.weight_sharing,
+        args.bos_idx,
+        args.eos_idx,
+        args.waitk,
+    )
+
+    # Load the trained model
+    assert args.init_from_params, "Please set init_from_params to load the infer model."
+
+    model_dict = paddle.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+
+    # To avoid a longer length than training, reset the size of position
+    # encoding to max_length
+    model_dict["src_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+    model_dict["trg_pos_embedding.pos_encoder.weight"] = position_encoding_init(args.max_length + 1, args.d_model)
+
+    transformer.load_dict(model_dict)
+
+    # Set evaluate mode
+    transformer.eval()
+
+    f = open(args.output_file, "w", encoding="utf8")
+
+    with paddle.no_grad():
+        for input_data in test_loader:
+            (src_word,) = input_data
+
+            finished_seq, finished_scores = transformer.greedy_search(
+                src_word, max_len=args.max_out_len, waitk=args.waitk
+            )
+            finished_seq = finished_seq.numpy()
+            finished_scores = finished_scores.numpy()
+            for idx, ins in enumerate(finished_seq):
+                for beam_idx, beam in enumerate(ins):
+                    if beam_idx >= args.n_best:
+                        break
+                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                    word_list = to_tokens(id_list)
+                    sequence = " ".join(word_list) + "\n"
+                    f.write(sequence)
+    f.close()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    yaml_file = args.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+
+    do_predict(args)
diff --git a/examples/simultaneous_translation/stacl/reader.py b/examples/simultaneous_translation/stacl/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb71b2bfb212c202311626d5755baba7f8746835
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/reader.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+from paddle.io import DataLoader
+from paddlenlp.data import Vocab, Pad
+from paddlenlp.data.sampler import SamplerHelper
+from paddlenlp.datasets import load_dataset
+
+
+def read(src_tgt_file, only_src=False):
+    with open(src_tgt_file, "r", encoding="utf8") as src_tgt_f:
+        for line in src_tgt_f:
+            line = line.strip("\n")
+            if not line:
+                continue
+            line_split = line.split("\t")
+            if only_src:
+                yield {"src": line_split[0]}
+            else:
+                if len(line_split) != 2:
+                    continue
+                yield {"src": line_split[0], "trg": line_split[1]}
+
+
+def min_max_filer(data, max_len, min_len=0):
+    # 1 for special tokens.
+    data_min_len = min(len(data[0]), len(data[1])) + 1
+    data_max_len = max(len(data[0]), len(data[1])) + 1
+    return (data_min_len >= min_len) and (data_max_len <= max_len)
+
+
+def create_data_loader(args, places=None):
+    data_files = {"train": args.training_file, "dev": args.validation_file}
+
+    datasets = [load_dataset(read, src_tgt_file=filename, lazy=False) for split, filename in data_files.items()]
+
+    src_vocab = Vocab.load_vocabulary(
+        args.src_vocab_fpath,
+        bos_token=args.special_token[0],
+        eos_token=args.special_token[1],
+        unk_token=args.special_token[2],
+    )
+    trg_vocab = Vocab.load_vocabulary(
+        args.trg_vocab_fpath,
+        bos_token=args.special_token[0],
+        eos_token=args.special_token[1],
+        unk_token=args.special_token[2],
+    )
+
+    args.src_vocab_size = len(src_vocab)
+    args.trg_vocab_size = len(trg_vocab)
+
+    def convert_samples(sample):
+        source = [item.strip() for item in sample["src"].split()]
+        target = [item.strip() for item in sample["trg"].split()]
+
+        source = src_vocab.to_indices(source) + [args.eos_idx]
+        target = [args.bos_idx] + trg_vocab.to_indices(target) + [args.eos_idx]
+
+        return source, target
+
+    data_loaders = [(None)] * 2
+    for i, dataset in enumerate(datasets):
+        dataset = dataset.map(convert_samples, lazy=False).filter(partial(min_max_filer, max_len=args.max_length))
+
+        sampler = SamplerHelper(dataset)
+
+        if args.sort_type == SortType.GLOBAL:
+            src_key = lambda x, data_source: len(data_source[x][0])
+            trg_key = lambda x, data_source: len(data_source[x][1])
+            # Sort twice
+            sampler = sampler.sort(key=trg_key).sort(key=src_key)
+        else:
+            if args.shuffle:
+                sampler = sampler.shuffle(seed=args.random_seed)
+            max_key = lambda x, data_source: max(len(data_source[x][0]), len(data_source[x][1]))
+            if args.sort_type == SortType.POOL:
+                sampler = sampler.sort(key=max_key, buffer_size=args.pool_size)
+
+        batch_size_fn = lambda new, count, sofar, data_source: max(
+            sofar, len(data_source[new][0]), len(data_source[new][1])
+        )
+        batch_sampler = sampler.batch(
+            batch_size=args.batch_size,
+            drop_last=False,
+            batch_size_fn=batch_size_fn,
+            key=lambda size_so_far, minibatch_len: size_so_far * minibatch_len,
+        )
+
+        if args.shuffle_batch:
+            batch_sampler = batch_sampler.shuffle(seed=args.random_seed)
+
+        if i == 0:
+            batch_sampler = batch_sampler.shard()
+
+        data_loader = DataLoader(
+            dataset=dataset,
+            places=places,
+            batch_sampler=batch_sampler,
+            collate_fn=partial(prepare_train_input, pad_idx=args.bos_idx),
+            num_workers=0,
+        )
+
+        data_loaders[i] = data_loader
+
+    return data_loaders
+
+
+def create_infer_loader(args, places=None):
+    data_files = {
+        "test": args.predict_file,
+    }
+    dataset = load_dataset(read, src_tgt_file=data_files["test"], only_src=True, lazy=False)
+
+    src_vocab = Vocab.load_vocabulary(
+        args.src_vocab_fpath,
+        bos_token=args.special_token[0],
+        eos_token=args.special_token[1],
+        unk_token=args.special_token[2],
+    )
+
+    trg_vocab = Vocab.load_vocabulary(
+        args.trg_vocab_fpath,
+        bos_token=args.special_token[0],
+        eos_token=args.special_token[1],
+        unk_token=args.special_token[2],
+    )
+
+    args.src_vocab_size = len(src_vocab)
+    args.trg_vocab_size = len(trg_vocab)
+
+    def convert_samples(sample):
+        source = [item.strip() for item in sample["src"].split()]
+        source = src_vocab.to_indices(source) + [args.eos_idx]
+        target = [args.bos_idx]
+        return source, target
+
+    dataset = dataset.map(convert_samples, lazy=False)
+
+    batch_sampler = SamplerHelper(dataset).batch(batch_size=args.batch_size, drop_last=False)
+
+    data_loader = DataLoader(
+        dataset=dataset,
+        places=places,
+        batch_sampler=batch_sampler,
+        collate_fn=partial(prepare_infer_input, pad_idx=args.bos_idx),
+        num_workers=0,
+        return_list=True,
+    )
+
+    return data_loader, trg_vocab.to_tokens
+
+
+def prepare_train_input(insts, pad_idx):
+    """
+    Put all padded data needed by training into a list.
+    """
+    word_pad = Pad(pad_idx)
+    src_word = word_pad([inst[0] for inst in insts])
+    trg_word = word_pad([inst[1][:-1] for inst in insts])
+    lbl_word = word_pad([inst[1][1:] for inst in insts])
+    data_inputs = [src_word, trg_word, lbl_word]
+
+    return data_inputs
+
+
+def prepare_infer_input(insts, pad_idx):
+    """
+    Put all padded data needed by beam search decoder into a list.
+    """
+    word_pad = Pad(pad_idx)
+    src_word = word_pad([inst[0] for inst in insts])
+
+    return [
+        src_word,
+    ]
+
+
+class SortType(object):
+    GLOBAL = "global"
+    POOL = "pool"
+    NONE = "none"
diff --git a/examples/simultaneous_translation/stacl/requirements.txt b/examples/simultaneous_translation/stacl/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d4fa621ca5572b3e00ffd1bda3266a546a580225
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/requirements.txt
@@ -0,0 +1,4 @@
+attrdict==2.0.1
+PyYAML==5.4.1
+subword_nmt==0.3.7
+jieba==0.42.1
diff --git a/examples/simultaneous_translation/stacl/train.py b/examples/simultaneous_translation/stacl/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..09ecb03001a9ce332ed520b609758f137a562d29
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/train.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import argparse
+from pprint import pprint
+import numpy as np
+import yaml
+from attrdict import AttrDict
+
+import paddle
+import paddle.distributed as dist
+from paddlenlp.utils.log import logger
+
+import reader
+from model import SimultaneousTransformer, CrossEntropyCriterion
+from utils.record import AverageStatistical
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./config/transformer.yaml", type=str, help="Path of the config file. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    trainer_count = dist.get_world_size()
+    rank = dist.get_rank()
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    # Define data loader
+    (train_loader), (eval_loader) = reader.create_data_loader(args, places=paddle.get_device())
+
+    # Define model
+    transformer = SimultaneousTransformer(
+        args.src_vocab_size,
+        args.trg_vocab_size,
+        args.max_length + 1,
+        args.n_layer,
+        args.n_head,
+        args.d_model,
+        args.d_inner_hid,
+        args.dropout,
+        args.weight_sharing,
+        args.bos_idx,
+        args.eos_idx,
+        args.waitk,
+    )
+
+    print("waitk=", args.waitk)
+
+    # Define loss
+    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
+
+    # Define optimizer
+    scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate)
+
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=scheduler,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        epsilon=float(args.eps),
+        parameters=transformer.parameters(),
+    )
+
+    # Init from some checkpoint, to resume the previous training
+    if args.init_from_checkpoint:
+        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
+        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
+        transformer.set_state_dict(model_dict)
+        optimizer.set_state_dict(opt_dict)
+        print("loaded from checkpoint.")
+    # Init from some pretrain models, to better solve the current task
+    if args.init_from_pretrain_model:
+        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
+        transformer.set_state_dict(model_dict)
+        print("loaded from pre-trained model.")
+
+    if trainer_count > 1:
+        transformer = paddle.DataParallel(transformer)
+
+    # The best cross-entropy value with label smoothing
+    loss_normalizer = -(
+        (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps))
+        + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)
+    )
+
+    step_idx = 0
+
+    # For logging
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+        batch_id = 0
+        batch_start = time.time()
+        for input_data in train_loader:
+            train_reader_cost = time.time() - batch_start
+            (src_word, trg_word, lbl_word) = input_data
+            if args.use_amp:
+                scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+                with paddle.amp.auto_cast():
+                    logits = transformer(src_word=src_word, trg_word=trg_word)
+                    sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                scaled_loss = scaler.scale(avg_cost)  # scale the loss
+                scaled_loss.backward()  # do backward
+
+                scaler.minimize(optimizer, scaled_loss)  # update parameters
+                optimizer.clear_grad()
+            else:
+                logits = transformer(src_word=src_word, trg_word=trg_word)
+                sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                avg_cost.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            if args.max_iter and step_idx + 1 == args.max_iter:
+                return
+
+            tokens_per_cards = token_num.numpy()
+
+            train_batch_cost = time.time() - batch_start
+            reader_cost_avg.record(train_reader_cost)
+            batch_cost_avg.record(train_batch_cost)
+            batch_ips_avg.record(train_batch_cost, tokens_per_cards)
+
+            if step_idx % args.print_step == 0:
+                total_avg_cost = avg_cost.numpy()
+                if step_idx == 0:
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f "
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                else:
+                    train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time()
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
+                        "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
+                        "ips: %.5f words/sec"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            train_avg_batch_cost,
+                            batch_cost_avg.get_average(),
+                            reader_cost_avg.get_average(),
+                            batch_ips_avg.get_total_cnt(),
+                            batch_ips_avg.get_average_per_sec(),
+                        )
+                    )
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                # Validation
+                transformer.eval()
+                total_sum_cost = 0
+                total_token_num = 0
+                with paddle.no_grad():
+                    for input_data in eval_loader:
+                        (src_word, trg_word, lbl_word) = input_data
+                        logits = transformer(src_word=src_word, trg_word=trg_word)
+                        sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+                        total_sum_cost += sum_cost.numpy()
+                        total_token_num += token_num.numpy()
+                        total_avg_cost = total_sum_cost / total_token_num
+                    logger.info(
+                        "validation, step_idx: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f"
+                        % (
+                            step_idx,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                transformer.train()
+
+                if args.save_model and rank == 0:
+                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+            batch_id += 1
+            step_idx += 1
+            scheduler.step()
+            batch_start = time.time()
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+    if args.save_model and rank == 0:
+        model_dir = os.path.join(args.save_model, "step_final")
+        if not os.path.exists(model_dir):
+            os.makedirs(model_dir)
+        paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+        paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    yaml_file = args.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+    do_train(args)
diff --git a/examples/simultaneous_translation/stacl/utils/__init__.py b/examples/simultaneous_translation/stacl/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/examples/simultaneous_translation/stacl/utils/record.py b/examples/simultaneous_translation/stacl/utils/record.py
new file mode 100644
index 0000000000000000000000000000000000000000..1147f0d433f246a3a07503f2e7dd29801677a538
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/utils/record.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class AverageStatistical(object):
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.total_cnt = 0
+        self.time = 0
+
+    def record(self, val, cnt=1):
+        self.time += val
+        self.total_cnt += cnt
+
+    def get_average(self):
+        if self.total_cnt == 0:
+            return 0
+
+        return self.time / self.total_cnt
+
+    def get_average_per_sec(self):
+        if self.time == 0.0:
+            return 0.0
+
+        return float(self.total_cnt) / self.time
+
+    def get_total_cnt(self):
+        return self.total_cnt
+
+    def get_total_time(self):
+        return self.time
diff --git a/examples/simultaneous_translation/stacl/utils/tokenizer.py b/examples/simultaneous_translation/stacl/utils/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..010d91dea922ecd7b1797283cb5384b1798e8dd9
--- /dev/null
+++ b/examples/simultaneous_translation/stacl/utils/tokenizer.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import _locale
+import jieba
+from subword_nmt import subword_nmt
+
+# By default, the Windows system opens the file with GBK code,
+# and the subword_nmt package does not support setting open encoding,
+# so it is set to UTF-8 uniformly.
+_locale._getdefaultlocale = lambda *args: ["en_US", "utf8"]
+
+
+class STACLTokenizer:
+    def __init__(self, bpe_dict, is_chinese):
+        bpe_parser = subword_nmt.create_apply_bpe_parser()
+        bpe_args = bpe_parser.parse_args(args=["-c", bpe_dict])
+        self.bpe = subword_nmt.BPE(bpe_args.codes, bpe_args.merges, bpe_args.separator, None, bpe_args.glossaries)
+        self.is_chinese = is_chinese
+
+    def tokenize(self, raw_string):
+        """
+        Tokenize string(BPE/jieba+BPE)
+        """
+        raw_string = raw_string.strip("\n")
+        if not raw_string:
+            return raw_string
+        if self.is_chinese:
+            raw_string = " ".join(jieba.cut(raw_string))
+        bpe_str = self.bpe.process_line(raw_string)
+        return " ".join(bpe_str.split())
+
+
+if __name__ == "__main__":
+    tokenizer_zh = STACLTokenizer("data/nist2m/2M.zh2en.dict4bpe.zh", is_chinese=True)
+    print(tokenizer_zh.tokenize("玻利维亚举行总统与国会选举"))
diff --git a/examples/text_classification/README.md b/examples/text_classification/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f02dae18865f80e7671de0bd305c02540d901d11
--- /dev/null
+++ b/examples/text_classification/README.md
@@ -0,0 +1,15 @@
+# 文本分类
+
+提供了多个文本分类任务示例，基于基于ERNIE 3.0预训练模型、传统序列模型、基于ERNIE-Doc超长文本预训练模型的文本分类。
+
+## Pretrained Model (PTMs)
+
+[Pretrained Models](./pretrained_models) 展示了如何使用以ERNIE 3.0 为代表的预模型，在多分类、多标签、层次分类场景下，基于预训练模型微调、提示学习（小样本）、语义索引等三种不同方案进行文本分类。预训练模型文本分类打通数据标注-模型训练-模型调优-模型压缩-预测部署全流程，旨在解决细分场景应用的痛点和难点，快速实现文本分类产品落地。
+
+## RNN Models
+
+[Recurrent Neural Networks](./rnn) 展示了如何使用传统序列模型RNN、LSTM、GRU等网络完成文本分类任务。
+
+## ERNIE-Doc Text Classification
+
+[ERNIE-Doc Text Classification](./ernie_doc) 展示了如何使用预训练模型ERNIE-Doc完成**超长文本**分类任务。
diff --git a/examples/text_classification/ernie_doc/README.md b/examples/text_classification/ernie_doc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7f16c92c585f93e4f48ee1a4da4d222b5a205ce3
--- /dev/null
+++ b/examples/text_classification/ernie_doc/README.md
@@ -0,0 +1,86 @@
+# Ernie_doc 在iflytek数据集上的使用
+
+## 简介
+
+本示例将使用ERNIE-DOC模型，演示如何在长文本数据集上（e.g. iflytek）完成分类任务的训练，预测以及动转静过程。以下是本例的简要目录结构及说明:
+
+```shell
+.
+├── LICENSE
+├── README.md             #文档
+├── data.py               #数据处理
+├── export_model.py       #将动态图参数导出成静态图参数
+├── metrics.py            #ERNIE-Doc下游任务指标
+├── modeling.py           #ERNIE-Doc模型实现（针对实现静态图修改）
+├── predict.py            #分类任务预测脚本（包括动态图预测和动转静）
+└── train.py              #分类任务训练脚本（包括数据下载，模型导出和测试集结果导出）
+```
+
+## 快速开始
+
+### 通用参数释义
+
+除[ERNIE_DOC](../../../model_zoo/ernie-doc/README.md)
+展示的通用参数之外，本例还有如下参数：
+
+- `static_mode` 在 `predict.py` 表示是否使用静态图进行预测。
+- `test_results_file` 在`train.py`和`predict.py`中表示测试集预测结果所存储的地址，默认为`./test_restuls.json`。
+- `static_path` 在`export_model.py`和`predict.py`中表示要将转化完成的静态图存储的地址，如果改地址已经有静态图模型参数，`predict.py`
+  会直接读取该模型参数，而`export_model.py`会覆盖掉该模型参数。默认路径为`{HOME}/.paddlenlp/static/inference`。
+
+### 分类任务训练
+
+iflytek的数据示例如下：
+
+```shell
+{"label": "110", "label_des": "社区超市", "sentence": "朴朴快送超市创立于2016年，专注于打造移动端30分钟即时配送一站式购物平台，商品品类包含水果、蔬菜、肉禽蛋奶、海鲜水产、粮油调味、酒水饮料、休闲食品、日用品、外卖等。朴朴公司希望能以全新的商业模式，更高效快捷的仓储配送模式，致力于成为更快、更好、更多、更省的在线零售平台，带给消费者更好的消费体验，同时推动中国食品安全进程，成为一家让社会尊敬的互联网公司。,朴朴一下，又好又快,1.配送时间提示更加清晰友好2.保障用户隐私的一些优化3.其他提高使用体验的调整4.修复了一些已知bug"}
+```
+
+该数据集共有1.7万多条关于app应用描述的长文本标注数据，包含和日常生活相关的各类应用主题，共119个类别。 使用训练脚本
+
+```shell
+python train.py --batch_size 4 \
+                --model_name_or_path ernie-doc-base-zh \
+                --epoch 5 \
+                --output_dir ./checkpoints/
+```
+
+根据通用参数释义可自行更改训练超参数和模型保存地址。
+
+### 模型导出和预测
+
+可以使用模型导出脚本将动态图模型转化成静态图：
+
+```shell
+python export_model.py --batch_size 16 \
+                       --model_name_or_path finetuned_model \
+                       --max_seq_lenght 512 \
+                       --memory_length 128 \
+                       --static_path ./my_static_model/
+```
+
+也可以直接使用预测脚本将`static_mode`设为True （设置成False则使用动态图预测），直接完成转化静态图和使用静态图预测的步骤：
+
+```shell
+python predict.py --static_mode True \
+        --dataset iflytek \
+        --batch_size 16 \
+        --model_name_or_path finetuned_model \
+        --max_seq_lenght 512 \
+        --memory_length 128 \
+        --static_path ./my_static_model/ \
+        --test_results_file ./test_results.json
+```
+
+模型输出的`test_results_file`示例：
+
+```shell
+{"id": "2590", "label": "70"}
+{"id": "2591", "label": "91"}
+{"id": "2592", "label": "20"}
+{"id": "2593", "label": "28"}
+{"id": "2594", "label": "95"}
+{"id": "2595", "label": "116"}
+{"id": "2596", "label": "59"}
+{"id": "2597", "label": "22"}
+```
diff --git a/examples/text_classification/ernie_doc/__init__.py b/examples/text_classification/ernie_doc/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/examples/text_classification/ernie_doc/data.py b/examples/text_classification/ernie_doc/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..85409d9e84e06da03633bbb8b9840e35bc8f82e6
--- /dev/null
+++ b/examples/text_classification/ernie_doc/data.py
@@ -0,0 +1,1208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import itertools
+import json
+from collections import namedtuple
+
+import numpy as np
+from paddle.utils import try_import
+
+from paddlenlp.transformers import tokenize_chinese_chars
+from paddlenlp.utils.log import logger
+
+
+def get_related_pos(insts, seq_len, memory_len=128):
+    """generate relative postion ids"""
+    beg = seq_len + seq_len + memory_len
+    r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))]
+    return np.array(r_position).astype("int64").reshape([len(insts), beg, 1])
+
+
+def pad_batch_data(
+    insts,
+    insts_data_type="int64",
+    pad_idx=0,
+    final_cls=False,
+    pad_max_len=None,
+    return_pos=False,
+    return_input_mask=False,
+    return_max_len=False,
+    return_num_token=False,
+    return_seq_lens=False,
+):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    if pad_max_len:
+        max_len = pad_max_len
+    else:
+        max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    # Input id
+    if final_cls:
+        inst_data = np.array([inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts])
+    else:
+        inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])]
+
+    # Position id
+    if return_pos:
+        inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        if final_cls:
+            input_mask_data = np.array([[1] * len(inst[:-1]) + [0] * (max_len - len(inst)) + [1] for inst in insts])
+        else:
+            input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens_type = [-1]
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape(seq_lens_type)]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+class TextPreprocessor(object):
+    def __call__(self, text):
+        raise NotImplementedError("TextPreprocessor object can't be called")
+
+
+class ImdbTextPreprocessor(TextPreprocessor):
+    def __call__(self, text):
+        text = text.strip().replace("<br /><br />", " ")
+        text = text.replace("\t", "")
+        return text
+
+
+class HYPTextPreprocessor(TextPreprocessor):
+    def __init__(self):
+        self.bs4 = try_import("bs4")
+
+    def __call__(self, text):
+        text = self.bs4.BeautifulSoup(text, "html.parser").get_text()
+        text = text.strip().replace("\n", "").replace("\t", "")
+        return text
+
+
+class ClassifierIterator(object):
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        preprocess_text_fn=None,
+    ):
+        self.batch_size = batch_size
+        self.tokenizer = tokenizer
+        self.trainer_num = trainer_num
+        self.trainer_id = trainer_id
+        self.max_seq_length = max_seq_length
+        self.memory_len = memory_len
+        self.repeat_input = repeat_input
+        self.in_tokens = in_tokens
+        self.dataset = [data for data in dataset]
+        self.num_examples = None
+        self.mode = mode
+        self.shuffle = True if mode == "train" else False
+        if random_seed is None:
+            random_seed = 12345
+        self.random_seed = random_seed
+        self.preprocess_text_fn = preprocess_text_fn
+
+    def shuffle_sample(self):
+        if self.shuffle:
+            self.global_rng = np.random.RandomState(self.random_seed)
+            self.global_rng.shuffle(self.dataset)
+
+    def _cnt_list(self, inp):
+        """Cnt_list"""
+        cnt = 0
+        for lit in inp:
+            if lit:
+                cnt += 1
+        return cnt
+
+    def _convert_to_features(self, example, qid):
+        """
+        Convert example to features fed into model
+        """
+        if "text" in example:  # imdb
+            text = example["text"]
+        elif "sentence" in example:  # iflytek
+            text = example["sentence"]
+
+        if self.preprocess_text_fn:
+            text = self.preprocess_text_fn(text)
+        if "label" in example:
+            label = example["label"]
+        else:
+            label = "-1"
+        doc_spans = []
+        _DocSpan = namedtuple("DocSpan", ["start", "length"])
+        start_offset = 0
+        max_tokens_for_doc = self.max_seq_length - 2
+        tokens_a = self.tokenizer.tokenize(text)
+        while start_offset < len(tokens_a):
+            length = len(tokens_a) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(tokens_a):
+                break
+            start_offset += min(length, self.memory_len)
+
+        features = []
+        Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"])
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = tokens_a[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"]
+            token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+            features.append(Feature(src_ids=token_ids, label_id=label, qid=qid, cal_loss=1))
+
+        if self.repeat_input:
+            features_repeat = features
+            features = list(map(lambda x: x._replace(cal_loss=0), features))
+            features = features + features_repeat
+        return features
+
+    def _get_samples(self, pre_batch_list, is_last=False):
+        if is_last:
+            # Pad batch
+            len_doc = [len(doc) for doc in pre_batch_list]
+            max_len_idx = len_doc.index(max(len_doc))
+            dirty_sample = pre_batch_list[max_len_idx][-1]._replace(cal_loss=0)
+            for sample_list in pre_batch_list:
+                sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list)))
+
+        samples = []
+        min_len = min([len(doc) for doc in pre_batch_list])
+        for cnt in range(min_len):
+            for batch_idx in range(self.batch_size * self.trainer_num):
+                sample = pre_batch_list[batch_idx][cnt]
+                samples.append(sample)
+
+        for idx in range(len(pre_batch_list)):
+            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
+        return samples
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [record.src_ids for record in batch_records]
+        if batch_records[0].label_id is not None:
+            batch_labels = [record.label_id for record in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        if batch_records[-1].qid is not None:
+            batch_qids = [record.qid for record in batch_records]
+            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+        else:
+            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            final_cls=True,
+            return_input_mask=True,
+        )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            batch_labels,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
+
+    def _prepare_batch_data(self, examples):
+        batch_records, max_len, gather_idx = [], 0, []
+        for index, example in enumerate(examples):
+            max_len = max(max_len, len(example.src_ids))
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * max_len <= self.batch_size
+            else:
+                to_append = len(batch_records) < self.batch_size
+            if to_append:
+                batch_records.append(example)
+                if example.cal_loss == 1:
+                    gather_idx.append(index % self.batch_size)
+            else:
+                yield self._pad_batch_records(batch_records, gather_idx)
+                batch_records, max_len = [example], len(example.src_ids)
+                gather_idx = [index % self.batch_size] if example.cal_loss == 1 else []
+        yield self._pad_batch_records(batch_records, gather_idx)
+
+    def _create_instances(self):
+        examples = self.dataset
+        pre_batch_list = []
+        insert_idx = []
+        for qid, example in enumerate(examples):
+            features = self._convert_to_features(example, qid)
+            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
+                if insert_idx:
+                    pre_batch_list[insert_idx[0]] = features
+                    insert_idx.pop(0)
+                else:
+                    pre_batch_list.append(features)
+            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
+                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
+                assert not insert_idx, "the insert_idx must be null"
+                sample_batch = self._get_samples(pre_batch_list)
+
+                for idx, lit in enumerate(pre_batch_list):
+                    if not lit:
+                        insert_idx.append(idx)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+        if self.mode != "train":
+            if self._cnt_list(pre_batch_list):
+                pre_batch_list += [
+                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
+                ]
+                sample_batch = self._get_samples(pre_batch_list, is_last=True)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+    def __call__(self):
+        curr_id = 0
+        for batch_records in self._create_instances():
+            if curr_id == self.trainer_id or self.mode != "train":
+                yield batch_records
+            curr_id = (curr_id + 1) % self.trainer_num
+
+    def get_num_examples(self):
+        if self.num_examples is None:
+            self.num_examples = 0
+            for qid, example in enumerate(self.dataset):
+                self.num_examples += len(self._convert_to_features(example, qid))
+        return self.num_examples
+
+
+class MRCIterator(ClassifierIterator):
+    """
+    Machine Reading Comprehension iterator. Only for answer extraction.
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        doc_stride=128,
+        max_query_length=64,
+    ):
+        super(MRCIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+            preprocess_text_fn=None,
+        )
+        self.doc_stride = doc_stride
+        self.max_query_length = max_query_length
+        self.examples = []
+        self.features = []
+        self.features_all = []
+        self._preprocess_data()
+
+    def shuffle_sample(self):
+        if self.shuffle:
+            self.global_rng = np.random.RandomState(self.random_seed)
+            self.global_rng.shuffle(self.features_all)
+
+    def _convert_qa_to_examples(self):
+        Example = namedtuple(
+            "Example", ["qas_id", "question_text", "doc_tokens", "orig_answer_text", "start_position", "end_position"]
+        )
+        examples = []
+        for qa in self.dataset:
+            qas_id = qa["id"]
+            question_text = qa["question"]
+            context = qa["context"]
+            start_pos = None
+            end_pos = None
+            orig_answer_text = None
+            if self.mode == "train":
+                if len(qa["answers"]) != 1:
+                    raise ValueError("For training, each question should have exactly 1 answer.")
+                orig_answer_text = qa["answers"][0]
+                answer_offset = qa["answer_starts"][0]
+                answer_length = len(orig_answer_text)
+                doc_tokens = [
+                    context[:answer_offset],
+                    context[answer_offset : answer_offset + answer_length],
+                    context[answer_offset + answer_length :],
+                ]
+
+                start_pos = 1
+                end_pos = 1
+
+                actual_text = " ".join(doc_tokens[start_pos : (end_pos + 1)])
+                if orig_answer_text.islower():
+                    actual_text = actual_text.lower()
+                if actual_text.find(orig_answer_text) == -1:
+                    logger.info("Could not find answer: '%s' vs. '%s'" % (actual_text, orig_answer_text))
+                    continue
+
+            else:
+                doc_tokens = tokenize_chinese_chars(context)
+
+            example = Example(
+                qas_id=qas_id,
+                question_text=question_text,
+                doc_tokens=doc_tokens,
+                orig_answer_text=orig_answer_text,
+                start_position=start_pos,
+                end_position=end_pos,
+            )
+            examples.append(example)
+        return examples
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple(
+            "Feature",
+            [
+                "qid",
+                "example_index",
+                "doc_span_index",
+                "tokens",
+                "token_to_orig_map",
+                "token_is_max_context",
+                "src_ids",
+                "start_position",
+                "end_position",
+                "cal_loss",
+            ],
+        )
+        features = []
+        self.features_all = []
+        unique_id = 1000
+        is_training = self.mode == "train"
+        print("total {} examples".format(len(examples)), flush=True)
+        for (example_index, example) in enumerate(examples):
+            query_tokens = self.tokenizer.tokenize(example.question_text)
+            if len(query_tokens) > self.max_query_length:
+                query_tokens = query_tokens[0 : self.max_query_length]
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            for (i, token) in enumerate(example.doc_tokens):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                sub_tokens = self.tokenizer.tokenize(token)
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+
+            tok_start_position = None
+            tok_end_position = None
+            if is_training:
+                tok_start_position = orig_to_tok_index[example.start_position]
+                if example.end_position < len(example.doc_tokens) - 1:
+                    tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
+                else:
+                    tok_end_position = len(all_doc_tokens) - 1
+                (tok_start_position, tok_end_position) = self._improve_answer_span(
+                    all_doc_tokens, tok_start_position, tok_end_position, example.orig_answer_text
+                )
+
+            max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                tokens = []
+                token_to_orig_map = {}
+                token_is_max_context = {}
+                tokens.append("[CLS]")
+                for i in range(doc_span.length):
+                    split_token_index = doc_span.start + i
+                    token_to_orig_map[i + 1] = tok_to_orig_index[split_token_index]
+                    is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[i + 1] = is_max_context
+                tokens += all_doc_tokens[doc_span.start : doc_span.start + doc_span.length]
+                tokens.append("[SEP]")
+
+                for token in query_tokens:
+                    tokens.append(token)
+                tokens.append("[SEP]")
+
+                token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+                start_position = None
+                end_position = None
+                if is_training:
+                    doc_start = doc_span.start
+                    doc_end = doc_span.start + doc_span.length - 1
+                    out_of_span = False
+                    if not (tok_start_position >= doc_start and tok_end_position <= doc_end):
+                        out_of_span = True
+                    if out_of_span:
+                        start_position = 0
+                        end_position = 0
+                    else:
+                        doc_offset = 1  # len(query_tokens) + 2
+                        start_position = tok_start_position - doc_start + doc_offset
+                        end_position = tok_end_position - doc_start + doc_offset
+
+                feature = Feature(
+                    qid=unique_id,
+                    example_index=example_index,
+                    doc_span_index=doc_span_index,
+                    tokens=tokens,
+                    token_to_orig_map=token_to_orig_map,
+                    token_is_max_context=token_is_max_context,
+                    src_ids=token_ids,
+                    start_position=start_position,
+                    end_position=end_position,
+                    cal_loss=1,
+                )
+                features.append(feature)
+                features_each.append(feature)
+                if example_index % 1000 == 0:
+                    print("processing {} examples".format(example_index), flush=True)
+
+                unique_id += 1
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _preprocess_data(self):
+        # Construct examples
+        self.examples = self._convert_qa_to_examples()
+        # Construct features
+        self.features = self._convert_example_to_feature(self.examples)
+
+    def get_num_examples(self):
+        if not self.features_all:
+            self._preprocess_data()
+        return len(sum(self.features_all, []))
+
+    def _improve_answer_span(self, doc_tokens, input_start, input_end, orig_answer_text):
+        """Improve answer span"""
+        tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text))
+
+        for new_start in range(input_start, input_end + 1):
+            for new_end in range(input_end, new_start - 1, -1):
+                text_span = " ".join(doc_tokens[new_start : (new_end + 1)])
+                if text_span == tok_answer_text:
+                    return (new_start, new_end)
+
+        return (input_start, input_end)
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        """Check is max context"""
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span.start + doc_span.length - 1
+            if position < doc_span.start:
+                break
+            if position > end:
+                continue
+            num_left_context = position - doc_span.start
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+                if best_span_index > cur_span_index:
+                    return False
+
+        return cur_span_index == best_span_index
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        """Pad batch data"""
+        batch_token_ids = [record.src_ids for record in batch_records]
+
+        if self.mode == "train":
+            batch_start_position = [record.start_position for record in batch_records]
+            batch_end_position = [record.end_position for record in batch_records]
+            batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1])
+            batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1])
+        else:
+            batch_size = len(batch_token_ids)
+            batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64")
+            batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64")
+
+        batch_qids = [record.qid for record in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        # padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            batch_start_position,
+            batch_end_position,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+
+        return return_list
+
+    def _create_instances(self):
+        """Generate batch records"""
+        pre_batch_list = []
+        insert_idx = []
+        for qid, features in enumerate(self.features_all):
+            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
+                if insert_idx:
+                    pre_batch_list[insert_idx[0]] = features
+                    insert_idx.pop(0)
+                else:
+                    pre_batch_list.append(features)
+            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
+                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
+                assert not insert_idx, "the insert_idx must be null"
+                sample_batch = self._get_samples(pre_batch_list)
+
+                for idx, lit in enumerate(pre_batch_list):
+                    if not lit:
+                        insert_idx.append(idx)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+        if self.mode != "train":
+            if self._cnt_list(pre_batch_list):
+                pre_batch_list += [
+                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
+                ]
+                sample_batch = self._get_samples(pre_batch_list, is_last=True)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+
+class MCQIterator(MRCIterator):
+    """
+    Multiple choice question iterator.
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        doc_stride=128,
+        max_query_length=64,
+        choice_num=4,
+    ):
+        self.choice_num = choice_num
+        super(MCQIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+        )
+
+    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
+        """Truncates a sequence pair in place to the maximum length."""
+
+        # This is a simple heuristic which will always truncate the longer sequence
+        # one token at a time. This makes more sense than truncating an equal percent
+        # of tokens from each, since if one sequence is very short then each token
+        # that's truncated likely contains more information than a longer sequence.
+        tokens_a = list(tokens_a)
+        tokens_b = list(tokens_b)
+        while True:
+            total_length = len(tokens_a) + len(tokens_b)
+            if total_length <= max_length:
+                break
+            if len(tokens_a) > len(tokens_b):
+                tokens_a.pop()
+            else:
+                tokens_b.pop()
+        return tokens_a, tokens_b
+
+    def _convert_qa_to_examples(self):
+        Example = namedtuple("Example", ["qas_id", "context", "question", "choice", "label"])
+        examples = []
+        for qas_id, qa in enumerate(self.dataset):
+            context = "\n".join(qa["context"]).lower()
+            question = qa["question"].lower()
+            choice = [c.lower() for c in qa["choice"]]
+            # pad empty choice
+            for k in range(len(choice), self.choice_num):
+                choice.append("")
+            label = qa["label"]
+
+            example = Example(qas_id=qas_id, context=context, question=question, choice=choice, label=label)
+            examples.append(example)
+        return examples
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple("Feature", ["qid", "src_ids", "segment_ids", "label", "cal_loss"])
+        features = []
+        self.features_all = []
+        for (ex_index, example) in enumerate(examples):
+            context_tokens = self.tokenizer.tokenize(example.context)
+            question_tokens = self.tokenizer.tokenize(example.question)
+            choice_tokens_lst = [self.tokenizer.tokenize(choice) for choice in example.choice]
+            # nums = 4
+            question_choice_pairs = [
+                self._truncate_seq_pair(question_tokens, choice_tokens, self.max_query_length - 2)
+                for choice_tokens in choice_tokens_lst
+            ]
+            total_qc_num = sum([(len(q) + len(c)) for q, c in question_choice_pairs])
+            max_tokens_for_doc = self.max_seq_length - total_qc_num - 4
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+
+            while start_offset < len(context_tokens):
+                length = len(context_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(context_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                qa_features = []
+                for q_tokens, c_tokens in question_choice_pairs:
+                    segment_tokens = ["[CLS]"]
+                    token_type_ids = [0]
+
+                    segment_tokens += context_tokens[doc_span.start : doc_span.start + doc_span.length]
+                    token_type_ids += [0] * doc_span.length
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [0]
+
+                    segment_tokens += q_tokens
+                    token_type_ids += [1] * len(q_tokens)
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [1]
+
+                    segment_tokens += c_tokens
+                    token_type_ids += [1] * len(c_tokens)
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [1]
+
+                    input_ids = self.tokenizer.convert_tokens_to_ids(segment_tokens)
+                    feature = Feature(
+                        qid=example.qas_id,
+                        label=example.label,
+                        src_ids=input_ids,
+                        segment_ids=token_type_ids,
+                        cal_loss=1,
+                    )
+                    qa_features.append(feature)
+
+                features.append(qa_features)
+                features_each.append(qa_features)
+
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [[record.src_ids for record in records] for records in batch_records]
+        if batch_records[0][0].label is not None:
+            batch_labels = [[record.label for record in records] for records in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        batch_qids = [[record.qid for record in records] for records in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        batch_task_ids = [[record.segment_ids for record in records] for records in batch_records]
+
+        # Padding
+        batch_padded_token_ids = []
+        batch_input_mask = []
+        batch_padded_task_ids = []
+        batch_padded_position_ids = []
+        batch_size = len(batch_token_ids)
+        for i in range(batch_size):
+            padded_token_ids, input_mask = pad_batch_data(
+                batch_token_ids[i],
+                pad_idx=self.tokenizer.pad_token_id,
+                pad_max_len=self.max_seq_length,
+                return_input_mask=True,
+            )
+            padded_task_ids = pad_batch_data(
+                batch_task_ids[i], pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
+            )
+
+            padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+            batch_padded_token_ids.append(padded_token_ids)
+            batch_input_mask.append(input_mask)
+            batch_padded_task_ids.append(padded_task_ids)
+            batch_padded_position_ids.append(padded_position_ids)
+
+        batch_padded_token_ids = (
+            np.array(batch_padded_token_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_padded_position_ids = (
+            np.array(batch_padded_position_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_padded_task_ids = (
+            np.array(batch_padded_task_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_input_mask = np.array(batch_input_mask).astype("float32").reshape([batch_size * self.choice_num, -1, 1])
+
+        return_list = [
+            batch_padded_token_ids,
+            batch_padded_position_ids,
+            batch_padded_task_ids,
+            batch_input_mask,
+            batch_labels,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
+
+    def _prepare_batch_data(self, examples_list):
+        batch_records, max_len, gather_idx = [], 0, []
+        real_batch_size = self.batch_size * self.choice_num
+        index = 0
+        for examples in examples_list:
+            records = []
+            gather_idx_candidate = []
+            for example in examples:
+                if example.cal_loss == 1:
+                    gather_idx_candidate.append(index % real_batch_size)
+                max_len = max(max_len, len(example.src_ids))
+                records.append(example)
+                index += 1
+
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * self.choice_num * max_len <= self.batch_size
+            else:
+                to_append = len(batch_records) < self.batch_size
+            if to_append:
+                batch_records.append(records)
+                gather_idx += gather_idx_candidate
+            else:
+                yield self._pad_batch_records(batch_records, gather_idx)
+                batch_records, max_len = [records], max(len(record.src_ids) for record in records)
+                gather_idx = gather_idx_candidate
+        if len(batch_records) > 0:
+            yield self._pad_batch_records(batch_records, gather_idx)
+
+    def _get_samples(self, pre_batch_list, is_last=False):
+        if is_last:
+            # Pad batch
+            len_doc = [[len(doc) for doc in doc_list] for doc_list in pre_batch_list]
+            len_doc = list(itertools.chain(*len_doc))
+            max_len_idx = len_doc.index(max(len_doc))
+            doc_idx = max_len_idx % self.choice_num
+            doc_list_idx = max_len_idx // self.choice_num
+            dirty_sample = pre_batch_list[doc_list_idx][doc_idx][-1]._replace(cal_loss=0)
+            for sample_list in pre_batch_list:
+                for samples in sample_list:
+                    samples.extend([dirty_sample] * (max(len_doc) - len(samples)))
+        samples = []
+        min_len = min([len(doc) for doc in pre_batch_list])
+        for cnt in range(min_len):
+            for batch_idx in range(self.batch_size * self.trainer_num):
+                sample = pre_batch_list[batch_idx][cnt]
+                samples.append(sample)
+
+        for idx in range(len(pre_batch_list)):
+            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
+        return samples
+
+
+class SemanticMatchingIterator(MRCIterator):
+    def _convert_qa_to_examples(self):
+        Example = namedtuple("Example", ["qid", "text_a", "text_b", "text_c", "label"])
+        examples = []
+        for qid, qa in enumerate(self.dataset):
+            text_a, text_b, text_c = list(
+                map(lambda x: x.replace("\n", "").strip(), [qa["text_a"], qa["text_b"], qa["text_c"]])
+            )
+
+            example = Example(qid=qid, text_a=text_a, text_b=text_b, text_c=text_c, label=qa["label"])
+            examples += [example]
+        return examples
+
+    def _create_tokens_and_type_id(self, text_a_tokens, text_b_tokens, start, length):
+        tokens = (
+            ["[CLS]"]
+            + text_a_tokens[start : start + length]
+            + ["[SEP]"]
+            + text_b_tokens[start : start + length]
+            + ["[SEP]"]
+        )
+        token_type_ids = [0] + [0] * (length + 1) + [1] * (length + 1)
+        return tokens, token_type_ids
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple(
+            "Feature", ["qid", "src_ids", "segment_ids", "pair_src_ids", "pair_segment_ids", "label", "cal_loss"]
+        )
+        features = []
+        self.features_all = []
+        for (ex_index, example) in enumerate(examples):
+            text_a_tokens = self.tokenizer.tokenize(example.text_a)
+            text_b_tokens = self.tokenizer.tokenize(example.text_b)
+            text_c_tokens = self.tokenizer.tokenize(example.text_c)
+            a_len, b_len, c_len = list(map(lambda x: len(x), [text_a_tokens, text_b_tokens, text_c_tokens]))
+
+            # Align 3 text
+            min_text_len = min([a_len, b_len, c_len])
+            text_a_tokens = text_a_tokens[:min_text_len]
+            text_b_tokens = text_b_tokens[:min_text_len]
+            text_c_tokens = text_c_tokens[:min_text_len]
+
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+
+            max_tokens_for_doc = (self.max_seq_length - 3) // 2
+
+            while start_offset < len(text_a_tokens):
+                length = len(text_a_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(text_a_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                tokens1, token_type_ids1 = self._create_tokens_and_type_id(
+                    text_a_tokens, text_b_tokens, doc_span.start, doc_span.length
+                )
+                tokens2, token_type_ids2 = self._create_tokens_and_type_id(
+                    text_a_tokens, text_c_tokens, doc_span.start, doc_span.length
+                )
+
+                input_ids1 = self.tokenizer.convert_tokens_to_ids(tokens1)
+                input_ids2 = self.tokenizer.convert_tokens_to_ids(tokens2)
+                feature = Feature(
+                    qid=example.qid,
+                    label=example.label,
+                    src_ids=input_ids1,
+                    segment_ids=token_type_ids1,
+                    pair_src_ids=input_ids2,
+                    pair_segment_ids=token_type_ids2,
+                    cal_loss=1,
+                )
+
+                features.append(feature)
+                features_each.append(feature)
+
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _create_pad_ids(self, batch_records, prefix=""):
+        src_ids = prefix + "src_ids"
+        segment_ids = prefix + "segment_ids"
+        batch_token_ids = [getattr(record, src_ids) for record in batch_records]
+        batch_task_ids = [getattr(record, segment_ids) for record in batch_records]
+
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        padded_task_ids = pad_batch_data(
+            batch_task_ids, pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
+        )
+
+        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+        return [padded_token_ids, padded_position_ids, padded_task_ids, input_mask]
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        if batch_records[0].label is not None:
+            batch_labels = [record.label for record in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        batch_qids = [record.qid for record in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        return_list = (
+            self._create_pad_ids(batch_records)
+            + self._create_pad_ids(batch_records, "pair_")
+            + [batch_labels, batch_qids, batch_gather_idx, need_cal_loss]
+        )
+        return return_list
+
+
+class SequenceLabelingIterator(ClassifierIterator):
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        no_entity_id=-1,
+    ):
+        super(SequenceLabelingIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+            preprocess_text_fn=None,
+        )
+        self.no_entity_id = no_entity_id
+
+    def _convert_to_features(self, example, qid):
+        """
+        Convert example to features fed into model
+        """
+        tokens = example["tokens"]
+        label = example["labels"]
+        doc_spans = []
+        _DocSpan = namedtuple("DocSpan", ["start", "length"])
+        start_offset = 0
+        max_tokens_for_doc = self.max_seq_length - 2
+        while start_offset < len(tokens):
+            length = len(tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(tokens):
+                break
+            start_offset += min(length, self.memory_len)
+
+        features = []
+        Feature = namedtuple("Feature", ["src_ids", "label_ids", "qid", "cal_loss"])
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            curr_tokens = ["[CLS]"] + tokens[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"]
+            token_ids = self.tokenizer.convert_tokens_to_ids(curr_tokens)
+            label = (
+                [self.no_entity_id] + label[doc_span.start : doc_span.start + doc_span.length] + [self.no_entity_id]
+            )
+
+            features.append(Feature(src_ids=token_ids, label_ids=label, qid=qid, cal_loss=1))
+
+        if self.repeat_input:
+            features_repeat = features
+            features = list(map(lambda x: x._replace(cal_loss=0), features))
+            features = features + features_repeat
+        return features
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [record.src_ids for record in batch_records]
+        batch_length = [len(record.src_ids) for record in batch_records]
+        batch_length = np.array(batch_length).astype("int64").reshape([-1, 1])
+
+        if batch_records[0].label_ids is not None:
+            batch_labels = [record.label_ids for record in batch_records]
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        if batch_records[-1].qid is not None:
+            batch_qids = [record.qid for record in batch_records]
+            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+        else:
+            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        if batch_records[0].label_ids is not None:
+            padded_batch_labels = pad_batch_data(
+                batch_labels, pad_idx=self.no_entity_id, pad_max_len=self.max_seq_length
+            )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            padded_batch_labels,
+            batch_length,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
+
+
+def to_json_file(task, label_dict, file_path):
+    if task == "iflytek":
+        filename = file_path
+
+        with open(filename, "w+") as f_obj:
+            for i, j in label_dict.items():
+                tmp = dict()
+                tmp["id"] = str(i)
+                tmp["label"] = str(j)
+                json.dump(tmp, f_obj)
+                f_obj.write("\n")
diff --git a/examples/text_classification/ernie_doc/export_model.py b/examples/text_classification/ernie_doc/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..187f8efe2dd8153623d6861573e125af1c54a4ed
--- /dev/null
+++ b/examples/text_classification/ernie_doc/export_model.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import paddle
+import shutil
+from paddlenlp.utils.log import logger
+from predict import LongDocClassifier
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=16, type=int,
+                    help="Batch size per GPU/CPU for predicting (In static mode, it should be the same as in model training process.)")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh",
+                    help="Pretraining or finetuned model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512,
+                    help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--device", type=str, default="cpu", choices=["cpu", "gpu"],
+                    help="Select cpu, gpu devices to train model.")
+parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str,
+                    help="The training dataset")
+parser.add_argument("--static_path", default=None, type=str,
+                    help="The path which your static model is at or where you want to save after converting.")
+
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    if os.path.exists(args.model_name_or_path):
+        logger.info("init checkpoint from %s" % args.model_name_or_path)
+
+    if args.static_path and os.path.exists(args.static_path):
+        logger.info("will remove the old model")
+        shutil.rmtree(args.static_path)
+
+    predictor = LongDocClassifier(
+        model_name_or_path=args.model_name_or_path,
+        batch_size=args.batch_size,
+        max_seq_length=args.max_seq_length,
+        memory_len=args.memory_length,
+        static_mode=True,
+        static_path=args.static_path,
+    )
diff --git a/examples/text_classification/ernie_doc/metrics.py b/examples/text_classification/ernie_doc/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..a60509380b5fd06c3b9936868dc1938fc4acc8cc
--- /dev/null
+++ b/examples/text_classification/ernie_doc/metrics.py
@@ -0,0 +1,367 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import sys
+
+import numpy as np
+import paddle
+from paddle.utils import try_import
+
+from paddlenlp.metrics.dureader import (
+    _compute_softmax,
+    _get_best_indexes,
+    get_final_text,
+)
+
+# Metric for ERNIE-DOCs
+
+
+class F1(object):
+    def __init__(self, positive_label=1):
+        self.positive_label = positive_label
+        self.reset()
+
+    def compute(self, preds, labels):
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        elif isinstance(preds, list):
+            preds = np.array(preds, dtype="float32")
+        if isinstance(labels, list):
+            labels = np.array(labels, dtype="int64")
+        elif isinstance(labels, paddle.Tensor):
+            labels = labels.numpy()
+        preds = np.argmax(preds, axis=1)
+        tp = ((preds == labels) & (labels == self.positive_label)).sum()
+        fn = ((preds != labels) & (labels == self.positive_label)).sum()
+        fp = ((preds != labels) & (preds == self.positive_label)).sum()
+        return tp, fp, fn
+
+    def update(self, statistic):
+        tp, fp, fn = statistic
+        self.tp += tp
+        self.fp += fp
+        self.fn += fn
+
+    def accumulate(self):
+        recall = self.tp / (self.tp + self.fn)
+        precision = self.tp / (self.tp + self.fp)
+        f1 = 2 * recall * precision / (recall + precision)
+        return f1
+
+    def reset(self):
+        self.tp = 0
+        self.fp = 0
+        self.fn = 0
+
+
+class EM_AND_F1(object):
+    def __init__(self):
+        self.nltk = try_import("nltk")
+        self.re = try_import("re")
+
+    def _mixed_segmentation(self, in_str, rm_punc=False):
+        """mixed_segmentation"""
+        in_str = in_str.lower().strip()
+        segs_out = []
+        temp_str = ""
+        sp_char = [
+            "-",
+            ":",
+            "_",
+            "*",
+            "^",
+            "/",
+            "\\",
+            "~",
+            "`",
+            "+",
+            "=",
+            "，",
+            "。",
+            "：",
+            "？",
+            "！",
+            "“",
+            "”",
+            "；",
+            "’",
+            "《",
+            "》",
+            "……",
+            "·",
+            "、",
+            "「",
+            "」",
+            "（",
+            "）",
+            "－",
+            "～",
+            "『",
+            "』",
+        ]
+        for char in in_str:
+            if rm_punc and char in sp_char:
+                continue
+            pattern = "[\\u4e00-\\u9fa5]"
+            if self.re.search(pattern, char) or char in sp_char:
+                if temp_str != "":
+                    ss = self.nltk.word_tokenize(temp_str)
+                    segs_out.extend(ss)
+                    temp_str = ""
+                segs_out.append(char)
+            else:
+                temp_str += char
+
+        # Handling last part
+        if temp_str != "":
+            ss = self.nltk.word_tokenize(temp_str)
+            segs_out.extend(ss)
+
+        return segs_out
+
+    # Remove punctuation
+    def _remove_punctuation(self, in_str):
+        """remove_punctuation"""
+        in_str = in_str.lower().strip()
+        sp_char = [
+            "-",
+            ":",
+            "_",
+            "*",
+            "^",
+            "/",
+            "\\",
+            "~",
+            "`",
+            "+",
+            "=",
+            "，",
+            "。",
+            "：",
+            "？",
+            "！",
+            "“",
+            "”",
+            "；",
+            "’",
+            "《",
+            "》",
+            "……",
+            "·",
+            "、",
+            "「",
+            "」",
+            "（",
+            "）",
+            "－",
+            "～",
+            "『",
+            "』",
+        ]
+        out_segs = []
+        for char in in_str:
+            if char in sp_char:
+                continue
+            else:
+                out_segs.append(char)
+        return "".join(out_segs)
+
+    # Find longest common string
+    def _find_lcs(self, s1, s2):
+        m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+        mmax = 0
+        p = 0
+        for i in range(len(s1)):
+            for j in range(len(s2)):
+                if s1[i] == s2[j]:
+                    m[i + 1][j + 1] = m[i][j] + 1
+                    if m[i + 1][j + 1] > mmax:
+                        mmax = m[i + 1][j + 1]
+                        p = i + 1
+        return s1[p - mmax : p], mmax
+
+    def _calc_f1_score(self, answers, prediction):
+        f1_scores = []
+        for ans in answers:
+            ans_segs = self._mixed_segmentation(ans, rm_punc=True)
+            prediction_segs = self._mixed_segmentation(prediction, rm_punc=True)
+            lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs)
+            if lcs_len == 0:
+                f1_scores.append(0)
+                continue
+            precision = 1.0 * lcs_len / len(prediction_segs)
+            recall = 1.0 * lcs_len / len(ans_segs)
+            f1 = (2 * precision * recall) / (precision + recall)
+            f1_scores.append(f1)
+        return max(f1_scores)
+
+    def _calc_em_score(self, answers, prediction):
+        em = 0
+        for ans in answers:
+            ans_ = self._remove_punctuation(ans)
+            prediction_ = self._remove_punctuation(prediction)
+            if ans_ == prediction_:
+                em = 1
+                break
+        return em
+
+    def __call__(self, prediction, ground_truth):
+        f1 = 0
+        em = 0
+        total_count = 0
+        skip_count = 0
+        for instance in ground_truth:
+            total_count += 1
+            query_id = instance["id"]
+            answers = instance["answers"]
+            if query_id not in prediction:
+                sys.stderr.write("Unanswered question: {}\n".format(query_id))
+                skip_count += 1
+                continue
+            preds = str(prediction[query_id])
+            f1 += self._calc_f1_score(answers, preds)
+            em += self._calc_em_score(answers, preds)
+
+        f1_score = 100.0 * f1 / total_count
+        em_score = 100.0 * em / total_count
+
+        avg_score = (f1_score + em_score) * 0.5
+        return em_score, f1_score, avg_score, total_count
+
+
+def compute_qa_predictions(
+    all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, tokenizer, verbose
+):
+    """Write final predictions to the json file and log-odds of null if needed."""
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
+    )
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # Keep track of the minimum score of null start+end of position 0
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.qid]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index],
+                        )
+                    )
+
+        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"]
+        )
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]
+                tok_text = " ".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = "".join(orig_tokens)
+
+                final_text = get_final_text(tok_text, orig_text, tokenizer, verbose)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
+
+        total_scores = []
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+
+        all_predictions[example.qas_id] = nbest_json[0]["text"]
+        all_nbest_json[example.qas_id] = nbest_json
+    return all_predictions, all_nbest_json
diff --git a/examples/text_classification/ernie_doc/modeling.py b/examples/text_classification/ernie_doc/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..05b725767e8fa042b8613d721bba652e6616f09d
--- /dev/null
+++ b/examples/text_classification/ernie_doc/modeling.py
@@ -0,0 +1,940 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+from paddlenlp.transformers.attention_utils import _convert_param_attr_to_list
+
+__all__ = [
+    "ErnieDocModel",
+    "ErnieDocPretrainedModel",
+    "ErnieDocForSequenceClassification",
+    "ErnieDocForTokenClassification",
+    "ErnieDocForQuestionAnswering",
+]
+
+
+class PointwiseFFN(nn.Layer):
+    def __init__(self, d_inner_hid, d_hid, dropout_rate, hidden_act, weight_attr=None, bias_attr=None):
+        super(PointwiseFFN, self).__init__()
+        self.linear1 = nn.Linear(d_hid, d_inner_hid, weight_attr, bias_attr=bias_attr)
+        self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train")
+        self.linear2 = nn.Linear(d_inner_hid, d_hid, weight_attr, bias_attr=bias_attr)
+        self.activation = getattr(F, hidden_act)
+
+    def forward(self, x):
+        return self.linear2(self.dropout(self.activation(self.linear1(x))))
+
+
+class MultiHeadAttention(nn.Layer):
+    def __init__(
+        self,
+        d_key,
+        d_value,
+        d_model,
+        n_head=1,
+        r_w_bias=None,
+        r_r_bias=None,
+        r_t_bias=None,
+        dropout_rate=0.0,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.d_key = d_key
+        self.d_value = d_value
+        self.d_model = d_model
+        self.n_head = n_head
+
+        assert d_key * n_head == d_model, "d_model must be divisible by n_head"
+
+        self.q_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.k_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.v_proj = nn.Linear(d_model, d_value * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.r_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.t_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.out_proj = nn.Linear(d_model, d_model, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.r_w_bias = r_w_bias
+        self.r_r_bias = r_r_bias
+        self.r_t_bias = r_t_bias
+        self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train") if dropout_rate else None
+
+    def _compute_qkv(self, queries, keys, values, rel_pos, rel_task):
+        q = self.q_proj(queries)
+        k = self.k_proj(keys)
+        v = self.v_proj(values)
+        r = self.r_proj(rel_pos)
+        t = self.t_proj(rel_task)
+        return q, k, v, r, t
+
+    def _split_heads(self, x, d_model, n_head):
+        # x shape: [B, T, H]
+        x = x.reshape(shape=[0, 0, n_head, d_model // n_head])
+        # shape: [B, N, T, HH]
+        return paddle.transpose(x=x, perm=[0, 2, 1, 3])
+
+    def _rel_shift(self, x, klen=-1):
+        """
+        To perform relative attention, it should relatively shift the attention score matrix
+        See more details on: https://github.com/kimiyoung/transformer-xl/issues/8#issuecomment-454458852
+        """
+        # input shape: [B, N, T, 2 * T + M]
+        x_shape = x.shape
+        x = x.reshape([x_shape[0], x_shape[1], x_shape[3], x_shape[2]])
+        x = x[:, :, 1:, :]
+        x = x.reshape([x_shape[0], x_shape[1], x_shape[2], x_shape[3] - 1])
+        # output shape: [B, N, T, T + M]
+        return x[:, :, :, :klen]
+
+    def _scaled_dot_product_attention(self, q, k, v, r, t, attn_mask):
+        q_w, q_r, q_t = q
+        score_w = paddle.matmul(q_w, k, transpose_y=True)
+        score_r = paddle.matmul(q_r, r, transpose_y=True)
+        score_r = self._rel_shift(score_r, k.shape[2])
+
+        score_t = paddle.matmul(q_t, t, transpose_y=True)
+        score = score_w + score_r + score_t
+        score = score * (self.d_key**-0.5)
+
+        if attn_mask is not None:
+            score += attn_mask
+        weights = F.softmax(score)
+        if self.dropout:
+            weights = self.dropout(weights)
+        out = paddle.matmul(weights, v)
+        return out
+
+    def _combine_heads(self, x):
+        sign = len(x.shape) == 3
+        # Directly using len(tensor.shape) as an if condition
+        # would not act functionally when applying paddle.jit.save api to save static graph.
+        if sign:
+            return x
+        sign = len(x.shape) != 4
+        if sign:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        # x shape: [B, N, T, HH]
+        x = paddle.transpose(x, [0, 2, 1, 3])
+        # target shape:[B, T, H]
+        return x.reshape([0, 0, x.shape[2] * x.shape[3]])
+
+    def forward(self, queries, keys, values, rel_pos, rel_task, memory, attn_mask):
+        sign = memory is not None and len(memory.shape) > 1
+        if sign:
+            cat = paddle.concat([memory, queries], 1)
+        else:
+            cat = queries
+        keys, values = cat, cat
+
+        sign = (
+            len(queries.shape)
+            == len(keys.shape)
+            == len(values.shape)
+            == len(rel_pos.shape)
+            == len(rel_task.shape)
+            == 3
+        )
+
+        if not sign:
+            raise ValueError("Inputs: quries, keys, values, rel_pos and rel_task should all be 3-D tensors.")
+
+        q, k, v, r, t = self._compute_qkv(queries, keys, values, rel_pos, rel_task)
+        q_w, q_r, q_t = list(map(lambda x: q + x.unsqueeze([0, 1]), [self.r_w_bias, self.r_r_bias, self.r_t_bias]))
+        q_w, q_r, q_t = list(map(lambda x: self._split_heads(x, self.d_model, self.n_head), [q_w, q_r, q_t]))
+        k, v, r, t = list(map(lambda x: self._split_heads(x, self.d_model, self.n_head), [k, v, r, t]))
+        ctx_multiheads = self._scaled_dot_product_attention([q_w, q_r, q_t], k, v, r, t, attn_mask)
+        out = self._combine_heads(ctx_multiheads)
+        out = self.out_proj(out)
+        return out
+
+
+class ErnieDocEncoderLayer(nn.Layer):
+    def __init__(
+        self,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        hidden_act,
+        normalize_before=False,
+        epsilon=1e-5,
+        rel_pos_params_sharing=False,
+        r_w_bias=None,
+        r_r_bias=None,
+        r_t_bias=None,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+        super(ErnieDocEncoderLayer, self).__init__()
+        if not rel_pos_params_sharing:
+            r_w_bias, r_r_bias, r_t_bias = list(
+                map(
+                    lambda x: self.create_parameter(shape=[n_head * d_key], dtype="float32"),
+                    ["r_w_bias", "r_r_bias", "r_t_bias"],
+                )
+            )
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 2)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 2)
+        self.attn = MultiHeadAttention(
+            d_key,
+            d_value,
+            d_model,
+            n_head,
+            r_w_bias,
+            r_r_bias,
+            r_t_bias,
+            attention_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+        )
+        self.ffn = PointwiseFFN(
+            d_inner_hid, d_model, relu_dropout, hidden_act, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1]
+        )
+        self.norm1 = nn.LayerNorm(d_model, epsilon=epsilon)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=epsilon)
+        self.dropout1 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train")
+        self.dropout2 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train")
+        self.d_model = d_model
+        self.epsilon = epsilon
+        self.normalize_before = normalize_before
+
+    def forward(self, enc_input, memory, rel_pos, rel_task, attn_mask):
+        residual = enc_input
+        if self.normalize_before:
+            enc_input = self.norm1(enc_input)
+        attn_output = self.attn(enc_input, enc_input, enc_input, rel_pos, rel_task, memory, attn_mask)
+        attn_output = residual + self.dropout1(attn_output)
+        if not self.normalize_before:
+            attn_output = self.norm1(attn_output)
+        residual = attn_output
+        if self.normalize_before:
+            attn_output = self.norm2(attn_output)
+        ffn_output = self.ffn(attn_output)
+        output = residual + self.dropout2(ffn_output)
+        if not self.normalize_before:
+            output = self.norm2(output)
+        return output
+
+
+class ErnieDocEncoder(nn.Layer):
+    def __init__(self, num_layers, encoder_layer, mem_len):
+        super(ErnieDocEncoder, self).__init__()
+        self.layers = nn.LayerList(
+            [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)]
+        )
+        self.num_layers = num_layers
+        self.normalize_before = self.layers[0].normalize_before
+        self.mem_len = mem_len
+
+    def _cache_mem(self, curr_out, prev_mem):
+        if self.mem_len is None or self.mem_len == 0:
+            return None
+        if prev_mem is None:
+            new_mem = curr_out[:, -self.mem_len :, :]
+        else:
+            new_mem = paddle.concat([prev_mem, curr_out], 1)[:, -self.mem_len :, :]
+        new_mem.stop_gradient = True
+        return new_mem
+
+    def forward(self, enc_input, memories, rel_pos, rel_task, attn_mask):
+        # memories shape: [N, B, M, H]
+        # no need to normalize enc_input, cause it's already normalized outside.
+        new_mem = None
+        for _, encoder_layer in enumerate(self.layers):
+            # Since in static mode, the memories should be set as tensor,
+            # so we use paddle.slice to free the old memories explicitly to save gpu memory.
+            enc_input = encoder_layer(enc_input, memories[0], rel_pos, rel_task, attn_mask)
+            if new_mem is None:
+                new_mem = paddle.unsqueeze(self._cache_mem(enc_input, memories[0]), axis=0)
+            else:
+                new_mem = paddle.concat(
+                    [new_mem, paddle.unsqueeze(self._cache_mem(enc_input, memories[0]), axis=0)], axis=0
+                )
+            sign = memories.shape[0]
+            if sign > 1:
+                axis = [0]
+                start = [1]
+                end = [memories.shape[0]]
+                memories = paddle.slice(memories, axes=axis, starts=start, ends=end)
+            else:
+                memories = None
+        return enc_input, new_mem
+
+
+class ErnieDocPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ErnieDoc models. It provides ErnieDoc related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "ernie-doc-base-en": {
+            "attention_dropout_prob": 0.0,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.0,
+            "relu_dropout": 0.0,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "task_type_vocab_size": 3,
+            "vocab_size": 50265,
+            "memory_len": 128,
+            "epsilon": 1e-12,
+            "pad_token_id": 1,
+        },
+        "ernie-doc-base-zh": {
+            "attention_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "relu_dropout": 0.0,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "task_type_vocab_size": 3,
+            "vocab_size": 28000,
+            "memory_len": 128,
+            "epsilon": 1e-12,
+            "pad_token_id": 0,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/ernie-doc-base-en.pdparams",
+            "ernie-doc-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-zh/ernie-doc-base-zh.pdparams",
+        }
+    }
+    base_model_prefix = "ernie_doc"
+
+    def _init_weights(self, layer):
+        # Initialization hook
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.ernie_doc.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class ErnieDocEmbeddings(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        d_model,
+        hidden_dropout_prob,
+        memory_len,
+        max_position_embeddings=512,
+        type_vocab_size=3,
+        padding_idx=0,
+    ):
+        super(ErnieDocEmbeddings, self).__init__()
+        self.word_emb = nn.Embedding(vocab_size, d_model)
+        self.pos_emb = nn.Embedding(max_position_embeddings * 2 + memory_len, d_model)
+        self.token_type_emb = nn.Embedding(type_vocab_size, d_model)
+        self.memory_len = memory_len
+        self.dropouts = nn.LayerList([nn.Dropout(hidden_dropout_prob) for i in range(3)])
+        self.norms = nn.LayerList([nn.LayerNorm(d_model) for i in range(3)])
+
+    def forward(self, input_ids, token_type_ids, position_ids):
+        # input_embeddings: [B, T, H]
+        input_embeddings = self.word_emb(input_ids.squeeze(-1))
+        # position_embeddings: [B, 2 * T + M, H]
+        position_embeddings = self.pos_emb(position_ids.squeeze(-1))
+        batch_size = input_ids.shape[0]
+        token_type_ids = paddle.concat(
+            [
+                paddle.zeros(shape=[batch_size, self.memory_len, 1], dtype="int64") + token_type_ids[0, 0, 0],
+                token_type_ids,
+            ],
+            axis=1,
+        )
+        token_type_ids.stop_gradient = True
+        # token_type_embeddings: [B, M + T, H]
+        token_type_embeddings = self.token_type_emb(token_type_ids.squeeze(-1))
+        embs = [input_embeddings, position_embeddings, token_type_embeddings]
+        for i in range(len(embs)):
+            embs[i] = self.dropouts[i](self.norms[i](embs[i]))
+        return embs
+
+
+class ErnieDocPooler(nn.Layer):
+    """
+    get pool output
+    """
+
+    def __init__(self, hidden_size, cls_token_idx=-1):
+        super(ErnieDocPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+        self.cls_token_idx = cls_token_idx
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the last token.
+        cls_token_tensor = hidden_states[:, self.cls_token_idx]
+        pooled_output = self.dense(cls_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+@register_base_model
+class ErnieDocModel(ErnieDocPretrainedModel):
+    """
+    The bare ERNIE-Doc Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        num_hidden_layers (int):
+            The number of hidden layers in the Transformer encoder.
+        num_attention_heads (int):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_size (int):
+            Dimensionality of the embedding layers, encoder layers and pooler layer.
+        hidden_dropout_prob (int):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+        attention_dropout_prob (int):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+        relu_dropout (int):
+            The dropout probability of FFN.
+        hidden_act (str):
+            The non-linear activation function of FFN.
+        memory_len (int):
+            The number of tokens to cache. If not 0, the last `memory_len` hidden states
+            in each layer will be cached into memory.
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `ErnieDocModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieDocModel`.
+        max_position_embeddings (int):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        task_type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`. Defaults to `3`.
+        normalize_before (bool, optional):
+            Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers.
+            If True, pre-process is layer normalization and post-precess includes dropout,
+            residual connection. Otherwise, no pre-process and post-precess includes dropout,
+            residual connection, layer normalization. Defaults to `False`.
+        epsilon (float, optional):
+            The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for
+            initializing layer normalization layers. Defaults to `1e-5`.
+        rel_pos_params_sharing (bool, optional):
+            Whether to share the relative position parameters.
+            Defaults to `False`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+        pad_token_id (int, optional):
+            The token id of [PAD] token whose parameters won't be updated when training.
+            Defaults to `0`.
+        cls_token_idx (int, optional):
+            The token id of [CLS] token. Defaults to `-1`.
+    """
+
+    def __init__(
+        self,
+        num_hidden_layers,
+        num_attention_heads,
+        hidden_size,
+        hidden_dropout_prob,
+        attention_dropout_prob,
+        relu_dropout,
+        hidden_act,
+        memory_len,
+        vocab_size,
+        max_position_embeddings,
+        task_type_vocab_size=3,
+        normalize_before=False,
+        epsilon=1e-5,
+        rel_pos_params_sharing=False,
+        initializer_range=0.02,
+        pad_token_id=0,
+        cls_token_idx=-1,
+    ):
+        super(ErnieDocModel, self).__init__()
+
+        r_w_bias, r_r_bias, r_t_bias = None, None, None
+        if rel_pos_params_sharing:
+            r_w_bias, r_r_bias, r_t_bias = list(
+                map(
+                    lambda x: self.create_parameter(shape=[num_attention_heads * d_key], dtype="float32"),
+                    ["r_w_bias", "r_r_bias", "r_t_bias"],
+                )
+            )
+        d_key = hidden_size // num_attention_heads
+        d_value = hidden_size // num_attention_heads
+        d_inner_hid = hidden_size * 4
+        encoder_layer = ErnieDocEncoderLayer(
+            num_attention_heads,
+            d_key,
+            d_value,
+            hidden_size,
+            d_inner_hid,
+            hidden_dropout_prob,
+            attention_dropout_prob,
+            relu_dropout,
+            hidden_act,
+            normalize_before=normalize_before,
+            epsilon=epsilon,
+            rel_pos_params_sharing=rel_pos_params_sharing,
+            r_w_bias=r_w_bias,
+            r_r_bias=r_r_bias,
+            r_t_bias=r_t_bias,
+        )
+        self.n_head = num_attention_heads
+        self.d_model = hidden_size
+        self.memory_len = memory_len
+        self.encoder = ErnieDocEncoder(num_hidden_layers, encoder_layer, memory_len)
+        self.pad_token_id = pad_token_id
+        self.embeddings = ErnieDocEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            memory_len,
+            max_position_embeddings,
+            task_type_vocab_size,
+            pad_token_id,
+        )
+        self.pooler = ErnieDocPooler(hidden_size, cls_token_idx)
+
+    def _create_n_head_attn_mask(self, attn_mask, batch_size):
+        # attn_mask shape: [B, T, 1]
+        # concat an data_mask, shape: [B, M + T, 1]
+        data_mask = paddle.concat(
+            [paddle.ones(shape=[batch_size, self.memory_len, 1], dtype=attn_mask.dtype), attn_mask], axis=1
+        )
+        data_mask.stop_gradient = True
+        # create a self_attn_mask, shape: [B, T, M + T]
+        self_attn_mask = paddle.matmul(attn_mask, data_mask, transpose_y=True)
+        self_attn_mask = (self_attn_mask - 1) * 1e8
+        n_head_self_attn_mask = paddle.stack([self_attn_mask] * self.n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+        return n_head_self_attn_mask
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1].
+            memories (Tensor):
+                Pre-computed hidden-states for each layer.
+                It's data type should be `float32` and has a shape of [num_hidden_layers, batch_size, memory_len, hidden_size].
+            token_type_ids (Tensor):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                config.max_position_embeddings - 1]``. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`.
+            attn_mask (Tensor):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple : Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``).
+
+            With the fields:
+
+            - `encoder_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+            - `new_mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and shape as [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocModel
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocModel.from_pretrained('ernie-doc-base-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = paddle.zeros([12, 1, 128, 768], dtype="float32")
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                encoder_output = outputs[0]
+                pooled_output = outputs[1]
+                new_mem = outputs[2]
+
+        """
+        input_embeddings, position_embeddings, token_embeddings = self.embeddings(
+            input_ids, token_type_ids, position_ids
+        )
+
+        batch_size = input_embeddings.shape[0]
+        # [B, N, T, M + T]
+        n_head_self_attn_mask = self._create_n_head_attn_mask(attn_mask, batch_size)
+        # memories contain n_layer memory whose shape is [B, M, H]
+        encoder_output, new_mem = self.encoder(
+            enc_input=input_embeddings,
+            memories=memories,
+            rel_pos=position_embeddings,
+            rel_task=token_embeddings,
+            attn_mask=n_head_self_attn_mask,
+        )
+        pooled_output = self.pooler(encoder_output)
+        return encoder_output, pooled_output, new_mem
+
+
+class ErnieDocForSequenceClassification(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        ernie_doc (:class:`ErnieDocModel`):
+            An instance of :class:`ErnieDocModel`.
+        num_classes (int):
+            The number of classes.
+        dropout (float, optional)
+            The dropout ratio of last output. Default to `0.1`.
+    """
+
+    def __init__(self, ernie_doc, num_classes, dropout=0.1):
+        super(ErnieDocForSequenceClassification, self).__init__()
+        self.ernie_doc = ernie_doc
+        self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], num_classes)
+        self.dropout = nn.Dropout(dropout, mode="upscale_in_train")
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForSequenceClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (Tensor):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`logits`, `mem`).
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor containing the [CLS] of hidden-states of the model at the output of last layer.
+                Each Tensor has a data type of `float32` and has a shape of [batch_size, num_classes].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForSequenceClassification
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForSequenceClassification.from_pretrained('ernie-doc-base-zh', num_classes=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = paddle.zeros([12, 1, 128, 768], dtype="float32")
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+                mem = outputs[1]
+
+        """
+        _, pooled_output, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        pooled_output = self.dropout(pooled_output)
+        logits = self.linear(pooled_output)
+        return logits, mem
+
+
+class ErnieDocForTokenClassification(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        ernie_doc (:class:`ErnieDocModel`):
+            An instance of :class:`ErnieDocModel`.
+        num_classes (int):
+            The number of classes.
+        dropout (float, optional)
+            The dropout ratio of last output. Default to 0.1.
+    """
+
+    def __init__(self, ernie_doc, num_classes, dropout=0.1):
+        super(ErnieDocForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie_doc = ernie_doc  # allow ernie_doc to be config
+        self.dropout = nn.Dropout(dropout, mode="upscale_in_train")
+        self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForTokenClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (Tensor):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`logits`, `mem`).
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor containing the hidden-states of the model at the output of last layer.
+                Each Tensor has a data type of `float32` and has a shape of [batch_size, sequence_length, num_classes].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForTokenClassification
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForTokenClassification.from_pretrained('ernie-doc-base-zh', num_classes=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = paddle.zeros([12, 1, 128, 768], dtype="float32")
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+                mem = outputs[1]
+
+        """
+        sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.linear(sequence_output)
+        return logits, mem
+
+
+class ErnieDocForQuestionAnswering(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+        ernie_doc (:class:`ErnieDocModel`):
+            An instance of :class:`ErnieDocModel`.
+        dropout (float, optional)
+            The dropout ratio of last output. Default to 0.1.
+    """
+
+    def __init__(self, ernie_doc, dropout=0.1):
+        super(ErnieDocForQuestionAnswering, self).__init__()
+        self.ernie_doc = ernie_doc  # allow ernie_doc to be config
+        self.dropout = nn.Dropout(dropout, mode="upscale_in_train")
+        self.linear = nn.Linear(self.ernie_doc.config["hidden_size"], 2)
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForQuestionAnswering forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (Tensor):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`start_logits`, `end_logits`, `mem`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForQuestionAnswering
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForQuestionAnswering.from_pretrained('ernie-doc-base-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = paddle.zeros([12, 1, 128, 768], dtype="float32")
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+                mem = outputs[2]
+
+        """
+        sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.linear(sequence_output)
+        start_logits, end_logits = paddle.transpose(logits, perm=[2, 0, 1])
+        return start_logits, end_logits, mem
diff --git a/examples/text_classification/ernie_doc/predict.py b/examples/text_classification/ernie_doc/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bfa6b626ac7db9f377b7bd75a9d2782556ba91c
--- /dev/null
+++ b/examples/text_classification/ernie_doc/predict.py
@@ -0,0 +1,301 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import (
+    ClassifierIterator,
+    HYPTextPreprocessor,
+    ImdbTextPreprocessor,
+    to_json_file,
+)
+from modeling import ErnieDocForSequenceClassification
+from train import init_memory
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.taskflow.utils import dygraph_mode_guard
+from paddlenlp.transformers import ErnieDocBPETokenizer, ErnieDocTokenizer
+from paddlenlp.utils.env import PPNLP_HOME
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=16, type=int,
+                    help="Batch size per GPU/CPU for predicting (In static mode, it should be the same as in model training process.)")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh",
+                    help="Pretraining or finetuned model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512,
+                    help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"],
+                    help="Select cpu, gpu devices to train model.")
+parser.add_argument("--test_results_file", default="./test_restuls.json", type=str,
+                    help="The file path you would like to save the model outputs on test dataset.")
+parser.add_argument("--static_mode", default=False, type=bool,
+                    help="Whether you would like to perform predicting by static model or dynamic model.")
+parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str,
+                    help="The training dataset")
+parser.add_argument("--static_path", default=None, type=str,
+                    help="The path which your static model is at or where you want to save after converting.")
+
+args = parser.parse_args()
+# yapf: enable
+
+DATASET_INFO = {
+    "imdb": (ErnieDocBPETokenizer, "test", ImdbTextPreprocessor()),
+    "hyp": (ErnieDocBPETokenizer, "test", HYPTextPreprocessor()),
+    "iflytek": (ErnieDocTokenizer, "test", None),
+    "thucnews": (ErnieDocTokenizer, "test", None),
+}
+
+
+def predict(
+    model, test_dataloader, file_path, memories, label_list, static_mode, input_handles=None, output_handles=None
+):
+    label_dict = dict()
+    if not static_mode:
+        model.eval()
+        for _, batch in enumerate(test_dataloader, start=1):
+            input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch
+            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+            logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids]))
+            probs = nn.functional.softmax(logits, axis=1)
+            idx = paddle.argmax(probs, axis=1).numpy()
+            idx = idx.tolist()
+            labels = [label_list[i] for i in idx]
+            for i, qid in enumerate(qids.numpy().flatten()):
+                label_dict[str(qid)] = labels[i]
+    else:
+        for _, batch in enumerate(test_dataloader, start=1):
+            input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch
+            input_handles[0].copy_from_cpu(input_ids.numpy())
+            input_handles[1].copy_from_cpu(memories)
+            input_handles[2].copy_from_cpu(token_type_ids.numpy())
+            input_handles[3].copy_from_cpu(position_ids.numpy())
+            input_handles[4].copy_from_cpu(attn_mask.numpy())
+            model.run()
+            logits = paddle.to_tensor(output_handles[0].copy_to_cpu())
+            memories = paddle.to_tensor(output_handles[1].copy_to_cpu())
+            logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids]))
+            probs = nn.functional.softmax(logits, axis=1)
+            idx = paddle.argmax(probs, axis=1).numpy()
+            idx = idx.tolist()
+            labels = [label_list[i] for i in idx]
+            for i, qid in enumerate(qids.numpy().flatten()):
+                label_dict[str(qid)] = labels[i]
+    to_json_file("iflytek", label_dict, file_path)
+
+
+class LongDocClassifier:
+    def __init__(
+        self,
+        model_name_or_path,
+        trainer_num=1,
+        rank=0,
+        batch_size=16,
+        max_seq_length=512,
+        memory_len=128,
+        static_mode=False,
+        dataset="iflytek",
+        static_path=None,
+    ):
+        self.model_name_or_path = model_name_or_path
+        self.batch_size = batch_size
+        self.trainer_num = trainer_num
+        self.rank = rank
+        self.max_seq_length = max_seq_length
+        self.memory_len = memory_len
+        self.static_mode = static_mode
+        self.static_path = static_path if static_path else PPNLP_HOME
+
+        tokenizer_class, test_name, preprocess_text_fn = DATASET_INFO[dataset]
+        self._construct_tokenizer(tokenizer_class)
+        self._input_preparation(args.dataset, test_name, preprocess_text_fn)
+        self._construct_model()
+        if static_mode:
+            logger.info("Loading the static model from {}".format(self.static_path))
+            self._load_static_model()
+
+    def _input_preparation(self, dataset="iflytek", test_name="test", preprocess_text_fn=None):
+        test_ds = load_dataset("clue", name=dataset, splits=[test_name])
+        self.label_list = test_ds.label_list
+        self.num_classes = len(test_ds.label_list)
+        self.test_ds_iter = ClassifierIterator(
+            test_ds,
+            self.batch_size,
+            self._tokenizer,
+            self.trainer_num,
+            trainer_id=self.rank,
+            memory_len=self.memory_len,
+            max_seq_length=self.max_seq_length,
+            mode="eval",
+            preprocess_text_fn=preprocess_text_fn,
+        )
+        self.test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+        self.test_dataloader.set_batch_generator(self.test_ds_iter, paddle.get_device())
+
+    def _construct_tokenizer(self, tokenizer_class):
+        """
+        Construct the tokenizer for the predictor.
+        :return:
+        """
+        tokenizer_instance = tokenizer_class.from_pretrained(self.model_name_or_path)
+        self._tokenizer = tokenizer_instance
+
+    def _construct_model(self):
+        """
+        Construct the inference model for the predictor
+        :param model_name_or_path: str
+        :return: model instance
+        """
+        model_instance = ErnieDocForSequenceClassification.from_pretrained(
+            self.model_name_or_path, num_classes=self.num_classes
+        )
+        self.model_config = model_instance.ernie_doc.config
+        self._model = model_instance
+
+    def _load_static_model(self, params_path=None):
+        """Load static model"""
+        inference_model_path = os.path.join(self.static_path, "static", "inference")
+        with dygraph_mode_guard():
+            self._construct_model()
+            if params_path:
+                state_dict = paddle.load(params_path)
+                self._model.set_dict(state_dict)
+            self._construct_input_spec()
+            self._convert_dygraph_to_static()
+
+        model_file = inference_model_path + ".pdmodel"
+        params_file = inference_model_path + ".pdiparams"
+        self._config = paddle.inference.Config(model_file, params_file)
+
+    def _prepare_static_mode(self):
+        """
+        Construct the input data and predictor in the PaddlePaddele static mode.
+        """
+        place = paddle.get_device()
+        if place == "cpu":
+            self._config.disable_gpu()
+        else:
+            self._config.enable_use_gpu(100)
+        self._config.switch_use_feed_fetch_ops(False)
+        self._config.disable_glog_info()
+        self.predictor = paddle.inference.create_predictor(self._config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()]
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        B, T, H, M, N = (
+            self.batch_size,
+            self.max_seq_length,
+            self.model_config["hidden_size"],
+            self.memory_len,
+            self.model_config["num_hidden_layers"],
+        )
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[B, T, 1], dtype="int64", name="input_ids"),  # input_ids
+            paddle.static.InputSpec(shape=[N, B, M, H], dtype="float32", name="memories"),  # memories
+            paddle.static.InputSpec(shape=[B, T, 1], dtype="int64", name="token_type_ids"),  # token_type_ids
+            paddle.static.InputSpec(shape=[B, 2 * T + M, 1], dtype="int64", name="position_ids"),  # position_ids
+            paddle.static.InputSpec(shape=[B, T, 1], dtype="float32", name="attn_mask"),  # attn_mask
+        ]
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+        save_path = os.path.join(self.static_path, "static", "inference")
+        paddle.jit.save(static_model, save_path)
+        logger.info("The inference model save in the path:{}".format(save_path))
+
+    def run_model(self, saved_path):
+        if not self.static_mode:
+            create_memory = partial(
+                init_memory,
+                self.batch_size,
+                self.memory_len,
+                self.model_config["hidden_size"],
+                self.model_config["num_hidden_layers"],
+            )
+            # Copy the memory
+            memories = create_memory()
+        else:
+            memories = np.zeros(
+                [
+                    self.model_config["num_hidden_layers"],
+                    self.batch_size,
+                    self.memory_len,
+                    self.model_config["hidden_size"],
+                ],
+                dtype="float32",
+            )
+        file_path = saved_path
+        if not self.static_mode:
+            self.input_handles, self.output_handle, self.predictor = None, None, self._model
+        else:
+            self._prepare_static_mode()
+        predict(
+            self.predictor,
+            self.test_dataloader,
+            file_path,
+            memories,
+            self.label_list,
+            self.static_mode,
+            self.input_handles,
+            self.output_handle,
+        )
+
+
+def do_predict(args):
+    # Initialize model
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+
+    predictor = LongDocClassifier(
+        model_name_or_path=args.model_name_or_path,
+        rank=rank,
+        trainer_num=trainer_num,
+        batch_size=args.batch_size,
+        max_seq_length=args.max_seq_length,
+        memory_len=args.memory_length,
+        static_mode=args.static_mode,
+        static_path=args.static_path,
+    )
+    predictor.run_model(saved_path=args.test_results_file)
+
+
+if __name__ == "__main__":
+    do_predict(args)
diff --git a/examples/text_classification/ernie_doc/train.py b/examples/text_classification/ernie_doc/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea527016a9978312ef47771e508e55705337d871
--- /dev/null
+++ b/examples/text_classification/ernie_doc/train.py
@@ -0,0 +1,345 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import (
+    ClassifierIterator,
+    HYPTextPreprocessor,
+    ImdbTextPreprocessor,
+    to_json_file,
+)
+from metrics import F1
+from modeling import ErnieDocForSequenceClassification
+from paddle.metric import Accuracy
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocBPETokenizer,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=16, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh",
+                    help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512,
+                    help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=1.5e-4, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"],
+                    help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float,
+                    help="Linear warmup proportion over the training process.")
+parser.add_argument("--dataset", default="iflytek", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str,
+                    help="The training dataset")
+parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
+parser.add_argument("--max_steps", default=-1, type=int,
+                    help="If > 0: set total number of training steps to perform. Override num_train_epochs.", )
+parser.add_argument("--test_results_file", default="./test_restuls.json", type=str,
+                    help="The file path you would like to save the model outputs on test dataset.")
+
+args = parser.parse_args()
+# yapf: enable
+
+DATASET_INFO = {
+    "imdb": (ErnieDocBPETokenizer, "test", "test", ImdbTextPreprocessor(), Accuracy()),
+    "hyp": (ErnieDocBPETokenizer, "dev", "test", HYPTextPreprocessor(), F1()),
+    "iflytek": (ErnieDocTokenizer, "dev", "test", None, Accuracy()),
+    "thucnews": (ErnieDocTokenizer, "dev", "test", None, Accuracy()),
+}
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return paddle.zeros([n_layers, batch_size, memory_length, d_model], dtype="float32")
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, memories):
+    model.eval()
+    losses = []
+    # copy the memory
+    tic_train = time.time()
+    eval_logging_step = 500
+
+    probs_dict = defaultdict(list)
+    label_dict = dict()
+    for step, batch in enumerate(data_loader, start=1):
+        input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch
+        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids]))
+        # Need to collect probs for each qid, so use softmax_with_cross_entropy
+        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
+        losses.append(loss.mean().numpy())
+        # Shape: [B, NUM_LABELS]
+        np_probs = probs.numpy()
+        # Shape: [B, 1]
+        np_qids = qids.numpy()
+        np_labels = labels.numpy().flatten()
+        for i, qid in enumerate(np_qids.flatten()):
+            probs_dict[qid].append(np_probs[i])
+            label_dict[qid] = np_labels[i]  # Same qid share same label.
+
+        if step % eval_logging_step == 0:
+            logger.info(
+                "Step %d: loss:  %.5f, speed: %.5f steps/s"
+                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+
+    # Collect predicted labels
+    preds = []
+    labels = []
+    for qid, probs in probs_dict.items():
+        mean_prob = np.mean(np.array(probs), axis=0)
+        preds.append(mean_prob)
+        labels.append(label_dict[qid])
+
+    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
+    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
+
+    metric.update(metric.compute(preds, labels))
+    acc_or_f1 = metric.accumulate()
+    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
+    metric.reset()
+    model.train()
+    return acc_or_f1
+
+
+def predict(model, test_dataloader, file_path, memories, label_list):
+    label_dict = dict()
+    model.eval()
+    for _, batch in enumerate(test_dataloader, start=1):
+        input_ids, position_ids, token_type_ids, attn_mask, _, qids, gather_idxs, need_cal_loss = batch
+        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        logits, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, qids]))
+        probs = nn.functional.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_list[i] for i in idx]
+        for i, qid in enumerate(qids.numpy().flatten()):
+            label_dict[str(qid)] = labels[i]
+    to_json_file("iflytek", label_dict, file_path)
+
+
+def do_train(args):
+    set_seed(args)
+
+    tokenizer_class, eval_name, test_name, preprocess_text_fn, eval_metric = DATASET_INFO[args.dataset]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    train_ds, eval_ds, test_ds = load_dataset("clue", name=args.dataset, splits=["train", eval_name, test_name])
+    num_classes = len(train_ds.label_list)
+
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+    model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds_iter = ClassifierIterator(
+        train_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        preprocess_text_fn=preprocess_text_fn,
+    )
+    eval_ds_iter = ClassifierIterator(
+        eval_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="eval",
+        preprocess_text_fn=preprocess_text_fn,
+    )
+    test_ds_iter = ClassifierIterator(
+        test_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="test",
+        preprocess_text_fn=preprocess_text_fn,
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_steps = 0
+    best_acc = -1
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    tic_train = time.time()
+    stop_training = False
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            global_steps += 1
+            input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch
+            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+
+            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
+            loss = criterion(logits, labels) * need_cal_loss
+            mean_loss = loss.mean()
+            mean_loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            # Rough acc result, not a precise acc
+            acc = metric.compute(logits, labels) * need_cal_loss
+            metric.update(acc)
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        mean_loss,
+                        metric.accumulate(),
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0:
+                # Evaluate
+                logger.info("Eval:")
+                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory())
+                # Save
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    if eval_acc > best_acc:
+                        logger.info("Save best model......")
+                        best_acc = eval_acc
+                        best_model_dir = os.path.join(output_dir, "best_model")
+                        if not os.path.exists(best_model_dir):
+                            os.makedirs(best_model_dir)
+                        model_to_save.save_pretrained(best_model_dir)
+                        tokenizer.save_pretrained(best_model_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                stop_training = True
+                break
+        if stop_training:
+            break
+    logger.info("Final test result:")
+    eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory())
+    logger.info("start predict the test data")
+
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    predict(model, test_dataloader, args.file_path, memories, test_ds.label_list)
+    logger.info("Done Predicting the results has been saved in file: {}".format(args.file_path))
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/examples/text_classification/pretrained_models/README.md b/examples/text_classification/pretrained_models/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1351d7803884389302f6a06c4a3af3b2cb22683
--- /dev/null
+++ b/examples/text_classification/pretrained_models/README.md
@@ -0,0 +1 @@
+[pretrained_models](../../../applications/text_classification)
diff --git a/examples/text_classification/rnn/README.md b/examples/text_classification/rnn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e87390a5b1c911e9cfb218470336b7e77be1d3bd
--- /dev/null
+++ b/examples/text_classification/rnn/README.md
@@ -0,0 +1,316 @@
+# 使用传统Recurrent Neural Networks完成中文文本分类任务
+
+文本分类是NLP应用最广的任务之一，可以被应用到多个领域中，包括但不仅限于：情感分析、垃圾邮件识别、商品评价分类...
+
+情感分析是一个自然语言处理中老生常谈的任务。情感分析的目的是为了找出说话者/作者在某些话题上，或者针对一个文本两极的观点的态度。这个态度或许是他或她的个人判断或是评估，也许是他当时的情感状态（就是说，作者在做出这个言论时的情绪状态），或是作者有意向的情感交流（就是作者想要读者所体验的情绪）。其可以用于数据挖掘、Web 挖掘、文本挖掘和信息检索方面得到了广泛的研究。可通过 [AI开放平台-情感倾向分析](http://ai.baidu.com/tech/nlp_apply/sentiment_classify) 线上体验。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/febb8a1478e34258953e56611ddc76cd20b412fec89845b0a4a2e6b9f8aae774" hspace='10'/> <br />
+</p>
+
+本项目开源了一系列模型用于进行文本建模，用户可通过参数配置灵活使用。效果上，我们基于开源情感倾向分类数据集ChnSentiCorp对多个模型进行评测。
+
+## paddlenlp.seq2vec
+
+情感分析任务中关键技术是如何将文本表示成一个**携带语义的文本向量**。随着深度学习技术的快速发展，目前常用的文本表示技术有LSTM，GRU，RNN等方法。
+PaddleNLP提供了一系列的文本表示技术，如`seq2vec`模块。
+
+[`paddlenlp.seq2vec`](../../../paddlenlp/seq2vec) 模块作用为将输入的序列文本表征成一个语义向量。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/bbf00931c7534ab48a5e7dff5fbc2ba3ff8d459940434628ad21e9195da5d4c6" width = "500" height = "200"  hspace='10'/> <br />
+</p>
+
+
+## 模型简介
+
+本项目通过调用[seq2vec](../../../paddlenlp/seq2vec/)中内置的模型进行序列建模，完成句子的向量表示。包含最简单的词袋模型和一系列经典的RNN类模型。
+
+`seq2vec`模块
+
+* 功能是将序列Embedding Tensor（shape是(batch_size, num_token, emb_dim) ）转化成文本语义表征Enocded Texts Tensor（shape 是(batch_sie,encoding_size)）
+* 提供了`BoWEncoder`，`CNNEncoder`，`GRUEncoder`，`LSTMEncoder`，`RNNEncoder`等模型
+    - `BoWEncoder` 是将输入序列Embedding Tensor在num_token维度上叠加，得到文本语义表征Enocded Texts Tensor。
+    - `CNNEncoder` 是将输入序列Embedding Tensor进行卷积操作，在对卷积结果进行max_pooling，得到文本语义表征Enocded Texts Tensor。
+    - `GRUEncoder` 是对输入序列Embedding Tensor进行GRU运算，在运算结果上进行pooling或者取最后一个step的隐表示，得到文本语义表征Enocded Texts Tensor。
+    - `LSTMEncoder` 是对输入序列Embedding Tensor进行LSTM运算，在运算结果上进行pooling或者取最后一个step的隐表示，得到文本语义表征Enocded Texts Tensor。
+    - `RNNEncoder` 是对输入序列Embedding Tensor进行RNN运算，在运算结果上进行pooling或者取最后一个step的隐表示，得到文本语义表征Enocded Texts Tensor。
+
+
+`seq2vec`提供了许多语义表征方法，那么这些方法在什么时候更加适合呢？
+
+* `BoWEncoder`采用Bag of Word Embedding方法，其特点是简单。但其缺点是没有考虑文本的语境，所以对文本语义的表征不足以表意。
+
+* `CNNEncoder`采用卷积操作，提取局部特征，其特点是可以共享权重。但其缺点同样只考虑了局部语义，上下文信息没有充分利用。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/2b2498edd83e49d3b017c4a14e1be68506349249b8a24cdaa214755fb51eadcd" width = "300" height = "150"  hspace='10'/> <br />
+</p>
+
+* `RNNEnocder`采用RNN方法，在计算下一个token语义信息时，利用上一个token语义信息作为其输入。但其缺点容易产生梯度消失和梯度爆炸。
+
+<p align="center">
+<img src="http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png" width = "50%" height = "30%"  hspace='10'/> <br />
+</p>
+
+* `LSTMEnocder`采用LSTM方法，LSTM是RNN的一种变种。为了学到长期依赖关系，LSTM 中引入了门控机制来控制信息的累计速度，
+    包括有选择地加入新的信息，并有选择地遗忘之前累计的信息。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/a5af1d93c69f422d963e094397a2f6ce978c30a26ab6480ab70d688dd1929de0" width = "50%" height = "30%"  hspace='10'/> <br />
+</p>
+
+* `GRUEncoder`采用GRU方法，GRU也是RNN的一种变种。一个LSTM单元有四个输入 ，因而参数是RNN的四倍，带来的结果是训练速度慢。
+    GRU对LSTM进行了简化，在不影响效果的前提下加快了训练速度。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/fc848bc2cb494b40ae42af892b756f5888770320a1fa42348cec10d3df64ee2f" width = "40%" height = "25%"  hspace='10'/> <br />
+</p>
+
+
+| 模型                                             | 模型介绍                                                     |
+| ------------------------------------------------ | ------------------------------------------------------------ |
+| BOW（Bag Of Words）                              | 非序列模型，将句子表示为其所包含词的向量的加和               |
+| RNN (Recurrent Neural Network)                   | 序列模型，能够有效地处理序列信息                             |
+| GRU（Gated Recurrent Unit）                      | 序列模型，能够较好地解决序列文本中长距离依赖的问题           |
+| LSTM（Long Short Term Memory）                   | 序列模型，能够较好地解决序列文本中长距离依赖的问题           |
+| Bi-LSTM（Bidirectional Long Short Term Memory）  | 序列模型，采用双向LSTM结构，更好地捕获句子中的语义特征       |
+| Bi-GRU（Bidirectional Gated Recurrent Unit）     | 序列模型，采用双向GRU结构，更好地捕获句子中的语义特征        |
+| Bi-RNN（Bidirectional Recurrent Neural Network） | 序列模型，采用双向RNN结构，更好地捕获句子中的语义特征        |
+| Bi-LSTM Attention                                | 序列模型，在双向LSTM结构之上加入Attention机制，结合上下文更好地表征句子语义特征 |
+| TextCNN                                          | 序列模型，使用多种卷积核大小，提取局部区域地特征             |
+
+
+| 模型  | dev acc | test acc |
+| ---- | ------- | -------- |
+| BoW  |  0.8970 | 0.8908   |
+| Bi-LSTM  | 0.9098  | 0.8983  |
+| Bi-GRU  | 0.9014  | 0.8785  |
+| Bi-RNN  | 0.8649  |  0.8504 |
+| Bi-LSTM Attention |  0.8992 |  0.8856 |
+| TextCNN  | 0.9102  | 0.9107 |
+
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/ecf309c20e5347399c55f1e067821daa088842fa46ad49be90de4933753cd3cf" width = "600" height = "200"  hspace='10'/> <br />
+</p>
+
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+rnn/
+├── deploy # 部署
+│   └── python
+│       └── predict.py # python预测部署示例
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── model.py # 模型组网脚本
+├── predict.py # 模型预测
+├── utils.py # 数据处理工具
+├── train.py # 训练模型主程序入口，包括训练、评估
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+#### 使用PaddleNLP内置数据集
+
+```python
+from paddlenlp.datasets import load_dataset
+
+train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
+```
+
+
+### 模型训练
+
+我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证
+
+CPU 启动：
+
+```shell
+python train.py --vocab_path='./vocab.json' \
+    --device=cpu \
+    --network=bilstm \
+    --lr=5e-4 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir='./checkpoints'
+```
+
+GPU 启动：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --vocab_path='./vocab.json' \
+    --device=gpu \
+    --network=bilstm \
+    --lr=5e-4 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir='./checkpoints'
+```
+
+XPU 启动：
+
+```shell
+python train.py --vocab_path='./vocab.json' \
+    --device=xpu \
+    --network=lstm \
+    --lr=5e-4 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir='./checkpoints'
+```
+
+MLU 启动：
+
+```shell
+python train.py --vocab_path='./vocab.json' \
+    --device=mlu \
+    --network=lstm \
+    --lr=5e-4 \
+    --batch_size=64 \
+    --epochs=10 \
+    --save_dir='./checkpoints'
+```
+
+Ascend NPU 启动：
+
+```shell
+python train.py --vocab_path='./vocab.json' \
+    --device=npu \
+    --network=bow \
+    --lr=5e-4 \
+    --batch_size=32 \
+    --epochs=10 \
+    --save_dir='./checkpoints'
+```
+
+以上参数表示：
+
+* `vocab_path`: 用于保存根据语料库构建的词汇表的文件路径。
+* `device`: 选用什么设备进行训练，可选cpu、gpu、xpu、mlu或者npu。如使用gpu训练则参数gpus指定GPU卡号。目前xpu只支持模型网络设置为lstm，npu只支持模型网络设置为bow。
+* `network`: 模型网络名称，默认为`bilstm`， 可更换为bilstm，bigru，birnn，bow，lstm，rnn，gru，bilstm_attn，cnn等。
+* `lr`: 学习率， 默认为5e-5。
+* `batch_size`: 运行一个batch大小，默认为64。
+* `epochs`: 训练轮次，默认为10。
+* `save_dir`: 训练保存模型的文件路径。
+* `init_from_ckpt`: 恢复模型训练的断点路径。
+
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── 0.pdopt
+├── 0.pdparams
+├── 1.pdopt
+├── 1.pdparams
+├── ...
+└── final.pdparams
+```
+
+**NOTE:**
+
+* 训练脚本中停用词`stopwords`仅仅是示例作用，具体停用词使用需要根据实际应用数据进行选择。
+
+* 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可，程序会自动加载模型参数`checkpoints/0.pdparams`，也会自动加载优化器状态`checkpoints/0.pdopt`。
+* 使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
+  运行方式：
+
+```shell
+python export_model.py --vocab_path=./vocab.json --network=bilstm --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params
+```
+
+其中`params_path`是指动态图训练保存的参数路径，`output_path`是指静态图参数导出路径。
+
+导出模型之后，可以用于部署，deploy/python/predict.py文件提供了python部署预测示例。运行方式：
+
+```shell
+python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams --network=bilstm
+```
+
+### 模型预测
+
+启动预测：
+
+CPU启动：
+
+```shell
+python predict.py --vocab_path='./vocab.json' \
+    --device=cpu \
+    --network=bilstm \
+    --params_path=checkpoints/final.pdparams
+```
+
+GPU启动：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --vocab_path='./vocab.json' \
+    --device=gpu \
+    --network=bilstm \
+    --params_path='./checkpoints/final.pdparams'
+```
+
+XPU启动：
+
+```shell
+python predict.py --vocab_path='./vocab.json' \
+    --device=xpu \
+    --network=lstm \
+    --params_path=checkpoints/final.pdparams
+```
+
+MLU启动：
+
+```shell
+python predict.py --vocab_path='./vocab.json' \
+    --device=mlu \
+    --network=lstm \
+    --params_path=checkpoints/final.pdparams
+```
+
+Ascend NPU启动：
+
+```shell
+python predict.py --vocab_path='./vocab.json' \
+    --device=npu \
+    --network=bow \
+    --params_path=checkpoints/final.pdparams
+```
+
+将待预测数据分词完毕后，如以下示例：
+
+```text
+非常不错，服务很好，位于市中心区，交通方便，不过价格也高！
+怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片
+作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。
+```
+
+处理成模型所需的`Tensor`，如可以直接调用`preprocess_prediction_data`函数既可处理完毕。之后传入`predict`函数即可输出预测结果。
+
+如
+
+```text
+Data: 非常不错，服务很好，位于市中心区，交通方便，不过价格也高！      Label: negative
+Data: 怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片      Label: negative
+Data: 作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。      Label: positive
+```
+
+## Reference
+
+关于LSTM、GRU、CNN更多信息参考：
+
+- https://canvas.stanford.edu/files/1090785/download
+- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
+- https://arxiv.org/abs/1412.3555
+- https://arxiv.org/pdf/1506.00019
+- https://arxiv.org/abs/1404.2188
diff --git a/examples/text_classification/rnn/deploy/python/predict.py b/examples/text_classification/rnn/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc14155c44b53f2945e9533e4e36a86c7a9774d0
--- /dev/null
+++ b/examples/text_classification/rnn/deploy/python/predict.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import numpy as np
+import paddle
+from scipy.special import softmax
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_file", type=str, required=True, default='./static_graph_params.pdmodel', help="The path to model info in static graph.")
+parser.add_argument("--params_file", type=str, required=True, default='./static_graph_params.pdiparams', help="The path to parameters in static graph.")
+parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn', 'textcnn'], default="bilstm", help="Select which network to train, defaults to bilstm.")
+parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to save vocabulary.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def preprocess_prediction_data(text, tokenizer):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        text (obj:`str`): The input text.
+        tokenizer(obj: `paddlenlp.data.JiebaTokenizer`): It use jieba to cut the chinese string.
+
+    Returns:
+        input_ids (obj: `list[int]`): The word ids of the `text`.
+        seq_len (obj: `int`): The length of words.
+    """
+    input_id = tokenizer.encode(text)
+    seq_len = len(input_id)
+    return input_id, seq_len
+
+
+class Predictor(object):
+    def __init__(self, model_file, params_file, device, max_seq_length):
+        self.max_seq_length = max_seq_length
+
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer, label_map, batch_size=1, network="bilstm"):
+        """
+        Predicts the data labels.
+
+        Args:
+            model (obj:`paddle.nn.Layer`): A model to classify texts.
+            data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+                A Example object contains `text`(word_ids) and `se_len`(sequence length).
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from
+                :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods.
+                 Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        examples = []
+        for text in data:
+            input_id, seq_len = preprocess_prediction_data(text, tokenizer)
+            examples.append((input_id, seq_len))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.vocab.token_to_idx.get("[PAD]", 0)), Stack()  # input_id  # seq_len
+        ): fn(samples)
+
+        # Separates data into some batches.
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+        results = []
+        for batch in batches:
+            input_ids, seq_lens = batchify_fn(batch)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            if network in ["lstm", "bilstm", "gru", "bigru", "rnn", "birnn", "bilstm_attn"]:
+                self.input_handles[1].copy_from_cpu(seq_lens)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+            probs = softmax(logits, axis=1)
+            print(probs)
+            idx = np.argmax(probs, axis=1)
+            idx = idx.tolist()
+            labels = [label_map[i] for i in idx]
+            results.extend(labels)
+        return results
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_length)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    data = [
+        "非常不错，服务很好，位于市中心区，交通方便，不过价格也高！",
+        "怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片",
+        "作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。",
+    ]
+    vocab = Vocab.from_json(args.vocab_path)
+    tokenizer = JiebaTokenizer(vocab)
+    label_map = {0: "negative", 1: "positive"}
+
+    results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size, network=args.network)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/text_classification/rnn/export_model.py b/examples/text_classification/rnn/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..37fbb37ddcb3eb99b9d360c96ed70127759f6e75
--- /dev/null
+++ b/examples/text_classification/rnn/export_model.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from model import (
+    BiLSTMAttentionModel,
+    BoWModel,
+    CNNModel,
+    GRUModel,
+    LSTMModel,
+    RNNModel,
+    SelfInteractiveAttention,
+)
+
+from paddlenlp.data import Vocab
+
+# fmt: off
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to vocabulary.")
+parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'], default="bilstm", help="Select which network to train, defaults to bilstm.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# fmt: on
+
+
+def main():
+    # Load vocab.
+    vocab = Vocab.from_json(args.vocab_path)
+    label_map = {0: "negative", 1: "positive"}
+
+    # Constructs the newtork.
+    network = args.network.lower()
+    vocab_size = len(vocab)
+    num_classes = len(label_map)
+    pad_token_id = vocab.to_indices("[PAD]")
+    if network == "bow":
+        model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "bigru":
+        model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm":
+        model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm_attn":
+        lstm_hidden_size = 196
+        attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+        model = BiLSTMAttentionModel(
+            attention_layer=attention,
+            vocab_size=vocab_size,
+            lstm_hidden_size=lstm_hidden_size,
+            num_classes=num_classes,
+            padding_idx=pad_token_id,
+        )
+    elif network == "birnn":
+        model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "cnn":
+        model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "gru":
+        model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "lstm":
+        model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "rnn":
+        model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    else:
+        raise ValueError(
+            "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn."
+            % network
+        )
+
+    # Load model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    model.eval()
+
+    inputs = [paddle.static.InputSpec(shape=[None, None], dtype="int64")]
+    # Convert to static graph with specific input description
+    if args.network in ["lstm", "bilstm", "gru", "bigru", "rnn", "birnn", "bilstm_attn"]:
+        inputs.append(paddle.static.InputSpec(shape=[None], dtype="int64"))  # seq_len
+
+    model = paddle.jit.to_static(model, input_spec=inputs)
+    # Save in static graph model.
+    paddle.jit.save(model, args.output_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/text_classification/rnn/model.py b/examples/text_classification/rnn/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d2e4950db0bb6efefd247ed36c9b042bdd1811c
--- /dev/null
+++ b/examples/text_classification/rnn/model.py
@@ -0,0 +1,403 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import paddlenlp as nlp
+
+INF = 1.0 * 1e12
+
+
+class BoWModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+
+    """
+
+    def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, hidden_size=128, fc_hidden_size=96):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim)
+        self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
+        self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+
+        # Shape: (batch_size, embedding_dim)
+        summed = self.bow_encoder(embedded_text)
+        encoded_text = paddle.tanh(summed)
+
+        # Shape: (batch_size, hidden_size)
+        fc1_out = paddle.tanh(self.fc1(encoded_text))
+        # Shape: (batch_size, fc_hidden_size)
+        fc2_out = paddle.tanh(self.fc2(fc1_out))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc2_out)
+        return logits
+
+
+class LSTMModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        lstm_hidden_size=198,
+        direction="forward",
+        lstm_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.lstm_encoder = nlp.seq2vec.LSTMEncoder(
+            emb_dim,
+            lstm_hidden_size,
+            num_layers=lstm_layers,
+            direction=direction,
+            dropout=dropout_rate,
+            pooling_type=pooling_type,
+        )
+        self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
+        # num_directions = 2 if direction is 'bidirect'
+        # if not, num_directions = 1
+        text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(text_repr))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+
+class GRUModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        gru_hidden_size=198,
+        direction="forward",
+        gru_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.gru_encoder = nlp.seq2vec.GRUEncoder(
+            emb_dim,
+            gru_hidden_size,
+            num_layers=gru_layers,
+            direction=direction,
+            dropout=dropout_rate,
+            pooling_type=pooling_type,
+        )
+        self.fc = nn.Linear(self.gru_encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, num_tokens, num_directions*gru_hidden_size)
+        # num_directions = 2 if direction is 'bidirect'
+        # if not, num_directions = 1
+        text_repr = self.gru_encoder(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(text_repr))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+
+class RNNModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        rnn_hidden_size=198,
+        direction="forward",
+        rnn_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.rnn_encoder = nlp.seq2vec.RNNEncoder(
+            emb_dim,
+            rnn_hidden_size,
+            num_layers=rnn_layers,
+            direction=direction,
+            dropout=dropout_rate,
+            pooling_type=pooling_type,
+        )
+        self.fc = nn.Linear(self.rnn_encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, num_tokens, num_directions*rnn_hidden_size)
+        # num_directions = 2 if direction is 'bidirect'
+        # if not, num_directions = 1
+        text_repr = self.rnn_encoder(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(text_repr))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+
+class BiLSTMAttentionModel(nn.Layer):
+    def __init__(
+        self,
+        attention_layer,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        lstm_hidden_size=196,
+        fc_hidden_size=96,
+        lstm_layers=1,
+        dropout_rate=0.0,
+        padding_idx=0,
+    ):
+        super().__init__()
+        self.padding_idx = padding_idx
+
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.bilstm = nn.LSTM(
+            input_size=emb_dim,
+            hidden_size=lstm_hidden_size,
+            num_layers=lstm_layers,
+            dropout=dropout_rate,
+            direction="bidirect",
+        )
+        self.attention = attention_layer
+        if isinstance(attention_layer, SelfAttention):
+            self.fc = nn.Linear(lstm_hidden_size, fc_hidden_size)
+        elif isinstance(attention_layer, SelfInteractiveAttention):
+            self.fc = nn.Linear(lstm_hidden_size * 2, fc_hidden_size)
+        else:
+            raise RuntimeError("Unknown attention type %s." % attention_layer.__class__.__name__)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len):
+        mask = text != self.padding_idx
+        embedded_text = self.embedder(text)
+        # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size)
+        encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, lstm_hidden_size)
+        hidden, att_weights = self.attention(encoded_text, mask)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(hidden))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+
+class SelfAttention(nn.Layer):
+    """
+    A close implementation of attention network of ACL 2016 paper,
+    Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016).
+    ref: https://www.aclweb.org/anthology/P16-2034/
+    Args:
+        hidden_size (int): The number of expected features in the input x.
+    """
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.att_weight = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32")
+
+    def forward(self, input, mask=None):
+        """
+        Args:
+            input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence.
+            mask (paddle.Tensor) of shape (batch, seq_len) :
+                Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
+                Defaults to `None`.
+        """
+        forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2)
+        # elementwise-sum forward_x and backward_x
+        # Shape: (batch_size, max_seq_len, hidden_size)
+        h = paddle.add_n([forward_input, backward_input])
+        # Shape: (batch_size, hidden_size, 1)
+        att_weight = self.att_weight.tile(repeat_times=(paddle.shape(h)[0], 1, 1))
+        # Shape: (batch_size, max_seq_len, 1)
+        att_score = paddle.bmm(paddle.tanh(h), att_weight)
+        if mask is not None:
+            # mask, remove the effect of 'PAD'
+            mask = paddle.cast(mask, dtype="float32")
+            mask = mask.unsqueeze(axis=-1)
+            inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF)
+            att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask))
+        # Shape: (batch_size, max_seq_len, 1)
+        att_weight = F.softmax(att_score, axis=1)
+        # Shape: (batch_size, lstm_hidden_size)
+        reps = paddle.bmm(h.transpose(perm=(0, 2, 1)), att_weight).squeeze(axis=-1)
+        reps = paddle.tanh(reps)
+        return reps, att_weight
+
+
+class SelfInteractiveAttention(nn.Layer):
+    """
+    A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classiﬁcation (Yang et al., 2016).
+    ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf
+    Args:
+        hidden_size (int): The number of expected features in the input x.
+    """
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.input_weight = self.create_parameter(shape=[1, hidden_size, hidden_size], dtype="float32")
+        self.bias = self.create_parameter(shape=[1, 1, hidden_size], dtype="float32")
+        self.att_context_vector = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32")
+
+    def forward(self, input, mask=None):
+        """
+        Args:
+            input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence.
+            mask (paddle.Tensor) of shape (batch, seq_len) :
+                Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not.
+                Defaults to `None
+        """
+        weight = self.input_weight.tile(repeat_times=(paddle.shape(input)[0], 1, 1))
+        bias = self.bias.tile(repeat_times=(paddle.shape(input)[0], 1, 1))
+        # Shape: (batch_size, max_seq_len, hidden_size)
+        word_squish = paddle.bmm(input, weight) + bias
+
+        att_context_vector = self.att_context_vector.tile(repeat_times=(paddle.shape(input)[0], 1, 1))
+        # Shape: (batch_size, max_seq_len, 1)
+        att_score = paddle.bmm(word_squish, att_context_vector)
+        if mask is not None:
+            # mask, remove the effect of 'PAD'
+            mask = paddle.cast(mask, dtype="float32")
+            mask = mask.unsqueeze(axis=-1)
+            inf_tensor = paddle.full(shape=paddle.shape(mask), dtype="float32", fill_value=-INF)
+            att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask))
+        att_weight = F.softmax(att_score, axis=1)
+
+        # Shape: (batch_size, hidden_size)
+        reps = paddle.bmm(input.transpose(perm=(0, 2, 1)), att_weight).squeeze(-1)
+        return reps, att_weight
+
+
+class CNNModel(nn.Layer):
+    """
+    This class implements the Convolution Neural Network model.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `CNNEncoder`.
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        num_filter=128,
+        ngram_filter_sizes=(3,),
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.encoder = nlp.seq2vec.CNNEncoder(
+            emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes
+        )
+        self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, len(ngram_filter_sizes)*num_filter)
+        encoder_out = self.encoder(embedded_text)
+        encoder_out = paddle.tanh(encoder_out)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = self.fc(encoder_out)
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
+
+
+class TextCNNModel(nn.Layer):
+    """
+    This class implements the Text Convolution Neural Network model.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `CNNEncoder`.
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        num_filter=128,
+        ngram_filter_sizes=(1, 2, 3),
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.encoder = nlp.seq2vec.CNNEncoder(
+            emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes
+        )
+        self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, len(ngram_filter_sizes)*num_filter)
+        encoder_out = self.encoder(embedded_text)
+        encoder_out = paddle.tanh(encoder_out)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(encoder_out))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        return logits
diff --git a/examples/text_classification/rnn/predict.py b/examples/text_classification/rnn/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..843884fd19b30ff7c73b8332bbffcae9659f6dd3
--- /dev/null
+++ b/examples/text_classification/rnn/predict.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+from model import (
+    BiLSTMAttentionModel,
+    BoWModel,
+    CNNModel,
+    GRUModel,
+    LSTMModel,
+    RNNModel,
+    SelfInteractiveAttention,
+)
+from utils import preprocess_prediction_data
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu', 'mlu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to vocabulary.")
+parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'],
+                    default="bilstm", help="Select which network to train, defaults to bilstm.")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data, label_map, batch_size=1, pad_token_id=0):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `se_len`(sequence length).
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_token_id),  # input_ids
+        Stack(dtype="int64"),  # seq len
+    ): [data for data in fn(samples)]
+
+    results = []
+    model.eval()
+    for batch in batches:
+        texts, seq_lens = batchify_fn(batch)
+        texts = paddle.to_tensor(texts)
+        seq_lens = paddle.to_tensor(seq_lens)
+        logits = model(texts, seq_lens)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device.lower())
+
+    # Loads vocab.
+    vocab = Vocab.from_json(args.vocab_path)
+    label_map = {0: "negative", 1: "positive"}
+
+    # Constructs the newtork.
+    network = args.network.lower()
+    vocab_size = len(vocab)
+    num_classes = len(label_map)
+    pad_token_id = vocab.to_indices("[PAD]")
+    if network == "bow":
+        model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "bigru":
+        model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm":
+        model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm_attn":
+        lstm_hidden_size = 196
+        attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+        model = BiLSTMAttentionModel(
+            attention_layer=attention,
+            vocab_size=vocab_size,
+            lstm_hidden_size=lstm_hidden_size,
+            num_classes=num_classes,
+            padding_idx=pad_token_id,
+        )
+    elif network == "birnn":
+        model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "cnn":
+        model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "gru":
+        model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "lstm":
+        model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "rnn":
+        model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    else:
+        raise ValueError(
+            "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn."
+            % network
+        )
+
+    # Loads model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    data = [
+        "非常不错，服务很好，位于市中心区，交通方便，不过价格也高！",
+        "怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片",
+        "作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。",
+    ]
+    tokenizer = JiebaTokenizer(vocab)
+    examples = preprocess_prediction_data(data, tokenizer)
+
+    results = predict(
+        model,
+        examples,
+        label_map=label_map,
+        batch_size=args.batch_size,
+        pad_token_id=vocab.token_to_idx.get("[PAD]", 0),
+    )
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/text_classification/rnn/train.py b/examples/text_classification/rnn/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c4c64890131682e03ce237f990dd70a14935398
--- /dev/null
+++ b/examples/text_classification/rnn/train.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+from model import (
+    BiLSTMAttentionModel,
+    BoWModel,
+    CNNModel,
+    GRUModel,
+    LSTMModel,
+    RNNModel,
+    SelfInteractiveAttention,
+)
+from utils import build_vocab, convert_example
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'mlu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to save vocabulary.")
+parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'],
+                    default="bilstm", help="Select which network to train, defaults to bilstm.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed=1000):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)
+    return dataloader
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    set_seed(1000)
+
+    # Loads dataset.
+    train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+    texts = []
+    for data in train_ds:
+        texts.append(data["text"])
+    for data in dev_ds:
+        texts.append(data["text"])
+
+    # Reads stop words.
+    # Stopwords are just for example.
+    # It should be updated according to the corpus.
+    stopwords = set(["的", "吗", "吧", "呀", "呜", "呢", "呗"])
+    # Builds vocab.
+    word2idx = build_vocab(texts, stopwords, min_freq=5, unk_token="[UNK]", pad_token="[PAD]")
+    vocab = Vocab.from_dict(word2idx, unk_token="[UNK]", pad_token="[PAD]")
+    # Saves vocab.
+    vocab.to_json(args.vocab_path)
+
+    # Constructs the network.
+    network = args.network.lower()
+    vocab_size = len(vocab)
+    num_classes = len(train_ds.label_list)
+    pad_token_id = vocab.to_indices("[PAD]")
+    if network == "bow":
+        model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "bigru":
+        model = GRUModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm":
+        model = LSTMModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "bilstm_attn":
+        lstm_hidden_size = 196
+        attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
+        model = BiLSTMAttentionModel(
+            attention_layer=attention,
+            vocab_size=vocab_size,
+            lstm_hidden_size=lstm_hidden_size,
+            num_classes=num_classes,
+            padding_idx=pad_token_id,
+        )
+    elif network == "birnn":
+        model = RNNModel(vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id)
+    elif network == "cnn":
+        model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id)
+    elif network == "gru":
+        model = GRUModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "lstm":
+        model = LSTMModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    elif network == "rnn":
+        model = RNNModel(vocab_size, num_classes, direction="forward", padding_idx=pad_token_id, pooling_type="max")
+    else:
+        raise ValueError(
+            "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn."
+            % network
+        )
+    model = paddle.Model(model)
+
+    # Reads data and generates mini-batches.
+    tokenizer = JiebaTokenizer(vocab)
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # input_ids
+        Stack(dtype="int64"),  # seq len
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    train_loader = create_dataloader(
+        train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn
+    )
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Defines loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    # Starts training and evaluating.
+    callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3)
+    model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback)
diff --git a/examples/text_classification/rnn/utils.py b/examples/text_classification/rnn/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c33d521e4e11f8bcf2ba474881661f3dbf299da7
--- /dev/null
+++ b/examples/text_classification/rnn/utils.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from collections import defaultdict
+
+import numpy as np
+
+from paddlenlp import Taskflow
+
+word_segmenter = Taskflow("word_segmentation", mode="fast")
+
+
+def convert_example(example, tokenizer, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        valid_length(obj:`int`): The input sequence valid length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+
+    input_ids = tokenizer.encode(example["text"])
+    valid_length = np.array(len(input_ids), dtype="int64")
+    input_ids = np.array(input_ids, dtype="int64")
+
+    if not is_test:
+        label = np.array(example["label"], dtype="int64")
+        return input_ids, valid_length, label
+    else:
+        return input_ids, valid_length
+
+
+def preprocess_prediction_data(data, tokenizer):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`List[str]`): The prediction data whose each element is  a tokenized text.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+
+    Returns:
+        examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+
+    """
+    examples = []
+    for text in data:
+        ids = tokenizer.encode(text)
+        examples.append([ids, len(ids)])
+    return examples
+
+
+def build_vocab(texts, stopwords=[], num_words=None, min_freq=10, unk_token="[UNK]", pad_token="[PAD]"):
+    """
+    According to the texts, it is to build vocabulary.
+
+    Args:
+        texts (obj:`List[str]`): The raw corpus data.
+        num_words (obj:`int`): the maximum size of vocabulary.
+        stopwords (obj:`List[str]`): The list where each element is a word that will be
+            filtered from the texts.
+        min_freq (obj:`int`): the minimum word frequency of words to be kept.
+        unk_token (obj:`str`): Special token for unknow token.
+        pad_token (obj:`str`): Special token for padding token.
+
+    Returns:
+        word_index (obj:`Dict`): The vocabulary from the corpus data.
+
+    """
+    word_counts = defaultdict(int)
+    for text in texts:
+        if not text:
+            continue
+        for word in word_segmenter(text):
+            if word in stopwords:
+                continue
+            word_counts[word] += 1
+
+    wcounts = []
+    for word, count in word_counts.items():
+        if count < min_freq:
+            continue
+        wcounts.append((word, count))
+    wcounts.sort(key=lambda x: x[1], reverse=True)
+    # -2 for the pad_token and unk_token which will be added to vocab.
+    if num_words is not None and len(wcounts) > (num_words - 2):
+        wcounts = wcounts[: (num_words - 2)]
+    # add the special pad_token and unk_token to the vocabulary
+    sorted_voc = [pad_token, unk_token]
+    sorted_voc.extend(wc[0] for wc in wcounts)
+    word_index = dict(zip(sorted_voc, list(range(len(sorted_voc)))))
+    return word_index
diff --git a/examples/text_correction/ernie-csc/README.md b/examples/text_correction/ernie-csc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4ff3a8da14417d847c816f0e5d3b85f94de08dde
--- /dev/null
+++ b/examples/text_correction/ernie-csc/README.md
@@ -0,0 +1,163 @@
+# ERNIE for Chinese Spelling Correction
+
+## 简介
+
+中文文本纠错任务是一项NLP基础任务，其输入是一个可能含有语法错误的中文句子，输出是一个正确的中文句子。语法错误类型很多，有多字、少字、错别字等，目前最常见的错误类型是`错别字`。大部分研究工作围绕错别字这一类型进行研究。本文实现了百度在ACL 2021上提出结合拼音特征的Softmask策略的中文错别字纠错的下游任务网络，并提供预训练模型，模型结构如下：
+
+![image](https://user-images.githubusercontent.com/10826371/131974040-fc84ec04-566f-4310-9839-862bfb27172e.png)
+
+以下是本项目的简要目录结构及说明：
+
+```text
+.
+├── README.md                   # 文档
+├── download.py                 # 下载SIGHAN测试集
+├── pinyin_vocab.txt            # 拼音字表
+├── predict.py                  # 预测标准输入的句子
+├── predict_sighan.py           # 生成SIGHAN测试集的预测结果
+├── model.py                    # 纠错模型实现
+├── requirements.txt            # 本项目的Python依赖项
+├── run_sighan_predict.sh       # 生成训练后模型在SIGHAN测试集的预测结果并输出预测效果
+├── sighan_evaluate.py          # 评估模型在SIGHAN测试集上预测效果
+├── train.py                    # 训练脚本
+└── utils.py                    # 通用函数工具
+```
+
+* 注：论文中暂未开源融合字音特征的预训练模型参数(即MLM-phonetics)，所以本文提供的纠错模型是在ERNIE-1.0的参数上进行Finetune，纠错模型结构与论文保持一致。
+
+## 安装依赖项
+```
+pip install -r requirements.txt
+```
+
+## 模型训练
+
+### 参数
+- `model_name_or_path` 目前支持的预训练模型有："ernie-1.0-base-zh"。
+- `max_seq_length` 表示最大句子长度，超过该长度的部分将被切分成下一个样本。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔步数。
+- `save_steps` 表示模型保存及评估间隔步数。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `seed` 表示随机数种子。
+- `weight_decay` 表示AdamW的权重衰减系数。
+- `warmup_proportion` 表示学习率warmup系数。
+- `pinyin_vocab_file_path` 拼音字表路径。默认为当前目录下的`pinyin_vocab.txt`文件。
+- `extra_train_ds_dir` 额外纠错训练集目录。用户可在该目录下提供文件名以`txt`为后缀的纠错数据集文件，以增大训练样本。默认为None。
+
+### 训练数据
+
+该模型在SIGHAN简体版数据集以及[Automatic Corpus Generation生成的中文纠错数据集](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)上进行Finetune训练。PaddleNLP已经集成SIGHAN简体版数据集，以下将介绍如何使用Automatic Corpus Generation生成的中文纠错数据集。
+
+#### 下载数据集
+
+Automatic Corpus Generation生成的中文纠错数据集比较大，下载时间比较长，请耐心等候。运行以下命令完成数据集下载：
+
+```
+python download.py --data_dir ./extra_train_ds/ --url https://github.com/wdimmy/Automatic-Corpus-Generation/raw/master/corpus/train.sgml
+```
+
+#### 预处理数据集
+
+训练脚本要求训练集文件内容以句子对形式呈现，这里提供一个转换脚本，将Automatic Corpus Generation提供的XML文件转换成句子对形式的文件，运行以下命令：
+
+```
+python change_sgml_to_txt.py -i extra_train_ds/train.sgml -o extra_train_ds/train.txt
+```
+
+### 单卡训练
+
+```python
+python train.py --batch_size 32 --logging_steps 100 --epochs 10 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ --max_seq_length 192
+```
+
+### 多卡训练
+
+```python
+python -m paddle.distributed.launch --gpus "0,1"  train.py --batch_size 32 --logging_steps 100 --epochs 10 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ --max_seq_length 192
+```
+
+## 模型预测
+
+### 预测SIGHAN测试集
+
+SIGHAN 13，SIGHAN 14，SIGHAN 15是目前中文错别字纠错任务常用的benchmark数据。由于SIGHAN官方提供的是繁体字数据集，PaddleNLP将提供简体版本的SIGHAN测试数据。以下运行SIGHAN预测脚本：
+
+```shell
+sh run_sighan_predict.sh
+```
+
+该脚本会下载SIGHAN数据集，加载checkpoint的模型参数运行模型，输出SIGHAN测试集的预测结果到predict_sighan文件，并输出预测效果。
+
+**预测效果**
+
+| Metric       | SIGHAN 13 | SIGHAN 14 | SIGHAN 15 |
+| -------------| --------- | --------- |---------  |
+| Detection F1 | 0.8348    | 0.6534    | 0.7464    |
+| Correction F1| 0.8217    | 0.6302    | 0.7296    |
+
+### 预测部署
+
+#### 模型导出
+
+使用动态图训练结束之后，预测部署需要导出静态图参数，具体做法需要运行模型导出脚本`export_model.py`。以下是脚本参数介绍以及运行方式：
+
+**参数**
+- `params_path` 是指动态图训练保存的参数路径。
+- `output_path` 是指静态图参数导出路径。
+- `pinyin_vocab_file_path` 指拼音表路径。
+- `model_name_or_path` 目前支持的预训练模型有："ernie-1.0-base-zh"。
+
+**运行方式**
+
+```shell
+python export_model.py --params_path checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params
+```
+
+其中`checkpoints/best_model.pdparams`是训练过程中保存的参数文件，请更换为实际得到的训练保存路径。
+
+#### 预测
+
+导出模型之后，可以用于预测部署，predict.py文件提供了python预测部署示例。运行方式：
+
+```python
+python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams
+```
+
+输出如下：
+```
+Source: 遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。
+Target: 遇到逆境时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。
+Source: 人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。
+Target: 人生就是如此，经过磨练才能让自己更加茁壮，才能使自己更加乐观。
+```
+
+### Taskflow一键预测
+可以使用PaddleNLP提供的Taskflow工具来对输入的文本进行一键纠错，具体使用方法如下:
+
+```python
+from paddlenlp import Taskflow
+text_correction = Taskflow("text_correction")
+text_correction('遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。')
+'''
+[{'source': '遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+    'target': '遇到逆境时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+    'errors': [{'position': 3, 'correction': {'竟': '境'}}]}]
+'''
+
+text_correction('人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。')
+'''
+[{'source': '人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。',
+    'target': '人生就是如此，经过磨练才能让自己更加茁壮，才能使自己更加乐观。',
+    'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}]
+'''
+
+```
+
+
+## 参考文献
+* Ruiqing Zhang, Chao Pang et al. "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021
+* DingminWang et al. "A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check", EMNLP, 2018
diff --git a/examples/text_correction/ernie-csc/change_sgml_to_txt.py b/examples/text_correction/ernie-csc/change_sgml_to_txt.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae9c063bc084c243779ec87f38313e37afce6a08
--- /dev/null
+++ b/examples/text_correction/ernie-csc/change_sgml_to_txt.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import xml.dom.minidom
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--input", "-i", default="train.sgml", type=str)
+parser.add_argument("--output", "-o", default="train.txt", type=str)
+
+args = parser.parse_args()
+
+
+def main():
+    with open(args.output, "w", encoding="utf-8") as fw:
+        with open(args.input, "r", encoding="utf-8") as f:
+            input_str = f.read()
+        # Add fake root node <SENTENCES>
+        input_str = "<SENTENCES>" + input_str + "</SENTENCES>"
+        dom = xml.dom.minidom.parseString(input_str)
+        example_nodes = dom.documentElement.getElementsByTagName("SENTENCE")
+        for example in example_nodes:
+            raw_text = example.getElementsByTagName("TEXT")[0].childNodes[0].data
+            correct_text = list(raw_text)
+            mistakes = example.getElementsByTagName("MISTAKE")
+            for mistake in mistakes:
+                loc = int(mistake.getElementsByTagName("LOCATION")[0].childNodes[0].data) - 1
+                correction = mistake.getElementsByTagName("CORRECTION")[0].childNodes[0].data
+                correct_text[loc] = correction
+
+            correct_text = "".join(correct_text)
+            fw.write("{}\t{}\n".format(raw_text, correct_text))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/text_correction/ernie-csc/download.py b/examples/text_correction/ernie-csc/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..051947db728f5a1905060bb5495ae461328ec3fb
--- /dev/null
+++ b/examples/text_correction/ernie-csc/download.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import sys
+
+from paddle.utils.download import get_path_from_url
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="./")
+parser.add_argument(
+    "-u", "--url", help="URL of target", type=str, default="https://bj.bcebos.com/paddlenlp/datasets/sighan_test.zip"
+)
+args = parser.parse_args()
+
+
+def main():
+    get_path_from_url(args.url, args.data_dir)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/text_correction/ernie-csc/export_model.py b/examples/text_correction/ernie-csc/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..81aa3fd06ac37ad5edb01a5e4f51639378c3e928
--- /dev/null
+++ b/examples/text_correction/ernie-csc/export_model.py
@@ -0,0 +1,57 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import paddle
+from model import ErnieForCSC
+from paddle.static import InputSpec
+
+from paddlenlp.data import Vocab
+from paddlenlp.transformers import ErnieModel
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+parser.add_argument("--output_path", type=str, default='./infer_model/static_graph_params', help="The path of model parameter in static graph to be saved.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0", choices=["ernie-1.0"], help="Pretraining model name or path")
+parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path")
+args = parser.parse_args()
+# yapf: enable
+
+
+def main():
+    pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    ernie = ErnieModel.from_pretrained(args.model_name_or_path)
+
+    model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token])
+
+    model_dict = paddle.load(args.params_path)
+    model.set_dict(model_dict)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="pinyin_ids"),
+        ],
+    )
+
+    paddle.jit.save(model, args.output_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/text_correction/ernie-csc/model.py b/examples/text_correction/ernie-csc/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d303be144158aafc394955d746360db2503cb508
--- /dev/null
+++ b/examples/text_correction/ernie-csc/model.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class ErnieForCSC(nn.Layer):
+    r"""
+    ErnieForCSC is a model specified for Chinese Spelling Correction task.
+
+    It integrates phonetic features into language model by leveraging the powerful
+    pre-training and fine-tuning method.
+
+    See more details on https://aclanthology.org/2021.findings-acl.198.pdf.
+    Args:
+        ernie (ErnieModel):
+            An instance of `paddlenlp.transformers.ErnieModel`.
+        pinyin_vocab_size (int):
+            The vocab size of pinyin vocab.
+        pad_pinyin_id (int, optional):
+            The pad token id of pinyin vocab. Defaults to 0.
+    """
+
+    def __init__(self, ernie, pinyin_vocab_size, pad_pinyin_id=0):
+        super(ErnieForCSC, self).__init__()
+        self.ernie = ernie
+        emb_size = self.ernie.config["hidden_size"]
+        hidden_size = self.ernie.config["hidden_size"]
+        vocab_size = self.ernie.config["vocab_size"]
+
+        self.pad_token_id = self.ernie.config["pad_token_id"]
+        self.pinyin_vocab_size = pinyin_vocab_size
+        self.pad_pinyin_id = pad_pinyin_id
+        self.pinyin_embeddings = nn.Embedding(self.pinyin_vocab_size, emb_size, padding_idx=pad_pinyin_id)
+        self.detection_layer = nn.Linear(hidden_size, 2)
+        self.correction_layer = nn.Linear(hidden_size, vocab_size)
+        self.softmax = nn.Softmax()
+
+    def forward(self, input_ids, pinyin_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            pinyin_ids (Tensor):
+                Indices of pinyin tokens of input sequence in the pinyin vocabulary. They are
+                numerical representations of tokens that build the pinyin input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                config.max_position_embeddings - 1]``.
+                Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`.
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+
+        Returns:
+            detection_error_probs (Tensor):
+                A Tensor of the detection probablity of each tokens.
+                Shape as `(batch_size, sequence_length, 2)` and dtype as `int`.
+
+            correction_logits (Tensor):
+                A Tensor of the correction logits of each tokens.
+                Shape as `(batch_size, sequence_length, vocab_size)` and dtype as `int`.
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.detection_layer.weight.dtype) * -1e9, axis=[1, 2]
+            )
+
+        embedding_output = self.ernie.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        pinyin_embedding_output = self.pinyin_embeddings(pinyin_ids)
+
+        # Detection module aims to detect whether each Chinese charater has spelling error.
+        detection_outputs = self.ernie.encoder(embedding_output, attention_mask)
+        # detection_error_probs shape: [B, T, 2]. It indicates the erroneous probablity of each
+        # word in the sequence from 0 to 1.
+        detection_error_probs = self.softmax(self.detection_layer(detection_outputs))
+        # Correction module aims to correct each potential wrong charater to right charater.
+        word_pinyin_embedding_output = (
+            detection_error_probs[:, :, 0:1] * embedding_output
+            + detection_error_probs[:, :, 1:2] * pinyin_embedding_output
+        )
+
+        correction_outputs = self.ernie.encoder(word_pinyin_embedding_output, attention_mask)
+        # correction_logits shape: [B, T, V]. It indicates the correct score of each token in vocab
+        # according to each word in the sequence.
+        correction_logits = self.correction_layer(correction_outputs)
+        return detection_error_probs, correction_logits
diff --git a/examples/text_correction/ernie-csc/pinyin_vocab.txt b/examples/text_correction/ernie-csc/pinyin_vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7cae38b4f7d4b4c04da9206b46aa54bba3d75fad
--- /dev/null
+++ b/examples/text_correction/ernie-csc/pinyin_vocab.txt
@@ -0,0 +1,417 @@
+[PAD]
+[UNK]
+a
+ai
+an
+ang
+ao
+ba
+bai
+ban
+bang
+bao
+bei
+ben
+beng
+bi
+bian
+biang
+biao
+bie
+bin
+bing
+bo
+bu
+ca
+cai
+can
+cang
+cao
+ce
+cei
+cen
+ceng
+cha
+chai
+chan
+chang
+chao
+che
+chen
+cheng
+chi
+chong
+chou
+chu
+chua
+chuai
+chuan
+chuang
+chui
+chun
+chuo
+ci
+cong
+cou
+cu
+cuan
+cui
+cun
+cuo
+da
+dai
+dan
+dang
+dao
+de
+den
+deng
+di
+dian
+diao
+die
+din
+ding
+diu
+dong
+dou
+du
+duan
+dui
+dun
+duo
+e
+ei
+en
+eng
+er
+fa
+fan
+fang
+fei
+fen
+feng
+fiao
+fo
+fou
+fu
+ga
+gai
+gan
+gang
+gao
+ge
+gei
+gen
+geng
+gong
+gou
+gu
+gua
+guai
+guan
+guang
+gui
+gun
+guo
+ha
+hai
+han
+hang
+hao
+he
+hei
+hen
+heng
+hm
+hong
+hou
+hu
+hua
+huai
+huan
+huang
+hui
+hun
+huo
+ji
+jia
+jian
+jiang
+jiao
+jie
+jin
+jing
+jiong
+jiu
+ju
+juan
+jue
+jun
+ka
+kai
+kan
+kang
+kao
+ke
+ken
+keng
+kong
+kou
+ku
+kua
+kuai
+kuan
+kuang
+kui
+kun
+kuo
+la
+lai
+lan
+lang
+lao
+le
+lei
+leng
+li
+lia
+lian
+liang
+liao
+lie
+lin
+ling
+liu
+lo
+long
+lou
+lu
+luan
+lun
+luo
+lv
+lve
+m
+ma
+mai
+man
+mang
+mao
+me
+mei
+men
+meng
+mi
+mian
+miao
+mie
+min
+ming
+miu
+mo
+mou
+mu
+n
+na
+nai
+nan
+nang
+nao
+ne
+nei
+nen
+neng
+ni
+nian
+niang
+niao
+nie
+nin
+ning
+niu
+nong
+nou
+nu
+nuan
+nun
+nuo
+nv
+nve
+o
+ou
+pa
+pai
+pan
+pang
+pao
+pei
+pen
+peng
+pi
+pian
+piao
+pie
+pin
+ping
+po
+pou
+pu
+qi
+qia
+qian
+qiang
+qiao
+qie
+qin
+qing
+qiong
+qiu
+qu
+quan
+que
+qun
+ran
+rang
+rao
+re
+ren
+reng
+ri
+rong
+rou
+ru
+rua
+ruan
+rui
+run
+ruo
+sa
+sai
+san
+sang
+sao
+se
+sen
+seng
+sha
+shai
+shan
+shang
+shao
+she
+shei
+shen
+sheng
+shi
+shou
+shu
+shua
+shuai
+shuan
+shuang
+shui
+shun
+shuo
+si
+song
+sou
+su
+suan
+sui
+sun
+suo
+ta
+tai
+tan
+tang
+tao
+te
+teng
+ti
+tian
+tiao
+tie
+ting
+tong
+tou
+tu
+tuan
+tui
+tun
+tuo
+wa
+wai
+wan
+wang
+wei
+wen
+weng
+wo
+wong
+wu
+xi
+xia
+xian
+xiang
+xiao
+xie
+xin
+xing
+xiong
+xiu
+xu
+xuan
+xue
+xun
+ya
+yan
+yang
+yao
+ye
+yi
+yin
+ying
+yo
+yong
+you
+yu
+yuan
+yue
+yun
+za
+zai
+zan
+zang
+zao
+ze
+zei
+zen
+zeng
+zha
+zhai
+zhan
+zhang
+zhao
+zhe
+zhen
+zheng
+zhi
+zhong
+zhou
+zhu
+zhua
+zhuai
+zhuan
+zhuang
+zhui
+zhun
+zhuo
+zi
+zong
+zou
+zu
+zuan
+zui
+zun
+zuo
diff --git a/examples/text_correction/ernie-csc/predict.py b/examples/text_correction/ernie-csc/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e5d7588240203a21f984397879a4e36de924fdd
--- /dev/null
+++ b/examples/text_correction/ernie-csc/predict.py
@@ -0,0 +1,148 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from functools import partial
+
+import paddle
+from utils import convert_example, parse_decode
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.transformers import ErnieTokenizer
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument(
+    "--model_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdmodel",
+    help="The path to model info in static graph.",
+)
+parser.add_argument(
+    "--params_file",
+    type=str,
+    required=True,
+    default="./static_graph_params.pdiparams",
+    help="The path to parameters in static graph.",
+)
+parser.add_argument("--batch_size", type=int, default=2, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu"],
+    help="The device to select to train the model, is must be cpu/gpu.",
+)
+parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path")
+
+args = parser.parse_args()
+
+
+class Predictor(object):
+    def __init__(self, model_file, params_file, device, max_seq_length, tokenizer, pinyin_vocab):
+        self.max_seq_length = max_seq_length
+
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        config.switch_use_feed_fetch_ops(False)
+        config.delete_pass("fused_multi_transformer_encoder_pass")
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.det_error_probs_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+        self.corr_logits_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[1])
+        self.tokenizer = tokenizer
+        self.pinyin_vocab = pinyin_vocab
+
+    def predict(self, data, batch_size=1):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+                A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+            batch_size(obj:`int`, defaults to 1): The number of batch.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        examples = []
+        texts = []
+        trans_func = partial(
+            convert_example,
+            tokenizer=self.tokenizer,
+            pinyin_vocab=self.pinyin_vocab,
+            max_seq_length=self.max_seq_length,
+            is_test=True,
+        )
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            Pad(axis=0, pad_val=self.pinyin_vocab.token_to_idx[self.pinyin_vocab.pad_token], dtype="int64"),  # pinyin
+            Stack(axis=0, dtype="int64"),  # length
+        ): [data for data in fn(samples)]
+
+        for text in data:
+            example = {"source": text.strip()}
+            input_ids, token_type_ids, pinyin_ids, length = trans_func(example)
+            examples.append((input_ids, token_type_ids, pinyin_ids, length))
+            texts.append(example["source"])
+
+        batch_examples = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        batch_texts = [texts[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        results = []
+
+        for examples, texts in zip(batch_examples, batch_texts):
+            token_ids, token_type_ids, pinyin_ids, length = batchify_fn(examples)
+            self.input_handles[0].copy_from_cpu(token_ids)
+            self.input_handles[1].copy_from_cpu(pinyin_ids)
+            self.predictor.run()
+            det_error_probs = self.det_error_probs_handle.copy_to_cpu()
+            corr_logits = self.corr_logits_handle.copy_to_cpu()
+
+            det_pred = det_error_probs.argmax(axis=-1)
+            char_preds = corr_logits.argmax(axis=-1)
+
+            for i in range(len(length)):
+                pred_result = parse_decode(
+                    texts[i], char_preds[i], det_pred[i], length[i], self.tokenizer, self.max_seq_length
+                )
+
+                results.append("".join(pred_result))
+        return results
+
+
+if __name__ == "__main__":
+    tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0")
+    pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]")
+    predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_len, tokenizer, pinyin_vocab)
+
+    samples = [
+        "遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。",
+        "人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。",
+    ]
+
+    results = predictor.predict(samples, batch_size=args.batch_size)
+    for source, target in zip(samples, results):
+        print("Source:", source)
+        print("Target:", target)
diff --git a/examples/text_correction/ernie-csc/predict_sighan.py b/examples/text_correction/ernie-csc/predict_sighan.py
new file mode 100644
index 0000000000000000000000000000000000000000..3816a17533d0168ea7ef2465934571c06d0045a2
--- /dev/null
+++ b/examples/text_correction/ernie-csc/predict_sighan.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from functools import partial
+
+import paddle
+from model import ErnieForCSC
+from utils import convert_example, create_dataloader, parse_decode, read_test_ds
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ErnieModel, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0", choices=["ernie-1.0"], help="Pretraining model name or path")
+parser.add_argument("--ckpt_path", default=None, type=str, help="The model checkpoint path.", )
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.", )
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.", )
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path")
+parser.add_argument("--test_file", type=str, default="test.txt", help="test set file")
+parser.add_argument("--predict_file", type=str, default="predict.txt", help="predict result file")
+
+# yapf: enable
+args = parser.parse_args()
+
+
+def write_sighan_result_to_file(args, corr_preds, det_preds, lengths, tokenizer):
+    with open(args.test_file, "r", encoding="utf-8") as fin:
+        with open(args.predict_file, "w", encoding="utf-8") as fout:
+            for i, line in enumerate(fin.readlines()):
+                ids, words = line.strip("\n").split("\t")[0:2]
+                ids = ids.split("=")[1][:-1]
+                pred_result = parse_decode(
+                    words, corr_preds[i], det_preds[i], lengths[i], tokenizer, args.max_seq_length
+                )
+                words = list(words)
+                pred_result = list(pred_result)
+                result = ids
+                if pred_result == words:
+                    result += ", 0"
+                else:
+                    assert len(pred_result) == len(words), "pred_result: {}, words: {}".format(pred_result, words)
+                    for i, word in enumerate(pred_result):
+                        if word != words[i]:
+                            result += ", {}, {}".format(i + 1, word)
+                fout.write("{}\n".format(result))
+
+
+@paddle.no_grad()
+def do_predict(args):
+    paddle.set_device(args.device)
+
+    pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    ernie = ErnieModel.from_pretrained(args.model_name_or_path)
+
+    model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token])
+
+    eval_ds = load_dataset(read_test_ds, data_path=args.test_file, lazy=False)
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        pinyin_vocab=pinyin_vocab,
+        max_seq_length=args.max_seq_length,
+        is_test=True,
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        Pad(axis=0, pad_val=pinyin_vocab.token_to_idx[pinyin_vocab.pad_token], dtype="int64"),  # pinyin
+        Stack(axis=0, dtype="int64"),  # length
+    ): [data for data in fn(samples)]
+
+    test_data_loader = create_dataloader(
+        eval_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.ckpt_path:
+        model_dict = paddle.load(args.ckpt_path)
+        model.set_dict(model_dict)
+        logger.info("Load model from checkpoints: {}".format(args.ckpt_path))
+
+    model.eval()
+    corr_preds = []
+    det_preds = []
+    lengths = []
+    for step, batch in enumerate(test_data_loader):
+        input_ids, token_type_ids, pinyin_ids, length = batch
+        det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids)
+        # corr_logits shape: [B, T, V]
+        det_pred = det_error_probs.argmax(axis=-1)
+        det_pred = det_pred.numpy()
+
+        char_preds = corr_logits.argmax(axis=-1)
+        char_preds = char_preds.numpy()
+
+        length = length.numpy()
+
+        corr_preds += [pred for pred in char_preds]
+        det_preds += [prob for prob in det_pred]
+        lengths += [l for l in length]
+
+    write_sighan_result_to_file(args, corr_preds, det_preds, lengths, tokenizer)
+
+
+if __name__ == "__main__":
+    do_predict(args)
diff --git a/examples/text_correction/ernie-csc/requirements.txt b/examples/text_correction/ernie-csc/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..97c0c2339ce30743d0a6f2c5a68c163512b5f351
--- /dev/null
+++ b/examples/text_correction/ernie-csc/requirements.txt
@@ -0,0 +1 @@
+pypinyin
\ No newline at end of file
diff --git a/examples/text_correction/ernie-csc/run_sighan_predict.sh b/examples/text_correction/ernie-csc/run_sighan_predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ef33a20b294779c1bd786dd9203abfa38eee7bd8
--- /dev/null
+++ b/examples/text_correction/ernie-csc/run_sighan_predict.sh
@@ -0,0 +1,26 @@
+export CUDA_VISIBLE_DEVICES=0
+
+model_name_or_path=ernie-1.0
+checkpoints_path=checkpoints
+model_file=best_model.pdparams
+
+# Download SIGHAN test dataset
+if [ ! -d "./sighan_test" ]; then
+  python download.py
+fi
+
+# Predict the test input from sighan13, sighan14, sighan15
+for version in 13 14 15
+do
+python predict_sighan.py --model_name_or_path $model_name_or_path       \
+    --test_file sighan_test/sighan$version/input.txt --batch_size 32    \
+    --ckpt_path $checkpoints_path/$model_file                           \
+    --predict_file predict_sighan$version.txt
+done
+
+# Evaluate the prediction result of the model
+for version in 13 14 15
+do
+echo -e "Sighan$version Performace\n"
+python sighan_evaluate.py -p predict_sighan$version.txt -t sighan_test/sighan$version/truth.txt
+done
\ No newline at end of file
diff --git a/examples/text_correction/ernie-csc/sighan_evaluate.py b/examples/text_correction/ernie-csc/sighan_evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..981f83e2279961608cae87317d784edd540bcc31
--- /dev/null
+++ b/examples/text_correction/ernie-csc/sighan_evaluate.py
@@ -0,0 +1,116 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--pred_file", "-p", required=True, type=str, help="")
+parser.add_argument("--truth_file", "-t", required=True, type=str, help="")
+args = parser.parse_args()
+
+
+def main(args):
+    detect_tp, correct_tp, pos, neg, fp = 0, 0, 0, 0, 0
+
+    pred_dict = dict()
+    truth_dict = dict()
+    fpred = open(args.pred_file, "r", encoding="utf-8")
+    ftruth = open(args.truth_file, "r", encoding="utf-8")
+    for idx, (pred, truth) in enumerate(zip(fpred, ftruth)):
+        pred_tokens = pred.strip().split(" ")
+        truth_tokens = truth.strip().split(" ")
+
+        pred_id = pred_tokens[0]
+        truth_id = truth_tokens[0]
+
+        pred_tokens = pred_tokens[1:]
+        truth_tokens = truth_tokens[1:]
+
+        detect_truth_positions = [
+            int(truth_token.strip(",")) for i, truth_token in enumerate(truth_tokens) if i % 2 == 0
+        ]
+        correct_truth_tokens = [truth_token.strip(",") for i, truth_token in enumerate(truth_tokens) if i % 2 == 1]
+        detect_pred_positions = [int(pred_token.strip(",")) for i, pred_token in enumerate(pred_tokens) if i % 2 == 0]
+        correct_pred_tokens = [pred_token.strip(",") for i, pred_token in enumerate(pred_tokens) if i % 2 == 1]
+
+        pred_dict[pred_id] = (detect_pred_positions, correct_pred_tokens)
+        truth_dict[truth_id] = (detect_truth_positions, correct_truth_tokens)
+
+    assert sorted(pred_dict.keys()) == sorted(
+        truth_dict.keys()
+    ), "Prediction file should have all prediction result in truth file"
+
+    for pid, predition in pred_dict.items():
+        truth = truth_dict[pid]
+        if predition[0][0] != 0:
+            pos += 1
+            if sorted(zip(*predition)) == sorted(zip(*truth)):
+                correct_tp += 1
+            if truth[0][0] == 0:
+                fp += 1
+
+        if truth[0][0] != 0:
+            if sorted(predition[0]) == sorted(truth[0]):
+                detect_tp += 1
+            neg += 1
+
+    eps = 1e-9
+
+    # Detection level
+    detect_pos = detect_tp + fp
+    if detect_pos > 0 and neg > 0:
+        detect_precision = detect_tp * 1.0 / detect_pos
+        detect_recall = detect_tp * 1.0 / neg
+        detect_f1 = 2.0 * detect_precision * detect_recall / (detect_precision + detect_recall + eps)
+    else:
+        detect_precision = 0
+        detect_recall = 0
+        detect_f1 = 0
+
+    # Correction level
+    correct_pos = correct_tp + fp
+    if correct_pos > 0 and neg > 0:
+        correct_precision = correct_tp * 1.0 / correct_pos
+        correct_recall = correct_tp * 1.0 / neg
+        correct_f1 = 2.0 * correct_precision * correct_recall / (correct_precision + correct_recall + eps)
+    else:
+        correct_precision = 0
+        correct_recall = 0
+        correct_f1 = 0
+
+    print("==========================================================")
+    print("Overall Performance")
+    print("==========================================================")
+    print("\nDetection Level")
+    print("\tPrecision = {:.4f} ({}/{})".format(detect_precision, detect_tp, detect_pos))
+    print("\tRecall = {:.4f} ({}/{})".format(detect_recall, detect_tp, neg))
+    print(
+        "\tF1-Score = {:.4f} ((2*{:.4f}*{:.4f})/({:.4f}+{:.4f}))".format(
+            detect_f1, detect_precision, detect_recall, detect_precision, detect_recall
+        )
+    )
+
+    print("\nCorrection Level")
+    print("\tPrecision = {:.4f} ({}/{})".format(correct_precision, correct_tp, correct_pos))
+    print("\tRecall = {:.4f} ({}/{})".format(correct_recall, correct_tp, neg))
+    print(
+        "\tF1-Score = {:.4f} ((2*{:.4f}*{:.4f})/({:.4f}+{:.4f}))".format(
+            correct_f1, correct_precision, correct_recall, correct_precision, correct_recall
+        )
+    )
+    print("==========================================================\n")
+
+
+if __name__ == "__main__":
+    main(args)
diff --git a/examples/text_correction/ernie-csc/train.py b/examples/text_correction/ernie-csc/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e554ef8d2327774914afa4371aca1052721c7437
--- /dev/null
+++ b/examples/text_correction/ernie-csc/train.py
@@ -0,0 +1,207 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from model import ErnieForCSC
+from utils import convert_example, create_dataloader, read_train_ds
+
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import CorrectionF1, DetectionF1
+from paddlenlp.transformers import ErnieModel, ErnieTokenizer, LinearDecayWithWarmup
+from paddlenlp.utils.log import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-1.0-base-zh", choices=["ernie-1.0-base-zh"], help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=128, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+parser.add_argument("--pinyin_vocab_file_path", type=str, default="pinyin_vocab.txt", help="pinyin vocab file path")
+parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+parser.add_argument("--ignore_label", default=-1, type=int, help="Ignore label for CrossEntropyLoss")
+parser.add_argument("--extra_train_ds_dir", default=None, type=str, help="The directory of extra train dataset.")
+
+# yapf: enable
+args = parser.parse_args()
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, eval_data_loader):
+    model.eval()
+    det_metric = DetectionF1()
+    corr_metric = CorrectionF1()
+    for step, batch in enumerate(eval_data_loader, start=1):
+        input_ids, token_type_ids, pinyin_ids, det_labels, corr_labels, length = batch
+        # det_error_probs shape: [B, T, 2]
+        # corr_logits shape: [B, T, V]
+        det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids)
+        det_metric.update(det_error_probs, det_labels, length)
+        corr_metric.update(det_error_probs, det_labels, corr_logits, corr_labels, length)
+
+    det_f1, det_precision, det_recall = det_metric.accumulate()
+    corr_f1, corr_precision, corr_recall = corr_metric.accumulate()
+    logger.info("Sentence-Level Performance:")
+    logger.info(
+        "Detection  metric: F1={:.4f}, Recall={:.4f}, Precision={:.4f}".format(det_f1, det_recall, det_precision)
+    )
+    logger.info(
+        "Correction metric: F1={:.4f}, Recall={:.4f}, Precision={:.4f}".format(corr_f1, corr_recall, corr_precision)
+    )
+    model.train()
+    return det_f1, corr_f1
+
+
+def do_train(args):
+    set_seed(args)
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    pinyin_vocab = Vocab.load_vocabulary(args.pinyin_vocab_file_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    ernie = ErnieModel.from_pretrained(args.model_name_or_path)
+
+    model = ErnieForCSC(ernie, pinyin_vocab_size=len(pinyin_vocab), pad_pinyin_id=pinyin_vocab[pinyin_vocab.pad_token])
+
+    train_ds, eval_ds = load_dataset("sighan-cn", splits=["train", "dev"])
+
+    # Extend current training dataset by providing extra training
+    # datasets directory. The suffix of dataset file name in extra
+    # dataset directory has to be ".txt". The data format of
+    # dataset need to be a couple of senteces at every line, such as:
+    # "城府宫员表示，这是过去三十六小时内第三期强烈的余震。\t政府官员表示，这是过去三十六小时内第三起强烈的余震。\n"
+    if args.extra_train_ds_dir is not None and os.path.exists(args.extra_train_ds_dir):
+        data = train_ds.data
+        data_files = [
+            os.path.join(args.extra_train_ds_dir, data_file)
+            for data_file in os.listdir(args.extra_train_ds_dir)
+            if data_file.endswith(".txt")
+        ]
+        for data_file in data_files:
+            ds = load_dataset(read_train_ds, data_path=data_file, splits=["train"], lazy=False)
+            data += ds.data
+        train_ds = MapDataset(data)
+
+    det_loss_act = paddle.nn.CrossEntropyLoss(ignore_index=args.ignore_label, use_softmax=False)
+    corr_loss_act = paddle.nn.CrossEntropyLoss(ignore_index=args.ignore_label, reduction="none")
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, pinyin_vocab=pinyin_vocab, max_seq_length=args.max_seq_length
+    )
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Pad(axis=0, pad_val=pinyin_vocab.token_to_idx[pinyin_vocab.pad_token]),  # pinyin
+        Pad(axis=0, dtype="int64"),  # detection label
+        Pad(axis=0, dtype="int64"),  # correction label
+        Stack(axis=0, dtype="int64"),  # length
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    eval_data_loader = create_dataloader(
+        eval_ds, mode="eval", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    logger.info("Total training step: {}".format(num_training_steps))
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_steps = 1
+    best_f1 = -1
+    tic_train = time.time()
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, pinyin_ids, det_labels, corr_labels, length = batch
+            det_error_probs, corr_logits = model(input_ids, pinyin_ids, token_type_ids)
+            # Chinese Spelling Correction has 2 tasks: detection task and correction task.
+            # Detection task aims to detect whether each Chinese charater has spelling error.
+            # Correction task aims to correct each potential wrong charater to right charater.
+            # So we need to minimize detection loss and correction loss simultaneously.
+            # See more loss design details on https://aclanthology.org/2021.findings-acl.198.pdf
+            det_loss = det_loss_act(det_error_probs, det_labels)
+            corr_loss = corr_loss_act(corr_logits, corr_labels) * det_error_probs.max(axis=-1)
+            loss = (det_loss + corr_loss).mean()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_steps, epoch, step, loss, args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            if global_steps % args.save_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info("Eval:")
+                    det_f1, corr_f1 = evaluate(model, eval_data_loader)
+                    f1 = (det_f1 + corr_f1) / 2
+                    model_file = "model_%d" % global_steps
+                    if f1 > best_f1:
+                        # save best model
+                        paddle.save(model.state_dict(), os.path.join(args.output_dir, "best_model.pdparams"))
+                        logger.info("Save best model at {} step.".format(global_steps))
+                        best_f1 = f1
+                        model_file = model_file + "_best"
+                    model_file = model_file + ".pdparams"
+                    paddle.save(model.state_dict(), os.path.join(args.output_dir, model_file))
+                    logger.info("Save model at {} step.".format(global_steps))
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                return
+            global_steps += 1
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/examples/text_correction/ernie-csc/utils.py b/examples/text_correction/ernie-csc/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..83273deff34b6ccaa712092344ac985111cf4427
--- /dev/null
+++ b/examples/text_correction/ernie-csc/utils.py
@@ -0,0 +1,116 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pypinyin import lazy_pinyin, Style
+import paddle
+
+from paddlenlp.transformers import is_chinese_char
+
+
+def read_train_ds(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            source, target = line.strip("\n").split("\t")[0:2]
+            yield {"source": source, "target": target}
+
+
+def read_test_ds(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            ids, words = line.strip("\n").split("\t")[0:2]
+            yield {"source": words}
+
+
+def convert_example(example, tokenizer, pinyin_vocab, max_seq_length=128, ignore_label=-1, is_test=False):
+    source = example["source"]
+    words = list(source)
+    if len(words) > max_seq_length - 2:
+        words = words[: max_seq_length - 2]
+    length = len(words)
+    words = ["[CLS]"] + words + ["[SEP]"]
+    input_ids = tokenizer.convert_tokens_to_ids(words)
+    token_type_ids = [0] * len(input_ids)
+
+    # Use pad token in pinyin emb to map word emb [CLS], [SEP]
+    pinyins = lazy_pinyin(source, style=Style.TONE3, neutral_tone_with_five=True)
+    pinyin_ids = [0]
+    # Align pinyin and chinese char
+    pinyin_offset = 0
+    for i, word in enumerate(words[1:-1]):
+        pinyin = "[UNK]" if word != "[PAD]" else "[PAD]"
+        if len(word) == 1 and is_chinese_char(ord(word)):
+            while pinyin_offset < len(pinyins):
+                current_pinyin = pinyins[pinyin_offset][:-1]
+                pinyin_offset += 1
+                if current_pinyin in pinyin_vocab:
+                    pinyin = current_pinyin
+                    break
+        pinyin_ids.append(pinyin_vocab[pinyin])
+
+    pinyin_ids.append(0)
+    assert len(input_ids) == len(pinyin_ids), "length of input_ids must be equal to length of pinyin_ids"
+
+    if not is_test:
+        target = example["target"]
+        correction_labels = list(target)
+        if len(correction_labels) > max_seq_length - 2:
+            correction_labels = correction_labels[: max_seq_length - 2]
+        correction_labels = tokenizer.convert_tokens_to_ids(correction_labels)
+        correction_labels = [ignore_label] + correction_labels + [ignore_label]
+
+        detection_labels = []
+        for input_id, label in zip(input_ids[1:-1], correction_labels[1:-1]):
+            detection_label = 0 if input_id == label else 1
+            detection_labels += [detection_label]
+        detection_labels = [ignore_label] + detection_labels + [ignore_label]
+        return input_ids, token_type_ids, pinyin_ids, detection_labels, correction_labels, length
+    else:
+        return input_ids, token_type_ids, pinyin_ids, length
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def parse_decode(words, corr_preds, det_preds, lengths, tokenizer, max_seq_length):
+    UNK = tokenizer.unk_token
+    UNK_id = tokenizer.convert_tokens_to_ids(UNK)
+
+    corr_pred = corr_preds[1 : 1 + lengths].tolist()
+    det_pred = det_preds[1 : 1 + lengths].tolist()
+    words = list(words)
+    rest_words = []
+    if len(words) > max_seq_length - 2:
+        rest_words = words[max_seq_length - 2 :]
+        words = words[: max_seq_length - 2]
+    pred_result = ""
+    for j, word in enumerate(words):
+        candidates = tokenizer.convert_ids_to_tokens(corr_pred[j] if corr_pred[j] < tokenizer.vocab_size else UNK_id)
+        word_icc = is_chinese_char(ord(word))
+        cand_icc = is_chinese_char(ord(candidates)) if len(candidates) == 1 else False
+        if not word_icc or det_pred[j] == 0 or candidates in [UNK, "[PAD]"] or (word_icc and not cand_icc):
+            pred_result += word
+        else:
+            pred_result += candidates.lstrip("##")
+    pred_result += "".join(rest_words)
+    return pred_result
diff --git a/examples/text_generation/couplet/README.md b/examples/text_generation/couplet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f9f18d0d4f4763b8731a3824051571667d74e92f
--- /dev/null
+++ b/examples/text_generation/couplet/README.md
@@ -0,0 +1,92 @@
+# 使用Seq2Seq模型完成自动对联
+
+以下是本范例模型的简要目录结构及说明：
+
+```
+.
+├── README.md              # 文档，本文件
+├── args.py                # 训练、预测以及模型参数配置程序
+├── data.py                # 数据读入程序
+├── train.py               # 训练主程序
+├── predict.py             # 预测主程序
+└── model.py               # 带注意力机制的对联生成程序
+```
+
+## 简介
+
+Sequence to Sequence (Seq2Seq)，使用编码器-解码器（Encoder-Decoder）结构，用编码器将源序列编码成vector，再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译，自动对话机器人，文档摘要自动生成，图片描述自动生成等任务中。
+
+本目录包含Seq2Seq的一个经典样例：自动对联生成，带attention机制的文本生成模型。
+
+
+## 模型概览
+
+本模型中，在编码器方面，我们采用了基于LSTM的多层的RNN encoder；在解码器方面，我们使用了带注意力（Attention）机制的RNN decoder，在预测时我们使用Beam Search算法来生对联的下联。
+
+## 数据介绍
+
+本教程使用[couplet数据集](https://bj.bcebos.com/paddlenlp/datasets/couplet.tar.gz)作为训练语料，该数据集来源于[这个github repo](https://github.com/v-zich/couplet-clean-dataset)，其中train_src.tsv及train_tgt.tsv为训练集，dev_src.tsv及dev_tgt.tsv为开发集，test_src.tsv及test_tgt.tsv为测试集。
+
+数据集会在调用`paddlenlp.datasets.load_dataset`时自动下载，在linux系统下，数据集会自动下载到`~/.paddlenlp/datasets/Couplet/`目录下
+
+
+## 模型训练
+
+执行以下命令即可训练带有注意力机制的Seq2Seq模型：
+
+```sh
+python train.py \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --batch_size 128 \
+    --device gpu \
+    --model_path ./couplet_models \
+    --max_epoch 20
+
+```
+
+各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后，保存一次模型。
+
+**NOTE:** 如需恢复模型训练，则`init_from_ckpt`只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=couplet_models/19`即可，程序会自动加载模型参数`couplet_models/19.pdparams`，也会自动加载优化器状态`couplet_models/19.pdopt`。
+
+## 模型预测
+
+训练完成之后，可以使用保存的模型（由 `--init_from_ckpt` 指定）对测试集进行beam search解码，命令如下：
+
+```sh
+python predict.py \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --batch_size 128 \
+    --init_from_ckpt couplet_models/19 \
+    --infer_output_file infer_output.txt \
+    --beam_size 10 \
+    --device gpu
+
+```
+
+各参数的具体说明请参阅 `args.py` ，注意预测时所用模型超参数需和训练时一致。
+
+## 生成对联样例
+
+上联：崖悬风雨骤                下联：月落水云寒
+
+上联：约春章柳下                下联：邀月醉花间
+
+上联：箬笠红尘外                下联：扁舟明月中
+
+上联：书香醉倒窗前月        下联：烛影摇红梦里人
+
+上联：踏雪寻梅求雅趣        下联：临风把酒觅知音
+
+上联：未出南阳天下论        下联：先登北斗汉中书
+
+上联：朱联妙语千秋颂        下联：赤胆忠心万代传
+
+上联：月半举杯圆月下        下联：花间对酒醉花间
+
+上联：挥笔如剑倚麓山豪气干云揽月去       下联：落笔似龙飞沧海龙吟破浪乘风来
+
+## 参考的开源数据集
+
+我们的数据集采用了开源对联数据集[couplet-clean-dataset](https://github.com/v-zich/couplet-clean-dataset)，地址：https://github.com/v-zich/couplet-clean-dataset ，该数据集过滤了[couplet-dataset](https://github.com/wb14123/couplet-dataset)（地址：https://github.com/wb14123/couplet-dataset ）中的低俗、敏感内容。
diff --git a/examples/text_generation/couplet/args.py b/examples/text_generation/couplet/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..179f76f05339b23bab06b12774be39ddc46c7e4a
--- /dev/null
+++ b/examples/text_generation/couplet/args.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument("--learning_rate", type=float, default=0.001, help="learning rate for optimizer")
+
+    parser.add_argument("--num_layers", type=int, default=1, help="layers number of encoder and decoder")
+
+    parser.add_argument("--hidden_size", type=int, default=100, help="hidden size of encoder and decoder")
+
+    parser.add_argument("--batch_size", type=int, default=128, help="Batch size of each step")
+
+    parser.add_argument("--max_epoch", type=int, default=50, help="max epoch for the training")
+
+    parser.add_argument("--max_len", type=int, default=50, help="max length for source and target sentence")
+
+    parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip")
+
+    parser.add_argument("--log_freq", type=int, default=200, help="The frequency to print training logs")
+
+    parser.add_argument("--model_path", type=str, default="model", help="model path for model to save")
+
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+
+    parser.add_argument("--infer_output_file", type=str, default="infer_output", help="file name for inference output")
+
+    parser.add_argument("--beam_size", type=int, default=10, help="file name for inference")
+
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference."
+    )
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/text_generation/couplet/data.py b/examples/text_generation/couplet/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8ac9b1ce842688a2573e26c8cd1cab5a8e3cc67
--- /dev/null
+++ b/examples/text_generation/couplet/data.py
@@ -0,0 +1,65 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import Pad, SamplerHelper, Vocab
+from paddlenlp.datasets import load_dataset
+
+
+def convert_example(example, vocab):
+    bos_id = vocab[vocab.bos_token]
+    eos_id = vocab[vocab.eos_token]
+
+    source = [bos_id] + vocab.to_indices(example["first"].split("\x02")) + [eos_id]
+    target = [bos_id] + vocab.to_indices(example["second"].split("\x02")) + [eos_id]
+    return source, target
+
+
+def create_train_loader(batch_size=128):
+    train_ds = load_dataset("couplet", splits="train")
+    vocab = Vocab.load_vocabulary(**train_ds.vocab_info)
+    pad_id = vocab[vocab.eos_token]
+    trans_func = partial(convert_example, vocab=vocab)
+    train_ds = train_ds.map(trans_func, lazy=False)
+    train_batch_sampler = SamplerHelper(train_ds).shuffle().batch(batch_size=batch_size)
+
+    train_loader = paddle.io.DataLoader(
+        train_ds, batch_sampler=train_batch_sampler, collate_fn=partial(prepare_input, pad_id=pad_id)
+    )
+    return train_loader, vocab
+
+
+def create_infer_loader(batch_size=128):
+    test_ds = load_dataset("couplet", splits="test")
+    vocab = Vocab.load_vocabulary(**test_ds.vocab_info)
+    pad_id = vocab[vocab.eos_token]
+    trans_func = partial(convert_example, vocab=vocab)
+    test_ds = test_ds.map(trans_func, lazy=False)
+    test_batch_sampler = SamplerHelper(test_ds).batch(batch_size=batch_size)
+
+    test_loader = paddle.io.DataLoader(
+        test_ds, batch_sampler=test_batch_sampler, collate_fn=partial(prepare_input, pad_id=pad_id)
+    )
+    return test_loader, vocab
+
+
+def prepare_input(insts, pad_id):
+    src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts])
+    tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([inst[1] for inst in insts])
+    tgt_mask = (tgt[:, :-1] != pad_id).astype(paddle.get_default_dtype())
+    return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask
diff --git a/examples/text_generation/couplet/model.py b/examples/text_generation/couplet/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..533477b1dfc98e610f27e532834bc9c5b3bb2952
--- /dev/null
+++ b/examples/text_generation/couplet/model.py
@@ -0,0 +1,198 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class CrossEntropyCriterion(nn.Layer):
+    def __init__(self):
+        super(CrossEntropyCriterion, self).__init__()
+
+    def forward(self, predict, label, trg_mask):
+        cost = F.cross_entropy(input=predict, label=label, reduction="none", soft_label=False)
+        cost = paddle.squeeze(cost, axis=[2])
+        masked_cost = cost * trg_mask
+        batch_mean_cost = paddle.mean(masked_cost, axis=[0])
+        seq_cost = paddle.sum(batch_mean_cost)
+
+        return seq_cost
+
+
+class Seq2SeqEncoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers):
+        super(Seq2SeqEncoder, self).__init__()
+        self.embedder = nn.Embedding(vocab_size, embed_dim)
+
+        self.lstm = nn.LSTM(
+            input_size=embed_dim,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            dropout=0.2 if num_layers > 1 else 0.0,
+        )
+
+    def forward(self, sequence, sequence_length):
+        inputs = self.embedder(sequence)
+        encoder_output, encoder_state = self.lstm(inputs, sequence_length=sequence_length)
+
+        return encoder_output, encoder_state
+
+
+class AttentionLayer(nn.Layer):
+    def __init__(self, hidden_size):
+        super(AttentionLayer, self).__init__()
+        self.input_proj = nn.Linear(hidden_size, hidden_size)
+        self.output_proj = nn.Linear(hidden_size + hidden_size, hidden_size)
+
+    def forward(self, hidden, encoder_output, encoder_padding_mask):
+        encoder_output = self.input_proj(encoder_output)
+        attn_scores = paddle.matmul(paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True)
+
+        if encoder_padding_mask is not None:
+            attn_scores = paddle.add(attn_scores, encoder_padding_mask)
+
+        attn_scores = F.softmax(attn_scores)
+        attn_out = paddle.squeeze(paddle.matmul(attn_scores, encoder_output), [1])
+        attn_out = paddle.concat([attn_out, hidden], 1)
+        attn_out = self.output_proj(attn_out)
+        return attn_out
+
+
+class Seq2SeqDecoderCell(nn.RNNCellBase):
+    def __init__(self, num_layers, input_size, hidden_size):
+        super(Seq2SeqDecoderCell, self).__init__()
+        self.dropout = nn.Dropout(0.2)
+        self.lstm_cells = nn.LayerList(
+            [
+                nn.LSTMCell(input_size=input_size + hidden_size if i == 0 else hidden_size, hidden_size=hidden_size)
+                for i in range(num_layers)
+            ]
+        )
+
+        self.attention_layer = AttentionLayer(hidden_size)
+
+    def forward(self, step_input, states, encoder_output, encoder_padding_mask=None):
+        lstm_states, input_feed = states
+        new_lstm_states = []
+        step_input = paddle.concat([step_input, input_feed], 1)
+        for i, lstm_cell in enumerate(self.lstm_cells):
+            out, new_lstm_state = lstm_cell(step_input, lstm_states[i])
+            step_input = self.dropout(out)
+            new_lstm_states.append(new_lstm_state)
+        out = self.attention_layer(step_input, encoder_output, encoder_padding_mask)
+        return out, [new_lstm_states, out]
+
+
+class Seq2SeqDecoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers):
+        super(Seq2SeqDecoder, self).__init__()
+        self.embedder = nn.Embedding(vocab_size, embed_dim)
+        self.lstm_attention = nn.RNN(Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size))
+        self.output_layer = nn.Linear(hidden_size, vocab_size)
+
+    def forward(self, trg, decoder_initial_states, encoder_output, encoder_padding_mask):
+        inputs = self.embedder(trg)
+
+        decoder_output, _ = self.lstm_attention(
+            inputs,
+            initial_states=decoder_initial_states,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask,
+        )
+        predict = self.output_layer(decoder_output)
+
+        return predict
+
+
+class Seq2SeqAttnModel(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, eos_id=1):
+        super(Seq2SeqAttnModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.eos_id = eos_id
+        self.num_layers = num_layers
+        self.INF = 1e9
+        self.encoder = Seq2SeqEncoder(vocab_size, embed_dim, hidden_size, num_layers)
+        self.decoder = Seq2SeqDecoder(vocab_size, embed_dim, hidden_size, num_layers)
+
+    def forward(self, src, src_length, trg):
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+
+        # Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size]
+        encoder_final_states = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)]
+        # Construct decoder initial states: use input_feed and the shape is
+        # [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states
+        decoder_initial_states = [
+            encoder_final_states,
+            self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]),
+        ]
+        # Build attention mask to avoid paying attention on padddings
+        src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
+        encoder_padding_mask = (src_mask - 1.0) * self.INF
+        encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
+
+        predict = self.decoder(trg, decoder_initial_states, encoder_output, encoder_padding_mask)
+
+        return predict
+
+
+class Seq2SeqAttnInferModel(Seq2SeqAttnModel):
+    def __init__(
+        self, vocab_size, embed_dim, hidden_size, num_layers, bos_id=0, eos_id=1, beam_size=4, max_out_len=256
+    ):
+        self.bos_id = bos_id
+        self.beam_size = beam_size
+        self.max_out_len = max_out_len
+        self.num_layers = num_layers
+        super(Seq2SeqAttnInferModel, self).__init__(vocab_size, embed_dim, hidden_size, num_layers, eos_id)
+
+        # Dynamic decoder for inference
+        self.beam_search_decoder = nn.BeamSearchDecoder(
+            self.decoder.lstm_attention.cell,
+            start_token=bos_id,
+            end_token=eos_id,
+            beam_size=beam_size,
+            embedding_fn=self.decoder.embedder,
+            output_fn=self.decoder.output_layer,
+        )
+
+    def forward(self, src, src_length):
+        encoder_output, encoder_final_state = self.encoder(src, src_length)
+
+        encoder_final_state = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)]
+
+        # Initial decoder initial states
+        decoder_initial_states = [
+            encoder_final_state,
+            self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]),
+        ]
+        # Build attention mask to avoid paying attention on paddings
+        src_mask = (src != self.eos_id).astype(paddle.get_default_dtype())
+
+        encoder_padding_mask = (src_mask - 1.0) * self.INF
+        encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1])
+
+        # Tile the batch dimension with beam_size
+        encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_output, self.beam_size)
+        encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_padding_mask, self.beam_size)
+
+        # Dynamic decoding with beam search
+        seq_output, _ = nn.dynamic_decode(
+            decoder=self.beam_search_decoder,
+            inits=decoder_initial_states,
+            max_step_num=self.max_out_len,
+            encoder_output=encoder_output,
+            encoder_padding_mask=encoder_padding_mask,
+        )
+        return seq_output
diff --git a/examples/text_generation/couplet/predict.py b/examples/text_generation/couplet/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..307af6f1233df1055f792c9605b2316fef08ce9c
--- /dev/null
+++ b/examples/text_generation/couplet/predict.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+
+import numpy as np
+import paddle
+from args import parse_args
+from data import create_infer_loader
+from model import Seq2SeqAttnInferModel
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+
+    test_loader, vocab = create_infer_loader(args.batch_size)
+    vocab_size = len(vocab)
+    bos_id = vocab[vocab.bos_token]
+    eos_id = vocab[vocab.eos_token]
+    trg_idx2word = vocab.idx_to_token
+
+    model = paddle.Model(
+        Seq2SeqAttnInferModel(
+            vocab_size,
+            args.hidden_size,
+            args.hidden_size,
+            args.num_layers,
+            bos_id=bos_id,
+            eos_id=eos_id,
+            beam_size=args.beam_size,
+            max_out_len=256,
+        )
+    )
+
+    model.prepare()
+
+    # Load the trained model
+    assert args.init_from_ckpt, "Please set reload_model to load the infer model."
+    model.load(args.init_from_ckpt)
+
+    # TODO(guosheng): use model.predict when support variant length
+    with io.open(args.infer_output_file, "w", encoding="utf-8") as f:
+        for data in test_loader():
+            inputs = data[:2]
+            finished_seq = model.predict_batch(inputs=list(inputs))[0]
+            finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq
+            finished_seq = np.transpose(finished_seq, [0, 2, 1])
+            for ins in finished_seq:
+                for beam_idx, beam in enumerate(ins):
+                    id_list = post_process_seq(beam, bos_id, eos_id)
+                    word_list = [trg_idx2word[id] for id in id_list]
+                    sequence = "\x02".join(word_list) + "\n"
+                    f.write(sequence)
+                    break
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/examples/text_generation/couplet/train.py b/examples/text_generation/couplet/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e1914ea247d92e5d6d3182c5f3aaeecd3c54043
--- /dev/null
+++ b/examples/text_generation/couplet/train.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from args import parse_args
+from data import create_train_loader
+from model import CrossEntropyCriterion, Seq2SeqAttnModel
+
+from paddlenlp.metrics import Perplexity
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+
+    # Define dataloader
+    train_loader, vocab = create_train_loader(args.batch_size)
+    vocab_size = len(vocab)
+    pad_id = vocab[vocab.eos_token]
+
+    model = paddle.Model(Seq2SeqAttnModel(vocab_size, args.hidden_size, args.hidden_size, args.num_layers, pad_id))
+
+    optimizer = paddle.optimizer.Adam(learning_rate=args.learning_rate, parameters=model.parameters())
+    ppl_metric = Perplexity()
+    model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric)
+
+    print(args)
+    model.fit(
+        train_data=train_loader,
+        epochs=args.max_epoch,
+        eval_freq=1,
+        save_freq=1,
+        save_dir=args.model_path,
+        log_freq=args.log_freq,
+    )
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_train(args)
diff --git a/examples/text_generation/ctrl/README.md b/examples/text_generation/ctrl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e68bed08e0776fc1a928b2c9c54713bde9dc77bb
--- /dev/null
+++ b/examples/text_generation/ctrl/README.md
@@ -0,0 +1,59 @@
+# [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/pdf/1909.05858.pdf)
+
+## 摘要
+大规模语言模型显示出很有前景的文本生成能力，但用户无法轻松控制生成文本的特定方面。我们发布了CTRL，一个包含 16.3 亿个参数的条件转换器语言模型，经过训练以调节控制样式、内容和特定任务行为的控制代码。 控制代码源自与原始文本自然共同出现的结构，保留了无监督学习的优势，同时对文本生成提供了更明确的控制。 这些代码还允许CTRL预测训练数据的哪些部分最有可能给定序列。 这提供了一种通过基于模型的来源归因分析大量数据的潜在方法。 我们在 https://github.com/salesforce/ctrl 上发布了多个全尺寸、预训练版本的CTRL。
+
+## 文本生成测试
+```sh
+python demo.py
+```
+模型生成使用到的参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。
+- `max_predict_len` 表示最大生成的句子长度。
+- `repetition_penalty` 表示生成重复token的惩罚参数，详细信息可查看[这篇论文](https://arxiv.org/pdf/1909.05858.pdf)。
+
+## 生成结果样例
+
+```
+input text: Diet English : I lost 10 kgs! ; German :
+output text: Ich habe zehn Kilogramm abgenommen!
+
+ Als ich das erste Mal mit meinem Smartphone war, war es ein wenig schwierig zu finden, wo man die App herunterladen kann. Aber jetzt ist sie da.
+
+ Das Smartphone hat mich auch sehr beeindruckt. Es machte mir viel Spaß. Und so funktioniert mein Leben heute ganz einfach und ohne große Probleme.
+
+ Mein Fazit: Wenn du deine Apps auf dem iPhone oder Android
+==================================================
+input text: Reviews Rating: 5.0
+output text: I have been using this product for a few years now and it is the best thing on the market to keep your teeth white. It does not taste bad at all like some of these other products do. The only problem with this product is that you need to use it every day or else they will start coming back in after about 2 weeks. But if you do that, then it's worth it. You can also buy them from Amazon but shipping takes forever. So just make sure you order enough so you don't run out.
+ Rating: 5.0
+ This stuff works great. My dentist recommended it, and I'm glad he did. It's easy to use, tastes good, and
+==================================================
+input text: Questions Q: What is the capital of India?
+output text: A: mumbai.
+ Q: Who was a British politician who served as Prime Minister from 1922 to 1924?
+ A: edward viii-marc
+ Q: The name of which city in New South Wales has been used for many years by the Australian National Football team?
+ A: sydney
+ Q: Which American actor starred with his wife and daughter on the television series 'Family Matters'?
+ A: james coburn
+ Q: In what year did the first edition of this book appear?
+ A: 1962
+ Q: How long does it take to make one pound of sausage?
+==================================================
+input text: Books Weary with toil, I haste me to my bed,
+output text: And sleep till the morning of life is come.
+ The sun has risen and his beams are bright,
+ But still he shines upon a world forlorn;
+ He sees no more its joys or griefs below,
+ Nor hears their murmur as they pass below.
+ My heart grows weary for the world's delight,
+ For all that makes it dear in human eyes;
+ It feels like one who wanders through an empty land,
+ With nothing left but desolation there.
+ O God! how long shall this be mine abode,
+ Where every joy hath passed away from me?
+ How long, O God, must I thus wander here,
+ In sorrow
+==================================================
+```
diff --git a/examples/text_generation/ctrl/demo.py b/examples/text_generation/ctrl/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..534531936c2a486c7db80832597fb148fae35d04
--- /dev/null
+++ b/examples/text_generation/ctrl/demo.py
@@ -0,0 +1,43 @@
+import paddle
+from paddlenlp.transformers import CTRLLMHeadModel, CTRLTokenizer
+
+
+class Demo:
+    def __init__(self, model_name_or_path="ctrl", max_predict_len=128, repetition_penalty=1.2):
+        self.tokenizer = CTRLTokenizer.from_pretrained(model_name_or_path)
+        print("Loading the model parameters, please wait...")
+        self.model = CTRLLMHeadModel.from_pretrained(model_name_or_path)
+        self.model.eval()
+        self.max_predict_len = max_predict_len
+        self.repetition_penalty = repetition_penalty
+        print("Model loaded.")
+
+    # prediction function
+    @paddle.no_grad()
+    def generate(self, inputs, max_predict_len=None, repetition_penalty=None):
+        max_predict_len = max_predict_len if max_predict_len is not None else self.max_predict_len
+        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.repetition_penalty
+
+        ids = self.tokenizer(inputs)["input_ids"]
+        input_ids = paddle.to_tensor([ids], dtype="int64")
+        max_length = max(self.max_predict_len - input_ids.shape[1], 20)
+        outputs = self.model.generate(input_ids, max_length=max_length, repetition_penalty=self.repetition_penalty)[0][
+            0
+        ]
+        decode_outputs = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(outputs.cpu()))
+
+        print(f"input text: {inputs}")
+        print(f"output text: {decode_outputs}")
+        print("=" * 50)
+
+
+if __name__ == "__main__":
+    demo = Demo(model_name_or_path="ctrl", max_predict_len=128, repetition_penalty=1.2)
+    input_text_list = [
+        "Diet English : I lost 10 kgs! ; German : ",
+        "Reviews Rating: 5.0",
+        "Questions Q: What is the capital of India?",
+        "Books Weary with toil, I haste me to my bed,",
+    ]
+    for text in input_text_list:
+        demo.generate(text)
diff --git a/examples/text_generation/ernie-gen b/examples/text_generation/ernie-gen
new file mode 100644
index 0000000000000000000000000000000000000000..9fcf590439dda626e45d037afc967fc0f80b98d6
--- /dev/null
+++ b/examples/text_generation/ernie-gen
@@ -0,0 +1 @@
+../../model_zoo/ernie-gen/
\ No newline at end of file
diff --git a/examples/text_generation/opt/README.md b/examples/text_generation/opt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5066072849447e63d3350057156d6cddf751a151
--- /dev/null
+++ b/examples/text_generation/opt/README.md
@@ -0,0 +1,62 @@
+# [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/pdf/1909.05858.pdf)
+
+## 摘要
+
+Meta AI 实验室高调宣布，将开放自己的 OPT（Open Pretrained Transformer，预训练变换模型）预训练模型，并贡献出所有代码，此模型对标GPT3，从模型性能、多个下有任务以及小样本中都取得了与GPT-3可比的成绩，PaddleNLP也是及时接入此模型，各位开发者只需要简单的调用即可使用此大模型。
+
+## 文本生成测试
+```sh
+python demo.py
+```
+模型生成使用到的参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。
+- `max_predict_len` 表示最大生成的句子长度。
+
+## 生成结果样例
+
+```
+input text:
+Question:If x is 2 and y is 5, what is x+y?
+Answer: 7
+
+Question: if x is 12 and y is 9, what is x+y?
+Answer:21
+
+Question: if x is 3 and y is 4, what is x+y?
+
+output text:
+Answer:7
+
+Question: if x is
+==================================================
+input text:
+a chat between a curious human and Statue of Liberty.
+Human: What is your name?
+Statue: I am statue of liberty.
+
+Human: where do you live?
+Statue: New york city.
+
+Human: how long have you lived there?
+
+output text:
+Statue: I have lived here for a long
+==================================================
+input text:
+Chinese: 我想回家。
+English: I want to go home.
+
+Chinese: 我不知道。
+English: I don't know.
+
+Chinese: 我饿了。
+English: I am hungry.
+
+Chinese: 我累了。
+
+output text:
+English: I am tired.
+
+Chinese:
+==================================================
+```
diff --git a/examples/text_generation/opt/demo.py b/examples/text_generation/opt/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ab891ed6153b81547444a2a7a2bd937330f2fcb
--- /dev/null
+++ b/examples/text_generation/opt/demo.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers.gpt.tokenizer import GPTTokenizer
+from paddlenlp.transformers.opt.modeling import OPTForCausalLM
+from paddlenlp.utils.log import logger
+
+
+class Demo:
+    def __init__(self, model_name_or_path, max_predict_len=128):
+        self.tokenizer = GPTTokenizer.from_pretrained(model_name_or_path)
+        logger.info("Loading the model parameters, please wait...")
+        self.model = OPTForCausalLM.from_pretrained(model_name_or_path, load_state_as_np=True)
+        self.model.eval()
+        self.max_predict_len = max_predict_len
+        logger.info("Model loaded.")
+
+    @paddle.no_grad()
+    def generate(self, inputs):
+        ids = self.tokenizer(inputs)["input_ids"]
+        input_ids = paddle.to_tensor([ids], dtype="int64")
+        outputs = self.model.generate(input_ids, max_length=self.max_predict_len)[0][0]
+        decode_outputs = self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(outputs.cpu()))
+
+        print(f"input text: \n{inputs}")
+        print(f"output text: \n{decode_outputs}")
+        print("=" * 50)
+
+
+if __name__ == "__main__":
+
+    demo = Demo(model_name_or_path="facebook/opt-1.3b", max_predict_len=10)
+    input_text_list = [
+        "Question:If x is 2 and y is 5, what is x+y?\n"
+        "Answer: 7\n\n"
+        "Question: if x is 12 and y is 9, what is x+y?\n"
+        "Answer:21\n\n"
+        "Question: if x is 3 and y is 4, what is x+y?\n",
+        "a chat between a curious human and Statue of Liberty.\n"
+        "Human: What is your name?\n"
+        "Statue: I am statue of liberty.\n\n"
+        "Human: where do you live?\n"
+        "Statue: New york city.\n\n"
+        "Human: how long have you lived there?\n",
+        "Chinese: 我想回家。\n"
+        "English: I want to go home.\n\n"
+        "Chinese: 我不知道。\n"
+        "English: I don't know.\n\n"
+        "Chinese: 我饿了。\n"
+        "English: I am hungry.\n\n"
+        "Chinese: 我累了。\n",
+    ]
+    for text in input_text_list:
+        demo.generate(text)
diff --git a/examples/text_generation/reformer/README.md b/examples/text_generation/reformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8b7d8260a450a2cc58edb1ce2ec5f2bebf228823
--- /dev/null
+++ b/examples/text_generation/reformer/README.md
@@ -0,0 +1,19 @@
+# [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)
+
+## 摘要
+大型 Transformer 模型通常会在许多任务上获得最先进的结果，但训练这些模型的成本可能高得惊人，尤其是在长序列上。 我们介绍了两种技术来提高 Transformer 的效率。 一方面，我们将点积注意力替换为使用局部敏感哈希的注意力，将其复杂度从 O(L²) 降为 O(LlogL)，其中 L 是序列的长度。 此外，我们使用可逆残差层而不是标准残差，这允许在训练过程中仅存储一次激活而不是 N 次，其中 N 是层数。 生成的模型，Reformer，与 Transformer 模型的性能相当，同时在长序列上的内存效率更高，速度更快。
+
+## 文本生成测试
+```sh
+python demo.py
+```
+模型生成使用到的参数释义如下：
+- `decode_strategy` 解码策略，可选择`greedy_search`和`sampling`。
+- `max_predict_len` 表示最大生成的句子长度。
+- `repetition_penalty` 表示生成重复token的惩罚参数，详细信息可查看[这篇论文](https://arxiv.org/pdf/1909.05858.pdf)。
+
+## 生成结果样例
+
+```
+In 1965, Brooks left IBM to found the Department of Defense. The Department was able to convince the Department to resign from the Department's constitutional amendments to the Department of Defense.\n\n
+```
diff --git a/examples/text_generation/reformer/demo.py b/examples/text_generation/reformer/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..3af5cad2f50e11b4d2e38a9503b7a207bf2f18a4
--- /dev/null
+++ b/examples/text_generation/reformer/demo.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddlenlp.transformers import ReformerModelWithLMHead
+
+
+# Encoding
+def encode(list_of_strings, pad_token_id=0):
+    max_length = max([len(string) for string in list_of_strings])
+
+    # create emtpy tensors
+    attention_masks = paddle.zeros((len(list_of_strings), max_length), dtype="int64")
+    input_ids = paddle.full((len(list_of_strings), max_length), pad_token_id, dtype="int64")
+
+    for idx, string in enumerate(list_of_strings):
+        # make sure string is in byte format
+        if not isinstance(string, bytes):
+            string = str.encode(string)
+
+        input_ids[idx, : len(string)] = paddle.to_tensor([x + 2 for x in string], dtype="int64")
+        attention_masks[idx, : len(string)] = 1
+
+    return input_ids, attention_masks
+
+
+# Decoding
+def decode(outputs_ids):
+    decoded_outputs = []
+    for output_ids in outputs_ids.tolist():
+        # transform id back to char IDs < 2 are simply transformed to ""
+        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
+    return decoded_outputs
+
+
+if __name__ == "__main__":
+    model = ReformerModelWithLMHead.from_pretrained("reformer-enwik8")
+    model.eval()
+    encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
+    output = decode(
+        model.generate(encoded, decode_strategy="greedy_search", max_length=150, repetition_penalty=1.2)[0]
+    )
+    print(output)
+    # expected:
+    # [" Defense. The Department was able to convince the Department to resign from the Department's constitutional amendments to the Department of Defense.\n\n"]
diff --git a/examples/text_generation/unimo-text/README.md b/examples/text_generation/unimo-text/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..135d9957dcc07a76d10b22333b7f82194606730d
--- /dev/null
+++ b/examples/text_generation/unimo-text/README.md
@@ -0,0 +1,144 @@
+# 千言：面向事实一致性的生成评测比赛基线
+
+## 比赛简介
+
+自然语言生成旨在让机器能够像人一样使用自然语言进行表达和交互，它是人工智能领域重要的前沿课题，也是全球热点技术AIGC（AI Generated Content，人工智能内容生成）的核心问题之一。
+
+随着神经网络生成模型特别是预训练语言模型的迅速发展，机器生成文本的可读性和流畅性不断提升。然而，自动生成的文本中依然经常出现不符合原文或背景的错误事实描述，这种生成的事实一致性问题是自然语言生成进行落地应用的主要障碍之一，并逐渐受到研究学者的关注。鉴于当前国内外关于事实一致性的生成评测比赛十分匮乏，为了促进自然语言生成的技术发展和实际应用，[千言](https://www.luge.ai/#/)组织了面向事实一致性的生成评测比赛。
+
+第一届面向事实一致性的生成评测比赛，一共吸引了577名高校、企业的参赛者，其中有57支参赛队提交了有效的正式赛结果，30支参赛队自动评测指标超过基线系统，在排名Top10的队伍中，收到9份参赛系统总结报告。在正式赛的人工评估过程中，我们进一步确认了事实一致性问题的广泛存在性，并且通过与参赛队伍的深入交流，也积累了更多对于事实一致性自动和人工评测的宝贵经验。
+
+2023年，千言举办[第二届面向事实一致性的生成评测比赛](https://aistudio.baidu.com/aistudio/competition/detail/726/0/introduction)，在数据集、自动评测指标等方面均有升级。在此比赛中，将提供三个对事实一致性有较高要求的生成任务，包括文案生成、摘要生成和对话生成。同时，在系统评价中，将结合文本流畅性和事实一致性两项指标综合评估参赛生成系统的水平，同时进一步提升事实一致性评测指标的先进性和丰富性。通过这样的任务设定和评价方式，此评测将有助于研究者和开发者更多关注自然语言生成的事实一致性难题，并为大家提供学术交流平台，从而进一步提升自然语言生成的研究水平，推动相关技术的应用发展。
+
+本比赛得到中国中文信息学会自然语言生成与智能写作专业委员会（筹）支持，将在2023年7月16日第二届中国自然语言生成与智能写作大会（NLGIW 2023）召开评测研讨会，并在大会上对获奖团队颁奖。
+
+## 模型简介
+本次比赛提供的基线系统，基于百度提出的ERNIE-UNIMO统一模态预训练框架。在本次比赛的三个文本生成任务中，我们基于本基线使用的模型是UNIMO-text,是基于[ERNIE-UNIMO](https://arxiv.org/pdf/2012.15409.pdf)框架在文本数据上预训练得到模型。
+
+## 快速开始
+
+本基线基于 **PaddleNLP 2.0.8** 版本，该版本包含了基线使用的最新版UNIMO-text模型以及升级后的生成API。更多详细升级信息请查看[Release Note](https://github.com/PaddlePaddle/PaddleNLP/releases/tag/v2.0.8)。请选手们**升级PaddleNLP后使用**。
+
+### 数据准备
+
+比赛使用三个任务数据集测试参赛系统的生成能力，包括文案生成(AdvertiseGen)、摘要生成(LCSTS_new)和问题生成(DuReaderQG)：
+
+- 文案生成根据结构化的商品信息生成合适的广告文案；
+- 摘要生成是为输入文档生成简洁且包含关键信息的简洁文本；
+- 问题生成则是根据给定段落以及答案生成适合的问题。
+
+为了方便用户快速使用基线，PaddleNLP Dataset API内置了数据集，一键即可完成数据集加载，示例代码如下：
+
+```python
+from paddlenlp.datasets import load_dataset
+train_ds, dev_ds = load_dataset('dureader_qg', splits=('train', 'dev'))
+```
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+.
+├── run_gen.py # 模型finetune主程序入口
+├── gen_utils.py # 定义参数及一些工具函数
+├── scripts # 三个任务的基线训练脚本
+└── README.md # 文档说明
+```
+
+### 模型训练
+
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证。也可以使用./scripts目录下面的训练脚本分别启动三个任务的训练。
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" --log_dir ./log run_gen.py \
+    --dataset_name=dureader_qg \
+    --model_name_or_path=unimo-text-1.0 \
+    --save_dir=./unimo/checkpoints \
+    --logging_steps=100 \
+    --save_steps=100000 \
+    --epochs=6 \
+    --batch_size=16 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=512 \
+    --max_target_len=30 \
+    --do_train \
+    --do_predict \
+    --device=gpu
+```
+
+关键参数释义如下：
+- `gpus` 指示了训练所用的GPU卡号。
+- `dataset_name` 数据集名称，`dureader_qg`、`advertisegen`和`lcsts_new`分别对应问题生成、文案生成和摘要生成三个任务。
+- `train_file` 本地训练数据地址，数据格式必须与`dataset_name`所指数据集格式相同。
+- `predict_file` 本地测试数据地址，数据格式必须与`dataset_name`所指数据集格式相同。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unimo-text-1.0      |
+   | unimo-text-1.0-large |
+
+- `save_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_proportion` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `max_seq_len` 模型输入序列的最大长度。
+- `max_target_len` 模型训练时标签的最大长度。
+- `min_dec_len` 模型生成序列的最小长度。
+- `max_dec_len` 模型生成序列的最大长度。
+- `do_train` 是否进行训练。
+- `do_predict` 是否进行预测，在验证集上会自动评估。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+
+更多参数详情和参数的默认值请参考`args.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+./checkpoints/
+├── model_8000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+### 模型预测
+
+运行下方脚本可以使用训练好的模型进行预测。
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python run_gen.py \
+    --dataset_name=dureader_qg \
+    --model_name_or_path=your_model_path \
+    --logging_steps=100 \
+    --batch_size=16 \
+    --max_seq_len=512 \
+    --max_target_len=30 \
+    --do_predict \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --device=gpu
+```
+
+程序运行结束后会将预测结果保存在`output_path`中。将预测结果准备成比赛官网要求的格式，提交评估即可得评估结果。
+
+Finetuned baseline的模型在各任务验证集上有如下结果(指标为BLEU-4)：
+
+|       model_name        | LCSTS_new | DuLeMon |    AdvertiseGen    |
+| :-----------------------------: | :---: | :-----------: | :-------------------: |
+|   finetuned unimo-text-1.0    | 18.82 | 5.52 |     10.03     |
diff --git a/examples/text_generation/unimo-text/gen_utils.py b/examples/text_generation/unimo-text/gen_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b69f1830754d5d977808bbc2c733bcf59f0619b3
--- /dev/null
+++ b/examples/text_generation/unimo-text/gen_utils.py
@@ -0,0 +1,187 @@
+import random
+from functools import partial
+
+import numpy as np
+
+import paddle
+import paddle.distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler
+from paddlenlp.data import Pad
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def convert_example(example, tokenizer, max_seq_len=512, max_target_len=128, max_title_len=256, mode="train"):
+    """Convert all examples into necessary features."""
+    source = example["source"]
+    title = None
+    if "title" in example.keys():
+        title = example["title"]
+
+    if mode != "test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            target=example["target"],
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            max_title_len=max_title_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1)
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+
+        return tokenized_example
+    else:
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=max_title_len,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+        )
+
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    if mode != "test":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    else:
+        return input_ids, token_type_ids, position_ids, attention_mask
+
+
+def create_data_loader(dataset, tokenizer, args, mode):
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        max_target_len=args.max_target_len,
+        max_title_len=args.max_title_len,
+        mode=mode,
+    )
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    else:
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/examples/text_generation/unimo-text/run_gen.py b/examples/text_generation/unimo-text/run_gen.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad172403ce5990be3ccc41790f86755fa4ff393c
--- /dev/null
+++ b/examples/text_generation/unimo-text/run_gen.py
@@ -0,0 +1,242 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn.functional as F
+from gen_utils import create_data_loader, print_args, select_sum, set_seed
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import BLEU
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    LinearDecayWithWarmup,
+    UNIMOLMHeadModel,
+    UNIMOTokenizer,
+)
+
+
+# yapf: disable
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument('--dataset_name', type=str, default='dureader_qg', help='The name of the dataset to load.')
+    parser.add_argument('--model_name_or_path', type=str, default='unimo-text-1.0', help='The path or shortcut name of the pre-trained model.')
+    parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.")
+    parser.add_argument("--predict_file", type=str, required=False, default=None, help="Predict data path.")
+    parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.')
+    parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.')
+    parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.')
+    parser.add_argument('--seed', type=int, default=1, help='Random seed for initialization.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.')
+    parser.add_argument('--learning_rate', type=float, default=5e-5, help='The initial learning rate.')
+    parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.')
+    parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.')
+    parser.add_argument('--warmup_proportion', type=float, default=0.02, help='The number of warmup steps.')
+    parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The max value of grad norm.')
+    parser.add_argument('--beta1', type=float, default=0.9, help='beta1')
+    parser.add_argument('--beta2', type=float, default=0.98, help='beta2')
+    parser.add_argument('--epsilon', type=float, default=1e-6, help='epsilon')
+    parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.')
+    parser.add_argument('--max_dec_len', type=int, default=20, help='The maximum sequence length of decoding.')
+    parser.add_argument('--min_dec_len', type=int, default=3, help='The minimal sequence length of decoding.')
+    parser.add_argument('--max_target_len', type=int, default=30, help='The maximum target sequence length of training.')
+    parser.add_argument('--max_title_len', type=int, default=30, help='The maximum title sequence length of training.')
+    parser.add_argument('--num_return_sequences', type=int, default=1, help='The numbers of returned sequences for one input in generation.')
+    parser.add_argument('--decode_strategy', type=str, default='beam_search', help='The decode strategy in generation.')
+    parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.')
+    parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.')
+    parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.')
+    parser.add_argument('--num_beams', type=int, default=6, help='The number of beams for beam search.')
+    parser.add_argument('--length_penalty', type=float, default=1.2, help='The exponential penalty to the sequence length for beam search.')
+    parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.')
+    parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.')
+    parser.add_argument("--do_train", action='store_true', help="Whether to train the model.")
+    parser.add_argument("--do_predict", action='store_true', help="Whether to eval and predict.")
+
+    args = parser.parse_args()
+    return args
+# yapf: enable
+
+
+def calc_bleu(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    bleu4 = BLEU(n_size=4)
+    tokenizer = BasicTokenizer()
+
+    for pred, target in zip(preds, targets):
+        pred_tokens = tokenizer.tokenize(pred)
+        target_token = tokenizer.tokenize(target)
+
+        bleu4.add_inst(pred_tokens, [target_token])
+
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("BLEU-4:", bleu4.score())
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds = load_dataset(args.dataset_name, splits="train", data_files=args.train_file)
+    dev_ds = load_dataset(args.dataset_name, splits="dev", data_files=args.predict_file)
+
+    train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train")
+    dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+    if args.do_train:
+        num_training_steps = args.epochs * len(train_data_loader)
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+        optimizer = AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.epsilon,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+        )
+
+        step = 0
+        total_time = 0.0
+        for epoch in range(args.epochs):
+            print("\nEpoch %d/%d" % (epoch + 1, args.epochs))
+            batch_start_time = time.time()
+            for inputs in train_data_loader:
+                step += 1
+                labels = inputs[-1]
+                logits = model(*inputs[:-1])
+                labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1])
+                labels = paddle.nn.functional.label_smooth(labels)
+                loss = F.cross_entropy(logits, labels, soft_label=True)
+
+                loss.backward()
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_time += time.time() - batch_start_time
+                if step % args.logging_steps == 0:
+                    ppl = paddle.exp(loss)
+                    print(
+                        "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                        % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                    )
+                    total_time = 0.0
+
+                if step % args.save_steps == 0 or step >= num_training_steps:
+                    if dist.get_rank() == 0:
+                        save_ckpt(model, tokenizer, args.save_dir, step)
+                        print("Saved step {} model.\n".format(step))
+                        if args.do_predict:
+                            model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+                            evaluation(model_eval, dev_data_loader, args, tokenizer)
+
+                batch_start_time = time.time()
+
+        print("\nTraining completed.")
+    elif args.do_predict:
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        evaluation(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader, args, tokenizer):
+    print("\nEval begin...")
+    model.eval()
+    pred_ref = []
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+    print("\nSave inference result into: %s" % args.output_path)
+
+    if "target" in data_loader.dataset[0].keys():
+        targets = [example["target"] for example in data_loader.dataset]
+        calc_bleu(pred_ref, targets)
+
+    model.train()
+    return
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/examples/text_generation/unimo-text/scripts/lcsts_train.sh b/examples/text_generation/unimo-text/scripts/lcsts_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..16472c28edec66d8ac3ba515822e99f4e15ffb7f
--- /dev/null
+++ b/examples/text_generation/unimo-text/scripts/lcsts_train.sh
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+log_dir=./lcsts-log
+rm -rf ${log_dir}
+mkdir -p ${log_dir}
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \
+    --dataset_name=lcsts_new \
+    --model_name_or_path=unimo-text-1.0 \
+    --save_dir=${log_dir}/checkpoints \
+    --logging_steps=100 \
+    --save_steps=10000 \
+    --epochs=6 \
+    --batch_size=64 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=360 \
+    --max_target_len=30 \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --do_train \
+    --do_predict \
+    --device=gpu >> ${log_dir}/lanch.log 2>&1
diff --git a/examples/text_generation/unimo-text/scripts/qg_train.sh b/examples/text_generation/unimo-text/scripts/qg_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..88e8c4ec8725599ce211f380843e86f36d4de427
--- /dev/null
+++ b/examples/text_generation/unimo-text/scripts/qg_train.sh
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+log_dir=./qg-log
+rm -rf ${log_dir}
+mkdir -p ${log_dir}
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \
+    --dataset_name=dureader_qg \
+    --model_name_or_path=unimo-text-1.0 \
+    --save_dir=${log_dir}/checkpoints \
+    --logging_steps=10 \
+    --save_steps=1000 \
+    --epochs=6 \
+    --batch_size=8 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=360 \
+    --max_target_len=30 \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --do_train \
+    --do_predict \
+    --device=gpu >> ${log_dir}/lanch.log 2>&1
diff --git a/examples/text_generation/unimo-text/scripts/table_train.sh b/examples/text_generation/unimo-text/scripts/table_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4865358949d550a273be7410f794f641397c7269
--- /dev/null
+++ b/examples/text_generation/unimo-text/scripts/table_train.sh
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+log_dir=./table-log
+rm -rf ${log_dir}
+mkdir -p ${log_dir}
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} run_gen.py \
+    --dataset_name=advertisegen \
+    --model_name_or_path=unimo-text-1.0 \
+    --save_dir=${log_dir}/checkpoints \
+    --logging_steps=100 \
+    --save_steps=1000 \
+    --epochs=6 \
+    --batch_size=8 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=512 \
+    --max_target_len=200 \
+    --max_dec_len=200 \
+    --min_dec_len=10 \
+    --do_train \
+    --do_predict \
+    --device=gpu >> ${log_dir}/lanch.log 2>&1
diff --git a/examples/text_generation/vae-seq2seq/README.md b/examples/text_generation/vae-seq2seq/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..11e26d67ace3adb0c24fe86e6c9fe8555cc88f63
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/README.md
@@ -0,0 +1,132 @@
+# Variational Autoencoder (VAE) for Text Generation
+以下是本范例模型的简要目录结构及说明：
+
+```text
+.
+├── README.md         # 文档
+├── args.py           # 训练、预测以及模型参数配置程序
+├── data.py           # 数据读入程序
+├── train.py          # 训练主程序
+├── predict.py        # 预测主程序
+└── model.py          # VAE模型组网部分，以及Metric等
+```
+
+## 简介
+
+本目录下此范例模型的实现，旨在展示如何用Paddle构建用于文本生成的VAE示例，其中LSTM作为编码器和解码器。分别对PTB数据集和Yahoo Answer（采样100k）数据集进行训练。
+
+关于VAE的详细介绍参照： [(Bowman et al., 2015) Generating Sentences from a Continuous Space](https://arxiv.org/pdf/1511.06349.pdf)
+
+## 数据介绍
+
+本教程使用了两个文本数据集：
+
+PTB数据集由华尔街日报的文章组成，包含929k个训练tokens，词汇量为10k。下载地址为: [PTB](https://dataset.bj.bcebos.com/imikolov%2Fsimple-examples.tgz)。
+
+Yahoo数据集来自[(Yang et al., 2017) Improved Variational Autoencoders for Text Modeling using Dilated Convolutions](https://arxiv.org/pdf/1702.08139.pdf)，该数据集从原始Yahoo Answer数据中采样100k个文档，数据集的平均文档长度为78，词汇量为200k。下载地址为：[YahooAnswer100k](https://bj.bcebos.com/paddlenlp/datasets/yahoo-answer-100k.tar.gz)，运行本例程序后，数据集会自动下载到`~/.paddlenlp/datasets/YahooAnswer100k`目录下。
+
+
+## 模型训练
+
+如果使用ptb数据集训练，可以通过下面命令配置：
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python train.py \
+        --batch_size 32 \
+        --init_scale 0.1 \
+        --max_grad_norm 5.0 \
+        --dataset ptb \
+        --model_path ptb_model\
+        --device gpu \
+        --max_epoch 50 \
+
+```
+
+如果需要多卡运行，可以运行如下命令：
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3" train.py \
+        --batch_size 32 \
+        --init_scale 0.1 \
+        --max_grad_norm 5.0 \
+        --dataset ptb \
+        --model_path ptb_model \
+        --device gpu \
+        --max_epoch 50 \
+
+```
+
+如果需要使用yahoo数据集进行多卡运行，可以将参数配置如下：
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3" train.py \
+        --batch_size 32 \
+        --embed_dim 512 \
+        --hidden_size 550 \
+        --init_scale 0.1 \
+        --max_grad_norm 5.0 \
+        --dataset yahoo \
+        --model_path yahoo_model \
+        --device gpu \
+        --max_epoch 50 \
+
+```
+
+**NOTE:** 如需恢复模型训练，则`init_from_ckpt`只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt ptb_model/49`即可，程序会自动加载模型参数`ptb_model/49.pdparams`，也会自动加载优化器状态`ptb_model/49.pdopt`。
+
+
+## 模型预测
+
+当模型训练完成之后，可以选择加载模型保存目录下的第 50 个epoch的模型进行预测，生成batch_size条短文本。生成的文本位于参数`infer_output_file`指定的路径下。如果使用ptb数据集，可以通过下面命令配置：
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python predict.py \
+        --batch_size 32 \
+        --init_scale 0.1 \
+        --max_grad_norm 5.0 \
+        --dataset ptb \
+        --device gpu \
+        --infer_output_file infer_output.txt \
+        --init_from_ckpt ptb_model/49 \
+
+```
+
+使用yahoo数据集，需要配置embed_dim和hidden_size：
+
+```
+python predict.py \
+        --batch_size 32 \
+        --init_scale 0.1 \
+        --embed_dim 512 \
+        --hidden_size 550 \
+        --max_grad_norm 5.0 \
+        --dataset yahoo \
+        --device gpu \
+        --infer_output_file infer_output.txt \
+        --init_from_ckpt yahoo_model/49 \
+
+```
+
+## 效果评价
+
+||Test PPL|Test NLL|
+|:-|:-:|:-:|
+|ptb dataset|108.71|102.76|
+|yahoo dataset|78.38|349.48|
+
+
+## 生成样例
+
+shareholders were spent about N shares to spend $ N million to ual sell this trust stock last week
+
+new york stock exchange composite trading trading outnumbered closed at $ N a share down N cents
+
+the company cited pressure to pursue up existing facilities in the third quarter was for <unk> and four N million briefly stocks for so-called unusual liability
+
+people had <unk> down out the kind of and much why your relationship are anyway
+
+there are a historic investment giant chips which ran the <unk> benefit the attempting to original maker
diff --git a/examples/text_generation/vae-seq2seq/args.py b/examples/text_generation/vae-seq2seq/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa26a23dad7137dece13ff12af4c85a13d34ff7c
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/args.py
@@ -0,0 +1,68 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+
+    parser.add_argument("--dataset", type=str, help="Dataset name. Now ptb|yahoo is supported.")
+
+    parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate of optimizer.")
+
+    parser.add_argument("--num_layers", type=int, default=1, help="The number of layers of encoder and decoder.")
+
+    parser.add_argument("--embed_dim", type=int, default=256, help="Embedding dim of encoder and decoder.")
+
+    parser.add_argument("--hidden_size", type=int, default=256, help="Hidden size of encoder and decoder.")
+
+    parser.add_argument("--latent_size", type=int, default=32, help="Latent size of Variational Auto Encoder.")
+
+    parser.add_argument("--batch_size", type=int, help="Batch size.")
+
+    parser.add_argument("--max_epoch", type=int, default=20, help="Max epoch of training.")
+
+    parser.add_argument("--max_len", type=int, default=1280, help="Max length of source and target sentence.")
+
+    parser.add_argument("--log_freq", type=int, default=200, help="Log frequency")
+
+    parser.add_argument("--dec_dropout", type=float, default=0.5, help="Drop probability of decoder")
+
+    parser.add_argument("--enc_dropout", type=float, default=0.0, help="Drop probability of encoder.")
+
+    parser.add_argument("--init_scale", type=float, default=0.0, help="Init scale for parameter.")
+
+    parser.add_argument("--max_grad_norm", type=float, default=5.0, help="Max grad norm of global norm clip.")
+
+    parser.add_argument("--model_path", type=str, default="model", help="Model path for model to save.")
+
+    parser.add_argument(
+        "--infer_output_file", type=str, default="infer_output.txt", help="File name to save inference output."
+    )
+
+    parser.add_argument("--beam_size", type=int, default=1, help="Beam size for Beam search.")
+
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference."
+    )
+
+    parser.add_argument("--warm_up", type=int, default=10, help="The number of warm up epoch for KL.")
+
+    parser.add_argument("--kl_start", type=float, default=0.1, help="KL start value, up to 1.0.")
+
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+
+    args = parser.parse_args()
+    return args
diff --git a/examples/text_generation/vae-seq2seq/data.py b/examples/text_generation/vae-seq2seq/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..8fd17ca1c211cbb0eba16d9dec9978e151f3a306
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/data.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import Pad, SamplerHelper, Vocab
+from paddlenlp.datasets import load_dataset
+
+
+def create_data_loader(args):
+    batch_size = args.batch_size
+    max_len = args.max_len
+    if args.dataset == "yahoo":
+        train_ds, dev_ds, test_ds = load_dataset("yahoo_answer_100k", splits=("train", "valid", "test"))
+        vocab = Vocab.load_vocabulary(**train_ds.vocab_info)
+    else:
+        train_ds, dev_ds, test_ds = load_dataset("ptb", splits=("train", "valid", "test"))
+        examples = [train_ds[i]["sentence"].split() for i in range(len(train_ds))]
+        vocab = Vocab.build_vocab(examples)
+
+    vocab_size = len(vocab)
+    bos_id = vocab_size
+    eos_id = vocab_size + 1
+    pad_id = vocab_size + 1
+
+    def convert_example(example):
+        features = vocab.to_indices(example["sentence"].split()[:max_len])
+        return features
+
+    key = lambda x, data_source: len(data_source[x])
+    # Truncate and convert example to ids
+    train_ds = train_ds.map(convert_example, lazy=False)
+    dev_ds = dev_ds.map(convert_example, lazy=False)
+    test_ds = test_ds.map(convert_example, lazy=False)
+
+    train_batch_sampler = (
+        SamplerHelper(train_ds).shuffle().sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size)
+    )
+
+    dev_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size)
+
+    # test_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size)
+
+    train_loader = paddle.io.DataLoader(
+        train_ds,
+        batch_sampler=train_batch_sampler,
+        collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+
+    dev_loader = paddle.io.DataLoader(
+        dev_ds,
+        batch_sampler=dev_batch_sampler,
+        collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+
+    test_loader = paddle.io.DataLoader(
+        test_ds,
+        batch_sampler=dev_batch_sampler,
+        collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id),
+    )
+
+    return train_loader, dev_loader, test_loader, vocab, bos_id, pad_id, len(train_ds)
+
+
+def prepare_train_input(insts, bos_id, eos_id, pad_id):
+    # Add eos token id and bos token id.
+    src = [[bos_id] + inst + [eos_id] for inst in insts]
+    trg = [inst[:-1] for inst in insts]
+    label = [inst[1:] for inst in insts]
+
+    # Pad sequence using eos id.
+    src, src_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in src])
+    trg, trg_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in trg])
+    label, _ = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([ids for ids in label])
+
+    label = np.array(label)
+    label = label.reshape((label.shape[0], label.shape[1], 1))
+    return src, src_length, trg, trg_length, label
diff --git a/examples/text_generation/vae-seq2seq/model.py b/examples/text_generation/vae-seq2seq/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fabed6164a9e67cc4c8249dc6da39bcf34d22926
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/model.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.nn.initializer as I
+
+
+class CrossEntropyWithKL(nn.Layer):
+    """
+    backward_loss = kl_loss * kl_weight + cross_entropy_loss
+    """
+
+    def __init__(self, base_kl_weight, anneal_r):
+        super(CrossEntropyWithKL, self).__init__()
+        self.kl_weight = base_kl_weight
+        self.anneal_r = anneal_r
+        self.loss = 0.0
+        self.kl_loss = 0.0
+        self.rec_loss = 0.0
+
+    def update_kl_weight(self):
+        self.kl_weight = min(1.0, self.kl_weight + self.anneal_r)
+
+    def forward(self, kl_loss, dec_output, trg_mask, label):
+        self.update_kl_weight()
+        self.kl_loss = kl_loss
+
+        rec_loss = F.cross_entropy(input=dec_output, label=label, reduction="none", soft_label=False)
+
+        rec_loss = paddle.squeeze(rec_loss, axis=[2])
+        rec_loss = rec_loss * trg_mask
+        rec_loss = paddle.mean(rec_loss, axis=[0])
+        rec_loss = paddle.sum(rec_loss)
+        self.rec_loss = rec_loss
+
+        self.loss = self.kl_loss * self.kl_weight + self.rec_loss
+        return self.loss
+
+
+class Perplexity(paddle.metric.Metric):
+    def __init__(self, name="ppl", reset_freq=100, *args, **kwargs):
+        self.cross_entropy = kwargs.pop("loss")
+        super(Perplexity, self).__init__(*args, **kwargs)
+        self._name = name
+        self.total_ce = 0
+        self.word_count = 0
+        self.reset_freq = reset_freq
+        self.batch_size = 0
+
+    def update(self, kl_loss, dec_output, trg_mask, label, *args):
+        # Perplexity is calculated using cross entropy
+        self.batch_size = dec_output.shape[0]
+        loss = self.cross_entropy.loss.numpy()
+        self.total_ce += loss[0] * self.batch_size
+        self.word_count += np.sum(trg_mask)
+
+    def reset(self):
+        self.total_ce = 0
+        self.word_count = 0
+
+    def accumulate(self):
+        return np.exp(self.total_ce / self.word_count)
+
+    def name(self):
+        return self._name
+
+
+class NegativeLogLoss(paddle.metric.Metric):
+    def __init__(self, name="nll", reset_freq=100, *args, **kwargs):
+        self.cross_entropy = kwargs.pop("loss")
+        super(NegativeLogLoss, self).__init__(*args, **kwargs)
+        self._name = name
+        self.total_ce = 0
+        self.batch_count = 0
+        self.reset_freq = reset_freq
+        self.batch_size = 0
+        self.sample_count = 0
+
+    def update(self, kl_loss, dec_output, trg_mask, label, *args):
+        self.batch_size = dec_output.shape[0]
+        loss = self.cross_entropy.loss.numpy()
+        self.total_ce += loss[0] * self.batch_size
+        self.sample_count += self.batch_size
+
+    def reset(self):
+        self.total_ce = 0
+        self.sample_count = 0
+
+    def accumulate(self):
+        return self.total_ce / self.sample_count
+
+    def name(self):
+        return self._name
+
+
+class TrainCallback(paddle.callbacks.ProgBarLogger):
+    def __init__(self, ppl, nll, log_freq=200, verbose=2):
+        super(TrainCallback, self).__init__(log_freq, verbose)
+        self.ppl = ppl
+        self.nll = nll
+
+    def on_train_begin(self, logs=None):
+        super(TrainCallback, self).on_train_begin(logs)
+        self.train_metrics = ["loss", "ppl", "nll", "kl weight", "kl loss", "rec loss"]
+
+    def on_epoch_begin(self, epoch=None, logs=None):
+        super(TrainCallback, self).on_epoch_begin(epoch, logs)
+        self.ppl.reset()
+        self.nll.reset()
+
+    def on_train_batch_end(self, step, logs=None):
+        # loss and kl weight are not accumulated
+        logs["kl weight"] = self.ppl.cross_entropy.kl_weight
+        logs["kl loss"] = float(self.ppl.cross_entropy.kl_loss)
+        logs["rec loss"] = float(self.ppl.cross_entropy.rec_loss)
+        super(TrainCallback, self).on_train_batch_end(step, logs)
+
+    def on_eval_begin(self, logs=None):
+        super(TrainCallback, self).on_eval_begin(logs)
+        self.eval_metrics = ["loss", "ppl", "nll"]
+
+    def on_eval_batch_end(self, step, logs=None):
+        super(TrainCallback, self).on_eval_batch_end(step, logs)
+
+
+class LSTMEncoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, init_scale=0.1, enc_dropout=0.0):
+        super(LSTMEncoder, self).__init__()
+        self.src_embedder = nn.Embedding(
+            vocab_size,
+            embed_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_size, num_layers=num_layers, dropout=enc_dropout)
+        if enc_dropout > 0.0:
+            self.dropout = nn.Dropout(enc_dropout)
+        else:
+            self.dropout = None
+
+    def forward(self, src, src_length):
+        src_emb = self.src_embedder(src)
+
+        if self.dropout:
+            src_emb = self.dropout(src_emb)
+        enc_output, enc_final_state = self.lstm(src_emb, sequence_length=src_length)
+        if self.dropout:
+            enc_output = self.dropout(enc_output)
+
+        enc_final_state = [[h, c] for h, c in zip(enc_final_state[0], enc_final_state[1])]
+        return enc_output, enc_final_state
+
+
+class LSTMDecoderCell(nn.Layer):
+    def __init__(self, num_layers, embed_dim, hidden_size, latent_size, dropout=None):
+        super(LSTMDecoderCell, self).__init__()
+        self.dropout = dropout
+        self.lstm_cells = nn.LayerList(
+            [nn.LSTMCell(input_size=embed_dim + latent_size, hidden_size=hidden_size) for i in range(num_layers)]
+        )
+
+    def forward(self, step_input, lstm_states, latent_z):
+        new_lstm_states = []
+        step_input = paddle.concat([step_input, latent_z], 1)
+        for i, lstm_cell in enumerate(self.lstm_cells):
+            out, new_lstm_state = lstm_cell(step_input, lstm_states[i])
+            if self.dropout:
+                step_input = self.dropout(out)
+            else:
+                step_input = out
+            new_lstm_states.append(new_lstm_state)
+        if self.dropout:
+            step_input = self.dropout(step_input)
+        out = step_input
+        return out, new_lstm_states
+
+
+class LSTMDecoder(nn.Layer):
+    def __init__(self, vocab_size, embed_dim, hidden_size, latent_size, num_layers, init_scale=0.1, dec_dropout=0.0):
+        super(LSTMDecoder, self).__init__()
+        self.num_layers = num_layers
+        self.embed_dim = embed_dim
+        self.hidden_size = hidden_size
+        self.latent_size = latent_size
+        self.trg_embedder = nn.Embedding(
+            vocab_size,
+            embed_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        self.output_fc = nn.Linear(
+            hidden_size,
+            vocab_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+        if dec_dropout > 0.0:
+            self.dropout = nn.Dropout(dec_dropout)
+        else:
+            self.dropout = None
+
+        self.lstm = nn.RNN(
+            LSTMDecoderCell(self.num_layers, self.embed_dim, self.hidden_size, self.latent_size, self.dropout)
+        )
+
+    def forward(self, trg, dec_initial_states, latent_z):
+        trg_emb = self.trg_embedder(trg)
+        if self.dropout:
+            trg_emb = self.dropout(trg_emb)
+        lstm_output, _ = self.lstm(inputs=trg_emb, initial_states=dec_initial_states, latent_z=latent_z)
+        dec_output = self.output_fc(lstm_output)
+        return dec_output
+
+
+class VAESeq2SeqModel(nn.Layer):
+    def __init__(
+        self,
+        embed_dim,
+        hidden_size,
+        latent_size,
+        vocab_size,
+        num_layers=1,
+        init_scale=0.1,
+        PAD_ID=0,
+        enc_dropout=0.0,
+        dec_dropout=0.0,
+    ):
+        super(VAESeq2SeqModel, self).__init__()
+        self.PAD_ID = PAD_ID
+        self.latent_size = latent_size
+        self.vocab_size = vocab_size
+        self.num_layers = num_layers
+        self.hidden_size = hidden_size
+        self.encoder = LSTMEncoder(vocab_size, embed_dim, hidden_size, num_layers, init_scale, enc_dropout)
+        self.decoder = LSTMDecoder(
+            vocab_size, embed_dim, hidden_size, latent_size, num_layers, init_scale, dec_dropout
+        )
+        self.distributed_fc = nn.Linear(
+            hidden_size * 2,
+            latent_size * 2,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+        self.fc = nn.Linear(
+            latent_size,
+            2 * hidden_size * num_layers,
+            weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)),
+        )
+
+    def sampling(self, z_mean, z_log_var):
+        """
+        Reparameterization trick
+        """
+        # By default, random_normal has mean=0 and std=1.0
+        epsilon = paddle.normal(shape=(z_mean.shape[0], self.latent_size))
+        epsilon.stop_gradient = True
+        return z_mean + paddle.exp(0.5 * z_log_var) * epsilon
+
+    def build_distribution(self, enc_final_state=None):
+        enc_hidden = [paddle.concat(state, axis=-1) for state in enc_final_state]
+
+        enc_hidden = paddle.concat(enc_hidden, axis=-1)
+        z_mean_log_var = self.distributed_fc(enc_hidden)
+        z_mean, z_log_var = paddle.split(z_mean_log_var, 2, -1)
+        return z_mean, z_log_var
+
+    def calc_kl_dvg(self, means, logvars):
+        """
+        Compute the KL divergence between Gaussian distribution
+        """
+        kl_cost = -0.5 * (logvars - paddle.square(means) - paddle.exp(logvars) + 1.0)
+        kl_cost = paddle.mean(kl_cost, 0)
+
+        return paddle.sum(kl_cost)
+
+    def forward(self, src, src_length, trg, trg_length):
+        # Encoder
+        _, enc_final_state = self.encoder(src, src_length)
+
+        # Build distribution
+        z_mean, z_log_var = self.build_distribution(enc_final_state)
+
+        # Decoder
+        latent_z = self.sampling(z_mean, z_log_var)
+
+        dec_first_hidden_cell = self.fc(latent_z)
+        dec_first_hidden, dec_first_cell = paddle.split(dec_first_hidden_cell, 2, axis=-1)
+        if self.num_layers > 1:
+            dec_first_hidden = paddle.split(dec_first_hidden, self.num_layers)
+            dec_first_cell = paddle.split(dec_first_cell, self.num_layers)
+        else:
+            dec_first_hidden = [dec_first_hidden]
+            dec_first_cell = [dec_first_cell]
+        dec_initial_states = [[h, c] for h, c in zip(dec_first_hidden, dec_first_cell)]
+
+        dec_output = self.decoder(trg, dec_initial_states, latent_z)
+
+        kl_loss = self.calc_kl_dvg(z_mean, z_log_var)
+        trg_mask = (self.PAD_ID != trg).astype(paddle.get_default_dtype())
+        return kl_loss, dec_output, trg_mask
+
+
+class VAESeq2SeqInferModel(VAESeq2SeqModel):
+    def __init__(
+        self, embed_dim, hidden_size, latent_size, vocab_size, start_token=1, end_token=2, beam_size=1, max_out_len=100
+    ):
+        self.start_token = start_token
+        self.end_token = end_token
+        self.beam_size = beam_size
+        self.max_out_len = max_out_len
+        super(VAESeq2SeqInferModel, self).__init__(embed_dim, hidden_size, latent_size, vocab_size)
+
+    def forward(self, trg):
+        # Encoder
+        latent_z = paddle.normal(shape=(trg.shape[0], self.latent_size))
+        dec_first_hidden_cell = self.fc(latent_z)
+        dec_first_hidden, dec_first_cell = paddle.split(dec_first_hidden_cell, 2, axis=-1)
+        if self.num_layers > 1:
+            dec_first_hidden = paddle.split(dec_first_hidden, self.num_layers)
+            dec_first_cell = paddle.split(dec_first_cell, self.num_layers)
+        else:
+            dec_first_hidden = [dec_first_hidden]
+            dec_first_cell = [dec_first_cell]
+        dec_initial_states = [[h, c] for h, c in zip(dec_first_hidden, dec_first_cell)]
+
+        output_fc = lambda x: F.one_hot(  # noqa: E731
+            paddle.multinomial(F.softmax(paddle.squeeze(self.decoder.output_fc(x), [1]))), num_classes=self.vocab_size
+        )
+
+        latent_z = nn.BeamSearchDecoder.tile_beam_merge_with_batch(latent_z, self.beam_size)
+
+        decoder = nn.BeamSearchDecoder(
+            cell=self.decoder.lstm.cell,
+            start_token=self.start_token,
+            end_token=self.end_token,
+            beam_size=self.beam_size,
+            embedding_fn=self.decoder.trg_embedder,
+            output_fn=output_fc,
+        )
+
+        outputs, _ = nn.dynamic_decode(
+            decoder, inits=dec_initial_states, max_step_num=self.max_out_len, latent_z=latent_z
+        )
+        return outputs
diff --git a/examples/text_generation/vae-seq2seq/predict.py b/examples/text_generation/vae-seq2seq/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c22b8abed7b1f603ce49b5fc6a40edafff98e47
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/predict.py
@@ -0,0 +1,52 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+
+import numpy as np
+import paddle
+from args import parse_args
+from data import create_data_loader
+from model import VAESeq2SeqInferModel
+
+
+def infer(args):
+    print(args)
+    paddle.set_device(args.device)
+    _, _, _, vocab, bos_id, eos_id, _ = create_data_loader(args)
+
+    net = VAESeq2SeqInferModel(args.embed_dim, args.hidden_size, args.latent_size, len(vocab) + 2)
+
+    model = paddle.Model(net)
+    model.prepare()
+    model.load(args.init_from_ckpt)
+
+    infer_output = paddle.ones((args.batch_size, 1), dtype="int64") * bos_id
+
+    space_token = " "
+    line_token = "\n"
+    with io.open(args.infer_output_file, "w", encoding="utf-8") as out_file:
+        predict_lines = model.predict_batch(infer_output)[0]
+        for line in predict_lines:
+            end_id = -1
+            if eos_id in line:
+                end_id = np.where(line == eos_id)[0][0]
+            new_line = [vocab.to_tokens(e[0]) for e in line[:end_id]]
+            out_file.write(space_token.join(new_line))
+            out_file.write(line_token)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    infer(args)
diff --git a/examples/text_generation/vae-seq2seq/train.py b/examples/text_generation/vae-seq2seq/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3acb3f3cbc6640e210486e186dd52d56f01e7c
--- /dev/null
+++ b/examples/text_generation/vae-seq2seq/train.py
@@ -0,0 +1,77 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from args import parse_args
+from data import create_data_loader
+from model import (
+    CrossEntropyWithKL,
+    NegativeLogLoss,
+    Perplexity,
+    TrainCallback,
+    VAESeq2SeqModel,
+)
+
+
+def train(args):
+    print(args)
+    paddle.set_device(args.device)
+    train_loader, dev_loader, test_loader, vocab, bos_id, pad_id, train_data_len = create_data_loader(args)
+
+    net = VAESeq2SeqModel(
+        embed_dim=args.embed_dim,
+        hidden_size=args.hidden_size,
+        latent_size=args.latent_size,
+        vocab_size=len(vocab) + 2,
+        num_layers=args.num_layers,
+        init_scale=args.init_scale,
+        enc_dropout=args.enc_dropout,
+        dec_dropout=args.dec_dropout,
+    )
+
+    gloabl_norm_clip = paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm)
+
+    anneal_r = 1.0 / (args.warm_up * train_data_len / args.batch_size)
+    cross_entropy = CrossEntropyWithKL(base_kl_weight=args.kl_start, anneal_r=anneal_r)
+    model = paddle.Model(net)
+
+    optimizer = paddle.optimizer.Adam(args.learning_rate, parameters=model.parameters(), grad_clip=gloabl_norm_clip)
+
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    ppl_metric = Perplexity(loss=cross_entropy)
+    nll_metric = NegativeLogLoss(loss=cross_entropy)
+
+    model.prepare(optimizer=optimizer, loss=cross_entropy, metrics=[ppl_metric, nll_metric])
+
+    model.fit(
+        train_data=train_loader,
+        eval_data=dev_loader,
+        epochs=args.max_epoch,
+        save_dir=args.model_path,
+        shuffle=False,
+        callbacks=[TrainCallback(ppl_metric, nll_metric, args.log_freq)],
+        log_freq=args.log_freq,
+    )
+
+    # Evaluation
+    print("Start to evaluate on test dataset...")
+    model.evaluate(test_loader, log_freq=len(test_loader))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    train(args)
diff --git a/examples/text_graph/erniesage/README.md b/examples/text_graph/erniesage/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a25780be05860d776182e0be89b7a3be13086284
--- /dev/null
+++ b/examples/text_graph/erniesage/README.md
@@ -0,0 +1,59 @@
+# 基于PaddleNLP的ErnieSage模型介绍
+
+## 背景介绍
+
+在很多工业应用中，往往出现如下图所示的一种特殊的图：Text Graph。顾名思义，图的节点属性由文本构成，而边的构建提供了结构信息。如搜索场景下的Text Graph，节点可由搜索词、网页标题、网页正文来表达，用户反馈和超链信息则可构成边关系。
+<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/text_graph.png" alt="Text Graph" width="800">
+
+**ErnieSage** 由飞桨PGL团队提出，是ERNIE SAmple aggreGatE的简称，该模型可以同时建模文本语义与图结构信息，有效提升 Text Graph 的应用效果。其中 [**ERNIE**](https://github.com/PaddlePaddle/ERNIE) 是百度推出的基于知识增强的持续学习语义理解框架。
+
+**ErnieSage** 是 ERNIE 与 GraphSAGE 碰撞的结果，是 ERNIE SAmple aggreGatE 的简称，它的结构如下图所示，主要思想是通过 ERNIE 作为聚合函数（Aggregators），建模自身节点和邻居节点的语义与结构关系。ErnieSage 对于文本的建模是构建在邻居聚合的阶段，中心节点文本会与所有邻居节点文本进行拼接；然后通过预训练的 ERNIE 模型进行消息汇聚，捕捉中心节点以及邻居节点之间的相互关系；最后使用 ErnieSage 搭配独特的邻居互相看不见的 Attention Mask 和独立的 Position Embedding 体系，就可以轻松构建 TextGraph 中句子之间以及词之间的关系。
+
+<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/ernie_aggregator.png" alt="ERNIESage" width="800">
+
+使用ID特征的GraphSAGE只能够建模图的结构信息，而单独的ERNIE只能处理文本信息。通过飞桨PGL搭建的图与文本的桥梁，**ErnieSage**能够很简单的把GraphSAGE以及ERNIE的优点结合一起。以下面TextGraph的场景，**ErnieSage**的效果能够比单独的ERNIE以及GraphSAGE模型都要好。
+
+**ErnieSage**可以很轻松地在基于PaddleNLP构建基于Ernie的图神经网络，目前PaddleNLP提供了V2版本的ErnieSage模型：
+
+- **ErnieSage V2**: ERNIE 作用在text graph的边上;
+
+<img src="https://raw.githubusercontent.com/PaddlePaddle/PGL/static_stable/examples/erniesage/docs/source/_static/ERNIESage_v1_4.png" alt="ERNIESage_v1_4" width="800">
+
+## 环境依赖
+
+- pgl >= 2.1
+安装命令 `pip install pgl\>=2.1`
+
+## 数据准备
+示例数据```data.txt```中使用了NLPCC2016-DBQA的部分数据，格式为每行"query \t answer"。
+```text
+NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务，其目标是从候选中找到合适的文档作为问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf]
+```
+
+## 如何运行
+
+我们采用了[PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet)作为我们的分布式训练框架，在```config/*.yaml```中，目前支持的[ERNIE](https://github.com/PaddlePaddle/ERNIE)预训练语义模型包括**ernie**以及**ernie_tiny**，通过config/erniesage_link_prediction.yaml中的ernie_name指定。
+
+
+```sh
+# 数据预处理，建图
+python ./preprocessing/dump_graph.py --conf ./config/erniesage_link_prediction.yaml
+# GPU多卡或单卡模式ErnieSage
+python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml
+# 对图节点的embeding进行预测, 单卡或多卡
+python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml --do_predict
+```
+
+## 超参数设置
+
+- epochs: 训练的轮数
+- graph_data: 训练模型时用到的图结构数据，使用“text1 \t text"格式。
+- train_data: 训练时的边，与graph_data格式相同，一般可以直接用graph_data。
+- graph_work_path: 临时存储graph数据中间文件的目录。
+- samples: 采样邻居数
+- model_type: 模型类型，包括ErnieSageV2。
+- ernie_name: 热启模型类型，支持“ernie”和"ernie_tiny"，后者速度更快，指定该参数后会自动从服务器下载预训练模型文件。
+- num_layers: 图神经网络层数。
+- hidden_size: 隐藏层大小。
+- batch_size: 训练时的batchsize。
+- infer_batch_size: 预测时batchsize。
diff --git a/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml b/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..970f5de365b7338142d276248a4bb3ccfbf8f354
--- /dev/null
+++ b/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml
@@ -0,0 +1,40 @@
+# Global Environment Settings 
+
+# trainer config ------
+device: "gpu" # use cpu or gpu devices to train.
+seed: 2020
+
+task: "link_prediction"
+model_name_or_path: "ernie-tiny" # ernie-tiny or ernie-1.0 avaiable
+sample_workers: 1
+optimizer_type: "adam"
+lr: 0.00005
+batch_size: 32
+CPU_NUM: 10
+epoch: 30
+log_per_step: 10
+save_per_step: 200
+output_path: "./output"
+
+# data config ------
+train_data: "./example_data/graph_data.txt"
+graph_data: "./example_data/train_data.txt"
+graph_work_path: "./graph_workdir"
+input_type: "text"
+encoding: "utf8"
+
+# model config ------
+samples: [10]
+model_type: "ErnieSageV2"
+max_seqlen: 40
+num_layers: 1
+hidden_size: 128
+final_fc: true
+final_l2_norm: true
+loss_type: "hinge"
+margin: 0.1
+neg_type: "batch_neg"
+
+# infer config ------
+infer_model: "./output/last"
+infer_batch_size: 128
diff --git a/examples/text_graph/erniesage/data/__init__.py b/examples/text_graph/erniesage/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd14d69548398963093d935c4c5bd7b6683716ec
--- /dev/null
+++ b/examples/text_graph/erniesage/data/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from data import dataset, graph_reader
+
+__all__ = []
+__all__ += dataset.__all__
+__all__ += graph_reader.__all__
diff --git a/examples/text_graph/erniesage/data/dataset.py b/examples/text_graph/erniesage/data/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a3733851e63b63d82773e28725dd8f98b4bf791
--- /dev/null
+++ b/examples/text_graph/erniesage/data/dataset.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle
+import pgl
+from paddle.io import Dataset
+from pgl.sampling import graphsage_sample
+
+__all__ = [
+    "TrainData",
+    "PredictData",
+    "batch_fn",
+]
+
+
+class TrainData(Dataset):
+    def __init__(self, graph_work_path):
+        trainer_id = paddle.distributed.get_rank()
+        trainer_count = paddle.distributed.get_world_size()
+        print("trainer_id: %s, trainer_count: %s." % (trainer_id, trainer_count))
+
+        edges = np.load(os.path.join(graph_work_path, "train_data.npy"), allow_pickle=True)
+        # edges is bidirectional.
+        train_src = edges[trainer_id::trainer_count, 0]
+        train_dst = edges[trainer_id::trainer_count, 1]
+        returns = {"train_data": [train_src, train_dst]}
+
+        if os.path.exists(os.path.join(graph_work_path, "neg_samples.npy")):
+            neg_samples = np.load(os.path.join(graph_work_path, "neg_samples.npy"), allow_pickle=True)
+            if neg_samples.size != 0:
+                train_negs = neg_samples[trainer_id::trainer_count]
+                returns["train_data"].append(train_negs)
+        print("Load train_data done.")
+        self.data = returns
+
+    def __getitem__(self, index):
+        return [data[index] for data in self.data["train_data"]]
+
+    def __len__(self):
+        return len(self.data["train_data"][0])
+
+
+class PredictData(Dataset):
+    def __init__(self, num_nodes):
+        trainer_id = paddle.distributed.get_rank()
+        trainer_count = paddle.distributed.get_world_size()
+        self.data = np.arange(trainer_id, num_nodes, trainer_count)
+
+    def __getitem__(self, index):
+        return [self.data[index], self.data[index]]
+
+    def __len__(self):
+        return len(self.data)
+
+
+def batch_fn(batch_ex, samples, base_graph, term_ids):
+    batch_src = []
+    batch_dst = []
+    batch_neg = []
+    for batch in batch_ex:
+        batch_src.append(batch[0])
+        batch_dst.append(batch[1])
+        if len(batch) == 3:  # default neg samples
+            batch_neg.append(batch[2])
+
+    batch_src = np.array(batch_src, dtype="int64")
+    batch_dst = np.array(batch_dst, dtype="int64")
+    if len(batch_neg) > 0:
+        batch_neg = np.unique(np.concatenate(batch_neg))
+    else:
+        batch_neg = batch_dst
+
+    nodes = np.unique(np.concatenate([batch_src, batch_dst, batch_neg], 0))
+    subgraphs = graphsage_sample(base_graph, nodes, samples)
+
+    subgraph, sample_index, node_index = subgraphs[0]
+    from_reindex = {int(x): i for i, x in enumerate(sample_index)}
+
+    term_ids = term_ids[sample_index].astype(np.int64)
+
+    sub_src_idx = pgl.graph_kernel.map_nodes(batch_src, from_reindex)
+    sub_dst_idx = pgl.graph_kernel.map_nodes(batch_dst, from_reindex)
+    sub_neg_idx = pgl.graph_kernel.map_nodes(batch_neg, from_reindex)
+
+    user_index = np.array(sub_src_idx, dtype="int64")
+    pos_item_index = np.array(sub_dst_idx, dtype="int64")
+    neg_item_index = np.array(sub_neg_idx, dtype="int64")
+
+    user_real_index = np.array(batch_src, dtype="int64")
+    pos_item_real_index = np.array(batch_dst, dtype="int64")
+
+    return (
+        np.array([subgraph.num_nodes], dtype="int32"),
+        subgraph.edges.astype("int32"),
+        term_ids,
+        user_index,
+        pos_item_index,
+        neg_item_index,
+        user_real_index,
+        pos_item_real_index,
+    )
diff --git a/examples/text_graph/erniesage/data/graph_reader.py b/examples/text_graph/erniesage/data/graph_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca82d5c78f66bad525f5d37314c7a1d020e9a2ee
--- /dev/null
+++ b/examples/text_graph/erniesage/data/graph_reader.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pgl
+from paddle.io import DataLoader
+
+__all__ = ["GraphDataLoader"]
+
+
+class GraphDataLoader(object):
+    def __init__(self, dataset, batch_size=1, shuffle=True, num_workers=1, collate_fn=None, **kwargs):
+        self.loader = DataLoader(
+            dataset=dataset,
+            batch_size=batch_size,
+            shuffle=shuffle,
+            num_workers=num_workers,
+            collate_fn=collate_fn,
+            **kwargs,
+        )
+
+    def __iter__(self):
+        func = self.__callback__()
+        for data in self.loader():
+            yield func(data)
+
+    def __call__(self):
+        return self.__iter__()
+
+    def __callback__(self):
+        """callback function, for recontruct a dict or graph."""
+
+        def construct(tensors):
+            """tensor list to ([graph_tensor, graph_tensor, ...],
+            other tensor)
+            """
+            graph_num = 1
+            start_len = 0
+            data = []
+            graph_list = []
+            for graph in range(graph_num):
+                graph_list.append(pgl.Graph(num_nodes=tensors[start_len], edges=tensors[start_len + 1]))
+                start_len += 2
+
+            for i in range(start_len, len(tensors)):
+                data.append(tensors[i])
+            return graph_list, data
+
+        return construct
diff --git a/examples/text_graph/erniesage/example_data/graph_data.txt b/examples/text_graph/erniesage/example_data/graph_data.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e9aead6c89fa2fdbed64e5dada352106a8deb349
--- /dev/null
+++ b/examples/text_graph/erniesage/example_data/graph_data.txt
@@ -0,0 +1,1000 @@
+黑缘粗角肖叶甲触角有多大？	体长卵形，棕红色；鞘翅棕黄或淡棕色，外缘和中缝黑色或黑褐色；触角基部3、4节棕黄，余节棕色。
+黑缘粗角肖叶甲触角有多大？	头部刻点粗大，分布不均匀，头顶刻点十分稀疏；触角基部的内侧有一个三角形光瘤，唇基前缘呈半圆形凹切。
+黑缘粗角肖叶甲触角有多大？	触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。
+黑缘粗角肖叶甲触角有多大？	前胸背板横宽，宽约为长的两倍，侧缘敞出较宽，圆形，敞边与盘区之间有一条细纵沟；盘区刻点相当密，前半部刻点较大于后半部。
+黑缘粗角肖叶甲触角有多大？	小盾片舌形，光亮，末端圆钝。
+黑缘粗角肖叶甲触角有多大？	鞘翅刻点粗大，不规则排列，肩部之后的刻点更为粗大，具皱褶，近中缝的刻点较小，略呈纵行排列。
+黑缘粗角肖叶甲触角有多大？	前胸前侧片前缘直；前胸后侧片具粗大刻点。
+黑缘粗角肖叶甲触角有多大？	足粗壮；胫节具纵脊，外端角向外延伸，呈弯角状；爪具附齿。
+暮光闪闪的姐姐是谁？	暮光闪闪是一匹雌性独角兽，后来在神秘魔法的影响下变成了空角兽（公主），她是《我的小马驹：友情是魔法》（英文名：My Little Pony：Friendship is Magic）中的主角之一。
+暮光闪闪的姐姐是谁？	 她是银甲闪闪（Shining Armor）的妹妹，同时也是韵律公主（Princess Cadance）的小姑子。
+暮光闪闪的姐姐是谁？	在该系列中，她与最好的朋友与助手斯派克（Spike）一起生活在小马镇（Ponyville）的金橡图书馆（Golden Oak Library），研究友谊的魔法。
+暮光闪闪的姐姐是谁？	  在暮光闪闪成为天角兽之前（即S3E13前），常常给塞拉丝蒂娅公主（Princess Celestia）关于友谊的报告。[1]
+暮光闪闪的姐姐是谁？	《我的小马驹：友谊是魔法》（英文名称：My Little Pony:Friendship is Magic）（简称MLP）
+暮光闪闪的姐姐是谁？	动画讲述了一只名叫做暮光闪闪（Twilight Sparkle）的独角兽（在SE3E13
+暮光闪闪的姐姐是谁？	My Little Pony:Friendship is Magic[2]
+暮光闪闪的姐姐是谁？	后成为了天角兽），执行她的导师塞拉斯蒂娅公主（PrincessCelestia）的任务，在小马镇（Ponyville）学习关于友谊的知识。
+暮光闪闪的姐姐是谁？	她与另外五只小马，苹果杰克（Applejack）、瑞瑞（Rarity）、云宝黛西（Rainbow Dash）、小蝶（Fluttershy）与萍琪派（Pinkie Pie），成为了最要好的朋友。
+暮光闪闪的姐姐是谁？	每匹小马都分别代表了协律精华的6个元素：诚实，慷慨，忠诚，善良，欢笑，魔法，各自扮演着属于自己的重要角色。
+暮光闪闪的姐姐是谁？	此后，暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。
+暮光闪闪的姐姐是谁？	在动画中，随时可见她们在小马镇（Ponyville）的种种冒险、奇遇、日常等等。
+暮光闪闪的姐姐是谁？	同时，也在她们之间的互动和冲突中，寻找着最适合最合理的完美解决方案。
+暮光闪闪的姐姐是谁？	“尽管小马国并不太平，六位主角之间也常常有这样那样的问题，但是他们之间的真情对待，使得这个童话世界已经成为不少人心中理想的世外桃源。”
+暮光闪闪的姐姐是谁？	暮光闪闪在剧情刚开始的时候生活在中心城（Canterlot），后来在夏日
+暮光闪闪的姐姐是谁？	暮光闪闪与斯派克（Spike）
+暮光闪闪的姐姐是谁？	庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。
+暮光闪闪的姐姐是谁？	在小马镇交到了朋友（即其余5个主角），并和她们一起使用协律精华（Elements of harmony）击败了梦魇之月。
+暮光闪闪的姐姐是谁？	并在塞拉丝蒂亚公主的许可下，留在小马镇继续研究友谊的魔法。
+暮光闪闪的姐姐是谁？	暮光闪闪的知识基本来自于书本，并且她相当不相信书本以外的“迷信”，因为这样她在S1E15里吃足了苦头。
+暮光闪闪的姐姐是谁？	在这之后，她也开始慢慢学会相信一些书本以外的东西。
+暮光闪闪的姐姐是谁？	暮光闪闪热爱学习，并且学习成绩相当好（从她可以立刻算出
+暮光闪闪的姐姐是谁？	的结果可以看
+暮光闪闪的姐姐是谁？	暮光闪闪的原型
+暮光闪闪的姐姐是谁？	出）。
+暮光闪闪的姐姐是谁？	相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。
+暮光闪闪的姐姐是谁？	在第二季中，曾因为无法交出关于友谊的报告而做出了疯狂的行为，后来被塞拉丝蒂亚公主制止，在这之后，暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。
+暮光闪闪的姐姐是谁？	于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。
+暮光闪闪的姐姐是谁？	在SE3E13中，因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽（公主），被尊称为“闪闪公主”。
+暮光闪闪的姐姐是谁？	当小星座熊在小马镇引起恐慌的时候，暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶，用牛奶将小星座熊安抚后，连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区，项目占地面积约380亩，总建筑面积约41万平方米。
+我想知道红谷十二庭有哪些金融机构？	项目以建设人文型、生态型居住环境为规划目标；创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便，有文化内涵的居住区。
+我想知道红谷十二庭有哪些金融机构？	金融机构：工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等
+我想知道红谷十二庭有哪些金融机构？	周边公园：沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园
+我想知道红谷十二庭有哪些金融机构？	周边医院：新建县人民医院、开心人药店、中寰医院
+我想知道红谷十二庭有哪些金融机构？	周边学校：育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学：南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校
+我想知道红谷十二庭有哪些金融机构？	周边公共交通：112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭处在南昌一江两城中的西城中心，位属红谷滩CBD文化公园中心——马兰圩中心组团，红谷滩中心区、红角洲、新建县三区交汇处，南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽，1000米长)。
+我想知道红谷十二庭有哪些金融机构？	交通便捷，景观资源丰富，生活配套设施齐全，出则繁华，入则幽静，是现代人居的理想地段。
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭户型图
+苏琳最开始进入智通实业是担任什么职位？	现任广东智通人才连锁股份有限公司总裁，清华大学高级工商管理硕士。
+苏琳最开始进入智通实业是担任什么职位？	1994年，加入智通实业，从总经理秘书做起。
+苏琳最开始进入智通实业是担任什么职位？	1995年，智通实业决定进入人才服务行业，被启用去负责新公司的筹建及运营工作，在苏琳的努力下，智通人才智力开发有限公司成立。
+苏琳最开始进入智通实业是担任什么职位？	2003年，面对同城对手的激烈竞争，苏琳冷静对待，领导智通先后接管、并购了同城的腾龙、安达盛人才市场，，“品牌运作，连锁经营，差异制胜”成为苏琳屡屡制胜的法宝。
+苏琳最开始进入智通实业是担任什么职位？	2006年，苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”，一举成为广东省人才市场目前惟一的连锁机构，随后在东莞同时开设长安、松山湖、清溪等镇区分部，至此智通在东莞共有6个分部。
+苏琳最开始进入智通实业是担任什么职位？	一番大刀阔斧完成东莞布局后，苏琳确定下一个更为高远的目标——进军珠三角，向全国发展连锁机构。
+苏琳最开始进入智通实业是担任什么职位？	到2011年末，苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地，长三角的南京、宁波、合肥等地，中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。
+苏琳最开始进入智通实业是担任什么职位？	除了财务副总裁之外，苏琳是智通人才核心管理高层当中唯一的女性，不管是要约采访的记者还是刚刚加入智通的员工，见到苏琳的第一面，都会有一种惊艳的感觉，“一位女企业家居然非常美丽和时尚？！”
+苏琳最开始进入智通实业是担任什么职位？	智通管理高层的另外6位男性成员，有一次同时接受一家知名媒体采访时，共同表达了对自己老板的“爱慕”之情，苏琳听后莞尔一笑，指着在座的这几位高层说道“其实，我更爱他们！”
+苏琳最开始进入智通实业是担任什么职位？	这种具有独特领导魅力的表述让这位记者唏嘘不已，同时由这样的一个细节让他感受到了智通管理团队的协作力量。
+谁知道黄沙中心小学的邮政编码是多少？	学校于1954年始建于棕树湾村，当时借用一间民房做教室，取名为“黄沙小学”，只有教师1人，学生8人。
+谁知道黄沙中心小学的邮政编码是多少？	1958年在大跃进精神的指导下，实行大集体，全乡集中办学，发展到12个班，300多学生，20名教职工。
+谁知道黄沙中心小学的邮政编码是多少？	1959年解散。
+谁知道黄沙中心小学的邮政编码是多少？	1959年下半年，在上级的扶持下，建了6间木房，搬到1960年学校所在地，有6名教师，3个班，60名学生。
+谁知道黄沙中心小学的邮政编码是多少？	1968年，开始招收一个初中班，“黄沙小学”改名为 “附小”。
+谁知道黄沙中心小学的邮政编码是多少？	当时已发展到5个班，8名教师，110多名学生。
+谁知道黄沙中心小学的邮政编码是多少？	增建土木结构教室两间。
+谁知道黄沙中心小学的邮政编码是多少？	1986年，初中、小学分开办学。
+谁知道黄沙中心小学的邮政编码是多少？	增建部分教师宿舍和教室，办学条件稍有改善，学校初具规模。
+谁知道黄沙中心小学的邮政编码是多少？	1996年，我校在市、县领导及希望工程主管部门的关怀下，决定改为“黄沙希望小学”并拨款32万元，新建一栋4层，12间教室的教学楼，教学条件大有改善。
+谁知道黄沙中心小学的邮政编码是多少？	当时发展到10个班，学生300多人，教职工19人，小学高级教师3人，一级教师7人，二级教师9人。
+谁知道黄沙中心小学的邮政编码是多少？	2003年下半年由于农村教育体制改革，撤销教育组，更名为“黄沙中心小学”。
+谁知道黄沙中心小学的邮政编码是多少？	学校现有在校生177人（含学前42人），设有学前至六年级共7个教学班。
+谁知道黄沙中心小学的邮政编码是多少？	有教师19人，其中大专以上学历11人，中师6人；小学高级教师14人，一级教师5人。
+谁知道黄沙中心小学的邮政编码是多少？	学校校园占地面积2050平方米，生均达15.29平方米，校舍建筑面积1645平方米，生均12.27平方米；设有教师办公室、自然实验、电教室（合二为一）、微机室、图书阅览室（合二为一）、体育室、广播室、少先队活动室。
+谁知道黄沙中心小学的邮政编码是多少？	广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编：541113[1]
+伊藤实华的职业是什么？	伊藤实华（1984年3月25日－）是日本的女性声优。
+伊藤实华的职业是什么？	THREE TREE所属，东京都出身，身长149cm，体重39kg，血型AB型。
+伊藤实华的职业是什么？	ポルノグラフィティのLION（森男）
+伊藤实华的职业是什么？	2000年
+伊藤实华的职业是什么？	犬夜叉（枫（少女时代））
+伊藤实华的职业是什么？	幻影死神（西亚梨沙）
+伊藤实华的职业是什么？	2001年
+伊藤实华的职业是什么？	NOIR（ロザリー）
+伊藤实华的职业是什么？	2002年
+伊藤实华的职业是什么？	水瓶战记（柠檬）
+伊藤实华的职业是什么？	返乡战士（エイファ）
+伊藤实华的职业是什么？	2003年
+伊藤实华的职业是什么？	奇诺之旅（女子A（悲しい国））
+伊藤实华的职业是什么？	2004年
+伊藤实华的职业是什么？	爱你宝贝（坂下ミキ）
+伊藤实华的职业是什么？	Get Ride! アムドライバー（イヴァン・ニルギース幼少期）
+伊藤实华的职业是什么？	スクールランブル（花井春树（幼少时代））
+伊藤实华的职业是什么？	2005年
+伊藤实华的职业是什么？	光速蒙面侠21（虎吉）
+伊藤实华的职业是什么？	搞笑漫画日和（男子トイレの精、パン美先生）
+伊藤实华的职业是什么？	银牙伝说WEED（テル）
+伊藤实华的职业是什么？	魔女的考验（真部カレン、守山太郎）
+伊藤实华的职业是什么？	BUZZER BEATER（レニー）
+伊藤实华的职业是什么？	虫师（“眼福眼祸”さき、“草を踏む音”沢（幼少时代））
+伊藤实华的职业是什么？	2006年
+伊藤实华的职业是什么？	魔女之刃（娜梅）
+伊藤实华的职业是什么？	反斗小王子（远藤レイラ）
+伊藤实华的职业是什么？	搞笑漫画日和2（パン美先生、フグ子、ダンサー、ヤマトの妹、女性）
+伊藤实华的职业是什么？	人造昆虫カブトボーグ V×V（ベネチアンの弟、东ルリ、园儿A）
+伊藤实华的职业是什么？	2007年
+爆胎监测与安全控制系统英文是什么？	爆胎监测与安全控制系统(Blow-out Monitoring and Brake System)，是吉利全球首创，并拥有自主知识产权及专利的一项安全技术。
+爆胎监测与安全控制系统英文是什么？	这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。
+爆胎监测与安全控制系统英文是什么？	BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。
+爆胎监测与安全控制系统英文是什么？	2008年第一代BMBS系统正式与世人见面，BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证，以确保产品的可靠性。
+爆胎监测与安全控制系统英文是什么？	BMBS技术方案的核心即是采用智能化自动控制系统，弥补驾驶员生理局限，在爆胎后反应时间为0.5秒，替代驾驶员实施行车制动，保障行车安全。
+爆胎监测与安全控制系统英文是什么？	BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。
+爆胎监测与安全控制系统英文是什么？	当轮胎气压高于或低于限值时，控制器声光提示胎压异常。
+爆胎监测与安全控制系统英文是什么？	轮胎温度过高时，控制器发出信号提示轮胎温度过高。
+爆胎监测与安全控制系统英文是什么？	发射器电量不足时，控制器显示低电压报警。
+爆胎监测与安全控制系统英文是什么？	发射器受到干扰长期不发射信号时，控制器显示无信号报警。
+爆胎监测与安全控制系统英文是什么？	当汽车电门钥匙接通时，BMBS首先进入自检程序，检测系统各部分功能是否正常，如不正常，BMBS报警灯常亮。
+走读干部现象在哪里比较多？	走读干部一般是指县乡两级干部家住县城以上的城市，本人在县城或者乡镇工作，要么晚出早归，要么周一去单位上班、周五回家过周末。
+走读干部现象在哪里比较多？	对于这种现象，社会上的议论多是批评性的，认为这些干部脱离群众、作风漂浮、官僚主义，造成行政成本增加和腐败。
+走读干部现象在哪里比较多？	  截至2014年10月，共有6484名“走读干部”在专项整治中被查处。
+走读干部现象在哪里比较多？	这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。
+走读干部现象在哪里比较多？	干部“走读”问题主要在乡镇地区比较突出，城市地区则较少。
+走读干部现象在哪里比较多？	从历史成因和各地反映的情况来看，产生“走读”现象的主要原因大致有四种：
+走读干部现象在哪里比较多？	现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路，这无疑为农村干部的出行创造了便利条件，为“干部像候鸟，频往家里跑”创造了客观条件。
+走读干部现象在哪里比较多？	选调生、公务员队伍大多是学历较高的大学毕业生，曾在高校所在地的城市生活，不少人向往城市生活，他们不安心长期扎根基层，而是将基层当作跳板，因此他们往往成为“走读”的主力军。
+走读干部现象在哪里比较多？	公仆意识、服务意识淡化，是“走读”现象滋生的主观原因。
+走读干部现象在哪里比较多？	有些党员干部感到自己长期在基层工作，该为自己和家庭想想了。
+走读干部现象在哪里比较多？	于是，不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难，也就不难理解了。
+走读干部现象在哪里比较多？	县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位，导致“走读”问题得不到应有的制度约束，是“走读”问题滋长的组织原因。[2]
+走读干部现象在哪里比较多？	近些年来，我国一些地方的“干部走读”现象较为普遍，社会上对此议走读干部论颇多。
+走读干部现象在哪里比较多？	所谓“干部走读”，一般是指县乡两级干部家住县城以上的城市，本人在县城或者乡镇工作，要么早出晚归，要么周一去单位上班、周五回家过周末。
+走读干部现象在哪里比较多？	对于这种现象，社会上的议论多是批评性的，认为这些干部脱离群众、作风漂浮、官僚主义，造成行政成本增加和腐败。
+走读干部现象在哪里比较多？	干部走读之所以成为“千夫所指”，是因为这种行为增加了行政成本。
+走读干部现象在哪里比较多？	从根子上说，干部走读是城乡发展不平衡的产物，“人往高处走，水往低处流”，有了更加舒适的生活环境，不管是为了自己生活条件改善也好，还是因为子女教育也好，农村人口向城镇转移，这是必然结果。
+走读干部现象在哪里比较多？	“干部走读”的另一个重要原因，是干部人事制度改革。
+走读干部现象在哪里比较多？	目前公务员队伍“凡进必考”，考上公务员的大多是学历较高的大学毕业生，这些大学毕业生来自各个全国各地，一部分在本地结婚生子，沉淀下来；一部分把公务员作为跳板，到基层后或考研，或再参加省考、国考，或想办法调回原籍。
+走读干部现象在哪里比较多？	再加上一些下派干部、异地交流任职干部，构成了看似庞大的“走读”队伍。
+走读干部现象在哪里比较多？	那么，“干部走读”有哪些弊端呢？
+走读干部现象在哪里比较多？	一是这些干部人在基层，心在城市，缺乏长期作战的思想，工作不安心。
+走读干部现象在哪里比较多？	周一来上班，周五回家转，对基层工作缺乏热情和感情；二是长期在省市直机关工作，对基层工作不熟悉不了解，工作不热心；三是长期走读，基层干群有工作难汇报，有困难难解决，群众不开心；四是干部来回走读，公车私驾，私费公报，把大量的经济负担转嫁给基层；五是对这些走读干部，基层管不了，上级监督难，节假日期间到哪里去、做什么事，基本处于失控和真空状态，各级组织和基层干群不放心。
+走读干部现象在哪里比较多？	特别需要引起警觉的是，由于少数走读干部有临时思想，满足于“当维持会长”，得过且过混日子，热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程，不愿做打基础、管长远的实事好事，甚至怠政、疏政和懒于理政，影响了党和政府各项方针政策措施的落实，导致基层无政府主义、自由主义抬头，削弱了党和政府的领导，等到矛盾激化甚至不可收拾的时候，处理已是来之不及。
+走读干部现象在哪里比较多？	权利要与义务相等，不能只有义务而没有权利，或是只有权利没有义务。
+走读干部现象在哪里比较多？	如何真正彻底解决乡镇干部“走读”的现象呢？
+走读干部现象在哪里比较多？	那就必须让乡镇基层干部义务与权利相等。
+走读干部现象在哪里比较多？	如果不能解决基层干部待遇等问题，即使干部住村，工作上也不会有什么进展的。
+走读干部现象在哪里比较多？	所以，在政治上关心，在生活上照顾，在待遇上提高。
+走读干部现象在哪里比较多？	如，提高基层干部的工资待遇，增加通讯、交通补助；帮助解决子女入学及老人赡养问题；提拔干部优先考虑基层干部；干部退休时的待遇至少不低于机关干部等等。
+化州市良光镇东岸小学学风是什么？	学校全体教职工爱岗敬业，团结拼搏，勇于开拓，大胆创新，进行教育教学改革，努力开辟第二课堂的教学路子，并开通了网络校校通的交流合作方式。
+化州市良光镇东岸小学学风是什么？	现学校教师正在为创建安全文明校园而努力。
+化州市良光镇东岸小学学风是什么？	东岸小学位置偏僻，地处贫穷落后，是良光镇最偏远的学校，学校，下辖分教点——东心埇小学，[1]?。
+化州市良光镇东岸小学学风是什么？	学校2011年有教师22人，学生231人。
+化州市良光镇东岸小学学风是什么？	小学高级教师8人，小学一级教师10人，未定级教师4人，大专学历的教师6人，其余的都具有中师学历。
+化州市良光镇东岸小学学风是什么？	全校共设12个班，学校课程按标准开设。
+化州市良光镇东岸小学学风是什么？	东岸小学原来是一所破旧不堪，教学质量非常差的薄弱学校。
+化州市良光镇东岸小学学风是什么？	近几年来，在各级政府、教育部门及社会各界热心人士鼎力支持下，学校领导大胆改革创新，致力提高教学质量和教师水平，并加大经费投入，大大改善了办学条件，使学校由差变好，实现了大跨越。
+化州市良光镇东岸小学学风是什么？	学校建设性方面。
+化州市良光镇东岸小学学风是什么？	东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址，1990年建造一幢建筑面积为800平方米的南面教学楼， 1998年老促会支持从北面建造一幢1800平方米的教学大楼。
+化州市良光镇东岸小学学风是什么？	学校在管理方面表现方面颇具特色，实现了各项制度的日常化和规范化。
+化州市良光镇东岸小学学风是什么？	学校领导有较强的事业心和责任感，讲求民主与合作，勤政廉政，依法治校，树立了服务意识。
+化州市良光镇东岸小学学风是什么？	学校一贯实施“德育为先，以人为本”的教育方针，制定了“团结，律已，拼搏，创新”的校训。
+化州市良光镇东岸小学学风是什么？	教育风为“爱岗敬业，乐于奉献”，学风为“乐学，勤学，巧学，会学”。
+化州市良光镇东岸小学学风是什么？	校内营造了尊师重教的氛围，形成了良好的校风和学风。
+化州市良光镇东岸小学学风是什么？	教师们爱岗敬业，师德高尚，治学严谨，教研教改气氛浓厚，获得喜人的教研成果。
+化州市良光镇东岸小学学风是什么？	近几年来，教师撰写的教育教学论文共10篇获得县市级以上奖励，获了镇级以上奖励的有100人次。
+化州市良光镇东岸小学学风是什么？	学校德育工作成绩显著，多年被评为“安全事故为零”的学校，良光镇先进学校。
+化州市良光镇东岸小学学风是什么？	特别是教学质量大大提高了。
+化州市良光镇东岸小学学风是什么？	这些成绩得到了上级及群众的充分肯定。
+化州市良光镇东岸小学学风是什么？	1.学校环境欠美观有序，学校大门口及校道有待改造。
+化州市良光镇东岸小学学风是什么？	2.学校管理制度有待改进，部分教师业务水平有待提高。
+化州市良光镇东岸小学学风是什么？	3.教师宿舍、教室及学生宿舍欠缺。
+化州市良光镇东岸小学学风是什么？	4.运动场不够规范，各类体育器材及设施需要增加。
+化州市良光镇东岸小学学风是什么？	5.学生活动空间少，见识面窄，视野不够开阔。
+化州市良光镇东岸小学学风是什么？	1.努力营造和谐的教育教学新气氛。
+化州市良光镇东岸小学学风是什么？	建立科学的管理制度，坚持“与时俱进，以人为本”，真正实现领导对教师，教师对学生之间进行“德治与情治”；学校的人文环境做到“文明，和谐，清新”；德育环境做到“自尊，律已，律人”；心理环境做到“安全，谦虚，奋发”；交际环境做到“团结合作，真诚助人”；景物环境做到“宜人，有序。”
+化州市良光镇东岸小学学风是什么？	营造学校与育人的新特色。
+我很好奇发射管的输出功率怎么样？	产生或放大高频功率的静电控制电子管，有时也称振荡管。
+我很好奇发射管的输出功率怎么样？	用于音频或开关电路中的发射管称调制管。
+我很好奇发射管的输出功率怎么样？	发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。
+我很好奇发射管的输出功率怎么样？	输出功率和工作频率是发射管的基本技术指标。
+我很好奇发射管的输出功率怎么样？	广播、通信和工业设备的发射管，工作频率一般在30兆赫以下，输出功率在1919年为2千瓦以下，1930年达300千瓦，70年代初已超过1000千瓦，效率高达80％以上。
+我很好奇发射管的输出功率怎么样？	发射管工作频率提高时，输出功率和效率都会降低，因此1936年首次实用的脉冲雷达工作频率仅28兆赫，80年代则已达 400兆赫以上。
+我很好奇发射管的输出功率怎么样？	40年代电视发射管的工作频率为数十兆赫，而80年代初，优良的电视发射管可在1000兆赫下工作，输出功率达20千瓦，效率为40％。
+我很好奇发射管的输出功率怎么样？	平面电极结构的小功率发射三极管可在更高的频率下工作。
+我很好奇发射管的输出功率怎么样？	发射管多采用同心圆筒电极结构。
+我很好奇发射管的输出功率怎么样？	阴极在最内层，向外依次为各个栅极和阳极。
+我很好奇发射管的输出功率怎么样？	图中，自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。
+我很好奇发射管的输出功率怎么样？	 发射管
+我很好奇发射管的输出功率怎么样？	中小功率发射管多采用间热式氧化物阴极。
+我很好奇发射管的输出功率怎么样？	大功率发射管一般采用碳化钍钨丝阴极，有螺旋、直条或网笼等结构形式。
+我很好奇发射管的输出功率怎么样？	图为网笼式阴极。
+我很好奇发射管的输出功率怎么样？	栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。
+我很好奇发射管的输出功率怎么样？	栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射，使发射管稳定工作。
+我很好奇发射管的输出功率怎么样？	用气相沉积方法制造的石墨栅极，具有良好的性能。
+我很好奇发射管的输出功率怎么样？	发射管阳极直流输入功率转化为高频输出功率的部分约为75％，其余25％成为阳极热损耗，因此对发射管的阳极必须进行冷却。
+我很好奇发射管的输出功率怎么样？	中小功率发射管的阳极采取自然冷却方式，用镍、钼或石墨等材料制造，装在管壳之内，工作温度可达 600℃。
+我很好奇发射管的输出功率怎么样？	大功率发射管的阳极都用铜制成，并作为真空密封管壳的一部分，采用各种强制冷却方式。
+我很好奇发射管的输出功率怎么样？	各种冷却方式下每平方厘米阳极内表面的散热能力为：水冷100瓦；风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上，80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。
+我很好奇发射管的输出功率怎么样？	发射管也常以冷却方式命名，如风冷发射管、水冷发射管和蒸发冷却发射管。
+我很好奇发射管的输出功率怎么样？	发射管管壳用玻璃或陶瓷制造。
+我很好奇发射管的输出功率怎么样？	小功率发射管内使用含钡的吸气剂；大功率发射管则采用锆、钛、钽等吸气材料，管内压强约为10帕量级。
+我很好奇发射管的输出功率怎么样？	发射管寿命取决于阴极发射电子的能力。
+我很好奇发射管的输出功率怎么样？	大功率发射管寿命最高记录可达8万小时。
+我很好奇发射管的输出功率怎么样？	发射四极管的放大作用和输出输入电路间的隔离效果优于三极管，应用最广。
+我很好奇发射管的输出功率怎么样？	工业高频振荡器普遍采用三极管。
+我很好奇发射管的输出功率怎么样？	五极管多用在小功率范围中。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	鲁能领秀城中央公园位于鲁能领秀城景观中轴之上，总占地161.55亩，总建筑面积约40万平米，容积率为2.70，由22栋小高层、高层组成；其绿地率高达35.2%，环境优美，产品更加注重品质化、人性化和自然生态化，是鲁能领秀城的生态人居典范。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	中央公园[1] 学区准现房，坐享鲁能领秀城成熟配套，成熟生活一步到位。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	经典板式小高层，103㎡2+1房仅22席，稀市推出，错过再无；92㎡经典两房、137㎡舒适三房压轴登场！
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	物业公司:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	济南凯瑞物业公司；深圳长城物业公司；北京盛世物业有限公司
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	绿化率:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	42%
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	容积率:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	2.70
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	暖气:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	集中供暖
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	楼座展示：中央公园由22栋小高层、高层组成，3、16、17号楼分别是11层小高层，18层和28层的高层。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	4号楼是23层，2梯3户。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	项目位置:
+鬼青蛙在哪里有收录详情？	鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃，从手卡特殊召唤。
+鬼青蛙在哪里有收录详情？	这张卡召唤·反转召唤·特殊召唤成功时，可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。
+鬼青蛙在哪里有收录详情？	此外，1回合1次，可以通过让自己场上1只怪兽回到手卡，这个回合通常召唤外加上只有1次，自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1]
+鬼青蛙在哪里有收录详情？	游戏王卡包收录详情
+鬼青蛙在哪里有收录详情？	[09/09/18]
+西湖区有多大？	西湖区是江西省南昌市市辖区。
+西湖区有多大？	为南昌市中心城区之一，有着2200多年历史，是一个物华天宝、人杰地灵的古老城区。
+西湖区有多大？	2004年南昌市老城区区划调整后，西湖区东起京九铁路线与青山湖区毗邻，南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤，西凭赣江中心线与红谷滩新区交界，北沿中山路、北京西路与东湖区相连，所辖面积34.5平方公里，常住人口43万，管辖1个镇、10个街道办事处，设12个行政村、100个社区。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖原为汉代豫章群古太湖的一部分，唐贞元15年（公元799年）洪恩桥的架设将东太湖分隔成东西两部分，洪恩桥以西谓之西湖，西湖区由此而得名。
+西湖区有多大？	西湖区在1926年南昌设市后分别称第四、五部分，六、七部分。
+西湖区有多大？	1949年解放初期分别称第三、四区。
+西湖区有多大？	1955年分别称抚河区、西湖区。
+西湖区有多大？	1980年两区合并称西湖区。[1]
+西湖区有多大？	辖：西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1]
+西湖区有多大？	2002年9月，由原筷子巷街道和原禾草街街道合并设立南浦街道，原广外街道与瓦子角街道的一部分合并设立广润门街道。
+西湖区有多大？	2002年12月1日设立桃源街道。
+西湖区有多大？	2004年区划调整前的西湖区区域：东与青山湖区湖坊乡插花接壤；西临赣江与红谷滩新区隔江相望；南以建设路为界，和青云谱区毗邻；北连中山路，北京西路，与东湖区交界。[1]
+西湖区有多大？	2002年9月，由原筷子巷街道和原禾草街街道合并设立南浦街道，原广外街道与瓦子角街道的一部分合并设立广润门街道。
+西湖区有多大？	2002年12月1日设立桃源街道。
+西湖区有多大？	2004年区划调整前的西湖区区域：东与青山湖区湖坊乡插花接壤；西临赣江与红谷滩新区隔江相望；南以建设路为界，和青云谱区毗邻；北连中山路，北京西路，与东湖区交界。
+西湖区有多大？	2004年9月7日，国务院批准（国函[2004]70号）调整南昌市市辖区部分行政区划：将西湖区朝阳洲街道的西船居委会划归东湖区管辖。
+西湖区有多大？	将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。
+西湖区有多大？	将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会，上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会，南站街道的解放西路东居委会划归青云谱区管辖。
+西湖区有多大？	将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会，丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会，南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。
+西湖区有多大？	调整后，西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇，区人民政府驻孺子路。
+西湖区有多大？	调整前，西湖区面积31平方千米，人口52万。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖区位于江西省省会南昌市的中心地带，具有广阔的发展空间和庞大的消费群体，商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机，投资前景十分广阔。
+西湖区有多大？	不仅水、电价格低廉，劳动力资源丰富，人均工资和房产价格都比沿海城市低，城区拥有良好的人居环境、低廉的投资成本，巨大的发展潜力。
+西湖区有多大？	105、316、320国道和京九铁路贯穿全境，把南北东西交通连成一线；民航可与上海、北京、广州、深圳、厦门、温州等到地通航，并开通了南昌-新加坡第一条国际航线；水运依托赣江可直达长江各港口；邮电通讯便捷，程控电话、数字微波、图文传真进入国际通讯网络；商检、海关、口岸等涉外机构齐全；水、电、气供应充足。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖区，是江西省省会南昌市的中心城区，面积34.8平方公里，常住人口51.9万人，辖桃花镇、朝农管理处及10个街道，设13个行政村，116个社区居委会，20个家委会。[2]
+西湖区有多大？	2005年11月16日，南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》
+西湖区有多大？	1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。
+青藏虎耳草花期什么时候？	青藏虎耳草多年生草本，高4-11.5厘米，丛生。
+青藏虎耳草花期什么时候？	花期7-8月。
+青藏虎耳草花期什么时候？	分布于甘肃（祁连山地）、青海（黄南、海南、海北）和西藏（加查）。
+青藏虎耳草花期什么时候？	生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1]
+青藏虎耳草花期什么时候？	多年生草本，高4-11.5厘米，丛生。
+青藏虎耳草花期什么时候？	茎不分枝，具褐色卷曲柔毛。
+青藏虎耳草花期什么时候？	基生叶具柄，叶片卵形、椭圆形至长圆形，长15-25毫米，宽4-8毫米，腹面无毛，背面和边缘具褐色卷曲柔毛，叶柄长1-3厘米，基部扩大，边缘具褐色卷曲柔毛；茎生叶卵形至椭圆形，长1.5-2厘米，向上渐变小。
+青藏虎耳草花期什么时候？	聚伞花序伞房状，具2-6花；花梗长5-19毫米，密被褐色卷曲柔毛；萼片在花期反曲，卵形至狭卵形，长2.5-4.2毫米，宽1.5-2毫米，先端钝，两面无毛，边缘具褐色卷曲柔毛，3-5脉于先端不汇合；花瓣腹面淡黄色且其中下部具红色斑点，背面紫红色，卵形、狭卵形至近长圆形，长2.5-5.2毫米，宽1.5-2.1毫米，先端钝，基部具长0.5-1毫米之爪，3-5（-7）脉，具2痂体；雄蕊长2-3.6毫米，花丝钻形；子房半下位，周围具环状花盘，花柱长1-1.5毫米。
+青藏虎耳草花期什么时候？	生于高山草甸、碎石间。
+青藏虎耳草花期什么时候？	分布青海、西藏、甘肃、四川等地。
+青藏虎耳草花期什么时候？	 [1]
+青藏虎耳草花期什么时候？	顶峰虎耳草Saxifraga cacuminum Harry Sm.
+青藏虎耳草花期什么时候？	对叶虎耳Saxifraga contraria Harry Sm.
+青藏虎耳草花期什么时候？	狭瓣虎耳草Saxifraga pseudohirculus Engl.
+青藏虎耳草花期什么时候？	唐古特虎耳草Saxifraga tangutica Engl.
+青藏虎耳草花期什么时候？	宽叶虎耳草（变种）Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan
+青藏虎耳草花期什么时候？	唐古特虎耳草（原变种）Saxifraga tangutica Engl. var. tangutica
+青藏虎耳草花期什么时候？	西藏虎耳草Saxifraga tibetica Losinsk.[1]
+青藏虎耳草花期什么时候？	Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1]
+生产一支欧文冲锋枪需要多少钱？	欧文冲锋枪 Owen Gun  1945年，在新不列颠手持欧文冲锋枪的澳大利亚士兵   类型 冲锋枪   原产国 ?澳大利亚   服役记录  服役期间 1941年－1960年代   用户 参见使用国   参与战役  第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争     生产历史  研发者 伊夫林·欧文（Evelyn Owen）   研发日期 1931年－1939年   生产商  约翰·莱萨特工厂 利特高轻武器工厂     单位制造费用 $ 30／枝   生产日期 1941年－1945年   制造数量 45,000－50,000 枝   衍生型  Mk 1/42 Mk 1/43 Mk 2/43     基本规格  总重   空枪：   Mk 1/42：4.24 千克（9.35 磅） Mk 1/43：3.99 千克（8.8 磅） Mk 2/43：3.47 千克（7.65 磅）       全长 806 毫米（31.73 英吋）   枪管长度 247 毫米（9.72 英吋）     弹药  制式：9 × 19 毫米 原型：.38/200 原型：.45 ACP     口径  9 × 19 毫米：9 毫米（.357 英吋） .38/200：9.65 毫米（.38 英吋） .45 ACP：11.43 毫米（.45 英吋）     枪管 1 根，膛线7 条，右旋   枪机种类      直接反冲作用 开放式枪机       发射速率      理论射速：  Mk 1/42：700 发／分钟 Mk 1/43：680 发／分钟 Mk 2/43：600 发／分钟   实际射速：120 发／分钟       枪口初速 380－420 米／秒（1,246.72－1,377.95 英尺／秒）   有效射程      瞄具装定射程：91.44 米（100 码） 最大有效射程：123 米（134.51 码）       最大射程 200 米（218.72 码）   供弹方式 32／33 发可拆卸式弹匣   瞄准具型式 机械瞄具：向右偏置的觇孔式照门和片状准星   欧文冲锋枪（英语：Owen Gun，正式名称：Owen Machine Carbine，以下简称为“欧文枪”）是一枝由伊夫林·（埃沃）·欧文（英语：Evelyn (Evo) Owen）于1939年研制、澳大利亚的首枝冲锋枪，制式型发射9 × 19 毫米鲁格手枪子弹。
+生产一支欧文冲锋枪需要多少钱？	欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪，并从1943年由澳大利亚陆军所使用，直到1960年代中期。
+生产一支欧文冲锋枪需要多少钱？	由新南威尔士州卧龙岗市出身的欧文枪发明者，伊夫林·欧文，在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。
+生产一支欧文冲锋枪需要多少钱？	该枪却被澳大利亚陆军所拒绝，因为澳大利亚陆军在当时没有承认冲锋枪的价值。
+生产一支欧文冲锋枪需要多少钱？	随着战争的爆发，欧文加入了澳大利亚军队，并且成为一名列兵。
+生产一支欧文冲锋枪需要多少钱？	1940年9月，欧文的邻居，文森特·沃德尔（英语：Vincent Wardell），看到欧文家楼梯后面搁著一个麻布袋，里面放著一枝欧文枪的原型枪。
+生产一支欧文冲锋枪需要多少钱？	而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理，他向欧文的父亲表明了他对其儿子的粗心大意感到痛心，但无论如何仍然解释了这款武器的历史。
+生产一支欧文冲锋枪需要多少钱？	沃德尔对欧文枪的简洁的设计留下了深刻的印象。
+生产一支欧文冲锋枪需要多少钱？	沃德尔安排欧文转调到陆军发明部（英语：Army Inventions Board），并重新开始在枪上的工作。
+生产一支欧文冲锋枪需要多少钱？	军队仍然持续地从负面角度查看该武器，但同时政府开始采取越来越有利的观点。
+生产一支欧文冲锋枪需要多少钱？	该欧文枪原型配备了装在顶部的弹鼓，后来让位给装在顶部的弹匣使用。
+生产一支欧文冲锋枪需要多少钱？	口径的选择亦花了一些时间去解决。
+生产一支欧文冲锋枪需要多少钱？	由于陆军有大批量的柯尔特.45 ACP子弹，它们决定欧文枪需要采用这种口径。
+生产一支欧文冲锋枪需要多少钱？	直到在1941年9月19日官方举办试验时，约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。
+生产一支欧文冲锋枪需要多少钱？	而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。
+生产一支欧文冲锋枪需要多少钱？	作为测试的一部分，所有的枪支都浸没在泥浆里，并以沙土覆盖，以模拟他们将会被使用时最恶劣的环境。
+生产一支欧文冲锋枪需要多少钱？	欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。
+生产一支欧文冲锋枪需要多少钱？	虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性，陆军没有对其口径作出决定。
+生产一支欧文冲锋枪需要多少钱？	结果它在上级政府干预以后，陆军才下令9 毫米的衍生型为正式口径，并在1941年11月20日正式被澳大利亚陆军采用。
+生产一支欧文冲锋枪需要多少钱？	在欧文枪的寿命期间，其可靠性在澳大利亚部队中赢得了“军人的至爱”（英语：Digger's Darling）的绰号，亦有人传言它受到美军高度青睐。
+生产一支欧文冲锋枪需要多少钱？	欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产，在生产高峰期每个星期生产800 支。
+生产一支欧文冲锋枪需要多少钱？	1942年3月至1943年2月之间，莱萨特生产了28,000 枝欧文枪。
+生产一支欧文冲锋枪需要多少钱？	然而，最初的一批弹药类型竟然是错误的，以至10,000 枝欧文枪无法提供弹药。
+生产一支欧文冲锋枪需要多少钱？	政府再一次推翻军方的官僚主义作风??，并让弹药通过其最后的生产阶段，以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。
+生产一支欧文冲锋枪需要多少钱？	在1941年至1945年间生产了约50,000 枝欧文枪。
+生产一支欧文冲锋枪需要多少钱？	在战争期间，欧文枪的平均生产成本为$ 30。[1]
+生产一支欧文冲锋枪需要多少钱？	虽然它是有点笨重，因为其可靠性，欧文枪在士兵当中变得非常流行。
+生产一支欧文冲锋枪需要多少钱？	它是如此成功，它也被新西兰、英国和美国订购。[2]
+生产一支欧文冲锋枪需要多少钱？	欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争，[3]特别是步兵组的侦察兵。
+生产一支欧文冲锋枪需要多少钱？	这仍然是一枝制式的澳大利亚陆军武器，直到1960年代中期，它被F1冲锋枪所取代。
+第二届中国光伏摄影大赛因为什么政策而开始的？	光伏发电不仅是全球能源科技和产业发展的重要方向，也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。
+第二届中国光伏摄影大赛因为什么政策而开始的？	2013年7月以来，国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策，大力推进分布式光伏发电的应用，光伏发电有望走进千家万户，融入百姓民生。
+第二届中国光伏摄影大赛因为什么政策而开始的？	大赛主办方以此为契机，开启了“第二届中国光伏摄影大赛”的征程。
+悬赏任务有哪些类型？	悬赏任务，威客网站上一种任务模式，由雇主在威客网站发布任务，提供一定数额的赏金，以吸引威客们参与。
+悬赏任务有哪些类型？	悬赏任务数额一般在几十到几千不等，但也有几万甚至几十万的任务。
+悬赏任务有哪些类型？	主要以提交的作品的质量好坏作为中标标准，当然其中也带有雇主的主观喜好，中标人数较少，多为一个或几个，因此竞争激烈。
+悬赏任务有哪些类型？	大型悬赏任务赏金数额巨大，中标者也较多，但参与人也很多，对于身有一技之长的威客来讲，悬赏任务十分适合。
+悬赏任务有哪些类型？	悬赏任务的类型主要包括：设计类、文案类、取名类、网站类、编程类、推广类等等。
+悬赏任务有哪些类型？	每一类所适合的威客人群不同，报酬的多少也不同，比如设计类的报酬就比较高，一般都几百到几千，而推广类的计件任务报酬比较少，一般也就几块钱，但花费的时间很少，技术要求也很低。
+悬赏任务有哪些类型？	1.注册—登陆
+悬赏任务有哪些类型？	2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。
+悬赏任务有哪些类型？	悬赏模式选择->网站托管赏金模式。
+悬赏任务有哪些类型？	威客网站客服稍后会跟发布者联系确认任务要求。
+悬赏任务有哪些类型？	3.没有问题之后就可以预付赏金进行任务发布。
+悬赏任务有哪些类型？	4.会员参与并提交稿件。
+悬赏任务有哪些类型？	5.发布者需要跟会员互动（每个提交稿件的会员都可以），解决问题，完善稿件，初步筛选稿件。
+悬赏任务有哪些类型？	6.任务发布期结束，进入选稿期（在筛选的稿件中选择最后满意的）
+悬赏任务有哪些类型？	7.发布者不满意现有稿件可选定一个会员修改至满意为止，或者加价延期重新开放任务进行征稿。
+悬赏任务有哪些类型？	（重复第六步）没有问题后进入下一步。
+悬赏任务有哪些类型？	8：中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。
+悬赏任务有哪些类型？	1、任务发布者自由定价，自由确定悬赏时间，自由发布任务要求，自主确定中标会员和中标方案。
+悬赏任务有哪些类型？	2、任务发布者100%预付任务赏金，让竞标者坚信您的诚意和诚信。
+悬赏任务有哪些类型？	3、任务赏金分配原则：任务一经发布，网站收取20%发布费，中标会员获得赏金的80%。
+悬赏任务有哪些类型？	4、每个任务最终都会选定至少一个作品中标，至少一个竞标者获得赏金。
+悬赏任务有哪些类型？	5、任务发布者若未征集到满意作品，可以加价延期征集，也可让会员修改,会员也可以删除任务。
+悬赏任务有哪些类型？	6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务，一经发现则视为任务发布者委托威客网按照网站规则选稿。
+悬赏任务有哪些类型？	7、任务悬赏总金额低于100元(含100元)的任务，悬赏时间最多为7天。
+悬赏任务有哪些类型？	所有任务最长时间不超过30天（特殊任务除外），任务总金额不得低于50元。
+悬赏任务有哪些类型？	8、网赚类、注册类任务总金额不能低于300元人民币，计件任务每个稿件的平均单价不能低于1元人民币。
+悬赏任务有哪些类型？	9、延期任务只有3次加价机会，第1次加价不得低于任务金额的10%，第2次加价不得低于任务总金额的20%，第3次不得低于任务总金额的50%。
+悬赏任务有哪些类型？	每次延期不能超过15天，加价金额不低于50元，特殊任务可以适当加长。
+悬赏任务有哪些类型？	如果为计件任务，且不是网赚类任务，将免费延期，直至征集完规定数量的作品为止。
+悬赏任务有哪些类型？	10、如果威客以交接源文件要挟任务发布者，威客网将扣除威客相关信用值，并取消其中标资格，同时任务将免费延长相应的时间继续征集作品 。
+江湖令由哪些平台运营？	《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。
+江湖令由哪些平台运营？	集角色扮演、策略、冒险等多种游戏元素为一体，画面精美犹如客户端游戏，融合历史、江湖、武功、恩仇多种特色元素，是款不可多得的精品游戏大作。
+江湖令由哪些平台运营？	由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的，由07177游戏网发布媒体资讯的网页游戏。
+江湖令由哪些平台运营？	网页游戏《江湖令》由51游戏社区运营，是以隋唐时期为背景的RPG角色扮演类网页游戏。
+江湖令由哪些平台运营？	集角色扮演、策略、冒险等多种游戏元素为一体，画面精美犹如客户端游戏，融合历史、江湖、武功、恩仇多种特色元素，是款不可多得的精品游戏大作…
+江湖令由哪些平台运营？	背景故事：
+江湖令由哪些平台运营？	隋朝末年，隋炀帝暴政，天下民不聊生，义军四起。
+江湖令由哪些平台运营？	在这动荡的时代中，百姓生活苦不堪言，多少人流离失所，家破人亡。
+江湖令由哪些平台运营？	天下三大势力---飞羽营、上清宫、侠隐岛，也值此机会扩张势力，派出弟子出来行走江湖。
+江湖令由哪些平台运营？	你便是这些弟子中的普通一员，在这群雄并起的年代，你将如何选择自己的未来。
+江湖令由哪些平台运营？	所有的故事，便从瓦岗寨/江都大营开始……
+江湖令由哪些平台运营？	势力：
+江湖令由哪些平台运营？	①、飞羽营：【外功、根骨】
+江湖令由哪些平台运营？	南北朝时期，由北方政权创立的一个民间军事团体，经过多年的发展，逐渐成为江湖一大势力。
+江湖令由哪些平台运营？	②、上清宫：【外功、身法】
+江湖令由哪些平台运营？	道家圣地，宫中弟子讲求清静无为，以一种隐世的方式修炼，但身在此乱世，亦也不能独善其身。
+江湖令由哪些平台运营？	③、侠隐岛：【根骨、内力】
+江湖令由哪些平台运营？	位于偏远海岛上的一个世家，岛内弟子大多武功高强，但从不进入江湖行走，适逢乱世，现今岛主也决意作一翻作为。
+江湖令由哪些平台运营？	两大阵营：
+江湖令由哪些平台运营？	义军：隋唐末期，百姓生活苦不堪言，有多个有志之士组成义军，对抗当朝暴君，希望建立一个适合百姓安居乐业的天地。
+江湖令由哪些平台运营？	隋军：战争一起即天下打乱，隋军首先要镇压四起的义军，同时在内部慢慢改变现有的朝廷，让天下再次恢复到昔日的安定。
+江湖令由哪些平台运营？	一、宠物品质
+江湖令由哪些平台运营？	宠物的品质分为：灵兽，妖兽，仙兽，圣兽，神兽
+江湖令由哪些平台运营？	二、宠物获取途径
+江湖令由哪些平台运营？	完成任务奖励宠物（其他途径待定）。
+江湖令由哪些平台运营？	三、宠物融合
+江湖令由哪些平台运营？	1、在主界面下方的【宠/骑】按钮进入宠物界面，再点击【融合】即可进入融合界面进行融合，在融合界面可选择要融合的宠物进行融合
+江湖令由哪些平台运营？	2、融合后主宠的形态不变；
+江湖令由哪些平台运营？	3、融合后宠物的成长，品质，技能，经验，成长经验，等级都继承成长高的宠物；
+江湖令由哪些平台运营？	4、融合宠物技能冲突，则保留成长值高的宠物技能，如果不冲突则叠加在空余的技能位置。
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛（土耳其文：Türkiye 1. Süper Futbol Ligi）是土耳其足球协会管理的职业足球联赛，通常简称“土超”，也是土耳其足球联赛中最高级别。
+请问土耳其足球超级联赛是什么时候成立的？	目前，土超联赛队伍共有18支。
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛
+请问土耳其足球超级联赛是什么时候成立的？	运动项目 足球
+请问土耳其足球超级联赛是什么时候成立的？	成立年份 1959年
+请问土耳其足球超级联赛是什么时候成立的？	参赛队数 18队
+请问土耳其足球超级联赛是什么时候成立的？	国家 土耳其
+请问土耳其足球超级联赛是什么时候成立的？	现任冠军 费内巴切足球俱乐部（2010-2011）
+请问土耳其足球超级联赛是什么时候成立的？	夺冠最多队伍 费内巴切足球俱乐部（18次）
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛（Türkiye 1. Süper Futbol Ligi）是土耳其足球协会管理的职业足球联赛，通常简称「土超」，也是土耳其足球联赛中最高级别。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛队伍共有18支。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛成立于1959年，成立之前土耳其国有多个地区性联赛。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛成立后便把各地方联赛制度统一起来。
+请问土耳其足球超级联赛是什么时候成立的？	一般土超联赛由八月开始至五月结束，12月至1月会有歇冬期。
+请问土耳其足球超级联赛是什么时候成立的？	十八支球队会互相对叠，各有主场和作客两部分，采计分制。
+请问土耳其足球超级联赛是什么时候成立的？	联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。
+请问土耳其足球超级联赛是什么时候成立的？	由2005-06年球季起，土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。
+请问土耳其足球超级联赛是什么时候成立的？	成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部，截至2009-2010赛季，双方各赢得冠军均为17次。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛共有18支球队，采取双循环得分制，每场比赛胜方得3分，负方0分，平局双方各得1分。
+请问土耳其足球超级联赛是什么时候成立的？	如果两支球队积分相同，对战成绩好的排名靠前，其次按照净胜球来决定；如果有三支以上的球队分数相同，则按照以下标准来确定排名：1、几支队伍间对战的得分，2、几支队伍间对战的净胜球数，3、总净胜球数。
+请问土耳其足球超级联赛是什么时候成立的？	联赛第1名直接参加下个赛季冠军杯小组赛，第2名参加下个赛季冠军杯资格赛第三轮，第3名进入下个赛季欧洲联赛资格赛第三轮，第4名进入下个赛季欧洲联赛资格赛第二轮，最后三名降入下个赛季的土甲联赛。
+请问土耳其足球超级联赛是什么时候成立的？	该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮，如果冠军已获得冠军杯资格，则亚军可参加下个赛季欧洲联赛资格赛第四轮，否则名额递补给联赛。
+请问土耳其足球超级联赛是什么时候成立的？	2010年/2011年 费内巴切
+请问土耳其足球超级联赛是什么时候成立的？	2009年/2010年 布尔萨体育（又译贝莎）
+请问土耳其足球超级联赛是什么时候成立的？	2008年/2009年 贝西克塔斯
+请问土耳其足球超级联赛是什么时候成立的？	2007年/2008年 加拉塔萨雷
+请问土耳其足球超级联赛是什么时候成立的？	2006年/2007年 费内巴切
+请问土耳其足球超级联赛是什么时候成立的？	2005年/2006年 加拉塔沙雷
+请问土耳其足球超级联赛是什么时候成立的？	2004年/2005年 费内巴切(又译费伦巴治)
+请问土耳其足球超级联赛是什么时候成立的？	2003年/2004年 费内巴切
+cid 作Customer IDentity解时是什么意思？ ？	CID 是 Customer IDentity 的简称，简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable）芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号，新的CID不断被使用，以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改
+cid 作Customer IDentity解时是什么意思？ ？	CID 是 Customer IDentity 的简称，简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable）芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号，新的CID不断被使用，以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改
+cid 作Customer IDentity解时是什么意思？ ？	（英）刑事调查局，香港警察的重案组
+cid 作Customer IDentity解时是什么意思？ ？	Criminal Investigation Department
+cid 作Customer IDentity解时是什么意思？ ？	佩枪：
+cid 作Customer IDentity解时是什么意思？ ？	香港警察的CID（刑事侦缉队），各区重案组的探员装备短管点38左轮手枪，其特点是便于收藏，而且不容易卡壳，重量轻，其缺点是装弹量少，只有6发，而且换子弹较慢，威力也一般，如果碰上54式手枪或者M9手枪明显处于下风。
+cid 作Customer IDentity解时是什么意思？ ？	香港警察的“刑事侦查”(Criminal Investigation Department)部门，早于1983年起已经不叫做C.I.D.的了，1983年香港警察队的重整架构，撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”，将“刑事侦查”部门归入去“行动处”内，是“行动处”内的一个分支部门，叫“刑事部”( Crime Wing )。
+cid 作Customer IDentity解时是什么意思？ ？	再于90年代的一次警队重整架构，香港警队成立了新的「刑事及保安处」，再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位，是归入这个“处”下的一个部门，亦叫“刑事部” ( Crime Wing )，由一个助理警务处长(刑事)领导。
+cid 作Customer IDentity解时是什么意思？ ？	但是时至今天，CID虽已经是一个老旧的名称，香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 .
+cid 作Customer IDentity解时是什么意思？ ？	CID格式是美国Adobe公司发表的最新字库格式，它具有易扩充、速度快、兼容性好、简便、灵活等特点，已成为国内开发中文字库的热点，也为用户使用字库提供质量更好，数量更多的字体。
+cid 作Customer IDentity解时是什么意思？ ？	CID (Character identifier）就是字符识别码，在组成方式上分成CIDFont，CMap表两部分。
+cid 作Customer IDentity解时是什么意思？ ？	CIDFont文件即总字符集，包括了一种特定语言中所有常用的字符，把这些字符排序，它们在总字符集中排列的次序号就是各个字符的CID标识码（Index）；CMap(Character Map）表即字符映像文件，将字符的编码（Code）映像到字符的CID标识码（Index）。
+cid 作Customer IDentity解时是什么意思？ ？	CID字库完全针对大字符集市场设计，其基本过程为：先根据Code，在CMap表查到Index，然后在CIDFont文件找到相应的字形数据。
+本町位于什么地方？	本条目记述台湾日治时期，各都市之本町。
+本町位于什么地方？	为台湾日治时期台北市之行政区，共分一～四丁目，在表町之西。
+本町位于什么地方？	以现在的位置来看，本町位于现台北市中正区的西北角，约位于忠孝西路一段往西至台北邮局东侧。
+本町位于什么地方？	再向南至开封街一段，沿此路线向西至开封街一段60号，顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。
+本町位于什么地方？	再向东至重庆南路一段，由重庆南路一段回到原点这个范围内。
+本町位于什么地方？	另外，重庆南路一段在当时名为“本町通”。
+本町位于什么地方？	此地方自日治时期起，就是繁华的商业地区，当时也有三和银行、台北专卖分局、日本石油等重要商业机构。
+本町位于什么地方？	其中，专卖分局是战后二二八事件的主要起始点。
+本町位于什么地方？	台湾贮蓄银行（一丁目）
+本町位于什么地方？	三和银行（二丁目）
+本町位于什么地方？	专卖局台北分局（三丁目）
+本町位于什么地方？	日本石油（四丁目）
+本町位于什么地方？	为台湾日治时期台南市之行政区。
+本町位于什么地方？	范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。
+本町位于什么地方？	为清代台南城最繁华的区域。
+本町位于什么地方？	台南公会堂
+本町位于什么地方？	北极殿
+本町位于什么地方？	开基武庙
+本町位于什么地方？	町名改正
+本町位于什么地方？	这是一个与台湾相关的小作品。
+本町位于什么地方？	你可以通过编辑或修订扩充其内容。
+《行走的观点：埃及》的条形码是多少？	出版社: 上海社会科学院出版社; 第1版 (2006年5月1日)
+《行走的观点：埃及》的条形码是多少？	丛书名: 时代建筑视觉旅行丛书
+《行走的观点：埃及》的条形码是多少？	条形码: 9787806818640
+《行走的观点：埃及》的条形码是多少？	尺寸: 18 x 13.1 x 0.7 cm
+《行走的观点：埃及》的条形码是多少？	重量: 181 g
+《行走的观点：埃及》的条形码是多少？	漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。
+《行走的观点：埃及》的条形码是多少？	埃及，这片蕴蓄了5000年文明的土地，本书为你撩开它神秘的纱。
+《行走的观点：埃及》的条形码是多少？	诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力；缠绵悱恻的象形文字、医学、雕刻等留给我们的文明，不断引发我们对古代文明的惊喜和赞叹。
+《行走的观点：埃及》的条形码是多少？	尼罗河畔的奇异之旅，数千年的古老文明，尽收在你的眼底……
+《行走的观点：埃及》的条形码是多少？	本书集历史、文化、地理等知识于一体，并以优美、流畅文笔，简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术，以大量富有艺术感染力的彩色照片，生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。
+《行走的观点：埃及》的条形码是多少？	古埃及历史
+老挝人民军的工兵部队有几个营？	老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”），始建于1949年1月20日，1965年10月改名为老挝人民解放军，1982年7月改称现名。
+老挝人民军的工兵部队有几个营？	最高领导机构是中央国防和治安委员会，朱马里·赛雅颂任主席，隆再·皮吉任国防部长。
+老挝人民军的工兵部队有几个营？	实行义务兵役制，服役期最少18个月。[1]
+老挝人民军的工兵部队有几个营？	?老挝军队在老挝社会中有较好的地位和保障，工资待遇比地方政府工作人员略高。
+老挝人民军的工兵部队有几个营？	  武装部队总兵力约6万人，其中陆军约5万人，主力部队编为5个步兵师；空军2000多人；海军（内河巡逻部队）1000多人；部队机关院校5000人。[1]
+老挝人民军的工兵部队有几个营？	老挝人民军军旗
+老挝人民军的工兵部队有几个营？	1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定：国家执行保卫国防和维护社会安宁的政策。
+老挝人民军的工兵部队有几个营？	全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务，积极参加国家建设事业。
+老挝人民军的工兵部队有几个营？	最高领导机构是中央国防和治安委员会。
+老挝人民军的工兵部队有几个营？	主席由老挝人民革命党中央委员会总书记兼任。
+老挝人民军的工兵部队有几个营？	老挝陆军成立最早，兵力最多，约有5万人。
+老挝人民军的工兵部队有几个营？	其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。
+老挝人民军的工兵部队有几个营？	地方部队30余个营及县属部队。
+老挝人民军的工兵部队有几个营？	地面炮兵2个团，10多个营。
+老挝人民军的工兵部队有几个营？	高射炮兵1个团9个营。
+老挝人民军的工兵部队有几个营？	导弹部队2个营。
+老挝人民军的工兵部队有几个营？	装甲兵7个营。
+老挝人民军的工兵部队有几个营？	特工部队6个营。
+老挝人民军的工兵部队有几个营？	通讯部队9个营。
+老挝人民军的工兵部队有几个营？	工兵部队6个营。
+老挝人民军的工兵部队有几个营？	基建工程兵2个团13个营。
+老挝人民军的工兵部队有几个营？	运输部队7个营。
+老挝人民军的工兵部队有几个营？	 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。
+老挝人民军的工兵部队有几个营？	老挝内河部队总兵力约1700人，装备有内河船艇110多艘，编成4个艇队。
+老挝人民军的工兵部队有几个营？	有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。
+老挝人民军的工兵部队有几个营？	空军于1975年8月组建，现有2个团、11个飞行大队，总兵力约2000人。
+老挝人民军的工兵部队有几个营？	装备有各种飞机140架，其中主要由前苏联提供和从万象政权的皇家空军手中接管。
+老挝人民军的工兵部队有几个营？	随着军队建设质量的提高，老挝人民军对外军事合作步伐也日益扩大，近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。
+老挝人民军的工兵部队有几个营？	2003年1月，印度决定向老挝援助一批军事装备和物资，并承诺提供技术帮助。
+老挝人民军的工兵部队有几个营？	2003年6月，老挝向俄罗斯订购了一批新式防空武器；2003年4月，老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。
+《焚心之城》的主角是谁？	《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说，目前正在创世中文网连载中。
+《焚心之城》的主角是谁？	乡下大男孩薛城，是一个不甘于生活现状的混混，他混过、爱过、也深深地被伤害过。
+《焚心之城》的主角是谁？	本料此生当浑浑噩噩，拼搏街头。
+《焚心之城》的主角是谁？	高考的成绩却给了他一点渺茫的希望，二月后，大学如期吹响了他进城的号角。
+《焚心之城》的主角是谁？	繁华的都市，热血的人生，冷眼嘲笑中，他发誓再不做一个平常人！
+《焚心之城》的主角是谁？	江北小城，黑河大地，他要行走过的每一个角落都有他的传说。
+《焚心之城》的主角是谁？	扯出一面旗，拉一帮兄弟，做男人，就要多一份担当，活一口傲气。
+《焚心之城》的主角是谁？	（日期截止到2014年10月23日凌晨）
+请问香港利丰集团是什么时候成立的？	香港利丰集团前身是广州的华资贸易 (1906 - 1949) ，利丰是香港历史最悠久的出口贸易商号之一。
+请问香港利丰集团是什么时候成立的？	 于1906年，冯柏燎先生和李道明先生在广州创立了利丰贸易公司；是当时中国第一家华资的对外贸易出口商。
+请问香港利丰集团是什么时候成立的？	利丰于1906年创立，初时只从事瓷器及丝绸生意；一年之后，增添了其它的货品，包括竹器、藤器、玉石、象牙及其它手工艺品，包括烟花爆竹类别。
+请问香港利丰集团是什么时候成立的？	在早期的对外贸易，中国南方内河港因水深不足不能行驶远洋船，反之香港港口水深岸阔，占尽地利。
+请问香港利丰集团是什么时候成立的？	因此，在香港成立分公司的责任，落在冯柏燎先生的三子冯汉柱先生身上。
+请问香港利丰集团是什么时候成立的？	1937年12月28日，利丰(1937)有限公司正式在香港创立。
+请问香港利丰集团是什么时候成立的？	第二次世界大战期间，利丰暂停贸易业务。
+请问香港利丰集团是什么时候成立的？	1943年，随着创办人冯柏燎先生去世后，业务移交给冯氏家族第二代。
+请问香港利丰集团是什么时候成立的？	之后，向来不参与业务管理的合伙人李道明先生宣布退休，将所拥有的利丰股权全部卖给冯氏家族。
+请问香港利丰集团是什么时候成立的？	目前由哈佛冯家两兄弟William Fung ， Victor Fung和CEO Bruce Rockowitz 管理。
+请问香港利丰集团是什么时候成立的？	截止到2012年，集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。
+请问香港利丰集团是什么时候成立的？	利亚（零售）连锁，业务包括大家所熟悉的：OK便利店、玩具〝反〞斗城和圣安娜饼屋；范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店
+请问香港利丰集团是什么时候成立的？	利和集团，IDS以专业物流服务为根基，为客户提供经销，物流，制造服务领域内的一系列服务项目。
+请问香港利丰集团是什么时候成立的？	业务网络覆盖大中华区，东盟，美国及英国，经营着90多个经销中心，在中国设有18个经销公司，10,000家现代经销门店。
+请问香港利丰集团是什么时候成立的？	利邦（上海）时装贸易有限公司为大中华区其中一家大型男士服装零售集团。
+请问香港利丰集团是什么时候成立的？	现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881，Gieves & Hawkes，Kent & curwen和D’urban 等中档到高档的男士服装品牌，全国有超过350间门店设于各一线城市之高级商场及百货公司。
+请问香港利丰集团是什么时候成立的？	利越（上海）服装商贸有限公司隶属于Branded Lifestyle，负责中国大陆地区LEO里奥（意大利）、GIBO捷宝（意大利）、UFFIZI古杰师（意大利）、OVVIO奥维路（意大利）、Roots绿适（加拿大，全球服装排名第四）品牌销售业务
+请问香港利丰集团是什么时候成立的？	利丰（贸易）1995年收购了英之杰采购服务，1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley)，2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited，大大扩张了在美国及欧洲的顾客群，自2008年经济危机起一直到现在，收购多家欧、美、印、非等地区的时尚品牌，如英国品牌Visage，仅2011年上半年6个月就完成26个品牌的收购。
+请问香港利丰集团是什么时候成立的？	2004年利丰与Levi Strauss & Co.签订特许经营协议
+请问香港利丰集团是什么时候成立的？	2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌
+请问香港利丰集团是什么时候成立的？	2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务
+请问香港利丰集团是什么时候成立的？	2007年收购Tommy Hilfiher全球采购业务，收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice
+请问香港利丰集团是什么时候成立的？	2008年收购Kent&Curwen全球特许经营权，收购Van Zeeland，Inc和Miles Fashion Group
+请问香港利丰集团是什么时候成立的？	2009年收购加拿大休闲品牌Roots ，收购Wear Me Appearl，LLC。
+请问香港利丰集团是什么时候成立的？	与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议
+请问香港利丰集团是什么时候成立的？	2010年收购Oxford apparel Visage Group LTD
+请问香港利丰集团是什么时候成立的？	2011年一月收购土耳其Modium、美国女性时尚Beyond Productions，三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881，五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000，六月收购家私贸易Exim Designs Co., Ltd.，七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited，八月收购童装Fishman & Tobin和Crimzon Rose，九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。
+请问香港利丰集团是什么时候成立的？	十二月与USPA – U.S. Polo Association签署授权协议。
+请问香港利丰集团是什么时候成立的？	利丰的精神：积极进取，不断认识并争取有利于客户和自身进步的机会；以行动为主导，对客户、供应商及职工的需求作出快速的决定。
+请问香港利丰集团是什么时候成立的？	利丰的最终目标：在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务，利丰成员有效合作，共达目标。
+如何使魔兽变种akt不被查杀？	Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一，采用Delphi 6.0-7.0编写，并经过加壳处理。
+如何使魔兽变种akt不被查杀？	“魔兽”变种akt运行后，自我复制到被感染计算机的指定目录下。
+如何使魔兽变种akt不被查杀？	修改注册表，实现木马开机自动运行。
+如何使魔兽变种akt不被查杀？	自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行，隐藏自我，防止被查杀。
+如何使魔兽变种akt不被查杀？	在后台秘密监视用户打开的窗口标题，盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息，并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上，致使玩家游戏帐号、装备物品、金钱等丢失，给游戏玩家造成非常大的损失。
+丙种球蛋白能预防什么病情？	丙种球蛋白预防传染性肝炎，预防麻疹等病毒性疾病感染，治疗先天性丙种球蛋白缺乏症 ，与抗生素合并使用，可提高对某些严重细菌性和病毒性疾病感染的疗效。
+丙种球蛋白能预防什么病情？	中文简称：“丙球”
+丙种球蛋白能预防什么病情？	英文名称：γ-globulin、gamma globulin
+丙种球蛋白能预防什么病情？	【别名】 免疫血清球蛋白，普通免疫球蛋白，人血丙种球蛋白，丙种球蛋白，静脉注射用人免疫球蛋白（pH4）
+丙种球蛋白能预防什么病情？	注：由于人血中的免疫球蛋白大多数为丙种球蛋白（γ-球蛋白），有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。
+丙种球蛋白能预防什么病情？	冻干制剂应为白色或灰白色的疏松体，液体制剂和冻干制剂溶解后，溶液应为接近无色或淡黄色的澄明液体，微带乳光。
+丙种球蛋白能预防什么病情？	但不应含有异物或摇不散的沉淀。
+丙种球蛋白能预防什么病情？	注射丙种球蛋白是一种被动免疫疗法。
+丙种球蛋白能预防什么病情？	它是把免疫球蛋白内含有的大量抗体输给受者，使之从低或无免疫状态很快达到暂时免疫保护状态。
+丙种球蛋白能预防什么病情？	由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。
+丙种球蛋白能预防什么病情？	因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。
+丙种球蛋白能预防什么病情？	人免疫球蛋白的生物半衰期为16～24天。
+丙种球蛋白能预防什么病情？	1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体，因而有增强机体抵抗力以预防感染的作用。
+丙种球蛋白能预防什么病情？	2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病
+丙种球蛋白能预防什么病情？	3、预防传染性肝炎，如甲型肝炎和乙型肝炎等。
+丙种球蛋白能预防什么病情？	4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治
+丙种球蛋白能预防什么病情？	5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。
+丙种球蛋白能预防什么病情？	6、与抗生素合并使用，可提高对某些严重细菌性和病毒性疾病感染的疗效。
+丙种球蛋白能预防什么病情？	7、川崎病，又称皮肤粘膜淋巴结综合征，常见于儿童，丙种球蛋白是主要的治疗药物。
+丙种球蛋白能预防什么病情？	1、对免疫球蛋白过敏或有其他严重过敏史者。
+丙种球蛋白能预防什么病情？	2、有IgA抗体的选择性IgA缺乏者。
+丙种球蛋白能预防什么病情？	3、发烧患者禁用或慎用。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行）
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	为了保护人的生命和健康，发扬人道主义精神，促进社会发展与和平进步事业，根据《中华人民共和国红十字会法》，结合本省实际，制定本办法。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	本省县级以上按行政区域建立的红十字会，是中国红十字会的地方组织，是从事人道主义工作的社会救助团体，依法取得社会团体法人资格，设置工作机构，配备专职工作人员，依照《中国红十字会章程》独立自主地开展工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	全省性行业根据需要可以建立行业红十字会，配备专职或兼职工作人员。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	街道、乡（镇）、机关、团体、学校、企业、事业单位根据需要，可以依照《中国红十字会章程》建立红十字会的基层组织。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	上级红十字会指导下级红十字会的工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	人民政府对红十字会给予支持和资助，保障红十字会依法履行职责，并对其活动进行监督；红十字会协助人民政府开展与其职责有关的活动。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	全社会都应当关心和支持红十字事业。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	本省公民和单位承认《中国红十字会章程》并缴纳会费的，可以自愿参加红十字会，成为红十字会的个人会员或团体会员。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	个人会员由本人申请，基层红十字会批准，发给会员证；团体会员由单位申请，县级以上红十字会批准，发给团体会员证。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》，热心红十字事业，履行会员的义务，并享有会员的权利。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上红十字会理事会由会员代表大会民主选举产生。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	理事会民主选举产生会长和副会长；根据会长提名，决定秘书长、副秘书长人选。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上红十字会可以设名誉会长、名誉副会长和名誉理事，由同级红十字会理事会聘请。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	省、市（地）红十字会根据独立、平等、互相尊重的原则，发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	红十字会履行下列职责：
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（一）宣传、贯彻《中华人民共和国红十字会法》和本办法；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（二）开展救灾的准备工作，筹措救灾款物；在自然灾害和突发事件中，对伤病人员和其他受害者进行救助；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（三）普及卫生救护和防病知识，进行初级卫生救护培训，对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（四）组织群众参加现场救护；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（五）参与输血献血工作，推动无偿献血；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（六）开展红十字青少年活动；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（七）根据中国红十字会总会部署，参加国际人道主义救援工作；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（八）依照国际红十字和红新月运动的基本原则，完成同级人民政府和上级红十字会委托的有关事宜；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（九）《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	第八条 红十字会经费的主要来源：
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（一）红十字会会员缴纳的会费；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（二）接受国内外组织和个人捐赠的款物；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（三）红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入；
+宝湖庭院绿化率多少？	建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。
+宝湖庭院绿化率多少？	项目已于2012年4月开工建设，总占地约4.2万平方米，总建筑面积约11.2万平方米，容积率2.14，绿化率35%，预计可入住630户。
+宝湖庭院绿化率多少？	“建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后，在宝湖湖区的又一力作。
+宝湖庭院绿化率多少？	项目周边发展成熟，东有唐徕渠景观水道，西临银川市交通主干道正源街；南侧与宝湖湿地公园遥相呼应。
+宝湖庭院绿化率多少？	“宝湖庭院”项目公共交通资源丰富：15路、21路、35路、38路、43路公交车贯穿银川市各地，出行便利。
+宝湖庭院绿化率多少？	距离新百良田购物广场约1公里，工人疗养院600米，宝湖公园1公里，唐徕渠景观水道500米。
+宝湖庭院绿化率多少？	项目位置优越，购物、餐饮、医疗、交通、休闲等生活资源丰富。[1]
+宝湖庭院绿化率多少？	建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格：搂间距宽广，确保每一座楼宇视野开阔通透。
+宝湖庭院绿化率多少？	楼宇位置错落有置，外立面设计大气沉稳别致。
+宝湖庭院绿化率多少？	项目内部休闲绿地、景观小品点缀其中，道路及停车系统设计合理，停车及通行条件便利。
+宝湖庭院绿化率多少？	社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。
+宝湖庭院绿化率多少？	行政区域:金凤区
+大月兔(中秋艺术作品)的作者还有哪些代表作？	大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品，该作品首次亮相于台湾桃园大园乡海军基地，为了迎接中秋节的到来；在展览期间，海军基地也首次对外开放。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	霍夫曼觉得中国神话中捣杵的玉兔很有想象力，于是特别创作了“月兔”，这也是“月兔”新作第一次展出。[1]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	?2014年9月15日因工人施工不慎，遭火烧毁。[2]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	“大月兔”外表采用的杜邦防水纸、会随风飘动，内部以木材加保丽龙框架支撑做成。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	兔毛用防水纸做成，材质完全防水，不怕日晒雨淋。[3
+大月兔(中秋艺术作品)的作者还有哪些代表作？	-4]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	25米的“月兔”倚靠在机
+大月兔(中秋艺术作品)的作者还有哪些代表作？	堡上望着天空，像在思考又像赏月。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	月兔斜躺在机堡上，意在思考生命、边做白日梦，编织自己的故事。[3]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	台湾桃园大园乡海军基地也首度对外开放。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	428公顷的海军基地中，地景艺术节使用约40公顷，展场包括过去军机机堡、跑道等，由于这处基地过去警备森严，不对外开放，这次结合地景艺术展出，也可一窥过去是黑猫中队基地的神秘面纱。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	2014年9月2日，桃园县政府文化局举行“踩线团”，让
+大月兔(中秋艺术作品)的作者还有哪些代表作？	大月兔
+大月兔(中秋艺术作品)的作者还有哪些代表作？	各项地景艺术作品呈现在媒体眼中，虽然“月兔”仍在进行最后的细节赶工，但横躺在机堡上的“月兔”雏形已经完工。[5]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	“这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉；尤其在蓝天的衬托及前方绿草的组合下，呈现犹如真实版的爱丽丝梦游仙境。[6]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	霍夫曼的作品大月兔，“从平凡中，创作出不平凡的视觉”，创造出观赏者打从心中油然而生的幸福感，拉近观赏者的距离。[6]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	2014年9月15日早
+大月兔(中秋艺术作品)的作者还有哪些代表作？	上，施工人员要将月兔拆解，搬离海军基地草皮时，疑施工拆除的卡车，在拆除过程，故障起火，起火的卡车不慎延烧到兔子，造成兔子起火燃烧，消防队员即刻抢救，白色的大月兔立即变成焦黑的火烧兔。[7]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	桃园县府表示相当遗憾及难过，也不排除向包商求偿，也已将此事告知霍夫曼。[2]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	?[8]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	弗洛伦泰因·霍夫曼，荷兰艺术家，以在公共空间创作巨大造型
+大月兔(中秋艺术作品)的作者还有哪些代表作？	物的艺术项目见长。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫（2014年5月在上海亮相）、大黄鸭（Rubber Duck）、月兔等。
+英国耆卫保险公司有多少保险客户？	英国耆卫保险公司（Old Mutual plc）成立于1845年，一直在伦敦证券交易所（伦敦证券交易所：OML）作第一上市，也是全球排名第32位(按营业收入排名）的保险公司（人寿/健康）。
+英国耆卫保险公司有多少保险客户？	公司是全球财富500强公司之一，也是被列入英国金融时报100指数的金融服务集团之一。
+英国耆卫保险公司有多少保险客户？	Old Mutual 是一家国际金融服务公司，拥有近320万个保险客户，240万个银行储户，270，000个短期保险客户以及700，000个信托客户
+英国耆卫保险公司有多少保险客户？	英国耆卫保险公司（Old Mutual）是一家国际金融服务公司，总部设在伦敦，主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等，目前业务遍及全球34个国家。[1]
+英国耆卫保险公司有多少保险客户？	主要包括人寿保险，资产管理，银行等。
+英国耆卫保险公司有多少保险客户？	1845年，Old Mutual在好望角成立。
+英国耆卫保险公司有多少保险客户？	1870年，董事长Charles Bell设计了Old Mutual公司的标记。
+英国耆卫保险公司有多少保险客户？	1910年，南非从英联邦独立出来。
+英国耆卫保险公司有多少保险客户？	Old Mutual的董事长John X. Merriman被选为国家总理。
+英国耆卫保险公司有多少保险客户？	1927年，Old Mutual在Harare成立它的第一个事务所。
+英国耆卫保险公司有多少保险客户？	1960年，Old Mutual在南非成立了Mutual Unit信托公司，用来管理公司的信托业务。
+英国耆卫保险公司有多少保险客户？	1970年，Old Mutual的收入超过100百万R。
+英国耆卫保险公司有多少保险客户？	1980年，Old Mutual成为南非第一大人寿保险公司，年收入达10亿R。
+英国耆卫保险公司有多少保险客户？	1991年，Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。
+英国耆卫保险公司有多少保险客户？	1995年，Old Mutual在美国波士顿建立投资顾问公司，同年、又在香港和Guernsey建立事务所。
+英国耆卫保险公司有多少保险客户？	作为一项加强与其母公司联系的举措，OMNIA公司（百慕大）荣幸的更名为Old Mutual 公司(百慕大) 。
+英国耆卫保险公司有多少保险客户？	这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。
+英国耆卫保险公司有多少保险客户？	2003 年4月，该公司被Old Mutual plc公司收购，更名为Sage Life（百慕大）公司并闻名于世，公司为Old Mutual公司提供了一个新的销售渠道，补充了其现有的以美元计价的产品线和分销系统。
+英国耆卫保险公司有多少保险客户？	达到了一个重要里程碑是公司成功的一个例证： 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑，也是公司成功的一个例证。
+英国耆卫保险公司有多少保险客户？	Old Mutual （百慕大）为客户提供一系列的投资产品。
+英国耆卫保险公司有多少保险客户？	在其开放的结构下，客户除了能够参与由Old Mutual会员管理的方案外，还能够参与由一些世界顶尖投资机构提供的投资选择。
+英国耆卫保险公司有多少保险客户？	首席执行官John Clifford对此发表评论说：“过去的两年对于Old Mutual家族来说是稳固发展的两年，更名是迫在眉睫的事情。
+英国耆卫保险公司有多少保险客户？	通过采用其名字和形象上的相似，Old Mutual （百慕大）进一步强化了与母公司的联系。”
+英国耆卫保险公司有多少保险客户？	 Clifford补充道：“我相信Old Mutual全球品牌认可度和Old Mutual（百慕大）产品专业知识的结合将在未来的日子里进一步推动公司的成功。”
+英国耆卫保险公司有多少保险客户？	随着公司更名而来的是公司网站的全新改版，设计投资选择信息、陈述、销售方案、营销材料和公告板块。
+英国耆卫保险公司有多少保险客户？	在美国购买不到OMNIA投资产品，该产品也不向美国公民或居民以及百慕大居民提供。
+英国耆卫保险公司有多少保险客户？	这些产品不对任何要约未得到批准的区域中的任何人，以及进行此要约或询价为非法行为的个人构成要约或询价。
+英国耆卫保险公司有多少保险客户？	关于Old Mutual（百慕大）公司
+英国耆卫保险公司有多少保险客户？	Old Mutual（百慕大）公司总部位于百慕大，公司面向非美国居民及公民以及非百慕大居民，通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。
+英国耆卫保险公司有多少保险客户？	这些方案由Old Mutual（百慕大）公司直接做出，向投资者提供各种投资选择和战略，同时提供死亡和其他受益保证。
+谁知道北京的淡定哥做了什么？	尼日利亚足球队守门员恩耶马被封淡定哥，原因是2010年南非世界杯上1:2落后希腊队时，对方前锋已经突破到禁区，其仍头依门柱发呆，其从容淡定令人吃惊。
+谁知道北京的淡定哥做了什么？	淡定哥
+谁知道北京的淡定哥做了什么？	在2010年6月17日的世界杯赛场上，尼日利亚1比2不敌希腊队，但尼日利亚门将恩耶马（英文名：Vincent Enyeama）在赛场上的“淡定”表现令人惊奇。
+谁知道北京的淡定哥做了什么？	随后，网友将赛场照片发布于各大论坛，恩耶马迅速窜红，并被网友称为“淡定哥”。
+谁知道北京的淡定哥做了什么？	淡定哥
+谁知道北京的淡定哥做了什么？	从网友上传得照片中可以看到，“淡定哥”在面临对方前锋突袭至小禁区之时，还靠在球门柱上发呆，其“淡定”程度的确非一般人所能及。
+谁知道北京的淡定哥做了什么？	恩耶马是尼日利亚国家队的主力守门员，目前效力于以色列的特拉维夫哈普尔队。
+谁知道北京的淡定哥做了什么？	1999年，恩耶马在尼日利亚国内的伊波姆星队开始职业生涯，后辗转恩伊姆巴、Iwuanyanwu民族等队，从07年开始，他为特拉维夫效力。
+谁知道北京的淡定哥做了什么？	恩耶马的尼日利亚国脚生涯始于2002年，截至2010年1月底，他为国家队出场已超过50次。
+谁知道北京的淡定哥做了什么？	当地时间2011年1月4日，国际足球历史与统计协会（IFFHS）公布了2010年度世界最佳门将，恩耶马（尼日利亚，特拉维夫夏普尔）10票排第十一
+谁知道北京的淡定哥做了什么？	此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语，并收录到《中国语言生活状况报告》中。
+谁知道北京的淡定哥做了什么？	提示性释义：对遇事从容镇定、处变不惊的男性的戏称。
+谁知道北京的淡定哥做了什么？	例句：上海现“淡定哥”:百米外爆炸他仍专注垂钓（2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm）
+谁知道北京的淡定哥做了什么？	2011年度新人物
+谁知道北京的淡定哥做了什么？	1、淡定哥（北京）
+谁知道北京的淡定哥做了什么？	7月24日傍晚，北京市出现大范围降雨天气，位于通州北苑路出现积水，公交车也难逃被淹。
+谁知道北京的淡定哥做了什么？	李欣摄图片来源：新华网一辆私家车深陷积水，车主索性盘坐在自己的汽车上抽烟等待救援。
+谁知道北京的淡定哥做了什么？	私家车主索性盘坐在自己的车上抽烟等待救援，被网友称“淡定哥”
+谁知道北京的淡定哥做了什么？	2、淡定哥——林峰
+谁知道北京的淡定哥做了什么？	在2011年7月23日的动车追尾事故中，绍兴人杨峰（@杨峰特快）在事故中失去了5位亲人：怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女，他的岳父也在事故中受伤正在治疗。
+谁知道北京的淡定哥做了什么？	他披麻戴孝出现在事故现场，要求将家人的死因弄个明白。
+谁知道北京的淡定哥做了什么？	但在第一轮谈判过后，表示：“请原谅我，如果我再坚持，我将失去我最后的第六个亲人。”
+谁知道北京的淡定哥做了什么？	如果他继续“纠缠”铁道部，他治疗中的岳父将会“被死亡”。
+谁知道北京的淡定哥做了什么？	很多博友就此批评杨峰，并讽刺其为“淡定哥”。
+071型船坞登陆舰的北约代号是什么？	071型船坞登陆舰（英语：Type 071 Amphibious Transport Dock，北约代号：Yuzhao-class，中文：玉昭级，或以首舰昆仑山号称之为昆仑山级船坞登陆舰），是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰，可作为登陆艇的母舰，用以运送士兵、步兵战车、主战坦克等展开登陆作战，也可搭载两栖车辆，具备大型直升机起降甲板及操作设施。
+071型船坞登陆舰的北约代号是什么？	  071型两栖登陆舰是中国首次建造的万吨级作战舰艇，亦为中国大型多功能两栖舰船的开山之作，也可以说是中国万吨级以上大型作战舰艇的试验之作，该舰的建造使中国海军的两栖舰船实力有了质的提升。
+071型船坞登陆舰的北约代号是什么？	在本世纪以前中国海军原有的两栖舰队以一
+071型船坞登陆舰的北约代号是什么？	早期071模型
+071型船坞登陆舰的北约代号是什么？	千至四千吨级登陆舰为主要骨干，这些舰艇吨位小、筹载量有限，直升机操作能力非常欠缺，舰上自卫武装普遍老旧，对于现代化两栖登陆作战可说有很多不足。
+071型船坞登陆舰的北约代号是什么？	为了应对新时期的国际国内形势，中国在本世纪初期紧急强化两栖作战能力，包括短时间内密集建造072、074系列登陆舰，同时也首度设计一种新型船坞登陆舰，型号为071。[1]
+071型船坞登陆舰的北约代号是什么？	在两栖作战行动中，这些舰只不得不采取最危险的
+071型船坞登陆舰的北约代号是什么？	舾装中的昆仑山号
+071型船坞登陆舰的北约代号是什么？	敌前登陆方式实施两栖作战行动，必须与敌人预定阻击力量进行面对面的战斗，在台湾地区或者亚洲其他国家的沿海，几乎没有可用而不设防的海滩登陆地带，并且各国或者地区的陆军在战时，可能会很快控制这些易于登陆的海难和港口，这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。
+071型船坞登陆舰的北约代号是什么？	071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2]
+071型船坞登陆舰的北约代号是什么？	071型两栖船坞登陆舰具有十分良好的整体隐身能力，
+071型船坞登陆舰的北约代号是什么？	071型概念图
+071型船坞登陆舰的北约代号是什么？	该舰外部线条简洁干练，而且舰体外形下部外倾、上部带有一定角度的内倾，从而形成雷达隐身性能良好的菱形横剖面。
+071型船坞登陆舰的北约代号是什么？	舰体为高干舷平甲板型，长宽比较小，舰身宽满，采用大飞剪型舰首及楔形舰尾，舰的上层建筑位于舰体中间部位，后部是大型直升机甲板，适航性能非常突出。
+071型船坞登陆舰的北约代号是什么？	顶甲板上各类电子设备和武器系统布局十分简洁干净，各系统的突出物很少。
+071型船坞登陆舰的北约代号是什么？	该舰的两座烟囱实行左右分布式设置在舰体两侧，既考虑了隐身特点，也十分新颖。[3]
+071型船坞登陆舰的北约代号是什么？	1号甲板及上层建筑物主要设置有指挥室、控
+071型船坞登陆舰的北约代号是什么？	舰尾俯视
+071型船坞登陆舰的北约代号是什么？	制舱、医疗救护舱及一些居住舱，其中医疗救护舱设置有完备的战场救护设施，可以在舰上为伤病员提供紧急手术和野战救护能力。
+071型船坞登陆舰的北约代号是什么？	2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。
+071型船坞登陆舰的北约代号是什么？	主甲板以下则是登陆舱，分前后两段，前段是装甲车辆储存舱，共两层，可以储存登陆装甲车辆和一些其它物资，在进出口处还设有一小型升降机，用于两层之间的移动装卸用。
+071型船坞登陆舰的北约代号是什么？	前段车辆储存舱外壁左右各设有一折叠式装载舱门，所有装载车辆在码头可通过该门直接装载或者登陆上岸。
+071型船坞登陆舰的北约代号是什么？	后段是一个巨型船坞登陆舱，总长约70米，主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4]
+071型船坞登陆舰的北约代号是什么？	自卫武装方面，舰艏设有一门PJ-26型76mm舰炮（
+071型船坞登陆舰的北约代号是什么？	井冈山号舰首主炮
+071型船坞登陆舰的北约代号是什么？	俄罗斯AK-176M的中国仿制版，亦被054A采用） ， 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧，近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。
+071型船坞登陆舰的北约代号是什么？	原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器，不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中，都未装上此一武器。
+071型船坞登陆舰的北约代号是什么？	电子装备方面， 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达，后桅杆顶装备一具拥有球型外罩的364型（SR-64）X频2D对空/对海搜索雷达，此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5]
+071型船坞登陆舰的北约代号是什么？	071型自卫武装布置
+071型船坞登陆舰的北约代号是什么？	071首舰昆仑山号于2006年6月开
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	竹溪县人大常委会办公室：承担人民代表大会会议、常委会会议、主任会议和常委会党组会议（简称“四会”）的筹备和服务工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会组成人员视察活动的联系服务工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	受主任会议委托，拟定有关议案草案。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担常委会人事任免的具体工作，负责机关人事管理和离退休干部的管理与服务。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担县人大机关的行政事务和后勤保障工作，负责机关的安全保卫、文电处理、档案、保密、文印工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担县人大常委会同市人大常委会及乡镇人大的工作联系。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责信息反馈工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况，及时向常委会和主任会议报告。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担人大宣传工作，负责人大常委会会议宣传的组织和联系。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	组织协调各专门工作委员会开展工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	办公室下设五个科，即秘书科、调研科、人事任免科、综合科、老干部科。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	教科文卫工作委员会：负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担教科文卫方面规范性备案文件的初审工作，侧重对教科文卫行政执法个案监督业务承办工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	代表工作委员会：负责与县人大代表和上级人大代表的联系、情况收集交流工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责组织人大系统的干部培训。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责乡镇人大主席团工作的联系和指导。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责人大代表建议、批评和意见办理工作的联系和督办落实。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责人大代表开展活动的组织、联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	财政经济工作委员会：负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	对国民经济计划和财政预算编制情况进行初审。
+我想知道武汉常住人口有多少？	武汉，简称“汉”，湖北省省会。
+我想知道武汉常住人口有多少？	它是武昌、汉口、汉阳三镇统称。
+我想知道武汉常住人口有多少？	世界第三大河长江及其最长支流汉江横贯市区，将武汉一分为三，形成武昌、汉口、汉阳，三镇跨江鼎立的格局。
+我想知道武汉常住人口有多少？	唐朝诗人李白在此写下“黄鹤楼中吹玉笛，江城五月落梅花”，因此武汉自古又称“江城”。
+我想知道武汉常住人口有多少？	武汉是中国15个副省级城市之一，全国七大中心城市之一，全市常住人口858万人。
+我想知道武汉常住人口有多少？	华中地区最大都市，华中金融中心、交通中心、文化中心，长江中下游特大城市。
+我想知道武汉常住人口有多少？	武汉城市圈的中心城市。
+我想知道武汉常住人口有多少？	[3]武昌、汉口、汉阳三地被俗称武汉三镇。
+我想知道武汉常住人口有多少？	武汉西与仙桃市、洪湖市相接，东与鄂州市、黄石市接壤，南与咸宁市相连，北与孝感市相接，形似一只自西向东的蝴蝶形状。
+我想知道武汉常住人口有多少？	在中国经济地理圈内，武汉处于优越的中心位置是中国地理上的“心脏”，故被称为“九省通衢”之地。
+我想知道武汉常住人口有多少？	武汉市历史悠久，古有夏汭、鄂渚之名。
+我想知道武汉常住人口有多少？	武汉地区考古发现的历史可以上溯距今6000年的新石器时代，其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。
+我想知道武汉常住人口有多少？	市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城，是迄今中国发现及保存最完整的商代古城之一。
+我想知道武汉常住人口有多少？	现代武汉的城市起源，是东汉末年的位于今汉阳的卻月城、鲁山城，和在今武昌蛇山的夏口城。
+我想知道武汉常住人口有多少？	东汉末年，地方军阀刘表派黄祖为江夏太守，将郡治设在位于今汉阳龟山的卻月城中。
+我想知道武汉常住人口有多少？	卻月城是武汉市区内已知的最早城堡。
+我想知道武汉常住人口有多少？	223年，东吴孙权在武昌蛇山修筑夏口城，同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。
+我想知道武汉常住人口有多少？	苏轼在《前赤壁赋》中说的“西望夏口，东望武昌”中的夏口就是指武汉（而当时的武昌则是今天的鄂州）。
+我想知道武汉常住人口有多少？	南朝时，夏口扩建为郢州，成为郢州的治所。
+我想知道武汉常住人口有多少？	隋置江夏县和汉阳县，分别以武昌，汉阳为治所。
+我想知道武汉常住人口有多少？	唐时江夏和汉阳分别升为鄂州和沔州的州治，成为长江沿岸的商业重镇。
+我想知道武汉常住人口有多少？	江城之称亦始于隋唐。
+我想知道武汉常住人口有多少？	两宋时武昌属鄂州，汉阳汉口属汉阳郡。
+我想知道武汉常住人口有多少？	经过发掘，武汉出土了大量唐朝墓葬，在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。
+我想知道武汉常住人口有多少？	宋代武汉的制瓷业发达。
+我想知道武汉常住人口有多少？	在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座，烧制的瓷器品种很多，釉色以青白瓷为主。
+我想知道武汉常住人口有多少？	南宋诗人陆游在经过武昌时，写下“市邑雄富，列肆繁错，城外南市亦数里，虽钱塘、建康不能过，隐然一大都会也”来描写武昌的繁华。
+我想知道武汉常住人口有多少？	南宋抗金将领岳飞驻防鄂州（今武昌）8年，在此兴师北伐。
+我想知道武汉常住人口有多少？	元世祖至元十八年（1281年），武昌成为湖广行省的省治。
+我想知道武汉常住人口有多少？	这是武汉第一次成为一级行政单位（相当于现代的省一级）的治所。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇，托洛茨基是联共(布)党内和第三国际时期反对派的领导人，托派"第四国际"的创始人和领导人。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇·托洛茨基
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇·托洛茨基（俄国与国际历史上最重要的无产阶级革命家之一，二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖，他以对古典马克思主义“不断革命论”的独创性发展闻名于世，第三共产国际和第四国际的主要缔造者之一（第三国际前三次代表大会的宣言执笔人）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席（而当时布尔什维克多数干部却还在讨论是否支持苏维埃，这些干部后来被赶回俄国的列宁痛击）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1917年革命托洛茨基率领“区联派”与列宁派联合，并再次被工人推举为彼得格勒苏维埃主席。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	对于十月革命这场20世纪最重大的社会革命，托洛茨基赢得了不朽的历史地位。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	后来成了托洛茨基死敌的斯大林，当时作为革命组织领导者之一却写道：“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	我们可以确切地说，卫戍部队之迅速站在苏维埃方面来，革命军事委员会的工作之所以搞得这样好，党认为这首先要归功于托洛茨基同志。”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	（值得一提的是，若干年后，当反托成为政治需要时，此类评价都从斯大林文章中删掉了。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）甚至连后来狂热的斯大林派雅克·沙杜尔，当时却也写道：“托洛茨基在十月起义中居支配地位，是起义的钢铁灵魂。”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	（苏汉诺夫《革命札记》第6卷P76。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）不仅在起义中，而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面，托洛茨基也作出了极其卓越的贡献（外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	革命后若干年里，托洛茨基与列宁的画像时常双双并列挂在一起；十月革命之后到列宁病逝之前，布尔什维克历次全国代表大会上，代表大会发言结束均高呼口号：“我们的领袖列宁和托洛茨基万岁！”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在欧美共运中托洛茨基的威望非常高。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	后人常常认为托洛茨基只是一个知识分子文人，实际上他文武双全，而且谙熟军事指挥艺术，并且亲临战场。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	正是他作为十月革命的最高军事领袖（在十月革命期间他与士兵一起在战壕里作战），并且在1918年缔造并指挥苏联红军，是一个杰出的军事家（列宁曾对朋友说，除了托洛茨基，谁还能给我迅速地造成一支上百万人的强大军队？
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在内战期间，他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战，差点挨炸死；当反革命军队进攻彼得堡时，当时的彼得堡领导人季诺维也夫吓得半死，托洛茨基却从容不迫指挥作战。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	同时托洛茨基又是一个高明的外交家，他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者，否则就不许英国公民离开俄国，连英国政府方面都觉得此举无懈可击；他并且把居高临下的法国到访者当场轰出他的办公室（革命前法国一直是俄国的头号债主与政治操纵者），却彬彬有礼地欢迎前来缓和冲突的法国大使；而在十月革命前夕，他对工人代表议会质询的答复既保守了即将起义的军事秘密，又鼓舞了革命者的战斗意志，同时严格遵循现代民主与公开原则，这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”（伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335，第十一章P390）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	托洛茨基在国民经济管理与研究工作中颇有创造：是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1928年斯大林迟迟开始的计划经济实验，是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	因为统治者的政策迟到，使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级，而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	另外，他还对文学理论有很高的造诣，其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子（包括中国的鲁迅、王实味等人）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>，其生动而真诚的自传和大量私人日记、信件，给人留下了研究人类生活各个方面的宝贵财富，更是追求社会进步与解放的历史道路上的重要知识库之一。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭，祖籍是犹太人。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	原姓布隆施泰因。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1896年开始参加工人运动。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1897年 ，参加建立南俄工人协会 ，反对沙皇专制制度。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1898年 在尼古拉也夫组织工人团体，被流放至西伯利亚。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1902年秋以署名托洛茨基之假护照逃到伦敦，参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥，位于洞庭湖与长江交汇处，东接岳阳市区洞庭大道和107国道、京珠高速公路，西连省道306线，是国内目前最长的内河公路桥。
+谁知道洞庭湖大桥有多长？	路桥全长10173.82m，其中桥长5747.82m，桥宽20m，西双向四车道，是我国第一座三塔双索面斜拉大桥，亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥是我国最长的内河公路桥，大桥横跨东洞庭湖区，全长10174.2米，主桥梁长5747.8米。
+谁知道洞庭湖大桥有多长？	大桥的通车使湘、鄂间公路干线大为畅通，并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进，新颖，造型美观，各项技求指标先进，且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥是湖区人民的造福桥，装点湘北门户的形象桥，对优化交通网络绪构，发展区域经济，保障防汛救灾，缩短鄂、豫、陕等省、市西部车辆南下的运距，拓展岳阳城区的主骨架，提升岳阳城市品位，增强城市辐射力，有着十分重要的意义。
+谁知道洞庭湖大桥有多长？	自1996年12月开工以来，共有10支施工队伍和两支监理队伍参与了大桥的建设。
+谁知道洞庭湖大桥有多长？	主桥桥面高52米(黄海)，设计通航等级Ⅲ级。
+谁知道洞庭湖大桥有多长？	主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥，跨径为130+310+310+130米。
+谁知道洞庭湖大桥有多长？	索塔为双室宝石型断面，中塔高为125.684米，两边塔高为99.311米。
+谁知道洞庭湖大桥有多长？	三塔基础为3米和3.2米大直径钻孔灌注桩。
+谁知道洞庭湖大桥有多长？	引桥为连续梁桥，跨径20至50米，基础直径为1.8和2.5米钻孔灌注桩。
+谁知道洞庭湖大桥有多长？	该桥设计先进、新颖、造型美观，各项技求指标先进，且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系，岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥，设计先进，施工难度大位居亚洲之首，是湖南省桥梁界的一大科研项目。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥设计为三塔斜拉桥，空间双斜面索，主梁采用前支点挂篮施工，并按各种工况模拟挂篮受力进行现场试验，获得了大量有关挂篮受力性能和实际刚度的计算参数，作为施工控制参数。
+谁知道洞庭湖大桥有多长？	利用组合式模型单元，推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵，并进行了岳阳洞庭湖大桥的空间受力分析，结果表明此种单元精度满足工程要求，同时在施工工艺方面也积累了成功经验。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥的通车使湘、鄂间公路干线大为畅通，并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。
+谁知道洞庭湖大桥有多长？	湖大桥设计先进，造型美丽，科技含量高。
+谁知道洞庭湖大桥有多长？	洞庭大桥还是一道美丽的风景线，大桥沿岸风景与岳阳楼，君山岛、洞庭湖等风景名胜融为一体，交相辉映，成为世人了解岳阳的又一崭新窗口，也具有特别旅游资源。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖，并获第五届詹天佑大奖。
+谁知道洞庭湖大桥有多长？	大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》，名列斜拉桥第二位。
+谁知道洞庭湖大桥有多长？	2001年荣获湖南省建设厅优秀设计一等奖，省优秀勘察一等奖。
+谁知道洞庭湖大桥有多长？	2003年荣获国家优秀工程设计金奖， "十佳学术活动"奖。
+天气预报员的布景师是谁？	芝加哥天气预报员大卫（尼古拉斯·凯奇），被他的粉丝们热爱，也被诅咒--这些人在天气不好的时候会迁怒于他，而大部分时候，大卫都是在预报坏天气。
+天气预报员的布景师是谁？	?不过，这也没什么，当一家国家早间新闻节目叫他去面试的时候，大卫的事业似乎又将再创新高。
+天气预报员的布景师是谁？	芝加哥天气预报员大卫（尼古拉斯·凯奇），被他的粉丝们热爱，也被诅咒--这些人在天气不好的时候会迁怒于他，而大部分时候，大卫都是在预报坏天气。
+天气预报员的布景师是谁？	不过，这也没什么，当一家国家早间新闻节目叫他去面试的时候，大卫的事业似乎又将再创新高。
+天气预报员的布景师是谁？	在电视节目上，大卫永远微笑，自信而光鲜，就像每一个成功的电视人一样，说起收入，他也绝对不落人后。
+天气预报员的布景师是谁？	不过，大卫的个人生活可就不那么如意了。
+天气预报员的布景师是谁？	与妻子劳伦（霍普·戴维斯）的离婚一直让他痛苦；儿子迈克吸大麻上瘾，正在进行戒毒，可戒毒顾问却对迈克有着异样的感情；女儿雪莉则体重惊人，总是愁眉苦脸、孤独寂寞；大卫的父亲罗伯特（迈克尔·凯恩），一个世界著名的小说家，虽然罗伯特不想再让大卫觉得负担过重，可正是他的名声让大卫的一生都仿佛处在他的阴影之下，更何况，罗伯特就快重病死了。
+天气预报员的布景师是谁？	和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系，都让大卫每天头疼，而每次当他越想控制局面，一切就越加复杂。
+天气预报员的布景师是谁？	然而就在最后人们再也不会向他扔快餐，或许是因为他总是背着弓箭在大街上走。
+天气预报员的布景师是谁？	最后，面对那份高额工作的接受意味着又一个新生活的开始。
+天气预报员的布景师是谁？	也许，生活就像天气，想怎么样就怎么样，完全不可预料。
+天气预报员的布景师是谁？	导 演：戈尔·维宾斯基 Gore Verbinski
+天气预报员的布景师是谁？	编 剧：Steve Conrad .....(written by)
+天气预报员的布景师是谁？	演 员：尼古拉斯·凯奇 Nicolas Cage .....David Spritz
+天气预报员的布景师是谁？	尼古拉斯·霍尔特 Nicholas Hoult .....Mike
+天气预报员的布景师是谁？	迈克尔·凯恩 Michael Caine .....Robert Spritzel
+天气预报员的布景师是谁？	杰蒙妮·德拉佩纳 Gemmenne de la Pe&ntilde;a .....Shelly
+天气预报员的布景师是谁？	霍普·戴维斯 Hope Davis .....Noreen
+天气预报员的布景师是谁？	迈克尔·瑞斯玻利 Michael Rispoli .....Russ
+天气预报员的布景师是谁？	原创音乐：James S. Levine .....(co-composer) (as James Levine)
+天气预报员的布景师是谁？	汉斯·兹米尔 Hans Zimmer
+天气预报员的布景师是谁？	摄 影：Phedon Papamichael
+天气预报员的布景师是谁？	剪 辑：Craig Wood
+天气预报员的布景师是谁？	选角导演：Denise Chamian
+天气预报员的布景师是谁？	艺术指导：Tom Duffield
+天气预报员的布景师是谁？	美术设计：Patrick M. Sullivan Jr. .....(as Patrick Sullivan)
+天气预报员的布景师是谁？	布景师 ：Rosemary Brandenburg
+天气预报员的布景师是谁？	服装设计：Penny Rose
+天气预报员的布景师是谁？	视觉特效：Charles Gibson
+天气预报员的布景师是谁？	David Sosalla .....Pacific Title & Art Studio
+韩国国家男子足球队教练是谁？	韩国国家足球队，全名大韩民国足球国家代表队（???? ?? ?????），为韩国足球协会所于1928年成立，并于1948年加入国际足球协会。
+韩国国家男子足球队教练是谁？	韩国队自1986年世界杯开始，从未缺席任何一届决赛周。
+韩国国家男子足球队教练是谁？	在2002年世界杯，韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队，最后夺得了殿军，是亚洲球队有史以来最好成绩。
+韩国国家男子足球队教练是谁？	在2010年世界杯，韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈，再次晋身十六强，但以1-2败给乌拉圭出局。
+韩国国家男子足球队教练是谁？	  北京时间2014年6月27日3时，巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时，韩国队0-1不敌比利时，3场1平2负积1分垫底出局。
+韩国国家男子足球队教练是谁？	球队教练：洪明甫
+韩国国家男子足球队教练是谁？	韩国国家足球队，全名大韩民国足球国家代表队（韩国国家男子足球队???? ?? ?????），为韩国足球协会所于1928年成立，并于1948年加入国际足联。
+韩国国家男子足球队教练是谁？	韩国队是众多亚洲球队中，在世界杯表现最好，他们自1986年世界杯开始，从未缺席任何一届决赛周。
+韩国国家男子足球队教练是谁？	在2002年世界杯，韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队，最后夺得了殿军，是亚洲球队有史以来最好成绩。
+韩国国家男子足球队教练是谁？	在2010年世界杯，韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈，再次晋身十六强，但以1-2败给乌拉圭出局。
+韩国国家男子足球队教练是谁？	2014年世界杯外围赛，韩国在首轮分组赛以首名出线次轮分组赛，与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格，最后韩国仅以较佳的得失球差压倒乌兹别克，以小组次名取得2014年世界杯决赛周参赛资格，也是韩国连续八次晋身世界杯决赛周。
+韩国国家男子足球队教练是谁？	虽然韩国队在世界杯成绩为亚洲之冠，但在亚洲杯足球赛的成绩却远不及世界杯。
+韩国国家男子足球队教练是谁？	韩国只在首两届亚洲杯（1956年及1960年）夺冠，之后五十多年未能再度称霸亚洲杯，而自1992年更从未打入过决赛，与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1]
+韩国国家男子足球队教练是谁？	人物简介
+韩国国家男子足球队教练是谁？	车范根（1953年5月22日－）曾是大韩民国有名的锋线选手，他被欧洲媒体喻为亚洲最佳输出球员之一，他也被认为是世界最佳足球员之一。
+韩国国家男子足球队教练是谁？	他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。
+韩国国家男子足球队教练是谁？	他在85-86赛季是德甲的最有价值球员，直到1999年为止他都是德甲外国球员入球纪录保持者。
+韩国国家男子足球队教练是谁？	德国的球迷一直没办法正确说出他名字的发音，所以球车范根（左）迷都以炸弹车（Cha Boom）称呼他。
+韩国国家男子足球队教练是谁？	这也代表了他强大的禁区得分能力。
+韩国国家男子足球队教练是谁？	职业生涯
+韩国国家男子足球队教练是谁？	车范根生于大韩民国京畿道的华城市，他在1971年于韩国空军俱乐部开始了他的足球员生涯；同年他入选了韩国19岁以下国家足球队(U-19)。
+韩国国家男子足球队教练是谁？	隔年他就加入了韩国国家足球队，他是有史以来加入国家队最年轻的球员。
+韩国国家男子足球队教练是谁？	车范根在27岁时前往德国发展，当时德甲被认为是世界上最好的足球联赛。
+韩国国家男子足球队教练是谁？	他在1978年12月加入了达姆施塔特，不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。
+韩国国家男子足球队教练是谁？	车范根很快在新俱乐部立足，他帮助球队赢得79-80赛季的欧洲足协杯。
+韩国国家男子足球队教练是谁？	在那个赛季过后，他成为德甲薪水第三高的球员，不过在1981年对上勒沃库森的一场比赛上，他的膝盖严重受伤，几乎毁了他的足球生涯。
+韩国国家男子足球队教练是谁？	在1983年车范根转投勒沃库森；他在这取得很高的成就，他成为85-86赛季德甲的最有价值球员，并且在1988年帮助球队拿下欧洲足协杯，也是他个人第二个欧洲足协杯。
+韩国国家男子足球队教练是谁？	他在决赛对垒西班牙人扮演追平比分的关键角色，而球会才在点球大战上胜出。
+韩国国家男子足球队教练是谁？	车范根在1989年退休，他在308场的德甲比赛中进了98球，一度是德甲外国球员的入球纪录。
+韩国国家男子足球队教练是谁？	执教生涯
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学，简称台湾科大、台科大或台科，是位于台湾台北市大安区的台湾第一所高等技职体系大专院校，现为台湾最知名的科技大学，校本部比邻国立台湾大学。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	“国立”台湾工业技术学院成立于“民国”六十三年（1974）八月一日，为台湾地区第一所技术职业教育高等学府。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	建校之目的，在因应台湾地区经济与工业迅速发展之需求，以培养高级工程技术及管理人才为目标，同时建立完整之技术职业教育体系。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	“国立”台湾工业技术学院成立于“民国”六十三年（1974）八月一日，为台湾地区第一所技术职业教育高等学府。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	建校之目的，在因应台湾地区经济与工业迅速发展之需求，以培养高级工程技术及管理人才为目标，同时建立完整之技术职业教育体系。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	本校校地约44.5公顷，校本部位于台北市基隆路四段四十三号，。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	民国68年成立硕士班，民国71年成立博士班，现有大学部学生5,664人，研究生4,458人，专任教师451位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	2001年在台湾地区教育部筹划之研究型大学（“国立”大学研究所基础教育重点改善计画)中，成为全台首批之9所大学之一 。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下，遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制；凡二专、三专及五专等专科学校以上之毕业生，皆可以报考本校大学部二年制，而高职、高中毕业生，可以报考本校大学部四年制。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	工业管理、电子工程、机械工程、营建工程及应用外语系等，则设有在职人员进修班学制，其招生对象为在职人员，利用夜间及暑假期间上课。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院，分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程；全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所，此外尚有人文学科负责人文及社会类等课程之教学，通识学科负责法律、音乐、环保类等课程之教学，以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师，合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学至今各系所毕业校友已达约56,456位，毕业生出路包含出国继续深造、在台深造以及投身于产业界。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	由于实作经验丰富，理论基础完备，工作态度认真，毕业校友担任政府要职、大学教授、大学校长及企业主管者众多，深受各界的肯定。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展，并获得观众票选第一名，这也是台湾首次入选及获奖的短片。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二，仅次于韩国三星美术设计学院“SADI”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	总体排名 依据《泰晤士高等教育》（THES-QS）在2009年的世界大学排名调查，台科大排名全世界第351名，在台湾所有大学中排名第五，仅次于台大，清大，成大及阳明，并且是台湾唯一进入世界四百大名校的科技大学。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料，台湾有七所大学的商管学院被分别列入世界1000大商学院，其中台科大位在“卓越商学院”（EXCELLENT Business Schools，国内主要）之列，“推荐程度”（Recommendation Rate）为全台第四，仅次于台大、政大、中山，与交大并列。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	预计于竹北新校区设立产学合作学院及应用理学院。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●台湾建筑科技中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●智慧型机械人研究中心科技成果展示(15张)
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●台湾彩卷与博彩研究中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●电力电子技术研发中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●NCP-Taiwan办公室
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●资通安全研究与教学中心
+在日本，神道最初属于什么信仰？	神道又称天道，语出《易经》“大观在上，顺而巽，中正以观天下。
+在日本，神道最初属于什么信仰？	观，盥而不荐，有孚顒若，下观而化也。
+在日本，神道最初属于什么信仰？	观天之神道，而四时不忒，圣人以神道设教，而天下服矣”。
+在日本，神道最初属于什么信仰？	自汉以降，神道又指“墓前开道，建石柱以为标”。
+在日本，神道最初属于什么信仰？	在中医中，神道，经穴名。
+在日本，神道最初属于什么信仰？	出《针灸甲乙经》。
+在日本，神道最初属于什么信仰？	别名冲道。
+在日本，神道最初属于什么信仰？	属督脉。
+在日本，神道最初属于什么信仰？	宗教中，神道是日本的本土传统民族宗教，最初以自然崇拜为主，属于泛灵多神信仰（精灵崇拜），视自然界各种动植物为神祇。
+在日本，神道最初属于什么信仰？	神道又称天道，语出《易经》“大观在上，顺而巽，中正以观天下。
+在日本，神道最初属于什么信仰？	观，盥而不荐，有孚顒若，下观而化也。
+在日本，神道最初属于什么信仰？	观天之神道，而四时不忒，圣人以神道设教，而天下服矣”。
+在日本，神道最初属于什么信仰？	自汉以降，神道又指“墓前开道，建石柱以为标”。
+在日本，神道最初属于什么信仰？	在中医中，神道，经穴名。
+在日本，神道最初属于什么信仰？	出《针灸甲乙经》。
+在日本，神道最初属于什么信仰？	别名冲道。
+在日本，神道最初属于什么信仰？	属督脉。
+在日本，神道最初属于什么信仰？	宗教中，神道是日本的本土传统民族宗教，最初以自然崇拜为主，属于泛灵多神信仰（精灵崇拜），视自然界各种动植物为神祇。
+在日本，神道最初属于什么信仰？	谓鬼神赐福降灾神妙莫测之道。
+在日本，神道最初属于什么信仰？	《易·观》：“观天之神道，而四时不忒，圣人以神道设教，而天下服矣。”
+在日本，神道最初属于什么信仰？	 孔颖达 疏：“微妙无方，理不可知，目不可见，不知所以然而然，谓之神道。”
+在日本，神道最初属于什么信仰？	《文选·王延寿<鲁灵光殿赋>》：“敷皇极以创业，协神道而大宁。”
+在日本，神道最初属于什么信仰？	 张载 注：“协和神明之道，而天下大宁。”
+在日本，神道最初属于什么信仰？	 南朝 梁 刘勰 《文心雕龙·正纬》：“夫神道阐幽，天命微显。”
+在日本，神道最初属于什么信仰？	 鲁迅 《中国小说史略》第五篇：“﹝ 干宝 ﹞尝感於其父婢死而再生，及其兄气绝复苏，自言见天神事，乃撰《搜神记》二十卷，以‘发明神道之不诬’。”
+在日本，神道最初属于什么信仰？	 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想，是孔子把这些思想发掘出来。
+在日本，神道最初属于什么信仰？	「据此是孔子见当时之人，惑于吉凶祸福，而卜筮之史，加以穿凿傅会，故演易系辞，明义理，切人事，借卜筮以教后人，所谓以神道设教，其所发明者，实即羲文之义理，而非别有义理，亦非羲文并无义理，至孔子始言义理也，当即朱子之言而小变之曰，易为卜筮作，实为义理作，伏羲文王之易，有占而无文，与今人用火珠林起课者相似，孔子加卦爻辞如签辞，纯以理言，实即羲文本意，则其说分明无误矣。」
+在日本，神道最初属于什么信仰？	孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。
+在日本，神道最初属于什么信仰？	《易传》的思想反映了孔子的思想，这个思想是《周易》的，也是孔子的。
+在日本，神道最初属于什么信仰？	在《周易》和孔子看来，神不是有意识的人格化的上帝。
+奥林匹克里昂获得了几连霸？	里昂 Lyon    全名 Olympique lyonnais   绰号 Les Gones、OL   成立 1950年   城市 法国，里昂   主场 热尔兰球场（Stade Gerland）   容纳人数 41,044人   主席 奥拉斯   主教练 雷米·加尔德   联赛 法国足球甲级联赛   2013–14 法甲，第 5 位   网站 官方网站                               主场球衣                              客场球衣                              第三球衣           日尔兰体育场    奥林匹克里昂（Olympique lyonnais，简称：OL及Lyon，中文简称里昂）是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会，成立于1950年8月3日，前身为里昂·奥林匹克（Lyon Olympique）体育俱乐部其中一个分支的足球队，1889年离开体育俱乐部自立门户成立新俱乐部，但官方网站表示俱乐部于1950年正式成立。
+奥林匹克里昂获得了几连霸？	现时在法国足球甲级联赛比赛，俱乐部同时设立男子及女子足球队。
+奥林匹克里昂获得了几连霸？	里昂是首届法国足球甲级联赛成员之一，可惜名列第十五位而降落乙组，1951年以乙级联赛冠军获得创会后首次锦标。
+奥林匹克里昂获得了几连霸？	球队在法国足球史上没有取得辉煌成绩，比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强，及3度晋身法国杯决赛并2次成功获冠。
+奥林匹克里昂获得了几连霸？	直至九十年代末里昂由辛天尼带领，先连续取得联赛头三名，到2002年终于首次登上法国顶级联赛冠军宝座，同年勒冈（Paul Le Guen）接替执教法国国家足球队的辛天尼，他其后继续带领里昂保持气势，加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出，2003年至2005年横扫3届联赛冠军，创下连续四年夺得联赛锦标，平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。
+奥林匹克里昂获得了几连霸？	2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练，并加入葡萄牙中场蒂亚戈，和前巴伦西亚前锋约翰·卡鲁。
+奥林匹克里昂获得了几连霸？	他亦成功带领里昂赢得一届法甲冠军。
+奥林匹克里昂获得了几连霸？	2007年里昂成为首支上市的法国足球俱乐部，招股价21至24.4欧元，发行370万股，集资8400万欧元[1]。
+奥林匹克里昂获得了几连霸？	2007年4月21日，联赛次名图卢兹二比三不敌雷恩，令处于榜首的里昂领先次席多达17分距离，里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军，亦是欧洲五大联赛（英格兰、德国、西班牙、意大利及法国）历史上首支联赛六连冠队伍[2]。
+奥林匹克里昂获得了几连霸？	 在2007-08年赛季，里昂再一次成功卫冕联赛锦标，达成七连霸伟业。
+奥林匹克里昂获得了几连霸？	不过在2008-09赛季，里昂排名法甲第三位，联赛冠军被波尔多所获得。
+奥林匹克里昂获得了几连霸？	于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。
+奥林匹克里昂获得了几连霸？	粗体字为新加盟球员
+奥林匹克里昂获得了几连霸？	以下球员名单更新于2014年8月27日，球员编号参照 官方网站，夏季转会窗为6月9日至8月31日
+火柴人刺杀行动怎么才能过关？	移动鼠标控制瞄准，点击鼠标左键进行射击。
+火柴人刺杀行动怎么才能过关？	游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。
+火柴人刺杀行动怎么才能过关？	这里不仅仅考验的是你的枪法而且最重要的是你的智慧，喜欢火柴人类型游戏的玩家可以进来小试身手。
+火柴人刺杀行动怎么才能过关？	控制瞄准，刺杀游戏中的目标人物即可过关哦。
+你知道2月14日西方情人节是因何起源的吗？	情人节（英语：Valentine's Day），情人节的起源有多个版本，其中一个说法是在公元三世纪，古罗马暴君为了征召更多士兵，禁止婚礼，一名叫瓦伦丁Valentine的修士不理禁令，秘密替人主持婚礼，结果被收监，最后处死。
+你知道2月14日西方情人节是因何起源的吗？	而他死的那天就是2月14日，为纪念Valentine的勇敢精神，人们将每年的2月14日定为Valentine的纪念日。
+你知道2月14日西方情人节是因何起源的吗？	因此成了后来的“情人节”。
+你知道2月14日西方情人节是因何起源的吗？	另外，据记载，教宗在公元496年废除牧神节，把2月14日定为圣瓦伦丁日，即是St.Valentine's Day，后来成为是西方的节日之一。
+你知道2月14日西方情人节是因何起源的吗？	中文名称：情人节
+你知道2月14日西方情人节是因何起源的吗？	外文名称：Valentine‘s Day
+你知道2月14日西方情人节是因何起源的吗？	别名：情人节圣瓦伦丁节
+你知道2月14日西方情人节是因何起源的吗？	公历日期：2月14日
+你知道2月14日西方情人节是因何起源的吗？	起源时间：公元270年2月14日
+你知道2月14日西方情人节是因何起源的吗？	起源事件：人们为了纪念为情人做主而牺牲的瓦伦丁神父，把他遇害的那一天（2月14日）称为情人节。
+你知道2月14日西方情人节是因何起源的吗？	地区：欧美地区
+你知道2月14日西方情人节是因何起源的吗？	宗教：基督教
+你知道2月14日西方情人节是因何起源的吗？	其他信息：西方的传统节日之一。
+你知道2月14日西方情人节是因何起源的吗？	男女在这一天互送礼物（如贺卡和玫瑰花等）用以表达爱意或友好。
+你知道2月14日西方情人节是因何起源的吗？	据台湾“今日台湾人讨厌情人节新闻网”报道，西洋情人节即将来到，求职网进行“办公室恋情及情人节调查”发现，在目前全台上班族的感情状态中，有情人相伴的比率约5成5，4成5的上班族单身；较出乎意料的结果是，情人节以近3成（28%）的占比，登上最讨厌的节日第一名，端午节以24.3%居第二；农历年则以18.2%居第三；第四名是圣诞节，占12.4%。
+你知道2月14日西方情人节是因何起源的吗？	调查指出，情人节对单身族来说，不仅成为压力，也显得更加孤单，在情人节当天，单身的上班族有将近4成(39.1%)的人在家看电视度过，近两成（18.7%）上网聊天，有1成4（14.8%）的人，不畏满街闪光，勇气十足出门看电影，近1成（9.7%）的上班族选择留在公司加班；另外有 5.4%的人，会在情人节当天积极参加联谊，希望能改变自己的感情状态。
+你知道2月14日西方情人节是因何起源的吗？	情侣们在情人节当天，庆祝方式以吃浪漫大餐最多（37.1%），不过有近3成（27%）的情侣，在情人节当天不会特别庆祝情人节，且这个比率远比第三名的旅游（占比11.5%）高出1倍以上。
+你知道2月14日西方情人节是因何起源的吗？	在情人节当天庆祝的开销上，可以说是小资男女当道，选择1000元(新台币，下同)以内的上班族最多占33.1%，情人节当天的花费上班族的平均花费是2473元，大手笔花费上万元以上庆祝情人节的，占比只有2.5%。
+你知道2月14日西方情人节是因何起源的吗？	情人节的起源众说纷纭，而为纪念罗马教士瓦伦丁是其中一个普遍的说法。
+你知道2月14日西方情人节是因何起源的吗？	据《世界图书百科全书》（World Book Encyclopedia）数据指出：“在公元200年时期，罗马皇帝克劳狄二世禁止年轻男子结婚。
+你知道2月14日西方情人节是因何起源的吗？	他认为未婚男子可以成为更优良的士兵。
+你知道2月14日西方情人节是因何起源的吗？	一位名叫瓦伦丁的教士违反了皇帝的命令，秘密为年轻男子主持婚礼，引起皇帝不满，结果被收监，据说瓦伦丁于公元269年2月14日被处决。
+你知道2月14日西方情人节是因何起源的吗？	另外，据《天主教百科全书》（The Catholic情人节 Encyclopedia）指出，公元496年，教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节，把2月14日定为圣瓦伦丁日。”
+你知道2月14日西方情人节是因何起源的吗？	这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。
+你知道2月14日西方情人节是因何起源的吗？	但是在第2次梵蒂冈大公会议后，1969年的典礼改革上，整理了一堆在史实上不确定是否真实存在的人物以后，圣瓦伦丁日就被废除了。
+你知道2月14日西方情人节是因何起源的吗？	现在天主教圣人历已经没有圣瓦伦丁日（St. Valentine's Day）。
+你知道2月14日西方情人节是因何起源的吗？	根据《布卢姆尔的警句与寓言辞典》记载：“圣瓦伦丁是个罗马教士，由于援助受逼害的基督徒而身陷险境，后来他归信基督教，最后被处死，卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系，可能是纯属巧合而已。
+你知道2月14日西方情人节是因何起源的吗？	事实上，这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。
+你知道2月14日西方情人节是因何起源的吗？	情人节的特色是情侣互相馈赠礼物。
+你知道2月14日西方情人节是因何起源的吗？	时至今日，人们则喜欢以情人卡向爱人表达情意。
+防卫大学每年招收多少学生？	防卫大学的前身是保安大学。
+防卫大学每年招收多少学生？	防卫大学是日本自卫队培养陆、海、空三军初级军官的学校，被称为日军"军官的摇篮"。
+防卫大学每年招收多少学生？	防卫大学是日军的重点院校。
+防卫大学每年招收多少学生？	日本历届内阁首相都要到防卫大学视察"训示"，并亲自向学生颁发毕业证书。
+防卫大学每年招收多少学生？	日军四分之一的军官、三分之一的将官从这里走出。
+防卫大学每年招收多少学生？	防卫大学毕业生已成为日军军官的中坚力量。
+防卫大学每年招收多少学生？	防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。
+防卫大学每年招收多少学生？	每年招生名额为530名。
+防卫大学每年招收多少学生？	1950年 8月，日本组建警察预备队，1952年改为保安队。
+防卫大学每年招收多少学生？	为了充实保安队干部队伍，提高干部军政素质，1953年4月成立了保安大学，校址设在三浦半岛的久里滨。
+防卫大学每年招收多少学生？	1954年7月1日保安厅改为防卫厅。
+防卫大学每年招收多少学生？	在保安队基础上，日本建立了陆、海、空三军自卫队，保安大学遂改名为防卫大学，1955年迁至三浦半岛东南方的小原台。
+防卫大学每年招收多少学生？	学校直属防卫厅领导。
+防卫大学每年招收多少学生？	防卫大学的教育方针是：要求学生德智体全面发展，倡导学生崇尚知识和正义，培养学生具有指挥各种部队的能力。
+防卫大学每年招收多少学生？	防卫大学每年招生名额为530名，其中陆军300名，海军100名，空军130名。
+防卫大学每年招收多少学生？	根据自卫队向妇女敞开军官大门的决定，防卫大学1992年首次招收女学员35名。
+防卫大学每年招收多少学生？	 考试分两次进行。
+防卫大学每年招收多少学生？	第一次，每年11月份进行学科考试；第二次，12月份进行口试和体检。
+防卫大学每年招收多少学生？	学校按陆、海、空三军分别设大学本科班和理工科研究生班。
+防卫大学每年招收多少学生？	本科班学制4年，又分为理工和人文社会学两大科。
+防卫大学每年招收多少学生？	学员入学后先分科，530人中有460人专攻理科，70人专攻文科。
+防卫大学每年招收多少学生？	第1学年按专科学习一般大学课程和一般军事知识。
+防卫大学每年招收多少学生？	第2学年以后在军事上开始区分军种，学员分别学习陆、海、空军的专门课程。
+防卫大学每年招收多少学生？	文化课和军事课的比例是6：l。
+防卫大学每年招收多少学生？	文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。
+防卫大学每年招收多少学生？	军事训练课每学年6周，按一年四季有比例地安排教学内容，对学生进行军事技术和体能训练。
+防卫大学每年招收多少学生？	理工科研究生班，每年招生1期，学制2年，每期招收90人，设电子工程、航空工程、兵器制造等7个专业，课程按一般大学硕士课程标准设置。
+防卫大学每年招收多少学生？	防卫大学的课程和训练都十分紧张。
+防卫大学每年招收多少学生？	近年来，为了增强防卫大学的吸引力，克服考生逐年减少的倾向广泛征集优秀人才，学校进行了一些改革，改变入学考试办法，各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生；减少学生入学考试科目，放宽对报考防卫大学的学生的视力要求；降低学分数(大约降低30学分)；改善学生宿舍条件。
+防卫大学每年招收多少学生？	防卫大学的学生生活紧张而愉快。
+《威鲁贝鲁的物语》官网是什么？	10年前大战后，威鲁贝鲁国一致辛勤的保护着得来不易的和平，但是与邻国圣卡特拉斯国的关系却不断的紧张，战争即将爆发。
+《威鲁贝鲁的物语》官网是什么？	为了避免战争，威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。
+《威鲁贝鲁的物语》官网是什么？	但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去，这事激怒了圣卡特拉斯国的国王兰帕诺夫王，并下令14天之内抓到王女并执行公开处刑来谢罪，不然两国就要开战。
+《威鲁贝鲁的物语》官网是什么？	《威鲁贝鲁的物语～Sisters of Wellber～》
+《威鲁贝鲁的物语》官网是什么？	（Sisters of Wellber）
+《威鲁贝鲁的物语》官网是什么？	日文名 ウエルベールの物语
+《威鲁贝鲁的物语》官网是什么？	官方网站 http://www.avexmovie.jp/lineup/wellber/
+《威鲁贝鲁的物语》官网是什么？	为了回避发生战争这个最坏的结果，莉塔下定决心去中立国古利达姆。
diff --git a/examples/text_graph/erniesage/example_data/train_data.txt b/examples/text_graph/erniesage/example_data/train_data.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e9aead6c89fa2fdbed64e5dada352106a8deb349
--- /dev/null
+++ b/examples/text_graph/erniesage/example_data/train_data.txt
@@ -0,0 +1,1000 @@
+黑缘粗角肖叶甲触角有多大？	体长卵形，棕红色；鞘翅棕黄或淡棕色，外缘和中缝黑色或黑褐色；触角基部3、4节棕黄，余节棕色。
+黑缘粗角肖叶甲触角有多大？	头部刻点粗大，分布不均匀，头顶刻点十分稀疏；触角基部的内侧有一个三角形光瘤，唇基前缘呈半圆形凹切。
+黑缘粗角肖叶甲触角有多大？	触角近于体长之半，第1节粗大，棒状，第2节短，椭圆形，3、4两节细长，稍短于第5节，第5节基细端粗，末端6节明显粗大。
+黑缘粗角肖叶甲触角有多大？	前胸背板横宽，宽约为长的两倍，侧缘敞出较宽，圆形，敞边与盘区之间有一条细纵沟；盘区刻点相当密，前半部刻点较大于后半部。
+黑缘粗角肖叶甲触角有多大？	小盾片舌形，光亮，末端圆钝。
+黑缘粗角肖叶甲触角有多大？	鞘翅刻点粗大，不规则排列，肩部之后的刻点更为粗大，具皱褶，近中缝的刻点较小，略呈纵行排列。
+黑缘粗角肖叶甲触角有多大？	前胸前侧片前缘直；前胸后侧片具粗大刻点。
+黑缘粗角肖叶甲触角有多大？	足粗壮；胫节具纵脊，外端角向外延伸，呈弯角状；爪具附齿。
+暮光闪闪的姐姐是谁？	暮光闪闪是一匹雌性独角兽，后来在神秘魔法的影响下变成了空角兽（公主），她是《我的小马驹：友情是魔法》（英文名：My Little Pony：Friendship is Magic）中的主角之一。
+暮光闪闪的姐姐是谁？	 她是银甲闪闪（Shining Armor）的妹妹，同时也是韵律公主（Princess Cadance）的小姑子。
+暮光闪闪的姐姐是谁？	在该系列中，她与最好的朋友与助手斯派克（Spike）一起生活在小马镇（Ponyville）的金橡图书馆（Golden Oak Library），研究友谊的魔法。
+暮光闪闪的姐姐是谁？	  在暮光闪闪成为天角兽之前（即S3E13前），常常给塞拉丝蒂娅公主（Princess Celestia）关于友谊的报告。[1]
+暮光闪闪的姐姐是谁？	《我的小马驹：友谊是魔法》（英文名称：My Little Pony:Friendship is Magic）（简称MLP）
+暮光闪闪的姐姐是谁？	动画讲述了一只名叫做暮光闪闪（Twilight Sparkle）的独角兽（在SE3E13
+暮光闪闪的姐姐是谁？	My Little Pony:Friendship is Magic[2]
+暮光闪闪的姐姐是谁？	后成为了天角兽），执行她的导师塞拉斯蒂娅公主（PrincessCelestia）的任务，在小马镇（Ponyville）学习关于友谊的知识。
+暮光闪闪的姐姐是谁？	她与另外五只小马，苹果杰克（Applejack）、瑞瑞（Rarity）、云宝黛西（Rainbow Dash）、小蝶（Fluttershy）与萍琪派（Pinkie Pie），成为了最要好的朋友。
+暮光闪闪的姐姐是谁？	每匹小马都分别代表了协律精华的6个元素：诚实，慷慨，忠诚，善良，欢笑，魔法，各自扮演着属于自己的重要角色。
+暮光闪闪的姐姐是谁？	此后，暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。
+暮光闪闪的姐姐是谁？	在动画中，随时可见她们在小马镇（Ponyville）的种种冒险、奇遇、日常等等。
+暮光闪闪的姐姐是谁？	同时，也在她们之间的互动和冲突中，寻找着最适合最合理的完美解决方案。
+暮光闪闪的姐姐是谁？	“尽管小马国并不太平，六位主角之间也常常有这样那样的问题，但是他们之间的真情对待，使得这个童话世界已经成为不少人心中理想的世外桃源。”
+暮光闪闪的姐姐是谁？	暮光闪闪在剧情刚开始的时候生活在中心城（Canterlot），后来在夏日
+暮光闪闪的姐姐是谁？	暮光闪闪与斯派克（Spike）
+暮光闪闪的姐姐是谁？	庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。
+暮光闪闪的姐姐是谁？	在小马镇交到了朋友（即其余5个主角），并和她们一起使用协律精华（Elements of harmony）击败了梦魇之月。
+暮光闪闪的姐姐是谁？	并在塞拉丝蒂亚公主的许可下，留在小马镇继续研究友谊的魔法。
+暮光闪闪的姐姐是谁？	暮光闪闪的知识基本来自于书本，并且她相当不相信书本以外的“迷信”，因为这样她在S1E15里吃足了苦头。
+暮光闪闪的姐姐是谁？	在这之后，她也开始慢慢学会相信一些书本以外的东西。
+暮光闪闪的姐姐是谁？	暮光闪闪热爱学习，并且学习成绩相当好（从她可以立刻算出
+暮光闪闪的姐姐是谁？	的结果可以看
+暮光闪闪的姐姐是谁？	暮光闪闪的原型
+暮光闪闪的姐姐是谁？	出）。
+暮光闪闪的姐姐是谁？	相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。
+暮光闪闪的姐姐是谁？	在第二季中，曾因为无法交出关于友谊的报告而做出了疯狂的行为，后来被塞拉丝蒂亚公主制止，在这之后，暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。
+暮光闪闪的姐姐是谁？	于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。
+暮光闪闪的姐姐是谁？	在SE3E13中，因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽（公主），被尊称为“闪闪公主”。
+暮光闪闪的姐姐是谁？	当小星座熊在小马镇引起恐慌的时候，暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶，用牛奶将小星座熊安抚后，连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区，项目占地面积约380亩，总建筑面积约41万平方米。
+我想知道红谷十二庭有哪些金融机构？	项目以建设人文型、生态型居住环境为规划目标；创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便，有文化内涵的居住区。
+我想知道红谷十二庭有哪些金融机构？	金融机构：工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等
+我想知道红谷十二庭有哪些金融机构？	周边公园：沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园
+我想知道红谷十二庭有哪些金融机构？	周边医院：新建县人民医院、开心人药店、中寰医院
+我想知道红谷十二庭有哪些金融机构？	周边学校：育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学：南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校
+我想知道红谷十二庭有哪些金融机构？	周边公共交通：112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭处在南昌一江两城中的西城中心，位属红谷滩CBD文化公园中心——马兰圩中心组团，红谷滩中心区、红角洲、新建县三区交汇处，南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽，1000米长)。
+我想知道红谷十二庭有哪些金融机构？	交通便捷，景观资源丰富，生活配套设施齐全，出则繁华，入则幽静，是现代人居的理想地段。
+我想知道红谷十二庭有哪些金融机构？	红谷十二庭户型图
+苏琳最开始进入智通实业是担任什么职位？	现任广东智通人才连锁股份有限公司总裁，清华大学高级工商管理硕士。
+苏琳最开始进入智通实业是担任什么职位？	1994年，加入智通实业，从总经理秘书做起。
+苏琳最开始进入智通实业是担任什么职位？	1995年，智通实业决定进入人才服务行业，被启用去负责新公司的筹建及运营工作，在苏琳的努力下，智通人才智力开发有限公司成立。
+苏琳最开始进入智通实业是担任什么职位？	2003年，面对同城对手的激烈竞争，苏琳冷静对待，领导智通先后接管、并购了同城的腾龙、安达盛人才市场，，“品牌运作，连锁经营，差异制胜”成为苏琳屡屡制胜的法宝。
+苏琳最开始进入智通实业是担任什么职位？	2006年，苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”，一举成为广东省人才市场目前惟一的连锁机构，随后在东莞同时开设长安、松山湖、清溪等镇区分部，至此智通在东莞共有6个分部。
+苏琳最开始进入智通实业是担任什么职位？	一番大刀阔斧完成东莞布局后，苏琳确定下一个更为高远的目标——进军珠三角，向全国发展连锁机构。
+苏琳最开始进入智通实业是担任什么职位？	到2011年末，苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地，长三角的南京、宁波、合肥等地，中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。
+苏琳最开始进入智通实业是担任什么职位？	除了财务副总裁之外，苏琳是智通人才核心管理高层当中唯一的女性，不管是要约采访的记者还是刚刚加入智通的员工，见到苏琳的第一面，都会有一种惊艳的感觉，“一位女企业家居然非常美丽和时尚？！”
+苏琳最开始进入智通实业是担任什么职位？	智通管理高层的另外6位男性成员，有一次同时接受一家知名媒体采访时，共同表达了对自己老板的“爱慕”之情，苏琳听后莞尔一笑，指着在座的这几位高层说道“其实，我更爱他们！”
+苏琳最开始进入智通实业是担任什么职位？	这种具有独特领导魅力的表述让这位记者唏嘘不已，同时由这样的一个细节让他感受到了智通管理团队的协作力量。
+谁知道黄沙中心小学的邮政编码是多少？	学校于1954年始建于棕树湾村，当时借用一间民房做教室，取名为“黄沙小学”，只有教师1人，学生8人。
+谁知道黄沙中心小学的邮政编码是多少？	1958年在大跃进精神的指导下，实行大集体，全乡集中办学，发展到12个班，300多学生，20名教职工。
+谁知道黄沙中心小学的邮政编码是多少？	1959年解散。
+谁知道黄沙中心小学的邮政编码是多少？	1959年下半年，在上级的扶持下，建了6间木房，搬到1960年学校所在地，有6名教师，3个班，60名学生。
+谁知道黄沙中心小学的邮政编码是多少？	1968年，开始招收一个初中班，“黄沙小学”改名为 “附小”。
+谁知道黄沙中心小学的邮政编码是多少？	当时已发展到5个班，8名教师，110多名学生。
+谁知道黄沙中心小学的邮政编码是多少？	增建土木结构教室两间。
+谁知道黄沙中心小学的邮政编码是多少？	1986年，初中、小学分开办学。
+谁知道黄沙中心小学的邮政编码是多少？	增建部分教师宿舍和教室，办学条件稍有改善，学校初具规模。
+谁知道黄沙中心小学的邮政编码是多少？	1996年，我校在市、县领导及希望工程主管部门的关怀下，决定改为“黄沙希望小学”并拨款32万元，新建一栋4层，12间教室的教学楼，教学条件大有改善。
+谁知道黄沙中心小学的邮政编码是多少？	当时发展到10个班，学生300多人，教职工19人，小学高级教师3人，一级教师7人，二级教师9人。
+谁知道黄沙中心小学的邮政编码是多少？	2003年下半年由于农村教育体制改革，撤销教育组，更名为“黄沙中心小学”。
+谁知道黄沙中心小学的邮政编码是多少？	学校现有在校生177人（含学前42人），设有学前至六年级共7个教学班。
+谁知道黄沙中心小学的邮政编码是多少？	有教师19人，其中大专以上学历11人，中师6人；小学高级教师14人，一级教师5人。
+谁知道黄沙中心小学的邮政编码是多少？	学校校园占地面积2050平方米，生均达15.29平方米，校舍建筑面积1645平方米，生均12.27平方米；设有教师办公室、自然实验、电教室（合二为一）、微机室、图书阅览室（合二为一）、体育室、广播室、少先队活动室。
+谁知道黄沙中心小学的邮政编码是多少？	广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编：541113[1]
+伊藤实华的职业是什么？	伊藤实华（1984年3月25日－）是日本的女性声优。
+伊藤实华的职业是什么？	THREE TREE所属，东京都出身，身长149cm，体重39kg，血型AB型。
+伊藤实华的职业是什么？	ポルノグラフィティのLION（森男）
+伊藤实华的职业是什么？	2000年
+伊藤实华的职业是什么？	犬夜叉（枫（少女时代））
+伊藤实华的职业是什么？	幻影死神（西亚梨沙）
+伊藤实华的职业是什么？	2001年
+伊藤实华的职业是什么？	NOIR（ロザリー）
+伊藤实华的职业是什么？	2002年
+伊藤实华的职业是什么？	水瓶战记（柠檬）
+伊藤实华的职业是什么？	返乡战士（エイファ）
+伊藤实华的职业是什么？	2003年
+伊藤实华的职业是什么？	奇诺之旅（女子A（悲しい国））
+伊藤实华的职业是什么？	2004年
+伊藤实华的职业是什么？	爱你宝贝（坂下ミキ）
+伊藤实华的职业是什么？	Get Ride! アムドライバー（イヴァン・ニルギース幼少期）
+伊藤实华的职业是什么？	スクールランブル（花井春树（幼少时代））
+伊藤实华的职业是什么？	2005年
+伊藤实华的职业是什么？	光速蒙面侠21（虎吉）
+伊藤实华的职业是什么？	搞笑漫画日和（男子トイレの精、パン美先生）
+伊藤实华的职业是什么？	银牙伝说WEED（テル）
+伊藤实华的职业是什么？	魔女的考验（真部カレン、守山太郎）
+伊藤实华的职业是什么？	BUZZER BEATER（レニー）
+伊藤实华的职业是什么？	虫师（“眼福眼祸”さき、“草を踏む音”沢（幼少时代））
+伊藤实华的职业是什么？	2006年
+伊藤实华的职业是什么？	魔女之刃（娜梅）
+伊藤实华的职业是什么？	反斗小王子（远藤レイラ）
+伊藤实华的职业是什么？	搞笑漫画日和2（パン美先生、フグ子、ダンサー、ヤマトの妹、女性）
+伊藤实华的职业是什么？	人造昆虫カブトボーグ V×V（ベネチアンの弟、东ルリ、园儿A）
+伊藤实华的职业是什么？	2007年
+爆胎监测与安全控制系统英文是什么？	爆胎监测与安全控制系统(Blow-out Monitoring and Brake System)，是吉利全球首创，并拥有自主知识产权及专利的一项安全技术。
+爆胎监测与安全控制系统英文是什么？	这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。
+爆胎监测与安全控制系统英文是什么？	BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。
+爆胎监测与安全控制系统英文是什么？	2008年第一代BMBS系统正式与世人见面，BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证，以确保产品的可靠性。
+爆胎监测与安全控制系统英文是什么？	BMBS技术方案的核心即是采用智能化自动控制系统，弥补驾驶员生理局限，在爆胎后反应时间为0.5秒，替代驾驶员实施行车制动，保障行车安全。
+爆胎监测与安全控制系统英文是什么？	BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。
+爆胎监测与安全控制系统英文是什么？	当轮胎气压高于或低于限值时，控制器声光提示胎压异常。
+爆胎监测与安全控制系统英文是什么？	轮胎温度过高时，控制器发出信号提示轮胎温度过高。
+爆胎监测与安全控制系统英文是什么？	发射器电量不足时，控制器显示低电压报警。
+爆胎监测与安全控制系统英文是什么？	发射器受到干扰长期不发射信号时，控制器显示无信号报警。
+爆胎监测与安全控制系统英文是什么？	当汽车电门钥匙接通时，BMBS首先进入自检程序，检测系统各部分功能是否正常，如不正常，BMBS报警灯常亮。
+走读干部现象在哪里比较多？	走读干部一般是指县乡两级干部家住县城以上的城市，本人在县城或者乡镇工作，要么晚出早归，要么周一去单位上班、周五回家过周末。
+走读干部现象在哪里比较多？	对于这种现象，社会上的议论多是批评性的，认为这些干部脱离群众、作风漂浮、官僚主义，造成行政成本增加和腐败。
+走读干部现象在哪里比较多？	  截至2014年10月，共有6484名“走读干部”在专项整治中被查处。
+走读干部现象在哪里比较多？	这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。
+走读干部现象在哪里比较多？	干部“走读”问题主要在乡镇地区比较突出，城市地区则较少。
+走读干部现象在哪里比较多？	从历史成因和各地反映的情况来看，产生“走读”现象的主要原因大致有四种：
+走读干部现象在哪里比较多？	现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路，这无疑为农村干部的出行创造了便利条件，为“干部像候鸟，频往家里跑”创造了客观条件。
+走读干部现象在哪里比较多？	选调生、公务员队伍大多是学历较高的大学毕业生，曾在高校所在地的城市生活，不少人向往城市生活，他们不安心长期扎根基层，而是将基层当作跳板，因此他们往往成为“走读”的主力军。
+走读干部现象在哪里比较多？	公仆意识、服务意识淡化，是“走读”现象滋生的主观原因。
+走读干部现象在哪里比较多？	有些党员干部感到自己长期在基层工作，该为自己和家庭想想了。
+走读干部现象在哪里比较多？	于是，不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难，也就不难理解了。
+走读干部现象在哪里比较多？	县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位，导致“走读”问题得不到应有的制度约束，是“走读”问题滋长的组织原因。[2]
+走读干部现象在哪里比较多？	近些年来，我国一些地方的“干部走读”现象较为普遍，社会上对此议走读干部论颇多。
+走读干部现象在哪里比较多？	所谓“干部走读”，一般是指县乡两级干部家住县城以上的城市，本人在县城或者乡镇工作，要么早出晚归，要么周一去单位上班、周五回家过周末。
+走读干部现象在哪里比较多？	对于这种现象，社会上的议论多是批评性的，认为这些干部脱离群众、作风漂浮、官僚主义，造成行政成本增加和腐败。
+走读干部现象在哪里比较多？	干部走读之所以成为“千夫所指”，是因为这种行为增加了行政成本。
+走读干部现象在哪里比较多？	从根子上说，干部走读是城乡发展不平衡的产物，“人往高处走，水往低处流”，有了更加舒适的生活环境，不管是为了自己生活条件改善也好，还是因为子女教育也好，农村人口向城镇转移，这是必然结果。
+走读干部现象在哪里比较多？	“干部走读”的另一个重要原因，是干部人事制度改革。
+走读干部现象在哪里比较多？	目前公务员队伍“凡进必考”，考上公务员的大多是学历较高的大学毕业生，这些大学毕业生来自各个全国各地，一部分在本地结婚生子，沉淀下来；一部分把公务员作为跳板，到基层后或考研，或再参加省考、国考，或想办法调回原籍。
+走读干部现象在哪里比较多？	再加上一些下派干部、异地交流任职干部，构成了看似庞大的“走读”队伍。
+走读干部现象在哪里比较多？	那么，“干部走读”有哪些弊端呢？
+走读干部现象在哪里比较多？	一是这些干部人在基层，心在城市，缺乏长期作战的思想，工作不安心。
+走读干部现象在哪里比较多？	周一来上班，周五回家转，对基层工作缺乏热情和感情；二是长期在省市直机关工作，对基层工作不熟悉不了解，工作不热心；三是长期走读，基层干群有工作难汇报，有困难难解决，群众不开心；四是干部来回走读，公车私驾，私费公报，把大量的经济负担转嫁给基层；五是对这些走读干部，基层管不了，上级监督难，节假日期间到哪里去、做什么事，基本处于失控和真空状态，各级组织和基层干群不放心。
+走读干部现象在哪里比较多？	特别需要引起警觉的是，由于少数走读干部有临时思想，满足于“当维持会长”，得过且过混日子，热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程，不愿做打基础、管长远的实事好事，甚至怠政、疏政和懒于理政，影响了党和政府各项方针政策措施的落实，导致基层无政府主义、自由主义抬头，削弱了党和政府的领导，等到矛盾激化甚至不可收拾的时候，处理已是来之不及。
+走读干部现象在哪里比较多？	权利要与义务相等，不能只有义务而没有权利，或是只有权利没有义务。
+走读干部现象在哪里比较多？	如何真正彻底解决乡镇干部“走读”的现象呢？
+走读干部现象在哪里比较多？	那就必须让乡镇基层干部义务与权利相等。
+走读干部现象在哪里比较多？	如果不能解决基层干部待遇等问题，即使干部住村，工作上也不会有什么进展的。
+走读干部现象在哪里比较多？	所以，在政治上关心，在生活上照顾，在待遇上提高。
+走读干部现象在哪里比较多？	如，提高基层干部的工资待遇，增加通讯、交通补助；帮助解决子女入学及老人赡养问题；提拔干部优先考虑基层干部；干部退休时的待遇至少不低于机关干部等等。
+化州市良光镇东岸小学学风是什么？	学校全体教职工爱岗敬业，团结拼搏，勇于开拓，大胆创新，进行教育教学改革，努力开辟第二课堂的教学路子，并开通了网络校校通的交流合作方式。
+化州市良光镇东岸小学学风是什么？	现学校教师正在为创建安全文明校园而努力。
+化州市良光镇东岸小学学风是什么？	东岸小学位置偏僻，地处贫穷落后，是良光镇最偏远的学校，学校，下辖分教点——东心埇小学，[1]?。
+化州市良光镇东岸小学学风是什么？	学校2011年有教师22人，学生231人。
+化州市良光镇东岸小学学风是什么？	小学高级教师8人，小学一级教师10人，未定级教师4人，大专学历的教师6人，其余的都具有中师学历。
+化州市良光镇东岸小学学风是什么？	全校共设12个班，学校课程按标准开设。
+化州市良光镇东岸小学学风是什么？	东岸小学原来是一所破旧不堪，教学质量非常差的薄弱学校。
+化州市良光镇东岸小学学风是什么？	近几年来，在各级政府、教育部门及社会各界热心人士鼎力支持下，学校领导大胆改革创新，致力提高教学质量和教师水平，并加大经费投入，大大改善了办学条件，使学校由差变好，实现了大跨越。
+化州市良光镇东岸小学学风是什么？	学校建设性方面。
+化州市良光镇东岸小学学风是什么？	东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址，1990年建造一幢建筑面积为800平方米的南面教学楼， 1998年老促会支持从北面建造一幢1800平方米的教学大楼。
+化州市良光镇东岸小学学风是什么？	学校在管理方面表现方面颇具特色，实现了各项制度的日常化和规范化。
+化州市良光镇东岸小学学风是什么？	学校领导有较强的事业心和责任感，讲求民主与合作，勤政廉政，依法治校，树立了服务意识。
+化州市良光镇东岸小学学风是什么？	学校一贯实施“德育为先，以人为本”的教育方针，制定了“团结，律已，拼搏，创新”的校训。
+化州市良光镇东岸小学学风是什么？	教育风为“爱岗敬业，乐于奉献”，学风为“乐学，勤学，巧学，会学”。
+化州市良光镇东岸小学学风是什么？	校内营造了尊师重教的氛围，形成了良好的校风和学风。
+化州市良光镇东岸小学学风是什么？	教师们爱岗敬业，师德高尚，治学严谨，教研教改气氛浓厚，获得喜人的教研成果。
+化州市良光镇东岸小学学风是什么？	近几年来，教师撰写的教育教学论文共10篇获得县市级以上奖励，获了镇级以上奖励的有100人次。
+化州市良光镇东岸小学学风是什么？	学校德育工作成绩显著，多年被评为“安全事故为零”的学校，良光镇先进学校。
+化州市良光镇东岸小学学风是什么？	特别是教学质量大大提高了。
+化州市良光镇东岸小学学风是什么？	这些成绩得到了上级及群众的充分肯定。
+化州市良光镇东岸小学学风是什么？	1.学校环境欠美观有序，学校大门口及校道有待改造。
+化州市良光镇东岸小学学风是什么？	2.学校管理制度有待改进，部分教师业务水平有待提高。
+化州市良光镇东岸小学学风是什么？	3.教师宿舍、教室及学生宿舍欠缺。
+化州市良光镇东岸小学学风是什么？	4.运动场不够规范，各类体育器材及设施需要增加。
+化州市良光镇东岸小学学风是什么？	5.学生活动空间少，见识面窄，视野不够开阔。
+化州市良光镇东岸小学学风是什么？	1.努力营造和谐的教育教学新气氛。
+化州市良光镇东岸小学学风是什么？	建立科学的管理制度，坚持“与时俱进，以人为本”，真正实现领导对教师，教师对学生之间进行“德治与情治”；学校的人文环境做到“文明，和谐，清新”；德育环境做到“自尊，律已，律人”；心理环境做到“安全，谦虚，奋发”；交际环境做到“团结合作，真诚助人”；景物环境做到“宜人，有序。”
+化州市良光镇东岸小学学风是什么？	营造学校与育人的新特色。
+我很好奇发射管的输出功率怎么样？	产生或放大高频功率的静电控制电子管，有时也称振荡管。
+我很好奇发射管的输出功率怎么样？	用于音频或开关电路中的发射管称调制管。
+我很好奇发射管的输出功率怎么样？	发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。
+我很好奇发射管的输出功率怎么样？	输出功率和工作频率是发射管的基本技术指标。
+我很好奇发射管的输出功率怎么样？	广播、通信和工业设备的发射管，工作频率一般在30兆赫以下，输出功率在1919年为2千瓦以下，1930年达300千瓦，70年代初已超过1000千瓦，效率高达80％以上。
+我很好奇发射管的输出功率怎么样？	发射管工作频率提高时，输出功率和效率都会降低，因此1936年首次实用的脉冲雷达工作频率仅28兆赫，80年代则已达 400兆赫以上。
+我很好奇发射管的输出功率怎么样？	40年代电视发射管的工作频率为数十兆赫，而80年代初，优良的电视发射管可在1000兆赫下工作，输出功率达20千瓦，效率为40％。
+我很好奇发射管的输出功率怎么样？	平面电极结构的小功率发射三极管可在更高的频率下工作。
+我很好奇发射管的输出功率怎么样？	发射管多采用同心圆筒电极结构。
+我很好奇发射管的输出功率怎么样？	阴极在最内层，向外依次为各个栅极和阳极。
+我很好奇发射管的输出功率怎么样？	图中，自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。
+我很好奇发射管的输出功率怎么样？	 发射管
+我很好奇发射管的输出功率怎么样？	中小功率发射管多采用间热式氧化物阴极。
+我很好奇发射管的输出功率怎么样？	大功率发射管一般采用碳化钍钨丝阴极，有螺旋、直条或网笼等结构形式。
+我很好奇发射管的输出功率怎么样？	图为网笼式阴极。
+我很好奇发射管的输出功率怎么样？	栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。
+我很好奇发射管的输出功率怎么样？	栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射，使发射管稳定工作。
+我很好奇发射管的输出功率怎么样？	用气相沉积方法制造的石墨栅极，具有良好的性能。
+我很好奇发射管的输出功率怎么样？	发射管阳极直流输入功率转化为高频输出功率的部分约为75％，其余25％成为阳极热损耗，因此对发射管的阳极必须进行冷却。
+我很好奇发射管的输出功率怎么样？	中小功率发射管的阳极采取自然冷却方式，用镍、钼或石墨等材料制造，装在管壳之内，工作温度可达 600℃。
+我很好奇发射管的输出功率怎么样？	大功率发射管的阳极都用铜制成，并作为真空密封管壳的一部分，采用各种强制冷却方式。
+我很好奇发射管的输出功率怎么样？	各种冷却方式下每平方厘米阳极内表面的散热能力为：水冷100瓦；风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上，80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。
+我很好奇发射管的输出功率怎么样？	发射管也常以冷却方式命名，如风冷发射管、水冷发射管和蒸发冷却发射管。
+我很好奇发射管的输出功率怎么样？	发射管管壳用玻璃或陶瓷制造。
+我很好奇发射管的输出功率怎么样？	小功率发射管内使用含钡的吸气剂；大功率发射管则采用锆、钛、钽等吸气材料，管内压强约为10帕量级。
+我很好奇发射管的输出功率怎么样？	发射管寿命取决于阴极发射电子的能力。
+我很好奇发射管的输出功率怎么样？	大功率发射管寿命最高记录可达8万小时。
+我很好奇发射管的输出功率怎么样？	发射四极管的放大作用和输出输入电路间的隔离效果优于三极管，应用最广。
+我很好奇发射管的输出功率怎么样？	工业高频振荡器普遍采用三极管。
+我很好奇发射管的输出功率怎么样？	五极管多用在小功率范围中。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	鲁能领秀城中央公园位于鲁能领秀城景观中轴之上，总占地161.55亩，总建筑面积约40万平米，容积率为2.70，由22栋小高层、高层组成；其绿地率高达35.2%，环境优美，产品更加注重品质化、人性化和自然生态化，是鲁能领秀城的生态人居典范。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	中央公园[1] 学区准现房，坐享鲁能领秀城成熟配套，成熟生活一步到位。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	经典板式小高层，103㎡2+1房仅22席，稀市推出，错过再无；92㎡经典两房、137㎡舒适三房压轴登场！
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	物业公司:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	济南凯瑞物业公司；深圳长城物业公司；北京盛世物业有限公司
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	绿化率:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	42%
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	容积率:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	2.70
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	暖气:
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	集中供暖
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	楼座展示：中央公园由22栋小高层、高层组成，3、16、17号楼分别是11层小高层，18层和28层的高层。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	4号楼是23层，2梯3户。
+鲁能领秀城中央公园有23层，2梯3户的是几号楼？	项目位置:
+鬼青蛙在哪里有收录详情？	鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃，从手卡特殊召唤。
+鬼青蛙在哪里有收录详情？	这张卡召唤·反转召唤·特殊召唤成功时，可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。
+鬼青蛙在哪里有收录详情？	此外，1回合1次，可以通过让自己场上1只怪兽回到手卡，这个回合通常召唤外加上只有1次，自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1]
+鬼青蛙在哪里有收录详情？	游戏王卡包收录详情
+鬼青蛙在哪里有收录详情？	[09/09/18]
+西湖区有多大？	西湖区是江西省南昌市市辖区。
+西湖区有多大？	为南昌市中心城区之一，有着2200多年历史，是一个物华天宝、人杰地灵的古老城区。
+西湖区有多大？	2004年南昌市老城区区划调整后，西湖区东起京九铁路线与青山湖区毗邻，南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤，西凭赣江中心线与红谷滩新区交界，北沿中山路、北京西路与东湖区相连，所辖面积34.5平方公里，常住人口43万，管辖1个镇、10个街道办事处，设12个行政村、100个社区。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖原为汉代豫章群古太湖的一部分，唐贞元15年（公元799年）洪恩桥的架设将东太湖分隔成东西两部分，洪恩桥以西谓之西湖，西湖区由此而得名。
+西湖区有多大？	西湖区在1926年南昌设市后分别称第四、五部分，六、七部分。
+西湖区有多大？	1949年解放初期分别称第三、四区。
+西湖区有多大？	1955年分别称抚河区、西湖区。
+西湖区有多大？	1980年两区合并称西湖区。[1]
+西湖区有多大？	辖：西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1]
+西湖区有多大？	2002年9月，由原筷子巷街道和原禾草街街道合并设立南浦街道，原广外街道与瓦子角街道的一部分合并设立广润门街道。
+西湖区有多大？	2002年12月1日设立桃源街道。
+西湖区有多大？	2004年区划调整前的西湖区区域：东与青山湖区湖坊乡插花接壤；西临赣江与红谷滩新区隔江相望；南以建设路为界，和青云谱区毗邻；北连中山路，北京西路，与东湖区交界。[1]
+西湖区有多大？	2002年9月，由原筷子巷街道和原禾草街街道合并设立南浦街道，原广外街道与瓦子角街道的一部分合并设立广润门街道。
+西湖区有多大？	2002年12月1日设立桃源街道。
+西湖区有多大？	2004年区划调整前的西湖区区域：东与青山湖区湖坊乡插花接壤；西临赣江与红谷滩新区隔江相望；南以建设路为界，和青云谱区毗邻；北连中山路，北京西路，与东湖区交界。
+西湖区有多大？	2004年9月7日，国务院批准（国函[2004]70号）调整南昌市市辖区部分行政区划：将西湖区朝阳洲街道的西船居委会划归东湖区管辖。
+西湖区有多大？	将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。
+西湖区有多大？	将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会，上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会，南站街道的解放西路东居委会划归青云谱区管辖。
+西湖区有多大？	将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会，丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会，南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。
+西湖区有多大？	调整后，西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇，区人民政府驻孺子路。
+西湖区有多大？	调整前，西湖区面积31平方千米，人口52万。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖区位于江西省省会南昌市的中心地带，具有广阔的发展空间和庞大的消费群体，商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机，投资前景十分广阔。
+西湖区有多大？	不仅水、电价格低廉，劳动力资源丰富，人均工资和房产价格都比沿海城市低，城区拥有良好的人居环境、低廉的投资成本，巨大的发展潜力。
+西湖区有多大？	105、316、320国道和京九铁路贯穿全境，把南北东西交通连成一线；民航可与上海、北京、广州、深圳、厦门、温州等到地通航，并开通了南昌-新加坡第一条国际航线；水运依托赣江可直达长江各港口；邮电通讯便捷，程控电话、数字微波、图文传真进入国际通讯网络；商检、海关、口岸等涉外机构齐全；水、电、气供应充足。
+西湖区有多大？	（图）西湖区[南昌市]
+西湖区有多大？	西湖区，是江西省省会南昌市的中心城区，面积34.8平方公里，常住人口51.9万人，辖桃花镇、朝农管理处及10个街道，设13个行政村，116个社区居委会，20个家委会。[2]
+西湖区有多大？	2005年11月16日，南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》
+西湖区有多大？	1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。
+青藏虎耳草花期什么时候？	青藏虎耳草多年生草本，高4-11.5厘米，丛生。
+青藏虎耳草花期什么时候？	花期7-8月。
+青藏虎耳草花期什么时候？	分布于甘肃（祁连山地）、青海（黄南、海南、海北）和西藏（加查）。
+青藏虎耳草花期什么时候？	生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1]
+青藏虎耳草花期什么时候？	多年生草本，高4-11.5厘米，丛生。
+青藏虎耳草花期什么时候？	茎不分枝，具褐色卷曲柔毛。
+青藏虎耳草花期什么时候？	基生叶具柄，叶片卵形、椭圆形至长圆形，长15-25毫米，宽4-8毫米，腹面无毛，背面和边缘具褐色卷曲柔毛，叶柄长1-3厘米，基部扩大，边缘具褐色卷曲柔毛；茎生叶卵形至椭圆形，长1.5-2厘米，向上渐变小。
+青藏虎耳草花期什么时候？	聚伞花序伞房状，具2-6花；花梗长5-19毫米，密被褐色卷曲柔毛；萼片在花期反曲，卵形至狭卵形，长2.5-4.2毫米，宽1.5-2毫米，先端钝，两面无毛，边缘具褐色卷曲柔毛，3-5脉于先端不汇合；花瓣腹面淡黄色且其中下部具红色斑点，背面紫红色，卵形、狭卵形至近长圆形，长2.5-5.2毫米，宽1.5-2.1毫米，先端钝，基部具长0.5-1毫米之爪，3-5（-7）脉，具2痂体；雄蕊长2-3.6毫米，花丝钻形；子房半下位，周围具环状花盘，花柱长1-1.5毫米。
+青藏虎耳草花期什么时候？	生于高山草甸、碎石间。
+青藏虎耳草花期什么时候？	分布青海、西藏、甘肃、四川等地。
+青藏虎耳草花期什么时候？	 [1]
+青藏虎耳草花期什么时候？	顶峰虎耳草Saxifraga cacuminum Harry Sm.
+青藏虎耳草花期什么时候？	对叶虎耳Saxifraga contraria Harry Sm.
+青藏虎耳草花期什么时候？	狭瓣虎耳草Saxifraga pseudohirculus Engl.
+青藏虎耳草花期什么时候？	唐古特虎耳草Saxifraga tangutica Engl.
+青藏虎耳草花期什么时候？	宽叶虎耳草（变种）Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan
+青藏虎耳草花期什么时候？	唐古特虎耳草（原变种）Saxifraga tangutica Engl. var. tangutica
+青藏虎耳草花期什么时候？	西藏虎耳草Saxifraga tibetica Losinsk.[1]
+青藏虎耳草花期什么时候？	Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1]
+生产一支欧文冲锋枪需要多少钱？	欧文冲锋枪 Owen Gun  1945年，在新不列颠手持欧文冲锋枪的澳大利亚士兵   类型 冲锋枪   原产国 ?澳大利亚   服役记录  服役期间 1941年－1960年代   用户 参见使用国   参与战役  第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争     生产历史  研发者 伊夫林·欧文（Evelyn Owen）   研发日期 1931年－1939年   生产商  约翰·莱萨特工厂 利特高轻武器工厂     单位制造费用 $ 30／枝   生产日期 1941年－1945年   制造数量 45,000－50,000 枝   衍生型  Mk 1/42 Mk 1/43 Mk 2/43     基本规格  总重   空枪：   Mk 1/42：4.24 千克（9.35 磅） Mk 1/43：3.99 千克（8.8 磅） Mk 2/43：3.47 千克（7.65 磅）       全长 806 毫米（31.73 英吋）   枪管长度 247 毫米（9.72 英吋）     弹药  制式：9 × 19 毫米 原型：.38/200 原型：.45 ACP     口径  9 × 19 毫米：9 毫米（.357 英吋） .38/200：9.65 毫米（.38 英吋） .45 ACP：11.43 毫米（.45 英吋）     枪管 1 根，膛线7 条，右旋   枪机种类      直接反冲作用 开放式枪机       发射速率      理论射速：  Mk 1/42：700 发／分钟 Mk 1/43：680 发／分钟 Mk 2/43：600 发／分钟   实际射速：120 发／分钟       枪口初速 380－420 米／秒（1,246.72－1,377.95 英尺／秒）   有效射程      瞄具装定射程：91.44 米（100 码） 最大有效射程：123 米（134.51 码）       最大射程 200 米（218.72 码）   供弹方式 32／33 发可拆卸式弹匣   瞄准具型式 机械瞄具：向右偏置的觇孔式照门和片状准星   欧文冲锋枪（英语：Owen Gun，正式名称：Owen Machine Carbine，以下简称为“欧文枪”）是一枝由伊夫林·（埃沃）·欧文（英语：Evelyn (Evo) Owen）于1939年研制、澳大利亚的首枝冲锋枪，制式型发射9 × 19 毫米鲁格手枪子弹。
+生产一支欧文冲锋枪需要多少钱？	欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪，并从1943年由澳大利亚陆军所使用，直到1960年代中期。
+生产一支欧文冲锋枪需要多少钱？	由新南威尔士州卧龙岗市出身的欧文枪发明者，伊夫林·欧文，在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。
+生产一支欧文冲锋枪需要多少钱？	该枪却被澳大利亚陆军所拒绝，因为澳大利亚陆军在当时没有承认冲锋枪的价值。
+生产一支欧文冲锋枪需要多少钱？	随着战争的爆发，欧文加入了澳大利亚军队，并且成为一名列兵。
+生产一支欧文冲锋枪需要多少钱？	1940年9月，欧文的邻居，文森特·沃德尔（英语：Vincent Wardell），看到欧文家楼梯后面搁著一个麻布袋，里面放著一枝欧文枪的原型枪。
+生产一支欧文冲锋枪需要多少钱？	而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理，他向欧文的父亲表明了他对其儿子的粗心大意感到痛心，但无论如何仍然解释了这款武器的历史。
+生产一支欧文冲锋枪需要多少钱？	沃德尔对欧文枪的简洁的设计留下了深刻的印象。
+生产一支欧文冲锋枪需要多少钱？	沃德尔安排欧文转调到陆军发明部（英语：Army Inventions Board），并重新开始在枪上的工作。
+生产一支欧文冲锋枪需要多少钱？	军队仍然持续地从负面角度查看该武器，但同时政府开始采取越来越有利的观点。
+生产一支欧文冲锋枪需要多少钱？	该欧文枪原型配备了装在顶部的弹鼓，后来让位给装在顶部的弹匣使用。
+生产一支欧文冲锋枪需要多少钱？	口径的选择亦花了一些时间去解决。
+生产一支欧文冲锋枪需要多少钱？	由于陆军有大批量的柯尔特.45 ACP子弹，它们决定欧文枪需要采用这种口径。
+生产一支欧文冲锋枪需要多少钱？	直到在1941年9月19日官方举办试验时，约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。
+生产一支欧文冲锋枪需要多少钱？	而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。
+生产一支欧文冲锋枪需要多少钱？	作为测试的一部分，所有的枪支都浸没在泥浆里，并以沙土覆盖，以模拟他们将会被使用时最恶劣的环境。
+生产一支欧文冲锋枪需要多少钱？	欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。
+生产一支欧文冲锋枪需要多少钱？	虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性，陆军没有对其口径作出决定。
+生产一支欧文冲锋枪需要多少钱？	结果它在上级政府干预以后，陆军才下令9 毫米的衍生型为正式口径，并在1941年11月20日正式被澳大利亚陆军采用。
+生产一支欧文冲锋枪需要多少钱？	在欧文枪的寿命期间，其可靠性在澳大利亚部队中赢得了“军人的至爱”（英语：Digger's Darling）的绰号，亦有人传言它受到美军高度青睐。
+生产一支欧文冲锋枪需要多少钱？	欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产，在生产高峰期每个星期生产800 支。
+生产一支欧文冲锋枪需要多少钱？	1942年3月至1943年2月之间，莱萨特生产了28,000 枝欧文枪。
+生产一支欧文冲锋枪需要多少钱？	然而，最初的一批弹药类型竟然是错误的，以至10,000 枝欧文枪无法提供弹药。
+生产一支欧文冲锋枪需要多少钱？	政府再一次推翻军方的官僚主义作风??，并让弹药通过其最后的生产阶段，以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。
+生产一支欧文冲锋枪需要多少钱？	在1941年至1945年间生产了约50,000 枝欧文枪。
+生产一支欧文冲锋枪需要多少钱？	在战争期间，欧文枪的平均生产成本为$ 30。[1]
+生产一支欧文冲锋枪需要多少钱？	虽然它是有点笨重，因为其可靠性，欧文枪在士兵当中变得非常流行。
+生产一支欧文冲锋枪需要多少钱？	它是如此成功，它也被新西兰、英国和美国订购。[2]
+生产一支欧文冲锋枪需要多少钱？	欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争，[3]特别是步兵组的侦察兵。
+生产一支欧文冲锋枪需要多少钱？	这仍然是一枝制式的澳大利亚陆军武器，直到1960年代中期，它被F1冲锋枪所取代。
+第二届中国光伏摄影大赛因为什么政策而开始的？	光伏发电不仅是全球能源科技和产业发展的重要方向，也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。
+第二届中国光伏摄影大赛因为什么政策而开始的？	2013年7月以来，国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策，大力推进分布式光伏发电的应用，光伏发电有望走进千家万户，融入百姓民生。
+第二届中国光伏摄影大赛因为什么政策而开始的？	大赛主办方以此为契机，开启了“第二届中国光伏摄影大赛”的征程。
+悬赏任务有哪些类型？	悬赏任务，威客网站上一种任务模式，由雇主在威客网站发布任务，提供一定数额的赏金，以吸引威客们参与。
+悬赏任务有哪些类型？	悬赏任务数额一般在几十到几千不等，但也有几万甚至几十万的任务。
+悬赏任务有哪些类型？	主要以提交的作品的质量好坏作为中标标准，当然其中也带有雇主的主观喜好，中标人数较少，多为一个或几个，因此竞争激烈。
+悬赏任务有哪些类型？	大型悬赏任务赏金数额巨大，中标者也较多，但参与人也很多，对于身有一技之长的威客来讲，悬赏任务十分适合。
+悬赏任务有哪些类型？	悬赏任务的类型主要包括：设计类、文案类、取名类、网站类、编程类、推广类等等。
+悬赏任务有哪些类型？	每一类所适合的威客人群不同，报酬的多少也不同，比如设计类的报酬就比较高，一般都几百到几千，而推广类的计件任务报酬比较少，一般也就几块钱，但花费的时间很少，技术要求也很低。
+悬赏任务有哪些类型？	1.注册—登陆
+悬赏任务有哪些类型？	2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。
+悬赏任务有哪些类型？	悬赏模式选择->网站托管赏金模式。
+悬赏任务有哪些类型？	威客网站客服稍后会跟发布者联系确认任务要求。
+悬赏任务有哪些类型？	3.没有问题之后就可以预付赏金进行任务发布。
+悬赏任务有哪些类型？	4.会员参与并提交稿件。
+悬赏任务有哪些类型？	5.发布者需要跟会员互动（每个提交稿件的会员都可以），解决问题，完善稿件，初步筛选稿件。
+悬赏任务有哪些类型？	6.任务发布期结束，进入选稿期（在筛选的稿件中选择最后满意的）
+悬赏任务有哪些类型？	7.发布者不满意现有稿件可选定一个会员修改至满意为止，或者加价延期重新开放任务进行征稿。
+悬赏任务有哪些类型？	（重复第六步）没有问题后进入下一步。
+悬赏任务有哪些类型？	8：中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。
+悬赏任务有哪些类型？	1、任务发布者自由定价，自由确定悬赏时间，自由发布任务要求，自主确定中标会员和中标方案。
+悬赏任务有哪些类型？	2、任务发布者100%预付任务赏金，让竞标者坚信您的诚意和诚信。
+悬赏任务有哪些类型？	3、任务赏金分配原则：任务一经发布，网站收取20%发布费，中标会员获得赏金的80%。
+悬赏任务有哪些类型？	4、每个任务最终都会选定至少一个作品中标，至少一个竞标者获得赏金。
+悬赏任务有哪些类型？	5、任务发布者若未征集到满意作品，可以加价延期征集，也可让会员修改,会员也可以删除任务。
+悬赏任务有哪些类型？	6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务，一经发现则视为任务发布者委托威客网按照网站规则选稿。
+悬赏任务有哪些类型？	7、任务悬赏总金额低于100元(含100元)的任务，悬赏时间最多为7天。
+悬赏任务有哪些类型？	所有任务最长时间不超过30天（特殊任务除外），任务总金额不得低于50元。
+悬赏任务有哪些类型？	8、网赚类、注册类任务总金额不能低于300元人民币，计件任务每个稿件的平均单价不能低于1元人民币。
+悬赏任务有哪些类型？	9、延期任务只有3次加价机会，第1次加价不得低于任务金额的10%，第2次加价不得低于任务总金额的20%，第3次不得低于任务总金额的50%。
+悬赏任务有哪些类型？	每次延期不能超过15天，加价金额不低于50元，特殊任务可以适当加长。
+悬赏任务有哪些类型？	如果为计件任务，且不是网赚类任务，将免费延期，直至征集完规定数量的作品为止。
+悬赏任务有哪些类型？	10、如果威客以交接源文件要挟任务发布者，威客网将扣除威客相关信用值，并取消其中标资格，同时任务将免费延长相应的时间继续征集作品 。
+江湖令由哪些平台运营？	《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。
+江湖令由哪些平台运营？	集角色扮演、策略、冒险等多种游戏元素为一体，画面精美犹如客户端游戏，融合历史、江湖、武功、恩仇多种特色元素，是款不可多得的精品游戏大作。
+江湖令由哪些平台运营？	由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的，由07177游戏网发布媒体资讯的网页游戏。
+江湖令由哪些平台运营？	网页游戏《江湖令》由51游戏社区运营，是以隋唐时期为背景的RPG角色扮演类网页游戏。
+江湖令由哪些平台运营？	集角色扮演、策略、冒险等多种游戏元素为一体，画面精美犹如客户端游戏，融合历史、江湖、武功、恩仇多种特色元素，是款不可多得的精品游戏大作…
+江湖令由哪些平台运营？	背景故事：
+江湖令由哪些平台运营？	隋朝末年，隋炀帝暴政，天下民不聊生，义军四起。
+江湖令由哪些平台运营？	在这动荡的时代中，百姓生活苦不堪言，多少人流离失所，家破人亡。
+江湖令由哪些平台运营？	天下三大势力---飞羽营、上清宫、侠隐岛，也值此机会扩张势力，派出弟子出来行走江湖。
+江湖令由哪些平台运营？	你便是这些弟子中的普通一员，在这群雄并起的年代，你将如何选择自己的未来。
+江湖令由哪些平台运营？	所有的故事，便从瓦岗寨/江都大营开始……
+江湖令由哪些平台运营？	势力：
+江湖令由哪些平台运营？	①、飞羽营：【外功、根骨】
+江湖令由哪些平台运营？	南北朝时期，由北方政权创立的一个民间军事团体，经过多年的发展，逐渐成为江湖一大势力。
+江湖令由哪些平台运营？	②、上清宫：【外功、身法】
+江湖令由哪些平台运营？	道家圣地，宫中弟子讲求清静无为，以一种隐世的方式修炼，但身在此乱世，亦也不能独善其身。
+江湖令由哪些平台运营？	③、侠隐岛：【根骨、内力】
+江湖令由哪些平台运营？	位于偏远海岛上的一个世家，岛内弟子大多武功高强，但从不进入江湖行走，适逢乱世，现今岛主也决意作一翻作为。
+江湖令由哪些平台运营？	两大阵营：
+江湖令由哪些平台运营？	义军：隋唐末期，百姓生活苦不堪言，有多个有志之士组成义军，对抗当朝暴君，希望建立一个适合百姓安居乐业的天地。
+江湖令由哪些平台运营？	隋军：战争一起即天下打乱，隋军首先要镇压四起的义军，同时在内部慢慢改变现有的朝廷，让天下再次恢复到昔日的安定。
+江湖令由哪些平台运营？	一、宠物品质
+江湖令由哪些平台运营？	宠物的品质分为：灵兽，妖兽，仙兽，圣兽，神兽
+江湖令由哪些平台运营？	二、宠物获取途径
+江湖令由哪些平台运营？	完成任务奖励宠物（其他途径待定）。
+江湖令由哪些平台运营？	三、宠物融合
+江湖令由哪些平台运营？	1、在主界面下方的【宠/骑】按钮进入宠物界面，再点击【融合】即可进入融合界面进行融合，在融合界面可选择要融合的宠物进行融合
+江湖令由哪些平台运营？	2、融合后主宠的形态不变；
+江湖令由哪些平台运营？	3、融合后宠物的成长，品质，技能，经验，成长经验，等级都继承成长高的宠物；
+江湖令由哪些平台运营？	4、融合宠物技能冲突，则保留成长值高的宠物技能，如果不冲突则叠加在空余的技能位置。
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛（土耳其文：Türkiye 1. Süper Futbol Ligi）是土耳其足球协会管理的职业足球联赛，通常简称“土超”，也是土耳其足球联赛中最高级别。
+请问土耳其足球超级联赛是什么时候成立的？	目前，土超联赛队伍共有18支。
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛
+请问土耳其足球超级联赛是什么时候成立的？	运动项目 足球
+请问土耳其足球超级联赛是什么时候成立的？	成立年份 1959年
+请问土耳其足球超级联赛是什么时候成立的？	参赛队数 18队
+请问土耳其足球超级联赛是什么时候成立的？	国家 土耳其
+请问土耳其足球超级联赛是什么时候成立的？	现任冠军 费内巴切足球俱乐部（2010-2011）
+请问土耳其足球超级联赛是什么时候成立的？	夺冠最多队伍 费内巴切足球俱乐部（18次）
+请问土耳其足球超级联赛是什么时候成立的？	土耳其足球超级联赛（Türkiye 1. Süper Futbol Ligi）是土耳其足球协会管理的职业足球联赛，通常简称「土超」，也是土耳其足球联赛中最高级别。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛队伍共有18支。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛成立于1959年，成立之前土耳其国有多个地区性联赛。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛成立后便把各地方联赛制度统一起来。
+请问土耳其足球超级联赛是什么时候成立的？	一般土超联赛由八月开始至五月结束，12月至1月会有歇冬期。
+请问土耳其足球超级联赛是什么时候成立的？	十八支球队会互相对叠，各有主场和作客两部分，采计分制。
+请问土耳其足球超级联赛是什么时候成立的？	联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。
+请问土耳其足球超级联赛是什么时候成立的？	由2005-06年球季起，土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。
+请问土耳其足球超级联赛是什么时候成立的？	成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部，截至2009-2010赛季，双方各赢得冠军均为17次。
+请问土耳其足球超级联赛是什么时候成立的？	土超联赛共有18支球队，采取双循环得分制，每场比赛胜方得3分，负方0分，平局双方各得1分。
+请问土耳其足球超级联赛是什么时候成立的？	如果两支球队积分相同，对战成绩好的排名靠前，其次按照净胜球来决定；如果有三支以上的球队分数相同，则按照以下标准来确定排名：1、几支队伍间对战的得分，2、几支队伍间对战的净胜球数，3、总净胜球数。
+请问土耳其足球超级联赛是什么时候成立的？	联赛第1名直接参加下个赛季冠军杯小组赛，第2名参加下个赛季冠军杯资格赛第三轮，第3名进入下个赛季欧洲联赛资格赛第三轮，第4名进入下个赛季欧洲联赛资格赛第二轮，最后三名降入下个赛季的土甲联赛。
+请问土耳其足球超级联赛是什么时候成立的？	该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮，如果冠军已获得冠军杯资格，则亚军可参加下个赛季欧洲联赛资格赛第四轮，否则名额递补给联赛。
+请问土耳其足球超级联赛是什么时候成立的？	2010年/2011年 费内巴切
+请问土耳其足球超级联赛是什么时候成立的？	2009年/2010年 布尔萨体育（又译贝莎）
+请问土耳其足球超级联赛是什么时候成立的？	2008年/2009年 贝西克塔斯
+请问土耳其足球超级联赛是什么时候成立的？	2007年/2008年 加拉塔萨雷
+请问土耳其足球超级联赛是什么时候成立的？	2006年/2007年 费内巴切
+请问土耳其足球超级联赛是什么时候成立的？	2005年/2006年 加拉塔沙雷
+请问土耳其足球超级联赛是什么时候成立的？	2004年/2005年 费内巴切(又译费伦巴治)
+请问土耳其足球超级联赛是什么时候成立的？	2003年/2004年 费内巴切
+cid 作Customer IDentity解时是什么意思？ ？	CID 是 Customer IDentity 的简称，简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable）芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号，新的CID不断被使用，以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改
+cid 作Customer IDentity解时是什么意思？ ？	CID 是 Customer IDentity 的简称，简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable）芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号，新的CID不断被使用，以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改
+cid 作Customer IDentity解时是什么意思？ ？	（英）刑事调查局，香港警察的重案组
+cid 作Customer IDentity解时是什么意思？ ？	Criminal Investigation Department
+cid 作Customer IDentity解时是什么意思？ ？	佩枪：
+cid 作Customer IDentity解时是什么意思？ ？	香港警察的CID（刑事侦缉队），各区重案组的探员装备短管点38左轮手枪，其特点是便于收藏，而且不容易卡壳，重量轻，其缺点是装弹量少，只有6发，而且换子弹较慢，威力也一般，如果碰上54式手枪或者M9手枪明显处于下风。
+cid 作Customer IDentity解时是什么意思？ ？	香港警察的“刑事侦查”(Criminal Investigation Department)部门，早于1983年起已经不叫做C.I.D.的了，1983年香港警察队的重整架构，撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”，将“刑事侦查”部门归入去“行动处”内，是“行动处”内的一个分支部门，叫“刑事部”( Crime Wing )。
+cid 作Customer IDentity解时是什么意思？ ？	再于90年代的一次警队重整架构，香港警队成立了新的「刑事及保安处」，再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位，是归入这个“处”下的一个部门，亦叫“刑事部” ( Crime Wing )，由一个助理警务处长(刑事)领导。
+cid 作Customer IDentity解时是什么意思？ ？	但是时至今天，CID虽已经是一个老旧的名称，香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 .
+cid 作Customer IDentity解时是什么意思？ ？	CID格式是美国Adobe公司发表的最新字库格式，它具有易扩充、速度快、兼容性好、简便、灵活等特点，已成为国内开发中文字库的热点，也为用户使用字库提供质量更好，数量更多的字体。
+cid 作Customer IDentity解时是什么意思？ ？	CID (Character identifier）就是字符识别码，在组成方式上分成CIDFont，CMap表两部分。
+cid 作Customer IDentity解时是什么意思？ ？	CIDFont文件即总字符集，包括了一种特定语言中所有常用的字符，把这些字符排序，它们在总字符集中排列的次序号就是各个字符的CID标识码（Index）；CMap(Character Map）表即字符映像文件，将字符的编码（Code）映像到字符的CID标识码（Index）。
+cid 作Customer IDentity解时是什么意思？ ？	CID字库完全针对大字符集市场设计，其基本过程为：先根据Code，在CMap表查到Index，然后在CIDFont文件找到相应的字形数据。
+本町位于什么地方？	本条目记述台湾日治时期，各都市之本町。
+本町位于什么地方？	为台湾日治时期台北市之行政区，共分一～四丁目，在表町之西。
+本町位于什么地方？	以现在的位置来看，本町位于现台北市中正区的西北角，约位于忠孝西路一段往西至台北邮局东侧。
+本町位于什么地方？	再向南至开封街一段，沿此路线向西至开封街一段60号，顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。
+本町位于什么地方？	再向东至重庆南路一段，由重庆南路一段回到原点这个范围内。
+本町位于什么地方？	另外，重庆南路一段在当时名为“本町通”。
+本町位于什么地方？	此地方自日治时期起，就是繁华的商业地区，当时也有三和银行、台北专卖分局、日本石油等重要商业机构。
+本町位于什么地方？	其中，专卖分局是战后二二八事件的主要起始点。
+本町位于什么地方？	台湾贮蓄银行（一丁目）
+本町位于什么地方？	三和银行（二丁目）
+本町位于什么地方？	专卖局台北分局（三丁目）
+本町位于什么地方？	日本石油（四丁目）
+本町位于什么地方？	为台湾日治时期台南市之行政区。
+本町位于什么地方？	范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。
+本町位于什么地方？	为清代台南城最繁华的区域。
+本町位于什么地方？	台南公会堂
+本町位于什么地方？	北极殿
+本町位于什么地方？	开基武庙
+本町位于什么地方？	町名改正
+本町位于什么地方？	这是一个与台湾相关的小作品。
+本町位于什么地方？	你可以通过编辑或修订扩充其内容。
+《行走的观点：埃及》的条形码是多少？	出版社: 上海社会科学院出版社; 第1版 (2006年5月1日)
+《行走的观点：埃及》的条形码是多少？	丛书名: 时代建筑视觉旅行丛书
+《行走的观点：埃及》的条形码是多少？	条形码: 9787806818640
+《行走的观点：埃及》的条形码是多少？	尺寸: 18 x 13.1 x 0.7 cm
+《行走的观点：埃及》的条形码是多少？	重量: 181 g
+《行走的观点：埃及》的条形码是多少？	漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。
+《行走的观点：埃及》的条形码是多少？	埃及，这片蕴蓄了5000年文明的土地，本书为你撩开它神秘的纱。
+《行走的观点：埃及》的条形码是多少？	诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力；缠绵悱恻的象形文字、医学、雕刻等留给我们的文明，不断引发我们对古代文明的惊喜和赞叹。
+《行走的观点：埃及》的条形码是多少？	尼罗河畔的奇异之旅，数千年的古老文明，尽收在你的眼底……
+《行走的观点：埃及》的条形码是多少？	本书集历史、文化、地理等知识于一体，并以优美、流畅文笔，简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术，以大量富有艺术感染力的彩色照片，生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。
+《行走的观点：埃及》的条形码是多少？	古埃及历史
+老挝人民军的工兵部队有几个营？	老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”），始建于1949年1月20日，1965年10月改名为老挝人民解放军，1982年7月改称现名。
+老挝人民军的工兵部队有几个营？	最高领导机构是中央国防和治安委员会，朱马里·赛雅颂任主席，隆再·皮吉任国防部长。
+老挝人民军的工兵部队有几个营？	实行义务兵役制，服役期最少18个月。[1]
+老挝人民军的工兵部队有几个营？	?老挝军队在老挝社会中有较好的地位和保障，工资待遇比地方政府工作人员略高。
+老挝人民军的工兵部队有几个营？	  武装部队总兵力约6万人，其中陆军约5万人，主力部队编为5个步兵师；空军2000多人；海军（内河巡逻部队）1000多人；部队机关院校5000人。[1]
+老挝人民军的工兵部队有几个营？	老挝人民军军旗
+老挝人民军的工兵部队有几个营？	1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定：国家执行保卫国防和维护社会安宁的政策。
+老挝人民军的工兵部队有几个营？	全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务，积极参加国家建设事业。
+老挝人民军的工兵部队有几个营？	最高领导机构是中央国防和治安委员会。
+老挝人民军的工兵部队有几个营？	主席由老挝人民革命党中央委员会总书记兼任。
+老挝人民军的工兵部队有几个营？	老挝陆军成立最早，兵力最多，约有5万人。
+老挝人民军的工兵部队有几个营？	其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。
+老挝人民军的工兵部队有几个营？	地方部队30余个营及县属部队。
+老挝人民军的工兵部队有几个营？	地面炮兵2个团，10多个营。
+老挝人民军的工兵部队有几个营？	高射炮兵1个团9个营。
+老挝人民军的工兵部队有几个营？	导弹部队2个营。
+老挝人民军的工兵部队有几个营？	装甲兵7个营。
+老挝人民军的工兵部队有几个营？	特工部队6个营。
+老挝人民军的工兵部队有几个营？	通讯部队9个营。
+老挝人民军的工兵部队有几个营？	工兵部队6个营。
+老挝人民军的工兵部队有几个营？	基建工程兵2个团13个营。
+老挝人民军的工兵部队有几个营？	运输部队7个营。
+老挝人民军的工兵部队有几个营？	 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。
+老挝人民军的工兵部队有几个营？	老挝内河部队总兵力约1700人，装备有内河船艇110多艘，编成4个艇队。
+老挝人民军的工兵部队有几个营？	有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。
+老挝人民军的工兵部队有几个营？	空军于1975年8月组建，现有2个团、11个飞行大队，总兵力约2000人。
+老挝人民军的工兵部队有几个营？	装备有各种飞机140架，其中主要由前苏联提供和从万象政权的皇家空军手中接管。
+老挝人民军的工兵部队有几个营？	随着军队建设质量的提高，老挝人民军对外军事合作步伐也日益扩大，近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。
+老挝人民军的工兵部队有几个营？	2003年1月，印度决定向老挝援助一批军事装备和物资，并承诺提供技术帮助。
+老挝人民军的工兵部队有几个营？	2003年6月，老挝向俄罗斯订购了一批新式防空武器；2003年4月，老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。
+《焚心之城》的主角是谁？	《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说，目前正在创世中文网连载中。
+《焚心之城》的主角是谁？	乡下大男孩薛城，是一个不甘于生活现状的混混，他混过、爱过、也深深地被伤害过。
+《焚心之城》的主角是谁？	本料此生当浑浑噩噩，拼搏街头。
+《焚心之城》的主角是谁？	高考的成绩却给了他一点渺茫的希望，二月后，大学如期吹响了他进城的号角。
+《焚心之城》的主角是谁？	繁华的都市，热血的人生，冷眼嘲笑中，他发誓再不做一个平常人！
+《焚心之城》的主角是谁？	江北小城，黑河大地，他要行走过的每一个角落都有他的传说。
+《焚心之城》的主角是谁？	扯出一面旗，拉一帮兄弟，做男人，就要多一份担当，活一口傲气。
+《焚心之城》的主角是谁？	（日期截止到2014年10月23日凌晨）
+请问香港利丰集团是什么时候成立的？	香港利丰集团前身是广州的华资贸易 (1906 - 1949) ，利丰是香港历史最悠久的出口贸易商号之一。
+请问香港利丰集团是什么时候成立的？	 于1906年，冯柏燎先生和李道明先生在广州创立了利丰贸易公司；是当时中国第一家华资的对外贸易出口商。
+请问香港利丰集团是什么时候成立的？	利丰于1906年创立，初时只从事瓷器及丝绸生意；一年之后，增添了其它的货品，包括竹器、藤器、玉石、象牙及其它手工艺品，包括烟花爆竹类别。
+请问香港利丰集团是什么时候成立的？	在早期的对外贸易，中国南方内河港因水深不足不能行驶远洋船，反之香港港口水深岸阔，占尽地利。
+请问香港利丰集团是什么时候成立的？	因此，在香港成立分公司的责任，落在冯柏燎先生的三子冯汉柱先生身上。
+请问香港利丰集团是什么时候成立的？	1937年12月28日，利丰(1937)有限公司正式在香港创立。
+请问香港利丰集团是什么时候成立的？	第二次世界大战期间，利丰暂停贸易业务。
+请问香港利丰集团是什么时候成立的？	1943年，随着创办人冯柏燎先生去世后，业务移交给冯氏家族第二代。
+请问香港利丰集团是什么时候成立的？	之后，向来不参与业务管理的合伙人李道明先生宣布退休，将所拥有的利丰股权全部卖给冯氏家族。
+请问香港利丰集团是什么时候成立的？	目前由哈佛冯家两兄弟William Fung ， Victor Fung和CEO Bruce Rockowitz 管理。
+请问香港利丰集团是什么时候成立的？	截止到2012年，集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。
+请问香港利丰集团是什么时候成立的？	利亚（零售）连锁，业务包括大家所熟悉的：OK便利店、玩具〝反〞斗城和圣安娜饼屋；范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店
+请问香港利丰集团是什么时候成立的？	利和集团，IDS以专业物流服务为根基，为客户提供经销，物流，制造服务领域内的一系列服务项目。
+请问香港利丰集团是什么时候成立的？	业务网络覆盖大中华区，东盟，美国及英国，经营着90多个经销中心，在中国设有18个经销公司，10,000家现代经销门店。
+请问香港利丰集团是什么时候成立的？	利邦（上海）时装贸易有限公司为大中华区其中一家大型男士服装零售集团。
+请问香港利丰集团是什么时候成立的？	现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881，Gieves & Hawkes，Kent & curwen和D’urban 等中档到高档的男士服装品牌，全国有超过350间门店设于各一线城市之高级商场及百货公司。
+请问香港利丰集团是什么时候成立的？	利越（上海）服装商贸有限公司隶属于Branded Lifestyle，负责中国大陆地区LEO里奥（意大利）、GIBO捷宝（意大利）、UFFIZI古杰师（意大利）、OVVIO奥维路（意大利）、Roots绿适（加拿大，全球服装排名第四）品牌销售业务
+请问香港利丰集团是什么时候成立的？	利丰（贸易）1995年收购了英之杰采购服务，1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley)，2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited，大大扩张了在美国及欧洲的顾客群，自2008年经济危机起一直到现在，收购多家欧、美、印、非等地区的时尚品牌，如英国品牌Visage，仅2011年上半年6个月就完成26个品牌的收购。
+请问香港利丰集团是什么时候成立的？	2004年利丰与Levi Strauss & Co.签订特许经营协议
+请问香港利丰集团是什么时候成立的？	2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌
+请问香港利丰集团是什么时候成立的？	2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务
+请问香港利丰集团是什么时候成立的？	2007年收购Tommy Hilfiher全球采购业务，收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice
+请问香港利丰集团是什么时候成立的？	2008年收购Kent&Curwen全球特许经营权，收购Van Zeeland，Inc和Miles Fashion Group
+请问香港利丰集团是什么时候成立的？	2009年收购加拿大休闲品牌Roots ，收购Wear Me Appearl，LLC。
+请问香港利丰集团是什么时候成立的？	与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议
+请问香港利丰集团是什么时候成立的？	2010年收购Oxford apparel Visage Group LTD
+请问香港利丰集团是什么时候成立的？	2011年一月收购土耳其Modium、美国女性时尚Beyond Productions，三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881，五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000，六月收购家私贸易Exim Designs Co., Ltd.，七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited，八月收购童装Fishman & Tobin和Crimzon Rose，九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。
+请问香港利丰集团是什么时候成立的？	十二月与USPA – U.S. Polo Association签署授权协议。
+请问香港利丰集团是什么时候成立的？	利丰的精神：积极进取，不断认识并争取有利于客户和自身进步的机会；以行动为主导，对客户、供应商及职工的需求作出快速的决定。
+请问香港利丰集团是什么时候成立的？	利丰的最终目标：在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务，利丰成员有效合作，共达目标。
+如何使魔兽变种akt不被查杀？	Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一，采用Delphi 6.0-7.0编写，并经过加壳处理。
+如何使魔兽变种akt不被查杀？	“魔兽”变种akt运行后，自我复制到被感染计算机的指定目录下。
+如何使魔兽变种akt不被查杀？	修改注册表，实现木马开机自动运行。
+如何使魔兽变种akt不被查杀？	自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行，隐藏自我，防止被查杀。
+如何使魔兽变种akt不被查杀？	在后台秘密监视用户打开的窗口标题，盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息，并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上，致使玩家游戏帐号、装备物品、金钱等丢失，给游戏玩家造成非常大的损失。
+丙种球蛋白能预防什么病情？	丙种球蛋白预防传染性肝炎，预防麻疹等病毒性疾病感染，治疗先天性丙种球蛋白缺乏症 ，与抗生素合并使用，可提高对某些严重细菌性和病毒性疾病感染的疗效。
+丙种球蛋白能预防什么病情？	中文简称：“丙球”
+丙种球蛋白能预防什么病情？	英文名称：γ-globulin、gamma globulin
+丙种球蛋白能预防什么病情？	【别名】 免疫血清球蛋白，普通免疫球蛋白，人血丙种球蛋白，丙种球蛋白，静脉注射用人免疫球蛋白（pH4）
+丙种球蛋白能预防什么病情？	注：由于人血中的免疫球蛋白大多数为丙种球蛋白（γ-球蛋白），有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。
+丙种球蛋白能预防什么病情？	冻干制剂应为白色或灰白色的疏松体，液体制剂和冻干制剂溶解后，溶液应为接近无色或淡黄色的澄明液体，微带乳光。
+丙种球蛋白能预防什么病情？	但不应含有异物或摇不散的沉淀。
+丙种球蛋白能预防什么病情？	注射丙种球蛋白是一种被动免疫疗法。
+丙种球蛋白能预防什么病情？	它是把免疫球蛋白内含有的大量抗体输给受者，使之从低或无免疫状态很快达到暂时免疫保护状态。
+丙种球蛋白能预防什么病情？	由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。
+丙种球蛋白能预防什么病情？	因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。
+丙种球蛋白能预防什么病情？	人免疫球蛋白的生物半衰期为16～24天。
+丙种球蛋白能预防什么病情？	1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体，因而有增强机体抵抗力以预防感染的作用。
+丙种球蛋白能预防什么病情？	2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病
+丙种球蛋白能预防什么病情？	3、预防传染性肝炎，如甲型肝炎和乙型肝炎等。
+丙种球蛋白能预防什么病情？	4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治
+丙种球蛋白能预防什么病情？	5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。
+丙种球蛋白能预防什么病情？	6、与抗生素合并使用，可提高对某些严重细菌性和病毒性疾病感染的疗效。
+丙种球蛋白能预防什么病情？	7、川崎病，又称皮肤粘膜淋巴结综合征，常见于儿童，丙种球蛋白是主要的治疗药物。
+丙种球蛋白能预防什么病情？	1、对免疫球蛋白过敏或有其他严重过敏史者。
+丙种球蛋白能预防什么病情？	2、有IgA抗体的选择性IgA缺乏者。
+丙种球蛋白能预防什么病情？	3、发烧患者禁用或慎用。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行）
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	为了保护人的生命和健康，发扬人道主义精神，促进社会发展与和平进步事业，根据《中华人民共和国红十字会法》，结合本省实际，制定本办法。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	本省县级以上按行政区域建立的红十字会，是中国红十字会的地方组织，是从事人道主义工作的社会救助团体，依法取得社会团体法人资格，设置工作机构，配备专职工作人员，依照《中国红十字会章程》独立自主地开展工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	全省性行业根据需要可以建立行业红十字会，配备专职或兼职工作人员。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	街道、乡（镇）、机关、团体、学校、企业、事业单位根据需要，可以依照《中国红十字会章程》建立红十字会的基层组织。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	上级红十字会指导下级红十字会的工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	人民政府对红十字会给予支持和资助，保障红十字会依法履行职责，并对其活动进行监督；红十字会协助人民政府开展与其职责有关的活动。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	全社会都应当关心和支持红十字事业。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	本省公民和单位承认《中国红十字会章程》并缴纳会费的，可以自愿参加红十字会，成为红十字会的个人会员或团体会员。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	个人会员由本人申请，基层红十字会批准，发给会员证；团体会员由单位申请，县级以上红十字会批准，发给团体会员证。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》，热心红十字事业，履行会员的义务，并享有会员的权利。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上红十字会理事会由会员代表大会民主选举产生。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	理事会民主选举产生会长和副会长；根据会长提名，决定秘书长、副秘书长人选。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	县级以上红十字会可以设名誉会长、名誉副会长和名誉理事，由同级红十字会理事会聘请。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	省、市（地）红十字会根据独立、平等、互相尊重的原则，发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	红十字会履行下列职责：
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（一）宣传、贯彻《中华人民共和国红十字会法》和本办法；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（二）开展救灾的准备工作，筹措救灾款物；在自然灾害和突发事件中，对伤病人员和其他受害者进行救助；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（三）普及卫生救护和防病知识，进行初级卫生救护培训，对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（四）组织群众参加现场救护；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（五）参与输血献血工作，推动无偿献血；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（六）开展红十字青少年活动；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（七）根据中国红十字会总会部署，参加国际人道主义救援工作；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（八）依照国际红十字和红新月运动的基本原则，完成同级人民政府和上级红十字会委托的有关事宜；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（九）《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	第八条 红十字会经费的主要来源：
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（一）红十字会会员缴纳的会费；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（二）接受国内外组织和个人捐赠的款物；
+浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的？	（三）红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入；
+宝湖庭院绿化率多少？	建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。
+宝湖庭院绿化率多少？	项目已于2012年4月开工建设，总占地约4.2万平方米，总建筑面积约11.2万平方米，容积率2.14，绿化率35%，预计可入住630户。
+宝湖庭院绿化率多少？	“建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后，在宝湖湖区的又一力作。
+宝湖庭院绿化率多少？	项目周边发展成熟，东有唐徕渠景观水道，西临银川市交通主干道正源街；南侧与宝湖湿地公园遥相呼应。
+宝湖庭院绿化率多少？	“宝湖庭院”项目公共交通资源丰富：15路、21路、35路、38路、43路公交车贯穿银川市各地，出行便利。
+宝湖庭院绿化率多少？	距离新百良田购物广场约1公里，工人疗养院600米，宝湖公园1公里，唐徕渠景观水道500米。
+宝湖庭院绿化率多少？	项目位置优越，购物、餐饮、医疗、交通、休闲等生活资源丰富。[1]
+宝湖庭院绿化率多少？	建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格：搂间距宽广，确保每一座楼宇视野开阔通透。
+宝湖庭院绿化率多少？	楼宇位置错落有置，外立面设计大气沉稳别致。
+宝湖庭院绿化率多少？	项目内部休闲绿地、景观小品点缀其中，道路及停车系统设计合理，停车及通行条件便利。
+宝湖庭院绿化率多少？	社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。
+宝湖庭院绿化率多少？	行政区域:金凤区
+大月兔(中秋艺术作品)的作者还有哪些代表作？	大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品，该作品首次亮相于台湾桃园大园乡海军基地，为了迎接中秋节的到来；在展览期间，海军基地也首次对外开放。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	霍夫曼觉得中国神话中捣杵的玉兔很有想象力，于是特别创作了“月兔”，这也是“月兔”新作第一次展出。[1]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	?2014年9月15日因工人施工不慎，遭火烧毁。[2]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	“大月兔”外表采用的杜邦防水纸、会随风飘动，内部以木材加保丽龙框架支撑做成。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	兔毛用防水纸做成，材质完全防水，不怕日晒雨淋。[3
+大月兔(中秋艺术作品)的作者还有哪些代表作？	-4]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	25米的“月兔”倚靠在机
+大月兔(中秋艺术作品)的作者还有哪些代表作？	堡上望着天空，像在思考又像赏月。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	月兔斜躺在机堡上，意在思考生命、边做白日梦，编织自己的故事。[3]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	台湾桃园大园乡海军基地也首度对外开放。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	428公顷的海军基地中，地景艺术节使用约40公顷，展场包括过去军机机堡、跑道等，由于这处基地过去警备森严，不对外开放，这次结合地景艺术展出，也可一窥过去是黑猫中队基地的神秘面纱。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	2014年9月2日，桃园县政府文化局举行“踩线团”，让
+大月兔(中秋艺术作品)的作者还有哪些代表作？	大月兔
+大月兔(中秋艺术作品)的作者还有哪些代表作？	各项地景艺术作品呈现在媒体眼中，虽然“月兔”仍在进行最后的细节赶工，但横躺在机堡上的“月兔”雏形已经完工。[5]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	“这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉；尤其在蓝天的衬托及前方绿草的组合下，呈现犹如真实版的爱丽丝梦游仙境。[6]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	霍夫曼的作品大月兔，“从平凡中，创作出不平凡的视觉”，创造出观赏者打从心中油然而生的幸福感，拉近观赏者的距离。[6]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	2014年9月15日早
+大月兔(中秋艺术作品)的作者还有哪些代表作？	上，施工人员要将月兔拆解，搬离海军基地草皮时，疑施工拆除的卡车，在拆除过程，故障起火，起火的卡车不慎延烧到兔子，造成兔子起火燃烧，消防队员即刻抢救，白色的大月兔立即变成焦黑的火烧兔。[7]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	桃园县府表示相当遗憾及难过，也不排除向包商求偿，也已将此事告知霍夫曼。[2]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	?[8]
+大月兔(中秋艺术作品)的作者还有哪些代表作？	弗洛伦泰因·霍夫曼，荷兰艺术家，以在公共空间创作巨大造型
+大月兔(中秋艺术作品)的作者还有哪些代表作？	物的艺术项目见长。
+大月兔(中秋艺术作品)的作者还有哪些代表作？	代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫（2014年5月在上海亮相）、大黄鸭（Rubber Duck）、月兔等。
+英国耆卫保险公司有多少保险客户？	英国耆卫保险公司（Old Mutual plc）成立于1845年，一直在伦敦证券交易所（伦敦证券交易所：OML）作第一上市，也是全球排名第32位(按营业收入排名）的保险公司（人寿/健康）。
+英国耆卫保险公司有多少保险客户？	公司是全球财富500强公司之一，也是被列入英国金融时报100指数的金融服务集团之一。
+英国耆卫保险公司有多少保险客户？	Old Mutual 是一家国际金融服务公司，拥有近320万个保险客户，240万个银行储户，270，000个短期保险客户以及700，000个信托客户
+英国耆卫保险公司有多少保险客户？	英国耆卫保险公司（Old Mutual）是一家国际金融服务公司，总部设在伦敦，主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等，目前业务遍及全球34个国家。[1]
+英国耆卫保险公司有多少保险客户？	主要包括人寿保险，资产管理，银行等。
+英国耆卫保险公司有多少保险客户？	1845年，Old Mutual在好望角成立。
+英国耆卫保险公司有多少保险客户？	1870年，董事长Charles Bell设计了Old Mutual公司的标记。
+英国耆卫保险公司有多少保险客户？	1910年，南非从英联邦独立出来。
+英国耆卫保险公司有多少保险客户？	Old Mutual的董事长John X. Merriman被选为国家总理。
+英国耆卫保险公司有多少保险客户？	1927年，Old Mutual在Harare成立它的第一个事务所。
+英国耆卫保险公司有多少保险客户？	1960年，Old Mutual在南非成立了Mutual Unit信托公司，用来管理公司的信托业务。
+英国耆卫保险公司有多少保险客户？	1970年，Old Mutual的收入超过100百万R。
+英国耆卫保险公司有多少保险客户？	1980年，Old Mutual成为南非第一大人寿保险公司，年收入达10亿R。
+英国耆卫保险公司有多少保险客户？	1991年，Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。
+英国耆卫保险公司有多少保险客户？	1995年，Old Mutual在美国波士顿建立投资顾问公司，同年、又在香港和Guernsey建立事务所。
+英国耆卫保险公司有多少保险客户？	作为一项加强与其母公司联系的举措，OMNIA公司（百慕大）荣幸的更名为Old Mutual 公司(百慕大) 。
+英国耆卫保险公司有多少保险客户？	这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。
+英国耆卫保险公司有多少保险客户？	2003 年4月，该公司被Old Mutual plc公司收购，更名为Sage Life（百慕大）公司并闻名于世，公司为Old Mutual公司提供了一个新的销售渠道，补充了其现有的以美元计价的产品线和分销系统。
+英国耆卫保险公司有多少保险客户？	达到了一个重要里程碑是公司成功的一个例证： 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑，也是公司成功的一个例证。
+英国耆卫保险公司有多少保险客户？	Old Mutual （百慕大）为客户提供一系列的投资产品。
+英国耆卫保险公司有多少保险客户？	在其开放的结构下，客户除了能够参与由Old Mutual会员管理的方案外，还能够参与由一些世界顶尖投资机构提供的投资选择。
+英国耆卫保险公司有多少保险客户？	首席执行官John Clifford对此发表评论说：“过去的两年对于Old Mutual家族来说是稳固发展的两年，更名是迫在眉睫的事情。
+英国耆卫保险公司有多少保险客户？	通过采用其名字和形象上的相似，Old Mutual （百慕大）进一步强化了与母公司的联系。”
+英国耆卫保险公司有多少保险客户？	 Clifford补充道：“我相信Old Mutual全球品牌认可度和Old Mutual（百慕大）产品专业知识的结合将在未来的日子里进一步推动公司的成功。”
+英国耆卫保险公司有多少保险客户？	随着公司更名而来的是公司网站的全新改版，设计投资选择信息、陈述、销售方案、营销材料和公告板块。
+英国耆卫保险公司有多少保险客户？	在美国购买不到OMNIA投资产品，该产品也不向美国公民或居民以及百慕大居民提供。
+英国耆卫保险公司有多少保险客户？	这些产品不对任何要约未得到批准的区域中的任何人，以及进行此要约或询价为非法行为的个人构成要约或询价。
+英国耆卫保险公司有多少保险客户？	关于Old Mutual（百慕大）公司
+英国耆卫保险公司有多少保险客户？	Old Mutual（百慕大）公司总部位于百慕大，公司面向非美国居民及公民以及非百慕大居民，通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。
+英国耆卫保险公司有多少保险客户？	这些方案由Old Mutual（百慕大）公司直接做出，向投资者提供各种投资选择和战略，同时提供死亡和其他受益保证。
+谁知道北京的淡定哥做了什么？	尼日利亚足球队守门员恩耶马被封淡定哥，原因是2010年南非世界杯上1:2落后希腊队时，对方前锋已经突破到禁区，其仍头依门柱发呆，其从容淡定令人吃惊。
+谁知道北京的淡定哥做了什么？	淡定哥
+谁知道北京的淡定哥做了什么？	在2010年6月17日的世界杯赛场上，尼日利亚1比2不敌希腊队，但尼日利亚门将恩耶马（英文名：Vincent Enyeama）在赛场上的“淡定”表现令人惊奇。
+谁知道北京的淡定哥做了什么？	随后，网友将赛场照片发布于各大论坛，恩耶马迅速窜红，并被网友称为“淡定哥”。
+谁知道北京的淡定哥做了什么？	淡定哥
+谁知道北京的淡定哥做了什么？	从网友上传得照片中可以看到，“淡定哥”在面临对方前锋突袭至小禁区之时，还靠在球门柱上发呆，其“淡定”程度的确非一般人所能及。
+谁知道北京的淡定哥做了什么？	恩耶马是尼日利亚国家队的主力守门员，目前效力于以色列的特拉维夫哈普尔队。
+谁知道北京的淡定哥做了什么？	1999年，恩耶马在尼日利亚国内的伊波姆星队开始职业生涯，后辗转恩伊姆巴、Iwuanyanwu民族等队，从07年开始，他为特拉维夫效力。
+谁知道北京的淡定哥做了什么？	恩耶马的尼日利亚国脚生涯始于2002年，截至2010年1月底，他为国家队出场已超过50次。
+谁知道北京的淡定哥做了什么？	当地时间2011年1月4日，国际足球历史与统计协会（IFFHS）公布了2010年度世界最佳门将，恩耶马（尼日利亚，特拉维夫夏普尔）10票排第十一
+谁知道北京的淡定哥做了什么？	此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语，并收录到《中国语言生活状况报告》中。
+谁知道北京的淡定哥做了什么？	提示性释义：对遇事从容镇定、处变不惊的男性的戏称。
+谁知道北京的淡定哥做了什么？	例句：上海现“淡定哥”:百米外爆炸他仍专注垂钓（2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm）
+谁知道北京的淡定哥做了什么？	2011年度新人物
+谁知道北京的淡定哥做了什么？	1、淡定哥（北京）
+谁知道北京的淡定哥做了什么？	7月24日傍晚，北京市出现大范围降雨天气，位于通州北苑路出现积水，公交车也难逃被淹。
+谁知道北京的淡定哥做了什么？	李欣摄图片来源：新华网一辆私家车深陷积水，车主索性盘坐在自己的汽车上抽烟等待救援。
+谁知道北京的淡定哥做了什么？	私家车主索性盘坐在自己的车上抽烟等待救援，被网友称“淡定哥”
+谁知道北京的淡定哥做了什么？	2、淡定哥——林峰
+谁知道北京的淡定哥做了什么？	在2011年7月23日的动车追尾事故中，绍兴人杨峰（@杨峰特快）在事故中失去了5位亲人：怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女，他的岳父也在事故中受伤正在治疗。
+谁知道北京的淡定哥做了什么？	他披麻戴孝出现在事故现场，要求将家人的死因弄个明白。
+谁知道北京的淡定哥做了什么？	但在第一轮谈判过后，表示：“请原谅我，如果我再坚持，我将失去我最后的第六个亲人。”
+谁知道北京的淡定哥做了什么？	如果他继续“纠缠”铁道部，他治疗中的岳父将会“被死亡”。
+谁知道北京的淡定哥做了什么？	很多博友就此批评杨峰，并讽刺其为“淡定哥”。
+071型船坞登陆舰的北约代号是什么？	071型船坞登陆舰（英语：Type 071 Amphibious Transport Dock，北约代号：Yuzhao-class，中文：玉昭级，或以首舰昆仑山号称之为昆仑山级船坞登陆舰），是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰，可作为登陆艇的母舰，用以运送士兵、步兵战车、主战坦克等展开登陆作战，也可搭载两栖车辆，具备大型直升机起降甲板及操作设施。
+071型船坞登陆舰的北约代号是什么？	  071型两栖登陆舰是中国首次建造的万吨级作战舰艇，亦为中国大型多功能两栖舰船的开山之作，也可以说是中国万吨级以上大型作战舰艇的试验之作，该舰的建造使中国海军的两栖舰船实力有了质的提升。
+071型船坞登陆舰的北约代号是什么？	在本世纪以前中国海军原有的两栖舰队以一
+071型船坞登陆舰的北约代号是什么？	早期071模型
+071型船坞登陆舰的北约代号是什么？	千至四千吨级登陆舰为主要骨干，这些舰艇吨位小、筹载量有限，直升机操作能力非常欠缺，舰上自卫武装普遍老旧，对于现代化两栖登陆作战可说有很多不足。
+071型船坞登陆舰的北约代号是什么？	为了应对新时期的国际国内形势，中国在本世纪初期紧急强化两栖作战能力，包括短时间内密集建造072、074系列登陆舰，同时也首度设计一种新型船坞登陆舰，型号为071。[1]
+071型船坞登陆舰的北约代号是什么？	在两栖作战行动中，这些舰只不得不采取最危险的
+071型船坞登陆舰的北约代号是什么？	舾装中的昆仑山号
+071型船坞登陆舰的北约代号是什么？	敌前登陆方式实施两栖作战行动，必须与敌人预定阻击力量进行面对面的战斗，在台湾地区或者亚洲其他国家的沿海，几乎没有可用而不设防的海滩登陆地带，并且各国或者地区的陆军在战时，可能会很快控制这些易于登陆的海难和港口，这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。
+071型船坞登陆舰的北约代号是什么？	071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2]
+071型船坞登陆舰的北约代号是什么？	071型两栖船坞登陆舰具有十分良好的整体隐身能力，
+071型船坞登陆舰的北约代号是什么？	071型概念图
+071型船坞登陆舰的北约代号是什么？	该舰外部线条简洁干练，而且舰体外形下部外倾、上部带有一定角度的内倾，从而形成雷达隐身性能良好的菱形横剖面。
+071型船坞登陆舰的北约代号是什么？	舰体为高干舷平甲板型，长宽比较小，舰身宽满，采用大飞剪型舰首及楔形舰尾，舰的上层建筑位于舰体中间部位，后部是大型直升机甲板，适航性能非常突出。
+071型船坞登陆舰的北约代号是什么？	顶甲板上各类电子设备和武器系统布局十分简洁干净，各系统的突出物很少。
+071型船坞登陆舰的北约代号是什么？	该舰的两座烟囱实行左右分布式设置在舰体两侧，既考虑了隐身特点，也十分新颖。[3]
+071型船坞登陆舰的北约代号是什么？	1号甲板及上层建筑物主要设置有指挥室、控
+071型船坞登陆舰的北约代号是什么？	舰尾俯视
+071型船坞登陆舰的北约代号是什么？	制舱、医疗救护舱及一些居住舱，其中医疗救护舱设置有完备的战场救护设施，可以在舰上为伤病员提供紧急手术和野战救护能力。
+071型船坞登陆舰的北约代号是什么？	2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。
+071型船坞登陆舰的北约代号是什么？	主甲板以下则是登陆舱，分前后两段，前段是装甲车辆储存舱，共两层，可以储存登陆装甲车辆和一些其它物资，在进出口处还设有一小型升降机，用于两层之间的移动装卸用。
+071型船坞登陆舰的北约代号是什么？	前段车辆储存舱外壁左右各设有一折叠式装载舱门，所有装载车辆在码头可通过该门直接装载或者登陆上岸。
+071型船坞登陆舰的北约代号是什么？	后段是一个巨型船坞登陆舱，总长约70米，主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4]
+071型船坞登陆舰的北约代号是什么？	自卫武装方面，舰艏设有一门PJ-26型76mm舰炮（
+071型船坞登陆舰的北约代号是什么？	井冈山号舰首主炮
+071型船坞登陆舰的北约代号是什么？	俄罗斯AK-176M的中国仿制版，亦被054A采用） ， 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧，近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。
+071型船坞登陆舰的北约代号是什么？	原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器，不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中，都未装上此一武器。
+071型船坞登陆舰的北约代号是什么？	电子装备方面， 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达，后桅杆顶装备一具拥有球型外罩的364型（SR-64）X频2D对空/对海搜索雷达，此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5]
+071型船坞登陆舰的北约代号是什么？	071型自卫武装布置
+071型船坞登陆舰的北约代号是什么？	071首舰昆仑山号于2006年6月开
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	竹溪县人大常委会办公室：承担人民代表大会会议、常委会会议、主任会议和常委会党组会议（简称“四会”）的筹备和服务工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会组成人员视察活动的联系服务工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	受主任会议委托，拟定有关议案草案。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担常委会人事任免的具体工作，负责机关人事管理和离退休干部的管理与服务。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担县人大机关的行政事务和后勤保障工作，负责机关的安全保卫、文电处理、档案、保密、文印工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担县人大常委会同市人大常委会及乡镇人大的工作联系。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责信息反馈工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况，及时向常委会和主任会议报告。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担人大宣传工作，负责人大常委会会议宣传的组织和联系。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	组织协调各专门工作委员会开展工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	办公室下设五个科，即秘书科、调研科、人事任免科、综合科、老干部科。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	教科文卫工作委员会：负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承担教科文卫方面规范性备案文件的初审工作，侧重对教科文卫行政执法个案监督业务承办工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	代表工作委员会：负责与县人大代表和上级人大代表的联系、情况收集交流工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责组织人大系统的干部培训。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责乡镇人大主席团工作的联系和指导。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责人大代表建议、批评和意见办理工作的联系和督办落实。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责人大代表开展活动的组织、联系工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	承办上级交办的其他工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	财政经济工作委员会：负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。
+我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的？	对国民经济计划和财政预算编制情况进行初审。
+我想知道武汉常住人口有多少？	武汉，简称“汉”，湖北省省会。
+我想知道武汉常住人口有多少？	它是武昌、汉口、汉阳三镇统称。
+我想知道武汉常住人口有多少？	世界第三大河长江及其最长支流汉江横贯市区，将武汉一分为三，形成武昌、汉口、汉阳，三镇跨江鼎立的格局。
+我想知道武汉常住人口有多少？	唐朝诗人李白在此写下“黄鹤楼中吹玉笛，江城五月落梅花”，因此武汉自古又称“江城”。
+我想知道武汉常住人口有多少？	武汉是中国15个副省级城市之一，全国七大中心城市之一，全市常住人口858万人。
+我想知道武汉常住人口有多少？	华中地区最大都市，华中金融中心、交通中心、文化中心，长江中下游特大城市。
+我想知道武汉常住人口有多少？	武汉城市圈的中心城市。
+我想知道武汉常住人口有多少？	[3]武昌、汉口、汉阳三地被俗称武汉三镇。
+我想知道武汉常住人口有多少？	武汉西与仙桃市、洪湖市相接，东与鄂州市、黄石市接壤，南与咸宁市相连，北与孝感市相接，形似一只自西向东的蝴蝶形状。
+我想知道武汉常住人口有多少？	在中国经济地理圈内，武汉处于优越的中心位置是中国地理上的“心脏”，故被称为“九省通衢”之地。
+我想知道武汉常住人口有多少？	武汉市历史悠久，古有夏汭、鄂渚之名。
+我想知道武汉常住人口有多少？	武汉地区考古发现的历史可以上溯距今6000年的新石器时代，其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。
+我想知道武汉常住人口有多少？	市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城，是迄今中国发现及保存最完整的商代古城之一。
+我想知道武汉常住人口有多少？	现代武汉的城市起源，是东汉末年的位于今汉阳的卻月城、鲁山城，和在今武昌蛇山的夏口城。
+我想知道武汉常住人口有多少？	东汉末年，地方军阀刘表派黄祖为江夏太守，将郡治设在位于今汉阳龟山的卻月城中。
+我想知道武汉常住人口有多少？	卻月城是武汉市区内已知的最早城堡。
+我想知道武汉常住人口有多少？	223年，东吴孙权在武昌蛇山修筑夏口城，同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。
+我想知道武汉常住人口有多少？	苏轼在《前赤壁赋》中说的“西望夏口，东望武昌”中的夏口就是指武汉（而当时的武昌则是今天的鄂州）。
+我想知道武汉常住人口有多少？	南朝时，夏口扩建为郢州，成为郢州的治所。
+我想知道武汉常住人口有多少？	隋置江夏县和汉阳县，分别以武昌，汉阳为治所。
+我想知道武汉常住人口有多少？	唐时江夏和汉阳分别升为鄂州和沔州的州治，成为长江沿岸的商业重镇。
+我想知道武汉常住人口有多少？	江城之称亦始于隋唐。
+我想知道武汉常住人口有多少？	两宋时武昌属鄂州，汉阳汉口属汉阳郡。
+我想知道武汉常住人口有多少？	经过发掘，武汉出土了大量唐朝墓葬，在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。
+我想知道武汉常住人口有多少？	宋代武汉的制瓷业发达。
+我想知道武汉常住人口有多少？	在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座，烧制的瓷器品种很多，釉色以青白瓷为主。
+我想知道武汉常住人口有多少？	南宋诗人陆游在经过武昌时，写下“市邑雄富，列肆繁错，城外南市亦数里，虽钱塘、建康不能过，隐然一大都会也”来描写武昌的繁华。
+我想知道武汉常住人口有多少？	南宋抗金将领岳飞驻防鄂州（今武昌）8年，在此兴师北伐。
+我想知道武汉常住人口有多少？	元世祖至元十八年（1281年），武昌成为湖广行省的省治。
+我想知道武汉常住人口有多少？	这是武汉第一次成为一级行政单位（相当于现代的省一级）的治所。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇，托洛茨基是联共(布)党内和第三国际时期反对派的领导人，托派"第四国际"的创始人和领导人。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇·托洛茨基
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	列夫·达维多维奇·托洛茨基（俄国与国际历史上最重要的无产阶级革命家之一，二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖，他以对古典马克思主义“不断革命论”的独创性发展闻名于世，第三共产国际和第四国际的主要缔造者之一（第三国际前三次代表大会的宣言执笔人）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席（而当时布尔什维克多数干部却还在讨论是否支持苏维埃，这些干部后来被赶回俄国的列宁痛击）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1917年革命托洛茨基率领“区联派”与列宁派联合，并再次被工人推举为彼得格勒苏维埃主席。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	对于十月革命这场20世纪最重大的社会革命，托洛茨基赢得了不朽的历史地位。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	后来成了托洛茨基死敌的斯大林，当时作为革命组织领导者之一却写道：“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	我们可以确切地说，卫戍部队之迅速站在苏维埃方面来，革命军事委员会的工作之所以搞得这样好，党认为这首先要归功于托洛茨基同志。”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	（值得一提的是，若干年后，当反托成为政治需要时，此类评价都从斯大林文章中删掉了。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）甚至连后来狂热的斯大林派雅克·沙杜尔，当时却也写道：“托洛茨基在十月起义中居支配地位，是起义的钢铁灵魂。”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	（苏汉诺夫《革命札记》第6卷P76。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）不仅在起义中，而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面，托洛茨基也作出了极其卓越的贡献（外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	革命后若干年里，托洛茨基与列宁的画像时常双双并列挂在一起；十月革命之后到列宁病逝之前，布尔什维克历次全国代表大会上，代表大会发言结束均高呼口号：“我们的领袖列宁和托洛茨基万岁！”
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在欧美共运中托洛茨基的威望非常高。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	后人常常认为托洛茨基只是一个知识分子文人，实际上他文武双全，而且谙熟军事指挥艺术，并且亲临战场。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	正是他作为十月革命的最高军事领袖（在十月革命期间他与士兵一起在战壕里作战），并且在1918年缔造并指挥苏联红军，是一个杰出的军事家（列宁曾对朋友说，除了托洛茨基，谁还能给我迅速地造成一支上百万人的强大军队？
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	在内战期间，他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战，差点挨炸死；当反革命军队进攻彼得堡时，当时的彼得堡领导人季诺维也夫吓得半死，托洛茨基却从容不迫指挥作战。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	同时托洛茨基又是一个高明的外交家，他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者，否则就不许英国公民离开俄国，连英国政府方面都觉得此举无懈可击；他并且把居高临下的法国到访者当场轰出他的办公室（革命前法国一直是俄国的头号债主与政治操纵者），却彬彬有礼地欢迎前来缓和冲突的法国大使；而在十月革命前夕，他对工人代表议会质询的答复既保守了即将起义的军事秘密，又鼓舞了革命者的战斗意志，同时严格遵循现代民主与公开原则，这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”（伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335，第十一章P390）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	托洛茨基在国民经济管理与研究工作中颇有创造：是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1928年斯大林迟迟开始的计划经济实验，是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	因为统治者的政策迟到，使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级，而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	另外，他还对文学理论有很高的造诣，其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子（包括中国的鲁迅、王实味等人）。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>，其生动而真诚的自传和大量私人日记、信件，给人留下了研究人类生活各个方面的宝贵财富，更是追求社会进步与解放的历史道路上的重要知识库之一。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭，祖籍是犹太人。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	原姓布隆施泰因。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1896年开始参加工人运动。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1897年 ，参加建立南俄工人协会 ，反对沙皇专制制度。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1898年 在尼古拉也夫组织工人团体，被流放至西伯利亚。
+列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的？	1902年秋以署名托洛茨基之假护照逃到伦敦，参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥，位于洞庭湖与长江交汇处，东接岳阳市区洞庭大道和107国道、京珠高速公路，西连省道306线，是国内目前最长的内河公路桥。
+谁知道洞庭湖大桥有多长？	路桥全长10173.82m，其中桥长5747.82m，桥宽20m，西双向四车道，是我国第一座三塔双索面斜拉大桥，亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥是我国最长的内河公路桥，大桥横跨东洞庭湖区，全长10174.2米，主桥梁长5747.8米。
+谁知道洞庭湖大桥有多长？	大桥的通车使湘、鄂间公路干线大为畅通，并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进，新颖，造型美观，各项技求指标先进，且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥是湖区人民的造福桥，装点湘北门户的形象桥，对优化交通网络绪构，发展区域经济，保障防汛救灾，缩短鄂、豫、陕等省、市西部车辆南下的运距，拓展岳阳城区的主骨架，提升岳阳城市品位，增强城市辐射力，有着十分重要的意义。
+谁知道洞庭湖大桥有多长？	自1996年12月开工以来，共有10支施工队伍和两支监理队伍参与了大桥的建设。
+谁知道洞庭湖大桥有多长？	主桥桥面高52米(黄海)，设计通航等级Ⅲ级。
+谁知道洞庭湖大桥有多长？	主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥，跨径为130+310+310+130米。
+谁知道洞庭湖大桥有多长？	索塔为双室宝石型断面，中塔高为125.684米，两边塔高为99.311米。
+谁知道洞庭湖大桥有多长？	三塔基础为3米和3.2米大直径钻孔灌注桩。
+谁知道洞庭湖大桥有多长？	引桥为连续梁桥，跨径20至50米，基础直径为1.8和2.5米钻孔灌注桩。
+谁知道洞庭湖大桥有多长？	该桥设计先进、新颖、造型美观，各项技求指标先进，且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系，岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥，设计先进，施工难度大位居亚洲之首，是湖南省桥梁界的一大科研项目。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥设计为三塔斜拉桥，空间双斜面索，主梁采用前支点挂篮施工，并按各种工况模拟挂篮受力进行现场试验，获得了大量有关挂篮受力性能和实际刚度的计算参数，作为施工控制参数。
+谁知道洞庭湖大桥有多长？	利用组合式模型单元，推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵，并进行了岳阳洞庭湖大桥的空间受力分析，结果表明此种单元精度满足工程要求，同时在施工工艺方面也积累了成功经验。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥的通车使湘、鄂间公路干线大为畅通，并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。
+谁知道洞庭湖大桥有多长？	湖大桥设计先进，造型美丽，科技含量高。
+谁知道洞庭湖大桥有多长？	洞庭大桥还是一道美丽的风景线，大桥沿岸风景与岳阳楼，君山岛、洞庭湖等风景名胜融为一体，交相辉映，成为世人了解岳阳的又一崭新窗口，也具有特别旅游资源。
+谁知道洞庭湖大桥有多长？	洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖，并获第五届詹天佑大奖。
+谁知道洞庭湖大桥有多长？	大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》，名列斜拉桥第二位。
+谁知道洞庭湖大桥有多长？	2001年荣获湖南省建设厅优秀设计一等奖，省优秀勘察一等奖。
+谁知道洞庭湖大桥有多长？	2003年荣获国家优秀工程设计金奖， "十佳学术活动"奖。
+天气预报员的布景师是谁？	芝加哥天气预报员大卫（尼古拉斯·凯奇），被他的粉丝们热爱，也被诅咒--这些人在天气不好的时候会迁怒于他，而大部分时候，大卫都是在预报坏天气。
+天气预报员的布景师是谁？	?不过，这也没什么，当一家国家早间新闻节目叫他去面试的时候，大卫的事业似乎又将再创新高。
+天气预报员的布景师是谁？	芝加哥天气预报员大卫（尼古拉斯·凯奇），被他的粉丝们热爱，也被诅咒--这些人在天气不好的时候会迁怒于他，而大部分时候，大卫都是在预报坏天气。
+天气预报员的布景师是谁？	不过，这也没什么，当一家国家早间新闻节目叫他去面试的时候，大卫的事业似乎又将再创新高。
+天气预报员的布景师是谁？	在电视节目上，大卫永远微笑，自信而光鲜，就像每一个成功的电视人一样，说起收入，他也绝对不落人后。
+天气预报员的布景师是谁？	不过，大卫的个人生活可就不那么如意了。
+天气预报员的布景师是谁？	与妻子劳伦（霍普·戴维斯）的离婚一直让他痛苦；儿子迈克吸大麻上瘾，正在进行戒毒，可戒毒顾问却对迈克有着异样的感情；女儿雪莉则体重惊人，总是愁眉苦脸、孤独寂寞；大卫的父亲罗伯特（迈克尔·凯恩），一个世界著名的小说家，虽然罗伯特不想再让大卫觉得负担过重，可正是他的名声让大卫的一生都仿佛处在他的阴影之下，更何况，罗伯特就快重病死了。
+天气预报员的布景师是谁？	和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系，都让大卫每天头疼，而每次当他越想控制局面，一切就越加复杂。
+天气预报员的布景师是谁？	然而就在最后人们再也不会向他扔快餐，或许是因为他总是背着弓箭在大街上走。
+天气预报员的布景师是谁？	最后，面对那份高额工作的接受意味着又一个新生活的开始。
+天气预报员的布景师是谁？	也许，生活就像天气，想怎么样就怎么样，完全不可预料。
+天气预报员的布景师是谁？	导 演：戈尔·维宾斯基 Gore Verbinski
+天气预报员的布景师是谁？	编 剧：Steve Conrad .....(written by)
+天气预报员的布景师是谁？	演 员：尼古拉斯·凯奇 Nicolas Cage .....David Spritz
+天气预报员的布景师是谁？	尼古拉斯·霍尔特 Nicholas Hoult .....Mike
+天气预报员的布景师是谁？	迈克尔·凯恩 Michael Caine .....Robert Spritzel
+天气预报员的布景师是谁？	杰蒙妮·德拉佩纳 Gemmenne de la Pe&ntilde;a .....Shelly
+天气预报员的布景师是谁？	霍普·戴维斯 Hope Davis .....Noreen
+天气预报员的布景师是谁？	迈克尔·瑞斯玻利 Michael Rispoli .....Russ
+天气预报员的布景师是谁？	原创音乐：James S. Levine .....(co-composer) (as James Levine)
+天气预报员的布景师是谁？	汉斯·兹米尔 Hans Zimmer
+天气预报员的布景师是谁？	摄 影：Phedon Papamichael
+天气预报员的布景师是谁？	剪 辑：Craig Wood
+天气预报员的布景师是谁？	选角导演：Denise Chamian
+天气预报员的布景师是谁？	艺术指导：Tom Duffield
+天气预报员的布景师是谁？	美术设计：Patrick M. Sullivan Jr. .....(as Patrick Sullivan)
+天气预报员的布景师是谁？	布景师 ：Rosemary Brandenburg
+天气预报员的布景师是谁？	服装设计：Penny Rose
+天气预报员的布景师是谁？	视觉特效：Charles Gibson
+天气预报员的布景师是谁？	David Sosalla .....Pacific Title & Art Studio
+韩国国家男子足球队教练是谁？	韩国国家足球队，全名大韩民国足球国家代表队（???? ?? ?????），为韩国足球协会所于1928年成立，并于1948年加入国际足球协会。
+韩国国家男子足球队教练是谁？	韩国队自1986年世界杯开始，从未缺席任何一届决赛周。
+韩国国家男子足球队教练是谁？	在2002年世界杯，韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队，最后夺得了殿军，是亚洲球队有史以来最好成绩。
+韩国国家男子足球队教练是谁？	在2010年世界杯，韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈，再次晋身十六强，但以1-2败给乌拉圭出局。
+韩国国家男子足球队教练是谁？	  北京时间2014年6月27日3时，巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时，韩国队0-1不敌比利时，3场1平2负积1分垫底出局。
+韩国国家男子足球队教练是谁？	球队教练：洪明甫
+韩国国家男子足球队教练是谁？	韩国国家足球队，全名大韩民国足球国家代表队（韩国国家男子足球队???? ?? ?????），为韩国足球协会所于1928年成立，并于1948年加入国际足联。
+韩国国家男子足球队教练是谁？	韩国队是众多亚洲球队中，在世界杯表现最好，他们自1986年世界杯开始，从未缺席任何一届决赛周。
+韩国国家男子足球队教练是谁？	在2002年世界杯，韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队，最后夺得了殿军，是亚洲球队有史以来最好成绩。
+韩国国家男子足球队教练是谁？	在2010年世界杯，韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈，再次晋身十六强，但以1-2败给乌拉圭出局。
+韩国国家男子足球队教练是谁？	2014年世界杯外围赛，韩国在首轮分组赛以首名出线次轮分组赛，与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格，最后韩国仅以较佳的得失球差压倒乌兹别克，以小组次名取得2014年世界杯决赛周参赛资格，也是韩国连续八次晋身世界杯决赛周。
+韩国国家男子足球队教练是谁？	虽然韩国队在世界杯成绩为亚洲之冠，但在亚洲杯足球赛的成绩却远不及世界杯。
+韩国国家男子足球队教练是谁？	韩国只在首两届亚洲杯（1956年及1960年）夺冠，之后五十多年未能再度称霸亚洲杯，而自1992年更从未打入过决赛，与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1]
+韩国国家男子足球队教练是谁？	人物简介
+韩国国家男子足球队教练是谁？	车范根（1953年5月22日－）曾是大韩民国有名的锋线选手，他被欧洲媒体喻为亚洲最佳输出球员之一，他也被认为是世界最佳足球员之一。
+韩国国家男子足球队教练是谁？	他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。
+韩国国家男子足球队教练是谁？	他在85-86赛季是德甲的最有价值球员，直到1999年为止他都是德甲外国球员入球纪录保持者。
+韩国国家男子足球队教练是谁？	德国的球迷一直没办法正确说出他名字的发音，所以球车范根（左）迷都以炸弹车（Cha Boom）称呼他。
+韩国国家男子足球队教练是谁？	这也代表了他强大的禁区得分能力。
+韩国国家男子足球队教练是谁？	职业生涯
+韩国国家男子足球队教练是谁？	车范根生于大韩民国京畿道的华城市，他在1971年于韩国空军俱乐部开始了他的足球员生涯；同年他入选了韩国19岁以下国家足球队(U-19)。
+韩国国家男子足球队教练是谁？	隔年他就加入了韩国国家足球队，他是有史以来加入国家队最年轻的球员。
+韩国国家男子足球队教练是谁？	车范根在27岁时前往德国发展，当时德甲被认为是世界上最好的足球联赛。
+韩国国家男子足球队教练是谁？	他在1978年12月加入了达姆施塔特，不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。
+韩国国家男子足球队教练是谁？	车范根很快在新俱乐部立足，他帮助球队赢得79-80赛季的欧洲足协杯。
+韩国国家男子足球队教练是谁？	在那个赛季过后，他成为德甲薪水第三高的球员，不过在1981年对上勒沃库森的一场比赛上，他的膝盖严重受伤，几乎毁了他的足球生涯。
+韩国国家男子足球队教练是谁？	在1983年车范根转投勒沃库森；他在这取得很高的成就，他成为85-86赛季德甲的最有价值球员，并且在1988年帮助球队拿下欧洲足协杯，也是他个人第二个欧洲足协杯。
+韩国国家男子足球队教练是谁？	他在决赛对垒西班牙人扮演追平比分的关键角色，而球会才在点球大战上胜出。
+韩国国家男子足球队教练是谁？	车范根在1989年退休，他在308场的德甲比赛中进了98球，一度是德甲外国球员的入球纪录。
+韩国国家男子足球队教练是谁？	执教生涯
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学，简称台湾科大、台科大或台科，是位于台湾台北市大安区的台湾第一所高等技职体系大专院校，现为台湾最知名的科技大学，校本部比邻国立台湾大学。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	“国立”台湾工业技术学院成立于“民国”六十三年（1974）八月一日，为台湾地区第一所技术职业教育高等学府。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	建校之目的，在因应台湾地区经济与工业迅速发展之需求，以培养高级工程技术及管理人才为目标，同时建立完整之技术职业教育体系。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	“国立”台湾工业技术学院成立于“民国”六十三年（1974）八月一日，为台湾地区第一所技术职业教育高等学府。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	建校之目的，在因应台湾地区经济与工业迅速发展之需求，以培养高级工程技术及管理人才为目标，同时建立完整之技术职业教育体系。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	本校校地约44.5公顷，校本部位于台北市基隆路四段四十三号，。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	民国68年成立硕士班，民国71年成立博士班，现有大学部学生5,664人，研究生4,458人，专任教师451位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	2001年在台湾地区教育部筹划之研究型大学（“国立”大学研究所基础教育重点改善计画)中，成为全台首批之9所大学之一 。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下，遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制；凡二专、三专及五专等专科学校以上之毕业生，皆可以报考本校大学部二年制，而高职、高中毕业生，可以报考本校大学部四年制。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	工业管理、电子工程、机械工程、营建工程及应用外语系等，则设有在职人员进修班学制，其招生对象为在职人员，利用夜间及暑假期间上课。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院，分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程；全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所，此外尚有人文学科负责人文及社会类等课程之教学，通识学科负责法律、音乐、环保类等课程之教学，以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师，合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	国立台湾科技大学至今各系所毕业校友已达约56,456位，毕业生出路包含出国继续深造、在台深造以及投身于产业界。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	由于实作经验丰富，理论基础完备，工作态度认真，毕业校友担任政府要职、大学教授、大学校长及企业主管者众多，深受各界的肯定。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展，并获得观众票选第一名，这也是台湾首次入选及获奖的短片。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二，仅次于韩国三星美术设计学院“SADI”。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	总体排名 依据《泰晤士高等教育》（THES-QS）在2009年的世界大学排名调查，台科大排名全世界第351名，在台湾所有大学中排名第五，仅次于台大，清大，成大及阳明，并且是台湾唯一进入世界四百大名校的科技大学。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料，台湾有七所大学的商管学院被分别列入世界1000大商学院，其中台科大位在“卓越商学院”（EXCELLENT Business Schools，国内主要）之列，“推荐程度”（Recommendation Rate）为全台第四，仅次于台大、政大、中山，与交大并列。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	预计于竹北新校区设立产学合作学院及应用理学院。
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●台湾建筑科技中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●智慧型机械人研究中心科技成果展示(15张)
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●台湾彩卷与博彩研究中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●电力电子技术研发中心
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●NCP-Taiwan办公室
+国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称？	●资通安全研究与教学中心
+在日本，神道最初属于什么信仰？	神道又称天道，语出《易经》“大观在上，顺而巽，中正以观天下。
+在日本，神道最初属于什么信仰？	观，盥而不荐，有孚顒若，下观而化也。
+在日本，神道最初属于什么信仰？	观天之神道，而四时不忒，圣人以神道设教，而天下服矣”。
+在日本，神道最初属于什么信仰？	自汉以降，神道又指“墓前开道，建石柱以为标”。
+在日本，神道最初属于什么信仰？	在中医中，神道，经穴名。
+在日本，神道最初属于什么信仰？	出《针灸甲乙经》。
+在日本，神道最初属于什么信仰？	别名冲道。
+在日本，神道最初属于什么信仰？	属督脉。
+在日本，神道最初属于什么信仰？	宗教中，神道是日本的本土传统民族宗教，最初以自然崇拜为主，属于泛灵多神信仰（精灵崇拜），视自然界各种动植物为神祇。
+在日本，神道最初属于什么信仰？	神道又称天道，语出《易经》“大观在上，顺而巽，中正以观天下。
+在日本，神道最初属于什么信仰？	观，盥而不荐，有孚顒若，下观而化也。
+在日本，神道最初属于什么信仰？	观天之神道，而四时不忒，圣人以神道设教，而天下服矣”。
+在日本，神道最初属于什么信仰？	自汉以降，神道又指“墓前开道，建石柱以为标”。
+在日本，神道最初属于什么信仰？	在中医中，神道，经穴名。
+在日本，神道最初属于什么信仰？	出《针灸甲乙经》。
+在日本，神道最初属于什么信仰？	别名冲道。
+在日本，神道最初属于什么信仰？	属督脉。
+在日本，神道最初属于什么信仰？	宗教中，神道是日本的本土传统民族宗教，最初以自然崇拜为主，属于泛灵多神信仰（精灵崇拜），视自然界各种动植物为神祇。
+在日本，神道最初属于什么信仰？	谓鬼神赐福降灾神妙莫测之道。
+在日本，神道最初属于什么信仰？	《易·观》：“观天之神道，而四时不忒，圣人以神道设教，而天下服矣。”
+在日本，神道最初属于什么信仰？	 孔颖达 疏：“微妙无方，理不可知，目不可见，不知所以然而然，谓之神道。”
+在日本，神道最初属于什么信仰？	《文选·王延寿<鲁灵光殿赋>》：“敷皇极以创业，协神道而大宁。”
+在日本，神道最初属于什么信仰？	 张载 注：“协和神明之道，而天下大宁。”
+在日本，神道最初属于什么信仰？	 南朝 梁 刘勰 《文心雕龙·正纬》：“夫神道阐幽，天命微显。”
+在日本，神道最初属于什么信仰？	 鲁迅 《中国小说史略》第五篇：“﹝ 干宝 ﹞尝感於其父婢死而再生，及其兄气绝复苏，自言见天神事，乃撰《搜神记》二十卷，以‘发明神道之不诬’。”
+在日本，神道最初属于什么信仰？	 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想，是孔子把这些思想发掘出来。
+在日本，神道最初属于什么信仰？	「据此是孔子见当时之人，惑于吉凶祸福，而卜筮之史，加以穿凿傅会，故演易系辞，明义理，切人事，借卜筮以教后人，所谓以神道设教，其所发明者，实即羲文之义理，而非别有义理，亦非羲文并无义理，至孔子始言义理也，当即朱子之言而小变之曰，易为卜筮作，实为义理作，伏羲文王之易，有占而无文，与今人用火珠林起课者相似，孔子加卦爻辞如签辞，纯以理言，实即羲文本意，则其说分明无误矣。」
+在日本，神道最初属于什么信仰？	孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。
+在日本，神道最初属于什么信仰？	《易传》的思想反映了孔子的思想，这个思想是《周易》的，也是孔子的。
+在日本，神道最初属于什么信仰？	在《周易》和孔子看来，神不是有意识的人格化的上帝。
+奥林匹克里昂获得了几连霸？	里昂 Lyon    全名 Olympique lyonnais   绰号 Les Gones、OL   成立 1950年   城市 法国，里昂   主场 热尔兰球场（Stade Gerland）   容纳人数 41,044人   主席 奥拉斯   主教练 雷米·加尔德   联赛 法国足球甲级联赛   2013–14 法甲，第 5 位   网站 官方网站                               主场球衣                              客场球衣                              第三球衣           日尔兰体育场    奥林匹克里昂（Olympique lyonnais，简称：OL及Lyon，中文简称里昂）是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会，成立于1950年8月3日，前身为里昂·奥林匹克（Lyon Olympique）体育俱乐部其中一个分支的足球队，1889年离开体育俱乐部自立门户成立新俱乐部，但官方网站表示俱乐部于1950年正式成立。
+奥林匹克里昂获得了几连霸？	现时在法国足球甲级联赛比赛，俱乐部同时设立男子及女子足球队。
+奥林匹克里昂获得了几连霸？	里昂是首届法国足球甲级联赛成员之一，可惜名列第十五位而降落乙组，1951年以乙级联赛冠军获得创会后首次锦标。
+奥林匹克里昂获得了几连霸？	球队在法国足球史上没有取得辉煌成绩，比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强，及3度晋身法国杯决赛并2次成功获冠。
+奥林匹克里昂获得了几连霸？	直至九十年代末里昂由辛天尼带领，先连续取得联赛头三名，到2002年终于首次登上法国顶级联赛冠军宝座，同年勒冈（Paul Le Guen）接替执教法国国家足球队的辛天尼，他其后继续带领里昂保持气势，加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出，2003年至2005年横扫3届联赛冠军，创下连续四年夺得联赛锦标，平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。
+奥林匹克里昂获得了几连霸？	2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练，并加入葡萄牙中场蒂亚戈，和前巴伦西亚前锋约翰·卡鲁。
+奥林匹克里昂获得了几连霸？	他亦成功带领里昂赢得一届法甲冠军。
+奥林匹克里昂获得了几连霸？	2007年里昂成为首支上市的法国足球俱乐部，招股价21至24.4欧元，发行370万股，集资8400万欧元[1]。
+奥林匹克里昂获得了几连霸？	2007年4月21日，联赛次名图卢兹二比三不敌雷恩，令处于榜首的里昂领先次席多达17分距离，里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军，亦是欧洲五大联赛（英格兰、德国、西班牙、意大利及法国）历史上首支联赛六连冠队伍[2]。
+奥林匹克里昂获得了几连霸？	 在2007-08年赛季，里昂再一次成功卫冕联赛锦标，达成七连霸伟业。
+奥林匹克里昂获得了几连霸？	不过在2008-09赛季，里昂排名法甲第三位，联赛冠军被波尔多所获得。
+奥林匹克里昂获得了几连霸？	于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。
+奥林匹克里昂获得了几连霸？	粗体字为新加盟球员
+奥林匹克里昂获得了几连霸？	以下球员名单更新于2014年8月27日，球员编号参照 官方网站，夏季转会窗为6月9日至8月31日
+火柴人刺杀行动怎么才能过关？	移动鼠标控制瞄准，点击鼠标左键进行射击。
+火柴人刺杀行动怎么才能过关？	游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。
+火柴人刺杀行动怎么才能过关？	这里不仅仅考验的是你的枪法而且最重要的是你的智慧，喜欢火柴人类型游戏的玩家可以进来小试身手。
+火柴人刺杀行动怎么才能过关？	控制瞄准，刺杀游戏中的目标人物即可过关哦。
+你知道2月14日西方情人节是因何起源的吗？	情人节（英语：Valentine's Day），情人节的起源有多个版本，其中一个说法是在公元三世纪，古罗马暴君为了征召更多士兵，禁止婚礼，一名叫瓦伦丁Valentine的修士不理禁令，秘密替人主持婚礼，结果被收监，最后处死。
+你知道2月14日西方情人节是因何起源的吗？	而他死的那天就是2月14日，为纪念Valentine的勇敢精神，人们将每年的2月14日定为Valentine的纪念日。
+你知道2月14日西方情人节是因何起源的吗？	因此成了后来的“情人节”。
+你知道2月14日西方情人节是因何起源的吗？	另外，据记载，教宗在公元496年废除牧神节，把2月14日定为圣瓦伦丁日，即是St.Valentine's Day，后来成为是西方的节日之一。
+你知道2月14日西方情人节是因何起源的吗？	中文名称：情人节
+你知道2月14日西方情人节是因何起源的吗？	外文名称：Valentine‘s Day
+你知道2月14日西方情人节是因何起源的吗？	别名：情人节圣瓦伦丁节
+你知道2月14日西方情人节是因何起源的吗？	公历日期：2月14日
+你知道2月14日西方情人节是因何起源的吗？	起源时间：公元270年2月14日
+你知道2月14日西方情人节是因何起源的吗？	起源事件：人们为了纪念为情人做主而牺牲的瓦伦丁神父，把他遇害的那一天（2月14日）称为情人节。
+你知道2月14日西方情人节是因何起源的吗？	地区：欧美地区
+你知道2月14日西方情人节是因何起源的吗？	宗教：基督教
+你知道2月14日西方情人节是因何起源的吗？	其他信息：西方的传统节日之一。
+你知道2月14日西方情人节是因何起源的吗？	男女在这一天互送礼物（如贺卡和玫瑰花等）用以表达爱意或友好。
+你知道2月14日西方情人节是因何起源的吗？	据台湾“今日台湾人讨厌情人节新闻网”报道，西洋情人节即将来到，求职网进行“办公室恋情及情人节调查”发现，在目前全台上班族的感情状态中，有情人相伴的比率约5成5，4成5的上班族单身；较出乎意料的结果是，情人节以近3成（28%）的占比，登上最讨厌的节日第一名，端午节以24.3%居第二；农历年则以18.2%居第三；第四名是圣诞节，占12.4%。
+你知道2月14日西方情人节是因何起源的吗？	调查指出，情人节对单身族来说，不仅成为压力，也显得更加孤单，在情人节当天，单身的上班族有将近4成(39.1%)的人在家看电视度过，近两成（18.7%）上网聊天，有1成4（14.8%）的人，不畏满街闪光，勇气十足出门看电影，近1成（9.7%）的上班族选择留在公司加班；另外有 5.4%的人，会在情人节当天积极参加联谊，希望能改变自己的感情状态。
+你知道2月14日西方情人节是因何起源的吗？	情侣们在情人节当天，庆祝方式以吃浪漫大餐最多（37.1%），不过有近3成（27%）的情侣，在情人节当天不会特别庆祝情人节，且这个比率远比第三名的旅游（占比11.5%）高出1倍以上。
+你知道2月14日西方情人节是因何起源的吗？	在情人节当天庆祝的开销上，可以说是小资男女当道，选择1000元(新台币，下同)以内的上班族最多占33.1%，情人节当天的花费上班族的平均花费是2473元，大手笔花费上万元以上庆祝情人节的，占比只有2.5%。
+你知道2月14日西方情人节是因何起源的吗？	情人节的起源众说纷纭，而为纪念罗马教士瓦伦丁是其中一个普遍的说法。
+你知道2月14日西方情人节是因何起源的吗？	据《世界图书百科全书》（World Book Encyclopedia）数据指出：“在公元200年时期，罗马皇帝克劳狄二世禁止年轻男子结婚。
+你知道2月14日西方情人节是因何起源的吗？	他认为未婚男子可以成为更优良的士兵。
+你知道2月14日西方情人节是因何起源的吗？	一位名叫瓦伦丁的教士违反了皇帝的命令，秘密为年轻男子主持婚礼，引起皇帝不满，结果被收监，据说瓦伦丁于公元269年2月14日被处决。
+你知道2月14日西方情人节是因何起源的吗？	另外，据《天主教百科全书》（The Catholic情人节 Encyclopedia）指出，公元496年，教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节，把2月14日定为圣瓦伦丁日。”
+你知道2月14日西方情人节是因何起源的吗？	这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。
+你知道2月14日西方情人节是因何起源的吗？	但是在第2次梵蒂冈大公会议后，1969年的典礼改革上，整理了一堆在史实上不确定是否真实存在的人物以后，圣瓦伦丁日就被废除了。
+你知道2月14日西方情人节是因何起源的吗？	现在天主教圣人历已经没有圣瓦伦丁日（St. Valentine's Day）。
+你知道2月14日西方情人节是因何起源的吗？	根据《布卢姆尔的警句与寓言辞典》记载：“圣瓦伦丁是个罗马教士，由于援助受逼害的基督徒而身陷险境，后来他归信基督教，最后被处死，卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系，可能是纯属巧合而已。
+你知道2月14日西方情人节是因何起源的吗？	事实上，这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。
+你知道2月14日西方情人节是因何起源的吗？	情人节的特色是情侣互相馈赠礼物。
+你知道2月14日西方情人节是因何起源的吗？	时至今日，人们则喜欢以情人卡向爱人表达情意。
+防卫大学每年招收多少学生？	防卫大学的前身是保安大学。
+防卫大学每年招收多少学生？	防卫大学是日本自卫队培养陆、海、空三军初级军官的学校，被称为日军"军官的摇篮"。
+防卫大学每年招收多少学生？	防卫大学是日军的重点院校。
+防卫大学每年招收多少学生？	日本历届内阁首相都要到防卫大学视察"训示"，并亲自向学生颁发毕业证书。
+防卫大学每年招收多少学生？	日军四分之一的军官、三分之一的将官从这里走出。
+防卫大学每年招收多少学生？	防卫大学毕业生已成为日军军官的中坚力量。
+防卫大学每年招收多少学生？	防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。
+防卫大学每年招收多少学生？	每年招生名额为530名。
+防卫大学每年招收多少学生？	1950年 8月，日本组建警察预备队，1952年改为保安队。
+防卫大学每年招收多少学生？	为了充实保安队干部队伍，提高干部军政素质，1953年4月成立了保安大学，校址设在三浦半岛的久里滨。
+防卫大学每年招收多少学生？	1954年7月1日保安厅改为防卫厅。
+防卫大学每年招收多少学生？	在保安队基础上，日本建立了陆、海、空三军自卫队，保安大学遂改名为防卫大学，1955年迁至三浦半岛东南方的小原台。
+防卫大学每年招收多少学生？	学校直属防卫厅领导。
+防卫大学每年招收多少学生？	防卫大学的教育方针是：要求学生德智体全面发展，倡导学生崇尚知识和正义，培养学生具有指挥各种部队的能力。
+防卫大学每年招收多少学生？	防卫大学每年招生名额为530名，其中陆军300名，海军100名，空军130名。
+防卫大学每年招收多少学生？	根据自卫队向妇女敞开军官大门的决定，防卫大学1992年首次招收女学员35名。
+防卫大学每年招收多少学生？	 考试分两次进行。
+防卫大学每年招收多少学生？	第一次，每年11月份进行学科考试；第二次，12月份进行口试和体检。
+防卫大学每年招收多少学生？	学校按陆、海、空三军分别设大学本科班和理工科研究生班。
+防卫大学每年招收多少学生？	本科班学制4年，又分为理工和人文社会学两大科。
+防卫大学每年招收多少学生？	学员入学后先分科，530人中有460人专攻理科，70人专攻文科。
+防卫大学每年招收多少学生？	第1学年按专科学习一般大学课程和一般军事知识。
+防卫大学每年招收多少学生？	第2学年以后在军事上开始区分军种，学员分别学习陆、海、空军的专门课程。
+防卫大学每年招收多少学生？	文化课和军事课的比例是6：l。
+防卫大学每年招收多少学生？	文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。
+防卫大学每年招收多少学生？	军事训练课每学年6周，按一年四季有比例地安排教学内容，对学生进行军事技术和体能训练。
+防卫大学每年招收多少学生？	理工科研究生班，每年招生1期，学制2年，每期招收90人，设电子工程、航空工程、兵器制造等7个专业，课程按一般大学硕士课程标准设置。
+防卫大学每年招收多少学生？	防卫大学的课程和训练都十分紧张。
+防卫大学每年招收多少学生？	近年来，为了增强防卫大学的吸引力，克服考生逐年减少的倾向广泛征集优秀人才，学校进行了一些改革，改变入学考试办法，各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生；减少学生入学考试科目，放宽对报考防卫大学的学生的视力要求；降低学分数(大约降低30学分)；改善学生宿舍条件。
+防卫大学每年招收多少学生？	防卫大学的学生生活紧张而愉快。
+《威鲁贝鲁的物语》官网是什么？	10年前大战后，威鲁贝鲁国一致辛勤的保护着得来不易的和平，但是与邻国圣卡特拉斯国的关系却不断的紧张，战争即将爆发。
+《威鲁贝鲁的物语》官网是什么？	为了避免战争，威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。
+《威鲁贝鲁的物语》官网是什么？	但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去，这事激怒了圣卡特拉斯国的国王兰帕诺夫王，并下令14天之内抓到王女并执行公开处刑来谢罪，不然两国就要开战。
+《威鲁贝鲁的物语》官网是什么？	《威鲁贝鲁的物语～Sisters of Wellber～》
+《威鲁贝鲁的物语》官网是什么？	（Sisters of Wellber）
+《威鲁贝鲁的物语》官网是什么？	日文名 ウエルベールの物语
+《威鲁贝鲁的物语》官网是什么？	官方网站 http://www.avexmovie.jp/lineup/wellber/
+《威鲁贝鲁的物语》官网是什么？	为了回避发生战争这个最坏的结果，莉塔下定决心去中立国古利达姆。
diff --git a/examples/text_graph/erniesage/link_prediction.py b/examples/text_graph/erniesage/link_prediction.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ad8b2faecfad6b294703e8f9271905d887de245
--- /dev/null
+++ b/examples/text_graph/erniesage/link_prediction.py
@@ -0,0 +1,177 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import pgl
+import yaml
+from data import GraphDataLoader, PredictData, TrainData, batch_fn
+from easydict import EasyDict as edict
+from models import ErnieSageForLinkPrediction
+
+from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie-tiny": (ErnieSageForLinkPrediction, ErnieTinyTokenizer),
+    "ernie-1.0": (ErnieSageForLinkPrediction, ErnieTokenizer),
+}
+
+
+def set_seed(config):
+    random.seed(config.seed)
+    np.random.seed(config.seed)
+    paddle.seed(config.seed)
+
+
+def load_data(graph_data_path):
+    base_graph = pgl.Graph.load(graph_data_path)
+    term_ids = np.load(os.path.join(graph_data_path, "term_ids.npy"), mmap_mode="r")
+    return base_graph, term_ids
+
+
+def do_train(config):
+    paddle.set_device(config.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(config)
+
+    base_graph, term_ids = load_data(config.graph_work_path)
+    collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids)
+
+    # mode = "train"
+    train_ds = TrainData(config.graph_work_path)
+
+    model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path)
+    config.cls_token_id = tokenizer.cls_token_id
+
+    model = model_class.from_pretrained(config.model_name_or_path, config_file=config)
+    model = paddle.DataParallel(model)
+
+    train_loader = GraphDataLoader(
+        train_ds, batch_size=config.batch_size, shuffle=True, num_workers=config.sample_workers, collate_fn=collate_fn
+    )
+
+    optimizer = paddle.optimizer.Adam(learning_rate=config.lr, parameters=model.parameters())
+
+    rank = paddle.distributed.get_rank()
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(config.epoch):
+        for step, (graphs, datas) in enumerate(train_loader):
+            global_step += 1
+            loss, outputs = model(graphs, datas)
+            if global_step % config.log_per_step == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % config.save_per_step == 0:
+                if rank == 0:
+                    output_dir = os.path.join(config.output_path, "model_%d" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model._layers.save_pretrained(output_dir)
+    if rank == 0:
+        output_dir = os.path.join(config.output_path, "last")
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        model._layers.save_pretrained(output_dir)
+
+
+def tostr(data_array):
+    return " ".join(["%.5lf" % d for d in data_array])
+
+
+@paddle.no_grad()
+def do_predict(config):
+    paddle.set_device(config.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(config)
+
+    # mode = "predict"
+    num_nodes = int(np.load(os.path.join(config.graph_work_path, "num_nodes.npy")))
+
+    base_graph, term_ids = load_data(config.graph_work_path)
+    collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids)
+
+    model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path)
+    config.cls_token_id = tokenizer.cls_token_id
+
+    model = model_class.from_pretrained(config.infer_model, config_file=config)
+
+    model = paddle.DataParallel(model)
+    predict_ds = PredictData(num_nodes)
+
+    predict_loader = GraphDataLoader(
+        predict_ds,
+        batch_size=config.infer_batch_size,
+        shuffle=True,
+        num_workers=config.sample_workers,
+        collate_fn=collate_fn,
+    )
+
+    trainer_id = paddle.distributed.get_rank()
+    id2str = io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding).readlines()
+    if not os.path.exists(config.output_path):
+        os.mkdir(config.output_path)
+    fout = io.open("%s/part-%s" % (config.output_path, trainer_id), "w", encoding="utf8")
+
+    global_step = 0
+    epoch = 0
+    tic_train = time.time()
+    model.eval()
+    for step, (graphs, datas) in enumerate(predict_loader):
+        global_step += 1
+        loss, outputs = model(graphs, datas)
+        for user_feat, user_real_index in zip(outputs[0].numpy(), outputs[3].numpy()):
+            sri = id2str[int(user_real_index)].strip("\n")
+            line = "{}\t{}\n".format(sri, tostr(user_feat))
+            fout.write(line)
+        if global_step % config.log_per_step == 0:
+            logger.info(
+                "predict step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+    fout.close()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="main")
+    parser.add_argument("--conf", type=str, default="./config.yaml")
+    parser.add_argument("--do_predict", action="store_true", default=False)
+    args = parser.parse_args()
+    config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader))
+
+    assert config.device in ["gpu", "cpu"], "Device should be gpu/cpu, but got %s." % config.device
+    logger.info(config)
+    if args.do_predict:
+        do_predict(config)
+    else:
+        do_train(config)
diff --git a/examples/text_graph/erniesage/models/__init__.py b/examples/text_graph/erniesage/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b02ff01793be5a8840cc144dabc13beaff989b8
--- /dev/null
+++ b/examples/text_graph/erniesage/models/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from models import model
+
+__all__ = []
+__all__ += model.__all__
diff --git a/examples/text_graph/erniesage/models/conv.py b/examples/text_graph/erniesage/models/conv.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ec0c61d7b0a4bbcdf2e76761d928fe7f48e09fe
--- /dev/null
+++ b/examples/text_graph/erniesage/models/conv.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class GraphSageConv(nn.Layer):
+    """GraphSAGE is a general inductive framework that leverages node feature
+    information (e.g., text attributes) to efficiently generate node embeddings
+    for previously unseen data.
+
+    Paper reference:
+    Hamilton, Will, Zhitao Ying, and Jure Leskovec.
+    "Inductive representation learning on large graphs."
+    Advances in neural information processing systems. 2017.
+    """
+
+    def __init__(self, input_size, hidden_size, learning_rate, aggr_func="sum"):
+        super(GraphSageConv, self).__init__()
+        assert aggr_func in [
+            "sum",
+            "mean",
+            "max",
+            "min",
+        ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function."
+        self.aggr_func = "reduce_%s" % aggr_func
+
+        self.self_linear = nn.Linear(
+            input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate)
+        )
+        self.neigh_linear = nn.Linear(
+            input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate)
+        )
+
+    def forward(self, graph, feature, act=None):
+        def _send_func(src_feat, dst_feat, edge_feat):
+            return {"msg": src_feat["h"]}
+
+        def _recv_func(message):
+            return getattr(message, self.aggr_func)(message["msg"])
+
+        msg = graph.send(_send_func, src_feat={"h": feature})
+        neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg)
+
+        self_feature = self.self_linear(feature)
+        neigh_feature = self.neigh_linear(neigh_feature)
+        output = self_feature + neigh_feature
+        if act is not None:
+            output = getattr(F, act)(output)
+
+        output = F.normalize(output, axis=1)
+        return output
+
+
+class ErnieSageV2Conv(nn.Layer):
+    """ErnieSage (abbreviation of ERNIE SAmple aggreGatE), a model proposed by the PGL team.
+    ErnieSageV2: Ernie is applied to the EDGE of the text graph.
+    """
+
+    def __init__(self, ernie, input_size, hidden_size, learning_rate, cls_token_id=1, aggr_func="sum"):
+        """ErnieSageV2: Ernie is applied to the EDGE of the text graph.
+
+        Args:
+            ernie (nn.Layer): the ernie model.
+            input_size (int): input size of feature tensor.
+            hidden_size (int): hidden size of the Conv layers.
+            learning_rate (float): learning rate.
+            aggr_func (str): aggregate function. 'sum', 'mean', 'max' avaliable.
+        """
+        super(ErnieSageV2Conv, self).__init__()
+        assert aggr_func in [
+            "sum",
+            "mean",
+            "max",
+            "min",
+        ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function."
+        self.aggr_func = "reduce_%s" % aggr_func
+        self.cls_token_id = cls_token_id
+        self.self_linear = nn.Linear(
+            input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate)
+        )
+        self.neigh_linear = nn.Linear(
+            input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate)
+        )
+
+        self.ernie = ernie
+
+    def ernie_send(self, src_feat, dst_feat, edge_feat):
+        """Apply ernie model on the edge.
+
+        Args:
+            src_feat (Tensor Dict): src feature tensor dict.
+            dst_feat (Tensor Dict): dst feature tensor dict.
+            edge_feat (Tensor Dict): edge feature tensor dict.
+
+        Returns:
+            Tensor Dict: tensor dict which use 'msg' as the key.
+        """
+        # input_ids
+        cls = paddle.full(shape=[src_feat["term_ids"].shape[0], 1], dtype="int64", fill_value=self.cls_token_id)
+        src_ids = paddle.concat([cls, src_feat["term_ids"]], 1)
+
+        dst_ids = dst_feat["term_ids"]
+
+        # sent_ids
+        sent_ids = paddle.concat([paddle.zeros_like(src_ids), paddle.ones_like(dst_ids)], 1)
+        term_ids = paddle.concat([src_ids, dst_ids], 1)
+
+        # build position_ids
+        input_mask = paddle.cast(term_ids > 0, "int64")
+        position_ids = paddle.cumsum(input_mask, axis=1) - 1
+
+        outputs = self.ernie(term_ids, sent_ids, position_ids)
+        feature = outputs[1]
+        return {"msg": feature}
+
+    def send_recv(self, graph, term_ids):
+        """Message Passing of erniesage v2.
+
+        Args:
+            graph (Graph): the Graph object.
+            feature (Tensor): the node feature tensor.
+
+        Returns:
+            Tensor: the self and neighbor feature tensors.
+        """
+
+        def _recv_func(message):
+            return getattr(message, self.aggr_func)(message["msg"])
+
+        msg = graph.send(self.ernie_send, node_feat={"term_ids": term_ids})
+        neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg)
+
+        cls = paddle.full(shape=[term_ids.shape[0], 1], dtype="int64", fill_value=self.cls_token_id)
+        term_ids = paddle.concat([cls, term_ids], 1)
+        term_ids.stop_gradient = True
+        outputs = self.ernie(term_ids, paddle.zeros_like(term_ids))
+        self_feature = outputs[1]
+
+        return self_feature, neigh_feature
+
+    def forward(self, graph, term_ids, act="relu"):
+        """Forward funciton of Conv layer.
+
+        Args:
+            graph (Graph): Graph object.
+            feature (Tensor): node feture.
+            act (str, optional): activation function. Defaults to 'relu'.
+
+        Returns:
+            Tensor: feature after conv.
+        """
+
+        self_feature, neigh_feature = self.send_recv(graph, term_ids)
+        self_feature = self.self_linear(self_feature)
+        neigh_feature = self.neigh_linear(neigh_feature)
+        output = self_feature + neigh_feature
+        if act is not None:
+            output = getattr(F, act)(output)
+        output = F.normalize(output, axis=1)
+        return output
diff --git a/examples/text_graph/erniesage/models/encoder.py b/examples/text_graph/erniesage/models/encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..9363beb43a4585d5406b82f90105a45deafd30bc
--- /dev/null
+++ b/examples/text_graph/erniesage/models/encoder.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from models.conv import ErnieSageV2Conv, GraphSageConv
+
+
+class Encoder(nn.Layer):
+    """Base class
+    Chose different type ErnieSage class.
+    """
+
+    def __init__(self, config):
+        """init function
+
+        Args:
+            config (Dict): all configs.
+        """
+        super(Encoder, self).__init__()
+        self.config = config
+        # Don't add ernie to self, oterwise, there will be more copies of ernie weights
+        # self.ernie = ernie
+
+    @classmethod
+    def factory(cls, config, ernie):
+        """Classmethod for ernie sage model.
+
+        Args:
+            config (Dict): all configs.
+            ernie (nn.Layer): the ernie model.
+
+        Raises:
+            ValueError: Invalid ernie sage model type.
+
+        Returns:
+            Class: real model class.
+        """
+        model_type = config.model_type
+        if model_type == "ErnieSageV2":
+            return ErnieSageV2Encoder(config, ernie)
+        else:
+            raise ValueError("Invalid ernie sage model type")
+
+    def forward(self, *args, **kwargs):
+        raise NotImplementedError
+
+
+class ErnieSageV2Encoder(Encoder):
+    def __init__(self, config, ernie):
+        """Ernie sage v2 encoder
+
+        Args:
+            config (Dict): all config.
+            ernie (nn.Layer): the ernie model.
+        """
+        super(ErnieSageV2Encoder, self).__init__(config)
+        # Don't add ernie to self, oterwise, there will be more copies of ernie weights
+        # self.ernie = ernie
+        self.convs = nn.LayerList()
+        fc_lr = self.config.lr / 0.001
+        erniesage_conv = ErnieSageV2Conv(
+            ernie,
+            ernie.config["hidden_size"],
+            self.config.hidden_size,
+            learning_rate=fc_lr,
+            cls_token_id=self.config.cls_token_id,
+            aggr_func="sum",
+        )
+        self.convs.append(erniesage_conv)
+        for i in range(1, self.config.num_layers):
+            layer = GraphSageConv(
+                self.config.hidden_size, self.config.hidden_size, learning_rate=fc_lr, aggr_func="sum"
+            )
+            self.convs.append(layer)
+
+        if self.config.final_fc:
+            self.linear = nn.Linear(
+                self.config.hidden_size, self.config.hidden_size, weight_attr=paddle.ParamAttr(learning_rate=fc_lr)
+            )
+
+    def take_final_feature(self, feature, index):
+        """Gather the final feature.
+
+        Args:
+            feature (Tensor): the total featue tensor.
+            index (Tensor): the index to gather.
+
+        Returns:
+            Tensor: final result tensor.
+        """
+        feat = paddle.gather(feature, index)
+        if self.config.final_fc:
+            feat = self.linear(feat)
+        if self.config.final_l2_norm:
+            feat = F.normalize(feat, axis=1)
+        return feat
+
+    def forward(self, graphs, term_ids, inputs):
+        """forward train function of the model.
+
+        Args:
+            graphs (Graph List): list of graph tensors.
+            inputs (Tensor List): list of input tensors.
+
+        Returns:
+            Tensor List: list of final feature tensors.
+        """
+        # term_ids for ErnieSageConv is the raw feature.
+        feature = term_ids
+        for i in range(len(graphs), self.config.num_layers):
+            graphs.append(graphs[0])
+        for i in range(0, self.config.num_layers):
+            if i == self.config.num_layers - 1 and i != 0:
+                act = None
+            else:
+                act = "leaky_relu"
+            feature = self.convs[i](graphs[i], feature, act)
+
+        final_feats = [self.take_final_feature(feature, x) for x in inputs]
+        return final_feats
diff --git a/examples/text_graph/erniesage/models/loss.py b/examples/text_graph/erniesage/models/loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..3648c27821c185a973e983ee7fe4298356cba1ff
--- /dev/null
+++ b/examples/text_graph/erniesage/models/loss.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+def LossFactory(config):
+    """Choose different type of loss by config
+
+    Args:
+        config (Dict): config file.
+
+    Raises:
+        ValueError: invalid loss type.
+
+    Returns:
+        Class: the real class object.
+    """
+    loss_type = config.loss_type
+    if loss_type == "hinge":
+        return HingeLoss(config.margin)
+    elif loss_type == "softmax_with_cross_entropy":
+        return SoftmaxWithCrossEntropy()
+    else:
+        raise ValueError("invalid loss type")
+
+
+class SoftmaxWithCrossEntropy(nn.Layer):
+    """softmax with cross entropy loss"""
+
+    def __init__(self, config):
+        super(SoftmaxWithCrossEntropy, self).__init__()
+
+    def forward(self, logits, label):
+        return F.cross_entropy(logits, label, reduction="mean")
+
+
+class HingeLoss(nn.Layer):
+    """Hinge Loss for the pos and neg."""
+
+    def __init__(self, margin):
+        super(HingeLoss, self).__init__()
+        self.margin = margin
+
+    def forward(self, pos, neg):
+        """forward function
+
+        Args:
+            pos (Tensor): pos score.
+            neg (Tensor): neg score.
+
+        Returns:
+            Tensor: final hinge loss.
+        """
+        loss = paddle.mean(F.relu(neg - pos + self.margin))
+        return loss
diff --git a/examples/text_graph/erniesage/models/model.py b/examples/text_graph/erniesage/models/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..4884baacc8609350d2871f06de1d5d724b5d3037
--- /dev/null
+++ b/examples/text_graph/erniesage/models/model.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from models.encoder import Encoder
+from models.loss import LossFactory
+
+from paddlenlp.transformers import ErnieModel, ErniePretrainedModel
+
+__all__ = ["ErnieSageForLinkPrediction"]
+
+
+class ErnieSageForLinkPrediction(ErniePretrainedModel):
+    """ErnieSage for link prediction task."""
+
+    def __init__(self, config, config_file):
+        """Model which Based on the PaddleNLP PretrainedModel
+
+        Note:
+            1. the ernie must be the first argument.
+            2. must set self.XX = ernie to load weights.
+            3. the self.config keyword is taken by PretrainedModel class.
+
+        Args:
+            ernie (nn.Layer): the submodule layer of ernie model.
+            config (Dict): the config file
+        """
+        super(ErnieSageForLinkPrediction, self).__init__(config)
+        self.config_file = config_file
+        self.ernie = ErnieModel(config)
+        self.encoder = Encoder.factory(self.config_file, self.ernie)
+        self.loss_func = LossFactory(self.config_file)
+
+    def forward(self, graphs, data):
+        """Forward function of link prediction task.
+
+        Args:
+            graphs (Graph List): the Graph list.
+            data (Tensor List): other input of the model.
+
+        Returns:
+            Tensor: loss and output tensors.
+        """
+        term_ids, user_index, pos_item_index, neg_item_index, user_real_index, pos_item_real_index = data
+        # encoder model
+        outputs = self.encoder(graphs, term_ids, [user_index, pos_item_index, neg_item_index])
+        user_feat, pos_item_feat, neg_item_feat = outputs
+
+        # calc loss
+        if self.config_file.neg_type == "batch_neg":
+            neg_item_feat = pos_item_feat
+
+        pos = paddle.sum(user_feat * pos_item_feat, -1, keepdim=True)  # [B, 1]
+        neg = paddle.matmul(user_feat, neg_item_feat, transpose_y=True)  # [B, B]
+        loss = self.loss_func(pos, neg)
+        # return loss, outputs
+        return loss, outputs + [user_real_index, pos_item_real_index]
diff --git a/examples/text_graph/erniesage/preprocessing/dump_graph.py b/examples/text_graph/erniesage/preprocessing/dump_graph.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2de5674a63ffde9dec10f21dab352a5367bb36b
--- /dev/null
+++ b/examples/text_graph/erniesage/preprocessing/dump_graph.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import os
+from functools import partial
+from io import open
+
+import numpy as np
+import pgl
+import yaml
+from easydict import EasyDict as edict
+from pgl.graph_kernel import alias_sample_build_table
+from pgl.utils.logger import log
+
+from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer
+
+TOKENIZER_CLASSES = {
+    "ernie-tiny": ErnieTinyTokenizer,
+    "ernie-1.0": ErnieTokenizer,
+}
+
+
+def term2id(string, tokenizer, max_seqlen):
+    # string = string.split("\t")[1]
+    tokens = tokenizer._tokenize(string)
+    ids = tokenizer.convert_tokens_to_ids(tokens)
+    ids = ids[: max_seqlen - 1]
+    ids = ids + [tokenizer.sep_token_id]
+    ids = ids + [tokenizer.pad_token_id] * (max_seqlen - len(ids))
+    return ids
+
+
+def load_graph(config, str2id, term_file, terms, item_distribution):
+    edges = []
+    with io.open(config.graph_data, encoding=config.encoding) as f:
+        for idx, line in enumerate(f):
+            if idx % 100000 == 0:
+                log.info("%s readed %s lines" % (config.graph_data, idx))
+            slots = []
+            for col_idx, col in enumerate(line.strip("\n").split("\t")):
+                s = col[: config.max_seqlen]
+                if s not in str2id:
+                    str2id[s] = len(str2id)
+                    term_file.write(str(col_idx) + "\t" + col + "\n")
+                    item_distribution.append(0)
+                slots.append(str2id[s])
+
+            src = slots[0]
+            dst = slots[1]
+            edges.append((src, dst))
+            edges.append((dst, src))
+            item_distribution[dst] += 1
+    edges = np.array(edges, dtype="int64")
+    return edges
+
+
+def load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution):
+    train_data = []
+    neg_samples = []
+    with io.open(config.train_data, encoding=config.encoding) as f:
+        for idx, line in enumerate(f):
+            if idx % 100000 == 0:
+                log.info("%s readed %s lines" % (config.train_data, idx))
+            slots = []
+            for col_idx, col in enumerate(line.strip("\n").split("\t")):
+                s = col[: config.max_seqlen]
+                if s not in str2id:
+                    str2id[s] = len(str2id)
+                    term_file.write(str(col_idx) + "\t" + col + "\n")
+                    item_distribution.append(0)
+                slots.append(str2id[s])
+
+            src = slots[0]
+            dst = slots[1]
+            neg_samples.append(slots[2:])
+            train_data.append((src, dst))
+    train_data = np.array(train_data, dtype="int64")
+    np.save(os.path.join(config.graph_work_path, "train_data.npy"), train_data)
+    if len(neg_samples) != 0:
+        np.save(os.path.join(config.graph_work_path, "neg_samples.npy"), np.array(neg_samples))
+
+
+def dump_graph(config):
+    if not os.path.exists(config.graph_work_path):
+        os.makedirs(config.graph_work_path)
+    str2id = dict()
+    term_file = io.open(os.path.join(config.graph_work_path, "terms.txt"), "w", encoding=config.encoding)
+    terms = []
+    item_distribution = []
+
+    edges = load_graph(config, str2id, term_file, terms, item_distribution)
+    if config.task == "link_prediction":
+        load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution)
+    else:
+        raise ValueError
+
+    term_file.close()
+    num_nodes = len(str2id)
+    str2id.clear()
+
+    log.info("Building graph...")
+    graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges)
+    # indegree = graph.indegree()
+    graph.indegree()
+    graph.outdegree()
+    graph.dump(config.graph_work_path)
+
+    # dump alias sample table
+    item_distribution = np.array(item_distribution)
+    item_distribution = np.sqrt(item_distribution)
+    distribution = 1.0 * item_distribution / item_distribution.sum()
+    alias, events = alias_sample_build_table(distribution)
+    np.save(os.path.join(config.graph_work_path, "alias.npy"), alias)
+    np.save(os.path.join(config.graph_work_path, "events.npy"), events)
+    log.info("End Build Graph")
+
+
+def dump_node_feat(config):
+    log.info("Dump node feat starting...")
+    id2str = [
+        line.strip("\n").split("\t")[-1]
+        for line in io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding)
+    ]
+    # pool = multiprocessing.Pool()
+
+    tokenizer_class = TOKENIZER_CLASSES[config.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path)
+    fn = partial(term2id, tokenizer=tokenizer, max_seqlen=config.max_seqlen)
+    term_ids = [fn(x) for x in id2str]
+
+    np.save(os.path.join(config.graph_work_path, "term_ids.npy"), np.array(term_ids, np.uint16))
+    log.info("Dump node feat done.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="main")
+    parser.add_argument("--conf", type=str, default="./config.yaml")
+    args = parser.parse_args()
+    config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader))
+    dump_graph(config)
+    dump_node_feat(config)
diff --git a/examples/text_matching/README.md b/examples/text_matching/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5f82c98c009f2aa271e01ce75f61dcd051a5cc57
--- /dev/null
+++ b/examples/text_matching/README.md
@@ -0,0 +1,31 @@
+# 文本匹配
+
+**文本匹配一直是自然语言处理（NLP）领域一个基础且重要的方向，一般研究两段文本之间的关系。文本相似度计算、自然语言推理、问答系统、信息检索等，都可以看作针对不同数据和场景的文本匹配应用。这些自然语言处理任务在很大程度上都可以抽象成文本匹配问题，比如信息检索可以归结为搜索词和文档资源的匹配，问答系统可以归结为问题和候选答案的匹配，复述问题可以归结为两个同义句的匹配。**
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/1d24ea95d560465995515f8a3040202b092b07c6d03e4501b64a16dce01a1bbe" hspace='10'/> <br />
+</p>
+
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/ff58769b237444b89bde5fec9d7215e02825b7d1f2864269986f1daa01b9f497" hspace='10'/> <br />
+</p>
+
+
+文本匹配任务数据每一个样本通常由两个文本组成（query，title）。类别形式为 0 或 1，0 表示 query 与 title 不匹配； 1 表示匹配。
+
+本项目包含面向搜索、推荐系统排序模块、召回模块的常规解决方案，具体如下:
+- 基于单塔 Point-wise 范式的语义匹配模型 [ernie_matching](./ernie_matching/train_pointwise.py): 模型精度高、计算复杂度高, 适合直接进行语义匹配 2 分类的应用场景。
+- 基于单塔 Pair-wise 范式的语义匹配模型 [ernie_matching](./ernie_matching/train_pairwise.py): 模型精度高、计算复杂度高, 对文本相似度大小的`序关系`建模能力更强，适合将相似度特征作为上层排序模块输入特征的应用场景。
+- 基于双塔 Point-wise 范式的语义匹配模型 [SimNet](./simnet) 和 [Sentence Transformers](./sentence_transformers), 这 2 种方案计算效率更高，适合对延时要求高、根据语义相似度进行粗排的应用场景。
+
+## ernie_matching
+[ernie_matching](./ernie_matching) 展示了基于预训练模型 ERNIE-Gram 训练单塔 Point-wise & Pair-wise 语义匹配模型。
+
+## SimNet
+
+[SimNet](./simnet) 展示了如何使用CNN、LSTM、GRU等网络完成文本匹配任务。
+
+## Sentence Transformers
+
+[Sentence Transformers](./sentence_transformers) 展示了如何使用以 ERNIE 为代表的模型Fine-tune完成文本匹配任务。
diff --git a/examples/text_matching/diffcse/README.md b/examples/text_matching/diffcse/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d08bd3bbe6d01cacbe6f85b0ff7c1b9a89d2182c
--- /dev/null
+++ b/examples/text_matching/diffcse/README.md
@@ -0,0 +1,169 @@
+# 无监督语义匹配模型 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf)
+
+借鉴 [DiffCSE](https://arxiv.org/pdf/2204.10298.pdf) 的思路，实现了 DiffCSE 模型。相比于 SimCSE 模型，DiffCSE模型会更关注语句之间的差异性，具有精确的向量表示能力。DiffCSE 模型同样适合缺乏监督数据，但是又有大量无监督数据的匹配和检索场景。
+
+## 快速开始
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```
+DiffCSE/
+├── model.py # DiffCSE 模型组网代码
+├── custom_ernie.py # 为适配 DiffCSE 模型，对ERNIE模型进行了部分修改
+├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑
+├── run_diffcse.py # 模型训练、评估、预测的主脚本
+├── utils.py # 包括一些常用的工具式函数
+├── run_train.sh # 模型训练的脚本
+├── run_eval.sh # 模型评估的脚本
+└── run_infer.sh # 模型预测的脚本
+```
+
+### 模型训练
+默认使用无监督模式进行训练 DiffCSE，模型训练数据的数据样例如下所示，每行表示一条训练样本：
+```shell
+全年地方财政总收入3686.81亿元，比上年增长12.3%。
+“我对案情并不十分清楚，所以没办法提出批评，建议，只能希望通过质询，要求检察院对此做出说明。”他说。
+据调查结果显示：2015年微商行业总体市场规模达到1819.5亿元，预计2016年将达到3607.3亿元，增长率为98.3%。
+前往冈仁波齐需要办理目的地包含日喀则和阿里地区的边防证，外转沿途有一些补给点，可购买到干粮和饮料。
+```
+
+可以运行如下命令，开始模型训练并且进行模型测试。
+
+```shell
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_train"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+    run_diffcse.py \
+    --mode "train" \
+    --encoder_name "rocketqa-zh-dureader-query-encoder" \
+    --generator_name "ernie-3.0-base-zh" \
+    --discriminator_name "ernie-3.0-base-zh" \
+    --max_seq_length "128" \
+    --output_emb_size "32" \
+    --train_set_file "your train_set path" \
+    --eval_set_file "your dev_set path" \
+    --save_dir "./checkpoints" \
+    --log_dir ${log_dir} \
+    --save_steps "50000" \
+    --eval_steps "1000" \
+    --epochs "3" \
+    --batch_size "32" \
+    --mlm_probability "0.15" \
+    --lambda_weight "0.15" \
+    --learning_rate "3e-5" \
+    --weight_decay "0.01" \
+    --warmup_proportion "0.01" \
+    --seed "0" \
+    --device "gpu"
+```
+
+可支持配置的参数：
+* `mode`：可选，用于指明本次运行是模型训练、模型评估还是模型预测，仅支持[train, eval, infer]三种模式；默认为 infer。
+* `encoder_name`：可选，DiffCSE模型中用于向量抽取的模型名称；默认为 ernie-3.0-base-zh。
+* `generator_name`: 可选，DiffCSE模型中生成器的模型名称；默认为 ernie-3.0-base-zh。
+* `discriminator_name`: 可选，DiffCSE模型中判别器的模型名称；默认为 rocketqa-zh-dureader-query-encoder。
+* `max_seq_length`：可选，ERNIE-Gram 模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `output_emb_size`：可选，向量抽取模型输出向量的维度；默认为32。
+* `train_set_file`：可选，用于指定训练集的路径。
+* `eval_set_file`：可选，用于指定验证集的路径。
+* `save_dir`：可选，保存训练模型的目录；
+* `log_dir`：可选，训练训练过程中日志的输出目录；
+* `save_steps`：可选，用于指定模型训练过程中每隔多少 step 保存一次模型。
+* `eval_steps`：可选，用于指定模型训练过程中每隔多少 step，使用验证集评估一次模型。
+* `epochs`: 模型训练轮次，默认为3。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `mlm_probability`：可选，利用生成器预测时，控制单词掩码的比例，默认为0.15。
+* `lambda_weight`：可选，控制RTD任务loss的占比，默认为0.15。
+* `learning_rate`：可选，Fine-tune 的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.01。
+* `warmup_proportion`：可选，学习率 warmup 策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到 learning_rate, 而后再缓慢衰减，默认为0.01。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选 cpu 或 gpu。如使用 gpu 训练则参数 gpus 指定GPU卡号。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── best
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   ├── special_tokens_map.json
+│   └── vocab.txt
+└── ...
+```
+
+### 模型评估
+在模型评估时，需要使用带有标签的数据，以下展示了几条模型评估数据样例，每行表示一条训练样本，每行共计包含3列，分别是query1， query2， label：
+```shell
+右键单击此电脑选择属性，如下图所示   右键单击此电脑选择属性，如下图所示   5
+好医生解密||是什么，让美洲大蠊能美容还能救命    解密美洲大蠊巨大药用价值        1
+蒜香蜜汁烤鸡翅的做法    外香里嫩一口爆汁蒜蓉蜜汁烤鸡翅的做法    3
+项目计划书 篇2  简易项目计划书（参考模板）      2
+夏天幼儿园如何正确使用空调？    老师们该如何正确使用空调，让孩子少生病呢？      3
+```
+
+
+可以运行如下命令，进行模型评估。
+
+```shell
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_eval"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+    run_diffcse.py \
+    --mode "eval" \
+    --encoder_name "rocketqa-zh-dureader-query-encoder" \
+    --max_seq_length "128" \
+    --output_emb_size "32" \
+    --eval_set_file "your dev_set path" \
+    --ckpt_dir "./checkpoints/best" \
+    --batch_size "32" \
+    --seed "0" \
+    --device "gpu"
+```
+可支持配置的参数：
+* `ckpt_dir`: 用于指定进行模型评估的checkpoint路径。
+
+其他参数解释同上。
+
+### 基于动态图模型预测
+在模型预测时，需要给定待预测的两条文本，以下展示了几条模型预测的数据样例，每行表示一条训练样本，每行共计包含2列，分别是query1， query2：
+```shell
+韩国现代摩比斯2015招聘  韩国现代摩比斯2015校园招聘信息
+《DNF》封号减刑方法 被封一年怎么办?     DNF封号减刑方法 封号一年怎么减刑
+原神手鞠游戏三个刷新位置一览    手鞠游戏三个刷新位置一览
+```
+
+可以运行如下命令，进行模型预测：
+```shell
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_infer"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+    run_diffcse.py \
+    --mode "infer" \
+    --encoder_name "rocketqa-zh-dureader-query-encoder" \
+    --max_seq_length "128" \
+    --output_emb_size "32" \
+    --infer_set_file "your test_set path \
+    --ckpt_dir "./checkpoints/best" \
+    --save_infer_path "./infer_result.txt" \
+    --batch_size "32" \
+    --seed "0" \
+    --device "gpu"
+```
+
+可支持配置的参数：
+* `infer_set_file`: 可选，用于指定测试集的路径。
+* `save_infer_path`: 可选，用于保存模型预测结果的文件路径。
+
+其他参数解释同上。 待模型预测结束后，会将结果保存至save_infer_path参数指定的文件中。
+
+
+## Reference
+[1] Chuang Y S ,  Dangovski R ,  Luo H , et al. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings[J]. arXiv e-prints, 2022. https://arxiv.org/pdf/2204.10298.pdf.
diff --git a/examples/text_matching/diffcse/custom_ernie.py b/examples/text_matching/diffcse/custom_ernie.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ba36f3bb3be08577a22c5e6b1f465df672b62d9
--- /dev/null
+++ b/examples/text_matching/diffcse/custom_ernie.py
@@ -0,0 +1,1082 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+
+__all__ = [
+    "ErnieModel",
+    "ErniePretrainedModel",
+    "ErnieForSequenceClassification",
+    "ErnieForTokenClassification",
+    "ErnieForQuestionAnswering",
+    "ErnieForPretraining",
+    "ErniePretrainingCriterion",
+    "ErnieForMaskedLM",
+    "ErnieForMultipleChoice",
+]
+
+
+class ErnieEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        pad_token_id=0,
+        weight_attr=None,
+        task_type_vocab_size=3,
+        task_id=0,
+        use_task_id=False,
+    ):
+        super(ErnieEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id, weight_attr=weight_attr)
+        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size, weight_attr=weight_attr)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size, weight_attr=weight_attr)
+        self.use_task_id = use_task_id
+        self.task_id = task_id
+        if self.use_task_id:
+            self.task_type_embeddings = nn.Embedding(task_type_vocab_size, hidden_size, weight_attr=weight_attr)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, task_type_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            # seq_length = input_ids.shape[1]
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        if self.use_task_id:
+            if task_type_ids is None:
+                task_type_ids = paddle.ones_like(input_ids, dtype="int64") * self.task_id
+            task_type_embeddings = self.task_type_embeddings(task_type_ids)
+            embeddings = embeddings + task_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErniePooler(nn.Layer):
+    def __init__(self, hidden_size, weight_attr=None):
+        super(ErniePooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size, weight_attr=weight_attr)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErniePretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained ERNIE models. It provides ERNIE related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        # Deprecated, alias for ernie-1.0-base-zh
+        "ernie-1.0": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "ernie-1.0-base-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "ernie-1.0-large-zh-cw": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,  # it is 3072 instead of 4096
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "ernie-tiny": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 600,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 3,
+            "type_vocab_size": 2,
+            "vocab_size": 50006,
+            "pad_token_id": 0,
+        },
+        "ernie-2.0-base-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 18000,
+        },
+        "ernie-2.0-large-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "intermediate_size": 4096,  # special for large model
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 12800,
+        },
+        "ernie-2.0-base-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie-2.0-base-en-finetuned-squad": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie-2.0-large-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "intermediate_size": 4096,  # special for ernie-2.0-large-en
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "rocketqa-zh-dureader-query-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "rocketqa-zh-dureader-para-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "rocketqa-v1-marco-query-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "rocketqa-v1-marco-para-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "rocketqa-zh-dureader-cross-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+        },
+        "rocketqa-v1-marco-cross-encoder": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie-3.0-xbase-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "intermediate_size": 4096,  # special for large model
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 20,
+            "task_type_vocab_size": 16,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+        "ernie-3.0-base-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "task_type_vocab_size": 3,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+        "ernie-3.0-medium-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "intermediate_size": 3072,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 6,
+            "task_type_vocab_size": 16,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+        "ernie-3.0-mini-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 384,
+            "intermediate_size": 1536,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 6,
+            "task_type_vocab_size": 16,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+        "ernie-3.0-micro-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 384,
+            "intermediate_size": 1536,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 4,
+            "task_type_vocab_size": 16,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+        "ernie-3.0-nano-zh": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 312,
+            "intermediate_size": 1248,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 2048,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 4,
+            "task_type_vocab_size": 16,
+            "type_vocab_size": 4,
+            "use_task_id": True,
+            "vocab_size": 40000,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            # Deprecated, alias for ernie-1.0-base-zh
+            "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams",
+            "ernie-1.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams",
+            "ernie-1.0-large-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_large_zh_cw.pdparams",
+            "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/ernie_tiny.pdparams",
+            "ernie-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_base_zh.pdparams",
+            "ernie-2.0-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_large_zh.pdparams",
+            "ernie-2.0-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base.pdparams",
+            "ernie-2.0-base-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base_finetuned_squad.pdparams",
+            "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams",
+            "rocketqa-zh-dureader-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams",
+            "rocketqa-zh-dureader-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_para_encoder.pdparams",
+            "rocketqa-v1-marco-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_query_encoder.pdparams",
+            "rocketqa-v1-marco-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_para_encoder.pdparams",
+            "rocketqa-zh-dureader-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_cross_encoder.pdparams",
+            "rocketqa-v1-marco-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_cross_encoder.pdparams",
+            "ernie-3.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams",
+            "ernie-3.0-xbase-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_xbase_zh.pdparams",
+            "ernie-3.0-medium-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams",
+            "ernie-3.0-mini-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams",
+            "ernie-3.0-micro-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams",
+            "ernie-3.0-nano-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams",
+        }
+    }
+    base_model_prefix = "ernie"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.ernie.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class ErnieModel(ErniePretrainedModel):
+    r"""
+    The bare ERNIE Model transformer outputting raw hidden-states.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`ErniePretrainedModel._init_weights()` for how weights are initialized in `ErnieModel`.
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        task_type_vocab_size=3,
+        task_id=0,
+        use_task_id=False,
+    ):
+        super(ErnieModel, self).__init__()
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.initializer_range)
+        )
+        self.embeddings = ErnieEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            pad_token_id,
+            weight_attr,
+            task_type_vocab_size,
+            task_id,
+            use_task_id,
+        )
+        encoder_layer = nn.TransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout=attention_probs_dropout_prob,
+            act_dropout=0,
+            weight_attr=weight_attr,
+            normalize_before=False,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = ErniePooler(hidden_size, weight_attr)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        task_type_ids=None,
+        cls_input=None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+        Returns:
+            tuple: Returns tuple (``sequence_output``, ``pooled_output``).
+            With the fields:
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import ErnieModel, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieModel.from_pretrained('ernie-1.0')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, task_type_ids=task_type_ids
+        )
+
+        if cls_input is not None:
+            embedding_output = paddle.concat([cls_input.unsqueeze(1), embedding_output[:, 1:, :]], axis=1)
+
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class ErnieForSequenceClassification(ErniePretrainedModel):
+    r"""
+    Ernie Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+    Args:
+        ernie (ErnieModel):
+            An instance of `paddlenlp.transformers.ErnieModel`.
+        num_classes (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of ERNIE.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.ErnieModel` instance. Defaults to `None`.
+    """
+
+    def __init__(self, ernie, num_classes=2, dropout=None):
+        super(ErnieForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie = ernie  # allow ernie to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        _, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class ErnieForQuestionAnswering(ErniePretrainedModel):
+    """
+    Ernie Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+    Args:
+        ernie (`ErnieModel`):
+            An instance of `ErnieModel`.
+    """
+
+    def __init__(self, ernie):
+        super(ErnieForQuestionAnswering, self).__init__()
+        self.ernie = ernie  # allow ernie to be config
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+            With the fields:
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        sequence_output, _ = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
+
+
+class ErnieForTokenClassification(ErniePretrainedModel):
+    r"""
+    ERNIE Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+    Args:
+        ernie (`ErnieModel`):
+            An instance of `ErnieModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of ERNIE.
+            If None, use the same value as `hidden_dropout_prob`
+            of `ErnieModel` instance `ernie`. Defaults to `None`.
+    """
+
+    def __init__(self, ernie, num_classes=2, dropout=None):
+        super(ErnieForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie = ernie  # allow ernie to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForTokenClassification.from_pretrained('ernie-1.0')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        sequence_output, _ = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class ErnieLMPredictionHead(nn.Layer):
+    r"""
+    Ernie Model with a `language modeling` head on top.
+    """
+
+    def __init__(
+        self,
+        hidden_size,
+        vocab_size,
+        activation,
+        embedding_weights=None,
+        weight_attr=None,
+    ):
+        super(ErnieLMPredictionHead, self).__init__()
+
+        self.transform = nn.Linear(hidden_size, hidden_size, weight_attr=weight_attr)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, attr=weight_attr, is_bias=False
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class ErniePretrainingHeads(nn.Layer):
+    def __init__(
+        self,
+        hidden_size,
+        vocab_size,
+        activation,
+        embedding_weights=None,
+        weight_attr=None,
+    ):
+        super(ErniePretrainingHeads, self).__init__()
+        self.predictions = ErnieLMPredictionHead(hidden_size, vocab_size, activation, embedding_weights, weight_attr)
+        self.seq_relationship = nn.Linear(hidden_size, 2, weight_attr=weight_attr)
+
+    def forward(self, sequence_output, pooled_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class ErnieForPretraining(ErniePretrainedModel):
+    r"""
+    Ernie Model with a `masked language modeling` head and a `sentence order prediction` head
+    on top.
+    """
+
+    def __init__(self, ernie):
+        super(ErnieForPretraining, self).__init__()
+        self.ernie = ernie
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.ernie.initializer_range)
+        )
+        self.cls = ErniePretrainingHeads(
+            self.ernie.config["hidden_size"],
+            self.ernie.config["vocab_size"],
+            self.ernie.config["hidden_act"],
+            embedding_weights=self.ernie.embeddings.word_embeddings.weight,
+            weight_attr=weight_attr,
+        )
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, masked_positions=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+            With the fields:
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+        """
+        with paddle.static.amp.fp16_guard():
+            outputs = self.ernie(
+                input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+            )
+            sequence_output, pooled_output = outputs[:2]
+            prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions)
+            return prediction_scores, seq_relationship_score
+
+
+class ErniePretrainingCriterion(paddle.nn.Layer):
+    r"""
+    The loss output of Ernie Model during the pretraining:
+    a `masked language modeling` head and a `next sentence prediction (classification)` head.
+    """
+
+    def __init__(self, with_nsp_loss=True):
+        super(ErniePretrainingCriterion, self).__init__()
+        self.with_nsp_loss = with_nsp_loss
+        # self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1)
+
+    def forward(self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels=None):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size]
+            seq_relationship_score(Tensor):
+                The scores of next sentence prediction. Its data type should be float32 and
+                its shape is [batch_size, 2]
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1].
+                Otherwise, its shape is [batch_size, mask_token_num, 1]
+            next_sentence_labels(Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and
+                its shape is [batch_size, 1]
+        Returns:
+            Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`.
+            Its data type should be float32 and its shape is [1].
+        """
+
+        with paddle.static.amp.fp16_guard():
+            masked_lm_loss = F.cross_entropy(prediction_scores, masked_lm_labels, ignore_index=-1, reduction="none")
+
+            if not self.with_nsp_loss:
+                return paddle.mean(masked_lm_loss)
+
+            next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none")
+            return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss)
+
+
+class ErnieOnlyMLMHead(nn.Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights):
+        super().__init__()
+        self.predictions = ErnieLMPredictionHead(
+            hidden_size=hidden_size, vocab_size=vocab_size, activation=activation, embedding_weights=embedding_weights
+        )
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class ErnieForMaskedLM(ErniePretrainedModel):
+    """
+    Ernie Model with a `masked language modeling` head on top.
+    Args:
+        ernie (:class:`ErnieModel`):
+            An instance of :class:`ErnieModel`.
+    """
+
+    def __init__(self, ernie):
+        super(ErnieForMaskedLM, self).__init__()
+        self.ernie = ernie
+        self.cls = ErnieOnlyMLMHead(
+            self.ernie.config["hidden_size"],
+            self.ernie.config["vocab_size"],
+            self.ernie.config["hidden_act"],
+            embedding_weights=self.ernie.embeddings.word_embeddings.weight,
+        )
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForMaskedLM.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 17, 18000]
+        """
+
+        outputs = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions=None)
+        return prediction_scores
+
+
+class ErnieForMultipleChoice(ErniePretrainedModel):
+    """
+    Ernie Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        ernie (:class:`ErnieModel`):
+            An instance of ErnieModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Ernie.
+            If None, use the same value as `hidden_dropout_prob` of `ErnieModel`
+            instance `ernie`. Defaults to None.
+    """
+
+    def __init__(self, ernie, num_choices=2, dropout=None):
+        super(ErnieForMultipleChoice, self).__init__()
+        self.num_choices = num_choices
+        self.ernie = ernie
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], 1)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The ErnieForMultipleChoice forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        _, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
diff --git a/examples/text_matching/diffcse/data.py b/examples/text_matching/diffcse/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2ff0b1749dab43957a83735802b3514bc400e40
--- /dev/null
+++ b/examples/text_matching/diffcse/data.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def get_special_tokens():
+    return ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"]
+
+
+def get_special_token_ids(tokenizer):
+    special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"]
+    return tokenizer.convert_tokens_to_ids(special_tokens)
+
+
+def get_special_token_dict(tokenizer):
+    special_tokens = ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"]
+    special_token_dict = dict(zip(special_tokens, tokenizer.convert_tokens_to_ids(special_tokens)))
+    return special_token_dict
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    result = []
+    for key, text in example.items():
+        if "label" in key:
+            # do_evaluate
+            result += [example["label"]]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length, return_attention_mask=True)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            attention_mask = encoded_inputs["attention_mask"]
+            result += [input_ids, token_type_ids, attention_mask]
+    return result
+
+
+def read_text_single(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip()
+            yield {"text_a": data, "text_b": data}
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def mask_tokens(batch_inputs, tokenizer, mlm_probability=0.15):
+    """
+    Description: Mask input_ids for masked language modeling: 80% MASK, 10% random, 10% original
+    """
+    mlm_inputs = batch_inputs.clone()
+    mlm_labels = batch_inputs.clone()
+
+    probability_matrix = paddle.full(mlm_inputs.shape, mlm_probability)
+
+    special_tokens_mask = paddle.cast(paddle.zeros(mlm_inputs.shape), dtype=bool)
+    for special_token_id in get_special_token_ids(tokenizer):
+        special_tokens_mask |= mlm_inputs == special_token_id
+
+    probability_matrix = masked_fill(probability_matrix, special_tokens_mask, 0.0)
+
+    masked_indices = paddle.cast(paddle.bernoulli(probability_matrix), dtype=bool)
+    mlm_labels = masked_fill(mlm_labels, ~masked_indices, -100)
+
+    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+    indices_replaced = paddle.cast(paddle.bernoulli(paddle.full(mlm_inputs.shape, 0.8)), dtype=bool) & masked_indices
+    mlm_inputs = masked_fill(mlm_inputs, indices_replaced, tokenizer.mask_token_id)
+
+    # 10% of the time, we replace masked input tokens with random word
+    indices_random = (
+        paddle.cast(paddle.bernoulli(paddle.full(mlm_inputs.shape, 0.5)), dtype=bool)
+        & masked_indices
+        & ~indices_replaced
+    )
+    random_words = paddle.randint(0, len(tokenizer), mlm_inputs.shape, dtype=mlm_inputs.dtype)
+    mlm_inputs = paddle.where(indices_random, random_words, mlm_inputs)
+
+    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+    return mlm_inputs, mlm_labels
+
+
+def read_text_pair(data_path, is_infer=False):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_infer:
+                if len(data[0]) == 0 or len(data[1]) == 0:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1]}
+            else:
+                if len(data[0]) == 0 or len(data[1]) == 0 or len(data[2]) == 0:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
diff --git a/examples/text_matching/diffcse/model.py b/examples/text_matching/diffcse/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0deb60ee4830adeb738b79a17322ef7a9b3da245
--- /dev/null
+++ b/examples/text_matching/diffcse/model.py
@@ -0,0 +1,313 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from custom_ernie import ErnieModel as CustomErnie
+from data import mask_tokens
+
+from paddlenlp.transformers import AutoModel, ErnieForMaskedLM
+
+
+class ProjectionMLP(nn.Layer):
+    def __init__(self, in_dim):
+        super(ProjectionMLP, self).__init__()
+        hidden_dim = in_dim * 2
+        out_dim = in_dim
+        list_layers = [nn.Linear(in_dim, hidden_dim, bias_attr=False), nn.BatchNorm1D(hidden_dim), nn.ReLU()]
+        list_layers += [nn.Linear(hidden_dim, out_dim, bias_attr=False), nn.BatchNorm1D(out_dim)]
+        self.net = nn.Sequential(*list_layers)
+
+    def forward(self, x):
+        return self.net(x)
+
+
+class Similarity(nn.Layer):
+    """
+    Dot product or cosine similarity
+    """
+
+    def __init__(self, temp):
+        super(Similarity, self).__init__()
+        self.temp = temp
+        self.cos = nn.CosineSimilarity(axis=-1)
+        self.record = None
+        self.pos_avg = 0.0
+        self.neg_avg = 0.0
+
+    def forward(self, x, y, one_vs_one=False):
+        if one_vs_one:
+            sim = self.cos(x, y)
+            return sim
+
+        x = x.unsqueeze(1)
+        y = y.unsqueeze(0)
+        sim = self.cos(x, y)
+        self.record = sim.detach()
+        min_size = min(self.record.shape[0], self.record.shape[1])
+        num_item = self.record.shape[0] * self.record.shape[1]
+        self.pos_avg = paddle.diag(self.record).sum().item() / min_size
+        self.neg_avg = (self.record.sum().item() - paddle.diag(self.record).sum().item()) / (num_item - min_size)
+        return sim / self.temp
+
+
+class Encoder(nn.Layer):
+    def __init__(self, pretrained_model_name, temp=0.05, output_emb_size=None):
+        super(Encoder, self).__init__()
+        self.ptm = AutoModel.from_pretrained(pretrained_model_name)
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size
+        self.output_emb_size = output_emb_size
+        self.mlp = ProjectionMLP(self.ptm.config["hidden_size"])
+
+        if output_emb_size is not None:
+            self.emb_reduce_linear = nn.Linear(self.ptm.config["hidden_size"], output_emb_size)
+
+        self.temp = temp
+        self.sim = Similarity(temp)
+
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=False
+    ):
+        # Note: cls_embedding is poolerd embedding with act tanh
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        if not with_pooler:
+            ori_cls_embedding = sequence_output[:, 0, :]
+        else:
+            ori_cls_embedding = cls_embedding
+
+        mlp_cls_embedding = self.mlp(ori_cls_embedding)
+        if self.output_emb_size is not None:
+            cls_embedding = self.emb_reduce_linear(mlp_cls_embedding)
+
+        return cls_embedding, mlp_cls_embedding
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        key_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        key_token_type_ids=None,
+        key_position_ids=None,
+        key_attention_mask=None,
+        with_pooler=False,
+    ):
+        query_cls_embedding, _ = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+        key_cls_embedding, _ = self.get_pooled_embedding(
+            key_input_ids, key_token_type_ids, key_position_ids, key_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = self.sim(query_cls_embedding, key_cls_embedding, one_vs_one=True)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        key_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        key_token_type_ids=None,
+        key_position_ids=None,
+        key_attention_mask=None,
+        with_pooler=False,
+    ):
+        query_cls_embedding, mlp_query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+        key_cls_embedding, mlp_key_cls_embedding = self.get_pooled_embedding(
+            key_input_ids, key_token_type_ids, key_position_ids, key_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = self.sim(query_cls_embedding, key_cls_embedding)
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        mlp_cls_embedding = paddle.concat([mlp_query_cls_embedding, mlp_key_cls_embedding], axis=0)
+        return loss, mlp_cls_embedding
+
+
+class Discriminator(nn.Layer):
+    def __init__(self, ptm_model_name):
+        super(Discriminator, self).__init__()
+        self.ptm = CustomErnie.from_pretrained(ptm_model_name)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2)
+
+    def forward(self, input_ids, labels, cls_input, token_type_ids=None, attention_mask=None):
+        sequence_output, _ = self.ptm(
+            input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, cls_input=cls_input
+        )
+        pred_scores = self.classifier(sequence_output)
+        loss = F.cross_entropy(input=pred_scores, label=labels)
+
+        return loss, pred_scores.argmax(-1)
+
+
+class DiffCSE(nn.Layer):
+    def __init__(
+        self,
+        encoder_name,
+        generator_name,
+        discriminator_name,
+        enc_tokenizer,
+        gen_tokenizer,
+        dis_tokenizer,
+        temp=0.05,
+        output_emb_size=32,
+        mlm_probability=0.15,
+        lambda_weight=0.15,
+    ):
+        super(DiffCSE, self).__init__()
+        self.encoder_name = encoder_name
+        self.generator_name = generator_name
+        self.discriminator_name = discriminator_name
+        self.enc_tokenizer = enc_tokenizer
+        self.gen_tokenizer = gen_tokenizer
+        self.dis_tokenizer = dis_tokenizer
+        self.temp = temp
+        self.output_emb_size = output_emb_size
+        self.mlm_probability = mlm_probability
+        self.lambda_weight = lambda_weight
+
+        self.encoder = Encoder(encoder_name, temp=temp, output_emb_size=output_emb_size)
+        self.generator = ErnieForMaskedLM.from_pretrained(generator_name)
+        self.discriminator = Discriminator(discriminator_name)
+
+        self.rtd_acc = 0.0
+        self.rtd_rep_acc = 0.0
+        self.rtd_fix_acc = 0.0
+
+    def train_forward(
+        self,
+        query_input_ids,
+        key_input_ids,
+        query_token_type_ids=None,
+        key_token_type_ids=None,
+        query_attention_mask=None,
+        key_attention_mask=None,
+    ):
+
+        # extract senmantic vector with encoder and then comput CL loss
+        loss, mlp_cls_embedding = self.encoder(
+            query_input_ids,
+            key_input_ids,
+            query_token_type_ids=query_token_type_ids,
+            key_token_type_ids=key_token_type_ids,
+            query_attention_mask=query_attention_mask,
+            key_attention_mask=key_attention_mask,
+        )
+
+        with paddle.no_grad():
+            # mask tokens for query and key input_ids and then predict mask token with generator
+            input_ids = paddle.concat([query_input_ids, key_input_ids], axis=0)
+            if self.encoder_name != self.generator_name:
+                input_ids = self.encode_by_generator(input_ids)
+            attention_mask = paddle.concat([query_attention_mask, key_attention_mask], axis=0)
+            mlm_input_ids, _ = mask_tokens(input_ids, self.gen_tokenizer, mlm_probability=self.mlm_probability)
+            # predict tokens using generator
+            pred_tokens = self.generator(mlm_input_ids, attention_mask=attention_mask).argmax(axis=-1)
+
+        pred_tokens = pred_tokens.detach()
+
+        if self.generator_name != self.discriminator_name:
+            pred_tokens = self.encode_by_discriminator(pred_tokens)
+            input_ids = self.encode_by_discriminator(input_ids)
+
+        pred_tokens[:, 0] = self.dis_tokenizer.cls_token_id
+        e_inputs = pred_tokens * attention_mask
+        replaced = pred_tokens != input_ids
+        e_labels = paddle.cast(replaced, dtype="int64") * attention_mask
+        rtd_loss, prediction = self.discriminator(e_inputs, e_labels, cls_input=mlp_cls_embedding)
+        loss = loss + self.lambda_weight * rtd_loss
+
+        rep = (e_labels == 1) * attention_mask
+        fix = (e_labels == 0) * attention_mask
+        self.rtd_rep_acc = float((prediction * rep).sum() / rep.sum())
+        self.rtd_fix_acc = float(1.0 - (prediction * fix).sum() / fix.sum())
+        self.rtd_acc = float(((prediction == e_labels) * attention_mask).sum() / attention_mask.sum())
+
+        return loss, rtd_loss
+
+    def encode_by_generator(self, batch_tokens):
+        new_tokens = []
+        for one_tokens in batch_tokens:
+            one_gen_tokens = self.enc_tokenizer.convert_ids_to_tokens(one_tokens.tolist())
+            new_tokens.append(self.gen_tokenizer.convert_tokens_to_ids(one_gen_tokens))
+
+        return paddle.to_tensor(new_tokens)
+
+    def encode_by_discriminator(self, batch_tokens):
+        new_tokens = []
+        for one_tokens in batch_tokens:
+            one_gen_tokens = self.gen_tokenizer.convert_ids_to_tokens(one_tokens.tolist())
+            new_tokens.append(self.dis_tokenizer.convert_tokens_to_ids(one_gen_tokens))
+
+        return paddle.to_tensor(new_tokens)
+
+    def test_forward(
+        self,
+        query_input_ids,
+        key_input_ids,
+        query_token_type_ids=None,
+        key_token_type_ids=None,
+        query_attention_mask=None,
+        key_attention_mask=None,
+    ):
+
+        # compute cosine similarity for query and key text
+        cos_sim = self.encoder.cosine_sim(
+            query_input_ids,
+            key_input_ids,
+            query_token_type_ids=query_token_type_ids,
+            key_token_type_ids=key_token_type_ids,
+            query_attention_mask=query_attention_mask,
+            key_attention_mask=key_attention_mask,
+        )
+
+        return cos_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        key_input_ids,
+        query_token_type_ids=None,
+        key_token_type_ids=None,
+        query_attention_mask=None,
+        key_attention_mask=None,
+        mode="train",
+    ):
+        if mode == "train":
+            return self.train_forward(
+                query_input_ids,
+                key_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                key_token_type_ids=key_token_type_ids,
+                query_attention_mask=query_attention_mask,
+                key_attention_mask=key_attention_mask,
+            )
+        else:
+            return self.test_forward(
+                query_input_ids,
+                key_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                key_token_type_ids=key_token_type_ids,
+                query_attention_mask=query_attention_mask,
+                key_attention_mask=key_attention_mask,
+            )
diff --git a/examples/text_matching/diffcse/run_diffcse.py b/examples/text_matching/diffcse/run_diffcse.py
new file mode 100644
index 0000000000000000000000000000000000000000..c567308db3d744ca3d546ae78a7e7a349ecc7532
--- /dev/null
+++ b/examples/text_matching/diffcse/run_diffcse.py
@@ -0,0 +1,374 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair, read_text_single
+from model import DiffCSE, Encoder
+from utils import eval_metric, set_seed
+from visualdl import LogWriter
+
+import paddlenlp as ppnlp
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--mode", choices=["train", "eval", "infer"], default="infer", help="Select which mode to run model, defaults to infer.")
+parser.add_argument("--encoder_name", type=str, help="The sentence_encoder name or path that you wanna train based on.")
+parser.add_argument("--generator_name", type=str, help="The generator model name or path that you wanna train based on.")
+parser.add_argument("--discriminator_name", type=str, help="The discriminator model name or path that you wanna train based on.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--train_set_file", type=str, help="The full path of train_set_file.")
+parser.add_argument("--eval_set_file", type=str, help="The full path of eval_set_file.")
+parser.add_argument("--infer_set_file", type=str, help="The full path of infer_set_file.")
+parser.add_argument("--ckpt_dir", default=None, type=str, help="The ckpt directory where the model checkpoints will be loaded when doing evaluation/inference.")
+parser.add_argument("--save_dir", default="./checkpoints", type=str, help="The directory where the model checkpoints will be written.")
+parser.add_argument("--log_dir", default=None, type=str, help="The directory where log will be written.")
+parser.add_argument("--save_infer_path", default="./infer_result.txt", type=str, help="The save directory where the inference result will be written.")
+parser.add_argument("--save_steps", type=int, default=10000, help="Step interval for saving checkpoint.")
+parser.add_argument("--eval_steps", type=int, default=10000, help="Step interval for evaluation.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--temp", default=0.05, type=float, help="Temperature for softmax.")
+parser.add_argument("--mlm_probability", default=0.15, type=float, help="The ratio for masked language model.")
+parser.add_argument("--lambda_weight", default=0.15, type=float, help="The weight for RTD loss.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument("--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def do_infer(model, tokenizer, data_loader):
+    assert isinstance(model, Encoder), "please make sure that model is instance of Encoder."
+    sims = []
+    model.eval()
+    with paddle.no_grad():
+        for batch in data_loader:
+            (
+                query_input_ids,
+                query_token_type_ids,
+                query_attention_mask,
+                key_input_ids,
+                key_token_type_ids,
+                key_attention_mask,
+            ) = batch
+            cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                key_input_ids=key_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                key_token_type_ids=key_token_type_ids,
+                query_attention_mask=query_attention_mask,
+                key_attention_mask=key_attention_mask,
+            )
+            sims.append(cosine_sim.numpy())
+        sims = np.concatenate(sims, axis=0)
+    model.train()
+    return sims
+
+
+def do_eval(model, tokenizer, data_loader):
+    assert isinstance(model, Encoder), "please make sure that model is instance of Encoder."
+    sims, labels = [], []
+    model.eval()
+    with paddle.no_grad():
+        for batch in data_loader:
+            (
+                query_input_ids,
+                query_token_type_ids,
+                query_attention_mask,
+                key_input_ids,
+                key_token_type_ids,
+                key_attention_mask,
+                label,
+            ) = batch
+            cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                key_input_ids=key_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                key_token_type_ids=key_token_type_ids,
+                query_attention_mask=query_attention_mask,
+                key_attention_mask=key_attention_mask,
+            )
+            sims.append(cosine_sim.numpy())
+            labels.append(label.numpy())
+
+    sims = np.concatenate(sims, axis=0)
+    labels = np.concatenate(labels, axis=0)
+    score = eval_metric(labels, sims)
+    model.train()
+    return score
+
+
+def do_train(model, tokenizer, train_data_loader, dev_data_loader, writer=None):
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    global_step = 0
+    best_score = 0.0
+    tic_train = time.time()
+    model = paddle.DataParallel(model)
+    model.train()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            (
+                query_input_ids,
+                query_token_type_ids,
+                query_attention_mask,
+                key_input_ids,
+                key_token_type_ids,
+                key_attention_mask,
+            ) = batch
+
+            loss, rtd_loss = model(
+                query_input_ids,
+                key_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                key_token_type_ids=key_token_type_ids,
+                query_attention_mask=query_attention_mask,
+                key_attention_mask=key_attention_mask,
+            )
+
+            global_step += 1
+            if global_step % (args.eval_steps // 10) == 0 and rank == 0:
+                print(
+                    "global step {}, epoch: {}, batch: {}, loss: {:.5f}, rtd_loss: {:.5f}, rtd_acc: {:.5f}, rtd_rep_acc: {:.5f}, rtd_fix_acc: {:.5f}, pos_avg: {:.5f}, neg_avg: {:.5f}, speed: {:.2f} step/s".format(
+                        global_step,
+                        epoch,
+                        step,
+                        loss.item(),
+                        rtd_loss.item(),
+                        model._layers.rtd_acc,
+                        model._layers.rtd_rep_acc,
+                        model._layers.rtd_fix_acc,
+                        model._layers.encoder.sim.pos_avg,
+                        model._layers.encoder.sim.neg_avg,
+                        (args.eval_steps // 10) / (time.time() - tic_train),
+                    )
+                )
+                writer.add_scalar(tag="train/loss", step=global_step, value=loss.item())
+                writer.add_scalar(tag="train/rtd_loss", step=global_step, value=rtd_loss.item())
+                writer.add_scalar(tag="train/rtd_acc", step=global_step, value=model._layers.rtd_acc)
+                writer.add_scalar(tag="train/rtd_rep_acc", step=global_step, value=model._layers.rtd_rep_acc)
+                writer.add_scalar(tag="train/rtd_fix_acc", step=global_step, value=model._layers.rtd_fix_acc)
+
+                tic_train = time.time()
+
+            if global_step % args.eval_steps == 0 and rank == 0:
+                score = do_eval(model._layers.encoder, tokenizer, dev_data_loader)
+                print("Evaluation - score:{:.5f}".format(score))
+
+                if best_score < score:
+                    print(
+                        "best checkpoint has been updated: from last best_score {} --> new score {}.".format(
+                            best_score, score
+                        )
+                    )
+                    best_score = score
+                    # save best model
+                    save_dir = os.path.join(args.save_dir, "best")
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model._layers.encoder.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+
+                writer.add_scalar(tag="eval/score", step=global_step, value=score)
+                model.train()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "checkpoint_{}".format(global_step))
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model._layers.encoder.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+            if args.max_steps > 0 and global_step >= args.max_steps:
+                return model
+
+
+if __name__ == "__main__":
+    # set running environment
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    # define tokenizer for processing data
+    tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.encoder_name)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    if args.mode == "train":
+        start_time = time.time()
+
+        # load data
+        train_ds = load_dataset(read_text_single, data_path=args.train_set_file, lazy=False)
+        dev_ds = load_dataset(read_text_pair, data_path=args.eval_set_file, lazy=False)
+        gen_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.generator_name)
+        dis_tokenizer = ppnlp.transformers.AutoTokenizer.from_pretrained(args.discriminator_name)
+
+        # initializing DiffCSE model
+        model = DiffCSE(
+            encoder_name=args.encoder_name,
+            generator_name=args.generator_name,
+            discriminator_name=args.discriminator_name,
+            enc_tokenizer=tokenizer,
+            gen_tokenizer=gen_tokenizer,
+            dis_tokenizer=dis_tokenizer,
+            temp=args.temp,
+            output_emb_size=args.output_emb_size,
+            mlm_probability=args.mlm_probability,
+            lambda_weight=args.lambda_weight,
+        )
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # key_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+        ): [data for data in fn(samples)]
+        dev_batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # key_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Stack(dtype="int64"),  # labels
+        ): [data for data in fn(samples)]
+
+        train_data_loader = create_dataloader(
+            train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+        )
+        dev_data_loader = create_dataloader(
+            dev_ds, mode="eval", batch_size=args.batch_size, batchify_fn=dev_batchify_fn, trans_fn=trans_func
+        )
+
+        with LogWriter(logdir=os.path.join(args.log_dir, "scalar")) as writer:
+            do_train(model, tokenizer, train_data_loader, dev_data_loader, writer=writer)
+
+        end_time = time.time()
+        print("running time {} s".format(end_time - start_time))
+
+    if args.mode == "eval":
+        start_time = time.time()
+        # initalizing encoder model for eval
+        model = Encoder(args.encoder_name, temp=args.temp, output_emb_size=args.output_emb_size)
+        # load model from saved checkpoint
+        if args.ckpt_dir:
+            init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams")
+            if os.path.isfile(init_from_ckpt):
+                print(
+                    "*************************initializing model from {}*****************************".format(
+                        init_from_ckpt
+                    )
+                )
+                state_dict = paddle.load(init_from_ckpt)
+                model.set_dict(state_dict)
+
+        dev_ds = load_dataset(read_text_pair, data_path=args.eval_set_file, lazy=False)
+
+        dev_batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # key_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Stack(dtype="int64"),  # labels
+        ): [data for data in fn(samples)]
+
+        dev_data_loader = create_dataloader(
+            dev_ds, mode="eval", batch_size=args.batch_size, batchify_fn=dev_batchify_fn, trans_fn=trans_func
+        )
+
+        score = do_eval(model, tokenizer, dev_data_loader)
+        print("Evaluation - score:{:.5f}".format(score))
+
+        end_time = time.time()
+        print("running time {} s".format(end_time - start_time))
+
+    if args.mode == "infer":
+        start_time = time.time()
+        # initalizing encoder model for eval
+        model = Encoder(args.encoder_name, temp=args.temp, output_emb_size=args.output_emb_size)
+        # load model from saved checkpoint
+        if args.ckpt_dir:
+            init_from_ckpt = os.path.join(args.ckpt_dir, "model_state.pdparams")
+            if os.path.isfile(init_from_ckpt):
+                print(
+                    "*************************initializing model from {}*****************************".format(
+                        init_from_ckpt
+                    )
+                )
+                state_dict = paddle.load(init_from_ckpt)
+                model.set_dict(state_dict)
+
+        infer_ds = load_dataset(read_text_pair, data_path=args.infer_set_file, lazy=False, is_infer=True)
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # key_input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+            Pad(axis=0, pad_val=0),  # attention_mask
+        ): [data for data in fn(samples)]
+
+        infer_data_loader = create_dataloader(
+            infer_ds, mode="infer", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+        )
+
+        cosin_sim = do_infer(model, tokenizer, infer_data_loader)
+
+        with open(args.save_infer_path, "w", encoding="utf-8") as f:
+            for idx, cos in enumerate(cosin_sim):
+                msg = "{} --> {}\n".format(idx, cos)
+                f.write(msg)
+            print("Inference result has been saved to : {}".format(args.save_infer_path))
+
+        end_time = time.time()
+        print("running time {} s".format(end_time - start_time))
diff --git a/examples/text_matching/diffcse/run_eval.sh b/examples/text_matching/diffcse/run_eval.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0c5512c2f4d3c6fd5ff0d1c1ae6a58da4454c2fd
--- /dev/null
+++ b/examples/text_matching/diffcse/run_eval.sh
@@ -0,0 +1,15 @@
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_eval"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+    run_diffcse.py \
+	--mode "eval" \
+	--encoder_name "rocketqa-zh-dureader-query-encoder" \
+	--max_seq_length "128" \
+	--output_emb_size "32" \
+	--eval_set_file "your dev_set path" \
+	--ckpt_dir "./checkpoints/best" \
+	--batch_size "32" \
+	--seed "0" \
+	--device "gpu"
diff --git a/examples/text_matching/diffcse/run_infer.sh b/examples/text_matching/diffcse/run_infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7df8a573cd8a3cdd8ced45b4871e45a5f5169881
--- /dev/null
+++ b/examples/text_matching/diffcse/run_infer.sh
@@ -0,0 +1,16 @@
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_infer"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+    run_diffcse.py \
+	--mode "infer" \
+	--encoder_name "rocketqa-zh-dureader-query-encoder" \
+	--max_seq_length "128" \
+	--output_emb_size "32" \
+	--infer_set_file "your test_set path \
+	--ckpt_dir "./checkpoints/best" \
+    --save_infer_path "./infer_result.txt" \
+	--batch_size "32" \
+	--seed "0" \
+	--device "gpu"
diff --git a/examples/text_matching/diffcse/run_train.sh b/examples/text_matching/diffcse/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9ed49a4b37e8db24b68e93a8c519c8552b7e4da7
--- /dev/null
+++ b/examples/text_matching/diffcse/run_train.sh
@@ -0,0 +1,27 @@
+gpu_ids=0
+export CUDA_VISIBLE_DEVICES=${gpu_ids}
+
+log_dir="log_train"
+python -u -m paddle.distributed.launch --gpus ${gpu_ids} --log_dir ${log_dir} \
+	run_diffcse.py \
+	--mode "train" \
+	--encoder_name "rocketqa-zh-dureader-query-encoder" \
+	--generator_name "ernie-3.0-base-zh" \
+	--discriminator_name "ernie-3.0-base-zh" \
+	--max_seq_length "128" \
+	--output_emb_size "32" \
+	--train_set_file "your train_set path" \
+	--eval_set_file "your dev_set path" \
+	--save_dir "./checkpoints" \
+	--log_dir ${log_dir} \
+	--save_steps "50000" \
+	--eval_steps "1000" \
+	--batch_size "32" \
+	--epochs "3" \
+	--mlm_probability "0.15" \
+	--lambda_weight "0.15" \
+	--learning_rate "3e-5" \
+	--weight_decay "0.01" \
+	--warmup_proportion "0.01" \
+	--seed "0" \
+	--device "gpu"
diff --git a/examples/text_matching/diffcse/utils.py b/examples/text_matching/diffcse/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d70b434f82869675883fa245c7e1d58e2ffbec1d
--- /dev/null
+++ b/examples/text_matching/diffcse/utils.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import numpy as np
+
+import paddle
+from scipy import stats
+
+
+def set_seed(seed=0):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def eval_metric(labels, preds):
+    spearman_corr = stats.spearmanr(labels, preds).correlation
+    return spearman_corr
diff --git a/examples/text_matching/ernie_matching/README.md b/examples/text_matching/ernie_matching/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd24b9a664b8ca02de02db6eac4a45c13e7b98d9
--- /dev/null
+++ b/examples/text_matching/ernie_matching/README.md
@@ -0,0 +1,175 @@
+# 基于预训练模型 ERNIE-Gram 的单塔文本匹配
+
+我们基于预训练模型 ERNIE-Gram 给出了单塔文本匹配的 2 种训练范式: Point-wise 和 Pair-wise。其中单塔 Point-wise 匹配模型适合直接对文本对进行 2 分类的应用场景: 例如判断 2 个文本是否为语义相似；Pair-wise 匹配模型适合将文本对相似度作为特征之一输入到上层排序模块进行排序的应用场景。
+
+## 模型下载
+本项目使用语义匹配数据集 LCQMC 作为训练集 , 基于 ERNIE-Gram 预训练模型热启训练并开源了单塔 Point-wise 语义匹配模型， 用户可以直接基于这个模型对文本对进行语义匹配的 2 分类任务。
+
+| 模型  | dev acc |
+| ---- | ------- |
+| [ERNIE-1.0-Base](https://bj.bcebos.com/paddlenlp/models/text_matching/ernie1.0_zh_pointwise_matching_model.tar)  | 89.43 |
+| [ERNIE-Gram-Base](https://bj.bcebos.com/paddlenlp/models/text_matching/ernie_gram_zh_pointwise_matching_model.tar)  | 90.60 |
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```
+ernie_matching/
+├── deply # 部署
+|   └── python
+|       └── predict.py # python 预测部署示例
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── model.py # Point-wise & Pair-wise 匹配模型组网
+├── data.py # Point-wise & Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑
+├── train_pointwise.py # Point-wise 单塔匹配模型训练脚本
+├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本
+├── predict_pointwise.py # Point-wise 单塔匹配模型预测脚本，输出文本对是否相似: 0、1 分类
+├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本，输出文本对的相似度打分
+└── train.py # 模型训练评估
+```
+
+### 模型训练
+
+我们以中文文本匹配公开数据集 LCQMC 为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行单塔 Point-wise 模型训练，并在开发集（dev.tsv）验证。Pair-wise 匹配模型只需要采用 `train_pairwise.py` 脚本即可。
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0" train_pointwise.py \
+        --device gpu \
+        --save_dir ./checkpoints \
+        --batch_size 32 \
+        --learning_rate 2E-5
+```
+
+可支持配置的参数：
+
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ERNIE-Gram 模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为3。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。
+
+代码示例中使用的预训练模型是 ERNIE-Gram，如果想要使用其他预训练模型如 ERNIE, BERT，RoBERTa，Electra等，只需更换`model` 和 `tokenizer`即可。
+
+```python
+
+# 使用 ERNIE-3.0-medium-zh 预训练模型
+model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+
+# 使用 ERNIE-Gram 预训练模型
+model = AutoModel.from_pretrained('ernie-gram-zh')
+tokenizer = AutoTokenizer.from_pretrained('ernie-gram-zh')
+
+# 使用 ERNIE 预训练模型
+# ernie-1.0
+#model = AutoModel.from_pretrained('ernie-1.0-base-zh'))
+#tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh')
+
+# ernie-tiny
+# model = AutoModel.from_pretrained('ernie-tiny'))
+# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny')
+
+
+# 使用 BERT 预训练模型
+# bert-base-chinese
+# model = AutoModel.from_pretrained('bert-base-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
+
+# bert-wwm-chinese
+# model = AutoModel.from_pretrained('bert-wwm-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese')
+
+# bert-wwm-ext-chinese
+# model = AutoModel.from_pretrained('bert-wwm-ext-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese')
+
+
+# 使用 RoBERTa 预训练模型
+# roberta-wwm-ext
+# model = AutoModel.from_pretrained('roberta-wwm-ext')
+# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext')
+
+# roberta-wwm-ext
+# model = AutoModel.from_pretrained('roberta-wwm-ext-large')
+# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large')
+
+```
+更多预训练模型，参考[transformers](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── model_100
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+* 如需使用ernie-tiny模型，则需要提前先安装sentencepiece依赖，如`pip install sentencepiece`
+
+### 基于动态图模型预测
+
+我们用 LCQMC 的测试集作为预测数据,  测试数据示例如下，：
+```text
+谁有狂三这张高清的  这张高清图，谁有
+英雄联盟什么英雄最好    英雄联盟最好英雄是什么
+这是什么意思，被蹭网吗  我也是醉了，这是什么意思
+现在有什么动画片好看呢？    现在有什么好看的动画片吗？
+请问晶达电子厂现在的工资待遇怎么样要求有哪些    三星电子厂工资待遇怎么样啊
+```
+
+启动预测：
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0" \
+        predict_pointwise.py \
+        --device gpu \
+        --params_path "./checkpoints/model_4400/model_state.pdparams"\
+        --batch_size 128 \
+        --max_seq_length 64 \
+        --input_file 'test.tsv'
+```
+
+输出预测结果如下:
+```text
+{'query': '谁有狂三这张高清的', 'title': '这张高清图，谁有', 'pred_label': 1}
+{'query': '英雄联盟什么英雄最好', 'title': '英雄联盟最好英雄是什么', 'pred_label': 1}
+{'query': '这是什么意思，被蹭网吗', 'title': '我也是醉了，这是什么意思', 'pred_label': 1}
+{'query': '现在有什么动画片好看呢？', 'title': '现在有什么好看的动画片吗？', 'pred_label': 1}
+{'query': '请问晶达电子厂现在的工资待遇怎么样要求有哪些', 'title': '三星电子厂工资待遇怎么样啊', 'pred_label': 0}
+```
+
+### 基于静态图部署预测
+#### 模型导出
+使用动态图训练结束之后，可以使用静态图导出工具 `export_model.py` 将动态图参数导出成静态图参数。 执行如下命令：
+
+```shell
+python export_model.py --params_path checkpoints/model_300/model_state.pdparams --output_path=./output
+```
+
+其中`params_path`是指动态图训练保存的参数路径，`output_path`是指静态图参数导出路径。
+
+#### 预测部署
+导出静态图模型之后，可以基于静态图模型进行预测，`deploy/python/predict.py` 文件提供了静态图预测示例。执行如下命令：
+
+```shell
+python deploy/python/predict.py --model_dir ./output
+```
+
+## Reference
+[1] Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, Buzhou Tang, LCQMC: A Large-scale Chinese Question Matching Corpus,COLING2018.
+[2] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs].
diff --git a/examples/text_matching/ernie_matching/data.py b/examples/text_matching/ernie_matching/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..aabe8f71a29be91d6d05e8266a5990feb62b413d
--- /dev/null
+++ b/examples/text_matching/ernie_matching/data.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"):
+
+    if phase == "train":
+        query, pos_title, neg_title = example["query"], example["title"], example["neg_title"]
+
+        pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length)
+        neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length)
+
+        pos_input_ids = pos_inputs["input_ids"]
+        pos_token_type_ids = pos_inputs["token_type_ids"]
+        neg_input_ids = neg_inputs["input_ids"]
+        neg_token_type_ids = neg_inputs["token_type_ids"]
+
+        return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids)
+
+    else:
+        query, title = example["query"], example["title"]
+
+        inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+        input_ids = inputs["input_ids"]
+        token_type_ids = inputs["token_type_ids"]
+        if phase == "eval":
+            return input_ids, token_type_ids, example["label"]
+        elif phase == "predict":
+            return input_ids, token_type_ids
+        else:
+            raise ValueError("not supported phase:{}".format(phase))
+
+
+def gen_pair(dataset, pool_size=100):
+    """
+    Generate triplet randomly based on dataset
+
+    Args:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+            Each example is composed of 2 texts: example["query"], example["title"]
+        pool_size: the number of example to sample negative example randomly
+
+    Return:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+        Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"]
+    """
+
+    if len(dataset) < pool_size:
+        pool_size = len(dataset)
+
+    new_examples = []
+    pool = []
+    tmp_examples = []
+
+    for example in dataset:
+        label = example["label"]
+
+        # Filter negative example
+        if label == 0:
+            continue
+
+        tmp_examples.append(example)
+        pool.append(example["title"])
+
+        if len(pool) >= pool_size:
+            np.random.shuffle(pool)
+            for idx, example in enumerate(tmp_examples):
+                example["neg_title"] = pool[idx]
+                new_examples.append(example)
+            tmp_examples = []
+            pool = []
+        else:
+            continue
+    return MapDataset(new_examples)
diff --git a/examples/text_matching/ernie_matching/deploy/python/predict.py b/examples/text_matching/ernie_matching/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9be63b026b69419c096b40a51ecf7562f13ef7
--- /dev/null
+++ b/examples/text_matching/ernie_matching/deploy/python/predict.py
@@ -0,0 +1,210 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.")
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+
+        model_file = model_dir + "/inference.pdmodel"
+        params_file = model_dir + "/inference.pdiparams"
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if args.use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if args.enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-tiny",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer, label_map):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # segment
+        ): fn(samples)
+
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        probs = self.output_handle.copy_to_cpu()
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        # probs = softmax(logits, axis=1)
+        idx = np.argmax(probs, axis=1)
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return labels
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    test_ds = load_dataset("lcqmc", splits=["test"])
+
+    data = [{"query": d["query"], "title": d["title"]} for d in test_ds]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer, label_map))
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/examples/text_matching/ernie_matching/export_model.py b/examples/text_matching/ernie_matching/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..696a700a3f61029dca0569d69549f9f09bda4087
--- /dev/null
+++ b/examples/text_matching/ernie_matching/export_model.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import PointwiseMatching
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    model = PointwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/text_matching/ernie_matching/model.py b/examples/text_matching/ernie_matching/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba412b9f78d6ab3671e9b3ee5857d1987001d4dc
--- /dev/null
+++ b/examples/text_matching/ernie_matching/model.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class PointwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        logits = self.classifier(cls_embedding)
+
+        return logits
+
+
+class PairwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.1):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        self.margin = margin
+
+        # hidden_size -> 1, calculate similarity
+        self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1)
+
+    def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        sim_score = self.similarity(cls_embedding)
+        sim_score = F.sigmoid(sim_score)
+
+        return sim_score
+
+    def forward(
+        self,
+        pos_input_ids,
+        neg_input_ids,
+        pos_token_type_ids=None,
+        neg_token_type_ids=None,
+        pos_position_ids=None,
+        neg_position_ids=None,
+        pos_attention_mask=None,
+        neg_attention_mask=None,
+    ):
+
+        _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask)
+
+        _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask)
+
+        pos_embedding = self.dropout(pos_cls_embedding)
+        neg_embedding = self.dropout(neg_cls_embedding)
+
+        pos_sim = self.similarity(pos_embedding)
+        neg_sim = self.similarity(neg_embedding)
+
+        pos_sim = F.sigmoid(pos_sim)
+        neg_sim = F.sigmoid(neg_sim)
+
+        labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32")
+
+        loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin)
+
+        return loss
diff --git a/examples/text_matching/ernie_matching/predict_pairwise.py b/examples/text_matching/ernie_matching/predict_pairwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..042089eec670e0bc2ea385593ac0e5d402539d4b
--- /dev/null
+++ b/examples/text_matching/ernie_matching/predict_pairwise.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pairwise_example as convert_example
+from data import create_dataloader, read_text_pair
+from model import PairwiseMatching
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_probs = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            input_ids = paddle.to_tensor(input_ids)
+            token_type_ids = paddle.to_tensor(token_type_ids)
+
+            batch_prob = model.predict(input_ids=input_ids, token_type_ids=token_type_ids).numpy()
+
+            batch_probs.append(batch_prob)
+
+        batch_probs = np.concatenate(batch_probs, axis=0)
+
+        return batch_probs
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict")
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PairwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, valid_data_loader)
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    for idx, prob in enumerate(y_probs):
+        text_pair = valid_ds[idx]
+        text_pair["pred_prob"] = prob[0]
+        print(text_pair)
diff --git a/examples/text_matching/ernie_matching/predict_pointwise.py b/examples/text_matching/ernie_matching/predict_pointwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..7931087bfb99586e7a20106d6f6560ee012201c5
--- /dev/null
+++ b/examples/text_matching/ernie_matching/predict_pointwise.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pointwise_example as convert_example
+from data import create_dataloader, read_text_pair
+from model import PointwiseMatching
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_probs = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            input_ids = paddle.to_tensor(input_ids)
+            token_type_ids = paddle.to_tensor(token_type_ids)
+
+            batch_prob = model(input_ids=input_ids, token_type_ids=token_type_ids).numpy()
+
+            batch_probs.append(batch_prob)
+
+        batch_probs = np.concatenate(batch_probs, axis=0)
+
+        return batch_probs
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PointwiseMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, valid_data_loader)
+    y_preds = np.argmax(y_probs, axis=1)
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+    for idx, y_pred in enumerate(y_preds):
+        text_pair = valid_ds[idx]
+        text_pair["pred_label"] = y_pred
+        print(text_pair)
diff --git a/examples/text_matching/ernie_matching/train_pairwise.py b/examples/text_matching/ernie_matching/train_pairwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..32d16eef867adb2fdd0b41bf4551218384092125
--- /dev/null
+++ b/examples/text_matching/ernie_matching/train_pairwise.py
@@ -0,0 +1,192 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pairwise_example as convert_example
+from data import create_dataloader, gen_pair
+from model import PairwiseMatching
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--margin", default=0.2, type=float, help="Margin for pos_score and neg_score.")
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.")
+parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.")
+parser.add_argument('--max_step', default=10000, type=int, help="Max steps for training.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+
+    for idx, batch in enumerate(data_loader):
+        input_ids, token_type_ids, labels = batch
+
+        pos_probs = model.predict(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        neg_probs = 1.0 - pos_probs
+
+        preds = np.concatenate((neg_probs, pos_probs), axis=1)
+        metric.update(preds=preds, labels=labels)
+
+    print("eval_{} auc:{:.2}".format(phase, metric.accumulate()))
+    metric.reset()
+    model.train()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])
+
+    train_ds = gen_pair(train_ds)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func_train = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    trans_func_eval = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="eval")
+
+    batchify_fn_train = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # pos_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # pos_pair_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # neg_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # neg_pair_segment
+    ): [data for data in fn(samples)]
+
+    batchify_fn_eval = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn_train, trans_fn=trans_func_train
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn_eval, trans_fn=trans_func_eval
+    )
+
+    model = PairwiseMatching(pretrained_model, margin=args.margin)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    metric = paddle.metric.Auc()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids = batch
+
+            loss = model(
+                pos_input_ids=pos_input_ids,
+                neg_input_ids=neg_input_ids,
+                pos_token_type_ids=pos_token_type_ids,
+                neg_token_type_ids=neg_token_type_ids,
+            )
+
+            global_step += 1
+
+            if global_step > args.max_step:
+                print("Training steps have achieved max_step, training is stopped.")
+                return
+
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, metric, dev_data_loader, "dev")
+
+            if global_step % args.save_step == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_matching/ernie_matching/train_pointwise.py b/examples/text_matching/ernie_matching/train_pointwise.py
new file mode 100644
index 0000000000000000000000000000000000000000..24cb9c78428e9cc919579d969e048def2b99a7cb
--- /dev/null
+++ b/examples/text_matching/ernie_matching/train_pointwise.py
@@ -0,0 +1,187 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pointwise_example as convert_example
+from data import create_dataloader
+from model import PointwiseMatching
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.")
+parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.")
+parser.add_argument('--max_step', default=10000, type=int, help="Max steps for training.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu', "npu"], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PointwiseMatching(pretrained_model)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits = model(input_ids=input_ids, token_type_ids=token_type_ids)
+            loss = criterion(logits, labels)
+            correct = metric.compute(logits, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+
+            if global_step > args.max_step:
+                print("Training steps have achieved max_step, training is stopped.")
+                return
+
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+
+            if global_step % args.save_step == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+    paddle.save(model.state_dict(), save_param_path)
+    tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_matching/question_matching/README.md b/examples/text_matching/question_matching/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..24f037843a4dd4bf67b6437ba476ba684038e13f
--- /dev/null
+++ b/examples/text_matching/question_matching/README.md
@@ -0,0 +1,144 @@
+# 千言-问题匹配鲁棒性评测基线
+
+我们基于预训练模型 ERNIE-Gram 结合正则化策略 [R-Drop](https://arxiv.org/abs/2106.14448) 在 [2021 CCF BDCI 千言-问题匹配鲁棒性评测](https://aistudio.baidu.com/aistudio/competition/detail/116/0/introduction) 竞赛上建立了 Baseline 方案和评测结果。
+
+## 赛题背景
+
+问题匹配（Question Matching）任务旨在判断两个自然问句之间的语义是否等价，是自然语言处理领域一个重要研究方向。问题匹配同时也具有很高的商业价值，在信息检索、智能客服等领域发挥重要作用。
+
+近年来，神经网络模型虽然在一些标准的问题匹配评测集合上已经取得与人类相仿甚至超越人类的准确性，但在处理真实应用场景问题时，性能大幅下降，在简单（人类很容易判断）的问题上无法做出正确判断（如下图），影响产品体验的同时也会造成相应的经济损失。
+
+|       问题1        |        问题2         | 标签(Label) | Model |
+| :----------------: | :------------------: | :---------: | :-----: |
+|  婴儿吃什么蔬菜好  | 婴儿吃什么`绿色`蔬菜好 |      0      |    1    |
+|  关于`牢房`的电视剧  |   关于`监狱`的电视剧   |      1      |    0    |
+| 心率过`快`有什么问题 |  心率过`慢`有什么问题  |      0      |    1    |
+| 黑色`裤子`配什么`上衣` |  黑色`上衣`配什么`裤子` |      0      |    1    |
+
+当前大多数问题匹配任务采用单一指标，在同分布的测试集上评测模型的好坏，这种评测方式可能夸大了模型能力，并且缺乏对模型鲁棒性的细粒度优劣势评估。本次评测关注问题匹配模型在真实应用场景中的鲁棒性，从词汇理解、句法结构、错别字、口语化、对话理解五个维度检测模型的能力，从而发现模型的不足之处，推动语义匹配技术的发展。本次竞赛主要基于[千言数据集](https://luge.ai)，采用的数据集包括哈尔滨工业大学（深圳）的LCQMC和BQ数据集、OPPO的小布对话短文本数据集以及百度的DuQM数据集，期望从多维度、多领域出发，全面评价模型的鲁棒性，进一步提升问题匹配技术的研究水平。本次竞赛将在第九届“CCF大数据与计算智能大赛”举办技术交流论坛和颁奖仪式，诚邀学术界和工业界的研究者和开发者参加本次竞赛！
+
+## 基线评测效果
+本项目分别基于ERNIE-1.0、Bert-base-chinese、ERNIE-Gram 3 个中文预训练模型训练了单塔 Point-wise 的匹配模型, 基于 ERNIE-Gram 的模型效果显著优于其它 2 个预训练模型。
+
+此外，在 ERNIE-Gram 模型基础上我们也对最新的正则化策略 [R-Drop](https://arxiv.org/abs/2106.14448) 进行了相关评测, [R-Drop](https://arxiv.org/abs/2106.14448) 策略的核心思想是针对同 1 个训练样本过多次前向网络得到的输出加上正则化的 Loss 约束。
+
+我们开源了效果最好的 2 个策略对应模型的 checkpoint 作为本次比赛的基线方案: 基于 ERNIE-Gram 预训练模型 R-Drop 系数分别为 0.0 和 0.1 的 2 个模型, 用户可以下载相应的模型来复现我们的评测结果。
+
+| 模型  | rdrop_coef | dev acc | test-A acc | test-B acc|
+| ---- | ---- |-----|--------|------- |
+| ernie-1.0-base |0.0| 86.96 |76.20 | 77.50|
+| bert-base-chinese |0.0| 86.93| 76.90 |77.60 |
+| [ernie-gram-zh](https://bj.bcebos.com/paddlenlp/models/text_matching/question_matching_rdrop0p0_baseline_model.tar) | 0.0 |87.66 | **80.80** | **81.20** |
+| [ernie-gram-zh](https://bj.bcebos.com/paddlenlp/models/text_matching/question_matching_rdrop0p1_baseline_model.tar) | 0.1 |87.91 | 80.20 | 80.80 |
+| ernie-gram-zh | 0.2 |87.47 | 80.10 | 81.00 |
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+```
+question_matching/
+├── model.py # 匹配模型组网
+├── data.py # 训练样本的数据读取、转换逻辑
+├── predict.py # 模型预测脚本，输出测试集的预测结果: 0,1
+└── train.py # 模型训练评估
+```
+
+### 数据准备
+本项目使用竞赛提供的 LCQMC、BQ、OPPO 这 3 个数据集的训练集合集作为训练集，使用这 3 个数据集的验证集合集作为验证集。
+
+运行如下命令生成本项目所使用的训练集和验证集，您在参赛过程中可以探索采取其它的训练集和验证集组合，不需要和基线方案完全一致。
+```shell
+cat ./data/train/LCQMC/train ./data/train/BQ/train ./data/train/OPPO/train > train.txt
+cat ./data/train/LCQMC/dev ./data/train/BQ/dev ./data/train/OPPO/dev > dev.txt
+```
+训练集数据格式为 3 列: text_a \t text_b \t label, 样例数据如下:
+```text
+喜欢打篮球的男生喜欢什么样的女生    爱打篮球的男生喜欢什么样的女生  1
+我手机丢了，我想换个手机    我想买个新手机，求推荐  1
+大家觉得她好看吗    大家觉得跑男好看吗？    0
+求秋色之空漫画全集  求秋色之空全集漫画  1
+晚上睡觉带着耳机听音乐有什么害处吗？    孕妇可以戴耳机听音乐吗? 0
+```
+验证集的数据格式和训练集相同，样例如下:
+```
+开初婚未育证明怎么弄？  初婚未育情况证明怎么开？    1
+谁知道她是网络美女吗？  爱情这杯酒谁喝都会醉是什么歌    0
+男孩喝女孩的尿的故事    怎样才知道是生男孩还是女孩  0
+这种图片是用什么软件制作的？    这种图片制作是用什么软件呢？    1
+```
+
+### 模型训练
+运行如下命令，即可复现本项目中基于 ERNIE-Gram 的基线模型:
+
+```shell
+$unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus "0,1,2,3" train.py \
+       --train_set train.txt \
+       --dev_set dev.txt \
+       --device gpu \
+       --eval_step 100 \
+       --save_dir ./checkpoints \
+       --train_batch_size 32 \
+       --learning_rate 2E-5 \
+       --rdrop_coef 0.0
+```
+
+可支持配置的参数：
+* `train_set`: 训练集的文件。
+* `dev_set`：验证集数据文件。
+* `rdrop_coef`：可选，控制 R-Drop 策略正则化 KL-Loss 的系数；默认为 0.0, 即不使用 R-Drop 策略。
+* `train_batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为3。
+* `warmup_proption`：可选，学习率 warmup 策略的比例，如果 0.1，则学习率会在前 10% 训练 step 的过程中从 0 慢慢增长到 learning_rate, 而后再缓慢衰减，默认为 0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000。
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+
+训练过程中每一次在验证集上进行评估之后，程序会根据验证集的评估指标是否优于之前最优的模型指标来决定是否存储当前模型，如果优于之前最优的验证集指标则会存储当前模型，否则则不存储，因此训练过程结束之后，模型存储路径下 step 数最大的模型则对应验证集指标最高的模型, 一般我们选择验证集指标最高的模型进行预测。
+
+如：
+```text
+checkpoints/
+├── model_10000
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+
+
+### 开始预测
+训练完成后，在指定的 checkpoints 路径下会自动存储在验证集评估指标最高的模型，运行如下命令开始生成预测结果:
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u \
+    predict.py \
+    --device gpu \
+    --params_path "./checkpoints/model_10000/model_state.pdparams" \
+    --batch_size 128 \
+    --input_file "${test_set}" \
+    --result_file "predict_result"
+```
+
+输出预测结果示例如下:
+```text
+0
+1
+0
+1
+```
+### 提交进行评测
+提交预测结果进行评测
+
+## Reference
+[1] Liang, Xiaobo, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. “R-Drop: Regularized Dropout for Neural Networks.” ArXiv:2106.14448 [Cs], June 28, 2021. http://arxiv.org/abs/2106.14448.
diff --git a/examples/text_matching/question_matching/data.py b/examples/text_matching/question_matching/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..19560ab78a86b62a33d774c8863e47290e467be7
--- /dev/null
+++ b/examples/text_matching/question_matching/data.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test is False:
+                if len(data) != 3:
+                    continue
+                yield {"query1": data[0], "query2": data[1], "label": data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {"query1": data[0], "query2": data[1]}
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query1"], example["query2"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
diff --git a/examples/text_matching/question_matching/model.py b/examples/text_matching/question_matching/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c28ac57b70d237dbe7369af58ed4906f8579850
--- /dev/null
+++ b/examples/text_matching/question_matching/model.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.nn as nn
+
+import paddlenlp as ppnlp
+
+
+class QuestionMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, rdrop_coef=0.0):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2)
+        self.rdrop_coef = rdrop_coef
+        self.rdrop_loss = ppnlp.losses.RDropLoss()
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, do_evaluate=False):
+
+        _, cls_embedding1 = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+        cls_embedding1 = self.dropout(cls_embedding1)
+        logits1 = self.classifier(cls_embedding1)
+
+        # For more information about R-drop please refer to this paper: https://arxiv.org/abs/2106.14448
+        # Original implementation please refer to this code: https://github.com/dropreg/R-Drop
+        if self.rdrop_coef > 0 and not do_evaluate:
+            _, cls_embedding2 = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+            cls_embedding2 = self.dropout(cls_embedding2)
+            logits2 = self.classifier(cls_embedding2)
+            kl_loss = self.rdrop_loss(logits1, logits2)
+        else:
+            kl_loss = 0.0
+
+        return logits1, kl_loss
diff --git a/examples/text_matching/question_matching/predict.py b/examples/text_matching/question_matching/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d6f350a48f7bdb4603c23c18b7c4ebe122b4922
--- /dev/null
+++ b/examples/text_matching/question_matching/predict.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+from model import QuestionMatching
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--result_file", type=str, required=True, help="The result file name")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=256, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`QuestionMatching`): A model to calculate whether the question pair is semantic similar or not.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    batch_logits = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            input_ids, token_type_ids = batch_data
+
+            input_ids = paddle.to_tensor(input_ids)
+            token_type_ids = paddle.to_tensor(token_type_ids)
+
+            batch_logit, _ = model(input_ids=input_ids, token_type_ids=token_type_ids)
+
+            batch_logits.append(batch_logit.numpy())
+
+        batch_logits = np.concatenate(batch_logits, axis=0)
+
+        return batch_logits
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment_ids
+    ): [data for data in fn(samples)]
+
+    test_ds = load_dataset(read_text_pair, data_path=args.input_file, is_test=True, lazy=False)
+
+    test_data_loader = create_dataloader(
+        test_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = QuestionMatching(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    y_probs = predict(model, test_data_loader)
+    y_preds = np.argmax(y_probs, axis=1)
+
+    with open(args.result_file, "w", encoding="utf-8") as f:
+        for y_pred in y_preds:
+            f.write(str(y_pred) + "\n")
diff --git a/examples/text_matching/question_matching/train.py b/examples/text_matching/question_matching/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..336dc663666215da4318074180cfc83cdcbfd3d3
--- /dev/null
+++ b/examples/text_matching/question_matching/train.py
@@ -0,0 +1,196 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+from model import QuestionMatching
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--train_set", type=str, required=True, help="The full path of train_set_file")
+parser.add_argument("--dev_set", type=str, required=True, help="The full path of dev_set_file")
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=256, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument('--max_steps', default=-1, type=int, help="If > 0, set total number of training steps to perform.")
+parser.add_argument("--train_batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--eval_batch_size", default=128, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.")
+parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--rdrop_coef", default=0.0, type=float, help="The coefficient of KL-Divergence loss in R-Drop paper, for more detail please refer to https://arxiv.org/abs/2106.14448), if rdrop_coef > 0 then R-Drop works")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    total_num = 0
+
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        total_num += len(labels)
+        logits, _ = model(input_ids=input_ids, token_type_ids=token_type_ids, do_evaluate=True)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+
+    print("dev_loss: {:.5}, accuracy: {:.5}, total_num:{}".format(np.mean(losses), accu, total_num))
+    model.train()
+    metric.reset()
+    return accu
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(read_text_pair, data_path=args.train_set, is_test=False, lazy=False)
+
+    dev_ds = load_dataset(read_text_pair, data_path=args.dev_set, is_test=False, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.train_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.eval_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = QuestionMatching(pretrained_model, rdrop_coef=args.rdrop_coef)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    best_accuracy = 0.0
+
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits1, kl_loss = model(input_ids=input_ids, token_type_ids=token_type_ids)
+            correct = metric.compute(logits1, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            ce_loss = criterion(logits1, labels)
+            if kl_loss > 0:
+                loss = ce_loss + kl_loss * args.rdrop_coef
+            else:
+                loss = ce_loss
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.4f, ce_loss: %.4f., kl_loss: %.4f, accu: %.4f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, ce_loss, kl_loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                accuracy = evaluate(model, criterion, metric, dev_data_loader)
+                if accuracy > best_accuracy:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                    paddle.save(model.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(save_dir)
+                    best_accuracy = accuracy
+
+            if global_step == args.max_steps:
+                return
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_matching/sentence_transformers/README.md b/examples/text_matching/sentence_transformers/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..53133e788470860ae9f7818befbfee10535ca0e4
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/README.md
@@ -0,0 +1,182 @@
+# 使用预训练模型Fine-tune完成中文文本匹配任务
+
+随着深度学习的发展，模型参数的数量飞速增长。为了训练这些参数，需要更大的数据集来避免过拟合。然而，对于大部分NLP任务来说，构建大规模的标注数据集非常困难（成本过高），特别是对于句法和语义相关的任务。相比之下，大规模的未标注语料库的构建则相对容易。为了利用这些数据，我们可以先从其中学习到一个好的表示，再将这些表示应用到其他任务中。最近的研究表明，基于大规模未标注语料库的预训练模型（Pretrained Models, PTM) 在NLP任务上取得了很好的表现。
+
+近年来，大量的研究表明基于大型语料库的预训练模型（Pretrained Models, PTM）可以学习通用的语言表示，有利于下游NLP任务，同时能够避免从零开始训练模型。随着计算能力的发展，深度模型的出现（即 Transformer）和训练技巧的增强使得 PTM 不断发展，由浅变深。
+
+百度的预训练模型ERNIE经过海量的数据训练后，其特征抽取的工作已经做的非常好。借鉴迁移学习的思想，我们可以利用其在海量数据中学习的语义信息辅助小数据集（如本示例中的医疗文本数据集）上的任务。
+
+<center> <img width="600px" src="https://ai-studio-static-online.cdn.bcebos.com/d96c602338044ee8bcd4171f38ea6d49506d1f3253f3496b802ec56cb654ecf5" /> </center>
+
+使用预训练模型ERNIE完成文本匹配任务，大家可能会想到将query和title文本拼接，之后输入ERNIE中，取`CLS`特征（pooled_output），之后输出全连接层，进行二分类。如下图ERNIE用于句对分类任务的用法：
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/45440029c07240ad89d665c5b176e63297e9584e1da24e02b79dd54fb990f74a" width='30%'/> <br />
+</p>
+
+然而，以上用法的问题在于，**ERNIE的模型参数非常庞大，导致计算量非常大，预测的速度也不够理想**。从而达不到线上业务的要求。针对该问题，可以使用PaddleNLP工具搭建Sentence Transformer网络。
+
+<p align="center">
+<img src="https://ai-studio-static-online.cdn.bcebos.com/103998703e134a7184883511a538620e16fed045e2614dcc8afacec446600438" width='30%'/> <br />
+</p>
+
+Sentence Transformer采用了双塔（Siamese）的网络结构。Query和Title分别输入ERNIE，共享一个ERNIE参数，得到各自的token embedding特征。之后对token embedding进行pooling（此处教程使用mean pooling操作），之后输出分别记作u，v。之后将三个表征（u,v,|u-v|)拼接起来，进行二分类。网络结构如上图所示。
+
+更多关于Sentence Transformer的信息可以参考论文：https://arxiv.org/abs/1908.10084
+
+**同时，不仅可以使用ERNIR作为文本语义特征提取器，可以利用BERT/RoBerta/Electra等模型作为文本语义特征提取器**
+
+**那么Sentence Transformer采用Siamese的网路结构，是如何提升预测速度呢？**
+
+**Siamese的网络结构好处在于query和title分别输入同一套网络。如在信息搜索任务中，此时就可以将数据库中的title文本提前计算好对应sequence_output特征，保存在数据库中。当用户搜索query时，只需计算query的sequence_output特征与保存在数据库中的title sequence_output特征，通过一个简单的mean_pooling和全连接层进行二分类即可。从而大幅提升预测效率，同时也保障了模型性能。**
+
+关于匹配任务常用的Siamese网络结构可以参考：https://blog.csdn.net/thriving_fcl/article/details/73730552
+
+PaddleNLP提供了丰富的预训练模型，并且可以便捷地获取PaddlePaddle生态下的所有预训练模型。下面展示如何使用PaddleNLP一键加载ERNIE，优化文本匹配任务。
+
+## 模型简介
+
+本项目针对中文文本匹配问题，开源了一系列模型，供用户可配置地使用：
+
++ BERT([Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805))中文模型，简写`bert-base-chinese`， 其由12层Transformer网络组成。
++ ERNIE([Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223))，支持ERNIE 1.0中文模型（简写`ernie-1.0`）和ERNIE Tiny中文模型（简写`ernie-tiny`)。
+   其中`ernie`由12层Transformer网络组成，`ernie-tiny`由3层Transformer网络组成。
++ RoBERTa([A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692))，支持12层Transformer网络的`roberta-wwm-ext`。
+
+| 模型  | dev acc | test acc |
+| ---- | ------- | -------- |
+| bert-base-chinese  | 0.86537 | 0.84440 |
+| bert-wwm-chinese | 0.86333 | 0.84128 |
+| bert-wwm-ext-chinese | 0.86049 | 0.83848 |
+| ernie-1.0  | 0.87480  | 0.84760  |
+| ernie-tiny  | 0.86071 | 0.83352 |
+| roberta-wwm-ext  | 0.87526 | 0.84904 |
+| rbt3 | 0.85367 | 0.83464 |
+| rbtl3 | 0.85174 | 0.83744 |
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+sentence_transformers/
+├── model.py # Sentence Transfomer 组网文件
+├── README.md # 文本说明
+└── train.py # 模型训练评估
+```
+
+### 模型训练
+
+我们以中文文本匹配公开数据集LCQMC为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+$ python -m paddle.distributed.launch --gpus "0" train.py --device gpu --save_dir ./checkpoints
+```
+
+可支持配置的参数：
+
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ERNIE/BERT模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.00。
+* `epochs`: 训练轮次，默认为3。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.1。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+代码示例中使用的预训练模型是ERNIE，如果想要使用其他预训练模型如BERT，RoBERTa，Electra等，只需更换`model` 和 `tokenizer`即可。
+
+```python
+# 使用 ERNIE 预训练模型
+# ernie-3.0-medium-zh
+model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
+tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+
+# ernie-1.0
+# model = AutoModel.from_pretrained('ernie-1.0-base-zh')
+# tokenizer = AutoTokenizer.from_pretrained('ernie-1.0-base-zh')
+
+# ernie-tiny
+# model = AutoModel.Model.from_pretrained('ernie-tiny')
+# tokenizer = AutoTokenizer.from_pretrained('ernie-tiny')
+
+
+# 使用 BERT 预训练模型
+# bert-base-chinese
+# model = AutoModel.Model.from_pretrained('bert-base-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
+
+# bert-wwm-chinese
+# model = AutoModel.from_pretrained('bert-wwm-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-chinese')
+
+# bert-wwm-ext-chinese
+# model = AutoModel.from_pretrained('bert-wwm-ext-chinese')
+# tokenizer = AutoTokenizer.from_pretrained('bert-wwm-ext-chinese')
+
+
+# 使用 RoBERTa 预训练模型
+# roberta-wwm-ext
+# model = AutoModel..from_pretrained('roberta-wwm-ext')
+# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext')
+
+# roberta-wwm-ext
+# model = AutoModel.from_pretrained('roberta-wwm-ext-large')
+# tokenizer = AutoTokenizer.from_pretrained('roberta-wwm-ext-large')
+
+```
+更多预训练模型，参考[transformers](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── model_100
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+* 如需使用ernie-tiny模型，则需要提前先安装sentencepiece依赖，如`pip install sentencepiece`
+
+### 模型预测
+
+启动预测：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --device gpu --params_path checkpoints/model_400/model_state.pdparams
+```
+
+将待预测数据如以下示例：
+
+```text
+世界上什么东西最小   世界上什么东西最小？
+光眼睛大就好看吗  眼睛好看吗？
+小蝌蚪找妈妈怎么样   小蝌蚪找妈妈是谁画的
+```
+
+可以直接调用`predict`函数即可输出预测结果。
+
+如
+
+```text
+Data: ['世界上什么东西最小', '世界上什么东西最小？']      Label: similar
+Data: ['光眼睛大就好看吗', '眼睛好看吗？']      Label: dissimilar
+Data: ['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的']      Label: dissimilar
+```
+
+
+## Reference
+
+关于Sentence Transformer更多信息参考[www.SBERT.net](https://www.sbert.net)以及论文：
+- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
+- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
+- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (arXiv 2020)
diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md b/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..37f76205b1e7b6386ecc4f23ae3a656b023c97ba
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/README.md
@@ -0,0 +1,39 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+```shell
+pip install paddlenlp >= 2.4.4
+```
+## Server服务启动
+### 分类任务启动
+#### 启动分类 Server 服务
+```bash
+paddlenlp server server:app --host 0.0.0.0 --port 8189
+```
+
+#### 启动分类 Client 服务
+```bash
+python client.py
+```
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size`, `prob_limit` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+            'text_pair': text_pairs,
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size,
+            'prob_limit': args.prob_limit
+        }
+    }
+```
diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py b/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..08de26f80d8933bb8383e879a292a1eb1c2038fa
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/client.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--prob_limit", default=0.5, type=int, help="probability limit.")
+args = parser.parse_args()
+
+url = "http://0.0.0.0:8189/models/text_matching"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["三亚是一个美丽的城市", "北京烤鸭怎么样"]
+    text_pair = ["三亚是个漂亮的城市", "北京烤鸭多少钱"]
+
+    data = {
+        "data": {
+            "text": texts,
+            "text_pair": text_pair,
+        },
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "prob_limit": args.prob_limit},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    result_json = json.loads(r.text)
+    print(result_json)
diff --git a/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py b/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..61356d61f05e6b8878747c74e6c23c57385f84f7
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/deploy/simple_serving/server.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from scipy.special import softmax
+
+from paddlenlp import SimpleServer
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.server import BaseModelHandler, BasePostHandler
+
+
+class TextMatchingModelHandler(BaseModelHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+
+        max_seq_len = 128
+        batch_size = 1
+        if "max_seq_len" not in parameters:
+            max_seq_len = parameters["max_seq_len"]
+        if "batch_size" not in parameters:
+            batch_size = parameters["batch_size"]
+        text = None
+        if "text" in data:
+            text = data["text"]
+        if text is None:
+            return {}
+        if isinstance(text, str):
+            text = [text]
+        has_pair = False
+        if "text_pair" in data and data["text_pair"] is not None:
+            text_pair = data["text_pair"]
+            if isinstance(text_pair, str):
+                text_pair = [text_pair]
+            if len(text) != len(text_pair):
+                raise ValueError("The length of text and text_pair must be same.")
+            has_pair = True
+
+        # Get the result of tokenizer
+        examples = []
+        for idx, _ in enumerate(text):
+            if has_pair:
+                text_a = tokenizer(text=text[idx], max_length=max_seq_len)
+                text_b = tokenizer(text=text_pair[idx], max_length=max_seq_len)
+
+            examples.append((text_a["input_ids"], text_b["input_ids"]))
+
+        # Separates data into some batches.
+        batches = [examples[i : i + batch_size] for i in range(0, len(examples), batch_size)]
+
+        def batchify_fn(samples):
+            return Tuple(
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+            )(samples)
+
+        results = [[]] * predictor._output_num
+        for batch in batches:
+            query_input_ids, title_input_ids = batchify_fn(batch)
+            if predictor._predictor_type == "paddle_inference":
+                predictor._input_handles[0].copy_from_cpu(query_input_ids)
+                predictor._input_handles[1].copy_from_cpu(title_input_ids)
+                predictor._predictor.run()
+                output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles]
+                for i, out in enumerate(output):
+                    results[i].append(out)
+        print(results)
+
+        # Resolve the logits result and get the predict label and confidence
+        results_concat = []
+        for i in range(0, len(results)):
+            results_concat.append(np.concatenate(results[i], axis=0))
+
+        out_dict = {"logits": results_concat[0].tolist(), "data": data}
+
+        return out_dict
+
+
+class TextMatchingPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        if "logits" not in data:
+            raise ValueError(
+                "The output of model handler do not include the 'logits', "
+                " please check the model handler output. The model handler output:\n{}".format(data)
+            )
+
+        prob_limit = 0.5
+        if "prob_limit" in parameters:
+            prob_limit = parameters["prob_limit"]
+        logits = data["logits"]
+        # softmax for probs
+        logits = softmax(logits, axis=-1)
+
+        print(logits)
+
+        labels = []
+        probs = []
+        for logit in logits:
+            if logit[1] > prob_limit:
+                labels.append(1)
+            else:
+                labels.append(0)
+            probs.append(logit[1])
+
+        out_dict = {"label": labels, "similarity": probs}
+        return out_dict
+
+
+app = SimpleServer()
+app.register(
+    task_name="models/text_matching",
+    model_path="../../export_model",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=TextMatchingModelHandler,
+    post_handler=TextMatchingPostHandler,
+    precision="fp32",
+    device_id=0,
+)
diff --git a/examples/text_matching/sentence_transformers/export_model.py b/examples/text_matching/sentence_transformers/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f132929ee758fd9881db29e5c232237e0a0a696
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/export_model.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, default="ernie-1.0", help="The path to model parameters to be loaded.")
+parser.add_argument(
+    "--output_path", type=str, default="./export", help="The path of model parameter in static graph to be saved."
+)
+args = parser.parse_args()
+
+
+class SentenceTransformer(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"] * 3, 2)
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+        query_token_embedding, _ = self.ptm(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+        query_token_embedding = self.dropout(query_token_embedding)
+        query_attention_mask = paddle.unsqueeze(
+            (query_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2
+        )
+        # Set token embeddings to 0 for padding tokens
+        query_token_embedding = query_token_embedding * query_attention_mask
+        query_sum_embedding = paddle.sum(query_token_embedding, axis=1)
+        query_sum_mask = paddle.sum(query_attention_mask, axis=1)
+        query_mean = query_sum_embedding / query_sum_mask
+
+        title_token_embedding, _ = self.ptm(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+        title_token_embedding = self.dropout(title_token_embedding)
+        title_attention_mask = paddle.unsqueeze(
+            (title_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2
+        )
+        # Set token embeddings to 0 for padding tokens
+        title_token_embedding = title_token_embedding * title_attention_mask
+        title_sum_embedding = paddle.sum(title_token_embedding, axis=1)
+        title_sum_mask = paddle.sum(title_attention_mask, axis=1)
+        title_mean = title_sum_embedding / title_sum_mask
+
+        sub = paddle.abs(paddle.subtract(query_mean, title_mean))
+        projection = paddle.concat([query_mean, title_mean, sub], axis=-1)
+
+        logits = self.classifier(projection)
+
+        return logits
+
+
+if __name__ == "__main__":
+
+    tokenizer = AutoTokenizer.from_pretrained(args.params_path)
+    pretrained_model = AutoModel.from_pretrained(args.params_path)
+
+    model = SentenceTransformer(pretrained_model)
+    model.eval()
+
+    input_spec = [
+        paddle.static.InputSpec(shape=[None, None], dtype="int64", name="query_input_ids"),
+        paddle.static.InputSpec(shape=[None, None], dtype="int64", name="title_input_ids"),
+    ]
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(model, input_spec=input_spec)
+
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "float32")
+    paddle.jit.save(model, save_path)
diff --git a/examples/text_matching/sentence_transformers/model.py b/examples/text_matching/sentence_transformers/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e132e557b3237875c9c910a18f96954d88a0f20
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/model.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class SentenceTransformer(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"] * 3, 2)
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+        query_token_embedding, _ = self.ptm(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+        query_token_embedding = self.dropout(query_token_embedding)
+        query_attention_mask = paddle.unsqueeze(
+            (query_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2
+        )
+        # Set token embeddings to 0 for padding tokens
+        query_token_embedding = query_token_embedding * query_attention_mask
+        query_sum_embedding = paddle.sum(query_token_embedding, axis=1)
+        query_sum_mask = paddle.sum(query_attention_mask, axis=1)
+        query_mean = query_sum_embedding / query_sum_mask
+
+        title_token_embedding, _ = self.ptm(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+        title_token_embedding = self.dropout(title_token_embedding)
+        title_attention_mask = paddle.unsqueeze(
+            (title_input_ids != self.ptm.pad_token_id).astype(self.ptm.pooler.dense.weight.dtype), axis=2
+        )
+        # Set token embeddings to 0 for padding tokens
+        title_token_embedding = title_token_embedding * title_attention_mask
+        title_sum_embedding = paddle.sum(title_token_embedding, axis=1)
+        title_sum_mask = paddle.sum(title_attention_mask, axis=1)
+        title_mean = title_sum_embedding / title_sum_mask
+
+        sub = paddle.abs(paddle.subtract(query_mean, title_mean))
+        projection = paddle.concat([query_mean, title_mean, sub], axis=-1)
+
+        logits = self.classifier(projection)
+
+        return logits
diff --git a/examples/text_matching/sentence_transformers/predict.py b/examples/text_matching/sentence_transformers/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9294ff2c0f86d37d4845508639dc0e41f17a92e
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/predict.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import SentenceTransformer
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, default='./checkpoint/model_2700/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=50, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def convert_example(example, tokenizer, max_seq_length=512):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+    - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+    A BERT sequence pair mask has the following format:
+    ::
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+
+    If only one sequence, only returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing query, title and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+
+    Returns:
+        query_input_ids(obj:`list[int]`): The list of query token ids.
+        query_token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+        title_input_ids(obj:`list[int]`): The list of title token ids.
+        title_token_type_ids(obj: `list[int]`): List of title sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    query, title = example[0], example[1]
+
+    query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length)
+    query_input_ids = query_encoded_inputs["input_ids"]
+    query_token_type_ids = query_encoded_inputs["token_type_ids"]
+
+    title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length)
+    title_input_ids = title_encoded_inputs["input_ids"]
+    title_token_type_ids = title_encoded_inputs["token_type_ids"]
+
+    return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids
+
+
+def predict(model, data, tokenizer, label_map, batch_size=1):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `se_len`(sequence length).
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+    examples = []
+    for text_pair in data:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = convert_example(
+            text_pair, tokenizer, max_seq_length=args.max_seq_length
+        )
+        examples.append((query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids))
+
+    # Separates data into some batches.
+    batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    results = []
+    model.eval()
+    for batch in batches:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batchify_fn(batch)
+
+        query_input_ids = paddle.to_tensor(query_input_ids)
+        query_token_type_ids = paddle.to_tensor(query_token_type_ids)
+        title_input_ids = paddle.to_tensor(title_input_ids)
+        title_token_type_ids = paddle.to_tensor(title_token_type_ids)
+
+        probs = model(
+            query_input_ids,
+            title_input_ids,
+            query_token_type_ids=query_token_type_ids,
+            title_token_type_ids=title_token_type_ids,
+        )
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # ErnieTinyTokenizer is special for ernie-tiny pretained model.
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    data = [
+        ["世界上什么东西最小", "世界上什么东西最小？"],
+        ["光眼睛大就好看吗", "眼睛好看吗？"],
+        ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"],
+    ]
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    model = SentenceTransformer(pretrained_model)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    results = predict(model, data, tokenizer, label_map, batch_size=args.batch_size)
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/text_matching/sentence_transformers/train.py b/examples/text_matching/sentence_transformers/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..d681bab0f91b38de06ac0bce5e450fb663d9e5fb
--- /dev/null
+++ b/examples/text_matching/sentence_transformers/train.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from model import SentenceTransformer
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, labels = batch
+        logits = model(
+            query_input_ids=query_input_ids,
+            title_input_ids=title_input_ids,
+            query_token_type_ids=query_token_type_ids,
+            title_token_type_ids=title_token_type_ids,
+        )
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+    - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+    A BERT sequence pair mask has the following format:
+    ::
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+
+    If only one sequence, only returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing query, title and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        query_input_ids(obj:`list[int]`): The list of query token ids.
+        query_token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+        title_input_ids(obj:`list[int]`): The list of title token ids.
+        title_token_type_ids(obj: `list[int]`): List of title sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    query, title = example["query"], example["title"]
+
+    query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length)
+    query_input_ids = query_encoded_inputs["input_ids"]
+    query_token_type_ids = query_encoded_inputs["token_type_ids"]
+
+    title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length)
+    title_input_ids = title_encoded_inputs["input_ids"]
+    title_token_type_ids = title_encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label
+    else:
+        return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = SentenceTransformer(pretrained_model)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, labels = batch
+            logits = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            )
+            loss = criterion(logits, labels)
+            correct = metric.compute(logits, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % 100 == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                evaluate(model, criterion, metric, dev_data_loader)
+                save_param_path = os.path.join(save_dir, "model_state.pdparams")
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_matching/simbert/README.md b/examples/text_matching/simbert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..751cfaa17067643c71aa4f85fbf5cef45d301a31
--- /dev/null
+++ b/examples/text_matching/simbert/README.md
@@ -0,0 +1,50 @@
+# SimBERT模型
+
+## 模型简介
+[SimBERT](https://github.com/ZhuiyiTechnology/simbert)的模型权重是以Google开源的BERT模型为基础，基于微软的UniLM思想设计了融检索与生成于一体的任务，来进一步微调后得到的模型，所以它同时具备相似问生成和相似句检索能力。
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+simbert/
+├── data.py #训练样本的数据加载以及转换
+├── predict.py # 模型预测
+└── README.md # 文档说明
+```
+
+### 模型预测
+
+启动预测：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python predict.py --input_file ./datasets/lcqmc/dev.tsv
+```
+
+待预测数据如以下示例：
+
+
+```text
+世界上什么东西最小   世界上什么东西最小？
+光眼睛大就好看吗  眼睛好看吗？
+小蝌蚪找妈妈怎么样   小蝌蚪找妈妈是谁画的
+```
+
+按照predict.py.py进行预测得到相似度
+
+如
+
+```text
+{'query': '世界上什么东西最小', 'title': '世界上什么东西最小？', 'similarity': 0.992725}
+{'query': '光眼睛大就好看吗', 'title': '眼睛好看吗？', 'similarity': 0.74502724}
+{'query': '小蝌蚪找妈妈怎么样', 'title': '小蝌蚪找妈妈是谁画的', 'similarity': 0.8192148}
+```
+
+## Reference
+
+关于SimBERT更多信息参考[科学空间](https://spaces.ac.cn/archives/7427)
+
+SimBERT项目地址 https://github.com/ZhuiyiTechnology/simbert
diff --git a/examples/text_matching/simbert/data.py b/examples/text_matching/simbert/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..82b104d282d864a648d06bbbec4313edd8bb6f12
--- /dev/null
+++ b/examples/text_matching/simbert/data.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_example(example, tokenizer, max_seq_length=512, phase="train"):
+
+    query, title = example["query"], example["title"]
+
+    query_encoded_inputs = tokenizer(text=query, max_seq_len=max_seq_length)
+    query_input_ids = query_encoded_inputs["input_ids"]
+    query_token_type_ids = query_encoded_inputs["token_type_ids"]
+    title_encoded_inputs = tokenizer(text=title, max_seq_len=max_seq_length)
+
+    title_input_ids = title_encoded_inputs["input_ids"]
+    title_token_type_ids = title_encoded_inputs["token_type_ids"]
+
+    return query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids
diff --git a/examples/text_matching/simbert/predict.py b/examples/text_matching/simbert/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..4db2a487d1dd12b81d6063e7152f9425a7ddebe9
--- /dev/null
+++ b/examples/text_matching/simbert/predict.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--input_file", type=str, required=True, help="The full path of input file")
+# parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def predict(model, data_loader):
+    """
+    Predicts the similarity.
+
+    Args:
+        model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+    results = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+            query_input_ids = paddle.to_tensor(query_input_ids)
+            query_token_type_ids = paddle.to_tensor(query_token_type_ids)
+            title_input_ids = paddle.to_tensor(title_input_ids)
+            title_token_type_ids = paddle.to_tensor(title_token_type_ids)
+
+            vecs_query = model(input_ids=query_input_ids, token_type_ids=query_token_type_ids)
+            vecs_title = model(input_ids=title_input_ids, token_type_ids=title_token_type_ids)
+            vecs_query = vecs_query[1].numpy()
+            vecs_title = vecs_title[1].numpy()
+
+            vecs_query = vecs_query / (vecs_query**2).sum(axis=1, keepdims=True) ** 0.5
+            vecs_title = vecs_title / (vecs_title**2).sum(axis=1, keepdims=True) ** 0.5
+            sims = (vecs_query * vecs_title).sum(axis=1)
+
+            results.extend(sims)
+
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    model = AutoModel.from_pretrained("simbert-base-chinese", pool_act="linear")
+    tokenizer = AutoTokenizer.from_pretrained("simbert-base-chinese")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, phase="predict")
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    y_sims = predict(model, valid_data_loader)
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False)
+
+    for idx, prob in enumerate(y_sims):
+        text_pair = valid_ds[idx]
+        text_pair["similarity"] = y_sims[idx]
+        print(text_pair)
diff --git a/examples/text_matching/simcse/README.md b/examples/text_matching/simcse/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..10ed7c5f5f63407bfc384f9918b46225091ce434
--- /dev/null
+++ b/examples/text_matching/simcse/README.md
@@ -0,0 +1,126 @@
+# 无监督语义匹配模型 [SimCSE](https://arxiv.org/abs/2104.08821)
+
+我们实现了 SimCSE 模型，并借鉴 ESimCSE 论文思想，通过 Word Repetition(WR) 策略进一步提升了 SimCSE 模型效果，在 4 个权威中文语义匹配数据集上做了充分效果评测。SimCSE 模型适合缺乏监督数据，但是又有大量无监督数据的匹配和检索场景。
+
+## 效果评估
+本项目分别使用 LCQMC、BQ_Corpus、STS-B、ATEC 这 4 个中文语义匹配数据集的训练集作为无监督训练集(仅使用文本信息，不使用 Label)，并且在各自数据集上的验证集上进行效果评估，评估指标采用 SimCSE 论文中采用的 Spearman 相关系数，Spearman 相关系数越高，表示模型效果越好。中文数据集的下载地址为：[下载地址](https://paddlenlp.bj.bcebos.com/datasets/senteval_cn.zip)
+
+### 中文语义匹配数据集效果
+
+| 模型| LCQMC | BQ_Corpus|STS-B|ATEC|
+|-------|-------|-----|------|-----|
+|SimCSE| 57.01 | **51.72** | 74.76 | 33.56 |
+| SimCSE + WR| **58.97** | 51.58 | **78.32** | **33.73** |
+
+SimCSE + WR 策略在中文数据集训练的超参数设置如下：
+
+| 数据集|epoch | learning rate | dropout|batch size| dup rate|
+|-------|-------|-----|------|-----|-----|
+|LCQMC|1| 5E-5 | 0.3 |64| 0.32 |
+|BQ_Corpus|1| 1E-5 | 0.3 |64|0.32 |
+|STS-B|8| 5E-5 | 0.1 |64| 0.32 |
+|ATEC|1| 5E-5 | 0.3 | 64| 0.32 |
+
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```
+simcse/
+├── model.py # SimCSE 模型组网代码
+├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑
+├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度
+├── train.sh # 模型训练的脚本
+└── train.py # SimCSE 模型训练、评估逻辑
+```
+
+### 模型训练
+我们以中文文本匹配公开数据集 LCQMC 为示例数据集， 仅使用 LCQMC 的文本数据构造生成了无监督的训练数据。可以运行如下命令，开始模型训练并且在 LCQMC 的验证集上进行 Spearman 相关系数评估。
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus '0' \
+    train.py \
+    --device gpu \
+    --save_dir ./checkpoints/ \
+    --batch_size 64 \
+    --learning_rate 5E-5 \
+    --epochs 1 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --max_seq_length 64 \
+    --dropout 0.3 \
+    --train_set_file "./senteval_cn/LCQMC/train.txt" \
+    --test_set_file "./senteval_cn/LCQMC/dev.tsv"
+```
+
+可支持配置的参数：
+
+* `infer_with_fc_pooler`：可选，在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc;  建议关闭模型效果最好。
+* `dup_rate`: 可选，word reptition 的比例，默认是0.32，根据论文 Word Repetition 比例采用 0.32 效果最佳。
+* `scale`：可选，在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子；默认为 20。
+* `dropout`：可选，SimCSE 网络前向使用的 dropout 取值；默认 0.1。
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ERNIE-Gram 模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为1。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── model_100
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+
+### 基于动态图模型预测
+
+我们用 LCQMC 的测试集作为预测数据,  测试数据示例如下，：
+```text
+谁有狂三这张高清的  这张高清图，谁有
+英雄联盟什么英雄最好    英雄联盟最好英雄是什么
+这是什么意思，被蹭网吗  我也是醉了，这是什么意思
+现在有什么动画片好看呢？    现在有什么好看的动画片吗？
+请问晶达电子厂现在的工资待遇怎么样要求有哪些    三星电子厂工资待遇怎么样啊
+```
+
+执行如下命令开始预测：
+```shell
+python -u -m paddle.distributed.launch --gpus "0" \
+        predict.py \
+        --device gpu \
+        --params_path "./checkpoints/model_4400/model_state.pdparams"\
+        --batch_size 64 \
+        --max_seq_length 64 \
+        --text_pair_file 'test.tsv'
+```
+
+输出预测结果如下:
+```text
+0.7201147675514221
+0.9010907411575317
+0.5393891334533691
+0.9698929786682129
+0.6056119203567505
+```
+
+## Reference
+[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821.
+
+[2] Wu, Xing, et al. "ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding." arXiv preprint arXiv:2109.04380 (2021). https://arxiv.org/abs/2109.04380.
diff --git a/examples/text_matching/simcse/data.py b/examples/text_matching/simcse/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f1f49d93a8fd1f7288e508355219784c631a81c
--- /dev/null
+++ b/examples/text_matching/simcse/data.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+
+import numpy as np
+import paddle
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        if "label" in key:
+            # do_evaluate
+            result += [example["label"]]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            result += [input_ids, token_type_ids]
+
+    return result
+
+
+def read_simcse_text(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip()
+            yield {"text_a": data, "text_b": data}
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test is False:
+                if len(data) != 3:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1], "label": data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {"text_a": data[0], "text_b": data[1]}
+
+
+def word_repetition(input_ids, token_type_ids, dup_rate=0.32):
+    """Word Repetition strategy."""
+    input_ids = input_ids.numpy().tolist()
+    token_type_ids = token_type_ids.numpy().tolist()
+
+    batch_size, seq_len = len(input_ids), len(input_ids[0])
+    repetitied_input_ids = []
+    repetitied_token_type_ids = []
+    rep_seq_len = seq_len
+    for batch_id in range(batch_size):
+        cur_input_id = input_ids[batch_id]
+        actual_len = np.count_nonzero(cur_input_id)
+        dup_word_index = []
+        # If sequence length is less than 5, skip it
+        if actual_len > 5:
+            dup_len = random.randint(a=0, b=max(2, int(dup_rate * actual_len)))
+            # Skip cls and sep position
+            dup_word_index = random.sample(list(range(1, actual_len - 1)), k=dup_len)
+
+        r_input_id = []
+        r_token_type_id = []
+        for idx, word_id in enumerate(cur_input_id):
+            # Insert duplicate word
+            if idx in dup_word_index:
+                r_input_id.append(word_id)
+                r_token_type_id.append(token_type_ids[batch_id][idx])
+            r_input_id.append(word_id)
+            r_token_type_id.append(token_type_ids[batch_id][idx])
+        after_dup_len = len(r_input_id)
+        repetitied_input_ids.append(r_input_id)
+        repetitied_token_type_ids.append(r_token_type_id)
+
+        if after_dup_len > rep_seq_len:
+            rep_seq_len = after_dup_len
+    # Padding the data to the same length
+    for batch_id in range(batch_size):
+        after_dup_len = len(repetitied_input_ids[batch_id])
+        pad_len = rep_seq_len - after_dup_len
+        repetitied_input_ids[batch_id] += [0] * pad_len
+        repetitied_token_type_ids[batch_id] += [0] * pad_len
+
+    return paddle.to_tensor(repetitied_input_ids), paddle.to_tensor(repetitied_token_type_ids)
diff --git a/examples/text_matching/simcse/model.py b/examples/text_matching/simcse/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..59364f2035807c23d29a77d802f2bda23fb8ceac
--- /dev/null
+++ b/examples/text_matching/simcse/model.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SimCSE(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.0, scale=20, output_emb_size=None):
+
+        super().__init__()
+
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, with_pooler=True
+    ):
+
+        # Note: cls_embedding is poolerd embedding with act tanh
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if with_pooler is False:
+            cls_embedding = sequence_output[:, 0, :]
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+        with_pooler=True,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask, with_pooler=with_pooler
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, with_pooler=with_pooler
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/examples/text_matching/simcse/predict.py b/examples/text_matching/simcse/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..c94eaf0164bb85f112a0e3e4d2b102a1639314c2
--- /dev/null
+++ b/examples/text_matching/simcse/predict.py
@@ -0,0 +1,116 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_example, create_dataloader, read_text_pair
+from model import SimCSE
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file")
+parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. "
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data_loader):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`SimCSE`): A model to extract text embedding or calculate similarity of text pair.
+        data_loader (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids]
+    Returns:
+        results(obj:`List`): cosine similarity of text pairs.
+    """
+
+    cosine_sims = []
+
+    model.eval()
+
+    with paddle.no_grad():
+        for batch_data in data_loader:
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data
+
+            query_input_ids = paddle.to_tensor(query_input_ids)
+            query_token_type_ids = paddle.to_tensor(query_token_type_ids)
+            title_input_ids = paddle.to_tensor(title_input_ids)
+            title_token_type_ids = paddle.to_tensor(title_token_type_ids)
+
+            batch_cosine_sim = model.cosine_sim(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids,
+            ).numpy()
+
+            cosine_sims.append(batch_cosine_sim)
+
+        cosine_sims = np.concatenate(cosine_sims, axis=0)
+
+        return cosine_sims
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    valid_ds = load_dataset(read_text_pair, data_path=args.text_pair_file, lazy=False, is_test=True)
+
+    valid_data_loader = create_dataloader(
+        valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+
+    model = SimCSE(pretrained_model, margin=args.margin, scale=args.scale, output_emb_size=args.output_emb_size)
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError("Please set --params_path with correct pretrained model file")
+
+    cosin_sim = predict(model, valid_data_loader)
+
+    for idx, cosine in enumerate(cosin_sim):
+        print("{}".format(cosine))
diff --git a/examples/text_matching/simcse/train.py b/examples/text_matching/simcse/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..d279be57b258daa5fe5b2d9608d153f6f83c1ab7
--- /dev/null
+++ b/examples/text_matching/simcse/train.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import (
+    convert_example,
+    create_dataloader,
+    read_simcse_text,
+    read_text_pair,
+    word_repetition,
+)
+from model import SimCSE
+from scipy import stats
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
+                    "Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.")
+parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.")
+parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override ecpochs.")
+parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.")
+parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.")
+parser.add_argument("--test_set_file", type=str, required=True, help="The full path of test_set_file.")
+parser.add_argument("--margin", default=0.0, type=float, help="Margin between pos_sample and neg_samples.")
+parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.")
+parser.add_argument("--dup_rate", default=0.32, type=float, help="duplicate rate for word repetition.")
+parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.")
+
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def do_evaluate(model, tokenizer, data_loader, with_pooler=False):
+    model.eval()
+
+    total_num = 0
+    spearman_corr = 0.0
+    sims = []
+    labels = []
+
+    for batch in data_loader:
+        query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch
+        total_num += len(label)
+
+        query_cls_embedding = model.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, with_pooler=with_pooler)
+
+        title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler)
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+
+        sims.append(cosine_sim.numpy())
+        labels.append(label.numpy())
+
+    sims = np.concatenate(sims, axis=0)
+    labels = np.concatenate(labels, axis=0)
+
+    spearman_corr = stats.spearmanr(labels, sims).correlation
+    model.train()
+    return spearman_corr, total_num
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(
+        read_simcse_text, data_path=args.train_set_file, lazy=False)
+
+    dev_ds = load_dataset(
+        read_text_pair, data_path=args.test_set_file, lazy=False)
+
+    pretrained_model = AutoModel.from_pretrained(
+        'ernie-3.0-medium-zh',
+        hidden_dropout_prob=args.dropout,
+        attention_probs_dropout_prob=args.dropout)
+
+    tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+    ): [data for data in fn(samples)]
+
+    dev_batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # query_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # query_segment
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # title_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # title_segment
+        Stack(dtype="int64"),  # labels
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds,
+        mode='train',
+        batch_size=args.batch_size,
+        batchify_fn=batchify_fn,
+        trans_fn=trans_func)
+
+    dev_data_loader = create_dataloader(
+        dev_ds,
+        mode='eval',
+        batch_size=args.batch_size,
+        batchify_fn=dev_batchify_fn,
+        trans_fn=trans_func)
+
+    model = SimCSE(
+        pretrained_model,
+        margin=args.margin,
+        scale=args.scale,
+        output_emb_size=args.output_emb_size)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+        print("warmup from:{}".format(args.init_from_ckpt))
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(
+        train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps,
+                                         args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch
+            if args.dup_rate > 0:
+                query_input_ids, query_token_type_ids = word_repetition(query_input_ids, query_token_type_ids, args.dup_rate)
+                title_input_ids, title_token_type_ids = word_repetition(title_input_ids, title_token_type_ids, args.dup_rate)
+
+            loss = model(
+                query_input_ids=query_input_ids,
+                title_input_ids=title_input_ids,
+                query_token_type_ids=query_token_type_ids,
+                title_token_type_ids=title_token_type_ids)
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss,
+                       10 / (time.time() - tic_train)))
+                tic_train = time.time()
+
+            if global_step % args.eval_steps == 0 and rank == 0:
+                # need better way to get model Layers
+                spearman_corr, total_num = do_evaluate(model._layers, tokenizer, dev_data_loader, args.infer_with_fc_pooler)
+                print("global step: {}, spearman_corr: {:.4f}, total_num: {}".format(global_step, spearman_corr, total_num))
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                save_param_path = os.path.join(save_dir, 'model_state.pdparams')
+                paddle.save(model.state_dict(), save_param_path)
+                tokenizer.save_pretrained(save_dir)
+
+            if args.max_steps > 0 and global_step >= args.max_steps:
+                return
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_matching/simcse/train.sh b/examples/text_matching/simcse/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..145be555a3b1eec1132d0beadf2abafa7b3b144d
--- /dev/null
+++ b/examples/text_matching/simcse/train.sh
@@ -0,0 +1,15 @@
+python -u -m paddle.distributed.launch --gpus '4' \
+	train.py \
+	--device gpu \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 8 \
+	--save_steps 2000 \
+	--eval_steps 100 \
+	--max_seq_length 64 \
+	--dropout 0.3 \
+    --output_emb_size 256 \
+    --dup_rate 0.32 \
+	--train_set_file "./senteval_cn/STS-B/train.txt" \
+	--test_set_file "./senteval_cn/STS-B/dev.tsv" 
\ No newline at end of file
diff --git a/examples/text_matching/simnet/README.md b/examples/text_matching/simnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9b346888e8a40e659530762cb0a0fc3d7318bf03
--- /dev/null
+++ b/examples/text_matching/simnet/README.md
@@ -0,0 +1,168 @@
+# 使用SimNet完成文本匹配任务
+
+短文本语义匹配(SimilarityNet, SimNet)是一个计算短文本相似度的框架，可以根据用户输入的两个文本，计算出相似度得分。
+SimNet框架在百度各产品上广泛应用，主要包括BOW、CNN、RNN、MMDNN等核心网络结构形式，提供语义相似度计算训练和预测框架，
+适用于信息检索、新闻推荐、智能客服等多个应用场景，帮助企业解决语义匹配问题。
+可通过[AI开放平台-短文本相似度](https://ai.baidu.com/tech/nlp_basic/simnet)线上体验。
+
+## 模型简介
+
+
+本项目通过调用[Seq2Vec](../../../paddlenlp/seq2vec/)中内置的模型进行序列建模，完成句子的向量表示。包含最简单的词袋模型和一系列经典的RNN类模型。
+
+| 模型                                             | 模型介绍                                                     |
+| ------------------------------------------------ | ------------------------------------------------------------ |
+| BOW（Bag Of Words）                              | 非序列模型，将句子表示为其所包含词的向量的加和               |
+| CNN                                          | 序列模型，使用卷积操作，提取局部区域地特征             |
+| GRU（Gated Recurrent Unit）                      | 序列模型，能够较好地解决序列文本中长距离依赖的问题           |
+| LSTM（Long Short Term Memory）                   | 序列模型，能够较好地解决序列文本中长距离依赖的问题           |
+
+
+| 模型  | dev acc | test acc |
+| ---- | ------- | -------- |
+| BoW  | 0.7290 | 0.75232 |
+| CNN  | 0.7042 | 0.73760 |
+| GRU  | 0.7781 | 0.77808 |
+| LSTM  | 0.73760 | 0.77320 |
+
+
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+simnet/
+├── model.py # 模型组网
+├── predict.py # 模型预测
+├── utils.py # 数据处理工具
+├── train.py # 训练模型主程序入口，包括训练、评估
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+#### 使用PaddleNLP内置数据集
+
+```python
+from paddlenlp.datasets import load_dataset
+
+train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"])
+```
+
+部分样例数据如下：
+
+```text
+query title label
+最近有什么好看的电视剧，推荐一下 近期有什么好看的电视剧，求推荐？ 1
+大学生验证仅针对在读学生，已毕业学生不能申请的哦。 通过了大学生验证的用户，可以在支付宝的合作商户，享受学生优惠   0
+如何在网上查户口  如何网上查户口 1
+关于故事的成语 来自故事的成语 1
+ 湖北农村信用社手机银行客户端下载   湖北长阳农村商业银行手机银行客户端下载 0
+草泥马是什么动物  草泥马是一种什么动物 1
+```
+
+### 模型训练
+
+在模型训练之前，需要先下载词汇表文件simnet_vocab.txt，用于构造词-id映射关系。
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/data/simnet_vocab.txt
+```
+
+**NOTE:** 词表的选择和实际应用数据相关，需根据实际数据选择词表。
+
+我们以中文文本匹配数据集LCQMC为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证
+
+CPU启动：
+
+```shell
+python train.py --vocab_path='./simnet_vocab.txt' \
+   --device=cpu \
+   --network=lstm \
+   --lr=5e-4 \
+   --batch_size=64 \
+   --epochs=5 \
+   --save_dir='./checkpoints'
+```
+
+GPU启动：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" train.py --vocab_path='./simnet_vocab.txt' \
+   --device=gpu \
+   --network=lstm \
+   --lr=5e-4 \
+   --batch_size=64 \
+   --epochs=5 \
+   --save_dir='./checkpoints'
+```
+
+以上参数表示：
+
+* `vocab_path`: 词汇表文件路径。
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+* `network`: 模型网络名称，默认为`lstm`， 可更换为lstm, gru, rnn，bow，cnn等。
+* `lr`: 学习率， 默认为5e-4。
+* `batch_size`: 运行一个batch大小，默认为64。
+* `epochs`: 训练轮次，默认为5。
+* `save_dir`: 训练保存模型的文件路径。
+* `init_from_ckpt`: 恢复模型训练的断点路径。
+
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── 0.pdopt
+├── 0.pdparams
+├── 1.pdopt
+├── 1.pdparams
+├── ...
+└── final.pdparams
+```
+
+**NOTE:** 如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可，程序会自动加载模型参数`checkpoints/0.pdparams`，也会自动加载优化器状态`checkpoints/0.pdopt`。
+
+### 模型预测
+
+启动预测
+
+CPU启动：
+
+```shell
+python predict.py --vocab_path='./simnet_vocab.txt' \
+   --device=cpu \
+   --network=lstm \
+   --params_path=checkpoints/final.pdparams
+```
+
+GPU启动：
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python predict.py --vocab_path='./simnet_vocab.txt' \
+   --device=gpu \
+   --network=lstm \
+   --params_path='./checkpoints/final.pdparams'
+```
+
+将待预测数据分词完毕后，如以下示例：
+
+```text
+世界上什么东西最小   世界上什么东西最小？
+光眼睛大就好看吗  眼睛好看吗？
+小蝌蚪找妈妈怎么样   小蝌蚪找妈妈是谁画的
+```
+
+处理成模型所需的`Tensor`，如可以直接调用`preprocess_prediction_data`函数既可处理完毕。之后传入`predict`函数即可输出预测结果。
+
+如
+
+```text
+Data: ['世界上什么东西最小', '世界上什么东西最小？']      Label: similar
+Data: ['光眼睛大就好看吗', '眼睛好看吗？']      Label: dissimilar
+Data: ['小蝌蚪找妈妈怎么样', '小蝌蚪找妈妈是谁画的']      Label: dissimilar
+```
diff --git a/examples/text_matching/simnet/model.py b/examples/text_matching/simnet/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae029705140be947cfa2e94f7e475863e3374716
--- /dev/null
+++ b/examples/text_matching/simnet/model.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+import paddlenlp as nlp
+
+
+class SimNet(nn.Layer):
+    def __init__(self, network, vocab_size, num_classes, emb_dim=128, pad_token_id=0):
+        super().__init__()
+
+        network = network.lower()
+        if network == "bow":
+            self.model = BoWModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id)
+        elif network == "cnn":
+            self.model = CNNModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id)
+        elif network == "gru":
+            self.model = GRUModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id)
+        elif network == "lstm":
+            self.model = LSTMModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id)
+        else:
+            raise ValueError("Unknown network: %s, it must be one of bow, cnn, lstm or gru." % network)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        logits = self.model(query, title, query_seq_len, title_seq_len)
+        return logits
+
+
+class BoWModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size (obj:`int`): The vocabulary size.
+        emb_dim (obj:`int`, optional, defaults to 128):  The embedding dimension.
+        padding_idx (obj:`int`, optional, defaults to 0) : The pad token index.
+        hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size.
+        fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size.
+        num_classes (obj:`int`): All the labels that the data has.
+    """
+
+    def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, fc_hidden_size=128):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim)
+        self.fc = nn.Linear(self.bow_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, embedding_dim)
+        summed_query = self.bow_encoder(embedded_query)
+        summed_title = self.bow_encoder(embedded_title)
+        encoded_query = paddle.tanh(summed_query)
+        encoded_title = paddle.tanh(summed_title)
+        # Shape: (batch_size, embedding_dim*2)
+        contacted = paddle.concat([encoded_query, encoded_title], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+        return logits
+
+
+class LSTMModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        lstm_hidden_size=128,
+        direction="forward",
+        lstm_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=128,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.lstm_encoder = nlp.seq2vec.LSTMEncoder(
+            emb_dim, lstm_hidden_size, num_layers=lstm_layers, direction=direction, dropout=dropout_rate
+        )
+        self.fc = nn.Linear(self.lstm_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len, title_seq_len):
+        assert query_seq_len is not None and title_seq_len is not None
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, lstm_hidden_size)
+        query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len)
+        title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len)
+        # Shape: (batch_size, 2*lstm_hidden_size)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+
+        return logits
+
+
+class GRUModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        gru_hidden_size=128,
+        direction="forward",
+        gru_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.gru_encoder = nlp.seq2vec.GRUEncoder(
+            emb_dim, gru_hidden_size, num_layers=gru_layers, direction=direction, dropout=dropout_rate
+        )
+        self.fc = nn.Linear(self.gru_encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len, title_seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, gru_hidden_size)
+        query_repr = self.gru_encoder(embedded_query, sequence_length=query_seq_len)
+        title_repr = self.gru_encoder(embedded_title, sequence_length=title_seq_len)
+        # Shape: (batch_size, 2*gru_hidden_size)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+
+        return logits
+
+
+class CNNModel(nn.Layer):
+    """
+    This class implements the
+
+
+     Convolution Neural Network model.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `CNNEncoder`.
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size (obj:`int`): The vocabulary size.
+        emb_dim (obj:`int`, optional, defaults to 128):  The embedding dimension.
+        padding_idx (obj:`int`, optional, defaults to 0) : The pad token index.
+        num_classes (obj:`int`): All the labels that the data has.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        num_filter=256,
+        ngram_filter_sizes=(3,),
+        fc_hidden_size=128,
+    ):
+        super().__init__()
+        self.padding_idx = padding_idx
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.encoder = nlp.seq2vec.CNNEncoder(
+            emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes
+        )
+        self.fc = nn.Linear(self.encoder.get_output_dim() * 2, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, query, title, query_seq_len=None, title_seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_query = self.embedder(query)
+        embedded_title = self.embedder(title)
+        # Shape: (batch_size, num_filter)
+        query_repr = self.encoder(embedded_query)
+        title_repr = self.encoder(embedded_title)
+        # Shape: (batch_size, 2*num_filter)
+        contacted = paddle.concat([query_repr, title_repr], axis=-1)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(contacted))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        # probs = F.softmax(logits, axis=-1)
+        return logits
diff --git a/examples/text_matching/simnet/predict.py b/examples/text_matching/simnet/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..3455a239226bab4f5696cabce4bdff0d38185414
--- /dev/null
+++ b/examples/text_matching/simnet/predict.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+from model import SimNet
+from utils import preprocess_prediction_data
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The path to vocabulary.")
+parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
+parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def predict(model, data, label_map, batch_size=1, pad_token_id=0):
+    """
+    Predicts the data labels.
+
+    Args:
+        model (obj:`paddle.nn.Layer`): A model to classify texts.
+        data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+        label_map(obj:`dict`): The label id (key) to label str (value) map.
+        batch_size(obj:`int`, defaults to 1): The number of batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+
+    Returns:
+        results(obj:`dict`): All the predictions labels.
+    """
+
+    # Separates data into some batches.
+    batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=pad_token_id),  # query_ids
+        Pad(axis=0, pad_val=pad_token_id),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+    ): [data for data in fn(samples)]
+
+    results = []
+    model.eval()
+    for batch in batches:
+        query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch)
+        query_ids = paddle.to_tensor(query_ids)
+        title_ids = paddle.to_tensor(title_ids)
+        query_seq_lens = paddle.to_tensor(query_seq_lens)
+        title_seq_lens = paddle.to_tensor(title_seq_lens)
+        logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+        results.extend(labels)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    # Loads vocab.
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+    tokenizer = JiebaTokenizer(vocab)
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map))
+
+    # Loads model parameters.
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    print("Loaded parameters from %s" % args.params_path)
+
+    # Firstly pre-processing prediction data  and then do predict.
+    data = [
+        ["世界上什么东西最小", "世界上什么东西最小？"],
+        ["光眼睛大就好看吗", "眼睛好看吗？"],
+        ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"],
+    ]
+    examples = preprocess_prediction_data(data, tokenizer)
+    results = predict(
+        model,
+        examples,
+        label_map=label_map,
+        batch_size=args.batch_size,
+        pad_token_id=vocab.token_to_idx.get("[PAD]", 0),
+    )
+
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
diff --git a/examples/text_matching/simnet/train.py b/examples/text_matching/simnet/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..848311004be0881ec76f46f9ef8d1e3c46a64220
--- /dev/null
+++ b/examples/text_matching/simnet/train.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from model import SimNet
+from utils import convert_example
+
+from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The directory to dataset.")
+parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
+            the sample list, None for only stack each fields of sample in axis
+            0(same as :attr::`np.stack(..., axis=0)`).
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn)
+    return dataloader
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    # Loads vocab.
+    if not os.path.exists(args.vocab_path):
+        raise RuntimeError("The vocab_path  can not be found in the path %s" % args.vocab_path)
+    vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    # Loads dataset.
+    train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"])
+
+    # Constructs the newtork.
+    model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(train_ds.label_list))
+    model = paddle.Model(model)
+
+    # Reads data and generates mini-batches.
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # query_ids
+        Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)),  # title_ids
+        Stack(dtype="int64"),  # query_seq_lens
+        Stack(dtype="int64"),  # title_seq_lens
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    tokenizer = JiebaTokenizer(vocab)
+    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
+    train_loader = create_dataloader(
+        train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn
+    )
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn
+    )
+    test_loader = create_dataloader(
+        test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn
+    )
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Defines loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    # Starts training and evaluating.
+    model.fit(
+        train_loader,
+        dev_loader,
+        epochs=args.epochs,
+        save_dir=args.save_dir,
+    )
diff --git a/examples/text_matching/simnet/utils.py b/examples/text_matching/simnet/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5de0876215d191025cc15298077d003345c17d89
--- /dev/null
+++ b/examples/text_matching/simnet/utils.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def convert_example(example, tokenizer, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        query_ids(obj:`list[int]`): The list of query ids.
+        title_ids(obj:`list[int]`): The list of title ids.
+        query_seq_len(obj:`int`): The input sequence query length.
+        title_seq_len(obj:`int`): The input sequence title length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+
+    query, title = example["query"], example["title"]
+    query_ids = np.array(tokenizer.encode(query), dtype="int64")
+    query_seq_len = np.array(len(query_ids), dtype="int64")
+    title_ids = np.array(tokenizer.encode(title), dtype="int64")
+    title_seq_len = np.array(len(title_ids), dtype="int64")
+
+    if not is_test:
+        label = np.array(example["label"], dtype="int64")
+        return query_ids, title_ids, query_seq_len, title_seq_len, label
+    else:
+        return query_ids, title_ids, query_seq_len, title_seq_len
+
+
+def preprocess_prediction_data(data, tokenizer):
+    """
+    It process the prediction data as the format used as training.
+
+    Args:
+        data (obj:`List[List[str, str]]`):
+            The prediction data whose each element is a text pair.
+            Each text will be tokenized by jieba.lcut() function.
+        tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
+
+    Returns:
+        examples (obj:`list`): The processed data whose each element
+            is a `list` object, which contains
+
+            - query_ids(obj:`list[int]`): The list of query ids.
+            - title_ids(obj:`list[int]`): The list of title ids.
+            - query_seq_len(obj:`int`): The input sequence query length.
+            - title_seq_len(obj:`int`): The input sequence title length.
+
+    """
+    examples = []
+    for query, title in data:
+        query_ids = tokenizer.encode(query)
+        title_ids = tokenizer.encode(title)
+        examples.append([query_ids, title_ids, len(query_ids), len(title_ids)])
+    return examples
diff --git a/examples/text_summarization/bart/README.md b/examples/text_summarization/bart/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..db140d4d8866c3dfe0fb387c13b31b67efb3cfc9
--- /dev/null
+++ b/examples/text_summarization/bart/README.md
@@ -0,0 +1,227 @@
+# BART
+
+## 模型简介
+
+BART是一种Seq2Seq结构的降噪自编码器，通过增加噪声来破环文本然后重建原文本来训练模型。它使用一个标准的Transformer结构，可以被看作泛化的BERT（由于是双向编码器），GPT（由于是从左到右解码器），和一些其他的预训练模型结构。
+
+本项目是BART在 PaddlePaddle 2.2上开源实现的文本摘要的例子，包含了在[CNN/DailyMail](https://arxiv.org/pdf/1704.04368.pdf)数据集上微调和生成的代码。
+
+## 快速开始
+
+### 环境依赖
+
+- nltk
+- rouge_score
+
+安装方式：`pip install -r requirements.txt`
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+.
+├── run_summarization.py # 模型finetune主程序入口
+├── generate.py # 模型生成主程序入口
+├── utils.py # 定义参数及一些工具函数
+├── requirements.txt # 环境依赖文件
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+**CNN/DailyMail**数据集是一个英文数据集，包含CNN和《每日邮报》记者撰写的30多万篇独特新闻文章，常用来做文本摘要。
+
+为了方便用户快速测试，PaddleNLP Dataset API内置了CNN/DailyMail数据集，一键即可完成数据集加载，示例代码如下：
+
+```python
+from paddlenlp.datasets import load_dataset
+train_set, dev_set, test_set = load_dataset("cnn_dailymail",  splits=["train", "dev", "test"])
+```
+
+### 模型训练
+
+运行如下命令即可在训练集上进行finetune，并在验证集上进行验证
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+# 例如使用1号和2号卡，则：`--gpu 1,2`
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus 1,2 run_summarization.py \
+    --model_name_or_path=bart-base \
+    --dataset_name=cnn_dailymail \
+    --output_dir=output \
+    --max_source_length=1024 \
+    --max_target_length=142 \
+    --learning_rate=1e-4 \
+    --num_train_epochs=6 \
+    --logging_steps=100 \
+    --save_steps=1000 \
+    --seed=42 \
+    --train_batch_size=20 \
+    --eval_batch_size=64 \
+    --warmup_proportion=0.1 \
+    --ignore_pad_token_for_loss=True \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `gpus` 指示了训练所用的GPU
+
+- `model_name_or_path` 指示了finetune使用的预训练模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | bart-base |
+   | bart-large |
+
+- `dataset_name` 表示训练的数据集。
+
+- `output_dir` 表示模型的保存路径。
+
+- `max_source_length` 表示输入article的最大长度。
+
+- `max_target_length` 表示输入highlights的最大长度。
+
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+
+- `num_train_epochs` 表示训练轮数。
+
+- `logging_steps` 表示日志打印间隔。
+
+- `save_steps` 表示模型保存及评估间隔。
+
+- `seed` 表示随机数生成器的种子。
+
+- `epochs` 表示训练轮数。
+
+- `train_batch_size` 表示训练**每张卡**上的样本数目。
+
+- `eval_batch_size` 表示预测**单卡**上的样本数目。
+
+- `warmup_proportion` 表示warmup_steps所占总步数的比例。学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数。
+
+- `ignore_pad_token_for_loss` 表示计算loss时忽略padding。
+
+- `device` 表示使用的设备。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`output_dir`中。如：
+
+```text
+./output/
+├── bart_model_1000.pdparams
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── merges.txt
+│   ├── tokenizer_config.json
+│   └── vocab.json
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，只需指定`model_name_or_path`为本地微调模型的路径即可。
+
+### 模型预测
+
+运行如下命令即可在验证集上进行测试
+
+```shell
+# GPU启动，预测仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python generate.py \
+    --model_name_or_path=bart-base-cnndm-model \
+    --dataset_name=cnn_dailymail \
+    --output_path=generate.txt \
+    --max_source_length=1024 \
+    --max_target_length=142 \
+    --decode_strategy=greedy_search \
+    --top_k=2 \
+    --top_p=1.0 \
+    --num_beams=1 \
+    --length_penalty=0.0 \
+    --batch_size=64 \
+    --seed=42 \
+    --ignore_pad_token_for_loss=True \
+    --logging_steps=100 \
+    --device=gpu
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了预测使用的模型，可以是PaddleNLP提供的预训练模型，或者是本地的模型。如果使用本地的模型，则配置为本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle模型参数model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | bart-base |
+   | bart-large |
+
+- `dataset_name` 表示预测的数据集。
+
+- `output_path` 表示预测结果的保存路径。
+
+- `max_source_length` 表示输入article的最大长度。
+
+- `max_target_length` 表示输入highlights的最大长度。
+
+- `decode_strategy` 表示预测解码时采取的策略，可选"sampling"、"greedy_search"和"beam_search"之一。
+
+- `top_k` 表示采用"sampling"解码策略时，token的概率按从大到小排序，生成的token只从前`top_k`个中进行采样。
+
+- `top_p` 表示采用"sampling"解码策略时，从词表中采样并选择概率之和大于给定阈值`top_p`的token。
+
+- `num_beams` 表示besm search的beam size。
+
+- `length_penalty` 表示besm search生成长度的指数惩罚。
+
+- `batch_size` 表示每次迭代**单卡**上的样本数目。
+
+- `seed` 表示随机数生成器的种子。
+
+- `ignore_pad_token_for_loss` 表示训练时计算loss时忽略padding。如果训练时设置为True，那么预测时的label需要还原来计算评估指标。
+
+- `logging_steps` 表示日志打印间隔。
+
+- `device` 表示使用的设备。
+
+程序运行结束后会将预测生成的摘要保存在`output_path`中。同时终端中会输出评估结果。
+
+采用预训练模型及微调模型在验证集上有如下结果：
+
+|   model_name_or_path    |     Rouge-1     |     Rouge-2     |    Rouge-L    |
+| :----------------------: | :-------------: | :-------------: |:-------------: |
+|        [bart-base-cnndm-model](https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-base-cnndm-model.tar.gz )      | 43.6446 | 20.1447 | 41.0132 |
+
+**NOTE:** `bart-base-cnndm-model`是按本项目中的超参finetune得到的结果。
+
+### 模型高性能预测
+
+在模型预测阶段，我们提供了基于 FastGeneration 的高性能预测的选项，可以选择性开启是否需要采用高性能预测。只需在上述模型预测上添加两个参数即可：分别是`faster`，`use_fp16_decoding`。
+
+```shell
+# GPU启动，预测仅支持单卡
+export CUDA_VISIBLE_DEVICES=0
+python generate.py \
+    --model_name_or_path=bart-base-cnndm-model \
+    --dataset_name=cnn_dailymail \
+    --output_path=generate.txt \
+    --max_source_length=1024 \
+    --max_target_length=142 \
+    --decode_strategy=greedy_search \
+    --top_k=2 \
+    --top_p=1.0 \
+    --num_beams=1 \
+    --length_penalty=0.0 \
+    --batch_size=64 \
+    --seed=42 \
+    --ignore_pad_token_for_loss=True \
+    --logging_steps=100 \
+    --faster \
+    --use_fp16_decoding \
+    --device=gpu
+```
+其中新增参数释义如下：
+- `faster` 表示是否开启高性能预测。设置 `--faster` 即表示开启。
+- `use_fp16_decoding` 表示在开启高性能预测的时候，是否使用 fp16 来完成预测过程。设置 `--use_fp16_decoding` 即表示使用 fp16 进行预测，否则使用 fp32。
+
+## 参考文献
+1. Lewis M , Liu Y , Goyal N , et al. [BART: Denoising Sequence-to-Sequence Pre-training for Natural
+Language Generation, Translation, and Comprehension](https://aclanthology.org/2020.acl-main.703.pdf)[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.
+2. See A , Liu P J , CD  Manning. [Get To The Point: Summarization with Pointer-Generator Networks](https://aclanthology.org/P17-1099.pdf)[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1073–1083.
diff --git a/examples/text_summarization/bart/generate.py b/examples/text_summarization/bart/generate.py
new file mode 100644
index 0000000000000000000000000000000000000000..12c7018cf89bc028149b9d51dfac549f2434ed90
--- /dev/null
+++ b/examples/text_summarization/bart/generate.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import random
+import time
+from functools import partial
+from pprint import pprint
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from utils import compute_metrics, convert_example
+
+from paddlenlp.data import Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+
+summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name_or_path", default="bart-base", type=str, required=True, help="Path to pre-trained model. "
+    )
+    parser.add_argument(
+        "--dataset_name",
+        default="cnn_dailymail",
+        type=str,
+        required=True,
+        help="The name of the dataset to use. Selected in the list: " + ", ".join(summarization_name_mapping.keys()),
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved."
+    )
+    parser.add_argument(
+        "--max_source_length",
+        default=1024,
+        type=int,
+        help="The maximum total input sequence length after "
+        "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--min_target_length",
+        default=0,
+        type=int,
+        help="The minimum total sequence length for target text when generating. ",
+    )
+    parser.add_argument(
+        "--max_target_length",
+        default=142,
+        type=int,
+        help="The maximum total sequence length for target text after "
+        "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+        "during ``evaluate`` and ``predict``.",
+    )
+    parser.add_argument(
+        "--decode_strategy", default="greedy_search", type=str, help="The decode strategy in generation."
+    )
+    parser.add_argument(
+        "--top_k",
+        default=2,
+        type=int,
+        help="The number of highest probability vocabulary tokens to keep for top-k sampling.",
+    )
+    parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.")
+    parser.add_argument("--num_beams", default=1, type=int, help="The number of beams for beam search.")
+    parser.add_argument(
+        "--length_penalty",
+        default=0.6,
+        type=float,
+        help="The exponential penalty to the sequence length for beam search.",
+    )
+    parser.add_argument(
+        "--early_stopping",
+        default=False,
+        type=eval,
+        help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.",
+    )
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument("--faster", action="store_true", help="Whether to process inference using FastGeneration. ")
+    parser.add_argument(
+        "--use_fp16_decoding",
+        action="store_true",
+        help="Whether to use fp16 when using FastGeneration. Only works when using FastGeneration. ",
+    )
+    parser.add_argument("--batch_size", default=64, type=int, help="Batch size per GPU/CPU for testing or evaluation.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--ignore_pad_token_for_loss",
+        default=True,
+        type=bool,
+        help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def batchify_fn(samples):
+    fn = Tuple(
+        Stack(dtype="int64"),  # input_ids
+        Stack(dtype="int64"),  # attention mask
+        Stack(dtype="int32"),  # mem_seq_lens
+        Stack(dtype="int64"),  # decoder_input_ids
+        Stack(dtype="int64"),  # labels
+    )
+    return fn(samples)
+
+
+@paddle.no_grad()
+def generate(args):
+    paddle.set_device(args.device)
+    set_seed(args)
+    tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path)
+    model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    dataset = load_dataset(args.dataset_name, splits=["dev"])
+    trans_func = partial(
+        convert_example,
+        text_column=summarization_name_mapping[args.dataset_name][0],
+        summary_column=summarization_name_mapping[args.dataset_name][1],
+        tokenizer=tokenizer,
+        decoder_start_token_id=model.bart.decoder_start_token_id,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+        ignore_pad_token_for_loss=args.ignore_pad_token_for_loss,
+        is_train=False,
+    )
+
+    dataset = dataset.map(trans_func, lazy=True)
+    batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False)
+    data_loader = DataLoader(
+        dataset=dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True
+    )
+    data_loader.pin_memory = False
+
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    all_preds = []
+    all_labels = []
+    for step, batch in enumerate(data_loader):
+        input_ids, _, mem_seq_lens, _, labels = batch
+        preds, _ = model.generate(
+            input_ids=input_ids,
+            seq_lens=mem_seq_lens,
+            max_length=args.max_target_length,
+            min_length=args.min_target_length,
+            decode_strategy=args.decode_strategy,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            early_stopping=args.early_stopping,
+            diversity_rate=args.diversity_rate,
+            use_fast=args.faster,
+        )
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+        all_preds.extend(preds.numpy())
+        all_labels.extend(labels.numpy())
+        start_time = time.time()
+
+    rouge_result, decoded_preds = compute_metrics(all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss)
+    print("Rouge result: ", rouge_result)
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for decoded_pred in decoded_preds:
+            fout.write(decoded_pred + "\n")
+    print("Save generated result into: %s" % args.output_path)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    generate(args)
diff --git a/examples/text_summarization/bart/requirements.txt b/examples/text_summarization/bart/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..659847365c1ff958cb7acafc2ba9ce9c8f40d234
--- /dev/null
+++ b/examples/text_summarization/bart/requirements.txt
@@ -0,0 +1,2 @@
+nltk==3.6.2
+rouge_score==0.0.4
diff --git a/examples/text_summarization/bart/run_summarization.py b/examples/text_summarization/bart/run_summarization.py
new file mode 100644
index 0000000000000000000000000000000000000000..3951f9778f52ccc3efbe69f7c2e20400a643adef
--- /dev/null
+++ b/examples/text_summarization/bart/run_summarization.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import random
+import time
+import distutils.util
+from pprint import pprint
+from functools import partial
+from tqdm import tqdm
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+from paddle.io import BatchSampler, DistributedBatchSampler, DataLoader
+from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+from paddlenlp.transformers import LinearDecayWithWarmup
+from paddlenlp.utils.log import logger
+from paddlenlp.datasets import load_dataset
+from paddlenlp.data import Tuple, Stack
+from utils import convert_example, compute_metrics
+
+summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name_or_path", default="bart-base", type=str, required=True, help="Path to pre-trained model. "
+    )
+    parser.add_argument(
+        "--dataset_name",
+        default="cnn_dailymail",
+        type=str,
+        required=True,
+        help="The name of the dataset to use. Selected in the list: " + ", ".join(summarization_name_mapping.keys()),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default="output",
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_source_length",
+        default=1024,
+        type=int,
+        help="The maximum total input sequence length after "
+        "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--min_target_length",
+        default=0,
+        type=int,
+        help="The minimum total sequence length for target text when generating. ",
+    )
+    parser.add_argument(
+        "--max_target_length",
+        default=142,
+        type=int,
+        help="The maximum total sequence length for target text after "
+        "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+        "during ``evaluate`` and ``predict``.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--train_batch_size",
+        default=20,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--eval_batch_size",
+        default=12,
+        type=int,
+        help="Batch size per GPU/CPU for evaluation.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training."
+    )
+    parser.add_argument("--scale_loss", default=2**15, type=float, help="The value of scale_loss for fp16.")
+    parser.add_argument(
+        "--ignore_pad_token_for_loss",
+        default=True,
+        type=bool,
+        help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, data_loader, tokenizer, ignore_pad_token_for_loss, min_target_length, max_target_length):
+    model.eval()
+    all_preds = []
+    all_labels = []
+    model = model._layers if isinstance(model, paddle.DataParallel) else model
+    for batch in tqdm(data_loader, total=len(data_loader), desc="Eval step"):
+        input_ids, _, _, labels = batch
+        preds = model.generate(
+            input_ids=input_ids, min_length=min_target_length, max_length=max_target_length, use_cache=True
+        )[0]
+        all_preds.extend(preds.numpy())
+        all_labels.extend(labels.numpy())
+    rouge_result, decoded_preds = compute_metrics(all_preds, all_labels, tokenizer, ignore_pad_token_for_loss)
+    logger.info(rouge_result)
+    model.train()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path)
+    model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    trans_func = partial(
+        convert_example,
+        text_column=summarization_name_mapping[args.dataset_name][0],
+        summary_column=summarization_name_mapping[args.dataset_name][1],
+        tokenizer=tokenizer,
+        decoder_start_token_id=model.bart.decoder_start_token_id,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+        ignore_pad_token_for_loss=args.ignore_pad_token_for_loss,
+    )
+    logger.info("Loading train and dev dataset: %s" % args.dataset_name)
+    train_set, dev_set = load_dataset(args.dataset_name, splits=["train", "dev"])
+    logger.info("Loaded train and dev dataset: %s" % args.dataset_name)
+    train_set = train_set.map(trans_func, lazy=True)
+    train_batch_sampler = DistributedBatchSampler(train_set, batch_size=args.train_batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Stack(dtype="int64"),  # input_ids
+        Stack(dtype="int64"),  # attention mask
+        Stack(dtype="int64"),  # decoder_input_ids
+        Stack(dtype="int64"),  # labels
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_set, batch_sampler=train_batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True
+    )
+    dev_set = dev_set.map(trans_func, lazy=True)
+    dev_batch_sampler = BatchSampler(dev_set, batch_size=args.eval_batch_size, shuffle=False)
+    dev_data_loader = DataLoader(
+        dataset=dev_set, batch_sampler=dev_batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True
+    )
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = nn.CrossEntropyLoss()
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    global_step = 0
+    tic_train = time.time()
+    for epoch in tqdm(range(args.num_train_epochs), desc="Epoch"):
+        for step, batch in tqdm(enumerate(train_data_loader), desc="Train step", total=len(train_data_loader)):
+            global_step += 1
+            input_ids, attention_mask, decoder_input_ids, labels = batch
+            with paddle.amp.auto_cast(args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"]):
+                logits = model(input_ids, attention_mask, decoder_input_ids)
+                loss = loss_fct(logits, labels)
+            if args.use_amp:
+                scaled_loss = scaler.scale(loss)
+                scaled_loss.backward()
+                scaler.minimize(optimizer, scaled_loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                evaluate(
+                    model,
+                    dev_data_loader,
+                    tokenizer,
+                    args.ignore_pad_token_for_loss,
+                    args.min_target_length,
+                    args.max_target_length,
+                )
+                logger.info("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "bart_model_%d.pdparams" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+    if paddle.distributed.get_rank() == 0:
+        output_dir = os.path.join(args.output_dir, "bart_model_final_%d.pdparams" % global_step)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        # Need better way to get inner model of DataParallel
+        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+        model_to_save.save_pretrained(output_dir)
+        tokenizer.save_pretrained(output_dir)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_train(args)
diff --git a/examples/text_summarization/bart/utils.py b/examples/text_summarization/bart/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..892e0f520c5707b5bc6ee31278f2438d01de5ca7
--- /dev/null
+++ b/examples/text_summarization/bart/utils.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import nltk
+from rouge_score import rouge_scorer, scoring
+
+
+def convert_example(
+    example,
+    text_column,
+    summary_column,
+    tokenizer,
+    decoder_start_token_id,
+    max_source_length,
+    max_target_length,
+    ignore_pad_token_for_loss=True,
+    is_train=True,
+):
+    """
+    Convert a example into necessary features.
+    """
+    inputs = example[text_column]
+    targets = example[summary_column]
+    labels = tokenizer(targets, max_length=max_target_length, padding="max_length", truncation=True)
+    decoder_input_ids = [decoder_start_token_id] + labels["input_ids"][:-1]
+    if ignore_pad_token_for_loss:
+        labels["input_ids"] = [(l if l != tokenizer.pad_token_id else -100) for l in labels["input_ids"]]
+    if is_train:
+        model_inputs = tokenizer(
+            inputs,
+            max_length=max_source_length,
+            padding="max_length",
+            truncation=True,
+            return_attention_mask=True,
+            return_length=False,
+        )
+        return model_inputs["input_ids"], model_inputs["attention_mask"], decoder_input_ids, labels["input_ids"]
+    else:
+        model_inputs = tokenizer(
+            inputs,
+            max_length=max_source_length,
+            padding="max_length",
+            truncation=True,
+            return_attention_mask=True,
+            return_length=True,
+        )
+        return (
+            model_inputs["input_ids"],
+            model_inputs["attention_mask"],
+            model_inputs["length"],
+            decoder_input_ids,
+            labels["input_ids"],
+        )
+
+
+def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True):
+    def compute_rouge(predictions, references, rouge_types=None, use_stemmer=True):
+        if rouge_types is None:
+            rouge_types = ["rouge1", "rouge2", "rougeLsum"]
+
+        scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer)
+        aggregator = scoring.BootstrapAggregator()
+
+        for ref, pred in zip(references, predictions):
+            score = scorer.score(ref, pred)
+            aggregator.add_scores(score)
+        result = aggregator.aggregate()
+        result = {key: round(value.mid.fmeasure * 100, 4) for key, value in result.items()}
+        return result
+
+    def post_process_text(preds, labels):
+        preds = [pred.strip() for pred in preds]
+        labels = [label.strip() for label in labels]
+
+        # rougeLSum expects newline after each sentence
+        preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
+        labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
+
+        return preds, labels
+
+    def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+        """
+        Post-process the decoded sequence.
+        """
+        eos_pos = len(seq) - 1
+        for i, idx in enumerate(seq):
+            if idx == eos_idx:
+                eos_pos = i
+                break
+        seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+        return seq
+
+    if ignore_pad_token_for_loss:
+        labels = np.asarray(labels)
+        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    decoded_preds, decoded_labels = [], []
+    for pred, label in zip(preds, labels):
+        pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        decoded_preds.append(tokenizer.convert_ids_to_string(pred_id))
+        decoded_labels.append(tokenizer.convert_ids_to_string(label_id))
+    decoded_preds, decoded_labels = post_process_text(decoded_preds, decoded_labels)
+    rouge_result = compute_rouge(decoded_preds, decoded_labels)
+    return rouge_result, decoded_preds
diff --git a/examples/text_summarization/pointer_summarizer/.gitignore b/examples/text_summarization/pointer_summarizer/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..4725dc6de854f8d59b6600aa4b1f9b4f6d197d56
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/.gitignore
@@ -0,0 +1,2 @@
+log/*
+finished_files/*
diff --git a/examples/text_summarization/pointer_summarizer/README.md b/examples/text_summarization/pointer_summarizer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..52bbdf8253586cae88119ae66bf46c7b480c616f
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/README.md
@@ -0,0 +1,63 @@
+# Pointer Generator Network for Text Summarization
+
+This code is the Paddle v2.0 implementation of *[Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/abs/1704.04368)*.
+The code adapts and aligns with [a previous Pytorch implmentation](https://github.com/atulkum/pointer_summarizer).
+
+To reach the state-of-the-art performance stated in the source paper, please use the default hyper-parameters listed in *config.py*.
+
+## Model performance (with pointer generation and coverage loss enabled)
+After training for 100k iterations with *batch_size=8*, the Paddle implementation achieves a ROUGE-1-f1 of 0.3980 (0.3907 by [a previous Pytorch implmentation](https://github.com/atulkum/pointer_summarizer) and 0.3953 by [the source paper](https://arxiv.org/abs/1704.04368)).
+
+```
+ROUGE-1:
+rouge_1_f_score: 0.3980 with confidence interval (0.3959, 0.4002)
+rouge_1_recall: 0.4639 with confidence interval (0.4613, 0.4667)
+rouge_1_precision: 0.3707 with confidence interval (0.3683, 0.3732)
+
+ROUGE-2:
+rouge_2_f_score: 0.1726 with confidence interval (0.1704, 0.1749)
+rouge_2_recall: 0.2008 with confidence interval (0.1984, 0.2034)
+rouge_2_precision: 0.1615 with confidence interval (0.1593, 0.1638)
+
+ROUGE-l:
+rouge_l_f_score: 0.3617 with confidence interval (0.3597, 0.3640)
+rouge_l_recall: 0.4214 with confidence interval (0.4188, 0.4242)
+rouge_l_precision: 0.3371 with confidence interval (0.3348, 0.3396)
+
+```
+
+## Prerequisites:
+* The code is tested on Python 3.7.1 and Paddle 2.0.0
+* Training takes around 1s/iter on a single Tesla V100 (\~28 hours to train 100k iters)
+* Decoding the entire test set takes 2-3 hours
+
+## Data Preprocessing:
+1) Follow data generation instruction from https://github.com/abisee/cnn-dailymail **but place the *make_datafiles_json.py* script provided in this repo into https://github.com/abisee/cnn-dailymail and run *make_datafiles_json.py* instead of *make_datafiles.py* to minimize package dependencies.**
+2) place the output folder *finished_files_json/* as a subfolder in this repo
+3) You might need to change some paths and parameters in *config.py*
+
+
+## How to run training:
+* To train the model from start:
+```
+python train.py
+```
+* To continue training using a previously trained model:
+```
+python train.py -m path/to/model/dir/
+```
+
+## Set up ROUGE
+* You need to setup [pyrouge](https://github.com/andersjo/pyrouge) to get the rouge score
+* Also see [this tutorial](https://poojithansl7.wordpress.com/2018/08/04/setting-up-rouge/) to set up rouge and pyrouge.
+
+
+## How to decode & evaluate:
+* To decode using a previously trained model:
+```
+python decode.py path/to/model/dir/
+```
+* If you already have the summaries generated using *decode.py* and only needs to run rouge evaluation:
+```
+python rouge_eval.py path/to/decoded/dir/
+```
diff --git a/examples/text_summarization/pointer_summarizer/__init__.py b/examples/text_summarization/pointer_summarizer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/examples/text_summarization/pointer_summarizer/config.py b/examples/text_summarization/pointer_summarizer/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..795fb03fdb339895bff7960397ed6342d7a0cae9
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/config.py
@@ -0,0 +1,31 @@
+# Data directories
+train_data_path = "./finished_files/chunked/train_*.json"
+eval_data_path = "./finished_files/val.json"
+decode_data_path = "./finished_files/test.json"
+vocab_path = "./finished_files/vocab"
+log_root = "./log"
+
+# Hyperparameters
+hidden_dim = 256
+emb_dim = 128
+batch_size = 8
+max_enc_steps = 400
+max_dec_steps = 100
+beam_size = 4
+min_dec_steps = 35
+vocab_size = 50000
+
+lr = 0.15
+adagrad_init_acc = 0.1
+rand_unif_init_mag = 0.02
+trunc_norm_init_std = 1e-4
+max_grad_norm = 2.0
+
+pointer_gen = True
+is_coverage = True
+cov_loss_wt = 1.0
+
+eps = 1e-12
+max_iterations = 100000
+
+lr_coverage = 0.15
diff --git a/examples/text_summarization/pointer_summarizer/data.py b/examples/text_summarization/pointer_summarizer/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8479943e691c3cfb68df383a503e9bb1fedf333
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/data.py
@@ -0,0 +1,503 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is adapted from https://github.com/abisee/pointer-generator/blob/master/data.py
+
+import csv
+import glob
+import io
+import json
+import queue
+import random
+import time
+from random import shuffle
+from threading import Thread
+
+import config
+import data
+import numpy as np
+
+random.seed(123)
+
+
+# <s> and </s> are used in the data files to segment the abstracts into sentences. They don't receive vocab ids.
+SENTENCE_START = "<s>"
+SENTENCE_END = "</s>"
+
+PAD_TOKEN = "[PAD]"  # This has a vocab id, which is used to pad the encoder input, decoder input and target sequence
+UNKNOWN_TOKEN = "[UNK]"  # This has a vocab id, which is used to represent out-of-vocabulary words
+START_DECODING = "[START]"  # This has a vocab id, which is used at the start of every decoder input sequence
+STOP_DECODING = "[STOP]"  # This has a vocab id, which is used at the end of untruncated target sequences
+
+# Note: none of <s>, </s>, [PAD], [UNK], [START], [STOP] should appear in the vocab file.
+
+
+class Example(object):
+    def __init__(self, article, abstract_sentences, vocab):
+        # Get ids of special tokens
+        start_decoding = vocab.word2id(data.START_DECODING)
+        stop_decoding = vocab.word2id(data.STOP_DECODING)
+
+        # Process the article
+        article_words = article.split()
+        if len(article_words) > config.max_enc_steps:
+            article_words = article_words[: config.max_enc_steps]
+        self.enc_len = len(article_words)  # store the length after truncation but before padding
+        self.enc_input = [
+            vocab.word2id(w) for w in article_words
+        ]  # list of word ids; OOVs are represented by the id for UNK token
+
+        # Process the abstract
+        abstract = " ".join(abstract_sentences)  # string
+        abstract_words = abstract.split()  # list of strings
+        abs_ids = [
+            vocab.word2id(w) for w in abstract_words
+        ]  # list of word ids; OOVs are represented by the id for UNK token
+
+        # Get the decoder input sequence and target sequence
+        self.dec_input, self.target = self.get_dec_inp_targ_seqs(
+            abs_ids, config.max_dec_steps, start_decoding, stop_decoding
+        )
+        self.dec_len = len(self.dec_input)
+
+        # If using pointer-generator mode, we need to store some extra info
+        if config.pointer_gen:
+            # Store a version of the enc_input where in-article OOVs are represented by their temporary OOV id; also store the in-article OOVs words themselves
+            self.enc_input_extend_vocab, self.article_oovs = data.article2ids(article_words, vocab)
+
+            # Get a version of the reference summary where in-article OOVs are represented by their temporary article OOV id
+            abs_ids_extend_vocab = data.abstract2ids(abstract_words, vocab, self.article_oovs)
+
+            # Overwrite decoder target sequence so it uses the temp article OOV ids
+            _, self.target = self.get_dec_inp_targ_seqs(
+                abs_ids_extend_vocab, config.max_dec_steps, start_decoding, stop_decoding
+            )
+
+        # Store the original strings
+        self.original_article = article
+        self.original_abstract = abstract
+        self.original_abstract_sents = abstract_sentences
+
+    def get_dec_inp_targ_seqs(self, sequence, max_len, start_id, stop_id):
+        inp = [start_id] + sequence[:]
+        target = sequence[:]
+        if len(inp) > max_len:  # truncate
+            inp = inp[:max_len]
+            target = target[:max_len]  # no end_token
+        else:  # no truncation
+            target.append(stop_id)  # end token
+        assert len(inp) == len(target)
+        return inp, target
+
+    def pad_decoder_inp_targ(self, max_len, pad_id):
+        while len(self.dec_input) < max_len:
+            self.dec_input.append(pad_id)
+        while len(self.target) < max_len:
+            self.target.append(pad_id)
+
+    def pad_encoder_input(self, max_len, pad_id):
+        while len(self.enc_input) < max_len:
+            self.enc_input.append(pad_id)
+        if config.pointer_gen:
+            while len(self.enc_input_extend_vocab) < max_len:
+                self.enc_input_extend_vocab.append(pad_id)
+
+
+class Batch(object):
+    def __init__(self, example_list, vocab, batch_size):
+        self.batch_size = batch_size
+        self.pad_id = vocab.word2id(data.PAD_TOKEN)  # id of the PAD token used to pad sequences
+        self.init_encoder_seq(example_list)  # initialize the input to the encoder
+        self.init_decoder_seq(example_list)  # initialize the input and targets for the decoder
+        self.store_orig_strings(example_list)  # store the original strings
+
+    def init_encoder_seq(self, example_list):
+        # Determine the maximum length of the encoder input sequence in this batch
+        max_enc_seq_len = max([ex.enc_len for ex in example_list])
+
+        # Pad the encoder input sequences up to the length of the longest sequence
+        for ex in example_list:
+            ex.pad_encoder_input(max_enc_seq_len, self.pad_id)
+
+        # Initialize the numpy arrays
+        # Note: our enc_batch can have different length (second dimension) for each batch because we use dynamic_rnn for the encoder.
+        self.enc_batch = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32)
+        self.enc_lens = np.zeros((self.batch_size), dtype=np.int32)
+        self.enc_padding_mask = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.float32)
+
+        # Fill in the numpy arrays
+        for i, ex in enumerate(example_list):
+            self.enc_batch[i, :] = ex.enc_input[:]
+            self.enc_lens[i] = ex.enc_len
+            for j in range(ex.enc_len):
+                self.enc_padding_mask[i][j] = 1
+
+        # For pointer-generator mode, need to store some extra info
+        if config.pointer_gen:
+            # Determine the max number of in-article OOVs in this batch
+            self.max_art_oovs = max([len(ex.article_oovs) for ex in example_list])
+            # Store the in-article OOVs themselves
+            self.art_oovs = [ex.article_oovs for ex in example_list]
+            # Store the version of the enc_batch that uses the article OOV ids
+            self.enc_batch_extend_vocab = np.zeros((self.batch_size, max_enc_seq_len), dtype=np.int32)
+            for i, ex in enumerate(example_list):
+                self.enc_batch_extend_vocab[i, :] = ex.enc_input_extend_vocab[:]
+
+    def init_decoder_seq(self, example_list):
+        # Pad the inputs and targets
+        for ex in example_list:
+            ex.pad_decoder_inp_targ(config.max_dec_steps, self.pad_id)
+
+        # Initialize the numpy arrays.
+        self.dec_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32)
+        self.target_batch = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.int32)
+        self.dec_padding_mask = np.zeros((self.batch_size, config.max_dec_steps), dtype=np.float32)
+        self.dec_lens = np.zeros((self.batch_size), dtype=np.int32)
+
+        # Fill in the numpy arrays
+        for i, ex in enumerate(example_list):
+            self.dec_batch[i, :] = ex.dec_input[:]
+            self.target_batch[i, :] = ex.target[:]
+            self.dec_lens[i] = ex.dec_len
+            for j in range(ex.dec_len):
+                self.dec_padding_mask[i][j] = 1
+
+    def store_orig_strings(self, example_list):
+        self.original_articles = [ex.original_article for ex in example_list]  # list of lists
+        self.original_abstracts = [ex.original_abstract for ex in example_list]  # list of lists
+        self.original_abstracts_sents = [ex.original_abstract_sents for ex in example_list]  # list of lists
+
+
+class Batcher(object):
+    BATCH_QUEUE_MAX = 100  # max number of batches the batch_queue can hold
+
+    def __init__(self, data_path, vocab, mode, batch_size, single_pass):
+        self._data_path = data_path
+        self._vocab = vocab
+        self._single_pass = single_pass
+        self.mode = mode
+        self.batch_size = batch_size
+        # Initialize a queue of Batches waiting to be used, and a queue of Examples waiting to be batched
+        self._batch_queue = queue.Queue(self.BATCH_QUEUE_MAX)
+        self._example_queue = queue.Queue(self.BATCH_QUEUE_MAX * self.batch_size)
+
+        # Different settings depending on whether we're in single_pass mode or not
+        if single_pass:
+            self._num_example_q_threads = 1  # just one thread, so we read through the dataset just once
+            self._num_batch_q_threads = 1  # just one thread to batch examples
+            self._bucketing_cache_size = (
+                1  # only load one batch's worth of examples before bucketing; this essentially means no bucketing
+            )
+            self._finished_reading = False  # this will tell us when we're finished reading the dataset
+        else:
+            self._num_example_q_threads = 1  # 16 # num threads to fill example queue
+            self._num_batch_q_threads = 1  # 4  # num threads to fill batch queue
+            self._bucketing_cache_size = (
+                1  # 100 # how many batches-worth of examples to load into cache before bucketing
+            )
+
+        # Start the threads that load the queues
+        self._example_q_threads = []
+        for _ in range(self._num_example_q_threads):
+            self._example_q_threads.append(Thread(target=self.fill_example_queue))
+            self._example_q_threads[-1].daemon = True
+            self._example_q_threads[-1].start()
+        self._batch_q_threads = []
+        for _ in range(self._num_batch_q_threads):
+            self._batch_q_threads.append(Thread(target=self.fill_batch_queue))
+            self._batch_q_threads[-1].daemon = True
+            self._batch_q_threads[-1].start()
+
+        # Start a thread that watches the other threads and restarts them if they're dead
+        if not single_pass:  # We don't want a watcher in single_pass mode because the threads shouldn't run forever
+            self._watch_thread = Thread(target=self.watch_threads)
+            self._watch_thread.daemon = True
+            self._watch_thread.start()
+
+    def next_batch(self):
+        # If the batch queue is empty, print a warning
+        if self._batch_queue.qsize() == 0:
+            print(
+                "Bucket input queue is empty when calling next_batch. Bucket queue size: %i, Input queue size: %i"
+                % (self._batch_queue.qsize(), self._example_queue.qsize())
+            )
+            if self._single_pass and self._finished_reading:
+                print("Finished reading dataset in single_pass mode.")
+                return None
+
+        batch = self._batch_queue.get()  # get the next Batch
+        return batch
+
+    def fill_example_queue(self):
+        input_gen = self.text_generator(data.example_generator(self._data_path, self._single_pass))
+
+        while True:
+            try:
+                (article, abstract) = next(
+                    input_gen
+                )  # read the next example from file. article and abstract are both strings.
+            except StopIteration:  # if there are no more examples:
+                print("The example generator for this example queue filling thread has exhausted data.")
+                if self._single_pass:
+                    print("single_pass mode is on, so we've finished reading dataset. This thread is stopping.")
+                    self._finished_reading = True
+                    break
+                else:
+                    raise Exception("single_pass mode is off but the example generator is out of data; error.")
+
+            abstract_sentences = [
+                sent.strip() for sent in data.abstract2sents(abstract)
+            ]  # Use the <s> and </s> tags in abstract to get a list of sentences.
+            example = Example(article, abstract_sentences, self._vocab)  # Process into an Example.
+            self._example_queue.put(example)  # place the Example in the example queue.
+
+    def fill_batch_queue(self):
+        while True:
+            if self.mode == "decode":
+                # Beam search decode mode where a single example is repeated in the batch
+                ex = self._example_queue.get()
+                b = [ex for _ in range(self.batch_size)]
+                self._batch_queue.put(Batch(b, self._vocab, self.batch_size))
+            else:
+                # Get bucketing_cache_size-many batches of Examples into a list, then sort
+                inputs = []
+                for _ in range(self.batch_size * self._bucketing_cache_size):
+                    inputs.append(self._example_queue.get())
+                inputs = sorted(
+                    inputs, key=lambda inp: inp.enc_len, reverse=True
+                )  # sort by length of encoder sequence
+
+                # Group the sorted Examples into batches, optionally shuffle the batches, and place in the batch queue.
+                batches = []
+                for i in range(0, len(inputs), self.batch_size):
+                    batches.append(inputs[i : i + self.batch_size])
+                if not self._single_pass:
+                    shuffle(batches)
+                for b in batches:  # each b is a list of Example objects
+                    self._batch_queue.put(Batch(b, self._vocab, self.batch_size))
+
+    def watch_threads(self):
+        while True:
+            print(
+                "Bucket queue size: %i, Input queue size: %i"
+                % (self._batch_queue.qsize(), self._example_queue.qsize())
+            )
+
+            time.sleep(60)
+            for idx, t in enumerate(self._example_q_threads):
+                if not t.is_alive():  # if the thread is dead
+                    print("Found example queue thread dead. Restarting.")
+                    new_t = Thread(target=self.fill_example_queue)
+                    self._example_q_threads[idx] = new_t
+                    new_t.daemon = True
+                    new_t.start()
+            for idx, t in enumerate(self._batch_q_threads):
+                if not t.is_alive():  # if the thread is dead
+                    print("Found batch queue thread dead. Restarting.")
+                    new_t = Thread(target=self.fill_batch_queue)
+                    self._batch_q_threads[idx] = new_t
+                    new_t.daemon = True
+                    new_t.start()
+
+    def text_generator(self, example_generator):
+        while True:
+            e = next(example_generator)
+            try:
+                article_text = e["article"]
+                abstract_text = e["abstract"]
+            except ValueError:
+                print("Failed to get article or abstract from example")
+                continue
+            if len(article_text) == 0:  # See https://github.com/abisee/pointer-generator/issues/1
+                print("Found an example with empty article text. Skipping it.")
+                continue
+            else:
+                yield (article_text, abstract_text)
+
+
+class Vocab(object):
+    def __init__(self, vocab_file, max_size):
+        self._word_to_id = {}
+        self._id_to_word = {}
+        self._count = 0  # keeps track of total number of words in the Vocab
+
+        # [UNK], [PAD], [START] and [STOP] get the ids 0,1,2,3.
+        for w in [UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
+            self._word_to_id[w] = self._count
+            self._id_to_word[self._count] = w
+            self._count += 1
+
+        # Read the vocab file and add words up to max_size
+        with open(vocab_file, "r", encoding="utf8") as vocab_f:
+            for line in vocab_f:
+                pieces = line.split()
+                if len(pieces) != 2:
+                    print("Warning: incorrectly formatted line in vocabulary file: %s\n" % line)
+                    continue
+                w = pieces[0]
+                if w in [SENTENCE_START, SENTENCE_END, UNKNOWN_TOKEN, PAD_TOKEN, START_DECODING, STOP_DECODING]:
+                    raise Exception(
+                        "<s>, </s>, [UNK], [PAD], [START] and [STOP] shouldn't be in the vocab file, but %s is" % w
+                    )
+                if w in self._word_to_id:
+                    raise Exception("Duplicated word in vocabulary file: %s" % w)
+                self._word_to_id[w] = self._count
+                self._id_to_word[self._count] = w
+                self._count += 1
+                if max_size != 0 and self._count >= max_size:
+                    print(
+                        "max_size of vocab was specified as %i; we now have %i words. Stopping reading."
+                        % (max_size, self._count)
+                    )
+                    break
+
+        print(
+            "Finished constructing vocabulary of %i total words. Last word added: %s"
+            % (self._count, self._id_to_word[self._count - 1])
+        )
+
+    def word2id(self, word):
+        if word not in self._word_to_id:
+            return self._word_to_id[UNKNOWN_TOKEN]
+        return self._word_to_id[word]
+
+    def id2word(self, word_id):
+        if word_id not in self._id_to_word:
+            raise ValueError("Id not found in vocab: %d" % word_id)
+        return self._id_to_word[word_id]
+
+    def size(self):
+        return self._count
+
+    def write_metadata(self, fpath):
+        print("Writing word embedding metadata file to %s..." % (fpath))
+        with open(fpath, "w") as f:
+            fieldnames = ["word"]
+            writer = csv.DictWriter(f, delimiter="\t", fieldnames=fieldnames)
+            for i in range(self.size()):
+                writer.writerow({"word": self._id_to_word[i]})
+
+
+def example_generator(data_path, single_pass):
+    while True:
+        filelist = glob.glob(data_path)  # get the list of datafiles
+        assert filelist, "Error: Empty filelist at %s" % data_path  # check filelist isn't empty
+        if single_pass:
+            filelist = sorted(filelist)
+        else:
+            random.shuffle(filelist)
+        for f in filelist:
+            reader = io.open(f, "r", encoding="utf8")
+            while True:
+                reader_str = reader.readline()
+                if reader_str == "":
+                    break
+                reader_json = json.loads(reader_str)
+                yield reader_json
+        if single_pass:
+            print("example_generator completed reading all datafiles. No more data.")
+            break
+
+
+def article2ids(article_words, vocab):
+    ids = []
+    oovs = []
+    unk_id = vocab.word2id(UNKNOWN_TOKEN)
+    for w in article_words:
+        i = vocab.word2id(w)
+        if i == unk_id:  # If w is OOV
+            if w not in oovs:  # Add to list of OOVs
+                oovs.append(w)
+            oov_num = oovs.index(w)  # This is 0 for the first article OOV, 1 for the second article OOV...
+            ids.append(vocab.size() + oov_num)  # This is e.g. 50000 for the first article OOV, 50001 for the second...
+        else:
+            ids.append(i)
+    return ids, oovs
+
+
+def abstract2ids(abstract_words, vocab, article_oovs):
+    ids = []
+    unk_id = vocab.word2id(UNKNOWN_TOKEN)
+    for w in abstract_words:
+        i = vocab.word2id(w)
+        if i == unk_id:  # If w is an OOV word
+            if w in article_oovs:  # If w is an in-article OOV
+                vocab_idx = vocab.size() + article_oovs.index(w)  # Map to its temporary article OOV number
+                ids.append(vocab_idx)
+            else:  # If w is an out-of-article OOV
+                ids.append(unk_id)  # Map to the UNK token id
+        else:
+            ids.append(i)
+    return ids
+
+
+def outputids2words(id_list, vocab, article_oovs):
+    words = []
+    for i in id_list:
+        try:
+            w = vocab.id2word(i)  # might be [UNK]
+        except ValueError:  # w is OOV
+            assert (
+                article_oovs is not None
+            ), "Error: model produced a word ID that isn't in the vocabulary. This should not happen in baseline (no pointer-generator) mode"
+            article_oov_idx = i - vocab.size()
+            try:
+                w = article_oovs[article_oov_idx]
+            except ValueError:  # i doesn't correspond to an article oov
+                raise ValueError(
+                    "Error: model produced word ID %i which corresponds to article OOV %i but this example only has %i article OOVs"
+                    % (i, article_oov_idx, len(article_oovs))
+                )
+        words.append(w)
+    return words
+
+
+def abstract2sents(abstract):
+    cur = 0
+    sents = []
+    while True:
+        try:
+            start_p = abstract.index(SENTENCE_START, cur)
+            end_p = abstract.index(SENTENCE_END, start_p + 1)
+            cur = end_p + len(SENTENCE_END)
+            sents.append(abstract[start_p + len(SENTENCE_START) : end_p])
+        except ValueError:  # no more sentences
+            return sents
+
+
+def show_art_oovs(article, vocab):
+    unk_token = vocab.word2id(UNKNOWN_TOKEN)
+    words = article.split(" ")
+    words = [("__%s__" % w) if vocab.word2id(w) == unk_token else w for w in words]
+    out_str = " ".join(words)
+    return out_str
+
+
+def show_abs_oovs(abstract, vocab, article_oovs):
+    unk_token = vocab.word2id(UNKNOWN_TOKEN)
+    words = abstract.split(" ")
+    new_words = []
+    for w in words:
+        if vocab.word2id(w) == unk_token:  # w is oov
+            if article_oovs is None:  # baseline mode
+                new_words.append("__%s__" % w)
+            else:  # pointer-generator mode
+                if w in article_oovs:
+                    new_words.append("__%s__" % w)
+                else:
+                    new_words.append("!!__%s__!!" % w)
+        else:  # w is in-vocab word
+            new_words.append(w)
+    out_str = " ".join(new_words)
+    return out_str
diff --git a/examples/text_summarization/pointer_summarizer/decode.py b/examples/text_summarization/pointer_summarizer/decode.py
new file mode 100644
index 0000000000000000000000000000000000000000..e66d9442f1fde5911f8ce60ce543efd356645a66
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/decode.py
@@ -0,0 +1,232 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Except for the paddle part, content of this file is copied from https://github.com/abisee/pointer-generator/blob/master/
+
+import os
+import re
+import sys
+import time
+
+import config
+import data
+import paddle
+from data import Batcher, Vocab
+from model import Model
+from train_util import get_input_from_batch
+from utils import rouge_eval, rouge_log, write_for_rouge
+
+
+class Beam(object):
+    def __init__(self, tokens, log_probs, state, context, coverage):
+        self.tokens = tokens
+        self.log_probs = log_probs
+        self.state = state
+        self.context = context
+        self.coverage = coverage
+
+    def extend(self, token, log_prob, state, context, coverage):
+        return Beam(
+            tokens=self.tokens + [token],
+            log_probs=self.log_probs + [log_prob],
+            state=state,
+            context=context,
+            coverage=coverage,
+        )
+
+    @property
+    def latest_token(self):
+        return self.tokens[-1]
+
+    @property
+    def avg_log_prob(self):
+        return sum(self.log_probs) / len(self.tokens)
+
+
+class BeamSearch(object):
+    def __init__(self, model_file_path):
+        model_name = (
+            re.findall(r"train_\d+", model_file_path)[0] + "_" + re.findall(r"model_\d+_\d+\.\d+", model_file_path)[0]
+        )
+        self._decode_dir = os.path.join(config.log_root, "decode_%s" % (model_name))
+        self._rouge_ref_dir = os.path.join(self._decode_dir, "rouge_ref")
+        self._rouge_dec_dir = os.path.join(self._decode_dir, "rouge_dec_dir")
+        for p in [self._decode_dir, self._rouge_ref_dir, self._rouge_dec_dir]:
+            if not os.path.exists(p):
+                os.mkdir(p)
+
+        self.vocab = Vocab(config.vocab_path, config.vocab_size)
+        self.batcher = Batcher(
+            config.decode_data_path, self.vocab, mode="decode", batch_size=config.beam_size, single_pass=True
+        )
+        self.model = Model(model_file_path, is_eval=True)
+
+    def sort_beams(self, beams):
+        return sorted(beams, key=lambda h: h.avg_log_prob, reverse=True)
+
+    def decode(self):
+        start = time.time()
+        counter = 0
+        batch = self.batcher.next_batch()
+        while batch is not None:
+            # Run beam search to get best Hypothesis
+            best_summary = self.beam_search(batch)
+
+            # Extract the output ids from the hypothesis and convert back to words
+            output_ids = [int(t) for t in best_summary.tokens[1:]]
+            decoded_words = data.outputids2words(
+                output_ids, self.vocab, (batch.art_oovs[0] if config.pointer_gen else None)
+            )
+
+            # Remove the [STOP] token from decoded_words, if necessary
+            try:
+                fst_stop_idx = decoded_words.index(data.STOP_DECODING)
+                decoded_words = decoded_words[:fst_stop_idx]
+            except ValueError:
+                decoded_words = decoded_words
+
+            original_abstract_sents = batch.original_abstracts_sents[0]
+
+            write_for_rouge(original_abstract_sents, decoded_words, counter, self._rouge_ref_dir, self._rouge_dec_dir)
+            counter += 1
+            print("global step %d, %.2f step/s" % (counter, 1.0 / (time.time() - start)))
+            start = time.time()
+            batch = self.batcher.next_batch()
+
+        print("Decoder has finished reading dataset for single_pass.")
+        print("Now starting ROUGE eval...")
+        results_dict = rouge_eval(self._rouge_ref_dir, self._rouge_dec_dir)
+        rouge_log(results_dict, self._decode_dir)
+
+    def beam_search(self, batch):
+        # The batch should have only one example
+        (
+            enc_batch,
+            enc_padding_mask,
+            enc_lens,
+            enc_batch_extend_vocab,
+            extra_zeros,
+            c_t_0,
+            coverage_t_0,
+        ) = get_input_from_batch(batch)
+
+        encoder_outputs, encoder_feature, encoder_hidden = self.model.encoder(enc_batch, enc_lens)
+        s_t_0 = self.model.reduce_state(encoder_hidden)
+
+        dec_h, dec_c = s_t_0  # 1 x 2*hidden_size
+        dec_h = dec_h.squeeze()
+        dec_c = dec_c.squeeze()
+
+        # Prepare decoder batch
+        beams = [
+            Beam(
+                tokens=[self.vocab.word2id(data.START_DECODING)],
+                log_probs=[0.0],
+                state=(dec_h[0], dec_c[0]),
+                context=c_t_0[0],
+                coverage=(coverage_t_0[0] if config.is_coverage else None),
+            )
+            for _ in range(config.beam_size)
+        ]
+        results = []
+        steps = 0
+        while steps < config.max_dec_steps and len(results) < config.beam_size:
+            latest_tokens = [h.latest_token for h in beams]
+            latest_tokens = [
+                t if t < self.vocab.size() else self.vocab.word2id(data.UNKNOWN_TOKEN) for t in latest_tokens
+            ]
+            y_t_1 = paddle.to_tensor(latest_tokens)
+            all_state_h = []
+            all_state_c = []
+
+            all_context = []
+
+            for h in beams:
+                state_h, state_c = h.state
+                all_state_h.append(state_h)
+                all_state_c.append(state_c)
+
+                all_context.append(h.context)
+
+            s_t_1 = (paddle.stack(all_state_h, 0).unsqueeze(0), paddle.stack(all_state_c, 0).unsqueeze(0))
+            c_t_1 = paddle.stack(all_context, 0)
+
+            coverage_t_1 = None
+            if config.is_coverage:
+                all_coverage = []
+                for h in beams:
+                    all_coverage.append(h.coverage)
+                coverage_t_1 = paddle.stack(all_coverage, 0)
+
+            final_dist, s_t, c_t, attn_dist, p_gen, coverage_t = self.model.decoder(
+                y_t_1,
+                s_t_1,
+                encoder_outputs,
+                encoder_feature,
+                enc_padding_mask,
+                c_t_1,
+                extra_zeros,
+                enc_batch_extend_vocab,
+                coverage_t_1,
+                steps,
+            )
+            log_probs = paddle.log(final_dist)
+            topk_log_probs, topk_ids = paddle.topk(log_probs, config.beam_size * 2)
+
+            dec_h, dec_c = s_t
+            dec_h = dec_h.squeeze()
+            dec_c = dec_c.squeeze()
+
+            all_beams = []
+            num_orig_beams = 1 if steps == 0 else len(beams)
+            for i in range(num_orig_beams):
+                h = beams[i]
+                state_i = (dec_h[i], dec_c[i])
+                context_i = c_t[i]
+                coverage_i = coverage_t[i] if config.is_coverage else None
+
+                for j in range(config.beam_size * 2):  # for each of the top 2*beam_size hyps:
+                    new_beam = h.extend(
+                        token=int(topk_ids[i, j]),
+                        log_prob=float(topk_log_probs[i, j]),
+                        state=state_i,
+                        context=context_i,
+                        coverage=coverage_i,
+                    )
+                    all_beams.append(new_beam)
+
+            beams = []
+            for h in self.sort_beams(all_beams):
+                if h.latest_token == self.vocab.word2id(data.STOP_DECODING):
+                    if steps >= config.min_dec_steps:
+                        results.append(h)
+                else:
+                    beams.append(h)
+                if len(beams) == config.beam_size or len(results) == config.beam_size:
+                    break
+
+            steps += 1
+
+        if len(results) == 0:
+            results = beams
+
+        beams_sorted = self.sort_beams(results)
+
+        return beams_sorted[0]
+
+
+if __name__ == "__main__":
+    model_filename = sys.argv[1]
+    beam_search_processor = BeamSearch(model_filename)
+    beam_search_processor.decode()
diff --git a/examples/text_summarization/pointer_summarizer/make_datafiles_json.py b/examples/text_summarization/pointer_summarizer/make_datafiles_json.py
new file mode 100644
index 0000000000000000000000000000000000000000..e266a6544247810d72da68f117267d9720078eb4
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/make_datafiles_json.py
@@ -0,0 +1,282 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import hashlib
+import json
+import os
+import subprocess
+import sys
+
+dm_single_close_quote = "\u2019"  # unicode
+dm_double_close_quote = "\u201d"
+END_TOKENS = [
+    ".",
+    "!",
+    "?",
+    "...",
+    "'",
+    "`",
+    '"',
+    dm_single_close_quote,
+    dm_double_close_quote,
+    ")",
+]  # acceptable ways to end a sentence
+
+# We use these to separate the summary sentences in the json datafiles
+SENTENCE_START = "<s>"
+SENTENCE_END = "</s>"
+
+all_train_urls = "url_lists/all_train.txt"
+all_val_urls = "url_lists/all_val.txt"
+all_test_urls = "url_lists/all_test.txt"
+
+cnn_tokenized_stories_dir = "cnn_stories_tokenized_json"
+dm_tokenized_stories_dir = "dm_stories_tokenized_json"
+finished_files_dir = "finished_files_json"
+chunks_dir = os.path.join(finished_files_dir, "chunked")
+
+# These are the number of .story files we expect there to be in cnn_stories_dir and dm_stories_dir
+num_expected_cnn_stories = 92579
+num_expected_dm_stories = 219506
+
+VOCAB_SIZE = 200000
+CHUNK_SIZE = 1000  # num examples per chunk, for the chunked data
+
+
+def chunk_file(set_name):
+    in_file = finished_files_dir + os.sep + "%s.json" % set_name
+    reader = open(in_file, "r")
+    chunk = 0
+    finished = False
+    while not finished:
+        chunk_fname = os.path.join(chunks_dir, "%s_%03d.json" % (set_name, chunk))  # new chunk
+        with open(chunk_fname, "w") as writer:
+            for _ in range(CHUNK_SIZE):
+                len_line = reader.readline()
+                if not len_line:
+                    finished = True
+                    break
+                writer.write(len_line)
+            chunk += 1
+
+
+def chunk_all():
+    # Make a dir to hold the chunks
+    if not os.path.isdir(chunks_dir):
+        os.mkdir(chunks_dir)
+    # Chunk the data
+    for set_name in ["train", "val", "test"]:
+        print("Splitting %s data into chunks..." % set_name)
+        chunk_file(set_name)
+    print("Saved chunked data in %s" % chunks_dir)
+
+
+def tokenize_stories(stories_dir, tokenized_stories_dir):
+    """Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer"""
+    print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
+    stories = os.listdir(stories_dir)
+    # make IO list file
+    print("Making list of files to tokenize...")
+    with open("mapping.txt", "w") as f:
+        for s in stories:
+            f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))
+    command = ["java", "edu.stanford.nlp.process.PTBTokenizer", "-ioFileList", "-preserveLines", "mapping.txt"]
+    print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
+    subprocess.call(command)
+    print("Stanford CoreNLP Tokenizer has finished.")
+    os.remove("mapping.txt")
+
+    # Check that the tokenized stories directory contains the same number of files as the original directory
+    num_orig = len(os.listdir(stories_dir))
+    num_tokenized = len(os.listdir(tokenized_stories_dir))
+    if num_orig != num_tokenized:
+        raise Exception(
+            "The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?"
+            % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig)
+        )
+    print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir))
+
+
+def read_text_file(text_file):
+    lines = []
+    with open(text_file, "r") as f:
+        for line in f:
+            lines.append(line.strip())
+    return lines
+
+
+def hashhex(s):
+    """Returns a heximal formated SHA1 hash of the input string."""
+    h = hashlib.sha1()
+    h.update(s.encode())
+    return h.hexdigest()
+
+
+def get_url_hashes(url_list):
+    return [hashhex(url) for url in url_list]
+
+
+def fix_missing_period(line):
+    """Adds a period to a line that is missing a period"""
+    if "@highlight" in line:
+        return line
+    if line == "":
+        return line
+    if line[-1] in END_TOKENS:
+        return line
+    # print line[-1]
+    return line + " ."
+
+
+def get_art_abs(story_file):
+    lines = read_text_file(story_file)
+
+    # Lowercase everything
+    lines = [line.lower() for line in lines]
+
+    # Put periods on the ends of lines that are missing them (this is a problem in the dataset because many image captions don't end in periods; consequently they end up in the body of the article as run-on sentences)
+    lines = [fix_missing_period(line) for line in lines]
+
+    # Separate out article and abstract sentences
+    article_lines = []
+    highlights = []
+    next_is_highlight = False
+    for idx, line in enumerate(lines):
+        if line == "":
+            continue  # empty line
+        elif line.startswith("@highlight"):
+            next_is_highlight = True
+        elif next_is_highlight:
+            highlights.append(line)
+        else:
+            article_lines.append(line)
+
+    # Make article into a single string
+    article = " ".join(article_lines)
+
+    # Make abstract into a signle string, putting <s> and </s> tags around the sentences
+    abstract = " ".join(["%s %s %s" % (SENTENCE_START, sent, SENTENCE_END) for sent in highlights])
+
+    return article, abstract
+
+
+def write_to_bin(url_file, out_file, makevocab=False):
+    """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
+    print("Making bin file for URLs listed in %s..." % url_file)
+    url_list = read_text_file(url_file)
+    url_hashes = get_url_hashes(url_list)
+    story_fnames = [s + ".story" for s in url_hashes]
+    num_stories = len(story_fnames)
+
+    if makevocab:
+        vocab_counter = collections.Counter()
+
+    with open(out_file, "w") as writer:
+        for idx, s in enumerate(story_fnames):
+            if idx % 1000 == 0:
+                print(
+                    "Writing story %i of %i; %.2f percent done"
+                    % (idx, num_stories, float(idx) * 100.0 / float(num_stories))
+                )
+
+            # Look in the tokenized story dirs to find the .story file corresponding to this url
+            if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
+                story_file = os.path.join(cnn_tokenized_stories_dir, s)
+            elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
+                story_file = os.path.join(dm_tokenized_stories_dir, s)
+            else:
+                print(
+                    "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?"
+                    % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
+                )
+                # Check again if tokenized stories directories contain correct number of files
+                print(
+                    "Checking that the tokenized stories directories %s and %s contain correct number of files..."
+                    % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
+                )
+                check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
+                check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
+                raise Exception(
+                    "Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither."
+                    % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s)
+                )
+
+            # Get the strings to write to .json file
+            article, abstract = get_art_abs(story_file)
+
+            # Write to json
+            writer.write(json.dumps({"article": article, "abstract": abstract}) + "\n")
+
+            # Write the vocab to file, if applicable
+            if makevocab:
+                art_tokens = article.split(" ")
+                abs_tokens = abstract.split(" ")
+                abs_tokens = [
+                    t for t in abs_tokens if t not in [SENTENCE_START, SENTENCE_END]
+                ]  # remove these tags from vocab
+                tokens = art_tokens + abs_tokens
+                tokens = [t.strip() for t in tokens]  # strip
+                tokens = [t for t in tokens if t != ""]  # remove empty
+                vocab_counter.update(tokens)
+
+    print("Finished writing file %s\n" % out_file)
+
+    # write vocab to file
+    if makevocab:
+        print("Writing vocab file...")
+        with open(os.path.join(finished_files_dir, "vocab"), "w") as writer:
+            for word, count in vocab_counter.most_common(VOCAB_SIZE):
+                writer.write(word + " " + str(count) + "\n")
+        print("Finished writing vocab file")
+
+
+def check_num_stories(stories_dir, num_expected):
+    num_stories = len(os.listdir(stories_dir))
+    if num_stories != num_expected:
+        raise Exception(
+            "stories directory %s contains %i files but should contain %i" % (stories_dir, num_stories, num_expected)
+        )
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 3:
+        print("USAGE: python make_datafiles.py <cnn_stories_dir> <dailymail_stories_dir>")
+        sys.exit()
+    cnn_stories_dir = sys.argv[1]
+    dm_stories_dir = sys.argv[2]
+
+    # Check the stories directories contain the correct number of .story files
+    check_num_stories(cnn_stories_dir, num_expected_cnn_stories)
+    check_num_stories(dm_stories_dir, num_expected_dm_stories)
+
+    # Create some new directories
+    if not os.path.exists(cnn_tokenized_stories_dir):
+        os.makedirs(cnn_tokenized_stories_dir)
+    if not os.path.exists(dm_tokenized_stories_dir):
+        os.makedirs(dm_tokenized_stories_dir)
+    if not os.path.exists(finished_files_dir):
+        os.makedirs(finished_files_dir)
+
+    # Run stanford tokenizer on both stories dirs, outputting to tokenized stories directories
+    tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir)
+    tokenize_stories(dm_stories_dir, dm_tokenized_stories_dir)
+
+    # Read the tokenized stories, do a little postprocessing then write to bin files
+    write_to_bin(all_test_urls, os.path.join(finished_files_dir, "test.json"))
+    write_to_bin(all_val_urls, os.path.join(finished_files_dir, "val.json"))
+    write_to_bin(all_train_urls, os.path.join(finished_files_dir, "train.json"), makevocab=True)
+
+    # Chunk the data. This splits each of train.json, val.json and test.json into smaller chunks, each containing e.g. 1000 examples, and saves them in finished_files/chunks
+    chunk_all()
diff --git a/examples/text_summarization/pointer_summarizer/model.py b/examples/text_summarization/pointer_summarizer/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..413df876e05be34ccfc97f0cd289ef028bd65abb
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/model.py
@@ -0,0 +1,263 @@
+import os
+import sys
+
+import paddle
+import paddle.nn.initializer as I
+
+import paddle.nn as nn
+import paddle.nn.functional as F
+import config
+
+
+def paddle2D_scatter_add(x_tensor, index_tensor, update_tensor, dim=0):
+    dim0, dim1 = update_tensor.shape
+    update_tensor = paddle.flatten(update_tensor, start_axis=0, stop_axis=1)
+    index_tensor = paddle.reshape(index_tensor, [-1, 1])
+    if dim == 0:
+        index_tensor = paddle.concat(x=[index_tensor, (paddle.arange(dim1 * dim0) % dim0).unsqueeze(1)], axis=1)
+    elif dim == 1:
+        index_tensor = paddle.concat(x=[(paddle.arange(dim1 * dim0) // dim1).unsqueeze(1), index_tensor], axis=1)
+    output_tensor = paddle.scatter_nd_add(x_tensor, index_tensor, update_tensor)
+    return output_tensor
+
+
+class Encoder(paddle.nn.Layer):
+    def __init__(self):
+        super(Encoder, self).__init__()
+
+        # Initialized embeddings
+        self.embedding = nn.Embedding(
+            config.vocab_size,
+            config.emb_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)),
+        )
+
+        # Initialized lstm weights
+        self.lstm = nn.LSTM(
+            config.emb_dim,
+            config.hidden_dim,
+            num_layers=1,
+            direction="bidirect",
+            weight_ih_attr=paddle.ParamAttr(
+                initializer=I.Uniform(low=-config.rand_unif_init_mag, high=config.rand_unif_init_mag)
+            ),
+            bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)),
+        )
+
+        # Initialized linear weights
+        self.W_h = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2, bias_attr=False)
+
+    # The variable seq_lens should be in descending order
+    def forward(self, input, seq_lens):
+        embedded = self.embedding(input)
+        self.embedded = embedded
+
+        output, hidden = self.lstm(embedded, sequence_length=paddle.to_tensor(seq_lens, dtype="int32"))
+
+        encoder_feature = paddle.reshape(output, [-1, 2 * config.hidden_dim])  # B * t_k x 2*hidden_dim
+        encoder_feature = self.W_h(encoder_feature)
+
+        return output, encoder_feature, hidden
+
+
+class ReduceState(paddle.nn.Layer):
+    def __init__(self):
+        super(ReduceState, self).__init__()
+
+        self.reduce_h = nn.Linear(
+            config.hidden_dim * 2,
+            config.hidden_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)),
+        )
+        self.reduce_c = nn.Linear(
+            config.hidden_dim * 2,
+            config.hidden_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)),
+        )
+
+    def forward(self, hidden):
+        h, c = hidden  # h, c dim = 2 x b x hidden_dim
+        h_in = paddle.reshape(h.transpose([1, 0, 2]), [-1, config.hidden_dim * 2])
+        hidden_reduced_h = F.relu(self.reduce_h(h_in))
+        c_in = paddle.reshape(c.transpose([1, 0, 2]), [-1, config.hidden_dim * 2])
+        hidden_reduced_c = F.relu(self.reduce_c(c_in))
+
+        return (hidden_reduced_h.unsqueeze(0), hidden_reduced_c.unsqueeze(0))  # h, c dim = 1 x b x hidden_dim
+
+
+class Attention(paddle.nn.Layer):
+    def __init__(self):
+        super(Attention, self).__init__()
+        # Attention
+        if config.is_coverage:
+            self.W_c = nn.Linear(1, config.hidden_dim * 2, bias_attr=False)
+        self.decode_proj = nn.Linear(config.hidden_dim * 2, config.hidden_dim * 2)
+        self.v = nn.Linear(config.hidden_dim * 2, 1, bias_attr=False)
+
+    def forward(self, s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage):
+        b, t_k, n = encoder_outputs.shape
+
+        dec_fea = self.decode_proj(s_t_hat)  # B x 2*hidden_dim
+        dec_fea_expanded = paddle.expand(dec_fea.unsqueeze(1), [b, t_k, n])  # B x t_k x 2*hidden_dim
+        dec_fea_expanded = paddle.reshape(dec_fea_expanded, [-1, n])  # B * t_k x 2*hidden_dim
+
+        att_features = encoder_feature + dec_fea_expanded  # B * t_k x 2*hidden_dim
+        if config.is_coverage:
+            coverage_input = paddle.reshape(coverage, [-1, 1])  # B * t_k x 1
+            coverage_feature = self.W_c(coverage_input)  # B * t_k x 2*hidden_dim
+            att_features = att_features + coverage_feature
+
+        e = F.tanh(att_features)  # B * t_k x 2*hidden_dim
+        scores = self.v(e)  # B * t_k x 1
+        scores = paddle.reshape(scores, [-1, t_k])  # B x t_k
+
+        attn_dist_ = F.softmax(scores, axis=1) * enc_padding_mask  # B x t_k
+        normalization_factor = attn_dist_.sum(1, keepdim=True)
+        # attn_dist = attn_dist_ / normalization_factor
+        attn_dist = attn_dist_ / (
+            paddle.reshape(normalization_factor, [-1, 1])
+            + paddle.ones_like(paddle.reshape(normalization_factor, [-1, 1])) * sys.float_info.epsilon
+        )
+        # See the issue: https://github.com/atulkum/pointer_summarizer/issues/54
+
+        attn_dist = attn_dist.unsqueeze(1)  # B x 1 x t_k
+        c_t = paddle.bmm(attn_dist, encoder_outputs)  # B x 1 x n
+        c_t = paddle.reshape(c_t, [-1, config.hidden_dim * 2])  # B x 2*hidden_dim
+
+        attn_dist = paddle.reshape(attn_dist, [-1, t_k])  # B x t_k
+
+        if config.is_coverage:
+            coverage = paddle.reshape(coverage, [-1, t_k])
+            coverage = coverage + attn_dist
+
+        return c_t, attn_dist, coverage
+
+
+class Decoder(paddle.nn.Layer):
+    def __init__(self):
+        super(Decoder, self).__init__()
+        self.attention_network = Attention()
+        # Decoder
+        self.embedding = nn.Embedding(
+            config.vocab_size,
+            config.emb_dim,
+            weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)),
+        )
+
+        self.x_context = nn.Linear(config.hidden_dim * 2 + config.emb_dim, config.emb_dim)
+
+        self.lstm = nn.LSTM(
+            config.emb_dim,
+            config.hidden_dim,
+            num_layers=1,
+            direction="forward",
+            weight_ih_attr=paddle.ParamAttr(
+                initializer=I.Uniform(low=-config.rand_unif_init_mag, high=config.rand_unif_init_mag)
+            ),
+            bias_ih_attr=paddle.ParamAttr(initializer=I.Constant(value=0.0)),
+        )
+
+        if config.pointer_gen:
+            self.p_gen_linear = nn.Linear(config.hidden_dim * 4 + config.emb_dim, 1)
+
+        self.out1 = nn.Linear(config.hidden_dim * 3, config.hidden_dim)
+        self.out2 = nn.Linear(
+            config.hidden_dim,
+            config.vocab_size,
+            weight_attr=paddle.ParamAttr(initializer=I.Normal(std=config.trunc_norm_init_std)),
+        )
+
+    def forward(
+        self,
+        y_t_1,
+        s_t_1,
+        encoder_outputs,
+        encoder_feature,
+        enc_padding_mask,
+        c_t_1,
+        extra_zeros,
+        enc_batch_extend_vocab,
+        coverage,
+        step,
+    ):
+        if not self.training and step == 0:
+            h_decoder, c_decoder = s_t_1
+            s_t_hat = paddle.concat(
+                (
+                    paddle.reshape(h_decoder, [-1, config.hidden_dim]),
+                    paddle.reshape(c_decoder, [-1, config.hidden_dim]),
+                ),
+                1,
+            )  # B x 2*hidden_dim
+            c_t, _, coverage_next = self.attention_network(
+                s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage
+            )
+            coverage = coverage_next
+
+        y_t_1_embd = self.embedding(y_t_1)
+        x = self.x_context(paddle.concat((c_t_1, y_t_1_embd), 1))
+        lstm_out, s_t = self.lstm(x.unsqueeze(1), s_t_1)
+
+        h_decoder, c_decoder = s_t
+        s_t_hat = paddle.concat(
+            (paddle.reshape(h_decoder, [-1, config.hidden_dim]), paddle.reshape(c_decoder, [-1, config.hidden_dim])), 1
+        )  # B x 2*hidden_dim
+        c_t, attn_dist, coverage_next = self.attention_network(
+            s_t_hat, encoder_outputs, encoder_feature, enc_padding_mask, coverage
+        )
+
+        if self.training or step > 0:
+            coverage = coverage_next
+
+        p_gen = None
+        if config.pointer_gen:
+            p_gen_input = paddle.concat((c_t, s_t_hat, x), 1)  # B x (2*2*hidden_dim + emb_dim)
+            p_gen = self.p_gen_linear(p_gen_input)
+            p_gen = F.sigmoid(p_gen)
+
+        output = paddle.concat((paddle.reshape(lstm_out, [-1, config.hidden_dim]), c_t), 1)  # B x hidden_dim * 3
+        output1 = self.out1(output)  # B x hidden_dim
+        output2 = self.out2(output1)  # B x vocab_size
+        vocab_dist = F.softmax(output2, axis=1)
+
+        if config.pointer_gen:
+            vocab_dist_ = p_gen * vocab_dist
+            attn_dist_ = (1 - p_gen) * attn_dist
+
+            if extra_zeros is not None:
+                vocab_dist_ = paddle.concat([vocab_dist_, extra_zeros], 1)
+            final_dist = paddle2D_scatter_add(vocab_dist_, enc_batch_extend_vocab, attn_dist_, 1)
+        else:
+            final_dist = vocab_dist
+
+        return final_dist, s_t, c_t, attn_dist, p_gen, coverage
+
+
+class Model(object):
+    def __init__(self, model_file_path=None, is_eval=False):
+        super(Model, self).__init__()
+        encoder = Encoder()
+        decoder = Decoder()
+        reduce_state = ReduceState()
+
+        # Shared the embedding between encoder and decoder
+        decoder.embedding.weight = encoder.embedding.weight
+
+        if paddle.distributed.get_world_size() > 1:
+            encoder = paddle.DataParallel(encoder)
+            decoder = paddle.DataParallel(decoder)
+            reduce_state = paddle.DataParallel(reduce_state)
+
+        if is_eval:
+            encoder.eval()
+            decoder.eval()
+            reduce_state.eval()
+
+        self.encoder = encoder
+        self.decoder = decoder
+        self.reduce_state = reduce_state
+
+        if model_file_path is not None:
+            self.decoder.set_state_dict(paddle.load(os.path.join(model_file_path, "decoder.params")))
+            self.encoder.set_state_dict(paddle.load(os.path.join(model_file_path, "encoder.params")))
+            self.reduce_state.set_state_dict(paddle.load(os.path.join(model_file_path, "reduce_state.params")))
diff --git a/examples/text_summarization/pointer_summarizer/rouge_eval.py b/examples/text_summarization/pointer_summarizer/rouge_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d0dd29b9d833a33d74904e68b2b891374344f6c
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/rouge_eval.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+from utils import rouge_eval, rouge_log
+
+decode_dir = sys.argv[1]
+
+print("Decoder has finished reading dataset for single_pass.")
+print("Now starting ROUGE eval...")
+results_dict = rouge_eval(decode_dir + "rouge_ref", decode_dir + "rouge_dec_dir")
+rouge_log(results_dict, decode_dir + "rouge_dec_dir")
diff --git a/examples/text_summarization/pointer_summarizer/train.py b/examples/text_summarization/pointer_summarizer/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c9406cc0686bfb919d892ca2f5207ab7849590a
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/train.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+import time
+
+import config
+import paddle
+from data import Batcher, Vocab
+from model import Model
+from paddle.optimizer import Adagrad
+from train_util import get_input_from_batch, get_output_from_batch
+from utils import calc_running_avg_loss
+
+# Flush out immediately
+sys.stdout.flush()
+
+
+class Trainer(object):
+    def __init__(self):
+        self.vocab = Vocab(config.vocab_path, config.vocab_size)
+        self.batcher = Batcher(
+            config.train_data_path, self.vocab, mode="train", batch_size=config.batch_size, single_pass=False
+        )
+
+        train_dir = os.path.join(config.log_root, "train_%d" % (int(time.time())))
+
+        if not os.path.exists(config.log_root):
+            os.mkdir(config.log_root)
+
+        if not os.path.exists(train_dir):
+            os.mkdir(train_dir)
+
+        self.model_dir = os.path.join(train_dir, "model")
+        if not os.path.exists(self.model_dir):
+            os.mkdir(self.model_dir)
+
+    def save_model(self, running_avg_loss, iter):
+        state = {
+            "encoder": self.model.encoder.state_dict(),
+            "decoder": self.model.decoder.state_dict(),
+            "reduce_state": self.model.reduce_state.state_dict(),
+            "optimizer": self.optimizer.state_dict(),
+        }
+        model_save_dir = os.path.join(self.model_dir, "model_%06d_%.8f" % (iter, running_avg_loss))
+        for k in state:
+            model_save_path = os.path.join(model_save_dir, "%s.params" % k)
+            paddle.save(state[k], model_save_path)
+        return model_save_dir
+
+    def setup_train(self, model_file_path=None):
+        self.model = Model(model_file_path)
+
+        initial_lr = config.lr_coverage if config.is_coverage else config.lr
+        params = (
+            list(self.model.encoder.parameters())
+            + list(self.model.decoder.parameters())
+            + list(self.model.reduce_state.parameters())
+        )
+
+        self.optimizer = Adagrad(
+            parameters=params,
+            learning_rate=initial_lr,
+            initial_accumulator_value=config.adagrad_init_acc,
+            epsilon=1.0e-10,
+            grad_clip=paddle.nn.ClipGradByGlobalNorm(clip_norm=config.max_grad_norm),
+        )
+
+        start_iter, start_loss = 0, 0
+
+        if model_file_path is not None:
+            start_iter = int(model_file_path.split("_")[-2])
+            start_loss = float(model_file_path.split("_")[-1].replace(os.sep, ""))
+
+            if not config.is_coverage:
+                self.optimizer.set_state_dict(paddle.load(os.path.join(model_file_path, "optimizer.params")))
+
+        return start_iter, start_loss
+
+    def train_one_batch(self, batch, iter):
+
+        (
+            enc_batch,
+            enc_padding_mask,
+            enc_lens,
+            enc_batch_extend_vocab,
+            extra_zeros,
+            c_t_1,
+            coverage,
+        ) = get_input_from_batch(batch)
+        dec_batch, dec_padding_mask, max_dec_len, dec_lens_var, target_batch = get_output_from_batch(batch)
+
+        self.optimizer.clear_gradients()
+
+        encoder_outputs, encoder_feature, encoder_hidden = self.model.encoder(enc_batch, enc_lens)
+        s_t_1 = self.model.reduce_state(encoder_hidden)
+
+        step_losses = []
+        for di in range(min(max_dec_len, config.max_dec_steps)):
+            y_t_1 = dec_batch[:, di]
+
+            final_dist, s_t_1, c_t_1, attn_dist, p_gen, next_coverage = self.model.decoder(
+                y_t_1,
+                s_t_1,
+                encoder_outputs,
+                encoder_feature,
+                enc_padding_mask,
+                c_t_1,
+                extra_zeros,
+                enc_batch_extend_vocab,
+                coverage,
+                di,
+            )
+
+            target = target_batch[:, di]
+            add_index = paddle.arange(0, target.shape[0])
+            new_index = paddle.stack([add_index, target], axis=1)
+            gold_probs = paddle.gather_nd(final_dist, new_index).squeeze()
+            step_loss = -paddle.log(gold_probs + config.eps)
+
+            if config.is_coverage:
+                step_coverage_loss = paddle.sum(paddle.minimum(attn_dist, coverage), 1)
+                step_loss = step_loss + config.cov_loss_wt * step_coverage_loss
+                coverage = next_coverage
+
+            step_mask = dec_padding_mask[:, di]
+            step_loss = step_loss * step_mask
+            step_losses.append(step_loss)
+
+        sum_losses = paddle.sum(paddle.stack(step_losses, 1), 1)
+        batch_avg_loss = sum_losses / dec_lens_var
+        loss = paddle.mean(batch_avg_loss)
+
+        loss.backward()
+        self.optimizer.minimize(loss)
+
+        return float(loss)
+
+    def trainIters(self, n_iters, model_file_path=None):
+        iter, running_avg_loss = self.setup_train(model_file_path)
+        start = time.time()
+        while iter < n_iters:
+            batch = self.batcher.next_batch()
+            loss = self.train_one_batch(batch, iter)
+            running_avg_loss = calc_running_avg_loss(loss, running_avg_loss, iter)
+            iter += 1
+            print(
+                "global step %d/%d, step loss: %.8f, running avg loss: %.8f, speed: %.2f step/s"
+                % (iter, n_iters, loss, running_avg_loss, 1.0 / (time.time() - start))
+            )
+            start = time.time()
+            if iter % 5000 == 0 or iter == 1:
+                if paddle.distributed.get_rank() == 0:
+                    model_save_dir = self.save_model(running_avg_loss, iter)
+                    print(
+                        "Saved model for iter %d with running avg loss %.8f to directory: %s"
+                        % (iter, running_avg_loss, model_save_dir)
+                    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Train script")
+    parser.add_argument(
+        "-m", dest="model_file_path", required=False, default=None, help="Model file for retraining (default: None)."
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override config.max_iterations.",
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+
+    args = parser.parse_args()
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    train_processor = Trainer()
+    if args.max_steps > 0:
+        train_processor.trainIters(args.max_steps, args.model_file_path)
+    else:
+        train_processor.trainIters(config.max_iterations, args.model_file_path)
diff --git a/examples/text_summarization/pointer_summarizer/train_util.py b/examples/text_summarization/pointer_summarizer/train_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..63773ce4a58fb1c8033004569853e5e68c6707a7
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/train_util.py
@@ -0,0 +1,38 @@
+import numpy as np
+import paddle
+import config
+
+
+def get_input_from_batch(batch):
+    batch_size = len(batch.enc_lens)
+    enc_batch = paddle.to_tensor(batch.enc_batch, dtype="int64")
+    enc_padding_mask = paddle.to_tensor(batch.enc_padding_mask, dtype="float32")
+    enc_lens = batch.enc_lens
+    extra_zeros = None
+    enc_batch_extend_vocab = None
+
+    if config.pointer_gen:
+        enc_batch_extend_vocab = paddle.to_tensor(batch.enc_batch_extend_vocab, dtype="int64")
+        # The variable max_art_oovs is the max over all the article oov list in the batch
+        if batch.max_art_oovs > 0:
+            extra_zeros = paddle.zeros((batch_size, batch.max_art_oovs))
+
+    c_t_1 = paddle.zeros((batch_size, 2 * config.hidden_dim))
+
+    coverage = None
+    if config.is_coverage:
+        coverage = paddle.zeros(enc_batch.shape)
+
+    return enc_batch, enc_padding_mask, enc_lens, enc_batch_extend_vocab, extra_zeros, c_t_1, coverage
+
+
+def get_output_from_batch(batch):
+    dec_batch = paddle.to_tensor(batch.dec_batch, dtype="int64")
+    dec_padding_mask = paddle.to_tensor(batch.dec_padding_mask, dtype="float32")
+    dec_lens = batch.dec_lens
+    max_dec_len = np.max(dec_lens)
+    dec_lens_var = paddle.to_tensor(dec_lens, dtype="float32")
+
+    target_batch = paddle.to_tensor(batch.target_batch, dtype="int64")
+
+    return dec_batch, dec_padding_mask, max_dec_len, dec_lens_var, target_batch
diff --git a/examples/text_summarization/pointer_summarizer/utils.py b/examples/text_summarization/pointer_summarizer/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cc081d905d3652b3b2fdd33a041e824b8f46182
--- /dev/null
+++ b/examples/text_summarization/pointer_summarizer/utils.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Content of this file is copied from https://github.com/abisee/pointer-generator/blob/master/
+import logging
+import os
+
+import pyrouge
+
+
+def print_results(article, abstract, decoded_output):
+    print("")
+    print(("ARTICLE:  %s", article))
+    print(("REFERENCE SUMMARY: %s", abstract))
+    print(("GENERATED SUMMARY: %s", decoded_output))
+    print("")
+
+
+def make_html_safe(s):
+    s.replace("<", "&lt;")
+    s.replace(">", "&gt;")
+    return s
+
+
+def rouge_eval(ref_dir, dec_dir):
+    r = pyrouge.Rouge155()
+    r.model_filename_pattern = "#ID#_reference.txt"
+    r.system_filename_pattern = "(\d+)_decoded.txt"
+    r.model_dir = ref_dir
+    r.system_dir = dec_dir
+    logging.getLogger("global").setLevel(logging.WARNING)  # silence pyrouge logging
+    rouge_results = r.convert_and_evaluate()
+    return r.output_to_dict(rouge_results)
+
+
+def rouge_log(results_dict, dir_to_write):
+    log_str = ""
+    for x in ["1", "2", "l"]:
+        log_str += "\nROUGE-%s:\n" % x
+        for y in ["f_score", "recall", "precision"]:
+            key = "rouge_%s_%s" % (x, y)
+            key_cb = key + "_cb"
+            key_ce = key + "_ce"
+            val = results_dict[key]
+            val_cb = results_dict[key_cb]
+            val_ce = results_dict[key_ce]
+            log_str += "%s: %.4f with confidence interval (%.4f, %.4f)\n" % (key, val, val_cb, val_ce)
+    print(log_str)
+    results_file = os.path.join(dir_to_write, "ROUGE_results.txt")
+    print(("Writing final ROUGE results to %s..." % results_file))
+    with open(results_file, "w") as f:
+        f.write(log_str)
+
+
+def calc_running_avg_loss(loss, running_avg_loss, step, decay=0.99):
+    if running_avg_loss == 0:  # on the first iteration just take the loss
+        running_avg_loss = loss
+    else:
+        running_avg_loss = running_avg_loss * decay + (1 - decay) * loss
+    running_avg_loss = min(running_avg_loss, 12)  # clip
+    return running_avg_loss
+
+
+def write_for_rouge(reference_sents, decoded_words, ex_index, _rouge_ref_dir, _rouge_dec_dir):
+    decoded_sents = []
+    while len(decoded_words) > 0:
+        try:
+            fst_period_idx = decoded_words.index(".")
+        except ValueError:
+            fst_period_idx = len(decoded_words)
+        sent = decoded_words[: fst_period_idx + 1]
+        decoded_words = decoded_words[fst_period_idx + 1 :]
+        decoded_sents.append(" ".join(sent))
+
+    # Pyrouge calls a perl script that puts the data into HTML files.
+    # Therefore we need to make our output HTML safe.
+    decoded_sents = [make_html_safe(w) for w in decoded_sents]
+    reference_sents = [make_html_safe(w) for w in reference_sents]
+
+    ref_file = os.path.join(_rouge_ref_dir, "%06d_reference.txt" % ex_index)
+    decoded_file = os.path.join(_rouge_dec_dir, "%06d_decoded.txt" % ex_index)
+
+    with open(ref_file, "w") as f:
+        for idx, sent in enumerate(reference_sents):
+            f.write(sent) if idx == len(reference_sents) - 1 else f.write(sent + "\n")
+    with open(decoded_file, "w") as f:
+        for idx, sent in enumerate(decoded_sents):
+            f.write(sent) if idx == len(decoded_sents) - 1 else f.write(sent + "\n")
diff --git a/examples/text_summarization/prophetnet/README.md b/examples/text_summarization/prophetnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a96213df8b809faa3be61e12e60d3c4ff2b311a0
--- /dev/null
+++ b/examples/text_summarization/prophetnet/README.md
@@ -0,0 +1,250 @@
+# Prophetnet
+
+## 模型简介
+
+ProphetNet（先知网络）是一种新型的 seq2seq 预训练模型。在训练时，Prophetnet 每一时刻将会学习同时预测未来的 N 个字符，这种自监督学习目标可以使得模型考虑未来更远的字符，防止模型对强局部相关（strong
+local correlation）过拟合。
+
+本项目是 Prophetnet 在 PaddlePaddle 2.4 上开源实现的文本摘要的例子，包含了在 CNN/DailyMail 数据集，Gigaword 数据集上微调和生成的代码。
+
+### 项目依赖
+
+```
+pip install -r requirements.txt
+python -m pip install paddlepaddle-gpu==2.4.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+pip install paddlenlp==2.5.2
+```
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+├── train_prophetnet.py # 模型finetune主程序入口
+├── generate.py # 模型生成主程序入口
+├── eval.py # 生成结果评估入口
+├── uncase_tokenize_data.py # 数据预处理
+├── uncompress_data.sh # 数据解压脚本
+├── run_train.sh # 模型训练脚本
+├── run_eval.sh # 模型评估脚本
+├── requirements.txt # 环境依赖文件
+└── README.md # 文档说明
+```
+
+### 数据准备
+
+GLGE 数据集下载：[链接](https://drive.google.com/file/d/1F4zppa9Gqrh6iNyVsZJkxfbm5waalqEA/view)
+
+GLGE 测试集下载：[链接](https://drive.google.com/file/d/11lDXIG87dChIfukq3x2Wx4r5_duCRm_J/view)
+
+将glge_public.tar与glge_hidden_v1.1.tar.gz放入到项目根目录下。
+
+```
+bash uncompress_data.sh
+```
+
+### 数据预处理
+
+```
+python uncase_tokenize_data.py --dataset <DATASET>
+```
+
+说明：
+
+- `<DATASET>`可选`cnndm`, `gigaword`.
+
+### 模型训练
+
+```
+bash run_train.sh <DATASET>
+```
+
+或直接运行finetune程序
+
+- cnndm:
+
+```
+python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \
+    --dataset=cnndm \
+    --model_name_or_path=prophetnet-large-uncased \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=8 \
+    --num_train_epochs=4 \
+    --learning_rate=0.0001 \
+    --warmup_init_lr=1e-07 \
+    --warmup_steps=1000 \
+    --max_grad_norm=0.1 \
+    --dataloader_num_workers=4 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --do_train \
+    --do_eval \
+    --output_dir=./ckpt/cnndm
+```
+
+- gigaword:
+
+```
+python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \
+    --dataset=gigaword \
+    --model_name_or_path=prophetnet-large-uncased \
+    --per_device_train_batch_size=16 \
+    --per_device_eval_batch_size=32 \
+    --num_train_epochs=6 \
+    --learning_rate=0.0001 \
+    --warmup_init_lr=1e-07 \
+    --warmup_steps=1000 \
+    --max_grad_norm=0.1 \
+    --dataloader_num_workers=8 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --do_train \
+    --do_eval \
+    --output_dir=./ckpt/gigaword
+```
+
+其中参数释义如下：
+
+- `dataset` 指定数据集，可选cnndm和gigaword
+
+- `model_name_or_path` 预训练模型名称或本地预训练模型初始化权重文件路径
+
+- `per_device_train_batch_size` 表示单卡训练样本批大小
+
+- `per_device_eval_batch_size` 表示单卡验证样本批大小
+
+- `num_train_epochs` 表示训练轮数
+
+- `learning_rate` 表示学习率
+
+- `warmup_init_lr` 表示预热学习率
+
+- `warmup_steps` 表示预热学习步数
+
+- `max_grad_norm` 表示梯度裁剪
+
+- `dataloader_num_workers` 指定数据加载规模
+
+- `logging_steps` 表示打印结果间隔
+
+- `save_steps`表示验证间隔
+
+- `do_train` 表示是否训练
+
+- `do_eval` 表示是否验证
+
+- `output_idr` 指定微调结果权重存放路径
+
+已经finetune好的模型权重：
+
+- cnndm : [链接](https://pan.baidu.com/s/1cemrUDxkqEW9raoasJ_VKw), 提取码：1egi
+
+- gigaword : [链接](https://pan.baidu.com/s/1qRH2FStT3vNQtDjZLkYJBQ), 提取码：on5v
+
+### 模型评估
+
+使用prophetNet源码的[评估脚本](https://pan.baidu.com/s/1FOnd01rNvDJoONYegacq1Q), 此脚本依赖于pyrouge，需要提前安装rouge。
+
+```
+pip install git+https://github.com/pltrdy/pyrouge
+```
+
+```
+bash run_eval.sh <DATASET>
+```
+
+或直接运行模型生成程序
+
+- cnndm:
+
+```
+python generate.py \
+    --dataset=cnndm \
+    --model_name_or_path=prophetnet-large-uncased \
+    --output_path=./generate/cnndm/generate.txt \
+    --min_target_length=45 \
+    --max_target_length=110 \
+    --decode_strategy=beam_search \
+    --num_beams=4 \
+    --length_penalty=1.2 \
+    --batch_size=16 \
+    --ignore_pad_token_for_loss=True \
+    --early_stopping=True \
+    --logging_steps=100 \
+    --device=gpu
+
+python eval.py --dataset cnndm --generated ./generate/cnndm/generate.txt
+```
+
+- gigaword:
+
+```
+python generate.py \
+    --dataset=gigaword \
+    --model_name_or_path=prophetnet-large-uncased \
+    --output_path=./generate/gigaword/generate.txt \
+    --min_target_length=1 \
+    --max_target_length=200 \
+    --decode_strategy=beam_search \
+    --num_beams=4 \
+    --length_penalty=1.6 \
+    --batch_size=16 \
+    --ignore_pad_token_for_loss=True \
+    --early_stopping=True \
+    --logging_steps=100 \
+    --device=gpu
+
+python eval.py --dataset gigaword --generated ./generate/gigaword/generate.txt
+```
+
+其中参数释义如下：
+
+- `dataset` 指定数据集，可选cnndm和gigaword
+
+- `vocab_file` 指定词表文件
+
+- `output_path` 指定生成结果存放路径
+
+- `min_target_length` 指定解码最短长度
+
+- `max_target_length` 指定解码最大长度
+
+- `decode_strategy` 指定解码策略
+
+- `num_beams` 指定beam_search解码宽度
+
+- `length_penalty` 指定beam_search解码的长度指数惩罚
+
+- `batch_size` 指定评估样本批大小
+
+- `ignore_pad_token_for_loss` 表示计算loss时忽略padding
+
+- `early_stopping` 指定生成结束符是否停止预测
+
+- `logging_steps` 指定日志打印间隔
+
+- `device` 指定使用设备
+
+### 微调测试精度
+
+> #### 在CNN/DM数据集的测试效果如下表。
+
+|网络 |opt|batch_size|数据集|ROUGE_1|ROUGE_2|ROUGE_L|
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+|prophetnet-large-uncased|Adam|4|CNN/DM|44.17|21.24|41.36|
+
+> #### 在gigaword数据集的测试效果如下表。
+
+|网络 |opt|batch_size|数据集|ROUGE_1|ROUGE_2|ROUGE_L|
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+|prophetnet-large-uncased|Adam|16|gigaword|38.92|19.81|36.06|
+
+### 实验环境
+
+- GPU RTX3090 * 1, CPU Intel i7-11700k
+- Ubuntu 18.04
+
+### 参考文献
+
+1. Qi W, Yan Y, Gong Y, et al. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training[J]. arXiv
+   preprint arXiv:2001.04063, 2020.
diff --git a/examples/text_summarization/prophetnet/eval.py b/examples/text_summarization/prophetnet/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..01a1042c4eca395aa801650297d802069f35757b
--- /dev/null
+++ b/examples/text_summarization/prophetnet/eval.py
@@ -0,0 +1,62 @@
+import argparse
+import os
+import re
+import sys
+from os import listdir
+from os.path import isfile, join
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset", type=str, help="choose from all, or 1 of 8 dataset like cnndm, gigaword etc.")
+parser.add_argument("--generated", type=str, help="generated output file.")
+
+args = parser.parse_args()
+
+data_root_path = "data"
+
+support_dataset = ["cnndm", "gigaword"]
+files2rouge_template = ".*ROUGE-1 Average_F: (?P<rouge1_f>\d+(\.\d*)?|\.\d+).*ROUGE-2 Average_F: (?P<rouge2_f>\d+(\.\d*)?|\.\d+).*ROUGE-L Average_F: (?P<rougeL_f>\d+(\.\d*)?|\.\d+).*"
+# gigaword_template='.*ROUGE-1: (?P<rouge1_f>\d+(\.\d*)?|\.\d+).*ROUGE-2: (?P<rouge2_f>\d+(\.\d*)?|\.\d+).*ROUGE-L: (?P<rougeL_f>\d+(\.\d*)?|\.\d+).*'
+qg_template = ".*Bleu_4: (?P<bleu4>\d+(\.\d*)?|\.\d+).*METEOR: (?P<meteor>\d+(\.\d*)?|\.\d+).*ROUGE_L: (?P<rougeL>\d+(\.\d*)?|\.\d+).*"
+personachat_template = ".*?(?P<d1>[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?).*?(?P<d2>[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?).*Bleu_1: (?P<bleu1>\d+(\.\d*)?|\.\d+).*Bleu_2: (?P<bleu2>\d+(\.\d*)?|\.\d+).*"
+
+
+def scale_up(d):
+    return {k: float(d[k]) * 100 for k in d.keys()}
+
+
+def eval_one_dataset():
+    golden_file = f"{data_root_path}/{args.dataset}_data/test.tgt"
+
+    eval_template = {
+        "cnndm": f"python ./evaluate/cnndm/postprocess_cnn_dm.py --generated {generated_file} --golden {golden_file}",
+        "gigaword": f"python ./evaluate/gigaword/eval.py --perl --pred {generated_file} --gold {golden_file}",
+    }
+
+    cmd = eval_template[args.dataset]
+    try:
+        output = os.popen(cmd).read()
+        if args.dataset in ["cnndm", "gigaword"]:
+            d = re.search(files2rouge_template, output.replace("\n", " ")).groupdict()
+            d = scale_up(d)
+            print(f"{args.dataset}\trouge1/rouge2/rougeL\t{d['rouge1_f']:.2f}/{d['rouge2_f']:.2f}/{d['rougeL_f']:.2f}")
+    except:
+        print("Unexpected error:", sys.exc_info()[0])
+        print(f"{args.dataset} evaluate failed!")
+
+
+if args.dataset != "all":
+    generated_file = args.generated
+    eval_one_dataset()
+else:
+    output_root_path = args.generated
+    onlyfolders = [f for f in listdir(output_root_path) if not isfile(join(args.generated, f))]
+    for dataset in support_dataset:
+        for folder in onlyfolders:
+            if folder.startswith(dataset):
+                for hypo_file in listdir(args.generated + "/" + folder):
+                    if "hypo" in hypo_file or "score" in hypo_file:
+                        generated_file = args.generated + "/" + folder + "/" + hypo_file
+                        print(f"{dataset}\tpredict_file:{generated_file}")
+                        args.dataset = dataset
+                        args.gnerated = generated_file
+                        eval_one_dataset()
diff --git a/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py b/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4859890760a2a07e45890d9056a1d3dae131e88
--- /dev/null
+++ b/examples/text_summarization/prophetnet/evaluate/cnndm/bs_pyrouge.py
@@ -0,0 +1,649 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function, unicode_literals, division
+
+import codecs
+import os
+import platform
+import re
+from functools import partial
+from subprocess import check_output
+from tempfile import mkdtemp
+
+try:
+    from configparser import ConfigParser
+except ImportError:
+    from ConfigParser import ConfigParser
+
+import logging
+from pyrouge.utils import log
+from pyrouge.utils.file_utils import verify_dir
+
+REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}", "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}
+
+
+def clean(x):
+    return re.sub(r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''", lambda m: REMAP.get(m.group()), x)
+
+
+class DirectoryProcessor:
+    @staticmethod
+    def process(input_dir, output_dir, function):
+        """
+        Apply function to all files in input_dir and save the resulting output
+        files in output_dir.
+
+        """
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        logger = log.get_global_console_logger()
+        logger.info("Processing files in {}.".format(input_dir))
+        input_file_names = os.listdir(input_dir)
+        for input_file_name in input_file_names:
+            input_file = os.path.join(input_dir, input_file_name)
+            with codecs.open(input_file, "r", encoding="UTF-8") as f:
+                input_string = f.read()
+            output_string = function(input_string)
+            output_file = os.path.join(output_dir, input_file_name)
+            with codecs.open(output_file, "w", encoding="UTF-8") as f:
+                f.write(clean(output_string.lower()))
+        logger.info("Saved processed files to {}.".format(output_dir))
+
+
+class Rouge155(object):
+    """
+    This is a wrapper for the ROUGE 1.5.5 summary evaluation package.
+    This class is designed to simplify the evaluation process by:
+
+        1) Converting summaries into a format ROUGE understands.
+        2) Generating the ROUGE configuration file automatically based
+            on filename patterns.
+
+    This class can be used within Python like this:
+
+    rouge = Rouge155()
+    rouge.system_dir = 'test/systems'
+    rouge.model_dir = 'test/models'
+
+    # The system filename pattern should contain one group that
+    # matches the document ID.
+    rouge.system_filename_pattern = 'SL.P.10.R.11.SL062003-(\d+).html'
+
+    # The model filename pattern has '#ID#' as a placeholder for the
+    # document ID. If there are multiple model summaries, pyrouge
+    # will use the provided regex to automatically match them with
+    # the corresponding system summary. Here, [A-Z] matches
+    # multiple model summaries for a given #ID#.
+    rouge.model_filename_pattern = 'SL.P.10.R.[A-Z].SL062003-#ID#.html'
+
+    rouge_output = rouge.evaluate()
+    print(rouge_output)
+    output_dict = rouge.output_to_dict(rouge_output)
+    print(output_dict)
+    ->    {'rouge_1_f_score': 0.95652,
+         'rouge_1_f_score_cb': 0.95652,
+         'rouge_1_f_score_ce': 0.95652,
+         'rouge_1_precision': 0.95652,
+        [...]
+
+
+    To evaluate multiple systems:
+
+        rouge = Rouge155()
+        rouge.system_dir = '/PATH/TO/systems'
+        rouge.model_dir = 'PATH/TO/models'
+        for system_id in ['id1', 'id2', 'id3']:
+            rouge.system_filename_pattern = \
+                'SL.P/.10.R.{}.SL062003-(\d+).html'.format(system_id)
+            rouge.model_filename_pattern = \
+                'SL.P.10.R.[A-Z].SL062003-#ID#.html'
+            rouge_output = rouge.evaluate(system_id)
+            print(rouge_output)
+
+    """
+
+    def __init__(self, rouge_dir=None, rouge_args=None, temp_dir=None):
+        """
+        Create a Rouge155 object.
+
+            rouge_dir:  Directory containing Rouge-1.5.5.pl
+            rouge_args: Arguments to pass through to ROUGE if you
+                        don't want to use the default pyrouge
+                        arguments.
+
+        """
+        self.temp_dir = temp_dir
+        self.log = log.get_global_console_logger()
+        self.log.setLevel(logging.WARNING)
+        self.__set_dir_properties()
+        self._config_file = None
+        self._settings_file = self.__get_config_path()
+        self.__set_rouge_dir(rouge_dir)
+        self.args = self.__clean_rouge_args(rouge_args)
+        self._system_filename_pattern = None
+        self._model_filename_pattern = None
+
+    def save_home_dir(self):
+        config = ConfigParser()
+        section = "pyrouge settings"
+        config.add_section(section)
+        config.set(section, "home_dir", self._home_dir)
+        with open(self._settings_file, "w") as f:
+            config.write(f)
+        self.log.info("Set ROUGE home directory to {}.".format(self._home_dir))
+
+    @property
+    def settings_file(self):
+        """
+        Path of the settings file, which stores the ROUGE home dir.
+
+        """
+        return self._settings_file
+
+    @property
+    def bin_path(self):
+        """
+        The full path of the ROUGE binary (although it's technically
+        a script), i.e. rouge_home_dir/ROUGE-1.5.5.pl
+
+        """
+        if self._bin_path is None:
+            raise Exception(
+                "ROUGE path not set. Please set the ROUGE home directory "
+                "and ensure that ROUGE-1.5.5.pl exists in it."
+            )
+        return self._bin_path
+
+    @property
+    def system_filename_pattern(self):
+        """
+        The regular expression pattern for matching system summary
+        filenames. The regex string.
+
+        E.g. "SL.P.10.R.11.SL062003-(\d+).html" will match the system
+        filenames in the SPL2003/system folder of the ROUGE SPL example
+        in the "sample-test" folder.
+
+        Currently, there is no support for multiple systems.
+
+        """
+        return self._system_filename_pattern
+
+    @system_filename_pattern.setter
+    def system_filename_pattern(self, pattern):
+        self._system_filename_pattern = pattern
+
+    @property
+    def model_filename_pattern(self):
+        """
+        The regular expression pattern for matching model summary
+        filenames. The pattern needs to contain the string "#ID#",
+        which is a placeholder for the document ID.
+
+        E.g. "SL.P.10.R.[A-Z].SL062003-#ID#.html" will match the model
+        filenames in the SPL2003/system folder of the ROUGE SPL
+        example in the "sample-test" folder.
+
+        "#ID#" is a placeholder for the document ID which has been
+        matched by the "(\d+)" part of the system filename pattern.
+        The different model summaries for a given document ID are
+        matched by the "[A-Z]" part.
+
+        """
+        return self._model_filename_pattern
+
+    @model_filename_pattern.setter
+    def model_filename_pattern(self, pattern):
+        self._model_filename_pattern = pattern
+
+    @property
+    def config_file(self):
+        return self._config_file
+
+    @config_file.setter
+    def config_file(self, path):
+        config_dir, _ = os.path.split(path)
+        verify_dir(config_dir, "configuration file")
+        self._config_file = path
+
+    def split_sentences(self):
+        """
+        ROUGE requires texts split into sentences. In case the texts
+        are not already split, this method can be used.
+
+        """
+        from pyrouge.utils.sentence_splitter import PunktSentenceSplitter
+
+        self.log.info("Splitting sentences.")
+        ss = PunktSentenceSplitter()
+
+        def sent_split_to_string(s):
+            return "\n".join(ss.split(s))
+
+        process_func = partial(DirectoryProcessor.process, function=sent_split_to_string)
+        self.__process_summaries(process_func)
+
+    @staticmethod
+    def convert_summaries_to_rouge_format(input_dir, output_dir):
+        """
+        Convert all files in input_dir into a format ROUGE understands
+        and saves the files to output_dir. The input files are assumed
+        to be plain text with one sentence per line.
+
+            input_dir:  Path of directory containing the input files.
+            output_dir: Path of directory in which the converted files
+                        will be saved.
+
+        """
+        DirectoryProcessor.process(input_dir, output_dir, Rouge155.convert_text_to_rouge_format)
+
+    @staticmethod
+    def convert_text_to_rouge_format(text, title="dummy title"):
+        """
+        Convert a text to a format ROUGE understands. The text is
+        assumed to contain one sentence per line.
+
+            text:   The text to convert, containg one sentence per line.
+            title:  Optional title for the text. The title will appear
+                    in the converted file, but doesn't seem to have
+                    any other relevance.
+
+        Returns: The converted text as string.
+
+        """
+        sentences = text.split("\n")
+        sent_elems = [
+            '<a name="{i}">[{i}]</a> <a href="#{i}" id={i}>' "{text}</a>".format(i=i, text=sent)
+            for i, sent in enumerate(sentences, start=1)
+        ]
+        html = """<html>
+<head>
+<title>{title}</title>
+</head>
+<body bgcolor="white">
+{elems}
+</body>
+</html>""".format(
+            title=title, elems="\n".join(sent_elems)
+        )
+
+        return html
+
+    @staticmethod
+    def write_config_static(
+        system_dir, system_filename_pattern, model_dir, model_filename_pattern, config_file_path, system_id=None
+    ):
+        """
+        Write the ROUGE configuration file, which is basically a list
+        of system summary files and their corresponding model summary
+        files.
+
+        pyrouge uses regular expressions to automatically find the
+        matching model summary files for a given system summary file
+        (cf. docstrings for system_filename_pattern and
+        model_filename_pattern).
+
+            system_dir:                 Path of directory containing
+                                        system summaries.
+            system_filename_pattern:    Regex string for matching
+                                        system summary filenames.
+            model_dir:                  Path of directory containing
+                                        model summaries.
+            model_filename_pattern:     Regex string for matching model
+                                        summary filenames.
+            config_file_path:           Path of the configuration file.
+            system_id:                  Optional system ID string which
+                                        will appear in the ROUGE output.
+
+        """
+        system_filenames = [f for f in os.listdir(system_dir)]
+        system_models_tuples = []
+
+        system_filename_pattern = re.compile(system_filename_pattern)
+        for system_filename in sorted(system_filenames):
+            match = system_filename_pattern.match(system_filename)
+            if match:
+                id = match.groups(0)[0]
+                model_filenames = [model_filename_pattern.replace("#ID#", id)]
+                # model_filenames = Rouge155.__get_model_filenames_for_id(
+                #     id, model_dir, model_filename_pattern)
+                system_models_tuples.append((system_filename, sorted(model_filenames)))
+        if not system_models_tuples:
+            raise Exception(
+                "Did not find any files matching the pattern {} "
+                "in the system summaries directory {}.".format(system_filename_pattern.pattern, system_dir)
+            )
+
+        with codecs.open(config_file_path, "w", encoding="utf-8") as f:
+            f.write('<ROUGE-EVAL version="1.55">')
+            for task_id, (system_filename, model_filenames) in enumerate(system_models_tuples, start=1):
+                eval_string = Rouge155.__get_eval_string(
+                    task_id, system_id, system_dir, system_filename, model_dir, model_filenames
+                )
+                f.write(eval_string)
+            f.write("</ROUGE-EVAL>")
+
+    def write_config(self, config_file_path=None, system_id=None):
+        """
+        Write the ROUGE configuration file, which is basically a list
+        of system summary files and their matching model summary files.
+
+        This is a non-static version of write_config_file_static().
+
+            config_file_path:   Path of the configuration file.
+            system_id:          Optional system ID string which will
+                                appear in the ROUGE output.
+
+        """
+        if not system_id:
+            system_id = 1
+        if (not config_file_path) or (not self._config_dir):
+            self._config_dir = mkdtemp(dir=self.temp_dir)
+            config_filename = "rouge_conf.xml"
+        else:
+            config_dir, config_filename = os.path.split(config_file_path)
+            verify_dir(config_dir, "configuration file")
+        self._config_file = os.path.join(self._config_dir, config_filename)
+        Rouge155.write_config_static(
+            self._system_dir,
+            self._system_filename_pattern,
+            self._model_dir,
+            self._model_filename_pattern,
+            self._config_file,
+            system_id,
+        )
+        self.log.info("Written ROUGE configuration to {}".format(self._config_file))
+
+    def evaluate(self, system_id=1, rouge_args=None):
+        """
+        Run ROUGE to evaluate the system summaries in system_dir against
+        the model summaries in model_dir. The summaries are assumed to
+        be in the one-sentence-per-line HTML format ROUGE understands.
+
+            system_id:  Optional system ID which will be printed in
+                        ROUGE's output.
+
+        Returns: Rouge output as string.
+
+        """
+        self.write_config(system_id=system_id)
+        options = self.__get_options(rouge_args)
+        command = [self._bin_path] + options
+        self.log.info("Running ROUGE with command {}".format(" ".join(command)))
+        rouge_output = check_output(command).decode("UTF-8")
+        return rouge_output
+
+    def convert_and_evaluate(self, system_id=1, split_sentences=False, rouge_args=None):
+        """
+        Convert plain text summaries to ROUGE format and run ROUGE to
+        evaluate the system summaries in system_dir against the model
+        summaries in model_dir. Optionally split texts into sentences
+        in case they aren't already.
+
+        This is just a convenience method combining
+        convert_summaries_to_rouge_format() and evaluate().
+
+            split_sentences:    Optional argument specifying if
+                                sentences should be split.
+            system_id:          Optional system ID which will be printed
+                                in ROUGE's output.
+
+        Returns: ROUGE output as string.
+
+        """
+        if split_sentences:
+            self.split_sentences()
+        self.__write_summaries()
+        rouge_output = self.evaluate(system_id, rouge_args)
+        return rouge_output
+
+    def output_to_dict(self, output):
+        """
+        Convert the ROUGE output into python dictionary for further
+        processing.
+
+        """
+        # 0 ROUGE-1 Average_R: 0.02632 (95%-conf.int. 0.02632 - 0.02632)
+        pattern = re.compile(r"(\d+) (ROUGE-\S+) (Average_\w): (\d.\d+) " r"\(95%-conf.int. (\d.\d+) - (\d.\d+)\)")
+        results = {}
+        for line in output.split("\n"):
+            match = pattern.match(line)
+            if match:
+                sys_id, rouge_type, measure, result, conf_begin, conf_end = match.groups()
+                measure = {"Average_R": "recall", "Average_P": "precision", "Average_F": "f_score"}[measure]
+                rouge_type = rouge_type.lower().replace("-", "_")
+                key = "{}_{}".format(rouge_type, measure)
+                results[key] = float(result)
+                results["{}_cb".format(key)] = float(conf_begin)
+                results["{}_ce".format(key)] = float(conf_end)
+        return results
+
+    ###################################################################
+    # Private methods
+
+    def __set_rouge_dir(self, home_dir=None):
+        """
+        Verify presence of ROUGE-1.5.5.pl and data folder, and set
+        those paths.
+
+        """
+        if not home_dir:
+            self._home_dir = self.__get_rouge_home_dir_from_settings()
+        else:
+            self._home_dir = home_dir
+            self.save_home_dir()
+        self._bin_path = os.path.join(self._home_dir, "ROUGE-1.5.5.pl")
+        self.data_dir = os.path.join(self._home_dir, "data")
+        if not os.path.exists(self._bin_path):
+            raise Exception(
+                "ROUGE binary not found at {}. Please set the "
+                "correct path by running pyrouge_set_rouge_path "
+                "/path/to/rouge/home.".format(self._bin_path)
+            )
+
+    def __get_rouge_home_dir_from_settings(self):
+        config = ConfigParser()
+        with open(self._settings_file) as f:
+            if hasattr(config, "read_file"):
+                config.read_file(f)
+            else:
+                # use deprecated python 2.x method
+                config.readfp(f)
+        rouge_home_dir = config.get("pyrouge settings", "home_dir")
+        return rouge_home_dir
+
+    @staticmethod
+    def __get_eval_string(task_id, system_id, system_dir, system_filename, model_dir, model_filenames):
+        """
+        ROUGE can evaluate several system summaries for a given text
+        against several model summaries, i.e. there is an m-to-n
+        relation between system and model summaries. The system
+        summaries are listed in the <PEERS> tag and the model summaries
+        in the <MODELS> tag. pyrouge currently only supports one system
+        summary per text, i.e. it assumes a 1-to-n relation between
+        system and model summaries.
+
+        """
+        peer_elems = '<P ID="{id}">{name}</P>'.format(id=system_id, name=system_filename)
+
+        model_elems = [
+            '<M ID="{id}">{name}</M>'.format(id=chr(65 + i), name=name) for i, name in enumerate(model_filenames)
+        ]
+
+        model_elems = "\n\t\t\t".join(model_elems)
+        eval_string = """
+    <EVAL ID="{task_id}">
+        <MODEL-ROOT>{model_root}</MODEL-ROOT>
+        <PEER-ROOT>{peer_root}</PEER-ROOT>
+        <INPUT-FORMAT TYPE="SEE">
+        </INPUT-FORMAT>
+        <PEERS>
+            {peer_elems}
+        </PEERS>
+        <MODELS>
+            {model_elems}
+        </MODELS>
+    </EVAL>
+""".format(
+            task_id=task_id, model_root=model_dir, model_elems=model_elems, peer_root=system_dir, peer_elems=peer_elems
+        )
+        return eval_string
+
+    def __process_summaries(self, process_func):
+        """
+        Helper method that applies process_func to the files in the
+        system and model folders and saves the resulting files to new
+        system and model folders.
+
+        """
+        temp_dir = mkdtemp(dir=self.temp_dir)
+        new_system_dir = os.path.join(temp_dir, "system")
+        os.mkdir(new_system_dir)
+        new_model_dir = os.path.join(temp_dir, "model")
+        os.mkdir(new_model_dir)
+        self.log.info(
+            "Processing summaries. Saving system files to {} and "
+            "model files to {}.".format(new_system_dir, new_model_dir)
+        )
+        process_func(self._system_dir, new_system_dir)
+        process_func(self._model_dir, new_model_dir)
+        self._system_dir = new_system_dir
+        self._model_dir = new_model_dir
+
+    def __write_summaries(self):
+        self.log.info("Writing summaries.")
+        self.__process_summaries(self.convert_summaries_to_rouge_format)
+
+    @staticmethod
+    def __get_model_filenames_for_id(id, model_dir, model_filenames_pattern):
+        pattern = re.compile(model_filenames_pattern.replace("#ID#", id))
+        model_filenames = [f for f in os.listdir(model_dir) if pattern.match(f)]
+        if not model_filenames:
+            raise Exception(
+                "Could not find any model summaries for the system"
+                " summary with ID {}. Specified model filename pattern was: "
+                "{}".format(id, model_filenames_pattern)
+            )
+        return model_filenames
+
+    def __get_options(self, rouge_args=None):
+        """
+        Get supplied command line arguments for ROUGE or use default
+        ones.
+
+        """
+        if self.args:
+            options = self.args.split()
+        elif rouge_args:
+            options = rouge_args.split()
+        else:
+            options = [
+                "-e",
+                self._data_dir,
+                "-c",
+                95,
+                # '-2',
+                # '-1',
+                # '-U',
+                "-m",
+                # '-v',
+                "-r",
+                1000,
+                "-n",
+                2,
+                # '-w', 1.2,
+                "-a",
+            ]
+            options = list(map(str, options))
+
+        options = self.__add_config_option(options)
+        return options
+
+    def __create_dir_property(self, dir_name, docstring):
+        """
+        Generate getter and setter for a directory property.
+
+        """
+        property_name = "{}_dir".format(dir_name)
+        private_name = "_" + property_name
+        setattr(self, private_name, None)
+
+        def fget(self):
+            return getattr(self, private_name)
+
+        def fset(self, path):
+            verify_dir(path, dir_name)
+            setattr(self, private_name, path)
+
+        p = property(fget=fget, fset=fset, doc=docstring)
+        setattr(self.__class__, property_name, p)
+
+    def __set_dir_properties(self):
+        """
+        Automatically generate the properties for directories.
+
+        """
+        directories = [
+            ("home", "The ROUGE home directory."),
+            ("data", "The path of the ROUGE 'data' directory."),
+            ("system", "Path of the directory containing system summaries."),
+            ("model", "Path of the directory containing model summaries."),
+        ]
+        for (dirname, docstring) in directories:
+            self.__create_dir_property(dirname, docstring)
+
+    def __clean_rouge_args(self, rouge_args):
+        """
+        Remove enclosing quotation marks, if any.
+
+        """
+        if not rouge_args:
+            return
+        quot_mark_pattern = re.compile('"(.+)"')
+        match = quot_mark_pattern.match(rouge_args)
+        if match:
+            cleaned_args = match.group(1)
+            return cleaned_args
+        else:
+            return rouge_args
+
+    def __add_config_option(self, options):
+        return options + [self._config_file]
+
+    def __get_config_path(self):
+        if platform.system() == "Windows":
+            parent_dir = os.getenv("APPDATA")
+            config_dir_name = "pyrouge"
+        elif os.name == "posix":
+            parent_dir = os.path.expanduser("~")
+            config_dir_name = ".pyrouge"
+        else:
+            parent_dir = os.path.dirname(__file__)
+            config_dir_name = ""
+        config_dir = os.path.join(parent_dir, config_dir_name)
+        if not os.path.exists(config_dir):
+            os.makedirs(config_dir)
+        return os.path.join(config_dir, "settings.ini")
+
+
+if __name__ == "__main__":
+    import argparse
+    from utils.argparsers import rouge_path_parser
+
+    parser = argparse.ArgumentParser(parents=[rouge_path_parser])
+    args = parser.parse_args()
+
+    rouge = Rouge155(args.rouge_home)
+    rouge.save_home_dir()
diff --git a/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py b/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b22c98c28983ce5ecc4c5af91df183c2ddd21e
--- /dev/null
+++ b/examples/text_summarization/prophetnet/evaluate/cnndm/postprocess_cnn_dm.py
@@ -0,0 +1,275 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import shutil
+import string
+import tempfile
+import time
+
+from bs_pyrouge import Rouge155
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--generated", type=str, help="generated output file.")
+parser.add_argument("--golden", type=str, help="Gold output file.")
+parser.add_argument(
+    "--duplicate_rate",
+    type=float,
+    default=0.7,
+    help="If the duplicate rate (compared with history) is large, we can discard the current sentence.",
+)
+parser.add_argument("--trunc_len", type=int, default=0, help="Truncate line by the maximum length.")
+args = parser.parse_args()
+
+fin = open(args.generated, "r", encoding="utf-8")
+fgolden = open(args.golden, "r", encoding="utf-8")
+dedup_rate = args.duplicate_rate
+trunc_len = args.trunc_len
+
+_tok_dict = {"(": "-LRB-", ")": "-RRB-", "[": "-LSB-", "]": "-RSB-", "{": "-LCB-", "}": "-RCB-"}
+
+
+def _is_digit(w):
+    for ch in w:
+        if not (ch.isdigit() or ch == ","):
+            return False
+    return True
+
+
+def fix_tokenization(text):
+    input_tokens = text.split()
+    output_tokens = []
+    has_left_quote = False
+    has_left_single_quote = False
+
+    i = 0
+    prev_dash = False
+    while i < len(input_tokens):
+        tok = input_tokens[i]
+        flag_prev_dash = False
+        if tok in _tok_dict.keys():
+            output_tokens.append(_tok_dict[tok])
+            i += 1
+        elif tok == '"':
+            if has_left_quote:
+                output_tokens.append("''")
+            else:
+                output_tokens.append("``")
+            has_left_quote = not has_left_quote
+            i += 1
+        elif (
+            tok == "'"
+            and len(output_tokens) > 0
+            and output_tokens[-1].endswith("n")
+            and i < len(input_tokens) - 1
+            and input_tokens[i + 1] == "t"
+        ):
+            output_tokens[-1] = output_tokens[-1][:-1]
+            output_tokens.append("n't")
+            i += 2
+        elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
+            output_tokens.append("'" + input_tokens[i + 1])
+            i += 2
+        elif tok == "'":
+            if has_left_single_quote:
+                output_tokens.append("'")
+            else:
+                output_tokens.append("`")
+            has_left_single_quote = not has_left_single_quote
+            i += 1
+        elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
+            output_tokens.append("...")
+            i += 3
+        elif (
+            tok == ","
+            and len(output_tokens) > 0
+            and _is_digit(output_tokens[-1])
+            and i < len(input_tokens) - 1
+            and _is_digit(input_tokens[i + 1])
+        ):
+            # $ 3 , 000 -> $ 3,000
+            output_tokens[-1] += "," + input_tokens[i + 1]
+            i += 2
+        elif (
+            tok == "."
+            and len(output_tokens) > 0
+            and output_tokens[-1].isdigit()
+            and i < len(input_tokens) - 1
+            and input_tokens[i + 1].isdigit()
+        ):
+            # 3 . 03 -> $ 3.03
+            output_tokens[-1] += "." + input_tokens[i + 1]
+            i += 2
+        elif (
+            tok == "."
+            and len(output_tokens) > 0
+            and len(output_tokens[-1]) == 1
+            and output_tokens[-1].isupper()
+            and i < len(input_tokens) - 2
+            and len(input_tokens[i + 1]) == 1
+            and input_tokens[i + 1].isupper()
+            and input_tokens[i + 2] == "."
+        ):
+            # U . N . -> U.N.
+            k = i + 3
+            while k + 2 < len(input_tokens):
+                if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == ".":
+                    k += 2
+                else:
+                    break
+            output_tokens[-1] += "".join(input_tokens[i:k])
+            i += 2
+        elif tok == "-":
+            if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
+                output_tokens.append("--")
+                i += 2
+            elif i == len(input_tokens) - 1 or i == 0:
+                output_tokens.append("-")
+                i += 1
+            elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
+                output_tokens[-1] += "-"
+                i += 1
+                flag_prev_dash = True
+            else:
+                output_tokens.append("-")
+                i += 1
+        elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
+            output_tokens[-1] += tok
+            i += 1
+        else:
+            output_tokens.append(tok)
+            i += 1
+        prev_dash = flag_prev_dash
+    return " ".join(output_tokens)
+
+
+def remove_duplicate(l_list, duplicate_rate):
+    tk_list = [l.lower().split() for l in l_list]
+    r_list = []
+    history_set = set()
+    for i, w_list in enumerate(tk_list):
+        w_set = set(w_list)
+        if len(w_set & history_set) / len(w_set) <= duplicate_rate:
+            r_list.append(l_list[i])
+        history_set |= w_set
+    return r_list
+
+
+def test_rouge(cand, ref):
+    temp_dir = tempfile.mkdtemp()
+    candidates = cand
+    references = ref
+    assert len(candidates) == len(references)
+
+    cnt = len(candidates)
+    current_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
+    tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time))
+    if not os.path.isdir(tmp_dir):
+        os.mkdir(tmp_dir)
+        os.mkdir(tmp_dir + "/candidate")
+        os.mkdir(tmp_dir + "/reference")
+    try:
+        for i in range(cnt):
+            if len(references[i]) < 1:
+                continue
+            with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(candidates[i])
+            with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(references[i])
+        r = Rouge155(temp_dir=temp_dir)
+        r.model_dir = tmp_dir + "/reference/"
+        r.system_dir = tmp_dir + "/candidate/"
+        r.model_filename_pattern = "ref.#ID#.txt"
+        r.system_filename_pattern = r"cand.(\d+).txt"
+        rouge_results = r.convert_and_evaluate()
+        print(rouge_results)
+        results_dict = r.output_to_dict(rouge_results)
+    finally:
+        if os.path.isdir(tmp_dir):
+            shutil.rmtree(tmp_dir)
+    return results_dict
+
+
+def rouge_results_to_str(results_dict):
+    return ">> ROUGE-F(1/2/l): {:.2f}/{:.2f}/{:.2f}\nROUGE-R(1/2/3/l): {:.2f}/{:.2f}/{:.2f}\n".format(
+        results_dict["rouge_1_f_score"] * 100,
+        results_dict["rouge_2_f_score"] * 100,
+        results_dict["rouge_l_f_score"] * 100,
+        results_dict["rouge_1_recall"] * 100,
+        results_dict["rouge_2_recall"] * 100,
+        results_dict["rouge_l_recall"] * 100,
+    )
+
+
+def count_tokens(tokens):
+    counter = {}
+    for t in tokens:
+        if t in counter.keys():
+            counter[t] += 1
+        else:
+            counter[t] = 1
+    return counter
+
+
+def get_f1(text_a, text_b):
+    tokens_a = text_a.lower().split()
+    tokens_b = text_b.lower().split()
+    if len(tokens_a) == 0 or len(tokens_b) == 0:
+        return 1 if len(tokens_a) == len(tokens_b) else 0
+    set_a = count_tokens(tokens_a)
+    set_b = count_tokens(tokens_b)
+    match = 0
+    for token in set_a.keys():
+        if token in set_b.keys():
+            match += min(set_a[token], set_b[token])
+    p = match / len(tokens_a)
+    r = match / len(tokens_b)
+    return 2.0 * p * r / (p + r + 1e-5)
+
+
+generated_list = []
+for line in fin:
+    buf = []
+    for sentence in line.strip().split("[X_SEP]"):
+        sentence = fix_tokenization(sentence)
+        if any(get_f1(sentence, s) > 1.0 for s in buf):
+            continue
+        s_len = len(sentence.split())
+        if s_len <= 4:
+            continue
+        buf.append(sentence)
+    if dedup_rate < 1:
+        buf = remove_duplicate(buf, dedup_rate)
+    if trunc_len:
+        num_left = trunc_len
+        trunc_list = []
+        for bit in buf:
+            tk_list = bit.split()
+            n = min(len(tk_list), num_left)
+            trunc_list.append(" ".join(tk_list[:n]))
+            num_left -= n
+            if num_left <= 0:
+                break
+    else:
+        trunc_list = buf
+    generated_list.append("\n".join(trunc_list))
+
+golden_list = []
+for line in fgolden:
+    line = line.strip().replace(" <S_SEP> ", "\n")
+    golden_list.append(line)
+
+scores = test_rouge(generated_list, golden_list)
+print(rouge_results_to_str(scores))
diff --git a/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py b/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py
new file mode 100644
index 0000000000000000000000000000000000000000..df4cbbecbcb26f646ee9d7249b50dd06a6736132
--- /dev/null
+++ b/examples/text_summarization/prophetnet/evaluate/gigaword/bs_pyrouge.py
@@ -0,0 +1,649 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function, unicode_literals, division
+
+import codecs
+import logging
+import os
+import platform
+import re
+from functools import partial
+from subprocess import check_output
+from tempfile import mkdtemp
+
+try:
+    from configparser import ConfigParser
+except ImportError:
+    from ConfigParser import ConfigParser
+
+from pyrouge.utils import log
+from pyrouge.utils.file_utils import verify_dir
+
+REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}", "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}
+
+
+def clean(x):
+    return re.sub(r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''", lambda m: REMAP.get(m.group()), x)
+
+
+class DirectoryProcessor:
+    @staticmethod
+    def process(input_dir, output_dir, function):
+        """
+        Apply function to all files in input_dir and save the resulting output
+        files in output_dir.
+
+        """
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        logger = log.get_global_console_logger()
+        logger.info("Processing files in {}.".format(input_dir))
+        input_file_names = os.listdir(input_dir)
+        for input_file_name in input_file_names:
+            input_file = os.path.join(input_dir, input_file_name)
+            with codecs.open(input_file, "r", encoding="UTF-8") as f:
+                input_string = f.read()
+            output_string = function(input_string)
+            output_file = os.path.join(output_dir, input_file_name)
+            with codecs.open(output_file, "w", encoding="UTF-8") as f:
+                f.write(clean(output_string.lower()))
+        logger.info("Saved processed files to {}.".format(output_dir))
+
+
+class Rouge155(object):
+    """
+    This is a wrapper for the ROUGE 1.5.5 summary evaluation package.
+    This class is designed to simplify the evaluation process by:
+
+        1) Converting summaries into a format ROUGE understands.
+        2) Generating the ROUGE configuration file automatically based
+            on filename patterns.
+
+    This class can be used within Python like this:
+
+    rouge = Rouge155()
+    rouge.system_dir = 'test/systems'
+    rouge.model_dir = 'test/models'
+
+    # The system filename pattern should contain one group that
+    # matches the document ID.
+    rouge.system_filename_pattern = 'SL.P.10.R.11.SL062003-(\d+).html'
+
+    # The model filename pattern has '#ID#' as a placeholder for the
+    # document ID. If there are multiple model summaries, pyrouge
+    # will use the provided regex to automatically match them with
+    # the corresponding system summary. Here, [A-Z] matches
+    # multiple model summaries for a given #ID#.
+    rouge.model_filename_pattern = 'SL.P.10.R.[A-Z].SL062003-#ID#.html'
+
+    rouge_output = rouge.evaluate()
+    print(rouge_output)
+    output_dict = rouge.output_to_dict(rouge_output)
+    print(output_dict)
+    ->    {'rouge_1_f_score': 0.95652,
+         'rouge_1_f_score_cb': 0.95652,
+         'rouge_1_f_score_ce': 0.95652,
+         'rouge_1_precision': 0.95652,
+        [...]
+
+
+    To evaluate multiple systems:
+
+        rouge = Rouge155()
+        rouge.system_dir = '/PATH/TO/systems'
+        rouge.model_dir = 'PATH/TO/models'
+        for system_id in ['id1', 'id2', 'id3']:
+            rouge.system_filename_pattern = \
+                'SL.P/.10.R.{}.SL062003-(\d+).html'.format(system_id)
+            rouge.model_filename_pattern = \
+                'SL.P.10.R.[A-Z].SL062003-#ID#.html'
+            rouge_output = rouge.evaluate(system_id)
+            print(rouge_output)
+
+    """
+
+    def __init__(self, rouge_dir=None, rouge_args=None, temp_dir=None):
+        """
+        Create a Rouge155 object.
+
+            rouge_dir:  Directory containing Rouge-1.5.5.pl
+            rouge_args: Arguments to pass through to ROUGE if you
+                        don't want to use the default pyrouge
+                        arguments.
+
+        """
+        self.temp_dir = temp_dir
+        self.log = log.get_global_console_logger()
+        self.log.setLevel(logging.WARNING)
+        self.__set_dir_properties()
+        self._config_file = None
+        self._settings_file = self.__get_config_path()
+        self.__set_rouge_dir(rouge_dir)
+        self.args = self.__clean_rouge_args(rouge_args)
+        self._system_filename_pattern = None
+        self._model_filename_pattern = None
+
+    def save_home_dir(self):
+        config = ConfigParser()
+        section = "pyrouge settings"
+        config.add_section(section)
+        config.set(section, "home_dir", self._home_dir)
+        with open(self._settings_file, "w") as f:
+            config.write(f)
+        self.log.info("Set ROUGE home directory to {}.".format(self._home_dir))
+
+    @property
+    def settings_file(self):
+        """
+        Path of the settings file, which stores the ROUGE home dir.
+
+        """
+        return self._settings_file
+
+    @property
+    def bin_path(self):
+        """
+        The full path of the ROUGE binary (although it's technically
+        a script), i.e. rouge_home_dir/ROUGE-1.5.5.pl
+
+        """
+        if self._bin_path is None:
+            raise Exception(
+                "ROUGE path not set. Please set the ROUGE home directory "
+                "and ensure that ROUGE-1.5.5.pl exists in it."
+            )
+        return self._bin_path
+
+    @property
+    def system_filename_pattern(self):
+        """
+        The regular expression pattern for matching system summary
+        filenames. The regex string.
+
+        E.g. "SL.P.10.R.11.SL062003-(\d+).html" will match the system
+        filenames in the SPL2003/system folder of the ROUGE SPL example
+        in the "sample-test" folder.
+
+        Currently, there is no support for multiple systems.
+
+        """
+        return self._system_filename_pattern
+
+    @system_filename_pattern.setter
+    def system_filename_pattern(self, pattern):
+        self._system_filename_pattern = pattern
+
+    @property
+    def model_filename_pattern(self):
+        """
+        The regular expression pattern for matching model summary
+        filenames. The pattern needs to contain the string "#ID#",
+        which is a placeholder for the document ID.
+
+        E.g. "SL.P.10.R.[A-Z].SL062003-#ID#.html" will match the model
+        filenames in the SPL2003/system folder of the ROUGE SPL
+        example in the "sample-test" folder.
+
+        "#ID#" is a placeholder for the document ID which has been
+        matched by the "(\d+)" part of the system filename pattern.
+        The different model summaries for a given document ID are
+        matched by the "[A-Z]" part.
+
+        """
+        return self._model_filename_pattern
+
+    @model_filename_pattern.setter
+    def model_filename_pattern(self, pattern):
+        self._model_filename_pattern = pattern
+
+    @property
+    def config_file(self):
+        return self._config_file
+
+    @config_file.setter
+    def config_file(self, path):
+        config_dir, _ = os.path.split(path)
+        verify_dir(config_dir, "configuration file")
+        self._config_file = path
+
+    def split_sentences(self):
+        """
+        ROUGE requires texts split into sentences. In case the texts
+        are not already split, this method can be used.
+
+        """
+        from pyrouge.utils.sentence_splitter import PunktSentenceSplitter
+
+        self.log.info("Splitting sentences.")
+        ss = PunktSentenceSplitter()
+
+        def sent_split_to_string(s):
+            return "\n".join(ss.split(s))
+
+        process_func = partial(DirectoryProcessor.process, function=sent_split_to_string)
+        self.__process_summaries(process_func)
+
+    @staticmethod
+    def convert_summaries_to_rouge_format(input_dir, output_dir):
+        """
+        Convert all files in input_dir into a format ROUGE understands
+        and saves the files to output_dir. The input files are assumed
+        to be plain text with one sentence per line.
+
+            input_dir:  Path of directory containing the input files.
+            output_dir: Path of directory in which the converted files
+                        will be saved.
+
+        """
+        DirectoryProcessor.process(input_dir, output_dir, Rouge155.convert_text_to_rouge_format)
+
+    @staticmethod
+    def convert_text_to_rouge_format(text, title="dummy title"):
+        """
+        Convert a text to a format ROUGE understands. The text is
+        assumed to contain one sentence per line.
+
+            text:   The text to convert, containg one sentence per line.
+            title:  Optional title for the text. The title will appear
+                    in the converted file, but doesn't seem to have
+                    any other relevance.
+
+        Returns: The converted text as string.
+
+        """
+        sentences = text.split("\n")
+        sent_elems = [
+            '<a name="{i}">[{i}]</a> <a href="#{i}" id={i}>' "{text}</a>".format(i=i, text=sent)
+            for i, sent in enumerate(sentences, start=1)
+        ]
+        html = """<html>
+<head>
+<title>{title}</title>
+</head>
+<body bgcolor="white">
+{elems}
+</body>
+</html>""".format(
+            title=title, elems="\n".join(sent_elems)
+        )
+
+        return html
+
+    @staticmethod
+    def write_config_static(
+        system_dir, system_filename_pattern, model_dir, model_filename_pattern, config_file_path, system_id=None
+    ):
+        """
+        Write the ROUGE configuration file, which is basically a list
+        of system summary files and their corresponding model summary
+        files.
+
+        pyrouge uses regular expressions to automatically find the
+        matching model summary files for a given system summary file
+        (cf. docstrings for system_filename_pattern and
+        model_filename_pattern).
+
+            system_dir:                 Path of directory containing
+                                        system summaries.
+            system_filename_pattern:    Regex string for matching
+                                        system summary filenames.
+            model_dir:                  Path of directory containing
+                                        model summaries.
+            model_filename_pattern:     Regex string for matching model
+                                        summary filenames.
+            config_file_path:           Path of the configuration file.
+            system_id:                  Optional system ID string which
+                                        will appear in the ROUGE output.
+
+        """
+        system_filenames = [f for f in os.listdir(system_dir)]
+        system_models_tuples = []
+
+        system_filename_pattern = re.compile(system_filename_pattern)
+        for system_filename in sorted(system_filenames):
+            match = system_filename_pattern.match(system_filename)
+            if match:
+                id = match.groups(0)[0]
+                model_filenames = [model_filename_pattern.replace("#ID#", id)]
+                # model_filenames = Rouge155.__get_model_filenames_for_id(
+                #     id, model_dir, model_filename_pattern)
+                system_models_tuples.append((system_filename, sorted(model_filenames)))
+        if not system_models_tuples:
+            raise Exception(
+                "Did not find any files matching the pattern {} "
+                "in the system summaries directory {}.".format(system_filename_pattern.pattern, system_dir)
+            )
+
+        with codecs.open(config_file_path, "w", encoding="utf-8") as f:
+            f.write('<ROUGE-EVAL version="1.55">')
+            for task_id, (system_filename, model_filenames) in enumerate(system_models_tuples, start=1):
+                eval_string = Rouge155.__get_eval_string(
+                    task_id, system_id, system_dir, system_filename, model_dir, model_filenames
+                )
+                f.write(eval_string)
+            f.write("</ROUGE-EVAL>")
+
+    def write_config(self, config_file_path=None, system_id=None):
+        """
+        Write the ROUGE configuration file, which is basically a list
+        of system summary files and their matching model summary files.
+
+        This is a non-static version of write_config_file_static().
+
+            config_file_path:   Path of the configuration file.
+            system_id:          Optional system ID string which will
+                                appear in the ROUGE output.
+
+        """
+        if not system_id:
+            system_id = 1
+        if (not config_file_path) or (not self._config_dir):
+            self._config_dir = mkdtemp(dir=self.temp_dir)
+            config_filename = "rouge_conf.xml"
+        else:
+            config_dir, config_filename = os.path.split(config_file_path)
+            verify_dir(config_dir, "configuration file")
+        self._config_file = os.path.join(self._config_dir, config_filename)
+        Rouge155.write_config_static(
+            self._system_dir,
+            self._system_filename_pattern,
+            self._model_dir,
+            self._model_filename_pattern,
+            self._config_file,
+            system_id,
+        )
+        self.log.info("Written ROUGE configuration to {}".format(self._config_file))
+
+    def evaluate(self, system_id=1, rouge_args=None):
+        """
+        Run ROUGE to evaluate the system summaries in system_dir against
+        the model summaries in model_dir. The summaries are assumed to
+        be in the one-sentence-per-line HTML format ROUGE understands.
+
+            system_id:  Optional system ID which will be printed in
+                        ROUGE's output.
+
+        Returns: Rouge output as string.
+
+        """
+        self.write_config(system_id=system_id)
+        options = self.__get_options(rouge_args)
+        command = [self._bin_path] + options
+        self.log.info("Running ROUGE with command {}".format(" ".join(command)))
+        rouge_output = check_output(command).decode("UTF-8")
+        return rouge_output
+
+    def convert_and_evaluate(self, system_id=1, split_sentences=False, rouge_args=None):
+        """
+        Convert plain text summaries to ROUGE format and run ROUGE to
+        evaluate the system summaries in system_dir against the model
+        summaries in model_dir. Optionally split texts into sentences
+        in case they aren't already.
+
+        This is just a convenience method combining
+        convert_summaries_to_rouge_format() and evaluate().
+
+            split_sentences:    Optional argument specifying if
+                                sentences should be split.
+            system_id:          Optional system ID which will be printed
+                                in ROUGE's output.
+
+        Returns: ROUGE output as string.
+
+        """
+        if split_sentences:
+            self.split_sentences()
+        self.__write_summaries()
+        rouge_output = self.evaluate(system_id, rouge_args)
+        return rouge_output
+
+    def output_to_dict(self, output):
+        """
+        Convert the ROUGE output into python dictionary for further
+        processing.
+
+        """
+        # 0 ROUGE-1 Average_R: 0.02632 (95%-conf.int. 0.02632 - 0.02632)
+        pattern = re.compile(r"(\d+) (ROUGE-\S+) (Average_\w): (\d.\d+) " r"\(95%-conf.int. (\d.\d+) - (\d.\d+)\)")
+        results = {}
+        for line in output.split("\n"):
+            match = pattern.match(line)
+            if match:
+                sys_id, rouge_type, measure, result, conf_begin, conf_end = match.groups()
+                measure = {"Average_R": "recall", "Average_P": "precision", "Average_F": "f_score"}[measure]
+                rouge_type = rouge_type.lower().replace("-", "_")
+                key = "{}_{}".format(rouge_type, measure)
+                results[key] = float(result)
+                results["{}_cb".format(key)] = float(conf_begin)
+                results["{}_ce".format(key)] = float(conf_end)
+        return results
+
+    ###################################################################
+    # Private methods
+
+    def __set_rouge_dir(self, home_dir=None):
+        """
+        Verify presence of ROUGE-1.5.5.pl and data folder, and set
+        those paths.
+
+        """
+        if not home_dir:
+            self._home_dir = self.__get_rouge_home_dir_from_settings()
+        else:
+            self._home_dir = home_dir
+            self.save_home_dir()
+        self._bin_path = os.path.join(self._home_dir, "ROUGE-1.5.5.pl")
+        self.data_dir = os.path.join(self._home_dir, "data")
+        if not os.path.exists(self._bin_path):
+            raise Exception(
+                "ROUGE binary not found at {}. Please set the "
+                "correct path by running pyrouge_set_rouge_path "
+                "/path/to/rouge/home.".format(self._bin_path)
+            )
+
+    def __get_rouge_home_dir_from_settings(self):
+        config = ConfigParser()
+        with open(self._settings_file) as f:
+            if hasattr(config, "read_file"):
+                config.read_file(f)
+            else:
+                # use deprecated python 2.x method
+                config.readfp(f)
+        rouge_home_dir = config.get("pyrouge settings", "home_dir")
+        return rouge_home_dir
+
+    @staticmethod
+    def __get_eval_string(task_id, system_id, system_dir, system_filename, model_dir, model_filenames):
+        """
+        ROUGE can evaluate several system summaries for a given text
+        against several model summaries, i.e. there is an m-to-n
+        relation between system and model summaries. The system
+        summaries are listed in the <PEERS> tag and the model summaries
+        in the <MODELS> tag. pyrouge currently only supports one system
+        summary per text, i.e. it assumes a 1-to-n relation between
+        system and model summaries.
+
+        """
+        peer_elems = '<P ID="{id}">{name}</P>'.format(id=system_id, name=system_filename)
+
+        model_elems = [
+            '<M ID="{id}">{name}</M>'.format(id=chr(65 + i), name=name) for i, name in enumerate(model_filenames)
+        ]
+
+        model_elems = "\n\t\t\t".join(model_elems)
+        eval_string = """
+    <EVAL ID="{task_id}">
+        <MODEL-ROOT>{model_root}</MODEL-ROOT>
+        <PEER-ROOT>{peer_root}</PEER-ROOT>
+        <INPUT-FORMAT TYPE="SEE">
+        </INPUT-FORMAT>
+        <PEERS>
+            {peer_elems}
+        </PEERS>
+        <MODELS>
+            {model_elems}
+        </MODELS>
+    </EVAL>
+""".format(
+            task_id=task_id, model_root=model_dir, model_elems=model_elems, peer_root=system_dir, peer_elems=peer_elems
+        )
+        return eval_string
+
+    def __process_summaries(self, process_func):
+        """
+        Helper method that applies process_func to the files in the
+        system and model folders and saves the resulting files to new
+        system and model folders.
+
+        """
+        temp_dir = mkdtemp(dir=self.temp_dir)
+        new_system_dir = os.path.join(temp_dir, "system")
+        os.mkdir(new_system_dir)
+        new_model_dir = os.path.join(temp_dir, "model")
+        os.mkdir(new_model_dir)
+        self.log.info(
+            "Processing summaries. Saving system files to {} and "
+            "model files to {}.".format(new_system_dir, new_model_dir)
+        )
+        process_func(self._system_dir, new_system_dir)
+        process_func(self._model_dir, new_model_dir)
+        self._system_dir = new_system_dir
+        self._model_dir = new_model_dir
+
+    def __write_summaries(self):
+        self.log.info("Writing summaries.")
+        self.__process_summaries(self.convert_summaries_to_rouge_format)
+
+    @staticmethod
+    def __get_model_filenames_for_id(id, model_dir, model_filenames_pattern):
+        pattern = re.compile(model_filenames_pattern.replace("#ID#", id))
+        model_filenames = [f for f in os.listdir(model_dir) if pattern.match(f)]
+        if not model_filenames:
+            raise Exception(
+                "Could not find any model summaries for the system"
+                " summary with ID {}. Specified model filename pattern was: "
+                "{}".format(id, model_filenames_pattern)
+            )
+        return model_filenames
+
+    def __get_options(self, rouge_args=None):
+        """
+        Get supplied command line arguments for ROUGE or use default
+        ones.
+
+        """
+        if self.args:
+            options = self.args.split()
+        elif rouge_args:
+            options = rouge_args.split()
+        else:
+            options = [
+                "-e",
+                self._data_dir,
+                "-c",
+                95,
+                # '-2',
+                # '-1',
+                # '-U',
+                "-m",
+                # '-v',
+                "-r",
+                1000,
+                "-n",
+                2,
+                # '-w', 1.2,
+                "-a",
+            ]
+            options = list(map(str, options))
+
+        options = self.__add_config_option(options)
+        return options
+
+    def __create_dir_property(self, dir_name, docstring):
+        """
+        Generate getter and setter for a directory property.
+
+        """
+        property_name = "{}_dir".format(dir_name)
+        private_name = "_" + property_name
+        setattr(self, private_name, None)
+
+        def fget(self):
+            return getattr(self, private_name)
+
+        def fset(self, path):
+            verify_dir(path, dir_name)
+            setattr(self, private_name, path)
+
+        p = property(fget=fget, fset=fset, doc=docstring)
+        setattr(self.__class__, property_name, p)
+
+    def __set_dir_properties(self):
+        """
+        Automatically generate the properties for directories.
+
+        """
+        directories = [
+            ("home", "The ROUGE home directory."),
+            ("data", "The path of the ROUGE 'data' directory."),
+            ("system", "Path of the directory containing system summaries."),
+            ("model", "Path of the directory containing model summaries."),
+        ]
+        for (dirname, docstring) in directories:
+            self.__create_dir_property(dirname, docstring)
+
+    def __clean_rouge_args(self, rouge_args):
+        """
+        Remove enclosing quotation marks, if any.
+
+        """
+        if not rouge_args:
+            return
+        quot_mark_pattern = re.compile('"(.+)"')
+        match = quot_mark_pattern.match(rouge_args)
+        if match:
+            cleaned_args = match.group(1)
+            return cleaned_args
+        else:
+            return rouge_args
+
+    def __add_config_option(self, options):
+        return options + [self._config_file]
+
+    def __get_config_path(self):
+        if platform.system() == "Windows":
+            parent_dir = os.getenv("APPDATA")
+            config_dir_name = "pyrouge"
+        elif os.name == "posix":
+            parent_dir = os.path.expanduser("~")
+            config_dir_name = ".pyrouge"
+        else:
+            parent_dir = os.path.dirname(__file__)
+            config_dir_name = ""
+        config_dir = os.path.join(parent_dir, config_dir_name)
+        if not os.path.exists(config_dir):
+            os.makedirs(config_dir)
+        return os.path.join(config_dir, "settings.ini")
+
+
+if __name__ == "__main__":
+    import argparse
+    from utils.argparsers import rouge_path_parser
+
+    parser = argparse.ArgumentParser(parents=[rouge_path_parser])
+    args = parser.parse_args()
+
+    rouge = Rouge155(args.rouge_home)
+    rouge.save_home_dir()
diff --git a/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py b/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..51ed27e6804af685a6c54704a1c004e25ed518fa
--- /dev/null
+++ b/examples/text_summarization/prophetnet/evaluate/gigaword/eval.py
@@ -0,0 +1,368 @@
+"""BERT finetuning runner."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import glob
+import json
+import logging
+import os
+import shutil
+import string
+import tempfile
+import time
+from multiprocessing import Pool, cpu_count
+from pathlib import Path
+
+# pip install py-rouge
+import rouge
+
+# from pytorch_pretrained_bert.tokenization import BertTokenizer
+# pip install pyrouge
+from bs_pyrouge import Rouge155
+
+logging.basicConfig(
+    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
+)
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.WARNING)
+
+parser = argparse.ArgumentParser()
+
+# Required parameters
+parser.add_argument("--gold", type=str, help="Gold output file.")
+parser.add_argument("--pred", type=str, help="Input prediction file.")
+parser.add_argument("--split", type=str, default="", help="Data split (train/dev/test).")
+parser.add_argument("--save_best", action="store_true", help="Save best epoch.")
+parser.add_argument("--only_eval_best", action="store_true", help="Only evaluate best epoch.")
+parser.add_argument("--trunc_len", type=int, default=0, help="Truncate line by the maximum length.")
+default_process_count = max(1, cpu_count() - 1)
+parser.add_argument(
+    "--processes", type=int, default=default_process_count, help="Number of processes to use (default %(default)s)"
+)
+parser.add_argument("--perl", action="store_true", help="Using the perl script.")
+parser.add_argument("--lazy_eval", action="store_true", help="Skip evaluation if the .rouge file exists.")
+args = parser.parse_args()
+
+SPECIAL_TOKEN = ["[UNK]", "[PAD]", "[CLS]", "[MASK]"]
+evaluator = rouge.Rouge(metrics=["rouge-n", "rouge-l"], max_n=2, limit_length=False, apply_avg=True, weight_factor=1.2)
+
+
+def test_rouge(cand, ref):
+    temp_dir = tempfile.mkdtemp()
+    candidates = cand
+    references = ref
+    assert len(candidates) == len(references)
+
+    cnt = len(candidates)
+    current_time = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
+    tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time))
+    if not os.path.isdir(tmp_dir):
+        os.mkdir(tmp_dir)
+        os.mkdir(tmp_dir + "/candidate")
+        os.mkdir(tmp_dir + "/reference")
+    try:
+        for i in range(cnt):
+            if len(references[i]) < 1:
+                continue
+            with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(candidates[i])
+            with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w", encoding="utf-8") as f:
+                f.write(references[i])
+        r = Rouge155(temp_dir=temp_dir)
+        r.model_dir = tmp_dir + "/reference/"
+        r.system_dir = tmp_dir + "/candidate/"
+        r.model_filename_pattern = "ref.#ID#.txt"
+        r.system_filename_pattern = r"cand.(\d+).txt"
+        rouge_results = r.convert_and_evaluate()
+        print(rouge_results)
+        results_dict = r.output_to_dict(rouge_results)
+    finally:
+        if os.path.isdir(tmp_dir):
+            shutil.rmtree(tmp_dir)
+    return results_dict
+
+
+def rouge_results_to_str(results_dict):
+    return ">> ROUGE-F(1/2/l): {:.2f}/{:.2f}/{:.2f}\nROUGE-R(1/2/3/l): {:.2f}/{:.2f}/{:.2f}\n".format(
+        results_dict["rouge_1_f_score"] * 100,
+        results_dict["rouge_2_f_score"] * 100,
+        results_dict["rouge_l_f_score"] * 100,
+        results_dict["rouge_1_recall"] * 100,
+        results_dict["rouge_2_recall"] * 100,
+        results_dict["rouge_l_recall"] * 100,
+    )
+
+
+def count_tokens(tokens):
+    counter = {}
+    for t in tokens:
+        if t in counter.keys():
+            counter[t] += 1
+        else:
+            counter[t] = 1
+    return counter
+
+
+def get_f1(text_a, text_b):
+    tokens_a = text_a.lower().split()
+    tokens_b = text_b.lower().split()
+    if len(tokens_a) == 0 or len(tokens_b) == 0:
+        return 1 if len(tokens_a) == len(tokens_b) else 0
+    set_a = count_tokens(tokens_a)
+    set_b = count_tokens(tokens_b)
+    match = 0
+    for token in set_a.keys():
+        if token in set_b.keys():
+            match += min(set_a[token], set_b[token])
+    p = match / len(tokens_a)
+    r = match / len(tokens_b)
+    return 2.0 * p * r / (p + r + 1e-5)
+
+
+_tok_dict = {
+    "(": "-lrb-",
+    ")": "-rrb-",
+    "[": "-lsb-",
+    "]": "-rsb-",
+    "{": "-lcb-",
+    "}": "-rcb-",
+    "[UNK]": "UNK",
+    "&": "&amp;",
+    "<": "&lt;",
+    ">": "&gt;",
+}
+
+
+def _is_digit(w):
+    for ch in w:
+        if not (ch.isdigit() or ch == ","):
+            return False
+    return True
+
+
+def fix_tokenization(text):
+    input_tokens = text.split()
+    output_tokens = []
+    has_left_quote = False
+    has_left_single_quote = False
+
+    i = 0
+    prev_dash = False
+    while i < len(input_tokens):
+        tok = input_tokens[i]
+        flag_prev_dash = False
+        if tok in _tok_dict.keys():
+            output_tokens.append(_tok_dict[tok])
+            i += 1
+        elif tok == '"':
+            if has_left_quote:
+                output_tokens.append("''")
+            else:
+                output_tokens.append("``")
+            has_left_quote = not has_left_quote
+            i += 1
+        elif (
+            tok == "'"
+            and len(output_tokens) > 0
+            and output_tokens[-1].endswith("n")
+            and i < len(input_tokens) - 1
+            and input_tokens[i + 1] == "t"
+        ):
+            output_tokens[-1] = output_tokens[-1][:-1]
+            output_tokens.append("n't")
+            i += 2
+        elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
+            output_tokens.append("'" + input_tokens[i + 1])
+            i += 2
+        elif tok == "'":
+            if has_left_single_quote:
+                output_tokens.append("'")
+            else:
+                output_tokens.append("`")
+            has_left_single_quote = not has_left_single_quote
+            i += 1
+        elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
+            output_tokens.append("...")
+            i += 3
+        elif (
+            tok == ","
+            and len(output_tokens) > 0
+            and _is_digit(output_tokens[-1])
+            and i < len(input_tokens) - 1
+            and _is_digit(input_tokens[i + 1])
+        ):
+            # $ 3 , 000 -> $ 3,000
+            output_tokens[-1] += "," + input_tokens[i + 1]
+            i += 2
+        elif (
+            tok == "."
+            and len(output_tokens) > 0
+            and output_tokens[-1].isdigit()
+            and i < len(input_tokens) - 1
+            and input_tokens[i + 1].isdigit()
+        ):
+            # 3 . 03 -> $ 3.03
+            output_tokens[-1] += "." + input_tokens[i + 1]
+            i += 2
+        elif (
+            tok == "."
+            and len(output_tokens) > 0
+            and len(output_tokens[-1]) == 1
+            and output_tokens[-1].isupper()
+            and i < len(input_tokens) - 2
+            and len(input_tokens[i + 1]) == 1
+            and input_tokens[i + 1].isupper()
+            and input_tokens[i + 2] == "."
+        ):
+            # U . N . -> U.N.
+            k = i + 3
+            while k + 2 < len(input_tokens):
+                if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == ".":
+                    k += 2
+                else:
+                    break
+            output_tokens[-1] += "".join(input_tokens[i:k])
+            i += 2
+        elif tok == "-":
+            if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
+                output_tokens.append("--")
+                i += 2
+            elif i == len(input_tokens) - 1 or i == 0:
+                output_tokens.append("-")
+                i += 1
+            elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
+                output_tokens[-1] += "-"
+                i += 1
+                flag_prev_dash = True
+            else:
+                output_tokens.append("-")
+                i += 1
+        elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
+            output_tokens[-1] += tok
+            i += 1
+        else:
+            output_tokens.append(tok)
+            i += 1
+        prev_dash = flag_prev_dash
+    return " ".join(output_tokens)
+
+
+def process_eval(eval_fn):
+    gold_list = []
+    with open(args.gold, "r", encoding="utf-8") as f_in:
+        for l in f_in:
+            line = l.strip()
+            gold_list.append(line)
+
+    pred_list = []
+    with open(eval_fn, "r", encoding="utf-8") as f_in:
+        for l in f_in:
+            buf = []
+            sentence = fix_tokenization(l.strip()).replace("1", "#")
+            buf.append(sentence)
+            if args.trunc_len:
+                num_left = args.trunc_len
+                trunc_list = []
+                for bit in buf:
+                    tk_list = bit.split()
+                    n = min(len(tk_list), num_left)
+                    trunc_list.append(" ".join(tk_list[:n]))
+                    num_left -= n
+                    if num_left <= 0:
+                        break
+            else:
+                trunc_list = buf
+            line = "\n".join(trunc_list)
+            pred_list.append(line)
+    with open(eval_fn + ".post", "w", encoding="utf-8") as f_out:
+        for l in pred_list:
+            f_out.write(l.strip())
+            f_out.write("\n")
+    # rouge scores
+    if len(pred_list) < len(gold_list):
+        # evaluate subset
+        gold_list = gold_list[: len(pred_list)]
+    assert len(pred_list) == len(gold_list)
+    if args.perl:
+        scores = test_rouge(pred_list, gold_list)
+    else:
+        scores = evaluator.get_scores(pred_list, [[it] for it in gold_list])
+    return eval_fn, scores
+
+
+def main():
+    if args.perl:
+        eval_fn_list = list(glob.glob(args.pred))
+    else:
+        eval_fn_list = [
+            eval_fn for eval_fn in glob.glob(args.pred) if not (args.lazy_eval and Path(eval_fn + ".rouge").exists())
+        ]
+    eval_fn_list = list(filter(lambda fn: not (fn.endswith(".post") or fn.endswith(".rouge")), eval_fn_list))
+
+    if args.only_eval_best:
+        best_epoch_dict = {}
+        for dir_path in set(Path(fn).parent for fn in eval_fn_list):
+            fn_save = os.path.join(dir_path, "save_best.dev")
+            if Path(fn_save).exists():
+                with open(fn_save, "r") as f_in:
+                    __, o_name, __ = f_in.read().strip().split("\n")
+                    epoch = o_name.split(".")[1]
+                    best_epoch_dict[dir_path] = epoch
+        new_eval_fn_list = []
+        for fn in eval_fn_list:
+            dir_path = Path(fn).parent
+            if dir_path in best_epoch_dict:
+                if Path(fn).name.split(".")[1] == best_epoch_dict[dir_path]:
+                    new_eval_fn_list.append(fn)
+        eval_fn_list = new_eval_fn_list
+
+    logger.info("***** Evaluation: %s *****", ",".join(eval_fn_list))
+    num_pool = max(1, min(args.processes, len(eval_fn_list)))
+    logger.info(args.processes, len(eval_fn_list), num_pool)
+    p = Pool(num_pool)
+    r_list = p.imap_unordered(process_eval, eval_fn_list)
+    r_list = sorted([(fn, scores) for fn, scores in r_list], key=lambda x: x[0])
+    rg2_dict = {}
+    for fn, scores in r_list:
+        logger.info(fn)
+        if args.perl:
+            print(rouge_results_to_str(scores))
+        else:
+            rg2_dict[fn] = scores["rouge-2"]["f"]
+            print(
+                "ROUGE-1: {}\tROUGE-2: {}\tROUGE-L: {}\n".format(
+                    scores["rouge-1"]["f"], scores["rouge-2"]["f"], scores["rouge-l"]["f"]
+                )
+            )
+            with open(fn + ".rouge", "w") as f_out:
+                f_out.write(json.dumps({"rg1": scores["rouge-1"]["f"], "rg2": scores["rouge-2"]["f"]}))
+    p.close()
+    p.join()
+
+    if args.save_best:
+        # find best results
+        group_dict = {}
+        for k, v in rg2_dict.items():
+            d_name, o_name = Path(k).parent, Path(k).name
+            if (d_name not in group_dict) or (v > group_dict[d_name][1]):
+                group_dict[d_name] = (o_name, v)
+        # compare and save the best result
+        for k, v in group_dict.items():
+            fn = os.path.join(k, "save_best." + args.split)
+            o_name_s, rst_s = v
+            should_save = True
+            if Path(fn).exists():
+                with open(fn, "r") as f_in:
+                    rst_f = float(f_in.read().strip().split("\n")[-1])
+                if rst_s <= rst_f:
+                    should_save = False
+            if should_save:
+                with open(fn, "w") as f_out:
+                    f_out.write("{0}\n{1}\n{2}\n".format(k, o_name_s, rst_s))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/text_summarization/prophetnet/generate.py b/examples/text_summarization/prophetnet/generate.py
new file mode 100644
index 0000000000000000000000000000000000000000..54f51e851809a977bd143dd054d0d13aba84af2d
--- /dev/null
+++ b/examples/text_summarization/prophetnet/generate.py
@@ -0,0 +1,275 @@
+import argparse
+import os
+import random
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler, DataLoader
+from rouge_score import rouge_scorer, scoring
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers.prophetnet.modeling import ProphetNetForConditionalGeneration
+from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer
+
+summarization_name_mapping = {"cnn_dailymail": ("article", "highlights")}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name_or_path",
+        default="prophetnet-large-uncased",
+        type=str,
+        required=True,
+        help="Path to pre-trained model. ",
+    )
+    parser.add_argument(
+        "--dataset", default="gigaword", choices=["cnndm", "gigaword"], type=str, help="Path to tokenizer vocab file. "
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="generate.txt", help="The file path where the infer result will be saved."
+    )
+    parser.add_argument(
+        "--max_source_length",
+        default=1024,
+        type=int,
+        help="The maximum total input sequence length after "
+        "tokenization.Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--min_target_length",
+        default=45,
+        type=int,
+        help="The minimum total sequence length for target text when generating. ",
+    )
+    parser.add_argument(
+        "--max_target_length",
+        default=110,
+        type=int,
+        help="The maximum total sequence length for target text after "
+        "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded."
+        "during ``evaluate`` and ``predict``.",
+    )
+    parser.add_argument(
+        "--decode_strategy", default="beam_search", type=str, help="The decode strategy in generation."
+    )
+    parser.add_argument(
+        "--top_k",
+        default=2,
+        type=int,
+        help="The number of highest probability vocabulary tokens to keep for top-k sampling.",
+    )
+    parser.add_argument("--top_p", default=1.0, type=float, help="The cumulative probability for top-p sampling.")
+    parser.add_argument("--num_beams", default=5, type=int, help="The number of beams for beam search.")
+    parser.add_argument(
+        "--length_penalty",
+        default=1.2,
+        type=float,
+        help="The exponential penalty to the sequence length for beam search.",
+    )
+    parser.add_argument(
+        "--early_stopping",
+        default=False,
+        type=eval,
+        help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.",
+    )
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument(
+        "--num_beam_groups",
+        default=1,
+        type=int,
+        help="Number of groups to divide `num_beams` into in order to use DIVERSE BEAM SEARCH.",
+    )
+    parser.add_argument(
+        "--repetition_penalty",
+        default=1.0,
+        type=float,
+        help="Number of groups to divide `num_beams` into in order to use DIVERSE BEAM SEARCH.",
+    )
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU/CPU for testing or evaluation.")
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu/xpu.",
+    )
+    parser.add_argument(
+        "--ignore_pad_token_for_loss",
+        default=True,
+        type=bool,
+        help="Whether to ignore the tokens corresponding to " "padded labels in the loss computation or not.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def compute_metrics(preds, labels, tokenizer, ignore_pad_token_for_loss=True, compute_rouge_=True):
+    def compute_rouge(predictions, references, rouge_types=None, use_stemmer=True):
+        if rouge_types is None:
+            rouge_types = ["rouge1", "rouge2", "rougeLsum"]
+
+        scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=use_stemmer)
+        aggregator = scoring.BootstrapAggregator()
+
+        for ref, pred in zip(references, predictions):
+            score = scorer.score(ref, pred)
+            aggregator.add_scores(score)
+        result = aggregator.aggregate()
+        result = {key: round(value.mid.fmeasure * 100, 4) for key, value in result.items()}
+        return result
+
+    def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+        """
+        Post-process the decoded sequence.
+        """
+        eos_pos = len(seq) - 1
+        for i, idx in enumerate(seq):
+            if idx == eos_idx:
+                eos_pos = i
+                break
+        seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+        return seq
+
+    if ignore_pad_token_for_loss:
+        labels = np.asarray(labels)
+        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+    decoded_preds, decoded_labels = [], []
+    for pred, label in zip(preds, labels):
+        pred_id = post_process_seq(pred, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        label_id = post_process_seq(label, tokenizer.bos_token_id, tokenizer.eos_token_id)
+        decoded_preds.append(tokenizer.convert_ids_to_string(pred_id))
+        decoded_labels.append(tokenizer.convert_ids_to_string(label_id))
+
+    if compute_rouge_:
+        rouge_result = compute_rouge(decoded_preds, decoded_labels)
+        return rouge_result, decoded_preds
+    else:
+        return decoded_preds, decoded_labels
+
+
+def read(data_path):
+    data_path_src = data_path[0]
+    data_path_tgt = data_path[1]
+    with open(data_path_src, "r", encoding="utf-8") as f_d_s:
+        src_lines_length = len(f_d_s.readlines())
+    with open(data_path_tgt, "r", encoding="utf-8") as f_d_t:
+        tgt_lines_length = len(f_d_t.readlines())
+    assert src_lines_length == tgt_lines_length
+    with open(data_path_src, "r", encoding="utf-8") as f_d_s:
+        with open(data_path_tgt, "r", encoding="utf-8") as f_d_t:
+            for row_d_s, row_d_t in tqdm(zip(f_d_s, f_d_t), total=src_lines_length):
+                yield {"article": row_d_s, "highlights": row_d_t}
+
+
+def convert_example(is_test=False):
+    def warpper(example):
+        """convert an example into necessary features"""
+        tokens = example["article"]
+        labels = example["highlights"]
+        src_ids, src_attention_mask_ids = tokens.split("$1$")
+        src_ids = [int(i) for i in src_ids.split(" ")]
+        src_attention_mask_ids = [int(i) for i in src_attention_mask_ids.split(" ")]
+
+        if not is_test:
+            labels, decoder_input_attention_mask_ids = labels.split("$1$")
+            labels = [int(i) for i in labels.split(" ")]
+            decoder_input_attention_mask_ids = [int(i) for i in decoder_input_attention_mask_ids.split(" ")]
+            decoder_input_ids = [labels[-1]] + labels[:-1]
+
+            return src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, labels
+
+        else:
+            labels, _ = labels.split("$1$")
+            labels = [int(i) for i in labels.split(" ")]
+            return src_ids, src_attention_mask_ids, labels
+
+    return warpper
+
+
+@paddle.no_grad()
+def generate(args):
+    paddle.set_device(args.device)
+    tokenizer = ProphetNetTokenizer.from_pretrained(args.model_name_or_path)
+    model = ProphetNetForConditionalGeneration.from_pretrained(args.model_name_or_path)
+
+    test_data_src = "data/" + args.dataset + "_data/uncased_tok_data/test.src"
+    test_data_tgt = "data/" + args.dataset + "_data/uncased_tok_data/test.tgt"
+
+    test_dataset = load_dataset(read, data_path=[test_data_src, test_data_tgt], lazy=False)
+
+    trunc = convert_example(is_test=True)
+
+    test_dataset = test_dataset.map(trunc)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
+        Pad(axis=0, pad_val=0),  # attn mask
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # labels
+    ): fn(samples)
+
+    batch_sampler = BatchSampler(test_dataset, batch_size=args.batch_size, shuffle=False)
+    test_data_loader = DataLoader(
+        dataset=test_dataset, batch_sampler=batch_sampler, num_workers=0, collate_fn=batchify_fn, return_list=True
+    )
+
+    model.eval()
+    total_time = 0.0
+    start_time = time.time()
+    all_preds = []
+    all_labels = []
+    for step, batch in tqdm(enumerate(test_data_loader), total=len(test_data_loader)):
+        input_ids, attention_mask, labels = batch
+        preds, _ = model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_target_length,
+            min_length=args.min_target_length,
+            decode_strategy=args.decode_strategy,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            early_stopping=args.early_stopping,
+            diversity_rate=args.diversity_rate,
+            num_beam_groups=args.num_beam_groups,
+            repetition_penalty=args.repetition_penalty,
+        )
+        total_time += time.time() - start_time
+        all_preds.extend(preds.numpy())
+        all_labels.extend(labels.numpy())
+        if step % args.logging_steps == 0:
+            print("step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+        start_time = time.time()
+    decoded_preds, _ = compute_metrics(
+        all_preds, all_labels, tokenizer, args.ignore_pad_token_for_loss, compute_rouge_=False
+    )
+    if not os.path.exists(os.path.abspath(os.path.dirname(args.output_path) + os.path.sep + ".")):
+        os.makedirs(os.path.abspath(os.path.dirname(args.output_path) + os.path.sep + "."))
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for decoded_pred in decoded_preds:
+            fout.write(decoded_pred + "\n")
+    print("Save generated result into: %s" % args.output_path)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    generate(args)
diff --git a/examples/text_summarization/prophetnet/requirements.txt b/examples/text_summarization/prophetnet/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..36ec00fa8f7284bb1ac36c268554ff8418a5f636
--- /dev/null
+++ b/examples/text_summarization/prophetnet/requirements.txt
@@ -0,0 +1,5 @@
+configparser==5.2.0
+nltk==3.6.7
+numpy==1.21.0
+tqdm==4.62.3
+py-rouge==1.1
\ No newline at end of file
diff --git a/examples/text_summarization/prophetnet/run_eval.sh b/examples/text_summarization/prophetnet/run_eval.sh
new file mode 100644
index 0000000000000000000000000000000000000000..39b7bdcf1cfc03849b17860c8b3bd4ad74de8f15
--- /dev/null
+++ b/examples/text_summarization/prophetnet/run_eval.sh
@@ -0,0 +1,37 @@
+DATASET=$1
+
+if [ $DATASET = cnndm ]
+then
+python generate.py \
+    --dataset=cnndm \
+    --model_name_or_path=prophetnet-large-uncased \
+    --output_path=./generate/cnndm/generate.txt \
+    --min_target_length=45 \
+    --max_target_length=110 \
+    --decode_strategy=beam_search \
+    --num_beams=4 \
+    --length_penalty=1.2 \
+    --batch_size=16 \
+    --ignore_pad_token_for_loss=True \
+    --early_stopping=True \
+    --logging_steps=100 \
+    --device=gpu
+else
+python generate.py \
+    --dataset=gigaword \
+    --model_name_or_path=prophetnet-large-uncased \
+    --output_path=./generate/gigaword/generate.txt \
+    --min_target_length=1 \
+    --max_target_length=200 \
+    --decode_strategy=beam_search \
+    --num_beams=4 \
+    --length_penalty=1.6 \
+    --batch_size=16 \
+    --ignore_pad_token_for_loss=True \
+    --early_stopping=True \
+    --logging_steps=100 \
+    --device=gpu
+fi
+
+
+python eval.py --dataset $DATASET --generated ./generate/$DATASET/generate.txt
\ No newline at end of file
diff --git a/examples/text_summarization/prophetnet/run_train.sh b/examples/text_summarization/prophetnet/run_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..625098dc77e205291e9d7e0c996c00ac0df796bb
--- /dev/null
+++ b/examples/text_summarization/prophetnet/run_train.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+DATASET=$1
+
+if [ "$DATASET" == cnndm ]
+then
+python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \
+    --dataset=cnndm \
+    --model_name_or_path=prophetnet-large-uncased \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=8 \
+    --num_train_epochs=4 \
+    --learning_rate=0.0001 \
+    --warmup_init_lr=1e-07 \
+    --warmup_steps=1000 \
+    --max_grad_norm=0.1 \
+    --dataloader_num_workers=4 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --do_train \
+    --do_eval \
+    --output_dir=./ckpt/cnndm
+else
+python -m paddle.distributed.launch --gpus 0 python train_prophetnet.py \
+    --dataset=gigaword \
+    --model_name_or_path=prophetnet-large-uncased \
+    --per_device_train_batch_size=16 \
+    --per_device_eval_batch_size=32 \
+    --num_train_epochs=6 \
+    --learning_rate=0.0001 \
+    --warmup_init_lr=1e-07 \
+    --warmup_steps=1000 \
+    --max_grad_norm=0.1 \
+    --dataloader_num_workers=8 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --do_train \
+    --do_eval \
+    --output_dir=./ckpt/gigaword
+fi
\ No newline at end of file
diff --git a/examples/text_summarization/prophetnet/train_prophetnet.py b/examples/text_summarization/prophetnet/train_prophetnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..f799a6dabbf6ce5341ba9712fdc8dc368858383c
--- /dev/null
+++ b/examples/text_summarization/prophetnet/train_prophetnet.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.data import Pad
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed
+from paddlenlp.transformers.prophetnet.modeling import (
+    ProphetNetForConditionalGeneration,
+)
+from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(
+        default="prophetnet-large-uncased",
+        metadata={"help": ("Path to pre-trained model.")},
+    )
+    warmup_init_lr: Optional[float] = field(
+        default=1e-07,
+    )
+
+
+@dataclass
+class DataArguments:
+    dataset: Optional[str] = field(
+        default="gigaword",
+        metadata={"help": ("Path to tokenizer vocab file.")},
+    )
+
+
+def read(data_path):
+    data_path_src = data_path[0]
+    data_path_tgt = data_path[1]
+    with open(data_path_src, "r", encoding="utf-8") as f_d_s:
+        src_lines_length = len(f_d_s.readlines())
+    with open(data_path_tgt, "r", encoding="utf-8") as f_d_t:
+        tgt_lines_length = len(f_d_t.readlines())
+    assert src_lines_length == tgt_lines_length
+    with open(data_path_src, "r", encoding="utf-8") as f_d_s:
+        with open(data_path_tgt, "r", encoding="utf-8") as f_d_t:
+            for row_d_s, row_d_t in tqdm(zip(f_d_s, f_d_t), total=src_lines_length):
+                yield {"article": row_d_s, "highlights": row_d_t}
+
+
+class InverseSquareRootSchedule(paddle.optimizer.lr.LRScheduler):
+    def __init__(self, warmup_init_lr, warmup_end_lr, warmup_steps, last_epoch=-1, verbose=False):
+        self.lr_step = (warmup_end_lr - warmup_init_lr) / warmup_steps
+        self.decay_factor = warmup_end_lr * warmup_steps**0.5
+        self.warmup_steps = warmup_steps
+        self.warmup_init_lr = warmup_init_lr
+        super(InverseSquareRootSchedule, self).__init__(warmup_init_lr, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            self.base_lr = self.warmup_init_lr + self.last_epoch * self.lr_step
+        else:
+            self.base_lr = self.decay_factor * self.last_epoch**-0.5
+        return self.base_lr
+
+
+def convert_example(is_test=False):
+    def warpper(example):
+        """convert an example into necessary features"""
+        tokens = example["article"]
+        labels = example["highlights"]
+        src_ids, src_attention_mask_ids = tokens.split("$1$")
+        src_ids = [int(i) for i in src_ids.split(" ")]
+        src_attention_mask_ids = [int(i) for i in src_attention_mask_ids.split(" ")]
+
+        if not is_test:
+            labels, decoder_input_attention_mask_ids = labels.split("$1$")
+            labels = [int(i) for i in labels.split(" ")]
+            decoder_input_attention_mask_ids = [int(i) for i in decoder_input_attention_mask_ids.split(" ")]
+            decoder_input_ids = [labels[-1]] + labels[:-1]
+            return src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, labels
+
+        else:
+            return src_ids, src_attention_mask_ids
+
+    return warpper
+
+
+@dataclass
+class DataCollator:
+    tokenizer: ProphetNetTokenizer
+
+    def __call__(self, features):
+        src_ids = []
+        src_pids = []
+        tgt_ids = []
+        tgt_pids = []
+        labels = []
+        batch = {}
+
+        for feature in features:
+            src_idx, src_pid, tgt_idx, tgt_pid, label = feature
+            src_ids.append(src_idx)
+            src_pids.append(src_pid)
+            tgt_ids.append(tgt_idx)
+            tgt_pids.append(tgt_pid)
+            labels.append(label)
+
+        src_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(src_ids),)
+        src_pids = (Pad(axis=0, pad_val=0)(src_pids),)
+        tgt_ids = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(tgt_ids),)
+        tgt_pids = (Pad(axis=0, pad_val=0)(tgt_pids),)
+        labels = (Pad(axis=0, pad_val=self.tokenizer.pad_token_id)(labels),)
+
+        batch["src_ids"] = src_ids[0]
+        batch["src_pids"] = src_pids[0]
+        batch["tgt_ids"] = tgt_ids[0]
+        batch["tgt_pids"] = tgt_pids[0]
+        batch["labels"] = labels[0]
+
+        return batch
+
+
+def loss_func(model, logits, labels, ignore_index=-100):
+    expend_targets = paddle.cast(
+        paddle.zeros((model.prophetnet.config["ngram"], labels.shape[0], labels.shape[1])).fill_(ignore_index),
+        dtype=paddle.int32,
+    )
+
+    for i in range(model.prophetnet.config["ngram"]):
+        if i > 0 and model.prophetnet.disable_ngram_loss:
+            break
+        expend_targets[i, :, :] = labels.cast(dtype=paddle.int32)  # B,Ngram,Seq
+
+    logits = logits.transpose([1, 0, 2, 3])
+
+    if model.prophetnet.eps > 0.0:
+        expend_targets_mask = paddle.cast(expend_targets != ignore_index, dtype=paddle.float32)
+        expend_targets = paddle.nn.functional.one_hot(expend_targets, num_classes=model.vocab_size)
+        expend_targets = paddle.nn.functional.label_smooth(expend_targets, epsilon=model.prophetnet.eps)
+        loss = paddle.nn.functional.cross_entropy(logits, expend_targets, soft_label=True, reduction="none").squeeze()
+        loss = paddle.sum(expend_targets_mask * loss) / expend_targets_mask.sum()
+    else:
+        loss = paddle.nn.functional.cross_entropy(
+            logits, expend_targets.cast(dtype=paddle.int64), ignore_index=ignore_index
+        )
+
+    return loss
+
+
+class ProphetnetTrainer(Trainer):
+    def compute_loss(self, model, inputs, return_outputs=False):
+        src_ids, src_attention_mask_ids, decoder_input_ids, decoder_input_attention_mask_ids, label_ids = inputs
+        src_ids = inputs["src_ids"]
+        src_attention_mask_ids = inputs["src_pids"]
+        decoder_input_ids = inputs["tgt_ids"]
+        decoder_input_attention_mask_ids = inputs["tgt_pids"]
+        label_ids = inputs["labels"]
+
+        src_ids = src_ids.cast(dtype=paddle.int32)
+        src_attention_mask_ids = src_attention_mask_ids.cast(dtype=paddle.int32)
+        decoder_input_ids = decoder_input_ids.cast(dtype=paddle.int32)
+        decoder_input_attention_mask_ids = decoder_input_attention_mask_ids.cast(dtype=paddle.int32)
+        label_ids = label_ids.cast(dtype=paddle.int64)
+
+        outputs = model(
+            input_ids=src_ids,
+            attention_mask=src_attention_mask_ids,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_input_attention_mask_ids,
+        )
+        loss = loss_func(model, outputs[2], label_ids, ignore_index=model.padding_idx)
+
+        return (loss, outputs) if return_outputs else loss
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    train_data_src = "data/" + data_args.dataset + "_data/uncased_tok_data/train.src"
+    train_data_tgt = "data/" + data_args.dataset + "_data/uncased_tok_data/train.tgt"
+
+    dev_data_src = "data/" + data_args.dataset + "_data/uncased_tok_data/dev.src"
+    dev_data_tgt = "data/" + data_args.dataset + "_data/uncased_tok_data/dev.tgt"
+
+    train_dataset = load_dataset(read, data_path=[train_data_src, train_data_tgt], lazy=False)
+    dev_dataset = load_dataset(read, data_path=[dev_data_src, dev_data_tgt], lazy=False)
+
+    tokenizer = ProphetNetTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    trans_func = convert_example()
+
+    train_dataset = train_dataset.map(trans_func)
+    dev_dataset = dev_dataset.map(trans_func)
+    batchify_fn = DataCollator(tokenizer)
+
+    model = ProphetNetForConditionalGeneration.from_pretrained(model_args.model_name_or_path)
+
+    lr_scheduler = InverseSquareRootSchedule(
+        model_args.warmup_init_lr, training_args.learning_rate, training_args.warmup_steps
+    )
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=training_args.weight_decay,
+        grad_clip=paddle.nn.ClipGradByNorm(training_args.max_grad_norm),
+    )
+
+    trainer = ProphetnetTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=dev_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=batchify_fn,
+        optimizers=(optimizer, lr_scheduler),
+    )
+
+    if training_args.do_train:
+        train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        metrics = train_results.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/examples/text_summarization/prophetnet/uncase_tokenize_data.py b/examples/text_summarization/prophetnet/uncase_tokenize_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a0b8f84ba0f1c5d42aa086d0312d925a3a9b25c
--- /dev/null
+++ b/examples/text_summarization/prophetnet/uncase_tokenize_data.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import tqdm
+from nltk.tokenize.treebank import TreebankWordDetokenizer
+
+from paddlenlp.transformers.prophetnet.tokenizer import ProphetNetTokenizer
+
+
+def uncased_preocess(fin, fout, keep_sep=False, max_len=512):
+    tokenizer = ProphetNetTokenizer(vocab_file="prophetnet.tokenizer")
+    fin = open(fin, "r", encoding="utf-8")
+    fout = open(fout, "w", encoding="utf-8")
+    twd = TreebankWordDetokenizer()
+    for line in tqdm.tqdm(fin.readlines()):
+        line = line.strip().replace("``", '"').replace("''", '"').replace("`", "'")
+        s_list = [twd.detokenize(x.strip().split(" "), convert_parentheses=True) for x in line.split("<S_SEP>")]
+        if keep_sep:
+            output_string = " [X_SEP] ".join(s_list)
+        else:
+            output_string = " ".join(s_list)
+        encoded_string = tokenizer(output_string, return_attention_mask=True, max_length=max_len)
+        ids, attention_mask_ids = encoded_string["input_ids"][:max_len], encoded_string["attention_mask"][:max_len]
+        output_string = "$1$".join([" ".join([str(i) for i in ids]), " ".join([str(i) for i in attention_mask_ids])])
+        fout.write("{}\n".format(output_string))
+
+
+def tokenize_with_bert_uncase(fin, fout, max_len=512):
+    fin = open(fin, "r", encoding="utf-8")
+    fout = open(fout, "w", encoding="utf-8")
+    tokenizer = ProphetNetTokenizer(vocab_file="prophetnet.tokenizer")
+    for line in tqdm.tqdm(fin.readlines()):
+        encoded_string = tokenizer(line, return_attention_mask=True, max_length=max_len)
+        ids, attention_mask_ids = encoded_string["input_ids"][:max_len], encoded_string["attention_mask"][:max_len]
+        output_string = "$1$".join([" ".join([str(i) for i in ids]), " ".join([str(i) for i in attention_mask_ids])])
+        fout.write("{}\n".format(output_string))
+
+
+def tokenize_data(dataset):
+    dataset = dataset + "_data"
+    input_dir = "./data/%s" % (dataset)
+    output_dir = "./data/%s/uncased_tok_data" % (dataset)
+    if not os.path.isdir(output_dir):
+        os.makedirs(output_dir)
+    if dataset == "cnndm":
+        uncased_preocess("%s/train.src" % input_dir, "%s/train.src" % output_dir, keep_sep=False)
+        uncased_preocess("%s/dev.src" % input_dir, "%s/dev.src" % output_dir, keep_sep=False)
+        uncased_preocess("%s/test.src" % input_dir, "%s/test.src" % output_dir, keep_sep=False)
+        uncased_preocess("%s/train.tgt" % input_dir, "%s/train.tgt" % output_dir, keep_sep=True, max_len=128)
+        uncased_preocess("%s/dev.tgt" % input_dir, "%s/dev.tgt" % output_dir, keep_sep=True)
+        uncased_preocess("%s/test.tgt" % input_dir, "%s/test.tgt" % output_dir, keep_sep=True)
+    else:
+        tokenize_with_bert_uncase("%s/train.src" % input_dir, "%s/train.src" % output_dir)
+        tokenize_with_bert_uncase("%s/train.tgt" % input_dir, "%s/train.tgt" % output_dir)
+        tokenize_with_bert_uncase("%s/dev.src" % input_dir, "%s/dev.src" % output_dir)
+        tokenize_with_bert_uncase("%s/dev.tgt" % input_dir, "%s/dev.tgt" % output_dir)
+        tokenize_with_bert_uncase("%s/test.src" % input_dir, "%s/test.src" % output_dir)
+        tokenize_with_bert_uncase("%s/test.tgt" % input_dir, "%s/test.tgt" % output_dir)
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset", type=str, help="choose dataset from all, or 1 of 8 datasets: cnndm, gigaword")
+args = parser.parse_args()
+
+DATASET_LIST = ["cnndm", "gigaword"]
+
+if args.dataset != "all" and args.dataset not in DATASET_LIST:
+    print("please choose dataset from all, or 1 of 8 datasets: cnndm, gigaword")
+    exit()
+else:
+    if args.dataset == "all":
+        dataset_list = DATASET_LIST
+    else:
+        dataset_list = [args.dataset]
+
+print(dataset_list)
+for dataset in dataset_list:
+    tokenize_data(dataset)
diff --git a/examples/text_summarization/prophetnet/uncompress_data.sh b/examples/text_summarization/prophetnet/uncompress_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..392596ae14b29a62bed5bf7dedf0c979b68ea9c4
--- /dev/null
+++ b/examples/text_summarization/prophetnet/uncompress_data.sh
@@ -0,0 +1,12 @@
+tar -xvf ./glge_public.tar
+tar -zxvf ./glge_hidden_v1.1.tar.gz
+
+DATA=./data
+DATASETS=(cnndm gigaword)
+mkdir $DATA
+for DATASET in ${DATASETS[@]}; do
+  echo $DATASET
+mkdir $DATA/$DATASET\_data
+mv ./glge-released-dataset/easy/$DATASET\_data/org_data/* $DATA/$DATASET\_data/
+mv ./glge-hidden-dataset/easy/$DATASET\_data/org_data/* $DATA/$DATASET\_data/
+done
diff --git a/examples/text_summarization/unimo-text/README.md b/examples/text_summarization/unimo-text/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c0d336331d03142e490890df9cb2a11b7fada0df
--- /dev/null
+++ b/examples/text_summarization/unimo-text/README.md
@@ -0,0 +1,294 @@
+# 生成式文本摘要应用
+
+**目录**
+- [生成式文本摘要应用](#生成式文本摘要应用)
+  - [简介](#简介)
+    - [基于预训练语言模型的文本摘要](#基于预训练语言模型的文本摘要)
+  - [效果展示](#效果展示)
+  - [开箱即用](#开箱即用)
+    - [支持单条、批量预测](#支持单条批量预测)
+    - [可配置参数说明](#可配置参数说明)
+  - [训练定制](#训练定制)
+    - [文本摘要应用定制训练全流程介绍](#文本摘要应用定制训练全流程介绍)
+    - [环境依赖](#环境依赖)
+    - [代码结构说明](#代码结构说明)
+    - [数据准备](#数据准备)
+      - [数据加载](#数据加载)
+      - [从本地文件创建数据集](#从本地文件创建数据集)
+    - [模型训练](#模型训练)
+    - [模型预测](#模型预测)
+    - [模型推理部署](#模型推理部署)
+      - [FastGeneration加速及模型静态图导出](#fastgeneration加速及模型静态图导出)
+      - [模型部署](#模型部署)
+  - [References](#references)
+
+
+## 简介
+文本摘要的目标是自动地将输入文本转换成简短摘要,为用户提供简明扼要的内容描述，是缓解文本信息过载的一个重要手段。
+文本摘要也是自然语言生成领域中的一个重要任务，有很多应用场景，如新闻摘要、论文摘要、财报摘要、传记摘要、专利摘要、对话摘要、评论摘要、观点摘要、电影摘要、文章标题生成、商品名生成、自动报告生成、搜索结果预览等。
+
+本项目是基于预训练语言模型UNIMO-Text的文本摘要，具有以下优势：
+- 效果领先。
+- 开箱即用。本项目提供TaskFlow接口，无需训练，仅需几行代码便可预测。
+- 高性能推理。本项目基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation)进行推理加速，能够提供更高性能的推理体验。
+- 训练推理全流程打通。本项目提供了全面的定制训练流程，从数据准备、模型训练预测，到模型推理部署，一应俱全。
+
+### 基于预训练语言模型的文本摘要
+
+基于预训练语言模型（Pretrained Language Models, PLMs）范式的自动文本摘要是目前最常用、效果最好(SOTA)的方式。
+预训练模型是在超大规模的语料采用无监督（unsupervised）或者弱监督（weak-supervised）的方式进行预训练，能够学习如何准确地理解自然语言并以自然语言的形式流畅表达，这两项都是完成文本摘要任务的重要能力。
+
+PaddleNLP提供了方便易用的接口，可指定模型名或模型参数文件路径通过from_pretrained()方法加载不同网络结构的预训练模型，且相应预训练模型权重下载速度快速、稳定。下面以中文unimo-text-1.0-summary模型为例，演示如何加载预训练模型和分词器：
+```
+from paddlenlp.transformers import  UNIMOLMHeadModel, UNIMOTokenizer
+model_name = "unimo-text-1.0-summary"
+model = UNIMOLMHeadModel.from_pretrained(model_name)
+tokenizer = UNIMOTokenizer.from_pretrained(model_name)
+```
+
+## 效果展示
+
+## 开箱即用
+PaddleNLP提供开箱即用的产业级NLP预置任务能力，无需训练，一键预测。
+### 支持单条、批量预测
+
+```python
+>>> from paddlenlp import Taskflow
+>>> summarizer = Taskflow("text_summarization")
+# 单条输入
+>>> summarizer("雪后的景色可真美丽呀！不管是大树上，屋顶上，还是菜地上，都穿上了一件精美的、洁白的羽绒服。放眼望去，整个世界变成了银装素裹似的，世界就像是粉妆玉砌的一样。")
+# 输出：'雪后的景色可真美丽呀！'
+
+# 多条输入
+>>> summarizer([
+  "雪后的景色可真美丽呀！不管是大树上，屋顶上，还是菜地上，都穿上了一件精美的、洁白的羽绒服。放眼望去，整个世界变成了银装素裹似的，世界就像是粉妆玉砌的一样。",
+  "根据“十个工作日”原则，下轮调价窗口为8月23日24时。卓创资讯分析，原油价格或延续震荡偏弱走势，且新周期的原油变化率仍将负值开局，消息面对国内成品油市场并无提振。受此影响，预计国内成品油批发价格或整体呈现稳中下滑走势，但“金九银十”即将到来，卖方看好后期市场，预计跌幅较为有限。"
+  ])
+#输出：['雪后的景色可真美丽呀！', '成品油调价窗口8月23日24时开启']
+```
+
+### 可配置参数说明
+* `model`：可选模型，默认为`unimo-text-1.0-summary`。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+
+
+## 训练定制
+### 文本摘要应用定制训练全流程介绍
+接下来，我们将按数据准备、训练、预测、推理部署对文本摘要应用的全流程进行介绍。
+1. **数据准备**
+- 如果没有已标注的数据集，我们推荐[doccano](https://github.com/doccano/doccano)数据标注工具。
+如果已有标注好的本地数据集，我们需要根据将数据集整理为文档要求的格式，请参考[从本地文件创建数据集](#从本地文件创建数据集)。
+
+2. **模型训练**
+
+- 数据准备完成后，可以开始使用我们的数据集对预训练模型进行微调训练。我们可以根据任务需求，调整可配置参数，选择使用GPU或CPU进行模型训练，脚本默认保存在开发集最佳表现模型。中文任务默认使用"unimo-text-1.0-summary"模型，unimo-text-1.0-summary还支持large模型，详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html)，可以根据任务和设备需求进行选择。
+
+
+3. **模型预测**
+
+- 训练结束后，我们可以加载保存的最佳模型进行模型测试，打印模型预测结果。
+
+4. **模型推理部署**
+
+- 模型部署需要将保存的最佳模型参数（动态图）导出成静态图参数，用于后续的推理部署。
+
+- 文本摘要应用提供了基于Paddle Inference的本地部署predictor，并且支持在GPU设备使用FastGeneration进行加速。
+
+- 文本摘要应用提供了基于Paddle Serving的服务端部署方案。
+
+### 环境依赖
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+text_summarization/
+├── deploy # 部署
+│   ├── paddle_inference # PaddleInference高性能推理部署
+│   │   ├── inference_unimo_text.py # 推理部署脚本
+│   │   └── README.md # 说明文档
+│   └── paddle_serving
+│       ├── config.yml # 配置文件
+│       ├── pipeline_client.py # 客户端程序
+│       ├── pipeline_service.py # 服务器程序
+│       ├── export_serving.sh # serving模型导出脚本
+│       └── README.md # 说明文档
+├── export_model.py # 动态图参数导出静态图参数脚本
+├── export_model.sh # 动态图参数导出静态图参数shell脚本
+├── train.py # 训练评估脚本
+├── train.sh # 训练评估shell脚本
+├── utils.py # 工具函数脚本
+└── README.md # 说明文档
+```
+
+### 数据准备
+
+#### 数据加载
+#### 从本地文件创建数据集
+
+在许多情况，我们需要使用本地数据集来训练我们的文本摘要模型，本项目支持使用固定格式本地数据集文件进行训练。
+
+本地数据集目录结构如下：
+
+```text
+data/
+├── train.json # 训练数据集文件
+└── test.json # 可选，待预测数据文件
+```
+本地数据集文件格式如下：
+- train.json/test.json 文件每行格式：
+```text
+{
+"title": "任志强抨击政府把土地作为投机品地产业被人为破坏",
+"content": "“北京的保障房市场就像一个巨大的赌场，每个人都在期待中奖。”面对中国目前现行的保障性住房政策，华远地产董事长任志强再次语出惊人。（分享自@第一财经-中国房地产金融）"
+}
+```
+
+更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。
+
+
+### 模型训练
+运行如下命令即可在样例训练集上进行finetune，并在样例验证集上进行验证。
+
+```shell
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+log_dir=output
+rm -rf ${log_dir}
+mkdir -p ${log_dir}
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} train.py \
+    --model_name_or_path=unimo-text-1.0-summary \
+    --train_file train.json \
+    --eval_file test.json \
+    --save_dir=${log_dir}/checkpoints \
+    --logging_steps=100 \
+    --save_steps=10000 \
+    --epochs=10 \
+    --batch_size=32 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=60 \
+    --max_target_len=30 \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --do_train \
+    --do_eval \
+    --device=gpu \
+```
+也可以直接使用`train.sh`.
+
+关键参数释义如下：
+- `gpus` 指示了训练所用的GPU卡号。
+- `dataset_name` 数据集名称。
+- `train_file` 本地训练数据地址。
+- `eval_file` 本地测试数据地址。
+- `model_name_or_path` 指示了finetune使用的具体预训练模型，可以是PaddleNLP提供的预训练模型（详见[UNIMO模型汇总](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/UNIMO/contents.html)），或者是本地的预训练模型。如果使用本地的预训练模型，可以配置本地模型的目录地址，例如: ./checkpoints/model_xx/，目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型，可以选择下面其中之一。
+
+   | PaddleNLP提供的预训练模型        |
+   |---------------------------------|
+   | unimo-text-1.0-summary      |
+   | unimo-text-1.0      |
+   | unimo-text-1.0-large |
+
+- `save_dir` 表示模型的保存路径。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `seed` 表示随机数生成器的种子。
+- `epochs` 表示训练轮数。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `warmup_proportion` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数占总步数的比例，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `max_seq_len` 模型输入序列的最大长度。
+- `max_target_len` 模型训练时标签的最大长度。
+- `min_dec_len` 模型生成序列的最小长度。
+- `max_dec_len` 模型生成序列的最大长度。
+- `do_train` 是否进行训练。
+- `do_eval` 是否进行预测，在验证集上会自动评估。
+- `device` 表示使用的设备，从gpu和cpu中选择。
+
+更多参数详情和参数的默认值请参考`train.py`。
+
+程序运行时将会自动进行训练和验证，训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+./checkpoints/
+├── model_8000
+│   ├── model_config.json
+│   ├── model_state.pdparams
+│   ├── special_tokens_map.json
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:** 如需恢复模型训练，`model_name_or_path`配置本地模型的目录地址即可。
+
+
+### 模型预测
+
+运行下方脚本可以使用训练好的模型进行预测。
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python train.py \
+    --do_eval \
+    --eval_file test.json \
+    --model_name_or_path=your_model_path \
+    --logging_steps=100 \
+    --batch_size=16 \
+    --max_seq_len=60 \
+    --max_target_len=30 \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --device=gpu
+```
+
+程序运行结束后会将预测结果保存在`output_path`中。
+
+
+Finetuned baseline的模型在[LCSTS](https://aclanthology.org/D15-1229/)测试集上有如下结果：
+|       model_name        | Rouge-1 | Rouge-2 |    Rouge-L    | BLEU-4 |
+| :-----------------------------: | :---: | :-----------: | :-------------------: |:-------------------: |
+|   finetuned unimo-text-1.0-summary    | 39.56 | 26.24 |     36.35     |     21.48     |
+
+
+### 模型推理部署
+
+#### FastGeneration加速及模型静态图导出
+
+使用动态图训练结束之后，可以通过[静态图导出脚本](export_model.py)实现基于FastGeneration的高性能预测加速，并将动态图参数导出成静态图参数，静态图参数保存在`output_path`指定路径中。运行方式：
+
+```shell
+python export_model.py \
+    --model_name_or_path unimo-text-1.0-summary \
+    --decoding_strategy beam_search \
+    --inference_model_dir ./inference_model \
+    --max_out_len 30 \
+```
+关键参数释义如下：
+
+* `model_name_or_path`：动态图训练保存的参数路径；默认为"unimo-text-1.0-summary"。
+* `inference_model_dir`：静态图图保存的参数路径；默认为"./inference_model"。
+* `max_out_len`：最大输出长度。
+
+执行命令后将会自动导出模型到指定的 `inference_model` 中，保存模型文件结构如下所示：
+
+```text
+inference_model/
+├── unimo_text.pdiparams
+├── unimo_text.pdiparams.info
+└── unimo_text.pdmodel
+```
+
+#### 模型部署
+文本摘要应用已打通多种场景部署方案，点击链接获取具体的使用教程。
+- [Paddle Inference 推理 (Python)](./deploy/paddle_inference/README.md)
+- [Paddle Serving 服务化部署（Python）](./deploy/paddle_serving/README.md)
+
+## References
+Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020).
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md b/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ca5577601cf2520aab2be98c4456a2c869840ecc
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_inference/README.md
@@ -0,0 +1,31 @@
+# Paddle Inference部署
+本文档将介绍如何使用[Paddle Inference](https://paddle-inference.readthedocs.io/en/latest/guides/introduction/index_intro.html#paddle-inference)工具进行自动文本摘要应用高性能推理推理部署。
+
+**目录**
+   * [背景介绍](#背景介绍)
+   * [导出预测部署模型](#导出预测部署模型)
+   * [基于Python预测](#基于Python预测)
+
+
+## 背景介绍
+Paddle inference和主框架的Model.predict均可实现推理预测，Paddle Inference 是飞桨的原生推理库， 作用于服务器端和云端，提供高性能的推理能力，主框架的Model 对象是一个具备训练、测试、推理的神经网络。相比于Model.predict，inference可使用MKLDNN、CUDNN、TensorRT进行预测加速。Model.predict适用于训练好的模型直接进行预测，paddle inference适用于对推理性能、通用性有要求的用户，针对不同平台不同的应用场景进行了深度的适配优化，保证模型在服务器端即训即用，快速部署。由于 Paddle Inference 能力直接基于飞桨的训练算子，因此它支持飞桨训练出的所有模型的推理。
+
+
+Paddle Inference Python端预测部署主要包含两个步骤：
+- 导出预测部署模型
+- 基于Python预测
+
+
+## 导出预测部署模型
+部署时需要使用预测格式的模型（即动态图转静态图操作）。预测格式模型相对训练格式模型而言，在拓扑上裁剪掉了预测不需要的算子，并且会做特定部署优化。具体操作详见[FastGeneration加速及模型静态图导出](../../README.md)。
+
+## 基于Python预测
+<!-- 同上，高性能预测的默认输入和输出形式也为文件，可分别通过 test_path 和 save_path 进行指定，通过如下命令便可以基于Paddle Inference 进行高性能预测： -->
+
+在终端输入以下命令可在GPU上进行预测：
+```shell
+python inference_unimo_text.py --inference_model_dir ../../inference_model
+```
+
+关键参数释义如下：
+* `inference_model_dir`：用于高性能推理的静态图模型参数路径；默认为"../../inference_model"。
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py b/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py
new file mode 100644
index 0000000000000000000000000000000000000000..f60fbf35841b1d4257615d9463aca01ed572638c
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_inference/inference_unimo_text.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from pprint import pprint
+
+import numpy as np
+from paddle import inference
+
+from paddlenlp.data import Pad
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UNIMOTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir",
+        default="../../inference_model",
+        type=str,
+        help="Path to save inference model of UNIMOText. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def setup_predictor(args):
+    """Setup inference predictor."""
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+    model_file = os.path.join(args.inference_model_dir, "unimo_text.pdmodel")
+    params_file = os.path.join(args.inference_model_dir, "unimo_text.pdiparams")
+    if not os.path.exists(model_file):
+        raise ValueError("not find model file path {}".format(model_file))
+    if not os.path.exists(params_file):
+        raise ValueError("not find params file path {}".format(params_file))
+    config = inference.Config(model_file, params_file)
+    config.enable_use_gpu(100, 0)
+    config.switch_ir_optim()
+    config.enable_memory_optim()
+    config.disable_glog_info()
+
+    predictor = inference.create_predictor(config)
+    return predictor
+
+
+def convert_example(example, tokenizer, max_seq_len=512, return_length=True):
+    """Convert all examples into necessary features."""
+    source = example
+    tokenized_example = tokenizer.gen_encode(
+        source,
+        max_seq_len=max_seq_len,
+        add_start_token_for_decoding=True,
+        return_length=True,
+        is_split_into_words=False,
+    )
+    return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, pad_right=False):
+    """Batchify a batch of examples."""
+
+    def pad_mask(batch_attention_mask):
+        """Pad attention_mask."""
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            if pad_right:
+                mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32")
+            else:
+                mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int32")
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+    seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32")
+    input_dict = {}
+    input_dict["input_ids"] = input_ids
+    input_dict["token_type_ids"] = token_type_ids
+    input_dict["attention_mask"] = attention_mask
+    input_dict["seq_len"] = seq_len
+    return input_dict
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+def infer(args, predictor):
+    """Use predictor to inference."""
+    tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-summary")
+
+    inputs = [
+        "雪后的景色可真美丽呀！不管是大树上，屋顶上，还是菜地上，都穿上了一件精美的、洁白的羽绒服。放眼望去，整个世界变成了银装素裹似的，世界就像是粉妆玉砌的一样。",
+        "根据“十个工作日”原则，下轮调价窗口为8月23日24时。卓创资讯分析，原油价格或延续震荡偏弱走势，且新周期的原油变化率仍将负值开局，消息面对国内成品油市场并无提振。受此影响，预计国内成品油批发价格或整体呈现稳中下滑走势，但“金九银十”即将到来，卖方看好后期市场，预计跌幅较为有限。",
+    ]
+
+    examples = [convert_example(i, tokenizer) for i in inputs]
+    data = batchify_fn(examples, tokenizer.pad_token_id)
+
+    input_handles = {}
+    for name in predictor.get_input_names():
+        input_handles[name] = predictor.get_input_handle(name)
+        input_handles[name].copy_from_cpu(data[name])
+
+    output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+    predictor.run()
+
+    output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+    for idx, sample in enumerate(output[0]):
+        for beam_idx, beam in enumerate(sample):
+            if beam_idx > len(sample) // 2:
+                break
+            print(f"Example {idx} beam beam_idx {beam_idx}: ", "".join(postprocess_response(beam, tokenizer)))
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+    predictor = setup_predictor(args)
+    infer(args, predictor)
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md b/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31061c513704e20e786edb56bac35a6c17d83722
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/README.md
@@ -0,0 +1,140 @@
+# Paddle Serving服务化部署
+
+本文档将介绍如何使用[Paddle Serving](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署自动文本摘要在线服务。
+
+## 目录
+- [Paddle Serving服务化部署](#paddle-serving服务化部署)
+  - [目录](#目录)
+  - [背景介绍](#背景介绍)
+  - [环境准备](#环境准备)
+    - [安装Paddle Serving](#安装paddle-serving)
+  - [模型转换](#模型转换)
+  - [pipeline部署](#pipeline部署)
+    - [修改配置文件](#修改配置文件)
+    - [server启动服务](#server启动服务)
+    - [client发送服务请求](#client发送服务请求)
+
+## 背景介绍
+Paddle Serving 依托深度学习框架 PaddlePaddle 旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议，提供多种异构硬件和多种操作系统环境下推理解决方案，和多种经典预训练模型示例。集成高性能服务端推理引擎 Paddle Inference 和端侧引擎 Paddle Lite。设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架，具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性。
+
+Paddle Serving Python端预测部署主要包含以下步骤：
+- 环境准备
+- 模型转换
+- 部署模型
+
+## 环境准备
+### 安装Paddle Serving
+安装client和serving app，用于向服务发送请求:
+```shell
+pip install paddle_serving_app paddle_serving_client
+```
+安装GPU server，用于启动服务：
+
+- 安装GPU server, 注意选择跟本地环境一致的命令
+```shell
+# CUDA10.2 + Cudnn7 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post102 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA10.1 + TensorRT6
+pip install paddle-serving-server-gpu==0.8.3.post101 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+# CUDA11.2 + TensorRT8
+pip install paddle-serving-server-gpu==0.8.3.post112 # -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+**NOTE:**
+- 可以开启国内清华镜像源来加速下载
+- 如果要安装最新版本的PaddleServing参考[链接](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Latest_Packages_CN.md)。
+
+
+## 模型转换
+
+使用Paddle Serving做服务化部署时，需要将保存的inference模型转换为serving易于部署的模型。
+
+用已安装的paddle_serving_client将静态图参数模型转换成serving格式。关于如何使用将训练后的动态图模型转为静态图模型详见[FastGeneration加速及模型静态图导出](../../README.md)。
+
+模型转换命令如下：
+```shell
+python -m paddle_serving_client.convert --dirname ../../inference_model \
+                                        --model_filename unimo_text.pdmodel \
+                                        --params_filename unimo_text.pdiparams \
+                                        --serving_server inference_model_server \
+                                        --serving_client inference_model_client
+```
+关键参数释义如下：
+* `dirname`：模型文件夹地址。
+* `model_filename`：模型文件名。
+* `params_filename`：模型参数名。
+* `serving_server`：server的模型文件和配置文件路径，默认"serving_server"。
+* `serving_client`：client的配置文件路径，默认"serving_client"。
+
+也可以直接使用`export_serving.sh`.
+
+更多参数可通过以下命令查询：
+```shell
+python -m paddle_serving_client.convert --help
+```
+模型转换完成后，会在paddle_serving文件夹多出inference_model_server和inference_model_client的文件夹，文件夹目录格式如下：
+```
+inference_model_server/
+├── unimo_text.pdiparams
+├── unimo_text.pdmodel
+├── serving_server_conf.prototxt
+└── serving_server_conf.stream.prototxt
+
+inference_model_client/
+├── serving_client_conf.prototxt
+└── serving_client_conf.stream.prototxt
+```
+
+## pipeline部署
+
+paddle_serving目录包含启动pipeline服务和发送预测请求的代码，包括：
+```
+paddle_serving/
+├──config.yml               # 启动服务端的配置文件
+├──pipeline_client.py       # 发送pipeline预测请求的脚本
+└──pipeline_service.py      # 启动pipeline服务端的脚本
+```
+
+### 修改配置文件
+目录中的`config.yml`文件解释了每一个参数的含义，可以根据实际需要修改其中的配置。
+
+### server启动服务
+修改好配置文件后，执行下面命令启动服务:
+```shell
+# 启动服务
+python pipeline_service.py
+```
+成功启动服务后，log.txt中会打印类似如下日志
+```
+--- Running analysis [ir_graph_to_program_pass]
+I0831 12:29:41.132828 28269 analysis_predictor.cc:1035] ======= optimize end =======
+I0831 12:29:41.133375 28269 naive_executor.cc:102] ---  skip [feed], feed -> seq_len
+I0831 12:29:41.133384 28269 naive_executor.cc:102] ---  skip [feed], feed -> attention_mask
+I0831 12:29:41.133390 28269 naive_executor.cc:102] ---  skip [feed], feed -> token_type_ids
+I0831 12:29:41.133401 28269 naive_executor.cc:102] ---  skip [feed], feed -> input_ids
+I0831 12:29:41.134040 28269 naive_executor.cc:102] ---  skip [_generated_var_3], fetch -> fetch
+I0831 12:29:41.134049 28269 naive_executor.cc:102] ---  skip [gather_tree_0.tmp_0], fetch -> fetch
+[2022-08-31 12:29:41,138] [    INFO] - Already cached /root/.paddlenlp/models/unimo-text-1.0-summary/unimo-text-1.0-vocab.txt
+[2022-08-31 12:29:41,161] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/unimo-text-1.0-summary/tokenizer_config.json
+[2022-08-31 12:29:41,162] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/unimo-text-1.0-summary/special_tokens_map.json
+[PipelineServicer] succ init
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+[OP Object] init success
+2022/08/31 12:29:41 start proxy service
+```
+
+### client发送服务请求
+执行以下命令发送文本摘要服务请求：
+```shell
+python pipeline_client.py
+```
+注意执行客户端请求时关闭代理，并根据实际情况修改server_url地址(启动服务所在的机器)。
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml b/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c6a2744c818a58f4a52d998620dcf6d76b554395
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/config.yml
@@ -0,0 +1,54 @@
+#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
+rpc_port: 18011
+
+#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
+http_port: 9999
+
+#worker_num, 最大并发数。
+#当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
+#当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
+worker_num: 10
+
+#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
+build_dag_each_worker: false
+
+dag:
+    #op资源类型, True, 为线程模型；False，为进程模型
+    is_thread_op: True
+
+    #重试次数
+    retry: 1
+
+    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
+    use_profile: false
+    tracer:
+        interval_s: 10
+
+op:
+    text_summarization:
+        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
+        concurrency: 11
+
+        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
+        local_service_conf:
+            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+            client_type: local_predictor
+
+            #模型路径
+            model_config: ./inference_model_server
+
+            #Fetch结果列表，以client_config中fetch_var的alias_name为准，不设置默认取全部输出变量
+            fetch_list: ["_generated_var_3", "transpose_0.tmp_0"]
+            
+            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 1
+            
+            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: "0"
+
+            #thread_num
+            thread_num: 12
+
+            #ir_optim
+            ir_optim: False
+            
\ No newline at end of file
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh b/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d34b97fdd5264076c11aacf043e2586496f67f8e
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/export_serving.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle_serving_client.convert --dirname ../../inference_model \
+                                        --model_filename unimo_text.pdmodel \
+                                        --params_filename unimo_text.pdiparams \
+                                        --serving_server inference_model_server \
+                                        --serving_client inference_model_client
\ No newline at end of file
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..70f11f46d589eb00211c2cf2ef1988b4d4fd9a59
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_client.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+import numpy as np
+from paddle_serving_server.pipeline import PipelineClient
+
+from paddlenlp.utils.log import logger
+
+
+class Runner(object):
+    def __init__(
+        self,
+        server_url: str,
+    ):
+        self.client = PipelineClient()
+        self.client.connect([server_url])
+
+    def Run(self, data):
+        inputs = np.array([i.encode("utf-8") for i in data], dtype=np.object_)
+        start_time = time.time()
+        ret = self.client.predict(feed_dict={"inputs": inputs})
+        end_time = time.time()
+        logger.info("time cost :{} seconds".format(end_time - start_time))
+        if not ret.value:
+            logger.warning("Fail to fetch summary.")
+        # ret is special class but a dict
+        for d, s in zip(data, eval(ret.value[0])):
+            print("Text: ", d)
+            print("Summary: ", s[0])
+            print("-" * 50)
+
+
+if __name__ == "__main__":
+    server_url = "127.0.0.1:18011"
+    runner = Runner(server_url)
+    texts = [
+        "雪后的景色可真美丽呀！不管是大树上，屋顶上，还是菜地上，都穿上了一件精美的、洁白的羽绒服。放眼望去，整个世界变成了银装素裹似的，世界就像是粉妆玉砌的一样。",
+        "根据“十个工作日”原则，下轮调价窗口为8月23日24时。卓创资讯分析，原油价格或延续震荡偏弱走势，且新周期的原油变化率仍将负值开局，消息面对国内成品油市场并无提振。受此影响，预计国内成品油批发价格或整体呈现稳中下滑走势，但“金九银十”即将到来，卖方看好后期市场，预计跌幅较为有限。",
+    ]
+    runner.Run(texts)
diff --git a/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py
new file mode 100644
index 0000000000000000000000000000000000000000..0184041c6f930ce07ab15a26fd00e2bc63b56735
--- /dev/null
+++ b/examples/text_summarization/unimo-text/deploy/paddle_serving/pipeline_service.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from paddle_serving_server.web_service import Op, WebService
+
+from paddlenlp.data import Pad
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UNIMOTokenizer
+from paddlenlp.utils.log import logger
+
+
+def convert_example(example, tokenizer, max_seq_len=512, return_length=True):
+    """Convert all examples into necessary features."""
+    source = example
+    tokenized_example = tokenizer.gen_encode(
+        source,
+        max_seq_len=max_seq_len,
+        add_start_token_for_decoding=True,
+        return_length=True,
+        is_split_into_words=False,
+    )
+    return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, pad_right=False):
+    """Batchify a batch of examples."""
+
+    def pad_mask(batch_attention_mask):
+        """Pad attention_mask."""
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            if pad_right:
+                mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32")
+            else:
+                mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int32")
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+    seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32")
+    input_dict = {}
+    input_dict["input_ids"] = input_ids
+    input_dict["token_type_ids"] = token_type_ids
+    input_dict["attention_mask"] = attention_mask
+    input_dict["seq_len"] = seq_len
+    return input_dict
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+class UnimoTextOp(Op):
+    """Op for unimo_text."""
+
+    def init_op(self):
+        self.tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-summary")
+
+    def preprocess(self, input_dicts, data_id, log_id):
+        # Convert input format
+        ((_, input_dict),) = input_dicts.items()
+        data = input_dict["inputs"]
+        if isinstance(data, str) and "array(" in data:
+            data = eval(data)
+        else:
+            logger.error("input value  {}is not supported.".format(data))
+        data = [i.decode("utf-8") for i in data]
+        examples = [convert_example(i, self.tokenizer) for i in data]
+        input_dict = batchify_fn(examples, self.tokenizer.pad_token_id)
+        # the first return must be a dict or a list of dict, the dict corresponding to a batch of model input
+        return input_dict, False, None, ""
+
+    def postprocess(self, input_dicts, fetch_dict, data_id, log_id):
+        outputs = fetch_dict["transpose_0.tmp_0"]
+        results = []
+        for sample in outputs:
+            result = []
+            for idx, beam in enumerate(sample):
+                if idx >= len(sample) // 2:
+                    break
+                res = "".join(postprocess_response(beam, self.tokenizer))
+                result.append(res)
+            results.append(result)
+        out_dict = {}
+        out_dict["outputs"] = str(results)
+        # the first return must be a dict or a list of dict, the dict corresponding to a batch of model output
+        return out_dict, None, ""
+
+
+class UnimoTextService(WebService):
+    def get_pipeline_response(self, read_op):
+        return UnimoTextOp(name="text_summarization", input_ops=[read_op])
+
+
+if __name__ == "__main__":
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+    service = UnimoTextService(name="text_summarization")
+    service.prepare_pipeline_config("config.yml")
+    service.run_service()
diff --git a/examples/text_summarization/unimo-text/export_model.py b/examples/text_summarization/unimo-text/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..4dc1a0b58fe51f0d124833f073ea8b35f0afc554
--- /dev/null
+++ b/examples/text_summarization/unimo-text/export_model.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterUNIMOText
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="unimo-text-1.0-summary",
+        type=str,
+        help="The model name to specify the UNIMOText to use. ",
+    )
+    parser.add_argument(
+        "--inference_model_dir",
+        default="./inference_model",
+        type=str,
+        help="Path to save inference model of UNIMOText. ",
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ")
+    parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_name_or_path = args.model_name_or_path
+    model = UNIMOLMHeadModel.from_pretrained(model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(model_name_or_path)
+
+    unimo_text = FasterUNIMOText(model=model, use_fp16_decoding=args.use_fp16_decoding, trans_out=True)
+
+    # Set evaluate mode
+    unimo_text.eval()
+
+    # Convert dygraph model to static graph model
+    unimo_text = paddle.jit.to_static(
+        unimo_text,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # token_type_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            # attention_mask
+            paddle.static.InputSpec(shape=[None, 1, None, None], dtype="float32"),
+            # seq_len
+            paddle.static.InputSpec(shape=[None], dtype="int64"),
+            args.max_out_len,
+            args.min_out_len,
+            args.topk,
+            args.topp,
+            args.num_beams,  # num_beams. Used for beam_search.
+            args.decoding_strategy,
+            tokenizer.cls_token_id,  # cls/bos
+            tokenizer.mask_token_id,  # mask/eos
+            tokenizer.pad_token_id,  # pad
+            args.diversity_rate,  # diversity rate. Used for beam search.
+            args.temperature,
+            args.num_return_sequences,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(unimo_text, os.path.join(args.inference_model_dir, "unimo_text"))
+    logger.info("UNIMOText has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/examples/text_summarization/unimo-text/export_model.sh b/examples/text_summarization/unimo-text/export_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6a51ec14e0df28b10f50e8d5c9717c63aad7294c
--- /dev/null
+++ b/examples/text_summarization/unimo-text/export_model.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python export_model.py \
+    --model_name_or_path unimo-text-1.0-summary \
+    --decoding_strategy beam_search \
+    --inference_model_dir ./inference_model \
+    --max_out_len 30 \
\ No newline at end of file
diff --git a/examples/text_summarization/unimo-text/train.py b/examples/text_summarization/unimo-text/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6f3d0429df06fd9ed95541551d5aa2e31f51bdd
--- /dev/null
+++ b/examples/text_summarization/unimo-text/train.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import math
+import os
+import time
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn.functional as F
+from paddle.optimizer import AdamW
+from utils import compute_metrics, create_data_loader, print_args, select_sum, set_seed
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    LinearDecayWithWarmup,
+    UNIMOLMHeadModel,
+    UNIMOTokenizer,
+)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        default="unimo-text-1.0-summary",
+        help="The path or shortcut name of the pre-trained model.",
+    )
+    parser.add_argument("--train_file", type=str, required=False, default=None, help="Train data path.")
+    parser.add_argument("--eval_file", type=str, required=False, default=None, help="Eval data path.")
+    parser.add_argument(
+        "--save_dir", type=str, default="./checkpoints", help="The directory where the checkpoints will be saved."
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", type=float, default=5e-5, help="The initial learning rate.")
+    parser.add_argument("--weight_decay", type=float, default=0.01, help="The weight decay for optimizer.")
+    parser.add_argument("--epochs", type=int, default=3, help="Total number of training epochs to perform.")
+    parser.add_argument("--warmup_proportion", type=float, default=0.02, help="The number of warmup steps.")
+    parser.add_argument("--max_grad_norm", type=float, default=1.0, help="The max value of grad norm.")
+    parser.add_argument("--beta1", type=float, default=0.9, help="beta1")
+    parser.add_argument("--beta2", type=float, default=0.98, help="beta2")
+    parser.add_argument("--epsilon", type=float, default=1e-6, help="epsilon")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum sequence length of training.")
+    parser.add_argument("--max_dec_len", type=int, default=20, help="The maximum sequence length of decoding.")
+    parser.add_argument("--min_dec_len", type=int, default=3, help="The minimal sequence length of decoding.")
+    parser.add_argument(
+        "--max_target_len", type=int, default=30, help="The maximum target sequence length of training."
+    )
+    parser.add_argument(
+        "--num_return_sequences",
+        type=int,
+        default=1,
+        help="The numbers of returned sequences for one input in generation.",
+    )
+    parser.add_argument(
+        "--decode_strategy", type=str, default="beam_search", help="The decode strategy in generation."
+    )
+    parser.add_argument(
+        "--top_k",
+        type=int,
+        default=0,
+        help="The number of highest probability vocabulary tokens to keep for top-k sampling.",
+    )
+    parser.add_argument(
+        "--temperature", type=float, default=1.0, help="The value used to module the next token probabilities."
+    )
+    parser.add_argument("--top_p", type=float, default=1.0, help="The cumulative probability for top-p sampling.")
+    parser.add_argument("--num_beams", type=int, default=6, help="The number of beams for beam search.")
+    parser.add_argument(
+        "--length_penalty",
+        type=float,
+        default=1.2,
+        help="The exponential penalty to the sequence length for beam search.",
+    )
+    parser.add_argument("--device", type=str, default="gpu", help="The device to select for training the model.")
+    parser.add_argument(
+        "--output_path", type=str, default="./predict.txt", help="The file path where the infer result will be saved."
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to train the model.")
+    parser.add_argument("--do_eval", action="store_true", help="Whether to eval and predict.")
+    parser.add_argument("--use_amp", action="store_true", help="Enable mixed precision training.")
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def save_ckpt(model, tokenizer, save_dir, name):
+    output_dir = os.path.join(save_dir, "model_{}".format(name))
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    # Need better way to get inner model of DataParallel
+    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+    model_to_save.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+
+
+def read_file(file):
+    with open(file, "r", encoding="utf-8") as f:
+        for line in f.readlines():
+            line = line.strip()
+            if not line:
+                continue
+            line = json.loads(line)
+            yield line
+
+
+def run(args):
+    paddle.set_device(args.device)
+    world_size = dist.get_world_size()
+
+    if world_size > 1:
+        dist.init_parallel_env()
+    set_seed(args.seed)
+
+    model = UNIMOLMHeadModel.from_pretrained(args.model_name_or_path)
+    tokenizer = UNIMOTokenizer.from_pretrained(args.model_name_or_path)
+
+    if world_size > 1:
+        model = paddle.DataParallel(model)
+
+    if args.do_train:
+        train_ds = load_dataset(read_file, file=args.train_file, lazy=False)
+        dev_ds = load_dataset(read_file, file=args.eval_file, lazy=False)
+
+        train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train")
+        dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+        if args.max_steps > 0:
+            num_training_steps = args.max_steps
+            num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+        else:
+            num_training_steps = len(train_data_loader) * args.epochs
+            num_train_epochs = args.epochs
+
+        print(f"num_training_steps: {num_training_steps}, num_train_epochs: {num_train_epochs}")
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+        optimizer = AdamW(
+            learning_rate=lr_scheduler,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.epsilon,
+            apply_decay_param_fun=lambda x: x in decay_params,
+            grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm),
+        )
+        if args.use_amp:
+            scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        step = 0
+        total_time = 0.0
+        for epoch in range(num_train_epochs):
+            print("\nEpoch %d/%d" % (epoch + 1, num_train_epochs))
+            batch_start_time = time.time()
+            for inputs in train_data_loader:
+                step += 1
+                labels = inputs[-1]
+                with paddle.amp.auto_cast(
+                    args.use_amp, custom_white_list=["layer_norm", "softmax", "gelu"], level="O1"
+                ):
+                    logits = model(*inputs[:-1])
+                    labels = paddle.nn.functional.one_hot(labels, num_classes=logits.shape[-1])
+                    labels = paddle.nn.functional.label_smooth(labels)
+                    loss = F.cross_entropy(logits, labels, soft_label=True)
+                if args.use_amp:
+                    scaled_loss = scaler.scale(loss)
+                    scaled_loss.backward()
+                    scaler.step(optimizer)
+                    scaler.update()
+                    optimizer.clear_grad(set_to_zero=False)
+                else:
+                    loss.backward()
+                    optimizer.step()
+                    optimizer.clear_grad()
+                lr_scheduler.step()
+                total_time += time.time() - batch_start_time
+                if step % args.logging_steps == 0:
+                    ppl = paddle.exp(loss)
+                    print(
+                        "epoch %d - step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step"
+                        % (epoch, step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps)
+                    )
+                    total_time = 0.0
+
+                if step % args.save_steps == 0 or step == num_training_steps:
+                    if dist.get_rank() == 0:
+                        save_ckpt(model, tokenizer, args.save_dir, step)
+                        print("Saved step {} model.\n".format(step))
+                        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+                        evaluation(model_eval, dev_data_loader, args, tokenizer)
+                batch_start_time = time.time()
+                if step >= num_training_steps:
+                    break
+            if step >= num_training_steps:
+                break
+
+        print("\nTraining completed.")
+    elif args.do_eval:
+        dev_ds = load_dataset(read_file, file=args.eval_file, lazy=False)
+        dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "test")
+
+        model_eval = model._layers if isinstance(model, paddle.DataParallel) else model
+        evaluation(model_eval, dev_data_loader, args, tokenizer)
+
+
+@paddle.no_grad()
+def evaluation(model, data_loader, args, tokenizer):
+    print("\nEval begin...")
+    model.eval()
+    pred_ref = []
+    total_time = 0.0
+    start_time = time.time()
+    for step, inputs in enumerate(data_loader, 1):
+        input_ids, token_type_ids, position_ids, attention_mask = inputs
+        ids, scores = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=args.max_dec_len,
+            min_length=args.min_dec_len,
+            decode_strategy=args.decode_strategy,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            num_beams=args.num_beams,
+            length_penalty=args.length_penalty,
+            num_return_sequences=args.num_return_sequences,
+            bos_token_id=tokenizer.cls_token_id,
+            eos_token_id=tokenizer.mask_token_id,
+        )
+
+        total_time += time.time() - start_time
+        if step % args.logging_steps == 0:
+            print("eval step %d - %.3fs/step" % (step, total_time / args.logging_steps))
+            total_time = 0.0
+
+        results = select_sum(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences)
+        pred_ref.extend(results)
+        start_time = time.time()
+
+    with open(args.output_path, "w", encoding="utf-8") as fout:
+        for ref in pred_ref:
+            fout.write(ref + "\n")
+
+    print("\nSave inference result into: %s" % args.output_path)
+
+    if "title" in data_loader.dataset[0].keys():
+        targets = [example["title"] for example in data_loader.dataset]
+        compute_metrics(pred_ref, targets)
+
+    model.train()
+    return
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_args(args)
+    run(args)
diff --git a/examples/text_summarization/unimo-text/train.sh b/examples/text_summarization/unimo-text/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..302b66ede711f38d58f3b40c7e57e52f4262ba46
--- /dev/null
+++ b/examples/text_summarization/unimo-text/train.sh
@@ -0,0 +1,40 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# GPU启动，参数`--gpus`指定训练所用的GPU卡号，可以是单卡，也可以多卡
+unset CUDA_VISIBLE_DEVICES
+
+log_dir=output
+rm -rf ${log_dir}
+mkdir -p ${log_dir}
+
+python -m paddle.distributed.launch --gpus "0,1,2,3" --log_dir ${log_dir} train.py \
+    --model_name_or_path=unimo-text-1.0-summary \
+    --train_file train.json \
+    --eval_file test.json \
+    --save_dir=${log_dir}/checkpoints \
+    --logging_steps=100 \
+    --save_steps=10000 \
+    --epochs=10 \
+    --batch_size=32 \
+    --learning_rate=5e-5 \
+    --warmup_proportion=0.02 \
+    --weight_decay=0.01 \
+    --max_seq_len=512 \
+    --max_target_len=60 \
+    --max_dec_len=20 \
+    --min_dec_len=3 \
+    --do_train \
+    --do_eval \
+    --device=gpu \
diff --git a/examples/text_summarization/unimo-text/utils.py b/examples/text_summarization/unimo-text/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5542758ff8ccb984cf73fd6edca0bda1a8322f90
--- /dev/null
+++ b/examples/text_summarization/unimo-text/utils.py
@@ -0,0 +1,217 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from rouge import Rouge
+
+from paddlenlp.data import Pad
+from paddlenlp.metrics import BLEU
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better.
+    paddle.seed(seed + dist.get_rank())
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+    print("\n" + "*" * 15)
+    print("The auto evaluation result is:")
+    print("rouge-1:", round(rouge1, 4))
+    print("rouge-2:", round(rouge2, 4))
+    print("rouge-L:", round(rougel, 4))
+    print("BLEU-4:", round(bleu4.score(), 4))
+
+
+def convert_example(example, tokenizer, max_seq_len=512, max_target_len=128, mode="train"):
+    """Convert all examples into necessary features."""
+    source = example["content"]
+    if mode != "test":
+        tokenized_example = tokenizer.gen_encode(
+            source,
+            target=example["title"],
+            max_seq_len=max_seq_len,
+            max_target_len=max_target_len,
+            return_position_ids=True,
+            return_length=True,
+        )
+        target_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1)
+        target_end = tokenized_example["seq_len"]
+        # Use to gather the logits corresponding to the labels during training
+        tokenized_example["masked_positions"] = list(range(target_start, target_end - 1))
+        tokenized_example["labels"] = tokenized_example["input_ids"][target_start + 1 : target_end]
+
+        return tokenized_example
+    else:
+        tokenized_example = tokenizer.gen_encode(
+            source, max_seq_len=max_seq_len, add_start_token_for_decoding=True, return_position_ids=True
+        )
+
+        if "title" in example and example["title"]:
+            tokenized_example["title"] = example["title"]
+        return tokenized_example
+
+
+def batchify_fn(batch_examples, pad_val, mode):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    if mode != "test":
+        max_len = max([example["seq_len"] for example in batch_examples])
+        masked_positions = np.concatenate(
+            [
+                np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len
+                for i, example in enumerate(batch_examples)
+            ]
+        )
+        labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples])
+        return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels
+    else:
+        return input_ids, token_type_ids, position_ids, attention_mask
+
+
+def create_data_loader(dataset, tokenizer, args, mode):
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=args.max_seq_len,
+        max_target_len=args.max_target_len,
+        mode=mode,
+    )
+    dataset = dataset.map(trans_func, lazy=True)
+    if mode == "train":
+        batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True)
+    else:
+        batch_sampler = BatchSampler(dataset, batch_size=args.batch_size // 2, shuffle=False)
+    collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode)
+    data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True)
+    return dataset, data_loader
+
+
+def post_process_sum(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    special_tokens = ["[UNK]"]
+    tokens = [token for token in tokens if token not in special_tokens]
+    return token_ids, tokens
+
+
+def select_sum(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1):
+    results = []
+    group = []
+    tmp = []
+    if scores is not None:
+        ids = ids.numpy()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+
+            target = "".join(pred_tokens)
+
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+
+            tmp.append([target, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+    else:
+        ids = ids.numpy()
+
+        for pred in ids:
+            pred_token_ids, pred_tokens = post_process_sum(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            response = "".join(pred_tokens)
+
+            # TODO: Support return scores in FT.
+            tmp.append([response])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        for preds in group:
+            results.append(preds[0][0])
+
+    return results
diff --git a/examples/text_to_knowledge/README.md b/examples/text_to_knowledge/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..39d41248a007b44f4aeee417436f6a75bd569e77
--- /dev/null
+++ b/examples/text_to_knowledge/README.md
@@ -0,0 +1,168 @@
+# 解语（Text to Knowledge）
+
+[解语官网](https://www.paddlepaddle.org.cn/textToKnowledge)
+
+解语（Text to Knowledge）是首个覆盖中文全词类的知识库（百科知识树）及知识标注与挖掘框架，拥有可描述所有中文词汇的词类体系、中文知识标注工具集，以及更适用于中文挖掘任务的预训练语言模型。
+
+
+覆盖中文全词类的知识库和知识标注工具能够帮助你面对更加多元的应用场景，方便地融合自有知识体系，显著提升中文文本解析和挖掘效果，并能够更容易地利用知识增强机器学习模型效果。解语经过大规模工业应用验证，在实际业务中取得了良好的应用效果，适合通用领域中文文本理解任务。
+
+<img width="1142" alt="image" src="https://user-images.githubusercontent.com/16698950/177760417-0cd4fdea-51b3-437d-ba9c-374f096de2f1.png">
+
+
+**解语由以下三部分构成：**
+
+- [百科知识树（TermTree）](./termtree) ：包括能够描述所有中文词汇的TermType词类体系，以及Term关系和属性值。
+- 中文知识标注工具集：包括[词类知识标注工具（WordTag）](./wordtag) 和[名词短语标注工具（NPTag）](./nptag)，[适用于中文文本挖掘的预训练语言模型（ERNIE-CTM）](./ernie-ctm)，为中文文本解析提供词类序列标注框架，结合百科知识树可实现定制化词类序列标注。
+- 中文知识挖掘方案：包括[知识模板挖掘工具](./wordtag-ie)，旨在提供灵活可配置，可快速定制的中文知识挖掘方案。
+
+**本次发布的解语开源试用版包括：**
+
+- 百科知识树（TermTree）V1.0试用版：包括简化版的TermType词类体系，和约100w的term集。
+- 中文词类知识标注工具（WordTag）V1.0版。
+- 名词短语标注工具（NPTag）V1.0版。
+- 中文预训练语言模型（ERNIE-CTM）V1.0版。
+
+
+----
+
+## 解语的应用场景
+
+解语可直接用于各类中文文本解析与挖掘任务，提升文本解析与挖掘精度；也可以作为中文文本特征生成器，为各类机器学习模型提供文本特征。
+
+中文词类知识标注工具（WordTag）整合了传统中文解析的**分词**、**词性标注**、**命名实体识别**的能力，能够将任意中文句子解析为**完整的词类序列**。结合百科知识树（TermTree），可为应用提供一套通用的知识关联（term-linking）框架，方便应用适配关联自己的应用知识图谱，更好地将知识用于中文自然语言处理（NLP）任务。
+
+![解语示例](doc/img/text_to_knowledge_example.png)
+
+
+### 应用场景A：文本挖掘/解析模板生成与匹配
+
+虽然近年来，深度学习模型尤其是预训练语言模型的广泛使用大幅提升了各项中文NLP任务效果，但在实际的工业应用中，单独使用深度学习模型往往达不到应用需求，还需要结合规则模型以提升精度以及解决恶劣case，如，知识图谱构建、query解析、语义一致性判定等应用。
+
+在这些应用中，文本挖掘/解析模板是最常用的规则模型。WordTag包含了覆盖中文所有词汇的词类标注体系，在生成模板以及模板匹配上有着天然的优势。用户可以根据WordTag标注的样本词类序列，自动生成或配置更加丰富、精准的挖掘/解析模板，然后对目标文本使用WordTag标注，即可利用模板进行匹配，从而大大降低人工配置模板的代价，显著提升生产效率。
+
+例如，输入文本：*美人鱼是周星驰执导的电影*，得到预测结果：
+
+```json
+{
+    "text": "美人鱼是周星驰执导的电影",
+    "items": [
+        {
+            "item": "美人鱼",
+            "offset": 0,
+            "wordtag_label": "作品类_实体",
+            "length": 3,
+            "termid": "作品与出版物_eb_美人鱼"
+        },
+        {
+            "item": "是",
+            "offset": 3,
+            "wordtag_label": "肯定词",
+            "length": 1,
+            "termid": "肯定否定词_cb_是"
+        },
+        {
+            "item": "周星驰",
+            "offset": 4,
+            "wordtag_label": "人物类_实体",
+            "length": 3,
+            "termid": "人物_eb_周星驰"
+        },
+        {
+            "item": "执导",
+            "offset": 7,
+            "wordtag_label": "场景事件",
+            "length": 2,
+            "termid": "场景事件_cb_执导"
+        },
+        {
+            "item": "的",
+            "offset": 9,
+            "wordtag_label": "助词",
+            "length": 1,
+            "termid": "助词_cb_的"
+        },
+        {
+            "item": "电影",
+            "offset": 10,
+            "wordtag_label": "作品类_概念",
+            "length": 2,
+            "termid": "影视作品_cb_电影"
+        }
+    ]
+}
+```
+
+将上述标注结果中的词类序列取出，去除虚词、标点等与语义无关的词，可将抽取出的词类直接构造成为挖掘匹配模板：
+
+```
+[作品类_实体][肯定词|是][人物类_实体][场景事件|执导][作品类_概念|电影]
+```
+
+利用该模板，以及结合TermTree进行概念扩展，可以匹配出所有该句式的文本，例如：
+
+> 《狂人日记》是鲁迅创作的第一个短篇白话日记体小说
+>
+> 《澳门风云》是王晶创作执导的合家欢贺岁喜剧赌片
+>
+> 《千王之王2000》是一部王晶于1999年执导的喜剧电影
+>
+> 《射雕英雄传》是金庸创作的长篇武侠小说
+
+WordTag的标注结果中，区分了“人物类\_实体”和“人物类\_概念”，以及“作品类\_实体”和“作品类\_概念”，使得模板生成更为精准。同时，TermTree中也区分了命名实体词(eb: entity base)与非实体词(cb: concept base)，这样，可以利用TermTree分别进行实体扩展（e.g., 周星驰->王晶）和概念扩展(e.g., 电影->小说)，生成更加丰富多样的模板，支持更细化的应用场景。
+
+### 应用场景B：词类知识增强的深度学习模型
+
+词类特征同时也是一类重要的文本特征，可为原始文本token提供有效的边界信息、归组信息，减少样本中的噪音，防止模型过拟合；还可作为层次泛化特征，弥补统计共现特征的不足。
+
+在深度学习模型应用中，可将WordTag产出的词类作为embedding特征，直接叠加到文本token上，作为深度学习模型的输入；在BERT等模型中，也可以将词类作为文本序列中的一部分，利用position id和可见性矩阵控制token和词类特征之间的可见性，作为深度学习模型的输入。
+
+### 应用场景C：知识图谱关联（term-linking）
+
+随着知识图谱技术的普及和越来越多应用知识图谱数据的发布，如何利用知识提升NLP任务效果，成为近年来NLP研究的热点方向。文本与图谱知识结合的前提是将图谱中的实体准确link到文本上，这是知识图谱应用的一大难点。现有的方案多是基于某个特定图谱实现的，缺乏通用的图谱关联解决方案。我们尝试使用“**WordTag+TermTree**”提供一套通用的图谱关联（term-linking）技术框架。
+
+**NOTE：** 为了避免歧义，我们 **用term统一指代图谱收录的各类实体、概念、术语**。
+
+为了能够适配应用中的不同实体集（例如，不同的企业有不同的人物实体集合，不同的小说站有不同的小说实体集合），我们将term-linking拆分为两个步骤：
+
+- 第一步是基于词类的linking，主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题，这一步只使用文本本身特征和词类特征，不使用图谱中的实体属性值（SPO）知识，从而支持切换不同应用图谱；
+- 第二步是同类同名实体词的linking，主要解决同类下不同属性值的实体消歧问题，这一步需要使用实体词的SPO知识（一般用于实体特征表示计算，以及文本-实体相似度计算）。
+
+“WordTag+TermTree”的开源版提供了第一步的解决示例，第二步由于依赖于特定图谱的SPO知识，暂时无法提供通用工具，未来可能提供通用解决方案。
+
+### 应用场景D：文本分类和文本挖掘样本优化
+
+工业NLP应用场景中，文本分类、文本挖掘是最常见的任务。虽然，预训练语言模型的技术进步大幅提升了小样本学习的效果，但要达到理想的工业应用效果，还是需要大规模高精度监督训练样本。
+
+人工标注可以产出高精度小规模训练样本。半监督学习等技术可以帮助用户基于人工标准样本快速扩充样本规模，但无法保证样本精度。这种情况下，可以使用“WordTag+TermTree”辅助筛选和修正样本，提升样本精度，例如：
+
+- 使用WordTag产出样本模板，再利用TermTree进行泛化约束，筛选出高置信度的样本，或者过滤不合格的样本；
+
+- 利用词类关系检测类别与样本的一致性，比如，医疗类文本与“疾病损伤、药物、医疗卫生机构”等词类相关，可以利用TermTree知识筛选出该类别高置信度的样本。
+
+此外，统计模型容易拟合高频term，导致在低频term上泛化效果不好，这时可以利用TermTree筛选样本，提升样本平衡性，从而提升模型泛化能力。
+
+## 后续计划
+
+1. 发布百科知识树（TermTree）正式版数据，建立知识共建社区，支持用户提交应用词表/应用图谱 & 定制化TermTree, [TermTree下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz)；
+2. 持续优化ERNIE-CTM预训练模型，支持多种参数规模模型发布，探索更好的适配中文解析挖掘任务的预训练模型；
+3. 持续优化中文文本知识标注工具集，提供更加精准的知识标注服务；发布多粒度标注工具，支持更加丰富的应用场景。
+
+## 在论文中引用解语
+
+如果您的工作成果中使用了解语，请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。
+
+```
+@article{zhao2020TermTree,
+    title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding},
+    author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong},
+    technical report={Baidu, Inc. TR:2020-KG-TermTree},
+    year={2020}
+}
+```
+
+
+
+## 问题与反馈
+
+解语在持续优化中，如果您有任何建议或问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png b/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png
new file mode 100644
index 0000000000000000000000000000000000000000..f5ec1073b759d83cd1b9e4aa1eeb70c09e4b697f
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png differ
diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_model.png b/examples/text_to_knowledge/doc/img/ernie_ctm_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..2d886e91f593140f1d0888d8cd987b2d33800408
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/ernie_ctm_model.png differ
diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge.png b/examples/text_to_knowledge/doc/img/text_to_knowledge.png
new file mode 100644
index 0000000000000000000000000000000000000000..2a158a0b256db3246ec5f88b90636dc77c7a4083
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/text_to_knowledge.png differ
diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png b/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png
new file mode 100644
index 0000000000000000000000000000000000000000..bf2e2212268bab759100e5365249e174653e0563
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png differ
diff --git a/examples/text_to_knowledge/doc/img/wordtag_example.png b/examples/text_to_knowledge/doc/img/wordtag_example.png
new file mode 100644
index 0000000000000000000000000000000000000000..b415962dda24c32c8b0583cd1b15693168543251
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/wordtag_example.png differ
diff --git a/examples/text_to_knowledge/doc/img/wordtag_model.png b/examples/text_to_knowledge/doc/img/wordtag_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..705e9b7d05a0eb59f9e8d3941e5a2ebcd3018f2c
Binary files /dev/null and b/examples/text_to_knowledge/doc/img/wordtag_model.png differ
diff --git a/examples/text_to_knowledge/ernie-ctm/README.md b/examples/text_to_knowledge/ernie-ctm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1e6a85355c606b20325233bb1efb94f0b84b7b89
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/README.md
@@ -0,0 +1,167 @@
+
+# 解语：ERNIE-CTM（ERNIE for **Chinese Text Mining**）
+
+ERNIE-CTM是适用于中文文本挖掘任务的预训练语言模型，拥有更全面的汉字字表集合，更优的中文文本挖掘任务表现，与PaddleNLP深度结合，提供更加便捷的应用实践。
+
+## ERNIE-CTM特点
+
+- 全面的中文汉字字表扩充
+  - ERNIE-CTM的字符集包含2万+汉字，以及中文常用符号（常用标点、汉语拼音、编号）、部分外语符号（假名、单位）等，大幅减少中文解析挖掘任务中UNK（未识别字符）引发的标注问题。同时，ERNIE-CTM使用了embedding分解，可以更加灵活地扩充应用字表。
+- 更加适配中文文本挖掘任务
+  - ERNIE-CTM中在每个表示后面添加了全局信息，在序列特征上叠加了全局的信息，使得在文本挖掘任务上有更加强力的表现。
+- 支持多种特征训练的模型结构
+  - ERNIE-CTM的模型结构中，支持多种特征训练，用户可按照自己的需求任意添加任务及对应特征训练模型，而无需考虑任务之间的冲突所造成的灾难性遗忘。
+
+
+
+## ERNIE-CTM模型介绍
+
+### 模型结构
+
+ERNIE-CTM的模型结构大体与BERT相同，都是双向transformer结构。区别是，ERNIE-CTM为能灵活扩充字表，采用了ALBERT的embedding分解，将embedding层分解为128维，参数列表如下：
+
+| 模型           | embedding size | hidden size | hidden layers | vocab size |
+| -------------- | -------------- | ----------- | ------------- | ---------- |
+| ERNIE-CTM-base | 128            | 768         | 12            | 23000      |
+
+ERNIE-CTM以字粒度建模，英文区分大小写，其输入表示如下：
+
+![ERNIE-CTM输入](../doc/img/ernie_ctm_inputs.png)
+
+其中，`[CLS{n}]`是ERNIE-CTM预留出的全局观察位，其中`n`从0开始计数，该全局观察位用于不同的训练任务，建模不同的语义特征，在下游任务中，可以结合使用，如使用attention筛选/融合特征，以达到更好的效果。而在灵活使用`[CLS{n}]`的时候，为中途增减任务token时不影响文本输入，所有的`[CLS{n}]`的位置编码均为0，且可以使用可见性矩阵（visible matrix）控制`[CLS{n}]`位置的特征对序列中其他位置，以及其他的全局观察位的可见性，以获得更加灵活、独立的特征表示。
+
+本次开源的ERNIE-CTM-base模型中，使用了两个全局观察位`[CLS0]`和`[CLS1]`，具体作用见下文预训练任务介绍。
+
+### 预训练任务
+
+ERNIE-CTM使用的预训练任务为掩码语言模型（Masked Language Model，MLM）及ALBERT所使用的句子顺序预测（Sentence Order Prediction，SOP）。
+
+其中`[CLS0]`用于训练SOP任务，训练方式如ALBERT中描述，正例为同一篇文章中的两个连续的句子，负例为用一篇文章中两个连续的句子顺序翻转。
+
+`[CLS1]`做为全局的监督信号，应用于MLM任务中。训练MLM任务前，将`[CLS1]`特征表示拼接在所有的序列表示之后，通过线性层融合，成为最终的序列表示，之后预测MLM任务。所以，ERNIE-CTM最终输出的文本序列表示中，都融合了`[CLS1]`的特征表示。最终的序列表示中，带有全句的特征，一定程度可避免序列中全局特征捕捉不足，同时，`[CLS1]`最终的表示中也充分融合了句子内容的信息，弥补了SOP任务对文本主题信息捕捉不足的缺陷。
+
+![ERNIE-CTM总体结构](../doc/img/ernie_ctm_model.png)
+
+### WordTag增量训练
+
+在Ernie-Ctm微调任务中我们提供了一个基于[WordTag](../wordtag)的百科知识标注任务，该任务旨在解析中文词汇的知识标注，在该词性体系中覆盖了所有中文词汇的词类体系，包括各类实体词与非实体词（如概念、实体/专名、语法词等）。除了使用已有的WordTag工具对通用中文文本进行词类知识标注，WordTag同样支持用户使用自己的数据进行增量训练，下面是在WordTag模型上进行增量训练的具体示例流程。
+
+#### 代码结构说明
+
+```text
+wordtag/
+├── data.py # 训练数据处理脚本
+├── metric.py # 模型效果验证指标脚本
+├── predict.py # 预测脚本
+├── README.md # 使用说明
+├── train.py  # 训练脚本
+└── utils.py  # 工具函数
+```
+
+#### 数据准备
+
+我们提供了少数样本用以示例增量训练。执行以下命令，下载并解压示例数据集：
+
+```bash
+wget https://bj.bcebos.com/paddlenlp/datasets/wordtag_dataset_v3.tar.gz && tar -zxvf wordtag_dataset_v3.tar.gz
+```
+解压之后
+
+```text
+data/
+├── dev.txt # 验证集
+├── tags.txt # WordTag标签集合
+└── train.json  # 训练数据
+```
+
+训练样本示例如下，每个单词以"/type"的形式标记其词性或实体类别，单词之间使用空格作为切分标记
+
+```text
+砚台/物体类 与/连词 笔/物体类 、/w 墨/物体类 、/w 纸/物体类 是/肯定词 中国/世界地区类 传统/修饰词 的/助词 文房四宝/词汇用语 。/w
+《/w 全球化与中国：理论与发展趋势/作品类_实体 》/w 是/肯定词 2010年/时间类 经济管理出版社/组织机构类 出版/场景事件 的/助词 图书/作品类_概念 ，/w 作者/人物类_概念 是/肯定词 余永定/人物类_实体 、/w 路爱国/人物类_实体 、/w 高海红/人物类_实体 。/w
+```
+
+#### 模型训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0"  train.py \
+    --max_seq_len 128 \
+    --batch_size 32   \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --output_dir ./output \
+    --device "gpu"
+```
+
+其中参数释义如下：
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+
+
+
+### 模型预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --gpus "0" predict.py \
+    --params_path ./output/model_300/model_state.pdparams \
+    --batch_size 32 \
+    --device "gpu"
+```
+
+## 自定义模型一键预测
+
+Taskflow支持加载增量训练后的模型进行一键预测，通过`task_path`定义用户自定义路径即可。
+
+文件组成：
+```text
+custom_task_path/
+├── model_state.pdparams
+├── model_config.json
+└── tags.txt
+```
+
+```python
+from paddlenlp import Taskflow
+
+my_wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/")
+
+my_wordtag("美人鱼是周星驰执导的一部电影")
+# [{'text': '美人鱼是周星驰执导的一部电影', 'items': [{'item': '美人鱼', 'offset': 0, 'wordtag_label': '作品类_实体', 'length': 3, 'termid': '作品与出版物_eb_美人鱼'}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '周星驰', 'offset': 4, 'wordtag_label': '人物类_实体', 'length': 3, 'termid': '人物_eb_周星驰'}, {'item': '执导', 'offset': 7, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_执导'}, {'item': '的', 'offset': 9, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '一部', 'offset': 10, 'wordtag_label': '数量词', 'length': 2}, {'item': '电影', 'offset': 12, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}]
+```
+
+
+## ERNIE-CTM后续计划
+
+
+1. 提升预训练语料的多样性（开源版主要使用了百度百科语料），持续优化预训练模型
+2. 发布其他参数量的预训练模型（tiny、large等），便于不同场景应用
+3. 维护开源社区，探索模型优化方向，整合优秀idea
+
+
+
+## 在论文中引用ERNIE-CTM
+
+如果您的工作成果中使用了ERNIE-CTM，请增加下述引用。我们非常乐于看到ERNIE-CTM对您的工作带来帮助。
+```
+@article{zhao2020TermTree,
+    title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding},
+    author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong},
+    technical report={Baidu, Inc. TR:2020-KG-TermTree},
+    year={2020}
+}
+```
+
+
+
+## 问题与反馈
+
+ERNIE-CTM在持续优化中，如果您有任何建议或问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/ernie-ctm/data_process.py b/examples/text_to_knowledge/ernie-ctm/data_process.py
new file mode 100644
index 0000000000000000000000000000000000000000..f40243dabb7064db449266023dc6cd6eb8d1eca5
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/data_process.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            vocab[line.strip()] = i
+            i += 1
+    return vocab
+
+
+def convert_example(example, tokenizer, max_seq_len, tags_to_idx=None, summary_num=2, is_test=False):
+    tokens = example["tokens"]
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token", max_seq_len=max_seq_len)
+
+    if is_test:
+        return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"]
+
+    tags = example["tags"]
+    if len(tokenized_input["input_ids"]) - 1 - summary_num < len(tags):
+        tags = tags[: len(tokenized_input["input_ids"]) - 1 - summary_num]
+    # '[CLS]' and '[SEP]' will get label 'O'
+    tags = ["O"] * (summary_num) + tags + ["O"]
+    tags += ["O"] * (len(tokenized_input["input_ids"]) - len(tags))
+    tokenized_input["tags"] = [tags_to_idx[x] for x in tags]
+    return (
+        tokenized_input["input_ids"],
+        tokenized_input["token_type_ids"],
+        tokenized_input["seq_len"],
+        tokenized_input["tags"],
+    )
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_custom_data(filename):
+    """Reads data"""
+    with open(filename, "r", encoding="utf-8") as f:
+        for line in f:
+            example = transfer_str_to_example(line.strip())
+            yield example
+
+
+def transfer_str_to_example(sample):
+    text = ""
+    tags = []
+    items = sample.split(" ")
+    items = [item.rsplit("/", 1) for item in items]
+    for w, t in items:
+        text += w
+        if len(w) == 1:
+            tags.append(f"S-{t}")
+        else:
+            l = len(w)
+            for j in range(l):
+                if j == 0:
+                    tags.append(f"B-{t}")
+                elif j == l - 1:
+                    tags.append(f"E-{t}")
+                else:
+                    tags.append(f"I-{t}")
+    res = {
+        "tokens": list(text),
+        "tags": tags,
+    }
+    return res
diff --git a/examples/text_to_knowledge/ernie-ctm/metric.py b/examples/text_to_knowledge/ernie-ctm/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..4341bd743a2774dbbf7664e9c329ee371dfcea22
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/metric.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List, Tuple
+
+import paddle
+
+
+class SequenceAccuracy(paddle.metric.Metric):
+    """
+    Masked language model pre-train task accuracy.
+    """
+
+    def __init__(self):
+        super(SequenceAccuracy, self).__init__()
+        self.correct_k = 0
+        self.total = 0
+
+    def compute(self, pred, label, ignore_index):
+        pred = paddle.argmax(pred, 1)
+        active_acc = label.reshape([-1]) != ignore_index
+        active_pred = pred.masked_select(active_acc)
+        active_labels = label.masked_select(active_acc)
+        correct = active_pred.equal(active_labels)
+        return correct
+
+    def update(self, correct):
+        self.correct_k += correct.cast("float32").sum(0)
+        self.total += correct.shape[0]
+
+    def reset(self):
+        self.correct_k = 0
+        self.total = 0
+
+    def accumulate(self):
+        return float(self.correct_k) / self.total
+
+    def name(self):
+        return "Masked Language Model Accuracy"
+
+
+def wordseg_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float:
+    """
+    Calculate extra metrics of word-seg
+
+    Args:
+        list_a: prediction list
+        list_b: real list
+
+    Returns:
+        acc: the extra accuracy
+    """
+    p, q = 0, 0
+    a_l, b_l = 0, 0
+    acc = 0.0
+    while q < len(list_b) and p < len(list_a):
+        a_r = a_l + len(list_a[p][0]) - 1
+        b_r = b_l + len(list_b[q][0]) - 1
+        if a_r < b_l:
+            p += 1
+            a_l = a_r + 1
+            continue
+        if b_r < a_l:
+            q += 1
+            b_l = b_r + 1
+            continue
+        if a_l == b_l and a_r == b_r:
+            acc += 1.0
+            p += 1
+            q += 1
+            a_l = a_r + 1
+            b_l = b_r + 1
+            continue
+        p += 1
+    return acc
+
+
+def wordtag_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float:
+    """
+    Calculate extra metrics of word-tag
+
+    Args:
+        list_a: prediction list
+        list_b: real list
+
+    Returns:
+        acc: the extra accuracy
+    """
+    p, q = 0, 0
+    a_l, b_l = 0, 0
+    acc = 0.0
+    while q < len(list_b) and p < len(list_a):
+        a_r = a_l + len(list_a[p][0]) - 1
+        b_r = b_l + len(list_b[q][0]) - 1
+        if a_r < b_l:
+            p += 1
+            a_l = a_r + 1
+            continue
+        if b_r < a_l:
+            q += 1
+            b_l = b_r + 1
+            continue
+        if a_l == b_l and a_r == b_r:
+            if list_a[p][-1] == list_b[q][-1]:
+                acc += 1.0
+            p += 1
+            q += 1
+            a_l, b_l = a_r + 1, b_r + 1
+            continue
+        p += 1
+    return acc
+
+
+def wordtag_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float:
+    """
+    Calculate extra metrics of word-tag
+
+    Args:
+        list_a: prediction list
+        list_b: real list
+
+    Returns:
+        acc: the extra accuracy
+    """
+    p, q = 0, 0
+    a_l, b_l = 0, 0
+    acc = 0.0
+    while q < len(list_b) and p < len(list_a):
+        a_r = a_l + len(list_a[p][0]) - 1
+        b_r = b_l + len(list_b[q][0]) - 1
+        if a_r < b_l:
+            p += 1
+            a_l = a_r + 1
+            continue
+        if b_r < a_l:
+            q += 1
+            b_l = b_r + 1
+            continue
+        if a_l == b_l and a_r == b_r:
+            if list_a[p][-1] == list_b[q][-1]:
+                acc += 1.0
+            elif list_b[q][-1].startswith(list_a[p][-1]):
+                acc += 1.0
+            elif list_b[q] == "词汇用语":
+                acc += 1.0
+            p += 1
+            q += 1
+            a_l, b_l = a_r + 1, b_r + 1
+            continue
+        p += 1
+    return acc
+
+
+def wordseg_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float:
+    """
+    Calculate extra metrics of word-seg
+
+    Args:
+        list_a: prediction list
+        list_b: real list
+
+    Returns:
+        acc: the extra accuracy
+    """
+    i, j = 0, 0
+    acc = 0.0
+    a_l, b_l = 0, 0
+    while i < len(list_a) and j < len(list_b):
+        a_r = a_l + len(list_a[i][0]) - 1
+        b_r = b_l + len(list_b[j][0]) - 1
+        if a_r < b_l:
+            i += 1
+            a_l = a_r + 1
+            continue
+        if b_r < a_l:
+            j += 1
+            b_l = b_r + 1
+            continue
+        if a_l == b_l and a_r == b_r:
+            acc += 1.0
+            a_l, b_l = a_r + 1, b_r + 1
+            i, j = i + 1, j + 1
+            continue
+        if a_l == b_l and a_r < b_r:
+            cnt = 0.0
+            tmp_a_r = a_r
+            for k in range(i + 1, len(list_a)):
+                tmp_a_r += len(list_a[k])
+                cnt += 1.0
+                if tmp_a_r == b_r:
+                    acc += cnt
+                    i, j = k + 1, j + 1
+                    a_l, b_l = tmp_a_r + 1, b_r + 1
+                    break
+            i += 1
+            continue
+        if a_l == b_l and a_r > b_r:
+            tmp_b_r = b_r
+            for k in range(j + 1, len(list_b)):
+                tmp_b_r += len(list_b[k])
+                if tmp_b_r == a_r:
+                    acc += 1.0
+                    i, j = i + 1, k + 1
+                    a_l, b_l = a_r + 1, tmp_b_r + 1
+                break
+            j += 1
+            continue
+        i += 1
+    return acc
diff --git a/examples/text_to_knowledge/ernie-ctm/predict.py b/examples/text_to_knowledge/ernie-ctm/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..8620f9c393a039372f32d8e2fec5aebc5cb39268
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/predict.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data_process import convert_example, load_dict
+from utils import decode
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.transformers import ErnieCtmTokenizer, ErnieCtmWordtagModel
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, default="./output/model_300/model_state.pdparams", required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def do_predict(data, model, tokenizer, viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=1, summary_num=2):
+
+    examples = []
+    for text in data:
+        example = {"tokens": list(text)}
+        input_ids, token_type_ids, seq_len = convert_example(example, tokenizer, args.max_seq_len, is_test=True)
+
+        examples.append((input_ids, token_type_ids, seq_len))
+
+    batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # seq_len
+    ): fn(samples)
+
+    all_pred_tags = []
+
+    model.eval()
+    for batch in batches:
+        input_ids, token_type_ids, seq_len = batchify_fn(batch)
+        input_ids = paddle.to_tensor(input_ids)
+        token_type_ids = paddle.to_tensor(token_type_ids)
+        seq_len = paddle.to_tensor(seq_len)
+        pred_tags = model(input_ids, token_type_ids, lengths=seq_len)[0]
+        all_pred_tags.extend(pred_tags.numpy().tolist())
+    results = decode(data, all_pred_tags, summary_num, idx_to_tags)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    data = [
+        "美人鱼是周星驰执导的一部电影",
+    ]
+
+    tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt"))
+    idx_to_tags = dict(zip(*(tags_to_idx.values(), tags_to_idx.keys())))
+
+    model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_tag=len(tags_to_idx))
+    tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag")
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    results = do_predict(
+        data, model, tokenizer, model.viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=args.batch_size
+    )
+    print(results)
diff --git a/examples/text_to_knowledge/ernie-ctm/train.py b/examples/text_to_knowledge/ernie-ctm/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..a67b5d09486fa575f8b28a9d89a13fd6eb2568d0
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/train.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data_process import convert_example, create_dataloader, load_dict, read_custom_data
+from metric import SequenceAccuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    ErnieCtmTokenizer,
+    ErnieCtmWordtagModel,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # yapf: disable
+    parser.add_argument("--data_dir", default="./data", type=str, help="The input data dir, should contain train.json.")
+    parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--output_dir", default="./output", type=str, help="The output directory where the model predictions and checkpoints will be written.",)
+    parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.", )
+    parser.add_argument("--logging_steps", type=int, default=5, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion")
+    parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over total steps.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--seed", default=1000, type=int, help="random seed for initialization")
+    parser.add_argument("--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    # yapf: enable
+
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, tags, tags_to_idx):
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader():
+        input_ids, token_type_ids, seq_len, tags = batch
+        loss, seq_logits = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[:2]
+        loss = loss.mean()
+        losses.append(loss.numpy())
+
+        correct = metric.compute(
+            pred=seq_logits.reshape([-1, len(tags_to_idx)]), label=tags.reshape([-1]), ignore_index=tags_to_idx["O"]
+        )
+        metric.update(correct)
+    acc = metric.accumulate()
+    logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc))
+    model.train()
+    metric.reset()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(
+        read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False
+    )
+    dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False)
+    tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt"))
+
+    tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag")
+    model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_labels=len(tags_to_idx))
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, tags_to_idx=tags_to_idx)
+
+    def batchify_fn(samples):
+        fn = Tuple(  # noqa: E731
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+            Stack(dtype="int64"),  # seq_len
+            Pad(axis=0, pad_val=tags_to_idx["O"], dtype="int64"),  # tags
+        )
+        return fn(samples)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.num_train_epochs
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    logger.info("Total steps: %s" % num_training_steps)
+    logger.info("WarmUp steps: %s" % warmup)
+
+    metric = SequenceAccuracy()
+
+    total_loss = 0
+    global_step = 0
+
+    for epoch in range(1, args.num_train_epochs + 1):
+        logger.info(f"Epoch {epoch} beginnig")
+        start_time = time.time()
+
+        for total_step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, token_type_ids, seq_len, tags = batch
+
+            loss = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[0]
+
+            loss = loss.mean()
+            total_loss += loss
+            loss.backward()
+
+            optimizer.step()
+            optimizer.clear_grad()
+            lr_scheduler.step()
+
+            if global_step % args.logging_steps == 0 and rank == 0:
+                end_time = time.time()
+                speed = float(args.logging_steps) / (end_time - start_time)
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, total_loss / args.logging_steps, speed)
+                )
+                start_time = time.time()
+                total_loss = 0
+
+            if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0:
+                output_dir = os.path.join(args.output_dir, "model_%d" % (global_step))
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(output_dir)
+                tokenizer.save_pretrained(output_dir)
+
+        evaluate(model, metric, dev_data_loader, tags, tags_to_idx)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/text_to_knowledge/ernie-ctm/utils.py b/examples/text_to_knowledge/ernie-ctm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..293945bef20362d069e68125ce6318f6189bf455
--- /dev/null
+++ b/examples/text_to_knowledge/ernie-ctm/utils.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def reset_offset(pred_words):
+    for i in range(0, len(pred_words)):
+        if i > 0:
+            pred_words[i]["offset"] = pred_words[i - 1]["offset"] + len(pred_words[i - 1]["item"])
+        pred_words[i]["length"] = len(pred_words[i]["item"])
+    return pred_words
+
+
+def decode(texts, all_pred_tags, summary_num, idx_to_tags):
+    batch_results = []
+    for i, pred_tags in enumerate(all_pred_tags):
+        pred_words, pred_word = [], []
+
+        for j, tag in enumerate(pred_tags[summary_num:-1]):
+            if j >= len(texts[i]):
+                break
+            pred_label = idx_to_tags[tag]
+            if pred_label.find("-") != -1:
+                _, label = pred_label.split("-")
+            else:
+                label = pred_label
+            if pred_label.startswith("S") or pred_label.startswith("O"):
+                pred_words.append({"item": texts[i][j], "offset": 0, "wordtag_label": label})
+            else:
+                pred_word.append(texts[i][j])
+                if pred_label.startswith("E"):
+                    pred_words.append({"item": "".join(pred_word), "offset": 0, "wordtag_label": label})
+                    del pred_word[:]
+
+        pred_words = reset_offset(pred_words)
+        result = {"text": texts[i], "items": pred_words}
+        batch_results.append(result)
+    return batch_results
diff --git a/examples/text_to_knowledge/nptag/README.md b/examples/text_to_knowledge/nptag/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aece4d724dfca330fc2b19c3cdc605f27ac23523
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/README.md
@@ -0,0 +1,160 @@
+# 解语：NPTag（名词短语标注工具）
+
+NPTag（名词短语标注工具）是首个能够覆盖所有中文名词性词汇及短语的细粒度知识标注工具，旨在解决NLP中，名词性短语收录不足，导致的OOV（out-of-vocabulary，超出收录词表）问题。可直接应用构造知识特征，辅助NLP任务。
+
+## NPTag特点
+
+- 包含2000+细粒度类别，覆盖所有中文名词性短语的词类体系，更丰富的知识标注结果
+    - NPTag试用的词类体系未覆盖所有中文名词性短语的词类体系，对所有类目做了更细类目的识别（如注射剂、鱼类、博物馆等），共包含2000+细粒度类别，且可以直接关联百科知识树。
+- 可自由定制的分类框架
+    - NPTag开源版标注使用的词类体系是我们在实践中对**百科词条**分类应用较好的一个版本，用户可以自由定制自己的词类体系和训练样本，构建自己的NPTag，以获得更好的适配效果。例如，可按照自定义的类别构造训练样本，使用小学习率、短训练周期微调NPTag模型，即可获得自己定制的NPTag工具。
+
+## NPTag模型介绍
+
+NPTag使用[ERNIE-CTM](../ernie-ctm)+prompt训练而成，使用启发式搜索解码，保证分类结果都在标签体系之内。
+
+### finetune任务
+
+在微调任务中提供了一个中文名词短语标注的任务，旨在对中文名词短语进行细粒度分类。
+
+#### 代码结构说明
+
+```text
+nptag/
+├── deploy # 部署
+│   └── python
+│       └── predict.py # python预测部署示例
+├── data.py # 训练数据处理脚本
+├── export_model.py # 模型导出脚本
+├── metric.py # 模型效果验证指标脚本
+├── predict.py # 预测脚本
+├── README.md # 使用说明
+├── train.py # 训练脚本
+└── utils.py  # 工具函数
+```
+
+#### 数据准备
+
+执行以下命令，下载并解压示例数据集：
+
+```bash
+wget https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/nptag_dataset.tar.gz && tar -zxvf nptag_dataset.tar.gz
+```
+
+解压之后
+```text
+data/
+├── name_category_map.json # NPTag标签文件
+├── dev.txt # 验证集
+└── train.txt  # 训练集
+```
+
+数据集`train.txt`和`dev.txt`格式示例(text VS label)
+```
+石竹  植物
+杂链聚合物   化学物质
+罗伯特·布雷森   人
+```
+
+标签文件`name_category_map.json`格式示例，其中key为细粒度标签，即NPTag的预测结果；value为粗粒度标签，示例中对应WordTag的标签集合，用户可以根据场景需要自定义修改该标签映射
+```
+{
+    "植物": "生物类_植物",
+    "化学物质": "物体类_化学物质",
+    "人": "人物类_实体",
+}
+```
+
+#### 模型训练
+```bash
+python -m paddle.distributed.launch --gpus "0" train.py \
+    --batch_size 64 \
+    --learning_rate 1e-6 \
+    --num_train_epochs 3 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --output_dir ./output \
+    --device "gpu"
+```
+
+可支持配置的参数：
+- `data_dir`: 数据集文件路径，默认数据集存放在当前目录data文件夹下。
+- `init_from_ckpt`: 模型参数路径，热启动模型训练，默认为None。
+- `output_dir`: 模型保存路径，默认保存在当前目录的output文件夹下。
+- `max_seq_len`: 模型使用的最大序列长度，默认为64。
+- `learning_rate`: finetune的最大学习率；默认为1e-6。
+- `num_train_epochs`: 表示训练轮数，默认为3。
+- `logging_steps`: 日志打印步数间隔，默认为10。
+- `save_steps`: 模型保存的步数间隔， 默认为100。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为64。
+- `weight_decay`: 控制正则项力度的参数，用于防止过拟合，默认为0.0。
+- `warmup_proportion`: 学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+- `adam_epsilon`: Adam优化器的参数，避免分母为零，默认为1e-8。
+- `seed`: 随机种子，默认为1000。
+- `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+### 基于动态图的预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --gpus "0" predict.py \
+    --device=gpu \
+    --params_path ./output/model_100/model_state.pdparams
+```
+
+### 基于静态图的预测部署
+
+使用动态图训练结束之后，可以将动态图参数导出成静态图参数，从而获得最优的预测部署性能，执行如下命令完成动态图转换静态图的功能：
+```shell
+python export_model.py --params_path=./output/model_100/model_state.pdparams --output_path=./export
+```
+
+导出静态图模型之后，可以用于部署，`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式：
+```shell
+python deploy/python/predict.py --model_dir=./export
+```
+
+## Taskflow一键预测
+
+除了以上的finetune示例，Taskflow内置了一个百度基于大规模标注汉语短语数据集训练的名词短语标注工具`NPTag`。用户可以方便地使用该工具完成对中文名词短语的一键预测。
+
+```python
+from paddlenlp import Taskflow
+
+nptag = Taskflow("knowledge_mining", model="nptag")
+nptag("糖醋排骨")
+'''
+[{'text': '糖醋排骨', 'label': '菜品'}]
+'''
+
+nptag(["糖醋排骨", "红曲霉菌"])
+'''
+[{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}]
+'''
+
+# 输出粗粒度类别标签`category`，即WordTag的词汇标签。
+nptag = Taskflow("knowledge_mining", model="nptag", linking=True)
+nptag(["糖醋排骨", "红曲霉菌"])
+
+'''
+[{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}]
+'''
+```
+
+## 在论文中引用NPTag
+
+如果您的工作成果中使用了NPTag，请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。
+
+```
+@article{zhao2020TermTree,
+    title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding},
+    author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong},
+    technical report={Baidu, Inc. TR:2020-KG-TermTree},
+    year={2020}
+}
+```
+
+
+## 问题与反馈
+
+解语在持续优化中，如果您有任何建议或问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/nptag/data.py b/examples/text_to_knowledge/nptag/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d482fe69f8ec8a310abcb382297241efbdb3b8b
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/data.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+
+def convert_example(example, tokenzier, max_seq_len=512, max_cls_len=5, summary_num=2, is_test=False):
+    """
+    Builds model inputs from a sequence for noun phrase classification task.
+    A prompt template is added to the end of the sequence.
+
+    Prompt template:
+
+    - ``[是] + [MASK] * max_cls_len``
+
+    Model input example:
+
+    - ``[CLS0][CLS1] X [是][MASK]...[MASK][SEP]``
+
+        where X is the input text.
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        max_cls_len(obj:`int`): The maximum length of labels.
+        summary_num(obj:`int`): The number of summary tokens, e.g. `[CLS0]` and `[CLS1]`.
+        is_test(obj:`bool`): If True, it will not return the label.
+
+    """
+
+    if len(example["text"]) + max_cls_len + 1 + summary_num + 1 > max_seq_len:
+        example["text"] = example["text"][: (max_seq_len - (max_cls_len + 1 + summary_num + 1))]
+
+    tokens = list(example["text"]) + ["是"] + ["[MASK]"] * max_cls_len
+    inputs = tokenzier(tokens, return_length=True, is_split_into_words="token", max_length=max_seq_len)
+
+    label_indices = list(range(inputs["seq_len"] - 1 - max_cls_len, inputs["seq_len"] - 1))
+
+    if is_test:
+        return inputs["input_ids"], inputs["token_type_ids"], label_indices
+
+    label_tokens = list(example["label"]) + ["[PAD]"] * (max_cls_len - len(example["label"]))
+    labels = np.full([inputs["seq_len"]], fill_value=-100, dtype=np.int64)
+    labels[label_indices] = tokenzier.convert_tokens_to_ids(label_tokens)
+    return inputs["input_ids"], inputs["token_type_ids"], labels
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_custom_data(filename):
+    """Reads data"""
+    with open(filename, "r", encoding="utf-8") as f:
+        for line in f:
+            text, label = line.strip().split("\t")
+            yield {"text": text, "label": label}
diff --git a/examples/text_to_knowledge/nptag/deploy/python/predict.py b/examples/text_to_knowledge/nptag/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bc505639132f257fab9ee835ecbb186d74b00f1
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/deploy/python/predict.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.transformers import ErnieCtmTokenizer
+
+sys.path.append("./")
+
+from data import convert_example  # noqa: E402
+from utils import construct_dict_map, decode, find_topk, search  # noqa: E402
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, default="./export/", help="The directory to static model.")
+parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", type=int, default=3, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+class Predictor(object):
+    def __init__(self, model_dir, device):
+        model_file = model_dir + "/inference.pdmodel"
+        params_file = model_dir + "/inference.pdiparams"
+
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+        # Disable IR optimization for NPTag
+        config.switch_ir_optim(False)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+    def predict(self, data, tokenizer):
+        examples = []
+        for text in data:
+            example = {"text": text}
+            input_ids, token_type_ids, label_indices = convert_example(
+                example, tokenizer, max_seq_len=args.max_seq_len, is_test=True
+            )
+            examples.append((input_ids, token_type_ids, label_indices))
+
+        batches = [examples[idx : idx + args.batch_size] for idx in range(0, len(examples), args.batch_size)]
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+            Stack(dtype="int64"),  # label_indices
+        ): fn(samples)
+
+        name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map(
+            tokenizer, os.path.join(args.data_dir, "name_category_map.json")
+        )
+
+        all_scores_can = []
+        all_preds_can = []
+        pred_ids = []
+
+        for batch in batches:
+            input_ids, token_type_ids, label_indices = batchify_fn(batch)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(token_type_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+
+            for i, l in zip(label_indices, logits):
+                score = l[i[0] : i[-1] + 1, vocab_ids]
+                # Find topk candidates of scores and predicted indices.
+                score_can, pred_id_can = find_topk(score, k=4, axis=-1)
+
+                all_scores_can.extend([score_can.tolist()])
+                all_preds_can.extend([pred_id_can.tolist()])
+                pred_ids.extend([pred_id_can[:, 0].tolist()])
+
+        results = []
+        for i, d in enumerate(data):
+            label = decode(pred_ids[i], id_vocabs)
+            result = {
+                "text": d,
+                "label": label,
+            }
+            if label not in name_dict:
+                scores_can = all_scores_can[i]
+                pred_ids_can = all_preds_can[i]
+                labels_can = search(scores_can, pred_ids_can, 0, [], 0)
+                labels_can.sort(key=lambda d: -d[1])
+                for labels in labels_can:
+                    cls_label_can = decode(labels[0], id_vocabs)
+                    if cls_label_can in name_dict:
+                        result["label"] = cls_label_can
+                        break
+                    else:
+                        labels_can = bk_tree.search_similar_word(label)
+                        result["label"] = labels_can[0][0]
+
+            result["category"] = name_dict[result["label"]]
+            results.append(result)
+        return results
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_dir, args.device)
+
+    tokenizer = ErnieCtmTokenizer.from_pretrained("nptag")
+
+    data = [
+        "刘德华",
+        "快乐薯片",
+        "自适应共振理论映射",
+    ]
+
+    results = predictor.predict(data, tokenizer)
+    print(results)
diff --git a/examples/text_to_knowledge/nptag/export_model.py b/examples/text_to_knowledge/nptag/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7956b003226053a8cb6baade0021ddb27f666b38
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/export_model.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.transformers import ErnieCtmNptagModel
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, required=True, default='./output/model_100/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    model = ErnieCtmNptagModel.from_pretrained("nptag")
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # token_type_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/examples/text_to_knowledge/nptag/metric.py b/examples/text_to_knowledge/nptag/metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e9ceaf02aeed7e622f2926fff0f1f8bb8d19586
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/metric.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+class NPTagAccuracy(paddle.metric.Metric):
+    """
+    Accuracy for NPTag Prompt Model.
+    """
+
+    def __init__(self):
+        super(NPTagAccuracy, self).__init__()
+        self.reset()
+
+    def reset(self):
+        self.corrects = 0
+        self.total = 0
+
+    def compute(self, preds, labels):
+        correct = []
+        for pred, label in zip(preds, labels):
+            real_pred, real_label = ([] for _ in range(2))
+            for i in range(len(label)):
+                if label[i] == -100 or label[i] == 0:
+                    continue
+                real_pred.append(pred[i])
+                real_label.append(label[i])
+
+            if all(real_pred[i] == real_label[i] for i in range(len(real_label))):
+                correct.append(1)
+            else:
+                correct.append(0)
+        return correct
+
+    def update(self, correct):
+        self.corrects += sum(correct)
+        self.total += len(correct)
+
+    def accumulate(self):
+        return float(self.corrects) / self.total
+
+    def name(self):
+        return "NPTag Prompt Model Accuracy"
diff --git a/examples/text_to_knowledge/nptag/predict.py b/examples/text_to_knowledge/nptag/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea791b59a4bf6f06e017104fb2b9b595d088ad13
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/predict.py
@@ -0,0 +1,125 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import convert_example
+from utils import construct_dict_map, decode, find_topk, search
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.transformers import ErnieCtmNptagModel, ErnieCtmTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--params_path", type=str, default="./output/model_100/model_state.pdparams", required=True, help="The path to model parameters to be loaded.")
+parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.")
+parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def do_predict(data, model, tokenizer, batch_size=1, max_cls_len=5, summary_num=2):
+    examples = []
+    for text in data:
+        example = {"text": text}
+        input_ids, token_type_ids, label_indices = convert_example(
+            example, tokenizer, max_seq_len=args.max_seq_len, is_test=True
+        )
+        examples.append((input_ids, token_type_ids, label_indices))
+
+    batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # label_indices
+    ): fn(samples)
+
+    name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map(
+        tokenizer, os.path.join(args.data_dir, "name_category_map.json")
+    )
+
+    all_scores_can = []
+    all_preds_can = []
+    pred_ids = []
+
+    model.eval()
+    for batch in batches:
+        input_ids, token_type_ids, label_indices = batchify_fn(batch)
+
+        input_ids = paddle.to_tensor(input_ids)
+        token_type_ids = paddle.to_tensor(token_type_ids)
+        logits = model(input_ids, token_type_ids)[0].numpy()
+        for i, l in zip(label_indices, logits):
+            score = l[i[0] : i[-1] + 1, vocab_ids]
+            # Find topk candidates of scores and predicted indices.
+            score_can, pred_id_can = find_topk(score, k=4, axis=-1)
+
+            all_scores_can.extend([score_can.tolist()])
+            all_preds_can.extend([pred_id_can.tolist()])
+            pred_ids.extend([pred_id_can[:, 0].tolist()])
+
+    results = []
+    for i, d in enumerate(data):
+        label = decode(pred_ids[i], id_vocabs)
+
+        result = {
+            "text": d,
+            "label": label,
+        }
+
+        if label not in name_dict:
+            scores_can = all_scores_can[i]
+            pred_ids_can = all_preds_can[i]
+            labels_can = search(scores_can, pred_ids_can, 0, [], 0)
+            labels_can.sort(key=lambda d: -d[1])
+            for labels in labels_can:
+                cls_label_can = decode(labels[0], id_vocabs)
+                if cls_label_can in name_dict:
+                    result["label"] = cls_label_can
+                    break
+                else:
+                    labels_can = bk_tree.search_similar_word(label)
+                    if len(labels_can) != 0:
+                        result["label"] = labels_can[0][0]
+
+        if result["label"] in name_dict:
+            result["category"] = name_dict[result["label"]]
+        results.append(result)
+    return results
+
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+
+    data = [
+        "刘德华",
+        "快乐薯片",
+        "自适应共振理论映射",
+    ]
+
+    model = ErnieCtmNptagModel.from_pretrained("nptag")
+    tokenizer = ErnieCtmTokenizer.from_pretrained("nptag")
+
+    if args.params_path and os.path.isfile(args.params_path):
+        state_dict = paddle.load(args.params_path)
+        model.set_dict(state_dict)
+        print("Loaded parameters from %s" % args.params_path)
+
+    results = do_predict(data, model, tokenizer, batch_size=args.batch_size)
+    print(results)
diff --git a/examples/text_to_knowledge/nptag/train.py b/examples/text_to_knowledge/nptag/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..d76c809d6010bb7f53607b65208a17e4d7dd239a
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/train.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data import convert_example, create_dataloader, read_custom_data
+from metric import NPTagAccuracy
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    ErnieCtmNptagModel,
+    ErnieCtmTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # yapf: disable
+    parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain train.json and dev.json.")
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--output_dir", type=str, default="./output", help="The output directory where the model predictions and checkpoints will be written.",)
+    parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", )
+    parser.add_argument("--learning_rate", type=float, default=1e-6, help="The initial learning rate for Adam.")
+    parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.", )
+    parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.", )
+    parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay if we apply some.")
+    parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Linear warmup proportion over total steps.")
+    parser.add_argument("--adam_epsilon", type=float, default=1e-8, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    # yapf: enable
+
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, criterion, data_loader, vocab_size):
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader():
+        input_ids, token_type_ids, labels = batch
+        outputs = model(input_ids, token_type_ids)
+        logits = outputs[0]
+        loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1]))
+        losses.append(loss.numpy())
+        probs = F.softmax(logits, axis=-1)
+        preds = paddle.argmax(probs, axis=-1).numpy()
+        correct = metric.compute(preds, labels)
+        metric.update(correct)
+    acc = metric.accumulate()
+    logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc))
+    model.train()
+    metric.reset()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds = load_dataset(
+        read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False
+    )
+    dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False)
+
+    tokenizer = ErnieCtmTokenizer.from_pretrained("nptag")
+    model = ErnieCtmNptagModel.from_pretrained("nptag")
+    vocab_size = model.ernie_ctm.config["vocab_size"]
+
+    trans_func = partial(convert_example, tokenzier=tokenizer, max_seq_len=args.max_seq_len)
+
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Pad(axis=0, pad_val=-100, dtype="int64"),  # labels
+    ): fn(samples)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+    num_training_steps = len(train_data_loader) * args.num_train_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    logger.info("Total steps: %s" % num_training_steps)
+
+    metric = NPTagAccuracy()
+    criterion = paddle.nn.CrossEntropyLoss()
+
+    global_step = 0
+    for epoch in range(1, args.num_train_epochs + 1):
+        logger.info(f"Epoch {epoch} beginnig")
+        start_time = time.time()
+
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, token_type_ids, labels = batch
+            outputs = model(input_ids, token_type_ids)
+            logits = outputs[0]
+            loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1]))
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            lr_scheduler.step()
+
+            if global_step % args.logging_steps == 0 and rank == 0:
+                end_time = time.time()
+                speed = float(args.logging_steps) / (end_time - start_time)
+                logger.info(
+                    "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, loss.item(), speed)
+                )
+                start_time = time.time()
+
+            if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0:
+                output_dir = os.path.join(args.output_dir, "model_%d" % (global_step))
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                model._layers.save_pretrained(output_dir)
+                tokenizer.save_pretrained(output_dir)
+
+        evaluate(model, metric, criterion, dev_data_loader, vocab_size)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/examples/text_to_knowledge/nptag/utils.py b/examples/text_to_knowledge/nptag/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad2e233e322345a82ff72b0d99cf2ccaed21a2f6
--- /dev/null
+++ b/examples/text_to_knowledge/nptag/utils.py
@@ -0,0 +1,195 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from collections import OrderedDict
+from typing import List
+
+import numpy as np
+
+
+def construct_dict_map(tokenizer, name_dict_path):
+    """Construct dict map"""
+    with open(name_dict_path, encoding="utf-8") as fp:
+        name_dict = json.load(fp)
+    cls_vocabs = OrderedDict()
+    bk_tree = BurkhardKellerTree()
+    for k in name_dict:
+        bk_tree.add(k)
+        for c in k:
+            if c not in cls_vocabs:
+                cls_vocabs[c] = len(cls_vocabs)
+    cls_vocabs["[PAD]"] = len(cls_vocabs)
+    id_vocabs = dict(zip(cls_vocabs.values(), cls_vocabs.keys()))
+    vocab_ids = tokenizer.vocab.to_indices(list(cls_vocabs.keys()))
+    return name_dict, bk_tree, id_vocabs, vocab_ids
+
+
+def decode(pred_ids, id_vocabs):
+    tokens = [id_vocabs[i] for i in pred_ids]
+    valid_token = []
+    for token in tokens:
+        if token == "[PAD]":
+            break
+        valid_token.append(token)
+    return "".join(valid_token)
+
+
+def search(scores_can, pred_ids_can, depth, path, score):
+    if depth >= 5:
+        return [(path, score)]
+    res = []
+    for i in range(len(pred_ids_can[0])):
+        tmp_res = search(
+            scores_can, pred_ids_can, depth + 1, path + [pred_ids_can[depth][i]], score + scores_can[depth][i]
+        )
+        res.extend(tmp_res)
+    return res
+
+
+def find_topk(a, k, axis=-1, largest=True, sorted=True):
+    if axis is None:
+        axis_size = a.size
+    else:
+        axis_size = a.shape[axis]
+    assert 1 <= k <= axis_size
+
+    a = np.asanyarray(a)
+    if largest:
+        index_array = np.argpartition(a, axis_size - k, axis=axis)
+        topk_indices = np.take(index_array, -np.arange(k) - 1, axis=axis)
+    else:
+        index_array = np.argpartition(a, k - 1, axis=axis)
+        topk_indices = np.take(index_array, np.arange(k), axis=axis)
+    topk_values = np.take_along_axis(a, topk_indices, axis=axis)
+    if sorted:
+        sorted_indices_in_topk = np.argsort(topk_values, axis=axis)
+        if largest:
+            sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis)
+        sorted_topk_values = np.take_along_axis(topk_values, sorted_indices_in_topk, axis=axis)
+        sorted_topk_indices = np.take_along_axis(topk_indices, sorted_indices_in_topk, axis=axis)
+        return sorted_topk_values, sorted_topk_indices
+    return topk_values, topk_indices
+
+
+def levenstein_distance(s1: str, s2: str) -> int:
+    """Calculate minimal Levenstein distance between s1 and s2.
+
+    Args:
+        s1 (str): string
+        s2 (str): string
+
+    Returns:
+        int: the minimal distance.
+    """
+    m, n = len(s1) + 1, len(s2) + 1
+
+    # Initialize
+    dp = [[0] * n for i in range(m)]
+    dp[0][0] = 0
+    for i in range(1, m):
+        dp[i][0] = dp[i - 1][0] + 1
+    for j in range(1, n):
+        dp[0][j] = dp[0][j - 1] + 1
+
+    for i in range(1, m):
+        for j in range(1, n):
+            if s1[i - 1] != s2[j - 1]:
+                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
+            else:
+                dp[i][j] = dp[i - 1][j - 1]
+    return dp[m - 1][n - 1]
+
+
+class BurkhardKellerNode(object):
+    """Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance.
+
+    Args:
+        word (str): word of current node.
+    """
+
+    def __init__(self, word: str):
+        self.word = word
+        self.next = {}
+
+
+class BurkhardKellerTree(object):
+    """Implementataion of BK-Tree"""
+
+    def __init__(self):
+        self.root = None
+        self.nodes = {}
+
+    def __add(self, cur_node: BurkhardKellerNode, word: str):
+        """Insert a word into current tree. If tree is empty, set this word to root.
+
+        Args:
+            word (str): word to be inserted.
+        """
+        if self.root is None:
+            self.root = BurkhardKellerNode(word)
+            return
+        if word in self.nodes:
+            return
+        dist = levenstein_distance(word, cur_node.word)
+        if dist not in cur_node.next:
+            self.nodes[word] = cur_node.next[dist] = BurkhardKellerNode(word)
+        else:
+            self.__add(cur_node.next[dist], word)
+
+    def add(self, word: str):
+        """Insert a word into current tree. If tree is empty, set this word to root.
+
+        Args:
+            word (str): word to be inserted.
+        """
+        return self.__add(self.root, word)
+
+    def __search_similar_word(self, cur_node: BurkhardKellerNode, s: str, threshold: int = 2) -> List[str]:
+        res = []
+        if cur_node is None:
+            return res
+        dist = levenstein_distance(cur_node.word, s)
+        if dist <= threshold:
+            res.append((cur_node.word, dist))
+        start = max(dist - threshold, 1)
+        while start < dist + threshold:
+            tmp_res = self.__search_similar_word(cur_node.next.get(start, None), s)[:]
+            res.extend(tmp_res)
+            start += 1
+        return res
+
+    def search_similar_word(self, word: str) -> List[str]:
+        """Search the most similar (minimal levenstain distance) word between `s`.
+
+        Args:
+            s (str): target word
+
+        Returns:
+            List[str]: similar words.
+        """
+        res = self.__search_similar_word(self.root, word)
+
+        def max_prefix(s1: str, s2: str) -> int:
+            res = 0
+            length = min(len(s1), len(s2))
+            for i in range(length):
+                if s1[i] == s2[i]:
+                    res += 1
+                else:
+                    break
+            return res
+
+        res.sort(key=lambda d: (d[1], -max_prefix(d[0], word)))
+        return res
diff --git a/examples/text_to_knowledge/termtree/README.md b/examples/text_to_knowledge/termtree/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7fd126b43fed7ce3281ded192c94b09662fc0bf2
--- /dev/null
+++ b/examples/text_to_knowledge/termtree/README.md
@@ -0,0 +1,271 @@
+# 解语：TermTree（百科知识树）
+TermTree（百科知识树）是一个描述所有中文词汇（包括概念、实体/专名、领域术语、语法词等，统一称之为Term）的树状知识库，完整的TermTree由两部分构成：
+
+> I. TermType词类体系：覆盖所有中文词汇词类的树状知识体系，是对中文词汇集合的一种全划分层次表示；
+>
+> II. Term关系和属性值：描述具体Term之间关系和Term属性值网状图谱，用于整合各应用知识图谱；
+
+本次发布的TermTreeV1.0试用版是TermTree的一个常用子集，包括两部分内容：
+
+> A.  简化版的TermType词类体系，由160+ termtype（三层结构）和 7000+ subtype构成。
+>
+> B.  约100w的term集（挂接在TermType词类体系下），包括大多数常用概念（src=cb，基础概念库，termtype准确率为98%）和一部分高频百科实体（src=eb，基础实体库，termtype准确率为95%）。
+>
+> 开源版不包括Term关系和属性值，但给出了实体的百科词条链接，应用方可以利用百科链接整合其他知识图谱使用。
+
+我们提供了TermTreeV1.0试用版的下载链接供大家使用，[下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz) 。
+
+**注：** 与其他常见应用知识图谱不同，TermTree的核心是概念词，而非专名实体词。因为，在中文文本中，概念词的含义是相对稳定的，而专名实体词随应用变化（例如，不同电商有不同的商品实体集，不同的小说站有不同的小说实体集），因此，TermTree通过 “提供常用概念集 + 可插拔的应用实体集/应用知识图谱” 来达到支持不同的应用适配。
+
+## 自定义TermTree
+
+`termtree.py`文件中的TermTree类支持TermTree的加载、增加、保存操作，因为涉及到数据结构整体性和一致性，暂不支持删除和修改操作。下面提供了离线维护自定义TermTree的代码示例
+
+### 文件准备
+
+首先下载已有的TermTreeV1.0
+```shell
+wget https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz && tar -zxvf TermTree.V1.0.tar.gz
+```
+
+### TermTree维护与修改
+
+加载TermTreeV1.0，增加新的term
+```python
+from termtree import TermTree
+
+# 加载百科知识树
+termtree = TermTree.from_dir("termtree_type.csv", "TermTree.V1.0")
+
+# 增加新term: 平原上的火焰
+termtree.add_term(term="平原上的火焰",
+                  base="eb",
+                  term_type="影视作品")
+
+# 保存修改, 执行后将在当前路径生成文件`termtree_data`，即新的自定义TermTree
+termtree.save("./")
+```
+
+#### API说明
+
+- ```python
+  def add_term()
+  ```
+
+- **参数**
+ - term (str): 待增加的term名称。
+ - base (str): term属于概念词（cb）还是实体词（eb）。
+ - term_type (str): term的主类别。
+ - sub_type (Optional[List[str]], optional): term的辅助类别或细分类别，非必选。
+ - sub_terms (Optional[List[str]], optional): 用于描述同类同名的term集，非必选。
+ - alias (Optional[List[str]], optional): term的常用别名，非必选。
+ - alias_ext (Optional[List[str]], optional): term的常用扩展别名，非必选。
+ - data (Optional[Dict[str, Any]], optional): 以dict形式构造该term节点，非必选。
+
+
+### 自定义Term-Linking
+
+Taskflow支持使用自定义TermTree实现自定义Term-Linking，该示例中"平原上的火焰"的Term-Linking如下:
+作品类_实体(wordtag_label) -> 影视作品_eb_平原上的火焰(term_id)
+
+通过`task_path`定义用户自定义路径，文件组成：
+```text
+custom_task_path/
+├── termtree_type.csv
+└── termtree_data
+```
+
+使用Taskflow加载自定义TermTree来进行预测：
+
+```python
+from paddlenlp import Taskflow
+
+wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/")
+
+wordtag("《平原上的火焰》是今年新上映的电影")
+# [{'text': '《平原上的火焰》是今年新上映的电影', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '平原上的火焰', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 6, 'termid': '影视作品_eb_平原上的火焰'}, {'item': '》', 'offset': 7, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 8, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '今年', 'offset': 9, 'wordtag_label': '时间类', 'length': 2, 'termid': '时间阶段_cb_今年'}, {'item': '新', 'offset': 11, 'wordtag_label': '修饰词', 'length': 1, 'termid': '修饰词_cb_新'}, {'item': '上映', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_上映'}, {'item': '的', 'offset': 14, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '电影', 'offset': 15, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}]
+```
+
+## 常见问题
+
+**常见问题1：为什么TermTree采用树状结构（Tree），而不是网状结构（Net/Graph）？**
+
+- 树结构是对知识空间的全划分，网状结构是对相关关系的描述和提炼。树结构更方便做到对词类体系的全面描述。
+
+- 树结构适合概念层次的泛化推理，网状结构适合相关性的泛化推理。树结构的知识对统计相关知识有很好的互补作用，在应用中能够更好地弥补统计模型的不足。
+- 两者可以结合表示和使用：Term集合整体以树结构组织（TermType词类体系），Term间的关系用网状结构描述（Term关系和属性值）。可以将TermTree视为中文词汇的层次描述框架，应用知识图谱可以基于TermType词类体系方便地整合到TermTree。
+
+**常见问题2：为什么TermTree叫做百科知识树？是否只能用于描述百科知识？**
+
+- 一方面，Term可以泛指任意概念、实体/专名、领域术语、语法词等，用“百科”是为了表达Term的多样性，而不是限定Term的来源，Term可以来自任意中文文本；
+- 另一方面，各类别的词汇都可以在百科词条中找到样例，用“百科”也是为了表示对所有中文词汇词类的描述能力。
+
+**常见问题3：中文词汇词类描述体系有很多，为什么采用这个体系？**
+
+- TermTree的词类体系是在大规模工业应用实践（如百科文本解析挖掘、query理解）中打磨出来的中文词类体系，在理论上可能不是一个完备体系，但很适合通用领域中文解析挖掘任务。
+
+
+## TermTree字段说明
+
+| 字段         | 说明                                                         | 备注                                                         |
+| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| id           | 【必有】唯一标识符                                           | 可基于termid生成                                             |
+| term         | 【必有】term的名字                                           |                                                              |
+| termid       | 【必有】term的id（唯一），构造方式为termtype_src_term        | 采用显式构造id的方式，便于应用数据扩展和整合                 |
+| src          | 【必有】term的来源库，当前包括两个基础库cb和eb。其中cb为基础概念库（concept base,收录常用词汇用语，可作为各类应用的基础集），eb为基础实体库（entity base, 收录常见命名实体，可根据应用需求扩展） | cb、eb的划分标准不同应用不一样，可根据需求调整；应用方也可以构造自己的应用库，与cb、eb整合使用。 |
+| termtype     | 【必有】term的主类别，详细描述参见 [termtree\_type](./termtree_type.csv) | 多上位的term会选择其中一个作为termtype，其他上位作为subtype，方便应用筛选 |
+| subtype      | 【非必须】term的辅助类别或细分类别                           | 如果应用特别关注某个subtype，也可以将其升级为termtype使用（需要相应更新termid和id） |
+| subterms     | 【非必须】用于描述同类同名的term集，若“termtype+src”下term只对应一个实例，则subterms为空；若“termtype+src”下term对应多个实例，则subterms记录这些实例，其字段与term相同 | 不需要区分subterm的两种常见场景：1. 应用只需词类特征；2. 上下文信息不足，无法区分具体实例 |
+| subterms_num | 【非必须】subterms中的subterm数量                            | 如果没有subterm，则值为0                                     |
+| alias        | 【非必须】term的常用别名                                     | 通常为歧义小的别名                                           |
+| alias\_ext   | 【非必须】term的常用扩展别名，经常是term或alias的一个子片段，单独出现有其他含义，结合上下文可识别为别名。 | 通常为歧义大的别名，便于应用筛选使用。e.g., 四维彩超的alias_ext“四维” |
+| links        | 【非必须】该term对应的其他term的id，可以是本知识库中的id，也可以是其他知识库如百度百科id | 如果是本知识库中的id，则表示两者可以指代同一实体             |
+
+## 数据示例
+```json
+// 示例1：无subterms的term
+{
+    "id": "c472a6fe74eb2008c4e5b958a047eb5c",
+    "termid": "植物_cb_苹果",
+    "term": "苹果",
+    "src": "cb",
+    "termtype": "植物",
+    "subtype": [],
+    "subterms": [],
+    "subterms_num": 0,
+    "alias": [
+        "苹果树"
+    ],
+    "alias_ext": [],
+    "links": [
+        {
+            "bdbkUrl": [
+                "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/14822460"
+            ]
+        }
+    ]
+}
+
+// 示例2：有subterms的term
+{
+    "id": "824716062a4d74efc0897d676700a24e",
+    "termid": "影视作品_eb_苹果",
+    "term": "苹果",
+    "src": "eb",
+    "termtype": "影视作品",
+    "subtype": [],
+    "subterms": [
+        {
+            "id": "9bb5b38dc50233b1ccd28d1c33c37605",
+            "subtype": [
+                "影视作品_cb_电影",
+                "影视动漫作品_cb_剧情片"
+            ],
+            "alias": [],
+            "alias_ext": [],
+            "links": [
+                {
+                    "bdbkUrl": [
+                        "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011191"
+                    ]
+                }
+            ]
+        },
+        {
+            "id": "688dc07cc98f02cbd4d21e2700290590",
+            "subtype": [
+                "影视作品_cb_韩国电影"
+            ],
+            "alias": [],
+            "alias_ext": [],
+            "links": [
+                {
+                    "bdbkUrl": [
+                        "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011208"
+                    ]
+                }
+            ]
+        },
+        {
+            "id": "bbf4abe6ac412b181eac383333ca9fef",
+            "subtype": [
+                "影视作品_cb_剧情电影"
+            ],
+            "alias": [],
+            "alias_ext": [],
+            "links": [
+                {
+                    "bdbkUrl": [
+                        "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011176"
+                    ]
+                }
+            ]
+        }
+    ],
+    "subterms_num": 3,
+    "alias": [],
+    "alias_ext": [],
+    "links": []
+}
+```
+
+## TermTree特点
+
+ 1. 将所有中文词汇放在一个统一类别体系下表示，包括**概念、实体/专名、领域术语、语法词**。
+- 解决传统标注技术下（e.g., 词性标注、命名实体识别），概念、实体、词性特征难以统一计算的问题。
+
+ 2. 为中文精准解析挖掘服务的词汇类别体系，以全面覆盖**百科词条、搜索query、新闻资讯**中出现的中文词汇为目标，支持通用场景文本理解。
+ - 应用可以通过指定词表的TermType，方便地整合到TermTree中，定制应用特化版。
+
+ 3. 尽可能收录常用概念词，并区分常用概念词（src=cb）和专名实体词（src=eb），以解决专名实体与概念在计算中容易混淆的问题。为此，特别补充收录了很多百科中缺少的概念词。
+ - 例：“琴房（歌曲类实体）” VS. “琴房（区域场所类概念）”
+ - 例：“甩掉（歌曲类实体）” VS. “甩掉（场景事件类概念）”
+
+ 4. 将同类同名实体拆分为term和subterm两层（参见数据示例），term作为给定termtype下所有同名实体的表示，subterm作为同类同名实体集中每一个具体实体的表示：
+ - 一方面解决文本中信息不足无法区分具体实体时的标注问题；
+ - 一方面减少同名词汇的消歧计算代价（只需要计算同类下的同名实体，有效解决概念词和实体词识别混淆的问题）
+
+ 5. 为重要的概念/实体构建完整上位归类路径（**注：** TermTreeV1.0试用版暂不包括），用于细粒度特征计算和知识推断，参见以下示例
+
+    | term | 类别| src| 上位归类路径示例 |
+    |---|---|---|---|
+    |苹果 | 植物类|cb|苹果 → 苹果属 → 蔷薇科 → 蔷薇目 → 双子叶植物纲 → 被子植物门 → 种子植物 → 植物界 → 真核生物域 → 生物|
+    | 黄香蕉苹果| 饮食类|cb|黄香蕉苹果 →苹果 →水果 → 蔬果和菌藻类 →食材 →食物 →饮食|
+    |甲型流感 | 疾病类|cb|甲型流感 → 流行性感冒 → 感冒 → 呼吸道感染 → 呼吸系统疾病 → 疾病损伤 → 生物疾病|
+    |甲型流感病毒| 微生物类|cb|甲型流感病毒 → 流行性感冒病毒 → 正粘病毒科 → RNA病毒 → 生物病毒 → 病原微生物 → 微生物 → 生物|
+    |琴房| 区域场所类|cb|琴房 → 音乐室 → 活动室 →活动场所 →区域场所|
+    |琴房| 音乐类|eb|琴房 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物|
+    |认同感 | 生活用语类|cb|认同感 →正面感受 → 感受 → 知觉感受 → 个体描述  → 生活用语|
+    | 认同感| 图书类|eb|认同感 →书籍 →图书 →书刊 →出版物 → 作品与出版物|
+    |佛罗伦萨足球俱乐部| 体育组织机构|eb|佛罗伦萨足球俱乐部 →意大利足球联赛球队→职业足球俱乐部→足球俱乐部 →足球队 →球队 →运动队 →体育组织机构 →组织机构|
+    |佛罗伦萨市 | 世界地区类|cb|佛罗伦萨市 →托斯卡纳大区 →意大利 →南欧 →欧洲 →地球区域 →世界地区|
+    |言情小说 | 小说类|cb|言情小说 →情感小说 →小说 →文学作品 →作品 →作品与出版物|
+    | 言情小说| 音乐类|eb|言情小说 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物|
+> **注：** TermType词类体系可视为所有上位归类路径的集合。
+
+## TermTree应用方式
+
+1. 直接作为词表使用，利用termtype和subtype筛选应用所需的词表（停用词表、黑白名单、概念扩展词表等）。
+2. 结合中文文本知识标注工具（WordTag等）使用，用于文本词类特征生成、挖掘/解析pattern生成、样本构建和优化等等，参见"[解语的应用场景](../)"。
+3. 整合应用知识图谱，为应用知识图谱提供通用词汇知识补充。
+
+## TermTree后续规划
+
+1. 数据覆盖扩展到全量百度百科词条，提升TermType归类准确率，便于应用方筛选构建应用适配的TermTree；
+2. 建立知识共建社区，支持用户提交自己的term词表，生成定制版TermTree。
+
+
+## 在论文中引用TermTree
+如果您的工作成果中使用了TermTree，请增加下述引用。我们非常乐于看到TermTree对您的工作带来帮助。
+```
+@article{zhao2020TermTree,
+    title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding},
+    author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong},
+    technical report={Baidu, Inc. TR:2020-KG-TermTree},
+    year={2020}
+}
+```
+
+## 问题与反馈
+
+百科知识树在持续扩充优化中，如果您有任何建议或发现数据问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/termtree/termtree.py b/examples/text_to_knowledge/termtree/termtree.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b09795ef1135f1a4bb4c7ff930c05da49a65da5
--- /dev/null
+++ b/examples/text_to_knowledge/termtree/termtree.py
@@ -0,0 +1,416 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import csv
+import json
+import os
+import warnings
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+
+class TermTreeNode(object):
+    """Defination of term node. All members are protected, to keep rigorism of data struct.
+
+    Args:
+        sid (str): term id of node.
+        term (str): term, common name of this term.
+        base (str): `cb` indicates concept base, `eb` indicates entity base.
+        term_type (Optional[str], optional): type of this term, constructs hirechical of `term` node. Defaults to None.
+        hyper (Optional[str], optional): parent type of a `type` node. Defaults to None.
+        node_type (str, optional): type statement of node, `type` or `term`. Defaults to "term".
+        alias (Optional[List[str]], optional): alias of this term. Defaults to None.
+        alias_ext (Optional[List[str]], optional): extended alias of this term, CANNOT be used in matching.
+            Defaults to None.
+        sub_type (Optional[List[str]], optional): grouped by some term. Defaults to None.
+        sub_term (Optional[List[str]], optional): some lower term. Defaults to None.
+        data (Optional[Dict[str, Any]], optional): to sore full imformation of a term. Defaults to None.
+
+    """
+
+    def __init__(
+        self,
+        sid: str,
+        term: str,
+        base: str,
+        node_type: str = "term",
+        term_type: Optional[str] = None,
+        hyper: Optional[str] = None,
+        level: Optional[int] = None,
+        alias: Optional[List[str]] = None,
+        alias_ext: Optional[List[str]] = None,
+        sub_type: Optional[List[str]] = None,
+        sub_term: Optional[List[str]] = None,
+        data: Optional[Dict[str, Any]] = None,
+    ):
+        self._sid = sid
+        self._term = term
+        self._base = base
+        self._term_type = term_type
+        self._hyper = hyper
+        self._sub_term = sub_term if sub_term is not None else []
+        self._sub_type = sub_type if sub_type is not None else []
+        self._alias = alias if alias is not None else []
+        self._alias_ext = alias_ext if alias_ext is not None else []
+        self._data = data
+        self._level = level
+        self._node_type = node_type
+        self._sons = set()
+
+    def __str__(self):
+        if self._data is not None:
+            return json.dumps(self._data, ensure_ascii=False)
+        else:
+            res = {
+                "termid": self._sid,
+                "term": self._term,
+                "src": self._base,
+                "alias": self._alias,
+                "alias_ext": self._alias_ext,
+                "termtype": self._term_type,
+                "subterms": self._sub_term,
+                "subtype": self._sub_type,
+                "links": [],
+            }
+            return json.dumps(res, ensure_ascii=False)
+
+    @property
+    def sid(self):
+        return self._sid
+
+    @property
+    def term(self):
+        return self._term
+
+    @property
+    def base(self):
+        return self._base
+
+    @property
+    def alias(self):
+        return self._alias
+
+    @property
+    def alias_ext(self):
+        return self._alias_ext
+
+    @property
+    def termtype(self):
+        return self._term_type
+
+    @property
+    def subtype(self):
+        return self._sub_type
+
+    @property
+    def subterm(self):
+        return self._sub_term
+
+    @property
+    def hyper(self):
+        return self._hyper
+
+    @property
+    def level(self):
+        return self._level
+
+    @property
+    def sons(self):
+        return self._sons
+
+    @property
+    def node_type(self):
+        return self._node_type
+
+    def add_son(self, son_name):
+        self._sons.add(son_name)
+
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]):
+        """Build a node from dictionary data.
+
+        Args:
+            data (Dict[str, Any]): Dictionary data contain all k-v data.
+
+        Returns:
+            [type]: TermTree node object.
+        """
+        return cls(
+            sid=data["termid"],
+            term=data["term"],
+            base=data["src"],
+            term_type=data["termtype"],
+            sub_type=data["subtype"],
+            sub_term=data["subterms"],
+            alias=data["alias"],
+            alias_ext=data["alias_ext"],
+            data=data,
+        )
+
+    @classmethod
+    def from_json(cls, json_str: str):
+        """Build a node from JSON string.
+
+        Args:
+            json_str (str): JSON string formatted by TermTree data.
+
+        Returns:
+            [type]: TermTree node object.
+        """
+        dict_data = json.loads(json_str)
+        return cls.from_dict(dict_data)
+
+
+class TermTree(object):
+    """TermTree class."""
+
+    def __init__(self):
+        self._nodes: Dict[str, TermTreeNode] = {}
+        self._root = TermTreeNode(sid="root", term="root", base="cb", node_type="root", level=0)
+        self._nodes["root"] = self.root
+        self._index = {}
+
+    def __build_sons(self):
+        for node in self._nodes:
+            self.__build_son(self._nodes[node])
+
+    def __getitem__(self, item):
+        return self._nodes[item]
+
+    def __contains__(self, item):
+        return item in self._nodes
+
+    def __iter__(self):
+        return self._nodes.__iter__()
+
+    @property
+    def root(self):
+        return self._root
+
+    def __load_type(self, file_path: str):
+        with open(file_path, "rt", newline="", encoding="utf8") as csvfile:
+            file_handler = csv.DictReader(csvfile, delimiter="\t")
+            for row in file_handler:
+                if row["type-1"] not in self:
+                    self.add_type(type_name=row["type-1"], hyper_type="root")
+                if row["type-2"] != "" and row["type-2"] not in self:
+                    self.add_type(type_name=row["type-2"], hyper_type=row["type-1"])
+                if row["type-3"] != "" and row["type-3"] not in self:
+                    self.add_type(type_name=row["type-3"], hyper_type=row["type-2"])
+
+    def __judge_term_node(self, node: TermTreeNode) -> bool:
+        if node.termtype not in self:
+            raise ValueError(f"Term type of new node {node.termtype} does not exists.")
+        if node.sid in self:
+            warnings.warn(f"{node.sid} exists, will be replaced by new node.")
+
+    def add_term(
+        self,
+        term: Optional[str] = None,
+        base: Optional[str] = None,
+        term_type: Optional[str] = None,
+        sub_type: Optional[List[str]] = None,
+        sub_term: Optional[List[str]] = None,
+        alias: Optional[List[str]] = None,
+        alias_ext: Optional[List[str]] = None,
+        data: Optional[Dict[str, Any]] = None,
+    ):
+        """Add a term into TermTree.
+
+        Args:
+            term (str): common name of name.
+            base (str): term is concept or entity.
+            term_type (str): term type of this term
+            sub_type (Optional[List[str]], optional): sub type of this term, must exists in TermTree. Defaults to None.
+            sub_terms (Optional[List[str]], optional): sub terms of this term. Defaults to None.
+            alias (Optional[List[str]], optional): alias of this term. Defaults to None.
+            alias_ext (Optional[List[str]], optional): . Defaults to None.
+            data (Optional[Dict[str, Any]], optional): [description]. Defaults to None.
+        """
+        if data is not None:
+            new_node = TermTreeNode.from_dict(data)
+        else:
+            new_node = TermTreeNode(
+                sid=f"{term_type}_{base}_{term}",
+                term=term,
+                base=base,
+                term_type=term_type,
+                sub_term=sub_term,
+                sub_type=sub_type,
+                alias=alias,
+                alias_ext=alias_ext,
+                node_type="term",
+            )
+        self.__judge_term_node(new_node)
+        self._nodes[new_node.sid] = new_node
+        self.__build_index(new_node)
+
+    def add_type(self, type_name, hyper_type):
+        if type_name in self._nodes:
+            raise ValueError(f"Term Type {type_name} exists.")
+        if hyper_type not in self._nodes:
+            raise ValueError(f"Hyper type {hyper_type} does not exist, please add it first.")
+        if self._nodes[hyper_type].level == 3:
+            raise ValueError(
+                "Term type schema must be 3-LEVEL, 3rd level type node should not be a parent of type node."
+            )
+        self._nodes[type_name] = TermTreeNode(
+            sid=type_name,
+            term=type_name,
+            base=None,
+            hyper=hyper_type,
+            node_type="type",
+            level=self._nodes[hyper_type].level + 1,
+        )
+        self.__build_index(self._nodes[type_name])
+
+    def __load_file(self, file_path: str):
+        with open(file_path, encoding="utf-8") as fp:
+            for line in fp:
+                data = json.loads(line)
+                self.add_term(data=data)
+
+    def __build_son(self, node: TermTreeNode):
+        """Build sons of a node
+
+        Args:
+            node (TermTreeNode): son node.
+        """
+        type_node = None
+        if node.termtype is not None:
+            type_node = self._nodes[node.termtype]
+        elif node.hyper is not None:
+            type_node = self._nodes[node.hyper]
+        if type_node is not None:
+            type_node.add_son(node.sid)
+        for sub_type in node.subtype:
+            sub_type_node = self._nodes[sub_type]
+            sub_type_node.add_son(node.sid)
+
+    def build_son(self, node: str):
+        self.__build_son(self[node])
+
+    def __build_index(self, node: TermTreeNode):
+        if node.term not in self._index:
+            self._index[node.term] = []
+        self._index[node.term].append(node.sid)
+        for alia in node.alias:
+            if alia not in self._index:
+                self._index[alia] = []
+            self._index[alia].append(node.sid)
+
+    def __judge_hyper(self, source_id, target_id) -> bool:
+        queue = [source_id]
+        visited_node = {source_id}
+        while len(queue) > 0:
+            cur_id = queue.pop(0)
+            if cur_id == target_id:
+                return True
+            cur_node = self._nodes[cur_id]
+            edge = []
+            if cur_node.hyper is not None:
+                edge.append(cur_node.hyper)
+            if cur_node.termtype is not None:
+                edge.append(cur_node.termtype)
+            edge.extend(cur_node.subtype)
+            for next_id in edge:
+                if next_id not in visited_node:
+                    queue.append(next_id)
+                    visited_node.add(next_id)
+        return False
+
+    def find_term(self, term: str, term_type: Optional[str] = None) -> Tuple[bool, Union[List[str], None]]:
+        """Find a term in Term Tree. If term not exists, return None.
+        If `term_type` is not None, will find term with this type.
+
+        Args:
+            term (str): term to look up.
+            term_type (Optional[str], optional): find term in this term_type. Defaults to None.
+
+        Returns:
+            Union[None, List[str]]: [description]
+        """
+        if term not in self._index:
+            return False, None
+        else:
+            if term_type is None:
+                return True, self._index[term]
+            else:
+                out = []
+                for term_id in self._index[term]:
+                    if self.__judge_hyper(term_id, term_type) is True:
+                        out.append(term_id)
+                if len(out) > 0:
+                    return True, out
+                else:
+                    return False, None
+
+    def build_from_dir(self, term_schema_path, term_data_path, linking=True):
+        """Build TermTree from a directory which should contain type schema and term data.
+
+        Args:
+            dir ([type]): [description]
+        """
+        self.__load_type(term_schema_path)
+        if linking:
+            self.__load_file(term_data_path)
+            self.__build_sons()
+
+    @classmethod
+    def from_dir(cls, term_schema_path, term_data_path, linking=True) -> "TermTree":
+        """Build TermTree from a directory which should contain type schema and term data.
+
+        Args:
+            source_dir ([type]): [description]
+
+        Returns:
+            TermTree: [description]
+        """
+        term_tree = cls()
+        term_tree.build_from_dir(term_schema_path, term_data_path, linking)
+        return term_tree
+
+    def __dfs(self, cur_id: str, depth: int, path: Dict[str, str], writer: csv.DictWriter):
+        cur_node = self._nodes[cur_id]
+        if cur_node.node_type == "term":
+            return
+        if depth > 0:
+            path[f"type-{depth}"] = cur_id
+        if path["type-1"] != "":
+            writer.writerow(path)
+        for son in cur_node.sons:
+            self.__dfs(son, depth + 1, path, writer)
+        if depth > 0:
+            path[f"type-{depth}"] = ""
+
+    def save(self, save_dir):
+        """Save term tree to directory `save_dir`
+
+        Args:
+            save_dir ([type]): Directory.
+        """
+        if os.path.exists(save_dir) is False:
+            os.makedirs(save_dir, exist_ok=True)
+        out_path = {}
+        for i in range(1, 3):
+            out_path[f"type-{i}"] = ""
+        with open(f"{save_dir}/termtree_type.csv", "wt", encoding="utf-8", newline="") as fp:
+            fieldnames = ["type-1", "type-2", "type-3"]
+            csv_writer = csv.DictWriter(fp, delimiter="\t", fieldnames=fieldnames)
+            csv_writer.writeheader()
+            self.__dfs("root", 0, out_path, csv_writer)
+        with open(f"{save_dir}/termtree_data", "w", encoding="utf-8", newline="") as fp:
+            for nid in self:
+                node = self[nid]
+                if node.node_type == "term":
+                    print(node, file=fp)
diff --git a/examples/text_to_knowledge/termtree/termtree_type.csv b/examples/text_to_knowledge/termtree/termtree_type.csv
new file mode 100644
index 0000000000000000000000000000000000000000..0ab88eefe103993d4de2c410488b7b2f886d8970
--- /dev/null
+++ b/examples/text_to_knowledge/termtree/termtree_type.csv
@@ -0,0 +1,164 @@
+type-1	type-2	type-3	说明	主要对应词性	subtype示例(开放集合)	src(C表示cb、E表示eb)
+角色				n/普通名词;nr/人名	群体	C/E
+角色	人物		人物类概念、人物类实体	nr/人名	职业角色、历史人物、行业人物	C/E
+角色	民族族群		民族和族群		五十六个民族	C
+角色	虚拟角色		非现实的角色		虚拟人物、虚拟生物	C/E
+作品与出版物			作品类概念、作品类实体	nw/作品名	拓片	C/E
+作品与出版物	游戏				电子游戏、视频小游戏、网页游戏	C/E
+作品与出版物	影视动漫作品				视频作品	C/E
+作品与出版物	影视动漫作品	动漫作品			漫画、动画	C/E
+作品与出版物	影视动漫作品	影视作品			电影、电视剧	C/E
+作品与出版物	影视动漫作品	视频节目			脱口秀节目、新闻类节目、访谈节目	C/E
+作品与出版物	音乐				歌曲、音乐专辑	C/E
+作品与出版物	小说				网络小说、言情小说	C/E
+作品与出版物	诗歌				诗词	C/E
+作品与出版物	计算机软件				工具软件、办公软件	C/E
+作品与出版物	舞蹈					C/E
+作品与出版物	美术				雕塑作品、油画作品、工艺美术作品	C/E
+作品与出版物	图书				书籍、词典、教材	C/E
+作品与出版物	刊物				报纸、期刊	C/E
+作品与出版物	文件				文书	C/E
+作品与出版物	作品IP					C/E
+区域场所				ns/地名	宗教场所、建筑物	C/E
+区域场所	景点				公园、植物园、动物园、博物馆	C/E
+区域场所	楼盘住宅				商业楼盘、住宅楼盘、住宅小区	C/E
+区域场所	交通场所				机场、车站、港口、交通道路、交通线路	C/E
+区域场所	住宿场所				酒店、旅馆	C/E
+区域场所	餐饮场所				咖啡馆、餐馆	C/E
+区域场所	网上组织机构场所				网站、虚拟社区	C/E
+位置方位			位置方位词	f/方位名词;s/处所名词		C
+组织机构			组织机构类	nt/机构团体名	委员会、论坛	C/E
+组织机构	演艺团体				乐队、艺术团、偶像组合	C/E
+组织机构	国家机关				政府部门、党政机关	C/E
+组织机构	企事业单位				公司、厂商、企业	C/E
+组织机构	教育组织机构				学校、大学、幼儿园、培训机构	C/E
+组织机构	居民服务机构				母婴护理机构、婚介机构、美容护理机构、家政服务机构	C/E
+组织机构	医疗卫生机构				医院、药店、诊所、科室	C/E
+组织机构	体育组织机构				运动队、体育俱乐部	C/E
+组织机构	金融组织机构				银行、交易所、投资机构	C/E
+组织机构	军事组织机构				部队、军区	C/E
+品牌			品牌名	n/普通名词;nz/其他专名		C/E
+品牌	汽车品牌					C/E
+品牌	手机品牌					C/E
+品牌	个护用品品牌				护肤品牌、彩妆品牌	C/E
+物体与物品			包括物品和物质	n/普通名词;nz/其他专名	物体构造、化学物质	C/E
+物体与物品	物品				飞机、船舶、轴承、摄影器材	C/E
+物体与物品	物品	汽车				C/E
+物体与物品	物品	手机				C/E
+物体与物品	物品	美容美发用品			化妆品、美发用品	C/E
+物体与物品	物品	电子电器产品			计算机、家用电器	C/E
+物体与物品	物品	衣物饰品			服装、箱包、鞋靴、饰品配件	C/E
+物体与物品	物品	兵器			武器、导弹、冷兵器..	C/E
+物体与物品	设施					C/E
+饮食			饮食类	n/普通名词;nz/其他专名	食材	C/E
+饮食	菜品		菜品类		汤品、面食	C/E
+饮食	饮品		饮品类		茶叶、酒、饮料和冷饮类	C/E
+生物			生物类	n/普通名词;nz/其他专名		C
+生物	动物		动物类		猫、狗、鸟纲、昆虫纲、鱼纲	C
+生物	植物		植物类			C
+生物	微生物		微生物类		真菌、细菌、生物病毒	C
+世界地区			世界地区，包括地球外区域	ns/地名	首都、地球山脉、地球河流、地球岛屿、历史地区	C
+世界地区	中国地区		中国地区		中国省级行政区、中国省会	C
+世界地区	国家		现代意义的国家		现存国家、历史国家	C
+虚拟事物			非现实事物	n/普通名词;nz/其他专名	虚拟场所、虚拟场景	C/E
+虚拟事物	虚拟物品				虚拟宝物、游戏卡牌、游戏道具、游戏装备	C/E
+文化			文化相关的特定类目	n/普通名词;nz/其他专名		C
+文化	姓氏与人名				中文姓氏、英文姓氏	C
+文化	语言文字				汉语、方言	C
+文化	民俗				方术、数术、十二生肖、占星学星座、周易六十四卦	C
+文化	政权朝代				历史政权、中国朝代	C
+文化	制度政策协议					C/E
+文化	奖项赛事活动				奖项、活动	C/E
+事件			事件类	n/普通名词;nz/其他专名	展览、会议、案件、事故、战争	C/E
+术语			领域术语、专名	nz/其他专名		C
+术语	编码符号指标				价格、符号、信号、度量单位、邮政编码.....	C
+术语	教育用语					C
+术语	教育用语	学科				C
+术语	教育用语	学历学位			学历、学位	C
+术语	教育用语	专业				C
+术语	游戏用语					C
+术语	游戏用语	麻将术语				C
+术语	医药学术语				中医术语、西医术语、医学指标、诊断治疗方法	C
+术语	医药学术语	医疗美容项目				C
+术语	医药学术语	药物			中药、西药	C
+术语	医药学术语	疾病损伤			疾病、疾病症状	C
+术语	医药学术语	动物疾病				C
+术语	金融术语				股票术语、证券术语、保险术语、银行术语	C
+术语	金融术语	股票				C/E
+术语	金融术语	保险				C/E
+术语	金融术语	基金				C/E
+术语	金融术语	银行卡			借记卡、信用卡	C/E
+术语	经济术语				会计术语	C
+术语	法律术语					C
+术语	法律术语	法律法规			法律体系、法律、法规	C/E
+术语	法律术语	罪名				C
+术语	体育术语				围棋术语、象棋术语、篮球术语	C
+术语	体育术语	体育运动项目			球类运动、武术功夫	C
+术语	赌博博彩用语				赌博用语	C
+术语	赌博博彩用语	彩票				C
+术语	天文学术语				星系、恒星	C
+术语	天文学术语	星座			八十八星座	C
+术语	天文学术语	星体			小行星	C
+术语	生物学术语					C
+术语	生物学术语	动物体构造			动物器官系统、骨	C
+术语	生物学术语	植物病虫害			植物病害、植物虫害	C
+术语	机械工程术语				机械制造术语	C
+术语	机械工程术语	汽车术语				C
+术语	大气科学术语				气象学术语、气候学术语	C
+术语	大气科学术语	台风				C/E
+术语	计算机术语				病毒程序、计算机网络术语、编程技术术语	C
+术语	文化术语				摄影术语、音乐术语、文学术语	C
+术语	数学术语				数学概念、数学公式、几何学术语	C
+术语	物理术语				电学术语、力学术语	C
+术语	化学术语				化学结构	C
+术语	统计术语				数理统计术语	C
+术语	地学术语				地理学术语、地质学术语	C
+术语	农业学术语				土壤学术语	C
+术语	心理学术语				心理现象	C
+术语	语言学术语				语法、词法、音韵学术语	C
+术语	建筑术语				土木工程术语、装修术语	C
+术语	军事术语					C
+术语	政治术语					C
+术语	哲学术语				哲学理论、伦理学术语、逻辑学术语	C
+术语	宗教术语				道教术语、佛教术语	C
+术语	通信术语					C
+术语	材料科学术语					C
+术语	航空科技术语					C
+术语	水利科技术语				水利工程	C
+术语	测绘学术语				测量术语	C
+术语	电力术语					C
+术语	社会学术语					C
+术语	交通术语				船舶工程术语	C
+术语	钓鱼术语					C
+术语	ACGN术语					C
+生活用语			日常生活中常用词	n/普通名词;nz/其他专名	信息知识资料、标识物、行业、服务	C
+生活用语	情绪					C
+生活用语	态度					C
+生活用语	表情				笑、哭、眼神	C
+生活用语	人物造型				妆容、发型	C
+生活用语	个性特点					C
+生活用语	颜色					C
+生活用语	场景事件		包括常见动词	v/普通动词;vn/名动词;vd/动副词	考试	C/E
+时间阶段			时间相关词	t/时间名词	时间、年代、世纪...	C
+时间阶段	地质年代					C
+时间阶段	特殊日				农历二十四节气、假日、节日、纪念日	C
+词汇用语			语法词类、汉字、成语等，用于兜底	n/普通名词		C
+词汇用语	汉字		汉字字表			C/E
+词汇用语	成语		成语词表			C/E
+词汇用语	俗语		非成语的俗语		歇后语、顺口溜、谚语	C/E
+词汇用语	诗句		诗句			C/E
+词汇用语	介词		介词	p/介词		C
+词汇用语	助词		助词	u/助词		C
+词汇用语	代词		代词	r/代词		C
+词汇用语	连词		连词	c/连词		C
+词汇用语	副词		副词	d/副词		C
+词汇用语	疑问词		疑问词			C
+词汇用语	肯定否定词		常用肯定词和否定词			C
+词汇用语	量词		量词	q/量词		C
+词汇用语	数量词		数量词	m/数量词		C
+词汇用语	叹词		叹词			C
+词汇用语	拟声词		拟声词			C
+词汇用语	修饰词		修饰词，包括常见形容词	n/普通名词;a/形容词;ad/副形词;an/名形词		C
+词汇用语	汉字偏旁部首		汉字偏旁部首			C
+词汇用语	日文假名		日文假名		平假名、片假名	C
+词汇用语	汉语拼音		汉语拼音字母			C
diff --git a/examples/text_to_knowledge/wordtag-ie/README.md b/examples/text_to_knowledge/wordtag-ie/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c5b28912c3098020c3e70f2ee85e886613fef347
--- /dev/null
+++ b/examples/text_to_knowledge/wordtag-ie/README.md
@@ -0,0 +1,135 @@
+# 解语：WordTag-IE（基于中文词类知识的信息抽取工具）
+
+WordTag-IE（基于中文词类知识的信息抽取工具）是在WordTag标注结果之上实现的信息抽取工具，旨在提供一个灵活、可配置，能够精准、全面覆盖简单句式的**规则信息抽取工具**。我们已提供了通用配置，可覆盖一些常见的抽取句式。用户也可以根据我们提供的配置方法，完成自己的配置，应用于自己的领域、专业文本。其产出数据，可作为模型的训练样本，也可以直接当作挖掘结果使用。
+
+![](https://user-images.githubusercontent.com/1371212/172542329-754cb4f9-3526-400b-be6e-d60e078af872.png)
+
+## WordTag-IE特点
+
+- **灵活、方便的配置，即时生效**
+  - WordTag-IE是在WordTag标注结果的基础上，完全使用规则实现的关系抽取工具。其配置完全基于WordTag的词类知识以及TermTree中的词表实现，实现了灵活、简单配置，且保证了产出数据的一致性
+
+## 使用示例
+
+在WordTag的任务中基础上可以打开`with_ie` 开关即可输出信息抽取的结果, 下面是使用PaddleNLP Taskflow使用WordTag-IE的使用示例。
+```python
+>>> from paddlenlp import Taskflow
+>>> wordtag_ie = Taskflow("knowledge_mining", with_ie=True)
+>>> wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。')
+[[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '，', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': '，', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]]
+```
+同时可以通过 `schema` 来配置相关关系类型, 抽取自定义的关系组合
+
+``` python
+>>> from pprint import pprint
+>>> schema = [
+     {
+        "head_role": "作品类_实体", #头实体词类
+        "group": "创作者", #关系名
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体" #尾实体词类
+                ],
+                "support": [] #相关词类，可作为该关系的补充，不可作为尾实体独立存在
+            }
+        ],
+        "trig_word": [
+            "作词", #触发词，对于没有触发词，而是由头尾实体直接触发的关系，可为null
+        ],
+        "trig_type": "trigger", #trigger表明由触发词触发，tail表明为尾实体触发
+        "reverse": False, #是否为反向配置，即尾实体实际是头，头实体实际是尾
+        "trig_direction": "B", #触发P的方向，表示在自然表达中，尾实体在触发词的哪一边，L为左，R为右，B为双向都有可能，默认为B
+        "rel_group": "创作" #对应的反关系，即头尾实体对调后，对应的关系，用于逻辑推断
+    }]
+>>> wordtag_ie.set_schema(schema)
+>>> pprint(wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。')[1])
+[[{'GROUP': '创作',
+   'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'},
+   'SRC': 'REVERSE',
+   'TAIL_ROLE': [{'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}],
+   'TRIG': [{'item': '作词', 'offset': 12}]},
+  {'GROUP': '创作者',
+   'HEAD_ROLE': {'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'},
+   'SRC': 'HTG',
+   'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}],
+   'TRIG': [{'item': '作词', 'offset': 12}]}]]
+```
+
+## 配置示例
+
+我们提供了配置示例文件[demo_config.json](./demo_config.json)，用户可以直接基于这个文件实现自己想要的配置。
+
+我们以“出版方”这个关系为例：
+
+```json
+{
+    "head_role": "作品类_实体", //头实体词类
+    "group": "出版方", //关系名
+    "tail_role": [
+        {
+            "main": [
+                "组织机构类"
+            ], //尾实体词类
+            "support": [
+                "时间类_具体时间"
+            ] //相关词类，可作为该关系的补充，不可作为尾实体独立存在
+        }
+    ], //尾实体配置
+    "trig_word": [
+        "出版"
+    ], //触发词，对于没有触发词，而是由头尾实体直接触发的关系，可为null
+    "trig_direction": "L", //触发P的方向，表示在自然表达中，尾实体在触发词的哪一边，L为左，R为右，B为双向都有可能，默认为B
+    "trig_type": "trigger", //trigger表明由触发词触发，tail表明为尾实体触发
+    "reverse": false, //是否为反向配置，即尾实体实际是头，头实体实际是尾
+    "rel_group": "出版" //对应的反关系，即头尾实体对调后，对应的关系，用于逻辑推断
+}
+```
+
+### 配置原则
+
+1. 文本中的头实体（head_role）一定在尾实体（tail_role）的前面（即左边），可以通过配置反向标记（reverse）和反向关系名（rel_group）生成反关系
+2. 两种触发模式：触发词触发（trig_type为trigger）和尾实体触发（trig_type为tail），两者的触发方向（trig_direction）配置不同
+
+   - 触发词的触发方向约束的是文本中尾实体在触发词的左边还是右边，默认是双向触发（B），可以配置向左触发（L）或向右（R）触发，以提升挖掘精度
+
+   - 尾实体触发不用配置方向，因为必然在头实体之后
+## 实现方法
+
+使用WordTag的标注结果，相当于已实现将**无限的词收拢到了有限的词类体系中**，而待抽取的关系，则变成了仅发生在词类与词类之间，便可以枚举出来。例如，`人物类_实体`与`作品类_实体`之间的关系可以是“创作”，而“创作”的触发词（如作词、作曲、演唱、执导、出演等）或触发pattern，则可以通过知识库枚举得到，如此，则实现了灵活配置。
+
+那么，接下来一个问题则是，我们如何从现在的序列解析结果中，得到关系三元组数据呢？
+
+要解决这个问题，我们依旧要从中文语言学的成果中寻找答案：中文更偏孤立语，注重**意合**，依靠词序和词之间的意义联系成句，词性、句法特征弱。也就是说，我们在解析的时候，可以尝试摒弃所谓句法特征，只是从次序上下手。于是，我们发现，只需要覆盖好 SPO 的几种常用表达顺序，单向搜索，即可覆盖大部分简单句。
+
+例如，对于`<张艺谋，创作，十面埋伏>`这一 SPO 三元组，常用表达顺序有如下几种：
+
+- S-P-O：张艺谋执导了《十面埋伏》。
+- S-O-P：张艺谋是《十面埋伏》的导演。
+- O-S-P：《十面埋伏》是张艺谋执导的电影。
+- O-P-S：《十面埋伏》的导演是张艺谋。
+
+然而，这种模式仍然过于复杂，如遇到多组 SPO 关系并存的文本，如果要完全照顾到这四种表达顺序，则很容易发生混乱，难以得到严格对应的三元组。所以，我们设计了**互反关系**的概念，即头实体和尾实体对调后，对应的反向关系。例如三元组`<张艺谋，创作，十面埋伏>`，则存在一个反向三元组`<十面埋伏，创作者，三元组>`。那么，当我们找到一个头实体之后，只需要考虑它之后的部分（即 `S-P-O` 和 `S-O-P` 两种表达顺序）就行了。
+
+另外，我们认为，规范表达中，关系触发和尾实体一定实在同一个短语中出现，所以，触发关系之后，寻找尾实体的过程中，我们仅搜索与触发在同一个短语中的实体及相关元素。
+
+## 后续计划
+
+- 实现基于语义结构的抽取，覆盖复杂句式
+
+## 在论文中引用WordTag-IE
+
+如果您的工作成果中使用了WordTag-IE，请增加下述引用。我们非常乐于看到WordTag-IE对您的工作带来帮助。
+
+```
+@article{qin2022WordTag-IE,
+    title={WordTag-IE: a Rule-based Tool for Chinese Information Extraction},
+    author={Qin, Huapeng and Zhao, Min and Tang, Wei},
+    technical report={Baidu, Inc. TR:2022-KG-WordTag-IE},
+    year={2022}
+}
+```
+
+## 问题与反馈
+
+WordTag-IE在持续优化中，如果您有任何建议或问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/wordtag-ie/demo_config.json b/examples/text_to_knowledge/wordtag-ie/demo_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..fa39a2ef3e4adc19f87607cf02358436013956af
--- /dev/null
+++ b/examples/text_to_knowledge/wordtag-ie/demo_config.json
@@ -0,0 +1,955 @@
+[
+    {
+        "head_role": "人物类_实体",
+        "group": "名字",
+        "tail_role": [
+            {
+                "main": [
+                    "文化类_姓氏与人名",
+                    "其他角色类",
+                    "人物类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "原名"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "性别",
+        "tail_role": [
+            {
+                "main": [
+                    "信息资料"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "男",
+            "女"
+        ],
+        "trig_type": "role",
+        "reverse": false,
+        "trig_direction": null
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "出生于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": [
+                    "世界地区类"
+                ]
+            },
+            {
+                "main": [
+                    "世界地区类"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "出生",
+            "出生于",
+            "生于"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "时间类_具体时间",
+        "group": "出生于",
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体"
+                ],
+                "support": [
+                    "世界地区类"
+                ]
+            }
+        ],
+        "trig_word": [
+            "出生",
+            "出生于",
+            "生于"
+        ],
+        "trig_type": "trigger",
+        "reverse": true,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "参加工作时间",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "参加工作"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "入党时间",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "入党"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "加入组织",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类",
+                    "组织机构类_概念"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "加入",
+            "参加"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "时间类_具体时间",
+        "group": "加入组织",
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体"
+                ],
+                "support": [
+                    "组织机构类",
+                    "组织机构类_概念"
+                ]
+            }
+        ],
+        "trig_word": [
+            "加入",
+            "参加"
+        ],
+        "trig_type": "trigger",
+        "reverse": true,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "享年",
+        "tail_role": [
+            {
+                "main": [
+                    "数量词"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "年仅"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "创作",
+        "tail_role": [
+            {
+                "main": [
+                    "作品类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "创作",
+            "监制",
+            "监制人",
+            "作词",
+            "作词人",
+            "作曲",
+            "作曲人",
+            "编曲",
+            "演唱",
+            "演唱者",
+            "制作人",
+            "制作",
+            "制片人",
+            "制片",
+            "主持人",
+            "主持",
+            "导演",
+            "执导",
+            "编剧",
+            "作者",
+            "所著",
+            "主编",
+            "撰写",
+            "编著",
+            "编撰"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B",
+        "rel_group": "创作者"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "出演",
+        "tail_role": [
+            {
+                "main": [
+                    "作品类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "配音"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B",
+        "rel_group": "演员"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "饰演",
+        "tail_role": [
+            {
+                "main": [
+                    "其他角色类",
+                    "人物类_实体"
+                ],
+                "support": [
+                    "作品类_实体"
+                ]
+            }
+        ],
+        "trig_word": [
+            "扮演",
+            "饰演",
+            "饰"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "代言",
+        "tail_role": [
+            {
+                "main": [
+                    "品牌名"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "代言"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B",
+        "rel_group": "代言人"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "创建",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "创办",
+            "创建"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R",
+        "rel_group": "创建人"
+    },
+    {
+        "head_role": "人物类_实体",
+        "group": "获奖",
+        "tail_role": [
+            {
+                "main": [
+                    "文化类_奖项赛事活动"
+                ],
+                "support": [
+                    "作品类_实体",
+                    "数量词_序数词"
+                ]
+            }
+        ],
+        "trig_word": [
+            "获",
+            "获得",
+            "荣获",
+            "获颁"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "类型",
+        "tail_role": [
+            {
+                "main": [
+                    "作品类_概念"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "是"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "出品方",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "出品"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "出版方",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "出版"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "发表于",
+        "tail_role": [
+            {
+                "main": [
+                    "场所类_网上场所"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "发表",
+            "连载",
+            "发表于",
+            "连载于"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "创作者",
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "创作",
+            "监制",
+            "监制人",
+            "作词",
+            "作词人",
+            "作曲",
+            "作曲人",
+            "编曲",
+            "演唱",
+            "演唱者",
+            "制作人",
+            "制作",
+            "制片人",
+            "制片",
+            "主持人",
+            "主持",
+            "导演",
+            "执导",
+            "编剧",
+            "作者",
+            "所著",
+            "主编",
+            "撰写",
+            "编著",
+            "编撰"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B",
+        "rel_group": "创作"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "演员",
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "配音"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B",
+        "rel_group": "出演"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "收录于",
+        "tail_role": [
+            {
+                "main": [
+                    "作品类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "收录于"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "改编自",
+        "tail_role": [
+            {
+                "main": [
+                    "作品类_实体",
+                    "作品类_概念"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "改编",
+            "改编自"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "获奖",
+        "tail_role": [
+            {
+                "main": [
+                    "文化类_奖项赛事活动"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "获",
+            "获得",
+            "荣获",
+            "获颁"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "作品类_实体",
+        "group": "上市于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "上市"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "组织机构类",
+        "group": "创建于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": [
+                    "世界地区类",
+                    "组织机构类_国家机关"
+                ]
+            },
+            {
+                "main": [
+                    "世界地区类",
+                    "组织机构类_国家机关"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "成立",
+            "创办",
+            "创建",
+            "建立",
+            "登记成立",
+            "成立登记"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "组织机构类",
+        "group": "创建人",
+        "tail_role": [
+            {
+                "main": [
+                    "人物类_实体"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "创办",
+            "创建"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L",
+        "rel_group": "创建"
+    },
+    {
+        "head_role": "组织机构类",
+        "group": "上市于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": [
+                    "{[education:场外交易市场]}"
+                ]
+            },
+            {
+                "main": [
+                    "{[education:场外交易市场]}"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "上市"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "组织机构类",
+        "group": "成立于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": [
+                    "世界地区类"
+                ]
+            },
+            {
+                "main": [
+                    "世界地区类"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "成立于",
+            "成立"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "时间类_具体时间",
+        "group": "成立于",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类"
+                ],
+                "support": [
+                    "世界地区类"
+                ]
+            }
+        ],
+        "trig_word": [
+            "成立于",
+            "成立"
+        ],
+        "trig_type": "trigger",
+        "reverse": true,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "组织机构类",
+        "group": "所属组织",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "隶属",
+            "隶属于"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "所属地区",
+        "tail_role": [
+            {
+                "main": [
+                    "世界地区类"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "首都",
+            "省会",
+            "首府"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "所属地区",
+        "tail_role": [
+            {
+                "main": [
+                    "世界地区类"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "隶属",
+            "隶属于"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "所属地区",
+        "tail_role": [
+            {
+                "main": [
+                    "世界地区类"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "下辖"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "官方语言",
+        "tail_role": [
+            {
+                "main": [
+                    "文化类_语言文字"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "官方语言"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "海拔",
+        "tail_role": [
+            {
+                "main": [
+                    "数量词"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "海拔"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "世界地区类",
+        "group": "面积",
+        "tail_role": [
+            {
+                "main": [
+                    "数量词"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "面积",
+            "占地"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "场所类",
+        "group": "类型",
+        "tail_role": [
+            {
+                "main": [
+                    "场所类_概念"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "是"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "场所类",
+        "group": "面积",
+        "tail_role": [
+            {
+                "main": [
+                    "数量词"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "面积",
+            "占地"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "物体类",
+        "group": "类型",
+        "tail_role": [
+            {
+                "main": [
+                    "物体类_概念"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "是"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "物体类_兵器",
+        "group": "类型",
+        "tail_role": [
+            {
+                "main": [
+                    "物体类_兵器"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "是"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    },
+    {
+        "head_role": "物体类",
+        "group": "上市于",
+        "tail_role": [
+            {
+                "main": [
+                    "时间类_具体时间"
+                ],
+                "support": []
+            }
+        ],
+        "trig_word": [
+            "上市"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "B"
+    },
+    {
+        "head_role": "物体类",
+        "group": "制造方",
+        "tail_role": [
+            {
+                "main": [
+                    "组织机构类_企事业单位"
+                ],
+                "support": [
+                    "时间类_具体时间"
+                ]
+            }
+        ],
+        "trig_word": [
+            "生产",
+            "制造",
+            "推出",
+            "发布"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "L"
+    },
+    {
+        "head_role": "品牌名",
+        "group": "类型",
+        "tail_role": [
+            {
+                "main": [
+                    "品牌名_品牌类型"
+                ],
+                "support": [
+                    "世界地区类_国家"
+                ]
+            }
+        ],
+        "trig_word": [
+            "是"
+        ],
+        "trig_type": "trigger",
+        "reverse": false,
+        "trig_direction": "R"
+    }
+]
diff --git a/examples/text_to_knowledge/wordtag/README.md b/examples/text_to_knowledge/wordtag/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ce194b26e07b4c62bb952c75883ea6b1cb1bd2d5
--- /dev/null
+++ b/examples/text_to_knowledge/wordtag/README.md
@@ -0,0 +1,248 @@
+# 解语：WordTag（中文词类知识标注工具）
+
+WordTag（中文词类知识标注工具）是首个能够覆盖所有中文词汇的词类知识标注工具，旨在为中文文本解析提供全面、丰富的知识标注结果，可以应用于模板（挖掘模板、解析模板）生成与匹配、知识挖掘(新词发现、关系挖掘)等自然语言处理任务中，提升文本解析与挖掘精度；也可以作为中文文本特征生成器，为各类机器学习模型提供文本特征。
+
+![wordtag示例](../doc/img/wordtag_example.png)
+
+## WordTag特点
+
+- **覆盖所有中文词汇的词类体系，更丰富的知识标注结果**
+  - WordTag使用的词类体系为覆盖所有中文词汇的词类体系，包括各类实体词与非实体词（如概念、实体/专名、语法词等）。WordTag开源版对部分类目（如组织机构等），做了更细类目的划分识别（如，医疗卫生机构、体育组织机构），对仅使用文本信息难以细分的类目（如人物类、作品类、品牌名等），不做更细粒度的词类识别。用户需要细粒度的词类识别时，可利用百科知识树的类别体系自行定制。
+- **整合百科知识树链接结果，获得更丰富的标注知识**
+  - 如上图示例所示，各个切分标注结果中，除词类标注外，还整合了百科知识树的链接结果，用户可以结合百科知识树数据共同使用：如，利用百科知识树中的subtype获得更细的上位粒度，利用term的百科信息获得更加丰富的知识等。
+- **可定制的词类序列标注框架**
+  - WordTag开源版标注使用的词类体系是我们在实践中对**百科文本**解析应用较好的一个版本，不同类型文本（如，搜索query、新闻资讯）的词类分布不同，用户可以利用百科知识树定制自己的词类体系和训练样本，构建自己的WordTag应用版，以获得更好的适配效果。例如，可将自定义的词表按照百科知识树的字段定义好，挂接/整合到百科知识树上，即可使用自己的Term数据定制标注样本和标注任务。
+
+## 模型结构
+
+模型使用[ERNIE-CTM](../ernie-ctm)+CRF训练而成，预测时使用viterbi解码，模型结构如下：
+
+<img src="../doc/img/wordtag_model.png" alt="wordtag模型结构"  />
+
+
+## Term-Linking实现
+
+WordTag提供从文本到百科知识树的链接方法，即Term-Linking，只需将term词类体系与百科知识树数据加载到工具中，即可在解析结果中得到term-linking结果。
+
+为了能够适配应用中的不同实体集（例如，不同的企业有不同的人物实体集合，不同的小说站有不同的小说实体集合），我们将term-linking拆分为两个步骤：
+
+- 第一步是基于词类的linking，主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题，这一步只使用文本本身特征和词类特征，不使用图谱中的实体属性值（SPO）知识，从而支持切换不同应用知识图谱；
+- 第二步是同类同名实体词的linking，主要解决同类下不同属性值的实体消歧问题，这一步需要使用实体词的SPO知识（一般用于实体特征表示计算，以及文本-实体相似度计算）。
+
+“WordTag+百科知识树”的开源版提供了第一步的解决示例，第二步由于依赖于特定图谱的SPO知识，无法提供通用工具，未来可能提供通用解决方案。
+
+WordTag模型对所有的词预测到上位词类之后，会直接根据预测到的词类，映射到term体系（映射表参见代码配置），查找相应的term，进行link。用户也可根据自己的数据分布，定制term-linking策略：
+
+- link到自己定制的term词表：只需将term词表按照TermTree挂接好之后更换数据即可；
+- 调整WordTag预测词类与term词表的映射关系（如，增加自定义类别）：在代码配置中直接调整映射表即可。
+
+## WordTag类别标签集合
+
+WordTag共包含91种词性及专名类别标签，标签集合如下表
+
+<table>
+    <thead>
+        <th colspan='7'>WordTag标签集合</th>
+    </thead>
+    <tbody>
+        <tr>
+            <td>人物类_实体</td>
+            <td>组织机构类_军事组织机构_概念</td>
+            <td>文化类_制度政策协议</td>
+            <td>位置方位</td>
+            <td>术语类_医药学术语</td>
+            <td>信息资料_性别</td>
+            <td>否定词</td>
+        </tr>
+        <tr>
+            <td>人物类_概念</td>
+            <td>组织机构类_医疗卫生机构</td>
+            <td>文化类_姓氏与人名</td>
+            <td>世界地区类</td>
+            <td>术语类_生物体</td>
+            <td>链接地址</td>
+            <td>数量词</td>
+        </tr>
+        <tr>
+            <td>作品类_实体</td>
+            <td>组织机构类_医疗卫生机构_概念</td>
+            <td>生物类</td>
+            <td>世界地区类_国家</td>
+            <td>疾病损伤类</td>
+            <td>个性特征</td>
+            <td>数量词_序数词</td>
+        </tr>
+        <tr>
+            <td>作品类_概念</td>
+            <td>组织机构类_教育组织机构</td>
+            <td>生物类_植物</td>
+            <td>世界地区类_区划概念</td>
+            <td>疾病损伤类_植物病虫害</td>
+            <td>感官特征</td>
+            <td>数量词_单位数量词</td>
+        </tr>
+        <tr>
+            <td>组织机构类</td>
+            <td>组织机构类_教育组织机构_概念</td>
+            <td>生物类_动物</td>
+            <td>世界地区类_地理概念</td>
+            <td>宇宙类</td>
+            <td>场景事件</td>
+            <td>叹词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_概念</td>
+            <td>物体类</td>
+            <td>品牌名</td>
+            <td>饮食类</td>
+            <td>事件类</td>
+            <td>介词</td>
+            <td>拟声词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位</td>
+            <td>物体类_概念</td>
+            <td>品牌名_品牌类型</td>
+            <td>饮食类_菜品</td>
+            <td>时间类</td>
+            <td>介词_方位介词</td>
+            <td>修饰词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_企事业单位_概念</td>
+            <td>物体类_兵器</td>
+            <td>场所类</td>
+            <td>饮食类_饮品</td>
+            <td>时间类_特殊日</td>
+            <td>助词</td>
+            <td>修饰词_性质</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关</td>
+            <td>物体类_化学物质</td>
+            <td>场所类_概念</td>
+            <td>药物类</td>
+            <td>时间类_朝代</td>
+            <td>代词</td>
+            <td>修饰词_类型</td>
+        </tr>
+        <tr>
+            <td>组织机构类_国家机关_概念</td>
+            <td>其他角色类</td>
+            <td>场所类_交通场所</td>
+            <td>药物类_中药</td>
+            <td>时间类_具体时间</td>
+            <td>连词</td>
+            <td>修饰词_化</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构</td>
+            <td>文化类</td>
+            <td>场所类_交通场所_概念</td>
+            <td>术语类</td>
+            <td>时间类_时长</td>
+            <td>副词</td>
+            <td>外语单词</td>
+        </tr>
+        <tr>
+            <td>组织机构类_体育组织机构_概念</td>
+            <td>文化类_语言文字</td>
+            <td>场所类_网上场所</td>
+            <td>术语类_术语类型</td>
+            <td>词汇用语</td>
+            <td>疑问词</td>
+            <td>汉语拼音</td>
+        </tr>
+        <tr>
+            <td>组织机构类_军事组织机构</td>
+            <td>文化类_奖项赛事活动</td>
+            <td>场所类_网上场所_概念</td>
+            <td>术语类_符号指标类</td>
+            <td>信息资料</td>
+            <td>肯定词</td>
+            <td>w（标点）</td>
+        </tr>
+    </tbody>
+</table>
+
+
+## WordTag应用场景
+
+参见"[解语的应用场景](../)"
+
+
+## WordTag示例代码
+下面提供了WordTag模型进行文本到百科知识树链接的示例程序。
+
+### Term-Linking示例程序
+
+Term-Linking示例程序可以对无标签数据启动模型预测, 例如想对下面几段文本进行百科知识树的链接解析
+```
+"《孤女》是2010年九州出版社出版的小说，作者是余兼羽。",
+"热梅茶是一道以梅子为主要原料制作的茶饮"
+```
+
+执行下面的脚本即可快速获取上面两段文本的百科知识树链接的结果
+
+```python
+from paddlenlp import Taskflow
+wordtag = Taskflow("knowledge_mining", model="wordtag", linking=True)
+wordtag(["热梅茶是一道以梅子为主要原料制作的茶饮",
+      "《孤女》是2010年九州出版社出版的小说，作者是余兼羽"])
+# Support the input text directly
+wordtag("热梅茶是一道以梅子为主要原料制作的茶饮")
+
+```
+下面是运行WordTag工具后的知识链接的预测结果
+
+```json
+[{'text': '《孤女》是2010年九州出版社出版的小说，作者是余兼羽。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2, 'termid': '小说_eb_孤女'}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': '，', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}, {'item': '。', 'offset': 27, 'wordtag_label': 'w', 'length': 1}]}, {'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]}]
+{'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]}
+```
+
+同时我们也提供了基于上述taskflow的python执行脚本，具体的执行方式如下：
+```shell
+python predict.py --max_seq_len 128 --batch_size 2
+```
+其中参数释义如下：
+- `max_seq_len` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每个预测批次的样本数目。
+
+## WordTag进阶使用
+
+### 自定义模型一键预测
+
+用户可以使用自有数据对WordTag模型进行增量训练，然后使用Taskflow进行一键预测，参见[WordTag增量训练示例](../ernie-ctm)。
+
+### 自定义Term-Linking
+
+Taskflow默认使用TermTreeV1.0实现Term-Linking, 用户也可以基于自己的TermTree实现Term-Linking，参见[自定义TermTree](../termtree)。
+
+## Release Note
+
+- 2022.06：新增25个细化词类，用于下游挖掘任务
+
+## WordTag后续计划
+
+1. 持续优化知识标注模型，获得更加精准的标注结果；
+2. 发布多粒度、多种参数规模的知识标注模型；
+3. 提供细粒度term及subterm消歧的解决方案。
+
+
+## 在论文中引用WordTag
+
+如果您的工作成果中使用了WordTag，请增加下述引用。我们非常乐于看到WordTag对您的工作带来帮助。
+```
+@article{zhao2020TermTree,
+    title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding},
+    author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong},
+    technical report={Baidu, Inc. TR:2020-KG-TermTree},
+    year={2020}
+}
+```
+
+
+
+## 问题与反馈
+
+WordTag在持续优化中，如果您有任何建议或问题，欢迎提交issue到Github。
diff --git a/examples/text_to_knowledge/wordtag/predict.py b/examples/text_to_knowledge/wordtag/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f4fe353782dfaab4206a0f79412a270c570e364
--- /dev/null
+++ b/examples/text_to_knowledge/wordtag/predict.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp import Taskflow
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # fmt: off
+    parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", )
+    parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+    # fmt: on
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    paddle.set_device(args.device)
+    wordtag = Taskflow(
+        "knowledge_mining", model="wordtag", batch_size=args.batch_size, max_seq_length=args.max_seq_len, linking=True
+    )
+    txts = ["《孤女》是2010年九州出版社出版的小说，作者是余兼羽。", "热梅茶是一道以梅子为主要原料制作的茶饮"]
+    res = wordtag(txts)
+    print(res)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_predict(args)
diff --git a/examples/text_to_sql/IGSQL/README.md b/examples/text_to_sql/IGSQL/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7ae994dbaba1ef4bb9a0cc30aec578736a6658a7
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/README.md
@@ -0,0 +1,126 @@
+# IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation
+
+## 上下文相关的 Text2SQL 任务
+
+语义解析是一种交互式分析技术，其将用户输入的自然语言表述转成可操作执行的语义表示形式，如逻辑表达式（如一阶逻辑表示，lambda表示等）、编程语言（如SQL、python等）、数学公式等。
+
+Text2SQL 是语义解析技术中的一类任务，让机器自动将用户输入的自然语言问题转成可与数据库交互的 SQL 查询语言，实现基于数据库的自动问答能力。
+
+上下文相关的 Text2SQL 则指在多轮问答、对话等场景中，对问题的解析除了依赖当前轮次的输入语句，往往同时依赖于上文中的用户语句和系统答复等，即要求模型具备上下文的感知（建模）能力，才可以更好地完成 SQL 生成的任务。这种多轮交互的方式更符合人类的行为习惯，所以上下文相关的 Text2SQL 解析技术也日益受到重视，成为学术界、工业界的研究重点和应用方向。
+
+## 数据集
+
+当前学术界主流的上下文相关的 Text2SQL 数据集包括[SParC](https://yale-lily.github.io/sparc)、[CoSQL](https://yale-lily.github.io/cosql) 等，详细说明可参见上述链接页面及相应的论文。
+
+## 基线系统
+本系统基于 PaddlePaddle 动态图复现了 [IGSQL](https://github.com/headacheboy/IGSQL) 模型，其核心是基于预训练模型（ERNIE、BERT等）和LSTM的基础 Encoder，以及针对多轮场景的交互 Schema Encoder 和上下文句子 Encoder，而解码端则是在 EditSQL 基础上扩展而来的、基于门控机制和拷贝机制的 SQL 序列生成 Decoder。
+
+# 环境准备
+代码运行需要 Linux 主机，Python 3.7 和 PaddlePaddle 2.1 以上版本。
+
+## 推荐的环境
+
+* 操作系统 CentOS 7.5
+* Python 3.7.9
+* PaddlePaddle develop
+
+除此之外，强烈建议使用支持 GPU 的硬件环境。
+
+## PaddlePaddle
+
+可根据机器情况和个人需求在 PaddlePaddle 和 PaddlePaddle-GPU 中二选一安装。
+如果机器支持GPU，则建议安装GPU版本。
+
+关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start).
+
+## 第三方 Python 库
+除 PaddlePaddle 及其依赖之外，还依赖其它第三方 Python 库，位于代码根目录的 requirements.txt 文件中。
+
+可使用 pip 一键安装
+
+```pip install -r requirements.txt```
+
+# 数据准备
+
+```bash
+# 下载模型训练、测试数据
+# 得到的sparc，cosql 两个数据集
+wget https://bj.bcebos.com/paddlenlp/paddlenlp/resource/igsql_data.tar.gz
+tar xzvf igsql_data.tar.gz
+# 下载glove词向量
+wget http://nlp.stanford.edu/data/glove.840B.300d.zip
+unzip glove.840B.300d.zip
+```
+
+# 数据预处理
+
+对原始数据进行数据预处理，以适配模型的输入，以sparc为例：
+
+```bash
+python preprocess.py --dataset=sparc --remove_from
+```
+
+## 训练
+
+以训练sparc模型为例:
+
+```bash
+python run.py --raw_train_filename="data/sparc_data_removefrom/train.pkl" \
+          --raw_validation_filename="data/sparc_data_removefrom/dev.pkl" \
+          --database_schema_filename="data/sparc_data_removefrom/tables.json" \
+          --embedding_filename="glove.840B.300d.txt" \
+          --data_directory="processed_data_sparc_removefrom" \
+          --logdir="logs_sparc" \
+          --train=True \
+          --evaluate=True
+```
+
+参数说明：
+* raw_train_filename, raw_validation_filename, database_schema_filename: 数据集文件路径。
+* embedding_filename: GLOVE 词向量文件路径。
+* data_directory: 预处理得到的文件夹路径。
+* logdir: 输出日志文件夹路径。
+* train，evaluate: 是否执行train，evaluate。
+
+
+### 训练阶段的输出日志
+训练过程会输出loss、acc相关日志，内容类似：
+```
+total_gold_tokens:13, step:5981================================= ]  99% ETA:  0:00:03
+LOSS:0.4242228865623474
+train     [==================================] 100% Time: 1:20:22
+Predicting with file logs_sparc/train-eval_predictions.json
+logs_sparc/train-eval[==================================] 100% Time: 0:01:30
+Predicting with file logs_sparc/valid-eval_predictions.json
+logs_sparc/valid-eval[==================================]100% Time: 0:04:53
+```
+
+## 预测
+
+以预测sparc数据集为例:
+
+```bash
+python run.py --raw_train_filename="data/sparc_data_removefrom/train.pkl" \
+          --raw_validation_filename="data/sparc_data_removefrom/dev.pkl" \
+          --database_schema_filename="data/sparc_data_removefrom/tables.json" \
+          --embedding_filename="glove.840B.300d.txt" \
+          --data_directory="processed_data_sparc_removefrom" \
+          --logdir="logs_sparc_eval" \
+          --evaluate=True \
+          --save_file="logs_sparc/best_model"
+```
+
+参数说明：
+* save_file: 加载的模型路径，请修改为真实的模型加载路径。
+
+执行完上述命令后，预测结果保存在 "logs_sparc_eval/valid_use_gold_queries_predictions.json"。
+
+# 评估
+
+执行以下命令获得评估结果：
+
+```bash
+python postprocess_eval.py --dataset=sparc --split=dev --pred_file logs_sparc_eval/valid_use_gold_queries_predictions.json --remove_from
+```
+
+其中的 --pred_file 参数请修改为真实的模型预测输出路径，评估结果保存在 "logs_sparc_eval/valid_use_gold_queries_predictions.json.eval"。
diff --git a/examples/text_to_sql/IGSQL/data_util/anonymization.py b/examples/text_to_sql/IGSQL/data_util/anonymization.py
new file mode 100644
index 0000000000000000000000000000000000000000..61686f65d69360fcbccfabf41582a568766703ff
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/anonymization.py
@@ -0,0 +1,274 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Code for identifying and anonymizing entities in NL and SQL."""
+
+import copy
+import json
+
+from . import util
+
+ENTITY_NAME = "ENTITY"
+CONSTANT_NAME = "CONSTANT"
+TIME_NAME = "TIME"
+SEPARATOR = "#"
+
+
+def timeval(string):
+    """Returns the numeric version of a time.
+
+    Args:
+        string (`str`): String representing a time.
+
+    Returns:
+        `str`: String representing the absolute time.
+    """
+    if string.endswith("am") or string.endswith("pm") and string[:-2].isdigit():
+        numval = int(string[:-2])
+        if len(string) == 3 or len(string) == 4:
+            numval *= 100
+        if string.endswith("pm"):
+            numval += 1200
+        return str(numval)
+    return ""
+
+
+def is_time(string):
+    """Returns whether a string represents a time.
+
+    Args:
+        string (str): String to check.
+
+    Returns:
+        `bool`: Whether the string represents a time.
+    """
+    if string.endswith("am") or string.endswith("pm"):
+        if string[:-2].isdigit():
+            return True
+
+    return False
+
+
+def deanonymize(sequence, ent_dict, key):
+    """Deanonymizes a sequence.
+
+    Args:
+        sequence (`list`): List of tokens to deanonymize.
+        ent_dict (`dict`): Maps from tokens to the entity dictionary.
+        key (`str`): The key to use, in this case either natural language or SQL.
+
+    Returns:
+        `list`: Deanonymized sequence of tokens.
+    """
+    new_sequence = []
+    for token in sequence:
+        if token in ent_dict:
+            new_sequence.extend(ent_dict[token][key])
+        else:
+            new_sequence.append(token)
+
+    return new_sequence
+
+
+class Anonymizer:
+    """Anonymization class for keeping track of entities in this domain and
+       scripts for anonymizing/deanonymizing.
+
+    Attributes:
+        anonymization_map (`list`): Containing entities from
+            the anonymization file.
+        entity_types (`list`): All entities in the anonymization file.
+        keys (`set`): Possible keys (types of text handled); in this case it should be
+            one for natural language and another for SQL.
+        entity_set (`set`): entity_types as a set.
+    """
+
+    def __init__(self, filename):
+        self.anonymization_map = []
+        self.entity_types = []
+        self.keys = set()
+
+        pairs = [json.loads(line) for line in open(filename).readlines()]
+        for pair in pairs:
+            for key in pair:
+                if key != "type":
+                    self.keys.add(key)
+            self.anonymization_map.append(pair)
+            if pair["type"] not in self.entity_types:
+                self.entity_types.append(pair["type"])
+
+        self.entity_types.append(ENTITY_NAME)
+        self.entity_types.append(CONSTANT_NAME)
+        self.entity_types.append(TIME_NAME)
+
+        self.entity_set = set(self.entity_types)
+
+    def get_entity_type_from_token(self, token):
+        """Gets the type of an entity given an anonymized token.
+
+        Args:
+            token (`str`): The entity token.
+
+        Returns:
+            `str`: representing the type of the entity.
+        """
+        # these are in the pattern NAME:#, so just strip the thing after the
+        # colon
+        colon_loc = token.index(SEPARATOR)
+        entity_type = token[:colon_loc]
+        assert entity_type in self.entity_set
+
+        return entity_type
+
+    def is_anon_tok(self, token):
+        """Returns whether a token is an anonymized token or not.
+
+        Args:
+            token (`str`): The token to check.
+
+        Returns:
+            `bool`: whether the token is an anonymized token.
+        """
+        return token.split(SEPARATOR)[0] in self.entity_set
+
+    def get_anon_id(self, token):
+        """Gets the entity index (unique ID) for a token.
+
+        Args:
+            token (`str`): The token to get the index from.
+
+        Returns:
+            `int`: the token ID if it is an anonymized token; otherwise -1.
+        """
+        if self.is_anon_tok(token):
+            return self.entity_types.index(token.split(SEPARATOR)[0])
+        else:
+            return -1
+
+    def anonymize(self, sequence, tok_to_entity_dict, key, add_new_anon_toks=False):
+        """Anonymizes a sequence.
+
+        Args:
+            sequence (`list`): Sequence to anonymize.
+            tok_to_entity_dict (`dict`): Existing dictionary mapping from anonymized
+                tokens to entities.
+            key (`str`): Which kind of text this is (natural language or SQL)
+            add_new_anon_toks (`bool`): Whether to add new entities to tok_to_entity_dict.
+
+        Returns:
+            `list`: The anonymized sequence.
+        """
+        # Sort the token-tok-entity dict by the length of the modality.
+        sorted_dict = sorted(tok_to_entity_dict.items(), key=lambda k: len(k[1][key]))[::-1]
+
+        anonymized_sequence = copy.deepcopy(sequence)
+
+        if add_new_anon_toks:
+            type_counts = {}
+            for entity_type in self.entity_types:
+                type_counts[entity_type] = 0
+            for token in tok_to_entity_dict:
+                entity_type = self.get_entity_type_from_token(token)
+                type_counts[entity_type] += 1
+
+        # First find occurrences of things in the anonymization dictionary.
+        for token, modalities in sorted_dict:
+            our_modality = modalities[key]
+
+            # Check if this key's version of the anonymized thing is in our
+            # sequence.
+            while util.subsequence(our_modality, anonymized_sequence):
+                found = False
+                for startidx in range(len(anonymized_sequence) - len(our_modality) + 1):
+                    if anonymized_sequence[startidx : startidx + len(our_modality)] == our_modality:
+                        anonymized_sequence = (
+                            anonymized_sequence[:startidx]
+                            + [token]
+                            + anonymized_sequence[startidx + len(our_modality) :]
+                        )
+                        found = True
+                        break
+                assert found, (
+                    "Thought " + str(our_modality) + " was in [" + str(anonymized_sequence) + "] but could not find it"
+                )
+
+        # Now add new keys if they are present.
+        if add_new_anon_toks:
+
+            # For every span in the sequence, check whether it is in the anon map
+            # for this modality
+            sorted_anon_map = sorted(self.anonymization_map, key=lambda k: len(k[key]))[::-1]
+
+            for pair in sorted_anon_map:
+                our_modality = pair[key]
+
+                token_type = pair["type"]
+                new_token = token_type + SEPARATOR + str(type_counts[token_type])
+
+                while util.subsequence(our_modality, anonymized_sequence):
+                    found = False
+                    for startidx in range(len(anonymized_sequence) - len(our_modality) + 1):
+                        if anonymized_sequence[startidx : startidx + len(our_modality)] == our_modality:
+                            if new_token not in tok_to_entity_dict:
+                                type_counts[token_type] += 1
+                                tok_to_entity_dict[new_token] = pair
+
+                            anonymized_sequence = (
+                                anonymized_sequence[:startidx]
+                                + [new_token]
+                                + anonymized_sequence[startidx + len(our_modality) :]
+                            )
+                            found = True
+                            break
+                    assert found, (
+                        "Thought "
+                        + str(our_modality)
+                        + " was in ["
+                        + str(anonymized_sequence)
+                        + "] but could not find it"
+                    )
+
+            # Also replace integers with constants
+            for index, token in enumerate(anonymized_sequence):
+                if token.isdigit() or is_time(token):
+                    if token.isdigit():
+                        entity_type = CONSTANT_NAME
+                        value = new_token
+                    if is_time(token):
+                        entity_type = TIME_NAME
+                        value = timeval(token)
+
+                    # First try to find the constant in the entity dictionary already,
+                    # and get the name if it's found.
+                    new_token = ""
+                    new_dict = {}
+                    found = False
+                    for entity, value in tok_to_entity_dict.items():
+                        if value[key][0] == token:
+                            new_token = entity
+                            new_dict = value
+                            found = True
+                            break
+
+                    if not found:
+                        new_token = entity_type + SEPARATOR + str(type_counts[entity_type])
+                        new_dict = {}
+                        for tempkey in self.keys:
+                            new_dict[tempkey] = [token]
+
+                        tok_to_entity_dict[new_token] = new_dict
+                        type_counts[entity_type] += 1
+
+                    anonymized_sequence[index] = new_token
+
+        return anonymized_sequence
diff --git a/examples/text_to_sql/IGSQL/data_util/atis_batch.py b/examples/text_to_sql/IGSQL/data_util/atis_batch.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0eb9351ad3f1696353dbbed87e7629a06396bd7
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/atis_batch.py
@@ -0,0 +1,334 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+
+from . import snippets as snip
+from . import sql_util
+from . import vocabulary as vocab
+
+
+class UtteranceItem:
+    def __init__(self, interaction, index):
+        self.interaction = interaction
+        self.utterance_index = index
+
+    def __str__(self):
+        return str(self.interaction.utterances[self.utterance_index])
+
+    def histories(self, maximum):
+        if maximum > 0:
+            history_seqs = []
+            for utterance in self.interaction.utterances[: self.utterance_index]:
+                history_seqs.append(utterance.input_seq_to_use)
+
+            if len(history_seqs) > maximum:
+                history_seqs = history_seqs[-maximum:]
+
+            return history_seqs
+        return []
+
+    def input_sequence(self):
+        return self.interaction.utterances[self.utterance_index].input_seq_to_use
+
+    def previous_query(self):
+        if self.utterance_index == 0:
+            return []
+        return self.interaction.utterances[self.utterance_index - 1].anonymized_gold_query
+
+    def anonymized_gold_query(self):
+        return self.interaction.utterances[self.utterance_index].anonymized_gold_query
+
+    def snippets(self):
+        return self.interaction.utterances[self.utterance_index].available_snippets
+
+    def original_gold_query(self):
+        return self.interaction.utterances[self.utterance_index].original_gold_query
+
+    def contained_entities(self):
+        return self.interaction.utterances[self.utterance_index].contained_entities
+
+    def original_gold_queries(self):
+        return [q[0] for q in self.interaction.utterances[self.utterance_index].all_gold_queries]
+
+    def gold_tables(self):
+        return [q[1] for q in self.interaction.utterances[self.utterance_index].all_gold_queries]
+
+    def gold_query(self):
+        return self.interaction.utterances[self.utterance_index].gold_query_to_use + [vocab.EOS_TOK]
+
+    def gold_edit_sequence(self):
+        return self.interaction.utterances[self.utterance_index].gold_edit_sequence
+
+    def gold_table(self):
+        return self.interaction.utterances[self.utterance_index].gold_sql_results
+
+    def all_snippets(self):
+        return self.interaction.snippets
+
+    def within_limits(self, max_input_length=float("inf"), max_output_length=float("inf")):
+        return self.interaction.utterances[self.utterance_index].length_valid(max_input_length, max_output_length)
+
+    def expand_snippets(self, sequence):
+        # Remove the EOS
+        if sequence[-1] == vocab.EOS_TOK:
+            sequence = sequence[:-1]
+
+        # First remove the snippets
+        no_snippets_sequence = self.interaction.expand_snippets(sequence)
+        no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence)
+        return no_snippets_sequence
+
+    def flatten_sequence(self, sequence):
+        # Remove the EOS
+        if sequence[-1] == vocab.EOS_TOK:
+            sequence = sequence[:-1]
+
+        # First remove the snippets
+        no_snippets_sequence = self.interaction.expand_snippets(sequence)
+
+        # Deanonymize
+        deanon_sequence = self.interaction.deanonymize(no_snippets_sequence, "sql")
+        return deanon_sequence
+
+
+class UtteranceBatch:
+    def __init__(self, items):
+        self.items = items
+
+    def __len__(self):
+        return len(self.items)
+
+    def start(self):
+        self.index = 0
+
+    def next(self):
+        item = self.items[self.index]
+        self.index += 1
+        return item
+
+    def done(self):
+        return self.index >= len(self.items)
+
+
+class PredUtteranceItem:
+    def __init__(self, input_sequence, interaction_item, previous_query, index, available_snippets):
+        self.input_seq_to_use = input_sequence
+        self.interaction_item = interaction_item
+        self.index = index
+        self.available_snippets = available_snippets
+        self.prev_pred_query = previous_query
+
+    def input_sequence(self):
+        return self.input_seq_to_use
+
+    def histories(self, maximum):
+        histories = []
+        if maximum == 0:
+            return histories
+        for utterance in self.interaction_item.processed_utterances[: self.index]:
+            histories.append(utterance.input_sequence())
+        if len(histories) > maximum:
+            histories = histories[-maximum:]
+        return histories
+
+    def snippets(self):
+        return self.available_snippets
+
+    def previous_query(self):
+        return self.prev_pred_query
+
+    def flatten_sequence(self, sequence):
+        return self.interaction_item.flatten_sequence(sequence)
+
+    def remove_snippets(self, sequence):
+        return sql_util.fix_parentheses(self.interaction_item.expand_snippets(sequence))
+
+    def set_predicted_query(self, query):
+        self.anonymized_pred_query = query
+
+
+class InteractionItem:
+    def __init__(
+        self,
+        interaction,
+        max_input_length=float("inf"),
+        max_output_length=float("inf"),
+        nl_to_sql_dict={},
+        maximum_length=float("inf"),
+    ):
+        if maximum_length != float("inf"):
+            self.interaction = copy.deepcopy(interaction)
+            self.interaction.utterances = self.interaction.utterances[:maximum_length]
+        else:
+            self.interaction = interaction
+        self.processed_utterances = []
+        self.snippet_bank = []
+        self.identifier = self.interaction.identifier
+
+        self.max_input_length = max_input_length
+        self.max_output_length = max_output_length
+
+        self.nl_to_sql_dict = nl_to_sql_dict
+
+        self.index = 0
+
+    def __len__(self):
+        return len(self.interaction)
+
+    def __str__(self):
+        s = "Utterances, gold queries, and predictions:\n"
+        for i, utterance in enumerate(self.interaction.utterances):
+            s += " ".join(utterance.input_seq_to_use) + "\n"
+            pred_utterance = self.processed_utterances[i]
+            s += " ".join(pred_utterance.gold_query()) + "\n"
+            s += " ".join(pred_utterance.anonymized_query()) + "\n"
+            s += "\n"
+        s += "Snippets:\n"
+        for snippet in self.snippet_bank:
+            s += str(snippet) + "\n"
+
+        return s
+
+    def start_interaction(self):
+        assert len(self.snippet_bank) == 0
+        assert len(self.processed_utterances) == 0
+        assert self.index == 0
+
+    def next_utterance(self):
+        utterance = self.interaction.utterances[self.index]
+        self.index += 1
+
+        available_snippets = self.available_snippets(snippet_keep_age=1)
+
+        return PredUtteranceItem(
+            utterance.input_seq_to_use,
+            self,
+            self.processed_utterances[-1].anonymized_pred_query if len(self.processed_utterances) > 0 else [],
+            self.index - 1,
+            available_snippets,
+        )
+
+    def done(self):
+        return len(self.processed_utterances) == len(self.interaction)
+
+    def finish(self):
+        self.snippet_bank = []
+        self.processed_utterances = []
+        self.index = 0
+
+    def utterance_within_limits(self, utterance_item):
+        return utterance_item.within_limits(self.max_input_length, self.max_output_length)
+
+    def available_snippets(self, snippet_keep_age):
+        return [snippet for snippet in self.snippet_bank if snippet.index <= snippet_keep_age]
+
+    def gold_utterances(self):
+        utterances = []
+        for i, utterance in enumerate(self.interaction.utterances):
+            utterances.append(UtteranceItem(self.interaction, i))
+        return utterances
+
+    def get_schema(self):
+        return self.interaction.schema
+
+    def add_utterance(self, utterance, predicted_sequence, snippets=None, previous_snippets=[], simple=False):
+        if not snippets:
+            self.add_snippets(predicted_sequence, previous_snippets=previous_snippets, simple=simple)
+        else:
+            for snippet in snippets:
+                snippet.assign_id(len(self.snippet_bank))
+                self.snippet_bank.append(snippet)
+
+            for snippet in self.snippet_bank:
+                snippet.increase_age()
+        self.processed_utterances.append(utterance)
+
+    def add_snippets(self, sequence, previous_snippets=[], simple=False):
+        if sequence:
+            if simple:
+                snippets = sql_util.get_subtrees_simple(sequence, oldsnippets=previous_snippets)
+            else:
+                snippets = sql_util.get_subtrees(sequence, oldsnippets=previous_snippets)
+            for snippet in snippets:
+                snippet.assign_id(len(self.snippet_bank))
+                self.snippet_bank.append(snippet)
+
+        for snippet in self.snippet_bank:
+            snippet.increase_age()
+
+    def expand_snippets(self, sequence):
+        return sql_util.fix_parentheses(snip.expand_snippets(sequence, self.snippet_bank))
+
+    def remove_snippets(self, sequence):
+        if sequence[-1] == vocab.EOS_TOK:
+            sequence = sequence[:-1]
+
+        no_snippets_sequence = self.expand_snippets(sequence)
+        no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence)
+        return no_snippets_sequence
+
+    def flatten_sequence(self, sequence, gold_snippets=False):
+        if sequence[-1] == vocab.EOS_TOK:
+            sequence = sequence[:-1]
+
+        if gold_snippets:
+            no_snippets_sequence = self.interaction.expand_snippets(sequence)
+        else:
+            no_snippets_sequence = self.expand_snippets(sequence)
+        no_snippets_sequence = sql_util.fix_parentheses(no_snippets_sequence)
+
+        deanon_sequence = self.interaction.deanonymize(no_snippets_sequence, "sql")
+        return deanon_sequence
+
+    def gold_query(self, index):
+        return self.interaction.utterances[index].gold_query_to_use + [vocab.EOS_TOK]
+
+    def original_gold_query(self, index):
+        return self.interaction.utterances[index].original_gold_query
+
+    def gold_table(self, index):
+        return self.interaction.utterances[index].gold_sql_results
+
+
+class InteractionBatch:
+    def __init__(self, items):
+        self.items = items
+
+    def __len__(self):
+        return len(self.items)
+
+    def start(self):
+        self.timestep = 0
+        self.current_interactions = []
+
+    def get_next_utterance_batch(self, snippet_keep_age, use_gold=False):
+        items = []
+        self.current_interactions = []
+        for interaction in self.items:
+            if self.timestep < len(interaction):
+                utterance_item = interaction.original_utterances(snippet_keep_age, use_gold)[self.timestep]
+                self.current_interactions.append(interaction)
+                items.append(utterance_item)
+
+        self.timestep += 1
+        return UtteranceBatch(items)
+
+    def done(self):
+        finished = True
+        for interaction in self.items:
+            if self.timestep < len(interaction):
+                finished = False
+                return finished
+        return finished
diff --git a/examples/text_to_sql/IGSQL/data_util/atis_data.py b/examples/text_to_sql/IGSQL/data_util/atis_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ac6ec6ba20f266cb7d42c6f75bb2e7b158d70f2
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/atis_data.py
@@ -0,0 +1,495 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Utility functions for loading and processing ATIS data."""
+
+import os
+import random
+import json
+
+from . import anonymization as anon
+from . import atis_batch
+from . import dataset_split as ds
+from .interaction import load_function
+from .entities import NLtoSQLDict
+from .atis_vocab import ATISVocabulary
+
+ENTITIES_FILENAME = "data/entities.txt"
+ANONYMIZATION_FILENAME = "data/anonymization.txt"
+
+
+class ATISDataset:
+    """Contains the ATIS data."""
+
+    def __init__(self, params):
+        self.anonymizer = None
+        if params.anonymize:
+            self.anonymizer = anon.Anonymizer(ANONYMIZATION_FILENAME)
+
+        if not os.path.exists(params.data_directory):
+            os.mkdir(params.data_directory)
+
+        self.entities_dictionary = NLtoSQLDict(ENTITIES_FILENAME)
+
+        database_schema = None
+        if params.database_schema_filename:
+            if "removefrom" not in params.data_directory:
+                (
+                    database_schema,
+                    column_names_surface_form,
+                    column_names_embedder_input,
+                ) = self.read_database_schema_simple(params.database_schema_filename)
+            else:
+                database_schema, column_names_surface_form, column_names_embedder_input = self.read_database_schema(
+                    params.database_schema_filename
+                )
+
+        int_load_function = load_function(
+            params, self.entities_dictionary, self.anonymizer, database_schema=database_schema
+        )
+
+        def collapse_list(the_list):
+            """Collapses a list of list into a single list."""
+            return [s for i in the_list for s in i]
+
+        if "atis" not in params.data_directory:
+            self.train_data = ds.DatasetSplit(
+                os.path.join(params.data_directory, params.processed_train_filename),
+                params.raw_train_filename,
+                int_load_function,
+            )
+            self.valid_data = ds.DatasetSplit(
+                os.path.join(params.data_directory, params.processed_validation_filename),
+                params.raw_validation_filename,
+                int_load_function,
+            )
+
+            s = set()
+            for ele in self.train_data.examples:
+                s.add(ele.schema.table_schema["db_id"])
+            db_id_ls = list(s)
+            db_id_ls.sort()
+            self.id2db = db_id_ls
+            self.db2id = {}
+            for i in range(len(self.id2db)):
+                self.db2id[self.id2db[i]] = i
+
+            train_input_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.input_seqs()))
+            valid_input_seqs = collapse_list(self.valid_data.get_ex_properties(lambda i: i.input_seqs()))
+
+            all_input_seqs = train_input_seqs + valid_input_seqs
+
+            self.input_vocabulary = ATISVocabulary(
+                all_input_seqs,
+                os.path.join(params.data_directory, params.input_vocabulary_filename),
+                params,
+                is_input="input",
+                anonymizer=self.anonymizer if params.anonymization_scoring else None,
+            )
+
+            self.output_vocabulary_schema = ATISVocabulary(
+                column_names_embedder_input,
+                os.path.join(params.data_directory, "schema_" + params.output_vocabulary_filename),
+                params,
+                is_input="schema",
+                anonymizer=self.anonymizer if params.anonymization_scoring else None,
+            )
+
+            train_output_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.output_seqs()))
+            valid_output_seqs = collapse_list(self.valid_data.get_ex_properties(lambda i: i.output_seqs()))
+            all_output_seqs = train_output_seqs + valid_output_seqs
+
+            sql_keywords = [
+                ".",
+                "t1",
+                "t2",
+                "=",
+                "select",
+                "as",
+                "join",
+                "on",
+                ")",
+                "(",
+                "where",
+                "t3",
+                "by",
+                ",",
+                "group",
+                "distinct",
+                "t4",
+                "and",
+                "limit",
+                "desc",
+                ">",
+                "avg",
+                "having",
+                "max",
+                "in",
+                "<",
+                "sum",
+                "t5",
+                "intersect",
+                "not",
+                "min",
+                "except",
+                "or",
+                "asc",
+                "like",
+                "!",
+                "union",
+                "between",
+                "t6",
+                "-",
+                "t7",
+                "+",
+                "/",
+            ]
+            sql_keywords += ["count", "from", "value", "order"]
+            sql_keywords += ["group_by", "order_by", "limit_value", "!="]
+
+            # skip column_names_surface_form but keep sql_keywords
+            skip_tokens = list(set(column_names_surface_form) - set(sql_keywords))
+
+            if params.data_directory == "processed_data_sparc_removefrom":
+                all_output_seqs = []
+                out_vocab_ordered = [
+                    "select",
+                    "value",
+                    ")",
+                    "(",
+                    "where",
+                    "=",
+                    ",",
+                    "count",
+                    "group_by",
+                    "order_by",
+                    "limit_value",
+                    "desc",
+                    ">",
+                    "distinct",
+                    "avg",
+                    "and",
+                    "having",
+                    "<",
+                    "in",
+                    "max",
+                    "sum",
+                    "asc",
+                    "like",
+                    "not",
+                    "or",
+                    "min",
+                    "intersect",
+                    "except",
+                    "!=",
+                    "union",
+                    "between",
+                    "-",
+                    "+",
+                ]
+                for i in range(len(out_vocab_ordered)):
+                    all_output_seqs.append(out_vocab_ordered[: i + 1])
+
+            self.output_vocabulary = ATISVocabulary(
+                all_output_seqs,
+                os.path.join(params.data_directory, params.output_vocabulary_filename),
+                params,
+                is_input="output",
+                anonymizer=self.anonymizer if params.anonymization_scoring else None,
+                skip=skip_tokens,
+            )
+        else:
+            self.train_data = ds.DatasetSplit(
+                os.path.join(params.data_directory, params.processed_train_filename),
+                params.raw_train_filename,
+                int_load_function,
+            )
+            if params.train:
+                self.valid_data = ds.DatasetSplit(
+                    os.path.join(params.data_directory, params.processed_validation_filename),
+                    params.raw_validation_filename,
+                    int_load_function,
+                )
+            if params.evaluate or params.attention:
+                self.dev_data = ds.DatasetSplit(
+                    os.path.join(params.data_directory, params.processed_dev_filename),
+                    params.raw_dev_filename,
+                    int_load_function,
+                )
+                if params.enable_testing:
+                    self.test_data = ds.DatasetSplit(
+                        os.path.join(params.data_directory, params.processed_test_filename),
+                        params.raw_test_filename,
+                        int_load_function,
+                    )
+
+            train_input_seqs = []
+            train_input_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.input_seqs()))
+
+            self.input_vocabulary = ATISVocabulary(
+                train_input_seqs,
+                os.path.join(params.data_directory, params.input_vocabulary_filename),
+                params,
+                is_input="input",
+                min_occur=2,
+                anonymizer=self.anonymizer if params.anonymization_scoring else None,
+            )
+
+            train_output_seqs = collapse_list(self.train_data.get_ex_properties(lambda i: i.output_seqs()))
+
+            self.output_vocabulary = ATISVocabulary(
+                train_output_seqs,
+                os.path.join(params.data_directory, params.output_vocabulary_filename),
+                params,
+                is_input="output",
+                anonymizer=self.anonymizer if params.anonymization_scoring else None,
+            )
+
+            self.output_vocabulary_schema = None
+
+    def read_database_schema_simple(self, database_schema_filename):
+        with open(database_schema_filename, "r") as f:
+            database_schema = json.load(f)
+
+        database_schema_dict = {}
+        column_names_surface_form = []
+        column_names_embedder_input = []
+        for table_schema in database_schema:
+            db_id = table_schema["db_id"]
+            database_schema_dict[db_id] = table_schema
+
+            column_names = table_schema["column_names"]
+            column_names_original = table_schema["column_names_original"]
+            table_names = table_schema["table_names"]
+            table_names_original = table_schema["table_names_original"]
+
+            for i, (table_id, column_name) in enumerate(column_names_original):
+                column_name_surface_form = column_name
+                column_names_surface_form.append(column_name_surface_form.lower())
+
+            for table_name in table_names_original:
+                column_names_surface_form.append(table_name.lower())
+
+            for i, (table_id, column_name) in enumerate(column_names):
+                column_name_embedder_input = column_name
+                column_names_embedder_input.append(column_name_embedder_input.split())
+
+            for table_name in table_names:
+                column_names_embedder_input.append(table_name.split())
+
+        database_schema = database_schema_dict
+
+        return database_schema, column_names_surface_form, column_names_embedder_input
+
+    def read_database_schema(self, database_schema_filename):
+        with open(database_schema_filename, "r") as f:
+            database_schema = json.load(f)
+
+        database_schema_dict = {}
+        column_names_surface_form = []
+        column_names_embedder_input = []
+        for table_schema in database_schema:
+            db_id = table_schema["db_id"]
+            database_schema_dict[db_id] = table_schema
+
+            column_names = table_schema["column_names"]
+            column_names_original = table_schema["column_names_original"]
+            table_names = table_schema["table_names"]
+            table_names_original = table_schema["table_names_original"]
+
+            for i, (table_id, column_name) in enumerate(column_names_original):
+                if table_id >= 0:
+                    table_name = table_names_original[table_id]
+                    column_name_surface_form = "{}.{}".format(table_name, column_name)
+                else:
+                    column_name_surface_form = column_name
+                column_names_surface_form.append(column_name_surface_form.lower())
+
+            # also add table_name.*
+            for table_name in table_names_original:
+                column_names_surface_form.append("{}.*".format(table_name.lower()))
+
+            for i, (table_id, column_name) in enumerate(column_names):
+                if table_id >= 0:
+                    table_name = table_names[table_id]
+                    column_name_embedder_input = table_name + " . " + column_name
+                else:
+                    column_name_embedder_input = column_name
+                column_names_embedder_input.append(column_name_embedder_input.split())
+
+            for table_name in table_names:
+                column_name_embedder_input = table_name + " . *"
+                column_names_embedder_input.append(column_name_embedder_input.split())
+
+        database_schema = database_schema_dict
+
+        return database_schema, column_names_surface_form, column_names_embedder_input
+
+    def get_all_utterances(self, dataset, max_input_length=float("inf"), max_output_length=float("inf")):
+        """Returns all utterances in a dataset."""
+        items = []
+        for interaction in dataset.examples:
+            for i, utterance in enumerate(interaction.utterances):
+                if utterance.length_valid(max_input_length, max_output_length):
+                    items.append(atis_batch.UtteranceItem(interaction, i))
+        return items
+
+    def get_all_interactions(
+        self,
+        dataset,
+        max_interaction_length=float("inf"),
+        max_input_length=float("inf"),
+        max_output_length=float("inf"),
+        sorted_by_length=False,
+    ):
+        """Gets all interactions in a dataset that fit the criteria.
+
+        Args:
+            dataset (`ATISDatasetSplit`): The dataset to use.
+            max_interaction_length (`int`): Maximum interaction length to keep.
+            max_input_length (`int`): Maximum input sequence length to keep.
+            max_output_length (`int`): Maximum output sequence length to keep.
+            sorted_by_length (`bool`): Whether to sort the examples by interaction length.
+
+        Returns:
+            `list`: All interactions.
+        """
+        ints = [
+            atis_batch.InteractionItem(
+                interaction, max_input_length, max_output_length, self.entities_dictionary, max_interaction_length
+            )
+            for interaction in dataset.examples
+        ]
+        if sorted_by_length:
+            return sorted(ints, key=lambda x: len(x))[::-1]
+        else:
+            return ints
+
+    def get_utterance_batches(
+        self, batch_size, max_input_length=float("inf"), max_output_length=float("inf"), randomize=False
+    ):
+        """Gets batches of utterances in the data.
+
+        Args:
+            batch_size (`int`): Batch size to use.
+            max_input_length (`int`): Maximum length of input to keep.
+            max_output_length (`int`): Maximum length of output to use.
+            randomize (`bool`): Whether to randomize the ordering.
+
+        Returns:
+            `list`: Batches of utterances.
+        """
+        # First, get all interactions and the positions of the utterances that are
+        # possible in them.
+        items = self.get_all_utterances(self.train_data, max_input_length, max_output_length)
+        # if randomize:
+        #     random.shuffle(items)
+
+        batches = []
+
+        current_batch_items = []
+        for item in items:
+            if len(current_batch_items) >= batch_size:
+                batches.append(atis_batch.UtteranceBatch(current_batch_items))
+                current_batch_items = []
+            current_batch_items.append(item)
+        batches.append(atis_batch.UtteranceBatch(current_batch_items))
+
+        assert sum([len(batch) for batch in batches]) == len(items)
+
+        return batches
+
+    def get_interaction_batches(
+        self,
+        batch_size,
+        max_interaction_length=float("inf"),
+        max_input_length=float("inf"),
+        max_output_length=float("inf"),
+        randomize=False,
+    ):
+        """Gets batches of interactions in the data.
+
+        Args:
+            batch_size (`int`): Batch size to use.
+            max_interaction_length (`int`): Maximum length of interaction to keep
+            max_input_length (`int`): Maximum length of input to keep.
+            max_output_length (`int`): Maximum length of output to keep.
+            randomize (`bool`): Whether to randomize the ordering.
+
+        Returns:
+            `list`: Batches of interactions.
+
+        """
+        items = self.get_all_interactions(
+            self.train_data,
+            max_interaction_length,
+            max_input_length,
+            max_output_length,
+            sorted_by_length=not randomize,
+        )
+        if randomize:
+            random.shuffle(items)
+
+        batches = []
+        current_batch_items = []
+        for item in items:
+            if len(current_batch_items) >= batch_size:
+                batches.append(atis_batch.InteractionBatch(current_batch_items))
+                current_batch_items = []
+            current_batch_items.append(item)
+        batches.append(atis_batch.InteractionBatch(current_batch_items))
+
+        assert sum([len(batch) for batch in batches]) == len(items)
+
+        return batches
+
+    def get_random_utterances(self, num_samples, max_input_length=float("inf"), max_output_length=float("inf")):
+        """Gets a random selection of utterances in the data.
+
+        Args:
+            num_samples (`bool`): Number of random utterances to get.
+            max_input_length (`int`): Limit of input length.
+            max_output_length (`int`): Limit on output length.
+
+        Returns:
+            `list`: A random selection of utterances.
+        """
+        items = self.get_all_utterances(self.train_data, max_input_length, max_output_length)
+        random.shuffle(items)
+        return items[:num_samples]
+
+    def get_random_interactions(
+        self,
+        num_samples,
+        max_interaction_length=float("inf"),
+        max_input_length=float("inf"),
+        max_output_length=float("inf"),
+    ):
+        """Gets a random selection of interactions in the data.
+
+        Args:
+            num_samples (`bool`): Number of random interactions to get.
+            max_input_length (`int`): Limit of input length.
+            max_output_length (`int`): Limit on output length.
+
+        Returns:
+            A random selection of interactions.
+        """
+        items = self.get_all_interactions(self.train_data, max_interaction_length, max_input_length, max_output_length)
+        # random.shuffle(items)
+        return items[:num_samples]
+
+
+def num_utterances(dataset):
+    """Returns the total number of utterances in the dataset."""
+    return sum([len(interaction) for interaction in dataset.examples])
diff --git a/examples/text_to_sql/IGSQL/data_util/atis_vocab.py b/examples/text_to_sql/IGSQL/data_util/atis_vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a4e875c5221dcbdb74db3d94aa4bbc9b091878e
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/atis_vocab.py
@@ -0,0 +1,84 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Gets and stores vocabulary for the ATIS data."""
+
+from . import snippets
+from .vocabulary import Vocabulary, UNK_TOK, DEL_TOK, EOS_TOK
+
+INPUT_FN_TYPES = [UNK_TOK, DEL_TOK, EOS_TOK]
+OUTPUT_FN_TYPES = [UNK_TOK, EOS_TOK]
+
+MIN_INPUT_OCCUR = 1
+MIN_OUTPUT_OCCUR = 1
+
+
+class ATISVocabulary:
+    """Stores the vocabulary for the ATIS data.
+
+    Attributes:
+        raw_vocab (`Vocabulary`): Vocabulary object.
+        tokens (`set`): Set of all of the strings in the vocabulary.
+        inorder_tokens (`list`): List of all tokens, with a strict and
+            unchanging order.
+    """
+
+    def __init__(self, token_sequences, filename, params, is_input="input", min_occur=1, anonymizer=None, skip=None):
+
+        if is_input == "input":
+            functional_types = INPUT_FN_TYPES
+        elif is_input == "output":
+            functional_types = OUTPUT_FN_TYPES
+        elif is_input == "schema":
+            functional_types = [UNK_TOK]
+        else:
+            functional_types = []
+
+        self.raw_vocab = Vocabulary(
+            token_sequences,
+            filename,
+            functional_types=functional_types,
+            min_occur=min_occur,
+            ignore_fn=lambda x: snippets.is_snippet(x)
+            or (anonymizer and anonymizer.is_anon_tok(x))
+            or (skip and x in skip),
+        )
+        self.tokens = set(self.raw_vocab.token_to_id.keys())
+        self.inorder_tokens = self.raw_vocab.id_to_token
+
+        assert len(self.inorder_tokens) == len(self.raw_vocab)
+
+    def __len__(self):
+        return len(self.raw_vocab)
+
+    def token_to_id(self, token):
+        """Maps from a token to a unique ID.
+
+        Args:
+            token (`str`): The token to look up.
+
+        Returns:
+            `int`: Uniquely identifying the token.
+        """
+        return self.raw_vocab.token_to_id[token]
+
+    def id_to_token(self, identifier):
+        """Maps from a unique integer to an identifier.
+
+        Args:
+            identifier (`int`): The unique ID.
+
+        Returns:
+            `str`: Representing the token.
+        """
+        return self.raw_vocab.id_to_token[identifier]
diff --git a/examples/text_to_sql/IGSQL/data_util/dataset_split.py b/examples/text_to_sql/IGSQL/data_util/dataset_split.py
new file mode 100644
index 0000000000000000000000000000000000000000..7349046ae0e8ef9a2e0293a895e45460fd64fd5b
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/dataset_split.py
@@ -0,0 +1,64 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utility functions for loading and processing ATIS data."""
+
+import os
+import pickle
+
+
+class DatasetSplit:
+    """Stores a split of the ATIS dataset.
+
+    Attributes:
+        examples (`list`): Stores the examples in the split.
+    """
+
+    def __init__(self, processed_filename, raw_filename, load_function):
+        if os.path.exists(processed_filename):
+            print("Loading preprocessed data from " + processed_filename)
+            with open(processed_filename, "rb") as infile:
+                self.examples = pickle.load(infile)
+        else:
+            print("Loading raw data from " + raw_filename + " and writing to " + processed_filename)
+
+            infile = open(raw_filename, "rb")
+            examples_from_file = pickle.load(infile)
+            assert isinstance(examples_from_file, list), raw_filename + " does not contain a list of examples"
+            infile.close()
+
+            self.examples = []
+            for example in examples_from_file:
+                obj, keep = load_function(example)
+
+                if keep:
+                    self.examples.append(obj)
+
+            print("Loaded " + str(len(self.examples)) + " examples")
+            outfile = open(processed_filename, "wb")
+            pickle.dump(self.examples, outfile)
+            outfile.close()
+
+    def get_ex_properties(self, function):
+        """Applies some function to the examples in the dataset.
+
+        Args:
+            function (`function`): Function to apply to all examples.
+
+        Returns
+            `list`: The return value of the function
+        """
+        elems = []
+        for example in self.examples:
+            elems.append(function(example))
+        return elems
diff --git a/examples/text_to_sql/IGSQL/data_util/entities.py b/examples/text_to_sql/IGSQL/data_util/entities.py
new file mode 100644
index 0000000000000000000000000000000000000000..7887b24a07cd06274ee4f181edad2f0401c23d78
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/entities.py
@@ -0,0 +1,77 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Classes for keeping track of the entities in a natural language string. """
+
+import json
+
+
+class NLtoSQLDict:
+    """
+    Entity dict file should contain, on each line, a JSON dictionary with
+    "input" and "output" keys specifying the string for the input and output
+    pairs. The idea is that the existence of the key in an input sequence
+    likely corresponds to the existence of the value in the output sequence.
+
+    The entity_dict should map keys (input strings) to a list of values (output
+    strings) where this property holds. This allows keys to map to multiple
+    output strings (e.g. for times).
+    """
+
+    def __init__(self, entity_dict_filename):
+        self.entity_dict = {}
+
+        pairs = [json.loads(line) for line in open(entity_dict_filename).readlines()]
+        for pair in pairs:
+            input_seq = pair["input"]
+            output_seq = pair["output"]
+            if input_seq not in self.entity_dict:
+                self.entity_dict[input_seq] = []
+            self.entity_dict[input_seq].append(output_seq)
+
+    def get_sql_entities(self, tokenized_nl_string):
+        """
+        Gets the output-side entities which correspond to the input entities in
+        the input sequence.
+
+        Args:
+           tokenized_nl_string (`list`): list of tokens in the input string.
+
+        Outputs:
+           `list`: The output strings.
+        """
+        assert len(tokenized_nl_string) > 0
+        flat_input_string = " ".join(tokenized_nl_string)
+        entities = []
+
+        # See if any input strings are in our input sequence, and add the
+        # corresponding output strings if so.
+        for entry, values in self.entity_dict.items():
+            in_middle = " " + entry + " " in flat_input_string
+
+            leftspace = " " + entry
+            at_end = leftspace in flat_input_string and flat_input_string.endswith(leftspace)
+
+            rightspace = entry + " "
+            at_beginning = rightspace in flat_input_string and flat_input_string.startswith(rightspace)
+            if in_middle or at_end or at_beginning:
+                for out_string in values:
+                    entities.append(out_string)
+
+        # Also add any integers in the input string (these aren't in the entity)
+        # dict.
+        for token in tokenized_nl_string:
+            if token.isnumeric():
+                entities.append(token)
+
+        return entities
diff --git a/examples/text_to_sql/IGSQL/data_util/interaction.py b/examples/text_to_sql/IGSQL/data_util/interaction.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b21b05471c9e1294709127e97333efc5df11ef1
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/interaction.py
@@ -0,0 +1,325 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Contains the class for an interaction in ATIS. """
+
+import paddle
+
+from . import anonymization as anon
+from . import sql_util
+from .snippets import expand_snippets
+from .utterance import Utterance, OUTPUT_KEY, ANON_INPUT_KEY
+
+
+class Schema:
+    def __init__(self, table_schema, simple=False):
+        if simple:
+            self.helper1(table_schema)
+        else:
+            self.helper2(table_schema)
+
+    def helper1(self, table_schema):
+        self.table_schema = table_schema
+        column_names = table_schema["column_names"]
+        column_names_original = table_schema["column_names_original"]
+        table_names = table_schema["table_names"]
+        table_names_original = table_schema["table_names_original"]
+        assert len(column_names) == len(column_names_original) and len(table_names) == len(table_names_original)
+
+        column_keep_index = []
+
+        self.column_names_surface_form = []
+        self.column_names_surface_form_to_id = {}
+        for i, (table_id, column_name) in enumerate(column_names_original):
+            column_name_surface_form = column_name
+            column_name_surface_form = column_name_surface_form.lower()
+            if column_name_surface_form not in self.column_names_surface_form_to_id:
+                self.column_names_surface_form.append(column_name_surface_form)
+                self.column_names_surface_form_to_id[column_name_surface_form] = (
+                    len(self.column_names_surface_form) - 1
+                )
+                column_keep_index.append(i)
+
+        column_keep_index_2 = []
+        for i, table_name in enumerate(table_names_original):
+            column_name_surface_form = table_name.lower()
+            if column_name_surface_form not in self.column_names_surface_form_to_id:
+                self.column_names_surface_form.append(column_name_surface_form)
+                self.column_names_surface_form_to_id[column_name_surface_form] = (
+                    len(self.column_names_surface_form) - 1
+                )
+                column_keep_index_2.append(i)
+
+        self.column_names_embedder_input = []
+        self.column_names_embedder_input_to_id = {}
+        for i, (table_id, column_name) in enumerate(column_names):
+            column_name_embedder_input = column_name
+            if i in column_keep_index:
+                self.column_names_embedder_input.append(column_name_embedder_input)
+                self.column_names_embedder_input_to_id[column_name_embedder_input] = (
+                    len(self.column_names_embedder_input) - 1
+                )
+
+        for i, table_name in enumerate(table_names):
+            column_name_embedder_input = table_name
+            if i in column_keep_index_2:
+                self.column_names_embedder_input.append(column_name_embedder_input)
+                self.column_names_embedder_input_to_id[column_name_embedder_input] = (
+                    len(self.column_names_embedder_input) - 1
+                )
+
+        max_id_1 = max(v for k, v in self.column_names_surface_form_to_id.items())
+        max_id_2 = max(v for k, v in self.column_names_embedder_input_to_id.items())
+        assert (len(self.column_names_surface_form) - 1) == max_id_2 == max_id_1
+
+        self.num_col = len(self.column_names_surface_form)
+
+    def helper2(self, table_schema):
+        self.table_schema = table_schema
+        column_names = table_schema["column_names"]
+        column_names_original = table_schema["column_names_original"]
+        table_names = table_schema["table_names"]
+        table_names_original = table_schema["table_names_original"]
+        assert len(column_names) == len(column_names_original) and len(table_names) == len(table_names_original)
+
+        column_keep_index = []
+
+        self.column_names_surface_form = []
+        self.column_names_surface_form_to_id = {}
+        for i, (table_id, column_name) in enumerate(column_names_original):
+            if table_id >= 0:
+                table_name = table_names_original[table_id]
+                column_name_surface_form = "{}.{}".format(table_name, column_name)
+            else:
+                column_name_surface_form = column_name
+            column_name_surface_form = column_name_surface_form.lower()
+            if column_name_surface_form not in self.column_names_surface_form_to_id:
+                self.column_names_surface_form.append(column_name_surface_form)
+                self.column_names_surface_form_to_id[column_name_surface_form] = (
+                    len(self.column_names_surface_form) - 1
+                )
+                column_keep_index.append(i)
+
+        start_i = len(self.column_names_surface_form_to_id)
+        for i, table_name in enumerate(table_names_original):
+            column_name_surface_form = "{}.*".format(table_name.lower())
+            self.column_names_surface_form.append(column_name_surface_form)
+            self.column_names_surface_form_to_id[column_name_surface_form] = i + start_i
+
+        self.column_names_embedder_input = []
+        self.column_names_embedder_input_to_id = {}
+        for i, (table_id, column_name) in enumerate(column_names):
+            if table_id >= 0:
+                table_name = table_names[table_id]
+                column_name_embedder_input = table_name + " . " + column_name
+            else:
+                column_name_embedder_input = column_name
+            if i in column_keep_index:
+                self.column_names_embedder_input.append(column_name_embedder_input)
+                self.column_names_embedder_input_to_id[column_name_embedder_input] = (
+                    len(self.column_names_embedder_input) - 1
+                )
+
+        start_i = len(self.column_names_embedder_input_to_id)
+        for i, table_name in enumerate(table_names):
+            column_name_embedder_input = table_name + " . *"
+            self.column_names_embedder_input.append(column_name_embedder_input)
+            self.column_names_embedder_input_to_id[column_name_embedder_input] = i + start_i
+
+        assert (
+            len(self.column_names_surface_form)
+            == len(self.column_names_surface_form_to_id)
+            == len(self.column_names_embedder_input)
+            == len(self.column_names_embedder_input_to_id)
+        )
+
+        max_id_1 = max(v for k, v in self.column_names_surface_form_to_id.items())
+        max_id_2 = max(v for k, v in self.column_names_embedder_input_to_id.items())
+        assert (len(self.column_names_surface_form) - 1) == max_id_2 == max_id_1
+
+        self.num_col = len(self.column_names_surface_form)
+
+    def __len__(self):
+        return self.num_col
+
+    def in_vocabulary(self, column_name, surface_form=False):
+        if surface_form:
+            return column_name in self.column_names_surface_form_to_id
+        else:
+            return column_name in self.column_names_embedder_input_to_id
+
+    def column_name_embedder_bow(self, column_name, surface_form=False, column_name_token_embedder=None):
+        assert self.in_vocabulary(column_name, surface_form)
+        if surface_form:
+            column_name_id = self.column_names_surface_form_to_id[column_name]
+            column_name_embedder_input = self.column_names_embedder_input[column_name_id]
+        else:
+            column_name_embedder_input = column_name
+
+        column_name_embeddings = [column_name_token_embedder(token) for token in column_name_embedder_input.split()]
+        column_name_embeddings = paddle.stack(column_name_embeddings, axis=0)
+        return paddle.mean(column_name_embeddings, axis=0)
+
+    def set_column_name_embeddings(self, column_name_embeddings):
+        self.column_name_embeddings = column_name_embeddings
+        assert len(self.column_name_embeddings) == self.num_col
+
+    def column_name_embedder(self, column_name, surface_form=False):
+        assert self.in_vocabulary(column_name, surface_form)
+        if surface_form:
+            column_name_id = self.column_names_surface_form_to_id[column_name]
+        else:
+            column_name_id = self.column_names_embedder_input_to_id[column_name]
+
+        return self.column_name_embeddings[column_name_id]
+
+
+class Interaction:
+    def __init__(self, utterances, schema, snippets, anon_tok_to_ent, identifier, params):
+        self.utterances = utterances
+        self.schema = schema
+        self.snippets = snippets
+        self.anon_tok_to_ent = anon_tok_to_ent
+        self.identifier = identifier
+
+        # Ensure that each utterance's input and output sequences, when remapped
+        # without anonymization or snippets, are the same as the original
+        # version.
+        for i, utterance in enumerate(self.utterances):
+            deanon_input = self.deanonymize(utterance.input_seq_to_use, ANON_INPUT_KEY)
+            assert deanon_input == utterance.original_input_seq, (
+                "Anonymized sequence ["
+                + " ".join(utterance.input_seq_to_use)
+                + "] is not the same as ["
+                + " ".join(utterance.original_input_seq)
+                + "] when deanonymized (is ["
+                + " ".join(deanon_input)
+                + "] instead)"
+            )
+            desnippet_gold = self.expand_snippets(utterance.gold_query_to_use)
+            deanon_gold = self.deanonymize(desnippet_gold, OUTPUT_KEY)
+            assert deanon_gold == utterance.original_gold_query, (
+                "Anonymized and/or snippet'd query "
+                + " ".join(utterance.gold_query_to_use)
+                + " is not the same as "
+                + " ".join(utterance.original_gold_query)
+            )
+
+    def __str__(self):
+        string = "Utterances:\n"
+        for utterance in self.utterances:
+            string += str(utterance) + "\n"
+        string += "Anonymization dictionary:\n"
+        for ent_tok, deanon in self.anon_tok_to_ent.items():
+            string += ent_tok + "\t" + str(deanon) + "\n"
+
+        return string
+
+    def __len__(self):
+        return len(self.utterances)
+
+    def deanonymize(self, sequence, key):
+        """Deanonymizes a predicted query or an input utterance.
+
+        Args:
+            sequence (`list`): The sequence to deanonymize.
+            key (`str`): The key in the anonymization table, e.g. NL or SQL.
+        """
+        return anon.deanonymize(sequence, self.anon_tok_to_ent, key)
+
+    def expand_snippets(self, sequence):
+        """Expands snippets for a sequence.
+
+        Args:
+            sequence (`list`): A SQL query.
+
+        """
+        return expand_snippets(sequence, self.snippets)
+
+    def input_seqs(self):
+        in_seqs = []
+        for utterance in self.utterances:
+            in_seqs.append(utterance.input_seq_to_use)
+        return in_seqs
+
+    def output_seqs(self):
+        out_seqs = []
+        for utterance in self.utterances:
+            out_seqs.append(utterance.gold_query_to_use)
+        return out_seqs
+
+
+def load_function(parameters, nl_to_sql_dict, anonymizer, database_schema=None):
+    def fn(interaction_example):
+        keep = False
+
+        raw_utterances = interaction_example["interaction"]
+
+        if "database_id" in interaction_example:
+            database_id = interaction_example["database_id"]
+            interaction_id = interaction_example["interaction_id"]
+            identifier = database_id + "/" + str(interaction_id)
+        else:
+            identifier = interaction_example["id"]
+
+        schema = None
+        if database_schema:
+            if "removefrom" not in parameters.data_directory:
+                schema = Schema(database_schema[database_id], simple=True)
+            else:
+                schema = Schema(database_schema[database_id])
+
+        snippet_bank = []
+
+        utterance_examples = []
+
+        anon_tok_to_ent = {}
+
+        for utterance in raw_utterances:
+            available_snippets = [snippet for snippet in snippet_bank if snippet.index <= 1]
+
+            proc_utterance = Utterance(
+                utterance, available_snippets, nl_to_sql_dict, parameters, anon_tok_to_ent, anonymizer
+            )
+            keep_utterance = proc_utterance.keep
+
+            if schema:
+                assert keep_utterance
+
+            if keep_utterance:
+                keep = True
+                utterance_examples.append(proc_utterance)
+
+                # Update the snippet bank, and age each snippet in it.
+                if parameters.use_snippets:
+                    if "atis" in parameters.data_directory:
+                        snippets = sql_util.get_subtrees(
+                            proc_utterance.anonymized_gold_query, proc_utterance.available_snippets
+                        )
+                    else:
+                        snippets = sql_util.get_subtrees_simple(
+                            proc_utterance.anonymized_gold_query, proc_utterance.available_snippets
+                        )
+
+                    for snippet in snippets:
+                        snippet.assign_id(len(snippet_bank))
+                        snippet_bank.append(snippet)
+
+                for snippet in snippet_bank:
+                    snippet.increase_age()
+
+        interaction = Interaction(utterance_examples, schema, snippet_bank, anon_tok_to_ent, identifier, parameters)
+
+        return interaction, keep
+
+    return fn
diff --git a/examples/text_to_sql/IGSQL/data_util/snippets.py b/examples/text_to_sql/IGSQL/data_util/snippets.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9a6fdfd2a2c26d0119335560686f3db617f9218
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/snippets.py
@@ -0,0 +1,113 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Contains the Snippet class and methods for handling snippets."""
+
+SNIPPET_PREFIX = "SNIPPET_"
+
+
+def is_snippet(token):
+    """Determines whether a token is a snippet or not.
+
+    Args:
+        token (`str`): The token to check.
+
+    Returns:
+        `bool`: Indicating whether it's a snippet.
+    """
+    return token.startswith(SNIPPET_PREFIX)
+
+
+def expand_snippets(sequence, snippets):
+    """Given a sequence and a list of snippets, expand the snippets in the sequence.
+
+    Args:
+        sequence (`list`): Query containing snippet references.
+        snippets (`list`): List of available snippets.
+
+    Returns:
+        `list`: The expanded sequence list.
+    """
+    snippet_id_to_snippet = {}
+    for snippet in snippets:
+        assert snippet.name not in snippet_id_to_snippet
+        snippet_id_to_snippet[snippet.name] = snippet
+    expanded_seq = []
+    for token in sequence:
+        if token in snippet_id_to_snippet:
+            expanded_seq.extend(snippet_id_to_snippet[token].sequence)
+        else:
+            assert not is_snippet(token)
+            expanded_seq.append(token)
+
+    return expanded_seq
+
+
+def snippet_index(token):
+    """Returns the index of a snippet.
+
+    Args:
+        token (`str`): The snippet to check.
+
+    Returns:
+        `int`: The index of the snippet.
+    """
+    assert is_snippet(token)
+    return int(token.split("_")[-1])
+
+
+class Snippet:
+    """Contains a snippet."""
+
+    def __init__(self, sequence, startpos, sql, age=0):
+        self.sequence = sequence
+        self.startpos = startpos
+        self.sql = sql
+
+        # TODO: age vs. index?
+        self.age = age
+        self.index = 0
+
+        self.name = ""
+        self.embedding = None
+
+        self.endpos = self.startpos + len(self.sequence)
+        assert self.endpos < len(self.sql), (
+            "End position of snippet is "
+            + str(self.endpos)
+            + " which is greater than length of SQL ("
+            + str(len(self.sql))
+            + ")"
+        )
+        assert self.sequence == self.sql[self.startpos : self.endpos], (
+            "Value of snippet (" + " ".join(self.sequence) + ") "
+            "is not the same as SQL at the same positions (" + " ".join(self.sql[self.startpos : self.endpos]) + ")"
+        )
+
+    def __str__(self):
+        return self.name + "\t" + str(self.age) + "\t" + " ".join(self.sequence)
+
+    def __len__(self):
+        return len(self.sequence)
+
+    def increase_age(self):
+        """Ages a snippet by one."""
+        self.index += 1
+
+    def assign_id(self, number):
+        """Assigns the name of the snippet to be the prefix + the number."""
+        self.name = SNIPPET_PREFIX + str(number)
+
+    def set_embedding(self, embedding):
+        """Sets the embedding of the snippet."""
+        self.embedding = embedding
diff --git a/examples/text_to_sql/IGSQL/data_util/sql_util.py b/examples/text_to_sql/IGSQL/data_util/sql_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb9b92260d3a118b4c31f69f42ee8c29759dd0d0
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/sql_util.py
@@ -0,0 +1,440 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import random
+import signal
+
+import pymysql
+import sqlparse
+from sqlparse import sql as sql_types
+from sqlparse import tokens as token_types
+
+from . import util
+from .snippets import Snippet
+
+interesting_selects = ["DISTINCT", "MAX", "MIN", "count"]
+ignored_subtrees = [["1", "=", "1"]]
+
+
+def strip_whitespace_front(token_list):
+    """Strips whitespace and punctuation from the front of a SQL token list.
+
+    Args:
+        token_list(`list`): the token list.
+
+    Outputs:
+        `list`: New token list.
+    """
+    new_token_list = []
+    found_valid = False
+
+    for token in token_list:
+        if not (token.is_whitespace or token.ttype == token_types.Punctuation) or found_valid:
+            found_valid = True
+            new_token_list.append(token)
+
+    return new_token_list
+
+
+def strip_whitespace(token_list):
+    """Strips whitespace from a token list.
+
+    Args:
+        token_list(`list`): the token list.
+
+    Returns:
+        `list`: New token list with no whitespace/punctuation surrounding.
+    """
+    subtokens = strip_whitespace_front(token_list)
+    subtokens = strip_whitespace_front(subtokens[::-1])[::-1]
+    return subtokens
+
+
+def token_list_to_seq(token_list):
+    """Converts a Token list to a sequence of strings, stripping out surrounding
+    punctuation and all whitespace.
+
+    Args:
+        token_list(`list`): the list of tokens.
+
+    Outputs:
+        `list`: sequence of strings
+    """
+    subtokens = strip_whitespace(token_list)
+
+    seq = []
+    flat = sqlparse.sql.TokenList(subtokens).flatten()
+    for i, token in enumerate(flat):
+        strip_token = str(token).strip()
+        if len(strip_token) > 0:
+            seq.append(strip_token)
+    if len(seq) > 0:
+        if seq[0] == "(" and seq[-1] == ")":
+            seq = seq[1:-1]
+
+    return seq
+
+
+def find_subtrees(sequence, current_subtrees, where_parent=False, keep_conj_subtrees=False):
+    """Finds subtrees for a subsequence of SQL.
+
+    Args:
+      sequence(`list`): Sequence of SQL tokens.
+      current_subtrees(`list`): Current list of subtrees.
+      where_parent(`bool`, optional): Whether the parent of the current sequence was a where clause
+      keep_conj_subtrees('bool', optional): Whether to look for a conjunction in this sequence and
+                          keep its arguments
+    """
+
+    # If the parent of the subsequence was a WHERE clause, keep everything in the
+    # sequence except for the beginning WHERE and any surrounding parentheses.
+    if where_parent:
+        # Strip out the beginning WHERE, and any punctuation or whitespace at the
+        # beginning or end of the token list.
+        seq = token_list_to_seq(sequence.tokens[1:])
+        if len(seq) > 0 and seq not in current_subtrees:
+            current_subtrees.append(seq)
+
+    # If the current sequence has subtokens, i.e. if it's a node that can be
+    # expanded, check for a conjunction in its subtrees, and expand its subtrees.
+    # Also check for any SELECT statements and keep track of what follows.
+    if sequence.is_group:
+        if keep_conj_subtrees:
+            subtokens = strip_whitespace(sequence.tokens)
+
+            # Check if there is a conjunction in the subsequence. If so, keep the
+            # children. Also make sure you don't split where AND is used within a
+            # child -- the subtokens sequence won't treat those ANDs differently (a
+            # bit hacky but it works)
+            has_and = False
+            for i, token in enumerate(subtokens):
+                if token.value == "OR" or token.value == "AND":
+                    has_and = True
+                    break
+
+            if has_and:
+                and_subtrees = []
+                current_subtree = []
+                for i, token in enumerate(subtokens):
+                    if token.value == "OR" or (
+                        token.value == "AND"
+                        and i - 4 >= 0
+                        and i - 4 < len(subtokens)
+                        and subtokens[i - 4].value != "BETWEEN"
+                    ):
+                        and_subtrees.append(current_subtree)
+                        current_subtree = []
+                    else:
+                        current_subtree.append(token)
+                and_subtrees.append(current_subtree)
+
+                for subtree in and_subtrees:
+                    seq = token_list_to_seq(subtree)
+                    if len(seq) > 0 and seq[0] == "WHERE":
+                        seq = seq[1:]
+                    if seq not in current_subtrees:
+                        current_subtrees.append(seq)
+
+        in_select = False
+        select_toks = []
+        for i, token in enumerate(sequence.tokens):
+            # Mark whether this current token is a WHERE.
+            is_where = isinstance(token, sql_types.Where)
+
+            # If you are in a SELECT, start recording what follows until you hit a
+            # FROM
+            if token.value == "SELECT":
+                in_select = True
+            elif in_select:
+                select_toks.append(token)
+                if token.value == "FROM":
+                    in_select = False
+
+                    seq = []
+                    if len(sequence.tokens) > i + 2:
+                        seq = token_list_to_seq(select_toks + [sequence.tokens[i + 2]])
+
+                    if seq not in current_subtrees and len(seq) > 0 and seq[0] in interesting_selects:
+                        current_subtrees.append(seq)
+
+                    select_toks = []
+
+            # Recursively find subtrees in the children of the node.
+            find_subtrees(token, current_subtrees, is_where, where_parent or keep_conj_subtrees)
+
+
+def get_subtrees(sql, oldsnippets=[]):
+    parsed = sqlparse.parse(" ".join(sql))[0]
+
+    subtrees = []
+    find_subtrees(parsed, subtrees)
+
+    final_subtrees = []
+    for subtree in subtrees:
+        if subtree not in ignored_subtrees:
+            final_version = []
+            keep = True
+
+            parens_counts = 0
+            for i, token in enumerate(subtree):
+                if token == ".":
+                    newtoken = final_version[-1] + "." + subtree[i + 1]
+                    final_version = final_version[:-1] + [newtoken]
+                    keep = False
+                elif keep:
+                    final_version.append(token)
+                else:
+                    keep = True
+
+                if token == "(":
+                    parens_counts -= 1
+                elif token == ")":
+                    parens_counts += 1
+
+            if parens_counts == 0:
+                final_subtrees.append(final_version)
+
+    snippets = []
+    sql = [str(tok) for tok in sql]
+    for subtree in final_subtrees:
+        startpos = -1
+        for i in range(len(sql) - len(subtree) + 1):
+            if sql[i : i + len(subtree)] == subtree:
+                startpos = i
+        if startpos >= 0 and startpos + len(subtree) < len(sql):
+            age = 0
+            for prevsnippet in oldsnippets:
+                if prevsnippet.sequence == subtree:
+                    age = prevsnippet.age + 1
+            snippet = Snippet(subtree, startpos, sql, age=age)
+            snippets.append(snippet)
+
+    return snippets
+
+
+def get_subtrees_simple(sql, oldsnippets=[]):
+    sql_string = " ".join(sql)
+    format_sql = sqlparse.format(sql_string, reindent=True)
+
+    # get subtrees
+    subtrees = []
+    for sub_sql in format_sql.split("\n"):
+        sub_sql = sub_sql.replace("(", " ( ").replace(")", " ) ").replace(",", " , ")
+
+        subtree = sub_sql.strip().split()
+        if len(subtree) > 1:
+            subtrees.append(subtree)
+
+    final_subtrees = subtrees
+
+    snippets = []
+    sql = [str(tok) for tok in sql]
+    for subtree in final_subtrees:
+        startpos = -1
+        for i in range(len(sql) - len(subtree) + 1):
+            if sql[i : i + len(subtree)] == subtree:
+                startpos = i
+
+        if startpos >= 0 and startpos + len(subtree) <= len(sql):
+            age = 0
+            for prevsnippet in oldsnippets:
+                if prevsnippet.sequence == subtree:
+                    age = prevsnippet.age + 1
+            new_sql = sql + [";"]
+            snippet = Snippet(subtree, startpos, new_sql, age=age)
+            snippets.append(snippet)
+
+    return snippets
+
+
+conjunctions = {"AND", "OR", "WHERE"}
+
+
+def get_all_in_parens(sequence):
+    if sequence[-1] == ";":
+        sequence = sequence[:-1]
+
+    if "(" not in sequence:
+        return []
+
+    if sequence[0] == "(" and sequence[-1] == ")":
+        in_parens = sequence[1:-1]
+        return [in_parens] + get_all_in_parens(in_parens)
+    else:
+        paren_subseqs = []
+        current_seq = []
+        num_parens = 0
+        in_parens = False
+        for token in sequence:
+            if in_parens:
+                current_seq.append(token)
+                if token == ")":
+                    num_parens -= 1
+                    if num_parens == 0:
+                        in_parens = False
+                        paren_subseqs.append(current_seq)
+                        current_seq = []
+            elif token == "(":
+                in_parens = True
+                current_seq.append(token)
+            if token == "(":
+                num_parens += 1
+
+        all_subseqs = []
+        for subseq in paren_subseqs:
+            all_subseqs.extend(get_all_in_parens(subseq))
+        return all_subseqs
+
+
+def split_by_conj(sequence):
+    num_parens = 0
+    current_seq = []
+    subsequences = []
+
+    for token in sequence:
+        if num_parens == 0:
+            if token in conjunctions:
+                subsequences.append(current_seq)
+                current_seq = []
+                break
+        current_seq.append(token)
+        if token == "(":
+            num_parens += 1
+        elif token == ")":
+            num_parens -= 1
+
+        assert num_parens >= 0
+
+    return subsequences
+
+
+def get_sql_snippets(sequence):
+    # First, get all subsequences of the sequence that are surrounded by
+    # parentheses.
+    all_in_parens = get_all_in_parens(sequence)
+    all_subseq = []
+
+    # Then for each one, split the sequence on conjunctions (AND/OR).
+    for seq in all_in_parens:
+        subsequences = split_by_conj(seq)
+        all_subseq.append(seq)
+        all_subseq.extend(subsequences)
+
+    # Finally, also get "interesting" selects
+
+    for i, seq in enumerate(all_subseq):
+        print(str(i) + "\t" + " ".join(seq))
+    exit()
+
+
+def add_snippets_to_query(snippets, ignored_entities, query, prob_align=1.0):
+    query_copy = copy.copy(query)
+
+    # Replace the longest snippets first, so sort by length descending.
+    sorted_snippets = sorted(snippets, key=lambda s: len(s.sequence))[::-1]
+
+    for snippet in sorted_snippets:
+        ignore = False
+        snippet_seq = snippet.sequence
+
+        # If it contains an ignored entity, then don't use it.
+        for entity in ignored_entities:
+            ignore = ignore or util.subsequence(entity, snippet_seq)
+
+        # No NL entities found in snippet, then see if snippet is a substring of
+        # the gold sequence
+        if not ignore:
+            snippet_length = len(snippet_seq)
+
+            # Iterate through gold sequence to see if it's a subsequence.
+            for start_idx in range(len(query_copy) - snippet_length + 1):
+                if query_copy[start_idx : start_idx + snippet_length] == snippet_seq:
+                    align = random.random() < prob_align
+
+                    if align:
+                        prev_length = len(query_copy)
+
+                        # At the start position of the snippet, replace with an
+                        # identifier.
+                        query_copy[start_idx] = snippet.name
+
+                        # Then cut out the indices which were collapsed into
+                        # the snippet.
+                        query_copy = query_copy[: start_idx + 1] + query_copy[start_idx + snippet_length :]
+
+                        # Make sure the length is as expected
+                        assert len(query_copy) == prev_length - (snippet_length - 1)
+
+    return query_copy
+
+
+def execution_results(query, username, password, timeout=3):
+    connection = pymysql.connect(user=username, password=password)
+
+    class TimeoutException(Exception):
+        pass
+
+    def timeout_handler(signum, frame):
+        raise TimeoutException
+
+    signal.signal(signal.SIGALRM, timeout_handler)
+
+    syntactic = True
+    semantic = True
+
+    table = []
+
+    with connection.cursor() as cursor:
+        signal.alarm(timeout)
+        try:
+            cursor.execute("SET sql_mode='IGNORE_SPACE';")
+            cursor.execute("use atis3;")
+            cursor.execute(query)
+            table = cursor.fetchall()
+            cursor.close()
+        except TimeoutException:
+            signal.alarm(0)
+            cursor.close()
+        except pymysql.err.ProgrammingError:
+            syntactic = False
+            semantic = False
+            cursor.close()
+        except pymysql.err.InternalError:
+            semantic = False
+            cursor.close()
+        except Exception:
+            signal.alarm(0)
+        signal.alarm(0)
+        cursor.close()
+    signal.alarm(0)
+
+    connection.close()
+
+    return (syntactic, semantic, sorted(table))
+
+
+def executable(query, username, password, timeout=2):
+    return execution_results(query, username, password, timeout)[1]
+
+
+def fix_parentheses(sequence):
+    num_left = sequence.count("(")
+    num_right = sequence.count(")")
+
+    if num_right < num_left:
+        fixed_sequence = sequence[:-1] + [")" for _ in range(num_left - num_right)] + [sequence[-1]]
+        return fixed_sequence
+
+    return sequence
diff --git a/examples/text_to_sql/IGSQL/data_util/tokenizers.py b/examples/text_to_sql/IGSQL/data_util/tokenizers.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cedd55d42ed2241b2310c17459e76795283f9a8
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/tokenizers.py
@@ -0,0 +1,98 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenizers for natural language SQL queries, and lambda calculus."""
+
+import nltk
+import sqlparse
+
+
+def nl_tokenize(string):
+    """Tokenizes a natural language string into tokens.
+    Assumes data is space-separated (this is true of ZC07 data in ATIS2/3).
+
+    Args:
+       string(`str`): the string to tokenize.
+    Outputs:
+        `list`: a list of tokens.
+    """
+    return nltk.word_tokenize(string)
+
+
+def sql_tokenize(string):
+    """Tokenizes a SQL statement into tokens.
+
+    Args:
+       string(`str`): string to tokenize.
+
+    Outputs:
+       `list`: a list of tokens.
+    """
+    tokens = []
+    statements = sqlparse.parse(string)
+
+    # SQLparse gives you a list of statements.
+    for statement in statements:
+        # Flatten the tokens in each statement and add to the tokens list.
+        flat_tokens = sqlparse.sql.TokenList(statement.tokens).flatten()
+        for token in flat_tokens:
+            strip_token = str(token).strip()
+            if len(strip_token) > 0:
+                tokens.append(strip_token)
+
+    newtokens = []
+    keep = True
+    for i, token in enumerate(tokens):
+        if token == ".":
+            newtoken = newtokens[-1] + "." + tokens[i + 1]
+            newtokens = newtokens[:-1] + [newtoken]
+            keep = False
+        elif keep:
+            newtokens.append(token)
+        else:
+            keep = True
+
+    return newtokens
+
+
+def lambda_tokenize(string):
+    """Tokenizes a lambda-calculus statement into tokens.
+
+    Args:
+       string(`str`): a lambda-calculus string
+
+    Outputs:
+       `list`: a list of tokens.
+    """
+
+    space_separated = string.split(" ")
+
+    new_tokens = []
+
+    # Separate the string by spaces, then separate based on existence of ( or
+    # ).
+    for token in space_separated:
+        tokens = []
+
+        current_token = ""
+        for char in token:
+            if char == ")" or char == "(":
+                tokens.append(current_token)
+                tokens.append(char)
+                current_token = ""
+            else:
+                current_token += char
+        tokens.append(current_token)
+        new_tokens.extend([tok for tok in tokens if tok])
+
+    return new_tokens
diff --git a/examples/text_to_sql/IGSQL/data_util/util.py b/examples/text_to_sql/IGSQL/data_util/util.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc814eb18e810828280af5f1bce3aaa64f2dd92a
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/util.py
@@ -0,0 +1,31 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains various utility functions."""
+
+
+def subsequence(first_sequence, second_sequence):
+    """
+    Returns whether the first sequence is a subsequence of the second sequence.
+
+    Args:
+        first_sequence (`list`): A sequence.
+        second_sequence (`list`): Another sequence.
+
+    Returns:
+        `bool`: Whether first_sequence is a subsequence of second_sequence.
+    """
+    for startidx in range(len(second_sequence) - len(first_sequence) + 1):
+        if second_sequence[startidx : startidx + len(first_sequence)] == first_sequence:
+            return True
+    return False
diff --git a/examples/text_to_sql/IGSQL/data_util/utterance.py b/examples/text_to_sql/IGSQL/data_util/utterance.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1a0879208bfc43b0d58dfc1dbec04b87e19a196
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/utterance.py
@@ -0,0 +1,102 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Contains the Utterance class. """
+
+from . import sql_util
+from . import tokenizers
+
+ANON_INPUT_KEY = "cleaned_nl"
+OUTPUT_KEY = "sql"
+
+
+class Utterance:
+    def process_input_seq(self, anonymize, anonymizer, anon_tok_to_ent):
+        assert not anon_tok_to_ent or anonymize
+        assert not anonymize or anonymizer
+
+        if anonymize:
+            assert anonymizer
+
+            self.input_seq_to_use = anonymizer.anonymize(
+                self.original_input_seq, anon_tok_to_ent, ANON_INPUT_KEY, add_new_anon_toks=True
+            )
+        else:
+            self.input_seq_to_use = self.original_input_seq
+
+    def process_gold_seq(
+        self, output_sequences, nl_to_sql_dict, available_snippets, anonymize, anonymizer, anon_tok_to_ent
+    ):
+        # Get entities in the input sequence:
+        #    anonymized entity types
+        #    othe recognized entities (this includes "flight")
+        entities_in_input = [[tok] for tok in self.input_seq_to_use if tok in anon_tok_to_ent]
+        entities_in_input.extend(nl_to_sql_dict.get_sql_entities(self.input_seq_to_use))
+
+        # Get the shortest gold query (this is what we use to train)
+        shortest_gold_and_results = min(output_sequences, key=lambda x: len(x[0]))
+
+        # Tokenize and anonymize it if necessary.
+        self.original_gold_query = shortest_gold_and_results[0]
+        self.gold_sql_results = shortest_gold_and_results[1]
+
+        self.contained_entities = entities_in_input
+
+        # Keep track of all gold queries and the resulting tables so that we can
+        # give credit if it predicts a different correct sequence.
+        self.all_gold_queries = output_sequences
+
+        self.anonymized_gold_query = self.original_gold_query
+        if anonymize:
+            self.anonymized_gold_query = anonymizer.anonymize(
+                self.original_gold_query, anon_tok_to_ent, OUTPUT_KEY, add_new_anon_toks=False
+            )
+
+        # Add snippets to it.
+        self.gold_query_to_use = sql_util.add_snippets_to_query(
+            available_snippets, entities_in_input, self.anonymized_gold_query
+        )
+
+    def __init__(self, example, available_snippets, nl_to_sql_dict, params, anon_tok_to_ent={}, anonymizer=None):
+        # Get output and input sequences from the dictionary representation.
+        output_sequences = example[OUTPUT_KEY]
+        self.original_input_seq = tokenizers.nl_tokenize(example[params.input_key])
+        self.available_snippets = available_snippets
+        self.keep = False
+
+        if len(output_sequences) > 0 and len(self.original_input_seq) > 0:
+            # Only keep this example if there is at least one output sequence.
+            self.keep = True
+        if len(output_sequences) == 0 or len(self.original_input_seq) == 0:
+            return
+
+        # Process the input sequence
+        self.process_input_seq(params.anonymize, anonymizer, anon_tok_to_ent)
+
+        # Process the gold sequence
+        self.process_gold_seq(
+            output_sequences, nl_to_sql_dict, self.available_snippets, params.anonymize, anonymizer, anon_tok_to_ent
+        )
+
+    def __str__(self):
+        string = "Original input: " + " ".join(self.original_input_seq) + "\n"
+        string += "Modified input: " + " ".join(self.input_seq_to_use) + "\n"
+        string += "Original output: " + " ".join(self.original_gold_query) + "\n"
+        string += "Modified output: " + " ".join(self.gold_query_to_use) + "\n"
+        string += "Snippets:\n"
+        for snippet in self.available_snippets:
+            string += str(snippet) + "\n"
+        return string
+
+    def length_valid(self, input_limit, output_limit):
+        return len(self.input_seq_to_use) < input_limit and len(self.gold_query_to_use) < output_limit
diff --git a/examples/text_to_sql/IGSQL/data_util/vocabulary.py b/examples/text_to_sql/IGSQL/data_util/vocabulary.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b102f084c955c5d5f2ea22088056012395ea0ac
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/data_util/vocabulary.py
@@ -0,0 +1,106 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains class and methods for storing and computing a vocabulary from text."""
+
+import operator
+import os
+import pickle
+
+# Special sequencing tokens.
+UNK_TOK = "_UNK"  # Replaces out-of-vocabulary words.
+EOS_TOK = "_EOS"  # Appended to the end of a sequence to indicate its end.
+DEL_TOK = ";"
+
+
+class Vocabulary:
+    """Vocabulary class: stores information about words in a corpus.
+
+    Attributes:
+        functional_types (`list`): Functional vocabulary words, such as EOS.
+        max_size (`int`): The maximum size of vocabulary to keep.
+        min_occur (`int`): The minimum number of times a word should occur to keep it.
+        id_to_token (`list`): Ordered list of word types.
+        token_to_id (`dict`): Maps from each unique word type to its index.
+    """
+
+    def get_vocab(self, sequences, ignore_fn):
+        """Gets vocabulary from a list of sequences.
+
+        Args:
+            sequences (`list`): Sequences from which to compute the vocabulary.
+            ignore_fn (`function`): Function used to tell whether to ignore a
+                token during computation of the vocabulary.
+
+        Returns:
+            `list`: List of the unique word types in the vocabulary.
+        """
+        type_counts = {}
+
+        for sequence in sequences:
+            for token in sequence:
+                if not ignore_fn(token):
+                    if token not in type_counts:
+                        type_counts[token] = 0
+                    type_counts[token] += 1
+
+        # Create sorted list of tokens, by their counts. Reverse so it is in order of
+        # most frequent to least frequent.
+        sorted_type_counts = sorted(sorted(type_counts.items()), key=operator.itemgetter(1))[::-1]
+
+        sorted_types = [typecount[0] for typecount in sorted_type_counts if typecount[1] >= self.min_occur]
+
+        # Append the necessary functional tokens.
+        sorted_types = self.functional_types + sorted_types
+
+        # Cut off if vocab_size is set (nonnegative)
+        if self.max_size >= 0:
+            vocab = sorted_types[: max(self.max_size, len(sorted_types))]
+        else:
+            vocab = sorted_types
+
+        return vocab
+
+    def __init__(
+        self, sequences, filename, functional_types=None, max_size=-1, min_occur=0, ignore_fn=lambda x: False
+    ):
+        self.functional_types = functional_types
+        self.max_size = max_size
+        self.min_occur = min_occur
+
+        vocab = self.get_vocab(sequences, ignore_fn)
+
+        self.id_to_token = []
+        self.token_to_id = {}
+
+        for i, word_type in enumerate(vocab):
+            self.id_to_token.append(word_type)
+            self.token_to_id[word_type] = i
+
+        # Load the previous vocab, if it exists.
+        if os.path.exists(filename):
+            infile = open(filename, "rb")
+            loaded_vocab = pickle.load(infile)
+            infile.close()
+
+            print("Loaded vocabulary from " + str(filename))
+            if loaded_vocab.id_to_token != self.id_to_token or loaded_vocab.token_to_id != self.token_to_id:
+                print("Loaded vocabulary is different than generated vocabulary.")
+        else:
+            print("Writing vocabulary to " + str(filename))
+            outfile = open(filename, "wb")
+            pickle.dump(self, outfile)
+            outfile.close()
+
+    def __len__(self):
+        return len(self.id_to_token)
diff --git a/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py b/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..46e988cd5e26a52370d6c77085e8e181ce18ee4f
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/eval_scripts/evaluation.py
@@ -0,0 +1,919 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import sqlite3
+
+from process_sql import Schema, get_schema, get_sql
+
+# Flag to disable value evaluation
+DISABLE_VALUE = True
+# Flag to disable distinct in select evaluation
+DISABLE_DISTINCT = True
+
+################################
+# val: number(float)/string(str)/sql(dict)
+# col_unit: (agg_id, col_id, isDistinct(bool))
+# val_unit: (unit_op, col_unit1, col_unit2)
+# table_unit: (table_type, col_unit/sql)
+# cond_unit: (not_op, op_id, val_unit, val1, val2)
+# condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+# sql {
+#   'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...])
+#   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+#   'where': condition
+#   'groupBy': [col_unit1, col_unit2, ...]
+#   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+#   'having': condition
+#   'limit': None/limit value
+#   'intersect': None/sql
+#   'except': None/sql
+#   'union': None/sql
+# }
+################################
+
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+COND_OPS = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+HARDNESS = {
+    "component1": ("where", "group", "order", "limit", "join", "or", "like"),
+    "component2": ("except", "union", "intersect"),
+}
+
+
+def condition_has_or(conds):
+    return "or" in conds[1::2]
+
+
+def condition_has_like(conds):
+    return WHERE_OPS.index("like") in [cond_unit[1] for cond_unit in conds[::2]]
+
+
+def condition_has_sql(conds):
+    for cond_unit in conds[::2]:
+        val1, val2 = cond_unit[3], cond_unit[4]
+        if val1 is not None and type(val1) is dict:
+            return True
+        if val2 is not None and type(val2) is dict:
+            return True
+    return False
+
+
+def val_has_op(val_unit):
+    return val_unit[0] != UNIT_OPS.index("none")
+
+
+def has_agg(unit):
+    return unit[0] != AGG_OPS.index("none")
+
+
+def accuracy(count, total):
+    if count == total:
+        return 1
+    return 0
+
+
+def recall(count, total):
+    if count == total:
+        return 1
+    return 0
+
+
+def F1(acc, rec):
+    if (acc + rec) == 0:
+        return 0
+    return (2.0 * acc * rec) / (acc + rec)
+
+
+def get_scores(count, pred_total, label_total):
+    if pred_total != label_total:
+        return 0, 0, 0
+    elif count == pred_total:
+        return 1, 1, 1
+    return 0, 0, 0
+
+
+def eval_sel(pred, label):
+    pred_sel = pred["select"][1]
+    label_sel = label["select"][1]
+    label_wo_agg = [unit[1] for unit in label_sel]
+    pred_total = len(pred_sel)
+    label_total = len(label_sel)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_sel:
+        if unit in label_sel:
+            cnt += 1
+            label_sel.remove(unit)
+        if unit[1] in label_wo_agg:
+            cnt_wo_agg += 1
+            label_wo_agg.remove(unit[1])
+
+    return label_total, pred_total, cnt, cnt_wo_agg
+
+
+def eval_where(pred, label):
+    pred_conds = [unit for unit in pred["where"][::2]]
+    label_conds = [unit for unit in label["where"][::2]]
+    label_wo_agg = [unit[2] for unit in label_conds]
+    pred_total = len(pred_conds)
+    label_total = len(label_conds)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_conds:
+        if unit in label_conds:
+            cnt += 1
+            label_conds.remove(unit)
+        if unit[2] in label_wo_agg:
+            cnt_wo_agg += 1
+            label_wo_agg.remove(unit[2])
+
+    return label_total, pred_total, cnt, cnt_wo_agg
+
+
+def eval_group(pred, label):
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    label_cols = [unit[1] for unit in label["groupBy"]]
+    pred_total = len(pred_cols)
+    label_total = len(label_cols)
+    cnt = 0
+    pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols]
+    label_cols = [label.split(".")[1] if "." in label else label for label in label_cols]
+    for col in pred_cols:
+        if col in label_cols:
+            cnt += 1
+            label_cols.remove(col)
+    return label_total, pred_total, cnt
+
+
+def eval_having(pred, label):
+    pred_total = label_total = cnt = 0
+    if len(pred["groupBy"]) > 0:
+        pred_total = 1
+    if len(label["groupBy"]) > 0:
+        label_total = 1
+
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    label_cols = [unit[1] for unit in label["groupBy"]]
+    if pred_total == label_total == 1 and pred_cols == label_cols and pred["having"] == label["having"]:
+        cnt = 1
+
+    return label_total, pred_total, cnt
+
+
+def eval_order(pred, label):
+    pred_total = label_total = cnt = 0
+    if len(pred["orderBy"]) > 0:
+        pred_total = 1
+    if len(label["orderBy"]) > 0:
+        label_total = 1
+    if (
+        len(label["orderBy"]) > 0
+        and pred["orderBy"] == label["orderBy"]
+        and (
+            (pred["limit"] is None and label["limit"] is None)
+            or (pred["limit"] is not None and label["limit"] is not None)
+        )
+    ):
+        cnt = 1
+    return label_total, pred_total, cnt
+
+
+def eval_and_or(pred, label):
+    pred_ao = pred["where"][1::2]
+    label_ao = label["where"][1::2]
+    pred_ao = set(pred_ao)
+    label_ao = set(label_ao)
+
+    if pred_ao == label_ao:
+        return 1, 1, 1
+    return len(pred_ao), len(label_ao), 0
+
+
+def get_nestedSQL(sql):
+    nested = []
+    for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]:
+        if type(cond_unit[3]) is dict:
+            nested.append(cond_unit[3])
+        if type(cond_unit[4]) is dict:
+            nested.append(cond_unit[4])
+    if sql["intersect"] is not None:
+        nested.append(sql["intersect"])
+    if sql["except"] is not None:
+        nested.append(sql["except"])
+    if sql["union"] is not None:
+        nested.append(sql["union"])
+    return nested
+
+
+def eval_nested(pred, label):
+    label_total = 0
+    pred_total = 0
+    cnt = 0
+    if pred is not None:
+        pred_total += 1
+    if label is not None:
+        label_total += 1
+    if pred is not None and label is not None:
+        cnt += Evaluator().eval_exact_match(pred, label)
+    return label_total, pred_total, cnt
+
+
+def eval_IUEN(pred, label):
+    lt1, pt1, cnt1 = eval_nested(pred["intersect"], label["intersect"])
+    lt2, pt2, cnt2 = eval_nested(pred["except"], label["except"])
+    lt3, pt3, cnt3 = eval_nested(pred["union"], label["union"])
+    label_total = lt1 + lt2 + lt3
+    pred_total = pt1 + pt2 + pt3
+    cnt = cnt1 + cnt2 + cnt3
+    return label_total, pred_total, cnt
+
+
+def get_keywords(sql):
+    res = set()
+    if len(sql["where"]) > 0:
+        res.add("where")
+    if len(sql["groupBy"]) > 0:
+        res.add("group")
+    if len(sql["having"]) > 0:
+        res.add("having")
+    if len(sql["orderBy"]) > 0:
+        res.add(sql["orderBy"][0])
+        res.add("order")
+    if sql["limit"] is not None:
+        res.add("limit")
+    if sql["except"] is not None:
+        res.add("except")
+    if sql["union"] is not None:
+        res.add("union")
+    if sql["intersect"] is not None:
+        res.add("intersect")
+
+    # or keyword
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    if len([token for token in ao if token == "or"]) > 0:
+        res.add("or")
+
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    # not keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0:
+        res.add("not")
+
+    # in keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("in")]) > 0:
+        res.add("in")
+
+    # like keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) > 0:
+        res.add("like")
+
+    return res
+
+
+def eval_keywords(pred, label):
+    pred_keywords = get_keywords(pred)
+    label_keywords = get_keywords(label)
+    pred_total = len(pred_keywords)
+    label_total = len(label_keywords)
+    cnt = 0
+
+    for k in pred_keywords:
+        if k in label_keywords:
+            cnt += 1
+    return label_total, pred_total, cnt
+
+
+def count_agg(units):
+    return len([unit for unit in units if has_agg(unit)])
+
+
+def count_component1(sql):
+    count = 0
+    if len(sql["where"]) > 0:
+        count += 1
+    if len(sql["groupBy"]) > 0:
+        count += 1
+    if len(sql["orderBy"]) > 0:
+        count += 1
+    if sql["limit"] is not None:
+        count += 1
+    if len(sql["from"]["table_units"]) > 0:  # JOIN
+        count += len(sql["from"]["table_units"]) - 1
+
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    count += len([token for token in ao if token == "or"])
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    count += len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")])
+
+    return count
+
+
+def count_component2(sql):
+    nested = get_nestedSQL(sql)
+    return len(nested)
+
+
+def count_others(sql):
+    count = 0
+    # number of aggregation
+    agg_count = count_agg(sql["select"][1])
+    agg_count += count_agg(sql["where"][::2])
+    agg_count += count_agg(sql["groupBy"])
+    if len(sql["orderBy"]) > 0:
+        agg_count += count_agg(
+            [unit[1] for unit in sql["orderBy"][1] if unit[1]] + [unit[2] for unit in sql["orderBy"][1] if unit[2]]
+        )
+    agg_count += count_agg(sql["having"])
+    if agg_count > 1:
+        count += 1
+
+    # number of select columns
+    if len(sql["select"][1]) > 1:
+        count += 1
+
+    # number of where conditions
+    if len(sql["where"]) > 1:
+        count += 1
+
+    # number of group by clauses
+    if len(sql["groupBy"]) > 1:
+        count += 1
+
+    return count
+
+
+class Evaluator:
+    """A simple evaluator"""
+
+    def __init__(self):
+        self.partial_scores = None
+
+    def eval_hardness(self, sql):
+        count_comp1_ = count_component1(sql)
+        count_comp2_ = count_component2(sql)
+        count_others_ = count_others(sql)
+
+        if count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ == 0:
+            return "easy"
+        elif (count_others_ <= 2 and count_comp1_ <= 1 and count_comp2_ == 0) or (
+            count_comp1_ <= 2 and count_others_ < 2 and count_comp2_ == 0
+        ):
+            return "medium"
+        elif (
+            (count_others_ > 2 and count_comp1_ <= 2 and count_comp2_ == 0)
+            or (2 < count_comp1_ <= 3 and count_others_ <= 2 and count_comp2_ == 0)
+            or (count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ <= 1)
+        ):
+            return "hard"
+        else:
+            return "extra"
+
+    def eval_exact_match(self, pred, label):
+        partial_scores = self.eval_partial_match(pred, label)
+        self.partial_scores = partial_scores
+
+        for _, score in partial_scores.items():
+            if score["f1"] != 1:
+                return 0
+        if len(label["from"]["table_units"]) > 0:
+            label_tables = sorted(label["from"]["table_units"])
+            pred_tables = sorted(pred["from"]["table_units"])
+            return label_tables == pred_tables
+        return 1
+
+    def eval_partial_match(self, pred, label):
+        res = {}
+
+        label_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["select"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total)
+        res["select(no AGG)"] = {
+            "acc": acc,
+            "rec": rec,
+            "f1": f1,
+            "label_total": label_total,
+            "pred_total": pred_total,
+        }
+
+        label_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["where"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total)
+        res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_group(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["group(no Having)"] = {
+            "acc": acc,
+            "rec": rec,
+            "f1": f1,
+            "label_total": label_total,
+            "pred_total": pred_total,
+        }
+
+        label_total, pred_total, cnt = eval_having(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["group"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_order(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["order"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_and_or(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_IUEN(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_keywords(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        return res
+
+
+def isValidSQL(sql, db):
+    conn = sqlite3.connect(db)
+    cursor = conn.cursor()
+    try:
+        cursor.execute(sql)
+    except Exception:
+        return False
+    return True
+
+
+def print_scores(scores, etype):
+    levels = ["easy", "medium", "hard", "extra", "all"]
+    partial_types = [
+        "select",
+        "select(no AGG)",
+        "where",
+        "where(no OP)",
+        "group(no Having)",
+        "group",
+        "order",
+        "and/or",
+        "IUEN",
+        "keywords",
+    ]
+
+    print("{:20} {:20} {:20} {:20} {:20} {:20}".format("", *levels))
+    counts = [scores[level]["count"] for level in levels]
+    print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts))
+
+    if etype in ["all", "exec"]:
+        print("=====================   EXECUTION ACCURACY     =====================")
+        this_scores = [scores[level]["exec"] for level in levels]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores))
+
+    if etype in ["all", "match"]:
+        print("\n====================== EXACT MATCHING ACCURACY =====================")
+        exact_scores = [scores[level]["exact"] for level in levels]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores))
+        print("\n---------------------PARTIAL MATCHING ACCURACY----------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["acc"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+        print("---------------------- PARTIAL MATCHING RECALL ----------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["rec"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+        print("---------------------- PARTIAL MATCHING F1 --------------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["f1"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+
+def evaluate(gold, predict, db_dir, etype, kmaps):
+    with open(gold) as f:
+        glist = [l.strip().split("\t") for l in f.readlines() if len(l.strip()) > 0]
+
+    with open(predict) as f:
+        plist = [l.strip().split("\t") for l in f.readlines() if len(l.strip()) > 0]
+    evaluator = Evaluator()
+
+    levels = ["easy", "medium", "hard", "extra", "all"]
+    partial_types = [
+        "select",
+        "select(no AGG)",
+        "where",
+        "where(no OP)",
+        "group(no Having)",
+        "group",
+        "order",
+        "and/or",
+        "IUEN",
+        "keywords",
+    ]
+    entries = []
+    scores = {}
+
+    for level in levels:
+        scores[level] = {"count": 0, "partial": {}, "exact": 0.0}
+        scores[level]["exec"] = 0
+        for type_ in partial_types:
+            scores[level]["partial"][type_] = {"acc": 0.0, "rec": 0.0, "f1": 0.0, "acc_count": 0, "rec_count": 0}
+
+    eval_err_num = 0
+    for p, g in zip(plist, glist):
+        p_str = p[0]
+        g_str, db = g
+        db_name = db
+        db = os.path.join(db_dir, db, db + ".sqlite")
+        schema = Schema(get_schema(db))
+        g_sql = get_sql(schema, g_str)
+        hardness = evaluator.eval_hardness(g_sql)
+        scores[hardness]["count"] += 1
+        scores["all"]["count"] += 1
+
+        try:
+            p_sql = get_sql(schema, p_str)
+        except Exception:
+            # If p_sql is not valid, then we will use an empty sql to evaluate with the correct sql
+            p_sql = {
+                "except": None,
+                "from": {"conds": [], "table_units": []},
+                "groupBy": [],
+                "having": [],
+                "intersect": None,
+                "limit": None,
+                "orderBy": [],
+                "select": [False, []],
+                "union": None,
+                "where": [],
+            }
+            eval_err_num += 1
+            print("eval_err_num:{}".format(eval_err_num))
+
+        # rebuild sql for value evaluation
+        kmap = kmaps[db_name]
+        g_valid_col_units = build_valid_col_units(g_sql["from"]["table_units"], schema)
+        g_sql = rebuild_sql_val(g_sql)
+        g_sql = rebuild_sql_col(g_valid_col_units, g_sql, kmap)
+        p_valid_col_units = build_valid_col_units(p_sql["from"]["table_units"], schema)
+        p_sql = rebuild_sql_val(p_sql)
+        p_sql = rebuild_sql_col(p_valid_col_units, p_sql, kmap)
+
+        if etype in ["all", "exec"]:
+            exec_score = eval_exec_match(db, p_str, g_str, p_sql, g_sql)
+            if exec_score:
+                scores[hardness]["exec"] += 1
+
+        if etype in ["all", "match"]:
+            exact_score = evaluator.eval_exact_match(p_sql, g_sql)
+            partial_scores = evaluator.partial_scores
+            if exact_score == 0:
+                print("{} pred: {}".format(hardness, p_str))
+                print("{} gold: {}".format(hardness, g_str))
+                print("")
+            scores[hardness]["exact"] += exact_score
+            scores["all"]["exact"] += exact_score
+            for type_ in partial_types:
+                if partial_scores[type_]["pred_total"] > 0:
+                    scores[hardness]["partial"][type_]["acc"] += partial_scores[type_]["acc"]
+                    scores[hardness]["partial"][type_]["acc_count"] += 1
+                if partial_scores[type_]["label_total"] > 0:
+                    scores[hardness]["partial"][type_]["rec"] += partial_scores[type_]["rec"]
+                    scores[hardness]["partial"][type_]["rec_count"] += 1
+                scores[hardness]["partial"][type_]["f1"] += partial_scores[type_]["f1"]
+                if partial_scores[type_]["pred_total"] > 0:
+                    scores["all"]["partial"][type_]["acc"] += partial_scores[type_]["acc"]
+                    scores["all"]["partial"][type_]["acc_count"] += 1
+                if partial_scores[type_]["label_total"] > 0:
+                    scores["all"]["partial"][type_]["rec"] += partial_scores[type_]["rec"]
+                    scores["all"]["partial"][type_]["rec_count"] += 1
+                scores["all"]["partial"][type_]["f1"] += partial_scores[type_]["f1"]
+
+            entries.append(
+                {
+                    "predictSQL": p_str,
+                    "goldSQL": g_str,
+                    "hardness": hardness,
+                    "exact": exact_score,
+                    "partial": partial_scores,
+                }
+            )
+
+    for level in levels:
+        if scores[level]["count"] == 0:
+            continue
+        if etype in ["all", "exec"]:
+            scores[level]["exec"] /= scores[level]["count"]
+
+        if etype in ["all", "match"]:
+            scores[level]["exact"] /= scores[level]["count"]
+            for type_ in partial_types:
+                if scores[level]["partial"][type_]["acc_count"] == 0:
+                    scores[level]["partial"][type_]["acc"] = 0
+                else:
+                    scores[level]["partial"][type_]["acc"] = (
+                        scores[level]["partial"][type_]["acc"] / scores[level]["partial"][type_]["acc_count"] * 1.0
+                    )
+                if scores[level]["partial"][type_]["rec_count"] == 0:
+                    scores[level]["partial"][type_]["rec"] = 0
+                else:
+                    scores[level]["partial"][type_]["rec"] = (
+                        scores[level]["partial"][type_]["rec"] / scores[level]["partial"][type_]["rec_count"] * 1.0
+                    )
+                if scores[level]["partial"][type_]["acc"] == 0 and scores[level]["partial"][type_]["rec"] == 0:
+                    scores[level]["partial"][type_]["f1"] = 1
+                else:
+                    scores[level]["partial"][type_]["f1"] = (
+                        2.0
+                        * scores[level]["partial"][type_]["acc"]
+                        * scores[level]["partial"][type_]["rec"]
+                        / (scores[level]["partial"][type_]["rec"] + scores[level]["partial"][type_]["acc"])
+                    )
+
+    print_scores(scores, etype)
+
+
+def eval_exec_match(db, p_str, g_str, pred, gold):
+    """
+    return 1 if the values between prediction and gold are matching
+    in the corresponding index. Currently not support multiple col_unit(pairs).
+    """
+    conn = sqlite3.connect(db)
+    cursor = conn.cursor()
+    try:
+        cursor.execute(p_str)
+        p_res = cursor.fetchall()
+    except Exception:
+        return False
+
+    cursor.execute(g_str)
+    q_res = cursor.fetchall()
+
+    def res_map(res, val_units):
+        rmap = {}
+        for idx, val_unit in enumerate(val_units):
+            key = tuple(val_unit[1]) if not val_unit[2] else (val_unit[0], tuple(val_unit[1]), tuple(val_unit[2]))
+            rmap[key] = [r[idx] for r in res]
+        return rmap
+
+    p_val_units = [unit[1] for unit in pred["select"][1]]
+    q_val_units = [unit[1] for unit in gold["select"][1]]
+    return res_map(p_res, p_val_units) == res_map(q_res, q_val_units)
+
+
+# Rebuild SQL functions for value evaluation
+def rebuild_cond_unit_val(cond_unit):
+    if cond_unit is None or not DISABLE_VALUE:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    if type(val1) is not dict:
+        val1 = None
+    else:
+        val1 = rebuild_sql_val(val1)
+    if type(val2) is not dict:
+        val2 = None
+    else:
+        val2 = rebuild_sql_val(val2)
+    return not_op, op_id, val_unit, val1, val2
+
+
+def rebuild_condition_val(condition):
+    if condition is None or not DISABLE_VALUE:
+        return condition
+
+    res = []
+    for idx, it in enumerate(condition):
+        if idx % 2 == 0:
+            res.append(rebuild_cond_unit_val(it))
+        else:
+            res.append(it)
+    return res
+
+
+def rebuild_sql_val(sql):
+    if sql is None or not DISABLE_VALUE:
+        return sql
+
+    sql["from"]["conds"] = rebuild_condition_val(sql["from"]["conds"])
+    sql["having"] = rebuild_condition_val(sql["having"])
+    sql["where"] = rebuild_condition_val(sql["where"])
+    sql["intersect"] = rebuild_sql_val(sql["intersect"])
+    sql["except"] = rebuild_sql_val(sql["except"])
+    sql["union"] = rebuild_sql_val(sql["union"])
+
+    return sql
+
+
+# Rebuild SQL functions for foreign key evaluation
+def build_valid_col_units(table_units, schema):
+    col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]]
+    prefixs = [col_id[:-2] for col_id in col_ids]
+    valid_col_units = []
+    for value in schema.idMap.values():
+        if "." in value and value[: value.index(".")] in prefixs:
+            valid_col_units.append(value)
+    return valid_col_units
+
+
+def rebuild_col_unit_col(valid_col_units, col_unit, kmap):
+    if col_unit is None:
+        return col_unit
+
+    agg_id, col_id, distinct = col_unit
+    if col_id in kmap and col_id in valid_col_units:
+        col_id = kmap[col_id]
+    if DISABLE_DISTINCT:
+        distinct = None
+    return agg_id, col_id, distinct
+
+
+def rebuild_val_unit_col(valid_col_units, val_unit, kmap):
+    if val_unit is None:
+        return val_unit
+
+    unit_op, col_unit1, col_unit2 = val_unit
+    col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap)
+    col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap)
+    return unit_op, col_unit1, col_unit2
+
+
+def rebuild_table_unit_col(valid_col_units, table_unit, kmap):
+    if table_unit is None:
+        return table_unit
+
+    table_type, col_unit_or_sql = table_unit
+    if isinstance(col_unit_or_sql, tuple):
+        col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap)
+    return table_type, col_unit_or_sql
+
+
+def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap):
+    if cond_unit is None:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap)
+    return not_op, op_id, val_unit, val1, val2
+
+
+def rebuild_condition_col(valid_col_units, condition, kmap):
+    for idx in range(len(condition)):
+        if idx % 2 == 0:
+            condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap)
+    return condition
+
+
+def rebuild_select_col(valid_col_units, sel, kmap):
+    if sel is None:
+        return sel
+    distinct, _list = sel
+    new_list = []
+    for it in _list:
+        agg_id, val_unit = it
+        new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)))
+    if DISABLE_DISTINCT:
+        distinct = None
+    return distinct, new_list
+
+
+def rebuild_from_col(valid_col_units, from_, kmap):
+    if from_ is None:
+        return from_
+
+    from_["table_units"] = [
+        rebuild_table_unit_col(valid_col_units, table_unit, kmap) for table_unit in from_["table_units"]
+    ]
+    from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap)
+    return from_
+
+
+def rebuild_group_by_col(valid_col_units, group_by, kmap):
+    if group_by is None:
+        return group_by
+
+    return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by]
+
+
+def rebuild_order_by_col(valid_col_units, order_by, kmap):
+    if order_by is None or len(order_by) == 0:
+        return order_by
+
+    direction, val_units = order_by
+    new_val_units = [rebuild_val_unit_col(valid_col_units, val_unit, kmap) for val_unit in val_units]
+    return direction, new_val_units
+
+
+def rebuild_sql_col(valid_col_units, sql, kmap):
+    if sql is None:
+        return sql
+
+    sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap)
+    sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap)
+    sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap)
+    sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap)
+    sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap)
+    sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap)
+    sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap)
+    sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap)
+    sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap)
+
+    return sql
+
+
+def build_foreign_key_map(entry):
+    cols_orig = entry["column_names_original"]
+    tables_orig = entry["table_names_original"]
+
+    # rebuild cols corresponding to idmap in Schema
+    cols = []
+    for col_orig in cols_orig:
+        if col_orig[0] >= 0:
+            t = tables_orig[col_orig[0]]
+            c = col_orig[1]
+            cols.append("__" + t.lower() + "." + c.lower() + "__")
+        else:
+            cols.append("__all__")
+
+    def keyset_in_list(k1, k2, k_list):
+        for k_set in k_list:
+            if k1 in k_set or k2 in k_set:
+                return k_set
+        new_k_set = set()
+        k_list.append(new_k_set)
+        return new_k_set
+
+    foreign_key_list = []
+    foreign_keys = entry["foreign_keys"]
+    for fkey in foreign_keys:
+        key1, key2 = fkey
+        key_set = keyset_in_list(key1, key2, foreign_key_list)
+        key_set.add(key1)
+        key_set.add(key2)
+
+    foreign_key_map = {}
+    for key_set in foreign_key_list:
+        sorted_list = sorted(list(key_set))
+        midx = sorted_list[0]
+        for idx in sorted_list:
+            foreign_key_map[cols[idx]] = cols[midx]
+
+    return foreign_key_map
+
+
+def build_foreign_key_map_from_json(table):
+    with open(table) as f:
+        data = json.load(f)
+    tables = {}
+    for entry in data:
+        tables[entry["db_id"]] = build_foreign_key_map(entry)
+    return tables
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--gold", dest="gold", type=str)
+    parser.add_argument("--pred", dest="pred", type=str)
+    parser.add_argument("--db", dest="db", type=str)
+    parser.add_argument("--table", dest="table", type=str)
+    parser.add_argument("--etype", dest="etype", type=str)
+    args = parser.parse_args()
+
+    gold = args.gold
+    pred = args.pred
+    db_dir = args.db
+    table = args.table
+    etype = args.etype
+
+    assert etype in ["all", "exec", "match"], "Unknown evaluation method"
+
+    kmaps = build_foreign_key_map_from_json(table)
+
+    evaluate(gold, pred, db_dir, etype, kmaps)
diff --git a/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py b/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..4330e660ad05e7cf5d88d85b6b2d8f632f2c556d
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/eval_scripts/evaluation_sqa.py
@@ -0,0 +1,995 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import sqlite3
+
+from process_sql import Schema, get_schema, get_sql
+
+# Flag to disable value evaluation
+DISABLE_VALUE = True
+# Flag to disable distinct in select evaluation
+DISABLE_DISTINCT = True
+
+################################
+# val: number(float)/string(str)/sql(dict)
+# col_unit: (agg_id, col_id, isDistinct(bool))
+# val_unit: (unit_op, col_unit1, col_unit2)
+# table_unit: (table_type, col_unit/sql)
+# cond_unit: (not_op, op_id, val_unit, val1, val2)
+# condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+# sql {
+#   'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...])
+#   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+#   'where': condition
+#   'groupBy': [col_unit1, col_unit2, ...]
+#   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+#   'having': condition
+#   'limit': None/limit value
+#   'intersect': None/sql
+#   'except': None/sql
+#   'union': None/sql
+# }
+################################
+
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+COND_OPS = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+HARDNESS = {
+    "component1": ("where", "group", "order", "limit", "join", "or", "like"),
+    "component2": ("except", "union", "intersect"),
+}
+
+
+def condition_has_or(conds):
+    return "or" in conds[1::2]
+
+
+def condition_has_like(conds):
+    return WHERE_OPS.index("like") in [cond_unit[1] for cond_unit in conds[::2]]
+
+
+def condition_has_sql(conds):
+    for cond_unit in conds[::2]:
+        val1, val2 = cond_unit[3], cond_unit[4]
+        if val1 is not None and type(val1) is dict:
+            return True
+        if val2 is not None and type(val2) is dict:
+            return True
+    return False
+
+
+def val_has_op(val_unit):
+    return val_unit[0] != UNIT_OPS.index("none")
+
+
+def has_agg(unit):
+    return unit[0] != AGG_OPS.index("none")
+
+
+def accuracy(count, total):
+    if count == total:
+        return 1
+    return 0
+
+
+def recall(count, total):
+    if count == total:
+        return 1
+    return 0
+
+
+def F1(acc, rec):
+    if (acc + rec) == 0:
+        return 0
+    return (2.0 * acc * rec) / (acc + rec)
+
+
+def get_scores(count, pred_total, label_total):
+    if pred_total != label_total:
+        return 0, 0, 0
+    elif count == pred_total:
+        return 1, 1, 1
+    return 0, 0, 0
+
+
+def eval_sel(pred, label):
+    pred_sel = pred["select"][1]
+    label_sel = label["select"][1]
+    label_wo_agg = [unit[1] for unit in label_sel]
+    pred_total = len(pred_sel)
+    label_total = len(label_sel)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_sel:
+        if unit in label_sel:
+            cnt += 1
+            label_sel.remove(unit)
+        if unit[1] in label_wo_agg:
+            cnt_wo_agg += 1
+            label_wo_agg.remove(unit[1])
+
+    return label_total, pred_total, cnt, cnt_wo_agg
+
+
+def eval_where(pred, label):
+    pred_conds = [unit for unit in pred["where"][::2]]
+    label_conds = [unit for unit in label["where"][::2]]
+    label_wo_agg = [unit[2] for unit in label_conds]
+    pred_total = len(pred_conds)
+    label_total = len(label_conds)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_conds:
+        if unit in label_conds:
+            cnt += 1
+            label_conds.remove(unit)
+        if unit[2] in label_wo_agg:
+            cnt_wo_agg += 1
+            label_wo_agg.remove(unit[2])
+
+    return label_total, pred_total, cnt, cnt_wo_agg
+
+
+def eval_group(pred, label):
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    label_cols = [unit[1] for unit in label["groupBy"]]
+    pred_total = len(pred_cols)
+    label_total = len(label_cols)
+    cnt = 0
+    pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols]
+    label_cols = [label.split(".")[1] if "." in label else label for label in label_cols]
+    for col in pred_cols:
+        if col in label_cols:
+            cnt += 1
+            label_cols.remove(col)
+    return label_total, pred_total, cnt
+
+
+def eval_having(pred, label):
+    pred_total = label_total = cnt = 0
+    if len(pred["groupBy"]) > 0:
+        pred_total = 1
+    if len(label["groupBy"]) > 0:
+        label_total = 1
+
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    label_cols = [unit[1] for unit in label["groupBy"]]
+    if pred_total == label_total == 1 and pred_cols == label_cols and pred["having"] == label["having"]:
+        cnt = 1
+
+    return label_total, pred_total, cnt
+
+
+def eval_order(pred, label):
+    pred_total = label_total = cnt = 0
+    if len(pred["orderBy"]) > 0:
+        pred_total = 1
+    if len(label["orderBy"]) > 0:
+        label_total = 1
+    if (
+        len(label["orderBy"]) > 0
+        and pred["orderBy"] == label["orderBy"]
+        and (
+            (pred["limit"] is None and label["limit"] is None)
+            or (pred["limit"] is not None and label["limit"] is not None)
+        )
+    ):
+        cnt = 1
+    return label_total, pred_total, cnt
+
+
+def eval_and_or(pred, label):
+    pred_ao = pred["where"][1::2]
+    label_ao = label["where"][1::2]
+    pred_ao = set(pred_ao)
+    label_ao = set(label_ao)
+
+    if pred_ao == label_ao:
+        return 1, 1, 1
+    return len(pred_ao), len(label_ao), 0
+
+
+def get_nestedSQL(sql):
+    nested = []
+    for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]:
+        if type(cond_unit[3]) is dict:
+            nested.append(cond_unit[3])
+        if type(cond_unit[4]) is dict:
+            nested.append(cond_unit[4])
+    if sql["intersect"] is not None:
+        nested.append(sql["intersect"])
+    if sql["except"] is not None:
+        nested.append(sql["except"])
+    if sql["union"] is not None:
+        nested.append(sql["union"])
+    return nested
+
+
+def eval_nested(pred, label):
+    label_total = 0
+    pred_total = 0
+    cnt = 0
+    if pred is not None:
+        pred_total += 1
+    if label is not None:
+        label_total += 1
+    if pred is not None and label is not None:
+        cnt += Evaluator().eval_exact_match(pred, label)
+    return label_total, pred_total, cnt
+
+
+def eval_IUEN(pred, label):
+    lt1, pt1, cnt1 = eval_nested(pred["intersect"], label["intersect"])
+    lt2, pt2, cnt2 = eval_nested(pred["except"], label["except"])
+    lt3, pt3, cnt3 = eval_nested(pred["union"], label["union"])
+    label_total = lt1 + lt2 + lt3
+    pred_total = pt1 + pt2 + pt3
+    cnt = cnt1 + cnt2 + cnt3
+    return label_total, pred_total, cnt
+
+
+def get_keywords(sql):
+    res = set()
+    if len(sql["where"]) > 0:
+        res.add("where")
+    if len(sql["groupBy"]) > 0:
+        res.add("group")
+    if len(sql["having"]) > 0:
+        res.add("having")
+    if len(sql["orderBy"]) > 0:
+        res.add(sql["orderBy"][0])
+        res.add("order")
+    if sql["limit"] is not None:
+        res.add("limit")
+    if sql["except"] is not None:
+        res.add("except")
+    if sql["union"] is not None:
+        res.add("union")
+    if sql["intersect"] is not None:
+        res.add("intersect")
+
+    # or keyword
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    if len([token for token in ao if token == "or"]) > 0:
+        res.add("or")
+
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    # not keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0:
+        res.add("not")
+
+    # in keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("in")]) > 0:
+        res.add("in")
+
+    # like keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")]) > 0:
+        res.add("like")
+
+    return res
+
+
+def eval_keywords(pred, label):
+    pred_keywords = get_keywords(pred)
+    label_keywords = get_keywords(label)
+    pred_total = len(pred_keywords)
+    label_total = len(label_keywords)
+    cnt = 0
+
+    for k in pred_keywords:
+        if k in label_keywords:
+            cnt += 1
+    return label_total, pred_total, cnt
+
+
+def count_agg(units):
+    return len([unit for unit in units if has_agg(unit)])
+
+
+def count_component1(sql):
+    count = 0
+    if len(sql["where"]) > 0:
+        count += 1
+    if len(sql["groupBy"]) > 0:
+        count += 1
+    if len(sql["orderBy"]) > 0:
+        count += 1
+    if sql["limit"] is not None:
+        count += 1
+    if len(sql["from"]["table_units"]) > 0:  # JOIN
+        count += len(sql["from"]["table_units"]) - 1
+
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    count += len([token for token in ao if token == "or"])
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    count += len([cond_unit for cond_unit in cond_units if cond_unit[1] == WHERE_OPS.index("like")])
+
+    return count
+
+
+def count_component2(sql):
+    nested = get_nestedSQL(sql)
+    return len(nested)
+
+
+def count_others(sql):
+    count = 0
+    # number of aggregation
+    agg_count = count_agg(sql["select"][1])
+    agg_count += count_agg(sql["where"][::2])
+    agg_count += count_agg(sql["groupBy"])
+    if len(sql["orderBy"]) > 0:
+        agg_count += count_agg(
+            [unit[1] for unit in sql["orderBy"][1] if unit[1]] + [unit[2] for unit in sql["orderBy"][1] if unit[2]]
+        )
+    agg_count += count_agg(sql["having"])
+    if agg_count > 1:
+        count += 1
+
+    # number of select columns
+    if len(sql["select"][1]) > 1:
+        count += 1
+
+    # number of where conditions
+    if len(sql["where"]) > 1:
+        count += 1
+
+    # number of group by clauses
+    if len(sql["groupBy"]) > 1:
+        count += 1
+
+    return count
+
+
+class Evaluator:
+    """A simple evaluator"""
+
+    def __init__(self):
+        self.partial_scores = None
+
+    def eval_hardness(self, sql):
+        count_comp1_ = count_component1(sql)
+        count_comp2_ = count_component2(sql)
+        count_others_ = count_others(sql)
+
+        if count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ == 0:
+            return "easy"
+        elif (count_others_ <= 2 and count_comp1_ <= 1 and count_comp2_ == 0) or (
+            count_comp1_ <= 2 and count_others_ < 2 and count_comp2_ == 0
+        ):
+            return "medium"
+        elif (
+            (count_others_ > 2 and count_comp1_ <= 2 and count_comp2_ == 0)
+            or (2 < count_comp1_ <= 3 and count_others_ <= 2 and count_comp2_ == 0)
+            or (count_comp1_ <= 1 and count_others_ == 0 and count_comp2_ <= 1)
+        ):
+            return "hard"
+        else:
+            return "extra"
+
+    def eval_exact_match(self, pred, label):
+        partial_scores = self.eval_partial_match(pred, label)
+        self.partial_scores = partial_scores
+
+        for key, score in partial_scores.items():
+            if score["f1"] != 1:
+                return 0
+
+        if len(label["from"]["table_units"]) > 0:
+            label_tables = sorted(label["from"]["table_units"])
+            pred_tables = sorted(pred["from"]["table_units"])
+            return label_tables == pred_tables
+        return 1
+
+    def eval_partial_match(self, pred, label):
+        res = {}
+
+        label_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["select"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total)
+        res["select(no AGG)"] = {
+            "acc": acc,
+            "rec": rec,
+            "f1": f1,
+            "label_total": label_total,
+            "pred_total": pred_total,
+        }
+
+        label_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["where"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, label_total)
+        res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_group(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["group(no Having)"] = {
+            "acc": acc,
+            "rec": rec,
+            "f1": f1,
+            "label_total": label_total,
+            "pred_total": pred_total,
+        }
+
+        label_total, pred_total, cnt = eval_having(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["group"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_order(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["order"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_and_or(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_IUEN(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        label_total, pred_total, cnt = eval_keywords(pred, label)
+        acc, rec, f1 = get_scores(cnt, pred_total, label_total)
+        res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "label_total": label_total, "pred_total": pred_total}
+
+        return res
+
+
+def isValidSQL(sql, db):
+    conn = sqlite3.connect(db)
+    cursor = conn.cursor()
+    try:
+        cursor.execute(sql)
+    except Exception:
+        return False
+    return True
+
+
+def print_scores(scores, etype):
+    turns = ["turn 1", "turn 2", "turn 3", "turn 4", "turn >4"]
+    levels = ["easy", "medium", "hard", "extra", "all", "joint_all"]
+    partial_types = [
+        "select",
+        "select(no AGG)",
+        "where",
+        "where(no OP)",
+        "group(no Having)",
+        "group",
+        "order",
+        "and/or",
+        "IUEN",
+        "keywords",
+    ]
+
+    print("{:20} {:20} {:20} {:20} {:20} {:20} {:20}".format("", *levels))
+    counts = [scores[level]["count"] for level in levels]
+    print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts))
+
+    if etype in ["all", "exec"]:
+        print("=====================   EXECUTION ACCURACY     =====================")
+        this_scores = [scores[level]["exec"] for level in levels]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores))
+
+    if etype in ["all", "match"]:
+        print("\n====================== EXACT MATCHING ACCURACY =====================")
+        exact_scores = [scores[level]["exact"] for level in levels]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores))
+        print("\n---------------------PARTIAL MATCHING ACCURACY----------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["acc"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+        print("---------------------- PARTIAL MATCHING RECALL ----------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["rec"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+        print("---------------------- PARTIAL MATCHING F1 --------------------------")
+        for type_ in partial_types:
+            this_scores = [scores[level]["partial"][type_]["f1"] for level in levels]
+            print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format(type_, *this_scores))
+
+    print("\n\n{:20} {:20} {:20} {:20} {:20} {:20}".format("", *turns))
+    counts = [scores[turn]["count"] for turn in turns]
+    print("{:20} {:<20d} {:<20d} {:<20d} {:<20d} {:<20d}".format("count", *counts))
+
+    if etype in ["all", "exec"]:
+        print("=====================   TRUN XECUTION ACCURACY     =====================")
+        this_scores = [scores[turn]["exec"] for turn in turns]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("execution", *this_scores))
+
+    if etype in ["all", "match"]:
+        print("\n====================== TRUN EXACT MATCHING ACCURACY =====================")
+        exact_scores = [scores[turn]["exact"] for turn in turns]
+        print("{:20} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f} {:<20.3f}".format("exact match", *exact_scores))
+
+
+def evaluate(gold, predict, db_dir, etype, kmaps):
+    with open(gold) as f:
+        glist = []
+        gseq_one = []
+        for l in f.readlines():
+            if len(l.strip()) == 0:
+                glist.append(gseq_one)
+                gseq_one = []
+            else:
+                lstrip = l.strip().split("\t")
+                gseq_one.append(lstrip)
+
+    with open(predict) as f:
+        plist = []
+        pseq_one = []
+        for l in f.readlines():
+            if len(l.strip()) == 0:
+                plist.append(pseq_one)
+                pseq_one = []
+            else:
+                pseq_one.append(l.strip().split("\t"))
+    evaluator = Evaluator()
+
+    turns = ["turn 1", "turn 2", "turn 3", "turn 4", "turn >4"]
+    levels = ["easy", "medium", "hard", "extra", "all", "joint_all"]
+    partial_types = [
+        "select",
+        "select(no AGG)",
+        "where",
+        "where(no OP)",
+        "group(no Having)",
+        "group",
+        "order",
+        "and/or",
+        "IUEN",
+        "keywords",
+    ]
+    entries = []
+    scores = {}
+
+    for turn in turns:
+        scores[turn] = {"count": 0, "exact": 0.0}
+        scores[turn]["exec"] = 0
+
+    for level in levels:
+        scores[level] = {"count": 0, "partial": {}, "exact": 0.0}
+        scores[level]["exec"] = 0
+        for type_ in partial_types:
+            scores[level]["partial"][type_] = {"acc": 0.0, "rec": 0.0, "f1": 0.0, "acc_count": 0, "rec_count": 0}
+
+    eval_err_num = 0
+    for p, g in zip(plist, glist):
+        print("----------------------interaction begin--------------")
+        scores["joint_all"]["count"] += 1
+        turn_scores = {"exec": [], "exact": []}
+        for idx, pg in enumerate(zip(p, g)):
+            p, g = pg
+            p_str = p[0]
+            p_str = p_str.replace("value", "1")
+            g_str, db = g
+            db_name = db
+            db = os.path.join(db_dir, db, db + ".sqlite")
+            schema = Schema(get_schema(db))
+            g_sql = get_sql(schema, g_str)
+            hardness = evaluator.eval_hardness(g_sql)
+            if idx > 3:
+                idx = ">4"
+            else:
+                idx += 1
+            turn_id = "turn " + str(idx)
+            scores[turn_id]["count"] += 1
+            scores[hardness]["count"] += 1
+            scores["all"]["count"] += 1
+
+            try:
+                p_sql = get_sql(schema, p_str)
+            except Exception:
+                # If p_sql is not valid, then we will use an empty sql to evaluate with the correct sql
+                p_sql = {
+                    "except": None,
+                    "from": {"conds": [], "table_units": []},
+                    "groupBy": [],
+                    "having": [],
+                    "intersect": None,
+                    "limit": None,
+                    "orderBy": [],
+                    "select": [False, []],
+                    "union": None,
+                    "where": [],
+                }
+                eval_err_num += 1
+                print("eval_err_num:{}".format(eval_err_num))
+
+            # rebuild sql for value evaluation
+            kmap = kmaps[db_name]
+            g_valid_col_units = build_valid_col_units(g_sql["from"]["table_units"], schema)
+            g_sql = rebuild_sql_val(g_sql)
+            g_sql = rebuild_sql_col(g_valid_col_units, g_sql, kmap)
+            p_valid_col_units = build_valid_col_units(p_sql["from"]["table_units"], schema)
+            p_sql = rebuild_sql_val(p_sql)
+            p_sql = rebuild_sql_col(p_valid_col_units, p_sql, kmap)
+
+            if etype in ["all", "exec"]:
+                exec_score = eval_exec_match(db, p_str, g_str, p_sql, g_sql)
+                if exec_score:
+                    scores[hardness]["exec"] += 1
+                    scores[turn_id]["exec"] += 1
+                    turn_scores["exec"].append(1)
+                else:
+                    turn_scores["exec"].append(0)
+
+            if etype in ["all", "match"]:
+                exact_score = evaluator.eval_exact_match(p_sql, g_sql)
+                partial_scores = evaluator.partial_scores
+                if exact_score == 0:
+                    turn_scores["exact"].append(0)
+                    print("{} pred: {}".format(hardness, p_str))
+                    print("{} gold: {}".format(hardness, g_str))
+                    print("")
+                else:
+                    print("correct")
+                    print("{} pred: {}".format(hardness, p_str))
+                    print("{} gold: {}".format(hardness, g_str))
+                    print("")
+                    turn_scores["exact"].append(1)
+                scores[turn_id]["exact"] += exact_score
+                scores[hardness]["exact"] += exact_score
+                scores["all"]["exact"] += exact_score
+                for type_ in partial_types:
+                    if partial_scores[type_]["pred_total"] > 0:
+                        scores[hardness]["partial"][type_]["acc"] += partial_scores[type_]["acc"]
+                        scores[hardness]["partial"][type_]["acc_count"] += 1
+                    if partial_scores[type_]["label_total"] > 0:
+                        scores[hardness]["partial"][type_]["rec"] += partial_scores[type_]["rec"]
+                        scores[hardness]["partial"][type_]["rec_count"] += 1
+                    scores[hardness]["partial"][type_]["f1"] += partial_scores[type_]["f1"]
+                    if partial_scores[type_]["pred_total"] > 0:
+                        scores["all"]["partial"][type_]["acc"] += partial_scores[type_]["acc"]
+                        scores["all"]["partial"][type_]["acc_count"] += 1
+                    if partial_scores[type_]["label_total"] > 0:
+                        scores["all"]["partial"][type_]["rec"] += partial_scores[type_]["rec"]
+                        scores["all"]["partial"][type_]["rec_count"] += 1
+                    scores["all"]["partial"][type_]["f1"] += partial_scores[type_]["f1"]
+
+                entries.append(
+                    {
+                        "predictSQL": p_str,
+                        "goldSQL": g_str,
+                        "hardness": hardness,
+                        "exact": exact_score,
+                        "partial": partial_scores,
+                    }
+                )
+
+        if all(v == 1 for v in turn_scores["exec"]):
+            scores["joint_all"]["exec"] += 1
+
+        if all(v == 1 for v in turn_scores["exact"]):
+            scores["joint_all"]["exact"] += 1
+            print("all correct")
+
+    for turn in turns:
+        if scores[turn]["count"] == 0:
+            continue
+        if etype in ["all", "exec"]:
+            scores[turn]["exec"] /= scores[turn]["count"]
+
+        if etype in ["all", "match"]:
+            scores[turn]["exact"] /= scores[turn]["count"]
+
+    for level in levels:
+        if scores[level]["count"] == 0:
+            continue
+        if etype in ["all", "exec"]:
+            scores[level]["exec"] /= scores[level]["count"]
+
+        if etype in ["all", "match"]:
+            scores[level]["exact"] /= scores[level]["count"]
+            for type_ in partial_types:
+                if scores[level]["partial"][type_]["acc_count"] == 0:
+                    scores[level]["partial"][type_]["acc"] = 0
+                else:
+                    scores[level]["partial"][type_]["acc"] = (
+                        scores[level]["partial"][type_]["acc"] / scores[level]["partial"][type_]["acc_count"] * 1.0
+                    )
+                if scores[level]["partial"][type_]["rec_count"] == 0:
+                    scores[level]["partial"][type_]["rec"] = 0
+                else:
+                    scores[level]["partial"][type_]["rec"] = (
+                        scores[level]["partial"][type_]["rec"] / scores[level]["partial"][type_]["rec_count"] * 1.0
+                    )
+                if scores[level]["partial"][type_]["acc"] == 0 and scores[level]["partial"][type_]["rec"] == 0:
+                    scores[level]["partial"][type_]["f1"] = 1
+                else:
+                    scores[level]["partial"][type_]["f1"] = (
+                        2.0
+                        * scores[level]["partial"][type_]["acc"]
+                        * scores[level]["partial"][type_]["rec"]
+                        / (scores[level]["partial"][type_]["rec"] + scores[level]["partial"][type_]["acc"])
+                    )
+
+    print_scores(scores, etype)
+
+
+def eval_exec_match(db, p_str, g_str, pred, gold):
+    """
+    return 1 if the values between prediction and gold are matching
+    in the corresponding index. Currently not support multiple col_unit(pairs).
+    """
+    conn = sqlite3.connect(db)
+    cursor = conn.cursor()
+    try:
+        cursor.execute(p_str)
+        p_res = cursor.fetchall()
+    except Exception:
+        return False
+
+    cursor.execute(g_str)
+    q_res = cursor.fetchall()
+
+    def res_map(res, val_units):
+        rmap = {}
+        for idx, val_unit in enumerate(val_units):
+            key = tuple(val_unit[1]) if not val_unit[2] else (val_unit[0], tuple(val_unit[1]), tuple(val_unit[2]))
+            rmap[key] = [r[idx] for r in res]
+        return rmap
+
+    p_val_units = [unit[1] for unit in pred["select"][1]]
+    q_val_units = [unit[1] for unit in gold["select"][1]]
+    return res_map(p_res, p_val_units) == res_map(q_res, q_val_units)
+
+
+# Rebuild SQL functions for value evaluation
+def rebuild_cond_unit_val(cond_unit):
+    if cond_unit is None or not DISABLE_VALUE:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    if type(val1) is not dict:
+        val1 = None
+    else:
+        val1 = rebuild_sql_val(val1)
+    if type(val2) is not dict:
+        val2 = None
+    else:
+        val2 = rebuild_sql_val(val2)
+    return not_op, op_id, val_unit, val1, val2
+
+
+def rebuild_condition_val(condition):
+    if condition is None or not DISABLE_VALUE:
+        return condition
+
+    res = []
+    for idx, it in enumerate(condition):
+        if idx % 2 == 0:
+            res.append(rebuild_cond_unit_val(it))
+        else:
+            res.append(it)
+    return res
+
+
+def rebuild_sql_val(sql):
+    if sql is None or not DISABLE_VALUE:
+        return sql
+
+    sql["from"]["conds"] = rebuild_condition_val(sql["from"]["conds"])
+    sql["having"] = rebuild_condition_val(sql["having"])
+    sql["where"] = rebuild_condition_val(sql["where"])
+    sql["intersect"] = rebuild_sql_val(sql["intersect"])
+    sql["except"] = rebuild_sql_val(sql["except"])
+    sql["union"] = rebuild_sql_val(sql["union"])
+
+    return sql
+
+
+# Rebuild SQL functions for foreign key evaluation
+def build_valid_col_units(table_units, schema):
+    col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]]
+    prefixs = [col_id[:-2] for col_id in col_ids]
+    valid_col_units = []
+    for value in schema.idMap.values():
+        if "." in value and value[: value.index(".")] in prefixs:
+            valid_col_units.append(value)
+    return valid_col_units
+
+
+def rebuild_col_unit_col(valid_col_units, col_unit, kmap):
+    if col_unit is None:
+        return col_unit
+
+    agg_id, col_id, distinct = col_unit
+    if col_id in kmap and col_id in valid_col_units:
+        col_id = kmap[col_id]
+    if DISABLE_DISTINCT:
+        distinct = None
+    return agg_id, col_id, distinct
+
+
+def rebuild_val_unit_col(valid_col_units, val_unit, kmap):
+    if val_unit is None:
+        return val_unit
+
+    unit_op, col_unit1, col_unit2 = val_unit
+    col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap)
+    col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap)
+    return unit_op, col_unit1, col_unit2
+
+
+def rebuild_table_unit_col(valid_col_units, table_unit, kmap):
+    if table_unit is None:
+        return table_unit
+
+    table_type, col_unit_or_sql = table_unit
+    if isinstance(col_unit_or_sql, tuple):
+        col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap)
+    return table_type, col_unit_or_sql
+
+
+def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap):
+    if cond_unit is None:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap)
+    return not_op, op_id, val_unit, val1, val2
+
+
+def rebuild_condition_col(valid_col_units, condition, kmap):
+    for idx in range(len(condition)):
+        if idx % 2 == 0:
+            condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap)
+    return condition
+
+
+def rebuild_select_col(valid_col_units, sel, kmap):
+    if sel is None:
+        return sel
+    distinct, _list = sel
+    new_list = []
+    for it in _list:
+        agg_id, val_unit = it
+        new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)))
+    if DISABLE_DISTINCT:
+        distinct = None
+    return distinct, new_list
+
+
+def rebuild_from_col(valid_col_units, from_, kmap):
+    if from_ is None:
+        return from_
+
+    from_["table_units"] = [
+        rebuild_table_unit_col(valid_col_units, table_unit, kmap) for table_unit in from_["table_units"]
+    ]
+    from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap)
+    return from_
+
+
+def rebuild_group_by_col(valid_col_units, group_by, kmap):
+    if group_by is None:
+        return group_by
+
+    return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by]
+
+
+def rebuild_order_by_col(valid_col_units, order_by, kmap):
+    if order_by is None or len(order_by) == 0:
+        return order_by
+
+    direction, val_units = order_by
+    new_val_units = [rebuild_val_unit_col(valid_col_units, val_unit, kmap) for val_unit in val_units]
+    return direction, new_val_units
+
+
+def rebuild_sql_col(valid_col_units, sql, kmap):
+    if sql is None:
+        return sql
+
+    sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap)
+    sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap)
+    sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap)
+    sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap)
+    sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap)
+    sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap)
+    sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap)
+    sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap)
+    sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap)
+
+    return sql
+
+
+def build_foreign_key_map(entry):
+    cols_orig = entry["column_names_original"]
+    tables_orig = entry["table_names_original"]
+
+    # rebuild cols corresponding to idmap in Schema
+    cols = []
+    for col_orig in cols_orig:
+        if col_orig[0] >= 0:
+            t = tables_orig[col_orig[0]]
+            c = col_orig[1]
+            cols.append("__" + t.lower() + "." + c.lower() + "__")
+        else:
+            cols.append("__all__")
+
+    def keyset_in_list(k1, k2, k_list):
+        for k_set in k_list:
+            if k1 in k_set or k2 in k_set:
+                return k_set
+        new_k_set = set()
+        k_list.append(new_k_set)
+        return new_k_set
+
+    foreign_key_list = []
+    foreign_keys = entry["foreign_keys"]
+    for fkey in foreign_keys:
+        key1, key2 = fkey
+        key_set = keyset_in_list(key1, key2, foreign_key_list)
+        key_set.add(key1)
+        key_set.add(key2)
+
+    foreign_key_map = {}
+    for key_set in foreign_key_list:
+        sorted_list = sorted(list(key_set))
+        midx = sorted_list[0]
+        for idx in sorted_list:
+            foreign_key_map[cols[idx]] = cols[midx]
+
+    return foreign_key_map
+
+
+def build_foreign_key_map_from_json(table):
+    with open(table) as f:
+        data = json.load(f)
+    tables = {}
+    for entry in data:
+        tables[entry["db_id"]] = build_foreign_key_map(entry)
+    return tables
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--gold", dest="gold", type=str)
+    parser.add_argument("--pred", dest="pred", type=str)
+    parser.add_argument("--db", dest="db", type=str)
+    parser.add_argument("--table", dest="table", type=str)
+    parser.add_argument("--etype", dest="etype", type=str)
+    args = parser.parse_args()
+
+    gold = args.gold
+    pred = args.pred
+    db_dir = args.db
+    table = args.table
+    etype = args.etype
+
+    assert etype in ["all", "exec", "match"], "Unknown evaluation method"
+
+    kmaps = build_foreign_key_map_from_json(table)
+
+    evaluate(gold, pred, db_dir, etype, kmaps)
diff --git a/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py b/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eb3fea22bf5cbe347a1526305d46271ecfa0e9c
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/eval_scripts/metric_averages.py
@@ -0,0 +1,66 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import sys
+
+predictions = [json.loads(line) for line in open(sys.argv[1]).readlines() if line]
+
+string_count = 0.0
+sem_count = 0.0
+syn_count = 0.0
+table_count = 0.0
+strict_table_count = 0.0
+
+precision_denom = 0.0
+precision = 0.0
+recall_denom = 0.0
+recall = 0.0
+f1_score = 0.0
+f1_denom = 0.0
+
+time = 0.0
+
+for prediction in predictions:
+    if prediction["correct_string"]:
+        string_count += 1.0
+    if prediction["semantic"]:
+        sem_count += 1.0
+    if prediction["syntactic"]:
+        syn_count += 1.0
+    if prediction["correct_table"]:
+        table_count += 1.0
+    if prediction["strict_correct_table"]:
+        strict_table_count += 1.0
+    if prediction["gold_tables"] != "[[]]":
+        precision += prediction["table_prec"]
+        precision_denom += 1
+    if prediction["pred_table"] != "[]":
+        recall += prediction["table_rec"]
+        recall_denom += 1
+
+        if prediction["gold_tables"] != "[[]]":
+            f1_score += prediction["table_f1"]
+            f1_denom += 1
+
+num_p = len(predictions)
+print("string precision: " + str(string_count / num_p))
+print("% semantic: " + str(sem_count / num_p))
+print("% syntactic: " + str(syn_count / num_p))
+print("table prec: " + str(table_count / num_p))
+print("strict table prec: " + str(strict_table_count / num_p))
+print("table row prec: " + str(precision / precision_denom))
+print("table row recall: " + str(recall / recall_denom))
+print("table row f1: " + str(f1_score / f1_denom))
+print("inference time: " + str(time / num_p))
diff --git a/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py b/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py
new file mode 100644
index 0000000000000000000000000000000000000000..4168932db02bc53df13a40a77a145111a66da2a4
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/eval_scripts/process_sql.py
@@ -0,0 +1,579 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import sqlite3
+
+from nltk import word_tokenize
+
+################################
+# Assumptions:
+#   1. sql is correct
+#   2. only table name has alias
+#   3. only one intersect/union/except
+#
+# val: number(float)/string(str)/sql(dict)
+# col_unit: (agg_id, col_id, isDistinct(bool))
+# val_unit: (unit_op, col_unit1, col_unit2)
+# table_unit: (table_type, col_unit/sql)
+# cond_unit: (not_op, op_id, val_unit, val1, val2)
+# condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+# sql {
+#   'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...])
+#   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+#   'where': condition
+#   'groupBy': [col_unit1, col_unit2, ...]
+#   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+#   'having': condition
+#   'limit': None/limit value
+#   'intersect': None/sql
+#   'except': None/sql
+#   'union': None/sql
+# }
+################################
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+WHERE_OPS = ("not", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like", "is", "exists")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+COND_OPS = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+
+class Schema:
+    """
+    Simple schema which maps table&column to a unique identifier
+    """
+
+    def __init__(self, schema):
+        self._schema = schema
+        self._idMap = self._map(self._schema)
+
+    @property
+    def schema(self):
+        return self._schema
+
+    @property
+    def idMap(self):
+        return self._idMap
+
+    def _map(self, schema):
+        idMap = {"*": "__all__"}
+        id = 1
+        for key, vals in schema.items():
+            for val in vals:
+                idMap[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__"
+                id += 1
+
+        for key in schema:
+            idMap[key.lower()] = "__" + key.lower() + "__"
+            id += 1
+
+        return idMap
+
+
+def get_schema(db):
+    """
+    Get database's schema, which is a dict with table name as key
+    and list of column names as value
+    Args:
+        db(`str`): Database path.
+    Returns:
+        `dict`: Schema dict.
+    """
+
+    schema = {}
+    conn = sqlite3.connect(db)
+    cursor = conn.cursor()
+
+    # fetch table names
+    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
+    tables = [str(table[0].lower()) for table in cursor.fetchall()]
+
+    # fetch table info
+    for table in tables:
+        cursor.execute("PRAGMA table_info({})".format(table))
+        schema[table] = [str(col[1].lower()) for col in cursor.fetchall()]
+
+    return schema
+
+
+def get_schema_from_json(fpath):
+    with open(fpath) as f:
+        data = json.load(f)
+
+    schema = {}
+    for entry in data:
+        table = str(entry["table"].lower())
+        cols = [str(col["column_name"].lower()) for col in entry["col_data"]]
+        schema[table] = cols
+
+    return schema
+
+
+def tokenize(string):
+    string = str(string)
+    string = string.replace("'", '"')  # ensures all string values wrapped by "" problem??
+    quote_idxs = [idx for idx, char in enumerate(string) if char == '"']
+    assert len(quote_idxs) % 2 == 0, "Unexpected quote"
+
+    # keep string value as token
+    vals = {}
+    for i in range(len(quote_idxs) - 1, -1, -2):
+        qidx1 = quote_idxs[i - 1]
+        qidx2 = quote_idxs[i]
+        val = string[qidx1 : qidx2 + 1]
+        key = "__val_{}_{}__".format(qidx1, qidx2)
+        string = string[:qidx1] + key + string[qidx2 + 1 :]
+        vals[key] = val
+
+    toks = [word.lower() for word in word_tokenize(string)]
+    # replace with string value token
+    for i in range(len(toks)):
+        if toks[i] in vals:
+            toks[i] = vals[toks[i]]
+
+    # find if there exists !=, >=, <=
+    eq_idxs = [idx for idx, tok in enumerate(toks) if tok == "="]
+    eq_idxs.reverse()
+    prefix = ("!", ">", "<")
+    for eq_idx in eq_idxs:
+        pre_tok = toks[eq_idx - 1]
+        if pre_tok in prefix:
+            toks = toks[: eq_idx - 1] + [pre_tok + "="] + toks[eq_idx + 1 :]
+
+    return toks
+
+
+def scan_alias(toks):
+    """Scan the index of 'as' and build the map for all alias"""
+    as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"]
+    alias = {}
+    for idx in as_idxs:
+        alias[toks[idx + 1]] = toks[idx - 1]
+    return alias
+
+
+def get_tables_with_alias(schema, toks):
+    tables = scan_alias(toks)
+    for key in schema:
+        assert key not in tables, "Alias {} has the same name in table".format(key)
+        tables[key] = key
+    return tables
+
+
+def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    tok = toks[start_idx]
+    if tok == "*":
+        return start_idx + 1, schema.idMap[tok]
+
+    if "." in tok:  # if token is a composite
+        alias, col = tok.split(".")
+        key = tables_with_alias[alias] + "." + col
+        return start_idx + 1, schema.idMap[key]
+
+    assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty"
+
+    for alias in default_tables:
+        table = tables_with_alias[alias]
+        if tok in schema.schema[table]:
+            key = table + "." + tok
+            return start_idx + 1, schema.idMap[key]
+
+    assert False, "Error col: {}".format(tok)
+
+
+def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    isDistinct = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    if toks[idx] in AGG_OPS:
+        agg_id = AGG_OPS.index(toks[idx])
+        idx += 1
+        assert idx < len_ and toks[idx] == "("
+        idx += 1
+        if toks[idx] == "distinct":
+            idx += 1
+            isDistinct = True
+        idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+        assert idx < len_ and toks[idx] == ")"
+        idx += 1
+        return idx, (agg_id, col_id, isDistinct)
+
+    if toks[idx] == "distinct":
+        idx += 1
+        isDistinct = True
+    agg_id = AGG_OPS.index("none")
+    idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+
+    return idx, (agg_id, col_id, isDistinct)
+
+
+def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    col_unit1 = None
+    col_unit2 = None
+    unit_op = UNIT_OPS.index("none")
+
+    idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+    if idx < len_ and toks[idx] in UNIT_OPS:
+        unit_op = UNIT_OPS.index(toks[idx])
+        idx += 1
+        idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+
+    return idx, (unit_op, col_unit1, col_unit2)
+
+
+def parse_table_unit(toks, start_idx, tables_with_alias, schema):
+    idx = start_idx
+    len_ = len(toks)
+    key = tables_with_alias[toks[idx]]
+
+    if idx + 1 < len_ and toks[idx + 1] == "as":
+        idx += 3
+    else:
+        idx += 1
+
+    return idx, schema.idMap[key], key
+
+
+def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    if toks[idx] == "select":
+        idx, val = parse_sql(toks, idx, tables_with_alias, schema)
+    elif '"' in toks[idx]:  # token is a string value
+        val = toks[idx]
+        idx += 1
+    else:
+        try:
+            val = float(toks[idx])
+            idx += 1
+        except Exception:
+            end_idx = idx
+            while (
+                end_idx < len_
+                and toks[end_idx] != ","
+                and toks[end_idx] != ")"
+                and toks[end_idx] != "and"
+                and toks[end_idx] not in CLAUSE_KEYWORDS
+                and toks[end_idx] not in JOIN_KEYWORDS
+            ):
+                end_idx += 1
+
+            idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables)
+            idx = end_idx
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1
+
+    return idx, val
+
+
+def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    conds = []
+
+    while idx < len_:
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        not_op = False
+        if toks[idx] == "not":
+            not_op = True
+            idx += 1
+
+        assert idx < len_ and toks[idx] in WHERE_OPS, "Error condition: idx: {}, tok: {}".format(idx, toks[idx])
+        op_id = WHERE_OPS.index(toks[idx])
+        idx += 1
+        val1 = val2 = None
+        if op_id == WHERE_OPS.index("between"):  # between..and... special case: dual values
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            assert toks[idx] == "and"
+            idx += 1
+            idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+        else:  # normal case: single value
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            val2 = None
+
+        conds.append((not_op, op_id, val_unit, val1, val2))
+
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS):
+            break
+
+        if idx < len_ and toks[idx] in COND_OPS:
+            conds.append(toks[idx])
+            idx += 1  # skip and/or
+
+    return idx, conds
+
+
+def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+
+    assert toks[idx] == "select", "'select' not found"
+    idx += 1
+    isDistinct = False
+    if idx < len_ and toks[idx] == "distinct":
+        idx += 1
+        isDistinct = True
+    val_units = []
+
+    while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS:
+        agg_id = AGG_OPS.index("none")
+        if toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append((agg_id, val_unit))
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+
+    return idx, (isDistinct, val_units)
+
+
+def parse_from(toks, start_idx, tables_with_alias, schema):
+    """
+    Assume in the from clause, all table units are combined with join
+    """
+    assert "from" in toks[start_idx:], "'from' not found"
+
+    len_ = len(toks)
+    idx = toks.index("from", start_idx) + 1
+    default_tables = []
+    table_units = []
+    conds = []
+
+    while idx < len_:
+        isBlock = False
+        if toks[idx] == "(":
+            isBlock = True
+            idx += 1
+
+        if toks[idx] == "select":
+            idx, sql = parse_sql(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["sql"], sql))
+        else:
+            if idx < len_ and toks[idx] == "join":
+                idx += 1  # skip join
+            idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["table_unit"], table_unit))
+            default_tables.append(table_name)
+        if idx < len_ and toks[idx] == "on":
+            idx += 1  # skip on
+            idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+            if len(conds) > 0:
+                conds.append("and")
+            conds.extend(this_conds)
+
+        if isBlock:
+            assert toks[idx] == ")"
+            idx += 1
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+            break
+
+    return idx, table_units, conds, default_tables
+
+
+def parse_where(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "where":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+    col_units = []
+
+    if idx >= len_ or toks[idx] != "group":
+        return idx, col_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+        col_units.append(col_unit)
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, col_units
+
+
+def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+    val_units = []
+    order_type = "asc"  # default type is 'asc'
+
+    if idx >= len_ or toks[idx] != "order":
+        return idx, val_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append(val_unit)
+        if idx < len_ and toks[idx] in ORDER_OPS:
+            order_type = toks[idx]
+            idx += 1
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, (order_type, val_units)
+
+
+def parse_having(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "having":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_limit(toks, start_idx):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx < len_ and toks[idx] == "limit":
+        idx += 2
+        # make limit value can work, cannot assume put 1 as a fake limit number
+        if type(toks[idx - 1]) != int:
+            return idx, 1
+
+        return idx, int(toks[idx - 1])
+
+    return idx, None
+
+
+def parse_sql(toks, start_idx, tables_with_alias, schema):
+    isBlock = False  # indicate whether this is a block of sql/sub-sql
+    len_ = len(toks)
+    idx = start_idx
+
+    sql = {}
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    # parse from clause in order to get default tables
+    from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema)
+    sql["from"] = {"table_units": table_units, "conds": conds}
+    # select clause
+    _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables)
+    idx = from_end_idx
+    sql["select"] = select_col_units
+    # where clause
+    idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables)
+    sql["where"] = where_conds
+    # group by clause
+    idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["groupBy"] = group_col_units
+    # having clause
+    idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables)
+    sql["having"] = having_conds
+    # order by clause
+    idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["orderBy"] = order_col_units
+    # limit clause
+    idx, limit_val = parse_limit(toks, idx)
+    sql["limit"] = limit_val
+
+    idx = skip_semicolon(toks, idx)
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+    idx = skip_semicolon(toks, idx)
+
+    # intersect/union/except clause
+    for op in SQL_OPS:  # initialize IUE
+        sql[op] = None
+    if idx < len_ and toks[idx] in SQL_OPS:
+        sql_op = toks[idx]
+        idx += 1
+        idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema)
+        sql[sql_op] = IUE_sql
+    return idx, sql
+
+
+def load_data(fpath):
+    with open(fpath) as f:
+        data = json.load(f)
+    return data
+
+
+def get_sql(schema, query):
+    toks = tokenize(query)
+    tables_with_alias = get_tables_with_alias(schema.schema, toks)
+    _, sql = parse_sql(toks, 0, tables_with_alias, schema)
+
+    return sql
+
+
+def skip_semicolon(toks, start_idx):
+    idx = start_idx
+    while idx < len(toks) and toks[idx] == ";":
+        idx += 1
+    return idx
diff --git a/examples/text_to_sql/IGSQL/logger.py b/examples/text_to_sql/IGSQL/logger.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d0584f7515fcbf28cc3f7afc43222e3090c8e74
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/logger.py
@@ -0,0 +1,67 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains the logging class."""
+
+
+class Logger:
+    def __init__(self, filename, option):
+        self.fileptr = open(filename, option)
+        if option == "r":
+            self.lines = self.fileptr.readlines()
+        else:
+            self.lines = []
+
+    def put(self, string):
+        """Writes to the file."""
+        self.fileptr.write(string + "\n")
+        self.fileptr.flush()
+
+    def close(self):
+        """Closes the logger."""
+        self.fileptr.close()
+
+    def findlast(self, identifier, default=0.0):
+        """Finds the last line in the log with a certain value."""
+        for line in self.lines[::-1]:
+            if line.lower().startswith(identifier):
+                string = line.strip().split("\t")[1]
+                if string.replace(".", "").isdigit():
+                    return float(string)
+                elif string.lower() == "true":
+                    return True
+                elif string.lower() == "false":
+                    return False
+                else:
+                    return string
+        return default
+
+    def contains(self, string):
+        """Determines whether the string is present in the log."""
+        for line in self.lines[::-1]:
+            if string.lower() in line.lower():
+                return True
+        return False
+
+    def findlast_log_before(self, before_str):
+        """Finds the last entry in the log before another entry."""
+        loglines = []
+        in_line = False
+        for line in self.lines[::-1]:
+            if line.startswith(before_str):
+                in_line = True
+            elif in_line:
+                loglines.append(line)
+            if line.strip() == "" and in_line:
+                return "".join(loglines[::-1])
+        return "".join(loglines[::-1])
diff --git a/examples/text_to_sql/IGSQL/model/attention.py b/examples/text_to_sql/IGSQL/model/attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1989ec726cd17754081067b624b69dc66162486
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/attention.py
@@ -0,0 +1,103 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains classes for computing and keeping track of attention distributions.
+"""
+from collections import namedtuple
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+np.random.seed(0)
+
+
+class AttentionResult(namedtuple("AttentionResult", ("scores", "distribution", "vector"))):
+    """Stores the result of an attention calculation."""
+
+    __slots__ = ()
+
+
+class Attention(paddle.nn.Layer):
+    """Attention mechanism class. Stores parameters for and computes attention.
+
+    Attributes:
+       transform_query (`bool`): Whether or not to transform the query being
+           passed in with a weight transformation before computing attentino.
+       transform_key (`bool`): Whether or not to transform the key being
+           passed in with a weight transformation before computing attentino.
+       transform_value (`bool`): Whether or not to transform the value being
+           passed in with a weight transformation before computing attentino.
+       key_size (`int`): The size of the key vectors.
+       value_size (`int`): The size of the value vectors.
+           the query or key.
+       query_weights (`Parameter`): Weights for transforming the query.
+       key_weights (`Parameter`): Weights for transforming the key.
+       value_weights (`Parameter`): Weights for transforming the value.
+    """
+
+    def __init__(self, query_size, key_size, value_size):
+        super().__init__()
+        self.key_size = key_size
+        self.value_size = value_size
+
+        _initializer = paddle.nn.initializer.XavierUniform()
+
+        query_weights = paddle.ParamAttr(initializer=_initializer)
+
+        self.query_linear = paddle.nn.Linear(query_size, self.key_size, weight_attr=query_weights, bias_attr=False)
+
+    def transform_arguments(self, query, keys, values):
+        """Transforms the query/key/value inputs before attention calculations.
+
+        Arguments:
+            query (`Tensor`): Vector representing the query (e.g., hidden state.)
+            keys (`list`): List of vectors representing the key
+                values.
+            values (`list`): List of vectors representing the values.
+
+        Returns:
+            `triple`: The first represents the (transformed)
+                query, the second represents the (transformed and concatenated)
+                keys, and the third represents the (transformed and concatenated)
+                values.
+        """
+        assert len(keys) == len(values)
+
+        all_keys = paddle.stack(keys, axis=1)
+        all_values = paddle.stack(values, axis=1)
+
+        assert all_keys.shape[0] == self.key_size, (
+            "Expected key size of " + str(self.key_size) + " but got " + str(all_keys.shape[0])
+        )
+        assert all_values.shape[0] == self.value_size
+
+        if query.dim() == 1:
+            query = query.unsqueeze(0)
+        query = self.query_linear(query)
+
+        return query, all_keys, all_values
+
+    def forward(self, query, keys, values=None):
+        if not values:
+            values = keys
+
+        query_t, keys_t, values_t = self.transform_arguments(query, keys, values)
+
+        scores = paddle.t(paddle.mm(query_t, keys_t))  # len(key) x len(query)
+
+        distribution = F.softmax(scores, axis=0)  # len(key) x len(query)
+
+        context_vector = paddle.mm(values_t, distribution).squeeze()  # value_size x len(query)
+
+        return AttentionResult(scores, distribution, context_vector)
diff --git a/examples/text_to_sql/IGSQL/model/bert_utils.py b/examples/text_to_sql/IGSQL/model/bert_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1363374684a8e2ea1e1f2044acdf9c00bce9579
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/bert_utils.py
@@ -0,0 +1,421 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# modified from https://github.com/naver/sqlova
+
+import paddle
+
+from paddlenlp.transformers import BertModel, BertPretrainedModel, BertTokenizer
+
+
+def get_bert(params):
+    model_bert = BertModel.from_pretrained("bert-base-uncased")
+    bert_config = BertPretrainedModel.pretrained_init_configuration["bert-base-uncased"]
+    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+
+    return model_bert, tokenizer, bert_config
+
+
+def generate_inputs(tokenizer, nlu1_tok, hds1):
+    tokens = []
+    segment_ids = []
+    sent_ids = []
+
+    t_to_tt_idx_hds1 = []
+
+    tokens.append("[CLS]")
+    i_st_nlu = len(tokens)  # to use it later
+
+    segment_ids.append(0)
+    sent_ids.append(0)
+    for token in nlu1_tok:
+        tokens.append(token)
+        segment_ids.append(0)
+        sent_ids.append(0)
+    i_ed_nlu = len(tokens)
+    tokens.append("[SEP]")
+    segment_ids.append(0)
+    sent_ids.append(0)
+
+    i_hds = []
+    for i, hds11 in enumerate(hds1):
+        i_st_hd = len(tokens)
+        t_to_tt_idx_hds11 = []
+        sub_tok = []
+        for sub_tok1 in hds11.split():
+            t_to_tt_idx_hds11.append(len(sub_tok))
+            sub_tok += tokenizer.tokenize(sub_tok1)
+        t_to_tt_idx_hds1.append(t_to_tt_idx_hds11)
+        tokens += sub_tok
+
+        i_ed_hd = len(tokens)
+        i_hds.append((i_st_hd, i_ed_hd))
+        segment_ids += [1] * len(sub_tok)
+        sent_ids += [1] * len(sub_tok)
+        if i < len(hds1) - 1:
+            tokens.append("[SEP]")
+            segment_ids.append(0)
+            sent_ids.append(1)
+        elif i == len(hds1) - 1:
+            tokens.append("[SEP]")
+            segment_ids.append(1)
+            sent_ids.append(1)
+        else:
+            raise EnvironmentError
+
+    i_nlu = (i_st_nlu, i_ed_nlu)
+
+    return tokens, sent_ids, segment_ids, i_nlu, i_hds, t_to_tt_idx_hds1
+
+
+def gen_l_hpu(i_hds):
+    """
+    # Treat columns as if it is a batch of natural language utterance with batch-size = # of columns * # of batch_size
+    i_hds = [(17, 18), (19, 21), (22, 23), (24, 25), (26, 29), (30, 34)])
+    """
+    l_hpu = []
+    for i_hds1 in i_hds:
+        for i_hds11 in i_hds1:
+            l_hpu.append(i_hds11[1] - i_hds11[0])
+
+    return l_hpu
+
+
+def get_bert_output(model_bert, tokenizer, nlu_t, hds, max_seq_length):
+    """
+    Here, input is toknized further by WordPiece (WP) tokenizer and fed into BERT.
+    """
+
+    l_n = []
+    l_hs = []  # The length of columns for each batch
+
+    input_ids = []
+    tokens = []
+    segment_ids = []
+    sent_ids = []
+    input_mask = []
+
+    i_nlu = []  # index to retreive the position of contextual vector later.
+    i_hds = []
+
+    nlu_tt = []
+
+    t_to_tt_idx = []
+    tt_to_t_idx = []
+
+    t_to_tt_idx_hds = []
+
+    for b, nlu_t1 in enumerate(nlu_t):
+        hds1 = hds[b]
+        l_hs.append(len(hds1))
+
+        # 1. 2nd tokenization using WordPiece
+        tt_to_t_idx1 = []  # number indicates where sub-token belongs to in 1st-level-tokens (here, CoreNLP).
+        t_to_tt_idx1 = []  # orig_to_tok_idx[i] = start index of i-th-1st-level-token in all_tokens.
+        nlu_tt1 = []  # all_doc_tokens[ orig_to_tok_idx[i] ] returns first sub-token segement of i-th-1st-level-token
+        for (i, token) in enumerate(nlu_t1):
+            t_to_tt_idx1.append(
+                len(nlu_tt1)
+            )  # all_doc_tokens[ indicate the start position of original 'white-space' tokens.
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tt_to_t_idx1.append(i)
+                nlu_tt1.append(sub_token)  # all_doc_tokens are further tokenized using WordPiece tokenizer
+        nlu_tt.append(nlu_tt1)
+        tt_to_t_idx.append(tt_to_t_idx1)
+        t_to_tt_idx.append(t_to_tt_idx1)
+
+        l_n.append(len(nlu_tt1))
+
+        # [CLS] nlu [SEP] col1 [SEP] col2 [SEP] ...col-n [SEP]
+        # 2. Generate BERT inputs & indices.
+        tokens1, sent_ids1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, hds1)
+
+        assert len(t_to_tt_idx_hds1) == len(hds1)
+
+        t_to_tt_idx_hds.append(t_to_tt_idx_hds1)
+
+        input_ids1 = tokenizer.convert_tokens_to_ids(tokens1)
+
+        # Input masks
+        # The mask has 1 for real tokens and 0 for padding tokens. Only real
+        # tokens are attended to.
+        input_mask1 = [1] * len(input_ids1)
+
+        # 3. Zero-pad up to the sequence length.
+        if len(nlu_t) == 1:
+            max_seq_length = len(input_ids1)
+        while len(input_ids1) < max_seq_length:
+            input_ids1.append(0)
+            input_mask1.append(0)
+            segment_ids1.append(0)
+            sent_ids1.append(0)
+
+        assert len(input_ids1) == max_seq_length
+        assert len(input_mask1) == max_seq_length
+        assert len(segment_ids1) == max_seq_length
+        assert len(sent_ids1) == max_seq_length
+
+        input_ids.append(input_ids1)
+        tokens.append(tokens1)
+        segment_ids.append(segment_ids1)
+        sent_ids.append(sent_ids1)
+
+        input_mask.append(input_mask1)
+
+        i_nlu.append(i_nlu1)
+        i_hds.append(i_hds1)
+
+    # Convert to tensor
+    all_input_ids = paddle.to_tensor(input_ids, dtype="int64")
+    all_segment_ids = paddle.to_tensor(segment_ids, dtype="int64")
+
+    # 4. Generate BERT output.
+    all_encoder_layer, pooled_output = model_bert(
+        all_input_ids, token_type_ids=all_segment_ids, output_hidden_states=True
+    )
+
+    # 5. generate l_hpu from i_hds
+    l_hpu = gen_l_hpu(i_hds)
+
+    assert len(set(l_n)) == 1 and len(set(i_nlu)) == 1
+    assert l_n[0] == i_nlu[0][1] - i_nlu[0][0]
+
+    return (
+        all_encoder_layer,
+        pooled_output,
+        tokens,
+        i_nlu,
+        i_hds,
+        l_n,
+        l_hpu,
+        l_hs,
+        nlu_tt,
+        t_to_tt_idx,
+        tt_to_t_idx,
+        t_to_tt_idx_hds,
+    )
+
+
+def get_wemb_n(i_nlu, l_n, hS, num_hidden_layers, all_encoder_layer, num_out_layers_n):
+    """
+    Get the representation of each tokens.
+    """
+    bS = len(l_n)
+    l_n_max = max(l_n)
+    wemb_n = []
+    for b in range(bS):
+        # [B, max_len, dim]
+        # Fill zero for non-exist part.
+        i_nlu1 = i_nlu[b]
+        for i_noln in range(num_out_layers_n):
+            i_layer = num_hidden_layers - 1 - i_noln
+            tmp = all_encoder_layer[i_layer][b, i_nlu1[0] : i_nlu1[1], :].unsqueeze(0)
+            pad_right = l_n_max - (i_nlu1[1] - i_nlu1[0])
+            pad_tmp = paddle.nn.functional.pad(tmp, [0, pad_right], data_format="NLC").squeeze(0)
+            wemb_n.append(pad_tmp)
+    wemb_n = paddle.stack(wemb_n)
+    return wemb_n
+
+
+def get_wemb_h(i_hds, l_hpu, l_hs, hS, num_hidden_layers, all_encoder_layer, num_out_layers_h):
+    """
+    As if
+    [ [table-1-col-1-tok1, t1-c1-t2, ...],
+       [t1-c2-t1, t1-c2-t2, ...].
+       ...
+       [t2-c1-t1, ...,]
+    ]
+    """
+    l_hpu_max = max(l_hpu)
+    wemb_h = []
+    b_pu = -1
+
+    for b, i_hds1 in enumerate(i_hds):
+        for b1, i_hds11 in enumerate(i_hds1):
+            b_pu += 1
+            for i_nolh in range(num_out_layers_h):
+                i_layer = num_hidden_layers - 1 - i_nolh
+                tmp = all_encoder_layer[i_layer][b, i_hds11[0] : i_hds11[1], :].unsqueeze(0)
+                pad_right = l_hpu_max - (i_hds11[1] - i_hds11[0])
+                pad_tmp = paddle.nn.functional.pad(tmp, [0, pad_right], data_format="NLC").squeeze(0)
+                wemb_h.append(pad_tmp)
+    wemb_h = paddle.stack(wemb_h)
+    return wemb_h
+
+
+def get_wemb_bert(
+    bert_config, model_bert, tokenizer, nlu_t, hds, max_seq_length, num_out_layers_n=1, num_out_layers_h=1
+):
+
+    # get contextual output of all tokens from bert
+    (
+        all_encoder_layer,
+        pooled_output,
+        tokens,
+        i_nlu,
+        i_hds,
+        l_n,
+        l_hpu,
+        l_hs,
+        nlu_tt,
+        t_to_tt_idx,
+        tt_to_t_idx,
+        t_to_tt_idx_hds,
+    ) = get_bert_output(model_bert, tokenizer, nlu_t, hds, max_seq_length)
+    # all_encoder_layer: BERT outputs from all layers.
+    # pooled_output: output of [CLS] vec.
+    # tokens: BERT intput tokens
+    # i_nlu: start and end indices of question in tokens
+    # i_hds: start and end indices of headers
+
+    # get the wemb
+    wemb_n = get_wemb_n(
+        i_nlu, l_n, bert_config["hidden_size"], bert_config["num_hidden_layers"], all_encoder_layer, num_out_layers_n
+    )
+
+    wemb_h = get_wemb_h(
+        i_hds,
+        l_hpu,
+        l_hs,
+        bert_config["hidden_size"],
+        bert_config["num_hidden_layers"],
+        all_encoder_layer,
+        num_out_layers_h,
+    )
+
+    return wemb_n, wemb_h, l_n, l_hpu, l_hs, nlu_tt, t_to_tt_idx, tt_to_t_idx, t_to_tt_idx_hds
+
+
+def prepare_input(tokenizer, input_sequence, input_schema, max_seq_length):
+    nlu_t = []
+    hds = []
+
+    nlu_t1 = input_sequence
+    all_hds = input_schema.column_names_embedder_input
+
+    nlu_tt1 = []
+    for (i, token) in enumerate(nlu_t1):
+        nlu_tt1 += tokenizer.tokenize(token)
+
+    current_hds1 = []
+    for hds1 in all_hds:
+        new_hds1 = current_hds1 + [hds1]
+        tokens1, _, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, new_hds1)
+        if len(segment_ids1) > max_seq_length:
+            nlu_t.append(nlu_t1)
+            hds.append(current_hds1)
+            current_hds1 = [hds1]
+        else:
+            current_hds1 = new_hds1
+
+    if len(current_hds1) > 0:
+        nlu_t.append(nlu_t1)
+        hds.append(current_hds1)
+
+    return nlu_t, hds
+
+
+def prepare_input_v2(tokenizer, input_sequence, input_schema):
+    nlu_t = []
+    hds = []
+    max_seq_length = 0
+
+    nlu_t1 = input_sequence
+    all_hds = input_schema.column_names_embedder_input
+
+    nlu_tt1 = []
+    for (i, token) in enumerate(nlu_t1):
+        nlu_tt1 += tokenizer.tokenize(token)
+
+    current_hds1 = []
+    current_table = ""
+    for hds1 in all_hds:
+        hds1_table = hds1.split(".")[0].strip()
+        if hds1_table == current_table:
+            current_hds1.append(hds1)
+        else:
+            tokens1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, current_hds1)
+            max_seq_length = max(max_seq_length, len(segment_ids1))
+
+            nlu_t.append(nlu_t1)
+            hds.append(current_hds1)
+            current_hds1 = [hds1]
+            current_table = hds1_table
+
+    if len(current_hds1) > 0:
+        tokens1, segment_ids1, i_nlu1, i_hds1, t_to_tt_idx_hds1 = generate_inputs(tokenizer, nlu_tt1, current_hds1)
+        max_seq_length = max(max_seq_length, len(segment_ids1))
+        nlu_t.append(nlu_t1)
+        hds.append(current_hds1)
+
+    return nlu_t, hds, max_seq_length
+
+
+def get_bert_encoding(
+    bert_config,
+    model_bert,
+    tokenizer,
+    input_sequence,
+    input_schema,
+    bert_input_version="v1",
+    max_seq_length=512,
+    num_out_layers_n=1,
+    num_out_layers_h=1,
+):
+    if bert_input_version == "v1":
+        nlu_t, hds = prepare_input(tokenizer, input_sequence, input_schema, max_seq_length)
+    elif bert_input_version == "v2":
+        nlu_t, hds, max_seq_length = prepare_input_v2(tokenizer, input_sequence, input_schema)
+
+    wemb_n, wemb_h, l_n, l_hpu, l_hs, nlu_tt, t_to_tt_idx, tt_to_t_idx, t_to_tt_idx_hds = get_wemb_bert(
+        bert_config, model_bert, tokenizer, nlu_t, hds, max_seq_length, num_out_layers_n, num_out_layers_h
+    )
+
+    t_to_tt_idx = t_to_tt_idx[0]
+    assert len(t_to_tt_idx) == len(input_sequence)
+    assert sum(len(t_to_tt_idx_hds1) for t_to_tt_idx_hds1 in t_to_tt_idx_hds) == len(
+        input_schema.column_names_embedder_input
+    )
+
+    assert list(wemb_h.shape)[0] == len(input_schema.column_names_embedder_input)
+
+    utterance_states = []
+    for i in range(len(t_to_tt_idx)):
+        start = t_to_tt_idx[i]
+        if i == len(t_to_tt_idx) - 1:
+            end = l_n[0]
+        else:
+            end = t_to_tt_idx[i + 1]
+        utterance_states.append(paddle.mean(wemb_n[:, start:end, :], axis=[0, 1]))
+    assert len(utterance_states) == len(input_sequence)
+
+    schema_token_states = []
+    cnt = -1
+    for t_to_tt_idx_hds1 in t_to_tt_idx_hds:
+        for t_to_tt_idx_hds11 in t_to_tt_idx_hds1:
+            cnt += 1
+            schema_token_states1 = []
+            for i in range(len(t_to_tt_idx_hds11)):
+                start = t_to_tt_idx_hds11[i]
+                if i == len(t_to_tt_idx_hds11) - 1:
+                    end = l_hpu[cnt]
+                else:
+                    end = t_to_tt_idx_hds11[i + 1]
+                schema_token_states1.append(paddle.mean(wemb_h[cnt, start:end, :], axis=0))
+            assert len(schema_token_states1) == len(input_schema.column_names_embedder_input[cnt].split())
+            schema_token_states.append(schema_token_states1)
+
+    assert len(schema_token_states) == len(input_schema.column_names_embedder_input)
+
+    return utterance_states, schema_token_states
diff --git a/examples/text_to_sql/IGSQL/model/decoder.py b/examples/text_to_sql/IGSQL/model/decoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0a7bf9c1c46bed0955dd99f1927eb22cd289704
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/decoder.py
@@ -0,0 +1,277 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Decoder for the SQL generation problem."""
+
+from collections import namedtuple
+
+import data_util.snippets as snippet_handler
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data_util.vocabulary import EOS_TOK, UNK_TOK
+
+from . import embedder
+from .token_predictor import PredictionInputWithSchema
+
+np.random.seed(0)
+
+
+def flatten_distribution(distribution_map, probabilities):
+    """Flattens a probability distribution given a map of "unique" values.
+    All values in distribution_map with the same value should get the sum
+    of the probabilities.
+
+    Arguments:
+        distribution_map (`list`): List of values to get the probability for.
+        probabilities (`np.ndarray`): Probabilities corresponding to the values in
+            distribution_map.
+
+    Returns:
+        `list`: `np.ndarray` of the same size where probabilities for duplicates
+            in distribution_map are given the sum of the probabilities in probabilities.
+    """
+    assert len(distribution_map) == len(probabilities)
+    if len(distribution_map) != len(set(distribution_map)):
+        idx_first_dup = 0
+        seen_set = set()
+        for i, tok in enumerate(distribution_map):
+            if tok in seen_set:
+                idx_first_dup = i
+                break
+            seen_set.add(tok)
+        new_dist_map = distribution_map[:idx_first_dup] + list(
+            set(distribution_map) - set(distribution_map[:idx_first_dup])
+        )
+        assert len(new_dist_map) == len(set(new_dist_map))
+        new_probs = np.array(
+            probabilities[:idx_first_dup] + [0.0 for _ in range(len(set(distribution_map)) - idx_first_dup)]
+        )
+        assert len(new_probs) == len(new_dist_map)
+
+        for i, token_name in enumerate(distribution_map[idx_first_dup:]):
+            if token_name not in new_dist_map:
+                new_dist_map.append(token_name)
+
+            new_index = new_dist_map.index(token_name)
+            new_probs[new_index] += probabilities[i + idx_first_dup]
+        new_probs = new_probs.tolist()
+    else:
+        new_dist_map = distribution_map
+        new_probs = probabilities
+
+    assert len(new_dist_map) == len(new_probs)
+
+    return new_dist_map, new_probs
+
+
+class SQLPrediction(namedtuple("SQLPrediction", ("predictions", "sequence", "probability"))):
+    """Contains prediction for a sequence."""
+
+    __slots__ = ()
+
+    def __str__(self):
+        return str(self.probability) + "\t" + " ".join(self.sequence)
+
+
+class SequencePredictorWithSchema(paddle.nn.Layer):
+    """Predicts a sequence.
+
+    Attributes:
+        lstms (list of dy.RNNBuilder): The RNN used.
+        token_predictor (TokenPredictor): Used to actually predict tokens.
+    """
+
+    def __init__(self, params, input_size, output_embedder, column_name_token_embedder, token_predictor):
+        super().__init__()
+
+        self.lstmCell = paddle.nn.LSTMCell(input_size, params.decoder_state_size)
+
+        self.token_predictor = token_predictor
+        self.output_embedder = output_embedder
+        self.column_name_token_embedder = column_name_token_embedder
+
+        start_token_embedding = self.create_parameter(
+            [params.output_embedding_size],
+            dtype="float32",
+            default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1),
+        )
+        self.add_parameter("start_token_embedding", start_token_embedding)
+
+        self.input_size = input_size
+        self.params = params
+
+    def _initialize_decoder_lstm(self, encoder_state):
+        decoder_lstm_states = []
+
+        # check which one is h_0, which is c_0
+        c_0 = encoder_state[1].reshape([1, -1])
+        h_0 = encoder_state[0].reshape([1, -1])
+
+        decoder_lstm_states.append((h_0, c_0))
+        return decoder_lstm_states
+
+    def get_output_token_embedding(self, output_token, input_schema, snippets):
+        if self.params.use_snippets and snippet_handler.is_snippet(output_token):
+            output_token_embedding = embedder.bow_snippets(output_token, snippets, self.output_embedder, input_schema)
+        else:
+            if input_schema:
+                assert self.output_embedder.in_vocabulary(output_token) or input_schema.in_vocabulary(
+                    output_token, surface_form=True
+                )
+                # 经过
+                if self.output_embedder.in_vocabulary(output_token):
+                    output_token_embedding = self.output_embedder(output_token)
+                else:
+                    output_token_embedding = input_schema.column_name_embedder(output_token, surface_form=True)
+            else:
+                output_token_embedding = self.output_embedder(output_token)
+        return output_token_embedding
+
+    def get_decoder_input(self, output_token_embedding, prediction):
+        if self.params.use_schema_attention and self.params.use_query_attention:
+            decoder_input = paddle.concat(
+                [
+                    output_token_embedding,
+                    prediction.utterance_attention_results.vector,
+                    prediction.schema_attention_results.vector,
+                    prediction.query_attention_results.vector,
+                ],
+                axis=0,
+            )
+        elif self.params.use_schema_attention:
+            decoder_input = paddle.concat(
+                [
+                    output_token_embedding,
+                    prediction.utterance_attention_results.vector,
+                    prediction.schema_attention_results.vector,
+                ],
+                axis=0,
+            )
+        else:
+            decoder_input = paddle.concat(
+                [output_token_embedding, prediction.utterance_attention_results.vector], axis=0
+            )
+        return decoder_input
+
+    def forward(
+        self,
+        final_encoder_state,
+        encoder_states,
+        schema_states,
+        max_generation_length,
+        snippets=None,
+        gold_sequence=None,
+        input_sequence=None,
+        previous_queries=None,
+        previous_query_states=None,
+        input_schema=None,
+        dropout_amount=0.0,
+    ):
+        """Generates a sequence."""
+        index = 0
+
+        context_vector_size = self.input_size - self.params.output_embedding_size
+
+        # Decoder states: just the initialized decoder.
+        # Current input to decoder: phi(start_token) ; zeros the size of the
+        # context vector
+        predictions = []
+        sequence = []
+        probability = 1.0
+
+        decoder_states = self._initialize_decoder_lstm(final_encoder_state)[0]
+
+        decoder_input = paddle.concat([self.start_token_embedding, paddle.zeros([context_vector_size])], axis=0)
+        continue_generating = True
+        while continue_generating:
+            if len(sequence) == 0 or sequence[-1] != EOS_TOK:
+
+                decoder_state, decoder_states = self.lstmCell(decoder_input.unsqueeze(0), decoder_states)
+                decoder_state = decoder_state.squeeze()
+
+                prediction_input = PredictionInputWithSchema(
+                    decoder_state=decoder_state,
+                    input_hidden_states=encoder_states,
+                    schema_states=schema_states,
+                    snippets=snippets,
+                    input_sequence=input_sequence,
+                    previous_queries=previous_queries,
+                    previous_query_states=previous_query_states,
+                    input_schema=input_schema,
+                )
+
+                prediction = self.token_predictor(prediction_input, dropout_amount=dropout_amount)
+
+                predictions.append(prediction)
+                # 经过
+                if gold_sequence:
+                    output_token = gold_sequence[index]
+
+                    output_token_embedding = self.get_output_token_embedding(output_token, input_schema, snippets)
+
+                    decoder_input = self.get_decoder_input(output_token_embedding, prediction)
+
+                    sequence.append(gold_sequence[index])
+
+                    if index >= len(gold_sequence) - 1:
+                        continue_generating = False
+                else:
+                    assert prediction.scores.dim() == 1
+                    probabilities = F.softmax(prediction.scores, axis=0).cpu().numpy().tolist()
+
+                    distribution_map = prediction.aligned_tokens
+                    assert len(probabilities) == len(distribution_map)
+
+                    if self.params.use_previous_query and self.params.use_copy_switch and len(previous_queries) > 0:
+                        assert prediction.query_scores.dim() == 1
+                        query_token_probabilities = (
+                            F.softmax(prediction.query_scores, axis=0).cpu().data.numpy().tolist()
+                        )
+
+                        query_token_distribution_map = prediction.query_tokens
+
+                        assert len(query_token_probabilities) == len(query_token_distribution_map)
+
+                        copy_switch = prediction.copy_switch.cpu().data.numpy()
+
+                        # Merge the two
+                        probabilities = (np.array(probabilities) * (1 - copy_switch)).tolist() + (
+                            np.array(query_token_probabilities) * copy_switch
+                        ).tolist()
+                        distribution_map = distribution_map + query_token_distribution_map
+                        assert len(probabilities) == len(distribution_map)
+
+                    # Get a new probabilities and distribution_map consolidating duplicates
+                    distribution_map, probabilities = flatten_distribution(distribution_map, probabilities)
+
+                    # Modify the probability distribution so that the UNK token can never be produced
+                    probabilities[distribution_map.index(UNK_TOK)] = 0.0
+                    argmax_index = int(np.argmax(probabilities))
+
+                    argmax_token = distribution_map[argmax_index]
+                    sequence.append(argmax_token)
+
+                    output_token_embedding = self.get_output_token_embedding(argmax_token, input_schema, snippets)
+
+                    decoder_input = self.get_decoder_input(output_token_embedding, prediction)
+
+                    probability *= probabilities[argmax_index]
+
+                    continue_generating = False
+                    if index < max_generation_length and argmax_token != EOS_TOK:
+                        continue_generating = True
+
+            index += 1
+
+        return SQLPrediction(predictions, sequence, probability)
diff --git a/examples/text_to_sql/IGSQL/model/embedder.py b/examples/text_to_sql/IGSQL/model/embedder.py
new file mode 100644
index 0000000000000000000000000000000000000000..cfa23cd3a66985cfbde4275e053d6f77ed749b65
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/embedder.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Embedder for tokens. """
+
+import data_util.snippets as snippet_handler
+import data_util.vocabulary as vocabulary_handler
+import paddle
+
+
+class Embedder(paddle.nn.Layer):
+    """Embeds tokens."""
+
+    def __init__(
+        self,
+        embedding_size,
+        name="",
+        initializer=None,
+        vocabulary=None,
+        num_tokens=-1,
+        anonymizer=None,
+        freeze=False,
+        use_unk=True,
+    ):
+        super().__init__()
+
+        if vocabulary:
+            assert num_tokens < 0, "Specified a vocabulary but also set number of tokens to " + str(num_tokens)
+            self.in_vocabulary = lambda token: token in vocabulary.tokens
+            self.vocab_token_lookup = lambda token: vocabulary.token_to_id(token)
+            if use_unk:
+                self.unknown_token_id = vocabulary.token_to_id(vocabulary_handler.UNK_TOK)
+            else:
+                self.unknown_token_id = -1
+            self.vocabulary_size = len(vocabulary)
+        else:
+
+            def check_vocab(index):
+                """Makes sure the index is in the vocabulary."""
+                assert index < num_tokens, (
+                    "Passed token ID " + str(index) + "; expecting something less than " + str(num_tokens)
+                )
+                return index < num_tokens
+
+            self.in_vocabulary = check_vocab
+            self.vocab_token_lookup = lambda x: x
+            self.unknown_token_id = num_tokens  # Deliberately throws an error here,
+            # But should crash before this
+            self.vocabulary_size = num_tokens
+
+        self.anonymizer = anonymizer
+
+        emb_name = name + "-tokens"
+        print(
+            "Creating token embedder called "
+            + emb_name
+            + " of size "
+            + str(self.vocabulary_size)
+            + " x "
+            + str(embedding_size)
+        )
+
+        if initializer is not None:
+            self.token_embedding_matrix = paddle.nn.Embedding(initializer.shape[0], initializer.shape[1])
+            self.token_embedding_matrix.weight.set_value(initializer)
+        else:
+            initializer = paddle.nn.initializer.Uniform(low=-0.1, high=0.1)
+            self.token_embedding_matrix = paddle.nn.Embedding(
+                self.vocabulary_size, embedding_size, weight_attr=initializer
+            )
+
+    def forward(self, token):
+        assert isinstance(token, int) or not snippet_handler.is_snippet(
+            token
+        ), "embedder should only be called on flat tokens; use snippet_bow if you are trying to encode snippets"
+
+        if self.in_vocabulary(token):
+            index_list = paddle.to_tensor(self.vocab_token_lookup(token), "int64")
+            return self.token_embedding_matrix(index_list).squeeze()
+        else:
+            index_list = paddle.to_tensor(self.unknown_token_id, "int64")
+            return self.token_embedding_matrix(index_list).squeeze()
+
+
+def bow_snippets(token, snippets, output_embedder, input_schema):
+    """Bag of words embedding for snippets"""
+    assert snippet_handler.is_snippet(token) and snippets
+
+    snippet_sequence = []
+    for snippet in snippets:
+        if snippet.name == token:
+            snippet_sequence = snippet.sequence
+            break
+    assert snippet_sequence
+
+    if input_schema:
+        snippet_embeddings = []
+        for output_token in snippet_sequence:
+            assert output_embedder.in_vocabulary(output_token) or input_schema.in_vocabulary(
+                output_token, surface_form=True
+            )
+            if output_embedder.in_vocabulary(output_token):
+                snippet_embeddings.append(output_embedder(output_token))
+            else:
+                snippet_embeddings.append(input_schema.column_name_embedder(output_token, surface_form=True))
+    else:
+        snippet_embeddings = [output_embedder(subtoken) for subtoken in snippet_sequence]
+
+    snippet_embeddings = paddle.stack(snippet_embeddings, axis=0)  # len(snippet_sequence) x emb_size
+    return paddle.mean(snippet_embeddings, axis=0)  # emb_size
diff --git a/examples/text_to_sql/IGSQL/model/model.py b/examples/text_to_sql/IGSQL/model/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f52ab8066c17d840758ba6a8a5164ccd391c7f9
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/model.py
@@ -0,0 +1,376 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Class for the Sequence to sequence model for ATIS."""
+
+import numpy as np
+import paddle
+from data_util.vocabulary import DEL_TOK, UNK_TOK
+
+from . import bert_utils
+from .embedder import Embedder
+
+
+def get_token_indices(token, index_to_token):
+    """Maps from a gold token (string) to a list of indices.
+
+    Args:
+        token (`string`): String to look up.
+        index_to_token (`list`): Ordered list of tokens.
+
+    Returns:
+        `list`: Representing the indices of the token in the probability
+            distribution.
+    """
+    if token in index_to_token:
+        if len(set(index_to_token)) == len(index_to_token):  # no duplicates
+            return [index_to_token.index(token)]
+        else:
+            indices = []
+            for index, other_token in enumerate(index_to_token):
+                if token == other_token:
+                    indices.append(index)
+            assert len(indices) == len(set(indices))
+            return indices
+    else:
+        return [index_to_token.index(UNK_TOK)]
+
+
+def flatten_utterances(utterances):
+    """Gets a flat sequence from a sequence of utterances.
+
+    Args:
+        utterances (`list`): Utterances to concatenate.
+
+    Returns:
+        `list`: Representing the flattened sequence with separating
+            delimiter tokens.
+    """
+    sequence = []
+    for i, utterance in enumerate(utterances):
+        sequence.extend(utterance)
+        if i < len(utterances) - 1:
+            sequence.append(DEL_TOK)
+
+    return sequence
+
+
+def encode_snippets_with_states(snippets, states):
+    """Encodes snippets by using previous query states instead.
+
+    Args:
+        snippets (`list`): Input snippets.
+        states (`list`): Previous hidden states to use.
+    """
+    for snippet in snippets:
+        snippet.set_embedding(paddle.concat([states[snippet.startpos], states[snippet.endpos]], axis=0))
+    return snippets
+
+
+def load_word_embeddings(input_vocabulary, output_vocabulary, output_vocabulary_schema, params):
+    print(output_vocabulary.inorder_tokens)
+    print()
+
+    if params.reload_embedding == 1:
+        input_vocabulary_embeddings = np.load(params.data_directory + "/input_embeddings.npy")
+        output_vocabulary_embeddings = np.load(params.data_directory + "/output_embeddings.npy")
+        output_vocabulary_schema_embeddings = np.load(params.data_directory + "/output_schema_embeddings.npy")
+        input_embedding_size = 300
+        return (
+            input_vocabulary_embeddings,
+            output_vocabulary_embeddings,
+            output_vocabulary_schema_embeddings,
+            input_embedding_size,
+        )
+
+    def read_glove_embedding(embedding_filename, embedding_size):
+        glove_embeddings = {}
+
+        with open(embedding_filename) as f:
+            cnt = 1
+            for line in f:
+                cnt += 1
+                if params.debug or not params.train:
+                    if cnt == 1000:
+                        print("Read 1000 word embeddings")
+                        break
+                l_split = line.split()
+                word = " ".join(l_split[0 : len(l_split) - embedding_size])
+                embedding = np.array([float(val) for val in l_split[len(l_split) - embedding_size :]])
+                glove_embeddings[word] = embedding
+
+        return glove_embeddings
+
+    print("Loading Glove Embedding from", params.embedding_filename)
+    glove_embedding_size = 300
+    glove_embeddings = read_glove_embedding(params.embedding_filename, glove_embedding_size)
+    print("Done")
+
+    input_embedding_size = glove_embedding_size
+
+    def create_word_embeddings(vocab):
+
+        vocabulary_embeddings = np.zeros((len(vocab), glove_embedding_size), dtype=np.float32)
+        vocabulary_tokens = vocab.inorder_tokens
+
+        glove_oov = 0
+        para_oov = 0
+        for token in vocabulary_tokens:
+            token_id = vocab.token_to_id(token)
+            if token in glove_embeddings:
+                vocabulary_embeddings[token_id][:glove_embedding_size] = glove_embeddings[token]
+            else:
+                glove_oov += 1
+
+        print("Glove OOV:", glove_oov, "Para OOV", para_oov, "Total", len(vocab))
+
+        return vocabulary_embeddings
+
+    input_vocabulary_embeddings = create_word_embeddings(input_vocabulary)
+    output_vocabulary_embeddings = create_word_embeddings(output_vocabulary)
+    output_vocabulary_schema_embeddings = None
+    if output_vocabulary_schema:
+        output_vocabulary_schema_embeddings = create_word_embeddings(output_vocabulary_schema)
+
+    np.save(params.data_directory + "/input_embeddings", input_vocabulary_embeddings)
+    np.save(params.data_directory + "/output_embeddings", output_vocabulary_embeddings)
+    np.save(params.data_directory + "/output_schema_embeddings", output_vocabulary_schema_embeddings)
+
+    return (
+        input_vocabulary_embeddings,
+        output_vocabulary_embeddings,
+        output_vocabulary_schema_embeddings,
+        input_embedding_size,
+    )
+
+
+class ATISModel(paddle.nn.Layer):
+    """Sequence-to-sequence model for predicting a SQL query given an utterance
+    and an interaction prefix.
+    """
+
+    def __init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer):
+        super().__init__()
+
+        self.params = params
+
+        self.dropout = 0.0
+
+        if params.use_bert:
+            self.model_bert, self.tokenizer, self.bert_config = bert_utils.get_bert(params)
+
+        if "atis" not in params.data_directory:
+            if params.use_bert:
+                (
+                    input_vocabulary_embeddings,
+                    output_vocabulary_embeddings,
+                    output_vocabulary_schema_embeddings,
+                    input_embedding_size,
+                ) = load_word_embeddings(input_vocabulary, output_vocabulary, output_vocabulary_schema, params)
+
+                # Create the output embeddings
+                self.output_embedder = Embedder(
+                    params.output_embedding_size,
+                    name="output-embedding",
+                    initializer=output_vocabulary_embeddings,
+                    vocabulary=output_vocabulary,
+                    anonymizer=anonymizer,
+                    freeze=False,
+                )
+                self.column_name_token_embedder = None
+
+        # Create the encoder
+        encoder_input_size = params.input_embedding_size
+        encoder_output_size = params.encoder_state_size
+        if params.use_bert:
+            encoder_input_size = self.bert_config["hidden_size"]
+
+        if params.discourse_level_lstm:
+            encoder_input_size += params.encoder_state_size // 2
+
+        self.utterance_encoder = paddle.nn.LSTM(
+            encoder_input_size, encoder_output_size // 2, num_layers=params.encoder_num_layers, direction="bidirect"
+        )
+
+        # Positional embedder for utterances
+        attention_key_size = params.encoder_state_size
+        self.schema_attention_key_size = attention_key_size
+        if params.state_positional_embeddings:
+            attention_key_size += params.positional_embedding_size
+            self.positional_embedder = Embedder(
+                params.positional_embedding_size, name="positional-embedding", num_tokens=params.maximum_utterances
+            )
+
+        self.utterance_attention_key_size = attention_key_size
+
+        # Create the discourse-level LSTM parameters
+        if params.discourse_level_lstm:
+            self.discourse_lstms = paddle.nn.LSTMCell(params.encoder_state_size, params.encoder_state_size // 2)
+
+            initial_discourse_state = self.create_parameter(
+                [params.encoder_state_size // 2],
+                dtype="float32",
+                default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1),
+            )
+            self.add_parameter("initial_discourse_state", initial_discourse_state)
+
+        # Snippet encoder
+        final_snippet_size = 0
+
+        # Previous query Encoder
+        if params.use_previous_query:
+            self.query_encoder = paddle.nn.LSTM(
+                params.output_embedding_size,
+                params.encoder_state_size // 2,
+                num_layers=params.encoder_num_layers,
+                direction="bidirect",
+            )
+
+        self.final_snippet_size = final_snippet_size
+
+    def _initialize_discourse_states(self):
+        discourse_state = self.initial_discourse_state
+
+        hidden_size = self.discourse_lstms.weight_hh.shape[1]
+
+        h_0 = paddle.zeros([1, hidden_size])
+        c_0 = paddle.zeros([1, hidden_size])
+
+        return discourse_state, (h_0, c_0)
+
+    def _add_positional_embeddings(self, hidden_states, utterances, group=False):
+        grouped_states = []
+
+        start_index = 0
+        for utterance in utterances:
+            grouped_states.append(hidden_states[start_index : start_index + len(utterance)])
+            start_index += len(utterance)
+        assert (
+            len(hidden_states)
+            == sum([len(seq) for seq in grouped_states])
+            == sum([len(utterance) for utterance in utterances])
+        )
+
+        new_states = []
+        flat_sequence = []
+
+        num_utterances_to_keep = min(self.params.maximum_utterances, len(utterances))
+        for i, (states, utterance) in enumerate(
+            zip(grouped_states[-num_utterances_to_keep:], utterances[-num_utterances_to_keep:])
+        ):
+            positional_sequence = []
+            index = num_utterances_to_keep - i - 1
+
+            for state in states:
+                positional_sequence.append(paddle.concat([state, self.positional_embedder(index)], axis=0))
+
+            assert len(positional_sequence) == len(utterance), (
+                "Expected utterance and state sequence length to be the same, "
+                + "but they were "
+                + str(len(utterance))
+                + " and "
+                + str(len(positional_sequence))
+            )
+
+            if group:
+                new_states.append(positional_sequence)
+            else:
+                new_states.extend(positional_sequence)
+            flat_sequence.extend(utterance)
+
+        return new_states, flat_sequence
+
+    def build_optim(self):
+        params_trainer = []
+        params_bert_trainer = []
+        for name, param in self.named_parameters():
+            if not param.stop_gradient:
+                if self.params.all_in_one_trainer:
+                    param.name = name
+                    params_trainer.append(param)
+                else:
+                    if "model_bert" in name:
+                        params_bert_trainer.append(param)
+                    else:
+                        params_trainer.append(param)
+        clip = paddle.nn.ClipGradByNorm(clip_norm=self.params.clip)
+
+        if self.params.scheduler:
+            self.scheduler = paddle.optimizer.lr.ReduceOnPlateau(
+                learning_rate=self.params.initial_learning_rate,
+                mode="min",
+            )
+            self.trainer = paddle.optimizer.Adam(
+                parameters=params_trainer, learning_rate=self.scheduler, grad_clip=clip
+            )
+        else:
+            self.trainer = paddle.optimizer.Adam(parameters=params_trainer, learning_rate=1.0, grad_clip=clip)
+        if self.params.fine_tune_bert:
+            if self.params.scheduler:
+                self.scheduler = paddle.optimizer.lr.ReduceOnPlateau(
+                    learning_rate=self.params.initial_learning_rate,
+                    mode="min",
+                )
+                self.bert_trainer = paddle.optimizer.Adam(
+                    parameters=params_bert_trainer, learning_rate=self.scheduler, grad_clip=clip
+                )
+            else:
+                self.bert_trainer = paddle.optimizer.Adam(
+                    parameters=params_bert_trainer, learning_rate=1.0, grad_clip=clip
+                )
+
+    def set_dropout(self, value):
+        """Sets the dropout to a specified value.
+
+        Args:
+            value (`float`): Value to set dropout to.
+        """
+        self.dropout = value
+
+    def set_learning_rate(self, value):
+        """Sets the learning rate for the trainer.
+
+        Args:
+            value (`float`): The new learning rate.
+        """
+        # return
+        for param_group in self.trainer._parameter_list:
+            if self.params.all_in_one_trainer:
+                if "model_bert" in param_group.name:
+                    param_group.optimize_attr["learning_rate"] = value * 0.01
+                else:
+                    param_group.optimize_attr["learning_rate"] = value
+            else:
+                param_group.optimize_attr["learning_rate"] = value
+
+        if self.params.use_bert:
+            if not self.params.all_in_one_trainer:
+                for param_group in self.bert_trainer._parameter_list:
+                    param_group.optimize_attr["learning_rate"] = value * 0.01
+
+    def save(self, filename):
+        """Saves the model to the specified filename.
+
+        Args:
+            filename (`str`): The filename to save to.
+        """
+        paddle.save(self.state_dict(), filename)
+
+    def load(self, filename):
+        """Loads saved parameters into the parameter collection.
+
+        Args:
+            filename (`str`): Name of file containing parameters.
+        """
+        self.load_dict(paddle.load(filename))
+        print("Loaded model from file " + filename)
diff --git a/examples/text_to_sql/IGSQL/model/model_utils.py b/examples/text_to_sql/IGSQL/model/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..68fb9969673a8bfde5bbf70b65b9a30a6057b2d4
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/model_utils.py
@@ -0,0 +1,196 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains various utility functions for Dynet models."""
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+
+def compute_loss(gold_seq, scores, index_to_token_maps, gold_tok_to_id, noise=0.00000001):
+    """Computes the loss of a gold sequence given scores.
+
+    Args:
+        gold_seq (`list`): A sequence of gold tokens.
+        scores (`list`): Expressions representing the scores of
+            potential output tokens for each token in gold_seq.
+        index_to_token_maps (`list`): Maps from index in the
+            sequence to a dictionary mapping from a string to a set of integers.
+        gold_tok_to_id (`func`): Maps from the gold token
+            and some lookup function to the indices in the probability distribution
+            where the gold token occurs.
+        noise (`float`, optional): The amount of noise to add to the loss.
+
+    Returns:
+        `Tensor`: representing the sum of losses over the sequence.
+    """
+    assert len(gold_seq) == len(scores) == len(index_to_token_maps)
+
+    losses = []
+    for i, gold_tok in enumerate(gold_seq):
+        score = scores[i]
+        token_map = index_to_token_maps[i]
+
+        gold_indices = gold_tok_to_id(gold_tok, token_map)
+
+        assert len(gold_indices) > 0
+        noise_i = noise
+        """
+        if len(gold_indices) == 1:
+            noise_i = 0
+            """
+
+        probdist = score
+
+        prob_of_tok = paddle.sum(paddle.index_select(probdist, paddle.to_tensor(gold_indices)))
+
+        if prob_of_tok < noise_i:
+            prob_of_tok = prob_of_tok + noise_i
+        elif prob_of_tok > 1 - noise_i:
+            prob_of_tok = prob_of_tok - noise_i
+        losses.append(-paddle.log(prob_of_tok))
+
+    return paddle.sum(paddle.stack(losses))
+
+
+def get_seq_from_scores(scores, index_to_token_maps):
+    """Gets the argmax sequence from a set of scores.
+
+    Args:
+        scores (`list`): Sequences of output scores.
+        index_to_token_maps (`list`): For each output token, maps
+            the index in the probability distribution to a string.
+
+    Returns:
+        `list`: Representing the argmax sequence.
+    """
+    seq = []
+    for score, tok_map in zip(scores, index_to_token_maps):
+        # score_numpy_list = score.cpu().detach().numpy()
+        score_numpy_list = score.cpu().numpy()
+        assert score.shape[0] == len(tok_map) == len(list(score_numpy_list))
+        seq.append(tok_map[np.argmax(score_numpy_list)])
+    return seq
+
+
+def per_token_accuracy(gold_seq, pred_seq):
+    """Returns the per-token accuracy comparing two strings (recall).
+
+    Args:
+        gold_seq (`list`): A list of gold tokens.
+        pred_seq (`list`): A list of predicted tokens.
+
+    Returns:
+        `float`: Representing the accuracy.
+    """
+    num_correct = 0
+    for i, gold_token in enumerate(gold_seq):
+        if i < len(pred_seq) and pred_seq[i] == gold_token:
+            num_correct += 1
+
+    return float(num_correct) / len(gold_seq)
+
+
+def forward_one_multilayer(rnns, lstm_input, layer_states, dropout_amount=0.0):
+    """Goes forward for one multilayer RNN cell step.
+
+    Args:
+        lstm_input (`Tensor`): Some input to the step.
+        layer_states (`list`): The states of each layer in the cell.
+        dropout_amount (`float`, optional): The amount of dropout to apply, in
+            between the layers.
+
+    Returns:
+        (`list` , `list`), `Tensor`, (`list`): Representing (each layer's cell memory,
+        each layer's cell hidden state), the final hidden state, and (each layer's updated RNNState).
+    """
+    num_layers = len(layer_states)
+    new_states = []
+    cell_states = []
+    hidden_states = []
+    state = lstm_input
+    for i in range(num_layers):
+        layer_h, new_state = rnns[i](paddle.unsqueeze(state, 0), layer_states[i])
+        new_states.append(new_state)
+
+        layer_h = layer_h.squeeze()
+        layer_c = new_state[1].squeeze()
+
+        state = layer_h
+        if i < num_layers - 1:
+            # p stands for probability of an element to be zeroed. i.e. p=1 means switch off all activations.
+            state = F.dropout(state, p=dropout_amount)
+
+        cell_states.append(layer_c)
+        hidden_states.append(layer_h)
+
+    return (cell_states, hidden_states), state, new_states
+
+
+def encode_sequence(sequence, rnns, embedder, dropout_amount=0.0):
+    """Encodes a sequence given RNN cells and an embedding function.
+
+    Args:
+        seq (`list`): The sequence to encode.
+        rnns (`list`): The RNNs to use.
+        emb_fn (`func`): Function that embeds strings to
+            word vectors.
+        size (`int`): The size of the RNN.
+        dropout_amount (`float`, optional): The amount of dropout to apply.
+
+    Returns:
+        (`list`, `list`), `list`: The first pair is the (final cell memories, final cell states)
+        of all layers, and the second list is a list of the final layer's cell
+        state for all tokens in the sequence.
+    """
+
+    batch_size = 1
+    layer_states = []
+    for rnn in rnns:
+        hidden_size = rnn.weight_hh.shape[1]
+
+        h_0 = paddle.zeros([batch_size, hidden_size])
+        c_0 = paddle.zeros([batch_size, hidden_size])
+
+        layer_states.append((h_0, c_0))
+
+    outputs = []
+    for token in sequence:
+        rnn_input = embedder(token)
+        (cell_states, hidden_states), output, layer_states = forward_one_multilayer(
+            rnns, rnn_input, layer_states, dropout_amount
+        )
+        outputs.append(output)
+
+    return (cell_states, hidden_states), outputs
+
+
+def mask_fill(input, mask, value):
+    return input * paddle.cast(paddle.logical_not(mask), input.dtype) + paddle.cast(mask, input.dtype) * value
+
+
+def LSTM_output_transfer(utterance_states, final_utterance_state):
+
+    if len(utterance_states) != 0:
+        utterance_states = utterance_states.squeeze(0)
+        utterance_states = paddle.split(utterance_states, utterance_states.shape[0])
+        for idx in range(len(utterance_states)):
+            utterance_states[idx] = utterance_states[idx].squeeze(0)
+
+    if len(final_utterance_state) != 0:
+        (hidden_state, cell_memory) = final_utterance_state
+        hidden_states = paddle.concat([hidden_state[0], hidden_state[1]], axis=-1).squeeze(0)
+        cell_memories = paddle.concat([cell_memory[0], cell_memory[1]], axis=-1).squeeze(0)
+        final_utterance_state = (hidden_states.squeeze(0), cell_memories.squeeze(0))
+    return utterance_states, final_utterance_state
diff --git a/examples/text_to_sql/IGSQL/model/schema_interaction_model.py b/examples/text_to_sql/IGSQL/model/schema_interaction_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a7a94e4be161c7a5442d74a0eb6e4554b3da3e5
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/schema_interaction_model.py
@@ -0,0 +1,936 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Class for the Sequence to sequence model for ATIS."""
+
+import data_util.snippets as snippet_handler
+import data_util.vocabulary as vocab
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from data_util import sql_util
+from data_util.vocabulary import EOS_TOK
+
+from . import bert_utils, model_utils
+from .attention import Attention
+from .decoder import SequencePredictorWithSchema
+from .model import ATISModel, get_token_indices
+from .token_predictor import construct_token_predictor
+
+np.random.seed(0)
+
+
+class GraphNN(paddle.nn.Layer):
+    def __init__(self, params):
+        super(GraphNN, self).__init__()
+        self.params = params
+
+        weight_attr_final_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr_final_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0))
+
+        weight_attr_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr_fc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0))
+
+        weight_attr_qfc = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr_qfc = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(value=0.0))
+
+        self.final_fc = paddle.nn.Linear(
+            params.encoder_state_size,
+            params.encoder_state_size,
+            weight_attr=weight_attr_final_fc,
+            bias_attr=bias_attr_final_fc,
+        )
+        self.fc = paddle.nn.Linear(
+            params.encoder_state_size, params.encoder_state_size, weight_attr=weight_attr_fc, bias_attr=bias_attr_fc
+        )
+        self.qfc = paddle.nn.Linear(
+            params.encoder_state_size, params.encoder_state_size, weight_attr=weight_attr_qfc, bias_attr=bias_attr_qfc
+        )
+        self.dropout = paddle.nn.Dropout(0.1)
+        self.leakyReLU = paddle.nn.LeakyReLU(0.2)
+        self.elu = paddle.nn.ELU()
+        self.relu = paddle.nn.ReLU()
+
+    def forward(self, x, adj_matrix, previous_x=None):
+        # x: [len_tokens, d]
+        # adj_matrix: [len_tokens, len_tokens]
+        if previous_x is not None:
+            x_new = self.leakyReLU(self.fc(paddle.concat([previous_x, x], axis=0))).unsqueeze(0)
+        else:
+            x_new = self.leakyReLU(self.fc(x).unsqueeze(0))  # [1, len_tokens, d]
+        q = self.leakyReLU(self.qfc(x_new))
+        x_ = paddle.concat(paddle.split(x_new, 3, axis=2), axis=0)
+        q_ = paddle.concat(paddle.split(q, 3, axis=2), axis=0)
+        outputs = paddle.matmul(q_, x_.transpose([0, 2, 1])) / 10.0  # [3, len_tokens, len_tokens]
+        tmp_adj_matrix = (adj_matrix == 0).expand(shape=[3, adj_matrix.shape[0], adj_matrix.shape[1]])
+        outputs = model_utils.mask_fill(input=outputs, mask=tmp_adj_matrix, value=-1e9)
+        outputs = self.dropout(F.softmax(outputs, axis=-1))
+        outputs = paddle.matmul(outputs, x_)
+        outputs = paddle.concat(paddle.split(outputs, 3, axis=0), axis=2)
+        if previous_x is not None:
+            outputs = paddle.split(outputs, 2, axis=1)[1]
+        outputs = x.unsqueeze(0) + outputs
+        ret = x + self.dropout(self.leakyReLU(self.final_fc(outputs).squeeze(0)))
+        return ret
+
+
+LIMITED_INTERACTIONS = {
+    "raw/atis2/12-1.1/ATIS2/TEXT/TRAIN/SRI/QS0/1": 22,
+    "raw/atis3/17-1.1/ATIS3/SP_TRN/MIT/8K7/5": 14,
+    "raw/atis2/12-1.1/ATIS2/TEXT/TEST/NOV92/770/5": -1,
+}
+
+END_OF_INTERACTION = {"quit", "exit", "done"}
+
+
+class SchemaInteractionATISModel(ATISModel):
+    """Interaction ATIS model, where an interaction is processed all at once."""
+
+    def __init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer):
+        ATISModel.__init__(self, params, input_vocabulary, output_vocabulary, output_vocabulary_schema, anonymizer)
+
+        if self.params.use_schema_encoder:
+            # Create the schema encoder
+            schema_encoder_input_size = params.input_embedding_size
+            schema_encoder_state_size = params.encoder_state_size
+            if params.use_bert:
+                schema_encoder_input_size = self.bert_config["hidden_size"]
+
+            self.schema_encoder = paddle.nn.LSTM(
+                schema_encoder_input_size,
+                schema_encoder_state_size // 2,
+                num_layers=params.encoder_num_layers,
+                direction="bidirect",
+            )
+
+        # self-attention
+        if self.params.use_schema_self_attention:
+            self.schema2schema_attention_module = Attention(
+                self.schema_attention_key_size, self.schema_attention_key_size, self.schema_attention_key_size
+            )
+
+        # utterance level attention
+        if self.params.use_utterance_attention:
+            self.utterance_attention_module = Attention(
+                self.params.encoder_state_size, self.params.encoder_state_size, self.params.encoder_state_size
+            )
+
+        # Use attention module between input_hidden_states and schema_states
+        # schema_states: self.schema_attention_key_size x len(schema)
+        # input_hidden_states: self.utterance_attention_key_size x len(input)
+        if params.use_encoder_attention:
+            self.utterance2schema_attention_module = Attention(
+                self.schema_attention_key_size, self.utterance_attention_key_size, self.utterance_attention_key_size
+            )
+            self.schema2utterance_attention_module = Attention(
+                self.utterance_attention_key_size, self.schema_attention_key_size, self.schema_attention_key_size
+            )
+
+            new_attention_key_size = self.schema_attention_key_size + self.utterance_attention_key_size
+            self.schema_attention_key_size = new_attention_key_size
+            self.utterance_attention_key_size = new_attention_key_size
+
+        self.token_predictor = construct_token_predictor(
+            params,
+            output_vocabulary,
+            self.utterance_attention_key_size,
+            self.schema_attention_key_size,
+            self.final_snippet_size,
+            anonymizer,
+        )
+
+        # Use schema_attention in decoder
+        if params.use_schema_attention and params.use_query_attention:
+            decoder_input_size = (
+                params.output_embedding_size
+                + self.utterance_attention_key_size
+                + self.schema_attention_key_size
+                + params.encoder_state_size
+            )
+        elif params.use_schema_attention:
+            decoder_input_size = (
+                params.output_embedding_size + self.utterance_attention_key_size + self.schema_attention_key_size
+            )
+        else:
+            decoder_input_size = params.output_embedding_size + self.utterance_attention_key_size
+
+        self.decoder = SequencePredictorWithSchema(
+            params, decoder_input_size, self.output_embedder, self.column_name_token_embedder, self.token_predictor
+        )
+
+        if params.gnn_layer_number:
+            self.gnn_history = paddle.nn.LayerList([GraphNN(params) for _ in range(2 * params.gnn_layer_number)])
+            self.gnn = paddle.nn.LayerList([GraphNN(params) for _ in range(params.gnn_layer_number)])
+
+    def predict_turn(
+        self,
+        utterance_final_state,
+        input_hidden_states,
+        schema_states,
+        max_generation_length,
+        gold_query=None,
+        snippets=None,
+        input_sequence=None,
+        previous_queries=None,
+        previous_query_states=None,
+        input_schema=None,
+        feed_gold_tokens=False,
+        training=False,
+    ):
+        """Gets a prediction for a single turn -- calls decoder and updates loss, etc.
+
+        TODO:  this can probably be split into two methods, one that just predicts
+            and another that computes the loss.
+        """
+        predicted_sequence = []
+        fed_sequence = []
+        loss = None
+        token_accuracy = 0.0
+        if self.params.use_encoder_attention:
+            schema_attention = self.utterance2schema_attention_module(
+                paddle.stack(schema_states, axis=0), input_hidden_states
+            ).vector  # input_value_size x len(schema)
+            utterance_attention = self.schema2utterance_attention_module(
+                paddle.stack(input_hidden_states, axis=0), schema_states
+            ).vector  # schema_value_size x len(input)
+
+            if schema_attention.dim() == 1:
+                schema_attention = schema_attention.unsqueeze(1)
+            if utterance_attention.dim() == 1:
+                utterance_attention = utterance_attention.unsqueeze(1)
+
+            new_schema_states = paddle.concat(
+                [paddle.stack(schema_states, axis=1), schema_attention], axis=0
+            )  # (input_value_size+schema_value_size) x len(schema)
+            schema_states = list(paddle.split(new_schema_states, num_or_sections=new_schema_states.shape[1], axis=1))
+            schema_states = [schema_state.squeeze() for schema_state in schema_states]
+
+            new_input_hidden_states = paddle.concat(
+                [paddle.stack(input_hidden_states, axis=1), utterance_attention], axis=0
+            )  # (input_value_size+schema_value_size) x len(input)
+            input_hidden_states = list(
+                paddle.split(new_input_hidden_states, num_or_sections=new_input_hidden_states.shape[1], axis=1)
+            )
+            input_hidden_states = [input_hidden_state.squeeze() for input_hidden_state in input_hidden_states]
+
+        if feed_gold_tokens:
+            decoder_results = self.decoder(
+                utterance_final_state,
+                input_hidden_states,
+                schema_states,
+                max_generation_length,
+                gold_sequence=gold_query,
+                input_sequence=input_sequence,
+                previous_queries=previous_queries,
+                previous_query_states=previous_query_states,
+                input_schema=input_schema,
+                snippets=snippets,
+                dropout_amount=self.dropout,
+            )
+
+            all_scores = []
+            all_alignments = []
+            for prediction in decoder_results.predictions:
+                scores = F.softmax(prediction.scores, axis=0)
+                alignments = prediction.aligned_tokens
+                if self.params.use_previous_query and self.params.use_copy_switch and len(previous_queries) > 0:
+                    query_scores = F.softmax(prediction.query_scores, axis=0)
+                    copy_switch = prediction.copy_switch
+                    scores = paddle.concat([scores * (1 - copy_switch), query_scores * copy_switch], axis=0)
+                    alignments = alignments + prediction.query_tokens
+
+                all_scores.append(scores)
+                all_alignments.append(alignments)
+
+            # Compute the loss
+            gold_sequence = gold_query
+
+            loss = model_utils.compute_loss(gold_sequence, all_scores, all_alignments, get_token_indices)
+            if not training:
+                predicted_sequence = model_utils.get_seq_from_scores(all_scores, all_alignments)
+                token_accuracy = model_utils.per_token_accuracy(gold_sequence, predicted_sequence)
+            fed_sequence = gold_sequence
+        else:
+            decoder_results = self.decoder(
+                utterance_final_state,
+                input_hidden_states,
+                schema_states,
+                max_generation_length,
+                input_sequence=input_sequence,
+                previous_queries=previous_queries,
+                previous_query_states=previous_query_states,
+                input_schema=input_schema,
+                snippets=snippets,
+                dropout_amount=self.dropout,
+            )
+            predicted_sequence = decoder_results.sequence
+            fed_sequence = predicted_sequence
+
+        decoder_states = [pred.decoder_state for pred in decoder_results.predictions]
+
+        # fed_sequence contains EOS, which we don't need when encoding snippets.
+        # also ignore the first state, as it contains the BEG encoding.
+
+        for token, state in zip(fed_sequence[:-1], decoder_states[1:]):
+            if snippet_handler.is_snippet(token):
+                snippet_length = 0
+                for snippet in snippets:
+                    if snippet.name == token:
+                        snippet_length = len(snippet.sequence)
+                        break
+                assert snippet_length > 0
+                decoder_states.extend([state for _ in range(snippet_length)])
+            else:
+                decoder_states.append(state)
+
+        return (predicted_sequence, loss, token_accuracy, decoder_states, decoder_results)
+
+    def encode_schema_bow_simple(self, input_schema):
+        schema_states = []
+        for column_name in input_schema.column_names_embedder_input:
+            schema_states.append(
+                input_schema.column_name_embedder_bow(
+                    column_name, surface_form=False, column_name_token_embedder=self.column_name_token_embedder
+                )
+            )
+        input_schema.set_column_name_embeddings(schema_states)
+        return schema_states
+
+    def encode_schema_self_attention(self, schema_states):
+        schema_self_attention = self.schema2schema_attention_module(
+            paddle.stack(schema_states, axis=0), schema_states
+        ).vector
+        if schema_self_attention.dim() == 1:
+            schema_self_attention = schema_self_attention.unsqueeze(1)
+        residual_schema_states = list(
+            paddle.split(schema_self_attention, num_or_sections=schema_self_attention.shape[1], axis=1)
+        )
+        residual_schema_states = [schema_state.squeeze() for schema_state in residual_schema_states]
+
+        new_schema_states = [
+            schema_state + residual_schema_state
+            for schema_state, residual_schema_state in zip(schema_states, residual_schema_states)
+        ]
+
+        return new_schema_states
+
+    def encode_schema(self, input_schema, dropout=False):
+        schema_states = []
+        for column_name_embedder_input in input_schema.column_names_embedder_input:
+            tokens = column_name_embedder_input.split()
+
+            schema_states_one, final_schema_state_one = self.schema_encoder(paddle.stack(tokens).unsqueeze(0))
+            schema_states_one, final_schema_state_one = model_utils.LSTM_output_transfer(
+                schema_states_one, final_schema_state_one
+            )
+
+            # final_schema_state_one: 1 means hidden_states instead of cell_memories, -1 means last layer
+            schema_states.append(final_schema_state_one[0])
+
+        input_schema.set_column_name_embeddings(schema_states)
+
+        # self-attention over schema_states
+        if self.params.use_schema_self_attention:
+            schema_states = self.encode_schema_self_attention(schema_states)
+
+        return schema_states
+
+    def get_bert_encoding(self, input_sequence, input_schema, discourse_state, dropout):
+        utterance_states, schema_token_states = bert_utils.get_bert_encoding(
+            self.bert_config,
+            self.model_bert,
+            self.tokenizer,
+            input_sequence,
+            input_schema,
+            bert_input_version=self.params.bert_input_version,
+            num_out_layers_n=1,
+            num_out_layers_h=1,
+        )
+
+        if self.params.discourse_level_lstm:
+            utterance_token_embedder = lambda x: paddle.concat([x, discourse_state], axis=0)
+            for idx in range(len(utterance_states)):
+                utterance_states[idx] = utterance_token_embedder(utterance_states[idx])
+
+        utterance_states, final_utterance_state = self.utterance_encoder(paddle.stack(utterance_states).unsqueeze(0))
+        utterance_states, final_utterance_state = model_utils.LSTM_output_transfer(
+            utterance_states, final_utterance_state
+        )
+
+        schema_states = []
+        for schema_token_states1 in schema_token_states:
+            schema_states_one, final_schema_state_one = self.schema_encoder(
+                paddle.stack(schema_token_states1).unsqueeze(0)
+            )
+            schema_states_one, final_schema_state_one = model_utils.LSTM_output_transfer(
+                schema_states_one, final_schema_state_one
+            )
+
+            # final_schema_state_one: 1 means hidden_states instead of cell_memories, -1 means last layer
+            schema_states.append(sum(schema_states_one) / len(schema_states_one))
+
+        input_schema.set_column_name_embeddings(schema_states)
+
+        # self-attention over schema_states
+        if self.params.use_schema_self_attention:
+            schema_states = self.encode_schema_self_attention(schema_states)
+
+        return final_utterance_state, utterance_states, schema_states
+
+    def get_query_token_embedding(self, output_token, input_schema):
+        if input_schema:
+            if not (
+                self.output_embedder.in_vocabulary(output_token)
+                or input_schema.in_vocabulary(output_token, surface_form=True)
+            ):
+                output_token = "value"
+            if self.output_embedder.in_vocabulary(output_token):
+                output_token_embedding = self.output_embedder(output_token)
+            else:
+                output_token_embedding = input_schema.column_name_embedder(output_token, surface_form=True)
+        else:
+            output_token_embedding = self.output_embedder(output_token)
+        return output_token_embedding
+
+    def get_utterance_attention(
+        self, final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep
+    ):
+        # self-attention between utterance_states
+        final_utterance_states_h.append(final_utterance_state[0])
+        final_utterance_states_c.append(final_utterance_state[1])
+        final_utterance_states_c = final_utterance_states_c[-num_utterances_to_keep:]
+        final_utterance_states_h = final_utterance_states_h[-num_utterances_to_keep:]
+
+        attention_result = self.utterance_attention_module(final_utterance_states_c[-1], final_utterance_states_c)
+        final_utterance_state_attention_c = final_utterance_states_c[-1] + attention_result.vector.squeeze()
+
+        attention_result = self.utterance_attention_module(final_utterance_states_h[-1], final_utterance_states_h)
+        final_utterance_state_attention_h = final_utterance_states_h[-1] + attention_result.vector.squeeze()
+
+        final_utterance_state = (final_utterance_state_attention_h, final_utterance_state_attention_c)
+
+        return final_utterance_states_c, final_utterance_states_h, final_utterance_state
+
+    def get_previous_queries(self, previous_queries, previous_query_states, previous_query, input_schema):
+
+        query_token_embedder = lambda query_token: self.get_query_token_embedding(query_token, input_schema)
+
+        previous_query_embedding = []
+
+        for output_token in previous_query:
+            previous_query_embedding.append(query_token_embedder(output_token))
+        previous_query_embedding = paddle.stack(previous_query_embedding)
+
+        previous_queries.append(previous_query)
+        num_queries_to_keep = min(self.params.maximum_queries, len(previous_queries))
+        previous_queries = previous_queries[-num_queries_to_keep:]
+
+        previous_outputs, _ = self.query_encoder(previous_query_embedding.unsqueeze(0))
+        previous_outputs, _ = model_utils.LSTM_output_transfer(previous_outputs, _)
+
+        assert len(previous_outputs) == len(previous_query)
+        previous_query_states.append(previous_outputs)
+        previous_query_states = previous_query_states[-num_queries_to_keep:]
+
+        return previous_queries, previous_query_states
+
+    def get_adj_matrix(self, inner, foreign_keys, num_col):
+        ret = np.eye(num_col)
+        all_keys = inner + foreign_keys
+        for ele in all_keys:
+            ret[ele[0]][ele[1]] = 1
+            ret[ele[1]][ele[0]] = 1
+        return ret
+
+    def get_adj_utterance_matrix(self, inner, foreign_keys, num_col):
+        ret = np.eye(2 * num_col)
+
+        all_keys = inner + foreign_keys
+        for i in range(num_col):
+            ret[i][num_col + i] = 1
+            ret[num_col + i][i] = 1
+        for ele in all_keys:
+
+            # self graph connect
+            ret[ele[0]][ele[1]] = 1
+            ret[ele[1]][ele[0]] = 1
+            ret[num_col + ele[0]][num_col + ele[1]] = 1
+            ret[num_col + ele[1]][num_col + ele[0]] = 1
+
+            ret[ele[0]][num_col + ele[1]] = 1
+            ret[num_col + ele[1]][ele[0]] = 1
+            ret[num_col + ele[0]][ele[1]] = 1
+            ret[ele[1]][num_col + ele[0]] = 1
+        ret = ret.dot(ret)
+        return ret
+
+    def train_step(
+        self, interaction, max_generation_length, snippet_alignment_probability=1.0, db2id=None, id2db=None, step=None
+    ):
+        """Trains the interaction-level model on a single interaction.
+
+        Args:
+            interaction (Interaction): The interaction to train on.
+            learning_rate (float): Learning rate to use.
+            snippet_keep_age (int): Age of oldest snippets to use.
+            snippet_alignment_probability (float): The probability that a snippet will
+                be used in constructing the gold sequence.
+        """
+        # assert self.params.discourse_level_lstm
+
+        losses = []
+        total_gold_tokens = 0
+
+        input_hidden_states = []
+        input_sequences = []
+
+        final_utterance_states_c = []
+        final_utterance_states_h = []
+
+        previous_query_states = []
+        previous_queries = []
+
+        discourse_state = None
+        if self.params.discourse_level_lstm:
+            discourse_state, discourse_lstm_states = self._initialize_discourse_states()
+
+        # Schema and schema embeddings
+        input_schema = interaction.get_schema()
+        schema_states = []
+
+        if input_schema and not self.params.use_bert:
+            schema_states = self.encode_schema_bow_simple(input_schema)
+
+        # Get the intra-turn graph and cross-turn graph
+        inner = []
+        for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form):
+            for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)):
+                if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]:
+                    inner.append([i, j])
+        adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col)
+        adjacent_matrix_cross = self.get_adj_utterance_matrix(
+            inner, input_schema.table_schema["foreign_keys"], input_schema.num_col
+        )
+        adjacent_matrix = paddle.to_tensor(adjacent_matrix)
+        adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross)
+
+        previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size])
+
+        for utterance_index, utterance in enumerate(interaction.gold_utterances()):
+
+            if (
+                interaction.identifier in LIMITED_INTERACTIONS
+                and utterance_index > LIMITED_INTERACTIONS[interaction.identifier]
+            ):
+                break
+
+            input_sequence = utterance.input_sequence()
+
+            available_snippets = utterance.snippets()
+            previous_query = utterance.previous_query()
+
+            # Get the gold query: reconstruct if the alignment probability is less than one
+            if snippet_alignment_probability < 1.0:
+                gold_query = sql_util.add_snippets_to_query(
+                    available_snippets,
+                    utterance.contained_entities(),
+                    utterance.anonymized_gold_query(),
+                    prob_align=snippet_alignment_probability,
+                ) + [vocab.EOS_TOK]
+            else:
+                gold_query = utterance.gold_query()
+
+            final_utterance_state, utterance_states, schema_states = self.get_bert_encoding(
+                input_sequence, input_schema, discourse_state, dropout=True
+            )
+
+            # temp1=final_utterance_state
+
+            schema_states = paddle.stack(schema_states, axis=0)
+            for i in range(self.params.gnn_layer_number):
+                schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states)
+                schema_states = self.gnn_history[2 * i + 1](
+                    schema_states, adjacent_matrix_cross, previous_schema_states
+                )
+                schema_states = self.gnn[i](schema_states, adjacent_matrix)
+            previous_schema_states = schema_states
+            schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0)
+            schema_states = [ele.squeeze(0) for ele in schema_states_ls]
+
+            input_hidden_states.extend(utterance_states)
+            input_sequences.append(input_sequence)
+
+            num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences))
+
+            if self.params.discourse_level_lstm:
+                discourse_state, discourse_lstm_states = self.discourse_lstms(
+                    final_utterance_state[0].unsqueeze(0), discourse_lstm_states
+                )
+                discourse_state = discourse_state.squeeze()
+
+            if self.params.use_utterance_attention:
+
+                (
+                    final_utterance_states_c,
+                    final_utterance_states_h,
+                    final_utterance_state,
+                ) = self.get_utterance_attention(
+                    final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep
+                )
+
+            if self.params.state_positional_embeddings:
+                utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences)
+
+            snippets = None
+
+            if self.params.use_previous_query:
+                if len(previous_query) > 0:
+                    previous_queries, previous_query_states = self.get_previous_queries(
+                        previous_queries, previous_query_states, previous_query, input_schema
+                    )
+
+            if len(gold_query) <= max_generation_length and len(previous_query) <= max_generation_length:
+                prediction = self.predict_turn(
+                    final_utterance_state,
+                    utterance_states,
+                    schema_states,
+                    max_generation_length,
+                    gold_query=gold_query,
+                    snippets=snippets,
+                    input_sequence=flat_sequence,
+                    previous_queries=previous_queries,
+                    previous_query_states=previous_query_states,
+                    input_schema=input_schema,
+                    feed_gold_tokens=True,
+                    training=True,
+                )
+                loss = prediction[1]
+                total_gold_tokens += len(gold_query)
+                losses.append(loss)
+            else:
+                # Break if previous decoder snippet encoding -- because the previous
+                # sequence was too long to run the decoder.
+                if self.params.previous_decoder_snippet_encoding:
+                    break
+                continue
+
+        if losses:
+            average_loss = paddle.sum(paddle.stack(losses)) / total_gold_tokens
+            print(f"total_gold_tokens:{total_gold_tokens}, step:{step}")
+            print(f"LOSS:{float(average_loss.numpy())}")
+            if paddle.sum(paddle.cast(paddle.isinf(average_loss), "int32")) == paddle.ones([1]):
+                self.save("./inf_checkpoint")
+
+            # Renormalize so the effect is normalized by the batch size.
+            normalized_loss = average_loss
+            if self.params.reweight_batch:
+                normalized_loss = len(losses) * average_loss / float(self.params.batch_size)
+
+            normalized_loss.backward()
+
+            if step <= self.params.warmup_step:
+                self.set_learning_rate(step / self.params.warmup_step * self.params.initial_learning_rate)
+            step += 1
+
+            self.trainer.step()
+            if self.params.fine_tune_bert:
+                self.bert_trainer.step()
+                self.bert_trainer.clear_grad()
+            self.trainer.clear_grad()
+
+            loss_scalar = float(normalized_loss.numpy())
+            isNan = sum(paddle.cast(paddle.isnan(normalized_loss), "float32").numpy().tolist()) == 0
+            if paddle.isnan(normalized_loss):
+                print("nan error but keep running")
+            assert isNan
+
+        else:
+            loss_scalar = 0.0
+
+        return loss_scalar, step
+
+    def predict_with_predicted_queries(self, interaction, max_generation_length, syntax_restrict=True):
+        """Predicts an interaction, using the predicted queries to get snippets."""
+
+        syntax_restrict = False
+
+        predictions = []
+
+        input_hidden_states = []
+        input_sequences = []
+
+        final_utterance_states_c = []
+        final_utterance_states_h = []
+
+        previous_query_states = []
+        previous_queries = []
+
+        discourse_state = None
+        if self.params.discourse_level_lstm:
+            discourse_state, discourse_lstm_states = self._initialize_discourse_states()
+
+        # Schema and schema embeddings
+        input_schema = interaction.get_schema()
+        schema_states = []
+
+        # Get the intra-turn graph and cross-turn graph
+        inner = []
+        for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form):
+            for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)):
+                if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]:
+                    inner.append([i, j])
+        adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col)
+        adjacent_matrix_cross = self.get_adj_utterance_matrix(
+            inner, input_schema.table_schema["foreign_keys"], input_schema.num_col
+        )
+
+        adjacent_matrix = paddle.to_tensor(adjacent_matrix)
+        adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross)
+        previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size])
+
+        if input_schema and not self.params.use_bert:
+            schema_states = self.encode_schema_bow_simple(input_schema)
+
+        interaction.start_interaction()
+
+        while not interaction.done():
+            utterance = interaction.next_utterance()
+
+            available_snippets = utterance.snippets()
+            previous_query = utterance.previous_query()
+
+            input_sequence = utterance.input_sequence()
+
+            if not self.params.use_bert:
+                if self.params.discourse_level_lstm:
+                    utterance_token_embedder = lambda token: paddle.concat(
+                        [self.input_embedder(token), discourse_state], axis=0
+                    )
+                else:
+                    utterance_token_embedder = self.input_embedder
+                final_utterance_state, utterance_states = self.utterance_encoder(
+                    input_sequence, utterance_token_embedder
+                )
+            else:
+                final_utterance_state, utterance_states, schema_states = self.get_bert_encoding(
+                    input_sequence, input_schema, discourse_state, dropout=False
+                )
+
+            schema_states = paddle.stack(schema_states, axis=0)
+            for i in range(self.params.gnn_layer_number):
+                schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states)
+                schema_states = self.gnn_history[2 * i + 1](
+                    schema_states, adjacent_matrix_cross, previous_schema_states
+                )
+                schema_states = self.gnn[i](schema_states, adjacent_matrix)
+            previous_schema_states = schema_states
+
+            schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0)
+            schema_states = [ele.squeeze(0) for ele in schema_states_ls]
+
+            input_hidden_states.extend(utterance_states)
+            input_sequences.append(input_sequence)
+
+            num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences))
+
+            if self.params.discourse_level_lstm:
+
+                discourse_state, discourse_lstm_states = self.discourse_lstms(
+                    final_utterance_state[0].unsqueeze(0), discourse_lstm_states
+                )
+
+            if self.params.use_utterance_attention:
+                (
+                    final_utterance_states_c,
+                    final_utterance_states_h,
+                    final_utterance_state,
+                ) = self.get_utterance_attention(
+                    final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep
+                )
+
+            if self.params.state_positional_embeddings:
+                utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences)
+            else:
+                flat_sequence = []
+                for utt in input_sequences[-num_utterances_to_keep:]:
+                    flat_sequence.extend(utt)
+
+            snippets = None
+            if self.params.use_snippets:
+                snippets = self._encode_snippets(previous_query, available_snippets, input_schema)
+
+            if self.params.use_previous_query and len(previous_query) > 0:
+                previous_queries, previous_query_states = self.get_previous_queries(
+                    previous_queries, previous_query_states, previous_query, input_schema
+                )
+
+            results = self.predict_turn(
+                final_utterance_state,
+                utterance_states,
+                schema_states,
+                max_generation_length,
+                input_sequence=flat_sequence,
+                previous_queries=previous_queries,
+                previous_query_states=previous_query_states,
+                input_schema=input_schema,
+                snippets=snippets,
+            )
+
+            predicted_sequence = results[0]
+            predictions.append(results)
+
+            # Update things necessary for using predicted queries
+            anonymized_sequence = utterance.remove_snippets(predicted_sequence)
+            if EOS_TOK in anonymized_sequence:
+                anonymized_sequence = anonymized_sequence[:-1]  # Remove _EOS
+            else:
+                anonymized_sequence = ["select", "*", "from", "t1"]
+
+            if not syntax_restrict:
+                utterance.set_predicted_query(interaction.remove_snippets(predicted_sequence))
+                if input_schema:
+                    # on SParC
+                    interaction.add_utterance(
+                        utterance, anonymized_sequence, previous_snippets=utterance.snippets(), simple=True
+                    )
+                else:
+                    # on ATIS
+                    interaction.add_utterance(
+                        utterance, anonymized_sequence, previous_snippets=utterance.snippets(), simple=False
+                    )
+            else:
+                utterance.set_predicted_query(utterance.previous_query())
+                interaction.add_utterance(
+                    utterance, utterance.previous_query(), previous_snippets=utterance.snippets()
+                )
+
+        return predictions
+
+    def predict_with_gold_queries(self, interaction, max_generation_length, feed_gold_query=False):
+        """Predicts SQL queries for an interaction.
+
+        Args:
+            interaction (Interaction): Interaction to predict for.
+            feed_gold_query (bool): Whether or not to feed the gold token to the
+                generation step.
+        """
+
+        predictions = []
+
+        input_hidden_states = []
+        input_sequences = []
+
+        final_utterance_states_c = []
+        final_utterance_states_h = []
+
+        previous_query_states = []
+        previous_queries = []
+
+        discourse_state = None
+        if self.params.discourse_level_lstm:
+            discourse_state, discourse_lstm_states = self._initialize_discourse_states()
+
+        # Schema and schema embeddings
+        input_schema = interaction.get_schema()
+        schema_states = []
+        if input_schema and not self.params.use_bert:
+            schema_states = self.encode_schema_bow_simple(input_schema)
+
+        # Get the intra-turn graph and cross-turn graph
+        inner = []
+        for i, ele in enumerate(interaction.interaction.schema.column_names_surface_form):
+            for j in range(i + 1, len(interaction.interaction.schema.column_names_surface_form)):
+                if ele.split(".")[0] == interaction.interaction.schema.column_names_surface_form[j].split(".")[0]:
+                    inner.append([i, j])
+        adjacent_matrix = self.get_adj_matrix(inner, input_schema.table_schema["foreign_keys"], input_schema.num_col)
+        adjacent_matrix_cross = self.get_adj_utterance_matrix(
+            inner, input_schema.table_schema["foreign_keys"], input_schema.num_col
+        )
+
+        adjacent_matrix = paddle.to_tensor(adjacent_matrix)
+        adjacent_matrix_cross = paddle.to_tensor(adjacent_matrix_cross)
+        previous_schema_states = paddle.zeros([input_schema.num_col, self.params.encoder_state_size])
+
+        for _, utterance in enumerate(interaction.gold_utterances()):
+
+            input_sequence = utterance.input_sequence()
+
+            previous_query = utterance.previous_query()
+
+            final_utterance_state, utterance_states, schema_states = self.get_bert_encoding(
+                input_sequence, input_schema, discourse_state, dropout=True
+            )
+
+            schema_states = paddle.stack(schema_states, axis=0)
+            for i in range(self.params.gnn_layer_number):
+                schema_states = self.gnn_history[2 * i](schema_states, adjacent_matrix_cross, previous_schema_states)
+                schema_states = self.gnn_history[2 * i + 1](
+                    schema_states, adjacent_matrix_cross, previous_schema_states
+                )
+                schema_states = self.gnn[i](schema_states, adjacent_matrix)
+            previous_schema_states = schema_states
+
+            schema_states_ls = paddle.split(schema_states, schema_states.shape[0], axis=0)
+            schema_states = [ele.squeeze(0) for ele in schema_states_ls]
+
+            input_hidden_states.extend(utterance_states)
+            input_sequences.append(input_sequence)
+
+            num_utterances_to_keep = min(self.params.maximum_utterances, len(input_sequences))
+
+            if self.params.discourse_level_lstm:
+                discourse_state, discourse_lstm_states = self.discourse_lstms(
+                    final_utterance_state[0].unsqueeze(0), discourse_lstm_states
+                )
+                discourse_state = discourse_state.squeeze()
+
+            if self.params.use_utterance_attention:
+                (
+                    final_utterance_states_c,
+                    final_utterance_states_h,
+                    final_utterance_state,
+                ) = self.get_utterance_attention(
+                    final_utterance_states_c, final_utterance_states_h, final_utterance_state, num_utterances_to_keep
+                )
+
+            if self.params.state_positional_embeddings:
+                utterance_states, flat_sequence = self._add_positional_embeddings(input_hidden_states, input_sequences)
+            else:
+                flat_sequence = []
+                for utt in input_sequences[-num_utterances_to_keep:]:
+                    flat_sequence.extend(utt)
+
+            snippets = None
+
+            if self.params.use_previous_query and len(previous_query) > 0:
+                previous_queries, previous_query_states = self.get_previous_queries(
+                    previous_queries, previous_query_states, previous_query, input_schema
+                )
+
+            prediction = self.predict_turn(
+                final_utterance_state,
+                utterance_states,
+                schema_states,
+                max_generation_length,
+                gold_query=utterance.gold_query(),
+                snippets=snippets,
+                input_sequence=flat_sequence,
+                previous_queries=previous_queries,
+                previous_query_states=previous_query_states,
+                input_schema=input_schema,
+                feed_gold_tokens=feed_gold_query,
+            )
+
+            predictions.append(prediction)
+
+        return predictions
diff --git a/examples/text_to_sql/IGSQL/model/token_predictor.py b/examples/text_to_sql/IGSQL/model/token_predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..7561cd9aac016bbcb5c3b57cf56c057546f519cd
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model/token_predictor.py
@@ -0,0 +1,447 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Predicts a token."""
+
+from collections import namedtuple
+
+import paddle
+import paddle.nn.functional as F
+
+from .attention import Attention, AttentionResult
+
+
+class PredictionInput(
+    namedtuple("PredictionInput", ("decoder_state", "input_hidden_states", "snippets", "input_sequence"))
+):
+    """Inputs to the token predictor."""
+
+    __slots__ = ()
+
+
+class PredictionInputWithSchema(
+    namedtuple(
+        "PredictionInputWithSchema",
+        (
+            "decoder_state",
+            "input_hidden_states",
+            "schema_states",
+            "snippets",
+            "input_sequence",
+            "previous_queries",
+            "previous_query_states",
+            "input_schema",
+        ),
+    )
+):
+    """Inputs to the token predictor."""
+
+    __slots__ = ()
+
+
+class TokenPrediction(
+    namedtuple(
+        "TokenPrediction",
+        (
+            "scores",
+            "aligned_tokens",
+            "utterance_attention_results",
+            "schema_attention_results",
+            "query_attention_results",
+            "copy_switch",
+            "query_scores",
+            "query_tokens",
+            "decoder_state",
+        ),
+    )
+):
+    """A token prediction."""
+
+    __slots__ = ()
+
+
+def score_schema_tokens(input_schema, schema_states, scorer):
+    # schema_states: emd_dim x num_tokens
+    scores = paddle.t(paddle.mm(paddle.t(scorer), schema_states))  # num_tokens x 1
+    if scores.shape[0] != len(input_schema):
+        raise ValueError("Got " + str(scores.shape[0]) + " scores for " + str(len(input_schema)) + " schema tokens")
+    return scores, input_schema.column_names_surface_form
+
+
+def score_query_tokens(previous_query, previous_query_states, scorer):
+    scores = paddle.t(paddle.mm(paddle.t(scorer), previous_query_states))  # num_tokens x 1
+    if scores.shape[0] != len(previous_query):
+        raise ValueError("Got " + str(scores.shape[0]) + " scores for " + str(len(previous_query)) + " query tokens")
+    return scores, previous_query
+
+
+class TokenPredictor(paddle.nn.Layer):
+    """Predicts a token given a (decoder) state.
+
+    Attributes:
+        vocabulary (`Vocabulary`): A vocabulary object for the output.
+        attention_module (`Attention`): An attention module.
+        state_transformation_weights (`Parameter`): Transforms the input state
+            before predicting a token.
+        vocabulary_weights (`Parameter`): Final layer weights.
+        vocabulary_biases (`Parameter`): Final layer biases.
+    """
+
+    def __init__(self, params, vocabulary, attention_key_size):
+        super().__init__()
+        self.params = params
+        self.vocabulary = vocabulary
+        self.attention_module = Attention(params.decoder_state_size, attention_key_size, attention_key_size)
+
+        bias_initializer = paddle.nn.initializer.Uniform(low=-0.1, high=0.1)
+
+        _initializer = paddle.nn.initializer.XavierUniform()
+
+        state_transform_weights = paddle.ParamAttr(initializer=_initializer)
+
+        vocabulary_biases = paddle.ParamAttr(initializer=bias_initializer)
+
+        self.state_transform_Linear = paddle.nn.Linear(
+            in_features=params.decoder_state_size + attention_key_size,
+            out_features=params.decoder_state_size,
+            weight_attr=state_transform_weights,
+            bias_attr=False,
+        )
+
+        self.vocabulary_Linear = paddle.nn.Linear(
+            in_features=params.decoder_state_size,
+            out_features=len(vocabulary),
+            weight_attr=state_transform_weights,
+            bias_attr=vocabulary_biases,
+        )
+
+    def _get_intermediate_state(self, state, dropout_amount=0.0):
+        intermediate_state = paddle.tanh(self.state_transform_Linear(state))
+        return F.dropout(intermediate_state, dropout_amount)
+
+    def _score_vocabulary_tokens(self, state):
+        scores = paddle.t(self.vocabulary_Linear(state))
+
+        if scores.shape[0] != len(self.vocabulary.inorder_tokens):
+            raise ValueError(
+                "Got "
+                + str(scores.shape[0])
+                + " scores for "
+                + str(len(self.vocabulary.inorder_tokens))
+                + " vocabulary items"
+            )
+
+        return scores, self.vocabulary.inorder_tokens
+
+    def forward(self, prediction_input, dropout_amount=0.0):
+        decoder_state = prediction_input.decoder_state
+        input_hidden_states = prediction_input.input_hidden_states
+
+        attention_results = self.attention_module(decoder_state, input_hidden_states)
+
+        state_and_attn = paddle.concat([decoder_state, attention_results.vector], axis=0)
+
+        intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount)
+        vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state)
+
+        return TokenPrediction(vocab_scores, vocab_tokens, attention_results, decoder_state)
+
+
+class SchemaTokenPredictor(TokenPredictor):
+    """Token predictor that also predicts snippets.
+
+    Attributes:
+        snippet_weights (`Parameter`): Weights for scoring snippets against some
+            state.
+    """
+
+    def __init__(self, params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size):
+        TokenPredictor.__init__(self, params, vocabulary, utterance_attention_key_size)
+
+        _initializer = paddle.nn.initializer.XavierUniform()
+
+        if params.use_schema_attention:
+            self.utterance_attention_module = self.attention_module
+            self.schema_attention_module = Attention(
+                params.decoder_state_size, schema_attention_key_size, schema_attention_key_size
+            )
+
+        if self.params.use_query_attention:
+            self.query_attention_module = Attention(
+                params.decoder_state_size, params.encoder_state_size, params.encoder_state_size
+            )
+
+            self.start_query_attention_vector = self.create_parameter(
+                [params.encoder_state_size],
+                dtype="float32",
+                default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1),
+            )
+
+        state_transform_weights = paddle.ParamAttr(initializer=_initializer)
+        if params.use_schema_attention and self.params.use_query_attention:
+            self.state_transform_Linear = paddle.nn.Linear(
+                in_features=params.decoder_state_size
+                + utterance_attention_key_size
+                + schema_attention_key_size
+                + params.encoder_state_size,
+                out_features=params.decoder_state_size,
+                weight_attr=state_transform_weights,
+                bias_attr=False,
+            )
+        elif params.use_schema_attention:
+            self.state_transform_Linear = paddle.nn.Linear(
+                in_features=params.decoder_state_size + utterance_attention_key_size + schema_attention_key_size,
+                out_features=params.decoder_state_size,
+                weight_attr=state_transform_weights,
+                bias_attr=False,
+            )
+
+        schema_token_weights = paddle.ParamAttr(initializer=_initializer)
+        self.schema_token_Linear = paddle.nn.Linear(
+            in_features=params.decoder_state_size,
+            out_features=schema_attention_key_size,
+            weight_attr=schema_token_weights,
+            bias_attr=False,
+        )
+
+        if self.params.use_previous_query:
+
+            query_token_weights = paddle.ParamAttr(initializer=_initializer)
+            self.query_token_Linear = paddle.nn.Linear(
+                in_features=params.decoder_state_size,
+                out_features=self.params.encoder_state_size,
+                weight_attr=query_token_weights,
+                bias_attr=False,
+            )
+
+        if self.params.use_copy_switch:
+            state2copyswitch_transform_weights = paddle.ParamAttr(initializer=_initializer)
+            if self.params.use_query_attention:
+                self.state2copyswitch_transform_Linear = paddle.nn.Linear(
+                    in_features=params.decoder_state_size
+                    + utterance_attention_key_size
+                    + schema_attention_key_size
+                    + params.encoder_state_size,
+                    out_features=1,
+                    weight_attr=state2copyswitch_transform_weights,
+                    bias_attr=False,
+                )
+            else:
+                self.state2copyswitch_transform_Linear = paddle.nn.Linear(
+                    in_features=params.decoder_state_size + utterance_attention_key_size + schema_attention_key_size,
+                    out_features=1,
+                    weight_attr=state2copyswitch_transform_weights,
+                    bias_attr=False,
+                )
+
+        state2copy_transform_weights = paddle.ParamAttr(initializer=_initializer)
+        self.state2copy_transform_Linear = paddle.nn.Linear(
+            in_features=params.decoder_state_size,
+            out_features=3,
+            weight_attr=state2copy_transform_weights,
+            bias_attr=False,
+        )
+
+    def _get_schema_token_scorer(self, state):
+        scorer = paddle.t(self.schema_token_Linear(state))
+        return scorer
+
+    def _get_query_token_scorer(self, state):
+        scorer = paddle.t(self.query_token_Linear(state))
+        return scorer
+
+    def _get_copy_switch(self, state):
+        copy_switch = F.sigmoid(self.state2copyswitch_transform_Linear(state))
+        return copy_switch.squeeze()
+
+    def forward(self, prediction_input, dropout_amount=0.0):
+        decoder_state = prediction_input.decoder_state
+        input_hidden_states = prediction_input.input_hidden_states
+
+        input_schema = prediction_input.input_schema
+        schema_states = prediction_input.schema_states
+
+        if self.params.use_schema_attention:
+            schema_attention_results = self.schema_attention_module(decoder_state, schema_states)
+            utterance_attention_results = self.utterance_attention_module(decoder_state, input_hidden_states)
+        else:
+            utterance_attention_results = self.attention_module(decoder_state, input_hidden_states)
+            schema_attention_results = None
+
+        query_attention_results = None
+        if self.params.use_query_attention:
+            previous_query_states = prediction_input.previous_query_states
+            if len(previous_query_states) > 0:
+                query_attention_results = self.query_attention_module(decoder_state, previous_query_states[-1])
+            else:
+                query_attention_results = self.start_query_attention_vector
+                query_attention_results = AttentionResult(None, None, query_attention_results)
+
+        if self.params.use_schema_attention and self.params.use_query_attention:
+            state_and_attn = paddle.concat(
+                [
+                    decoder_state,
+                    utterance_attention_results.vector,
+                    schema_attention_results.vector,
+                    query_attention_results.vector,
+                ],
+                axis=0,
+            )
+        elif self.params.use_schema_attention:
+            state_and_attn = paddle.concat(
+                [decoder_state, utterance_attention_results.vector, schema_attention_results.vector], axis=0
+            )
+        else:
+            state_and_attn = paddle.concat([decoder_state, utterance_attention_results.vector], axis=0)
+
+        intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount)
+        copy_score = F.sigmoid(self.state2copy_transform_Linear(intermediate_state).squeeze(0))
+
+        vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state)
+
+        final_scores = vocab_scores
+        aligned_tokens = []
+        aligned_tokens.extend(vocab_tokens)
+
+        schema_states = paddle.stack(schema_states, axis=1)
+        schema_scores, schema_tokens = score_schema_tokens(
+            input_schema, schema_states, self._get_schema_token_scorer(intermediate_state)
+        )
+
+        final_scores = paddle.concat([copy_score[0] * final_scores, copy_score[1] * schema_scores], axis=0)
+        aligned_tokens.extend(schema_tokens)
+
+        # Previous Queries
+        previous_queries = prediction_input.previous_queries
+        previous_query_states = prediction_input.previous_query_states
+
+        copy_switch = None
+        query_scores = None
+        query_tokens = None
+        if self.params.use_previous_query and len(previous_queries) > 0:
+            if self.params.use_copy_switch:
+                copy_switch = self._get_copy_switch(state_and_attn)
+            for turn, (previous_query, previous_query_state) in enumerate(
+                zip(previous_queries, previous_query_states)
+            ):
+
+                assert len(previous_query) == len(previous_query_state)
+                previous_query_state = paddle.stack(previous_query_state, axis=1)
+                query_scores, query_tokens = score_query_tokens(
+                    previous_query, previous_query_state, self._get_query_token_scorer(intermediate_state)
+                )
+                query_scores = query_scores.squeeze()
+
+        if query_scores is not None:
+            final_scores = paddle.concat([final_scores, copy_score[2] * query_scores], axis=0)
+            aligned_tokens += query_tokens
+
+        return TokenPrediction(
+            final_scores,
+            aligned_tokens,
+            utterance_attention_results,
+            schema_attention_results,
+            query_attention_results,
+            copy_switch,
+            query_scores,
+            query_tokens,
+            decoder_state,
+        )
+
+
+class AnonymizationTokenPredictor(TokenPredictor):
+    """Token predictor that also predicts anonymization tokens.
+
+    Attributes:
+        anonymizer (`Anonymizer`): The anonymization object.
+
+    """
+
+    def __init__(self, params, vocabulary, attention_key_size, anonymizer):
+        TokenPredictor.__init__(self, params, vocabulary, attention_key_size)
+        if not anonymizer:
+            raise ValueError("Expected an anonymizer, but was None")
+        self.anonymizer = anonymizer
+
+    def _score_anonymized_tokens(self, input_sequence, attention_scores):
+        scores = []
+        tokens = []
+        for i, token in enumerate(input_sequence):
+            if self.anonymizer.is_anon_tok(token):
+                scores.append(attention_scores[i])
+                tokens.append(token)
+
+        if len(scores) > 0:
+            if len(scores) != len(tokens):
+                raise ValueError("Got " + str(len(scores)) + " scores for " + str(len(tokens)) + " anonymized tokens")
+
+            anonymized_scores = paddle.concat(scores, axis=0)
+            if anonymized_scores.dim() == 1:
+                anonymized_scores = anonymized_scores.unsqueeze(1)
+            return anonymized_scores, tokens
+        else:
+            return None, []
+
+    def forward(self, prediction_input, dropout_amount=0.0):
+        decoder_state = prediction_input.decoder_state
+        input_hidden_states = prediction_input.input_hidden_states
+        input_sequence = prediction_input.input_sequence
+        assert input_sequence
+
+        attention_results = self.attention_module(decoder_state, input_hidden_states)
+
+        state_and_attn = paddle.concat([decoder_state, attention_results.vector], axis=0)
+
+        intermediate_state = self._get_intermediate_state(state_and_attn, dropout_amount=dropout_amount)
+        vocab_scores, vocab_tokens = self._score_vocabulary_tokens(intermediate_state)
+
+        final_scores = vocab_scores
+        aligned_tokens = []
+        aligned_tokens.extend(vocab_tokens)
+
+        anonymized_scores, anonymized_tokens = self._score_anonymized_tokens(input_sequence, attention_results.scores)
+
+        if anonymized_scores:
+            final_scores = paddle.concat([final_scores, anonymized_scores], axis=0)
+            aligned_tokens.extend(anonymized_tokens)
+
+        final_scores = final_scores.squeeze()
+
+        return TokenPrediction(
+            final_scores, aligned_tokens, attention_results, None, None, None, None, None, decoder_state
+        )
+
+
+def construct_token_predictor(
+    params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size, anonymizer=None
+):
+    """Constructs a token predictor given the parameters.
+
+    Args:
+        params (`dict`): Contains the command line parameters/hyperparameters.
+        vocabulary (`Vocabulary`): Vocabulary object for output generation.
+        attention_key_size (`int`): The size of the attention keys.
+        anonymizer (`Anonymizer`): An anonymization object.
+    """
+
+    if not anonymizer and not params.previous_decoder_snippet_encoding:
+        print("using SchemaTokenPredictor")
+        return SchemaTokenPredictor(
+            params, vocabulary, utterance_attention_key_size, schema_attention_key_size, snippet_size
+        )
+    elif params.use_snippets and anonymizer and not params.previous_decoder_snippet_encoding:
+        print("using AnonymizationTokenPredictor")
+        return AnonymizationTokenPredictor(params, vocabulary, utterance_attention_key_size, snippet_size, anonymizer)
+    else:
+        print("Unknown token_predictor")
+        exit()
diff --git a/examples/text_to_sql/IGSQL/model_util.py b/examples/text_to_sql/IGSQL/model_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..a30608b6263286f6b489a7513160d7121ea8f201
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/model_util.py
@@ -0,0 +1,639 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Basic model training and evaluation functions."""
+
+import json
+import random
+from enum import Enum
+
+import paddle
+import progressbar
+
+from . import model_utils
+from .data_util import sql_util
+
+
+def write_prediction(
+    fileptr,
+    identifier,
+    input_seq,
+    probability,
+    prediction,
+    flat_prediction,
+    gold_query,
+    flat_gold_queries,
+    gold_tables,
+    index_in_interaction,
+    database_username,
+    database_password,
+    database_timeout,
+    compute_metrics=True,
+):
+    pred_obj = {}
+    pred_obj["identifier"] = identifier
+    if len(identifier.split("/")) == 2:
+        database_id, interaction_id = identifier.split("/")
+    else:
+        database_id = "atis"
+        interaction_id = identifier
+    pred_obj["database_id"] = database_id
+    pred_obj["interaction_id"] = interaction_id
+
+    pred_obj["input_seq"] = input_seq
+    pred_obj["probability"] = probability
+    pred_obj["prediction"] = prediction
+    pred_obj["flat_prediction"] = flat_prediction
+    pred_obj["gold_query"] = gold_query
+    pred_obj["flat_gold_queries"] = flat_gold_queries
+    pred_obj["index_in_interaction"] = index_in_interaction
+    pred_obj["gold_tables"] = str(gold_tables)
+
+    # Now compute the metrics we want.
+
+    if compute_metrics:
+        # First metric: whether flat predicted query is in the gold query set.
+        correct_string = " ".join(flat_prediction) in [" ".join(q) for q in flat_gold_queries]
+        pred_obj["correct_string"] = correct_string
+
+        # Database metrics
+        if not correct_string:
+            syntactic, semantic, pred_table = sql_util.execution_results(
+                " ".join(flat_prediction), database_username, database_password, database_timeout
+            )
+            pred_table = sorted(pred_table)
+            best_prec = 0.0
+            best_rec = 0.0
+            best_f1 = 0.0
+
+            for gold_table in gold_tables:
+                num_overlap = float(len(set(pred_table) & set(gold_table)))
+
+                if len(set(gold_table)) > 0:
+                    prec = num_overlap / len(set(gold_table))
+                else:
+                    prec = 1.0
+
+                if len(set(pred_table)) > 0:
+                    rec = num_overlap / len(set(pred_table))
+                else:
+                    rec = 1.0
+
+                if prec > 0.0 and rec > 0.0:
+                    f1 = (2 * (prec * rec)) / (prec + rec)
+                else:
+                    f1 = 1.0
+
+                best_prec = max(best_prec, prec)
+                best_rec = max(best_rec, rec)
+                best_f1 = max(best_f1, f1)
+
+        else:
+            syntactic = True
+            semantic = True
+            pred_table = []
+            best_prec = 1.0
+            best_rec = 1.0
+            best_f1 = 1.0
+
+        assert best_prec <= 1.0
+        assert best_rec <= 1.0
+        assert best_f1 <= 1.0
+        pred_obj["syntactic"] = syntactic
+        pred_obj["semantic"] = semantic
+        correct_table = (pred_table in gold_tables) or correct_string
+        pred_obj["correct_table"] = correct_table
+        pred_obj["strict_correct_table"] = correct_table and syntactic
+        pred_obj["pred_table"] = str(pred_table)
+        pred_obj["table_prec"] = best_prec
+        pred_obj["table_rec"] = best_rec
+        pred_obj["table_f1"] = best_f1
+
+    fileptr.write(json.dumps(pred_obj) + "\n")
+
+
+class Metrics(Enum):
+    """Definitions of simple metrics to compute."""
+
+    LOSS = 1
+    TOKEN_ACCURACY = 2
+    STRING_ACCURACY = 3
+    CORRECT_TABLES = 4
+    STRICT_CORRECT_TABLES = 5
+    SEMANTIC_QUERIES = 6
+    SYNTACTIC_QUERIES = 7
+    FIRST_ACC = 8
+    AFTER_FIRST_ACC = 9
+
+
+def get_progressbar(name, size):
+    """Gets a progress bar object given a name and the total size.
+
+    Args:
+        name (`str`): The name to display on the side.
+        size (`int`): The maximum size of the progress bar.
+
+    """
+    return progressbar.ProgressBar(
+        maxval=size,
+        widgets=[name, progressbar.Bar("=", "[", "]"), " ", progressbar.Percentage(), " ", progressbar.ETA()],
+    )
+
+
+def train_epoch_with_utterances(batches, model, randomize=True):
+    """Trains model for a single epoch given batches of utterance data.
+
+    Args:
+        batches (`UtteranceBatch`): The batches to give to training.
+        model (`ATISModel`): The model obect.
+        learning_rate (`float`): The learning rate to use during training.
+        dropout_amount (`float`): Amount of dropout to set in the model.
+        randomize (`bool`): Whether or not to randomize the order that the batches are seen.
+    """
+    if randomize:
+        random.shuffle(batches)
+    progbar = get_progressbar("train     ", len(batches))
+    progbar.start()
+    loss_sum = 0.0
+
+    for i, batch in enumerate(batches):
+        batch_loss = model.train_step(batch)
+        loss_sum += batch_loss
+
+        progbar.update(i)
+
+    progbar.finish()
+
+    total_loss = loss_sum / len(batches)
+
+    return total_loss
+
+
+def train_epoch_with_interactions(
+    interaction_batches, params, model, randomize=True, db2id=None, id2db=None, step=None
+):
+    """Trains model for single epoch given batches of interactions.
+
+    Args:
+        interaction_batches (`list`): The batches to train on.
+        params (`namespace`): Parameters to run with.
+        model (`ATISModel`): Model to train.
+        randomize (`bool`): Whether or not to randomize the order that batches are seen.
+    """
+    if randomize:
+        random.shuffle(interaction_batches)
+    progbar = get_progressbar("train     ", len(interaction_batches))
+    progbar.start()
+    loss_sum = 0.0
+
+    skip_ls = ["sakila_1", "baseball_1", "soccer_1", "cre_Drama_Workshop_Groups", "formula_1", "assets_maintenance/8"]
+    skip_num = 0
+
+    for i, interaction_batch in enumerate(interaction_batches):
+        assert len(interaction_batch) == 1
+
+        interaction = interaction_batch.items[0]
+
+        if interaction.identifier == "raw/atis2/12-1.1/ATIS2/TEXT/TEST/NOV92/770/5":
+            continue
+
+        if "sparc" in params.data_directory and "baseball_1" in interaction.identifier:
+            continue
+
+        skip = False
+        if "cosql" in params.data_directory:
+            print(interaction.identifier, i, skip_num)
+            for ele in skip_ls:
+                if ele in interaction.identifier:
+                    print("skip")
+                    skip = True
+                    continue
+        if skip:
+            print("skip, length:", len(interaction.gold_utterances()))
+            skip_num += 1
+            continue
+
+        batch_loss, step = model.train_step(
+            interaction, params.train_maximum_sql_length, db2id=db2id, id2db=id2db, step=step
+        )
+
+        loss_sum += batch_loss
+
+        progbar.update(i)
+
+    progbar.finish()
+
+    total_loss = loss_sum / len(interaction_batches)
+
+    return total_loss, step
+
+
+# 计算ACC
+def update_sums(
+    metrics,
+    metrics_sums,
+    predicted_sequence,
+    flat_sequence,
+    gold_query,
+    original_gold_query,
+    gold_forcing=False,
+    loss=None,
+    token_accuracy=0.0,
+    database_username="",
+    database_password="",
+    database_timeout=0,
+    gold_table=None,
+):
+    """ " Updates summing for metrics in an aggregator.
+
+    TODO: don't use sums, just keep the raw value.
+    """
+    if Metrics.LOSS in metrics:
+        metrics_sums[Metrics.LOSS] += loss
+
+    if Metrics.TOKEN_ACCURACY in metrics:
+        if gold_forcing:
+            metrics_sums[Metrics.TOKEN_ACCURACY] += token_accuracy
+        else:
+            num_tokens_correct = 0.0
+            for j, token in enumerate(gold_query):
+                if len(predicted_sequence) > j and predicted_sequence[j] == token:
+                    num_tokens_correct += 1
+            metrics_sums[Metrics.TOKEN_ACCURACY] += num_tokens_correct / len(gold_query)
+
+    if Metrics.STRING_ACCURACY in metrics:
+
+        metrics_sums[Metrics.STRING_ACCURACY] += int(flat_sequence == original_gold_query)
+
+    if Metrics.CORRECT_TABLES in metrics:
+        assert database_username, "You did not provide a database username"
+        assert database_password, "You did not provide a database password"
+        assert database_timeout > 0, "Database timeout is 0 seconds"
+
+        # Evaluate SQL
+        if flat_sequence != original_gold_query:
+            syntactic, semantic, table = sql_util.execution_results(
+                " ".join(flat_sequence), database_username, database_password, database_timeout
+            )
+        else:
+            syntactic = True
+            semantic = True
+            table = gold_table
+
+        metrics_sums[Metrics.CORRECT_TABLES] += int(table == gold_table)
+        if Metrics.SYNTACTIC_QUERIES in metrics:
+            metrics_sums[Metrics.SYNTACTIC_QUERIES] += int(syntactic)
+        if Metrics.SEMANTIC_QUERIES in metrics:
+            metrics_sums[Metrics.SEMANTIC_QUERIES] += int(semantic)
+        if Metrics.STRICT_CORRECT_TABLES in metrics:
+            metrics_sums[Metrics.STRICT_CORRECT_TABLES] += int(table == gold_table and syntactic)
+
+
+def construct_averages(metrics_sums, total_num):
+    """Computes the averages for metrics.
+
+    Args:
+        metrics_sums (`dict`): Sums for a metric.
+        total_num (`int`): Number to divide by (average).
+    """
+    metrics_averages = {}
+    if isinstance(total_num, int):
+        for metric, value in metrics_sums.items():
+            metrics_averages[metric] = value / total_num
+            if metric != "loss":
+                metrics_averages[metric] *= 100.0
+    else:
+        for metric, value in metrics_sums.items():
+            metrics_averages[metric] = value / total_num
+            if metric != "loss":
+                metrics_averages[metric] *= 100.0
+
+    return metrics_averages
+
+
+def evaluate_utterance_sample(
+    sample,
+    model,
+    max_generation_length,
+    name="",
+    gold_forcing=False,
+    metrics=None,
+    total_num=-1,
+    database_username="",
+    database_password="",
+    database_timeout=0,
+    write_results=False,
+):
+    """Evaluates a sample of utterance examples.
+
+    Args:
+        sample (`list`): Examples to evaluate.
+        model (`ATISModel`): Model to predict with.
+        max_generation_length (`int`): Maximum length to generate.
+        name (`str`): Name to log with.
+        gold_forcing (`bool`): Whether to force the gold tokens during decoding.
+        metrics (`list`): Metrics to evaluate with.
+        total_num (`int`): Number to divide by when reporting results.
+        database_username (`str`): Username to use for executing queries.
+        database_password (`str`): Password to use when executing queries.
+        database_timeout (`float`): Timeout on queries when executing.
+        write_results (`bool`): Whether to write the results to a file.
+    """
+    assert metrics
+
+    if total_num < 0:
+        total_num = len(sample)
+
+    metrics_sums = {}
+    for metric in metrics:
+        metrics_sums[metric] = 0.0
+
+    predictions_file = open(name + "_predictions.json", "w")
+    print("Predicting with filename " + str(name) + "_predictions.json")
+    progbar = get_progressbar(name, len(sample))
+    progbar.start()
+
+    predictions = []
+    for i, item in enumerate(sample):
+        _, loss, predicted_seq = model.eval_step(item, max_generation_length, feed_gold_query=gold_forcing)
+        loss = loss / len(item.gold_query())
+        predictions.append(predicted_seq)
+
+        flat_sequence = item.flatten_sequence(predicted_seq)
+        token_accuracy = model_utils.per_token_accuracy(item.gold_query(), predicted_seq)
+
+        if write_results:
+            write_prediction(
+                predictions_file,
+                identifier=item.interaction.identifier,
+                input_seq=item.input_sequence(),
+                probability=0,
+                prediction=predicted_seq,
+                flat_prediction=flat_sequence,
+                gold_query=item.gold_query(),
+                flat_gold_queries=item.original_gold_queries(),
+                gold_tables=item.gold_tables(),
+                index_in_interaction=item.utterance_index,
+                database_username=database_username,
+                database_password=database_password,
+                database_timeout=database_timeout,
+            )
+
+        update_sums(
+            metrics,
+            metrics_sums,
+            predicted_seq,
+            flat_sequence,
+            item.gold_query(),
+            item.original_gold_queries()[0],
+            gold_forcing,
+            loss,
+            token_accuracy,
+            database_username=database_username,
+            database_password=database_password,
+            database_timeout=database_timeout,
+            gold_table=item.gold_tables()[0],
+        )
+
+        progbar.update(i)
+
+    progbar.finish()
+    predictions_file.close()
+
+    return construct_averages(metrics_sums, total_num), None
+
+
+def evaluate_interaction_sample(
+    sample,
+    model,
+    max_generation_length,
+    name="",
+    gold_forcing=False,
+    metrics=None,
+    total_num=-1,
+    database_username="",
+    database_password="",
+    database_timeout=0,
+    use_predicted_queries=False,
+    write_results=False,
+    use_gpu=False,
+    compute_metrics=False,
+):
+    """Evaluates a sample of interactions."""
+    predictions_file = open(name + "_predictions.json", "w")
+    print("Predicting with file " + str(name + "_predictions.json"))
+    metrics_sums = {}
+    for metric in metrics:
+        metrics_sums[metric] = 0.0
+    progbar = get_progressbar(name, len(sample))
+    progbar.start()
+
+    num_utterances = 0
+    predictions = []
+
+    model.eval()
+
+    for i, interaction in enumerate(sample):
+        try:
+            with paddle.no_grad():
+                if use_predicted_queries:
+                    example_preds = model.predict_with_predicted_queries(interaction, max_generation_length)
+                else:
+                    example_preds = model.predict_with_gold_queries(
+                        interaction, max_generation_length, feed_gold_query=gold_forcing
+                    )
+        except RuntimeError as exception:
+            print("Failed on interaction: " + str(interaction.identifier))
+            print(exception)
+            print("\n\n")
+            exit()
+
+        predictions.extend(example_preds)
+
+        assert len(example_preds) == len(interaction.interaction.utterances) or not example_preds
+        for j, pred in enumerate(example_preds):
+            num_utterances += 1
+
+            sequence, loss, token_accuracy, _, decoder_results = pred
+
+            if use_predicted_queries:
+                item = interaction.processed_utterances[j]
+                original_utt = interaction.interaction.utterances[item.index]
+
+                gold_query = original_utt.gold_query_to_use
+                original_gold_query = original_utt.original_gold_query
+
+                gold_table = original_utt.gold_sql_results
+                gold_queries = [q[0] for q in original_utt.all_gold_queries]
+                gold_tables = [q[1] for q in original_utt.all_gold_queries]
+                index = item.index
+            else:
+                item = interaction.gold_utterances()[j]
+
+                gold_query = item.gold_query()
+                original_gold_query = item.original_gold_query()
+
+                gold_table = item.gold_table()
+                gold_queries = item.original_gold_queries()
+                gold_tables = item.gold_tables()
+                index = item.utterance_index
+            if loss:
+                loss = loss / len(gold_query)
+
+            flat_sequence = item.flatten_sequence(sequence)
+
+            # if isinstance(flat_sequence[-1],int):
+            #     if flat_sequence[-1]==0:
+            #         num_first_utterances += 1
+            #     else:
+            #         num_after_first_utterances += 1
+
+            if write_results:
+                write_prediction(
+                    predictions_file,
+                    identifier=interaction.identifier,
+                    input_seq=item.input_sequence(),
+                    probability=decoder_results.probability,
+                    prediction=sequence,
+                    flat_prediction=flat_sequence,
+                    gold_query=gold_query,
+                    flat_gold_queries=gold_queries,
+                    gold_tables=gold_tables,
+                    index_in_interaction=index,
+                    database_username=database_username,
+                    database_password=database_password,
+                    database_timeout=database_timeout,
+                    compute_metrics=compute_metrics,
+                )
+
+            update_sums(
+                metrics,
+                metrics_sums,
+                sequence,
+                flat_sequence,
+                gold_query,
+                original_gold_query,
+                gold_forcing,
+                loss,
+                token_accuracy,
+                database_username=database_username,
+                database_password=database_password,
+                database_timeout=database_timeout,
+                gold_table=gold_table,
+            )
+
+        progbar.update(i)
+
+    progbar.finish()
+
+    if total_num < 0:
+        total_num = num_utterances
+
+    predictions_file.close()
+    return construct_averages(metrics_sums, total_num), predictions
+
+
+def evaluate_using_predicted_queries(
+    sample,
+    model,
+    name="",
+    gold_forcing=False,
+    metrics=None,
+    total_num=-1,
+    database_username="",
+    database_password="",
+    database_timeout=0,
+    snippet_keep_age=1,
+):
+    predictions_file = open(name + "_predictions.json", "w")
+    print("Predicting with file " + str(name + "_predictions.json"))
+    assert not gold_forcing
+    metrics_sums = {}
+    for metric in metrics:
+        metrics_sums[metric] = 0.0
+    progbar = get_progressbar(name, len(sample))
+    progbar.start()
+
+    num_utterances = 0
+    predictions = []
+    for i, item in enumerate(sample):
+        int_predictions = []
+        item.start_interaction()
+        while not item.done():
+            utterance = item.next_utterance(snippet_keep_age)
+
+            predicted_sequence, loss, _, probability = model.eval_step(utterance)
+            int_predictions.append((utterance, predicted_sequence))
+
+            flat_sequence = utterance.flatten_sequence(predicted_sequence)
+
+            if (
+                sql_util.executable(
+                    flat_sequence, username=database_username, password=database_password, timeout=database_timeout
+                )
+                and probability >= 0.24
+            ):
+                utterance.set_pred_query(item.remove_snippets(predicted_sequence))
+                item.add_utterance(
+                    utterance, item.remove_snippets(predicted_sequence), previous_snippets=utterance.snippets()
+                )
+            else:
+                # Add the /previous/ predicted query, guaranteed to be syntactically
+                # correct
+                seq = []
+                utterance.set_pred_query(seq)
+                item.add_utterance(utterance, seq, previous_snippets=utterance.snippets())
+
+            original_utt = item.interaction.utterances[utterance.index]
+            write_prediction(
+                predictions_file,
+                identifier=item.interaction.identifier,
+                input_seq=utterance.input_sequence(),
+                probability=probability,
+                prediction=predicted_sequence,
+                flat_prediction=flat_sequence,
+                gold_query=original_utt.gold_query_to_use,
+                flat_gold_queries=[q[0] for q in original_utt.all_gold_queries],
+                gold_tables=[q[1] for q in original_utt.all_gold_queries],
+                index_in_interaction=utterance.index,
+                database_username=database_username,
+                database_password=database_password,
+                database_timeout=database_timeout,
+            )
+
+            update_sums(
+                metrics,
+                metrics_sums,
+                predicted_sequence,
+                flat_sequence,
+                original_utt.gold_query_to_use,
+                original_utt.original_gold_query,
+                gold_forcing,
+                loss,
+                token_accuracy=0,
+                database_username=database_username,
+                database_password=database_password,
+                database_timeout=database_timeout,
+                gold_table=original_utt.gold_sql_results,
+            )
+
+        predictions.append(int_predictions)
+        progbar.update(i)
+
+    progbar.finish()
+
+    if total_num < 0:
+        total_num = num_utterances
+    predictions_file.close()
+
+    return construct_averages(metrics_sums, total_num), predictions
diff --git a/examples/text_to_sql/IGSQL/parse_args.py b/examples/text_to_sql/IGSQL/parse_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ab7d9eb373f9f3aba4cc24cba72c2e27857029d
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/parse_args.py
@@ -0,0 +1,167 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+args = sys.argv
+
+
+def interpret_args():
+    """Interprets the command line arguments, and returns a dictionary."""
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--no_gpus", type=bool, default=1)
+
+    # Data parameters
+    parser.add_argument(
+        "--raw_train_filename", type=str, default="../atis_data/data/resplit/processed/train_with_tables.pkl"
+    )
+    parser.add_argument(
+        "--raw_dev_filename", type=str, default="../atis_data/data/resplit/processed/dev_with_tables.pkl"
+    )
+    parser.add_argument(
+        "--raw_validation_filename", type=str, default="../atis_data/data/resplit/processed/valid_with_tables.pkl"
+    )
+    parser.add_argument(
+        "--raw_test_filename", type=str, default="../atis_data/data/resplit/processed/test_with_tables.pkl"
+    )
+
+    parser.add_argument("--data_directory", type=str, default="processed_data")
+
+    parser.add_argument("--processed_train_filename", type=str, default="train.pkl")
+    parser.add_argument("--processed_dev_filename", type=str, default="dev.pkl")
+    parser.add_argument("--processed_validation_filename", type=str, default="validation.pkl")
+    parser.add_argument("--processed_test_filename", type=str, default="test.pkl")
+
+    parser.add_argument("--database_schema_filename", type=str, default=None)
+    parser.add_argument("--embedding_filename", type=str, default=None)
+
+    parser.add_argument("--input_vocabulary_filename", type=str, default="input_vocabulary.pkl")
+    parser.add_argument("--output_vocabulary_filename", type=str, default="output_vocabulary.pkl")
+
+    parser.add_argument("--input_key", type=str, default="utterance")
+
+    parser.add_argument("--anonymize", type=bool, default=False)
+    parser.add_argument("--anonymization_scoring", type=bool, default=False)
+    parser.add_argument("--use_snippets", type=bool, default=False)
+
+    parser.add_argument("--use_previous_query", type=bool, default=True)
+    parser.add_argument("--maximum_queries", type=int, default=1)
+    parser.add_argument("--use_copy_switch", type=bool, default=False)
+    parser.add_argument("--use_query_attention", type=bool, default=True)
+
+    parser.add_argument("--use_utterance_attention", type=bool, default=True)
+
+    parser.add_argument("--scheduler", type=bool, default=False)
+
+    parser.add_argument("--use_bert", type=bool, default=True)
+    parser.add_argument("--bert_input_version", type=str, default="v1")
+    parser.add_argument("--fine_tune_bert", type=bool, default=True)
+    parser.add_argument("--lr_bert", default=1e-5, type=float, help="BERT model learning rate.")
+
+    # Debugging/logging parameters
+    parser.add_argument("--reload_embedding", type=bool, default=False)
+    parser.add_argument("--logdir", type=str, default="logs")
+    parser.add_argument("--deterministic", type=bool, default=False)
+    parser.add_argument("--num_train", type=int, default=-1)
+
+    parser.add_argument("--logfile", type=str, default="log.txt")
+    parser.add_argument("--results_file", type=str, default="results.txt")
+
+    # Model architecture
+    parser.add_argument("--input_embedding_size", type=int, default=300)
+    parser.add_argument("--output_embedding_size", type=int, default=300)
+
+    parser.add_argument("--encoder_state_size", type=int, default=300)
+    parser.add_argument("--decoder_state_size", type=int, default=300)
+
+    parser.add_argument("--encoder_num_layers", type=int, default=1)
+    parser.add_argument("--decoder_num_layers", type=int, default=1)
+    parser.add_argument("--snippet_num_layers", type=int, default=1)
+
+    parser.add_argument("--maximum_utterances", type=int, default=5)
+    parser.add_argument("--state_positional_embeddings", type=bool, default=True)
+    parser.add_argument("--positional_embedding_size", type=int, default=50)
+
+    parser.add_argument("--snippet_age_embedding", type=bool, default=False)
+    parser.add_argument("--snippet_age_embedding_size", type=int, default=64)
+    parser.add_argument("--max_snippet_age_embedding", type=int, default=4)
+    parser.add_argument("--previous_decoder_snippet_encoding", type=bool, default=False)
+
+    parser.add_argument("--discourse_level_lstm", type=bool, default=True)
+
+    parser.add_argument("--use_schema_attention", type=bool, default=True)
+    parser.add_argument("--use_encoder_attention", type=bool, default=True)
+
+    parser.add_argument("--use_schema_encoder", type=bool, default=True)
+    parser.add_argument("--use_schema_self_attention", type=bool, default=False)
+    parser.add_argument("--use_schema_encoder_2", type=bool, default=False)
+
+    # Training parameters
+    parser.add_argument("--batch_size", type=int, default=16)
+    parser.add_argument("--train_maximum_sql_length", type=int, default=400)  # 200
+    parser.add_argument("--train_evaluation_size", type=int, default=100)
+
+    parser.add_argument("--dropout_amount", type=float, default=0.5)
+
+    parser.add_argument("--initial_patience", type=float, default=10.0)
+    parser.add_argument("--patience_ratio", type=float, default=1.01)
+
+    parser.add_argument("--initial_learning_rate", type=float, default=1e-3)
+    parser.add_argument("--learning_rate_ratio", type=float, default=0.9)
+
+    parser.add_argument("--interaction_level", type=bool, default=True)
+    parser.add_argument("--reweight_batch", type=bool, default=True)
+    parser.add_argument("--gnn_layer_number", type=int, default=1)
+    parser.add_argument("--clip", type=float, default=5.0)
+    parser.add_argument("--warmup_step", type=int, default=1000)
+
+    # Setting
+    parser.add_argument("--train", type=bool, default=False)
+    parser.add_argument("--debug", type=bool, default=False)
+
+    parser.add_argument("--evaluate", type=bool, default=False)
+    parser.add_argument("--attention", type=bool, default=False)
+    parser.add_argument("--save_file", type=str, default="")
+    parser.add_argument("--enable_testing", type=bool, default=False)
+    parser.add_argument("--use_predicted_queries", type=bool, default=False)
+    parser.add_argument("--evaluate_split", type=str, default="valid")
+    parser.add_argument("--evaluate_with_gold_forcing", type=bool, default=False)
+    parser.add_argument("--eval_maximum_sql_length", type=int, default=400)
+    parser.add_argument("--results_note", type=str, default="")
+    parser.add_argument("--compute_metrics", type=bool, default=False)
+
+    parser.add_argument("--reference_results", type=str, default="")
+
+    parser.add_argument("--interactive", type=bool, default=False)
+
+    parser.add_argument("--database_username", type=str, default="aviarmy")
+    parser.add_argument("--database_password", type=str, default="aviarmy")
+    parser.add_argument("--database_timeout", type=int, default=2)
+
+    parser.add_argument("--all_in_one_trainer", type=bool, default=False)
+
+    args = parser.parse_args()
+
+    if not os.path.exists(args.logdir):
+        os.makedirs(args.logdir)
+
+    if not (args.train or args.evaluate or args.interactive or args.attention):
+        raise ValueError("You need to be training or evaluating")
+    if args.enable_testing and not args.evaluate:
+        raise ValueError("You should evaluate the model if enabling testing")
+
+    return args
diff --git a/examples/text_to_sql/IGSQL/postprocess_eval.py b/examples/text_to_sql/IGSQL/postprocess_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2c41b383e12aa0e1d29012955b73f1773ca74e5
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/postprocess_eval.py
@@ -0,0 +1,512 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import subprocess
+import traceback
+from collections import defaultdict
+
+import sqlparse
+
+
+def find_shortest_path(start, end, graph):
+    stack = [[start, []]]
+    visited = set()
+    while len(stack) > 0:
+        ele, history = stack.pop()
+        if ele == end:
+            return history
+        for node in graph[ele]:
+            if node[0] not in visited:
+                stack.append((node[0], history + [(node[0], node[1])]))
+                visited.add(node[0])
+
+
+def gen_from(candidate_tables, schema):
+    if len(candidate_tables) <= 1:
+        if len(candidate_tables) == 1:
+            ret = "from {}".format(schema["table_names_original"][list(candidate_tables)[0]])
+        else:
+            ret = "from {}".format(schema["table_names_original"][0])
+        return {}, ret
+
+    table_alias_dict = {}
+    uf_dict = {}
+    for t in candidate_tables:
+        uf_dict[t] = -1
+    idx = 1
+    graph = defaultdict(list)
+    for acol, bcol in schema["foreign_keys"]:
+        t1 = schema["column_names"][acol][0]
+        t2 = schema["column_names"][bcol][0]
+        graph[t1].append((t2, (acol, bcol)))
+        graph[t2].append((t1, (bcol, acol)))
+    candidate_tables = list(candidate_tables)
+    start = candidate_tables[0]
+    table_alias_dict[start] = idx
+    idx += 1
+    ret = "from {} as T1".format(schema["table_names_original"][start])
+    try:
+        for end in candidate_tables[1:]:
+            if end in table_alias_dict:
+                continue
+            path = find_shortest_path(start, end, graph)
+            prev_table = start
+            if not path:
+                table_alias_dict[end] = idx
+                idx += 1
+                ret = "{} join {} as T{}".format(
+                    ret,
+                    schema["table_names_original"][end],
+                    table_alias_dict[end],
+                )
+                continue
+            for node, (acol, bcol) in path:
+                if node in table_alias_dict:
+                    prev_table = node
+                    continue
+                table_alias_dict[node] = idx
+                idx += 1
+                ret = "{} join {} as T{} on T{}.{} = T{}.{}".format(
+                    ret,
+                    schema["table_names_original"][node],
+                    table_alias_dict[node],
+                    table_alias_dict[prev_table],
+                    schema["column_names_original"][acol][1],
+                    table_alias_dict[node],
+                    schema["column_names_original"][bcol][1],
+                )
+                prev_table = node
+    except Exception:
+        traceback.print_exc()
+        print("db:{}".format(schema["db_id"]))
+        return table_alias_dict, ret
+    return table_alias_dict, ret
+
+
+def normalize_space(format_sql):
+    format_sql_1 = [
+        " ".join(sub_sql.strip().replace(",", " , ").replace("(", " ( ").replace(")", " ) ").split())
+        for sub_sql in format_sql.split("\n")
+    ]
+    format_sql_1 = "\n".join(format_sql_1)
+    format_sql_2 = (
+        format_sql_1.replace("\njoin", " join")
+        .replace(",\n", ", ")
+        .replace(" where", "\nwhere")
+        .replace(" intersect", "\nintersect")
+        .replace("union ", "union\n")
+        .replace("\nand", " and")
+        .replace("order by t2 .\nstart desc", "order by t2 . start desc")
+    )
+    return format_sql_2
+
+
+def get_candidate_tables(format_sql, schema):
+    candidate_tables = []
+
+    tokens = format_sql.split()
+    for ii, token in enumerate(tokens):
+        if "." in token:
+            table_name = token.split(".")[0]
+            candidate_tables.append(table_name)
+
+    candidate_tables = list(set(candidate_tables))
+
+    table_names_original = [table_name.lower() for table_name in schema["table_names_original"]]
+    candidate_tables_id = [table_names_original.index(table_name) for table_name in candidate_tables]
+
+    assert -1 not in candidate_tables_id
+    table_names_original = schema["table_names_original"]
+
+    return candidate_tables_id, table_names_original
+
+
+def get_surface_form_orig(format_sql_2, schema):
+    column_names_surface_form = []
+    column_names_surface_form_original = []
+
+    column_names_original = schema["column_names_original"]
+    table_names_original = schema["table_names_original"]
+    for i, (table_id, column_name) in enumerate(column_names_original):
+        if table_id >= 0:
+            table_name = table_names_original[table_id]
+            column_name_surface_form = "{}.{}".format(table_name, column_name)
+        else:
+            # this is just *
+            column_name_surface_form = column_name
+        column_names_surface_form.append(column_name_surface_form.lower())
+        column_names_surface_form_original.append(column_name_surface_form)
+
+    # also add table_name.*
+    for table_name in table_names_original:
+        column_names_surface_form.append("{}.*".format(table_name.lower()))
+        column_names_surface_form_original.append("{}.*".format(table_name))
+
+    assert len(column_names_surface_form) == len(column_names_surface_form_original)
+    for surface_form, surface_form_original in zip(column_names_surface_form, column_names_surface_form_original):
+        format_sql_2 = format_sql_2.replace(surface_form, surface_form_original)
+
+    return format_sql_2
+
+
+def add_from_clase(sub_sql, from_clause):
+    select_right_sub_sql = []
+    left_sub_sql = []
+    left = True
+    num_left_parathesis = 0  # in select_right_sub_sql
+    num_right_parathesis = 0  # in select_right_sub_sql
+    tokens = sub_sql.split()
+    for ii, token in enumerate(tokens):
+        if token == "select":
+            left = False
+        if left:
+            left_sub_sql.append(token)
+            continue
+        select_right_sub_sql.append(token)
+        if token == "(":
+            num_left_parathesis += 1
+        elif token == ")":
+            num_right_parathesis += 1
+
+    def remove_missing_tables_from_select(select_statement):
+        tokens = select_statement.split(",")
+
+        stop_idx = -1
+        for i in range(len(tokens)):
+            idx = len(tokens) - 1 - i
+            token = tokens[idx]
+            if ".*" in token and "count " not in token:
+                pass
+            else:
+                stop_idx = idx + 1
+                break
+
+        if stop_idx > 0:
+            new_select_statement = ",".join(tokens[:stop_idx]).strip()
+        else:
+            new_select_statement = select_statement
+
+        return new_select_statement
+
+    if num_left_parathesis == num_right_parathesis or num_left_parathesis > num_right_parathesis:
+        sub_sqls = []
+        sub_sqls.append(remove_missing_tables_from_select(sub_sql))
+        sub_sqls.append(from_clause)
+    else:
+        assert num_left_parathesis < num_right_parathesis
+        select_sub_sql = []
+        right_sub_sql = []
+        for i in range(len(select_right_sub_sql)):
+            token_idx = len(select_right_sub_sql) - 1 - i
+            token = select_right_sub_sql[token_idx]
+            if token == ")":
+                num_right_parathesis -= 1
+            if num_right_parathesis == num_left_parathesis:
+                select_sub_sql = select_right_sub_sql[:token_idx]
+                right_sub_sql = select_right_sub_sql[token_idx:]
+                break
+
+        sub_sqls = []
+
+        if len(left_sub_sql) > 0:
+            sub_sqls.append(" ".join(left_sub_sql))
+        if len(select_sub_sql) > 0:
+            new_select_statement = remove_missing_tables_from_select(" ".join(select_sub_sql))
+            sub_sqls.append(new_select_statement)
+
+        sub_sqls.append(from_clause)
+
+        if len(right_sub_sql) > 0:
+            sub_sqls.append(" ".join(right_sub_sql))
+
+    return sub_sqls
+
+
+def postprocess_single(format_sql_2, schema, start_alias_id=0):
+    candidate_tables_id, table_names_original = get_candidate_tables(format_sql_2, schema)
+    format_sql_2 = get_surface_form_orig(format_sql_2, schema)
+
+    if len(candidate_tables_id) == 0:
+        final_sql = format_sql_2.replace("\n", " ")
+    elif len(candidate_tables_id) == 1:
+        # easy case
+        table_name = table_names_original[candidate_tables_id[0]]
+        from_clause = "from {}".format(table_name)
+        format_sql_3 = []
+        for sub_sql in format_sql_2.split("\n"):
+            if "select" in sub_sql:
+                format_sql_3 += add_from_clase(sub_sql, from_clause)
+            else:
+                format_sql_3.append(sub_sql)
+        final_sql = " ".join(format_sql_3).replace("{}.".format(table_name), "")
+    else:
+        # more than 1 candidate_tables
+        table_alias_dict, ret = gen_from(candidate_tables_id, schema)
+
+        from_clause = ret
+        for i in range(len(table_alias_dict)):
+            from_clause = from_clause.replace("T{}".format(i + 1), "T{}".format(i + 1 + start_alias_id))
+
+        table_name_to_alias = {}
+        for table_id, alias_id in table_alias_dict.items():
+            table_name = table_names_original[table_id]
+            alias = "T{}".format(alias_id + start_alias_id)
+            table_name_to_alias[table_name] = alias
+        start_alias_id = start_alias_id + len(table_alias_dict)
+
+        format_sql_3 = []
+        for sub_sql in format_sql_2.split("\n"):
+            if "select" in sub_sql:
+                format_sql_3 += add_from_clase(sub_sql, from_clause)
+            else:
+                format_sql_3.append(sub_sql)
+        format_sql_3 = " ".join(format_sql_3)
+
+        for table_name, alias in table_name_to_alias.items():
+            format_sql_3 = format_sql_3.replace("{}.".format(table_name), "{}.".format(alias))
+
+        final_sql = format_sql_3
+
+    for i in range(5):
+        final_sql = final_sql.replace("select count ( T{}.* ) ".format(i), "select count ( * ) ")
+        final_sql = final_sql.replace("count ( T{}.* ) from ".format(i), "count ( * ) from ")
+        final_sql = final_sql.replace("order by count ( T{}.* ) ".format(i), "order by count ( * ) ")
+        final_sql = final_sql.replace("having count ( T{}.* ) ".format(i), "having count ( * ) ")
+
+    return final_sql, start_alias_id
+
+
+def postprocess_nested(format_sql_2, schema):
+    candidate_tables_id, table_names_original = get_candidate_tables(format_sql_2, schema)
+    if len(candidate_tables_id) == 1:
+        format_sql_2 = get_surface_form_orig(format_sql_2, schema)
+        # easy case
+        table_name = table_names_original[candidate_tables_id[0]]
+        from_clause = "from {}".format(table_name)
+        format_sql_3 = []
+        for sub_sql in format_sql_2.split("\n"):
+            if "select" in sub_sql:
+                format_sql_3 += add_from_clase(sub_sql, from_clause)
+            else:
+                format_sql_3.append(sub_sql)
+        final_sql = " ".join(format_sql_3).replace("{}.".format(table_name), "")
+    else:
+        # case 1: easy case, except / union / intersect
+        # case 2: nested queries in condition
+        final_sql = []
+
+        def postprocess_subquery(sub_query_one, schema, start_alias_id_1):
+            final_sub_sql = []
+            sub_query = []
+            for sub_sql in sub_query_one.split("\n"):
+                if "select" in sub_sql:
+                    if len(sub_query) > 0:
+                        sub_query = "\n".join(sub_query)
+                        sub_query, start_alias_id_1 = postprocess_single(sub_query, schema, start_alias_id_1)
+                        final_sub_sql.append(sub_query)
+                        sub_query = []
+                    sub_query.append(sub_sql)
+                else:
+                    sub_query.append(sub_sql)
+            if len(sub_query) > 0:
+                sub_query = "\n".join(sub_query)
+                sub_query, start_alias_id_1 = postprocess_single(sub_query, schema, start_alias_id_1)
+                final_sub_sql.append(sub_query)
+
+            final_sub_sql = " ".join(final_sub_sql)
+            return final_sub_sql, False, start_alias_id_1
+
+        start_alias_id = 0
+        sub_query = []
+        for sub_sql in format_sql_2.split("\n"):
+            if "except" in sub_sql or "union" in sub_sql or "intersect" in sub_sql:
+                sub_query = "\n".join(sub_query)
+                sub_query, _, start_alias_id = postprocess_subquery(sub_query, schema, start_alias_id)
+                final_sql.append(sub_query)
+                final_sql.append(sub_sql)
+                sub_query = []
+            else:
+                sub_query.append(sub_sql)
+        if len(sub_query) > 0:
+            sub_query = "\n".join(sub_query)
+            sub_query, _, start_alias_id = postprocess_subquery(sub_query, schema, start_alias_id)
+            final_sql.append(sub_query)
+
+        final_sql = " ".join(final_sql)
+
+    # special case of from a subquery
+    final_sql = final_sql.replace("select count ( * ) (", "select count ( * ) from (")
+
+    return final_sql
+
+
+def postprocess_one(pred_sql, schema):
+    pred_sql = (
+        pred_sql.replace("group_by", "group by")
+        .replace("order_by", "order by")
+        .replace("limit_value", "limit 1")
+        .replace("_EOS", "")
+        .replace(" value ", " 1 ")
+        .replace("distinct", "")
+        .strip(",")
+        .strip()
+    )
+    if pred_sql.endswith("value"):
+        pred_sql = pred_sql[: -len("value")] + "1"
+
+    try:
+        format_sql = sqlparse.format(pred_sql, reindent=True)
+    except Exception:
+        return pred_sql
+    format_sql_2 = normalize_space(format_sql)
+
+    num_select = format_sql_2.count("select")
+
+    if num_select > 1:
+        final_sql = postprocess_nested(format_sql_2, schema)
+    else:
+        final_sql, _ = postprocess_single(format_sql_2, schema)
+
+    return final_sql
+
+
+def postprocess(predictions, database_schema, remove_from=False):
+    correct = 0
+    total = 0
+    postprocess_sqls = {}
+
+    for pred in predictions:
+        db_id = pred["database_id"]
+        schema = database_schema[db_id]
+        if db_id not in postprocess_sqls:
+            postprocess_sqls[db_id] = []
+
+        interaction_id = pred["interaction_id"]
+        turn_id = pred["index_in_interaction"]
+        total += 1
+
+        pred_sql_str = " ".join(pred["flat_prediction"])
+
+        gold_sql_str = " ".join(pred["flat_gold_queries"][0])
+        if pred_sql_str == gold_sql_str:
+            correct += 1
+
+        postprocess_sql = pred_sql_str
+        if remove_from:
+            postprocess_sql = postprocess_one(pred_sql_str, schema)
+
+        postprocess_sqls[db_id].append((postprocess_sql, interaction_id, turn_id))
+
+    # print (correct, total, float(correct)/total)
+    return postprocess_sqls
+
+
+def read_prediction(pred_file):
+    print("Read prediction from", pred_file)
+    predictions = []
+    with open(pred_file) as f:
+        for line in f:
+            pred = json.loads(line)
+            predictions.append(pred)
+    print("Number of predictions", len(predictions))
+    return predictions
+
+
+def read_schema(table_schema_path):
+    with open(table_schema_path) as f:
+        database_schema = json.load(f)
+
+    database_schema_dict = {}
+    for table_schema in database_schema:
+        db_id = table_schema["db_id"]
+        database_schema_dict[db_id] = table_schema
+
+    return database_schema_dict
+
+
+def write_and_evaluate(postprocess_sqls, db_path, table_schema_path, gold_path, dataset):
+    db_list = []
+    with open(gold_path) as f:
+        for line in f:
+            line_split = line.strip().split("\t")
+            if len(line_split) != 2:
+                continue
+            db = line.strip().split("\t")[1]
+            if db not in db_list:
+                db_list.append(db)
+
+    output_file = "output_temp.txt"
+    if dataset == "spider":
+        with open(output_file, "w") as f:
+            for db in db_list:
+                for postprocess_sql, interaction_id, turn_id in postprocess_sqls[db]:
+                    f.write(postprocess_sql + "\n")
+
+        command = "python3 eval_scripts/evaluation.py --db {} --table {} --etype match --gold {} --pred {}".format(
+            db_path, table_schema_path, gold_path, os.path.abspath(output_file)
+        )
+    elif dataset in ["sparc", "cosql"]:
+        cnt = 0
+        with open(output_file, "w") as f:
+            for db in db_list:
+                for postprocess_sql, interaction_id, turn_id in postprocess_sqls[db]:
+                    if turn_id == 0 and cnt > 0:
+                        f.write("\n")
+                    f.write("{}\n".format(postprocess_sql))
+                    cnt += 1
+
+        command = "python eval_scripts/evaluation_sqa.py --db {} --table {} --etype match --gold {} --pred {}".format(
+            db_path, table_schema_path, gold_path, os.path.abspath(output_file)
+        )
+    command += "; rm output_temp.txt"
+    return command
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", choices=("spider", "sparc", "cosql"), default="sparc")
+    parser.add_argument("--split", type=str, default="dev")
+    parser.add_argument("--pred_file", type=str, default="")
+    parser.add_argument("--remove_from", action="store_true", default=False)
+    args = parser.parse_args()
+
+    db_path = "data/database/"
+    if args.dataset == "spider":
+        table_schema_path = "data/spider/tables.json"
+        if args.split == "dev":
+            gold_path = "data/spider/dev_gold.sql"
+    elif args.dataset == "sparc":
+        table_schema_path = "data/sparc/tables.json"
+        if args.split == "dev":
+            gold_path = "data/sparc/dev_gold.txt"
+    elif args.dataset == "cosql":
+        table_schema_path = "data/cosql/tables.json"
+        if args.split == "dev":
+            gold_path = "data/cosql/dev_gold.txt"
+
+    pred_file = args.pred_file
+
+    database_schema = read_schema(table_schema_path)
+    predictions = read_prediction(pred_file)
+    postprocess_sqls = postprocess(predictions, database_schema, args.remove_from)
+
+    command = write_and_evaluate(postprocess_sqls, db_path, table_schema_path, gold_path, args.dataset)
+
+    eval_output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
+    with open(pred_file + ".eval", "w") as f:
+        f.write(eval_output.decode("utf-8"))
+    print("Eval result in", pred_file + ".eval")
diff --git a/examples/text_to_sql/IGSQL/preprocess.py b/examples/text_to_sql/IGSQL/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fceaf9e9c3fcc401b1acda6643e215943d8696e
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/preprocess.py
@@ -0,0 +1,759 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import pickle
+import shutil
+
+import sqlparse
+from postprocess_eval import get_candidate_tables
+
+
+def write_interaction(interaction_list, split, output_dir):
+    json_split = os.path.join(output_dir, split + ".json")
+    pkl_split = os.path.join(output_dir, split + ".pkl")
+
+    with open(json_split, "w") as outfile:
+        for interaction in interaction_list:
+            json.dump(interaction, outfile, indent=4)
+            outfile.write("\n")
+
+    new_objs = []
+    for i, obj in enumerate(interaction_list):
+        new_interaction = []
+        for ut in obj["interaction"]:
+            sql = ut["sql"]
+            sqls = [sql]
+            tok_sql_list = []
+            for sql in sqls:
+                results = []
+                tokenized_sql = sql.split()
+                tok_sql_list.append((tokenized_sql, results))
+            ut["sql"] = tok_sql_list
+            new_interaction.append(ut)
+        obj["interaction"] = new_interaction
+        new_objs.append(obj)
+
+    with open(pkl_split, "wb") as outfile:
+        pickle.dump(new_objs, outfile)
+
+    return
+
+
+def read_database_schema(database_schema_filename, schema_tokens, column_names, database_schemas_dict):
+    with open(database_schema_filename) as f:
+        database_schemas = json.load(f)
+
+    def get_schema_tokens(table_schema):
+        column_names_surface_form = []
+        column_names = []
+        column_names_original = table_schema["column_names_original"]
+        table_names_original = table_schema["table_names_original"]
+        for i, (table_id, column_name) in enumerate(column_names_original):
+            if table_id >= 0:
+                table_name = table_names_original[table_id]
+                column_name_surface_form = "{}.{}".format(table_name, column_name)
+            else:
+                # this is just *
+                column_name_surface_form = column_name
+            column_names_surface_form.append(column_name_surface_form.lower())
+            column_names.append(column_name.lower())
+
+        # also add table_name.*
+        for table_name in table_names_original:
+            column_names_surface_form.append("{}.*".format(table_name.lower()))
+
+        return column_names_surface_form, column_names
+
+    for table_schema in database_schemas:
+        database_id = table_schema["db_id"]
+        database_schemas_dict[database_id] = table_schema
+        schema_tokens[database_id], column_names[database_id] = get_schema_tokens(table_schema)
+
+    return schema_tokens, column_names, database_schemas_dict
+
+
+def remove_from_with_join(format_sql_2):
+    used_tables_list = []
+    format_sql_3 = []
+    table_to_name = {}
+    table_list = []
+    old_table_to_name = {}
+    old_table_list = []
+    for sub_sql in format_sql_2.split("\n"):
+        if "select " in sub_sql:
+            # only replace alias: t1 -> table_name, t2 -> table_name, etc...
+            if len(table_list) > 0:
+                for i in range(len(format_sql_3)):
+                    for table, name in table_to_name.items():
+                        format_sql_3[i] = format_sql_3[i].replace(table, name)
+
+            old_table_list = table_list
+            old_table_to_name = table_to_name
+            table_to_name = {}
+            table_list = []
+            format_sql_3.append(sub_sql)
+        elif sub_sql.startswith("from"):
+            new_sub_sql = None
+            sub_sql_tokens = sub_sql.split()
+            for t_i, t in enumerate(sub_sql_tokens):
+                if t == "as":
+                    table_to_name[sub_sql_tokens[t_i + 1]] = sub_sql_tokens[t_i - 1]
+                    table_list.append(sub_sql_tokens[t_i - 1])
+                elif t == ")" and new_sub_sql is None:
+                    # new_sub_sql keeps some trailing parts after ')'
+                    new_sub_sql = " ".join(sub_sql_tokens[t_i:])
+            if len(table_list) > 0:
+                # if it's a from clause with join
+                if new_sub_sql is not None:
+                    format_sql_3.append(new_sub_sql)
+
+                used_tables_list.append(table_list)
+            else:
+                # if it's a from clause without join
+                table_list = old_table_list
+                table_to_name = old_table_to_name
+                assert "join" not in sub_sql
+                if new_sub_sql is not None:
+                    sub_sub_sql = sub_sql[: -len(new_sub_sql)].strip()
+                    assert len(sub_sub_sql.split()) == 2
+                    used_tables_list.append([sub_sub_sql.split()[1]])
+                    format_sql_3.append(sub_sub_sql)
+                    format_sql_3.append(new_sub_sql)
+                elif "join" not in sub_sql:
+                    assert len(sub_sql.split()) == 2 or len(sub_sql.split()) == 1
+                    if len(sub_sql.split()) == 2:
+                        used_tables_list.append([sub_sql.split()[1]])
+
+                    format_sql_3.append(sub_sql)
+                else:
+                    print("bad from clause in remove_from_with_join")
+                    exit()
+        else:
+            format_sql_3.append(sub_sql)
+
+    if len(table_list) > 0:
+        for i in range(len(format_sql_3)):
+            for table, name in table_to_name.items():
+                format_sql_3[i] = format_sql_3[i].replace(table, name)
+
+    used_tables = []
+    for t in used_tables_list:
+        for tt in t:
+            used_tables.append(tt)
+    used_tables = list(set(used_tables))
+
+    return format_sql_3, used_tables, used_tables_list
+
+
+def remove_from_without_join(format_sql_3, column_names, schema_tokens):
+    format_sql_4 = []
+    table_name = None
+    for sub_sql in format_sql_3.split("\n"):
+        if "select " in sub_sql:
+            if table_name:
+                for i in range(len(format_sql_4)):
+                    tokens = format_sql_4[i].split()
+                    for ii, token in enumerate(tokens):
+                        if token in column_names and tokens[ii - 1] != ".":
+                            if (
+                                ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "("
+                            ) or ii + 1 == len(tokens):
+                                if "{}.{}".format(table_name, token) in schema_tokens:
+                                    tokens[ii] = "{} . {}".format(table_name, token)
+                    format_sql_4[i] = " ".join(tokens)
+
+            format_sql_4.append(sub_sql)
+        elif sub_sql.startswith("from"):
+            sub_sql_tokens = sub_sql.split()
+            if len(sub_sql_tokens) == 1:
+                table_name = None
+            elif len(sub_sql_tokens) == 2:
+                table_name = sub_sql_tokens[1]
+            else:
+                print("bad from clause in remove_from_without_join")
+                print(format_sql_3)
+                exit()
+        else:
+            format_sql_4.append(sub_sql)
+
+    if table_name:
+        for i in range(len(format_sql_4)):
+            tokens = format_sql_4[i].split()
+            for ii, token in enumerate(tokens):
+                if token in column_names and tokens[ii - 1] != ".":
+                    if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len(
+                        tokens
+                    ):
+                        if "{}.{}".format(table_name, token) in schema_tokens:
+                            tokens[ii] = "{} . {}".format(table_name, token)
+            format_sql_4[i] = " ".join(tokens)
+
+    return format_sql_4
+
+
+def add_table_name(format_sql_3, used_tables, column_names, schema_tokens):
+    # If just one table used, easy case, replace all column_name -> table_name.column_name
+    if len(used_tables) == 1:
+        table_name = used_tables[0]
+        format_sql_4 = []
+        for sub_sql in format_sql_3.split("\n"):
+            if sub_sql.startswith("from"):
+                format_sql_4.append(sub_sql)
+                continue
+
+            tokens = sub_sql.split()
+            for ii, token in enumerate(tokens):
+                if token in column_names and tokens[ii - 1] != ".":
+                    if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len(
+                        tokens
+                    ):
+                        if "{}.{}".format(table_name, token) in schema_tokens:
+                            tokens[ii] = "{} . {}".format(table_name, token)
+            format_sql_4.append(" ".join(tokens))
+        return format_sql_4
+
+    def get_table_name_for(token):
+        table_names = []
+        for table_name in used_tables:
+            if "{}.{}".format(table_name, token) in schema_tokens:
+                table_names.append(table_name)
+        if len(table_names) == 0:
+            return "table"
+        if len(table_names) > 1:
+            return None
+        else:
+            return table_names[0]
+
+    format_sql_4 = []
+    for sub_sql in format_sql_3.split("\n"):
+        if sub_sql.startswith("from"):
+            format_sql_4.append(sub_sql)
+            continue
+
+        tokens = sub_sql.split()
+        for ii, token in enumerate(tokens):
+            # skip *
+            if token == "*":
+                continue
+            if token in column_names and tokens[ii - 1] != ".":
+                if (ii + 1 < len(tokens) and tokens[ii + 1] != "." and tokens[ii + 1] != "(") or ii + 1 == len(tokens):
+                    table_name = get_table_name_for(token)
+                    if table_name:
+                        tokens[ii] = "{} . {}".format(table_name, token)
+        format_sql_4.append(" ".join(tokens))
+
+    return format_sql_4
+
+
+def check_oov(format_sql_final, output_vocab, schema_tokens):
+    for sql_tok in format_sql_final.split():
+        if not (sql_tok in schema_tokens or sql_tok in output_vocab):
+            print("OOV!", sql_tok)
+            raise Exception("OOV")
+
+
+def normalize_space(format_sql):
+    format_sql_1 = [
+        " ".join(
+            sub_sql.strip().replace(",", " , ").replace(".", " . ").replace("(", " ( ").replace(")", " ) ").split()
+        )
+        for sub_sql in format_sql.split("\n")
+    ]
+    format_sql_1 = "\n".join(format_sql_1)
+
+    format_sql_2 = (
+        format_sql_1.replace("\njoin", " join")
+        .replace(",\n", ", ")
+        .replace(" where", "\nwhere")
+        .replace(" intersect", "\nintersect")
+        .replace("\nand", " and")
+        .replace("order by t2 .\nstart desc", "order by t2 . start desc")
+    )
+
+    format_sql_2 = (
+        format_sql_2.replace("select\noperator", "select operator")
+        .replace("select\nconstructor", "select constructor")
+        .replace("select\nstart", "select start")
+        .replace("select\ndrop", "select drop")
+        .replace("select\nwork", "select work")
+        .replace("select\ngroup", "select group")
+        .replace("select\nwhere_built", "select where_built")
+        .replace("select\norder", "select order")
+        .replace("from\noperator", "from operator")
+        .replace("from\nforward", "from forward")
+        .replace("from\nfor", "from for")
+        .replace("from\ndrop", "from drop")
+        .replace("from\norder", "from order")
+        .replace(".\nstart", ". start")
+        .replace(".\norder", ". order")
+        .replace(".\noperator", ". operator")
+        .replace(".\nsets", ". sets")
+        .replace(".\nwhere_built", ". where_built")
+        .replace(".\nwork", ". work")
+        .replace(".\nconstructor", ". constructor")
+        .replace(".\ngroup", ". group")
+        .replace(".\nfor", ". for")
+        .replace(".\ndrop", ". drop")
+        .replace(".\nwhere", ". where")
+    )
+
+    format_sql_2 = (
+        format_sql_2.replace("group by", "group_by")
+        .replace("order by", "order_by")
+        .replace("! =", "!=")
+        .replace("limit value", "limit_value")
+    )
+    return format_sql_2
+
+
+def normalize_final_sql(format_sql_5):
+    format_sql_final = (
+        format_sql_5.replace("\n", " ")
+        .replace(" . ", ".")
+        .replace("group by", "group_by")
+        .replace("order by", "order_by")
+        .replace("! =", "!=")
+        .replace("limit value", "limit_value")
+    )
+
+    # normalize two bad sqls
+    if "t1" in format_sql_final or "t2" in format_sql_final or "t3" in format_sql_final or "t4" in format_sql_final:
+        format_sql_final = format_sql_final.replace("t2.dormid", "dorm.dormid")
+
+    # This is the failure case of remove_from_without_join()
+    format_sql_final = format_sql_final.replace(
+        "select city.city_name where city.state_name in ( select state.state_name where state.state_name in ( select river.traverse where river.river_name = value ) and state.area = ( select min ( state.area ) where state.state_name in ( select river.traverse where river.river_name = value ) ) ) order_by population desc limit_value",
+        "select city.city_name where city.state_name in ( select state.state_name where state.state_name in ( select river.traverse where river.river_name = value ) and state.area = ( select min ( state.area ) where state.state_name in ( select river.traverse where river.river_name = value ) ) ) order_by city.population desc limit_value",
+    )
+
+    return format_sql_final
+
+
+def parse_sql(sql_string, db_id, column_names, output_vocab, schema_tokens, schema):
+    format_sql = sqlparse.format(sql_string, reindent=True)
+    format_sql_2 = normalize_space(format_sql)
+
+    format_sql_3, used_tables, used_tables_list = remove_from_with_join(format_sql_2)
+
+    format_sql_3 = "\n".join(format_sql_3)
+    format_sql_4 = add_table_name(format_sql_3, used_tables, column_names, schema_tokens)
+
+    format_sql_4 = "\n".join(format_sql_4)
+    format_sql_5 = remove_from_without_join(format_sql_4, column_names, schema_tokens)
+
+    format_sql_5 = "\n".join(format_sql_5)
+    format_sql_final = normalize_final_sql(format_sql_5)
+
+    candidate_tables_id, table_names_original = get_candidate_tables(format_sql_final, schema)
+
+    check_oov(format_sql_final, output_vocab, schema_tokens)
+
+    return format_sql_final
+
+
+def read_spider_split(
+    split_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+):
+    with open(split_json) as f:
+        split_data = json.load(f)
+    print("read_spider_split", split_json, len(split_data))
+
+    for i, ex in enumerate(split_data):
+        db_id = ex["db_id"]
+
+        final_sql = []
+        skip = False
+        for query_tok in ex["query_toks_no_value"]:
+            if query_tok != "." and "." in query_tok:
+                # invalid sql; didn't use table alias in join
+                final_sql += query_tok.replace(".", " . ").split()
+                skip = True
+            else:
+                final_sql.append(query_tok)
+        final_sql = " ".join(final_sql)
+
+        if skip and "train" in split_json:
+            continue
+
+        if remove_from:
+            final_sql_parse = parse_sql(
+                final_sql, db_id, column_names[db_id], output_vocab, schema_tokens[db_id], database_schemas[db_id]
+            )
+        else:
+            final_sql_parse = final_sql
+
+        final_utterance = " ".join(ex["question_toks"])
+
+        if db_id not in interaction_list:
+            interaction_list[db_id] = []
+
+        interaction = {}
+        interaction["id"] = ""
+        interaction["scenario"] = ""
+        interaction["database_id"] = db_id
+        interaction["interaction_id"] = len(interaction_list[db_id])
+        interaction["final"] = {}
+        interaction["final"]["utterance"] = final_utterance
+        interaction["final"]["sql"] = final_sql_parse
+        interaction["interaction"] = [{"utterance": final_utterance, "sql": final_sql_parse}]
+        interaction_list[db_id].append(interaction)
+
+    return interaction_list
+
+
+def read_data_json(
+    split_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+):
+    with open(split_json) as f:
+        split_data = json.load(f)
+    print("read_data_json", split_json, len(split_data))
+
+    for interaction_data in split_data:
+        db_id = interaction_data["database_id"]
+        final_sql = interaction_data["final"]["query"]
+        final_utterance = interaction_data["final"]["utterance"]
+
+        if db_id not in interaction_list:
+            interaction_list[db_id] = []
+
+        # no interaction_id in train
+        if "interaction_id" in interaction_data["interaction"]:
+            interaction_id = interaction_data["interaction"]["interaction_id"]
+        else:
+            interaction_id = len(interaction_list[db_id])
+
+        interaction = {}
+        interaction["id"] = ""
+        interaction["scenario"] = ""
+        interaction["database_id"] = db_id
+        interaction["interaction_id"] = interaction_id
+        interaction["final"] = {}
+        interaction["final"]["utterance"] = final_utterance
+        interaction["final"]["sql"] = final_sql
+        interaction["interaction"] = []
+
+        for turn in interaction_data["interaction"]:
+            turn_sql = []
+            skip = False
+            for query_tok in turn["query_toks_no_value"]:
+                if query_tok != "." and "." in query_tok:
+                    # invalid sql; didn't use table alias in join
+                    turn_sql += query_tok.replace(".", " . ").split()
+                    skip = True
+                else:
+                    turn_sql.append(query_tok)
+            turn_sql = " ".join(turn_sql)
+
+            # Correct some human sql annotation error
+            turn_sql = turn_sql.replace(
+                "select f_id from files as t1 join song as t2 on t1 . f_id = t2 . f_id",
+                "select t1 . f_id from files as t1 join song as t2 on t1 . f_id = t2 . f_id",
+            )
+            turn_sql = turn_sql.replace("select name from climber mountain", "select name from climber")
+            turn_sql = turn_sql.replace(
+                "select sid from sailors as t1 join reserves as t2 on t1 . sid = t2 . sid join boats as t3 on t3 . bid = t2 . bid",
+                "select t1 . sid from sailors as t1 join reserves as t2 on t1 . sid = t2 . sid join boats as t3 on t3 . bid = t2 . bid",
+            )
+            turn_sql = turn_sql.replace("select avg ( price ) from goods )", "select avg ( price ) from goods")
+            turn_sql = turn_sql.replace(
+                "select min ( annual_fuel_cost ) , from vehicles", "select min ( annual_fuel_cost ) from vehicles"
+            )
+            turn_sql = turn_sql.replace(
+                "select * from goods where price < ( select avg ( price ) from goods",
+                "select * from goods where price < ( select avg ( price ) from goods )",
+            )
+            turn_sql = turn_sql.replace(
+                "select distinct id , price from goods where price < ( select avg ( price ) from goods",
+                "select distinct id , price from goods where price < ( select avg ( price ) from goods )",
+            )
+            turn_sql = turn_sql.replace(
+                "select id from goods where price > ( select avg ( price ) from goods",
+                "select id from goods where price > ( select avg ( price ) from goods )",
+            )
+
+            if skip and "train" in split_json:
+                continue
+
+            if remove_from:
+                try:
+                    turn_sql_parse = parse_sql(
+                        turn_sql,
+                        db_id,
+                        column_names[db_id],
+                        output_vocab,
+                        schema_tokens[db_id],
+                        database_schemas[db_id],
+                    )
+                except Exception:
+                    print("continue")
+                    continue
+            else:
+                turn_sql_parse = turn_sql
+
+            if "utterance_toks" in turn:
+                turn_utterance = " ".join(turn["utterance_toks"])  # not lower()
+            else:
+                turn_utterance = turn["utterance"]
+
+            interaction["interaction"].append({"utterance": turn_utterance, "sql": turn_sql_parse})
+        if len(interaction["interaction"]) > 0:
+            interaction_list[db_id].append(interaction)
+
+    return interaction_list
+
+
+def read_spider(spider_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from):
+    interaction_list = {}
+
+    train_json = os.path.join(spider_dir, "train.json")
+    interaction_list = read_spider_split(
+        train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    dev_json = os.path.join(spider_dir, "dev.json")
+    interaction_list = read_spider_split(
+        dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    return interaction_list
+
+
+def read_sparc(sparc_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from):
+    interaction_list = {}
+
+    train_json = os.path.join(sparc_dir, "train_no_value.json")
+    interaction_list = read_data_json(
+        train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    dev_json = os.path.join(sparc_dir, "dev_no_value.json")
+    interaction_list = read_data_json(
+        dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    return interaction_list
+
+
+def read_cosql(cosql_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from):
+    interaction_list = {}
+
+    train_json = os.path.join(cosql_dir, "train.json")
+    interaction_list = read_data_json(
+        train_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    dev_json = os.path.join(cosql_dir, "dev.json")
+    interaction_list = read_data_json(
+        dev_json, interaction_list, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+    )
+
+    return interaction_list
+
+
+def read_db_split(data_dir):
+    train_database = []
+    with open(os.path.join(data_dir, "train_db_ids.txt")) as f:
+        for line in f:
+            train_database.append(line.strip())
+
+    dev_database = []
+    with open(os.path.join(data_dir, "dev_db_ids.txt")) as f:
+        for line in f:
+            dev_database.append(line.strip())
+
+    return train_database, dev_database
+
+
+def preprocess(dataset, remove_from=False):
+    # Validate output_vocab
+    output_vocab = [
+        "_UNK",
+        "_EOS",
+        ".",
+        "t1",
+        "t2",
+        "=",
+        "select",
+        "from",
+        "as",
+        "value",
+        "join",
+        "on",
+        ")",
+        "(",
+        "where",
+        "t3",
+        "by",
+        ",",
+        "count",
+        "group",
+        "order",
+        "distinct",
+        "t4",
+        "and",
+        "limit",
+        "desc",
+        ">",
+        "avg",
+        "having",
+        "max",
+        "in",
+        "<",
+        "sum",
+        "t5",
+        "intersect",
+        "not",
+        "min",
+        "except",
+        "or",
+        "asc",
+        "like",
+        "!",
+        "union",
+        "between",
+        "t6",
+        "-",
+        "t7",
+        "+",
+        "/",
+    ]
+    if remove_from:
+        output_vocab = [
+            "_UNK",
+            "_EOS",
+            "=",
+            "select",
+            "value",
+            ")",
+            "(",
+            "where",
+            ",",
+            "count",
+            "group_by",
+            "order_by",
+            "distinct",
+            "and",
+            "limit_value",
+            "limit",
+            "desc",
+            ">",
+            "avg",
+            "having",
+            "max",
+            "in",
+            "<",
+            "sum",
+            "intersect",
+            "not",
+            "min",
+            "except",
+            "or",
+            "asc",
+            "like",
+            "!=",
+            "union",
+            "between",
+            "-",
+            "+",
+            "/",
+        ]
+    print("size of output_vocab", len(output_vocab))
+    print("output_vocab", output_vocab)
+    print()
+
+    if dataset == "spider":
+        spider_dir = "data/spider/"
+        database_schema_filename = "data/spider/tables.json"
+        output_dir = "data/spider_data"
+        if remove_from:
+            output_dir = "data/spider_data_removefrom"
+        train_database, dev_database = read_db_split(spider_dir)
+    elif dataset == "sparc":
+        sparc_dir = "data/sparc/"
+        database_schema_filename = "data/sparc/tables.json"
+        output_dir = "data/sparc_data"
+        if remove_from:
+            output_dir = "data/sparc_data_removefrom"
+        train_database, dev_database = read_db_split(sparc_dir)
+    elif dataset == "cosql":
+        cosql_dir = "data/cosql/"
+        database_schema_filename = "data/cosql/tables.json"
+        output_dir = "data/cosql_data"
+        if remove_from:
+            output_dir = "data/cosql_data_removefrom"
+        train_database, dev_database = read_db_split(cosql_dir)
+
+    if os.path.isdir(output_dir):
+        shutil.rmtree(output_dir)
+    os.mkdir(output_dir)
+
+    schema_tokens = {}
+    column_names = {}
+    database_schemas = {}
+
+    print("Reading spider database schema file")
+    schema_tokens, column_names, database_schemas = read_database_schema(
+        database_schema_filename, schema_tokens, column_names, database_schemas
+    )
+    num_database = len(schema_tokens)
+    print("num_database", num_database, len(train_database), len(dev_database))
+    print("total number of schema_tokens / databases:", len(schema_tokens))
+
+    output_database_schema_filename = os.path.join(output_dir, "tables.json")
+    with open(output_database_schema_filename, "w") as outfile:
+        json.dump([v for k, v in database_schemas.items()], outfile, indent=4)
+
+    if dataset == "spider":
+        interaction_list = read_spider(
+            spider_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+        )
+    elif dataset == "sparc":
+        interaction_list = read_sparc(
+            sparc_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+        )
+    elif dataset == "cosql":
+        interaction_list = read_cosql(
+            cosql_dir, database_schemas, column_names, output_vocab, schema_tokens, remove_from
+        )
+
+    print("interaction_list length", len(interaction_list))
+
+    train_interaction = []
+    for database_id in interaction_list:
+        if database_id not in dev_database:
+            train_interaction += interaction_list[database_id]
+
+    dev_interaction = []
+    for database_id in dev_database:
+        if database_id in interaction_list.keys():
+            dev_interaction += interaction_list[database_id]
+
+    print("train interaction: ", len(train_interaction))
+    print("dev interaction: ", len(dev_interaction))
+
+    write_interaction(train_interaction, "train", output_dir)
+    write_interaction(dev_interaction, "dev", output_dir)
+
+    return
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", choices=("spider", "sparc", "cosql"), default="sparc")
+    parser.add_argument("--remove_from", action="store_true", default=False)
+    args = parser.parse_args()
+    preprocess(args.dataset, args.remove_from)
diff --git a/examples/text_to_sql/IGSQL/requirements.txt b/examples/text_to_sql/IGSQL/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..07db382b82e06152829025cf2d1008adb660e19f
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/requirements.txt
@@ -0,0 +1,4 @@
+sqlparse
+pymysql
+progressbar
+nltk
diff --git a/examples/text_to_sql/IGSQL/run.py b/examples/text_to_sql/IGSQL/run.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf9923c5dcd25f1dec039c377b9796890a2d2192
--- /dev/null
+++ b/examples/text_to_sql/IGSQL/run.py
@@ -0,0 +1,335 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Contains a main function for training and/or evaluating a model."""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import random  # noqa: E402
+
+import numpy as np  # noqa: E402
+import paddle  # noqa: E402
+from data_util import atis_data  # noqa: E402
+from logger import Logger  # noqa: E402
+from model.schema_interaction_model import SchemaInteractionATISModel  # noqa: E402
+from model_util import (  # noqa: E402
+    Metrics,
+    evaluate_interaction_sample,
+    evaluate_using_predicted_queries,
+    evaluate_utterance_sample,
+    train_epoch_with_interactions,
+    train_epoch_with_utterances,
+)
+from parse_args import interpret_args  # noqa: E402
+
+np.random.seed(0)
+np.set_printoptions(16)
+random.seed(0)
+
+VALID_EVAL_METRICS = [Metrics.LOSS, Metrics.TOKEN_ACCURACY, Metrics.STRING_ACCURACY]
+TRAIN_EVAL_METRICS = [Metrics.LOSS, Metrics.TOKEN_ACCURACY, Metrics.STRING_ACCURACY]
+FINAL_EVAL_METRICS = [Metrics.STRING_ACCURACY, Metrics.TOKEN_ACCURACY]
+
+
+def train(model, data, params):
+    """Trains a model.
+
+    Args:
+        model (ATISModel): The model to train.
+        data (ATISData): The data that is used to train.
+        params (namespace): Training parameters.
+    """
+    # Get the training batches.
+    log = Logger(os.path.join(params.logdir, params.logfile), "w")
+    num_train_original = atis_data.num_utterances(data.train_data)
+    log.put("Original number of training utterances:\t" + str(num_train_original))
+
+    eval_fn = evaluate_utterance_sample
+    trainbatch_fn = data.get_utterance_batches
+    trainsample_fn = data.get_random_utterances
+    validsample_fn = data.get_all_utterances
+    batch_size = params.batch_size
+    if params.interaction_level:
+        batch_size = 1
+        eval_fn = evaluate_interaction_sample
+        trainbatch_fn = data.get_interaction_batches
+        trainsample_fn = data.get_random_interactions
+        validsample_fn = data.get_all_interactions
+
+    maximum_output_length = params.train_maximum_sql_length
+    train_batches = trainbatch_fn(
+        batch_size, max_output_length=maximum_output_length, randomize=not params.deterministic
+    )
+
+    if params.num_train >= 0:
+        train_batches = train_batches[: params.num_train]
+
+    training_sample = trainsample_fn(params.train_evaluation_size, max_output_length=maximum_output_length)
+    valid_examples = validsample_fn(data.valid_data, max_output_length=maximum_output_length)
+
+    num_train_examples = sum([len(batch) for batch in train_batches])
+    num_steps_per_epoch = len(train_batches)
+
+    log.put("Actual number of used training examples:\t" + str(num_train_examples))
+    log.put("(Shortened by output limit of " + str(maximum_output_length) + ")")
+    log.put("Number of steps per epoch:\t" + str(num_steps_per_epoch))
+    log.put("Batch size:\t" + str(batch_size))
+
+    print("Kept " + str(num_train_examples) + "/" + str(num_train_original) + " examples")
+    print("Batch size of " + str(batch_size) + " gives " + str(num_steps_per_epoch) + " steps per epoch")
+
+    # Keeping track of things during training.
+    epochs = 0
+    patience = params.initial_patience
+    learning_rate_coefficient = 1.0
+    previous_epoch_loss = float("inf")
+    previous_valid_acc = 0.0
+    maximum_string_accuracy = 0.0
+
+    countdown = int(patience)
+
+    keep_training = True
+    step = 0
+
+    # init learning_rate
+    model.set_learning_rate(params.initial_learning_rate)
+
+    while keep_training:
+        log.put("Epoch:\t" + str(epochs))
+        model.set_dropout(params.dropout_amount)
+        model.train()
+
+        if not params.scheduler:
+            model.set_learning_rate(learning_rate_coefficient * params.initial_learning_rate)
+
+        # Run a training step.
+        if params.interaction_level:
+            epoch_loss, step = train_epoch_with_interactions(
+                train_batches,
+                params,
+                model,
+                randomize=not params.deterministic,
+                db2id=data.db2id,
+                id2db=data.id2db,
+                step=step,
+            )
+        else:
+            epoch_loss = train_epoch_with_utterances(train_batches, model, randomize=not params.deterministic)
+
+        log.put("train epoch loss:\t" + str(epoch_loss))
+
+        model.set_dropout(0.0)
+
+        model.eval()
+
+        with paddle.no_grad():
+
+            # Run an evaluation step on a sample of the training data.
+            train_eval_results = eval_fn(
+                training_sample,
+                model,
+                params.train_maximum_sql_length,
+                name=os.path.join(params.logdir, "train-eval"),
+                write_results=True,
+                gold_forcing=True,
+                metrics=TRAIN_EVAL_METRICS,
+            )[0]
+
+            for name, value in train_eval_results.items():
+                log.put("train final gold-passing " + name.name + ":\t" + "%.2f" % value)
+
+            # Run an evaluation step on the validation set.
+            valid_eval_results = eval_fn(
+                valid_examples,
+                model,
+                params.eval_maximum_sql_length,
+                name=os.path.join(params.logdir, "valid-eval"),
+                write_results=True,
+                gold_forcing=True,
+                metrics=VALID_EVAL_METRICS,
+            )[0]
+            for name, value in valid_eval_results.items():
+                log.put("valid gold-passing " + name.name + ":\t" + "%.2f" % value)
+
+            valid_loss = valid_eval_results[Metrics.LOSS]
+            valid_token_accuracy = valid_eval_results[Metrics.TOKEN_ACCURACY]
+            string_accuracy = valid_eval_results[Metrics.STRING_ACCURACY]
+
+            if params.scheduler:
+                model.scheduler.step(valid_loss)
+
+            if (
+                valid_loss > previous_epoch_loss
+                and valid_token_accuracy < previous_valid_acc
+                and step >= params.warmup_step
+            ):
+                learning_rate_coefficient *= params.learning_rate_ratio
+                log.put("learning rate coefficient:\t" + str(learning_rate_coefficient))
+
+            previous_epoch_loss = valid_loss
+            previous_valid_acc = valid_token_accuracy
+            saved = False
+
+            if not saved and string_accuracy > maximum_string_accuracy:
+                maximum_string_accuracy = string_accuracy
+                patience = patience * params.patience_ratio
+                countdown = int(patience)
+                last_save_file = os.path.join(params.logdir, "best_model")
+                model.save(last_save_file)
+
+                log.put("maximum string accuracy:\t" + str(maximum_string_accuracy))
+                log.put("patience:\t" + str(patience))
+                log.put("save file:\t" + str(last_save_file))
+
+            if countdown <= 0:
+                keep_training = False
+
+            countdown -= 1
+            log.put("countdown:\t" + str(countdown))
+            log.put("")
+
+            epochs += 1
+
+    log.put("Finished training!")
+    log.close()
+
+    return last_save_file
+
+
+def evaluate(model, data, params, last_save_file, split):
+    """Evaluates a pretrained model on a dataset.
+
+    Args:
+        model (ATISModel): Model class.
+        data (ATISData): All of the data.
+        params (namespace): Parameters for the model.
+        last_save_file (str): Location where the model save file is.
+    """
+    if last_save_file:
+        model.load(last_save_file)
+    else:
+        if not params.save_file:
+            raise ValueError("Must provide a save file name if not training first.")
+        model.load(params.save_file)
+
+    filename = split
+
+    if filename == "dev":
+        split = data.dev_data
+    elif filename == "train":
+        split = data.train_data
+    elif filename == "test":
+        split = data.test_data
+    elif filename == "valid":
+        split = data.valid_data
+    else:
+        raise ValueError("Split not recognized: " + str(params.evaluate_split))
+
+    if params.use_predicted_queries:
+        filename += "_use_predicted_queries"
+    else:
+        filename += "_use_gold_queries"
+
+    full_name = os.path.join(params.logdir, filename) + params.results_note
+
+    if params.interaction_level or params.use_predicted_queries:
+        examples = data.get_all_interactions(split)
+        if params.interaction_level:
+            evaluate_interaction_sample(
+                examples,
+                model,
+                name=full_name,
+                metrics=FINAL_EVAL_METRICS,
+                total_num=atis_data.num_utterances(split),
+                database_username=params.database_username,
+                database_password=params.database_password,
+                database_timeout=params.database_timeout,
+                use_predicted_queries=params.use_predicted_queries,
+                max_generation_length=params.eval_maximum_sql_length,
+                write_results=True,
+                use_gpu=True,
+                compute_metrics=params.compute_metrics,
+            )
+        else:
+            evaluate_using_predicted_queries(
+                examples,
+                model,
+                name=full_name,
+                metrics=FINAL_EVAL_METRICS,
+                total_num=atis_data.num_utterances(split),
+                database_username=params.database_username,
+                database_password=params.database_password,
+                database_timeout=params.database_timeout,
+            )
+    else:
+        examples = data.get_all_utterances(split)
+        evaluate_utterance_sample(
+            examples,
+            model,
+            name=full_name,
+            gold_forcing=False,
+            metrics=FINAL_EVAL_METRICS,
+            total_num=atis_data.num_utterances(split),
+            max_generation_length=params.eval_maximum_sql_length,
+            database_username=params.database_username,
+            database_password=params.database_password,
+            database_timeout=params.database_timeout,
+            write_results=True,
+        )
+
+
+def main():
+    """Main function that trains and/or evaluates a model."""
+    params = interpret_args()
+
+    paddle.set_device("gpu")
+
+    # Prepare the dataset into the proper form.
+    data = atis_data.ATISDataset(params)
+    params.num_db = len(data.db2id)
+
+    # Construct the model object.
+    if params.interaction_level:
+        model_type = SchemaInteractionATISModel
+    else:
+        print("not implemented")
+        exit()
+
+    model = model_type(
+        params,
+        data.input_vocabulary,
+        data.output_vocabulary,
+        data.output_vocabulary_schema,
+        data.anonymizer if params.anonymize and params.anonymization_scoring else None,
+    )
+
+    model.build_optim()
+
+    sys.stdout.flush()
+
+    last_save_file = ""
+
+    if params.train:
+        last_save_file = train(model, data, params)
+    if params.evaluate and "valid" in params.evaluate_split:
+        evaluate(model, data, params, last_save_file, split="valid")
+    if params.evaluate and "dev" in params.evaluate_split:
+        evaluate(model, data, params, last_save_file, split="dev")
+    if params.evaluate and "test" in params.evaluate_split:
+        evaluate(model, data, params, last_save_file, split="test")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/text_to_sql/RAT-SQL/README.md b/examples/text_to_sql/RAT-SQL/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48ed9294db7491d1a8071913d9ec78e903be5dc1
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/README.md
@@ -0,0 +1,198 @@
+# Enhanced RAT-SQL
+
+## Text2SQL 任务
+
+语义解析是一种交互式分析技术，其将用户输入的自然语言表述转成一种指定的语义表示形式，如图表示（AMR等）、逻辑表达式（一阶逻辑表示，lambda表示等）、编程语言（SQL、python等）、数学公式等。
+
+Text2SQL 是语义解析技术中的一类任务，基于给定的数据库，其将用户输入的自然语言问题转成可与数据库交互的 SQL 查询语句，实现基于数据库的自动问答能力。
+
+
+## 数据集
+
+数据集是推动自然语言处理技术进步的基石。为了处理不同场景、不同领域的应用需求，学术界及工业界陆续开放了一些相关数据集。千言项目为了验证模型的鲁棒性、泛化性等，针对每个自然语言处理问题，均收集和整理多个开源数据集，进行统一的处理并提供统一的测评方式。
+
+作为千言项目的重要任务之一，语义解析方向收集和整理了NL2SQL、CSpider和DuSQL数据集，详情可参见千言官网的语义解析任务页面。
+
+
+## 基线系统
+
+本基线系统基于PaddlePaddle2.0动态图实现了复杂数据集上的SOTA模型[RAT-SQL](https://arxiv.org/abs/1911.04942)，其核心是基于encoder-decoder框架的序列生成模型。本系统编码端使用了 ERNIE + Relation-aware Transformer对问题和数据库schema 进行编码， 解码端实现了基于语法指导的解码算法，具体算法思想请见[TRANX](https://www.aclweb.org/anthology/D18-2002.pdf)。
+
+同时，为了兼容上述提及的三个数据集，我们基于RAT-SQL模型进行扩展以丰富其问题解决能力，主要包括：1）多语言处理能力；2）value识别能力。
+
+该基线系统除了提供模型的训练、预测外，还提供了评估及数据处理脚本。参赛选手及相关研究者可基于此系统进行更深层的效果优化。
+
+# 环境准备
+代码运行需要 Linux 主机，Python 3.7 和 PaddlePaddle 2.0 以上版本。
+
+## 推荐的环境
+
+* 操作系统 CentOS 7.5
+* Python 3.7.9
+* PaddlePaddle 2.0.0
+
+除此之外，强烈建议使用支持 GPU 的硬件环境。
+
+## PaddlePaddle
+
+可根据机器情况和个人需求在 PaddlePaddle 和 PaddlePaddle-GPU 中二选一安装。
+如果机器支持GPU，则建议安装GPU版本。
+
+```
+# CPU 版本
+pip3 install paddlepaddle
+# GPU 版本
+pip3 install paddlepaddle-gpu
+```
+
+更多关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start).
+
+## 第三方 Python 库
+除 PaddlePaddle 及其依赖之外，还依赖其它第三方 Python 库，位于代码根目录的 requirements.txt 文件中。
+
+可使用 pip 一键安装
+
+```pip3 install -r requirements.txt```
+
+# 数据准备
+运行前需要自行下载训练、测试数据。
+
+```
+# 下载模型训练、测试数据
+# 得到的数据包括 DuSQL, NL2SQL 和 CSpider 三个数据集（同[千言-语义解析](https://aistudio.baidu.com/aistudio/competition/detail/47)任务的三个数据集）
+bash data/download_model_data.sh
+
+# 下载训练好的 Text2SQL 模型
+# 得到的数据包括：
+#   data
+#   ├── trained_model
+#   │   ├── dusql.pdparams
+#   │   ├── nl2sql.pdparams
+#   │   ├── cspider.pdparams
+bash data/download_trained_model.sh
+
+```
+
+# 数据预处理
+
+对原始数据进行格式转换、依赖信息补充等，以适配模型的输入。下面以DuSQL数据集为例进行说明。
+
+## 获取 Schema Linking 结果
+将 schema linking 独立出来，以便于针对这一步进行特定优化，可有效提升模型最终的效果。
+
+```
+# 训练集
+./run.sh ./script/schema_linking.py \
+        -s data/DuSQL/db_schema.json \
+        -c data/DuSQL/db_content.json \
+        -o data/DuSQL/match_values_train.json \
+        data/DuSQL/train.json --is-train
+
+# 开发集和测试集
+./run.sh ./script/schema_linking.py \
+        -s data/DuSQL/db_schema.json \
+        -c data/DuSQL/db_content.json \
+        -o data/DuSQL/match_values_dev.json \
+        data/DuSQL/dev.json
+./run.sh ./script/schema_linking.py \
+        -s data/DuSQL/db_schema.json \
+        -c data/DuSQL/db_content.json \
+        -o data/DuSQL/match_values_test.json \
+        data/DuSQL/test.json
+
+```
+
+需要注意的是，对于 NL2SQL 数据集，需要额外指定参数 `--sql-format nl2sql`，以便适配其简化的 SQL Json 格式。
+此参数默认取值为 'dusql'，可同时兼容 DuSQL 和 CSpider 数据集。
+
+## 获得模型输入
+
+对 DuSQL 原始数据和Schema Linking的结果做处理，得到模型的输入，位于 data/DuSQL/preproc 目录下：
+```
+./run.sh ./script/text2sql_main.py \
+        --mode preproc \
+        --config conf/text2sql_dusql.jsonnet \
+        --data-root data/DuSQL/ \
+        --is-cached false \
+        --output data/DuSQL/preproc
+```
+
+# 运行模型
+
+## 模型配置文件
+
+模型运行必需的配置位于conf下，默认提供的配置包括：text2sql_dusql.jsonnet, text2sql_nl2sql.jsonnet 和 text2sql_cspider.jsonnet， 分别用于 DuSQL, NL2SQL 和 CSpider 三个数据集的训练、预测等任务。 下文中如无特殊说明，则上述配置统称为 config.jsonnet。
+
+## 训练
+
+以训练DuSQL 模型为例
+
+```
+bash ./train.sh 1 output/dusql_baseline --config conf/text2sql_dusql.jsonnet --data-root data/DuSQL/preproc
+```
+
+参数说明：
+* 1 表示并发数，代码会自动选取剩余显存最多的卡使用，当前仅支持 1 卡训练；也可手动指定使用哪张卡，比如使用卡2，则此参数写为 cuda:2
+* output/dusql_experiment 表示训练过程保存的模型、预测开发集的结果等保存的目录，按需设置即可
+* --config conf/text2sql_dusql.jsonnet 指定本次任务的核心配置。注意 text2sql_dusql.jsonnet 需要替换为特定的配置文件
+* --data-root: 指定数据集的根目录。也可通过 --train-set/--dev-set/--test-set/--db 分别指定不同文件的路径
+
+全部的参数可通过 `bash ./run.sh ./script/text2sql_main.py -h` 查看。其中常用参数：
+* --pretrain-model: 指定 ERNIE 预训练模型目录
+* --batch-size: batch size 大小
+* --epochs: 总的训练轮数
+* --init-model-params: 热启模型的文件路径
+
+命令行参数的优先级高于配置文件，即如果在命令行指定了config文件包含的参数，则以命令行的设置为准。
+
+### 训练阶段的输出日志
+训练过程会输出loss、acc相关日志，日志会同时输出到屏幕和 --output 参数指定目录下的 train.log 文件中。
+内容类似：
+```
+[train] epoch 1, batch 600. loss is 34.1222593689. cost 442.40s
+[train] epoch 1, batch 700. loss is 33.3783876610. cost 424.55s
+[train] epoch 1/30 loss is 34.777802, cost 2826.10s.
+[eval] dev loss 0.000000, acc 1.0000. got best and saved. cost [27.94s]
+```
+其中，间隔多少steps输出一次日志在conf中设置(train.log_steps)，也可通过命令行参数指定（--log-steps）。
+为了提升训练速度，并非每个 epoch 结束都会执行评估，所以 eval 一行的 acc 实际中使用了 epoch 代替。
+
+## 预测
+
+以预测 DuSQL 开发集为例，结果保存到 output/dusql_dev_infer_result.json。
+
+```
+bash ./run.sh ./script/text2sql_main.py --mode infer \
+         --config conf/text2sql_dusql.jsonnet \
+         --data-root data/DuSQL/preproc \
+         --test-set data/DuSQL/preproc/dev.pkl \
+         --init-model-param output/dusql_baseline/....../model.pdparams \
+         --output output/dusql_dev_infer_result.sql
+```
+其中的 --init-model-param 参数请修改为真实的模型路径。
+
+## 评估
+
+同样以 DuSQL 开发集的预测结果为例。
+
+```
+python ./evaluation/text2sql_evaluation.py \
+        -g data/DuSQL/gold_dev.sql \
+        -t data/DuSQL/db_schema.json \
+        -d DuSQL \
+        -p output/dusql_dev_infer_result.sql
+```
+
+注意，其中的 `data/DuSQL/gold_dev.sql` 需要开发者从 dev.json 中提取得到，格式为“question_id \t sql_query \t db_id”。
+
+# 基线效果
+
+评价指标：Exact Match Accuracy，即预测的SQL query与标准SQL query相等的问题占比，这里“相等”判断时忽略成分的顺序影响，即SELECT A,B和SELECT B,A是相等的。
+评估数据集：各数据集的开发集合。
+评估设置：基线系统默认代码和配置。
+
+| 数据集  | 准确率(%) |
+|-------- | ---    |
+| DuSQL   | 64.3   |
+| NL2SQL  | 73.0   |
+| CSpider | 33.6   |
diff --git a/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl b/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl
new file mode 100644
index 0000000000000000000000000000000000000000..0edd7a574b2fb781ac335921c82aae021d446ca5
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/conf/DuSQL.asdl
@@ -0,0 +1,119 @@
+-- Assumptions:
+--   1. sql is correct
+--   2. only table name has alias
+--   3. only one intersect/union/except
+
+module DuSQL
+{
+    -- val: number(float)/string(str)/sql(dict)
+    val = Value(value val_id) | ValSql(sql s) | ColUnit(column col_id)
+
+    -- col_unit: (agg_id, col_id, isDistinct(bool))
+    col_unit = (
+      agg_type agg_id,
+      -- TODO fix
+      column col_id
+    )
+
+    -- val_unit: (unit_op, col_unit1, col_unit2)
+    -- val_unit = (
+    --     unit_type unit_op,
+    --     col_unit col_unit1,
+    --     col_unit col_unit2
+    -- )
+    val_unit = Column(col_unit col_unit1)
+             | Minus(col_unit col_unit1, col_unit col_unit2)
+             | Plus(col_unit col_unit1, col_unit col_unit2)
+             | Times(col_unit col_unit1, col_unit col_unit2)
+             | Divide(col_unit col_unit1, col_unit col_unit2)
+
+    -- condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+    -- cond_unit: (agg_id, op_id, val_unit, val1, val2)
+    cond = And(cond left, cond right)
+         | Or(cond left, cond right)
+         | NotIn(agg_type agg_id, val_unit val_unit, val val1)
+         | Between(agg_type agg_id, val_unit val_unit, val val1, val val2)
+         | Eq(agg_type agg_id, val_unit val_unit, val val1)
+         | Gt(agg_type agg_id, val_unit val_unit, val val1)
+         | Lt(agg_type agg_id, val_unit val_unit, val val1)
+         | Ge(agg_type agg_id, val_unit val_unit, val val1)
+         | Le(agg_type agg_id, val_unit val_unit, val val1)
+         | Ne(agg_type agg_id, val_unit val_unit, val val1)
+         | In(agg_type agg_id, val_unit val_unit, val val1)
+         | Like(agg_type agg_id, val_unit val_unit, val val1)
+
+    -- sql {
+    --   'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...])
+    --   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+    --   'where': condition
+    --   'groupBy': [col_unit1, col_unit2, ...]
+    --   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+    --   'having': condition
+    --   'limit': None/limit value
+    --   'intersect': None/sql
+    --   'except': None/sql
+    --   'union': None/sql
+    -- }
+
+    sql = (
+      select select,
+      from from,
+      sql_where sql_where,
+      sql_groupby sql_groupby,
+      sql_orderby sql_orderby,
+      sql_ieu sql_ieu,
+    )
+
+    sql_where = (
+      cond? where,
+    )
+
+    sql_groupby = (
+      col_unit* group_by,
+      cond? having,
+    )
+
+    sql_orderby = (
+      order_by? order_by,
+      value? limit,
+    )
+
+    sql_ieu = (
+      sql? intersect,
+      sql? except,
+      sql? union,
+    )
+
+    --   'select': ([(agg_id, val_unit), (agg_id, val_unit), ...])
+    select = (agg* aggs)
+    agg = (agg_type agg_id, val_unit val_unit)
+
+    --   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+    from = (table_unit* table_units, cond? conds)
+    -- table_unit: (table_type, col_unit/sql)
+    table_unit = TableUnitSql(sql s) | Table(table table_id)
+
+    --   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+    order_by = (order order, agg* aggs)
+
+    -- CLAUSE_KEYWORDS = ('select', 'from', 'where', 'group', 'order', 'limit', 'intersect', 'union', 'except')
+    -- JOIN_KEYWORDS = ('join', 'on', 'as')
+
+    -- WHERE_OPS = ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists')
+    -- cond_type = Between | Eq | Gt | Lt | Ge | Le | Ne | In | Like | Is | Exists
+
+    -- UNIT_OPS = ('none', '-', '+', "*", '/')
+    --unit_type = NoneUnitOp | Minus | Plus | Times | Divide
+
+    -- AGG_OPS = ('none', 'max', 'min', 'count', 'sum', 'avg')
+    agg_type = NoneAggOp | Max | Min | Count | Sum | Avg
+
+    -- TABLE_TYPE = {
+    --     'sql': "sql",
+    --     'table_unit': "table_unit",
+    -- }
+    -- COND_OPS = ('and', 'or')
+    -- SQL_OPS = ('intersect', 'union', 'except')
+    -- ORDER_OPS = ('desc', 'asc')
+    order = Asc | Desc
+}
diff --git a/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl b/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl
new file mode 100644
index 0000000000000000000000000000000000000000..d7c7fdbb7f6686e1a2af2a1960e8664e477e49a3
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/conf/NL2SQL.asdl
@@ -0,0 +1,119 @@
+-- Assumptions:
+--   1. sql is correct
+--   2. only table name has alias
+--   3. only one intersect/union/except
+
+module NL2SQL
+{
+    -- val: number(float)/string(str)/sql(dict)
+    val = Value(value val_id) | ValSql(sql s) | ColUnit(column col_id)
+
+    -- col_unit: (agg_id, col_id, isDistinct(bool))
+    col_unit = (
+      agg_type agg_id,
+      -- TODO fix
+      column col_id
+    )
+
+    -- val_unit: (unit_op, col_unit1, col_unit2)
+    -- val_unit = (
+    --     unit_type unit_op,
+    --     col_unit col_unit1,
+    --     col_unit col_unit2
+    -- )
+    val_unit = Column(col_unit col_unit1)
+             | Minus(col_unit col_unit1, col_unit col_unit2)
+             | Plus(col_unit col_unit1, col_unit col_unit2)
+             | Times(col_unit col_unit1, col_unit col_unit2)
+             | Divide(col_unit col_unit1, col_unit col_unit2)
+
+    -- condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+    -- cond_unit: (agg_id, op_id, val_unit, val1, val2)
+    cond = And(cond left, cond right)
+         | Or(cond left, cond right)
+         | NotIn(agg_type agg_id, val_unit val_unit, val val1)
+         | Between(agg_type agg_id, val_unit val_unit, val val1, val val2)
+         | Eq(agg_type agg_id, val_unit val_unit, val val1)
+         | Gt(agg_type agg_id, val_unit val_unit, val val1)
+         | Lt(agg_type agg_id, val_unit val_unit, val val1)
+         | Ge(agg_type agg_id, val_unit val_unit, val val1)
+         | Le(agg_type agg_id, val_unit val_unit, val val1)
+         | Ne(agg_type agg_id, val_unit val_unit, val val1)
+         | In(agg_type agg_id, val_unit val_unit, val val1)
+         | Like(agg_type agg_id, val_unit val_unit, val val1)
+
+    -- sql {
+    --   'select': (isDistinct(bool), [(agg_id, val_unit), (agg_id, val_unit), ...])
+    --   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+    --   'where': condition
+    --   'groupBy': [col_unit1, col_unit2, ...]
+    --   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+    --   'having': condition
+    --   'limit': None/limit value
+    --   'intersect': None/sql
+    --   'except': None/sql
+    --   'union': None/sql
+    -- }
+
+    sql = (
+      select select,
+      from from,
+      sql_where sql_where,
+      sql_groupby sql_groupby,
+      sql_orderby sql_orderby,
+      sql_ieu sql_ieu,
+    )
+
+    sql_where = (
+      cond? where,
+    )
+
+    sql_groupby = (
+      col_unit* group_by,
+      cond? having,
+    )
+
+    sql_orderby = (
+      order_by? order_by,
+      value? limit,
+    )
+
+    sql_ieu = (
+      sql? intersect,
+      sql? except,
+      sql? union,
+    )
+
+    --   'select': ([(agg_id, val_unit), (agg_id, val_unit), ...])
+    select = (agg* aggs)
+    agg = (agg_type agg_id, val_unit val_unit)
+
+    --   'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+    from = (table_unit* table_units, cond? conds)
+    -- table_unit: (table_type, col_unit/sql)
+    table_unit = TableUnitSql(sql s) | Table(table table_id)
+
+    --   'orderBy': ('asc'/'desc', [val_unit1, val_unit2, ...])
+    order_by = (order order, agg* aggs)
+
+    -- CLAUSE_KEYWORDS = ('select', 'from', 'where', 'group', 'order', 'limit', 'intersect', 'union', 'except')
+    -- JOIN_KEYWORDS = ('join', 'on', 'as')
+
+    -- WHERE_OPS = ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists')
+    -- cond_type = Between | Eq | Gt | Lt | Ge | Le | Ne | In | Like | Is | Exists
+
+    -- UNIT_OPS = ('none', '-', '+', "*", '/')
+    --unit_type = NoneUnitOp | Minus | Plus | Times | Divide
+
+    -- AGG_OPS = ('none', 'avg', 'max', 'min', 'count', 'sum')
+    agg_type = NoneAggOp | Avg | Max | Min | Count | Sum
+
+    -- TABLE_TYPE = {
+    --     'sql': "sql",
+    --     'table_unit': "table_unit",
+    -- }
+    -- COND_OPS = ('and', 'or')
+    -- SQL_OPS = ('intersect', 'union', 'except')
+    -- ORDER_OPS = ('desc', 'asc')
+    order = Asc | Desc
+}
diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet
new file mode 100644
index 0000000000000000000000000000000000000000..4370bae165f3c431d644492f5e5cee748ad5d0ab
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_cspider.jsonnet
@@ -0,0 +1,62 @@
+
+function(data_path='data/CSpider/preproc') {
+    general: {
+        mode: null,
+        batch_size: 16,
+        use_cuda: true,
+        is_cloud: false,
+        is_debug: false,
+        use_fp16: 0,
+    },
+    model: {
+        pretrain_model_type: 'BERT',
+        pretrain_model: 'bert-base-multilingual-uncased',
+        init_model_params: null,
+        init_model_optim: null,
+        model_name: 'seq2tree_v2',
+        grammar_type: 'dusql_v2',
+        rat_layers: 8,
+        rat_heads: 8,
+        enc_value_with_col: true,
+        num_value_col_type: 'q_num', # cls|col_0|q_num
+        value_memory: true,
+        predict_value: false,
+        max_seq_len: 510,
+        max_question_len: 120,
+        max_column_num: 100,
+        max_table_num: 18,
+        max_column_tokens: 50,  # useless
+        max_table_tokens: 20,   # useless
+    },
+    data: {
+        db: null,
+        grammar: 'conf/DuSQL.asdl',
+        train_set: null,
+        dev_set: null,
+        test_set: null,
+        eval_file: null,
+        output: 'output',
+        is_cached: false,
+    },
+    train: {
+        epochs: 50,
+        log_steps: 10,
+        trainer_num: 1,
+        # [begin] config for optimizer
+        learning_rate: 1e-05,
+        lr_scheduler: "linear_warmup_decay",
+        warmup_steps: 0,
+        warmup_proportion: 0.1,
+        weight_decay: 0.01,
+        use_dynamic_loss_scaling: false,
+        init_loss_scaling: 128,
+        incr_every_n_steps: 100,
+        decr_every_n_nan_or_inf: 2,
+        incr_ratio: 2.0,
+        decr_ratio: 0.8,
+        grad_clip: 1.0,
+        # [end] optimizer
+        random_seed: null,
+        use_data_parallel: false,
+    }
+}
diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet
new file mode 100644
index 0000000000000000000000000000000000000000..6c77b0c73ccfed1ff36b6b65f1e5749a28db1762
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_dusql.jsonnet
@@ -0,0 +1,62 @@
+
+function(data_path='data/DuSQL/preproc') {
+    general: {
+        mode: null,
+        batch_size: 16,
+        use_cuda: true,
+        is_cloud: false,
+        is_debug: false,
+        use_fp16: 0,
+    },
+    model: {
+        pretrain_model_type: 'ERNIE',
+        pretrain_model: 'ernie-1.0',
+        init_model_params: null,
+        init_model_optim: null,
+        model_name: 'seq2tree_v2',
+        grammar_type: 'dusql_v2',
+        rat_layers: 8,
+        rat_heads: 8,
+        enc_value_with_col: true,
+        num_value_col_type: 'q_num', # cls|col_0|q_num
+        value_memory: true,
+        predict_value: true,
+        max_seq_len: 510,
+        max_question_len: 120,
+        max_column_num: 60,
+        max_table_num: 15,
+        max_column_tokens: 50,  # useless
+        max_table_tokens: 20,   # useless
+    },
+    data: {
+        db: null,
+        grammar: 'conf/DuSQL.asdl',
+        train_set: null,
+        dev_set: null,
+        test_set: null,
+        eval_file: null,
+        output: 'output',
+        is_cached: false,
+    },
+    train: {
+        epochs: 30,
+        log_steps: 10,
+        trainer_num: 1,
+        # [begin] config for optimizer
+        learning_rate: 1e-05,
+        lr_scheduler: "linear_warmup_decay",
+        warmup_steps: 0,
+        warmup_proportion: 0.1,
+        weight_decay: 0.01,
+        use_dynamic_loss_scaling: false,
+        init_loss_scaling: 128,
+        incr_every_n_steps: 100,
+        decr_every_n_nan_or_inf: 2,
+        incr_ratio: 2.0,
+        decr_ratio: 0.8,
+        grad_clip: 1.0,
+        # [end] optimizer
+        random_seed: null,
+        use_data_parallel: false,
+    }
+}
diff --git a/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet b/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet
new file mode 100644
index 0000000000000000000000000000000000000000..80550cfd0c181ba2c55e73524f53d55d173469a6
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/conf/text2sql_nl2sql.jsonnet
@@ -0,0 +1,62 @@
+
+function(data_path='data/NL2SQL/preproc') {
+    general: {
+        mode: null,
+        batch_size: 16,
+        use_cuda: true,
+        is_cloud: false,
+        is_debug: false,
+        use_fp16: 0,
+    },
+    model: {
+        pretrain_model_type: 'ERNIE',
+        pretrain_model: 'ernie-1.0',
+        init_model_params: null,
+        init_model_optim: null,
+        model_name: 'seq2tree_v2',
+        grammar_type: 'nl2sql',
+        rat_layers: 8,
+        rat_heads: 8,
+        enc_value_with_col: true,
+        num_value_col_type: 'q_num', # cls|col_0|q_num
+        value_memory: true,
+        predict_value: true,
+        max_seq_len: 510,
+        max_question_len: 120,
+        max_column_num: 60,
+        max_table_num: 15,
+        max_column_tokens: 50,  # useless
+        max_table_tokens: 20,   # useless
+    },
+    data: {
+        db: null,
+        grammar: 'conf/NL2SQL.asdl',
+        train_set: null,
+        dev_set: null,
+        test_set: null,
+        eval_file: null,
+        output: 'output',
+        is_cached: false,
+    },
+    train: {
+        epochs: 12,
+        log_steps: 10,
+        trainer_num: 1,
+        # [begin] config for optimizer
+        learning_rate: 1e-05,
+        lr_scheduler: "linear_warmup_decay",
+        warmup_steps: 0,
+        warmup_proportion: 0.1,
+        weight_decay: 0.01,
+        use_dynamic_loss_scaling: false,
+        init_loss_scaling: 128,
+        incr_every_n_steps: 100,
+        decr_every_n_nan_or_inf: 2,
+        incr_ratio: 2.0,
+        decr_ratio: 0.8,
+        grad_clip: 1.0,
+        # [end] optimizer
+        random_seed: null,
+        use_data_parallel: false,
+    }
+}
diff --git a/examples/text_to_sql/RAT-SQL/data/download_model_data.sh b/examples/text_to_sql/RAT-SQL/data/download_model_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..87a77dd1d86a562baf657eef4f337a5c69677115
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/data/download_model_data.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+cd `dirname $0`
+
+set -e
+
+wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/NL2SQL.zip
+unzip NL2SQL.zip >/dev/null
+
+wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/CSpider.zip
+unzip CSpider.zip >/dev/null
+
+wget --no-check-certificate https://bj.bcebos.com/v1/dataset-bj/qianyan/DuSQL.zip
+unzip DuSQL.zip >/dev/null
diff --git a/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh b/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..47d64e59bf48a0f71c047d7d35b27becc5eae0ce
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/data/download_trained_model.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+cd `dirname $0`
+
+version='v1.0.0'
+target_file="text2sql_trained_model_$version.tar.gz"
+wget --no-check-certificate https://dataset-bj.cdn.bcebos.com/qianyan/$target_file
+tar xzf $target_file
\ No newline at end of file
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/README.md b/examples/text_to_sql/RAT-SQL/evaluation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dc05871a09158aea37838af1aba930e42210c1e3
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/README.md
@@ -0,0 +1,34 @@
+# 环境
+建议 python3.7
+
+# 评估
+输入文件格式：
+1. 文件以.sql结尾
+2. 文件每行的格式："qid\tsql_query\tdb_id",其中predict文件db_id是可选字段，gold文件db_id是必选字段
+3. 评估指标：exact matching score
+
+# 使用
+
+## 命令行
+
+    python text2sql_evaluation.py \
+        --g 'data/DuSQL/test_gold.sql' \      # gold文件
+        --p 'test_DuSQL.sql' \                # predict文件
+        --t 'data/DuSQL/db_schema.json' \     # schema文件
+        --d 'DuSQL'                           # 选择dataset（DuSQL、NL2SQL、CSPider可选）
+
+## 接口
+
+    from text2sql_evaluation import evaluate
+    score, score_novalue = evaluate('table.json', 'gold.sql', 'pred.sql', dataset='DuSQL')
+其中：
+    score["all"] = {"exact": exact num, "count": test examples num, "acc": accuracy}
+    score_novalue["all"] = {"exact": exact num, "count": test examples num, "acc": accuracy}
+
+## 输出
+    with value:
+    {"exact": exact correct num, "count": test examples num, "acc": accuracy}
+    without value:
+    {"exact": exact correct num, "count": test examples num, "acc": accuracy}
+其中：
+    acc表示最终输出准确率
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql b/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql
new file mode 100644
index 0000000000000000000000000000000000000000..bc6d8e119c55717949579a96111ce220bb0f5aa7
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/data/gold.sql
@@ -0,0 +1,18 @@
+1	( select 书名 from 传记 where 页数 > 400 ) intersect ( select 书名 from 传记 where 出版时间 > "1981-03-24" )	人物传记
+2	( select 姓名 from 作者 where 作品数量 >= 50 ) except ( select 姓名 from 作者 order by 出生日期 desc limit 3 )	小说
+3	( select 开源课程名称 from 学校的开源课程 order by 课时 desc limit 3 ) except ( select 开源课程名称 from 学校的开源课程 where 主讲教师 != "王建安" )	在线学习平台
+4	select avg ( 现价格 ) sum ( 原价格 ) from 本月特价书籍	榜单
+5	select max ( 电子书售价 ) from 电子书	购书平台
+6	select min ( 电子书售价 ) avg ( 购买人数 ) max ( 会员价格 ) from 电子书	购书平台
+7	select sum ( 豆瓣评分 ) max ( 1星占比 ) from 书籍	豆瓣读书
+8	select 书名, 出版社 from 传记 where 作者 != "柳润墨" order by 页数 asc	人物传记
+9	select 书名, 类型 from 网络小说 where 评分 == ( select max ( 评分 ) from 网络小说 )	网易云阅读
+10	select 出版社 from 文集 group by 出版社 order by avg ( 页数 ) desc limit 1	文集
+11	select 名称 from 小说改编话剧 where 演出总场次 < ( select max ( 演出总场次 ) from 小说改编话剧 where 演出剧团 != "开心麻花" )	小说
+12	select 名称 from 文集 where 页数 < ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" )	文集
+13	select 名称 from 文集 where 页数 == ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" )	文集
+14	select 名称, 作者 from 书籍 where 豆瓣评分 > 5.4 order by 1星占比 desc	豆瓣读书
+15	select 名称, 评价人数 * 1星占比 from 书籍 where 作者 == "塔拉·韦斯特弗"	豆瓣读书
+16	select 姓名, 国籍 from 作者 where 作品数量 == ( select max ( 作品数量 ) from 作者 )	小说
+17	select 姓名, 逝世日期 - 出生日期 from 作者 where 作品数量 < 50	小说
+18	select 讲述朝代 from 中国朝代历史	历史类书籍
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql b/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql
new file mode 100644
index 0000000000000000000000000000000000000000..cff97fda2aa84c904c7e44dfbfe0da34af7ab7a6
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/data/pred.sql
@@ -0,0 +1,18 @@
+1	( select 出版社 from 传记 where 页数 > 400 ) intersect ( select 书名 from 传记 where 出版时间 > "1981-03-24" )	人物传记
+2	( select 姓名 from 作者 where 作品数量 >= 60 ) except ( select 姓名 from 作者 order by 出生日期 desc limit 3 )	小说
+3	( select 开源课程名称 from 学校的开源课程 order by 课时 desc limit 5 ) except ( select 开源课程名称 from 学校的开源课程 where 主讲教师 != "王建安" )	在线学习平台
+4	select avg ( 现价格 ) max ( 原价格 ) from 本月特价书籍	榜单
+5	select max ( 电子书售价 ) from 电子书	购书平台
+6	select min ( 电子书售价 ) avg ( 购买人数 ) max ( 会员价格 ) from 电子书	购书平台
+7	select sum ( 豆瓣评分 ) max ( 1星占比 ) from 书籍	豆瓣读书
+8	select 书名 from 传记 where 作者 != "柳润墨" order by 页数 asc	人物传记
+9	select 书名, 类型 from 网络小说 where 评分 in ( select max ( 评分 ) from 网络小说 )	网易云阅读
+10	select 出版社 from 文集 group by 出版社 order by avg ( 页数 ) desc limit 1	文集
+11	select 名称 from 小说改编话剧 where 演出总场次 < ( select max ( 演出总场次 ) from 小说改编话剧 where 演出剧团 != "开心麻花" )	小说
+12	select 名称 from 文集 where 页数 < ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" )	文集
+13	select 名称 from 文集 where 页数 == ( select max ( 页数 ) from 文集 where 出版社 != "人民文学出版社" )	文集
+14	select 名称, 作者 from 书籍 where 豆瓣评分 > 5.4 order by 1星占比 desc	豆瓣读书
+15	select 名称, 评价人数 * 1星占比 from 书籍 where 作者 == "塔拉·韦斯特弗"	豆瓣读书
+16	select 姓名, 国籍 from 作者 where 作品数量 == ( select max ( 作品数量 ) from 作者 )	小说
+17	select 姓名, 逝世日期 - 出生日期 from 作者 where 作品数量 < 50	小说
+18	select 讲述朝代 from 中国朝代历史	历史类书籍
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/data/table.json b/examples/text_to_sql/RAT-SQL/evaluation/data/table.json
new file mode 100644
index 0000000000000000000000000000000000000000..144854f31c369b8f6282a3d27d4fefbc0dca1fec
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/data/table.json
@@ -0,0 +1,2049 @@
+[
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "姓名"
+      ],
+      [
+        0,
+        "国籍"
+      ],
+      [
+        0,
+        "职业"
+      ],
+      [
+        0,
+        "主要成就"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "书名"
+      ],
+      [
+        1,
+        "作者"
+      ],
+      [
+        1,
+        "页数"
+      ],
+      [
+        1,
+        "出版社"
+      ],
+      [
+        1,
+        "出版时间"
+      ],
+      [
+        2,
+        "传记id"
+      ],
+      [
+        2,
+        "人物id"
+      ],
+      [
+        2,
+        "记录时间"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "text",
+      "number",
+      "text",
+      "text",
+      "number",
+      "text",
+      "time",
+      "number",
+      "number",
+      "text"
+    ],
+    "db_id": "人物传记",
+    "foreign_keys": [
+      [
+        13,
+        1
+      ],
+      [
+        12,
+        6
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6
+    ],
+    "table_names": [
+      "名人",
+      "传记",
+      "名人传记"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "成立时间"
+      ],
+      [
+        0,
+        "年营业额"
+      ],
+      [
+        0,
+        "是否自营"
+      ],
+      [
+        0,
+        "会员费"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "书名"
+      ],
+      [
+        1,
+        "作者"
+      ],
+      [
+        1,
+        "类型"
+      ],
+      [
+        2,
+        "书名id"
+      ],
+      [
+        2,
+        "平台id"
+      ],
+      [
+        2,
+        "售价"
+      ],
+      [
+        2,
+        "购买人数"
+      ],
+      [
+        2,
+        "评分"
+      ],
+      [
+        2,
+        "评分人数"
+      ],
+      [
+        2,
+        "加入购物车人数"
+      ],
+      [
+        2,
+        "收藏人数"
+      ],
+      [
+        2,
+        "缺货"
+      ],
+      [
+        3,
+        "书名id"
+      ],
+      [
+        3,
+        "平台id"
+      ],
+      [
+        3,
+        "电子书售价"
+      ],
+      [
+        3,
+        "会员价格"
+      ],
+      [
+        3,
+        "购买人数"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "time",
+      "number",
+      "binary",
+      "number",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "binary",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "购书平台",
+    "foreign_keys": [
+      [
+        11,
+        7
+      ],
+      [
+        12,
+        1
+      ],
+      [
+        20,
+        7
+      ],
+      [
+        21,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      7
+    ],
+    "table_names": [
+      "平台",
+      "图书",
+      "图书与平台",
+      "电子书"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "姓名"
+      ],
+      [
+        0,
+        "性别"
+      ],
+      [
+        0,
+        "国籍"
+      ],
+      [
+        0,
+        "职业"
+      ],
+      [
+        0,
+        "所在单位"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "名称"
+      ],
+      [
+        1,
+        "作者id"
+      ],
+      [
+        1,
+        "会议名称"
+      ],
+      [
+        1,
+        "年份"
+      ],
+      [
+        1,
+        "引用量"
+      ],
+      [
+        2,
+        "论文id"
+      ],
+      [
+        2,
+        "引用论文id"
+      ],
+      [
+        2,
+        "是否对比论文"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "text",
+      "text",
+      "number",
+      "text",
+      "number",
+      "text",
+      "time",
+      "number",
+      "number",
+      "number",
+      "binary"
+    ],
+    "db_id": "论文",
+    "foreign_keys": [
+      [
+        13,
+        7
+      ],
+      [
+        9,
+        1
+      ],
+      [
+        14,
+        7
+      ]
+    ],
+    "primary_keys": [
+      1,
+      7
+    ],
+    "table_names": [
+      "作者",
+      "论文",
+      "论文引用"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "书名"
+      ],
+      [
+        0,
+        "讲述国家"
+      ],
+      [
+        0,
+        "讲述时代"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "书名"
+      ],
+      [
+        1,
+        "讲述朝代"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "书名"
+      ],
+      [
+        2,
+        "描述战事"
+      ],
+      [
+        3,
+        "词条id"
+      ],
+      [
+        3,
+        "书名"
+      ],
+      [
+        3,
+        "讲述名人"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "time",
+      "number",
+      "text",
+      "text",
+      "number",
+      "text",
+      "text",
+      "number",
+      "text",
+      "text"
+    ],
+    "db_id": "历史类书籍",
+    "foreign_keys": [],
+    "primary_keys": [
+      1,
+      5,
+      8,
+      11
+    ],
+    "table_names": [
+      "国家历史",
+      "中国朝代历史",
+      "战争历史",
+      "人物历史"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "成立年数"
+      ],
+      [
+        0,
+        "教师数量"
+      ],
+      [
+        0,
+        "课程体系分级"
+      ],
+      [
+        1,
+        "平台id"
+      ],
+      [
+        1,
+        "适合群体"
+      ],
+      [
+        1,
+        "一节课时间"
+      ],
+      [
+        1,
+        "课时数"
+      ],
+      [
+        1,
+        "主题数"
+      ],
+      [
+        1,
+        "词汇量"
+      ],
+      [
+        2,
+        "平台id"
+      ],
+      [
+        2,
+        "外教来自国家"
+      ],
+      [
+        2,
+        "外教数量"
+      ],
+      [
+        2,
+        "教师职业占比"
+      ],
+      [
+        3,
+        "平台id"
+      ],
+      [
+        3,
+        "课时数"
+      ],
+      [
+        3,
+        "原价"
+      ],
+      [
+        3,
+        "折扣"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "在线英语教学",
+    "foreign_keys": [
+      [
+        16,
+        1
+      ],
+      [
+        6,
+        1
+      ],
+      [
+        12,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1
+    ],
+    "table_names": [
+      "平台",
+      "青少年课程",
+      "教师",
+      "学费(平台,课时数,价格,折扣）"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "姓名"
+      ],
+      [
+        0,
+        "国籍"
+      ],
+      [
+        0,
+        "毕业院校"
+      ],
+      [
+        0,
+        "民族"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "名称"
+      ],
+      [
+        1,
+        "作者id"
+      ],
+      [
+        1,
+        "页数"
+      ],
+      [
+        1,
+        "定价"
+      ],
+      [
+        1,
+        "出版社"
+      ],
+      [
+        1,
+        "出版时间"
+      ],
+      [
+        1,
+        "开本"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "text",
+      "time",
+      "text"
+    ],
+    "db_id": "文集",
+    "foreign_keys": [
+      [
+        8,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6
+    ],
+    "table_names": [
+      "作者",
+      "文集"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "英文名"
+      ],
+      [
+        0,
+        "原著作者"
+      ],
+      [
+        0,
+        "字数"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "姓名"
+      ],
+      [
+        1,
+        "国籍"
+      ],
+      [
+        1,
+        "翻译作品数量"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "名称"
+      ],
+      [
+        2,
+        "成立时间"
+      ],
+      [
+        2,
+        "成立地点"
+      ],
+      [
+        3,
+        "书籍id"
+      ],
+      [
+        3,
+        "译者id"
+      ],
+      [
+        3,
+        "出版社id"
+      ],
+      [
+        3,
+        "出版册数"
+      ],
+      [
+        3,
+        "出版时间"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "number",
+      "text",
+      "text",
+      "number",
+      "number",
+      "text",
+      "time",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "time"
+    ],
+    "db_id": "外文书籍",
+    "foreign_keys": [
+      [
+        14,
+        1
+      ],
+      [
+        16,
+        10
+      ],
+      [
+        15,
+        6
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6,
+      10
+    ],
+    "table_names": [
+      "外文书籍",
+      "译者",
+      "出版社",
+      "书籍出版信息"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "书名"
+      ],
+      [
+        0,
+        "作者"
+      ],
+      [
+        0,
+        "评分"
+      ],
+      [
+        0,
+        "总排名"
+      ],
+      [
+        1,
+        "图书id"
+      ],
+      [
+        1,
+        "评价人数"
+      ],
+      [
+        2,
+        "图书id"
+      ],
+      [
+        2,
+        "现价格"
+      ],
+      [
+        2,
+        "原价格"
+      ],
+      [
+        3,
+        "图书id"
+      ],
+      [
+        3,
+        "购买人数"
+      ],
+      [
+        3,
+        "收藏人数"
+      ],
+      [
+        4,
+        "图书id"
+      ],
+      [
+        4,
+        "推荐人数"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "榜单",
+    "foreign_keys": [
+      [
+        11,
+        1
+      ],
+      [
+        8,
+        1
+      ],
+      [
+        6,
+        1
+      ],
+      [
+        14,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6,
+      8,
+      11,
+      14
+    ],
+    "table_names": [
+      "图书",
+      "五星榜单",
+      "本月特价书籍",
+      "人气榜单",
+      "必读榜单"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "语言"
+      ],
+      [
+        0,
+        "类别"
+      ],
+      [
+        0,
+        "主办单位"
+      ],
+      [
+        0,
+        "创刊时间"
+      ],
+      [
+        0,
+        "国家"
+      ],
+      [
+        0,
+        "出版刊数"
+      ],
+      [
+        1,
+        "年份"
+      ],
+      [
+        1,
+        "期刊id"
+      ],
+      [
+        1,
+        "统计平台"
+      ],
+      [
+        1,
+        "出版文献数"
+      ],
+      [
+        1,
+        "被下载数量"
+      ],
+      [
+        1,
+        "被引数量"
+      ],
+      [
+        1,
+        "复合影响因子"
+      ],
+      [
+        1,
+        "综合影响因子"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "姓名"
+      ],
+      [
+        2,
+        "职业"
+      ],
+      [
+        3,
+        "人物id"
+      ],
+      [
+        3,
+        "期刊id"
+      ],
+      [
+        3,
+        "次数"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "text",
+      "time",
+      "text",
+      "number",
+      "time",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "期刊",
+    "foreign_keys": [
+      [
+        20,
+        17
+      ],
+      [
+        10,
+        1
+      ],
+      [
+        21,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      17
+    ],
+    "table_names": [
+      "期刊",
+      "期刊文献",
+      "封面人物",
+      "期刊封面人物"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "作者"
+      ],
+      [
+        0,
+        "作者国籍"
+      ],
+      [
+        0,
+        "豆瓣评分"
+      ],
+      [
+        0,
+        "评价人数"
+      ],
+      [
+        0,
+        "5星占比"
+      ],
+      [
+        0,
+        "1星占比"
+      ],
+      [
+        1,
+        "年份"
+      ],
+      [
+        1,
+        "书籍id"
+      ],
+      [
+        1,
+        "评分"
+      ],
+      [
+        1,
+        "排名"
+      ],
+      [
+        2,
+        "年份"
+      ],
+      [
+        2,
+        "书籍id"
+      ],
+      [
+        2,
+        "关注数"
+      ],
+      [
+        2,
+        "排名"
+      ],
+      [
+        3,
+        "年份"
+      ],
+      [
+        3,
+        "书籍id"
+      ],
+      [
+        3,
+        "购买数"
+      ],
+      [
+        3,
+        "排名"
+      ],
+      [
+        4,
+        "书籍id"
+      ],
+      [
+        4,
+        "平台"
+      ],
+      [
+        4,
+        "售价"
+      ],
+      [
+        4,
+        "是否有货"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "time",
+      "number",
+      "number",
+      "number",
+      "time",
+      "number",
+      "number",
+      "number",
+      "time",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "binary"
+    ],
+    "db_id": "豆瓣读书",
+    "foreign_keys": [
+      [
+        14,
+        1
+      ],
+      [
+        21,
+        1
+      ],
+      [
+        10,
+        1
+      ],
+      [
+        18,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1
+    ],
+    "table_names": [
+      "书籍",
+      "高分图书榜单",
+      "最受关注图书榜单",
+      "最畅销图书榜单",
+      "购买平台"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "姓名"
+      ],
+      [
+        0,
+        "国籍"
+      ],
+      [
+        0,
+        "出生日期"
+      ],
+      [
+        0,
+        "出生地"
+      ],
+      [
+        0,
+        "逝世日期"
+      ],
+      [
+        0,
+        "作品数量"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "小说名"
+      ],
+      [
+        1,
+        "文学体裁"
+      ],
+      [
+        1,
+        "作者id"
+      ],
+      [
+        1,
+        "首版时间"
+      ],
+      [
+        1,
+        "字数"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "名称"
+      ],
+      [
+        2,
+        "小说id"
+      ],
+      [
+        2,
+        "演出剧团"
+      ],
+      [
+        2,
+        "导演"
+      ],
+      [
+        2,
+        "演出总场次"
+      ],
+      [
+        2,
+        "观众评分"
+      ],
+      [
+        3,
+        "词条id"
+      ],
+      [
+        3,
+        "剧名"
+      ],
+      [
+        3,
+        "小说id"
+      ],
+      [
+        3,
+        "首播时间"
+      ],
+      [
+        3,
+        "集数"
+      ],
+      [
+        3,
+        "豆瓣评分"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "time",
+      "text",
+      "time",
+      "number",
+      "number",
+      "text",
+      "text",
+      "number",
+      "time",
+      "number",
+      "number",
+      "text",
+      "number",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "time",
+      "number",
+      "number"
+    ],
+    "db_id": "小说",
+    "foreign_keys": [
+      [
+        11,
+        1
+      ],
+      [
+        16,
+        8
+      ],
+      [
+        23,
+        8
+      ]
+    ],
+    "primary_keys": [
+      1,
+      8,
+      14,
+      21
+    ],
+    "table_names": [
+      "作者",
+      "小说",
+      "小说改编话剧",
+      "小说改编电视剧"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "姓名"
+      ],
+      [
+        0,
+        "年龄"
+      ],
+      [
+        0,
+        "出版作品数"
+      ],
+      [
+        0,
+        "网络作品数"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "书名"
+      ],
+      [
+        1,
+        "作者id"
+      ],
+      [
+        1,
+        "评分"
+      ],
+      [
+        1,
+        "评价人数"
+      ],
+      [
+        1,
+        "字数"
+      ],
+      [
+        1,
+        "点击数"
+      ],
+      [
+        1,
+        "类型"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "书名"
+      ],
+      [
+        2,
+        "作者id"
+      ],
+      [
+        2,
+        "评分"
+      ],
+      [
+        2,
+        "类型"
+      ],
+      [
+        2,
+        "状态"
+      ],
+      [
+        2,
+        "价格"
+      ],
+      [
+        3,
+        "网络小说id"
+      ],
+      [
+        3,
+        "周排名"
+      ],
+      [
+        3,
+        "月排名"
+      ],
+      [
+        3,
+        "总排名"
+      ],
+      [
+        4,
+        "网络小说id"
+      ],
+      [
+        4,
+        "周排名"
+      ],
+      [
+        4,
+        "月排名"
+      ],
+      [
+        4,
+        "总排名"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "text",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "网易云阅读",
+    "foreign_keys": [
+      [
+        25,
+        14
+      ],
+      [
+        8,
+        1
+      ],
+      [
+        21,
+        14
+      ],
+      [
+        16,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6,
+      14
+    ],
+    "table_names": [
+      "作者",
+      "出版图书",
+      "网络小说",
+      "畅销榜",
+      "收藏榜"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "类型"
+      ],
+      [
+        0,
+        "适用阶段"
+      ],
+      [
+        0,
+        "适用年级"
+      ],
+      [
+        0,
+        "科目类型"
+      ],
+      [
+        0,
+        "价格"
+      ],
+      [
+        0,
+        "特点"
+      ],
+      [
+        1,
+        "试卷id"
+      ],
+      [
+        1,
+        "套数"
+      ],
+      [
+        1,
+        "押题命中率"
+      ],
+      [
+        2,
+        "省份"
+      ],
+      [
+        2,
+        "参考试卷id"
+      ],
+      [
+        2,
+        "版本"
+      ],
+      [
+        2,
+        "购买数量"
+      ],
+      [
+        2,
+        "平均得分"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "text",
+      "number",
+      "text",
+      "number",
+      "text",
+      "number",
+      "text",
+      "number",
+      "text",
+      "number",
+      "number"
+    ],
+    "db_id": "教材辅助参考书",
+    "foreign_keys": [
+      [
+        9,
+        1
+      ],
+      [
+        13,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1,
+      9
+    ],
+    "table_names": [
+      "参考书",
+      "参考试卷",
+      "适用城市"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "类型"
+      ],
+      [
+        0,
+        "国家"
+      ],
+      [
+        0,
+        "世界排名"
+      ],
+      [
+        1,
+        "词条id"
+      ],
+      [
+        1,
+        "名称"
+      ],
+      [
+        1,
+        "所属专业"
+      ],
+      [
+        1,
+        "适合学者类型"
+      ],
+      [
+        2,
+        "词条id"
+      ],
+      [
+        2,
+        "名称"
+      ],
+      [
+        2,
+        "课程数量"
+      ],
+      [
+        2,
+        "合作学校数量"
+      ],
+      [
+        2,
+        "是否免费"
+      ],
+      [
+        3,
+        "平台id"
+      ],
+      [
+        3,
+        "学校id"
+      ],
+      [
+        3,
+        "合作课程数量"
+      ],
+      [
+        4,
+        "词条id"
+      ],
+      [
+        4,
+        "开源课程名称"
+      ],
+      [
+        4,
+        "课程id"
+      ],
+      [
+        4,
+        "学校id"
+      ],
+      [
+        4,
+        "课时"
+      ],
+      [
+        4,
+        "主讲教师"
+      ],
+      [
+        5,
+        "开源课程id"
+      ],
+      [
+        5,
+        "平台id"
+      ],
+      [
+        5,
+        "是否直播"
+      ],
+      [
+        5,
+        "课程时长"
+      ],
+      [
+        5,
+        "价格"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "number",
+      "text",
+      "text",
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "binary",
+      "number",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "text",
+      "number",
+      "number",
+      "binary",
+      "number",
+      "number"
+    ],
+    "db_id": "在线学习平台",
+    "foreign_keys": [
+      [
+        20,
+        6
+      ],
+      [
+        21,
+        1
+      ],
+      [
+        24,
+        18
+      ],
+      [
+        25,
+        10
+      ],
+      [
+        16,
+        1
+      ],
+      [
+        15,
+        10
+      ]
+    ],
+    "primary_keys": [
+      1,
+      6,
+      10,
+      18
+    ],
+    "table_names": [
+      "学校",
+      "课程",
+      "平台",
+      "平台合作学校",
+      "学校的开源课程",
+      "平台课程"
+    ]
+  },
+  {
+    "column_names": [
+      [
+        -1,
+        "*"
+      ],
+      [
+        0,
+        "词条id"
+      ],
+      [
+        0,
+        "名称"
+      ],
+      [
+        0,
+        "成立时间"
+      ],
+      [
+        0,
+        "级别"
+      ],
+      [
+        1,
+        "会议id"
+      ],
+      [
+        1,
+        "年份"
+      ],
+      [
+        1,
+        "长文提交量"
+      ],
+      [
+        1,
+        "长文录取率"
+      ],
+      [
+        1,
+        "短文提交量"
+      ],
+      [
+        1,
+        "短文录取率"
+      ],
+      [
+        2,
+        "会议id"
+      ],
+      [
+        2,
+        "年份"
+      ],
+      [
+        2,
+        "大洲"
+      ],
+      [
+        2,
+        "提交数量占比"
+      ],
+      [
+        3,
+        "会议id"
+      ],
+      [
+        3,
+        "年份"
+      ],
+      [
+        3,
+        "国家"
+      ],
+      [
+        3,
+        "提交数量占比"
+      ],
+      [
+        4,
+        "方向名称"
+      ],
+      [
+        4,
+        "会议id"
+      ],
+      [
+        4,
+        "长文提交量"
+      ],
+      [
+        4,
+        "长文录取率"
+      ],
+      [
+        4,
+        "短文提交量"
+      ],
+      [
+        4,
+        "短文录取率"
+      ]
+    ],
+    "column_types": [
+      "text",
+      "number",
+      "text",
+      "time",
+      "text",
+      "number",
+      "time",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number",
+      "time",
+      "text",
+      "number",
+      "number",
+      "time",
+      "text",
+      "number",
+      "text",
+      "number",
+      "number",
+      "number",
+      "number",
+      "number"
+    ],
+    "db_id": "NLP会议",
+    "foreign_keys": [
+      [
+        5,
+        1
+      ],
+      [
+        11,
+        1
+      ],
+      [
+        15,
+        1
+      ],
+      [
+        20,
+        1
+      ]
+    ],
+    "primary_keys": [
+      1
+    ],
+    "table_names": [
+      "会议",
+      "各会议论文",
+      "各会议论文大洲分布",
+      "各会议论文国家分布",
+      "2019年会议各方向分布"
+    ]
+  }
+]
\ No newline at end of file
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py b/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee85021e4139d615cc1cfc9f1e7f106cb36b7b04
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/text2sql_evaluation.py
@@ -0,0 +1,1602 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Calculating the exact accuracy. For select, where and others schema, it will be
+seen as right if has different order. This script refers to https://github.com/taoyds/spider。
+"""
+import copy
+import json
+import logging
+import re
+from collections import defaultdict
+from io import open
+
+from utils import evaluate_NL2SQL, is_float
+
+"""
+val: number(float)/string(str)/sql(dict)
+col_unit: (agg_id, col_id, isdistinct(bool))
+val_unit: (unit_op, col_unit1, col_unit2)
+table_unit: (table_type, col_unit/sql)
+cond_unit: (not_op, cond_op, val_unit, val1, val2)
+condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+sql {
+  'select': [(agg_id, val_unit), (agg_id, val_unit), ...]
+  'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+  'where': condition
+  'groupBy': [col_unit1, col_unit2, ...]
+  'orderBy': ('asc'/'desc', [(agg_id, val_unit), ...])
+  'having': condition
+  'limit': None/number(int)
+  'intersect': None/sql
+  'except': None/sql
+  'union': None/sql
+}
+"""
+
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+COND_OPS = ("not_in", "between", "==", ">", "<", ">=", "<=", "!=", "in", "like")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+LOGIC_AND_OR = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+CONST_COLUMN = set(["time_now"])
+
+EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ",", "distinct"))
+
+g_empty_sql = {
+    "select": [],
+    "from": {"conds": [], "table_units": []},
+    "where": [],
+    "groupBy": [],
+    "having": [],
+    "orderBy": [],
+    "limit": None,
+    "except": None,
+    "intersect": None,
+    "union": None,
+}
+
+VALUE = "1"
+
+
+def tokenize(string, single_equal=False, math=True):
+    """
+    Args:
+
+    Returns:
+    """
+
+    string = string.replace("'", '"').lower()
+    assert string.count('"') % 2 == 0, "Unexpected quote"
+
+    def _extract_value(string):
+        """extract values in sql"""
+        fields = string.split('"')
+        for idx, tok in enumerate(fields):
+            if idx % 2 == 1:
+                fields[idx] = '"%s"' % (tok)
+        return fields
+
+    def _resplit(tmp_tokens, fn_split, fn_omit):
+        """resplit"""
+        new_tokens = []
+        for token in tmp_tokens:
+            token = token.strip()
+            if fn_omit(token):
+                new_tokens.append(token)
+            elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token):
+                new_tokens.append('"%s"' % (token))
+            else:
+                new_tokens.extend(fn_split(token))
+        return new_tokens
+
+    tokens_tmp = _extract_value(string)
+
+    two_bytes_op = ["==", "!=", ">=", "<=", "<>", "<in>"]
+    if single_equal:
+        sep1 = re.compile(r"([ \+\-\*/\(\)=,><;])")  # 单字节运算符
+    else:
+        sep1 = re.compile(r"([ \+\-\*/\(\),><;])")  # 单字节运算符
+    sep2 = re.compile("(" + "|".join(two_bytes_op) + ")")  # 多字节运算符
+    tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"'))
+    tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"'))
+    tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep1, x), lambda x: x in two_bytes_op or x.startswith('"'))
+    tokens = list(filter(lambda x: x.strip() not in ("", "distinct", "DISTINCT"), tokens_tmp))
+
+    def _post_merge(tokens):
+        """merge:
+        * col name with "(", ")"
+        * values with +/-
+        """
+        idx = 1
+        while idx < len(tokens):
+            if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS and tokens[idx - 1] != "=":
+                # 兼容单引号，这里可能有问题
+                while idx < len(tokens):
+                    tmp_tok = tokens.pop(idx)
+                    tokens[idx - 1] += tmp_tok
+                    if tmp_tok == ")":
+                        break
+            elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens):
+                tokens[idx] += tokens[idx + 1]
+                tokens.pop(idx + 1)
+                idx += 1
+            else:
+                idx += 1
+        return tokens
+
+    tokens = _post_merge(tokens)
+    if single_equal:
+        tokens = [i if i != "=" else "==" for i in tokens]
+    return tokens
+
+
+def scan_alias(toks):
+    """Scan the index of 'as' and build the map for all alias"""
+    as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"]
+    alias = {}
+    for idx in as_idxs:
+        alias[toks[idx + 1]] = toks[idx - 1]
+    return alias
+
+
+def get_tables_with_alias(schema, toks):
+    """
+    Args:
+
+    Returns:
+    """
+    tables = scan_alias(toks)
+    for key in schema:
+        assert key not in tables, "Alias {} has the same name in table".format(key)
+        tables[key] = key
+    return tables
+
+
+def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    :returns next idx, column id
+    """
+    tok = toks[start_idx]
+    if tok == "*":
+        return start_idx + 1, schema.id_map[tok]
+    if tok in CONST_COLUMN:
+        return start_idx + 1, tok
+
+    if "." in tok:  # if token is a composite
+        alias, col = tok.split(".")
+        key = tables_with_alias[alias] + "." + col
+        return start_idx + 1, schema.id_map[key]
+
+    assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty"
+
+    for alias in default_tables:
+        table = tables_with_alias[alias]
+        if tok in schema.schema[table]:
+            key = table + "." + tok
+            return start_idx + 1, schema.id_map[key]
+
+    raise RuntimeError("Error col: {},{}".format(tok, toks))
+
+
+def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    :returns next idx, (agg_op id, col_id)
+    """
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    if toks[idx] in AGG_OPS:
+        agg_id = AGG_OPS.index(toks[idx])
+        idx += 1
+        assert idx < len_ and toks[idx] == "("
+        idx += 1
+        if toks[idx] == "distinct":
+            idx += 1
+        idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+        assert idx < len_ and toks[idx] == ")"
+        idx += 1
+        return idx, (agg_id, col_id)
+    if toks[idx] == "distinct":
+        idx += 1
+    agg_id = AGG_OPS.index("none")
+    idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+
+    return idx, (agg_id, col_id)
+
+
+def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    col_unit1 = None
+    col_unit2 = None
+    unit_op = UNIT_OPS.index("none")
+
+    idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+    if idx < len_ and toks[idx] in UNIT_OPS:
+        unit_op = UNIT_OPS.index(toks[idx])
+        idx += 1
+        idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+    if unit_op in (UNIT_OPS.index("+"), UNIT_OPS.index("*")):
+        col_unit1, col_unit2 = sorted([col_unit1, col_unit2])
+
+    return idx, (unit_op, col_unit1, col_unit2)
+
+
+def parse_table_unit(toks, start_idx, tables_with_alias, schema):
+    """
+    :returns next idx, table id, table name
+    """
+    idx = start_idx
+    len_ = len(toks)
+    key = tables_with_alias[toks[idx]]
+
+    if idx + 1 < len_ and toks[idx + 1] == "as":
+        idx += 3
+    else:
+        idx += 1
+
+    return idx, schema.id_map[key], key
+
+
+def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    def _force_float(str_num):
+        """force float, just for debug"""
+        last = ""
+        while len(str_num) > 0:
+            try:
+                n = float(str_num)
+                if last == "%":
+                    n /= 100
+                return n
+            except Exception:
+                last = str_num[-1]
+                str_num = str_num[:-1]
+        raise ValueError("not a float number")
+
+    if toks[idx] == "select":
+        idx, val = parse_sql(toks, idx, tables_with_alias, schema)
+    elif toks[idx].startswith('"') and toks[idx].endswith('"'):  # token is a string value
+        val = toks[idx]
+        idx += 1
+    else:
+        try:
+            val_str = toks[idx]
+            # val = float(val_str) if val_str[-1] != '%' else float(val_str[:-1]) / 100
+            val = _force_float(val_str)
+            idx += 1
+        except Exception:
+            end_idx = idx
+            while (
+                end_idx < len_
+                and toks[end_idx] != ","
+                and toks[end_idx] != ")"
+                and toks[end_idx] != "and"
+                and toks[end_idx] not in CLAUSE_KEYWORDS
+                and toks[end_idx] not in JOIN_KEYWORDS
+            ):
+                end_idx += 1
+
+            idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables)
+            idx = end_idx
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1
+
+    return idx, val
+
+
+def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+    conds = []
+
+    while idx < len_:
+        agg_id = 0
+        if idx < len_ and toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+
+        op_str = toks[idx]
+        if op_str == "not":
+            assert toks[idx + 1] == "in", '"not" must followed by "in"'
+            op_str = "not_in"
+            idx += 1
+        assert idx < len_ and op_str in COND_OPS, "Error condition: idx: {}, tok: {},, toks: {}".format(
+            idx, op_str, toks
+        )
+        op_id = COND_OPS.index(op_str)
+        idx += 1
+        val1 = val2 = None
+        # idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+        # val2 = None
+        if op_id == COND_OPS.index("between"):  # between..and... special case: dual values
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            assert toks[idx] == "and"
+            idx += 1
+            idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+        else:  # normal case: single value
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            val2 = None
+
+        conds.append((agg_id, op_id, val_unit, val1, val2))
+
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS):
+            break
+
+        if idx < len_ and toks[idx] in LOGIC_AND_OR:
+            conds.append(toks[idx])
+            idx += 1  # skip and/or
+
+    return idx, conds
+
+
+def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+
+    assert toks[idx] == "select", "'select' not found"
+    idx += 1
+    if idx < len_ and toks[idx] == "distinct":
+        idx += 1
+    val_units = []
+
+    while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS:
+        agg_id = AGG_OPS.index("none")
+        if toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append((agg_id, val_unit))
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+
+    return idx, val_units
+
+
+def parse_from(toks, start_idx, tables_with_alias, schema):
+    """
+    Assume in the from clause, all table units are combined with join
+    """
+    assert "from" in toks[start_idx:], "'from' not found"
+
+    len_ = len(toks)
+    idx = toks.index("from", start_idx) + 1
+    default_tables = []
+    table_units = []
+    conds = []
+    last_table = None
+
+    while idx < len_:
+        isBlock = False
+        if toks[idx] == "(":
+            isBlock = True
+            idx += 1
+
+        if toks[idx] == "select":
+            idx, sql = parse_sql(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["sql"], sql))
+            last_table = sql["from"]["table_units"][0][1].strip("_")
+        else:
+            if idx < len_ and toks[idx] == "join":
+                idx += 1  # skip join
+            idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["table_unit"], table_unit))
+            default_tables.append(table_name)
+        if idx < len_ and toks[idx] == "on":
+            idx += 1  # skip on
+            idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+            if len(conds) > 0:
+                conds.append("and")
+            conds.extend(this_conds)
+
+        if isBlock:
+            assert toks[idx] == ")"
+            idx += 1
+        if idx < len_ and toks[idx] == "a":
+            assert last_table is not None, "last_table should be a table name strin, not None"
+            tables_with_alias["a"] = last_table
+            idx += 2
+        elif idx < len_ and toks[idx] == "b":
+            assert last_table is not None, "last_table should be a table name strin, not None"
+            tables_with_alias["b"] = last_table
+            idx += 1
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+            break
+
+    return [idx, table_units, conds, default_tables]
+
+
+def parse_where(toks, start_idx, tables_with_alias, schema, default_tables):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "where":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+    col_units = []
+
+    if idx >= len_ or toks[idx] != "group":
+        return idx, col_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+        col_units.append(col_unit)
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, col_units
+
+
+def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+    val_units = []
+    order_type = "asc"  # default type is 'asc'
+
+    if idx >= len_ or toks[idx] != "order":
+        return idx, val_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        agg_id = AGG_OPS.index("none")
+        if toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append((agg_id, val_unit))
+        if idx < len_ and toks[idx] in ORDER_OPS:
+            order_type = toks[idx]
+            idx += 1
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, (order_type, val_units)
+
+
+def parse_having(toks, start_idx, tables_with_alias, schema, default_tables):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "having":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_limit(toks, start_idx):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx < len_ and toks[idx] == "limit":
+        idx += 2
+        return idx, int(toks[idx - 1])
+
+    return idx, None
+
+
+def parse_sql(toks, start_idx, tables_with_alias, schema):
+    """
+    Args:
+
+    Returns:
+    """
+    isBlock = False  # indicate whether this is a block of sql/sub-sql
+    len_ = len(toks)
+    idx = start_idx
+
+    sql = {}
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    # parse from clause in order to get default tables
+    from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema)
+    sql["from"] = {"table_units": table_units, "conds": conds}
+    # select clause
+    _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables)
+    idx = from_end_idx
+    sql["select"] = select_col_units
+    # where clause
+    idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables)
+    sql["where"] = where_conds
+    # group by clause
+    idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["groupBy"] = group_col_units
+    # having clause
+    idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables)
+    sql["having"] = having_conds
+    # order by clause
+    idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["orderBy"] = order_col_units
+    # limit clause
+    idx, limit_val = parse_limit(toks, idx)
+    sql["limit"] = limit_val
+
+    idx = skip_semicolon(toks, idx)
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+    idx = skip_semicolon(toks, idx)
+
+    # intersect/union/except clause
+    for op in SQL_OPS:  # initialize IUE
+        sql[op] = None
+    if idx < len_ and toks[idx] in SQL_OPS:
+        sql_op = toks[idx]
+        idx += 1
+        idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema)
+        sql[sql_op] = IUE_sql
+    return idx, sql
+
+
+def load_data(fpath):
+    """
+    Args:
+
+    Returns:
+    """
+    with open(fpath) as f:
+        data = json.load(f)
+    return data
+
+
+def get_sql(schema, query, single_equal=False):
+    """
+    Args:
+
+    Returns:
+    """
+    toks = tokenize(query, single_equal=single_equal)
+    tables_with_alias = get_tables_with_alias(schema.schema, toks)
+    _, sql = parse_sql(toks, 0, tables_with_alias, schema)
+
+    return sql
+
+
+def skip_semicolon(toks, start_idx):
+    """
+    Args:
+
+    Returns:
+    """
+    idx = start_idx
+    while idx < len(toks) and toks[idx] == ";":
+        idx += 1
+    return idx
+
+
+#################################
+
+
+class Evaluator(object):
+    """A simple evaluator"""
+
+    def __init__(self):
+        """init"""
+        self.partial_scores = None
+
+    def _eval_exact_match(self, pred, gold, value_match=True):
+        """eval_exact_match"""
+        partial_scores = self.eval_partial_match(pred, gold, value_match=value_match)
+        self.partial_scores = partial_scores
+
+        for _, score in partial_scores.items():
+            if score["f1"] != 1:
+                return 0
+
+        gold_table_units = gold["from"]["table_units"]
+        pred_table_units = pred["from"]["table_units"]
+        if len(pred_table_units) != len(gold_table_units) or any(
+            map(lambda x: type(x[0][1]) != type(x[1][1]), zip(pred_table_units, gold_table_units))  # noqa: E721
+        ):
+            return 0
+        if type(gold_table_units[0][1]) is not dict:
+            return 1 if sorted(gold_table_units) == sorted(pred_table_units) else 0
+
+        # TODO: 严格考虑顺序
+        def __eval_from_sql(pred_tables, gold_tables):
+            """eval from sql"""
+            for pred_table_unit, gold_table_unit in zip(pred_tables, gold_tables):
+                pred_table_sql = pred_table_unit[1]
+                gold_table_sql = gold_table_unit[1]
+                _, _, correct = eval_nested(pred_table_sql, gold_table_sql, value_match)
+                if correct == 0:
+                    return 0
+            return 1
+
+        correct = __eval_from_sql(pred_table_units, gold_table_units)
+        if len(gold_table_units) > 1 and correct == 0:
+            return __eval_from_sql(pred_table_units, list(reversed(gold_table_units)))
+        else:
+            return correct
+
+    def eval_exact_match(self, pred, gold, value_match=True):
+        """wrapper of evaluate examct match, to process
+        `SQL1 intersect/union SQL2` vs `SQL2 intersect/union SQL1`
+
+        Args:
+            pred (TYPE): NULL
+            gold (TYPE): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        score = self._eval_exact_match(pred, gold, value_match=value_match)
+        if score == 1:
+            return score
+
+        if gold["union"] is not None:
+            gold_tmp = copy.deepcopy(gold)
+            new_gold = gold_tmp["union"]
+            gold_tmp["union"] = None
+            new_gold["union"] = gold_tmp
+            return self._eval_exact_match(pred, new_gold, value_match=value_match)
+        elif gold["intersect"] is not None:
+            gold_tmp = copy.deepcopy(gold)
+            new_gold = gold_tmp["intersect"]
+            gold_tmp["intersect"] = None
+            new_gold["intersect"] = gold_tmp
+            return self._eval_exact_match(pred, new_gold, value_match=value_match)
+        else:
+            return 0
+
+    def eval_partial_match(self, pred, gold, value_match=True):
+        """eval_partial_match"""
+        res = {}
+
+        gold_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["select"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total)
+        res["select(no AGG)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, gold, value_match=value_match)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["where"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total)
+        res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_group(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["group"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_having(pred, gold, value_match=value_match)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["having"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_order(pred, gold, value_match=value_match)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["order"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_and_or(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_IUEN(pred, gold, value_match=value_match)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_keywords(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        return res
+
+
+class Schema(object):
+    """
+    Simple schema which maps table&column to a unique identifier
+    """
+
+    def __init__(self, db):
+        """init"""
+        self._schema = self._build_schema(db)
+        self._id_map = self._map(self._schema)
+
+    @property
+    def schema(self):
+        """_schema property"""
+        return self._schema
+
+    @property
+    def id_map(self):
+        """_id_map property"""
+        return self._id_map
+
+    def _build_schema(self, db):
+        """build <table, list of columns> schema by input db
+
+        Args:
+            db (dict): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        tables = [x.lower() for x in db.get("table_names_original", db["table_names"])]
+        dct_table2cols = defaultdict(list)
+        for table_id, column in db.get("column_names_original", db["column_names"]):
+            if table_id < 0:
+                continue
+            dct_table2cols[tables[table_id]].append(column.lower())
+        return dct_table2cols
+
+    def _map(self, schema):
+        """map"""
+        id_map = {"*": "__all__"}
+        for key, vals in schema.items():
+            for val in vals:
+                id_map[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__"
+
+        for key in schema:
+            id_map[key.lower()] = "__" + key.lower() + "__"
+
+        return id_map
+
+
+def get_scores(count, pred_total, gold_total):
+    """
+    Args:
+
+    Returns:
+    """
+    if pred_total != gold_total:
+        return 0, 0, 0
+    elif count == pred_total:
+        return 1, 1, 1
+    return 0, 0, 0
+
+
+def eval_sel(pred, gold):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_sel = copy.deepcopy(pred["select"])
+    gold_sel = copy.deepcopy(gold["select"])
+    gold_wo_agg = [unit[1] for unit in gold_sel]
+    pred_total = len(pred_sel)
+    gold_total = len(gold_sel)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_sel:
+        if unit in gold_sel:
+            cnt += 1
+            gold_sel.remove(unit)
+        if unit[1] in gold_wo_agg:
+            cnt_wo_agg += 1
+            gold_wo_agg.remove(unit[1])
+
+    return [gold_total, pred_total, cnt, cnt_wo_agg]
+
+
+def eval_nested_cond(pred_cond, gold_cond, value_match=True):
+    """
+
+    Args:
+        pred_cond (TYPE): NULL
+        gold_cond (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    if pred_cond[:3] != gold_cond[:3] or type(pred_cond[3]) is not dict:
+        return 0
+
+    _, _, correct = eval_nested(pred_cond[3], gold_cond[3], value_match)
+    if correct == 0:
+        return 0
+
+    return pred_cond[4] == gold_cond[4]
+
+
+def eval_cond(pred, gold, value_match=True):
+    """
+
+    Args:
+        pred (TYPE): NULL
+        gold (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+
+    def _equal(p, g):
+        p = p.strip("\"'") if type(p) is str else p
+        g = g.strip("\"'") if type(g) is str else g
+        if str(p) == str(g):
+            return True
+        if is_float(p) and is_float(g) and float(p) == float(g):
+            return True
+        return False
+
+    if not value_match:
+        if not isinstance(pred[3], dict):
+            pred[3] = VALUE
+        if pred[4] is not None:
+            pred[4] = VALUE
+
+        if not isinstance(gold[3], dict):
+            gold[3] = VALUE
+        if gold[4] is not None:
+            gold[4] = VALUE
+
+    if type(gold[3]) is dict:
+        return eval_nested_cond(pred, gold, value_match)
+
+    if pred[:3] != gold[:3]:
+        return 0
+
+    if _equal(pred[3], gold[3]) and _equal(pred[4], gold[4]):
+        return 1
+    else:
+        return 0
+
+
+def eval_where(pred, gold, value_match=True):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_conds = copy.deepcopy([unit for unit in sorted(pred["where"][::2], key=lambda x: [str(i) for i in x])])
+    gold_conds = copy.deepcopy([unit for unit in sorted(gold["where"][::2], key=lambda x: [str(i) for i in x])])
+    pred_total = len(pred_conds)
+    gold_total = len(gold_conds)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    # 已经排过序了，可以一一比对
+    for unit_p, unit_g in zip(pred_conds, gold_conds):
+        cnt += eval_cond(unit_p, unit_g, value_match)
+
+        if unit_p[2] == unit_g[2]:
+            cnt_wo_agg += 1
+
+    return [gold_total, pred_total, cnt, cnt_wo_agg]
+
+
+def eval_group(pred, gold):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    gold_cols = [unit[1] for unit in gold["groupBy"]]
+    pred_total = len(pred_cols)
+    gold_total = len(gold_cols)
+    cnt = 0
+    pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols]
+    gold_cols = [gold.split(".")[1] if "." in gold else gold for gold in gold_cols]
+    for col in pred_cols:
+        if col in gold_cols:
+            cnt += 1
+            gold_cols.remove(col)
+    return [gold_total, pred_total, cnt]
+
+
+def eval_having(pred, gold, value_match=True):
+    """不评估and/or，在其它分支专门评估
+    Args:
+
+    Returns:
+    """
+    if len(pred["having"]) != len(gold["having"]):
+        return [1, 1, 0]
+
+    pred_total = len(pred["having"][::2])
+    gold_total = len(gold["having"][::2])
+    cnt = 0
+    for pred_cond, gold_cond in zip(sorted(pred["having"][::2]), sorted(gold["having"][::2])):
+        if eval_cond(pred_cond, gold_cond, value_match) == 1:
+            cnt += 1
+
+    return [gold_total, pred_total, cnt]
+
+
+def eval_order(pred, gold, value_match=True):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_total = gold_total = cnt = 0
+    if len(pred["orderBy"]) > 0:
+        pred_total = 1
+    if len(gold["orderBy"]) > 0:
+        gold_total = 1
+
+    if value_match:
+        if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"] and pred["limit"] == gold["limit"]:
+            cnt = 1
+    else:
+        if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"]:
+            cnt = 1
+
+    return [gold_total, pred_total, cnt]
+
+
+def eval_and_or(pred, gold):
+    """
+    Args:
+
+    Returns:
+    """
+
+    def _extract(conds):
+        """extract condition and/or"""
+        op_set = set()
+        for i in range(1, len(conds) - 1, 2):
+            left = conds[i - 1][:3]
+            right = conds[i + 1][:3]
+            left, right = list(sorted([left, right]))
+            op_set.add(f"{left}{conds[i].lower()}{right}")
+        return op_set
+
+    # eval where and/or
+    pred_op_set = _extract(pred["where"])
+    gold_op_set = _extract(gold["where"])
+    if pred_op_set != gold_op_set:
+        return [1, 1, 0]
+
+    # eval having and/or
+    pred_op_set = _extract(pred["having"])
+    gold_op_set = _extract(gold["having"])
+    if pred_op_set != gold_op_set:
+        return [1, 1, 0]
+
+    return [1, 1, 1]
+
+
+def get_nestedSQL(sql):
+    """
+    Args:
+
+    Returns:
+    """
+    nested = []
+    for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]:
+        if type(cond_unit[3]) is dict:
+            nested.append(cond_unit[3])
+        if type(cond_unit[4]) is dict:
+            nested.append(cond_unit[4])
+    ##
+    for from_nest_sql in [table_unit[1] for table_unit in sql["from"]["table_units"] if table_unit[0] == "sql"]:
+        nested.append(from_nest_sql)
+
+    if sql["intersect"] is not None:
+        nested.append(sql["intersect"])
+    if sql["except"] is not None:
+        nested.append(sql["except"])
+    if sql["union"] is not None:
+        nested.append(sql["union"])
+    return nested
+
+
+def eval_nested(pred, gold, value_match=True):
+    """
+    Args:
+
+    Returns:
+    """
+    gold_total = 0
+    pred_total = 0
+    cnt = 0
+    if pred is not None:
+        pred_total += 1
+    if gold is not None:
+        gold_total += 1
+    if pred is not None and gold is not None:
+        cnt += Evaluator().eval_exact_match(pred, gold, value_match=value_match)
+    return [gold_total, pred_total, cnt]
+
+
+def eval_IUEN(pred, gold, value_match=True):
+    """
+    Args:
+
+    Returns:
+    """
+    lt1, pt1, cnt1 = eval_nested(pred["intersect"], gold["intersect"], value_match=value_match)
+    lt2, pt2, cnt2 = eval_nested(pred["except"], gold["except"], value_match=value_match)
+    lt3, pt3, cnt3 = eval_nested(pred["union"], gold["union"], value_match=value_match)
+    gold_total = lt1 + lt2 + lt3
+    pred_total = pt1 + pt2 + pt3
+    cnt = cnt1 + cnt2 + cnt3
+    return [gold_total, pred_total, cnt]
+
+
+def get_keywords(sql):
+    """
+    Args:
+
+    Returns:
+    """
+    res = set()
+    if len(sql["where"]) > 0:
+        res.add("where")
+    if len(sql["groupBy"]) > 0:
+        res.add("group")
+    if len(sql["having"]) > 0:
+        res.add("having")
+    if len(sql["orderBy"]) > 0:
+        res.add(sql["orderBy"][0])
+        res.add("order")
+    if sql["limit"] is not None:
+        res.add("limit")
+    if sql["except"] is not None:
+        res.add("except")
+    if sql["union"] is not None:
+        res.add("union")
+    if sql["intersect"] is not None:
+        res.add("intersect")
+
+    # or keyword
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    if len([token for token in ao if token == "or"]) > 0:
+        res.add("or")
+
+    # TODO
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    # not keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0:
+        res.add("not")
+
+    # in keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("in")]) > 0:
+        res.add("in")
+
+    # like keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("like")]) > 0:
+        res.add("like")
+
+    return res
+
+
+def eval_keywords(pred, gold):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_keywords = get_keywords(pred)
+    gold_keywords = get_keywords(gold)
+    pred_total = len(pred_keywords)
+    gold_total = len(gold_keywords)
+    cnt = 0
+
+    for k in pred_keywords:
+        if k in gold_keywords:
+            cnt += 1
+    return [gold_total, pred_total, cnt]
+
+
+# Rebuild SQL functions for foreign key evaluation
+def build_valid_col_units(table_units, schema):
+    """
+    Args:
+
+    Returns:
+    """
+    col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]]
+    prefixs = [col_id[:-2] for col_id in col_ids]
+    valid_col_units = []
+    for value in schema.id_map.values():
+        if "." in value and value[: value.index(".")] in prefixs:
+            valid_col_units.append(value)
+    return valid_col_units
+
+
+def rebuild_col_unit_col(valid_col_units, col_unit, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if col_unit is None:
+        return col_unit
+
+    agg_id, col_id = col_unit[0], col_unit[1]
+    if col_id in kmap and col_id in valid_col_units:
+        col_id = kmap[col_id]
+    return agg_id, col_id
+
+
+def rebuild_val_unit_col(valid_col_units, val_unit, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if val_unit is None:
+        return val_unit
+
+    unit_op, col_unit1, col_unit2 = val_unit
+    col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap)
+    col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap)
+    return [unit_op, col_unit1, col_unit2]
+
+
+def rebuild_table_unit_col(valid_col_units, table_unit, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if table_unit is None:
+        return table_unit
+
+    table_type, col_unit_or_sql = table_unit
+    if isinstance(col_unit_or_sql, dict):
+        col_unit_or_sql = rebuild_sql_col(valid_col_units, col_unit_or_sql, kmap)
+    elif isinstance(col_unit_or_sql, tuple):
+        col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap)
+    return table_type, col_unit_or_sql
+
+
+def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if cond_unit is None:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    if type(val1) is dict:
+        rebuild_sql_col(valid_col_units, val1, kmap)
+    else:
+        val1 = str(val1)
+        val2 = str(val2)
+
+    val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap)
+    return [not_op, op_id, val_unit, val1, val2]
+
+
+def rebuild_condition_col(valid_col_units, condition, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    for idx in range(len(condition)):
+        if idx % 2 == 0:
+            condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap)
+    return condition
+
+
+def rebuild_select_col(valid_col_units, sel, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if sel is None:
+        return sel
+    new_list = []
+    for it in sel:
+        agg_id, val_unit = it
+        new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)))
+    return new_list
+
+
+def rebuild_from_col(valid_col_units, from_, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if from_ is None:
+        return from_
+
+    fn_proc = lambda x: rebuild_table_unit_col(valid_col_units, x, kmap)
+    from_["table_units"] = [fn_proc(table_unit) for table_unit in from_["table_units"]]
+    from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap)
+    return from_
+
+
+def rebuild_group_by_col(valid_col_units, group_by, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if group_by is None:
+        return group_by
+
+    return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by]
+
+
+def rebuild_order_by_col(valid_col_units, order_by, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if order_by is None or len(order_by) == 0:
+        return order_by
+
+    direction, val_units = order_by
+    new_val_units = [(agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)) for agg_id, val_unit in val_units]
+    return direction, new_val_units
+
+
+def rebuild_sql_col(valid_col_units, sql, kmap):
+    """
+    Args:
+
+    Returns:
+    """
+    if sql is None:
+        return sql
+
+    sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap)
+    sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap)
+    sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap)
+    sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap)
+    sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap)
+    sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap)
+    sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap)
+    sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap)
+    sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap)
+
+    return sql
+
+
+def build_foreign_key_map(entry):
+    """
+    Args:
+
+    Returns:
+    """
+    cols_orig = entry["column_names"]
+    tables_orig = entry["table_names"]
+
+    # rebuild cols corresponding to idmap in Schema
+    cols = []
+    for col_orig in cols_orig:
+        if col_orig[0] >= 0:
+            t = tables_orig[col_orig[0]]
+            c = col_orig[1]
+            cols.append("__" + t.lower() + "." + c.lower() + "__")
+        else:
+            cols.append("__all__")
+
+    def keyset_in_list(k1, k2, k_list):
+        """keyset_in_list"""
+        for k_set in k_list:
+            if k1 in k_set or k2 in k_set:
+                return k_set
+        new_k_set = set()
+        k_list.append(new_k_set)
+        return new_k_set
+
+    foreign_key_list = []
+    foreign_keys = entry["foreign_keys"]
+    for fkey in foreign_keys:
+        key1, key2 = fkey
+        key_set = keyset_in_list(key1, key2, foreign_key_list)
+        key_set.add(key1)
+        key_set.add(key2)
+
+    foreign_key_map = {}
+    for key_set in foreign_key_list:
+        sorted_list = sorted(list(key_set))
+        midx = sorted_list[0]
+        for idx in sorted_list:
+            foreign_key_map[cols[idx]] = cols[midx]
+
+    return foreign_key_map
+
+
+def build_foreign_key_map_from_json(table):
+    """
+    Args:
+
+    Returns:
+    """
+    with open(table) as f:
+        data = json.load(f)
+    tables = {}
+    for entry in data:
+        tables[entry["db_id"]] = build_foreign_key_map(entry)
+    return tables
+
+
+def evaluate_complex(table, gold, predict, mode="exact", single_equal=False):
+    """evaluate main
+
+    Args:
+        table (str): all tables file name
+        gold (str): gold file name
+        pred (str): predict file name
+        mode (str): partial or exact
+
+    Returns: float
+        exact match acc
+    """
+    kmaps = build_foreign_key_map_from_json(table)
+
+    with open(table) as ifs:
+        table_list = json.load(ifs)
+        table_dict = {}
+        for table in table_list:
+            if table["db_id"] in table_dict:
+                continue
+            table_dict[table["db_id"]] = table
+    with open(gold) as ifs:
+        gold_list = [l.strip().split("\t") for l in ifs if len(l.strip()) > 0]
+        gold_dict = dict([(x[0], x[1:]) for x in gold_list])
+
+    with open(predict) as ifs:
+        pred_list = [l.strip().split("\t") for l in ifs if len(l.strip()) > 0]
+        pred_dict = dict([(x[0], x[1:]) for x in pred_list])
+
+    evaluator = Evaluator()
+
+    scores = {
+        "all": {"count": 0, "exact": 0, "acc": 0},
+        "select": {"acc": 0, "rec": 0, "f1": 0},
+        "select(no AGG)": {"acc": 0, "rec": 0, "f1": 0},
+        "where": {"acc": 0, "rec": 0, "f1": 0},
+        "where(no OP)": {"acc": 0, "rec": 0, "f1": 0},
+        "group": {"acc": 0, "rec": 0, "f1": 0},
+        "having": {"acc": 0, "rec": 0, "f1": 0},
+        "order": {"acc": 0, "rec": 0, "f1": 0},
+        "and/or": {"acc": 0, "rec": 0, "f1": 0},
+        "IUEN": {"acc": 0, "rec": 0, "f1": 0},
+        "keywords": {"acc": 0, "rec": 0, "f1": 0},
+    }
+
+    scores_novalue = {
+        "all": {"count": 0, "exact": 0, "acc": 0},
+        "select": {"acc": 0, "rec": 0, "f1": 0},
+        "select(no AGG)": {"acc": 0, "rec": 0, "f1": 0},
+        "where": {"acc": 0, "rec": 0, "f1": 0},
+        "where(no OP)": {"acc": 0, "rec": 0, "f1": 0},
+        "group": {"acc": 0, "rec": 0, "f1": 0},
+        "having": {"acc": 0, "rec": 0, "f1": 0},
+        "order": {"acc": 0, "rec": 0, "f1": 0},
+        "and/or": {"acc": 0, "rec": 0, "f1": 0},
+        "IUEN": {"acc": 0, "rec": 0, "f1": 0},
+        "keywords": {"acc": 0, "rec": 0, "f1": 0},
+    }
+
+    eval_err_num = 0
+    for ins_id, g in gold_dict.items():
+        scores["all"]["count"] += 1
+        scores_novalue["all"]["count"] += 1
+        if ins_id not in pred_dict:
+            continue
+        p = pred_dict[ins_id]
+
+        pred_str = p[0]
+        gold_str, db_id = g
+        schema = Schema(table_dict[db_id])
+        gold_str = gold_str.replace("==", "=")
+        gold_sql = get_sql(schema, gold_str, single_equal=single_equal)
+
+        kmap = kmaps[db_id]
+        # rebuild sql for value evaluation
+        g_valid_col_units = build_valid_col_units(gold_sql["from"]["table_units"], schema)
+        gold_sql = rebuild_sql_col(g_valid_col_units, gold_sql, kmap)
+
+        try:
+            pred_str = pred_str.replace("==", "=")
+            pred_sql = get_sql(schema, pred_str, single_equal=single_equal)
+
+            p_valid_col_units = build_valid_col_units(pred_sql["from"]["table_units"], schema)
+            pred_sql = rebuild_sql_col(p_valid_col_units, pred_sql, kmap)
+        except Exception:
+            # If pred_sql is not valid, then we will use an empty sql to evaluate with the correct sql
+            pred_sql = g_empty_sql
+            eval_err_num += 1
+
+        exact_score = evaluator.eval_exact_match(pred_sql, gold_sql, value_match=True)
+        exact_score_novalue = evaluator.eval_exact_match(pred_sql, gold_sql, value_match=False)
+        if exact_score == 0:
+            logging.debug("error instance %s:\npred: %s\ngold: %s" % (ins_id, pred_str, gold_str))
+        scores["all"]["exact"] += exact_score
+        scores_novalue["all"]["exact"] += exact_score_novalue
+        score = evaluator.eval_partial_match(pred_sql, gold_sql, value_match=True)
+        for k, v in score.items():
+            for k1, v1 in v.items():
+                if k1 in scores[k].keys():
+                    scores[k][k1] += v1
+
+        score_novalue = evaluator.eval_partial_match(pred_sql, gold_sql, value_match=False)
+        for k, v in score_novalue.items():
+            for k1, v1 in v.items():
+                if k1 in scores_novalue[k].keys():
+                    scores_novalue[k][k1] += v1
+
+    if scores["all"]["count"] == 0:
+        logging.warn("the number of evaluated instance is zero")
+        return 0.0, 0.0
+    if scores_novalue["all"]["count"] == 0:
+        logging.warn("the number of evaluated no value instance is zero")
+        return 0.0, 0.0
+
+    for k, v in scores.items():
+        if "acc" in v.keys() and "rec" in v.keys():
+            scores[k]["acc"] = scores[k]["acc"] / scores["all"]["count"] * 1.0
+            scores[k]["rec"] = scores[k]["rec"] / scores["all"]["count"] * 1.0
+            scores[k]["f1"] = 2 * scores[k]["acc"] * scores[k]["rec"] / (scores[k]["acc"] + scores[k]["rec"])
+
+    for k, v in scores_novalue.items():
+        if "acc" in v.keys() and "rec" in v.keys():
+            scores_novalue[k]["acc"] = scores_novalue[k]["acc"] / scores_novalue["all"]["count"] * 1.0
+            scores_novalue[k]["rec"] = scores_novalue[k]["rec"] / scores_novalue["all"]["count"] * 1.0
+            scores_novalue[k]["f1"] = (
+                2
+                * scores_novalue[k]["acc"]
+                * scores_novalue[k]["rec"]
+                / (scores_novalue[k]["acc"] + scores_novalue[k]["rec"])
+            )
+    scores["all"]["acc"] = scores["all"]["exact"] / scores["all"]["count"]
+    scores_novalue["all"]["acc"] = scores_novalue["all"]["exact"] / scores_novalue["all"]["count"]
+
+    return scores, scores_novalue
+
+
+def evaluate(table, gold, predict, mode="exact", dataset="DuSQL"):
+    """
+    dataset:['CSpider', 'DuSQL', 'NL2SQL']
+    """
+    if dataset == "NL2SQL":
+        scores, scores_novalue = evaluate_NL2SQL(table, gold, predict, mode=mode, single_equal=True)
+    elif dataset == "DuSQL":
+        scores, scores_novalue = evaluate_complex(table, gold, predict, mode=mode, single_equal=True)
+    else:
+        scores, scores_novalue = evaluate_complex(table, gold, predict, mode=mode, single_equal=True)
+    return scores, scores_novalue
+
+
+if __name__ == "__main__":
+    from argparse import ArgumentParser
+
+    arg_parser = ArgumentParser()
+    arg_parser.add_argument("-g", "--gold", dest="gold", type=str)
+    arg_parser.add_argument("-p", "--pred", dest="pred", type=str)
+    arg_parser.add_argument("-t", "--table", dest="table", type=str)
+    arg_parser.add_argument("-d", "--dataset", choices=("NL2SQL", "CSpider", "DuSQL"), required=True, type=str)
+    arg_parser.add_argument("--debug", default=False, action="store_true")
+    args = arg_parser.parse_args()
+
+    logging.basicConfig(
+        level=logging.DEBUG if args.debug else logging.INFO,
+        format="%(levelname)s: %(asctime)s %(filename)s [%(funcName)s:%(lineno)d][%(process)d] %(message)s",
+        datefmt="%m-%d %H:%M:%S",
+        filename="eval.log",
+        filemode="a",
+    )
+
+    out, out_novalue = evaluate(args.table, args.gold, args.pred, dataset=args.dataset)
+    print("with value:")
+    print(out["all"])
+    print("*" * 20)
+    print("without_value")
+    print(out_novalue["all"] if out_novalue else "None")
diff --git a/examples/text_to_sql/RAT-SQL/evaluation/utils.py b/examples/text_to_sql/RAT-SQL/evaluation/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..081b2df3b1f75822bf6b1117b21ad2fbaa6298e9
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/evaluation/utils.py
@@ -0,0 +1,400 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import logging
+import re
+
+op_sql_dict = {0: ">", 1: "<", 2: "==", 3: "!="}
+agg_sql_dict = {0: "", 1: "AVG", 2: "MAX", 3: "MIN", 4: "COUNT", 5: "SUM"}
+conn_sql_dict = {0: "", 1: "and", 2: "or"}
+
+# from IRNet keywords, need to be simplify
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+COND_OPS = ("not_in", "between", "==", ">", "<", ">=", "<=", "!=", "in", "like")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+LOGIC_AND_OR = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+CONST_COLUMN = set(["time_now"])
+
+EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ","))
+
+g_empty_sql = {
+    "select": [],
+    "from": {"conds": [], "table_units": []},
+    "where": [],
+    "groupBy": [],
+    "having": [],
+    "orderBy": [],
+    "limit": None,
+    "except": None,
+    "intersect": None,
+    "union": None,
+}
+
+
+def is_float(value):
+    """is float"""
+    try:
+        float(value)
+        return True
+    except ValueError:
+        return False
+    except TypeError:
+        return False
+
+
+def get_scores(count, pred_total, gold_total):
+    """
+    Args:
+
+    Returns:
+    """
+    if pred_total != gold_total:
+        return 0, 0, 0
+    elif count == pred_total:
+        return 1, 1, 1
+    return 0, 0, 0
+
+
+def tokenize_NL2SQL(string, cols, single_equal=False, math=True):
+    """
+    Args:
+
+    Returns:
+    """
+
+    string = string.replace("'", '"').lower()
+    assert string.count('"') % 2 == 0, "Unexpected quote"
+
+    re_cols = [i.lower() for i in cols]
+
+    def _extract_value(string):
+        """extract values in sql"""
+        fields = string.split('"')
+        for idx, tok in enumerate(fields):
+            if idx % 2 == 1:
+                fields[idx] = '"%s"' % (tok)
+        return fields
+
+    def _resplit(tmp_tokens, fn_split, fn_omit):
+        """resplit"""
+        new_tokens = []
+        for token in tmp_tokens:
+            token = token.strip()
+            if fn_omit(token):
+                new_tokens.append(token)
+            elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token):
+                new_tokens.append('"%s"' % (token))
+            else:
+                new_tokens.extend(fn_split(token))
+        return new_tokens
+
+    def _split_aggs(tmp_tokens):
+        """split aggs in select"""
+        new_toks = []
+        for i, tok in enumerate(tmp_tokens):
+            if tok in ("from", "where"):
+                new_toks.extend(tmp_tokens[i:])
+                break
+            if not ((tok.endswith(")") or tok.endswith("),")) and len(tok) > 5):
+                new_toks.extend(tok.split(","))
+                continue
+
+            extra = ""
+            if tok.endswith(","):
+                extra = ","
+                tok = tok[:-1]
+
+            if tok[:4] in ("sum(", "avg(", "max(", "min("):
+                new_toks.extend([tok[:3], "(", tok[4:-1], ")"])
+            elif tok[:6] == "count(":
+                new_toks.extend(["count", "(", tok[6:-1], ")"])
+            else:
+                new_toks.append(tok)
+
+            if extra:
+                new_toks.append(extra)
+
+        return new_toks
+
+    def join_by_col(toks, cols):
+        new_toks = []
+        _len = len(toks)
+        i = 0
+        while i < _len - 1:
+            merge = False
+            for j in range(10):
+                if "".join(toks[i : i + j]) in cols:
+                    new_toks.append("".join(toks[i : i + j]))
+                    i += j
+                    merge = True
+            if not merge:
+                new_toks.append(toks[i])
+                i += 1
+        new_toks.append(toks[-1])
+        return new_toks
+
+    tokens_tmp = _extract_value(string)
+
+    two_bytes_op = ["==", "!=", ">=", "<=", "<>", "<in>"]
+    sep2 = re.compile("(" + "|".join(two_bytes_op) + ")")  # 多字节运算符
+    tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"'))
+    tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"'))
+    tokens_tmp = _split_aggs(tokens_tmp)
+    tokens = list(filter(lambda x: x.strip() != "", tokens_tmp))
+
+    tokens = join_by_col(tokens, re_cols)
+
+    def _post_merge(tokens):
+        """merge:
+        * col name with "(", ")"
+        * values with +/-
+        """
+        idx = 1
+        while idx < len(tokens):
+            if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS and tokens[idx - 1] != "=":
+                while idx < len(tokens):
+                    tmp_tok = tokens.pop(idx)
+                    tokens[idx - 1] += tmp_tok
+                    if tmp_tok == ")":
+                        break
+            elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens):
+                tokens[idx] += tokens[idx + 1]
+                tokens.pop(idx + 1)
+                idx += 1
+            else:
+                idx += 1
+        return tokens
+
+    tokens = _post_merge(tokens)
+    if single_equal:
+        tokens = [i if i != "=" else "==" for i in tokens]
+    return tokens
+
+
+def sql2query(sql, cols):
+    """
+    transform sql json to sql query, this is only for NL2SQL, eg. select a, b where a op val1
+    """
+
+    sels = sql["sel"]
+    aggs = sql["agg"]
+    op = sql["cond_conn_op"]
+    conds = sql["conds"]
+
+    condstrs = [f'{cols[cond[0]]} {op_sql_dict[cond[1]]} "{cond[2]}"' for cond in conds]
+    cond_str = f" {conn_sql_dict[op]} ".join(condstrs)
+
+    def agg_col(agg, col):
+        if agg == 0:
+            return cols[col]
+        else:
+            return f"{agg_sql_dict[agg]} ( {cols[col]} )"
+
+    selstrs = [agg_col(i, j) for i, j in zip(aggs, sels)]
+    sel_str = " , ".join(selstrs)
+
+    return f"SELECT {sel_str} WHERE {cond_str}"
+
+
+def query2sql(query, cols, single_equal=False, with_value=True):
+
+    cols = [i.lower() for i in cols]
+
+    sql_op_dict = {}
+    sql_agg_dict = {}
+    sql_conn_dict = {}
+    for k, v in op_sql_dict.items():
+        sql_op_dict[v] = k
+        sql_op_dict[v.lower()] = k
+    for k, v in agg_sql_dict.items():
+        sql_agg_dict[v] = k
+        sql_agg_dict[v.lower()] = k
+    for k, v in conn_sql_dict.items():
+        sql_conn_dict[v] = k
+        sql_conn_dict[v.lower()] = k
+
+    query = tokenize_NL2SQL(query, cols, single_equal=single_equal, math=False)
+    assert query[0] == "select"
+
+    def parse_cols(toks, start_idx):
+        """
+        :returns next idx, (agg, col)
+        """
+        if "from" in toks:
+            toks = toks[: toks.index("from")]
+        idx = start_idx
+        len_ = len(toks)
+        outs = []
+        while idx < len_:
+            if toks[idx] in AGG_OPS:
+                idx += 1
+                assert idx < len_ and toks[idx] == "(", toks[idx]
+                idx += 1
+                agg, col = toks[start_idx], toks[idx]
+                idx += 1
+                assert idx < len_ and toks[idx] == ")", toks[idx] + "".join(toks)
+                idx += 1
+                outs.append((agg, col))
+            elif toks[idx] == ",":
+                idx += 1
+            else:
+                agg, col = "", toks[idx]
+                idx += 1
+                outs.append(("", col))
+        return outs
+
+    def _format_col(old_col):
+        """format"""
+        if old_col.lower().startswith("table_"):
+            return old_col.split(".", 1)[1]
+        else:
+            return old_col
+
+    if "where" not in query:
+        cond_index = len(query)
+        conn = ""
+        conds = []
+    else:
+        cond_index = query.index("where")
+        condstr = query[cond_index + 1 :]
+        conn = [i for i in condstr[3::4]]
+        assert len(set(conn)) < 2, conn
+        conn = list(set(conn))[0] if conn else ""
+        conds = [condstr[i : i + 3] for i in range(len(condstr))[::4]]
+    sels = parse_cols(query[:cond_index], 1)
+
+    sql = {}
+
+    sql["agg"] = [sql_agg_dict[i[0]] for i in sels]
+    sql["cond_conn_op"] = sql_conn_dict[conn]
+    sql["sel"] = [cols.index(_format_col(i[1])) for i in sels]
+    if with_value:
+        sql["conds"] = [[cols.index(_format_col(c[0])), sql_op_dict[c[1]], '"' + c[2].strip('"') + '"'] for c in conds]
+    else:
+        sql["conds"] = [[cols.index(_format_col(c[0])), sql_op_dict[c[1]], "1"] for c in conds]
+
+    sql_sels = [(sql_agg_dict[i[0]], cols.index(_format_col(i[1]))) for i in sels]
+    return sql, sql_sels
+
+
+def evaluate_NL2SQL(table, gold, predict, single_equal=False, mode=None):
+    scores = {}
+    scores_novalue = {}
+
+    # load db
+    with open(table) as ifs:
+        table_list = json.load(ifs)
+        table_dict = {}
+        for table in table_list:
+            table_dict[table["db_id"]] = table
+
+    # load qa
+    with open(gold, "r", encoding="utf-8") as f1, open(predict, "r", encoding="utf-8") as f2:
+        gold_list = [l.strip().split("\t") for l in f1 if len(l.strip()) > 0]
+        gold_dict = dict([(x[0], x[1:]) for x in gold_list])
+
+        pred_list = [l.strip().split("\t") for l in f2 if len(l.strip()) > 0]
+        pred_dict = dict([(x[0], x[1]) for x in pred_list])
+
+    right = total = 0
+    cnt_sel = 0
+    cnt_cond = cnt_conn = 0
+
+    def compare_set(gold, pred):
+        _pred = copy.deepcopy(pred)
+        _gold = copy.deepcopy(gold)
+
+        pred_total = len(_pred)
+        gold_total = len(_gold)
+        cnt = 0
+
+        for unit in _pred:
+            if unit in _gold:
+                cnt += 1
+                _gold.remove(unit)
+        return cnt, pred_total, gold_total
+
+    for qid, item in gold_dict.items():
+        total += 1
+        if qid not in pred_dict:
+            continue
+        sql_gold, db_id = "".join(item[0:-1]), item[-1]
+
+        db = table_dict[db_id]
+        cols = [i[1] for i in db["column_names"]]
+
+        sql_pred = pred_dict[qid]
+
+        try:
+            sql_gold = sql_gold.replace("==", "=")
+            sql_pred = sql_pred.replace("==", "=")
+            components_gold, sels_gold = query2sql(sql_gold, cols, single_equal=single_equal)
+            components_pred, sels_pred = query2sql(sql_pred, cols, single_equal=single_equal)
+
+            cnt, pred_total, gold_total = compare_set(sels_gold, sels_pred)
+            score_sels, _, _ = get_scores(cnt, pred_total, gold_total)
+            cnt, pred_total, gold_total = compare_set(components_gold["conds"], components_pred["conds"])
+            score_conds, _, _ = get_scores(cnt, pred_total, gold_total)
+            score_conn = components_gold["cond_conn_op"] == components_pred["cond_conn_op"]
+
+            if score_sels:
+                cnt_sel += 1
+            if score_conds:
+                cnt_cond += 1
+            if score_conn:
+                cnt_conn += 1
+            if score_sels and score_conds and score_conn:
+                right += 1
+            else:
+                logging.debug("error instance %s:\npred: %s\ngold: %s" % (qid, sql_pred, sql_gold))
+        except Exception:
+            # traceback.print_exc()
+            logging.warning("parse sql error, error sql:")
+            logging.warning(sql_gold + "|||" + sql_pred)
+            # raise e
+            continue
+
+    scores["all"] = dict([("count", total), ("exact", right), ("acc", right * 1.0 / total)])
+    scores["select"] = dict([("count", total), ("exact", cnt_sel), ("acc", cnt_sel * 1.0 / total)])
+    scores["condition"] = dict([("count", total), ("exact", cnt_cond), ("acc", cnt_cond * 1.0 / total)])
+    scores["connection"] = dict([("count", total), ("exact", cnt_conn), ("acc", cnt_conn * 1.0 / total)])
+
+    return scores, scores_novalue
+
+
+if __name__ == "__main__":
+    print(query2sql("SELECT 所在省份 , 产线名称 WHERE 日熔量（吨） < 600", []))
+    print(query2sql("SELECT MAX ( 货币资金（亿元） ) WHERE 总资产（亿元） > 100 or 净资产（亿元） > 100", []))
+    print(query2sql("SELECT 股价 , EPS17A WHERE 铁路公司 = 广深铁路", ["股价", "铁路公司", "EPS17A"], True))
+    cols = ["公司", "2014（亿元）", "2015（亿元）", "2016（亿元）"]
+    print(query2sql("SELECT COUNT ( 公司 ) WHERE 2014（亿元） > 20 and 2015（亿元） > 20 and 2016（亿元） > 20", cols))
+
+    # print(query2sql("SELECT 书名/Title WHERE 索书号/CallNo. == BF637.U53C555=12010 or ISBN == 9.78142212482e+12", ["书名/Title","索书号/CallNo.",'ISBN']))
+    # print(tokenize("SELECT 标称生产企业名称 WHERE 规格(包装规格） == 187.2g/盒 and 标称产品名称 == 富兰克牌西洋参含片", math=False))
+    # print(tokenize("SELECT 设备型号 WHERE 生产企业 == AISINAWCO.,LTD. or 设备名称 == WCDMA无线数据终端", math=False))
+    # print(tokenize("SELECT sum(t1.amount_claimed) FROM claim_headers AS t1 JOIN claims_documents AS t2 ON t1.claim_header_id  =  t2.claim_id WHERE t2.created_date  =  ( SELECT created_date FROM claims_documents ORDER BY created_date LIMIT 1 )"))
+    # print(query2sql("SELECT 书号（ISBN) WHERE 教材名称 == 线性代数 or 教材名称 == 中级有机化学", ["书号（ISBN)", "教材名称" ]))
diff --git a/examples/text_to_sql/RAT-SQL/requirements.txt b/examples/text_to_sql/RAT-SQL/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4c5c04d646299aa5a3171b4e24153f503b86225d
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/requirements.txt
@@ -0,0 +1,11 @@
+nvgpu
+tqdm
+LAC
+cn2an
+setproctitle
+sentencepiece
+attrs
+asdl
+networkx
+pyrsistent
+jsonnet
diff --git a/examples/text_to_sql/RAT-SQL/run.sh b/examples/text_to_sql/RAT-SQL/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e625033cf6b322c7e8e43857e71106557b59beed
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/run.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+
+WORKROOT=$(cd $(dirname $0); pwd)
+cd $WORKROOT
+
+#### gpu libs ####
+# 添加cuda, cudnn库的路径
+export LD_LIBRARY_PATH=/home/work/cuda-10.0/lib64:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/home/work/cuda-10.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH
+# 单独添加库的路径
+##export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v7.4/cuda/lib64:$LD_LIBRARY_PATH
+# 添加NCCL库的路径。单卡训练时非必须
+export LD_LIBRARY_PATH=/home/work/nccl_2.3.5/lib/:$LD_LIBRARY_PATH
+
+#### paddle ####
+# 是否是分布式训练，0标识是分布式，1标识是单机
+export PADDLE_IS_LOCAL=1
+# 申请显存比例
+export FLAGS_fraction_of_gpu_memory_to_use=1.0
+# 选择要使用的GPU
+export CUDA_VISIBLE_DEVICES=`python script/available_gpu.py --best 1`
+# CPU 核数
+export CPU_NUM=1
+# 表示是否使用垃圾回收策略来优化网络的内存使用，<0表示禁用，>=0表示启用
+export FLAGS_eager_delete_tensor_gb=1.0
+# 是否使用快速垃圾回收策略
+export FLAGS_fast_eager_deletion_mode=1
+# 垃圾回收策略释放变量的内存大小百分比，范围为[0.0, 1.0]
+export FLAGS_memory_fraction_of_eager_deletion=1
+# 如果为1，则会在allreduce_op_handle中调用cudaStreamSynchronize（nccl_stream），这种模式在某些情况下可以获得更好的性能
+#export FLAGS_sync_nccl_allreduce=1
+
+#### python ####
+export PYTHONPATH=$WORKROOT:$PYTHONPATH
+#echo "PYTHONPATH=$PYTHONPATH"
+## python 3.6/3.7 is recomended
+PYTHON_BIN=`which python3`
+
+echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
+echo "running command: ($PYTHON_BIN $@)"
+$PYTHON_BIN -u $@
+exit $?
+
diff --git a/examples/text_to_sql/RAT-SQL/script/available_gpu.py b/examples/text_to_sql/RAT-SQL/script/available_gpu.py
new file mode 100644
index 0000000000000000000000000000000000000000..1de2aba8171df20c4476402dfbbeefeed1a39740
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/script/available_gpu.py
@@ -0,0 +1,45 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import traceback
+
+import nvgpu
+
+logging.basicConfig(
+    level=logging.DEBUG,
+    format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s",
+    datefmt="%m-%d %H:%M:%S",
+    filename=None,
+    filemode="a",
+)
+
+if __name__ == "__main__":
+    from argparse import ArgumentParser
+
+    try:
+        arg_parser = ArgumentParser(description="print available_gpu id, using nvgpu")
+        arg_parser.add_argument("-b", "--best", default=None, type=int, help="output best N")
+        args = arg_parser.parse_args()
+
+        if args.best is not None:
+            gpus = sorted(nvgpu.gpu_info(), key=lambda x: (x["mem_used"], x["index"]))
+            ids = [x["index"] for x in gpus]
+            print(",".join(ids[: args.best]))
+        else:
+            print(",".join(nvgpu.available_gpus()))
+
+    except Exception:
+        traceback.print_exc()
+        exit(-1)
diff --git a/examples/text_to_sql/RAT-SQL/script/schema_linking.py b/examples/text_to_sql/RAT-SQL/script/schema_linking.py
new file mode 100644
index 0000000000000000000000000000000000000000..76ac0c9df30d165874e01e2948e1370c802b1053
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/script/schema_linking.py
@@ -0,0 +1,231 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import re
+import sys
+import traceback
+from collections import defaultdict
+
+from text2sql.dataproc.dusql_dataset_v2 import load_tables
+
+logging.basicConfig(
+    level=logging.DEBUG,
+    format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s",
+    datefmt="%m-%d %H:%M:%S",
+    filename=None,
+    filemode="a",
+)
+
+g_date_patt = re.compile(r"(([0-9]{2})[0-9]{2}年)?[0-9]{1,2}月[0-9]{2}日|([0-9]{2})[0-9]{2}年[0-9]{1,2}月")
+
+
+def get_char_list(sentence):
+    def is_ascii(s):
+        """check if s is English album or number
+        Args:
+            s (str): NULL
+        Returns: bool
+        """
+        return ord(s) < 128
+
+    if len(sentence) == 0:
+        return []
+
+    lst_result = [sentence[0]]
+    last_is_ascii = lst_result[-1].isalnum()
+    for char in sentence[1:]:
+        if char == " ":
+            last_is_ascii = False
+            continue
+        elif char == "-":
+            last_is_ascii = False
+            lst_result.append(char)
+            continue
+
+        if is_ascii(char) and last_is_ascii:
+            lst_result[-1] += char
+            continue
+
+        if is_ascii(char):
+            last_is_ascii = True
+        else:
+            last_is_ascii = False
+
+        lst_result.append(char)
+
+    return tuple(lst_result)
+
+
+def _format_date_cell(old_cell):
+    new_cell = old_cell.rstrip("月日")
+    new_cell = new_cell.replace("年", "-")
+    new_cell = new_cell.replace("月", "-")
+    return new_cell
+
+
+def _build(cells):
+    dct_index = defaultdict(set)
+    for cell in set(cells):
+        if type(cell) is not str:
+            continue
+        cell = cell.strip()
+        if re.match(g_date_patt, cell):
+            cell = _format_date_cell(cell)
+        cell_chars = get_char_list(cell.lower())
+        dct_index[cell.lower()].add((cell, len(cell_chars)))
+        for pos in range(len(cell_chars) - 1):
+            bigram = cell_chars[pos : pos + 2]
+            # tri_gram = cell_chars[pos: pos + 3]
+            # four_gram = cell_chars[pos: pos + 4]
+            dct_index[bigram].add((cell, len(cell_chars) - 1))
+            # dct_index[tri_gram].add((cell, len(cell_chars) - 2))
+            # dct_index[four_gram].add(cell)
+    return dct_index
+
+
+def build_cell_index(db_dict):
+    for db in db_dict.values():
+        column_cells = []
+        for column in db.columns:
+            cell_index = _build(column.cells)
+            column_cells.append(cell_index)
+        db.column_cells_index = column_cells
+
+
+def extract_value_from_sql(sql_json, sql_format="dusql"):
+    dct_col_values = defaultdict(list)
+    if sql_format == "nl2sql":
+        for col, _, val in item["sql"]["conds"]:
+            dct_col_values[col].append(val)
+        return dct_col_values
+
+    def _merge_dict(base_dict, extra_dict):
+        for k, v in extra_dict.items():
+            base_dict[k].extend(v)
+
+    def _extract_value_from_sql_cond(cond, dct_col_values):
+        if type(cond[3]) is dict:
+            new_col_values = extract_value_from_sql(cond[3])
+            _merge_dict(dct_col_values, new_col_values)
+            return
+        col_id = cond[2][1][1]
+        dct_col_values[col_id].append(cond[3])
+        if cond[4] is not None:
+            dct_col_values[col_id].append(cond[4])
+
+    for table_unit in sql_json["from"]["table_units"]:
+        if type(table_unit[1]) is dict:
+            new_col_values = extract_value_from_sql(table_unit[1])
+            _merge_dict(dct_col_values, new_col_values)
+
+    for cond in sql_json["where"][::2]:
+        _extract_value_from_sql_cond(cond, dct_col_values)
+    for cond in sql_json["having"][::2]:
+        _extract_value_from_sql_cond(cond, dct_col_values)
+
+    if sql_json["intersect"] is not None:
+        new_col_values = extract_value_from_sql(sql_json["intersect"])
+        _merge_dict(dct_col_values, new_col_values)
+    if sql_json["union"] is not None:
+        new_col_values = extract_value_from_sql(sql_json["union"])
+        _merge_dict(dct_col_values, new_col_values)
+    if sql_json["except"] is not None:
+        new_col_values = extract_value_from_sql(sql_json["except"])
+        _merge_dict(dct_col_values, new_col_values)
+
+    return dct_col_values
+
+
+def search_values(query, db, extra_values):
+    lst_match_values = []
+    for column, cell_index in zip(db.columns, db.column_cells_index):
+        if column.id == 0:
+            lst_match_values.append([])
+            continue
+
+        candi_cnt = defaultdict(float)
+        query_chars = get_char_list(query.lower())
+        appear_set = set()
+        for pos in range(len(query_chars)):
+            unigram = query_chars[pos]
+            if len(unigram) > 2 and unigram not in appear_set and unigram in cell_index:
+                for cell, base in cell_index[unigram]:
+                    candi_cnt[cell] += 1.0 / base
+            if pos == len(query_chars) - 1:
+                break
+
+            bigram = query_chars[pos : pos + 2]
+            if bigram not in cell_index:
+                continue
+            if bigram in appear_set:
+                continue
+            appear_set.add(bigram)
+            for cell, base in cell_index[bigram]:
+                candi_cnt[cell] += 1.0 / base
+
+        if extra_values is not None and column.id in extra_values:
+            gold_values = extra_values[column.id]
+            for gval in gold_values:
+                candi_cnt[str(gval)] += 2.0
+
+        lst_match_values.append(list(sorted(candi_cnt.items(), key=lambda x: x[1], reverse=True))[:10])
+
+    return lst_match_values
+
+
+if __name__ == "__main__":
+    import argparse
+
+    try:
+        arg_parser = argparse.ArgumentParser(description="linking candidate values for each column")
+        arg_parser.add_argument(
+            "input", nargs="?", type=argparse.FileType("r"), default=sys.stdin, help="input file path"
+        )
+        arg_parser.add_argument("-s", "--db-schema", required=True, help="file path")
+        arg_parser.add_argument("-c", "--db-content", required=True, help="file path")
+        arg_parser.add_argument(
+            "-o", "--output", type=argparse.FileType("w"), default=sys.stdout, help="output file path"
+        )
+        arg_parser.add_argument("-t", "--is-train", default=False, action="store_true")
+        arg_parser.add_argument("-f", "--sql-format", default="dusql", choices=["dusql", "nl2sql", "cspider"])
+        args = arg_parser.parse_args()
+
+        sys.stderr.write(">>> loading databases...\n")
+        dct_db, _ = load_tables(args.db_schema, args.db_content)
+        build_cell_index(dct_db)
+
+        sys.stderr.write(">>> extracting values...\n")
+        lst_output = []
+        for idx, item in enumerate(json.load(args.input)):
+            question_id = item.get("question_id", f"qid{idx:06d}")
+            question = item["question"]
+            db_id = item["db_id"]
+            db = dct_db[db_id]
+
+            extra_values = None
+            if args.is_train:
+                extra_values = extract_value_from_sql(item["sql"], args.sql_format)
+
+            match_values = search_values(question, db, extra_values)
+            lst_output.append(
+                {"question_id": question_id, "question": question, "db_id": db_id, "match_values": match_values}
+            )
+
+        json.dump(lst_output, args.output, indent=2, ensure_ascii=False)
+    except Exception:
+        traceback.print_exc()
+        # logging.critical(traceback.format_exc())
+        exit(-1)
diff --git a/examples/text_to_sql/RAT-SQL/script/text2sql_main.py b/examples/text_to_sql/RAT-SQL/script/text2sql_main.py
new file mode 100644
index 0000000000000000000000000000000000000000..14f329a3035d425bf41d72707d43708b8aa07d47
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/script/text2sql_main.py
@@ -0,0 +1,218 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import random
+import sys
+from functools import partial
+from pathlib import Path
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import text2sql
+from text2sql import dataproc, global_config, launch
+from text2sql.grammars.cspider_v2 import CSpiderLanguageV2
+from text2sql.grammars.dusql_v2 import DuSQLLanguageV2
+from text2sql.grammars.nl2sql import NL2SQLLanguage
+
+ModelClass = None
+GrammarClass = None
+DataLoaderClass = None
+DatasetClass = None
+g_input_encoder = None
+g_label_encoder = None
+
+
+def preprocess(config):
+    dataset_config = {
+        "db_file": config.data.db,
+        "input_encoder": g_input_encoder,
+        "label_encoder": g_label_encoder,
+        "is_cached": False,
+    }
+
+    output_base = config.data.output
+    if config.data.train_set is not None:
+        dataset = DatasetClass(name="train", data_file=config.data.train_set, **dataset_config)
+        dataset.save(output_base, save_db=True)
+        g_label_encoder.save(Path(output_base) / "label_vocabs")
+
+    if config.data.dev_set is not None:
+        dataset = DatasetClass(name="dev", data_file=config.data.dev_set, **dataset_config)
+        dataset.save(output_base, save_db=False)
+
+    if config.data.test_set is not None:
+        dataset = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config)
+        dataset.save(output_base, save_db=False)
+
+
+def train(config):
+    logging.info("training arguments: %s", config)
+    if config.train.use_data_parallel:
+        logging.info("parallel mode. init env...")
+        dist.init_parallel_env()
+
+    dataset_config = {
+        "db_file": config.data.db,
+        "input_encoder": g_input_encoder,
+        "label_encoder": g_label_encoder,
+        "is_cached": True,
+    }
+    train_set = DatasetClass(name="train", data_file=config.data.train_set, **dataset_config)
+    dev_set = DatasetClass(name="dev", data_file=config.data.dev_set, **dataset_config)
+
+    shuf_train = True if not config.general.is_debug else False
+    train_reader = DataLoaderClass(config, train_set, batch_size=config.general.batch_size, shuffle=shuf_train)
+    # dev_reader = dataproc.DataLoader(config, dev_set, batch_size=config.general.batch_size, shuffle=False)
+    dev_reader = DataLoaderClass(config, dev_set, batch_size=1, shuffle=False)
+    max_train_steps = config.train.epochs * (len(train_set) // config.general.batch_size // config.train.trainer_num)
+
+    model = ModelClass(config.model, g_label_encoder)
+    if config.model.init_model_params is not None:
+        logging.info("loading model param from %s", config.model.init_model_params)
+        model.set_state_dict(paddle.load(config.model.init_model_params))
+    if config.train.use_data_parallel:
+        logging.info("parallel mode. init model...")
+        model = paddle.DataParallel(model)
+
+    optimizer = text2sql.optim.init_optimizer(model, config.train, max_train_steps)
+    if config.model.init_model_optim is not None:
+        logging.info("loading model optim from %s", config.model.init_model_optim)
+        optimizer.set_state_dict(paddle.load(config.model.init_model_optim))
+
+    logging.info("start of training...")
+    launch.trainer.train(config, model, optimizer, config.train.epochs, train_reader, dev_reader)
+    logging.info("end of training...")
+
+
+def inference(config):
+    if config.model.init_model_params is None:
+        raise RuntimeError("config.init_model_params should be a valid model path")
+
+    dataset_config = {
+        "db_file": config.data.db,
+        "input_encoder": g_input_encoder,
+        "label_encoder": g_label_encoder,
+        "is_cached": True,
+    }
+    test_set = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config)
+    test_reader = DataLoaderClass(config, test_set, batch_size=1, shuffle=False)
+
+    model = ModelClass(config.model, g_label_encoder)
+    logging.info("loading model param from %s", config.model.init_model_params)
+    state_dict = paddle.load(config.model.init_model_params)
+    model.set_state_dict(state_dict)
+
+    logging.info("start of inference...")
+    launch.infer.inference(
+        model, test_reader, config.data.output, beam_size=config.general.beam_size, model_name=config.model.model_name
+    )
+    logging.info("end of inference...")
+
+
+def evaluate(config):
+    dataset_config = {
+        "db_file": config.data.db,
+        "input_encoder": g_input_encoder,
+        "label_encoder": g_label_encoder,
+        "is_cached": True,
+        "schema_file": config.data.db_schema,
+    }
+    test_set = DatasetClass(name="test", data_file=config.data.test_set, **dataset_config)
+    with open(config.data.eval_file) as ifs:
+        infer_results = list(ifs)
+    model = None
+
+    logging.info("start of evaluating...")
+    launch.eval.evaluate(model, test_set, infer_results, eval_value=config.general.is_eval_value)
+    logging.info("end of evaluating....")
+
+
+def init_env(config):
+    log_level = logging.INFO if not config.general.is_debug else logging.DEBUG
+    formatter = logging.Formatter("%(levelname)s %(asctime)s %(filename)s:%(lineno)03d * %(message)s")
+    logger = logging.getLogger()
+    logger.setLevel(log_level)
+    handler = logger.handlers[0]
+    handler.setLevel(log_level)
+    handler.setFormatter(formatter)
+
+    seed = config.train.random_seed
+    if seed is not None:
+        random.seed(seed)
+        paddle.seed(seed)
+        np.random.seed(seed)
+
+    global ModelClass
+    global GrammarClass
+    global DatasetClass
+    global DataLoaderClass
+    global g_input_encoder
+    global g_label_encoder
+
+    if config.model.grammar_type == "dusql_v2":
+        GrammarClass = DuSQLLanguageV2
+    elif config.model.grammar_type == "nl2sql":
+        GrammarClass = NL2SQLLanguage
+    elif config.model.grammar_type == "cspider_v2":
+        GrammarClass = CSpiderLanguageV2
+    else:
+        raise ValueError("grammar type is not supported: %s" % (config.model.grammar_type))
+    g_label_encoder = dataproc.SQLPreproc(
+        config.data.grammar,
+        GrammarClass,
+        predict_value=config.model.predict_value,
+        is_cached=config.general.mode != "preproc",
+    )
+
+    assert config.model.model_name == "seq2tree_v2", "only seq2tree_v2 is supported"
+    g_input_encoder = dataproc.ErnieInputEncoderV2(config.model)
+    ModelClass = lambda x1, x2: text2sql.models.EncDecModel(x1, x2, "v2")
+    DatasetClass = dataproc.DuSQLDatasetV2
+    DataLoaderClass = partial(dataproc.DataLoader, collate_fn=dataproc.dataloader.collate_batch_data_v2)
+
+
+def _set_proc_name(config, tag_base):
+    """
+    set process name on local machine
+    """
+    if config.general.is_cloud:
+        return
+    if tag_base.startswith("train"):
+        tag_base = "train"
+    import setproctitle
+
+    setproctitle.setproctitle(tag_base + "_" + config.data.output.rstrip("/").split("/")[-1])
+
+
+if __name__ == "__main__":
+    config = global_config.gen_config()
+    init_env(config)
+
+    run_mode = config.general.mode
+    if run_mode == "preproc":
+        preprocess(config)
+        sys.exit(0)
+
+    _set_proc_name(config, run_mode)
+    if run_mode == "test":
+        evaluate(config)
+    elif run_mode == "infer":
+        inference(config)
+    elif run_mode.startswith("train"):
+        if config.train.use_data_parallel:
+            dist.spawn(train, args=(config,))
+        else:
+            train(config)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d65733fca8c9689d99c4f5d538b365caa67de56
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/__init__.py
@@ -0,0 +1,21 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import dataproc
+from . import grammars
+from . import io
+from . import launch
+from . import models
+from . import optim
+from . import utils
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5aa57bf32d9483441dffe231e6411ca6f34d55f3
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/__init__.py
@@ -0,0 +1,20 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .base_classes import *
+
+from . import dataloader
+from .dataloader import DataLoader
+from .dusql_dataset_v2 import DuSQLDatasetV2
+from .ernie_input_encoder_v2 import ErnieInputEncoderV2
+from .sql_preproc_v2 import SQLPreproc
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py
new file mode 100644
index 0000000000000000000000000000000000000000..3514e5c63eb60e05968b0d0a9e0edbc403d90e2b
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/base_classes.py
@@ -0,0 +1,29 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class BaseInputEncoder(object):
+    """Docstring for BaseInputEncoder."""
+
+    def __init__(self):
+        """init of class"""
+        super(BaseInputEncoder, self).__init__()
+
+    def encode(self, inputs):
+        raise NotImplementedError
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py
new file mode 100644
index 0000000000000000000000000000000000000000..144d0c232392314132eed38e68ec9edc5f711258
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dataloader.py
@@ -0,0 +1,151 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import sys
+
+import numpy as np
+import paddle
+from text2sql.utils import nn_utils
+
+
+def collate_batch_data_v2(origin_batch, config):
+    """format origin batch data for model forward"""
+    TOKEN_IDS = []
+    SENT_IDS = []
+
+    QUESTION_TOKENS_INDEX = []
+    TABLE_INDEX = []
+    COLUMN_INDEX = []
+    VALUE_INDEX = []
+    RELATION_MATRIXES = []
+
+    lst_orig_inputs = []
+    lst_orig_labels = []
+    for orig_input, orig_label in origin_batch:
+        if orig_input.value_indexes[-1] > 510:
+            logging.warning(
+                "sequence is too long: %d. question is %s", orig_input.value_indexes[-1] + 2, orig_input.question
+            )
+            continue
+        lst_orig_inputs.append(orig_input)
+        lst_orig_labels.append(orig_label)
+
+        TOKEN_IDS.append(orig_input.token_ids)
+        SENT_IDS.append(orig_input.sent_ids)
+
+        # orig_input.span_lens[0] 即 question 包含 [cls], [sep] 的长度
+        QUESTION_TOKENS_INDEX.append(list(range(1, orig_input.column_indexes[0] - 1)))
+        TABLE_INDEX.append(orig_input.table_indexes)
+        COLUMN_INDEX.append(orig_input.column_indexes)
+        VALUE_INDEX.append(orig_input.value_indexes)
+
+        relations = orig_input.relations
+        RELATION_MATRIXES.append(np.pad(relations, (0, config.max_seq_len - relations.shape[0])))
+
+    TOKEN_IDS = nn_utils.pad_sequences(TOKEN_IDS, max_len=config.max_seq_len)
+    SENT_IDS = nn_utils.pad_sequences(SENT_IDS, max_len=config.max_seq_len)
+
+    QUESTION_TOKENS_INDEX = nn_utils.pad_sequences(QUESTION_TOKENS_INDEX, max_len=config.max_question_len)
+    TABLE_INDEX = nn_utils.pad_sequences(TABLE_INDEX, max_len=config.max_table_num)
+    COLUMN_INDEX = nn_utils.pad_sequences(COLUMN_INDEX, max_len=config.max_column_num)
+    VALUE_INDEX = nn_utils.pad_sequences(VALUE_INDEX, max_len=config.max_column_num * 2)
+
+    inputs = {
+        "src_ids": TOKEN_IDS,
+        "sent_ids": SENT_IDS,
+        "question_tokens_index": QUESTION_TOKENS_INDEX,
+        "table_indexes": TABLE_INDEX,
+        "column_indexes": COLUMN_INDEX,
+        "value_indexes": VALUE_INDEX,
+        "orig_inputs": lst_orig_inputs,
+    }
+    RELATION_MATRIXES = np.array(RELATION_MATRIXES).astype(np.int64)
+    inputs["relations"] = RELATION_MATRIXES
+
+    for key, value in inputs.items():
+        if key in ("orig_inputs",):
+            continue
+        inputs[key] = paddle.to_tensor(value)
+    return (inputs, lst_orig_labels)
+
+
+class DataLoader(object):
+    """Data Loader for train, test and inference"""
+
+    def __init__(
+        self,
+        config,
+        dataset,
+        batch_size=1,
+        collate_fn=collate_batch_data_v2,
+        shuffle=False,
+        drop_last=False,
+        use_data_parallel=False,
+        use_multiprocess=False,
+    ):
+        super(DataLoader, self).__init__()
+        assert batch_size > 0, "batch_size must be an interger that > 0"
+
+        self.config = config
+        self._dataset = dataset
+        self._batch_size = batch_size
+        self._collate_fn = collate_fn
+        self._shuffle = shuffle
+        self._drop_last = drop_last
+        self._use_data_parallel = use_data_parallel
+        self._use_multiprocess = use_multiprocess
+
+        self.dataloader = paddle.fluid.reader.DataLoader.from_generator(
+            capacity=1000, return_list=True, use_multiprocess=use_multiprocess
+        )
+        self.dataloader.set_batch_generator(self.create_generator())
+        if use_data_parallel:
+            self.dataloader = paddle.distributed_batch_reader(self.dataloader)
+
+    def __call__(self):
+        """call"""
+        return self.create_generator()()
+
+    def create_generator(self):
+        """Returns a generator, each iteration returns a batch of data"""
+
+        def _reader():
+            range_fn = np.random.permutation if self._shuffle else np.arange
+            batch = []
+            for iid in range_fn(len(self._dataset)):
+                batch.append(self._dataset[iid])
+                if len(batch) == self._batch_size:
+                    outputs = self._collate_fn(batch, self.config.model)
+                    batch = []
+                    if len(outputs[1]) == 0:
+                        continue
+                    yield outputs
+
+            if len(batch) > 0 and not self._drop_last:
+                yield self._collate_fn(batch, self.config.model)
+
+        return _reader
+
+    @property
+    def name(self):
+        """read property of name"""
+        return self._dataset.name
+
+
+if __name__ == "__main__":
+    """run simple tests"""
+    if len(sys.argv) != 5:
+        print("usage: %s schema content data grammar_file" % (sys.argv[0]))
+        sys.exit(1)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..8093ff4c421f455fa41054f7094f094266691cbe
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/dusql_dataset_v2.py
@@ -0,0 +1,336 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import pickle
+import sys
+from pathlib import Path
+
+import attr
+import networkx as nx
+import paddle
+import tqdm
+from text2sql.utils import linking_utils, text_utils
+
+g_ernie_input_parser = None
+g_match_score_threshold = 0.3
+
+
+@attr.s
+class DuSQLItem:
+    text = attr.ib()
+    code = attr.ib()
+    schema = attr.ib()
+    orig = attr.ib()
+    orig_schema = attr.ib()
+
+
+@attr.s
+class Column:
+    id = attr.ib()
+    table = attr.ib()
+    name = attr.ib()
+    orig_name = attr.ib()
+    dtype = attr.ib()
+    cells = attr.ib(factory=list)
+    foreign_key_for = attr.ib(default=None)
+
+
+@attr.s
+class Table:
+    id = attr.ib()
+    name = attr.ib()
+    orig_name = attr.ib()
+    columns = attr.ib(factory=list)
+    primary_keys = attr.ib(factory=list)
+    primary_keys_id = attr.ib(factory=list)
+    foreign_keys_tables = attr.ib(factory=set)
+
+
+@attr.s
+class DB:
+    db_id = attr.ib()
+    tables = attr.ib()
+    columns = attr.ib()
+    foreign_key_graph = attr.ib()
+    orig = attr.ib()
+    connection = attr.ib(default=None)
+
+
+def _extract_column_cells(table_names, tables_content):
+    lst_column_cells = [table_names]
+
+    for table_name in table_names:
+        table_info = tables_content.get(table_name, None)
+        if table_info is None:
+            return None
+        rows = table_info.get("cell", [])
+        if len(rows) == 0:
+            rows = [[] for _ in tables_content[table_name]["header"]]
+            lst_column_cells.extend(rows)
+        else:
+            lst_column_cells.extend(list(zip(*rows)))
+
+    return lst_column_cells
+
+
+def load_tables(schema_file, content_file):
+    """load tables from json files"""
+    schemas = {}
+    eval_foreign_key_maps = {}
+
+    with open(schema_file) as ifs_schema, open(content_file) as ifs_content:
+        lst_schema = json.load(ifs_schema)
+        dct_content = {x["db_id"]: x for x in json.load(ifs_content)}
+
+        for schema_dict in lst_schema:
+            db_id = schema_dict["db_id"]
+
+            contents = dct_content[db_id]
+            lst_column_cells = _extract_column_cells(schema_dict["table_names"], contents["tables"])
+            if lst_column_cells is None:
+                lst_column_cells = [[] for _ in schema_dict["column_names"]]
+            assert len(lst_column_cells) == len(schema_dict["column_names"])
+
+            if "table_names_original" not in schema_dict:
+                schema_dict["table_names_original"] = schema_dict["table_names"]
+            if "column_names_original" not in schema_dict:
+                schema_dict["column_names_original"] = schema_dict["column_names"]
+            tables = tuple(
+                Table(id=i, name=text_utils.wordseg(name), orig_name=orig_name)
+                for i, (name, orig_name) in enumerate(
+                    zip(schema_dict["table_names"], schema_dict["table_names_original"])
+                )
+            )
+            columns = tuple(
+                Column(
+                    id=i,
+                    table=tables[table_id] if table_id >= 0 else None,
+                    name=text_utils.wordseg(col_name),
+                    orig_name=orig_col_name,
+                    dtype=col_type,
+                    # 1. drop data with length > 20
+                    # 2. ID is startswith item_
+                    cells=[
+                        x for x in set([str(c) for c in lst_column_cells[i]]) if len(x) <= 20 or x.startswith("item_")
+                    ],
+                )
+                for i, ((table_id, col_name), (_, orig_col_name), col_type) in enumerate(
+                    zip(schema_dict["column_names"], schema_dict["column_names_original"], schema_dict["column_types"])
+                )
+            )
+
+            # Link columns to tables
+            for column in columns:
+                if column.table:
+                    column.table.columns.append(column)
+
+            # Register primary keys
+            for column_id in schema_dict["primary_keys"]:
+                column = columns[column_id]
+                column.table.primary_keys.append(column)
+                column.table.primary_keys_id.append(column_id)
+
+            # Register foreign keys
+            foreign_key_graph = nx.DiGraph()
+            for source_column_id, dest_column_id in schema_dict["foreign_keys"]:
+                source_column = columns[source_column_id]
+                dest_column = columns[dest_column_id]
+                source_column.foreign_key_for = dest_column
+                columns[source_column_id].table.foreign_keys_tables.add(dest_column_id)
+                foreign_key_graph.add_edge(
+                    source_column.table.id, dest_column.table.id, columns=(source_column_id, dest_column_id)
+                )
+                foreign_key_graph.add_edge(
+                    dest_column.table.id, source_column.table.id, columns=(dest_column_id, source_column_id)
+                )
+
+            schemas[db_id] = DB(db_id, tables, columns, foreign_key_graph, schema_dict)
+            # TODO
+            # eval_foreign_key_maps[db_id] = evaluation.build_foreign_key_map(schema_dict)
+
+    return schemas, eval_foreign_key_maps
+
+
+class DuSQLExample(object):
+    """Define struct of one DuSQL example, and its processing methods"""
+
+    def __init__(self, json_example, db, input_encoder):
+        super(DuSQLExample, self).__init__()
+
+        self.orig = json_example
+        self.question = json_example["question"]
+        self.question_id = json_example["question_id"]
+        self.columns = db.columns
+        self.tables = db.tables
+        self.db = db
+
+        self.column_match_cells = self._filter_match_values(json_example["match_values"])
+
+        ernie_inputs = input_encoder.encode(self.question, db, self.column_match_cells)
+        self.token_ids = ernie_inputs.token_ids
+        self.sent_ids = ernie_inputs.sent_ids
+        self.table_indexes = ernie_inputs.table_indexes
+        self.column_indexes = ernie_inputs.column_indexes
+        self.value_indexes = ernie_inputs.value_indexes
+        self.values = ernie_inputs.value_list
+
+        self.token_mapping = ernie_inputs.token_mapping
+        self.question_tokens = ernie_inputs.orig_question_tokens
+        self.candi_nums = ernie_inputs.candi_nums
+        self.relations = self._compute_relations()
+
+    def _filter_match_values(self, match_values_info):
+        """filter by match score"""
+        lst_result = []
+        for column_values in match_values_info:
+            filtered_results = []
+            for value, score in column_values:
+                if score > g_match_score_threshold:
+                    filtered_results.append(value)
+                else:  # column_values should ordered by score
+                    break
+            lst_result.append(filtered_results)
+        return lst_result
+
+    def _compute_relations(self):
+        schema_linking_results = self._linking_wrapper(linking_utils.compute_schema_linking)
+        cell_value_linking_results = self._linking_wrapper(linking_utils.compute_cell_value_linking)
+        link_info_dict = {"sc_link": schema_linking_results, "cv_link": cell_value_linking_results}
+
+        q_len = self.column_indexes[0] - 2
+        c_len = len(self.columns)
+        t_len = len(self.tables)
+        total_len = q_len + c_len + t_len
+        relation_matrix = linking_utils.build_relation_matrix(
+            link_info_dict, total_len, q_len, c_len, list(range(c_len + 1)), list(range(t_len + 1)), self.db
+        )
+        return relation_matrix
+
+    def _linking_wrapper(self, fn_linking):
+        """wrapper for linking function, do linking and id convert"""
+        link_result = fn_linking(self.question_tokens, self.db)
+
+        # convert words id to BERT word pieces id
+        new_result = {}
+        for m_name, matches in link_result.items():
+            new_match = {}
+            for pos_str, match_type in matches.items():
+                qid_str, col_tab_id_str = pos_str.split(",")
+                qid, col_tab_id = int(qid_str), int(col_tab_id_str)
+                for real_qid in self.token_mapping[qid]:
+                    new_match[f"{real_qid},{col_tab_id}"] = match_type
+            new_result[m_name] = new_match
+        return new_result
+
+    def __repr__(self):
+        """format for reviewing"""
+        return str(self.__dict__)
+
+
+class DuSQLDatasetV2(paddle.io.Dataset):
+    """implement of DuSQL dataset for training/evaluating"""
+
+    def __init__(
+        self, name, db_file, data_file, input_encoder, label_encoder, is_cached=False, schema_file=None, has_label=True
+    ):
+        super(DuSQLDatasetV2, self).__init__()
+
+        self.name = name
+        self.input_encoder = input_encoder
+        self.label_encoder = label_encoder
+        self.db_schema_file = schema_file
+        self.has_label = has_label
+        self._qid2index = {}
+
+        if is_cached:
+            self.db_dict, self._examples = None, None
+            self.load(db_file, data_file)
+        else:
+            schema_file, content_file = db_file
+            self.db_dict, _ = load_tables(schema_file, content_file)
+            self._examples = []
+            match_value_file = Path(os.path.dirname(data_file)) / ("match_values_" + os.path.basename(data_file))
+            if not match_value_file.exists():
+                raise FileNotFoundError("match value file not found: " + str(match_value_file))
+            with open(data_file) as ifs_data, open(match_value_file) as ifs_mval:
+                self.collate_examples(json.load(ifs_data), json.load(ifs_mval))
+
+    def collate_examples(self, orig_examples, match_values):
+        """collate examples, and append to self._examples"""
+        for idx, (item, m_val) in tqdm.tqdm(enumerate(zip(orig_examples, match_values))):
+            if "question_id" in item:
+                assert (
+                    item["question_id"] == m_val["question_id"]
+                ), f'data no match: {item["question_id"]} != {m_val["question_id"]}'
+            item["match_values"] = m_val["match_values"]
+            db = self.db_dict[item["db_id"]]
+            if not self.input_encoder.check(item, db):
+                logging.warning(f'check failed: db_id={item["db_id"]}, question={item["question"]}')
+                continue
+            if "question_id" not in item:
+                item["question_id"] = f"qid{idx:06d}"
+            inputs = DuSQLExample(item, db, self.input_encoder)
+            if "sql" not in item or type(item["sql"]) is not dict or not self.has_label:
+                outputs = None
+            else:
+                outputs = self.label_encoder.add_item(self.name, item["sql"], inputs.values)
+            self._qid2index[item["question_id"]] = len(self._examples)
+            self._examples.append([inputs, outputs])
+
+    def save(self, save_dir, save_db=True):
+        """save data to disk
+
+        Args:
+            save_dir (TYPE): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        os.makedirs(save_dir, exist_ok=True)
+        if save_db:
+            with open(Path(save_dir) / "db.pkl", "wb") as ofs:
+                pickle.dump(self.db_dict, ofs)
+        with open(Path(save_dir) / f"{self.name}.pkl", "wb") as ofs:
+            pickle.dump([self._examples, self._qid2index], ofs)
+
+    def load(self, db_file, data_file):
+        """load data from disk"""
+        with open(db_file, "rb") as ifs:
+            self.db_dict = pickle.load(ifs)
+        with open(data_file, "rb") as ifs:
+            self._examples, self._qid2index = pickle.load(ifs)
+
+    def get_by_qid(self, qid):
+        """ """
+        index = self._qid2index[qid]
+        return self._examples[index]
+
+    def __getitem__(self, idx):
+        """get one example"""
+        return self._examples[idx]
+
+    def __len__(self):
+        """size of data examples"""
+        return len(self._examples)
+
+
+if __name__ == "__main__":
+    """run simple tests"""
+    if len(sys.argv) != 5:
+        print("usage: %s schema content data grammar_file" % (sys.argv[0]))
+        sys.exit(1)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..f68e3da9ff0fd2d5416479bfcc32d2808fa06248
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/ernie_input_encoder_v2.py
@@ -0,0 +1,293 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import sys
+from collections import namedtuple
+
+import numpy as np
+from text2sql.dataproc import BaseInputEncoder
+from text2sql.utils import text_utils
+
+from paddlenlp.transformers import BertTokenizer, ErnieTokenizer
+
+ErnieInput = namedtuple(
+    "ErnieInput",
+    "token_ids sent_ids table_indexes column_indexes value_indexes value_list token_mapping orig_question_tokens candi_nums",
+)
+
+
+class ErnieInputEncoderV2(BaseInputEncoder):
+    """use ernie field_reader to seg, it will automatically add padding,mask,position,task,sentence and return length"""
+
+    padding_id = 0
+    truncation_type = 0
+
+    def __init__(self, model_config):
+        super(ErnieInputEncoderV2, self).__init__()
+
+        self.config = model_config
+        self.enc_value_with_col = model_config.enc_value_with_col
+        if model_config.pretrain_model_type == "BERT":
+            self.tokenizer = BertTokenizer.from_pretrained(model_config.pretrain_model)
+            self.special_token_dict = {
+                "table": "[unused1]",
+                "column": "[unused2]",
+                "value": "[unused3]",
+                "text": "[unused11]",
+                "real": "[unused12]",
+                "number": "[unused13]",
+                "time": "[unused14]",
+                "binary": "[unused15]",
+                "boolean": "[unused16]",
+                "bool": "[unused17]",
+                "others": "[unused18]",
+            }
+        else:
+            self.tokenizer = ErnieTokenizer.from_pretrained(model_config.pretrain_model)
+            # low frequency token will be used as special token
+            # Other candidate: overchicstoretvhome
+            self.special_token_dict = {
+                "table": "blogabstract",
+                "column": "wx17house",
+                "value": "fluke62max",
+                "text": "googlemsn",
+                "real": "sputniknews",
+                "number": "sputniknews",
+                "time": "pixstyleme3c",
+                "binary": "pixnetfacebookyahoo",
+                "boolean": "pixnetfacebookyahoo",
+                "bool": "pixnetfacebookyahoo",
+                "others": "ubuntuforumwikilinuxpastechat",
+            }
+        self._need_bool_value = True if self.config.grammar_type != "nl2sql" else False
+
+    def check(self, data, db):
+        if len(db.columns) > self.config.max_column_num or len(db.tables) > self.config.max_table_num:
+            return False
+        return True
+
+    def encode(self, question, db, column_match_cells=None, candi_nums=None, col_orders=None, debug=False):
+        question = question.strip()
+        if self.config.num_value_col_type != "q_num":
+            orig_question_tokens = text_utils.wordseg(self.question)
+            candi_nums = list(set(["0", "1"] + text_utils.CandidateValueExtractor.extract_num_from_text(question)))
+            candi_nums_index = [-1] * len(candi_nums)
+        else:
+            orig_question_tokens, candi_nums, candi_nums_index = text_utils.wordseg_and_extract_num(question)
+            if "0" not in candi_nums:
+                candi_nums.append("0")
+                candi_nums_index.append(-1)
+            if "1" not in candi_nums:
+                candi_nums.append("1")
+                candi_nums_index.append(-1)
+        tokens, value_list, schema_indexes, token_mapping = self.tokenize(
+            orig_question_tokens, db, column_match_cells, candi_nums, candi_nums_index, col_orders
+        )
+        if debug:
+            sys.stderr.write(json.dumps(tokens, ensure_ascii=False) + "\n")
+        token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+
+        table_indexes, column_indexes, value_indexes, num_value_indexes = schema_indexes
+        q_len = column_indexes[0]
+        sent_ids = [0] * q_len + [1] * (len(token_ids) - q_len)
+
+        value_indexes += num_value_indexes
+        return ErnieInput(
+            token_ids,
+            sent_ids,
+            table_indexes,
+            column_indexes,
+            value_indexes,
+            value_list,
+            token_mapping,
+            orig_question_tokens,
+            candi_nums,
+        )
+
+    def tokenize(self, question, db, column_match_cells=None, candi_nums=None, candi_nums_index=None, col_orders=None):
+        """
+        Tokenize question and columns and concatenate.
+        final_tokens will include：Question、Schema（include non digital value）、digital value
+                        [CLS] Q tokens [SEP]
+                        [T] table1 [C] col1 [V] value [C] col2 ... [SEP]
+                        [V] number [V] ... [SEP]
+        """
+        if col_orders is None:
+            col_orders = np.arange(len(db.columns))
+
+        if type(question) is str:
+            q_tokens_tmp = self.tokenizer.tokenize(question)
+            token_idx_mapping = [[i] for i in range(len(q_tokens_tmp))]
+        else:
+            # question is tokens list
+            q_tokens_tmp, token_idx_mapping = self._resplit_words(question)
+
+        final_candi_num_index = []
+        if candi_nums_index is not None:
+            for idx in candi_nums_index:
+                if idx < 0:
+                    final_candi_num_index.append(0)
+                else:
+                    final_candi_num_index.append(token_idx_mapping[idx][0] + 1)
+
+        # handle question tokens
+        question_tokens = ["[CLS]"] + q_tokens_tmp
+        final_tokens = question_tokens[: self.config.max_question_len] + ["[SEP]"]
+
+        columns = [db.columns[i] for i in col_orders]
+        if column_match_cells is not None:
+            column_match_cells = [column_match_cells[i] for i in col_orders]
+        else:
+            column_match_cells = [None] * len(columns)
+
+        # handle schema tokens
+        table_indexes = []
+        column_indexes = []
+        value_indexes = []
+        value_list = []
+        universe_value_set = set(["是", "否"]) if self._need_bool_value else set()
+        for idx, (column, match_cells) in enumerate(zip(columns, column_match_cells)):
+            if idx == 1 or idx > 1 and column.table.id != columns[idx - 1].table.id:
+                table_indexes.append(len(final_tokens))
+                final_tokens.append(self.special_token_dict["table"])
+                final_tokens += self.tokenizer.tokenize(column.table.orig_name)
+
+            if idx == 0:
+                col_name = "任意列"
+                col_type = self.special_token_dict["text"]
+            else:
+                col_name = column.orig_name
+                # col_name = remove_brackets(col_name)
+                col_type = self.special_token_dict[column.dtype]
+
+            column_indexes.append(len(final_tokens))
+            final_tokens += [col_type] + self.tokenizer.tokenize(col_name)
+
+            if match_cells is not None and len(match_cells) > 0:
+                if column.dtype in ("text", "time"):
+                    if not self.config.predict_value:
+                        match_cells = match_cells[:1]  # the first cell used to complement semantics
+                    for mcell in match_cells:
+                        value_list.append(mcell)
+                        toks = [self.special_token_dict["value"]] + self.tokenizer.tokenize(mcell)
+                        if self.enc_value_with_col:
+                            value_indexes.extend([column_indexes[-1], len(final_tokens)])
+                        else:
+                            value_indexes.append(len(final_tokens))
+                        final_tokens += toks
+                elif self.config.predict_value:
+                    for mcell in match_cells:
+                        universe_value_set.add(mcell)
+        final_tokens.append("[SEP]")
+
+        if self.config.predict_value:
+            for value in universe_value_set:
+                value_list.append(value)
+                toks = [self.special_token_dict["value"]] + self.tokenizer.tokenize(value)
+                if self.enc_value_with_col:
+                    value_indexes.extend([0, len(final_tokens)])
+                else:
+                    value_indexes.append(len(final_tokens))
+                final_tokens += toks
+            final_tokens.append("[SEP]")
+
+            # handle number value tokens: condition and limit number values
+            num_value_indexes = []
+            if candi_nums is not None and len(candi_nums) > 0:
+                value_list += candi_nums
+                for num, index in zip(candi_nums, final_candi_num_index):
+                    if self.enc_value_with_col:
+                        # index is the index of current number in question
+                        num_value_indexes.extend([index, len(final_tokens)])
+                    elif self.config.num_value_col_type == "q_num":
+                        num_value_indexes.append(index)
+                    else:
+                        num_value_indexes.append(len(final_tokens))
+                    final_tokens += [self.special_token_dict["value"]] + self.tokenizer.tokenize(num)
+        else:
+            # use fixed special token value/empty
+            if self.enc_value_with_col:
+                value_indexes = [0, len(final_tokens), 0, len(final_tokens) + 1]
+            else:
+                value_indexes = [len(final_tokens), len(final_tokens) + 1]
+            num_value_indexes = []
+            value_list = ["value", "empty"]
+            final_tokens.extend(value_list)
+        final_tokens.append("[SEP]")
+
+        # packed_sents_lens = [q_lens, column_tokens_lens, table_tokens_lens, limit_tokens_lens]
+        # packed_sents, packed_sents_lens = self._pack([question_tokens],
+        #                                             column_tokens,
+        #                                             table_tokens,
+        #                                             limit_tokens,
+        #                                             value_indexes=column_values_index)
+
+        return (
+            final_tokens,
+            value_list,
+            [table_indexes, column_indexes, value_indexes, num_value_indexes],
+            token_idx_mapping,
+        )
+
+    def _resplit_words(self, words):
+        """resplit words by bert_tokenizer"""
+        lst_new_result = []
+        token_idx_mapping = []
+        for idx, word in enumerate(words):
+            tokens = self.tokenizer.tokenize(word)
+            new_id_start = len(lst_new_result)
+            new_id_end = new_id_start + len(tokens)
+            lst_new_result.extend(tokens)
+            token_idx_mapping.append(list(range(new_id_start, new_id_end)))
+        return lst_new_result, token_idx_mapping
+
+    def _pack(self, *sents_of_tokens_list, value_indexes=None):
+        packed_sents = []
+        packed_sents_lens_all = []
+        for sents_of_tokens in sents_of_tokens_list:
+            packed_sents_lens = []
+            for tokens in sents_of_tokens:
+                packed_tokens = tokens + ["[SEP]"]
+                packed_sents += packed_tokens
+                packed_sents_lens.append(len(packed_tokens))
+            packed_sents_lens_all.append(packed_sents_lens)
+        return packed_sents, packed_sents_lens_all
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    if len(sys.argv) != 3:
+        print("usage: %s --db db_path")
+        sys.exit(1)
+
+    from pathlib import Path
+
+    from text2sql import global_config
+    from text2sql.dataproc.dusql_dataset import load_tables
+
+    config = global_config.gen_config()
+    parser = ErnieInputEncoderV2(config)
+    q = "这 是 一项 测试 。 hello world !"
+    db_path = Path(config.data.db)
+    db_dict, _ = load_tables(db_path / "db_schema.json", db_path / "db_content.json")
+    db = db_dict[list(db_dict.keys())[0]]
+    column_match_cells = [None] * len(db.columns)
+    column_match_cells[1] = ["你好", "[CLS]"]
+    print(q)
+    print([x.orig_name for x in db.columns])
+    print([x.orig_name for x in db.tables])
+    print(parser.encode(q, db, column_match_cells=column_match_cells, candi_nums=["1", "0", "10000000"], debug=True))
+    print("*" * 100)
+    print(parser.encode(q.split(" "), db, candi_nums=["1", "0", "10000000"], debug=True))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0805650db93ef210bebcf366c7e922f5124c1b8
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_label.py
@@ -0,0 +1,440 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+g_open_value_predict = False
+g_having_agg_threshold = 0.9
+
+
+class SQL(object):
+    """SQL define"""
+
+    op_sql_dict = {0: ">", 1: "<", 2: "==", 3: "!=", 4: ">=", 5: "<="}
+    agg_sql_dict = {0: "", 1: "AVG", 2: "MAX", 3: "MIN", 4: "COUNT", 5: "SUM"}
+    conn_sql_dict = {0: "", 1: "and", 2: "or"}
+    order_dict = {0: "", 1: "asc", 2: "desc"}
+    sel_num_dict = {0: 1, 1: 2, 2: 3, 3: 4}
+    # cond_num_dict = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
+    cond_num_dict = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
+    group_num_dict = {0: 0, 1: 1}
+    group_type_dict = {0: "none", 1: "group", 2: "group_having", 3: "group_order"}
+
+    order2id = {"": 0, "asc": 1, "desc": 2}
+
+    num_where_ops = len(op_sql_dict) + 1
+    num_agg_ops = len(agg_sql_dict)
+    num_cond_ops = len(conn_sql_dict)
+    num_order_directions = len(order_dict)
+    num_sel_num = len(sel_num_dict)
+    num_where_num = len(cond_num_dict)
+    num_group_num = len(group_num_dict)
+    num_group_type = len(group_type_dict)
+
+    dtype_str = "text"
+    dtype_num = "real"
+
+    def __init__(self, cond_conn_op: int, agg: list, sel: list, conds: list, **kwargs):
+        """doc"""
+        self.cond_conn_op = cond_conn_op
+        self.sel = []
+        self.agg = []
+        sel_agg_pairs = sorted(zip(sel, agg), key=lambda x: x[0])
+        for col_id, agg_op in sel_agg_pairs:
+            self.sel.append(col_id)
+            self.agg.append(agg_op)
+        self.conds = list(sorted(conds, key=lambda x: x[0]))
+        self.order_by = list(sorted(kwargs.get("order_by", [])))
+        self.group_by = list(sorted(kwargs.get("group_by", [])))
+        self.having = list(sorted(kwargs.get("having", [])))
+        order_str = kwargs.get("order_direction", "").lower()
+        self.order_direction = self.order2id.get(order_str, 0)
+        limit = kwargs.get("limit", None)
+        self.limit = "0" if limit is None else str(limit)
+        self.sel_num = len(self.sel)
+        self.cond_num = len(self.conds)
+        self.group_num = len(self.group_by)
+        self.group_type = 0
+        if len(self.group_by) > 0:
+            self.group_type = 1
+            if len(self.having) > 0:
+                self.group_type = 2
+            elif len(self.order_by) > 0 and self.order_by[0][0] > 0:
+                self.group_type = 3
+
+    @classmethod
+    def from_dict(cls, data: dict):
+        """doc"""
+        return cls(**data)
+
+    def keys(self):
+        """doc"""
+        return ["cond_conn_op", "sel", "agg", "conds", "order_by", "order_direction", "limit", "group_by", "having"]
+
+    def __getitem__(self, key):
+        """doc"""
+        return getattr(self, key)
+
+    def to_json(self):
+        """doc"""
+        return json.dumps(dict(self), ensure_ascii=False, sort_keys=True)
+
+    def equal_all_mode(self, other):
+        """doc"""
+        return self.to_json() == other.to_json()
+
+    def __eq__(self, other):
+        """doc"""
+        raise NotImplementedError("compare mode not set")
+
+    def __repr__(self):
+        """doc"""
+        repr_str = ""
+        repr_str += "sel: {}\n".format(self.sel)
+        repr_str += "agg: {}\n".format([self.agg_sql_dict[a] for a in self.agg])
+        repr_str += "cond_conn_op: '{}'\n".format(self.conn_sql_dict[self.cond_conn_op])
+        repr_str += "conds: {}".format([[cond[0], self.op_sql_dict[cond[1]], cond[2]] for cond in self.conds])
+
+        # TODO: support order/group/...
+
+        return repr_str
+
+    def __str__(self):
+        """doc"""
+        return self.to_json()
+
+    def _repr_html_(self):
+        """doc"""
+        return self.__repr__().replace("\n", "<br>")
+
+
+def sql2label(sql, num_cols):
+    """encode sql"""
+    # because of classification task, label is from 0
+    # so sel_num and cond_num should -1，and label should +1 in prediction phrase
+    cond_conn_op_label = sql.cond_conn_op
+
+    sel_num_label = sql.sel_num - 1
+    # the new dataset has cond_num = 0, do not -1
+    cond_num_label = len(sql.conds) + len(sql.having)
+    sel_label = np.zeros(num_cols, dtype="int32")
+    sel_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32")
+    for col_id, agg_op in zip(sql.sel, sql.agg):
+        assert col_id < num_cols, f"select col_id({col_id}) >= num_cols({num_cols}): {sql}"
+        sel_agg_label[col_id][agg_op] = 1
+        sel_label[col_id] = 1
+    # len(SQL.op_sql_dict) over all op ID range，which means defaults to no OP
+    cond_op_label = np.ones(num_cols, dtype="int32") * len(SQL.op_sql_dict)
+    having_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32")
+
+    for col_id, cond_op, _ in sql.conds:
+        assert col_id < num_cols, f"where col_id({col_id}) >= num_cols({num_cols}): {sql}"
+        cond_op_label[col_id] = cond_op
+
+    for agg, col_id, cond_op, _ in sql.having:
+        assert col_id < num_cols, f"having col_id({col_id}) >= num_cols({num_cols}): {sql}"
+        cond_op_label[col_id] = cond_op
+        having_agg_label[col_id][agg] = 1
+
+    order_col_label = np.zeros(num_cols, dtype="int32")
+    order_agg_label = np.zeros((num_cols, SQL.num_agg_ops), dtype="int32")
+
+    order_direction_label = sql.order_direction
+    for agg, order_col in sql.order_by:
+        order_col_label[order_col] = 1
+        order_agg_label[order_col][agg] = 1
+
+    group_num_label = sql.group_num
+    having_num_label = len(sql.having)
+    group_col_label = np.zeros(num_cols, dtype="int32")
+    for col_id in sql.group_by:
+        assert col_id < num_cols, f"group_by col_id({col_id}) >= num_cols({num_cols}): {sql}"
+        group_col_label[col_id] = 1
+
+    return (
+        sel_num_label,
+        cond_num_label,
+        cond_conn_op_label,
+        sel_agg_label,
+        sel_label,
+        cond_op_label,
+        order_col_label,
+        order_agg_label,
+        order_direction_label,
+        group_num_label,
+        having_num_label,
+        group_col_label,
+        having_agg_label,
+    )
+
+
+def decode(
+    sel_num,
+    sel_col,
+    sel_agg,
+    where_num,
+    where_conn,
+    where_op,
+    where_op_prob,
+    col_value,
+    order_direction,
+    order_col,
+    order_agg,
+    limit_label,
+    group_num,
+    having_num,
+    group_col,
+    having_agg,
+    having_agg_prob,
+    header_match_cells,
+    candi_limit_nums,
+):
+    """decode one instance predicts to sql"""
+    if col_value is None:
+        col_value = [None] * len(where_op)
+    # use dict to find label number, equals to label+1
+    sel_num = SQL.sel_num_dict[int(sel_num)]
+    sorted_sel_index = sorted(range(len(sel_col)), key=lambda i: sel_col[i], reverse=True)
+    sel_col = [int(col_id) for col_id in sorted_sel_index][:sel_num]
+    sel_agg = [int(sel_agg[col_id]) for col_id in sorted_sel_index][:sel_num]
+
+    cond_num = SQL.cond_num_dict[int(where_num)]
+    where_conn = int(where_conn)
+    cond_probs = []
+    conds = []
+    for col_id, (cond_op, cond_prob, value_id) in enumerate(zip(where_op, where_op_prob, col_value)):
+        if cond_op < len(SQL.op_sql_dict):
+            cond_probs.append(cond_prob)
+            value = get_value_by_id(col_id, value_id, header_match_cells)
+            conds.append([col_id, int(cond_op), value])
+    if cond_num < len(conds):
+        sorted_cond_index = sorted(range(len(cond_probs)), key=lambda i: cond_probs[i], reverse=True)
+        conds = [conds[i] for i in sorted_cond_index[:cond_num]]
+
+    if group_num is None:
+        group_num = 0
+    if group_num > 0:
+        sorted_group_index = sorted(range(len(group_col)), key=lambda i: group_col[i], reverse=True)
+        group_col = [int(col_id) for col_id in sorted_group_index[:group_num]]
+    else:
+        group_col = []
+
+    having = []
+    if having_num is None:
+        having_num = 0
+    if having_agg is not None and group_num > 0 and having_num > 0:
+        having_agg_info = []
+        for idx, (col_id, _, _) in enumerate(conds):
+            if having_agg[col_id] > 0:
+                having_agg_info.append([idx, int(having_agg[col_id]), float(having_agg_prob[col_id])])
+                # cond_num -= 1
+        if len(having_agg_info) > 0 and having_agg_info[0][2] >= g_having_agg_threshold:
+            # 按 agg 概率最大排序
+            having_agg_info.sort(key=lambda x: x[2], reverse=True)
+            idx, agg, _ = having_agg_info[0]
+            having = [[agg] + list(conds[idx])]
+            conds.pop(idx)
+
+    order_direction = int(order_direction) if order_direction is not None else 0
+    if order_direction == 0 or order_col is None or order_agg is None:
+        order_by = []
+        limit = "0"
+    else:
+        sorted_order_index = sorted(range(len(order_col)), key=lambda i: order_col[i], reverse=True)
+        order_col = [int(col_id) for col_id in sorted_order_index[:1]]
+        order_agg = [int(order_agg[col_id]) for col_id in sorted_order_index[:1]]
+        order_by = [[order_agg[0], order_col[0]]]
+        if limit_label < len(candi_limit_nums):
+            limit = candi_limit_nums[limit_label]
+            if limit == "0":
+                limit = "1"
+        else:
+            limit = "1"
+
+    return {
+        "sel": list(sel_col),
+        "sel_num": int(sel_num),
+        "cond_num": int(cond_num),
+        "agg": list(sel_agg),
+        "cond_conn_op": int(where_conn),
+        "conds": [list(cond) for cond in conds],
+        "order_direction": order_direction,
+        "order_by": list(order_by),
+        "limit": limit,
+        "having_num": int(having_num),
+        "group_num": int(group_num),
+        "group_by": list(group_col),
+        "having": list(having),
+    }
+
+
+def get_value_by_id(col_id, value_id, header_match_cells):
+    """
+
+    Args:
+        col_id (TYPE): NULL
+        value_id (TYPE): NULL
+        header_match_cells (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    if value_id is None or value_id < 0:
+        return None
+
+    assert col_id < len(header_match_cells)
+
+    curr_cells = header_match_cells[col_id]
+    if len(curr_cells) == 0:
+        return "0"
+    if value_id >= len(curr_cells):
+        return curr_cells[0]
+    else:
+        return curr_cells[value_id]
+
+
+def decode_sqls(preds, header_lens, header_values_list, limit_nums_list):
+    """Generate sqls from model outputs"""
+    fn_empty_preds = lambda: [None] * len(preds["sel_num"])
+
+    preds_sel_num = np.argmax(preds["sel_num"], axis=-1)
+    preds_sel_col = preds["sel_col"]
+    preds_sel_agg = np.argmax(preds["sel_agg"], axis=-1)
+    preds_cond_num = np.argmax(preds["cond_num"], axis=-1)
+    preds_where_conn = np.argmax(preds["where_conn"], axis=-1)
+    preds_where_op = np.argmax(preds["where_op"], axis=-1)
+    preds_where_op_prob = np.max(preds["where_op"], axis=-1)
+
+    preds_order_direction = np.argmax(preds["order_direction"], axis=-1)
+    preds_order_col = preds["order_col"]
+    preds_order_agg = np.argmax(preds["order_agg"], axis=-1)
+    preds_limit_label = np.argmax(preds["limit_label"], axis=-1)
+
+    preds_group_num = np.argmax(preds["group_num"], axis=-1)
+    preds_having_num = np.argmax(preds["having_num"], axis=-1)
+    preds_group_col = preds["group_col"]
+    preds_having_agg = np.argmax(preds["having_agg"], axis=-1)
+    preds_having_agg_prob = np.max(preds["having_agg"], axis=-1)
+
+    if g_open_value_predict:
+        preds_col_value = np.argmax(preds["col_value"], axis=-1)
+    else:
+        preds_col_value = fn_empty_preds()
+
+    sqls = []
+    for (
+        sel_num,
+        sel_col,
+        sel_agg,
+        where_num,
+        where_conn,
+        where_op,
+        where_op_prob,
+        col_value,
+        order_direction,
+        order_col,
+        order_agg,
+        limit_label,
+        group_num,
+        having_num,
+        group_col,
+        having_agg,
+        having_agg_prob,
+        header_len,
+        limit_nums,
+    ) in zip(
+        preds_sel_num,
+        preds_sel_col,
+        preds_sel_agg,
+        preds_cond_num,
+        preds_where_conn,
+        preds_where_op,
+        preds_where_op_prob,
+        preds_col_value,
+        preds_order_direction,
+        preds_order_col,
+        preds_order_agg,
+        preds_limit_label,
+        preds_group_num,
+        preds_having_num,
+        preds_group_col,
+        preds_having_agg,
+        preds_having_agg_prob,
+        header_lens,
+        limit_nums_list,
+    ):
+
+        sel_col = sel_col[:header_len]
+        sel_agg = sel_agg[:header_len]
+        where_op = where_op[:header_len]
+        where_op_prob = where_op_prob[:header_len]
+        if g_open_value_predict:
+            col_value = col_value[:header_len]
+        order_col = order_col[:header_len]
+        order_agg = order_agg[:header_len]
+        group_col = group_col[:header_len]
+        having_agg = having_agg[:header_len]
+
+        sql = decode(
+            sel_num,
+            sel_col,
+            sel_agg,
+            where_num,
+            where_conn,
+            where_op,
+            where_op_prob,
+            col_value,
+            order_direction,
+            order_col,
+            order_agg,
+            limit_label,
+            group_num,
+            having_num,
+            group_col,
+            having_agg,
+            having_agg_prob,
+            None,
+            limit_nums,
+        )
+        sqls.append(sql)
+
+    return sqls
+
+
+if __name__ == "__main__":
+    """run some simple test"""
+    # if len(sys.argv) > 2:
+    #    with open(sys.argv[1]) as ifs:
+    #        gold_sqls = [SQL.from_dict(json.loads(x)["sql"]) for x in ifs]
+    #    with open(sys.argv[2]) as ifs:
+    #        pred_sqls = [json.loads(x) for x in ifs]
+
+    #    print(f"acc of {sys.argv[1]} vs {sys.argv[2]}: ", get_acc(gold_sqls, pred_sqls))
+    # else:
+    #    gold_sqls = [{"sel": [5], "sel_num": 1, "cond_num": 2, "agg": [0],
+    #                  "cond_conn_op": 1, "conds": [[0, 2, '123'], [1, 2, '444']],
+    #                  "order_direction": "asc", "order_by": [[0, 1]]}]
+    #    pred_sqls = [{"sel": [1, 0], "agg": [0, 4], "cond_conn_op": 0,
+    #                 "conds": [], "having_conn_op": 0, "having": [], "order_by": [],
+    #                 "order_direction": "", "limit": None, "group_by": [20]}]
+    #    enc_out_names = ["sel_num_label", "cond_num_label", "cond_conn_op_label", "sel_agg_label",
+    #                     "sel_label", "cond_op_label", "order_col_label", "order_agg_label",
+    #                     "order_direction_label", "group_num_label", "group_col_label",
+    #                     "having_agg_label"]
+    #    enc_out = sql2label(SQL.from_dict(gold_sqls[0]), 8)
+    #    for name, array in zip(enc_out_names, enc_out):
+    #        print(name, array)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..18b9f8c972fca0064080d4f89818fc5141ea6b29
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/sql_preproc_v2.py
@@ -0,0 +1,457 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import collections.abc
+import json
+import logging
+import os
+import shutil
+from pathlib import Path
+
+import attr
+from text2sql.dataproc import vocab
+from text2sql.utils import serialization
+
+
+def get_field_presence_info(ast_wrapper, node, field_infos):
+    """get_field_presence_info"""
+    present = []
+    for field_info in field_infos:
+        field_value = node.get(field_info.name)
+        is_present = field_value is not None and field_value != []
+
+        maybe_missing = field_info.opt or field_info.seq
+        is_builtin_type = field_info.type in ast_wrapper.primitive_types
+
+        if maybe_missing and is_builtin_type:
+            # TODO: make it possible to deal with "singleton?"
+            present.append(is_present and type(field_value).__name__)
+        elif maybe_missing and not is_builtin_type:
+            present.append(is_present)
+        elif not maybe_missing and is_builtin_type:
+            present.append(type(field_value).__name__)
+        elif not maybe_missing and not is_builtin_type:
+            assert is_present
+            present.append(True)
+    return tuple(present)
+
+
+@attr.s
+class DecoderSQLItem:
+    """DecoderSQLItem"""
+
+    tree = attr.ib()
+    orig_code = attr.ib()
+    sql_query = attr.ib(default="")
+
+
+class SQLPreproc(object):
+    """SQLPreproc"""
+
+    def __init__(
+        self,
+        base_path,
+        grammar_class,
+        predict_value=True,
+        min_freq=3,
+        max_count=5000,
+        use_seq_elem_rules=False,
+        is_cached=False,
+    ):
+        """
+        Args:
+            base_path (TYPE): if is_cached is False, base_path is the asdl grammar file.
+                              if is_cached is True, base_path is path to cached directory.
+            grammar_class (TYPE): grammar class, like grammars.dusql.DuSQLLanguage
+            predict_value (TYPE): Default is True
+            min_freq (TYPE): Default is 3
+            max_count (TYPE): Default is 5000
+            use_seq_elem_rules (TYPE): Default is False
+            is_cached (TYPE): Default is False
+
+        Raises: NULL
+        """
+        self.base_path = base_path
+        self.predict_value = predict_value
+        self.vocab = None
+        self.all_rules = None
+        self.rules_mask = None
+
+        # key: train/dev/val/test/...
+        # value: examples
+        self.items = collections.defaultdict(list)
+        self.sum_type_constructors = collections.defaultdict(set)
+        self.field_presence_infos = collections.defaultdict(set)
+        self.seq_lengths = collections.defaultdict(set)
+        self.primitive_types = set()
+
+        if not is_cached:
+            self.grammar = grammar_class(self.base_path)
+            self.ast_wrapper = self.grammar.ast_wrapper
+            self.vocab_builder = vocab.VocabBuilder(min_freq, max_count)
+        else:
+            self.grammar = None
+            self.ast_wrapper = None
+            self.load(grammar_class)
+
+        self.use_seq_elem_rules = use_seq_elem_rules
+        if self.predict_value:
+            self.format_sql_value = self.transfer_sql_value
+        else:
+            self.format_sql_value = self.fix_sql_value
+
+    def _get_val_index(self, val, value_dict):
+        def _float(val):
+            try:
+                return True, str(int(float(val)))
+            except Exception:
+                return False, ""
+
+        val = str(val)
+        if val in value_dict:
+            return value_dict[val]
+        is_float, new_val = _float(val)
+        if is_float and new_val in value_dict:
+            return value_dict[new_val]
+
+        new_val = val.replace(".", "")
+        candi = []
+        for v, idx in value_dict.items():
+            v = v.replace(".", "")
+            if v.startswith(new_val) or new_val.startswith(v):
+                candi.append((v, idx))
+
+        if len(candi) == 1:
+            return candi[0][1]
+        elif len(candi) > 1:
+            candi.sort(key=lambda x: len(x[0]), reverse=True)
+            return candi[0][1]
+
+        return -1
+
+    def transfer_sql_value(self, sql_json, value_dict):
+        """transfer value str to int index"""
+        if "cond_conn_op" in sql_json:
+            self.transfer_simple_sql_value(sql_json, value_dict)
+            return
+
+        def _trans_cond(cond):
+            """transfer condition value"""
+            val1 = cond[3]
+            val2 = cond[4]
+            if type(val1) is dict:
+                self.transfer_sql_value(val1, value_dict)
+                if val2 is not None:
+                    val2 = self._get_val_index(val2, value_dict)
+                    cond[4] = val2 if val2 >= 0 else 0
+                return
+
+            val1 = self._get_val_index(val1, value_dict)
+            if val2 is not None:
+                val2 = self._get_val_index(val2, value_dict)
+            if val1 == -1:
+                val1 = 0
+                logging.debug("lost value: %s. candidates: %s", cond[3], ", ".join(value_dict.keys()))
+                logging.debug("sql is: %s", json.dumps(sql_json, ensure_ascii=False))
+            if val2 == -1:
+                val2 = 0
+            cond[3] = val1
+            cond[4] = val2
+
+        for table_unit in sql_json["from"]["table_units"]:
+            if type(table_unit[1]) is dict:
+                self.transfer_sql_value(table_unit[1], value_dict)
+
+        for cond in sql_json["where"][::2]:
+            _trans_cond(cond)
+        for cond in sql_json["having"][::2]:
+            _trans_cond(cond)
+
+        if sql_json["limit"] is not None:
+            limit = str(sql_json["limit"])
+        else:
+            limit = "0"
+        if limit in value_dict:
+            sql_json["limit"] = value_dict[limit]
+        else:
+            logging.debug("value of limit is lost: %s. candidates: %s", limit, ", ".join(value_dict.keys()))
+            sql_json["limit"] = value_dict["0"]
+
+        if sql_json["intersect"] is not None:
+            self.transfer_sql_value(sql_json["intersect"], value_dict)
+        if sql_json["union"] is not None:
+            self.transfer_sql_value(sql_json["union"], value_dict)
+        if sql_json["except"] is not None:
+            self.transfer_sql_value(sql_json["except"], value_dict)
+
+    def transfer_simple_sql_value(self, sql_json, value_dict):
+        for cond in sql_json["conds"]:
+            value = cond[2]
+            new_val = self._get_val_index(value, value_dict)
+            if new_val == -1:
+                new_val = 0
+            cond[2] = new_val
+
+    def fix_sql_value(self, sql_json, value_dict):
+        """fix sql value to 'value' token"""
+
+        def _fix_cond_value(cond):
+            """transfer condition value"""
+            val1 = cond[3]
+            val2 = cond[4]
+            if type(val1) is dict:
+                self.fix_sql_value(val1, value_dict)
+                if val2 is not None:
+                    val2 = self._get_val_index("value", value_dict)
+                    cond[4] = val2 if val2 >= 0 else 0
+                return
+
+            val1 = self._get_val_index("value", value_dict)
+            if val2 is not None:
+                val2 = self._get_val_index("value", value_dict)
+            if val1 == -1:
+                val1 = 0
+                logging.info("lost value: %s. candidates: %s", cond[3], ", ".join(value_dict.keys()))
+                logging.debug("sql is: %s", json.dumps(sql_json, ensure_ascii=False))
+            if val2 == -1:
+                val2 = 0
+            cond[3] = val1
+            cond[4] = val2
+
+        for table_unit in sql_json["from"]["table_units"]:
+            if type(table_unit[1]) is dict:
+                self.fix_sql_value(table_unit[1], value_dict)
+
+        for cond in sql_json["where"][::2]:
+            _fix_cond_value(cond)
+        for cond in sql_json["having"][::2]:
+            _fix_cond_value(cond)
+
+        if sql_json["limit"] is not None:
+            limit = "value"
+        else:
+            limit = "empty"
+        assert limit in value_dict
+        sql_json["limit"] = value_dict[limit]
+
+        if sql_json["intersect"] is not None:
+            self.fix_sql_value(sql_json["intersect"], value_dict)
+        if sql_json["union"] is not None:
+            self.fix_sql_value(sql_json["union"], value_dict)
+        if sql_json["except"] is not None:
+            self.fix_sql_value(sql_json["except"], value_dict)
+
+    def add_item(self, section, sql_json, value_list):
+        """add an item"""
+        value_dict = {val: idx for idx, val in enumerate(value_list)}
+        self.format_sql_value(sql_json, value_dict)
+
+        parsed = self.grammar.parse(sql_json, section)
+        self.ast_wrapper.verify_ast(parsed)  # will raise AssertionError, if verify failed
+
+        root = parsed
+        if section == "train":
+            for token in self._all_tokens(root):
+                self.vocab_builder.add_word(token)
+            self._record_productions(root)
+
+        item = DecoderSQLItem(tree=root, orig_code=sql_json)
+        self.items[section].append(item)
+        return item
+
+    def clear_items(self):
+        """clear items"""
+        self.items = collections.defaultdict(list)
+
+    def _construct_cache_path(self, root_path):
+        root_path = Path(root_path)
+        self.vocab_path = root_path / "dec_vocab.json"
+        self.observed_productions_path = root_path / "observed_productions.json"
+        self.grammar_rules_path = root_path / "grammar_rules.json"
+        self.grammar_file = root_path / "grammar.asdl"
+
+    def save(self, save_path):
+        """save parsed items to disk"""
+        os.makedirs(save_path, exist_ok=True)
+        self._construct_cache_path(save_path)
+
+        self.vocab = self.vocab_builder.finish()
+        self.vocab.save(self.vocab_path)
+        # observed_productions
+        self.sum_type_constructors = serialization.to_dict_with_sorted_values(self.sum_type_constructors)
+        self.field_presence_infos = serialization.to_dict_with_sorted_values(self.field_presence_infos, key=str)
+        self.seq_lengths = serialization.to_dict_with_sorted_values(self.seq_lengths)
+        self.primitive_types = sorted(self.primitive_types)
+        with open(self.observed_productions_path, "w") as f:
+            json.dump(
+                {
+                    "sum_type_constructors": self.sum_type_constructors,
+                    "field_presence_infos": self.field_presence_infos,
+                    "seq_lengths": self.seq_lengths,
+                    "primitive_types": self.primitive_types,
+                },
+                f,
+                indent=2,
+                sort_keys=True,
+            )
+
+        # grammar
+        self.all_rules, self.rules_mask = self._calculate_rules()
+        with open(self.grammar_rules_path, "w") as f:
+            json.dump(
+                {
+                    "all_rules": self.all_rules,
+                    "rules_mask": self.rules_mask,
+                },
+                f,
+                indent=2,
+                sort_keys=True,
+            )
+
+        shutil.copy2(self.base_path, self.grammar_file)
+
+    def load(self, grammar_class):
+        """load parsed items from disk"""
+        self._construct_cache_path(self.base_path)
+
+        self.grammar = grammar_class(self.grammar_file)
+        self.ast_wrapper = self.grammar.ast_wrapper
+        self.vocab = vocab.Vocab.load(self.vocab_path)
+
+        observed_productions = json.load(open(self.observed_productions_path))
+        self.sum_type_constructors = observed_productions["sum_type_constructors"]
+        self.field_presence_infos = observed_productions["field_presence_infos"]
+        self.seq_lengths = observed_productions["seq_lengths"]
+        self.primitive_types = observed_productions["primitive_types"]
+
+        grammar = json.load(open(self.grammar_rules_path))
+        self.all_rules = serialization.tuplify(grammar["all_rules"])
+        self.rules_mask = grammar["rules_mask"]
+
+    def _record_productions(self, tree):
+        """_record_productions"""
+        queue = [(tree, False)]
+        while queue:
+            node, is_seq_elem = queue.pop()
+            node_type = node["_type"]
+
+            # Rules of the form:
+            # expr -> Attribute | Await | BinOp | BoolOp | ...
+            # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ...
+            for type_name in [node_type] + node.get("_extra_types", []):
+                if type_name in self.ast_wrapper.constructors:
+                    sum_type_name = self.ast_wrapper.constructor_to_sum_type[type_name]
+                    if is_seq_elem and self.use_seq_elem_rules:
+                        self.sum_type_constructors[sum_type_name + "_seq_elem"].add(type_name)
+                    else:
+                        self.sum_type_constructors[sum_type_name].add(type_name)
+
+            # Rules of the form:
+            # FunctionDef
+            # -> identifier name, arguments args
+            # |  identifier name, arguments args, stmt* body
+            # |  identifier name, arguments args, expr* decorator_list
+            # |  identifier name, arguments args, expr? returns
+            # ...
+            # |  identifier name, arguments args, stmt* body, expr* decorator_list, expr returns
+            assert node_type in self.ast_wrapper.singular_types
+            field_presence_info = get_field_presence_info(
+                self.ast_wrapper, node, self.ast_wrapper.singular_types[node_type].fields
+            )
+            self.field_presence_infos[node_type].add(field_presence_info)
+
+            for field_info in self.ast_wrapper.singular_types[node_type].fields:
+                field_value = node.get(field_info.name, [] if field_info.seq else None)
+                to_enqueue = []
+                if field_info.seq:
+                    # Rules of the form:
+                    # stmt* -> stmt
+                    #        | stmt stmt
+                    #        | stmt stmt stmt
+                    self.seq_lengths[field_info.type + "*"].add(len(field_value))
+                    to_enqueue = field_value
+                else:
+                    to_enqueue = [field_value]
+                for child in to_enqueue:
+                    if isinstance(child, collections.abc.Mapping) and "_type" in child:
+                        queue.append((child, field_info.seq))
+                    else:
+                        self.primitive_types.add(type(child).__name__)
+
+    def _calculate_rules(self):
+        """_calculate_rules"""
+        offset = 0
+
+        all_rules = []
+        rules_mask = {}
+
+        # Rules of the form:
+        # expr -> Attribute | Await | BinOp | BoolOp | ...
+        # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ...
+        for parent, children in sorted(self.sum_type_constructors.items()):
+            assert not isinstance(children, set)
+            rules_mask[parent] = (offset, offset + len(children))
+            offset += len(children)
+            all_rules += [(parent, child) for child in children]
+
+        # Rules of the form:
+        # FunctionDef
+        # -> identifier name, arguments args
+        # |  identifier name, arguments args, stmt* body
+        # |  identifier name, arguments args, expr* decorator_list
+        # |  identifier name, arguments args, expr? returns
+        # ...
+        # |  identifier name, arguments args, stmt* body, expr* decorator_list, expr returns
+        for name, field_presence_infos in sorted(self.field_presence_infos.items()):
+            assert not isinstance(field_presence_infos, set)
+            rules_mask[name] = (offset, offset + len(field_presence_infos))
+            offset += len(field_presence_infos)
+            all_rules += [(name, presence) for presence in field_presence_infos]
+
+        # Rules of the form:
+        # stmt* -> stmt
+        #        | stmt stmt
+        #        | stmt stmt stmt
+        for seq_type_name, lengths in sorted(self.seq_lengths.items()):
+            assert not isinstance(lengths, set)
+            rules_mask[seq_type_name] = (offset, offset + len(lengths))
+            offset += len(lengths)
+            all_rules += [(seq_type_name, i) for i in lengths]
+
+        return tuple(all_rules), rules_mask
+
+    def _all_tokens(self, root):
+        """_all_tokens"""
+        queue = [root]
+        while queue:
+            node = queue.pop()
+            type_info = self.ast_wrapper.singular_types[node["_type"]]
+
+            for field_info in reversed(type_info.fields):
+                field_value = node.get(field_info.name)
+                if field_info.type in self.grammar.pointers:
+                    pass
+                elif field_info.type in self.ast_wrapper.primitive_types:
+                    for token in self.grammar.tokenize_field_value(field_value):
+                        yield token
+                elif isinstance(field_value, (list, tuple)):
+                    queue.extend(field_value)
+                elif field_value is not None:
+                    queue.append(field_value)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..037399586ecd991434f542ac93952109a1c09801
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/dataproc/vocab.py
@@ -0,0 +1,96 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""vocabulary utils from rat-sql: https://github.com/Microsoft/rat-sql"""
+
+import collections
+import collections.abc
+import json
+
+UNK = "<UNK>"
+BOS = "<BOS>"
+EOS = "<EOS>"
+
+
+class Vocab(collections.abc.Set):
+    def __init__(self, iterable, special_elems=(UNK, BOS, EOS)):
+        elements = list(special_elems)
+        elements.extend(iterable)
+        assert len(elements) == len(set(elements))
+
+        self.id_to_elem = {i: elem for i, elem in enumerate(elements)}
+        self.elem_to_id = {elem: i for i, elem in enumerate(elements)}
+
+    def __iter__(self):
+        for i in range(len(self)):
+            yield self.id_to_elem[i]
+
+    def __contains__(self, value):
+        return value in self.elem_to_id
+
+    def __len__(self):
+        return len(self.elem_to_id)
+
+    def __getitem__(self, key):
+        if isinstance(key, slice):
+            raise TypeError("Slices not supported.")
+        return self.id_to_elem[key]
+
+    def index(self, value):
+        try:
+            return self.elem_to_id[value]
+        except KeyError:
+            return self.elem_to_id[UNK]
+
+    def indices(self, values):
+        return [self.index(value) for value in values]
+
+    def __hash__(self):
+        return id(self)
+
+    @classmethod
+    def load(self, in_path):
+        return Vocab(json.load(open(in_path)), special_elems=())
+
+    def save(self, out_path):
+        with open(out_path, "w") as ofs:
+            json.dump([self.id_to_elem[i] for i in range(len(self.id_to_elem))], ofs)
+
+
+class VocabBuilder:
+    def __init__(self, min_freq=None, max_count=None):
+        self.word_freq = collections.Counter()
+        self.min_freq = min_freq
+        self.max_count = max_count
+
+    def add_word(self, word, count=1):
+        self.word_freq[word] += count
+
+    def finish(self, *args, **kwargs):
+        # Select the `max_count` most frequent words. If `max_count` is None, then choose all of the words.
+        eligible_words_and_freqs = self.word_freq.most_common(self.max_count)
+        if self.min_freq is not None:
+            for i, (word, freq) in enumerate(eligible_words_and_freqs):
+                if freq < self.min_freq:
+                    eligible_words_and_freqs = eligible_words_and_freqs[:i]
+                    break
+
+        return Vocab((word for word, freq in sorted(eligible_words_and_freqs)), *args, **kwargs)
+
+    def save(self, path):
+        with open(path, "w") as f:
+            json.dump(self.word_freq, f)
+
+    def load(self, path):
+        with open(path, "r") as f:
+            self.word_freq = collections.Counter(json.load(f))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/global_config.py b/examples/text_to_sql/RAT-SQL/text2sql/global_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..e31afaf4d90ef442b297313606027226a3b479a9
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/global_config.py
@@ -0,0 +1,149 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import logging
+import os
+import sys
+from types import SimpleNamespace
+
+import _jsonnet as jsonnet
+
+
+def define_args_parser():
+    """define command-line args parser"""
+
+    def _arg_bool(arg):
+        """trans arg to bool type"""
+        if arg is None:
+            return arg
+
+        if type(arg) is not str:
+            return bool(arg)
+
+        if arg.isdigit():
+            return bool(int(arg))
+
+        if arg.lower() == "true":
+            return True
+        else:
+            return False
+
+    parser = argparse.ArgumentParser(description="text2sql command-line interface")
+    parser.add_argument(
+        "-c", "--config", default="conf/text2sql.jsonnet", help="global config file path. it's priority is the lowest"
+    )
+
+    general_args = parser.add_argument_group(title="general")
+    general_args.add_argument(
+        "--mode",
+        type=str.lower,
+        default="debug",
+        required=False,
+        choices=["preproc", "train", "infer", "test", "debug"],
+    )
+    general_args.add_argument("--batch-size", type=int)
+    general_args.add_argument("--beam-size", default=1, type=int)
+    general_args.add_argument("--use-cuda", type=_arg_bool, default=True, help="is run in cuda mode")
+    general_args.add_argument("--is-eval-value", type=_arg_bool, default=True, help="is evaluating value")
+    general_args.add_argument("--is-cloud", type=_arg_bool, default=False, help="is run in paddle cloud")
+    general_args.add_argument("--is-debug", type=_arg_bool, default=False, help="is run in debug mode")
+
+    model_args = parser.add_argument_group(title="model")
+    model_args.add_argument("--pretrain-model", help="ernie model path for dygraph")
+    model_args.add_argument("--init-model-params", help="trained model params")
+    model_args.add_argument("--init-model-optim", help="dumped model optimizer")
+    model_args.add_argument("--model-name", choices=["seq2tree_v2"], help="ernie model path for dygraph")
+    model_args.add_argument("--grammar-type", choices=["dusql_v2", "nl2sql"], help="")
+
+    data_args = parser.add_argument_group(title="data")
+    data_args.add_argument("--data-root", help="root data path. low priority.")
+    data_args.add_argument("--db", help="a tuple of pathes (schema, content) or path to dumped file")
+    data_args.add_argument("--db-schema", help="temp argument")
+    data_args.add_argument("--grammar", help="path to grammar definition file, or cached label vocabs directory")
+    data_args.add_argument("--train-set", help="original dataset path or dumped file path")
+    data_args.add_argument("--dev-set", help="original dataset path or dumped file path")
+    data_args.add_argument("--test-set", help="original dataset path or dumped file path")
+    data_args.add_argument("--eval-file", help="file to be evaluated(inferenced result)")
+    data_args.add_argument("--output", help="")
+    data_args.add_argument("--is-cached", type=_arg_bool, help="is dataset in cached format")
+
+    train_args = parser.add_argument_group(title="train")
+    train_args.add_argument("--epochs", type=int)
+    train_args.add_argument("--learning-rate", type=float)
+    train_args.add_argument("--log-steps", type=int)
+    train_args.add_argument("--random-seed", type=int)
+    train_args.add_argument("--use-data-parallel", type=_arg_bool)
+
+    return parser
+
+
+def gen_config(arg_list=None):
+    """read configs from file, and updating it by command-line arguments
+
+    Args:
+        config_path (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    parser = define_args_parser()
+    cli_args = parser.parse_args(arg_list)
+    if cli_args.data_root is not None:
+        root_path = cli_args.data_root
+        if cli_args.is_cached or cli_args.is_cached is None:
+            if cli_args.db is None:
+                cli_args.db = os.path.join(root_path, "db.pkl")
+            if cli_args.grammar is None:
+                cli_args.grammar = os.path.join(root_path, "label_vocabs")
+            if cli_args.train_set is None:
+                cli_args.train_set = os.path.join(root_path, "train.pkl")
+            if cli_args.dev_set is None:
+                cli_args.dev_set = os.path.join(root_path, "dev.pkl")
+            if cli_args.test_set is None and not cli_args.mode.startswith("train"):
+                cli_args.test_set = os.path.join(root_path, "test.pkl")
+        else:
+            if cli_args.db is None:
+                cli_args.db = [os.path.join(root_path, "db_schema.json"), os.path.join(root_path, "db_content.json")]
+            if cli_args.train_set is None:
+                cli_args.train_set = os.path.join(root_path, "train.json")
+            if cli_args.dev_set is None:
+                cli_args.dev_set = os.path.join(root_path, "dev.json")
+            if cli_args.test_set is None and not cli_args.mode.startswith("train"):
+                cli_args.test_set = os.path.join(root_path, "test.json")
+
+    arg_groups = {}
+    for group in parser._action_groups:
+        group_dict = {arg.dest: getattr(cli_args, arg.dest, None) for arg in group._group_actions}
+        arg_groups[group.title] = {k: v for k, v in group_dict.items() if v is not None}
+
+    config_file = cli_args.config
+    config = json.loads(jsonnet.evaluate_file(config_file), object_hook=lambda o: SimpleNamespace(**o))
+
+    for group, args in arg_groups.items():
+        if not hasattr(config, group):
+            logging.debug(f"group {group} is not a module of config")
+            setattr(config, group, SimpleNamespace())
+        config_module = getattr(config, group)
+        for name, value in args.items():
+            setattr(config_module, name, value)
+
+    return config
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    print(gen_config(sys.argv[1:] + ["--mode", "train", "--db", "path/to/db"]))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f0ea85344b7e0c679730356928c8749cf71cd66
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/__init__.py
@@ -0,0 +1,13 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..466bc51471902242a6a73991f9a2e23fcf843301
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/cspider_v2.py
@@ -0,0 +1,701 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import itertools
+
+import asdl
+import attr
+import networkx as nx
+from text2sql.utils import ast_util
+
+
+def bimap(first, second):
+    return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)}
+
+
+def filter_nones(d):
+    return {k: v for k, v in d.items() if v is not None and v != []}
+
+
+def join(iterable, delimiter):
+    it = iter(iterable)
+    yield next(it)
+    for x in it:
+        yield delimiter
+        yield x
+
+
+def intersperse(delimiter, seq):
+    return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None)
+
+
+class CSpiderLanguageV2:
+    root_type = "sql"
+
+    def __init__(
+        self,
+        asdl_file,
+        output_from=True,
+        use_table_pointer=True,
+        include_literals=False,
+        include_columns=True,
+        end_with_from=True,
+        clause_order=None,
+        infer_from_conditions=True,
+        factorize_sketch=2,
+    ):
+
+        # collect pointers and checkers
+        custom_primitive_type_checkers = {}
+        self.include_columns = include_columns
+        self.pointers = set(["table", "column", "value"])
+        custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int)
+
+        # create ast wrapper
+        self.factorize_sketch = factorize_sketch
+        self.ast_wrapper = ast_util.ASTWrapper(
+            asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers
+        )
+        if not use_table_pointer:
+            self.ast_wrapper.singular_types["Table"].fields[0].type = "int"
+        if not include_columns:
+            col_unit_fields = self.ast_wrapper.singular_types["col_unit"].fields
+            assert col_unit_fields[1].name == "col_id"
+            del col_unit_fields[1]
+
+        # literals of limit field
+        self.include_literals = include_literals
+
+        # from field
+        self.output_from = output_from
+        self.end_with_from = end_with_from
+        self.clause_order = clause_order
+        self.infer_from_conditions = infer_from_conditions
+        if self.clause_order:
+            # clause order is prioritized over configurations like end_with_from
+            assert factorize_sketch == 2  # TODO support other grammars
+            sql_fields = self.ast_wrapper.product_types["sql"].fields
+            letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)}
+            new_sql_fields = [letter2field[k] for k in self.clause_order]
+            self.ast_wrapper.product_types["sql"].fields = new_sql_fields
+        else:
+            if not self.output_from:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                del sql_fields[1]
+            else:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                if self.end_with_from:
+                    sql_fields.append(sql_fields[1])
+                    del sql_fields[1]
+
+    def parse(self, code, section):
+        return self.parse_sql(code)
+
+    def unparse(self, tree, db, value_list):
+        unparser = CSpiderUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch)
+        return unparser.unparse_sql(tree)
+
+    @classmethod
+    def tokenize_field_value(cls, field_value):
+        if isinstance(field_value, bytes):
+            field_value_str = field_value.encode("latin1")
+        elif isinstance(field_value, str):
+            field_value_str = field_value
+        else:
+            field_value_str = str(field_value)
+            if field_value_str[0] == '"' and field_value_str[-1] == '"':
+                field_value_str = field_value_str[1:-1]
+        # TODO: Get rid of surrounding quotes
+        return [field_value_str]
+
+    def parse_val(self, val):
+        if isinstance(val, int):
+            return {
+                "_type": "Value",
+                "val_id": val,
+            }
+        elif isinstance(val, dict):
+            return {
+                "_type": "ValSql",
+                "s": self.parse_sql(val),
+            }
+        else:
+            raise ValueError(val)
+
+    def parse_col_unit(self, col_unit):
+        agg_id, col_id, is_distinct = col_unit
+        result = {
+            "_type": "col_unit",
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+            "is_distinct": is_distinct,
+        }
+        if self.include_columns:
+            result["col_id"] = col_id
+        return result
+
+    def parse_val_unit(self, val_unit):
+        unit_op, col_unit1, col_unit2 = val_unit
+        result = {
+            "_type": self.UNIT_TYPES_F[unit_op],
+            "col_unit1": self.parse_col_unit(col_unit1),
+        }
+        if unit_op != 0:
+            result["col_unit2"] = self.parse_col_unit(col_unit2)
+        return result
+
+    def parse_table_unit(self, table_unit):
+        table_type, value = table_unit
+        if table_type == "sql":
+            return {
+                "_type": "TableUnitSql",
+                "s": self.parse_sql(value),
+            }
+        elif table_type == "table_unit":
+            return {
+                "_type": "Table",
+                "table_id": value,
+            }
+        else:
+            raise ValueError(table_type)
+
+    def parse_cond(self, cond, optional=False):
+        if optional and not cond:
+            return None
+
+        if len(cond) > 1:
+            return {
+                "_type": self.LOGIC_OPERATORS_F[cond[1]],
+                "left": self.parse_cond(cond[:1]),
+                "right": self.parse_cond(cond[2:]),
+            }
+
+        ((not_op, op_id, val_unit, val1, val2),) = cond
+        result = {
+            "_type": self.COND_TYPES_F[op_id],
+            "val_unit": self.parse_val_unit(val_unit),
+            "val1": self.parse_val(val1),
+        }
+        if op_id == 1:  # between
+            result["val2"] = self.parse_val(val2)
+        if not_op:
+            result = {
+                "_type": "Not",
+                "c": result,
+            }
+        return result
+
+    def parse_sql(self, sql, optional=False):
+        if optional and sql is None:
+            return None
+        if self.factorize_sketch == 0:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    "where": self.parse_cond(sql["where"], optional=True),
+                    "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                    "order_by": self.parse_order_by(sql["orderBy"]),
+                    "having": self.parse_cond(sql["having"], optional=True),
+                    "limit": sql["limit"] if self.include_literals else (sql["limit"] is not None),
+                    "intersect": self.parse_sql(sql["intersect"], optional=True),
+                    "except": self.parse_sql(sql["except"], optional=True),
+                    "union": self.parse_sql(sql["union"], optional=True),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                }
+            )
+        elif self.factorize_sketch == 1:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                    "sql_where": filter_nones(
+                        {
+                            "_type": "sql_where",
+                            "where": self.parse_cond(sql["where"], optional=True),
+                            "sql_groupby": filter_nones(
+                                {
+                                    "_type": "sql_groupby",
+                                    "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                                    "having": filter_nones(
+                                        {
+                                            "_type": "having",
+                                            "having": self.parse_cond(sql["having"], optional=True),
+                                        }
+                                    ),
+                                    "sql_orderby": filter_nones(
+                                        {
+                                            "_type": "sql_orderby",
+                                            "order_by": self.parse_order_by(sql["orderBy"]),
+                                            "limit": filter_nones(
+                                                {
+                                                    "_type": "limit",
+                                                    "limit": sql["limit"]
+                                                    if self.include_literals
+                                                    else (sql["limit"] is not None),
+                                                }
+                                            ),
+                                            "sql_ieu": filter_nones(
+                                                {
+                                                    "_type": "sql_ieu",
+                                                    "intersect": self.parse_sql(sql["intersect"], optional=True),
+                                                    "except": self.parse_sql(sql["except"], optional=True),
+                                                    "union": self.parse_sql(sql["union"], optional=True),
+                                                }
+                                            ),
+                                        }
+                                    ),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            )
+        elif self.factorize_sketch == 2:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                    "sql_where": filter_nones(
+                        {
+                            "_type": "sql_where",
+                            "where": self.parse_cond(sql["where"], optional=True),
+                        }
+                    ),
+                    "sql_groupby": filter_nones(
+                        {
+                            "_type": "sql_groupby",
+                            "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                            "having": self.parse_cond(sql["having"], optional=True),
+                        }
+                    ),
+                    "sql_orderby": filter_nones(
+                        {
+                            "_type": "sql_orderby",
+                            "order_by": self.parse_order_by(sql["orderBy"]),
+                            "limit": sql["limit"] if sql["limit"] is not None else 0,
+                        }
+                    ),
+                    "sql_ieu": filter_nones(
+                        {
+                            "_type": "sql_ieu",
+                            "intersect": self.parse_sql(sql["intersect"], optional=True),
+                            "except": self.parse_sql(sql["except"], optional=True),
+                            "union": self.parse_sql(sql["union"], optional=True),
+                        }
+                    ),
+                }
+            )
+
+    def parse_select(self, select):
+        is_distinct, aggs = select
+        return {
+            "_type": "select",
+            "is_distinct": is_distinct,
+            "aggs": [self.parse_agg(agg) for agg in aggs],
+        }
+
+    def parse_agg(self, agg):
+        agg_id, val_unit = agg
+        return {
+            "_type": "agg",
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+            "val_unit": self.parse_val_unit(val_unit),
+        }
+
+    def parse_from(self, from_, infer_from_conditions=False):
+        return filter_nones(
+            {
+                "_type": "from",
+                "table_units": [self.parse_table_unit(u) for u in from_["table_units"]],
+                "conds": self.parse_cond(from_["conds"], optional=True) if not infer_from_conditions else None,
+            }
+        )
+
+    def parse_order_by(self, order_by):
+        if not order_by:
+            return None
+
+        order, val_units = order_by
+        return {
+            "_type": "order_by",
+            "order": {"_type": self.ORDERS_F[order]},
+            "val_units": [self.parse_val_unit(v) for v in val_units],
+        }
+
+    COND_TYPES_F, COND_TYPES_B = bimap(
+        # ('not', 'between', '=', '>', '<', '>=', '<=', '!=', 'in', 'like', 'is', 'exists'),
+        # (None, 'Between', 'Eq', 'Gt', 'Lt', 'Ge', 'Le', 'Ne', 'In', 'Like', 'Is', 'Exists'))
+        range(1, 10),
+        ("Between", "Eq", "Gt", "Lt", "Ge", "Le", "Ne", "In", "Like"),
+    )
+
+    UNIT_TYPES_F, UNIT_TYPES_B = bimap(
+        # ('none', '-', '+', '*', '/'),
+        range(5),
+        ("Column", "Minus", "Plus", "Times", "Divide"),
+    )
+
+    AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Max", "Min", "Count", "Sum", "Avg"))
+
+    ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc"))
+
+    LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or"))
+
+
+@attr.s
+class CSpiderUnparser:
+    ast_wrapper = attr.ib()
+    schema = attr.ib()
+    value_list = attr.ib()
+    factorize_sketch = attr.ib(default=0)
+
+    UNIT_TYPES_B = {
+        "Minus": "-",
+        "Plus": "+",
+        "Times": "*",
+        "Divide": "/",
+    }
+    COND_TYPES_B = {
+        "Between": "BETWEEN",
+        "Eq": "=",
+        "Gt": ">",
+        "Lt": "<",
+        "Ge": ">=",
+        "Le": "<=",
+        "Ne": "!=",
+        "In": "IN",
+        "Like": "LIKE",
+    }
+
+    @classmethod
+    def conjoin_conds(cls, conds):
+        if not conds:
+            return None
+        if len(conds) == 1:
+            return conds[0]
+        return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])}
+
+    @classmethod
+    def linearize_cond(cls, cond):
+        if cond["_type"] in ("And", "Or"):
+            conds, keywords = cls.linearize_cond(cond["right"])
+            return [cond["left"]] + conds, [cond["_type"]] + keywords
+        else:
+            return [cond], []
+
+    def unparse_val(self, val):
+        if val["_type"] == "Value":
+            value_index = int(val["val_id"])
+            if value_index >= len(self.value_list):
+                value_index = 0
+            return f'"{self.value_list[value_index]}"'
+        if val["_type"] == "ValSql":
+            return f'({self.unparse_sql(val["s"])})'
+        if val["_type"] == "ColUnit":
+            return self.unparse_col_unit(val["c"])
+
+    def unparse_col_unit(self, col_unit):
+        if "col_id" in col_unit:
+            column = self.schema.columns[col_unit["col_id"]]
+            if column.table is None:
+                column_name = column.orig_name
+            else:
+                column_name = f"{column.table.orig_name}.{column.orig_name}"
+        else:
+            column_name = "some_col"
+
+        if col_unit["is_distinct"]:
+            column_name = f"DISTINCT {column_name}"
+        agg_type = col_unit["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return column_name
+        else:
+            return f"{agg_type}({column_name})"
+
+    def unparse_val_unit(self, val_unit):
+        if val_unit["_type"] == "Column":
+            return self.unparse_col_unit(val_unit["col_unit1"])
+        col1 = self.unparse_col_unit(val_unit["col_unit1"])
+        col2 = self.unparse_col_unit(val_unit["col_unit2"])
+        return f'{col1} {self.UNIT_TYPES_B[val_unit["_type"]]} {col2}'
+
+    # def unparse_table_unit(self, table_unit):
+    #    raise NotImplementedError
+
+    def unparse_cond(self, cond, negated=False):
+        if cond["_type"] == "And":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Or":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Not":
+            return self.unparse_cond(cond["c"], negated=True)
+        elif cond["_type"] == "Between":
+            tokens = [self.unparse_val_unit(cond["val_unit"])]
+            if negated:
+                tokens.append("NOT")
+            tokens += [
+                "BETWEEN",
+                self.unparse_val(cond["val1"]),
+                "AND",
+                self.unparse_val(cond["val2"]),
+            ]
+            return " ".join(tokens)
+        tokens = [self.unparse_val_unit(cond["val_unit"])]
+        if negated:
+            tokens.append("NOT")
+        tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])]
+        return " ".join(tokens)
+
+    def refine_from(self, tree):
+        """
+        1) Inferring tables from columns predicted
+        2) Mix them with the predicted tables if any
+        3) Inferring conditions based on tables
+        """
+
+        # nested query in from clause, recursively use the refinement
+        if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql":
+            for table_unit in tree["from"]["table_units"]:
+                subquery_tree = table_unit["s"]
+                self.refine_from(subquery_tree)
+            return
+
+        # get predicted tables
+        predicted_from_table_ids = set()
+        if "from" in tree:
+            table_unit_set = []
+            for table_unit in tree["from"]["table_units"]:
+                if table_unit["table_id"] not in predicted_from_table_ids:
+                    predicted_from_table_ids.add(table_unit["table_id"])
+                    table_unit_set.append(table_unit)
+            tree["from"]["table_units"] = table_unit_set  # remove duplicate
+
+        # Get all candidate columns
+        candidate_column_ids = set(
+            self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql")
+        )
+        candidate_columns = [self.schema.columns[i] for i in candidate_column_ids]
+        must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None)
+
+        # Table the union of inferred and predicted tables
+        all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids)
+        if not all_from_table_ids:
+            # TODO: better heuristic e.g., tables that have exact match
+            all_from_table_ids = {0}
+
+        covered_tables = set()
+        candidate_table_ids = sorted(all_from_table_ids)
+        start_table_id = candidate_table_ids[0]
+        conds = []
+        for table_id in candidate_table_ids[1:]:
+            if table_id in covered_tables:
+                continue
+            try:
+                path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id)
+            except (nx.NetworkXNoPath, nx.NodeNotFound):
+                covered_tables.add(table_id)
+                continue
+
+            for source_table_id, target_table_id in zip(path, path[1:]):
+                if target_table_id in covered_tables:
+                    continue
+                all_from_table_ids.add(target_table_id)
+                col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"]
+                conds.append(
+                    {
+                        "_type": "Eq",
+                        "val_unit": {
+                            "_type": "Column",
+                            "col_unit1": {
+                                "_type": "col_unit",
+                                "agg_id": {"_type": "NoneAggOp"},
+                                "col_id": col1,
+                                "is_distinct": False,
+                            },
+                        },
+                        "val1": {
+                            "_type": "ColUnit",
+                            "c": {
+                                "_type": "col_unit",
+                                "agg_id": {"_type": "NoneAggOp"},
+                                "col_id": col2,
+                                "is_distinct": False,
+                            },
+                        },
+                    }
+                )
+        table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)]
+
+        tree["from"] = {
+            "_type": "from",
+            "table_units": table_units,
+        }
+        cond_node = self.conjoin_conds(conds)
+        if cond_node is not None:
+            tree["from"]["conds"] = cond_node
+
+    def unparse_sql(self, tree):
+        self.refine_from(tree)
+
+        result = [
+            # select select,
+            self.unparse_select(tree["select"]),
+            # from from,
+            self.unparse_from(tree["from"]),
+        ]
+
+        def find_subtree(_tree, name):
+            if self.factorize_sketch == 0:
+                return _tree, _tree
+            elif name in _tree:
+                if self.factorize_sketch == 1:
+                    return _tree[name], _tree[name]
+                elif self.factorize_sketch == 2:
+                    return _tree, _tree[name]
+                else:
+                    raise NotImplementedError
+
+        tree, target_tree = find_subtree(tree, "sql_where")
+        # cond? where,
+        if "where" in target_tree:
+            result += ["WHERE", self.unparse_cond(target_tree["where"])]
+
+        tree, target_tree = find_subtree(tree, "sql_groupby")
+        # col_unit* group_by,
+        if "group_by" in target_tree:
+            result += ["GROUP BY", ", ".join(self.unparse_col_unit(c) for c in target_tree["group_by"])]
+
+        tree, target_tree = find_subtree(tree, "sql_orderby")
+        # order_by? order_by,
+        if "order_by" in target_tree:
+            result.append(self.unparse_order_by(target_tree["order_by"]))
+
+        tree, target_tree = find_subtree(tree, "sql_groupby")
+        # cond? having,
+        if "having" in target_tree:
+            result += ["HAVING", self.unparse_cond(target_tree["having"])]
+
+        tree, target_tree = find_subtree(tree, "sql_orderby")
+        # int? limit,
+        if "limit" in target_tree:
+            limit_index = int(target_tree["limit"])
+            limit_value = "0"
+            if limit_index < len(self.value_list):
+                limit_value = self.value_list[limit_index]
+            if limit_value == "value":
+                limit_value = "1"
+            if limit_value.isdigit() and limit_value != "0":
+                result += ["LIMIT", str(limit_value)]
+
+        tree, target_tree = find_subtree(tree, "sql_ieu")
+        # sql? intersect,
+        if "intersect" in target_tree:
+            result += ["INTERSECT", self.unparse_sql(target_tree["intersect"])]
+        # sql? except,
+        if "except" in target_tree:
+            result += ["EXCEPT", self.unparse_sql(target_tree["except"])]
+        # sql? union
+        if "union" in target_tree:
+            result += ["UNION", self.unparse_sql(target_tree["union"])]
+
+        return " ".join(result)
+
+    def unparse_select(self, select):
+        tokens = ["SELECT"]
+        if select["is_distinct"]:
+            tokens.append("DISTINCT")
+        tokens.append(", ".join(self.unparse_agg(agg) for agg in select.get("aggs", [])))
+        return " ".join(tokens)
+
+    def unparse_agg(self, agg):
+        unparsed_val_unit = self.unparse_val_unit(agg["val_unit"])
+        agg_type = agg["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return unparsed_val_unit
+        else:
+            return f"{agg_type}({unparsed_val_unit})"
+
+    def unparse_from(self, from_):
+        if "conds" in from_:
+            all_conds, keywords = self.linearize_cond(from_["conds"])
+        else:
+            all_conds, keywords = [], []
+        assert all(keyword == "And" for keyword in keywords)
+
+        cond_indices_by_table = collections.defaultdict(set)
+        tables_involved_by_cond_idx = collections.defaultdict(set)
+        for i, cond in enumerate(all_conds):
+            for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"):
+                table = self.schema.columns[column].table
+                if table is None:
+                    continue
+                cond_indices_by_table[table.id].add(i)
+                tables_involved_by_cond_idx[i].add(table.id)
+
+        output_table_ids = set()
+        output_cond_indices = set()
+        tokens = ["FROM"]
+        for i, table_unit in enumerate(from_.get("table_units", [])):
+            if i > 0:
+                tokens += ["JOIN"]
+
+            if table_unit["_type"] == "TableUnitSql":
+                tokens.append(f'({self.unparse_sql(table_unit["s"])})')
+            elif table_unit["_type"] == "Table":
+                table_id = table_unit["table_id"]
+                tokens += [self.schema.tables[table_id].orig_name]
+                output_table_ids.add(table_id)
+
+                # Output "ON <cond>" if all tables involved in the condition have been output
+                conds_to_output = []
+                for cond_idx in sorted(cond_indices_by_table[table_id]):
+                    if cond_idx in output_cond_indices:
+                        continue
+                    if tables_involved_by_cond_idx[cond_idx] <= output_table_ids:
+                        conds_to_output.append(all_conds[cond_idx])
+                        output_cond_indices.add(cond_idx)
+                if conds_to_output:
+                    tokens += ["ON"]
+                    tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output)))
+        return " ".join(tokens)
+
+    def unparse_order_by(self, order_by):
+        return f'ORDER BY {", ".join(self.unparse_val_unit(v) for v in order_by["val_units"])} {order_by["order"]["_type"]}'
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..75989fd74b3e6d6f3d240d7715fa480c88f497bd
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/dusql_v2.py
@@ -0,0 +1,750 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import itertools
+import logging
+
+import asdl
+import attr
+import networkx as nx
+from text2sql.utils import ast_util
+
+
+def bimap(first, second):
+    return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)}
+
+
+def filter_nones(d):
+    return {k: v for k, v in d.items() if v is not None and v != []}
+
+
+def join(iterable, delimiter):
+    it = iter(iterable)
+    yield next(it)
+    for x in it:
+        yield delimiter
+        yield x
+
+
+def intersperse(delimiter, seq):
+    return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None)
+
+
+class DuSQLLanguageV2(object):
+    root_type = "sql"
+
+    def __init__(
+        self,
+        asdl_file,
+        output_from=True,  # < changed
+        use_table_pointer=True,  # < changed
+        include_literals=False,  # < changed
+        include_columns=True,
+        end_with_from=True,  # < changed
+        clause_order=None,
+        infer_from_conditions=True,  # < changed
+        factorize_sketch=2,
+    ):
+
+        # collect pointers and checkers
+        self.pointers = set(["table", "column", "value"])
+        custom_primitive_type_checkers = {}
+        custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int)
+
+        self.include_columns = include_columns
+        # create ast wrapper
+        self.factorize_sketch = factorize_sketch
+        self.ast_wrapper = ast_util.ASTWrapper(
+            asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers
+        )
+
+        # from field
+        self.output_from = output_from
+        self.end_with_from = end_with_from
+        self.clause_order = clause_order
+        self.infer_from_conditions = infer_from_conditions
+        if self.clause_order:
+            # clause order is prioritized over configurations like end_with_from
+            assert factorize_sketch == 2  # TODO support other grammars
+            sql_fields = self.ast_wrapper.product_types["sql"].fields
+            letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)}
+            new_sql_fields = [letter2field[k] for k in self.clause_order]
+            self.ast_wrapper.product_types["sql"].fields = new_sql_fields
+        else:
+            if not self.output_from:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                del sql_fields[1]
+            else:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                if self.end_with_from:
+                    sql_fields.append(sql_fields[1])
+                    del sql_fields[1]
+
+    def parse(self, code, section):
+        return self.parse_sql(code)
+
+    def unparse(self, tree, db, value_list):
+        unparser = DuSQLUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch)
+        return unparser.unparse_sql(tree)
+
+    @classmethod
+    def tokenize_field_value(cls, field_value):
+        if isinstance(field_value, bytes):
+            field_value_str = field_value.encode("latin1")
+        elif isinstance(field_value, str):
+            field_value_str = field_value
+        else:
+            field_value_str = str(field_value)
+            if field_value_str[0] == '"' and field_value_str[-1] == '"':
+                field_value_str = field_value_str[1:-1]
+        # TODO: Get rid of surrounding quotes
+        return [field_value_str]
+
+    def parse_val(self, val):
+        if isinstance(val, int):  # Modify: add int
+            return {
+                "_type": "Value",
+                "val_id": val,
+            }
+        elif isinstance(val, dict):
+            return {
+                "_type": "ValSql",
+                "s": self.parse_sql(val),
+            }
+        else:
+            raise ValueError(val)
+
+    def parse_col_unit(self, col_unit):
+        agg_id, col_id = col_unit[:2]
+        result = {
+            "_type": "col_unit",
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+        }
+        if self.include_columns:
+            result["col_id"] = col_id
+        return result
+
+    def parse_val_unit(self, val_unit):
+        unit_op, col_unit1, col_unit2 = val_unit
+        result = {
+            "_type": self.UNIT_TYPES_F[unit_op],
+            "col_unit1": self.parse_col_unit(col_unit1),
+        }
+        if unit_op != 0:
+            result["col_unit2"] = self.parse_col_unit(col_unit2)
+            if result["col_unit1"]["col_id"] == "TIME_NOW":
+                logging.debug("fix TIME_NOW grammar")
+                result["col_unit1"]["col_id"] = result["col_unit2"]["col_id"]
+        return result
+
+    def parse_table_unit(self, table_unit):
+        table_type, value = table_unit
+        if table_type == "sql":
+            return {
+                "_type": "TableUnitSql",
+                "s": self.parse_sql(value),
+            }
+        elif table_type == "table_unit":
+            return {
+                "_type": "Table",
+                "table_id": value,
+            }
+        else:
+            raise ValueError(table_type)
+
+    def parse_cond(self, cond, optional=False):
+        if optional and not cond:
+            return None
+
+        if len(cond) > 1:
+            return {
+                "_type": self.LOGIC_OPERATORS_F[cond[1]],
+                "left": self.parse_cond(cond[:1]),
+                "right": self.parse_cond(cond[2:]),
+            }
+
+        ((agg_id, op_id, val_unit, val1, val2),) = cond
+        result = {
+            "_type": self.COND_TYPES_F[op_id],
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+            "val_unit": self.parse_val_unit(val_unit),
+            "val1": self.parse_val(val1),
+        }
+        if op_id == 1:  # between
+            result["val2"] = self.parse_val(val2)
+        return result
+
+    def parse_sql(self, sql, optional=False):
+        if optional and sql is None:
+            return None
+        if self.factorize_sketch == 0:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    "where": self.parse_cond(sql["where"], optional=True),
+                    "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                    "order_by": self.parse_order_by(sql["orderBy"]),
+                    "having": self.parse_cond(sql["having"], optional=True),
+                    "limit": sql["limit"] if self.include_literals else (sql["limit"] is not None),
+                    "intersect": self.parse_sql(sql["intersect"], optional=True),
+                    "except": self.parse_sql(sql["except"], optional=True),
+                    "union": self.parse_sql(sql["union"], optional=True),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                }
+            )
+        elif self.factorize_sketch == 1:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                    "sql_where": filter_nones(
+                        {
+                            "_type": "sql_where",
+                            "where": self.parse_cond(sql["where"], optional=True),
+                            "sql_groupby": filter_nones(
+                                {
+                                    "_type": "sql_groupby",
+                                    "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                                    "having": filter_nones(
+                                        {
+                                            "_type": "having",
+                                            "having": self.parse_cond(sql["having"], optional=True),
+                                        }
+                                    ),
+                                    "sql_orderby": filter_nones(
+                                        {
+                                            "_type": "sql_orderby",
+                                            "order_by": self.parse_order_by(sql["orderBy"]),
+                                            "limit": filter_nones(
+                                                {
+                                                    "_type": "limit",
+                                                    "limit": sql["limit"]
+                                                    if self.include_literals
+                                                    else (sql["limit"] is not None),
+                                                }
+                                            ),
+                                            "sql_ieu": filter_nones(
+                                                {
+                                                    "_type": "sql_ieu",
+                                                    "intersect": self.parse_sql(sql["intersect"], optional=True),
+                                                    "except": self.parse_sql(sql["except"], optional=True),
+                                                    "union": self.parse_sql(sql["union"], optional=True),
+                                                }
+                                            ),
+                                        }
+                                    ),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            )
+        elif self.factorize_sketch == 2:
+            return filter_nones(
+                {
+                    "_type": "sql",
+                    "select": self.parse_select(sql["select"]),
+                    **(
+                        {
+                            "from": self.parse_from(sql["from"], self.infer_from_conditions),
+                        }
+                        if self.output_from
+                        else {}
+                    ),
+                    "sql_where": filter_nones(
+                        {
+                            "_type": "sql_where",
+                            "where": self.parse_cond(sql["where"], optional=True),
+                        }
+                    ),
+                    "sql_groupby": filter_nones(
+                        {
+                            "_type": "sql_groupby",
+                            "group_by": [self.parse_col_unit(u) for u in sql["groupBy"]],
+                            "having": self.parse_cond(sql["having"], optional=True),
+                        }
+                    ),
+                    "sql_orderby": filter_nones(
+                        {
+                            "_type": "sql_orderby",
+                            "order_by": self.parse_order_by(sql["orderBy"]),
+                            "limit": sql["limit"] if sql["limit"] is not None else 0,
+                        }
+                    ),
+                    "sql_ieu": filter_nones(
+                        {
+                            "_type": "sql_ieu",
+                            "intersect": self.parse_sql(sql["intersect"], optional=True),
+                            "except": self.parse_sql(sql["except"], optional=True),
+                            "union": self.parse_sql(sql["union"], optional=True),
+                        }
+                    ),
+                }
+            )
+
+    def parse_select(self, select):
+        if type(select[0]) is bool:
+            aggs = select[1]
+        else:
+            aggs = select
+        return {
+            "_type": "select",
+            "aggs": [self.parse_agg(agg) for agg in aggs],
+        }
+
+    def parse_agg(self, agg):
+        if len(agg) == 2:
+            agg_id, val_unit = agg
+        else:
+            agg_id, val_unit = 0, agg
+        return {
+            "_type": "agg",
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+            "val_unit": self.parse_val_unit(val_unit),
+        }
+
+    def parse_from(self, from_, infer_from_conditions=False):
+        return filter_nones(
+            {
+                "_type": "from",
+                "table_units": [self.parse_table_unit(u) for u in from_["table_units"]],
+                "conds": self.parse_cond(from_["conds"], optional=True) if not infer_from_conditions else None,
+            }
+        )
+
+    def parse_order_by(self, order_by):
+        if not order_by:
+            return None
+
+        # DIFF: Spider&DuSQL
+        order, order_units = order_by
+        return {
+            "_type": "order_by",
+            "order": {"_type": self.ORDERS_F[order]},
+            "aggs": [self.parse_agg(v) for v in order_units],
+        }
+
+    COND_TYPES_F, COND_TYPES_B = bimap(
+        # Spider: (not, between, =, >, <, >=, <=, !=, in, like, is, exists)
+        # RAT: (None, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like, Is, Exists)
+        # DuSQL: (NotIn, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like)
+        # DIFF: Spider&DuSQL
+        range(0, 10),
+        ("NotIn", "Between", "Eq", "Gt", "Lt", "Ge", "Le", "Ne", "In", "Like"),
+    )
+
+    UNIT_TYPES_F, UNIT_TYPES_B = bimap(
+        # ('none', '-', '+', '*', '/'),
+        range(5),
+        ("Column", "Minus", "Plus", "Times", "Divide"),
+    )
+
+    AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Max", "Min", "Count", "Sum", "Avg"))
+
+    ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc"))
+
+    LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or"))
+
+
+@attr.s
+class DuSQLUnparser:
+    ast_wrapper = attr.ib()
+    schema = attr.ib()
+    value_list = attr.ib()
+    factorize_sketch = attr.ib(default=0)
+
+    UNIT_TYPES_B = {
+        "Minus": "-",
+        "Plus": "+",
+        "Times": "*",
+        "Divide": "/",
+    }
+    COND_TYPES_B = {
+        "Between": "BETWEEN",
+        "Eq": "=",
+        "Gt": ">",
+        "Lt": "<",
+        "Ge": ">=",
+        "Le": "<=",
+        "Ne": "!=",
+        "In": "IN",
+        "NotIn": "NOT IN",
+        "Like": "LIKE",
+    }
+
+    @classmethod
+    def conjoin_conds(cls, conds):
+        if not conds:
+            return None
+        if len(conds) == 1:
+            return conds[0]
+        return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])}
+
+    @classmethod
+    def linearize_cond(cls, cond):
+        if cond["_type"] in ("And", "Or"):
+            conds, keywords = cls.linearize_cond(cond["right"])
+            return [cond["left"]] + conds, [cond["_type"]] + keywords
+        else:
+            return [cond], []
+
+    def unparse_val(self, val):
+        if val["_type"] == "Value":
+            value_index = int(val["val_id"])
+            if value_index >= len(self.value_list):
+                value_index = 0
+            return f'"{self.value_list[value_index]}"'
+        if val["_type"] == "ValSql":
+            return f'({self.unparse_sql(val["s"])})'
+        if val["_type"] == "ColUnit":
+            column = self.schema.columns[val["col_id"]]
+            return f"{column.table.orig_name}.{column.orig_name}"
+
+    def unparse_col_unit(self, col_unit, alias_table_name=None):
+        if "col_id" in col_unit:
+            column = self.schema.columns[col_unit["col_id"]]
+            if alias_table_name is not None:
+                column_name = f"{alias_table_name}.{column.orig_name}"
+            elif column.table is not None:
+                column_name = f"{column.table.orig_name}.{column.orig_name}"
+            else:
+                column_name = column.orig_name
+        else:
+            column_name = "some_col"
+
+        agg_type = col_unit["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return column_name
+        else:
+            return f"{agg_type}({column_name})"
+
+    def unparse_val_unit(self, val_unit, is_row_calc=False):
+        if val_unit["_type"] == "Column":
+            return self.unparse_col_unit(val_unit["col_unit1"])
+        col1 = self.unparse_col_unit(val_unit["col_unit1"], alias_table_name="a" if is_row_calc else None)
+        col2 = self.unparse_col_unit(val_unit["col_unit2"], alias_table_name="b" if is_row_calc else None)
+        calc_op = self.UNIT_TYPES_B[val_unit["_type"]]
+        # TODO: DuSQL const col
+        if col1 == col2 and calc_op == "-":
+            col1 = "TIME_NOW"
+        return f"{col1} {calc_op} {col2}"
+
+    # def unparse_table_unit(self, table_unit):
+    #    raise NotImplementedError
+
+    def unparse_cond(self, cond, negated=False):
+        """
+        Args:
+            cond:
+                {
+                    "_type": "Ne",
+                    "agg_id": {
+                        "_type": "NoneAggOp"
+                    },
+                    "val_unit": {
+                        "_type": "Column",
+                        "col_unit1": {
+                            "_type": "col_unit",
+                            "agg_id": {
+                                "_type": "NoneAggOp"
+                            },
+                            "col_id": 11,
+                        }
+                    },
+                    "val1": {
+                        "_type": "Value",
+                        "val_id": 0
+                    }
+                }
+        """
+        if cond["_type"] == "And":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Or":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Not":
+            return self.unparse_cond(cond["c"], negated=True)
+        elif cond["_type"] == "Between":
+            tokens = [self.unparse_val_unit(cond["val_unit"])]
+            if negated:
+                tokens.append("NOT")
+            tokens += [
+                "BETWEEN",
+                self.unparse_val(cond["val1"]),
+                "AND",
+                self.unparse_val(cond["val2"]),
+            ]
+            return " ".join(tokens)
+        tokens = [self.unparse_val_unit(cond["val_unit"])]
+        if cond["agg_id"]["_type"] != "NoneAggOp":
+            agg = cond["agg_id"]["_type"]
+            tokens[0] = f"{agg}({tokens[0]})"
+        if negated:
+            tokens.append("NOT")
+        tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])]
+        return " ".join(tokens)
+
+    def refine_from(self, tree):
+        """
+        1) Inferring tables from columns predicted
+        2) Mix them with the predicted tables if any
+        3) Inferring conditions based on tables
+
+        Returns: bool
+            True: 是行计算
+            False: 不是行计算
+        """
+
+        # nested query in from clause, recursively use the refinement
+        if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql":
+            for table_unit in tree["from"]["table_units"]:
+                # 正常解码的结果中，FROM 中要么全是 sub-sql，要么全是普通的 table_id
+                if "s" not in table_unit:
+                    logging.warning("error tree in FROM clause: %s", str(tree))
+                    continue
+                subquery_tree = table_unit["s"]
+                self.refine_from(subquery_tree)
+            return len(tree["from"]["table_units"]) == 2  # 行计算
+
+        # get predicted tables
+        predicted_from_table_ids = set()
+        if "from" in tree:
+            table_unit_set = []
+            for table_unit in tree["from"]["table_units"]:
+                if "table_id" in table_unit and table_unit["table_id"] not in predicted_from_table_ids:
+                    predicted_from_table_ids.add(table_unit["table_id"])
+                    table_unit_set.append(table_unit)
+            tree["from"]["table_units"] = table_unit_set  # remove duplicate
+
+        # Get all candidate columns
+        candidate_column_ids = set(
+            self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql")
+        )
+        candidate_columns = [self.schema.columns[i] for i in candidate_column_ids]
+        must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None)
+
+        # Table the union of inferred and predicted tables
+        all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids)
+        if not all_from_table_ids:
+            # TODO: better heuristic e.g., tables that have exact match
+            all_from_table_ids = {0}
+
+        covered_tables = set()
+        candidate_table_ids = sorted(all_from_table_ids)
+        start_table_id = candidate_table_ids[0]
+        conds = []
+        for table_id in candidate_table_ids[1:]:
+            if table_id in covered_tables:
+                continue
+            try:
+                path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id)
+            except (nx.NetworkXNoPath, nx.NodeNotFound):
+                covered_tables.add(table_id)
+                continue
+
+            for source_table_id, target_table_id in zip(path, path[1:]):
+                if target_table_id in covered_tables:
+                    continue
+                all_from_table_ids.add(target_table_id)
+                col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"]
+                conds.append(
+                    {
+                        "_type": "Eq",
+                        "agg_id": {"_type": DuSQLLanguageV2.AGG_TYPES_F[0]},
+                        "val_unit": {
+                            "_type": "Column",
+                            "col_unit1": {
+                                "_type": "col_unit",
+                                "agg_id": {"_type": "NoneAggOp"},
+                                "col_id": col1,
+                            },
+                        },
+                        "val1": {
+                            "_type": "ColUnit",
+                            "col_id": col2,
+                        },
+                    }
+                )
+        table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)]
+
+        tree["from"] = {
+            "_type": "from",
+            "table_units": table_units,
+        }
+        cond_node = self.conjoin_conds(conds)
+        if cond_node is not None:
+            tree["from"]["conds"] = cond_node
+        return False
+
+    def unparse_sql(self, tree):
+        is_row_calc = self.refine_from(tree)
+
+        result = [
+            # select select,
+            self.unparse_select(tree["select"], is_row_calc),
+            # from from,
+            self.unparse_from(tree["from"], is_row_calc),
+        ]
+
+        def find_subtree(_tree, name):
+            if self.factorize_sketch == 0:
+                return _tree, _tree
+            elif name in _tree:
+                if self.factorize_sketch == 1:
+                    return _tree[name], _tree[name]
+                elif self.factorize_sketch == 2:
+                    return _tree, _tree[name]
+                else:
+                    raise NotImplementedError
+
+        tree, target_tree = find_subtree(tree, "sql_where")
+        # cond? where,
+        if "where" in target_tree:
+            result += ["WHERE", self.unparse_cond(target_tree["where"])]
+
+        tree, target_tree = find_subtree(tree, "sql_groupby")
+        # col_unit* group_by,
+        if "group_by" in target_tree:
+            result += ["GROUP BY", ", ".join(self.unparse_col_unit(c) for c in target_tree["group_by"])]
+
+        tree, target_tree = find_subtree(tree, "sql_orderby")
+        # order_by? order_by,
+        if "order_by" in target_tree:
+            result.append(self.unparse_order_by(target_tree["order_by"]))
+
+        tree, target_tree = find_subtree(tree, "sql_groupby")
+        # cond? having,
+        if "having" in target_tree:
+            having_block = self.unparse_cond(target_tree["having"]).split(" ")
+            if having_block[0] == "*":  # 没有 agg
+                logging.info("post process: adding count() for having statement")
+                having_block[0] = "count(*)"
+            result += ["HAVING", " ".join(having_block)]
+
+        tree, target_tree = find_subtree(tree, "sql_orderby")
+        # int? limit,
+        if "limit" in target_tree:
+            limit_index = int(target_tree["limit"])
+            limit_value = "0"
+            if limit_index < len(self.value_list):
+                limit_value = self.value_list[limit_index]
+            if limit_value == "value":
+                limit_value = "1"
+            if limit_value.isdigit() and limit_value != "0":  # 0表示没有limit
+                result += ["LIMIT", str(limit_value)]
+
+        tree, target_tree = find_subtree(tree, "sql_ieu")
+        # sql? intersect,
+        if "intersect" in target_tree:
+            result += ["INTERSECT", self.unparse_sql(target_tree["intersect"])]
+        # sql? except,
+        if "except" in target_tree:
+            result += ["EXCEPT", self.unparse_sql(target_tree["except"])]
+        # sql? union
+        if "union" in target_tree:
+            result += ["UNION", self.unparse_sql(target_tree["union"])]
+
+        return " ".join(result)
+
+    def unparse_select(self, select, is_row_calc=False):
+        tokens = ["SELECT"]
+        tokens.append(", ".join(self.unparse_agg(agg, is_row_calc) for agg in select.get("aggs", [])))
+        return " ".join(tokens)
+
+    def unparse_agg(self, agg, is_row_calc=False):
+        unparsed_val_unit = self.unparse_val_unit(agg["val_unit"], is_row_calc)
+        agg_type = agg["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return unparsed_val_unit
+        else:
+            return f"{agg_type}({unparsed_val_unit})"
+
+    def unparse_from(self, from_, is_row_calc=False):
+        if "conds" in from_:
+            all_conds, keywords = self.linearize_cond(from_["conds"])
+        else:
+            all_conds, keywords = [], []
+        assert all(keyword == "And" for keyword in keywords)
+
+        cond_indices_by_table = collections.defaultdict(set)
+        tables_involved_by_cond_idx = collections.defaultdict(set)
+        for i, cond in enumerate(all_conds):
+            for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"):
+                if type(column) is dict:
+                    column = column["col_id"]
+                table = self.schema.columns[column].table
+                if table is None:
+                    continue
+                cond_indices_by_table[table.id].add(i)
+                tables_involved_by_cond_idx[i].add(table.id)
+
+        output_table_ids = set()
+        output_cond_indices = set()
+        tokens = ["FROM"]
+        for i, table_unit in enumerate(from_.get("table_units", [])):
+            if i > 0:
+                if not is_row_calc:
+                    tokens += ["JOIN"]
+                else:
+                    tokens += [","]
+
+            if table_unit["_type"] == "TableUnitSql":
+                tokens.append(f'({self.unparse_sql(table_unit["s"])})')
+                if is_row_calc:  # 行计算SQL的别名
+                    tokens.append(["a", "b", "c"][i])
+            elif table_unit["_type"] == "Table":
+                table_id = table_unit["table_id"]
+                tokens += [self.schema.tables[table_id].orig_name]
+                output_table_ids.add(table_id)
+
+                # Output "ON <cond>" if all tables involved in the condition have been output
+                conds_to_output = []
+                for cond_idx in sorted(cond_indices_by_table[table_id]):
+                    if cond_idx in output_cond_indices:
+                        continue
+                    if tables_involved_by_cond_idx[cond_idx] <= output_table_ids:
+                        conds_to_output.append(all_conds[cond_idx])
+                        output_cond_indices.add(cond_idx)
+                if conds_to_output:
+                    tokens += ["ON"]
+                    tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output)))
+        return " ".join(tokens)
+
+    def unparse_order_by(self, order_by):
+        return f'ORDER BY {", ".join(self.unparse_agg(v) for v in order_by["aggs"])} {order_by["order"]["_type"]}'
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    dusql_lang = DuSQLLanguageV2("conf/DuSQL.asdl")
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py b/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py
new file mode 100644
index 0000000000000000000000000000000000000000..b12592472721ab1069b58d1448a77e515f132ce9
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/grammars/nl2sql.py
@@ -0,0 +1,552 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import itertools
+import logging
+
+import asdl
+import attr
+import networkx as nx
+from text2sql.utils import ast_util
+
+
+def bimap(first, second):
+    return {f: s for f, s in zip(first, second)}, {s: f for f, s in zip(first, second)}
+
+
+def filter_nones(d):
+    return {k: v for k, v in d.items() if v is not None and v != []}
+
+
+def join(iterable, delimiter):
+    it = iter(iterable)
+    yield next(it)
+    for x in it:
+        yield delimiter
+        yield x
+
+
+def intersperse(delimiter, seq):
+    return itertools.islice(itertools.chain.from_iterable(zip(itertools.repeat(delimiter), seq)), 1, None)
+
+
+class NL2SQLLanguage(object):
+    root_type = "sql"
+
+    def __init__(
+        self,
+        asdl_file,
+        output_from=True,  # < changed
+        use_table_pointer=True,  # < changed
+        include_literals=False,  # < changed
+        include_columns=True,
+        end_with_from=True,  # < changed
+        clause_order=None,
+        infer_from_conditions=True,  # < changed
+        factorize_sketch=2,
+    ):
+
+        # collect pointers and checkers
+        self.pointers = set(["table", "column", "value"])
+        custom_primitive_type_checkers = {}
+        custom_primitive_type_checkers["table"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["column"] = lambda x: isinstance(x, int)
+        custom_primitive_type_checkers["value"] = lambda x: isinstance(x, int)
+
+        self.include_columns = include_columns
+        # create ast wrapper
+        self.factorize_sketch = factorize_sketch
+        self.ast_wrapper = ast_util.ASTWrapper(
+            asdl.parse(asdl_file), custom_primitive_type_checkers=custom_primitive_type_checkers
+        )
+
+        # from field
+        self.output_from = output_from
+        self.end_with_from = end_with_from
+        self.clause_order = clause_order
+        self.infer_from_conditions = infer_from_conditions
+        if self.clause_order:
+            # clause order is prioritized over configurations like end_with_from
+            assert factorize_sketch == 2  # TODO support other grammars
+            sql_fields = self.ast_wrapper.product_types["sql"].fields
+            letter2field = {k: v for k, v in zip("SFWGOI", sql_fields)}
+            new_sql_fields = [letter2field[k] for k in self.clause_order]
+            self.ast_wrapper.product_types["sql"].fields = new_sql_fields
+        else:
+            if not self.output_from:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                del sql_fields[1]
+            else:
+                sql_fields = self.ast_wrapper.product_types["sql"].fields
+                assert sql_fields[1].name == "from"
+                if self.end_with_from:
+                    sql_fields.append(sql_fields[1])
+                    del sql_fields[1]
+
+    def parse(self, code, section):
+        return self.parse_sql(code)
+
+    def unparse(self, tree, db, value_list):
+        unparser = NL2SQLUnparser(self.ast_wrapper, db, value_list, self.factorize_sketch)
+        return unparser.unparse_sql(tree)
+
+    @classmethod
+    def tokenize_field_value(cls, field_value):
+        if isinstance(field_value, bytes):
+            field_value_str = field_value.encode("latin1")
+        elif isinstance(field_value, str):
+            field_value_str = field_value
+        else:
+            field_value_str = str(field_value)
+            if field_value_str[0] == '"' and field_value_str[-1] == '"':
+                field_value_str = field_value_str[1:-1]
+        # TODO: Get rid of surrounding quotes
+        return [field_value_str]
+
+    def build_val_unit(self, col_id):
+        result = {
+            "_type": self.UNIT_TYPES_F[0],
+            "col_unit1": {"_type": "col_unit", "agg_id": {"_type": self.AGG_TYPES_F[0]}, "col_id": col_id},
+        }
+        return result
+
+    def parse_sql(self, sql, optional=False):
+        return filter_nones(
+            {
+                "_type": "sql",
+                "select": self.parse_select(sql["sel"], sql["agg"]),
+                **(
+                    {"from": {"_type": "from", "table_units": [{"_type": "Table", "table_id": 0}]}}
+                    if self.output_from
+                    else {}
+                ),
+                "sql_where": filter_nones(
+                    {
+                        "_type": "sql_where",
+                        "where": self.parse_cond(sql["conds"], ["", "and", "or"][sql["cond_conn_op"]]),
+                    }
+                ),
+                "sql_groupby": filter_nones(
+                    {
+                        "_type": "sql_groupby",
+                        "group_by": [],
+                        "having": None,
+                    }
+                ),
+                "sql_orderby": filter_nones(
+                    {
+                        "_type": "sql_orderby",
+                        "order_by": None,
+                        "limit": None,
+                    }
+                ),
+                "sql_ieu": filter_nones(
+                    {
+                        "_type": "sql_ieu",
+                        "intersect": None,
+                        "except": None,
+                        "union": None,
+                    }
+                ),
+            }
+        )
+
+    def parse_select(self, select, aggs):
+        return {
+            "_type": "select",
+            "aggs": [self.parse_agg(col, agg) for col, agg in zip(select, aggs)],
+        }
+
+    def parse_agg(self, col, agg):
+        return {
+            "_type": "agg",
+            "agg_id": {"_type": self.AGG_TYPES_F[agg]},
+            "val_unit": self.build_val_unit(col),
+        }
+
+    def parse_cond(self, conds, cond_conn_op):
+        if len(conds) > 1:
+            return {
+                "_type": self.LOGIC_OPERATORS_F[cond_conn_op],
+                "left": self.parse_cond(conds[:1], cond_conn_op),
+                "right": self.parse_cond(conds[1:], cond_conn_op),
+            }
+
+        agg_id = 0
+        col_id, op_id, val1 = conds[0]
+        result = {
+            "_type": self.COND_TYPES_F[op_id],
+            "agg_id": {"_type": self.AGG_TYPES_F[agg_id]},
+            "val_unit": self.build_val_unit(col_id),
+            "val1": self.parse_val(val1),
+        }
+        return result
+
+    def parse_val(self, val):
+        return {
+            "_type": "Value",
+            "val_id": val,
+        }
+
+    COND_TYPES_F, COND_TYPES_B = bimap(
+        # Spider: (not, between, =, >, <, >=, <=, !=, in, like, is, exists)
+        # RAT: (None, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like, Is, Exists)
+        # DuSQL: (NotIn, Between, Eq, Gt, Lt, Ge, Le, Ne, In, Like)
+        # DIFF: Spider&DuSQL
+        range(0, 6),
+        ("Gt", "Lt", "Eq", "Ne", "Ge", "Le"),
+    )
+
+    UNIT_TYPES_F, UNIT_TYPES_B = bimap(
+        # ('none', '-', '+', '*', '/'),
+        range(5),
+        ("Column", "Minus", "Plus", "Times", "Divide"),
+    )
+
+    AGG_TYPES_F, AGG_TYPES_B = bimap(range(6), ("NoneAggOp", "Avg", "Max", "Min", "Count", "Sum"))
+
+    ORDERS_F, ORDERS_B = bimap(("asc", "desc"), ("Asc", "Desc"))
+
+    LOGIC_OPERATORS_F, LOGIC_OPERATORS_B = bimap(("and", "or"), ("And", "Or"))
+
+
+@attr.s
+class NL2SQLUnparser:
+    ast_wrapper = attr.ib()
+    schema = attr.ib()
+    value_list = attr.ib()
+    factorize_sketch = attr.ib(default=0)
+
+    UNIT_TYPES_B = {
+        "Minus": "-",
+        "Plus": "+",
+        "Times": "*",
+        "Divide": "/",
+    }
+    COND_TYPES_B = {
+        "Between": "BETWEEN",
+        "Eq": "=",
+        "Gt": ">",
+        "Lt": "<",
+        "Ge": ">=",
+        "Le": "<=",
+        "Ne": "!=",
+        "In": "IN",
+        "NotIn": "NOT IN",
+        "Like": "LIKE",
+    }
+
+    @classmethod
+    def conjoin_conds(cls, conds):
+        if not conds:
+            return None
+        if len(conds) == 1:
+            return conds[0]
+        return {"_type": "And", "left": conds[0], "right": cls.conjoin_conds(conds[1:])}
+
+    @classmethod
+    def linearize_cond(cls, cond):
+        if cond["_type"] in ("And", "Or"):
+            conds, keywords = cls.linearize_cond(cond["right"])
+            return [cond["left"]] + conds, [cond["_type"]] + keywords
+        else:
+            return [cond], []
+
+    def unparse_val(self, val):
+        if val["_type"] == "Value":
+            value_index = int(val["val_id"])
+            if value_index >= len(self.value_list):
+                value_index = 0
+            return f'"{self.value_list[value_index]}"'
+        if val["_type"] == "ValSql":
+            return f'({self.unparse_sql(val["s"])})'
+        if val["_type"] == "ColUnit":
+            column = self.schema.columns[val["col_id"]]
+            return f"{column.table.orig_name}.{column.orig_name}"
+
+    def unparse_col_unit(self, col_unit, alias_table_name=None):
+        if "col_id" in col_unit:
+            column = self.schema.columns[col_unit["col_id"]]
+            if alias_table_name is not None:
+                column_name = f"{alias_table_name}.{column.orig_name}"
+            elif column.table is not None:
+                column_name = f"{column.table.orig_name}.{column.orig_name}"
+            else:
+                column_name = column.orig_name
+        else:
+            column_name = "some_col"
+
+        agg_type = col_unit["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return column_name
+        else:
+            return f"{agg_type}({column_name})"
+
+    def unparse_val_unit(self, val_unit, is_row_calc=False):
+        if val_unit["_type"] == "Column":
+            return self.unparse_col_unit(val_unit["col_unit1"])
+        col1 = self.unparse_col_unit(val_unit["col_unit1"], alias_table_name="a" if is_row_calc else None)
+        col2 = self.unparse_col_unit(val_unit["col_unit2"], alias_table_name="b" if is_row_calc else None)
+        calc_op = self.UNIT_TYPES_B[val_unit["_type"]]
+        # TODO: DuSQL const col
+        if col1 == col2 and calc_op == "-":
+            col1 = "TIME_NOW"
+        return f"{col1} {calc_op} {col2}"
+
+    # def unparse_table_unit(self, table_unit):
+    #    raise NotImplementedError
+
+    def unparse_cond(self, cond, negated=False):
+        """
+        Args:
+            cond:
+                {
+                    "_type": "Ne",
+                    "agg_id": {
+                        "_type": "NoneAggOp"
+                    },
+                    "val_unit": {
+                        "_type": "Column",
+                        "col_unit1": {
+                            "_type": "col_unit",
+                            "agg_id": {
+                                "_type": "NoneAggOp"
+                            },
+                            "col_id": 11,
+                        }
+                    },
+                    "val1": {
+                        "_type": "Value",
+                        "val_id": 0
+                    }
+                }
+        """
+        if cond["_type"] == "And":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} AND {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Or":
+            assert not negated
+            return f'{self.unparse_cond(cond["left"])} OR {self.unparse_cond(cond["right"])}'
+        elif cond["_type"] == "Not":
+            return self.unparse_cond(cond["c"], negated=True)
+        elif cond["_type"] == "Between":
+            tokens = [self.unparse_val_unit(cond["val_unit"])]
+            if negated:
+                tokens.append("NOT")
+            tokens += [
+                "BETWEEN",
+                self.unparse_val(cond["val1"]),
+                "AND",
+                self.unparse_val(cond["val2"]),
+            ]
+            return " ".join(tokens)
+        tokens = [self.unparse_val_unit(cond["val_unit"])]
+        if cond["agg_id"]["_type"] != "NoneAggOp":
+            agg = cond["agg_id"]["_type"]
+            tokens[0] = f"{agg}({tokens[0]})"
+        if negated:
+            tokens.append("NOT")
+        tokens += [self.COND_TYPES_B[cond["_type"]], self.unparse_val(cond["val1"])]
+        return " ".join(tokens)
+
+    def refine_from(self, tree):
+        """
+        1) Inferring tables from columns predicted
+        2) Mix them with the predicted tables if any
+        3) Inferring conditions based on tables
+
+        Returns: bool
+            True: row calculation
+            False: not row calculation
+        """
+
+        # nested query in from clause, recursively use the refinement
+        if "from" in tree and tree["from"]["table_units"][0]["_type"] == "TableUnitSql":
+            for table_unit in tree["from"]["table_units"]:
+                # during natural decoding, all of FROM is sub-sql or ordinary table_id
+                if "s" not in table_unit:
+                    logging.warning("error tree in FROM clause: %s", str(tree))
+                    continue
+                subquery_tree = table_unit["s"]
+                self.refine_from(subquery_tree)
+            return len(tree["from"]["table_units"]) == 2  # row calculation
+
+        # get predicted tables
+        predicted_from_table_ids = set()
+        if "from" in tree:
+            table_unit_set = []
+            for table_unit in tree["from"]["table_units"]:
+                if "table_id" in table_unit and table_unit["table_id"] not in predicted_from_table_ids:
+                    predicted_from_table_ids.add(table_unit["table_id"])
+                    table_unit_set.append(table_unit)
+            tree["from"]["table_units"] = table_unit_set  # remove duplicate
+
+        # Get all candidate columns
+        candidate_column_ids = set(
+            self.ast_wrapper.find_all_descendants_of_type(tree, "column", lambda field: field.type != "sql")
+        )
+        candidate_columns = [self.schema.columns[i] for i in candidate_column_ids]
+        must_in_from_table_ids = set(column.table.id for column in candidate_columns if column.table is not None)
+
+        # Table the union of inferred and predicted tables
+        all_from_table_ids = must_in_from_table_ids.union(predicted_from_table_ids)
+        if not all_from_table_ids:
+            # TODO: better heuristic e.g., tables that have exact match
+            all_from_table_ids = {0}
+
+        covered_tables = set()
+        candidate_table_ids = sorted(all_from_table_ids)
+        start_table_id = candidate_table_ids[0]
+        conds = []
+        for table_id in candidate_table_ids[1:]:
+            if table_id in covered_tables:
+                continue
+            try:
+                path = nx.shortest_path(self.schema.foreign_key_graph, source=start_table_id, target=table_id)
+            except (nx.NetworkXNoPath, nx.NodeNotFound):
+                covered_tables.add(table_id)
+                continue
+
+            for source_table_id, target_table_id in zip(path, path[1:]):
+                if target_table_id in covered_tables:
+                    continue
+                all_from_table_ids.add(target_table_id)
+                col1, col2 = self.schema.foreign_key_graph[source_table_id][target_table_id]["columns"]
+                conds.append(
+                    {
+                        "_type": "Eq",
+                        "agg_id": {"_type": NL2SQLLanguage.AGG_TYPES_F[0]},
+                        "val_unit": {
+                            "_type": "Column",
+                            "col_unit1": {
+                                "_type": "col_unit",
+                                "agg_id": {"_type": "NoneAggOp"},
+                                "col_id": col1,
+                            },
+                        },
+                        "val1": {
+                            "_type": "ColUnit",
+                            "col_id": col2,
+                        },
+                    }
+                )
+        table_units = [{"_type": "Table", "table_id": i} for i in sorted(all_from_table_ids)]
+
+        tree["from"] = {
+            "_type": "from",
+            "table_units": table_units,
+        }
+        cond_node = self.conjoin_conds(conds)
+        if cond_node is not None:
+            tree["from"]["conds"] = cond_node
+        return False
+
+    def unparse_sql(self, tree):
+        is_row_calc = self.refine_from(tree)
+
+        result = [
+            # select select,
+            self.unparse_select(tree["select"], is_row_calc),
+            # from from,
+            self.unparse_from(tree["from"], is_row_calc),
+        ]
+
+        def find_subtree(_tree, name):
+            if self.factorize_sketch == 0:
+                return _tree, _tree
+            elif name in _tree:
+                if self.factorize_sketch == 1:
+                    return _tree[name], _tree[name]
+                elif self.factorize_sketch == 2:
+                    return _tree, _tree[name]
+                else:
+                    raise NotImplementedError
+
+        tree, target_tree = find_subtree(tree, "sql_where")
+        # cond? where,
+        if "where" in target_tree:
+            result += ["WHERE", self.unparse_cond(target_tree["where"])]
+
+        return " ".join(result)
+
+    def unparse_select(self, select, is_row_calc=False):
+        tokens = ["SELECT"]
+        tokens.append(", ".join(self.unparse_agg(agg, is_row_calc) for agg in select.get("aggs", [])))
+        return " ".join(tokens)
+
+    def unparse_agg(self, agg, is_row_calc=False):
+        unparsed_val_unit = self.unparse_val_unit(agg["val_unit"], is_row_calc)
+        agg_type = agg["agg_id"]["_type"]
+        if agg_type == "NoneAggOp":
+            return unparsed_val_unit
+        else:
+            return f"{agg_type}({unparsed_val_unit})"
+
+    def unparse_from(self, from_, is_row_calc=False):
+        if "conds" in from_:
+            all_conds, keywords = self.linearize_cond(from_["conds"])
+        else:
+            all_conds, keywords = [], []
+        assert all(keyword == "And" for keyword in keywords)
+
+        cond_indices_by_table = collections.defaultdict(set)
+        tables_involved_by_cond_idx = collections.defaultdict(set)
+        for i, cond in enumerate(all_conds):
+            for column in self.ast_wrapper.find_all_descendants_of_type(cond, "column"):
+                if type(column) is dict:
+                    column = column["col_id"]
+                table = self.schema.columns[column].table
+                if table is None:
+                    continue
+                cond_indices_by_table[table.id].add(i)
+                tables_involved_by_cond_idx[i].add(table.id)
+
+        output_table_ids = set()
+        output_cond_indices = set()
+        tokens = ["FROM"]
+        for i, table_unit in enumerate(from_.get("table_units", [])):
+            if i > 0:
+                if not is_row_calc:
+                    tokens += ["JOIN"]
+                else:
+                    tokens += [","]
+
+            if table_unit["_type"] == "TableUnitSql":
+                tokens.append(f'({self.unparse_sql(table_unit["s"])})')
+                if is_row_calc:  # 行计算SQL的别名
+                    tokens.append(["a", "b", "c"][i])
+            elif table_unit["_type"] == "Table":
+                table_id = table_unit["table_id"]
+                tokens += [self.schema.tables[table_id].orig_name]
+                output_table_ids.add(table_id)
+
+                # Output "ON <cond>" if all tables involved in the condition have been output
+                conds_to_output = []
+                for cond_idx in sorted(cond_indices_by_table[table_id]):
+                    if cond_idx in output_cond_indices:
+                        continue
+                    if tables_involved_by_cond_idx[cond_idx] <= output_table_ids:
+                        conds_to_output.append(all_conds[cond_idx])
+                        output_cond_indices.add(cond_idx)
+                if conds_to_output:
+                    tokens += ["ON"]
+                    tokens += list(intersperse("AND", (self.unparse_cond(cond) for cond in conds_to_output)))
+        return " ".join(tokens)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    dusql_lang = NL2SQLLanguage("conf/DuSQL.asdl")
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/io.py b/examples/text_to_sql/RAT-SQL/text2sql/io.py
new file mode 100644
index 0000000000000000000000000000000000000000..65dd2889c51579f5e8d09fd72cd7c3b19e6d79e6
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/io.py
@@ -0,0 +1,45 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import traceback
+
+import paddle
+
+
+def init_ernie_model(model_class, model_dir):
+    """init ernie model from static graph checkpoint"""
+    with open(os.path.join(model_dir, "ernie_config.json")) as ifs:
+        config = json.load(ifs)
+
+    state = paddle.static.load_program_state(os.path.join(model_dir, "params"))
+    ernie = model_class(config, name="")
+    ernie.set_dict(state, use_structured_name=False)
+    return ernie, config["hidden_size"]
+
+
+def save(model, optimizer, save_path):
+    try:
+        paddle.save(model.state_dict(), save_path + ".pdparams")
+        paddle.save(optimizer.state_dict(), save_path + ".pdopt")
+    except Exception:
+        logging.error("save model and optimizer failed. save path: %s", save_path)
+        logging.error(traceback.format_exc())
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..136a074e0fbc89edcfe6c812e1383bfc80cf6dbb
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/__init__.py
@@ -0,0 +1,17 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import trainer
+from . import infer
+from . import eval
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd6a9806ccc437e1d1df920b2b60adfdb56c98f7
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/eval.py
@@ -0,0 +1,44 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+from text2sql.utils import metrics
+
+
+def evaluate(model, dataset, infer_results, name="DuSQL", eval_value=True):
+    if name.lower() == "dusql":
+        metric = metrics.MetricDuSQLAcc(dataset, eval_value=eval_value)
+    else:
+        raise RuntimeError(f"only supports name DuSQL. but got {name}")
+
+    for idx, line in enumerate(infer_results):
+        qid, pred_query, db_id, detail_result = line.strip().split("\t")
+        dct_result = json.loads(detail_result)
+        qid = dct_result["question_id"]
+        metric.update(dataset.get_by_qid(qid)[0], pred_query)
+
+    eval_result = metric.finalize()
+    print("evaluating result:", json.dumps(eval_result["total_scores"], indent=4))
+    with open("output/debug.json", "w") as ofs:
+        import random
+
+        random.shuffle(eval_result["per_item"])
+        json.dump(eval_result["per_item"], ofs, indent=4, ensure_ascii=False)
+    return eval_result
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4304c2364febf1dd2244dd893314cc997ce0dc66
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/infer.py
@@ -0,0 +1,135 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+import numpy as np
+import paddle
+import tqdm
+from text2sql.models import beam_search, sql_beam_search
+
+
+def inference(
+    model, data, output_path, beam_size=1, mode="infer", output_history=True, use_heuristic=True, model_name="seq2tree"
+):
+    model.eval()
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with paddle.no_grad(), open(output_path, "w") as ofs:
+        if mode == "infer":
+            _do_infer(model, data, beam_size, output_history, ofs, use_heuristic, model_name)
+        elif mode == "debug":
+            _debug(model, data, ofs)
+
+
+def _do_infer(model, data, beam_size, output_history, ofs, use_heuristic=True, model_name="seq2tree"):
+    for i, (inputs, labels) in enumerate(tqdm.tqdm(data())):
+        if model_name.startswith("seq2tree"):
+            decoded = _infer_one(model, inputs, beam_size, output_history, use_heuristic, labels)
+        else:
+            decoded = _infer_general(model, inputs, labels)
+        db_id = inputs["orig_inputs"][0].db.db_id
+        question_id = inputs["orig_inputs"][0].question_id
+        question = inputs["orig_inputs"][0].question
+        gold_query = labels[0].orig_code if labels is not None and labels[0] is not None else ""
+        values = inputs["orig_inputs"][0].values
+        if len(decoded) == 0:
+            pred_query = "select *"
+        else:
+            pred_query = decoded[0]["pred_query"]
+        lst_output = [
+            question_id,
+            pred_query,
+            db_id,
+            json.dumps(
+                {
+                    "db_id": db_id,
+                    "question_id": question_id,
+                    "question": question,
+                    "gold_query": gold_query,
+                    "values": values,
+                    "beams": decoded,
+                },
+                ensure_ascii=False,
+            ),
+        ]
+        ofs.write("\t".join(lst_output) + "\n")
+        ofs.flush()
+
+
+def _infer_one(model, inputs, beam_size, output_history=False, use_heuristic=True, labels=None):
+    """inference one example"""
+    if use_heuristic:
+        # TODO: from_cond should be true from non-bert model
+        beams = sql_beam_search.beam_search_with_heuristics(
+            model, inputs, beam_size=beam_size, max_steps=1000, from_cond=False
+        )
+    else:
+        beams = beam_search.beam_search(model, inputs, beam_size=beam_size, max_steps=1000)
+    decoded = []
+    for beam in beams:
+        model_output, inferred_code = beam.inference_state.finalize()
+
+        decoded.append(
+            {
+                "pred_query": inferred_code,
+                "model_output": model_output,
+                "score": beam.score,
+                **(
+                    {
+                        "choice_history": beam.choice_history,
+                        "score_history": beam.score_history,
+                    }
+                    if output_history
+                    else {}
+                ),
+            }
+        )
+    return decoded
+
+
+def _infer_general(model, inputs, labels=None):
+    output = model(inputs)
+    sel_num = np.argmax(output.sel_num.numpy()).item()
+    # labels[0].sel_num, labels[0].sel_col
+    pred_sel_col = output.sel_col[0].numpy()
+    col_ids = list(zip(range(pred_sel_col.shape[1]), pred_sel_col.tolist()[0]))
+    sorted_col = sorted(col_ids, key=lambda x: x[1], reverse=True)
+    pred_cols = list(sorted(sorted_col[:sel_num], key=lambda x: x[0]))
+    gold_cols = []
+    for cid, label in enumerate(labels[0].sel_col):
+        if label == 1:
+            gold_cols.append(cid)
+
+    return {"sel_num": (sel_num, labels[0].sel_num), "sel_col": ([x[0] for x in pred_cols], gold_cols)}
+
+
+def _debug(model, data, ofs):
+    for i, item in enumerate(tqdm.tqdm(data)):
+        ((_, history),) = model.compute_loss([item], debug=True)
+        ofs.write(
+            json.dumps(
+                {
+                    "index": i,
+                    "history": history,
+                }
+            )
+            + "\n"
+        )
+        ofs.flush()
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py b/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c0e02f8edabcac8b763c5deb4488012b510578b
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/launch/trainer.py
@@ -0,0 +1,107 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import traceback
+from pathlib import Path
+
+from text2sql import io, utils
+from text2sql.launch import infer
+
+
+def log_train_step(epoch, batch, steps_loss, cost_time):
+    if len(steps_loss) == 0:
+        return
+
+    logging.info(
+        f"[train] epoch {epoch}, batch {batch}. "
+        + f"loss is {sum(steps_loss) / len(steps_loss):.10f}. "
+        + f"cost {cost_time:.2f}s"
+    )
+    steps_loss.clear()
+
+
+def epoch_train(config, model, optimizer, epoch, train_data, is_debug=False):
+    model.train()
+
+    total_loss = 0
+    steps_loss = []
+    timer = utils.Timer()
+    batch_id = 1
+    for batch_id, (inputs, labels) in enumerate(train_data(), start=1):
+        loss = model(inputs, labels)
+
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+        if type(optimizer._learning_rate) is not float:
+            optimizer._learning_rate.step()
+
+        total_loss += loss.item()
+        steps_loss.append(loss.item())
+        if batch_id % config.train.log_steps == 0 or is_debug:
+            log_train_step(epoch, batch_id, steps_loss, timer.interval())
+    log_train_step(epoch, batch_id, steps_loss, timer.interval())
+
+    return total_loss / batch_id
+
+
+def _eval_during_train(model, data, epoch, output_root):
+    if epoch in [1, 2, 3, 4] + [6, 7, 9, 10, 11, 13, 14, 16, 17, 19] + list(range(21, 100, 2)):
+        return 0, epoch
+    model.eval()
+    try:
+        output = Path(output_root) / "infer_result" / f"{data.name}.infer_epoch{epoch:03d}.sql"
+        infer.inference(model, data, output)
+    except OSError:
+        traceback.print_exc()
+        logging.error(traceback.format_exc())
+        return 0, epoch
+
+    mean_loss = 0
+    return mean_loss, epoch
+
+
+def train(config, model, optimizer, epochs, train_data, dev_data, test_data=None):
+    best_acc = -1e10
+    best_epoch = 0
+    timer = utils.Timer()
+    for epoch in range(1, epochs + 1):
+        loss = epoch_train(config, model, optimizer, epoch, train_data, config.general.is_debug)
+        cost_time = timer.interval()
+        logging.info(f"[train] epoch {epoch}/{epochs} loss is {loss:.6f}, cost {cost_time:.2f}s.")
+
+        dev_loss, dev_acc = _eval_during_train(model, dev_data, epoch, config.data.output)
+        log_str = f"[eval] dev loss {dev_loss:.6f}, acc {dev_acc:.4f}."
+        if test_data is not None:
+            test_loss, test_acc = _eval_during_train(model, test_data, epoch, config.data.output)
+            log_str += f" test loss {test_loss:.6f}, acc {test_acc:.4f}."
+
+        if dev_acc > best_acc:
+            best_acc, best_epoch = dev_acc, epoch
+            save_path = os.path.join(config.data.output, f"epoch{epoch:03d}_acc{best_acc:.4f}", "model")
+            io.save(model, optimizer, save_path)
+            log_str += " got best and saved."
+        else:
+            log_str += f" best acc is {best_acc} on epoch {best_epoch}."
+
+        cost_time = timer.interval()
+        log_str += f" cost [{cost_time:.2f}s]"
+        logging.info(log_str)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccd17d06b7f18ad2dd870ee7de5c41ec3ec6ec7d
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/__init__.py
@@ -0,0 +1,15 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .enc_dec import EncDecModel
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py b/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f457530a6ba86f39114032d7430adc33805fb5f
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/attention.py
@@ -0,0 +1,179 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+
+def maybe_mask(attn, attn_mask):
+    if attn_mask is not None:
+        assert all(
+            a == 1 or b == 1 or a == b for a, b in zip(attn.shape[::-1], attn_mask.shape[::-1])
+        ), f"Attention mask shape {attn_mask.shape} should be broadcastable with attention shape {attn.shape}"
+
+        attn.data.masked_fill_(attn_mask, -float("inf"))
+
+
+def attention(query, key, value, mask=None, dropout=None):
+    "Compute 'Scaled Dot Product Attention'"
+    d_k = query.shape[-1]
+    scores = paddle.matmul(query, key, transpose_y=True) / math.sqrt(d_k)
+    if mask is not None:
+        scores = scores.masked_fill(mask == 0, -1e9)
+    p_attn = F.softmax(scores, axis=-1)
+    if dropout is not None:
+        p_attn = dropout(p_attn)
+    # return paddle.matmul(p_attn, value), scores.squeeze(1).squeeze(1)
+    return paddle.matmul(p_attn, value), p_attn
+
+
+class Attention(paddle.nn.Layer):
+    def __init__(self, pointer):
+        super().__init__()
+        self.pointer = pointer
+        self.softmax = paddle.nn.Softmax(axis=-1)
+
+    def forward(self, query, values, attn_mask=None):
+        # query shape: batch x query_size
+        # values shape: batch x num values x value_size
+
+        # attn_logits shape: batch x num values
+        attn_logits = self.pointer(query, values, attn_mask)
+        # attn_logits shape: batch x num values
+        attn = self.softmax(attn_logits)
+        # output shape: batch x 1 x value_size
+        output = paddle.bmm(attn.unsqueeze(1), values)
+        output = output.squeeze(1)
+        return output, attn
+
+
+class ScaledDotProductPointer(paddle.nn.Layer):
+    def __init__(self, query_size, key_size):
+        super().__init__()
+        self.query_proj = paddle.nn.Linear(query_size, key_size)
+        self.temp = np.power(key_size, 0.5)
+
+    def forward(self, query, keys, attn_mask=None):
+        # query shape: batch x query_size
+        # keys shape: batch x num keys x key_size
+
+        # proj_query shape: batch x key_size x 1
+        proj_query = self.query_proj(query).unsqueeze(2)
+
+        # attn_logits shape: batch x num keys
+        attn_logits = paddle.bmm(keys, proj_query).squeeze(2) / self.temp
+        maybe_mask(attn_logits, attn_mask)
+        return attn_logits
+
+
+class ScaledDotProductAttention(Attention):
+    def __init__(self, query_size, value_size):
+        super().__init__(ScaledDotProductPointer(query_size, value_size))
+
+
+class BahdanauPointer(paddle.nn.Layer):
+    def __init__(self, query_size, key_size, proj_size):
+        super().__init__()
+        self.compute_scores = paddle.nn.Sequential(
+            paddle.nn.Linear(query_size + key_size, proj_size), paddle.nn.Tanh(), paddle.nn.Linear(proj_size, 1)
+        )
+
+    def forward(self, query: paddle.Tensor, keys: paddle.Tensor, attn_mask=None):
+        # query shape: batch x query_size
+        # keys shape: batch x num keys x key_size
+
+        # query_expanded shape: batch x num keys x query_size
+        query_expanded = query.unsqueeze(1).expand([query.shape[0], keys.shape[1], query.shape[-1]])
+
+        # scores shape: batch x num keys x 1
+        attn_logits = self.compute_scores(
+            # shape: batch x num keys x query_size + key_size
+            paddle.concat((query_expanded, keys), axis=2)
+        )
+        # scores shape: batch x num keys
+        attn_logits = attn_logits.squeeze(2)
+        maybe_mask(attn_logits, attn_mask)
+        return attn_logits
+
+
+class BahdanauAttention(Attention):
+    def __init__(self, query_size, value_size, proj_size):
+        super().__init__(BahdanauPointer(query_size, value_size, proj_size))
+
+
+# Adapted from The Annotated Transformers
+class MultiHeadedAttention(paddle.nn.Layer):
+    def __init__(self, h, query_size, value_size, dropout=0.1):
+        super().__init__()
+        assert query_size % h == 0
+        assert value_size % h == 0
+
+        # We assume d_v always equals d_k
+        self.d_k = value_size // h
+        self.h = h
+
+        self.linears = paddle.nn.LayerList(
+            [
+                paddle.nn.Linear(query_size, value_size),
+                paddle.nn.Linear(value_size, value_size),
+                paddle.nn.Linear(value_size, value_size),
+                paddle.nn.Linear(value_size, value_size),
+            ]
+        )
+
+        self.attn = None
+        self.dropout = paddle.nn.Dropout(p=dropout)
+
+    def forward(self, query, values, attn_mask=None):
+        "Implements Figure 2"
+        if attn_mask is not None:
+            # Same mask applied to all h heads.
+            attn_mask = attn_mask.unsqueeze(1)
+        nbatches = query.shape[0]
+
+        # 1) Do all the linear projections in batch from d_model => h x d_k
+        query, keys, values = [
+            l(x).reshape([nbatches, -1, self.h, self.d_k]).transpose([0, 2, 1, 3])
+            for l, x in zip(self.linears, (query, values, values))
+        ]
+
+        # 2) Apply attention on all the projected vectors in batch.
+        # x, self.attn = transformer.sparse_attention(
+        x, self.attn = attention(query, keys, values, mask=attn_mask, dropout=self.dropout)
+
+        # 3) "Concat" using a view and apply a final linear.
+        x = x.transpose([0, 2, 1, 3]).reshape([nbatches, -1, self.h * self.d_k])
+        x = x.squeeze(1)
+        return self.linears[3](x), self.attn
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    sdpp = ScaledDotProductPointer(query_size=8, key_size=16)
+    sdpa = ScaledDotProductAttention(query_size=8, value_size=16)
+    bp = BahdanauPointer(query_size=8, key_size=16, proj_size=12)
+    ba = BahdanauAttention(query_size=8, value_size=16, proj_size=12)
+    mha = MultiHeadedAttention(h=2, query_size=8, value_size=16)
+
+    q = paddle.to_tensor(list(range(1, 9)), dtype="float32").reshape([1, 8])
+    v = paddle.to_tensor(list(range(1, 17)), dtype="float32").reshape([1, 1, 16])
+
+    print(sdpp(q, v))
+    print(sdpa(q, v))
+    print(bp(q, v))
+    print(ba(q, v))
+    print(mha(q, v))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py b/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..c108a96df2ec66611d33bfa388524b17366d99b4
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/beam_search.py
@@ -0,0 +1,81 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import operator
+
+import attr
+
+
+@attr.s
+class Hypothesis:
+    inference_state = attr.ib()
+    next_choices = attr.ib()
+    score = attr.ib(default=0)
+
+    choice_history = attr.ib(factory=list)
+    score_history = attr.ib(factory=list)
+
+
+def beam_search(model, orig_item, preproc_item, beam_size, max_steps):
+    inference_state, next_choices = model.begin_inference(orig_item, preproc_item)
+    beam = [Hypothesis(inference_state, next_choices)]
+    finished = []
+
+    for step in range(max_steps):
+        # Check if all beams are finished
+        if len(finished) == beam_size:
+            break
+
+        candidates = []
+
+        # For each hypothesis, get possible expansions
+        # Score each expansion
+        for hyp in beam:
+            candidates += [
+                (hyp, choice, choice_score.item(), hyp.score + choice_score.item())
+                for choice, choice_score in hyp.next_choices
+            ]
+
+        # Keep the top K expansions
+        candidates.sort(key=operator.itemgetter(3), reverse=True)
+        candidates = candidates[: beam_size - len(finished)]
+
+        # Create the new hypotheses from the expansions
+        beam = []
+        for hyp, choice, choice_score, cum_score in candidates:
+            inference_state = hyp.inference_state.clone()
+            next_choices = inference_state.step(choice)
+            if next_choices is None:
+                finished.append(
+                    Hypothesis(
+                        inference_state,
+                        None,
+                        cum_score,
+                        hyp.choice_history + [choice],
+                        hyp.score_history + [choice_score],
+                    )
+                )
+            else:
+                beam.append(
+                    Hypothesis(
+                        inference_state,
+                        next_choices,
+                        cum_score,
+                        hyp.choice_history + [choice],
+                        hyp.score_history + [choice_score],
+                    )
+                )
+
+    finished.sort(key=operator.attrgetter("score"), reverse=True)
+    return finished
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py b/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ae4f7d50d39b86e0e8837924e3f5fcc4103377b
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/enc_dec.py
@@ -0,0 +1,63 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import nn
+from text2sql.models import encoder_v2
+from text2sql.models.sql_decoder import decoder as decoder_v2
+
+
+class EncDecModel(nn.Layer):
+    """Dygraph version of BoomUp Model"""
+
+    def __init__(self, config, label_encoder, model_version="v2"):
+        super(EncDecModel, self).__init__()
+
+        self._config = config
+        self._model_version = model_version
+
+        assert model_version in ("v2",), "model_version only support v2"
+        self.encoder = encoder_v2.Text2SQLEncoderV2(config)
+        self.decoder = decoder_v2.Text2SQLDecoder(
+            label_encoder, dropout=0.2, desc_attn="mha", use_align_mat=True, use_align_loss=True
+        )
+
+    def forward(self, inputs, labels=None, db=None, is_train=True):
+        if is_train:
+            assert labels is not None, "labels should not be None while training"
+            return self._train(inputs, labels)
+        else:
+            assert db is not None, "db should not be None while inferencing"
+            return self._inference(inputs, db)
+
+    def _train(self, inputs, labels):
+        enc_results = self.encoder(inputs)
+        lst_loss = []
+        for orig_inputs, label_info, enc_result in zip(inputs["orig_inputs"], labels, enc_results):
+            loss = self.decoder.compute_loss(orig_inputs, label_info, enc_result)
+            lst_loss.append(loss)
+
+        return paddle.mean(paddle.stack(lst_loss, axis=0), axis=0)
+
+    def _inference(self, inputs, db):
+        enc_state = self.encoder(inputs)
+        if self._model_version == "v1":
+            return self.decoder.inference(enc_state[0], db)
+        elif self._model_version == "v2":
+            return self.decoder.inference(enc_state[0], db, inputs["orig_inputs"][0].values)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py b/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e336675115644cb6a46250a958b4f2cf59bc35a
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/encoder_v2.py
@@ -0,0 +1,253 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import attr
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.nn import functional as F
+from text2sql.models import relational_encoder, relational_transformer
+from text2sql.utils import linking_utils, nn_utils, utils
+
+from paddlenlp.transformers import BertModel, ErnieModel, ErniePretrainedModel
+
+
+@attr.s
+class EncoderState:
+    """Encoder state define"""
+
+    state = attr.ib()
+    cls_hidden = attr.ib()
+    memory = attr.ib()
+    question_memory = attr.ib()
+    schema_memory = attr.ib()
+    words = attr.ib()
+
+    pointer_memories = attr.ib()
+    pointer_maps = attr.ib()
+
+    m2c_align_mat = attr.ib()
+    m2t_align_mat = attr.ib()
+    m2v_align_mat = attr.ib()
+
+    def find_word_occurrences(self, word):
+        """find word occurrences"""
+        return [i for i, w in enumerate(self.words) if w == word]
+
+
+class Text2SQLEncoderV2(nn.Layer):
+    """Encoder for text2sql model"""
+
+    def __init__(self, model_config, extra=None):
+        super(Text2SQLEncoderV2, self).__init__()
+        self.enc_value_with_col = model_config.enc_value_with_col
+
+        self.pretrain_model_type = model_config.pretrain_model_type
+        if model_config.pretrain_model_type == "BERT":
+            PretrainModel = BertModel
+            vocab_file = os.path.join(
+                os.path.expandvars("$HOME"),
+                ".paddlenlp/models",
+                model_config.pretrain_model,
+                model_config.pretrain_model + "-vocab.txt",
+            )
+            args = {"vocab_size": utils.count_file_lines(vocab_file), "type_vocab_size": 2}
+            self.hidden_size = 768
+        elif model_config.pretrain_model_type == "ERNIE":
+            PretrainModel = ErnieModel
+            ernie_config = ErniePretrainedModel.pretrained_init_configuration[model_config.pretrain_model]
+            # with open(Path(model_config.pretrain_model) /
+            #           'ernie_config.json') as ifs:
+            #     ernie_config = json.load(ifs)
+            args = {"cfg": ernie_config}
+            self.hidden_size = ernie_config["hidden_size"]
+        else:
+            raise RuntimeError(f"unsupported pretrain model type: {model_config.pretrain_model_type}")
+
+        if model_config.init_model_params is None:
+            self.base_encoder = PretrainModel.from_pretrained(model_config.pretrain_model)
+        else:
+            self.base_encoder = PretrainModel(**args["cfg"])
+        # initializer = nn.initializer.TruncatedNormal(std=0.02)
+        self.rel_has_value = True
+        self.encs_update = relational_encoder.RelationAwareEncoder(
+            num_layers=model_config.rat_layers,
+            num_heads=model_config.rat_heads,
+            num_relations=len(linking_utils.RELATIONS),
+            hidden_size=self.hidden_size,
+            has_value=self.rel_has_value,
+        )
+        if not self.rel_has_value:
+            self.value_align = relational_transformer.RelationalPointerNet(
+                hidden_size=self.hidden_size, num_relations=0
+            )
+
+        self.include_in_memory = set(["question", "column", "table", "value"])
+
+    def forward(self, inputs):
+        """modeling forward stage of encoder"""
+        seq_hidden, cls_hidden = self.base_encoder(inputs["src_ids"], inputs["sent_ids"])
+        if self.pretrain_model_type != "ERNIE" and self.pretrain_model_type != "BERT":
+            cls_hidden, seq_hidden = seq_hidden, cls_hidden
+
+        question_tokens_index = inputs["question_tokens_index"]
+        table_indexes = inputs["table_indexes"]
+        column_indexes = inputs["column_indexes"]
+        value_indexes = inputs["value_indexes"]
+
+        question_encs = nn_utils.batch_gather_2d(seq_hidden, question_tokens_index)
+        table_encs = nn_utils.batch_gather_2d(seq_hidden, table_indexes)
+        column_encs = nn_utils.batch_gather_2d(seq_hidden, column_indexes)
+        value_encs = nn_utils.batch_gather_2d(seq_hidden, value_indexes)
+        if self.enc_value_with_col:
+            value_num = value_encs.shape[1] // 2
+            value_encs = value_encs.reshape([value_encs.shape[0], value_num, 2, -1]).sum(axis=2)
+
+        orig_inputs = inputs["orig_inputs"]
+        column_pointer_maps = [{i: [i] for i in range(len(orig_input.columns))} for orig_input in orig_inputs]
+        table_pointer_maps = [{i: [i] for i in range(len(orig_input.tables))} for orig_input in orig_inputs]
+        value_pointer_maps = [{i: [i] for i in range(len(orig_input.values))} for orig_input in orig_inputs]
+
+        enc_results = []
+        # calculate relation encoding one-by-one
+        for batch_idx, orig_input in enumerate(orig_inputs):
+            q_len = orig_input.column_indexes[0] - 2
+            col_size = len(orig_input.columns)
+            tab_size = len(orig_input.tables)
+            val_size = len(orig_input.values)
+
+            q_enc = question_encs[batch_idx][:q_len]
+            tab_enc = table_encs[batch_idx][:tab_size]
+            col_enc = column_encs[batch_idx][:col_size]
+            val_enc = value_encs[batch_idx][:val_size]
+
+            c_boundary = list(range(col_size + 1))
+            t_boundary = list(range(tab_size + 1))
+
+            v_e_input = val_enc.unsqueeze(0) if self.rel_has_value else None
+            (q_enc_new, c_enc_new, t_enc_new, v_enc_new), align_mat = self.encs_update.forward_unbatched(
+                q_enc.unsqueeze(0),
+                col_enc.unsqueeze(0),
+                tab_enc.unsqueeze(0),
+                c_boundary,
+                t_boundary,
+                orig_input.relations,
+                v_e_input,
+            )
+
+            memory = []
+            if "question" in self.include_in_memory:
+                memory.append(q_enc_new)
+            if "table" in self.include_in_memory:
+                memory.append(t_enc_new)
+            if "column" in self.include_in_memory:
+                memory.append(c_enc_new)
+            if "value" in self.include_in_memory and self.rel_has_value:
+                memory.append(v_enc_new)
+            memory = paddle.concat(memory, axis=1)
+            if not self.rel_has_value:
+                v_enc_new = val_enc.unsqueeze(0)
+                m2v_align_mat = self.value_align(memory, v_enc_new, relations=None)
+                align_mat[2] = m2v_align_mat
+
+            schema_memory = (c_enc_new, t_enc_new)
+            if self.rel_has_value:
+                schema_memory += (v_enc_new,)
+
+            enc_results.append(
+                EncoderState(
+                    state=None,
+                    cls_hidden=cls_hidden[batch_idx],
+                    memory=memory,
+                    question_memory=q_enc_new,
+                    schema_memory=paddle.concat(schema_memory, axis=1),
+                    words=orig_input.question_tokens,
+                    pointer_memories={
+                        "table": t_enc_new,
+                        "column": c_enc_new,
+                        "value": v_enc_new,
+                    },
+                    pointer_maps={
+                        "column": column_pointer_maps[batch_idx],
+                        "table": table_pointer_maps[batch_idx],
+                        "value": value_pointer_maps[batch_idx],
+                    },
+                    m2c_align_mat=align_mat[0],
+                    m2t_align_mat=align_mat[1],
+                    m2v_align_mat=align_mat[2],
+                )
+            )
+
+        return enc_results
+
+    def span_encoder(self, cls_hidden, seq_hidden, span_index, span_tokens_index, span_tokens_mask, proj_fn=None):
+        """encode spans(like headers, table names) by sequence hidden states"""
+        batch_size, max_col_nums, max_col_tokens = span_tokens_index.shape
+        hidden_size = cls_hidden.shape[-1]
+
+        # shape = [batch, max_col, hidden_size]
+        span_enc1 = nn_utils.batch_gather_2d(seq_hidden, span_index)
+
+        token_gather_index = paddle.reshape(span_tokens_index, shape=[-1, max_col_nums * max_col_tokens])
+        span_tokens_enc_origin = nn_utils.batch_gather_2d(seq_hidden, token_gather_index)
+
+        span_tokens_weight = paddle.reshape(
+            paddle.matmul(span_tokens_enc_origin, paddle.unsqueeze(cls_hidden, [-1])),
+            [-1, max_col_nums, max_col_tokens],
+        )
+        span_tokens_weight = F.softmax(nn_utils.sequence_mask(span_tokens_weight, span_tokens_mask), axis=-1)
+
+        span_tokens_enc_origin = paddle.reshape(
+            span_tokens_enc_origin, [-1, max_col_nums, max_col_tokens, hidden_size]
+        )
+        span_enc2 = paddle.sum(paddle.multiply(span_tokens_enc_origin, span_tokens_weight.unsqueeze([-1])), axis=2)
+
+        span_enc = paddle.concat([span_enc1, span_enc2], axis=-1)
+        if proj_fn is not None:
+            span_enc = proj_fn(span_enc)
+        return span_enc
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    inputs = {
+        "src_ids": paddle.to_tensor(np.array([0, 1, 2, 3, 4, 5], dtype=np.int64).reshape([1, 6])),
+        "sent_ids": paddle.to_tensor(np.array([0, 1, 1, 1, 1, 1], dtype=np.int64).reshape([1, 6])),
+        "question_tokens_index": paddle.to_tensor(list(range(1, 5)), dtype="int64").reshape([1, 4]),
+        "column_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]),
+        "column_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]),
+        "column_tokens_index": paddle.to_tensor([1, 2, 3, 4, 5, 0], dtype="int64").reshape([1, 2, 3]),
+        "column_tokens_mask": paddle.to_tensor([1, 1, 1, 1, 1, 0], dtype="float32").reshape([1, 2, 3]),
+        "table_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]),
+        "table_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]),
+        "table_tokens_index": paddle.to_tensor([1, 2, 3, 4, 5, 0], dtype="int64").reshape([1, 2, 3]),
+        "table_tokens_mask": paddle.to_tensor([1, 1, 1, 1, 1, 0], dtype="float32").reshape([1, 2, 3]),
+        "limit_nums_index": paddle.to_tensor([1, 4], dtype="int64").reshape([1, 2]),
+        "limit_nums_mask": paddle.to_tensor([1, 1], dtype="float32").reshape([1, 2]),
+        "orig_inputs": [
+            {
+                "columns": ["a", "b"],
+                "tables": ["t1", "t2"],
+                "question_tokens": ["a", "bc", "d"],
+                "span_lens": [[6], [1, 1], [1, 1]],
+                "relations": np.arange(8 * 8).reshape(8, 8),
+            }
+        ],
+    }
+
+    # model = Text2SQLEncoder()
+    # outputs = model(inputs)
+    # print(outputs)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..adc3868d8d4bdf12664422b969990ec3145d7bd1
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_encoder.py
@@ -0,0 +1,98 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+from text2sql.models import relational_transformer
+
+
+class RelationAwareEncoder(paddle.nn.Layer):
+    """Relation-aware encoder"""
+
+    def __init__(self, num_layers, num_heads, num_relations, hidden_size, has_value=False, dropout=0.1):
+        super(RelationAwareEncoder, self).__init__()
+
+        self._num_layers = num_layers
+        self._num_heads = num_heads
+        self._hidden_size = hidden_size
+        self._dropout = dropout
+
+        cfg = {
+            "num_hidden_layers": num_layers,
+            "num_attention_heads": num_heads,
+            "num_relations": num_relations,
+            "hidden_size": hidden_size,
+            "hidden_act": "relu",
+            "attention_probs_dropout_prob": dropout,
+            "hidden_dropout_prob": dropout,
+            "initializer_range": 0.02,
+        }
+        self.encoder = relational_transformer.RelationalTransformerEncoder(cfg)
+        if not has_value:
+            self.align_attn = relational_transformer.RelationalPointerNet(hidden_size, num_relations)
+        else:
+            self.align_attn = relational_transformer.RelationalPointerNet(hidden_size, 0)
+
+    def forward(self, q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations, v_enc=None):
+        assert q_enc.shape[0] == 1 and c_enc.shape[0] == 1 and t_enc.shape[0] == 1
+        return self.forward_unbatched(q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations)
+
+    def forward_unbatched(self, q_enc, c_enc, t_enc, c_boundaries, t_boundaries, relations, v_enc=None):
+        enc = paddle.concat((q_enc, c_enc, t_enc), axis=1)
+        # enc = enc.transpose([1, 0, 2])
+
+        relations_t = paddle.to_tensor(relations, dtype="int64").unsqueeze([0])
+        enc_new, _, _ = self.encoder(enc, relations_t)
+
+        # Split updated_enc again
+        c_base = q_enc.shape[1]
+        t_base = q_enc.shape[1] + c_enc.shape[1]
+        q_enc_new = enc_new[:, :c_base]
+        c_enc_new = enc_new[:, c_base:t_base]
+        t_enc_new = enc_new[:, t_base:]
+
+        if v_enc is None:
+            m2c_align_mat = self.align_attn(enc_new, c_enc_new, relations_t[:, :, c_base:t_base])
+            m2t_align_mat = self.align_attn(enc_new, t_enc_new, relations_t[:, :, t_base:])
+            m2v_align_mat = None
+        else:
+            enc_new = paddle.concat((enc_new, v_enc), axis=1)
+            m2c_align_mat = self.align_attn(enc_new, c_enc_new, relations=None)
+            m2t_align_mat = self.align_attn(enc_new, t_enc_new, relations=None)
+            m2v_align_mat = self.align_attn(enc_new, v_enc, relations=None)
+
+        return ([q_enc_new, c_enc_new, t_enc_new, v_enc], [m2c_align_mat, m2t_align_mat, m2v_align_mat])
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+
+    hidden_size = 4
+    q = paddle.to_tensor(list(range(12)), dtype="float32").reshape([1, 3, hidden_size])
+    c = paddle.to_tensor(list(range(8)), dtype="float32").reshape([1, 2, hidden_size])
+    t = paddle.to_tensor(list(range(8)), dtype="float32").reshape([1, 2, hidden_size])
+    c_bound = None
+    t_bound = None
+    relations = np.zeros([7, 7], dtype=np.int64)
+    relations[0, 3] = 10
+    relations[0, 1] = 1
+    relations[0, 2] = 2
+    relations[1, 2] = 1
+    relations[1, 4] = 11
+    relations[3, 4] = 21
+    relations[3, 5] = 31
+
+    model = RelationAwareEncoder(2, 2, 99, hidden_size)
+    outputs = model(q, c, t, c_bound, t_bound, relations)
+    print(outputs)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..aba8fdd11346b28624fc2c6e5a4abeaefc162fa7
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/relational_transformer.py
@@ -0,0 +1,413 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+
+ACT_DICT = {
+    "relu": nn.ReLU,
+    "gelu": nn.GELU,
+}
+
+
+def _build_linear(n_in, n_out, name=None, init=None):
+    return nn.Linear(
+        n_in,
+        n_out,
+        weight_attr=paddle.ParamAttr(name="%s.w_0" % name if name is not None else None, initializer=init),
+        bias_attr="%s.b_0" % name if name is not None else None,
+    )
+
+
+def _build_ln(n_in, name):
+    return nn.LayerNorm(
+        normalized_shape=n_in,
+        weight_attr=paddle.ParamAttr(
+            name="%s_layer_norm_scale" % name if name is not None else None, initializer=nn.initializer.Constant(1.0)
+        ),
+        bias_attr=paddle.ParamAttr(
+            name="%s_layer_norm_bias" % name if name is not None else None, initializer=nn.initializer.Constant(0.0)
+        ),
+    )
+
+
+def new_name(name, postfix):
+    if name is None:
+        ret = None
+    elif name == "":
+        ret = postfix
+    else:
+        ret = "%s_%s" % (name, postfix)
+    return ret
+
+
+# Adapted from
+# https://github.com/tensorflow/tensor2tensor/blob/0b156ac533ab53f65f44966381f6e147c7371eee/tensor2tensor/layers/common_attention.py
+def relative_attention_logits(query, key, relation):
+    """relative attention logits(scores)
+
+    Args:
+        query (TYPE): NULL
+        key (TYPE): NULL
+        relation (TYPE): NULL
+
+    Returns: Tensor, shape = [batch, heads, num queries, num kvs]
+
+    Raises: NULL
+    """
+    # We can't reuse the same logic as tensor2tensor because we don't share relation vectors across the batch.
+    # In this version, relation vectors are shared across heads.
+    # query: [batch, heads, num queries, depth].
+    # key: [batch, heads, num kvs, depth].
+    # relation: [batch, num queries, num kvs, depth].
+
+    # qk_matmul is [batch, heads, num queries, num kvs]
+    qk_matmul = paddle.matmul(query, key, transpose_y=True)
+    if relation is None:
+        return qk_matmul / math.sqrt(query.shape[-1])
+
+    # q_t is [batch, num queries, heads, depth]
+    q_t = query.transpose([0, 2, 1, 3])
+    # r_t is [batch, num queries, depth, num kvs]
+    r_t = relation.transpose([0, 1, 3, 2])
+
+    # [batch, num queries, heads, depth] * [batch, num queries, depth, num kvs]
+    # = [batch, num queries, heads, num kvs]
+    # For each batch and query, we have a query vector per head.
+    # We take its dot product with the relation vector for each kv.
+    #
+    # transposed = [batch, heads, num queries, num kvs]
+    qr_matmul = paddle.matmul(q_t, r_t).transpose([0, 2, 1, 3])
+
+    # [batch, heads, num queries, num kvs]
+    return (qk_matmul + qr_matmul) / math.sqrt(query.shape[-1])
+
+    # Sharing relation vectors across batch and heads:
+    # query: [batch, heads, num queries, depth].
+    # key: [batch, heads, num kvs, depth].
+    # relation: [num queries, num kvs, depth].
+    #
+    # Then take
+    # key reshaped
+    #   [num queries, batch * heads, depth]
+    # relation.transpose(-2, -1)
+    #   [num queries, depth, num kvs]
+    # and multiply them together.
+    #
+    # Without sharing relation vectors across heads:
+    # query: [batch, heads, num queries, depth].
+    # key: [batch, heads, num kvs, depth].
+    # relation: [batch, heads, num queries, num kvs, depth].
+    #
+    # Then take
+    # key.unsqueeze(3)
+    #   [batch, heads, num queries, 1, depth]
+    # relation.transpose(-2, -1)
+    #   [batch, heads, num queries, depth, num kvs]
+    # and multiply them together:
+    #   [batch, heads, num queries, 1, depth]
+    # * [batch, heads, num queries, depth, num kvs]
+    # = [batch, heads, num queries, 1, num kvs]
+    # and squeeze
+    # [batch, heads, num queries, num kvs]
+
+
+def relative_attention_values(weight, value, relation):
+    """In this version, relation vectors are shared across heads.
+    Args:
+        weight: [batch, heads, num queries, num kvs].
+        value: [batch, heads, num kvs, depth].
+        relation: [batch, num queries, num kvs, depth].
+    Returns: Tensor, shape = [batch, heads, num queries, depth]
+    """
+    # wv_matmul is [batch, heads, num queries, depth]
+    wv_matmul = paddle.matmul(weight, value)
+
+    # w_t is [batch, num queries, heads, num kvs]
+    w_t = weight.transpose([0, 2, 1, 3])
+    #   [batch, num queries, heads, num kvs]
+    # * [batch, num queries, num kvs, depth]
+    # = [batch, num queries, heads, depth]
+    # transposed = [batch, heads, num queries, depth]
+    wr_matmul = paddle.matmul(w_t, relation).transpose([0, 2, 1, 3])
+
+    return wv_matmul + wr_matmul
+
+
+class RelationalAttentionLayer(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(RelationalAttentionLayer, self).__init__()
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+        d_model = cfg["hidden_size"]
+        n_head = cfg["num_attention_heads"]
+        assert d_model % n_head == 0
+        d_model_q = cfg.get("query_hidden_size_per_head", d_model // n_head) * n_head
+        d_model_v = cfg.get("value_hidden_size_per_head", d_model // n_head) * n_head
+        self.n_head = n_head
+        self.d_key = d_model_q // n_head
+        self.q = _build_linear(d_model, d_model_q, new_name(name, "query_fc"), initializer)
+        self.k = _build_linear(d_model, d_model_q, new_name(name, "key_fc"), initializer)
+        self.v = _build_linear(d_model, d_model_v, new_name(name, "value_fc"), initializer)
+        self.o = _build_linear(d_model_v, d_model, new_name(name, "output_fc"), initializer)
+        self.dropout = nn.Dropout(p=cfg["attention_probs_dropout_prob"])
+
+    def forward(self, queries, keys, values, relation_k, relation_v, attn_bias=None, past_cache=None):
+        """relational attention forward.
+        seq_len in `shape` means num queries/keys/values of attention
+
+        Args:
+            queries (TYPE): shape = [batch, seq_len, num_heads * hidden]
+            keys (TYPE): shape = queries.shape
+            values (TYPE): shape = queries.shape
+            relation_k (TYPE): shape = [batch, seq_len, seq_len, hidden]
+            relation_v (TYPE): shape = relation_k.shape
+            attn_bias (TYPE): used as sequence mask. Default is None
+            past_cache (TYPE): Default is None
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        assert len(queries.shape) == len(keys.shape) == len(values.shape) == 3
+        # bsz, q_len, q_dim = queries.shape
+        # bsz, k_len, k_dim = keys.shape
+        # bsz, v_len, v_dim = values.shape
+        # assert k_len == v_len
+
+        q = self.q(queries)
+        k = self.k(keys)
+        v = self.v(values)
+
+        cache = (k, v)
+        if past_cache is not None:
+            cached_k, cached_v = past_cache
+            k = paddle.concat([cached_k, k], 1)
+            v = paddle.concat([cached_v, v], 1)
+
+        def _transpose(inputs):
+            """reshape and transpose
+            Args: inputs: shape = [batch, seq_len, heads * hidden]
+            Returns: shape = [batch, heads, seq_len, hidden]
+            """
+            hidden_size = inputs.shape[-1] // self.n_head
+            outputs = inputs.reshape([0, 0, self.n_head, hidden_size])
+            return outputs.transpose([0, 2, 1, 3])
+
+        q, k, v = [_transpose(x) for x in (q, k, v)]
+
+        q = q.scale(self.d_key**-0.5)
+        scores = relative_attention_logits(q, k, relation_k)
+        if attn_bias is not None:
+            scores += attn_bias
+        scores = F.softmax(scores)
+        scores = self.dropout(scores)
+
+        out = relative_attention_values(scores, v, relation_v)
+        # input: [batch, heads, seq_len, hidden]
+        # output: [batch, seq_len, heads * hidden]
+        out = out.transpose([0, 2, 1, 3])
+        out = out.reshape([0, 0, out.shape[2] * out.shape[3]])
+        out = self.o(out)
+        return out, cache
+
+
+class RelationalPointerNet(nn.Layer):
+    """Pointer Netword with Relations"""
+
+    def __init__(self, hidden_size, num_relations, init_range=0.02, name=None):
+        """init of class
+
+        Args:
+            cfg (TYPE): NULL
+
+        """
+        super(RelationalPointerNet, self).__init__()
+        self.hidden_size = hidden_size
+
+        initializer = nn.initializer.TruncatedNormal(std=init_range)
+        self.q = _build_linear(hidden_size, hidden_size, new_name(name, "query_fc"), initializer)
+        self.k = _build_linear(hidden_size, hidden_size, new_name(name, "key_fc"), initializer)
+        # self.dropout = nn.Dropout(p=cfg['attention_probs_dropout_prob'])
+
+        self.relation_emb = None
+        if num_relations > 0:
+            self.relation_emb = nn.Embedding(num_relations, hidden_size)
+        self.scores = None
+
+    def forward(self, queries, keys, relations, attn_bias=None):
+        """relational attention forward.
+        seq_len in `shape` means num queries/keys/values of attention
+
+        Args:
+            queries (TYPE): shape = [batch, seq_len, num_heads * hidden]
+            keys (TYPE): shape = queries.shape
+            relations (TYPE): shape = [batch, seq_len, seq_len, hidden]
+            attn_bias (TYPE): used as sequence mask. Default is None
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        assert len(queries.shape) == len(keys.shape) == 3
+
+        q = self.q(queries)
+        k = self.k(keys)
+        r = None
+        if relations is not None:
+            r = self.relation_emb(relations)
+
+        def _transpose(inputs):
+            """reshape and transpose
+            Args: inputs: shape = [batch, seq_len, heads * hidden]
+            Returns: shape = [batch, heads, seq_len, hidden]
+            """
+            # 1 代表 head 数量，此处恒为 1。
+            outputs = inputs.reshape([0, 0, 1, self.hidden_size])
+            return outputs.transpose([0, 2, 1, 3])
+
+        q = _transpose(q)
+        k = _transpose(k)
+        # q = q.scale(self.hidden_size**-0.5)
+        scores = relative_attention_logits(q, k, r)
+        if attn_bias is not None:
+            scores += attn_bias
+
+        self.scores = F.softmax(scores)
+        return self.scores.squeeze([0, 1])
+
+
+class PositionwiseFeedForwardLayer(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(PositionwiseFeedForwardLayer, self).__init__()
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+        d_model = cfg["hidden_size"]
+        d_ffn = cfg.get("intermediate_size", 4 * d_model)
+        self.act = ACT_DICT[cfg["hidden_act"]]()
+        self.i = _build_linear(
+            d_model,
+            d_ffn,
+            new_name(name, "fc_0"),
+            initializer,
+        )
+        self.o = _build_linear(d_ffn, d_model, new_name(name, "fc_1"), initializer)
+        prob = cfg.get("intermediate_dropout_prob", 0.0)
+        self.dropout = nn.Dropout(p=prob)
+
+    def forward(self, inputs):
+        hidden = self.act(self.i(inputs))
+        hidden = self.dropout(hidden)
+        out = self.o(hidden)
+        return out
+
+
+class RelationalTransformerBlock(nn.Layer):
+    """A transformer block with relations"""
+
+    def __init__(self, cfg, name=None):
+        super(RelationalTransformerBlock, self).__init__()
+        d_model = cfg["hidden_size"]
+        n_heads = cfg["num_attention_heads"]
+        self.attn = RelationalAttentionLayer(cfg, name=new_name(name, "multi_head_att"))
+        self.ln1 = _build_ln(d_model, name=new_name(name, "post_att"))
+        self.ffn = PositionwiseFeedForwardLayer(cfg, name=new_name(name, "ffn"))
+        self.ln2 = _build_ln(d_model, name=new_name(name, "post_ffn"))
+        prob = cfg.get("intermediate_dropout_prob", cfg["hidden_dropout_prob"])
+        self.dropout = nn.Dropout(p=prob)
+
+        # 假设 k/v 的
+        rel_hidden = d_model // n_heads
+        self.relation_k_emb = nn.Embedding(cfg["num_relations"], rel_hidden)
+        self.relation_v_emb = nn.Embedding(cfg["num_relations"], rel_hidden)
+
+    def forward(self, inputs, relations, attn_bias=None, past_cache=None):
+        relation_k = self.relation_k_emb(relations)
+        relation_v = self.relation_k_emb(relations)
+
+        attn_out, cache = self.attn(
+            inputs, inputs, inputs, relation_k, relation_v, attn_bias, past_cache=past_cache
+        )  # self attn
+        attn_out = self.dropout(attn_out)
+        hidden = attn_out + inputs
+        hidden = self.ln1(hidden)  # dropout/ add/ norm
+
+        ffn_out = self.ffn(hidden)
+        ffn_out = self.dropout(ffn_out)
+        hidden = ffn_out + hidden
+        hidden = self.ln2(hidden)
+        return hidden, cache
+
+
+class RelationalTransformerEncoder(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(RelationalTransformerEncoder, self).__init__()
+        n_layers = cfg["num_hidden_layers"]
+        self.block = nn.LayerList(
+            [RelationalTransformerBlock(cfg, new_name(name, "layer_%d" % i)) for i in range(n_layers)]
+        )
+
+    def forward(self, inputs, relations, attn_bias=None, past_cache=None):
+        """relational transformer encoder, forward stage of
+        n layers and m heads transformer blocks with relations
+
+        Args:
+            inputs (TYPE): shape= [batch, seq_len, hidden]
+            relations (TYPE): shape = [batch, seq_len, seq_len]
+            attn_bias (TYPE): mask for inputs sequence. Default is None
+            past_cache (TYPE): Default is None
+
+        Returns: (last_hidden_state, all_hidden_state_list, (cache_list_k, cache_list_v))
+
+        Raises: NULL
+        """
+        if past_cache is not None:
+            assert isinstance(
+                past_cache, tuple
+            ), "unknown type of `past_cache`," + " expect tuple or list. got %s" % repr(type(past_cache))
+            past_cache = list(zip(*past_cache))
+        else:
+            past_cache = [None] * len(self.block)
+        cache_list_k, cache_list_v, hidden_list = [], [], [inputs]
+
+        for b, p in zip(self.block, past_cache):
+            inputs, cache = b(inputs, relations, attn_bias=attn_bias, past_cache=p)
+            cache_k, cache_v = cache
+            cache_list_k.append(cache_k)
+            cache_list_v.append(cache_v)
+            hidden_list.append(inputs)
+
+        return inputs, hidden_list, (cache_list_k, cache_list_v)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    cfg = {
+        "num_hidden_layers": 12,
+        "num_attention_heads": 2,
+        "num_relations": 99,
+        "hidden_size": 4,
+        "hidden_act": "relu",
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_dropout_prob": 0.1,
+        "initializer_range": 0.02,
+    }
+
+    model = RelationalTransformerEncoder(cfg)
+    print(model)
+    inputs = paddle.to_tensor(list(range(24)), dtype="float32").reshape([2, 3, 4])
+    relations = paddle.to_tensor(list(range(18)), dtype="int64").reshape([2, 3, 3])
+    hidden, _, _ = model(inputs, relations)
+    print(hidden)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b43d0610c4d70a89585b44569cd04799529dad0
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_beam_search.py
@@ -0,0 +1,445 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import operator
+
+import attr
+import networkx as nx
+from text2sql.dataproc.sql_preproc_v2 import get_field_presence_info
+from text2sql.models.beam_search import Hypothesis
+from text2sql.models.sql_decoder.decoder import TreeState
+
+
+@attr.s
+class Hypothesis4Filtering(Hypothesis):
+    column_history = attr.ib(factory=list)
+    table_history = attr.ib(factory=list)
+    key_column_history = attr.ib(factory=list)
+
+
+def beam_search_with_heuristics(model, inputs, beam_size, max_steps, from_cond=True):
+    """
+    Find the valid FROM clause with beam search
+    """
+    orig_inputs = inputs["orig_inputs"][0]
+    # inference_state, next_choices = model.inference(inputs, orig_inputs.db)
+    inference_state, next_choices = model(inputs, db=orig_inputs.db, is_train=False)
+    beam = [Hypothesis4Filtering(inference_state, next_choices)]
+
+    cached_finished_seqs = []  # cache filtered trajectories
+    beam_prefix = beam
+    while True:
+        # search prefixes with beam search
+        prefixes2fill_from = []
+        for step in range(max_steps):
+            if len(prefixes2fill_from) >= beam_size:
+                break
+
+            candidates = []
+            for hyp in beam_prefix:
+                if (
+                    hyp.inference_state.cur_item.state == hyp.inference_state.State.CHILDREN_APPLY
+                    and hyp.inference_state.cur_item.node_type == "from"
+                ):
+                    # only from not fill, save it and process in the following code
+                    prefixes2fill_from.append(hyp)
+                else:
+                    candidates += [
+                        (hyp, choice, choice_score.item(), hyp.score + choice_score.item())
+                        for choice, choice_score in hyp.next_choices
+                    ]
+            candidates.sort(key=operator.itemgetter(3), reverse=True)
+            candidates = candidates[: beam_size - len(prefixes2fill_from)]
+
+            # Create the new hypotheses from the expansions
+            beam_prefix = []
+            for hyp, choice, choice_score, cum_score in candidates:
+                inference_state = hyp.inference_state.clone()
+
+                # cache column choice
+                column_history = hyp.column_history[:]
+                if (
+                    hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY
+                    and hyp.inference_state.cur_item.node_type == "column"
+                ):
+                    column_history = column_history + [choice]
+
+                # get next choices
+                next_choices = inference_state.step(choice)
+                assert next_choices is not None
+                beam_prefix.append(
+                    Hypothesis4Filtering(
+                        inference_state,
+                        next_choices,
+                        cum_score,
+                        hyp.choice_history + [choice],
+                        hyp.score_history + [choice_score],
+                        column_history,
+                    )
+                )
+
+        prefixes2fill_from.sort(key=operator.attrgetter("score"), reverse=True)
+        # assert len(prefixes) == beam_size
+
+        # enumerating
+        beam_from = prefixes2fill_from
+        max_size = 6
+        unfiltered_finished = []
+        prefixes_unfinished = []
+        for step in range(max_steps):
+            if len(unfiltered_finished) + len(prefixes_unfinished) > max_size:
+                break
+
+            candidates = []
+            for hyp in beam_from:
+                if (
+                    step > 0
+                    and hyp.inference_state.cur_item.state == hyp.inference_state.State.CHILDREN_APPLY
+                    and hyp.inference_state.cur_item.node_type == "from"
+                ):
+                    prefixes_unfinished.append(hyp)
+                else:
+                    candidates += [
+                        (hyp, choice, choice_score.item(), hyp.score + choice_score.item())
+                        for choice, choice_score in hyp.next_choices
+                    ]
+            candidates.sort(key=operator.itemgetter(3), reverse=True)
+            candidates = candidates[: max_size - len(prefixes_unfinished)]
+
+            beam_from = []
+            for hyp, choice, choice_score, cum_score in candidates:
+                inference_state = hyp.inference_state.clone()
+
+                # cache table choice
+                table_history = hyp.table_history[:]
+                key_column_history = hyp.key_column_history[:]
+                if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY:
+                    if hyp.inference_state.cur_item.node_type == "table":
+                        table_history = table_history + [choice]
+                    elif hyp.inference_state.cur_item.node_type == "column":
+                        key_column_history = key_column_history + [choice]
+
+                next_choices = inference_state.step(choice)
+                if next_choices is None:
+                    unfiltered_finished.append(
+                        Hypothesis4Filtering(
+                            inference_state,
+                            None,
+                            cum_score,
+                            hyp.choice_history + [choice],
+                            hyp.score_history + [choice_score],
+                            hyp.column_history,
+                            table_history,
+                            key_column_history,
+                        )
+                    )
+                else:
+                    beam_from.append(
+                        Hypothesis4Filtering(
+                            inference_state,
+                            next_choices,
+                            cum_score,
+                            hyp.choice_history + [choice],
+                            hyp.score_history + [choice_score],
+                            hyp.column_history,
+                            table_history,
+                            key_column_history,
+                        )
+                    )
+        # [END] for step in range(max_steps)
+
+        unfiltered_finished.sort(key=operator.attrgetter("score"), reverse=True)
+
+        # filtering
+        filtered_finished = []
+        for hyp in unfiltered_finished:
+            mentioned_column_ids = set(hyp.column_history)
+            mentioned_key_column_ids = set(hyp.key_column_history)
+            mentioned_table_ids = set(hyp.table_history)
+
+            # duplicate tables
+            if len(mentioned_table_ids) != len(hyp.table_history):
+                continue
+
+            # the foreign key should be correctly used
+            # NOTE: the new version does not predict conditions in FROM clause anymore
+            if from_cond:
+                covered_tables = set()
+                must_include_key_columns = set()
+                candidate_table_ids = sorted(mentioned_table_ids)
+                start_table_id = candidate_table_ids[0]
+                for table_id in candidate_table_ids[1:]:
+                    if table_id in covered_tables:
+                        continue
+                    try:
+                        path = nx.shortest_path(
+                            orig_inputs.db.foreign_key_graph, source=start_table_id, target=table_id
+                        )
+                    except (nx.NetworkXNoPath, nx.NodeNotFound):
+                        covered_tables.add(table_id)
+                        continue
+
+                    for source_table_id, target_table_id in zip(path, path[1:]):
+                        if target_table_id in covered_tables:
+                            continue
+                        if target_table_id not in mentioned_table_ids:
+                            continue
+                        col1, col2 = orig_inputs.db.foreign_key_graph[source_table_id][target_table_id]["columns"]
+                        must_include_key_columns.add(col1)
+                        must_include_key_columns.add(col2)
+                if not must_include_key_columns == mentioned_key_column_ids:
+                    continue
+
+            # tables whose columns are mentioned should also exist
+            must_table_ids = set()
+            for col in mentioned_column_ids:
+                tab_ = orig_inputs.db.columns[col].table
+                if tab_ is not None:
+                    must_table_ids.add(tab_.id)
+            if not must_table_ids.issubset(mentioned_table_ids):
+                continue
+
+            filtered_finished.append(hyp)
+
+        filtered_finished.sort(key=operator.attrgetter("score"), reverse=True)
+        # filtered.sort(key=lambda x: x.score / len(x.choice_history), reverse=True)
+        prefixes_unfinished.sort(key=operator.attrgetter("score"), reverse=True)
+        # new_prefixes.sort(key=lambda x: x.score / len(x.choice_history), reverse=True)
+
+        prefixes_, filtered_ = merge_beams(prefixes_unfinished, filtered_finished, beam_size)
+
+        if filtered_:
+            cached_finished_seqs = cached_finished_seqs + filtered_
+            cached_finished_seqs.sort(key=operator.attrgetter("score"), reverse=True)
+
+        if prefixes_ and len(prefixes_[0].choice_history) < 200:
+            beam_prefix = prefixes_
+            for hyp in beam_prefix:
+                hyp.table_history = []
+                hyp.column_history = []
+                hyp.key_column_history = []
+        elif cached_finished_seqs:
+            return cached_finished_seqs[:beam_size]
+        else:
+            return unfiltered_finished[:beam_size]
+
+
+# merge sorted beam
+def merge_beams(beam_1, beam_2, beam_size):
+    if len(beam_1) == 0 or len(beam_2) == 0:
+        return beam_1, beam_2
+
+    annoated_beam_1 = [("beam_1", b) for b in beam_1]
+    annoated_beam_2 = [("beam_2", b) for b in beam_2]
+    merged_beams = annoated_beam_1 + annoated_beam_2
+    merged_beams.sort(key=lambda x: x[1].score, reverse=True)
+
+    ret_beam_1 = []
+    ret_beam_2 = []
+    for label, beam in merged_beams[:beam_size]:
+        if label == "beam_1":
+            ret_beam_1.append(beam)
+        else:
+            assert label == "beam_2"
+            ret_beam_2.append(beam)
+    return ret_beam_1, ret_beam_2
+
+
+def beam_search_with_oracle_column(model, inputs, preproc_item, beam_size, max_steps):
+    inference_state, next_choices = model.begin_inference(inputs, preproc_item)
+    beam = [Hypothesis(inference_state, next_choices)]
+    finished = []
+    assert beam_size == 1
+
+    # identify all the cols mentioned in the gold sql
+    root_node = preproc_item[1].tree
+
+    col_queue = list(
+        reversed([val for val in model.decoder.ast_wrapper.find_all_descendants_of_type(root_node, "column")])
+    )
+    tab_queue = list(
+        reversed([val for val in model.decoder.ast_wrapper.find_all_descendants_of_type(root_node, "table")])
+    )
+    col_queue_copy = col_queue[:]
+    tab_queue_copy = tab_queue[:]
+
+    predict_counter = 0
+
+    for step in range(max_steps):
+        # Check if all beams are finished
+        if len(finished) == beam_size:
+            break
+
+        # hijack the next choice using the gold col
+        assert len(beam) == 1
+        hyp = beam[0]
+        if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY:
+            if hyp.inference_state.cur_item.node_type == "column" and len(col_queue) > 0:
+                gold_col = col_queue[0]
+
+                flag = False
+                for _choice in hyp.next_choices:
+                    if _choice[0] == gold_col:
+                        flag = True
+                        hyp.next_choices = [_choice]
+                        col_queue = col_queue[1:]
+                        break
+                assert flag
+            elif hyp.inference_state.cur_item.node_type == "table" and len(tab_queue) > 0:
+                gold_tab = tab_queue[0]
+
+                flag = False
+                for _choice in hyp.next_choices:
+                    if _choice[0] == gold_tab:
+                        flag = True
+                        hyp.next_choices = [_choice]
+                        tab_queue = tab_queue[1:]
+                        break
+                assert flag
+
+        # for debug
+        if hyp.inference_state.cur_item.state == hyp.inference_state.State.POINTER_APPLY:
+            predict_counter += 1
+
+        # For each hypothesis, get possible expansions
+        # Score each expansion
+        candidates = []
+        for hyp in beam:
+            candidates += [
+                (hyp, choice, choice_score.item(), hyp.score + choice_score.item())
+                for choice, choice_score in hyp.next_choices
+            ]
+
+        # Keep the top K expansions
+        candidates.sort(key=operator.itemgetter(3), reverse=True)
+        candidates = candidates[: beam_size - len(finished)]
+
+        # Create the new hypotheses from the expansions
+        beam = []
+        for hyp, choice, choice_score, cum_score in candidates:
+            inference_state = hyp.inference_state.clone()
+            next_choices = inference_state.step(choice)
+            if next_choices is None:
+                finished.append(
+                    Hypothesis(
+                        inference_state,
+                        None,
+                        cum_score,
+                        hyp.choice_history + [choice],
+                        hyp.score_history + [choice_score],
+                    )
+                )
+            else:
+                beam.append(
+                    Hypothesis(
+                        inference_state,
+                        next_choices,
+                        cum_score,
+                        hyp.choice_history + [choice],
+                        hyp.score_history + [choice_score],
+                    )
+                )
+    if (len(col_queue_copy) + len(tab_queue_copy)) != predict_counter:
+        # print("The number of column/tables are not matched")
+        pass
+    finished.sort(key=operator.attrgetter("score"), reverse=True)
+    return finished
+
+
+def beam_search_with_oracle_sketch(model, inputs, preproc_item, beam_size, max_steps):
+    inference_state, next_choices = model.begin_inference(inputs, preproc_item)
+    hyp = Hypothesis(inference_state, next_choices)
+
+    parsed = model.decoder.preproc.grammar.parse(inputs["orig_inputs"][0].code, "val")
+    if not parsed:
+        return []
+
+    queue = [
+        TreeState(
+            node=preproc_item[1].tree,
+            parent_field_type=model.decoder.preproc.grammar.root_type,
+        )
+    ]
+
+    while queue:
+        item = queue.pop()
+        node = item.node
+        parent_field_type = item.parent_field_type
+
+        if isinstance(node, (list, tuple)):
+            node_type = parent_field_type + "*"
+            rule = (node_type, len(node))
+            if rule not in model.decoder.rules_index:
+                return []
+            rule_idx = model.decoder.rules_index[rule]
+            assert inference_state.cur_item.state == inference_state.State.LIST_LENGTH_APPLY
+
+            if model.decoder.preproc.use_seq_elem_rules and parent_field_type in model.decoder.ast_wrapper.sum_types:
+                parent_field_type += "_seq_elem"
+
+            for i, elem in reversed(list(enumerate(node))):
+                queue.append(TreeState(node=elem, parent_field_type=parent_field_type))
+
+            hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0])
+            continue
+
+        if parent_field_type in model.decoder.preproc.grammar.pointers:
+            assert inference_state.cur_item.state == inference_state.State.POINTER_APPLY
+            # best_choice = max(next_choices, key=lambda x: x[1])
+            # node = best_choice[0] # override the node
+
+            assert isinstance(node, int)
+            next_choices = inference_state.step(node)
+            hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [node], hyp.score_history + [0])
+            continue
+
+        if parent_field_type in model.decoder.ast_wrapper.primitive_types:
+            field_value_split = model.decoder.preproc.grammar.tokenize_field_value(node) + ["<EOS>"]
+
+            for token in field_value_split:
+                next_choices = inference_state.step(token)
+            hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + field_value_split, hyp.score_history + [0])
+            continue
+
+        type_info = model.decoder.ast_wrapper.singular_types[node["_type"]]
+
+        if parent_field_type in model.decoder.preproc.sum_type_constructors:
+            # ApplyRule, like expr -> Call
+            rule = (parent_field_type, type_info.name)
+            rule_idx = model.decoder.rules_index[rule]
+            assert inference_state.cur_item.state == inference_state.State.SUM_TYPE_APPLY
+            extra_rules = [
+                model.decoder.rules_index[parent_field_type, extra_type] for extra_type in node.get("_extra_types", [])
+            ]
+            next_choices = inference_state.step(rule_idx, extra_rules)
+
+            hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0])
+
+        if type_info.fields:
+            # ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords]
+            # Figure out which rule needs to be applied
+            present = get_field_presence_info(model.decoder.ast_wrapper, node, type_info.fields)
+            rule = (node["_type"], tuple(present))
+            rule_idx = model.decoder.rules_index[rule]
+            next_choices = inference_state.step(rule_idx)
+
+            hyp = Hypothesis(inference_state, None, 0, hyp.choice_history + [rule_idx], hyp.score_history + [0])
+
+        # reversed so that we perform a DFS in left-to-right order
+        for field_info in reversed(type_info.fields):
+            if field_info.name not in node:
+                continue
+            queue.append(TreeState(node=node[field_info.name], parent_field_type=field_info.type))
+
+    return [hyp]
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f0ea85344b7e0c679730356928c8749cf71cd66
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/__init__.py
@@ -0,0 +1,13 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py
new file mode 100644
index 0000000000000000000000000000000000000000..954693a5c60cbe9c35bd4208094e9ae36a72e41c
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/align_dec_func.py
@@ -0,0 +1,77 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+
+def compute_align_loss(model, desc_enc, example):
+    """model: a nl2code decoder"""
+    # find relevant columns
+    root_node = example.tree
+    rel_cols = list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "column")]))
+    rel_tabs = list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "table")]))
+    rel_vals = np.abs(
+        list(reversed([val for val in model.ast_wrapper.find_all_descendants_of_type(root_node, "value")]))
+    )
+
+    rel_cols_t = paddle.to_tensor(sorted(list(set(rel_cols))), dtype="int64")
+    rel_tabs_t = paddle.to_tensor(sorted(list(set(rel_tabs))), dtype="int64")
+    rel_vals_t = paddle.to_tensor(sorted(list(set(rel_vals))), dtype="int64")
+
+    mc_att_on_rel_col = desc_enc.m2c_align_mat.index_select(rel_cols_t, axis=1)
+    mc_max_rel_att = mc_att_on_rel_col.max(axis=0)
+    mc_max_rel_att = mc_max_rel_att.clip(min=1e-9)
+
+    mt_att_on_rel_tab = desc_enc.m2t_align_mat.index_select(rel_tabs_t, axis=1)
+    mt_max_rel_att = mt_att_on_rel_tab.max(axis=0)
+    mt_max_rel_att = mt_max_rel_att.clip(min=1e-9)
+
+    mv_att_on_rel_val = desc_enc.m2v_align_mat.index_select(rel_vals_t, axis=1)
+    mv_max_rel_att = mv_att_on_rel_val.max(axis=0)
+    mv_max_rel_att = mv_max_rel_att.clip(min=1e-9)
+
+    value_loss_weight = 2.0
+    align_loss = (
+        -paddle.log(mc_max_rel_att).mean()
+        - paddle.log(mt_max_rel_att).mean()
+        - value_loss_weight * paddle.log(mv_max_rel_att).mean()
+    )
+    return align_loss
+
+
+def compute_pointer_with_align(model, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc):
+    """compute_pointer_with_align"""
+    new_state, attention_weights = model._update_state(
+        node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc
+    )
+    # output shape: batch (=1) x emb_size
+    output = new_state[0]
+    memory_pointer_logits = model.pointers[node_type](output, desc_enc.memory)
+    memory_pointer_probs = paddle.nn.functional.softmax(memory_pointer_logits, axis=1)
+    # pointer_logits shape: batch (=1) x num choices
+    if node_type == "column":
+        pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2c_align_mat)
+    elif node_type == "table":
+        pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2t_align_mat)
+    else:  # value
+        pointer_probs = paddle.mm(memory_pointer_probs, desc_enc.m2v_align_mat)
+    pointer_probs = pointer_probs.clip(min=1e-9)
+    pointer_logits = paddle.log(pointer_probs)
+    return output, new_state, pointer_logits, attention_weights
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..56d022236f3b5ce9df34bb7e086c8d9c7956570b
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/decoder.py
@@ -0,0 +1,638 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import itertools
+
+import attr
+import paddle
+import paddle.nn.functional as F
+from text2sql.dataproc import sql_preproc_v2, vocab
+from text2sql.models import attention
+from text2sql.models.sql_decoder import align_dec_func
+from text2sql.models.sql_decoder.infer_tree_traversal import InferenceTreeTraversal
+from text2sql.models.sql_decoder.train_tree_traversal import TrainTreeTraversal
+from text2sql.models.sql_decoder.tree_traversal import TreeTraversal
+
+
+def maybe_stack(items, axis=None):
+    to_stack = [item for item in items if item is not None]
+    if not to_stack:
+        return None
+    elif len(to_stack) == 1:
+        return to_stack[0].unsqueeze(axis)
+    else:
+        return paddle.stack(to_stack, axis)
+
+
+def accumulate_logprobs(d, keys_and_logprobs):
+    for key, logprob in keys_and_logprobs:
+        existing = d.get(key)
+        if existing is None:
+            d[key] = logprob
+        else:
+            d[key] = paddle.logsumexp(paddle.stack((logprob, existing), axis=0), axis=0)
+
+
+@attr.s
+class TreeState:
+    node = attr.ib()
+    parent_field_type = attr.ib()
+
+
+class Text2SQLDecoder(paddle.nn.Layer):
+    """Decoder model"""
+
+    Preproc = sql_preproc_v2.SQLPreproc
+
+    def __init__(
+        self,
+        preproc,
+        #
+        rule_emb_size=128,
+        node_embed_size=64,
+        # TODO: This should be automatically inferred from encoder
+        enc_recurrent_size=768,
+        recurrent_size=512,
+        dropout=0.0,
+        desc_attn="bahdanau",
+        copy_pointer=None,
+        multi_loss_type="logsumexp",
+        sup_att=None,
+        use_align_mat=False,
+        use_align_loss=False,
+        enumerate_order=False,
+        loss_type="softmax",
+    ):
+        """init"""
+        super().__init__()
+        self.preproc = preproc
+        self.ast_wrapper = preproc.ast_wrapper
+        self.terminal_vocab = preproc.vocab
+
+        self.rule_emb_size = rule_emb_size
+        self.node_emb_size = node_embed_size
+        self.enc_recurrent_size = enc_recurrent_size
+        self.recurrent_size = recurrent_size
+
+        self.rules_index = {v: idx for idx, v in enumerate(self.preproc.all_rules)}
+        self.use_align_mat = use_align_mat
+        self.use_align_loss = use_align_loss
+        self.enumerate_order = enumerate_order
+
+        if use_align_mat:
+            self.compute_align_loss = lambda *args: align_dec_func.compute_align_loss(self, *args)
+            self.compute_pointer_with_align = lambda *args: align_dec_func.compute_pointer_with_align(self, *args)
+
+        if self.preproc.use_seq_elem_rules:
+            self.node_type_vocab = vocab.Vocab(
+                sorted(self.preproc.primitive_types)
+                + sorted(self.ast_wrapper.custom_primitive_types)
+                + sorted(self.preproc.sum_type_constructors.keys())
+                + sorted(self.preproc.field_presence_infos.keys())
+                + sorted(self.preproc.seq_lengths.keys()),
+                special_elems=(),
+            )
+        else:
+            self.node_type_vocab = vocab.Vocab(
+                sorted(self.preproc.primitive_types)
+                + sorted(self.ast_wrapper.custom_primitive_types)
+                + sorted(self.ast_wrapper.sum_types.keys())
+                + sorted(self.ast_wrapper.singular_types.keys())
+                + sorted(self.preproc.seq_lengths.keys()),
+                special_elems=(),
+            )
+
+        self.state_update = paddle.nn.LSTMCell(
+            input_size=self.rule_emb_size * 2 + self.enc_recurrent_size + self.recurrent_size + self.node_emb_size,
+            hidden_size=self.recurrent_size,
+        )
+        # dropout=dropout)
+
+        self.attn_type = desc_attn
+        if desc_attn == "bahdanau":
+            self.desc_attn = attention.BahdanauAttention(
+                query_size=self.recurrent_size, value_size=self.enc_recurrent_size, proj_size=50
+            )
+        elif desc_attn == "mha":
+            self.desc_attn = attention.MultiHeadedAttention(
+                h=8, query_size=self.recurrent_size, value_size=self.enc_recurrent_size
+            )
+        elif desc_attn == "mha-1h":
+            self.desc_attn = attention.MultiHeadedAttention(
+                h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size
+            )
+        elif desc_attn == "sep":
+            self.question_attn = attention.MultiHeadedAttention(
+                h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size
+            )
+            self.schema_attn = attention.MultiHeadedAttention(
+                h=1, query_size=self.recurrent_size, value_size=self.enc_recurrent_size
+            )
+        else:
+            # TODO: Figure out how to get right sizes (query, value) to module
+            self.desc_attn = desc_attn
+        self.sup_att = sup_att
+
+        self.rule_logits = paddle.nn.Sequential(
+            paddle.nn.Linear(self.recurrent_size, self.rule_emb_size),
+            paddle.nn.Tanh(),
+            paddle.nn.Linear(self.rule_emb_size, len(self.rules_index)),
+        )
+        self.rule_embedding = paddle.nn.Embedding(
+            num_embeddings=len(self.rules_index), embedding_dim=self.rule_emb_size
+        )
+
+        self.gen_logodds = paddle.nn.Linear(self.recurrent_size, 1)
+        self.terminal_logits = paddle.nn.Sequential(
+            paddle.nn.Linear(self.recurrent_size, self.rule_emb_size),
+            paddle.nn.Tanh(),
+            paddle.nn.Linear(self.rule_emb_size, len(self.terminal_vocab)),
+        )
+        self.terminal_embedding = paddle.nn.Embedding(
+            num_embeddings=len(self.terminal_vocab), embedding_dim=self.rule_emb_size
+        )
+        if copy_pointer is None:
+            self.copy_pointer = attention.BahdanauPointer(
+                query_size=self.recurrent_size, key_size=self.enc_recurrent_size, proj_size=50
+            )
+        else:
+            # TODO: Figure out how to get right sizes (query, key) to module
+            self.copy_pointer = copy_pointer
+        if multi_loss_type == "logsumexp":
+            self.multi_loss_reduction = lambda logprobs: -paddle.logsumexp(logprobs, axis=1)
+        elif multi_loss_type == "mean":
+            self.multi_loss_reduction = lambda logprobs: -paddle.mean(logprobs, axis=1)
+
+        self.pointers = {}
+        self.pointer_action_emb_proj = {}
+        for pointer_type in self.preproc.grammar.pointers:
+            self.pointers[pointer_type] = attention.ScaledDotProductPointer(
+                query_size=self.recurrent_size, key_size=self.enc_recurrent_size
+            )
+            self.pointer_action_emb_proj[pointer_type] = paddle.nn.Linear(self.enc_recurrent_size, self.rule_emb_size)
+            setattr(self, pointer_type + "_pointer", self.pointers[pointer_type])
+            setattr(self, pointer_type + "_action_emb_proj", self.pointer_action_emb_proj[pointer_type])
+
+        self.node_type_embedding = paddle.nn.Embedding(
+            num_embeddings=len(self.node_type_vocab), embedding_dim=self.node_emb_size
+        )
+
+        # TODO batching
+        self.zero_rule_emb = paddle.zeros([1, self.rule_emb_size])
+        self.zero_recurrent_emb = paddle.zeros([1, self.recurrent_size])
+        if loss_type == "softmax":
+            self.xent_loss = paddle.nn.CrossEntropyLoss(reduction="none")
+        elif loss_type == "entmax":
+            raise ValueError("entmax is not supported")
+            # self.xent_loss = entmax.entmax15_loss
+        elif loss_type == "sparsemax":
+            raise ValueError("sparsemax is not supported")
+            # self.xent_loss = entmax.sparsemax_loss
+        elif loss_type == "label_smooth":
+            self.xent_loss = self.label_smooth_loss
+
+    def label_smooth_loss(self, X, target, smooth_value=0.1):
+        """label smooth loss"""
+        if self.training:
+            logits = paddle.log_softmax(X, axis=1)
+            size = X.size()[1]
+            one_hot = paddle.full(X.size(), smooth_value / (size - 1)).to(X.device)
+            one_hot.scatter_(1, target.unsqueeze(0), 1 - smooth_value)
+            loss = F.kl_div(logits, one_hot, reduction="batchmean")
+            return loss.unsqueeze(0)
+        else:
+            return paddle.nn.functional.cross_entropy(X, target, reduction="none")
+
+    @classmethod
+    def _calculate_rules(cls, preproc):
+        """calculate rules"""
+        offset = 0
+
+        all_rules = []
+        rules_mask = {}
+
+        # Rules of the form:
+        # expr -> Attribute | Await | BinOp | BoolOp | ...
+        # expr_seq_elem -> Attribute | Await | ... | Template1 | Template2 | ...
+        for parent, children in sorted(preproc.sum_type_constructors.items()):
+            assert parent not in rules_mask
+            rules_mask[parent] = (offset, offset + len(children))
+            offset += len(children)
+            all_rules += [(parent, child) for child in children]
+
+        # Rules of the form:
+        # FunctionDef
+        # -> identifier name, arguments args
+        # |  identifier name, arguments args, stmt* body
+        # |  identifier name, arguments args, expr* decorator_list
+        # |  identifier name, arguments args, expr? returns
+        # ...
+        # |  identifier name, arguments args, stmt* body, expr* decorator_list, expr returns
+        for name, field_presence_infos in sorted(preproc.field_presence_infos.items()):
+            assert name not in rules_mask
+            rules_mask[name] = (offset, offset + len(field_presence_infos))
+            offset += len(field_presence_infos)
+            all_rules += [(name, presence) for presence in field_presence_infos]
+
+        # Rules of the form:
+        # stmt* -> stmt
+        #        | stmt stmt
+        #        | stmt stmt stmt
+        for seq_type_name, lengths in sorted(preproc.seq_lengths.items()):
+            assert seq_type_name not in rules_mask
+            rules_mask[seq_type_name] = (offset, offset + len(lengths))
+            offset += len(lengths)
+            all_rules += [(seq_type_name, i) for i in lengths]
+
+        return all_rules, rules_mask
+
+    def compute_loss(self, enc_input, example, desc_enc, debug=False):
+        """train main"""
+        if not (self.enumerate_order and self.training):
+            mle_loss = self.compute_mle_loss(enc_input, example, desc_enc, debug)
+        else:
+            mle_loss = self.compute_loss_from_all_ordering(enc_input, example, desc_enc, debug)
+
+        if self.use_align_loss:
+            align_loss = self.compute_align_loss(desc_enc, example)
+            return mle_loss + align_loss
+        return mle_loss
+
+    def compute_loss_from_all_ordering(self, enc_input, example, desc_enc, debug):
+        """compute loss from all ordering"""
+
+        def get_permutations(node):
+            """get permutations"""
+
+            def traverse_tree(node):
+                """traverse tree"""
+                nonlocal permutations
+                if isinstance(node, (list, tuple)):
+                    p = itertools.permutations(range(len(node)))
+                    permutations.append(list(p))
+                    for child in node:
+                        traverse_tree(child)
+                elif isinstance(node, dict):
+                    for node_name in node:
+                        traverse_tree(node[node_name])
+
+            permutations = []
+            traverse_tree(node)
+            return permutations
+
+        def get_perturbed_tree(node, permutation):
+            """get perturbed tree"""
+
+            def traverse_tree(node, parent_type, parent_node):
+                """traverse tree"""
+                if isinstance(node, (list, tuple)):
+                    nonlocal permutation
+                    p_node = [node[i] for i in permutation[0]]
+                    parent_node[parent_type] = p_node
+                    permutation = permutation[1:]
+                    for child in node:
+                        traverse_tree(child, None, None)
+                elif isinstance(node, dict):
+                    for node_name in node:
+                        traverse_tree(node[node_name], node_name, node)
+
+            node = copy.deepcopy(node)
+            traverse_tree(node, None, None)
+            return node
+
+        orig_tree = example.tree
+        permutations = get_permutations(orig_tree)
+        products = itertools.product(*permutations)
+        loss_list = []
+        for product in products:
+            tree = get_perturbed_tree(orig_tree, product)
+            example.tree = tree
+            loss = self.compute_mle_loss(enc_input, example, desc_enc)
+            loss_list.append(loss)
+        example.tree = orig_tree
+        loss_v = paddle.stack(loss_list, 0)
+        return paddle.logsumexp(loss_v, 0)
+
+    def compute_mle_loss(self, enc_input, example, desc_enc, debug=False):
+        """compute mle loss"""
+        traversal = TrainTreeTraversal(self, desc_enc, debug)
+        traversal.step(None)
+        # for debug
+        # class List(list):
+        #    def __init__(self, *args, **kwargs):
+        #        """ """
+        #        super(List, self).__init__(*args, **kwargs)
+
+        #    def append(self, *args, **kwargs):
+        #        """ """
+        #        super().append(*args, **kwargs)
+        #        print('append:', list(reversed(self)))
+
+        #    def pop(self):
+        #        """ """
+        #        print('pop:', list(reversed(self)))
+        #        item = super().pop()
+        #        return item
+        #
+        # queue = List()
+        # queue.append(
+        #    TreeState(
+        #        node=example.tree,
+        #        parent_field_type=self.preproc.grammar.root_type,
+        #    ))
+
+        queue = [
+            TreeState(
+                node=example.tree,
+                parent_field_type=self.preproc.grammar.root_type,
+            )
+        ]
+        while queue:
+            item = queue.pop()
+            node = item.node
+            parent_field_type = item.parent_field_type
+
+            if isinstance(node, (list, tuple)):
+                node_type = parent_field_type + "*"
+                rule = (node_type, len(node))
+                rule_idx = self.rules_index[rule]
+                assert traversal.cur_item.state == TreeTraversal.State.LIST_LENGTH_APPLY
+                traversal.step(rule_idx)
+
+                if self.preproc.use_seq_elem_rules and parent_field_type in self.ast_wrapper.sum_types:
+                    parent_field_type += "_seq_elem"
+
+                for i, elem in reversed(list(enumerate(node))):
+                    queue.append(
+                        TreeState(
+                            node=elem,
+                            parent_field_type=parent_field_type,
+                        )
+                    )
+                continue
+
+            if parent_field_type in self.preproc.grammar.pointers:
+                assert isinstance(node, int)
+                assert traversal.cur_item.state == TreeTraversal.State.POINTER_APPLY
+                pointer_map = desc_enc.pointer_maps.get(parent_field_type)
+                # TODO: fix -1
+                if node == -1:
+                    node = 0
+                if pointer_map:
+                    values = pointer_map[node]
+                    traversal.step(values[0])
+                else:
+                    traversal.step(node)
+                continue
+
+            if parent_field_type in self.ast_wrapper.primitive_types:
+                # identifier, int, string, bytes, object, singleton
+                # - could be bytes, str, int, float, bool, NoneType
+                # - terminal tokens vocabulary is created by turning everything into a string (with `str`)
+                # - at decoding time, cast back to str/int/float/bool
+                field_value_split = self.preproc.grammar.tokenize_field_value(node) + [vocab.EOS]
+
+                for token in field_value_split:
+                    assert traversal.cur_item.state == TreeTraversal.State.GEN_TOKEN
+                    traversal.step(token)
+                continue
+
+            type_info = self.ast_wrapper.singular_types[node["_type"]]
+
+            if parent_field_type in self.preproc.sum_type_constructors:
+                # ApplyRule, like expr -> Call
+                rule = (parent_field_type, type_info.name)
+                rule_idx = self.rules_index[rule]
+                assert traversal.cur_item.state == TreeTraversal.State.SUM_TYPE_APPLY
+                extra_rules = [
+                    self.rules_index[parent_field_type, extra_type] for extra_type in node.get("_extra_types", [])
+                ]
+                traversal.step(rule_idx, extra_rules)
+
+            if type_info.fields:
+                # ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords]
+                # Figure out which rule needs to be applied
+                present = sql_preproc_v2.get_field_presence_info(self.ast_wrapper, node, type_info.fields)
+                rule = (node["_type"], tuple(present))
+                rule_idx = self.rules_index[rule]
+                assert traversal.cur_item.state == TreeTraversal.State.CHILDREN_APPLY
+                traversal.step(rule_idx)
+
+            # reversed so that we perform a DFS in left-to-right order
+            for field_info in reversed(type_info.fields):
+                if field_info.name not in node:
+                    continue
+
+                queue.append(
+                    TreeState(
+                        node=node[field_info.name],
+                        parent_field_type=field_info.type,
+                    )
+                )
+
+        loss = paddle.sum(paddle.stack(tuple(traversal.loss), axis=0), axis=0)
+        if debug:
+            return loss, [attr.asdict(entry) for entry in traversal.history]
+        else:
+            return loss
+
+    def inference(self, desc_enc, db, value_list):
+        """infer main"""
+        traversal = InferenceTreeTraversal(self, desc_enc, db, value_list)
+        choices = traversal.step(None)
+        return traversal, choices
+
+    def _desc_attention(self, prev_state, desc_enc):
+        """desc attention
+
+        Args:
+            prev_state (tuple): (prev_hidden, prev_cell_state)
+            desc_enc (encoder.EncoderState)
+        """
+        # prev_state shape:
+        # - h_n: batch (=1) x emb_size
+        # - c_n: batch (=1) x emb_size
+        query = prev_state[0]
+        if self.attn_type != "sep":
+            return self.desc_attn(query, desc_enc.memory, attn_mask=None)
+        else:
+            question_context, question_attention_logits = self.question_attn(query, desc_enc.question_memory)
+            schema_context, schema_attention_logits = self.schema_attn(query, desc_enc.schema_memory)
+            return question_context + schema_context, schema_attention_logits
+
+    def _tensor(self, data, dtype=None):
+        """new a tensor"""
+        return paddle.to_tensor(data, dtype=dtype)
+
+    def _index(self, vocab, word):
+        """get token id"""
+        return self._tensor([vocab.index(word)])
+
+    def _update_state(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc):
+        """update state"""
+        # desc_context: shape = batch (=1) x emb_size
+        desc_context, attention_logits = self._desc_attention(prev_state, desc_enc)
+        # node_type_emb: shape = batch (=1) x emb_size
+        node_type_emb = self.node_type_embedding(self._index(self.node_type_vocab, node_type))
+
+        # 更新 LSTM state 的输入
+        state_input = paddle.concat(
+            (
+                prev_action_emb,  # a_{t-1}: rule_emb_size
+                desc_context,  # c_t:     enc_recurrent_size
+                parent_h,  # s_{p_t}: recurrent_size
+                parent_action_emb,  # a_{p_t}: rule_emb_size
+                node_type_emb,  # n_{f-t}: node_emb_size
+            ),
+            axis=-1,
+        )
+        # state_input shape: batch (=1) x (emb_size * 5)
+        _, new_state = self.state_update(state_input, prev_state)
+        return new_state, attention_logits
+
+    def apply_rule(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc):
+        """apply rule"""
+        new_state, attention_logits = self._update_state(
+            node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc
+        )
+        # output shape: batch (=1) x emb_size
+        output = new_state[0]
+        # rule_logits shape: batch (=1) x num choices
+        rule_logits = self.rule_logits(output)
+
+        return output, new_state, rule_logits
+
+    def rule_infer(self, node_type, rule_logits):
+        """rule infer"""
+        rule_logprobs = paddle.nn.functional.log_softmax(rule_logits, axis=-1)
+        rules_start, rules_end = self.preproc.rules_mask[node_type]
+
+        # TODO: Mask other probabilities first?
+        return list(zip(range(rules_start, rules_end), rule_logprobs[0, rules_start:rules_end]))
+
+    def gen_token(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc):
+        """gen token"""
+        new_state, attention_logits = self._update_state(
+            node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc
+        )
+        # output shape: batch (=1) x emb_size
+        output = new_state[0]
+
+        # gen_logodds shape: batch (=1)
+        gen_logodds = self.gen_logodds(output).squeeze(1)
+
+        return new_state, output, gen_logodds
+
+    def gen_token_loss(self, output, gen_logodds, token, desc_enc):
+        """gen token loss"""
+        # token_idx shape: batch (=1), LongTensor
+        token_idx = self._index(self.terminal_vocab, token)
+        # action_emb shape: batch (=1) x emb_size
+
+        # +unk, +in desc: copy
+        # +unk, -in desc: gen (an unk token)
+        # -unk, +in desc: copy, gen
+        # -unk, -in desc: gen
+        # gen_logodds shape: batch (=1)
+        desc_locs = desc_enc.find_word_occurrences(token)
+        if desc_locs:
+            # copy: if the token appears in the description at least once
+            # copy_loc_logits shape: batch (=1) x desc length
+            copy_loc_logits = self.copy_pointer(output, desc_enc.memory)
+            copy_logprob = (
+                # log p(copy | output)
+                # shape: batch (=1)
+                paddle.nn.functional.log_sigmoid(-gen_logodds)
+                - self.xent_loss(copy_loc_logits, self._tensor(desc_locs[0:1]))
+                # xent_loss: -log p(location | output)
+                # TODO: sum the probability of all occurrences
+                # shape: batch (=1)
+            )
+        else:
+            copy_logprob = None
+
+        # gen: ~(unk & in desc), equivalent to  ~unk | ~in desc
+        if token in self.terminal_vocab or copy_logprob is None:
+            token_logits = self.terminal_logits(output)
+            # shape:
+            gen_logprob = (
+                # log p(gen | output)
+                # shape: batch (=1)
+                paddle.nn.functional.log_sigmoid(gen_logodds)
+                - self.xent_loss(token_logits, token_idx)
+                # xent_loss: -log p(token | output)
+                # shape: batch (=1)
+            )
+        else:
+            gen_logprob = None
+
+        # loss should be -log p(...), so negate
+        loss_piece = -paddle.logsumexp(maybe_stack([copy_logprob, gen_logprob], axis=1), axis=1)
+        return loss_piece
+
+    def token_infer(self, output, gen_logodds, desc_enc):
+        """token infer"""
+        # Copy tokens
+        # log p(copy | output)
+        # shape: batch (=1)
+        copy_logprob = paddle.nn.functional.log_sigmoid(-gen_logodds)
+        copy_loc_logits = self.copy_pointer(output, desc_enc.memory)
+        # log p(loc_i | copy, output)
+        # shape: batch (=1) x seq length
+        copy_loc_logprobs = paddle.nn.functional.log_softmax(copy_loc_logits, axis=-1)
+        # log p(loc_i, copy | output)
+        copy_loc_logprobs += copy_logprob
+
+        log_prob_by_word = {}
+        # accumulate_logprobs is needed because the same word may appear
+        # multiple times in desc_enc.words.
+        accumulate_logprobs(log_prob_by_word, zip(desc_enc.words, copy_loc_logprobs.squeeze(0)))
+
+        # Generate tokens
+        # log p(~copy | output)
+        # shape: batch (=1)
+        gen_logprob = paddle.nn.functional.log_sigmoid(gen_logodds)
+        token_logits = self.terminal_logits(output)
+        # log p(v | ~copy, output)
+        # shape: batch (=1) x vocab size
+        token_logprobs = paddle.nn.functional.log_softmax(token_logits, axis=-1)
+        # log p(v, ~copy| output)
+        # shape: batch (=1) x vocab size
+        token_logprobs += gen_logprob
+
+        accumulate_logprobs(
+            log_prob_by_word,
+            ((self.terminal_vocab[idx], token_logprobs[0, idx]) for idx in range(token_logprobs.shape[1])),
+        )
+
+        return list(log_prob_by_word.items())
+
+    def compute_pointer(self, node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc):
+        """compute pointer"""
+        new_state, attention_logits = self._update_state(
+            node_type, prev_state, prev_action_emb, parent_h, parent_action_emb, desc_enc
+        )
+        # output shape: batch (=1) x emb_size
+        output = new_state[0]
+        # pointer_logits shape: batch (=1) x num choices
+        pointer_logits = self.pointers[node_type](output, desc_enc.pointer_memories[node_type])
+
+        return output, new_state, pointer_logits, attention_logits
+
+    def pointer_infer(self, node_type, logits):
+        """pointer infer"""
+        logprobs = paddle.nn.functional.log_softmax(logits, axis=-1)
+        # TODO batching
+        return list(zip(range(logits.shape[1]), logprobs[0]))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py
new file mode 100644
index 0000000000000000000000000000000000000000..654472b58e0f26855d98495448bc1539964d7a19
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/infer_tree_traversal.py
@@ -0,0 +1,216 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import attr
+import pyrsistent
+import paddle
+
+from text2sql.models.sql_decoder.tree_traversal import TreeTraversal
+from text2sql.dataproc import vocab
+
+
+class InferenceTreeTraversal(TreeTraversal):
+    """InferenceTreeTraversal"""
+
+    class TreeAction:
+        pass
+
+    @attr.s(frozen=True)
+    class SetParentField(TreeAction):
+        """SetParentField"""
+
+        parent_field_name = attr.ib()
+        node_type = attr.ib()
+        node_value = attr.ib(default=None)
+
+    @attr.s(frozen=True)
+    class CreateParentFieldList(TreeAction):
+        """CreateParentFieldList"""
+
+        parent_field_name = attr.ib()
+
+    @attr.s(frozen=True)
+    class AppendTerminalToken(TreeAction):
+        """AppendTerminalToken"""
+
+        parent_field_name = attr.ib()
+        value = attr.ib()
+
+    @attr.s(frozen=True)
+    class FinalizeTerminal(TreeAction):
+        """FinalizeTerminal"""
+
+        parent_field_name = attr.ib()
+        terminal_type = attr.ib()
+
+    @attr.s(frozen=True)
+    class NodeFinished(TreeAction):
+        """NodeFinished"""
+
+        pass
+
+    SIMPLE_TERMINAL_TYPES = {
+        "str": str,
+        "int": int,
+        "float": float,
+        "bool": lambda n: {"True": True, "False": False}.get(n, False),
+    }
+
+    SIMPLE_TERMINAL_TYPES_DEFAULT = {
+        "str": "",
+        "int": 0,
+        "float": 0,
+        "bool": True,
+    }
+
+    def __init__(self, model, desc_enc, db=None, value_list=None):
+        """__init__"""
+        super().__init__(model, desc_enc)
+        self.actions = pyrsistent.pvector()
+        self.db = db
+        self.value_list = value_list
+
+    def clone(self):
+        """clone"""
+        super_clone = super().clone()
+        super_clone.actions = self.actions
+        super_clone.db = self.db
+        super_clone.value_list = self.value_list
+        return super_clone
+
+    def rule_choice(self, node_type, rule_logits):
+        """rule_choice"""
+        return self.model.rule_infer(node_type, rule_logits)
+
+    def token_choice(self, output, gen_logodds):
+        """token_choice"""
+        return self.model.token_infer(output, gen_logodds, self.desc_enc)
+
+    def pointer_choice(self, node_type, logits, attention_logits):
+        """pointer_choice"""
+        # Group them based on pointer map
+        pointer_logprobs = self.model.pointer_infer(node_type, logits)
+        pointer_map = self.desc_enc.pointer_maps.get(node_type)
+        if not pointer_map:
+            return pointer_logprobs
+
+        pointer_logprobs = dict(pointer_logprobs)
+        return [
+            (
+                orig_index,
+                paddle.logsumexp(paddle.stack(tuple(pointer_logprobs[i] for i in mapped_indices), axis=0), axis=0),
+            )
+            for orig_index, mapped_indices in pointer_map.items()
+        ]
+
+    def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset):
+        """update_using_last_choice"""
+        super().update_using_last_choice(last_choice, extra_choice_info, attention_offset)
+
+        # Record actions
+        # CHILDREN_INQUIRE
+        if self.cur_item.state == TreeTraversal.State.CHILDREN_INQUIRE:
+            self.actions = self.actions.append(
+                self.SetParentField(self.cur_item.parent_field_name, self.cur_item.node_type)
+            )
+            type_info = self.model.ast_wrapper.singular_types[self.cur_item.node_type]
+            if not type_info.fields:
+                self.actions = self.actions.append(self.NodeFinished())
+
+        # LIST_LENGTH_APPLY
+        elif self.cur_item.state == TreeTraversal.State.LIST_LENGTH_APPLY:
+            self.actions = self.actions.append(self.CreateParentFieldList(self.cur_item.parent_field_name))
+
+        # GEN_TOKEN
+        elif self.cur_item.state == TreeTraversal.State.GEN_TOKEN:
+            if last_choice == vocab.EOS:
+                self.actions = self.actions.append(
+                    self.FinalizeTerminal(self.cur_item.parent_field_name, self.cur_item.node_type)
+                )
+            elif last_choice is not None:
+                self.actions = self.actions.append(
+                    self.AppendTerminalToken(self.cur_item.parent_field_name, last_choice)
+                )
+
+        elif self.cur_item.state == TreeTraversal.State.POINTER_APPLY:
+            self.actions = self.actions.append(
+                self.SetParentField(self.cur_item.parent_field_name, node_type=None, node_value=last_choice)
+            )
+
+        # NODE_FINISHED
+        elif self.cur_item.state == TreeTraversal.State.NODE_FINISHED:
+            self.actions = self.actions.append(self.NodeFinished())
+
+    def finalize(self):
+        """finalize"""
+        root = current = None
+        stack = []
+        for i, action in enumerate(self.actions):
+            if isinstance(action, self.SetParentField):
+                if action.node_value is None:
+                    new_node = {"_type": action.node_type}
+                else:
+                    new_node = action.node_value
+
+                if action.parent_field_name is None:
+                    # Initial node in tree.
+                    assert root is None
+                    root = current = new_node
+                    stack.append(root)
+                    continue
+
+                existing_list = current.get(action.parent_field_name)
+                if existing_list is None:
+                    current[action.parent_field_name] = new_node
+                else:
+                    assert isinstance(existing_list, list)
+                    current[action.parent_field_name].append(new_node)
+
+                if action.node_value is None:
+                    stack.append(current)
+                    current = new_node
+
+            elif isinstance(action, self.CreateParentFieldList):
+                current[action.parent_field_name] = []
+
+            elif isinstance(action, self.AppendTerminalToken):
+                tokens = current.get(action.parent_field_name)
+                if tokens is None:
+                    tokens = current[action.parent_field_name] = []
+                tokens.append('"' + action.value + '"')
+
+            elif isinstance(action, self.FinalizeTerminal):
+                terminal = "".join(current.get(action.parent_field_name, []))
+                constructor = self.SIMPLE_TERMINAL_TYPES.get(action.terminal_type)
+                if constructor:
+                    try:
+                        value = constructor(terminal)
+                    except ValueError:
+                        value = self.SIMPLE_TERMINAL_TYPES_DEFAULT[action.terminal_type]
+                elif action.terminal_type == "bytes":
+                    value = terminal.decode("latin1")
+                elif action.terminal_type == "NoneType":
+                    value = None
+                else:
+                    raise ValueError(f"Unknown terminal type: {action.terminal_type}")
+                current[action.parent_field_name] = value
+
+            elif isinstance(action, self.NodeFinished):
+                current = stack.pop()
+
+            else:
+                raise ValueError(action)
+
+        assert not stack
+        return root, self.model.preproc.grammar.unparse(root, self.db, self.value_list)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py
new file mode 100644
index 0000000000000000000000000000000000000000..259322bfb05e4b482f2b05596862965dcd683ea9
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/train_tree_traversal.py
@@ -0,0 +1,132 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import operator
+
+import attr
+import pyrsistent
+import paddle
+
+from text2sql.models.sql_decoder.tree_traversal import TreeTraversal
+
+
+@attr.s
+class ChoiceHistoryEntry:
+    """ChoiceHistoryEntry"""
+
+    rule_left = attr.ib()
+    choices = attr.ib()
+    probs = attr.ib()
+    valid_choices = attr.ib()
+
+
+class TrainTreeTraversal(TreeTraversal):
+    """TrainTreeTraversal"""
+
+    @attr.s(frozen=True)
+    class XentChoicePoint:
+        """XentChoicePoint"""
+
+        logits = attr.ib()
+        weight = attr.ib(default=1.0)
+
+        def compute_loss(self, outer, idx, extra_indices):
+            """compute loss"""
+            if extra_indices:
+                logprobs = paddle.nn.functional.log_softmax(self.logits, axis=1)
+                valid_logprobs = logprobs[:, [idx] + extra_indices]
+                return self.weight * outer.model.multi_loss_reduction(valid_logprobs)
+            else:
+                # idx shape: batch (=1)
+                idx = outer.model._tensor([idx])
+                # loss_piece shape: batch (=1)
+                loss = outer.model.xent_loss(self.logits, idx)
+                return self.weight * loss
+
+    @attr.s(frozen=True)
+    class TokenChoicePoint:
+        """TokenChoicePoint"""
+
+        lstm_output = attr.ib()
+        gen_logodds = attr.ib()
+
+        def compute_loss(self, outer, token, extra_tokens):
+            """compute loss"""
+            return outer.model.gen_token_loss(self.lstm_output, self.gen_logodds, token, outer.desc_enc)
+
+    def __init__(self, model, desc_enc, debug=False):
+        """__init__"""
+        super().__init__(model, desc_enc)
+        self.choice_point = None
+        self.loss = pyrsistent.pvector()
+
+        self.debug = debug
+        self.history = pyrsistent.pvector()
+
+    def clone(self):
+        """clone"""
+        super_clone = super().clone()
+        super_clone.choice_point = self.choice_point
+        super_clone.loss = self.loss
+        super_clone.debug = self.debug
+        super_clone.history = self.history
+        return super_clone
+
+    def rule_choice(self, node_type, rule_logits):
+        """rule_choice"""
+        self.choice_point = self.XentChoicePoint(rule_logits)
+        if self.debug:
+            choices = []
+            probs = []
+            for rule_idx, logprob in sorted(
+                self.model.rule_infer(node_type, rule_logits), key=operator.itemgetter(1), reverse=True
+            ):
+                _, rule = self.model.preproc.all_rules[rule_idx]
+                choices.append(rule)
+                probs.append(logprob.exp().item())
+            self.history = self.history.append(ChoiceHistoryEntry(node_type, choices, probs, None))
+
+    def token_choice(self, output, gen_logodds):
+        """token_choice"""
+        self.choice_point = self.TokenChoicePoint(output, gen_logodds)
+
+    def pointer_choice(self, node_type, logits, attention_logits):
+        """pointer_choice"""
+        loss_weight = 1.0
+        if node_type == "value":
+            loss_weight = 2.0
+        self.choice_point = self.XentChoicePoint(logits, weight=loss_weight)
+        self.attention_choice = self.XentChoicePoint(attention_logits, weight=loss_weight)
+
+    def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset):
+        """update_using_last_choice"""
+        super().update_using_last_choice(last_choice, extra_choice_info, attention_offset)
+        if last_choice is None:
+            return
+
+        if self.debug and isinstance(self.choice_point, self.XentChoicePoint):
+            valid_choice_indices = [last_choice] + ([] if extra_choice_info is None else extra_choice_info)
+            self.history[-1].valid_choices = [
+                self.model.preproc.all_rules[rule_idx][1] for rule_idx in valid_choice_indices
+            ]
+
+        self.loss = self.loss.append(self.choice_point.compute_loss(self, last_choice, extra_choice_info))
+
+        if attention_offset is not None and self.attention_choice is not None:
+            self.loss = self.loss.append(
+                self.attention_choice.compute_loss(self, attention_offset, extra_indices=None)
+            )
+
+        self.choice_point = None
+        self.attention_choice = None
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py
new file mode 100644
index 0000000000000000000000000000000000000000..47ec81d78b6e1c3db76bba0ef7da0dcc8372a4e5
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/models/sql_decoder/tree_traversal.py
@@ -0,0 +1,429 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import enum
+
+import attr
+import pyrsistent
+from text2sql.dataproc import vocab
+from text2sql.utils import nn_utils
+
+
+@attr.s
+class TreeState:
+    node = attr.ib()
+    parent_field_type = attr.ib()
+
+
+class TreeTraversal:
+    class Handler:
+        handlers = {}
+
+        @classmethod
+        def register_handler(cls, func_type):
+            """register_handler"""
+            if func_type in cls.handlers:
+                raise RuntimeError(f"{func_type} handler is already registered")
+
+            def inner_func(func):
+                """inner_func"""
+                cls.handlers[func_type] = func.__name__
+                return func
+
+            return inner_func
+
+    @attr.s(frozen=True)
+    class QueueItem:
+        item_id = attr.ib()
+        state = attr.ib()
+        node_type = attr.ib()
+        parent_action_emb = attr.ib()
+        parent_h = attr.ib()  # parent hidden state
+        parent_field_name = attr.ib()
+
+        def to_str(self):
+            """to_str"""
+            return f"<state: {self.state}, node_type: {self.node_type}, parent_field_name: {self.parent_field_name}>"
+
+    class State(enum.Enum):
+        """State"""
+
+        SUM_TYPE_INQUIRE = 0
+        SUM_TYPE_APPLY = 1
+        CHILDREN_INQUIRE = 2
+        CHILDREN_APPLY = 3
+        LIST_LENGTH_INQUIRE = 4
+        LIST_LENGTH_APPLY = 5
+        GEN_TOKEN = 6
+        POINTER_INQUIRE = 7
+        POINTER_APPLY = 8
+        NODE_FINISHED = 9
+
+    def __init__(self, model, desc_enc):
+        """__init__"""
+        if model is None:
+            return
+
+        self.model = model
+        self.desc_enc = desc_enc
+
+        # TODO: model.state_update.set_dropout_masks(batch_size=1)
+        self.recurrent_state = nn_utils.lstm_init(None, self.model.recurrent_size, 1)
+        self.prev_action_emb = model.zero_rule_emb
+
+        root_type = model.preproc.grammar.root_type
+        if root_type in model.preproc.ast_wrapper.sum_types:
+            initial_state = TreeTraversal.State.SUM_TYPE_INQUIRE
+        else:
+            initial_state = TreeTraversal.State.CHILDREN_INQUIRE
+
+        self.queue = pyrsistent.pvector()
+        self.cur_item = TreeTraversal.QueueItem(
+            item_id=0,
+            state=initial_state,
+            node_type=root_type,
+            parent_action_emb=self.model.zero_rule_emb,
+            parent_h=self.model.zero_recurrent_emb,
+            parent_field_name=None,
+        )
+        self.next_item_id = 1
+
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule
+
+    def clone(self):
+        """clone"""
+        other = self.__class__(None, None)
+        other.model = self.model
+        other.desc_enc = self.desc_enc
+        other.recurrent_state = self.recurrent_state
+        other.prev_action_emb = self.prev_action_emb
+        other.queue = self.queue
+        other.cur_item = self.cur_item
+        other.next_item_id = self.next_item_id
+        other.actions = self.actions
+        other.update_prev_action_emb = self.update_prev_action_emb
+        return other
+
+    def step(self, last_choice, extra_choice_info=None, attention_offset=None):
+        """step"""
+        while True:
+            # <debug print('2' * 10, 'in one step...', 'last_choice:', last_choice, 'cur_item:', self.cur_item.to_str())
+            self.update_using_last_choice(last_choice, extra_choice_info, attention_offset)
+
+            handler_name = TreeTraversal.Handler.handlers[self.cur_item.state]
+            handler = getattr(self, handler_name)
+            choices, continued = handler(last_choice)
+            if continued:
+                last_choice = choices
+                continue
+            else:
+                # <debug print('>>>', '2' * 10, 'one step finished.')
+                return choices
+
+    def update_using_last_choice(self, last_choice, extra_choice_info, attention_offset):
+        """update_using_last_choice"""
+        if last_choice is None:
+            return
+        self.update_prev_action_emb(self, last_choice, extra_choice_info)
+
+    @classmethod
+    def _update_prev_action_emb_apply_rule(cls, self, last_choice, extra_choice_info):
+        """_update_prev_action_emb_apply_rule"""
+        rule_idx = self.model._tensor([last_choice])
+        self.prev_action_emb = self.model.rule_embedding(rule_idx)
+
+    @classmethod
+    def _update_prev_action_emb_gen_token(cls, self, last_choice, extra_choice_info):
+        """_update_prev_action_emb_gen_token"""
+        # token_idx shape: batch (=1), LongTensor
+        token_idx = self.model._index(self.model.terminal_vocab, last_choice)
+        # action_emb shape: batch (=1) x emb_size
+        self.prev_action_emb = self.model.terminal_embedding(token_idx)
+
+    @classmethod
+    def _update_prev_action_emb_pointer(cls, self, last_choice, extra_choice_info):
+        """_update_prev_action_emb_pointer"""
+        # TODO batching
+        self.prev_action_emb = self.model.pointer_action_emb_proj[self.cur_item.node_type](
+            self.desc_enc.pointer_memories[self.cur_item.node_type][:, last_choice]
+        )
+
+    def pop(self):
+        """pop"""
+        # <debug print('v' * 66)
+        # <debug print([x.to_str() for x in self.queue])
+        if self.queue:
+            self.cur_item = self.queue[-1]
+            self.queue = self.queue.delete(-1)
+            return True
+        return False
+
+    @Handler.register_handler(State.SUM_TYPE_INQUIRE)
+    def process_sum_inquire(self, last_choice):
+        """process_sum_inquire"""
+        # 1. ApplyRule, like expr -> Call
+        # a. Ask which one to choose
+        output, self.recurrent_state, rule_logits = self.model.apply_rule(
+            self.cur_item.node_type,
+            self.recurrent_state,
+            self.prev_action_emb,
+            self.cur_item.parent_h,
+            self.cur_item.parent_action_emb,
+            self.desc_enc,
+        )
+        self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.SUM_TYPE_APPLY, parent_h=output)
+
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule
+        choices = self.rule_choice(self.cur_item.node_type, rule_logits)
+        return choices, False
+
+    @Handler.register_handler(State.SUM_TYPE_APPLY)
+    def process_sum_apply(self, last_choice):
+        """process_sum_apply"""
+        # b. Add action, prepare for #2
+        sum_type, singular_type = self.model.preproc.all_rules[last_choice]
+        assert sum_type == self.cur_item.node_type
+
+        self.cur_item = attr.evolve(
+            self.cur_item,
+            node_type=singular_type,
+            parent_action_emb=self.prev_action_emb,
+            state=TreeTraversal.State.CHILDREN_INQUIRE,
+        )
+        return None, True
+
+    @Handler.register_handler(State.CHILDREN_INQUIRE)
+    def process_children_inquire(self, last_choice):
+        """process_children_inquire"""
+        # 2. ApplyRule, like Call -> expr[func] expr*[args] keyword*[keywords]
+        # Check if we have no children
+        type_info = self.model.ast_wrapper.singular_types[self.cur_item.node_type]
+        if not type_info.fields:
+            if self.pop():
+                last_choice = None
+                return last_choice, True
+            else:
+                return None, False
+
+        # a. Ask about presence
+        output, self.recurrent_state, rule_logits = self.model.apply_rule(
+            self.cur_item.node_type,
+            self.recurrent_state,
+            self.prev_action_emb,
+            self.cur_item.parent_h,
+            self.cur_item.parent_action_emb,
+            self.desc_enc,
+        )
+        self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.CHILDREN_APPLY, parent_h=output)
+
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule
+        choices = self.rule_choice(self.cur_item.node_type, rule_logits)
+        return choices, False
+
+    @Handler.register_handler(State.CHILDREN_APPLY)
+    def process_children_apply(self, last_choice):
+        """process_children_apply"""
+        # b. Create the children
+        node_type, children_presence = self.model.preproc.all_rules[last_choice]
+        assert node_type == self.cur_item.node_type
+
+        self.queue = self.queue.append(
+            TreeTraversal.QueueItem(
+                item_id=self.cur_item.item_id,
+                state=TreeTraversal.State.NODE_FINISHED,
+                node_type=None,
+                parent_action_emb=None,
+                parent_h=None,
+                parent_field_name=None,
+            )
+        )
+        for field_info, present in reversed(
+            list(
+                zip(
+                    self.model.ast_wrapper.singular_types[node_type].fields,
+                    children_presence,
+                )
+            )
+        ):
+            if not present:
+                continue
+
+            # seq field: LIST_LENGTH_INQUIRE x
+            # sum type: SUM_TYPE_INQUIRE x
+            # product type:
+            #   no children: not possible
+            #   children: CHILDREN_INQUIRE
+            # constructor type: not possible x
+            # builtin type: GEN_TOKEN x
+            child_type = field_type = field_info.type
+            if field_info.seq:
+                child_state = TreeTraversal.State.LIST_LENGTH_INQUIRE
+            elif field_type in self.model.ast_wrapper.sum_types:
+                child_state = TreeTraversal.State.SUM_TYPE_INQUIRE
+            elif field_type in self.model.ast_wrapper.product_types:
+                assert self.model.ast_wrapper.product_types[field_type].fields
+                child_state = TreeTraversal.State.CHILDREN_INQUIRE
+            elif field_type in self.model.preproc.grammar.pointers:
+                child_state = TreeTraversal.State.POINTER_INQUIRE
+            elif field_type in self.model.ast_wrapper.primitive_types:
+                child_state = TreeTraversal.State.GEN_TOKEN
+                child_type = present
+            else:
+                raise ValueError(f"Unable to handle field type {field_type}")
+
+            self.queue = self.queue.append(
+                TreeTraversal.QueueItem(
+                    item_id=self.next_item_id,
+                    state=child_state,
+                    node_type=child_type,
+                    parent_action_emb=self.prev_action_emb,
+                    parent_h=self.cur_item.parent_h,
+                    parent_field_name=field_info.name,
+                )
+            )
+            self.next_item_id += 1
+
+        # pop queue 的最后一个元素，并赋值给 self.cur_item
+        advanced = self.pop()
+        assert advanced
+        last_choice = None
+        return last_choice, True
+
+    @Handler.register_handler(State.LIST_LENGTH_INQUIRE)
+    def process_list_length_inquire(self, last_choice):
+        """process_list_length_inquire"""
+        list_type = self.cur_item.node_type + "*"
+        output, self.recurrent_state, rule_logits = self.model.apply_rule(
+            list_type,
+            self.recurrent_state,
+            self.prev_action_emb,
+            self.cur_item.parent_h,
+            self.cur_item.parent_action_emb,
+            self.desc_enc,
+        )
+        self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.LIST_LENGTH_APPLY, parent_h=output)
+
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_apply_rule
+        choices = self.rule_choice(list_type, rule_logits)
+        return choices, False
+
+    @Handler.register_handler(State.LIST_LENGTH_APPLY)
+    def process_list_length_apply(self, last_choice):
+        """process_list_length_apply"""
+        list_type, num_children = self.model.preproc.all_rules[last_choice]
+        elem_type = self.cur_item.node_type
+        assert list_type == elem_type + "*"
+
+        child_node_type = elem_type
+        if elem_type in self.model.ast_wrapper.sum_types:
+            child_state = TreeTraversal.State.SUM_TYPE_INQUIRE
+            if self.model.preproc.use_seq_elem_rules:
+                child_node_type = elem_type + "_seq_elem"
+        elif elem_type in self.model.ast_wrapper.product_types:
+            child_state = TreeTraversal.State.CHILDREN_INQUIRE
+        elif elem_type == "identifier":
+            child_state = TreeTraversal.State.GEN_TOKEN
+            child_node_type = "str"
+        elif elem_type in self.model.ast_wrapper.primitive_types:
+            raise ValueError("sequential builtin types not supported")
+        else:
+            raise ValueError(f"Unable to handle seq field type {elem_type}")
+
+        for i in range(num_children):
+            self.queue = self.queue.append(
+                TreeTraversal.QueueItem(
+                    item_id=self.next_item_id,
+                    state=child_state,
+                    node_type=child_node_type,
+                    parent_action_emb=self.prev_action_emb,
+                    parent_h=self.cur_item.parent_h,
+                    parent_field_name=self.cur_item.parent_field_name,
+                )
+            )
+            self.next_item_id += 1
+
+        advanced = self.pop()
+        assert advanced
+        last_choice = None
+        return last_choice, True
+
+    @Handler.register_handler(State.GEN_TOKEN)
+    def process_gen_token(self, last_choice):
+        """process_gen_token"""
+        if last_choice == vocab.EOS:
+            if self.pop():
+                last_choice = None
+                return last_choice, True
+            else:
+                return None, False
+
+        self.recurrent_state, output, gen_logodds = self.model.gen_token(
+            self.cur_item.node_type,
+            self.recurrent_state,
+            self.prev_action_emb,
+            self.cur_item.parent_h,
+            self.cur_item.parent_action_emb,
+            self.desc_enc,
+        )
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_gen_token
+        choices = self.token_choice(output, gen_logodds)
+        return choices, False
+
+    @Handler.register_handler(State.POINTER_INQUIRE)
+    def process_pointer_inquire(self, last_choice):
+        """process_pointer_inquire"""
+        # a. Ask which one to choose
+        output, self.recurrent_state, logits, attention_logits = self.model.compute_pointer_with_align(
+            self.cur_item.node_type,
+            self.recurrent_state,
+            self.prev_action_emb,
+            self.cur_item.parent_h,
+            self.cur_item.parent_action_emb,
+            self.desc_enc,
+        )
+        self.cur_item = attr.evolve(self.cur_item, state=TreeTraversal.State.POINTER_APPLY, parent_h=output)
+
+        self.update_prev_action_emb = TreeTraversal._update_prev_action_emb_pointer
+        choices = self.pointer_choice(self.cur_item.node_type, logits, attention_logits)
+        return choices, False
+
+    @Handler.register_handler(State.POINTER_APPLY)
+    def process_pointer_apply(self, last_choice):
+        """process_pointer_apply"""
+        if self.pop():
+            last_choice = None
+            return last_choice, True
+        else:
+            return None, False
+
+    @Handler.register_handler(State.NODE_FINISHED)
+    def process_node_finished(self, last_choice):
+        """process_node_finished"""
+        if self.pop():
+            last_choice = None
+            return last_choice, True
+        else:
+            return None, False
+
+    def rule_choice(self, node_type, rule_logits):
+        """rule_choice"""
+        raise NotImplementedError
+
+    def token_choice(self, output, gen_logodds):
+        """token_choice"""
+        raise NotImplementedError
+
+    def pointer_choice(self, node_type, logits, attention_logits):
+        """pointer_choice"""
+        raise NotImplementedError
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/optim.py b/examples/text_to_sql/RAT-SQL/text2sql/optim.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ca7406376f3e80295569813af80508a1af5664a
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/optim.py
@@ -0,0 +1,54 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+
+import paddle
+
+param_name_to_exclude_from_weight_decay = re.compile(r".*layer_norm_scale|.*layer_norm_bias|.*b_0")
+
+
+def get_warmup_and_linear_decay(max_steps, warmup_steps):
+    """ERNIE/demo/utils.py"""
+    return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps))
+
+
+def init_optimizer(model, config, train_steps, scale_params_lr=None):
+    if scale_params_lr is not None:
+        for model, lr_scale in scale_params_lr:
+            for param in model.parameters():
+                param.optimize_attr["learning_rate"] *= lr_scale
+
+    warmup_steps = int(config.warmup_proportion * train_steps)
+    lr_scheduler = paddle.optimizer.lr.LambdaDecay(
+        config.learning_rate, get_warmup_and_linear_decay(train_steps, warmup_steps)
+    )
+    optimizer = paddle.optimizer.AdamW(
+        lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=config.weight_decay,
+        apply_decay_param_fun=lambda n: not param_name_to_exclude_from_weight_decay.match(n),
+        grad_clip=paddle.nn.ClipGradByGlobalNorm(config.grad_clip),
+    )
+    return optimizer
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    import types
+
+    model = paddle.vision.models.LeNet()
+    config = types.SimpleNamespace(learning_rate=1e-3, warmup_proportion=0.1, weight_decay=0.2, grad_clip=1.0)
+    optim = init_optimizer(model, config, train_steps=10000)
+    print(optim)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b9868224721c471baafd08878ebf0a7a84d765a
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/__init__.py
@@ -0,0 +1,15 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .utils import *
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c13a7d855970b30fa8af2b5e158ec1f1956b270
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/ast_util.py
@@ -0,0 +1,272 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, Union
+
+import asdl
+import attr
+
+
+class ASTWrapperVisitor(asdl.VisitorBase):
+    """Used by ASTWrapper to collect information.
+
+    - put constructors in one place.
+    - checks that all fields have names.
+    - get all optional fields.
+    """
+
+    def __init__(self):
+        # type: () -> None
+        super(ASTWrapperVisitor, self).__init__()
+        self.constructors = {}  # type: Dict[str, asdl.Constructor]
+        self.sum_types = {}  # type: Dict[str, asdl.Sum]
+        self.product_types = {}  # type: Dict[str, asdl.Product]
+        self.fieldless_constructors = {}  # type: Dict[str, asdl.Constructor]
+
+    def visitModule(self, mod):
+        # type: (asdl.Module) -> None
+        for dfn in mod.dfns:
+            self.visit(dfn)
+
+    def visitType(self, type_):
+        # type: (asdl.Type) -> None
+        self.visit(type_.value, str(type_.name))
+
+    def visitSum(self, sum_, name):
+        # type: (asdl.Sum, str) -> None
+        self.sum_types[name] = sum_
+        for t in sum_.types:
+            self.visit(t, name)
+
+    def visitConstructor(self, cons, _name):
+        # type: (asdl.Constructor, str) -> None
+        assert cons.name not in self.constructors
+        self.constructors[cons.name] = cons
+        if not cons.fields:
+            self.fieldless_constructors[cons.name] = cons
+        for f in cons.fields:
+            self.visit(f, cons.name)
+
+    def visitField(self, field, name):
+        # type: (asdl.Field, str) -> None
+        # pylint: disable=no-self-use
+        if field.name is None:
+            raise ValueError(f"Field of type {field.type} in {name} lacks name")
+
+    def visitProduct(self, prod, name):
+        # type: (asdl.Product, str) -> None
+        self.product_types[name] = prod
+        for f in prod.fields:
+            self.visit(f, name)
+
+
+SingularType = Union[asdl.Constructor, asdl.Product]
+
+
+class ASTWrapper(object):
+    """Provides helper methods on the ASDL AST."""
+
+    default_primitive_type_checkers = {
+        "identifier": lambda x: isinstance(x, str),
+        "int": lambda x: isinstance(x, int),
+        "string": lambda x: isinstance(x, str),
+        "bytes": lambda x: isinstance(x, bytes),
+        "object": lambda x: isinstance(x, object),
+        "singleton": lambda x: x is True or x is False or x is None,
+    }
+
+    # pylint: disable=too-few-public-methods
+
+    def __init__(self, ast_def, custom_primitive_type_checkers={}):
+        # type: (asdl.Module, str) -> None
+        self.ast_def = ast_def
+
+        visitor = ASTWrapperVisitor()
+        visitor.visit(ast_def)
+
+        self.constructors = visitor.constructors
+        self.sum_types = visitor.sum_types
+        self.product_types = visitor.product_types
+        self.seq_fragment_constructors = {}
+        self.primitive_type_checkers = {**self.default_primitive_type_checkers, **custom_primitive_type_checkers}
+        self.custom_primitive_types = set(custom_primitive_type_checkers.keys())
+        self.primitive_types = set(self.primitive_type_checkers.keys())
+
+        # Product types and constructors:
+        # no need to decide upon a further type for these.
+        self.singular_types = {}  # type: Dict[str, SingularType]
+        self.singular_types.update(self.constructors)
+        self.singular_types.update(self.product_types)
+
+        # IndexedSets for each sum type
+        self.sum_type_vocabs = {
+            name: sorted(t.name for t in sum_type.types) for name, sum_type in self.sum_types.items()
+        }
+        self.constructor_to_sum_type = {
+            constructor.name: name for name, sum_type in self.sum_types.items() for constructor in sum_type.types
+        }
+        self.seq_fragment_constructor_to_sum_type = {
+            constructor.name: name for name, sum_type in self.sum_types.items() for constructor in sum_type.types
+        }
+        self.fieldless_constructors = sorted(visitor.fieldless_constructors.keys())
+
+    @property
+    def types(self):
+        # type: () -> Dict[str, Union[asdl.Sum, asdl.Product]]
+        return self.ast_def.types
+
+    @property
+    def root_type(self):
+        # type: () -> str
+        return self._root_type
+
+    def add_sum_type(self, name, sum_type):
+        assert name not in self.sum_types
+        self.sum_types[name] = sum_type
+        self.types[name] = sum_type
+
+        for type_ in sum_type.types:
+            self._add_constructor(name, type_)
+
+    def add_constructors_to_sum_type(self, sum_type_name, constructors):
+        for constructor in constructors:
+            self._add_constructor(sum_type_name, constructor)
+        self.sum_types[sum_type_name].types += constructors
+
+    def remove_product_type(self, product_type_name):
+        self.singular_types.pop(product_type_name)
+        self.product_types.pop(product_type_name)
+        self.types.pop(product_type_name)
+
+    def add_seq_fragment_type(self, sum_type_name, constructors):
+        for constructor in constructors:
+            # TODO: Record that this constructor is a sequence fragment?
+            self._add_constructor(sum_type_name, constructor)
+
+        sum_type = self.sum_types[sum_type_name]
+        if not hasattr(sum_type, "seq_fragment_types"):
+            sum_type.seq_fragment_types = []
+        sum_type.seq_fragment_types += constructors
+
+    def _add_constructor(self, sum_type_name, constructor):
+        assert constructor.name not in self.constructors
+        self.constructors[constructor.name] = constructor
+        assert constructor.name not in self.singular_types
+        self.singular_types[constructor.name] = constructor
+        assert constructor.name not in self.constructor_to_sum_type
+        self.constructor_to_sum_type[constructor.name] = sum_type_name
+
+        if not constructor.fields:
+            self.fieldless_constructors.append(constructor.name)
+            self.fieldless_constructors.sort()
+
+    def verify_ast(self, node, expected_type=None, field_path=(), is_seq=False):
+        """Checks that `node` conforms to the current ASDL."""
+        if node is None:
+            raise ValueError(f"node is None. path: {field_path}")
+        if not isinstance(node, dict):
+            raise ValueError(f"node is type {type(node)}. path: {field_path}")
+
+        node_type = node["_type"]  # type: str
+        if expected_type is not None:
+            sum_product = self.types[expected_type]
+            if isinstance(sum_product, asdl.Product):
+                if node_type != expected_type:
+                    raise ValueError(f"Expected type {expected_type}, but instead saw {node_type}. path: {field_path}")
+            elif isinstance(sum_product, asdl.Sum):
+                possible_names = [t.name for t in sum_product.types]
+                if is_seq:
+                    possible_names += [t.name for t in getattr(sum_product, "seq_fragment_types", [])]
+                if node_type not in possible_names:
+                    raise ValueError(
+                        f'Expected one of {", ".join(possible_names)}, but instead saw {node_type}. path: {field_path}'
+                    )
+
+            else:
+                raise ValueError(f"Unexpected type in ASDL: {sum_product}")
+
+        if node_type in self.types:
+            # Either a product or a sum type; we want it to be a product type
+            sum_product = self.types[node_type]
+            if isinstance(sum_product, asdl.Sum):
+                raise ValueError(f"sum type {node_type} not allowed as node type. path: {field_path}")
+            fields_to_check = sum_product.fields
+        elif node_type in self.constructors:
+            fields_to_check = self.constructors[node_type].fields
+        else:
+            raise ValueError(f"Unknown node_type {node_type}. path: {field_path}")
+
+        for field in fields_to_check:
+            # field.opt:
+            # - missing is okay
+            # field.seq
+            # - missing is okay
+            # - otherwise, must be list
+            if field.name not in node:
+                if field.opt or field.seq:
+                    continue
+                raise ValueError(f"required field {field.name} is missing. path: {field_path}")
+
+            if field.seq and field.name in node and not isinstance(node[field.name], (list, tuple)):  # noqa: E125
+                raise ValueError(f"sequential field {field.name} is not sequence. path: {field_path}")
+
+            # Check that each item in this field has the expected type.
+            items = node.get(field.name, ()) if field.seq else (node.get(field.name),)
+
+            # pylint: disable=cell-var-from-loop
+            if field.type in self.primitive_type_checkers:
+                check = self.primitive_type_checkers[field.type]
+            else:
+                # pylint: disable=line-too-long
+                check = lambda n: self.verify_ast(
+                    n, field.type, field_path + (field.name,), is_seq=field.seq
+                )  # noqa: E731,E501
+
+            for item in items:
+                assert check(item)
+        return True
+
+    def find_all_descendants_of_type(self, tree, target_type, descend_pred=lambda field: True):
+        queue = [tree]
+        while queue:
+            node = queue.pop()
+            if not isinstance(node, dict):
+                continue
+            for field_info in self.singular_types[node["_type"]].fields:
+                if field_info.opt and field_info.name not in node:
+                    continue
+                if not descend_pred(field_info):
+                    continue
+
+                if field_info.seq:
+                    values = node.get(field_info.name, [])
+                else:
+                    values = [node[field_info.name]]
+
+                if field_info.type == target_type:
+                    for value in values:
+                        yield value
+                else:
+                    queue.extend(values)
+
+
+# Improve this when mypy supports recursive types.
+Node = Dict[str, Any]
+
+
+@attr.s
+class HoleValuePlaceholder:
+    id = attr.ib()
+    is_seq = attr.ib()
+    is_opt = attr.ib()
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..d832acb97cc7b71e4aed325ea867a159f7faa11b
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/dusql_evaluation.py
@@ -0,0 +1,1298 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Calculating the exact accuracy. For select, where and others schema, it will be
+seen as right if has different order. This script refers to https://github.com/taoyds/spider。
+"""
+import copy
+import json
+import logging
+import re
+from collections import defaultdict
+from io import open
+
+from text2sql.utils import text_utils
+
+"""
+val: number(float)/string(str)/sql(dict)
+col_unit: (agg_id, col_id)
+val_unit: (unit_op, col_unit1, col_unit2)
+table_unit: (table_type, col_unit/sql)
+cond_unit: (not_op, cond_op, val_unit, val1, val2)
+condition: [cond_unit1, 'and'/'or', cond_unit2, ...]
+sql {
+  'select': [(agg_id, val_unit), (agg_id, val_unit), ...]
+  'from': {'table_units': [table_unit1, table_unit2, ...], 'conds': condition}
+  'where': condition
+  'groupBy': [col_unit1, col_unit2, ...]
+  'orderBy': ('asc'/'desc', [(agg_id, val_unit), ...])
+  'having': condition
+  'limit': None/number(int)
+  'intersect': None/sql
+  'except': None/sql
+  'union': None/sql
+}
+"""
+
+CLAUSE_KEYWORDS = ("select", "from", "where", "group", "order", "limit", "intersect", "union", "except")
+JOIN_KEYWORDS = ("join", "on", "as")
+
+COND_OPS = ("not_in", "between", "=", ">", "<", ">=", "<=", "!=", "in", "like")
+UNIT_OPS = ("none", "-", "+", "*", "/")
+AGG_OPS = ("none", "max", "min", "count", "sum", "avg")
+TABLE_TYPE = {
+    "sql": "sql",
+    "table_unit": "table_unit",
+}
+
+LOGIC_AND_OR = ("and", "or")
+SQL_OPS = ("intersect", "union", "except")
+ORDER_OPS = ("desc", "asc")
+
+CONST_COLUMN = set(["time_now"])
+
+EXPECT_BRACKET_PRE_TOKENS = set(AGG_OPS + SQL_OPS + COND_OPS + CLAUSE_KEYWORDS + ("from", ",", "distinct"))
+
+g_empty_sql = {
+    "select": [],
+    "from": {"conds": [], "table_units": []},
+    "where": [],
+    "groupBy": [],
+    "having": [],
+    "orderBy": [],
+    "limit": None,
+    "except": None,
+    "intersect": None,
+    "union": None,
+}
+
+g_eval_value = True
+g_is_nl2sql_dataset = False
+
+
+#################################
+def tokenize(string):
+    """
+    Args:
+
+    Returns:
+    """
+    string = string.replace("'", '"').lower()
+    assert string.count('"') % 2 == 0, "Unexpected quote"
+
+    def _extract_value(string):
+        """extract values in sql"""
+        fields = string.split('"')
+        for idx, tok in enumerate(fields):
+            if idx % 2 == 1:
+                fields[idx] = '"%s"' % (tok)
+        return fields
+
+    def _resplit(tmp_tokens, fn_split, fn_omit):
+        """resplit"""
+        new_tokens = []
+        for token in tmp_tokens:
+            token = token.strip()
+            if fn_omit(token):
+                new_tokens.append(token)
+            elif re.match(r"\d\d\d\d-\d\d(-\d\d)?", token):
+                new_tokens.append('"%s"' % (token))
+            else:
+                new_tokens.extend(fn_split(token))
+        return new_tokens
+
+    tokens_tmp = _extract_value(string)
+
+    two_bytes_op = ["==", "!=", ">=", "<=", "<>", "<in>"]
+    sep1 = re.compile(r"([ \+\-\*/\(\),><;])")  # 单字节运算符
+    sep2 = re.compile("(" + "|".join(two_bytes_op) + ")")  # 多字节运算符
+    tokens_tmp = _resplit(tokens_tmp, lambda x: x.split(" "), lambda x: x.startswith('"'))
+    tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep2, x), lambda x: x.startswith('"'))
+    tokens_tmp = _resplit(tokens_tmp, lambda x: re.split(sep1, x), lambda x: x in two_bytes_op or x.startswith('"'))
+    tokens = list(filter(lambda x: x.strip() not in ("", "distinct", "DISTINCT"), tokens_tmp))
+
+    def _post_merge(tokens):
+        """merge:
+        * col name with "(", ")"
+        * values with +/-
+        """
+        idx = 1
+        while idx < len(tokens):
+            if tokens[idx] == "(" and tokens[idx - 1] not in EXPECT_BRACKET_PRE_TOKENS:
+                while idx < len(tokens):
+                    tmp_tok = tokens.pop(idx)
+                    tokens[idx - 1] += tmp_tok
+                    if tmp_tok == ")":
+                        break
+            elif tokens[idx] in ("+", "-") and tokens[idx - 1] in COND_OPS and idx + 1 < len(tokens):
+                tokens[idx] += tokens[idx + 1]
+                tokens.pop(idx + 1)
+                idx += 1
+            else:
+                idx += 1
+        return tokens
+
+    tokens = _post_merge(tokens)
+    return tokens
+
+
+def scan_alias(toks):
+    """Scan the index of 'as' and build the map for all alias"""
+    as_idxs = [idx for idx, tok in enumerate(toks) if tok == "as"]
+    alias = {}
+    for idx in as_idxs:
+        alias[toks[idx + 1]] = toks[idx - 1]
+    return alias
+
+
+def get_tables_with_alias(schema, toks):
+    tables = scan_alias(toks)
+    for key in schema:
+        assert key not in tables, "Alias {} has the same name in table".format(key)
+        tables[key] = key
+    return tables
+
+
+def parse_col(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    tok = toks[start_idx]
+    if tok == "*":
+        return start_idx + 1, schema.id_map[tok]
+    if tok in CONST_COLUMN:
+        return start_idx + 1, tok
+
+    if g_is_nl2sql_dataset:
+        fn_check = lambda tok: "." in tok and tok.startswith("table_")
+    else:
+        fn_check = lambda tok: "." in tok
+    if fn_check(tok):  # if token is a composite
+        alias, col = tok.split(".", 1)
+        key = tables_with_alias[alias] + "." + col
+        return start_idx + 1, schema.id_map[key]
+
+    assert default_tables is not None and len(default_tables) > 0, "Default tables should not be None or empty"
+
+    for alias in default_tables:
+        table = tables_with_alias[alias]
+        if tok in schema.schema[table]:
+            key = table + "." + tok
+            return start_idx + 1, schema.id_map[key]
+
+    raise RuntimeError("Error col: {} from {}".format(tok, toks))
+
+
+def parse_col_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    if toks[idx] in AGG_OPS:
+        agg_id = AGG_OPS.index(toks[idx])
+        idx += 1
+        assert idx < len_ and toks[idx] == "("
+        idx += 1
+        idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+        assert idx < len_ and toks[idx] == ")"
+        idx += 1
+        return idx, (agg_id, col_id)
+
+    agg_id = AGG_OPS.index("none")
+    idx, col_id = parse_col(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+
+    return idx, (agg_id, col_id)
+
+
+def parse_val_unit(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    col_unit1 = None
+    col_unit2 = None
+    unit_op = UNIT_OPS.index("none")
+
+    idx, col_unit1 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+    if idx < len_ and toks[idx] in UNIT_OPS:
+        unit_op = UNIT_OPS.index(toks[idx])
+        idx += 1
+        idx, col_unit2 = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+    if unit_op in (UNIT_OPS.index("+"), UNIT_OPS.index("*")):
+        col_unit1, col_unit2 = sorted([col_unit1, col_unit2])
+
+    return idx, (unit_op, col_unit1, col_unit2)
+
+
+def parse_table_unit(toks, start_idx, tables_with_alias, schema):
+    idx = start_idx
+    len_ = len(toks)
+    key = tables_with_alias[toks[idx]]
+
+    if idx + 1 < len_ and toks[idx + 1] == "as":
+        idx += 3
+    else:
+        idx += 1
+
+    return idx, schema.id_map[key], key
+
+
+def parse_value(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+
+    isBlock = False
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    def _force_float(str_num):
+        """force float, just for debug"""
+        last = ""
+        while len(str_num) > 0:
+            try:
+                n = float(str_num)
+                if last == "%":
+                    n /= 100
+                return n
+            except Exception:
+                last = str_num[-1]
+                str_num = str_num[:-1]
+        raise ValueError("not a float number")
+
+    if toks[idx] == "select":
+        idx, val = parse_sql(toks, idx, tables_with_alias, schema)
+    elif toks[idx].startswith('"') and toks[idx].endswith('"'):  # token is a string value
+        val = toks[idx]
+        idx += 1
+    else:
+        try:
+            val_str = toks[idx]
+            # val = float(val_str) if val_str[-1] != '%' else float(val_str[:-1]) / 100
+            val = _force_float(val_str)
+            idx += 1
+        except Exception:
+            end_idx = idx
+            while (
+                end_idx < len_
+                and toks[end_idx] != ","
+                and toks[end_idx] != ")"
+                and toks[end_idx] != "and"
+                and toks[end_idx] not in CLAUSE_KEYWORDS
+                and toks[end_idx] not in JOIN_KEYWORDS
+            ):
+                end_idx += 1
+
+            idx, val = parse_col_unit(toks[start_idx:end_idx], 0, tables_with_alias, schema, default_tables)
+            idx = end_idx
+
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1
+
+    return idx, val
+
+
+def parse_condition(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+    conds = []
+
+    while idx < len_:
+        agg_id = 0
+        if idx < len_ and toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+
+        op_str = toks[idx]
+        if op_str == "not":
+            assert toks[idx + 1] == "in", '"not" must followed by "in"'
+            op_str = "not_in"
+            idx += 1
+        assert idx < len_ and op_str in COND_OPS, "Error condition: idx: {}, tok: {}".format(idx, op_str)
+        op_id = COND_OPS.index(op_str)
+        idx += 1
+        val1 = val2 = None
+        if op_id == COND_OPS.index("between"):
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            assert toks[idx].lower() == "and"
+            idx += 1
+            idx, val2 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+        else:
+            idx, val1 = parse_value(toks, idx, tables_with_alias, schema, default_tables)
+            val2 = None
+
+        conds.append((agg_id, op_id, val_unit, val1, val2))
+
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";") or toks[idx] in JOIN_KEYWORDS):
+            break
+
+        if idx < len_ and toks[idx] in LOGIC_AND_OR:
+            conds.append(toks[idx])
+            idx += 1  # skip and/or
+
+    return idx, conds
+
+
+def parse_select(toks, start_idx, tables_with_alias, schema, default_tables=None):
+    idx = start_idx
+    len_ = len(toks)
+
+    assert toks[idx] == "select", "'select' not found"
+    idx += 1
+    val_units = []
+
+    while idx < len_ and toks[idx] not in CLAUSE_KEYWORDS:
+        agg_id = AGG_OPS.index("none")
+        if toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append((agg_id, val_unit))
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+
+    return idx, val_units
+
+
+def parse_from(toks, start_idx, tables_with_alias, schema):
+    """
+    Assume in the from clause, all table units are combined with join
+    """
+    assert "from" in toks[start_idx:], "'from' not found"
+
+    len_ = len(toks)
+    idx = toks.index("from", start_idx) + 1
+    default_tables = []
+    table_units = []
+    conds = []
+    last_table = None
+
+    while idx < len_:
+        isBlock = False
+        if toks[idx] == "(":
+            isBlock = True
+            idx += 1
+
+        if toks[idx] == "select":
+            idx, sql = parse_sql(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["sql"], sql))
+            last_table = sql["from"]["table_units"][0][1].strip("_")
+        else:
+            if idx < len_ and toks[idx] == "join":
+                idx += 1  # skip join
+            idx, table_unit, table_name = parse_table_unit(toks, idx, tables_with_alias, schema)
+            table_units.append((TABLE_TYPE["table_unit"], table_unit))
+            default_tables.append(table_name)
+        if idx < len_ and toks[idx] == "on":
+            idx += 1  # skip on
+            idx, this_conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+            if len(conds) > 0:
+                conds.append("and")
+            conds.extend(this_conds)
+
+        if isBlock:
+            assert toks[idx] == ")"
+            idx += 1
+        if idx < len_ and toks[idx] == "a":
+            assert last_table is not None, "last_table should be a table name string, not None"
+            tables_with_alias["a"] = last_table
+            idx += 2
+        elif idx < len_ and toks[idx] == "b":
+            assert last_table is not None, "last_table should be a table name string, not None"
+            tables_with_alias["b"] = last_table
+            idx += 1
+        if idx < len_ and (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+            break
+
+    return [idx, table_units, conds, default_tables]
+
+
+def parse_where(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "where":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_group_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+    col_units = []
+
+    if idx >= len_ or toks[idx] != "group":
+        return idx, col_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        idx, col_unit = parse_col_unit(toks, idx, tables_with_alias, schema, default_tables)
+        col_units.append(col_unit)
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, col_units
+
+
+def parse_order_by(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+    val_units = []
+    order_type = "asc"  # default type is 'asc'
+
+    if idx >= len_ or toks[idx] != "order":
+        return idx, val_units
+
+    idx += 1
+    assert toks[idx] == "by"
+    idx += 1
+
+    while idx < len_ and not (toks[idx] in CLAUSE_KEYWORDS or toks[idx] in (")", ";")):
+        agg_id = AGG_OPS.index("none")
+        if toks[idx] in AGG_OPS:
+            agg_id = AGG_OPS.index(toks[idx])
+            idx += 1
+        idx, val_unit = parse_val_unit(toks, idx, tables_with_alias, schema, default_tables)
+        val_units.append((agg_id, val_unit))
+        if idx < len_ and toks[idx] in ORDER_OPS:
+            order_type = toks[idx]
+            idx += 1
+        if idx < len_ and toks[idx] == ",":
+            idx += 1  # skip ','
+        else:
+            break
+
+    return idx, (order_type, val_units)
+
+
+def parse_having(toks, start_idx, tables_with_alias, schema, default_tables):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx >= len_ or toks[idx] != "having":
+        return idx, []
+
+    idx += 1
+    idx, conds = parse_condition(toks, idx, tables_with_alias, schema, default_tables)
+    return idx, conds
+
+
+def parse_limit(toks, start_idx):
+    idx = start_idx
+    len_ = len(toks)
+
+    if idx < len_ and toks[idx] == "limit":
+        idx += 2
+        limit_num = int(toks[idx - 1])
+        return idx, limit_num
+
+    return idx, None
+
+
+def parse_sql(toks, start_idx, tables_with_alias, schema):
+    isBlock = False  # indicate whether this is a block of sql/sub-sql
+    len_ = len(toks)
+    idx = start_idx
+
+    sql = {}
+    if toks[idx] == "(":
+        isBlock = True
+        idx += 1
+
+    # parse from clause in order to get default tables
+    from_end_idx, table_units, conds, default_tables = parse_from(toks, start_idx, tables_with_alias, schema)
+    sql["from"] = {"table_units": table_units, "conds": conds}
+    # select clause
+    _, select_col_units = parse_select(toks, idx, tables_with_alias, schema, default_tables)
+    idx = from_end_idx
+    sql["select"] = select_col_units
+    # where clause
+    idx, where_conds = parse_where(toks, idx, tables_with_alias, schema, default_tables)
+    sql["where"] = where_conds
+    # group by clause
+    idx, group_col_units = parse_group_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["groupBy"] = group_col_units
+    # having clause
+    idx, having_conds = parse_having(toks, idx, tables_with_alias, schema, default_tables)
+    sql["having"] = having_conds
+    # order by clause
+    idx, order_col_units = parse_order_by(toks, idx, tables_with_alias, schema, default_tables)
+    sql["orderBy"] = order_col_units
+    # limit clause
+    idx, limit_val = parse_limit(toks, idx)
+    sql["limit"] = limit_val
+
+    idx = skip_semicolon(toks, idx)
+    if isBlock:
+        assert toks[idx] == ")"
+        idx += 1  # skip ')'
+    idx = skip_semicolon(toks, idx)
+
+    # intersect/union/except clause
+    for op in SQL_OPS:  # initialize IUE
+        sql[op] = None
+    if idx < len_ and toks[idx] in SQL_OPS:
+        sql_op = toks[idx]
+        idx += 1
+        idx, IUE_sql = parse_sql(toks, idx, tables_with_alias, schema)
+        sql[sql_op] = IUE_sql
+    return idx, sql
+
+
+def load_data(fpath):
+    with open(fpath) as f:
+        data = json.load(f)
+    return data
+
+
+def get_sql(schema, query):
+    toks = tokenize(query)
+    tables_with_alias = get_tables_with_alias(schema.schema, toks)
+    _, sql = parse_sql(toks, 0, tables_with_alias, schema)
+
+    return sql
+
+
+def skip_semicolon(toks, start_idx):
+    idx = start_idx
+    while idx < len(toks) and toks[idx] == ";":
+        idx += 1
+    return idx
+
+
+#################################
+
+g_db_schema_file = None
+g_foreign_key_maps = None
+
+
+class Evaluator(object):
+    """A simple evaluator"""
+
+    def __init__(self, db_schema_file, foreign_key_maps, eval_value=True):
+        """init"""
+        self.schemas = {}
+        self.foreign_key_maps = foreign_key_maps
+        self.partial_scores = None
+        self.scores = {"all": {"count": 0, "exact": 0}}
+        global g_db_schema_file
+        global g_foreign_key_maps
+        g_db_schema_file = db_schema_file
+        g_foreign_key_maps = foreign_key_maps
+
+        with open(db_schema_file) as ifs:
+            databases = json.load(ifs)
+            for db in databases:
+                self.schemas[db["db_id"]] = Schema(db)
+        is_nl2sql = all([len(x.schema) == 1 for x in self.schemas.values()])
+        if is_nl2sql:
+            global g_is_nl2sql_dataset
+            g_is_nl2sql_dataset = True
+
+        # number of failed to parse predicted sql query
+        self.eval_err_num = 0
+
+        global g_eval_value
+        g_eval_value = eval_value
+        self.eval_value = eval_value
+
+    def _eval_exact_match(self, pred, gold):
+        """eval_exact_match"""
+        partial_scores = self.eval_partial_match(pred, gold)
+        self.partial_scores = partial_scores
+
+        for _, score in partial_scores.items():
+            if score["f1"] != 1:
+                return 0
+
+        gold_table_units = gold["from"]["table_units"]
+        pred_table_units = pred["from"]["table_units"]
+        if len(pred_table_units) != len(gold_table_units) or any(
+            map(lambda x: type(x[0][1]) != type(x[1][1]), zip(pred_table_units, gold_table_units))  # noqa: E721
+        ):
+            return 0
+        if type(gold_table_units[0][1]) is not dict:
+            return 1 if sorted(gold_table_units) == sorted(pred_table_units) else 0
+
+        # TODO: 严格考虑顺序
+        def __eval_from_sql(pred_tables, gold_tables):
+            """eval from sql"""
+            for pred_table_unit, gold_table_unit in zip(pred_tables, gold_tables):
+                pred_table_sql = pred_table_unit[1]
+                gold_table_sql = gold_table_unit[1]
+                _, _, correct = eval_nested(pred_table_sql, gold_table_sql)
+                if correct == 0:
+                    return 0
+            return 1
+
+        correct = __eval_from_sql(pred_table_units, gold_table_units)
+        if len(gold_table_units) > 1 and correct == 0:
+            return __eval_from_sql(pred_table_units, list(reversed(gold_table_units)))
+        else:
+            return correct
+
+        # if len(gold['from']['table_units']) > 0:
+        #    gold_tables = sorted(gold['from']['table_units'], key=lambda x: str(x))
+        #    pred_tables = sorted(pred['from']['table_units'], key=lambda x: str(x))
+        #    return gold_tables == pred_tables
+        # return 1
+
+    def eval_exact_match(self, pred, gold):
+        """wrapper of evaluate exact match, to process
+        `SQL1 intersect/union SQL2` vs `SQL2 intersect/union SQL1`
+        """
+        score = self._eval_exact_match(pred, gold)
+        if score == 1:
+            return score
+
+        if gold["union"] is not None:
+            new_gold = gold["union"]
+            gold["union"] = None
+            new_gold["union"] = gold
+            return self._eval_exact_match(pred, new_gold)
+        elif gold["intersect"] is not None:
+            new_gold = gold["intersect"]
+            gold["intersect"] = None
+            new_gold["intersect"] = gold
+            return self._eval_exact_match(pred, new_gold)
+        else:
+            return 0
+
+    def eval_partial_match(self, pred, gold):
+        """eval partial match"""
+        res = {}
+
+        gold_total, pred_total, cnt, cnt_wo_agg = eval_sel(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["select"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total)
+        res["select(no AGG)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt, cnt_wo_agg = eval_where(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["where"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+        acc, rec, f1 = get_scores(cnt_wo_agg, pred_total, gold_total)
+        res["where(no OP)"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_group(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["group(no Having)"] = {
+            "acc": acc,
+            "rec": rec,
+            "f1": f1,
+            "gold_total": gold_total,
+            "pred_total": pred_total,
+        }
+
+        gold_total, pred_total, cnt = eval_having(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["group"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_order(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["order"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_and_or(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["and/or"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_IUEN(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["IUEN"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        gold_total, pred_total, cnt = eval_keywords(pred, gold)
+        acc, rec, f1 = get_scores(cnt, pred_total, gold_total)
+        res["keywords"] = {"acc": acc, "rec": rec, "f1": f1, "gold_total": gold_total, "pred_total": pred_total}
+
+        return res
+
+    def evaluate_one(self, db_id, gold_query, pred_query):
+        """evaluate one predicted result, and cache evaluating info
+
+        Args:
+            db (TYPE): NULL
+            gold_query (TYPE): NULL
+            pred_query (TYPE): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        self.scores["all"]["count"] += 1
+
+        schema = self.schemas[db_id]
+        kmap = self.foreign_key_maps[db_id]
+
+        gold_query = gold_query.replace("==", "=")
+        gold_sql = get_sql(schema, gold_query)
+        # rebuild sql for value evaluation
+        g_valid_col_units = build_valid_col_units(gold_sql["from"]["table_units"], schema)
+        gold_sql = rebuild_sql_col(g_valid_col_units, gold_sql, kmap, self.eval_value)
+
+        is_parse_error = False
+        try:
+            pred_sql = get_sql(schema, pred_query)
+
+            p_valid_col_units = build_valid_col_units(pred_sql["from"]["table_units"], schema)
+            pred_sql = rebuild_sql_col(p_valid_col_units, pred_sql, kmap, self.eval_value)
+        except Exception:
+            # If pred_sql is not valid, then we will use an empty sql to evaluate with the correct sql
+            pred_sql = g_empty_sql
+            self.eval_err_num += 1
+            is_parse_error = True
+
+        exact_score = self.eval_exact_match(pred_sql, gold_sql)
+        if exact_score == 0:
+            logging.debug("error instance %s:\npred: %s\ngold: %s" % (db_id, pred_query, gold_query))
+        self.scores["all"]["exact"] += exact_score
+        return {
+            "gold": gold_query,
+            "pred": pred_query,
+            "correct": int(exact_score),
+            "parse_error": int(is_parse_error),
+        }
+
+    def finalize(self):
+        """
+        Returns: TODO
+
+        Raises: NULL
+        """
+        self.scores["all"]["exact"] /= self.scores["all"]["count"]
+        return self.scores
+
+
+class Schema(object):
+    """
+    Simple schema which maps table&column to a unique identifier
+    """
+
+    def __init__(self, db):
+        """init"""
+        self._schema = self._build_schema(db)
+        self._id_map = self._map(self._schema)
+
+    @property
+    def schema(self):
+        """_schema property"""
+        return self._schema
+
+    @property
+    def id_map(self):
+        """_id_map property"""
+        return self._id_map
+
+    def _build_schema(self, db):
+        """build <table, list of columns> schema by input db
+
+        Args:
+            db (dict): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        tables = [x.lower() for x in db.get("table_names_original", db["table_names"])]
+        dct_table2cols = defaultdict(list)
+        for table_id, column in db.get("column_names_original", db["column_names"]):
+            if table_id < 0:
+                continue
+            dct_table2cols[tables[table_id]].append(column.lower())
+        return dct_table2cols
+
+    def _map(self, schema):
+        """map"""
+        id_map = {"*": "__all__"}
+        for key, vals in schema.items():
+            for val in vals:
+                id_map[key.lower() + "." + val.lower()] = "__" + key.lower() + "." + val.lower() + "__"
+
+        for key in schema:
+            id_map[key.lower()] = "__" + key.lower() + "__"
+
+        return id_map
+
+
+def get_scores(count, pred_total, gold_total):
+    """
+    Args:
+
+    Returns:
+    """
+    if pred_total != gold_total:
+        return 0, 0, 0
+    elif count == pred_total:
+        return 1, 1, 1
+    return 0, 0, 0
+
+
+def eval_sel(pred, gold):
+    """
+    Args:
+
+    Returns:
+    """
+    pred_sel = copy.deepcopy(pred["select"])
+    gold_sel = copy.deepcopy(gold["select"])
+    gold_wo_agg = [unit[1] for unit in gold_sel]
+    pred_total = len(pred_sel)
+    gold_total = len(gold_sel)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit in pred_sel:
+        if unit in gold_sel:
+            cnt += 1
+            gold_sel.remove(unit)
+        if unit[1] in gold_wo_agg:
+            cnt_wo_agg += 1
+            gold_wo_agg.remove(unit[1])
+
+    return [gold_total, pred_total, cnt, cnt_wo_agg]
+
+
+def eval_nested_cond(pred_cond, gold_cond):
+    """
+
+    Args:
+        pred_cond (TYPE): NULL
+        gold_cond (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    if pred_cond[:3] != gold_cond[:3] or type(pred_cond[3]) is not dict:
+        return 0
+
+    _, _, correct = eval_nested(pred_cond[3], gold_cond[3])
+    if correct == 0:
+        return 0
+
+    return pred_cond[4] == gold_cond[4]
+
+
+def eval_cond(pred, gold):
+    def _equal(p, g):
+        if str(p) == str(g):
+            return True
+        p = p.strip("\"'") if type(p) is str else p
+        g = g.strip("\"'") if type(g) is str else g
+        if text_utils.is_float(p) and text_utils.is_float(g) and float(p) == float(g):
+            return True
+        return False
+
+    if type(gold[3]) is dict:
+        return eval_nested_cond(pred, gold)
+
+    if pred[:3] != gold[:3]:
+        return 0
+
+    if _equal(pred[3], gold[3]) and _equal(pred[4], gold[4]):
+        return 1
+    else:
+        return 0
+
+
+def eval_where(pred, gold):
+    pred_conds = list(sorted([unit for unit in pred["where"][::2]], key=lambda x: [str(i) for i in x]))
+    gold_conds = list(sorted([unit for unit in gold["where"][::2]], key=lambda x: [str(i) for i in x]))
+    # gold_wo_agg = [unit[2] for unit in gold_conds]
+    pred_total = len(pred_conds)
+    gold_total = len(gold_conds)
+    cnt = 0
+    cnt_wo_agg = 0
+
+    for unit_p, unit_g in zip(pred_conds, gold_conds):
+        cnt += eval_cond(unit_p, unit_g)
+
+        if unit_p[2] == unit_g[2]:
+            cnt_wo_agg += 1
+
+    # for unit in pred_conds:
+    #    if unit in gold_conds:
+    #        cnt += 1
+    #        gold_conds.remove(unit)
+    #    if unit[2] in gold_wo_agg:
+    #        cnt_wo_agg += 1
+    #        gold_wo_agg.remove(unit[2])
+    return [gold_total, pred_total, cnt, cnt_wo_agg]
+    # return [gold_total, pred_total, cnt, gold_total]
+
+
+def eval_group(pred, gold):
+    pred_cols = [unit[1] for unit in pred["groupBy"]]
+    gold_cols = [unit[1] for unit in gold["groupBy"]]
+    pred_total = len(pred_cols)
+    gold_total = len(gold_cols)
+    cnt = 0
+    pred_cols = [pred.split(".")[1] if "." in pred else pred for pred in pred_cols]
+    gold_cols = [gold.split(".")[1] if "." in gold else gold for gold in gold_cols]
+    for col in pred_cols:
+        if col in gold_cols:
+            cnt += 1
+            gold_cols.remove(col)
+    return [gold_total, pred_total, cnt]
+
+
+def eval_having(pred, gold):
+    """and/or will be evaluate in other branch"""
+    if len(pred["having"]) != len(gold["having"]):
+        return [1, 1, 0]
+
+    pred_total = len(pred["having"][::2])
+    gold_total = len(gold["having"][::2])
+    cnt = 0
+    for pred_cond, gold_cond in zip(sorted(pred["having"][::2]), sorted(gold["having"][::2])):
+        if eval_cond(pred_cond, gold_cond) == 1:
+            cnt += 1
+
+    return [gold_total, pred_total, cnt]
+
+
+def eval_order(pred, gold):
+    pred_total = gold_total = cnt = 0
+    if len(pred["orderBy"]) > 0:
+        pred_total = 1
+    if len(gold["orderBy"]) > 0:
+        gold_total = 1
+
+    if len(gold["orderBy"]) > 0 and pred["orderBy"] == gold["orderBy"] and pred["limit"] == gold["limit"]:
+        cnt = 1
+
+    return [gold_total, pred_total, cnt]
+
+
+def eval_and_or(pred, gold):
+    def _extract(conds):
+        """extract condition and/or"""
+        op_set = set()
+        for i in range(1, len(conds) - 1, 2):
+            left = conds[i - 1][:3]
+            right = conds[i + 1][:3]
+            left, right = list(sorted([left, right]))
+            op_set.add(f"{left}{conds[i].lower()}{right}")
+        return op_set
+
+    # eval where and/or
+    pred_op_set = _extract(pred["where"])
+    gold_op_set = _extract(gold["where"])
+    if pred_op_set != gold_op_set:
+        return [1, 1, 0]
+
+    # eval having and/or
+    pred_op_set = _extract(pred["having"])
+    gold_op_set = _extract(gold["having"])
+    if pred_op_set != gold_op_set:
+        return [1, 1, 0]
+
+    return [1, 1, 1]
+
+
+def get_nestedSQL(sql):
+    nested = []
+    for cond_unit in sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]:
+        if type(cond_unit[3]) is dict:
+            nested.append(cond_unit[3])
+        if type(cond_unit[4]) is dict:
+            nested.append(cond_unit[4])
+    ##
+    for from_nest_sql in [table_unit[1] for table_unit in sql["from"]["table_units"] if table_unit[0] == "sql"]:
+        nested.append(from_nest_sql)
+
+    if sql["intersect"] is not None:
+        nested.append(sql["intersect"])
+    if sql["except"] is not None:
+        nested.append(sql["except"])
+    if sql["union"] is not None:
+        nested.append(sql["union"])
+    return nested
+
+
+def eval_nested(pred, gold):
+    gold_total = 0
+    pred_total = 0
+    cnt = 0
+    if pred is not None:
+        pred_total += 1
+    if gold is not None:
+        gold_total += 1
+    if pred is not None and gold is not None:
+        cnt += Evaluator(g_db_schema_file, g_foreign_key_maps, g_eval_value).eval_exact_match(pred, gold)
+    return [gold_total, pred_total, cnt]
+
+
+def eval_IUEN(pred, gold):
+    lt1, pt1, cnt1 = eval_nested(pred["intersect"], gold["intersect"])
+    lt2, pt2, cnt2 = eval_nested(pred["except"], gold["except"])
+    lt3, pt3, cnt3 = eval_nested(pred["union"], gold["union"])
+    gold_total = lt1 + lt2 + lt3
+    pred_total = pt1 + pt2 + pt3
+    cnt = cnt1 + cnt2 + cnt3
+    return [gold_total, pred_total, cnt]
+
+
+def get_keywords(sql):
+    res = set()
+    if len(sql["where"]) > 0:
+        res.add("where")
+    if len(sql["groupBy"]) > 0:
+        res.add("group")
+    if len(sql["having"]) > 0:
+        res.add("having")
+    if len(sql["orderBy"]) > 0:
+        res.add(sql["orderBy"][0])
+        res.add("order")
+    if sql["limit"] is not None:
+        res.add("limit")
+    if sql["except"] is not None:
+        res.add("except")
+    if sql["union"] is not None:
+        res.add("union")
+    if sql["intersect"] is not None:
+        res.add("intersect")
+
+    # or keyword
+    ao = sql["from"]["conds"][1::2] + sql["where"][1::2] + sql["having"][1::2]
+    if len([token for token in ao if token == "or"]) > 0:
+        res.add("or")
+
+    # TODO
+    cond_units = sql["from"]["conds"][::2] + sql["where"][::2] + sql["having"][::2]
+    # not keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[0]]) > 0:
+        res.add("not")
+
+    # in keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("in")]) > 0:
+        res.add("in")
+
+    # like keyword
+    if len([cond_unit for cond_unit in cond_units if cond_unit[1] == COND_OPS.index("like")]) > 0:
+        res.add("like")
+
+    return res
+
+
+def eval_keywords(pred, gold):
+    pred_keywords = get_keywords(pred)
+    gold_keywords = get_keywords(gold)
+    pred_total = len(pred_keywords)
+    gold_total = len(gold_keywords)
+    cnt = 0
+
+    for k in pred_keywords:
+        if k in gold_keywords:
+            cnt += 1
+    return [gold_total, pred_total, cnt]
+
+
+# Rebuild SQL functions for foreign key evaluation
+def build_valid_col_units(table_units, schema):
+    col_ids = [table_unit[1] for table_unit in table_units if table_unit[0] == TABLE_TYPE["table_unit"]]
+    prefixs = [col_id[:-2] for col_id in col_ids]
+    valid_col_units = []
+    for value in schema.id_map.values():
+        if "." in value and value[: value.index(".")] in prefixs:
+            valid_col_units.append(value)
+    return valid_col_units
+
+
+def rebuild_col_unit_col(valid_col_units, col_unit, kmap):
+    if col_unit is None:
+        return col_unit
+
+    agg_id, col_id = col_unit
+    if col_id in kmap and col_id in valid_col_units:
+        col_id = kmap[col_id]
+    return agg_id, col_id
+
+
+def rebuild_val_unit_col(valid_col_units, val_unit, kmap):
+    if val_unit is None:
+        return val_unit
+
+    unit_op, col_unit1, col_unit2 = val_unit
+    col_unit1 = rebuild_col_unit_col(valid_col_units, col_unit1, kmap)
+    col_unit2 = rebuild_col_unit_col(valid_col_units, col_unit2, kmap)
+    return [unit_op, col_unit1, col_unit2]
+
+
+def rebuild_table_unit_col(valid_col_units, table_unit, kmap, eval_value=True):
+    if table_unit is None:
+        return table_unit
+
+    table_type, col_unit_or_sql = table_unit
+    if isinstance(col_unit_or_sql, dict):
+        col_unit_or_sql = rebuild_sql_col(valid_col_units, col_unit_or_sql, kmap, eval_value)
+    elif isinstance(col_unit_or_sql, tuple):  # useless
+        col_unit_or_sql = rebuild_col_unit_col(valid_col_units, col_unit_or_sql, kmap)
+    return table_type, col_unit_or_sql
+
+
+def rebuild_cond_unit_col(valid_col_units, cond_unit, kmap, eval_value):
+    if cond_unit is None:
+        return cond_unit
+
+    not_op, op_id, val_unit, val1, val2 = cond_unit
+    if type(val1) is dict:
+        rebuild_sql_col(valid_col_units, val1, kmap, eval_value)
+    if not eval_value:
+        if type(val1) is not dict:
+            val1 = "1"
+        if type(val2) is not dict and val2 is not None:
+            val2 = "2"
+    val_unit = rebuild_val_unit_col(valid_col_units, val_unit, kmap)
+    return [not_op, op_id, val_unit, val1, val2]
+
+
+def rebuild_condition_col(valid_col_units, condition, kmap, eval_value):
+    for idx in range(len(condition)):
+        if idx % 2 == 0:
+            condition[idx] = rebuild_cond_unit_col(valid_col_units, condition[idx], kmap, eval_value)
+    return condition
+
+
+def rebuild_select_col(valid_col_units, sel, kmap):
+    if sel is None:
+        return sel
+    new_list = []
+    for it in sel:
+        agg_id, val_unit = it
+        new_list.append((agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)))
+    return new_list
+
+
+def rebuild_from_col(valid_col_units, from_, kmap, eval_value=True):
+    if from_ is None:
+        return from_
+
+    fn_proc = lambda x: rebuild_table_unit_col(valid_col_units, x, kmap, eval_value)
+    from_["table_units"] = [fn_proc(table_unit) for table_unit in from_["table_units"]]
+    from_["conds"] = rebuild_condition_col(valid_col_units, from_["conds"], kmap, True)
+    return from_
+
+
+def rebuild_group_by_col(valid_col_units, group_by, kmap):
+    if group_by is None:
+        return group_by
+
+    return [rebuild_col_unit_col(valid_col_units, col_unit, kmap) for col_unit in group_by]
+
+
+def rebuild_order_by_col(valid_col_units, order_by, kmap):
+    if order_by is None or len(order_by) == 0:
+        return order_by
+
+    direction, val_units = order_by
+    new_val_units = [(agg_id, rebuild_val_unit_col(valid_col_units, val_unit, kmap)) for agg_id, val_unit in val_units]
+    return direction, new_val_units
+
+
+def rebuild_sql_col(valid_col_units, sql, kmap, eval_value):
+    if sql is None:
+        return sql
+
+    sql["select"] = rebuild_select_col(valid_col_units, sql["select"], kmap)
+    sql["from"] = rebuild_from_col(valid_col_units, sql["from"], kmap, eval_value)
+    sql["where"] = rebuild_condition_col(valid_col_units, sql["where"], kmap, eval_value)
+    sql["groupBy"] = rebuild_group_by_col(valid_col_units, sql["groupBy"], kmap)
+    sql["orderBy"] = rebuild_order_by_col(valid_col_units, sql["orderBy"], kmap)
+    sql["having"] = rebuild_condition_col(valid_col_units, sql["having"], kmap, eval_value)
+    sql["intersect"] = rebuild_sql_col(valid_col_units, sql["intersect"], kmap, eval_value)
+    sql["except"] = rebuild_sql_col(valid_col_units, sql["except"], kmap, eval_value)
+    sql["union"] = rebuild_sql_col(valid_col_units, sql["union"], kmap, eval_value)
+    if not eval_value:
+        if sql["limit"] is None or int(sql["limit"]) <= 0:
+            sql["limit"] = 0
+        else:
+            sql["limit"] = 1
+
+    return sql
+
+
+def build_foreign_key_map(entry):
+    cols_orig = entry["column_names_original"]
+    tables_orig = entry["table_names_original"]
+
+    # rebuild cols corresponding to idmap in Schema
+    cols = []
+    for col_orig in cols_orig:
+        if col_orig[0] >= 0:
+            t = tables_orig[col_orig[0]]
+            c = col_orig[1]
+            cols.append("__" + t.lower() + "." + c.lower() + "__")
+        else:
+            cols.append("__all__")
+
+    def keyset_in_list(k1, k2, k_list):
+        """keyset_in_list"""
+        for k_set in k_list:
+            if k1 in k_set or k2 in k_set:
+                return k_set
+        new_k_set = set()
+        k_list.append(new_k_set)
+        return new_k_set
+
+    foreign_key_list = []
+    foreign_keys = entry["foreign_keys"]
+    for fkey in foreign_keys:
+        key1, key2 = fkey
+        key_set = keyset_in_list(key1, key2, foreign_key_list)
+        key_set.add(key1)
+        key_set.add(key2)
+
+    foreign_key_map = {}
+    for key_set in foreign_key_list:
+        sorted_list = sorted(list(key_set))
+        midx = sorted_list[0]
+        for idx in sorted_list:
+            foreign_key_map[cols[idx]] = cols[midx]
+
+    return foreign_key_map
+
+
+def build_foreign_key_map_from_json(table):
+    with open(table) as f:
+        data = json.load(f)
+    tables = {}
+    for entry in data:
+        tables[entry["db_id"]] = build_foreign_key_map(entry)
+    return tables
+
+
+if __name__ == "__main__":
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..876f923301e0cb5df28319628e8c36d593aa5b35
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/linking_utils.py
@@ -0,0 +1,558 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import logging
+import re
+
+import numpy as np
+from text2sql.utils import text_utils
+
+# the max matching ngram
+g_linking_ngrams_n = 5
+
+STOPWORDS = set(
+    [
+        "的",
+        "是",
+        "，",
+        "？",
+        "有",
+        "多少",
+        "哪些",
+        "我",
+        "什么",
+        "你",
+        "知道",
+        "啊",
+        "一下",
+        "吗",
+        "在",
+        "请问",
+        "或者",
+        "想",
+        "和",
+        "为",
+        "帮",
+        "那个",
+        "你好",
+        "这",
+        "了",
+        "并且",
+        "都",
+        "呢",
+        "呀",
+        "哪个",
+        "还有",
+        "这个",
+        "-",
+        "项目",
+        "我查",
+        "就是",
+        "它",
+        "要求",
+        "谁",
+        "了解",
+        "告诉",
+        "时候",
+        "个",
+        "能",
+        "那",
+        "人",
+        "问",
+        "中",
+        "可以",
+        "一共",
+        "哪",
+        "麻烦",
+        "叫",
+        "想要",
+        "《",
+        "》",
+        "分别",
+    ]
+)
+
+
+def clamp(value, abs_max):
+    """clamp value"""
+    value = max(-abs_max, value)
+    value = min(abs_max, value)
+    return value
+
+
+class Relations(object):
+    """Docstring for Relations."""
+
+    def __init__(
+        self,
+        qq_max_dist=2,
+        cc_foreign_key=True,
+        cc_table_match=True,
+        cc_max_dist=2,
+        ct_foreign_key=True,
+        ct_table_match=True,
+        tc_table_match=True,
+        tc_foreign_key=True,
+        tt_max_dist=2,
+        tt_foreign_key=True,
+        merge_types=False,
+        sc_link=True,
+        cv_link=True,
+    ):
+        super(Relations, self).__init__()
+
+        self.qq_max_dist = qq_max_dist
+        self.cc_foreign_key = cc_foreign_key
+        self.cc_table_match = cc_table_match
+        self.cc_max_dist = cc_max_dist
+        self.ct_foreign_key = ct_foreign_key
+        self.ct_table_match = ct_table_match
+        self.tc_table_match = tc_table_match
+        self.tc_foreign_key = tc_foreign_key
+        self.tt_max_dist = tt_max_dist
+        self.tt_foreign_key = tt_foreign_key
+        self.merge_types = merge_types
+        self.sc_link = sc_link
+        self.cv_link = cv_link
+
+        self.relation_ids = {}
+
+        def add_relation(name):
+            self.relation_ids[name] = len(self.relation_ids)
+            logging.debug("relation: %s --> %d", name, self.relation_ids[name])
+
+        # < TODO: add_relation('[UNK]')
+
+        def add_rel_dist(name, max_dist):
+            for i in range(-max_dist, max_dist + 1):
+                add_relation((name, i))
+
+        add_rel_dist("qq_dist", qq_max_dist)
+
+        add_relation("qc_default")
+        # if qc_token_match:
+        #    add_relation('qc_token_match')
+
+        add_relation("qt_default")
+        # if qt_token_match:
+        #    add_relation('qt_token_match')
+
+        add_relation("cq_default")
+        # if cq_token_match:
+        #    add_relation('cq_token_match')
+
+        add_relation("cc_default")
+        if cc_foreign_key:
+            add_relation("cc_foreign_key_forward")
+            add_relation("cc_foreign_key_backward")
+        if cc_table_match:
+            add_relation("cc_table_match")
+        add_rel_dist("cc_dist", cc_max_dist)
+
+        add_relation("ct_default")
+        if ct_foreign_key:
+            add_relation("ct_foreign_key")
+        if ct_table_match:
+            add_relation("ct_primary_key")
+            add_relation("ct_table_match")
+            add_relation("ct_any_table")
+
+        add_relation("tq_default")
+        # if cq_token_match:
+        #    add_relation('tq_token_match')
+
+        add_relation("tc_default")
+        if tc_table_match:
+            add_relation("tc_primary_key")
+            add_relation("tc_table_match")
+            add_relation("tc_any_table")
+        if tc_foreign_key:
+            add_relation("tc_foreign_key")
+
+        add_relation("tt_default")
+        if tt_foreign_key:
+            add_relation("tt_foreign_key_forward")
+            add_relation("tt_foreign_key_backward")
+            add_relation("tt_foreign_key_both")
+        add_rel_dist("tt_dist", tt_max_dist)
+
+        # schema linking relations
+        # forward_backward
+        if sc_link:
+            add_relation("qcCEM")
+            add_relation("cqCEM")
+            add_relation("qtTEM")
+            add_relation("tqTEM")
+            add_relation("qcCPM")
+            add_relation("cqCPM")
+            add_relation("qtTPM")
+            add_relation("tqTPM")
+
+        if cv_link:
+            add_relation("qcNUMBER")
+            add_relation("cqNUMBER")
+            add_relation("qcTIME")
+            add_relation("cqTIME")
+            add_relation("qcCELLMATCH")
+            add_relation("cqCELLMATCH")
+
+        if merge_types:
+            assert not cc_foreign_key
+            assert not cc_table_match
+            assert not ct_foreign_key
+            assert not ct_table_match
+            assert not tc_foreign_key
+            assert not tc_table_match
+            assert not tt_foreign_key
+
+            assert cc_max_dist == qq_max_dist
+            assert tt_max_dist == qq_max_dist
+
+            add_relation("xx_default")
+            self.relation_ids["qc_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["qt_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["cq_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["cc_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["ct_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["tq_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["tc_default"] = self.relation_ids["xx_default"]
+            self.relation_ids["tt_default"] = self.relation_ids["xx_default"]
+
+            if sc_link:
+                self.relation_ids["qcCEM"] = self.relation_ids["xx_default"]
+                self.relation_ids["qcCPM"] = self.relation_ids["xx_default"]
+                self.relation_ids["qtTEM"] = self.relation_ids["xx_default"]
+                self.relation_ids["qtTPM"] = self.relation_ids["xx_default"]
+                self.relation_ids["cqCEM"] = self.relation_ids["xx_default"]
+                self.relation_ids["cqCPM"] = self.relation_ids["xx_default"]
+                self.relation_ids["tqTEM"] = self.relation_ids["xx_default"]
+                self.relation_ids["tqTPM"] = self.relation_ids["xx_default"]
+            if cv_link:
+                self.relation_ids["qcNUMBER"] = self.relation_ids["xx_default"]
+                self.relation_ids["cqNUMBER"] = self.relation_ids["xx_default"]
+                self.relation_ids["qcTIME"] = self.relation_ids["xx_default"]
+                self.relation_ids["cqTIME"] = self.relation_ids["xx_default"]
+                self.relation_ids["qcCELLMATCH"] = self.relation_ids["xx_default"]
+                self.relation_ids["cqCELLMATCH"] = self.relation_ids["xx_default"]
+
+            for i in range(-qq_max_dist, qq_max_dist + 1):
+                self.relation_ids["cc_dist", i] = self.relation_ids["qq_dist", i]
+                self.relation_ids["tt_dist", i] = self.relation_ids["tt_dist", i]
+
+        logging.info("relations num is: %d", len(self.relation_ids))
+
+    def __len__(self):
+        """size of relations
+        Returns: int
+        """
+        return len(self.relation_ids)
+
+
+RELATIONS = Relations()
+
+
+# schema linking, similar to IRNet
+def compute_schema_linking(tokens, db):
+    """schema linking"""
+
+    def partial_match(x_list, y_list):
+        """check partial match"""
+        x_str = "".join(x_list)
+        y_str = "".join(y_list)
+        if x_str in STOPWORDS:
+            return False
+        if re.match("%s" % re.escape(x_str), y_str):
+            assert x_str in y_str
+            return True
+        else:
+            return False
+
+    def exact_match(x_list, y_list):
+        """check exact match"""
+        x, y = x_list, y_list
+        if type(x) is list:
+            x = "".join(x)
+        if type(y) is list:
+            y = "".join(y)
+        return x == y
+
+    def set_q_relation(q_match_dict, q_start, q_match_len, other_id, relation_tag, force=True):
+        """set match relation for question"""
+        for q_id in range(q_start, q_start + q_match_len):
+            key = f"{q_id},{other_id}"
+            if not force and key in q_match_dict:
+                continue
+            q_match_dict[key] = relation_tag
+
+    columns = [x.name for x in db.columns]
+    tables = [x.name for x in db.tables]
+
+    q_col_match = dict()
+    q_tab_match = dict()
+
+    col_id2list = dict()
+    for col_id, col_item in enumerate(columns):
+        col_id2list[col_id] = col_item
+
+    tab_id2list = dict()
+    for tab_id, tab_item in enumerate(tables):
+        tab_id2list[tab_id] = tab_item
+
+    # 5-gram
+    n = g_linking_ngrams_n
+    while n > 0:
+        for i, n_gram_list in enumerate(text_utils.ngrams(tokens, n)):
+            if len("".join(n_gram_list).strip()) == 0:
+                continue
+            # exact match case
+            for col_id, col in col_id2list.items():
+                if exact_match(n_gram_list, col):
+                    set_q_relation(q_col_match, i, n, col_id, "CEM")
+            for tab_id, tab in tab_id2list.items():
+                if exact_match(n_gram_list, tab):
+                    set_q_relation(q_tab_match, i, n, tab_id, "TEM")
+
+            # partial match case
+            for col_id, col in col_id2list.items():
+                if partial_match(n_gram_list, col):
+                    set_q_relation(q_col_match, i, n, col_id, "CPM", force=False)
+            for tab_id, tab in tab_id2list.items():
+                if partial_match(n_gram_list, tab):
+                    set_q_relation(q_tab_match, i, n, tab_id, "TEM", force=False)
+        n -= 1
+    return {"q_col_match": q_col_match, "q_tab_match": q_tab_match}
+
+
+def compute_cell_value_linking(tokens, db):
+    """cell-value linking"""
+
+    def isnumber(word):
+        """check if input is a number"""
+        try:
+            float(word)
+            return True
+        except Exception:
+            return False
+
+    def check_cell_match(word, cells):
+        """check if word partial/exact match one of values"""
+        for cell in cells:
+            if word in cell:
+                return True
+        return False
+
+    num_date_match = {}
+    cell_match = {}
+
+    for q_id, word in enumerate(tokens):
+        if len(word.strip()) == 0:
+            continue
+        if word in STOPWORDS:
+            continue
+
+        num_flag = isnumber(word)
+        for col_id, column in enumerate(db.columns):
+            # word is number
+            if num_flag:
+                if column.dtype in ("number", "real", "time"):  # TODO fine-grained date
+                    rel = "NUMBER" if column.dtype == "real" else column.dtype.upper()
+                    num_date_match[f"{q_id},{col_id}"] = rel
+            elif column.dtype.lower() == "binary":  # binary condition should use special process
+                continue
+            elif check_cell_match(word, column.cells):
+                cell_match[f"{q_id},{col_id}"] = "CELLMATCH"
+
+    cv_link = {"num_date_match": num_date_match, "cell_match": cell_match}
+    return cv_link
+
+
+def _table_id(db, col):
+    if col == 0:
+        return None
+    else:
+        return db.columns[col].table.id
+
+
+def _foreign_key_id(db, col):
+    foreign_col = db.columns[col].foreign_key_for
+    if foreign_col is None:
+        return None
+    return foreign_col.id
+
+
+def _match_foreign_key(db, col, table):
+    foreign_key_id = _foreign_key_id(db, col)
+    if foreign_key_id is None:
+        return None
+    return table == _table_id(db, foreign_key_id)
+
+
+def build_relation_matrix(other_links, total_length, q_length, c_length, c_boundaries, t_boundaries, db):
+    """build relation matrix"""
+    sc_link = other_links.get("sc_link", {"q_col_match": {}, "q_tab_match": {}})
+    cv_link = other_links.get("cv_link", {"num_date_match": {}, "cell_match": {}})
+
+    # Catalogue which things are where
+    loc_types = {}
+    for i in range(q_length):
+        loc_types[i] = ("question",)
+
+    c_base = q_length
+    for c_id, (c_start, c_end) in enumerate(zip(c_boundaries, c_boundaries[1:])):
+        for i in range(c_start + c_base, c_end + c_base):
+            loc_types[i] = ("column", c_id)
+    t_base = q_length + c_length
+    for t_id, (t_start, t_end) in enumerate(zip(t_boundaries, t_boundaries[1:])):
+        for i in range(t_start + t_base, t_end + t_base):
+            loc_types[i] = ("table", t_id)
+
+    relations = np.zeros((total_length, total_length), dtype=np.int64)
+    for i, j in itertools.product(range(total_length), repeat=2):
+
+        def _set_relation(name):
+            """set relation for position (i, j)"""
+            relations[i, j] = RELATIONS.relation_ids[name]
+
+        def _get_qc_links(q_id, c_id):
+            """get link relation of q and col"""
+            coord = "%d,%d" % (q_id, c_id)
+            if coord in sc_link["q_col_match"]:
+                return sc_link["q_col_match"][coord]
+            elif coord in cv_link["cell_match"]:
+                return cv_link["cell_match"][coord]
+            elif coord in cv_link["num_date_match"]:
+                return cv_link["num_date_match"][coord]
+            return "_default"
+
+        def _get_qt_links(q_id, c_id):
+            """get link relation of q and tab"""
+            coord = "%d,%d" % (q_id, c_id)
+            if coord in sc_link["q_tab_match"]:
+                return sc_link["q_tab_match"][coord]
+            else:
+                return "_default"
+
+        try:
+            i_type, j_type = loc_types[i], loc_types[j]
+        except Exception as e:
+            logging.error(
+                f"loc_types: {loc_types}. c_boundaries: {c_boundaries}."
+                + f"i, j, total_length and q_length: {i}, {j}, {total_length}, {q_length}"
+            )
+            raise e
+
+        if i_type[0] == "question":
+            # relation of question-to-*
+            if j_type[0] == "question":  # relation qq
+                _set_relation(("qq_dist", clamp(j - i, RELATIONS.qq_max_dist)))
+            elif j_type[0] == "column":  # relation qc
+                j_real = j_type[1]
+                rel = _get_qc_links(i, j_real)
+                _set_relation("qc" + rel)
+            elif j_type[0] == "table":  # relation qt
+                j_real = j_type[1]
+                rel = _get_qt_links(i, j_real)
+                _set_relation("qt" + rel)
+        elif i_type[0] == "column":
+            # relation of column-to-*
+            if j_type[0] == "question":  # relation cq
+                i_real = i_type[1]
+                rel = _get_qc_links(j, i_real)
+                _set_relation("cq" + rel)
+            elif j_type[0] == "column":  # relation cc
+                col1, col2 = i_type[1], j_type[1]
+                if col1 == col2:
+                    _set_relation(("cc_dist", clamp(j - i, RELATIONS.cc_max_dist)))
+                else:
+                    _set_relation("cc_default")
+                    # TODO: foreign keys and table match
+                    if RELATIONS.cc_foreign_key:
+                        if _foreign_key_id(db, col1) == col2:
+                            _set_relation("cc_foreign_key_forward")
+                        if _foreign_key_id(db, col2) == col1:
+                            _set_relation("cc_foreign_key_backward")
+                    if RELATIONS.cc_table_match and _table_id(db, col1) == _table_id(db, col2):
+                        _set_relation("cc_table_match")
+            elif j_type[0] == "table":  # relation ct
+                col, table = i_type[1], j_type[1]
+                _set_relation("ct_default")
+                if RELATIONS.ct_foreign_key and _match_foreign_key(db, col, table):
+                    _set_relation("ct_foreign_key")
+                if RELATIONS.ct_table_match:
+                    col_table = _table_id(db, col)
+                    if col_table == table:
+                        if col in db.columns[col].table.primary_keys_id:
+                            _set_relation("ct_primary_key")
+                        else:
+                            _set_relation("ct_table_match")
+                    elif col_table is None:
+                        _set_relation("ct_any_table")
+        elif i_type[0] == "table":
+            # relation of table-to-*
+            if j_type[0] == "question":
+                i_real = i_type[1]
+                rel = _get_qt_links(j, i_real)
+                _set_relation("tq" + rel)
+            elif j_type[0] == "column":
+                table, col = i_type[1], j_type[1]
+                _set_relation("tc_default")
+
+                if RELATIONS.tc_foreign_key and _match_foreign_key(db, col, table):
+                    _set_relation("tc_foreign_key")
+                if RELATIONS.tc_table_match:
+                    col_table = _table_id(db, col)
+                    if col_table == table:
+                        if col in db.columns[col].table.primary_keys_id:
+                            _set_relation("tc_primary_key")
+                        else:
+                            _set_relation("tc_table_match")
+                    elif col_table is None:
+                        _set_relation("tc_any_table")
+            elif j_type[0] == "table":
+                table1, table2 = i_type[1], j_type[1]
+                if table1 == table2:
+                    _set_relation(("tt_dist", clamp(j - i, RELATIONS.tt_max_dist)))
+                else:
+                    _set_relation("tt_default")
+                    if RELATIONS.tt_foreign_key:
+                        forward = table2 in db.tables[table1].foreign_keys_tables
+                        backward = table1 in db.tables[table2].foreign_keys_tables
+                        if forward and backward:
+                            _set_relation("tt_foreign_key_both")
+                        elif forward:
+                            _set_relation("tt_foreign_key_forward")
+                        elif backward:
+                            _set_relation("tt_foreign_key_backward")
+
+    return relations
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    q = "帮 我 查 一 下 大众 帕 萨 特 的 轴距 和 能源 类型 分别 是 什么 , 叫 什么 名 ？".split(" ")
+    for i, tok in enumerate(q):
+        print(i, tok)
+    # header = Header(['名称', '品牌', '轴距', '能源类型'], ['text', 'text', 'real', 'text'])
+    # print(header.names)
+    # print(compute_schema_linking(q, header))
+
+    # q = '帮 我 查 一 下 大众 轴距 大于 10 米 的 车 能源 类型 分别 是 什么 ？'.split(' ')
+    # for i, tok in enumerate(q):
+    #    print(i, tok)
+    # rows = [['帕萨特', '大众', '10', '汽油车'],
+    #        ['伊兰特', '现代', '10', '汽油车'],
+    #        ['GL8', '别克', '10', '汽油车']]
+    # table = Table('tid1', 'tname', 'title', header, rows)
+    # print(compute_cell_value_linking(q, table))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc5b616a6ed2e677678210d2d4d17642e166a995
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/metrics.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from text2sql.dataproc import sql_label
+from text2sql.utils import dusql_evaluation, nn_utils
+
+
+class MetricSimpleSQLAcc(object):
+    """SimpleSQLAccMetric."""
+
+    def __init__(self, eval_value=False):
+        """init of class
+
+        Args:
+            eval_value (TYPE): Default is False
+
+        """
+        super(MetricSimpleSQLAcc, self).__init__()
+
+        self._eval_value = eval_value
+        self._gold_list = []
+        self._pred_list = []
+        self._correctness = []
+
+    def update(self, labels, predicts):
+        preds = nn_utils.tensor2numpy(predicts)
+        pred_sqls = sql_label.decode_sqls(preds, labels["header_lens"], None, labels["limit_nums"])
+        self._gold_list.extend(labels["gold_sqls"])
+        self._pred_list.extend(pred_sqls)
+
+    def calc(self):
+        conn_correct = 0
+        sel_col_agg_correct = 0
+        conds_correct = 0
+        conds_col_correct = 0
+        conds_col_op_correct = 0
+        all_correct = 0
+        sel_num_correct = 0
+        cond_num_correct = 0
+        order_correct = 0
+        limit_correct = 0
+        group_correct = 0
+        having_correct = 0
+        num_queries = len(self._gold_list)
+        self._correctness.clear()
+        for pred_sql, true_sql in zip(self._pred_list, self._gold_list):
+            n_correct = 0
+            if pred_sql["sel_num"] == true_sql.sel_num:
+                sel_num_correct += 1
+            if pred_sql["cond_num"] == len(true_sql.conds) + len(true_sql.having):
+                cond_num_correct += 1
+            if pred_sql["cond_conn_op"] == true_sql.cond_conn_op:
+                conn_correct += 1
+                n_correct += 1
+            pred_aggs = set(zip(pred_sql["sel"], pred_sql["agg"]))
+            true_aggs = set(zip(true_sql.sel, true_sql.agg))
+            if pred_aggs == true_aggs:
+                sel_col_agg_correct += 1
+                n_correct += 1
+            pred_conds = set([(cond[0], cond[1], cond[2]) for cond in pred_sql["conds"]])
+            if not self._eval_value:
+                true_conds_tmp = [(cond[0], cond[1], None) for cond in true_sql.conds]
+            else:
+                true_conds_tmp = [(cond[0], cond[1], cond[2]) for cond in true_sql.conds]
+            true_conds = set(true_conds_tmp)
+            if pred_conds == true_conds:
+                conds_correct += 1
+                n_correct += 1
+            pred_conds_col = set([cond[0] for cond in pred_sql["conds"]])
+            true_conds_col = set([cond[0] for cond in true_sql["conds"]])
+            if pred_conds_col == true_conds_col:
+                conds_col_correct += 1
+            pred_conds_col_op = set([(cond[0], cond[1]) for cond in pred_sql["conds"]])
+            true_conds_col_op = set([(cond[0], cond[1]) for cond in true_sql["conds"]])
+            if pred_conds_col_op == true_conds_col_op:
+                conds_col_op_correct += 1
+
+            pred_order_direc = pred_sql["order_direction"]
+            true_order_direc = true_sql["order_direction"]
+            pred_order_by = pred_sql["order_by"]
+            true_order_by = true_sql["order_by"]
+            if pred_order_direc == true_order_direc and pred_order_by == true_order_by:
+                n_correct += 1
+                order_correct += 1
+
+            pred_limit = pred_sql["limit"]
+            true_limit = true_sql["limit"]
+            if pred_limit == true_limit:
+                n_correct += 1
+                limit_correct += 1
+
+            pred_group_by = pred_sql["group_by"]
+            true_group_by = true_sql["group_by"]
+            if pred_group_by == true_group_by:
+                n_correct += 1
+                group_correct += 1
+
+            pred_having = [list(x) for x in pred_sql["having"]]
+            true_having = [list(x) for x in true_sql["having"]]
+            if not self._eval_value:
+                true_having = [x[:-1] + [None] for x in true_having]
+            if pred_having == true_having:
+                n_correct += 1
+                having_correct += 1
+
+            if n_correct == 7:
+                all_correct += 1
+                self._correctness.append(1)
+            else:
+                self._correctness.append(0)
+
+        self._acc = all_correct / num_queries
+        self._sub_task_acc = {
+            "sel_num": sel_num_correct / num_queries,
+            "sel_col_agg": sel_col_agg_correct / num_queries,
+            "cond_num": cond_num_correct / num_queries,
+            "cond_conn": conn_correct / num_queries,
+            "where_conds": conds_correct / num_queries,
+            "where_col": conds_col_correct / num_queries,
+            "where_col_op": conds_col_op_correct / num_queries,
+        }
+        self._sub_task_acc.update(
+            {
+                "order_by": order_correct / num_queries,
+                "limit": limit_correct / num_queries,
+            }
+        )
+        self._sub_task_acc.update(
+            {
+                "group_by": group_correct / num_queries,
+                "having": having_correct / num_queries,
+            }
+        )
+        return self._acc, self._sub_task_acc
+
+    def save(self, save_dir, file_tag):
+        """
+
+        Args:
+            save_dir (TYPE): NULL
+            file_tag (TYPE): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        if "{acc}" in file_tag or "{acc:" in file_tag:
+            file_tag = file_tag.format(acc=self._acc)
+        file_path = os.path.join(save_dir, file_tag)
+        if not os.path.isdir(os.path.dirname(file_path)):
+            os.mkdir(os.path.dirname(file_path))
+        with open(file_path, "w") as ofs:
+            for pred, correct in zip(self._pred_list, self._correctness):
+                pred["correct"] = correct
+                ofs.write(json.dumps(pred, ensure_ascii=False) + "\n")
+
+    def __str__(self):
+        """
+        Returns: TODO
+
+        Raises: NULL
+        """
+        return f"acc {self._acc * 100:.2f}, sub tasks {self._sub_task_acc}"
+
+
+class MetricDuSQLAcc(object):
+    """Acc Metric for DuSQL like dataset"""
+
+    def __init__(self, dataset, eval_value=True):
+        """init"""
+        super(MetricDuSQLAcc, self).__init__()
+        self.dataset = dataset
+        self.eval_value = eval_value
+
+        self.foreign_key_maps = {
+            db_id: dusql_evaluation.build_foreign_key_map(db.orig) for db_id, db in self.dataset.db_dict.items()
+        }
+        self.evaluator = dusql_evaluation.Evaluator(
+            self.dataset.db_schema_file, self.foreign_key_maps, eval_value=self.eval_value
+        )
+        self.results = []
+
+    def update(self, item, inferred_code):
+        """update one instance"""
+        sql_query = item.orig["query"] if "query" in item.orig else item.orig["sql_query"]
+        ret_dict = self.evaluator.evaluate_one(item.db.db_id, sql_query, inferred_code)
+        ret_dict["db_id"] = item.orig["db_id"]
+        ret_dict["question"] = item.orig["question"]
+        self.results.append(ret_dict)
+
+    def udpate_beams(self, item, inferred_codes, orig_question=None):
+        """update one instance beam"""
+        beam_dict = {}
+        if orig_question:
+            beam_dict["orig_question"] = orig_question
+        for i, code in enumerate(inferred_codes):
+            ret_dict = self.evaluator.evaluate_one(item.db.db_id, item.orig["query"], code)
+            beam_dict[i] = ret_dict
+            if ret_dict["exact"] is True:
+                break
+        self.results.append(beam_dict)
+
+    def finalize(self):
+        """finalize"""
+        self.evaluator.finalize()
+        return {"per_item": self.results, "total_scores": self.evaluator.scores}
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    pass
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..02d04743d52ca028bcb6d3d1f31f5331b2fb76e2
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/nn_utils.py
@@ -0,0 +1,199 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+from paddle import nn
+
+
+def build_linear(n_in, n_out, name=None, init=None):
+    return nn.Linear(
+        n_in,
+        n_out,
+        weight_attr=paddle.ParamAttr(name="%s.w_0" % name if name is not None else None, initializer=init),
+        bias_attr="%s.b_0" % name if name is not None else None,
+    )
+
+
+def build_layer_norm(n_in, name):
+    return nn.LayerNorm(
+        normalized_shape=n_in,
+        weight_attr=paddle.ParamAttr(
+            name="%s_layer_norm_scale" % name if name is not None else None, initializer=nn.initializer.Constant(1.0)
+        ),
+        bias_attr=paddle.ParamAttr(
+            name="%s_layer_norm_bias" % name if name is not None else None, initializer=nn.initializer.Constant(0.0)
+        ),
+    )
+
+
+def lstm_init(num_layers, hidden_size, *batch_sizes):
+    init_size = batch_sizes + (hidden_size,)
+    if num_layers is not None:
+        init_size = (num_layers,) + init_size
+    init = paddle.zeros(init_size)
+    return (init, init)
+
+
+def batch_gather_2d(var, indices):
+    """Gather slices from var in each batch, according to corresponding
+    index in indices. Currently, it only support 2d Tensor.
+
+    Args:
+        var (Variable): with shape [batch_size, ...]
+        indices (Variable): with shape [batch_size, max_len]
+
+    Returns: Variable with shape [batch_size]
+
+    Raises: NULL
+
+    Examples:
+        var
+            [[1, 2, 3],
+             [4, 5, 6]]
+        indices
+            [[2, 0], [1, 2]]
+
+        return
+            [[3, 1], [5, 6]]
+
+    """
+    if len(indices.shape) != 2:
+        raise ValueError(
+            "shape of indices error. it should be a 2-D layers. " "but got shape = %s" % (str(indices.shape),)
+        )
+
+    batch_size = paddle.shape(indices)[0]
+
+    zero = paddle.to_tensor([0], dtype="int64")
+    one = paddle.to_tensor([1], dtype="int64")
+    end = paddle.cast(batch_size, dtype="int64")
+    batch_indices_1d = paddle.unsqueeze(paddle.arange(zero, end, one, dtype=indices.dtype), [1])
+
+    seq_len = indices.shape[1]
+    batch_indices = paddle.expand(batch_indices_1d, [batch_size, seq_len])
+
+    coord_2d = paddle.concat([paddle.unsqueeze(batch_indices, [2]), paddle.unsqueeze(indices, [2])], axis=2)
+    coord_2d.stop_gradient = True
+    coord_1d = paddle.reshape(coord_2d, shape=[-1, 2])
+    output_1d = paddle.gather_nd(var, coord_1d)
+    output_2d = paddle.reshape(output_1d, [batch_size, seq_len, var.shape[-1]])
+    return output_2d
+
+
+def sequence_mask(seq_hidden, mask, mode="zero"):
+    """
+
+    Args:
+        seq_hidden (Tensor): NULL
+        mask (Tensor): 1 for un-mask tokens, and 0 for mask tokens.
+        mode (str): zero/-inf/+inf
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+
+    while len(mask.shape) < len(seq_hidden.shape):
+        mask = mask.unsqueeze([-1])
+
+    mask = mask.cast(dtype=seq_hidden.dtype)
+    masked = paddle.multiply(seq_hidden, mask)
+    if mode == "zero":
+        return masked
+
+    if mode == "-inf":
+        scale_size = +1e5
+    elif mode == "+inf":
+        scale_size = -1e5
+    else:
+        raise ValueError(f"mask mode setting error. expect zero/-inf/+inf, but got {mode}")
+
+    add_mask = paddle.scale(mask - 1, scale=scale_size)
+    masked = paddle.add(masked, add_mask)
+    return masked
+
+
+def pad_sequences(seqs, max_len, value=0.0, dtype=np.int64):
+    """padding sequences"""
+    data_max_len = 0
+    format_seqs = []
+    for seq in seqs:
+        format_seqs.append(list(seq))
+        data_max_len = len(seq) if len(seq) > data_max_len else data_max_len
+    max_len = min(max_len, data_max_len)
+    padded = []
+    for seq in format_seqs:
+        padded.append(seq[:max_len] + [value] * (max_len - len(seq)))
+    padded = np.array(padded)
+    return padded.astype(dtype)
+
+
+def pad_sequences_for_3d(seqs, max_col, max_num, dtype=np.int64):
+    """padding sequences for 3d"""
+    padded = []
+    for seq in seqs:
+        padded.append(np.vstack((seq, np.zeros((max_col - seq.shape[0], max_num), dtype=np.int64))))
+    return np.array(padded).astype(dtype)
+
+
+def pad_index_sequences(seqs, max_col, max_row, dtype=np.int64):
+    """padding sequences for column token indexes"""
+    padded = []
+    for query in seqs:
+        new_cols = []
+        for col in query[:max_row]:
+            temp_cols = col[:max_col] + [0] * (max_col - len(col))
+            new_cols.append(temp_cols)
+        new_cols = new_cols + [[0] * max_col for _ in range(max_row - len(new_cols))]
+        padded.append(new_cols)
+    return np.array(padded).astype(dtype)
+
+
+def tensor2numpy(inputs):
+    if type(inputs) in (list, tuple):
+        return [x.numpy() for x in inputs]
+    elif type(inputs) is dict:
+        outputs = {}
+        for key, value in inputs.items():
+            if type(value) is paddle.Tensor:
+                outputs[key] = value.numpy()
+            else:
+                outputs[key] = value
+        return outputs
+    elif type(inputs) is paddle.Tensor:
+        return inputs.numpy()
+    else:
+        raise ValueError("only support inputs to be of type list/tuple/dict/Tensor." + f"but got {type(inputs)}")
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    seq_input = paddle.to_tensor(
+        [
+            [1, 2, 3, 4],
+            [5, 5, 5, 5],
+        ],
+        dtype="float32",
+    )
+    mask = paddle.to_tensor(
+        [
+            [1, 1, 0, 0],
+            [1, 1, 1, 0],
+        ],
+        dtype="float32",
+    )
+
+    print(sequence_mask(seq_input, mask, mode="zero"))
+    print(sequence_mask(seq_input, mask, mode="-inf"))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py
new file mode 100644
index 0000000000000000000000000000000000000000..1dd638d5319bff1c3cefd120efc211337e0d2c3c
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/serialization.py
@@ -0,0 +1,39 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def to_dict_with_sorted_values(d, key=None):
+    """to dict with sorted values"""
+    return {k: sorted(v, key=key) for k, v in d.items()}
+
+
+def to_dict_with_set_values(d):
+    """to dict with set values"""
+    result = {}
+    for k, v in d.items():
+        hashable_v = []
+        for v_elem in v:
+            if isinstance(v_elem, list):
+                hashable_v.append(tuple(v_elem))
+            else:
+                hashable_v.append(v_elem)
+        result[k] = set(hashable_v)
+    return result
+
+
+def tuplify(x):
+    """tuplify"""
+    if not isinstance(x, (tuple, list)):
+        return x
+    return tuple(tuplify(elem) for elem in x)
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8273bcb6b1d207378b55e388bcebd98463e8e751
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/text_utils.py
@@ -0,0 +1,424 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from collections import defaultdict
+from statistics import mean
+
+import cn2an
+from LAC import LAC
+
+g_max_candi_value = 5
+g_date_patt = re.compile(r"(([0-9]{2})[0-9]{2}-)?(0?[1-9]|1[012])-[0123][0-9]")
+g_date_patt2 = re.compile(r"(([0-9]{2})[0-9]{2}年)?[0-9]{1,2}月[0-9]{2}[号日]|([0-9]{2})[0-9]{2}年[0-9]{1,2}月")
+
+g_lac_seg = LAC(mode="seg")
+g_lac_lac = LAC(mode="lac")
+
+wordseg = lambda sentence: g_lac_seg.run(sentence)
+lac = lambda sentence: g_lac_lac.run(sentence)
+
+# LAC Tags
+# 标签 含义      标签 含义      标签 含义       标签 含义
+# n    普通名词  f    方位名词  s    处所名词   nw   作品名
+# nz   其他专名  v    普通动词  vd   动副词     vn   名动词
+# a    形容词    ad   副形词    an   名形词     d    副词
+# m    数量词    q    量词      r    代词       p    介词
+# c    连词      u    助词      xc   其他虚词   w    标点符号
+# PER  人名      LOC  地名      ORG  机构名     TIME 时间
+g_ner_tag_mapping = {
+    "LOC": "LOC",
+    "TIME": "TIME",
+    "PER": "PER",
+    "m": "NUM",
+}
+EMPTY_TAG = "o"
+
+
+def ner(sentence):
+    """wordseg and ner
+
+    Args:
+        sentence (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    results = lac(sentence)
+    words = results[0]
+    tags_tmp = results[1]
+    tags = []
+    for tag in tags_tmp:
+        tags.append(g_ner_tag_mapping.get(tag, EMPTY_TAG))
+    return (words, tags)
+
+
+def ngrams(tok_list, n):
+    """generate n-grams from tok_list
+
+    Args:
+        tok_list (TYPE): NULL
+        n (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    for pos in range(len(tok_list) - n + 1):
+        yield tok_list[pos : pos + n]
+
+
+def remove_brackets(s):
+    """Remove brackets [] () from text"""
+    return re.sub(r"[\(\（].*[\)\）]", "", s)
+
+
+def is_float(value):
+    """is float"""
+    try:
+        float(value)
+        return True
+    except ValueError:
+        return False
+    except TypeError:
+        return False
+
+
+def cn_to_an(string):
+    """cn to an"""
+    try:
+        return str(cn2an.cn2an(string, "normal"))
+    except ValueError:
+        return string
+
+
+def an_to_cn(string):
+    """an to cn"""
+    try:
+        return str(cn2an.an2cn(string))
+    except ValueError:
+        return string
+
+
+def str_to_num(string):
+    """str to num"""
+    try:
+        float_val = float(cn_to_an(string))
+        if int(float_val) == float_val:
+            return str(int(float_val))
+        else:
+            return str(float_val)
+    except ValueError:
+        return None
+
+
+def str_to_year(string):
+    """str to year"""
+    year = string.replace("年", "")
+    year = cn_to_an(year)
+    if is_float(year) and float(year) < 1900:
+        year = int(year) + 2000
+        return str(year)
+    else:
+        return None
+
+
+class CandidateValueExtractor:
+    """
+    params:
+    """
+
+    CN_NUM = "〇一二三四五六七八九零壹贰叁肆伍陆柒捌玖貮两１２３４５６７８９０"
+    CN_UNIT = "十拾百佰千仟万萬亿億兆点．"
+
+    @classmethod
+    def norm_unit(cls, rows, col_id, values):
+        """norm unit"""
+        l = []
+        for row in rows:
+            if isinstance(row[col_id], str) or row[col_id] is None:
+                return None
+            l.append(len(str(int(row[col_id]))))
+        mean_len = round(mean(l) + 0.5)
+
+        new_values = set()
+        for value in values:
+            flag = False
+            if value.isdigit():
+                str_value = str(value)
+                diff = len(str_value) - mean_len
+                if diff > 2:
+                    tail_str = str_value[-1 * diff :]
+                    if tail_str.count("0") == len(tail_str):
+                        new_values.add(str_value[:mean_len])
+                        new_values.add(value)
+                        flag = True
+            if not flag:
+                new_values.add(value)
+        return list(new_values)
+
+    @classmethod
+    def search_values(cls, question, table):
+        """search candidate cells from table, that will be used as sql values
+
+        Args:
+            question_words (list): NULL
+            question_tags (list): NULL
+            table (Table): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        # 提取年份和数字
+        value_in_question = cls.extract_values_from_text(question)
+        all_candidate = []
+        for col_id in range(len(table.header)):
+            header = table.header[col_id]
+            # 提取col出现在question中的cell
+            # TODO 这里存在一个问题，一个text类型cell必须完全在question中出现才会被当做候选cell
+            value_in_column = cls.extract_values_from_column(question, table, col_id, header.type)
+            if header.type == "text":
+                candi_values = value_in_column
+            elif header.type == "real":
+                norm_unit_res = cls.norm_unit(table.rows, col_id, value_in_question)
+                if norm_unit_res is not None:
+                    value_in_question = norm_unit_res
+                candi_values = value_in_question
+                if len(candi_values) >= g_max_candi_value:
+                    candi_values = candi_values[:g_max_candi_value]
+                else:
+                    st_candi_values = set(candi_values)
+                    for v in value_in_column:
+                        if v in st_candi_values:
+                            continue
+                        st_candi_values.add(v)
+                        if len(st_candi_values) >= g_max_candi_value:
+                            break
+                    candi_values = list(st_candi_values)
+            all_candidate.append(candi_values)
+        return all_candidate
+
+    # 19年 or 一九年 will be replaced to 2019年
+    @classmethod
+    def extract_year_from_text(cls, text):
+        """extract year from text"""
+        values = []
+        # FIXME trick: yrs is from 2000
+        num_year_texts = re.findall(r"[0-9][0-9]年", text)
+        values += ["20{}".format(text[:-1]) for text in num_year_texts]
+        cn_year_texts = re.findall(r"[{}][{}]年".format(cls.CN_NUM, cls.CN_NUM), text)
+        cn_year_values = [str_to_year(text) for text in cn_year_texts]
+        values += [value for value in cn_year_values if value is not None]
+        return values
+
+    @classmethod
+    def extract_date_from_text(cls, text):
+        """
+
+        Args:
+            text (TYPE): NULL
+
+        Returns: TODO
+
+        Raises: NULL
+        """
+        date_values = []
+        unmatched_spans = []
+        base = 0
+        while base < len(text):
+            res = re.search(g_date_patt, text[base:])
+            if res is None:
+                unmatched_spans.append(text[base:])
+            else:
+                start, end = res.span()
+                unmatched_spans.append(text[base:start])
+                unmatched_spans.append(text[end:])
+                base = end
+                date_values.append(text[start:end])
+        return date_values, list(filter(lambda x: x.strip() != "", unmatched_spans))
+
+    @classmethod
+    def extract_num_from_text(cls, text):
+        """extract num from text"""
+        values = []
+        # 1. all digital number
+        num_values = re.findall(r"[-+]?[0-9]*\.?[0-9]+", text)
+        values += num_values
+        # 2. include chinese word
+        cn_num_unit = cls.CN_NUM + cls.CN_UNIT
+        cn_num_texts = re.findall(r"[{}]*\.?[{}]+".format(cn_num_unit, cn_num_unit), text)
+
+        cn_num_values = [str_to_num(text) for text in cn_num_texts]
+        values += [value for value in cn_num_values if value is not None]
+        # 3. both number and chinese word
+        cn_num_mix = re.findall(r"[0-9]*\.?[{}]+".format(cls.CN_UNIT), text)
+        for word in cn_num_mix:
+            num = re.findall(r"[-+]?[0-9]*\.?[0-9]+", word)
+            for n in num:
+                word = word.replace(n, an_to_cn(n))
+            str_num = str_to_num(word)
+            if str_num is not None:
+                values.append(str_num)
+        return values
+
+    @classmethod
+    def extract_values_from_text(cls, text):
+        """extract values from text"""
+        values = []
+        values += cls.extract_year_from_text(text)
+        values_tmp, unmatched_spans = cls.extract_date_from_text(text)
+        values.extend(values_tmp)
+        for span in unmatched_spans:
+            values += cls.extract_num_from_text(span)
+        return list(set(values))
+
+    @classmethod
+    def extract_values_from_column(cls, question, table, col_id, col_type):
+        """extract values from column"""
+        if col_type == "text":
+            base_threshold = 0
+        else:
+            base_threshold = 2
+
+        value_score = table.search(question, col_id)
+        value_score_filter = list(filter(lambda x: x[1] > base_threshold, value_score))
+        if len(value_score_filter) == 0:
+            return []
+
+        value_score_filter.sort(key=lambda x: x[1], reverse=True)
+        # if col_type == 'text' \
+        #        and len(value_score_filter) > g_max_candi_value \
+        #        and value_score_filter[g_max_candi_value][1] == value_score_filter[0][1]:
+        #    value_score_filter_tmp = value_score_filter[:50]
+        #    tmp_score = value_score_filter[g_max_candi_value][1]
+        #    select_col_values = [x[0] for x in value_score_filter_tmp if x[1] >= tmp_score]
+        # else:
+        #    select_col_values = [x[0] for x in value_score_filter[:g_max_candi_value]]
+        select_col_values = [x[0] for x in value_score_filter[:g_max_candi_value]]
+
+        return select_col_values
+
+
+def re_search(patt, text):
+    """
+
+    Args:
+        patt (TYPE): NULL
+        text (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    lst_result = []
+    pos = 0
+    while True:
+        match = re.search(patt, text[pos:])
+        if match is None:
+            break
+        lst_result.append((match.start() + pos, match.end() + pos))
+        pos = pos + match.end() + 1
+
+    return lst_result
+
+
+CN_NUM = "〇一二三四五六七八九零壹贰叁肆伍陆柒捌玖貮两１２３４５６７８９０"
+CN_UNIT = "十拾百佰千仟万萬亿億兆点．"
+CN_NUM_UNIT = CN_NUM + CN_UNIT
+PATT_NUM = re.compile(r"[-+]?[0-9]*\.?[0-9]+")
+PATT_CN_NUM = re.compile(r"[{}]*\.?[{}]+".format(CN_NUM_UNIT, CN_NUM_UNIT))
+PATT_MIX_NUM = re.compile(r"[0-9]*\.?[{}]+".format(CN_UNIT))
+
+
+def _extract_num_span(text):
+    """extract number and mark their spans
+
+    Args:
+        text (TYPE): NULL
+
+    Returns: TODO
+
+    Raises: NULL
+    """
+    dct_start2end = defaultdict(set)
+    # digital number
+    spans = re_search(PATT_NUM, text)
+    for start, end in spans:
+        dct_start2end[start].add((end, text[start:end]))
+
+    # chinese number
+    spans = re_search(PATT_CN_NUM, text)
+    for start, end in spans:
+        num = str_to_num(text[start:end])
+        if num is None:
+            continue
+        dct_start2end[start].add((end, num))
+
+    # number, chinese
+    spans = re_search(PATT_MIX_NUM, text)
+    for start, end in spans:
+        orig_num = text[start:end]
+        for ar_num in re.findall(PATT_NUM, orig_num):
+            orig_num = orig_num.replace(ar_num, an_to_cn(ar_num))
+        num = str_to_num(orig_num)
+        if num is not None:
+            dct_start2end[start].add((end, num))
+
+    lst_result = []
+    for start, st_end_and_num in sorted(dct_start2end.items()):
+        lst_end, lst_num = list(zip(*st_end_and_num))
+        end = max(lst_end)
+        if len(lst_result) == 0 or start > lst_result[-1][0][1]:
+            lst_result.append([(start, end), lst_num])
+            continue
+        last_start, last_end = lst_result[-1][0]
+        if end - start > last_end - last_start:
+            lst_result.pop(-1)
+            lst_result.append([(start, end), lst_num])
+        else:
+            pass
+
+    return lst_result
+
+
+def wordseg_and_extract_num(text):
+    lst_span_and_nums = _extract_num_span(text)
+    lst_words = []
+    pos = 0
+    lst_nums = []
+    lst_nums_index = []
+    for span, nums in lst_span_and_nums:
+        start, end = span
+        lst_words.extend(wordseg(text[pos:start]))
+        lst_nums.extend(nums)
+        lst_nums_index.extend([len(lst_words)] * len(nums))
+        lst_words.append(text[start:end])
+        pos = end
+
+    if pos < len(text):
+        lst_words.extend(wordseg(text[pos:]))
+
+    return lst_words, lst_nums, lst_nums_index
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    lst_token = ["hello", ",", "I", "am", "Li", "Lei", "."]
+    print(list(ngrams(lst_token, 2)))
+    print(list(ngrams(lst_token, 4)))
+
+    text = "123年后，你好一百万不多，2百万不少"
+    print(wordseg_and_extract_num(text))
diff --git a/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py b/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3de10ab34c618ff99d37ba0ac4d7bf0d4d94b84d
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/text2sql/utils/utils.py
@@ -0,0 +1,104 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import time
+
+
+class Timer(object):
+    """Stat Cost Time"""
+
+    def __init__(self, msg=""):
+        super(Timer, self).__init__()
+
+        self._msg = msg
+        self._start = time.time()
+        self._last = self._start
+
+    def reset(self, only_last=False, msg=None):
+        """reset all setting"""
+        if msg is not None:
+            self._msg = msg
+        curr_time = time.time()
+        self._last = curr_time
+        if not only_last:
+            self._start = curr_time
+
+    def check(self):
+        """check cost time from start"""
+        end = time.time()
+        cost = end - self._start
+        return cost
+
+    def interval(self):
+        """check cost time from lst"""
+        end = time.time()
+        cost = end - self._last
+        self._last = end
+        return cost
+
+    def ending(self):
+        """ending checking and log"""
+        cost = "%.2f" % time.time() - self._start
+        if self._msg == "":
+            log_msg = "cost time: %s" % (cost)
+        elif "{}" in self._msg:
+            log_msg = self._msg.format(cost)
+        else:
+            log_msg = self._msg + cost
+
+        logging.info(log_msg)
+
+
+def list_increment(lst: list, base: int):
+    """increment each element in list"""
+    for i in range(len(lst)):
+        lst[i] += base
+    return lst
+
+
+def count_file_lines(filename):
+    cnt = 0
+    with open(filename) as ifs:
+        for _ in ifs:
+            cnt += 1
+    return cnt
+
+
+def print_tensors(tag="*", **kwargs):
+    """print tensors for debugging"""
+    print(tag * 50)
+    for key, value in kwargs.items():
+        print(key, ":", value)
+
+
+if __name__ == "__main__":
+    """run some simple test cases"""
+    from boomup import data_struct
+
+    question = "三峡碧江需要大于2的招聘数量"
+    table_json = {
+        "rows": [
+            [4.0, "污水运行工", "三峡碧江公司", "渝北", 2.0, "大专及以上", "给排水/环境工程/机电及相关专业", "sxswrlzyb@163.com"],
+            [5.0, "污水运行工", "三峡垫江公司", "垫江", 1.0, "大专及以上", "给排水/环境工程/机电及相关专业", "sxswrlzyb@163.com"],
+        ],
+        "name": "Table_a7b5108c3b0611e98ad7f40f24344a08",
+        "title": "",
+        "header": ["岗位序号", "招聘岗位", "用人单位", "工作地点", "招聘数量", "学历要求", "专业及资格要求", "简历投递邮箱"],
+        "common": "",
+        "id": "a7b510",
+        "types": ["real", "text", "text", "text", "real", "text", "text", "text"],
+    }
+    table_json["header"] = data_struct.Header(table_json["header"], table_json["types"])
+    table = data_struct.Table(**table_json)
diff --git a/examples/text_to_sql/RAT-SQL/train.sh b/examples/text_to_sql/RAT-SQL/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65f98c27abba52d886119885022af76f05a77ffa
--- /dev/null
+++ b/examples/text_to_sql/RAT-SQL/train.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+if [ $# -ge 1 ] && [ "$1" == "-h" ]; then
+    echo "usage:"
+    echo "    $0 trainer_num output_path [main args]"
+    exit 0
+fi
+
+trainer_num=$1
+output_path=$2
+shift 2
+if [[ $trainer_num = cuda* ]]; then
+    cuda_devices=`echo $trainer_num | sed 's/cuda://'`
+    trainer_num=`echo $cuda_devices | awk -F',' '{print NF}'`
+else
+    cuda_devices=`python script/available_gpu.py --best $trainer_num`
+fi
+
+WORKROOT=$(cd $(dirname $0); pwd)
+cd $WORKROOT
+
+#### paddle ####
+# 选择要使用的GPU
+export CUDA_VISIBLE_DEVICES=$cuda_devices
+# CPU 核数
+export CPU_NUM=$trainer_num
+#### python ####
+export PYTHONPATH=$WORKROOT:$WORKROOT/third:$WORKROOT/third/ERNIE:$PYTHONPATH
+echo "PYTHONPATH=$PYTHONPATH"
+
+echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
+[ -d $output_path ] || mkdir -p $output_path
+[ -f $output_path/train.log ] && mv $output_path/train.log $output_path/train.log.`date +%Y%m%d_%H%M%S`
+echo "running command: ($PYTHON_BIN $@ --output $output_path)" > $output_path/train.log
+python ./script/text2sql_main.py $@ --mode train --output $output_path 2>&1 | tee -a $output_path/train.log
+exit $?
+
diff --git a/examples/torch_migration/README.md b/examples/torch_migration/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..603f040ed11467e011fd3eb01e1aaffd5ff15650
--- /dev/null
+++ b/examples/torch_migration/README.md
@@ -0,0 +1,62 @@
+# BERT-SST2-Prod
+Reproduction process of BERT on SST2 dataset
+
+# 安装说明
+
+* 下载代码库
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/torch_migration
+```
+
+* 进入文件夹，安装requirements
+
+```shell
+pip install -r requirements.txt
+```
+
+* 安装PaddlePaddle与PyTorch
+
+```shell
+# CPU版本的PaddlePaddle
+pip install paddlepaddle==2.2.0 -i https://mirror.baidu.com/pypi/simple
+# 如果希望安装GPU版本的PaddlePaddle，可以使用下面的命令
+# pip install paddlepaddle-gpu==2.2.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+# 安装PyTorch
+pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
+```
+
+**注意**: 本项目依赖于paddlepaddle-2.2.0版本，安装时需要注意。
+
+* 验证PaddlePaddle是否安装成功
+
+运行python，输入下面的命令。
+
+```shell
+import paddle
+paddle.utils.run_check()
+print(paddle.__version__)
+```
+
+如果输出下面的内容，则说明PaddlePaddle安装成功。
+
+```
+PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
+2.2.0
+```
+
+
+* 验证PyTorch是否安装成功
+
+运行python，输入下面的命令，如果可以正常输出，则说明torch安装成功。
+
+```shell
+import torch
+print(torch.__version__)
+# 如果安装的是cpu版本，可以按照下面的命令确认torch是否安装成功
+# 期望输出为 tensor([1.])
+print(torch.Tensor([1.0]))
+# 如果安装的是gpu版本，可以按照下面的命令确认torch是否安装成功
+# 期望输出为 tensor([1.], device='cuda:0')
+print(torch.Tensor([1.0]).cuda())
+```
diff --git a/examples/torch_migration/docs/ThesisReproduction_NLP.md b/examples/torch_migration/docs/ThesisReproduction_NLP.md
new file mode 100644
index 0000000000000000000000000000000000000000..eee175d34a280b06fa2b40a11c4401c4ae5182be
--- /dev/null
+++ b/examples/torch_migration/docs/ThesisReproduction_NLP.md
@@ -0,0 +1,928 @@
+# 论文复现指南
+
+## 目录
+
+- [1. 总览](#1)
+    - [1.1 背景](#1.1)
+    - [1.2 前序工作](#1.2)
+- [2. 整体框图](#2)
+    - [2.1 流程概览](#2.1)
+    - [2.2 reprod_log whl包](#2.2)
+- [3. 论文复现理论知识及实战](#3)
+    - [3.1 模型结构对齐](#3.1)
+    - [3.2 验证/测试集数据读取对齐](#3.2)
+    - [3.3 评估指标对齐](#3.3)
+    - [3.4 损失函数对齐](#3.4)
+    - [3.5 优化器对齐](#3.5)
+    - [3.6 学习率对齐](#3.6)
+    - [3.7 正则化策略对齐](#3.7)
+    - [3.8 反向对齐](#3.8)
+    - [3.9 训练集数据读取对齐](#3.9)
+    - [3.10 网络初始化对齐](#3.10)
+    - [3.11 模型训练对齐](#3.11)
+    - [3.12 单机多卡训练](#3.12)
+- [4. 论文复现注意事项与FAQ](#4)
+    - [4.0 通用注意事项](#4.0)
+    - [4.1 模型结构对齐](#4.1)
+    - [4.2 验证/测试集数据读取对齐](#4.2)
+    - [4.3 评估指标对齐](#4.3)
+    - [4.4 损失函数对齐](#4.4)
+    - [4.5 优化器对齐](#4.5)
+    - [4.6 学习率对齐](#4.6)
+    - [4.7 正则化策略对齐](#4.7)
+    - [4.8 反向对齐](#4.8)
+    - [4.9 训练集数据读取对齐](#4.9)
+    - [4.10 网络初始化对齐](#4.10)
+    - [4.11 模型训练对齐](#4.11)
+
+<a name="1"></a>
+## 1. 总览
+
+<a name="1.1"></a>
+### 1.1 背景
+
+* 以深度学习为核心的人工智能技术仍在高速发展，通过论文复现，开发者可以获得
+    * 学习成长：自我能力提升
+    * 技术积累：对科研或工作有所帮助和启发
+    * 社区荣誉：成果被开发者广泛使用
+
+<a name="1.2"></a>
+### 1.2 前序工作
+
+基于本指南复现论文过程中，建议开发者准备以下内容。
+
+* 了解该模型输入输出格式。以BERT的情感分类任务为例，通过阅读论文与参考代码，了解到模型输入为`[batch_size, sequence_length]`的tensor，类型为`int64`，label为`[batch, ]`的label，类型为`int64`。
+* 准备好训练/验证数据集，用于模型训练与评估
+* 准备好fake input data以及label，与模型输入shape、type等保持一致，用于后续模型前向对齐。
+    * 在对齐模型前向过程中，我们不需要考虑数据集模块等其他模块，此时使用fake data是将模型结构和数据部分解耦非常合适的一种方式。
+    * 将fake data以文件的形式存储下来，也可以保证PaddlePaddle与参考代码的模型结构输入是完全一致的，更便于排查问题。
+    * 在该步骤中，以BERT为例，生成fake data的脚本可以参考：[gen_fake_data.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/fake_data/gen_fake_data.py)。
+* 在特定设备(CPU/GPU)上，跑通参考代码的预测过程(前向)以及至少2轮(iteration)迭代过程，保证后续基于PaddlePaddle复现论文过程中可对比。
+* 本文档基于 `BERT-SST2-Prod` 代码以及`reprod_log` whl包进行说明与测试。如果希望体验，建议参考[BERT-SST2-Prod文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/README.md)进行安装与测试。
+* 在复现的过程中，只需要将PaddlePaddle的复现代码以及打卡日志上传至github，不能在其中添加参考代码的实现，在验收通过之后，需要删除打卡日志。建议在初期复现的时候，就将复现代码与参考代码分成2个文件夹进行管理。
+
+<a name="2"></a>
+## 2. 整体框图
+
+<a name="2.1"></a>
+### 2.1 流程概览
+
+面对一篇自然语言处理的论文，复现该论文的整体流程如下图所示。
+
+![图片](https://user-images.githubusercontent.com/16911935/199389647-b000a7b1-28d1-485e-8ec0-3e7e2c05884a.png)
+
+总共包含11个步骤。为了高效复现论文，设置了5个验收节点。如上图中黄色框所示。后续章节会详细介绍上述步骤和验收节点，具体内容安排如下：
+
+* 第3章：介绍11个复现步骤的理论知识、实战以及验收流程。
+* 第4章：针对复现流程过程中每个步骤可能出现的问题，本章会进行详细介绍。如果还是不能解决问题，可以提ISSUE进行讨论，提ISSUE地址：[https://github.com/PaddlePaddle/Paddle/issues/new/choose](https://github.com/PaddlePaddle/Paddle/issues/new/choose)
+
+<a name="2.2"></a>
+### 2.2 reprod_log whl包
+
+#### 2.2.1 reprod_log工具简介
+`reprod_log`是用于论文复现赛中辅助自查和验收工具。该工具源代码地址在：[https://github.com/WenmuZhou/reprod_log](https://github.com/WenmuZhou/reprod_log)。主要功能如下：
+
+* 存取指定节点的输入输出tensor
+* 基于文件的tensor读写
+* 2个字典的对比验证
+* 对比结果的输出与记录
+
+更多API与使用方法可以参考：[reprod_log API使用说明](https://github.com/WenmuZhou/reprod_log/blob/master/README.md)。
+
+#### 2.2.2 reprod_log使用demo
+
+下面基于代码：[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/reprod_log_demo](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/reprod_log_demo)，给出如何使用该工具。
+
+文件夹中包含`write_log.py`和`check_log_diff.py`文件，其中`write_log.py`中给出了`ReprodLogger`类的使用方法，`check_log_diff.py`给出了`ReprodDiffHelper`类的使用方法，依次运行两个python文件，使用下面的方式运行代码。
+
+```shell
+# 进入文件夹
+cd pipeline/reprod_log_demo
+# 随机生成矩阵，写入文件中
+python write_log.py
+# 进行文件对比，输出日志
+python check_log_diff.py
+```
+
+最终会输出以下内容
+
+```
+[2021/11/18 09:29:31] root INFO: demo_test_1:
+[2021/11/18 09:29:31] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/18 09:29:31] root INFO: demo_test_2:
+[2021/11/18 09:29:31] root INFO:     mean diff: check passed: False, value: 0.33387675881385803
+[2021/11/18 09:29:31] root INFO: diff check failed
+```
+
+可以看出：对于key为`demo_test_1`的矩阵，由于diff为0，小于设置的阈值`1e-6`，核验成功；对于key为`demo_test_2`的矩阵，由于diff为0.33，大于设置的阈值`1e-6`，核验失败。
+
+#### 2.2.3 reprod_log在论文复现中应用
+
+在论文复现中，基于reprod_log的结果记录模块，产出下面若干文件
+```
+log_reprod
+├── forward_paddle.npy
+├── forward_torch.npy    # 与forward_paddle.npy作为一并核查的文件对
+├── metric_paddle.npy
+├── metric_torch.npy     # 与metric_paddle.npy作为一并核查的文件对
+├── loss_paddle.npy
+├── loss_torch.npy       # 与loss_paddle.npy作为一并核查的文件对
+├── bp_align_paddle.npy
+├── bp_align_torch.npy   # 与bp_align_paddle.npy作为一并核查的文件对
+├── train_align_paddle.npy
+├── train_align_torch.npy # pytorch运行得到的参考评估指标
+```
+
+基于reprod_log的`ReprodDiffHelper`模块，产出下面5个日志文件。
+
+```
+├── forward_diff.log     # forward_paddle.npy与forward_torch.npy生成的diff结果文件
+├── metric_diff.log      # metric_paddle.npy与metric_torch.npy生成的diff结果文件
+├── loss_diff.log          # loss_paddle.npy与loss_torch.npy生成的diff结果文件
+├── bp_align_diff.log    # bp_align_paddle.npy与bp_align_torch.npy生成的diff结果文件
+├── train_align_diff.log # train_align_paddle.train_align_torch.npy生成的diff结果文件
+```
+
+上述文件的生成代码都需要开发者进行开发，验收时需要提供上面罗列的所有文件（不需要提供产生这些文件的可运行程序）以及完整的模型训练评估程序和日志。
+BERT-SST2-Prod项目提供了基于reprod_log的5个验收点对齐验收示例，具体代码地址为：[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline)，
+每个文件夹中的README.md文档提供了使用说明。
+
+<a name="3"></a>
+## 3. 论文复现理论知识及实战
+
+<a name="3.1"></a>
+### 3.1 模型结构对齐
+
+对齐模型结构时，一般有3个主要步骤：
+
+* 网络结构代码转换
+* 权重转换
+* 模型组网正确性验证
+
+下面详细介绍这3个部分。
+
+#### 3.1.1 网络结构代码转换
+
+**【基本流程】**
+
+由于PyTorch的API和PaddlePaddle的API非常相似，可以参考[PyTorch-PaddlePaddle API映射表](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_api_mapping/pytorch_api_mapping_cn.html)
+，组网部分代码直接进行手动转换即可。
+
+**【注意事项】**
+
+如果遇到PaddlePaddle没有的API，可以尝试用多种API来组合，也可以给PaddlePaddle团队提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues)，获得支持。
+
+**【实战】**
+
+BERT网络结构的PyTorch实现: [transformers-bert](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py)
+
+对应转换后的PaddlePaddle实现: [paddlenlp-bert](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/bert/modeling.py)
+
+
+#### 3.1.2 权重转换
+
+**【基本流程】**
+
+组网代码转换完成之后，需要对模型权重进行转换，如果PyTorch repo中已经提供权重，那么可以直接下载并进行后续的转换；如果没有提供，则可以基于PyTorch代码，随机生成一个初始化权重(定义完model以后，使用`torch.save()` API保存模型权重)，然后进行权重转换。
+
+**【注意事项】**
+
+在权重转换的时候，需要注意`paddle.nn.Linear`等API的权重保存格式和名称等与PyTorch稍有diff，具体内容可以参考`4.1章节`。
+
+**【实战】**
+
+BERT的代码转换脚本可以在这里查看：[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py)，
+
+注意：运行该代码需要首先下载Huggingface的BERT预训练模型到该目录下，下载地址为：[https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin](https://huggingface.co/bert-base-uncased/blob/main/pytorch_model.bin)
+
+```python
+# https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py
+
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+import torch
+from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM
+from transformers import BertForMaskedLM as PTBertForMaskedLM
+
+
+def convert_pytorch_checkpoint_to_paddle(
+        pytorch_checkpoint_path="pytorch_model.bin",
+        paddle_dump_path="model_state.pdparams",
+        version="old", ):
+    hf_to_paddle = {
+        "embeddings.LayerNorm": "embeddings.layer_norm",
+        "encoder.layer": "encoder.layers",
+        "attention.self.query": "self_attn.q_proj",
+        "attention.self.key": "self_attn.k_proj",
+        "attention.self.value": "self_attn.v_proj",
+        "attention.output.dense": "self_attn.out_proj",
+        "intermediate.dense": "linear1",
+        "output.dense": "linear2",
+        "attention.output.LayerNorm": "norm1",
+        "output.LayerNorm": "norm2",
+        "predictions.decoder.": "predictions.decoder_",
+        "predictions.transform.dense": "predictions.transform",
+        "predictions.transform.LayerNorm": "predictions.layer_norm",
+    }
+    do_not_transpose = []
+    if version == "old":
+        hf_to_paddle.update({
+            "predictions.bias": "predictions.decoder_bias",
+            ".gamma": ".weight",
+            ".beta": ".bias",
+        })
+        do_not_transpose = do_not_transpose + ["predictions.decoder.weight"]
+
+    pytorch_state_dict = torch.load(
+        pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        is_transpose = False
+        if k[-7:] == ".weight":
+            # embeddings.weight and LayerNorm.weight do not transpose
+            if all(d not in k for d in do_not_transpose):
+                if ".embeddings." not in k and ".LayerNorm." not in k:
+                    if v.ndim == 2:
+                        v = v.transpose(0, 1)
+                        is_transpose = True
+        oldk = k
+        for hf_name, pd_name in hf_to_paddle.items():
+            k = k.replace(hf_name, pd_name)
+
+        # add prefix `bert.`
+        if "bert." not in k and "cls." not in k and "classifier" not in k:
+            k = "bert." + k
+
+        print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+def compare(out_torch, out_paddle):
+    out_torch = out_torch.detach().numpy()
+    out_paddle = out_paddle.detach().numpy()
+    assert out_torch.shape == out_paddle.shape
+    abs_dif = np.abs(out_torch - out_paddle)
+    mean_dif = np.mean(abs_dif)
+    max_dif = np.max(abs_dif)
+    min_dif = np.min(abs_dif)
+    print("mean_dif:{}".format(mean_dif))
+    print("max_dif:{}".format(max_dif))
+    print("min_dif:{}".format(min_dif))
+
+
+def test_forward():
+    paddle.set_device("cpu")
+    model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_torch.eval()
+    model_paddle.eval()
+    np.random.seed(42)
+    x = np.random.randint(
+        1, model_paddle.bert.config["vocab_size"], size=(4, 64))
+    input_torch = torch.tensor(x, dtype=torch.int64)
+    out_torch = model_torch(input_torch)[0]
+
+    input_paddle = paddle.to_tensor(x, dtype=paddle.int64)
+    out_paddle = model_paddle(input_paddle)[0]
+
+    print("torch result shape:{}".format(out_torch.shape))
+    print("paddle result shape:{}".format(out_paddle.shape))
+    compare(out_torch, out_paddle)
+
+
+if __name__ == "__main__":
+    convert_pytorch_checkpoint_to_paddle(
+        "./bert-base-uncased/pytorch_model.bin",
+        "./bert-base-uncased/model_state.pdparams")
+    test_forward()
+    # torch result shape:torch.Size([4, 64, 30522])
+    # paddle result shape:[4, 64, 30522]
+    # mean_dif:1.666686512180604e-05
+    # max_dif:0.00015211105346679688
+    # min_dif:0.0
+```
+
+运行完成之后，会在当前目录生成`model_state.pdparams`文件，即为转换后的PaddlePaddle预训练模型。
+**Tips**: 由于paddlenlp中已有转换后的bert-base-uncased模型，因此可以一键加载，程序会自动下载对应权重！
+
+
+#### 3.1.3 模型组网正确性验证
+
+**【基本流程】**
+
+1. 定义PyTorch模型，加载权重，固定seed，基于numpy生成随机数，转换为PyTorch可以处理的tensor，送入网络，获取输出，使用reprod_log保存结果。
+2. 定义PaddlePaddle模型，加载权重，固定seed，基于numpy生成随机数，转换为PaddlePaddle可以处理的tensor，送入网络，获取输出，使用reprod_log保存结果。
+3.  使用reprod_log排查diff，小于阈值，即可完成自测。
+
+**【注意事项】**
+
+* 模型在前向对齐验证时，需要调用`model.eval()`方法，保证组网中的随机量被关闭，比如BatchNorm、Dropout等。
+* 给定相同的输入数据，为保证可复现性，如果有随机数生成，固定相关的随机种子。
+* 输出diff可以使用`np.mean(np.abs(o1 - o2))`进行计算，一般小于1e-6的话，可以认为前向没有问题。如果最终输出结果diff较大，可以使用二分的方法进行排查，比如说BERT，包含1个embdding层、12个transformer-block以及最后的MLM head层，那么完成模型组网和权重转换之后，如果模型输出没有对齐，可以尝试输出中间某一个transformer-block的tensor进行对比，如果相同，则向后进行排查；如果不同，则继续向前进行排查，以此类推，直到找到导致没有对齐的操作。
+
+**【实战】**
+
+BERT模型组网正确性验证可以参考如下示例代码：
+[https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/Step1](https://github.com/JunnYu/BERT-SST2-Prod/tree/main/pipeline/Step1
+
+**【验收】**
+
+对于待复现的项目，前向对齐验收流程如下。
+
+1. 准备输入：fake data
+    * 使用参考代码的dataloader，生成一个batch的数据，保存下来，在前向对齐时，直接从文件中读入。
+    * 固定随机数种子，生成numpy随机矩阵，转化tensor
+2. 保存输出：
+    * PaddlePaddle/PyTorch：dict，key为tensor的name（自定义），value为tensor的值。最后将dict保存到文件中。建议命名为`forward_paddle.npy`和`forward_torch.npy`。
+3. 自测：使用reprod_log加载2个文件，使用report功能，记录结果到日志文件中，建议命名为`forward_diff_log.txt`，观察diff，二者diff小于特定的阈值即可。
+4. 提交内容：新建文件夹，将`forward_paddle.npy`、`forward_torch.npy`与`forward_diff_log.txt`文件放在文件夹中，后续的输出结果和自查日志也放在该文件夹中，一并打包上传即可。
+5. 注意：
+    * PaddlePaddle与PyTorch保存的dict的key需要保持相同，否则report过程可能会提示key无法对应，从而导致report失败，之后的`【验收】`环节也是如此。
+    * 如果是固定随机数种子，建议将fake data保存到dict中，方便check参考代码和PaddlePaddle的输入是否一致。
+
+<a name="3.2"></a>
+### 3.2 验证/测试集数据读取对齐
+
+**【基本流程】**
+
+对于一个数据集，一般有以下一些信息需要重点关注
+
+* 数据集名称、下载地址
+* 训练集/验证集/测试集
+
+PaddlePaddle中数据集相关的API为`paddle.io.Dataset`，PyTorch中对应为`torch.utils.data.Dataset`，二者功能一致，在绝大多数情况下，可以使用该类构建数据集。它是描述Dataset方法和行为的抽象类，在具体实现的时候，需要继承这个基类，实现其中的`__getitem__`和`__len__`方法。除了参考代码中相关实现，也可以参考待复现论文中的说明。
+
+复现完Dataset之后，可以构建Dataloader，对数据进行组batch、批处理，送进网络进行计算。
+
+`paddle.io.DataLoader`可以进行数据加载，将数据分成批数据，并提供加载过程中的采样。PyTorch对应的实现为`torch.utils.data.DataLoader`，二者在功能上一致，只是在参数方面稍有diff：（1）PaddlePaddle缺少对`pin_memory`等参数的支持；（2）PaddlePaddle增加了`use_shared_memory`参数来选择是否使用共享内存加速数据加载过程。
+
+**【注意事项】**
+
+论文中一般会提供数据集的名称以及基本信息。复现过程中，我们在下载完数据之后，建议先检查下是否和论文中描述一致，否则可能存在的问题有：
+
+* 数据集版本不同，比如论文中使用了cnn_dailymail的v3.0.0版本数据集，但是我们下载的是cnn_dailymail的v1.0.0版本数据集，如果不对其进行检查，可能会导致我们最终训练的数据量等与论文中有diff
+* 数据集使用方式不同，有些论文中，可能只是抽取了该数据集的子集进行方法验证，此时需要注意抽取方法，需要保证抽取出的子集完全相同。
+* 在评估指标对齐时，我们可以固定batch size，关闭Dataloader的shuffle操作。
+
+构建数据集时，可以使用paddlenlp中的数据集加载方式，具体可以参考：[如何自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。对应地，PyTorch中的数据处理api可以参考：[huggingface的datasets自定义数据集](https://huggingface.co/docs/datasets/about_dataset_load.html#building-a-dataset)。对于其中之一，可以找到另一个平台的实现。
+
+此外，
+* 有些自定义的数据处理方法，如果不涉及到深度学习框架的部分，可以直接复用。
+* 对于特定任务中的数据预处理方法，比如说Tokenizer，如果没有现成的API可以调用，可以参考官方模型套件中的一些实现方法，比如PaddleClas、PaddleDetection、PaddleSeg等。
+
+**【实战】**
+
+BERT模型复现过程中，数据预处理和Dataset、Dataloader的检查可以参考该文件：
+[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/test_data.py](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/test_data.py)
+
+
+使用方法可以参考[数据检查文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/README.md)。
+
+<a name="3.3"></a>
+### 3.3 评估指标对齐
+
+**【基本流程】**
+
+PaddlePaddle提供了一系列Metric计算类，比如说`Accuracy`, `Auc`, `Precision`, `Recall`等，而PyTorch中，目前可以通过组合的方式实现metric计算，或者调用[huggingface-datasets](https://huggingface.co/docs/datasets/about_metrics.html?highlight=metric)，在论文复现的过程中，需要注意保证对于该模块，给定相同的输入，二者输出完全一致。具体流程如下。
+
+1. 构建fake数据
+1. 使用PyTorch的指标获取评估结果，使用reprod_log保存结果。
+2. 使用PaddlePaddle的指标获取评估结果，使用reprod_log保存结果。
+3. 使用reprod_log排查diff，小于阈值，即可完成自测。
+
+**【注意事项】**
+
+在评估指标对齐之前，需要注意保证对于该模块，给定相同的输入，二者输出完全一致。
+
+
+**【实战】**
+
+评估指标对齐检查方法可以参考文档：[评估指标对齐检查方法文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step2/README.md#%E6%95%B0%E6%8D%AE%E8%AF%84%E4%BC%B0%E5%AF%B9%E9%BD%90%E6%B5%81%E7%A8%8B)
+
+
+**【验收】**
+
+对于待复现的项目，评估指标对齐验收流程如下。
+
+1. 输入：dataloader, model
+2. 输出：
+    * PaddlePaddle/PyTorch：dict，key为tensor的name（自定义），value为具体评估指标的值。最后将dict使用reprod_log保存到各自的文件中，建议命名为`metric_paddle.npy`和`metric_torch.npy`。
+    * 自测：使用reprod_log加载2个文件，使用report功能，记录结果到日志文件中，建议命名为`metric_diff_log.txt`，观察diff，二者diff小于特定的阈值即可。
+3. 提交内容：将`metric_paddle.npy`、`metric_torch.npy`与`metric_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中，后续的输出结果和自查日志也放在该文件夹中，一并打包上传即可。
+4. 注意：
+    * 数据需要是真实数据
+    * 需要检查论文是否只是抽取了验证集/测试集中的部分文件，如果是的话，则需要保证PaddlePaddle和参考代码中dataset使用的数据集一致。
+
+
+<a name="3.4"></a>
+### 3.4 损失函数对齐
+
+**【基本流程】**
+
+PaddlePaddle与PyTorch均提供了很多loss function，用于模型训练，具体的API映射表可以参考：[Loss类API映射列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/08_api_mapping/pytorch_api_mapping_cn.html#lossapi)。以CrossEntropyLoss为例，主要区别为：
+* PaddlePaddle提供了对软标签、指定softmax计算纬度的支持。
+
+如果论文中使用的loss function没有指定的API，则可以尝试通过组合API的方式，实现自定义的loss function。
+
+具体流程如下。
+
+1. 定义PyTorch模型，加载权重，加载fake data 和 fake label（或者固定seed，基于numpy生成随机数），转换为PyTorch可以处理的tensor，送入网络，获取loss结果，使用reprod_log保存结果。
+2. 定义PaddlePaddle模型，加载fake data 和 fake label（或者固定seed，基于numpy生成随机数），转换为PaddlePaddle可以处理的tensor，送入网络，获取loss结果，使用reprod_log保存结果。
+3. 使用reprod_log排查diff，小于阈值，即可完成自测。
+
+**【注意事项】**
+
+* 计算loss的时候，建议设置`model.eval()`，避免模型中随机量的问题。
+
+**【实战】**
+
+本部分可以参考文档：[https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step3/README.md](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step3/README.md)。
+
+**【验收】**
+
+对于待复现的项目，损失函数对齐验收流程如下。
+
+1. 输入：fake data & label
+2. 输出：
+    * PaddlePaddle/PyTorch：dict，key为tensor的name（自定义），value为具体评估指标的值。最后将dict使用reprod_log保存到各自的文件中，建议命名为`loss_paddle.npy`和`loss_torch.npy`。
+3. 自测：使用reprod_log加载2个文件，使用report功能，记录结果到日志文件中，建议命名为`loss_diff_log.txt`，观察diff，二者diff小于特定的阈值即可。
+4. 提交内容：将`loss_paddle.npy`、`loss_torch.npy`与`loss_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中，后续的输出结果和自查日志也放在该文件夹中，一并打包上传即可。
+
+<a name="3.5"></a>
+### 3.5 优化器对齐
+
+**【基本流程】**
+
+PaddlePaddle中的optimizer有`paddle.optimizer`等一系列实现，PyTorch中则有`torch.Optim`等一系列实现。
+
+**【注意事项】**
+
+以SGD等优化器为例，PaddlePaddle与Pytorch的优化器区别主要如下。
+
+* PaddlePaddle在优化器中增加了对梯度裁剪的支持，在训练GAN或者一些NLP、多模态任务中，这个用到的比较多。
+* PaddlePaddle的SGD不支持动量更新、动量衰减和Nesterov动量，这里需要使用`paddle.optimizer.Momentum` API实现这些功能。
+
+**【实战】**
+
+本部分对齐建议对照[PaddlePaddle优化器API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/Overview_cn.html)与参考代码的优化器实现进行对齐，用之后的反向对齐统一验证该模块的正确性。
+
+
+<a name="3.6"></a>
+### 3.6 学习率对齐
+
+**【基本流程】**
+
+* 学习率策略主要用于指定训练过程中的学习率变化曲线，这里可以将定义好的学习率策略，不断step，即可得到对应的学习率值，可以将学习率值保存在列表或者矩阵中，使用`reprod_log`工具判断二者是否对齐。
+
+**【注意事项】**
+
+PaddlePaddle中，需要首先构建学习率策略，再传入优化器对象中；对于PyTorch，如果希望使用更丰富的学习率策略，需要先构建优化器，再传入学习率策略类API。
+
+**【实战】**
+
+学习率复现对齐，可以参考代码：[学习率对齐验证文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step4/README.md#%E5%AD%A6%E4%B9%A0%E7%8E%87%E5%AF%B9%E9%BD%90%E9%AA%8C%E8%AF%81)。
+
+<a name="3.7"></a>
+### 3.7 正则化策略对齐
+
+**【基本流程】**
+
+L2正则化策略用于模型训练，可以防止模型对训练数据过拟合，L1正则化可以用于得到稀疏化的权重矩阵，PaddlePaddle中有`paddle.regularizer.L1Decay`与`paddle.regularizer.L2Decay` API。PyTorch中，torch.optim集成的优化器只有L2正则化方法，直接在构建optimizer的时候，传入`weight_decay`参数即可。
+
+**【注意事项】**
+
+* PaddlePaddle的optimizer中支持L1Decat/L2Decay。
+* PyTorch的optimizer支持不同参数列表的学习率分别设置，params传入字典即可，而PaddlePaddle的2.1.0版本目前尚未支持这种行为，可以通过设置`ParamAttr`的`learning_rate`参数，来确定相对学习率倍数。PaddlePaddle的2.2.0版本中虽然实现该功能，但是模型收敛速度较慢，不建议使用。[优化器收敛速度慢](https://github.com/PaddlePaddle/Paddle/issues/36915)
+
+**【实战】**
+
+本部分对齐建议对照[PaddlePaddle正则化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/regularizer/L2Decay_cn.html)与参考代码的优化器实现进行对齐，用之后的反向对齐统一验证该模块的正确性。
+
+<a name="3.8"></a>
+### 3.8 反向对齐
+
+**【基本流程】**
+
+此处可以通过numpy生成假的数据和label（推荐），也可以准备固定的真实数据。具体流程如下。
+
+1. 检查两个代码的训练超参数全部一致，如优化器及其超参数、学习率、LayerNorm中的eps等。
+2. 将PaddlePaddle与PyTorch网络中涉及的所有随机操作全部关闭，如dropout、drop_path等，推荐将模型设置为eval模式（`model.eval()`）
+3. 加载相同的weight dict（可以通过PyTorch来存储随机的权重），将准备好的数据分别传入网络并迭代，观察二者loss是否一致（此处batch-size要一致，如果使用多个真实数据，要保证传入网络的顺序一致）
+4. 如果经过2轮以上，loss均可以对齐，则基本可以认为反向对齐。
+
+
+**【注意事项】**
+
+* 如果第一轮loss就没有对齐，则需要仔细排查一下模型前向部分。
+* 如果第二轮开始，loss开始无法对齐，则首先需要排查下超参数的差异，没问题的话，在`loss.backward()`方法之后，使用`tensor.grad`获取梯度值，二分的方法查找diff，定位出PaddlePaddle与PyTorch梯度无法对齐的API或者操作，然后进一步验证并反馈。
+
+梯度的打印方法示例代码如下所示，注释掉的内容即为打印网络中所有参数的梯度shape。
+
+```python
+    # 代码地址：https://github.com/JunnYu/BERT-SST2-Prod/blob/2c372656bb1b077b0073c50161771d9fa9d8de5a/pipeline/Step4/test_bp.py#L12
+    def pd_train_some_iters(model,
+                        criterion,
+                        optimizer,
+                        fake_data,
+                        fake_label,
+                        max_iter=2):
+        model = PDBertForSequenceClassification.from_pretrained("bert-base-uncased", num_classes=2)
+        classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+        model.load_dict(classifier_weights)
+        model.eval()
+        criterion = paddle.nn.CrossEntropy()
+        decay_params = [
+            p.name for n, p in model.named_parameters()
+            if not any(nd in n for nd in ["bias", "norm"])
+        ]
+        optimizer = paddle.optimizer.AdamW(learning_rate=3e-5, parameters=model.parameters(),
+            weight_decay=1e-2,
+            epsilon=1e-6,
+            apply_decay_param_fun=lambda x: x in decay_params)
+        loss_list = []
+        for idx in range(max_iter):
+            input_ids = paddle.to_tensor(fake_data)
+            labels = paddle.to_tensor(fake_label)
+
+            output = model(input_ids)
+            loss = criterion(output, labels)
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            loss_list.append(loss)
+        return loss_list
+```
+
+
+
+
+**【实战】**
+
+本部分可以参考文档：[反向对齐操作文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step4/README.md#%E5%8F%8D%E5%90%91%E5%AF%B9%E9%BD%90%E6%93%8D%E4%BD%9C%E6%96%B9%E6%B3%95)。
+
+**【验收】**
+
+对于待复现的项目，反向对齐验收流程如下。
+
+1. 输入：fake data & label
+2. 输出：
+    * PaddlePaddle/PyTorch：dict，key为tensor的name（自定义），value为具体loss的值。最后将dict使用reprod_log保存到各自的文件中，建议命名为`bp_align_paddle.npy`和`bp_align_torch.npy`。
+3. 自测：使用reprod_log加载2个文件，使用report功能，记录结果到日志文件中，建议命名为`bp_align_diff_log.txt`，观察diff，二者diff小于特定的阈值即可。
+4. 提交内容：将`bp_align_paddle.npy`、`bp_align_torch.npy`与`bp_align_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中，后续的输出结果和自查日志也放在该文件夹中，一并打包上传即可。
+5. 注意：
+    * loss需要保存至少2轮以上。
+    * 在迭代的过程中，需要保证模型的batch size等超参数完全相同
+    * 在迭代的过程中，需要设置`model.eval()`，使用固定的假数据，同时加载相同权重的预训练模型。
+
+<a name="3.9"></a>
+### 3.9 训练集数据读取对齐
+
+**【基本流程】**
+
+该部分内容与3.2节内容基本一致，参考PyTorch的代码，实现训练集数据读取与预处理模块即可。
+
+**【注意事项】**
+
+该部分内容，可以参考3.8节的自测方法，将输入的`fake data & label`替换为训练的dataloader，但是需要注意的是：
+* 在使用train dataloader的时候，建议设置random seed，对于PyTorch来说
+
+```python
+#initialize random seed
+torch.manual_seed(config.SEED)
+torch.cuda.manual_seed_all(config.SEED)
+np.random.seed(config.SEED)
+random.seed(config.SEED)
+```
+
+对于PaddlePaddle来说
+
+```python
+paddle.seed(config.SEED)
+np.random.seed(config.SEED)
+random.seed(config.SEED)
+```
+
+
+<a name="3.10"></a>
+### 3.10 网络初始化对齐
+
+**【基本流程】**
+
+* 下面给出了部分初始化API的映射表。
+
+|PaddlePaddle API | PyTorch API |
+|---|---|
+| paddle.nn.initializer.KaimingNormal | torch.nn.init.kaiming_normal_ |
+| paddle.nn.initializer.KaimingUniform | torch.nn.init.kaiming_uniform_ |
+| paddle.nn.initializer.XavierNormal | torch.nn.init.xavier_normal_ |
+| paddle.nn.initializer.XavierUniform | torch.nn.init.xavier_uniform_ |
+
+**【注意事项】**
+
+* 更多初始化API可以参考[PyTorch初始化API文档](https://pytorch.org/docs/stable/nn.init.html)以及[PaddlePaddle初始化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#chushihuaxiangguan)。
+
+**【实战】**
+
+本部分对齐建议对照[PaddlePaddle 初始化API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Overview_cn.html#chushihuaxiangguan)与参考代码的初始化实现对齐。
+
+<a name="3.11"></a>
+### 3.11 模型训练对齐
+
+**【基本流程】**
+
+完成前面的步骤之后，就可以开始全量数据的训练对齐任务了。按照下面的步骤进行训练对齐。
+
+1. 准备train/eval data, loader, model
+2. 对model按照论文所述进行初始化(如果论文中提到加载了预训练模型，则按需加载pretrained model)
+3. 加载配置，开始训练，迭代得到最终模型与评估指标，将评估指标使用reprod_log保存到文件中。
+4. 将PaddlePaddle提供的参考指标使用reprod_log提交到另一个文件中。
+5. 使用reprod_log排查diff，小于阈值，即可完成自测。
+
+**【注意事项】**
+
+* 【强烈】建议先做完反向对齐之后再进行模型训练对齐，二者之间的不确定量包括：数据集、PaddlePaddle与参考代码在模型training mode下的区别，初始化参数。
+* 在训练对齐过程中，受到较多随机量的影响，精度有少量diff是正常的，以SST-2数据集的分类为例，diff在0.15%以内可以认为是正常的，这里可以根据不同的任务，适当调整对齐检查的阈值(`ReprodDiffHelper.report`函数中的`diff_threshold`参数)。
+* 训练过程中的波动是正常的，如果最终收敛结果不一致，可以
+    * 仔细排查Dropout、BatchNorm以及其他组网模块及超参是否无误。
+    * 基于参考代码随机生成一份预训练模型，转化为PaddlePaddle的模型，并使用PaddlePaddle加载训练，对比二者的收敛曲线与最终结果，排查初始化影响。
+    * 使用参考代码的Dataloader生成的数据，进行模型训练，排查train dataloader的影响。
+
+**【实战】**
+
+本部分可以参考文档：[训练对齐操作文档](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/Step5/README.md)。
+
+**【验收】**
+
+对于待复现的项目，训练对齐验收流程如下。
+
+1. 输入：train/eval dataloader, model
+2. 输出：
+    * PaddlePaddle：dict，key为保存值的name（自定义），value为具体评估指标的值。最后将dict使用reprod_log保存到文件中，建议命名为`train_align_paddle.npy`。
+    * benchmark：dict，key为保存值的name（自定义），value为论文复现赛的评估指标要求的值。最后将dict使用reprod_log保存到文件中，建议命名为`train_align_benchmark.npy`。
+3. 自测：使用reprod_log加载2个文件，使用report功能，记录结果到日志文件中，建议命名为`train_align_diff_log.txt`，观察diff，二者diff小于特定的阈值即可。
+4. 提交内容：将`train_align_paddle.npy`、`train_align_benchmark.npy`与`train_align_diff_log.txt`文件备份到`3.1节验收环节`新建的文件夹中，最终一并打包上传即可。
+
+<a name="3.12"></a>
+### 3.12 单机多卡训练
+
+如果希望使用单机多卡提升训练效率，可以从以下几个过程对代码进行修改。
+
+#### 3.12.1 数据读取
+
+对于PaddlePaddle来说，多卡数据读取这块主要的变化在sampler
+
+对于单机单卡，sampler实现方式如下所示。
+
+```python
+train_sampler = paddle.io.RandomSampler(dataset)
+train_batch_sampler = paddle.io.BatchSampler(
+    sampler=train_sampler, batch_size=args.batch_size)
+```
+
+对于单机多卡任务，sampler实现方式如下所示。
+
+```python
+train_batch_sampler = paddle.io.DistributedBatchSampler(
+        dataset=dataset,
+        batch_size=args.batch_size,
+        shuffle=True,
+        drop_last=False
+    )
+```
+
+注意：在这种情况下，单机多卡的代码仍然能够以单机单卡的方式运行，因此建议以这种sampler方式进行论文复现。
+
+
+#### 3.12.2 多卡模型初始化
+
+如果以多卡的方式运行，需要初始化并行训练环境，代码如下所示。
+
+```python
+if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+```
+
+在模型组网并初始化参数之后，需要使用`paddle.DataParallel()`对模型进行封装，使得模型可以通过数据并行的模式被执行。代码如下所示。
+
+```python
+if paddle.distributed.get_world_size() > 1:
+    model = paddle.DataParallel(model)
+```
+
+
+#### 3.12.3 模型保存、日志保存等其他模块
+
+以模型保存为例，我们只需要在0号卡上保存即可，否则多个trainer同时保存的话，可能会造成写冲突，导致最终保存的模型不可用。
+
+
+#### 3.12.4 程序启动方式
+
+对于单机单卡，启动脚本如下所示。[https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert)
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    --use_amp False
+```
+
+
+对于单机多卡（示例中为4卡训练），启动脚本如下所示。
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3" run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    --use_amp False
+```
+
+注意：这里8卡训练时，虽然单卡的batch size没有变化(32)，但是总卡的batch size相当于是单卡的8倍，因此学习率也设置为了单卡时的8倍。
+
+
+**【实战】**
+
+本部分可以参考paddlenlp库中的例子：[单机多卡训练](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/bert)。
+
+<a name="4"></a>
+## 4. 论文复现注意事项与FAQ
+
+本部分主要总结大家在论文复现赛过程中遇到的问题，如果本章内容没有能够解决你的问题，欢迎给该文档提出优化建议或者给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose)。
+
+<a name="4.0"></a>
+### 4.0 通用注意事项
+
+* 需要仔细对照PaddlePaddle与参考代码的优化器参数实现，确保优化器参数严格对齐。
+* 如果遇到一些Paddle不支持的API操作，可以尝试使用替代实现进行复现。如下面的PyTorch代码，PaddlePaddle中可以通过slice + concat API的组合形式进行功能实现。同时，对于这个问题，建议优先给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose)，列出Paddle不支持的实现，开发人员会根据优先级进行开发。
+
+```python
+torch.stack([
+    per_locations[:, 0] - per_box_regression[:, 0],
+    per_locations[:, 1] - per_box_regression[:, 1],
+    per_locations[:, 0] + per_box_regression[:, 2],
+    per_locations[:, 1] + per_box_regression[:, 3],
+], dim=1)
+```
+* 如果遇到Paddle不包含的OP或者API，比如(1) 如果是某些算法实现存在调用了外部OP，而且Paddle也不包含该OP实现；(2) 其他框架存在的API或者OP，但是Paddle中没有这些OP。此时：
+    * 对于Paddle资深用户来说，可以尝试使用Paddle的自定义算子功能，存在一定的代码开发量。
+    * 对于初学者来说，可以给Paddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose)，列出Paddle不支持的实现，Paddle开发人员会根据优先级进行实现。
+* PaddlePaddle与PyTorch对于不同名称的API，实现的功能可能是相同的，复现的时候注意，比如[paddle.optimizer.lr.StepDecay](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/optimizer/lr/StepDecay_cn.html#stepdecay)与[torch.optim.lr_scheduler.StepLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR) 。
+* 对于PaddlePaddle来说，通过`paddle.set_device`函数（全局）来确定模型结构是运行在什么设备上，对于torch来说，是通过`model.to("device")` （局部）来确定模型结构的运行设备，这块在复现的时候需要注意。
+
+
+<a name="4.1"></a>
+### 4.1 模型结构对齐
+
+#### 4.1.1 API
+* 对于 `paddle.nn.Linear` 层的weight参数，PaddlePaddle与PyTorch的保存方式不同，在转换时需要进行转置，示例代码可以参考[BERT权重转换脚本](https://github.com/JunnYu/BERT-SST2-Prod/blob/main/pipeline/weights/torch2paddle.py)。
+* `torch.masked_fill`函数的功能目前可以使用`paddle.where`进行实现，可以参考：[链接](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#paddletorch-masked-fillapi)。
+* `pack_padded_sequence`和`pad_packed_sequence`这两个API目前PaddlePaddle中没有实现，可以直接在RNN或者LSTM的输入中传入`sequence_length`来实现等价的功能。
+
+
+#### 4.1.2 权重转换
+
+* 在权重转换的时候，不能只关注参数的名称，比如说有些`paddle.nn.Linear`层，但是定义的变量名称为`conv`，这种也是需要进行权重转置的。
+* 权重转换时，建议同时打印 Paddle 和 PyTorch 对应权重的shape，以防止名称相似但是shape不同的参数权重转换报错。
+
+#### 4.1.3 模型组网正确性验证
+
+* 在论文复现的过程中，可能会遇到一些经典的模型结构，比如Transformer等，Paddle官方也提供了Transformer的实现，但是这里建议自己根据PyTorch代码重新实现一遍，一方面是对整体的模型结构更加熟悉，另一方面也保证模型结构和权重完全对齐。
+* 在复杂的网络结构中，如果前向结果对不齐，可以按照模块排查问题，比如依次获取embedding、transformer-block、mlm-head输出等，看下问题具体出现在哪个子模块，再进到子模块详细排查。
+* 网络结构对齐后，尽量使用训练好的预训练模型和真实的数据进行前向diff计算，这样更准确。
+
+<a name="4.2"></a>
+### 4.2 验证/测试集数据读取对齐
+
+* 需要仔细排查数据预处理，不仅包含的预处理方法相同，也需要保证预处理的流程相同，比如先padding策略不同和截断策略的不同会导致得到最终的结果是不同的。
+
+<a name="4.3"></a>
+### 4.3 评估指标对齐
+
+* 真实数据评估时，需要注意评估时 `paddle.io.DataLoader` 的 ``drop_last`` 参数是否打开(文档[链接](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html#dataloader))，复现代码需要与参考代码保持一致，否则最后不够batch-size的数据的评估会有diff。
+* 在识别或者检索过程中，为了加速评估过程，往往会将评估函数由CPU实现改为GPU实现，由此会带来评估函数输出的不一致。这是由于sort函数对于相同值的排序结果不同带来的。在复现的过程中，如果可以接受轻微的指标不稳定，可以使用PaddlePaddle的sort函数，如果对于指标非常敏感，同时对速度性能要求很高，可以给PaddlePaddle提[ISSUE](https://github.com/PaddlePaddle/Paddle/issues/new/choose)，由研发人员高优开发。
+
+
+<a name="4.4"></a>
+### 4.4 损失函数对齐
+
+* 部分算法的损失函数中会用到 bool 索引，这时候可以使用[paddle.where](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/where_cn.html#where) 代替。
+* `paddle.nn.CrossEntropyLoss` 默认是在最后一维(axis=-1)计算损失函数，而 `torch.nn.CrossEntropyLoss` 是在axis=1的地方计算损失函数，因此如果输入的维度大于2，这里需要保证计算的维(axis)相同，否则可能会出错。
+* 在生成模型中会遇到梯度损失，需要对模型中的算子求二次梯度，目前`MaxPooling`暂时不支持二次梯度，如果复现的过程中遇到了需要对`MaxPooling`求二次梯度的情况，可以和Paddle官方开发同学反馈，进一步确认解决方案。
+* 在保存损失函数值的时候，注意要使用`paddle.no_grad`，或者仅仅保存转换成 numpy 的数组，避免损失没有析构导致内存泄漏问题。
+
+```python
+# 错误示范
+loss = celoss(pred, label)
+avg_loss += loss
+# 正确示范1
+loss = celoss(pred, label)
+avg_loss += loss.numpy()
+# 正确示范2
+loss = celoss(pred, label)
+with paddle.no_grad()
+    avg_loss += loss
+```
+
+<a name="4.5"></a>
+### 4.5 优化器对齐
+
+* Paddle目前支持在 ``optimizer`` 中通过设置 ``params_groups`` 的方式设置不同参数的更新方式，可以参考[代码示例](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/optimizer/optimizer.py#L107) 。
+* 有些模型训练时，会使用梯度累加策略，即累加到一定step数量之后才进行参数更新，这时在实现上需要注意对齐。
+* 在某些任务中，比如说深度学习可视化、可解释性等任务中，一般只要求模型前向过程，不需要训练，此时优化器、学习率等用于模型训练的模块对于该类论文复现是不需要的。
+* 在文本分类领域，大多数Transformer模型都采用了AdamW优化器，并且会设置weigh decay，同时部分参数设置为no weight decay，例如位置编码的参数通常设置为no weight decay，no weight decay参数设置不正确，最终会有明显的精度损失，需要特别注意。一般可以通过分析模型权重来发现该问题，分别计算官方模型和复现模型每层参数权重的平均值、方差，对每一层依次对比，有显著差异的层可能存在问题，因为在weight decay的作用下，参数权重数值会相对较小，而未正确设置no weight decay，则会造成该层参数权重数值异常偏小。
+
+
+<a name="4.6"></a>
+### 4.6 学习率对齐
+
+* PaddlePaddle 中参数的学习率受到优化器学习率和`ParamAttr`中设置的学习率影响，因此跟踪学习率需要将二者结合进行跟踪。
+* 对于复现代码和参考代码，学习率在整个训练过程中在相同的轮数相同的iter下应该保持一致，可以通过`reprod_log`工具、打印学习率值或者可视化二者学习率的log来查看diff。
+* 有些网络的学习率策略比较细致，比如带warmup的学习率策略，这里需要保证起始学习率等参数都完全一致。
+
+
+<a name="4.7"></a>
+### 4.7 正则化策略对齐
+
+* 在如Transformer或者少部分CNN模型中，存在一些参数不做正则化(正则化系数为0)的情况。这里需要找到这些参数并对齐取消实施正则化策略，可以参考[这里](https://github.com/PaddlePaddle/PaddleClas/blob/release%2F2.3/ppcls/arch/backbone/model_zoo/resnest.py#L72)，对特定参数进行修改。
+
+<a name="4.8"></a>
+### 4.8 反向对齐
+
+* 反向对齐时，如果第二轮开始，loss开始无法对齐，则首先需要排查下超参数的差异，没问题的话，在`loss.backward()`方法之后，使用`tensor.grad`获取梯度值，二分的方法查找diff，定位出PaddlePaddle与PyTorch梯度无法对齐的API或者操作，然后进一步验证。第3章中给出了获取所有参数的梯度方法，如果只希望打印特定参数的梯度，可以用下面的方式。
+
+
+```python
+import paddle
+
+def print_hook_fn(grad):
+    print(grad)
+
+x = paddle.to_tensor([0., 1., 2., 3.], stop_gradient=False)
+h = x.register_hook(print_hook_fn)
+w = x * 4
+w.backward()
+# backward之后会输出下面的内容
+#     Tensor(shape=[4], dtype=float32, place=CPUPlace, stop_gradient=False,
+#            [4., 4., 4., 4.])
+```
+
+
+<a name="4.9"></a>
+### 4.9 训练集数据读取对齐
+
+#### 4.9.1 API
+
+* 在前向过程中，如果数据预处理过程运行出错，请先将 ``paddle.io.DataLoader`` 的 ``num_workers`` 参数设为0，然后根据单个进程下的报错日志定位出具体的bug。
+
+#### 4.9.2 数据预处理
+
+
+* 如果数据处理过程中涉及到随机数生成，建议固定seed (`np.random.seed(0)`, `random.seed(0)`)，查看复现代码和参考代码处理后的数据是否有diff。
+* 对文本进行tokenizer处理时，需要确定文本的截断策略，padding策略。
+
+<a name="4.10"></a>
+### 4.10 网络初始化对齐
+
+* 对于不同的深度学习框架，网络初始化在大多情况下，即使值的分布完全一致，也无法保证值完全一致，这里也是论文复现中不确定性比较大的地方。如果十分怀疑初始化导致的问题，建议将参考的初始化权重转成paddle模型，加载该初始化模型训练，看下收敛精度。
+* CNN对于模型初始化相对来说没有那么敏感，在迭代轮数与数据集足够的情况下，最终精度指标基本接近；而transformer系列模型对于初始化比较敏感，在transformer系列模型训练对齐过程中，建议对这一块进行重点检查。
+
+
+<a name="4.11"></a>
+### 4.11 模型训练对齐
+
+#### 4.11.1 训练对齐通用问题
+
+* 有条件的话，复现工作之前最好先基于官方代码完成训练，保证与官方指标能够对齐，并且将训练策略和训练过程中的关键指标记录保存下来，比如每个epoch的学习率、Train Loss、Eval Loss、Eval Acc等，在复现网络的训练过程中，将关键指标保存下来，这样可以将两次训练中关键指标的变化曲线绘制出来，能够很方便的进行对比。
+* 训练过程中可以对loss或者acc进行可视化，和竞品loss或者acc进行直观的对比；如果训练较大的数据集，1次完整训练的成本比较高，此时可以隔一段时间查看一下，如果精度差异比较大，建议先停掉实验，排查原因。
+* 如果训练的过程中出nan，一般是因为除0或者log0的情况， 可以着重看下几个部分：
+    * 如果有预训练模型的话，可以确认下是否加载正确
+    * 模型结构中计算loss的部分是否有考虑到正样本为0的情况
+    * 也可能是某个API的数值越界导致的，可以测试较小的输入是否还会出现nan。
+* 如果训练过程中如果出现不收敛的情况，可以
+    * 简化网络和数据，实验是否收敛；
+    * 如果是基于原有实现进行改动，可以尝试控制变量法，每次做一个改动，逐个排查；
+    * 检查学习率是否过大、优化器设置是否合理，排查下weight decay是否设置正确；
+    * 保存不同step之间的模型参数，观察模型参数是否更新。
diff --git a/examples/torch_migration/pipeline/Step1/README.md b/examples/torch_migration/pipeline/Step1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b3db11238110a8c3d1f7479b9ee6c3e172293549
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step1/README.md
@@ -0,0 +1,86 @@
+# 使用方法
+
+
+本部分内容以前向对齐为例，介绍基于`repord_log`工具对齐的检查流程。其中与`reprod_log`工具有关的部分都是需要开发者需要添加的部分。
+
+
+```shell
+# 进入文件夹并生成torch的bert模型权重
+cd pipeline/weights/ && python torch_bert_weights.py
+# 进入文件夹并将torch的bert模型权重转换为paddle
+cd pipeline/weights/ && python torch2paddle.py
+# 进入文件夹并生成classifier权重
+cd pipeline/classifier_weights/ && python generate_classifier_weights.py
+# 进入Step1文件夹
+cd pipeline/Step1/
+# 生成paddle的前向数据
+python pd_forward_bert.py
+# 生成torch的前向数据
+python pt_forward_bert.py
+# 对比生成log
+python check_step1.py
+```
+
+具体地，以PaddlePaddle为例，`pd_forward_bert.py`的具体代码如下所示。
+
+```python
+import numpy as np
+import paddle
+from reprod_log import ReprodLogger
+import sys
+import os
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+config_path = CURRENT_DIR.rsplit('/', 1)[0]
+sys.path.append(config_path)
+from models.pd_bert import *
+
+# 导入reprod_log中的ReprodLogger类
+from reprod_log import ReprodLogger
+
+reprod_logger = ReprodLogger()
+
+# 组网初始化加载BertModel权重
+paddle_dump_path = '../weights/paddle_weight.pdparams'
+config = BertConfig()
+model = BertForSequenceClassification(config)
+checkpoint = paddle.load(paddle_dump_path)
+model.bert.load_dict(checkpoint)
+
+# 加载分类权重
+classifier_weights = paddle.load(
+        "../classifier_weights/paddle_classifier_weights.bin")
+model.load_dict(classifier_weights)
+model.eval()
+# 读入fake data并转换为tensor，这里也可以固定seed在线生成fake data
+fake_data = np.load("../fake_data/fake_data.npy")
+fake_data = paddle.to_tensor(fake_data)
+# 模型前向
+out = model(fake_data)
+# 保存前向结果，对于不同的任务，需要开发者添加。
+reprod_logger.add("logits", out.cpu().detach().numpy())
+reprod_logger.save("forward_paddle.npy")
+```
+
+diff检查的代码可以参考：[check_step1.py](./check_step1.py)，具体代码如下所示。
+
+```python
+# https://github.com/littletomatodonkey/AlexNet-Prod/blob/master/pipeline/Step1/check_step1.py
+# 使用reprod_log排查diff
+from reprod_log import ReprodDiffHelper
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("./forward_torch.npy")
+    paddle_info = diff_helper.load_info("./forward_paddle.npy")
+    diff_helper.compare_info(torch_info, paddle_info)
+    diff_helper.report(path="forward_diff.log")
+```
+
+产出日志如下，同时会将check的结果保存在`forward_diff.log`文件中。
+
+```
+[2021/11/17 20:15:50] root INFO: logits:
+[2021/11/17 20:15:50] root INFO:     mean diff: check passed: True, value: 1.30385160446167e-07
+[2021/11/17 20:15:50] root INFO: diff check passed
+```
+
+平均绝对误差为1.3e-7，测试通过。
diff --git a/examples/torch_migration/pipeline/Step1/check_step1.py b/examples/torch_migration/pipeline/Step1/check_step1.py
new file mode 100644
index 0000000000000000000000000000000000000000..6dbb247cf179e584d36b0009120c942cc782ba09
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step1/check_step1.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("./forward_torch.npy")
+    paddle_info = diff_helper.load_info("./forward_paddle.npy")
+
+    diff_helper.compare_info(torch_info, paddle_info)
+    diff_helper.report(path="forward_diff.log")
diff --git a/examples/torch_migration/pipeline/Step1/pd_forward_bert.py b/examples/torch_migration/pipeline/Step1/pd_forward_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bdd14f4071b0a5929b6cefadd5d832a8fc8880d
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step1/pd_forward_bert.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+
+import numpy as np
+import paddle
+from reprod_log import ReprodLogger
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pd_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+if __name__ == "__main__":
+    paddle.set_device("cpu")
+
+    # def logger
+    reprod_logger = ReprodLogger()
+
+    paddle_dump_path = "../weights/paddle_weight.pdparams"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+
+    classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+    model.eval()
+    # read or gen fake data
+
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_data = paddle.to_tensor(fake_data)
+    # forward
+    out = model(fake_data)[0]
+    reprod_logger.add("logits", out.cpu().detach().numpy())
+    reprod_logger.save("forward_paddle.npy")
diff --git a/examples/torch_migration/pipeline/Step1/pt_forward_bert.py b/examples/torch_migration/pipeline/Step1/pt_forward_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..55b7f7490caa616ad6d832c600517e57bf08c429
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step1/pt_forward_bert.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+
+import numpy as np
+import torch
+from reprod_log import ReprodLogger
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pt_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+if __name__ == "__main__":
+    # def logger
+    reprod_logger = ReprodLogger()
+
+    pytorch_dump_path = "../weights/torch_weight.bin"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = torch.load(pytorch_dump_path)
+    model.bert.load_state_dict(checkpoint)
+
+    classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin")
+    model.load_state_dict(classifier_weights, strict=False)
+    model.eval()
+
+    # read or gen fake data
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_data = torch.from_numpy(fake_data)
+    # forward
+    out = model(fake_data)[0]
+    reprod_logger.add("logits", out.cpu().detach().numpy())
+    reprod_logger.save("forward_torch.npy")
diff --git a/examples/torch_migration/pipeline/Step1/torch2paddle.py b/examples/torch_migration/pipeline/Step1/torch2paddle.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a2b4977051b97291ce45e7e49aabc1581267b7b
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step1/torch2paddle.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+import torch
+from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM
+from transformers import BertForMaskedLM as PTBertForMaskedLM
+
+
+def convert_pytorch_checkpoint_to_paddle(
+    pytorch_checkpoint_path="pytorch_model.bin",
+    paddle_dump_path="model_state.pdparams",
+    version="old",
+):
+    hf_to_paddle = {
+        "embeddings.LayerNorm": "embeddings.layer_norm",
+        "encoder.layer": "encoder.layers",
+        "attention.self.query": "self_attn.q_proj",
+        "attention.self.key": "self_attn.k_proj",
+        "attention.self.value": "self_attn.v_proj",
+        "attention.output.dense": "self_attn.out_proj",
+        "intermediate.dense": "linear1",
+        "output.dense": "linear2",
+        "attention.output.LayerNorm": "norm1",
+        "output.LayerNorm": "norm2",
+        "predictions.decoder.": "predictions.decoder_",
+        "predictions.transform.dense": "predictions.transform",
+        "predictions.transform.LayerNorm": "predictions.layer_norm",
+    }
+    do_not_transpose = []
+    if version == "old":
+        hf_to_paddle.update(
+            {
+                "predictions.bias": "predictions.decoder_bias",
+                ".gamma": ".weight",
+                ".beta": ".bias",
+            }
+        )
+        do_not_transpose = do_not_transpose + ["predictions.decoder.weight"]
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        is_transpose = False
+        if k[-7:] == ".weight":
+            # embeddings.weight and LayerNorm.weight do not transpose
+            if all(d not in k for d in do_not_transpose):
+                if ".embeddings." not in k and ".LayerNorm." not in k:
+                    if v.ndim == 2:
+                        if "embeddings" not in k:
+                            v = v.transpose(0, 1)
+                            is_transpose = True
+                        is_transpose = False
+        oldk = k
+        print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+def compare(out_torch, out_paddle):
+    out_torch = out_torch.detach().numpy()
+    out_paddle = out_paddle.detach().numpy()
+    assert out_torch.shape == out_paddle.shape
+    abs_dif = np.abs(out_torch - out_paddle)
+    mean_dif = np.mean(abs_dif)
+    max_dif = np.max(abs_dif)
+    min_dif = np.min(abs_dif)
+    print("mean_dif:{}".format(mean_dif))
+    print("max_dif:{}".format(max_dif))
+    print("min_dif:{}".format(min_dif))
+
+
+def test_forward():
+    paddle.set_device("cpu")
+    model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_torch.eval()
+    model_paddle.eval()
+    np.random.seed(42)
+    x = np.random.randint(1, model_paddle.bert.config["vocab_size"], size=(4, 64))
+    input_torch = torch.tensor(x, dtype=torch.int64)
+    out_torch = model_torch(input_torch)[0]
+
+    input_paddle = paddle.to_tensor(x, dtype=paddle.int64)
+    out_paddle = model_paddle(input_paddle)[0]
+
+    print("torch result shape:{}".format(out_torch.shape))
+    print("paddle result shape:{}".format(out_paddle.shape))
+    compare(out_torch, out_paddle)
+
+
+if __name__ == "__main__":
+    convert_pytorch_checkpoint_to_paddle("test.bin", "test_paddle.pdparams")
+# test_forward()
+# torch result shape:torch.Size([4, 64, 30522])
+# paddle result shape:[4, 64, 30522]
+# mean_dif:1.666686512180604e-05
+# max_dif:0.00015211105346679688
+# min_dif:0.0
diff --git a/examples/torch_migration/pipeline/Step2/README.md b/examples/torch_migration/pipeline/Step2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..029761c85e47fdbfda1fa427092a1e5791b9271f
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/README.md
@@ -0,0 +1,131 @@
+# 使用方法
+
+## 数据集和数据加载对齐步骤
+
+* 使用下面的命令，判断数据预处理以及数据集是否构建正确。
+
+```shell
+python test_data.py
+```
+
+显示出以下内容，Dataset以及Dataloader的长度和内容diff均满足小于指定阈值，可以认为复现成功。
+
+```
+[2021/11/17 20:57:06] root INFO: length:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_0_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_0_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_0_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_1_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_1_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_1_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_2_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_2_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_2_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_3_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_3_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_3_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_4_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_4_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataset_4_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_0_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_0_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_0_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_1_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_1_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_1_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_2_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_2_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_2_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_3_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_3_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_3_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_4_input_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_4_token_type_ids:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: dataloader_4_labels:
+[2021/11/17 20:57:06] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 20:57:06] root INFO: diff check passed
+```
+
+
+## 数据评估对齐流程
+
+### 评估代码和修改内容说明
+
+Pytorch准确率评估指标使用的是huggingface的datasets库。
+
+```python
+import torch
+import numpy as np
+from datasets import load_metric
+hf_metric = load_metric("accuracy.py")
+logits = np.random.normal(0, 1, size=(64, 2)).astype("float32")
+labels = np.random.randint(0, 2, size=(64,)).astype("int64")
+hf_metric.add_batch(predictions=torch.from_numpy(logits).argmax(dim=-1), references=torch.from_numpy(labels))
+hf_accuracy = hf_metric.compute()["accuracy"]
+print(hf_accuracy)
+```
+
+对应地，PaddlePaddle评估指标代码如下
+
+```python
+import paddle
+import numpy as np
+from paddle.metric import Accuracy
+pd_metric = Accuracy()
+pd_metric.reset()
+logits = np.random.normal(0, 1, size=(64, 2)).astype("float32")
+labels = np.random.randint(0, 2, size=(64,)).astype("int64")
+correct = pd_metric.compute(paddle.to_tensor(logits), paddle.to_tensor(labels))
+pd_metric.update(correct)
+pd_accuracy = pd_metric.accumulate()
+print(pd_accuracy)
+```
+
+### 操作步骤
+
+运行下面的命令，验证数据集评估是否正常。
+
+```shell
+# 生成paddle和pytorch指标
+python test_metric.py
+# 对比生成log
+python check_step2.py
+```
+
+最终结果输出如下，accuracy精度diff为0，小于阈值，结果前向验证，
+```
+[2021/11/17 21:15:05] root INFO: accuracy:
+[2021/11/17 21:15:05] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:15:05] root INFO: diff check passed
+
+```
diff --git a/examples/torch_migration/pipeline/Step2/accuracy.py b/examples/torch_migration/pipeline/Step2/accuracy.py
new file mode 100644
index 0000000000000000000000000000000000000000..aecc76a72f54fb6a3c71404597769806d7847466
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/accuracy.py
@@ -0,0 +1,90 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Accuracy metric."""
+
+import datasets
+from sklearn.metrics import accuracy_score
+
+_DESCRIPTION = """
+Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
+Accuracy = (TP + TN) / (TP + TN + FP + FN)
+TP: True positive
+TN: True negative
+FP: False positive
+FN: False negative
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions: Predicted labels, as returned by a model.
+    references: Ground truth labels.
+    normalize: If False, return the number of correctly classified samples.
+        Otherwise, return the fraction of correctly classified samples.
+    sample_weight: Sample weights.
+Returns:
+    accuracy: Accuracy score.
+Examples:
+
+    >>> accuracy_metric = datasets.load_metric("accuracy")
+    >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
+    >>> print(results)
+    {'accuracy': 1.0}
+"""
+
+_CITATION = """\
+@article{scikit-learn,
+  title={Scikit-learn: Machine Learning in {P}ython},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011}
+}
+"""
+
+
+@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Accuracy(datasets.Metric):
+    def _info(self):
+        return datasets.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
+        )
+
+    def _compute(self, predictions, references, normalize=True, sample_weight=None):
+        return {
+            "accuracy": accuracy_score(
+                references,
+                predictions,
+                normalize=normalize,
+                sample_weight=sample_weight,
+            ).item(),
+        }
diff --git a/examples/torch_migration/pipeline/Step2/check_step2.py b/examples/torch_migration/pipeline/Step2/check_step2.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac74370e6a99d16c7ed03a80d60f87e51d1f66b7
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/check_step2.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("./metric_torch.npy")
+    paddle_info = diff_helper.load_info("./metric_paddle.npy")
+
+    diff_helper.compare_info(torch_info, paddle_info)
+
+    diff_helper.report(path="metric_diff.log")
diff --git a/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv b/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv
new file mode 100644
index 0000000000000000000000000000000000000000..fdc6b82affefaea1a9b623436adc4f11968cf661
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv
@@ -0,0 +1,33 @@
+sentence	label
+it 's a charming and often affecting journey . 	1
+unflinchingly bleak and desperate 	0
+allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 	1
+the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 	1
+it 's slow -- very , very slow . 	0
+although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . 	1
+a sometimes tedious film . 	0
+or doing last year 's taxes with your ex-wife . 	0
+you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . 	1
+in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey . 	0
+the mesmerizing performances of the leads keep the film grounded and keep the audience riveted . 	1
+it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie . 	0
+... the film suffers from a lack of humor ( something needed to balance out the violence ) ... 	0
+we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . 	1
+even horror fans will most likely not find what they 're seeking with trouble every day ; the movie lacks both thrills and humor . 	0
+a gorgeous , high-spirited musical from india that exquisitely blends music , dance , song , and high drama . 	1
+the emotions are raw and will strike a nerve with anyone who 's ever had family trauma . 	1
+audrey tatou has a knack for picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning-glory exuberant as she was in amélie . 	1
+... the movie is just a plain old monster . 	0
+in its best moments , resembles a bad high school production of grease , without benefit of song . 	0
+pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . 	0
+the iditarod lasts for days - this just felt like it did . 	0
+holden caulfield did it better . 	0
+a delectable and intriguing thriller filled with surprises , read my lips is an original . 	1
+seldom has a movie so closely matched the spirit of a man and his work . 	1
+nicks , seemingly uncertain what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy . 	0
+the action switches between past and present , but the material link is too tenuous to anchor the emotional connections that purport to span a 125-year divide . 	0
+it 's an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part . 	1
+it 's a cookie-cutter movie , a cut-and-paste job . 	0
+i had to look away - this was god awful . 	0
+thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men . 	1
+... designed to provide a mix of smiles and tears , `` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns . 	0
diff --git a/examples/torch_migration/pipeline/Step2/predict.py b/examples/torch_migration/pipeline/Step2/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..6333dcd105e4b9051d8ffaaa30a7f8a7f7302e9f
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/predict.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+from functools import partial
+
+import paddle
+import paddle.nn as nn
+import pandas as pd
+
+from paddlenlp.datasets import load_dataset as ppnlp_load_dataset
+from paddlenlp.transformers import BertTokenizer as PPNLPBertTokenizer
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+from models.pd_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+
+def get_data():
+    def read(data_path):
+        df = pd.read_csv(data_path, sep="\t")
+        for _, row in df.iterrows():
+            yield {"sentence": row["sentence"], "labels": row["label"]}
+
+    def convert_example(example, tokenizer, max_length=128):
+        # labels = [example["labels"]]
+        # labels = np.array([example["labels"]], dtype="int64")
+        example = tokenizer(example["sentence"], max_seq_len=max_length)
+        return example
+
+    tokenizer = PPNLPBertTokenizer.from_pretrained("bert-base-uncased")
+    dataset_test = ppnlp_load_dataset(read, data_path="demo_sst2_sentence/demo.tsv", lazy=False)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_length=128)
+
+    dataset_test = dataset_test.map(trans_func, lazy=False)
+    one_sentence = dataset_test.new_data[0]
+
+    for k in ["input_ids", "token_type_ids"]:
+        one_sentence[k] = paddle.to_tensor(one_sentence[k], dtype="int64")
+        one_sentence[k] = paddle.unsqueeze(one_sentence[k], axis=0)
+
+    return one_sentence
+
+
+@paddle.no_grad()
+def main():
+    # 模型定义
+    paddle_dump_path = "../weights/paddle_weight.pdparams"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+
+    classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+
+    model.eval()
+    # 要预测的句子
+    data = get_data()
+    softmax = nn.Softmax()
+    # 预测的各类别的概率值
+    output = softmax(model(**data)[0]).numpy()
+
+    # 概率值最大的类别
+    class_id = output.argmax()
+    # 对应的概率值
+    prob = output[0][class_id]
+    print(f"class_id: {class_id}, prob: {prob}")
+    return output
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/torch_migration/pipeline/Step2/test_data.py b/examples/torch_migration/pipeline/Step2/test_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..587e4e088dace3dd1b7db41e080a32a88291d1f3
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/test_data.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from functools import partial
+
+import numpy as np
+import paddle
+import pandas as pd
+import torch
+from datasets import Dataset
+from reprod_log import ReprodDiffHelper, ReprodLogger
+from transformers import BertTokenizer as HFBertTokenizer
+
+from paddlenlp.datasets import load_dataset as ppnlp_load_dataset
+from paddlenlp.transformers import BertTokenizer as PPNLPBertTokenizer
+
+
+def build_paddle_data_pipeline():
+    from paddlenlp.data import DataCollatorWithPadding
+
+    def read(data_path):
+        df = pd.read_csv(data_path, sep="\t")
+        for _, row in df.iterrows():
+            yield {"sentence": row["sentence"], "labels": row["label"]}
+
+    def convert_example(example, tokenizer, max_length=128):
+        labels = [example["labels"]]
+        example = tokenizer(example["sentence"], max_seq_len=max_length)
+
+        example["labels"] = labels
+        return example
+
+    # load tokenizer
+    tokenizer = PPNLPBertTokenizer.from_pretrained("bert-base-uncased")
+    # load data
+    dataset_test = ppnlp_load_dataset(read, data_path="demo_sst2_sentence/demo.tsv", lazy=False)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_length=128)
+
+    # tokenize data
+    dataset_test = dataset_test.map(trans_func, lazy=False)
+
+    test_sampler = paddle.io.SequenceSampler(dataset_test)
+    test_batch_sampler = paddle.io.BatchSampler(sampler=test_sampler, batch_size=4)
+    data_collator = DataCollatorWithPadding(tokenizer)
+    data_loader_test = paddle.io.DataLoader(
+        dataset_test,
+        batch_sampler=test_batch_sampler,
+        num_workers=0,
+        collate_fn=data_collator,
+    )
+
+    return dataset_test, data_loader_test
+
+
+def build_torch_data_pipeline():
+    from transformers import DataCollatorWithPadding
+
+    tokenizer = HFBertTokenizer.from_pretrained("bert-base-uncased")
+
+    def preprocess_function(examples):
+        result = tokenizer(
+            examples["sentence"],
+            padding=False,
+            max_length=128,
+            truncation=True,
+            return_token_type_ids=True,
+        )
+        if "label" in examples:
+            result["labels"] = [examples["label"]]
+        return result
+
+    # load data
+    dataset_test = Dataset.from_csv("demo_sst2_sentence/demo.tsv", sep="\t")
+    dataset_test = dataset_test.map(
+        preprocess_function,
+        batched=False,
+        remove_columns=dataset_test.column_names,
+        desc="Running tokenizer on dataset",
+    )
+    dataset_test.set_format("np", columns=["input_ids", "token_type_ids", "labels"])
+    test_sampler = torch.utils.data.SequentialSampler(dataset_test)
+    collate_fn = DataCollatorWithPadding(tokenizer)
+    data_loader_test = torch.utils.data.DataLoader(
+        dataset_test,
+        batch_size=4,
+        sampler=test_sampler,
+        num_workers=0,
+        collate_fn=collate_fn,
+    )
+    return dataset_test, data_loader_test
+
+
+def test_data_pipeline():
+    diff_helper = ReprodDiffHelper()
+    paddle_dataset, paddle_dataloader = build_paddle_data_pipeline()
+    torch_dataset, torch_dataloader = build_torch_data_pipeline()
+
+    logger_paddle_data = ReprodLogger()
+    logger_torch_data = ReprodLogger()
+
+    logger_paddle_data.add("length", np.array(len(paddle_dataset)))
+    logger_torch_data.add("length", np.array(len(torch_dataset)))
+
+    # random choose 5 images and check
+    for idx in range(5):
+        rnd_idx = np.random.randint(0, len(paddle_dataset))
+        for k in ["input_ids", "token_type_ids", "labels"]:
+
+            logger_paddle_data.add(f"dataset_{idx}_{k}", np.array(paddle_dataset[rnd_idx][k]))
+
+            logger_torch_data.add(f"dataset_{idx}_{k}", np.array(torch_dataset[rnd_idx][k]))
+
+    for idx, (paddle_batch, torch_batch) in enumerate(zip(paddle_dataloader, torch_dataloader)):
+        if idx >= 5:
+            break
+        for i, k in enumerate(["input_ids", "token_type_ids", "labels"]):
+            logger_paddle_data.add(f"dataloader_{idx}_{k}", paddle_batch[k].numpy())
+            logger_torch_data.add(f"dataloader_{idx}_{k}", torch_batch[k].cpu().numpy())
+
+    diff_helper.compare_info(logger_paddle_data.data, logger_torch_data.data)
+    diff_helper.report()
+
+
+if __name__ == "__main__":
+    test_data_pipeline()
diff --git a/examples/torch_migration/pipeline/Step2/test_metric.py b/examples/torch_migration/pipeline/Step2/test_metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b82fd8348e53c894efe9319efb047c8245bbae5
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step2/test_metric.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import torch
+from datasets import load_metric
+from paddle.metric import Accuracy
+from reprod_log import ReprodLogger
+
+
+def generate():
+    pd_metric = Accuracy()
+    pd_metric.reset()
+    hf_metric = load_metric("accuracy.py")
+    for i in range(4):
+        logits = np.random.normal(0, 1, size=(64, 2)).astype("float32")
+        labels = np.random.randint(0, 2, size=(64,)).astype("int64")
+        # paddle metric
+        correct = pd_metric.compute(paddle.to_tensor(logits), paddle.to_tensor(labels))
+        pd_metric.update(correct)
+        # hf metric
+        hf_metric.add_batch(
+            predictions=torch.from_numpy(logits).argmax(dim=-1),
+            references=torch.from_numpy(labels),
+        )
+    pd_accuracy = pd_metric.accumulate()
+    hf_accuracy = hf_metric.compute()["accuracy"]
+    reprod_logger = ReprodLogger()
+    reprod_logger.add("accuracy", np.array([pd_accuracy]))
+    reprod_logger.save("metric_paddle.npy")
+    reprod_logger = ReprodLogger()
+    reprod_logger.add("accuracy", np.array([hf_accuracy]))
+    reprod_logger.save("metric_torch.npy")
+
+
+if __name__ == "__main__":
+    generate()
diff --git a/examples/torch_migration/pipeline/Step3/README.md b/examples/torch_migration/pipeline/Step3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4e6e79ae1bf1f152d335517bd96df8dc87616378
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step3/README.md
@@ -0,0 +1,67 @@
+# 使用方法
+
+## 代码解析
+
+以PaddlePaddle为例，下面为定义模型、计算loss并保存的代码。
+
+```python
+# paddle_loss.py
+if __name__ == "__main__":
+    paddle.set_device("cpu")
+
+    # def logger
+    reprod_logger = ReprodLogger()
+
+    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_classes=2)
+    classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+    model.eval()
+
+    criterion = nn.CrossEntropyLoss()
+
+    # read or gen fake data
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_data = paddle.to_tensor(fake_data)
+
+    fake_label = np.load("../fake_data/fake_label.npy")
+    fake_label = paddle.to_tensor(fake_label)
+
+    # forward
+    out = model(fake_data)
+
+    loss = criterion(out, fake_label)
+    #
+    reprod_logger.add("loss", loss.cpu().detach().numpy())
+    reprod_logger.save("loss_paddle.npy")
+
+```
+
+记录loss并保存在`loss_paddle.npy`文件中。
+
+
+## 操作步骤
+
+* 具体操作步骤如下所示。
+
+
+```shell
+# 生成paddle的前向loss结果
+python paddle_loss.py
+
+# 生成torch的前向loss结果
+python torch_loss.py
+
+# 对比生成log
+python check_step3.py
+```
+
+`check_step3.py`的输出结果如下所示，同时也会保存在`loss_diff.log`文件中。
+
+```
+[2021/11/17 21:27:35] root INFO: loss:
+[2021/11/17 21:27:35] root INFO:     mean diff: check passed: True, value: 5.960464477539063e-08
+[2021/11/17 21:27:35] root INFO: diff check passed
+
+```
+
+diff为5.96e-8，check通过。
diff --git a/examples/torch_migration/pipeline/Step3/check_step3.py b/examples/torch_migration/pipeline/Step3/check_step3.py
new file mode 100644
index 0000000000000000000000000000000000000000..546233dade0ea728512e43318790abd6742627d7
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step3/check_step3.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("./loss_torch.npy")
+    paddle_info = diff_helper.load_info("./loss_paddle.npy")
+
+    diff_helper.compare_info(torch_info, paddle_info)
+
+    diff_helper.report(path="loss_diff.log")
diff --git a/examples/torch_migration/pipeline/Step3/paddle_loss.py b/examples/torch_migration/pipeline/Step3/paddle_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..c791e03fc7e105232d26a37d314b273139d528e0
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step3/paddle_loss.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from reprod_log import ReprodLogger
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pd_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+if __name__ == "__main__":
+    paddle.set_device("cpu")
+
+    # def logger
+    reprod_logger = ReprodLogger()
+
+    paddle_dump_path = "../weights/paddle_weight.pdparams"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+
+    classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+    model.eval()
+
+    criterion = nn.CrossEntropyLoss()
+
+    # read or gen fake data
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_data = paddle.to_tensor(fake_data)
+
+    fake_label = np.load("../fake_data/fake_label.npy")
+    fake_label = paddle.to_tensor(fake_label)
+
+    # forward
+    out = model(fake_data)[0]
+
+    loss = criterion(out, fake_label)
+    reprod_logger.add("loss", loss.cpu().detach().numpy())
+    reprod_logger.save("loss_paddle.npy")
diff --git a/examples/torch_migration/pipeline/Step3/torch_loss.py b/examples/torch_migration/pipeline/Step3/torch_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa455bb5a4d9ed6fe992a3e27e33c4bb94a5c7ea
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step3/torch_loss.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+
+import numpy as np
+import torch
+import torch.nn as nn
+from reprod_log import ReprodLogger
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pt_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+if __name__ == "__main__":
+
+    # def logger
+    reprod_logger = ReprodLogger()
+
+    criterion = nn.CrossEntropyLoss()
+
+    pytorch_dump_path = "../weights/torch_weight.bin"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = torch.load(pytorch_dump_path)
+    model.bert.load_state_dict(checkpoint)
+
+    classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin")
+    model.load_state_dict(classifier_weights, strict=False)
+    model.eval()
+    # read or gen fake data
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_data = torch.from_numpy(fake_data)
+
+    fake_label = np.load("../fake_data/fake_label.npy")
+    fake_label = torch.from_numpy(fake_label)
+
+    # forward
+    out = model(fake_data)[0]
+
+    loss = criterion(out, fake_label)
+    reprod_logger.add("loss", loss.cpu().detach().numpy())
+    reprod_logger.save("loss_torch.npy")
diff --git a/examples/torch_migration/pipeline/Step4/README.md b/examples/torch_migration/pipeline/Step4/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..695b0728a773d0c14ef6dbd90f7342a2d8afdd20
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step4/README.md
@@ -0,0 +1,136 @@
+# 使用方法
+
+### 学习率对齐验证
+
+运行下面的命令，检查学习率模块设置是否正确。
+
+```shell
+python test_lr_scheduler.py
+```
+
+最终输出内容如下。
+
+```
+[2021/11/17 21:44:19] root INFO: step_100_linear_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_300_linear_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_500_linear_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_700_linear_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_900_linear_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_100_cosine_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_300_cosine_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_500_cosine_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: False, value: 9.35605818719964e-06
+[2021/11/17 21:44:19] root INFO: step_700_cosine_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: False, value: 1.3681476625617212e-05
+[2021/11/17 21:44:19] root INFO: step_900_cosine_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: False, value: 1.8924391285779562e-05
+[2021/11/17 21:44:19] root INFO: step_100_polynomial_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_300_polynomial_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_500_polynomial_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_700_polynomial_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: step_900_polynomial_lr:
+[2021/11/17 21:44:19] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 21:44:19] root INFO: diff check failed
+
+```
+
+linear和polynomial方式衰减的学习率diff为0，check通过，cosine方式衰减学习率可能由于计算误差未通过。
+
+
+### 反向对齐操作方法
+
+#### 代码讲解
+
+以PaddlePaddle为例，训练流程核心代码如下所示。每个iter中输入相同的fake data与fake label，计算loss，进行梯度反传与参数更新，将loss批量返回，用于后续的验证。
+
+```python
+def pd_train_some_iters(model,
+                     criterion,
+                     optimizer,
+                     fake_data,
+                     fake_label,
+                     max_iter=2):
+    paddle_dump_path = '../weights/paddle_weight.pdparams'
+    config = PDBertConfig()
+    model = PDBertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+    classifier_weights = paddle.load(
+        "../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+    model.eval()
+    criterion = paddle.nn.CrossEntropy()
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm"])
+    ]
+    optimizer = paddle.optimizer.AdamW(learning_rate=3e-5, parameters=model.parameters(),
+        weight_decay=1e-2,
+        epsilon=1e-6,
+        apply_decay_param_fun=lambda x: x in decay_params)
+    loss_list = []
+    for idx in range(max_iter):
+        input_ids = paddle.to_tensor(fake_data)
+        labels = paddle.to_tensor(fake_label)
+
+        output = model(input_ids)
+        loss = criterion(output, labels)
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+        loss_list.append(loss)
+    return loss_list
+```
+
+
+#### 操作方法
+
+运行下面的命令，基于fake data与fake label，依次生成若干轮loss数据并保存，使用`reprod_log`工具进行diff排查。
+
+```shell
+# 生成paddle和torch的前向数据
+python test_bp.py
+
+# 对比生成log
+python check_step4.py
+```
+
+最终输出结果如下，同时会保存在文件`bp_align_diff.log`中。
+
+```
+[2021/11/17 22:08:30] root INFO: loss_0:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_1:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_2:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_3:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_4:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_5:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_6:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_7:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_8:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: loss_9:
+[2021/11/17 22:08:30] root INFO:     mean diff: check passed: True, value: 0.0
+[2021/11/17 22:08:30] root INFO: diff check passed
+
+```
+
+前面10轮的loss diff均等于0，check通过。
diff --git a/examples/torch_migration/pipeline/Step4/check_step4.py b/examples/torch_migration/pipeline/Step4/check_step4.py
new file mode 100644
index 0000000000000000000000000000000000000000..20823116d45584038cc2b49fab1d693a66c1ae0c
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step4/check_step4.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("./bp_align_torch.npy")
+    paddle_info = diff_helper.load_info("./bp_align_paddle.npy")
+    diff_helper.compare_info(torch_info, paddle_info)
+
+    diff_helper.report(diff_threshold=1e-2, path="bp_align_diff.log")
diff --git a/examples/torch_migration/pipeline/Step4/test_bp.py b/examples/torch_migration/pipeline/Step4/test_bp.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cf28c97bcad3cc9e17016698ea53ca44d28cb1b
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step4/test_bp.py
@@ -0,0 +1,127 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+
+import numpy as np
+import paddle
+import torch
+from reprod_log import ReprodLogger
+from transformers import AdamW
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 1)[0]
+sys.path.append(CONFIG_PATH)
+
+# isort: off
+
+from models.pd_bert import BertConfig as PDBertConfig  # noqa: E402
+from models.pd_bert import (  # noqa: E402
+    BertForSequenceClassification as PDBertForSequenceClassification,
+)
+from models.pt_bert import BertConfig as HFBertConfig  # noqa: E402
+from models.pt_bert import (  # noqa: E402
+    BertForSequenceClassification as HFBertForSequenceClassification,
+)
+
+# isort: on
+
+
+def pd_train_some_iters(fake_data, fake_label, max_iter=2):
+    paddle_dump_path = "../weights/paddle_weight.pdparams"
+    config = PDBertConfig()
+    model = PDBertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+
+    classifier_weights = paddle.load("../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+    model.eval()
+    criterion = paddle.nn.CrossEntropyLoss()
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=3e-5,
+        parameters=model.parameters(),
+        weight_decay=1e-2,
+        epsilon=1e-6,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    loss_list = []
+    for idx in range(max_iter):
+        input_ids = paddle.to_tensor(fake_data)
+        labels = paddle.to_tensor(fake_label)
+
+        output = model(input_ids)[0]
+        loss = criterion(output, labels)
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+        loss_list.append(loss)
+    return loss_list
+
+
+def hf_train_some_iters(fake_data, fake_label, max_iter=2):
+
+    pytorch_dump_path = "../weights/torch_weight.bin"
+    config = HFBertConfig()
+    model = HFBertForSequenceClassification(config)
+    checkpoint = torch.load(pytorch_dump_path)
+    model.bert.load_state_dict(checkpoint)
+    classifier_weights = torch.load("../classifier_weights/torch_classifier_weights.bin")
+    model.load_state_dict(classifier_weights, strict=False)
+    model.eval()
+    criterion = torch.nn.CrossEntropyLoss()
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+            "weight_decay": 1e-2,
+        },
+        {
+            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+            "weight_decay": 0.0,
+        },
+    ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5)
+
+    loss_list = []
+    for idx in range(max_iter):
+        input_ids = torch.from_numpy(fake_data)
+        labels = torch.from_numpy(fake_label)
+
+        output = model(input_ids)[0]
+        loss = criterion(output, labels)
+        loss.backward()
+        optimizer.step()
+        optimizer.zero_grad()
+        loss_list.append(loss)
+    return loss_list
+
+
+if __name__ == "__main__":
+    print("Start training")
+    paddle.set_device("cpu")
+    fake_data = np.load("../fake_data/fake_data.npy")
+    fake_label = np.load("../fake_data/fake_label.npy")
+    hf_reprod_logger = ReprodLogger()
+    hf_loss_list = hf_train_some_iters(fake_data, fake_label, 10)
+    for idx, loss in enumerate(hf_loss_list):
+        hf_reprod_logger.add(f"loss_{idx}", loss.detach().cpu().numpy())
+    hf_reprod_logger.save("bp_align_torch.npy")
+
+    pd_reprod_logger = ReprodLogger()
+    pd_loss_list = pd_train_some_iters(fake_data, fake_label, 10)
+    for idx, loss in enumerate(pd_loss_list):
+        pd_reprod_logger.add(f"loss_{idx}", loss.detach().cpu().numpy())
+    pd_reprod_logger.save("bp_align_paddle.npy")
diff --git a/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py b/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..caab6bd1432e91f639ba0aa36c7067293bfe140b
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import torch
+from reprod_log import ReprodDiffHelper, ReprodLogger
+from torch.optim import AdamW
+from transformers.optimization import get_scheduler as get_hf_scheduler
+
+# define paddle scheduler
+from paddlenlp.transformers import (
+    CosineDecayWithWarmup,
+    LinearDecayWithWarmup,
+    PolyDecayWithWarmup,
+)
+
+scheduler_type2cls = {
+    "linear": LinearDecayWithWarmup,
+    "cosine": CosineDecayWithWarmup,
+    "polynomial": PolyDecayWithWarmup,
+}
+
+
+def get_paddle_scheduler(
+    learning_rate,
+    scheduler_type,
+    num_warmup_steps=None,
+    num_training_steps=None,
+    **scheduler_kwargs,
+):
+    if scheduler_type not in scheduler_type2cls.keys():
+        data = " ".join(scheduler_type2cls.keys())
+        raise ValueError(f"scheduler_type must be choson from {data}")
+
+    if num_warmup_steps is None:
+        raise ValueError("requires `num_warmup_steps`, please provide that argument.")
+
+    if num_training_steps is None:
+        raise ValueError("requires `num_training_steps`, please provide that argument.")
+
+    return scheduler_type2cls[scheduler_type](
+        learning_rate=learning_rate,
+        total_steps=num_training_steps,
+        warmup=num_warmup_steps,
+        **scheduler_kwargs,
+    )
+
+
+def test_lr():
+    diff_helper = ReprodDiffHelper()
+    pd_reprod_logger = ReprodLogger()
+    hf_reprod_logger = ReprodLogger()
+    lr = 3e-5
+    num_warmup_steps = 345
+    num_training_steps = 1024
+    milestone = [100, 300, 500, 700, 900]
+    for scheduler_type in ["linear", "cosine", "polynomial"]:
+        torch_optimizer = AdamW(torch.nn.Linear(1, 1).parameters(), lr=lr)
+        hf_scheduler = get_hf_scheduler(
+            name=scheduler_type,
+            optimizer=torch_optimizer,
+            num_warmup_steps=num_warmup_steps,
+            num_training_steps=num_training_steps,
+        )
+        pd_scheduler = get_paddle_scheduler(
+            learning_rate=lr,
+            scheduler_type=scheduler_type,
+            num_warmup_steps=num_warmup_steps,
+            num_training_steps=num_training_steps,
+        )
+
+        for i in range(num_training_steps):
+            hf_scheduler.step()
+            pd_scheduler.step()
+            if i in milestone:
+                hf_reprod_logger.add(
+                    f"step_{i}_{scheduler_type}_lr",
+                    np.array([hf_scheduler.get_last_lr()[-1]]),
+                )
+                pd_reprod_logger.add(f"step_{i}_{scheduler_type}_lr", np.array([pd_scheduler.get_lr()]))
+
+    diff_helper.compare_info(hf_reprod_logger.data, pd_reprod_logger.data)
+    diff_helper.report()
+
+
+if __name__ == "__main__":
+    test_lr()
diff --git a/examples/torch_migration/pipeline/Step5/README.md b/examples/torch_migration/pipeline/Step5/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bab96301cac7c0e808aac92475e640e4d1ac05e5
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/README.md
@@ -0,0 +1,29 @@
+# 使用方法
+
+首先运行下面的python代码，生成`train_align_torch.npy`和`train_align_paddle.npy`文件。
+
+```python
+# 运行生成paddle结果
+cd bert_paddle/
+sh train.sh
+# 运行生成torch结果
+cd bert_torch/
+sh train.sh
+```
+
+然后运行下面的代码，运行训练脚本；之后使用`check_step5.py`进行精度diff验证。
+
+```shell
+# 对比生成log
+python check_step5.py
+```
+
+这里需要注意的是，由于是精度对齐，SST-2数据集的精度diff在0.15%以内时，可以认为对齐，因此将`diff_threshold`参数修改为了`0.0015`。
+
+```
+[2021/11/17 22:41:12] root INFO: acc:
+[2021/11/17 22:41:12] root INFO:     mean diff: check passed: True, value: 0.0011467889908256534
+[2021/11/17 22:41:12] root INFO: diff check passed
+```
+
+最终diff为`0.00114`，小于阈值标准，检查通过。
diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.py b/examples/torch_migration/pipeline/Step5/bert_paddle/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..425b4f47c01ae91accf921a826b2144f90e1126b
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_paddle/train.py
@@ -0,0 +1,313 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datetime
+import os
+import random
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import utils
+from paddle.metric import Accuracy
+from paddle.optimizer import AdamW
+from reprod_log import ReprodLogger
+
+from paddlenlp.data import Dict, Pad, Stack
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import BertTokenizer
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 2)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pd_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+
+def train_one_epoch(
+    model,
+    criterion,
+    optimizer,
+    lr_scheduler,
+    data_loader,
+    epoch,
+    print_freq,
+    scaler=None,
+):
+    model.train()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value}"))
+    metric_logger.add_meter("sentence/s", utils.SmoothedValue(window_size=10, fmt="{value}"))
+
+    header = "Epoch: [{}]".format(epoch)
+    for batch in metric_logger.log_every(data_loader, print_freq, header):
+        inputs = {"input_ids": batch[0], "token_type_ids": batch[1]}
+        labels = batch[2]
+        start_time = time.time()
+        with paddle.amp.auto_cast(
+            enable=scaler is not None,
+            custom_white_list=["layer_norm", "softmax", "gelu"],
+        ):
+            logits = model(**inputs)[0]
+            loss = criterion(
+                logits.reshape([-1, 2]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+        optimizer.clear_grad()
+        if scaler is not None:
+            scaler.scale(loss).backward()
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            optimizer.step()
+        lr_scheduler.step()
+        batch_size = inputs["input_ids"].shape[0]
+        metric_logger.update(loss=loss.item(), lr=lr_scheduler.get_lr())
+        metric_logger.meters["sentence/s"].update(batch_size / (time.time() - start_time))
+
+
+def evaluate(model, criterion, data_loader, metric, print_freq=100):
+    model.eval()
+    metric.reset()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = "Test:"
+    with paddle.no_grad():
+        for batch in metric_logger.log_every(data_loader, print_freq, header):
+            inputs = {"input_ids": batch[0], "token_type_ids": batch[1]}
+            labels = batch[2]
+            logits = model(**inputs)[0]
+            loss = criterion(
+                logits.reshape([-1, 2]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+            metric_logger.update(loss=loss.item())
+            correct = metric.compute(logits, labels)
+            metric.update(correct)
+        acc_global_avg = metric.accumulate()
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print(" * Accuracy {acc_global_avg:.6f}".format(acc_global_avg=acc_global_avg))
+    return acc_global_avg
+
+
+def set_seed(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def convert_example(example, tokenizer, max_length=128):
+    labels = np.array([example["labels"]], dtype="int64")
+    example = tokenizer(example["sentence"], max_seq_len=max_length)
+    return {
+        "input_ids": example["input_ids"],
+        "token_type_ids": example["token_type_ids"],
+        "labels": labels,
+    }
+
+
+def load_data(args, tokenizer):
+    print("Loading data")
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    validation_ds = load_dataset("glue", args.task_name, splits="dev")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_length=args.max_length)
+    train_ds = train_ds.map(trans_func, lazy=False)
+    validation_ds = validation_ds.map(trans_func, lazy=False)
+
+    train_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=False)
+    validation_sampler = paddle.io.BatchSampler(validation_ds, batch_size=args.batch_size, shuffle=False)
+
+    return train_ds, validation_ds, train_sampler, validation_sampler
+
+
+def main(args):
+    if args.output_dir:
+        pass
+    # utils.mkdir(args.output_dir)
+    print(args)
+    scaler = None
+    # if args.fp16:
+    #     scaler = paddle.amp.GradScaler()
+    paddle.set_device(args.device)
+
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+    batchify_fn = lambda samples, fn=Dict(
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "labels": Stack(dtype="int64"),
+        }
+    ): fn(samples)
+    train_dataset, validation_dataset, train_sampler, validation_sampler = load_data(args, tokenizer)
+
+    train_data_loader = paddle.io.DataLoader(
+        train_dataset,
+        batch_sampler=train_sampler,
+        num_workers=args.workers,
+        collate_fn=batchify_fn,
+    )
+    validation_data_loader = paddle.io.DataLoader(
+        validation_dataset,
+        batch_sampler=validation_sampler,
+        num_workers=args.workers,
+        collate_fn=batchify_fn,
+    )
+
+    print("Creating model")
+    paddle_dump_path = "../../weights/paddle_weight.pdparams"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = paddle.load(paddle_dump_path)
+    model.bert.load_dict(checkpoint)
+
+    classifier_weights = paddle.load("../../classifier_weights/paddle_classifier_weights.bin")
+    model.load_dict(classifier_weights)
+
+    print("Creating criterion")
+    criterion = nn.CrossEntropyLoss()
+
+    print("Creating lr_scheduler")
+    lr_scheduler = utils.get_scheduler(
+        learning_rate=args.lr,
+        scheduler_type=args.lr_scheduler_type,
+        num_warmup_steps=args.num_warmup_steps,
+        num_training_steps=args.num_train_epochs * len(train_data_loader),
+    )
+
+    print("Creating optimizer")
+    # Split weights in two groups, one with weight decay and the other not.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        epsilon=1e-6,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    metric = Accuracy()
+
+    if args.test_only:
+        evaluate(model, criterion, validation_data_loader, metric)
+        return
+
+    print("Start training")
+    start_time = time.time()
+    best_accuracy = 0.0
+    for epoch in range(args.num_train_epochs):
+
+        train_one_epoch(
+            model,
+            criterion,
+            optimizer,
+            lr_scheduler,
+            train_data_loader,
+            epoch,
+            args.print_freq,
+            scaler,
+        )
+        acc = evaluate(model, criterion, validation_data_loader, metric)
+        best_accuracy = max(best_accuracy, acc)
+        if args.output_dir:
+            pass
+
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print("Training time {}".format(total_time_str))
+    return best_accuracy
+
+
+def get_args_parser(add_help=True):
+    import argparse
+
+    parser = argparse.ArgumentParser(description="Paddle SST-2 Classification Training", add_help=add_help)
+    parser.add_argument("--task_name", default="sst-2", help="the name of the glue task to train on.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bert-base-uncased",
+        help="path to pretrained model or model identifier from huggingface.co/models.",
+    )
+    parser.add_argument("--device", default="gpu", help="device")
+    parser.add_argument("--batch_size", default=32, type=int)
+    parser.add_argument(
+        "--max_length",
+        type=int,
+        default=128,
+        help=(
+            "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated,"
+        ),
+    )
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="number of total epochs to run")
+    parser.add_argument(
+        "--workers",
+        default=0,
+        type=int,
+        help="number of data loading workers (default: 16)",
+    )
+    parser.add_argument("--lr", default=3e-5, type=float, help="initial learning rate")
+    parser.add_argument(
+        "--weight_decay",
+        default=1e-2,
+        type=float,
+        help="weight decay (default: 1e-2)",
+        dest="weight_decay",
+    )
+    parser.add_argument(
+        "--lr_scheduler_type",
+        default="linear",
+        help="the scheduler type to use.",
+        choices=["linear", "cosine", "polynomial"],
+    )
+    parser.add_argument(
+        "--num_warmup_steps",
+        default=0,
+        type=int,
+        help="number of steps for the warmup in the lr scheduler.",
+    )
+    parser.add_argument("--print_freq", default=10, type=int, help="print frequency")
+    parser.add_argument("--output_dir", default="outputs", help="path where to save")
+    parser.add_argument(
+        "--test_only",
+        help="only test the model",
+        action="store_true",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="a seed for reproducible training.")
+    # Mixed precision training parameters
+    parser.add_argument("--fp16", action="store_true", help="whether or not mixed precision training")
+
+    return parser
+
+
+if __name__ == "__main__":
+    args = get_args_parser().parse_args()
+    acc = main(args)
+    reprod_logger = ReprodLogger()
+    reprod_logger.add("acc", np.array([acc]))
+    reprod_logger.save("train_align_paddle.npy")
diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh b/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c5e367f6404ac9f8df12e554ffe6ef67c308e0a
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh
@@ -0,0 +1,20 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus "1" train.py \
+    --model_name_or_path bert-base-uncased \
+    --batch_size 128 \
+    --num_warmup_steps 158 \
+    --output_dir paddle_outputs
+
diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py b/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..101eeca84cd4ea911a52020df0839c24cdb43cec
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py
@@ -0,0 +1,213 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datetime
+import errno
+import os
+import time
+from collections import defaultdict, deque
+
+import paddle
+
+from paddlenlp.transformers import (
+    CosineDecayWithWarmup,
+    LinearDecayWithWarmup,
+    PolyDecayWithWarmup,
+)
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        t = paddle.to_tensor([self.count, self.total], dtype="float64")
+        t = t.numpy().tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+
+    @property
+    def median(self):
+        d = paddle.to_tensor(list(self.deque))
+        return d.median().item()
+
+    @property
+    def avg(self):
+        d = paddle.to_tensor(list(self.deque), dtype="float32")
+        return d.mean().item()
+
+    @property
+    def global_avg(self):
+        return self.total / self.count
+
+    @property
+    def max(self):
+        return max(self.deque)
+
+    @property
+    def value(self):
+        return self.deque[-1]
+
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value,
+        )
+
+
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, paddle.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(type(self).__name__, attr))
+
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append("{}: {}".format(name, str(meter)))
+        return self.delimiter.join(loss_str)
+
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ""
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt="{avg:.4f}")
+        data_time = SmoothedValue(fmt="{avg:.4f}")
+        space_fmt = ":" + str(len(str(len(iterable)))) + "d"
+        if paddle.device.is_compiled_with_cuda():
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                ]
+            )
+        else:
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                ]
+            )
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                print(
+                    log_msg.format(
+                        i,
+                        len(iterable),
+                        eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time),
+                        data=str(data_time),
+                    )
+                )
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print("{} Total time: {}".format(header, total_time_str))
+
+
+scheduler_type2cls = {
+    "linear": LinearDecayWithWarmup,
+    "cosine": CosineDecayWithWarmup,
+    "polynomial": PolyDecayWithWarmup,
+}
+
+
+def get_scheduler(
+    learning_rate,
+    scheduler_type,
+    num_warmup_steps=None,
+    num_training_steps=None,
+    **scheduler_kwargs,
+):
+    if scheduler_type not in scheduler_type2cls.keys():
+        data = " ".join(scheduler_type2cls.keys())
+        raise ValueError(f"scheduler_type must be choson from {data}")
+
+    if num_warmup_steps is None:
+        raise ValueError("requires `num_warmup_steps`, please provide that argument.")
+
+    if num_training_steps is None:
+        raise ValueError("requires `num_training_steps`, please provide that argument.")
+
+    return scheduler_type2cls[scheduler_type](
+        learning_rate=learning_rate,
+        total_steps=num_training_steps,
+        warmup=num_warmup_steps,
+        **scheduler_kwargs,
+    )
+
+
+def mkdir(path):
+    try:
+        os.makedirs(path)
+    except OSError as e:
+        if e.errno != errno.EEXIST:
+            raise
diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py b/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py
new file mode 100644
index 0000000000000000000000000000000000000000..aecc76a72f54fb6a3c71404597769806d7847466
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py
@@ -0,0 +1,90 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Accuracy metric."""
+
+import datasets
+from sklearn.metrics import accuracy_score
+
+_DESCRIPTION = """
+Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
+Accuracy = (TP + TN) / (TP + TN + FP + FN)
+TP: True positive
+TN: True negative
+FP: False positive
+FN: False negative
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions: Predicted labels, as returned by a model.
+    references: Ground truth labels.
+    normalize: If False, return the number of correctly classified samples.
+        Otherwise, return the fraction of correctly classified samples.
+    sample_weight: Sample weights.
+Returns:
+    accuracy: Accuracy score.
+Examples:
+
+    >>> accuracy_metric = datasets.load_metric("accuracy")
+    >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
+    >>> print(results)
+    {'accuracy': 1.0}
+"""
+
+_CITATION = """\
+@article{scikit-learn,
+  title={Scikit-learn: Machine Learning in {P}ython},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011}
+}
+"""
+
+
+@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Accuracy(datasets.Metric):
+    def _info(self):
+        return datasets.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
+        )
+
+    def _compute(self, predictions, references, normalize=True, sample_weight=None):
+        return {
+            "accuracy": accuracy_score(
+                references,
+                predictions,
+                normalize=normalize,
+                sample_weight=sample_weight,
+            ).item(),
+        }
diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/glue.py b/examples/torch_migration/pipeline/Step5/bert_torch/glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..10cec84128a579c3b290eb847e720c16f03cb4c7
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_torch/glue.py
@@ -0,0 +1,625 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""The General Language Understanding Evaluation (GLUE) benchmark."""
+
+import csv
+import os
+import textwrap
+
+import datasets
+import numpy as np
+
+_GLUE_CITATION = """\
+@inproceedings{wang2019glue,
+  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
+  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
+  note={In the Proceedings of ICLR.},
+  year={2019}
+}
+"""
+
+_GLUE_DESCRIPTION = """\
+GLUE, the General Language Understanding Evaluation benchmark
+(https://gluebenchmark.com/) is a collection of resources for training,
+evaluating, and analyzing natural language understanding systems.
+
+"""
+
+_MRPC_DEV_IDS = "https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv"
+_MRPC_TRAIN = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt"
+_MRPC_TEST = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt"
+
+_MNLI_BASE_KWARGS = dict(
+    text_features={
+        "premise": "sentence1",
+        "hypothesis": "sentence2",
+    },
+    label_classes=["entailment", "neutral", "contradiction"],
+    label_column="gold_label",
+    data_url="https://dl.fbaipublicfiles.com/glue/data/MNLI.zip",
+    data_dir="MNLI",
+    citation=textwrap.dedent(
+        """\
+      @InProceedings{N18-1101,
+        author = "Williams, Adina
+                  and Nangia, Nikita
+                  and Bowman, Samuel",
+        title = "A Broad-Coverage Challenge Corpus for
+                 Sentence Understanding through Inference",
+        booktitle = "Proceedings of the 2018 Conference of
+                     the North American Chapter of the
+                     Association for Computational Linguistics:
+                     Human Language Technologies, Volume 1 (Long
+                     Papers)",
+        year = "2018",
+        publisher = "Association for Computational Linguistics",
+        pages = "1112--1122",
+        location = "New Orleans, Louisiana",
+        url = "http://aclweb.org/anthology/N18-1101"
+      }
+      @article{bowman2015large,
+        title={A large annotated corpus for learning natural language inference},
+        author={Bowman, Samuel R and Angeli, Gabor and Potts, Christopher and Manning, Christopher D},
+        journal={arXiv preprint arXiv:1508.05326},
+        year={2015}
+      }"""
+    ),
+    url="http://www.nyu.edu/projects/bowman/multinli/",
+)
+
+
+class GlueConfig(datasets.BuilderConfig):
+    """BuilderConfig for GLUE."""
+
+    def __init__(
+        self,
+        text_features,
+        label_column,
+        data_url,
+        data_dir,
+        citation,
+        url,
+        label_classes=None,
+        process_label=lambda x: x,
+        **kwargs,
+    ):
+        """BuilderConfig for GLUE.
+
+        Args:
+          text_features: `dict[string, string]`, map from the name of the feature
+            dict for each text field to the name of the column in the tsv file
+          label_column: `string`, name of the column in the tsv file corresponding
+            to the label
+          data_url: `string`, url to download the zip file from
+          data_dir: `string`, the path to the folder containing the tsv files in the
+            downloaded zip
+          citation: `string`, citation for the data set
+          url: `string`, url for information about the data set
+          label_classes: `list[string]`, the list of classes if the label is
+            categorical. If not provided, then the label will be of type
+            `datasets.Value('float32')`.
+          process_label: `Function[string, any]`, function  taking in the raw value
+            of the label and processing it to the form required by the label feature
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(GlueConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
+        self.text_features = text_features
+        self.label_column = label_column
+        self.label_classes = label_classes
+        self.data_url = data_url
+        self.data_dir = data_dir
+        self.citation = citation
+        self.url = url
+        self.process_label = process_label
+
+
+class Glue(datasets.GeneratorBasedBuilder):
+    """The General Language Understanding Evaluation (GLUE) benchmark."""
+
+    BUILDER_CONFIGS = [
+        GlueConfig(
+            name="cola",
+            description=textwrap.dedent(
+                """\
+            The Corpus of Linguistic Acceptability consists of English
+            acceptability judgments drawn from books and journal articles on
+            linguistic theory. Each example is a sequence of words annotated
+            with whether it is a grammatical English sentence."""
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=["unacceptable", "acceptable"],
+            label_column="is_acceptable",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/CoLA.zip",
+            data_dir="CoLA",
+            citation=textwrap.dedent(
+                """\
+            @article{warstadt2018neural,
+              title={Neural Network Acceptability Judgments},
+              author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
+              journal={arXiv preprint arXiv:1805.12471},
+              year={2018}
+            }"""
+            ),
+            url="https://nyu-mll.github.io/CoLA/",
+        ),
+        GlueConfig(
+            name="sst2",
+            description=textwrap.dedent(
+                """\
+            The Stanford Sentiment Treebank consists of sentences from movie reviews and
+            human annotations of their sentiment. The task is to predict the sentiment of a
+            given sentence. We use the two-way (positive/negative) class split, and use only
+            sentence-level labels."""
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=["negative", "positive"],
+            label_column="label",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/SST-2.zip",
+            data_dir="SST-2",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{socher2013recursive,
+              title={Recursive deep models for semantic compositionality over a sentiment treebank},
+              author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher},
+              booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing},
+              pages={1631--1642},
+              year={2013}
+            }"""
+            ),
+            url="https://datasets.stanford.edu/sentiment/index.html",
+        ),
+        GlueConfig(
+            name="mrpc",
+            description=textwrap.dedent(
+                """\
+            The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of
+            sentence pairs automatically extracted from online news sources, with human annotations
+            for whether the sentences in the pair are semantically equivalent."""
+            ),  # pylint: disable=line-too-long
+            text_features={"sentence1": "", "sentence2": ""},
+            label_classes=["not_equivalent", "equivalent"],
+            label_column="Quality",
+            data_url="",  # MRPC isn't hosted by GLUE.
+            data_dir="MRPC",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{dolan2005automatically,
+              title={Automatically constructing a corpus of sentential paraphrases},
+              author={Dolan, William B and Brockett, Chris},
+              booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},
+              year={2005}
+            }"""
+            ),
+            url="https://www.microsoft.com/en-us/download/details.aspx?id=52398",
+        ),
+        GlueConfig(
+            name="qqp",
+            description=textwrap.dedent(
+                """\
+            The Quora Question Pairs2 dataset is a collection of question pairs from the
+            community question-answering website Quora. The task is to determine whether a
+            pair of questions are semantically equivalent."""
+            ),
+            text_features={
+                "question1": "question1",
+                "question2": "question2",
+            },
+            label_classes=["not_duplicate", "duplicate"],
+            label_column="is_duplicate",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip",
+            data_dir="QQP",
+            citation=textwrap.dedent(
+                """\
+          @online{WinNT,
+            author = {Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel},
+            title = {First Quora Dataset Release: Question Pairs},
+            year = {2017},
+            url = {https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs},
+            urldate = {2019-04-03}
+          }"""
+            ),
+            url="https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs",
+        ),
+        GlueConfig(
+            name="stsb",
+            description=textwrap.dedent(
+                """\
+            The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of
+            sentence pairs drawn from news headlines, video and image captions, and natural
+            language inference data. Each pair is human-annotated with a similarity score
+            from 1 to 5."""
+            ),
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_column="score",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/STS-B.zip",
+            data_dir="STS-B",
+            citation=textwrap.dedent(
+                """\
+            @article{cer2017semeval,
+              title={Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation},
+              author={Cer, Daniel and Diab, Mona and Agirre, Eneko and Lopez-Gazpio, Inigo and Specia, Lucia},
+              journal={arXiv preprint arXiv:1708.00055},
+              year={2017}
+            }"""
+            ),
+            url="http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark",
+            process_label=np.float32,
+        ),
+        GlueConfig(
+            name="mnli",
+            description=textwrap.dedent(
+                """\
+            The Multi-Genre Natural Language Inference Corpus is a crowdsourced
+            collection of sentence pairs with textual entailment annotations. Given a premise sentence
+            and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis
+            (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are
+            gathered from ten different sources, including transcribed speech, fiction, and government reports.
+            We use the standard test set, for which we obtained private labels from the authors, and evaluate
+            on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend
+            the SNLI corpus as 550k examples of auxiliary training data."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="mnli_mismatched",
+            description=textwrap.dedent(
+                """\
+          The mismatched validation and test splits from MNLI.
+          See the "mnli" BuilderConfig for additional information."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="mnli_matched",
+            description=textwrap.dedent(
+                """\
+          The matched validation and test splits from MNLI.
+          See the "mnli" BuilderConfig for additional information."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="qnli",
+            description=textwrap.dedent(
+                """\
+            The Stanford Question Answering Dataset is a question-answering
+            dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn
+            from Wikipedia) contains the answer to the corresponding question (written by an annotator). We
+            convert the task into sentence pair classification by forming a pair between each question and each
+            sentence in the corresponding context, and filtering out pairs with low lexical overlap between the
+            question and the context sentence. The task is to determine whether the context sentence contains
+            the answer to the question. This modified version of the original task removes the requirement that
+            the model select the exact answer, but also removes the simplifying assumptions that the answer
+            is always present in the input and that lexical overlap is a reliable cue."""
+            ),  # pylint: disable=line-too-long
+            text_features={
+                "question": "question",
+                "sentence": "sentence",
+            },
+            label_classes=["entailment", "not_entailment"],
+            label_column="label",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/QNLIv2.zip",
+            data_dir="QNLI",
+            citation=textwrap.dedent(
+                """\
+            @article{rajpurkar2016squad,
+              title={Squad: 100,000+ questions for machine comprehension of text},
+              author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
+              journal={arXiv preprint arXiv:1606.05250},
+              year={2016}
+            }"""
+            ),
+            url="https://rajpurkar.github.io/SQuAD-explorer/",
+        ),
+        GlueConfig(
+            name="rte",
+            description=textwrap.dedent(
+                """\
+            The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual
+            entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim
+            et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are
+            constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where
+            for three-class datasets we collapse neutral and contradiction into not entailment, for consistency."""
+            ),  # pylint: disable=line-too-long
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_classes=["entailment", "not_entailment"],
+            label_column="label",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/RTE.zip",
+            data_dir="RTE",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{dagan2005pascal,
+              title={The PASCAL recognising textual entailment challenge},
+              author={Dagan, Ido and Glickman, Oren and Magnini, Bernardo},
+              booktitle={Machine Learning Challenges Workshop},
+              pages={177--190},
+              year={2005},
+              organization={Springer}
+            }
+            @inproceedings{bar2006second,
+              title={The second pascal recognising textual entailment challenge},
+              author={Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan},
+              booktitle={Proceedings of the second PASCAL challenges workshop on recognising textual entailment},
+              volume={6},
+              number={1},
+              pages={6--4},
+              year={2006},
+              organization={Venice}
+            }
+            @inproceedings{giampiccolo2007third,
+              title={The third pascal recognizing textual entailment challenge},
+              author={Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill},
+              booktitle={Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing},
+              pages={1--9},
+              year={2007},
+              organization={Association for Computational Linguistics}
+            }
+            @inproceedings{bentivogli2009fifth,
+              title={The Fifth PASCAL Recognizing Textual Entailment Challenge.},
+              author={Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo},
+              booktitle={TAC},
+              year={2009}
+            }"""
+            ),
+            url="https://aclweb.org/aclwiki/Recognizing_Textual_Entailment",
+        ),
+        GlueConfig(
+            name="wnli",
+            description=textwrap.dedent(
+                """\
+            The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task
+            in which a system must read a sentence with a pronoun and select the referent of that pronoun from
+            a list of choices. The examples are manually constructed to foil simple statistical methods: Each
+            one is contingent on contextual information provided by a single word or phrase in the sentence.
+            To convert the problem into sentence pair classification, we construct sentence pairs by replacing
+            the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the
+            pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of
+            new examples derived from fiction books that was shared privately by the authors of the original
+            corpus. While the included training set is balanced between two classes, the test set is imbalanced
+            between them (65% not entailment). Also, due to a data quirk, the development set is adversarial:
+            hypotheses are sometimes shared between training and development examples, so if a model memorizes the
+            training examples, they will predict the wrong label on corresponding development set
+            example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence
+            between a model's score on this task and its score on the unconverted original task. We
+            call converted dataset WNLI (Winograd NLI)."""
+            ),
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_classes=["not_entailment", "entailment"],
+            label_column="label",
+            data_url="https://dl.fbaipublicfiles.com/glue/data/WNLI.zip",
+            data_dir="WNLI",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{levesque2012winograd,
+              title={The winograd schema challenge},
+              author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
+              booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
+              year={2012}
+            }"""
+            ),
+            url="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html",
+        ),
+        GlueConfig(
+            name="ax",
+            description=textwrap.dedent(
+                """\
+            A manually-curated evaluation dataset for fine-grained analysis of
+            system performance on a broad range of linguistic phenomena. This
+            dataset evaluates sentence understanding through Natural Language
+            Inference (NLI) problems. Use a model trained on MulitNLI to produce
+            predictions for this dataset."""
+            ),
+            text_features={
+                "premise": "sentence1",
+                "hypothesis": "sentence2",
+            },
+            label_classes=["entailment", "neutral", "contradiction"],
+            label_column="",  # No label since we only have test set.
+            # We must use a URL shortener since the URL from GLUE is very long and
+            # causes issues in TFDS.
+            data_url="https://dl.fbaipublicfiles.com/glue/data/AX.tsv",
+            data_dir="",  # We are downloading a tsv.
+            citation="",  # The GLUE citation is sufficient.
+            url="https://gluebenchmark.com/diagnostics",
+        ),
+    ]
+
+    def _info(self):
+        features = {text_feature: datasets.Value("string") for text_feature in self.config.text_features.keys()}
+        if self.config.label_classes:
+            features["label"] = datasets.features.ClassLabel(names=self.config.label_classes)
+        else:
+            features["label"] = datasets.Value("float32")
+        features["idx"] = datasets.Value("int32")
+        return datasets.DatasetInfo(
+            description=_GLUE_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage=self.config.url,
+            citation=self.config.citation + "\n" + _GLUE_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        if self.config.name == "ax":
+            data_file = dl_manager.download(self.config.data_url)
+            return [
+                datasets.SplitGenerator(
+                    name=datasets.Split.TEST,
+                    gen_kwargs={
+                        "data_file": data_file,
+                        "split": "test",
+                    },
+                )
+            ]
+
+        if self.config.name == "mrpc":
+            data_dir = None
+            mrpc_files = dl_manager.download(
+                {
+                    "dev_ids": _MRPC_DEV_IDS,
+                    "train": _MRPC_TRAIN,
+                    "test": _MRPC_TEST,
+                }
+            )
+        else:
+            dl_dir = dl_manager.download_and_extract(self.config.data_url)
+            data_dir = os.path.join(dl_dir, self.config.data_dir)
+            mrpc_files = None
+        train_split = datasets.SplitGenerator(
+            name=datasets.Split.TRAIN,
+            gen_kwargs={
+                "data_file": os.path.join(data_dir or "", "train.tsv"),
+                "split": "train",
+                "mrpc_files": mrpc_files,
+            },
+        )
+        if self.config.name == "mnli":
+            return [
+                train_split,
+                _mnli_split_generator("validation_matched", data_dir, "dev", matched=True),
+                _mnli_split_generator("validation_mismatched", data_dir, "dev", matched=False),
+                _mnli_split_generator("test_matched", data_dir, "test", matched=True),
+                _mnli_split_generator("test_mismatched", data_dir, "test", matched=False),
+            ]
+        elif self.config.name == "mnli_matched":
+            return [
+                _mnli_split_generator("validation", data_dir, "dev", matched=True),
+                _mnli_split_generator("test", data_dir, "test", matched=True),
+            ]
+        elif self.config.name == "mnli_mismatched":
+            return [
+                _mnli_split_generator("validation", data_dir, "dev", matched=False),
+                _mnli_split_generator("test", data_dir, "test", matched=False),
+            ]
+        else:
+            return [
+                train_split,
+                datasets.SplitGenerator(
+                    name=datasets.Split.VALIDATION,
+                    gen_kwargs={
+                        "data_file": os.path.join(data_dir or "", "dev.tsv"),
+                        "split": "dev",
+                        "mrpc_files": mrpc_files,
+                    },
+                ),
+                datasets.SplitGenerator(
+                    name=datasets.Split.TEST,
+                    gen_kwargs={
+                        "data_file": os.path.join(data_dir or "", "test.tsv"),
+                        "split": "test",
+                        "mrpc_files": mrpc_files,
+                    },
+                ),
+            ]
+
+    def _generate_examples(self, data_file, split, mrpc_files=None):
+        if self.config.name == "mrpc":
+            # We have to prepare the MRPC dataset from the original sources ourselves.
+            examples = self._generate_example_mrpc_files(mrpc_files=mrpc_files, split=split)
+            for example in examples:
+                yield example["idx"], example
+        else:
+            process_label = self.config.process_label
+            label_classes = self.config.label_classes
+
+            # The train and dev files for CoLA are the only tsv files without a
+            # header.
+            is_cola_non_test = self.config.name == "cola" and split != "test"
+
+            with open(data_file, encoding="utf8") as f:
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                if is_cola_non_test:
+                    reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+
+                for n, row in enumerate(reader):
+                    if is_cola_non_test:
+                        row = {
+                            "sentence": row[3],
+                            "is_acceptable": row[1],
+                        }
+
+                    example = {feat: row[col] for feat, col in self.config.text_features.items()}
+                    example["idx"] = n
+
+                    if self.config.label_column in row:
+                        label = row[self.config.label_column]
+                        # For some tasks, the label is represented as 0 and 1 in the tsv
+                        # files and needs to be cast to integer to work with the feature.
+                        if label_classes and label not in label_classes:
+                            label = int(label) if label else None
+                        example["label"] = process_label(label)
+                    else:
+                        example["label"] = process_label(-1)
+
+                    # Filter out corrupted rows.
+                    for value in example.values():
+                        if value is None:
+                            break
+                    else:
+                        yield example["idx"], example
+
+    def _generate_example_mrpc_files(self, mrpc_files, split):
+        if split == "test":
+            with open(mrpc_files["test"], encoding="utf8") as f:
+                # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with
+                # the Quality key.
+                f.seek(3)
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for n, row in enumerate(reader):
+                    yield {
+                        "sentence1": row["#1 String"],
+                        "sentence2": row["#2 String"],
+                        "label": int(row["Quality"]),
+                        "idx": n,
+                    }
+        else:
+            with open(mrpc_files["dev_ids"], encoding="utf8") as f:
+                reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                dev_ids = [[row[0], row[1]] for row in reader]
+            with open(mrpc_files["train"], encoding="utf8") as f:
+                # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with
+                # the Quality key.
+                f.seek(3)
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for n, row in enumerate(reader):
+                    is_row_in_dev = [row["#1 ID"], row["#2 ID"]] in dev_ids
+                    if is_row_in_dev == (split == "dev"):
+                        yield {
+                            "sentence1": row["#1 String"],
+                            "sentence2": row["#2 String"],
+                            "label": int(row["Quality"]),
+                            "idx": n,
+                        }
+
+
+def _mnli_split_generator(name, data_dir, split, matched):
+    return datasets.SplitGenerator(
+        name=name,
+        gen_kwargs={
+            "data_file": os.path.join(data_dir, "%s_%s.tsv" % (split, "matched" if matched else "mismatched")),
+            "split": split,
+            "mrpc_files": None,
+        },
+    )
diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.py b/examples/torch_migration/pipeline/Step5/bert_torch/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..b236071f2a695f92428189aabfec91cab2f008b3
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_torch/train.py
@@ -0,0 +1,326 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datetime
+import os
+import random
+import sys
+import time
+
+import numpy as np
+import torch
+import torch.utils.data
+import utils
+from datasets import load_dataset, load_metric
+from reprod_log import ReprodLogger
+from torch import nn
+from transformers import AdamW, BertTokenizer, DataCollatorWithPadding, get_scheduler
+
+CURRENT_DIR = os.path.split(os.path.abspath(__file__))[0]  # 当前目录
+CONFIG_PATH = CURRENT_DIR.rsplit("/", 2)[0]
+sys.path.append(CONFIG_PATH)
+
+from models.pt_bert import BertConfig, BertForSequenceClassification  # noqa: E402
+
+task_to_keys = {
+    "cola": ("sentence", None),
+    "mnli": ("premise", "hypothesis"),
+    "mrpc": ("sentence1", "sentence2"),
+    "qnli": ("question", "sentence"),
+    "qqp": ("question1", "question2"),
+    "rte": ("sentence1", "sentence2"),
+    "sst2": ("sentence", None),
+    "stsb": ("sentence1", "sentence2"),
+    "wnli": ("sentence1", "sentence2"),
+}
+
+
+def train_one_epoch(
+    model,
+    criterion,
+    optimizer,
+    lr_scheduler,
+    data_loader,
+    device,
+    epoch,
+    print_freq,
+    scaler=None,
+):
+    model.train()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter("lr", utils.SmoothedValue(window_size=1, fmt="{value}"))
+    metric_logger.add_meter("sentence/s", utils.SmoothedValue(window_size=10, fmt="{value}"))
+
+    header = "Epoch: [{}]".format(epoch)
+    for batch in metric_logger.log_every(data_loader, print_freq, header):
+        start_time = time.time()
+        batch.to(device)
+        labels = batch.pop("labels")
+        with torch.cuda.amp.autocast(enabled=scaler is not None):
+            logits = model(**batch)[0]
+            loss = criterion(logits.reshape(-1, 2), labels.reshape(-1))
+
+        optimizer.zero_grad()
+        if scaler is not None:
+            scaler.scale(loss).backward()
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            optimizer.step()
+        lr_scheduler.step()
+        batch_size = batch["input_ids"].shape[0]
+        metric_logger.update(loss=loss.item(), lr=lr_scheduler.get_last_lr()[-1])
+        metric_logger.meters["sentence/s"].update(batch_size / (time.time() - start_time))
+
+
+def evaluate(model, criterion, data_loader, device, metric, print_freq=100):
+    model.eval()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = "Test:"
+    with torch.no_grad():
+        for batch in metric_logger.log_every(data_loader, print_freq, header):
+            batch.to(device)
+            labels = batch.pop("labels")
+            logits = model(**batch)[0]
+            loss = criterion(logits.reshape(-1, 2), labels.reshape(-1))
+            metric_logger.update(loss=loss.item())
+            metric.add_batch(
+                predictions=logits.argmax(dim=-1),
+                references=labels,
+            )
+    acc_global_avg = metric.compute()["accuracy"]
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print(" * Accuracy {acc_global_avg:.6f}".format(acc_global_avg=acc_global_avg))
+    return acc_global_avg
+
+
+def set_seed(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+
+
+def load_data(args, tokenizer):
+    print("Loading data")
+    raw_datasets = load_dataset("glue.py", args.task_name, cache_dir=args.data_cache_dir)
+    sentence1_key, sentence2_key = task_to_keys[args.task_name]
+
+    def preprocess_function(examples):
+        texts = (
+            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
+        )
+        result = tokenizer(*texts, padding=False, max_length=args.max_length, truncation=True)
+
+        if "label" in examples:
+            result["labels"] = examples["label"]
+        return result
+
+    train_ds = raw_datasets["train"].map(
+        preprocess_function,
+        batched=True,
+        remove_columns=raw_datasets["train"].column_names,
+        desc="Running tokenizer on train dataset",
+        new_fingerprint=f"train_tokenized_dataset_{args.task_name}",
+    )
+    validation_ds = raw_datasets["validation"].map(
+        preprocess_function,
+        batched=True,
+        remove_columns=raw_datasets["validation"].column_names,
+        desc="Running tokenizer on validation dataset",
+        new_fingerprint=f"validation_tokenized_dataset_{args.task_name}",
+    )
+    train_sampler = torch.utils.data.SequentialSampler(train_ds)
+    validation_sampler = torch.utils.data.SequentialSampler(validation_ds)
+
+    return train_ds, validation_ds, train_sampler, validation_sampler
+
+
+def main(args):
+    if args.output_dir:
+        utils.mkdir(args.output_dir)
+    print(args)
+    scaler = None
+    if args.fp16:
+        scaler = torch.cuda.amp.GradScaler()
+    device = torch.device(args.device)
+    torch.backends.cudnn.benchmark = True
+
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+    data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=(8 if args.fp16 else None))
+    train_dataset, validation_dataset, train_sampler, validation_sampler = load_data(args, tokenizer)
+    train_data_loader = torch.utils.data.DataLoader(
+        train_dataset,
+        batch_size=args.batch_size,
+        sampler=train_sampler,
+        num_workers=args.workers,
+        collate_fn=data_collator,
+    )
+
+    validation_data_loader = torch.utils.data.DataLoader(
+        validation_dataset,
+        batch_size=args.batch_size,
+        sampler=validation_sampler,
+        num_workers=args.workers,
+        collate_fn=data_collator,
+    )
+
+    print("Creating model")
+    pytorch_dump_path = "../../weights/torch_weight.bin"
+    config = BertConfig()
+    model = BertForSequenceClassification(config)
+    checkpoint = torch.load(pytorch_dump_path)
+    model.bert.load_state_dict(checkpoint)
+
+    classifier_weights = torch.load("../../classifier_weights/torch_classifier_weights.bin")
+    model.load_state_dict(classifier_weights, strict=False)
+    model.to(device)
+
+    print("Creating criterion")
+    criterion = nn.CrossEntropyLoss()
+
+    print("Creating optimizer")
+    # Split weights in two groups, one with weight decay and the other not.
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+            "weight_decay": args.weight_decay,
+        },
+        {
+            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+            "weight_decay": 0.0,
+        },
+    ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.lr)
+
+    print("Creating lr_scheduler")
+    lr_scheduler = get_scheduler(
+        name=args.lr_scheduler_type,
+        optimizer=optimizer,
+        num_warmup_steps=args.num_warmup_steps,
+        num_training_steps=args.num_train_epochs * len(train_data_loader),
+    )
+
+    metric = load_metric("accuracy.py")
+    if args.test_only:
+        evaluate(model, criterion, validation_data_loader, device=device)
+        return
+
+    print("Start training")
+    start_time = time.time()
+    best_accuracy = 0.0
+    for epoch in range(args.num_train_epochs):
+        train_one_epoch(
+            model,
+            criterion,
+            optimizer,
+            lr_scheduler,
+            train_data_loader,
+            device,
+            epoch,
+            args.print_freq,
+            scaler,
+        )
+        acc = evaluate(model, criterion, validation_data_loader, device=device, metric=metric)
+        best_accuracy = max(best_accuracy, acc)
+        if args.output_dir:
+            pass
+
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print("Training time {}".format(total_time_str))
+    return best_accuracy
+
+
+def get_args_parser(add_help=True):
+    import argparse
+
+    parser = argparse.ArgumentParser(description="PyTorch SST-2 Classification Training", add_help=add_help)
+    parser.add_argument("--data_cache_dir", default="data_caches", help="data cache dir.")
+    parser.add_argument("--task_name", default="sst2", help="the name of the glue task to train on.")
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bert-base-uncased",
+        help="path to pretrained model or model identifier from huggingface.co/models.",
+    )
+    parser.add_argument("--device", default="cuda:2", help="device")
+    parser.add_argument("--batch_size", default=32, type=int)
+    parser.add_argument(
+        "--max_length",
+        type=int,
+        default=128,
+        help=(
+            "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated,"
+        ),
+    )
+    parser.add_argument("--num_train_epochs", default=3, type=int, help="number of total epochs to run")
+    parser.add_argument(
+        "--workers",
+        default=0,
+        type=int,
+        help="number of data loading workers (default: 16)",
+    )
+    parser.add_argument("--lr", default=3e-5, type=float, help="initial learning rate")
+    parser.add_argument(
+        "--weight_decay",
+        default=1e-2,
+        type=float,
+        help="weight decay (default: 1e-2)",
+        dest="weight_decay",
+    )
+    parser.add_argument(
+        "--lr_scheduler_type",
+        default="linear",
+        help="the scheduler type to use.",
+        choices=[
+            "linear",
+            "cosine",
+            "cosine_with_restarts",
+            "polynomial",
+            "constant",
+            "constant_with_warmup",
+        ],
+    )
+    parser.add_argument(
+        "--num_warmup_steps",
+        default=0,
+        type=int,
+        help="number of steps for the warmup in the lr scheduler.",
+    )
+    parser.add_argument("--print_freq", default=10, type=int, help="print frequency")
+    parser.add_argument("--output_dir", default="outputs", help="path where to save")
+    parser.add_argument(
+        "--test_only",
+        help="only test the model",
+        action="store_true",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="a seed for reproducible training.")
+    # Mixed precision training parameters
+    parser.add_argument("--fp16", action="store_true", help="whether or not mixed precision training")
+
+    return parser
+
+
+if __name__ == "__main__":
+    args = get_args_parser().parse_args()
+    acc = main(args)
+    reprod_logger = ReprodLogger()
+    reprod_logger.add("acc", np.array([acc]))
+    reprod_logger.save("train_align_benchmark.npy")
diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.sh b/examples/torch_migration/pipeline/Step5/bert_torch/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1d26be50340b7850bc36f33411045499bc34a221
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_torch/train.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python train.py \
+    --model_name_or_path bert-base-uncased \
+    --batch_size 128 \
+    --num_warmup_steps 158 \
+    --output_dir bert_outputs \
\ No newline at end of file
diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/utils.py b/examples/torch_migration/pipeline/Step5/bert_torch/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a77288f7823bd9a388e97d8713d6770199f3f974
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/bert_torch/utils.py
@@ -0,0 +1,204 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datetime
+import errno
+import os
+import time
+from collections import defaultdict, deque
+
+import torch
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        return
+
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+
+    @property
+    def global_avg(self):
+        return self.total / self.count
+
+    @property
+    def max(self):
+        return max(self.deque)
+
+    @property
+    def value(self):
+        return self.deque[-1]
+
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value,
+        )
+
+
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(type(self).__name__, attr))
+
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append("{}: {}".format(name, str(meter)))
+        return self.delimiter.join(loss_str)
+
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ""
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt="{avg:.4f}")
+        data_time = SmoothedValue(fmt="{avg:.4f}")
+        space_fmt = ":" + str(len(str(len(iterable)))) + "d"
+        if torch.cuda.is_available():
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                    "max mem: {memory:.0f}",
+                ]
+            )
+        else:
+            log_msg = self.delimiter.join(
+                [
+                    header,
+                    "[{0" + space_fmt + "}/{1}]",
+                    "eta: {eta}",
+                    "{meters}",
+                    "time: {time}",
+                    "data: {data}",
+                ]
+            )
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(
+                        log_msg.format(
+                            i,
+                            len(iterable),
+                            eta=eta_string,
+                            meters=str(self),
+                            time=str(iter_time),
+                            data=str(data_time),
+                            memory=torch.cuda.max_memory_allocated() / MB,
+                        )
+                    )
+                else:
+                    print(
+                        log_msg.format(
+                            i,
+                            len(iterable),
+                            eta=eta_string,
+                            meters=str(self),
+                            time=str(iter_time),
+                            data=str(data_time),
+                        )
+                    )
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print("{} Total time: {}".format(header, total_time_str))
+
+
+def accuracy(output, target, topk=(1,)):
+    """Computes the accuracy over the k top predictions for the specified values of k"""
+    with torch.no_grad():
+        maxk = max(topk)
+        batch_size = target.size(0)
+
+        _, pred = output.topk(maxk, 1, True, True)
+        pred = pred.t()
+        correct = pred.eq(target[None])
+
+        res = []
+        for k in topk:
+            correct_k = correct[:k].flatten().sum(dtype=torch.float32)
+            res.append(correct_k * (100.0 / batch_size))
+        return res
+
+
+def mkdir(path):
+    try:
+        os.makedirs(path)
+    except OSError as e:
+        if e.errno != errno.EEXIST:
+            raise
diff --git a/examples/torch_migration/pipeline/Step5/check_step5.py b/examples/torch_migration/pipeline/Step5/check_step5.py
new file mode 100644
index 0000000000000000000000000000000000000000..79d3556a8ae0a604cc35f622df28d76c142d554a
--- /dev/null
+++ b/examples/torch_migration/pipeline/Step5/check_step5.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+    torch_info = diff_helper.load_info("bert_torch/train_align_benchmark.npy")
+    paddle_info = diff_helper.load_info("bert_paddle/train_align_paddle.npy")
+
+    diff_helper.compare_info(torch_info, paddle_info)
+
+    diff_helper.report(path="train_align_diff.log", diff_threshold=0.0025)
diff --git a/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py b/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c696aac3af0a44a79d65125ec5197597c0090e0
--- /dev/null
+++ b/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import torch
+
+
+def generate(seed):
+    np.random.seed(seed)
+    weight = np.random.normal(0, 0.02, (768, 2)).astype("float32")
+    bias = np.zeros((2,)).astype("float32")
+    paddle_weights = {
+        "classifier.weight": weight,
+        "classifier.bias": bias,
+    }
+    torch_weights = {
+        "classifier.weight": torch.from_numpy(weight).t(),
+        "classifier.bias": torch.from_numpy(bias),
+    }
+    torch.save(torch_weights, "torch_classifier_weights.bin")
+    paddle.save(paddle_weights, "paddle_classifier_weights.bin")
+
+
+if __name__ == "__main__":
+    generate(seed=42)
diff --git a/examples/torch_migration/pipeline/fake_data/gen_fake_data.py b/examples/torch_migration/pipeline/fake_data/gen_fake_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..e083799c0484e456ab7ed574eb33f38a8def010b
--- /dev/null
+++ b/examples/torch_migration/pipeline/fake_data/gen_fake_data.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+
+def gen_fake_data():
+    fake_data = np.random.randint(1, 30522, size=(4, 64)).astype(np.int64)
+    fake_label = np.array([0, 1, 1, 0]).astype(np.int64)
+    np.save("fake_data.npy", fake_data)
+    np.save("fake_label.npy", fake_label)
+
+
+if __name__ == "__main__":
+    gen_fake_data()
diff --git a/examples/torch_migration/pipeline/models/pd_bert.py b/examples/torch_migration/pipeline/models/pd_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e798421241910dbb0394836b4950da5aab59f1b
--- /dev/null
+++ b/examples/torch_migration/pipeline/models/pd_bert.py
@@ -0,0 +1,424 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle BERT model."""
+
+import math
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+ACT2FN = {
+    "relu": F.relu,
+    "gelu": F.gelu,
+    "tanh": F.tanh,
+    "sigmoid": F.sigmoid,
+}
+NEG_INF = -1e4
+
+
+class BertConfig:
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        layer_norm_eps: float = 1e-12,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        num_labels=2,
+        **kwargs
+    ):
+        self.pad_token_id = pad_token_id
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.layer_norm_eps = layer_norm_eps
+        self.output_attentions = output_attentions
+        self.output_hidden_states = output_hidden_states
+        self.num_labels = num_labels
+
+
+class BertEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").reshape((1, -1))
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+    ) -> paddle.Tensor:
+        input_shape = input_ids.shape
+        seq_length = input_ids.shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + token_type_embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x: paddle.Tensor) -> paddle.Tensor:
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+
+        # compute q,k,v
+        query_layer = self.transpose_for_scores(self.query(hidden_states))
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+class BertSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.self = BertSelfAttention(config)
+        self.output = BertSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        # self attn
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        # ffn
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+class BertEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList([BertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for layer_module in self.layer:
+            # add hidden_states
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask,
+                output_attentions,
+            )
+            hidden_states = layer_outputs[0]
+
+            # add self attn
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        return tuple(
+            v
+            for v in [
+                hidden_states,
+                all_hidden_states,
+                all_self_attentions,
+            ]
+            if v is not None
+        )
+
+
+class BertPooler(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = ACT2FN[config.pool_act]
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPreTrainedModel(nn.Layer):
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        normal_init = nn.initializer.Normal(mean=0.0, std=self.config.initializer_range)
+        zero_init = nn.initializer.Constant(0.0)
+        one_init = nn.initializer.Constant(1.0)
+        if isinstance(module, nn.Linear):
+            normal_init(module.weight)
+            if module.bias is not None:
+                zero_init(module.bias)
+        elif isinstance(module, nn.Embedding):
+            normal_init(module.weight)
+            if module._padding_idx is not None:
+                with paddle.no_grad():
+                    module.weight[module._padding_idx] = 0
+        elif isinstance(module, nn.LayerNorm):
+            zero_init(module.bias)
+            one_init(module.weight)
+
+
+class BertModel(BertPreTrainedModel):
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__()
+        self.config = config
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+    ) -> Tuple[paddle.Tensor]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_ids.shape, dtype=paddle.int64)
+
+        if attention_mask is not None:
+            attention_mask = (1.0 - attention_mask[:, :, None, None]) * NEG_INF
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+
+class BertForSequenceClassification(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__()
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+    ) -> Tuple[paddle.Tensor]:
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        output = (logits,) + outputs[2:]
+        return output
diff --git a/examples/torch_migration/pipeline/models/pt_bert.py b/examples/torch_migration/pipeline/models/pt_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..fab4fd9dd5182040d67c79def30d5e0ce1381d8e
--- /dev/null
+++ b/examples/torch_migration/pipeline/models/pt_bert.py
@@ -0,0 +1,420 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch BERT model."""
+
+import math
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+ACT2FN = {
+    "relu": F.relu,
+    "gelu": F.gelu,
+    "tanh": F.tanh,
+    "sigmoid": F.sigmoid,
+}
+NEG_INF = -1e4
+
+
+class BertConfig:
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        layer_norm_eps: float = 1e-12,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        num_labels=2,
+        **kwargs
+    ):
+        self.pad_token_id = pad_token_id
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.layer_norm_eps = layer_norm_eps
+        self.output_attentions = output_attentions
+        self.output_hidden_states = output_hidden_states
+        self.num_labels = num_labels
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+    ) -> torch.Tensor:
+        input_shape = input_ids.size()
+        seq_length = input_ids.shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
+
+        inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + token_type_embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+
+        # compute q,k,v
+        query_layer = self.transpose_for_scores(self.query(hidden_states))
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, dim=-1)
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.self = BertSelfAttention(config)
+        self.output = BertSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        # self attn
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        # ffn
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for layer_module in self.layer:
+            # add hidden_states
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask,
+                output_attentions,
+            )
+            hidden_states = layer_outputs[0]
+
+            # add self attn
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        return tuple(
+            v
+            for v in [
+                hidden_states,
+                all_hidden_states,
+                all_self_attentions,
+            ]
+            if v is not None
+        )
+
+
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = ACT2FN[config.pool_act]
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPreTrainedModel(nn.Module):
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+class BertModel(BertPreTrainedModel):
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__()
+        self.config = config
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+    ) -> Tuple[torch.Tensor]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        device = input_ids.device
+
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(input_ids.shape, dtype=torch.long, device=device)
+
+        if attention_mask is not None:
+            attention_mask = (1.0 - attention_mask[:, :, None, None]) * NEG_INF
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+
+class BertForSequenceClassification(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__()
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+    ) -> Tuple[torch.Tensor]:
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        output = (logits,) + outputs[2:]
+        return output
diff --git a/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py b/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py
new file mode 100644
index 0000000000000000000000000000000000000000..40ebdf029afe3167cb3447dec5055aa5a7e45547
--- /dev/null
+++ b/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from reprod_log import ReprodDiffHelper
+
+if __name__ == "__main__":
+    diff_helper = ReprodDiffHelper()
+
+    info1 = diff_helper.load_info("./result_1.npy")
+    info2 = diff_helper.load_info("./result_2.npy")
+
+    diff_helper.compare_info(info1, info2)
+
+    diff_helper.report(diff_method="mean", diff_threshold=1e-6, path="./diff.txt")
diff --git a/examples/torch_migration/pipeline/reprod_log_demo/write_log.py b/examples/torch_migration/pipeline/reprod_log_demo/write_log.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2985e3db724447a7a45db29b8667fa641a5ee2e
--- /dev/null
+++ b/examples/torch_migration/pipeline/reprod_log_demo/write_log.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from reprod_log import ReprodLogger
+
+if __name__ == "__main__":
+    reprod_log_1 = ReprodLogger()
+    reprod_log_2 = ReprodLogger()
+
+    data_1 = np.random.rand(4, 64, 768).astype(np.float32)
+    data_2 = np.random.rand(4, 64, 768).astype(np.float32)
+
+    reprod_log_1.add("demo_test_1", data_1)
+    reprod_log_1.add("demo_test_2", data_1)
+    reprod_log_1.save("result_1.npy")
+
+    reprod_log_2.add("demo_test_1", data_1)
+    reprod_log_2.add("demo_test_2", data_2)
+    reprod_log_2.save("result_2.npy")
diff --git a/examples/torch_migration/pipeline/weights/torch2paddle.py b/examples/torch_migration/pipeline/weights/torch2paddle.py
new file mode 100644
index 0000000000000000000000000000000000000000..74511fea26e95e800c24f8118fc9c157ade56c26
--- /dev/null
+++ b/examples/torch_migration/pipeline/weights/torch2paddle.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+import torch
+from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM
+from transformers import BertForMaskedLM as PTBertForMaskedLM
+
+
+def convert_pytorch_checkpoint_to_paddle(
+    pytorch_checkpoint_path="pytorch_model.bin",
+    paddle_dump_path="model_state.pdparams",
+    version="old",
+):
+    hf_to_paddle = {
+        "embeddings.LayerNorm": "embeddings.layer_norm",
+        "encoder.layer": "encoder.layers",
+        "attention.self.query": "self_attn.q_proj",
+        "attention.self.key": "self_attn.k_proj",
+        "attention.self.value": "self_attn.v_proj",
+        "attention.output.dense": "self_attn.out_proj",
+        "intermediate.dense": "linear1",
+        "output.dense": "linear2",
+        "attention.output.LayerNorm": "norm1",
+        "output.LayerNorm": "norm2",
+        "predictions.decoder.": "predictions.decoder_",
+        "predictions.transform.dense": "predictions.transform",
+        "predictions.transform.LayerNorm": "predictions.layer_norm",
+    }
+    do_not_transpose = []
+    if version == "old":
+        hf_to_paddle.update(
+            {
+                "predictions.bias": "predictions.decoder_bias",
+                ".gamma": ".weight",
+                ".beta": ".bias",
+            }
+        )
+        do_not_transpose = do_not_transpose + ["predictions.decoder.weight"]
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        is_transpose = False
+        if k[-7:] == ".weight":
+            # embeddings.weight and LayerNorm.weight do not transpose
+            if all(d not in k for d in do_not_transpose):
+                if ".embeddings." not in k and ".LayerNorm." not in k:
+                    if v.ndim == 2:
+                        if "embeddings" not in k:
+                            v = v.transpose(0, 1)
+                            is_transpose = True
+                        is_transpose = False
+        oldk = k
+        # for hf_name, pd_name in hf_to_paddle.items():
+        #     k = k.replace(hf_name, pd_name)
+
+        # add prefix `bert.`
+        if "bert." not in k and "cls." not in k and "classifier" not in k:
+            k = k
+
+        print(f"Converting: {oldk} => {k} | is_transpose {is_transpose}")
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+def compare(out_torch, out_paddle):
+    out_torch = out_torch.detach().numpy()
+    out_paddle = out_paddle.detach().numpy()
+    assert out_torch.shape == out_paddle.shape
+    abs_dif = np.abs(out_torch - out_paddle)
+    mean_dif = np.mean(abs_dif)
+    max_dif = np.max(abs_dif)
+    min_dif = np.min(abs_dif)
+    print("mean_dif:{}".format(mean_dif))
+    print("max_dif:{}".format(max_dif))
+    print("min_dif:{}".format(min_dif))
+
+
+def test_forward():
+    paddle.set_device("cpu")
+    model_torch = PTBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_paddle = PDBertForMaskedLM.from_pretrained("./bert-base-uncased")
+    model_torch.eval()
+    model_paddle.eval()
+    np.random.seed(42)
+    x = np.random.randint(1, model_paddle.bert.config["vocab_size"], size=(4, 64))
+    input_torch = torch.tensor(x, dtype=torch.int64)
+    out_torch = model_torch(input_torch)[0]
+
+    input_paddle = paddle.to_tensor(x, dtype=paddle.int64)
+    out_paddle = model_paddle(input_paddle)[0]
+
+    print("torch result shape:{}".format(out_torch.shape))
+    print("paddle result shape:{}".format(out_paddle.shape))
+    compare(out_torch, out_paddle)
+
+
+if __name__ == "__main__":
+    convert_pytorch_checkpoint_to_paddle("./torch_weight.bin", "./paddle_weight.pdparams")
diff --git a/examples/torch_migration/pipeline/weights/torch_bert_weight.py b/examples/torch_migration/pipeline/weights/torch_bert_weight.py
new file mode 100644
index 0000000000000000000000000000000000000000..819229e156a57c3c15a588de43b18fde6496b97c
--- /dev/null
+++ b/examples/torch_migration/pipeline/weights/torch_bert_weight.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from transformers import BertModel
+import torch
+
+hf_model = BertModel.from_pretrained("bert-base-uncased")
+hf_model.eval()
+PATH = "./torch_weight.bin"
+torch.save(hf_model.state_dict(), PATH)
diff --git a/examples/torch_migration/requirements.txt b/examples/torch_migration/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4d3875d031562bd8503b533f5c2c63816400322e
--- /dev/null
+++ b/examples/torch_migration/requirements.txt
@@ -0,0 +1,5 @@
+paddlepaddle-gpu==2.2.0
+torch>=1.7
+transformers
+paddlenlp
+git+https://github.com/WenmuZhou/reprod_log.git
diff --git a/examples/word_embedding/README.md b/examples/word_embedding/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9d7eeac4b12c3d0325050131515a60fddb83381b
--- /dev/null
+++ b/examples/word_embedding/README.md
@@ -0,0 +1,106 @@
+# Word Embedding with PaddleNLP
+
+## 简介
+
+PaddleNLP已预置多个公开的预训练Embedding，用户可以通过使用`paddlenlp.embeddings.TokenEmbedding`接口加载预训练Embedding，从而提升训练效果。以下通过基于开源情感倾向分类数据集ChnSentiCorp的文本分类训练例子展示`paddlenlp.embeddings.TokenEmbedding`对训练提升的效果。更多的`paddlenlp.embeddings.TokenEmbedding`用法，请参考[TokenEmbedding 接口使用指南](../../docs/model_zoo/embeddings.md) 。
+
+
+## 快速开始
+
+### 环境依赖
+
+- visualdl
+
+安装命令：`pip install visualdl`
+
+
+### 启动训练
+
+我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在验证集（dev.tsv）验证。训练时会自动下载词表dict.txt，用于对数据集进行切分，构造数据样本。
+
+启动训练：
+
+```shell
+# 使用paddlenlp.embeddings.TokenEmbedding
+python train.py --device='gpu' \
+                --lr=5e-4 \
+                --batch_size=64 \
+                --epochs=20 \
+                --use_token_embedding=True \
+                --vdl_dir='./vdl_dir'
+
+# 使用paddle.nn.Embedding
+python train.py --device='gpu' \
+                --lr=1e-4 \
+                --batch_size=64 \
+                --epochs=20 \
+                --use_token_embedding=False \
+                --vdl_dir='./vdl_dir'
+```
+
+以上参数表示：
+
+* `device`: 选择训练设备，目前可选'gpu', 'cpu', 'xpu'。 默认为`gpu`。
+* `lr`: 学习率， 默认为5e-4。
+* `batch_size`: 运行一个batch大小，默认为64。
+* `epochs`: 训练轮次，默认为5。
+* `use_token_embedding`: 是否使用`paddlenlp.embeddings.TokenEmbedding`，默认为True。
+* `vdl_dir`: VisualDL日志目录。训练过程中的VisualDL信息会在该目录下保存。默认为`./vdl_dir`
+
+该脚本还提供以下参数：
+
+* `save_dir`: 模型保存目录。默认值为"./checkpoints/"。
+* `init_from_ckpt`: 恢复模型训练的断点路径。默认值为None，表示不恢复训练。
+* `embedding_name`: 预训练Embedding名称，默认为`w2v.baidu_encyclopedia.target.word-word.dim300`. 支持的预训练Embedding可参考[Embedding 模型汇总](../../docs/model_zoo/embeddings.md)。
+
+**注意：**
+
+程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的`save_dir`中。训练过程中会实时保存每个epoch的模型参数，并以当前epoch值命名。如第2个Epochs，模型参数会被保存为`./checkpoints/2.pdparams`，优化器状态保存为`./checkpoints/2.pdopt`。
+
+如：
+```text
+checkpoints/
+├── 0.pdopt
+├── 0.pdparams
+├── 1.pdopt
+├── 1.pdparams
+├── ...
+└── final.pdparams
+```
+
+如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如果用户想热启第10个Epoch保存的模型，则设置 `--init_from_ckpt=./checkpoints/10`即可，程序会自动加载模型参数`./checkpoints/10.pdparams`，也会自动加载优化器状态`./checkpoints/10.pdopt`。
+
+
+### 启动VisualDL
+
+推荐使用VisualDL查看实验对比。以下为VisualDL的启动命令，其中logdir参数指定的目录需要与启动训练时指定的`vdl_dir`相同。（更多VisualDL的用法，可参考[VisualDL使用指南](https://github.com/PaddlePaddle/VisualDL#2-launch-panel)）
+
+```
+visualdl --logdir ./vdl_dir --port 8888 --host 0.0.0.0
+```
+
+### 训练效果对比
+
+在Chrome浏览器输入 `ip:8888` (ip为启动VisualDL机器的IP)。
+
+以下为示例实验效果对比图，蓝色是使用`paddlenlp.embeddings.TokenEmbedding`进行的实验，绿色是使用没有加载预训练模型的Embedding进行的实验。
+可以看到，使用`paddlenlp.embeddings.TokenEmbedding`的训练，其验证acc变化趋势上升，并收敛于0.90左右，收敛后相对平稳，不容易过拟合。
+而没有使用`paddlenlp.embeddings.TokenEmbedding`的训练，其验证acc变化趋势向下，并收敛于0.86左右。从示例实验可以观察到，使用`paddlenlp.embedding.TokenEmbedding`能提升训练效果。
+
+Eval Acc：
+
+![eval acc](https://user-images.githubusercontent.com/16698950/102076935-79ac5480-3e43-11eb-81f8-6e509c394fbf.png)
+
+|                                     |    Best Acc    |
+| ------------------------------------| -------------  |
+| paddle.nn.Embedding                 |    0.8965      |
+| paddelnlp.embeddings.TokenEmbedding |    0.9082      |
+
+## 致谢
+- 感谢 [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)提供Word2Vec中文Embedding预训练模型，[GloVe Project](https://nlp.stanford.edu/projects/glove)提供的GloVe英文Embedding预训练模型，[FastText Project](https://fasttext.cc/docs/en/english-vectors.html)提供的fasttext英文预训练模型。
+
+## 参考论文
+- Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018).
+- Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221.
+- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
+- T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations
diff --git a/examples/word_embedding/data.py b/examples/word_embedding/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..23b8b61abfd349ab22e594901961e51b3cb69380
--- /dev/null
+++ b/examples/word_embedding/data.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import jieba
+import numpy as np
+
+from paddlenlp.data import JiebaTokenizer
+
+tokenizer = jieba
+
+
+def set_tokenizer(vocab):
+    global tokenizer
+    if vocab is not None:
+        tokenizer = JiebaTokenizer(vocab=vocab)
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = {}
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n").split("\t")[0]
+        vocab[token] = index
+    return vocab
+
+
+def convert_tokens_to_ids(tokens, vocab):
+    """Converts a token id (or a sequence of id) in a token string
+    (or a sequence of tokens), using the vocabulary.
+    """
+
+    ids = []
+    unk_id = vocab.get("[UNK]", None)
+    for token in tokens:
+        wid = vocab.get(token, unk_id)
+        if wid:
+            ids.append(wid)
+    return ids
+
+
+def convert_example(example, vocab, unk_token_id=1, is_test=False):
+    """
+    Builds model inputs from a sequence for sequence classification tasks.
+    It use `jieba.cut` to tokenize text.
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        vocab(obj:`dict`): The vocabulary.
+        unk_token_id(obj:`int`, defaults to 1): The unknown token id.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.s
+        valid_length(obj:`int`): The input sequence valid length.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+
+    input_ids = []
+    for token in tokenizer.cut(example["text"]):
+        token_id = vocab.get(token, unk_token_id)
+        input_ids.append(token_id)
+    valid_length = np.array([len(input_ids)])
+    input_ids = np.array(input_ids, dtype="int32")
+    if not is_test:
+        label = np.array(example["label"], dtype="int64")
+        return input_ids, valid_length, label
+    else:
+        return input_ids, valid_length
+
+
+def pad_texts_to_max_seq_len(texts, max_seq_len, pad_token_id=0):
+    """
+    Padded the texts to the max sequence length if the length of text is lower than it.
+    Unless it truncates the text.
+    Args:
+        texts(obj:`list`): Texts which contains a sequence of word ids.
+        max_seq_len(obj:`int`): Max sequence length.
+        pad_token_id(obj:`int`, optional, defaults to 0) : The pad token index.
+    """
+    for index, text in enumerate(texts):
+        seq_len = len(text)
+        if seq_len < max_seq_len:
+            padded_tokens = [pad_token_id for _ in range(max_seq_len - seq_len)]
+            new_text = text + padded_tokens
+            texts[index] = new_text
+        elif seq_len > max_seq_len:
+            new_text = text[:max_seq_len]
+            texts[index] = new_text
+
+
+def preprocess_prediction_data(data, vocab):
+    """
+    It process the prediction data as the format used as training.
+    Args:
+        data (obj:`List[str]`): The prediction data whose each element is  a tokenized text.
+    Returns:
+        examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+            A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+    """
+    examples = []
+    for text in data:
+        tokens = " ".join(tokenizer.cut(text)).split(" ")
+        ids = convert_tokens_to_ids(tokens, vocab)
+        examples.append([ids, len(ids)])
+    return examples
diff --git a/examples/word_embedding/train.py b/examples/word_embedding/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd997f6eea9a1132c349c0cfa796883efeb5dd7d
--- /dev/null
+++ b/examples/word_embedding/train.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import os.path as osp
+from functools import partial
+
+import data
+import paddle
+import paddle.nn as nn
+
+import paddlenlp
+from paddlenlp.data import Pad, Stack, Tuple, Vocab
+from paddlenlp.datasets import load_dataset
+from paddlenlp.embeddings import TokenEmbedding
+from paddlenlp.utils.downloader import get_path_from_url
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", help="Select cpu, gpu, xpu devices to train model.")
+parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
+parser.add_argument("--save_dir", type=str, default='./checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
+parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+parser.add_argument("--use_token_embedding", type=eval, default=True, help="Whether use pretrained embedding")
+parser.add_argument("--embedding_name", type=str, default="w2v.baidu_encyclopedia.target.word-word.dim300", help="The name of pretrained embedding")
+parser.add_argument("--vdl_dir", type=str, default="vdl_dir/", help="VisualDL log directory")
+args = parser.parse_args()
+# yapf: enable
+
+WORD_DICT_URL = "https://bj.bcebos.com/paddlenlp/data/dict.txt"
+
+
+def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, pad_token_id=0):
+    """
+    Creats dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn, lazy=True)
+
+    shuffle = True if mode == "train" else False
+    sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=vocab.get("[PAD]", 0)),  # input_ids
+        Stack(dtype="int32"),  # seq len
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn)
+    return dataloader
+
+
+class BoWModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size (obj:`int`): The vocabulary size.
+        emb_dim (obj:`int`, optional, defaults to 300):  The embedding dimension.
+        hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size.
+        fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size.
+        num_classes (obj:`int`): All the labels that the data has.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        vocab_path,
+        emb_dim=300,
+        hidden_size=128,
+        fc_hidden_size=96,
+        use_token_embedding=True,
+    ):
+        super().__init__()
+        if use_token_embedding:
+            self.embedder = TokenEmbedding(args.embedding_name, extended_vocab_path=vocab_path)
+            emb_dim = self.embedder.embedding_dim
+        else:
+            padding_idx = vocab_size - 1
+            self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.bow_encoder = paddlenlp.seq2vec.BoWEncoder(emb_dim)
+        self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
+        self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
+        self.dropout = nn.Dropout(p=0.3, axis=1)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+
+        # Shape: (batch_size, embedding_dim)
+        summed = self.bow_encoder(embedded_text)
+        summed = self.dropout(summed)
+        encoded_text = paddle.tanh(summed)
+
+        # Shape: (batch_size, hidden_size)
+        fc1_out = paddle.tanh(self.fc1(encoded_text))
+        # Shape: (batch_size, fc_hidden_size)
+        fc2_out = paddle.tanh(self.fc2(fc1_out))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc2_out)
+        return logits
+
+
+if __name__ == "__main__":
+    assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu."
+    paddle.set_device(args.device)
+
+    # Loads vocab.
+    vocab_path = "./dict.txt"
+    if not os.path.exists(vocab_path):
+        # download in current directory
+        get_path_from_url(WORD_DICT_URL, "./")
+    vocab = data.load_vocab(vocab_path)
+
+    if "[PAD]" not in vocab:
+        vocab["[PAD]"] = len(vocab)
+    # Loads dataset.
+    train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+
+    # Constructs the network.
+    model = BoWModel(
+        vocab_size=len(vocab),
+        num_classes=len(train_ds.label_list),
+        vocab_path=vocab_path,
+        use_token_embedding=args.use_token_embedding,
+    )
+    if args.use_token_embedding:
+        vocab = model.embedder.vocab
+        data.set_tokenizer(vocab)
+        vocab = vocab.token_to_idx
+    else:
+        v = Vocab.from_dict(vocab, unk_token="[UNK]", pad_token="[PAD]")
+        data.set_tokenizer(v)
+    model = paddle.Model(model)
+
+    # Reads data and generates mini-batches.
+    trans_fn = partial(data.convert_example, vocab=vocab, unk_token_id=vocab["[UNK]"], is_test=False)
+    train_loader = create_dataloader(
+        train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", pad_token_id=vocab["[PAD]"]
+    )
+    dev_loader = create_dataloader(
+        dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", pad_token_id=vocab["[PAD]"]
+    )
+
+    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr)
+
+    # Defines loss and metric.
+    criterion = paddle.nn.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    model.prepare(optimizer, criterion, metric)
+
+    # Loads pre-trained parameters.
+    if args.init_from_ckpt:
+        model.load(args.init_from_ckpt)
+        print("Loaded checkpoint from %s" % args.init_from_ckpt)
+
+    # Starts training and evaluating.
+    log_dir = "use_normal_embedding"
+    if args.use_token_embedding:
+        log_dir = "use_token_embedding"
+    log_dir = osp.join(args.vdl_dir, log_dir)
+    callback = paddle.callbacks.VisualDL(log_dir=log_dir)
+    model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback)
diff --git a/fast_generation/README.md b/fast_generation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fe699a9c7271e55ad1be9960560a1510ec7806ad
--- /dev/null
+++ b/fast_generation/README.md
@@ -0,0 +1,305 @@
+# FastGeneration
+
+FastGeneration是PaddleNLP v2.2版本加入的文本生成高性能加速功能，其支持GPT、OPT、BART、UnifiedTransformer等多种NLP生成类预训练模型，并且支持多种解码策略，可以用于机器翻译、文本续写、文本摘要、对话生成等多种NLG任务的GPU场景预测加速。
+
+功能底层依托于[NV FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，该库针对标准的Transformer和GPT模型、beam search和sampling解码策略进行了性能优化。PaddleNLP FastGeneration在其之上进行了扩展，实现了更多模型和生成策略的优化支持，并将功能入口封装于`model.generate`函数。功能的开启和关闭通过传入`use_fast`参数进行控制（默认为关闭状态）。通过调用generate函数，用户可以简单的使用模型高性能推理功能。下图展示了FastGeneration的启动流程：
+
+
+<p align="center">
+  <img src="../docs/imgs/fast_generation.png" width="400" height ="600" />
+</p>
+
+## Featrues
+
+- 全面支持生成式预训练模型。包括GPT、OPT、CodeGen、GPTJ、BART、mBART、UnifiedTransformer和UNIMO-text。
+- 支持大多数主流解码策略。包括Beam Search、Sampling、Greedy Search。以及Diverse Sibling Search、Length Penalty等子策略。
+- 解码速度快。最高可达非加速版generate函数的**18倍**。**并支持FP16混合精度计算**。
+- 易用性强。功能的入口为`model.generate`，与非加速版生成api的使用方法相同，当满足加速条件时使用jit即时编译高性能算子并用于生成，不满足则自动切换回非加速版生成api。
+- GPT、UnifiedTransformer和UNIMO-text模型支持高性能并行推理，在具备MPI和NCCL的环境中一行代码即可开启使用，允许通过多张小显存容量的 GPU 使用百亿大模型，预测速度较单卡也进一步提升。百亿模型四卡并行高性能推理速度达单卡高性能推理速度2+倍。
+
+### Inference Model Support
+下表为PaddleNLP FastGeneration对预训练模型和解码策略的支持情况（GPU）。
+
+| Model   Name           | GPT2    | OPT     | CodeGen| GPTJ| BART            | mBART           | UnifiedTransformer |
+|------------------------|---------|---------| ---------| ---------|-----------------|-----------------|--------------------|
+| Model   Structure      | Decoder | Decoder |Decoder|Decoder| Encoder-Decoder | Encoder-Decoder | Prefix-LM          |
+| Beam Search            | ❌       | ❌       |❌|❌| ✅               | ✅               | ✅                  |
+| Top-K Sampling         | ✅       | ✅       |✅|✅| ✅               | ✅               | ✅                  |
+| Top-P Sampling         | ✅       | ✅       |✅|✅| ✅               | ✅               | ✅                  |
+| Diverse Sibling Search | ❌       | ❌       |❌|❌| ✅               | ✅               | ✅                  |
+| Forced Decoding        | ❌       | ❌       |❌|❌| ❌               | ✅               | ❌                  |
+| Length Penalty         | ❌       | ❌       |❌|❌| ✅               | ✅               | ✅                  |
+| Temperature            | ✅       | ✅       |✅|✅| ✅               | ✅               | ✅                  |
+| Repetition Penalty     | ✅       | ✅       |✅|✅| ❌               | ❌               | ❌                  |
+
+## Performence
+
+FastGeneration的高性能解码相比原版generate方法加速明显，并且与竞品相比有也有极大的速度优势。以下为性能对比图：
+
+- **batch_size = 4, out_seq_len = 32**
+- Device: Tesla V100-SXM2-16GB
+- CUDA version 11.2
+- cudnn version 8
+- torch version 1.10.0+cu113
+- transformers version 4.12.5
+
+### **BART** (bart-base, batch_size=4, max_length=32)
+
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/10242208/183384011-0df9a81e-72ac-429e-88da-166d48128b67.png" width="800" height ="400" />
+</p>
+
+### **GPT** (gpt2, batch_size=4, max_length=32)
+
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/10242208/183376427-638a7dd1-94b0-4b45-bd52-7c38f12f090f.png" width="800" height ="400" />
+</p>
+
+### **OPT** (opt, batch_size=4, max_length=32)
+
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/10242208/183376428-7e7a0998-803c-4bc3-acf6-971a9471b300.png" width="800" height ="400" />
+</p>
+
+### **CodeGen:**
+* 环境和超参
+  - Platform: Tesla V100-SXM2-32GB
+  - CUDA 10.1
+  - CUDNN 7.6.5
+  - PaddlePaddle-gpu 2.3.1.post101
+  - transformers==4.21.1
+  - torch==1.11.0
+  - Batch Size: 1
+  - Input Length: 60
+  - Output Length: 20
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/24390500/185611444-df2bec75-6cec-4c86-afd6-3049faae6288.png" width="800" height ="350" />
+</p>
+
+- Platform: A100-40G
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/24390500/185743415-317e75f5-029b-4037-aaaa-75d38db6b288.png" width="800" height ="350" />
+</p>
+
+### **Pegasus**
+* 环境和超参
+  - Platform: Tesla V100-SXM2-32GB
+  - CUDA 10.1
+  - CUDNN 7.6.5
+  - PaddlePaddle-gpu 2.3.2.post101
+  - transformers==4.21.1
+  - torch==1.11.0
+  - Batch Size: 4
+  - Input Length: 60
+  - Output Length: 20
+  - Decode_strategy: beam search
+  - num_beams: 4
+<p align="left">
+  <img src="https://user-images.githubusercontent.com/24390500/198013848-96ada404-c936-42a0-a83d-eedb8193ef53.png" width="800" height ="400" />
+</p>
+
+更详细的性能数据请参见[这里](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/fast_generation/perf)
+
+## Quick Start
+
+### 高性能推理
+
+为体现FastGeneration的易用性，我们在`samples`文件夹中内置了几个典型任务示例，下面以基于GPT模型的中文文本续写任务为例：
+
+```sh
+python samples/gpt_sample.py
+```
+
+如果是第一次执行，PaddleNLP会启动即时编译（[JIT Compile](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/new_op/new_custom_op_cn.html#jit-compile)）自动编译高性能解码算子。
+
+```sh
+...
+2021-11-17 13:42:56,771 - INFO - execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build
+INFO:utils.cpp_extension:execute command: cd /10.2/hub/PaddleNLP/paddlenlp/ops/extenstions && /usr/local/bin/python FasterTransformer_setup.py build
+grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+running build
+running build_ext
+-- The C compiler identification is GNU 8.2.0
+-- The CXX compiler identification is GNU 8.2.0
+-- The CUDA compiler identification is NVIDIA 10.2.89
+-- Check for working C compiler: /usr/bin/cc
+-- Check for working C compiler: /usr/bin/cc -- works
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Check for working CXX compiler: /usr
+...
+```
+
+编译过程通常会花费几分钟的时间编译只会进行一次，之后再次使用高性能解码就不需要重新编译了，编译完成后会继续运行，可以看到生成的结果如下：
+
+```
+Model input: 花间一壶酒，独酌无相亲。举杯邀明月，
+Result: 对影成三人。
+```
+
+打开示例代码 `samples/gpt_sample.py` ，我们可以看到如下代码：
+
+```
+...
+model = GPTLMHeadModel.from_pretrained(model_name)
+...
+outputs, _ = model.generate(
+    input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search',
+    use_fast=True)
+...
+```
+
+可以看到，FastGeneration的使用方法与 `model.generate()` 相同，只需传入输入tensor和解码相关参数即可，使用非常简便。如果要使用非加速版的 `model.generate()` 方法，只需传入 `use_fast=False` 即可，示例如下：
+
+```
+...
+outputs, _ = model.generate(
+    input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search', use_fast=False)
+...
+```
+
+**NOTE:** 需要注意的是，如果传入 `model.generate()` 的参数不满足高性能版本的要求。程序会做出提示并自动切换为非加速版本，例如我们在上面的例子中传入 `min_length=1` ，会得到如下提示：
+
+```
+...
+[2021-11-17 14:21:06,132] [ WARNING] - 'min_length != 0' is not supported yet in the fast version
+[2021-11-17 14:21:06,132] [ WARNING] - FastGeneration is not available, and the original version would be used instead.
+...
+```
+
+关于该函数的详细介绍可以参考API文档[generate](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.generation_utils.html)和**Aistudio教程[文本生成任务实战：如何使用PaddleNLP实现各种解码策略](https://aistudio.baidu.com/aistudio/projectdetail/3243711?contributionType=1)。**`samples`文件夹中的其他示例的使用方法相同。
+
+### 并行推理
+
+FastGeneration对GPT、UnifiedTransformer和UNIMO-text模型在高性能推理的基础上还实现了模型并行功能，其中GPT支持Tensor Parallel和Layer Parallel（Pipeline Parallel）两种并行策略的组合，UnifiedTransformer和UNIMO-text支持Tensor Parallel。关于这两种并行策略的详细介绍请参考[Megatron论文](https://arxiv.org/pdf/2104.04473.pdf)。
+
+并行推理当前依赖MPI（[MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可）和[NCCL](https://developer.nvidia.com/nccl)，如需使用还请先安装依赖。在使用时，相比上面的单卡高性能加速代码中也只增加了`from_pretrained`创建加载模型之前加上`enable_ft_para()`一行。
+#### GPT 并行推理
+
+GPT高性能并行推理的完整使用示例已在`gpt_mp_sample.py`中提供，按照如下方式启动即可：
+
+```sh
+mpirun -n 4 python gpt_mp_sample.py --tensor_para_size 4 --layer_para_size 1
+```
+
+其中`-n 4`指明使用的进程和GPU数，`tensor_para_size`和`tensor_para_size`分别指明Tensor Parallel和Layer Parallel各自使用的GPU数，均设置为1则进行单卡预测。另外加上`--use_fp16`以使用FP16，加上`--profile`可以进行相应设置的性能测试。其他生成相关的参数设置释义如下：
+- `model_name` 指定使用的GPT模型，默认为[`gpt-cpm-larg-cn`](https://github.com/TsinghuaAI/CPM-1-Generate)。
+- `max_length` 指定生成的最大长度，默认为50。
+- `topk` 用于Top-K采样策略，采样时将只从概率最高K个token中采样，默认为1，即greedy search。
+- `topp` 用于Top-P采样策略，采样时将只从概率最高且累加概率不超过该值的token中采样，默认为1.0。
+- `temperature` 用于调整预测概率分布，默认为1.0，即保持模型原有的预测概率。
+
+使用`gpt-cpm-larg-cn`(2.6B)和默认设置，在V100上4卡Tensor Parallel较单卡高性能预测速度提升约40%。
+
+#### PLATO-XL 并行推理
+
+PLATO-XL百亿对话预训练模型(11B UnifiedTransformer模型)高性能并行推理的完整使用示例已在`plato_xl_sample.py`中提供(当前只支持Tensor Parallel)，按照如下方式启动即可：
+
+```shell
+mpirun -n 4 python plato_xl_sample.py
+```
+
+参数释义基本同上。在V100上4卡Tensor Parallel高性能预测为单卡高性能预测速度的2倍。
+
+## Generate Examples
+
+除了以上示例之外，PaddleNLP的examples中大多使用了`model.generate`的示例都可以通过调整到合适的参数使用高性能推理。具体如下：
+
+- [examples/dialogue/unified_transformer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/dialogue/unified_transformer)
+- [model_zoo/gpt/fast_gpt](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt/fast_gpt)
+- [examples/text_generation/unimo-text](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_generation/unimo-text)
+- [examples/text_summarization/bart](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/text_summarization/bart)
+
+下面我们以基于 `Unified Transformer` 的任务型对话为例展示一下FastGeneration的加速效果：
+
+打开以上链接中Unified Transformer对应的example，找到README中对应预测的脚本。稍作修改如下：
+
+```sh
+export CUDA_VISIBLE_DEVICES=0
+    python infer.py \
+    --model_name_or_path=unified_transformer-12L-cn-luge \
+    --output_path=./predict.txt \
+    --logging_steps=10 \
+    --seed=2021 \
+    --max_seq_len=512 \
+    --max_knowledge_len=256 \
+    --batch_size=4 \
+    --min_dec_len=1 \
+    --max_dec_len=64 \
+    --num_return_sequences=1 \
+    --decode_strategy=sampling \
+    --top_k=5 \
+    --faster
+    --device=gpu
+```
+
+由于这里只是展示性能，我们直接在 `model_name_or_path` 填入PaddleNLP预训练模型名称 `unified_transformer-12L-cn-luge` 。
+
+可以看到，由于该任务为对话任务，我们为了防止模型生成过多安全回复（如：哈哈哈、不错等），保证生成结果具有更多的随机性，我们选择TopK-sampling作为解码策略，并让k=5。
+
+打开 `infer.py` ，可以看到我们传入的脚本参数大多都提供给了 `model.generate()` 方法：
+
+```
+output = model.generate(
+        input_ids=input_ids,
+        token_type_ids=token_type_ids,
+        position_ids=position_ids,
+        attention_mask=attention_mask,
+        seq_len=seq_len,
+        max_length=args.max_dec_len,
+        min_length=args.min_dec_len,
+        decode_strategy=args.decode_strategy,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        num_beams=args.num_beams,
+        length_penalty=args.length_penalty,
+        early_stopping=args.early_stopping,
+        num_return_sequences=args.num_return_sequences,
+        use_fp16_decoding=args.use_fp16_decoding,
+        use_fast=args.faster)
+```
+
+运行脚本，输出结果如下：
+
+```sh
+step 10 - 1.695s/step
+step 20 - 1.432s/step
+step 30 - 1.435s/step
+```
+
+可以看到，非加速版 `generate()` 方法的预测速度为每个step耗时1.5秒左右。
+
+下面我们在启动脚本中传入 `--faster` 参数，该参数会向 `generate()` 方法传入 `use_fast=True` ，启动加速模式。同时我们需要设置 `--min_dec_len=0` ，因为FastGeneration当前还不支持该参数。新的脚本启动参数如下：
+
+```sh
+export CUDA_VISIBLE_DEVICES=0
+    python infer.py \
+    --model_name_or_path=unified_transformer-12L-cn-luge \
+    --output_path=./predict.txt \
+    --logging_steps=10 \
+    --seed=2021 \
+    --max_seq_len=512 \
+    --max_knowledge_len=256 \
+    --batch_size=4 \
+    --min_dec_len=0 \
+    --max_dec_len=64 \
+    --num_return_sequences=1 \
+    --decode_strategy=sampling \
+    --top_k=5 \
+    --device=gpu \
+    --faster
+```
+
+再次运行脚本，输出结果如下（由于我们已经编译过高性能算子，所以这里不会重新编译）：
+
+```sh
+[2021-11-23 13:38:09,200] [   DEBUG] - skipping 'FastGeneration' extension (up-to-date) build
+step 10 - 0.250s/step
+step 20 - 0.156s/step
+step 30 - 0.141s/step
+```
+
+可以看到，FastGeneration的预测速度为每个step耗时0.15秒左右，相比非加速版提速超过9倍。
diff --git a/fast_generation/perf/README.md b/fast_generation/perf/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..242bf765ec1187d9039f4bb67e4f4af21dc83443
--- /dev/null
+++ b/fast_generation/perf/README.md
@@ -0,0 +1,250 @@
+# FastGeneration Performance
+
+以下性能数据为非加速版generate方法和FastGeneration对比数据。
+
+- **测试设备:** Tesla V100-SXM2-16GB
+- **Batch Size:** 4
+- **Max Length:** 32
+
+## 性能数据
+***
+
+CUDA 10.1, cudnn 7, gcc 82
+
+torch version 1.10.0+cu102, transformers version 4.12.5
+
+**BART:**
+
+| Model Size | Decode Strategy| FastGeneration(FP32)<br>(ms) | FastGeneration(FP16)<br>(ms) | HF generate<br>(ms) | Speed Up Rate<br>(Faster32/HF) | Speed Up Rate<br>(Faster16/HF) |
+|-----|----|---|---|---|---|---|
+|num_layers = 6<br>num_attention_heads = 12<br>hidden_size = 768<br>(bart-base)|top_k = 1|37.53|34.01|136.89|3.65|4.02
+| |top_k = 4    |39.33|34.98|146.89|3.73|4.2 |
+| |top_k = 8    |42.35|34.77|136.80|3.23|3.93|
+| |top_k = 16   |40.95|35.45|148.45|3.63|4.19|
+| |top_p = 0.4  |45.83|33.32|184.36|4.02|5.53|
+| |num_beams = 4|44.72|37.51|242.73|5.43|6.47|
+| |num_beams = 8|61.56|40.27|273.93|4.45|6.8 |
+| |num_beams = 16|82.05|46.68|433.51|5.28|9.29|
+|num_layers = 12<br>num_attention_heads = 16<br>hidden_size = 1024<br>(bart-large)|top_k = 1|55.03|45.44|199.27|3.62|4.39|
+| |top_k = 4|70.12|56.81|220.96|3.15|3.89|
+| |top_k = 8|69.96|57.73|201.06|2.87|3.48|
+| |top_k = 16|69.16|59.62|223.73|3.23|3.75|
+| |top_p = 0.4|73.49|61.43|275.86|3.75|4.49|
+| |num_beams = 4|66.44|50.71|277.61|4.18|5.47|
+| |num_beams = 8|135.30|85.75|314.78|2.33|3.67|
+| |num_beams = 16|168.01|100.22|441.95|2.63|4.41|
+
+**GPT:**
+
+| Model Size | Decode Strategy| FastGeneration(FP32)<br>(ms) | FastGeneration(FP16)<br>(ms) | HF generate<br>(ms) | Speed Up Rate<br>(Faster32/HF) | Speed Up Rate<br>(Faster16/HF) |
+|-----|----|---|---|---|---|---|
+|num_layers = 12<br>num_attention_heads = 12<br>hidden_size = 768<br>(gpt2)|top_k = 1|69.29|59.20|363.93|5.25|6.15|
+| |top_k = 4|68.07|60.92|391.02|5.74|6.42|
+| |top_k = 8|69.16|60.45|401.18|5.80|6.64|
+| |top_k = 16|73.59|62.40|401.55|5.46|6.44|
+| |top_p = 0.4|95.61|76.26|429.63|4.49|5.63|
+|num_layers = 24<br>num_attention_heads = 16<br>hidden_size = 1024<br>(gpt2-medium)|top_k = 1|127.04|95.13|726.83|5.72|7.64|
+| |top_k = 4|126.74|93.95|694.53|5.48|7.39|
+| |top_k = 8|128.11|94.07|743.63|5.80|7.91|
+| |top_k = 16|126.78|95.00|732.96|5.78|7.72|
+| |top_p = 0.4|143.36|105.40|756.12|5.27|7.17|
+|num_layers = 36<br>num_attention_heads = 20<br>hidden_size = 1280<br>(gpt2-large)top_k = 1|236.80|200.37|1057.94|4.47|5.28|
+| |top_k = 4|236.69|201.95|1075.17|4.54|5.32|
+| |top_k = 8|237.04|202.00|1084.60|4.58|5.37|
+| |top_k = 16|235.01|201.79|1110.75|4.73|5.5|
+| |top_p = 0.4|270.31|205.84|1111.16|4.11|5.4|
+
+**OPT**
+
+* 模型参数
+
+| Model Name | num_layers | num_attention_heads | hidden_size |
+|------------|------------|---------------------|-------------|
+| OPT-125m   | 12         | 12                  | 768         |
+| OPT-350M   | 24         | 16                  | 1024        |
+
+transformer: 4.20.1
+
+* 性能指标数据
+
+|   Model  | Decoding   Strategy | Faster   Generation(FP32)(ms) | Faster   Generation(FP16)(ms) | HF   Generation(ms) | Speed Up   Rate(Faster32/HF) | Speed Up   Rate(Faster16/HF) |
+|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:|
+| opt-125m |       top_k=1       |             58.39             |             48.82             |       290.14        |             4.97             |             5.94             |
+|          |       top_k=4       |             58.45             |             49.05             |       283.55        |             4.85             |             5.78             |
+|          |       top_k=8       |             59.13             |             49.32             |       284.76        |             4.82             |             5.77             |
+|          |       top_k=16      |             60.15             |             49.54             |       299.87        |             4.99             |             6.05             |
+|          |      top_p=0.4      |             75.78             |             60.72             |       335.70        |             4.43             |             5.53             |
+| opt-350m |       top_k=1       |            124.49             |             90.58             |       511.46        |             4.11             |             5.65             |
+|          |       top_k=4       |            125.60             |             90.96             |       528.42        |             4.21             |             5.81             |
+|          |       top_k=8       |            125.93             |             90.96             |       523.46        |             4.16             |             5.75             |
+|          |       top_k=16      |            126.25             |             91.58             |       524.79        |             4.16             |             5.73             |
+|          |      top_p=0.4      |            142.93             |            103.68             |       600.80        |             4.20             |             5.79             |
+
+***
+
+CUDA 11.2, cudnn 8, gcc 82
+
+torch version 1.10.0+cu113, transformers version 4.12.5
+
+**BART:**
+
+| Model Size | Decode Strategy| FastGeneration(FP32)<br>(ms) | FastGeneration(FP16)<br>(ms) | HF generate<br>(ms) | Speed Up Rate<br>(Faster32/HF) | Speed Up Rate<br>(Faster16/HF) |
+|-----|----|---|---|---|---|---|
+|num_layers = 6<br>num_attention_heads = 12<br>hidden_size = 768<br>(bart-base)|top_k = 1|31.1|27.4|139.46|4.48|5.09
+| |top_k = 4    |32.13|29.06|149.81|4.66|5.16|
+| |top_k = 8    |31.7|28.36|154.3|4.87|5.44|
+| |top_k = 16   |32.93|28.66|145.85|4.43|5.09|
+| |top_p = 0.4  |33.35|29.01|173.18|5.19|5.97|
+| |num_beams = 4|47.55|38.02|252.71|5.31|6.65|
+| |num_beams = 8|52.19|41.39|282.3|5.41|6.82|
+| |num_beams = 16|67.18|45.82|441.59|6.57|9.64|
+|num_layers = 12<br>num_attention_heads = 16<br>hidden_size = 1024<br>(bart-large)|top_k = 1|45.8|37.43|173.08|3.78|4.62|
+| |top_k = 4|51.11|48.28|246.27|4.82|5.1|
+| |top_k = 8|61.61|50.67|246.19|4.0|4.86|
+| |top_k = 16|63.81|48.33|272.93|4.28|5.65|
+| |top_p = 0.4|63.0|50.05|288.76|4.58|5.77|
+| |num_beams = 4|65.54|48.58|273.84|4.18|5.64|
+| |num_beams = 8|75.68|52.59|340.86|4.5|6.48|
+| |num_beams = 16|102.87|62.25|477.97|4.65|7.68|
+
+**GPT:**
+
+| Model Size | Decode Strategy| FastGeneration(FP32)<br>(ms) | FastGeneration(FP16)<br>(ms) | HF generate<br>(ms) | Speed Up Rate<br>(Faster32/HF) | Speed Up Rate<br>(Faster16/HF) |
+|-----|----|---|---|---|---|---|
+|num_layers = 12<br>num_attention_heads = 12<br>hidden_size = 768<br>(gpt2)|top_k = 1|50.84|40.37|399.58|7.86|9.9|
+| |top_k = 4|50.38|38.81|419.55|8.33|10.81|
+| |top_k = 8|51.23|36.78|411.7|8.04|11.19|
+| |top_k = 16|51.03|38.76|408.36|8.0|10.54|
+| |top_p = 0.4|68.55|48.04|489.45|7.14|10.19|
+|num_layers = 24<br>num_attention_heads = 16<br>hidden_size = 1024<br>(gpt2-medium)|top_k = 1|111.37|79.73|753.11|6.76|9.45|
+| |top_k = 4|110.53|80.48|767.48|6.94|9.54|
+| |top_k = 8|109.87|78.92|754.99|6.87|9.57|
+| |top_k = 16|110.61|85.26|764.16|6.91|8.96|
+| |top_p = 0.4|127.51|87.72|830.24|6.51|9.46|
+|num_layers = 36<br>num_attention_heads = 20<br>hidden_size = 1280<br>(gpt2-large)|top_k = 1|203.76|142.85|1108.26|5.44|7.76|
+| |top_k = 4|204.18|139.49|1230.63|6.03|8.82|
+| |top_k = 8|204.22|139.14|1238.96|6.07|8.9|
+| |top_k = 16|204.11|140.04|1148.05|5.62|8.2|
+| |top_p = 0.4|222.12|150.68|1248.75|5.62|8.29|
+
+
+**OPT:**
+
+* 模型参数
+
+| Model Name | num_layers | num_attention_heads | hidden_size |
+|------------|------------|---------------------|-------------|
+| OPT-125m   | 12         | 12                  | 768         |
+| OPT-350M   | 24         | 16                  | 1024        |
+
+transformers: 4.20.1
+
+* 性能结果报表
+
+|   Model  | Decoding   Strategy | Faster   Generation(FP32)(ms) | Faster   Generation(FP16)(ms) | HF   Generation(ms) | Speed Up   Rate(Faster32/HF) | Speed Up   Rate(Faster16/HF) |
+|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:|
+| opt-125m |       top_k=1       |             50.57             |             42.59             |       267.95        |             5.30             |             6.29             |
+|          |       top_k=4       |             50.88             |             40.01             |       280.95        |             5.52             |             7.02             |
+|          |       top_k=8       |             50.91             |             43.77             |       268.54        |             5.27             |             6.14             |
+|          |       top_k=16      |             51.08             |             42.56             |       265.40        |             5.20             |             6.24             |
+|          |      top_p=0.4      |             69.08             |             54.59             |       330.56        |             4.78             |             6.06             |
+| opt-350m |       top_k=1       |            110.22             |             77.82             |       507.00        |             4.60             |             6.51             |
+|          |       top_k=4       |            110.76             |             77.93             |       479.42        |             4.33             |             6.15             |
+|          |       top_k=8       |            142.07             |             78.86             |       513.79        |             3.62             |             6.52             |
+|          |       top_k=16      |            110.80             |             78.19             |       488.34        |             4.41             |             6.25             |
+|          |      top_p=0.4      |            128.33             |             92.57             |       544.18        |             4.24             |             5.88             |
+
+**CodeGen:**
+* 环境和超参
+
+- Platform: Tesla V100-SXM2-32GB
+- CUDA 10.1
+- CUDNN 7.6.5
+- PaddlePaddle-gpu 2.3.1.post101
+- transformers==4.21.1
+- torch==1.11.0
+- Batch Size: 1
+- Input Length: 60
+- Output Length: 20
+
+* 模型参数
+
+| Model Name | num_layers | num_attention_heads | hidden_size |
+|------------|------------|---------------------|-------------|
+| Salesforce/codegen-350M-mono   | 20         | 16                  | 1024         |
+| Salesforce/codegen-2B-mono   | 32         | 32                  | 2560        |
+| Salesforce/codegen-6B-mono   | 33         | 16                  | 4096         |
+| Salesforce/codegen-16B-mono   | 34         | 24                  | 6144        |
+
+
+
+* 性能结果报表
+
+|   Model  | Decoding   Strategy | Faster   Generation(FP32)(ms) | Faster   Generation(FP16)(ms) | HF   Generation(ms) | Speed Up   Rate(Faster32/HF) | Speed Up   Rate(Faster16/HF) |
+|:--------:|:-------------------:|:-----------------------------:|:-----------------------------:|:-------------------:|:----------------------------:|:----------------------------:|
+| Salesforce/codegen-350M-mono |       top_k=1       |             57.76             |             51.35             |       709.62        |             12.29             |             13.82             |
+|          |       top_k=4       |             57.42             |             50.88             |       639.58        |            11.14             |             12.57             |
+|          |       top_k=8       |             57.24             |             51.67             |       685.82        |             11.98             |             13.27             |
+|          |       top_k=16      |             57.57             |             51.61             |       686.62        |             11.93             |             13.30             |
+|          |      top_p=0.4      |             67.26             |             57.35             |       656.12        |             9.75             |             11.44             |
+| Salesforce/codegen-2B-mono|       top_k=1       |            319.03             |             207.41             |       1040.71        |             3.26             |             5.02             |
+|          |       top_k=4       |            318.98             |             207.37             |       1014.32        |             3.18             |             4.89             |
+|          |       top_k=8       |            319.66             |             207.26             |       1084.09        |             3.39             |             5.23             |
+|          |       top_k=16      |            320.04             |             207.74             |       1040.28        |             3.25             |             5.01             |
+|          |      top_p=0.4      |            329.07             |             213.97             |       1055.55        |             3.21             |             4.93             |
+| Salesforce/codegen-6B-mono|       top_k=1       |            762.91             |             411.94             |       1384.90        |             1.82             |             3.36             |
+|          |       top_k=4       |            762.58             |             412.79             |       1378.32        |             1.81             |             3.34             |
+|          |       top_k=8       |            763.43             |             413.32             |       1366.45        |             1.79             |             3.31             |
+|          |       top_k=16      |            762.79             |             413.83             |       1376.69        |             1.80             |             3.33             |
+|          |      top_p=0.4      |            771.77             |             419.16             |       1366.49        |             1.77             |             3.26             |
+
+
+**Pegasus:**
+
+| Model Size | Decode Strategy| FastGeneration(FP32)<br>(ms) | FastGeneration(FP16)<br>(ms) | HF generate<br>(ms) | Speed Up Rate<br>(Faster32/HF) | Speed Up Rate<br>(Faster16/HF) |
+|-----|----|---|---|---|---|---|
+|IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese|num_beams=2|87.41|75.47|1322.24|15.13|17.52
+| |num_beams=4    |91.55|66.47|1364.43|14.90|20.53|
+| |num_beams=6    |94.55|73.25|1391.35|14.72|18.99|
+| |num_beams=8   |100.48|84.82|1467.64|14.61|17.30|
+|IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese|num_beams=2|120.15|94.26|1735.21|14.44|18.41|
+| |num_beams=4    |126.42|99.07|1622.31|12.83|16.38|
+| |num_beams=6    |142.21|99.95|1717.49|12.08|17.18|
+| |num_beams=8   |158.26|104.31|1697.65|10.73|16.27|
+
+
+## 测试方法
+
+运行如下命令即可bart性能测试：
+
+```sh
+bash run_perf_bart.sh
+```
+
+运行如下命令即可启动gpt性能测试：
+
+```sh
+bash run_perf_gpt.sh
+```
+
+运行以上命令后，脚本会自动使用不同的模型参数进行性能测试，结果如下图所示：
+
+```sh
+...
+[2021-12-10 08:11:37,255] [   DEBUG] - skipping 'FastGeneration' extension (up-to-date) build
+Namespace(decode_strategy='sampling', max_length=32, model_name_or_path='bart-base', num_beams=1, top_k=1, top_p=1.0, use_fp16_decoding=False)
+Faster FP32 cost: 40.13654176145792
+PD cost: 511.413540635258
+HF cost: 138.49875444546342
+Speed up Faster FP32/PD: 12.741843671403577
+Speed up Faster FP32/HF: 3.4506897796177394
+...
+...
+[2021-12-10 08:13:42,858] [   DEBUG] - skipping 'FastGeneration' extension (up-to-date) build
+Namespace(decode_strategy='sampling', max_length=32, model_name_or_path='bart-base', num_beams=1, top_k=1, top_p=1.0, use_fp16_decoding=True)
+Faster FP16 cost: 34.004870522767305
+...
+```
+可以看到，对于每组参数，脚本会先输出FP32和竞品的测试对比，再单独输出FP16的性能数据。
+
+**NOTE:** 根据测试环境和机器状态的不同，以上性能测试脚本的结果可能与表中结果有所出入。
diff --git a/fast_generation/perf/bart_perf.py b/fast_generation/perf/bart_perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..8466dafcaaefa8bd1a3e92f763f104ef578b77ea
--- /dev/null
+++ b/fast_generation/perf/bart_perf.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import paddle
+import torch
+from transformers import BartForConditionalGeneration as hf_bart_model
+
+from paddlenlp.data import Pad
+from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+
+
+def prepare_input(tokenizer, sentences):
+    word_pad = Pad(tokenizer.pad_token_id, dtype="int64")
+    tokenized = tokenizer(sentences)
+    inputs = word_pad([i["input_ids"] for i in tokenized])
+    input_ids = paddle.to_tensor(inputs)
+    return input_ids
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bart-base",
+        type=str,
+        choices=["bart-base", "bart-large"],
+        help="The model name to specify the bart to use. Can be one of ['bart-base', 'bart-large']. ",
+    )
+    parser.add_argument(
+        "--decode_strategy",
+        default="sampling",
+        type=str,
+        choices=["greedy_search", "beam_search", "sampling"],
+        help="The decoding strategy. Can be one of ['greedy_search', 'beam_search', 'sampling']",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The parameters for beam search. ")
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path)
+    model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    # Set evaluate mode
+    model.eval()
+    sentences = [
+        "I love that girl, but <mask> does not <mask> me.",
+        "She is so <mask> that I can not help glance at <mask>.",
+        "Nothing's gonna <mask> my love for you.",
+        "Drop everything now. Meet me in the pouring <mask>. Kiss me on the sidewalk.",
+    ]
+
+    input_ids = prepare_input(tokenizer, sentences)
+
+    # Define model
+    model.eval()
+
+    num_loop = 100
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                early_stopping=True,
+                use_fast=True,
+                use_fp16_decoding=args.use_fp16_decoding,
+            )
+        paddle.device.cuda.synchronize(place)
+        fast_cost = (time.perf_counter() - start) / 50 * 1000
+
+    if args.use_fp16_decoding:
+        pprint(args)
+        print("Fast FP16 cost:", fast_cost)
+        return
+
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                early_stopping=True,
+            )
+        paddle.device.cuda.synchronize(place)
+        pd_cost = (time.perf_counter() - start) / 50 * 1000
+
+    device = torch.device("cuda:0")
+    hf_model = hf_bart_model.from_pretrained("facebook/" + args.model_name_or_path)
+    hf_model.to(device)
+    hf_model.eval()
+    hf_input_ids = prepare_input(tokenizer, sentences)
+    hf_input_ids = torch.tensor(hf_input_ids.numpy())
+    hf_input_ids = hf_input_ids.to(device)
+
+    if args.decode_strategy == "sampling":
+        do_sample = True
+    else:
+        do_sample = False
+    with torch.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                torch.cuda.synchronize()
+                start = time.perf_counter()
+            hf_model.generate(
+                hf_input_ids,
+                do_sample=do_sample,
+                max_length=args.max_length + 1,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                no_repeat_ngram_size=0,
+                length_penalty=0.0,
+            )
+        torch.cuda.synchronize()
+        hf_cost = (time.perf_counter() - start) / 50 * 1000
+
+    pprint(args)
+    print("Fast FP32 cost:", fast_cost)
+    print("PD cost:", pd_cost)
+    print("HF cost:", hf_cost)
+    print("Speed up Fast FP32/PD:", pd_cost / fast_cost)
+    print("Speed up Fast FP32/HF:", hf_cost / fast_cost)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/fast_generation/perf/codegen_perf.py b/fast_generation/perf/codegen_perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..8620a11336c5c48ed0d443a5654a857c8221782b
--- /dev/null
+++ b/fast_generation/perf/codegen_perf.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import pynvml
+
+from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+
+pynvml.nvmlInit()
+
+
+def query_by_id(gpu_id=2):
+    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
+    meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
+    return meminfo.used // 1024 // 1024
+
+
+def perf_pd(args):
+    start_mem = query_by_id(args.gpu_id)
+    place = "gpu"
+    place = paddle.set_device(place)
+    tokenizer = CodeGenTokenizer.from_pretrained(args.model_name_or_path)
+    model = CodeGenForCausalLM.from_pretrained(args.model_name_or_path, load_state_as_np=True)
+    model.eval()
+    load_mem = query_by_id(args.gpu_id)
+
+    input_ids_np = [
+        np.random.choice(list(tokenizer.decoder.keys())[:-1], args.input_len) for _ in range(args.batch_size)
+    ]
+    input_ids = paddle.to_tensor(input_ids_np)
+
+    num_loop = 100
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if num_loop // 2 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.generate_len,
+                min_length=args.generate_len,
+                decode_strategy="sampling",
+                top_k=args.top_k,
+                top_p=args.top_p,
+                use_fast=args.use_faster,
+                use_fp16_decoding=args.use_fp16_decoding,
+            )
+            generate_mem = query_by_id(args.gpu_id)
+        paddle.device.cuda.synchronize(place)
+        pd_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000
+    return pd_cost, load_mem - start_mem, generate_mem - start_mem
+
+
+def perf_hf(args):
+    import torch
+    from transformers import CodeGenForCausalLM as hf_codegen
+    from transformers import CodeGenTokenizer as hf_tokenizer
+
+    start_mem = query_by_id(args.gpu_id)
+    device = torch.device("cuda")
+    tokenizer = hf_tokenizer.from_pretrained(args.model_name_or_path)
+    model = hf_codegen.from_pretrained(args.model_name_or_path)
+    model.to(device)
+    model.eval()
+    load_mem = query_by_id(args.gpu_id)
+
+    input_ids_np = [np.random.choice(list(tokenizer.decoder.keys()), args.input_len) for _ in range(args.batch_size)]
+    input_ids = torch.tensor(input_ids_np)
+    input_ids = input_ids.to(device)
+    num_loop = 100
+    with torch.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if num_loop // 2 == i:
+                torch.cuda.synchronize()
+                start = time.perf_counter()
+            model.generate(
+                input_ids,
+                do_sample=True,
+                max_length=args.generate_len + input_ids.shape[-1],
+                min_length=args.generate_len + input_ids.shape[-1],
+                top_k=args.top_k,
+                top_p=args.top_p,
+            )
+            generate_mem = query_by_id(args.gpu_id)
+        torch.cuda.synchronize()
+        hf_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000
+    return hf_cost, load_mem - start_mem, generate_mem - start_mem
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--perf_type",
+        default="pd",
+        type=str,
+        choices=["pd", "pd_faster_fp32", "pd_faster_fp16", "hf"],
+        help="The type of perf.  ",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="Salesforce/codegen-350M-mono",
+        type=str,
+        choices=[
+            "Salesforce/codegen-350M-mono",
+            "Salesforce/codegen-2B-mono",
+            "Salesforce/codegen-6B-mono",
+            "Salesforce/codegen-16B-mono",
+        ],
+        help="The model name to specify the bart to use.  ",
+    )
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure topk sampling. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--batch_size", default=1, type=int, help="The size of input batch. ")
+    parser.add_argument("--input_len", default=60, type=int, help="The size of model input. ")
+    parser.add_argument("--generate_len", default=20, type=int, help="Length of output . ")
+    parser.add_argument("--gpu_id", default=2, type=int, help="The id of GPU . ")
+    parser.add_argument(
+        "--use_faster", action="store_true", help="Whether to process inference using faster codegen. "
+    )
+
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    try:
+        if args.perf_type == "pd":
+            args.use_faster = False
+            cost, load_mem, generate_mem = perf_pd(args)
+        elif args.perf_type == "pd_faster_fp32":
+            args.use_faster = True
+            args.use_fp16_decoding = False
+            cost, load_mem, generate_mem = perf_pd(args)
+        elif args.perf_type == "pd_faster_fp16":
+            args.use_faster = True
+            args.use_fp16_decoding = True
+            paddle.set_default_dtype("float16")
+            cost, load_mem, generate_mem = perf_pd(args)
+        else:
+            cost, load_mem, generate_mem = perf_hf(args)
+        pprint(args)
+        print(
+            f"CodeGenPerfResult: cost_time: {cost} ms, load_mem: {load_mem} MB, generate_mem:{generate_mem} MB, args:{args}\n"
+        )
+    except Exception as e:
+        pprint(args)
+        print(f"CodeGenPerfResult: ERROR: {e}, args:{args}\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/fast_generation/perf/gpt_perf.py b/fast_generation/perf/gpt_perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..87afcba682b4fd81abcbd6b60f20749cde814ce4
--- /dev/null
+++ b/fast_generation/perf/gpt_perf.py
@@ -0,0 +1,155 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import torch
+from transformers import GPT2LMHeadModel as hf_gpt_model
+
+from paddlenlp.transformers import GPTLMHeadModel, GPTTokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt2-en",
+        type=str,
+        choices=["gpt2-en", "gpt2-medium-en", "gpt2-large-en"],
+        help="The model name to specify the bart to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt2-large-en']. ",
+    )
+    parser.add_argument(
+        "--decode_strategy",
+        default="sampling",
+        type=str,
+        choices=["greedy_search", "sampling"],
+        help="The decoding strategy. Can be one of ['greedy_search', 'sampling']",
+    )
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument("--batch_size", default=4, type=int, help="The size of input batch. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path)
+    model = GPTLMHeadModel.from_pretrained(args.model_name_or_path)
+    # Set evaluate mode
+    model.eval()
+    bos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
+    eos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
+
+    input_ids_np = np.array([[bos_id] for i in range(args.batch_size)]).astype("int64").reshape([args.batch_size, 1])
+    input_ids = paddle.to_tensor(input_ids_np)
+    # Define model
+    num_loop = 100
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                use_fast=True,
+                use_fp16_decoding=args.use_fp16_decoding,
+            )
+        paddle.device.cuda.synchronize(place)
+        fast_cost = (time.perf_counter() - start) / 50 * 1000
+
+    if args.use_fp16_decoding:
+        pprint(args)
+        print("Fast FP16 cost:", fast_cost)
+        return
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+            )
+        paddle.device.cuda.synchronize(place)
+        pd_cost = (time.perf_counter() - start) / 50 * 1000
+
+    device = torch.device("cuda:0")
+    hf_model = hf_gpt_model.from_pretrained(args.model_name_or_path[:-3])
+    hf_model.to(device)
+    hf_model.eval()
+
+    hf_input_ids = torch.tensor(input_ids_np)
+    hf_input_ids = hf_input_ids.to(device)
+
+    if args.decode_strategy == "sampling":
+        do_sample = True
+    else:
+        do_sample = False
+    with torch.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                torch.cuda.synchronize()
+                start = time.perf_counter()
+            hf_model.generate(
+                hf_input_ids,
+                do_sample=do_sample,
+                max_length=args.max_length + 1,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                pad_token_id=0,
+                top_k=args.top_k,
+                top_p=args.top_p,
+            )
+        torch.cuda.synchronize()
+        hf_cost = (time.perf_counter() - start) / 50 * 1000
+
+    pprint(args)
+    print("Fast FP32 cost:", fast_cost)
+    print("PD cost:", pd_cost)
+    print("HF cost:", hf_cost)
+    print("Speed up Fast FP32/PD:", pd_cost / fast_cost)
+    print("Speed up Fast FP32/HF:", hf_cost / fast_cost)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/fast_generation/perf/opt_perf.py b/fast_generation/perf/opt_perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..213881fbf947c116ebc9934eefc3aa5e7c99c9c8
--- /dev/null
+++ b/fast_generation/perf/opt_perf.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+# append project root dir to project to make it run with latest code
+import sys
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import torch
+from transformers.models.opt.modeling_opt import OPTForCausalLM as hf_opt_model
+
+from paddlenlp.transformers import GPTTokenizer, OPTForCausalLM
+
+sys.path.insert(0, "../../")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="facebook/opt-125m",
+        type=str,
+        choices=["facebook/opt-125m", "facebook/opt-350m", "facebook/opt-1.3b", "facebook/opt-2.7b"],
+        help="The model name to specify the bart to use. Can be one of ['facebook/opt-125m', 'facebook/opt-350m', 'facebook/opt-1.3b', 'facebook/opt-2.7b']. ",
+    )
+    parser.add_argument(
+        "--decode_strategy",
+        default="greedy_search",
+        type=str,
+        choices=["greedy_search", "sampling"],
+        help="The decoding strategy. Can be one of ['greedy_search', 'sampling']",
+    )
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument("--batch_size", default=4, type=int, help="The size of input batch. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    tokenizer = GPTTokenizer.from_pretrained(args.model_name_or_path)
+    model = OPTForCausalLM.from_pretrained(args.model_name_or_path)
+    # Set evaluate mode
+    model.eval()
+    bos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
+    eos_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
+
+    input_ids_np = np.array([[bos_id] for i in range(args.batch_size)]).astype("int64").reshape([args.batch_size, 1])
+    input_ids = paddle.to_tensor(input_ids_np)
+    # Define model
+    num_loop = 100
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                use_fast=True,
+                use_fp16_decoding=args.use_fp16_decoding,
+            )
+        paddle.device.cuda.synchronize(place)
+        fast_cost = (time.perf_counter() - start) / 50 * 1000
+
+    if args.use_fp16_decoding:
+        pprint(args)
+        print("Fast FP16 cost:", fast_cost)
+        return
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decode_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+            )
+        paddle.device.cuda.synchronize(place)
+        pd_cost = (time.perf_counter() - start) / 50 * 1000
+
+    device = torch.device("cuda:0")
+    hf_model = hf_opt_model.from_pretrained(args.model_name_or_path)
+
+    hf_model.to(device)
+    hf_model.eval()
+
+    hf_input_ids = torch.tensor(input_ids_np)
+    hf_input_ids = hf_input_ids.to(device)
+
+    if args.decode_strategy == "sampling":
+        do_sample = True
+    else:
+        do_sample = False
+    with torch.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if 50 == i:
+                torch.cuda.synchronize()
+                start = time.perf_counter()
+            hf_model.generate(
+                hf_input_ids,
+                do_sample=do_sample,
+                max_length=args.max_length + 1,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                pad_token_id=0,
+                top_k=args.top_k,
+                top_p=args.top_p,
+            )
+        torch.cuda.synchronize()
+        hf_cost = (time.perf_counter() - start) / 50 * 1000
+
+    pprint(args)
+    print("Fast FP32 cost:", fast_cost)
+    print("PD cost:", pd_cost)
+    print("HF cost:", hf_cost)
+    print("Speed up Fast FP32/PD:", pd_cost / fast_cost)
+    print("Speed up Fast FP32/HF:", hf_cost / fast_cost)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print(args.model_name_or_path)
+    do_predict(args)
diff --git a/fast_generation/perf/pegasus_perf.py b/fast_generation/perf/pegasus_perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae9c6ce61b6a09abca9d7c59379bccb740530617
--- /dev/null
+++ b/fast_generation/perf/pegasus_perf.py
@@ -0,0 +1,168 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import pynvml
+
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusForConditionalGeneration,
+)
+
+pynvml.nvmlInit()
+
+
+def query_by_id(gpu_id=2):
+    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
+    meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
+    return meminfo.used // 1024 // 1024
+
+
+def perf_pd(args):
+    start_mem = query_by_id(args.gpu_id)
+    place = "gpu"
+    place = paddle.set_device(place)
+    tokenizer = PegasusChineseTokenizer.from_pretrained(args.model_name_or_path)
+    model = PegasusForConditionalGeneration.from_pretrained(args.model_name_or_path, load_state_as_np=True)
+    model.eval()
+    load_mem = query_by_id(args.gpu_id)
+    input_ids_np = [np.random.choice(range(len(tokenizer.vocab)), args.input_len) for _ in range(args.batch_size)]
+    input_ids = paddle.to_tensor(input_ids_np)
+
+    num_loop = 100
+    with paddle.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if num_loop // 2 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize(place)
+                start = time.perf_counter()
+            model.generate(
+                input_ids=input_ids,
+                max_length=args.generate_len,
+                min_length=args.generate_len,
+                decode_strategy="beam_search",
+                num_beams=args.num_beams,
+                use_fast=args.use_faster,
+                use_fp16_decoding=args.use_fp16_decoding,
+            )
+            generate_mem = query_by_id(args.gpu_id)
+        paddle.device.cuda.synchronize(place)
+        pd_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000
+    return pd_cost, load_mem - start_mem, generate_mem - start_mem
+
+
+def perf_hf(args):
+    import torch
+    from tokenizers_pegasus import PegasusTokenizer as hf_tokenizer
+    from transformers import PegasusForConditionalGeneration as hf_pegasus
+
+    start_mem = query_by_id(args.gpu_id)
+    device = torch.device("cuda")
+    tokenizer = hf_tokenizer.from_pretrained(args.model_name_or_path)
+    model = hf_pegasus.from_pretrained(args.model_name_or_path)
+    model.to(device)
+    model.eval()
+    load_mem = query_by_id(args.gpu_id)
+
+    input_ids_np = [np.random.choice(range(len(tokenizer.vocab)), args.input_len) for _ in range(args.batch_size)]
+    input_ids = torch.tensor(input_ids_np)
+    input_ids = input_ids.to(device)
+    num_loop = 100
+    with torch.no_grad():
+        for i in range(num_loop):
+            # For warmup.
+            if num_loop // 2 == i:
+                torch.cuda.synchronize()
+                start = time.perf_counter()
+            model.generate(
+                input_ids,
+                do_sample=False,
+                num_beams=args.num_beams,
+                max_length=args.generate_len + input_ids.shape[-1],
+                min_length=args.generate_len + input_ids.shape[-1],
+            )
+            generate_mem = query_by_id(args.gpu_id)
+        torch.cuda.synchronize()
+        hf_cost = (time.perf_counter() - start) / (num_loop - num_loop // 2) * 1000
+    return hf_cost, load_mem - start_mem, generate_mem - start_mem
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--perf_type",
+        default="pd",
+        type=str,
+        choices=["pd", "pd_faster_fp32", "pd_faster_fp16", "hf"],
+        help="The type of perf.  ",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+        type=str,
+        choices=[
+            "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+            "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese",
+        ],
+        help="The model name to specify the pegasus to use.  ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of beams to procedure beam search. ")
+    parser.add_argument("--batch_size", default=1, type=int, help="The size of input batch. ")
+    parser.add_argument("--input_len", default=60, type=int, help="The size of model input. ")
+    parser.add_argument("--generate_len", default=20, type=int, help="Length of output . ")
+    parser.add_argument("--gpu_id", default=2, type=int, help="The id of GPU . ")
+    parser.add_argument(
+        "--use_faster", action="store_true", help="Whether to process inference using faster pegasus. "
+    )
+
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    try:
+        if args.perf_type == "pd":
+            args.use_faster = False
+            cost, load_mem, generate_mem = perf_pd(args)
+        elif args.perf_type == "pd_faster_fp32":
+            args.use_faster = True
+            args.use_fp16_decoding = False
+            cost, load_mem, generate_mem = perf_pd(args)
+        elif args.perf_type == "pd_faster_fp16":
+            args.use_faster = True
+            args.use_fp16_decoding = True
+            # paddle.set_default_dtype('float16')
+            cost, load_mem, generate_mem = perf_pd(args)
+        else:
+            cost, load_mem, generate_mem = perf_hf(args)
+        pprint(args)
+        print(
+            f"PegasusPerfResult: cost_time: {cost} ms, load_mem: {load_mem} MB, generate_mem:{generate_mem} MB, args:{args}\n"
+        )
+    except Exception as e:
+        pprint(args)
+        print(f"PegasusPerfResult: ERROR: {e}, args:{args}\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    do_predict(args)
diff --git a/fast_generation/perf/run_perf_bart.sh b/fast_generation/perf/run_perf_bart.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fa087770cb5afabe1a20cdce8acb5be6a17ca976
--- /dev/null
+++ b/fast_generation/perf/run_perf_bart.sh
@@ -0,0 +1,76 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=3
+
+for model_name in bart-base bart-large;  
+    do   
+        for top_k in 1 4 8 16;
+            do
+                python bart_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --num_beams=1 \
+                    --top_k=$top_k \
+                    --top_p=1 \
+                    --max_length=32 
+                sleep 10s
+                python bart_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --num_beams=1 \
+                    --top_k=$top_k \
+                    --top_p=1 \
+                    --max_length=32 \
+                    --use_fp16_decoding
+                sleep 10s
+            done
+        python bart_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --num_beams=1 \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 
+        sleep 10s
+        python bart_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --num_beams=1 \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 \
+            --use_fp16_decoding
+        sleep 10s
+        for num_beams in 4 8 16;
+            do
+                python bart_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=beam_search \
+                    --num_beams=$num_beams \
+                    --top_k=1 \
+                    --top_p=1 \
+                    --max_length=32 
+                sleep 10s
+                python bart_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=beam_search \
+                    --num_beams=$num_beams \
+                    --top_k=1 \
+                    --top_p=1 \
+                    --max_length=32 \
+                    --use_fp16_decoding
+                sleep 10s
+            done
+    done
\ No newline at end of file
diff --git a/fast_generation/perf/run_perf_codegen.sh b/fast_generation/perf/run_perf_codegen.sh
new file mode 100644
index 0000000000000000000000000000000000000000..be4792096e2efb3df8fcafcbf4f2260d109c9539
--- /dev/null
+++ b/fast_generation/perf/run_perf_codegen.sh
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+GPU_ID=1
+export CUDA_VISIBLE_DEVICES=${GPU_ID}
+
+for model_name in Salesforce/codegen-350M-mono Salesforce/codegen-2B-mono Salesforce/codegen-6B-mono; 
+    do   
+        for top_k in 1 4 8 16;
+            do
+                for input_len in 60;
+                    do
+                        for generate_len in 20;
+                            do
+                                for perf_type in pd pd_faster_fp32 pd_faster_fp16 hf;
+                                    do 
+                                        echo model_name: $model_name, perf_type: $perf_type, top_k: $top_k, top_p: 1.0, input_len: $input_len, generate_len: $generate_len
+                                        python codegen_perf.py \
+                                            --model_name_or_path=$model_name \
+                                            --perf_type=$perf_type \
+                                            --top_k=$top_k \
+                                            --top_p=1.0 \
+                                            --input_len=$input_len \
+                                            --generate_len=$generate_len \
+                                            --gpu_id ${GPU_ID}
+                                        sleep 3s
+                                    done
+                            done
+                    done
+            done
+        for top_p in 0.4;
+            do
+                for input_len in 60;
+                    do
+                        for generate_len in 20;
+                            do
+                                for perf_type in pd pd_faster_fp32 pd_faster_fp16 hf;
+                                    do 
+                                        echo model_name: $model_name, perf_type: $perf_type, top_k: 0, top_p: $top_p, input_len: $input_len, generate_len: $generate_len
+                                        python codegen_perf.py \
+                                            --model_name_or_path=$model_name \
+                                            --perf_type=$perf_type \
+                                            --top_k=0 \
+                                            --top_p=$top_p \
+                                            --input_len=$input_len \
+                                            --generate_len=$generate_len \
+                                            --gpu_id ${GPU_ID}
+                                        sleep 3s
+                                    done
+                            done
+                    done
+            done
+    done
\ No newline at end of file
diff --git a/fast_generation/perf/run_perf_gpt.sh b/fast_generation/perf/run_perf_gpt.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5363b0546af65d01737b6a1e75a2aaa5c9fe3971
--- /dev/null
+++ b/fast_generation/perf/run_perf_gpt.sh
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=3
+
+for model_name in gpt2-en gpt2-medium-en gpt2-large-en;  
+    do   
+        for top_k in 1 4 8 16;
+            do
+                python gpt_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --top_k=$top_k \
+                    --top_p=1 \
+                    --max_length=32 
+                sleep 10s
+                python gpt_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --top_k=$top_k \
+                    --top_p=1 \
+                    --max_length=32 \
+                    --use_fp16_decoding
+                sleep 10s
+            done
+        python gpt_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 
+        sleep 10s
+        python gpt_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 \
+            --use_fp16_decoding
+        sleep 10s
+    done
\ No newline at end of file
diff --git a/fast_generation/perf/run_perf_opt.sh b/fast_generation/perf/run_perf_opt.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bc1d525c00acf95ec80e162794874e3b5b390373
--- /dev/null
+++ b/fast_generation/perf/run_perf_opt.sh
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=3
+
+for model_name in facebook/opt-125m facebook/opt-350m;
+    do   
+        for top_k in 1 4 8 16;
+            do
+                python opt_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --top_k=$top_k \
+                    --top_p=0.4 \
+                    --max_length=32 
+                sleep 10s
+                python opt_perf.py \
+                    --model_name_or_path=$model_name \
+                    --decode_strategy=sampling \
+                    --top_k=$top_k \
+                    --top_p=0.4 \
+                    --max_length=32 \
+                    --use_fp16_decoding
+                sleep 10s
+            done
+        python opt_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 
+        sleep 10s
+        python opt_perf.py \
+            --model_name_or_path=$model_name \
+            --decode_strategy=sampling \
+            --top_k=0 \
+            --top_p=0.4 \
+            --max_length=32 \
+            --use_fp16_decoding
+        sleep 10s
+    done
diff --git a/fast_generation/perf/run_perf_pegasus.sh b/fast_generation/perf/run_perf_pegasus.sh
new file mode 100644
index 0000000000000000000000000000000000000000..264c28b22c8ba45d8c104661da5237947d547b36
--- /dev/null
+++ b/fast_generation/perf/run_perf_pegasus.sh
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+GPU_ID=4
+export CUDA_VISIBLE_DEVICES=${GPU_ID}
+
+for model_name in IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese;
+    do
+        for batch_size in 1 4 8 16;
+            do   
+                for num_beams in 2 4 6 8;
+                    do
+                        for input_len in 60;
+                            do
+                                for generate_len in 20;
+                                    do
+                                        for perf_type in pd_faster_fp16 pd_faster_fp32 pd hf;
+                                            do 
+                                                echo model_name: $model_name, perf_type: $perf_type, batch_size:$batch_size, num_beams: $num_beams, input_len: $input_len, generate_len: $generate_len
+                                                python pegasus_perf.py \
+                                                    --model_name_or_path=$model_name \
+                                                    --perf_type=$perf_type \
+                                                    --batch_size=$batch_size \
+                                                    --num_beams=$num_beams \
+                                                    --input_len=$input_len \
+                                                    --generate_len=$generate_len \
+                                                    --gpu_id ${GPU_ID}
+                                                sleep 3s
+                                            done
+                                    done
+                            done
+                    done
+            done
+    done
\ No newline at end of file
diff --git a/fast_generation/samples/codegen_16b_sample.py b/fast_generation/samples/codegen_16b_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..02c121645e1ca17203a1a75242d76bec2a77235e
--- /dev/null
+++ b/fast_generation/samples/codegen_16b_sample.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+
+# Can be load on A100-40G
+paddle.set_default_dtype("float16")
+model_name = "Salesforce/codegen-16B-mono"
+
+tokenizer = CodeGenTokenizer.from_pretrained(model_name)
+model = CodeGenForCausalLM.from_pretrained(model_name, load_state_as_np=True)
+model.eval()
+
+inputs = "def hello"
+input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"]
+
+# Enable FastGeneration
+outputs, _ = model.generate(
+    input_ids=input_ids, max_length=128, decode_strategy="greedy_search", use_fp16_decoding=True, use_fast=True
+)
+
+result = tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"])
+
+print("Model input:", inputs)
+print("Result:", result)
diff --git a/fast_generation/samples/codegen_sample.py b/fast_generation/samples/codegen_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..77cb5c7a335e8641b951adaa0d6bc1c3589962a3
--- /dev/null
+++ b/fast_generation/samples/codegen_sample.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+
+model_name = "Salesforce/codegen-350M-mono"
+
+tokenizer = CodeGenTokenizer.from_pretrained(model_name)
+model = CodeGenForCausalLM.from_pretrained(model_name)
+model.eval()
+
+inputs = "def hello"
+input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"]
+
+outputs, _ = model.generate(
+    input_ids=input_ids, max_length=128, decode_strategy="greedy_search", use_fp16_decoding=True, use_fast=True
+)
+
+result = tokenizer.decode(outputs[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"])
+
+print("Model input:", inputs)
+print("Result:", result)
+# Result: _world():
+#       print("Hello World")
+
+# hello_world()
diff --git a/fast_generation/samples/gpt_mp_sample.py b/fast_generation/samples/gpt_mp_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2370f9b2e8fc9861e8769d64e55381d87d47869
--- /dev/null
+++ b/fast_generation/samples/gpt_mp_sample.py
@@ -0,0 +1,132 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import enable_ft_para, get_ft_para_conf
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer
+
+MODEL_CLASSES = {
+    "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt-cpm-small-cn-distill": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt2-en": (GPTLMHeadModel, GPTTokenizer),
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+    "gpt2-large-en": (GPTLMHeadModel, GPTTokenizer),
+    "gpt2-xl-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name",
+        default="gpt-cpm-large-cn",
+        choices=list(MODEL_CLASSES.keys()),
+        help="The model name to specify which gpt to use. It can be " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size.")
+    parser.add_argument("--max_length", default=50, type=int, help="Maximum output length.")
+    parser.add_argument(
+        "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling."
+    )
+    parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.")
+    parser.add_argument("--tensor_para_size", default=2, type=int, help="The size for tensor parallel.")
+    parser.add_argument("--layer_para_size", default=1, type=int, help="The size for layer parallel.")
+    parser.add_argument(
+        "--layer_para_batch_size",
+        default=None,
+        type=int,
+        help="The local batch size for pipeline parallel." "It is suggested to use `batch_size // layer_para_size`.",
+    )
+    parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16 to predict.")
+    parser.add_argument("--profile", action="store_true", help="Whether to profile.")
+    args = parser.parse_args()
+    return args
+
+
+def profile(batch_size, total_step=50, warmup_step=10, rank=0):
+    def _wrapper(func):
+        def _impl(*args, **kwargs):
+            for i in range(total_step):
+                if i == warmup_step:
+                    paddle.device.cuda.synchronize()
+                    start_time = time.time()
+                out = func(*args, **kwargs)
+            paddle.device.cuda.synchronize()
+            end_time = time.time()
+            if rank is None or get_ft_para_conf().rank == rank:
+                time_interval = end_time - start_time
+                num_batch = total_step - warmup_step
+                print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval))
+            return out
+
+        return _impl
+
+    return _wrapper
+
+
+def main(args):
+    if args.use_fp16:
+        paddle.set_default_dtype("float16")
+    enable_ft_para(
+        args.tensor_para_size,
+        args.layer_para_size,
+        args.batch_size // args.layer_para_size if args.layer_para_batch_size is None else args.layer_para_batch_size,
+    )
+    # TODO(guosheng): Maybe device can be set in `enable_ft_para`
+    paddle.set_device("gpu:" + str(get_ft_para_conf().rank))
+
+    model_name = args.model_name
+    if args.profile:
+        MODEL_CLASSES[model_name][0].generate = profile(args.batch_size)(MODEL_CLASSES[model_name][0].generate)
+    tokenizer = MODEL_CLASSES[model_name][-1].from_pretrained(model_name)
+    model = MODEL_CLASSES[model_name][0].from_pretrained(model_name, load_state_as_np=True)
+    model.eval()
+
+    # NOTE: When using prompt, open this and replace the text with what you want.
+    input = "花间一壶酒，独酌无相亲。举杯邀明月，"
+    # input = '一时黛玉进了荣府，下了车。众嬷嬷引着，便往东转弯，'
+    # input = '爱因斯坦曾经说过：'
+    input_ids = tokenizer(input)["input_ids"]
+    # NOTE: When generating from the beginning, open this.
+    # input_ids = [tokenizer.eos_token_id]
+    input_ids = [input_ids] * args.batch_size
+
+    inputs_ids = paddle.to_tensor(input_ids, dtype="int32")
+    outputs, _ = model.generate(
+        input_ids=inputs_ids,
+        max_length=args.max_length,
+        decode_strategy="sampling",
+        top_k=args.topk,
+        top_p=args.topp,
+        temperature=args.temperature,
+        use_fast=True,
+    )
+
+    # Only make the first process to output.
+    if get_ft_para_conf().rank == 0:
+        for i in range(len(outputs)):
+            result = tokenizer.convert_ids_to_string(outputs[i].numpy().tolist())
+            print("Result:", result)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    main(args)
diff --git a/fast_generation/samples/gpt_sample.py b/fast_generation/samples/gpt_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0cff0bba726b680014055e6cf4ac6db3bf6ef28
--- /dev/null
+++ b/fast_generation/samples/gpt_sample.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel
+
+model_name = "gpt-cpm-small-cn-distill"
+
+tokenizer = GPTChineseTokenizer.from_pretrained(model_name)
+model = GPTLMHeadModel.from_pretrained(model_name)
+model.eval()
+
+inputs = "花间一壶酒，独酌无相亲。举杯邀明月，"
+inputs_ids = tokenizer(inputs)["input_ids"]
+inputs_ids = paddle.to_tensor(inputs_ids, dtype="int64").unsqueeze(0)
+
+outputs, _ = model.generate(input_ids=inputs_ids, max_length=10, decode_strategy="greedy_search", use_fast=True)
+
+result = tokenizer.convert_ids_to_string(outputs[0].numpy().tolist())
+
+print("Model input:", inputs)
+print("Result:", result)
+# 对影成三人。
diff --git a/fast_generation/samples/gptj_sample.py b/fast_generation/samples/gptj_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..f335121287c87f6d88c61b863396e28811c2d6f6
--- /dev/null
+++ b/fast_generation/samples/gptj_sample.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import GPTJForCausalLM, GPTJTokenizer
+
+paddle.set_default_dtype("float16")
+model_name = "EleutherAI/gpt-j-6B"
+
+tokenizer = GPTJTokenizer.from_pretrained(model_name)
+model = GPTJForCausalLM.from_pretrained(model_name, load_state_as_np=True)
+model.eval()
+
+inputs = "What is PaddleNLP?"
+input_ids = tokenizer([inputs], return_tensors="pd")["input_ids"]
+
+outputs, _ = model.generate(
+    input_ids=input_ids,
+    max_length=100,
+    decode_strategy="sampling",
+    temperature=0.8,
+    top_p=0.9,
+    use_fp16_decoding=True,
+    use_fast=True,
+)
+
+result = tokenizer.decode(outputs[0])
+
+print("Model input:", inputs)
+print("Result:", result)
diff --git a/fast_generation/samples/mbart_sample.py b/fast_generation/samples/mbart_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..e16c4e7de1764f4ba911a000b054df6505d7d9e3
--- /dev/null
+++ b/fast_generation/samples/mbart_sample.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+
+from paddlenlp.transformers import MBart50Tokenizer, MBartForConditionalGeneration
+
+model_name = "mbart-large-50-many-to-many-mmt"
+
+tokenizer = MBart50Tokenizer.from_pretrained(model_name, src_lang="en_XX")
+model = MBartForConditionalGeneration.from_pretrained(model_name)
+model.eval()
+
+
+def postprocess_response(seq, bos_idx, eos_idx):
+    """Post-process the decoded sequence."""
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx]
+    res = tokenizer.convert_ids_to_string(seq)
+    return res
+
+
+bos_id = tokenizer.lang_code_to_id["zh_CN"]
+eos_id = model.mbart.config["eos_token_id"]
+
+inputs = "PaddleNLP is a powerful NLP library with Awesome pre-trained models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications."
+input_ids = tokenizer(inputs)["input_ids"]
+input_ids = paddle.to_tensor(input_ids, dtype="int32").unsqueeze(0)
+
+outputs, _ = model.generate(
+    input_ids=input_ids,
+    forced_bos_token_id=bos_id,
+    decode_strategy="beam_search",
+    num_beams=4,
+    max_length=50,
+    use_fast=True,
+)
+
+result = postprocess_response(outputs[0].numpy().tolist(), bos_id, eos_id)
+
+print("Model input:", inputs)
+
+print("Result:", result)
+# PaddleNLP是一个强大的NLP库,具有超乎寻常的预训练模型和易于使用的接口,支持从研究到工业应用的广泛的NLP任务。
diff --git a/fast_generation/samples/opt_sample.py b/fast_generation/samples/opt_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..812fd6e01b8f39e423f53e1c5cd6aea32306abce
--- /dev/null
+++ b/fast_generation/samples/opt_sample.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import GPTTokenizer, OPTForCausalLM
+
+model_name = "facebook/opt-350m"
+
+tokenizer = GPTTokenizer.from_pretrained(model_name)
+model = OPTForCausalLM.from_pretrained(model_name)
+model.eval()
+
+inputs = """a chat between a curious human and Statue of Liberty.
+Human: What is your name?
+Statue: I am statue of liberty.
+Human: where do you live?
+Statue: New york city.
+Human: how long have you lived there?。"""
+
+inputs_ids = tokenizer([inputs])["input_ids"]
+inputs_ids = paddle.to_tensor(inputs_ids, dtype="int64")
+
+outputs, _ = model.generate(
+    input_ids=inputs_ids,
+    max_length=20,
+    decode_strategy="greedy_search",
+    use_fast=True,
+)
+
+result = tokenizer.convert_ids_to_string(outputs[0].numpy().tolist())
+
+print("Model input:", inputs)
+print("Result:", result)
diff --git a/fast_generation/samples/pegasus_sample.py b/fast_generation/samples/pegasus_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddbc340808b602fabb78305b32c57e82a344585d
--- /dev/null
+++ b/fast_generation/samples/pegasus_sample.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.transformers import (
+    PegasusChineseTokenizer,
+    PegasusForConditionalGeneration,
+)
+
+model = PegasusForConditionalGeneration.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese")
+tokenizer = PegasusChineseTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese")
+model.eval()
+
+inputs = "在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中，中国选手谷爱凌夺得银牌。祝贺谷爱凌！今天上午，自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行，取选手最佳成绩排名决出奖牌。第一跳，中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后，谷爱凌又扮了个鬼脸，甚是可爱。第二轮中，谷爱凌在道具区第三个障碍处失误，落地时摔倒。获得16.98分。网友：摔倒了也没关系，继续加油！在第二跳失误摔倒的情况下，谷爱凌顶住压力，第三跳稳稳发挥，流畅落地！获得86.23分！此轮比赛，共12位选手参赛，谷爱凌第10位出场。网友：看比赛时我比谷爱凌紧张，加油！"
+tokenized = tokenizer(inputs, return_tensors="pd")
+outputs, _ = model.generate(
+    input_ids=tokenized["input_ids"],
+    decode_strategy="beam_search",
+    num_beams=4,
+    use_fp16_decoding=True,
+    use_fast=True,
+)
+result = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+
+print("Model input:", inputs)
+print("Result:", result)
diff --git a/fast_generation/samples/plato_sample.py b/fast_generation/samples/plato_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac79e60918e49a9596e2b6a49739f58af6817fa7
--- /dev/null
+++ b/fast_generation/samples/plato_sample.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+model_name = "plato-mini"
+
+tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)
+model = UnifiedTransformerLMHeadModel.from_pretrained(model_name)
+model.eval()
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+inputs = "你好啊，你今年多大了"
+
+inputs_ids = tokenizer.dialogue_encode(
+    inputs, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False
+)
+
+outputs, _ = model.generate(
+    input_ids=inputs_ids["input_ids"],
+    token_type_ids=inputs_ids["token_type_ids"],
+    position_ids=inputs_ids["position_ids"],
+    attention_mask=inputs_ids["attention_mask"],
+    max_length=64,
+    decode_strategy="sampling",
+    top_k=5,
+    use_fast=True,
+)
+
+result = postprocess_response(outputs[0].numpy(), tokenizer)
+result = "".join(result)
+
+print("Model input:", inputs)
+print("Result:", result)
+# 我今年23岁了,你今年多大了?
diff --git a/fast_generation/samples/plato_xl_sample.py b/fast_generation/samples/plato_xl_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7c91f4fc9217215c3e71169aac8cc76e08e4b39
--- /dev/null
+++ b/fast_generation/samples/plato_xl_sample.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from distutils.util import strtobool
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.ops import enable_ft_para, get_ft_para_conf
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--use_role", type=strtobool, default=True, help="Whether to use role embeddings.")
+    parser.add_argument(
+        "--position_style",
+        default="relative",
+        choices=["continuous", "relative"],
+        type=str,
+        help="The type for positional embedding. Default is relative.",
+    )
+    parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
+    parser.add_argument(
+        "--num_return_sequences", default=1, type=int, help="The number of returned sequences for each sample."
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output sequence length.")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output sequence length.")
+    parser.add_argument(
+        "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling."
+    )
+    parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.")
+    parser.add_argument("--use_fp16", action="store_true", help="Whether to use fp16 to predict.")
+    parser.add_argument("--profile", action="store_true", help="Whether to profile.")
+    args = parser.parse_args()
+    return args
+
+
+def profile(batch_size, total_step=50, warmup_step=10, rank=0):
+    def _wrapper(func):
+        def _impl(*args, **kwargs):
+            for i in range(total_step):
+                if i == warmup_step:
+                    paddle.device.cuda.synchronize()
+                    start_time = time.time()
+                out = func(*args, **kwargs)
+            paddle.device.cuda.synchronize()
+            end_time = time.time()
+            if rank is None or get_ft_para_conf().rank == rank:
+                time_interval = end_time - start_time
+                num_batch = total_step - warmup_step
+                print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval))
+            return out
+
+        return _impl
+
+    return _wrapper
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    response = " ".join(tokens)
+    return response
+
+
+def main(args):
+    # For memory saving when using FastGeneration:
+    # If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of q/k/v
+    # is fused, it will try to delete the original unfused weights. Note the
+    # rollback to original model would not be guarantee anymore when the fast
+    # model failed if the original weights are deleted.
+    os.environ["PPFG_QKV_MEM_OPT"] = "1"
+    if args.use_fp16:
+        paddle.set_default_dtype("float16")
+    enable_ft_para()
+    # TODO(guosheng): Maybe device can be set in `enable_ft_para`
+    paddle.set_device("gpu:" + str(get_ft_para_conf().rank))
+
+    if args.profile:
+        UnifiedTransformerLMHeadModel.generate = profile(args.batch_size)(UnifiedTransformerLMHeadModel.generate)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-xl")
+    model = UnifiedTransformerLMHeadModel.from_pretrained("plato-xl", load_state_as_np=True)
+    model.eval()
+
+    history = [
+        "hi , Mary ! What do you usually like to do in your spare time ?",
+        "well , I spend a lot of time watching movies .",
+        "what a confidence ! I always watch a lot of movies , too ."
+        "oh really , Frank ? What kind of movies do you like ?",
+    ]
+    inputs = [history] * args.batch_size
+    inputs = list(
+        map(
+            lambda history: tokenizer.dialogue_encode(
+                history=history,
+                add_start_token_as_response=True,
+                return_length=True,
+                return_role_ids=args.use_role,
+                position_style=args.position_style,
+            ),
+            inputs,
+        )
+    )
+    collator = DataCollatorWithPadding(tokenizer)
+    data = collator(inputs)
+
+    outputs, _ = model.generate(
+        input_ids=data["input_ids"],
+        token_type_ids=data["token_type_ids"],
+        position_ids=data["position_ids"],
+        attention_mask=data["attention_mask"].cast("float32"),  # TODO(guosheng): remove this cast
+        role_ids=data.get("role_ids", None),
+        seq_len=data["seq_len"],
+        max_length=args.max_out_len,
+        min_length=args.min_out_len,
+        decode_strategy="sampling",
+        top_k=args.topk,
+        top_p=args.topp,
+        temperature=args.temperature,
+        num_return_sequences=args.num_return_sequences,
+        use_fast=True,
+        use_fp16_decoding=args.use_fp16,
+    )
+
+    # Only make the first process to output.
+    if get_ft_para_conf().rank == 0:
+        for i in range(len(outputs)):
+            result = postprocess_response(outputs[i].numpy(), tokenizer)
+            print("Result:", result)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    main(args)
diff --git a/fast_generation/samples/t5_sample.py b/fast_generation/samples/t5_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..53ad13f903c115e647411c0edb56dab752653a73
--- /dev/null
+++ b/fast_generation/samples/t5_sample.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--max_length", default=256, type=int, help="Maximum output sequence length.")
+    parser.add_argument("--beam_size", default=4, type=int, help="The beam size to set.")
+    parser.add_argument("--use_faster", action="store_true", help="Whether to use faster to predict.")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 to predict.")
+    args = parser.parse_args()
+    return args
+
+
+def predict(args):
+    model_name = "t5-base"
+
+    model = T5ForConditionalGeneration.from_pretrained(model_name)
+    model.eval()
+    tokenizer = T5Tokenizer.from_pretrained(model_name)
+
+    en_text = ' This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of countless generations of stars: the oldest stars are seen as blue dots. '
+    input_ids = tokenizer.encode("translate English to French: " + en_text, return_tensors="pd")["input_ids"]
+
+    output, _ = model.generate(
+        input_ids=input_ids,
+        num_beams=args.beam_size,
+        max_length=args.max_length,
+        decode_strategy="beam_search",
+        use_fast=True,  # args.use_faster,
+        use_fp16_decoding=args.use_fp16_decoding,
+    )
+
+    translation = tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+
+    print("The original sentence: ", en_text)
+    print("The translation result: ", translation)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    predict(args)
diff --git a/fast_generation/samples/unimo_text_sample.py b/fast_generation/samples/unimo_text_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..29197be47e52b209160e9fe5c52586e9b265219b
--- /dev/null
+++ b/fast_generation/samples/unimo_text_sample.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+
+model_name = "unimo-text-1.0-lcsts-new"
+
+model = UNIMOLMHeadModel.from_pretrained(model_name)
+model.eval()
+tokenizer = UNIMOTokenizer.from_pretrained(model_name)
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+inputs = "深度学习是人工智能的核心技术领域。百度飞桨作为中国首个自主研发、功能丰富、开源开放的产业级深度学习平台,将从多层次技术产品、产业AI人才培养和强大的生态资源支持三方面全面护航企业实现快速AI转型升级。"
+
+inputs_ids = tokenizer.gen_encode(
+    inputs, add_start_token_for_decoding=True, return_tensors=True, is_split_into_words=False
+)
+
+outputs, _ = model.generate(
+    input_ids=inputs_ids["input_ids"],
+    token_type_ids=inputs_ids["token_type_ids"],
+    position_ids=inputs_ids["position_ids"],
+    attention_mask=inputs_ids["attention_mask"],
+    max_length=64,
+    decode_strategy="beam_search",
+    num_beams=2,
+    use_fast=True,
+)
+
+result = postprocess_response(outputs[0].numpy(), tokenizer)
+result = "".join(result)
+
+print("Model input:", inputs)
+print("Result:", result)
+# 百度飞桨：深度学习助力企业转型升级
diff --git a/fast_tokenizer/CMakeLists.txt b/fast_tokenizer/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ce238239ae7f88c87ad301c39cef25a287c64e0e
--- /dev/null
+++ b/fast_tokenizer/CMakeLists.txt
@@ -0,0 +1,235 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+cmake_minimum_required(VERSION 3.10)
+
+project(tokenizers LANGUAGES CXX C VERSION 1.0)
+
+option(WITH_TESTING     "Compile PaddleNLP fast_tokenizer with unit testing"        OFF)
+option(WITH_PYTHON      "Compile PaddleNLP fast_tokenizer with python interpreter"   ON)
+option(WITH_ICU_LITE      "Compile PaddleNLP fast_tokenizer with lite icu library"   OFF)
+option(USE_ABI0      "Set -D_GLIBCXX_USE_CXX11_ABI to 0"   OFF)
+
+add_definitions(-DFASTTOKENIZER_LIB)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+set (PUBLIC_DEPEND_LIBS "")
+
+if(NOT CMAKE_BUILD_TYPE)
+  set(CMAKE_BUILD_TYPE "Release" CACHE STRING
+      "Choose the type of build, options are: Debug Release
+RelWithDebInfo MinSizeRel."
+      FORCE)
+endif(NOT CMAKE_BUILD_TYPE)
+
+if(NOT WIN32)
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")
+    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64")
+        set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed")
+    endif()
+    if (USE_ABI0)
+        message(STATUS "-D_GLIBCXX_USE_CXX11_ABI will be set to 0.")
+        add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0")
+    else()
+        message(STATUS "-D_GLIBCXX_USE_CXX11_ABI will be set to 1.")
+        add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1)
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=1")
+    endif()
+else()
+    set(CMAKE_CXX_STANDARD 11)
+endif()
+
+IF(WIN32)
+# Need to add flags for windows
+foreach(
+    flag_var
+    CMAKE_CXX_FLAGS
+    CMAKE_CXX_FLAGS_DEBUG
+    CMAKE_CXX_FLAGS_RELEASE
+    CMAKE_CXX_FLAGS_MINSIZEREL
+    CMAKE_CXX_FLAGS_RELWITHDEBINFO
+    CMAKE_C_FLAGS
+    CMAKE_C_FLAGS_DEBUG
+    CMAKE_C_FLAGS_RELEASE
+    CMAKE_C_FLAGS_MINSIZEREL
+    CMAKE_C_FLAGS_RELWITHDEBINFO)
+    if(${flag_var} MATCHES "/MD")
+      string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}")
+    elseif(NOT (${flag_var} MATCHES "/MT"))
+      set(${flag_var} "${${flag_var}} /MT")
+    endif()
+    set(${flag_var} "${${flag_var}}" CACHE STRING "msvc compiler flags" FORCE)
+endforeach()
+
+add_definitions("-DNOMINMAX")
+# windows build turn off warnings, use parallel compiling.
+
+foreach(
+  flag_var
+  CMAKE_CXX_FLAGS
+  CMAKE_CXX_FLAGS_DEBUG
+  CMAKE_CXX_FLAGS_RELEASE
+  CMAKE_CXX_FLAGS_MINSIZEREL
+  CMAKE_CXX_FLAGS_RELWITHDEBINFO
+  CMAKE_C_FLAGS
+  CMAKE_C_FLAGS_DEBUG
+  CMAKE_C_FLAGS_RELEASE
+  CMAKE_C_FLAGS_MINSIZEREL
+  CMAKE_C_FLAGS_RELWITHDEBINFO)
+  string(REGEX REPLACE "/W[1-4]" " /W0 " ${flag_var} "${${flag_var}}")
+endforeach()
+
+foreach(flag_var CMAKE_CXX_FLAGS CMAKE_C_FLAGS)
+    set(${flag_var} "${${flag_var}} /w")
+endforeach()
+
+# Windows Remove /Zi, /ZI for Release, MinSizeRel builds
+foreach(flag_var
+        CMAKE_C_FLAGS CMAKE_C_FLAGS_RELEASE CMAKE_C_FLAGS_MINSIZEREL
+        CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_RELEASE CMAKE_CXX_FLAGS_MINSIZEREL)
+if(${flag_var} MATCHES "/Z[iI]")
+    string(REGEX REPLACE "/Z[iI]" "" ${flag_var} "${${flag_var}}")
+endif()
+endforeach()
+
+set(CMAKE_SUPPRESS_REGENERATION ON)
+set(CMAKE_STATIC_LIBRARY_PREFIX lib)
+
+set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /bigobj")
+set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /bigobj")
+set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj")
+set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /bigobj")
+
+if("${CMAKE_GENERATOR}" STREQUAL "Ninja")
+    set(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /Zc:inline")
+    set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /Zc:inline")
+    set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /Zc:inline")
+    set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} /Zc:inline")
+endif()
+
+foreach(flag_var CMAKE_SHARED_LINKER_FLAGS CMAKE_STATIC_LINKER_FLAGS
+                CMAKE_EXE_LINKER_FLAGS CMAKE_LINKER_FLAGS)
+set(${flag_var}
+    "${${flag_var}} /ignore:4049 /ignore:4217 /ignore:4006 /ignore:4221")
+set(${flag_var} "${${flag_var}} /NODEFAULTLIB:MSVCRT.LIB")
+endforeach()
+
+ELSE(WIN32)
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fPIC")
+    IF (NOT APPLE)
+      set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ldl")
+      IF (NOT ANDROID)
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread")
+      ELSE()
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Os")
+      ENDIF()
+    ENDIF()
+    set (PUBLIC_DEPEND_LIBS ${CMAKE_DL_LIBS})
+ENDIF(WIN32)
+
+set(CMAKE_INSTALL_PREFIX ${PROJECT_SOURCE_DIR})
+set(TOKENIZERS_INSTALL_INCLUDE_DIR ${PROJECT_SOURCE_DIR})
+
+message("CMAKE_BUILD_TYPE = " ${CMAKE_BUILD_TYPE})
+message("CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}")
+message("CMAKE_EXE_LINKER_FLAGS = ${CMAKE_EXE_LINKER_FLAGS}")
+
+# config GIT_URL with github mirrors to speed up dependent repos clone
+option(GIT_URL "Git URL to clone dependent repos" ${GIT_URL})
+if(NOT GIT_URL)
+    set(GIT_URL "https://github.com")
+endif()
+
+include_directories(${TOKENIZERS_INSTALL_INCLUDE_DIR})
+
+include(generic)
+include(third_party)
+
+add_subdirectory(fast_tokenizer)
+
+if(WITH_PYTHON)
+
+add_subdirectory(python)
+
+if (NOT APPLE AND NOT WIN32) # Linux
+add_custom_target(build_tokenizers_bdist_wheel ALL
+    COMMAND ${PYTHON_EXECUTABLE} setup.py bdist_wheel --plat-name=manylinux1_x86_64
+    COMMENT "Packing whl packages------>>>"
+    DEPENDS copy_python_tokenizers)
+else()
+add_custom_target(build_tokenizers_bdist_wheel ALL
+    COMMAND ${CMAKE_COMMAND} -E env ${py_env} ${PYTHON_EXECUTABLE} setup.py bdist_wheel
+    COMMENT "Packing whl packages------>>>"
+    DEPENDS copy_python_tokenizers)
+endif()
+
+else(WITH_PYTHON) # Pack fast_tokenizer cpp lib
+
+set(CPP_PACKAGE_DIR ${CMAKE_BINARY_DIR}/cpp/fast_tokenizer)
+add_custom_target(build_cpp_package_dir ALL
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${CPP_PACKAGE_DIR}/lib ${CPP_PACKAGE_DIR}/include ${CPP_PACKAGE_DIR}/third_party/include ${CPP_PACKAGE_DIR}/third_party/lib
+    DEPENDS core_tokenizers)
+
+# copy cmake
+configure_file(${PROJECT_SOURCE_DIR}/FastTokenizer.cmake.in ${PROJECT_SOURCE_DIR}/FastTokenizer.cmake @ONLY)
+file(COPY ${PROJECT_SOURCE_DIR}/FastTokenizer.cmake DESTINATION ${CPP_PACKAGE_DIR}/)
+
+# copy headers
+file(COPY ${PROJECT_SOURCE_DIR}/fast_tokenizer/ DESTINATION ${CPP_PACKAGE_DIR}/include/fast_tokenizer/
+    FILES_MATCHING PATTERN "*.h"
+    PATTERN "test" EXCLUDE
+    PATTERN "demo" EXCLUDE
+    PATTERN "pybind" EXCLUDE)
+
+add_custom_target(copy_third_party_headers ALL
+    COMMAND ${CMAKE_COMMAND} -E copy_directory
+    ${GFLAGS_INCLUDE_DIR} ${ICU_INCLUDE_DIR} ${DART_INCLUDE_DIR}
+    ${GLOG_INCLUDE_DIR} ${JSON_INCLUDE_DIR} ${RE2_INCLUDE_DIR}
+    ${CPP_PACKAGE_DIR}/third_party/include
+    DEPENDS build_cpp_package_dir)
+
+# copy library
+set(TOKENIZER_CORE_NAME "core_tokenizers")
+set(TOKENIZER_CORE_PATH ${CMAKE_BINARY_DIR}/fast_tokenizer)
+if (WIN32)
+    set(ICU_DLL_DIR ${CMAKE_BINARY_DIR}/third_party/icu/src/extern_icu/icu4c/bin64)
+    set(ICU_LIB_DIR ${CMAKE_BINARY_DIR}/third_party/icu/src/extern_icu/icu4c/lib64)
+    add_custom_target(copy_shared_library ALL
+        COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_PATH}/${TOKENIZER_CORE_NAME}.dll ${TOKENIZER_CORE_PATH}/${TOKENIZER_CORE_NAME}.lib ${CPP_PACKAGE_DIR}/lib
+        COMMAND ${CMAKE_COMMAND} -E copy ${ICU_DLL_DIR}/icudt70.dll ${ICU_DLL_DIR}/icuuc70.dll ${ICU_LIB_DIR}/icudt.lib ${ICU_LIB_DIR}/icuuc.lib ${CPP_PACKAGE_DIR}/third_party/lib
+        DEPENDS build_cpp_package_dir core_tokenizers)
+elseif(APPLE)
+    set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.dylib")
+    add_custom_target(copy_shared_library ALL
+        COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib
+        DEPENDS build_cpp_package_dir core_tokenizers)
+elseif(ANDROID)
+        set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.so")
+        add_custom_target(copy_shared_library ALL
+            COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib
+            COMMAND ${CMAKE_STRIP} --strip-all ${CPP_PACKAGE_DIR}/lib/lib${TOKENIZER_CORE_NAME}.so
+            DEPENDS build_cpp_package_dir core_tokenizers)
+else()
+    set(TOKENIZER_CORE_LIBS_PATH "${TOKENIZER_CORE_PATH}/lib${TOKENIZER_CORE_NAME}.so")
+    add_custom_target(copy_shared_library ALL
+        COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_CORE_LIBS_PATH} ${CPP_PACKAGE_DIR}/lib
+        DEPENDS build_cpp_package_dir core_tokenizers)
+endif()
+
+add_custom_target(create_commit_id_file ALL
+    COMMAND ${GIT_EXECUTABLE} log -1 --format=%H > ${CPP_PACKAGE_DIR}/commit.log
+    DEPENDS copy_shared_library)
+endif(WITH_PYTHON)
+
diff --git a/fast_tokenizer/FastTokenizer.cmake.in b/fast_tokenizer/FastTokenizer.cmake.in
new file mode 100644
index 0000000000000000000000000000000000000000..cf9c341867686ff40f59cb5cb392c4f02e4f5211
--- /dev/null
+++ b/fast_tokenizer/FastTokenizer.cmake.in
@@ -0,0 +1,44 @@
+CMAKE_MINIMUM_REQUIRED (VERSION 3.12)
+
+set(USE_ABI0 @USE_ABI0@)
+
+if (NOT WIN32)
+    if (USE_ABI0)
+        add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)
+    else()
+        add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1)
+    endif()
+endif()
+
+
+if(NOT CMAKE_BUILD_TYPE)
+  set(CMAKE_BUILD_TYPE "Release" CACHE STRING
+      "Choose the type of build, options are: Debug Release
+RelWithDebInfo MinSizeRel."
+      FORCE)
+endif(NOT CMAKE_BUILD_TYPE)
+
+if(NOT WIN32)
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --std=c++11")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -Wno-narrowing")
+else()
+    set(CMAKE_CXX_STANDARD 11)
+endif()
+
+set(LIBRARY_NAME core_tokenizers)
+
+set(FAST_TOKENIZER_INCS "")
+list(APPEND FAST_TOKENIZER_INCS ${CMAKE_CURRENT_LIST_DIR}/include)
+list(APPEND FAST_TOKENIZER_INCS ${CMAKE_CURRENT_LIST_DIR}/third_party/include)
+
+set(FAST_TOKENIZER_LIBS "")
+find_library(FTLIB ${LIBRARY_NAME} ${CMAKE_CURRENT_LIST_DIR}/lib NO_DEFAULT_PATH)
+list(APPEND FAST_TOKENIZER_LIBS ${FTLIB})
+
+if (WIN32)
+find_library(ICUDT icudt ${CMAKE_CURRENT_LIST_DIR}/third_party/lib NO_DEFAULT_PATH)
+list(APPEND FAST_TOKENIZER_LIBS ${ICUDT})
+
+find_library(ICUUC icuuc ${CMAKE_CURRENT_LIST_DIR}/third_party/lib NO_DEFAULT_PATH)
+list(APPEND FAST_TOKENIZER_LIBS ${ICUUC})
+endif()
diff --git a/fast_tokenizer/LICENSE b/fast_tokenizer/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..d945753f93c1bade8953a38fa2141850b174b26b
--- /dev/null
+++ b/fast_tokenizer/LICENSE
@@ -0,0 +1,203 @@
+Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/fast_tokenizer/Makefile b/fast_tokenizer/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..4823a02a3cc188ee98ce9709370d05f6e71b3cbb
--- /dev/null
+++ b/fast_tokenizer/Makefile
@@ -0,0 +1,38 @@
+# Makefile for fast_tokenizer
+#
+# 	GitHb: https://github.com/PaddlePaddle/PaddleNLP
+# 	Author: Paddle Team https://github.com/PaddlePaddle
+#
+
+# Compile and test for fast_tokenizer cpp library
+
+.PHONY: fast_tokenizer_cpp_compile
+
+fast_tokenizer_cpp_compile:
+	mkdir -p build_cpp && cd build_cpp && \
+	cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=ON -DCMAKE_BUILD_TYPE=Release && \
+	make -j4
+
+.PHONY: fast_tokenizer_cpp_test
+
+fast_tokenizer_cpp_test:
+	bash run_fast_tokenizer_cpp_test.sh build_cpp/fast_tokenizer/test
+
+# Compile and test for fast_tokenizer python library
+
+.PHONY: fast_tokenizer_python_install
+
+fast_tokenizer_python_install:
+	pip install numpy wheel pytest paddlepaddle ..
+
+.PHONY: fast_tokenizer_python_compile
+
+fast_tokenizer_python_compile:
+	mkdir -p build_py && cd build_py && \
+	cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release && \
+	make -j4
+
+.PHONY: fast_tokenizer_python_test
+
+fast_tokenizer_python_test:
+	pip install build_py/dist/*whl && pytest build_py/python/tests
diff --git a/fast_tokenizer/README.md b/fast_tokenizer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6b8d18e6f56ff7c53e42deb2669a8d3ff7fc544b
--- /dev/null
+++ b/fast_tokenizer/README.md
@@ -0,0 +1,150 @@
+
+# ⚡ FastTokenizer：高性能文本处理库
+
+------------------------------------------------------------------------------------------
+
+<p align="center">
+    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
+    <a href=""><img src="https://img.shields.io/badge/python-3.6.2+-aff.svg"></a>
+    <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleNLP?color=9ea"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleNLP?color=3af"></a>
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
+</p>
+
+
+FastTokenizer 是一款简单易用、功能强大的跨平台高性能文本预处理库，集成业界多个常用的 Tokenizer 实现，支持不同 NLP 场景下的文本预处理功能，如文本分类、阅读理解，序列标注等。在 Python 端结合 PaddleNLP Tokenizer 模块，为用户在训练、推理阶段提供高效通用的文本预处理能力。
+
+![fast-tokenizer-framework](https://user-images.githubusercontent.com/10826371/215277623-1fcd84e5-1cc7-43a8-a33c-890103cf1cc8.png)
+
+## 特性
+
+- 高性能。底层采用 C++ 实现，并且集成高性能分词算法 `FastWordPiece` [1]，其性能远高于常规 Python 实现的 Tokenizer。在文本分类任务上，FastTokenizer 对比 Python 版本 Tokenizer 加速比最高可达20倍。支持多线程加速多文本批处理分词。默认使用单线程分词。
+- 可拓展性强。用户可以通过指定不同的 `Normalizer`, `PreTokenizer`, `Model` 以及 `PostProcessor` 组件自定义 Tokenizer。可在 [FastTokenizer Pipeline](docs/pipeline/README.md) 文档了解更多关于组件的介绍以及使用方式。
+- 跨平台。FastTokenizer 可在不同的系统平台上使用，目前已支持 Windows x64，Linux x64 以及 MacOS 10.14+ 平台上使用。
+- 支持多编程语言。FastTokenizer 提供在 [C++](./docs/cpp/README.md)、[Python](./docs/python/README.md) 语言上开发的能力。
+- 包含所有预处理。覆盖绝大部分 Transformer 模型的 Tokenizer 所需要的功能，包括特殊 Tokens 的拼接、截断等。输入的原始文本经过 FastTokenizer 处理后得到的结果可直接输入到 Transformer 类模型。
+
+## 快速开始
+
+下面将介绍 Python 版本 FastTokenizer 的使用方式，C++ 版本的使用方式可参考 [FastTokenizer C++ 库使用教程](./docs/cpp/README.md)。
+
+### 环境依赖
+
+- Windows 64位系统
+- Linux x64系统
+- MacOS 10.14+系统（ m1 芯片的 MacOS，需要使用 x86_64 版本的 Anaconda 作为 python 环境方可安装使用）
+- Python 3.6 ~ 3.10
+
+### 安装 FastTokenizer
+
+```python
+pip install fast-tokenizer-python
+```
+
+### FastTokenizer 使用示例
+
+- 准备词表
+
+```shell
+# Linux或者Mac用户可直接执行以下命令下载测试的词表，Windows 用户可在浏览器上下载到本地。
+wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt
+```
+
+- 切词示例
+
+FastTokenizer 库内置 NLP 任务常用的 Tokenizer，如 ErnieFastTokenizer。下面将展示 FastTokenizer 的简单用法。
+
+```python
+import fast_tokenizer
+from fast_tokenizer import ErnieFastTokenizer, models
+
+# 0.（可选）设置线程数
+fast_tokenizer.set_thread_num(1)
+# 1. 加载词表
+vocab = models.WordPiece.read_file("ernie_vocab.txt")
+# 2. 实例化 ErnieFastTokenizer 对象
+fast_tokenizer = ErnieFastTokenizer(vocab)
+# 3. 切词
+output = fast_tokenizer.encode("我爱中国")
+# 4. 输出结果
+print("ids: ", output.ids)
+print("type_ids: ", output.type_ids)
+print("tokens: ", output.tokens)
+print("offsets: ", output.offsets)
+print("attention_mask: ", output.attention_mask)
+
+# 5. 示例输出
+# ids:  [1, 75, 329, 12, 20, 2]
+# type_ids:  [0, 0, 0, 0, 0, 0]
+# tokens:  ['[CLS]', '我', '爱', '中', '国', '[SEP]']
+# offsets:  [(0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (0, 0)]
+# attention_mask:  [1, 1, 1, 1, 1, 1]
+```
+
+### FastTokenizer 在 PaddleNLP Tokenizer 模块加速示例
+
+PaddleNLP Tokenizer 模块可简单地应用在模型训练以及推理部署的文本预处理阶段，并通过 `AutoTokenizer.from_pretrained` 方式实例化相应的 Tokenizer 。其中 `AutoTokenizer` 默认加载得到的 Tokenizer 是常规 Python 实现的 Tokenizer，其性能会低于 C++ 实现的 FastTokenizer。为了提升 PaddleNLP Tokenizer 模块性能，目前 PaddleNLP Tokenizer 模块已经支持使用 FastTokenizer 作为 Tokenizer 的后端加速切词阶段。在现有的 Tokenizer 加载接口中，仅需添加 `use_fast=True` 这一关键词参数，其余代码保持不变，即可加载 Fast 版本的 Tokenizer，代码示例如下：
+
+```python
+from paddlenlp.transformers import AutoTokenizer
+
+# 默认加载Python版本的Tokenizer
+tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+# 打开use_fast开关，可加载Fast版本Tokenizer
+fast_tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh', use_fast=True)
+
+text1 = tokenizer('自然语言处理')
+text2 = fast_tokenizer('自然语言处理')
+
+print(text1)
+print(text2)
+
+# 示例输出
+# {'input_ids': [1, 67, 187, 405, 545, 239, 38, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}
+# {'input_ids': [1, 67, 187, 405, 545, 239, 38, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}
+
+```
+
+目前 PaddleNLP 已支持 BERT、ERNIE、TinyBERT 以及 ERNIE-M 4种 Tokenizer 的 Fast 版本，其余模型的 Tokenizer 暂不支持 Fast 版本。
+
+## FAQ
+
+**Q：我在 AutoTokenizer.from_pretrained 接口上已经打开 `use_fast=True` 开关，为什么文本预处理阶段性能上好像没有任何变化？**
+
+A：在有三种情况下，打开 `use_fast=True` 开关可能无法提升性能：
+  1. 没有安装 fast_tokenizer 。若在没有安装 fast_tokenizer 库的情况下打开 `use_fast` 开关，PaddleNLP 会给出以下warning："Can't find the fast_tokenizer package, please ensure install fast_tokenizer correctly. "。
+
+  2. 加载的 Tokenizer 类型暂不支持 Fast 版本。目前支持4种 Tokenizer 的 Fast 版本，分别是 BERT、ERNIE、TinyBERT 以及 ERNIE-M Tokenizer。若加载不支持 Fast 版本的 Tokenizer 情况下打开 `use_fast` 开关，PaddleNLP 会给出以下warning："The tokenizer XXX doesn't have the fast version. Please check the map paddlenlp.transformers.auto.tokenizer.FAST_TOKENIZER_MAPPING_NAMES to see which fast tokenizers are currently supported."
+
+  3. 待切词文本长度过短（如文本平均长度小于5）。这种情况下切词开销可能不是整个文本预处理的性能瓶颈，导致在使用 FastTokenizer 后仍无法提升整体性能。
+
+**Q：如何使用多线程加速分词？**
+
+A：可以通过调用 `fast_tokenizer.set_thread_num(xxx)` 使用多线程进行分词。需要谨慎开启多线程加速分词，在以下场景下可以考虑开启多线程：
+  1. CPU资源充足。若在推理阶段使用CPU进行推理，开启多线程分词可能会出现资源竞争情况，从而影响推理阶段的性能。
+
+  2. 文本的批大小较大。若批大小比较小，开启多线程可能不会得到任何加速效果，并且可能会因为线程调度导致延时增长。建议批大小大于4的时候再考虑开启多线程分词。
+
+  3. 文本长度较长。若文本长度较短，开启多线程可能不会得到任何加速效果，并且可能会因为线程调度导致延时增长。建议文本平均长度大于16的时候再考虑开启多线程分词。
+
+**Q：Windows 上编译、运行示例出错。** 相关issue：[issues 4673](https://github.com/PaddlePaddle/PaddleNLP/issues/4673)。
+
+A：FastTokenizer 支持 Linux、Windows 以及 MacOS 系统上运行，同一示例可以在不同的操作系统上运行。如果出现在其他系统编译运行没错，但在 Windows 上编译或者运行示例出错的问题，大概率是编译过程中遇到中文字符的编码问题，FastTokenizer 要求字符集必须为 UTF-8。可以参考Visual Studio的官方文档，设置源字符集为/utf-8解决：[/utf-8（将源字符集和执行字符集设置为 UTF-8）](https://learn.microsoft.com/zh-cn/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=msvc-170)。
+
+## 参考文献
+
+- [1] Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021
+
+## 相关文档
+
+[FastTokenizer Pipeline](docs/pipeline/README.md)
+
+[FastTokenizer 编译指南](docs/compile/README.md)
+
+[FastTokenizer C++ 库使用教程](./docs/cpp/README.md)
+
+[FastTokenizer Python 库使用教程](./docs/python/README.md)
diff --git a/fast_tokenizer/cmake/ByproductsICU.cmake b/fast_tokenizer/cmake/ByproductsICU.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..3b68f08249bc0ca4846c05304e55ee08647bcb70
--- /dev/null
+++ b/fast_tokenizer/cmake/ByproductsICU.cmake
@@ -0,0 +1,54 @@
+# MIT License
+#
+# Copyright (c) 2018 The ViaDuck Project
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+function(GetICUByproducts ICU_PATH ICU_LIB_VAR ICU_INCLUDE_VAR ICU_BASE_NAMES_VAR)
+    # include directory
+    set(${ICU_INCLUDE_VAR} "${ICU_PATH}/include" PARENT_SCOPE)
+    
+    if (WIN32)
+        # windows basenames and pre/suffixes
+        set(ICU_LIB_BASE_NAMES dt in io tu uc)
+        
+        set(ICU_SHARED_PREFIX "lib")
+        set(ICU_STATIC_PREFIX "")
+        set(ICU_SHARED_SUFFIX ".dll.a")
+        set(ICU_STATIC_SUFFIX ".lib")
+        set(ICU_INSTALL_LIB "lib64")
+    else()
+        # unix basenames and pre/suffixes
+        set(ICU_LIB_BASE_NAMES i18n data uc io tu)
+        set(ICU_SHARED_PREFIX ${CMAKE_SHARED_LIBRARY_PREFIX})
+        set(ICU_STATIC_PREFIX ${CMAKE_STATIC_LIBRARY_PREFIX})
+        set(ICU_SHARED_SUFFIX ${CMAKE_SHARED_LIBRARY_SUFFIX})
+        set(ICU_STATIC_SUFFIX ${CMAKE_STATIC_LIBRARY_SUFFIX})
+        set(ICU_INSTALL_LIB "lib")
+    endif()
+    # add static and shared libs to the libraries variable
+    foreach(ICU_BASE_NAME ${ICU_LIB_BASE_NAMES})
+        set(ICU_SHARED_LIB "${ICU_PATH}/${ICU_INSTALL_LIB}/${ICU_SHARED_PREFIX}icu${ICU_BASE_NAME}${ICU_SHARED_SUFFIX}")
+        set(ICU_STATIC_LIB "${ICU_PATH}/${ICU_INSTALL_LIB}/${ICU_STATIC_PREFIX}icu${ICU_BASE_NAME}${ICU_STATIC_SUFFIX}")
+        
+        if (ICU_STATIC)
+            list(APPEND ${ICU_LIB_VAR} ${ICU_STATIC_LIB})
+        else()
+            list(APPEND ${ICU_LIB_VAR} ${ICU_SHARED_LIB})
+        endif()
+        list(APPEND ${ICU_BASE_NAMES_VAR} ${ICU_BASE_NAME})
+    endforeach()
+    set(${ICU_LIB_VAR} ${${ICU_LIB_VAR}} PARENT_SCOPE)
+    set(${ICU_BASE_NAMES_VAR} ${${ICU_BASE_NAMES_VAR}} PARENT_SCOPE)
+endfunction()
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/FindNumPy.cmake b/fast_tokenizer/cmake/FindNumPy.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..7815e86cb503cf91afdaba7fd4f0ec02d87132a1
--- /dev/null
+++ b/fast_tokenizer/cmake/FindNumPy.cmake
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Find the Python NumPy package
+# PYTHON_NUMPY_INCLUDE_DIR
+# NUMPY_FOUND
+# will be set by this script
+
+cmake_minimum_required(VERSION 2.6)
+
+if(NOT PYTHON_EXECUTABLE)
+  if(NumPy_FIND_QUIETLY)
+    find_package(PythonInterp QUIET)
+  else()
+    find_package(PythonInterp)
+    set(_numpy_out 1)
+  endif()
+endif()
+
+if (PYTHON_EXECUTABLE)
+  # write a python script that finds the numpy path
+  file(WRITE ${PROJECT_BINARY_DIR}/FindNumpyPath.py
+      "try: import numpy; print(numpy.get_include())\nexcept:pass\n")
+
+  # execute the find script
+  exec_program("${PYTHON_EXECUTABLE}" ${PROJECT_BINARY_DIR}
+    ARGS "FindNumpyPath.py"
+    OUTPUT_VARIABLE NUMPY_PATH)
+elseif(_numpy_out)
+  message(STATUS "Python executable not found.")
+endif(PYTHON_EXECUTABLE)
+
+find_path(PYTHON_NUMPY_INCLUDE_DIR numpy/arrayobject.h
+  HINTS "${NUMPY_PATH}" "${PYTHON_INCLUDE_PATH}")
+
+if(PYTHON_NUMPY_INCLUDE_DIR)
+  set(PYTHON_NUMPY_FOUND 1 CACHE INTERNAL "Python numpy found")
+endif(PYTHON_NUMPY_INCLUDE_DIR)
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(NumPy DEFAULT_MSG PYTHON_NUMPY_INCLUDE_DIR)
diff --git a/fast_tokenizer/cmake/dummy.c.in b/fast_tokenizer/cmake/dummy.c.in
new file mode 100644
index 0000000000000000000000000000000000000000..17ba4d3495eb41be61ab59425a6ddc49fa4389e3
--- /dev/null
+++ b/fast_tokenizer/cmake/dummy.c.in
@@ -0,0 +1,3 @@
+// Generated by @dummy_GENERATOR@. DO NOT EDIT!!!
+
+const char *dummy = "@dummy_CONTENT@";
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/dart.cmake b/fast_tokenizer/cmake/external/dart.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..1e00807f774567db810304adc8c70f83dfea12b1
--- /dev/null
+++ b/fast_tokenizer/cmake/external/dart.cmake
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include(ExternalProject)
+
+set(DART_PREFIX_DIR     ${THIRD_PARTY_PATH}/dart)
+SET(DART_REPOSITORY     ${GIT_URL}/s-yata/darts-clone.git)
+SET(DART_TAG            master)
+
+set(DART_INCLUDE_DIR ${THIRD_PARTY_PATH}/dart/src/extern_dart/include)
+include_directories(${DART_INCLUDE_DIR})
+
+ExternalProject_Add(
+        extern_dart
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${DART_REPOSITORY}
+        GIT_TAG           ${DART_TAG}
+        PREFIX            ${DART_PREFIX_DIR}
+        # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add
+        # function in CMakeLists blank, it will cause another parameter GIT_TAG
+        # to be modified without triggering incremental compilation, and the
+        # third-party library version changes cannot be incorporated.
+        # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND ""
+        BUILD_COMMAND     ""
+        INSTALL_COMMAND   ""
+        TEST_COMMAND      ""
+)
+
+add_library(dart INTERFACE)
+
+add_dependencies(dart extern_dart)
diff --git a/fast_tokenizer/cmake/external/gflags.cmake b/fast_tokenizer/cmake/external/gflags.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..cb7e0420045aecdcb467cac0eba872c5dbe76fb2
--- /dev/null
+++ b/fast_tokenizer/cmake/external/gflags.cmake
@@ -0,0 +1,121 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(ExternalProject)
+
+SET(GFLAGS_PREFIX_DIR  ${THIRD_PARTY_PATH}/gflags)
+SET(GFLAGS_INSTALL_DIR ${THIRD_PARTY_PATH}/install/gflags)
+SET(GFLAGS_INCLUDE_DIR "${GFLAGS_INSTALL_DIR}/include" CACHE PATH "gflags include directory." FORCE)
+set(GFLAGS_REPOSITORY ${GIT_URL}/gflags/gflags.git)
+set(GFLAGS_TAG "v2.2.2")
+IF(WIN32)
+  set(GFLAGS_LIBRARIES "${GFLAGS_INSTALL_DIR}/lib/gflags_static.lib" CACHE FILEPATH "GFLAGS_LIBRARIES" FORCE)
+ELSE(WIN32)
+  set(GFLAGS_LIBRARIES "${GFLAGS_INSTALL_DIR}/lib/libgflags.a" CACHE FILEPATH "GFLAGS_LIBRARIES" FORCE)
+  set(BUILD_COMMAND $(MAKE) --silent)
+  set(INSTALL_COMMAND $(MAKE) install)
+ENDIF(WIN32)
+
+INCLUDE_DIRECTORIES(${GFLAGS_INCLUDE_DIR})
+
+IF(ANDROID)    
+set(CROSS_COMPILE_CMAKE_ARGS
+    "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}"
+    "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}"
+    "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}"
+    "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}"
+    "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}"
+    "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}"
+    "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}"
+    "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake"
+    "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}"
+    "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}"
+    "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}")
+    
+ExternalProject_Add(
+    extern_gflags
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    ${SHALLOW_CLONE}
+    GIT_REPOSITORY  ${GFLAGS_REPOSITORY}
+    GIT_TAG         ${GFLAGS_TAG}
+    PREFIX          ${GFLAGS_PREFIX_DIR}
+    UPDATE_COMMAND  ""
+    BUILD_COMMAND   ${BUILD_COMMAND}
+    INSTALL_COMMAND ${INSTALL_COMMAND}
+    CMAKE_ARGS      ${CROSS_COMPILE_CMAKE_ARGS} 
+                    -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
+                    -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                    -DBUILD_STATIC_LIBS=ON
+                    -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR}
+                    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                    -DBUILD_TESTING=OFF
+                    -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                    ${EXTERNAL_OPTIONAL_ARGS}
+    CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GFLAGS_INSTALL_DIR}
+                     -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                     -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+    BUILD_BYPRODUCTS ${GFLAGS_LIBRARIES}
+)
+ELSE()
+ExternalProject_Add(
+    extern_gflags
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    ${SHALLOW_CLONE}
+    GIT_REPOSITORY  ${GFLAGS_REPOSITORY}
+    GIT_TAG         ${GFLAGS_TAG}
+    PREFIX          ${GFLAGS_PREFIX_DIR}
+    UPDATE_COMMAND  ""
+    BUILD_COMMAND   ${BUILD_COMMAND}
+    INSTALL_COMMAND ${INSTALL_COMMAND}
+    CMAKE_ARGS      -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
+                    -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                    -DBUILD_STATIC_LIBS=ON
+                    -DCMAKE_INSTALL_PREFIX=${GFLAGS_INSTALL_DIR}
+                    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                    -DBUILD_TESTING=OFF
+                    -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                    ${EXTERNAL_OPTIONAL_ARGS}
+    CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GFLAGS_INSTALL_DIR}
+                     -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                     -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+    BUILD_BYPRODUCTS ${GFLAGS_LIBRARIES}
+)
+ENDIF()
+
+ADD_LIBRARY(gflags STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET gflags PROPERTY IMPORTED_LOCATION ${GFLAGS_LIBRARIES})
+ADD_DEPENDENCIES(gflags extern_gflags)
+
+# On Windows (including MinGW), the Shlwapi library is used by gflags if available.
+if (WIN32)
+  include(CheckIncludeFileCXX)
+  check_include_file_cxx("shlwapi.h" HAVE_SHLWAPI)
+  if (HAVE_SHLWAPI)
+    set_property(GLOBAL PROPERTY OS_DEPENDENCY_MODULES shlwapi.lib)
+  endif(HAVE_SHLWAPI)
+endif (WIN32)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/glog.cmake b/fast_tokenizer/cmake/external/glog.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..2afc39608ce1afd99e3e17d25095ed08648f4ae8
--- /dev/null
+++ b/fast_tokenizer/cmake/external/glog.cmake
@@ -0,0 +1,117 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(ExternalProject)
+
+SET(GLOG_PREFIX_DIR  ${THIRD_PARTY_PATH}/glog)
+SET(GLOG_INSTALL_DIR ${THIRD_PARTY_PATH}/install/glog)
+SET(GLOG_INCLUDE_DIR "${GLOG_INSTALL_DIR}/include" CACHE PATH "glog include directory." FORCE)
+SET(GLOG_REPOSITORY ${GIT_URL}/google/glog.git)
+SET(GLOG_TAG        v0.4.0)
+
+IF(WIN32)
+  SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/glog.lib" CACHE FILEPATH "glog library." FORCE)
+  SET(GLOG_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4267 /wd4530")
+  add_definitions("/DGOOGLE_GLOG_DLL_DECL=")
+ELSE(WIN32)
+  SET(GLOG_LIBRARIES "${GLOG_INSTALL_DIR}/lib/libglog.a" CACHE FILEPATH "glog library." FORCE)
+  SET(GLOG_CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
+ENDIF(WIN32)
+
+INCLUDE_DIRECTORIES(${GLOG_INCLUDE_DIR})
+
+IF(ANDROID)    
+set(CROSS_COMPILE_CMAKE_ARGS
+    "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}"
+    "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}"
+    "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}"
+    "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}"
+    "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}"
+    "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}"
+    "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}"
+    "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake"
+    "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}"
+    "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}"
+    "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}")
+
+ExternalProject_Add(
+    extern_glog
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    ${SHALLOW_CLONE}
+    GIT_REPOSITORY  ${GLOG_REPOSITORY}
+    GIT_TAG         ${GLOG_TAG}
+    DEPENDS         gflags
+    PREFIX          ${GLOG_PREFIX_DIR}
+    UPDATE_COMMAND  ""
+    CMAKE_ARGS      ${CROSS_COMPILE_CMAKE_ARGS}
+                    -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
+                    -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                    -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR}
+                    -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib
+                    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                    -DWITH_GFLAGS=OFF
+                    -DBUILD_TESTING=OFF
+                    -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                    ${EXTERNAL_OPTIONAL_ARGS}
+    CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GLOG_INSTALL_DIR}
+                     -DCMAKE_INSTALL_LIBDIR:PATH=${GLOG_INSTALL_DIR}/lib
+                     -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                     -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+    BUILD_BYPRODUCTS ${GLOG_LIBRARIES}
+)
+ELSE()
+ExternalProject_Add(
+    extern_glog
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    ${SHALLOW_CLONE}
+    GIT_REPOSITORY  ${GLOG_REPOSITORY}
+    GIT_TAG         ${GLOG_TAG}
+    DEPENDS         gflags
+    PREFIX          ${GLOG_PREFIX_DIR}
+    UPDATE_COMMAND  ""
+    CMAKE_ARGS      -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
+                    -DCMAKE_CXX_FLAGS=${GLOG_CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                    -DCMAKE_INSTALL_PREFIX=${GLOG_INSTALL_DIR}
+                    -DCMAKE_INSTALL_LIBDIR=${GLOG_INSTALL_DIR}/lib
+                    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                    -DWITH_GFLAGS=OFF
+                    -DBUILD_TESTING=OFF
+                    -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                    ${EXTERNAL_OPTIONAL_ARGS}
+    CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GLOG_INSTALL_DIR}
+                     -DCMAKE_INSTALL_LIBDIR:PATH=${GLOG_INSTALL_DIR}/lib
+                     -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                     -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+    BUILD_BYPRODUCTS ${GLOG_LIBRARIES}
+)
+ENDIF()
+
+ADD_LIBRARY(glog STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET glog PROPERTY IMPORTED_LOCATION ${GLOG_LIBRARIES})
+ADD_DEPENDENCIES(glog extern_glog gflags)
+LINK_LIBRARIES(glog)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/gtest.cmake b/fast_tokenizer/cmake/external/gtest.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..4b870934558739f43f572afc02c232ef392fa45c
--- /dev/null
+++ b/fast_tokenizer/cmake/external/gtest.cmake
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+IF(WITH_TESTING)
+    ENABLE_TESTING()
+ENDIF()
+
+INCLUDE(GNUInstallDirs)
+INCLUDE(ExternalProject)
+
+SET(GTEST_PREFIX_DIR    ${THIRD_PARTY_PATH}/gtest)
+SET(GTEST_INSTALL_DIR   ${THIRD_PARTY_PATH}/install/gtest)
+SET(GTEST_INCLUDE_DIR   "${GTEST_INSTALL_DIR}/include" CACHE PATH "gtest include directory." FORCE)
+set(GTEST_REPOSITORY    ${GIT_URL}/google/googletest.git)
+set(GTEST_TAG           release-1.8.1)
+
+INCLUDE_DIRECTORIES(${GTEST_INCLUDE_DIR})
+
+IF(WIN32)
+    set(GTEST_LIBRARIES
+        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest.lib" CACHE FILEPATH "gtest libraries." FORCE)
+    set(GTEST_MAIN_LIBRARIES
+        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest_main.lib" CACHE FILEPATH "gtest main libraries." FORCE)
+    string(REPLACE "/w " "" GTEST_CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
+    string(REPLACE "/w " "" GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+    string(REPLACE "/W0 " "" GTEST_CMAKE_C_FLAGS "${GTEST_CMAKE_C_FLAGS}")
+    string(REPLACE "/W0 " "" GTEST_CMAKE_CXX_FLAGS "${GTEST_CMAKE_CXX_FLAGS}")
+ELSE(WIN32)
+    set(GTEST_LIBRARIES
+        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest.a" CACHE FILEPATH "gtest libraries." FORCE)
+    set(GTEST_MAIN_LIBRARIES
+        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest_main.a" CACHE FILEPATH "gtest main libraries." FORCE)
+    set(GTEST_CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
+    set(GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+ENDIF(WIN32)
+
+ExternalProject_Add(
+    extern_gtest
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    ${SHALLOW_CLONE}
+    GIT_REPOSITORY  ${GTEST_REPOSITORY}
+    GIT_TAG         ${GTEST_TAG}
+    DEPENDS         ${GTEST_DEPENDS}
+    PREFIX          ${GTEST_PREFIX_DIR}
+    UPDATE_COMMAND  ""
+    CMAKE_ARGS      -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+                    -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}
+                    -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS}
+                    -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                    -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS=${GTEST_CMAKE_C_FLAGS}
+                    -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                    -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                    -DCMAKE_INSTALL_PREFIX=${GTEST_INSTALL_DIR}
+                    -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                    -DBUILD_GMOCK=ON
+                    -Dgtest_disable_pthreads=ON
+                    -Dgtest_force_shared_crt=ON
+                    -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                    ${EXTERNAL_OPTIONAL_ARGS}
+    CMAKE_CACHE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=${GTEST_INSTALL_DIR}
+                     -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                     -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+    BUILD_BYPRODUCTS ${GTEST_LIBRARIES}
+    BUILD_BYPRODUCTS ${GTEST_MAIN_LIBRARIES}
+)
+
+ADD_LIBRARY(gtest STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET gtest PROPERTY IMPORTED_LOCATION ${GTEST_LIBRARIES})
+ADD_DEPENDENCIES(gtest extern_gtest)
+
+ADD_LIBRARY(gtest_main STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET gtest_main PROPERTY IMPORTED_LOCATION ${GTEST_MAIN_LIBRARIES})
+ADD_DEPENDENCIES(gtest_main extern_gtest)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/icu.cmake b/fast_tokenizer/cmake/external/icu.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..cd604d384ef6e4fd97ec0500aba7d632de69840c
--- /dev/null
+++ b/fast_tokenizer/cmake/external/icu.cmake
@@ -0,0 +1,138 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+include(CMakeParseArguments)
+include(ExternalProject)
+include (ByproductsICU)
+SET(ICU_PREFIX_DIR    ${THIRD_PARTY_PATH}/icu)
+SET(ICU_INSTALL_DIR   ${THIRD_PARTY_PATH}/install/icu)
+if(ANDROID)
+  set(ICU_URL_PREFIX "https://bj.bcebos.com/fastdeploy/test")
+  # check ABI, toolchain
+  if((NOT ANDROID_ABI MATCHES "armeabi-v7a") AND (NOT ANDROID_ABI MATCHES "arm64-v8a"))
+    message(FATAL_ERROR "FastTokenizer for Android only support armeabi-v7a, arm64-v8a now.")
+  endif()
+  if(NOT ANDROID_TOOLCHAIN MATCHES "clang")
+     message(FATAL_ERROR "Currently, only support clang toolchain while cross compiling FastTokenizer for Android, but found ${ANDROID_TOOLCHAIN}.")
+  endif()
+  if (WITH_ICU_LITE)
+    set(ICU_REPOSITORY ${ICU_URL_PREFIX}/icu-lite-android-${ANDROID_ABI}.tgz)
+  else()
+    set(ICU_REPOSITORY ${ICU_URL_PREFIX}/icu-android-${ANDROID_ABI}.tgz)
+  endif()
+else()
+  SET(ICU_REPOSITORY    ${GIT_URL}/unicode-org/icu.git)
+endif()  
+SET(ICU_TAG           release-70-1)
+set(FIND_OR_BUILD_ICU_DIR ${CMAKE_CURRENT_LIST_DIR})
+
+set(HOST_CFLAGS "${CMAKE_C_FLAGS}")
+set(HOST_CXXFLAGS "${CMAKE_CXX_FLAGS}")
+set(HOST_CC "${CMAKE_C_COMPILER}")
+set(HOST_CXX "${CMAKE_CXX_COMPILER}")
+set(HOST_LDFLAGS "${CMAKE_MODULE_LINKER_FLAGS}")
+
+set(HOST_ENV_CMAKE ${CMAKE_COMMAND} -E env
+        CC=${HOST_CC}
+        CXX=${HOST_CXX}
+        CFLAGS=${HOST_CFLAGS}
+        CXXFLAGS=${HOST_CXXFLAGS}
+        LDFLAGS=${HOST_LDFLAGS}
+)
+
+# predict host libraries
+set(ICU_STATIC TRUE)
+GetICUByproducts(${ICU_INSTALL_DIR} ICU_LIBRARIES ICU_INCLUDE_DIRS ICU_BASE_NAMES)
+INCLUDE_DIRECTORIES(${ICU_INCLUDE_DIRS})
+
+if(WIN32)
+ExternalProject_Add(
+        extern_icu
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${ICU_REPOSITORY}
+        GIT_TAG           ${ICU_TAG}
+        GIT_PROGRESS      1
+        PREFIX            ${ICU_PREFIX_DIR}
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND msbuild ..\\extern_icu\\icu4c\\source\\allinone\\allinone.sln /p:Configuration=Release /p:Platform=x64 /p:RuntimeLibrary=MT_StaticRelease /p:SkipUWP=true
+        BUILD_COMMAND ""
+        INSTALL_COMMAND ${CMAKE_COMMAND} -E copy_directory ../extern_icu/icu4c/include ${ICU_INSTALL_DIR}/include
+                     && ${CMAKE_COMMAND} -E copy_directory ../extern_icu/icu4c/lib64 ${ICU_INSTALL_DIR}/lib64
+        BUILD_BYPRODUCTS ${ICU_LIBRARIES}
+)
+elseif(APPLE)
+ExternalProject_Add(
+        extern_icu
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${ICU_REPOSITORY}
+        GIT_TAG           ${ICU_TAG}
+        GIT_PROGRESS      1
+        PREFIX            ${ICU_PREFIX_DIR}
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND ${HOST_ENV_CMAKE} ../extern_icu/icu4c/source/runConfigureICU "MacOSX/GCC" --enable-static --disable-shared --enable-rpath
+        BUILD_COMMAND make -j4
+        INSTALL_COMMAND make install prefix="" DESTDIR=${ICU_INSTALL_DIR} install
+        BUILD_BYPRODUCTS ${ICU_LIBRARIES}
+)
+elseif(ANDROID)
+ExternalProject_Add(
+        extern_icu
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        URL               ${ICU_REPOSITORY}
+        PREFIX            ${ICU_PREFIX_DIR}
+        CONFIGURE_COMMAND ""
+        UPDATE_COMMAND    ""
+        BUILD_COMMAND     ""
+        INSTALL_COMMAND
+          ${CMAKE_COMMAND} -E remove_directory ${ICU_INSTALL_DIR} &&
+          ${CMAKE_COMMAND} -E make_directory ${ICU_INSTALL_DIR} &&  
+          ${CMAKE_COMMAND} -E rename ${ICU_PREFIX_DIR}/src/extern_icu/lib/ ${ICU_INSTALL_DIR}/lib &&
+          ${CMAKE_COMMAND} -E copy_directory ${ICU_PREFIX_DIR}/src/extern_icu/include ${ICU_INSTALL_DIR}/include
+        BUILD_BYPRODUCTS ${ICU_LIBRARIES}
+)
+else()
+ExternalProject_Add(
+        extern_icu
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${ICU_REPOSITORY}
+        GIT_TAG           ${ICU_TAG}
+        GIT_PROGRESS      1
+        PREFIX            ${ICU_PREFIX_DIR}
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND ${HOST_ENV_CMAKE} ../extern_icu/icu4c/source/runConfigureICU "Linux/gcc" --enable-static --disable-shared --enable-rpath
+        BUILD_COMMAND make -j4
+        INSTALL_COMMAND make install prefix="" DESTDIR=${ICU_INSTALL_DIR} install
+        BUILD_BYPRODUCTS ${ICU_LIBRARIES}
+)
+endif()
+
+list(LENGTH ICU_LIBRARIES ICU_LIB_LEN)
+MATH(EXPR ICU_LIB_LEN "${ICU_LIB_LEN}-1")
+
+# icui18n icudata icuuc icuio icutu
+foreach(ICU_IDX RANGE ${ICU_LIB_LEN})
+  list(GET ICU_LIBRARIES ${ICU_IDX} ICU_LIB)
+  list(GET ICU_BASE_NAMES ${ICU_IDX} ICU_BASE_NAME)
+  ADD_LIBRARY("icu${ICU_BASE_NAME}" STATIC IMPORTED GLOBAL)
+  SET_PROPERTY(TARGET "icu${ICU_BASE_NAME}" PROPERTY IMPORTED_LOCATION ${ICU_LIB})
+  ADD_DEPENDENCIES("icu${ICU_BASE_NAME}" extern_icu)
+  list(APPEND ICU_INTERFACE_LINK_LIBRARIES "icu${ICU_BASE_NAME}")
+endforeach()
+
+if(WIN32)
+ADD_LIBRARY("icudata" ALIAS "icudt")
+endif()
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/nlohmann_json.cmake b/fast_tokenizer/cmake/external/nlohmann_json.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..9a34c5cca503ccb9565a8fcfdaeefc9811ec20fb
--- /dev/null
+++ b/fast_tokenizer/cmake/external/nlohmann_json.cmake
@@ -0,0 +1,46 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include(ExternalProject)
+
+set(JSON_PREFIX_DIR     ${THIRD_PARTY_PATH}/json)
+SET(JSON_REPOSITORY     ${GIT_URL}/nlohmann/json.git)
+SET(JSON_TAG            v3.10.5)
+
+set(JSON_INCLUDE_DIR ${THIRD_PARTY_PATH}/json/src/extern_json/single_include)
+include_directories(${JSON_INCLUDE_DIR})
+
+ExternalProject_Add(
+  extern_json
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${JSON_REPOSITORY}
+        GIT_TAG           ${JSON_TAG}
+        GIT_PROGRESS      1
+        PREFIX            ${JSON_PREFIX_DIR}
+        # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add
+        # function in CMakeLists blank, it will cause another parameter GIT_TAG
+        # to be modified without triggering incremental compilation, and the
+        # third-party library version changes cannot be incorporated.
+        # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND ""
+        BUILD_COMMAND     ""
+        INSTALL_COMMAND   ""
+        TEST_COMMAND      ""
+)
+
+add_library(json INTERFACE)
+
+add_dependencies(json extern_json)
diff --git a/fast_tokenizer/cmake/external/protobuf.cmake b/fast_tokenizer/cmake/external/protobuf.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..e5f3e19be7b52b3344538864f04b0ee646f7f685
--- /dev/null
+++ b/fast_tokenizer/cmake/external/protobuf.cmake
@@ -0,0 +1,295 @@
+# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(ExternalProject)
+# Always invoke `FIND_PACKAGE(Protobuf)` for importing function protobuf_generate_cpp
+IF(NOT WIN32)
+FIND_PACKAGE(Protobuf QUIET)
+ENDIF(NOT WIN32)
+macro(UNSET_VAR VAR_NAME)
+    UNSET(${VAR_NAME} CACHE)
+    UNSET(${VAR_NAME})
+endmacro()
+
+UNSET_VAR(PROTOBUF_INCLUDE_DIR)
+UNSET_VAR(PROTOBUF_FOUND)
+UNSET_VAR(PROTOBUF_PROTOC_EXECUTABLE)
+UNSET_VAR(PROTOBUF_PROTOC_LIBRARY)
+UNSET_VAR(PROTOBUF_LITE_LIBRARY)
+UNSET_VAR(PROTOBUF_LIBRARY)
+UNSET_VAR(PROTOBUF_INCLUDE_DIR)
+UNSET_VAR(Protobuf_PROTOC_EXECUTABLE)
+function(protobuf_generate_python SRCS)
+    # shameless copy from https://github.com/Kitware/CMake/blob/master/Modules/FindProtobuf.cmake
+    if(NOT ARGN)
+        message(SEND_ERROR "Error: PROTOBUF_GENERATE_PYTHON() called without any proto files")
+        return()
+    endif()
+
+    if(PROTOBUF_GENERATE_CPP_APPEND_PATH)
+        # Create an include path for each file specified
+        foreach(FIL ${ARGN})
+            get_filename_component(ABS_FIL ${FIL} ABSOLUTE)
+            get_filename_component(ABS_PATH ${ABS_FIL} PATH)
+            list(FIND _protobuf_include_path ${ABS_PATH} _contains_already)
+            if(${_contains_already} EQUAL -1)
+                list(APPEND _protobuf_include_path -I ${ABS_PATH})
+            endif()
+        endforeach()
+    else()
+        set(_protobuf_include_path -I ${CMAKE_CURRENT_SOURCE_DIR})
+    endif()
+    if(DEFINED PROTOBUF_IMPORT_DIRS AND NOT DEFINED Protobuf_IMPORT_DIRS)
+        set(Protobuf_IMPORT_DIRS "${PROTOBUF_IMPORT_DIRS}")
+    endif()
+
+    if(DEFINED Protobuf_IMPORT_DIRS)
+        foreach(DIR ${Protobuf_IMPORT_DIRS})
+            get_filename_component(ABS_PATH ${DIR} ABSOLUTE)
+            list(FIND _protobuf_include_path ${ABS_PATH} _contains_already)
+            if(${_contains_already} EQUAL -1)
+                list(APPEND _protobuf_include_path -I ${ABS_PATH})
+            endif()
+        endforeach()
+    endif()
+
+    set(${SRCS})
+    foreach(FIL ${ARGN})
+        get_filename_component(ABS_FIL ${FIL} ABSOLUTE)
+        get_filename_component(FIL_WE ${FIL} NAME_WE)
+        if(NOT PROTOBUF_GENERATE_CPP_APPEND_PATH)
+            get_filename_component(FIL_DIR ${FIL} DIRECTORY)
+            if(FIL_DIR)
+                set(FIL_WE "${FIL_DIR}/${FIL_WE}")
+            endif()
+        endif()
+        list(APPEND ${SRCS} "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py")
+        add_custom_command(
+                OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}_pb2.py"
+                COMMAND  ${PROTOBUF_PROTOC_EXECUTABLE} --python_out ${CMAKE_CURRENT_BINARY_DIR} ${_protobuf_include_path} ${ABS_FIL}
+                DEPENDS ${ABS_FIL} ${PROTOBUF_PROTOC_EXECUTABLE}
+                COMMENT "Running Python protocol buffer compiler on ${FIL}"
+                VERBATIM )
+    endforeach()
+
+    set(${SRCS} ${${SRCS}} PARENT_SCOPE)
+endfunction()
+
+# Print and set the protobuf library information,
+# finish this cmake process and exit from this file.
+macro(PROMPT_PROTOBUF_LIB)
+    SET(protobuf_DEPS ${ARGN})
+
+    MESSAGE(STATUS "Protobuf protoc executable: ${PROTOBUF_PROTOC_EXECUTABLE}")
+    MESSAGE(STATUS "Protobuf-lite library: ${PROTOBUF_LITE_LIBRARY}")
+    MESSAGE(STATUS "Protobuf library: ${PROTOBUF_LIBRARY}")
+    MESSAGE(STATUS "Protoc library: ${PROTOBUF_PROTOC_LIBRARY}")
+    MESSAGE(STATUS "Protobuf version: ${PROTOBUF_VERSION}")
+    INCLUDE_DIRECTORIES(${PROTOBUF_INCLUDE_DIR})
+
+    # Assuming that all the protobuf libraries are of the same type.
+    IF(${PROTOBUF_LIBRARY} MATCHES ${CMAKE_STATIC_LIBRARY_SUFFIX})
+        SET(protobuf_LIBTYPE STATIC)
+    ELSEIF(${PROTOBUF_LIBRARY} MATCHES "${CMAKE_SHARED_LIBRARY_SUFFIX}$")
+        SET(protobuf_LIBTYPE SHARED)
+    ELSE()
+        MESSAGE(FATAL_ERROR "Unknown library type: ${PROTOBUF_LIBRARY}")
+    ENDIF()
+
+    ADD_LIBRARY(protobuf ${protobuf_LIBTYPE} IMPORTED GLOBAL)
+    SET_PROPERTY(TARGET protobuf PROPERTY IMPORTED_LOCATION ${PROTOBUF_LIBRARY})
+
+    ADD_LIBRARY(protobuf_lite ${protobuf_LIBTYPE} IMPORTED GLOBAL)
+    SET_PROPERTY(TARGET protobuf_lite PROPERTY IMPORTED_LOCATION ${PROTOBUF_LITE_LIBRARY})
+
+    ADD_LIBRARY(libprotoc ${protobuf_LIBTYPE} IMPORTED GLOBAL)
+    SET_PROPERTY(TARGET libprotoc PROPERTY IMPORTED_LOCATION ${PROTOC_LIBRARY})
+
+    ADD_EXECUTABLE(protoc IMPORTED GLOBAL)
+    SET_PROPERTY(TARGET protoc PROPERTY IMPORTED_LOCATION ${PROTOBUF_PROTOC_EXECUTABLE})
+    # FIND_Protobuf.cmake uses `Protobuf_PROTOC_EXECUTABLE`.
+    # make `protobuf_generate_cpp` happy.
+    SET(Protobuf_PROTOC_EXECUTABLE ${PROTOBUF_PROTOC_EXECUTABLE})
+
+    FOREACH(dep ${protobuf_DEPS})
+        ADD_DEPENDENCIES(protobuf ${dep})
+        ADD_DEPENDENCIES(protobuf_lite ${dep})
+        ADD_DEPENDENCIES(libprotoc ${dep})
+        ADD_DEPENDENCIES(protoc ${dep})
+    ENDFOREACH()
+
+    RETURN()
+endmacro()
+macro(SET_PROTOBUF_VERSION)
+    EXEC_PROGRAM(${PROTOBUF_PROTOC_EXECUTABLE} ARGS --version OUTPUT_VARIABLE PROTOBUF_VERSION)
+    STRING(REGEX MATCH "[0-9]+.[0-9]+" PROTOBUF_VERSION "${PROTOBUF_VERSION}")
+endmacro()
+
+set(PROTOBUF_ROOT "" CACHE PATH "Folder contains protobuf")
+IF (WIN32)
+    SET(PROTOBUF_ROOT ${THIRD_PARTY_PATH}/install/protobuf)
+ENDIF(WIN32)
+
+if (NOT "${PROTOBUF_ROOT}" STREQUAL "")
+    find_path(PROTOBUF_INCLUDE_DIR google/protobuf/message.h PATHS ${PROTOBUF_ROOT}/include NO_DEFAULT_PATH)
+    find_library(PROTOBUF_LIBRARY protobuf libprotobuf.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH)
+    find_library(PROTOBUF_LITE_LIBRARY protobuf-lite libprotobuf-lite.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH)
+    find_library(PROTOBUF_PROTOC_LIBRARY protoc libprotoc.lib PATHS ${PROTOBUF_ROOT}/lib NO_DEFAULT_PATH)
+    find_program(PROTOBUF_PROTOC_EXECUTABLE protoc PATHS ${PROTOBUF_ROOT}/bin NO_DEFAULT_PATH)
+    if (PROTOBUF_INCLUDE_DIR AND PROTOBUF_LIBRARY AND PROTOBUF_LITE_LIBRARY AND PROTOBUF_PROTOC_LIBRARY AND PROTOBUF_PROTOC_EXECUTABLE)
+        message(STATUS "Using custom protobuf library in ${PROTOBUF_ROOT}.")
+        SET(PROTOBUF_FOUND true)
+        SET_PROTOBUF_VERSION()
+        PROMPT_PROTOBUF_LIB()
+    else()
+        message(WARNING "Cannot find protobuf library in ${PROTOBUF_ROOT}")
+    endif()
+endif()
+
+FUNCTION(build_protobuf TARGET_NAME BUILD_FOR_HOST)
+    STRING(REPLACE "extern_" "" TARGET_DIR_NAME "${TARGET_NAME}")
+    SET(PROTOBUF_SOURCES_DIR ${THIRD_PARTY_PATH}/${TARGET_DIR_NAME})
+    SET(PROTOBUF_INSTALL_DIR ${THIRD_PARTY_PATH}/install/${TARGET_DIR_NAME})
+
+    SET(${TARGET_NAME}_INCLUDE_DIR "${PROTOBUF_INSTALL_DIR}/include" PARENT_SCOPE)
+    SET(PROTOBUF_INCLUDE_DIR "${PROTOBUF_INSTALL_DIR}/include" PARENT_SCOPE)
+    SET(${TARGET_NAME}_LITE_LIBRARY
+        "${PROTOBUF_INSTALL_DIR}/lib/libprotobuf-lite${CMAKE_STATIC_LIBRARY_SUFFIX}"
+         PARENT_SCOPE)
+    SET(${TARGET_NAME}_LIBRARY
+        "${PROTOBUF_INSTALL_DIR}/lib/libprotobuf${CMAKE_STATIC_LIBRARY_SUFFIX}"
+         PARENT_SCOPE)
+    SET(${TARGET_NAME}_PROTOC_LIBRARY
+        "${PROTOBUF_INSTALL_DIR}/lib/libprotoc${CMAKE_STATIC_LIBRARY_SUFFIX}"
+         PARENT_SCOPE)
+    SET(${TARGET_NAME}_PROTOC_EXECUTABLE
+        "${PROTOBUF_INSTALL_DIR}/bin/protoc${CMAKE_EXECUTABLE_SUFFIX}"
+         PARENT_SCOPE)
+
+    SET(PROTOBUF_REPO "https://github.com/protocolbuffers/protobuf.git")
+    SET(PROTOBUF_TAG "9f75c5aa851cd877fb0d93ccc31b8567a6706546")
+    SET(OPTIONAL_CACHE_ARGS "")
+    SET(OPTIONAL_ARGS "")
+
+    IF(BUILD_FOR_HOST)
+        SET(OPTIONAL_ARGS
+            "-DCMAKE_C_COMPILER=${HOST_C_COMPILER}"
+            "-DCMAKE_CXX_COMPILER=${HOST_CXX_COMPILER}"
+            "-Dprotobuf_WITH_ZLIB=OFF"
+            "-DZLIB_ROOT:FILEPATH=${ZLIB_ROOT}")
+        SET(OPTIONAL_CACHE_ARGS "-DZLIB_ROOT:STRING=${ZLIB_ROOT}")
+    ELSE()
+        # protobuf have compile issue when use android stl c++_static
+        SET(PROTOBUF_REPO "https://github.com/tensor-tang/protobuf.git")
+        SET(PROTOBUF_TAG "mobile")
+        SET(OPTIONAL_ARGS "-Dprotobuf_WITH_ZLIB=OFF"
+                ${CROSS_COMPILE_CMAKE_ARGS}
+                "-DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}"
+                "-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER}"
+                "-DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}"
+                "-DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}"
+                "-DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}"
+                "-DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}"
+                "-DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}"
+                "-DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}")
+    ENDIF()
+    IF(WIN32)
+        SET(OPTIONAL_ARGS ${OPTIONAL_ARGS} "-DCMAKE_GENERATOR_PLATFORM=x64")
+    ENDIF()
+
+    if(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK)
+        ExternalProject_Add(
+            ${TARGET_NAME}
+            ${EXTERNAL_PROJECT_LOG_ARGS}
+            PREFIX          ${PROTOBUF_SOURCES_DIR}
+            SOURCE_SUBDIR   cmake
+            UPDATE_COMMAND  ""
+            GIT_REPOSITORY  ${PROTOBUF_REPO}
+            GIT_TAG         ${PROTOBUF_TAG}
+            GIT_PROGRESS      1
+            CMAKE_ARGS
+                ${OPTIONAL_ARGS}
+                -Dprotobuf_BUILD_TESTS=OFF
+                -DCMAKE_SKIP_RPATH=ON
+                -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                -DCMAKE_INSTALL_PREFIX=${PROTOBUF_INSTALL_DIR}
+                -DCMAKE_INSTALL_LIBDIR=lib
+                -DBUILD_SHARED_LIBS=OFF
+            CMAKE_CACHE_ARGS
+                -DCMAKE_INSTALL_PREFIX:PATH=${PROTOBUF_INSTALL_DIR}
+                -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+                -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+                -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                ${OPTIONAL_CACHE_ARGS}
+        )
+    else()
+        ExternalProject_Add(
+            ${TARGET_NAME}
+            ${EXTERNAL_PROJECT_LOG_ARGS}
+            PREFIX          ${PROTOBUF_SOURCES_DIR}
+            UPDATE_COMMAND  ""
+            GIT_REPOSITORY  ${PROTOBUF_REPO}
+            GIT_TAG         ${PROTOBUF_TAG}
+            GIT_PROGRESS      1
+            CONFIGURE_COMMAND
+            ${CMAKE_COMMAND} ${PROTOBUF_SOURCES_DIR}/src/${TARGET_NAME}/cmake
+                ${OPTIONAL_ARGS}
+                -Dprotobuf_BUILD_TESTS=OFF
+                -DCMAKE_SKIP_RPATH=ON
+                -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+                -DCMAKE_BUILD_TYPE=${THIRD_PARTY_BUILD_TYPE}
+                -DCMAKE_INSTALL_PREFIX=${PROTOBUF_INSTALL_DIR}
+                -DCMAKE_INSTALL_LIBDIR=lib
+                -DBUILD_SHARED_LIBS=OFF
+            CMAKE_CACHE_ARGS
+                -DCMAKE_INSTALL_PREFIX:PATH=${PROTOBUF_INSTALL_DIR}
+                -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+                -DCMAKE_VERBOSE_MAKEFILE:BOOL=OFF
+                -DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
+                ${OPTIONAL_CACHE_ARGS}
+        )
+    endif()
+ENDFUNCTION()
+
+SET(PROTOBUF_VERSION 3.1.0)
+
+IF(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK)
+    build_protobuf(protobuf_host TRUE)
+    LIST(APPEND external_project_dependencies protobuf_host)
+    SET(PROTOBUF_PROTOC_EXECUTABLE ${protobuf_host_PROTOC_EXECUTABLE}
+        CACHE FILEPATH "protobuf executable." FORCE)
+ENDIF()
+
+IF(NOT PROTOBUF_FOUND)
+    build_protobuf(extern_protobuf FALSE)
+
+    SET(PROTOBUF_INCLUDE_DIR ${extern_protobuf_INCLUDE_DIR}
+        CACHE PATH "protobuf include directory." FORCE)
+    SET(PROTOBUF_LITE_LIBRARY ${extern_protobuf_LITE_LIBRARY}
+        CACHE FILEPATH "protobuf lite library." FORCE)
+    SET(PROTOBUF_LIBRARY ${extern_protobuf_LIBRARY}
+        CACHE FILEPATH "protobuf library." FORCE)
+    SET(PROTOBUF_PROTOC_LIBRARY ${extern_protobuf_PROTOC_LIBRARY}
+        CACHE FILEPATH "protoc library." FORCE)
+
+    IF(LITE_WITH_LIGHT_WEIGHT_FRAMEWORK)
+        PROMPT_PROTOBUF_LIB(protobuf_host extern_protobuf)
+    ELSE()
+        SET(PROTOBUF_PROTOC_EXECUTABLE ${extern_protobuf_PROTOC_EXECUTABLE}
+            CACHE FILEPATH "protobuf executable." FORCE)
+        PROMPT_PROTOBUF_LIB(extern_protobuf)
+    ENDIF()
+
+ENDIF(NOT PROTOBUF_FOUND)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/pybind11.cmake b/fast_tokenizer/cmake/external/pybind11.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..7f5f15d3e091de0d433b45c604927ee7a11f8d82
--- /dev/null
+++ b/fast_tokenizer/cmake/external/pybind11.cmake
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include(ExternalProject)
+
+set(PYBIND_PREFIX_DIR     ${THIRD_PARTY_PATH}/pybind)
+SET(PYBIND_REPOSITORY     ${GIT_URL}/pybind/pybind11.git)
+SET(PYBIND_TAG            v2.9.0)
+
+set(PYBIND_INCLUDE_DIR ${THIRD_PARTY_PATH}/pybind/src/extern_pybind/include)
+include_directories(${PYBIND_INCLUDE_DIR})
+
+ExternalProject_Add(
+        extern_pybind
+        ${EXTERNAL_PROJECT_LOG_ARGS}
+        ${SHALLOW_CLONE}
+        GIT_REPOSITORY    ${PYBIND_REPOSITORY}
+        GIT_TAG           ${PYBIND_TAG}
+        PREFIX            ${PYBIND_PREFIX_DIR}
+        # If we explicitly leave the `UPDATE_COMMAND` of the ExternalProject_Add
+        # function in CMakeLists blank, it will cause another parameter GIT_TAG
+        # to be modified without triggering incremental compilation, and the
+        # third-party library version changes cannot be incorporated.
+        # reference: https://cmake.org/cmake/help/latest/module/ExternalProject.html
+        UPDATE_COMMAND    ""
+        CONFIGURE_COMMAND ""
+        BUILD_COMMAND     ""
+        INSTALL_COMMAND   ""
+        TEST_COMMAND      ""
+)
+
+add_library(pybind INTERFACE)
+
+add_dependencies(pybind extern_pybind)
diff --git a/fast_tokenizer/cmake/external/python.cmake b/fast_tokenizer/cmake/external/python.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..81da0782893bdf98dfe30578ec96f97edb59e0da
--- /dev/null
+++ b/fast_tokenizer/cmake/external/python.cmake
@@ -0,0 +1,74 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(python_module)
+
+FIND_PACKAGE(PythonInterp ${PY_VERSION} REQUIRED)
+FIND_PACKAGE(PythonLibs ${PY_VERSION} REQUIRED)
+
+if(WIN32)
+    execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c"
+"from distutils import sysconfig as s;import sys;import struct;
+print(sys.prefix);
+print(s.get_config_var('LDVERSION') or s.get_config_var('VERSION'));
+"
+            RESULT_VARIABLE _PYTHON_SUCCESS
+            OUTPUT_VARIABLE _PYTHON_VALUES
+            ERROR_VARIABLE _PYTHON_ERROR_VALUE)
+
+    if(NOT _PYTHON_SUCCESS EQUAL 0)
+        set(PYTHONLIBS_FOUND FALSE)
+        return()
+    endif()
+
+    # Convert the process output into a list
+    string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES})
+    string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES})
+    list(GET _PYTHON_VALUES 0 PYTHON_PREFIX)
+    list(GET _PYTHON_VALUES 1 PYTHON_LIBRARY_SUFFIX)
+
+    # Make sure all directory separators are '/'
+    string(REGEX REPLACE "\\\\" "/" PYTHON_PREFIX ${PYTHON_PREFIX})
+
+    set(PYTHON_LIBRARY
+            "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib")
+
+    # when run in a venv, PYTHON_PREFIX points to it. But the libraries remain in the
+    # original python installation. They may be found relative to PYTHON_INCLUDE_DIR.
+    if(NOT EXISTS "${PYTHON_LIBRARY}")
+        get_filename_component(_PYTHON_ROOT ${PYTHON_INCLUDE_DIR} DIRECTORY)
+        set(PYTHON_LIBRARY
+                "${_PYTHON_ROOT}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib")
+    endif()
+
+    # raise an error if the python libs are still not found.
+    if(NOT EXISTS "${PYTHON_LIBRARY}")
+        message(FATAL_ERROR "Python libraries not found")
+    endif()
+    SET(PYTHON_LIBRARIES "${PYTHON_LIBRARY}")
+endif(WIN32)
+
+# Fixme: Maybe find a static library. Get SHARED/STATIC by FIND_PACKAGE.
+ADD_LIBRARY(python SHARED IMPORTED GLOBAL)
+SET_PROPERTY(TARGET python PROPERTY IMPORTED_LOCATION ${PYTHON_LIBRARIES})
+
+SET(py_env "")
+IF(PYTHONINTERP_FOUND)
+    find_python_module(pip REQUIRED)
+    find_python_module(numpy REQUIRED)
+    find_python_module(wheel REQUIRED)
+    FIND_PACKAGE(NumPy REQUIRED)
+ENDIF(PYTHONINTERP_FOUND)
+INCLUDE_DIRECTORIES(${PYTHON_INCLUDE_DIR})
+INCLUDE_DIRECTORIES(${PYTHON_NUMPY_INCLUDE_DIR})
diff --git a/fast_tokenizer/cmake/external/re2.cmake b/fast_tokenizer/cmake/external/re2.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..079dfaa3182c857706d9100225ea9bb1c098439b
--- /dev/null
+++ b/fast_tokenizer/cmake/external/re2.cmake
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+INCLUDE(ExternalProject)
+
+SET(RE2_PREFIX_DIR    ${THIRD_PARTY_PATH}/re2)
+SET(RE2_INSTALL_DIR   ${THIRD_PARTY_PATH}/install/re2)
+# As we add extra features for utf8proc, we use the non-official repo
+SET(RE2_REPOSITORY    ${GIT_URL}/google/re2.git)
+SET(RE2_TAG           2022-04-01)
+
+IF(WIN32)
+  SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/re2.lib")
+  add_definitions(-DRE2_STATIC)
+ELSEIF(APPLE)
+  SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/libre2.a")
+ELSEIF(ANDROID)
+  SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/libre2.a")
+ELSE()
+  IF(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64|arm64")
+    SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/libre2.a")
+  ELSE()
+    file(READ "/etc/issue" ETC_ISSUE)
+    string(REGEX MATCH "Debian|Ubuntu" DIST ${ETC_ISSUE})
+    IF(DIST STREQUAL "Debian")
+      SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/libre2.a")
+    ELSEIF(DIST STREQUAL "Ubuntu")
+      SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib/libre2.a")
+    ELSE()
+      SET(RE2_LIBRARIES     "${RE2_INSTALL_DIR}/lib64/libre2.a")
+    ENDIF()
+  ENDIF()
+ENDIF()
+
+SET(RE2_INCLUDE_DIR ${RE2_INSTALL_DIR}/include)
+INCLUDE_DIRECTORIES(${RE2_INCLUDE_DIR})
+
+IF(ANDROID)    
+set(CROSS_COMPILE_CMAKE_ARGS
+    "-DCMAKE_SYSTEM_NAME=${CMAKE_SYSTEM_NAME}"
+    "-DCMAKE_SYSTEM_VERSION=${CMAKE_SYSTEM_VERSION}"
+    "-DCMAKE_ANDROID_ARCH_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DCMAKE_ANDROID_NDK=${CMAKE_ANDROID_NDK}"
+    "-DCMAKE_ANDROID_STL_TYPE=${CMAKE_ANDROID_STL_TYPE}"
+    "-DANDROID_ABI=${CMAKE_ANDROID_ARCH_ABI}"
+    "-DANDROID_TOOLCHAIN=${ANDROID_TOOLCHAIN}"
+    "-DANDROID_STL=${CMAKE_ANDROID_STL_TYPE}"
+    "-DCMAKE_SYSTEM_PROCESSOR=${CMAKE_SYSTEM_PROCESSOR}"
+    "-DCMAKE_TOOLCHAIN_FILE=${CMAKE_ANDROID_NDK}/build/cmake/android.toolchain.cmake"
+    "-DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=${CMAKE_ANDROID_NDK_TOOLCHAIN_VERSION}"
+    "-DANDROID_PLATFORM=android-${ANDROID_NATIVE_API_LEVEL}"
+    "-D__ANDROID_API__=${ANDROID_NATIVE_API_LEVEL}")
+
+ExternalProject_Add(
+  extern_re2
+  ${EXTERNAL_PROJECT_LOG_ARGS}
+  ${SHALLOW_CLONE}
+  GIT_REPOSITORY        ${RE2_REPOSITORY}
+  GIT_TAG               ${RE2_TAG}
+  PREFIX                ${RE2_PREFIX_DIR}
+  UPDATE_COMMAND        ""
+  CMAKE_ARGS            ${CROSS_COMPILE_CMAKE_ARGS}
+                        -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                        -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                        -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                        -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                        -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                        -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                        -DCMAKE_INSTALL_PREFIX:PATH=${RE2_INSTALL_DIR}
+                        -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+                        -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+  BUILD_BYPRODUCTS     ${RE2_LIBRARIES}
+)
+ELSE()
+ExternalProject_Add(
+  extern_re2
+  ${EXTERNAL_PROJECT_LOG_ARGS}
+  ${SHALLOW_CLONE}
+  GIT_REPOSITORY        ${RE2_REPOSITORY}
+  GIT_TAG               ${RE2_TAG}
+  PREFIX                ${RE2_PREFIX_DIR}
+  UPDATE_COMMAND        ""
+  CMAKE_ARGS            -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                        -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
+                        -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
+                        -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                        -DCMAKE_C_FLAGS_DEBUG=${CMAKE_C_FLAGS_DEBUG}
+                        -DCMAKE_C_FLAGS_RELEASE=${CMAKE_C_FLAGS_RELEASE}
+                        -DCMAKE_INSTALL_PREFIX:PATH=${RE2_INSTALL_DIR}
+                        -DCMAKE_BUILD_TYPE:STRING=${THIRD_PARTY_BUILD_TYPE}
+                        -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
+  BUILD_BYPRODUCTS     ${RE2_LIBRARIES}
+)
+ENDIF()
+
+ADD_LIBRARY(re2 STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET re2 PROPERTY IMPORTED_LOCATION ${RE2_LIBRARIES})
+ADD_DEPENDENCIES(re2 extern_re2)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/external/utf8proc.cmake b/fast_tokenizer/cmake/external/utf8proc.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..460cbab819c4e5a132c5c7b798a27b366cc821ce
--- /dev/null
+++ b/fast_tokenizer/cmake/external/utf8proc.cmake
@@ -0,0 +1,51 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+INCLUDE(ExternalProject)
+
+SET(UTF8PROC_PREFIX_DIR    ${THIRD_PARTY_PATH}/utf8proc)
+SET(UTF8PROC_INSTALL_DIR   ${THIRD_PARTY_PATH}/install/utf8proc)
+# As we add extra features for utf8proc, we use the non-official repo
+SET(UTF8PROC_REPOSITORY    ${GIT_URL}/JuliaStrings/utf8proc.git)
+SET(UTF8PROC_TAG           v2.6.1)
+
+IF(WIN32)
+  SET(UTF8PROC_LIBRARIES     "${UTF8PROC_INSTALL_DIR}/lib/utf8proc_static.lib")
+  add_definitions(-DUTF8PROC_STATIC)
+ELSE(WIN32)
+  SET(UTF8PROC_LIBRARIES     "${UTF8PROC_INSTALL_DIR}/lib/libutf8proc.a")
+ENDIF(WIN32)
+
+INCLUDE_DIRECTORIES(${UTF8PROC_INSTALL_DIR}/include)
+
+ExternalProject_Add(
+  extern_utf8proc
+  ${EXTERNAL_PROJECT_LOG_ARGS}
+  ${SHALLOW_CLONE}
+  GIT_REPOSITORY        ${UTF8PROC_REPOSITORY}
+  GIT_TAG               ${UTF8PROC_TAG}
+  PREFIX                ${UTF8PROC_PREFIX_DIR}
+  UPDATE_COMMAND        ""
+  CMAKE_ARGS            -DCMAKE_C_FLAGS=${CMAKE_C_FLAGS}
+                        -DBUILD_SHARED=ON
+                        -DBUILD_STATIC=ON
+                        -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
+                        -DCMAKE_INSTALL_PREFIX:PATH=${UTF8PROC_INSTALL_DIR}
+                        -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
+  BUILD_BYPRODUCTS     ${UTF8PROC_LIBRARIES}
+)
+
+ADD_LIBRARY(utf8proc STATIC IMPORTED GLOBAL)
+SET_PROPERTY(TARGET utf8proc PROPERTY IMPORTED_LOCATION ${UTF8PROC_LIBRARIES})
+ADD_DEPENDENCIES(utf8proc extern_utf8proc)
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/generic.cmake b/fast_tokenizer/cmake/generic.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..07266667383b831e1a149d9c50a6c3606381700a
--- /dev/null
+++ b/fast_tokenizer/cmake/generic.cmake
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+function(cc_library TARGET_NAME)
+  set(options STATIC static SHARED shared INTERFACE interface)
+  set(oneValueArgs "")
+  set(multiValueArgs SRCS DEPS)
+  cmake_parse_arguments(cc_library "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+  if(WIN32)
+      # add libxxx.lib prefix in windows
+      set(${TARGET_NAME}_LIB_NAME "${CMAKE_STATIC_LIBRARY_PREFIX}${TARGET_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}" CACHE STRING "output library name for target ${TARGET_NAME}")
+  endif(WIN32)
+  if(cc_library_SRCS)
+      if(cc_library_SHARED OR cc_library_shared) # build *.so
+        add_library(${TARGET_NAME} SHARED ${cc_library_SRCS})
+      elseif(cc_library_INTERFACE OR cc_library_interface)
+        generate_dummy_static_lib(LIB_NAME ${TARGET_NAME} FILE_PATH ${target_SRCS} GENERATOR "generic.cmake:cc_library")
+      else()
+        add_library(${TARGET_NAME} STATIC ${cc_library_SRCS})
+      endif()
+    if(cc_library_DEPS)
+      # remove link to python, see notes at:
+      # https://github.com/pybind/pybind11/blob/master/docs/compiling.rst#building-manually
+      if("${cc_library_DEPS};" MATCHES "python;")
+        list(REMOVE_ITEM cc_library_DEPS python)
+        add_dependencies(${TARGET_NAME} python)
+        if(WIN32)
+          target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES})
+        else()
+          target_link_libraries(${TARGET_NAME} "-Wl,-undefined,dynamic_lookup")
+        endif(WIN32)
+      endif()
+      target_link_libraries(${TARGET_NAME} ${cc_library_DEPS} ${PUBLIC_DEPEND_LIBS})
+    endif()
+    # For C++ 17 filesystem
+    # target_link_libraries(${TARGET_NAME} stdc++fs)
+
+    # cpplint code style
+    foreach(source_file ${cc_library_SRCS})
+      string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file})
+      if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
+        list(APPEND cc_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
+      endif()
+    endforeach()
+
+  else(cc_library_SRCS)
+    if(cc_library_DEPS)
+      list(REMOVE_DUPLICATES cc_library_DEPS)
+      set(dummy_FILE_PATH "${CMAKE_CURRENT_BINARY_DIR}/${TARGET_NAME}_dummy.c")
+      configure_file(${PROJECT_SOURCE_DIR}/cmake/dummy.c.in ${dummy_FILE_PATH} @ONLY)
+      if(cc_library_SHARED OR cc_library_shared) # build *.so
+        add_library(${TARGET_NAME} SHARED ${dummy_FILE_PATH})
+      elseif(cc_library_INTERFACE OR cc_library_interface)
+        generate_dummy_static_lib(LIB_NAME ${TARGET_NAME} FILE_PATH ${dummy_FILE_PATH} GENERATOR "generic.cmake:cc_library")
+      else()
+        add_library(${TARGET_NAME} STATIC ${dummy_FILE_PATH})
+      endif()
+      target_link_libraries(${TARGET_NAME} ${cc_library_DEPS})
+    else()
+      message(FATAL_ERROR "Please specify source files or libraries in cc_library(${TARGET_NAME} ...).")
+    endif()
+  endif(cc_library_SRCS)
+endfunction(cc_library)
+
+function(cc_test_build TARGET_NAME)
+  if(WITH_TESTING AND NOT "$ENV{CI_SKIP_CPP_TEST}" STREQUAL "ON")
+    set(oneValueArgs "")
+    set(multiValueArgs SRCS DEPS)
+    cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+    add_executable(${TARGET_NAME} ${cc_test_SRCS})
+    if(WIN32)
+      if("${cc_test_DEPS};" MATCHES "python;")
+        list(REMOVE_ITEM cc_test_DEPS python)
+        target_link_libraries(${TARGET_NAME} ${PYTHON_LIBRARIES})
+      endif()
+    endif(WIN32)
+    get_property(os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES)
+    target_link_libraries(${TARGET_NAME} ${cc_test_DEPS} ${os_dependency_modules} tokenizers_gtest_main gtest glog)
+    add_dependencies(${TARGET_NAME} ${cc_test_DEPS} gtest)
+  endif()
+endfunction()
+
+function(cc_test_run TARGET_NAME)
+  if(WITH_TESTING AND NOT "$ENV{CI_SKIP_CPP_TEST}" STREQUAL "ON")
+    set(oneValueArgs "")
+    set(multiValueArgs COMMAND ARGS)
+    cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+    add_test(NAME ${TARGET_NAME}
+      COMMAND ${cc_test_COMMAND} ${cc_test_ARGS}
+            WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})
+    # No unit test should exceed 2 minutes.
+    if (WIN32)
+        set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 150)
+    endif()
+    if (APPLE)
+        set_tests_properties(${TARGET_NAME} PROPERTIES TIMEOUT 20)
+    endif()
+  elseif(WITH_TESTING AND NOT TEST ${TARGET_NAME})
+    add_test(NAME ${TARGET_NAME} COMMAND ${CMAKE_COMMAND} -E echo CI skip ${TARGET_NAME}.)
+  endif()
+endfunction()
+
+function(cc_test TARGET_NAME)
+    # The environment variable `CI_SKIP_CPP_TEST` is used to skip the compilation
+    # and execution of test in CI. `CI_SKIP_CPP_TEST` is set to ON when no files
+  # other than *.py are modified.
+  if(WITH_TESTING)
+    set(oneValueArgs "")
+    set(multiValueArgs SRCS DEPS ARGS)
+    cmake_parse_arguments(cc_test "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+    cc_test_build(${TARGET_NAME}
+      SRCS ${cc_test_SRCS}
+      DEPS ${cc_test_DEPS})
+    add_test(NAME ${TARGET_NAME} COMMAND ${CMAKE_COMMAND} -E echo CI skip ${TARGET_NAME}.)
+  endif()
+endfunction(cc_test)
+
+# create a dummy source file, then create a static library.
+# LIB_NAME should be the static lib name.
+# FILE_PATH should be the dummy source file path.
+# GENERATOR should be the file name invoke this function.
+# CONTENT should be some helpful info.
+# example: generate_dummy_static_lib(mylib FILE_PATH /path/to/dummy.c GENERATOR mylib.cmake CONTENT "helpful info")
+function(generate_dummy_static_lib)
+  set(options "")
+  set(oneValueArgs LIB_NAME FILE_PATH GENERATOR CONTENT)
+  set(multiValueArgs "")
+  cmake_parse_arguments(dummy "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+  if(NOT dummy_LIB_NAME)
+    message(FATAL_ERROR "You must provide a static lib name.")
+  endif()
+  if(NOT dummy_FILE_PATH)
+    set(dummy_FILE_PATH "${CMAKE_CURRENT_BINARY_DIR}/${dummy_LIB_NAME}_dummy.c")
+  endif()
+  if(NOT dummy_GENERATOR)
+    message(FATAL_ERROR "You must provide a generator file name.")
+  endif()
+  if(NOT dummy_CONTENT)
+    set(dummy_CONTENT "${dummy_LIB_NAME}_dummy.c for lib ${dummy_LIB_NAME}")
+  endif()
+
+  configure_file(${PROJECT_SOURCE_DIR}/cmake/dummy.c.in ${dummy_FILE_PATH} @ONLY)
+  add_library(${dummy_LIB_NAME} STATIC ${dummy_FILE_PATH})
+endfunction()
+
+function(paddle_protobuf_generate_cpp SRCS HDRS)
+  if(NOT ARGN)
+    message(
+      SEND_ERROR
+        "Error: paddle_protobuf_generate_cpp() called without any proto files")
+    return()
+  endif()
+
+  set(${SRCS})
+  set(${HDRS})
+
+  foreach(FIL ${ARGN})
+    get_filename_component(ABS_FIL ${FIL} ABSOLUTE)
+    get_filename_component(FIL_WE ${FIL} NAME_WE)
+    set(_protobuf_protoc_src "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.cc")
+    set(_protobuf_protoc_hdr "${CMAKE_CURRENT_BINARY_DIR}/${FIL_WE}.pb.h")
+    list(APPEND ${SRCS} "${_protobuf_protoc_src}")
+    list(APPEND ${HDRS} "${_protobuf_protoc_hdr}")
+
+    add_custom_command(
+      OUTPUT "${_protobuf_protoc_src}" "${_protobuf_protoc_hdr}"
+      COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_CURRENT_BINARY_DIR}"
+      COMMAND ${PROTOBUF_PROTOC_EXECUTABLE} -I${CMAKE_CURRENT_SOURCE_DIR}
+              --cpp_out "${CMAKE_CURRENT_BINARY_DIR}" ${ABS_FIL}
+      # Set `EXTERN_PROTOBUF_DEPEND` only if need to compile `protoc.exe`.
+      DEPENDS ${ABS_FIL} ${EXTERN_PROTOBUF_DEPEND}
+      COMMENT "Running C++ protocol buffer compiler on ${FIL}"
+      VERBATIM)
+  endforeach()
+
+  set_source_files_properties(${${SRCS}} ${${HDRS}} PROPERTIES GENERATED TRUE)
+  set(${SRCS}
+      ${${SRCS}}
+      PARENT_SCOPE)
+  set(${HDRS}
+      ${${HDRS}}
+      PARENT_SCOPE)
+endfunction()
+
+function(proto_library TARGET_NAME)
+  set(oneValueArgs "")
+  set(multiValueArgs SRCS DEPS)
+  cmake_parse_arguments(proto_library "${options}" "${oneValueArgs}"
+                        "${multiValueArgs}" ${ARGN})
+  set(proto_srcs)
+  set(proto_hdrs)
+  paddle_protobuf_generate_cpp(proto_srcs proto_hdrs ${proto_library_SRCS})
+  cc_library(
+    ${TARGET_NAME}
+    SRCS ${proto_srcs}
+    DEPS ${proto_library_DEPS} protobuf)
+endfunction()
\ No newline at end of file
diff --git a/fast_tokenizer/cmake/python_module.cmake b/fast_tokenizer/cmake/python_module.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..8fdccc17e9b0ca5e66c62652bbc8fe4a7065fc00
--- /dev/null
+++ b/fast_tokenizer/cmake/python_module.cmake
@@ -0,0 +1,54 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+function(find_python_module module)
+    string(TOUPPER ${module} module_upper)
+    if(NOT PY_${module_upper})
+        if(ARGC GREATER 1 AND ARGV1 STREQUAL "REQUIRED")
+            set(${module}_FIND_REQUIRED TRUE)
+        else()
+            set(${module}_FIND_REQUIRED FALSE)
+        endif()
+        # A module's location is usually a directory, but for binary modules
+        # it's a .so file.
+        execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c"
+            "import re, ${module}; print(re.compile('/__init__.py.*').sub('',${module}.__file__))"
+            RESULT_VARIABLE _${module}_status
+            OUTPUT_VARIABLE _${module}_location
+            ERROR_QUIET
+            OUTPUT_STRIP_TRAILING_WHITESPACE)
+        if(NOT _${module}_status)
+            set(PY_${module_upper} ${_${module}_location} CACHE STRING
+                "Location of Python module ${module}")
+        endif(NOT _${module}_status)
+    endif(NOT PY_${module_upper})
+    find_package_handle_standard_args(PY_${module} DEFAULT_MSG PY_${module_upper})
+    if(NOT PY_${module_upper}_FOUND AND ${module}_FIND_REQUIRED)
+        message(FATAL_ERROR "python module ${module} is not found")
+    endif()
+
+    execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c"
+        "import sys, ${module}; sys.stdout.write(${module}.__version__)"
+        OUTPUT_VARIABLE _${module}_version
+        RESULT_VARIABLE _${module}_status
+        ERROR_QUIET
+        OUTPUT_STRIP_TRAILING_WHITESPACE)
+    if(NOT _${module}_status)
+        set(PY_${module_upper}_VERSION ${_${module}_version} CACHE STRING
+            "Version of Python module ${module}")
+    endif(NOT _${module}_status)
+
+    set(PY_${module_upper}_FOUND ${PY_${module_upper}_FOUND} PARENT_SCOPE)
+    set(PY_${module_upper}_VERSION ${PY_${module_upper}_VERSION} PARENT_SCOPE)
+endfunction(find_python_module)
diff --git a/fast_tokenizer/cmake/third_party.cmake b/fast_tokenizer/cmake/third_party.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..ef17fc6bf3dc9c24c79ce3225a7d7d4557b1d5ae
--- /dev/null
+++ b/fast_tokenizer/cmake/third_party.cmake
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include(ExternalProject)
+
+set(THIRD_PARTY_PATH  "${CMAKE_BINARY_DIR}/third_party" CACHE STRING
+    "A path setting third party libraries download & build directories.")
+
+include(external/icu)
+if(WITH_TESTING)
+    include(external/gtest)
+endif()
+include(external/gflags)
+include(external/glog)
+include(external/re2)
+include(external/nlohmann_json)
+include(external/dart) # For trie
+if (WITH_PYTHON)
+    include(external/python)
+    include(external/pybind11)
+endif()
\ No newline at end of file
diff --git a/fast_tokenizer/docs/compile/README.md b/fast_tokenizer/docs/compile/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6ae8c791ae7c8c5ea946377d639612d4566bd2dd
--- /dev/null
+++ b/fast_tokenizer/docs/compile/README.md
@@ -0,0 +1,16 @@
+# FastTokenizer 编译指南
+
+本文档说明编译 FastTokenizer C++ 库、Python 库以及 Android 库三种编译过程，根据编译的平台参考如下文档
+
+- [Linux & Mac 编译](./how_to_build_linux_and_mac.md)
+- [Windows 编译](./how_to_build_windows.md)
+- [Android 编译](./how_to_build_android.md)
+
+FastTokenizer 使用 CMake 编译，其中编译过程中，各平台上编译选项如下表所示
+
+| 选项 | 作用 | 备注 |
+|:---- | :--- | :--- |
+| WITH_PYTHON | 是否编译为 Python 库，默认为是，否则为 C++ 库||
+| WITH_TESTING | 是否编译 C++ 单测，默认为否 ||
+| WITH_ICU_LITE | 是否与 ICU Lite 依赖包联编，打开后可减小 FastTokenizer 库体积，默认为否 | 只能用于 Andriod 编译，打开后 FastTokenizer 库体积大小从 **32 M 减少到 7.4 M**，但只能对中英文进行分词。|
+| USE_ABI0 | 是否编译_GLIBCXX_USE_CXX11_ABI=0, 默认为OFF。|
diff --git a/fast_tokenizer/docs/compile/how_to_build_android.md b/fast_tokenizer/docs/compile/how_to_build_android.md
new file mode 100644
index 0000000000000000000000000000000000000000..40c1cfe375db2b56c9d541268309d18c56599c20
--- /dev/null
+++ b/fast_tokenizer/docs/compile/how_to_build_android.md
@@ -0,0 +1,46 @@
+# Android 编译
+
+FastTokenizer 提供两种版本 Android 库，分别是常规版本以及轻量版本。常规版本的 FastTokenizer Android 库功能齐全，可支持任意语言的分词功能，库体积大约为 **32 M**；轻量版本主要支持中文和英文两种语言的分词，库体积约为 **7.4 M**。开发者可以根据自己实际需求选择合适的版本安装，以下将分别介绍这两种版本的编译方式。
+
+## 环境依赖
+
+- cmake >= 3.10
+- NDK >= 20
+
+## 配置NDK
+```bash
+wget https://dl.google.com/android/repository/android-ndk-r20b-linux-x86_64.zip
+unzip android-ndk-r20b-linux-x86_64.zip # 会解压缩到 android-ndk-r20b 目录
+export NDK_ROOT=${PWD}/android-ndk-r20b
+```
+
+## 编译 C++ 库方法
+
+### 常规版本
+
+```bash
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang
+make -j8
+```
+
+### 轻量版本
+
+```
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON
+make -j8
+```
+
+### 库体积压缩
+
+编译后的 C++ 库在当前目录下的 `cpp` 目录下。可以选择使用 strip 减少库体积:
+```shell
+$NDK_ROOT/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin/strip libcore_tokenizers.so
+```
+
+更多编译选项说明参考[编译指南](./README.md)
diff --git a/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md b/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md
new file mode 100644
index 0000000000000000000000000000000000000000..cd13724aef2dfe355b3228ebfc62b00b29d96a36
--- /dev/null
+++ b/fast_tokenizer/docs/compile/how_to_build_linux_and_mac.md
@@ -0,0 +1,36 @@
+# Linux & Mac编译
+
+## 环境依赖
+
+- cmake >= 3.10
+- gcc >= 8.2.0
+
+## 编译 C++ 库方法
+
+```bash
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+make -j8
+```
+
+编译后的 C++ 库在当前目录下的 `cpp` 目录下。
+
+## 编译 Python 库方法
+
+```bash
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+# 设置 Python 环境
+export LD_LIBRARY_PATH=/opt/_internal/cpython-3.6.0/lib/:${LD_LIBRARY_PATH}
+export PATH=/opt/_internal/cpython-3.6.0/bin/:${PATH}
+
+cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+make -j8
+```
+
+编译后的 wheel 包即在当前目录下的 `dist` 目录中
+
+更多编译选项说明参考[编译指南](./README.md)
diff --git a/fast_tokenizer/docs/compile/how_to_build_windows.md b/fast_tokenizer/docs/compile/how_to_build_windows.md
new file mode 100644
index 0000000000000000000000000000000000000000..4796b0418034eaf51ee65b0e712c3eb6c135b5b1
--- /dev/null
+++ b/fast_tokenizer/docs/compile/how_to_build_windows.md
@@ -0,0 +1,42 @@
+# Windows 编译
+
+## 环境依赖
+
+- cmake >= 3.10
+- VS 2019
+- ninja
+- cmake >= 3.10
+
+以上依赖安装好后，在 Windows 菜单打开`x64 Native Tools Command Prompt for VS 2019`命令工具即可进行下面的编译环节。
+
+## 编译 C++ 库方法
+
+```bash
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+ninja -j8
+```
+
+编译后的 C++ 库在当前目录下的`cpp`目录下。
+
+## 编译 Python 库方法
+
+```bash
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/fast_tokenizer
+mkdir build & cd build
+# 需要指定 Python 库
+cmake .. -G "Ninja" -DWITH_PYTHON=ON ^
+                    -DWITH_TESTING=OFF ^
+                    -DCMAKE_BUILD_TYPE=Release ^
+                    -DPYTHON_EXECUTABLE=C:\Python37\python.exe ^
+                    -DPYTHON_INCLUDE_DIR=C:\Python37\include ^
+                    -DPYTHON_LIBRARY=C:\Python37\libs\python3%%x.lib
+ninja -j8
+```
+
+编译后的 wheel 包即在当前目录下的 `dist` 目录中
+
+更多编译选项说明参考[编译指南](./README.md)
diff --git a/fast_tokenizer/docs/cpp/README.md b/fast_tokenizer/docs/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7b8976efd4de2b7c469e786dfc58f33a3978217c
--- /dev/null
+++ b/fast_tokenizer/docs/cpp/README.md
@@ -0,0 +1,71 @@
+# FastTokenizer C++ 库使用教程
+
+## 1. 快速安装
+
+当前版本 FastTokenizer C++ 库支持不同的操作系统以及硬件平台，并为以下平台提供预编译包：
+|系统|下载地址|
+|---|---|
+|Linux-x64| [fast_tokenizer-linux-x64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.2.tgz) |
+|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.2.tgz) |
+|Windows| [fast_tokenizer-win-x64-1.0.2.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.2.zip) |
+|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.2.tgz) |
+|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.2.tgz) |
+|Android-arm64-v8a| [fast_tokenizer-android-arm64-v8a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-android-arm64-v8a-1.0.2.tgz) |
+|Android-armeabi-v7a| [fast_tokenizer-android-armeabi-v7a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-android-armeabi-v7a-1.0.2.tgz) |
+|Android-lite-arm64-v8a| [fast_tokenizer-lite-android-arm64-v8a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-lite-android-arm64-v8a-1.0.2.tgz) |
+|Android-lite-armeabi-v7a| [fast_tokenizer-lite-android-armeabi-v7a-1.0.2.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-lite-android-armeabi-v7a-1.0.2.tgz) |
+
+### 环境依赖
+
+#### 系统环境要求
+|系统|版本|架构|
+|---|---|---|
+|Linux|Ubuntu 16.04+，CentOS 7+|x64, aarch64|
+|Windows|10+|x64|
+|MacOS| 11.4+|x64, arm64|
+|Android| - |arm64-v8a, armeabi-v7a|
+
+#### Linux，Mac 编译环境要求
+|依赖|版本|
+|---|---|
+|cmake|>=16.0|
+|gcc|>=8.2.0|
+
+#### Windows 编译环境要求
+|依赖|版本|
+|---|---|
+|cmake|>=16.0|
+|VisualStudio|2019|
+
+### 下载解压
+
+```shell
+wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.2.tgz
+
+tar xvfz fast_tokenizer-linux-x64-1.0.2.tgz
+# 解压后为fast_tokenizer目录
+```
+
+解压后得到fast_tokenizer目录，该目录的结构如下：
+
+```shell
+
+fast_tokenizer
+|__ commit.log              # 编译时的commit id
+|__ FastTokenizer.cmake     # FastTokenizer CMake文件，定义了头文件目录、动态链接库目录变量
+|__ include                 # FastTokenizer的头文件目录
+|__ lib                     # FastTokenizer的动态链接库目录
+|__ third_party             # FastTokenizer依赖的第三方库目录
+
+```
+
+推荐用户直接使用 cmake 方式引入 FastTokenizer 库。在 CMake 引入 FastTokenizer 时，只需添加一行 `include(FastTokenizer.cmake)`，即可获取 FastTokenizer 的预定义的 CMake 变量 `FAST_TOKENIZER_INCS` 和 `FAST_TOKENIZER_LIBS`，分别指定 FastTokenizer 的头文件目录以及动态链接库目录。
+
+
+## 2. 快速开始
+
+目前 FastTokenizer 提供了以下 C++ 使用示例。
+
+[ErnieFastTokenizer C++示例](../../examples/ernie-3.0/README.md)
+
+[ClipFastTokenizer C++示例](../../examples/clip/README.md)
diff --git a/fast_tokenizer/docs/pipeline/README.md b/fast_tokenizer/docs/pipeline/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..16da52f90ef4a9d5079cdaed2eb3674e65854cc5
--- /dev/null
+++ b/fast_tokenizer/docs/pipeline/README.md
@@ -0,0 +1,254 @@
+# FastTokenizer Pipeline
+
+当我们使用 Tokenizer 的 `Tokenizer.encode` 或者 `Tokenizer.encode_batch` 方法进行分词时，会经历如下四个阶段：Normalize、PreTokenize、 Model 以及 PostProcess。针对这四个阶段，FastTokenizer 提供 Normalizer、PreTokenizer、Model 以及 PostProcessor 四个组件分别完成四个阶段所需要的工作。下面将详细介绍四大组件具体负责的工作，并通过示例介绍如何组合四个组件定义一个 Tokenizer。
+
+## Normalizer
+
+Normalizer 组件主要用于将原始字符串标准化，输出标准化的字符串，常见的标准化字符串操作有大小写转换、半角全角转换等。 FastTokenizer 所有 Normalizer 类都继承自 `normalizers.Normalizer`，命名方式均为 `normalizers.*Normalizer`。 FastTokenizer 还支持将现有 Normalizer 类进行组合得到一个 Normalizer 序列，用户可以通过调用 `normalizers.SequenceNormalizer` 使用已有的 Normalizer 自定义新的 Normalizer。下面将分别展示 Python 以及 C++ 上使用示例。
+
+### Python 示例
+
+```python
+import fast_tokenizer
+from fast_tokenizer.normalizers import LowercaseNormalizer, SequenceNormalizer, NFDNormalizer, StripAccentsNormalizer
+
+normalizer = SequenceNormalizer([NFDNormalizer(), StripAccentsNormalizer() LowercaseNormalizer()])
+print(normalizer.normalize_str("Héllò hôw are ü?"))
+# hello how are u?
+```
+
+### C++ 示例
+
+```c++
+
+#include <iostream>
+#include "fast_tokenizer/normalizers/normalizers.h"
+using namespace paddlenlp::fast_tokenizer;
+
+int main() {
+  normalizers::NFDNormalizer n1;
+  normalizers::StripAccentsNormalizer n2;
+  normalizers::LowercaseNormalizer n3;
+  normalizers::SequenceNormalizer normalizer({&n1, &n2, &n3});
+  normalizers::NormalizedString normalized("Héllò hôw are ü?");
+  normalizer(&normalized);
+  // Expected output
+  // normalized string: hello how are u?
+  // original string: Héllò hôw are ü?
+  std::cout << "normalized string: " << normalized.GetStr() << std::endl;
+  std::cout << "original string: " << normalized.GetOrignalStr() << std::endl;
+}
+
+```
+
+## PreTokenizer
+
+PreTokenizer 组件主要使用简单的分词方法，将标准化的字符串进行预切词，得到较大粒度的词组（word），例如按照标点、空格等方式进行分词。FastTokenizer 所有 PreTokenizer 类都继承自 `normalizers.PreTokenizer`，命名方式均为 `normalizers.*PreTokenizer`。 下面将分别展示 Python 以及 C++ 上使用空格对文本进行分词的使用示例。
+
+### Python 示例
+
+```python
+import fast_tokenizer
+from fast_tokenizer.pretokenizers import WhitespacePreTokenizer
+pretokenizer = WhitespacePreTokenizer()
+print(pretokenizer.pretokenize_str("Hello! How are you? I'm fine, thank you."))
+# [('Hello!', (0, 6)), ('How', (7, 10)), ('are', (11, 14)), ('you?', (15, 19)), ("I'm", (20, 23)), ('fine,', (24, 29)), ('thank', (30, 35)), ('you.', (36, 40))]
+```
+
+### C++ 示例
+
+```c++
+
+#include <iostream>
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+
+using namespace paddlenlp::fast_tokenizer;
+
+int main() {
+  pretokenizers::WhitespacePreTokenizer pretokenizer;
+  pretokenizers::PreTokenizedString pretokenized(
+      "Hello! How are you? I'm fine, thank you.");
+  pretokenizer(&pretokenized);
+  auto&& splits = pretokenized.GetSplits(true, core::OffsetType::CHAR);
+  for (auto&& split : splits) {
+    auto&& value = std::get<0>(split);
+    auto&& offset = std::get<1>(split);
+    std::cout << "(" << value << ", (" << offset.first << ", " << offset.second
+              << ")"
+              << ")" << std::endl;
+  }
+  return 0;
+}
+
+// (Hello!, (0, 6))
+// (How, (7, 10))
+// (are, (11, 14))
+// (you?, (15, 19))
+// (I'm, (20, 23))
+// (fine,, (24, 29))
+// (thank, (30, 35))
+// (you., (36, 40))
+
+```
+
+## Model
+
+Model 组件是 FastTokenizer 核心模块，用于将粗粒度词组按照一定的算法进行切分，得到细粒度的 Token（word piece）及其对应的在词表中的 id，目前支持的切词算法包括 FastWordPiece[1]、WordPiece、BPE 以及 Unigram。其中，`FastWordPiece` 是 "Fast WordPiece Tokenization" 提出的基于`MinMaxMatch`匹配算法的一种分词算法。原有 `WordPiece` 算法的时间复杂度与序列长度为二次方关系，在对长文本进行分词操作时，时间开销比较大。而 `FastWordPiece` 算法通过 `Aho–Corasick` 算法避免 Token 失配时从头匹配，将 `WordPiece` 算法的时间复杂度降低为与序列长度的线性关系，大大提升了分词效率。下面是 `FastWordPiece` 类的初始化示例。
+
+### Python 示例
+
+```python
+import fast_tokenizer
+from fast_tokenizer.models import FastWordPiece
+
+# Initialize model from ernie 3.0 vocab file
+model = FastWordPiece.from_file("ernie-3.0-medium-vocab.txt", with_pretokenization=True)
+print(model.tokenize("我爱中国!"))
+# [id: 75    value:我    offset: (0, 3), id: 329    value:爱    offset: (3, 6), id: 12    value:中    offset: (6, 9), id: 20    value:国    offset: (9, 12), id: 12046    value:!    offset: (12, 13)]
+```
+
+### C++ 示例
+
+```c++
+
+#include <iostream>
+#include <vector>
+
+#include "fast_tokenizer/models/models.h"
+
+using namespace paddlenlp::fast_tokenizer;
+
+int main() {
+  std::string text = "我爱中国！";
+  auto model = models::FastWordPiece::GetFastWordPieceFromFile(
+      "ernie_vocab.txt", "[UNK]", 100, "##", true);
+  std::vector<core::Token> results = model.Tokenize(text);
+  for (const core::Token& token : results) {
+    std::cout << "id: " << token.id_ << ", value: " << token.value_
+              << ", offset: (" << token.offset_.first << ", "
+              << token.offset_.second << ")." << std::endl;
+  }
+  return 0;
+}
+
+// id: 75, value: 我, offset: (0, 3).
+// id: 329, value: 爱, offset: (3, 6).
+// id: 12, value: 中, offset: (6, 9).
+// id: 20, value: 国, offset: (9, 12).
+// id: 12044, value: ！, offset: (12, 15).
+```
+
+## PostProcessor
+
+PostProcess 组件主要执行 Transformer 类模型的文本序列的后处理逻辑，比如添加 [SEP] 等特殊 Token，并且会将前面分词得到的结果转为一个 `Encoding` 的结构体，包含 token_ids, type_ids, offset, position_ids 等模型所需要的信息。FastTokenizer 所有 PostProcessor 类都继承自 `normalizers.PostProcessor`，命名方式均为 `normalizers.*PostProcessor`。
+
+## Tokenizer
+
+Tokenizer 对象在运行`Tokenizer.encode` 或者 `Tokenizer.encode_batch` 方法进行分词时，通过调用各个阶段组件的回调函数运行不同阶段的处理逻辑。所以我们定义 Tokenizer 对象时，需要设置各个阶段的组件。下面将通过代码示例展示如何定义 ERNIE 模型的 Tokenizer。
+
+### Python 示例
+
+```python
+
+import fast_tokenizer
+from fast_tokenizer import Tokenizer
+from fast_tokenizer.models import FastWordPiece
+from fast_tokenizer.normalizers import BertNormalizer
+from fast_tokenizer.pretokenizers import BertPreTokenizer
+
+# 1. Initialize model from ernie 3.0 vocab file
+model = FastWordPiece.from_file("ernie-3.0-medium-vocab.txt")
+
+# 2. Use model to initialize a tokenizer object
+tokenizer = Tokenizer(model)
+
+# 3. Set a normalizer
+tokenizer.normalizer = BertNormalizer(
+    clean_text=True,
+    handle_chinese_chars=True,
+    strip_accents=True,
+    lowercase=True,
+)
+
+# 4. Set a pretokenizer
+tokenizer.pretokenizer = BertPreTokenizer()
+
+print(tokenizer.encode("我爱中国!"))
+
+# The Encoding content:
+# ids: 75, 329, 12, 20, 12046
+# type_ids: 0, 0, 0, 0, 0
+# tokens: 我, 爱, 中, 国, !
+# offsets: (0, 1), (1, 2), (2, 3), (3, 4), (4, 5)
+# special_tokens_mask: 0, 0, 0, 0, 0
+# attention_mask: 1, 1, 1, 1, 1
+# sequence_ranges:
+```
+
+针对 ERNIE、BERT 这类常见模型，FastTokenizer Python 库 已经定义好这类模型的 Tokenizer，可以通过 `from fast_tokenizer import ErnieFastTokenizer` 直接使用。
+
+### C++ 示例
+
+```c++
+
+#include <iostream>
+#include <vector>
+
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/models/models.h"
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+
+using namespace paddlenlp::fast_tokenizer;
+
+int main() {
+  std::vector<std::string> texts{"我爱中国！"};
+  core::Tokenizer tokenizer;
+
+  // 1. Set model
+  auto model = models::FastWordPiece::GetFastWordPieceFromFile(
+      "ernie_vocab.txt", "[UNK]", 100, "##", true);
+  tokenizer.SetModel(model);
+
+  // 2. Set Normalizer
+  normalizers::BertNormalizer normalizer(
+      /* clean_text = */ true,
+      /* handle_chinese_chars = */ true,
+      /* strip_accents= */ true,
+      /* lowercase = */ true);
+  tokenizer.SetNormalizer(normalizer);
+
+  // 3. Set Pretokenizer
+  pretokenizers::BertPreTokenizer pretokenizer;
+  tokenizer.SetPreTokenizer(pretokenizer);
+
+  // 4. Set PostProcessor
+  postprocessors::BertPostProcessor postprocessor;
+  tokenizer.SetPostProcessor(postprocessor);
+
+  std::vector<core::Encoding> encodings;
+  tokenizer.EncodeBatchStrings(texts, &encodings);
+
+  for (auto encoding : encodings) {
+    std::cout << encoding.DebugString() << std::endl;
+  }
+  return 0;
+}
+
+// The Encoding content:
+// ids: 101, 75, 329, 12, 20, 12044, 102
+// type_ids: 0, 0, 0, 0, 0, 0, 0
+// tokens: [CLS], 我, 爱, 中, 国, ！, [SEP]
+// offsets: (0, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (0, 0)
+// special_tokens_mask: 1, 0, 0, 0, 0, 0, 1
+// attention_mask: 1, 1, 1, 1, 1, 1, 1
+// sequence_ranges: {0 : (1, 6) },
+```
+
+针对 ERNIE、BERT 这类常见模型，FastTokenizer C++ 库 已经定义好这类模型的 Tokenizer，可以通过 `paddlenlp::fast_tokenizer::tokenizers_impl::ErnieFastTokenizer` 直接使用。
+
+
+## 参考文献
+
+- [1] Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021
diff --git a/fast_tokenizer/docs/python/README.md b/fast_tokenizer/docs/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..86720f9dcca9a4eb490396c1be0aa6e24758cfe3
--- /dev/null
+++ b/fast_tokenizer/docs/python/README.md
@@ -0,0 +1,16 @@
+# FastTokenizer Python 库使用教程
+
+## 1. 快速安装
+
+### 环境依赖
+
+- Windows 64位系统
+- Linux x64系统
+- MacOS 10.14+系统（m1芯片的MacOS，需要使用x86_64版本的Anaconda作为python环境方可安装使用）
+- Python 3.6 ~ 3.10
+
+### 安装
+
+```shell
+pip install --upgrade fast_tokenizer
+```
diff --git a/fast_tokenizer/examples/clip/README.md b/fast_tokenizer/examples/clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/examples/clip/cpp/CMakeLists.txt b/fast_tokenizer/examples/clip/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..271104d7f89caa8035677d23112694f2af9827b4
--- /dev/null
+++ b/fast_tokenizer/examples/clip/cpp/CMakeLists.txt
@@ -0,0 +1,28 @@
+cmake_minimum_required(VERSION 3.10)
+project(cpp_fast_tokenizer_demo CXX C)
+option(FAST_TOKENIZER_INSTALL_DIR "Path of downloaded fast_tokenizer sdk.")
+
+# Download clip vocab and merge files
+set(CLIP_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_vocab.json)
+set(CLIP_MERGES_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_merges.txt)
+
+if (EXISTS ${CLIP_VOCAB_PATH})
+  message("The ${CLIP_VOCAB_PATH} exists already.")
+else()
+  file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json" ${CLIP_VOCAB_PATH} SHOW_PROGRESS)
+  message("Already download the vocab.json of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.")
+endif()
+
+if (EXISTS ${CLIP_MERGES_PATH})
+  message("The ${CLIP_MERGES_PATH} exists already.")
+else()
+  file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt" ${CLIP_MERGES_PATH} SHOW_PROGRESS)
+  message("Already download the merges.txt of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.")
+endif()
+
+# Get FAST_TOKENIZER_INCS and FAST_TOKENIZER_LIBS
+include(${FAST_TOKENIZER_INSTALL_DIR}/FastTokenizer.cmake)
+include_directories(${FAST_TOKENIZER_INCS})
+
+add_executable(demo ${PROJECT_SOURCE_DIR}/demo.cc)
+target_link_libraries(demo ${FAST_TOKENIZER_LIBS})
diff --git a/fast_tokenizer/examples/clip/cpp/README.md b/fast_tokenizer/examples/clip/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d24c0f7fd65f17d9e2b451e984c9eeefd44078b3
--- /dev/null
+++ b/fast_tokenizer/examples/clip/cpp/README.md
@@ -0,0 +1,99 @@
+# ClipFastTokenizer C++ 示例
+
+## 1. 快速安装
+
+当前版本FastTokenizer C++库支持不同的操作系统以及硬件平台，用户可以根据实际的使用环境，从以下选择合适的预编译包：
+|系统|下载地址|
+|---|---|
+|Linux-x64| [fast_tokenizer-linux-x64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz) |
+|Linux-aarch64| [fast_tokenizer-linux-aarch64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-aarch64-1.0.0.tgz) |
+|Windows| [fast_tokenizer-win-x64-1.0.0.zip](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-win-x64-1.0.0.zip) |
+|MacOS-x64| [fast_tokenizer-osx-x86_64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-x86_64-1.0.0.tgz) |
+|MacOS-arm64| [fast_tokenizer-osx-arm64-1.0.0.tgz](https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-osx-arm64-1.0.0.tgz) |
+
+### 环境依赖
+
+#### 系统环境要求
+|系统|版本|
+|---|---|
+|Linux|Ubuntu 16.04+，CentOS 7+|
+|Windows|10|
+|MacOS| 11.4+|
+
+
+#### Linux，Mac编译环境要求
+|依赖|版本|
+|---|---|
+|cmake|>=16.0|
+|gcc|>=8.2.0|
+
+#### Windows编译环境要求
+|依赖|版本|
+|---|---|
+|cmake|>=16.0|
+|VisualStudio|2019|
+
+## 2. 快速开始
+
+以下以Linux平台为例, 介绍如何使用FastTokenizer C++预编译包完成demo示例编译及运行。该示例会生成一个名为`demo`的可执行文件。
+
+### 2.1 下载解压
+
+```shell
+wget -c https://bj.bcebos.com/paddlenlp/fast_tokenizer/fast_tokenizer-linux-x64-1.0.0.tgz
+
+tar xvfz fast_tokenizer-linux-x64-1.0.0.tgz
+# 解压后为fast_tokenizer目录
+```
+
+解压后得到fast_tokenizer目录，该目录的结构如下：
+
+```shell
+
+fast_tokenizer
+|__ commit.log              # 编译时的commit id
+|__ FastTokenizer.cmake     # FastTokenizer CMake文件，定义了头文件目录、动态链接库目录变量
+|__ include                 # FastTokenizer的头文件目录
+|__ lib                     # FastTokenizer的动态链接库目录
+|__ third_party             # FastTokenizer依赖的第三方库目录
+
+```
+
+推荐用户直接使用cmake方式引入FastTokenizer库。在CMake引入FastTokenizer时，只需添加一行 `include(FastTokenizer.cmake)`，即可获取FastTokenizer的预定义的CMake变量`FAST_TOKENIZER_INCS`和`FAST_TOKENIZER_LIBS`，分别指定FastTokenizer的头文件目录以及动态链接库目录。
+
+
+### 2.2 编译
+
+示例提供简单的CMakeLists.txt, 用户仅需指定fast_tokenizer包的路径，即可完成编译。
+
+```shell
+
+# 创建编译目录
+mkdir build
+cd build
+
+# 运行cmake，通过指定fast_tokenizer包的路径，构建Makefile
+cmake .. -DFAST_TOKENIZER_INSTALL_DIR=/path/to/fast_tokenizer
+
+# 编译
+make
+
+```
+
+### 2.3 运行
+
+```shell
+./demo
+```
+
+
+### 2.4 样例输出
+
+输出包含原始文本的输入，以及分词后的ids序列结果（含padding）。
+
+```shell
+
+text = "a photo of an astronaut riding a horse on mars"
+ids = [49406, 320, 1125, 539, 550, 18376, 6765, 320, 4558, 525, 7496, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]
+
+```
diff --git a/fast_tokenizer/examples/clip/cpp/demo.cc b/fast_tokenizer/examples/clip/cpp/demo.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0d7983b2ab55b0c14a46396f88ef51482d42b5ef
--- /dev/null
+++ b/fast_tokenizer/examples/clip/cpp/demo.cc
@@ -0,0 +1,70 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <iostream>
+#include <vector>
+#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h"
+using namespace paddlenlp;
+
+template <typename T>
+std::ostream& operator<<(std::ostream& os, const std::vector<T> vec) {
+  os << "[";
+  for (int i = 0; i < vec.size(); ++i) {
+    if (i == 0) {
+      os << vec[i];
+    } else {
+      os << ", " << vec[i];
+    }
+  }
+  os << "]";
+  return os;
+}
+
+fast_tokenizer::tokenizers_impl::ClipFastTokenizer CreateClipFastTokenizer(
+    const std::string& vocab_path,
+    const std::string& merge_path,
+    uint32_t max_length,
+    bool pad_to_max_length = true) {
+  fast_tokenizer::tokenizers_impl::ClipFastTokenizer tokenizer(
+      vocab_path, merge_path, max_length);
+  if (pad_to_max_length) {
+    tokenizer.EnablePadMethod(fast_tokenizer::core::RIGHT,
+                              tokenizer.GetPadTokenId(),
+                              0,
+                              tokenizer.GetPadToken(),
+                              &max_length,
+                              nullptr);
+  }
+  return tokenizer;
+}
+
+int main() {
+  // 1. Define a clip fast tokenizer
+  auto tokenizer = CreateClipFastTokenizer("clip_vocab.json",
+                                           "clip_merges.txt",
+                                           /*max_length = */ 77,
+                                           /* pad_to_max_length = */ true);
+  // 2. Tokenize the input strings
+  std::vector<fast_tokenizer::core::Encoding> encodings;
+  std::vector<std::string> texts = {
+      "a photo of an astronaut riding a horse on mars"};
+  tokenizer.EncodeBatchStrings(texts, &encodings);
+
+  for (int i = 0; i < texts.size(); ++i) {
+    std::cout << "text = \"" << texts[i] << "\"" << std::endl;
+    std::cout << "ids = " << encodings[i].GetIds() << std::endl;
+  }
+
+  return 0;
+}
diff --git a/fast_tokenizer/examples/clip/python/README.md b/fast_tokenizer/examples/clip/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/examples/clip/python/demo.py b/fast_tokenizer/examples/clip/python/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/fast_tokenizer/examples/clip/python/demo.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/fast_tokenizer/examples/ernie-3.0/README.md b/fast_tokenizer/examples/ernie-3.0/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..db728533e676ca67382271ae705064f7b96e0119
--- /dev/null
+++ b/fast_tokenizer/examples/ernie-3.0/README.md
@@ -0,0 +1,21 @@
+# ErnieFastTokenizer分词示例
+
+FastTokenizer库在C++、Python端提供ErnieFastTokenizer接口，用户只需传入模型相应的词表即可调用该接口，完成高效分词操作。该接口底层使用`WordPiece`算法进行分词。针对`WordPiece`算法，FastTokenizer实现了"Fast WordPiece Tokenization"提出的基于`MinMaxMatch`的`FastWordPiece`算法。原有`WordPiece`算法的时间复杂度与序列长度为二次方关系，在对长文本进行分词操作时，时间开销比较大。而`FastWordPiece`算法通过`Aho–Corasick `算法将`WordPiece`算法的时间复杂度降低为与序列长度的线性关系，大大提升了分词效率。`ErnieFastTokenizer`除了支持ERNIE模型的分词以外，还支持其他基于`WordPiece`算法分词的模型，比如`BERT`, `TinyBert`等，详细的模型列表如下：
+
+## 支持的模型列表
+
+- ERNIE
+- BERT
+- TinyBERT
+- ERNIE Gram
+- ERNIE ViL
+
+## 详细分词示例文档
+
+[C++ 分词示例](./cpp/README.md)
+
+[Python 分词示例](./python/README.md)
+
+## 参考文献
+
+- Xinying Song, Alex Salcianuet al. "Fast WordPiece Tokenization", EMNLP, 2021
diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt b/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..b7f349ea036e11adcb71d1f260461d20bde9839c
--- /dev/null
+++ b/fast_tokenizer/examples/ernie-3.0/cpp/CMakeLists.txt
@@ -0,0 +1,22 @@
+cmake_minimum_required(VERSION 3.10)
+project(cpp_fast_tokenizer_demo CXX C)
+
+option(FAST_TOKENIZER_INSTALL_DIR "Path of downloaded fast_tokenizer sdk.")
+
+# Download ernie vocab for demo
+set(ERNIE_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/ernie_vocab.txt)
+if (EXISTS ${ERNIE_VOCAB_PATH})
+  message(STATUS "The ${ERNIE_VOCAB_PATH} exists already.")
+else()
+  file(DOWNLOAD "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt" ${ERNIE_VOCAB_PATH} SHOW_PROGRESS)
+  message(STATUS "Already download the vocab.txt of ernie to ${CMAKE_CURRENT_BINARY_DIR} for demo.")
+endif()
+
+# Get FAST_TOKENIZER_INCS and FAST_TOKENIZER_LIBS
+message(STATUS "The fast_tokenizer install dir: ${FAST_TOKENIZER_INSTALL_DIR}")
+include(${FAST_TOKENIZER_INSTALL_DIR}/FastTokenizer.cmake)
+
+include_directories(${FAST_TOKENIZER_INCS})
+
+add_executable(demo ${PROJECT_SOURCE_DIR}/demo.cc)
+target_link_libraries(demo ${FAST_TOKENIZER_LIBS})
diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/README.md b/fast_tokenizer/examples/ernie-3.0/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc b/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d886791d4d8a28f15b63d72f370b2aa38811fd05
--- /dev/null
+++ b/fast_tokenizer/examples/ernie-3.0/cpp/demo.cc
@@ -0,0 +1,70 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <iostream>
+#include <vector>
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+using namespace paddlenlp;
+
+int main() {
+  // 1. Define a ernie fast tokenizer
+  fast_tokenizer::tokenizers_impl::ErnieFastTokenizer tokenizer(
+      "ernie_vocab.txt");
+  // 2. Tokenize the input strings
+  // case 1: tokenize a single string
+  std::cout << "case 1: Tokenize a single string" << std::endl;
+  fast_tokenizer::core::Encoding encoding;
+  std::string single_string =
+      "商赢环球股份有限公司关于延期回复上海证券交易所对"
+      "公司2017年年度报告的事后审核问询函的公告";
+  tokenizer.EncodePairStrings(single_string, &encoding);
+  std::cout << encoding.DebugString() << std::endl;
+
+  // case 2: tokenize a pair of strings
+  std::cout << "case 2: Tokenize a pair of strings" << std::endl;
+  std::string text = "蚂蚁借呗等额还款可以换成先息后本吗";
+  std::string text_pair = "借呗有先息到期还本吗";
+
+  tokenizer.EncodePairStrings(text, text_pair, &encoding);
+  std::cout << encoding.DebugString() << std::endl;
+
+  // case 3: Tokenize a batch of single strings
+  std::cout << "case 3: Tokenize a batch of single strings" << std::endl;
+  std::vector<fast_tokenizer::core::Encoding> encodings;
+  std::vector<std::string> strings_list = {
+      "通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？",
+      "凌云研发的国产两轮电动车怎么样，有什么惊喜？",
+      "一辆车的寿命到底多长，最多可以开多久？"};
+  tokenizer.EncodeBatchStrings(strings_list, &encodings);
+  for (auto&& encoding : encodings) {
+    std::cout << encoding.DebugString() << std::endl;
+  }
+
+  // case 4: Tokenize a batch of pair strings
+  std::cout << "case 4: Tokenize a batch of pair strings" << std::endl;
+  std::vector<std::string> texts = {
+      "花呗自动从余额宝扣款，需要我自己设置吗",
+      "这个蚂蚁花呗能恢复正常用不",
+      "在经济的另一次转变中，人们发现在低地农场饲养羔羊更具成本效益，部分原因"
+      "是那里有更丰富、更有营养的牧场，因此湖地农场的利润变得更少。"};
+  std::vector<std::string> text_pairs = {
+      "支付宝余额会自动还花呗吗",
+      "我的蚂蚁花呗 怎么用不了",
+      "人们发现，经济的另一个转变更有营养。"};
+  tokenizer.EncodeBatchStrings(texts, text_pairs, &encodings);
+  for (auto&& encoding : encodings) {
+    std::cout << encoding.DebugString() << std::endl;
+  }
+  return 0;
+}
\ No newline at end of file
diff --git a/fast_tokenizer/examples/ernie-3.0/python/README.md b/fast_tokenizer/examples/ernie-3.0/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/examples/ernie-3.0/python/demo.py b/fast_tokenizer/examples/ernie-3.0/python/demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..490604e7bcc76a224d7d8f0ef5d103a3f7bb9ce2
--- /dev/null
+++ b/fast_tokenizer/examples/ernie-3.0/python/demo.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import fast_tokenizer
+from fast_tokenizer import ErnieFastTokenizer, models
+
+fast_tokenizer.set_thread_num(1)
+vocab = models.WordPiece.read_file("ernie_vocab.txt")
+fast_tokenizer = ErnieFastTokenizer(vocab)
+output = fast_tokenizer.encode("我爱中国")
+print("ids: ", output.ids)
+print("type_ids: ", output.type_ids)
+print("tokens: ", output.tokens)
+print("offsets: ", output.offsets)
+print("attention_mask: ", output.attention_mask)
diff --git a/fast_tokenizer/fast_tokenizer/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a4269b2bbd640203181511cceec5a59ae83d157c
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/CMakeLists.txt
@@ -0,0 +1,44 @@
+add_subdirectory(decoders)
+add_subdirectory(models)
+add_subdirectory(normalizers)
+add_subdirectory(pretokenizers)
+add_subdirectory(postprocessors)
+add_subdirectory(core)
+add_subdirectory(utils)
+# set the relative path of shared library
+if (NOT APPLE AND NOT WIN32)
+  set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-rpath='$ORIGIN'")
+endif()
+
+if (WITH_PYTHON)
+  add_subdirectory(pybind)
+  cc_library(core_tokenizers SHARED
+             SRCS pybind/pybind.cc tokenizers/ernie_fast_tokenizer.cc
+             DEPS pybind python pybind_normalizers pybind_utils
+                  pybind_pretokenizers pybind_models pybind_decoders
+                  pybind_postprocessors pybind_tokenizers pybind_exception
+                  pybind_core normalizers pretokenizers core models
+                  tokenizer added_vocabulary postprocessors json)
+  set_target_properties(core_tokenizers PROPERTIES PREFIX "")
+  if (WIN32)
+    set_target_properties(core_tokenizers PROPERTIES SUFFIX ".pyd")
+  else()
+    set_target_properties(core_tokenizers PROPERTIES SUFFIX ".so")
+  endif()
+
+  if (APPLE)
+    SET(CMAKE_INSTALL_RPATH "@loader_path/core_tokenizers.so")
+  endif()
+
+else(WITH_PYTHON)
+  cc_library(core_tokenizers SHARED
+            SRCS tokenizers/ernie_fast_tokenizer.cc tokenizers/clip_fast_tokenizer.cc
+            DEPS normalizers pretokenizers models decoders
+                  postprocessors core added_vocabulary tokenizer json)
+
+  if (APPLE)
+    SET(CMAKE_INSTALL_RPATH "@loader_path/lib/libcore_tokenizers.dylib")
+  endif()
+endif(WITH_PYTHON)
+
+add_subdirectory(test)
diff --git a/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2700f66a44018a824502a865449e23dd636b6f39
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/CMakeLists.txt
@@ -0,0 +1,4 @@
+cc_library(added_vocabulary SRCS added_vocabulary.cc DEPS normalizers pretokenizers json)
+cc_library(base SRCS base.cc DEPS json)
+cc_library(tokenizer SRCS tokenizer.cc DEPS added_vocabulary json decoders trie models postprocessors base)
+cc_library(core SRCS encoding.cc DEPS json base)
diff --git a/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc
new file mode 100644
index 0000000000000000000000000000000000000000..bdb05fa136b82dd28031d4d211a0e2064c54af1e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.cc
@@ -0,0 +1,424 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/models/model.h"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+inline bool StartWithWord(const std::string& sequence) {
+  static re2::RE2 pattern("^\\w");
+  return re2::RE2::FullMatch(sequence, pattern);
+}
+
+inline bool EndWithWord(const std::string& sequence) {
+  static re2::RE2 pattern("\\w$");
+  return re2::RE2::FullMatch(sequence, pattern);
+}
+
+inline bool StartWithSpace(const std::string& sequence) {
+  static re2::RE2 pattern("^\\s*");
+  return re2::RE2::FullMatch(sequence, pattern);
+}
+
+inline bool EndWithSpace(const std::string& sequence) {
+  static re2::RE2 pattern("\\s*$");
+  return re2::RE2::FullMatch(sequence, pattern);
+}
+
+inline size_t GetEndSpaceIdx(const std::string& sequence) {
+  static re2::RE2 pattern("\\s*$");
+  re2::StringPiece result_str;
+  pattern.Match(
+      sequence, 0, sequence.length(), RE2::UNANCHORED, &result_str, 1);
+  return result_str.data() - sequence.data();
+}
+
+inline size_t GetStartSpaceIdx(const std::string& sequence) {
+  static re2::RE2 pattern("^\\s*");
+  re2::StringPiece result_str;
+  pattern.Match(
+      sequence, 0, sequence.length(), RE2::UNANCHORED, &result_str, 1);
+  return result_str.data() + result_str.length() - sequence.data();
+}
+
+inline size_t GetLeftMostSpaceFromEnd(const std::string& sequence) {
+  if (EndWithSpace(sequence)) {
+    return GetEndSpaceIdx(sequence);
+  }
+  return sequence.length();
+}
+
+inline size_t GetRightMostSpaceFromStart(const std::string& sequence) {
+  if (StartWithSpace(sequence)) {
+    return GetStartSpaceIdx(sequence);
+  }
+  return 0;
+}
+
+AddedToken::AddedToken()
+    : content_(""),
+      is_single_word_(false),
+      use_lstrip_(false),
+      use_rstrip_(false),
+      use_normalized_(true),
+      is_special_(false) {}
+
+AddedToken::AddedToken(const std::string& content,
+                       bool is_special,
+                       bool single_word,
+                       bool lstrip,
+                       bool rstrip)
+    : content_(content),
+      is_special_(is_special),
+      use_normalized_(!is_special),
+      is_single_word_(single_word),
+      use_lstrip_(lstrip),
+      use_rstrip_(rstrip) {}
+
+std::string AddedToken::GetContent() const { return content_; }
+
+bool AddedToken::GetIsSpecial() const { return is_special_; }
+
+bool AddedToken::GetUseNormalized() const { return use_normalized_; }
+
+void AddedToken::SetIsSingleWord(bool is_single_word) {
+  is_single_word_ = is_single_word;
+}
+bool AddedToken::GetUseLStrip() const { return use_lstrip_; }
+
+bool AddedToken::GetUseRStrip() const { return use_rstrip_; }
+
+bool AddedToken::GetIsSingleWord() const { return is_single_word_; }
+
+void AddedToken::SetContent(const std::string& content) { content_ = content; }
+
+void AddedToken::SetUseLStrip(bool use_lstrip) { use_lstrip_ = use_lstrip; }
+
+void AddedToken::SetUseRStrip(bool use_rstrip) { use_rstrip_ = use_rstrip; }
+
+void AddedToken::SetUseNormalized(bool use_normalized) {
+  use_normalized_ = use_normalized;
+}
+
+void AddedToken::SetIsSpecial(bool is_special) { is_special_ = is_special; }
+
+bool AddedToken::operator==(const AddedToken& other) const {
+  return content_ == other.content_;
+}
+
+AddedVocabulary::AddedVocabulary()
+    : split_trie_({std::make_shared<re2::RE2>(""), Vocab()}),
+      split_normalized_trie_({std::make_shared<re2::RE2>(""), Vocab()}) {}
+
+size_t AddedVocabulary::GetLen() const { return vocab_.size(); }
+
+core::Vocab AddedVocabulary::GetVocab() const { return vocab_; }
+core::Vocab& AddedVocabulary::GetMutableVocab() { return vocab_; }
+
+bool AddedVocabulary::TokenToId(const std::string& token,
+                                const models::Model& model,
+                                uint32_t* id) const {
+  if (vocab_.find(token) != vocab_.end()) {
+    *id = vocab_.at(token);
+    return true;
+  }
+  return model.TokenToId(token, id);
+}
+
+bool AddedVocabulary::IdToToken(uint32_t id,
+                                const models::Model& model,
+                                std::string* token) const {
+  if (vocab_reversed_.find(id) != vocab_reversed_.end()) {
+    *token = vocab_reversed_.at(id).GetContent();
+    return true;
+  }
+  return model.IdToToken(id, token);
+}
+
+bool AddedVocabulary::IsSpecialToken(const std::string& token) const {
+  return special_tokens_set_.find(token) != special_tokens_set_.end();
+}
+
+size_t AddedVocabulary::AddSpecialTokens(
+    const std::vector<AddedToken>& tokens,
+    const models::Model& model,
+    const normalizers::Normalizer* normalizers) {
+  return AddTokens(tokens, model, normalizers);
+}
+
+size_t AddedVocabulary::AddTokens(const std::vector<AddedToken>& tokens,
+                                  const models::Model& model,
+                                  const normalizers::Normalizer* normalizers) {
+  for (const auto& token : tokens) {
+    if (token.GetIsSpecial() && !token.GetContent().empty() &&
+        !IsSpecialToken(token.GetContent())) {
+      special_tokens_.push_back(token);
+      special_tokens_set_.insert(token.GetContent());
+    }
+  }
+  int ignored_tokens_num = 0;
+  for (const auto& token : tokens) {
+    if (token.GetContent().empty()) {
+      ignored_tokens_num += 1;
+      continue;
+    }
+    uint32_t id;
+    if (TokenToId(token.GetContent(), model, &id)) {
+      ignored_tokens_num += 1;
+    } else {
+      uint32_t new_id = model.GetVocabSize() + GetLen();
+      vocab_[token.GetContent()] = new_id;
+      if (special_tokens_set_.count(token.GetContent()) == 0) {
+        added_tokens_.push_back(token);
+      }
+      id = new_id;
+    }
+    vocab_reversed_[id] = token;
+  }
+  RefreshAddedTokens(model, normalizers);
+  return tokens.size() - ignored_tokens_num;
+}
+void AddedVocabulary::RefreshAddedTokens(
+    const models::Model& model, const normalizers::Normalizer* normalizers) {
+  using TokenAndId = std::pair<AddedToken, uint32_t>;
+  std::vector<TokenAndId> normalized, non_normalized;
+  for (const auto& tokens : {special_tokens_, added_tokens_}) {
+    for (const auto& token : tokens) {
+      uint32_t id;
+      if (TokenToId(token.GetContent(), model, &id)) {
+        if (token.GetUseNormalized()) {
+          normalized.push_back({token, id});
+        } else {
+          non_normalized.push_back({token, id});
+        }
+      }
+    }
+  }
+  Vocab ids;
+  std::vector<AddedToken> tokens;
+  for (const auto& token_ids : non_normalized) {
+    tokens.push_back(token_ids.first);
+    ids[token_ids.first.GetContent()] = token_ids.second;
+  }
+  // Create a regex pattern
+  std::string pattern("");
+  for (int i = 0; i < tokens.size(); ++i) {
+    if (i > 0) {
+      pattern += "|";
+    }
+    std::string pattern_str = "";
+    for (const auto& ch : tokens[i].GetContent()) {
+      if (ch == '[' || ch == ']') {
+        pattern_str.append(1, '\\');
+      }
+      pattern_str.append(1, ch);
+    }
+    pattern += "\(" + pattern_str + "\)";
+  }
+  // Update split_trie_
+  split_trie_.first = std::make_shared<re2::RE2>(pattern);
+  split_trie_.second = std::move(ids);
+  Vocab normalized_ids;
+  std::vector<AddedToken> normalized_tokens;
+  for (const auto& token_ids : normalized) {
+    normalized_tokens.push_back(token_ids.first);
+    normalized_ids[token_ids.first.GetContent()] = token_ids.second;
+  }
+
+  std::string normalized_pattern("");
+  for (int i = 0; i < normalized_tokens.size(); ++i) {
+    normalizers::NormalizedString normalized_content(
+        normalized_tokens[i].GetContent());
+    if (normalizers != nullptr) {
+      (*normalizers)(&normalized_content);
+    }
+    if (i > 0) {
+      normalized_pattern += "|";
+    }
+    std::string pattern_str = "";
+    for (const auto& ch : normalized_content.GetStr()) {
+      if (ch == '[' || ch == ']') {
+        pattern_str.append(1, '\\');
+      }
+      pattern_str.append(1, ch);
+    }
+    normalized_pattern += "\(" + pattern_str + "\)";
+  }
+  split_normalized_trie_.first = std::make_shared<re2::RE2>(normalized_pattern);
+  split_normalized_trie_.second = std::move(normalized_ids);
+}
+
+bool AddedVocabulary::FindMatch(const std::string& sequence,
+                                const MatchSet& pattern,
+                                std::vector<MatchResult>* results) const {
+  if (sequence.empty()) {
+    return false;
+  }
+  std::vector<MatchResult> splits;
+  size_t start = 0;
+  size_t start_offset = 0;
+  size_t end = sequence.length();
+  re2::StringPiece result_str;
+  VLOG(6) << "start = " << start << ", end = " << end
+          << ", sequence = " << sequence
+          << ", pattern: " << pattern.first->pattern();
+  while (pattern.first->Match(
+             sequence, start, end, RE2::UNANCHORED, &result_str, 1) &&
+         result_str != "") {
+    VLOG(6) << "result_str: " << result_str << ", " << pattern.first->pattern();
+    size_t curr_start = result_str.data() - sequence.data();
+    size_t curr_end = curr_start + result_str.length();
+    uint32_t id = pattern.second.at(result_str.ToString());
+    AddedToken added_tokens = vocab_reversed_.at(id);
+    VLOG(6) << "start = " << start << ", end = " << end
+            << ", curr_start = " << curr_start << ", curr_end = " << curr_end;
+    if (added_tokens.GetIsSingleWord()) {
+      bool start_space =
+          (curr_start == 0) || !EndWithWord(sequence.substr(0, curr_start));
+      bool stop_space = (curr_end >= sequence.length()) ||
+                        !StartWithWord(sequence.substr(curr_end));
+      if (!start_space || !stop_space) {
+        // Discard not single word
+        start = curr_end;
+        continue;
+      }
+    }
+    if (added_tokens.GetUseLStrip()) {
+      auto new_start = GetEndSpaceIdx(sequence.substr(0, curr_start));
+      curr_start = std::max(new_start, start_offset);
+    }
+    if (added_tokens.GetUseRStrip()) {
+      curr_end += GetStartSpaceIdx(sequence.substr(curr_end));
+    }
+    if (curr_start > start_offset) {
+      splits.push_back({0, false, {start_offset, curr_start}});
+    }
+    splits.push_back({id, true, {curr_start, curr_end}});
+    start = curr_end;
+    start_offset = curr_end;
+  }
+  if (start != sequence.length()) {
+    splits.push_back({0, false, {start, sequence.length()}});
+  }
+  *results = std::move(splits);
+  return true;
+}
+
+bool AddedVocabulary::SplitWithIndices(
+    const normalizers::NormalizedString& normalized,
+    const MatchSet& pattern,
+    std::vector<pretokenizers::StringSplit>* split_results) const {
+  std::vector<MatchResult> match_results;
+  bool status = FindMatch(normalized.GetStr(), pattern, &match_results);
+  for (auto& match_result : match_results) {
+    normalizers::NormalizedString slice;
+    auto id = std::get<0>(match_result);
+    auto is_not_unk = std::get<1>(match_result);
+    auto offsets = std::get<2>(match_result);
+    normalized.Slice(offsets, &slice, false);
+    std::vector<core::Token> tokens;
+    if (is_not_unk) {
+      tokens.emplace_back(core::Token{id, slice.GetStr(), {0, slice.GetLen()}});
+    }
+    // use push_back({slice, tokens}) will raise error in windows platform.
+    split_results->emplace_back(slice, tokens);
+  }
+  return status;
+}
+
+void AddedVocabulary::ExtractAndNormalize(
+    const normalizers::Normalizer* normalizers,
+    const std::string& sequence,
+    pretokenizers::PreTokenizedString* pretokenized) const {
+  pretokenized->SetOriginalStr(sequence);
+  pretokenized->Split(
+      [&](int idx,
+          normalizers::NormalizedString* normalized,
+          std::vector<pretokenizers::StringSplit>* string_splits) {
+        this->SplitWithIndices(*normalized, this->split_trie_, string_splits);
+      });
+  pretokenized->Split(
+      [&](int idx,
+          normalizers::NormalizedString* normalized,
+          std::vector<pretokenizers::StringSplit>* string_splits) {
+        if (normalizers != nullptr) {
+          (*normalizers)(normalized);
+          VLOG(6) << "After normalized: " << normalized->GetStr();
+          this->SplitWithIndices(
+              *normalized, this->split_normalized_trie_, string_splits);
+        }
+      });
+}
+
+const std::unordered_map<uint32_t, AddedToken>&
+AddedVocabulary::GetAddedTokenVocabReversed() const {
+  return vocab_reversed_;
+}
+
+
+void to_json(nlohmann::json& j, const AddedTokenWithId& added_token) {
+  j = {
+      {"id", added_token.id_},
+      {"content", added_token.added_token_.GetContent()},
+      {"single_word", added_token.added_token_.GetIsSingleWord()},
+      {"lstrip", added_token.added_token_.GetUseLStrip()},
+      {"rstrip", added_token.added_token_.GetUseRStrip()},
+      {"normalized", added_token.added_token_.GetUseNormalized()},
+      {"special", added_token.added_token_.GetIsSpecial()},
+  };
+}
+
+void from_json(const nlohmann::json& j, AddedTokenWithId& added_token) {
+  j.at("id").get_to(added_token.id_);
+  std::string content = j.at("content").get<std::string>();
+  added_token.added_token_.SetContent(content);
+
+  bool single_word = j.at("single_word").get<bool>();
+  added_token.added_token_.SetIsSingleWord(single_word);
+
+  bool lstrip = j.at("lstrip").get<bool>();
+  added_token.added_token_.SetUseLStrip(lstrip);
+
+  bool rstrip = j.at("rstrip").get<bool>();
+  added_token.added_token_.SetUseRStrip(rstrip);
+
+  bool normalized = j.at("normalized").get<bool>();
+  added_token.added_token_.SetUseNormalized(normalized);
+
+  bool special = j.at("special").get<bool>();
+  added_token.added_token_.SetIsSpecial(special);
+}
+
+void to_json(nlohmann::json& j, const AddedVocabulary& added_vocab) {
+  nlohmann::json jarray = nlohmann::json::array();
+  for (const auto& vocab_item : added_vocab.vocab_reversed_) {
+    AddedTokenWithId added_token_with_id;
+    added_token_with_id.id_ = vocab_item.first;
+    added_token_with_id.added_token_ = vocab_item.second;
+    nlohmann::json jo = added_token_with_id;
+    jarray.emplace_back(jo);
+  }
+  j = jarray;
+}
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h
new file mode 100644
index 0000000000000000000000000000000000000000..a9b26f6778184c3174110fdf4344409726d2dc0e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/added_vocabulary.h
@@ -0,0 +1,154 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <memory>  // For shared_ptr
+#include <string>
+#include <unordered_set>
+
+#include "fast_tokenizer/core/base.h"
+#include "nlohmann/json.hpp"
+
+namespace re2 {
+class RE2;
+}  // namespace re2
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+
+namespace normalizers {
+class Normalizer;
+class NormalizedString;
+}  // namespace normalizers
+
+namespace models {
+class Model;
+}  // namespace models
+
+namespace pretokenizers {
+class PreTokenizedString;
+struct StringSplit;
+}  // namespace pretokenizers
+
+namespace core {
+
+using MatchSet = std::pair<std::shared_ptr<re2::RE2>, Vocab>;
+using MatchResult = std::tuple<uint32_t, bool /* UNK Flag */, core::Offset>;
+
+bool StartWithWord(const std::string& sequence);
+bool EndWithWord(const std::string& sequence);
+bool StartWithSpace(const std::string& sequence);
+bool EndWithSpace(const std::string& sequence);
+
+class FASTTOKENIZER_DECL AddedToken {
+public:
+  AddedToken();
+  AddedToken(const std::string& content,
+             bool is_special = false,
+             bool single_word = false,
+             bool lstrip = false,
+             bool rstrip = false);
+  void SetIsSingleWord(bool is_single_word);
+  void SetUseLStrip(bool use_lstrip);
+  void SetUseRStrip(bool use_rstrip);
+  void SetUseNormalized(bool use_normalized);
+  void SetContent(const std::string& content);
+  void SetIsSpecial(bool is_special);
+  std::string GetContent() const;
+  bool GetIsSpecial() const;
+  bool GetUseNormalized() const;
+  bool GetUseLStrip() const;
+  bool GetUseRStrip() const;
+  bool GetIsSingleWord() const;
+  bool operator==(const AddedToken& other) const;
+
+private:
+  std::string content_;
+  bool is_single_word_;
+  bool use_lstrip_;
+  bool use_rstrip_;
+  bool use_normalized_;
+  bool is_special_;
+  friend struct AddedTokenWithId;
+};
+
+struct FASTTOKENIZER_DECL AddedTokenWithId {
+  AddedToken added_token_;
+  uint32_t id_;
+  friend void to_json(nlohmann::json& j, const AddedTokenWithId& added_token);
+  friend void from_json(const nlohmann::json& j, AddedTokenWithId& added_token);
+};
+
+class FASTTOKENIZER_DECL AddedVocabulary {
+public:
+  AddedVocabulary();
+  size_t GetLen() const;
+  core::Vocab& GetMutableVocab();
+  core::Vocab GetVocab() const;
+  bool TokenToId(const std::string& token,
+                 const models::Model& model,
+                 uint32_t* id) const;
+  bool IdToToken(uint32_t id,
+                 const models::Model& model,
+                 std::string* token) const;
+  bool IsSpecialToken(const std::string& token) const;
+  size_t AddSpecialTokens(const std::vector<AddedToken>& tokens,
+                          const models::Model& model,
+                          const normalizers::Normalizer* normalizers);
+  size_t AddTokens(const std::vector<AddedToken>& tokens,
+                   const models::Model& model,
+                   const normalizers::Normalizer* normalizers);
+  void RefreshAddedTokens(const models::Model& model,
+                          const normalizers::Normalizer* normalizers);
+  bool FindMatch(const std::string& sequence,
+                 const MatchSet& pattern,
+                 std::vector<MatchResult>* results) const;
+  bool SplitWithIndices(
+      const normalizers::NormalizedString& normalized,
+      const MatchSet& pattern,
+      std::vector<pretokenizers::StringSplit>* split_results) const;
+  void ExtractAndNormalize(
+      const normalizers::Normalizer* normalizers,
+      const std::string& sequence,
+      pretokenizers::PreTokenizedString* pretokenized) const;
+  const std::unordered_map<uint32_t, AddedToken>& GetAddedTokenVocabReversed()
+      const;
+
+private:
+  core::Vocab vocab_;
+  std::unordered_map<uint32_t, AddedToken> vocab_reversed_;
+  std::vector<AddedToken> added_tokens_;
+  std::vector<AddedToken> special_tokens_;
+  std::unordered_set<std::string> special_tokens_set_;
+  MatchSet split_trie_;
+  MatchSet split_normalized_trie_;
+  friend void to_json(nlohmann::json& j, const AddedVocabulary& added_vocab);
+  friend void from_json(const nlohmann::json& j, AddedVocabulary& added_vocab);
+};
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
+
+namespace std {
+template <>
+class hash<paddlenlp::fast_tokenizer::core::AddedToken> {
+public:
+  size_t operator()(
+      const paddlenlp::fast_tokenizer::core::AddedToken& added_token) const {
+    return std::hash<std::string>()(added_token.GetContent());
+  }
+};
+}
diff --git a/fast_tokenizer/fast_tokenizer/core/base.cc b/fast_tokenizer/fast_tokenizer/core/base.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2d0bdb23d0121e40188d45c4dcfa05789eaf006f
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/base.cc
@@ -0,0 +1,53 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/core/base.h"
+
+#include <thread>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+static int fast_tokenizer_thread_num = 1;
+
+void SetThreadNum(int thread_num) { fast_tokenizer_thread_num = thread_num; }
+
+int GetThreadNum() { return fast_tokenizer_thread_num; }
+
+void RunMultiThread(std::function<void(size_t, size_t)> func,
+                    size_t batch_size) {
+  int thread_num = GetThreadNum();
+  if (thread_num == 1) {
+    // Note(zhoushunjie): No need to create threads when
+    // thread_num equals to 1.
+    func(0, batch_size);
+  } else {
+    std::vector<std::thread> vectorOfThread;
+    size_t start_index = 0;
+    size_t step_index = ceil(batch_size / float(thread_num));
+
+    for (size_t thread_index = 0; thread_index < thread_num; thread_index++) {
+      vectorOfThread.emplace_back(std::thread(func, start_index, step_index));
+      start_index = start_index + step_index;
+    }
+    for (size_t thread_index = 0; thread_index < thread_num; thread_index++) {
+      vectorOfThread[thread_index].join();
+    }
+  }
+}
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/base.h b/fast_tokenizer/fast_tokenizer/core/base.h
new file mode 100644
index 0000000000000000000000000000000000000000..21af2d912f7cbdd2285950300328f54d41b3cedc
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/base.h
@@ -0,0 +1,378 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <functional>
+#include <queue>
+#include <random>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace std {
+template <>
+struct hash<std::pair<uint32_t, uint32_t>> {
+  size_t operator()(const std::pair<uint32_t, uint32_t>& x) const {
+    size_t h1 = hash<uint32_t>()(x.first);
+    size_t h2 = hash<uint32_t>()(x.second);
+    return h1 ^ (h2 << 1);
+  }
+};
+}
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+enum FASTTOKENIZER_DECL OffsetType { CHAR, BYTE };
+enum FASTTOKENIZER_DECL Direction { LEFT, RIGHT };
+enum FASTTOKENIZER_DECL TruncStrategy {
+  LONGEST_FIRST,
+  ONLY_FIRST,
+  ONLY_SECOND
+};
+enum FASTTOKENIZER_DECL PadStrategy { BATCH_LONGEST, FIXED_SIZE };
+
+enum FASTTOKENIZER_DECL SplitMode {
+  REMOVED,
+  ISOLATED,
+  MERGED_WITH_PREVIOUS,
+  MERGED_WITH_NEXT,
+  CONTIGUOUS
+};
+
+NLOHMANN_JSON_SERIALIZE_ENUM(OffsetType,
+                             {
+                                 {CHAR, "CHAR"}, {BYTE, "BYTE"},
+                             });
+
+NLOHMANN_JSON_SERIALIZE_ENUM(Direction,
+                             {
+                                 {LEFT, "LEFT"}, {RIGHT, "RIGHT"},
+                             });
+
+NLOHMANN_JSON_SERIALIZE_ENUM(TruncStrategy,
+                             {
+                                 {LONGEST_FIRST, "LONGEST_FIRST"},
+                                 {ONLY_FIRST, "ONLY_FIRST"},
+                                 {ONLY_SECOND, "ONLY_SECOND"},
+                             });
+
+
+NLOHMANN_JSON_SERIALIZE_ENUM(PadStrategy,
+                             {
+                                 {BATCH_LONGEST, "BATCH_LONGEST"},
+                                 {FIXED_SIZE, "FIXED_SIZE"},
+                             });
+
+struct FASTTOKENIZER_DECL TruncMethod {
+  Direction direction_;
+  size_t max_len_;
+  TruncStrategy strategy_;
+  size_t stride_;
+  TruncMethod()
+      : max_len_(512),
+        stride_(0),
+        strategy_(LONGEST_FIRST),
+        direction_(RIGHT) {}
+};
+
+struct FASTTOKENIZER_DECL PadMethod {
+  PadStrategy strategy_;
+  Direction direction_;
+  uint32_t pad_id_;
+  uint32_t pad_token_type_id_;
+  std::string pad_token_;
+  uint32_t pad_len_;
+  uint32_t pad_to_multiple_of_;
+
+  PadMethod()
+      : strategy_(BATCH_LONGEST),
+        direction_(RIGHT),
+        pad_id_(0),
+        pad_token_type_id_(0),
+        pad_token_("[PAD]"),
+        pad_len_(0),
+        pad_to_multiple_of_(0) {}
+};
+
+inline void to_json(nlohmann::json& j, const TruncMethod& trunc_method) {
+  j = {
+      {"strategy", trunc_method.strategy_},
+      {"direction", trunc_method.direction_},
+      {"max_len", trunc_method.max_len_},
+      {"stride", trunc_method.stride_},
+  };
+}
+
+inline void from_json(const nlohmann::json& j, TruncMethod& trunc_method) {
+  j["strategy"].get_to(trunc_method.strategy_);
+  j["direction"].get_to(trunc_method.direction_);
+  j["max_len"].get_to(trunc_method.max_len_);
+  j["stride"].get_to(trunc_method.stride_);
+}
+
+
+inline void to_json(nlohmann::json& j, const PadMethod& pad_method) {
+  j = {
+      {"strategy", pad_method.strategy_},
+      {"direction", pad_method.direction_},
+      {"pad_id", pad_method.pad_id_},
+      {"pad_token_type_id", pad_method.pad_token_type_id_},
+      {"pad_token", pad_method.pad_token_},
+      {"pad_len", pad_method.pad_len_},
+      {"pad_to_multiple_of", pad_method.pad_to_multiple_of_},
+  };
+}
+
+inline void from_json(const nlohmann::json& j, PadMethod& pad_method) {
+  j["strategy"].get_to(pad_method.strategy_);
+  j["direction"].get_to(pad_method.direction_);
+  j["pad_id"].get_to(pad_method.pad_id_);
+  j["pad_token_type_id"].get_to(pad_method.pad_token_type_id_);
+  j["pad_token"].get_to(pad_method.pad_token_);
+  j["pad_len"].get_to(pad_method.pad_len_);
+  j["pad_to_multiple_of"].get_to(pad_method.pad_to_multiple_of_);
+}
+
+using Offset = std::pair<uint32_t, uint32_t>;
+using Range = std::pair<uint32_t, uint32_t>;
+using Vocab = std::unordered_map<std::string, uint32_t>;
+using VocabList = std::vector<std::pair<std::string, float>>;
+using VocabReversed = std::unordered_map<uint32_t, std::string>;
+using SortedVocabReversed = std::map<uint32_t, std::string>;
+using Pair = std::pair<uint32_t, uint32_t>;
+using MergeMap = std::unordered_map<Pair, std::pair<uint32_t, uint32_t>>;
+using Merges = std::vector<std::pair<std::string, std::string>>;
+
+inline void to_json(nlohmann::json& j,
+                    const SortedVocabReversed& sorted_vocab_r) {
+  j = nlohmann::ordered_json();
+  for (const auto& item : sorted_vocab_r) {
+    j[item.second] = item.first;
+  }
+}
+
+struct FASTTOKENIZER_DECL Token {
+  uint32_t id_;
+  std::string value_;
+  Offset offset_;
+  Token() = default;
+  Token(uint32_t id, const std::string& value, const Offset& offset)
+      : id_(id), value_(value), offset_(offset) {}
+};
+
+struct FASTTOKENIZER_DECL Merge {
+  size_t pos_;
+  uint32_t rank_;
+  uint32_t new_id_;
+
+  bool operator==(const Merge& other) const {
+    return pos_ == other.pos_ && rank_ == other.rank_;
+  }
+  bool operator<(const Merge& other) const {
+    // Used in priority queue
+    // The queue will output the Merge value
+    // in ascending order of rank_
+    if (rank_ != other.rank_) {
+      return rank_ > other.rank_;
+    }
+    return pos_ > other.pos_;
+  }
+};
+
+struct FASTTOKENIZER_DECL Symbol {
+  uint32_t ch_;  // symbol id
+  int prev_;
+  int next_;
+  size_t len_;
+
+  Symbol() = default;
+  Symbol(uint32_t ch, int prev, int next, size_t len)
+      : ch_(ch), prev_(prev), next_(next), len_(len) {}
+  // Merges the current Symbol with the other one.
+  // In order to update prev/next, we consider Self to be the Symbol on the
+  // left,
+  // and other to be the next one on the right.
+  void MergeWith(const Symbol& other, uint32_t ch) {
+    ch_ = ch;
+    next_ = other.next_;
+    len_ += other.len_;
+  }
+};
+
+struct FASTTOKENIZER_DECL BPEWord {
+  BPEWord() = default;
+  BPEWord(size_t capacity) { Reserve(capacity); }
+  void Reserve(size_t capacity) { symbols_.reserve(capacity); }
+  void Add(uint32_t ch, size_t byte_len) {
+    int len = symbols_.size();
+    int next = -1;
+    int prev = -1;
+    if (len >= 1) {
+      symbols_.back().next_ = len;
+      prev = len - 1;
+    }
+    symbols_.emplace_back(ch, prev, next, byte_len);
+  }
+
+  void Merge(uint32_t c1,
+             uint32_t c2,
+             uint32_t replacement,
+             std::vector<std::pair<Pair, int>>* changes) {
+    for (int i = 0; i < symbols_.size(); ++i) {
+      // Found a byte pair
+      if (symbols_[i].ch_ == c1 && i + 1 < symbols_.size() &&
+          symbols_[i + 1].ch_ == c2) {
+        auto& first = symbols_[i];
+        auto& second = symbols_[i + 1];
+        // If there are other characters before the pair
+        if (i > 0) {
+          changes->push_back({{symbols_[i - 1].ch_, first.ch_}, -1});
+          changes->push_back({{symbols_[i - 1].ch_, replacement}, 1});
+        }
+        Symbol symbols{
+            replacement, first.prev_, second.next_, first.len_ + second.len_};
+        symbols_.insert(symbols_.begin() + i, symbols);
+        symbols_.erase(symbols_.begin() + i + 1, symbols_.begin() + i + 3);
+        if (i + 1 < symbols_.size()) {
+          changes->push_back({{second.ch_, symbols_[i + 1].ch_}, -1});
+          changes->push_back({{replacement, symbols_[i + 1].ch_}, 1});
+        }
+      }
+    }
+  }
+
+  void MergeAll(const MergeMap& merges, const std::vector<float>& dropout) {
+    std::priority_queue<core::Merge> queue;
+    std::vector<core::Merge> skip;
+    skip.reserve(symbols_.size());
+    for (int i = 0; i < symbols_.size() - 1; ++i) {
+      auto& first = symbols_[i];
+      auto& second = symbols_[i + 1];
+      if (merges.find({first.ch_, second.ch_}) != merges.end()) {
+        auto new_merge_info = merges.at({first.ch_, second.ch_});
+        core::Merge new_merge{static_cast<size_t>(i),
+                              new_merge_info.first,
+                              new_merge_info.second};
+        queue.push(new_merge);
+      }
+    }
+    std::random_device
+        rd;  // Will be used to obtain a seed for the random number engine
+    std::mt19937 gen(rd());  // Standard mersenne_twister_engine seeded with
+                             // rd()
+    std::uniform_real_distribution<float> distrib(0.0, 1.0);
+    bool can_skip = (dropout.size() > 0);
+    while (!queue.empty()) {
+      // Can't use reference there, because the pop operation will change the
+      // top value
+      auto top = queue.top();
+      queue.pop();
+      if (can_skip && distrib(gen) < dropout[0]) {
+        // May dropout some merges
+        skip.push_back(top);
+      } else {
+        for (auto& skip_merge : skip) {
+          queue.push(skip_merge);
+        }
+        skip.clear();
+        if (symbols_[top.pos_].len_ == 0) {
+          continue;
+        }
+        if (symbols_[top.pos_].next_ == -1) {
+          continue;
+        }
+        size_t next_pos = symbols_[top.pos_].next_;
+        auto& right = symbols_[next_pos];
+        // Make sure we are not processing an expired queue entry
+        auto target_new_pair = Pair{symbols_[top.pos_].ch_, right.ch_};
+        if (merges.find(target_new_pair) == merges.end() ||
+            merges.at(target_new_pair).second != top.new_id_) {
+          continue;
+        }
+        // Otherwise, let's merge
+        symbols_[top.pos_].MergeWith(right, top.new_id_);
+        // Tag the right part as removed
+        symbols_[next_pos].len_ = 0;
+        // Update `prev` on the new `next` to the current pos
+        if (right.next_ > -1 && (right.next_ < symbols_.size())) {
+          symbols_[right.next_].prev_ = top.pos_;
+        }
+        // Insert the new pair formed with the previous symbol
+        auto& current = symbols_[top.pos_];
+        if (current.prev_ >= 0) {
+          auto prev = current.prev_;
+          auto& prev_symbol = symbols_[prev];
+          auto new_pair = Pair{prev_symbol.ch_, current.ch_};
+          if (merges.find(new_pair) != merges.end()) {
+            auto new_merge = merges.at(new_pair);
+            queue.push({static_cast<size_t>(current.prev_),
+                        new_merge.first,
+                        new_merge.second});
+          }
+        }
+
+        // Insert the new pair formed with the next symbol
+        size_t next = current.next_;
+        if (next < symbols_.size()) {
+          auto& next_symbol = symbols_[next];
+          auto next_pair = Pair{current.ch_, next_symbol.ch_};
+          if (merges.find(next_pair) != merges.end()) {
+            auto new_merge = merges.at(next_pair);
+            queue.push({top.pos_, new_merge.first, new_merge.second});
+          }
+        }
+      }
+    }
+    symbols_.erase(
+        std::remove_if(symbols_.begin(),
+                       symbols_.end(),
+                       [](const Symbol& symbol) { return symbol.len_ == 0; }),
+        symbols_.end());
+  }
+
+  void GetChars(std::vector<uint32_t>* result) const {
+    result->reserve(symbols_.size());
+    for (const auto& symbol : symbols_) {
+      result->emplace_back(symbol.ch_);
+    }
+  }
+
+  void GetOffset(std::vector<Offset>* result) const {
+    result->reserve(symbols_.size());
+    uint32_t pos = 0;
+    for (const auto& symbol : symbols_) {
+      result->emplace_back(pos, pos + symbol.len_);
+      pos += symbol.len_;
+    }
+  }
+
+  std::vector<Symbol> symbols_;
+};
+
+FASTTOKENIZER_DECL void SetThreadNum(int thread_num);
+
+FASTTOKENIZER_DECL int GetThreadNum();
+
+FASTTOKENIZER_DECL void RunMultiThread(std::function<void(size_t, size_t)> func,
+                                       size_t batch_size);
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/encoding.cc b/fast_tokenizer/fast_tokenizer/core/encoding.cc
new file mode 100644
index 0000000000000000000000000000000000000000..379d21df931b9ef7803ccc23f6d6321e09cfc702
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/encoding.cc
@@ -0,0 +1,669 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/core/encoding.h"
+#include <algorithm>
+#include <cassert>
+#include <limits>
+#include <sstream>
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+Encoding::Encoding(const std::vector<uint32_t>& ids,
+                   const std::vector<uint32_t>& type_ids,
+                   const std::vector<std::string>& tokens,
+                   const std::vector<uint32_t>& words_idx,
+                   const std::vector<Offset>& offsets,
+                   const std::vector<uint32_t>& special_tokens_mask,
+                   const std::vector<uint32_t>& attention_mask,
+                   const std::vector<Encoding>& overflowing,
+                   const std::unordered_map<uint32_t, Range>& sequence_ranges)
+    : ids_(ids),
+      type_ids_(type_ids),
+      tokens_(tokens),
+      words_idx_(words_idx),
+      offsets_(offsets),
+      special_tokens_mask_(special_tokens_mask),
+      attention_mask_(attention_mask),
+      overflowing_(overflowing),
+      sequence_ranges_(sequence_ranges) {}
+// Move version
+Encoding::Encoding(std::vector<uint32_t>&& ids,
+                   std::vector<uint32_t>&& type_ids,
+                   std::vector<std::string>&& tokens,
+                   std::vector<uint32_t>&& words_idx,
+                   std::vector<Offset>&& offsets,
+                   std::vector<uint32_t>&& special_tokens_mask,
+                   std::vector<uint32_t>&& attention_mask,
+                   std::vector<Encoding>&& overflowing,
+                   std::unordered_map<uint32_t, Range>&& sequence_ranges)
+    : ids_(std::move(ids)),
+      type_ids_(std::move(type_ids)),
+      tokens_(std::move(tokens)),
+      words_idx_(std::move(words_idx)),
+      offsets_(std::move(offsets)),
+      special_tokens_mask_(std::move(special_tokens_mask)),
+      attention_mask_(std::move(attention_mask)),
+      overflowing_(std::move(overflowing)),
+      sequence_ranges_(std::move(sequence_ranges)) {}
+
+Encoding::Encoding(uint32_t capacity) {
+  ids_.reserve(capacity);
+  type_ids_.reserve(capacity);
+  tokens_.reserve(capacity);
+  words_idx_.reserve(capacity);
+  offsets_.reserve(capacity);
+  special_tokens_mask_.reserve(capacity);
+  attention_mask_.reserve(capacity);
+}
+
+Encoding::Encoding(const std::vector<Token>& tokens, uint32_t type_id)
+    : type_ids_(tokens.size(), type_id),
+      words_idx_(tokens.size(), std::numeric_limits<uint32_t>::max()),
+      attention_mask_(tokens.size(), 1),
+      special_tokens_mask_(tokens.size(), 0) {
+  auto length = tokens.size();
+  ids_.reserve(length);
+  offsets_.reserve(length);
+  tokens_.reserve(length);
+  for (const auto& token : tokens) {
+    ids_.push_back(token.id_);
+    tokens_.push_back(token.value_);
+    offsets_.push_back(token.offset_);
+  }
+}
+
+Encoding::Encoding(Encoding&& other)
+    : ids_(std::move(other.ids_)),
+      type_ids_(std::move(other.type_ids_)),
+      tokens_(std::move(other.tokens_)),
+      words_idx_(std::move(other.words_idx_)),
+      offsets_(std::move(other.offsets_)),
+      special_tokens_mask_(std::move(other.special_tokens_mask_)),
+      attention_mask_(std::move(other.attention_mask_)),
+      overflowing_(std::move(other.overflowing_)),
+      sequence_ranges_(std::move(other.sequence_ranges_)) {}
+
+Encoding& Encoding::operator=(Encoding&& other) {
+  ids_ = std::move(other.ids_);
+  type_ids_ = std::move(other.type_ids_);
+  tokens_ = std::move(other.tokens_);
+  words_idx_ = std::move(other.words_idx_);
+  offsets_ = std::move(other.offsets_);
+  special_tokens_mask_ = std::move(other.special_tokens_mask_);
+  attention_mask_ = std::move(other.attention_mask_);
+  overflowing_ = std::move(other.overflowing_);
+  sequence_ranges_ = std::move(other.sequence_ranges_);
+  return *this;
+}
+
+bool Encoding::IsEmpty() const { return ids_.empty(); }
+
+int Encoding::GetLen() const { return ids_.size(); }
+
+int Encoding::GetNumSequence() const {
+  if (sequence_ranges_.empty()) {
+    return 1;
+  }
+  return sequence_ranges_.size();
+}
+
+void Encoding::SetSequenceIds(uint32_t seq_ids) {
+  sequence_ranges_[seq_ids] = {0, GetLen()};
+}
+
+const std::vector<std::string>& Encoding::GetTokens() const { return tokens_; }
+
+const std::vector<uint32_t>& Encoding::GetWordsIdx() const {
+  return words_idx_;
+}
+
+std::vector<uint32_t>& Encoding::GetMutableWordsIdx() { return words_idx_; }
+
+std::vector<uint32_t> Encoding::GetSequenceIds() const {
+  std::vector<uint32_t> sequences(GetLen());
+  for (uint32_t seq_id = 0; seq_id < GetNumSequence(); ++seq_id) {
+    Range range = sequence_ranges_.at(seq_id);
+    auto seq_len = range.second - range.first;
+    for (int i = range.first; i < range.second; ++i) {
+      sequences[i] = seq_id;
+    }
+  }
+  return sequences;
+}
+
+const std::vector<uint32_t>& Encoding::GetIds() const { return ids_; }
+
+const std::vector<uint32_t>& Encoding::GetTypeIds() const { return type_ids_; }
+
+const std::vector<Offset>& Encoding::GetOffsets() const { return offsets_; }
+
+std::vector<Offset>& Encoding::GetMutableOffsets() { return offsets_; }
+
+const std::vector<uint32_t>& Encoding::GetSpecialTokensMask() const {
+  return special_tokens_mask_;
+}
+
+const std::vector<uint32_t>& Encoding::GetAttentionMask() const {
+  return attention_mask_;
+}
+
+const std::vector<Encoding>& Encoding::GetOverflowing() const {
+  return overflowing_;
+}
+
+std::vector<Encoding>& Encoding::GetMutableOverflowing() {
+  return overflowing_;
+}
+
+Range Encoding::GetSequenceRange(uint32_t seq_id) const {
+  return sequence_ranges_.at(seq_id);
+}
+
+void Encoding::ProcessTokenWithOffsets(
+    std::function<void(uint32_t, const std::string&, Offset*)>
+        process_token_fn) {
+  auto length = GetLen();
+  for (int i = 0; i < length; ++i) {
+    process_token_fn(i, tokens_[i], &offsets_[i]);
+  }
+}
+
+std::vector<uint32_t> Encoding::TokenIdxToSequenceIds(
+    uint32_t token_idx) const {
+  std::vector<uint32_t> seq_ids;
+  if (token_idx < GetLen()) {
+    if (sequence_ranges_.empty()) {
+      seq_ids.push_back(0);
+    } else {
+      for (auto iter = sequence_ranges_.begin(); iter != sequence_ranges_.end();
+           ++iter) {
+        if (token_idx >= iter->second.first &&
+            token_idx < iter->second.second) {
+          seq_ids.push_back(iter->first);
+          break;
+        }
+      }
+    }
+  }
+  return seq_ids;
+}
+
+std::vector<Range> Encoding::WordIdxToTokensIdx(uint32_t word_idx,
+                                                uint32_t seq_id) const {
+  auto seq_range = sequence_ranges_.at(seq_id);
+  std::vector<Range> ranges;
+  int start = -1;
+  int end = -1;
+  for (uint32_t i = seq_range.first; i < seq_range.second; ++i) {
+    // -1 is the word index of special token
+    if (words_idx_[i] > word_idx &&
+        words_idx_[i] != std::numeric_limits<uint32_t>::max()) {
+      break;
+    }
+    if (words_idx_[i] == word_idx) {
+      if (start < 0 || i < start) {
+        start = i;
+      }
+      if (end < 0 || i >= end) {
+        end = i + 1;
+      }
+    }
+  }
+  if (start >= 0 && end >= 0) {
+    seq_range.first += start;
+    seq_range.second += end;
+    ranges.push_back(seq_range);
+  }
+  return ranges;
+}
+
+std::vector<Offset> Encoding::WordIdxToCharOffsets(uint32_t word_idx,
+                                                   uint32_t seq_id) const {
+  std::vector<Offset> offsets;
+  std::vector<Range> ranges = WordIdxToTokensIdx(word_idx, seq_id);
+  if (ranges.size() > 0) {
+    auto start = ranges[0].first;
+    auto end = ranges[0].second;
+    if (end > 0) {
+      offsets.push_back({offsets_[start].first, offsets_[end - 1].second});
+    }
+  }
+  return offsets;
+}
+
+std::vector<std::pair<uint32_t, Offset>> Encoding::TokenIdxToCharOffsets(
+    uint32_t token_idx) const {
+  std::vector<std::pair<uint32_t, Offset>> results;
+  auto seq_ids = TokenIdxToSequenceIds(token_idx);
+  if (seq_ids.size() > 0) {
+    results.push_back({seq_ids[0], offsets_[token_idx]});
+  }
+  return results;
+}
+
+std::vector<std::pair<uint32_t, uint32_t>> Encoding::TokenIdxToWordIdx(
+    uint32_t token_idx) const {
+  std::vector<std::pair<uint32_t, uint32_t>> results;
+  auto seq_ids = TokenIdxToSequenceIds(token_idx);
+  if (seq_ids.size() > 0) {
+    results.push_back({seq_ids[0], words_idx_[token_idx]});
+  }
+  return results;
+}
+
+std::vector<uint32_t> Encoding::CharOffsetsToTokenIdx(uint32_t char_pos,
+                                                      uint32_t seq_id) const {
+  std::vector<uint32_t> token_idx;
+  auto seq_range = sequence_ranges_.at(seq_id);
+  for (int i = seq_range.first; i < seq_range.second; ++i) {
+    if (char_pos >= offsets_[i].first && char_pos < offsets_[i].second) {
+      token_idx.push_back(i);
+      break;
+    }
+  }
+  return token_idx;
+}
+
+std::vector<uint32_t> Encoding::CharOffsetsToWordIdx(uint32_t char_pos,
+                                                     uint32_t seq_id) const {
+  std::vector<uint32_t> token_idx = CharOffsetsToTokenIdx(char_pos, seq_id);
+  std::vector<uint32_t> word_idx;
+  if (token_idx.size() > 0) {
+    auto words_idx = TokenIdxToWordIdx(token_idx[0]);
+    if (words_idx.size() > 0) {
+      word_idx.push_back(words_idx[0].second);
+    }
+  }
+  return word_idx;
+}
+
+void Encoding::Truncate(size_t max_len, size_t stride, Direction direction) {
+  size_t encoding_len = ids_.size();
+  if (max_len < encoding_len) {
+    if (max_len == 0) {
+      *this = Encoding(0);
+      overflowing_.push_back(*this);
+      return;
+    }
+    assert(stride < max_len);
+    sequence_ranges_.clear();
+
+    size_t step_len = max_len - stride;
+    bool found_end = false;
+    std::vector<Range> part_ranges;
+    // Get PartRanges
+    if (direction == RIGHT) {
+      for (size_t start = 0; start < encoding_len && !found_end;
+           start += step_len) {
+        size_t stop = std::min(start + max_len, encoding_len);
+        found_end = (stop == encoding_len);
+        part_ranges.push_back({start, stop});
+      }
+    } else {
+      for (size_t i = 0; i < encoding_len; i += step_len) {
+        size_t stop = encoding_len - i;
+        size_t start = (stop < max_len) ? 0 : stop - max_len;
+        if (start < stop && !found_end) {
+          found_end = (start == 0);
+          part_ranges.push_back({start, stop});
+        } else {
+          break;
+        }
+      }
+    }
+    // Create new encoding
+    auto new_encoding_len = part_ranges[0].second - part_ranges[0].first;
+    Encoding new_encoding(
+        std::vector<uint32_t>(ids_.begin(), ids_.begin() + new_encoding_len),
+        std::vector<uint32_t>(type_ids_.begin(),
+                              type_ids_.begin() + new_encoding_len),
+        std::vector<std::string>(tokens_.begin(),
+                                 tokens_.begin() + new_encoding_len),
+        std::vector<uint32_t>(words_idx_.begin(),
+                              words_idx_.begin() + new_encoding_len),
+        std::vector<Offset>(offsets_.begin(),
+                            offsets_.begin() + new_encoding_len),
+        std::vector<uint32_t>(special_tokens_mask_.begin(),
+                              special_tokens_mask_.begin() + new_encoding_len),
+        std::vector<uint32_t>(attention_mask_.begin(),
+                              attention_mask_.begin() + new_encoding_len),
+        std::vector<Encoding>(),
+        std::unordered_map<uint32_t, Range>());
+    // Set overflowing
+    for (size_t i = 1; i < part_ranges.size() - 1; ++i) {
+      auto start = part_ranges[i].first;
+      auto end = part_ranges[i].second;
+      new_encoding.overflowing_.emplace_back(Encoding(
+          std::vector<uint32_t>(ids_.begin() + start, ids_.begin() + end),
+          std::vector<uint32_t>(type_ids_.begin() + start,
+                                type_ids_.begin() + end),
+          std::vector<std::string>(tokens_.begin() + start,
+                                   tokens_.begin() + end),
+          std::vector<uint32_t>(words_idx_.begin() + start,
+                                words_idx_.begin() + end),
+          std::vector<Offset>(offsets_.begin() + start, offsets_.begin() + end),
+          std::vector<uint32_t>(special_tokens_mask_.begin() + start,
+                                special_tokens_mask_.begin() + end),
+          std::vector<uint32_t>(attention_mask_.begin() + start,
+                                attention_mask_.begin() + end),
+          std::vector<Encoding>(),
+          std::unordered_map<uint32_t, Range>()));
+    }
+    *this = std::move(new_encoding);
+  }
+}
+
+
+void Encoding::MergeWith(const Encoding& pair, bool growing_offsets) {
+  std::vector<Encoding> overflowings;
+
+  for (const auto& this_o : overflowing_) {
+    auto n_encoding = this_o;
+    n_encoding.MergeWith(pair, growing_offsets);
+    overflowings.emplace_back(n_encoding);
+    for (const auto& pair_o : pair.overflowing_) {
+      auto n_encoding = this_o;
+      n_encoding.MergeWith(pair_o, growing_offsets);
+      overflowings.emplace_back(n_encoding);
+    }
+  }
+  for (const auto& pair_o : pair.overflowing_) {
+    auto n_encoding = *this;
+    n_encoding.MergeWith(pair_o, growing_offsets);
+    overflowings.emplace_back(n_encoding);
+  }
+
+  auto orignal_len = GetLen();
+  for (const auto& pair_seq_range : pair.sequence_ranges_) {
+    sequence_ranges_.insert({pair_seq_range.first,
+                             {pair_seq_range.second.first + orignal_len,
+                              pair_seq_range.second.second + orignal_len}});
+  }
+#define EXTEND_VECTOR(member) \
+  member.insert(member.end(), pair.member.begin(), pair.member.end())
+  EXTEND_VECTOR(ids_);
+  EXTEND_VECTOR(type_ids_);
+  EXTEND_VECTOR(tokens_);
+  EXTEND_VECTOR(words_idx_);
+  EXTEND_VECTOR(special_tokens_mask_);
+  EXTEND_VECTOR(attention_mask_);
+#undef EXTEND_VECTOR
+  // Setting offet
+  uint32_t starting_offset = 0;
+  if (growing_offsets && offsets_.size() > 0) {
+    starting_offset = offsets_.back().second;
+  }
+  for (const auto& pair_offset : pair.offsets_) {
+    offsets_.push_back({pair_offset.first + starting_offset,
+                        pair_offset.second + starting_offset});
+  }
+
+  overflowing_ = std::move(overflowings);
+}
+
+void Encoding::Pad(uint32_t target_length,
+                   uint32_t pad_id,
+                   uint32_t pad_type_id,
+                   const std::string& pad_token,
+                   Direction direction) {
+  for (auto& overflowing : overflowing_) {
+    overflowing.Pad(target_length, pad_id, pad_type_id, pad_token, direction);
+  }
+  // Need to be padded in this situation
+  if (GetLen() < target_length) {
+    auto pad_len = target_length - GetLen();
+    if (direction == LEFT) {
+      ids_.insert(ids_.begin(), pad_len, pad_id);
+      type_ids_.insert(type_ids_.begin(), pad_len, pad_type_id);
+      tokens_.insert(tokens_.begin(), pad_len, pad_token);
+      words_idx_.insert(
+          words_idx_.begin(), pad_len, std::numeric_limits<uint32_t>::max());
+      attention_mask_.insert(attention_mask_.begin(), pad_len, 0);
+      special_tokens_mask_.insert(special_tokens_mask_.begin(), pad_len, 1);
+      offsets_.insert(offsets_.begin(), pad_len, {0, 0});
+    } else {
+      ids_.insert(ids_.end(), pad_len, pad_id);
+      type_ids_.insert(type_ids_.end(), pad_len, pad_type_id);
+      tokens_.insert(tokens_.end(), pad_len, pad_token);
+      words_idx_.insert(
+          words_idx_.end(), pad_len, std::numeric_limits<uint32_t>::max());
+      attention_mask_.insert(attention_mask_.end(), pad_len, 0);
+      special_tokens_mask_.insert(special_tokens_mask_.end(), pad_len, 1);
+      offsets_.insert(offsets_.end(), pad_len, {0, 0});
+    }
+  }
+}
+
+// Static method
+Encoding Encoding::Merge(const std::vector<Encoding>& encodings,
+                         bool growing_offsets) {
+  Encoding merged_encoding;
+  for (auto& encoding : encodings) {
+    merged_encoding.MergeWith(encoding, growing_offsets);
+  }
+  return merged_encoding;
+}
+
+void Encoding::SetTypeIds(const std::vector<uint32_t>& type_ids) {
+  type_ids_ = type_ids;
+}
+
+bool Encoding::operator==(const Encoding& other) const {
+  if (overflowing_.size() != other.overflowing_.size()) {
+    return false;
+  }
+  for (int i = 0; i < overflowing_.size(); ++i) {
+    if (!(overflowing_[i] == other.overflowing_[i])) {
+      return false;
+    }
+  }
+  return ids_ == other.ids_ && type_ids_ == other.type_ids_ &&
+         tokens_ == other.tokens_ && words_idx_ == other.words_idx_ &&
+         offsets_ == other.offsets_ &&
+         special_tokens_mask_ == other.special_tokens_mask_ &&
+         attention_mask_ == other.attention_mask_ &&
+         sequence_ranges_ == other.sequence_ranges_;
+}
+
+std::string Encoding::DebugString() const {
+  std::ostringstream oss;
+  oss << "The Encoding content: \n";
+  oss << "ids: ";
+  for (int i = 0; i < ids_.size(); ++i) {
+    oss << ids_[i];
+    if (i < ids_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "type_ids: ";
+  for (int i = 0; i < type_ids_.size(); ++i) {
+    oss << type_ids_[i];
+    if (i < type_ids_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "tokens: ";
+  for (int i = 0; i < tokens_.size(); ++i) {
+    oss << tokens_[i];
+    if (i < tokens_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "offsets: ";
+  for (int i = 0; i < offsets_.size(); ++i) {
+    oss << "(" << offsets_[i].first << ", " << offsets_[i].second << ")";
+    if (i < offsets_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "special_tokens_mask: ";
+  for (int i = 0; i < special_tokens_mask_.size(); ++i) {
+    oss << special_tokens_mask_[i];
+    if (i < special_tokens_mask_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "attention_mask: ";
+  for (int i = 0; i < attention_mask_.size(); ++i) {
+    oss << attention_mask_[i];
+    if (i < attention_mask_.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "\n";
+
+  oss << "sequence_ranges: ";
+  for (auto iter = sequence_ranges_.begin(); iter != sequence_ranges_.end();
+       ++iter) {
+    oss << "{" << iter->first << " : (" << iter->second.first << ", "
+        << iter->second.second << ") }, ";
+  }
+  return oss.str();
+}
+
+
+bool TruncateEncodings(Encoding* encoding,
+                       Encoding* pair_encoding,
+                       const TruncMethod& method) {
+  if (method.max_len_ == 0) {
+    encoding->Truncate(0, method.stride_, method.direction_);
+    if (pair_encoding != nullptr) {
+      pair_encoding->Truncate(0, method.stride_, method.direction_);
+    }
+    return true;
+  }
+  size_t total_length = encoding->GetIds().size();
+  if (pair_encoding != nullptr) {
+    total_length += pair_encoding->GetIds().size();
+  }
+  if (total_length <= method.max_len_) {
+    return true;
+  }
+  auto num_of_removed_ids = total_length - method.max_len_;
+
+  if (method.strategy_ == TruncStrategy::LONGEST_FIRST) {
+    if (pair_encoding == nullptr) {
+      encoding->Truncate(method.max_len_, method.stride_, method.direction_);
+    } else {
+      auto encoding_len = encoding->GetIds().size();
+      auto pair_encoding_len = pair_encoding->GetIds().size();
+      bool has_swapped = false;
+      if (encoding_len > pair_encoding_len) {
+        std::swap(encoding_len, pair_encoding_len);
+        has_swapped = true;
+      }
+      if (encoding_len > method.max_len_) {
+        pair_encoding_len = encoding_len;
+      } else {
+        pair_encoding_len =
+            std::max(method.max_len_ - encoding_len, encoding_len);
+      }
+      if (pair_encoding_len + encoding_len > method.max_len_) {
+        // In this case, make sure the encoding_len is larger than
+        // pair_encoding_len
+        encoding_len = method.max_len_ / 2;
+        pair_encoding_len = encoding_len + method.max_len_ % 2;
+      }
+      if (has_swapped) {
+        std::swap(encoding_len, pair_encoding_len);
+      }
+      encoding->Truncate(encoding_len, method.stride_, method.direction_);
+      pair_encoding->Truncate(
+          pair_encoding_len, method.stride_, method.direction_);
+    }
+  } else {
+    // TruncStrategy::ONLY_FIRST or TruncStrategy::ONLY_SECOND
+    Encoding* result = nullptr;
+    if (method.strategy_ == TruncStrategy::ONLY_FIRST) {
+      result = encoding;
+    } else if (method.strategy_ == TruncStrategy::ONLY_SECOND) {
+      if (pair_encoding == nullptr) {
+        // Can't truncate the pair text when it doesn't exist
+        return false;
+      }
+      result = pair_encoding;
+    }
+    if (result->GetIds().size() > num_of_removed_ids) {
+      result->Truncate(result->GetIds().size() - num_of_removed_ids,
+                       method.stride_,
+                       method.direction_);
+    } else {
+      // Target sequence is too short to be truncated.
+      return false;
+    }
+  }
+  return true;
+}
+
+void MultiThreadPadEncodings(std::vector<Encoding>* encodings,
+                             const PadMethod& method,
+                             size_t pad_length,
+                             size_t start_index,
+                             size_t step_index) {
+  auto batch_size = encodings->size();
+  size_t end_index = start_index + step_index;
+  if (end_index > batch_size) end_index = batch_size;
+  for (size_t i = start_index; i < end_index; ++i) {
+    auto& encoding = (*encodings)[i];
+    encoding.Pad(pad_length,
+                 method.pad_id_,
+                 method.pad_token_type_id_,
+                 method.pad_token_,
+                 method.direction_);
+  }
+}
+void PadEncodings(std::vector<Encoding>* encodings, const PadMethod& method) {
+  if (encodings == nullptr || encodings->empty()) {
+    return;
+  }
+  size_t pad_length = 0;
+  if (method.strategy_ == PadStrategy::BATCH_LONGEST) {
+    for (const auto& encoding : *encodings) {
+      pad_length = std::max(encoding.GetIds().size(), pad_length);
+    }
+  } else {
+    pad_length = method.pad_len_;
+  }
+  if (method.pad_to_multiple_of_ > 0 &&
+      pad_length % method.pad_to_multiple_of_) {
+    pad_length += pad_length - pad_length % method.pad_to_multiple_of_;
+  }
+  auto batch_size = encodings->size();
+  auto func = std::bind(&MultiThreadPadEncodings,
+                        encodings,
+                        std::ref(method),
+                        pad_length,
+                        std::placeholders::_1,
+                        std::placeholders::_2);
+  RunMultiThread(func, batch_size);
+}
+
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/encoding.h b/fast_tokenizer/fast_tokenizer/core/encoding.h
new file mode 100644
index 0000000000000000000000000000000000000000..5a9d3a41b714a044164ce4441b064d96b6f5aea6
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/encoding.h
@@ -0,0 +1,135 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <functional>
+#include <string>
+#include <unordered_map>
+#include <vector>
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/utils/utils.h"
+
+#include <math.h>
+#include <stdlib.h>
+#include <functional>
+#include <thread>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+class FASTTOKENIZER_DECL Encoding {
+public:
+  Encoding() = default;
+  Encoding(const std::vector<uint32_t>& ids,
+           const std::vector<uint32_t>& type_ids,
+           const std::vector<std::string>& tokens,
+           const std::vector<uint32_t>& words_idx,
+           const std::vector<Offset>& offsets,
+           const std::vector<uint32_t>& special_tokens_mask,
+           const std::vector<uint32_t>& attention_mask,
+           const std::vector<Encoding>& overflowing,
+           const std::unordered_map<uint32_t, Range>& sequence_ranges);
+  // Move version
+  Encoding(std::vector<uint32_t>&& ids,
+           std::vector<uint32_t>&& type_ids,
+           std::vector<std::string>&& tokens,
+           std::vector<uint32_t>&& words_idx,
+           std::vector<Offset>&& offsets,
+           std::vector<uint32_t>&& special_tokens_mask,
+           std::vector<uint32_t>&& attention_mask,
+           std::vector<Encoding>&& overflowing,
+           std::unordered_map<uint32_t, Range>&& sequence_ranges);
+  Encoding(uint32_t size);
+  Encoding(const std::vector<Token>& tokens, uint32_t type_id);
+
+  Encoding(Encoding&&);
+  Encoding(const Encoding&) = default;
+  Encoding& operator=(Encoding&&);
+  Encoding& operator=(const Encoding&) = default;
+
+  bool IsEmpty() const;
+  void SetSequenceIds(uint32_t seq_ids);
+
+  // Getter
+  int GetLen() const;
+  int GetNumSequence() const;
+  const std::vector<std::string>& GetTokens() const;
+  const std::vector<uint32_t>& GetWordsIdx() const;
+  std::vector<uint32_t>& GetMutableWordsIdx();
+  std::vector<uint32_t> GetSequenceIds() const;
+  const std::vector<uint32_t>& GetIds() const;
+  const std::vector<uint32_t>& GetTypeIds() const;
+  const std::vector<Offset>& GetOffsets() const;
+  std::vector<Offset>& GetMutableOffsets();
+  const std::vector<uint32_t>& GetSpecialTokensMask() const;
+  const std::vector<uint32_t>& GetAttentionMask() const;
+  const std::vector<Encoding>& GetOverflowing() const;
+  std::vector<Encoding>& GetMutableOverflowing();
+  Range GetSequenceRange(uint32_t seq_id) const;
+
+  void ProcessTokenWithOffsets(
+      std::function<void(uint32_t, const std::string&, Offset*)>
+          process_token_fn);
+
+  // token_idx: The index of token in the sequence
+  std::vector<uint32_t> TokenIdxToSequenceIds(uint32_t token_idx) const;
+  std::vector<Range> WordIdxToTokensIdx(uint32_t word_idx,
+                                        uint32_t seq_id) const;
+  std::vector<Offset> WordIdxToCharOffsets(uint32_t word_idx,
+                                           uint32_t seq_id) const;
+  std::vector<std::pair<uint32_t, Offset>> TokenIdxToCharOffsets(
+      uint32_t token_idx) const;
+  std::vector<std::pair<uint32_t, uint32_t>> TokenIdxToWordIdx(
+      uint32_t token_idx) const;
+  std::vector<uint32_t> CharOffsetsToTokenIdx(uint32_t char_pos,
+                                              uint32_t seq_id) const;
+  std::vector<uint32_t> CharOffsetsToWordIdx(uint32_t char_pos,
+                                             uint32_t seq_id) const;
+  void Truncate(size_t max_len, size_t stride, Direction direction);
+  void MergeWith(const Encoding& pair, bool growing_offsets);
+  void Pad(uint32_t target_length,
+           uint32_t pad_id,
+           uint32_t pad_type_id,
+           const std::string& pad_token,
+           Direction direction);
+  // Static method
+  static Encoding Merge(const std::vector<Encoding>& encodings,
+                        bool growing_offsets);
+  std::string DebugString() const;
+  void SetTypeIds(const std::vector<uint32_t>& type_ids);
+  bool operator==(const Encoding& other) const;
+
+private:
+  std::vector<uint32_t> ids_;
+  std::vector<uint32_t> type_ids_;
+  std::vector<std::string> tokens_;
+  std::vector<uint32_t> words_idx_;
+  std::vector<Offset> offsets_;
+  std::vector<uint32_t> special_tokens_mask_;
+  std::vector<uint32_t> attention_mask_;
+  std::vector<Encoding> overflowing_;
+  std::unordered_map<uint32_t, Range> sequence_ranges_;
+};
+
+bool FASTTOKENIZER_DECL TruncateEncodings(Encoding* encoding,
+                                          Encoding* pair_encoding,
+                                          const TruncMethod& method);
+void FASTTOKENIZER_DECL PadEncodings(std::vector<Encoding>* encoding,
+                                     const PadMethod& method);
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/tokenizer.cc b/fast_tokenizer/fast_tokenizer/core/tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2b907b56936ebb5821f9796d961d52d8bf1b5f6a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/tokenizer.cc
@@ -0,0 +1,876 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/core/tokenizer.h"
+
+#include <fstream>
+
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/decoders/decoders.h"
+#include "fast_tokenizer/models/models.h"
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace core {
+
+normalizers::Normalizer* Tokenizer::GetNormalizerPtr() const {
+  return normalizer_.get();
+}
+
+void Tokenizer::ReleaseNormaizer() { normalizer_ = nullptr; }
+
+pretokenizers::PreTokenizer* Tokenizer::GetPreTokenizer() const {
+  return pretokenizer_.get();
+}
+
+void Tokenizer::ReleasePreTokenizer() { pretokenizer_ = nullptr; }
+
+void Tokenizer::SetTruncMethod(const TruncMethod& trunc_method) {
+  trunc_method_ = trunc_method;
+}
+
+void Tokenizer::EnableTruncMethod(size_t max_len,
+                                  size_t stride,
+                                  Direction direction,
+                                  TruncStrategy strategy) {
+  use_truncation_ = true;
+  trunc_method_.direction_ = direction;
+  trunc_method_.max_len_ = max_len;
+  trunc_method_.strategy_ = strategy;
+  trunc_method_.stride_ = stride;
+}
+
+void Tokenizer::DisableTruncMethod() { use_truncation_ = false; }
+
+TruncMethod Tokenizer::GetTruncMethod() const { return trunc_method_; }
+
+void Tokenizer::SetPadMethod(const PadMethod& pad_method) {
+  pad_method_ = pad_method;
+}
+
+void Tokenizer::EnablePadMethod(Direction direction,
+                                uint32_t pad_id,
+                                uint32_t pad_type_id,
+                                const std::string& pad_token,
+                                uint32_t* length,
+                                uint32_t* pad_to_multiple_of) {
+  use_padding_ = true;
+  pad_method_.direction_ = direction;
+  pad_method_.pad_id_ = pad_id;
+  pad_method_.pad_token_type_id_ = pad_type_id;
+  pad_method_.pad_token_ = pad_token;
+  if (length != nullptr) {
+    pad_method_.pad_len_ = *length;
+    pad_method_.strategy_ = PadStrategy::FIXED_SIZE;
+  } else {
+    pad_method_.strategy_ = PadStrategy::BATCH_LONGEST;
+  }
+  if (pad_to_multiple_of != nullptr) {
+    pad_method_.pad_to_multiple_of_ = *pad_to_multiple_of;
+  } else {
+    pad_method_.pad_to_multiple_of_ = 0;
+  }
+}
+void Tokenizer::DisablePadMethod() { use_padding_ = false; }
+
+PadMethod Tokenizer::GetPadMethod() const { return pad_method_; }
+
+models::Model* Tokenizer::GetModelPtr() const { return model_.get(); }
+
+void Tokenizer::ReleasePostProcessor() { post_processor_ = nullptr; }
+
+postprocessors::PostProcessor* Tokenizer::GetPostProcessorPtr() const {
+  return post_processor_.get();
+}
+
+void Tokenizer::ReleaseDecoder() { decoder_ = nullptr; }
+
+decoders::Decoder* Tokenizer::GetDecoderPtr() const { return decoder_.get(); }
+
+Vocab Tokenizer::GetVocab(bool with_added_vocabulary) const {
+  auto vocab = model_->GetVocab();
+  auto added_vocab = added_vocabulary_.GetVocab();
+  if (with_added_vocabulary) {
+    for (const auto& vocab_item : added_vocab) {
+      vocab.insert(vocab_item);
+    }
+  }
+  return vocab;
+}
+
+size_t Tokenizer::GetVocabSize(bool with_added_vocabulary) const {
+  size_t vocab_size = model_->GetVocabSize();
+  if (with_added_vocabulary) {
+    vocab_size += added_vocabulary_.GetLen();
+  }
+  return vocab_size;
+}
+
+size_t Tokenizer::AddTokens(const std::vector<AddedToken>& tokens) {
+  return added_vocabulary_.AddTokens(tokens, *model_, normalizer_.get());
+}
+
+size_t Tokenizer::AddSpecialTokens(const std::vector<AddedToken>& tokens) {
+  return added_vocabulary_.AddSpecialTokens(tokens, *model_, normalizer_.get());
+}
+
+bool Tokenizer::TokenToId(const std::string& token, uint32_t* id) const {
+  return added_vocabulary_.TokenToId(token, *model_, id);
+}
+
+bool Tokenizer::IdToToken(uint32_t id, std::string* token) const {
+  return added_vocabulary_.IdToToken(id, *model_, token);
+}
+
+bool Tokenizer::DoTokenize(pretokenizers::PreTokenizedString* pretokenized,
+                           uint32_t type_id,
+                           const std::vector<uint32_t>& word_idx,
+                           OffsetType offset_type,
+                           Encoding* encoding) const {
+  pretokenized->Tokenize([&](normalizers::NormalizedString* normalized) {
+    return this->GetModelPtr()->Tokenize(normalized->GetStr());
+  });
+  return pretokenized->TransformToEncoding(
+      word_idx, type_id, offset_type, encoding);
+}
+
+bool Tokenizer::DoPreTokenize(
+    pretokenizers::PreTokenizedString* pretokenized) const {
+  if (pretokenizer_ != nullptr) {
+    (*pretokenizer_)(pretokenized);
+  }
+  return true;
+}
+
+struct InputStringVisitor {
+  InputStringVisitor(const Tokenizer* tokenizer,
+                     uint32_t type_id,
+                     OffsetType offset_type,
+                     Encoding* encodings)
+      : tokenizer_(tokenizer),
+        type_id_(type_id),
+        offset_type_(offset_type),
+        encodings_(encodings) {}
+  void operator()(const std::vector<std::string>& pretokenized_texts) const {
+    tokenizer_->EncodeSingleText(
+        pretokenized_texts, type_id_, offset_type_, encodings_);
+  }
+
+  void operator()(const std::string& raw_text) const {
+    tokenizer_->EncodeSingleText(raw_text, type_id_, offset_type_, encodings_);
+  }
+  const Tokenizer* tokenizer_;
+  uint32_t type_id_;
+  OffsetType offset_type_;
+  Encoding* encodings_;
+};
+
+void Tokenizer::EncodeSingleString(const InputString& input_string,
+                                   uint32_t type_id,
+                                   OffsetType offset_type,
+                                   Encoding* encodings) const {
+  paddlenlp::visit(InputStringVisitor(this, type_id, offset_type, encodings),
+                   input_string);
+}
+
+void Tokenizer::PostProcess(Encoding* encoding,
+                            Encoding* pair_encoding,
+                            bool add_special_tokens,
+                            Encoding* result_encoding) const {
+  // 1. Trunc
+  if (use_truncation_) {
+    auto added_tokens_num = 0;
+    if (post_processor_ != nullptr) {
+      added_tokens_num =
+          post_processor_->AddedTokensNum(pair_encoding != nullptr);
+    }
+    if (add_special_tokens && added_tokens_num > 0) {
+      auto trunc_method = trunc_method_;
+      trunc_method.max_len_ -= added_tokens_num;
+      TruncateEncodings(encoding, pair_encoding, trunc_method);
+    } else {
+      TruncateEncodings(encoding, pair_encoding, trunc_method_);
+    }
+  }
+  // 2. Post process
+  if (post_processor_ == nullptr) {
+    postprocessors::PostProcessor::DefaultProcess(
+        encoding, pair_encoding, result_encoding);
+  } else {
+    (*post_processor_)(
+        encoding, pair_encoding, add_special_tokens, result_encoding);
+  }
+  // 3. Pad
+  if (use_padding_) {
+    std::vector<Encoding> encodings;
+    encodings.push_back(*result_encoding);
+    PadEncodings(&encodings, pad_method_);
+  }
+}
+
+void Tokenizer::EncodePairStrings(const EncodeInput& encode_input,
+                                  Encoding* encodings,
+                                  bool add_special_tokens) const {
+  Encoding encoding;
+  if (encode_input.type() == typeid(InputString)) {
+    const auto& input_string = paddlenlp::get<InputString>(encode_input);
+    EncodeSingleString(input_string, 0, OffsetType::CHAR, &encoding);
+    PostProcess(&encoding, nullptr, add_special_tokens, encodings);
+  } else {
+    Encoding pair_encoding;
+    const auto& input_string_pair =
+        paddlenlp::get<std::pair<InputString, InputString>>(encode_input);
+    EncodeSingleString(input_string_pair.first, 0, OffsetType::CHAR, &encoding);
+    EncodeSingleString(
+        input_string_pair.second, 1, OffsetType::CHAR, &pair_encoding);
+    PostProcess(&encoding, &pair_encoding, add_special_tokens, encodings);
+  }
+}
+
+void Tokenizer::EncodePairStrings(const std::string& text,
+                                  const std::string& text_pair,
+                                  Encoding* encodings,
+                                  bool add_special_tokens) const {
+  Encoding encoding, pair_encoding;
+  EncodeSingleString(text, 0, OffsetType::CHAR, &encoding);
+  EncodeSingleString(text_pair, 1, OffsetType::CHAR, &pair_encoding);
+  PostProcess(&encoding, &pair_encoding, add_special_tokens, encodings);
+}
+
+void Tokenizer::MultiThreadEncodeBatchStrings(
+    const std::vector<std::string>& texts,
+    const std::vector<std::string>& text_pairs,
+    std::vector<Encoding>* encodings,
+    bool add_special_tokens,
+    size_t start_index,
+    size_t step_index) const {
+  if (texts.size() != text_pairs.size()) {
+    throw std::runtime_error(
+        "The size of text must equal to the size of text_pair");
+  }
+  auto batch_size = texts.size();
+  size_t end_index = start_index + step_index;
+  if (end_index > batch_size) end_index = batch_size;
+  for (size_t i = start_index; i < end_index; ++i) {
+    EncodePairStrings(
+        texts[i], text_pairs[i], &(*encodings)[i], add_special_tokens);
+  }
+}
+
+void Tokenizer::MultiThreadEncodeBatchStrings(
+    const std::vector<EncodeInput>& batch_encode_input,
+    std::vector<Encoding>* encodings,
+    bool add_special_tokens,
+    size_t start_index,
+    size_t step_index) const {
+  auto batch_size = batch_encode_input.size();
+  size_t end_index = start_index + step_index;
+  if (end_index > batch_size) end_index = batch_size;
+  for (size_t i = start_index; i < end_index; ++i) {
+    EncodePairStrings(
+        batch_encode_input[i], &(*encodings)[i], add_special_tokens);
+  }
+}
+
+void Tokenizer::MultiThreadEncodeBatchStrings(
+    const std::vector<std::string>& texts,
+    std::vector<Encoding>* encodings,
+    bool add_special_tokens,
+    size_t start_index,
+    size_t step_index) const {
+  auto batch_size = texts.size();
+  size_t end_index = start_index + step_index;
+  if (end_index > batch_size) end_index = batch_size;
+  for (size_t i = start_index; i < end_index; ++i) {
+    EncodePairStrings(texts[i], &(*encodings)[i], add_special_tokens);
+  }
+}
+
+void Tokenizer::EncodeBatchStrings(
+    const std::vector<EncodeInput>& batch_encode_input,
+    std::vector<Encoding>* encodings,
+    bool add_special_tokens) const {
+  auto batch_size = batch_encode_input.size();
+  encodings->resize(batch_size);
+  auto func = [&](size_t start_index, size_t step_index) {
+    MultiThreadEncodeBatchStrings(batch_encode_input,
+                                  encodings,
+                                  add_special_tokens,
+                                  start_index,
+                                  step_index);
+  };
+  RunMultiThread(func, batch_size);
+
+  if (use_padding_) {
+    PadEncodings(encodings, pad_method_);
+  }
+}
+
+void Tokenizer::EncodeBatchStrings(const std::vector<std::string>& texts,
+                                   std::vector<Encoding>* encodings,
+                                   bool add_special_tokens) const {
+  auto batch_size = texts.size();
+  encodings->resize(batch_size);
+  auto func = [&](size_t start_index, size_t step_index) {
+    MultiThreadEncodeBatchStrings(
+        texts, encodings, add_special_tokens, start_index, step_index);
+  };
+  RunMultiThread(func, batch_size);
+
+  if (use_padding_) {
+    PadEncodings(encodings, pad_method_);
+  }
+}
+
+void Tokenizer::EncodeBatchStrings(const std::vector<std::string>& texts,
+                                   const std::vector<std::string>& text_pairs,
+                                   std::vector<Encoding>* encodings,
+                                   bool add_special_tokens) const {
+  auto batch_size = texts.size();
+  encodings->resize(batch_size);
+  auto func = [&](size_t start_index, size_t step_index) {
+    MultiThreadEncodeBatchStrings(texts,
+                                  text_pairs,
+                                  encodings,
+                                  add_special_tokens,
+                                  start_index,
+                                  step_index);
+  };
+  RunMultiThread(func, batch_size);
+
+  if (use_padding_) {
+    PadEncodings(encodings, pad_method_);
+  }
+}
+
+void Tokenizer::EncodeSingleText(
+    const std::vector<std::string>& pretokenized_texts,
+    uint32_t type_id,
+    OffsetType offset_type,
+    Encoding* encoding) const {
+  std::vector<Encoding> encodings;
+  for (uint32_t i = 0; i < pretokenized_texts.size(); ++i) {
+    encodings.emplace_back(
+        EncodeTextToEncoding({i}, type_id, offset_type, pretokenized_texts[i]));
+  }
+  *encoding = Encoding::Merge(encodings, false);
+}
+
+void Tokenizer::EncodeSingleText(const std::string& raw_text,
+                                 uint32_t type_id,
+                                 OffsetType offset_type,
+                                 Encoding* encodings) const {
+  *encodings = EncodeTextToEncoding({}, type_id, offset_type, raw_text);
+}
+
+Encoding Tokenizer::EncodeTextToEncoding(const std::vector<uint32_t>& word_idx,
+                                         uint32_t type_id,
+                                         OffsetType offset_type,
+                                         const std::string& text) const {
+  pretokenizers::PreTokenizedString pretokenized;
+  added_vocabulary_.ExtractAndNormalize(normalizer_.get(), text, &pretokenized);
+  DoPreTokenize(&pretokenized);
+  Encoding encoding;
+  DoTokenize(&pretokenized, type_id, word_idx, offset_type, &encoding);
+  return encoding;
+}
+
+const AddedVocabulary& Tokenizer::GetAddedVocabulary() const {
+  return added_vocabulary_;
+}
+
+void Tokenizer::Save(const std::string& path, bool pretty) const {
+  std::string json_str;
+  ToJsonStr(&json_str, pretty);
+  std::ofstream fout(path);
+  fout << json_str;
+}
+
+void Tokenizer::ToJsonStr(std::string* json_str, bool pretty) const {
+  int indent = -1;
+  if (pretty) {
+    indent = 2;
+  }
+  nlohmann::json j = *this;
+  *json_str = j.dump(indent);
+}
+
+Tokenizer Tokenizer::LoadFromFile(const std::string& json_path) {
+  std::ifstream fin(json_path);
+  nlohmann::json j;
+  fin >> j;
+  Tokenizer tokenizer;
+  j.get_to(tokenizer);
+  return tokenizer;
+}
+
+Tokenizer Tokenizer::LoadFromStr(const std::string& json_str) {
+  auto jo = nlohmann::json::parse(json_str);
+  Tokenizer tokenizer;
+  jo.get_to(tokenizer);
+  return tokenizer;
+}
+
+void Tokenizer::Decode(const std::vector<uint32_t>& token_ids,
+                       std::string* result,
+                       bool skip_special_tokens) const {
+  // Get tokens
+  std::vector<std::string> tokens;
+  std::string token;
+  for (int i = 0; i < token_ids.size(); ++i) {
+    IdToToken(token_ids[i], &token);
+    if (!added_vocabulary_.IsSpecialToken(token) || !skip_special_tokens) {
+      tokens.push_back(token);
+    }
+  }
+  if (decoder_ != nullptr) {
+    (*decoder_)(tokens, result);
+  } else {
+    for (int i = 0; i < tokens.size(); ++i) {
+      if (i > 0) {
+        *result += " ";
+      }
+      *result += tokens[i];
+    }
+  }
+}
+
+
+void Tokenizer::MultiThreadDecodeBatch(
+    const std::vector<std::vector<uint32_t>>& batch_token_ids,
+    std::vector<std::string>* results,
+    bool skip_special_tokens,
+    size_t start_index,
+    size_t step_index) const {
+  auto batch_size = batch_token_ids.size();
+  size_t end_index = start_index + step_index;
+  if (end_index > batch_size) end_index = batch_size;
+  for (size_t i = start_index; i < end_index; ++i) {
+    Decode(batch_token_ids[i], &(*results)[i], skip_special_tokens);
+  }
+}
+
+void Tokenizer::DecodeBatch(
+    const std::vector<std::vector<uint32_t>>& batch_token_ids,
+    std::vector<std::string>* results,
+    bool skip_special_tokens) const {
+  auto batch_size = batch_token_ids.size();
+  results->resize(batch_size);
+  auto func = [&](size_t start_index, size_t step_index) {
+    MultiThreadDecodeBatch(
+        batch_token_ids, results, skip_special_tokens, start_index, step_index);
+  };
+  RunMultiThread(func, batch_size);
+}
+
+bool Tokenizer::GetUseTruncation() const { return use_truncation_; }
+
+bool Tokenizer::GetUsePadding() const { return use_padding_; }
+
+void to_json(nlohmann::json& j, const Tokenizer& tokenizer) {
+  j = {
+      {"added_tokens", tokenizer.added_vocabulary_},
+  };
+
+  j["truncation"] = nullptr;
+  if (tokenizer.use_truncation_) {
+    j["truncation"] = tokenizer.trunc_method_;
+  }
+
+  j["padding"] = nullptr;
+  if (tokenizer.use_padding_) {
+    j["padding"] = tokenizer.pad_method_;
+  }
+
+  j["normalizer"] = nullptr;
+  if (tokenizer.normalizer_ != nullptr) {
+    if (typeid(*tokenizer.normalizer_.get()) ==
+        typeid(normalizers::BertNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::BertNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::ReplaceNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::ReplaceNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::StripNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::StripNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::StripAccentsNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::StripAccentsNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::NFCNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::NFCNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::NFDNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::NFDNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::NFKCNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::NFKCNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::NFKDNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::NFKDNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::NmtNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::NmtNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::LowercaseNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::LowercaseNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::SequenceNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::SequenceNormalizer*>(
+          tokenizer.normalizer_.get());
+    } else if (typeid(*tokenizer.normalizer_.get()) ==
+               typeid(normalizers::PrecompiledNormalizer)) {
+      j["normalizer"] = *dynamic_cast<normalizers::PrecompiledNormalizer*>(
+          tokenizer.normalizer_.get());
+    }
+  }
+
+  j["pretokenizer"] = nullptr;
+  if (tokenizer.pretokenizer_ != nullptr) {
+    if (typeid(*tokenizer.pretokenizer_.get()) ==
+        typeid(pretokenizers::BertPreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::BertPreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::MetaSpacePreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::MetaSpacePreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::WhitespacePreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::WhitespacePreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::WhitespaceAndPunctuationPreTokenizer)) {
+      j["pretokenizer"] =
+          *dynamic_cast<pretokenizers::WhitespaceAndPunctuationPreTokenizer*>(
+              tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::SequencePreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::SequencePreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::ByteLevelPreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::ByteLevelPreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    } else if (typeid(*tokenizer.pretokenizer_.get()) ==
+               typeid(pretokenizers::SplitPreTokenizer)) {
+      j["pretokenizer"] = *dynamic_cast<pretokenizers::SplitPreTokenizer*>(
+          tokenizer.pretokenizer_.get());
+    }
+  }
+
+  j["model"] = nullptr;
+  if (tokenizer.model_ != nullptr) {
+    if (typeid(*tokenizer.model_.get()) == typeid(models::WordPiece)) {
+      j["model"] = *dynamic_cast<models::WordPiece*>(tokenizer.model_.get());
+    } else if (typeid(*tokenizer.model_.get()) ==
+               typeid(models::FastWordPiece)) {
+      j["model"] =
+          *dynamic_cast<models::FastWordPiece*>(tokenizer.model_.get());
+    } else if (typeid(*tokenizer.model_.get()) == typeid(models::BPE)) {
+      j["model"] = *dynamic_cast<models::BPE*>(tokenizer.model_.get());
+    } else if (typeid(*tokenizer.model_.get()) == typeid(models::Unigram)) {
+      j["model"] = *dynamic_cast<models::Unigram*>(tokenizer.model_.get());
+    }
+  }
+
+  j["postprocessor"] = nullptr;
+  if (tokenizer.post_processor_ != nullptr) {
+    if (typeid(*tokenizer.post_processor_.get()) ==
+        typeid(postprocessors::BertPostProcessor)) {
+      j["postprocessor"] = *dynamic_cast<postprocessors::BertPostProcessor*>(
+          tokenizer.post_processor_.get());
+    } else if (typeid(*tokenizer.post_processor_.get()) ==
+               typeid(postprocessors::TemplatePostProcessor)) {
+      j["postprocessor"] =
+          *dynamic_cast<postprocessors::TemplatePostProcessor*>(
+              tokenizer.post_processor_.get());
+    } else if (typeid(*tokenizer.post_processor_.get()) ==
+               typeid(postprocessors::RobertaPostProcessor)) {
+      j["postprocessor"] = *dynamic_cast<postprocessors::RobertaPostProcessor*>(
+          tokenizer.post_processor_.get());
+    } else if (typeid(*tokenizer.post_processor_.get()) ==
+               typeid(postprocessors::ByteLevelPostProcessor)) {
+      j["postprocessor"] =
+          *dynamic_cast<postprocessors::ByteLevelPostProcessor*>(
+              tokenizer.post_processor_.get());
+    }
+  }
+
+  j["decoder"] = nullptr;
+  if (tokenizer.decoder_ != nullptr) {
+    if (typeid(*tokenizer.decoder_.get()) == typeid(decoders::WordPiece)) {
+      j["decoder"] =
+          *dynamic_cast<decoders::WordPiece*>(tokenizer.decoder_.get());
+    }
+  }
+}
+
+void from_json(const nlohmann::json& j, Tokenizer& tokenizer) {
+  // deserialize normalizer_
+  try {
+    const auto& normalizer = j.at("normalizer");
+    if (!normalizer.is_null()) {
+      if (normalizer.at("type") == "BertNormalizer") {
+        normalizers::BertNormalizer bert_normalizer;
+        normalizer.get_to(bert_normalizer);
+        tokenizer.SetNormalizer(bert_normalizer);
+      } else if (normalizer.at("type") == "ReplaceNormalizer") {
+        normalizers::ReplaceNormalizer replace_normalizer;
+        normalizer.get_to(replace_normalizer);
+        tokenizer.SetNormalizer(replace_normalizer);
+      } else if (normalizer.at("type") == "StripNormalizer") {
+        normalizers::StripNormalizer strip_normalizer;
+        normalizer.get_to(strip_normalizer);
+        tokenizer.SetNormalizer(strip_normalizer);
+      } else if (normalizer.at("type") == "StripAccentsNormalizer") {
+        normalizers::StripAccentsNormalizer strip_normalizer;
+        normalizer.get_to(strip_normalizer);
+        tokenizer.SetNormalizer(strip_normalizer);
+      } else if (normalizer.at("type") == "NFCNormalizer") {
+        normalizers::NFCNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "NFDNormalizer") {
+        normalizers::NFDNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "NFKCNormalizer") {
+        normalizers::NFKCNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "NFKDNormalizer") {
+        normalizers::NFKDNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "NmtNormalizer") {
+        normalizers::NmtNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "LowercaseNormalizer") {
+        normalizers::LowercaseNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "SequenceNormalizer") {
+        normalizers::SequenceNormalizer unicode_normalizer;
+        normalizer.get_to(unicode_normalizer);
+        tokenizer.SetNormalizer(unicode_normalizer);
+      } else if (normalizer.at("type") == "PrecompiledNormalizer") {
+        normalizers::PrecompiledNormalizer precompiled_normalizer;
+        normalizer.get_to(precompiled_normalizer);
+        tokenizer.SetNormalizer(precompiled_normalizer);
+      }
+    }
+
+    // deserialize pretokenizer_
+    nlohmann::json pretokenizer;
+    if (j.find("pretokenizer") == j.end()) {
+      pretokenizer = j.at("pre_tokenizer");
+    } else {
+      pretokenizer = j.at("pretokenizer");
+    }
+    if (!pretokenizer.is_null()) {
+      if (pretokenizer.at("type") == "BertPreTokenizer") {
+        pretokenizers::BertPreTokenizer bert_pretokenizer;
+        tokenizer.SetPreTokenizer(bert_pretokenizer);
+      } else if (pretokenizer.at("type") == "MetaSpacePreTokenizer") {
+        pretokenizers::MetaSpacePreTokenizer meta_pretokenizer;
+        pretokenizer.get_to(meta_pretokenizer);
+        tokenizer.SetPreTokenizer(meta_pretokenizer);
+      } else if (pretokenizer.at("type") == "WhitespacePreTokenizer") {
+        pretokenizers::WhitespacePreTokenizer whitespace_pretokenizer;
+        tokenizer.SetPreTokenizer(whitespace_pretokenizer);
+      } else if (pretokenizer.at("type") ==
+                 "WhitespaceAndPunctuationPreTokenizer") {
+        pretokenizers::WhitespaceAndPunctuationPreTokenizer
+            whitespace_pretokenizer;
+        tokenizer.SetPreTokenizer(whitespace_pretokenizer);
+      } else if (pretokenizer.at("type") == "SequencePreTokenizer") {
+        pretokenizers::SequencePreTokenizer sequence_pretokenizer;
+        pretokenizer.get_to(sequence_pretokenizer);
+        tokenizer.SetPreTokenizer(sequence_pretokenizer);
+      } else if (pretokenizer.at("type") == "ByteLevelPreTokenizer") {
+        pretokenizers::ByteLevelPreTokenizer byte_pretokenizer;
+        pretokenizer.get_to(byte_pretokenizer);
+        tokenizer.SetPreTokenizer(byte_pretokenizer);
+      } else if (pretokenizer.at("type") == "SplitPreTokenizer") {
+        pretokenizers::SplitPreTokenizer split_pretokenizer;
+        pretokenizer.get_to(split_pretokenizer);
+        tokenizer.SetPreTokenizer(split_pretokenizer);
+      }
+    }
+
+    // deserialize model_
+    const auto& model = j.at("model");
+    if (!model.is_null()) {
+      if (model.at("type") == "WordPiece") {
+        models::WordPiece wordpiece;
+        model.get_to(wordpiece);
+        tokenizer.SetModel(wordpiece);
+      } else if (model.at("type") == "FastWordPiece") {
+        models::FastWordPiece wordpiece;
+        model.get_to(wordpiece);
+        tokenizer.SetModel(wordpiece);
+      } else if (model.at("type") == "BPE") {
+        models::BPE bpe;
+        model.get_to(bpe);
+        tokenizer.SetModel(bpe);
+      } else if (model.at("type") == "Unigram") {
+        models::Unigram unigram;
+        model.get_to(unigram);
+        tokenizer.SetModel(unigram);
+      }
+    }
+
+    // deserialize post_processor_
+    nlohmann::json post_processor;
+    if (j.find("postprocessor") == j.end()) {
+      post_processor = j.at("post_processor");
+    } else {
+      post_processor = j.at("postprocessor");
+    }
+    if (!post_processor.is_null()) {
+      if (post_processor.at("type") == "BertPostProcessor") {
+        postprocessors::BertPostProcessor bert_postprocessor;
+        post_processor.get_to(bert_postprocessor);
+        tokenizer.SetPostProcessor(bert_postprocessor);
+      } else if (post_processor.at("type") == "TemplateProcessing") {
+        postprocessors::TemplatePostProcessor template_postprocessor;
+        post_processor.get_to(template_postprocessor);
+        tokenizer.SetPostProcessor(template_postprocessor);
+      } else if (post_processor.at("type") == "RobertaPostProcessor") {
+        postprocessors::RobertaPostProcessor roberta_postprocessor;
+        post_processor.get_to(roberta_postprocessor);
+        tokenizer.SetPostProcessor(roberta_postprocessor);
+      } else if (post_processor.at("type") == "ByteLevelPostProcessor") {
+        postprocessors::ByteLevelPostProcessor byte_level_postprocessor;
+        post_processor.get_to(byte_level_postprocessor);
+        tokenizer.SetPostProcessor(byte_level_postprocessor);
+      }
+    }
+
+    // deserialize trunc_method_
+    const auto& trunc_method = j.at("truncation");
+    if (!trunc_method.is_null()) {
+      tokenizer.use_truncation_ = true;
+      trunc_method.get_to(tokenizer.trunc_method_);
+    } else {
+      tokenizer.use_truncation_ = false;
+    }
+
+    // deserialize pad_method_
+    const auto& pad_method = j.at("padding");
+    if (!pad_method.is_null()) {
+      tokenizer.use_padding_ = true;
+      pad_method.get_to(tokenizer.pad_method_);
+    } else {
+      tokenizer.use_padding_ = false;
+    }
+
+    // deserialize added_vocabulary_
+    const auto& added_tokens = j.at("added_tokens");
+    core::AddedTokenWithId added_token_with_id;
+    std::vector<AddedToken> tokens(added_tokens.size());
+    for (int i = 0; i < added_tokens.size(); ++i) {
+      added_tokens[i].get_to(added_token_with_id);
+      tokens[i] = added_token_with_id.added_token_;
+    }
+    tokenizer.AddSpecialTokens(tokens);
+
+    const auto& decoder = j.at("decoder");
+    if (!decoder.is_null()) {
+      if (decoder.at("type") == "WordPiece") {
+        decoders::WordPiece wordpiece_decoder;
+        decoder.get_to(wordpiece_decoder);
+        tokenizer.SetDecoder(wordpiece_decoder);
+      }
+    }
+
+  } catch (nlohmann::json::out_of_range& e) {
+    VLOG(0) << e.what();
+  }
+}
+// Instantiate normalizers
+template void Tokenizer::SetNormalizer(const normalizers::BertNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::LowercaseNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::NFCNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::NFKCNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::NFDNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::NFKDNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::NmtNormalizer&);
+template void Tokenizer::SetNormalizer(
+    const normalizers::PrecompiledNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::ReplaceNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::SequenceNormalizer&);
+template void Tokenizer::SetNormalizer(
+    const normalizers::StripAccentsNormalizer&);
+template void Tokenizer::SetNormalizer(const normalizers::StripNormalizer&);
+
+// Instantiate pretokenizers
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::BertPreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::WhitespacePreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::WhitespaceAndPunctuationPreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::MetaSpacePreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::SequencePreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::ByteLevelPreTokenizer&);
+template void Tokenizer::SetPreTokenizer(
+    const pretokenizers::SplitPreTokenizer&);
+
+// Instantiate models
+template Tokenizer::Tokenizer(const models::WordPiece&);
+template void Tokenizer::SetModel(const models::WordPiece&);
+template Tokenizer::Tokenizer(const models::FastWordPiece&);
+template void Tokenizer::SetModel(const models::FastWordPiece&);
+template Tokenizer::Tokenizer(const models::BPE&);
+template void Tokenizer::SetModel(const models::BPE&);
+template Tokenizer::Tokenizer(const models::Unigram&);
+template void Tokenizer::SetModel(const models::Unigram&);
+
+// Instantiate processors
+template void Tokenizer::SetPostProcessor(
+    const postprocessors::BertPostProcessor&);
+template void Tokenizer::SetPostProcessor(
+    const postprocessors::TemplatePostProcessor&);
+template void Tokenizer::SetPostProcessor(
+    const postprocessors::RobertaPostProcessor&);
+template void Tokenizer::SetPostProcessor(
+    const postprocessors::ByteLevelPostProcessor&);
+
+// Instantiate Decoder
+template void Tokenizer::SetDecoder(const decoders::WordPiece& decoder);
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/core/tokenizer.h b/fast_tokenizer/fast_tokenizer/core/tokenizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..71fdf6099f64616b4ed539ee5da9ec2b80fb6b8d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/core/tokenizer.h
@@ -0,0 +1,254 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <memory>  // For shared_ptr
+#include <vector>
+
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "fast_tokenizer/utils/variant.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+
+namespace normalizers {
+
+class Normalizer;
+class NormalizedString;
+
+}  // namespace normalizers
+
+namespace pretokenizers {
+
+class PreTokenizer;
+class PreTokenizedString;
+
+}  // namespace pretokenizers
+
+namespace models {
+class Model;
+}  // namespace models
+
+namespace postprocessors {
+class PostProcessor;
+}  // namespace postprocessors
+
+namespace decoders {
+class Decoder;
+};
+
+namespace core {
+
+class AddedVocabulary;
+class Encoding;
+
+using InputString = paddlenlp::variant<std::string, std::vector<std::string>>;
+using EncodeInput =
+    paddlenlp::variant<InputString, std::pair<InputString, InputString>>;
+
+class FASTTOKENIZER_DECL Tokenizer {
+public:
+  Tokenizer()
+      : model_(nullptr),
+        normalizer_(nullptr),
+        pretokenizer_(nullptr),
+        post_processor_(nullptr),
+        decoder_(nullptr),
+        use_padding_(true),
+        use_truncation_(true) {}
+  template <typename ModelType>
+  Tokenizer(const ModelType& model)
+      : model_(std::make_shared<ModelType>(model)),
+        normalizer_(nullptr),
+        pretokenizer_(nullptr),
+        post_processor_(nullptr),
+        decoder_(nullptr),
+        use_padding_(true),
+        use_truncation_(true) {}
+
+  template <typename NormalizerType>
+  void SetNormalizer(const NormalizerType& normalizer) {
+    normalizer_ = std::make_shared<NormalizerType>(normalizer);
+  }
+  void ReleaseNormaizer();
+  normalizers::Normalizer* GetNormalizerPtr() const;
+
+  template <typename PreTokenizerType>
+  void SetPreTokenizer(const PreTokenizerType& pretokenizer) {
+    pretokenizer_ = std::make_shared<PreTokenizerType>(pretokenizer);
+  }
+  void ReleasePreTokenizer();
+  pretokenizers::PreTokenizer* GetPreTokenizer() const;
+
+  template <typename ModelType>
+  void SetModel(const ModelType& model) {
+    model_ = std::make_shared<ModelType>(model);
+  }
+  models::Model* GetModelPtr() const;
+
+  template <typename PostProcessorType>
+  void SetPostProcessor(const PostProcessorType& post_processor) {
+    post_processor_ = std::make_shared<PostProcessorType>(post_processor);
+  }
+  void ReleasePostProcessor();
+  postprocessors::PostProcessor* GetPostProcessorPtr() const;
+
+  template <typename DecoderType>
+  void SetDecoder(const DecoderType& decoder) {
+    decoder_ = std::make_shared<DecoderType>(decoder);
+  }
+  void ReleaseDecoder();
+  decoders::Decoder* GetDecoderPtr() const;
+
+  void SetTruncMethod(const TruncMethod& trunc_method);
+  void DisableTruncMethod();
+  void EnableTruncMethod(size_t max_len,
+                         size_t stride,
+                         Direction direction,
+                         TruncStrategy strategy);
+  TruncMethod GetTruncMethod() const;
+
+  void SetPadMethod(const PadMethod& pad_method);
+  void DisablePadMethod();
+  void EnablePadMethod(Direction direction,
+                       uint32_t pad_id,
+                       uint32_t pad_type_id,
+                       const std::string& pad_token,
+                       uint32_t* length,
+                       uint32_t* pad_to_multiple_of);
+  PadMethod GetPadMethod() const;
+
+  Vocab GetVocab(bool with_added_vocabulary = true) const;
+  size_t GetVocabSize(bool with_added_vocabulary = true) const;
+  bool TokenToId(const std::string& token, uint32_t* id) const;
+  bool IdToToken(uint32_t id, std::string* token) const;
+  size_t AddTokens(const std::vector<AddedToken>& tokens);
+  size_t AddSpecialTokens(const std::vector<AddedToken>& tokens);
+  bool DoTokenize(pretokenizers::PreTokenizedString* pretokenized,
+                  uint32_t type_id,
+                  const std::vector<uint32_t>& word_idx,
+                  OffsetType offset_type,
+                  Encoding* encoding) const;
+  bool DoPreTokenize(pretokenizers::PreTokenizedString* pretokenized) const;
+
+  void EncodeSingleString(const InputString& input_string,
+                          uint32_t type_id,
+                          OffsetType offset_type,
+                          Encoding* encodings) const;
+  void PostProcess(Encoding* encoding,
+                   Encoding* pair_encoding,
+                   bool add_special_tokens,
+                   Encoding* result_encoding) const;
+  void EncodePairStrings(const EncodeInput& encode_input,
+                         Encoding* encodings,
+                         bool add_special_tokens = true) const;
+  void EncodePairStrings(const std::string& text,
+                         const std::string& text_pair,
+                         Encoding* encodings,
+                         bool add_special_tokens = true) const;
+
+  void MultiThreadEncodeBatchStrings(
+      const std::vector<EncodeInput>& batch_encode_input,
+      std::vector<Encoding>* encodings,
+      bool add_special_tokens,
+      size_t start_index,
+      size_t step_index) const;
+  // Tokenize the unpretokenized text.
+  void MultiThreadEncodeBatchStrings(const std::vector<std::string>& texts,
+                                     std::vector<Encoding>* encodings,
+                                     bool add_special_tokens,
+                                     size_t start_index,
+                                     size_t step_index) const;
+  void MultiThreadEncodeBatchStrings(const std::vector<std::string>& texts,
+                                     const std::vector<std::string>& text_pairs,
+                                     std::vector<Encoding>* encodings,
+                                     bool add_special_tokens,
+                                     size_t start_index,
+                                     size_t step_index) const;
+
+  void EncodeBatchStrings(const std::vector<EncodeInput>& batch_encode_input,
+                          std::vector<Encoding>* encodings,
+                          bool add_special_tokens = true) const;
+  // Tokenize the unpretokenized text.
+  void EncodeBatchStrings(const std::vector<std::string>& texts,
+                          std::vector<Encoding>* encodings,
+                          bool add_special_tokens = true) const;
+  void EncodeBatchStrings(const std::vector<std::string>& texts,
+                          const std::vector<std::string>& text_pairs,
+                          std::vector<Encoding>* encodings,
+                          bool add_special_tokens = true) const;
+
+  // Encode single text which is already pretokenized.
+  void EncodeSingleText(const std::vector<std::string>& pretokenized_texts,
+                        uint32_t type_id,
+                        OffsetType offset_type,
+                        Encoding* encodings) const;
+  // Encode single raw text
+  void EncodeSingleText(const std::string& raw_text,
+                        uint32_t type_id,
+                        OffsetType offset_type,
+                        Encoding* encodings) const;
+  const AddedVocabulary& GetAddedVocabulary() const;
+  void Save(const std::string& json_path, bool pretty = true) const;
+  void ToJsonStr(std::string* json_str, bool pretty = true) const;
+
+  // Create a tokenzier from json path
+  static Tokenizer LoadFromFile(const std::string& json_path);
+  static Tokenizer LoadFromStr(const std::string& json_str);
+
+  bool GetUseTruncation() const;
+  bool GetUsePadding() const;
+
+  // Decode: From tokens to a complete string
+  void Decode(const std::vector<uint32_t>& token_ids,
+              std::string* result,
+              bool skip_special_tokens = true) const;
+  void MultiThreadDecodeBatch(
+      const std::vector<std::vector<uint32_t>>& batch_token_ids,
+      std::vector<std::string>* results,
+      bool skip_special_tokens,
+      size_t start_index,
+      size_t step_index) const;
+  void DecodeBatch(const std::vector<std::vector<uint32_t>>& batch_token_ids,
+                   std::vector<std::string>* results,
+                   bool skip_special_tokens = true) const;
+
+private:
+  Encoding EncodeTextToEncoding(const std::vector<uint32_t>& word_idx,
+                                uint32_t type_id,
+                                OffsetType offset_type,
+                                const std::string& text) const;
+  // All member of Tokenizer
+  std::shared_ptr<normalizers::Normalizer> normalizer_;
+  std::shared_ptr<pretokenizers::PreTokenizer> pretokenizer_;
+  std::shared_ptr<models::Model> model_;
+  std::shared_ptr<postprocessors::PostProcessor> post_processor_;
+  std::shared_ptr<decoders::Decoder> decoder_;
+
+  TruncMethod trunc_method_;
+  PadMethod pad_method_;
+  AddedVocabulary added_vocabulary_;
+  bool use_truncation_;
+  bool use_padding_;
+
+  friend void to_json(nlohmann::json& j, const Tokenizer& tokenizer);
+  friend void from_json(const nlohmann::json& j, Tokenizer& tokenizer);
+};
+
+}  // namespace core
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d2fffc3dac6e24bd846f3e775280e08dfd6012b8
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/decoders/CMakeLists.txt
@@ -0,0 +1 @@
+cc_library(decoders SRCS wordpiece.cc DEPS json utils)
diff --git a/fast_tokenizer/fast_tokenizer/decoders/decoder.h b/fast_tokenizer/fast_tokenizer/decoders/decoder.h
new file mode 100644
index 0000000000000000000000000000000000000000..7969e22d11f7dd5370cdb2ba4c6b0c375c7653ca
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/decoders/decoder.h
@@ -0,0 +1,32 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace decoders {
+
+struct FASTTOKENIZER_DECL Decoder {
+  virtual void operator()(const std::vector<std::string> tokens,
+                          std::string* result) const = 0;
+};
+
+}  // namespace decoders
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/decoders/decoders.h b/fast_tokenizer/fast_tokenizer/decoders/decoders.h
new file mode 100644
index 0000000000000000000000000000000000000000..efc72779de9f9908b2f7f58576199ef6e349f9f0
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/decoders/decoders.h
@@ -0,0 +1,18 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/decoders/decoder.h"
+#include "fast_tokenizer/decoders/wordpiece.h"
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e81f1562d24265d8974e708a63cc9d2bfabb9d1c
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.cc
@@ -0,0 +1,69 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/decoders/wordpiece.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace decoders {
+
+WordPiece::WordPiece(const std::string prefix, bool cleanup)
+    : prefix_(prefix), cleanup_(cleanup) {}
+
+void WordPiece::CleanUp(std::string* result) const {
+  utils::StringReplaceAll(result, " .", ".");
+  utils::StringReplaceAll(result, " !", "!");
+  utils::StringReplaceAll(result, " ?", "?");
+  utils::StringReplaceAll(result, " ,", ",");
+  utils::StringReplaceAll(result, " ' ", "'");
+  utils::StringReplaceAll(result, " n't", "n't");
+  utils::StringReplaceAll(result, " 'm", "'m");
+  utils::StringReplaceAll(result, " do not", " don't");
+  utils::StringReplaceAll(result, " 's", "'s");
+  utils::StringReplaceAll(result, " 've", "'ve");
+  utils::StringReplaceAll(result, " 're", "'re");
+}
+
+void WordPiece::operator()(const std::vector<std::string> tokens,
+                           std::string* result) const {
+  *result = "";
+  for (int i = 0; i < tokens.size(); ++i) {
+    if (i > 0) {
+      *result += " ";
+    }
+    *result += tokens[i];
+  }
+  utils::StringReplaceAll(result, " " + prefix_, "");
+  if (cleanup_) {
+    CleanUp(result);
+  }
+}
+
+void to_json(nlohmann::json& j, const WordPiece& decoder) {
+  j = {
+      {"type", "WordPiece"},
+      {"cleanup", decoder.cleanup_},
+      {"prefix", decoder.prefix_},
+  };
+}
+
+void from_json(const nlohmann::json& j, WordPiece& decoder) {
+  j["cleanup"].get_to(decoder.cleanup_);
+  j["prefix"].get_to(decoder.prefix_);
+}
+
+}  // namespace decoders
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h
new file mode 100644
index 0000000000000000000000000000000000000000..1f41b3f8b5dcf4475bc701f4d81521eb3c0bd5f5
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/decoders/wordpiece.h
@@ -0,0 +1,42 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/decoders/decoder.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace decoders {
+
+struct FASTTOKENIZER_DECL WordPiece : public Decoder {
+  virtual void operator()(const std::vector<std::string> tokens,
+                          std::string* result) const;
+
+  WordPiece(const std::string prefix = "##", bool cleanup = true);
+
+private:
+  void CleanUp(std::string* result) const;
+  std::string prefix_;
+  bool cleanup_;
+
+  friend void to_json(nlohmann::json& j, const WordPiece& decoder);
+  friend void from_json(const nlohmann::json& j, WordPiece& decoder);
+};
+
+}  // namespace decoders
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f51706a770358988d5e72386da0eea90e363a7d2
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/CMakeLists.txt
@@ -0,0 +1,3 @@
+cc_library(models
+        SRCS wordpiece.cc fast_wordpiece.cc bpe.cc unigram.cc
+        DEPS core json trie failure icuuc icudata lattice utils)
diff --git a/fast_tokenizer/fast_tokenizer/models/bpe.cc b/fast_tokenizer/fast_tokenizer/models/bpe.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ad80d2752d6f3b115512d573c3e07e2e27eebc87
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/bpe.cc
@@ -0,0 +1,349 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <codecvt>
+#include <fstream>
+#include <locale>
+#include <map>
+
+#include "glog/logging.h"
+#include "fast_tokenizer/models/bpe.h"
+#include "fast_tokenizer/utils/path.h"
+#include "fast_tokenizer/utils/utf8.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+const std::string WHITESPACE = " \n\r\t\f\v";
+
+void BPE::Init(const core::Merges& merges) {
+  if (dropout_.size() > 0) {
+    if (dropout_[0] > 1.0 || dropout_[0] <= 0.0) {
+      std::ostringstream oss;
+      oss << "The range of dropout rate should be (0,1], but receive "
+          << dropout_[0];
+      throw std::runtime_error(oss.str());
+    }
+  }
+  // construct vocab_r
+  for (auto&& item : vocab_) {
+    vocab_reversed_[item.second] = item.first;
+  }
+  int prefix_len = 0;
+  if (continuing_subword_prefix_.size() > 0) {
+    prefix_len += continuing_subword_prefix_[0].length();
+  }
+
+  // construct merge_map
+  for (int i = 0; i < merges.size(); i++) {
+    auto&& merge = merges[i];
+    try {
+      auto a_id = vocab_.at(merge.first);
+      auto b_id = vocab_.at(merge.second);
+      auto new_token = merge.first + merge.second.substr(prefix_len);
+      auto new_id = vocab_.at(new_token);
+      merges_.insert({core::Pair(a_id, b_id), {i, new_id}});
+    } catch (...) {
+      std::ostringstream oss;
+      oss << "Can't merge token out of the vocabulary";
+      throw std::runtime_error(oss.str());
+    }
+  }
+
+  // construct unk
+  if (unk_token_.size() > 0) {
+    try {
+      unk_token_id_.emplace_back(vocab_.at(unk_token_.front()));
+    } catch (...) {
+      std::ostringstream oss;
+      oss << "Unk token `" << unk_token_.front()
+          << "` not found in the vocabulary";
+      throw std::runtime_error(oss.str());
+    }
+  }
+}
+
+BPE::BPE() : fuse_unk_(false), cache_(utils::DEFAULT_CACHE_CAPACITY) {}
+
+BPE::BPE(const core::Vocab& vocab,
+         const core::Merges& merges,
+         size_t cache_capacity,
+         const std::vector<float>& dropout,
+         const std::vector<std::string>& unk_token,
+         const std::vector<std::string>& continuing_subword_prefix,
+         const std::vector<std::string>& end_of_word_suffix,
+         bool fuse_unk)
+    : vocab_(vocab),
+      fuse_unk_(fuse_unk),
+      dropout_(dropout),
+      unk_token_(unk_token),
+      continuing_subword_prefix_(continuing_subword_prefix),
+      end_of_word_suffix_(end_of_word_suffix),
+      cache_(utils::DEFAULT_CACHE_CAPACITY) {
+  Init(merges);
+}
+
+void BPE::ClearCache() { cache_.Clear(); }
+
+core::Vocab BPE::GetVocabFromFile(const std::string& vocab_json_path) {
+  std::ifstream fin(vocab_json_path);
+  core::Vocab vocab;
+  nlohmann::json j;
+  fin >> j;
+  for (nlohmann::json::iterator it = j.begin(); it != j.end(); ++it) {
+    vocab[it.key()] = it.value();
+  }
+  return vocab;
+}
+
+void BPE::ConstructMergesPair(const std::string word_line,
+                              std::pair<std::string, std::string>* result) {
+  auto pair_a_begin = word_line.find_first_not_of(WHITESPACE);
+  auto pair_a_end = word_line.find_first_of(WHITESPACE, pair_a_begin);
+  auto pair_b_begin = word_line.find_first_not_of(WHITESPACE, pair_a_end);
+  auto pair_b_end = word_line.find_first_of(WHITESPACE, pair_b_begin);
+  *result = {word_line.substr(pair_a_begin, pair_a_end - pair_a_begin),
+             word_line.substr(pair_b_begin, pair_b_end - pair_b_begin)};
+}
+
+core::Merges BPE::GetMergesFromFile(const std::string& merge_path) {
+  std::ifstream fin(merge_path);
+  core::Merges merges;
+  std::string word_str;
+  while (std::getline(fin, word_str)) {
+    if (word_str.find("#version") == 0) {
+      continue;
+    }
+    std::pair<std::string, std::string> result;
+    ConstructMergesPair(word_str, &result);
+    merges.emplace_back(result);
+  }
+  return merges;
+}
+
+void BPE::GetVocabAndMergesFromFile(const std::string& vocab_json_path,
+                                    const std::string& merge_path,
+                                    core::Vocab* vocab,
+                                    core::Merges* merges) {
+  *vocab = BPE::GetVocabFromFile(vocab_json_path);
+  *merges = BPE::GetMergesFromFile(merge_path);
+}
+
+void BPE::MergeWord(const std::string& word, core::BPEWord* bpe_word) {
+  std::vector<std::pair<uint32_t, size_t>> unk;
+  bpe_word->Reserve(word.length());
+  uint32_t start = 0;
+  while (start < word.length()) {
+    uint32_t content_char;
+    uint32_t content_char_width =
+        utils::UTF8ToUInt32(word.data() + start, &content_char);
+    content_char = utils::UTF8ToUnicode(content_char);
+    uint32_t end = start + content_char_width;
+    bool is_first = (start == 0);
+    bool is_last = (end >= word.length());
+    std::string curr_str = word.substr(start, content_char_width);
+    // Add the `continuing_subword_prefix` if relevant
+    if (!is_first) {
+      if (continuing_subword_prefix_.size() > 0) {
+        curr_str = continuing_subword_prefix_.front() + curr_str;
+      }
+    }
+    // Add the `end_of_word_suffix` if relevant
+    if (is_last) {
+      if (end_of_word_suffix_.size() > 0) {
+        curr_str = curr_str + end_of_word_suffix_.front();
+      }
+    }
+    if (vocab_.find(curr_str) != vocab_.end()) {
+      if (unk.size() > 0) {
+        bpe_word->Add(unk.front().first, unk.front().second);
+        unk.clear();
+      }
+      auto id = vocab_.at(curr_str);
+      bpe_word->Add(id, content_char_width);
+    } else {
+      if (unk_token_id_.size() > 0) {
+        if (unk.size() == 0) {
+          unk.push_back({unk_token_id_.front(), content_char_width});
+        } else {
+          if (fuse_unk_) {
+            unk[0] = {unk[0].first, unk[0].second + content_char_width};
+          } else {
+            bpe_word->Add(unk[0].first, unk[0].second);
+            unk[0] = {unk_token_id_.front(), content_char_width};
+          }
+        }
+      }
+    }
+    start = end;
+  }
+
+  if (unk.size() > 0) {
+    bpe_word->Add(unk.front().first, unk.front().second);
+  }
+  bpe_word->MergeAll(merges_, dropout_);
+}
+
+void BPE::WordToTokens(const core::BPEWord& bpe_word,
+                       std::vector<core::Token>* tokens) {
+  std::vector<uint32_t> chars;
+  bpe_word.GetChars(&chars);
+
+  std::vector<core::Offset> offsets;
+  bpe_word.GetOffset(&offsets);
+
+  tokens->reserve(offsets.size());
+  for (int i = 0; i < offsets.size(); ++i) {
+    tokens->emplace_back(chars[i], vocab_reversed_[chars[i]], offsets[i]);
+  }
+}
+
+void BPE::TokenizeWithCache(const std::string& sequence,
+                            std::vector<core::Token>* tokens) {
+  core::BPEWord bpe_word;
+  if (cache_.GetValue(sequence, &bpe_word)) {
+    WordToTokens(bpe_word, tokens);
+  } else {
+    MergeWord(sequence, &bpe_word);
+    WordToTokens(bpe_word, tokens);
+    cache_.SetValue(sequence, bpe_word);
+  }
+}
+
+std::vector<core::Token> BPE::Tokenize(const std::string& sequence) {
+  std::vector<core::Token> tokens;
+  if (sequence.empty()) {
+    return tokens;
+  }
+  if (dropout_.size() == 0) {
+    TokenizeWithCache(sequence, &tokens);
+    return tokens;
+  }
+  core::BPEWord bpe_word;
+  MergeWord(sequence, &bpe_word);
+  WordToTokens(bpe_word, &tokens);
+  return tokens;
+}
+
+bool BPE::TokenToId(const std::string& token, uint32_t* id) const {
+  if (vocab_.find(token) == vocab_.end()) {
+    return false;
+  }
+  *id = vocab_.at(token);
+  return true;
+}
+
+bool BPE::IdToToken(uint32_t id, std::string* token) const {
+  if (vocab_reversed_.find(id) == vocab_reversed_.end()) {
+    return false;
+  }
+  *token = vocab_reversed_.at(id);
+  return true;
+}
+
+core::Vocab BPE::GetVocab() const { return vocab_; }
+
+size_t BPE::GetVocabSize() const { return vocab_.size(); }
+
+// Return the saved voacb path and merges.txt
+std::vector<std::string> BPE::Save(const std::string& folder,
+                                   const std::string& filename_prefix) const {
+  // write vocab json
+  std::string vocab_path;
+  if (filename_prefix == "") {
+    vocab_path = utils::PathJoin(folder, "vocab.json");
+  } else {
+    vocab_path = utils::PathJoin({folder, filename_prefix, "-vocab.json"});
+  }
+  VLOG(6) << "Vocab path" << vocab_path;
+  core::SortedVocabReversed sorted_vocab_r(vocab_reversed_.begin(),
+                                           vocab_reversed_.end());
+  nlohmann::json j = sorted_vocab_r;
+  std::ofstream fout(vocab_path);
+  fout << j.dump();
+  fout.close();
+
+  // write merges.txt
+  std::string merges_path;
+  if (filename_prefix == "") {
+    merges_path = utils::PathJoin(folder, "merges.txt");
+  } else {
+    merges_path = utils::PathJoin({folder, filename_prefix, "-merges.txt"});
+  }
+  VLOG(6) << "Merges path" << merges_path;
+  std::ofstream merge_fout(merges_path);
+  merge_fout << "#version: 0.2\n";
+  for (auto&& merge : merges_) {
+    merge_fout << vocab_reversed_.at(merge.first.first) << " "
+               << vocab_reversed_.at(merge.first.second) << "\n";
+  }
+  merge_fout.close();
+  return {vocab_path, merges_path};
+}
+
+void to_json(nlohmann::json& j, const BPE& model) {
+  std::vector<std::pair<core::Pair, uint32_t>> merges;
+  for (auto& merge : model.merges_) {
+    merges.push_back({merge.first, merge.second.first});
+  }
+  std::sort(merges.begin(),
+            merges.end(),
+            [](const std::pair<core::Pair, uint32_t>& a,
+               const std::pair<core::Pair, uint32_t>& b) {
+              return a.second < b.second;
+            });
+  std::vector<std::string> merge_strs;
+  for (auto& merge : merges) {
+    std::string s = model.vocab_reversed_.at(merge.first.first) + " " +
+                    model.vocab_reversed_.at(merge.first.second);
+    merge_strs.push_back(s);
+  }
+
+  core::SortedVocabReversed sorted_vocab_r(model.vocab_reversed_.begin(),
+                                           model.vocab_reversed_.end());
+
+  j = {{"type", "BPE"},
+       {"unk_token", model.unk_token_},
+       {"continuing_subword_prefix", model.continuing_subword_prefix_},
+       {"end_of_word_suffix", model.end_of_word_suffix_},
+       {"fuse_unk", model.fuse_unk_},
+       {"dropout", model.dropout_},
+       {"vocab", sorted_vocab_r},
+       {"merges", merge_strs}};
+}
+
+void from_json(const nlohmann::json& j, BPE& model) {
+  j["vocab"].get_to(model.vocab_);
+  j["unk_token"].get_to(model.unk_token_);
+  j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_);
+  j["end_of_word_suffix"].get_to(model.end_of_word_suffix_);
+  j["fuse_unk"].get_to(model.fuse_unk_);
+  j["dropout"].get_to(model.dropout_);
+
+  std::vector<std::string> merge_strs;
+  j["merges"].get_to(merge_strs);
+
+  core::Merges merges;
+  std::pair<std::string, std::string> result;
+  for (auto& word_line : merge_strs) {
+    BPE::ConstructMergesPair(word_line, &result);
+    merges.push_back(result);
+  }
+  model.Init(merges);
+}
+
+}  // namespace model
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/bpe.h b/fast_tokenizer/fast_tokenizer/models/bpe.h
new file mode 100644
index 0000000000000000000000000000000000000000..bb8cfd08cc413b3b73aea15fdcf5c866514a11cc
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/bpe.h
@@ -0,0 +1,82 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/models/model.h"
+#include "nlohmann/json.hpp"
+#include "fast_tokenizer/utils/cache.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+struct FASTTOKENIZER_DECL BPE : public Model {
+  BPE();
+  BPE(const core::Vocab& vocab,
+      const core::Merges& merges,
+      size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY,
+      const std::vector<float>& dropout = {},
+      const std::vector<std::string>& unk_token = {},
+      const std::vector<std::string>& continuing_subword_prefix = {},
+      const std::vector<std::string>& end_of_word_suffix = {},
+      bool fuse_unk = false);
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& sequence) override;
+  virtual bool TokenToId(const std::string& token, uint32_t* id) const override;
+  virtual bool IdToToken(uint32_t id, std::string* token) const override;
+  virtual core::Vocab GetVocab() const override;
+  virtual size_t GetVocabSize() const override;
+  // Return the saved voacb path
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override;
+
+  void ClearCache();
+  static core::Vocab GetVocabFromFile(const std::string& vocab_json_path);
+  static core::Merges GetMergesFromFile(const std::string& merge_path);
+  static void GetVocabAndMergesFromFile(const std::string& vocab_json_path,
+                                        const std::string& merge_path,
+                                        core::Vocab* vocab,
+                                        core::Merges* merges);
+  static void ConstructMergesPair(const std::string word_line,
+                                  std::pair<std::string, std::string>* result);
+
+private:
+  void Init(const core::Merges& merges);
+  void MergeWord(const std::string& word, core::BPEWord* bpe_word);
+  void WordToTokens(const core::BPEWord& bpe_word,
+                    std::vector<core::Token>* tokens);
+  void TokenizeWithCache(const std::string& sequence,
+                         std::vector<core::Token>* tokens);
+  core::Vocab vocab_;
+  core::VocabReversed vocab_reversed_;
+  core::MergeMap merges_;
+
+  // The following vector may contain 0 or 1 element
+  utils::Cache<std::string, core::BPEWord> cache_;
+  std::vector<float> dropout_;
+  std::vector<std::string> unk_token_;
+  std::vector<uint32_t> unk_token_id_;
+  std::vector<std::string> continuing_subword_prefix_;
+  std::vector<std::string> end_of_word_suffix_;
+  bool fuse_unk_;
+  friend void to_json(nlohmann::json& j, const BPE& model);
+  friend void from_json(const nlohmann::json& j, BPE& model);
+};
+
+}  // namespace models
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc
new file mode 100644
index 0000000000000000000000000000000000000000..92124c6b15ae210638f3360a396bf77eed00e01c
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.cc
@@ -0,0 +1,428 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "fast_tokenizer/models/fast_wordpiece.h"
+
+#include <algorithm>
+#include <codecvt>
+#include <fstream>
+#include <locale>
+#include <map>
+
+#include "fast_tokenizer/models/wordpiece.h"
+#include "fast_tokenizer/utils/path.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "glog/logging.h"
+#include "unicode/uchar.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+const std::string WHITESPACE = " \n\r\t\f\v";
+
+void FastWordPiece::InitFailureAndTrie() {
+  unk_token_id_ = vocab_.at(unk_token_);
+  trie_.SetWithPretokenization(with_pretokenization_);
+  trie_.SetUNKToken(unk_token_);
+  trie_.SetContinuingSubwordPrefix(continuing_subword_prefix_);
+  failure_array_.SetWithPretokenization(with_pretokenization_);
+  failure_array_.InitFromVocabAndTrie(
+      vocab_, &trie_, unk_token_, continuing_subword_prefix_);
+  PrecomputeEncodeValueForSubwordPrefix();
+}
+
+FastWordPiece::FastWordPiece() : WordPiece(), with_pretokenization_(false) {}
+
+FastWordPiece::FastWordPiece(const core::Vocab& vocab,
+                             const std::string& unk_token,
+                             size_t max_input_chars_per_word,
+                             const std::string& continuing_subword_prefix,
+                             bool with_pretokenization)
+    : WordPiece(vocab,
+                unk_token,
+                max_input_chars_per_word,
+                continuing_subword_prefix),
+      trie_(continuing_subword_prefix, unk_token, with_pretokenization),
+      with_pretokenization_(with_pretokenization),
+      failure_array_(with_pretokenization) {
+  InitFailureAndTrie();
+}
+
+void FastWordPiece::PrecomputeEncodeValueForSubwordPrefix() {
+  auto subword_prefix_tokens = WordPiece::Tokenize(continuing_subword_prefix_);
+  encoded_value_for_subword_prefix_.reserve(subword_prefix_tokens.size());
+
+  for (auto& token : subword_prefix_tokens) {
+    utils::FailureVocabToken failure_vocab_token(
+        token.value_, token.id_, continuing_subword_prefix_);
+    int encoded_value = utils::EncodeToken(
+        failure_vocab_token.TokenId(),
+        failure_vocab_token.TokenLengthWithoutContinuingSubwordPrefix(),
+        failure_vocab_token.IsSuffixToken());
+    encoded_value_for_subword_prefix_.push_back(encoded_value);
+  }
+}
+
+bool FastWordPiece::TryFollowFailureLinkAndCollectTokens(
+    const std::string& sequence,
+    int sequence_offset_in_text,
+    int* curr_offset_in_sequence,
+    utils::Trie::TraversalCursor* node,
+    std::vector<core::Token>* tokens) const {
+  int curr_node_value = 0;
+  if (trie_.TryGetData(*node, &curr_node_value)) {
+    AppendTokensToOutput(sequence,
+                         sequence_offset_in_text,
+                         curr_offset_in_sequence,
+                         curr_node_value,
+                         tokens);
+    trie_.SetTraversalCursor(
+        node, failure_array_.GetFailure(node->node_id_)->failure_link_);
+    return true;
+  }
+  const auto& node_aux = failure_array_.GetFailure(node->node_id_);
+
+  if (node_aux->failure_link_ == utils::kNullNode) {
+    // No failure_link can be followed.
+    return false;
+  }
+  int offset = 0, length = 0;
+  utils::GetFailurePopsOffsetAndLength(
+      node_aux->failure_pops_offset_length_, &offset, &length);
+  for (int i = offset; i < offset + length; ++i) {
+    AppendTokensToOutput(sequence,
+                         sequence_offset_in_text,
+                         curr_offset_in_sequence,
+                         failure_array_.GetFailurePop(i),
+                         tokens);
+  }
+  trie_.SetTraversalCursor(node, node_aux->failure_link_);
+  return true;
+}
+
+void FastWordPiece::AppendTokensToOutput(
+    const std::string& sequence,
+    int sequence_offset_in_text,
+    int* curr_offset_in_sequence,
+    int curr_node_value,
+    std::vector<core::Token>* tokens) const {
+  uint32_t id = utils::GetTokenIdFromEncodedValue(curr_node_value);
+  std::string value;
+  int token_substr_length =
+      utils::GetTokenLengthFromEncodedValue(curr_node_value);
+  if (*curr_offset_in_sequence == 0 &&
+      utils::IsSuffixTokenFromEncodedValue(curr_node_value)) {
+    token_substr_length += continuing_subword_prefix_.size();
+  }
+
+  if (id == unk_token_id_) {
+    value = unk_token_;
+  } else {
+    auto c_offset = *curr_offset_in_sequence;
+    c_offset = (std::min)(c_offset, static_cast<int>(sequence.length() - 1));
+    value = sequence.substr(*curr_offset_in_sequence, token_substr_length);
+  }
+
+  if (*curr_offset_in_sequence > 0) {
+    value = continuing_subword_prefix_ + value;
+  }
+  core::Offset offset = {
+      sequence_offset_in_text + *curr_offset_in_sequence,
+      sequence_offset_in_text + *curr_offset_in_sequence + token_substr_length};
+  tokens->emplace_back(id, value, offset);
+
+  *curr_offset_in_sequence += token_substr_length;
+}
+
+void FastWordPiece::ResetOutputAppendUNK(
+    int sequence_offset_in_text,
+    int sequence_size,
+    int* original_num_tokens,
+    std::vector<core::Token>* tokens) const {
+  tokens->resize(*original_num_tokens + 1);
+  tokens->back() = {
+      unk_token_id_,
+      unk_token_,
+      {sequence_offset_in_text, sequence_offset_in_text + sequence_size}};
+  (*original_num_tokens)++;
+}
+
+bool FastWordPiece::TryHandleContinuingSubWordPrefix(
+    const std::string& sequence,
+    int sequence_offset_in_text,
+    const utils::Trie::TraversalCursor& curr_node,
+    int* original_num_tokens,
+    int* curr_offset_in_sequence,
+    std::vector<core::Token>* tokens) const {
+  if (curr_node.node_id_ != trie_.GetSuffixRoot()) {
+    return false;
+  }
+  int cur_num_tokens = tokens->size();
+  if (cur_num_tokens != *original_num_tokens) {
+    return false;
+  }
+  if (encoded_value_for_subword_prefix_.size() == 1 &&
+      utils::GetTokenIdFromEncodedValue(encoded_value_for_subword_prefix_[0]) ==
+          unk_token_id_) {
+    ResetOutputAppendUNK(
+        sequence_offset_in_text, sequence.size(), original_num_tokens, tokens);
+    return true;
+  }
+  for (int encoded_token_value : encoded_value_for_subword_prefix_) {
+    AppendTokensToOutput(sequence,
+                         sequence_offset_in_text,
+                         curr_offset_in_sequence,
+                         encoded_token_value,
+                         tokens);
+  }
+  return true;
+}
+
+void FastWordPiece::HandleTheRemainingStringOnTriePath(
+    const std::string& sequence,
+    int sequence_offset_in_text,
+    utils::Trie::TraversalCursor* curr_node,
+    int* original_num_tokens,
+    int* curr_offset_in_sequence,
+    std::vector<core::Token>* tokens) const {
+  if (curr_node->node_id_ == utils::Trie::kRootNodeId) {
+    return;
+  }
+  if (TryHandleContinuingSubWordPrefix(sequence,
+                                       sequence_offset_in_text,
+                                       *curr_node,
+                                       original_num_tokens,
+                                       curr_offset_in_sequence,
+                                       tokens)) {
+    *original_num_tokens = tokens->size();
+    return;
+  }
+  while (curr_node->node_id_ != trie_.GetSuffixRoot() &&
+         curr_node->node_id_ != trie_.GetPuncFailureNode()) {
+    if (!TryFollowFailureLinkAndCollectTokens(sequence,
+                                              sequence_offset_in_text,
+                                              curr_offset_in_sequence,
+                                              curr_node,
+                                              tokens)) {
+      ResetOutputAppendUNK(sequence_offset_in_text,
+                           sequence.size(),
+                           original_num_tokens,
+                           tokens);
+      return;
+    }
+  }
+  *original_num_tokens = tokens->size();
+}
+
+int FastWordPiece::SkipRemainingOfWordAndTrailingWhiteSpaces(
+    const std::string& sequence, int* curr_idx) const {
+  int seq_len = sequence.length();
+  uint32_t curr_unicode_char;
+  int end_of_word = *curr_idx;
+  while (*curr_idx < seq_len) {
+    auto chwidth =
+        utils::UTF8ToUInt32(sequence.data() + *curr_idx, &curr_unicode_char);
+    curr_unicode_char = utils::UTF8ToUnicode(curr_unicode_char);
+    if (u_isUWhiteSpace(curr_unicode_char)) {
+      *curr_idx += chwidth;
+      break;
+    }
+    if (utils::IsPunctuationOrChineseChar(curr_unicode_char)) {
+      break;
+    }
+    *curr_idx += chwidth;
+    end_of_word = *curr_idx;
+  }
+  return end_of_word;
+}
+
+std::vector<core::Token> FastWordPiece::TokenizeWithoutPreTokenize(
+    const std::string& sequence) const {
+  VLOG(6) << "Using FastWordPiece::TokenizeWithoutPreTokenize to tokenize "
+             "sequence";
+  if (sequence.empty()) {
+    return {};
+  }
+  std::vector<core::Token> all_tokens;
+  size_t unicode_len =
+      utils::GetUnicodeLenFromUTF8(sequence.data(), sequence.length());
+  int original_num_tokens = 0;
+  if (unicode_len > max_input_chars_per_word_) {
+    ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens);
+  } else {
+    int curr_offset_in_sequence = 0;
+    auto curr_node = trie_.CreateRootTraversalCursor();
+    for (auto ch : sequence) {
+      while (!trie_.TryTraverseOneStep(&curr_node, ch)) {
+        if (!TryFollowFailureLinkAndCollectTokens(sequence,
+                                                  0,
+                                                  &curr_offset_in_sequence,
+                                                  &curr_node,
+                                                  &all_tokens)) {
+          ResetOutputAppendUNK(
+              0, sequence.size(), &original_num_tokens, &all_tokens);
+          return all_tokens;
+        }
+      }
+    }
+    HandleTheRemainingStringOnTriePath(sequence,
+                                       0,
+                                       &curr_node,
+                                       &original_num_tokens,
+                                       &curr_offset_in_sequence,
+                                       &all_tokens);
+  }
+  if (all_tokens.size() == 0) {
+    ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens);
+  }
+  VLOG(6) << "All tokens num from TokenizeWithoutPreTokenize: "
+          << all_tokens.size();
+  return all_tokens;
+}
+
+std::vector<core::Token> FastWordPiece::TokenizeWithPreTokenize(
+    const std::string& sequence) const {
+  VLOG(6)
+      << "Using FastWordPiece::TokenizeWithPreTokenize to tokenize sequence";
+  // Need to implement
+  if (sequence.empty()) {
+    return {};
+  }
+  std::vector<core::Token> all_tokens;
+  int original_num_tokens = 0;
+  uint32_t prev_unicode_char, curr_unicode_char;
+  int curr_idx = 0;
+  int chwidth = 0;
+  auto seq_len = sequence.length();
+  while (curr_idx < seq_len) {
+    int curr_offset_in_word = 0;
+    auto curr_node = trie_.CreateRootTraversalCursor();
+    int bytes_length = 0;
+    int word_offset_in_sequence = curr_idx;
+    std::string sequence_substr = sequence.substr(curr_idx);
+    bool fail_to_match = false;
+    while (curr_idx < seq_len) {
+      prev_unicode_char = curr_unicode_char;
+      chwidth =
+          utils::UTF8ToUInt32(sequence.data() + curr_idx, &curr_unicode_char);
+      curr_unicode_char = utils::UTF8ToUnicode(curr_unicode_char);
+      if (bytes_length + chwidth > max_input_chars_per_word_) {
+        break;
+      }
+      std::string curr_substr = sequence.substr(curr_idx, chwidth);
+      while (!trie_.TryTraverseSeveralSteps(&curr_node, curr_substr)) {
+        if (!TryFollowFailureLinkAndCollectTokens(sequence_substr,
+                                                  word_offset_in_sequence,
+                                                  &curr_offset_in_word,
+                                                  &curr_node,
+                                                  &all_tokens)) {
+          fail_to_match = true;
+          break;
+        }
+      }
+      if (fail_to_match) {
+        break;
+      }
+      bytes_length += chwidth;
+      curr_idx += chwidth;
+    }
+    if (curr_idx >= seq_len) {
+      HandleTheRemainingStringOnTriePath(sequence_substr,
+                                         word_offset_in_sequence,
+                                         &curr_node,
+                                         &original_num_tokens,
+                                         &curr_offset_in_word,
+                                         &all_tokens);
+      break;
+    }
+    bool curr_unicode_char_is_space = u_isUWhiteSpace(curr_unicode_char);
+    if (curr_unicode_char_is_space ||
+        utils::IsPunctuationOrChineseChar(curr_unicode_char) ||
+        (curr_idx > 0 &&
+         utils::IsPunctuationOrChineseChar(prev_unicode_char))) {
+      HandleTheRemainingStringOnTriePath(
+          sequence_substr.substr(0, curr_idx - word_offset_in_sequence),
+          word_offset_in_sequence,
+          &curr_node,
+          &original_num_tokens,
+          &curr_offset_in_word,
+          &all_tokens);
+      if (curr_unicode_char_is_space) {
+        curr_idx += chwidth;
+      }
+      continue;
+    }
+    curr_idx += chwidth;
+    int end_of_word =
+        SkipRemainingOfWordAndTrailingWhiteSpaces(sequence, &curr_idx);
+    ResetOutputAppendUNK(word_offset_in_sequence,
+                         end_of_word - word_offset_in_sequence,
+                         &original_num_tokens,
+                         &all_tokens);
+  }
+  if (all_tokens.size() == 0) {
+    ResetOutputAppendUNK(0, sequence.size(), &original_num_tokens, &all_tokens);
+  }
+  VLOG(6) << "All tokens num from TokenizeWithPreTokenize: "
+          << all_tokens.size();
+  return all_tokens;
+}
+
+std::vector<core::Token> FastWordPiece::Tokenize(const std::string& sequence) {
+  if (!with_pretokenization_) {
+    return TokenizeWithoutPreTokenize(sequence);
+  }
+  return TokenizeWithPreTokenize(sequence);
+}
+
+FastWordPiece FastWordPiece::GetFastWordPieceFromFile(
+    const std::string& file,
+    const std::string& unk_token,
+    size_t max_input_chars_per_word,
+    const std::string& continuing_subword_prefix,
+    bool with_pretokenization) {
+  auto vocab = GetVocabFromFile(file);
+  return FastWordPiece(vocab,
+                       unk_token,
+                       max_input_chars_per_word,
+                       continuing_subword_prefix,
+                       with_pretokenization);
+}
+
+void to_json(nlohmann::json& j, const FastWordPiece& model) {
+  j = {
+      {"type", "FastWordPiece"},
+      {"vocab", model.vocab_},
+      {"unk_token", model.unk_token_},
+      {"max_input_chars_per_word", model.max_input_chars_per_word_},
+      {"continuing_subword_prefix", model.continuing_subword_prefix_},
+      {"with_pretokenization", model.with_pretokenization_},
+  };
+}
+
+void from_json(const nlohmann::json& j, FastWordPiece& model) {
+  j["vocab"].get_to(model.vocab_);
+  j["unk_token"].get_to(model.unk_token_);
+  j["max_input_chars_per_word"].get_to(model.max_input_chars_per_word_);
+  j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_);
+  j["with_pretokenization"].get_to(model.with_pretokenization_);
+  model.InitFailureAndTrie();
+}
+
+}  // namespace models
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h
new file mode 100644
index 0000000000000000000000000000000000000000..f26534ba6753cf34db90a784b2e78e7a254418dc
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/fast_wordpiece.h
@@ -0,0 +1,95 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "fast_tokenizer/models/model.h"
+#include "fast_tokenizer/models/wordpiece.h"
+#include "fast_tokenizer/utils/failure.h"
+#include "fast_tokenizer/utils/trie.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+struct FASTTOKENIZER_DECL FastWordPiece : public WordPiece {
+  FastWordPiece();
+  FastWordPiece(const core::Vocab& vocab,
+                const std::string& unk_token = "[UNK]",
+                size_t max_input_chars_per_word = 100,
+                const std::string& continuing_subword_prefix = "##",
+                bool with_pretokenization = false);
+
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& sequence) override;
+  static FastWordPiece GetFastWordPieceFromFile(
+      const std::string& file,
+      const std::string& unk_token = "[UNK]",
+      size_t max_input_chars_per_word = 100,
+      const std::string& continuing_subword_prefix = "##",
+      bool with_pretokenization = false);
+
+private:
+  void InitFailureAndTrie();
+  std::vector<core::Token> TokenizeWithoutPreTokenize(
+      const std::string& sequence) const;
+  std::vector<core::Token> TokenizeWithPreTokenize(
+      const std::string& sequence) const;
+  bool TryFollowFailureLinkAndCollectTokens(
+      const std::string& sequence,
+      int sequence_offset_in_text,
+      int* curr_offset_in_sequence,
+      utils::Trie::TraversalCursor* node,
+      std::vector<core::Token>* tokens) const;
+
+  void AppendTokensToOutput(const std::string& sequence,
+                            int sequence_offset_in_text,
+                            int* curr_offset_in_sequence,
+                            int curr_node_value,
+                            std::vector<core::Token>* tokens) const;
+  void HandleTheRemainingStringOnTriePath(
+      const std::string& sequence,
+      int sequence_offset_in_text,
+      utils::Trie::TraversalCursor* node,
+      int* original_num_tokens,
+      int* curr_offset_in_sequence,
+      std::vector<core::Token>* tokens) const;
+  bool TryHandleContinuingSubWordPrefix(
+      const std::string& sequence,
+      int sequence_offset_in_text,
+      const utils::Trie::TraversalCursor& node,
+      int* original_num_tokens,
+      int* curr_offset_in_sequence,
+      std::vector<core::Token>* tokens) const;
+  void ResetOutputAppendUNK(int sequence_offset_in_text,
+                            int sequence_size,
+                            int* original_num_tokens,
+                            std::vector<core::Token>* tokens) const;
+  int SkipRemainingOfWordAndTrailingWhiteSpaces(const std::string& sequence,
+                                                int* curr_idx) const;
+  void PrecomputeEncodeValueForSubwordPrefix();
+  utils::Trie trie_;
+  utils::FailureArray failure_array_;
+  std::vector<int> encoded_value_for_subword_prefix_;
+  friend void to_json(nlohmann::json& j, const FastWordPiece& model);
+  friend void from_json(const nlohmann::json& j, FastWordPiece& model);
+  bool with_pretokenization_;  // The end-to-end version of FastWordPiece
+};
+
+}  // namespace models
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/model.h b/fast_tokenizer/fast_tokenizer/models/model.h
new file mode 100644
index 0000000000000000000000000000000000000000..8a8f8daddf6f1d4c0092a477b20975a91e1b6265
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/model.h
@@ -0,0 +1,39 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <unordered_map>
+#include <vector>
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+struct FASTTOKENIZER_DECL Model {
+  virtual std::vector<core::Token> Tokenize(const std::string& tokens) = 0;
+  virtual bool TokenToId(const std::string& token, uint32_t* id) const = 0;
+  virtual bool IdToToken(uint32_t id, std::string* token) const = 0;
+  virtual core::Vocab GetVocab() const = 0;
+  virtual size_t GetVocabSize() const = 0;
+  // Return the saved voacb path
+  virtual std::vector<std::string> Save(
+      const std::string& folder, const std::string& filename_prefix) const = 0;
+};
+
+}  // namespace model
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/models.h b/fast_tokenizer/fast_tokenizer/models/models.h
new file mode 100644
index 0000000000000000000000000000000000000000..feafdd1ae5902f4242a5cffdf16f67f5fe80f8e6
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/models.h
@@ -0,0 +1,21 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/models/bpe.h"
+#include "fast_tokenizer/models/fast_wordpiece.h"
+#include "fast_tokenizer/models/model.h"
+#include "fast_tokenizer/models/unigram.h"
+#include "fast_tokenizer/models/wordpiece.h"
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/models/unigram.cc b/fast_tokenizer/fast_tokenizer/models/unigram.cc
new file mode 100644
index 0000000000000000000000000000000000000000..255ee1c3ca2978cda2c198ec78f7a4117504f117
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/unigram.cc
@@ -0,0 +1,436 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/models/unigram.h"
+#include <iomanip>
+#include <limits>
+#include <sstream>
+
+#include "glog/logging.h"
+#include "fast_tokenizer/utils/path.h"
+#include "fast_tokenizer/utils/unique_ptr.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+constexpr float kUnkPenalty = 10.0;
+
+Unigram::Unigram() {
+  core::VocabList vocab = {{"<unk>", 0.0}};
+  std::vector<size_t> unk_id = {0};
+  Init(vocab, unk_id);
+}
+
+Unigram::Unigram(const core::VocabList& vocab,
+                 const std::vector<size_t>& unk_id) {
+  Init(vocab, unk_id);
+}
+
+Unigram::Unigram(const Unigram& other) { Init(other.vocab_, other.unk_id_); }
+
+void Unigram::Init(const core::VocabList& vocab,
+                   const std::vector<size_t>& unk_id) {
+  size_t n = vocab.size();
+  if (unk_id.size() > 0) {
+    if (n == 0) {
+      std::ostringstream oss;
+      oss << "EmptyVocabulary error occurs when init unigram with unk token.";
+      throw std::runtime_error(oss.str());
+    } else if (unk_id[0] >= n) {
+      std::ostringstream oss;
+      oss << "Unk token id is not in vocab when init unigram with unk token.";
+      throw std::runtime_error(oss.str());
+    }
+  }
+
+  vocab_ = vocab;
+  unk_id_ = unk_id;
+
+  bos_id_ = n + 1;
+  eos_id_ = n + 2;
+  min_score_ = std::numeric_limits<double>::max();
+
+  std::vector<const char*> keys;
+  std::vector<int> values;
+  // id = 0 is unk_id_
+  for (size_t id = 0; id < n; ++id) {
+    size_t actual_id = id;
+    token_to_ids_.insert({vocab[id].first, actual_id});
+    keys.push_back(vocab[id].first.c_str());
+    values.push_back(actual_id);
+    if (vocab[id].second < min_score_) {
+      min_score_ = vocab[id].second;
+    }
+  }
+
+  std::vector<const char*> sorted_keys;
+  std::vector<int> sorted_values;
+  utils::GetSortedVocab(keys, values, &sorted_keys, &sorted_values);
+  trie_ = utils::make_unique<Darts::DoubleArray>();
+  if (trie_->build(sorted_keys.size(),
+                   const_cast<char**>(&sorted_keys[0]),
+                   nullptr,
+                   &sorted_values[0]) != 0) {
+    std::ostringstream oss;
+    oss << "Cannot build double-array.";
+    throw std::runtime_error(oss.str());
+    return;
+  }
+  // Computes the maximum number of shared prefixes in the trie.
+  const int kMaxTrieResultsSize = 1024;
+  std::vector<Darts::DoubleArray::result_pair_type> results(
+      kMaxTrieResultsSize);
+  trie_results_size_ = 0;
+  for (size_t id = 0; id < n; ++id) {
+    const int num_nodes = trie_->commonPrefixSearch(vocab[id].first.data(),
+                                                    results.data(),
+                                                    results.size(),
+                                                    vocab[id].first.size());
+    trie_results_size_ = std::max(trie_results_size_, num_nodes);
+  }
+  fuse_unk_ = true;
+  is_optimized_ = true;
+  if (trie_results_size_ == 0) {
+    std::ostringstream oss;
+    oss << "No entry is found in the trie.";
+    throw std::runtime_error(oss.str());
+  }
+}
+
+float Unigram::GetVocabScore(uint32_t id) const { return vocab_.at(id).second; }
+
+bool Unigram::TokenToId(const std::string& token, uint32_t* id) const {
+  if (token_to_ids_.find(token) == token_to_ids_.end()) {
+    return false;
+  }
+  *id = token_to_ids_.at(token);
+  return true;
+}
+
+bool Unigram::IdToToken(uint32_t id, std::string* token) const {
+  if (id >= vocab_.size()) {
+    return false;
+  }
+  *token = vocab_[id].first;
+  return true;
+}
+
+core::Vocab Unigram::GetVocab() const { return token_to_ids_; }
+
+size_t Unigram::GetVocabSize() const { return vocab_.size(); }
+
+std::vector<core::Token> Unigram::Tokenize(const std::string& sequence) {
+  std::vector<std::string> encode_result;
+  Encode(sequence, &encode_result);
+  size_t offset = 0;
+  std::vector<core::Token> tokens;
+  tokens.reserve(encode_result.size());
+  auto UpdateTokens = [&](const std::string& str) {
+    uint32_t id = 0;
+    if (token_to_ids_.find(str) != token_to_ids_.end()) {
+      id = token_to_ids_.at(str);
+    } else {
+      if (unk_id_.size() > 0) {
+        id = unk_id_[0];
+      }
+    }
+    auto len = str.length();
+    tokens.emplace_back(id, str, core::Offset{offset, offset + len});
+    offset += len;
+  };
+
+  for (auto&& str : encode_result) {
+    // Avoid to append the filtered_token_ to encoded_result
+    if (str == filtered_token_) {
+      offset += filtered_token_.length();
+      continue;
+    }
+    // Split the tokenized tokens following some regex rule
+    if (split_rule_ != nullptr) {
+      re2::StringPiece result;
+      int start = 0;
+      int end = str.length();
+      while (split_rule_->Match(str, start, end, RE2::UNANCHORED, &result, 1)) {
+        int curr_start = result.data() - str.data();
+        int res_len = result.length();
+        start = curr_start + res_len;
+        std::string result_str(result.data(), res_len);
+        if (result_str == filtered_token_) {
+          offset += filtered_token_.length();
+          continue;
+        }
+        UpdateTokens(result_str);
+      }
+      if (start == 0) {
+        // Hasn't been splitted
+        UpdateTokens(str);
+      }
+    } else {
+      UpdateTokens(str);
+    }
+  }
+  return tokens;
+}
+
+std::vector<std::string> Unigram::Save(
+    const std::string& folder, const std::string& filename_prefix) const {
+  std::string vocab_path;
+  if (filename_prefix == "") {
+    vocab_path = utils::PathJoin(folder, "unigram.json");
+  } else {
+    vocab_path = utils::PathJoin({folder, filename_prefix, "-unigram.json"});
+  }
+  VLOG(6) << "Vocab path" << vocab_path;
+  std::ofstream fout(vocab_path);
+  nlohmann::json j = *this;
+  fout << j.dump();
+  fout.close();
+  return {vocab_path};
+}
+
+void Unigram::PopulateNodes(utils::Lattice* lattice) const {
+  auto get_chars_length = [&lattice](int begin_pos, const char* end) {
+    int pos = begin_pos;
+    while (lattice->surface(pos) < end) ++pos;
+    return pos - begin_pos;
+  };
+
+  const float unk_score = min_score_ - kUnkPenalty;
+
+  const int len = lattice->size();
+  const char* end = lattice->sentence() + lattice->utf8_size();
+
+  // +1 just in case.
+  std::vector<Darts::DoubleArray::result_pair_type> trie_results(
+      trie_results_size_ + 1);
+
+  for (int begin_pos = 0; begin_pos < len; ++begin_pos) {
+    const char* begin = lattice->surface(begin_pos);
+
+    // Finds all pieces which are prefix of surface(begin_pos).
+    const size_t num_nodes =
+        trie_->commonPrefixSearch(begin,
+                                  trie_results.data(),
+                                  trie_results.size(),
+                                  static_cast<int>(end - begin));
+    CHECK_LT(num_nodes, trie_results.size());
+
+    bool has_single_node = false;
+
+    // Inserts pieces to the lattice.
+    for (size_t k = 0; k < num_nodes; ++k) {
+      const int length =
+          get_chars_length(begin_pos, begin + trie_results[k].length);
+      const int id = trie_results[k].value;
+      utils::Lattice::Node* node = lattice->Insert(begin_pos, length);
+      node->id = id;  // the value of Trie stores vocab_id.
+      // User defined symbol receives extra bonus to always be selected.
+      node->score = vocab_[id].second;
+
+      if (!has_single_node && node->length == 1) {
+        has_single_node = true;
+      }
+    }
+
+    if (!has_single_node) {
+      if (unk_id_.size() > 0) {
+        utils::Lattice::Node* node = lattice->Insert(begin_pos, 1);
+        node->id = unk_id_[0];  // add UNK node.
+        node->score = unk_score;
+      }
+    }
+  }
+}
+
+void Unigram::Encode(const std::string& normalized,
+                     std::vector<std::string>* encode_result) {
+  encode_result->clear();
+  if (normalized.empty()) {
+    return;
+  }
+  if (!cache_.GetValue(normalized, encode_result)) {
+    if (is_optimized_) {
+      EncodeOptimized(normalized, encode_result);
+    } else {
+      EncodeUnoptimized(normalized, encode_result);
+    }
+    cache_.SetValue(normalized, *encode_result);
+  }
+}
+
+void Unigram::EncodeOptimized(const std::string& normalized,
+                              std::vector<std::string>* encode_result) {
+  // Represents the last node of the best path.
+  struct BestPathNode {
+    int id = -1;  // The vocab id. (maybe -1 for UNK)
+    float best_path_score =
+        0;  // The total score of the best path ending at this node.
+    int starts_at =
+        -1;  // The starting position (in utf-8) of this node. The entire best
+             // path can be constructed by backtracking along this link.
+  };
+  const int size = normalized.size();
+  const float unk_score = min_score_ - kUnkPenalty;
+  // The ends are exclusive.
+  std::vector<BestPathNode> best_path_ends_at(size + 1);
+  // Generate lattice on-the-fly (not stored) and update best_path_ends_at.
+  int starts_at = 0;
+  while (starts_at < size) {
+    std::size_t node_pos = 0;
+    std::size_t key_pos = starts_at;
+    const auto best_path_score_till_here =
+        best_path_ends_at[starts_at].best_path_score;
+    bool has_single_node = false;
+    const int mblen = std::min<int>(
+        utils::OneCharLen(normalized.data() + starts_at), size - starts_at);
+    while (key_pos < size) {
+      const int ret =
+          trie_->traverse(normalized.data(), node_pos, key_pos, key_pos + 1);
+      if (ret == -2) break;
+      if (ret >= 0) {
+        // Update the best path node.
+        auto& target_node = best_path_ends_at[key_pos];
+        const auto length = (key_pos - starts_at);
+        const auto score = GetVocabScore(ret);
+        const auto candidate_best_path_score =
+            score + best_path_score_till_here;
+        VLOG(4) << "key_pos: " << key_pos;
+        VLOG(4) << "score: " << score;
+        VLOG(4) << "best_path_score_till_here: " << best_path_score_till_here;
+        VLOG(4) << "starts_at: " << starts_at;
+        VLOG(4) << "token: " << vocab_.at(ret).first;
+        if (target_node.starts_at == -1 ||
+            candidate_best_path_score > target_node.best_path_score) {
+          target_node.best_path_score = candidate_best_path_score;
+          target_node.starts_at = starts_at;
+          target_node.id = ret;
+        }
+        if (!has_single_node && length == mblen) {
+          has_single_node = true;
+        }
+      }
+    }
+    if (!has_single_node) {
+      auto& target_node = best_path_ends_at[starts_at + mblen];
+      const auto candidate_best_path_score =
+          unk_score + best_path_score_till_here;
+      if (target_node.starts_at == -1 ||
+          candidate_best_path_score > target_node.best_path_score) {
+        target_node.best_path_score = candidate_best_path_score;
+        target_node.starts_at = starts_at;
+        target_node.id = -1;
+        if (unk_id_.size() > 0) {
+          target_node.id = unk_id_[0];
+        }
+      }
+    }
+    // Move by one unicode character.
+    starts_at += mblen;
+  }
+  int ends_at = size;
+  std::vector<std::string> token;
+  while (ends_at > 0) {
+    const auto& node = best_path_ends_at[ends_at];
+    auto starts_at = node.starts_at;
+    if (fuse_unk_ && unk_id_.size() > 0 && node.id == unk_id_[0]) {
+      token.push_back(normalized.substr(starts_at, ends_at - starts_at));
+    } else {
+      if (!token.empty()) {
+        encode_result->push_back("");
+        auto& back = encode_result->back();
+        for (int i = token.size() - 1; i >= 0; --i) {
+          back.append(token[i]);
+        }
+        token.clear();
+      }
+      encode_result->push_back(
+          normalized.substr(starts_at, ends_at - starts_at));
+    }
+    ends_at = node.starts_at;
+  }
+  if (!token.empty()) {
+    encode_result->push_back("");
+    auto& back = encode_result->back();
+    for (int i = token.size() - 1; i >= 0; --i) {
+      back.append(token[i]);
+    }
+  }
+  std::reverse(encode_result->begin(), encode_result->end());
+}
+
+void Unigram::EncodeUnoptimized(const std::string& normalized,
+                                std::vector<std::string>* encode_result) {
+  utils::Lattice lattice;
+  lattice.SetSentence(
+      utils::simple_string_view(normalized.data(), normalized.size()));
+  PopulateNodes(&lattice);
+  if (fuse_unk_) {
+    std::string token;
+    for (const auto* node : lattice.Viterbi().first) {
+      if (unk_id_.size() > 0 && node->id == unk_id_[0]) {
+        token.append(node->piece.data(), node->piece.size());
+      } else {
+        if (!token.empty()) {
+          encode_result->push_back(token);
+          token.clear();
+        }
+        encode_result->push_back(std::string(node->piece.data()));
+      }
+      if (!token.empty()) {
+        encode_result->push_back(token);
+      }
+    }
+  } else {
+    for (const auto* node : lattice.Viterbi().first) {
+      encode_result->push_back(std::string(node->piece.data()));
+    }
+  }
+}
+
+void Unigram::SetFilterToken(const std::string& filtered_token) {
+  filtered_token_ = filtered_token;
+}
+
+void Unigram::SetSplitRule(const std::string& split_rule) {
+  split_rule_ = utils::make_unique<re2::RE2>(split_rule);
+}
+
+void to_json(nlohmann::json& j, const Unigram& model) {
+  std::string split_rule = "";
+  if (model.split_rule_ != nullptr) {
+    split_rule = model.split_rule_->pattern();
+  }
+  j = {{"type", "Unigram"},
+       {"unk_id", model.unk_id_},
+       {"vocab", model.vocab_},
+       {"filter_token", model.filtered_token_},
+       {"split_rule", split_rule}};
+}
+
+void from_json(const nlohmann::json& j, Unigram& model) {
+  std::string filter_token = j.at("filter_token").get<std::string>();
+  std::string split_rule = j.at("split_rule").get<std::string>();
+  model.Init(j.at("vocab").get<core::VocabList>(),
+             j.at("unk_id").get<std::vector<size_t>>());
+  if (!split_rule.empty()) {
+    model.SetSplitRule(split_rule);
+  }
+  model.SetFilterToken(filter_token);
+}
+
+}  // namespace model
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/unigram.h b/fast_tokenizer/fast_tokenizer/models/unigram.h
new file mode 100644
index 0000000000000000000000000000000000000000..c66cbbbae5a3b3946cff4263b59c40c668d5f4b4
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/unigram.h
@@ -0,0 +1,86 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/models/model.h"
+#include "fast_tokenizer/utils/cache.h"
+#include "fast_tokenizer/utils/lattice.h"
+#include "fast_tokenizer/utils/trie.h"
+
+#include "darts.h"
+#include "nlohmann/json.hpp"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+struct FASTTOKENIZER_DECL Unigram : public Model {
+  Unigram();
+  Unigram(const core::VocabList& vocab, const std::vector<size_t>& unk_id);
+  Unigram(const Unigram& other);
+  virtual bool TokenToId(const std::string& token, uint32_t* id) const override;
+  virtual bool IdToToken(uint32_t id, std::string* token) const override;
+  virtual core::Vocab GetVocab() const override;
+  virtual size_t GetVocabSize() const override;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& sequence) override;
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override;
+  // Set the filter token for unigram.
+  void SetFilterToken(const std::string& filtered_token);
+  // Set the special spliting rule for unigram.
+  void SetSplitRule(const std::string& split_rule);
+
+private:
+  float GetVocabScore(uint32_t id) const;
+  void Init(const core::VocabList& vocab, const std::vector<size_t>& unk_id);
+  void PopulateNodes(utils::Lattice* lattice) const;
+  void Encode(const std::string& normalized,
+              std::vector<std::string>* encode_result);
+  void EncodeOptimized(const std::string& normalized,
+                       std::vector<std::string>* encode_result);
+  void EncodeUnoptimized(const std::string& normalized,
+                         std::vector<std::string>* encode_result);
+
+  core::Vocab token_to_ids_;
+  core::VocabList vocab_;
+  utils::Cache<std::string, std::vector<std::string>> cache_;
+  std::unique_ptr<Darts::DoubleArray> trie_;
+  double min_score_;
+  std::vector<size_t> unk_id_;
+  size_t bos_id_;
+  size_t eos_id_;
+  bool fuse_unk_;
+  bool is_optimized_;
+  int trie_results_size_;
+  // Some tokenizer, such as ernie-m, may avoid to append some special
+  // token to final result, the unigram model doesn't filter any tokens
+  // by default.
+  std::string filtered_token_;
+  // For special rule of token spliting after tokenization,
+  // the unigram model has no spliting rule by default.
+  // It's useful for some cases, such as ernie-m tokenizer.
+  std::unique_ptr<re2::RE2> split_rule_;
+
+  friend void to_json(nlohmann::json& j, const Unigram& model);
+  friend void from_json(const nlohmann::json& j, Unigram& model);
+};
+
+}  // namespace models
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/wordpiece.cc b/fast_tokenizer/fast_tokenizer/models/wordpiece.cc
new file mode 100644
index 0000000000000000000000000000000000000000..079602d128064bda6c359482091800a99150ff1b
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/wordpiece.cc
@@ -0,0 +1,294 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <cctype>
+#include <codecvt>
+#include <fstream>
+#include <locale>
+#include <map>
+
+#include "fast_tokenizer/models/wordpiece.h"
+#include "fast_tokenizer/utils/path.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+const std::string WHITESPACE = " \n\r\t\f\v";
+
+WordPiece::WordPiece()
+    : unk_token_("[UNK]"),
+      continuing_subword_prefix_("##"),
+      max_input_chars_per_word_(100),
+      unk_token_id_(0) {}
+
+WordPiece::WordPiece(const core::Vocab& vocab,
+                     const std::string& unk_token,
+                     size_t max_input_chars_per_word,
+                     const std::string& continuing_subword_prefix,
+                     bool handle_chinese_chars)
+    : vocab_(vocab),
+      unk_token_(unk_token),
+      max_input_chars_per_word_(max_input_chars_per_word),
+      continuing_subword_prefix_(continuing_subword_prefix),
+      handle_chinese_chars_(handle_chinese_chars) {
+  for (const auto& vocab_item : vocab) {
+    vocab_reversed_[vocab_item.second] = vocab_item.first;
+  }
+  unk_token_id_ = vocab.at(unk_token);
+}
+
+// Move version
+WordPiece::WordPiece(core::Vocab&& vocab,
+                     std::string&& unk_token,
+                     size_t max_input_chars_per_word,
+                     std::string&& continuing_subword_prefix,
+                     bool handle_chinese_chars)
+    : vocab_(std::move(vocab)),
+      unk_token_(std::move(unk_token)),
+      max_input_chars_per_word_(std::move(max_input_chars_per_word)),
+      continuing_subword_prefix_(std::move(continuing_subword_prefix)),
+      handle_chinese_chars_(handle_chinese_chars) {
+  for (const auto& vocab_item : vocab) {
+    vocab_reversed_[vocab_item.second] = vocab_item.first;
+  }
+  unk_token_id_ = vocab.at(unk_token);
+}
+
+core::Vocab WordPiece::GetVocab() const { return vocab_; }
+
+size_t WordPiece::GetVocabSize() const { return vocab_.size(); }
+
+bool WordPiece::TokenToId(const std::string& token, uint32_t* id) const {
+  if (vocab_.find(token) == vocab_.end()) {
+    return false;
+  }
+  *id = vocab_.at(token);
+  return true;
+}
+
+bool WordPiece::IdToToken(uint32_t id, std::string* token) const {
+  if (vocab_reversed_.find(id) == vocab_reversed_.end()) {
+    return false;
+  }
+  *token = vocab_reversed_.at(id);
+  return true;
+}
+
+std::vector<std::string> WordPiece::Save(
+    const std::string& folder, const std::string& filename_prefix) const {
+  std::string filepath;
+  if (filename_prefix == "") {
+    filepath = utils::PathJoin(folder, "vocab.txt");
+  } else {
+    filepath = utils::PathJoin({folder, filename_prefix, "-vocab.txt"});
+  }
+  VLOG(6) << "Full path" << filepath;
+  std::ofstream fout(filepath);
+  std::vector<std::pair<std::string, uint32_t>> vocab(vocab_.begin(),
+                                                      vocab_.end());
+  std::sort(vocab.begin(),
+            vocab.end(),
+            [](const std::pair<std::string, uint32_t>& left,
+               const std::pair<std::string, uint32_t>& right) -> bool {
+              return left.second < right.second;
+            });
+  for (const auto& vocab_item : vocab) {
+    fout << vocab_item.first << "\n";
+  }
+  fout.close();
+  return {filepath};
+}
+
+static bool CheckIfStringIsAlphaNum(const std::string& str) {
+  return std::count_if(str.begin(), str.end(), [](char ch) {
+           return std::isalnum(ch) > 0;
+         }) == str.length();
+}
+
+std::vector<core::Token> WordPiece::Tokenize(const std::string& sequence) {
+  VLOG(6) << "Using WordPiece::Tokenize to tokenize sequence '" << sequence << "'";
+  std::vector<core::Token> all_tokens;
+  size_t unicode_len =
+      utils::GetUnicodeLenFromUTF8(sequence.data(), sequence.length());
+  if (unicode_len > max_input_chars_per_word_) {
+    all_tokens.emplace_back(
+        vocab_.at(unk_token_), unk_token_, core::Offset{0, sequence.length()});
+  } else {
+    bool found_token = true;
+    uint32_t start = 0;
+
+    while (start < sequence.length()) {
+      uint32_t end = sequence.length();
+      core::Token cur_token;
+      bool match_cur_token = false;
+      while (start < end) {
+        std::string sub_str = sequence.substr(start, end - start);
+        if (start > 0 &&
+            (handle_chinese_chars_ || CheckIfStringIsAlphaNum(sub_str))) {
+          sub_str = continuing_subword_prefix_ + sub_str;
+        }
+        const auto& vocab_iter = vocab_.find(sub_str);
+        if (vocab_iter != vocab_.end()) {
+          cur_token = {vocab_iter->second, sub_str, {start, end}};
+          match_cur_token = true;
+          break;
+        }
+        // std::u32string u32sub_str = conv.from_bytes(sub_str);
+        // end -= utils::GetUTF8CharLen(u32sub_str.back());
+        for (auto it = sub_str.rbegin(); it != sub_str.rend(); ++it) {
+          --end;
+          if (utils::IsCharBeginBoundary(*it)) {
+            break;
+          }
+        }
+      }
+      if (!match_cur_token) {
+        found_token = false;
+        break;
+      }
+      all_tokens.emplace_back(cur_token);
+      start = end;
+    }
+
+    if (!found_token) {
+      all_tokens.clear();
+      all_tokens.emplace_back(vocab_.at(unk_token_),
+                              unk_token_,
+                              core::Offset{0, sequence.length()});
+    }
+  }
+  return all_tokens;
+}
+
+
+core::Vocab WordPiece::GetVocabFromFile(const std::string& file) {
+  std::ifstream fin(file);
+  core::Vocab vocab;
+  int i = 0;
+  constexpr int MAX_BUFFER_SIZE = 256;
+  char word[MAX_BUFFER_SIZE];
+  while (fin.getline(word, MAX_BUFFER_SIZE)) {
+    std::string word_str = word;
+    auto leading_spaces = word_str.find_first_not_of(WHITESPACE);
+    if (leading_spaces != std::string::npos) {
+      leading_spaces = (std::min)(leading_spaces, word_str.length() - 1);
+      word_str = word_str.substr(leading_spaces);
+    }
+    auto trailing_spaces = word_str.find_last_not_of(WHITESPACE);
+    if (trailing_spaces != std::string::npos) {
+      word_str = word_str.substr(0, trailing_spaces + 1);
+    }
+    if (word_str != "") {
+      vocab[word_str] = i++;
+    }
+  }
+  return vocab;
+}
+
+WordPiece WordPiece::GetWordPieceFromFile(
+    const std::string& file,
+    const std::string& unk_token,
+    size_t max_input_chars_per_word,
+    const std::string& continuing_subword_prefix) {
+  auto vocab = GetVocabFromFile(file);
+  return WordPiece(
+      vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix);
+}
+
+void to_json(nlohmann::json& j, const WordPiece& model) {
+  j = {
+      {"type", "WordPiece"},
+      {"vocab", model.vocab_},
+      {"unk_token", model.unk_token_},
+      {"max_input_chars_per_word", model.max_input_chars_per_word_},
+      {"continuing_subword_prefix", model.continuing_subword_prefix_},
+  };
+}
+
+void from_json(const nlohmann::json& j, WordPiece& model) {
+  j["vocab"].get_to(model.vocab_);
+  j["unk_token"].get_to(model.unk_token_);
+  j["max_input_chars_per_word"].get_to(model.max_input_chars_per_word_);
+  j["continuing_subword_prefix"].get_to(model.continuing_subword_prefix_);
+}
+
+
+WordPieceConfig::WordPieceConfig()
+    : unk_token_("[UNK]"),
+      max_input_chars_per_word_(100),
+      continuing_subword_prefix_("##") {}
+
+
+void WordPieceFactory::SetFiles(const std::string& files) {
+  config_.files_ = files;
+}
+
+void WordPieceFactory::SetUNKToken(const std::string& unk_token) {
+  config_.unk_token_ = unk_token;
+}
+
+void WordPieceFactory::SetMaxInputCharsPerWord(
+    size_t max_input_chars_per_word) {
+  config_.max_input_chars_per_word_ = max_input_chars_per_word;
+}
+
+void WordPieceFactory::SetContinuingSubwordPrefix(
+    const std::string& continuing_subword_prefix) {
+  config_.continuing_subword_prefix_ = continuing_subword_prefix;
+}
+
+WordPiece WordPieceFactory::CreateWordPieceModel() {
+  std::ifstream fin(config_.files_);
+  if (fin) {
+    GetVocabFromFiles(config_.files_);
+  } else {
+    VLOG(0) << "File " << config_.files_
+            << " doesn't exist or can't be accessed.";
+    config_.vocab_ = core::Vocab();
+  }
+  return WordPiece{config_.vocab_,
+                   config_.unk_token_,
+                   config_.max_input_chars_per_word_,
+                   config_.continuing_subword_prefix_};
+}
+
+void WordPieceFactory::GetVocabFromFiles(const std::string& files) {
+  std::ifstream fin(files);
+  config_.vocab_.clear();
+  int i = 0;
+  constexpr int MAX_BUFFER_SIZE = 256;
+  char word[MAX_BUFFER_SIZE];
+  while (fin.getline(word, MAX_BUFFER_SIZE)) {
+    std::string word_str = word;
+    auto leading_spaces = word_str.find_first_not_of(WHITESPACE);
+    if (leading_spaces != std::string::npos) {
+      leading_spaces = (std::min)(leading_spaces, word_str.length() - 1);
+      word_str = word_str.substr(leading_spaces);
+    }
+    auto trailing_spaces = word_str.find_last_not_of(WHITESPACE);
+    if (trailing_spaces != std::string::npos) {
+      word_str = word_str.substr(0, trailing_spaces + 1);
+    }
+    if (word_str != "") {
+      config_.vocab_[word_str] = i++;
+    }
+  }
+}
+
+}  // namespace model
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/models/wordpiece.h b/fast_tokenizer/fast_tokenizer/models/wordpiece.h
new file mode 100644
index 0000000000000000000000000000000000000000..956485522f25b277c665a05001f626c2bc192e3c
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/models/wordpiece.h
@@ -0,0 +1,88 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/models/model.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace models {
+
+struct FASTTOKENIZER_DECL WordPiece : public Model {
+  WordPiece();
+  WordPiece(const core::Vocab& vocab,
+            const std::string& unk_token = "[UNK]",
+            size_t max_input_chars_per_word = 100,
+            const std::string& continuing_subword_prefix = "##",
+            bool handle_chinese_chars = true);
+  // Move version
+  WordPiece(core::Vocab&& vocab,
+            std::string&& unk_token,
+            size_t max_input_chars_per_word,
+            std::string&& continuing_subword_prefix,
+            bool handle_chinese_chars);
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& sequence) override;
+  virtual bool TokenToId(const std::string& token, uint32_t* id) const override;
+  virtual bool IdToToken(uint32_t id, std::string* token) const override;
+  virtual core::Vocab GetVocab() const override;
+  virtual size_t GetVocabSize() const override;
+  // Return the saved voacb full path
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override;
+  static core::Vocab GetVocabFromFile(const std::string& file);
+  static WordPiece GetWordPieceFromFile(
+      const std::string& file,
+      const std::string& unk_token = "[UNK]",
+      size_t max_input_chars_per_word = 100,
+      const std::string& continuing_subword_prefix = "##");
+
+protected:
+  core::Vocab vocab_;
+  core::VocabReversed vocab_reversed_;
+  std::string unk_token_;
+  uint32_t unk_token_id_;
+  size_t max_input_chars_per_word_;
+  std::string continuing_subword_prefix_;
+  bool handle_chinese_chars_;
+  friend void to_json(nlohmann::json& j, const WordPiece& model);
+  friend void from_json(const nlohmann::json& j, WordPiece& model);
+};
+
+struct WordPieceConfig {
+  WordPieceConfig();
+  std::string files_;
+  core::Vocab vocab_;
+  std::string unk_token_;
+  size_t max_input_chars_per_word_;
+  std::string continuing_subword_prefix_;
+};
+
+
+struct WordPieceFactory {
+  WordPieceConfig config_;
+  void SetFiles(const std::string& files);
+  void SetUNKToken(const std::string& unk_token);
+  void SetMaxInputCharsPerWord(size_t max_input_chars_per_word);
+  void SetContinuingSubwordPrefix(const std::string& continuing_subword_prefix);
+  WordPiece CreateWordPieceModel();
+  void GetVocabFromFiles(const std::string& files);
+};
+
+}  // namespace models
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..b44e4bbcc455f894953c592ce6e43b591dbc7eda
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/CMakeLists.txt
@@ -0,0 +1,5 @@
+cc_library(normalizers 
+      SRCS normalizer.cc unicode.cc
+           utils.cc strip.cc replace.cc bert.cc
+           precompiled.cc 
+      DEPS re2 json sentencepiece_normalizer icuuc icudata)
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/bert.cc b/fast_tokenizer/fast_tokenizer/normalizers/bert.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f4745c449cd80f26156b459d427d4a60fd028869
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/bert.cc
@@ -0,0 +1,120 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/bert.h"
+
+#include <algorithm>
+#include <codecvt>
+#include <locale>
+
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/utils.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "glog/logging.h"
+#include "unicode/uchar.h"
+#include "unicode/unistr.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+BertNormalizer::BertNormalizer(bool clean_text,
+                               bool handle_chinese_chars,
+                               bool strip_accents,
+                               bool lowercase)
+    : clean_text_(clean_text),
+      handle_chinese_chars_(handle_chinese_chars),
+      strip_accents_(strip_accents),
+      lowercase_(lowercase) {}
+
+static bool IsControl(int ch) {
+  if (ch == '\t' || ch == '\n' || ch == '\r') return false;
+  // It means (general category "C").
+  return !u_isprint(ch);
+}
+
+void BertNormalizer::DoCleanText(NormalizedString* input) const {
+  (*input)
+      .FilterChar([](char32_t ch) -> bool {
+        return !(ch == 0 || ch == 0xfffd || IsControl(ch));
+      })
+      .MapChar([](char32_t ch) -> char32_t {
+        if (utils::IsWhiteSpace(ch)) {
+          return ' ';
+        }
+        return ch;
+      });
+}
+
+void BertNormalizer::DoHandleChineseChars(NormalizedString* input) const {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32input = conv.from_bytes(input->GetStr());
+  std::u32string u32output;
+  std::vector<int> changes;
+  u32output.reserve(u32input.length() * 3);
+  changes.reserve(u32input.length() * 3);
+  for (int i = 0; i < u32input.length(); ++i) {
+    if (utils::IsChineseChar(u32input[i])) {
+      u32output.push_back(' ');
+      u32output.push_back(u32input[i]);
+      u32output.push_back(' ');
+      changes.push_back(0);
+      changes.push_back(1);
+      changes.push_back(1);
+    } else {
+      u32output.push_back(u32input[i]);
+      changes.push_back(0);
+    }
+  }
+  OffsetMapping new_normalized_offset{u32output, changes};
+  input->UpdateNormalized(new_normalized_offset, 0);
+}
+void BertNormalizer::operator()(NormalizedString* input) const {
+  if (clean_text_) {
+    DoCleanText(input);
+  }
+  if (handle_chinese_chars_) {
+    DoHandleChineseChars(input);
+  }
+  if (strip_accents_) {
+    StripAccentsNormalizer()(input);
+  }
+  if (lowercase_) {
+    input->Lowercase();
+  }
+}
+
+void to_json(nlohmann::json& j, const BertNormalizer& bert_normalizer) {
+  j = {
+      {"type", "BertNormalizer"},
+      {"clean_text", bert_normalizer.clean_text_},
+      {"handle_chinese_chars", bert_normalizer.handle_chinese_chars_},
+      {"strip_accents", bert_normalizer.strip_accents_},
+      {"lowercase", bert_normalizer.lowercase_},
+  };
+}
+
+void from_json(const nlohmann::json& j, BertNormalizer& bert_normalizer) {
+  j.at("clean_text").get_to(bert_normalizer.clean_text_);
+  j.at("handle_chinese_chars").get_to(bert_normalizer.handle_chinese_chars_);
+  j.at("lowercase").get_to(bert_normalizer.lowercase_);
+  if (!j.at("strip_accents").is_null()) {
+    j.at("strip_accents").get_to(bert_normalizer.strip_accents_);
+  } else {
+    bert_normalizer.strip_accents_ = false;
+  }
+}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/bert.h b/fast_tokenizer/fast_tokenizer/normalizers/bert.h
new file mode 100644
index 0000000000000000000000000000000000000000..4312bdefb01e0cce421027906321eaa906c3ccb4
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/bert.h
@@ -0,0 +1,47 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include "nlohmann/json.hpp"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+struct FASTTOKENIZER_DECL BertNormalizer : public Normalizer {
+  BertNormalizer(bool clean_text = true,
+                 bool handle_chinese_chars = true,
+                 bool strip_accents = true,
+                 bool lowercase = true);
+  virtual void operator()(NormalizedString* input) const override;
+  BertNormalizer(const BertNormalizer&) = default;
+  BertNormalizer(BertNormalizer&&) = default;
+
+private:
+  bool clean_text_;
+  bool handle_chinese_chars_;
+  bool strip_accents_;
+  bool lowercase_;
+  void DoCleanText(NormalizedString* input) const;
+  void DoHandleChineseChars(NormalizedString* input) const;
+  friend void to_json(nlohmann::json& j, const BertNormalizer& bert_normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        BertNormalizer& bert_normalizer);
+};
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..bc63e0845063f6260a4043baee8ee9f9f9faf71a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.cc
@@ -0,0 +1,656 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <codecvt>
+#include <locale>
+#include <string>
+#include <vector>
+
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utf8.h"
+
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+#include "unicode/edits.h"
+#include "unicode/errorcode.h"
+#include "unicode/normalizer2.h"
+#include "unicode/uchar.h"
+#include "unicode/utypes.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+NormalizedString::NormalizedString(const std::string& original)
+    : original_(original), normalized_(original), original_shift_(0) {
+  // calculate alignments
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32normalized = conv.from_bytes(normalized_);
+  for (int i = 0; i < u32normalized.length(); ++i) {
+    auto new_normalized_char_len = utils::GetUTF8CharLen(u32normalized[i]);
+    uint32_t start = 0;
+    uint32_t end = 0;
+    if (i != 0) {
+      start = alignments_.back().second;
+    }
+    end = start + new_normalized_char_len;
+    for (int j = 0; j < new_normalized_char_len; ++j) {
+      alignments_.push_back({start, end});
+    }
+  }
+}
+
+NormalizedString::NormalizedString(NormalizedString&& other)
+    : original_(std::move(other.original_)),
+      normalized_(std::move(other.normalized_)),
+      alignments_(std::move(other.alignments_)),
+      original_shift_(other.original_shift_) {}
+
+NormalizedString& NormalizedString::operator=(NormalizedString&& other) {
+  original_ = std::move(other.original_);
+  normalized_ = std::move(other.normalized_);
+  alignments_ = std::move(other.alignments_);
+  original_shift_ = other.original_shift_;
+  return *this;
+}
+
+const std::string& NormalizedString::GetStr() const { return normalized_; }
+
+const std::string& NormalizedString::GetOrignalStr() const { return original_; }
+
+uint32_t NormalizedString::GetLen() const { return normalized_.length(); }
+
+uint32_t NormalizedString::GetOriginalLen() const { return original_.length(); }
+
+core::Offset NormalizedString::GetOrginalOffset() const {
+  return {original_shift_, GetOriginalLen() + original_shift_};
+}
+
+bool NormalizedString::IsEmpty() const { return normalized_.empty(); }
+
+bool NormalizedString::IsOriginalEmpty() const { return original_.empty(); }
+
+void NormalizedString::UpdateNormalized(const OffsetMapping& new_normalized,
+                                        uint32_t initial_offset) {
+  UpdateNormalizedRange(new_normalized, initial_offset, {0, GetLen()}, true);
+}
+
+void NormalizedString::UpdateNormalizedRange(
+    const OffsetMapping& new_normalized,
+    uint32_t initial_offset,
+    const core::Range& range,
+    bool origin_range) {
+  auto n_range = range;
+  if (origin_range) {
+    ConvertOffsets(&n_range, origin_range);
+  }
+  // Retrieve the original characters that are being replaced. This let us
+  // compute the change in byte sizes along the way.
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  n_range.first = (std::min)(n_range.first,
+                             static_cast<uint32_t>(normalized_.length() - 1));
+  std::u32string u32replaced_normalized = conv.from_bytes(
+      normalized_.substr(n_range.first, n_range.second - n_range.first));
+  uint32_t initial_removed = 0;
+  // calculate initial_removed
+  for (int i = 0; i < initial_offset; ++i) {
+    size_t chwidth = utils::GetUTF8CharLen(u32replaced_normalized[i]);
+    initial_removed += chwidth;
+  }
+
+  uint32_t offset = initial_removed + n_range.first;
+  std::vector<core::Range> alignments;
+  alignments.reserve(n_range.second - n_range.first);
+
+  int replaced_normalized_idx = initial_removed;
+  // Calculate the new alignments
+  for (int i = 0; i < new_normalized.u32normalized.length(); ++i) {
+    auto idx = offset;
+    core::Range align;
+    int curr_changes = new_normalized.changes[i];
+    if (curr_changes > 0) {
+      // Insert a char
+      if (idx < 1) {
+        align = {0, 0};
+      } else {
+        align = alignments_[idx - 1];
+      }
+    } else {
+      align = alignments_[idx];
+    }
+    char32_t new_normalized_char = new_normalized.u32normalized[i];
+    auto new_normalized_char_len = utils::GetUTF8CharLen(new_normalized_char);
+    char32_t replaced_char = -1;
+    if (curr_changes <= 0) {
+      replaced_char = u32replaced_normalized[replaced_normalized_idx++];
+    }
+    uint32_t replaced_char_size =
+        (replaced_char == -1) ? 0 : utils::GetUTF8CharLen(replaced_char);
+
+    uint32_t total_bytes_to_remove = 0;
+    if (curr_changes < 0) {
+      for (int j = 0; j < -curr_changes; ++j) {
+        replaced_char = u32replaced_normalized[replaced_normalized_idx++];
+        total_bytes_to_remove += utils::GetUTF8CharLen(replaced_char);
+      }
+    }
+    offset += replaced_char_size + total_bytes_to_remove;
+    alignments.insert(alignments.end(), new_normalized_char_len, align);
+  }
+  // Replace the old alignments in n_range
+  if (n_range.second - n_range.first >= alignments.size()) {
+    std::memcpy(alignments_.data() + n_range.first,
+                alignments.data(),
+                alignments.size() * sizeof(core::Range));
+    alignments_.erase(alignments_.begin() + n_range.first + alignments.size(),
+                      alignments_.begin() + n_range.second);
+  } else {
+    std::vector<core::Range> new_alignments;
+    auto third_len = 0;
+    if (alignments_.size() > n_range.second) {
+      third_len = alignments_.size() - n_range.second;
+    }
+    new_alignments.resize(n_range.first + alignments.size() + third_len);
+    if (n_range.first > 0) {
+      std::copy_n(alignments_.begin(), n_range.first, new_alignments.begin());
+    }
+    std::copy_n(alignments.begin(),
+                alignments.size(),
+                new_alignments.begin() + n_range.first);
+    if (third_len > 0) {
+      std::copy_n(alignments_.begin() + n_range.second,
+                  third_len,
+                  new_alignments.begin() + n_range.first + alignments.size());
+    }
+    alignments_ = std::move(new_alignments);
+  }
+  // Unicode -> UTF8
+  uint32_t normalized_utf8_size = 0;
+  for (auto& ch : new_normalized.u32normalized) {
+    normalized_utf8_size += utils::GetUTF8CharLen(ch);
+  }
+  std::vector<char> utf8_str(normalized_utf8_size + 1);
+  utils::GetUTF8Str(new_normalized.u32normalized.data(),
+                    utf8_str.data(),
+                    new_normalized.u32normalized.length());
+
+  // Update normalized_
+  auto normalized_iter = normalized_.begin();
+  normalized_.replace(normalized_iter + n_range.first,
+                      normalized_iter + n_range.second,
+                      utf8_str.data(),
+                      normalized_utf8_size);
+}
+
+bool NormalizedString::ConvertOffsets(core::Range* range,
+                                      bool origin_range) const {
+  auto len_original = GetOriginalLen();
+  auto len_normalized = GetLen();
+  if (range->first == range->second) {
+    return true;
+  }
+  if (range->first > range->second) {
+    return false;
+  }
+  if (origin_range && original_.empty() &&
+      (range->first == 0 && range->second == 0)) {
+    range->second = len_normalized;
+    return true;
+  }
+  if (!origin_range && normalized_.empty() &&
+      (range->first == 0 && range->second == 0)) {
+    range->second = len_original;
+    return true;
+  }
+  if (origin_range) {
+    int start = -1;
+    int end = -1;
+    for (int i = 0; i < alignments_.size(); ++i) {
+      if (range->second >= alignments_[i].second) {
+        if (start < 0 && range->first <= alignments_[i].first) {
+          if (alignments_[i].first != alignments_[i].second) {
+            start = i;
+          }
+        }
+        if (range->second >= alignments_[i].second) {
+          end = i + 1;
+        }
+      }
+    }
+    if (start > 0 && end < 0) {
+      *range = {start, start};
+    } else if (start < 0 && end > 0) {
+      *range = {end, end};
+    } else if (start > 0 && end > 0) {
+      *range = {start, end};
+    } else {
+      return false;
+    }
+  } else {
+    range->first = alignments_[range->first].first;
+    range->second = alignments_[range->second - 1].second;
+  }
+  return true;
+}
+
+void NormalizedString::RunNormalization(const std::string& mode) {
+  icu::ErrorCode icu_error;
+  const icu::Normalizer2* normalizer = nullptr;
+  if (mode == "NFD") {
+    normalizer = icu::Normalizer2::getNFDInstance(icu_error);
+  } else if (mode == "NFKD") {
+    normalizer = icu::Normalizer2::getNFKDInstance(icu_error);
+  } else if (mode == "NFC") {
+    normalizer = icu::Normalizer2::getNFCInstance(icu_error);
+  } else if (mode == "NFKC") {
+    normalizer = icu::Normalizer2::getNFKCInstance(icu_error);
+  }
+  std::string normalized_result;
+  icu::Edits edits;
+  icu::StringByteSink<std::string> byte_sink(&normalized_result);
+  normalizer->normalizeUTF8(
+      0,
+      icu::StringPiece(normalized_.data(), normalized_.size()),
+      byte_sink,
+      &edits,
+      icu_error);
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32new_normalized = conv.from_bytes(normalized_result);
+  // Set changes
+  std::vector<int> changes;
+  changes.reserve(u32new_normalized.length());
+  auto iter = edits.getFineIterator();
+  int old_offset = 0;
+  int new_offset = 0;
+  // The edits record the byte level modification, so need to transform to char
+  // level
+  // using GetUnicodeLenFromUTF8
+  while (iter.next(icu_error)) {
+    auto old_length = iter.oldLength();
+    auto new_length = iter.newLength();
+    auto new_unicode_len = utils::GetUnicodeLenFromUTF8(
+        normalized_result.data() + new_offset, new_length);
+    auto old_unicode_len = utils::GetUnicodeLenFromUTF8(
+        normalized_.data() + old_offset, old_length);
+    old_offset += old_length;
+    new_offset += new_length;
+    if (old_unicode_len == new_unicode_len) {
+      // Just replace the char
+      changes.insert(changes.end(), old_unicode_len, 0);
+    } else if (old_unicode_len < new_unicode_len) {
+      // Insert the char
+      changes.insert(changes.end(), old_unicode_len, 0);
+      changes.insert(changes.end(), new_unicode_len - old_unicode_len, 1);
+    } else /* old_length > new_length */ {
+      // Remove the char
+      if (new_unicode_len > 1) {
+        changes.insert(changes.end(), new_unicode_len - 1, 0);
+      }
+      changes.push_back(new_unicode_len - old_unicode_len);
+    }
+  }
+  OffsetMapping new_normalized_offset{u32new_normalized, changes};
+  // Update normalized_ and alignments_
+  UpdateNormalized(new_normalized_offset, 0);
+}
+
+NormalizedString& NormalizedString::NFD() {
+  RunNormalization("NFD");
+  return *this;
+}
+
+NormalizedString& NormalizedString::NFKD() {
+  RunNormalization("NFKD");
+  return *this;
+}
+
+NormalizedString& NormalizedString::NFC() {
+  RunNormalization("NFC");
+  return *this;
+}
+
+NormalizedString& NormalizedString::NFKC() {
+  RunNormalization("NFKC");
+  return *this;
+}
+
+NormalizedString& NormalizedString::LStrip() { return LRStrip(true, false); }
+
+NormalizedString& NormalizedString::RStrip() { return LRStrip(false, true); }
+
+const std::string WHITESPACE = " \n\r\t\f\v";
+
+NormalizedString& NormalizedString::LRStrip(bool left, bool right) {
+  uint32_t leading_spaces = 0;
+  uint32_t trailing_spaces = 0;
+  std::string new_normalized = normalized_;
+  if (left) {
+    leading_spaces = new_normalized.find_first_not_of(WHITESPACE);
+    if (leading_spaces != std::string::npos) {
+      leading_spaces = (std::min)(
+          leading_spaces, static_cast<uint32_t>(new_normalized.length() - 1));
+      new_normalized = new_normalized.substr(leading_spaces);
+    }
+  }
+  if (right) {
+    trailing_spaces = new_normalized.find_last_not_of(WHITESPACE);
+    if (trailing_spaces != std::string::npos) {
+      new_normalized = new_normalized.substr(0, trailing_spaces + 1);
+    }
+  }
+
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32new_normalized = conv.from_bytes(new_normalized);
+  // Set changes
+  std::vector<int> changes(u32new_normalized.length(), 0);
+  changes.back() = -trailing_spaces;
+
+  OffsetMapping new_normalized_offset{u32new_normalized, changes};
+  // Update normalized_ and alignments_
+  UpdateNormalized(new_normalized_offset, leading_spaces);
+  return *this;
+}
+
+NormalizedString& NormalizedString::FilterChar(
+    std::function<bool(char32_t)> keep_char_fn) {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32new_normalized;
+  u32new_normalized.reserve(normalized_.length());
+  uint32_t removed_start = 0;
+  uint32_t removed = 0;
+  std::vector<int> changes;
+  changes.reserve(normalized_.length());
+  bool has_init_ch = false;
+  uint32_t last_char;
+  uint32_t curr_char;
+  size_t utf8_len = 0;
+  while (utf8_len < normalized_.length()) {
+    auto chwidth =
+        utils::UTF8ToUInt32(normalized_.data() + utf8_len, &curr_char);
+    curr_char = utils::UTF8ToUnicode(curr_char);
+    if (keep_char_fn(curr_char)) {
+      if (has_init_ch) {
+        u32new_normalized.push_back(last_char);
+        changes.push_back(-removed);
+      } else {
+        has_init_ch = true;
+        removed_start = removed;
+      }
+      last_char = curr_char;
+      removed = 0;
+    } else {
+      removed += 1;
+    }
+    utf8_len += chwidth;
+  }
+  if (has_init_ch) {
+    u32new_normalized.push_back(last_char);
+    changes.push_back(-removed);
+  }
+  OffsetMapping new_normalized_offset{u32new_normalized, changes};
+  // Update normalized_ and alignments_
+  UpdateNormalized(new_normalized_offset, removed_start);
+  return *this;
+}
+
+NormalizedString& NormalizedString::MapChar(
+    std::function<char32_t(char32_t)> map_char_fn) {
+  size_t utf8_len = 0;
+  std::u32string u32normalized;
+  uint32_t curr_char;
+  u32normalized.reserve(normalized_.length());
+  while (utf8_len < normalized_.length()) {
+    auto chwidth =
+        utils::UTF8ToUInt32(normalized_.data() + utf8_len, &curr_char);
+    curr_char = utils::UTF8ToUnicode(curr_char);
+    curr_char = map_char_fn(curr_char);
+    u32normalized.push_back(curr_char);
+    utf8_len += chwidth;
+  }
+  std::vector<int> changes(u32normalized.size(), 0);
+  UpdateNormalized({u32normalized, changes}, 0);
+  return *this;
+}
+
+NormalizedString& NormalizedString::Lowercase() {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32normalized = conv.from_bytes(normalized_);
+  // Can cover all single char covert cases
+  for (int i = 0; i < u32normalized.length(); ++i) {
+    u32normalized[i] = u_tolower(u32normalized[i]);
+  }
+  // No need to update normalized range
+  normalized_ = conv.to_bytes(u32normalized);
+  return *this;
+}
+
+NormalizedString& NormalizedString::Replace(const re2::RE2& pattern,
+                                            const std::string& content) {
+  re2::StringPiece result;
+  size_t start = 0;
+  size_t end = normalized_.length();
+  int64_t offset = 0;
+
+  std::u32string u32content;
+  u32content.reserve(content.size());
+  std::vector<int> changes;
+  changes.reserve(content.size());
+
+  size_t content_utf8_len = 0;
+  while (content_utf8_len < content.length()) {
+    uint32_t content_char;
+    auto content_char_width =
+        utils::UTF8ToUInt32(content.data() + content_utf8_len, &content_char);
+    content_char = utils::UTF8ToUnicode(content_char);
+    u32content.push_back(content_char);
+    changes.push_back(1);
+    content_utf8_len += content_char_width;
+  }
+  size_t new_len = content.length();
+
+  OffsetMapping new_normalized{u32content, changes};
+
+  while (pattern.Match(normalized_, start, end, RE2::UNANCHORED, &result, 1)) {
+    size_t curr_start = result.data() - normalized_.data();
+    size_t old_len = result.length();
+    size_t curr_end = curr_start + old_len;
+    size_t removed_chars =
+        utils::GetUnicodeLenFromUTF8(normalized_.data() + curr_start, old_len);
+    UpdateNormalizedRange(
+        new_normalized, removed_chars, {curr_start, curr_end}, false);
+    offset = new_len - old_len;
+    // update start
+    start = curr_end;
+    if (offset >= 0) {
+      start = curr_end + offset;
+    } else {
+      size_t uoffset = -offset;
+      start = (curr_end >= uoffset) ? curr_end - uoffset : 0;
+    }
+    end = normalized_.length();
+  }
+  return *this;
+}
+
+NormalizedString& NormalizedString::Prepend(const std::string& content) {
+  // Get the first unicode char of normalized
+  uint32_t first_char_of_normalized;
+  auto first_char_width =
+      utils::UTF8ToUInt32(normalized_.data(), &first_char_of_normalized);
+  first_char_of_normalized = utils::UTF8ToUnicode(first_char_of_normalized);
+
+  std::u32string u32content;
+  u32content.reserve(content.length());
+  std::vector<int> changes;
+  changes.reserve(content.length());
+  uint32_t utf8_len = 0;
+  while (utf8_len < content.length()) {
+    uint32_t content_char;
+    auto content_char_width =
+        utils::UTF8ToUInt32(content.data() + utf8_len, &content_char);
+    content_char = utils::UTF8ToUnicode(content_char);
+    u32content.push_back(content_char);
+    if (utf8_len == 0) {
+      changes.push_back(0);
+    } else {
+      changes.push_back(1);
+    }
+    utf8_len += content_char_width;
+  }
+  u32content.push_back(first_char_of_normalized);
+  changes.push_back(1);
+  UpdateNormalizedRange({u32content, changes}, 0, {0, first_char_width}, false);
+  return *this;
+}
+
+bool NormalizedString::ValidateRange(const core::Range& range,
+                                     bool origin_range) const {
+  if (origin_range) {
+    return utils::IsCharBoundary(original_.data() + range.first) &&
+           utils::IsCharBoundary(original_.data() + range.second - 1);
+  }
+  return utils::IsCharBoundary(normalized_.data() + range.first) &&
+         utils::IsCharBoundary(normalized_.data() + range.second - 1);
+}
+
+bool NormalizedString::Slice(core::Range range,
+                             NormalizedString* normalized,
+                             bool origin_range) const {
+  if (ValidateRange(range, origin_range)) {
+    core::Range normalized_range = range;
+    core::Range original_range = range;
+    if (origin_range) {
+      ConvertOffsets(&normalized_range, true);
+    } else {
+      ConvertOffsets(&original_range, false);
+    }
+    uint32_t n_shift = original_range.first;
+
+    original_range.first =
+        (std::min)(original_range.first,
+                   static_cast<uint32_t>(this->original_.length() - 1));
+    normalized->original_ = this->original_.substr(
+        original_range.first, original_range.second - original_range.first);
+
+    normalized_range.first =
+        (std::min)(normalized_range.first,
+                   static_cast<uint32_t>(this->normalized_.length() - 1));
+    normalized->normalized_ = this->normalized_.substr(
+        normalized_range.first,
+        normalized_range.second - normalized_range.first);
+    normalized->alignments_.reserve(normalized_range.second -
+                                    normalized_range.first);
+    for (uint32_t i = normalized_range.first; i < normalized_range.second;
+         ++i) {
+      normalized->alignments_.emplace_back(
+          this->alignments_[i].first - n_shift,
+          this->alignments_[i].second - n_shift);
+    }
+
+    normalized->original_shift_ = this->original_shift_ + original_range.first;
+    return true;
+  }
+  return false;
+}
+
+uint32_t NormalizedString::GetMatch(
+    const std::string& normalized,
+    const re2::RE2& pattern,
+    std::vector<std::pair<core::Range, bool>>* matches,
+    bool invert) const {
+  size_t start = 0;
+  size_t end = normalized.length();
+  // Construct the matches whose mode is REMOVED.
+  re2::StringPiece result;
+  uint32_t reserved_num = 0;
+  while (pattern.Match(normalized, start, end, RE2::UNANCHORED, &result, 1)) {
+    size_t curr_start = result.data() - normalized.data();
+    size_t curr_end = curr_start + result.length();
+    if (start != curr_start) {
+      matches->push_back({{start, curr_start}, invert});
+      if (!invert) {
+        ++reserved_num;
+      }
+    }
+    matches->push_back({{curr_start, curr_end}, !invert});
+    if (invert) {
+      ++reserved_num;
+    }
+    start = curr_end;
+  }
+  if (start < end) {
+    matches->push_back({{start, end}, invert});
+    if (!invert) {
+      ++reserved_num;
+    }
+  }
+  return reserved_num;
+}
+
+uint32_t NormalizedString::GetMatch(
+    const std::string& normalized,
+    const std::function<bool(char32_t)>& pattern_func,
+    std::vector<std::pair<core::Range, bool>>* matches,
+    bool invert) const {
+  size_t utf8_len = 0;
+  size_t start = 0;
+  size_t curr_start = 0;
+  size_t curr_end = 0;
+  matches->reserve(normalized.length());
+  uint32_t ch;
+  uint32_t reserved_num = 0;
+  while (utf8_len < normalized.length()) {
+    auto chwidth = utils::UTF8ToUInt32(normalized.data() + utf8_len, &ch);
+    ch = utils::UTF8ToUnicode(ch);
+    if (pattern_func(ch)) {
+      curr_start = utf8_len;
+      curr_end = curr_start + chwidth;
+      if (curr_start != start) {
+        matches->emplace_back(core::Range{start, curr_start}, invert);
+        if (!invert) {
+          ++reserved_num;
+        }
+      }
+      matches->emplace_back(core::Range{curr_start, curr_end}, !invert);
+      if (invert) {
+        ++reserved_num;
+      }
+      start = curr_end;
+    }
+    utf8_len += chwidth;
+  }
+  if (start < normalized.length()) {
+    matches->emplace_back(core::Range{start, normalized.length()}, invert);
+    if (!invert) {
+      ++reserved_num;
+    }
+  }
+  return reserved_num;
+}
+
+template void NormalizedString::Split(const re2::RE2& pattern,
+                                      core::SplitMode mode,
+                                      std::vector<NormalizedString>* normalizes,
+                                      bool invert) const;
+template void NormalizedString::Split(
+    const std::function<bool(char32_t)>& pattern_func,
+    core::SplitMode mode,
+    std::vector<NormalizedString>* normalizes,
+    bool invert) const;
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..9a8b74e687cb0119ecc255a0f4029953f7e9e355
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizer.h
@@ -0,0 +1,205 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <functional>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace re2 {
+class RE2;
+}  // namespace re2
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+struct FASTTOKENIZER_DECL OffsetMapping {
+  std::u32string u32normalized;
+  std::vector<int> changes;  // Same size as normalized
+};
+
+class FASTTOKENIZER_DECL NormalizedString {
+public:
+  NormalizedString(const std::string& original);
+  NormalizedString(NormalizedString&& other);
+  NormalizedString(const NormalizedString& other) = default;
+  NormalizedString& operator=(const NormalizedString& other) = default;
+  NormalizedString& operator=(NormalizedString&& other);
+  const std::string& GetStr() const;
+  const std::string& GetOrignalStr() const;
+  uint32_t GetLen() const;
+  uint32_t GetOriginalLen() const;
+  core::Offset GetOrginalOffset() const;
+  bool IsEmpty() const;
+  bool IsOriginalEmpty() const;
+
+  // Unicode Normalization
+  NormalizedString& NFD();
+  NormalizedString& NFKD();
+  NormalizedString& NFC();
+  NormalizedString& NFKC();
+
+  // Strip
+  NormalizedString& LRStrip(bool left, bool right);
+  NormalizedString& LStrip();
+  NormalizedString& RStrip();
+
+  NormalizedString& FilterChar(std::function<bool(char32_t)> keep_char_fn);
+  NormalizedString& MapChar(std::function<char32_t(char32_t)> map_char_fn);
+  NormalizedString& Lowercase();
+  NormalizedString& Replace(const re2::RE2& pattern,
+                            const std::string& content);
+  NormalizedString& Prepend(const std::string& content);
+  bool Slice(core::Range range,
+             NormalizedString* normalized,
+             bool origin_range) const;
+
+  void UpdateNormalized(const OffsetMapping& new_normalized,
+                        uint32_t initial_offset);
+  template <typename PatternType>
+  void Split(const PatternType&
+                 pattern, /* re2::RE2 or std::function<bool(char32_t)> */
+             core::SplitMode mode,
+             std::vector<NormalizedString>* normalizes,
+             bool invert = false) const {
+    // Vec<(Offsets, should_remove)>
+    std::vector<std::pair<core::Range, bool>> matches;
+    auto normalizes_size = GetMatch(normalized_, pattern, &matches, invert);
+    // Convert matches
+    switch (mode) {
+      case core::SplitMode::REMOVED:
+        break;
+      case core::SplitMode::ISOLATED: {
+        for (auto& match : matches) {
+          match.second = false;
+        }
+        normalizes_size = matches.size();
+        break;
+      }
+      case core::SplitMode::MERGED_WITH_PREVIOUS: {
+        bool previous_match = false;
+        std::vector<std::pair<core::Range, bool>> new_matches;
+        for (const auto& match : matches) {
+          auto offset = match.first;
+          bool curr_match = match.second;
+          if (curr_match && !previous_match) {
+            if (new_matches.size() > 0) {
+              new_matches.back().first.second = offset.second;
+            } else {
+              new_matches.push_back({offset, false});
+            }
+          } else {
+            new_matches.push_back({offset, false});
+          }
+          previous_match = curr_match;
+        }
+        matches = std::move(new_matches);
+        normalizes_size = matches.size();
+        break;
+      }
+      case core::SplitMode::MERGED_WITH_NEXT: {
+        bool previous_match = false;
+        std::vector<std::pair<core::Range, bool>> new_matches;
+        for (auto it = matches.crbegin(); it != matches.crend(); ++it) {
+          const auto& match = *it;
+          auto offset = match.first;
+          bool curr_match = match.second;
+          if (curr_match && !previous_match) {
+            if (new_matches.size() > 0) {
+              new_matches.back().first.first = offset.first;
+            } else {
+              new_matches.push_back({offset, false});
+            }
+          } else {
+            new_matches.push_back({offset, false});
+          }
+          previous_match = curr_match;
+        }
+        matches = std::move(new_matches);
+        normalizes_size = matches.size();
+        std::reverse(matches.begin(), matches.end());
+        break;
+      }
+      case core::SplitMode::CONTIGUOUS: {
+        bool previous_match = false;
+        std::vector<std::pair<core::Range, bool>> new_matches;
+        for (const auto& match : matches) {
+          auto offset = match.first;
+          bool curr_match = match.second;
+          if (curr_match == previous_match) {
+            if (new_matches.size() > 0) {
+              new_matches.back().first.second = offset.second;
+            } else {
+              new_matches.push_back({offset, false});
+            }
+          } else {
+            new_matches.push_back({offset, false});
+          }
+          previous_match = curr_match;
+        }
+        matches = std::move(new_matches);
+        normalizes_size = matches.size();
+        break;
+      }
+      default:
+        break;
+    }
+    normalizes->resize(normalizes_size);
+    int idx = 0;
+    for (const auto& match : matches) {
+      if (!match.second) {
+        Slice(match.first, &(normalizes->at(idx++)), false);
+      }
+    }
+  }
+  bool ConvertOffsets(core::Range* range, bool origin_range = true) const;
+  NormalizedString() = default;
+
+private:
+  std::string original_;
+  std::string normalized_;
+  // In order to keep track of the offset mapping from
+  // original_ to normalized_
+  std::vector<core::Range> alignments_;
+  uint32_t original_shift_;
+
+  void UpdateNormalizedRange(const OffsetMapping& new_normalized,
+                             uint32_t initial_offset,
+                             const core::Range& range,
+                             bool origin_range = true);
+  void RunNormalization(const std::string& mode);
+  bool ValidateRange(const core::Range& range, bool origin_range) const;
+
+  uint32_t GetMatch(const std::string& normalized,
+                    const re2::RE2& pattern,
+                    std::vector<std::pair<core::Range, bool>>* matches,
+                    bool invert = false) const;
+
+  uint32_t GetMatch(const std::string& normalized,
+                    const std::function<bool(char32_t)>& pattern_func,
+                    std::vector<std::pair<core::Range, bool>>* matches,
+                    bool invert = false) const;
+};
+
+struct FASTTOKENIZER_DECL Normalizer {
+  virtual void operator()(NormalizedString* mut_str) const = 0;
+};
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h b/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h
new file mode 100644
index 0000000000000000000000000000000000000000..6f29e0c2eb25de0c185aeb3605ad1b7ca8663da6
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/normalizers.h
@@ -0,0 +1,23 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/normalizers/precompiled.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "fast_tokenizer/normalizers/utils.h"
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7d5189d30f26e0d5617fc79abc059a56626262cd
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.cc
@@ -0,0 +1,87 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/precompiled.h"
+#include <iomanip>
+#include <sstream>
+
+#include "glog/logging.h"
+#include "fast_tokenizer/utils/unique_ptr.h"
+
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+PrecompiledNormalizer::PrecompiledNormalizer(
+    const std::string& precompiled_charsmap) {
+  SetPrecompiledCharsMap(precompiled_charsmap);
+}
+
+PrecompiledNormalizer::PrecompiledNormalizer(
+    const PrecompiledNormalizer& precompiled_normalizer)
+    : sentencepiece_normalizer_(new utils::Normalizer(
+          *precompiled_normalizer.sentencepiece_normalizer_.get())) {}
+
+static std::string GetByteFromString(const std::string& str) {
+  std::ostringstream oss;
+  oss << std::hex << std::setfill('0');
+  for (int i = 0; i < str.length(); ++i) {
+    oss << "\\x" << std::setw(2) << (static_cast<int>(str[i]) & 0xFF);
+  }
+  return oss.str();
+}
+
+void PrecompiledNormalizer::SetPrecompiledCharsMap(
+    const std::string& precompiled_charsmap) {
+  sentencepiece_normalizer_ =
+      utils::make_unique<utils::Normalizer>(precompiled_charsmap);
+}
+
+void PrecompiledNormalizer::operator()(NormalizedString* mut_str) const {
+  std::string normalized;
+  std::vector<int> norm_to_orig;
+  std::u32string u32content;
+  if (sentencepiece_normalizer_->Normalize(mut_str->GetStr().data(),
+                                           mut_str->GetStr().length(),
+                                           &normalized,
+                                           &norm_to_orig,
+                                           &u32content)) {
+    mut_str->UpdateNormalized({u32content, norm_to_orig}, 0);
+  }
+}
+
+void to_json(nlohmann::json& j,
+             const PrecompiledNormalizer& precompiled_normalizer) {
+  const auto& precompiled_str =
+      precompiled_normalizer.sentencepiece_normalizer_
+          ->GetPrecompiledCharsmap();
+  std::vector<uint8_t> bytes(precompiled_str.begin(), precompiled_str.end());
+  j = {{"type", "PrecompiledNormalizer"}, {"precompiled_charsmap", bytes}};
+}
+
+void from_json(const nlohmann::json& j,
+               PrecompiledNormalizer& precompiled_normalizer) {
+  std::vector<uint8_t> bytes;
+  j.at("precompiled_charsmap").get_to(bytes);
+  std::ostringstream precompiled_charsmap_oss;
+  for (int i = 0; i < bytes.size(); ++i) {
+    precompiled_charsmap_oss << static_cast<char>(bytes[i]);
+  }
+  precompiled_normalizer.SetPrecompiledCharsMap(precompiled_charsmap_oss.str());
+}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h
new file mode 100644
index 0000000000000000000000000000000000000000..3641952030f6706c33227292cb648d264c9bfc32
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/precompiled.h
@@ -0,0 +1,44 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include "nlohmann/json.hpp"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/sentencepiece_normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+
+namespace normalizers {
+struct FASTTOKENIZER_DECL PrecompiledNormalizer : public Normalizer {
+  PrecompiledNormalizer() = default;
+  explicit PrecompiledNormalizer(const std::string& precompiled_charsmap);
+  PrecompiledNormalizer(const PrecompiledNormalizer& precompiled_normalizer);
+
+  virtual void operator()(NormalizedString* mut_str) const override;
+  void SetPrecompiledCharsMap(const std::string& precompiled_charsmap);
+
+private:
+  std::unique_ptr<utils::Normalizer> sentencepiece_normalizer_;
+  friend void to_json(nlohmann::json& j,
+                      const PrecompiledNormalizer& precompiled_normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        PrecompiledNormalizer& precompiled_normalizer);
+};
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/replace.cc b/fast_tokenizer/fast_tokenizer/normalizers/replace.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1d7f81d09a5fd61abf1371f59681e3b9913d6ec9
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/replace.cc
@@ -0,0 +1,51 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/utils/unique_ptr.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+ReplaceNormalizer::ReplaceNormalizer(const std::string& pattern,
+                                     const std::string& content)
+    : pattern_(new re2::RE2(pattern)), content_(content) {}
+
+ReplaceNormalizer::ReplaceNormalizer(
+    const ReplaceNormalizer& replace_normalizer)
+    : pattern_(new re2::RE2(replace_normalizer.pattern_->pattern())),
+      content_(replace_normalizer.content_) {}
+
+void ReplaceNormalizer::operator()(NormalizedString* input) const {
+  input->Replace(*pattern_, content_);
+}
+
+void to_json(nlohmann::json& j, const ReplaceNormalizer& replace_normalizer) {
+  j = {
+      {"type", "ReplaceNormalizer"},
+      {"pattern", replace_normalizer.pattern_->pattern()},
+      {"content", replace_normalizer.content_},
+  };
+}
+
+void from_json(const nlohmann::json& j, ReplaceNormalizer& replace_normalizer) {
+  replace_normalizer.pattern_ =
+      utils::make_unique<re2::RE2>(j.at("pattern").get<std::string>());
+  j.at("content").get_to(replace_normalizer.content_);
+}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/replace.h b/fast_tokenizer/fast_tokenizer/normalizers/replace.h
new file mode 100644
index 0000000000000000000000000000000000000000..76141f7669c8ad8377153d5e37d5816282fac07a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/replace.h
@@ -0,0 +1,45 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include "nlohmann/json.hpp"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "re2/re2.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+struct FASTTOKENIZER_DECL ReplaceNormalizer : public Normalizer {
+  ReplaceNormalizer() = default;
+  ReplaceNormalizer(const std::string& pattern, const std::string& content);
+  ReplaceNormalizer(const ReplaceNormalizer& replace_normalizer);
+  virtual void operator()(NormalizedString* mut_str) const override;
+  friend void to_json(nlohmann::json& j,
+                      const ReplaceNormalizer& replace_normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        ReplaceNormalizer& replace_normalizer);
+
+private:
+  std::unique_ptr<re2::RE2> pattern_;
+  std::string content_;
+};
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/strip.cc b/fast_tokenizer/fast_tokenizer/normalizers/strip.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c14c23f27164b98c5fda60c73858da39f5cfc145
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/strip.cc
@@ -0,0 +1,68 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/strip.h"
+#include "unicode/translit.h"
+#include "unicode/unistr.h"
+#include "unicode/utypes.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+StripNormalizer::StripNormalizer(bool left /* = true*/, bool right /* = true*/)
+    : left_(left), right_(right) {}
+
+void StripNormalizer::operator()(NormalizedString* input) const {
+  if (left_) {
+    input->LStrip();
+  }
+  if (right_) {
+    input->RStrip();
+  }
+}
+
+void to_json(nlohmann::json& j, const StripNormalizer& strip_normalizer) {
+  j = {
+      {"type", "StripNormalizer"},
+      {"left", strip_normalizer.left_},
+      {"right", strip_normalizer.right_},
+  };
+}
+
+void from_json(const nlohmann::json& j, StripNormalizer& strip_normalizer) {
+  j.at("left").get_to(strip_normalizer.left_);
+  j.at("right").get_to(strip_normalizer.right_);
+}
+
+void StripAccentsNormalizer::operator()(NormalizedString* input) const {
+  input->NFD();
+  input->FilterChar([](char32_t ch) -> bool {
+    // equals to `unicodedata.category(char) == 'Mn'`
+    return u_charType(ch) != U_NON_SPACING_MARK;
+  });
+}
+
+void to_json(nlohmann::json& j,
+             const StripAccentsNormalizer& strip_normalizer) {
+  j = {
+      {"type", "StripAccentsNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               StripAccentsNormalizer& strip_normalizer) {}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/strip.h b/fast_tokenizer/fast_tokenizer/normalizers/strip.h
new file mode 100644
index 0000000000000000000000000000000000000000..e8af13ac36a6110867adfb559fa37ffdcc9ae58b
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/strip.h
@@ -0,0 +1,51 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+struct FASTTOKENIZER_DECL StripNormalizer : public Normalizer {
+  StripNormalizer(bool left = true, bool right = true);
+  virtual void operator()(NormalizedString* input) const override;
+  StripNormalizer(StripNormalizer&&) = default;
+  StripNormalizer(const StripNormalizer&) = default;
+
+private:
+  bool left_;
+  bool right_;
+  friend void to_json(nlohmann::json& j,
+                      const StripNormalizer& strip_normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        StripNormalizer& strip_normalizer);
+};
+
+struct FASTTOKENIZER_DECL StripAccentsNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j,
+                      const StripAccentsNormalizer& strip_normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        StripAccentsNormalizer& strip_normalizer);
+};
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc b/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5c16c0d00eae9d6b8d678328564dbaa34edd5ee5
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/unicode.cc
@@ -0,0 +1,103 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <codecvt>
+#include <locale>
+#include <string>
+
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "unicode/edits.h"
+#include "unicode/errorcode.h"
+#include "unicode/normalizer2.h"
+#include "unicode/utypes.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+void NFCNormalizer::operator()(NormalizedString* input) const { input->NFC(); }
+
+void to_json(nlohmann::json& j, const NFCNormalizer& normalizer) {
+  j = {
+      {"type", "NFCNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, NFCNormalizer& normalizer) {}
+
+void NFKCNormalizer::operator()(NormalizedString* input) const {
+  input->NFKC();
+}
+
+void to_json(nlohmann::json& j, const NFKCNormalizer& normalizer) {
+  j = {
+      {"type", "NFKCNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, NFKCNormalizer& normalizer) {}
+
+void NFDNormalizer::operator()(NormalizedString* input) const { input->NFD(); }
+
+void to_json(nlohmann::json& j, const NFDNormalizer& normalizer) {
+  j = {
+      {"type", "NFDNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, NFDNormalizer& normalizer) {}
+
+void NFKDNormalizer::operator()(NormalizedString* input) const {
+  input->NFKD();
+}
+
+void to_json(nlohmann::json& j, const NFKDNormalizer& normalizer) {
+  j = {
+      {"type", "NFKDNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, NFKDNormalizer& normalizer) {}
+
+void NmtNormalizer::operator()(NormalizedString* input) const {
+  input->FilterChar([](char32_t ch) -> bool {
+    if ((ch >= 0x0001 && ch <= 0x0008) || (ch == 0x000B) ||
+        (ch >= 0x000E && ch <= 0x001F) || (ch == 0x007F) || (ch == 0x008F) ||
+        (ch == 0x009F)) {
+      return false;
+    }
+    return true;
+  });
+  input->MapChar([](char32_t ch) -> char32_t {
+    if ((ch == 0x0009) || (ch == 0x000A) || (ch == 0x000C) || (ch == 0x000D) ||
+        (ch == 0x1680) || (ch >= 0x200B && ch <= 0x200F) || (ch == 0x2028) ||
+        (ch == 0x2029) || (ch == 0x2581) || (ch == 0xFEFF) || (ch == 0xFFFD)) {
+      return ' ';
+    }
+    return ch;
+  });
+}
+
+void to_json(nlohmann::json& j, const NmtNormalizer& normalizer) {
+  j = {
+      {"type", "NmtNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, NmtNormalizer& normalizer) {}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/unicode.h b/fast_tokenizer/fast_tokenizer/normalizers/unicode.h
new file mode 100644
index 0000000000000000000000000000000000000000..6bf9c4b8de42b1117163e466292af2607612aced
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/unicode.h
@@ -0,0 +1,57 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+struct FASTTOKENIZER_DECL NFCNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const NFCNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j, NFCNormalizer& normalizer);
+};
+
+struct FASTTOKENIZER_DECL NFDNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const NFDNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j, NFDNormalizer& normalizer);
+};
+
+struct FASTTOKENIZER_DECL NFKCNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const NFKCNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j, NFKCNormalizer& normalizer);
+};
+
+struct FASTTOKENIZER_DECL NFKDNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const NFKDNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j, NFKDNormalizer& normalizer);
+};
+
+struct FASTTOKENIZER_DECL NmtNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const NmtNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j, NmtNormalizer& normalizer);
+};
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/utils.cc b/fast_tokenizer/fast_tokenizer/normalizers/utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..15f6875b87490d80c33f1f6cef411a8270798333
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/utils.cc
@@ -0,0 +1,158 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/utils.h"
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/precompiled.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "unicode/unistr.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+void SequenceNormalizer::AppendNormalizer(Normalizer* normalizer) {
+  std::shared_ptr<Normalizer> normalizer_ptr;
+  if (typeid(*normalizer) == typeid(SequenceNormalizer)) {
+    auto cast_normalizer = dynamic_cast<SequenceNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<SequenceNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(LowercaseNormalizer)) {
+    auto cast_normalizer = dynamic_cast<LowercaseNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<LowercaseNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(StripNormalizer)) {
+    auto cast_normalizer = dynamic_cast<StripNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<StripNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(StripAccentsNormalizer)) {
+    auto cast_normalizer = dynamic_cast<StripAccentsNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<StripAccentsNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(NFCNormalizer)) {
+    auto cast_normalizer = dynamic_cast<NFCNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<NFCNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(NFDNormalizer)) {
+    auto cast_normalizer = dynamic_cast<NFDNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<NFDNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(NFKCNormalizer)) {
+    auto cast_normalizer = dynamic_cast<NFKCNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<NFKCNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(NFKDNormalizer)) {
+    auto cast_normalizer = dynamic_cast<NFKDNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<NFKDNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(NmtNormalizer)) {
+    auto cast_normalizer = dynamic_cast<NmtNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<NmtNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(ReplaceNormalizer)) {
+    auto cast_normalizer = dynamic_cast<ReplaceNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<ReplaceNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(BertNormalizer)) {
+    auto cast_normalizer = dynamic_cast<BertNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<BertNormalizer>(*cast_normalizer);
+  } else if (typeid(*normalizer) == typeid(PrecompiledNormalizer)) {
+    auto cast_normalizer = dynamic_cast<PrecompiledNormalizer*>(normalizer);
+    normalizer_ptr = std::make_shared<PrecompiledNormalizer>(*cast_normalizer);
+  }
+  normalizer_ptrs_.push_back(std::move(normalizer_ptr));
+}
+
+SequenceNormalizer::SequenceNormalizer(
+    const std::vector<Normalizer*>& normalizers) {
+  for (auto& normalizer : normalizers) {
+    AppendNormalizer(normalizer);
+  }
+}
+
+void SequenceNormalizer::operator()(NormalizedString* input) const {
+  std::string result;
+  for (auto& normalizer : normalizer_ptrs_) {
+    normalizer->operator()(input);
+  }
+}
+void LowercaseNormalizer::operator()(NormalizedString* input) const {
+  input->Lowercase();
+}
+
+void to_json(nlohmann::json& j, const SequenceNormalizer& normalizer) {
+  nlohmann::json jlist;
+  for (auto& ptr : normalizer.normalizer_ptrs_) {
+    nlohmann::json jitem;
+    if (typeid(*ptr) == typeid(SequenceNormalizer)) {
+      jitem = *dynamic_cast<SequenceNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(LowercaseNormalizer)) {
+      jitem = *dynamic_cast<LowercaseNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(StripNormalizer)) {
+      jitem = *dynamic_cast<StripNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(StripAccentsNormalizer)) {
+      jitem = *dynamic_cast<StripAccentsNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(NFCNormalizer)) {
+      jitem = *dynamic_cast<NFCNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(NFDNormalizer)) {
+      jitem = *dynamic_cast<NFDNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(NFKCNormalizer)) {
+      jitem = *dynamic_cast<NFKCNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(NFKDNormalizer)) {
+      jitem = *dynamic_cast<NFKDNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(NmtNormalizer)) {
+      jitem = *dynamic_cast<NmtNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(ReplaceNormalizer)) {
+      jitem = *dynamic_cast<ReplaceNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(BertNormalizer)) {
+      jitem = *dynamic_cast<BertNormalizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(PrecompiledNormalizer)) {
+      jitem = *dynamic_cast<PrecompiledNormalizer*>(ptr.get());
+    }
+    jlist.push_back(jitem);
+  }
+  j = {{"type", "SequenceNormalizer"}, {"normalizers", jlist}};
+}
+
+void from_json(const nlohmann::json& j,
+               SequenceNormalizer& sequence_normalizer) {
+#define TRY_APPEND_NORMALIZER(NORMALIZER_TYPE)         \
+  if (normalizer_type == #NORMALIZER_TYPE) {           \
+    NORMALIZER_TYPE normalizer;                        \
+    normalizer_json.get_to(normalizer);                \
+    sequence_normalizer.AppendNormalizer(&normalizer); \
+  }
+
+  for (auto& normalizer_json : j.at("normalizers")) {
+    std::string normalizer_type;
+    normalizer_json.at("type").get_to(normalizer_type);
+    TRY_APPEND_NORMALIZER(BertNormalizer);
+    TRY_APPEND_NORMALIZER(PrecompiledNormalizer);
+    TRY_APPEND_NORMALIZER(ReplaceNormalizer);
+    TRY_APPEND_NORMALIZER(StripAccentsNormalizer);
+    TRY_APPEND_NORMALIZER(StripNormalizer);
+    TRY_APPEND_NORMALIZER(NFCNormalizer);
+    TRY_APPEND_NORMALIZER(NFKCNormalizer);
+    TRY_APPEND_NORMALIZER(NFDNormalizer);
+    TRY_APPEND_NORMALIZER(NFKDNormalizer);
+    TRY_APPEND_NORMALIZER(NmtNormalizer);
+    TRY_APPEND_NORMALIZER(LowercaseNormalizer);
+    TRY_APPEND_NORMALIZER(SequenceNormalizer);
+  }
+#undef TRY_APPEND_NORMALIZER
+}
+
+void to_json(nlohmann::json& j, const LowercaseNormalizer& normalizer) {
+  j = {
+      {"type", "LowercaseNormalizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, LowercaseNormalizer& normalizer) {}
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/normalizers/utils.h b/fast_tokenizer/fast_tokenizer/normalizers/utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..94fd2c91cdf06f72f6a6e29937161ae441ecf409
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/normalizers/utils.h
@@ -0,0 +1,50 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace normalizers {
+
+struct FASTTOKENIZER_DECL SequenceNormalizer : public Normalizer {
+  SequenceNormalizer() = default;
+  SequenceNormalizer(const SequenceNormalizer&) = default;
+  SequenceNormalizer(const std::vector<Normalizer*>& normalizers);
+  virtual void operator()(NormalizedString* input) const override;
+  void AppendNormalizer(Normalizer* normalizer);
+
+private:
+  std::vector<std::shared_ptr<Normalizer>> normalizer_ptrs_;
+  friend void to_json(nlohmann::json& j, const SequenceNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        SequenceNormalizer& normalizer);
+};
+
+struct FASTTOKENIZER_DECL LowercaseNormalizer : public Normalizer {
+  virtual void operator()(NormalizedString* input) const override;
+  friend void to_json(nlohmann::json& j, const LowercaseNormalizer& normalizer);
+  friend void from_json(const nlohmann::json& j,
+                        LowercaseNormalizer& normalizer);
+};
+
+}  // namespace normalizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9d4aad766e77052d84d931bd35ad4a6ac490a606
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/CMakeLists.txt
@@ -0,0 +1 @@
+cc_library(postprocessors SRCS bert.cc postprocessor.cc template.cc roberta.cc byte_level.cc DEPS core json)
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc b/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d40067c9d837adf6f303eb517eeadb43960b9b7d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/bert.cc
@@ -0,0 +1,208 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <algorithm>
+
+#include "fast_tokenizer/core/encoding.h"
+#include "glog/logging.h"
+#include "fast_tokenizer/postprocessors/bert.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+BertPostProcessor::BertPostProcessor()
+    : sep_({"[SEP]", 102}), cls_({"[CLS]", 101}) {}
+BertPostProcessor::BertPostProcessor(
+    const std::pair<std::string, uint32_t>& sep,
+    const std::pair<std::string, uint32_t>& cls)
+    : sep_(sep), cls_(cls) {}
+size_t BertPostProcessor::AddedTokensNum(bool is_pair) const {
+  if (is_pair) {
+    // [CLS] A [SEP] B [SEP]
+    return 3;
+  }
+  // [CLS] A [SEP]
+  return 2;
+}
+
+void BertPostProcessor::operator()(core::Encoding* encoding,
+                                   core::Encoding* pair_encoding,
+                                   bool add_special_tokens,
+                                   core::Encoding* result_encoding) const {
+  if (!add_special_tokens) {
+    DefaultProcess(encoding, pair_encoding, result_encoding);
+    return;
+  }
+// Construct the sequence as: [CLS] A [SEP]
+#define CREATE_PROCESSED_ENCODING_SEQ(                                         \
+    encoding_ptr, attr, name, head_value, back_value)                          \
+  auto encoding_##name = encoding_ptr->Get##attr();                            \
+  decltype(encoding_##name) name(encoding_##name.size() + 2);                  \
+  std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin() + 1); \
+  name.front() = head_value;                                                   \
+  name.back() = back_value
+  // ids
+  CREATE_PROCESSED_ENCODING_SEQ(encoding, Ids, ids, cls_.second, sep_.second);
+  // type_ids
+  CREATE_PROCESSED_ENCODING_SEQ(encoding, TypeIds, type_ids, 0, 0);
+  // tokens
+  CREATE_PROCESSED_ENCODING_SEQ(
+      encoding, Tokens, tokens, cls_.first, sep_.first);
+  // word_idx
+  CREATE_PROCESSED_ENCODING_SEQ(encoding, WordsIdx, word_idx, -1, -1);
+  // offsets
+  core::Offset empty_offsets = {0, 0};
+  CREATE_PROCESSED_ENCODING_SEQ(
+      encoding, Offsets, offsets, empty_offsets, empty_offsets);
+  // special_tokens_mask
+  std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+  special_tokens_mask.front() = special_tokens_mask.back() = 1;
+  // attention_mask
+  std::vector<uint32_t> attention_mask(ids.size(), 1);
+  // sequence_ranges
+  std::unordered_map<uint32_t, core::Range> sequence_ranges;
+  sequence_ranges[0] = {1, ids.size() - 1};
+  // overflowing
+  auto& overflowings = encoding->GetMutableOverflowing();
+  for (auto& overflow_encoding : overflowings) {
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Ids, ids, cls_.second, sep_.second);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), TypeIds, type_ids, 0, 0);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Tokens, tokens, cls_.first, sep_.first);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), WordsIdx, word_idx, -1, -1);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Offsets, offsets, empty_offsets, empty_offsets);
+
+    std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+    special_tokens_mask.front() = special_tokens_mask.back() = 1;
+
+    std::vector<uint32_t> attention_mask(ids.size(), 1);
+
+    std::unordered_map<uint32_t, core::Range> sequence_ranges;
+    sequence_ranges[0] = {1, ids.size() - 1};
+
+    overflow_encoding = std::move(
+        core::Encoding(std::move(ids),
+                       std::move(type_ids),
+                       std::move(tokens),
+                       std::move(word_idx),
+                       std::move(offsets),
+                       std::move(special_tokens_mask),
+                       std::move(attention_mask),
+                       std::vector<core::Encoding>(),  // No overflowing
+                       std::move(sequence_ranges)));
+  }
+
+  core::Encoding new_encoding(std::move(ids),
+                              std::move(type_ids),
+                              std::move(tokens),
+                              std::move(word_idx),
+                              std::move(offsets),
+                              std::move(special_tokens_mask),
+                              std::move(attention_mask),
+                              std::move(overflowings),
+                              std::move(sequence_ranges));
+  if (pair_encoding != nullptr) {
+#define CREATE_PROCESSED_PARI_ENCODING_SEQ(                                \
+    encoding_ptr, attr, name, back_value)                                  \
+  auto encoding_##name = encoding_ptr->Get##attr();                        \
+  decltype(encoding_##name) name(encoding_##name.size() + 1);              \
+  std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin()); \
+  name.back() = back_value
+
+    CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, Ids, ids, sep_.second);
+    CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, TypeIds, type_ids, 1);
+    CREATE_PROCESSED_PARI_ENCODING_SEQ(
+        pair_encoding, Tokens, tokens, sep_.first);
+    CREATE_PROCESSED_PARI_ENCODING_SEQ(pair_encoding, WordsIdx, word_idx, -1);
+    core::Offset empty_offsets = {0, 0};
+    CREATE_PROCESSED_PARI_ENCODING_SEQ(
+        pair_encoding, Offsets, offsets, empty_offsets);
+
+    std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+    special_tokens_mask.back() = 1;
+
+    std::vector<uint32_t> attention_mask(ids.size(), 1);
+    std::unordered_map<uint32_t, core::Range> sequence_ranges;
+    sequence_ranges[1] = {0, ids.size() - 1};
+    // overflowing
+    auto& overflowings = pair_encoding->GetMutableOverflowing();
+    for (auto& overflow_pair_encoding : overflowings) {
+      CREATE_PROCESSED_PARI_ENCODING_SEQ(
+          (&overflow_pair_encoding), Ids, ids, sep_.second);
+      CREATE_PROCESSED_PARI_ENCODING_SEQ(
+          (&overflow_pair_encoding), TypeIds, type_ids, 1);
+      CREATE_PROCESSED_PARI_ENCODING_SEQ(
+          (&overflow_pair_encoding), Tokens, tokens, sep_.first);
+      CREATE_PROCESSED_PARI_ENCODING_SEQ(
+          (&overflow_pair_encoding), WordsIdx, word_idx, -1);
+      core::Offset empty_offsets = {0, 0};
+      CREATE_PROCESSED_PARI_ENCODING_SEQ(
+          (&overflow_pair_encoding), Offsets, offsets, empty_offsets);
+
+      std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+      special_tokens_mask.back() = 1;
+
+      std::vector<uint32_t> attention_mask(ids.size(), 1);
+      std::unordered_map<uint32_t, core::Range> sequence_ranges;
+      sequence_ranges[0] = {1, ids.size() - 1};
+
+      overflow_pair_encoding = std::move(
+          core::Encoding(std::move(ids),
+                         std::move(type_ids),
+                         std::move(tokens),
+                         std::move(word_idx),
+                         std::move(offsets),
+                         std::move(special_tokens_mask),
+                         std::move(attention_mask),
+                         std::vector<core::Encoding>(),  // No overflowing
+                         std::move(sequence_ranges)));
+    }
+
+    core::Encoding new_pair_encoding(std::move(ids),
+                                     std::move(type_ids),
+                                     std::move(tokens),
+                                     std::move(word_idx),
+                                     std::move(offsets),
+                                     std::move(special_tokens_mask),
+                                     std::move(attention_mask),
+                                     std::move(overflowings),
+                                     std::move(sequence_ranges));
+    new_encoding.MergeWith(new_pair_encoding, false);
+  }
+#undef CREATE_PROCESSED_ENCODING_SEQ
+#undef CREATE_PROCESSED_PARI_ENCODING_SEQ
+  *result_encoding = std::move(new_encoding);
+}
+
+void to_json(nlohmann::json& j, const BertPostProcessor& bert_postprocessor) {
+  j = {
+      {"type", "BertPostProcessor"},
+      {"sep", bert_postprocessor.sep_},
+      {"cls", bert_postprocessor.cls_},
+  };
+}
+
+void from_json(const nlohmann::json& j, BertPostProcessor& bert_postprocessor) {
+  j["cls"].get_to(bert_postprocessor.cls_);
+  j["sep"].get_to(bert_postprocessor.sep_);
+}
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/bert.h b/fast_tokenizer/fast_tokenizer/postprocessors/bert.h
new file mode 100644
index 0000000000000000000000000000000000000000..cc8c77f9785be134f39919fe398d278a11f2f2f8
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/bert.h
@@ -0,0 +1,43 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+struct FASTTOKENIZER_DECL BertPostProcessor : public PostProcessor {
+  BertPostProcessor(const std::pair<std::string, uint32_t>& sep,
+                    const std::pair<std::string, uint32_t>& cls);
+  BertPostProcessor();
+  virtual size_t AddedTokensNum(bool is_pair) const override;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override;
+  std::pair<std::string, uint32_t> sep_;
+  std::pair<std::string, uint32_t> cls_;
+  friend void to_json(nlohmann::json& j,
+                      const BertPostProcessor& bert_postprocessor);
+  friend void from_json(const nlohmann::json& j,
+                        BertPostProcessor& bert_postprocessor);
+};
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc
new file mode 100644
index 0000000000000000000000000000000000000000..51f0b45ecec43148b464c213d96a3570f85efeaf
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.cc
@@ -0,0 +1,74 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "fast_tokenizer/postprocessors/byte_level.h"
+#include "fast_tokenizer/pretokenizers/byte_level.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+ByteLevelPostProcessor::ByteLevelPostProcessor(bool add_prefix_space,
+                                               bool trim_offsets,
+                                               bool use_regex)
+    : add_prefix_space_(add_prefix_space),
+      trim_offsets_(trim_offsets),
+      use_regex_(use_regex) {}
+
+
+size_t ByteLevelPostProcessor::AddedTokensNum(bool is_pair) const { return 0; }
+
+void ByteLevelPostProcessor::operator()(core::Encoding* encoding,
+                                        core::Encoding* pair_encoding,
+                                        bool add_special_tokens,
+                                        core::Encoding* result_encoding) const {
+  if (trim_offsets_) {
+    pretokenizers::ProcessOffsets(encoding, add_special_tokens);
+    for (auto& overflowing : encoding->GetMutableOverflowing()) {
+      pretokenizers::ProcessOffsets(&overflowing, add_special_tokens);
+    }
+    if (pair_encoding != nullptr) {
+      pretokenizers::ProcessOffsets(pair_encoding, add_special_tokens);
+      for (auto& overflowing : pair_encoding->GetMutableOverflowing()) {
+        pretokenizers::ProcessOffsets(&overflowing, add_special_tokens);
+      }
+    }
+  }
+
+  encoding->SetSequenceIds(0);
+  if (pair_encoding != nullptr) {
+    pair_encoding->SetSequenceIds(1);
+  }
+}
+
+void to_json(nlohmann::json& j,
+             const ByteLevelPostProcessor& byte_level_postprocessor) {
+  j = {
+      {"type", "ByteLevelPostProcessor"},
+      {"add_prefix_space", byte_level_postprocessor.add_prefix_space_},
+      {"trim_offsets", byte_level_postprocessor.trim_offsets_},
+      {"use_regex", byte_level_postprocessor.use_regex_},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               ByteLevelPostProcessor& byte_level_postprocessor) {
+  j["add_prefix_space"].get_to(byte_level_postprocessor.add_prefix_space_);
+  j["trim_offsets"].get_to(byte_level_postprocessor.trim_offsets_);
+  j["use_regex"].get_to(byte_level_postprocessor.use_regex_);
+}
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h
new file mode 100644
index 0000000000000000000000000000000000000000..75cdd995fec5e5c80ad9d82128bb5b8e5f0e1449
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/byte_level.h
@@ -0,0 +1,46 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+struct FASTTOKENIZER_DECL ByteLevelPostProcessor : public PostProcessor {
+  ByteLevelPostProcessor(bool add_prefix_space = true,
+                         bool trim_offsets = true,
+                         bool use_regex = true);
+  virtual size_t AddedTokensNum(bool is_pair) const override;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override;
+
+  friend void to_json(nlohmann::json& j,
+                      const ByteLevelPostProcessor& byte_level_postprocessor);
+  friend void from_json(const nlohmann::json& j,
+                        ByteLevelPostProcessor& byte_level_postprocessor);
+  bool add_prefix_space_;
+  bool trim_offsets_;
+  bool use_regex_;
+};
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc
new file mode 100644
index 0000000000000000000000000000000000000000..abe994beb7d8f38c257e790f55c70babf46010ee
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.cc
@@ -0,0 +1,37 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/core/encoding.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+void PostProcessor::DefaultProcess(core::Encoding* encoding,
+                                   core::Encoding* pair_encoding,
+                                   core::Encoding* result_encoding) {
+  if (pair_encoding == nullptr) {
+    *result_encoding = *encoding;
+  } else {
+    encoding->SetSequenceIds(0);
+    pair_encoding->SetSequenceIds(1);
+    encoding->MergeWith(*pair_encoding, false);
+    *result_encoding = *encoding;
+  }
+}
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h
new file mode 100644
index 0000000000000000000000000000000000000000..34fda8377a2aadab775b61c6d38e5230467b45eb
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessor.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+
+namespace core {
+class Encoding;
+}  // namespace core
+
+namespace postprocessors {
+
+struct FASTTOKENIZER_DECL PostProcessor {
+  virtual size_t AddedTokensNum(bool is_pair) const = 0;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const = 0;
+  static void DefaultProcess(core::Encoding* encoding,
+                             core::Encoding* pair_encoding,
+                             core::Encoding* result_encoding);
+};
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h
new file mode 100644
index 0000000000000000000000000000000000000000..9427f2478b9a487f34419e6fa394ffe84bcde0f2
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/postprocessors.h
@@ -0,0 +1,21 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/postprocessors/bert.h"
+#include "fast_tokenizer/postprocessors/byte_level.h"
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/postprocessors/roberta.h"
+#include "fast_tokenizer/postprocessors/template.h"
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4a468847c2220f927c89eb7117656302a3a2091e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.cc
@@ -0,0 +1,244 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <algorithm>
+
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/postprocessors/roberta.h"
+#include "fast_tokenizer/pretokenizers/byte_level.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+RobertaPostProcessor::RobertaPostProcessor(
+    const std::pair<std::string, uint32_t>& sep,
+    const std::pair<std::string, uint32_t>& cls,
+    bool trim_offsets,
+    bool add_prefix_space)
+    : sep_(sep),
+      cls_(cls),
+      trim_offsets_(trim_offsets),
+      add_prefix_space_(add_prefix_space) {}
+
+size_t RobertaPostProcessor::AddedTokensNum(bool is_pair) const {
+  if (is_pair) {
+    return 4;
+  }
+  return 2;
+}
+
+void RobertaPostProcessor::operator()(core::Encoding* encoding,
+                                      core::Encoding* pair_encoding,
+                                      bool add_special_tokens,
+                                      core::Encoding* result_encoding) const {
+  if (trim_offsets_) {
+    pretokenizers::ProcessOffsets(encoding, add_special_tokens);
+    for (auto& overflowing : encoding->GetMutableOverflowing()) {
+      pretokenizers::ProcessOffsets(&overflowing, add_special_tokens);
+    }
+    if (pair_encoding != nullptr) {
+      pretokenizers::ProcessOffsets(pair_encoding, add_special_tokens);
+      for (auto& overflowing : pair_encoding->GetMutableOverflowing()) {
+        pretokenizers::ProcessOffsets(&overflowing, add_special_tokens);
+      }
+    }
+  }
+  encoding->SetTypeIds(std::vector<uint32_t>(encoding->GetLen(), 0));
+  if (pair_encoding != nullptr) {
+    pair_encoding->SetTypeIds(
+        std::vector<uint32_t>(pair_encoding->GetLen(), 0));
+  }
+  if (!add_special_tokens) {
+    DefaultProcess(encoding, pair_encoding, result_encoding);
+    return;
+  }
+// Construct the sequence as: [CLS] A [SEP]
+#define CREATE_PROCESSED_ENCODING_SEQ(                                         \
+    encoding_ptr, attr, name, head_value, back_value)                          \
+  auto encoding_##name = encoding_ptr->Get##attr();                            \
+  decltype(encoding_##name) name(encoding_##name.size() + 2);                  \
+  std::copy(encoding_##name.begin(), encoding_##name.end(), name.begin() + 1); \
+  name.front() = head_value;                                                   \
+  name.back() = back_value
+  // ids
+  CREATE_PROCESSED_ENCODING_SEQ(encoding, Ids, ids, cls_.second, sep_.second);
+  // type_ids. Because there is no nsp task in the roberta, all type_ids are set
+  // to 0.
+  std::vector<uint32_t> type_ids(encoding->GetTypeIds().size() + 2, 0);
+  // tokens
+  CREATE_PROCESSED_ENCODING_SEQ(
+      encoding, Tokens, tokens, cls_.first, sep_.first);
+  // word_idx
+  CREATE_PROCESSED_ENCODING_SEQ(encoding, WordsIdx, word_idx, -1, -1);
+  // offsets
+  core::Offset empty_offsets = {0, 0};
+  CREATE_PROCESSED_ENCODING_SEQ(
+      encoding, Offsets, offsets, empty_offsets, empty_offsets);
+  // special_tokens_mask
+  std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+  special_tokens_mask.front() = special_tokens_mask.back() = 1;
+  // attention_mask
+  std::vector<uint32_t> attention_mask(ids.size(), 1);
+  // sequence_ranges
+  std::unordered_map<uint32_t, core::Range> sequence_ranges;
+  sequence_ranges[0] = {1, ids.size() - 1};
+  // overflowing
+  auto& overflowings = encoding->GetMutableOverflowing();
+  for (auto& overflow_encoding : overflowings) {
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Ids, ids, cls_.second, sep_.second);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), TypeIds, type_ids, 0, 0);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Tokens, tokens, cls_.first, sep_.first);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), WordsIdx, word_idx, -1, -1);
+    CREATE_PROCESSED_ENCODING_SEQ(
+        (&overflow_encoding), Offsets, offsets, empty_offsets, empty_offsets);
+
+    std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+    special_tokens_mask.front() = special_tokens_mask.back() = 1;
+
+    std::vector<uint32_t> attention_mask(ids.size(), 1);
+
+    std::unordered_map<uint32_t, core::Range> sequence_ranges;
+    sequence_ranges[0] = {1, ids.size() - 1};
+
+    overflow_encoding = std::move(
+        core::Encoding(std::move(ids),
+                       std::move(type_ids),
+                       std::move(tokens),
+                       std::move(word_idx),
+                       std::move(offsets),
+                       std::move(special_tokens_mask),
+                       std::move(attention_mask),
+                       std::vector<core::Encoding>(),  // No overflowing
+                       std::move(sequence_ranges)));
+  }
+
+  core::Encoding new_encoding(std::move(ids),
+                              std::move(type_ids),
+                              std::move(tokens),
+                              std::move(word_idx),
+                              std::move(offsets),
+                              std::move(special_tokens_mask),
+                              std::move(attention_mask),
+                              std::move(overflowings),
+                              std::move(sequence_ranges));
+
+  // Construct the sequence as: [CLS] A [SEP] [SEP] B [SEP]
+  if (pair_encoding != nullptr) {
+    // ids
+    CREATE_PROCESSED_ENCODING_SEQ(
+        pair_encoding, Ids, ids, sep_.second, sep_.second);
+    // type_ids. Because there is no nsp task in the roberta, all type_ids are
+    // set to 0.
+    std::vector<uint32_t> type_ids(pair_encoding->GetTypeIds().size() + 2, 0);
+    // tokens
+    CREATE_PROCESSED_ENCODING_SEQ(
+        pair_encoding, Tokens, tokens, sep_.first, sep_.first);
+    // word_idx
+    CREATE_PROCESSED_ENCODING_SEQ(pair_encoding, WordsIdx, word_idx, -1, -1);
+    // offsets
+    core::Offset empty_offsets = {0, 0};
+    CREATE_PROCESSED_ENCODING_SEQ(
+        pair_encoding, Offsets, offsets, empty_offsets, empty_offsets);
+    // special_tokens_mask
+    std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+    special_tokens_mask.front() = special_tokens_mask.back() = 1;
+    // attention_mask
+    std::vector<uint32_t> attention_mask(ids.size(), 1);
+    // sequence_ranges
+    std::unordered_map<uint32_t, core::Range> sequence_ranges;
+    sequence_ranges[1] = {1, ids.size() - 1};
+    // overflowing
+    auto& overflowings = pair_encoding->GetMutableOverflowing();
+    for (auto& overflow_pair_encoding : overflowings) {
+      // ids
+      CREATE_PROCESSED_ENCODING_SEQ(
+          (&overflow_pair_encoding), Ids, ids, sep_.second, sep_.second);
+      // type_ids
+      std::vector<uint32_t> type_ids(
+          overflow_pair_encoding.GetTypeIds().size() + 2, 0);
+      // tokens
+      CREATE_PROCESSED_ENCODING_SEQ(
+          (&overflow_pair_encoding), Tokens, tokens, sep_.first, sep_.first);
+      // word_idx
+      CREATE_PROCESSED_ENCODING_SEQ(
+          (&overflow_pair_encoding), WordsIdx, word_idx, -1, -1);
+      // offsets
+      core::Offset empty_offsets = {0, 0};
+      CREATE_PROCESSED_ENCODING_SEQ((&overflow_pair_encoding),
+                                    Offsets,
+                                    offsets,
+                                    empty_offsets,
+                                    empty_offsets);
+      // special_tokens_mask
+      std::vector<uint32_t> special_tokens_mask(ids.size(), 0);
+      special_tokens_mask.front() = special_tokens_mask.back() = 1;
+      // attention_mask
+      std::vector<uint32_t> attention_mask(ids.size(), 1);
+      // sequence_ranges
+      std::unordered_map<uint32_t, core::Range> sequence_ranges;
+      sequence_ranges[0] = {1, ids.size() - 1};
+      overflow_pair_encoding = std::move(
+          core::Encoding(std::move(ids),
+                         std::move(type_ids),
+                         std::move(tokens),
+                         std::move(word_idx),
+                         std::move(offsets),
+                         std::move(special_tokens_mask),
+                         std::move(attention_mask),
+                         std::vector<core::Encoding>(),  // No overflowing
+                         std::move(sequence_ranges)));
+    }
+    core::Encoding new_pair_encoding(std::move(ids),
+                                     std::move(type_ids),
+                                     std::move(tokens),
+                                     std::move(word_idx),
+                                     std::move(offsets),
+                                     std::move(special_tokens_mask),
+                                     std::move(attention_mask),
+                                     std::move(overflowings),
+                                     std::move(sequence_ranges));
+    new_encoding.MergeWith(new_pair_encoding, false);
+  }
+#undef CREATE_PROCESSED_ENCODING_SEQ
+  *result_encoding = std::move(new_encoding);
+}
+
+void to_json(nlohmann::json& j,
+             const RobertaPostProcessor& roberta_postprocessor) {
+  j = {
+      {"type", "RobertaPostProcessor"},
+      {"sep", roberta_postprocessor.sep_},
+      {"cls", roberta_postprocessor.cls_},
+      {"trim_offsets", roberta_postprocessor.trim_offsets_},
+      {"add_prefix_space", roberta_postprocessor.add_prefix_space_},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               RobertaPostProcessor& roberta_postprocessor) {
+  j["cls"].get_to(roberta_postprocessor.cls_);
+  j["sep"].get_to(roberta_postprocessor.sep_);
+  j["trim_offsets"].get_to(roberta_postprocessor.trim_offsets_);
+  j["add_prefix_space"].get_to(roberta_postprocessor.add_prefix_space_);
+}
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h
new file mode 100644
index 0000000000000000000000000000000000000000..0601882f1df1cc20872c63161e475bd7505ec086
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/roberta.h
@@ -0,0 +1,47 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+struct FASTTOKENIZER_DECL RobertaPostProcessor : public PostProcessor {
+  RobertaPostProcessor(const std::pair<std::string, uint32_t>& sep = {"</s>",
+                                                                      2},
+                       const std::pair<std::string, uint32_t>& cls = {"<s>", 0},
+                       bool trim_offsets = true,
+                       bool add_prefix_space = true);
+  virtual size_t AddedTokensNum(bool is_pair) const override;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override;
+  std::pair<std::string, uint32_t> sep_;
+  std::pair<std::string, uint32_t> cls_;
+  bool trim_offsets_;
+  bool add_prefix_space_;
+  friend void to_json(nlohmann::json& j,
+                      const RobertaPostProcessor& roberta_postprocessor);
+  friend void from_json(const nlohmann::json& j,
+                        RobertaPostProcessor& roberta_postprocessor);
+};
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/template.cc b/fast_tokenizer/fast_tokenizer/postprocessors/template.cc
new file mode 100644
index 0000000000000000000000000000000000000000..28a3eb92587e8eeb1924eacdfb674ac9e60a7328
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/template.cc
@@ -0,0 +1,454 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <algorithm>
+#include <string>
+
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/postprocessors/template.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+void ParseIdFromString(const std::string& template_id_string,
+                       TemplatePiece* template_piece) {
+  if (template_id_string.find_first_of("$") == 0) {
+    *template_piece = TemplateSequence();
+    auto& seq = paddlenlp::get<TemplateSequence>(*template_piece);
+    std::string rest =
+        template_id_string.substr(template_id_string.find_first_not_of("$"));
+    if (rest == "" || rest == "A" || rest == "a") {
+      seq = TemplateSequence{SequenceType::SEQ_A, 0};
+    } else if (rest == "B" || rest == "b") {
+      seq = TemplateSequence{SequenceType::SEQ_B, 0};
+    } else {
+      std::string::size_type sz;
+      uint32_t type_id = std::stoul(rest, &sz);
+      if (sz = rest.length()) {
+        seq = TemplateSequence{SequenceType::SEQ_A, type_id};
+      } else {
+        throw std::runtime_error(
+            "ParseIdFromString error! The format of template piece id should "
+            "be "
+            "$A, $a, $B, $b or ${type_id}");
+      }
+    }
+  } else {
+    *template_piece = TemplateSpecialToken();
+    paddlenlp::get<TemplateSpecialToken>(*template_piece) = {template_id_string,
+                                                             0};
+  }
+}
+
+void SetTypeId(uint32_t type_id, TemplatePiece* template_piece) {
+  if (paddlenlp::get_if<TemplateSequence>(template_piece) != nullptr) {
+    paddlenlp::get<TemplateSequence>(*template_piece).second = type_id;
+  } else {
+    paddlenlp::get<TemplateSpecialToken>(*template_piece).second = type_id;
+  }
+}
+
+void GetTemplatePieceFromString(const std::string& template_string,
+                                TemplatePiece* template_piece) {
+  auto spliter_idx = template_string.find_first_of(":");
+  if (spliter_idx == std::string::npos) {
+    ParseIdFromString(template_string, template_piece);
+  } else {
+    std::string template_id_string = template_string.substr(0, spliter_idx);
+    std::string template_type_id_string =
+        template_string.substr(spliter_idx + 1);
+    ParseIdFromString(template_id_string, template_piece);
+
+    std::string::size_type sz;
+    uint32_t type_id = std::stoul(template_type_id_string, &sz);
+    if (sz == template_type_id_string.length()) {
+      SetTypeId(type_id, template_piece);
+    } else {
+      throw std::runtime_error(
+          "ParseTypeIdFromString error! The type id should be unsigned "
+          "integer.");
+    }
+  }
+}
+
+void to_json(nlohmann::json& j, const TemplatePiece& template_piece) {
+  if (paddlenlp::get_if<TemplateSequence>(&template_piece) != nullptr) {
+    auto& template_sequence = paddlenlp::get<TemplateSequence>(template_piece);
+    j = {
+        {"Sequence",
+         {
+             {"id", template_sequence.first},
+             {"type_id", template_sequence.second},
+         }},
+    };
+  } else {
+    auto& template_special_token =
+        paddlenlp::get<TemplateSpecialToken>(template_piece);
+    j = {
+        {"SpecialToken",
+         {
+             {"id", template_special_token.first},
+             {"type_id", template_special_token.second},
+         }},
+    };
+  }
+}
+
+void from_json(const nlohmann::json& j, TemplatePiece& template_piece) {
+  if (j.find("Sequence") != j.end()) {
+    template_piece =
+        TemplateSequence(j["Sequence"]["id"], j["Sequence"]["type_id"]);
+  } else {
+    template_piece = TemplateSpecialToken(j["SpecialToken"]["id"],
+                                          j["SpecialToken"]["type_id"]);
+  }
+}
+
+void to_json(nlohmann::json& j, const SpecialToken& special_token) {
+  j = {
+      {"id", special_token.id_},
+      {"ids", special_token.ids_},
+      {"tokens", special_token.tokens_},
+  };
+}
+
+void from_json(const nlohmann::json& j, SpecialToken& special_token) {
+  j["id"].get_to(special_token.id_);
+  j["ids"].get_to(special_token.ids_);
+  j["tokens"].get_to(special_token.tokens_);
+}
+
+size_t TemplatePostProcessor::CountAdded(
+    Template* template_, const SpecialTokensMap& special_tokens_map) {
+  size_t count = 0;
+  for (auto& piece : template_->pieces_) {
+    TemplateSpecialToken* special_token =
+        paddlenlp::get_if<TemplateSpecialToken>(&piece);
+    if (special_token != nullptr) {
+      auto token_iter =
+          special_tokens_map.tokens_map_.find(special_token->first);
+      if (token_iter != special_tokens_map.tokens_map_.end()) {
+        count += token_iter->second.ids_.size();
+      }
+    }
+  }
+  return count;
+}
+
+void to_json(nlohmann::json& j, const Template& template_) {
+  for (auto& piece : template_.pieces_) {
+    j.push_back(piece);
+  }
+}
+
+void from_json(const nlohmann::json& j, Template& template_) {
+  template_.pieces_.resize(j.size());
+  for (int i = 0; i < j.size(); ++i) {
+    j[i].get_to(template_.pieces_[i]);
+  }
+}
+
+void to_json(nlohmann::json& j, const SpecialTokensMap& tokens_map) {
+  for (auto it = tokens_map.tokens_map_.begin();
+       it != tokens_map.tokens_map_.end();
+       ++it) {
+    j[it->first] = it->second;
+  }
+}
+
+void from_json(const nlohmann::json& j, SpecialTokensMap& tokens_map) {
+  SpecialToken special_token;
+  for (auto it = j.begin(); it != j.end(); ++it) {
+    tokens_map.tokens_map_[it.key()] = it.value().get_to(special_token);
+  }
+}
+
+size_t TemplatePostProcessor::DefaultAdded(bool is_single) {
+  Template* target = nullptr;
+  if (is_single) {
+    target = &single_;
+  } else {
+    target = &pair_;
+  }
+  return CountAdded(target, special_tokens_map_);
+}
+
+void TemplatePostProcessor::UpdateAddedTokensNum() {
+  added_single_ = DefaultAdded(true);
+  added_pair_ = DefaultAdded(false);
+}
+
+void TemplatePostProcessor::UpdateSinglePieces(
+    const std::string& template_str) {
+  single_.GetPiecesFromStr(template_str);
+  added_single_ = DefaultAdded(true);
+}
+
+void TemplatePostProcessor::UpdateSinglePieces(
+    const std::vector<std::string>& pieces) {
+  single_.GetPiecesFromVec(pieces);
+  added_single_ = DefaultAdded(true);
+}
+
+void TemplatePostProcessor::UpdatePairPieces(const std::string& template_str) {
+  pair_.GetPiecesFromStr(template_str);
+  added_pair_ = DefaultAdded(false);
+}
+
+void TemplatePostProcessor::UpdatePairPieces(
+    const std::vector<std::string>& pieces) {
+  pair_.GetPiecesFromVec(pieces);
+  added_pair_ = DefaultAdded(false);
+}
+
+TemplatePostProcessor::TemplatePostProcessor() { UpdateAddedTokensNum(); }
+
+TemplatePostProcessor::TemplatePostProcessor(
+    const Template& single,
+    const Template& pair,
+    const std::vector<SpecialToken>& special_tokens_map)
+    : single_(single), pair_(pair), special_tokens_map_(special_tokens_map) {
+  UpdateAddedTokensNum();
+}
+
+size_t TemplatePostProcessor::AddedTokensNum(bool is_pair) const {
+  if (is_pair) {
+    return added_pair_;
+  }
+  return added_single_;
+}
+
+void TemplatePostProcessor::SetTokensMap(
+    const std::vector<SpecialToken>& special_tokens) {
+  special_tokens_map_.SetTokensMap(special_tokens);
+  UpdateAddedTokensNum();
+}
+
+void TemplatePostProcessor::ApplyTemplate(
+    const Template& pieces,
+    core::Encoding* encoding,
+    core::Encoding* pair_encoding,
+    bool add_special_tokens,
+    core::Encoding* result_encoding) const {
+  size_t new_size = 0;
+  for (auto&& piece : pieces.pieces_) {
+    if (paddlenlp::get_if<TemplateSequence>(&piece) != nullptr) {
+      auto seq_type = paddlenlp::get<TemplateSequence>(piece).first;
+      if (seq_type == SequenceType::SEQ_A) {
+        new_size += encoding->GetLen();
+      } else {
+        if (pair_encoding == nullptr) {
+          throw std::runtime_error(
+              "Template expected a pair sequence, but none provided");
+        }
+        new_size += pair_encoding->GetLen();
+      }
+    } else {
+      if (add_special_tokens) {
+        auto&& special_token =
+            paddlenlp::get<TemplateSpecialToken>(piece).first;
+        if (special_tokens_map_.tokens_map_.find(special_token) !=
+            special_tokens_map_.tokens_map_.end()) {
+          new_size +=
+              special_tokens_map_.tokens_map_.at(special_token).ids_.size();
+        }
+      }
+    }
+  }
+  std::vector<uint32_t> ids;
+  ids.reserve(new_size);
+  std::vector<uint32_t> type_ids;
+  type_ids.reserve(new_size);
+  std::vector<std::string> tokens;
+  tokens.reserve(new_size);
+  std::vector<uint32_t> words_idx;
+  words_idx.reserve(new_size);
+  std::vector<core::Offset> offsets;
+  offsets.reserve(new_size);
+  std::vector<uint32_t> special_tokens_mask;
+  special_tokens_mask.reserve(new_size);
+  std::vector<uint32_t> attention_mask;
+  attention_mask.reserve(new_size);
+  std::unordered_map<uint32_t, core::Range> sequence_ranges;
+  std::vector<core::Encoding> result_overflowings;
+  auto& overflowings = encoding->GetMutableOverflowing();
+
+  core::Encoding result_overflowing_encoding;
+  for (auto& overflow_encoding : overflowings) {
+    core::Encoding encoding_copy = overflow_encoding;
+    core::Encoding pair_encoding_copy;
+    if (pair_encoding != nullptr) {
+      pair_encoding_copy = *pair_encoding;
+      ApplyTemplate(pieces,
+                    &encoding_copy,
+                    &pair_encoding_copy,
+                    add_special_tokens,
+                    &result_overflowing_encoding);
+      result_overflowings.push_back(result_overflowing_encoding);
+      for (auto& pair_overflow_encoding :
+           pair_encoding->GetMutableOverflowing()) {
+        core::Encoding pair_encoding_copy = pair_overflow_encoding;
+        ApplyTemplate(pieces,
+                      &encoding_copy,
+                      &pair_encoding_copy,
+                      add_special_tokens,
+                      &result_overflowing_encoding);
+        result_overflowings.push_back(result_overflowing_encoding);
+      }
+    } else {
+      ApplyTemplate(pieces,
+                    &encoding_copy,
+                    pair_encoding,
+                    add_special_tokens,
+                    &result_overflowing_encoding);
+      result_overflowings.push_back(result_overflowing_encoding);
+    }
+  }
+  if (pair_encoding != nullptr) {
+    for (auto& pair_overflow_encoding :
+         pair_encoding->GetMutableOverflowing()) {
+      core::Encoding encoding_copy = *encoding;
+      core::Encoding pair_encoding_copy = pair_overflow_encoding;
+      ApplyTemplate(pieces,
+                    &encoding_copy,
+                    &pair_encoding_copy,
+                    add_special_tokens,
+                    &result_overflowing_encoding);
+      result_overflowings.push_back(result_overflowing_encoding);
+    }
+  }
+  VLOG(6) << "Template pieces num: " << pieces.pieces_.size();
+  for (auto& piece : pieces.pieces_) {
+    if (paddlenlp::get_if<TemplateSequence>(&piece) != nullptr) {
+      auto& template_sequence = paddlenlp::get<TemplateSequence>(piece);
+      if (template_sequence.first == SequenceType::SEQ_A) {
+        auto seq_start = ids.size();
+        auto seq_end = seq_start + encoding->GetLen();
+        sequence_ranges[0] = {seq_start, seq_end};
+        ids.insert(
+            ids.end(), encoding->GetIds().begin(), encoding->GetIds().end());
+        type_ids.insert(
+            type_ids.end(), encoding->GetLen(), template_sequence.second);
+        tokens.insert(tokens.end(),
+                      encoding->GetTokens().begin(),
+                      encoding->GetTokens().end());
+        words_idx.insert(words_idx.end(),
+                         encoding->GetWordsIdx().begin(),
+                         encoding->GetWordsIdx().end());
+        offsets.insert(offsets.end(),
+                       encoding->GetOffsets().begin(),
+                       encoding->GetOffsets().end());
+        special_tokens_mask.insert(special_tokens_mask.end(),
+                                   encoding->GetSpecialTokensMask().begin(),
+                                   encoding->GetSpecialTokensMask().end());
+        attention_mask.insert(attention_mask.end(),
+                              encoding->GetAttentionMask().begin(),
+                              encoding->GetAttentionMask().end());
+      } else if (template_sequence.first == SequenceType::SEQ_B) {
+        if (pair_encoding == nullptr) {
+          throw std::runtime_error("Missing pair sequence, checked above");
+        }
+        auto seq_start = ids.size();
+        auto seq_end = seq_start + pair_encoding->GetLen();
+        sequence_ranges[0] = {seq_start, seq_end};
+        ids.insert(ids.end(),
+                   pair_encoding->GetIds().begin(),
+                   pair_encoding->GetIds().end());
+        type_ids.insert(
+            type_ids.end(), pair_encoding->GetLen(), template_sequence.second);
+        tokens.insert(tokens.end(),
+                      pair_encoding->GetTokens().begin(),
+                      pair_encoding->GetTokens().end());
+        words_idx.insert(words_idx.end(),
+                         pair_encoding->GetWordsIdx().begin(),
+                         pair_encoding->GetWordsIdx().end());
+        offsets.insert(offsets.end(),
+                       pair_encoding->GetOffsets().begin(),
+                       pair_encoding->GetOffsets().end());
+        special_tokens_mask.insert(
+            special_tokens_mask.end(),
+            pair_encoding->GetSpecialTokensMask().begin(),
+            pair_encoding->GetSpecialTokensMask().end());
+        attention_mask.insert(attention_mask.end(),
+                              pair_encoding->GetAttentionMask().begin(),
+                              pair_encoding->GetAttentionMask().end());
+      }
+    } else {
+      auto& special_token = paddlenlp::get<TemplateSpecialToken>(piece);
+      if (add_special_tokens) {
+        const std::string& id = special_token.first;
+        uint32_t type_id = special_token.second;
+        auto& tok = special_tokens_map_.tokens_map_.at(
+            id);  // We already checked existance above
+        auto size = tok.ids_.size();
+        ids.insert(ids.end(), tok.ids_.begin(), tok.ids_.end());
+        type_ids.insert(type_ids.end(), size, type_id);
+        tokens.insert(tokens.end(), tok.tokens_.begin(), tok.tokens_.end());
+        words_idx.insert(words_idx.end(), size, -1 /* 2^32 */);
+        offsets.insert(offsets.end(), size, {0, 0});
+        special_tokens_mask.insert(special_tokens_mask.end(), size, 1);
+        attention_mask.insert(attention_mask.end(), size, 1);
+      }
+    }
+  }
+  *result_encoding = core::Encoding(std::move(ids),
+                                    std::move(type_ids),
+                                    std::move(tokens),
+                                    std::move(words_idx),
+                                    std::move(offsets),
+                                    std::move(special_tokens_mask),
+                                    std::move(attention_mask),
+                                    std::move(result_overflowings),
+                                    std::move(sequence_ranges));
+}
+
+void TemplatePostProcessor::operator()(core::Encoding* encoding,
+                                       core::Encoding* pair_encoding,
+                                       bool add_special_tokens,
+                                       core::Encoding* result_encoding) const {
+  if (pair_encoding != nullptr) {
+    ApplyTemplate(
+        pair_, encoding, pair_encoding, add_special_tokens, result_encoding);
+  } else {
+    ApplyTemplate(
+        single_, encoding, pair_encoding, add_special_tokens, result_encoding);
+  }
+}
+
+void to_json(nlohmann::json& j,
+             const TemplatePostProcessor& template_postprocessor) {
+  j = {
+      {"type", "TemplateProcessing"},
+      {"single", template_postprocessor.single_},
+      {"pair", template_postprocessor.pair_},
+      {"special_tokens", template_postprocessor.special_tokens_map_},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               TemplatePostProcessor& template_postprocessor) {
+  j["single"].get_to(template_postprocessor.single_);
+  j["pair"].get_to(template_postprocessor.pair_);
+  j["special_tokens"].get_to(template_postprocessor.special_tokens_map_);
+  template_postprocessor.added_single_ =
+      template_postprocessor.DefaultAdded(true);
+  template_postprocessor.added_pair_ =
+      template_postprocessor.DefaultAdded(false);
+}
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/postprocessors/template.h b/fast_tokenizer/fast_tokenizer/postprocessors/template.h
new file mode 100644
index 0000000000000000000000000000000000000000..aa20de483daa1729fc7ae83d8b3fb03385dddfe0
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/postprocessors/template.h
@@ -0,0 +1,191 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "fast_tokenizer/postprocessors/postprocessor.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "fast_tokenizer/utils/variant.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace postprocessors {
+
+enum FASTTOKENIZER_DECL SequenceType { SEQ_A, SEQ_B };
+NLOHMANN_JSON_SERIALIZE_ENUM(SequenceType,
+                             {
+                                 {SEQ_A, "A"}, {SEQ_B, "B"},
+                             });
+// The template indicate `${Id} : ${TypeId}`
+using TemplateSequence = std::pair<SequenceType, uint32_t>;
+using TemplateSpecialToken = std::pair<std::string, uint32_t>;
+
+using TemplatePiece =
+    paddlenlp::variant<TemplateSequence, TemplateSpecialToken>;
+void to_json(nlohmann::json& j, const TemplatePiece& template_piece);
+void from_json(const nlohmann::json& j, TemplatePiece& template_piece);
+
+void ParseIdFromString(const std::string& template_id_string,
+                       TemplatePiece* template_piece);
+void SetTypeId(uint32_t type_id, TemplatePiece* template_piece);
+void GetTemplatePieceFromString(const std::string& template_string,
+                                TemplatePiece* template_piece);
+
+struct FASTTOKENIZER_DECL SpecialToken {
+  std::string id_;
+  std::vector<uint32_t> ids_;
+  std::vector<std::string> tokens_;
+  SpecialToken() = default;
+  SpecialToken(const std::string& id,
+               const std::vector<uint32_t>& ids,
+               const std::vector<std::string>& tokens)
+      : id_(id), ids_(ids), tokens_(tokens) {}
+  SpecialToken(const std::string& token, uint32_t id) {
+    id_ = token;
+    ids_.push_back(id);
+    tokens_.push_back(token);
+  }
+  friend void to_json(nlohmann::json& j, const SpecialToken& special_token);
+  friend void from_json(const nlohmann::json& j, SpecialToken& special_token);
+};
+
+struct FASTTOKENIZER_DECL Template {
+  std::vector<TemplatePiece> pieces_;
+  Template() = default;
+  explicit Template(const std::string& template_str) {
+    std::vector<std::string> pieces;
+
+    // Parse the pieces
+    size_t start = template_str.find_first_not_of(" ");
+    size_t pos;
+    while ((pos = template_str.find_first_of(" ", start)) !=
+           std::string::npos) {
+      pieces.push_back(template_str.substr(start, pos - start));
+      start = template_str.find_first_not_of(" ", pos);
+    }
+    if (start != std::string::npos) {
+      pieces.push_back(template_str.substr(start));
+    }
+    AddStringPiece(pieces);
+  }
+
+  explicit Template(const std::vector<TemplatePiece>& pieces)
+      : pieces_(pieces) {}
+  explicit Template(const std::vector<std::string>& pieces) {
+    AddStringPiece(pieces);
+  }
+
+  void GetPiecesFromVec(const std::vector<std::string>& pieces) {
+    AddStringPiece(pieces);
+  }
+
+  void GetPiecesFromStr(const std::string& template_str) {
+    std::vector<std::string> pieces;
+
+    // Parse the pieces
+    size_t start = template_str.find_first_not_of(" ");
+    size_t pos;
+    while ((pos = template_str.find_first_of(" ", start)) !=
+           std::string::npos) {
+      pieces.push_back(template_str.substr(start, pos - start));
+      start = template_str.find_first_not_of(" ", pos);
+    }
+    if (start != std::string::npos) {
+      pieces.push_back(template_str.substr(start));
+    }
+    AddStringPiece(pieces);
+  }
+
+  void Clean() { pieces_.clear(); }
+
+private:
+  void AddStringPiece(const std::vector<std::string>& pieces) {
+    for (auto&& piece : pieces) {
+      TemplatePiece template_piece;
+      GetTemplatePieceFromString(piece, &template_piece);
+      if (paddlenlp::get_if<TemplateSequence>(&template_piece)) {
+        pieces_.push_back(paddlenlp::get<TemplateSequence>(template_piece));
+      } else {
+        pieces_.push_back(paddlenlp::get<TemplateSpecialToken>(template_piece));
+      }
+    }
+  }
+
+  friend void to_json(nlohmann::json& j, const Template& template_);
+  friend void from_json(const nlohmann::json& j, Template& template_);
+};
+
+struct FASTTOKENIZER_DECL SpecialTokensMap {
+  std::unordered_map<std::string, SpecialToken> tokens_map_;
+  SpecialTokensMap() = default;
+  explicit SpecialTokensMap(const std::vector<SpecialToken>& special_tokens) {
+    SetTokensMap(special_tokens);
+  }
+  void SetTokensMap(const std::vector<SpecialToken>& special_tokens) {
+    tokens_map_.clear();
+    for (const auto& special_token : special_tokens) {
+      tokens_map_.insert({special_token.id_, special_token});
+    }
+  }
+  friend void to_json(nlohmann::json& j, const SpecialTokensMap& tokens_map);
+  friend void from_json(const nlohmann::json& j, SpecialTokensMap& tokens_map);
+};
+
+struct FASTTOKENIZER_DECL TemplatePostProcessor : public PostProcessor {
+  TemplatePostProcessor();
+  TemplatePostProcessor(const Template&,
+                        const Template&,
+                        const std::vector<SpecialToken>&);
+
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override;
+  virtual size_t AddedTokensNum(bool is_pair) const override;
+
+  void UpdateSinglePieces(const std::string& template_str);
+  void UpdateSinglePieces(const std::vector<std::string>& pieces);
+  void UpdatePairPieces(const std::string& template_str);
+  void UpdatePairPieces(const std::vector<std::string>& pieces);
+  void UpdateAddedTokensNum();
+  void SetTokensMap(const std::vector<SpecialToken>& special_tokens);
+  size_t CountAdded(Template* template_,
+                    const SpecialTokensMap& special_tokens_map);
+  size_t DefaultAdded(bool is_single = true);
+  void ApplyTemplate(const Template& pieces,
+                     core::Encoding* encoding,
+                     core::Encoding* pair_encoding,
+                     bool add_special_tokens,
+                     core::Encoding* result_encoding) const;
+
+  friend void to_json(nlohmann::json& j,
+                      const TemplatePostProcessor& template_postprocessor);
+  friend void from_json(const nlohmann::json& j,
+                        TemplatePostProcessor& template_postprocessor);
+
+  Template single_;
+  Template pair_;
+  size_t added_single_;
+  size_t added_pair_;
+  SpecialTokensMap special_tokens_map_;
+};
+
+}  // namespace postprocessors
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2acf3a9a7a5d01fecebd53bcf64fab52fa960f25
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/CMakeLists.txt
@@ -0,0 +1,4 @@
+cc_library(pretokenizers
+      SRCS pretokenizer.cc whitespace.cc bert.cc metaspace.cc
+           sequence.cc byte_level.cc split.cc whitespace_and_punctuation.cc
+      DEPS normalizers core json utils)
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ce45d0c316f223514a93dc6e9037aeb119511662
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.cc
@@ -0,0 +1,74 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/bert.h"
+
+#include "fast_tokenizer/utils/utils.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+#include "unicode/uchar.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+// Note (zhoushunjie): Init re2::RE2 objects cost too much,
+// ensure not init in the operator()
+static re2::RE2 pattern("[\\s\\p{Zs}]+");
+static re2::RE2 punc_pattern("[[:punct:]]|[\\pP]");
+
+void BertPreTokenizer::operator()(PreTokenizedString* pretokenized) const {
+  std::vector<normalizers::NormalizedString> normalized_splits;
+  pretokenized->Split([&normalized_splits](
+                          int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    // Use single character match instead of regex to improve performance
+    normalized->Split([](char32_t ch) -> bool { return u_isUWhiteSpace(ch); },
+                      core::SplitMode::REMOVED,
+                      &normalized_splits);
+    for (auto&& normalize : normalized_splits) {
+      if (!normalize.IsEmpty()) {
+        string_splits->emplace_back(std::move(normalize));
+      }
+    }
+  });
+  normalized_splits.clear();
+  pretokenized->Split([&normalized_splits](
+                          int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    // Use single character match instead of regex to improve performance
+    normalized->Split(
+        utils::IsPunctuation, core::SplitMode::ISOLATED, &normalized_splits);
+    for (auto&& normalize : normalized_splits) {
+      if (!normalize.IsEmpty()) {
+        VLOG(6) << "After pretokenized: " << normalize.GetStr();
+        string_splits->emplace_back(std::move(normalize));
+      }
+    }
+  });
+}
+
+void to_json(nlohmann::json& j, const BertPreTokenizer& bert_pre_tokenizer) {
+  j = {
+      {"type", "BertPreTokenizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j, BertPreTokenizer& bert_pre_tokenizer) {}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h
new file mode 100644
index 0000000000000000000000000000000000000000..7543ac5d25f212363e0676b094f8e3e887ea9f9d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/bert.h
@@ -0,0 +1,35 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL BertPreTokenizer : public PreTokenizer {
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  friend void to_json(nlohmann::json& j,
+                      const BertPreTokenizer& bert_pre_tokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        BertPreTokenizer& bert_pre_tokenizer);
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e686bce3c785386e757d25a419b8d054dbed1497
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.cc
@@ -0,0 +1,150 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "fast_tokenizer/pretokenizers/byte_level.h"
+
+#include <codecvt>
+#include <locale>
+
+#include "fast_tokenizer/utils/utf8.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+#include "unicode/uchar.h"
+
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+
+static re2::RE2 pattern(
+    R"('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+)");
+
+
+static std::unordered_map<uint8_t, uint32_t> BYTES_TO_CHARS =
+    utils::CreateBytesToChars();
+ByteLevelPreTokenizer::ByteLevelPreTokenizer(bool add_prefix_space,
+                                             bool use_regex)
+    : add_prefix_space_(add_prefix_space), use_regex_(use_regex) {}
+
+
+void ByteLevelPreTokenizer::operator()(PreTokenizedString* pretokenized) const {
+  std::vector<normalizers::NormalizedString> normalized_splits;
+  pretokenized->Split([&normalized_splits, this](
+                          int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    if (this->add_prefix_space_ && normalized->GetStr().find(' ') != 0) {
+      normalized->Prepend(" ");
+    }
+    if (this->use_regex_) {
+      normalized->Split(pattern, core::SplitMode::ISOLATED, &normalized_splits);
+      for (auto&& normalize : normalized_splits) {
+        if (!normalize.IsEmpty()) {
+          string_splits->emplace_back(std::move(normalize));
+        }
+      }
+    } else {
+      string_splits->emplace_back(*normalized);
+    }
+  });
+  pretokenized->Normalize([](normalizers::NormalizedString* normalized) {
+    const std::string& str = normalized->GetStr();
+    std::u32string u32normalized;
+    std::vector<int> changes;
+    size_t utf8_len = 0;
+    uint32_t last_char;
+    uint32_t curr_char;
+    while (utf8_len < str.length()) {
+      auto chwidth = utils::UTF8ToUInt32(str.data() + utf8_len, &curr_char);
+      curr_char = utils::UTF8ToUnicode(curr_char);
+      for (int i = 0; i < chwidth; ++i) {
+        u32normalized.push_back(BYTES_TO_CHARS.at(str[i + utf8_len]));
+        if (i == 0) {
+          changes.push_back(0);
+        } else {
+          changes.push_back(1);
+        }
+      }
+      utf8_len += chwidth;
+    }
+    normalized->UpdateNormalized({u32normalized, changes}, 0);
+  });
+}
+
+
+void to_json(nlohmann::json& j,
+             const ByteLevelPreTokenizer& byte_pre_tokenizer) {
+  j = {
+      {"type", "ByteLevelPreTokenizer"},
+      {"add_prefix_space", byte_pre_tokenizer.add_prefix_space_},
+      {"use_regex", byte_pre_tokenizer.use_regex_},
+  };
+}
+
+
+void from_json(const nlohmann::json& j,
+               ByteLevelPreTokenizer& byte_pre_tokenizer) {
+  j.at("add_prefix_space").get_to(byte_pre_tokenizer.add_prefix_space_);
+  j.at("use_regex").get_to(byte_pre_tokenizer.add_prefix_space_);
+}
+
+void ProcessOffsets(core::Encoding* encoding, bool add_prefix_space) {
+  auto process_token_fn =
+      [&](uint32_t i, const std::string& token, core::Offset* offset) -> void {
+    uint32_t leading_spaces = 0;
+    uint32_t trailing_spaces = 0;
+
+    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+    std::u32string u32token = conv.from_bytes(token);
+    for (int i = 0; i < u32token.size(); ++i) {
+      if (utils::IsWhiteSpace(u32token[i]) ||
+          u32token[i] == BYTES_TO_CHARS.at(' ')) {
+        ++leading_spaces;
+      } else {
+        break;
+      }
+    }
+
+    for (int i = u32token.size() - 1; i >= 0; --i) {
+      if (utils::IsWhiteSpace(u32token[i]) ||
+          u32token[i] == BYTES_TO_CHARS.at(' ')) {
+        ++trailing_spaces;
+      } else {
+        break;
+      }
+    }
+
+    if (leading_spaces > 0 || trailing_spaces > 0) {
+      if (leading_spaces > 0) {
+        bool is_first = (i == 0) || (offset->first == 0);
+        if (is_first && add_prefix_space && leading_spaces == 1) {
+          leading_spaces = 0;
+        }
+        offset->first =
+            (std::min)(offset->first + leading_spaces, offset->second);
+      }
+    }
+    if (trailing_spaces > 0 && offset->second >= trailing_spaces) {
+      offset->second =
+          (std::max)(offset->second - trailing_spaces, offset->first);
+    }
+  };
+  encoding->ProcessTokenWithOffsets(process_token_fn);
+}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h
new file mode 100644
index 0000000000000000000000000000000000000000..c06dcc373f6dd3cdefeea4272dd01c7a30b20da9
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/byte_level.h
@@ -0,0 +1,44 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL ByteLevelPreTokenizer : public PreTokenizer {
+  ByteLevelPreTokenizer(bool add_prefix_space = true, bool use_regex = true);
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  friend void to_json(nlohmann::json& j,
+                      const ByteLevelPreTokenizer& byte_pre_tokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        ByteLevelPreTokenizer& byte_pre_tokenizer);
+
+private:
+  bool add_prefix_space_;
+  bool use_regex_;
+};
+
+void FASTTOKENIZER_DECL ProcessOffsets(core::Encoding* encoding,
+                                       bool add_prefix_space);
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f3a26001f11b395a16da257512456b169060d602
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.cc
@@ -0,0 +1,87 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/metaspace.h"
+
+#include "fast_tokenizer/utils/utf8.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+static re2::RE2 pattern(" ");
+
+void MetaSpacePreTokenizer::UpdateReplacementChar() {
+  uint32_t ch;
+  utils::UTF8ToUInt32(replacement_.data(), &ch);
+  replacement_char_ = utils::UTF8ToUnicode(ch);
+}
+
+MetaSpacePreTokenizer::MetaSpacePreTokenizer(const std::string& replacement,
+                                             bool add_prefix_space)
+    : replacement_(replacement), add_prefix_space_(add_prefix_space) {
+  UpdateReplacementChar();
+}
+
+std::string MetaSpacePreTokenizer::GetReplacement() const {
+  return replacement_;
+}
+
+void MetaSpacePreTokenizer::SetReplacement(const std::string& replacement) {
+  replacement_ = replacement;
+  UpdateReplacementChar();
+}
+
+void MetaSpacePreTokenizer::operator()(PreTokenizedString* pretokenized) const {
+  std::vector<normalizers::NormalizedString> normalized_splits;
+  pretokenized->Split([&](int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    normalized->Replace(pattern, replacement_);
+    if (add_prefix_space_ && normalized->GetStr().find(replacement_) != 0) {
+      normalized->Prepend(replacement_);
+    }
+    normalized->Split(
+        [&](char32_t ch) -> bool { return ch == replacement_char_; },
+        core::SplitMode::MERGED_WITH_NEXT,
+        &normalized_splits);
+    for (auto&& normalize : normalized_splits) {
+      if (!normalize.IsEmpty()) {
+        VLOG(6) << "After pretokenized: " << normalize.GetStr();
+        string_splits->emplace_back(std::move(normalize));
+      }
+    }
+  });
+}
+
+void to_json(nlohmann::json& j,
+             const MetaSpacePreTokenizer& meta_pretokenizer) {
+  j = {
+      {"type", "MetaSpacePreTokenizer"},
+      {"replacement", meta_pretokenizer.replacement_},
+      {"add_prefix_space", meta_pretokenizer.add_prefix_space_},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               MetaSpacePreTokenizer& meta_pretokenizer) {
+  j.at("add_prefix_space").get_to(meta_pretokenizer.add_prefix_space_);
+  meta_pretokenizer.SetReplacement(j.at("replacement").get<std::string>());
+}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h
new file mode 100644
index 0000000000000000000000000000000000000000..4b5c20504d3a25091e72244addfea7b0c2c29dc9
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/metaspace.h
@@ -0,0 +1,48 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL MetaSpacePreTokenizer : public PreTokenizer {
+  // Replaces white space with U+2581 (LOWER ONE EIGHT BLOCK)
+  MetaSpacePreTokenizer(const std::string& replacement = "\xe2\x96\x81",
+                        bool add_prefix_space = true);
+  MetaSpacePreTokenizer(const MetaSpacePreTokenizer&) = default;
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  std::string GetReplacement() const;
+  void SetReplacement(const std::string&);
+
+private:
+  void UpdateReplacementChar();
+  std::string replacement_;
+  bool add_prefix_space_;
+  char32_t replacement_char_;
+
+  friend void to_json(nlohmann::json& j,
+                      const MetaSpacePreTokenizer& meta_pretokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        MetaSpacePreTokenizer& meta_pretokenizer);
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..47d7b5f6ed5008a8c7f26b7f394c62efdce541fa
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.cc
@@ -0,0 +1,277 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+
+#include <codecvt>
+#include <exception>
+#include <locale>
+
+#include "fast_tokenizer/utils/unique_ptr.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+BytesToCharOffsetConverter::BytesToCharOffsetConverter(const std::string& seq)
+    : OffsetConverter(seq) {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32seq = conv.from_bytes(seq);
+  offset_map_.reserve(u32seq.length() * 4);
+  for (int i = 0; i < u32seq.length(); ++i) {
+    auto utf8_len = utils::GetUTF8CharLen(u32seq[i]);
+    for (int j = 0; j < utf8_len; ++j) {
+      offset_map_.push_back(i);
+    }
+  }
+}
+
+bool BytesToCharOffsetConverter::convert(const core::Offset& offset,
+                                         core::Offset* result) const {
+  size_t byte_start = offset.first;
+  size_t byte_end = offset.second;
+  if (offset_map_.size() <= byte_start) {
+    return false;
+  }
+  auto char_start = offset_map_.at(byte_start);
+  auto char_end = char_start + 1;
+  if (offset_map_.size() > byte_end) {
+    char_end = offset_map_.at(byte_end);
+  } else if (offset_map_.size() > byte_end - 1) {
+    char_end = offset_map_.at(byte_end - 1) + 1;
+  }
+  *result = {char_start, char_end};
+  return true;
+}
+
+
+CharToBytesOffsetConverter::CharToBytesOffsetConverter(const std::string& seq)
+    : OffsetConverter(seq) {
+  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
+  std::u32string u32seq = conv.from_bytes(seq);
+  uint32_t index = 0;
+  offset_map_.reserve(u32seq.length() * 4);
+  for (int i = 0; i < u32seq.length(); ++i) {
+    offset_map_.push_back(index);
+    auto utf8_len = fast_tokenizer::utils::GetUTF8CharLen(u32seq[i]);
+    index += utf8_len;
+  }
+  offset_map_.push_back(index);
+}
+
+bool CharToBytesOffsetConverter::convert(const core::Offset& offset,
+                                         core::Offset* result) const {
+  size_t char_start = offset.first;
+  size_t char_end = offset.second;
+  if (offset_map_.size() <= char_start) {
+    return false;
+  }
+  auto byte_start = offset_map_.at(char_start);
+  auto byte_end = byte_start + 1;
+  if (offset_map_.size() > char_end) {
+    byte_end = offset_map_.at(char_end);
+  }
+  *result = {byte_start, byte_end};
+  return true;
+}
+
+PreTokenizedString::PreTokenizedString(const std::string& original)
+    : original_(original) {
+  splits_.emplace_back(std::move(StringSplit(original_)));
+}
+
+PreTokenizedString::PreTokenizedString(
+    const normalizers::NormalizedString& normalized)
+    : original_(normalized.GetOrignalStr()) {
+  splits_.emplace_back(std::move(StringSplit(original_)));
+}
+
+PreTokenizedString& PreTokenizedString::operator=(PreTokenizedString&& other) {
+  original_ = std::move(other.original_);
+  splits_ = std::move(other.splits_);
+  return *this;
+}
+
+size_t PreTokenizedString::GetSplitsSize() const { return splits_.size(); }
+
+StringSplit PreTokenizedString::GetSplit(int idx) const { return splits_[idx]; }
+
+const std::string& PreTokenizedString::GetOriginStr() const {
+  return original_;
+}
+
+void PreTokenizedString::Split(
+    std::function<void(int,
+                       normalizers::NormalizedString*,
+                       std::vector<StringSplit>*)> split_fn) {
+  std::vector<StringSplit> new_splits;
+  new_splits.reserve(splits_.size());
+  for (int i = 0; i < splits_.size(); ++i) {
+    if (splits_[i].tokens_.size() > 0) {
+      new_splits.emplace_back(std::move(splits_[i]));
+      continue;
+    }
+    split_fn(i, &splits_[i].normalized_, &new_splits);
+  }
+  splits_ = std::move(new_splits);
+}
+
+void PreTokenizedString::Normalize(
+    std::function<void(normalizers::NormalizedString*)> normalize_fn) {
+  for (auto& split : splits_) {
+    if (split.tokens_.empty()) {
+      normalize_fn(&split.normalized_);
+    }
+  }
+}
+void PreTokenizedString::Tokenize(
+    std::function<std::vector<core::Token>(normalizers::NormalizedString*)>
+        tokenize_fn) {
+  for (auto& split : splits_) {
+    if (split.tokens_.empty()) {
+      split.tokens_ = std::move(tokenize_fn(&split.normalized_));
+    }
+  }
+}
+
+bool PreTokenizedString::TransformToEncoding(
+    const std::vector<uint32_t>& input_word_idx,
+    uint32_t type_id,
+    core::OffsetType offset_type,
+    core::Encoding* encoding) const {
+  if (splits_.empty()) {
+    *encoding = core::Encoding();
+    return true;
+  }
+  for (const auto& split : splits_) {
+    if (split.tokens_.empty()) {
+      throw std::logic_error(
+          "The split of PreTokenizedString is empty, please call "
+          "PreTokenizedString::Tokenize first before transform to Encoding.");
+      return false;
+    }
+  }
+
+  if (offset_type == core::OffsetType::CHAR) {
+    return TransformToEncodingUseConvertor<BytesToCharOffsetConverter>(
+        input_word_idx, type_id, encoding);
+  }
+  return TransformToEncodingUseConvertor<OffsetConverter>(
+      input_word_idx, type_id, encoding);
+}
+
+template <typename Convertor>
+bool PreTokenizedString::TransformToEncodingUseConvertor(
+    const std::vector<uint32_t>& input_word_idx,
+    uint32_t type_id,
+    core::Encoding* encoding) const {
+  Convertor converter(original_);
+  uint32_t tokens_size = 0;
+  for (int i = 0; i < splits_.size(); ++i) {
+    tokens_size += splits_[i].tokens_.size();
+  }
+
+  std::vector<uint32_t> token_ids(tokens_size);
+  std::vector<std::string> tokens(tokens_size);
+  std::vector<core::Offset> offsets(tokens_size);
+  uint32_t curr_idx = 0;
+  for (int i = 0; i < splits_.size(); ++i) {
+    const auto& split = splits_[i];
+    const auto& normalized = split.normalized_;
+    auto offset = normalized.GetOrginalOffset();
+    core::Offset tmp_offset;
+    bool has_set_offset = false;
+    for (const auto& token : split.tokens_) {
+      auto token_offset = token.offset_;
+      bool flag = normalized.ConvertOffsets(&token_offset, false);
+      if (flag) {
+        token_offset.first += offset.first;
+        token_offset.second += offset.first;
+      }
+      if (has_set_offset) {
+        offset = token_offset;
+        has_set_offset = true;
+      }
+      converter.convert(token_offset, &tmp_offset);
+      token_ids[curr_idx] = token.id_;
+      tokens[curr_idx] = token.value_;
+      offsets[curr_idx] = tmp_offset;
+      ++curr_idx;
+    }
+  }
+  // Setting words_idx
+  std::vector<uint32_t> words_idx(tokens_size);
+  if (input_word_idx.size() == 0) {
+    uint32_t word_offset = 0;
+    for (uint32_t i = 0; i < splits_.size(); ++i) {
+      std::fill_n(
+          words_idx.begin() + word_offset, splits_[i].tokens_.size(), i);
+      word_offset += splits_[i].tokens_.size();
+    }
+  } else {
+    std::fill(words_idx.begin(), words_idx.end(), input_word_idx[0]);
+  }
+  *encoding = std::move(core::Encoding(
+      std::move(token_ids),
+      std::vector<uint32_t>(tokens_size, type_id),  // type_ids
+      std::move(tokens),
+      std::move(words_idx),
+      std::move(offsets),
+      std::vector<uint32_t>(tokens_size, 0), /* special_tokens_mask */
+      std::vector<uint32_t>(tokens_size, 1), /* attention_mask */
+      std::vector<core::Encoding>(),         /* overflowing */
+      std::unordered_map<uint32_t, core::Range>() /* sequence_ranges */));
+  return true;
+}
+
+void PreTokenizedString::SetOriginalStr(const std::string& original) {
+  original_ = original;
+  splits_.clear();
+  splits_.emplace_back(original_);
+}
+
+std::vector<std::tuple<std::string, core::Offset, std::vector<core::Token>>>
+PreTokenizedString::GetSplits(bool is_original,
+                              const core::OffsetType& offset_type) const {
+  std::unique_ptr<OffsetConverter> converter;
+  if (offset_type == core::OffsetType::BYTE) {
+    converter = utils::make_unique<OffsetConverter>(original_);
+  } else {
+    converter = utils::make_unique<BytesToCharOffsetConverter>(original_);
+  }
+  std::vector<std::tuple<std::string, core::Offset, std::vector<core::Token>>>
+      result;
+  uint32_t offset = 0;
+  for (auto&& split : splits_) {
+    core::Offset curr_offset, split_offset;
+    if (is_original) {
+      split_offset = split.normalized_.GetOrginalOffset();
+    } else {
+      auto len = split.normalized_.GetLen();
+      offset += len;
+      split_offset = {offset - len, offset};
+    }
+
+    // Convert to char offsets if relevant
+    converter->convert(split_offset, &curr_offset);
+    result.emplace_back(split.normalized_.GetStr(), curr_offset, split.tokens_);
+  }
+  return result;
+}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..02abfff8949c4e2d2193aa0093f1fdf30e7e67a0
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizer.h
@@ -0,0 +1,117 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <functional>
+#include <string>
+#include <tuple>
+#include <unordered_map>
+#include <vector>
+
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL StringSplit {
+  normalizers::NormalizedString normalized_;
+  std::vector<core::Token> tokens_;
+  StringSplit(normalizers::NormalizedString&& normalized)
+      : normalized_(std::move(normalized)) {}
+  StringSplit(const normalizers::NormalizedString& normalized)
+      : normalized_(normalized) {}
+  StringSplit(const normalizers::NormalizedString& normalized,
+              const std::vector<core::Token>& tokens)
+      : normalized_(normalized), tokens_(tokens) {}
+  StringSplit() = default;
+  StringSplit(const StringSplit& other) = default;
+  StringSplit(StringSplit&& other)
+      : tokens_(std::move(other.tokens_)),
+        normalized_(std::move(other.normalized_)) {}
+
+  StringSplit& operator=(const StringSplit& other) = default;
+  StringSplit& operator=(StringSplit&& other) {
+    tokens_ = std::move(other.tokens_);
+    normalized_ = std::move(other.normalized_);
+    return *this;
+  }
+};
+
+class FASTTOKENIZER_DECL PreTokenizedString {
+public:
+  PreTokenizedString() = default;
+  PreTokenizedString(const std::string& original);
+  PreTokenizedString(const normalizers::NormalizedString& normalized);
+  PreTokenizedString& operator=(PreTokenizedString&& other);
+
+  void Split(std::function<void(int,
+                                normalizers::NormalizedString*,
+                                std::vector<StringSplit>*)> split_fn);
+  void Normalize(
+      std::function<void(normalizers::NormalizedString*)> normalize_fn);
+  // For wordpiece, bpe ......
+  void Tokenize(
+      std::function<std::vector<core::Token>(normalizers::NormalizedString*)>
+          tokenize_fn);
+  bool TransformToEncoding(const std::vector<uint32_t>& word_idx,
+                           uint32_t type_id,
+                           core::OffsetType offset_type,
+                           core::Encoding* encodings) const;
+  template <typename Convertor>
+  bool TransformToEncodingUseConvertor(const std::vector<uint32_t>& word_idx,
+                                       uint32_t type_id,
+                                       core::Encoding* encodings) const;
+  size_t GetSplitsSize() const;
+  StringSplit GetSplit(int idx) const;
+  const std::string& GetOriginStr() const;
+  void SetOriginalStr(const std::string& original);
+  std::vector<std::tuple<std::string, core::Offset, std::vector<core::Token>>>
+  GetSplits(bool is_original, const core::OffsetType& offset_type) const;
+
+private:
+  std::string original_;
+  std::vector<StringSplit> splits_;
+};
+
+struct FASTTOKENIZER_DECL PreTokenizer {
+  virtual void operator()(PreTokenizedString* pretokenized) const = 0;
+};
+
+struct FASTTOKENIZER_DECL OffsetConverter {
+  OffsetConverter(const std::string&) {}
+  virtual bool convert(const core::Offset&, core::Offset*) const {
+    return true;
+  }
+};
+
+struct FASTTOKENIZER_DECL BytesToCharOffsetConverter : public OffsetConverter {
+  std::vector<size_t> offset_map_;
+  BytesToCharOffsetConverter(const std::string&);
+  virtual bool convert(const core::Offset&, core::Offset*) const;
+};
+
+struct FASTTOKENIZER_DECL CharToBytesOffsetConverter : public OffsetConverter {
+  std::vector<size_t> offset_map_;
+  CharToBytesOffsetConverter(const std::string&);
+  virtual bool convert(const core::Offset&, core::Offset*) const;
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h
new file mode 100644
index 0000000000000000000000000000000000000000..5828ecd2dafe66b5ba34184a3f12d3799d4333e9
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/pretokenizers.h
@@ -0,0 +1,24 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/bert.h"
+#include "fast_tokenizer/pretokenizers/byte_level.h"
+#include "fast_tokenizer/pretokenizers/metaspace.h"
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/pretokenizers/sequence.h"
+#include "fast_tokenizer/pretokenizers/split.h"
+#include "fast_tokenizer/pretokenizers/whitespace.h"
+#include "fast_tokenizer/pretokenizers/whitespace_and_punctuation.h"
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3ac87ddbc3501bb4639fa12f8fef8a197f1a5e19
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.cc
@@ -0,0 +1,122 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+#include "glog/logging.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+SequencePreTokenizer::SequencePreTokenizer(
+    const std::vector<PreTokenizer*>& pretokenizers) {
+  for (auto& pretokenizer : pretokenizers) {
+    AppendPreTokenizer(pretokenizer);
+  }
+}
+
+void SequencePreTokenizer::AppendPreTokenizer(PreTokenizer* pretokenizer) {
+  std::shared_ptr<PreTokenizer> pretokenizer_ptr;
+  if (typeid(*pretokenizer) == typeid(SequencePreTokenizer)) {
+    auto cast_pretokenizer = dynamic_cast<SequencePreTokenizer*>(pretokenizer);
+    pretokenizer_ptr =
+        std::make_shared<SequencePreTokenizer>(*cast_pretokenizer);
+  } else if (typeid(*pretokenizer) == typeid(BertPreTokenizer)) {
+    auto cast_pretokenizer = dynamic_cast<BertPreTokenizer*>(pretokenizer);
+    pretokenizer_ptr = std::make_shared<BertPreTokenizer>(*cast_pretokenizer);
+  } else if (typeid(*pretokenizer) == typeid(MetaSpacePreTokenizer)) {
+    auto cast_pretokenizer = dynamic_cast<MetaSpacePreTokenizer*>(pretokenizer);
+    pretokenizer_ptr =
+        std::make_shared<MetaSpacePreTokenizer>(*cast_pretokenizer);
+  } else if (typeid(*pretokenizer) == typeid(WhitespacePreTokenizer)) {
+    auto cast_pretokenizer =
+        dynamic_cast<WhitespacePreTokenizer*>(pretokenizer);
+    pretokenizer_ptr =
+        std::make_shared<WhitespacePreTokenizer>(*cast_pretokenizer);
+  } else if (typeid(*pretokenizer) ==
+             typeid(WhitespaceAndPunctuationPreTokenizer)) {
+    auto cast_pretokenizer =
+        dynamic_cast<WhitespaceAndPunctuationPreTokenizer*>(pretokenizer);
+    pretokenizer_ptr = std::make_shared<WhitespaceAndPunctuationPreTokenizer>(
+        *cast_pretokenizer);
+  } else if (typeid(*pretokenizer) == typeid(SplitPreTokenizer)) {
+    auto cast_pretokenizer = dynamic_cast<SplitPreTokenizer*>(pretokenizer);
+    pretokenizer_ptr = std::make_shared<SplitPreTokenizer>(*cast_pretokenizer);
+  } else if (typeid(*pretokenizer) == typeid(ByteLevelPreTokenizer)) {
+    auto cast_pretokenizer = dynamic_cast<ByteLevelPreTokenizer*>(pretokenizer);
+    pretokenizer_ptr =
+        std::make_shared<ByteLevelPreTokenizer>(*cast_pretokenizer);
+  } else {
+    VLOG(6) << "This pretokenizer is not supportted now.";
+  }
+  pretokenzer_ptrs_.push_back(pretokenizer_ptr);
+}
+
+void SequencePreTokenizer::operator()(PreTokenizedString* pretokenized) const {
+  for (auto& pretokenizer : pretokenzer_ptrs_) {
+    pretokenizer->operator()(pretokenized);
+  }
+}
+
+void to_json(nlohmann::json& j,
+             const SequencePreTokenizer& sequence_pretokenizer) {
+  nlohmann::json jlist;
+  for (auto& ptr : sequence_pretokenizer.pretokenzer_ptrs_) {
+    nlohmann::json jitem;
+    if (typeid(*ptr) == typeid(SequencePreTokenizer)) {
+      jitem = *dynamic_cast<SequencePreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(BertPreTokenizer)) {
+      jitem = *dynamic_cast<BertPreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(MetaSpacePreTokenizer)) {
+      jitem = *dynamic_cast<MetaSpacePreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(WhitespacePreTokenizer)) {
+      jitem = *dynamic_cast<WhitespacePreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(WhitespaceAndPunctuationPreTokenizer)) {
+      jitem = *dynamic_cast<WhitespaceAndPunctuationPreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(SplitPreTokenizer)) {
+      jitem = *dynamic_cast<SplitPreTokenizer*>(ptr.get());
+    } else if (typeid(*ptr) == typeid(ByteLevelPreTokenizer)) {
+      jitem = *dynamic_cast<ByteLevelPreTokenizer*>(ptr.get());
+    }
+    jlist.push_back(jitem);
+  }
+  j = {{"type", "SequencePreTokenizer"}, {"pretokenizers", jlist}};
+}
+
+void from_json(const nlohmann::json& j,
+               SequencePreTokenizer& sequence_pretokenizer) {
+#define TRY_APPEND_PRETOKENIZER(PRETOKENIZER_TYPE)           \
+  if (pretokenizer_type == #PRETOKENIZER_TYPE) {             \
+    PRETOKENIZER_TYPE pretokenizer;                          \
+    pretokenizer_json.get_to(pretokenizer);                  \
+    sequence_pretokenizer.AppendPreTokenizer(&pretokenizer); \
+  }
+  for (auto& pretokenizer_json : j.at("pretokenizers")) {
+    std::string pretokenizer_type;
+    pretokenizer_json.at("type").get_to(pretokenizer_type);
+    TRY_APPEND_PRETOKENIZER(SequencePreTokenizer);
+    TRY_APPEND_PRETOKENIZER(WhitespacePreTokenizer);
+    TRY_APPEND_PRETOKENIZER(WhitespaceAndPunctuationPreTokenizer);
+    TRY_APPEND_PRETOKENIZER(MetaSpacePreTokenizer);
+    TRY_APPEND_PRETOKENIZER(BertPreTokenizer);
+    TRY_APPEND_PRETOKENIZER(ByteLevelPreTokenizer);
+    TRY_APPEND_PRETOKENIZER(SplitPreTokenizer);
+  }
+#undef TRY_APPEND_PRETOKENIZER
+}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h
new file mode 100644
index 0000000000000000000000000000000000000000..741f8d43c08af413afb5036721208a214281d86a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/sequence.h
@@ -0,0 +1,43 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <memory>
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "nlohmann/json.hpp"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL SequencePreTokenizer : public PreTokenizer {
+  SequencePreTokenizer() = default;
+  SequencePreTokenizer(const SequencePreTokenizer&) = default;
+  SequencePreTokenizer(const std::vector<PreTokenizer*>& pretokenizers);
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  void AppendPreTokenizer(PreTokenizer* pretokenizer);
+
+private:
+  std::vector<std::shared_ptr<PreTokenizer>> pretokenzer_ptrs_;
+  friend void to_json(nlohmann::json& j,
+                      const SequencePreTokenizer& sequence_pretokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        SequencePreTokenizer& sequence_pretokenizer);
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc
new file mode 100644
index 0000000000000000000000000000000000000000..927af338b2e26c4d57b852d7b9d1053a3b53d817
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/split.cc
@@ -0,0 +1,72 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/split.h"
+
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "fast_tokenizer/utils/unique_ptr.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+SplitPreTokenizer::SplitPreTokenizer(
+    const SplitPreTokenizer& split_pretokenizer)
+    : pattern_(new re2::RE2(split_pretokenizer.pattern_->pattern())) {
+  split_mode_ = split_pretokenizer.split_mode_;
+  invert_ = split_pretokenizer.invert_;
+}
+
+SplitPreTokenizer::SplitPreTokenizer(const std::string& pattern,
+                                     core::SplitMode split_mode,
+                                     bool invert)
+    : invert_(invert), split_mode_(split_mode) {
+  pattern_ = utils::make_unique<re2::RE2>(pattern);
+}
+
+void SplitPreTokenizer::operator()(PreTokenizedString* pretokenized) const {
+  pretokenized->Split([&](int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    std::vector<normalizers::NormalizedString> normalized_splits;
+    normalized->Split(*pattern_, split_mode_, &normalized_splits, invert_);
+    for (auto& normalize : normalized_splits) {
+      string_splits->push_back(StringSplit(normalize));
+    }
+  });
+}
+
+
+void to_json(nlohmann::json& j, const SplitPreTokenizer& split_pretokenizer) {
+  j = {
+      {"type", "SplitPreTokenizer"},
+      {"pattern", split_pretokenizer.pattern_->pattern()},
+      {"split_mode", split_pretokenizer.split_mode_},
+      {"invert", split_pretokenizer.invert_},
+  };
+}
+
+void from_json(const nlohmann::json& j, SplitPreTokenizer& split_pretokenizer) {
+  split_pretokenizer.pattern_ =
+      utils::make_unique<re2::RE2>(j.at("pattern").get<std::string>());
+  j.at("split_mode").get_to(split_pretokenizer.split_mode_);
+  j.at("invert").get_to(split_pretokenizer.invert_);
+}
+
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/split.h b/fast_tokenizer/fast_tokenizer/pretokenizers/split.h
new file mode 100644
index 0000000000000000000000000000000000000000..52796f9f4e240441b2d4fd2204bdeb77a044f525
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/split.h
@@ -0,0 +1,48 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace re2 {
+class RE2;
+}  // namespace re2
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL SplitPreTokenizer : public PreTokenizer {
+  SplitPreTokenizer() = default;
+  SplitPreTokenizer(const std::string& pattern,
+                    core::SplitMode split_mode,
+                    bool invert);
+  SplitPreTokenizer(const SplitPreTokenizer& split_pretokenizer);
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  friend void to_json(nlohmann::json& j,
+                      const SplitPreTokenizer& split_pretokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        SplitPreTokenizer& split_pretokenizer);
+
+private:
+  bool invert_;
+  core::SplitMode split_mode_;
+  std::unique_ptr<re2::RE2> pattern_;
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6ef950870997127f299be630f439db6c1dd69c2a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.cc
@@ -0,0 +1,50 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/whitespace.h"
+
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+static re2::RE2 pattern("[\\s\\p{Zs}]+");
+
+void WhitespacePreTokenizer::operator()(
+    PreTokenizedString* pretokenized) const {
+  pretokenized->Split([&](int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    std::vector<normalizers::NormalizedString> normalized_splits;
+    normalized->Split(pattern, core::SplitMode::REMOVED, &normalized_splits);
+    for (auto& normalize : normalized_splits) {
+      string_splits->push_back(StringSplit(normalize));
+    }
+  });
+}
+
+void to_json(nlohmann::json& j,
+             const WhitespacePreTokenizer& whitespace_pretokenizer) {
+  j = {
+      {"type", "WhitespacePreTokenizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               WhitespacePreTokenizer& whitespace_pretokenizer) {}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h
new file mode 100644
index 0000000000000000000000000000000000000000..43aa955ffc0655b97e1644f7266676d74ea0a8f2
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace.h
@@ -0,0 +1,34 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL WhitespacePreTokenizer : public PreTokenizer {
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  friend void to_json(nlohmann::json& j,
+                      const WhitespacePreTokenizer& whitespace_pretokenizer);
+  friend void from_json(const nlohmann::json& j,
+                        WhitespacePreTokenizer& whitespace_pretokenizer);
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d54d59cdbcfc3ef2427e3a9caebcd6ed8ea055ee
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.cc
@@ -0,0 +1,60 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pretokenizers/whitespace_and_punctuation.h"
+
+#include "fast_tokenizer/normalizers/normalizer.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+static re2::RE2 pattern("[\\s\\p{Zs}]+");
+
+void WhitespaceAndPunctuationPreTokenizer::operator()(
+    PreTokenizedString* pretokenized) const {
+  pretokenized->Split([&](int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    std::vector<normalizers::NormalizedString> normalized_splits;
+    normalized->Split(pattern, core::SplitMode::REMOVED, &normalized_splits);
+    for (auto& normalize : normalized_splits) {
+      string_splits->push_back(StringSplit(normalize));
+    }
+  });
+  pretokenized->Split([&](int idx,
+                          normalizers::NormalizedString* normalized,
+                          std::vector<StringSplit>* string_splits) {
+    std::vector<normalizers::NormalizedString> normalized_splits;
+    normalized->Split("\\w+", core::SplitMode::ISOLATED, &normalized_splits);
+    for (auto& normalize : normalized_splits) {
+      string_splits->push_back(StringSplit(normalize));
+    }
+  });
+}
+
+void to_json(
+    nlohmann::json& j,
+    const WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer) {
+  j = {
+      {"type", "WhitespaceAndPunctuationPreTokenizer"},
+  };
+}
+
+void from_json(const nlohmann::json& j,
+               WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer) {}
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h
new file mode 100644
index 0000000000000000000000000000000000000000..f7343052e5857ca134b473ca267b9d92de33910d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pretokenizers/whitespace_and_punctuation.h
@@ -0,0 +1,37 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pretokenizers {
+
+struct FASTTOKENIZER_DECL WhitespaceAndPunctuationPreTokenizer
+    : public PreTokenizer {
+  virtual void operator()(PreTokenizedString* pretokenized) const override;
+  friend void to_json(
+      nlohmann::json& j,
+      const WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer);
+  friend void from_json(
+      const nlohmann::json& j,
+      WhitespaceAndPunctuationPreTokenizer& whitespace_pretokenizer);
+};
+
+}  // namespace pretokenizers
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66e6290ddd96d94464f0ceaac85b1fde67606ed4
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/CMakeLists.txt
@@ -0,0 +1,10 @@
+set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-rpath='$ORIGIN'")
+cc_library(pybind_utils SRCS utils.cc DEPS pybind python json)
+cc_library(pybind_normalizers SRCS normalizers.cc DEPS pybind python json normalizers)
+cc_library(pybind_pretokenizers SRCS pretokenizers.cc DEPS pybind python json pretokenizers)
+cc_library(pybind_models SRCS models.cc DEPS pybind python json models)
+cc_library(pybind_postprocessors SRCS postprocessors.cc DEPS pybind python core json postprocessors)
+cc_library(pybind_tokenizers SRCS tokenizers.cc DEPS pybind python pybind_utils json tokenizer)
+cc_library(pybind_exception SRCS exception.cc DEPS pybind python)
+cc_library(pybind_decoders SRCS decoders.cc DEPS pybind python json decoders)
+cc_library(pybind_core SRCS core.cc DEPS pybind python json)
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/pybind/core.cc b/fast_tokenizer/fast_tokenizer/pybind/core.cc
new file mode 100644
index 0000000000000000000000000000000000000000..26cd25d9317dc19ede8d009024a6e3645c55115a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/core.cc
@@ -0,0 +1,288 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <sstream>
+
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/pybind/core.h"
+
+#include <Python.h>
+#include <pybind11/operators.h>
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+py::list GetWordIdx(const core::Encoding& self) {
+  py::list list;
+  for (const auto& idx : self.GetWordsIdx()) {
+    if (idx == static_cast<uint32_t>(-1)) {
+      list.append(py::none());
+    } else {
+      list.append(py::cast(idx));
+    }
+  }
+  return list;
+}
+
+void BindCore(pybind11::module* m) {
+  py::class_<core::Token>(*m, "Token")
+      .def(py::init<>())
+      .def_readwrite("id", &core::Token::id_)
+      .def_readwrite("value", &core::Token::value_)
+      .def_readwrite("offset", &core::Token::offset_)
+      .def("__repr__", [](const core::Token& token) {
+        std::ostringstream oss;
+        oss << "id: " << token.id_ << "\tvalue:" << token.value_
+            << "\toffset: (" << token.offset_.first << ", "
+            << token.offset_.second << ")";
+        return oss.str();
+      });
+  py::class_<core::PadMethod>(*m, "PadMethod")
+      .def(py::init<>())
+      .def_readwrite("strategy", &core::PadMethod::strategy_)
+      .def_readwrite("direction", &core::PadMethod::direction_)
+      .def_readwrite("pad_id", &core::PadMethod::pad_id_)
+      .def_readwrite("pad_token_type_id", &core::PadMethod::pad_token_type_id_)
+      .def_readwrite("pad_token", &core::PadMethod::pad_token_)
+      .def_readwrite("pad_len", &core::PadMethod::pad_len_)
+      .def_readwrite("pad_to_multiple_of",
+                     &core::PadMethod::pad_to_multiple_of_);
+  py::class_<core::TruncMethod>(*m, "TruncMethod")
+      .def(py::init<>())
+      .def_readwrite("direction", &core::TruncMethod::direction_)
+      .def_readwrite("max_len", &core::TruncMethod::max_len_)
+      .def_readwrite("strategy", &core::TruncMethod::strategy_)
+      .def_readwrite("stride", &core::TruncMethod::stride_);
+
+  py::enum_<core::OffsetType>(*m, "OffsetType")
+      .value("CHAR", core::OffsetType::CHAR)
+      .value("BYTE", core::OffsetType::BYTE)
+      .export_values();
+  py::enum_<core::Direction>(*m, "Direction")
+      .value("LEFT", core::Direction::LEFT)
+      .value("RIGHT", core::Direction::RIGHT)
+      .export_values();
+  py::enum_<core::TruncStrategy>(*m, "TruncStrategy")
+      .value("LONGEST_FIRST", core::TruncStrategy::LONGEST_FIRST)
+      .value("ONLY_FIRST", core::TruncStrategy::ONLY_FIRST)
+      .value("ONLY_SECOND", core::TruncStrategy::ONLY_SECOND)
+      .export_values();
+  py::enum_<core::PadStrategy>(*m, "PadStrategy")
+      .value("BATCH_LONGEST", core::PadStrategy::BATCH_LONGEST)
+      .value("FIXED_SIZE", core::PadStrategy::FIXED_SIZE)
+      .export_values();
+
+  py::enum_<core::SplitMode>(*m, "SplitMode")
+      .value("REMOVED", core::SplitMode::REMOVED)
+      .value("ISOLATED", core::SplitMode::ISOLATED)
+      .value("MERGED_WITH_PREVIOUS", core::SplitMode::MERGED_WITH_PREVIOUS)
+      .value("MERGED_WITH_NEXT", core::SplitMode::MERGED_WITH_NEXT)
+      .value("CONTIGUOUS", core::SplitMode::CONTIGUOUS)
+      .export_values();
+
+  py::class_<core::Encoding>(*m, "Encoding")
+      .def(py::init<const std::vector<uint32_t>&,
+                    const std::vector<uint32_t>&,
+                    const std::vector<std::string>&,
+                    const std::vector<uint32_t>&,
+                    const std::vector<core::Offset>&,
+                    const std::vector<uint32_t>&,
+                    const std::vector<uint32_t>&,
+                    const std::vector<core::Encoding>&,
+                    const std::unordered_map<uint32_t, core::Range>&>(),
+           py::arg("ids"),
+           py::arg("type_ids"),
+           py::arg("tokens"),
+           py::arg("words_idx"),
+           py::arg("offsets"),
+           py::arg("special_tokens_mask"),
+           py::arg("attention_mask"),
+           py::arg("overflowing"),
+           py::arg("sequence_ranges"))
+      .def(py::init<uint32_t>(), py::arg("size"))
+      .def(py::init<const std::vector<core::Token>&, uint32_t>(),
+           py::arg("tokens"),
+           py::arg("type_id"))
+      .def("__str__", &core::Encoding::DebugString)
+      .def("__repr__", &core::Encoding::DebugString)
+      .def("__len__", &core::Encoding::GetLen)
+      .def_property_readonly("n_sequences", &core::Encoding::GetNumSequence)
+      .def_property_readonly("tokens", &core::Encoding::GetTokens)
+      .def_property_readonly("word_ids", &GetWordIdx)
+      .def_property_readonly("sequence_ids", &core::Encoding::GetSequenceIds)
+      .def_property_readonly("ids", &core::Encoding::GetIds)
+      .def_property_readonly("type_ids", &core::Encoding::GetTypeIds)
+      .def_property_readonly("offsets", &core::Encoding::GetOffsets)
+      .def_property_readonly("special_tokens_mask",
+                             &core::Encoding::GetSpecialTokensMask)
+      .def_property_readonly("attention_mask",
+                             &core::Encoding::GetAttentionMask)
+      .def_property_readonly("overflowing", &core::Encoding::GetOverflowing)
+      .def("set_sequence_ids",
+           &core::Encoding::SetSequenceIds,
+           py::arg("sequence_id"))
+      .def("char_to_token",
+           [](const core::Encoding& self,
+              uint32_t char_pos,
+              uint32_t seq_id) -> py::object {
+             auto token_idxs = self.CharOffsetsToTokenIdx(char_pos, seq_id);
+             if (token_idxs.size() == 0) {
+               return py::none();
+             }
+             return py::cast(token_idxs[0]);
+           },
+           py::arg("char_pos"),
+           py::arg("sequence_index") = 0)
+      .def("char_to_word",
+           [](const core::Encoding& self,
+              uint32_t char_pos,
+              uint32_t seq_id) -> py::object {
+             auto word_idxs = self.CharOffsetsToWordIdx(char_pos, seq_id);
+             if (word_idxs.size() == 0) {
+               return py::none();
+             }
+             return py::cast(word_idxs[0]);
+           },
+           py::arg("char_pos"),
+           py::arg("sequence_index") = 0)
+      .def_static("merge",
+                  &core::Encoding::Merge,
+                  py::arg("encodings"),
+                  py::arg("growing_offsets") = true)
+      .def("pad",
+           [](core::Encoding& self,
+              uint32_t length,
+              const std::string& direction,
+              uint32_t pad_id,
+              uint32_t pad_type_id,
+              const std::string& pad_token) {
+             core::Direction direct;
+             if (direction == "right") {
+               direct = core::Direction::RIGHT;
+             } else {
+               direct = core::Direction::LEFT;
+             }
+             self.Pad(length, pad_id, pad_type_id, pad_token, direct);
+           },
+           py::arg("length"),
+           py::arg("direction") = "right",
+           py::arg("pad_id") = 0,
+           py::arg("pad_type_id") = 0,
+           py::arg("pad_token") = "[PAD]")
+      .def("token_to_chars",
+           [](const core::Encoding& self, uint32_t token_index) -> py::object {
+             auto offsets = self.TokenIdxToCharOffsets(token_index);
+             if (offsets.size() == 0) {
+               return py::none();
+             }
+             return py::cast(offsets[0]);
+           },
+           py::arg("token_index"))
+      .def("token_to_sequence",
+           [](const core::Encoding& self, uint32_t token_index) -> py::object {
+             auto seq_ids = self.TokenIdxToSequenceIds(token_index);
+             if (seq_ids.size() == 0) {
+               return py::none();
+             }
+             return py::cast(seq_ids[0]);
+           },
+           py::arg("token_index"))
+      .def("token_to_word",
+           [](const core::Encoding& self, uint32_t token_index) -> py::object {
+             auto word_idx = self.TokenIdxToWordIdx(token_index);
+             if (word_idx.size() == 0) {
+               return py::none();
+             }
+             return py::cast(word_idx[0].second);
+           },
+           py::arg("token_index"))
+      .def("word_to_chars",
+           [](const core::Encoding& self,
+              uint32_t word_index,
+              uint32_t sequence_index) -> py::object {
+             auto ranges =
+                 self.WordIdxToCharOffsets(word_index, sequence_index);
+             if (ranges.size() == 0) {
+               return py::none();
+             }
+             return py::cast(ranges[0]);
+           },
+           py::arg("word_index"),
+           py::arg("sequence_index") = 0)
+      .def("word_to_tokens",
+           [](const core::Encoding& self,
+              uint32_t word_index,
+              uint32_t sequence_index) -> py::object {
+             auto ranges = self.WordIdxToTokensIdx(word_index, sequence_index);
+             if (ranges.size() == 0) {
+               return py::none();
+             }
+             return py::cast(ranges[0]);
+           },
+           py::arg("word_index"),
+           py::arg("sequence_index") = 0)
+      .def("truncate",
+           [](core::Encoding& self,
+              size_t max_length,
+              size_t stride,
+              const std::string& direction) {
+             core::Direction direct;
+             if (direction == "right") {
+               direct = core::Direction::RIGHT;
+             } else {
+               direct = core::Direction::LEFT;
+             }
+             self.Truncate(max_length, stride, direct);
+           },
+           py::arg("max_length"),
+           py::arg("stride") = 0,
+           py::arg("direction") = "right");
+
+  py::class_<core::AddedToken>(*m, "AddedToken")
+      .def(py::init<>())
+      .def(py::init([](const std::string& content,
+                       bool single_word,
+                       bool lstrip,
+                       bool rstrip,
+                       bool normalized) {
+             return core::AddedToken(
+                 content, !normalized, single_word, lstrip, rstrip);
+           }),
+           py::arg("content"),
+           py::arg("single_word") = false,
+           py::arg("lstrip") = false,
+           py::arg("rstrip") = false,
+           py::arg("normalized") = true)
+      .def(py::self == py::self)
+      .def_property_readonly("content", &core::AddedToken::GetContent)
+      .def_property_readonly("get_is_special", &core::AddedToken::GetIsSpecial)
+      .def_property_readonly(
+          "normalized",
+          [](const core::AddedToken& self) { return !self.GetUseNormalized(); })
+      .def_property_readonly("lstrip", &core::AddedToken::GetUseLStrip)
+      .def_property_readonly("rstrip", &core::AddedToken::GetUseRStrip)
+      .def_property_readonly("single_word", &core::AddedToken::GetIsSingleWord);
+
+  m->def("set_thread_num", &core::SetThreadNum);
+  m->def("get_thread_num", &core::GetThreadNum);
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/core.h b/fast_tokenizer/fast_tokenizer/pybind/core.h
new file mode 100644
index 0000000000000000000000000000000000000000..4d42bdd00cbfe90d30ee97a0223cf1db9244aa4a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/core.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindCore(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/decoders.cc b/fast_tokenizer/fast_tokenizer/pybind/decoders.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0e0bdbe8728ff2fee6212931550f4c9effb30be5
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/decoders.cc
@@ -0,0 +1,74 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/decoders/decoders.h"
+#include <Python.h>
+#include "fast_tokenizer/pybind/decoders.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+class PyDecoder : public decoders::Decoder {
+public:
+  using Decoder::Decoder;
+  virtual void operator()(const std::vector<std::string> tokens,
+                          std::string* result) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        void, Decoder, "__call__", operator(), tokens, result);
+  }
+};
+
+class PyWordPieceDecoder : public decoders::WordPiece {
+public:
+  using WordPiece::WordPiece;
+  virtual void operator()(const std::vector<std::string> tokens,
+                          std::string* result) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, WordPiece, "__call__", operator(), tokens, result);
+  }
+};
+
+void BindDecoders(pybind11::module* m) {
+  auto submodule = m->def_submodule("decoders", "The decoders module");
+  py::class_<decoders::Decoder, PyDecoder>(submodule, "Decoder")
+      .def(py::init<>())
+      .def("decode",
+           [](const decoders::Decoder& self,
+              const std::vector<std::string>& tokens) {
+             std::string result;
+             self(tokens, &result);
+             return result;
+           },
+           py::arg("tokens"));
+
+  py::class_<decoders::WordPiece, PyWordPieceDecoder>(submodule, "WordPiece")
+      .def(py::init<std::string, bool>(),
+           py::arg("prefix") = "##",
+           py::arg("cleanup") = true)
+      .def("decode",
+           [](const decoders::Decoder& self,
+              const std::vector<std::string>& tokens) {
+             std::string result;
+             self(tokens, &result);
+             return result;
+           },
+           py::arg("tokens"));
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/decoders.h b/fast_tokenizer/fast_tokenizer/pybind/decoders.h
new file mode 100644
index 0000000000000000000000000000000000000000..27be3049cf674b2d581b44edfed031bbb2df3817
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/decoders.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindDecoders(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/exception.cc b/fast_tokenizer/fast_tokenizer/pybind/exception.cc
new file mode 100644
index 0000000000000000000000000000000000000000..35df7987fd54c66c0ffb4bea8086b573e58fc792
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/exception.cc
@@ -0,0 +1,35 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pybind/exception.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void ThrowExceptionToPython(std::exception_ptr p) {
+  static PyObject* EnforceNotMetException =
+      PyErr_NewException("tokenizer.EnforceNotMet", PyExc_Exception, NULL);
+  try {
+    if (p) std::rethrow_exception(p);
+  } catch (const std::runtime_error& e) {
+    PyErr_SetString(PyExc_RuntimeError, e.what());
+  }
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/exception.h b/fast_tokenizer/fast_tokenizer/pybind/exception.h
new file mode 100644
index 0000000000000000000000000000000000000000..49381948179f8addef9d95f0023463b3db28490e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/exception.h
@@ -0,0 +1,45 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <Python.h>
+
+#include "pybind11/pybind11.h"
+#define TOKENIZERS_TRY try {
+#define TOKENIZERS_CATCH_AND_THROW_RETURN_NULL        \
+  }                                                   \
+  catch (...) {                                       \
+    ThrowExceptionToPython(std::current_exception()); \
+    return nullptr;                                   \
+  }
+
+#define TOKENIZERS_CATCH_AND_THROW_RETURN_NEG         \
+  }                                                   \
+  catch (...) {                                       \
+    ThrowExceptionToPython(std::current_exception()); \
+    return -1;                                        \
+  }
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void ThrowExceptionToPython(std::exception_ptr p);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/models.cc b/fast_tokenizer/fast_tokenizer/pybind/models.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3cee6b03f3884ab913a9639001870262e106e9ae
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/models.cc
@@ -0,0 +1,551 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/models/models.h"
+
+#include <Python.h>
+
+#include "fast_tokenizer/pybind/models.h"
+#include "fast_tokenizer/pybind/utils.h"
+#include "glog/logging.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+class PyModel : public models::Model {
+public:
+  using Model::Model;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& tokens) override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        std::vector<core::Token>, Model, "tokenize", Tokenize, tokens);
+  }
+
+  virtual bool TokenToId(const std::string& token,
+                         uint32_t* id) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        bool, Model, "token_to_id", TokenToId, token, id);
+  }
+
+  virtual bool IdToToken(uint32_t id, std::string* token) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        bool, Model, "id_to_token", IdToToken, id, token);
+  }
+
+  virtual core::Vocab GetVocab() const override {
+    PYBIND11_OVERLOAD_PURE_NAME(core::Vocab, Model, "get_vocab", GetVocab);
+  }
+
+  virtual size_t GetVocabSize() const override {
+    PYBIND11_OVERLOAD_PURE_NAME(size_t, Model, "get_vocab_size", GetVocabSize);
+  }
+
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        std::vector<std::string>, Model, "save", Save, folder, filename_prefix);
+  }
+};
+
+class PyWordPiece : public models::WordPiece {
+  using WordPiece::WordPiece;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& tokens) override {
+    PYBIND11_OVERLOAD_NAME(
+        std::vector<core::Token>, WordPiece, "tokenize", Tokenize, tokens);
+  }
+
+  virtual bool TokenToId(const std::string& token,
+                         uint32_t* id) const override {
+    PYBIND11_OVERLOAD_NAME(
+        bool, WordPiece, "token_to_id", TokenToId, token, id);
+  }
+
+  virtual bool IdToToken(uint32_t id, std::string* token) const override {
+    PYBIND11_OVERLOAD_NAME(
+        bool, WordPiece, "id_to_token", IdToToken, id, token);
+  }
+
+  virtual core::Vocab GetVocab() const override {
+    PYBIND11_OVERLOAD_NAME(core::Vocab, WordPiece, "get_vocab", GetVocab);
+  }
+
+  virtual size_t GetVocabSize() const override {
+    PYBIND11_OVERLOAD_NAME(size_t, WordPiece, "get_vocab_size", GetVocabSize);
+  }
+
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override {
+    PYBIND11_OVERLOAD_NAME(std::vector<std::string>,
+                           WordPiece,
+                           "save",
+                           Save,
+                           folder,
+                           filename_prefix);
+  }
+};
+
+class PyFastWordPiece : public models::FastWordPiece {
+  using FastWordPiece::FastWordPiece;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& tokens) override {
+    PYBIND11_OVERLOAD_NAME(
+        std::vector<core::Token>, FastWordPiece, "tokenize", Tokenize, tokens);
+  }
+
+  virtual bool TokenToId(const std::string& token,
+                         uint32_t* id) const override {
+    PYBIND11_OVERLOAD_NAME(
+        bool, FastWordPiece, "token_to_id", TokenToId, token, id);
+  }
+
+  virtual bool IdToToken(uint32_t id, std::string* token) const override {
+    PYBIND11_OVERLOAD_NAME(
+        bool, FastWordPiece, "id_to_token", IdToToken, id, token);
+  }
+
+  virtual core::Vocab GetVocab() const override {
+    PYBIND11_OVERLOAD_NAME(core::Vocab, FastWordPiece, "get_vocab", GetVocab);
+  }
+
+  virtual size_t GetVocabSize() const override {
+    PYBIND11_OVERLOAD_NAME(
+        size_t, FastWordPiece, "get_vocab_size", GetVocabSize);
+  }
+
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override {
+    PYBIND11_OVERLOAD_NAME(std::vector<std::string>,
+                           FastWordPiece,
+                           "save",
+                           Save,
+                           folder,
+                           filename_prefix);
+  }
+};
+
+class PyBPE : public models::BPE {
+  using BPE::BPE;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& tokens) override {
+    PYBIND11_OVERLOAD_NAME(
+        std::vector<core::Token>, BPE, "tokenize", Tokenize, tokens);
+  }
+
+  virtual bool TokenToId(const std::string& token,
+                         uint32_t* id) const override {
+    PYBIND11_OVERLOAD_NAME(bool, BPE, "token_to_id", TokenToId, token, id);
+  }
+
+  virtual bool IdToToken(uint32_t id, std::string* token) const override {
+    PYBIND11_OVERLOAD_NAME(bool, BPE, "id_to_token", IdToToken, id, token);
+  }
+
+  virtual core::Vocab GetVocab() const override {
+    PYBIND11_OVERLOAD_NAME(core::Vocab, BPE, "get_vocab", GetVocab);
+  }
+
+  virtual size_t GetVocabSize() const override {
+    PYBIND11_OVERLOAD_NAME(size_t, BPE, "get_vocab_size", GetVocabSize);
+  }
+
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override {
+    PYBIND11_OVERLOAD_NAME(
+        std::vector<std::string>, BPE, "save", Save, folder, filename_prefix);
+  }
+};
+
+class PyUnigram : public models::Unigram {
+  using Unigram::Unigram;
+  virtual std::vector<core::Token> Tokenize(
+      const std::string& tokens) override {
+    PYBIND11_OVERLOAD_NAME(
+        std::vector<core::Token>, Unigram, "tokenize", Tokenize, tokens);
+  }
+
+  virtual bool TokenToId(const std::string& token,
+                         uint32_t* id) const override {
+    PYBIND11_OVERLOAD_NAME(bool, Unigram, "token_to_id", TokenToId, token, id);
+  }
+
+  virtual bool IdToToken(uint32_t id, std::string* token) const override {
+    PYBIND11_OVERLOAD_NAME(bool, Unigram, "id_to_token", IdToToken, id, token);
+  }
+
+  virtual core::Vocab GetVocab() const override {
+    PYBIND11_OVERLOAD_NAME(core::Vocab, Unigram, "get_vocab", GetVocab);
+  }
+
+  virtual size_t GetVocabSize() const override {
+    PYBIND11_OVERLOAD_NAME(size_t, Unigram, "get_vocab_size", GetVocabSize);
+  }
+
+  virtual std::vector<std::string> Save(
+      const std::string& folder,
+      const std::string& filename_prefix) const override {
+    PYBIND11_OVERLOAD_NAME(std::vector<std::string>,
+                           Unigram,
+                           "save",
+                           Save,
+                           folder,
+                           filename_prefix);
+  }
+};
+
+void BindModels(pybind11::module* m) {
+  auto submodule = m->def_submodule("models", "The models module");
+  py::class_<models::Model, PyModel>(submodule, "Model")
+      .def(py::init<>())
+      .def("tokenize", &models::Model::Tokenize)
+      .def("token_to_id", &models::Model::TokenToId)
+      .def("id_to_token", &models::Model::IdToToken)
+      .def("get_vocab", &models::Model::GetVocab)
+      .def("get_vocab_size", &models::Model::GetVocabSize)
+      .def("save", &models::Model::Save);
+  py::class_<models::WordPiece, PyWordPiece>(submodule, "WordPiece")
+      .def(py::init<>())
+      .def(py::init<const core::Vocab&,
+                    const std::string&,
+                    size_t,
+                    const std::string&,
+                    bool>(),
+           py::arg("vocab"),
+           py::arg("unk_token") = "[UNK]",
+           py::arg("max_input_chars_per_word") = 100,
+           py::arg("continuing_subword_prefix") = "##",
+           py::arg("handle_chinese_chars") = true)
+      .def("tokenize", &models::WordPiece::Tokenize)
+      .def("token_to_id",
+           [](const models::WordPiece& wordpiece, const std::string& token) {
+             uint32_t id;
+             wordpiece.TokenToId(token, &id);
+             return id;
+           })
+      .def("id_to_token",
+           [](const models::WordPiece& wordpiece, uint32_t id) {
+             std::string token;
+             wordpiece.IdToToken(id, &token);
+             return token;
+           })
+      .def("get_vocab", &models::WordPiece::GetVocab)
+      .def("get_vocab_size", &models::WordPiece::GetVocabSize)
+      .def_static(
+          "read_file", &models::WordPiece::GetVocabFromFile, py::arg("vocab"))
+      .def_static("from_file",
+                  &models::WordPiece::GetWordPieceFromFile,
+                  py::arg("vocab"),
+                  py::arg("unk_token") = "[UNK]",
+                  py::arg("max_input_chars_per_word") = 100,
+                  py::arg("continuing_subword_prefix") = "##")
+      .def(
+          "save",
+          [](const models::WordPiece& wordpiece,
+             const std::string& folder,
+             const py::object& py_obj) {
+            std::string prefix = "";
+            if (!py_obj.is(py::none())) {
+              prefix = py_obj.cast<std::string>();
+            }
+            return wordpiece.Save(folder, prefix);
+          },
+          py::arg("folder"),
+          py::arg("prefix") = py::none());
+  py::class_<models::FastWordPiece, PyFastWordPiece>(submodule, "FastWordPiece")
+      .def(py::init<>())
+      .def(py::init<const core::Vocab&,
+                    const std::string&,
+                    size_t,
+                    const std::string&,
+                    bool>(),
+           py::arg("vocab"),
+           py::arg("unk_token") = "[UNK]",
+           py::arg("max_input_chars_per_word") = 100,
+           py::arg("continuing_subword_prefix") = "##",
+           py::arg("with_pretokenization") = false)
+      .def("tokenize", &models::FastWordPiece::Tokenize)
+      .def("token_to_id",
+           [](const models::FastWordPiece& model, const std::string& token) {
+             uint32_t id;
+             model.TokenToId(token, &id);
+             return id;
+           })
+      .def("id_to_token",
+           [](const models::FastWordPiece& model, uint32_t id) {
+             std::string token;
+             model.IdToToken(id, &token);
+             return token;
+           })
+      .def("get_vocab", &models::FastWordPiece::GetVocab)
+      .def("get_vocab_size", &models::FastWordPiece::GetVocabSize)
+      .def_static("read_file",
+                  &models::FastWordPiece::GetVocabFromFile,
+                  py::arg("vocab"))
+      .def_static("from_file",
+                  &models::FastWordPiece::GetFastWordPieceFromFile,
+                  py::arg("vocab"),
+                  py::arg("unk_token") = "[UNK]",
+                  py::arg("max_input_chars_per_word") = 100,
+                  py::arg("continuing_subword_prefix") = "##",
+                  py::arg("with_pretokenization") = false)
+      .def(
+          "save",
+          [](const models::FastWordPiece& wordpiece,
+             const std::string& folder,
+             const py::object& py_obj) {
+            std::string prefix = "";
+            if (!py_obj.is(py::none())) {
+              prefix = py_obj.cast<std::string>();
+            }
+            return wordpiece.Save(folder, prefix);
+          },
+          py::arg("folder"),
+          py::arg("prefix") = py::none());
+  py::class_<models::BPE, PyBPE>(submodule, "BPE")
+      .def(py::init([](const py::object& py_vocab,
+                       const py::object& py_merges,
+                       const py::object& py_cache_capacity,
+                       const py::object& py_dropout,
+                       const py::object& py_unk_token,
+                       const py::object& py_continuing_subword_prefix,
+                       const py::object& py_end_of_word_suffix,
+                       const py::object& py_fuse_unk) {
+             core::Vocab vocab;
+             if (!py_vocab.is(py::none())) {
+               vocab = py_vocab.cast<core::Vocab>();
+             }
+
+             core::Merges merges;
+             if (!py_merges.is(py::none())) {
+               merges = py_merges.cast<core::Merges>();
+             }
+
+             size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY;
+             if (!py_cache_capacity.is(py::none())) {
+               cache_capacity = py_cache_capacity.cast<size_t>();
+             }
+
+             std::vector<float> dropout;
+             if (!py_dropout.is(py::none())) {
+               dropout.emplace_back(py_dropout.cast<float>());
+             }
+
+             std::vector<std::string> unk_token;
+             if (!py_unk_token.is(py::none())) {
+               unk_token.emplace_back(py_unk_token.cast<std::string>());
+             }
+
+             std::vector<std::string> continuing_subword_prefix;
+             if (!py_continuing_subword_prefix.is(py::none())) {
+               continuing_subword_prefix.emplace_back(
+                   py_continuing_subword_prefix.cast<std::string>());
+             }
+
+             std::vector<std::string> end_of_word_suffix;
+             if (!py_end_of_word_suffix.is(py::none())) {
+               end_of_word_suffix.emplace_back(
+                   py_end_of_word_suffix.cast<std::string>());
+             }
+
+             bool fuse_unk = false;
+             if (!py_fuse_unk.is(py::none())) {
+               fuse_unk = py_fuse_unk.cast<bool>();
+             }
+             models::BPE self(vocab,
+                              merges,
+                              cache_capacity,
+                              dropout,
+                              unk_token,
+                              continuing_subword_prefix,
+                              end_of_word_suffix,
+                              fuse_unk);
+             return self;
+           }),
+           py::arg("vocab") = py::none(),
+           py::arg("merges") = py::none(),
+           py::arg("cache_capacity") = py::none(),
+           py::arg("dropout") = py::none(),
+           py::arg("unk_token") = py::none(),
+           py::arg("continuing_subword_prefix") = py::none(),
+           py::arg("end_of_word_suffix") = py::none(),
+           py::arg("fuse_unk") = py::none())
+      .def("tokenize", &models::BPE::Tokenize)
+      .def("token_to_id",
+           [](const models::BPE& model, const std::string& token) {
+             uint32_t id;
+             model.TokenToId(token, &id);
+             return id;
+           })
+      .def("id_to_token",
+           [](const models::BPE& model, uint32_t id) {
+             std::string token;
+             model.IdToToken(id, &token);
+             return token;
+           })
+      .def("get_vocab", &models::BPE::GetVocab)
+      .def("get_vocab_size", &models::BPE::GetVocabSize)
+      .def(
+          "save",
+          [](const models::BPE& bpe,
+             const std::string& folder,
+             const py::object& py_obj) {
+            std::string prefix = "";
+            if (!py_obj.is(py::none())) {
+              prefix = py_obj.cast<std::string>();
+            }
+            return bpe.Save(folder, prefix);
+          },
+          py::arg("folder"),
+          py::arg("prefix") = py::none())
+      .def_static(
+          "read_file",
+          [](const std::string& vocab_path, const std::string& merges_path) {
+            core::Vocab vocab;
+            core::Merges merges;
+            models::BPE::GetVocabAndMergesFromFile(
+                vocab_path, merges_path, &vocab, &merges);
+            return py::make_tuple(vocab, merges);
+          },
+          py::arg("vocab"),
+          py::arg("merges"))
+      .def_static(
+          "from_file",
+          [](const std::string& vocab_path,
+             const std::string& merges_path,
+             const py::kwargs& kwargs) {
+            core::Vocab vocab;
+            core::Merges merges;
+            models::BPE::GetVocabAndMergesFromFile(
+                vocab_path, merges_path, &vocab, &merges);
+            VLOG(6) << "In BPE from_file:";
+            size_t cache_capacity = utils::DEFAULT_CACHE_CAPACITY;
+            if (kwargs.contains("cache_capacity")) {
+              cache_capacity = kwargs["cache_capacity"].cast<size_t>();
+              VLOG(6) << "cache_capacity = " << cache_capacity;
+            }
+            std::vector<float> dropout;
+            if (kwargs.contains("dropout")) {
+              dropout.emplace_back(kwargs["dropout"].cast<float>());
+              VLOG(6) << "dropout = " << kwargs["dropout"].cast<float>();
+            }
+
+            std::vector<std::string> unk_token;
+            if (kwargs.contains("unk_token")) {
+              unk_token.emplace_back(kwargs["unk_token"].cast<std::string>());
+              VLOG(6) << "unk_token = "
+                      << kwargs["unk_token"].cast<std::string>();
+            }
+
+            std::vector<std::string> continuing_subword_prefix;
+            if (kwargs.contains("continuing_subword_prefix")) {
+              continuing_subword_prefix.emplace_back(
+                  kwargs["continuing_subword_prefix"].cast<std::string>());
+              VLOG(6)
+                  << "continuing_subword_prefix = "
+                  << kwargs["continuing_subword_prefix"].cast<std::string>();
+            }
+
+            std::vector<std::string> end_of_word_suffix;
+            if (kwargs.contains("end_of_word_suffix")) {
+              end_of_word_suffix.emplace_back(
+                  kwargs["end_of_word_suffix"].cast<std::string>());
+              VLOG(6) << "end_of_word_suffix = "
+                      << kwargs["end_of_word_suffix"].cast<std::string>();
+            }
+
+            bool fuse_unk = false;
+            if (kwargs.contains("fuse_unk")) {
+              fuse_unk = kwargs["fuse_unk"].cast<bool>();
+              VLOG(6) << "fuse_unk = " << kwargs["fuse_unk"].cast<bool>();
+            }
+            return models::BPE(vocab,
+                               merges,
+                               cache_capacity,
+                               dropout,
+                               unk_token,
+                               continuing_subword_prefix,
+                               end_of_word_suffix,
+                               fuse_unk);
+          },
+          py::arg("vocab"),
+          py::arg("merges"));
+  py::class_<models::Unigram, PyUnigram>(submodule, "Unigram")
+      .def(py::init([](const py::object& py_vocab_list,
+                       const py::object& py_unk_token_id) {
+             if (py_vocab_list.is(py::none()) &&
+                 py_unk_token_id.is(py::none())) {
+               return models::Unigram();
+             } else if (!py_vocab_list.is(py::none()) &&
+                        !py_unk_token_id.is(py::none())) {
+               try {
+                 core::VocabList vocab_list =
+                     py_vocab_list.cast<core::VocabList>();
+                 size_t unk_id = py_unk_token_id.cast<size_t>();
+                 return models::Unigram(vocab_list, {unk_id});
+               } catch (std::exception& e) {
+                 VLOG(0) << "Init Unigram error:" << e.what();
+                 goto error;
+               }
+             }
+           error:
+             throw py::value_error(
+                 "`vocab` and `unk_id` must be both specified");
+           }),
+           py::arg("vocab") = py::none(),
+           py::arg("unk_id") = py::none())
+      .def("tokenize", &models::Unigram::Tokenize)
+      .def("token_to_id",
+           [](const models::Unigram& model, const std::string& token) {
+             uint32_t id;
+             model.TokenToId(token, &id);
+             return id;
+           })
+      .def("id_to_token",
+           [](const models::Unigram& model, uint32_t id) {
+             std::string token;
+             model.IdToToken(id, &token);
+             return token;
+           })
+      .def("get_vocab", &models::Unigram::GetVocab)
+      .def("get_vocab_size", &models::Unigram::GetVocabSize)
+      .def("set_filter_token",
+           &models::Unigram::SetFilterToken,
+           py::arg("filter_token") = "")
+      .def("set_split_rule",
+           &models::Unigram::SetSplitRule,
+           py::arg("split_rule") = "")
+      .def(
+          "save",
+          [](const models::Unigram& unigram,
+             const std::string& folder,
+             const py::object& py_obj) {
+            std::string prefix = "";
+            if (!py_obj.is(py::none())) {
+              prefix = py_obj.cast<std::string>();
+            }
+            return unigram.Save(folder, prefix);
+          },
+          py::arg("folder"),
+          py::arg("prefix") = py::none());
+}
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/models.h b/fast_tokenizer/fast_tokenizer/pybind/models.h
new file mode 100644
index 0000000000000000000000000000000000000000..ca675e61c61e52b52e49b369af0cfa4500066204
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/models.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindModels(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc b/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9c561b11850322a64498f63107da9066f4bc6b01
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/normalizers.cc
@@ -0,0 +1,462 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include <Python.h>
+#include "fast_tokenizer/pybind/normalizers.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+class PyNormalizer : public normalizers::Normalizer {
+public:
+  using Normalizer::Normalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        void, Normalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyBertNormalizer : public normalizers::BertNormalizer {
+public:
+  using BertNormalizer::BertNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, BertNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyReplaceNormalizer : public normalizers::ReplaceNormalizer {
+public:
+  using ReplaceNormalizer::ReplaceNormalizer;
+  PyReplaceNormalizer(const ReplaceNormalizer& r) : ReplaceNormalizer(r) {}
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, ReplaceNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyStripNormalizer : public normalizers::StripNormalizer {
+public:
+  using StripNormalizer::StripNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, StripNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyStripAccentsNormalizer : public normalizers::StripAccentsNormalizer {
+public:
+  using StripAccentsNormalizer::StripAccentsNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, StripAccentsNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyNFCNormalizer : public normalizers::NFCNormalizer {
+public:
+  using NFCNormalizer::NFCNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, NFCNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyNFKCNormalizer : public normalizers::NFKCNormalizer {
+public:
+  using NFKCNormalizer::NFKCNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, NFKCNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyNFDNormalizer : public normalizers::NFDNormalizer {
+public:
+  using NFDNormalizer::NFDNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, NFDNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyNFKDNormalizer : public normalizers::NFKDNormalizer {
+public:
+  using NFKDNormalizer::NFKDNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, NFKDNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyNmtNormalizer : public normalizers::NmtNormalizer {
+public:
+  using NmtNormalizer::NmtNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, NmtNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PySequenceNormalizer : public normalizers::SequenceNormalizer {
+public:
+  using SequenceNormalizer::SequenceNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, SequenceNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyLowercaseNormalizer : public normalizers::LowercaseNormalizer {
+public:
+  using LowercaseNormalizer::LowercaseNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, LowercaseNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+class PyPrecompiledNormalizer : public normalizers::PrecompiledNormalizer {
+public:
+  using PrecompiledNormalizer::PrecompiledNormalizer;
+  virtual void operator()(
+      normalizers::NormalizedString* mut_str) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, PrecompiledNormalizer, "__call__", operator(), mut_str);
+  }
+};
+
+void BindNormalizers(pybind11::module* m) {
+  auto submodule = m->def_submodule("normalizers", "The normalizers module");
+  py::class_<normalizers::NormalizedString>(submodule, "NormalizedString")
+      .def(py::init<const std::string&>())
+      .def(py::init<>())
+      .def("__str__", &normalizers::NormalizedString::GetStr);
+  py::class_<normalizers::Normalizer, PyNormalizer>(submodule, "Normalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::Normalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::Normalizer::operator());
+  py::class_<normalizers::BertNormalizer, PyBertNormalizer>(submodule,
+                                                            "BertNormalizer")
+      .def(py::init<bool, bool, bool, bool>(),
+           py::arg("clean_text") = true,
+           py::arg("handle_chinese_chars") = true,
+           py::arg("strip_accents") = true,
+           py::arg("lowercase") = true)
+      .def(py::init([](bool clean_text,
+                       bool handle_chinese_chars,
+                       const py::object& strip_accents_obj,
+                       bool lowercase) {
+             bool strip_accents = lowercase;
+             if (!strip_accents_obj.is(py::none())) {
+               strip_accents = strip_accents_obj.cast<bool>();
+             }
+             return std::unique_ptr<normalizers::BertNormalizer>(
+                 new normalizers::BertNormalizer(clean_text,
+                                                 handle_chinese_chars,
+                                                 strip_accents,
+                                                 lowercase));
+           }),
+           py::arg("clean_text") = true,
+           py::arg("handle_chinese_chars") = true,
+           py::arg("strip_accents") = true,
+           py::arg("lowercase") = true)
+      .def("normalize_str",
+           [](const normalizers::BertNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::BertNormalizer::operator())
+      .def("__getstate__", [](const normalizers::BertNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+
+  py::class_<normalizers::ReplaceNormalizer, PyReplaceNormalizer>(
+      submodule, "ReplaceNormalizer")
+      .def(py::init<const normalizers::ReplaceNormalizer&>(),
+           py::arg("replace_normalizer"))
+      .def(py::init<const std::string&, const std::string&>(),
+           py::arg("pattern"),
+           py::arg("content"))
+      .def("normalize_str",
+           [](const normalizers::ReplaceNormalizer& self,
+              const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::ReplaceNormalizer::operator())
+      .def("__getstate__", [](const normalizers::ReplaceNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+
+  py::class_<normalizers::StripNormalizer, PyStripNormalizer>(submodule,
+                                                              "StripNormalizer")
+      .def(py::init<bool, bool>(),
+           py::arg("left") = true,
+           py::arg("right") = true)
+      .def(
+          "normalize_str",
+          [](const normalizers::StripNormalizer& self, const std::string& str) {
+            normalizers::NormalizedString normalized(str);
+            self(&normalized);
+            return normalized.GetStr();
+          },
+          py::arg("sequence"))
+      .def("__call__", &normalizers::StripNormalizer::operator())
+      .def("__getstate__", [](const normalizers::StripNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::StripAccentsNormalizer, PyStripAccentsNormalizer>(
+      submodule, "StripAccentsNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::StripAccentsNormalizer& self,
+              const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::StripAccentsNormalizer::operator())
+      .def("__getstate__", [](const normalizers::StripAccentsNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::NFCNormalizer, PyNFCNormalizer>(submodule,
+                                                          "NFCNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::NFCNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::NFCNormalizer::operator())
+      .def("__getstate__", [](const normalizers::NFCNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::NFDNormalizer, PyNFDNormalizer>(submodule,
+                                                          "NFDNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::NFDNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::NFDNormalizer::operator())
+      .def("__getstate__", [](const normalizers::NFDNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::NFKCNormalizer, PyNFKCNormalizer>(submodule,
+                                                            "NFKCNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::NFKCNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::NFKCNormalizer::operator())
+      .def("__getstate__", [](const normalizers::NFKCNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::NFKDNormalizer, PyNFKDNormalizer>(submodule,
+                                                            "NFKDNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::NFKDNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::NFKDNormalizer::operator())
+      .def("__getstate__", [](const normalizers::NFKDNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::NmtNormalizer, PyNmtNormalizer>(submodule,
+                                                          "NmtNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::NmtNormalizer& self, const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::NmtNormalizer::operator())
+      .def("__getstate__", [](const normalizers::NmtNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::LowercaseNormalizer, PyLowercaseNormalizer>(
+      submodule, "LowercaseNormalizer")
+      .def(py::init<>())
+      .def("normalize_str",
+           [](const normalizers::LowercaseNormalizer& self,
+              const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::LowercaseNormalizer::operator())
+      .def("__getstate__", [](const normalizers::LowercaseNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::SequenceNormalizer, PySequenceNormalizer>(
+      submodule, "SequenceNormalizer")
+      .def(
+          py::init([](const py::list& py_list) {
+            normalizers::Normalizer* normalizer_ptr;
+            std::vector<normalizers::Normalizer*> normalizers;
+            for (py::handle py_normalizer : py_list) {
+              if (pybind11::type::of(py_normalizer)
+                      .is(py::type::of<normalizers::LowercaseNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::LowercaseNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::BertNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::BertNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::NFCNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::NFCNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::NFKCNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::NFKCNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::NFDNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::NFDNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::NFKDNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::NFKDNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<normalizers::NmtNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::NmtNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<
+                                 normalizers::ReplaceNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::ReplaceNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<
+                                 normalizers::SequenceNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::SequenceNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<
+                                 normalizers::StripAccentsNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::StripAccentsNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<
+                                 normalizers::StripNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::StripNormalizer*>();
+              } else if (pybind11::type::of(py_normalizer)
+                             .is(py::type::of<
+                                 normalizers::PrecompiledNormalizer>())) {
+                normalizer_ptr =
+                    py_normalizer.cast<normalizers::PrecompiledNormalizer*>();
+              } else {
+                throw py::value_error(
+                    "Type of normalizers should be one of "
+                    "`LowercaseNormalizer`,"
+                    " `BertNormalizer`, `NFCNormalizer`, `NFKCNormalizer`, "
+                    "`NFDNormalizer`,"
+                    " `NFKDNormalizer`, `NmtNormalizer`, `ReplaceNormalizer`, "
+                    "`SequenceNormalizer`,"
+                    " `StripAccentsNormalizer`, `StripNormalizer`, "
+                    "`PrecompiledNormalizer`");
+              }
+              normalizers.push_back(normalizer_ptr);
+            }
+            return normalizers::SequenceNormalizer(normalizers);
+          }),
+          py::arg("normalizers"))
+      .def("normalize_str",
+           [](const normalizers::SequenceNormalizer& self,
+              const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::SequenceNormalizer::operator())
+      .def("__getstate__", [](const normalizers::SequenceNormalizer& self) {
+        nlohmann::json j = self;
+        return j.dump();
+      });
+  py::class_<normalizers::PrecompiledNormalizer, PyPrecompiledNormalizer>(
+      submodule, "PrecompiledNormalizer")
+      .def(py::init<>())
+      .def(py::init<const std::string&>(), py::arg("precompiled_charsmap"))
+      .def("normalize_str",
+           [](const normalizers::PrecompiledNormalizer& self,
+              const std::string& str) {
+             normalizers::NormalizedString normalized(str);
+             self(&normalized);
+             return normalized.GetStr();
+           },
+           py::arg("sequence"))
+      .def("__call__", &normalizers::PrecompiledNormalizer::operator());
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/normalizers.h b/fast_tokenizer/fast_tokenizer/pybind/normalizers.h
new file mode 100644
index 0000000000000000000000000000000000000000..64cd9b6e2ed486f9ef6ba7bcc74512aba5c00b9d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/normalizers.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindNormalizers(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6795c125481afc1fa6aff5fd7599cf13fe94d7a6
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.cc
@@ -0,0 +1,418 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include <Python.h>
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/pybind/postprocessors.h"
+#include "fast_tokenizer/pybind/utils.h"
+#include "glog/logging.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+class PyPostProcessor : public postprocessors::PostProcessor {
+public:
+  using PostProcessor::PostProcessor;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(void,
+                                PostProcessor,
+                                "__call__",
+                                operator(),
+                                encoding,
+                                pair_encoding,
+                                add_special_tokens,
+                                result_encoding);
+  }
+  virtual size_t AddedTokensNum(bool is_pair) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(size_t,
+                                PostProcessor,
+                                "num_special_tokens_to_add",
+                                AddedTokensNum,
+                                is_pair);
+  }
+};
+
+class PyBertPostProcessor : public postprocessors::BertPostProcessor {
+public:
+  using BertPostProcessor::BertPostProcessor;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override {
+    PYBIND11_OVERLOAD_NAME(void,
+                           BertPostProcessor,
+                           "__call__",
+                           operator(),
+                           encoding,
+                           pair_encoding,
+                           add_special_tokens,
+                           result_encoding);
+  }
+  virtual size_t AddedTokensNum(bool is_pair) const override {
+    PYBIND11_OVERLOAD_NAME(size_t,
+                           BertPostProcessor,
+                           "num_special_tokens_to_add",
+                           AddedTokensNum,
+                           is_pair);
+  }
+};
+
+class PyTemplatePostProcessor : public postprocessors::TemplatePostProcessor {
+public:
+  using TemplatePostProcessor::TemplatePostProcessor;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override {
+    PYBIND11_OVERLOAD_NAME(void,
+                           TemplatePostProcessor,
+                           "__call__",
+                           operator(),
+                           encoding,
+                           pair_encoding,
+                           add_special_tokens,
+                           result_encoding);
+  }
+  virtual size_t AddedTokensNum(bool is_pair) const override {
+    PYBIND11_OVERLOAD_NAME(size_t,
+                           TemplatePostProcessor,
+                           "num_special_tokens_to_add",
+                           AddedTokensNum,
+                           is_pair);
+  }
+};
+
+class PyRobertaPostProcessor : public postprocessors::RobertaPostProcessor {
+public:
+  using RobertaPostProcessor::RobertaPostProcessor;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override {
+    PYBIND11_OVERLOAD_NAME(void,
+                           RobertaPostProcessor,
+                           "__call__",
+                           operator(),
+                           encoding,
+                           pair_encoding,
+                           add_special_tokens,
+                           result_encoding);
+  }
+  virtual size_t AddedTokensNum(bool is_pair) const override {
+    PYBIND11_OVERLOAD_NAME(size_t,
+                           RobertaPostProcessor,
+                           "num_special_tokens_to_add",
+                           AddedTokensNum,
+                           is_pair);
+  }
+};
+
+class PyByteLevelPostProcessor : public postprocessors::ByteLevelPostProcessor {
+public:
+  using ByteLevelPostProcessor::ByteLevelPostProcessor;
+  virtual void operator()(core::Encoding* encoding,
+                          core::Encoding* pair_encoding,
+                          bool add_special_tokens,
+                          core::Encoding* result_encoding) const override {
+    PYBIND11_OVERLOAD_NAME(void,
+                           ByteLevelPostProcessor,
+                           "__call__",
+                           operator(),
+                           encoding,
+                           pair_encoding,
+                           add_special_tokens,
+                           result_encoding);
+  }
+  virtual size_t AddedTokensNum(bool is_pair) const override {
+    PYBIND11_OVERLOAD_NAME(size_t,
+                           ByteLevelPostProcessor,
+                           "num_special_tokens_to_add",
+                           AddedTokensNum,
+                           is_pair);
+  }
+};
+
+void BindPostProcessors(pybind11::module* m) {
+  auto submodule =
+      m->def_submodule("postprocessors", "The postprocessors module");
+  py::class_<postprocessors::PostProcessor, PyPostProcessor>(submodule,
+                                                             "PostProcessor")
+      .def(py::init<>())
+      .def("num_special_tokens_to_add",
+           &postprocessors::PostProcessor::AddedTokensNum,
+           py::arg("is_pair"))
+      .def("__call__",
+           [](const postprocessors::PostProcessor& self,
+              core::Encoding* encoding,
+              core::Encoding* pair_encoding,
+              bool add_special_tokens) {
+             core::Encoding result_encoding;
+             self(
+                 encoding, pair_encoding, add_special_tokens, &result_encoding);
+             return result_encoding;
+           },
+           py::arg("encoding"),
+           py::arg("pair_encoding"),
+           py::arg("add_special_tokens"));
+  py::class_<postprocessors::BertPostProcessor, PyBertPostProcessor>(
+      submodule, "BertPostProcessor")
+      .def(py::init<>())
+      .def(py::init<const std::pair<std::string, uint32_t>&,
+                    const std::pair<std::string, uint32_t>&>(),
+           py::arg("sep"),
+           py::arg("cls"))
+      .def("num_special_tokens_to_add",
+           &postprocessors::BertPostProcessor::AddedTokensNum,
+           py::arg("is_pair"))
+      .def("__call__",
+           [](const postprocessors::BertPostProcessor& self,
+              core::Encoding* encoding,
+              core::Encoding* pair_encoding,
+              bool add_special_tokens) {
+             core::Encoding result_encoding;
+             self(
+                 encoding, pair_encoding, add_special_tokens, &result_encoding);
+             return result_encoding;
+           },
+           py::arg("encoding"),
+           py::arg("pair_encoding"),
+           py::arg("add_special_tokens"));
+
+  // For Template Processing
+  py::class_<postprocessors::SpecialToken>(submodule, "SpecialToken")
+      .def(py::init<>())
+      .def(py::init<const std::string&,
+                    const std::vector<uint32_t>&,
+                    const std::vector<std::string>&>(),
+           py::arg("id"),
+           py::arg("ids"),
+           py::arg("tokens"))
+      .def(py::init<const std::string&, uint32_t>(),
+           py::arg("token"),
+           py::arg("id"));
+
+  py::class_<postprocessors::Template>(submodule, "Template")
+      .def(py::init<>())
+      .def(py::init<const std::string&>(), py::arg("template"))
+      .def(py::init<const std::vector<std::string>&>(), py::arg("pieces"))
+      .def(py::init<const std::vector<postprocessors::TemplatePiece>&>(),
+           py::arg("pieces"));
+
+  py::class_<postprocessors::TemplatePostProcessor, PyTemplatePostProcessor>(
+      submodule, "TemplatePostProcessor")
+      .def(
+          py::init([](const py::object& single_obj,
+                      const py::object& pair_obj,
+                      const py::object& special_tokens_obj) {
+            postprocessors::TemplatePostProcessor self;
+            // Setting single
+            if (py::isinstance<py::list>(single_obj)) {
+              std::vector<std::string> template_piece =
+                  CastPyArg2VectorOfStr(single_obj.ptr(), 0);
+              self.UpdateSinglePieces(template_piece);
+            } else if (py::isinstance<py::str>(single_obj)) {
+              self.UpdateSinglePieces(
+                  CastPyArg2AttrString(single_obj.ptr(), 0));
+            } else {
+              throw py::value_error(
+                  "Type of args single need to be List[str] or str.");
+            }
+            // Setting pair
+            if (py::isinstance<py::list>(pair_obj)) {
+              std::vector<std::string> template_piece =
+                  CastPyArg2VectorOfStr(pair_obj.ptr(), 0);
+              self.UpdatePairPieces(template_piece);
+            } else if (py::isinstance<py::str>(pair_obj)) {
+              self.UpdatePairPieces(CastPyArg2AttrString(pair_obj.ptr(), 0));
+            } else {
+              throw py::value_error(
+                  "Type of args pair need to be List[str] or str.");
+            }
+            // Setting special_tokens
+            if (py::isinstance<py::list>(special_tokens_obj)) {
+              std::vector<postprocessors::SpecialToken> special_tokens;
+              for (auto& str : special_tokens_obj.cast<py::list>()) {
+                if (py::isinstance<py::tuple>(str)) {
+                  auto token_tuple = str.cast<py::tuple>();
+                  uint32_t id;
+                  std::string token;
+                  if (token_tuple.size() == 2) {
+                    if (py::isinstance<py::str>(token_tuple[0]) &&
+                        py::isinstance<py::int_>(token_tuple[1])) {
+                      token = token_tuple[0].cast<std::string>();
+                      id = token_tuple[1].cast<uint32_t>();
+                    } else if (py::isinstance<py::str>(token_tuple[1]) &&
+                               py::isinstance<py::int_>(token_tuple[0])) {
+                      token = token_tuple[1].cast<std::string>();
+                      id = token_tuple[0].cast<uint32_t>();
+                    } else {
+                      throw py::value_error(
+                          "`Tuple` with both a token and its associated ID, in "
+                          "any order");
+                    }
+                    special_tokens.emplace_back(token, id);
+                  } else {
+                    throw py::value_error(
+                        "Type of args special_tokens need to be "
+                        "List[Union[Tuple[int, str], Tuple[str, int], dict]]");
+                  }
+                } else if (py::isinstance<py::dict>(str)) {
+                  auto token_dict = str.cast<py::dict>();
+                  std::string id;
+                  std::vector<uint32_t> ids;
+                  std::vector<std::string> tokens;
+                  if (token_dict.contains("id") &&
+                      py::isinstance<py::str>(token_dict["id"])) {
+                    id = token_dict["id"].cast<std::string>();
+                  } else {
+                    throw py::value_error(
+                        "Type of args special_tokens dict need to have key 'id'"
+                        "and the respective value should be `str`");
+                  }
+                  if (token_dict.contains("ids") &&
+                      py::isinstance<py::list>(token_dict["ids"])) {
+                    for (auto py_id : token_dict["ids"].cast<py::list>()) {
+                      if (py::isinstance<py::int_>(py_id)) {
+                        ids.push_back(py_id.cast<uint32_t>());
+                      } else {
+                        throw py::value_error(
+                            "Type of args special_tokens dict need to have key "
+                            "'ids'"
+                            "and the respective value should be List[int]");
+                      }
+                    }
+                  } else {
+                    throw py::value_error(
+                        "Type of args special_tokens dict need to have key "
+                        "'ids'"
+                        "and the respective value should be List[int]");
+                  }
+                  if (token_dict.contains("tokens") &&
+                      py::isinstance<py::list>(token_dict["tokens"])) {
+                    for (auto& py_token :
+                         token_dict["tokens"].cast<py::list>()) {
+                      if (py::isinstance<py::str>(py_token)) {
+                        tokens.push_back(py_token.cast<std::string>());
+                      } else {
+                        throw py::value_error(
+                            "Type of args special_tokens dict need to have key "
+                            "'tokens'"
+                            "and the respective value should be List[str]");
+                      }
+                    }
+                  } else {
+                    throw py::value_error(
+                        "Type of args special_tokens dict need to have key "
+                        "'tokens'"
+                        "and the respective value should be List[str]");
+                  }
+                  special_tokens.emplace_back(id, ids, tokens);
+                } else {
+                  throw py::value_error(
+                      "Type of args special_tokens need to be "
+                      "List[Union[Tuple[int, str], Tuple[str, int], dict]]");
+                }
+              }
+              self.SetTokensMap(special_tokens);
+            } else {
+              throw py::value_error(
+                  "Type of args special_tokens need to be "
+                  "List[Union[Tuple[int, str], Tuple[str, int], dict]]");
+            }
+            return self;
+          }),
+          py::arg("single"),
+          py::arg("pair"),
+          py::arg("special_tokens"))
+      .def("num_special_tokens_to_add",
+           &postprocessors::TemplatePostProcessor::AddedTokensNum,
+           py::arg("is_pair"))
+      .def("__call__",
+           [](const postprocessors::TemplatePostProcessor& self,
+              core::Encoding* encoding,
+              core::Encoding* pair_encoding,
+              bool add_special_tokens) {
+             core::Encoding result_encoding;
+             self(
+                 encoding, pair_encoding, add_special_tokens, &result_encoding);
+             return result_encoding;
+           },
+           py::arg("encoding"),
+           py::arg("pair_encoding"),
+           py::arg("add_special_tokens"));
+
+  py::class_<postprocessors::RobertaPostProcessor, PyRobertaPostProcessor>(
+      submodule, "RobertaPostProcessor")
+      .def(py::init<>())
+      .def(py::init<const std::pair<std::string, uint32_t>&,
+                    const std::pair<std::string, uint32_t>&,
+                    bool,
+                    bool>(),
+           py::arg("sep"),
+           py::arg("cls"),
+           py::arg("trim_offsets") = true,
+           py::arg("add_prefix_space") = true)
+      .def("num_special_tokens_to_add",
+           &postprocessors::RobertaPostProcessor::AddedTokensNum,
+           py::arg("is_pair"))
+      .def("__call__",
+           [](const postprocessors::RobertaPostProcessor& self,
+              core::Encoding* encoding,
+              core::Encoding* pair_encoding,
+              bool add_special_tokens) {
+             core::Encoding result_encoding;
+             self(
+                 encoding, pair_encoding, add_special_tokens, &result_encoding);
+             return result_encoding;
+           },
+           py::arg("encoding"),
+           py::arg("pair_encoding"),
+           py::arg("add_special_tokens"));
+  py::class_<postprocessors::ByteLevelPostProcessor, PyByteLevelPostProcessor>(
+      submodule, "ByteLevelPostProcessor")
+      .def(py::init<bool,
+                    bool,
+                    bool>(),
+           py::arg("add_prefix_space") = true,
+           py::arg("trim_offsets") = true,
+           py::arg("use_regex") = true)
+      .def("num_special_tokens_to_add",
+           &postprocessors::ByteLevelPostProcessor::AddedTokensNum,
+           py::arg("is_pair"))
+      .def("__call__",
+           [](const postprocessors::ByteLevelPostProcessor& self,
+              core::Encoding* encoding,
+              core::Encoding* pair_encoding,
+              bool add_special_tokens) {
+             core::Encoding result_encoding;
+             self(
+                 encoding, pair_encoding, add_special_tokens, &result_encoding);
+             return result_encoding;
+           },
+           py::arg("encoding"),
+           py::arg("pair_encoding"),
+           py::arg("add_special_tokens"));
+
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h
new file mode 100644
index 0000000000000000000000000000000000000000..b30b31a951ee6caac96a55f7686d0a781ca6c59e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/postprocessors.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindPostProcessors(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc
new file mode 100644
index 0000000000000000000000000000000000000000..efb9ae77446d993992cd73629085ae598a7f1631
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.cc
@@ -0,0 +1,265 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+
+#include <Python.h>
+
+#include "fast_tokenizer/pybind/pretokenizers.h"
+#include "re2/re2.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+class PyPreTokenizer : public pretokenizers::PreTokenizer {
+public:
+  using PreTokenizer::PreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_PURE_NAME(
+        void, PreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PyWhitespacePreTokenizer : public pretokenizers::WhitespacePreTokenizer {
+public:
+  using WhitespacePreTokenizer::WhitespacePreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, WhitespacePreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PyWhitespaceAndPunctuationPreTokenizer
+    : public pretokenizers::WhitespaceAndPunctuationPreTokenizer {
+public:
+  using WhitespaceAndPunctuationPreTokenizer::
+      WhitespaceAndPunctuationPreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(void,
+                           WhitespaceAndPunctuationPreTokenizer,
+                           "__call__",
+                           operator(),
+                           pretokenized);
+  }
+};
+
+class PyBertPreTokenizer : public pretokenizers::BertPreTokenizer {
+public:
+  using BertPreTokenizer::BertPreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, BertPreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PyMetaSpacePreTokenizer : public pretokenizers::MetaSpacePreTokenizer {
+public:
+  using MetaSpacePreTokenizer::MetaSpacePreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, MetaSpacePreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PySequencePreTokenizer : public pretokenizers::SequencePreTokenizer {
+public:
+  using SequencePreTokenizer::SequencePreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, SequencePreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PyByteLevelPreTokenizer : public pretokenizers::ByteLevelPreTokenizer {
+public:
+  using ByteLevelPreTokenizer::ByteLevelPreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, ByteLevelPreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+class PySplitPreTokenizer : public pretokenizers::SplitPreTokenizer {
+public:
+  using SplitPreTokenizer::SplitPreTokenizer;
+  virtual void operator()(
+      pretokenizers::PreTokenizedString* pretokenized) const override {
+    PYBIND11_OVERLOAD_NAME(
+        void, SplitPreTokenizer, "__call__", operator(), pretokenized);
+  }
+};
+
+void BindPreTokenizers(pybind11::module* m) {
+  auto sub_module =
+      m->def_submodule("pretokenizers", "The pretokenizers module");
+  py::class_<pretokenizers::StringSplit>(sub_module, "StringSplit")
+      .def(py::init<const normalizers::NormalizedString&>(),
+           py::arg("nomalized_text"))
+      .def(py::init<const normalizers::NormalizedString&,
+                    const std::vector<core::Token>&>(),
+           py::arg("nomalized_text"),
+           py::arg("tokens"))
+      .def_readwrite("normalized", &pretokenizers::StringSplit::normalized_)
+      .def_readwrite("tokens", &pretokenizers::StringSplit::tokens_);
+  py::class_<pretokenizers::PreTokenizedString>(sub_module,
+                                                "PreTokenizedString")
+      .def(py::init<>())
+      .def(py::init<const std::string&>(), py::arg("raw_text"))
+      .def(py::init<const normalizers::NormalizedString&>(),
+           py::arg("nomalized_text"))
+      .def("get_string_split", &pretokenizers::PreTokenizedString::GetSplit)
+      .def("get_string_splits_size",
+           &pretokenizers::PreTokenizedString::GetSplitsSize)
+      .def("get_original_text",
+           &pretokenizers::PreTokenizedString::GetOriginStr)
+      .def(
+          "get_splits",
+          [](const pretokenizers::PreTokenizedString& self,
+             const std::string& offset_referential,
+             const std::string& offset_type) {
+            bool is_original = true;
+            if (offset_referential != "original") {
+              is_original = false;
+            }
+            core::OffsetType type = core::OffsetType::CHAR;
+            if (offset_type != "char") {
+              type = core::OffsetType::BYTE;
+            }
+            return self.GetSplits(is_original, type);
+          },
+          py::arg("offset_referential") = "original",
+          py::arg("offset_type") = "char")
+      .def("to_encoding",
+           [](const pretokenizers::PreTokenizedString& self,
+              const std::vector<uint32_t>& word_idx,
+              uint32_t type_id,
+              core::OffsetType offset_type) {
+             core::Encoding encoding;
+             self.TransformToEncoding(
+                 word_idx, type_id, offset_type, &encoding);
+             return encoding;
+           });
+  py::class_<pretokenizers::PreTokenizer, PyPreTokenizer>(sub_module,
+                                                          "PreTokenizer")
+      .def(py::init<>())
+      .def("__call__", &pretokenizers::PreTokenizer::operator());
+  py::class_<pretokenizers::WhitespacePreTokenizer, PyWhitespacePreTokenizer>(
+      sub_module, "WhitespacePreTokenizer")
+      .def(py::init<>())
+      .def("__call__", &pretokenizers::WhitespacePreTokenizer::operator());
+  py::class_<pretokenizers::WhitespaceAndPunctuationPreTokenizer,
+             PyWhitespaceAndPunctuationPreTokenizer>(
+      sub_module, "WhitespaceAndPunctuationPreTokenizer")
+      .def(py::init<>())
+      .def("__call__",
+           &pretokenizers::WhitespaceAndPunctuationPreTokenizer::operator());
+  py::class_<pretokenizers::BertPreTokenizer, PyBertPreTokenizer>(
+      sub_module, "BertPreTokenizer")
+      .def(py::init<>())
+      .def("__call__", &pretokenizers::BertPreTokenizer::operator());
+  py::class_<pretokenizers::MetaSpacePreTokenizer, PyMetaSpacePreTokenizer>(
+      sub_module, "MetaSpacePreTokenizer")
+      .def(py::init<const std::string&, bool>(),
+           py::arg("replacement") = "_",
+           py::arg("add_prefix_space") = true)
+      .def("__call__", &pretokenizers::MetaSpacePreTokenizer::operator());
+  py::class_<pretokenizers::SequencePreTokenizer, PySequencePreTokenizer>(
+      sub_module, "SequencePreTokenizer")
+      .def(
+          py::init([](const py::list& py_list) {
+            pretokenizers::PreTokenizer* pretokenizer_ptr;
+            std::vector<pretokenizers::PreTokenizer*> pretokenizers;
+            for (py::handle py_pretokenizer : py_list) {
+              if (pybind11::type::of(py_pretokenizer)
+                      .is(py::type::of<pretokenizers::BertPreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer.cast<pretokenizers::BertPreTokenizer*>();
+              } else if (pybind11::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::MetaSpacePreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer
+                        .cast<pretokenizers::MetaSpacePreTokenizer*>();
+              } else if (pybind11::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::SequencePreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer
+                        .cast<pretokenizers::SequencePreTokenizer*>();
+              } else if (pybind11::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::WhitespacePreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer
+                        .cast<pretokenizers::WhitespacePreTokenizer*>();
+              } else if (pybind11::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::
+                                     WhitespaceAndPunctuationPreTokenizer>())) {
+                pretokenizer_ptr = py_pretokenizer.cast<
+                    pretokenizers::WhitespaceAndPunctuationPreTokenizer*>();
+              } else if (pybind11::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::ByteLevelPreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer
+                        .cast<pretokenizers::ByteLevelPreTokenizer*>();
+              } else if (py::type::of(py_pretokenizer)
+                             .is(py::type::of<
+                                 pretokenizers::SplitPreTokenizer>())) {
+                pretokenizer_ptr =
+                    py_pretokenizer.cast<pretokenizers::SplitPreTokenizer*>();
+              } else {
+                throw py::value_error(
+                    "Type of normalizers should be one of `BertPreTokenizer`,"
+                    " `MetaSpacePreTokenizer`, `SequencePreTokenizer`,"
+                    " `WhitespacePreTokenizer`, `ByteLevelPreTokenizer`,"
+                    " `WhitespaceAndPunctuationPreTokenizer`, "
+                    "`SplitPreTokenizer`");
+              }
+              pretokenizers.push_back(pretokenizer_ptr);
+            }
+            return pretokenizers::SequencePreTokenizer(pretokenizers);
+          }),
+          py::arg("pretokenizers"))
+      .def("__call__", &pretokenizers::SequencePreTokenizer::operator());
+  py::class_<pretokenizers::ByteLevelPreTokenizer, PyByteLevelPreTokenizer>(
+      sub_module, "ByteLevelPreTokenizer")
+      .def(py::init<bool, bool>(),
+           py::arg("add_prefix_space") = true,
+           py::arg("use_regex") = true)
+      .def("__call__", &pretokenizers::ByteLevelPreTokenizer::operator());
+  py::class_<pretokenizers::SplitPreTokenizer, PySplitPreTokenizer>(
+      sub_module, "SplitPreTokenizer")
+      .def(py::init<const std::string&, core::SplitMode, bool>(),
+           py::arg("pattern"),
+           py::arg("split_mode"),
+           py::arg("invert"))
+      .def("__call__", &pretokenizers::SplitPreTokenizer::operator());
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h
new file mode 100644
index 0000000000000000000000000000000000000000..ffe7b1adf83d0d5e5fef7e3274b1dfb71eedc069
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/pretokenizers.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindPreTokenizers(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/pybind.cc b/fast_tokenizer/fast_tokenizer/pybind/pybind.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3f7b32f56d88dd1b65ae0a69199d24b284b4e6e1
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/pybind.cc
@@ -0,0 +1,51 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <Python.h>
+
+#include <pybind11/pybind11.h>
+#include <pybind11/stl_bind.h>
+
+#include "fast_tokenizer/pybind/core.h"
+#include "fast_tokenizer/pybind/decoders.h"
+#include "fast_tokenizer/pybind/models.h"
+#include "fast_tokenizer/pybind/normalizers.h"
+#include "fast_tokenizer/pybind/postprocessors.h"
+#include "fast_tokenizer/pybind/pretokenizers.h"
+#include "fast_tokenizer/pybind/tokenizers.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+PYBIND11_MODULE(core_tokenizers, m) {
+  m.doc() = "The paddlenlp fast_tokenizer core module.";
+  // 1. Bind normalizers submodule
+  BindNormalizers(&m);
+  // 2. Bind pre_tokenizers submodule
+  BindPreTokenizers(&m);
+  // 3. Bind models submodule
+  BindModels(&m);
+  // 4. Bind processors submodule
+  BindPostProcessors(&m);
+  // 5. Bind tokenizers submodule
+  BindTokenizers(&m);
+  // 6. Bind core
+  BindCore(&m);
+  // 7. Bind decoder submodule
+  BindDecoders(&m);
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc
new file mode 100644
index 0000000000000000000000000000000000000000..69902251a2ced6bc69de577213079e3fcd0ca034
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.cc
@@ -0,0 +1,1428 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/pybind/tokenizers.h"
+
+#include <Python.h>
+
+#include <unordered_map>
+
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/decoders/decoders.h"
+#include "fast_tokenizer/models/models.h"
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+#include "fast_tokenizer/pybind/exception.h"
+#include "fast_tokenizer/pybind/utils.h"
+#include "glog/logging.h"
+
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+PyTypeObject* p_tokenizer_type;  // For Tokenizer
+
+PyNumberMethods number_methods;
+PySequenceMethods sequence_methods;
+PyMappingMethods mapping_methods;
+
+typedef struct {
+  PyObject_HEAD core::Tokenizer tokenizer;
+  // Weak references
+  PyObject* weakrefs;
+} TokenizerObject;
+
+static PyObject* TokenizerPropertiesGetNormaizer(TokenizerObject* self,
+                                                 void* closure) {
+  py::object py_obj = py::cast(self->tokenizer.GetNormalizerPtr());
+  py_obj.inc_ref();
+  return py_obj.ptr();
+}
+
+static int TokenizerPropertiesSetNormalizer(TokenizerObject* self,
+                                            PyObject* value,
+                                            void* closure) {
+  TOKENIZERS_TRY
+  py::handle py_obj(value);
+  int ret = 0;
+  if (pybind11::type::of(py_obj).is(
+          py::type::of<normalizers::LowercaseNormalizer>())) {
+    const auto& normalizer =
+        py_obj.cast<const normalizers::LowercaseNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::BertNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::BertNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::NFCNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::NFCNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::NFKCNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::NFKCNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::NFDNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::NFDNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::NFKDNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::NFKDNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::NmtNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::NmtNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::ReplaceNormalizer>())) {
+    const auto& normalizer =
+        py_obj.cast<const normalizers::ReplaceNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::SequenceNormalizer>())) {
+    const auto& normalizer =
+        py_obj.cast<const normalizers::SequenceNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::StripAccentsNormalizer>())) {
+    const auto& normalizer =
+        py_obj.cast<const normalizers::StripAccentsNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::StripNormalizer>())) {
+    const auto& normalizer = py_obj.cast<const normalizers::StripNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<normalizers::PrecompiledNormalizer>())) {
+    const auto& normalizer =
+        py_obj.cast<const normalizers::PrecompiledNormalizer&>();
+    self->tokenizer.SetNormalizer(normalizer);
+  } else if (py_obj.is(py::none())) {
+    self->tokenizer.ReleaseNormaizer();
+  } else {
+    ret = 1;
+    throw std::runtime_error("Need to assign the object of Normalizer");
+  }
+  return ret;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NEG
+}
+
+static PyObject* TokenizerPropertiesGetPreTokenizer(TokenizerObject* self,
+                                                    void* closure) {
+  py::object py_obj = py::cast(self->tokenizer.GetPreTokenizer());
+  py_obj.inc_ref();
+  return py_obj.ptr();
+}
+
+static int TokenizerPropertiesSetPreTokenizer(TokenizerObject* self,
+                                              PyObject* value,
+                                              void* closure) {
+  TOKENIZERS_TRY
+  py::handle py_obj(value);
+  int ret = 0;
+  if (pybind11::type::of(py_obj).is(
+          py::type::of<pretokenizers::BertPreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::BertPreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<pretokenizers::WhitespacePreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::WhitespacePreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<
+                     pretokenizers::WhitespaceAndPunctuationPreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj
+            .cast<const pretokenizers::WhitespaceAndPunctuationPreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<pretokenizers::MetaSpacePreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::MetaSpacePreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<pretokenizers::SequencePreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::SequencePreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<pretokenizers::ByteLevelPreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::ByteLevelPreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<pretokenizers::SplitPreTokenizer>())) {
+    const auto& pretokenizer =
+        py_obj.cast<const pretokenizers::SplitPreTokenizer&>();
+    self->tokenizer.SetPreTokenizer(pretokenizer);
+  } else if (py_obj.is(py::none())) {
+    self->tokenizer.ReleasePreTokenizer();
+  } else {
+    ret = 1;
+    throw std::runtime_error("Need to assign the object of PreTokenizer");
+  }
+  return ret;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NEG
+}
+
+static PyObject* TokenizerPropertiesGetModel(TokenizerObject* self,
+                                             void* closure) {
+  py::object py_obj = py::cast(self->tokenizer.GetModelPtr());
+  py_obj.inc_ref();
+  return py_obj.ptr();
+}
+
+static int TokenizerPropertiesSetModel(TokenizerObject* self,
+                                       PyObject* value,
+                                       void* closure) {
+  TOKENIZERS_TRY
+  py::handle py_obj(value);
+  int ret = 0;
+  if (pybind11::type::of(py_obj).is(py::type::of<models::WordPiece>())) {
+    const auto& model = py_obj.cast<const models::WordPiece&>();
+    self->tokenizer.SetModel(model);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<models::FastWordPiece>())) {
+    const auto& model = py_obj.cast<const models::FastWordPiece&>();
+    self->tokenizer.SetModel(model);
+  } else if (pybind11::type::of(py_obj).is(py::type::of<models::BPE>())) {
+    const auto& model = py_obj.cast<const models::BPE&>();
+    self->tokenizer.SetModel(model);
+  } else if (pybind11::type::of(py_obj).is(py::type::of<models::Unigram>())) {
+    const auto& model = py_obj.cast<const models::Unigram&>();
+    self->tokenizer.SetModel(model);
+  } else {
+    ret = 1;
+    throw std::runtime_error("Need to assign the object of Model");
+  }
+  return ret;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NEG
+}
+
+static PyObject* TokenizerPropertiesGetPostProcessor(TokenizerObject* self,
+                                                     void* closure) {
+  py::object py_obj = py::cast(self->tokenizer.GetPostProcessorPtr());
+  py_obj.inc_ref();
+  return py_obj.ptr();
+}
+
+static int TokenizerPropertiesSetPostProcessor(TokenizerObject* self,
+                                               PyObject* value,
+                                               void* closure) {
+  TOKENIZERS_TRY
+  py::handle py_obj(value);
+  int ret = 0;
+  if (pybind11::type::of(py_obj).is(
+          py::type::of<postprocessors::BertPostProcessor>())) {
+    const auto& processor =
+        py_obj.cast<const postprocessors::BertPostProcessor&>();
+    self->tokenizer.SetPostProcessor(processor);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<postprocessors::TemplatePostProcessor>())) {
+    const auto& processor =
+        py_obj.cast<const postprocessors::TemplatePostProcessor&>();
+    self->tokenizer.SetPostProcessor(processor);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<postprocessors::RobertaPostProcessor>())) {
+    const auto& processor =
+        py_obj.cast<const postprocessors::RobertaPostProcessor&>();
+    self->tokenizer.SetPostProcessor(processor);
+  } else if (pybind11::type::of(py_obj).is(
+                 py::type::of<postprocessors::ByteLevelPostProcessor>())) {
+    const auto& processor =
+        py_obj.cast<const postprocessors::ByteLevelPostProcessor&>();
+    self->tokenizer.SetPostProcessor(processor);
+  } else if (py_obj.is(py::none())) {
+    self->tokenizer.ReleasePostProcessor();
+  } else {
+    ret = 1;
+    throw std::runtime_error("Need to assign the object of PostProcessor");
+  }
+  return ret;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NEG
+}
+
+static PyObject* TokenizerPropertiesGetPadding(TokenizerObject* self,
+                                               void* closure) {
+  if (!self->tokenizer.GetUsePadding()) {
+    Py_RETURN_NONE;
+  }
+  auto pad_method = self->tokenizer.GetPadMethod();
+  PyObject* py_dict = PyDict_New();
+  PyDict_SetItem(py_dict, ToPyObject("pad_id"), ToPyObject(pad_method.pad_id_));
+  PyDict_SetItem(py_dict,
+                 ToPyObject("pad_token_type_id"),
+                 ToPyObject(pad_method.pad_token_type_id_));
+  PyDict_SetItem(
+      py_dict, ToPyObject("pad_token"), ToPyObject(pad_method.pad_token_));
+  if (pad_method.pad_to_multiple_of_ > 0) {
+    PyDict_SetItem(py_dict,
+                   ToPyObject("pad_to_multiple_of"),
+                   ToPyObject(pad_method.pad_to_multiple_of_));
+  } else {
+    Py_INCREF(Py_None);
+    PyDict_SetItem(py_dict, ToPyObject("pad_to_multiple_of"), Py_None);
+  }
+
+  PyDict_SetItem(
+      py_dict,
+      ToPyObject("direction"),
+      ToPyObject((pad_method.direction_ == core::Direction::RIGHT) ? "right"
+                                                                   : "left"));
+  if (pad_method.strategy_ == core::PadStrategy::BATCH_LONGEST) {
+    Py_INCREF(Py_None);
+    PyDict_SetItem(py_dict, ToPyObject("length"), Py_None);
+  } else {
+    PyDict_SetItem(
+        py_dict, ToPyObject("length"), ToPyObject(pad_method.pad_len_));
+  }
+  return py_dict;
+}
+
+static PyObject* TokenizerPropertiesGetTruncation(TokenizerObject* self,
+                                                  void* closure) {
+  if (!self->tokenizer.GetUseTruncation()) {
+    Py_RETURN_NONE;
+  }
+  auto trunc_method = self->tokenizer.GetTruncMethod();
+  PyObject* py_dict = PyDict_New();
+  PyDict_SetItem(
+      py_dict, ToPyObject("max_length"), ToPyObject(trunc_method.max_len_));
+  PyDict_SetItem(
+      py_dict, ToPyObject("stride"), ToPyObject(trunc_method.stride_));
+  PyDict_SetItem(
+      py_dict,
+      ToPyObject("direction"),
+      ToPyObject((trunc_method.direction_ == core::Direction::RIGHT) ? "right"
+                                                                     : "left"));
+  if (trunc_method.strategy_ == core::TruncStrategy::LONGEST_FIRST) {
+    PyDict_SetItem(
+        py_dict, ToPyObject("strategy"), ToPyObject("longest_first"));
+  } else if (trunc_method.strategy_ == core::TruncStrategy::ONLY_FIRST) {
+    PyDict_SetItem(py_dict, ToPyObject("strategy"), ToPyObject("only_first"));
+  } else if (trunc_method.strategy_ == core::TruncStrategy::ONLY_SECOND) {
+    PyDict_SetItem(py_dict, ToPyObject("strategy"), ToPyObject("only_second"));
+  }
+  return py_dict;
+}
+
+static PyObject* TokenizerPropertiesGetDecoder(TokenizerObject* self,
+                                               void* closure) {
+  py::object py_obj = py::cast(self->tokenizer.GetDecoderPtr());
+  py_obj.inc_ref();
+  return py_obj.ptr();
+}
+
+static int TokenizerPropertiesSetDecoder(TokenizerObject* self,
+                                         PyObject* value,
+                                         void* closure) {
+  TOKENIZERS_TRY
+  py::handle py_obj(value);
+  int ret = 0;
+  if (pybind11::type::of(py_obj).is(py::type::of<decoders::WordPiece>())) {
+    const auto& decoder = py_obj.cast<const decoders::WordPiece&>();
+    self->tokenizer.SetDecoder(decoder);
+  } else if (py_obj.is(py::none())) {
+    self->tokenizer.ReleaseDecoder();
+  } else {
+    ret = 1;
+    throw std::runtime_error("Need to assign the object of Decoder");
+  }
+  return ret;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NEG
+}
+
+struct PyGetSetDef tokenizer_variable_properties[] = {
+    {"normalizer",
+     (getter)TokenizerPropertiesGetNormaizer,
+     (setter)TokenizerPropertiesSetNormalizer,
+     nullptr,
+     nullptr},
+    {"pretokenizer",
+     (getter)TokenizerPropertiesGetPreTokenizer,
+     (setter)TokenizerPropertiesSetPreTokenizer,
+     nullptr,
+     nullptr},
+    {"model",
+     (getter)TokenizerPropertiesGetModel,
+     (setter)TokenizerPropertiesSetModel,
+     nullptr,
+     nullptr},
+    {"postprocessor",
+     (getter)TokenizerPropertiesGetPostProcessor,
+     (setter)TokenizerPropertiesSetPostProcessor,
+     nullptr,
+     nullptr},
+    {"padding",
+     (getter)TokenizerPropertiesGetPadding,
+     nullptr,
+     nullptr,
+     nullptr},
+    {"truncation",
+     (getter)TokenizerPropertiesGetTruncation,
+     nullptr,
+     nullptr,
+     nullptr},
+    {"decoder",
+     (getter)TokenizerPropertiesGetDecoder,
+     (setter)TokenizerPropertiesSetDecoder,
+     nullptr,
+     nullptr},
+    {nullptr, nullptr, nullptr, nullptr, nullptr}};
+
+PyObject* TokenizerNew(PyTypeObject* type, PyObject* args, PyObject* kwargs) {
+  PyObject* obj = type->tp_alloc(type, 0);
+  if (obj) {
+    auto v = reinterpret_cast<TokenizerObject*>(obj);
+    new (&(v->tokenizer)) core::Tokenizer();
+  }
+  return obj;
+}
+
+static void TokenizerDealloc(TokenizerObject* self) {
+  if (self->weakrefs != NULL)
+    PyObject_ClearWeakRefs(reinterpret_cast<PyObject*>(self));
+  self->tokenizer.~Tokenizer();
+  Py_TYPE(self)->tp_free(reinterpret_cast<PyObject*>(self));
+}
+
+int TokenizerInit(PyObject* self, PyObject* args, PyObject* kwargs) {
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  // all kwargs
+  PyObject* kw_model = NULL;
+  // the keywords argument
+  static char* kwlist[] = {const_cast<char*>("model"), NULL};
+  // 'O' Store a Python object (without any conversion) in a C object pointer,
+  // '|' Indicates that the remaining arguments in the Python argument list are
+  // optional.
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_model);
+  std::unordered_map<std::string, PyObject*> kws_map{{"model", kw_model}};
+
+  auto py_tokenizer_ptr = reinterpret_cast<TokenizerObject*>(self);
+  Py_ssize_t args_num = PyTuple_Size(args);
+
+  if (args_num == 1) {
+    py::object py_obj =
+        py::reinterpret_borrow<py::object>(PyTuple_GET_ITEM(args, 0));
+    if (pybind11::type::of(py_obj).is(py::type::of<models::WordPiece>())) {
+      const auto& model = py_obj.cast<const models::WordPiece&>();
+      py_tokenizer_ptr->tokenizer.SetModel(model);
+    } else if (pybind11::type::of(py_obj).is(
+                   py::type::of<models::FastWordPiece>())) {
+      const auto& model = py_obj.cast<const models::FastWordPiece&>();
+      py_tokenizer_ptr->tokenizer.SetModel(model);
+    } else if (pybind11::type::of(py_obj).is(py::type::of<models::BPE>())) {
+      const auto& model = py_obj.cast<const models::BPE&>();
+      py_tokenizer_ptr->tokenizer.SetModel(model);
+    } else if (pybind11::type::of(py_obj).is(py::type::of<models::Unigram>())) {
+      const auto& model = py_obj.cast<const models::Unigram&>();
+      py_tokenizer_ptr->tokenizer.SetModel(model);
+    } else {
+      std::ostringstream oss;
+      oss << "Expected type of arguments is `model`";
+      throw std::runtime_error(oss.str());
+    }
+    return 0;
+  } else if (args_num >= 1) {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 0 or 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  return 1;
+}
+
+// def add_special_tokens(token)
+static PyObject* AddSpecialTokens(TokenizerObject* self,
+                                  PyObject* args,
+                                  PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_special_tokens = NULL;
+  static char* kwlist[] = {const_cast<char*>("tokens"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|O", kwlist, &kw_special_tokens);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string tokens;
+  if (args_num == (Py_ssize_t)1) {
+    if (PyList_Check(kw_special_tokens)) {
+      std::vector<core::AddedToken> added_tokens;
+      Py_ssize_t tokens_num = PyList_GET_SIZE(kw_special_tokens);
+      for (Py_ssize_t i = 0; i < tokens_num; ++i) {
+        PyObject* obj = PyList_GetItem(kw_special_tokens, i);
+        if (PyUnicode_Check(obj)) {
+          added_tokens.push_back(
+              core::AddedToken(CastPyArg2AttrString(obj, 0), true));
+        } else {
+          py::handle py_obj(obj);
+          if (!py::type::of(py_obj).is(py::type::of<core::AddedToken>())) {
+            throw std::runtime_error(
+                "The argument of tokens should be List[Union[str, "
+                "AddedToken]]");
+          }
+          auto added_token = py_obj.cast<core::AddedToken>();
+          added_tokens.push_back(added_token);
+        }
+      }
+      return ToPyObject(self->tokenizer.AddSpecialTokens(added_tokens));
+    } else {
+      // throw error
+      throw std::runtime_error(
+          "Need to pass the string list as to argument tokens");
+    }
+  } else {
+    // throw error
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def add_tokens(token)
+static PyObject* AddTokens(TokenizerObject* self,
+                           PyObject* args,
+                           PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_tokens = NULL;
+  static char* kwlist[] = {const_cast<char*>("tokens"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_tokens);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string tokens;
+  if (args_num == (Py_ssize_t)1) {
+    if (PyList_Check(kw_tokens)) {
+      std::vector<core::AddedToken> added_tokens;
+      Py_ssize_t tokens_num = PyList_GET_SIZE(kw_tokens);
+      for (Py_ssize_t i = 0; i < tokens_num; ++i) {
+        PyObject* obj = PyList_GetItem(kw_tokens, i);
+        if (PyUnicode_Check(obj)) {
+          added_tokens.push_back(
+              core::AddedToken(CastPyArg2AttrString(obj, 0), true));
+        } else {
+          py::handle py_obj(obj);
+          if (!py::type::of(py_obj).is(py::type::of<core::AddedToken>())) {
+            throw std::runtime_error(
+                "The argument of tokens should be List[Union[str, "
+                "AddedToken]]");
+          }
+          auto added_token = py_obj.cast<core::AddedToken>();
+          added_tokens.push_back(added_token);
+        }
+      }
+      return ToPyObject(self->tokenizer.AddTokens(added_tokens));
+    } else {
+      throw std::runtime_error(
+          "Need to pass the string list as to argument tokens");
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def enable_padding(direction="right", pad_id=0,
+//                    pad_type_id=0, pad_token="[PAD]",
+//                    length=None, ad_to_multiple_of=None)
+static PyObject* EnablePadding(TokenizerObject* self,
+                               PyObject* args,
+                               PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_direction = NULL;
+  PyObject* kw_pad_id = NULL;
+  PyObject* kw_pad_type_id = NULL;
+  PyObject* kw_pad_token = NULL;
+  PyObject* kw_length = NULL;
+  PyObject* kw_pad_to_multiple_of = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {const_cast<char*>("direction"),
+                           const_cast<char*>("pad_id"),
+                           const_cast<char*>("pad_type_id"),
+                           const_cast<char*>("pad_token"),
+                           const_cast<char*>("length"),
+                           const_cast<char*>("pad_to_multiple_of"),
+                           NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(args,
+                                           kwargs,
+                                           "|OOOOOO",
+                                           kwlist,
+                                           &kw_direction,
+                                           &kw_pad_id,
+                                           &kw_pad_type_id,
+                                           &kw_pad_token,
+                                           &kw_length,
+                                           &kw_pad_to_multiple_of);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string direction = "right";
+  uint32_t pad_id = 0;
+  uint32_t pad_type_id = 0;
+  std::string pad_token = "[PAD]";
+  uint32_t* length_ptr = nullptr;
+  uint32_t* pad_to_multiple_of_ptr = nullptr;
+  uint32_t length = 0;
+  uint32_t pad_to_multiple_of = 0;
+  VLOG(6) << "args_num: " << args_num << ", flag_kwargs: " << flag_kwargs;
+  VLOG(6) << "kw_direction: " << kw_direction;
+  VLOG(6) << "kw_pad_id: " << kw_pad_id;
+  VLOG(6) << "kw_pad_type_id: " << kw_pad_type_id;
+  VLOG(6) << "kw_pad_token: " << kw_pad_token;
+  VLOG(6) << "kw_length: " << kw_length;
+  VLOG(6) << "kw_pad_to_multiple_of: " << kw_pad_to_multiple_of;
+  if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)6) {
+    if ((args_num == 0 && flag_kwargs && kw_direction) || (args_num >= 1)) {
+      direction = CastPyArg2AttrString(kw_direction, 0);
+    }
+    if ((args_num <= 1 && flag_kwargs && kw_pad_id) || (args_num >= 2)) {
+      pad_id = CastPyArg2AttrSize_t(kw_pad_id, 1);
+    }
+    if ((args_num <= 2 && flag_kwargs && kw_pad_type_id) || (args_num >= 3)) {
+      pad_type_id = CastPyArg2AttrSize_t(kw_pad_type_id, 2);
+    }
+    if ((args_num <= 3 && flag_kwargs && kw_pad_token) || (args_num >= 4)) {
+      pad_token = CastPyArg2AttrString(kw_pad_token, 3);
+    }
+    if ((args_num <= 4 && flag_kwargs && kw_length) || (args_num >= 5)) {
+      if (!(kw_length == Py_None)) {
+        length = CastPyArg2AttrSize_t(kw_length, 4);
+        length_ptr = &length;
+      }
+    }
+    if ((args_num <= 5 && flag_kwargs && kw_pad_to_multiple_of) ||
+        (args_num == 6)) {
+      if (!(kw_pad_to_multiple_of == Py_None)) {
+        pad_to_multiple_of = CastPyArg2AttrSize_t(kw_pad_to_multiple_of, 5);
+        pad_to_multiple_of_ptr = &pad_to_multiple_of;
+      }
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 0 to 6, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  core::Direction pad_direction;
+  if (direction == "right") {
+    pad_direction = core::Direction::RIGHT;
+  } else if (direction == "left") {
+    pad_direction = core::Direction::LEFT;
+  } else {
+    throw std::runtime_error(
+        "The direction args should be \"right\" or \"left\"");
+  }
+  self->tokenizer.EnablePadMethod(pad_direction,
+                                  pad_id,
+                                  pad_type_id,
+                                  pad_token,
+                                  length_ptr,
+                                  pad_to_multiple_of_ptr);
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def disable_padding()
+static PyObject* DisablePadding(TokenizerObject* self,
+                                PyObject* args,
+                                PyObject* kwargs) {
+  TOKENIZERS_TRY
+  Py_ssize_t args_num = PyTuple_Size(args);
+  if (args_num == (Py_ssize_t)0) {
+    self->tokenizer.DisablePadMethod();
+    Py_RETURN_NONE;
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 0, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def enable_truncation(max_length, stride=0, strategy="longest_first",
+// direction="right")
+static PyObject* EnableTruncation(TokenizerObject* self,
+                                  PyObject* args,
+                                  PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_max_length = NULL;
+  PyObject* kw_stride = NULL;
+  PyObject* kw_strategy = NULL;
+  PyObject* kw_direction = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {const_cast<char*>("max_length"),
+                           const_cast<char*>("stride"),
+                           const_cast<char*>("strategy"),
+                           const_cast<char*>("direction"),
+                           NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(args,
+                                           kwargs,
+                                           "|OOOO",
+                                           kwlist,
+                                           &kw_max_length,
+                                           &kw_stride,
+                                           &kw_strategy,
+                                           &kw_direction);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  uint32_t max_length = 0;
+  uint32_t stride = 0;
+  std::string strategy = "longest_first";
+  std::string direction = "right";
+
+  if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)4) {
+    max_length = CastPyArg2AttrSize_t(kw_max_length, 0);
+    if ((args_num <= 1 && flag_kwargs && kw_stride) || (args_num >= 2)) {
+      stride = CastPyArg2AttrSize_t(kw_stride, 1);
+    }
+    if ((args_num <= 2 && flag_kwargs && kw_strategy) || (args_num >= 3)) {
+      strategy = CastPyArg2AttrString(kw_strategy, 2);
+    }
+    if ((args_num <= 3 && flag_kwargs && kw_direction) || (args_num >= 4)) {
+      direction = CastPyArg2AttrString(kw_direction, 3);
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments 1 to 4, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+
+  core::TruncStrategy trunc_strategy;
+  if (strategy == "longest_first") {
+    trunc_strategy = core::TruncStrategy::LONGEST_FIRST;
+  } else if (strategy == "only_first") {
+    trunc_strategy = core::TruncStrategy::ONLY_FIRST;
+  } else if (strategy == "only_second") {
+    trunc_strategy = core::TruncStrategy::ONLY_SECOND;
+  } else {
+    throw std::runtime_error(
+        "The strategy args should be \"longest_first\", \"only_first\" or "
+        "\"only_second\"");
+  }
+  core::Direction trunc_direction;
+  if (direction == "right") {
+    trunc_direction = core::Direction::RIGHT;
+  } else if (direction == "left") {
+    trunc_direction = core::Direction::LEFT;
+  } else {
+    throw std::runtime_error(
+        "The direction args should be \"right\" or \"left\"");
+  }
+  self->tokenizer.EnableTruncMethod(
+      max_length, stride, trunc_direction, trunc_strategy);
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def disable_truncation()
+static PyObject* DisableTruncation(TokenizerObject* self,
+                                   PyObject* args,
+                                   PyObject* kwargs) {
+  TOKENIZERS_TRY
+  Py_ssize_t args_num = PyTuple_Size(args);
+  if (args_num == (Py_ssize_t)0) {
+    self->tokenizer.DisableTruncMethod();
+    Py_RETURN_NONE;
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 0, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def get_vocab(with_added_vocabulary=True)
+static PyObject* GetVocab(TokenizerObject* self,
+                          PyObject* args,
+                          PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_with_added_vocabulary = NULL;
+  static char* kwlist[] = {const_cast<char*>("with_added_vocabulary"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|O", kwlist, &kw_with_added_vocabulary);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  bool with_added_vocabulary = true;
+  if (args_num == (Py_ssize_t)0) {
+    with_added_vocabulary = true;
+    py::object py_obj =
+        py::cast(self->tokenizer.GetVocab(with_added_vocabulary));
+    py_obj.inc_ref();
+    return py_obj.ptr();
+  } else if (args_num == (Py_ssize_t)1) {
+    if (PyBool_Check(kw_with_added_vocabulary)) {
+      with_added_vocabulary =
+          CastPyArg2AttrBoolean(kw_with_added_vocabulary, 0);
+      py::object py_obj =
+          py::cast(self->tokenizer.GetVocab(with_added_vocabulary));
+      py_obj.inc_ref();
+      return py_obj.ptr();
+    } else {
+      // throw error
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 0 to 1, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def get_vocab_size(with_added_vocabulary=True)
+static PyObject* GetVocabSize(TokenizerObject* self,
+                              PyObject* args,
+                              PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_with_added_vocabulary = NULL;
+  static char* kwlist[] = {const_cast<char*>("with_added_vocabulary"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|O", kwlist, &kw_with_added_vocabulary);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  bool with_added_vocabulary = true;
+  if (args_num == (Py_ssize_t)0) {
+    with_added_vocabulary = true;
+    return ToPyObject(self->tokenizer.GetVocabSize(with_added_vocabulary));
+  } else if (args_num == (Py_ssize_t)1) {
+    if (PyBool_Check(kw_with_added_vocabulary)) {
+      with_added_vocabulary =
+          CastPyArg2AttrBoolean(kw_with_added_vocabulary, 0);
+      return ToPyObject(self->tokenizer.GetVocabSize(with_added_vocabulary));
+    } else {
+      // throw error
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 0, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def encode(sequence, pair=None, is_pretokenized=False,
+// add_special_tokens=True)
+static PyObject* Encode(TokenizerObject* self,
+                        PyObject* args,
+                        PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_sequence = NULL;
+  PyObject* kw_pair = NULL;
+  PyObject* kw_is_pretokenized = NULL;
+  PyObject* kw_add_special_tokens = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {const_cast<char*>("sequence"),
+                           const_cast<char*>("pair"),
+                           const_cast<char*>("is_pretokenized"),
+                           const_cast<char*>("add_special_tokens"),
+                           NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(args,
+                                           kwargs,
+                                           "|OOOO",
+                                           kwlist,
+                                           &kw_sequence,
+                                           &kw_pair,
+                                           &kw_is_pretokenized,
+                                           &kw_add_special_tokens);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)4) {
+    bool is_pretokenized = false;
+    bool add_special_tokens = true;
+    bool has_pair = false;
+    core::Encoding encoding;
+    core::Encoding pair_encoding;
+    core::Encoding result_encoding;
+
+    if ((args_num <= 2 && flag_kwargs && kw_is_pretokenized) ||
+        (args_num >= 3)) {
+      is_pretokenized = CastPyArg2AttrBoolean(kw_is_pretokenized, 2);
+    }
+    if ((args_num <= 3 && flag_kwargs && kw_add_special_tokens) ||
+        (args_num >= 4)) {
+      add_special_tokens = CastPyArg2AttrBoolean(kw_add_special_tokens, 3);
+    }
+    if (is_pretokenized) {
+      if (PyList_Check(kw_sequence) || PyTuple_Check(kw_sequence)) {
+        std::vector<std::string> sequence_array =
+            CastPyArg2VectorOfStr(kw_sequence, 0);
+        std::vector<std::string> pair_array;
+        if ((args_num <= 1 && flag_kwargs && kw_pair && kw_pair != Py_None) ||
+            (args_num >= 2)) {
+          has_pair = true;
+          pair_array = CastPyArg2VectorOfStr(kw_pair, 1);
+        }
+        self->tokenizer.EncodeSingleString(
+            sequence_array, 0, core::OffsetType::CHAR, &encoding);
+        core::Encoding* pair_encoding_ptr = nullptr;
+        if (has_pair) {
+          self->tokenizer.EncodeSingleString(
+              pair_array, 0, core::OffsetType::CHAR, &pair_encoding);
+          pair_encoding_ptr = &pair_encoding;
+        }
+        self->tokenizer.PostProcess(
+            &encoding, pair_encoding_ptr, add_special_tokens, &result_encoding);
+      } else {
+        // throw error
+        std::ostringstream oss;
+        oss << "The sequence should be list of string when "
+               "is_pretokenized=True";
+        throw std::runtime_error(oss.str());
+      }
+    } else {
+      std::string sequence = CastPyArg2AttrString(kw_sequence, 0);
+      std::string pair;
+      if (((args_num <= 1 && flag_kwargs && kw_pair) || (args_num >= 2)) &&
+          kw_pair != Py_None) {
+        has_pair = true;
+        pair = CastPyArg2AttrString(kw_pair, 1);
+      }
+      self->tokenizer.EncodeSingleString(
+          sequence, 0, core::OffsetType::CHAR, &encoding);
+      core::Encoding* pair_encoding_ptr = nullptr;
+      if (has_pair) {
+        self->tokenizer.EncodeSingleString(
+            pair, 1, core::OffsetType::CHAR, &pair_encoding);
+        pair_encoding_ptr = &pair_encoding;
+      }
+      self->tokenizer.PostProcess(
+          &encoding, pair_encoding_ptr, add_special_tokens, &result_encoding);
+    }
+    py::object py_obj = py::cast(result_encoding);
+    py_obj.inc_ref();
+    return py_obj.ptr();
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 4, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def encode(input, add_special_tokens=True, is_pretokenized=False)
+static PyObject* EncodeBatch(TokenizerObject* self,
+                             PyObject* args,
+                             PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_input = NULL;
+  PyObject* kw_special_tokens = NULL;
+  PyObject* kw_is_pretokenized = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {const_cast<char*>("input"),
+                           const_cast<char*>("add_special_tokens"),
+                           const_cast<char*>("is_pretokenized"),
+                           NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(args,
+                                           kwargs,
+                                           "|OOO",
+                                           kwlist,
+                                           &kw_input,
+                                           &kw_special_tokens,
+                                           &kw_is_pretokenized);
+  bool add_special_tokens = true;
+  bool is_pretokenized = false;
+  Py_ssize_t args_num = PyTuple_Size(args);
+  VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs
+          << ", flag_: " << flag_;
+  std::vector<core::EncodeInput> batch_encode_input;
+  if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)3) {
+    if ((args_num <= 1 && flag_kwargs && kw_special_tokens) ||
+        (args_num >= 2)) {
+      add_special_tokens = CastPyArg2AttrBoolean(kw_special_tokens, 1);
+    }
+    if ((args_num <= 2 && kw_is_pretokenized && flag_kwargs) || args_num == 3) {
+      is_pretokenized = CastPyArg2AttrBoolean(kw_is_pretokenized, 2);
+    }
+    if (PyList_Check(kw_input)) {
+      Py_ssize_t list_size = PyList_Size(kw_input);
+      for (Py_ssize_t i = 0; i < list_size; ++i) {
+        PyObject* item = PyList_GetItem(kw_input, i);
+        // Has pair
+        if (PyTuple_Check(item) && PyTuple_Size(item) == 2) {
+          PyObject* text = PyTuple_GetItem(item, 0);
+          PyObject* text_pair = PyTuple_GetItem(item, 1);
+          // pretokenized
+          if (is_pretokenized) {
+            Py_ssize_t pretokenized_size = PyList_Size(item);
+            std::vector<std::string> text_vec;
+            std::vector<std::string> text_pair_vec;
+            for (Py_ssize_t j = 0; j < pretokenized_size; ++j) {
+              PyObject* py_text = PyList_GetItem(text, j);
+              PyObject* py_text_pair = PyList_GetItem(text_pair, j);
+              text_vec.push_back(CastPyArg2AttrString(py_text, 0));
+              text_pair_vec.push_back(CastPyArg2AttrString(py_text_pair, 1));
+            }
+            batch_encode_input.push_back(
+                std::pair<core::InputString, core::InputString>{text_vec,
+                                                                text_pair_vec});
+          } else {
+            batch_encode_input.push_back(
+                std::pair<core::InputString, core::InputString>{
+                    CastPyArg2AttrString(text, 0),
+                    CastPyArg2AttrString(text_pair, 1)});
+          }
+        } else {
+          // Only get text
+          if (is_pretokenized) {
+            Py_ssize_t pretokenized_size = PyList_Size(item);
+            std::vector<std::string> str_vec(pretokenized_size);
+            for (Py_ssize_t j = 0; j < pretokenized_size; ++j) {
+              PyObject* py_text = PyList_GetItem(item, j);
+              str_vec[j] = CastPyArg2AttrString(py_text, 0);
+            }
+            batch_encode_input.push_back(str_vec);
+          } else {
+            batch_encode_input.push_back(CastPyArg2AttrString(item, 0));
+          }
+        }
+      }
+    } else {
+      std::ostringstream oss;
+      oss << "Expected the type of input argument is list";
+      throw std::runtime_error(oss.str());
+    }
+    std::vector<core::Encoding> result_encodings;
+    self->tokenizer.EncodeBatchStrings(
+        batch_encode_input, &result_encodings, add_special_tokens);
+    py::object py_obj = py::cast(result_encodings);
+    py_obj.inc_ref();
+    return py_obj.ptr();
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def id_to_token(id)
+static PyObject* IdToToken(TokenizerObject* self,
+                           PyObject* args,
+                           PyObject* kwargs) {
+  PyObject* kw_id = NULL;
+  static char* kwlist[] = {const_cast<char*>("id"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_id);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  if (args_num == (Py_ssize_t)1) {
+    if (PyLong_Check(kw_id)) {
+      uint32_t id = PyLong_AsLong(kw_id);
+      std::string token;
+      if (self->tokenizer.IdToToken(id, &token)) {
+        return ToPyObject(token);
+      }
+      Py_RETURN_NONE;
+    } else {
+      // throw error
+    }
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+}
+
+// def token_to_id(token)
+static PyObject* TokenToId(TokenizerObject* self,
+                           PyObject* args,
+                           PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_token = NULL;
+  static char* kwlist[] = {const_cast<char*>("token"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_token);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string token = "";
+  if (args_num == (Py_ssize_t)1) {
+    token = CastPyArg2AttrString(kw_token, 0);
+    uint32_t id;
+    if (self->tokenizer.TokenToId(token, &id)) {
+      return ToPyObject(id);
+    }
+    Py_RETURN_NONE;
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def num_special_tokens_to_add(is_pair)
+static PyObject* NumSpecialTokensToAdd(TokenizerObject* self,
+                                       PyObject* args,
+                                       PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_is_pair = NULL;
+  static char* kwlist[] = {const_cast<char*>("is_pair"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_is_pair);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  bool is_pair = false;
+  if (args_num == (Py_ssize_t)1) {
+    is_pair = CastPyArg2AttrBoolean(kw_is_pair, 0);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is 1, but recive " << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  auto postprocessor_ptr = self->tokenizer.GetPostProcessorPtr();
+  if (postprocessor_ptr == nullptr) {
+    return ToPyObject(0);
+  }
+  return ToPyObject(postprocessor_ptr->AddedTokensNum(is_pair));
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+
+static PyObject* Save(TokenizerObject* self, PyObject* args, PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_path = NULL;
+  PyObject* kw_pretty = NULL;
+  static char* kwlist[] = {
+      const_cast<char*>("path"), const_cast<char*>("pretty"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|OO", kwlist, &kw_path, &kw_pretty);
+  bool pretty = true;
+  Py_ssize_t args_num = PyTuple_Size(args);
+  if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) {
+    if (args_num == (Py_ssize_t)2) {
+      pretty = CastPyArg2AttrBoolean(kw_pretty, 1);
+    }
+    std::string path = CastPyArg2AttrString(kw_path, 0);
+    self->tokenizer.Save(path, pretty);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  Py_RETURN_NONE;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* ToStr(TokenizerObject* self,
+                       PyObject* args,
+                       PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_pretty = NULL;
+  static char* kwlist[] = {const_cast<char*>("pretty"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_pretty);
+  bool pretty = true;
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string json_str;
+  if (args_num >= (Py_ssize_t)0 && args_num <= (Py_ssize_t)1) {
+    if (args_num == (Py_ssize_t)1) {
+      pretty = CastPyArg2AttrBoolean(kw_pretty, 0);
+    }
+    self->tokenizer.ToJsonStr(&json_str, pretty);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  return ToPyObject(json_str);
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* FromStr(TokenizerObject* self,
+                         PyObject* args,
+                         PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_json = NULL;
+  static char* kwlist[] = {const_cast<char*>("json"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_json);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string json_str;
+  core::Tokenizer tokenizer;
+  if (args_num == (Py_ssize_t)1) {
+    json_str = CastPyArg2AttrString(kw_json, 0);
+    tokenizer = core::Tokenizer::LoadFromStr(json_str);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  TokenizerObject* obj =
+      (TokenizerObject*)TokenizerNew(p_tokenizer_type, NULL, NULL);
+  obj->tokenizer = tokenizer;
+  return (PyObject*)obj;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* FromFile(TokenizerObject* self,
+                          PyObject* args,
+                          PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_path = NULL;
+  static char* kwlist[] = {const_cast<char*>("json"), NULL};
+  bool flag_ =
+      PyArg_ParseTupleAndKeywords(args, kwargs, "|O", kwlist, &kw_path);
+  Py_ssize_t args_num = PyTuple_Size(args);
+  std::string path;
+  core::Tokenizer tokenizer;
+  if (args_num == (Py_ssize_t)1) {
+    path = CastPyArg2AttrString(kw_path, 0);
+    tokenizer = core::Tokenizer::LoadFromFile(path);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  TokenizerObject* obj =
+      (TokenizerObject*)TokenizerNew(p_tokenizer_type, NULL, NULL);
+  obj->tokenizer = tokenizer;
+  return (PyObject*)obj;
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def decode(self, ids, skip_special_tokens=True):
+static PyObject* Decode(TokenizerObject* self,
+                        PyObject* args,
+                        PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_ids = NULL;
+  PyObject* kw_skip_special_tokens = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {
+      const_cast<char*>("ids"), const_cast<char*>("skip_special_tokens"), NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|OO", kwlist, &kw_ids, &kw_skip_special_tokens);
+  bool skip_special_tokens = true;
+  Py_ssize_t args_num = PyTuple_Size(args);
+  VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs
+          << ", flag_: " << flag_;
+  if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) {
+    if (args_num == (Py_ssize_t)2 || (flag_kwargs && kw_skip_special_tokens)) {
+      skip_special_tokens = CastPyArg2AttrBoolean(kw_skip_special_tokens, 1);
+    }
+    auto ids = CastPyArg2VectorOfInt<uint32_t>(kw_ids, 0);
+    std::string result;
+    self->tokenizer.Decode(ids, &result, skip_special_tokens);
+    return ToPyObject(result);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+// def decode_batch(self, sequences, skip_special_tokens=True):
+static PyObject* DecodeBatch(TokenizerObject* self,
+                             PyObject* args,
+                             PyObject* kwargs) {
+  TOKENIZERS_TRY
+  PyObject* kw_sequences = NULL;
+  PyObject* kw_skip_special_tokens = NULL;
+  bool flag_kwargs = false;
+  if (kwargs) flag_kwargs = true;
+  static char* kwlist[] = {const_cast<char*>("sequences"),
+                           const_cast<char*>("skip_special_tokens"),
+                           NULL};
+  bool flag_ = PyArg_ParseTupleAndKeywords(
+      args, kwargs, "|OO", kwlist, &kw_sequences, &kw_skip_special_tokens);
+  bool skip_special_tokens = true;
+  Py_ssize_t args_num = PyTuple_Size(args);
+  VLOG(6) << " args_num: " << args_num << ", flag_kwargs: " << flag_kwargs
+          << ", flag_: " << flag_;
+  if (args_num >= (Py_ssize_t)1 && args_num <= (Py_ssize_t)2) {
+    if (args_num == (Py_ssize_t)2 || (flag_kwargs && kw_skip_special_tokens)) {
+      skip_special_tokens = CastPyArg2AttrBoolean(kw_skip_special_tokens, 1);
+    }
+    std::vector<std::vector<uint32_t>> batch_ids;
+    PyObject* item = nullptr;
+    if (PyTuple_Check(kw_sequences)) {
+      Py_ssize_t len = PyTuple_Size(kw_sequences);
+      for (Py_ssize_t i = 0; i < len; i++) {
+        item = PyTuple_GetItem(kw_sequences, i);
+        batch_ids.emplace_back(CastPyArg2VectorOfInt<uint32_t>(item, 0));
+      }
+    } else if (PyList_Check(kw_sequences)) {
+      Py_ssize_t len = PyList_Size(kw_sequences);
+      for (Py_ssize_t i = 0; i < len; i++) {
+        item = PyList_GetItem(kw_sequences, i);
+        batch_ids.emplace_back(CastPyArg2VectorOfInt<uint32_t>(item, 0));
+      }
+    } else {
+      std::ostringstream oss;
+      oss << "Args sequences need to be int of list or tuple";
+      throw std::runtime_error(oss.str());
+    }
+    std::vector<std::string> result;
+    self->tokenizer.DecodeBatch(batch_ids, &result, skip_special_tokens);
+    return ToPyObject(result);
+  } else {
+    std::ostringstream oss;
+    oss << "Expected number of arguments is from 1 to 2, but recive "
+        << args_num;
+    throw std::runtime_error(oss.str());
+  }
+  TOKENIZERS_CATCH_AND_THROW_RETURN_NULL
+}
+
+PyMethodDef tokenizer_variable_methods[] = {
+    {"add_special_tokens",
+     (PyCFunction)(void (*)(void))AddSpecialTokens,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"add_tokens",
+     (PyCFunction)(void (*)(void))AddTokens,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"enable_padding",
+     (PyCFunction)(void (*)(void))EnablePadding,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"disable_padding",
+     (PyCFunction)(void (*)(void))DisablePadding,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"enable_truncation",
+     (PyCFunction)(void (*)(void))EnableTruncation,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"disable_truncation",
+     (PyCFunction)(void (*)(void))DisableTruncation,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"get_vocab",
+     (PyCFunction)(void (*)(void))GetVocab,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"get_vocab_size",
+     (PyCFunction)(void (*)(void))GetVocabSize,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"encode",
+     (PyCFunction)(void (*)(void))Encode,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"encode_batch",
+     (PyCFunction)(void (*)(void))EncodeBatch,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"decode",
+     (PyCFunction)(void (*)(void))Decode,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"decode_batch",
+     (PyCFunction)(void (*)(void))DecodeBatch,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"id_to_token",
+     (PyCFunction)(void (*)(void))IdToToken,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"token_to_id",
+     (PyCFunction)(void (*)(void))TokenToId,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"num_special_tokens_to_add",
+     (PyCFunction)(void (*)(void))NumSpecialTokensToAdd,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"save",
+     (PyCFunction)(void (*)(void))Save,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"to_str",
+     (PyCFunction)(void (*)(void))ToStr,
+     METH_VARARGS | METH_KEYWORDS,
+     NULL},
+    {"from_str",
+     (PyCFunction)(void (*)(void))FromStr,
+     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
+     NULL},
+    {"from_file",
+     (PyCFunction)(void (*)(void))FromFile,
+     METH_VARARGS | METH_KEYWORDS | METH_STATIC,
+     NULL},
+    // TODO(zhoushunjie): Need to implement
+    // {"from_buffer",
+    //  (PyCFunction)(void (*)(void))NumSpecialTokensToAdd,
+    //  METH_VARARGS | METH_KEYWORDS | METH_STATIC,
+    //  NULL},
+    // {"from_pretrained",
+    //  (PyCFunction)(void (*)(void))NumSpecialTokensToAdd,
+    //  METH_VARARGS | METH_KEYWORDS | METH_STATIC,
+    //  NULL},
+    {NULL, NULL, 0, NULL}};
+
+void BindTokenizers(pybind11::module* m) {
+  auto heap_type = reinterpret_cast<PyHeapTypeObject*>(
+      PyType_Type.tp_alloc(&PyType_Type, 0));
+  heap_type->ht_name = ToPyObject("Tokenizer");
+  heap_type->ht_qualname = ToPyObject("Tokenizer");
+  auto type = &heap_type->ht_type;
+  type->tp_name = "Tokenizer";
+  type->tp_basicsize = sizeof(TokenizerObject);
+  type->tp_dealloc = (destructor)TokenizerDealloc;
+  type->tp_as_number = &number_methods;
+  type->tp_as_sequence = &sequence_methods;
+  type->tp_as_mapping = &mapping_methods;
+  type->tp_methods = tokenizer_variable_methods;
+  type->tp_getset = tokenizer_variable_properties;
+  type->tp_init = TokenizerInit;
+  type->tp_new = TokenizerNew;
+  Py_INCREF(&PyBaseObject_Type);
+  type->tp_base = reinterpret_cast<PyTypeObject*>(&PyBaseObject_Type);
+  type->tp_flags |=
+      Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE | Py_TPFLAGS_HEAPTYPE;
+#if PY_VERSION_HEX >= 0x03050000
+  type->tp_as_async = &heap_type->as_async;
+#endif
+  p_tokenizer_type = type;
+
+  if (PyType_Ready(type) < 0) {
+    throw "Init Tokenizers error in BindTokenizers(PyType_Ready).";
+    return;
+  }
+
+  Py_INCREF(type);
+  if (PyModule_AddObject(
+          m->ptr(), "Tokenizer", reinterpret_cast<PyObject*>(type)) < 0) {
+    Py_DECREF(type);
+    Py_DECREF(m->ptr());
+    throw "Init Tokenizers error in BindTokenizers(PyModule_AddObject).";
+    return;
+  }
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h
new file mode 100644
index 0000000000000000000000000000000000000000..b9e45b957da8998b7b16fb2e757f148319595e43
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/tokenizers.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+void BindTokenizers(pybind11::module* m);
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/utils.cc b/fast_tokenizer/fast_tokenizer/pybind/utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..377aeb2ed32e3f0877bbf22f94d8f4ed0a8eb2fd
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/utils.cc
@@ -0,0 +1,277 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <sstream>
+#include <unordered_map>
+
+#include "fast_tokenizer/pybind/utils.h"
+namespace py = pybind11;
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+PyObject* ToPyObject(bool value) {
+  if (value) {
+    Py_INCREF(Py_True);
+    return Py_True;
+  } else {
+    Py_INCREF(Py_False);
+    return Py_False;
+  }
+}
+
+PyObject* ToPyObject(int value) { return PyLong_FromLong(value); }
+
+PyObject* ToPyObject(uint32_t value) { return PyLong_FromUnsignedLong(value); }
+
+PyObject* ToPyObject(int64_t value) { return PyLong_FromLongLong(value); }
+
+PyObject* ToPyObject(size_t value) { return PyLong_FromSize_t(value); }
+
+PyObject* ToPyObject(float value) { return PyLong_FromDouble(value); }
+
+PyObject* ToPyObject(double value) { return PyLong_FromDouble(value); }
+
+PyObject* ToPyObject(const char* value) { return PyUnicode_FromString(value); }
+
+PyObject* ToPyObject(const std::string& value) {
+  return PyUnicode_FromString(value.c_str());
+}
+
+PyObject* ToPyObject(const std::vector<bool>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<int>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<int64_t>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, (Py_ssize_t)i, ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<size_t>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, (Py_ssize_t)i, ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<float>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<double>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<std::vector<size_t>>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
+PyObject* ToPyObject(const std::vector<std::string>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+  return result;
+}
+
+bool CastPyArg2AttrBoolean(PyObject* obj, ssize_t arg_pos) {
+  if (obj == Py_None) {
+    return false;  // To be compatible with QA integration testing. Some
+                   // test case pass in None.
+  } else if (obj == Py_True) {
+    return true;
+  } else if (obj == Py_False) {
+    return false;
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be bool, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+  }
+  return false;
+}
+
+std::string CastPyArg2AttrString(PyObject* obj, ssize_t arg_pos) {
+  if (PyUnicode_Check(obj)) {
+    Py_ssize_t size;
+    const char* data;
+    data = PyUnicode_AsUTF8AndSize(obj, &size);
+    return std::string(data, static_cast<size_t>(size));
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be str, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return "";
+  }
+}
+
+int CastPyArg2AttrInt(PyObject* obj, ssize_t arg_pos) {
+  if (PyLong_Check(obj) && !PyBool_Check(obj)) {
+    return static_cast<int>(PyLong_AsLong(obj));
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be str, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return 0;
+  }
+}
+
+int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos) {
+  if (PyLong_Check(obj) && !PyBool_Check(obj)) {
+    return (int64_t)PyLong_AsLong(obj);  // NOLINT
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be str, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return 0;
+  }
+}
+
+size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos) {
+  if (PyLong_Check(obj) && !PyBool_Check(obj)) {
+    return PyLong_AsSize_t(obj);
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be str, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return 0;
+  }
+}
+
+float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos) {
+  if (PyFloat_Check(obj) || PyLong_Check(obj)) {
+    return static_cast<float>(PyFloat_AsDouble(obj));
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1 << " must be str, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return 0;
+  }
+}
+
+std::vector<std::string> CastPyArg2VectorOfStr(PyObject* obj, size_t arg_pos) {
+  std::vector<std::string> result;
+  if (PyList_Check(obj)) {
+    Py_ssize_t len = PyList_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyList_GetItem(obj, i);
+      if (PyUnicode_Check(item)) {
+        result.emplace_back(CastPyArg2AttrString(item, 0));
+      } else {
+        std::ostringstream oss;
+        oss << "argument (position" << arg_pos + 1
+            << " must be list of str, but got "
+            << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+        throw std::runtime_error(oss.str());
+        return {};
+      }
+    }
+  } else if (PyTuple_Check(obj)) {
+    Py_ssize_t len = PyTuple_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyTuple_GetItem(obj, i);
+      if (PyUnicode_Check(item)) {
+        result.emplace_back(CastPyArg2AttrString(item, 0));
+      } else {
+        std::ostringstream oss;
+        oss << "argument (position" << arg_pos + 1
+            << " must be list of str, but got "
+            << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+        throw std::runtime_error(oss.str());
+        return {};
+      }
+    }
+  } else if (obj == Py_None) {
+    return {};
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position" << arg_pos + 1
+        << " must be list or tuple, but got "
+        << (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name;
+    throw std::runtime_error(oss.str());
+    return {};
+  }
+  return result;
+}
+
+bool PyObject_CheckLongOrConvertToLong(PyObject** obj) {
+  if ((PyLong_Check(*obj) && !PyBool_Check(*obj))) {
+    return true;
+  }
+
+  if (std::string((reinterpret_cast<PyTypeObject*>((*obj)->ob_type))->tp_name)
+          .find("numpy") != std::string::npos) {
+    auto to = PyNumber_Long(*obj);
+    if (to) {
+      *obj = to;
+      return true;
+    }
+  }
+
+  return false;
+}
+
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/pybind/utils.h b/fast_tokenizer/fast_tokenizer/pybind/utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..75ef55d654a571eebf5fef44c6ea6e0521835eec
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/pybind/utils.h
@@ -0,0 +1,108 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <Python.h>
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+// For Windows ssize_t
+#if defined(_MSC_VER)
+#include <BaseTsd.h>
+typedef SSIZE_T ssize_t;
+#endif
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace pybind {
+
+PyObject* ToPyObject(int value);
+PyObject* ToPyObject(uint32_t value);
+PyObject* ToPyObject(bool value);
+PyObject* ToPyObject(int64_t value);
+PyObject* ToPyObject(size_t value);
+PyObject* ToPyObject(float value);
+PyObject* ToPyObject(double value);
+PyObject* ToPyObject(const char* value);
+PyObject* ToPyObject(const std::string& value);
+PyObject* ToPyObject(const std::vector<bool>& value);
+PyObject* ToPyObject(const std::vector<int>& value);
+PyObject* ToPyObject(const std::vector<int64_t>& value);
+PyObject* ToPyObject(const std::vector<size_t>& value);
+PyObject* ToPyObject(const std::vector<float>& value);
+PyObject* ToPyObject(const std::vector<double>& value);
+PyObject* ToPyObject(const std::vector<std::vector<size_t>>& value);
+PyObject* ToPyObject(const std::vector<std::string>& value);
+
+bool PyObject_CheckLongOrConvertToLong(PyObject** obj);
+bool CastPyArg2AttrBoolean(PyObject* obj, ssize_t arg_pos);
+std::string CastPyArg2AttrString(PyObject* obj, ssize_t arg_pos);
+int CastPyArg2AttrInt(PyObject* obj, ssize_t arg_pos);
+int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos);
+size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos);
+float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos);
+std::vector<std::string> CastPyArg2VectorOfStr(PyObject* obj, size_t arg_pos);
+
+template <typename T>
+std::vector<T> CastPyArg2VectorOfInt(PyObject* obj, size_t arg_pos) {
+  std::vector<T> result;
+  if (PyList_Check(obj)) {
+    Py_ssize_t len = PyList_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyList_GetItem(obj, i);
+      if (PyObject_CheckLongOrConvertToLong(&item)) {
+        result.emplace_back(static_cast<T>(PyLong_AsLong(item)));
+      } else {
+        std::ostringstream oss;
+        oss << "argument (position " << arg_pos + 1
+            << "must be list of int, but got "
+            << reinterpret_cast<PyTypeObject*>(item->ob_type)->tp_name
+            << " at pos " << i;
+        throw oss.str();
+        return {};
+      }
+    }
+  } else if (PyTuple_Check(obj)) {
+    Py_ssize_t len = PyTuple_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyTuple_GetItem(obj, i);
+      if (PyObject_CheckLongOrConvertToLong(&item)) {
+        result.emplace_back(static_cast<T>(PyLong_AsLong(item)));
+      } else {
+        std::ostringstream oss;
+        oss << "argument (position " << arg_pos + 1
+            << "must be list of int, but got "
+            << reinterpret_cast<PyTypeObject*>(item->ob_type)->tp_name
+            << " at pos " << i;
+        throw oss.str();
+        return {};
+      }
+    }
+  } else if (obj == Py_None) {
+    return {};
+  } else {
+    std::ostringstream oss;
+    oss << "argument (position " << arg_pos + 1
+        << "must be list or tuple, but got "
+        << reinterpret_cast<PyTypeObject*>(obj->ob_type)->tp_name;
+    throw oss.str();
+    return {};
+  }
+  return result;
+}
+}  // namespace pybind
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..380e63582b161fc491259415251ad48e1e075d73
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/CMakeLists.txt
@@ -0,0 +1,58 @@
+if(WITH_TESTING)
+cc_library(tokenizers_gtest_main SRCS gtest_main.cc DEPS gtest gflags)
+
+# Test Normalizers modules
+cc_test(test_normalizer SRCS test_normalizer.cc DEPS normalizers)
+cc_test(test_unicode SRCS test_unicode.cc DEPS normalizers)
+cc_test(test_replace SRCS test_replace.cc DEPS normalizers)
+cc_test(test_strip SRCS test_strip.cc DEPS normalizers)
+cc_test(test_utils SRCS test_utils.cc DEPS normalizers)
+
+# Test PreTokenizers modules
+cc_test(test_whitespace SRCS test_whitespace.cc DEPS pretokenizers)
+cc_test(test_bert_pretokenizer SRCS test_bert_pretokenizer.cc DEPS pretokenizers)
+cc_test(test_split_pretokenizer SRCS test_split_pretokenizer.cc DEPS pretokenizers)
+
+# Test Model
+cc_test(test_wordpiece SRCS test_wordpiece.cc DEPS models)
+cc_test(test_fast_wordpiece SRCS test_fast_wordpiece.cc DEPS models)
+
+# Download ernie vocab for test
+set(ERNIE_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/ernie_vocab.txt)
+if (EXISTS ${ERNIE_VOCAB_PATH})
+  message("The ${ERNIE_VOCAB_PATH} exists already.")
+else()
+  file(DOWNLOAD "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt" ${ERNIE_VOCAB_PATH} SHOW_PROGRESS)
+  message("Already download the vocab.txt of ernie to ${CMAKE_CURRENT_BINARY_DIR} for test.")
+endif()
+
+# Download clip vocab and merge files
+set(CLIP_VOCAB_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_vocab.json)
+set(CLIP_MERGES_PATH ${CMAKE_CURRENT_BINARY_DIR}/clip_merges.txt)
+
+if (EXISTS ${CLIP_VOCAB_PATH})
+  message("The ${CLIP_VOCAB_PATH} exists already.")
+else()
+  file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json" ${CLIP_VOCAB_PATH} SHOW_PROGRESS)
+  message("Already download the vocab.json of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.")
+endif()
+
+if (EXISTS ${CLIP_MERGES_PATH})
+  message("The ${CLIP_MERGES_PATH} exists already.")
+else()
+  file(DOWNLOAD "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt" ${CLIP_MERGES_PATH} SHOW_PROGRESS)
+  message("Already download the merges.txt of clip to ${CMAKE_CURRENT_BINARY_DIR} for test.")
+endif()
+
+# Test Tokenizer
+cc_test(test_bert_tokenizer SRCS test_bert_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer)
+
+# Test PostProcessor
+cc_test(test_roberta_postprocessor SRCS test_roberta_postprocessor.cc DEPS normalizers pretokenizers models postprocessors tokenizer)
+
+if(NOT WITH_PYTHON)
+  cc_test(test_ernie_fast_tokenizer SRCS test_ernie_fast_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer core_tokenizers)
+  cc_test(test_clip_fast_tokenizer SRCS test_clip_fast_tokenizer.cc DEPS normalizers pretokenizers models postprocessors tokenizer core_tokenizers)
+endif()
+
+endif()
diff --git a/fast_tokenizer/fast_tokenizer/test/gtest_main.cc b/fast_tokenizer/fast_tokenizer/test/gtest_main.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8dc400abf0bea266f46297837814e68fe8049dca
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/gtest_main.cc
@@ -0,0 +1,62 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "gtest/gtest.h"
+#include "gflags/gflags.h"
+#include "glog/logging.h"
+
+
+int main(int argc, char** argv) {
+  testing::InitGoogleTest(&argc, argv);
+  std::vector<char*> new_argv;
+  for (int i = 0; i < argc; ++i) {
+    new_argv.push_back(argv[i]);
+  }
+
+  std::vector<std::string> envs;
+  std::vector<std::string> undefok;
+
+  char* env_str = nullptr;
+  if (envs.size() > 0) {
+    std::string env_string = "--tryfromenv=";
+    for (auto t : envs) {
+      env_string += t + ",";
+    }
+    env_string = env_string.substr(0, env_string.length() - 1);
+    env_str = strdup(env_string.c_str());
+    new_argv.push_back(env_str);
+    VLOG(1) << "gtest env_string:" << env_string;
+  }
+
+  char* undefok_str = nullptr;
+  if (undefok.size() > 0) {
+    std::string undefok_string = "--undefok=";
+    for (auto t : undefok) {
+      undefok_string += t + ",";
+    }
+    undefok_string = undefok_string.substr(0, undefok_string.length() - 1);
+    undefok_str = strdup(undefok_string.c_str());
+    new_argv.push_back(undefok_str);
+    VLOG(1) << "gtest undefok_string:" << undefok_string;
+  }
+
+  int new_argc = static_cast<int>(new_argv.size());
+  char** new_argv_address = new_argv.data();
+  ::GFLAGS_NAMESPACE::ParseCommandLineFlags(
+      &new_argc, &new_argv_address, false);
+  int ret = RUN_ALL_TESTS();
+  if (env_str) free(env_str);
+  if (undefok_str) free(undefok_str);
+  return ret;
+}
diff --git a/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f4be133e375dbc2f338a14d8a1145b75d0ebc2e0
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_bert_pretokenizer.cc
@@ -0,0 +1,49 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+#include "fast_tokenizer/pretokenizers/bert.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+TEST(pretokenizers, bert) {
+  std::string input =
+      "I \t am good\r   at \nsport. I like\tfootball especially!!!";
+  std::vector<std::string> expected_outputs = {"I",
+                                               "am",
+                                               "good",
+                                               "at",
+                                               "sport",
+                                               ".",
+                                               "I",
+                                               "like",
+                                               "football",
+                                               "especially",
+                                               "!",
+                                               "!",
+                                               "!"};
+  pretokenizers::PreTokenizedString bert_input(input);
+  pretokenizers::BertPreTokenizer()(&bert_input);
+  ASSERT_EQ(expected_outputs.size(), bert_input.GetSplitsSize());
+  for (int i = 0; i < expected_outputs.size(); ++i) {
+    ASSERT_EQ(bert_input.GetSplit(i).normalized_.GetStr(), expected_outputs[i]);
+  }
+}
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6aac0fdf1222c889e693c003da119adf8d624f11
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_bert_tokenizer.cc
@@ -0,0 +1,120 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <array>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/models/wordpiece.h"
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/postprocessors/bert.h"
+#include "fast_tokenizer/pretokenizers/bert.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+template <typename T>
+void CheckVectorEqual(const std::vector<T>& a, const std::vector<T>& b) {
+  ASSERT_EQ(a.size(), b.size());
+  auto size = a.size();
+  for (int i = 0; i < size; ++i) {
+    ASSERT_EQ(a[i], b[i]);
+  }
+}
+
+TEST(tokenizer, bert_tokenizer) {
+  models::WordPieceFactory factory;
+  std::string vocab_file = "ernie_vocab.txt";
+  factory.SetFiles(vocab_file);
+  // Declare the components of tokenizer
+  auto word_piece = factory.CreateWordPieceModel();
+  auto normalizer = normalizers::BertNormalizer();
+  auto pretokenizer = pretokenizers::BertPreTokenizer();
+  auto postprocessor =
+      postprocessors::BertPostProcessor({"[SEP]", 2}, {"[CLS]", 1});
+  core::PadMethod pad_method;
+  core::TruncMethod trunc_method;
+
+  // Initialize tokenizer
+  core::Tokenizer tokenizer(word_piece);
+  tokenizer.SetNormalizer(normalizer);
+  tokenizer.SetPreTokenizer(pretokenizer);
+  tokenizer.SetPostProcessor(postprocessor);
+  tokenizer.SetPadMethod(pad_method);
+  tokenizer.SetTruncMethod(trunc_method);
+  std::vector<core::AddedToken> special_added_tokens = {
+      {"[PAD]", true},
+      {"[CLS]", true},
+      {"[SEP]", true},
+      {"[MASK]", true},
+      {"[UNK]", true},
+  };
+  auto special_tokens_num = tokenizer.AddSpecialTokens(special_added_tokens);
+
+  // Tokenize the sample strings
+  std::vector<core::Encoding> encodings(2);
+  tokenizer.EncodePairStrings("今天天气真好", &encodings[0]);
+  tokenizer.EncodePairStrings("don't know how this missed award nominations.",
+                              &encodings[1]);
+  std::vector<std::vector<std::string>> expected_tokens = {
+      {"[CLS]", "今", "天", "天", "气", "真", "好", "[SEP]"},
+      {"[CLS]",
+       "don",
+       "[UNK]",
+       "t",
+       "know",
+       "how",
+       "this",
+       "miss",
+       "##ed",
+       "award",
+       "no",
+       "##min",
+       "##ations",
+       ".",
+       "[SEP]"}};
+  std::vector<std::vector<uint32_t>> expected_ids = {
+      {1, 508, 125, 125, 266, 384, 170, 2},
+      {1,
+       3362,
+       17963,
+       2052,
+       3821,
+       5071,
+       3730,
+       7574,
+       9530,
+       6301,
+       3825,
+       10189,
+       11005,
+       42,
+       2}};
+  std::vector<std::vector<uint32_t>> expected_type_ids = {
+      {0, 0, 0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
+  for (int i = 0; i < encodings.size(); ++i) {
+    CheckVectorEqual(expected_tokens[i], encodings[i].GetTokens());
+    CheckVectorEqual(expected_ids[i], encodings[i].GetIds());
+    CheckVectorEqual(expected_type_ids[i], encodings[i].GetTypeIds());
+  }
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b2604b3cc90624d9526fd1e514723f2f5b58e418
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_clip_fast_tokenizer.cc
@@ -0,0 +1,50 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h"
+
+#include "fast_tokenizer/test/utils.h"
+
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(tokenizer, clip_full) {
+  std::string vocab_path = "clip_vocab.json";
+  std::string merges_path = "clip_merges.txt";
+  tokenizers_impl::ClipFastTokenizer clip_tokenizer(vocab_path, merges_path);
+
+  core::Encoding encoding;
+  std::string input_text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat";
+  std::vector<uint32_t> expected_ids = {
+      49406, 320, 1342,  272, 272,  335,  273, 273, 274, 16368, 13439, 2971,
+      748,   531, 13610, 323, 1896, 8445, 323, 539, 320, 2368,  49407};
+  std::vector<std::string> expected_tokens = {
+      "<|startoftext|>", "a</w>",   "'ll</w>",      "1</w>",  "1</w>",
+      "p</w>",           "2</w>",   "2</w>",        "3</w>",  "rf</w>",
+      "âĺĨ</w>",         "ho</w>",  "!!</w>",       "to</w>", "?'</w>",
+      "d</w>",           "'d</w>",  "''</w>",       "d</w>",  "of</w>",
+      "a</w>",           "cat</w>", "<|endoftext|>"};
+  clip_tokenizer.EncodePairStrings(input_text, &encoding);
+  CheckVectorEqual(expected_ids, encoding.GetIds());
+  CheckVectorEqual(expected_tokens, encoding.GetTokens());
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f3b73f6b42d951fec698ddb11f17ecfcfd60a2ef
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_ernie_fast_tokenizer.cc
@@ -0,0 +1,88 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <array>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/core/added_vocabulary.h"
+#include "fast_tokenizer/core/base.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/models/wordpiece.h"
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/postprocessors/bert.h"
+#include "fast_tokenizer/pretokenizers/bert.h"
+#include "fast_tokenizer/test/utils.h"
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(tokenizer, ernie_fast_tokenizer) {
+  std::string vocab_file = "ernie_vocab.txt";
+  tokenizers_impl::ErnieFastTokenizer ernie_fast_tokenizer(vocab_file);
+  std::vector<core::Encoding> encodings(2);
+  ernie_fast_tokenizer.EncodePairStrings("今天天气真好", &encodings[0]);
+  ernie_fast_tokenizer.EncodePairStrings(
+      "don't know how this missed award nominations.", &encodings[1]);
+  std::vector<std::vector<std::string>> expected_tokens = {
+      {"[CLS]", "今", "天", "天", "气", "真", "好", "[SEP]"},
+      {"[CLS]",
+       "don",
+       "[UNK]",
+       "t",
+       "know",
+       "how",
+       "this",
+       "miss",
+       "##ed",
+       "award",
+       "no",
+       "##min",
+       "##ations",
+       ".",
+       "[SEP]"}};
+  std::vector<std::vector<uint32_t>> expected_ids = {
+      {1, 508, 125, 125, 266, 384, 170, 2},
+      {1,
+       3362,
+       17963,
+       2052,
+       3821,
+       5071,
+       3730,
+       7574,
+       9530,
+       6301,
+       3825,
+       10189,
+       11005,
+       42,
+       2}};
+  std::vector<std::vector<uint32_t>> expected_type_ids = {
+      {0, 0, 0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}};
+  for (int i = 0; i < encodings.size(); ++i) {
+    CheckVectorEqual(expected_tokens[i], encodings[i].GetTokens());
+    CheckVectorEqual(expected_ids[i], encodings[i].GetIds());
+    CheckVectorEqual(expected_type_ids[i], encodings[i].GetTypeIds());
+  }
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc b/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c30517ebd4a3c992526f1bb62dc096fba7b0d3a3
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_fast_wordpiece.cc
@@ -0,0 +1,42 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <array>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/models/fast_wordpiece.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(model, fast_wordpiece_token_to_id) {
+  auto vocab = models::FastWordPiece::GetVocabFromFile("ernie_vocab.txt");
+  models::FastWordPiece fast_wordpiece_model(vocab);
+  // Test tokens in vocab
+  for (const auto& item : vocab) {
+    uint32_t id;
+    fast_wordpiece_model.TokenToId(item.first, &id);
+    ASSERT_EQ(item.second, id);
+  }
+  // Test [UNK] token
+  uint32_t fast_wordpiece_id;
+  ASSERT_FALSE(fast_wordpiece_model.TokenToId("dasd", &fast_wordpiece_id));
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc b/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..97de7ea8307394c3f8ee7e789f7a8f4a70835412
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_normalizer.cc
@@ -0,0 +1,55 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(normalizers, split) {
+  re2::RE2 pattern("-");
+  std::string input = "The-final--countdown";
+  normalizers::NormalizedString split_input(input);
+  auto test_split = [&pattern, &split_input](
+      core::SplitMode mode, const std::vector<std::string> expected_strings) {
+    std::vector<normalizers::NormalizedString> normalizes;
+    split_input.Split(pattern, mode, &normalizes);
+    ASSERT_EQ(expected_strings.size(), normalizes.size());
+    for (int i = 0; i < expected_strings.size(); ++i) {
+      ASSERT_EQ(expected_strings[i], normalizes[i].GetStr());
+    }
+  };
+
+  test_split(core::SplitMode::REMOVED, {"The", "final", "countdown"});
+  test_split(core::SplitMode::ISOLATED,
+             {"The", "-", "final", "-", "-", "countdown"});
+  test_split(core::SplitMode::CONTIGUOUS,
+             {"The", "-", "final", "--", "countdown"});
+  test_split(core::SplitMode::MERGED_WITH_PREVIOUS,
+             {"The-", "final-", "-", "countdown"});
+  test_split(core::SplitMode::MERGED_WITH_NEXT,
+             {"The", "-final", "-", "-countdown"});
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_replace.cc b/fast_tokenizer/fast_tokenizer/test/test_replace.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c32a193e4e92b0ea57b89594acf9075c59b58a7e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_replace.cc
@@ -0,0 +1,47 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(normalizers, replace) {
+  std::string input = "This is a ''test''";
+  std::string expected_output = "This is a \"test\"";
+  normalizers::NormalizedString replace_input(input);
+
+  normalizers::ReplaceNormalizer normalizer("''", "\"");
+  normalizer(&replace_input);
+  ASSERT_EQ(expected_output, replace_input.GetStr());
+
+  // Regex pattern
+  std::string regex_input = "This     is   a         test";
+  std::string expected_regex_output = "This is a test";
+  normalizers::NormalizedString regex_replace_input(regex_input);
+  normalizers::ReplaceNormalizer regex_normalizer("(\\s+)", " ");
+  regex_normalizer(&regex_replace_input);
+  ASSERT_EQ(expected_regex_output, regex_replace_input.GetStr());
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc b/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3fabbf6af049b92e6bb9271da75e1753dde6922b
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_roberta_postprocessor.cc
@@ -0,0 +1,94 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <string>
+
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/postprocessors/roberta.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(postprocessors, roberta) {
+  postprocessors::RobertaPostProcessor postprocessor;
+  core::Encoding encoding(
+      {core::Token(12, "Hello", {0, 5}), core::Token(14, "there", {6, 11})}, 0);
+  core::Encoding pair_encoding({core::Token(15, "pair", {0, 4})}, 0);
+  core::Encoding result_encoding;
+
+  core::Encoding encoding_copy = encoding;
+  core::Encoding pair_encoding_copy = pair_encoding;
+
+  postprocessor(&encoding_copy, nullptr, true, &result_encoding);
+  uint32_t special_word_idx = std::numeric_limits<uint32_t>::max();
+  ASSERT_EQ(result_encoding,
+            core::Encoding({0, 12, 14, 2},
+                           {0, 0, 0, 0},
+                           {"<s>", "Hello", "there", "</s>"},
+                           std::vector<uint32_t>(4, special_word_idx),
+                           {{0, 0}, {0, 5}, {6, 11}, {0, 0}},
+                           {1, 0, 0, 1},
+                           {1, 1, 1, 1},
+                           {},
+                           {{0, {1, 3}}}));
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2),
+            std::vector<uint32_t>(1, 0));
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(3).size(), 0);
+
+  encoding_copy = encoding;
+  postprocessor(&encoding_copy, &pair_encoding_copy, true, &result_encoding);
+  ASSERT_EQ(
+      result_encoding,
+      core::Encoding({0, 12, 14, 2, 2, 15, 2},
+                     {0, 0, 0, 0, 0, 0, 0},
+                     {"<s>", "Hello", "there", "</s>", "</s>", "pair", "</s>"},
+                     std::vector<uint32_t>(7, special_word_idx),
+                     {{0, 0}, {0, 5}, {6, 11}, {0, 0}, {0, 0}, {0, 4}, {0, 0}},
+                     {1, 0, 0, 1, 1, 0, 1},
+                     {1, 1, 1, 1, 1, 1, 1},
+                     {},
+                     {{0, {1, 3}}, {1, {5, 6}}}));
+
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2),
+            std::vector<uint32_t>(1, 0));
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(3), std::vector<uint32_t>{});
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(4), std::vector<uint32_t>{});
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(5), std::vector<uint32_t>{1});
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(6), std::vector<uint32_t>{});
+
+  encoding_copy = encoding;
+  pair_encoding_copy = pair_encoding;
+  postprocessor(&encoding_copy, &pair_encoding_copy, false, &result_encoding);
+  ASSERT_EQ(result_encoding,
+            core::Encoding({12, 14, 15},
+                           {0, 0, 0},
+                           {"Hello", "there", "pair"},
+                           std::vector<uint32_t>(3, special_word_idx),
+                           {{0, 5}, {6, 11}, {0, 4}},
+                           {0, 0, 0},
+                           {1, 1, 1},
+                           {},
+                           {{0, {0, 2}}, {1, {2, 3}}}));
+
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(0), std::vector<uint32_t>{0});
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(1), std::vector<uint32_t>{0});
+  ASSERT_EQ(result_encoding.TokenIdxToSequenceIds(2), std::vector<uint32_t>{1});
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc b/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..89c4df06943eb90ce0cbb0da4751bad72fb85e59
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_split_pretokenizer.cc
@@ -0,0 +1,110 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+#include "fast_tokenizer/pretokenizers/split.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+#include "re2/re2.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(pretokenizers, split_basic) {
+  std::string input = "How are you doing?";
+  // All tokens' id are set to 0.
+  std::vector<std::pair<core::SplitMode, std::vector<core::Token>>> test_cases =
+      {{
+           core::SplitMode::REMOVED,
+           std::vector<core::Token>{{0, "How", {0, 3}},
+                                    {0, "are", {4, 7}},
+                                    {0, "you", {8, 11}},
+                                    {0, "doing", {12, 17}},
+                                    {0, "?", {17, 18}}},
+       },
+       {
+           core::SplitMode::ISOLATED,
+           std::vector<core::Token>{{0, "How", {0, 3}},
+                                    {0, " ", {3, 4}},
+                                    {0, "are", {4, 7}},
+                                    {0, " ", {7, 8}},
+                                    {0, "you", {8, 11}},
+                                    {0, " ", {11, 12}},
+                                    {0, "doing", {12, 17}},
+                                    {0, "?", {17, 18}}},
+       },
+       {
+           core::SplitMode::MERGED_WITH_PREVIOUS,
+           std::vector<core::Token>{{0, "How ", {0, 4}},
+                                    {0, "are ", {4, 8}},
+                                    {0, "you ", {8, 12}},
+                                    {0, "doing", {12, 17}},
+                                    {0, "?", {17, 18}}},
+       },
+       {
+           core::SplitMode::MERGED_WITH_NEXT,
+           std::vector<core::Token>{{0, "How", {0, 3}},
+                                    {0, " are", {3, 7}},
+                                    {0, " you", {7, 11}},
+                                    {0, " doing", {11, 17}},
+                                    {0, "?", {17, 18}}},
+       },
+       {
+           core::SplitMode::CONTIGUOUS,
+           std::vector<core::Token>{{0, "How", {0, 3}},
+                                    {0, " ", {3, 4}},
+                                    {0, "are", {4, 7}},
+                                    {0, " ", {7, 8}},
+                                    {0, "you", {8, 11}},
+                                    {0, " ", {11, 12}},
+                                    {0, "doing?", {12, 18}}},
+       }};
+  std::string pattern = R"(\w+|[^\w\s]+)";
+  for (auto&& test : test_cases) {
+    pretokenizers::PreTokenizedString pretokenized(input);
+    pretokenizers::SplitPreTokenizer pretok(pattern, test.first, true);
+    pretok(&pretokenized);
+    ASSERT_EQ(test.second.size(), pretokenized.GetSplitsSize());
+    for (int i = 0; i < test.second.size(); ++i) {
+      auto&& curr_split = pretokenized.GetSplit(i);
+      ASSERT_EQ(test.second[i].value_, curr_split.normalized_.GetStr());
+      auto original_offset = curr_split.normalized_.GetOrginalOffset();
+      ASSERT_EQ(test.second[i].offset_, original_offset);
+    }
+  }
+}
+
+TEST(pretokenizers, split_invert) {
+  std::string input = "Hello Hello Hello";
+  pretokenizers::PreTokenizedString pretok_str(input),
+      pretok_str_for_invert(input);
+  pretokenizers::SplitPreTokenizer pretok(" ", core::SplitMode::REMOVED, false);
+  pretokenizers::SplitPreTokenizer pretok_invert(
+      "Hello", core::SplitMode::REMOVED, true);
+
+  pretok(&pretok_str);
+  pretok_invert(&pretok_str_for_invert);
+
+  ASSERT_EQ(pretok_str.GetSplitsSize(), pretok_str_for_invert.GetSplitsSize());
+  for (int i = 0; i < pretok_str.GetSplitsSize(); ++i) {
+    ASSERT_EQ(pretok_str.GetSplit(i).normalized_.GetStr(),
+              pretok_str_for_invert.GetSplit(i).normalized_.GetStr());
+  }
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_strip.cc b/fast_tokenizer/fast_tokenizer/test/test_strip.cc
new file mode 100644
index 0000000000000000000000000000000000000000..370ad9c37d79847b2cfef79aa364024920fe9c09
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_strip.cc
@@ -0,0 +1,55 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(normalizers, strip) {
+  std::string input = " \t我爱中国\n\f\v";
+  std::string expected_lstrip_output = "我爱中国\n\f\v";
+  std::string expected_rstrip_output = " \t我爱中国";
+  std::string expected_lrstrip_output = "我爱中国";
+
+  normalizers::NormalizedString lrstrip_input(input);
+  normalizers::NormalizedString lstrip_input(input);
+  normalizers::NormalizedString rstrip_input(input);
+
+  normalizers::StripNormalizer lrstrip(true, true);
+  lrstrip(&lrstrip_input);
+  std::string lrstrip_output = lrstrip_input.GetStr();
+  ASSERT_EQ(expected_lrstrip_output, lrstrip_output);
+
+  normalizers::StripNormalizer lstrip(true, false);
+  lstrip(&lstrip_input);
+  std::string lstrip_output = lstrip_input.GetStr();
+  ASSERT_EQ(expected_lstrip_output, lstrip_output);
+
+  normalizers::StripNormalizer rstrip(false, true);
+  rstrip(&rstrip_input);
+  std::string rstrip_output = rstrip_input.GetStr();
+  ASSERT_EQ(expected_rstrip_output, rstrip_output);
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_unicode.cc b/fast_tokenizer/fast_tokenizer/test/test_unicode.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e4e927b13f35a4f9681539db8833a07784cb60e6
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_unicode.cc
@@ -0,0 +1,61 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(normalizers, unicode) {
+  std::string input = "\u1e9b\u0323a\u1e9b\u0323";
+  std::string expected_nfkc_output = "ṩaṩ";
+  std::string expected_nfc_output = "\u1e9b\u0323a\u1e9b\u0323";
+  std::string expected_nfkd_output = "\u0073\u0323\u0307a\u0073\u0323\u0307";
+  std::string expected_nfd_output = "\u017f\u0323\u0307a\u017f\u0323\u0307";
+
+  normalizers::NFKCNormalizer nfkc;
+  normalizers::NormalizedString normalized_input1(input);
+  normalizers::NormalizedString normalized_input2(input);
+  normalizers::NormalizedString normalized_input3(input);
+  normalizers::NormalizedString normalized_input4(input);
+  nfkc(&normalized_input1);
+  std::string nfkc_output = normalized_input1.GetStr();
+  ASSERT_EQ(expected_nfkc_output, nfkc_output);
+
+  normalizers::NFCNormalizer nfc;
+  nfc(&normalized_input2);
+  std::string nfc_output = normalized_input2.GetStr();
+  ASSERT_EQ(expected_nfc_output, nfc_output);
+
+  normalizers::NFKDNormalizer nfkd;
+  nfkd(&normalized_input3);
+  std::string nfkd_output = normalized_input3.GetStr();
+  ASSERT_EQ(expected_nfkd_output, nfkd_output);
+
+  normalizers::NFDNormalizer nfd;
+  nfd(&normalized_input4);
+  std::string nfd_output = normalized_input4.GetStr();
+  ASSERT_EQ(expected_nfd_output, nfd_output);
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_utils.cc b/fast_tokenizer/fast_tokenizer/test/test_utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8711f47724b42f7ed931be213edb44c171f86740
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_utils.cc
@@ -0,0 +1,37 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "fast_tokenizer/normalizers/bert.h"
+#include "fast_tokenizer/normalizers/replace.h"
+#include "fast_tokenizer/normalizers/strip.h"
+#include "fast_tokenizer/normalizers/unicode.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(normalizers, utils) {
+  std::string input = "ÓÓßSSCHLOË";
+  std::string expected_output = "óóßsschloë";
+  normalizers::NormalizedString lower_input(input);
+  lower_input.Lowercase();
+  ASSERT_EQ(expected_output, lower_input.GetStr());
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc b/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc
new file mode 100644
index 0000000000000000000000000000000000000000..95a2fdee98b2cc12eefbc425a9865fbe942c483d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_whitespace.cc
@@ -0,0 +1,40 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+#include "fast_tokenizer/pretokenizers/whitespace.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(pretokenizers, whitespace) {
+  std::string input = "I \t am good\r   at \nsport";
+  std::vector<std::string> expected_outputs = {
+      "I", "am", "good", "at", "sport"};
+  pretokenizers::PreTokenizedString whitespace_input(input);
+  pretokenizers::WhitespacePreTokenizer()(&whitespace_input);
+  ASSERT_EQ(expected_outputs.size(), whitespace_input.GetSplitsSize());
+  for (int i = 0; i < expected_outputs.size(); ++i) {
+    ASSERT_EQ(whitespace_input.GetSplit(i).normalized_.GetStr(),
+              expected_outputs[i]);
+  }
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc b/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3d7661f923193a4584d501ecb24981b86220757f
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/test_wordpiece.cc
@@ -0,0 +1,96 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <array>
+#include <string>
+#include <vector>
+#include "fast_tokenizer/models/wordpiece.h"
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+TEST(model, wordpiece_factory) {
+  models::WordPieceFactory factory;
+  auto check_config = [&](const std::string& filename,
+                          size_t vocab_size,
+                          const std::string& unk_token,
+                          size_t max_input_chars_per_word,
+                          const std::string& continuing_subword_prefix) {
+    ASSERT_EQ(filename, factory.config_.files_);
+    ASSERT_EQ(vocab_size, factory.config_.vocab_.size());
+    ASSERT_EQ(unk_token, factory.config_.unk_token_);
+    ASSERT_EQ(max_input_chars_per_word,
+              factory.config_.max_input_chars_per_word_);
+    ASSERT_EQ(continuing_subword_prefix,
+              factory.config_.continuing_subword_prefix_);
+  };
+  check_config("", 0, "[UNK]", 100, "##");
+
+  std::string vocab_file = "ernie_vocab.txt";
+  factory.SetFiles(vocab_file);
+  factory.GetVocabFromFiles(vocab_file);
+  check_config(vocab_file, 17964UL, "[UNK]", 100, "##");
+}
+
+TEST(model, wordpiece_model) {
+  models::WordPieceFactory factory;
+  factory.SetFiles("ernie_vocab.txt");
+
+  auto wordpiece_model = factory.CreateWordPieceModel();
+  auto check_token_id = [&](const std::string& expected_token,
+                            uint32_t expected_id) {
+    std::string token;
+    uint32_t id;
+    wordpiece_model.TokenToId(expected_token, &id);
+    wordpiece_model.IdToToken(expected_id, &token);
+    ASSERT_EQ(id, expected_id);
+    ASSERT_EQ(token, expected_token);
+  };
+  std::array<std::string, 10> tokens = {
+      "[PAD]", "[CLS]", "[SEP]", "[MASK]", "，", "的", "、", "一", "人", "有"};
+  for (int i = 0; i < tokens.size(); i++) {
+    check_token_id(tokens[i], i);
+  }
+  // check non-exist token
+  uint32_t id;
+  ASSERT_FALSE(wordpiece_model.TokenToId("xxsada", &id));
+  // check non-exist id
+  std::string token;
+  ASSERT_FALSE(
+      wordpiece_model.IdToToken(wordpiece_model.GetVocabSize(), &token));
+
+  // Check Tokenize Chinese
+  auto chinese_tokens = wordpiece_model.Tokenize("今天天气真好");
+  auto check_token = [](const core::Token& token,
+                        const std::string& expected_string,
+                        uint32_t id,
+                        core::Offset offset) {
+    ASSERT_EQ(token.value_, expected_string);
+    ASSERT_EQ(token.id_, id);
+    ASSERT_EQ(token.offset_, offset);
+  };
+  check_token(chinese_tokens[0], "今", 508, {0, 3});
+  check_token(chinese_tokens[1], "##天", 12172, {3, 6});
+  check_token(chinese_tokens[2], "##天", 12172, {6, 9});
+  check_token(chinese_tokens[3], "##气", 12311, {9, 12});
+  check_token(chinese_tokens[4], "##真", 12427, {12, 15});
+  check_token(chinese_tokens[5], "##好", 12217, {15, 18});
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/test/utils.h b/fast_tokenizer/fast_tokenizer/test/utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..af1525ebbf1da5b4998e7d1639d5e86bbe84bd57
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/test/utils.h
@@ -0,0 +1,37 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <array>
+#include <string>
+#include <vector>
+
+#include "glog/logging.h"
+#include "gtest/gtest.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tests {
+
+template <typename T>
+void CheckVectorEqual(const std::vector<T>& a, const std::vector<T>& b) {
+  ASSERT_EQ(a.size(), b.size());
+  auto size = a.size();
+  for (int i = 0; i < size; ++i) {
+    ASSERT_EQ(a[i], b[i]);
+  }
+}
+
+}  // namespace tests
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/tokenizers/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..929d6320cf7bb8046cd03d395799bcc447336c9e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.cc
@@ -0,0 +1,138 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/tokenizers/clip_fast_tokenizer.h"
+#include "fast_tokenizer/models/models.h"
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tokenizers_impl {
+
+ClipFastTokenizer::ClipFastTokenizer(
+    const std::string& vocab_path,
+    const std::string& merges_path,
+    uint32_t max_length,
+    const std::string& unk_token,
+    const std::string& pad_token,
+    const std::string& bos_token,
+    const std::string& eos_token,
+    bool add_prefix_space,
+    const std::string& continuing_subword_prefix,
+    const std::string& end_of_word_suffix,
+    bool trim_offsets) {
+  core::Vocab vocab;
+  core::Merges merges;
+  models::BPE::GetVocabAndMergesFromFile(
+      vocab_path, merges_path, &vocab, &merges);
+  VLOG(6) << "The vocab size of ClipFastTokenizer is " << vocab.size();
+  VLOG(6) << "The merges size of ClipFastTokenizer is " << merges.size();
+
+  models::BPE bpe(vocab,
+                  merges,
+                  10000,
+                  {},
+                  {unk_token},
+                  {continuing_subword_prefix},
+                  {end_of_word_suffix},
+                  false);
+  // Set tokenizer model
+  this->SetModel(bpe);
+
+  // Set added tokens
+  std::vector<core::AddedToken> added_tokens;
+  uint32_t id;
+  unk_token_ = unk_token;
+  if (this->TokenToId(unk_token, &id)) {
+    added_tokens.emplace_back(unk_token, true);
+  }
+  pad_token_ = pad_token;
+  if (this->TokenToId(pad_token, &id)) {
+    added_tokens.emplace_back(pad_token, true);
+    pad_token_id_ = id;
+  }
+  bos_token_ = bos_token;
+  if (this->TokenToId(bos_token, &id)) {
+    added_tokens.emplace_back(bos_token, true);
+    bos_token_id_ = id;
+  }
+  eos_token_ = eos_token;
+  if (this->TokenToId(eos_token, &id)) {
+    added_tokens.emplace_back(eos_token, true);
+    eos_token_id_ = id;
+  }
+  this->AddSpecialTokens(added_tokens);
+
+  // Set normalizers
+  normalizers::NFCNormalizer nfc_normalizer;
+  normalizers::ReplaceNormalizer replace_normalizer(R"(\s+)", " ");
+  normalizers::LowercaseNormalizer lower_normalizer;
+  normalizers::SequenceNormalizer seq_normalizer;
+  seq_normalizer.AppendNormalizer(&nfc_normalizer);
+  seq_normalizer.AppendNormalizer(&replace_normalizer);
+  seq_normalizer.AppendNormalizer(&lower_normalizer);
+  this->SetNormalizer(seq_normalizer);
+
+  // Set pretokenizers
+  pretokenizers::ByteLevelPreTokenizer byte_level_pretokenizer(add_prefix_space,
+                                                               true);
+  pretokenizers::SplitPreTokenizer split_pretokenizer(
+      R"('s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+)",
+      core::SplitMode::REMOVED,
+      true);
+  pretokenizers::SequencePreTokenizer seq_pretokenizer;
+  seq_pretokenizer.AppendPreTokenizer(&split_pretokenizer);
+  seq_pretokenizer.AppendPreTokenizer(&byte_level_pretokenizer);
+  this->SetPreTokenizer(seq_pretokenizer);
+
+  // Set postprocessors
+  postprocessors::RobertaPostProcessor roberta_postprocessor(
+      {eos_token, eos_token_id_},
+      {bos_token, bos_token_id_},
+      /* trim_offsets= */ false,
+      add_prefix_space);
+  this->SetPostProcessor(roberta_postprocessor);
+
+  if (max_length == 0) {
+    this->DisableTruncMethod();
+  } else {
+    this->EnableTruncMethod(max_length,
+                            0,
+                            core::Direction::RIGHT,
+                            core::TruncStrategy::LONGEST_FIRST);
+  }
+}
+
+std::string ClipFastTokenizer::GetPadToken() const { return pad_token_; }
+
+uint32_t ClipFastTokenizer::GetPadTokenId() const { return pad_token_id_; }
+
+std::string ClipFastTokenizer::GetUNKToken() const { return unk_token_; }
+
+uint32_t ClipFastTokenizer::GetUNKTokenId() const { return unk_token_id_; }
+
+std::string ClipFastTokenizer::GetBOSToken() const { return bos_token_; }
+
+uint32_t ClipFastTokenizer::GetBOSTokenId() const { return bos_token_id_; }
+
+std::string ClipFastTokenizer::GetEOSToken() const { return eos_token_; }
+
+uint32_t ClipFastTokenizer::GetEOSTokenId() const { return eos_token_id_; }
+
+}  // namespace tokenizers_impl
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..c018919615563f4e8191d84c87374f88ee41de40
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/tokenizers/clip_fast_tokenizer.h
@@ -0,0 +1,61 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <unordered_map>
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tokenizers_impl {
+
+struct FASTTOKENIZER_DECL ClipFastTokenizer : public core::Tokenizer {
+  ClipFastTokenizer(const std::string& vocab_path,
+                    const std::string& merges_path,
+                    uint32_t max_length = 0,
+                    const std::string& unk_token = "<|endoftext|>",
+                    const std::string& pad_token = "<|endoftext|>",
+                    const std::string& bos_token = "<|startoftext|>",
+                    const std::string& eos_token = "<|endoftext|>",
+                    bool add_prefix_space = false,
+                    const std::string& continuing_subword_prefix = "",
+                    const std::string& end_of_word_suffix = "</w>",
+                    bool trim_offsets = false);
+  std::string GetPadToken() const;
+  uint32_t GetPadTokenId() const;
+  std::string GetUNKToken() const;
+  uint32_t GetUNKTokenId() const;
+  std::string GetBOSToken() const;
+  uint32_t GetBOSTokenId() const;
+  std::string GetEOSToken() const;
+  uint32_t GetEOSTokenId() const;
+
+private:
+  std::string pad_token_;
+  uint32_t pad_token_id_;
+  std::string unk_token_;
+  uint32_t unk_token_id_;
+  std::string bos_token_;
+  uint32_t bos_token_id_;
+  std::string eos_token_;
+  uint32_t eos_token_id_;
+};
+
+}  // namespace fast_tokenizer_impl
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2c9d3bacbd5ce93dea93ee8932addaa0227a992e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.cc
@@ -0,0 +1,152 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/models/models.h"
+#include "fast_tokenizer/normalizers/normalizers.h"
+#include "fast_tokenizer/postprocessors/postprocessors.h"
+#include "fast_tokenizer/pretokenizers/pretokenizers.h"
+#include "fast_tokenizer/utils/utils.h"
+#include "glog/logging.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tokenizers_impl {
+
+ErnieFastTokenizer::ErnieFastTokenizer(const std::string& vocab_path,
+                                       const std::string& unk_token,
+                                       const std::string& sep_token,
+                                       const std::string& cls_token,
+                                       const std::string& pad_token,
+                                       const std::string& mask_token,
+                                       bool clean_text,
+                                       bool handle_chinese_chars,
+                                       bool strip_accents,
+                                       bool lowercase,
+                                       const std::string& wordpieces_prefix,
+                                       uint32_t max_sequence_len) {
+  core::Vocab vocab;
+  utils::GetVocabFromFiles(vocab_path, &vocab);
+  VLOG(6) << "The vocab size of ErnieFastTokenizer is " << vocab.size();
+  Init(vocab,
+       unk_token,
+       sep_token,
+       cls_token,
+       pad_token,
+       mask_token,
+       clean_text,
+       handle_chinese_chars,
+       strip_accents,
+       lowercase,
+       wordpieces_prefix,
+       max_sequence_len);
+}
+
+
+ErnieFastTokenizer::ErnieFastTokenizer(const core::Vocab& vocab,
+                                       const std::string& unk_token,
+                                       const std::string& sep_token,
+                                       const std::string& cls_token,
+                                       const std::string& pad_token,
+                                       const std::string& mask_token,
+                                       bool clean_text,
+                                       bool handle_chinese_chars,
+                                       bool strip_accents,
+                                       bool lowercase,
+                                       const std::string& wordpieces_prefix,
+                                       uint32_t max_sequence_len) {
+  Init(vocab,
+       unk_token,
+       sep_token,
+       cls_token,
+       pad_token,
+       mask_token,
+       clean_text,
+       handle_chinese_chars,
+       strip_accents,
+       lowercase,
+       wordpieces_prefix,
+       max_sequence_len);
+}
+
+
+void ErnieFastTokenizer::Init(const core::Vocab& vocab,
+                              const std::string& unk_token,
+                              const std::string& sep_token,
+                              const std::string& cls_token,
+                              const std::string& pad_token,
+                              const std::string& mask_token,
+                              bool clean_text,
+                              bool handle_chinese_chars,
+                              bool strip_accents,
+                              bool lowercase,
+                              const std::string& wordpieces_prefix,
+                              uint32_t max_sequence_len) {
+  models::FastWordPiece wordpiece(vocab,
+                                  unk_token,
+                                  100 /* max_input_chars_per_word */,
+                                  wordpieces_prefix,
+                                  true);
+  this->SetModel(wordpiece);
+
+  std::vector<core::AddedToken> added_tokens;
+  uint32_t id;
+  if (this->TokenToId(unk_token, &id)) {
+    added_tokens.emplace_back(unk_token, true);
+  }
+  if (this->TokenToId(sep_token, &id)) {
+    added_tokens.emplace_back(sep_token, true);
+  }
+  if (this->TokenToId(cls_token, &id)) {
+    added_tokens.emplace_back(cls_token, true);
+  }
+  if (this->TokenToId(pad_token, &id)) {
+    added_tokens.emplace_back(pad_token, true);
+  }
+  if (this->TokenToId(mask_token, &id)) {
+    added_tokens.emplace_back(mask_token, true);
+  }
+  this->AddSpecialTokens(added_tokens);
+
+
+  normalizers::BertNormalizer bert_normalizer(
+      clean_text, handle_chinese_chars, strip_accents, lowercase);
+  this->SetNormalizer(bert_normalizer);
+
+  if (vocab.size() > 0) {
+    uint32_t sep_id, cls_id;
+    if (!this->TokenToId(sep_token, &sep_id)) {
+      throw std::invalid_argument("sep_token not found in the vocabulary");
+    }
+    if (!this->TokenToId(cls_token, &cls_id)) {
+      throw std::invalid_argument("cls_token not found in the vocabulary");
+    }
+    postprocessors::BertPostProcessor bert_postprocessor({sep_token, sep_id},
+                                                         {cls_token, cls_id});
+    this->SetPostProcessor(bert_postprocessor);
+  }
+  if (max_sequence_len == 0) {
+    this->DisableTruncMethod();
+  } else {
+    this->EnableTruncMethod(max_sequence_len,
+                            0,
+                            core::Direction::RIGHT,
+                            core::TruncStrategy::LONGEST_FIRST);
+  }
+}
+
+}  // namespace tokenizers_impl
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..b1cf0dbe52b2bc04a7b35fdb6a8440a0f8799a08
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/tokenizers/ernie_fast_tokenizer.h
@@ -0,0 +1,70 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <unordered_map>
+#include "fast_tokenizer/core/encoding.h"
+#include "fast_tokenizer/core/tokenizer.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace tokenizers_impl {
+
+struct FASTTOKENIZER_DECL ErnieFastTokenizer : public core::Tokenizer {
+  ErnieFastTokenizer(const std::string& vocab_path,
+                       const std::string& unk_token = "[UNK]",
+                       const std::string& sep_token = "[SEP]",
+                       const std::string& cls_token = "[CLS]",
+                       const std::string& pad_token = "[PAD]",
+                       const std::string& mask_token = "[MASK]",
+                       bool clean_text = true,
+                       bool handle_chinese_chars = true,
+                       bool strip_accents = true,
+                       bool lowercase = true,
+                       const std::string& wordpieces_prefix = "##",
+                       uint32_t max_sequence_len = 0);
+
+  ErnieFastTokenizer(const core::Vocab& vocab,
+                       const std::string& unk_token = "[UNK]",
+                       const std::string& sep_token = "[SEP]",
+                       const std::string& cls_token = "[CLS]",
+                       const std::string& pad_token = "[PAD]",
+                       const std::string& mask_token = "[MASK]",
+                       bool clean_text = true,
+                       bool handle_chinese_chars = true,
+                       bool strip_accents = true,
+                       bool lowercase = true,
+                       const std::string& wordpieces_prefix = "##",
+                       uint32_t max_sequence_len = 0);
+
+private:
+  void Init(const core::Vocab& vocab,
+            const std::string& unk_token = "[UNK]",
+            const std::string& sep_token = "[SEP]",
+            const std::string& cls_token = "[CLS]",
+            const std::string& pad_token = "[PAD]",
+            const std::string& mask_token = "[MASK]",
+            bool clean_text = true,
+            bool handle_chinese_chars = true,
+            bool strip_accents = true,
+            bool lowercase = true,
+            const std::string& wordpieces_prefix = "##",
+            uint32_t max_sequence_len = 0);
+};
+
+}  // namespace fast_tokenizer_impl
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt b/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d4ef2b1eb91f094f1254468777b2f431f5f27b29
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/CMakeLists.txt
@@ -0,0 +1,5 @@
+cc_library(utils SRCS utils.cc DEPS icuuc icudata)
+cc_library(trie SRCS trie.cc DEPS dart utils)
+cc_library(failure SRCS failure.cc DEPS trie utils)
+cc_library(sentencepiece_normalizer SRCS sentencepiece_normalizer.cc DEPS trie icuuc icudata utils)
+cc_library(lattice SRCS lattice.cc DEPS utils)
\ No newline at end of file
diff --git a/fast_tokenizer/fast_tokenizer/utils/cache.h b/fast_tokenizer/fast_tokenizer/utils/cache.h
new file mode 100644
index 0000000000000000000000000000000000000000..70471057239442b5febf466c932de370626fe47e
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/cache.h
@@ -0,0 +1,102 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "fast_tokenizer/utils/shared_mutex.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+static size_t DEFAULT_CACHE_CAPACITY = 10000;
+typedef utils::shared_mutex RWLock;
+typedef std::unique_lock<RWLock> WLock;
+typedef utils::shared_lock<RWLock> RLock;
+
+template <typename K, typename V>
+struct Cache {
+  std::unordered_map<K, V> map_;
+  size_t capacity_;
+  Cache(size_t capacity = DEFAULT_CACHE_CAPACITY) : capacity_(capacity) {
+    Fresh();
+  }
+
+  Cache(const Cache& other) {
+    RLock guard(cache_mutex_);
+    map_ = other.map_;
+    capacity_ = other.capacity_;
+  }
+
+  Cache& operator=(const Cache& other) {
+    RLock guard(cache_mutex_);
+    map_ = other.map_;
+    capacity_ = other.capacity_;
+    return *this;
+  }
+
+  void Fresh() { CreateCacheMap(capacity_); }
+  void Clear() {
+    WLock guard(cache_mutex_);
+    map_.clear();
+  }
+
+  bool GetValue(const K& key, V* value) {
+    // It's not guaranteed to get the value if the key is in cache
+    // for non-blocking read.
+    if (cache_mutex_.try_lock_shared()) {
+      if (map_.find(key) == map_.end()) {
+        cache_mutex_.unlock_shared();
+        return false;
+      }
+      *value = map_.at(key);
+      cache_mutex_.unlock_shared();
+      return true;
+    }
+    return false;
+  }
+
+  bool SetValue(const K& key, const V& value) {
+    // Before trying to acquire a write lock, we check if we are already at
+    // capacity with a read handler.
+    if (cache_mutex_.try_lock_shared()) {
+      if (map_.size() >= capacity_) {
+        cache_mutex_.unlock_shared();
+        return false;
+      }
+    } else {
+      return false;
+    }
+    if (cache_mutex_.try_lock()) {
+      map_.insert({key, value});
+      cache_mutex_.unlock();
+      return true;
+    }
+    return false;
+  }
+
+private:
+  void CreateCacheMap(size_t capacity) {
+    WLock guard(cache_mutex_);
+    map_ = std::unordered_map<K, V>(capacity);
+  }
+  RWLock cache_mutex_;
+};
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/failure.cc b/fast_tokenizer/fast_tokenizer/utils/failure.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1ae50b4d334e8fb2589a6f0cb77af5479587b474
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/failure.cc
@@ -0,0 +1,425 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <algorithm>
+#include <queue>
+#include <sstream>
+
+#include "glog/logging.h"
+#include "fast_tokenizer/utils/failure.h"
+#include "fast_tokenizer/utils/trie.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+Failure::Failure()
+    : failure_link_(utils::kNullNode),
+      failure_pops_offset_length_(utils::kNullFailurePopsList) {}
+
+FailureVocabToken::FailureVocabToken(
+    const std::string& token,
+    int token_id,
+    const std::string& continuing_subword_prefix)
+    : token_(token),
+      token_id_(token_id),
+      is_suffix_token_(false),
+      actual_token_start_offset_(0),
+      actual_token_unicode_len_(0),
+      contains_punctuation_(false) {
+  if (!continuing_subword_prefix.empty() &&
+      token_ != continuing_subword_prefix &&
+      utils::IsSuffixWord(token_, continuing_subword_prefix)) {
+    is_suffix_token_ = true;
+    actual_token_start_offset_ = continuing_subword_prefix.size();
+  }
+  // Iterate over the Unicode chars from the token, to initialize
+  // contains_punctuation_ and actual_token_unicode_len_.
+  int token_len = token.size();
+  int cur_pos = actual_token_start_offset_;
+  uint32_t ch;
+  const char* pSrc = token.c_str();
+  while (cur_pos < token_len) {
+    uint32_t count = utils::UTF8ToUInt32(pSrc + cur_pos, &ch);
+    cur_pos += count;
+    ch = utils::UTF8ToUnicode(ch);
+    if (!contains_punctuation_ && utils::IsPunctuationOrChineseChar(ch)) {
+      contains_punctuation_ = true;
+    }
+    ++actual_token_unicode_len_;
+  }
+}
+
+const std::string& FailureVocabToken::Token() const { return token_; }
+
+int FailureVocabToken::TokenId() const { return token_id_; }
+
+bool FailureVocabToken::IsSuffixToken() const { return is_suffix_token_; }
+
+bool FailureVocabToken::ContainsPunctuation() const {
+  return contains_punctuation_;
+}
+
+int FailureVocabToken::TokenUnicodeLengthWithoutContinuingSubwordPrefix()
+    const {
+  return actual_token_unicode_len_;
+}
+
+int FailureVocabToken::TokenLengthWithoutContinuingSubwordPrefix() const {
+  return token_.size() - actual_token_start_offset_;
+}
+
+void FailureArray::BuildFailureVocab(
+    const std::unordered_map<std::string, uint32_t>& vocab,
+    const std::string& unk_token,
+    const std::string& continuing_subword_prefix) {
+  if (vocab.size() > utils::kMaxSupportedVocabSize) {
+    std::ostringstream oss;
+    oss << "Vocab size exceeds the max supported ("
+        << utils::kMaxSupportedVocabSize
+        << "). Found vocab size: " << vocab.size();
+    throw std::invalid_argument(oss.str());
+  }
+  failure_vocab_tokens_.reserve(vocab.size());
+  int unk_id = vocab.at(unk_token);
+  for (auto& item : vocab) {
+    if (item.first == continuing_subword_prefix) {
+      VLOG(6)
+          << "The empty suffix token is found in the vocabulary, which takes "
+             "place in token id space but will (almost) never be used in the "
+             "result. Consider cleaning it from the vocabulary.";
+      continue;
+    }
+    if (item.first.empty()) {
+      VLOG(6)
+          << "The empty string is found in the vocabulary, which takes place "
+             "in the token id space but will never be used in the result. "
+             "Consider cleaning it from the vocabulary.";
+      continue;
+    }
+    FailureVocabToken vocab_token(
+        item.first, item.second, continuing_subword_prefix);
+    if (vocab_token.TokenLengthWithoutContinuingSubwordPrefix() >
+        utils::kMaxVocabTokenLengthInUTF8Bytes) {
+      std::ostringstream oss;
+      oss << "Vocab token utf8 length (excluding suffix indicator) exceeds the "
+             "max supported ("
+          << utils::kMaxVocabTokenLengthInUTF8Bytes
+          << "). The vocab token is: " << item.first
+          << " with utf8 length (excluding suffix indicator): "
+          << vocab_token.TokenLengthWithoutContinuingSubwordPrefix();
+      throw std::invalid_argument(oss.str());
+    }
+    // Skip the word which contains punctuation but not a punctuation.
+    if (with_pretokenization_ && vocab_token.ContainsPunctuation() &&
+        (vocab_token.IsSuffixToken() ||
+         vocab_token.TokenUnicodeLengthWithoutContinuingSubwordPrefix() > 1)) {
+      continue;
+    }
+    failure_vocab_tokens_.emplace_back(vocab_token);
+  }
+  if (failure_vocab_tokens_.empty()) {
+    std::ostringstream oss;
+    oss << "No valid vocab tokens were found to build the trie.";
+    throw std::invalid_argument(oss.str());
+  }
+  if (!continuing_subword_prefix.empty()) {
+    const bool suffix_token_exists = std::any_of(
+        failure_vocab_tokens_.begin(),
+        failure_vocab_tokens_.end(),
+        [](const FailureVocabToken& token) { return token.IsSuffixToken(); });
+    if (!suffix_token_exists) {
+      auto new_suffix_token = continuing_subword_prefix +
+                              std::string(1, utils::kInvalidControlChar);
+      failure_vocab_tokens_.emplace_back(
+          new_suffix_token, unk_id, continuing_subword_prefix);
+    }
+  }
+  if (with_pretokenization_) {
+    for (uint32_t cp = 1; cp <= 0x0010FFFF; ++cp) {
+      if (!utils::IsUnicodeChar(cp) || !utils::IsPunctuationOrChineseChar(cp)) {
+        continue;
+      }
+      char utf_str[5];
+      utils::GetUTF8Str(reinterpret_cast<char32_t*>(&cp), utf_str, 1);
+      std::string punc_str(utf_str);
+      if (vocab.count(punc_str) == 0) {
+        failure_vocab_tokens_.emplace_back(
+            punc_str, unk_id, continuing_subword_prefix);
+      }
+    }
+    failure_vocab_tokens_.emplace_back(
+        std::string(1, kInvalidControlChar), unk_id, continuing_subword_prefix);
+  }
+}
+
+void FailureArray::CreateVocabFromFailureVocab(
+    const std::vector<FailureVocabToken>& failure_vocab_tokens,
+    std::unordered_map<std::string, uint32_t>* vocab) const {
+  for (auto&& failure_vocab : failure_vocab_tokens) {
+    (*vocab)[failure_vocab.Token()] = failure_vocab.TokenId();
+  }
+}
+
+void FailureArray::InitFromVocabAndTrie(
+    const std::unordered_map<std::string, uint32_t>& vocab,
+    Trie* trie,
+    const std::string& unk_token,
+    const std::string& continuing_subword_prefix) {
+  BuildFailureVocab(vocab, unk_token, continuing_subword_prefix);
+
+  // Create Trie
+  std::unordered_map<std::string, uint32_t> new_vocab;
+  CreateVocabFromFailureVocab(failure_vocab_tokens_, &new_vocab);
+  trie->SetVocab(new_vocab);
+
+  // Create failure array
+  BuildFailureArray(failure_vocab_tokens_, trie);
+}
+
+void FailureArray::RemovePunctuationTrieLink(Trie* trie) const {
+  auto continuing_subword_prefix = trie->GetContinuingSubwordPrefix();
+  if (with_pretokenization_ && !continuing_subword_prefix.empty()) {
+    int cur_idx = 0;
+    int next_idx = 0;
+    uint32_t curr_char, next_char;
+    bool prev_node_is_root = false;
+    auto node = trie->CreateRootTraversalCursor();
+    while (cur_idx < continuing_subword_prefix.length()) {
+      next_idx = cur_idx;
+      auto chwidth = utils::UTF8ToUInt32(
+          continuing_subword_prefix.data() + next_idx, &curr_char);
+      curr_char = utils::UTF8ToUnicode(curr_char);
+      next_idx = cur_idx + chwidth;
+      prev_node_is_root = (node.node_id_ == trie->kRootNodeId);
+      std::string cur_unicode_char(continuing_subword_prefix.data() + cur_idx,
+                                   chwidth);
+      if (!trie->TryTraverseSeveralSteps(&node, cur_unicode_char)) {
+        throw std::runtime_error(
+            "Cannot locate a character in suffix_indicator_. It should never "
+            "happen.");
+      }
+      if (IsPunctuationOrChineseChar(curr_char)) {
+        if (prev_node_is_root) {
+          cur_idx = next_idx;
+          auto next_chwidth = utils::UTF8ToUInt32(
+              continuing_subword_prefix.data() + next_idx, &next_char);
+          next_idx += next_chwidth;
+          std::string next_unicode_char(
+              continuing_subword_prefix.data() + cur_idx, next_chwidth);
+          auto child_node = node;
+          if (!trie->TryTraverseSeveralSteps(&child_node, next_unicode_char)) {
+            throw std::runtime_error(
+                "Cannot locate a character in suffix_indicator_. It should "
+                "never happen.");
+          }
+          trie->DeleteLinkFromParent(child_node.node_id_);
+        } else {
+          trie->DeleteLinkFromParent(node.node_id_);
+        }
+        break;
+      }
+      cur_idx = next_idx;
+    }
+  }
+}
+
+// Algorithm 2 in https://arxiv.org/pdf/2012.15524.pdf
+void FailureArray::BuildFailureArray(
+    const std::vector<FailureVocabToken>& failure_vocab_tokens, Trie* trie) {
+  std::vector<std::unordered_set<char>> node_outgoing_edge_labels;
+  BuildOutgoingEdgeLabelsForTrie(
+      failure_vocab_tokens, trie, &node_outgoing_edge_labels);
+  failure_array_.resize(trie->Size());
+  std::queue<uint32_t> trie_node_queue({trie->kRootNodeId});
+  if (trie->GetSuffixRoot() != trie->kRootNodeId) {
+    trie_node_queue.push(trie->GetSuffixRoot());
+  }
+  while (!trie_node_queue.empty()) {
+    uint32_t parent_id = trie_node_queue.front();
+    trie_node_queue.pop();
+    std::vector<char> outgoing_labels_sorted(
+        node_outgoing_edge_labels[parent_id].begin(),
+        node_outgoing_edge_labels[parent_id].end());
+    std::sort(outgoing_labels_sorted.begin(), outgoing_labels_sorted.end());
+    for (const char edge_label : outgoing_labels_sorted) {
+      auto child_node = trie->CreateTraversalCursor(parent_id);
+      if (!trie->TryTraverseOneStep(&child_node, edge_label)) {
+        std::ostringstream oss;
+        oss << "Failed to traverse to child following edge " << edge_label
+            << " at parent " << parent_id << ".";
+        throw std::runtime_error(oss.str());
+      }
+      if (child_node.node_id_ == trie->GetSuffixRoot()) {
+        continue;
+      }
+      int child_data_value = -1;
+      // Case 1: str(v) in V
+      //  * f(v) = trie.GetSuffixRoot()
+      //  * F(v) = [str(v)]
+      if (trie->TryGetData(child_node, &child_data_value)) {
+        uint32_t failure_link = trie->GetSuffixRoot();
+        if (node_id_is_punc_map_.count(child_node.node_id_) == 0) {
+          throw std::invalid_argument(
+              "Failed to find if an end node in the trie is a punctuation char "
+              "in node_id_is_punc_map_. It should never happen.");
+        }
+        if (with_pretokenization_ &&
+            node_id_is_punc_map_.at(child_node.node_id_)) {
+          failure_link = trie->GetPuncFailureNode();
+        }
+        AssignFailureLinkAndPops(child_node.node_id_,
+                                 failure_link,
+                                 {child_data_value},
+                                 utils::kNullFailurePopsList);
+        trie_node_queue.push(child_node.node_id_);
+        continue;
+      }
+
+      // Case 2: str(v) is not in V
+      const Failure& parent_failure = failure_array_[parent_id];
+      if (parent_failure.failure_link_ != utils::kNullNode) {
+        std::vector<int> one_step_pops;
+        auto curr_node =
+            trie->CreateTraversalCursor(parent_failure.failure_link_);
+        // Find the failure link util the failure link is root or
+        // the node has the outgoing label correspoding to edge_label.
+        while (true) {
+          if (trie->TryTraverseOneStep(&curr_node, edge_label)) {
+            AssignFailureLinkAndPops(
+                child_node.node_id_,
+                curr_node.node_id_,
+                one_step_pops,
+                parent_failure.failure_pops_offset_length_);
+            break;
+          }
+          const Failure& curr_node_failure = failure_array_[curr_node.node_id_];
+          if (curr_node_failure.failure_link_ == utils::kNullNode) {
+            break;
+          }
+          GetFailurePopsAndAppendToOut(
+              curr_node_failure.failure_pops_offset_length_, &one_step_pops);
+          trie->SetTraversalCursor(&curr_node, curr_node_failure.failure_link_);
+        }
+      }
+      // If the failure_link of parent is root,
+      // * f(v) = none
+      // * F(v) = []
+      trie_node_queue.push(child_node.node_id_);
+    }
+  }
+  RemovePunctuationTrieLink(trie);
+}
+
+void FailureArray::AssignFailureLinkAndPops(
+    uint32_t cur_node,
+    uint32_t failure_link,
+    const std::vector<int>& one_step_pops,
+    int parent_failure_pops_offset_length) {
+  if (failure_link == utils::kNullNode) {
+    return;
+  }
+  auto& curr_node_failure = failure_array_[cur_node];
+  curr_node_failure.failure_link_ = failure_link;
+  if (one_step_pops.empty()) {
+    curr_node_failure.failure_pops_offset_length_ =
+        parent_failure_pops_offset_length;
+  } else {
+    const int offset = failure_pops_pool_.size();
+    if (offset > utils::kMaxSupportedFailurePoolOffset) {
+      std::ostringstream oss;
+      oss << "Failure pops list offset is " << offset
+          << ", which exceeds maximum supported offset "
+          << utils::kMaxSupportedFailurePoolOffset
+          << ". The vocabulary seems to be too large to be supported.";
+      throw std::runtime_error(oss.str());
+    }
+    GetFailurePopsAndAppendToOut(parent_failure_pops_offset_length,
+                                 &failure_pops_pool_);
+    failure_pops_pool_.insert(
+        failure_pops_pool_.end(), one_step_pops.begin(), one_step_pops.end());
+    const int length = failure_pops_pool_.size() - offset;
+    if (length > utils::kMaxSupportedFailurePoolOffset) {
+      std::ostringstream oss;
+      oss << "Failure pops list size is " << length
+          << ", which exceeds maximum supported offset "
+          << utils::kMaxFailurePopsListSize;
+      throw std::runtime_error(oss.str());
+    }
+    curr_node_failure.failure_pops_offset_length_ =
+        utils::EncodeFailurePopList(offset, length);
+  }
+}
+
+void FailureArray::GetFailurePopsAndAppendToOut(
+    uint32_t failure_pops_offset_length, std::vector<int>* out_failure_pops) {
+  if (failure_pops_offset_length == utils::kNullFailurePopsList) {
+    return;
+  }
+  int offset = 0, length = 0;
+  utils::GetFailurePopsOffsetAndLength(
+      failure_pops_offset_length, &offset, &length);
+  out_failure_pops->insert(out_failure_pops->end(),
+                           failure_pops_pool_.begin() + offset,
+                           failure_pops_pool_.begin() + offset + length);
+}
+
+void FailureArray::BuildOutgoingEdgeLabelsForTrie(
+    const std::vector<FailureVocabToken>& failure_vocab_tokens,
+    Trie* trie,
+    std::vector<std::unordered_set<char>>* node_outgoing_edge_labels) {
+  node_outgoing_edge_labels->resize(trie->Size());
+  const std::string dummy_token = std::string(1, utils::kInvalidControlChar);
+  for (auto& item : failure_vocab_tokens) {
+    if (item.Token() != dummy_token) {
+      BuildOutgoingEdgeLabelsFromToken(item, trie, node_outgoing_edge_labels);
+    }
+  }
+}
+
+void FailureArray::BuildOutgoingEdgeLabelsFromToken(
+    const FailureVocabToken& vocab_token,
+    Trie* trie,
+    std::vector<std::unordered_set<char>>* node_outgoing_edge_labels) {
+  const std::string& token = vocab_token.Token();
+  Trie::TraversalCursor curr_node;
+  int char_pos = 0;
+  trie->SetTraversalCursor(&curr_node, Trie::kRootNodeId);
+  while (char_pos < token.size()) {
+    const char edge_label = token[char_pos];
+    (*node_outgoing_edge_labels)[curr_node.node_id_].insert(edge_label);
+    if (!trie->TryTraverseOneStep(&curr_node, edge_label)) {
+      std::ostringstream oss;
+      oss << "Error in traversing to child following edge `" << edge_label
+          << "` from the prefix `" << token.substr(0, char_pos)
+          << "` at parent id " << curr_node.node_id_ << ". The token is `"
+          << token << "`. The char position"
+          << " is " << char_pos << ".";
+
+      throw std::runtime_error(oss.str());
+    }
+    ++char_pos;
+  }
+  node_id_is_punc_map_[curr_node.node_id_] =
+      !vocab_token.IsSuffixToken() && vocab_token.ContainsPunctuation() &&
+      vocab_token.TokenUnicodeLengthWithoutContinuingSubwordPrefix() == 1;
+}
+
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/failure.h b/fast_tokenizer/fast_tokenizer/utils/failure.h
new file mode 100644
index 0000000000000000000000000000000000000000..c302f53496a8b4bd3156234635f108e50ac1e8e2
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/failure.h
@@ -0,0 +1,107 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+class Trie;
+
+// Used in Fast WordPiece Model specially
+struct Failure {
+  uint32_t failure_link_;
+  // Indicate the number of failure_pops
+  // and the offset in failure_pops_pool
+  uint32_t failure_pops_offset_length_;
+  Failure();
+};
+
+class FailureVocabToken {
+public:
+  FailureVocabToken(const std::string& token,
+                    int token_id,
+                    const std::string& continuing_subword_prefix);
+
+  const std::string& Token() const;
+
+  int TokenId() const;
+  bool IsSuffixToken() const;
+  bool ContainsPunctuation() const;
+  int TokenUnicodeLengthWithoutContinuingSubwordPrefix() const;
+  int TokenLengthWithoutContinuingSubwordPrefix() const;
+
+private:
+  std::string token_;
+  int token_id_;
+  bool is_suffix_token_;
+  int actual_token_start_offset_;
+  int actual_token_unicode_len_;
+  bool contains_punctuation_;
+};
+
+struct FailureArray {
+  FailureArray(bool with_pretokenization = false)
+      : with_pretokenization_(with_pretokenization) {}
+  void BuildFailureArray(
+      const std::vector<FailureVocabToken>& failure_vocab_tokens, Trie* trie);
+  void BuildFailureVocab(const std::unordered_map<std::string, uint32_t>& vocab,
+                         const std::string& unk_token,
+                         const std::string& continuing_subword_prefix);
+  void InitFromVocabAndTrie(
+      const std::unordered_map<std::string, uint32_t>& vocab,
+      Trie* trie,
+      const std::string& unk_token,
+      const std::string& continuing_subword_prefix);
+  const Failure* GetFailure(int idx) const { return &(failure_array_.at(idx)); }
+  int GetFailurePop(int idx) const { return failure_pops_pool_.at(idx); }
+  void SetWithPretokenization(bool with_pretokenization) {
+    with_pretokenization_ = with_pretokenization;
+  }
+
+private:
+  void BuildOutgoingEdgeLabelsForTrie(
+      const std::vector<FailureVocabToken>& failure_vocab_tokens,
+      Trie* trie,
+      std::vector<std::unordered_set<char>>* node_outgoing_edge_labels);
+  void BuildOutgoingEdgeLabelsFromToken(
+      const FailureVocabToken& vocab_token,
+      Trie* trie,
+      std::vector<std::unordered_set<char>>* node_outgoing_edge_labels);
+  void AssignFailureLinkAndPops(uint32_t cur_node,
+                                uint32_t failure_link,
+                                const std::vector<int>& one_step_pops,
+                                int parent_failure_pops_offset_length);
+  void GetFailurePopsAndAppendToOut(uint32_t failure_pops_offset_length,
+                                    std::vector<int>* out_failure_pops);
+  void RemovePunctuationTrieLink(Trie* trie) const;
+  void CreateVocabFromFailureVocab(
+      const std::vector<FailureVocabToken>& failure_vocab_tokens,
+      std::unordered_map<std::string, uint32_t>* vocab) const;
+  std::vector<Failure> failure_array_;
+  std::vector<int> failure_pops_pool_;
+  std::unordered_map<uint32_t, bool> node_id_is_punc_map_;
+  std::vector<FailureVocabToken> failure_vocab_tokens_;
+  bool with_pretokenization_;  // The end-to-end version of FailureArray
+};
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/lattice.cc b/fast_tokenizer/fast_tokenizer/utils/lattice.cc
new file mode 100644
index 0000000000000000000000000000000000000000..14447788cda5ea07e4a41129f6da4452a57e2b61
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/lattice.cc
@@ -0,0 +1,546 @@
+// Copyright 2016 Google Inc.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "fast_tokenizer/utils/lattice.h"
+
+#include <algorithm>
+#include <atomic>
+#include <cfloat>
+#include <cmath>
+#include <map>
+#include <queue>
+#include <random>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+#include "glog/logging.h"
+
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+// Size of nodes pre-allocated in Lattice.
+constexpr size_t kPreallocateLatticeNodeSize = 1024;
+
+constexpr float kEpsilon = 1e-7;
+
+constexpr unsigned int kDefaultSeed = static_cast<unsigned int>(-1);
+static std::atomic<unsigned int> g_seed(kDefaultSeed);
+
+inline float LogSumExp(float x, float y, bool init_mode) {
+  if (init_mode) {
+    return y;
+  }
+  const float vmin = std::min(x, y);
+  const float vmax = std::max(x, y);
+  constexpr float kMinusLogEpsilon = 50;
+  if (vmax > vmin + kMinusLogEpsilon) {
+    return vmax;
+  } else {
+    return vmax + log(std::exp(static_cast<double>(vmin - vmax)) + 1.0);
+  }
+}
+
+uint32_t GetRandomGeneratorSeed() {
+  return g_seed == kDefaultSeed ? std::random_device{}() : g_seed.load();
+}
+
+std::mt19937 *GetRandomGenerator() {
+  thread_local static std::mt19937 mt(GetRandomGeneratorSeed());
+  return &mt;
+}
+
+inline float Gumbel() {
+  const float kEpsilon = 1e-7;
+  auto *mt = GetRandomGenerator();
+  std::uniform_real_distribution<float> dis(0.0, 1.0);
+  float noise = -std::log(-(std::log(dis(*mt) + kEpsilon)));
+  return noise;
+}
+
+Lattice::Lattice() : node_allocator_(kPreallocateLatticeNodeSize) {}
+Lattice::~Lattice() {}
+
+const std::vector<Lattice::Node *> &Lattice::begin_nodes(int pos) const {
+  return begin_nodes_[pos];
+}
+
+const std::vector<Lattice::Node *> &Lattice::end_nodes(int pos) const {
+  return end_nodes_[pos];
+}
+
+int Lattice::size() const {
+  // -1 because surface_ may include the EOS.
+  return std::max<int>(0, surface_.size() - 1);
+}
+
+int Lattice::utf8_size() const { return sentence_.size(); }
+
+const char *Lattice::sentence() const { return sentence_.data(); }
+
+const char *Lattice::surface(int pos) const { return surface_[pos]; }
+
+Lattice::Node *Lattice::bos_node() const { return end_nodes_[0][0]; }
+
+Lattice::Node *Lattice::eos_node() const { return begin_nodes_[size()][0]; }
+
+Lattice::Node *Lattice::NewNode() {
+  Node *node = node_allocator_.Allocate();
+  node->node_id = node_allocator_.size() - 1;
+  return node;
+}
+
+void Lattice::Clear() {
+  begin_nodes_.clear();
+  end_nodes_.clear();
+  sentence_ = utils::simple_string_view("");
+  surface_.clear();
+  node_allocator_.Free();
+}
+
+void Lattice::SetSentence(utils::simple_string_view sentence) {
+  Clear();
+
+  sentence_ = sentence;
+  surface_.reserve(sentence.size() + 1);
+
+  while (!sentence.empty()) {
+    const int mblen =
+        std::min<int>(utils::OneCharLen(sentence.data()), sentence.size());
+    surface_.push_back(sentence.data());
+    sentence.remove_prefix(mblen);
+  }
+  surface_.push_back(sentence.data());
+
+  const int len = size();
+  begin_nodes_.resize(len + 1);
+  end_nodes_.resize(len + 1);
+
+  constexpr size_t kReservedNodeSize = 16;
+  for (int i = 0; i <= len; ++i) {
+    begin_nodes_[i].reserve(kReservedNodeSize);
+    end_nodes_[i].reserve(kReservedNodeSize);
+  }
+
+  Node *bos = NewNode();
+  bos->id = -1;
+  bos->pos = 0;
+  end_nodes_[0].push_back(bos);
+
+  Node *eos = NewNode();
+  eos->id = -1;
+  eos->pos = len;
+  begin_nodes_[len].push_back(eos);
+}
+
+Lattice::Node *Lattice::Insert(int pos, int length) {
+  Node *node = NewNode();
+  node->pos = pos;
+  node->length = length;
+  const int utf8_length =
+      static_cast<int>(surface(pos + length) - surface(pos));
+  node->piece = simple_string_view(surface(pos), utf8_length);
+  begin_nodes_[pos].push_back(node);
+  end_nodes_[pos + node->length].push_back(node);
+  return node;
+}
+
+Lattice::LatticePathWithScore Lattice::Viterbi() {
+  const int len = size();
+
+  for (int pos = 0; pos <= len; ++pos) {
+    for (Node *rnode : begin_nodes_[pos]) {
+      rnode->prev = nullptr;
+      float best_score = 0.0;
+      Node *best_node = nullptr;
+      for (Node *lnode : end_nodes_[pos]) {
+        const float score = lnode->backtrace_score + rnode->score;
+        if (best_node == nullptr || score > best_score) {
+          best_node = lnode;
+          best_score = score;
+        }
+      }
+      if (best_node == nullptr) {
+        LOG(ERROR) << "Failed to find the best path in Viterbi.";
+        return {};
+      }
+      rnode->prev = best_node;
+      rnode->backtrace_score = best_score;
+    }
+  }
+
+  // backtrace
+  std::vector<Node *> results;
+  float score = begin_nodes(len)[0]->backtrace_score;
+  for (Node *node = begin_nodes_[len][0]->prev; node->prev != nullptr;
+       node = node->prev) {
+    results.push_back(node);
+  }
+
+  std::reverse(results.begin(), results.end());
+
+  LatticePathWithScore retval = {results, score};
+
+  return retval;
+}
+
+std::vector<float> Lattice::ForwardAlgorithm(float inv_theta) const {
+  const int len = size();
+  std::vector<float> alpha(node_allocator_.size(), 0.0);
+
+  for (int pos = 0; pos <= len; ++pos) {
+    for (Node *rnode : begin_nodes_[pos]) {
+      for (Node *lnode : end_nodes_[pos]) {
+        alpha[rnode->node_id] =
+            LogSumExp(alpha[rnode->node_id],
+                      inv_theta * lnode->score + alpha[lnode->node_id],
+                      lnode == end_nodes_[pos][0]);
+      }
+    }
+  }
+
+  return alpha;
+}
+
+std::vector<float> Lattice::BackwardAlgorithm(float inv_theta) const {
+  const int len = size();
+  std::vector<float> beta(node_allocator_.size(), 0.0);
+
+  for (int pos = len; pos >= 0; --pos) {
+    for (Node *lnode : end_nodes_[pos]) {
+      for (Node *rnode : begin_nodes_[pos]) {
+        beta[lnode->node_id] = LogSumExp(beta[lnode->node_id],
+                                         rnode->score + beta[rnode->node_id],
+                                         rnode == begin_nodes_[pos][0]);
+      }
+    }
+  }
+
+  return beta;
+}
+
+float Lattice::PopulateMarginal(float freq,
+                                std::vector<float> *expected) const {
+  if (expected == nullptr) return 0.0;
+
+  const int len = size();
+
+  // alpha and beta (accumulative log prob) in Forward Backward.
+  // the index of alpha/beta is Node::node_id.
+
+  const auto alpha = ForwardAlgorithm(1.0);
+  const auto beta = BackwardAlgorithm(1.0);
+
+  const float Z = alpha[begin_nodes_[len][0]->node_id];
+  for (int pos = 0; pos < len; ++pos) {
+    for (Node *node : begin_nodes_[pos]) {
+      if (node->id >= 0) {
+        // the index of |expected| is a Node::id, which is a vocabulary id.
+        (*expected)[node->id] +=
+            freq *
+            std::exp(static_cast<double>(alpha[node->node_id] + node->score +
+                                         beta[node->node_id] - Z));
+      }
+    }
+  }
+
+  return freq * Z;
+}
+
+float Lattice::CalculateEntropy(float inv_theta) const {
+  const int len = size();
+
+  // alpha[node_id] is the marginal prob of sequence up to start of node
+  // H is entropy of sequence
+  // the index of alpha/H is Node::node_id.
+  std::vector<float> H(node_allocator_.size(), 0.0);
+
+  // Populate the forward marginals to get the normalising constant
+  const auto alpha = ForwardAlgorithm(inv_theta);
+
+  // Now populate the forward entropies
+  for (int pos = 0; pos <= len; ++pos) {
+    for (Node *rnode : begin_nodes_[pos]) {
+      for (Node *lnode : end_nodes_[pos]) {
+        // Contribution each lnode makes = p(lnode) * (H(lnode) + log p(lnode))
+
+        // We have to normalise p(lnode) by the marginal contribution it makes
+        const float lnode_transition_prob =
+            ((inv_theta * lnode->score) + alpha[lnode->node_id] -
+             alpha[rnode->node_id]);
+        H[rnode->node_id] += std::exp(lnode_transition_prob) *
+                             (H[lnode->node_id] + lnode_transition_prob);
+      }
+    }
+  }
+
+  return -H[begin_nodes_[len][0]->node_id];
+}
+
+// The node structure to support A* algorithm in Lattice::NBest()
+struct Hypothesis {
+  Lattice::Node *node;
+  Hypothesis *next;
+  float fx;  // the priority to pop a new hypothesis from the priority queue.
+  float gx;  // the sum of scores from EOS to the left-most node in x.
+};
+
+// Helper function for cloning a Hypothesis and the ones on their next paths.
+// The graph structure is preserved.
+//
+//   to_clone:  the Hypothesis to clone.
+//   clone_map: mapping between the old pointers and the new pointers.
+//   allocator: allocate and own the cloned Hypothesis.
+//
+// Returns the cloned Hypothesis*. All Hypothesis on its "next" chain are also
+// guaranteed to have been allocated via "allocator", and "clone_map" is updated
+// with all new mappings.
+Hypothesis *CloneHypAndDependents(
+    const Hypothesis *to_clone,
+    std::unordered_map<const Hypothesis *, Hypothesis *> *clone_map,
+    FreeList<Hypothesis> *allocator) {
+  Hypothesis *cloned = nullptr;
+  Hypothesis **result_callback = &cloned;
+
+  // Iteratively clone "to_clone" and its dependencies.
+  // The new pointer will be written back to *result_callback.
+  while (to_clone != nullptr) {
+    // If "to_clone" has already been cloned before, we just look up the result.
+    auto iter = clone_map->find(to_clone);
+    if (iter != clone_map->end()) {
+      *result_callback = iter->second;
+      break;
+    }
+
+    // Allocate a new Hypothesis and copy the values.
+    Hypothesis *new_hyp = allocator->Allocate();
+    *new_hyp = *to_clone;
+    *result_callback = new_hyp;
+    clone_map->insert({to_clone, new_hyp});
+
+    // Move on to clone "to_clone->next".
+    to_clone = to_clone->next;
+    result_callback = &(new_hyp->next);
+    LOG(ERROR) << "Failed to find the best path in Viterbi.";
+  }
+  return cloned;
+}
+
+std::vector<Lattice::LatticePathWithScore> Lattice::NBest(size_t nbest_size,
+                                                          bool sample,
+                                                          float inv_theta) {
+  if (nbest_size < 1) {
+    LOG(WARNING) << "nbest_size >= 1. Returns empty result.";
+    return {};
+  }
+
+  if (nbest_size == 1 && !sample) {
+    return {Viterbi()};
+  }
+
+  // Uses A* search to enumerate N-bests.
+  // Given a lattice, enumerates hypotheses (paths) from EOS.
+  // At each partial path x, compute f(x) as follows
+  //   f(x) = g(x) + h(x).
+  // g(x): the sum of scores from  EOS to the left-most node in x.
+  //       for a complete hypothesis, g(hyp) is the score of the hypothesis.
+  // h(x): a heuristic that estimates the largest score from x to BOS.
+  // f(x): the priority to pop a new hypothesis from the priority queue.
+  //
+  // As left-to-right Viterbi search can tell the *exact* value of h(x),
+  // we can obtain the exact n-best results with A*.
+
+  class HypothesisComparator {
+  public:
+    const bool operator()(Hypothesis *h1, Hypothesis *h2) {
+      return (h1->fx < h2->fx);
+    }
+  };
+
+  using Agenda = std::priority_queue<Hypothesis *,
+                                     std::vector<Hypothesis *>,
+                                     HypothesisComparator>;
+  constexpr size_t kPreallocatedHypothesisSize = 512;
+  FreeList<Hypothesis> hypothesis_allocator(kPreallocatedHypothesisSize);
+
+  Agenda agenda;
+  std::vector<Lattice::LatticePathWithScore> results;
+
+  auto *eos = hypothesis_allocator.Allocate();
+  eos->node = eos_node();
+  eos->next = nullptr;
+  eos->gx = 0.0;
+
+  std::vector<float> alpha(node_allocator_.size(), 0.0);
+
+  if (sample) {
+    // Run forwards algorithm to get normalising constants
+    alpha = ForwardAlgorithm(inv_theta);
+    // f(eos) = Gumbel(0), as it is the perturbed score of the entire lattice.
+    eos->fx = Gumbel();
+  } else {
+    // Run Viterbi first to fill backtrace score.
+    Viterbi();
+    eos->fx = eos->node->backtrace_score;
+  }
+  agenda.push(eos);
+
+  int shrink_count = 0;  // Number of times agenda has shrunk. For logging only.
+  bool printed_memory_warning = false;  // For logging only.
+  while (!agenda.empty()) {
+    auto *top = agenda.top();
+    agenda.pop();
+    auto *node = top->node;
+
+    // Reaches to BOS
+    if (node == bos_node()) {
+      results.resize(results.size() + 1);
+      for (auto *n = top->next; n->next != nullptr; n = n->next) {
+        results.back().first.push_back(n->node);
+      }
+      results.back().second = top->fx;
+      if (results.size() == nbest_size) {
+        break;
+      }
+      continue;
+    }
+
+    const int end_nodes_size = end_nodes(node->pos).size();
+    std::vector<float> probs(end_nodes_size, 0.0);
+    std::vector<float> perturbed_probs(end_nodes_size, 0.0);
+    std::vector<double> adjusted_probs(end_nodes_size, 0.0);
+    const float Z = alpha[node->node_id];
+    if (sample) {
+      float max_score = -1e8;
+      // Calculate the marginal and perturbed scores for stochastic search
+      for (int i = 0; i < end_nodes(node->pos).size(); i++) {
+        Node *lnode = end_nodes(node->pos)[i];
+        // Calculate backwards transition score
+        probs[i] =
+            top->gx + alpha[lnode->node_id] + (inv_theta * lnode->score) - Z;
+        perturbed_probs[i] = probs[i] + Gumbel();
+        if (perturbed_probs[i] > max_score) {
+          max_score = perturbed_probs[i];
+        }
+      }
+      // Now constrain the sampled continuations to match the score of parent
+      for (int i = 0; i < adjusted_probs.size(); i++) {
+        // Use numerically stable version of truncated Gumbel:
+        // https://arxiv.org/pdf/1903.06059.pdf appendix B.3
+        const float v = top->fx - perturbed_probs[i] +
+                        std::log1p(-std::exp(perturbed_probs[i] - max_score));
+        adjusted_probs[i] = top->fx - std::max(static_cast<float>(0.0), v) -
+                            std::log1p(std::exp(-std::abs(v)));
+      }
+    }
+
+    // Expands new node ending at node->pos
+    for (int i = 0; i < end_nodes(node->pos).size(); i++) {
+      Node *lnode = end_nodes(node->pos)[i];
+      auto *hyp = hypothesis_allocator.Allocate();
+      hyp->node = lnode;
+      if (sample) {
+        hyp->gx = probs[i];
+        hyp->fx = adjusted_probs[i];
+      } else {
+        hyp->gx = lnode->score + top->gx;  // just adds node->score
+        hyp->fx =
+            lnode->backtrace_score + top->gx;  // backtrace_score is h(node).
+      }
+      hyp->next = top;
+      agenda.push(hyp);
+    }
+
+    static constexpr int kOneBillion = 1000000000;  // 10^9.
+    if (hypothesis_allocator.size() >= kOneBillion) {
+      if (!printed_memory_warning) {
+        printed_memory_warning = true;
+        LOG(WARNING) << "Allocator size exceeds " << kOneBillion
+                     << " with an example of length " << this->size();
+      }
+    }
+
+    // When the input is too long or contains duplicated phrases,
+    // `agenda` will get extremely big. Here we avoid this case by
+    // dynamically shrinking the agenda.
+    constexpr int kMaxAgendaSize = 10000;
+    constexpr int kMinAgendaSize = 512;
+    if (agenda.size() >= kMaxAgendaSize) {
+      // Keeps the top `kMinAgendaSize` hypothesis.
+      Agenda new_agenda;
+      // Keeps the top hypothesis and the ones on their "next" paths.
+      FreeList<Hypothesis> new_allocator(kPreallocatedHypothesisSize);
+      // Map between old Hypothesis* and new Hypothesis*.
+      std::unordered_map<const Hypothesis *, Hypothesis *> clone_map;
+
+      const int size = std::min<int>(kMinAgendaSize, nbest_size * 10);
+      shrink_count++;
+      LOG(WARNING) << "Too big agenda size " << agenda.size()
+                   << ". Shrinking (round " << shrink_count << ") down to "
+                   << size << ".";
+      for (int i = 0; i < size; ++i) {
+        const Hypothesis *top_hyp = agenda.top();
+        Hypothesis *cloned_hyp =
+            CloneHypAndDependents(top_hyp, &clone_map, &new_allocator);
+        new_agenda.push(cloned_hyp);
+        agenda.pop();
+      }
+      agenda = std::move(new_agenda);
+      hypothesis_allocator.swap(new_allocator);
+    }
+  }
+
+  return results;
+}
+
+std::vector<Lattice::Node *> Lattice::Sample(float inv_theta) {
+  const int len = size();
+  if (len == 0) return {};
+
+  std::vector<float> alpha(node_allocator_.size(), 0.0);
+
+  alpha = ForwardAlgorithm(inv_theta);
+
+  auto *mt = GetRandomGenerator();
+
+  std::vector<Node *> results;
+  std::vector<float> probs;
+
+  float Z = alpha[eos_node()->node_id];
+  Node *node = eos_node();
+  while (true) {
+    probs.clear();
+    for (const Node *lnode : end_nodes_[node->pos]) {
+      probs.push_back(std::exp(static_cast<double>(
+          alpha[lnode->node_id] + inv_theta * lnode->score - Z)));
+    }
+    std::discrete_distribution<int> dist(probs.begin(), probs.end());
+    node = end_nodes_[node->pos][dist(*mt)];
+    if (node == bos_node()) break;
+
+    Z = alpha[node->node_id];
+    results.push_back(node);
+  }
+
+  std::reverse(results.begin(), results.end());
+  return results;
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/lattice.h b/fast_tokenizer/fast_tokenizer/utils/lattice.h
new file mode 100644
index 0000000000000000000000000000000000000000..daa6523d059d46bd9d56cce4427e408ee0de76b4
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/lattice.h
@@ -0,0 +1,192 @@
+// Copyright 2016 Google Inc.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <string>
+#include <vector>
+#include "fast_tokenizer/utils/string_view.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+// Copy from https://github.com/google/sentencepiece/blob/master/src/freelist.h
+// Simple FreeList that allocates a chunk of T at once.
+template <class T>
+class FreeList {
+public:
+  FreeList() = delete;
+  explicit FreeList(size_t chunk_size) : chunk_size_(chunk_size) {}
+  virtual ~FreeList() {
+    for (auto &chunk : freelist_) delete[] chunk;
+  }
+
+  // `Free` doesn't free the object but reuse the allocated memory chunks.
+  void Free() {
+    const int size = std::min<int>(chunk_index_ + 1, freelist_.size());
+    for (int i = 0; i < size; ++i) {
+      T *chunk = freelist_[i];
+      memset(static_cast<void *>(chunk), 0, sizeof(*chunk) * chunk_size_);
+    }
+    chunk_index_ = 0;
+    element_index_ = 0;
+  }
+
+  // Returns the number of allocated elements.
+  size_t size() const { return chunk_size_ * chunk_index_ + element_index_; }
+
+  void swap(FreeList<T> &other) {
+    std::swap(freelist_, other.freelist_);
+    std::swap(element_index_, other.element_index_);
+    std::swap(chunk_index_, other.chunk_index_);
+    std::swap(chunk_size_, other.chunk_size_);
+  }
+
+  // Returns the element as an array.
+  T *operator[](size_t index) const {
+    return freelist_[index / chunk_size_] + index % chunk_size_;
+  }
+
+  // Allocates new element.
+  T *Allocate() {
+    if (element_index_ >= chunk_size_) {
+      ++chunk_index_;
+      element_index_ = 0;
+    }
+
+    if (chunk_index_ == freelist_.size()) {
+      T *chunk = new T[chunk_size_];
+      memset(static_cast<void *>(chunk), 0, sizeof(*chunk) * chunk_size_);
+      freelist_.push_back(chunk);
+    }
+
+    T *result = freelist_[chunk_index_] + element_index_;
+    ++element_index_;
+
+    return result;
+  }
+
+private:
+  std::vector<T *> freelist_;
+
+  // The last element is stored at freelist_[chunk_index_][element_index_]
+  size_t element_index_ = 0;
+  size_t chunk_index_ = 0;
+  size_t chunk_size_ = 0;  // Do not modify except in swap()
+};
+
+
+// Copy from
+// https://github.com/google/sentencepiece/blob/master/src/unigram_model.h
+class Lattice {
+public:
+  Lattice();
+  virtual ~Lattice();
+
+  struct Node {
+    utils::simple_string_view piece;  // Sentence piece representation.
+    uint32_t pos;                     // Unicode position in the sentence.
+    uint32_t length;                  // Unicode length, not UT8 byte.
+    uint32_t node_id;                 // unique id in the current lattice.
+    int id;                           // vocab id. (maybe -1 for UNK)
+    float score;                      // logprob of this sentencepiece.
+    float backtrace_score;            // backtrace info used in Viterbi.
+    Node *prev;                       // best previous node on Viterbi path.
+
+    std::string DebugString() const;
+  };
+
+  // Returns bos node.
+  Node *bos_node() const;
+
+  // Returns eos node.
+  Node *eos_node() const;
+
+  // Returns nodes starting at |pos|.
+  const std::vector<Node *> &begin_nodes(int pos) const;
+
+  // Returns nodes ending at |pos|.
+  const std::vector<Node *> &end_nodes(int pos) const;
+
+  // Returns Unicode character length.
+  int size() const;
+
+  // Returns multi-byte (utf8) length.
+  int utf8_size() const;
+
+  // Returns the substring of sentence. sentence[pos:]
+  const char *surface(int pos) const;
+
+  // Returns immutable sentence. The same as surface(0)
+  const char *sentence() const;
+
+  // Clears the lattice.
+  void Clear();
+
+  // Sets new sentence.
+  void SetSentence(utils::simple_string_view sentence);
+
+  // Inserts a new node at [pos, pos + length - 1].
+  // After calling this method, The caller must set Node::score and Node::id.
+  Node *Insert(int pos, int length);
+
+  using LatticePathWithScore = std::pair<std::vector<Node *>, float>;
+
+  // Returns Viterbi path. All nodes must be populated in advance.
+  LatticePathWithScore Viterbi();
+
+  // Runs forwards/backwards algorithm, returns vector with normalised
+  // transition probs.
+  std::vector<float> ForwardAlgorithm(float theta) const;
+  std::vector<float> BackwardAlgorithm(float theta) const;
+
+  // Returns n-best results.
+  std::vector<LatticePathWithScore> NBest(size_t nbest_size,
+                                          bool sample,
+                                          float theta);
+
+  // Samples one path from the lattice according to the
+  // generation probability (Product of piece probabilities).
+  // `theta` is a smoothing parameter.
+  std::vector<Node *> Sample(float theta);
+
+  // Calculates the entropy of the lattice.
+  float CalculateEntropy(float theta) const;
+
+  // Populates marginal probability of every node in this lattice.
+  // |freq| is the frequency of the sentence.
+  //  for (auto *node : all_nodes_) {
+  //    (*expected)[node->id] += marginal_prob_of_node * freq;
+  //  }
+  // Returns the log-likelihood of this sentence.
+  float PopulateMarginal(float freq, std::vector<float> *expected) const;
+
+private:
+  // Returns new node.
+  // Lattice class has the ownership of the returned value.
+  Node *NewNode();
+
+  utils::simple_string_view sentence_;
+  std::vector<const char *> surface_;
+  std::vector<std::vector<Node *>> begin_nodes_;
+  std::vector<std::vector<Node *>> end_nodes_;
+  FreeList<Node> node_allocator_;
+};
+
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/path.h b/fast_tokenizer/fast_tokenizer/utils/path.h
new file mode 100644
index 0000000000000000000000000000000000000000..a58a00af613abf822e3af365855fe286a7d7479d
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/path.h
@@ -0,0 +1,58 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+#ifdef _MSC_VER
+#define PATH_SEP "\\"
+#else
+#define PATH_SEP "/"
+#endif
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+inline std::string PathJoin(const std::vector<std::string>& paths,
+                            const std::string& sep = PATH_SEP) {
+  if (paths.size() == 1) {
+    return paths[0];
+  }
+  std::string filepath = "";
+  for (const auto& path : paths) {
+    if (filepath == "") {
+      filepath += path;
+      continue;
+    }
+    if (path[0] == sep[0] || filepath.back() == sep[0]) {
+      filepath += path;
+    } else {
+      filepath += sep + path;
+    }
+  }
+  return filepath;
+}
+
+inline std::string PathJoin(const std::string& folder,
+                            const std::string filename,
+                            const std::string& sep = PATH_SEP) {
+  return PathJoin(std::vector<std::string>{folder, filename}, sep);
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4a7bf9950ab5996347a3b3254c0ef75887d7b728
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.cc
@@ -0,0 +1,342 @@
+// Copyright 2016 Google Inc.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "fast_tokenizer/utils/sentencepiece_normalizer.h"
+#include <algorithm>
+#include "fast_tokenizer/utils/unique_ptr.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "fast_tokenizer/utils/utils.h"
+
+#include "glog/logging.h"
+#include "unicode/brkiter.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+PrefixMatcher::PrefixMatcher(const std::set<const char*, Cstrless>& dic) {
+  if (dic.empty()) return;
+  std::vector<const char*> key;
+  key.reserve(dic.size());
+  for (const auto& it : dic) key.push_back(it);
+  trie_ = utils::make_unique<Darts::DoubleArray>();
+  trie_->build(key.size(), const_cast<char**>(&key[0]), nullptr, nullptr);
+}
+
+int PrefixMatcher::PrefixMatch(const char* w, size_t w_len, bool* found) const {
+  if (trie_ == nullptr) {
+    if (found) *found = false;
+    return std::min<int>(w_len, OneCharLen(w));
+  }
+  constexpr int kResultSize = 64;
+  Darts::DoubleArray::result_pair_type trie_results[kResultSize];
+  const int num_nodes =
+      trie_->commonPrefixSearch(w, trie_results, kResultSize, w_len);
+  if (found) *found = (num_nodes > 0);
+  if (num_nodes == 0) {
+    return std::min<int>(w_len, OneCharLen(w));
+  }
+
+  int mblen = 0;
+  for (int i = 0; i < num_nodes; ++i) {
+    mblen = std::max<int>(trie_results[i].length, mblen);
+  }
+  return mblen;
+}
+
+std::string PrefixMatcher::GlobalReplace(const char* w,
+                                         size_t w_len,
+                                         const char* out,
+                                         size_t out_len,
+                                         const char** result_w) const {
+  std::string result;
+  if (w_len > 0) {
+    bool found = false;
+    const int mblen = PrefixMatch(w, w_len, &found);
+    if (found) {
+      result.append(out, out_len);
+    } else {
+      result.append(w, mblen);
+    }
+    *result_w = w + mblen;
+  }
+  return result;
+}
+
+Normalizer::Normalizer(const std::string& precompiled_charsmap)
+    : precompiled_charsmap_(precompiled_charsmap) {
+  Init();
+}
+
+Normalizer::Normalizer(const Normalizer& other)
+    : precompiled_charsmap_(other.precompiled_charsmap_) {
+  Init();
+}
+
+
+Normalizer::~Normalizer() {}
+
+std::string Normalizer::GetPrecompiledCharsmap() const {
+  return precompiled_charsmap_;
+}
+
+void Normalizer::Init() {
+  if (!precompiled_charsmap_.empty()) {
+#ifdef IS_BIG_ENDIAN
+    DecodePrecompiledCharsMap(precompiled_charsmap_.data(),
+                              precompiled_charsmap_.length(),
+                              &trie_blob_,
+                              &normalized_blob_,
+                              &precompiled_charsmap_buffer_);
+#else
+    DecodePrecompiledCharsMap(precompiled_charsmap_.data(),
+                              precompiled_charsmap_.length(),
+                              &trie_blob_,
+                              &normalized_blob_);
+#endif
+    // Reads the body of double array.
+    trie_ = utils::make_unique<Darts::DoubleArray>();
+    // The second arg of set_array is not the size of blob,
+    // but the number of double array units.
+    trie_->set_array(const_cast<char*>(trie_blob_.data()),
+                     trie_blob_.size() / trie_->unit_size());
+    normalized_ = normalized_blob_.data();
+  }
+}
+
+void Normalizer::DecodePrecompiledCharsMap(const char* blob,
+                                           size_t blob_size,
+                                           std::string* trie_blob,
+                                           std::string* normalized,
+                                           std::string* buffer) {
+  uint32_t trie_blob_size = 0;
+  uint32_t offset = 0;
+  if (blob_size <= sizeof(trie_blob_size) ||
+      !DecodePOD<uint32_t>(blob, sizeof(trie_blob_size), &trie_blob_size) ||
+      trie_blob_size >= blob_size) {
+    throw std::runtime_error("Blob for normalization rule is broken.");
+  }
+#ifdef IS_BIG_ENDIAN
+  trie_blob_size = util::Swap32(trie_blob_size);
+#endif
+  if (trie_blob_size >= blob_size) {
+    throw std::runtime_error("Trie data size exceeds the input blob size.");
+  }
+  offset += sizeof(trie_blob_size);
+#ifdef IS_BIG_ENDIAN
+  buffer->assign(blob + offset, trie_blob_size);
+  uint32* data = reinterpret_cast<uint32_t*>(const_cast<char*>(buffer->data()));
+  for (int i = 0; i < trie_blob_size / 4; ++i) data[i] = util::Swap32(data[i]);
+  *trie_blob = std::string(buffer->data(), trie_blob_size);
+#else
+  *trie_blob = std::string(blob + offset, trie_blob_size);
+#endif
+  offset += trie_blob_size;
+  *normalized = std::string(blob + offset, blob_size - offset);
+}
+
+std::string Normalizer::EncodePrecompiledCharsMap(
+    const std::string& trie_blob, const std::string& normalized) {
+  // <trie size(4byte)><double array trie><normalized string>
+  std::string blob;
+  blob.append(EncodePOD<uint32_t>(trie_blob.size()));
+  blob.append(trie_blob.data(), trie_blob.size());
+  blob.append(normalized.data(), normalized.size());
+
+#ifdef IS_BIG_ENDIAN
+  uint32* data = reinterpret_cast<uint32_t*>(const_cast<char*>(blob.data()));
+  for (int i = 0; i <= trie_blob.size() / 4; ++i) {
+    data[i] = util::Swap32(data[i]);
+  }
+#endif
+  return blob;
+}
+
+std::pair<simple_string_view, int> Normalizer::NormalizePrefix(
+    const char* input, size_t input_len) const {
+  std::pair<simple_string_view, int> result;
+  if (input_len == 0) {
+    return result;
+  }
+  if (matcher_ != nullptr) {
+    bool found = false;
+    const int mblen = matcher_->PrefixMatch(input, input_len, &found);
+    if (found) {
+      return std::make_pair(simple_string_view(input, input_len), mblen);
+    }
+  }
+  size_t longest_length = 0;
+  int longest_value = 0;
+  if (trie_ != nullptr) {
+    // Allocates trie_results in stack, which makes the encoding speed 36%
+    // fast. (38k sentences/sec => 60k sentences/sec). Builder checks that the
+    // result size never exceeds kMaxTrieResultsSize. This array consumes
+    // 0.5kByte in stack, which is less than default stack frames (16kByte).
+    Darts::DoubleArray::result_pair_type
+        trie_results[Normalizer::kMaxTrieResultsSize];
+    const size_t num_nodes = trie_->commonPrefixSearch(
+        input, trie_results, Normalizer::kMaxTrieResultsSize, input_len);
+
+    // Finds the longest rule.
+    for (size_t k = 0; k < num_nodes; ++k) {
+      if (longest_length == 0 || trie_results[k].length > longest_length) {
+        longest_length = trie_results[k].length;  // length of prefix
+        longest_value = trie_results[k].value;    // pointer to |normalized_|.
+      }
+    }
+  }
+
+  if (longest_length == 0) {
+    size_t length = 0;
+    if (!IsValidDecodeUTF8(input, input + input_len, &length)) {
+      // Found a malformed utf8.
+      // The rune is set to be 0xFFFD (REPLACEMENT CHARACTER),
+      // which is a valid Unicode of three bytes in utf8,
+      // but here we only consume one byte.
+      result.second = 1;
+      static const char kReplacementChar[] = "\xEF\xBF\xBD";
+      result.first = simple_string_view(kReplacementChar);
+    } else {
+      result.second = length;
+      result.first = simple_string_view(input, length);
+    }
+  } else {
+    result.second = longest_length;
+    // No need to pass the size of normalized sentence,
+    // since |normalized| is delimitered by "\0".
+    result.first = simple_string_view(&(normalized_[longest_value]));
+  }
+  return result;
+}
+
+bool Normalizer::Normalize(const char* input,
+                           size_t input_len,
+                           std::string* normalized,
+                           std::vector<int>* norm_to_orig,
+                           std::u32string* u32content) const {
+  bool modified = false;
+  norm_to_orig->clear();
+  normalized->clear();
+  if (input_len == 0) {
+    return modified;
+  }
+
+  // Reserves the output buffer to avoid re-allocations.
+  const size_t kReservedSize = input_len * 3;
+  normalized->reserve(kReservedSize);
+  norm_to_orig->reserve(kReservedSize);
+  if (u32content != nullptr) {
+    u32content->reserve(kReservedSize);
+  }
+  UErrorCode err = U_ZERO_ERROR;
+  std::unique_ptr<icu::BreakIterator> iter(
+      icu::BreakIterator::createCharacterInstance(icu::Locale::getDefault(),
+                                                  err));
+  UText utext = UTEXT_INITIALIZER;
+  utext_openUTF8(&utext, input, input_len, &err);
+  iter->setText(&utext, err);
+  int curr_pos = iter->current();
+  while (iter->next() != icu::BreakIterator::DONE) {
+    int next_pos = iter->current();
+    int curr_len = next_pos - curr_pos;
+    std::pair<simple_string_view, int> p;
+    if (curr_len < 6) {
+      p = NormalizePrefix(input + curr_pos, curr_len);
+      simple_string_view sp = p.first;
+      if (sp.data() != input + curr_pos) {
+        if (!sp.empty()) {
+          for (size_t n = 0; n < sp.size(); ++n) {
+            *normalized += sp.data()[n];
+          }
+        }
+        Replace(sp,
+                simple_string_view(input + curr_pos, curr_len),
+                norm_to_orig,
+                u32content);
+        modified = true;
+        curr_pos = next_pos;
+        continue;
+      }
+    }
+    int curr_grapheme_pos = curr_pos;
+    while (curr_grapheme_pos < next_pos) {
+      uint32_t content_char;
+      auto content_char_width =
+          utils::UTF8ToUInt32(input + curr_grapheme_pos, &content_char);
+      content_char = utils::UTF8ToUnicode(content_char);
+      p = NormalizePrefix(input + curr_grapheme_pos, content_char_width);
+      simple_string_view sp = p.first;
+      if (sp.data() != input + curr_grapheme_pos) {
+        if (!sp.empty()) {
+          for (size_t n = 0; n < sp.size(); ++n) {
+            *normalized += sp.data()[n];
+          }
+        }
+        Replace(
+            sp,
+            simple_string_view(input + curr_grapheme_pos, content_char_width),
+            norm_to_orig,
+            u32content);
+        modified = true;
+      } else {
+        for (int i = 0; i < sp.size(); ++i) {
+          *normalized += sp.data()[i];
+        }
+        if (u32content != nullptr) {
+          u32content->push_back(content_char);
+        }
+        norm_to_orig->push_back(0);
+      }
+      curr_grapheme_pos += content_char_width;
+    }
+    curr_pos = next_pos;
+  }
+  utext_close(&utext);
+  return modified;
+}
+
+void Normalizer::Replace(const simple_string_view& new_part,
+                         const simple_string_view& old_part,
+                         std::vector<int>* changes,
+                         std::u32string* u32content) const {
+  auto new_unicode_len =
+      GetUnicodeLenFromUTF8(new_part.data(), new_part.size());
+  auto old_unicode_len =
+      GetUnicodeLenFromUTF8(old_part.data(), old_part.size());
+  if (u32content != nullptr) {
+    size_t utf8_len = 0;
+    while (utf8_len < new_part.size()) {
+      uint32_t content_char;
+      auto content_char_width =
+          utils::UTF8ToUInt32(new_part.data() + utf8_len, &content_char);
+      content_char = utils::UTF8ToUnicode(content_char);
+      u32content->push_back(content_char);
+      utf8_len += content_char_width;
+    }
+  }
+  changes->insert(changes->end(), new_unicode_len, 0);
+  if (new_unicode_len > old_unicode_len) {
+    auto diff = new_unicode_len - old_unicode_len;
+    for (auto i = changes->size() - 1; i >= changes->size() - diff; --i) {
+      (*changes)[i] = 1;
+    }
+  } else {
+    changes->back() -= old_unicode_len - new_unicode_len;
+  }
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..3a8543cc39c8d3d8816abe053f09e52848867cd5
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/sentencepiece_normalizer.h
@@ -0,0 +1,114 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <cstring>
+#include <memory>
+#include <set>
+#include <string>
+#include <vector>
+
+#include "fast_tokenizer/utils/string_view.h"
+
+#include "darts.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+struct Cstrless {
+  bool operator()(const char* a, const char* b) const {
+    return std::strcmp(a, b) < 0;
+  }
+};
+
+class PrefixMatcher {
+public:
+  // Initializes the PrefixMatcher with `dic`.
+  explicit PrefixMatcher(const std::set<const char*, Cstrless>& dic);
+
+  int PrefixMatch(const char* w, size_t w_len, bool* found = nullptr) const;
+
+  std::string GlobalReplace(const char* w,
+                            size_t w_len,
+                            const char* out,
+                            size_t out_len,
+                            const char** result_w) const;
+
+private:
+  std::unique_ptr<Darts::DoubleArray> trie_;
+};
+
+class Normalizer {
+public:
+  // Instantiates Normalizer with |spec|.
+  // |spec| should not be deleted until Normalizer is destroyed.
+  explicit Normalizer(const std::string& precompiled_charsmap);
+  Normalizer(const Normalizer& other);
+  virtual ~Normalizer();
+
+  virtual void SetPrefixMatcher(const PrefixMatcher* matcher) {
+    matcher_ = matcher;
+  }
+
+  virtual bool Normalize(const char* input,
+                         size_t input_len,
+                         std::string* normalized,
+                         std::vector<int>* norm_to_orig,
+                         std::u32string* u32content = nullptr) const;
+  std::string GetPrecompiledCharsmap() const;
+
+private:
+  void Init();
+  void Replace(const simple_string_view& new_part,
+               const simple_string_view& old_part,
+               std::vector<int>* changes,
+               std::u32string* u32content = nullptr) const;
+  std::pair<simple_string_view, int> NormalizePrefix(const char* input,
+                                                     size_t input_len) const;
+
+
+  // // Encodes trie_blob and normalized string and return compiled blob.
+  static std::string EncodePrecompiledCharsMap(const std::string& trie_blob,
+                                               const std::string& normalized);
+
+  // Decodes blob into trie_blob and normalized string.
+  static void DecodePrecompiledCharsMap(const char* blob,
+                                        size_t blob_size,
+                                        std::string* trie_blob,
+                                        std::string* normalized,
+                                        std::string* buffer = nullptr);
+
+  static constexpr int kMaxTrieResultsSize = 32;
+
+  std::unique_ptr<Darts::DoubleArray> trie_;
+
+  const char* normalized_ = nullptr;
+  std::string normalized_blob_;
+  std::string trie_blob_;
+
+  // Prefix matcher;
+  const PrefixMatcher* matcher_ = nullptr;
+
+  // Split hello world into "hello_" and "world_" instead of
+  // "_hello" and "_world".
+  const bool treat_whitespace_as_suffix_ = false;
+  std::string precompiled_charsmap_buffer_;
+  std::string precompiled_charsmap_;
+};
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h b/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h
new file mode 100644
index 0000000000000000000000000000000000000000..37931ff9f740f8a3a7f9449c1535bd5935c22e21
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/shared_mutex.h
@@ -0,0 +1,304 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <chrono>
+#include <climits>
+#include <condition_variable>
+#include <mutex>
+#include <system_error>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+// The code is from http://howardhinnant.github.io/shared_mutex.cpp
+// C++ 11 shared_mutex implementation
+class shared_mutex {
+  typedef std::mutex mutex_t;
+  typedef std::condition_variable cond_t;
+  typedef unsigned count_t;
+
+  mutex_t mut_;
+  cond_t gate1_;
+  cond_t gate2_;
+  count_t state_;
+
+  static const count_t write_entered_ = 1U << (sizeof(count_t) * CHAR_BIT - 1);
+  static const count_t n_readers_ = ~write_entered_;
+
+public:
+  shared_mutex() : state_(0) {}
+  ~shared_mutex() { std::lock_guard<mutex_t> _(mut_); }
+
+  shared_mutex(const shared_mutex&) = delete;
+  shared_mutex& operator=(const shared_mutex&) = delete;
+
+  // Exclusive ownership
+
+  void lock() {
+    std::unique_lock<mutex_t> lk(mut_);
+    while (state_ & write_entered_) gate1_.wait(lk);
+    state_ |= write_entered_;
+    while (state_ & n_readers_) gate2_.wait(lk);
+  }
+  bool try_lock() {
+    std::unique_lock<mutex_t> lk(mut_);
+    if (state_ == 0) {
+      state_ = write_entered_;
+      return true;
+    }
+    return false;
+  }
+  template <class Rep, class Period>
+  bool try_lock_for(const std::chrono::duration<Rep, Period>& rel_time) {
+    return try_lock_until(std::chrono::steady_clock::now() + rel_time);
+  }
+  template <class Clock, class Duration>
+  bool try_lock_until(const std::chrono::time_point<Clock, Duration>& abs_time);
+  void unlock() {
+    std::lock_guard<mutex_t> _(mut_);
+    state_ = 0;
+    gate1_.notify_all();
+  }
+
+  // Shared ownership
+
+  void lock_shared() {
+    std::unique_lock<mutex_t> lk(mut_);
+    while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
+      gate1_.wait(lk);
+    count_t num_readers = (state_ & n_readers_) + 1;
+    state_ &= ~n_readers_;
+    state_ |= num_readers;
+  }
+  bool try_lock_shared() {
+    std::unique_lock<mutex_t> lk(mut_);
+    count_t num_readers = state_ & n_readers_;
+    if (!(state_ & write_entered_) && num_readers != n_readers_) {
+      ++num_readers;
+      state_ &= ~n_readers_;
+      state_ |= num_readers;
+      return true;
+    }
+    return false;
+  }
+  template <class Rep, class Period>
+  bool try_lock_shared_for(const std::chrono::duration<Rep, Period>& rel_time) {
+    return try_lock_shared_until(std::chrono::steady_clock::now() + rel_time);
+  }
+  template <class Clock, class Duration>
+  bool try_lock_shared_until(
+      const std::chrono::time_point<Clock, Duration>& abs_time);
+  void unlock_shared() {
+    std::lock_guard<mutex_t> _(mut_);
+    count_t num_readers = (state_ & n_readers_) - 1;
+    state_ &= ~n_readers_;
+    state_ |= num_readers;
+    if (state_ & write_entered_) {
+      if (num_readers == 0) gate2_.notify_one();
+    } else {
+      if (num_readers == n_readers_ - 1) gate1_.notify_one();
+    }
+  }
+};
+
+template <class Clock, class Duration>
+bool shared_mutex::try_lock_until(
+    const std::chrono::time_point<Clock, Duration>& abs_time) {
+  std::unique_lock<mutex_t> lk(mut_);
+  if (state_ & write_entered_) {
+    while (true) {
+      std::cv_status status = gate1_.wait_until(lk, abs_time);
+      if ((state_ & write_entered_) == 0) break;
+      if (status == std::cv_status::timeout) return false;
+    }
+  }
+  state_ |= write_entered_;
+  if (state_ & n_readers_) {
+    while (true) {
+      std::cv_status status = gate2_.wait_until(lk, abs_time);
+      if ((state_ & n_readers_) == 0) break;
+      if (status == std::cv_status::timeout) {
+        state_ &= ~write_entered_;
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+template <class Clock, class Duration>
+bool shared_mutex::try_lock_shared_until(
+    const std::chrono::time_point<Clock, Duration>& abs_time) {
+  std::unique_lock<mutex_t> lk(mut_);
+  if ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_) {
+    while (true) {
+      std::cv_status status = gate1_.wait_until(lk, abs_time);
+      if ((state_ & write_entered_) == 0 && (state_ & n_readers_) < n_readers_)
+        break;
+      if (status == std::cv_status::timeout) return false;
+    }
+  }
+  count_t num_readers = (state_ & n_readers_) + 1;
+  state_ &= ~n_readers_;
+  state_ |= num_readers;
+  return true;
+}
+
+template <class Mutex>
+class shared_lock {
+public:
+  typedef Mutex mutex_type;
+
+private:
+  mutex_type* m_;
+  bool owns_;
+
+  struct __nat {
+    int _;
+  };
+
+public:
+  shared_lock() : m_(nullptr), owns_(false) {}
+
+  explicit shared_lock(mutex_type& m) : m_(&m), owns_(true) {
+    m_->lock_shared();
+  }
+
+  shared_lock(mutex_type& m, std::defer_lock_t) : m_(&m), owns_(false) {}
+
+  shared_lock(mutex_type& m, std::try_to_lock_t)
+      : m_(&m), owns_(m.try_lock_shared()) {}
+
+  shared_lock(mutex_type& m, std::adopt_lock_t) : m_(&m), owns_(true) {}
+
+  template <class Clock, class Duration>
+  shared_lock(mutex_type& m,
+              const std::chrono::time_point<Clock, Duration>& abs_time)
+      : m_(&m), owns_(m.try_lock_shared_until(abs_time)) {}
+  template <class Rep, class Period>
+  shared_lock(mutex_type& m, const std::chrono::duration<Rep, Period>& rel_time)
+      : m_(&m), owns_(m.try_lock_shared_for(rel_time)) {}
+
+  ~shared_lock() {
+    if (owns_) m_->unlock_shared();
+  }
+
+  shared_lock(shared_lock const&) = delete;
+  shared_lock& operator=(shared_lock const&) = delete;
+
+  shared_lock(shared_lock&& sl) : m_(sl.m_), owns_(sl.owns_) {
+    sl.m_ = nullptr;
+    sl.owns_ = false;
+  }
+
+  shared_lock& operator=(shared_lock&& sl) {
+    if (owns_) m_->unlock_shared();
+    m_ = sl.m_;
+    owns_ = sl.owns_;
+    sl.m_ = nullptr;
+    sl.owns_ = false;
+    return *this;
+  }
+
+  explicit shared_lock(std::unique_lock<mutex_type>&& ul)
+      : m_(ul.mutex()), owns_(ul.owns_lock()) {
+    if (owns_) m_->unlock_and_lock_shared();
+    ul.release();
+  }
+
+  void lock();
+  bool try_lock();
+  template <class Rep, class Period>
+  bool try_lock_for(const std::chrono::duration<Rep, Period>& rel_time) {
+    return try_lock_until(std::chrono::steady_clock::now() + rel_time);
+  }
+  template <class Clock, class Duration>
+  bool try_lock_until(const std::chrono::time_point<Clock, Duration>& abs_time);
+  void unlock();
+
+  void swap(shared_lock&& u) {
+    std::swap(m_, u.m_);
+    std::swap(owns_, u.owns_);
+  }
+
+  mutex_type* release() {
+    mutex_type* r = m_;
+    m_ = nullptr;
+    owns_ = false;
+    return r;
+  }
+  bool owns_lock() const { return owns_; }
+  operator int __nat::*() const { return owns_ ? &__nat::_ : 0; }
+  mutex_type* mutex() const { return m_; }
+};
+
+template <class Mutex>
+void shared_lock<Mutex>::lock() {
+  if (m_ == nullptr)
+    throw std::system_error(std::error_code(EPERM, std::system_category()),
+                            "shared_lock::lock: references null mutex");
+  if (owns_)
+    throw std::system_error(std::error_code(EDEADLK, std::system_category()),
+                            "shared_lock::lock: already locked");
+  m_->lock_shared();
+  owns_ = true;
+}
+
+template <class Mutex>
+bool shared_lock<Mutex>::try_lock() {
+  if (m_ == nullptr)
+    throw std::system_error(std::error_code(EPERM, std::system_category()),
+                            "shared_lock::try_lock: references null mutex");
+  if (owns_)
+    throw std::system_error(std::error_code(EDEADLK, std::system_category()),
+                            "shared_lock::try_lock: already locked");
+  owns_ = m_->try_lock_shared();
+  return owns_;
+}
+
+template <class Mutex>
+template <class Clock, class Duration>
+bool shared_lock<Mutex>::try_lock_until(
+    const std::chrono::time_point<Clock, Duration>& abs_time) {
+  if (m_ == nullptr)
+    throw std::system_error(
+        std::error_code(EPERM, std::system_category()),
+        "shared_lock::try_lock_until: references null mutex");
+  if (owns_)
+    throw std::system_error(std::error_code(EDEADLK, std::system_category()),
+                            "shared_lock::try_lock_until: already locked");
+  owns_ = m_->try_lock_shared_until(abs_time);
+  return owns_;
+}
+
+template <class Mutex>
+void shared_lock<Mutex>::unlock() {
+  if (!owns_)
+    throw std::system_error(std::error_code(EPERM, std::system_category()),
+                            "shared_lock::unlock: not locked");
+  m_->unlock_shared();
+  owns_ = false;
+}
+
+template <class Mutex>
+inline void swap(shared_lock<Mutex>& x, shared_lock<Mutex>& y) {
+  x.swap(y);
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/string_view.h b/fast_tokenizer/fast_tokenizer/utils/string_view.h
new file mode 100644
index 0000000000000000000000000000000000000000..35cacdefed9f2f4bb14392ee3775c00ceb663bc5
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/string_view.h
@@ -0,0 +1,53 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <cassert>
+#include <cstring>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+struct simple_string_view {
+  const char* ptr_;
+  size_t offset_;
+  size_t size_;
+  explicit simple_string_view(const char* ptr = nullptr)
+      : ptr_(ptr), offset_(0), size_(0) {
+    while (ptr_ && ptr_[size_] != '\0') {
+      size_++;
+    }
+  }
+  simple_string_view(const char* ptr, size_t size) : ptr_(ptr), size_(size) {}
+
+  const char* data() const {
+    if (!ptr_) {
+      return ptr_ + offset_;
+    }
+    return ptr_;
+  }
+  size_t size() const { return size_; }
+  bool empty() const { return size_ == 0; }
+
+  void remove_prefix(size_t n) {
+    assert(n <= size_);
+    ptr_ += n;
+    size_ -= n;
+  }
+};
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/trie.cc b/fast_tokenizer/fast_tokenizer/utils/trie.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b063e91ff0853829fe7ec4596ca872e744cfae66
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/trie.cc
@@ -0,0 +1,231 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <algorithm>
+#include <cstring>
+#include <numeric>
+
+#include "glog/logging.h"
+#include "fast_tokenizer/utils/trie.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "fast_tokenizer/utils/utils.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+void Trie::CreateTrie(const std::vector<const char*>& keys,
+                      const std::vector<int>& values) {
+  trie_ = std::make_shared<Darts::DoubleArray>();
+  trie_->build(keys.size(),
+               const_cast<char**>(&keys[0]),
+               nullptr,
+               const_cast<int*>(&values[0]));
+  const uint32_t* trie_ptr = reinterpret_cast<const uint32_t*>(trie_->array());
+  trie_array_ = std::vector<uint32_t>(trie_ptr, trie_ptr + trie_->size());
+}
+
+int Trie::EncodeTokenId(const std::string& token, uint32_t id) const {
+  bool is_suffix_token = (token.rfind(continuing_subword_prefix_) == 0);
+  uint32_t token_length = token.length();
+  if (is_suffix_token) {
+    token_length -= continuing_subword_prefix_.length();
+  }
+  return EncodeToken(id, token_length, is_suffix_token);
+}
+
+void Trie::InitTrieSuffixRoot() {
+  auto node = CreateRootTraversalCursor();
+  if (!TryTraverseSeveralSteps(&node, continuing_subword_prefix_)) {
+    throw std::runtime_error(
+        "Cannot locate suffix_root_. This should never happen.");
+  }
+  suffix_root_ = node.node_id_;
+}
+
+void Trie::InitTrie(const std::vector<const char*>& keys,
+                    const std::vector<int>& values) {
+  std::vector<const char*> sorted_keys;
+  std::vector<int> sorted_values;
+  GetSortedVocab(keys, values, &sorted_keys, &sorted_values);
+  CreateTrie(sorted_keys, sorted_values);
+  InitTrieSuffixRoot();
+  if (with_pretokenization_ && keys.size() > 0) {
+    auto node = CreateRootTraversalCursor();
+    if (!TryTraverseSeveralSteps(&node,
+                                 std::string(1, utils::kInvalidControlChar))) {
+      throw std::runtime_error(
+          "Cannot locate the dummy node for the failure link for punctuation "
+          "nodes. This should never happen.");
+    }
+    punct_failure_link_node_ = node.node_id_;
+    DeleteLinkFromParent(punct_failure_link_node_);
+    DeleteValueOfNode(punct_failure_link_node_);
+  }
+}
+
+void Trie::AddPuncVocab(
+    std::vector<std::string>* punc_vocab,
+    const std::unordered_map<std::string, uint32_t>& vocab) const {
+  if (with_pretokenization_) {
+    for (uint32_t cp = 1; cp <= 0x0010FFFF; ++cp) {
+      if (!utils::IsUnicodeChar(cp) || !utils::IsPunctuationOrChineseChar(cp)) {
+        continue;
+      }
+      char utf_str[5];
+      utils::GetUTF8Str(reinterpret_cast<char32_t*>(&cp), utf_str, 1);
+      std::string punc_str(utf_str);
+      if (vocab.count(punc_str) == 0) {
+        punc_vocab->push_back(punc_str);
+      }
+    }
+    punc_vocab->push_back(std::string(1, utils::kInvalidControlChar));
+  }
+}
+
+void Trie::SetVocab(const std::unordered_map<std::string, uint32_t>& vocab) {
+  std::vector<const char*> keys;
+  std::vector<int> values;
+  for (auto&& item : vocab) {
+    keys.push_back(item.first.c_str());
+    values.push_back(EncodeTokenId(item.first, item.second));
+  }
+  InitTrie(keys, values);
+}
+
+void Trie::SetVocabList(const std::vector<std::string>& keys) {
+  std::unordered_map<std::string, uint32_t> vocab;
+  for (int i = 0; i < keys.size(); ++i) {
+    vocab[keys[i]] = i;
+  }
+  SetVocab(vocab);
+}
+
+Trie::Trie(const std::string& continuing_subword_prefix,
+           const std::string& unk_token,
+           bool with_pretokenization)
+    : trie_(nullptr),
+      continuing_subword_prefix_(continuing_subword_prefix),
+      suffix_root_(utils::kNullNode),
+      punct_failure_link_node_(utils::kNullNode),
+      unk_token_(unk_token),
+      with_pretokenization_(with_pretokenization) {}
+
+Trie::Trie(const std::unordered_map<std::string, uint32_t>& vocab,
+           const std::string& continuing_subword_prefix,
+           const std::string& unk_token,
+           bool with_pretokenization)
+    : continuing_subword_prefix_(continuing_subword_prefix),
+      unk_token_(unk_token),
+      suffix_root_(utils::kNullNode),
+      punct_failure_link_node_(utils::kNullNode),
+      with_pretokenization_(with_pretokenization) {
+  SetVocab(vocab);
+}
+
+Trie::Trie(const std::vector<std::string>& keys,
+           const std::string& continuing_subword_prefix,
+           const std::string& unk_token,
+           bool with_pretokenization)
+    : continuing_subword_prefix_(continuing_subword_prefix),
+      unk_token_(unk_token),
+      suffix_root_(utils::kNullNode),
+      punct_failure_link_node_(utils::kNullNode),
+      with_pretokenization_(with_pretokenization) {
+  SetVocabList(keys);
+}
+
+Trie::TraversalCursor Trie::CreateRootTraversalCursor() const {
+  return CreateTraversalCursor(kRootNodeId);
+}
+
+Trie::TraversalCursor Trie::CreateTraversalCursor(uint32_t node_id) const {
+  return Trie::TraversalCursor(node_id, trie_array_[node_id]);
+}
+
+void Trie::SetTraversalCursor(Trie::TraversalCursor* cursor,
+                              uint32_t node_id) const {
+  cursor->node_id_ = node_id;
+  cursor->unit_ = trie_array_[node_id];
+}
+
+bool Trie::TryTraverseOneStep(Trie::TraversalCursor* cursor,
+                              unsigned char ch) const {
+  const uint32_t next_node_id = cursor->node_id_ ^ Offset(cursor->unit_) ^ ch;
+  const uint32_t next_node_unit = trie_array_[next_node_id];
+  if (Label(next_node_unit) != ch) {
+    return false;
+  }
+  cursor->node_id_ = next_node_id;
+  cursor->unit_ = next_node_unit;
+  return true;
+}
+
+bool Trie::TryTraverseSeveralSteps(Trie::TraversalCursor* cursor,
+                                   const std::string& path) const {
+  return TryTraverseSeveralSteps(cursor, path.data(), path.size());
+}
+
+bool Trie::TryTraverseSeveralSteps(Trie::TraversalCursor* cursor,
+                                   const char* ptr,
+                                   int size) const {
+  uint32_t cur_id = cursor->node_id_;
+  uint32_t cur_unit = cursor->unit_;
+  for (; size > 0; --size, ++ptr) {
+    const unsigned char ch = static_cast<const unsigned char>(*ptr);
+    cur_id ^= Offset(cur_unit) ^ ch;
+    cur_unit = trie_array_[cur_id];
+    if (Label(cur_unit) != ch) {
+      return false;
+    }
+  }
+  cursor->node_id_ = cur_id;
+  cursor->unit_ = cur_unit;
+  return true;
+}
+
+bool Trie::TryGetData(const Trie::TraversalCursor& cursor,
+                      int* out_data) const {
+  if (!HasLeaf(cursor.unit_)) {
+    return false;
+  }
+  const uint32_t value_unit =
+      trie_array_[cursor.node_id_ ^ Offset(cursor.unit_)];
+  *out_data = Value(value_unit);
+  return true;
+}
+
+void Trie::DeleteValueOfNode(uint32_t node_id) {
+  trie_array_[node_id] &= 0xFFFFFEFF;
+}
+
+void Trie::DeleteLinkFromParent(uint32_t child_node_id) {
+  trie_array_[child_node_id] &= 0xFFFFFF00;
+}
+
+void Trie::SetWithPretokenization(bool with_pretokenization) {
+  with_pretokenization_ = with_pretokenization;
+}
+
+void Trie::SetUNKToken(const std::string& unk_token) { unk_token_ = unk_token; }
+
+void Trie::SetContinuingSubwordPrefix(
+    const std::string& continuing_subword_prefix) {
+  continuing_subword_prefix_ = continuing_subword_prefix;
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/trie.h b/fast_tokenizer/fast_tokenizer/utils/trie.h
new file mode 100644
index 0000000000000000000000000000000000000000..b4c9b0cbff4d05a84326479d7b9767312308adf4
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/trie.h
@@ -0,0 +1,120 @@
+// Copyright 2022 TF.Text Authors.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+
+//     http://www.apache.org/licenses/LICENSE-2.0
+
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <vector>
+#include "darts.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+class Trie {
+public:
+  static constexpr uint32_t kRootNodeId = 0;
+
+  Trie(const std::string& continuing_subword_prefix = "##",
+       const std::string& unk_token = "[UNK]",
+       bool with_pretokenization = false);
+  Trie(const std::unordered_map<std::string, uint32_t>& vocab,
+       const std::string& continuing_subword_prefix = "##",
+       const std::string& unk_token = "[UNK]",
+       bool with_pretokenization = false);
+  Trie(const std::vector<std::string>& keys,
+       const std::string& continuing_subword_prefix = "##",
+       const std::string& unk_token = "[UNK]",
+       bool with_pretokenization = false);
+  struct TraversalCursor {
+    uint32_t node_id_;
+    uint32_t unit_;
+    TraversalCursor(uint32_t node_id = 0, uint32_t unit = 0)
+        : node_id_(node_id), unit_(unit) {}
+  };
+
+  TraversalCursor CreateRootTraversalCursor() const;
+  TraversalCursor CreateTraversalCursor(uint32_t node_id) const;
+  void SetTraversalCursor(TraversalCursor* cursor, uint32_t node_id) const;
+  bool TryTraverseOneStep(TraversalCursor* cursor, unsigned char ch) const;
+  bool TryTraverseSeveralSteps(TraversalCursor* cursor,
+                               const std::string& path) const;
+  bool TryGetData(const TraversalCursor& cursor, int* out_data) const;
+  void SetVocab(const std::unordered_map<std::string, uint32_t>& vocab);
+  void SetVocabList(const std::vector<std::string>& vocab);
+  void SetWithPretokenization(bool with_pretokenization_);
+  void SetUNKToken(const std::string& unk_token);
+  void SetContinuingSubwordPrefix(const std::string& continuing_subword_prefix);
+
+  uint32_t Size() const {
+    if (trie_.get() != nullptr) {
+      return trie_->size();
+    }
+    return 0;
+  }
+  std::string GetContinuingSubwordPrefix() const {
+    return continuing_subword_prefix_;
+  }
+  uint32_t GetSuffixRoot() const { return suffix_root_; }
+  uint32_t GetPuncFailureNode() const { return punct_failure_link_node_; }
+  void DeleteValueOfNode(uint32_t node_id);
+  void DeleteLinkFromParent(uint32_t child_node_id);
+
+private:
+  void AddPuncVocab(
+      std::vector<std::string>* punc_vocab,
+      const std::unordered_map<std::string, uint32_t>& vocab) const;
+  void InitTrieSuffixRoot();
+  void InitTrie(const std::vector<const char*>& keys,
+                const std::vector<int>& values);
+  int EncodeTokenId(const std::string& token, uint32_t id) const;
+  void CreateTrie(const std::vector<const char*>& keys,
+                  const std::vector<int>& values);
+
+  bool TryTraverseSeveralSteps(TraversalCursor* cursor,
+                               const char* ptr,
+                               int size) const;
+
+  static uint32_t Offset(uint32_t unit) {
+    return (unit >> 10) << ((unit & 0x200) >> 6);
+  }
+
+  // Returns a label associated with a node.
+  // A leaf node will have the MSB set and thus return an invalid label.
+  static uint32_t Label(uint32_t unit) { return unit & 0x800000ff; }
+
+  // Returns whether a node has a leaf as a child.
+  static bool HasLeaf(uint32_t unit) { return unit & 0x100; }
+
+  // Returns a value associated with a node. Available when a node is a leaf.
+  static int Value(uint32_t unit) {
+    return static_cast<int>(unit & 0x7fffffff);
+  }
+
+  std::shared_ptr<Darts::DoubleArray> trie_;
+  std::vector<uint32_t> trie_array_;
+  std::string continuing_subword_prefix_;
+  std::string unk_token_;
+  uint32_t suffix_root_;
+  uint32_t punct_failure_link_node_;
+  bool with_pretokenization_;
+};
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h b/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h
new file mode 100644
index 0000000000000000000000000000000000000000..767e203fcd2d5d2c9bddb49cf042cf6472465120
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/unique_ptr.h
@@ -0,0 +1,61 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <memory>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+// Trait to select overloads and return types for MakeUnique.
+template <typename T>
+struct MakeUniqueResult {
+  using scalar = std::unique_ptr<T>;
+};
+template <typename T>
+struct MakeUniqueResult<T[]> {
+  using array = std::unique_ptr<T[]>;
+};
+template <typename T, size_t N>
+struct MakeUniqueResult<T[N]> {
+  using invalid = void;
+};
+
+// MakeUnique<T>(...) is an early implementation of C++14 std::make_unique.
+// It is designed to be 100% compatible with std::make_unique so that the
+// eventual switchover will be a simple renaming operation.
+template <typename T, typename... Args>
+typename MakeUniqueResult<T>::scalar make_unique(Args &&... args) {  // NOLINT
+  return std::unique_ptr<T>(
+      new T(std::forward<Args>(args)...));  // NOLINT(build/c++11)
+}
+
+// Overload for array of unknown bound.
+// The allocation of arrays needs to use the array form of new,
+// and cannot take element constructor arguments.
+template <typename T>
+typename MakeUniqueResult<T>::array make_unique(size_t n) {
+  return std::unique_ptr<T>(new typename std::remove_extent<T>::type[n]());
+}
+
+// Reject arrays of known bound.
+template <typename T, typename... Args>
+typename MakeUniqueResult<T>::invalid make_unique(Args &&... /* args */) =
+    delete;  // NOLINT
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/utf8.h b/fast_tokenizer/fast_tokenizer/utils/utf8.h
new file mode 100644
index 0000000000000000000000000000000000000000..dbb8c92f67329db99151649e433aa6d39c58360a
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/utf8.h
@@ -0,0 +1,225 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <cstring>
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+static constexpr uint32_t kUnicodeError = 0xFFFD;
+
+inline bool IsUnicodeNonChar(uint32_t c) {
+  return ((c) >= 0xfdd0 && ((c) <= 0xfdef || ((c)&0xfffe) == 0xfffe) &&
+          (c) <= 0x10ffff);
+}
+
+inline bool IsUnicodeChar(uint32_t c) {
+  return ((c) < 0xd800 ||
+          (0xdfff < (c) && (c) <= 0x10ffff && !IsUnicodeNonChar(c)));
+}
+
+inline uint32_t BytesInUTF8Char(uint8_t byte) {
+  unsigned int count = 1;
+  // no if-statements means no divergence
+  count += static_cast<int>((byte & 0xF0) == 0xF0);
+  count += static_cast<int>((byte & 0xE0) == 0xE0);
+  count += static_cast<int>((byte & 0xC0) == 0xC0);
+  count -= static_cast<int>((byte & 0xC0) == 0x80);
+  return count;
+}
+
+inline uint32_t GetUnicodeLenFromUTF8(const char* pSrc, size_t length) {
+  size_t unicode_len = 0;
+  size_t start = 0;
+  while (start < length && pSrc[start] != '\0') {
+    size_t chwidth = BytesInUTF8Char(pSrc[start]);
+    start += chwidth;
+    ++unicode_len;
+  }
+  return unicode_len;
+}
+
+inline uint32_t UTF8ToUInt32(const char* pSrc, uint32_t* chr) {
+  uint32_t chwidth = BytesInUTF8Char(static_cast<uint32_t>(*pSrc));
+  *chr = static_cast<uint32_t>(*pSrc++) & 0xFF;
+  if (chwidth > 1) {
+    *chr = (*chr) << 8;
+    *chr |= (static_cast<uint32_t>(*pSrc++) & 0xFF);  // << 8;
+    if (chwidth > 2) {
+      *chr = (*chr) << 8;
+      *chr |= (static_cast<uint32_t>(*pSrc++) & 0xFF);  // << 16;
+      if (chwidth > 3) {
+        *chr = (*chr) << 8;
+        *chr |= (static_cast<uint32_t>(*pSrc++) & 0xFF);  // << 24;
+      }
+    }
+  }
+  return chwidth;
+}
+
+inline uint32_t UTF8ToUnicode(uint32_t utf8) {
+  uint32_t unchr = 0;
+  if (utf8 < 0x00000080) {
+    unchr = utf8;
+  } else if (utf8 < 0x0000E000) {
+    unchr = (utf8 & 0x1F00) >> 2;
+    unchr |= (utf8 & 0x003F);
+  } else if (utf8 < 0x00F00000) {
+    unchr = (utf8 & 0x0F0000) >> 4;
+    unchr |= (utf8 & 0x003F00) >> 2;
+    unchr |= (utf8 & 0x00003F);
+  } else if (utf8 <= static_cast<uint32_t>(0xF8000000)) {
+    unchr = (utf8 & 0x03000000) >> 6;
+    unchr |= (utf8 & 0x003F0000) >> 4;
+    unchr |= (utf8 & 0x00003F00) >> 2;
+    unchr |= (utf8 & 0x0000003F);
+  }
+  return unchr;
+}
+
+inline bool IsCharBeginBoundary(char ch) {
+  return ((~ch) >> 7) || ((ch & 0xC0) == 0xC0);
+}
+
+inline bool IsCharBoundary(const char* ch) {
+  return IsCharBeginBoundary(*ch) || IsCharBeginBoundary(*(ch + 1));
+}
+
+inline uint32_t UnicodeToUTF8(uint32_t unchr) {
+  uint32_t utf8 = 0;
+  if (unchr < 0x00000080) {
+    utf8 = unchr;
+  } else if (unchr < 0x00000800) {
+    utf8 = (unchr << 2) & 0x1F00;
+    utf8 |= (unchr & 0x3F);
+    utf8 |= 0x0000C080;
+  } else if (unchr < 0x00010000) {
+    utf8 = (unchr << 4) & 0x0F0000;   // upper 4 bits
+    utf8 |= (unchr << 2) & 0x003F00;  // next 6 bits
+    utf8 |= (unchr & 0x3F);           // last 6 bits
+    utf8 |= 0x00E08080;
+  } else if (unchr < 0x00110000) {      // 3-byte unicode
+    utf8 = (unchr << 6) & 0x07000000;   // upper 3 bits
+    utf8 |= (unchr << 4) & 0x003F0000;  // next 6 bits
+    utf8 |= (unchr << 2) & 0x00003F00;  // next 6 bits
+    utf8 |= (unchr & 0x3F);             // last 6 bits
+    utf8 |= static_cast<uint32_t>(0xF0808080);
+  }
+  return utf8;
+}
+
+inline uint32_t BytesInUnicodeChar(uint32_t chr) {
+  uint32_t count = 1;
+  // no if-statements means no divergence
+  count += static_cast<int>((chr & static_cast<uint32_t>(0x0000FF00)) > 0);
+  count += static_cast<int>((chr & static_cast<uint32_t>(0x00FF0000)) > 0);
+  count += static_cast<int>((chr & static_cast<uint32_t>(0xFF000000)) > 0);
+  return count;
+}
+
+inline uint32_t UnicodeToUTF8Char(uint32_t chr, char* dst) {
+  uint32_t chwidth = BytesInUnicodeChar(chr);
+  for (uint32_t idx = 0; idx < chwidth; ++idx) {
+    dst[chwidth - idx - 1] = static_cast<char>(chr & 0xFF);
+    chr = chr >> 8;
+  }
+  return chwidth;
+}
+
+inline uint32_t GetUTF8CharLen(uint32_t u32chr) {
+  return BytesInUnicodeChar(UnicodeToUTF8(u32chr));
+}
+
+inline void GetUTF8Str(const char32_t* unicode_str,
+                       char* utf8_str,
+                       size_t unicode_len) {
+  char dst_char[5] = {0};
+  for (size_t i = 0; i < unicode_len; ++i) {
+    uint32_t utf8_uint32 = UnicodeToUTF8(unicode_str[i]);
+    uint32_t utf8_char_count = UnicodeToUTF8Char(utf8_uint32, dst_char);
+    dst_char[utf8_char_count] = '\0';
+    memcpy(utf8_str, dst_char, utf8_char_count);
+    utf8_str += utf8_char_count;
+  }
+  *utf8_str = '\0';
+}
+
+inline void GetUnicodeStr(const char* pSrc,
+                          char32_t* unicode_str,
+                          size_t unicode_len) {
+  uint32_t curr_unicode_char;
+  uint32_t count = UTF8ToUInt32(pSrc, &curr_unicode_char);
+  curr_unicode_char = UTF8ToUnicode(curr_unicode_char);
+  for (size_t i = 0; i < unicode_len; ++i) {
+    unicode_str[i] = curr_unicode_char;
+    pSrc += count;
+    count = UTF8ToUInt32(pSrc, &curr_unicode_char);
+    curr_unicode_char = UTF8ToUnicode(curr_unicode_char);
+  }
+}
+
+inline bool IsTrailByte(char x) { return static_cast<signed char>(x) < -0x40; }
+
+inline bool IsValidCodepoint(char32_t c) {
+  return (static_cast<uint32_t>(c) < 0xD800) || (c >= 0xE000 && c <= 0x10FFFF);
+}
+
+// mblen sotres the number of bytes consumed after decoding.
+inline uint32_t DecodeUTF8(const char* begin, const char* end, size_t* mblen) {
+  const size_t len = end - begin;
+
+  if (static_cast<unsigned char>(begin[0]) < 0x80) {
+    *mblen = 1;
+    return static_cast<unsigned char>(begin[0]);
+  } else if (len >= 2 && (begin[0] & 0xE0) == 0xC0) {
+    const uint32_t cp = (((begin[0] & 0x1F) << 6) | ((begin[1] & 0x3F)));
+    if (IsTrailByte(begin[1]) && cp >= 0x0080 && IsValidCodepoint(cp)) {
+      *mblen = 2;
+      return cp;
+    }
+  } else if (len >= 3 && (begin[0] & 0xF0) == 0xE0) {
+    const uint32_t cp = (((begin[0] & 0x0F) << 12) | ((begin[1] & 0x3F) << 6) |
+                         ((begin[2] & 0x3F)));
+    if (IsTrailByte(begin[1]) && IsTrailByte(begin[2]) && cp >= 0x0800 &&
+        IsValidCodepoint(cp)) {
+      *mblen = 3;
+      return cp;
+    }
+  } else if (len >= 4 && (begin[0] & 0xf8) == 0xF0) {
+    const uint32_t cp = (((begin[0] & 0x07) << 18) | ((begin[1] & 0x3F) << 12) |
+                         ((begin[2] & 0x3F) << 6) | ((begin[3] & 0x3F)));
+    if (IsTrailByte(begin[1]) && IsTrailByte(begin[2]) &&
+        IsTrailByte(begin[3]) && cp >= 0x10000 && IsValidCodepoint(cp)) {
+      *mblen = 4;
+      return cp;
+    }
+  }
+
+  // Invalid UTF-8.
+  *mblen = 1;
+  return kUnicodeError;
+}
+
+inline bool IsValidDecodeUTF8(const char* begin,
+                              const char* end,
+                              size_t* mblen) {
+  const uint32_t c = DecodeUTF8(begin, end, mblen);
+  return c != kUnicodeError || *mblen == 3;
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/utils.cc b/fast_tokenizer/fast_tokenizer/utils/utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..dd23726c1bf457748e3cd7d79e11ec8fc116a698
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/utils.cc
@@ -0,0 +1,147 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "fast_tokenizer/utils/utils.h"
+
+#include "unicode/uchar.h"
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+void GetVocabFromFiles(const std::string& files,
+                       std::unordered_map<std::string, uint32_t>* vocab) {
+  const static std::string WHITESPACE = " \n\r\t\f\v";
+  std::ifstream fin(files);
+  if (!fin.good()) {
+    std::cerr << "The vocab file " << files
+              << " seems to be unable to access"
+                 " or non-exists, please check again. "
+              << std::endl;
+    return;
+  }
+  vocab->clear();
+  int i = 0;
+  constexpr int MAX_BUFFER_SIZE = 256;
+  char word[MAX_BUFFER_SIZE];
+  while (fin.getline(word, MAX_BUFFER_SIZE)) {
+    std::string word_str = word;
+    auto leading_spaces = word_str.find_first_not_of(WHITESPACE);
+    if (leading_spaces != std::string::npos) {
+      leading_spaces = (std::min)(leading_spaces, word_str.length() - 1);
+      word_str = word_str.substr(leading_spaces);
+    }
+    auto trailing_spaces = word_str.find_last_not_of(WHITESPACE);
+    if (trailing_spaces != std::string::npos) {
+      word_str = word_str.substr(0, trailing_spaces + 1);
+    }
+    if (word_str != "") {
+      (*vocab)[word_str] = i++;
+    }
+  }
+}
+
+bool IsChineseChar(int ch) {
+  return (
+      (ch >= 0x4E00 && ch <= 0x9FFF) || (ch >= 0x3400 && ch <= 0x4DBF) ||
+      (ch >= 0x20000 && ch <= 0x2A6DF) || (ch >= 0x2A700 && ch <= 0x2B73F) ||
+      (ch >= 0x2B740 && ch <= 0x2B81F) || (ch >= 0x2B820 && ch <= 0x2CEAF) ||
+      (ch >= 0xF900 && ch <= 0xFAFF) || (ch >= 0x2F800 && ch <= 0x2FA1F));
+}
+
+bool IsPunctuation(int ch) {
+  return (ch >= 33 && ch <= 47) || (ch >= 58 && ch <= 64) ||
+         (ch >= 91 && ch <= 96) || (ch >= 123 && ch <= 126) || u_ispunct(ch);
+}
+
+bool IsPunctuationOrChineseChar(int ch) {
+  return IsChineseChar(ch) || IsPunctuation(ch);
+}
+
+bool StringReplace(std::string* str,
+                   const std::string& from,
+                   const std::string& to) {
+  size_t start_pos = str->find(from);
+  if (start_pos == std::string::npos) return false;
+  str->replace(start_pos, from.length(), to);
+  return true;
+}
+
+void StringReplaceAll(std::string* str,
+                      const std::string& from,
+                      const std::string& to) {
+  if (from.empty()) return;
+  size_t start_pos = 0;
+  while ((start_pos = str->find(from, start_pos)) != std::string::npos) {
+    str->replace(start_pos, from.length(), to);
+    start_pos += to.length();  // In case 'to' contains 'from', like replacing
+                               // 'x' with 'yx'
+  }
+}
+
+void GetSortedVocab(const std::vector<const char*>& keys,
+                    const std::vector<int>& values,
+                    std::vector<const char*>* sorted_keys,
+                    std::vector<int>* sorted_values) {
+  // Sort the vocab
+  std::vector<int> sorted_vocab_index(keys.size());
+  std::iota(sorted_vocab_index.begin(), sorted_vocab_index.end(), 0);
+  std::sort(sorted_vocab_index.begin(),
+            sorted_vocab_index.end(),
+            [&keys](const int a, const int b) {
+              return std::strcmp(keys[a], keys[b]) < 0;
+            });
+
+  sorted_keys->resize(keys.size());
+  sorted_values->resize(keys.size());
+  for (int i = 0; i < sorted_vocab_index.size(); ++i) {
+    auto idx = sorted_vocab_index[i];
+    (*sorted_keys)[i] = keys[idx];
+    (*sorted_values)[i] = values[idx];
+  }
+}
+
+std::unordered_map<uint8_t, uint32_t> CreateBytesToChars() {
+  std::unordered_map<uint8_t, uint32_t> bytes_to_chars;
+  bool bytes_flag[256] = {false};
+  std::vector<std::pair<uint8_t, uint8_t>> ranges = {
+      {'!', '~'}, {'\xA1', '\xAC'}, {'\xAE', '\xFF'}};
+
+  for (int i = 0; i < ranges.size(); ++i) {
+    for (uint32_t c = ranges[i].first; c <= ranges[i].second; ++c) {
+      bytes_to_chars.insert({c, c});
+      bytes_flag[c] = true;
+    }
+  }
+  uint32_t n = 0;
+  for (uint32_t b = 0; b <= 255; ++b) {
+    if (!bytes_flag[b]) {
+      bytes_to_chars.insert({b, (1 << 8) + n});
+      n += 1;
+    }
+  }
+  return bytes_to_chars;
+}
+
+bool IsWhiteSpace(int ch) {
+  const std::string WHITESPACE = " \n\r\t\f\v";
+  for (int i = 0; i < WHITESPACE.length(); ++i) {
+    if (ch == WHITESPACE[i]) return true;
+  }
+  return u_isspace(ch);
+}
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/utils.h b/fast_tokenizer/fast_tokenizer/utils/utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..6a9e418693d337223c3afb0a3f8b94aba7979357
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/utils.h
@@ -0,0 +1,187 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <stdlib.h>
+#include <algorithm>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <limits>
+#include <numeric>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#if defined(_FREEBSD)
+#include <sys/endian.h>
+#endif
+#if !defined(__APPLE__) && !defined(_WIN32) && !defined(_FREEBSD)
+#include <endian.h>
+#if BYTE_ORDER == __BIG_ENDIAN
+#define IS_BIG_ENDIAN
+#endif
+#endif
+
+#if defined(_WIN32)
+#ifdef FASTTOKENIZER_LIB
+#define FASTTOKENIZER_DECL __declspec(dllexport)
+#else
+#define FASTTOKENIZER_DECL __declspec(dllimport)
+#endif  // FASTTOKENIZER_LIB
+#else
+#define FASTTOKENIZER_DECL __attribute__((visibility("default")))
+#endif  // _WIN32
+
+namespace paddlenlp {
+namespace fast_tokenizer {
+namespace utils {
+
+void GetVocabFromFiles(const std::string& files,
+                       std::unordered_map<std::string, uint32_t>* vocab);
+
+bool IsChineseChar(int ch);
+
+bool IsPunctuation(int ch);
+
+bool IsPunctuationOrChineseChar(int ch);
+
+bool StringReplace(std::string* str,
+                   const std::string& from,
+                   const std::string& to);
+
+void StringReplaceAll(std::string* str,
+                      const std::string& from,
+                      const std::string& to);
+
+// Used in fast wordpiece model
+static constexpr uint32_t kBitToIndicateSuffixToken = 30;
+
+static constexpr uint32_t kBitsToEncodeVocabTokenLength = 8;
+
+static constexpr uint32_t kMaskToEncodeVocabTokenLength =
+    (1 << kBitsToEncodeVocabTokenLength) - 1;
+
+static constexpr uint32_t kMaxVocabTokenLengthInUTF8Bytes =
+    (1 << kBitsToEncodeVocabTokenLength);
+
+static constexpr uint32_t kMaxSupportedVocabSize =
+    (1 << (32 - 1 - 1 - kBitsToEncodeVocabTokenLength));
+
+static constexpr uint32_t kMaskToEncodeVocabTokenId =
+    ((1 << kBitToIndicateSuffixToken) - 1) ^ kMaskToEncodeVocabTokenLength;
+
+inline int EncodeToken(uint32_t token_id,
+                       uint32_t token_length,
+                       bool is_suffix_token) {
+  int encoded_value = (is_suffix_token << kBitToIndicateSuffixToken) |
+                      (token_id << kBitsToEncodeVocabTokenLength) |
+                      (token_length - 1);
+  return encoded_value;
+}
+
+inline bool IsSuffixTokenFromEncodedValue(int token_encoded_value) {
+  return static_cast<bool>(token_encoded_value >> kBitToIndicateSuffixToken);
+}
+
+// Gets the token id from the encoded value.
+inline int GetTokenIdFromEncodedValue(int token_encoded_value) {
+  return (token_encoded_value & kMaskToEncodeVocabTokenId) >>
+         kBitsToEncodeVocabTokenLength;
+}
+
+// Gets the token length (without the suffix indicator) from the encoded value.
+inline int GetTokenLengthFromEncodedValue(int token_encoded_value) {
+  return (token_encoded_value & kMaskToEncodeVocabTokenLength) + 1;
+}
+
+static constexpr uint32_t kBitsToEncodeFailurePopsListSize =
+    kBitsToEncodeVocabTokenLength;
+
+static constexpr uint32_t kMaskToEncodeFailurePopsListSize =
+    (1 << kBitsToEncodeFailurePopsListSize) - 1;
+
+static constexpr uint32_t kMaxFailurePopsListSize =
+    (1 << kBitsToEncodeFailurePopsListSize);
+
+static constexpr uint32_t kMaxSupportedFailurePoolOffset =
+    (1 << (32 - kBitsToEncodeFailurePopsListSize)) - 1 - 1;
+
+static constexpr uint32_t kNullFailurePopsList =
+    std::numeric_limits<uint32_t>::max();
+
+inline uint32_t EncodeFailurePopList(int offset, int length) {
+  return (offset << kBitsToEncodeFailurePopsListSize) | (length - 1);
+}
+
+inline void GetFailurePopsOffsetAndLength(uint32_t offset_and_length,
+                                          int* out_offset,
+                                          int* out_length) {
+  *out_offset = offset_and_length >> kBitsToEncodeFailurePopsListSize;
+  *out_length = (offset_and_length & kMaskToEncodeFailurePopsListSize) + 1;
+}
+
+static constexpr uint32_t kNullNode = std::numeric_limits<uint32_t>::max();
+
+static constexpr uint32_t kMaxSupportedTrieSize =
+    std::numeric_limits<uint32_t>::max();
+
+// A Unicode control char that never appears in the input as it is filtered
+// during text normalization. It is used to build dummy nodes in the trie.
+static constexpr char kInvalidControlChar = 0x11;
+
+inline bool IsSuffixWord(const std::string& word,
+                         const std::string& continuing_subword_prefix) {
+  return word.rfind(continuing_subword_prefix) == 0;
+}
+
+template <typename T>
+inline bool DecodePOD(const char* str, size_t str_len, T* result) {
+  if (sizeof(*result) != str_len) {
+    return false;
+  }
+  memcpy(result, str, sizeof(T));
+  return true;
+}
+
+
+template <typename T>
+inline std::string EncodePOD(const T& value) {
+  std::string s;
+  s.resize(sizeof(T));
+  memcpy(const_cast<char*>(s.data()), &value, sizeof(T));
+  return s;
+}
+
+inline size_t OneCharLen(const char* src) {
+  return "\1\1\1\1\1\1\1\1\1\1\1\1\2\2\3\4"[(*src & 0xFF) >> 4];
+}
+
+#ifdef IS_BIG_ENDIAN
+inline uint32 Swap32(uint32 x) { return __builtin_bswap32(x); }
+#endif
+
+void GetSortedVocab(const std::vector<const char*>& keys,
+                    const std::vector<int>& values,
+                    std::vector<const char*>* sorted_keys,
+                    std::vector<int>* sorted_values);
+
+std::unordered_map<uint8_t, uint32_t> CreateBytesToChars();
+
+bool IsWhiteSpace(int ch);
+
+}  // namespace utils
+}  // namespace fast_tokenizer
+}  // namespace paddlenlp
diff --git a/fast_tokenizer/fast_tokenizer/utils/variant.h b/fast_tokenizer/fast_tokenizer/utils/variant.h
new file mode 100644
index 0000000000000000000000000000000000000000..3429c2ce15b3c02055c4670edee9f901b81679e3
--- /dev/null
+++ b/fast_tokenizer/fast_tokenizer/utils/variant.h
@@ -0,0 +1,2859 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Copy from
+// https://github.com/mpark/variant/blob/single-header/v1.4.0/variant.hpp
+// Modify the following points:
+// 1. modify namespace mpark to namespace paddlenlp
+// 2. add type() member function for variant class
+// 3. remove the visitation implementation under the branhch with
+// MPARK_CPP14_CONSTEXPR defined since lib::cpp14::array could not be converted
+// to std::initializer_list in Paddle's compilation
+// 4. decorate PYBIND11_HIDDEN for struct value_visitor
+
+// MPark.Variant
+//
+// Copyright Michael Park, 2015-2017
+//
+// Distributed under the Boost Software License, Version 1.0.
+// (See accompanying file LICENSE.md or copy at
+// http://boost.org/LICENSE_1_0.txt)
+
+#pragma once
+
+// gcc >= 9 has a bug that creates a false positive warning.
+// Reference:
+// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92145
+// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89381
+#if defined(__GNUC__) && !defined(__clang__) && __GNUC__ >= 9
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-copy"
+#endif
+
+/*
+   variant synopsis
+
+namespace std {
+
+  // 20.7.2, class template variant
+  template <class... Types>
+  class variant {
+  public:
+
+    // 20.7.2.1, constructors
+    constexpr variant() noexcept(see below);
+    variant(const variant&);
+    variant(variant&&) noexcept(see below);
+
+    template <class T> constexpr variant(T&&) noexcept(see below);
+
+    template <class T, class... Args>
+    constexpr explicit variant(in_place_type_t<T>, Args&&...);
+
+    template <class T, class U, class... Args>
+    constexpr explicit variant(
+        in_place_type_t<T>, initializer_list<U>, Args&&...);
+
+    template <size_t I, class... Args>
+    constexpr explicit variant(in_place_index_t<I>, Args&&...);
+
+    template <size_t I, class U, class... Args>
+    constexpr explicit variant(
+        in_place_index_t<I>, initializer_list<U>, Args&&...);
+
+    // 20.7.2.2, destructor
+    ~variant();
+
+    // 20.7.2.3, assignment
+    variant& operator=(const variant&);
+    variant& operator=(variant&&) noexcept(see below);
+
+    template <class T> variant& operator=(T&&) noexcept(see below);
+
+    // 20.7.2.4, modifiers
+    template <class T, class... Args>
+    T& emplace(Args&&...);
+
+    template <class T, class U, class... Args>
+    T& emplace(initializer_list<U>, Args&&...);
+
+    template <size_t I, class... Args>
+    variant_alternative<I, variant>& emplace(Args&&...);
+
+    template <size_t I, class U, class...  Args>
+    variant_alternative<I, variant>& emplace(initializer_list<U>, Args&&...);
+
+    // 20.7.2.5, value status
+    constexpr bool valueless_by_exception() const noexcept;
+    constexpr size_t index() const noexcept;
+
+    // 20.7.2.6, swap
+    void swap(variant&) noexcept(see below);
+  };
+
+  // 20.7.3, variant helper classes
+  template <class T> struct variant_size; // undefined
+
+  template <class T>
+  constexpr size_t variant_size_v = variant_size<T>::value;
+
+  template <class T> struct variant_size<const T>;
+  template <class T> struct variant_size<volatile T>;
+  template <class T> struct variant_size<const volatile T>;
+
+  template <class... Types>
+  struct variant_size<variant<Types...>>;
+
+  template <size_t I, class T> struct variant_alternative; // undefined
+
+  template <size_t I, class T>
+  using variant_alternative_t = typename variant_alternative<I, T>::type;
+
+  template <size_t I, class T> struct variant_alternative<I, const T>;
+  template <size_t I, class T> struct variant_alternative<I, volatile T>;
+  template <size_t I, class T> struct variant_alternative<I, const volatile T>;
+
+  template <size_t I, class... Types>
+  struct variant_alternative<I, variant<Types...>>;
+
+  constexpr size_t variant_npos = -1;
+
+  // 20.7.4, value access
+  template <class T, class... Types>
+  constexpr bool holds_alternative(const variant<Types...>&) noexcept;
+
+  template <size_t I, class... Types>
+  constexpr variant_alternative_t<I, variant<Types...>>&
+  get(variant<Types...>&);
+
+  template <size_t I, class... Types>
+  constexpr variant_alternative_t<I, variant<Types...>>&&
+  get(variant<Types...>&&);
+
+  template <size_t I, class... Types>
+  constexpr variant_alternative_t<I, variant<Types...>> const&
+  get(const variant<Types...>&);
+
+  template <size_t I, class... Types>
+  constexpr variant_alternative_t<I, variant<Types...>> const&&
+  get(const variant<Types...>&&);
+
+  template <class T, class...  Types>
+  constexpr T& get(variant<Types...>&);
+
+  template <class T, class... Types>
+  constexpr T&& get(variant<Types...>&&);
+
+  template <class T, class... Types>
+  constexpr const T& get(const variant<Types...>&);
+
+  template <class T, class... Types>
+  constexpr const T&& get(const variant<Types...>&&);
+
+  template <size_t I, class... Types>
+  constexpr add_pointer_t<variant_alternative_t<I, variant<Types...>>>
+  get_if(variant<Types...>*) noexcept;
+
+  template <size_t I, class... Types>
+  constexpr add_pointer_t<const variant_alternative_t<I, variant<Types...>>>
+  get_if(const variant<Types...>*) noexcept;
+
+  template <class T, class... Types>
+  constexpr add_pointer_t<T>
+  get_if(variant<Types...>*) noexcept;
+
+  template <class T, class... Types>
+  constexpr add_pointer_t<const T>
+  get_if(const variant<Types...>*) noexcept;
+
+  // 20.7.5, relational operators
+  template <class... Types>
+  constexpr bool operator==(const variant<Types...>&, const variant<Types...>&);
+
+  template <class... Types>
+  constexpr bool operator!=(const variant<Types...>&, const variant<Types...>&);
+
+  template <class... Types>
+  constexpr bool operator<(const variant<Types...>&, const variant<Types...>&);
+
+  template <class... Types>
+  constexpr bool operator>(const variant<Types...>&, const variant<Types...>&);
+
+  template <class... Types>
+  constexpr bool operator<=(const variant<Types...>&, const variant<Types...>&);
+
+  template <class... Types>
+  constexpr bool operator>=(const variant<Types...>&, const variant<Types...>&);
+
+  // 20.7.6, visitation
+  template <class Visitor, class... Variants>
+  constexpr see below visit(Visitor&&, Variants&&...);
+
+  // 20.7.7, class monostate
+  struct monostate;
+
+  // 20.7.8, monostate relational operators
+  constexpr bool operator<(monostate, monostate) noexcept;
+  constexpr bool operator>(monostate, monostate) noexcept;
+  constexpr bool operator<=(monostate, monostate) noexcept;
+  constexpr bool operator>=(monostate, monostate) noexcept;
+  constexpr bool operator==(monostate, monostate) noexcept;
+  constexpr bool operator!=(monostate, monostate) noexcept;
+
+  // 20.7.9, specialized algorithms
+  template <class... Types>
+  void swap(variant<Types...>&, variant<Types...>&) noexcept(see below);
+
+  // 20.7.10, class bad_variant_access
+  class bad_variant_access;
+
+  // 20.7.11, hash support
+  template <class T> struct hash;
+  template <class... Types> struct hash<variant<Types...>>;
+  template <> struct hash<monostate>;
+
+} // namespace std
+
+*/
+
+#include <cstddef>
+#include <exception>
+#include <functional>
+#include <initializer_list>
+#include <new>
+#include <type_traits>
+#include <utility>
+
+// MPark.Variant
+//
+// Copyright Michael Park, 2015-2017
+//
+// Distributed under the Boost Software License, Version 1.0.
+// (See accompanying file LICENSE.md or copy at
+// http://boost.org/LICENSE_1_0.txt)
+
+#ifndef MPARK_CONFIG_HPP
+#define MPARK_CONFIG_HPP
+
+// MSVC 2015 Update 3.
+#if __cplusplus < 201103L && (!defined(_MSC_VER) || _MSC_FULL_VER < 190024210)
+#error "MPark.Variant requires C++11 support."
+#endif
+
+#ifndef __has_attribute
+#define __has_attribute(x) 0
+#endif
+
+#ifndef __has_builtin
+#define __has_builtin(x) 0
+#endif
+
+#ifndef __has_include
+#define __has_include(x) 0
+#endif
+
+#ifndef __has_feature
+#define __has_feature(x) 0
+#endif
+
+#if __has_attribute(always_inline) || defined(__GNUC__)
+#define MPARK_ALWAYS_INLINE __attribute__((__always_inline__)) inline
+#elif defined(_MSC_VER)
+#define MPARK_ALWAYS_INLINE __forceinline
+#else
+#define MPARK_ALWAYS_INLINE inline
+#endif
+
+#if __has_builtin(__builtin_addressof) || \
+    (defined(__GNUC__) && __GNUC__ >= 7) || defined(_MSC_VER)
+#define MPARK_BUILTIN_ADDRESSOF
+#endif
+
+#if __has_builtin(__builtin_unreachable) || defined(__GNUC__)
+#define MPARK_BUILTIN_UNREACHABLE __builtin_unreachable()
+#elif defined(_MSC_VER)
+#define MPARK_BUILTIN_UNREACHABLE __assume(false)
+#else
+#define MPARK_BUILTIN_UNREACHABLE
+#endif
+
+#if __has_builtin(__type_pack_element)
+#define MPARK_TYPE_PACK_ELEMENT
+#endif
+
+#if defined(__cpp_constexpr) && __cpp_constexpr >= 200704 && \
+    !(defined(__GNUC__) && __GNUC__ == 4 && __GNUC_MINOR__ == 9)
+#define MPARK_CPP11_CONSTEXPR
+#endif
+
+#if defined(__cpp_constexpr) && __cpp_constexpr >= 201304
+#define MPARK_CPP14_CONSTEXPR
+#endif
+
+#if __has_feature(cxx_exceptions) || defined(__cpp_exceptions) || \
+    (defined(_MSC_VER) && defined(_CPPUNWIND))
+#define MPARK_EXCEPTIONS
+#endif
+
+#if defined(__cpp_generic_lambdas) || defined(_MSC_VER)
+#define MPARK_GENERIC_LAMBDAS
+#endif
+
+#if defined(__cpp_lib_integer_sequence)
+#define MPARK_INTEGER_SEQUENCE
+#endif
+
+#if defined(__cpp_return_type_deduction) || defined(_MSC_VER)
+#define MPARK_RETURN_TYPE_DEDUCTION
+#endif
+
+#if defined(__cpp_lib_transparent_operators) || defined(_MSC_VER)
+#define MPARK_TRANSPARENT_OPERATORS
+#endif
+
+#if defined(__cpp_variable_templates) || defined(_MSC_VER)
+#define MPARK_VARIABLE_TEMPLATES
+#endif
+
+#if !defined(__GLIBCXX__) || __has_include(<codecvt>)  // >= libstdc++-5
+#define MPARK_TRIVIALITY_TYPE_TRAITS
+#define MPARK_INCOMPLETE_TYPE_TRAITS
+#endif
+
+#endif  // MPARK_CONFIG_HPP
+
+// MPark.Variant
+//
+// Copyright Michael Park, 2015-2017
+//
+// Distributed under the Boost Software License, Version 1.0.
+// (See accompanying file LICENSE.md or copy at
+// http://boost.org/LICENSE_1_0.txt)
+
+#ifndef MPARK_IN_PLACE_HPP
+#define MPARK_IN_PLACE_HPP
+
+#include <cstddef>
+
+namespace paddlenlp {
+
+struct in_place_t {
+  explicit in_place_t() = default;
+};
+
+template <std::size_t I>
+struct in_place_index_t {
+  explicit in_place_index_t() = default;
+};
+
+template <typename T>
+struct in_place_type_t {
+  explicit in_place_type_t() = default;
+};
+
+#ifdef MPARK_VARIABLE_TEMPLATES
+constexpr in_place_t in_place{};
+
+template <std::size_t I>
+constexpr in_place_index_t<I> in_place_index{};
+
+template <typename T>
+constexpr in_place_type_t<T> in_place_type{};
+#endif
+
+}  // namespace paddlenlp
+
+#endif  // MPARK_IN_PLACE_HPP
+
+// MPark.Variant
+//
+// Copyright Michael Park, 2015-2017
+//
+// Distributed under the Boost Software License, Version 1.0.
+// (See accompanying file LICENSE.md or copy at
+// http://boost.org/LICENSE_1_0.txt)
+
+#ifndef MPARK_LIB_HPP
+#define MPARK_LIB_HPP
+
+#include <functional>
+#include <memory>
+#include <type_traits>
+#include <utility>
+
+#define MPARK_RETURN(...) \
+  noexcept(noexcept(__VA_ARGS__))->decltype(__VA_ARGS__) { return __VA_ARGS__; }
+
+namespace paddlenlp {
+namespace lib {
+template <typename T>
+struct identity {
+  using type = T;
+};
+
+inline namespace cpp14 {
+template <typename T, std::size_t N>
+struct array {
+  constexpr const T &operator[](std::size_t index) const { return data[index]; }
+
+  T data[N == 0 ? 1 : N];
+};
+
+template <typename T>
+using add_pointer_t = typename std::add_pointer<T>::type;
+
+template <typename... Ts>
+using common_type_t = typename std::common_type<Ts...>::type;
+
+template <typename T>
+using decay_t = typename std::decay<T>::type;
+
+template <bool B, typename T = void>
+using enable_if_t = typename std::enable_if<B, T>::type;
+
+template <typename T>
+using remove_const_t = typename std::remove_const<T>::type;
+
+template <typename T>
+using remove_reference_t = typename std::remove_reference<T>::type;
+
+template <typename T>
+inline constexpr T &&forward(remove_reference_t<T> &t) noexcept {
+  return static_cast<T &&>(t);
+}
+
+template <typename T>
+inline constexpr T &&forward(remove_reference_t<T> &&t) noexcept {
+  static_assert(!std::is_lvalue_reference<T>::value,
+                "can not forward an rvalue as an lvalue");
+  return static_cast<T &&>(t);
+}
+
+template <typename T>
+inline constexpr remove_reference_t<T> &&move(T &&t) noexcept {
+  return static_cast<remove_reference_t<T> &&>(t);
+}
+
+#ifdef MPARK_INTEGER_SEQUENCE
+using std::index_sequence;
+using std::index_sequence_for;
+using std::integer_sequence;
+using std::make_index_sequence;
+#else
+template <typename T, T... Is>
+struct integer_sequence {
+  using value_type = T;
+  static constexpr std::size_t size() noexcept { return sizeof...(Is); }
+};
+
+template <std::size_t... Is>
+using index_sequence = integer_sequence<std::size_t, Is...>;
+
+template <typename Lhs, typename Rhs>
+struct make_index_sequence_concat;
+
+template <std::size_t... Lhs, std::size_t... Rhs>
+struct make_index_sequence_concat<index_sequence<Lhs...>,
+                                  index_sequence<Rhs...>>
+    : identity<index_sequence<Lhs..., (sizeof...(Lhs) + Rhs)...>> {};
+
+template <std::size_t N>
+struct make_index_sequence_impl;
+
+template <std::size_t N>
+using make_index_sequence = typename make_index_sequence_impl<N>::type;
+
+template <std::size_t N>
+struct make_index_sequence_impl
+    : make_index_sequence_concat<make_index_sequence<N / 2>,
+                                 make_index_sequence<N - (N / 2)>> {};
+
+template <>
+struct make_index_sequence_impl<0> : identity<index_sequence<>> {};
+
+template <>
+struct make_index_sequence_impl<1> : identity<index_sequence<0>> {};
+
+template <typename... Ts>
+using index_sequence_for = make_index_sequence<sizeof...(Ts)>;
+#endif
+
+// <functional>
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using equal_to = std::equal_to<>;
+#else
+struct equal_to {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) == lib::forward<Rhs>(rhs))
+};
+#endif
+
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using not_equal_to = std::not_equal_to<>;
+#else
+struct not_equal_to {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) != lib::forward<Rhs>(rhs))
+};
+#endif
+
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using less = std::less<>;
+#else
+struct less {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) < lib::forward<Rhs>(rhs))
+};
+#endif
+
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using greater = std::greater<>;
+#else
+struct greater {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) > lib::forward<Rhs>(rhs))
+};
+#endif
+
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using less_equal = std::less_equal<>;
+#else
+struct less_equal {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) <= lib::forward<Rhs>(rhs))
+};
+#endif
+
+#ifdef MPARK_TRANSPARENT_OPERATORS
+using greater_equal = std::greater_equal<>;
+#else
+struct greater_equal {
+  template <typename Lhs, typename Rhs>
+  inline constexpr auto operator()(Lhs &&lhs, Rhs &&rhs) const
+      MPARK_RETURN(lib::forward<Lhs>(lhs) >= lib::forward<Rhs>(rhs))
+};
+#endif
+}  // namespace cpp14
+
+inline namespace cpp17 {
+// <type_traits>
+template <bool B>
+using bool_constant = std::integral_constant<bool, B>;
+
+template <typename...>
+struct voider : identity<void> {};
+
+template <typename... Ts>
+using void_t = typename voider<Ts...>::type;
+
+namespace detail {
+namespace swappable {
+
+using std::swap;
+
+template <typename T>
+struct is_swappable {
+ private:
+  template <typename U,
+            typename = decltype(swap(std::declval<U &>(), std::declval<U &>()))>
+  inline static std::true_type test(int);
+
+  template <typename U>
+  inline static std::false_type test(...);
+
+ public:
+  static constexpr bool value = decltype(test<T>(0))::value;
+};
+
+template <bool IsSwappable, typename T>
+struct is_nothrow_swappable {
+  static constexpr bool value =
+      noexcept(swap(std::declval<T &>(), std::declval<T &>()));
+};
+
+template <typename T>
+struct is_nothrow_swappable<false, T> : std::false_type {};
+
+}  // namespace swappable
+}  // namespace detail
+
+using detail::swappable::is_swappable;
+
+template <typename T>
+using is_nothrow_swappable =
+    detail::swappable::is_nothrow_swappable<is_swappable<T>::value, T>;
+
+// <functional>
+namespace detail {
+
+template <typename T>
+struct is_reference_wrapper : std::false_type {};
+
+template <typename T>
+struct is_reference_wrapper<std::reference_wrapper<T>> : std::true_type {};
+
+template <bool, int>
+struct Invoke;
+
+template <>
+struct Invoke<true /* pmf */, 0 /* is_base_of */> {
+  template <typename R, typename T, typename Arg, typename... Args>
+  inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args)
+      MPARK_RETURN((lib::forward<Arg>(arg).*pmf)(lib::forward<Args>(args)...))
+};
+
+template <>
+struct Invoke<true /* pmf */, 1 /* is_reference_wrapper */> {
+  template <typename R, typename T, typename Arg, typename... Args>
+  inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args)
+      MPARK_RETURN((lib::forward<Arg>(arg).get().*
+                    pmf)(lib::forward<Args>(args)...))
+};
+
+template <>
+struct Invoke<true /* pmf */, 2 /* otherwise */> {
+  template <typename R, typename T, typename Arg, typename... Args>
+  inline static constexpr auto invoke(R T::*pmf, Arg &&arg, Args &&...args)
+      MPARK_RETURN(((*lib::forward<Arg>(arg)).*
+                    pmf)(lib::forward<Args>(args)...))
+};
+
+template <>
+struct Invoke<false /* pmo */, 0 /* is_base_of */> {
+  template <typename R, typename T, typename Arg>
+  inline static constexpr auto invoke(R T::*pmo, Arg &&arg)
+      MPARK_RETURN(lib::forward<Arg>(arg).*pmo)
+};
+
+template <>
+struct Invoke<false /* pmo */, 1 /* is_reference_wrapper */> {
+  template <typename R, typename T, typename Arg>
+  inline static constexpr auto invoke(R T::*pmo, Arg &&arg)
+      MPARK_RETURN(lib::forward<Arg>(arg).get().*pmo)
+};
+
+template <>
+struct Invoke<false /* pmo */, 2 /* otherwise */> {
+  template <typename R, typename T, typename Arg>
+  inline static constexpr auto invoke(R T::*pmo, Arg &&arg)
+      MPARK_RETURN((*lib::forward<Arg>(arg)).*pmo)
+};
+
+template <typename R, typename T, typename Arg, typename... Args>
+inline constexpr auto invoke(R T::*f, Arg &&arg, Args &&...args)
+    MPARK_RETURN(Invoke<std::is_function<R>::value,
+                        (std::is_base_of<T, lib::decay_t<Arg>>::value ? 0
+                         : is_reference_wrapper<lib::decay_t<Arg>>::value
+                             ? 1
+                             : 2)>::invoke(f,
+                                           lib::forward<Arg>(arg),
+                                           lib::forward<Args>(args)...))
+
+#ifdef _MSC_VER
+#pragma warning(push)
+#pragma warning(disable : 4100)
+#endif
+        template <typename F, typename... Args>
+        inline constexpr auto invoke(F &&f, Args &&...args)
+            MPARK_RETURN(lib::forward<F>(f)(lib::forward<Args>(args)...))
+#ifdef _MSC_VER
+#pragma warning(pop)
+#endif
+}  // namespace detail
+
+template <typename F, typename... Args>
+inline constexpr auto invoke(F &&f, Args &&...args)
+    MPARK_RETURN(detail::invoke(lib::forward<F>(f),
+                                lib::forward<Args>(args)...))
+
+        namespace detail {
+  template <typename Void, typename, typename...>
+  struct invoke_result {};
+
+  template <typename F, typename... Args>
+  struct invoke_result<
+      void_t<decltype(lib::invoke(std::declval<F>(), std::declval<Args>()...))>,
+      F,
+      Args...> : identity<decltype(lib::invoke(std::declval<F>(),
+                                               std::declval<Args>()...))> {};
+
+}  // namespace detail
+
+template <typename F, typename... Args>
+using invoke_result = detail::invoke_result<void, F, Args...>;
+
+template <typename F, typename... Args>
+using invoke_result_t = typename invoke_result<F, Args...>::type;
+
+namespace detail {
+
+template <typename Void, typename, typename...>
+struct is_invocable : std::false_type {};
+
+template <typename F, typename... Args>
+struct is_invocable<void_t<invoke_result_t<F, Args...>>, F, Args...>
+    : std::true_type {};
+
+template <typename Void, typename, typename, typename...>
+struct is_invocable_r : std::false_type {};
+
+template <typename R, typename F, typename... Args>
+struct is_invocable_r<void_t<invoke_result_t<F, Args...>>, R, F, Args...>
+    : std::is_convertible<invoke_result_t<F, Args...>, R> {};
+
+}  // namespace detail
+
+template <typename F, typename... Args>
+using is_invocable = detail::is_invocable<void, F, Args...>;
+
+template <typename R, typename F, typename... Args>
+using is_invocable_r = detail::is_invocable_r<void, R, F, Args...>;
+
+namespace detail {
+
+template <bool Invocable, typename F, typename... Args>
+struct is_nothrow_invocable {
+  static constexpr bool value =
+      noexcept(lib::invoke(std::declval<F>(), std::declval<Args>()...));
+};
+
+template <typename F, typename... Args>
+struct is_nothrow_invocable<false, F, Args...> : std::false_type {};
+
+template <bool Invocable, typename R, typename F, typename... Args>
+struct is_nothrow_invocable_r {
+ private:
+  inline static R impl() {
+    return lib::invoke(std::declval<F>(), std::declval<Args>()...);
+  }
+
+ public:
+  static constexpr bool value = noexcept(impl());
+};
+
+template <typename R, typename F, typename... Args>
+struct is_nothrow_invocable_r<false, R, F, Args...> : std::false_type {};
+
+}  // namespace detail
+
+template <typename F, typename... Args>
+using is_nothrow_invocable =
+    detail::is_nothrow_invocable<is_invocable<F, Args...>::value, F, Args...>;
+
+template <typename R, typename F, typename... Args>
+using is_nothrow_invocable_r = detail::
+    is_nothrow_invocable_r<is_invocable_r<R, F, Args...>::value, R, F, Args...>;
+
+// <memory>
+#ifdef MPARK_BUILTIN_ADDRESSOF
+template <typename T>
+inline constexpr T *addressof(T &arg) noexcept {
+  return __builtin_addressof(arg);
+}
+#else
+namespace detail {
+
+namespace has_addressof_impl {
+
+struct fail;
+
+template <typename T>
+inline fail operator&(T &&);
+
+template <typename T>
+inline static constexpr bool impl() {
+  return (std::is_class<T>::value || std::is_union<T>::value) &&
+         !std::is_same<decltype(&std::declval<T &>()), fail>::value;
+}
+
+}  // namespace has_addressof_impl
+
+template <typename T>
+using has_addressof = bool_constant<has_addressof_impl::impl<T>()>;
+
+template <typename T>
+inline constexpr T *addressof(T &arg, std::true_type) noexcept {
+  return std::addressof(arg);
+}
+
+template <typename T>
+inline constexpr T *addressof(T &arg, std::false_type) noexcept {
+  return &arg;
+}
+
+}  // namespace detail
+
+template <typename T>
+inline constexpr T *addressof(T &arg) noexcept {
+  return detail::addressof(arg, detail::has_addressof<T>{});
+}
+#endif
+
+template <typename T>
+inline constexpr T *addressof(const T &&) = delete;
+
+}  // namespace cpp17
+
+template <typename T>
+struct remove_all_extents : identity<T> {};
+
+template <typename T, std::size_t N>
+struct remove_all_extents<array<T, N>> : remove_all_extents<T> {};
+
+template <typename T>
+using remove_all_extents_t = typename remove_all_extents<T>::type;
+
+template <std::size_t N>
+using size_constant = std::integral_constant<std::size_t, N>;
+
+template <std::size_t I, typename T>
+struct indexed_type : size_constant<I> {
+  using type = T;
+};
+
+template <bool... Bs>
+using all = std::is_same<integer_sequence<bool, true, Bs...>,
+                         integer_sequence<bool, Bs..., true>>;
+
+#ifdef MPARK_TYPE_PACK_ELEMENT
+template <std::size_t I, typename... Ts>
+using type_pack_element_t = __type_pack_element<I, Ts...>;
+#else
+template <std::size_t I, typename... Ts>
+struct type_pack_element_impl {
+ private:
+  template <typename>
+  struct set;
+
+  template <std::size_t... Is>
+  struct set<index_sequence<Is...>> : indexed_type<Is, Ts>... {};
+
+  template <typename T>
+  inline static std::enable_if<true, T> impl(indexed_type<I, T>);
+
+  inline static std::enable_if<false> impl(...);
+
+ public:
+  using type = decltype(impl(set<index_sequence_for<Ts...>>{}));
+};
+
+template <std::size_t I, typename... Ts>
+using type_pack_element = typename type_pack_element_impl<I, Ts...>::type;
+
+template <std::size_t I, typename... Ts>
+using type_pack_element_t = typename type_pack_element<I, Ts...>::type;
+#endif
+
+#ifdef MPARK_TRIVIALITY_TYPE_TRAITS
+using std::is_trivially_copy_assignable;
+using std::is_trivially_copy_constructible;
+using std::is_trivially_move_assignable;
+using std::is_trivially_move_constructible;
+#else
+template <typename T>
+struct is_trivially_copy_constructible
+    : bool_constant<std::is_copy_constructible<T>::value &&__has_trivial_copy(
+          T)> {};
+
+template <typename T>
+struct is_trivially_move_constructible : bool_constant<__is_trivial(T)> {};
+
+template <typename T>
+struct is_trivially_copy_assignable
+    : bool_constant<std::is_copy_assignable<T>::value &&__has_trivial_assign(
+          T)> {};
+
+template <typename T>
+struct is_trivially_move_assignable : bool_constant<__is_trivial(T)> {};
+#endif
+
+template <typename T, bool>
+struct dependent_type : T {};
+
+template <typename Is, std::size_t J>
+struct push_back;
+
+template <typename Is, std::size_t J>
+using push_back_t = typename push_back<Is, J>::type;
+
+template <std::size_t... Is, std::size_t J>
+struct push_back<index_sequence<Is...>, J> {
+  using type = index_sequence<Is..., J>;
+};
+
+}  // namespace lib
+}  // namespace paddlenlp
+
+#undef MPARK_RETURN
+
+#endif  // MPARK_LIB_HPP
+
+namespace paddlenlp {
+
+#ifdef MPARK_RETURN_TYPE_DEDUCTION
+
+#define AUTO auto
+#define AUTO_RETURN(...) \
+  { return __VA_ARGS__; }
+
+#define AUTO_REFREF auto &&
+#define AUTO_REFREF_RETURN(...) \
+  { return __VA_ARGS__; }
+
+#define DECLTYPE_AUTO decltype(auto)
+#define DECLTYPE_AUTO_RETURN(...) \
+  { return __VA_ARGS__; }
+
+#else
+
+#define AUTO auto
+#define AUTO_RETURN(...) \
+  ->lib::decay_t<decltype(__VA_ARGS__)> { return __VA_ARGS__; }
+
+#define AUTO_REFREF auto
+#define AUTO_REFREF_RETURN(...)                                           \
+  ->decltype((__VA_ARGS__)) {                                             \
+    static_assert(std::is_reference<decltype((__VA_ARGS__))>::value, ""); \
+    return __VA_ARGS__;                                                   \
+  }
+
+#define DECLTYPE_AUTO auto
+#define DECLTYPE_AUTO_RETURN(...) \
+  ->decltype(__VA_ARGS__) { return __VA_ARGS__; }
+
+#endif
+
+class bad_variant_access : public std::exception {
+ public:
+  virtual const char *what() const noexcept override {
+    return "bad_variant_access";
+  }
+};
+
+[[noreturn]] inline void throw_bad_variant_access() {
+#ifdef MPARK_EXCEPTIONS
+  throw bad_variant_access{};
+#else
+  std::terminate();
+  MPARK_BUILTIN_UNREACHABLE;
+#endif
+}
+
+template <typename... Ts>
+class variant;
+
+template <typename T>
+struct variant_size;
+
+#ifdef MPARK_VARIABLE_TEMPLATES
+template <typename T>
+constexpr std::size_t variant_size_v = variant_size<T>::value;
+#endif
+
+template <typename T>
+struct variant_size<const T> : variant_size<T> {};
+
+template <typename T>
+struct variant_size<volatile T> : variant_size<T> {};
+
+template <typename T>
+struct variant_size<const volatile T> : variant_size<T> {};
+
+template <typename... Ts>
+struct variant_size<variant<Ts...>> : lib::size_constant<sizeof...(Ts)> {};
+
+template <std::size_t I, typename T>
+struct variant_alternative;
+
+template <std::size_t I, typename T>
+using variant_alternative_t = typename variant_alternative<I, T>::type;
+
+template <std::size_t I, typename T>
+struct variant_alternative<I, const T>
+    : std::add_const<variant_alternative_t<I, T>> {};
+
+template <std::size_t I, typename T>
+struct variant_alternative<I, volatile T>
+    : std::add_volatile<variant_alternative_t<I, T>> {};
+
+template <std::size_t I, typename T>
+struct variant_alternative<I, const volatile T>
+    : std::add_cv<variant_alternative_t<I, T>> {};
+
+template <std::size_t I, typename... Ts>
+struct variant_alternative<I, variant<Ts...>> {
+  static_assert(I < sizeof...(Ts),
+                "index out of bounds in `std::variant_alternative<>`");
+  using type = lib::type_pack_element_t<I, Ts...>;
+};
+
+constexpr std::size_t variant_npos = static_cast<std::size_t>(-1);
+
+namespace detail {
+
+constexpr std::size_t not_found = static_cast<std::size_t>(-1);
+constexpr std::size_t ambiguous = static_cast<std::size_t>(-2);
+
+#ifdef MPARK_CPP14_CONSTEXPR
+template <typename T, typename... Ts>
+inline constexpr std::size_t find_index() {
+  constexpr lib::array<bool, sizeof...(Ts)> matches = {
+      {std::is_same<T, Ts>::value...}};
+  std::size_t result = not_found;
+  for (std::size_t i = 0; i < sizeof...(Ts); ++i) {
+    if (matches[i]) {
+      if (result != not_found) {
+        return ambiguous;
+      }
+      result = i;
+    }
+  }
+  return result;
+}
+#else
+inline constexpr std::size_t find_index_impl(std::size_t result, std::size_t) {
+  return result;
+}
+
+template <typename... Bs>
+inline constexpr std::size_t find_index_impl(std::size_t result,
+                                             std::size_t idx,
+                                             bool b,
+                                             Bs... bs) {
+  return b ? (result != not_found ? ambiguous
+                                  : find_index_impl(idx, idx + 1, bs...))
+           : find_index_impl(result, idx + 1, bs...);
+}
+
+template <typename T, typename... Ts>
+inline constexpr std::size_t find_index() {
+  return find_index_impl(not_found, 0, std::is_same<T, Ts>::value...);
+}
+#endif
+
+template <std::size_t I>
+using find_index_sfinae_impl =
+    lib::enable_if_t<I != not_found && I != ambiguous, lib::size_constant<I>>;
+
+template <typename T, typename... Ts>
+using find_index_sfinae = find_index_sfinae_impl<find_index<T, Ts...>()>;
+
+template <std::size_t I>
+struct find_index_checked_impl : lib::size_constant<I> {
+  static_assert(I != not_found, "the specified type is not found.");
+  static_assert(I != ambiguous, "the specified type is ambiguous.");
+};
+
+template <typename T, typename... Ts>
+using find_index_checked = find_index_checked_impl<find_index<T, Ts...>()>;
+
+struct valueless_t {};
+
+enum class Trait { TriviallyAvailable, Available, Unavailable };
+
+template <typename T,
+          template <typename>
+          class IsTriviallyAvailable,
+          template <typename>
+          class IsAvailable>
+inline constexpr Trait trait() {
+  return IsTriviallyAvailable<T>::value ? Trait::TriviallyAvailable
+         : IsAvailable<T>::value        ? Trait::Available
+                                        : Trait::Unavailable;
+}
+
+#ifdef MPARK_CPP14_CONSTEXPR
+template <typename... Traits>
+inline constexpr Trait common_trait(Traits... traits_) {
+  Trait result = Trait::TriviallyAvailable;
+  lib::array<Trait, sizeof...(Traits)> traits = {{traits_...}};
+  for (std::size_t i = 0; i < sizeof...(Traits); ++i) {
+    Trait t = traits[i];
+    if (static_cast<int>(t) > static_cast<int>(result)) {
+      result = t;
+    }
+  }
+  return result;
+}
+#else
+inline constexpr Trait common_trait_impl(Trait result) { return result; }
+
+template <typename... Traits>
+inline constexpr Trait common_trait_impl(Trait result, Trait t, Traits... ts) {
+  return static_cast<int>(t) > static_cast<int>(result)
+             ? common_trait_impl(t, ts...)
+             : common_trait_impl(result, ts...);
+}
+
+template <typename... Traits>
+inline constexpr Trait common_trait(Traits... ts) {
+  return common_trait_impl(Trait::TriviallyAvailable, ts...);
+}
+#endif
+
+template <typename... Ts>
+struct traits {
+  static constexpr Trait copy_constructible_trait =
+      common_trait(trait<Ts,
+                         lib::is_trivially_copy_constructible,
+                         std::is_copy_constructible>()...);
+
+  static constexpr Trait move_constructible_trait =
+      common_trait(trait<Ts,
+                         lib::is_trivially_move_constructible,
+                         std::is_move_constructible>()...);
+
+  static constexpr Trait copy_assignable_trait =
+      common_trait(copy_constructible_trait,
+                   trait<Ts,
+                         lib::is_trivially_copy_assignable,
+                         std::is_copy_assignable>()...);
+
+  static constexpr Trait move_assignable_trait =
+      common_trait(move_constructible_trait,
+                   trait<Ts,
+                         lib::is_trivially_move_assignable,
+                         std::is_move_assignable>()...);
+
+  static constexpr Trait destructible_trait = common_trait(
+      trait<Ts, std::is_trivially_destructible, std::is_destructible>()...);
+};
+
+namespace access {
+
+struct recursive_union {
+#ifdef MPARK_RETURN_TYPE_DEDUCTION
+  template <typename V>
+  inline static constexpr auto &&get_alt(V &&v, in_place_index_t<0>) {
+    return lib::forward<V>(v).head_;
+  }
+
+  template <typename V, std::size_t I>
+  inline static constexpr auto &&get_alt(V &&v, in_place_index_t<I>) {
+    return get_alt(lib::forward<V>(v).tail_, in_place_index_t<I - 1>{});
+  }
+#else
+  template <std::size_t I, bool Dummy = true>
+  struct get_alt_impl {
+    template <typename V>
+    inline constexpr AUTO_REFREF operator()(V &&v) const
+        AUTO_REFREF_RETURN(get_alt_impl<I - 1>{}(lib::forward<V>(v).tail_))
+  };
+
+  template <bool Dummy>
+  struct get_alt_impl<0, Dummy> {
+    template <typename V>
+    inline constexpr AUTO_REFREF operator()(V &&v) const
+        AUTO_REFREF_RETURN(lib::forward<V>(v).head_)
+  };
+
+  template <typename V, std::size_t I>
+  inline static constexpr AUTO_REFREF get_alt(V &&v, in_place_index_t<I>)
+      AUTO_REFREF_RETURN(get_alt_impl<I>{}(lib::forward<V>(v)))
+#endif
+};
+
+struct base {
+  template <std::size_t I, typename V>
+  inline static constexpr AUTO_REFREF get_alt(V &&v)
+#ifdef _MSC_VER
+      AUTO_REFREF_RETURN(recursive_union::get_alt(lib::forward<V>(v).data_,
+                                                  in_place_index_t<I>{}))
+#else
+      AUTO_REFREF_RETURN(recursive_union::get_alt(data(lib::forward<V>(v)),
+                                                  in_place_index_t<I>{}))
+#endif
+};
+
+struct variant {
+  template <std::size_t I, typename V>
+  inline static constexpr AUTO_REFREF get_alt(V &&v)
+      AUTO_REFREF_RETURN(base::get_alt<I>(lib::forward<V>(v).impl_))
+};
+
+}  // namespace access
+
+namespace visitation {
+
+#if defined(MPARK_CPP14_CONSTEXPR) && !defined(_MSC_VER)
+#define MPARK_VARIANT_SWITCH_VISIT
+#endif
+
+struct base {
+  template <typename Visitor, typename... Vs>
+  using dispatch_result_t =
+      decltype(lib::invoke(std::declval<Visitor>(),
+                           access::base::get_alt<0>(std::declval<Vs>())...));
+
+  template <typename Expected>
+  struct expected {
+    template <typename Actual>
+    inline static constexpr bool but_got() {
+      return std::is_same<Expected, Actual>::value;
+    }
+  };
+
+  template <typename Expected, typename Actual>
+  struct visit_return_type_check {
+    static_assert(expected<Expected>::template but_got<Actual>(),
+                  "`visit` requires the visitor to have a single return type");
+
+    template <typename Visitor, typename... Alts>
+    inline static constexpr DECLTYPE_AUTO invoke(Visitor &&visitor,
+                                                 Alts &&...alts)
+        DECLTYPE_AUTO_RETURN(lib::invoke(lib::forward<Visitor>(visitor),
+                                         lib::forward<Alts>(alts)...))
+  };
+
+#ifdef MPARK_VARIANT_SWITCH_VISIT
+  template <bool B, typename R, typename... ITs>
+  struct dispatcher;
+
+  template <typename R, typename... ITs>
+  struct dispatcher<false, R, ITs...> {
+    template <std::size_t B, typename F, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch(F &&,
+                                                    typename ITs::type &&...,
+                                                    Vs &&...) {
+      MPARK_BUILTIN_UNREACHABLE;
+    }
+
+    template <std::size_t I, typename F, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch_case(F &&, Vs &&...) {
+      MPARK_BUILTIN_UNREACHABLE;
+    }
+
+    template <std::size_t B, typename F, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch_at(std::size_t,
+                                                       F &&,
+                                                       Vs &&...) {
+      MPARK_BUILTIN_UNREACHABLE;
+    }
+  };
+
+  template <typename R, typename... ITs>
+  struct dispatcher<true, R, ITs...> {
+    template <std::size_t B, typename F>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch(
+        F &&f, typename ITs::type &&...visited_vs) {
+      using Expected = R;
+      using Actual = decltype(lib::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<ITs::value>(
+              lib::forward<typename ITs::type>(visited_vs))...));
+      return visit_return_type_check<Expected, Actual>::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<ITs::value>(
+              lib::forward<typename ITs::type>(visited_vs))...);
+    }
+
+    template <std::size_t B, typename F, typename V, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch(
+        F &&f, typename ITs::type &&...visited_vs, V &&v, Vs &&...vs) {
+#define MPARK_DISPATCH(I)                                                   \
+  dispatcher<(I < lib::decay_t<V>::size()),                                 \
+             R,                                                             \
+             ITs...,                                                        \
+             lib::indexed_type<I, V>>::                                     \
+      template dispatch<0>(lib::forward<F>(f),                              \
+                           lib::forward<typename ITs::type>(visited_vs)..., \
+                           lib::forward<V>(v),                              \
+                           lib::forward<Vs>(vs)...)
+
+#define MPARK_DEFAULT(I)                                                      \
+  dispatcher<(I < lib::decay_t<V>::size()), R, ITs...>::template dispatch<I>( \
+      lib::forward<F>(f),                                                     \
+      lib::forward<typename ITs::type>(visited_vs)...,                        \
+      lib::forward<V>(v),                                                     \
+      lib::forward<Vs>(vs)...)
+
+      switch (v.index()) {
+        case B + 0:
+          return MPARK_DISPATCH(B + 0);
+        case B + 1:
+          return MPARK_DISPATCH(B + 1);
+        case B + 2:
+          return MPARK_DISPATCH(B + 2);
+        case B + 3:
+          return MPARK_DISPATCH(B + 3);
+        case B + 4:
+          return MPARK_DISPATCH(B + 4);
+        case B + 5:
+          return MPARK_DISPATCH(B + 5);
+        case B + 6:
+          return MPARK_DISPATCH(B + 6);
+        case B + 7:
+          return MPARK_DISPATCH(B + 7);
+        case B + 8:
+          return MPARK_DISPATCH(B + 8);
+        case B + 9:
+          return MPARK_DISPATCH(B + 9);
+        case B + 10:
+          return MPARK_DISPATCH(B + 10);
+        case B + 11:
+          return MPARK_DISPATCH(B + 11);
+        case B + 12:
+          return MPARK_DISPATCH(B + 12);
+        case B + 13:
+          return MPARK_DISPATCH(B + 13);
+        case B + 14:
+          return MPARK_DISPATCH(B + 14);
+        case B + 15:
+          return MPARK_DISPATCH(B + 15);
+        case B + 16:
+          return MPARK_DISPATCH(B + 16);
+        case B + 17:
+          return MPARK_DISPATCH(B + 17);
+        case B + 18:
+          return MPARK_DISPATCH(B + 18);
+        case B + 19:
+          return MPARK_DISPATCH(B + 19);
+        case B + 20:
+          return MPARK_DISPATCH(B + 20);
+        case B + 21:
+          return MPARK_DISPATCH(B + 21);
+        case B + 22:
+          return MPARK_DISPATCH(B + 22);
+        case B + 23:
+          return MPARK_DISPATCH(B + 23);
+        case B + 24:
+          return MPARK_DISPATCH(B + 24);
+        case B + 25:
+          return MPARK_DISPATCH(B + 25);
+        case B + 26:
+          return MPARK_DISPATCH(B + 26);
+        case B + 27:
+          return MPARK_DISPATCH(B + 27);
+        case B + 28:
+          return MPARK_DISPATCH(B + 28);
+        case B + 29:
+          return MPARK_DISPATCH(B + 29);
+        case B + 30:
+          return MPARK_DISPATCH(B + 30);
+        case B + 31:
+          return MPARK_DISPATCH(B + 31);
+        default:
+          return MPARK_DEFAULT(B + 32);
+      }
+
+#undef MPARK_DEFAULT
+#undef MPARK_DISPATCH
+    }
+
+    template <std::size_t I, typename F, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch_case(F &&f, Vs &&...vs) {
+      using Expected = R;
+      using Actual = decltype(lib::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<I>(lib::forward<Vs>(vs))...));
+      return visit_return_type_check<Expected, Actual>::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<I>(lib::forward<Vs>(vs))...);
+    }
+
+    template <std::size_t B, typename F, typename V, typename... Vs>
+    MPARK_ALWAYS_INLINE static constexpr R dispatch_at(std::size_t index,
+                                                       F &&f,
+                                                       V &&v,
+                                                       Vs &&...vs) {
+      static_assert(lib::all<(lib::decay_t<V>::size() ==
+                              lib::decay_t<Vs>::size())...>::value,
+                    "all of the variants must be the same size.");
+#define MPARK_DISPATCH_AT(I)                                               \
+  dispatcher<(I < lib::decay_t<V>::size()), R>::template dispatch_case<I>( \
+      lib::forward<F>(f), lib::forward<V>(v), lib::forward<Vs>(vs)...)
+
+#define MPARK_DEFAULT(I)                                                 \
+  dispatcher<(I < lib::decay_t<V>::size()), R>::template dispatch_at<I>( \
+      index, lib::forward<F>(f), lib::forward<V>(v), lib::forward<Vs>(vs)...)
+
+      switch (index) {
+        case B + 0:
+          return MPARK_DISPATCH_AT(B + 0);
+        case B + 1:
+          return MPARK_DISPATCH_AT(B + 1);
+        case B + 2:
+          return MPARK_DISPATCH_AT(B + 2);
+        case B + 3:
+          return MPARK_DISPATCH_AT(B + 3);
+        case B + 4:
+          return MPARK_DISPATCH_AT(B + 4);
+        case B + 5:
+          return MPARK_DISPATCH_AT(B + 5);
+        case B + 6:
+          return MPARK_DISPATCH_AT(B + 6);
+        case B + 7:
+          return MPARK_DISPATCH_AT(B + 7);
+        case B + 8:
+          return MPARK_DISPATCH_AT(B + 8);
+        case B + 9:
+          return MPARK_DISPATCH_AT(B + 9);
+        case B + 10:
+          return MPARK_DISPATCH_AT(B + 10);
+        case B + 11:
+          return MPARK_DISPATCH_AT(B + 11);
+        case B + 12:
+          return MPARK_DISPATCH_AT(B + 12);
+        case B + 13:
+          return MPARK_DISPATCH_AT(B + 13);
+        case B + 14:
+          return MPARK_DISPATCH_AT(B + 14);
+        case B + 15:
+          return MPARK_DISPATCH_AT(B + 15);
+        case B + 16:
+          return MPARK_DISPATCH_AT(B + 16);
+        case B + 17:
+          return MPARK_DISPATCH_AT(B + 17);
+        case B + 18:
+          return MPARK_DISPATCH_AT(B + 18);
+        case B + 19:
+          return MPARK_DISPATCH_AT(B + 19);
+        case B + 20:
+          return MPARK_DISPATCH_AT(B + 20);
+        case B + 21:
+          return MPARK_DISPATCH_AT(B + 21);
+        case B + 22:
+          return MPARK_DISPATCH_AT(B + 22);
+        case B + 23:
+          return MPARK_DISPATCH_AT(B + 23);
+        case B + 24:
+          return MPARK_DISPATCH_AT(B + 24);
+        case B + 25:
+          return MPARK_DISPATCH_AT(B + 25);
+        case B + 26:
+          return MPARK_DISPATCH_AT(B + 26);
+        case B + 27:
+          return MPARK_DISPATCH_AT(B + 27);
+        case B + 28:
+          return MPARK_DISPATCH_AT(B + 28);
+        case B + 29:
+          return MPARK_DISPATCH_AT(B + 29);
+        case B + 30:
+          return MPARK_DISPATCH_AT(B + 30);
+        case B + 31:
+          return MPARK_DISPATCH_AT(B + 31);
+        default:
+          return MPARK_DEFAULT(B + 32);
+      }
+
+#undef MPARK_DEFAULT
+#undef MPARK_DISPATCH_AT
+    }
+  };
+#else
+  template <typename T>
+  inline static constexpr const T &at(const T &elem) noexcept {
+    return elem;
+  }
+
+  template <typename T, std::size_t N, typename... Is>
+  inline static constexpr const lib::remove_all_extents_t<T> &at(
+      const lib::array<T, N> &elems, std::size_t i, Is... is) noexcept {
+    return at(elems[i], is...);
+  }
+
+  template <typename F, typename... Fs>
+  inline static constexpr lib::array<lib::decay_t<F>, sizeof...(Fs) + 1>
+  make_farray(F &&f, Fs &&...fs) {
+    return {{lib::forward<F>(f), lib::forward<Fs>(fs)...}};
+  }
+
+  template <typename F, typename... Vs>
+  struct make_fmatrix_impl {
+    template <std::size_t... Is>
+    inline static constexpr dispatch_result_t<F, Vs...> dispatch(F &&f,
+                                                                 Vs &&...vs) {
+      using Expected = dispatch_result_t<F, Vs...>;
+      using Actual = decltype(lib::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<Is>(lib::forward<Vs>(vs))...));
+      return visit_return_type_check<Expected, Actual>::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<Is>(lib::forward<Vs>(vs))...);
+    }
+
+#ifdef MPARK_RETURN_TYPE_DEDUCTION
+    template <std::size_t... Is>
+    inline static constexpr auto impl(lib::index_sequence<Is...>) {
+      return &dispatch<Is...>;
+    }
+
+    template <typename Is, std::size_t... Js, typename... Ls>
+    inline static constexpr auto impl(Is,
+                                      lib::index_sequence<Js...>,
+                                      Ls... ls) {
+      return make_farray(impl(lib::push_back_t<Is, Js>{}, ls...)...);
+    }
+#else
+    template <typename...>
+    struct impl;
+
+    template <std::size_t... Is>
+    struct impl<lib::index_sequence<Is...>> {
+      inline constexpr AUTO operator()() const AUTO_RETURN(&dispatch<Is...>)
+    };
+
+    template <typename Is, std::size_t... Js, typename... Ls>
+    struct impl<Is, lib::index_sequence<Js...>, Ls...> {
+      inline constexpr AUTO operator()() const
+          AUTO_RETURN(make_farray(impl<lib::push_back_t<Is, Js>, Ls...>{}()...))
+    };
+#endif
+  };
+
+#ifdef MPARK_RETURN_TYPE_DEDUCTION
+  template <typename F, typename... Vs>
+  inline static constexpr auto make_fmatrix() {
+    return make_fmatrix_impl<F, Vs...>::impl(
+        lib::index_sequence<>{},
+        lib::make_index_sequence<lib::decay_t<Vs>::size()>{}...);
+  }
+#else
+  template <typename F, typename... Vs>
+  inline static constexpr AUTO make_fmatrix()
+      AUTO_RETURN(typename make_fmatrix_impl<F, Vs...>::template impl<
+                  lib::index_sequence<>,
+                  lib::make_index_sequence<lib::decay_t<Vs>::size()>...>{}())
+#endif
+
+  template <typename F, typename... Vs>
+  struct make_fdiagonal_impl {
+    template <std::size_t I>
+    inline static constexpr dispatch_result_t<F, Vs...> dispatch(F &&f,
+                                                                 Vs &&...vs) {
+      using Expected = dispatch_result_t<F, Vs...>;
+      using Actual = decltype(lib::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<I>(lib::forward<Vs>(vs))...));
+      return visit_return_type_check<Expected, Actual>::invoke(
+          lib::forward<F>(f),
+          access::base::get_alt<I>(lib::forward<Vs>(vs))...);
+    }
+
+    template <std::size_t... Is>
+    inline static constexpr AUTO impl(lib::index_sequence<Is...>)
+        AUTO_RETURN(make_farray(&dispatch<Is>...))
+  };
+
+  template <typename F, typename V, typename... Vs>
+  inline static constexpr auto make_fdiagonal()
+      -> decltype(make_fdiagonal_impl<F, V, Vs...>::impl(
+          lib::make_index_sequence<lib::decay_t<V>::size()>{})) {
+    static_assert(lib::all<(lib::decay_t<V>::size() ==
+                            lib::decay_t<Vs>::size())...>::value,
+                  "all of the variants must be the same size.");
+    return make_fdiagonal_impl<F, V, Vs...>::impl(
+        lib::make_index_sequence<lib::decay_t<V>::size()>{});
+  }
+#endif
+};
+
+#if !defined(MPARK_VARIANT_SWITCH_VISIT) && \
+    (!defined(_MSC_VER) || _MSC_VER >= 1910)
+template <typename F, typename... Vs>
+using fmatrix_t = decltype(base::make_fmatrix<F, Vs...>());
+
+template <typename F, typename... Vs>
+struct fmatrix {
+  static constexpr fmatrix_t<F, Vs...> value = base::make_fmatrix<F, Vs...>();
+};
+
+template <typename F, typename... Vs>
+constexpr fmatrix_t<F, Vs...> fmatrix<F, Vs...>::value;
+
+template <typename F, typename... Vs>
+using fdiagonal_t = decltype(base::make_fdiagonal<F, Vs...>());
+
+template <typename F, typename... Vs>
+struct fdiagonal {
+  static constexpr fdiagonal_t<F, Vs...> value =
+      base::make_fdiagonal<F, Vs...>();
+};
+
+template <typename F, typename... Vs>
+constexpr fdiagonal_t<F, Vs...> fdiagonal<F, Vs...>::value;
+#endif
+
+struct alt {
+  template <typename Visitor, typename... Vs>
+  inline static constexpr DECLTYPE_AUTO visit_alt(Visitor &&visitor, Vs &&...vs)
+#ifdef MPARK_VARIANT_SWITCH_VISIT
+      DECLTYPE_AUTO_RETURN(
+          base::dispatcher<true,
+                           base::dispatch_result_t<
+                               Visitor,
+                               decltype(as_base(lib::forward<Vs>(vs)))...>>::
+              template dispatch<0>(lib::forward<Visitor>(visitor),
+                                   as_base(lib::forward<Vs>(vs))...))
+#elif !defined(_MSC_VER) || _MSC_VER >= 1910
+      DECLTYPE_AUTO_RETURN(
+          base::at(fmatrix<Visitor &&,
+                           decltype(as_base(lib::forward<Vs>(vs)))...>::value,
+                   vs.index()...)(lib::forward<Visitor>(visitor),
+                                  as_base(lib::forward<Vs>(vs))...))
+#else
+      DECLTYPE_AUTO_RETURN(base::at(
+          base::make_fmatrix<Visitor &&,
+                             decltype(as_base(lib::forward<Vs>(vs)))...>(),
+          vs.index()...)(lib::forward<Visitor>(visitor),
+                         as_base(lib::forward<Vs>(vs))...))
+#endif
+
+          template <typename Visitor, typename... Vs>
+          inline static constexpr DECLTYPE_AUTO
+      visit_alt_at(std::size_t index, Visitor &&visitor, Vs &&...vs)
+#ifdef MPARK_VARIANT_SWITCH_VISIT
+          DECLTYPE_AUTO_RETURN(
+              base::dispatcher<
+                  true,
+                  base::dispatch_result_t<
+                      Visitor,
+                      decltype(as_base(lib::forward<Vs>(vs)))...>>::
+                  template dispatch_at<0>(index,
+                                          lib::forward<Visitor>(visitor),
+                                          as_base(lib::forward<Vs>(vs))...))
+#elif !defined(_MSC_VER) || _MSC_VER >= 1910
+          DECLTYPE_AUTO_RETURN(base::at(
+              fdiagonal<Visitor &&,
+                        decltype(as_base(lib::forward<Vs>(vs)))...>::value,
+              index)(lib::forward<Visitor>(visitor),
+                     as_base(lib::forward<Vs>(vs))...))
+#else
+          DECLTYPE_AUTO_RETURN(
+              base::at(base::make_fdiagonal<
+                           Visitor &&,
+                           decltype(as_base(lib::forward<Vs>(vs)))...>(),
+                       index)(lib::forward<Visitor>(visitor),
+                              as_base(lib::forward<Vs>(vs))...))
+#endif
+};
+
+struct variant {
+ private:
+  template <typename Visitor>
+  struct visitor {
+    template <typename... Values>
+    inline static constexpr bool does_not_handle() {
+      return lib::is_invocable<Visitor, Values...>::value;
+    }
+  };
+
+  template <typename Visitor, typename... Values>
+  struct visit_exhaustiveness_check {
+    static_assert(visitor<Visitor>::template does_not_handle<Values...>(),
+                  "`visit` requires the visitor to be exhaustive.");
+
+    inline static constexpr DECLTYPE_AUTO invoke(Visitor &&visitor,
+                                                 Values &&...values)
+        DECLTYPE_AUTO_RETURN(lib::invoke(lib::forward<Visitor>(visitor),
+                                         lib::forward<Values>(values)...))
+  };
+
+  template <typename Visitor>
+  struct value_visitor {
+    Visitor &&visitor_;
+
+    template <typename... Alts>
+    inline constexpr DECLTYPE_AUTO operator()(Alts &&...alts) const
+        DECLTYPE_AUTO_RETURN(visit_exhaustiveness_check<
+                             Visitor,
+                             decltype((lib::forward<Alts>(alts).value))...>::
+                                 invoke(lib::forward<Visitor>(visitor_),
+                                        lib::forward<Alts>(alts).value...))
+  };
+
+  template <typename Visitor>
+  inline static constexpr AUTO make_value_visitor(Visitor &&visitor)
+      AUTO_RETURN(value_visitor<Visitor>{lib::forward<Visitor>(visitor)})
+
+          public
+      : template <typename Visitor, typename... Vs>
+        inline static constexpr DECLTYPE_AUTO
+        visit_alt(Visitor &&visitor, Vs &&...vs)
+            DECLTYPE_AUTO_RETURN(alt::visit_alt(lib::forward<Visitor>(visitor),
+                                                lib::forward<Vs>(vs).impl_...))
+
+                template <typename Visitor, typename... Vs>
+                inline static constexpr DECLTYPE_AUTO
+        visit_alt_at(std::size_t index, Visitor &&visitor, Vs &&...vs)
+            DECLTYPE_AUTO_RETURN(
+                alt::visit_alt_at(index,
+                                  lib::forward<Visitor>(visitor),
+                                  lib::forward<Vs>(vs).impl_...))
+
+                template <typename Visitor, typename... Vs>
+                inline static constexpr DECLTYPE_AUTO
+        visit_value(Visitor &&visitor, Vs &&...vs) DECLTYPE_AUTO_RETURN(
+            visit_alt(make_value_visitor(lib::forward<Visitor>(visitor)),
+                      lib::forward<Vs>(vs)...))
+
+            template <typename Visitor, typename... Vs>
+            inline static constexpr DECLTYPE_AUTO
+        visit_value_at(std::size_t index, Visitor &&visitor, Vs &&...vs)
+            DECLTYPE_AUTO_RETURN(
+                visit_alt_at(index,
+                             make_value_visitor(lib::forward<Visitor>(visitor)),
+                             lib::forward<Vs>(vs)...))
+};
+
+}  // namespace visitation
+
+template <std::size_t Index, typename T>
+struct alt {
+  using value_type = T;
+
+#ifdef _MSC_VER
+#pragma warning(push)
+#pragma warning(disable : 4244)
+#endif
+  template <typename... Args>
+  inline explicit constexpr alt(in_place_t, Args &&...args)
+      : value(lib::forward<Args>(args)...) {}
+#ifdef _MSC_VER
+#pragma warning(pop)
+#endif
+
+  T value;
+};
+
+template <Trait DestructibleTrait, std::size_t Index, typename... Ts>
+union recursive_union;
+
+template <Trait DestructibleTrait, std::size_t Index>
+union recursive_union<DestructibleTrait, Index> {};
+
+#define MPARK_VARIANT_RECURSIVE_UNION(destructible_trait, destructor)      \
+  template <std::size_t Index, typename T, typename... Ts>                 \
+  union recursive_union<destructible_trait, Index, T, Ts...> {             \
+   public:                                                                 \
+    inline explicit constexpr recursive_union(valueless_t) noexcept        \
+        : dummy_{} {}                                                      \
+                                                                           \
+    template <typename... Args>                                            \
+    inline explicit constexpr recursive_union(in_place_index_t<0>,         \
+                                              Args &&...args)              \
+        : head_(in_place_t{}, lib::forward<Args>(args)...) {}              \
+                                                                           \
+    template <std::size_t I, typename... Args>                             \
+    inline explicit constexpr recursive_union(in_place_index_t<I>,         \
+                                              Args &&...args)              \
+        : tail_(in_place_index_t<I - 1>{}, lib::forward<Args>(args)...) {} \
+                                                                           \
+    recursive_union(const recursive_union &) = default;                    \
+    recursive_union(recursive_union &&) = default;                         \
+                                                                           \
+    destructor                                                             \
+                                                                           \
+        recursive_union &                                                  \
+        operator=(const recursive_union &) = default;                      \
+    recursive_union &operator=(recursive_union &&) = default;              \
+                                                                           \
+   private:                                                                \
+    char dummy_;                                                           \
+    alt<Index, T> head_;                                                   \
+    recursive_union<destructible_trait, Index + 1, Ts...> tail_;           \
+                                                                           \
+    friend struct access::recursive_union;                                 \
+  }
+
+MPARK_VARIANT_RECURSIVE_UNION(Trait::TriviallyAvailable,
+                              ~recursive_union() = default;);
+MPARK_VARIANT_RECURSIVE_UNION(Trait::Available, ~recursive_union(){});
+MPARK_VARIANT_RECURSIVE_UNION(Trait::Unavailable, ~recursive_union() = delete;);
+
+#undef MPARK_VARIANT_RECURSIVE_UNION
+
+using index_t = unsigned int;
+
+template <Trait DestructibleTrait, typename... Ts>
+class base {
+ public:
+  inline explicit constexpr base(valueless_t tag) noexcept
+      : data_(tag), index_(static_cast<index_t>(-1)) {}
+
+  template <std::size_t I, typename... Args>
+  inline explicit constexpr base(in_place_index_t<I>, Args &&...args)
+      : data_(in_place_index_t<I>{}, lib::forward<Args>(args)...), index_(I) {}
+
+  inline constexpr bool valueless_by_exception() const noexcept {
+    return index_ == static_cast<index_t>(-1);
+  }
+
+  inline constexpr std::size_t index() const noexcept {
+    return valueless_by_exception() ? variant_npos : index_;
+  }
+
+ protected:
+  using data_t = recursive_union<DestructibleTrait, 0, Ts...>;
+
+  friend inline constexpr base &as_base(base &b) { return b; }
+  friend inline constexpr const base &as_base(const base &b) { return b; }
+  friend inline constexpr base &&as_base(base &&b) { return lib::move(b); }
+  friend inline constexpr const base &&as_base(const base &&b) {
+    return lib::move(b);
+  }
+
+  friend inline constexpr data_t &data(base &b) { return b.data_; }
+  friend inline constexpr const data_t &data(const base &b) { return b.data_; }
+  friend inline constexpr data_t &&data(base &&b) { return lib::move(b).data_; }
+  friend inline constexpr const data_t &&data(const base &&b) {
+    return lib::move(b).data_;
+  }
+
+  inline static constexpr std::size_t size() { return sizeof...(Ts); }
+
+  data_t data_;
+  index_t index_;
+
+  friend struct access::base;
+  friend struct visitation::base;
+};
+
+struct dtor {
+#ifdef _MSC_VER
+#pragma warning(push)
+#pragma warning(disable : 4100)
+#endif
+  template <typename Alt>
+  inline void operator()(Alt &alt) const noexcept {
+    alt.~Alt();
+  }
+#ifdef _MSC_VER
+#pragma warning(pop)
+#endif
+};
+
+#if !defined(_MSC_VER) || _MSC_VER >= 1910
+#define MPARK_INHERITING_CTOR(type, base) using base::base;
+#else
+#define MPARK_INHERITING_CTOR(type, base)        \
+  template <typename... Args>                    \
+  inline explicit constexpr type(Args &&...args) \
+      : base(lib::forward<Args>(args)...) {}
+#endif
+
+template <typename Traits, Trait = Traits::destructible_trait>
+class destructor;
+
+#define MPARK_VARIANT_DESTRUCTOR(destructible_trait, definition, destroy) \
+  template <typename... Ts>                                               \
+  class destructor<traits<Ts...>, destructible_trait>                     \
+      : public base<destructible_trait, Ts...> {                          \
+    using super = base<destructible_trait, Ts...>;                        \
+                                                                          \
+   public:                                                                \
+    MPARK_INHERITING_CTOR(destructor, super)                              \
+    using super::operator=;                                               \
+                                                                          \
+    destructor(const destructor &) = default;                             \
+    destructor(destructor &&) = default;                                  \
+    definition destructor &operator=(const destructor &) = default;       \
+    destructor &operator=(destructor &&) = default;                       \
+                                                                          \
+   protected:                                                             \
+    destroy                                                               \
+  }
+
+MPARK_VARIANT_DESTRUCTOR(
+    Trait::TriviallyAvailable, ~destructor() = default;
+    , inline void destroy() noexcept {
+      this->index_ = static_cast<index_t>(-1);
+    });
+
+MPARK_VARIANT_DESTRUCTOR(
+    Trait::Available,
+    ~destructor() { destroy(); },
+    inline void destroy() noexcept {
+      if (!this->valueless_by_exception()) {
+        visitation::alt::visit_alt(dtor{}, *this);
+      }
+      this->index_ = static_cast<index_t>(-1);
+    });
+
+MPARK_VARIANT_DESTRUCTOR(Trait::Unavailable, ~destructor() = delete;
+                         , inline void destroy() noexcept = delete;);
+
+#undef MPARK_VARIANT_DESTRUCTOR
+
+template <typename Traits>
+class constructor : public destructor<Traits> {
+  using super = destructor<Traits>;
+
+ public:
+  MPARK_INHERITING_CTOR(constructor, super)
+  using super::operator=;
+
+ protected:
+#ifndef MPARK_GENERIC_LAMBDAS
+  struct ctor {
+    template <typename LhsAlt, typename RhsAlt>
+    inline void operator()(LhsAlt &lhs_alt, RhsAlt &&rhs_alt) const {
+      constructor::construct_alt(lhs_alt, lib::forward<RhsAlt>(rhs_alt).value);
+    }
+  };
+#endif
+
+  template <std::size_t I, typename T, typename... Args>
+  inline static T &construct_alt(alt<I, T> &a, Args &&...args) {
+    auto *result = ::new (static_cast<void *>(lib::addressof(a)))
+        alt<I, T>(in_place_t{}, lib::forward<Args>(args)...);
+    return result->value;
+  }
+
+  template <typename Rhs>
+  inline static void generic_construct(constructor &lhs, Rhs &&rhs) {
+    lhs.destroy();
+    if (!rhs.valueless_by_exception()) {
+      visitation::alt::visit_alt_at(
+          rhs.index(),
+#ifdef MPARK_GENERIC_LAMBDAS
+          [](auto &lhs_alt, auto &&rhs_alt) {
+            constructor::construct_alt(
+                lhs_alt, lib::forward<decltype(rhs_alt)>(rhs_alt).value);
+          }
+#else
+          ctor {}
+#endif
+          ,
+          lhs,
+          lib::forward<Rhs>(rhs));
+      lhs.index_ = rhs.index_;
+    }
+  }
+};
+
+template <typename Traits, Trait = Traits::move_constructible_trait>
+class move_constructor;
+
+#define MPARK_VARIANT_MOVE_CONSTRUCTOR(move_constructible_trait, definition) \
+  template <typename... Ts>                                                  \
+  class move_constructor<traits<Ts...>, move_constructible_trait>            \
+      : public constructor<traits<Ts...>> {                                  \
+    using super = constructor<traits<Ts...>>;                                \
+                                                                             \
+   public:                                                                   \
+    MPARK_INHERITING_CTOR(move_constructor, super)                           \
+    using super::operator=;                                                  \
+                                                                             \
+    move_constructor(const move_constructor &) = default;                    \
+    definition ~move_constructor() = default;                                \
+    move_constructor &operator=(const move_constructor &) = default;         \
+    move_constructor &operator=(move_constructor &&) = default;              \
+  }
+
+MPARK_VARIANT_MOVE_CONSTRUCTOR(
+    Trait::TriviallyAvailable,
+    move_constructor(move_constructor &&that) = default;);
+
+MPARK_VARIANT_MOVE_CONSTRUCTOR(
+    Trait::Available,
+    move_constructor(move_constructor &&that) noexcept(
+        lib::all<std::is_nothrow_move_constructible<Ts>::value...>::value)
+    : move_constructor(valueless_t{}) {
+      this->generic_construct(*this, lib::move(that));
+    });
+
+MPARK_VARIANT_MOVE_CONSTRUCTOR(Trait::Unavailable,
+                               move_constructor(move_constructor &&) = delete;);
+
+#undef MPARK_VARIANT_MOVE_CONSTRUCTOR
+
+template <typename Traits, Trait = Traits::copy_constructible_trait>
+class copy_constructor;
+
+#define MPARK_VARIANT_COPY_CONSTRUCTOR(copy_constructible_trait, definition) \
+  template <typename... Ts>                                                  \
+  class copy_constructor<traits<Ts...>, copy_constructible_trait>            \
+      : public move_constructor<traits<Ts...>> {                             \
+    using super = move_constructor<traits<Ts...>>;                           \
+                                                                             \
+   public:                                                                   \
+    MPARK_INHERITING_CTOR(copy_constructor, super)                           \
+    using super::operator=;                                                  \
+                                                                             \
+    definition copy_constructor(copy_constructor &&) = default;              \
+    ~copy_constructor() = default;                                           \
+    copy_constructor &operator=(const copy_constructor &) = default;         \
+    copy_constructor &operator=(copy_constructor &&) = default;              \
+  }
+
+MPARK_VARIANT_COPY_CONSTRUCTOR(
+    Trait::TriviallyAvailable,
+    copy_constructor(const copy_constructor &that) = default;);
+
+MPARK_VARIANT_COPY_CONSTRUCTOR(
+    Trait::Available, copy_constructor(const copy_constructor &that)
+    : copy_constructor(valueless_t{}) {
+      this->generic_construct(*this, that);
+    });
+
+MPARK_VARIANT_COPY_CONSTRUCTOR(
+    Trait::Unavailable, copy_constructor(const copy_constructor &) = delete;);
+
+#undef MPARK_VARIANT_COPY_CONSTRUCTOR
+
+template <typename Traits>
+class assignment : public copy_constructor<Traits> {
+  using super = copy_constructor<Traits>;
+
+ public:
+  MPARK_INHERITING_CTOR(assignment, super)
+  using super::operator=;
+
+  template <std::size_t I, typename... Args>
+  inline /* auto & */ auto emplace(Args &&...args)
+      -> decltype(this->construct_alt(access::base::get_alt<I>(*this),
+                                      lib::forward<Args>(args)...)) {
+    this->destroy();
+    auto &result = this->construct_alt(access::base::get_alt<I>(*this),
+                                       lib::forward<Args>(args)...);
+    this->index_ = I;
+    return result;
+  }
+
+ protected:
+#ifndef MPARK_GENERIC_LAMBDAS
+  template <typename That>
+  struct assigner {
+    template <typename ThisAlt, typename ThatAlt>
+    inline void operator()(ThisAlt &this_alt, ThatAlt &&that_alt) const {
+      self->assign_alt(this_alt, lib::forward<ThatAlt>(that_alt).value);
+    }
+    assignment *self;
+  };
+#endif
+
+  template <std::size_t I, typename T, typename Arg>
+  inline void assign_alt(alt<I, T> &a, Arg &&arg) {
+    if (this->index() == I) {
+#ifdef _MSC_VER
+#pragma warning(push)
+#pragma warning(disable : 4244)
+#endif
+      a.value = lib::forward<Arg>(arg);
+#ifdef _MSC_VER
+#pragma warning(pop)
+#endif
+    } else {
+      struct {
+        void operator()(std::true_type) const {
+          this_->emplace<I>(lib::forward<Arg>(arg_));
+        }
+        void operator()(std::false_type) const {
+          this_->emplace<I>(T(lib::forward<Arg>(arg_)));
+        }
+        assignment *this_;
+        Arg &&arg_;
+      } impl{this, lib::forward<Arg>(arg)};
+      impl(lib::bool_constant < std::is_nothrow_constructible<T, Arg>::value ||
+           !std::is_nothrow_move_constructible<T>::value > {});
+    }
+  }
+
+  template <typename That>
+  inline void generic_assign(That &&that) {
+    if (this->valueless_by_exception() && that.valueless_by_exception()) {
+      // do nothing.
+    } else if (that.valueless_by_exception()) {
+      this->destroy();
+    } else {
+      visitation::alt::visit_alt_at(
+          that.index(),
+#ifdef MPARK_GENERIC_LAMBDAS
+          [this](auto &this_alt, auto &&that_alt) {
+            this->assign_alt(this_alt,
+                             lib::forward<decltype(that_alt)>(that_alt).value);
+          }
+#else
+          assigner<That> { this }
+#endif
+          ,
+          *this,
+          lib::forward<That>(that));
+    }
+  }
+};
+
+template <typename Traits, Trait = Traits::move_assignable_trait>
+class move_assignment;
+
+#define MPARK_VARIANT_MOVE_ASSIGNMENT(move_assignable_trait, definition) \
+  template <typename... Ts>                                              \
+  class move_assignment<traits<Ts...>, move_assignable_trait>            \
+      : public assignment<traits<Ts...>> {                               \
+    using super = assignment<traits<Ts...>>;                             \
+                                                                         \
+   public:                                                               \
+    MPARK_INHERITING_CTOR(move_assignment, super)                        \
+    using super::operator=;                                              \
+                                                                         \
+    move_assignment(const move_assignment &) = default;                  \
+    move_assignment(move_assignment &&) = default;                       \
+    ~move_assignment() = default;                                        \
+    move_assignment &operator=(const move_assignment &) = default;       \
+    definition                                                           \
+  }
+
+MPARK_VARIANT_MOVE_ASSIGNMENT(
+    Trait::TriviallyAvailable,
+    move_assignment &operator=(move_assignment &&that) = default;);
+
+MPARK_VARIANT_MOVE_ASSIGNMENT(
+    Trait::Available,
+    move_assignment &
+    operator=(move_assignment &&that) noexcept(
+        lib::all<(std::is_nothrow_move_constructible<Ts>::value &&
+                  std::is_nothrow_move_assignable<Ts>::value)...>::value) {
+      this->generic_assign(lib::move(that));
+      return *this;
+    });
+
+MPARK_VARIANT_MOVE_ASSIGNMENT(
+    Trait::Unavailable,
+    move_assignment &operator=(move_assignment &&) = delete;);
+
+#undef MPARK_VARIANT_MOVE_ASSIGNMENT
+
+template <typename Traits, Trait = Traits::copy_assignable_trait>
+class copy_assignment;
+
+#define MPARK_VARIANT_COPY_ASSIGNMENT(copy_assignable_trait, definition) \
+  template <typename... Ts>                                              \
+  class copy_assignment<traits<Ts...>, copy_assignable_trait>            \
+      : public move_assignment<traits<Ts...>> {                          \
+    using super = move_assignment<traits<Ts...>>;                        \
+                                                                         \
+   public:                                                               \
+    MPARK_INHERITING_CTOR(copy_assignment, super)                        \
+    using super::operator=;                                              \
+                                                                         \
+    copy_assignment(const copy_assignment &) = default;                  \
+    copy_assignment(copy_assignment &&) = default;                       \
+    ~copy_assignment() = default;                                        \
+    definition copy_assignment &operator=(copy_assignment &&) = default; \
+  }
+
+MPARK_VARIANT_COPY_ASSIGNMENT(
+    Trait::TriviallyAvailable,
+    copy_assignment &operator=(const copy_assignment &that) = default;);
+
+MPARK_VARIANT_COPY_ASSIGNMENT(
+    Trait::Available, copy_assignment &operator=(const copy_assignment &that) {
+      this->generic_assign(that);
+      return *this;
+    });
+
+MPARK_VARIANT_COPY_ASSIGNMENT(
+    Trait::Unavailable,
+    copy_assignment &operator=(const copy_assignment &) = delete;);
+
+#undef MPARK_VARIANT_COPY_ASSIGNMENT
+
+template <typename... Ts>
+class impl : public copy_assignment<traits<Ts...>> {
+  using super = copy_assignment<traits<Ts...>>;
+
+ public:
+  MPARK_INHERITING_CTOR(impl, super)
+  using super::operator=;
+
+  template <std::size_t I, typename Arg>
+  inline void assign(Arg &&arg) {
+    this->assign_alt(access::base::get_alt<I>(*this), lib::forward<Arg>(arg));
+  }
+
+  inline void swap(impl &that) {
+    if (this->valueless_by_exception() && that.valueless_by_exception()) {
+      // do nothing.
+    } else if (this->index() == that.index()) {
+      visitation::alt::visit_alt_at(
+          this->index(),
+#ifdef MPARK_GENERIC_LAMBDAS
+          [](auto &this_alt, auto &that_alt) {
+            using std::swap;
+            swap(this_alt.value, that_alt.value);
+          }
+#else
+          swapper {}
+#endif
+          ,
+          *this,
+          that);
+    } else {
+      impl *lhs = this;
+      impl *rhs = lib::addressof(that);
+      if (lhs->move_nothrow() && !rhs->move_nothrow()) {
+        std::swap(lhs, rhs);
+      }
+      impl tmp(lib::move(*rhs));
+#ifdef MPARK_EXCEPTIONS
+      // EXTENSION: When the move construction of `lhs` into `rhs` throws
+      // and `tmp` is nothrow move constructible then we move `tmp` back
+      // into `rhs` and provide the strong exception safety guarantee.
+      try {
+        this->generic_construct(*rhs, lib::move(*lhs));
+      } catch (...) {
+        if (tmp.move_nothrow()) {
+          this->generic_construct(*rhs, lib::move(tmp));
+        }
+        throw;
+      }
+#else
+      this->generic_construct(*rhs, lib::move(*lhs));
+#endif
+      this->generic_construct(*lhs, lib::move(tmp));
+    }
+  }
+
+  inline const std::type_info &type() const {
+    return visitation::alt::visit_alt_at(
+        this->index(),
+#ifdef MPARK_GENERIC_LAMBDAS
+        [](auto &alt) -> const std::type_info & { return typeid(alt.value); }
+#else
+        typer {}
+#endif
+        ,
+        *this);
+  }
+
+ private:
+#ifndef MPARK_GENERIC_LAMBDAS
+  struct swapper {
+    template <typename ThisAlt, typename ThatAlt>
+    inline void operator()(ThisAlt &this_alt, ThatAlt &that_alt) const {
+      using std::swap;
+      swap(this_alt.value, that_alt.value);
+    }
+  };
+
+  struct typer {
+    template <typename Alt>
+    inline const std::type_info &operator()(Alt &alt) const {
+      return typeid(alt.value);
+    }
+  };
+#endif
+
+  inline constexpr bool move_nothrow() const {
+    return this->valueless_by_exception() ||
+           lib::array<bool, sizeof...(Ts)>{{std::is_nothrow_move_constructible<
+               Ts>::value...}}[this->index()];
+  }
+};
+
+#undef MPARK_INHERITING_CTOR
+
+template <std::size_t I, typename T>
+struct overload_leaf {
+  using F = lib::size_constant<I> (*)(T);
+  operator F() const { return nullptr; }
+};
+
+template <typename... Ts>
+struct overload_impl {
+ private:
+  template <typename>
+  struct impl;
+
+  template <std::size_t... Is>
+  struct impl<lib::index_sequence<Is...>> : overload_leaf<Is, Ts>... {};
+
+ public:
+  using type = impl<lib::index_sequence_for<Ts...>>;
+};
+
+template <typename... Ts>
+using overload = typename overload_impl<Ts...>::type;
+
+template <typename T, typename... Ts>
+using best_match = lib::invoke_result_t<overload<Ts...>, T &&>;
+
+template <typename T>
+struct is_in_place_index : std::false_type {};
+
+template <std::size_t I>
+struct is_in_place_index<in_place_index_t<I>> : std::true_type {};
+
+template <typename T>
+struct is_in_place_type : std::false_type {};
+
+template <typename T>
+struct is_in_place_type<in_place_type_t<T>> : std::true_type {};
+
+}  // namespace detail
+
+template <typename... Ts>
+class variant {
+  static_assert(0 < sizeof...(Ts),
+                "variant must consist of at least one alternative.");
+
+  static_assert(lib::all<!std::is_array<Ts>::value...>::value,
+                "variant can not have an array type as an alternative.");
+
+  static_assert(lib::all<!std::is_reference<Ts>::value...>::value,
+                "variant can not have a reference type as an alternative.");
+
+  static_assert(lib::all<!std::is_void<Ts>::value...>::value,
+                "variant can not have a void type as an alternative.");
+
+ public:
+  template <
+      typename Front = lib::type_pack_element_t<0, Ts...>,
+      lib::enable_if_t<std::is_default_constructible<Front>::value, int> = 0>
+  inline constexpr variant() noexcept(
+      std::is_nothrow_default_constructible<Front>::value)
+      : impl_(in_place_index_t<0>{}) {}
+
+  variant(const variant &) = default;
+  variant(variant &&) = default;
+
+  template <
+      typename Arg,
+      typename Decayed = lib::decay_t<Arg>,
+      lib::enable_if_t<!std::is_same<Decayed, variant>::value, int> = 0,
+      lib::enable_if_t<!detail::is_in_place_index<Decayed>::value, int> = 0,
+      lib::enable_if_t<!detail::is_in_place_type<Decayed>::value, int> = 0,
+      std::size_t I = detail::best_match<Arg, Ts...>::value,
+      typename T = lib::type_pack_element_t<I, Ts...>,
+      lib::enable_if_t<std::is_constructible<T, Arg>::value, int> = 0>
+  inline constexpr variant(Arg &&arg) noexcept(
+      std::is_nothrow_constructible<T, Arg>::value)
+      : impl_(in_place_index_t<I>{}, lib::forward<Arg>(arg)) {}
+
+  template <std::size_t I,
+            typename... Args,
+            typename T = lib::type_pack_element_t<I, Ts...>,
+            lib::enable_if_t<std::is_constructible<T, Args...>::value, int> = 0>
+  inline explicit constexpr variant(
+      in_place_index_t<I>,
+      Args &&...args) noexcept(std::is_nothrow_constructible<T, Args...>::value)
+      : impl_(in_place_index_t<I>{}, lib::forward<Args>(args)...) {}
+
+  template <
+      std::size_t I,
+      typename Up,
+      typename... Args,
+      typename T = lib::type_pack_element_t<I, Ts...>,
+      lib::enable_if_t<
+          std::is_constructible<T, std::initializer_list<Up> &, Args...>::value,
+          int> = 0>
+  inline explicit constexpr variant(
+      in_place_index_t<I>,
+      std::initializer_list<Up> il,
+      Args &&...args) noexcept(std::
+                                   is_nothrow_constructible<
+                                       T,
+                                       std::initializer_list<Up> &,
+                                       Args...>::value)
+      : impl_(in_place_index_t<I>{}, il, lib::forward<Args>(args)...) {}
+
+  template <typename T,
+            typename... Args,
+            std::size_t I = detail::find_index_sfinae<T, Ts...>::value,
+            lib::enable_if_t<std::is_constructible<T, Args...>::value, int> = 0>
+  inline explicit constexpr variant(
+      in_place_type_t<T>,
+      Args &&...args) noexcept(std::is_nothrow_constructible<T, Args...>::value)
+      : impl_(in_place_index_t<I>{}, lib::forward<Args>(args)...) {}
+
+  template <
+      typename T,
+      typename Up,
+      typename... Args,
+      std::size_t I = detail::find_index_sfinae<T, Ts...>::value,
+      lib::enable_if_t<
+          std::is_constructible<T, std::initializer_list<Up> &, Args...>::value,
+          int> = 0>
+  inline explicit constexpr variant(
+      in_place_type_t<T>,
+      std::initializer_list<Up> il,
+      Args &&...args) noexcept(std::
+                                   is_nothrow_constructible<
+                                       T,
+                                       std::initializer_list<Up> &,
+                                       Args...>::value)
+      : impl_(in_place_index_t<I>{}, il, lib::forward<Args>(args)...) {}
+
+  ~variant() = default;
+
+  variant &operator=(const variant &) = default;
+  variant &operator=(variant &&) = default;
+
+  template <typename Arg,
+            lib::enable_if_t<!std::is_same<lib::decay_t<Arg>, variant>::value,
+                             int> = 0,
+            std::size_t I = detail::best_match<Arg, Ts...>::value,
+            typename T = lib::type_pack_element_t<I, Ts...>,
+            lib::enable_if_t<(std::is_assignable<T &, Arg>::value &&
+                              std::is_constructible<T, Arg>::value),
+                             int> = 0>
+  inline variant &operator=(Arg &&arg) noexcept(
+      (std::is_nothrow_assignable<T &, Arg>::value &&
+       std::is_nothrow_constructible<T, Arg>::value)) {
+    impl_.template assign<I>(lib::forward<Arg>(arg));
+    return *this;
+  }
+
+  template <std::size_t I,
+            typename... Args,
+            typename T = lib::type_pack_element_t<I, Ts...>,
+            lib::enable_if_t<std::is_constructible<T, Args...>::value, int> = 0>
+  inline T &emplace(Args &&...args) {
+    return impl_.template emplace<I>(lib::forward<Args>(args)...);
+  }
+
+  template <
+      std::size_t I,
+      typename Up,
+      typename... Args,
+      typename T = lib::type_pack_element_t<I, Ts...>,
+      lib::enable_if_t<
+          std::is_constructible<T, std::initializer_list<Up> &, Args...>::value,
+          int> = 0>
+  inline T &emplace(std::initializer_list<Up> il, Args &&...args) {
+    return impl_.template emplace<I>(il, lib::forward<Args>(args)...);
+  }
+
+  template <typename T,
+            typename... Args,
+            std::size_t I = detail::find_index_sfinae<T, Ts...>::value,
+            lib::enable_if_t<std::is_constructible<T, Args...>::value, int> = 0>
+  inline T &emplace(Args &&...args) {
+    return impl_.template emplace<I>(lib::forward<Args>(args)...);
+  }
+
+  template <
+      typename T,
+      typename Up,
+      typename... Args,
+      std::size_t I = detail::find_index_sfinae<T, Ts...>::value,
+      lib::enable_if_t<
+          std::is_constructible<T, std::initializer_list<Up> &, Args...>::value,
+          int> = 0>
+  inline T &emplace(std::initializer_list<Up> il, Args &&...args) {
+    return impl_.template emplace<I>(il, lib::forward<Args>(args)...);
+  }
+
+  inline constexpr bool valueless_by_exception() const noexcept {
+    return impl_.valueless_by_exception();
+  }
+
+  inline constexpr std::size_t index() const noexcept { return impl_.index(); }
+
+  template <bool Dummy = true,
+            lib::enable_if_t<
+                lib::all<Dummy,
+                         (lib::dependent_type<std::is_move_constructible<Ts>,
+                                              Dummy>::value &&
+                          lib::dependent_type<lib::is_swappable<Ts>,
+                                              Dummy>::value)...>::value,
+                int> = 0>
+  inline void swap(variant &that) noexcept(
+      lib::all<(std::is_nothrow_move_constructible<Ts>::value &&
+                lib::is_nothrow_swappable<Ts>::value)...>::value) {
+    impl_.swap(that.impl_);
+  }
+
+  inline const std::type_info &type() const noexcept { return impl_.type(); }
+
+ private:
+  detail::impl<Ts...> impl_;
+
+  friend struct detail::access::variant;
+  friend struct detail::visitation::variant;
+};
+
+template <std::size_t I, typename... Ts>
+inline constexpr bool holds_alternative(const variant<Ts...> &v) noexcept {
+  return v.index() == I;
+}
+
+template <typename T, typename... Ts>
+inline constexpr bool holds_alternative(const variant<Ts...> &v) noexcept {
+  return holds_alternative<detail::find_index_checked<T, Ts...>::value>(v);
+}
+
+namespace detail {
+template <std::size_t I, typename V>
+struct generic_get_impl {
+  constexpr generic_get_impl(int) noexcept {}
+
+  constexpr AUTO_REFREF operator()(V &&v) const
+      AUTO_REFREF_RETURN(access::variant::get_alt<I>(lib::forward<V>(v)).value)
+};
+
+template <std::size_t I, typename V>
+inline constexpr AUTO_REFREF generic_get(V &&v)
+    AUTO_REFREF_RETURN(generic_get_impl<I, V>(holds_alternative<I>(v)
+                                                  ? 0
+                                                  : (throw_bad_variant_access(),
+                                                     0))(lib::forward<V>(v)))
+}  // namespace detail
+
+template <std::size_t I, typename... Ts>
+inline constexpr variant_alternative_t<I, variant<Ts...>> &get(
+    variant<Ts...> &v) {
+  return detail::generic_get<I>(v);
+}
+
+template <std::size_t I, typename... Ts>
+inline constexpr variant_alternative_t<I, variant<Ts...>> &&get(
+    variant<Ts...> &&v) {
+  return detail::generic_get<I>(lib::move(v));
+}
+
+template <std::size_t I, typename... Ts>
+inline constexpr const variant_alternative_t<I, variant<Ts...>> &get(
+    const variant<Ts...> &v) {
+  return detail::generic_get<I>(v);
+}
+
+template <std::size_t I, typename... Ts>
+inline constexpr const variant_alternative_t<I, variant<Ts...>> &&get(
+    const variant<Ts...> &&v) {
+  return detail::generic_get<I>(lib::move(v));
+}
+
+template <typename T, typename... Ts>
+inline constexpr T &get(variant<Ts...> &v) {
+  return get<detail::find_index_checked<T, Ts...>::value>(v);
+}
+
+template <typename T, typename... Ts>
+inline constexpr T &&get(variant<Ts...> &&v) {
+  return get<detail::find_index_checked<T, Ts...>::value>(lib::move(v));
+}
+
+template <typename T, typename... Ts>
+inline constexpr const T &get(const variant<Ts...> &v) {
+  return get<detail::find_index_checked<T, Ts...>::value>(v);
+}
+
+template <typename T, typename... Ts>
+inline constexpr const T &&get(const variant<Ts...> &&v) {
+  return get<detail::find_index_checked<T, Ts...>::value>(lib::move(v));
+}
+
+namespace detail {
+
+template <std::size_t I, typename V>
+inline constexpr /* auto * */ AUTO generic_get_if(V *v) noexcept
+    AUTO_RETURN(v &&holds_alternative<I>(*v)
+                    ? lib::addressof(access::variant::get_alt<I>(*v).value)
+                    : nullptr)
+
+}  // namespace detail
+
+template <std::size_t I, typename... Ts>
+inline constexpr lib::add_pointer_t<variant_alternative_t<I, variant<Ts...>>>
+get_if(variant<Ts...> *v) noexcept {
+  return detail::generic_get_if<I>(v);
+}
+
+template <std::size_t I, typename... Ts>
+inline constexpr lib::add_pointer_t<
+    const variant_alternative_t<I, variant<Ts...>>>
+get_if(const variant<Ts...> *v) noexcept {
+  return detail::generic_get_if<I>(v);
+}
+
+template <typename T, typename... Ts>
+inline constexpr lib::add_pointer_t<T> get_if(variant<Ts...> *v) noexcept {
+  return get_if<detail::find_index_checked<T, Ts...>::value>(v);
+}
+
+template <typename T, typename... Ts>
+inline constexpr lib::add_pointer_t<const T> get_if(
+    const variant<Ts...> *v) noexcept {
+  return get_if<detail::find_index_checked<T, Ts...>::value>(v);
+}
+
+namespace detail {
+template <typename RelOp>
+struct convert_to_bool {
+  template <typename Lhs, typename Rhs>
+  inline constexpr bool operator()(Lhs &&lhs, Rhs &&rhs) const {
+    static_assert(
+        std::is_convertible<lib::invoke_result_t<RelOp, Lhs, Rhs>, bool>::value,
+        "relational operators must return a type"
+        " implicitly convertible to bool");
+    return lib::invoke(RelOp{}, lib::forward<Lhs>(lhs), lib::forward<Rhs>(rhs));
+  }
+};
+}  // namespace detail
+
+template <typename... Ts>
+inline constexpr bool operator==(const variant<Ts...> &lhs,
+                                 const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using equal_to = detail::convert_to_bool<lib::equal_to>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (lhs.index() != rhs.index()) return false;
+  if (lhs.valueless_by_exception()) return true;
+  return variant::visit_value_at(lhs.index(), equal_to{}, lhs, rhs);
+#else
+  return lhs.index() == rhs.index() &&
+         (lhs.valueless_by_exception() ||
+          variant::visit_value_at(lhs.index(), equal_to{}, lhs, rhs));
+#endif
+}
+
+template <typename... Ts>
+inline constexpr bool operator!=(const variant<Ts...> &lhs,
+                                 const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using not_equal_to = detail::convert_to_bool<lib::not_equal_to>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (lhs.index() != rhs.index()) return true;
+  if (lhs.valueless_by_exception()) return false;
+  return variant::visit_value_at(lhs.index(), not_equal_to{}, lhs, rhs);
+#else
+  return lhs.index() != rhs.index() ||
+         (!lhs.valueless_by_exception() &&
+          variant::visit_value_at(lhs.index(), not_equal_to{}, lhs, rhs));
+#endif
+}
+
+template <typename... Ts>
+inline constexpr bool operator<(const variant<Ts...> &lhs,
+                                const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using less = detail::convert_to_bool<lib::less>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (rhs.valueless_by_exception()) return false;
+  if (lhs.valueless_by_exception()) return true;
+  if (lhs.index() < rhs.index()) return true;
+  if (lhs.index() > rhs.index()) return false;
+  return variant::visit_value_at(lhs.index(), less{}, lhs, rhs);
+#else
+  return !rhs.valueless_by_exception() &&
+         (lhs.valueless_by_exception() || lhs.index() < rhs.index() ||
+          (lhs.index() == rhs.index() &&
+           variant::visit_value_at(lhs.index(), less{}, lhs, rhs)));
+#endif
+}
+
+template <typename... Ts>
+inline constexpr bool operator>(const variant<Ts...> &lhs,
+                                const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using greater = detail::convert_to_bool<lib::greater>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (lhs.valueless_by_exception()) return false;
+  if (rhs.valueless_by_exception()) return true;
+  if (lhs.index() > rhs.index()) return true;
+  if (lhs.index() < rhs.index()) return false;
+  return variant::visit_value_at(lhs.index(), greater{}, lhs, rhs);
+#else
+  return !lhs.valueless_by_exception() &&
+         (rhs.valueless_by_exception() || lhs.index() > rhs.index() ||
+          (lhs.index() == rhs.index() &&
+           variant::visit_value_at(lhs.index(), greater{}, lhs, rhs)));
+#endif
+}
+
+template <typename... Ts>
+inline constexpr bool operator<=(const variant<Ts...> &lhs,
+                                 const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using less_equal = detail::convert_to_bool<lib::less_equal>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (lhs.valueless_by_exception()) return true;
+  if (rhs.valueless_by_exception()) return false;
+  if (lhs.index() < rhs.index()) return true;
+  if (lhs.index() > rhs.index()) return false;
+  return variant::visit_value_at(lhs.index(), less_equal{}, lhs, rhs);
+#else
+  return lhs.valueless_by_exception() ||
+         (!rhs.valueless_by_exception() &&
+          (lhs.index() < rhs.index() ||
+           (lhs.index() == rhs.index() &&
+            variant::visit_value_at(lhs.index(), less_equal{}, lhs, rhs))));
+#endif
+}
+
+template <typename... Ts>
+inline constexpr bool operator>=(const variant<Ts...> &lhs,
+                                 const variant<Ts...> &rhs) {
+  using detail::visitation::variant;
+  using greater_equal = detail::convert_to_bool<lib::greater_equal>;
+#ifdef MPARK_CPP14_CONSTEXPR
+  if (rhs.valueless_by_exception()) return true;
+  if (lhs.valueless_by_exception()) return false;
+  if (lhs.index() > rhs.index()) return true;
+  if (lhs.index() < rhs.index()) return false;
+  return variant::visit_value_at(lhs.index(), greater_equal{}, lhs, rhs);
+#else
+  return rhs.valueless_by_exception() ||
+         (!lhs.valueless_by_exception() &&
+          (lhs.index() > rhs.index() ||
+           (lhs.index() == rhs.index() &&
+            variant::visit_value_at(lhs.index(), greater_equal{}, lhs, rhs))));
+#endif
+}
+
+struct monostate {};
+
+inline constexpr bool operator<(monostate, monostate) noexcept { return false; }
+
+inline constexpr bool operator>(monostate, monostate) noexcept { return false; }
+
+inline constexpr bool operator<=(monostate, monostate) noexcept { return true; }
+
+inline constexpr bool operator>=(monostate, monostate) noexcept { return true; }
+
+inline constexpr bool operator==(monostate, monostate) noexcept { return true; }
+
+inline constexpr bool operator!=(monostate, monostate) noexcept {
+  return false;
+}
+
+namespace detail {
+
+template <std::size_t N>
+inline constexpr bool all_impl(const lib::array<bool, N> &bs, std::size_t idx) {
+  return idx >= N || (bs[idx] && all_impl(bs, idx + 1));
+}
+
+template <std::size_t N>
+inline constexpr bool all(const lib::array<bool, N> &bs) {
+  return all_impl(bs, 0);
+}
+
+}  // namespace detail
+
+template <typename Visitor, typename... Vs>
+inline constexpr DECLTYPE_AUTO visit(Visitor &&visitor, Vs &&...vs)
+    DECLTYPE_AUTO_RETURN(
+        (detail::all(lib::array<bool, sizeof...(Vs)>{
+             {!vs.valueless_by_exception()...}})
+             ? (void)0
+             : throw_bad_variant_access()),
+        detail::visitation::variant::visit_value(lib::forward<Visitor>(visitor),
+                                                 lib::forward<Vs>(vs)...))
+
+        template <typename... Ts>
+        inline auto swap(variant<Ts...> &lhs,
+                         variant<Ts...> &rhs) noexcept(noexcept(lhs.swap(rhs)))
+            -> decltype(lhs.swap(rhs)) {
+  lhs.swap(rhs);
+}
+
+namespace detail {
+
+template <typename T, typename...>
+using enabled_type = T;
+
+namespace hash {
+
+template <typename H, typename K>
+constexpr bool meets_requirements() noexcept {
+  return std::is_copy_constructible<H>::value &&
+         std::is_move_constructible<H>::value &&
+         lib::is_invocable_r<std::size_t, H, const K &>::value;
+}
+
+template <typename K>
+constexpr bool is_enabled() noexcept {
+  using H = std::hash<K>;
+  return meets_requirements<H, K>() &&
+         std::is_default_constructible<H>::value &&
+         std::is_copy_assignable<H>::value && std::is_move_assignable<H>::value;
+}
+
+}  // namespace hash
+
+}  // namespace detail
+
+#undef AUTO
+#undef AUTO_RETURN
+
+#undef AUTO_REFREF
+#undef AUTO_REFREF_RETURN
+
+#undef DECLTYPE_AUTO
+#undef DECLTYPE_AUTO_RETURN
+
+}  // namespace paddlenlp
+
+namespace std {
+
+template <typename... Ts>
+struct hash<paddlenlp::detail::enabled_type<
+    paddlenlp::variant<Ts...>,
+    paddlenlp::lib::enable_if_t<paddlenlp::lib::all<paddlenlp::detail::hash::is_enabled<
+        paddlenlp::lib::remove_const_t<Ts>>()...>::value>>> {
+  using argument_type = paddlenlp::variant<Ts...>;
+  using result_type = std::size_t;
+
+  inline result_type operator()(const argument_type &v) const {
+    using paddlenlp::detail::visitation::variant;
+    std::size_t result =
+        v.valueless_by_exception()
+            ? 299792458  // Random value chosen by the universe upon creation
+            : variant::visit_alt(
+#ifdef MPARK_GENERIC_LAMBDAS
+                  [](const auto &alt) {
+                    using alt_type = paddlenlp::lib::decay_t<decltype(alt)>;
+                    using value_type = paddlenlp::lib::remove_const_t<
+                        typename alt_type::value_type>;
+                    return hash<value_type>{}(alt.value);
+                  }
+#else
+                  hasher {}
+#endif
+                  ,
+                  v);
+    return hash_combine(result, hash<std::size_t>{}(v.index()));
+  }
+
+ private:
+#ifndef MPARK_GENERIC_LAMBDAS
+  struct hasher {
+    template <typename Alt>
+    inline std::size_t operator()(const Alt &alt) const {
+      using alt_type = paddlenlp::lib::decay_t<Alt>;
+      using value_type =
+          paddlenlp::lib::remove_const_t<typename alt_type::value_type>;
+      return hash<value_type>{}(alt.value);
+    }
+  };
+#endif
+
+  static std::size_t hash_combine(std::size_t lhs, std::size_t rhs) {
+    return lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
+  }
+};
+
+template <>
+struct hash<paddlenlp::monostate> {
+  using argument_type = paddlenlp::monostate;
+  using result_type = std::size_t;
+
+  inline result_type operator()(const argument_type &) const noexcept {
+    return 66740831;  // return a fundamentally attractive random value.
+  }
+};
+
+}  // namespace std
+
+#if defined(__GNUC__) && !defined(__clang__) && __GNUC__ >= 9
+#pragma GCC diagnostic pop
+#endif
diff --git a/fast_tokenizer/icu_filters.json b/fast_tokenizer/icu_filters.json
new file mode 100644
index 0000000000000000000000000000000000000000..42f1735d0421e8ec5e32927218790b14792285be
--- /dev/null
+++ b/fast_tokenizer/icu_filters.json
@@ -0,0 +1,35 @@
+{
+  "localeFilter": {
+    "filterType": "language",
+    "includelist": [
+      "en",
+      "zh"
+    ]
+  },
+  "featureFilters": {
+    "coll_tree": "exclude",
+    "coll_ucadata": "exclude",
+
+    "confusables": "exclude",
+
+    "curr_tree": "exclude",
+    "curr_supplemental": "exclude",
+
+    "lang_tree": "exclude",
+
+    "unit_tree": "exclude",
+
+    "rbnf_tree": "exclude",
+
+    "zone_tree": "exclude",
+    "zone_supplemental": "exclude",
+
+    "stringprep": "exclude",
+
+    "translit": "exclude",
+
+    "unames": "exclude",
+
+    "conversion_mappings": "exclude"
+  }
+}
diff --git a/fast_tokenizer/perf/README.md b/fast_tokenizer/perf/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7d4f4b1d33a9e499c18d7d8a718818bdcd9f0931
--- /dev/null
+++ b/fast_tokenizer/perf/README.md
@@ -0,0 +1,68 @@
+# 飞桨FastTokenizer性能测试
+
+在PaddleNLP v2.2.0版本中PaddleNLP推出了高性能的Transformer类文本分词器，简称飞桨FastTokenizer。为了验证飞桨FastTokenizer的性能快的特点，PaddleNLP选取了业内常见的一些文本分词器进行了性能对比比较，主要进行性能参考的是HuggingFace BertTokenizer， Tensorflow-text BertTokenizer. 我们以 bert-base-chinese 模型为例进行了文本分词性能实验对比，在中文的数据下进行性能对比实验，下面是具体实验设置信息：
+* [HuggingFace Tokenizers(Python)](https://github.com/huggingface/tokenizers):
+
+```python
+from transformers import AutoTokenizer
+
+hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=False)
+```
+
+* [HuggingFace Tokenizers(Rust)](https://github.com/huggingface/tokenizers):
+
+```python
+from transformers import AutoTokenizer
+
+hf_tokenizer_fast = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=True)
+```
+
+* [TensorFlow-Text](https://www.tensorflow.org/text/api_docs/python/text/BertTokenizer):
+
+```python
+import tensorflow_text as tf_text
+
+# vocab 为bert-base-chinese的词汇表
+tf_tokenizer = tf_text.BertTokenizer(vocab)
+```
+
+* [飞桨FastTokenizer](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/experimental):
+
+```python
+from paddlenlp.experimental import FastTokenizer
+
+fast_tokenizer = FastTokenizer.from_pretrained("bert-base-chinese")
+
+```
+
+
+## 环境依赖
+
+* paddlepaddle >= 2.2.1
+* paddlenlp >= 2.2
+* transformers == 4.11.3
+* tokenizers == 0.10.3
+* tensorflow_text == 2.5.0
+
+
+```shell
+pip install -r requirements.txt
+```
+
+## 运行
+
+```shell
+python perf.py
+```
+
+- 测试环境：
+
+    * CPU： Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz，物理核数40
+    * GPU： CUDA 10.2, CuDNN 7.6.5, 16G
+
+- 测试结果：
+
+
+<center><img width="1343" alt="图片" src="https://user-images.githubusercontent.com/16698950/145664356-0b766d5a-9ff1-455a-bb85-1ee51e2ad77d.png"></center>
+
+飞桨FastTokenizer与其他框架性能的对比，是在固定文本长度在不同batch size下的分词吞吐量。纵坐标是对数坐标，单位是1w tokens/秒。随着batch size的增大，飞桨FastTokenizer速度会远远超过其他同类产品的实现，尤其是在大batch文本上飞桨框架能充分发挥多核机器的优势，取得领先的速度。
diff --git a/fast_tokenizer/perf/perf.py b/fast_tokenizer/perf/perf.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b40060b2c71970b71209cf149e744f997d9f2fb
--- /dev/null
+++ b/fast_tokenizer/perf/perf.py
@@ -0,0 +1,134 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+
+import tensorflow as tf
+import tensorflow_text as tf_text
+from transformers import AutoTokenizer
+
+from paddlenlp.experimental import FastTokenizer, to_tensor
+from paddlenlp.transformers import BertTokenizer
+
+parser = argparse.ArgumentParser()
+
+# yapf: disable
+parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size for tokenization.")
+parser.add_argument("--epochs", default=10, type=int, help="Total number of tokenization epochs to perform.")
+parser.add_argument("--num_samples", default=100, type=int, help="The number of samples to be tokenized")
+# yapf: enable
+args = parser.parse_args()
+
+max_seq_length = args.max_seq_length
+batch_size = args.batch_size
+epochs = args.epochs
+num_samples = args.num_samples
+total_tokens = epochs * num_samples * max_seq_length
+
+text = (
+    "在世界几大古代文明中，中华文明源远流长、从未中断，至今仍充满蓬勃生机与旺盛生命力，这在人类历史上是了不起的奇迹。"
+    "本固根深、一脉相承的历史文化是铸就这一奇迹的重要基础。先秦时期是中华文化的创生期，奠定了此后几千年中华文化发展的"
+    "基础。考古发现证实，早期中华文明的形成经历了从“满天星斗”到“月明星稀”再到“多元一体”的过程。在这个过程中，不同地域、"
+    "不同人群的文化交流交融，中华民族最早的大家庭逐渐成形，国家由此诞生，“大同”社会理想和“天下为公，选贤与能，讲信修睦”"
+    "的价值追求逐渐深入人心。在早期国家形成过程中，我们的先人积累了初步的国家治理经验，包括经济、政治、军事、法律、文化"
+    "等各个方面，最终以典章、思想的形式进行总结和传承。流传至今的夏商西周国家治理经验、春秋战国诸子百家思想，是先秦时期"
+    "历史文化的集中反映。秦汉至宋元时期是中华文化的发展期，中华传统文化在这个时期走向成熟并迈向新的高峰。中央集权制度的"
+    "形成、郡县制度的推广、官僚制度的健全，推动中国传统社会形成国家治理的基本形态，为中国传统社会的长期延续和发展提供了"
+    "坚实的制度和文化支撑，贯穿其中的价值主线是对“大一统”的坚定追求。与此同时，民为邦本的民本思想、以文化人的文治主张、"
+    "协和万邦的天下观等，也在实践中得到丰富和完善。在追求“大一统”的历史中，民族精神世代相传，民族英雄史不绝书。"
+)
+
+data = [text[:max_seq_length]] * num_samples
+
+# BERT Tokenizer using PaddleNLP FastTokenizer
+pp_tokenizer = FastTokenizer.from_pretrained("bert-base-chinese")
+
+batches = [to_tensor(data[idx : idx + batch_size]) for idx in range(0, len(data), batch_size)]
+
+for batch_data in batches:
+    input_ids, token_type_ids = pp_tokenizer(text=batch_data, max_seq_len=max_seq_length)
+
+start = time.time()
+for _ in range(epochs):
+    for batch_data in batches:
+        input_ids, token_type_ids = pp_tokenizer(batch_data, max_seq_len=max_seq_length)
+end = time.time()
+
+print("The throughput of paddle FastTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start))))
+
+hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=True)
+
+batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)]
+
+for batch_data in batches:
+    encoded_inputs = hf_tokenizer(batch_data)
+
+# BERT Tokenizer using HuggingFace AutoTokenizer
+start = time.time()
+for _ in range(epochs):
+    for batch_data in batches:
+        encoded_inputs = hf_tokenizer(batch_data)  # , padding=True, truncation=True)
+end = time.time()
+print("The throughput of huggingface FastTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start))))
+
+# BERT Tokenizer using PaddleNLP BertTokenizer
+py_tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
+for batch_data in batches:
+    encoded_inputs = py_tokenizer(batch_data)
+
+start = time.time()
+for _ in range(epochs):
+    for batch_data in batches:
+        encoded_inputs = py_tokenizer(batch_data)
+end = time.time()
+print("The throughput of paddle BertTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start))))
+
+# BERT Tokenizer using HuggingFace AutoTokenizer
+hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", use_fast=False)
+
+for batch_data in batches:
+    encoded_inputs = hf_tokenizer(batch_data)
+
+start = time.time()
+for _ in range(epochs):
+    for batch_data in batches:
+        encoded_inputs = hf_tokenizer(batch_data)  # , padding=True, truncation=True)
+end = time.time()
+print("The throughput of huggingface python tokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start))))
+
+# BERT Tokenizer using TensorFlow Text
+vocab_list = list(py_tokenizer.vocab.token_to_idx.keys())
+lookup_table = tf.lookup.StaticVocabularyTable(
+    tf.lookup.KeyValueTensorInitializer(
+        keys=vocab_list,
+        key_dtype=tf.string,
+        values=tf.range(tf.size(vocab_list, out_type=tf.int64), dtype=tf.int64),
+        value_dtype=tf.int64,
+    ),
+    num_oov_buckets=1,
+)
+
+tf_tokenizer = tf_text.BertTokenizer(lookup_table)
+
+for batch_data in batches:
+    input_ids = tf_tokenizer.tokenize(batch_data)
+
+start = time.time()
+for _ in range(epochs):
+    for batch_data in batches:
+        input_ids = tf_tokenizer.tokenize(batch_data)
+end = time.time()
+print("The throughput of TensorFlow Text BertTokenizer: {:,.2f} tokens/s".format((total_tokens / (end - start))))
diff --git a/fast_tokenizer/perf/requirements.txt b/fast_tokenizer/perf/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0a26c4d03ac5d967e6f7976deb52b0468a6ae3a4
--- /dev/null
+++ b/fast_tokenizer/perf/requirements.txt
@@ -0,0 +1,3 @@
+paddlenlp>=2.2.0
+transformer==4.11.3
+tensorflow_text==2.5.0
\ No newline at end of file
diff --git a/fast_tokenizer/perf/run_all_perf.sh b/fast_tokenizer/perf/run_all_perf.sh
new file mode 100644
index 0000000000000000000000000000000000000000..052de2a717cde719812a5618fe5d77a5d92835e3
--- /dev/null
+++ b/fast_tokenizer/perf/run_all_perf.sh
@@ -0,0 +1,27 @@
+# !/bin/sh
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+for seq_len in 32 64 128 256 512; do
+for batch_size in 1 2 4 8 16 32 64; do
+mkdir -p seq_len_$seq_len/batch_size_$batch_size
+for thread_num in 1 2 4 8 16 32 64; do
+echo "Experiment setting: thread_num=$thread_num, batch_size=$batch_size, sequence_length=$seq_len"
+export OMP_NUM_THREADS=$thread_num
+export RAYON_RS_NUM_CPUS=$thread_num
+python perf.py --batch_size $batch_size --max_seq_length $seq_len >seq_len_$seq_len/batch_size_$batch_size/parallel$thread_num.log 2>nohup.out
+done 
+done
+done
\ No newline at end of file
diff --git a/fast_tokenizer/python/CMakeLists.txt b/fast_tokenizer/python/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ab0fe0065e3b844179ea2feef38ab862b9a85c71
--- /dev/null
+++ b/fast_tokenizer/python/CMakeLists.txt
@@ -0,0 +1,31 @@
+# 1. Copy the python tokenizers directory to the binary directory
+add_custom_target(copy_python_tokenizers ALL
+    COMMAND ${CMAKE_COMMAND} -E copy_directory ${CMAKE_SOURCE_DIR}/python ${CMAKE_BINARY_DIR}/python
+    DEPENDS core_tokenizers)
+
+
+# 2. Copy setup.py
+add_custom_target(copy_setup ALL
+    COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_SOURCE_DIR}/setup.py ${CMAKE_BINARY_DIR}/setup.py)
+
+# 3. Copy the core_tokenizers.so to python tokenizers directory
+set(TOKENIZER_CORE_NAME "core_tokenizers")
+set(TOKENIZER_DST_DIR ${CMAKE_BINARY_DIR}/python/fast_tokenizer)
+set(TOKENIZER_SRC_DIR ${CMAKE_BINARY_DIR}/fast_tokenizer)
+set(TOKENIZER_THIRD_PARTY_DIR ${CMAKE_BINARY_DIR}/third_party)
+
+IF(WIN32)
+set(ICU_DLL_DIR ${TOKENIZER_THIRD_PARTY_DIR}/icu/src/extern_icu/icu4c/bin64)
+add_custom_target(copy_shared_library ALL
+    COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.pyd ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.lib ${TOKENIZER_DST_DIR}
+    COMMAND ${CMAKE_COMMAND} -E copy ${ICU_DLL_DIR}/icudt70.dll ${ICU_DLL_DIR}/icuuc70.dll ${TOKENIZER_DST_DIR}/libs
+    DEPENDS copy_python_tokenizers)
+ELSE(WIN32)
+add_custom_target(copy_shared_library ALL
+    COMMAND ${CMAKE_COMMAND} -E copy ${TOKENIZER_SRC_DIR}/${TOKENIZER_CORE_NAME}.so ${TOKENIZER_DST_DIR}
+    DEPENDS copy_python_tokenizers)
+ENDIF()
+
+add_custom_target(create_commit_id_file ALL
+    COMMAND ${GIT_EXECUTABLE} log -1 --format=%H > ${TOKENIZER_DST_DIR}/commit.log
+    DEPENDS copy_python_tokenizers)
diff --git a/fast_tokenizer/python/fast_tokenizer/__init__.py b/fast_tokenizer/python/fast_tokenizer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea7b5a9570f456153e7fb030165b377f04e26987
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/__init__.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__version__ = "1.0.2"
+
+import os
+import platform
+import sys
+from typing import Dict, List, Tuple, Union
+
+try:
+    current_path = os.path.abspath(os.path.dirname(__file__))
+    if os.name == "nt":
+        third_lib_path = current_path + os.sep + "libs"
+        # Will load shared library from 'path' on windows
+        os.environ["path"] = current_path + ";" + third_lib_path + ";" + os.environ["path"]
+        sys.path.insert(0, third_lib_path)
+        # Note: from python3.8, PATH will not take effect
+        # https://github.com/python/cpython/pull/12302
+        # Use add_dll_directory to specify dll resolution path
+        if sys.version_info[:2] >= (3, 8):
+            os.add_dll_directory(third_lib_path)
+except ImportError as e:
+    if os.name == "nt":
+        executable_path = os.path.abspath(os.path.dirname(sys.executable))
+        raise ImportError(
+            """NOTE: You may need to run \"set PATH=%s;%%PATH%%\"
+        if you encounters \"DLL load failed\" errors. If you have python
+        installed in other directory, replace \"%s\" with your own
+        directory. The original error is: \n %s"""
+            % (executable_path, executable_path, str(e))
+        )
+    else:
+        raise ImportError(
+            """NOTE: You may need to run \"export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH\"
+        if you encounters \"libmkldnn.so not found\" errors. If you have python
+        installed in other directory, replace \"/usr/local/lib\" with your own
+        directory. The original error is: \n"""
+            + str(e)
+        )
+except Exception as e:
+    raise e
+
+from . import core_tokenizers as C
+from .c_wrap import *
+
+from . import decoders, models, normalizers, postprocessors, pretokenizers
+from .tokenizers_impl import (
+    ClipFastTokenizer,
+    ErnieFastTokenizer,
+    SentencePieceBPEFastTokenizer,
+)
diff --git a/fast_tokenizer/python/fast_tokenizer/c_wrap.py b/fast_tokenizer/python/fast_tokenizer/c_wrap.py
new file mode 100644
index 0000000000000000000000000000000000000000..9645afceab8a2c4f91f550d087252b0df7ed9f7b
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/c_wrap.py
@@ -0,0 +1,505 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict, List, Tuple, Union
+
+from . import core_tokenizers as C
+
+TextInputSequence = str
+PreTokenizedInputSequence = Union[List[str], Tuple[str]]
+
+TextEncodeInput = Union[
+    TextInputSequence,
+    Tuple[TextInputSequence, TextInputSequence],
+    List[TextInputSequence],
+]
+
+PreTokenizedEncodeInput = Union[
+    PreTokenizedInputSequence,
+    Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence],
+    List[PreTokenizedInputSequence],
+]
+
+InputSequence = Union[TextInputSequence, PreTokenizedInputSequence]
+
+EncodeInput = Union[TextEncodeInput, PreTokenizedEncodeInput]
+
+
+class OffsetType:
+    CHAR = C.OffsetType.CHAR
+    BYTE = C.OffsetType.BYTE
+
+
+class Direction:
+    LEFT = C.Direction.LEFT
+    RIGHT = C.Direction.RIGHT
+
+
+class TruncStrategy:
+    LONGEST_FIRST = C.TruncStrategy.LONGEST_FIRST
+    ONLY_FIRST = C.TruncStrategy.ONLY_FIRST
+    ONLY_SECOND = C.TruncStrategy.ONLY_SECOND
+
+
+class PadStrategy:
+    BATCH_LONGEST = C.PadStrategy.BATCH_LONGEST
+    FIXED_SIZE = C.PadStrategy.FIXED_SIZE
+
+
+class SplitMode:
+    REMOVED = C.SplitMode.REMOVED
+    ISOLATED = C.SplitMode.ISOLATED
+    MERGED_WITH_NEXT = C.SplitMode.MERGED_WITH_NEXT
+    MERGED_WITH_PREVIOUS = C.SplitMode.MERGED_WITH_PREVIOUS
+    CONTIGUOUS = C.SplitMode.CONTIGUOUS
+
+
+class Token:
+    def __init__(self):
+        self._token = C.Token()
+
+    @property
+    def id(self):
+        return self._token.id
+
+    @id.setter
+    def id(self, id: int):
+        self._token.id = id
+
+    @property
+    def value(self):
+        return self._token.value
+
+    @value.setter
+    def value(self, value: str):
+        self._token.value = value
+
+    @property
+    def offset(self):
+        return self._token.offset
+
+    @offset.setter
+    def offset(self, offset: Tuple[int, int]):
+        self._token.offset = offset
+
+    def __repr__(self):
+        return self._token.__repr__()
+
+
+class PadMethod:
+    def __init__(self):
+        self._pad_method = C.PadMethod()
+
+    @property
+    def strategy(self):
+        return self._pad_method.strategy
+
+    @strategy.setter
+    def strategy(self, strategy: str):
+        """Set the strategy of PadMethod.
+        :param strategy: (str) The strategy of PadMethod, 'batch_longest' and 'fixed_size' are valid
+        :return None
+        """
+        self._pad_method.strategy = getattr(PadStrategy, strategy.upper())
+
+    @property
+    def direction(self):
+        return self._pad_method.direction
+
+    @direction.setter
+    def direction(self, direction: str):
+        """Set the direction of PadMethod.
+        :param strategy: (str) The direction of PadMethod, 'left' and 'right' are valid
+        :return None
+        """
+        self._pad_method.direction = getattr(Direction, direction.upper())
+
+    @property
+    def pad_id(self):
+        return self._pad_method.pad_id
+
+    @pad_id.setter
+    def pad_id(self, pad_id: int):
+        self._pad_method.pad_id = pad_id
+
+    @property
+    def pad_token_type_id(self):
+        return self._pad_method.pad_token_type_id
+
+    @pad_token_type_id.setter
+    def pad_token_type_id(self, pad_token_type_id: int):
+        self._pad_method.pad_token_type_id = pad_token_type_id
+
+    @property
+    def pad_token(self):
+        return self._pad_method.pad_token
+
+    @pad_token.setter
+    def pad_token(self, pad_token: str):
+        self._pad_method.pad_token = pad_token
+
+    @property
+    def pad_len(self):
+        return self._pad_method.pad_len
+
+    @pad_len.setter
+    def pad_len(self, pad_len: int):
+        self._pad_method.pad_len = pad_len
+
+    @property
+    def pad_to_multiple_of(self):
+        return self._pad_method.pad_to_multiple_of
+
+    @pad_to_multiple_of.setter
+    def pad_to_multiple_of(self, pad_to_multiple_of):
+        self._pad_method.pad_to_multiple_of = pad_to_multiple_of
+
+
+class TruncMethod:
+    def __init__(self):
+        self._trunc_method = C.TruncMethod()
+
+    @property
+    def max_len(self):
+        return self._trunc_method.max_len
+
+    @max_len.setter
+    def max_len(self, max_len: int):
+        self._trunc_method.max_len = max_len
+
+    @property
+    def strategy(self):
+        return self._trunc_method.strategy
+
+    @strategy.setter
+    def strategy(self, strategy: str):
+        """Set the strategy of TruncMethod.
+        :param strategy: (str) The strategy of PadMethod, 'longest_first', 'only_first' and 'only_second' are valid
+        :return None
+        """
+        self._trunc_method.strategy = getattr(TruncStrategy, strategy.upper())
+
+    @property
+    def direction(self):
+        return self._trunc_method.direction
+
+    @direction.setter
+    def direction(self, direction: str):
+        """Set the direction of TruncMethod.
+        :param strategy: (str) The direction of TruncMethod, 'left' and 'right' are valid
+        :return None
+        """
+        self._trunc_method.direction = getattr(Direction, direction.upper())
+
+    @property
+    def stride(self):
+        return self._trunc_method.stride
+
+    @stride.setter
+    def stride(self, stride: int):
+        self._trunc_method.stride = stride
+
+
+class AddedToken:
+    def __init__(self, content="", single_word=False, lstrip=False, rstrip=False, normalized=True):
+        self._added_token = C.AddedToken(content, single_word, lstrip, rstrip, normalized)
+
+    @property
+    def content(self):
+        return self._added_token.content
+
+    @property
+    def get_is_special(self):
+        return self._added_token.get_is_special
+
+    @property
+    def normalized(self):
+        return self._added_token.normalized
+
+    @property
+    def lstrip(self):
+        return self._added_token.lstrip
+
+    @property
+    def rstrip(self):
+        return self._added_token.rstrip
+
+    @property
+    def single_word(self):
+        return self._added_token.single_word
+
+    def __eq__(self, other):
+        return self._added_token == other._added_token
+
+
+class Encoding:
+    def __init__(
+        self,
+        ids: List[int],
+        type_ids: List[int],
+        tokens: List[str],
+        words_idx: List[int],
+        offsets: List[Tuple[int, int]],
+        special_tokens_mask: List[int],
+        attention_mask: List[int],
+        overflowing: List,
+        sequence_ranges: Dict[str, Tuple[int, int]],
+    ):
+        self._encoding = C.Encoding(
+            ids,
+            type_ids,
+            tokens,
+            words_idx,
+            offsets,
+            special_tokens_mask,
+            attention_mask,
+            overflowing,
+            sequence_ranges,
+        )
+
+    def __str__(self):
+        return str(self._encoding)
+
+    def __repr__(self):
+        return self._encoding.__repr__()
+
+    def __len__(self):
+        return len(self._encoding)
+
+    @property
+    def n_sequences(self):
+        return self._encoding.n_sequences
+
+    @property
+    def tokens(self):
+        return self._encoding.tokens
+
+    @property
+    def word_ids(self):
+        return self._encoding.word_ids
+
+    @property
+    def sequence_ids(self):
+        return self._encoding.sequence_ids
+
+    @property
+    def ids(self):
+        return self._encoding.ids
+
+    @property
+    def type_ids(self):
+        return self._encoding.type_ids
+
+    @property
+    def offsets(self):
+        return self._encoding.offsets
+
+    @property
+    def special_tokens_mask(self):
+        return self._encoding.special_tokens_mask
+
+    @property
+    def attention_mask(self):
+        return self._encoding.attention_mask
+
+    @property
+    def overflowing(self):
+        return self._encoding.overflowing
+
+    def set_sequence_ids(self, sequence_id: int):
+        return self._encoding.set_sequence_ids(sequence_id)
+
+    def char_to_token(self, char_pos, sequence_index: int = 0):
+        return self._encoding.char_to_token(char_pos, sequence_index)
+
+    @staticmethod
+    def merge(encodings: List, growing_offsets: bool = True):
+        return C.Encoding.merge(encodings, growing_offsets)
+
+    def token_to_chars(self, token_index: int):
+        return self._encoding.token_to_chars(token_index)
+
+    def token_to_sequence(self, token_index: int):
+        return self._encoding.token_to_sequence(token_index)
+
+    def token_to_word(self, token_index: int):
+        return self._encoding.token_to_word(token_index)
+
+    def word_to_chars(self, word_index: int, sequence_index: int = 0):
+        return self._encoding.word_to_chars(word_index, sequence_index)
+
+    def word_to_tokens(self, word_index: int, sequence_index: int = 0):
+        return self._encoding.word_to_tokens(word_index, sequence_index)
+
+    def truncate(self, max_length: int, stride: int = 0, direction: str = "right"):
+        return self._encoding.truncate(max_length, stride, direction)
+
+    def pad(
+        self, length: int, direction: str = "right", pad_id: int = 0, pad_type_id: int = 0, pad_token: str = "[PAD]"
+    ):
+        return self._encoding.pad(length, direction, pad_id, pad_type_id, pad_token)
+
+
+class Tokenizer:
+    def __init__(self, model):
+        self._tokenizer = None
+        if model is not None:
+            self._tokenizer = C.Tokenizer(model._model)
+
+    @property
+    def normalizer(self):
+        return self._tokenizer.normalizer
+
+    @normalizer.setter
+    def normalizer(self, normalizer):
+        self._tokenizer.normalizer = normalizer._normalizer
+
+    @property
+    def pretokenizer(self):
+        return self._tokenizer.pretokenizer
+
+    @pretokenizer.setter
+    def pretokenizer(self, pretokenizer):
+        self._tokenizer.pretokenizer = pretokenizer._pretokenizer
+
+    @property
+    def model(self):
+        return self._tokenizer.model
+
+    @model.setter
+    def model(self, model):
+        self._tokenizer.model = model._model
+
+    @property
+    def postprocessor(self):
+        return self._tokenizer.postprocessor
+
+    @postprocessor.setter
+    def postprocessor(self, postprocessor):
+        self._tokenizer.postprocessor = postprocessor._postprocessor
+
+    @property
+    def decoder(self):
+        return self._tokenizer.decoder
+
+    @decoder.setter
+    def decoder(self, decoder):
+        self._tokenizer.decoder = decoder._decoder
+
+    @property
+    def padding(self):
+        return self._tokenizer.padding
+
+    @property
+    def truncation(self):
+        return self._tokenizer.truncation
+
+    def add_special_tokens(self, tokens: List[str]):
+        return self._tokenizer.add_special_tokens(tokens)
+
+    def add_tokens(self, tokens: List[str]):
+        return self._tokenizer.add_tokens(tokens)
+
+    def enable_padding(
+        self,
+        direction: str = "right",
+        pad_id: int = 0,
+        pad_type_id: int = 0,
+        pad_token: str = "[PAD]",
+        length: int = None,
+        pad_to_multiple_of: int = None,
+    ):
+        return self._tokenizer.enable_padding(direction, pad_id, pad_type_id, pad_token, length, pad_to_multiple_of)
+
+    def disable_padding(self):
+        return self._tokenizer.disable_padding()
+
+    def enable_truncation(
+        self, max_length: int, stride: int = 0, strategy: str = "longest_first", direction: str = "right"
+    ):
+        return self._tokenizer.enable_truncation(max_length, stride, strategy, direction)
+
+    def disable_truncation(self):
+        return self._tokenizer.disable_truncation()
+
+    def get_vocab(self, with_added_vocabulary: bool = True):
+        return self._tokenizer.get_vocab(with_added_vocabulary)
+
+    def get_vocab_size(self, with_added_vocabulary: bool = True):
+        return self._tokenizer.get_vocab_size(with_added_vocabulary)
+
+    def encode(
+        self,
+        sequence: InputSequence,
+        pair: InputSequence = None,
+        is_pretokenized: bool = False,
+        add_special_tokens: bool = True,
+    ):
+        return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
+
+    def encode_batch(
+        self,
+        input: Union[List[EncodeInput], Tuple[EncodeInput]],
+        add_special_tokens: bool = True,
+        is_pretokenized: bool = False,
+    ):
+        return self._tokenizer.encode_batch(input, add_special_tokens, is_pretokenized)
+
+    def decode(self, ids: List[int], skip_special_tokens: bool = True):
+        return self._tokenizer.decode(ids, skip_special_tokens)
+
+    def decode_batch(self, sequences: List[List[int]], skip_special_tokens: bool = True):
+        return self._tokenizer.decode_batch(sequences, skip_special_tokens)
+
+    def id_to_token(self, id: int):
+        return self._tokenizer.id_to_token(id)
+
+    def token_to_id(self, token: str):
+        return self._tokenizer.token_to_id(token)
+
+    def num_special_tokens_to_add(self, is_pair: bool = True):
+        return self._tokenizer.num_special_tokens_to_add(is_pair)
+
+    def save(self, path: str, pretty: bool = True):
+        return self._tokenizer.save(path, pretty)
+
+    def to_str(self, pretty: bool = True):
+        return self._tokenizer.to_str(pretty)
+
+    @staticmethod
+    def from_str(json: str):
+        tr = Tokenizer(None)
+        tr._tokenizer = C.Tokenizer.from_str(json)
+        return tr
+
+    @staticmethod
+    def from_file(json: str):
+        tr = Tokenizer(None)
+        tr._tokenizer = C.Tokenizer.from_file(json)
+        return tr
+
+
+def set_thread_num(thread_num):
+    """Set the number of threads for accelerating batch tokenization
+    :param thread_num: (int) The number of threads
+    :return None
+    """
+    C.set_thread_num(thread_num)
+
+
+def get_thread_num():
+    """Get the number of tokenization threads
+    :return int
+    """
+    return C.get_thread_num()
diff --git a/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py b/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..142e5af3345a553aa9dda2fecdea5310e330226f
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/decoders/__init__.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC
+from typing import List
+
+from .. import C
+
+
+class Decoder(ABC):
+    def decode(self, tokens: List[str]):
+        return self._decoder.decode(tokens)
+
+
+class WordPiece(Decoder):
+    def __init__(self, prefix: str = "##", cleanup: bool = True):
+        self._decoder = C.decoders.WordPiece(prefix, cleanup)
diff --git a/fast_tokenizer/python/fast_tokenizer/libs/__init__.py b/fast_tokenizer/python/fast_tokenizer/libs/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac566ed8eb61b16f39ab0c71ab928e0d0d76284a
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/libs/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# used for setup.py.in to store the thirdparty shared libraries
diff --git a/fast_tokenizer/python/fast_tokenizer/models/__init__.py b/fast_tokenizer/python/fast_tokenizer/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..567944e4d2b738edaeaaf7448b6823d72f53eabd
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/models/__init__.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Tuple, Union, List, Dict
+from abc import ABC
+
+from .. import C
+
+
+class Model(ABC):
+    def tokenize(self, tokens: List[str]):
+        return self._model.tokenize(tokens)
+
+    def token_to_id(self, token: str):
+        return self._model.token_to_id(token)
+
+    def id_to_token(self, id: int):
+        return self._model.id_to_token(id)
+
+    def get_vocab(self):
+        return self._model.get_vocab()
+
+    def get_vocab_size(self):
+        return self._model.get_vocab_size()
+
+    def save(self, folder: str, prefix: str = None):
+        return self._model.save(folder, prefix)
+
+
+class WordPiece(Model):
+    def __init__(
+        self,
+        vocab: Dict[str, int],
+        unk_token: str = "[UNK]",
+        max_input_chars_per_word: int = 100,
+        continuing_subword_prefix: str = "##",
+        handle_chinese_chars: bool = True,
+    ):
+        self._model = None
+        if vocab is not None:
+            self._model = C.models.WordPiece(
+                vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, handle_chinese_chars
+            )
+
+    @staticmethod
+    def read_file(vocab: str):
+        """Read a vocab.txt file
+
+        :params vocab: (str) The path to a vocab.txt file
+        :return: Dict[str, int], The vocabulary as a dict
+        """
+        return C.models.WordPiece.read_file(vocab)
+
+    @staticmethod
+    def from_file(
+        vocab: str,
+        unk_token: str = "[UNK]",
+        max_input_chars_per_word: int = 100,
+        continuing_subword_prefix: str = "continuing_subword_prefix",
+    ):
+        """Load a WordPiece instance from vocab file.
+
+        :param vocab: (str) The path to a vocab.txt file
+        :param unk_token: (str) The unknown token
+        :param max_input_chars_per_word: (int) The max number of char when tokenize a word
+        :param continuing_subword_prefix: (str) The latter subword prefix.
+        :return: An instance of WordPiece.
+        """
+        wp = WordPiece(None)
+        wp._model = C.models.WordPiece.from_file(vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix)
+        return wp
+
+
+class FastWordPiece(Model):
+    def __init__(
+        self,
+        vocab: Dict[str, int],
+        unk_token: str = "[UNK]",
+        max_input_chars_per_word: int = 100,
+        continuing_subword_prefix: str = "##",
+        with_pretokenization: bool = False,
+    ):
+        self._model = None
+        if vocab is not None:
+            self._model = C.models.FastWordPiece(
+                vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, with_pretokenization
+            )
+
+    @staticmethod
+    def read_file(vocab: str):
+        """Read a vocab.txt file
+
+        :params vocab: (str) The path to a vocab.txt file
+        :return: Dict[str, int], The vocabulary as a dict
+        """
+        return C.models.FastWordPiece.read_file(vocab)
+
+    @staticmethod
+    def from_file(
+        vocab: str,
+        unk_token: str = "[UNK]",
+        max_input_chars_per_word: int = 100,
+        continuing_subword_prefix: str = "continuing_subword_prefix",
+        with_pretokenization: bool = False,
+    ):
+        """Load a FastWordPiece instance from vocab file.
+
+        :param vocab: (str) The path to a vocab.txt file
+        :param unk_token: (str) The unknown token
+        :param max_input_chars_per_word: (int) The max number of char when tokenize a word
+        :param continuing_subword_prefix: (str) The latter subword prefix.
+        :param with_pretokenization: (bool) Whether to pretokenize sequence during the wordpiece tokenization.
+            If it's true, the end to end tokenization would be faster.
+        :return: An instance of FastWordPiece.
+        """
+        wp = FastWordPiece(None)
+        wp._model = C.models.FastWordPiece.from_file(
+            vocab, unk_token, max_input_chars_per_word, continuing_subword_prefix, with_pretokenization
+        )
+        return wp
+
+
+class BPE(Model):
+    def __init__(
+        self,
+        vocab: Dict[str, int] = None,
+        merges: List[Tuple[str, str]] = None,
+        cache_capacity: int = None,
+        dropout: float = None,
+        unk_token: str = None,
+        continuing_subword_prefix: str = None,
+        end_of_word_suffix: str = None,
+        fuse_unk: bool = None,
+    ):
+        self._model = C.models.BPE(
+            vocab, merges, cache_capacity, dropout, unk_token, continuing_subword_prefix, end_of_word_suffix, fuse_unk
+        )
+
+    @staticmethod
+    def read_file(vocab: str, merges: str):
+        return C.models.BPE.read_file(vocab, merges)
+
+    @staticmethod
+    def from_file(vocab: str, merges: str, **kwargs):
+        bpe = BPE()
+        bpe._model = C.models.BPE.from_file(vocab, merges, **kwargs)
+        return bpe
+
+
+class Unigram(Model):
+    def __init__(self, vocab: List[Tuple[str, float]] = None, unk_id: int = None):
+        self._model = C.models.Unigram(vocab, unk_id)
+
+    def set_filter_token(self, filter_token: str = ""):
+        return self._model.set_filter_token(filter_token)
+
+    def set_split_rule(self, split_rule: str = ""):
+        return self._model.set_split_rule(split_rule)
diff --git a/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py b/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e8fa6e45b2d4515aa96b3159b7fd252d375c7e3
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/normalizers/__init__.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC
+
+from .. import C
+
+
+class NormalizedString:
+    def __init__(self, raw_str: str):
+        self._normalized = C.normalizers.NormalizedString(raw_str)
+
+    def __str__(self):
+        return str(self._normalized)
+
+
+class Normalizer(ABC):
+    def normalize_str(self, sequence: str):
+        return self._normalizer.normalize_str(sequence)
+
+    def __call__(self, normalized: NormalizedString):
+        return self._normalizer(normalized._normalized)
+
+    def __getstate__(self):
+        return self._normalizer.__getstate__()
+
+
+class BertNormalizer(Normalizer):
+    def __init__(
+        self,
+        clean_text: bool = True,
+        handle_chinese_chars: bool = True,
+        strip_accents: bool = True,
+        lowercase: bool = True,
+    ):
+        self._normalizer = C.normalizers.BertNormalizer(clean_text, handle_chinese_chars, strip_accents, lowercase)
+
+
+class ReplaceNormalizer(Normalizer):
+    def __init__(self, pattern: str, content: str):
+        self._normalizer = C.normalizers.ReplaceNormalizer(pattern, content)
+
+
+class StripNormalizer(Normalizer):
+    def __init__(self, left: bool = True, right: bool = True):
+        self._normalizer = C.normalizers.StripNormalizer(left, right)
+
+
+class StripAccentsNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.StripAccentsNormalizer()
+
+
+class NFCNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.NFCNormalizer()
+
+
+class NFDNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.NFDNormalizer()
+
+
+class NFKCNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.NFKCNormalizer()
+
+
+class NFKDNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.NFKDNormalizer()
+
+
+class NmtNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.NmtNormalizer()
+
+
+class LowercaseNormalizer(Normalizer):
+    def __init__(self):
+        self._normalizer = C.normalizers.LowercaseNormalizer()
+
+
+class SequenceNormalizer(Normalizer):
+    def __init__(self, normalizer_list=[]):
+        normalizer_list = [normalizer._normalizer for normalizer in normalizer_list]
+        self._normalizer = C.normalizers.SequenceNormalizer(normalizer_list)
+
+
+class PrecompiledNormalizer(Normalizer):
+    def __init__(self, precompiled_charsmap: str):
+        self._normalizer = C.normalizers.PrecompiledNormalizer(precompiled_charsmap)
diff --git a/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py b/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..496c6413b76b0fe7e68d1473e10f0bd135554adb
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/postprocessors/__init__.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC
+from typing import List, Tuple, Union
+
+from .. import C, Encoding
+
+
+class PostProcessor(ABC):
+    def num_special_tokens_to_add(self, is_pair: bool = True):
+        return self._postprocessor.num_special_tokens_to_add(is_pair)
+
+    def __call__(self, encoding: Encoding, pair_encoding: Encoding, add_special_tokens: bool):
+        return self._postprocessor(encoding, pair_encoding, add_special_tokens)
+
+
+class BertPostProcessor(PostProcessor):
+    def __init__(self, sep: Tuple[str, int] = ("[SEP]", 102), cls: Tuple[str, int] = ("[CLS]", 101)):
+        self._postprocessor = C.postprocessors.BertPostProcessor(sep, cls)
+
+
+class RobertaPostProcessor(PostProcessor):
+    def __init__(
+        self,
+        sep: Tuple[str, int] = ("</s>", 2),
+        cls: Tuple[str, int] = ("<s>", 0),
+        trim_offsets: bool = True,
+        add_prefix_space: bool = True,
+    ):
+        self._postprocessor = C.postprocessors.RobertaPostProcessor(sep, cls, trim_offsets, add_prefix_space)
+
+
+class ByteLevelPostProcessor(PostProcessor):
+    def __init__(self, add_prefix_space: bool = True, trim_offsets: bool = True, use_regex: bool = True):
+        self._postprocessor = C.postprocessors.ByteLevelPostProcessor(add_prefix_space, trim_offsets, use_regex)
+
+
+class TemplatePostProcessor(PostProcessor):
+    def __init__(
+        self, single: Union[str, List[str]], pair: Union[str, List[str]], special_tokens: List[Tuple[str, int]]
+    ):
+        self._postprocessor = C.postprocessors.TemplatePostProcessor(single, pair, special_tokens)
diff --git a/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py b/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6458f579ea1eac1a4f30ea7d3cc3209f43476276
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/pretokenizers/__init__.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC
+from typing import Dict, List, Tuple, Union
+
+from .. import C, OffsetType, Token
+from ..normalizers import NormalizedString
+
+
+class StringSplit:
+    def __init__(self, nomalized_text: NormalizedString, tokens: List[Token]):
+        tokens = [token._token for token in tokens]
+        self._string_split = C.pretokenizers.StringSplit(nomalized_text._normalized, tokens)
+
+    @property
+    def normalized(self):
+        return NormalizedString(self._string_split.normalized)
+
+    @normalized.setter
+    def normalized(self, normalized: NormalizedString):
+        self._string_split.normalized = normalized._normalized
+
+    @property
+    def tokens(self):
+        return self._string_split.tokens
+
+    @tokens.setter
+    def tokens(self, tokens: List[Token]):
+        self._string_split.tokens = [token._token for token in tokens]
+
+
+class PreTokenizedString:
+    def __init__(self, text: str):
+        self._pretokenized = C.pretokenizers.PreTokenizedString(text)
+
+    def get_string_split(self, idx: int):
+        return self._pretokenized.get_string_split(idx)
+
+    def get_string_splits_size(self):
+        return self._pretokenized.get_string_splits_size()
+
+    def get_original_text(self):
+        return self._pretokenized.get_original_text()
+
+    def get_splits(self, offset_referential: str = "original", offset_type: str = "char"):
+        """
+        param offset_referential: "original" or "normalized"
+        param offset_type: "char" or "byte"
+        """
+        return self._pretokenized.get_splits(offset_referential, offset_type)
+
+    def to_encoding(self, word_idx: List[int], type_id: int, offset_type):
+        return self._pretokenized.to_encoding(word_idx, type_id, offset_type)
+
+
+class PreTokenizer(ABC):
+    def __call__(self, pretokenized: PreTokenizedString):
+        return self._pretokenizer(pretokenized._pretokenized)
+
+    def pretokenize_str(self, pretokenized_str: str):
+        pretokenized = PreTokenizedString(pretokenized_str)
+        self(pretokenized)
+        splits = pretokenized.get_splits()
+        result = [(s, offset) for s, offset, tokens in splits]
+        return result
+
+
+class WhitespacePreTokenizer(PreTokenizer):
+    def __init__(self):
+        self._pretokenizer = C.pretokenizers.WhitespacePreTokenizer()
+
+
+class WhitespaceAndPunctuationPreTokenizer(PreTokenizer):
+    def __init__(self):
+        self._pretokenizer = C.pretokenizers.WhitespaceAndPunctuationPreTokenizer()
+
+
+class BertPreTokenizer(PreTokenizer):
+    def __init__(self):
+        self._pretokenizer = C.pretokenizers.BertPreTokenizer()
+
+
+class MetaSpacePreTokenizer(PreTokenizer):
+    def __init__(self, replacement: str = "_", add_prefix_space: bool = True):
+        self._pretokenizer = C.pretokenizers.MetaSpacePreTokenizer(replacement, add_prefix_space)
+
+
+class SequencePreTokenizer(PreTokenizer):
+    def __init__(self, pretokenizers: List):
+        pretokenizers = [pretokenizer._pretokenizer for pretokenizer in pretokenizers]
+        self._pretokenizer = C.pretokenizers.SequencePreTokenizer(pretokenizers)
+
+
+class ByteLevelPreTokenizer(PreTokenizer):
+    def __init__(self, add_prefix_space: bool = True, use_regex: bool = True):
+        self._pretokenizer = C.pretokenizers.ByteLevelPreTokenizer(add_prefix_space, use_regex)
+
+
+class SplitPreTokenizer(PreTokenizer):
+    def __init__(self, pattern: str, split_mode: int, invert: bool = True):
+        self._pretokenizer = C.pretokenizers.SplitPreTokenizer(pattern, C.SplitMode(split_mode), invert)
diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..cfb9bfb4859b7d86df0f2aa600537cd901597d4a
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_tokenizer import BaseFastTokenizer
+from .ernie import ErnieFastTokenizer
+from .sentencepiece_bpe import SentencePieceBPEFastTokenizer
+from .clip import ClipFastTokenizer
diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6c5631d53b200ddb3dfbe0d10e2d28728e08ebe
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/base_tokenizer.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fast_tokenizer import Tokenizer
+
+__all__ = ["BaseFastTokenizer"]
+
+
+class BaseFastTokenizer:
+    def __init__(self, tokenizer_impl, parma_dict=None):
+        self._tokenizer = tokenizer_impl
+        self._parma_dict = parma_dict if parma_dict is not None else {}
+
+    def __repr__(self):
+        return "Tokenizer(vocabulary_size={}, {})".format(
+            self._tokenizer.get_vocab_size(),
+            ", ".join(k + "=" + str(v) for k, v in self._parma_dict.items()),
+        )
+
+    def num_special_tokens_to_add(self, is_pair):
+        return self._tokenizer.num_special_tokens_to_add(is_pair)
+
+    def get_vocab(self, with_added_tokens=True):
+        return self._tokenizer.get_vocab(with_added_tokens=with_added_tokens)
+
+    def get_vocab_size(self, with_added_tokens=True):
+        return self._tokenizer.get_vocab_size(with_added_tokens=with_added_tokens)
+
+    def enable_padding(
+        self,
+        direction="right",
+        pad_id=0,
+        pad_type_id=0,
+        pad_token="[PAD]",
+        pad_to_multiple_of=None,
+        length=None,
+    ):
+        return self._tokenizer.enable_padding(
+            direction=direction,
+            pad_to_multiple_of=pad_to_multiple_of,
+            pad_id=pad_id,
+            pad_type_id=pad_type_id,
+            pad_token=pad_token,
+            length=length,
+        )
+
+    def disable_padding(self):
+        self._tokenizer.disable_padding()
+
+    @property
+    def padding(self):
+        return self._tokenizer.padding
+
+    def enable_truncation(self, max_length, stride=0, strategy="longest_first", direction="right"):
+        self._tokenizer.enable_truncation(max_length, stride, strategy, direction)
+
+    def disable_truncation(self):
+        self._tokenizer.disable_truncation()
+
+    def truncation(self):
+        return self._tokenizer.truncation
+
+    def add_tokens(self, tokens):
+        return self._tokenizer.add_tokens(tokens)
+
+    def add_special_tokens(self, special_tokens):
+        return self._tokenizer.add_special_tokens(special_tokens)
+
+    def encode(
+        self,
+        sequence,
+        pair=None,
+        is_pretokenized=False,
+        add_special_tokens=True,
+    ):
+        if sequence is None:
+            raise ValueError("encode: `sequence` can't be `None`")
+        return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
+
+    def encode_batch(self, inputs, add_special_tokens=True, is_pretokenized=False):
+        if inputs is None:
+            raise ValueError("encode_batch: `inputs` can't be `None`")
+        return self._tokenizer.encode_batch(inputs, add_special_tokens, is_pretokenized)
+
+    def decode(self, ids, skip_special_tokens=True) -> str:
+        if ids is None:
+            raise ValueError("None input is not valid. Should be a list of integers.")
+
+        return self._tokenizer.decode(ids, skip_special_tokens=skip_special_tokens)
+
+    def decode_batch(self, sequences, skip_special_tokens=True) -> str:
+        if sequences is None:
+            raise ValueError("None input is not valid. Should be list of list of integers.")
+
+        return self._tokenizer.decode_batch(sequences, skip_special_tokens=skip_special_tokens)
+
+    def token_to_id(self, token):
+        return self._tokenizer.token_to_id(token)
+
+    def id_to_token(self, id):
+        return self._tokenizer.id_to_token(id)
+
+    def post_process(self, encoding, pair=None, add_special_tokens=True):
+        return self._tokenizer.post_process(encoding, pair, add_special_tokens)
+
+    @property
+    def model(self):
+        return self._tokenizer.model
+
+    @model.setter
+    def model(self, model):
+        self._tokenizer.model = model
+
+    @property
+    def normalizer(self):
+        return self._tokenizer.normalizer
+
+    @normalizer.setter
+    def normalizer(self, normalizer):
+        self._tokenizer.normalizer = normalizer
+
+    @property
+    def pretokenizer(self):
+        return self._tokenizer.pretokenizer
+
+    @pretokenizer.setter
+    def pretokenizer(self, pretokenizer):
+        self._tokenizer.pretokenizer = pretokenizer
+
+    @property
+    def postprocessor(self):
+        return self._tokenizer.postprocessor
+
+    @postprocessor.setter
+    def postprocessor(self, postprocessor):
+        self._tokenizer.postprocessor = postprocessor
+
+    @property
+    def decoder(self):
+        return self._tokenizer.decoder
+
+    @decoder.setter
+    def decoder(self, decoder):
+        self._tokenizer.decoder = decoder
+
+    def save(self, path, pretty=True):
+        self._tokenizer.save(path, pretty)
+
+    def to_str(self, pretty=True):
+        return self._tokenizer.to_str(pretty)
+
+    @staticmethod
+    def from_str(json):
+        return Tokenizer.from_str(json)
+
+    @staticmethod
+    def from_file(path):
+        return Tokenizer.from_file(path)
diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py
new file mode 100644
index 0000000000000000000000000000000000000000..0add6051bc44a89c8c69ff5428c9c54da80aafeb
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/clip.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_tokenizer import BaseFastTokenizer
+
+from fast_tokenizer.normalizers import NFCNormalizer, ReplaceNormalizer, LowercaseNormalizer, SequenceNormalizer
+from fast_tokenizer.pretokenizers import SplitPreTokenizer, ByteLevelPreTokenizer, SequencePreTokenizer
+from fast_tokenizer.models import BPE
+from fast_tokenizer.postprocessors import RobertaPostProcessor
+from fast_tokenizer import Tokenizer, SplitMode
+
+__all__ = ["ClipFastTokenizer"]
+
+
+class ClipFastTokenizer(BaseFastTokenizer):
+    def __init__(
+        self,
+        vocab=None,
+        merges=None,
+        max_length=None,
+        unk_token="<|endoftext|>",
+        pad_token="<|endoftext|>",
+        bos_token="<|startoftext|>",
+        eos_token="<|endoftext|>",
+        add_prefix_space=False,
+        continuing_subword_prefix="",
+        end_of_word_suffix="</w>",
+        trim_offsets=False,
+    ):
+        # Init Tokenizer instance using tokenization model
+        tokenizer = Tokenizer(
+            BPE(
+                vocab,
+                merges,
+                unk_token=unk_token,
+                continuing_subword_prefix=continuing_subword_prefix,
+                end_of_word_suffix=end_of_word_suffix,
+                fuse_unk=False,
+            )
+        )
+
+        # Add special tokens
+        bos_token_id = 0
+        eos_token_id = 1
+        if tokenizer.token_to_id(str(unk_token)) is not None:
+            tokenizer.add_special_tokens([str(unk_token)])
+        if tokenizer.token_to_id(str(pad_token)) is not None:
+            tokenizer.add_special_tokens([str(pad_token)])
+        if tokenizer.token_to_id(str(bos_token)) is not None:
+            bos_token_id = tokenizer.token_to_id(str(bos_token))
+            tokenizer.add_special_tokens([str(bos_token)])
+        if tokenizer.token_to_id(str(eos_token)) is not None:
+            eos_token_id = tokenizer.token_to_id(str(eos_token))
+            tokenizer.add_special_tokens([str(eos_token)])
+
+        # Set the normalizer
+        tokenizer.normalizer = SequenceNormalizer(
+            [NFCNormalizer(), ReplaceNormalizer(r"\s+", " "), LowercaseNormalizer()]
+        )
+
+        # Set the pretokenizer
+        tokenizer.pretokenizer = SequencePreTokenizer(
+            [
+                SplitPreTokenizer(
+                    r"""'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
+                    split_mode=SplitMode.REMOVED,
+                    invert=True,
+                ),
+                ByteLevelPreTokenizer(add_prefix_space=False),
+            ]
+        )
+
+        # Set the postprocessor
+        tokenizer.postprocessor = RobertaPostProcessor(
+            sep=(eos_token, eos_token_id), cls=(bos_token, bos_token_id), trim_offsets=False, add_prefix_space=False
+        )
+
+        parameters = {
+            "model": "BPE",
+            "unk_token": unk_token,
+            "pad_token": pad_token,
+            "bos_token": bos_token,
+            "eos_token": eos_token,
+            "add_prefix_space": add_prefix_space,
+            "max_length": max_length,
+            "continuing_subword_prefix": continuing_subword_prefix,
+            "end_of_word_suffix": end_of_word_suffix,
+            "trim_offsets": trim_offsets,
+        }
+        super().__init__(tokenizer, parameters)
diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c854faa1713098b1726d3ffe56dd9de162099ac
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/ernie.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fast_tokenizer import Tokenizer, decoders
+from fast_tokenizer.models import FastWordPiece, WordPiece
+from fast_tokenizer.normalizers import BertNormalizer
+from fast_tokenizer.postprocessors import BertPostProcessor
+from fast_tokenizer.pretokenizers import BertPreTokenizer
+
+from .base_tokenizer import BaseFastTokenizer
+
+__all__ = ["ErnieFastTokenizer"]
+
+
+class ErnieFastTokenizer(BaseFastTokenizer):
+    def __init__(
+        self,
+        vocab=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        cls_token="[CLS]",
+        pad_token="[PAD]",
+        mask_token="[MASK]",
+        clean_text=True,
+        handle_chinese_chars=True,
+        strip_accents=True,
+        lowercase=True,
+        wordpieces_prefix="##",
+        max_sequence_len=None,
+        max_input_chars_per_word=100,
+        use_fast_wordpiece=False,
+        use_fast_wordpiece_with_pretokenization=False,
+    ):
+        tokenizer_model = WordPiece if not use_fast_wordpiece else FastWordPiece
+        model_kwargs = {
+            "unk_token": str(unk_token),
+            "continuing_subword_prefix": wordpieces_prefix,
+            "max_input_chars_per_word": max_input_chars_per_word,
+        }
+        if use_fast_wordpiece:
+            model_kwargs["with_pretokenization"] = use_fast_wordpiece_with_pretokenization
+        else:
+            model_kwargs["handle_chinese_chars"] = handle_chinese_chars
+        if vocab is not None:
+            tokenizer = Tokenizer(tokenizer_model(vocab, **model_kwargs))
+        else:
+            tokenizer = Tokenizer(tokenizer_model(**model_kwargs))
+
+        if tokenizer.token_to_id(str(unk_token)) is not None:
+            tokenizer.add_special_tokens([str(unk_token)])
+        if tokenizer.token_to_id(str(sep_token)) is not None:
+            tokenizer.add_special_tokens([str(sep_token)])
+        if tokenizer.token_to_id(str(cls_token)) is not None:
+            tokenizer.add_special_tokens([str(cls_token)])
+        if tokenizer.token_to_id(str(pad_token)) is not None:
+            tokenizer.add_special_tokens([str(pad_token)])
+        if tokenizer.token_to_id(str(mask_token)) is not None:
+            tokenizer.add_special_tokens([str(mask_token)])
+
+        tokenizer.normalizer = BertNormalizer(
+            clean_text=clean_text,
+            handle_chinese_chars=handle_chinese_chars,
+            strip_accents=strip_accents,
+            lowercase=lowercase,
+        )
+        if not use_fast_wordpiece or not use_fast_wordpiece_with_pretokenization:
+            tokenizer.pretokenizer = BertPreTokenizer()
+
+        if vocab is not None:
+            sep_token_id = tokenizer.token_to_id(str(sep_token))
+            if sep_token_id is None:
+                raise TypeError("sep_token not found in the vocabulary")
+            cls_token_id = tokenizer.token_to_id(str(cls_token))
+            if cls_token_id is None:
+                raise TypeError("cls_token not found in the vocabulary")
+
+            tokenizer.postprocessor = BertPostProcessor((str(sep_token), sep_token_id), (str(cls_token), cls_token_id))
+
+        tokenizer.decoder = decoders.WordPiece(prefix=wordpieces_prefix)
+        if max_sequence_len is None:
+            tokenizer.disable_truncation()
+        else:
+            tokenizer.enable_truncation(max_sequence_len)
+
+        parameters = {
+            "model": "BertWordPiece",
+            "unk_token": unk_token,
+            "sep_token": sep_token,
+            "cls_token": cls_token,
+            "pad_token": pad_token,
+            "mask_token": mask_token,
+            "clean_text": clean_text,
+            "handle_chinese_chars": handle_chinese_chars,
+            "strip_accents": strip_accents,
+            "lowercase": lowercase,
+            "wordpieces_prefix": wordpieces_prefix,
+        }
+
+        super().__init__(tokenizer, parameters)
diff --git a/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py
new file mode 100644
index 0000000000000000000000000000000000000000..3daacc6d28c9c856270bfc1da84ed12f94fb1592
--- /dev/null
+++ b/fast_tokenizer/python/fast_tokenizer/tokenizers_impl/sentencepiece_bpe.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_tokenizer import BaseFastTokenizer
+from fast_tokenizer.models import BPE
+from fast_tokenizer.normalizers import NFKCNormalizer
+from fast_tokenizer import Tokenizer
+from fast_tokenizer.pretokenizers import MetaSpacePreTokenizer
+
+__all__ = ["SentencePieceBPEFastTokenizer"]
+
+
+class SentencePieceBPEFastTokenizer(BaseFastTokenizer):
+    def __init__(
+        self,
+        vocab=None,
+        merges=None,
+        unk_token="<unk>",
+        replacement="▁",
+        add_prefix_space=True,
+        dropout=None,
+        fuse_unk=False,
+    ):
+        if vocab is not None and merges is not None:
+            tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))
+        else:
+            tokenizer = Tokenizer(BPE())
+        if tokenizer.token_to_id(str(unk_token)) is not None:
+            tokenizer.add_special_tokens([str(unk_token)])
+        tokenizer.normalizer = NFKCNormalizer()
+        tokenizer.pretokenizer = MetaSpacePreTokenizer(replacement=replacement, add_prefix_space=add_prefix_space)
+        parameters = {
+            "model": "SentencePieceBPE",
+            "unk_token": unk_token,
+            "replacement": replacement,
+            "add_prefix_space": add_prefix_space,
+            "dropout": dropout,
+        }
+
+        super().__init__(tokenizer, parameters)
+
+    @staticmethod
+    def from_file(vocab_filename, merges_filename, **kwargs):
+        vocab, merges = BPE.read_file(vocab_filename, merges_filename)
+        return SentencePieceBPEFastTokenizer(vocab, merges, **kwargs)
diff --git a/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py b/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f42234acf9bf568c87bbfd0d774c2b1cfc5350e
--- /dev/null
+++ b/fast_tokenizer/python/tests/test_byte_level_pretokenizer.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from fast_tokenizer import pretokenizers
+
+
+class TestByteLevelPreTokenizer(unittest.TestCase):
+    def setUp(self):
+        self.pretokenized = pretokenizers.PreTokenizedString("Hello my friend, how is your day going?")
+
+    def check_equals(self, add_prefix_space, use_regex, expected_result):
+        bytelevel = pretokenizers.ByteLevelPreTokenizer(add_prefix_space=add_prefix_space, use_regex=use_regex)
+        bytelevel(self.pretokenized)
+        splits = self.pretokenized.get_splits()
+        result = [(s, offset) for s, offset, tokens in splits]
+        self.assertEqual(result, expected_result)
+
+    def test_pretokenize_with_regex(self):
+        expected_result = [
+            ("Hello", (0, 5)),
+            ("Ġmy", (5, 8)),
+            ("Ġfriend", (8, 15)),
+            (",", (15, 16)),
+            ("Ġhow", (16, 20)),
+            ("Ġis", (20, 23)),
+            ("Ġyour", (23, 28)),
+            ("Ġday", (28, 32)),
+            ("Ġgoing", (32, 38)),
+            ("?", (38, 39)),
+        ]
+
+        self.check_equals(False, True, expected_result)
+
+    def test_pretokenize_without_regex(self):
+        expected_result = [("HelloĠmyĠfriend,ĠhowĠisĠyourĠdayĠgoing?", (0, 39))]
+        self.check_equals(False, False, expected_result)
+
+    def test_pretokenize_with_prefix_with_regex(self):
+        expected_result = [
+            ("ĠHello", (0, 5)),
+            ("Ġmy", (5, 8)),
+            ("Ġfriend", (8, 15)),
+            (",", (15, 16)),
+            ("Ġhow", (16, 20)),
+            ("Ġis", (20, 23)),
+            ("Ġyour", (23, 28)),
+            ("Ġday", (28, 32)),
+            ("Ġgoing", (32, 38)),
+            ("?", (38, 39)),
+        ]
+
+        self.check_equals(True, True, expected_result)
+
+    def test_pretokenize_with_prefix_without_regex(self):
+        expected_result = [("ĠHelloĠmyĠfriend,ĠhowĠisĠyourĠdayĠgoing?", (0, 39))]
+        self.check_equals(True, False, expected_result)
diff --git a/fast_tokenizer/python/tests/test_clip_tokenizer.py b/fast_tokenizer/python/tests/test_clip_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6124b4f8c444bf7f92509aef083cf757f6f2be0
--- /dev/null
+++ b/fast_tokenizer/python/tests/test_clip_tokenizer.py
@@ -0,0 +1,91 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from fast_tokenizer import ClipFastTokenizer, models
+from paddlenlp.utils.downloader import get_path_from_url
+
+
+class TestClipFastTokenizer(unittest.TestCase):
+    def setUp(self):
+        vocab_path = os.path.join(os.getcwd(), "vocab.json")
+        merges_path = os.path.join(os.getcwd(), "merges.txt")
+        if not os.path.exists(vocab_path):
+            get_path_from_url(
+                "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/vocab.json", os.getcwd()
+            )
+        if not os.path.exists(merges_path):
+            get_path_from_url(
+                "http://bj.bcebos.com/paddlenlp/models/community/openai/clip-vit-large-patch14/merges.txt", os.getcwd()
+            )
+        vocab, merges = models.BPE.read_file(vocab_path, merges_path)
+        self.tokenizer = ClipFastTokenizer(vocab, merges)
+        self.expected_ids = [
+            49406,
+            320,
+            1342,
+            272,
+            272,
+            335,
+            273,
+            273,
+            274,
+            16368,
+            13439,
+            2971,
+            748,
+            531,
+            13610,
+            323,
+            1896,
+            8445,
+            323,
+            539,
+            320,
+            2368,
+            49407,
+        ]
+        self.expected_tokens = [
+            "<|startoftext|>",
+            "a</w>",
+            "'ll</w>",
+            "1</w>",
+            "1</w>",
+            "p</w>",
+            "2</w>",
+            "2</w>",
+            "3</w>",
+            "rf</w>",
+            "âĺĨ</w>",
+            "ho</w>",
+            "!!</w>",
+            "to</w>",
+            "?'</w>",
+            "d</w>",
+            "'d</w>",
+            "''</w>",
+            "d</w>",
+            "of</w>",
+            "a</w>",
+            "cat</w>",
+            "<|endoftext|>",
+        ]
+        self.input_text = "A\n'll 11p223RF☆ho!!to?'d'd''d of a cat"
+
+    def test_encode(self):
+        result = self.tokenizer.encode(self.input_text)
+        self.assertEqual(result.tokens, self.expected_tokens)
+        self.assertEqual(result.ids, self.expected_ids)
diff --git a/fast_tokenizer/python/tests/test_fast_wordpiece.py b/fast_tokenizer/python/tests/test_fast_wordpiece.py
new file mode 100644
index 0000000000000000000000000000000000000000..7125aef04ace165664f2e766b2853a3229e4618f
--- /dev/null
+++ b/fast_tokenizer/python/tests/test_fast_wordpiece.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from fast_tokenizer import ErnieFastTokenizer
+from fast_tokenizer.models import WordPiece, FastWordPiece
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+logger.logger.setLevel("ERROR")
+
+
+class TestWordpiece(unittest.TestCase):
+    def set_flag(self):
+        self.use_fast_wordpiece = False
+        self.use_fast_wordpiece_with_pretokenization = False
+
+    def setUp(self):
+        self.max_seq_length = 128
+        self.wordpiece_tokenizer = AutoTokenizer.from_pretrained("ernie-1.0", use_fast=True)
+        ernie_vocab = self.wordpiece_tokenizer.vocab
+        self.set_flag()
+        self.fast_wordpiece_tokenizer = ErnieFastTokenizer(
+            ernie_vocab,
+            max_sequence_len=self.max_seq_length,
+            use_fast_wordpiece=self.use_fast_wordpiece,
+            use_fast_wordpiece_with_pretokenization=self.use_fast_wordpiece_with_pretokenization,
+        )
+        self.dataset = [example["sentence"] for example in load_dataset("clue", "tnews", splits=["train"])]
+
+    def test_encode(self):
+        for sentence in self.dataset:
+            wordpiece_result = self.wordpiece_tokenizer(sentence, max_length=self.max_seq_length)
+            expected_input_ids = wordpiece_result["input_ids"]
+            expected_token_type_ids = wordpiece_result["token_type_ids"]
+
+            fast_wordpiece_result = self.fast_wordpiece_tokenizer.encode(sentence)
+            actual_input_ids = fast_wordpiece_result.ids
+            actual_token_type_ids = fast_wordpiece_result.type_ids
+            self.assertEqual(expected_input_ids, actual_input_ids)
+            self.assertEqual(expected_token_type_ids, actual_token_type_ids)
+
+    def test_get_offset_mapping(self):
+        for i, sentence in enumerate(self.dataset):
+            wordpiece_result = self.wordpiece_tokenizer(
+                sentence, max_length=self.max_seq_length, return_offsets_mapping=True
+            )
+            expected_offset_mapping = wordpiece_result["offset_mapping"]
+
+            fast_wordpiece_result = self.fast_wordpiece_tokenizer.encode(sentence)
+            actual_offset_mapping = fast_wordpiece_result.offsets
+            self.assertEqual(expected_offset_mapping, actual_offset_mapping)
+
+
+class TestFastWordpiece(TestWordpiece):
+    def set_flag(self):
+        self.use_fast_wordpiece = True
+        self.use_fast_wordpiece_with_pretokenization = False
+
+
+class TestFastWordpieceWithPretokenization(TestWordpiece):
+    def set_flag(self):
+        self.use_fast_wordpiece = True
+        self.use_fast_wordpiece_with_pretokenization = True
+
+
+class TestFromfile(unittest.TestCase):
+    def setUp(self):
+        self.max_seq_length = 128
+        t = AutoTokenizer.from_pretrained("ernie-1.0", use_fast=True)
+        self.vocab_file = t.init_kwargs["vocab_file"]
+
+    def test(self):
+        WordPiece.from_file(self.vocab_file)
+        FastWordPiece.from_file(self.vocab_file)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/fast_tokenizer/python/tests/test_tokenizer_json.py b/fast_tokenizer/python/tests/test_tokenizer_json.py
new file mode 100644
index 0000000000000000000000000000000000000000..ace7f3858d02eddc6e05b6b82e07c4a889221e0c
--- /dev/null
+++ b/fast_tokenizer/python/tests/test_tokenizer_json.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+import fast_tokenizer
+from fast_tokenizer import ErnieFastTokenizer
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+logger.logger.setLevel("ERROR")
+
+
+class TestTokenizerJson(unittest.TestCase):
+    def setUp(self):
+        wordpiece_tokenizer = AutoTokenizer.from_pretrained("ernie-1.0")
+        ernie_vocab = wordpiece_tokenizer.vocab.token_to_idx
+        self.fast_tokenizer = ErnieFastTokenizer(ernie_vocab)
+
+
+class TestNormalizerJson(TestTokenizerJson):
+    def check_normalizer_json(self, normalizer):
+        self.fast_tokenizer.normalizer = normalizer
+        json_file = str(normalizer.__class__) + ".json"
+        self.fast_tokenizer.save(json_file)
+        tokenizer = ErnieFastTokenizer.from_file(json_file)
+        os.remove(json_file)
+        self.assertEqual(normalizer.__getstate__(), tokenizer.normalizer.__getstate__())
+
+    def test_replace(self):
+        replace_normalizer = fast_tokenizer.normalizers.ReplaceNormalizer("''", '"')
+        self.check_normalizer_json(replace_normalizer)
+
+    def test_strip(self):
+        strip_normalizer = fast_tokenizer.normalizers.StripNormalizer(True, True)
+        self.check_normalizer_json(strip_normalizer)
+
+    def test_strip_accent(self):
+        strip_normalizer = fast_tokenizer.normalizers.StripAccentsNormalizer()
+        self.check_normalizer_json(strip_normalizer)
+
+    def test_nfc(self):
+        nfc_normalizer = fast_tokenizer.normalizers.NFCNormalizer()
+        self.check_normalizer_json(nfc_normalizer)
+
+    def test_nfkc(self):
+        nfkc_normalizer = fast_tokenizer.normalizers.NFKCNormalizer()
+        self.check_normalizer_json(nfkc_normalizer)
+
+    def test_nfd(self):
+        nfd_normalizer = fast_tokenizer.normalizers.NFDNormalizer()
+        self.check_normalizer_json(nfd_normalizer)
+
+    def test_nfkd(self):
+        nfkd_normalizer = fast_tokenizer.normalizers.NFKDNormalizer()
+        self.check_normalizer_json(nfkd_normalizer)
+
+    def test_nmt(self):
+        nmt_normalizer = fast_tokenizer.normalizers.NmtNormalizer()
+        self.check_normalizer_json(nmt_normalizer)
+
+    def test_lowercase(self):
+        lowercase_normalizer = fast_tokenizer.normalizers.LowercaseNormalizer()
+        self.check_normalizer_json(lowercase_normalizer)
+
+    def test_sequence(self):
+        lowercase_normalizer = fast_tokenizer.normalizers.LowercaseNormalizer()
+        sequence_normalizer = fast_tokenizer.normalizers.SequenceNormalizer(normalizer_list=[lowercase_normalizer])
+        self.check_normalizer_json(sequence_normalizer)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/fast_tokenizer/run_build_android_armv7_lib.sh b/fast_tokenizer/run_build_android_armv7_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..45303830c0c5335fe0dd73a3a621668f25a0383b
--- /dev/null
+++ b/fast_tokenizer/run_build_android_armv7_lib.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+mkdir build_android_armeabi_v7a
+cd build_android_armeabi_v7a
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang
+make -j8
diff --git a/fast_tokenizer/run_build_android_armv7_lite_lib.sh b/fast_tokenizer/run_build_android_armv7_lite_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4d09e46108e13c1f4d04152e2e6483572bb997e3
--- /dev/null
+++ b/fast_tokenizer/run_build_android_armv7_lite_lib.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+mkdir build_android_armeabi_v7a_lite
+cd build_android_armeabi_v7a_lite
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="armeabi-v7a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON
+make -j8
diff --git a/fast_tokenizer/run_build_android_armv8_lib.sh b/fast_tokenizer/run_build_android_armv8_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f3905ad31a861195c98f2a37c259caab8f6a8d1c
--- /dev/null
+++ b/fast_tokenizer/run_build_android_armv8_lib.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+mkdir build_android_arm64_v8a
+cd build_android_arm64_v8a
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang
+make -j8
diff --git a/fast_tokenizer/run_build_android_armv8_lite_lib.sh b/fast_tokenizer/run_build_android_armv8_lite_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..60390b1102a676fff0b00360595c6ea4452136e8
--- /dev/null
+++ b/fast_tokenizer/run_build_android_armv8_lite_lib.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+mkdir build_android_arm64_v8a_lite
+cd build_android_arm64_v8a_lite
+cmake .. -DCMAKE_TOOLCHAIN_FILE=$NDK_ROOT/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_NATIVE_API_LEVEL=android-21 -DANDROID_STL=c++_shared -DWITH_TESTING=OFF -DWITH_PYTHON=OFF -DANDROID_TOOLCHAIN=clang -DWITH_ICU_LITE=ON 
+make -j8
diff --git a/fast_tokenizer/run_build_cpp_lib.bat b/fast_tokenizer/run_build_cpp_lib.bat
new file mode 100644
index 0000000000000000000000000000000000000000..faf396a27be563a12d3b7f71ef447bc19b98e8a4
--- /dev/null
+++ b/fast_tokenizer/run_build_cpp_lib.bat
@@ -0,0 +1,7 @@
+if not exist build_cpp mkdir build_cpp
+cd build_cpp
+for /d %%G in ("*") do rmdir /s /q "%%G"
+del /q *
+cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+ninja -j20
+cd ..
\ No newline at end of file
diff --git a/fast_tokenizer/run_build_cpp_lib.sh b/fast_tokenizer/run_build_cpp_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9e3ccceec6780fc7fe95cb54417e120a5bb6dd55
--- /dev/null
+++ b/fast_tokenizer/run_build_cpp_lib.sh
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Can be used in linux and mac
+mkdir -p build_cpp
+cd build_cpp
+rm -rf *
+platform="$(uname -s)"
+if [[ $platform == Linux* ]];
+then
+  core_num=`nproc`
+else
+  core_num=`sysctl -n hw.logicalcpu`
+fi
+echo "Compile with $core_num cores"
+cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+make -j${core_num}
+
+if [[ $? == 0 ]];
+then
+    echo "Successfully compile."
+else
+    echo "Fail compiling."
+fi
+cd ..
diff --git a/fast_tokenizer/run_build_py_lib.bat b/fast_tokenizer/run_build_py_lib.bat
new file mode 100644
index 0000000000000000000000000000000000000000..72bbd6824488e60ae97a1d06ee49405a81d9d631
--- /dev/null
+++ b/fast_tokenizer/run_build_py_lib.bat
@@ -0,0 +1,14 @@
+for %%x in (6 7 8 9 10) do (
+  if not exist build_py3%%x mkdir build_py3%%x
+  cd build_py3%%x
+  for /d %%G in ("*") do rmdir /s /q "%%G"
+  del /q *
+  cmake .. -G "Ninja" -DWITH_PYTHON=ON ^
+                      -DWITH_TESTING=OFF ^
+                      -DCMAKE_BUILD_TYPE=Release ^
+                      -DPYTHON_EXECUTABLE=C:\Python3%%x\python.exe ^
+                      -DPYTHON_INCLUDE_DIR=C:\Python3%%x\include ^
+                      -DPYTHON_LIBRARY=C:\Python3%%x\libs\python3%%x.lib
+  ninja -j20
+  cd ..
+)
diff --git a/fast_tokenizer/run_build_py_lib.sh b/fast_tokenizer/run_build_py_lib.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a473cd51a048b1c7ed5a9fbd382a21389049cb1d
--- /dev/null
+++ b/fast_tokenizer/run_build_py_lib.sh
@@ -0,0 +1,44 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Can be used in linux and mac
+# build python lib
+mkdir -p build_py36 build_py37 build_py38 build_py39 build_py310
+for py_version in 6 7 8 9 10;
+do
+   cd build_py3${py_version}
+   rm -rf *
+   platform="$(uname -s)"
+   if [[ $platform == Linux* ]];
+   then
+      export LD_LIBRARY_PATH=/opt/_internal/cpython-3.${py_version}.0/lib/:${LD_LIBRARY_PATH}
+      export PATH=/opt/_internal/cpython-3.${py_version}.0/bin/:${PATH}
+      core_num=`nproc`
+   else
+      export LD_LIBRARY_PATH=/Users/paddle/miniconda2/envs/py3${py_version}/lib/:${LD_LIBRARY_PATH}
+      export PATH=/Users/paddle/miniconda2/envs/py3${py_version}/bin/:${PATH}
+      core_num=`sysctl -n hw.logicalcpu`
+   fi
+   echo "Compile with $core_num cores"
+   cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
+   make -j${core_num}
+   if [[ $? == 0 ]];
+   then
+       echo "Successfully compile."
+   else
+       echo "Fail compiling."
+   fi
+   cd ..
+done
+
diff --git a/fast_tokenizer/run_fast_tokenizer_cpp_test.sh b/fast_tokenizer/run_fast_tokenizer_cpp_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..02eeda082ed53f2b05a59859d822957fa2a071bf
--- /dev/null
+++ b/fast_tokenizer/run_fast_tokenizer_cpp_test.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+cd ${PWD}/$1
+
+for testcase in `ls ${TEST_DIR}`; do
+    if [[ "${testcase}" == "test"* ]]; then
+        ${PWD}/${testcase}
+        if [ $? -ne 0 ]; then
+            exit -1
+        fi
+    fi
+done
diff --git a/fast_tokenizer/setup.py b/fast_tokenizer/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..d172cda979caa41c55058d4d14410e7d345f2389
--- /dev/null
+++ b/fast_tokenizer/setup.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from setuptools import Distribution, setup
+from setuptools.command.install import install
+
+
+class BinaryDistribution(Distribution):
+    # when build the package, it will add
+    # platform name such as "cp37-cp37m-linux_x86_64"
+    def has_ext_modules(self):
+        return True
+
+
+class InstallPlatlib(install):
+    def finalize_options(self):
+        install.finalize_options(self)
+        if self.distribution.has_ext_modules():
+            self.install_lib = self.install_platlib
+
+
+if os.name != "nt":
+    package_data = {"fast_tokenizer": ["core_tokenizers.so", "commit.log"]}
+    package_data["fast_tokenizer.libs"] = []
+else:
+    package_data = {"fast_tokenizer": ["core_tokenizers.pyd", "core_tokenizers.lib", "commit.log"]}
+    # Add icu dll
+    package_data["fast_tokenizer.libs"] = ["icuuc70.dll", "icudt70.dll"]
+
+
+def get_version():
+    f = open(os.path.join("python", "fast_tokenizer", "__init__.py"))
+    lines = f.readlines()
+    version = ""
+    for line in lines:
+        if line.startswith("__version__"):
+            version = line.split("=")[1]
+            version = version.strip().replace('"', "")
+            break
+    return version
+
+
+long_description = "PaddleNLP Fast Tokenizer Library written in C++ "
+setup(
+    name="fast-tokenizer-python",
+    version=get_version(),
+    author="PaddlePaddle Speech and Language Team",
+    author_email="paddlesl@baidu.com",
+    description=long_description,
+    long_description=long_description,
+    zip_safe=False,
+    url="https://github.com/PaddlePaddle/PaddleNLP/fast_tokenizer",
+    package_dir={"": "python"},
+    packages=[
+        "fast_tokenizer",
+        "fast_tokenizer.tokenizers_impl",
+        "fast_tokenizer.normalizers",
+        "fast_tokenizer.pretokenizers",
+        "fast_tokenizer.models",
+        "fast_tokenizer.postprocessors",
+        "fast_tokenizer.libs",
+        "fast_tokenizer.decoders",
+    ],
+    package_data=package_data,
+    extras_require={"test": ["pytest>=6.0"]},
+    python_requires=">=3.6",
+    cmdclass={"install": InstallPlatlib},
+    license="Apache 2.0",
+    distclass=BinaryDistribution,
+    classifiers=[
+        "Development Status :: 5 - Production/Stable",
+        "Operating System :: OS Independent",
+        "Intended Audience :: Developers",
+        "Intended Audience :: Education",
+        "Intended Audience :: Science/Research",
+        "License :: OSI Approved :: Apache Software License",
+        "Programming Language :: C++",
+        "Programming Language :: Python :: 3.6",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+    ],
+)
diff --git a/fast_tokenizer/tools/codestyle/clang_format.hook b/fast_tokenizer/tools/codestyle/clang_format.hook
new file mode 100644
index 0000000000000000000000000000000000000000..1d928216867c0ba3897d71542fea44debf8d72a0
--- /dev/null
+++ b/fast_tokenizer/tools/codestyle/clang_format.hook
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -e
+
+readonly VERSION="3.8"
+
+version=$(clang-format -version)
+
+if ! [[ $version == *"$VERSION"* ]]; then
+    echo "clang-format version check failed."
+    echo "a version contains '$VERSION' is needed, but get '$version'"
+    echo "you can install the right version, and make an soft-link to '\$PATH' env"
+    exit -1
+fi
+
+clang-format $@
diff --git a/fast_tokenizer/tools/codestyle/copyright.hook b/fast_tokenizer/tools/codestyle/copyright.hook
new file mode 100644
index 0000000000000000000000000000000000000000..ecdc8a23ddeb2770b65294c8126e36cf18ca8eed
--- /dev/null
+++ b/fast_tokenizer/tools/codestyle/copyright.hook
@@ -0,0 +1,134 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import argparse
+import io
+import re
+import sys
+import os
+import datetime
+
+COPYRIGHT = '''Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.'''
+
+def _generate_copyright(comment_mark):
+    copyright=COPYRIGHT.split(os.linesep)
+    header = copyright[0].rstrip()
+
+    p = re.search('(\d{4})', header).group(0)
+    now = datetime.datetime.now()
+
+    header = header.replace(p,str(now.year))
+
+    ans=[comment_mark + " " + header + os.linesep]
+    for idx, line in enumerate(copyright[1:]):
+        ans.append(comment_mark + " " + line.rstrip() + os.linesep)
+
+    return ans
+
+def _get_comment_mark(path):
+    lang_type=re.compile(r"\.(py|sh)$")
+    if lang_type.search(path) is not None:
+        return "#"
+
+    lang_type=re.compile(r"\.(h|c|hpp|cc|cpp|cu|go|cuh|proto)$")
+    if lang_type.search(path) is not None:
+        return "//"
+
+    return None
+
+
+RE_ENCODE = re.compile(r"^[ \t\v]*#.*?coding[:=]", re.IGNORECASE)
+RE_COPYRIGHT = re.compile(r".*Copyright \(c\) \d{4}", re.IGNORECASE)
+RE_SHEBANG = re.compile(r"^[ \t\v]*#[ \t]?\!")
+
+def _check_copyright(path):
+    head=[]
+    try:
+        with open(path) as f:
+            head = [next(f) for x in range(4)]
+    except StopIteration:
+        pass
+
+    for idx, line in enumerate(head):
+        if RE_COPYRIGHT.search(line) is not None:
+            return True
+
+    return False
+
+def generate_copyright(path, comment_mark):
+    original_contents = io.open(path, encoding="utf-8").readlines()
+    head = original_contents[0:4]
+
+    insert_line_no=0
+    for i, line in enumerate(head):
+        if RE_ENCODE.search(line) or RE_SHEBANG.search(line):
+            insert_line_no=i+1
+
+    copyright = _generate_copyright(comment_mark)
+    if insert_line_no == 0:
+        new_contents = copyright
+        if len(original_contents) > 0 and len(original_contents[0].strip()) != 0:
+            new_contents.append(os.linesep)
+        new_contents.extend(original_contents)
+    else:
+        new_contents=original_contents[0:insert_line_no]
+        new_contents.append(os.linesep)
+        new_contents.extend(copyright)
+        if len(original_contents) > insert_line_no and len(original_contents[insert_line_no].strip()) != 0:
+            new_contents.append(os.linesep)
+        new_contents.extend(original_contents[insert_line_no:])
+    new_contents="".join(new_contents)
+
+    with io.open(path, 'w') as output_file:
+        output_file.write(new_contents)
+
+
+
+def main(argv=None):
+    parser = argparse.ArgumentParser(
+        description='Checker for copyright declaration.')
+    parser.add_argument('filenames', nargs='*', help='Filenames to check')
+    args = parser.parse_args(argv)
+
+    retv = 0
+    for path in args.filenames:
+        comment_mark = _get_comment_mark(path)
+        if comment_mark is None:
+            print("warning:Unsupported file", path, file=sys.stderr)
+            continue
+
+        if _check_copyright(path):
+            continue
+
+        generate_copyright(path, comment_mark)
+
+
+if __name__ == '__main__':
+    exit(main())
diff --git a/fast_tokenizer/tools/codestyle/cpplint_pre_commit.hook b/fast_tokenizer/tools/codestyle/cpplint_pre_commit.hook
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook b/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook
new file mode 100644
index 0000000000000000000000000000000000000000..bf2e982dd88a3bf5585b32e4d60badfe1acd9337
--- /dev/null
+++ b/fast_tokenizer/tools/codestyle/pylint_pre_commit.hook
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+TOTAL_ERRORS=0
+
+
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+export PYTHONPATH=$DIR:$PYTHONPATH
+
+# The trick to remove deleted files: https://stackoverflow.com/a/2413151
+for file in $(git diff --name-status | awk '$1 != "D" {print $2}'); do
+    pylint --disable=all --load-plugins=docstring_checker \
+    --enable=doc-string-one-line,doc-string-end-with,doc-string-with-all-args,doc-string-triple-quotes,doc-string-missing,doc-string-indent-error,doc-string-with-returns,doc-string-with-raises $file;
+    TOTAL_ERRORS=$(expr $TOTAL_ERRORS + $?);
+done
+
+exit $TOTAL_ERRORS
+#For now, just warning:
+#exit 0
diff --git "a/llama\346\250\241\345\236\213\347\273\223\346\236\204.png" "b/llama\346\250\241\345\236\213\347\273\223\346\236\204.png"
new file mode 100644
index 0000000000000000000000000000000000000000..8e85936ebbb813ae2ddd5b8b8509006ea760e247
Binary files /dev/null and "b/llama\346\250\241\345\236\213\347\273\223\346\236\204.png" differ
diff --git "a/llama\347\256\227\346\263\225\345\216\237\347\220\206.png" "b/llama\347\256\227\346\263\225\345\216\237\347\220\206.png"
new file mode 100644
index 0000000000000000000000000000000000000000..276b6ec5d331026a9aa8dcdbe8e0a30a2feebc7a
Binary files /dev/null and "b/llama\347\256\227\346\263\225\345\216\237\347\220\206.png" differ
diff --git a/llm/README.md b/llm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..22e665f949f744631fa8b77578fa7a95c1e1158b
--- /dev/null
+++ b/llm/README.md
@@ -0,0 +1,404 @@
+# 飞桨大语言模型
+大模型全流程工具基于PaddlePaddle的4D分布式并行能力旨在提供高性能、灵活易用大模型工具，可以根据自己的需求轻易来定制化百亿和千亿大模型训练，同时支持高性能的压缩推理和服务化，最终使用大模型能力提升业务效果。
+
+| Model | Pretrain | SFT | LoRA | PrefixTuning | Generation | Quantization |
+| --- | --- | --- | --- | --- | --- | --- |
+| [LLaMA v1/v2](./llama) | ✅  | ✅ | ✅ | ✅ | ✅ | ✅  |
+| [ChatGLM-6B](./chatglm) |  N/A |  ✅  |  ✅  |  ✅  |  ✅  |  ✅  |
+| [ChatGLM2-6B](./chatglm2) |  N/A |  ✅  |  ✅  |  ✅  |  ✅  |  ✅  |
+| [Bloom](./bloom) | N/A | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [GPT-3](./gpt-3) |   ✅  |  ✅  |  ✅  |  WIP  | ✅    | WIP |
+| [OPT](./opt) | WIP | ✅ | ✅ | WIP|  ✅ | WIP |
+| [GLM](./glm) |N/A | ✅ | ✅ | WIP|  ✅ | WIP |
+| [Qwen](./qwen) |N/A | ✅ | ✅ | ✅ |  ✅ | WIP |
+
+
+# LLM全流程工具介绍
+我们提供了模型预训练、精调（SFT、LoRA、PrefixTuning）、量化、动态图推理、服务化部署全流程脚本，开发者可以根据自己的需求定制化自己的大语言模型。
+
+<div align="center">
+    <img width="800" alt="llm" src="https://github.com/PaddlePaddle/PaddleNLP/assets/63761690/009bbb4e-baee-4c4a-a52e-94ac44c73c90">
+</div>
+
+<div align="center">
+    <font size ="1">
+    LLM全流程工具流程图（上图：PaddleNLP 2.6进展 下图：最终目标）
+     </font>
+</div>
+
+## 1. 环境准备
+
+- PaddlePaddle >= 2.5.1
+- PaddleNLP >= 2.6.0
+- tiktoken (仅 Qwen 需要)
+
+## 2. 预训练
+[LLaMA v1/v2](./llama)、[GPT-3](./gpt-3) 目录中提供了模型预训练的数据准备和训练细节，后续我们将支持更多的模型预训练。
+
+## 3. 精调
+目前精调统一脚本只支持[LLaMA v1/v2](./llama)、[ChatGLM-6B](./chatglm)、[ChatGLM2-6B](./chatglm2)、[Bloom](./bloom)、[OPT](./opt)、[Qwen](./qwen)，其他模型精调使用详见对应模型目录。接下来我们将以**Llama 2**为例介绍如何使用统一脚本进行SFT、LoRA、Prefix Tuning。更多LoRA、Prefix Tuning请参见[PEFT文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/peft.md)。
+
+### 3.1 精调训练数据格式
+
+为了方便用户测试，我们也提供示例数据集[广告生成数据集](https://bj.bcebos.com/paddlenlp/datasets/examples/AdvertiseGen.tar.gz)，用户也可以仿照数据集的格式制作自己的数据集进行精调。我们支持的数据格式是每行包含一个字典，每个字典包含以下字段：
+
+- `src` : `str, List(str)`, 模型的输入指令（instruction）、提示（prompt），模型应该执行的任务。
+- `tgt` : `str, List(str)`, 模型的输出。
+
+样例数据：
+```
+{"src": "类型#裙*颜色#蓝色*风格#清新*图案#蝴蝶结", "tgt": "裙身处采用立体蝴蝶结装饰辅以蓝色条带点缀，令衣身造型饱满富有层次的同时为其注入一丝甜美气息。将女孩清新娇俏的一面衬托而出。"}
+...
+```
+
+
+
+### 3.2 SFT
+SFT(Supervised Fine-Tuning)依托飞桨提出的[4D混合分布式并行](https://ai.baidu.com/forum/topic/show/987996)能力，支持使用Trainer API轻松切换数据并行(DP)、[张量并行（TP, Tensor Parallelism）](https://arxiv.org/abs/1909.08053)、[流水线并行（PP, Pipeline Parallelism）](https://arxiv.org/abs/1811.06965)（目前仅支持Llama）等多种分布式训练策略。
+
+4D 混合并行策略如何组合？如图所示，在单机内使用通信量较大，适合使用机器内的卡间通信的张量并行（张量并行又称模型并行，MP）和分组参数切片（Sharding）的2D组合策略；训练千亿规模模型时，叠加流水线并行策略使用多台机器共同分担；同时叠加数据并行来增加并发数量，提升训练速度。
+<div align="center">
+    <img src="https://ai.bdstatic.com/file/63F5EBB1E188457ABAFD311CFC1D8658" width=50% height=50%>
+</div>
+
+```
+# 张量并行分布式训练（常用）
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_argument.json
+
+# 目前ChatGLM2、OPT不支持张量并行，默认使用Sharding策略（Paddle 2.5.1支持Sharding Stage2，Sharding Stage3需要使用Paddle develop版本）
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./chatglm2/sft_argument.json
+
+# 张量并行&流水线并行分布式训练（目前仅支持Llama）
+python -u  -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py ./llama/sft_pp_argument.json
+```
+
+### 3.3 LoRA
+
+Transformer模型中包含许多Linear层需要进行密集的矩阵乘法计算，而这些通常具有全秩(full rank)。[LoRA](https://arxiv.org/abs/2106.09685)提出冻结预训练的权重矩阵, 通过引入两个低 rank 矩阵 $AB$(图中橙色的两个矩阵) 来近似权重的更新过程 $W_0+\Delta W=W_0+B A$ , 其中 $B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}$，实验表面将输入表达随机投影到较小的子空间模型仍然可以有效地学习下游任务还可以节约大量的计算显存需求。
+
+
+<div align="center">
+<img src=https://github.com/PaddlePaddle/PaddleNLP/assets/37530985/63d56558-247a-4a8d-a6ca-121c820f7534 width=50% height=50% />
+</div>
+
+
+PaddleNLP LoRA API支持数据并行、张量并行等多种分布式训练策略，可以通过控制`tensor_parallel_degree` 调整并行训练策略。LoRA策略默认应用在所有Linear层，可拓展至**单机LoRA微调千亿模型**。
+
+
+```
+# 单卡训练
+python  finetune_generation.py ./llama/lora_argument.json
+
+# 张量并行分布式训练（ChatGLM2、OPT不支持张量并行）
+# 将lora_argument.json中tensor_parallel_degree修改为2
+python  -u  -m paddle.distributed.launch --gpus "0,1"  finetune_generation.py ./llama/lora_argument.json
+```
+
+
+### 3.4 Prefix Tuning
+
+[Prefix Tuning](https://arxiv.org/abs/2101.00190)受提示学习（Prompt learning）的影响，加入的一部分 prefix embedding 作为连续型提示进行训练。prefix embedding是由专门的 prefix encoder 网络生成的数个张量，会以 past_key_value的方式被插入到语言模型每一层的 hidden_state之前。
+
+<div align="center">
+<img src=https://github.com/PaddlePaddle/PaddleNLP/assets/37530985/8baf6943-4540-4c02-8540-35f977acc077 width=40% height=40% />
+</div>
+
+PaddleNLP Prefix Tuning API支持数据并行、张量并行等多种分布式训练策略，可以通过控制`tensor_parallel_degree` 调整并行训练策略。
+```
+# 单卡训练
+python  finetune_generation.py ./llama/pt_argument.json
+
+# 张量并行分布式训练（ChatGLM2、OPT不支持张量并行）
+# 将pt_argument.json中tensor_parallel_degree修改为2
+python  -u  -m paddle.distributed.launch --gpus "0,1"  finetune_generation.py ./llama/pt_argument.json
+```
+### 3.5 精调参数介绍
+<details><summary>&emsp; 模型参数(ModelArgument) </summary><div>
+
+- `model_name_or_path`: 预训练模型名称或者本地的模型路径，用于热启模型和分词器，默认为None。每个模型**支持模型权重**详见各模型目录。
+- `lora`: 是否开启LoRA微调策略，默认为False。
+- `lora_path`: LoRA参数和配置路径，对LoRA参数进行初始化，默认为None。
+- `lora_rank`: LoRA算法中rank（秩）的值，默认为8。
+- `prefix_tuning`: 是否使用Prefix Tuning策略，默认为False。
+- `num_prefix_tokens`: Prefix Tuning策略中Prefix Token数量，默认为128。
+
+</div></details>
+
+<details><summary>&emsp; 数据参数(DataArgument) </summary><div>
+
+- `dataset_name_or_path`: 本地数据集目录或内置数据集名称，默认为None。脚本已适配单文件和多文件，会自己寻找`dataset_name_or_path/train.json` 或者 `dataset_name_or_path/train/*.json`作为训练集文件, 以及`dataset_name_or_path/dev.json` 或者 `dataset_name_or_path/dev/*.json`作为验证集文件。
+- `task_name`: 用于选择内置数据集中的具体任务，默认为None。
+- `eval_with_do_generation`: 在模型效果评估的时候是否调用model.generate,默认为False。设置为True时，指标为ppl, accuracy；设置为False时，指标为BLEU4/Rouge，建议将`metric_for_best_model`设为bleu4。
+- `save_generation_output`: 当`eval_with_do_generation`设为True，是否将生成结果保存在`generated_output.json`文件中，默认为False。
+- `intokens`:是否使用InToken数据流（减少Padding冗余计算，大幅提升有效Token计算效率），默认为False。当`eval_with_do_generation`设为True,评估过程不支持InToken数据流。。
+- `src_length`: 模型输入上下文最大token长度，默认为1024。
+- `max_length`:模型输入（上下文+生成内容）的最大token长度, 默认为2048。当`intokens`设为True的时候，同时也为InToken数据流模型训练输入最大长度，通常建议设为模型允许输入最大长度，同时`per_device_train_batch_size`设为1，使用`gradient_accumulation_steps`控制batch size。
+- `lazy`:设置为False则使用`MapDataset`，设置为True则使用`IterDataset`，默认为False。对于数据量较大的时候建议设为True，`IterDataset`可以避免一次性将所有数据读入内存，注意需要设置`max_steps`并且`evaluation_strategy`和`save_strategy`设为`steps`
+
+</div></details>
+
+
+<details><summary>&emsp; 生成参数(GenerateArgument) </summary><div>
+
+注：以下参数仅在`eval_with_do_generation`为True，调用model.generate()时生效。
+
+- `top_k`: “采样”策略中为 top-k 过滤保留的最高概率标记的数量。默认为1，等价于贪心策略。
+- `top_p`:“采样”策略中 top-p 过滤的累积概率。默认为1.0，表示不起作用。
+</div></details>
+
+<details><summary>&emsp; 训练参数(TrainingArguments) </summary><div>
+
+以下仅介绍TrainingArguments部分常用参数，详情请参见[TrainingArguments文档](https://paddlenlp.readthedocs.io/zh/latest/trainer.html)。
+
+- `output_dir`: 用于保存相关的文件目录，主要包括模型相关文件、训练过程中的checkpoint、分词器相关文件、评估的结果文件，默认为None。
+- `per_device_train_batch_size`: 训练集训练过程批处理大小，对应 micro batch size，默认为8。该参数需要根据具体的数据集来设定，该参数越大，占用显存越高，训练代价越大；反之，占用显存越小，训练速度越快。
+- `gradient_accumulation_steps`:梯度累积步数，顾名思义，就是将多次计算得到的梯度值进行累加，然后一次性进行参数更新，默认为1。等效于将原有训练batch size*gradient_accumulation_steps。
+- `per_device_eval_batch_size`: 验证集批处理大小，对应 micro batch size，默认为8。该参数越大，占用显存越高；该参数越小，占用显存越低。
+- `eval_accumulation_steps`:在将结果移动到CPU之前，累积输出张量的预测步骤数。如果如果未设置，则在移动到CPU之前，整个预测都会在GPU上累积（速度更快需要更多的显存），默认为None。
+- `num_train_epochs`:模型训练的轮次，默认为3。
+- `learning_rate`:优化器的初始学习率，默认为 5e-05。
+- `warmup_steps`: warmup的步数，默认为0。当warmup_steps>0时，会覆盖warmup_ratio的设置。
+- `logging_steps`: 日志打印的频率，仅当logging_strategy=="step"生效，默认为 500。如果希望看到较快的日志反馈或者即时的训练的速度，可以减小logging_steps。
+- `evaluation_strategy`: 评估策略，默认为no。"no"：训练期间不进行评估；"steps"：在每eval_steps结束进行；"epoch"：在每个 epoch 结束时进行。
+- `save_strategy`: 保存策略，默认为no。"no"：训练期间不进行评估；"steps"：在每eval_steps结束进行；"epoch"：在每个 epoch 结束时进行。
+- `fp16`: 是否需要开启FP16训练，开启FP16训练可以加速训练，默认为False。
+- `bf16`: 是否需要开启BF16训练，开启BF16训练可以加速训练，默认为False。
+- `fp16_opt_level`: 可设置O1或者O2，在 O1 级别下，在白名单中的算子将使用 float16/bfloat16 计算，在黑名单中的算子将使用 float32 计算。在 O2 级别下，模型的参数被转换为 float16/bfloat16， 如果算子的浮点型输入全是 float16/bfloat16，算子才会采用 float16/bfloat16 计算，若任意浮点型输入是 float32 类型，算子将采用 float32 计算。默认为O1。
+- `do_train`: 是否打开训练，默认为False。
+- `do_eval`: 是否打开评估，默认为False。
+- `disable_tqdm`: 是否关掉tqdm的进度条，默认为False。如果需要预估整体的训练时长，可以打开该配置，实时观察训练进度。
+- `load_best_model_at_end`: 训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用,默认为False。
+- `metric_for_best_model`: 最优模型指标，如"accuarcy"等，用于比较模型好坏，默认为None。
+- `recompute`: 重计算，暂支持full策略。开启后可降低显存以达到增大batch size的目的，默认为False。
+- `save_total_limit`: 保留checkpoint的个数，老的checkpoint会被删除，默认为None。
+- `tensor_parallel_degree`: 此参数tensor_parallel_degree表示将一层transformer结构的份数，该方法对通信开销较大, 建议 tensor_parallel_degree<=8, 尽量使用机器内部通信。默认为-1，表示不启用张量并行。
+- `pipeline_parallel_degree`: 表示划分流水线的大小.(假设该参数为4, 模型12层, 则每一个pp stage 包含3层模型) 默认值-1, 表示不启用流水线并行。
+
+</div></details>
+
+
+### 3.6 张量并行参数合并
+我们使用张量并行(TP，Tensor Parallelism)训练过程中，为了节省TP参数合并时间往往在中间checkpoint将参数存储为多个TP参数分片，可以使用提供的分片合并参数脚本进行参数合并。
+
+```
+python merge_tp_params.py \
+    --model_name_or_path ./checkpoints/llama_sft_ckpts/checkpoint-100
+```
+
+<details><summary>&emsp; 脚本参数介绍</summary><div>
+- `model_name_or_path`: 必须，本地的TP模型参数路径，默认为None。
+- `device`: 运行环境，默认为gpu。
+</div></details>
+
+### 3.7 LoRA参数合并
+为了后续的**压缩**和**静态图推理**方便，我们提供LoRA参数合并脚本，可以将LoRA参数合并到主干模型并保存相应的权重。
+```
+python merge_lora_params.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --lora_path ./checkpoints/llama_lora_ckpts
+```
+<details><summary>&emsp; 脚本参数介绍</summary><div>
+
+- `model_name_or_path`: 必须，预训练模型名称或者本地的模型路径，用于热启模型和分词器，默认为None。
+- `lora_path`: LoRA参数和配置路径，对LoRA参数进行初始化，默认为None。
+- `merge_model_path`: 必须，合并参数后保存路径，默认为None。
+- `device`: 运行环境，默认为gpu。
+</div></details>
+
+## 4. 模型推理
+
+### 4.1 动态图推理
+
+```shell
+# 预训练&SFT动态图模型推理
+python predictor.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --batch_size 1 \
+    --data_file ./data/dev.json \
+    --dtype "float16" \
+    --mode "dynamic"
+
+# LoRA动态图模型推理
+python predictor.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --batch_size 1 \
+    --data_file ./data/dev.json \
+    --lora_path ./checkpoints/llama_lora_ckpts \
+    --mode "dynamic"
+
+# Prefix Tuning动态图模型推理
+python predictor.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --batch_size 1 \
+    --data_file ./data/dev.json \
+    --prefix_path ./checkpoints/llama_pt_ckpts \
+    --mode "dynamic"
+```
+
+### 4.2 静态图推理
+
+```shell
+# 首先需要运行一下命令将动态图导出为静态图
+# LoRA需要先合并参数，详见3.7LoRA参数合并
+# Prefix Tuning暂不支持
+python export_model.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --output_path ./inference \
+    --dtype float16
+
+
+# 静态图模型推理
+python predictor.py \
+    --model_name_or_path inference \
+    --batch_size 1 \
+    --data_file ./data/dev.json \
+    --dtype "float16" \
+    --mode "static"
+```
+
+### 4.3 InferenceModel 动态图推理
+
+```shell
+# InferenceModel 动态图推理
+# LoRA需要先合并参数，详见3.7LoRA参数合并
+# Prefix Tuning暂不支持
+python predictor.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --dtype float16 \
+    --max_length 1024 \
+    --mode "dynamic" \
+    --inference_model
+```
+
+### 4.4 InferenceModel 静态图推理
+
+```shell
+# 首先需要运行一下命令将InferenceModel动态图导出为静态图
+# LoRA需要先合并参数，详见3.7LoRA参数合并
+# Prefix Tuning暂不支持
+python export_model.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --output_path ./inference \
+    --dtype float16 \
+    --inference_model
+
+# InferenceModel 静态图推理
+python predictor.py \
+    --model_name_or_path ./inference \
+    --dtype float16 \
+    --max_length 1024 \
+    --output_file "infer.json" \
+    --mode "static" \
+    --inference_model
+```
+
+
+### 4.5 参数介绍
+
+<details><summary>&emsp; 脚本参数介绍 </summary><div>
+
+- `model_name_or_path`: 必须，预训练模型名称或者本地的模型路径，用于热启模型和分词器，默认为None。
+- `batch_size`: 批处理大小，默认为8。该参数越大，占用显存越高；该参数越小，占用显存越低。
+- `src_length`: 模型输入上下文最大token长度，默认为1024。
+- `max_length`:模型输入（上下文+生成内容）的最大token长度, 默认为2048。
+- `lora_path`: LoRA参数和配置路径，对LoRA参数进行初始化，默认为None。
+- `prefix_path`: Prefix Tuning参数和配置路径，对Prefix Tuning参数进行初始化，默认为None。
+- `top_k`: “采样”策略中为 top-k 过滤保留的最高概率标记的数量。默认为1，等价于贪心策略。
+- `top_p`:“采样”策略中 top-p 过滤的累积概率。默认为1.0，表示不起作用。
+- `temperature`:“采样”策略中会对输出logit除以temperature。默认为1.0，表示不起作用。
+- `data_file`:必须，待推理json文件，默认为None。
+- `output_file`:保存推理结果文件名，默认为output.json。
+- `device`: 运行环境，默认为gpu。
+- `dtype`: 模型参数dtype，默认为None。如果没有传入`lora_path`、`prefix_path`则必须传入
+- `model_type`: 初始化不同类型模型，gpt-3: GPTForCausalLM; ernie-3.5-se: Ernie35ForCausalLM; 默认为 None。
+- `mode`: 使用动态图或者静态图推理，值为：[dynamic, static]，默认为 dynamic。
+- `inference_model`: 是否使用InferenceModel 推理，默认值为 False。
+
+</div></details>
+
+## 5. 服务化部署
+
+### 5.1 环境准备
+
+- python >= 3.8
+- gradio
+- flask
+
+### 5.2 Flask & Gradio UI服务化部署
+
+我们提供了一套简单易用的UI服务化部署脚本:
+
+
+```
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" flask_server.py \
+    --model_name_or_path meta-llama/Llama-2-7b-chat \
+    --port 8010 \
+    --flask_port 8011 \
+    --src_length 1024 \
+    --dtype "float16"
+```
+
+<details><summary>&emsp; 脚本参数介绍</summary><div>
+
+
+- `port`: Gradio UI 服务端口号，默认8011。
+- `flask_port`: Flask服务端口号，默认8010。
+- 其他参数请参见动态图推理中参数。
+
+</div></details>
+
+## 6. 量化
+
+**注**：量化后模型暂不支持推理，相关开源工作正在进行中，敬请期待。
+量化算法可以将模型输入和模型权重用更低比特数值表示，能够有效减少内存占用和计算开销。下面我们提供PTQ、GPTQ两种量化算法结合**PaddleSlim自研策略**进行量化，更多技术细节详见[量化策略详细教程](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/quant/advanced_quantization.md)
+
+### 6.1 环境安装
+- PaddleSlim develop版本
+- PaddlePaddle develop版本
+
+### 6.2 数据准备
+
+量化中默认使用训练集作为校正（Calibartion）数据集，开发集作为评估数据集。如果希望使用其他数据作为校正数据集，则在数据目录下新增`quant.json`文件，文件格式请参照精调训练数据格式。
+
+### 6.3 PTQ量化
+
+```
+python  finetune_generation.py ./llama/ptq_argument.json
+```
+
+### 6.4 GPTQ量化
+
+```
+python  finetune_generation.py ./llama/gptq_argument.json
+```
+
+### 6.5 量化参数介绍
+
+<details><summary>&emsp; 量化参数(QuantArgument)</summary><div>
+
+- `quant_type`: PTQ,QAT量化类型，默认为A8W8。支持A8W8,WINT4，WINT8：A8W8指对激活（输入）进行INT8量化，对模型权重进行INT8量化；WINT4指仅对模型权重进行INT4量化，后续使用WeightOnly进行推理；WINT8指仅对模型权重进行INT8量化，后续使用WeightOnly进行推理。
+- `do_ptq`: 是否进行PTQ量化，默认为False。
+- `ptq_step`: PTQ量化步数，也即模型前向次数，默认为32。
+- `shift`: 是否在PTQ量化前进行[Shift策略](https://arxiv.org/abs/2304.09145)，默认为False。使用Shift策略需要设`do_ptq`为True。
+- `shift_all_linear`: 是否对模型中所有Linear层应用Shift，如果为True，将会对非LayerNorm-Linear组合的Linear进行Shift，并且添加两个op，默认为False
+- `shift_sampler`: Shift策略使用的sampler，默认为none。可选none，ema：none指直接利用MinMax计算Shift中的零点；ema指使用指数平均计算Shift中零点。
+- `shift_step`: Shift采样步数，也即模型前向次数，默认为32。
+- `smooth`: 是否在PTQ量化前进行[SmoothQuant策略](https://arxiv.org/abs/2211.10438)，默认为False。使用Smooth策略需要设`do_ptq`为True。
+- `smooth_all_linears`: 是否对模型中所有Linear层应用Smooth，如果为True，将会对非LayerNorm-Linear组合的Linear进行Smooth，并且添加两个op，默认为False
+- `smooth_sampler`: Smooth策略使用的sampler，默认为none，可选none，multi_step。multi_step会保存多轮前向结果进行计算，需要更大的显存。
+- `smooth_step`: Smooth采样步数，也即模型前向次数，默认为32。
+- `smooth_piecewise_search`: Smooth是否进行分段搜索,默认为False。分段搜索根据数值大小将激活分成K段，对于每一段进行alhpa和scale的搜索。
+- `smooth_k_piece`: 使用分段搜索功能时分段数量，默认为3。根据经验建议10B模型设置为3，100B模型设置为6。
+- `smooth_search_piece`: 使用分段搜索功能时，是否搜索分段数量，默认为False。设为True时，`smooth_k_piece`建议设为6，搜索分段数量耗时较长，如需加速Smooth过程建议关闭。
+- `do_gptq`: 是否进行GPTQ量化，GPTQ对模型进行WINT4量化，相比于普通PTQ量化精度更高，量化时间较长。默认为False。
+- `gptq_step`: GPTQ量化步数，也即模型前向次数，默认为8。
+</div></details>
+
+
+<details><summary>&emsp; 其他参数</summary><div>
+
+- `per_device_train_batch_size`: 量化前向批大小，默认为8。量化过程只有模型前向，相比于普通训练需要显存较少。
+
+- 更多参数详见精调参数介绍。
+
+</div></details>
diff --git a/llm/argument.py b/llm/argument.py
new file mode 100644
index 0000000000000000000000000000000000000000..fff1c3ceea1a3fbdff26d846ad4829d11e1d3d43
--- /dev/null
+++ b/llm/argument.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+
+from paddlenlp.trainer import TrainingArguments
+
+
+@dataclass
+class TrainingArguments(TrainingArguments):
+    benchmark: bool = field(default=False, metadata={"help": "Whether runs benchmark"})
+
+
+@dataclass
+class DataArgument:
+    dataset_name_or_path: str = field(default=None, metadata={"help": "Name or path for dataset"})
+    task_name: str = field(default=None, metadata={"help": "Additional name to select a more specific task."})
+    intokens: bool = field(default=False, metadata={"help": "Whether to use InTokens data stream"})
+    src_length: int = field(default=1024, metadata={"help": "The maximum length of source(context) tokens."})
+    max_length: int = field(
+        default=2048,
+        metadata={
+            "help": "The maximum length that model input tokens can have. When intokens is set to True, it's also the maximum length for InTokens data stream"
+        },
+    )
+    eval_with_do_generation: bool = field(default=False, metadata={"help": "Whether to do generation for evaluation"})
+    save_generation_output: bool = field(
+        default=False,
+        metadata={"help": "Whether to save generated text to file when eval_with_do_generation set to True."},
+    )
+    lazy: bool = field(
+        default=False,
+        metadata={
+            "help": "Weather to return `MapDataset` or an `IterDataset`.True for `IterDataset`. False for `MapDataset`."
+        },
+    )
+
+
+@dataclass
+class ModelArgument:
+    model_name_or_path: str = field(
+        default=None, metadata={"help": "Build-in pretrained model name or the path to local model."}
+    )
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+
+    # LoRA related parameters
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+    lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."})
+    lora_rank: int = field(default=8, metadata={"help": "Lora attention dimension"})
+
+    # prefix tuning related parameters
+    prefix_tuning: bool = field(default=False, metadata={"help": "Whether to use Prefix technique"})
+    num_prefix_tokens: int = field(default=128, metadata={"help": "Number of prefix tokens"})
+
+
+@dataclass
+class QuantArgument:
+    quant_type: str = field(
+        default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, WINT4,WINT8"}
+    )
+
+    # QAT related parameters
+    # Not Yet support
+    do_qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"})
+
+    # PTQ related parameters
+    do_ptq: bool = field(default=False, metadata={"help": "Whether to use PTQ"})
+    ptq_step: int = field(default=32, metadata={"help": "Step for PTQ"})
+
+    shift: bool = field(default=False, metadata={"help": "Whether to use Shift"})
+    shift_all_linears: bool = field(default=False, metadata={"help": "Whether to shift all linears"})
+    shift_sampler: str = field(
+        default="ema", metadata={"help": "The name of shift sampler, choosen from ['ema', 'none']"}
+    )
+    shift_step: int = field(default=32, metadata={"help": "Sample steps when shift"})
+
+    smooth: bool = field(default=False, metadata={"help": "Whether to use Smooth"})
+    smooth_all_linears: bool = field(default=False, metadata={"help": "Whether to smooth all linears"})
+    smooth_sampler: str = field(
+        default="none", metadata={"help": "The name of smooth sampler, choosen from ['multi_step','none']"}
+    )
+    smooth_step: int = field(default=32, metadata={"help": "Sample steps when smooth"})
+    smooth_piecewise_search: bool = field(
+        default=False, metadata={"help": "The number of piece in piecewise search for smooth strategy."}
+    )
+    smooth_k_piece: int = field(default=3, metadata={"help": "Number of pieces for K-search"})
+    smooth_search_piece: bool = field(default=False, metadata={"help": "Whether search k_piece when piecewise search"})
+
+    # GPTQ related parameters
+    do_gptq: bool = field(default=False, metadata={"help": "Whether to use GPTQ"})
+    gptq_step: int = field(default=8, metadata={"help": "Step for GPTQ"})
+
+
+@dataclass
+class GenerateArgument:
+    top_k: int = field(
+        default=1,
+        metadata={
+            "help": "The number of highest probability tokens to keep for top-k-filtering in the sampling strategy"
+        },
+    )
+    top_p: float = field(
+        default=1.0, metadata={"help": "The cumulative probability for top-p-filtering in the sampling strategy."}
+    )
diff --git a/llm/benchmark.sh b/llm/benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1f2db3dff9dbe44e0631a65aa54a3ef9bd1a45e1
--- /dev/null
+++ b/llm/benchmark.sh
@@ -0,0 +1,31 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export PYTHONPATH=$(dirname $(pwd)):$PYTHONPATH
+
+export FLAGS_control_flow_use_new_executor=1
+export FLAGS_new_executor_serial_run=1
+export FLAGS_allocator_strategy=naive_best_fit
+export FLAGS_fraction_of_gpu_memory_to_use=0.92
+
+python predictor.py \
+    --model_name_or_path ./llama7b-inference_model_fp16 \
+    --dtype float16 \
+    --src_length 300 \
+    --max_length 100 \
+    --output_file "infer.json" \
+    --mode "static" \
+    --batch_size 1 \
+    --benchmark \
+    --inference_model
\ No newline at end of file
diff --git a/llm/bloom/README.md b/llm/bloom/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cdeafa669686d4fd49fd91cdd56b3b814b9c175
--- /dev/null
+++ b/llm/bloom/README.md
@@ -0,0 +1,25 @@
+# BLOOM
+
+## 1.模型介绍
+
+
+BLOOM是一种自回归大型语言模型(LLM)，在大量文本数据上训练从而生生成目标文本，同时它能够支持46种语言和13种编程语言的文本交互。BLOOM 主要基于文本生成任务训练而成，可以很好的完成文本续写任务，此外 BloomZ 系列模型加入了 Instruction Tuning。
+
+**支持模型权重:**
+| Model                            |
+|----------------------------------|
+| bigscience/bloom-560m            |
+| bigscience/bloom-560m-bf16          |
+| bigscience/bloom-1b1/          |
+| bigscience/bloom-3b          |
+| bigscience/bloom-7b1          |
+| bigscience/bloomz-560m/         |
+| bigscience/bloomz-1b1          |
+| bigscience/bloomz-3b          |
+| bigscience/bloomz-7b1-mt          |
+| bigscience/bloomz-7b1-p3          |
+| bigscience/bloomz-7b1          |
+| bellegroup/belle-7b-2m |
+
+## 2. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/bloom/gptq_argument.json b/llm/bloom/gptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..6a5cb7e882a70dd13112e726ba1623ee3c9a254a
--- /dev/null
+++ b/llm/bloom/gptq_argument.json
@@ -0,0 +1,16 @@
+{
+    "model_name_or_path": "./checkpoints/bloom_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/bloom_gptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_gptq": true,
+    "gptq_step": 8
+  }
\ No newline at end of file
diff --git a/llm/bloom/lora_argument.json b/llm/bloom/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..9403c788ba70eb95f522dc1de429bdb84f63ed71
--- /dev/null
+++ b/llm/bloom/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "bigscience/bloomz-7b1-mt",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/bloom_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
\ No newline at end of file
diff --git a/llm/bloom/pt_argument.json b/llm/bloom/pt_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..4112caffa884834fd50bf51ca9f7b7ccc436779d
--- /dev/null
+++ b/llm/bloom/pt_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "bigscience/bloomz-7b1-mt",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/bloom_pt_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true
+  }
\ No newline at end of file
diff --git a/llm/bloom/ptq_argument.json b/llm/bloom/ptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..0c96b852c6d3c3195a1be742122cb3ad0046314d
--- /dev/null
+++ b/llm/bloom/ptq_argument.json
@@ -0,0 +1,22 @@
+{
+    "model_name_or_path": "./checkpoints/bloom_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/bloom_ptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_ptq": true,
+    "ptq_step": 16,
+    "smooth": true,
+    "smooth_step": 16,
+    "smooth_all_linears": true,
+    "smooth_piecewise_search": true,
+    "smooth_k_piece": true,
+    "smooth_search_piece": true
+  }
\ No newline at end of file
diff --git a/llm/bloom/sft_argument.json b/llm/bloom/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..3b3b35f93c79f8ab810448c1f8bb3f282d17a995
--- /dev/null
+++ b/llm/bloom/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "bigscience/bloomz-7b1-mt",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/bloom_sft_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 4,
+    "pipeline_parallel_degree": 1
+  }
\ No newline at end of file
diff --git a/llm/chatglm/README.md b/llm/chatglm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..281a7ceea61fefd9510adf005fd7eb71fde481ed
--- /dev/null
+++ b/llm/chatglm/README.md
@@ -0,0 +1,19 @@
+# ChatGLM-6B
+
+## 1. 模型介绍
+
+ChatGLM-6B 是一个开源的、支持中英双语问答的对话语言模型，基于 [General Language Model (GLM)](https://arxiv.org/abs/2103.10360) 架构，具有 62 亿参数。ChatGLM-6B 使用了和 ChatGLM 相同的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。
+
+**支持模型权重:**
+
+| Model                            |
+|----------------------------------|
+| THUDM/chatglm-6b                 |
+| THUDM/chatglm-6b-v1.1            |
+
+## 2. 模型协议
+
+ChatGLM-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm/LICENSE)。
+
+## 3. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/chatglm/gptq_argument.json b/llm/chatglm/gptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..8b1c07742ba82d624a957621c5293ac4a507e3e4
--- /dev/null
+++ b/llm/chatglm/gptq_argument.json
@@ -0,0 +1,16 @@
+{
+    "model_name_or_path": "./checkpoints/chatglm_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm_gptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_gptq": true,
+    "gptq_step": 8
+  }
\ No newline at end of file
diff --git a/llm/chatglm/lora_argument.json b/llm/chatglm/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..5f2e017a1c8afa779994cb13ce372fd5c8f19736
--- /dev/null
+++ b/llm/chatglm/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "THUDM/chatglm-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
\ No newline at end of file
diff --git a/llm/chatglm/pt_argument.json b/llm/chatglm/pt_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..a34a86c0b3579f6ef268ef95bc1cc51ee1f365a3
--- /dev/null
+++ b/llm/chatglm/pt_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "THUDM/chatglm-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm_pt_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true
+  }
\ No newline at end of file
diff --git a/llm/chatglm/ptq_argument.json b/llm/chatglm/ptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..63474a9e0a194d2b1100c7e2b91560e396d079a9
--- /dev/null
+++ b/llm/chatglm/ptq_argument.json
@@ -0,0 +1,16 @@
+{
+    "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_ptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_ptq": true,
+    "ptq_step": 16
+  }
\ No newline at end of file
diff --git a/llm/chatglm/sft_argument.json b/llm/chatglm/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..5884b2f99882a68a014ee2351b09140d8ff7ac40
--- /dev/null
+++ b/llm/chatglm/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "THUDM/chatglm-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm_sft_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 4,
+    "pipeline_parallel_degree": 1
+  }
\ No newline at end of file
diff --git a/llm/chatglm2/README.md b/llm/chatglm2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bbc50870529830d43d9631997aa1d8e1461fe0ad
--- /dev/null
+++ b/llm/chatglm2/README.md
@@ -0,0 +1,19 @@
+# ChatGLM2-6B
+
+## 1. 模型介绍
+
+ChatGLM2-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM2-6B 引入了[FlashAttention](https://github.com/HazyResearch/flash-attention)和[Multi-Query Attention](https://arxiv.org/abs/1911.02150v1)等新特性。更详细的模型介绍见[ChatGLM2-6B GitHub](https://github.com/THUDM/ChatGLM2-6B)
+
+**支持模型权重:**
+
+| Model                            |
+|----------------------------------|
+| THUDM/chatglm2-6b                |
+
+## 2. 模型协议
+
+
+ChatGLM2-6B 模型的权重的使用需要遵循[License](../../paddlenlp/transformers/chatglm_v2/LICENSE)。
+
+## 3. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/chatglm2/gptq_argument.json b/llm/chatglm2/gptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..9285e8b628ad652cb39b3f621b303f9e4abd0301
--- /dev/null
+++ b/llm/chatglm2/gptq_argument.json
@@ -0,0 +1,16 @@
+{
+    "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm2_gptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_gptq": true,
+    "gptq_step": 8
+  }
\ No newline at end of file
diff --git a/llm/chatglm2/lora_argument.json b/llm/chatglm2/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..3bf0a3bce244d6cecbe6daabd2e467a6ff7c9c42
--- /dev/null
+++ b/llm/chatglm2/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "THUDM/chatglm2-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm2_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
\ No newline at end of file
diff --git a/llm/chatglm2/pt_argument.json b/llm/chatglm2/pt_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..ae664a5201f21d701cf78bf5498d96afa3589806
--- /dev/null
+++ b/llm/chatglm2/pt_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "THUDM/chatglm2-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm2_pt_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true
+  }
\ No newline at end of file
diff --git a/llm/chatglm2/ptq_argument.json b/llm/chatglm2/ptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..acae45bd480f0880f4617152769c3a7c4b4fed3b
--- /dev/null
+++ b/llm/chatglm2/ptq_argument.json
@@ -0,0 +1,22 @@
+{
+    "model_name_or_path": "./checkpoints/chatglm2_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm2_ptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_ptq": true,
+    "ptq_step": 16,
+    "smooth": true,
+    "smooth_step": 16,
+    "smooth_all_linears": true,
+    "smooth_piecewise_search": true,
+    "smooth_k_piece": true,
+    "smooth_search_piece": true
+  }
\ No newline at end of file
diff --git a/llm/chatglm2/sft_argument.json b/llm/chatglm2/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..e978a9de0f127396880079992b4bb4dcfb076e67
--- /dev/null
+++ b/llm/chatglm2/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "THUDM/chatglm2-6b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/chatglm2_sft_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "sharding_parallel_degree": 4,
+    "sharding": "stage3"
+  }
\ No newline at end of file
diff --git a/llm/data.py b/llm/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a39643c096be8c27f99d7e5aa0956f63ab39abf
--- /dev/null
+++ b/llm/data.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from paddlenlp.peft import LoRAModel, PrefixModelForCausalLM
+
+
+def get_convert_example(model):
+    if isinstance(model, LoRAModel) or isinstance(model, PrefixModelForCausalLM):
+        base_model_prefix = model.model.base_model_prefix
+    else:
+        base_model_prefix = model.base_model_prefix
+
+    if base_model_prefix == "chatglm":
+        return convert_example_chatglm
+    elif base_model_prefix in ["chatglm_v2", "llama", "bloom", "opt", "qwen"]:
+        return convert_example_common
+    else:
+        raise ValueError(
+            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm, bloom, llama."
+        )
+
+
+class DataFormatError(ValueError):
+    pass
+
+
+def tokenize_example(tokenizer, example, data_args):
+    if "src" in example and "tgt" in example:
+        source = example["src"][0] if isinstance(example["src"], list) else example["src"]
+        target = example["tgt"][0] if isinstance(example["tgt"], list) else example["tgt"]
+    else:
+        raise DataFormatError(
+            f"Example format is wrong, please check: {example} or rewrite tokenize_example in data.py "
+        )
+    tokenized_source = tokenizer(
+        source,
+        max_length=data_args.src_length,
+        truncation=True,
+        truncation_side="left",
+        add_special_tokens=True,
+    )
+    tgt_max_length = data_args.max_length - len(tokenized_source["input_ids"])
+    tokenized_target = tokenizer(
+        target,
+        max_length=tgt_max_length,
+        truncation=True,
+        truncation_side="right",
+        add_special_tokens=False,
+    )
+
+    tokenized_target_input_ids = tokenized_target["input_ids"]
+    # Add eos_token_id at the end of sequence if the sentence is not truncated.
+    # Attention! In some cases(ex. ChatGLMv2), tokenized eos_token is not equal to eos_token_id.
+    if len(tokenized_target_input_ids) < tgt_max_length:
+        tokenized_target_input_ids += [tokenizer.eos_token_id]
+
+    return tokenized_source, tokenized_target_input_ids
+
+
+def convert_example_common(example, tokenizer, data_args, is_test=True, intokens=False):
+    tokenized_source, tokenized_target_input_ids = tokenize_example(tokenizer, example, data_args)
+
+    if is_test:
+        return {
+            **tokenized_source,
+            "labels": tokenized_target_input_ids,
+        }
+    else:
+        input_ids = tokenized_source["input_ids"] + tokenized_target_input_ids
+        source_length = len(tokenized_source["input_ids"])
+        labels = [-100] * source_length + input_ids[source_length:]
+        # shift input_ids and labels
+        input_ids, labels = input_ids[:-1], labels[1:]
+        seq_length = len(input_ids)
+        features = {"input_ids": input_ids, "labels": labels}
+        if "position_ids" in tokenized_source:
+            features["position_ids"] = list(range(seq_length))
+        if intokens:
+            features["attention_mask"] = np.tri(seq_length, seq_length, dtype=bool)
+
+        return features
+
+
+def convert_example_chatglm(example, tokenizer, data_args, is_test=True, intokens=False):
+
+    tokenized_source, tokenized_target_input_ids = tokenize_example(tokenizer, example, data_args)
+    if is_test:
+        return {
+            **tokenized_source,
+            "labels": tokenized_target_input_ids,
+        }
+    else:
+        input_ids = tokenized_source["input_ids"] + tokenized_target_input_ids
+        bos_position = len(tokenized_source["input_ids"]) - 1
+        labels = [-100] * bos_position + input_ids[bos_position:]
+        # shift input_ids and labels
+        input_ids, labels = input_ids[:-1], labels[1:]
+        features = {
+            "input_ids": input_ids,
+            "labels": labels,
+        }
+
+        if intokens:
+            seq_length = len(input_ids)
+            # attention_mask
+            attention_mask = np.tri(seq_length, seq_length, dtype=bool)
+            attention_mask[:, :bos_position] = 1
+            features["attention_mask"] = attention_mask
+            # 2d position_ids
+            position_ids = np.arange(seq_length, dtype=np.int64)
+            block_position_ids = np.concatenate(
+                [
+                    np.zeros(bos_position, dtype=np.int64),
+                    np.arange(1, seq_length - bos_position + 1, dtype=np.int64),
+                ]
+            )
+            features["position_ids"] = np.stack([position_ids, block_position_ids], axis=0)
+
+        return features
diff --git a/llm/ernie-3.5-se/README.md b/llm/ernie-3.5-se/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad14040deba7ec63d7ac023847b720e3a30f907e
--- /dev/null
+++ b/llm/ernie-3.5-se/README.md
@@ -0,0 +1,203 @@
+# ERNIE-3.5-SE
+
+## 1. 模型介绍
+
+我们采用了Attention和FFN并行的Parallel Transformer的实现方式，将FFN和Attention层进行并行计算。通过这样的设计，我们可以把Attention和FFN需要的线形层计算进行算子融合，降低kernel调用以及通讯次数，提升并行训练的效率。并且我们发现第一层的FFN和最后一层的Attn作用不大，因此采用了“掐头去尾”策略，将底层的FFN的计算量挪到模型的顶层，在同FLOPs下效果和传统Transformer结构一致，但有更好的训练速度和吞吐。
+
+<table>
+<tr>
+ <td><img src="https://github.com/PaddlePaddle/PaddleNLP/assets/16911935/89ca3093-4039-44c7-abce-4a47de6af1f6" height="300"> </td>
+ <td><img src="https://github.com/PaddlePaddle/PaddleNLP/assets/16911935/3c89a72d-34b8-4711-b13e-d31063fc92d3" height="300"> </td>
+</tr>
+<tr>
+ <td> Parallel Transformer </td>
+ <td> “掐头去尾”策略 </td>
+</tr>
+</table>
+
+
+* Rope Embedding+[随机位置编码](https://aclanthology.org/2023.acl-short.161)：我们采用的旋转位置编码Rope，并且为了有较好的模型外推能力，我们保留了线形层的Bias。为了提供长文外推能力，我们通过随机间隔取Position Ids，让模型能够有训短推长的能力。
+
+<img src="https://github.com/PaddlePaddle/PaddleNLP/assets/20554008/423622c1-aed9-4ea9-83b0-d5d3efbaf35b" title="随机位置编码" height="300">
+
+* Sequence Length Warmup：通过动态调整前期训练的序列长度，提升模型的收敛效率。
+
+
+## 2. 预训练
+
+预训练数据制作参考[此处](../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
+
+为了方便用户运行测试本模型，本项目提供了处理好的100k条doc的训练样本：
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/data/ernie_openwebtext_100k_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/ernie/data/ernie_openwebtext_100k_idx.npz
+```
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv ernie_openwebtext_100k_ids.npy ./data
+mv ernie_openwebtext_100k_idx.npz ./data
+```
+
+使用下面脚本,即可启动 ernie-3.5-se-3b 的预训练，也可直接参考 run_trainer_stage2.sh。
+```shell
+task_name="ernie35_hybrid"
+python -u -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "baidu/ernie-3.5-se-3b" \
+    --tokenizer_name_or_path "ernie-tokenizer" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 4096 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --use_fused_ln 1 \
+    --bf16 \
+    --fp16_opt_level "O2"  \
+    --scale_loss 512 \
+    --learning_rate 0.0003 \
+    --min_learning_rate 0.00003 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 300000 \
+    --save_steps 200 \
+    --adam_beta2 0.95 \
+    --weight_decay 0.1 \
+    --warmup_steps 2000 \
+    --max_grad_norm 1.0 \
+    --logging_steps 2 \
+    --dataloader_num_workers 0 \
+    --sharding "stage2" \
+    --sharding_parallel_degree 8 \
+    --eval_steps 200 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --continue_training 0\
+    --recompute 1 \
+    --do_train \
+    --do_eval \
+    --save_total_limit 10 \
+    --device "gpu"
+```
+注意：
+1. 需要paddle develop版本训练，需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
+2. `use_flash_attention` 需要在A100机器开启，否则loss可能不正常（很快变成0.00x,非常小不正常）。建议使用cuda11.8环境。
+3. `continue_training` 表示从现有的预训练模型加载训练，如果需要从头开始预训练模型，则设置为0。
+4. `use_fused_ln` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子，需要额外设置PYTHONPATH
+5. 当前脚本为sharding版本，需要4D并行训练（数据、sharding、张量、流水线并行）的用户，可另外调整相关参数。
+
+
+
+## 3. 精调
+
+### SFT
+```shell
+python -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    finetune_generation.py \
+    --output_dir "output_sft/$task_name" \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 2 \
+    --per_device_eval_batch_size 8 \
+    --model_name_or_path <PATH_TO_CKPT> \
+    --task_name squad \
+    --num_train_epochs 2 \
+    --learning_rate 3e-5 \
+    --warmup_steps 30 \
+    --logging_steps 1 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --src_length 1024 \
+    --tgt_length 1024 \
+    --bf16 \
+    --fp16_opt_level O2 \
+    --do_train \
+    --do_eval \
+    --disable_tqdm True \
+    --load_best_model_at_end True \
+    --metric_for_best_model accuracy \
+    --eval_with_do_generation False \
+    --recompute \
+    --save_total_limit 1 \
+    --overwrite_output_dir \
+    --sharding "stage2" \
+    --sharding_parallel_degree 8
+```
+
+### LoRA
+```shell
+python finetune_generation.py \
+    --output_dir ./checkpoints/ \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 2 \
+    --per_device_eval_batch_size 8 \
+    --model_name_or_path <PATH_TO_CKPT> \
+    --task_name squad \
+    --num_train_epochs 2 \
+    --learning_rate 3e-4 \
+    --warmup_steps 30 \
+    --logging_steps 1 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --src_length 1024 \
+    --tgt_length 1024 \
+    --bf16 \
+    --fp16_opt_level O2 \
+    --do_train \
+    --do_eval \
+    --disable_tqdm True \
+    --load_best_model_at_end True \
+    --metric_for_best_model accuracy \
+    --eval_with_do_generation False \
+    --recompute \
+    --save_total_limit 1 \
+    --overwrite_output_dir \
+    --lora True \
+    --lora_rank 8
+```
+
+其中参数释义如下：
+
+- `model_name_or_path`: 预训练模型内置名称或者模型所在目录.
+- `num_train_epochs`: 要执行的训练 epoch 总数（如果不是整数，将在停止训练之前执行最后一个 epoch
+的小数部分百分比）。
+- `max_steps`: 模型训练步数。
+- `learning_rate`: 参数更新的学习率。
+- `warmup_steps`: 学习率热启的步数。
+- `eval_steps`: 模型评估的间隔步数。
+- `logging_steps`: 训练日志打印的间隔步数。
+- `save_steps`: 模型参数保存的间隔步数。
+- `save_total_limit`: 模型 checkpoint 保存的份数。
+- `output_dir`: 模型参数保存目录。
+- `src_length`: 上下文的最大输入长度，默认为128.
+- `tgt_length`: 生成文本的最大长度，默认为160.
+- `gradient_accumulation_steps`: 模型参数梯度累积的步数，可用于扩大 batch size。实际的 batch_size = per_device_train_batch_size * gradient_accumulation_steps。
+- `bf16`: 使用 bfloat16 精度进行模型训练和推理。
+- `fp16_opt_level`: bfloat16 精度训练模式，`O2`表示纯 bfloat16 训练。
+- `recompute`: 使用重计算策略，开启后可节省训练显存。
+- `do_train`: 是否训练模型。
+- `do_eval`: 是否评估模型。
+- `tensor_parallel_degree`: 模型并行数量。
+- `eval_with_do_generation`: 在评估的时候是否调用model.generate,默认为False。
+- `lora`: 是否使用 LoRA 技术。
+- `merge_weights`: 是否合并原始模型和 LoRA 模型的权重。
+- `lora_rank`: LoRA 算法中rank（秩）的值，默认为8。
+- `lora_path`: LoRA 参数和配置路径，对 LoRA 参数进行初始化。
+- `task_name`: 内置数据集任务名
+- `data_name`: 内置数据集名，定义数据集名必须同时定义数据集任务名
+- `dataset_path`: 自定义数据集路径。
+
+
+## 4. 动态图预测
+
+```shell
+python predict_generation.py \
+    --model_name_or_path <PATH_TO_CKPT> \
+    --tokenizer_name_or_path ernie-tokenizer
+```
diff --git a/llm/ernie-3.5-se/configuration.py b/llm/ernie-3.5-se/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecd6e055e7e1c26aa9d308cbafbc6eaf4c9c914b
--- /dev/null
+++ b/llm/ernie-3.5-se/configuration.py
@@ -0,0 +1,202 @@
+# !/usr/bin/env python3
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Ernie35 model configuration"""
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+from paddlenlp.utils.log import logger
+
+__all__ = [
+    "ERNIE_PRETRAINED_INIT_CONFIGURATION",
+    "Ernie35Config",
+    "ERNIE_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+ERNIE_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie/tiny-random-ernie": {
+        "fuse_linear": False,
+        "fuse_ln": False,
+        "hidden_size": 768,
+        "ignored_index": -100,
+        "initializer_range": 0.02,
+        "intermediate_size": 2048,
+        "max_position_embeddings": 4096,
+        "model_type": "ernie",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 3,
+        "pad_token_id": 0,
+        "parallel_attn_hatf": True,
+        "enable_random_position_ids": True,
+        "use_progressive_seq_len": False,
+        "layer_norm_eps": 1e-06,
+        "tensor_parallel_output": True,
+        "tie_word_embeddings": False,
+        "use_bias": True,
+        "use_flash_attention": True,
+        "use_recompute": False,
+        "use_recompute_attn": False,
+        "vocab_size": 32000,
+        "weight_share_add_bias": True,
+    },
+    "baidu/ernie-3.5-se-3b": {
+        "fuse_linear": False,
+        "fuse_ln": False,
+        "hidden_size": 3072,
+        "ignored_index": -100,
+        "initializer_range": 0.02,
+        "intermediate_size": 8192,
+        "max_position_embeddings": 32768,  # 32k
+        "model_type": "ernie",
+        "num_attention_heads": 24,
+        "num_hidden_layers": 32,
+        "pad_token_id": 0,
+        "parallel_attn_hatf": True,
+        "enable_random_position_ids": True,
+        "use_progressive_seq_len": True,
+        "layer_norm_eps": 1e-06,
+        "tensor_parallel_output": True,
+        "tie_word_embeddings": False,
+        "use_bias": True,
+        "use_flash_attention": True,
+        "use_recompute": False,
+        "use_recompute_attn": False,
+        "vocab_size": 32000,
+        "weight_share_add_bias": True,
+    },
+}
+
+# Hypothetical model weights currently
+ERNIE_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {},
+}
+
+
+class Ernie35Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~Ernie35Model`]. It is used to instantiate an Ernie35
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the Ernie35.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 65536):
+            Vocabulary size of the Ernie35 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~Ernie35Model`].
+        hidden_size (`int`, *optional*, defaults to 3072):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 8192):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+    """
+    model_type = "ernie"
+    attribute_map = {
+        "n_positions": "max_position_embeddings",
+        "n_embd": "hidden_size",
+        "n_layer": "num_hidden_layers",
+        "n_head": "num_attention_heads",
+        "n_inner": "intermediate_size",
+        "activation_function": "hidden_act",
+    }
+    pretrained_init_configuration = ERNIE_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=65536,
+        hidden_size=768,
+        intermediate_size=11008,
+        max_position_embeddings=2048,
+        num_hidden_layers=2,
+        num_attention_heads=2,
+        initializer_range=0.02,  # no use
+        layer_norm_eps=1e-6,
+        use_cache=True,
+        use_flash_attention=True,
+        use_recompute=False,
+        use_recompute_attn=False,
+        fuse_ln=False,
+        tensor_parallel_output=True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        use_bias=False,
+        sequence_parallel=False,
+        weight_share=False,  # non-PP only
+        weight_share_add_bias=True,
+        fuse_linear=False,
+        seqlen=False,
+        virtual_pp_degree=1,
+        ignored_index=-100,
+        parallel_attn_hatf=True,
+        enable_random_position_ids=False,
+        use_progressive_seq_len=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.max_position_embeddings = max_position_embeddings
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
+        self.use_recompute_attn = use_recompute_attn
+        if use_recompute_attn:
+            logger.warning("set `use_recompute_attn`=True, disabling `use_recompute`")
+            use_recompute = False
+        self.use_recompute = use_recompute
+        self.use_flash_attention = use_flash_attention
+        self.tensor_parallel_output = tensor_parallel_output
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.fuse_ln = fuse_ln
+        self.sequence_parallel = sequence_parallel
+        self.seqlen = seqlen
+        self.virtual_pp_degree = virtual_pp_degree
+        self.use_bias = use_bias
+        self.weight_share_add_bias = weight_share_add_bias
+        kwargs["tie_word_embeddings"] = weight_share
+        self.fuse_linear = fuse_linear
+        self.ignored_index = ignored_index
+        self.parallel_attn_hatf = parallel_attn_hatf
+        self.enable_random_position_ids = enable_random_position_ids
+        self.use_progressive_seq_len = use_progressive_seq_len
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tensor_parallel_output=tensor_parallel_output,
+            **kwargs,
+        )
+        if self.sequence_parallel:
+            assert self.seqlen, "seqlen not provided in sequence-parallel"
+            assert (
+                self.tensor_parallel_degree > 1
+            ), f"senquence-parallel only works in mp, got mp={self.tensor_parallel_degree}"
diff --git a/llm/ernie-3.5-se/conversion_utils.py b/llm/ernie-3.5-se/conversion_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..325a1cd30d7cc57e32f65927470cb0547268d96e
--- /dev/null
+++ b/llm/ernie-3.5-se/conversion_utils.py
@@ -0,0 +1,198 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+
+def split_qkv_gate_up_tensor_parallel_weight(
+    weight, tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads
+):
+    """
+    [QKV, G, U]  -> [QKV1, G1, U1], [QKV2, G2, U2]
+
+    Only support split Column dim.
+
+    """
+    assert weight.shape[-1] == 3 * hidden_size + 2 * intermediate_size, "input weight size dismatch!"
+
+    if "PySafeSlice" in str(type(weight)):
+        QKV = weight[:, 0 : 3 * hidden_size]
+        G = weight[:, 3 * hidden_size : 3 * hidden_size + intermediate_size]
+        U = weight[:, 3 * hidden_size + intermediate_size :]
+
+        # Split QKV
+        block_size = 3 * hidden_size // tensor_parallel_degree
+        start = tensor_parallel_rank * block_size
+        stop = (tensor_parallel_rank + 1) * block_size
+        assert (
+            3 * hidden_size % tensor_parallel_degree == 0
+        ), f"The choosen size {hidden_size} is not compatible with sharding on {tensor_parallel_degree} shards"
+        qkv = QKV[:, start:stop]
+
+        # Split G, U
+        block_size = intermediate_size // tensor_parallel_degree
+        start = tensor_parallel_rank * block_size
+        stop = (tensor_parallel_rank + 1) * block_size
+        assert (
+            intermediate_size % tensor_parallel_degree == 0
+        ), f"The choosen size {intermediate_size} is not compatible with sharding on {tensor_parallel_degree} shards"
+        g = G[:, start:stop]
+        u = U[:, start:stop]
+
+        tensor = np.concatenate([qkv, g, u], axis=-1)
+        return tensor
+
+    QKV, G, U = np.split(weight, [hidden_size * 3, hidden_size * 3 + intermediate_size], axis=-1)
+    assert (
+        weight.shape[-1] % tensor_parallel_degree == 0
+    ), f"The choosen size {weight.shape[-1]} is not compatible with sharding on {tensor_parallel_degree} shards, for tensor shape {weight.shape}"
+    sQKV, sG, sU = [np.split(item, tensor_parallel_degree, axis=-1) for item in [QKV, G, U]]
+    qkv, g, u = [item[tensor_parallel_rank] for item in [sQKV, sG, sU]]
+    tensor = np.concatenate([qkv, g, u], axis=-1)
+    return tensor
+
+
+def merge_qkv_gate_up_tensor_parallel_weight(weight_list, tensor_parallel_degree, hidden_size, intermediate_size):
+    """
+    [QKV1, G1, U1], [QKV2, G2, U2] -> [Q, K, V, G, U]
+
+    Only support split Column dim.
+
+    """
+    bhs = hidden_size // tensor_parallel_degree
+    bis = intermediate_size // tensor_parallel_degree
+
+    qkv_l, g_l, u_l = [], [], []
+    for weight in weight_list:
+        qkv, g, u = np.split(weight, [bhs * 3, bhs * 3 + bis], axis=-1)
+        qkv_l.append(qkv)
+        g_l.append(g)
+        u_l.append(u)
+    QKV, G, U = [np.concatenate(item, axis=-1) for item in [qkv_l, g_l, u_l]]
+    tensor = np.concatenate([QKV, G, U], axis=-1)
+    return tensor
+
+
+def qkv_gate_up_proj_split_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads):
+    def fn(x):
+        if x is None:
+            return None
+        x = split_qkv_gate_up_tensor_parallel_weight(
+            x,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            hidden_size=hidden_size,
+            intermediate_size=intermediate_size,
+            num_heads=num_heads,
+        )
+        return x
+
+    return fn
+
+
+def qkv_gate_up_proj_merge_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size, num_heads):
+    def fn(x):
+        if x is None:
+            return None
+        x = merge_qkv_gate_up_tensor_parallel_weight(
+            x,
+            tensor_parallel_degree=tensor_parallel_degree,
+            hidden_size=hidden_size,
+            intermediate_size=intermediate_size,
+        )
+        return x
+
+    return fn
+
+
+def split_o_tensor_parallel_weight(
+    weight, tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size
+):
+    """
+    Only support split Row dim.
+    """
+    assert weight.shape[0] == intermediate_size + hidden_size, "input weight size dismatch!"
+    if "PySafeSlice" in str(type(weight)):
+        A = weight[:intermediate_size]
+        block_size = intermediate_size // tensor_parallel_degree
+        start = tensor_parallel_rank * block_size
+        stop = (tensor_parallel_rank + 1) * block_size
+        assert (
+            intermediate_size % tensor_parallel_degree == 0
+        ), f"The choosen size {intermediate_size} is not compatible with sharding on {tensor_parallel_degree} shards"
+        a = A[start:stop]
+
+        B = weight[intermediate_size:]
+        block_size = hidden_size // tensor_parallel_degree
+        start = tensor_parallel_rank * block_size
+        stop = (tensor_parallel_rank + 1) * block_size
+        assert (
+            hidden_size % tensor_parallel_degree == 0
+        ), f"The choosen size {hidden_size} is not compatible with sharding on {tensor_parallel_degree} shards"
+        b = B[start:stop]
+        tensor = np.concatenate([a, b], axis=0)
+        return tensor
+
+    A, B = np.split(weight, [intermediate_size], axis=0)
+    assert (
+        weight.shape[0] % tensor_parallel_degree == 0
+    ), f"The choosen size {weight.shape[-1]} is not compatible with sharding on {tensor_parallel_degree} shards, for tensor shape {weight.shape}"
+    sA = np.split(A, tensor_parallel_degree, axis=0)
+    sB = np.split(B, tensor_parallel_degree, axis=0)
+    a, b = [item[tensor_parallel_rank] for item in [sA, sB]]
+    tensor = np.concatenate([a, b], axis=0)
+    return tensor
+
+
+def merge_o_tensor_parallel_weight(weight_list, tensor_parallel_degree, hidden_size, intermediate_size):
+    bis = intermediate_size // tensor_parallel_degree
+    a_l, b_l = [], []
+    for weight in weight_list:
+        a, b = np.split(weight, [bis], axis=0)
+        a_l.append(a)
+        b_l.append(b)
+    A, B = [np.concatenate(item, axis=0) for item in [a_l, b_l]]
+    tensor = np.concatenate([A, B], axis=0)
+    return tensor
+
+
+def o_proj_split_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size):
+    def fn(x):
+        if x is None:
+            return None
+        x = split_o_tensor_parallel_weight(
+            x,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            hidden_size=hidden_size,
+            intermediate_size=intermediate_size,
+        )
+        return x
+
+    return fn
+
+
+def o_proj_merge_fn(tensor_parallel_degree, tensor_parallel_rank, hidden_size, intermediate_size):
+    def fn(x):
+        if x is None:
+            return None
+        x = merge_o_tensor_parallel_weight(
+            x,
+            tensor_parallel_degree=tensor_parallel_degree,
+            hidden_size=hidden_size,
+            intermediate_size=intermediate_size,
+        )
+        return x
+
+    return fn
diff --git a/llm/ernie-3.5-se/data.py b/llm/ernie-3.5-se/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..e439183c6917f8db6a43488550ab249a36db0f39
--- /dev/null
+++ b/llm/ernie-3.5-se/data.py
@@ -0,0 +1,199 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+
+import paddle
+
+IGNORE_INDEX = -100
+
+PROMPT_DICT = {
+    "prompt_input": (
+        "Below is an instruction that describes a task, paired with an input that provides further context. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
+    ),
+    "prompt_no_input": (
+        "Below is an instruction that describes a task. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Response:"
+    ),
+}
+
+
+def reader(data_path):
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            yield json_line
+
+
+def convert_example(example, tokenizer, data_args, is_test=False):
+    """
+    Convert an example into necessary features.
+    """
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    if "context" in example:
+        context = example["context"]
+        question = example["question"]
+        try:
+            answer = example["answers"][0]
+        except:
+            print(example["context"])
+            print(example["question"])
+            print(example["answers"])
+            print(example["answer_starts"])
+            print(example["is_impossible"])
+        input_seq = f"answer: {answer} context: {context}"
+        output_seq = f"question: {question} </s>"
+    elif "instruction" in example:
+        input_seq = f"{example['instruction']}"
+        output_seq = f"{example['output']} </s>"
+    elif "src" in example:
+        context = example["src"][0] if isinstance(example["src"], list) else example["src"]
+        question = example["tgt"][0] if isinstance(example["tgt"], list) else example["tgt"]
+        input_seq = f"{context}"
+        output_seq = f"{question} </s>"
+    else:
+        raise ValueError("Please check the dataset format.")
+
+    source_tokenized = tokenizer(
+        input_seq,
+        return_tensors="pd",
+        max_length=data_args.src_length,
+        truncation=True,
+    )
+
+    source_input_ids_len = (
+        source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item()
+    )
+
+    example_tokenized = tokenizer(
+        input_seq + output_seq,
+        return_tensors="pd",
+        max_length=data_args.src_length + data_args.tgt_length,
+        padding=False,
+        truncation=True,
+    )
+
+    input_ids = example_tokenized["input_ids"][0]
+    labels = copy.deepcopy(input_ids)
+    labels[:source_input_ids_len] = IGNORE_INDEX
+
+    if is_test:
+        return dict(
+            input_ids=source_tokenized["input_ids"][0],
+            labels=labels,
+        )
+
+    # shift labels
+    input_ids, labels = input_ids[:-1], labels[1:]
+
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+    )
+
+
+def custom_instruction_convert_example(
+    example, tokenizer, data_args, is_test=False, benchmark=False, model_max_length=512
+):
+    """
+    Convert an example into necessary features.
+    """
+
+    if benchmark:
+        prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
+
+        if example.get("input", "") != "":
+            input_seq = prompt_input.format_map(example)
+        else:
+            input_seq = prompt_no_input.format_map(example)
+
+        output_seq = example["output"] + tokenizer.eos_token
+    else:
+        instruction = ""
+        input = ""
+        output = ""
+        if "instruction" in example and "output" in example:
+            instruction = example["instruction"]
+            output = example["output"]
+        else:
+            assert False, "instruction and output are not in the input dictionary."
+        if "input" in example["input"]:
+            input = example["input"]
+
+        input_seq = instruction + input
+        output_seq = output + tokenizer.eos_token
+
+    # To compatible with compile training mode in benchmark, input will be pad to fix length
+    source_tokenized = tokenizer(
+        input_seq,
+        return_tensors="pd",
+        max_length=data_args.src_length if not benchmark else model_max_length,
+        truncation=True,
+    )
+
+    source_input_ids_len = (
+        source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item()
+    )
+
+    total_length = data_args.src_length + data_args.tgt_length
+
+    example_tokenized = tokenizer(
+        input_seq + output_seq,
+        return_tensors="pd",
+        max_length=total_length if not benchmark else model_max_length,
+        truncation=True,
+    )
+
+    input_ids = example_tokenized["input_ids"][0]
+    labels = copy.deepcopy(input_ids)
+    labels[:source_input_ids_len] = IGNORE_INDEX
+
+    if is_test:
+        return dict(
+            input_ids=source_tokenized["input_ids"][0],
+            labels=labels,
+        )
+
+    # shift labels
+    input_ids, labels = input_ids[:-1], labels[1:]
+
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+    )
+
+
+def left_padding(inputs, pad_id, max_length=-1):
+    for ids in inputs:
+        max_length = max(max_length, len(ids))
+
+    def extend_max_lenth(value, max_length, to_pad_id):
+        return [to_pad_id] * (max_length - len(value)) + value
+
+    def extend_filed(values, max_length, to_pad_id):
+        res = []
+        for value in values:
+            res.append(extend_max_lenth(value.tolist(), max_length, to_pad_id))
+        return res
+
+    res = extend_filed(inputs, max_length, pad_id)
+    return paddle.to_tensor(res)
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model b/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model
new file mode 100644
index 0000000000000000000000000000000000000000..ec2ed09d7a2cad429ad536c1582b7f769ef171cd
Binary files /dev/null and b/llm/ernie-3.5-se/ernie-tokenizer/sentencepiece.bpe.model differ
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json b/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json
new file mode 100644
index 0000000000000000000000000000000000000000..1c141cfedc26fa17771e88d0a1372d4a7387a3ec
--- /dev/null
+++ b/llm/ernie-3.5-se/ernie-tokenizer/special_tokens_map.json
@@ -0,0 +1 @@
+{"bos_token": {"content": "<s>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "eos_token": {"content": "</s>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "unk_token": {"content": "<unk>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}}
\ No newline at end of file
diff --git a/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json b/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json
new file mode 100644
index 0000000000000000000000000000000000000000..c12b770ded0ecebb105bfa45d039ddc79076e952
--- /dev/null
+++ b/llm/ernie-3.5-se/ernie-tokenizer/tokenizer_config.json
@@ -0,0 +1 @@
+{"add_bos_token": true, "add_eos_token": false, "model_max_length": 2048, "pad_token": null, "sp_model_kwargs": {}, "tokenizer_class": "Ernie35Tokenizer", "clean_up_tokenization_spaces": false, "bos_token": {"__type": "AddedToken", "content": "<s>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "eos_token": {"__type": "AddedToken", "content": "</s>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}, "unk_token": {"__type": "AddedToken", "content": "<unk>", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false}}
diff --git a/llm/ernie-3.5-se/ernie_dataset.py b/llm/ernie-3.5-se/ernie_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..120b439260c183e28067a456be734dc7eb46dfe0
--- /dev/null
+++ b/llm/ernie-3.5-se/ernie_dataset.py
@@ -0,0 +1,958 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+import hashlib
+import math
+import os
+import time
+from itertools import accumulate
+
+import numpy as np
+import paddle
+from paddle.distributed import fleet
+
+local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+class FakeHCG:
+    def get_data_parallel_group(self):
+        return None
+
+    def get_pipe_parallel_group(self):
+        return None
+
+    def get_model_parallel_group(self):
+        return None
+
+
+class MMapIndexedDataset(paddle.io.Dataset):
+    def __init__(self, path, skip_warmup=False):
+        super().__init__()
+
+        self._path = path
+
+        # All documment ids, extend as 1-D array.
+
+        for suffix in ["_ids.npy", "_idx.npz"]:
+            if not os.path.isfile(path + suffix):
+                raise ValueError("File Not found, %s" % (path + suffix))
+
+        self._token_ids = np.load(path + "_ids.npy", mmap_mode="r", allow_pickle=True)
+        process_data = np.load(path + "_idx.npz")
+        self._sizes = process_data["lens"]
+        self._pointers = np.empty(len(self._sizes) + 1, dtype=np.int64)
+        self._pointers[0] = 0
+        np.cumsum(self._sizes, out=self._pointers[1:])
+        self._doc_idx = process_data["docs"]
+
+    def __getstate__(self):
+        return self._path
+
+    def __len__(self):
+        return len(self._sizes)
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            size = self._sizes[idx]
+            ptr = self._pointers[idx]
+            np_array = self._token_ids[ptr : ptr + size]
+            return np_array
+
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError("Slices into indexed_dataset must be contiguous")
+            ptr = self._pointers[start]
+            sizes = self._sizes[idx]
+            offsets = list(accumulate(sizes))
+            total_size = sum(sizes)
+            np_array = self._token_ids[ptr : ptr + total_size]
+            sents = np.split(np_array, offsets[:-1])
+            return sents
+
+    def get(self, idx, offset=0, length=None):
+        """Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        size = self._sizes[idx]
+        ptr = self._pointers[idx]
+
+        if length is None:
+            length = size - offset
+        ptr += offset
+        np_array = self._token_ids[ptr : ptr + length]
+        return np_array
+
+    @property
+    def sizes(self):
+        return self._sizes
+
+    @property
+    def doc_idx(self):
+        return self._doc_idx
+
+    def get_doc_idx(self):
+        return self._doc_idx
+
+    def set_doc_idx(self, doc_idx_):
+        self._doc_idx = doc_idx_
+
+
+class BlendableDataset(paddle.io.Dataset):
+    def __init__(self, datasets, weights, size, *, data_cache_path=None):
+
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+
+        self.size = size
+
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+
+        # Build indicies.
+        def _build_indices():
+            start_time = time.time()
+            assert num_datasets < 255
+            dataset_index = np.zeros(self.size, dtype=np.uint8)
+            dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+
+            from tool_helpers import helpers
+
+            helpers.build_blending_indices(
+                dataset_index,
+                dataset_sample_index,
+                weights,
+                num_datasets,
+                self.size,
+                local_rank == 0,
+                #    paddle.distributed.get_rank() == 0,
+            )
+            print_rank_0(
+                "> elapsed time for building blendable dataset indices: "
+                "{:.2f} (sec)".format(time.time() - start_time)
+            )
+            return dataset_index, dataset_sample_index
+
+        desc = "Blendable dataset\n\n"
+        desc += "Datasets:\n"
+        for dataset in datasets:
+            desc += dataset.desc + "\n\n"
+        desc += f"Weights: {weights}\n"
+        desc += f"Size: {size}\n"
+        self.desc = desc
+
+        if data_cache_path:
+            desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+            desc_path = os.path.join(data_cache_path, desc_hash + ".dsc")
+            index_path = os.path.join(data_cache_path, desc_hash + "_index.npy")
+            sample_index_path = os.path.join(data_cache_path, desc_hash + "_sample_index.npy")
+            cache_hit = os.path.isfile(index_path) and os.path.isfile(sample_index_path)
+            cache_success = True
+            # if paddle.distributed.get_rank() == 0 and not cache_hit:
+            if local_rank == 0 and not cache_hit:
+                print(
+                    " > WARNING: could not find index map files for blendable"
+                    " dataset, building indices on rank 0 ...",
+                    flush=True,
+                )
+                dataset_index, dataset_sample_index = _build_indices()
+                try:
+                    os.makedirs(os.path.dirname(index_path), exist_ok=True)
+                    with open(desc_path, "wt") as fd:
+                        fd.write(desc)
+                        np.save(index_path, dataset_index, allow_pickle=True)
+                        np.save(sample_index_path, dataset_sample_index, allow_pickle=True)
+                except OSError:
+                    print(f"There was an error trying to create the data cache directory ({data_cache_path})")
+                    print("or a file in it. This is set with the --data-cache-path argument. Please")
+                    print("ensure you have write access to this directory or specify one that you do have")
+                    print("write access to.")
+                    cache_success = False
+
+            try:
+                hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+            except:
+                hcg = FakeHCG()
+
+            counts = paddle.to_tensor([cache_success], dtype="int64")
+            paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group())
+            paddle.distributed.all_reduce(counts, group=hcg.get_pipeline_model_parallel_group())
+            if counts[0].item() != (
+                paddle.distributed.get_world_size()
+                // paddle.distributed.get_world_size(group=hcg.get_tensor_model_parallel_group())
+            ):
+                print_rank_0("Data index creation unsuccessful, exiting.")
+                exit()
+
+            paddle.distributed.barrier()
+            # Load on all ranks.
+            print_rank_0(f"> loading blendable dataset index: {index_path}")
+            self.dataset_index = np.load(index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_index.size == self.size
+
+            print_rank_0(f"> loading blendable dataset sample index: {sample_index_path}")
+            self.dataset_sample_index = np.load(sample_index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_sample_index.size == self.size
+        else:
+            self.dataset_index, self.dataset_sample_index = _build_indices()
+
+        # Check size
+        _ = self.__getitem__(self.size - 1)
+        try:
+            _ = self.__getitem__(self.size)
+            raise RuntimeError("BlendedDataset size is improperly bounded")
+        except IndexError:
+            pass
+        print_rank_0("> size of blendable dataset: " "{} samples".format(self.size))
+
+    def __len__(self):
+        return self.size
+
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return {
+            "dataset_idx": dataset_idx,
+            **self.datasets[dataset_idx][sample_idx],
+        }
+
+
+def make_indexed_dataset(data_prefix, data_impl=None, skip_warmup=False):
+    return MMapIndexedDataset(data_prefix)
+
+
+def get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples):
+
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005)) for val in train_valid_test_num_samples]
+        )
+
+    return prefixes, weights, datasets_train_valid_test_num_samples
+
+
+def print_rank_0(*args, **kwargs):
+    if paddle.distributed.get_rank() == 0:
+        print(*args, **kwargs)
+
+
+def build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_valid_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    train_data_prefix=None,
+    valid_data_prefix=None,
+    test_data_prefix=None,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None
+):
+    """Build train, valid, and test datasets."""
+
+    if data_prefix:
+        print_rank_0("Single data path provided for train, valid & test")
+
+        # Single dataset.
+        if len(data_prefix) == 1:
+            return _build_train_valid_test_datasets(
+                data_prefix[0],
+                data_impl,
+                splits_string,
+                train_valid_test_num_samples,
+                seq_length,
+                seed,
+                skip_warmup,
+                data_cache_path=data_cache_path,
+            )
+
+        # Blending dataset.
+        # Parse the values.
+        output = get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples)
+        prefixes, weights, datasets_train_valid_test_num_samples = output
+        train_num_samples, valid_num_samples, test_num_samples = map(sum, zip(*datasets_train_valid_test_num_samples))
+
+        # Build individual datasets.
+        train_datasets = []
+        valid_datasets = []
+        test_datasets = []
+        for i in range(len(prefixes)):
+            train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+                prefixes[i],
+                data_impl,
+                splits_string,
+                datasets_train_valid_test_num_samples[i],
+                seq_length,
+                seed,
+                skip_warmup,
+                return_doc_ids,
+                data_cache_path=data_cache_path,
+            )
+            if train_ds:
+                train_datasets.append(train_ds)
+            if valid_ds:
+                valid_datasets.append(valid_ds)
+            if test_ds:
+                test_datasets.append(test_ds)
+
+        # Blend.
+        blending_train_dataset = None
+        if train_datasets:
+            blending_train_dataset = BlendableDataset(
+                train_datasets, weights, train_num_samples, data_cache_path=data_cache_path
+            )
+        blending_valid_dataset = None
+        if valid_datasets:
+            blending_valid_dataset = BlendableDataset(
+                valid_datasets, weights, valid_num_samples, data_cache_path=data_cache_path
+            )
+        blending_test_dataset = None
+        if test_datasets:
+            blending_test_dataset = BlendableDataset(
+                test_datasets, weights, test_num_samples, data_cache_path=data_cache_path
+            )
+
+        return (blending_train_dataset, blending_valid_dataset, blending_test_dataset)
+
+    else:
+        print_rank_0("Separate data paths provided for train, valid & test. Split string will be ignored.")
+
+        train_dataset, valid_dataset, test_dataset = None, None, None
+        # Single dataset.
+        if train_data_prefix is not None:
+            train_dataset = build_dataset(
+                "train",
+                train_data_prefix,
+                data_impl,
+                splits_string,
+                train_valid_test_num_samples[0],
+                seq_length,
+                seed,
+                skip_warmup,
+                data_cache_path=data_cache_path,
+            )
+
+        if valid_data_prefix is not None:
+            valid_dataset = build_dataset(
+                "valid",
+                valid_data_prefix,
+                data_impl,
+                splits_string,
+                train_valid_test_num_samples[1],
+                seq_length,
+                seed,
+                False,
+                data_cache_path=data_cache_path,
+            )
+
+        if test_data_prefix is not None:
+            test_dataset = build_dataset(
+                "test",
+                test_data_prefix,
+                data_impl,
+                splits_string,
+                train_valid_test_num_samples[2],
+                seq_length,
+                seed,
+                False,
+                data_cache_path=data_cache_path,
+            )
+
+        return (train_dataset, valid_dataset, test_dataset)
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(",") != -1:
+        splits = [float(s) for s in splits_string.split(",")]
+    elif splits_string.find("/") != -1:
+        splits = [float(s) for s in splits_string.split("/")]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def _build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_valid_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None
+):
+    """Build train, valid, and test datasets."""
+
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)
+
+    total_num_of_documents = indexed_dataset.sizes.shape[0]
+    splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+
+    # Print stats about the splits.
+    print_rank_0(" > dataset split:")
+
+    def print_split_stats(name, index):
+        print_rank_0("    {}:".format(name))
+        print_rank_0(
+            "     document indices in [{}, {}) total of {} "
+            "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index])
+        )
+
+    print_split_stats("train", 0)
+    print_split_stats("validation", 1)
+    print_split_stats("test", 2)
+
+    def build_dataset(index, name):
+        dataset = None
+        if splits[index + 1] > splits[index]:
+            documents = np.arange(start=splits[index], stop=splits[index + 1], step=1, dtype=np.int32)
+            dataset = LLMDataset(
+                name,
+                data_prefix,
+                documents,
+                indexed_dataset,
+                splits_string,
+                train_valid_test_num_samples[index],
+                seq_length,
+                seed,
+                return_doc_ids,
+                data_cache_path=data_cache_path,
+            )
+        return dataset
+
+    train_dataset = build_dataset(0, "train")
+    valid_dataset = build_dataset(1, "valid")
+    test_dataset = build_dataset(2, "test")
+
+    return (train_dataset, valid_dataset, test_dataset)
+
+
+def build_dataset(
+    dataset_name,
+    data_prefix,
+    data_impl,
+    splits_string,
+    num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    *,
+    data_cache_path=None
+):
+    dataset = None
+    if len(data_prefix) == 1:
+        dataset = _build_dataset(
+            dataset_name,
+            data_prefix[0],
+            data_impl,
+            splits_string,
+            num_samples,
+            seq_length,
+            seed,
+            skip_warmup,
+            data_cache_path=data_cache_path,
+        )
+    else:
+        # Blending dataset.
+        # Parse the values.
+        output = get_datasets_weights_and_num_samples(data_prefix, num_samples)
+        prefixes, weights, dataset_num_samples = output
+        num_samples = sum(dataset_num_samples)
+
+        # Build individual datasets.
+        datasets = []
+        for i in range(len(prefixes)):
+            ds = _build_dataset(
+                dataset_name,
+                prefixes[i],
+                data_impl,
+                splits_string,
+                dataset_num_samples[i],
+                seq_length,
+                seed,
+                skip_warmup,
+                data_cache_path=data_cache_path,
+            )
+            if ds:
+                datasets.append(ds)
+
+        if datasets:
+            dataset = BlendableDataset(datasets, weights, num_samples, data_cache_path=data_cache_path)
+
+    return dataset
+
+
+def _build_dataset(
+    dataset_name,
+    data_prefix,
+    data_impl,
+    splits_string,
+    num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    *,
+    data_cache_path=None
+):
+    """
+    Build dataset. This method is called when individual
+    train, valid, test datasets are provided
+    """
+
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)
+
+    total_num_of_documents = indexed_dataset.sizes.shape[0]
+
+    print_rank_0("    {}:".format(dataset_name))
+    print_rank_0(
+        "     document indices in [0, {}) total of {} "
+        "documents".format(total_num_of_documents, total_num_of_documents)
+    )
+
+    documents = np.arange(start=0, stop=total_num_of_documents, step=1, dtype=np.int32)
+
+    dataset = LLMDataset(
+        dataset_name,
+        data_prefix,
+        documents,
+        indexed_dataset,
+        splits_string,
+        num_samples,
+        seq_length,
+        seed,
+        data_cache_path=data_cache_path,
+    )
+
+    return dataset
+
+
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+    """Build indexed dataset."""
+    print_rank_0(" > building dataset index ...")
+
+    start_time = time.time()
+    indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup)
+    print_rank_0(" > finished creating indexed dataset in {:4f} " "seconds".format(time.time() - start_time))
+    print_rank_0("    number of documents: {}".format(indexed_dataset.sizes.shape[0]))
+
+    return indexed_dataset
+
+
+class LLMDataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        name,
+        data_prefix,
+        documents,
+        indexed_dataset,
+        splits_string,
+        num_samples,
+        seq_length,
+        seed,
+        return_doc_ids=False,
+        *,
+        data_cache_path=None
+    ):
+
+        self.name = name
+        self.indexed_dataset = indexed_dataset
+        self.return_doc_ids = return_doc_ids
+
+        # Checks
+        assert np.min(documents) >= 0
+        assert np.max(documents) < indexed_dataset.sizes.shape[0]
+
+        # Build index mappings.
+        self.doc_idx, self.sample_idx, self.shuffle_idx, self.desc, self.desc_hash = _build_index_mappings(
+            self.name,
+            data_prefix,
+            documents,
+            self.indexed_dataset.sizes,
+            splits_string,
+            num_samples,
+            seq_length,
+            seed,
+            data_cache_path=data_cache_path,
+        )
+
+    def __len__(self):
+        # -1 is due to data structure used to retieve the index:
+        #    sample i --> [sample_idx[i], sample_idx[i+1])
+        return self.sample_idx.shape[0] - 1
+
+    def __getitem__(self, idx):
+        # Get the shuffled index.
+        idx = self.shuffle_idx[idx]
+        # Start and end documents and offsets.
+        doc_index_f = self.sample_idx[idx][0]
+        doc_index_l = self.sample_idx[idx + 1][0]
+        offset_f = self.sample_idx[idx][1]
+        offset_l = self.sample_idx[idx + 1][1]
+        # If we are within the same document, just extract the chunk.
+        doc_ids = []
+        if doc_index_f == doc_index_l:
+            doc_ids.append(self.doc_idx[doc_index_f])
+            sample = self.indexed_dataset.get(
+                self.doc_idx[doc_index_f], offset=offset_f, length=offset_l - offset_f + 1
+            )
+        else:
+            # Otherwise, get the rest of the initial document.
+            doc_ids.append(self.doc_idx[doc_index_f])
+            sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f], offset=offset_f)]
+            # Loop over all in between documents and add the entire document.
+            for i in range(doc_index_f + 1, doc_index_l):
+                doc_ids.append(self.doc_idx[i])
+                sample_list.append(self.indexed_dataset.get(self.doc_idx[i]))
+            # And finally add the relevant portion of last document.
+            doc_ids.append(self.doc_idx[doc_index_l])
+            sample_list.append(self.indexed_dataset.get(self.doc_idx[doc_index_l], length=offset_l + 1))
+            sample = np.concatenate(sample_list)
+
+        if self.return_doc_ids:  # for retro preprocessing
+            return {"text": np.array(sample, dtype=np.int64), "doc_ids": np.array(doc_ids, dtype=np.int64)}
+        else:
+            return {"text": np.array(sample, dtype=np.int64)}
+
+
+def _build_index_mappings(
+    name, data_prefix, documents, sizes, splits_string, num_samples, seq_length, seed, *, data_cache_path
+):
+    """Build doc-idx, sample-idx, and shuffle-idx.
+    doc-idx: is an array (ordered) of documents to be used in training.
+    sample-idx: is the start document index and document offset for each
+       training sample.
+    shuffle-idx: maps the sample index into a random index into sample-idx.
+    """
+    # Number of tokens in each epoch and number of required epochs.
+    tokens_per_epoch = _num_tokens(documents, sizes)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+
+    # rng state
+    np_rng = np.random.RandomState(seed=seed)
+
+    # Filename of the index mappings.
+    desc = "LLM Dataset\n\n"
+    desc += f"Data prefix {data_prefix}\n"
+    desc += f"Dataset name {name}\n"
+    desc += f"Number of samples {num_samples}\n"
+    desc += f"Sequence length {seq_length}\n"
+    desc += f"Random seed {seed}\n"
+    desc += f"Split {splits_string}\n"
+    desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+    print_rank_0(desc_hash, desc)
+    desc_filename = desc_hash + ".dsc"
+    doc_idx_filename = desc_hash + "_doc_idx.npy"
+    sample_idx_filename = desc_hash + "_sample_idx.npy"
+    shuffle_idx_filename = desc_hash + "_shuffle_idx.npy"
+
+    # Look for cache in main data dir first to avoid unnecessary
+    # duplication, then look in data-cache-path if specified,
+    # If nothing is found, use the last path looked in
+    build_indices = True
+    prefixes = [os.path.join(os.path.dirname(data_prefix), "index-cache")]
+    if data_cache_path is not None:
+        prefixes.append(data_cache_path)
+    for prefix in prefixes:
+        idx_path = {
+            "desc": os.path.join(prefix, desc_filename),
+            "doc": os.path.join(prefix, doc_idx_filename),
+            "sample": os.path.join(prefix, sample_idx_filename),
+            "shuffle": os.path.join(prefix, shuffle_idx_filename),
+        }
+        for f in idx_path.values():
+            if not os.path.isfile(f):
+                break
+        else:
+            # Found our files!
+            build_indices = False
+            break
+    data_cache_dir = os.path.dirname(idx_path["desc"])
+    data_cache_success = True
+
+    # local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+    # Build the indexed mapping if not exist.
+    if build_indices and paddle.distributed.get_rank() == 0:
+        # if build_indices and local_rank == 0:
+        print_rank_0(" > WARNING: could not find index map files, building " "the indices on rank 0 ...")
+
+        # For the last epoch, decide whether include the entire epoch
+        # in the global shuffle or not.
+
+        # If we need only one epoch, then separating last epoch  does
+        # not mean anything.
+        if num_epochs == 1:
+            separate_last_epoch = False
+            print(" > only one epoch required, setting " "separate_last_epoch to False", flush=True)
+
+        else:
+            # Get the number of samples for the last epoch
+            num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+            last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one
+            assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative."
+            num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+            assert last_epoch_num_samples <= (
+                num_samples_per_epoch + 1
+            ), "last epoch number of samples exceeded max value."
+            # If we have less than 80% of the samples for the last epoch,
+            # seperate out the epoch and treat it differently.
+            # Note: the 80% number is just based on common sense and can
+            # be adjusted if needed.
+            separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)
+            if separate_last_epoch:
+                string = (
+                    " > last epoch number of samples ({}) is smaller "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to True"
+                )
+            else:
+                string = (
+                    " > last epoch number of samples ({}) is larger "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to False"
+                )
+            print(string.format(last_epoch_num_samples, num_samples_per_epoch), flush=True)
+
+        try:
+            os.makedirs(data_cache_dir, exist_ok=True)
+
+            # description
+            with open(idx_path["desc"], "wt") as fd:
+                fd.write(desc)
+
+            # doc-idx.
+            start_time = time.time()
+            doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch)
+            np.save(idx_path["doc"], doc_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save doc-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # sample-idx.
+            start_time = time.time()
+            # Use C++ implementation for speed.
+            # First compile and then import.
+            # from megatron.data import helpers
+            from tool_helpers import helpers
+
+            assert doc_idx.dtype == np.int32
+            assert sizes.dtype == np.int32
+            sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch)
+            np.save(idx_path["sample"], sample_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save sample-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # shuffle-idx.
+            start_time = time.time()
+            # -1 is due to data structure used to retieve the index:
+            #    sample i --> [sample_idx[i], sample_idx[i+1])
+            if separate_last_epoch:
+                num_samples_ = num_samples_from_epochs_minus_one
+            else:
+                num_samples_ = sample_idx.shape[0] - 1
+            shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng)
+            np.save(idx_path["shuffle"], shuffle_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save shuffle-idx mapping"
+                " (seconds): {:4f}".format(time.time() - start_time)
+            )
+        except OSError:
+            print(f"There was an error trying to create the data cache directory ({data_cache_dir})")
+            print('or a file in it. This defaults to a directory "index-cache" within the directory')
+            print("the data files are in and can be set with the --data-cache-path argument. Please")
+            print("ensure you have write access to this directory or specify one that you do have")
+            print("write access to.")
+            data_cache_success = False
+
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+    except:
+        hcg = FakeHCG()
+
+    counts = paddle.to_tensor([data_cache_success], dtype="int64")
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group())
+        paddle.distributed.all_reduce(counts, group=hcg.get_sharding_parallel_group())
+        paddle.distributed.all_reduce(counts, group=hcg.get_pipe_parallel_group())
+
+    if counts[0].item() != (
+        paddle.distributed.get_world_size()
+        // max(paddle.distributed.get_world_size(group=hcg.get_model_parallel_group()), 1)
+    ):
+        print_rank_0("Data index creation unsuccessful, exiting.")
+        exit()
+
+    paddle.distributed.barrier()
+    # Load mappings.
+    start_time = time.time()
+    print_rank_0(f" > loading doc-idx mapping from {idx_path['doc']}")
+    doc_idx = np.load(idx_path["doc"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0(f" > loading sample-idx mapping from {idx_path['sample']}")
+    sample_idx = np.load(idx_path["sample"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0(f" > loading shuffle-idx mapping from {idx_path['shuffle']}")
+    shuffle_idx = np.load(idx_path["shuffle"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0("    loaded indexed file in {:3.3f} seconds".format(time.time() - start_time))
+    print_rank_0("    total number of samples: {}".format(sample_idx.shape[0]))
+    print_rank_0("    total number of epochs: {}".format(num_epochs))
+
+    return doc_idx, sample_idx, shuffle_idx, desc, desc_hash
+
+
+def _num_tokens(documents, sizes):
+    """Total number of tokens in the dataset."""
+    return np.sum(sizes[documents])
+
+
+def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+    """Based on number of samples and sequence lenght, calculate how many
+    epochs will be needed."""
+    num_epochs = 0
+    total_tokens = 0
+    while True:
+        num_epochs += 1
+        total_tokens += tokens_per_epoch
+        # -1 is because we need to retrieve seq_length + 1 token each time
+        # but the last token will overlap with the first token of the next
+        # sample except for the last sample.
+        if ((total_tokens - 1) // seq_length) >= num_samples:
+            return num_epochs
+
+
+def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
+    """Build an array with length = number-of-epochs * number-of-dcuments.
+    Each index is mapped to a corresponding document."""
+    if not separate_last_epoch or num_epochs == 1:
+        doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1]
+        doc_idx[:] = documents
+        doc_idx = doc_idx.reshape(-1)
+        doc_idx = doc_idx.astype(np.int32)
+        np_rng.shuffle(doc_idx)
+        return doc_idx
+
+    doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False)
+    doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
+    return np.concatenate((doc_idx_first, doc_idx_last))
+
+
+def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch):
+    """Sample index mapping is a 2D array with sizes
+    [number-of-samples + 1, 2] where [..., 0] contains
+    the index into `doc_idx` and [..., 1] is the
+    starting offset in that document."""
+
+    # Total number of samples. For -1 see comments in `_num_epochs`.
+    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    sample_idx = np.zeros([num_samples + 1, 2], dtype=np.int32)
+
+    # Index into sample_idx.
+    sample_index = 0
+    # Index into doc_idx.
+    doc_idx_index = 0
+    # Begining offset for each document.
+    doc_offset = 0
+    # Start with first document and no offset.
+    sample_idx[sample_index][0] = doc_idx_index
+    sample_idx[sample_index][1] = doc_offset
+    sample_index += 1
+    while sample_index <= num_samples:
+        # Start with a fresh sequence.
+        remaining_seq_length = seq_length + 1
+        while remaining_seq_length != 0:
+            # Get the document length.
+            doc_id = doc_idx[doc_idx_index]
+            doc_length = sizes[doc_id] - doc_offset
+            # And add it to the current sequence.
+            remaining_seq_length -= doc_length
+            # If we have more than a full sequence, adjust offset and set
+            # remaining length to zero so we return from the while loop.
+            # Note that -1 here is for the same reason we have -1 in
+            # `_num_epochs` calculations.
+            if remaining_seq_length <= 0:
+                doc_offset += remaining_seq_length + doc_length - 1
+                remaining_seq_length = 0
+            else:
+                # Otherwise, start from the begining of the next document.
+                doc_idx_index += 1
+                doc_offset = 0
+        # Record the sequence.
+        sample_idx[sample_index][0] = doc_idx_index
+        sample_idx[sample_index][1] = doc_offset
+        sample_index += 1
+
+    return sample_idx
+
+
+def _build_shuffle_idx(num_samples, total_size, np_rng):
+    """Build the range [0, size) and shuffle."""
+    print(
+        " > building shuffle index with split [0, {}) and [{}, {}) "
+        "...".format(num_samples, num_samples, total_size),
+        flush=True,
+    )
+
+    dtype_ = np.uint32
+    if total_size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+
+    shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_first)
+    if num_samples == total_size:
+        return shuffle_idx_first
+
+    shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_last)
+
+    return np.concatenate((shuffle_idx_first, shuffle_idx_last))
diff --git a/llm/ernie-3.5-se/finetune_generation.py b/llm/ernie-3.5-se/finetune_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b23b9e24c953848ef714aa655b4d889ca29e367
--- /dev/null
+++ b/llm/ernie-3.5-se/finetune_generation.py
@@ -0,0 +1,266 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from data import convert_example, custom_instruction_convert_example, reader
+from modeling import Ernie35ForCausalLM
+from tokenizer import Ernie35Tokenizer
+from utils import Ernie35Trainer, compute_metrics, compute_metrics_not_do_generation
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.datasets import load_dataset
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    TrainingArguments,
+    get_last_checkpoint,
+    set_seed,
+)
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArgument:
+
+    data_name: str = field(default=None, metadata={"help": "The name of data."})
+    task_name: str = field(default=None, metadata={"help": "The name of task."})
+    dataset_path: str = field(default=None, metadata={"help": "The file name of train dataset."})
+    src_length: int = field(default=512, metadata={"help": "The max length of source text."})
+    tgt_length: int = field(default=256, metadata={"help": "The max length of target text."})
+
+
+@dataclass
+class ModelArgument:
+    model_name_or_path: str = field(
+        default="baidu/ernie-3.5-se-3b",
+        metadata={"help": "Build-in pretrained model name or the path to local model."},
+    )
+    label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."})
+    lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"})
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+    eval_with_do_generation: bool = field(
+        default=False, metadata={"help": "Evaluate with generation, instead for calc loss."}
+    )
+    benchmark: bool = field(
+        default=False,
+        metadata={"help": "Whether or not run benchmark."},
+    )
+    profiler_options: str = field(
+        default=None,
+        metadata={"help": "profiler_options."},
+    )
+    # lora
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+    lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."})
+    lora_rank: int = field(default=4, metadata={"help": "Lora attention dimension"})
+    merge_weights: bool = field(
+        default=False, metadata={"help": "Merge weights of the original model and the Lora model"}
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    data_args.always_pad_to_max_length = False
+
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    training_args.benchmark = model_args.benchmark
+    training_args.tgt_length = data_args.tgt_length
+
+    training_args.profiler_options = model_args.profiler_options
+    setattr(training_args, "label_smoothing", model_args.label_smoothing)
+    setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio)
+
+    paddle.set_device(training_args.device)
+
+    set_seed(args=training_args)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    model_class = Ernie35ForCausalLM
+
+    # Load the pretrained language model.
+    model = model_class.from_pretrained(
+        model_args.model_name_or_path,
+        tensor_parallel_output=False,
+        tensor_parallel_degree=training_args.tensor_parallel_degree,
+        tensor_parallel_rank=training_args.tensor_parallel_rank,
+        use_flash_attention=model_args.use_flash_attention,
+        dtype=dtype,  # todo enable set dtype to avoid additional mem usage
+        use_progressive_seq_len=False,
+        parallel_attn_hatf=True,
+    )
+
+    if model_args.lora:
+        if model_args.lora_path is None:
+            # Not yet support RowParallelLinear
+            target_modules = [
+                ".*q_proj.*",
+                ".*v_proj.*",
+                ".*k_proj.*",
+                ".*gate_proj.*",
+                ".*up_proj.*",
+                ".*o_proj.*",
+                ".*down_proj.*",
+                ".*qkv_gate_up_proj.*",
+            ]
+
+            lora_config = LoRAConfig(
+                target_modules=target_modules,
+                r=model_args.lora_rank,
+                lora_alpha=2 * model_args.lora_rank,
+                merge_weights=model_args.merge_weights,
+                tensor_parallel_degree=training_args.tensor_parallel_degree,
+                dtype=dtype,
+            )
+            model = LoRAModel(model, lora_config)
+        else:
+            model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path)
+
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    tokenizer = Ernie35Tokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        padding_side="left",  # Allow batch inference
+    )
+    tokenizer.pad_token = tokenizer.unk_token
+
+    # Load the dataset.
+    if training_args.benchmark:
+        train_ds = load_dataset(reader, data_path="./data/train.txt", lazy=False)
+        training_args.do_eval = False
+        data_args.always_pad_to_max_length = True
+        trans_func = partial(
+            custom_instruction_convert_example, tokenizer=tokenizer, data_args=data_args, benchmark=True
+        )
+    elif training_args.do_train or training_args.do_eval:
+        if data_args.data_name is not None:
+            if data_args.task_name is not None:
+                train_ds, dev_ds = load_dataset(data_args.data_name, data_args.task_name, splits=["train", "dev"])
+            else:
+                raise ValueError("`data_args.task_name` and `data_args.data_name` should be specified together")
+        elif data_args.task_name is not None:
+            if data_args.task_name == "squad":
+                train_ds, dev_ds = load_dataset("squad", splits=["train_v1", "dev_v1"])
+            else:
+                train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train", "dev"])
+        elif data_args.dataset_path is not None:
+            train_ds = load_dataset(reader, data_path=os.path.join(data_args.dataset_path, "train.json"), lazy=False)
+            dev_ds = load_dataset(reader, data_path=os.path.join(data_args.dataset_path, "dev.json"), lazy=False)
+        else:
+            raise ValueError("Please set the correct data arguments(data_name, task_name, dataset_pat)")
+        trans_func = partial(convert_example, tokenizer=tokenizer, data_args=data_args)
+
+    if training_args.do_train:
+        train_ds = train_ds.map(partial(trans_func))
+    if training_args.do_eval:
+        # pipeline_parallel eval is the same as training.
+        dev_ds = dev_ds.map(partial(trans_func, is_test=model_args.eval_with_do_generation))
+
+    model_max_length = data_args.src_length + data_args.tgt_length if not training_args.benchmark else 512
+    collate_fn = DataCollatorForSeq2Seq(
+        return_tensors="pd",
+        tokenizer=tokenizer,
+        max_length=model_max_length if data_args.always_pad_to_max_length else -1,
+        padding="max_length" if data_args.always_pad_to_max_length else True,
+        return_attention_mask=True,
+    )
+
+    def compute_metrics_trainer(eval_preds, tokenizer):
+        all_preds = []
+        all_labels = []
+        preds = eval_preds.predictions
+        preds = [x[x != -100] for x in preds]
+        all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+        labels = [x[x != -100] for x in eval_preds.label_ids]
+        all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+
+        all_preds = [pred.strip() for pred in all_preds]
+        all_labels = [label.strip() for label in all_labels]
+        all_preds = [pred.strip("question:") for pred in all_preds]
+        all_labels = [label.strip("question:") for label in all_labels]
+
+        eval_result = compute_metrics(all_preds, all_labels)
+        return eval_result
+
+    compute_metrics_func = partial(
+        compute_metrics_trainer,
+        tokenizer=tokenizer,
+    )
+
+    trainer = Ernie35Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics_func
+        if model_args.eval_with_do_generation
+        else compute_metrics_not_do_generation,
+        do_generation=model_args.eval_with_do_generation,
+        data_collator=collate_fn,
+    )
+
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_result = trainer.evaluate()
+        trainer.log_metrics("test", eval_result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/ernie-3.5-se/modeling.py b/llm/ernie-3.5-se/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..570433b994c920f877fe1a230ee7c1eb74ba1de6
--- /dev/null
+++ b/llm/ernie-3.5-se/modeling.py
@@ -0,0 +1,1415 @@
+# !/usr/bin/env python3
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Ernie35 model"""
+import contextlib
+import math
+import os
+from functools import partial
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from configuration import Ernie35Config
+from paddle import nn
+from paddle.distributed import fleet
+from paddle.distributed.fleet.layers.mpu.mp_layers import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+    VocabParallelEmbedding,
+)
+from paddle.distributed.fleet.layers.mpu.random import get_rng_state_tracker
+from paddle.distributed.fleet.utils import recompute
+from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd
+
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from paddlenlp.utils.log import logger
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+
+    logger.warning("Use flash attention in scaled-dot-product. Attention mask is deprecated")
+except:
+    flash_attention = None
+
+try:
+    import fused_ln as fused
+except ImportError:
+    logger.warning("fused-ln not found, run `python setup.py install` to build fused ln")
+    fused = None
+
+
+ERNIE_PRETRAINED_MODEL_ARCHIVE_LIST = []
+
+__all__ = [
+    "Ernie35Model",
+    "Ernie35PretrainedModel",
+    "Ernie35ForCausalLM",
+]
+
+
+def get_triangle_upper_mask(x, mask=None):
+    if mask is not None:
+        return mask
+    # [bsz, n_head, q_len, kv_seq_len]
+    shape = x.shape
+    #  [bsz, 1, q_len, kv_seq_len]
+    shape[1] = 1
+    mask = paddle.full(shape, -np.inf, dtype=x.dtype)
+    mask.stop_gradient = True
+    mask = paddle.triu(mask, diagonal=1)
+    mask.stop_gradient = True
+    return mask
+
+
+def parallel_matmul(
+    x,
+    y,
+    bias=None,
+    transpose_y=False,
+    tensor_parallel_degree=1,
+    tensor_parallel_output=True,
+    fuse_linear=False,
+):
+
+    if tensor_parallel_degree > 1 and y.is_distributed:
+        pg = fleet.get_hybrid_communicate_group().get_model_parallel_group()
+        input_parallel = paddle.distributed.collective._c_identity(x, group=pg)
+        if transpose_y:
+            logits = paddle.matmul(input_parallel, y, transpose_y=True)
+            if bias is not None:
+                logits += bias
+        else:
+            if not fuse_linear:
+                logits = F.linear(input_parallel, y, bias)
+            else:
+                logits = paddle.incubate.nn.functional.fused_linear(input_parallel, y, bias)  # hack for 逐位对齐
+
+        if tensor_parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=pg)
+
+    else:
+        logits = paddle.matmul(x, y, transpose_y=transpose_y)
+        if bias is not None:
+            logits += bias
+        return logits
+
+
+def finfo(dtype: paddle.dtype = None):
+    if dtype is None:
+        dtype = paddle.get_default_dtype()
+
+    if dtype == paddle.bfloat16:
+        # Numpy do not support `np.finfo(np.uint16)`, so try to construct a finfo object to fetch min value
+        class BFloatFInfo:
+            min = -3.3895313892515355e38
+
+        return BFloatFInfo
+    if dtype == paddle.float32:
+        return np.finfo(np.float32)
+    if dtype == paddle.float16:
+        return np.finfo(np.float16)
+    if dtype == paddle.float64:
+        return np.finfo(np.float64)
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def scaled_dot_product_attention(
+    query_states, key_states, value_states, attention_mask, output_attentions, config, is_causal=True
+):
+
+    bsz, q_len, num_heads, _ = paddle.shape(query_states)
+    head_dim = config.hidden_size // config.num_attention_heads
+    _, kv_seq_len, _, _ = value_states.shape
+
+    if config.use_flash_attention and flash_attention is not None:
+        # Flash Attention now ignore attention mask
+        # Current Flash Attention doesn't support attn maskt
+        # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
+        # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
+        # without past keys
+        attn_output, attn_weights = flash_attention(
+            query_states,
+            key_states,
+            value_states,
+            causal=is_causal and query_states.shape[1] != 1,
+            return_softmax=output_attentions,
+        )
+
+        attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
+        return attn_output, attn_weights
+    else:
+
+        query_states = paddle.transpose(query_states, [0, 2, 1, 3]) / math.sqrt(head_dim)
+        # merge with the next tranpose
+        key_states = paddle.transpose(key_states, [0, 2, 1, 3])
+        value_states = paddle.transpose(value_states, [0, 2, 1, 3])
+
+        attn_weights = paddle.matmul(query_states, key_states.transpose([0, 1, 3, 2]))
+
+        if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention weights should be of shape {(bsz, num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        if attention_mask is None:
+            attention_mask = get_triangle_upper_mask(attn_weights)
+
+        attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
+        if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention mask should be of shape {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
+            )
+
+        attn_weights = attention_mask + attn_weights
+        attn_weights = paddle.maximum(
+            attn_weights, paddle.to_tensor(float(finfo(query_states.dtype).min), dtype=query_states.dtype)
+        )
+
+        with paddle.amp.auto_cast(False):
+            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+
+        attn_output = paddle.matmul(attn_weights, value_states)
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+        attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
+        if output_attentions:
+            return attn_output, attn_weights
+        return attn_output, None
+
+
+def _make_causal_mask(input_ids_shape, past_key_values_length, dtype):
+    """
+    Make causal mask used for self-attention.
+    """
+    batch_size, target_length = input_ids_shape
+
+    mask = paddle.full((target_length, target_length), float(finfo(dtype).min))
+
+    mask_cond = paddle.arange(mask.shape[-1])
+    mask = masked_fill(mask, mask_cond < (mask_cond + 1).reshape([mask.shape[-1], 1]), 0)
+
+    if past_key_values_length > 0:
+        mask = paddle.concat([paddle.zeros([target_length, past_key_values_length]), mask], axis=-1)
+
+    return mask[None, None, :, :].expand([batch_size, 1, target_length, target_length + past_key_values_length])
+
+
+def _expand_mask(mask, dtype, tgt_length):
+    """
+    Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
+    """
+    if mask.ndim == 4:
+        expanded_mask = mask
+    elif mask.ndim == 3:
+        expanded_mask = mask[:, None, :, :]
+    else:
+        batch_size, src_length = mask.shape[0], mask.shape[-1]
+        tgt_length = tgt_length if tgt_length is not None else src_length
+
+        expanded_mask = mask[:, None, None, :].expand([batch_size, 1, tgt_length, src_length])
+
+    inverted_mask = 1.0 - expanded_mask
+    return masked_fill(inverted_mask, inverted_mask.cast("bool"), float(finfo(dtype).min))
+
+
+class LayerNorm(nn.LayerNorm):
+    def __init__(self, config):
+        super().__init__(config.hidden_size, epsilon=config.layer_norm_eps)
+
+
+class FusedLayerNorm(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.weight = paddle.create_parameter(
+            shape=[self.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+        self.bias = paddle.create_parameter(
+            shape=[self.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            is_bias=True,
+            default_initializer=nn.initializer.Constant(0.0),
+        )
+        self.variance_epsilon = config.layer_norm_eps
+
+    def forward(self, hidden_states):
+        return fused.fused_ln(hidden_states, self.weight, self.bias, self.variance_epsilon)[0]
+
+
+class RotaryEmbedding(nn.Layer):
+    def __init__(self, config, dim, max_position_embeddings=4096, base=10000):
+        super().__init__()
+        # dtype = paddle.get_default_dtype()
+        self.config = config
+        self.base = base
+        self.max_position_embeddings = max_position_embeddings
+        inv_freq = 1.0 / (base ** (paddle.cast(paddle.arange(0, dim, 2), dtype="float32") / dim))
+
+        # higher acc using float32
+        t = paddle.arange(max_position_embeddings, dtype="float32")
+        freqs = paddle.einsum("i,j->ij", t, inv_freq.cast("float32"))
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = paddle.concat([freqs, freqs], axis=-1)
+
+        # [bs, seqlen, nhead, head_dim]
+        self.cos_cached = emb.cos()  # [None, :, None, :]  # .astype(dtype)
+        self.sin_cached = emb.sin()  # [None, :, None, :]  # .astype(dtype)
+
+    def forward(self, x, seq_len=None, position_ids=None):
+        if position_ids is not None:
+            return self.cos_cached, self.sin_cached
+        start = 0
+        if self.config.enable_random_position_ids:
+            if self.training:
+                np_rng = np.random.RandomState(
+                    int(os.getenv("TRAINER_GLOBAL_STEP", "0")) + (paddle.distributed.get_rank() * 100000)
+                )
+                pos_ids = np.array(list(np.sort(np_rng.permutation(self.max_position_embeddings)[:seq_len]))).astype(
+                    "int64"
+                )
+                pos_ids = paddle.to_tensor(pos_ids)
+            else:
+                if seq_len <= 4096:
+                    times = 8
+                else:
+                    times = self.max_position_embeddings // seq_len
+                pos_ids = [times - 1]
+                pos_ids += [times] * (seq_len - 1)
+                pos_ids = paddle.cumsum(paddle.to_tensor(pos_ids))
+
+            return (
+                self.cos_cached[pos_ids],
+                self.sin_cached[pos_ids],
+            )
+
+        return (
+            self.cos_cached[start : start + seq_len, :],
+            self.sin_cached[start : start + seq_len, :],
+        )
+
+    @classmethod
+    def rotate_half(cls, x):
+        """Rotates half the hidden dims of the input."""
+
+        x1 = x[..., : x.shape[-1] // 2]
+        x2 = x[..., x.shape[-1] // 2 :]
+        return paddle.concat([-x2, x1], axis=-1)
+
+    @classmethod
+    def apply_rotary_pos_emb(cls, q, k, cos, sin, offset: int = 0, position_ids=None):
+        if position_ids is not None:
+            assert offset == 0, offset
+            cos = F.embedding(position_ids, cos)
+            sin = F.embedding(position_ids, sin)
+        else:
+            cos = cos.unsqueeze(0)
+            sin = sin.unsqueeze(0)
+        cos = cos[:, offset : q.shape[1] + offset, None, :]
+        sin = sin[:, offset : q.shape[1] + offset, None, :]
+
+        # q_embed = (q * cos) + (rotate_half(q) * sin)
+        # k_embed = (k * cos) + (rotate_half(k) * sin)
+
+        cos = paddle.cast(cos, q.dtype)
+        sin = paddle.cast(sin, q.dtype)
+        q_embed = paddle.add(paddle.multiply(q, cos), paddle.multiply(cls.rotate_half(q), sin))
+        k_embed = paddle.add(paddle.multiply(k, cos), paddle.multiply(cls.rotate_half(k), sin))
+        return q_embed, k_embed
+
+
+class Ernie35MLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+
+        if config.tensor_parallel_degree > 1:
+            self.gate_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.intermediate_size,
+                gather_output=False,
+                has_bias=config.use_bias,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+            self.up_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.intermediate_size,
+                gather_output=False,
+                has_bias=config.use_bias,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+        else:
+            self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=config.use_bias)
+            self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=config.use_bias)
+
+        if config.tensor_parallel_degree > 1:
+            self.down_proj = RowParallelLinear(
+                self.intermediate_size,
+                self.hidden_size,
+                input_is_parallel=True,
+                has_bias=config.use_bias,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+        else:
+            self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias_attr=config.use_bias)
+
+    def forward(self, x):
+        x = F.silu(self.gate_proj(x)) * self.up_proj(x)
+        return self.down_proj(x)
+
+
+def rope_attn(
+    mix_layer,
+    query_states,
+    key_states,
+    value_states,
+    attention_mask,
+    position_ids,
+    output_attentions=False,
+    past_key_value=None,
+    use_cache=False,
+    rotary_emb=None,
+    config=None,
+):
+
+    if mix_layer is not None:
+        query_states, key_states, value_states = paddle.split(mix_layer, 3, axis=-1)
+
+    kv_seq_len = key_states.shape[-3]
+    offset = 0
+    if past_key_value is not None:
+        offset = past_key_value[0].shape[-3]
+        kv_seq_len += offset
+
+    cos, sin = rotary_emb(value_states, seq_len=kv_seq_len, position_ids=position_ids)
+
+    query_states, key_states = rotary_emb.apply_rotary_pos_emb(
+        query_states,
+        key_states,
+        cos,
+        sin,
+        position_ids=position_ids,
+        offset=offset if position_ids is None else 0,
+    )
+
+    if past_key_value is not None:
+        # reuse k, v, self_attention
+        key_states = paddle.concat([past_key_value[0], key_states], axis=1)
+        value_states = paddle.concat([past_key_value[1], value_states], axis=1)
+
+    past_key_value = (key_states, value_states) if use_cache else None
+    attn_output, attn_weights = scaled_dot_product_attention(
+        query_states,
+        key_states,
+        value_states,
+        attention_mask,
+        output_attentions,
+        config=config,
+    )
+    return attn_output, attn_weights, past_key_value
+
+
+class Ernie35Attention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.use_recompute_attn = config.use_recompute_attn
+
+        if config.tensor_parallel_degree > 1:
+            assert (
+                self.num_heads % config.tensor_parallel_degree == 0
+            ), "num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}"
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+
+        if config.tensor_parallel_degree > 1:
+            self.q_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.hidden_size,
+                has_bias=config.use_bias,
+                gather_output=False,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+            self.k_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.hidden_size,
+                has_bias=config.use_bias,
+                gather_output=False,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+            self.v_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.hidden_size,
+                has_bias=config.use_bias,
+                gather_output=False,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+        else:
+            self.q_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size,
+                bias_attr=config.use_bias,
+            )
+            self.k_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size,
+                bias_attr=config.use_bias,
+            )
+            self.v_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size,
+                bias_attr=config.use_bias,
+            )
+
+        if config.tensor_parallel_degree > 1:
+            self.o_proj = RowParallelLinear(
+                self.hidden_size,
+                self.hidden_size,
+                has_bias=config.use_bias,
+                input_is_parallel=True,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+        else:
+            self.o_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size,
+                bias_attr=config.use_bias,
+            )
+
+        self.rotary_emb = RotaryEmbedding(
+            config,
+            self.head_dim,
+            max_position_embeddings=config.max_position_embeddings,
+        )
+        self._cast_to_low_precison = False
+
+        self.config = config
+
+    def forward(
+        self,
+        hidden_states,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+        bsz, q_len, _ = hidden_states.shape
+        query_states = key_states = value_states = mix_layer = None
+        # if self.fuse_attn:
+        #     mix_layer = self.qkv_proj(hidden_states).reshape([bsz, q_len, self.num_heads, 3 * self.head_dim])
+        # else:
+        query_states = self.q_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim])
+        key_states = self.k_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim])
+        value_states = self.v_proj(hidden_states).reshape(shape=[bsz, q_len, self.num_heads, self.head_dim])
+
+        _rope_attn = partial(rope_attn, rotary_emb=self.rotary_emb, config=self.config)
+        if self.use_recompute_attn:
+            assert past_key_value is None, "do not use kv cache in recompute"
+            assert not use_cache
+            attn_output, attn_weights, past_key_value = recompute(
+                _rope_attn,
+                mix_layer,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                position_ids,
+                output_attentions,
+                use_reentrant=self.config.recompute_use_reentrant,
+            )
+        else:
+            attn_output, attn_weights, past_key_value = _rope_attn(
+                mix_layer,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                position_ids,
+                output_attentions,
+                past_key_value,
+                use_cache=use_cache,
+            )
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class Ernie35MLPAttention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.intermediate_size = config.intermediate_size
+
+        self.use_recompute_attn = config.use_recompute_attn
+
+        if config.tensor_parallel_degree > 1:
+            assert (
+                self.num_heads % config.tensor_parallel_degree == 0
+            ), "num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}"
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+            self.intermediate_size_this_rank = self.intermediate_size // config.tensor_parallel_degree
+        else:
+            self.intermediate_size_this_rank = self.intermediate_size
+
+        if config.tensor_parallel_degree > 1:
+            self.qkv_gate_up_proj = ColumnParallelLinear(
+                self.hidden_size,
+                self.hidden_size * 3 + self.intermediate_size * 2,
+                has_bias=config.use_bias,
+                gather_output=False,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+            self.o_proj = RowParallelLinear(
+                self.hidden_size + self.intermediate_size,
+                self.hidden_size,
+                has_bias=config.use_bias,
+                input_is_parallel=True,
+                fuse_matmul_bias=config.fuse_linear,
+            )
+        else:
+            self.qkv_gate_up_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size * 3 + self.intermediate_size * 2,
+                bias_attr=config.use_bias,
+            )
+            self.o_proj = nn.Linear(
+                self.hidden_size + self.intermediate_size,
+                self.hidden_size,
+                bias_attr=config.use_bias,
+            )
+
+        self.rotary_emb = RotaryEmbedding(
+            config,
+            self.head_dim,
+            max_position_embeddings=config.max_position_embeddings,
+        )
+        self._cast_to_low_precison = False
+
+        self.config = config
+
+    def forward(
+        self,
+        hidden_states,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        bsz, q_len, _ = hidden_states.shape
+        query_states = key_states = value_states = mix_layer = None
+
+        mix_layer = self.qkv_gate_up_proj(hidden_states)
+        mix_layer, up_states, gate_states = mix_layer.split(
+            [self.head_dim * self.num_heads * 3, self.intermediate_size_this_rank, self.intermediate_size_this_rank],
+            axis=-1,
+        )
+        mix_layer = mix_layer.reshape([bsz, q_len, self.num_heads, 3 * self.head_dim])
+        _rope_attn = partial(rope_attn, rotary_emb=self.rotary_emb, config=self.config)
+
+        if self.use_recompute_attn:
+            assert past_key_value is None, "do not use kv cache in recompute"
+            attn_output, attn_weights, past_key_value = recompute(
+                _rope_attn,
+                mix_layer,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                position_ids,
+                output_attentions,
+                use_reentrant=False,
+            )
+        else:
+            attn_output, attn_weights, past_key_value = _rope_attn(
+                mix_layer,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                position_ids,
+                output_attentions,
+                past_key_value,
+                use_cache=use_cache,
+            )
+
+        ffn_output = F.silu(up_states) * gate_states
+
+        output_states = paddle.concat([ffn_output, attn_output], axis=-1)
+        output_states = self.o_proj(output_states)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return output_states, attn_weights, past_key_value
+
+
+class Ernie35DecoderLayer(nn.Layer):
+    def __init__(self, config, has_ffn=True, has_mha=True, parallel_attn_ffn=False):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.attn_ffn_no_mha = not parallel_attn_ffn and not has_mha
+
+        Norm = LayerNorm
+        if config.fuse_ln:
+            Norm = FusedLayerNorm
+
+        self.self_attn_mlp = self.self_attn = self.mlp = None
+        if parallel_attn_ffn:
+            logger.info("using parallel-attn")
+            self.self_attn_mlp = Ernie35MLPAttention(config)
+            self.input_layernorm = Norm(config)
+            self.residual_add1 = FusedDropoutAdd(0.0, mode="upscale_in_train")
+        else:
+            logger.info("using normal-attn")
+            if has_mha:
+                self.self_attn = Ernie35Attention(config)
+                self.residual_add1 = FusedDropoutAdd(0.0, mode="upscale_in_train")
+                self.input_layernorm = Norm(config)
+
+            if has_ffn:
+                self.mlp = Ernie35MLP(config)
+                self.post_attention_layernorm = Norm(config)
+                self.residual_add2 = FusedDropoutAdd(0.0, mode="upscale_in_train")
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `cache` key value states are returned and can be used to speed up decoding
+                (see `cache`).
+            cache (`Tuple(paddle.Tensor)`, *optional*): cached past key and value projection states
+        """
+        if self.self_attn_mlp is not None:
+
+            residual = hidden_states
+            hidden_states = self.input_layernorm(hidden_states)
+
+            # Self Attention
+            hidden_states, self_attn_weights, present_key_value = self.self_attn_mlp(
+                hidden_states=hidden_states,
+                past_key_value=past_key_value,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+            )
+            hidden_states = self.residual_add1(hidden_states, residual)
+        else:
+            if self.self_attn is not None:
+                residual = hidden_states
+                hidden_states = self.input_layernorm(hidden_states)
+
+                # Self Attention
+                hidden_states, self_attn_weights, present_key_value = self.self_attn(
+                    hidden_states=hidden_states,
+                    past_key_value=past_key_value,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+                hidden_states = self.residual_add1(hidden_states, residual)
+
+            if self.mlp is not None:
+                residual = hidden_states
+                hidden_states = self.post_attention_layernorm(hidden_states)
+                hidden_states = self.mlp(hidden_states)
+                hidden_states = self.residual_add2(hidden_states, residual)
+
+        if self.attn_ffn_no_mha:
+            present_key_value = None
+
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        # remove empty tuple for pipeline parallel
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
+        return outputs
+
+
+class Ernie35PretrainedModel(PretrainedModel):
+    config_class = Ernie35Config
+    base_model_prefix = "ernie"
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from conversion_utils import (
+            o_proj_merge_fn,
+            o_proj_split_fn,
+            qkv_gate_up_proj_merge_fn,
+            qkv_gate_up_proj_split_fn,
+        )
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+        qkv_gate_up_proj_fn = qkv_gate_up_proj_split_fn if is_split else qkv_gate_up_proj_merge_fn
+        fuse_qkvgu_fn = qkv_gate_up_proj_fn(  # is_column: True
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+            num_heads=config.num_attention_heads,
+        )
+        o_proj_fn = o_proj_split_fn if is_split else o_proj_merge_fn
+        fuse_o_proj_fn = o_proj_fn(  # is_column: False
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            hidden_size=config.hidden_size,
+            intermediate_size=config.intermediate_size,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers, parallel_attn_hatf):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "layers.0.self_attn_mlp.qkv_gate_up_proj.weight": fuse_qkvgu_fn,
+                "lm_head.weight": partial(fn, is_column=True),
+                # Row Linear
+                "embed_tokens.weight": partial(fn, is_column=False),
+                "layers.0.self_attn_mlp.o_proj.weight": fuse_o_proj_fn,
+            }
+            if config.use_bias:
+                base_actions.update(
+                    {
+                        # Column Linear
+                        "layers.0.self_attn_mlp.qkv_gate_up_proj.bias": fuse_qkvgu_fn,
+                        "lm_head.bias": partial(fn, is_column=True),
+                    }
+                )
+
+            start = 0 if not parallel_attn_hatf else 1
+            end = num_layers if not parallel_attn_hatf else num_layers - 1
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(start, end):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                if "layers.0." not in key:
+                    final_actions[key] = action
+
+            if parallel_attn_hatf:
+                # Layer 0
+                final_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+                final_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True)
+                final_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True)
+                final_actions["layers.0.self_attn.o_proj.weight"] = partial(fn, is_column=False)
+                if config.use_bias:
+                    final_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+                    final_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
+                    final_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+                # Layer num_layers - 1
+                final_actions[f"layers.{num_layers - 1}.mlp.gate_proj.weight"] = partial(fn, is_column=True)
+                final_actions[f"layers.{num_layers - 1}.mlp.up_proj.weight"] = partial(fn, is_column=True)
+                final_actions[f"layers.{num_layers - 1}.mlp.down_proj.weight"] = partial(fn, is_column=False)
+                if config.use_bias:
+                    final_actions[f"layers.{num_layers - 1}.mlp.gate_proj.bias"] = partial(fn, is_column=True)
+                    final_actions[f"layers.{num_layers - 1}.mlp.up_proj.bias"] = partial(fn, is_column=True)
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers, config.parallel_attn_hatf)
+
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if self.config.tensor_parallel_degree > 1:
+            rng_tracker = get_rng_state_tracker().rng_state
+        else:
+            rng_tracker = contextlib.nullcontext
+
+        if isinstance(
+            layer,
+            (
+                ColumnParallelLinear,
+                RowParallelLinear,
+                VocabParallelEmbedding,
+                Ernie35LMHead,
+                nn.Embedding,
+                nn.Linear,
+            ),
+        ):
+
+            with rng_tracker():
+                dtype = paddle.get_default_dtype()
+                # layer.weight.set_value(
+                #     paddle.randn(layer.weight.shape, dtype=dtype).scale(self.config.initializer_range)
+                # )
+                tmp = paddle.randn(layer.weight.shape, dtype=dtype).scale(self.config.initializer_range)
+                src_tensor = tmp.value().get_tensor()
+                layer.weight.value().get_tensor()._share_data_with(src_tensor)
+                # layer.weight.copy_(tmp, True)
+
+                logger.info(
+                    f'dist-init-fc: shape={layer.weight.shape}, range={self.config.initializer_range},type={type(layer)},norm={layer.weight.astype("float32").norm().item()}'
+                )
+
+        elif isinstance(layer, RotaryEmbedding):
+            head_dim = self.config.hidden_size // self.config.num_attention_heads
+            inv_freq = 1.0 / (layer.base ** (np.arange(0, head_dim, 2).astype("float32") / head_dim))
+            # higher acc using float32
+            t = np.arange(layer.max_position_embeddings, dtype="float32")
+            freqs = np.einsum("i,j->ij", t, inv_freq)
+            emb = np.concatenate([freqs, freqs], axis=-1)
+            # [bs, seqlen, nhead, head_dim]
+            cos_cached = np.cos(emb)[:, :]
+            sin_cached = np.sin(emb)[:, :]
+            layer.cos_cached.set_value(cos_cached)
+            layer.sin_cached.set_value(sin_cached)
+
+
+@register_base_model
+class Ernie35Model(Ernie35PretrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Ernie35DecoderLayer`]
+    Args:
+        config: Ernie35Config
+    """
+
+    def __init__(self, config: Ernie35Config):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.hidden_size = config.hidden_size
+        self.config = config
+
+        if config.tensor_parallel_degree > 1:
+            self.embed_tokens = VocabParallelEmbedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+        else:
+            self.embed_tokens = nn.Embedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+
+        self.layers = nn.LayerList()
+        for i in range(config.num_hidden_layers):
+            if not config.parallel_attn_hatf:
+                _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=False, parallel_attn_ffn=True)
+            else:
+                # no ffn in fisrt layer
+                # no mha in last layer
+                # Head mhA tail Ffn
+                if i == 0:
+                    _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=True, parallel_attn_ffn=False)
+                elif i == (config.num_hidden_layers - 1):
+                    _layer = Ernie35DecoderLayer(config, has_ffn=True, has_mha=False, parallel_attn_ffn=False)
+                else:
+                    _layer = Ernie35DecoderLayer(config, has_ffn=False, has_mha=False, parallel_attn_ffn=True)
+
+            self.layers.append(_layer)
+            logger.info(
+                f"building layer:{i}, mha={_layer.self_attn is not None},ffn={_layer.mlp is not None},paral-mha-ffn={_layer.self_attn_mlp is not None}"
+            )
+
+        Norm = LayerNorm
+        if config.fuse_ln:
+            Norm = FusedLayerNorm
+        self.norm = Norm(config)
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @classmethod
+    def _prepare_decoder_attention_mask(cls, attention_mask, input_shape, past_key_values_length, dtype):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape, past_key_values_length=past_key_values_length, dtype=dtype
+            )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, dtype, tgt_length=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+        combined_attention_mask = paddle.maximum(
+            combined_attention_mask.astype(dtype), paddle.to_tensor(float(finfo(dtype).min), dtype=dtype)
+        )
+        return combined_attention_mask
+
+    @paddle.jit.not_to_static
+    def recompute_training(self, layer_module, hidden_states, attention_mask, position_ids, output_attentions):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs, output_attentions, None)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            attention_mask,
+            position_ids,
+            use_reentrant=False,
+        )
+        return hidden_states
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.layers))
+
+        seq_length_with_past = seq_length
+        cache_length = 0
+        if past_key_values[0] is not None:
+            cache_length = paddle.shape(past_key_values[0][0])[1]
+            seq_length_with_past += cache_length
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids).astype(self.embed_tokens.weight.dtype)
+
+        # embed positions
+        if attention_mask is None:
+            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
+        )
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, (decoder_layer) in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+            has_gradient = not hidden_states.stop_gradient
+            if self.config.use_recompute and has_gradient:
+                layer_outputs = self.recompute_training(
+                    decoder_layer,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    output_attentions,
+                    past_key_value,
+                    use_cache,
+                )
+
+            if isinstance(layer_outputs, (tuple, list)):
+                hidden_states = layer_outputs[0]
+            else:
+                hidden_states = layer_outputs
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+            cross_attentions=None,
+        )
+
+
+class Ernie35PretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for Ernie35.
+    It calculates the final loss.
+    """
+
+    def __init__(self, config, return_tuple=True):
+        super(Ernie35PretrainingCriterion, self).__init__()
+        self.ignored_index = getattr(config, "ignored_index", -100)
+        self.config = config
+        self.return_tuple = return_tuple
+        self.enable_parallel_cross_entropy = config.tensor_parallel_degree > 1 and config.tensor_parallel_output
+
+        if self.enable_parallel_cross_entropy:
+            logger.info("using parallel cross entroy, take care")
+            self.loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(
+                reduction="none",
+            )
+
+    def forward(self, prediction_scores, masked_lm_labels):
+        if self.enable_parallel_cross_entropy:
+            assert (
+                prediction_scores.shape[-1] != self.config.vocab_size
+            ), f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
+
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
+            lossmask = masked_lm_labels != self.ignored_index
+            if (~lossmask).all():
+                loss = paddle.mean(masked_lm_loss) * 0.0
+                loss_sum = masked_lm_loss.sum().detach()
+            else:
+                lossmask = lossmask.reshape([-1]).cast(paddle.float32)
+                masked_lm_loss = paddle.sum(masked_lm_loss.cast(paddle.float32).reshape([-1]) * lossmask)
+                loss = masked_lm_loss / lossmask.sum()
+                loss_sum = masked_lm_loss.sum().detach()
+        if not self.return_tuple:
+            if self.training:
+                return loss
+            return loss_sum
+        return loss, loss_sum
+
+
+class Ernie35LMHead(nn.Layer):
+    def __init__(self, config):
+        super(Ernie35LMHead, self).__init__()
+        self.config = config
+        if config.tensor_parallel_degree > 1:
+            vocab_size = config.vocab_size // config.tensor_parallel_degree
+        else:
+            vocab_size = config.vocab_size
+
+        self.weight = self.create_parameter(
+            shape=[vocab_size, config.hidden_size] if config.tie_word_embeddings else [config.hidden_size, vocab_size],
+            dtype=paddle.get_default_dtype(),
+        )
+        if config.weight_share_add_bias and config.use_bias:
+            self.bias = self.create_parameter(
+                shape=[vocab_size],
+                dtype=paddle.get_default_dtype(),
+            )
+        else:
+            self.bias = None
+
+        # Must set distributed attr for Tensor Parallel !
+        self.weight.is_distributed = True if (vocab_size != config.vocab_size) else False
+        if config.weight_share_add_bias and config.use_bias:
+            self.bias.is_distributed = True if (vocab_size != config.vocab_size) else False
+
+        if self.weight.is_distributed:
+            self.weight.split_axis = 1
+        if config.weight_share_add_bias and config.use_bias and self.bias.is_distributed:
+            self.bias.split_axis = 0
+
+    def forward(self, hidden_states, tensor_parallel_output=None):
+        if tensor_parallel_output is None:
+            tensor_parallel_output = self.config.tensor_parallel_output
+        logits = parallel_matmul(
+            hidden_states,
+            self.weight,
+            bias=self.bias,
+            transpose_y=self.config.tie_word_embeddings,
+            tensor_parallel_degree=self.config.tensor_parallel_degree,
+            tensor_parallel_output=tensor_parallel_output,
+            fuse_linear=self.config.fuse_linear,
+        )
+        return logits
+
+
+class Ernie35ForCausalLM(Ernie35PretrainedModel):
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        # initialize-trick for big model, see https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/README.md#std-init
+        new_initializer_range = math.sqrt(0.3333 / config.hidden_size)
+        logger.info(f"change initializer-range from {config.initializer_range} to {new_initializer_range}")
+        config.initializer_range = new_initializer_range
+        self.config = config
+
+        self.ernie = Ernie35Model(config)
+        self.lm_head = Ernie35LMHead(config)
+        self.criterion = Ernie35PretrainingCriterion(config)
+
+        self.tie_weights()  # maybe weight share
+
+    # Initialize weights and apply final processing
+    def _post_init(self, original_init, *args, **kwargs):
+        super()._post_init(self, original_init, *args, **kwargs)
+        factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+        logger.info(f"using post init div: factor:{factor}")
+        with paddle.no_grad():
+            for l in self.ernie.layers:
+                if l.self_attn_mlp is not None:
+                    l.self_attn_mlp.o_proj.weight.scale_(factor)
+                if l.self_attn is not None:
+                    l.self_attn.o_proj.weight.scale_(factor)
+                if l.mlp is not None:
+                    l.mlp.down_proj.weight.scale_(factor)
+
+    def get_input_embeddings(self):
+        return self.ernie.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.ernie.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.ernie = decoder
+
+    def get_decoder(self):
+        return self.ernie
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(
+            input_ids == pad_token_id
+        ).numpy().item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="int64")
+        return attention_mask
+
+    def prepare_inputs_for_generation(
+        self, input_ids, use_cache=False, past_key_values=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        attention_mask = kwargs.get("attention_mask", None)
+        position_ids = kwargs.get("position_ids", None)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "past_key_values": past_key_values,
+                "use_cache": True,  # use_cache,
+                "attention_mask": attention_mask,
+                "position_ids": position_ids,
+                "return_dict": True,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithCrossAttentions) and "past_key_values" in outputs:
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        interval = int(os.getenv("pos_decoding_interval", "2"))
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[:, -1:] + interval], axis=-1)
+
+        if not is_encoder_decoder:
+            # update attention mask
+            if "attention_mask" in model_kwargs:
+                attention_mask = model_kwargs["attention_mask"]
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        return model_kwargs
+
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=False,
+        ignored_index=0,  # no use
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        def progressive_seq(x, y):
+            globel_step = int(os.getenv("TRAINER_GLOBAL_STEP", "0"))
+            if globel_step < 500:
+                return x[:, :512], y[:, :512]
+            if globel_step < 1000:
+                return x[:, :1024], y[:, :1024]
+            if globel_step < 1500:
+                return x[:, :2048], y[:, :2048]
+            return x, y
+
+        if self.config.use_progressive_seq_len and self.training:
+            input_ids, labels = progressive_seq(input_ids, labels)
+
+        outputs = self.ernie(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is togather with ParallelCrossEntropy
+        # tensor_parallel_output = (
+        #     self.config.tensor_parallel_output and labels is not None and self.config.tensor_parallel_degree > 1
+        # )
+
+        logits = self.lm_head(
+            hidden_states,
+        )  # tensor_parallel_output=tensor_parallel_output)
+
+        if return_dict:
+            if labels is not None:
+                loss, _ = self.criterion(logits, labels)
+            else:
+                loss = None
+            return CausalLMOutputWithCrossAttentions(
+                loss=loss,
+                logits=logits,
+                past_key_values=outputs.past_key_values,
+                hidden_states=outputs.hidden_states,
+                attentions=outputs.attentions,
+            )
+
+        assert labels is not None
+        loss, loss_sum = self.criterion(logits, labels)
+        return loss, loss_sum
diff --git a/llm/ernie-3.5-se/predict_generation.py b/llm/ernie-3.5-se/predict_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..4120b1862a41eb5c4869b973441e7ffd0c23dc44
--- /dev/null
+++ b/llm/ernie-3.5-se/predict_generation.py
@@ -0,0 +1,212 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import sys
+
+sys.path.append("../../")
+
+import paddle
+from modeling import Ernie35Config, Ernie35ForCausalLM
+from paddle.distributed import fleet
+from tokenizer import Ernie35Tokenizer
+
+
+def get_parser():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", default="baidu/ernie-3.5-se-3b", help="The directory of model.")
+    parser.add_argument("--tokenizer_name_or_path", default="ernie-tokenizer", help="The directory of tokenizer.")
+    parser.add_argument(
+        "--merge_tensor_parallel_path", default=None, help="The directory of model to merge tensor parallel parts."
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--src_length", type=int, default=50, help="the max length of source text")
+    parser.add_argument("--tgt_length", type=int, default=100, help="the max length of decoding length")
+
+    parser.add_argument("--top_k", type=int, default=3, help="top_k parameter for generation")
+    parser.add_argument("--top_p", type=float, default=0.8, help="top_p parameter for generation")
+    parser.add_argument("--temperature", type=float, default=0.95, help="top_p parameter for generation")
+    parser.add_argument("--data_file", default=None, help="data file directory")
+    parser.add_argument("--predict_file", default="prediction.json", help="predict result file directory")
+    parser.add_argument("--device", type=str, default="gpu", help="Device")
+    return parser
+
+
+def parse_arguments():
+    parser = get_parser()
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args=None, tokenizer=None, model=None, **kwargs):
+        if args is None:
+            self.tokenizer = tokenizer
+            self.model = model
+            self.src_length = kwargs["src_length"]
+            self.tgt_length = kwargs["tgt_length"]
+        else:
+            self.tokenizer = Ernie35Tokenizer.from_pretrained(args.tokenizer_name_or_path)
+            self.tokenizer.pad_token = self.tokenizer.unk_token
+            self.batch_size = args.batch_size
+            self.args = args
+            self.src_length = self.args.src_length
+            self.tgt_length = self.args.tgt_length
+
+            tensor_parallel_degree = paddle.distributed.get_world_size()
+            tensor_parallel_rank = 0
+            if tensor_parallel_degree > 1:
+                strategy = fleet.DistributedStrategy()
+                strategy.hybrid_configs = {
+                    "dp_degree": 1,
+                    "mp_degree": tensor_parallel_degree,
+                    "pp_degree": 1,
+                    "sharding_degree": 1,
+                }
+                fleet.init(is_collective=True, strategy=strategy)
+                hcg = fleet.get_hybrid_communicate_group()
+                tensor_parallel_rank = hcg.get_model_parallel_rank()
+
+            self.rank = tensor_parallel_rank
+
+            config = Ernie35Config.from_pretrained(args.model_name_or_path)
+            config.tensor_parallel_degree = tensor_parallel_degree
+            config.tensor_parallel_rank = (tensor_parallel_rank,)
+            dtype = "float16" if config.dtype is None else config.dtype
+            use_flash_attn = False if dtype == "float32" else True
+            self.model = Ernie35ForCausalLM.from_pretrained(
+                args.model_name_or_path,
+                tensor_parallel_degree=tensor_parallel_degree,
+                tensor_parallel_rank=tensor_parallel_rank,
+                load_state_as_np=True,
+                dtype=dtype,
+                use_flash_attention=use_flash_attn,
+            )
+            print(self.model)
+
+        self.model.eval()
+
+    def preprocess(self, input_text):
+        inputs = self.tokenizer(
+            input_text,
+            padding=True,
+            return_tensors="np",
+            max_length=self.src_length,
+            return_attention_mask=True,
+            return_position_ids=True,
+        )
+        inputs_tensor = {}
+        for key, value in inputs.items():
+            inputs_tensor[key] = paddle.to_tensor(value)
+
+        return inputs_tensor
+
+    def infer(self, inputs):
+        if self.src_length + self.tgt_length > 4096:
+            self.model.pos_decoding_interval = max(
+                self.model.config.max_position_embeddings // (self.tgt_length + self.src_length), 1
+            )
+        else:
+            self.model.pos_decoding_interval = 8
+        if self.model.pos_decoding_interval > 1 and "position_ids" in inputs:
+            inputs["position_ids"] = inputs["position_ids"] * self.model.pos_decoding_interval + (
+                self.model.pos_decoding_interval - 1
+            )
+
+        os.environ["pos_decoding_interval"] = str(self.model.pos_decoding_interval)
+
+        if self.model.config.dtype == "float32" or self.model.config.dtype is None:
+            with paddle.no_grad():
+                result = self.model.generate(
+                    **inputs,
+                    max_length=self.tgt_length,
+                    decode_strategy="sampling",
+                    temperature=self.args.temperature,
+                    top_k=self.args.top_k,
+                    top_p=self.args.top_p,
+                    repetition_penalty=1.0,
+                )
+        else:
+            white = ["flash_attn"]
+            with paddle.no_grad():
+                with paddle.amp.auto_cast(True, custom_white_list=white, level="O2", dtype=self.model.config.dtype):
+                    result = self.model.generate(
+                        **inputs,
+                        max_length=self.tgt_length,
+                        decode_strategy="sampling",
+                        temperature=self.args.temperature,
+                        top_k=self.args.top_k,
+                        top_p=self.args.top_p,
+                        repetition_penalty=1.0,
+                    )
+        result = result[0]
+        return result
+
+    def postprocess(self, infer_data):
+        result = []
+        for x in infer_data.tolist():
+            res = self.tokenizer.decode(x, skip_special_tokens=True)
+            result.append(res)
+        out_dict = {"result": result}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    paddle.set_device(args.device)
+    predictor = Predictor(args)
+    if args.data_file is None:
+        all_texts = [
+            "The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, ",
+            "Hello, it is a great day! nice to ",
+            "a b c d e f g ",
+        ]
+    else:
+        all_texts = []
+        with open(args.data_file, "r", encoding="utf-8") as f:
+            for line in f:
+                example = json.loads(line)
+                context = example["src"][0] if isinstance(example["src"], list) else example["src"]
+                all_texts.append(context)
+    batch_texts = batchfy_text(all_texts, args.batch_size)
+    with open(args.predict_file, "w", encoding="utf-8") as f:
+        for bs, texts in enumerate(batch_texts):
+            outputs = predictor.predict(texts)
+            for text, result in zip(texts, outputs["result"]):
+                print("{}\n{}".format(text, result))
+                out = {"src": text, "output": result}
+                f.write(json.dumps(out, ensure_ascii=False) + "\n")
+
+    if args.merge_tensor_parallel_path is not None:
+        predictor.model.save_pretrained(
+            save_dir=args.merge_tensor_parallel_path,
+            merge_tensor_parallel=True,
+        )
+        predictor.tokenizer.save_pretrained(args.merge_tensor_parallel_path)
diff --git a/llm/ernie-3.5-se/run_pretrain.py b/llm/ernie-3.5-se/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e08185f7bb4c14f916a611c6e8455ad2dd65118
--- /dev/null
+++ b/llm/ernie-3.5-se/run_pretrain.py
@@ -0,0 +1,465 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+GPT/Ernie35 pretraining scripts.
+"""
+import math
+import os
+import random
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+import paddle
+from ernie_dataset import build_train_valid_test_datasets, print_rank_0
+from modeling import Ernie35Config, Ernie35ForCausalLM
+from tokenizer import Ernie35Tokenizer
+
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    speed_metrics,
+)
+from paddlenlp.transformers import (  # Ernie35Config,; Ernie35ForCausalLM,
+    CosineAnnealingWithWarmupDecay,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie": (
+        Ernie35Config,
+        Ernie35ForCausalLM,
+    ),
+}
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="ernie", metadata={"help": "Only support for ernie pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="__internal_testing__/tiny-random-ernie",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    use_flash_attention: bool = field(
+        default=False,
+        metadata={"help": "use_flash_attention"},
+    )
+    use_fused_ln: bool = field(
+        default=False,
+        metadata={"help": "ernie, use_fused_ln"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=True,
+        metadata={"help": "gpt, fuse_attention_qkv"},
+    )
+    recompute_granularity: str = field(
+        default="full",
+        metadata={"help": "full core_attn"},
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+
+    continue_training: bool = field(
+        default=False,
+        metadata={
+            "help": "Pre-training from existing paddlenlp model weights. Default Fasle and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+        },
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+):
+
+    train_valid_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    print_rank_0("    train:      {}".format(train_valid_test_num_samples[0]))
+    print_rank_0("    validation: {}".format(train_valid_test_num_samples[1]))
+    print_rank_0("    test:       {}".format(train_valid_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl="mmap",
+        splits_string=data_args.split,
+        train_valid_test_num_samples=train_valid_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=False,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+        logger.info(tokenizer._decode(input_ids))
+        # logger.info(tokenizer._decode(labels))
+        # logger.info(tokenizer.convert_ids_to_tokens(input_ids))
+
+    # eod_token = tokenizer.eos_token_id
+    from paddlenlp.data import Stack
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn(x["text"] for x in data)
+        # Unpack.
+        tokens_ = paddle.to_tensor(tokens_, dtype="int64")
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        # Loss mask.
+        # loss_mask = paddle.ones(tokens.shape, dtype=paddle.float32)
+        # loss_mask[data == eod_token] = 0.0
+
+        return {
+            "input_ids": tokens,
+            # "token_type_ids": out[1],
+            # "attention_mask": out[2],
+            # "loss_mask": loss_mask,
+            "labels": labels,
+        }
+
+    print_dataset(train_dataset[0])
+    print_dataset(valid_dataset[0])
+    print_dataset(test_dataset[0])
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        # keep eval_dataloader
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    set_seed(training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        # if last_checkpoint is None and len(
+        #         os.listdir(training_args.output_dir)) > 1:
+        #     raise ValueError(
+        #         f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+        #         "Use --overwrite_output_dir to overcome.")
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = Ernie35Tokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+    config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+    if not model_args.continue_training:
+        config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+        logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
+
+    config.use_flash_attention = model_args.use_flash_attention
+    config.fuse_ln = model_args.use_fused_ln
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.recompute_granularity = model_args.recompute_granularity
+    config.virtual_pp_degree = model_args.virtual_pp_degree
+    config.use_recompute = training_args.recompute
+
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+
+    print("Final pre-training config:", config)
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    if model_args.continue_training:
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            config=config,
+            dtype=dtype,
+            load_state_as_np=True,
+            use_progressive_seq_len=True,
+        )
+    else:
+        model = model_class._from_config(config, dtype=dtype)
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    if training_args.warmup_steps > 0:
+        warmup_steps = training_args.warmup_steps
+    else:
+        warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args, training_args, data_file, tokenizer
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/ernie-3.5-se/run_trainer_stage2.sh b/llm/ernie-3.5-se/run_trainer_stage2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..54d1ff7d391a16c1634b6c690d4058b2667b9631
--- /dev/null
+++ b/llm/ernie-3.5-se/run_trainer_stage2.sh
@@ -0,0 +1,70 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="ernie35_hybrid"
+# rm -rf output/$task_name/
+# rm -rf "output/$task_name""_log"
+
+data_dir="./data"
+
+PYTHONPATH=../../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "baidu/ernie-3.5-se-3b" \
+    --tokenizer_name_or_path "ernie-tokenizer" \
+    --input_dir "${data_dir}" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 4096 \
+    --per_device_train_batch_size 2 \
+    --gradient_accumulation_steps 2 \
+    --per_device_eval_batch_size 2 \
+    --use_flash_attention 1 \
+    --use_fused_ln 1 \
+    --amp_master_grad 0 \
+    --bf16 1 \
+    --fp16_opt_level "O2"  \
+    --scale_loss 512 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --virtual_pp_degree 1 \
+    --learning_rate 0.0003 \
+    --min_learning_rate 0.00003 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 300000 \
+    --save_steps 200 \
+    --adam_beta2 0.95 \
+    --weight_decay 0.1 \
+    --warmup_steps 2000 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1 \
+    --dataloader_num_workers 0 \
+    --eval_steps 50 \
+    --report_to "visualdl" \
+    --sharding "stage2" \
+    --sharding_parallel_degree 8 \
+    --disable_tqdm true \
+    --continue_training 0 \
+    --recompute 0 \
+    --do_train \
+    --do_eval \
+    --save_total_limit 5 \
+    --device "gpu"
diff --git a/llm/ernie-3.5-se/tokenizer.py b/llm/ernie-3.5-se/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0afbd2502d2ff3e2ce3ea7320e87d26c21f314a4
--- /dev/null
+++ b/llm/ernie-3.5-se/tokenizer.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from shutil import copyfile
+from typing import List, Optional, Tuple
+
+import sentencepiece as spm
+
+from paddlenlp.transformers import PretrainedTokenizer
+from paddlenlp.utils.log import logger
+
+
+class Ernie35Tokenizer(PretrainedTokenizer):
+    model_input_names = ["input_ids", "attention_mask", "position_ids"]
+    resource_files_names = {
+        "vocab_file": "sentencepiece.bpe.model",
+    }
+    padding_side = "left"
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        add_bos_token=True,
+        add_eos_token=False,
+        sp_model_kwargs=None,
+        decode_with_prefix_space=False,
+        **kwargs
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+
+        if self.pad_token is None:
+            self.pad_token = self.unk_token
+        self.vocab_file = vocab_file
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self.decode_with_prefix_space = decode_with_prefix_space
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_model.get_piece_size()
+
+    @property
+    def bos_token_id(self) -> Optional[int]:
+        return self.sp_model.bos_id()
+
+    @property
+    def eos_token_id(self) -> Optional[int]:
+        return self.sp_model.eos_id()
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        """Returns a tokenized string."""
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        prev_is_special = False
+        for i, token in enumerate(tokens):
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                if not prev_is_special and i != 0:
+                    out_string += " "
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                prev_is_special = True
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+                prev_is_special = False
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string
+
+    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory,
+            (filename_prefix + "-" if filename_prefix else "") + self.resource_files_names["vocab_file"],
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        if self.add_bos_token:
+            bos_token_ids = [self.bos_token_id]
+        else:
+            bos_token_ids = []
+
+        output = bos_token_ids + token_ids_0
+
+        if token_ids_1 is not None:
+            output = output + token_ids_1
+
+        if self.add_eos_token:
+            output = output + [self.eos_token_id]
+
+        return output
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
+        use of token type ids, therefore a list of zeros is returned.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of zeros.
+        """
+        eos = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
diff --git a/llm/ernie-3.5-se/utils.py b/llm/ernie-3.5-se/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4260f80124ba92b370f54f97d1677522ea5c6442
--- /dev/null
+++ b/llm/ernie-3.5-se/utils.py
@@ -0,0 +1,231 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.optimizer.lr import LambdaDecay
+from rouge import Rouge
+from sklearn.metrics import accuracy_score
+
+from paddlenlp.metrics import BLEU
+from paddlenlp.trainer import PrinterCallback, ProgressCallback, Trainer
+from paddlenlp.trainer.integrations import TrainerCallback
+from paddlenlp.utils.log import logger
+
+
+class AverageStatistical(object):
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.total_cnt = 0
+        self.time = 0
+
+    def record(self, val, cnt=1):
+        self.time += val
+        self.total_cnt += cnt
+
+    def get_average(self):
+        if self.total_cnt == 0:
+            return 0
+
+        return self.time / self.total_cnt
+
+    def get_average_per_sec(self):
+        if self.time == 0.0:
+            return 0.0
+
+        return float(self.total_cnt) / self.time
+
+    def get_total_cnt(self):
+        return self.total_cnt
+
+    def get_total_time(self):
+        return self.time
+
+
+class BenchmarkCallback(TrainerCallback):
+    def __init__(self, benchmark=True, profiler_options=None):
+        self.benchmark = benchmark
+        self.profiler_options = profiler_options
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        assert args.gradient_accumulation_steps == 1 and not args.do_eval and not args.do_predict
+        if self.benchmark:
+            self.reader_cost_avg = AverageStatistical()
+
+    def on_epoch_begin(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.epoch_start = time.time()
+            self.batch_start = time.time()
+
+    def on_step_begin(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.reader_cost_avg.record(time.time() - self.batch_start)
+
+    def on_step_end(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.batch_start = time.time()
+            if control.should_log:
+                self.maybe_log_save_evaluate_start = time.time()
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if self.benchmark:
+            if logs is not None and "interval_steps_per_second" in logs:
+                self.batch_start = self.batch_start + (time.time() - self.maybe_log_save_evaluate_start)
+                ips = logs["interval_steps_per_second"] * args.train_batch_size
+                avg_batch_cost = 1 / logs["interval_steps_per_second"]
+                logger.info(
+                    "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sample/sec"
+                    % (
+                        state.global_step,
+                        state.max_steps,
+                        logs["loss"],
+                        self.reader_cost_avg.get_average(),
+                        avg_batch_cost,
+                        args.train_batch_size,
+                        ips,
+                    )
+                )
+                self.reader_cost_avg.reset()
+
+    def on_epoch_end(self, args, state, control, **kwargs):
+        if self.benchmark:
+            train_epoch_cost = time.time() - self.epoch_start
+            logger.info("train epoch: %d, epoch_cost: %.5f s" % (state.epoch, train_epoch_cost))
+
+
+class Ernie35Trainer(Trainer):
+    def __init__(self, do_generation: bool, **kwargs):
+        super().__init__(**kwargs)
+        if self.args.benchmark or self.args.profiler_options is not None:
+            self.add_callback(
+                BenchmarkCallback(benchmark=self.args.benchmark, profiler_options=self.args.profiler_options)
+            )
+            if self.args.benchmark:
+                if self.args.disable_tqdm:
+                    self.pop_callback(PrinterCallback)
+                else:
+                    self.pop_callback(ProgressCallback)
+        self.do_generation = do_generation
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+
+        if prediction_loss_only:
+            return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+        elif not self.do_generation:
+            loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+            # argmax here to avoid gather all logits, which is too memory-consuming.
+            # keepdim in order to maintain the same shape as logits
+            return (loss, logits.argmax(axis=-1, keepdim=True), labels)
+
+        model.eval()
+
+        preds = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            max_length=self.args.tgt_length,
+            min_length=0,
+            use_cache=True,
+            temperature=1.0,
+            top_k=1,
+            top_p=1.0,
+            repetition_penalty=1.0,
+            decode_strategy="sampling",
+        )[0]
+        all_labels = []
+        for label in inputs["labels"].numpy():
+            label = [x for x in label[label != self.tokenizer.pad_token_id]]
+            all_labels.append(label)
+        max_label_length = max([len(x) for x in all_labels])
+        for index, labels in enumerate(all_labels):
+            all_labels[index] = labels + [-100] * (max_label_length - len(labels))
+
+        return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels))
+
+    def create_scheduler(self, num_training_steps: int):
+        num_warmup_steps = (
+            self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps
+        )
+
+        def lr_lambda(current_step: int):
+            if current_step < num_warmup_steps:
+                return float(current_step) / float(max(1, num_warmup_steps))
+            else:
+                decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps)
+                return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio
+
+        if self.lr_scheduler is None:
+            self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1)
+        return self.lr_scheduler
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+        if "loss" in logs:
+            logs["ppl"] = np.exp(logs["loss"])
+        if "eval_loss" in logs:
+            logs["eval_ppl"] = np.exp(logs["eval_loss"])
+
+        super(Ernie35Trainer, self).log(logs, **kwargs)
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+
+    rouge1 = round(rouge1, 4)
+    rouge2 = round(rouge2, 4)
+    rougel = round(rougel, 4)
+    bleu4 = round(bleu4.score(), 4)
+    return dict(
+        rouge1=rouge1,
+        rouge2=rouge2,
+        rougel=rougel,
+        bleu4=bleu4,
+    )
+
+
+def compute_metrics_not_do_generation(eval_preds):
+    flattened_preds = np.array(eval_preds.predictions).flatten()
+    flattened_labels = np.array(eval_preds.label_ids).flatten()
+    filtered_preds = flattened_preds[flattened_labels != -100]
+    filtered_labels = flattened_labels[flattened_labels != -100]
+    accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds)
+    return {
+        "accuracy": accuracy,
+    }
diff --git a/llm/export_model.py b/llm/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..167f558d763ed707e8c0527a8f29c9893ff8e9f7
--- /dev/null
+++ b/llm/export_model.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass, field
+
+import paddle
+from paddle.distributed import fleet
+from predictor import ModelArgument, PredictorArgument, create_predictor
+from tqdm import tqdm
+from utils import generate_rank_mapping, get_infer_model_path
+
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class ExportArgument:
+    output_path: str = field(default=None, metadata={"help": "The output path of model."})
+
+
+def load_inference_model(model_path, model_name, param_name, exe):
+    model_abs_path = os.path.join(model_path, model_name)
+    param_abs_path = os.path.join(model_path, param_name)
+    if os.path.exists(model_abs_path) and os.path.exists(param_abs_path):
+        return paddle.static.io.load_inference_model(model_path, exe, model_name, param_name)
+    else:
+        return paddle.static.io.load_inference_model(model_path, exe)
+
+
+def validate_pdmodel(model_path, model_prefix):
+    paddle.enable_static()
+    place = paddle.CUDAPlace(0)
+    exe = paddle.static.Executor(place)
+    scope = paddle.static.Scope()
+
+    with paddle.static.scope_guard(scope):
+        net_program, feed_target_names, fetch_targets = paddle.static.io.load_inference_model(
+            os.path.join(model_path, model_prefix), exe
+        )
+
+        for block in net_program.blocks:
+            ops: list[paddle.framework.Operator] = block.ops
+            for op in tqdm(ops, desc="checking the validation of ops"):
+                if op.type.lower() == "print":
+                    logger.warning(f"UNEXPECTED OP<{op.type}> which will reduce the performace of the static model")
+
+
+def main():
+    parser = PdArgumentParser((PredictorArgument, ModelArgument, ExportArgument))
+    predictor_args, model_args, export_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_default_dtype(predictor_args.dtype)
+    tensor_parallel_degree = paddle.distributed.get_world_size()
+    tensor_parallel_rank = paddle.distributed.get_rank()
+    if tensor_parallel_degree > 1:
+        strategy = fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "dp_degree": 1,
+            "mp_degree": tensor_parallel_degree,
+            "pp_degree": 1,
+            "sharding_degree": 1,
+        }
+        fleet.init(is_collective=True, strategy=strategy)
+        hcg = fleet.get_hybrid_communicate_group()
+        tensor_parallel_rank = hcg.get_model_parallel_rank()
+
+    # set predictor type
+    predictor = create_predictor(predictor_args, model_args, tensor_parallel_degree, tensor_parallel_rank)
+    predictor.model.eval()
+
+    predictor.model.to_static(
+        get_infer_model_path(export_args.output_path, predictor_args.model_prefix),
+        {"dtype": predictor_args.dtype, "export_precache": predictor_args.export_precache},
+    )
+    predictor.model.config.save_pretrained(export_args.output_path)
+    predictor.tokenizer.save_pretrained(export_args.output_path)
+    generate_rank_mapping(os.path.join(export_args.output_path, "rank_mapping.csv"))
+
+    validate_pdmodel(export_args.output_path, "model")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/finetune_generation.py b/llm/finetune_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..b190352e3705f0565e40be9e2f06f684d3c3261a
--- /dev/null
+++ b/llm/finetune_generation.py
@@ -0,0 +1,452 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import sys
+from functools import partial
+
+import paddle
+from argument import (
+    DataArgument,
+    GenerateArgument,
+    ModelArgument,
+    QuantArgument,
+    TrainingArguments,
+)
+from data import get_convert_example
+from utils import (
+    CausalLMTrainer,
+    InTokensIterDatasetCallback,
+    compute_metrics,
+    get_lora_target_modules,
+    get_prefix_tuning_params,
+)
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.datasets import InTokensIterableDataset, InTokensMapDataset, load_dataset
+from paddlenlp.metrics import BLEU, Rouge1, Rouge2, RougeL
+from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM
+from paddlenlp.trainer import PdArgumentParser, get_last_checkpoint
+from paddlenlp.trainer.trainer_callback import TrainerState
+from paddlenlp.transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    LlamaTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+
+def read_local_dataset(path):
+    with open(path, "r", encoding="utf-8") as fp:
+        for line in fp:
+            yield json.loads(line.strip())
+
+
+def main():
+    # Arguments
+    parser = PdArgumentParser((GenerateArgument, QuantArgument, ModelArgument, DataArgument, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        gen_args, quant_args, model_args, data_args, training_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        gen_args, quant_args, model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    training_args.print_config(quant_args, "Quant")
+    training_args.print_config(gen_args, "Generation")
+
+    if sum([quant_args.do_ptq, quant_args.do_qat, quant_args.do_gptq, training_args.do_train]) > 1:
+        raise ValueError(
+            "--do_train, --do_ptq, --do_gptq and --do_qat cannot work at the same time. Please choose only one at a time"
+        )
+
+    # Setup GPU & distributed training
+    paddle.set_device(training_args.device)
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Load model
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        elif training_args.bf16:
+            dtype = "bfloat16"
+        else:
+            raise ValueError("Please specific dtype: --fp16 or --bf16")
+    else:
+        dtype = "float32"
+
+    if training_args.pipeline_parallel_degree > 1:
+        if data_args.eval_with_do_generation and training_args.do_eval:
+            raise ValueError("Plese set eval_with_do_generation to false in pipeline parallel mode.")
+        from llama.modeling_pp import LlamaForCausalLMPipe
+
+        model = LlamaForCausalLMPipe.from_pretrained(
+            model_args.model_name_or_path,
+            tensor_parallel_output=False,
+            tensor_parallel_degree=training_args.tensor_parallel_degree,
+            tensor_parallel_rank=training_args.tensor_parallel_rank,
+            use_flash_attention=model_args.use_flash_attention,
+            dtype=dtype,
+        )
+    else:
+        model_config = AutoConfig.from_pretrained(
+            model_args.model_name_or_path,
+            tensor_parallel_output=False,
+            tensor_parallel_degree=training_args.tensor_parallel_degree,
+            tensor_parallel_rank=training_args.tensor_parallel_rank,
+            dtype=dtype,
+        )
+        if hasattr(model_config, "use_flash_attention"):
+            model_config.use_flash_attention = model_args.use_flash_attention
+        model = AutoModelForCausalLM.from_pretrained(
+            model_args.model_name_or_path,
+            config=model_config,
+        )
+
+    # Load tokenizer & dataset
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    if isinstance(tokenizer, LlamaTokenizer):
+        tokenizer.pad_token_id = tokenizer.eos_token_id
+
+    if data_args.dataset_name_or_path is None:
+        raise ValueError(f"Please specific dataset name or path (got {data_args.dataset_name_or_path})")
+    elif os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) and os.path.exists(
+        os.path.join(data_args.dataset_name_or_path, "dev.json")
+    ):
+        # train_ds, dev_ds = load_dataset(
+        #     "json",
+        #     data_files={
+        #         "train": os.path.join(data_args.dataset_name_or_path, "train.json"),
+        #         "dev": os.path.join(data_args.dataset_name_or_path, "dev.json"),
+        #     },
+        #     lazy=data_args.lazy,
+        # )
+        train_ds = load_dataset(
+            read_local_dataset,
+            path=os.path.join(data_args.dataset_name_or_path, "train.json"),
+            lazy=data_args.lazy,
+        )
+        dev_ds = load_dataset(
+            read_local_dataset,
+            path=os.path.join(data_args.dataset_name_or_path, "dev.json"),
+            lazy=data_args.lazy,
+        )
+
+    elif os.path.exists(os.path.join(data_args.dataset_name_or_path, "train")) and os.path.exists(
+        os.path.join(data_args.dataset_name_or_path, "dev")
+    ):
+        import glob
+
+        train_files = glob.glob(os.path.join(data_args.dataset_name_or_path, "train", "*.json"))
+        dev_files = glob.glob(os.path.join(data_args.dataset_name_or_path, "dev", "*.json"))
+        train_ds, dev_ds = load_dataset(
+            "json", data_files={"train": train_files, "dev": dev_files}, lazy=data_args.lazy
+        )
+    else:
+        if data_args.task_name is not None:
+            train_ds, dev_ds = load_dataset(
+                data_args.dataset_name_or_path, data_args.task_name, splits=["train", "dev"]
+            )
+        else:
+            train_ds, dev_ds = load_dataset(data_args.dataset_name_or_path, splits=["train", "dev"])
+    # TODO(ZHUI & sijunhe): Temporary implementation. Generalize this logic and move to Trainer later.
+    if training_args.resume_from_checkpoint is not None and data_args.lazy:
+        logger.info(
+            f"Loading from '{training_args.resume_from_checkpoint}' with `lazy=True`, manually skipping dataset and setting `ignore_data_skip` to True."
+        )
+        training_args.ignore_data_skip = True
+        state = TrainerState.load_from_json(os.path.join(training_args.resume_from_checkpoint, "trainer_state.json"))
+        if state.trial_params is not None and "intokens_global_step" in state.trial_params:
+            consumed_samples = state.trial_params["intokens_global_step"]
+        else:
+            consumed_samples = (
+                state.global_step
+                * training_args.per_device_train_batch_size
+                * training_args.gradient_accumulation_steps
+                * training_args.dataset_world_size
+            )
+        logger.info(
+            f"Skipping the first {consumed_samples} samples to warmup the dataset from checkpoint '{training_args.resume_from_checkpoint}'."
+        )
+        train_ds = train_ds.skip(consumed_samples)
+
+    if training_args.pipeline_parallel_degree > 1:
+        from data import convert_example_common
+
+        trans_func = partial(convert_example_common, tokenizer=tokenizer, data_args=data_args)
+    else:
+        trans_func = partial(get_convert_example(model), tokenizer=tokenizer, data_args=data_args)
+    if data_args.intokens:
+        if (
+            model.base_model_prefix not in ["llama", "bloom", "chatglm", "chatglm_v2", "qwen"]
+            and training_args.pipeline_parallel_degree < 1
+        ):
+            raise NotImplementedError(
+                "InTokens data stream is only implemented for LLaMA, Bloom, ChatGLM and QWen so far."
+            )
+    train_ds = train_ds.map(partial(trans_func, is_test=False, intokens=data_args.intokens))
+    eval_intokens = data_args.intokens
+    if data_args.intokens and data_args.eval_with_do_generation:
+        logger.warning(
+            "`intokens` conflicts with `eval_with_do_generation`. Setting intokens to False for the eval_dataset."
+        )
+        eval_intokens = False
+    dev_ds = dev_ds.map(partial(trans_func, is_test=data_args.eval_with_do_generation, intokens=eval_intokens))
+    if data_args.intokens:
+        if data_args.lazy:
+            intoken_dataset = InTokensIterableDataset
+        else:
+            intoken_dataset = InTokensMapDataset
+        logger.info("Creating InTokens Data Stream. This may take a few minutes.")
+        train_ds = intoken_dataset(
+            train_ds,
+            tokenizer=tokenizer,
+            max_length=data_args.max_length,
+        )
+        if eval_intokens:
+            dev_ds = intoken_dataset(
+                dev_ds,
+                tokenizer=tokenizer,
+                max_length=data_args.max_length,
+            )
+
+    if model_args.prefix_tuning:
+        prefix_tuning_params = get_prefix_tuning_params(model)
+        prefix_config = PrefixConfig(
+            num_prefix_tokens=model_args.num_prefix_tokens,
+            num_attention_heads=prefix_tuning_params["num_attention_heads"],
+            num_hidden_layers=prefix_tuning_params["num_hidden_layers"],
+            hidden_size=prefix_tuning_params["hidden_size"],
+            multi_query_group_num=prefix_tuning_params["multi_query_group_num"],
+            dtype=dtype,
+        )
+        model = PrefixModelForCausalLM(
+            model=model,
+            prefix_config=prefix_config,
+            postprocess_past_key_value=prefix_tuning_params["postprocess_past_key_value"],
+        )
+        model.mark_only_prefix_as_trainable()
+        model.print_trainable_parameters()
+
+    if model_args.lora:
+        if model_args.lora_path is None:
+            target_modules = get_lora_target_modules(model)
+            lora_config = LoRAConfig(
+                target_modules=target_modules,
+                r=model_args.lora_rank,
+                lora_alpha=2 * model_args.lora_rank,
+                merge_weights=False,
+                tensor_parallel_degree=training_args.tensor_parallel_degree,
+                dtype=dtype,
+            )
+            model = LoRAModel(model, lora_config)
+        else:
+            model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path)
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    def compute_metrics_do_generation(eval_preds):
+        rouge1 = Rouge1()
+        rouge2 = Rouge2()
+        rougel = RougeL()
+        bleu4 = BLEU(n_size=4)
+
+        predictions = [x[x != -100].tolist() for x in eval_preds.predictions]
+        references = [x[x != -100].tolist() for x in eval_preds.label_ids]
+
+        predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        references = tokenizer.batch_decode(references, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        if data_args.save_generation_output:
+            with open(os.path.join(training_args.output_dir, "generated_output.json"), "w", encoding="utf-8") as f:
+                for pred, ref in zip(predictions, references):
+                    out = {"output": pred, "tgt": ref}
+                    f.write(json.dumps(out, ensure_ascii=False) + "\n")
+
+        # for pred in predictions:
+        rouge1_score = rouge1.score(predictions, references)
+        rouge2_score = rouge2.score(predictions, references)
+        for pred, ref in zip(predictions, references):
+            rougel.add_inst(pred, [ref])
+            bleu4.add_inst(pred, [ref])
+        return {
+            "rouge1": rouge1_score,
+            "rouge2": rouge2_score,
+            "rougel": rougel.score(),
+            "bleu4": bleu4.score(),
+        }
+
+    # Create trainer
+    max_length = data_args.max_length if training_args.pipeline_parallel_degree > 1 else None
+    padding = "max_length" if training_args.pipeline_parallel_degree > 1 else True
+    trainer = CausalLMTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics_do_generation if data_args.eval_with_do_generation else compute_metrics,
+        data_collator=DataCollatorForSeq2Seq(
+            tokenizer=tokenizer,
+            max_length=max_length,
+            padding=padding,
+            max_label_length=max_length,
+            return_tensors="np",
+        ),
+        do_generation=data_args.eval_with_do_generation,
+        callbacks=[InTokensIterDatasetCallback()] if isinstance(train_ds, InTokensIterableDataset) else None,
+        gen_args=gen_args,
+        data_args=data_args,
+    )
+
+    # Train
+    if training_args.do_train:
+        checkpoint = None
+        if training_args.resume_from_checkpoint is not None:
+            checkpoint = training_args.resume_from_checkpoint
+        elif last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        if training_args.benchmark:
+            total_effective_tokens = (
+                sum([len(i["input_ids"]) for i in trainer.train_dataset]) * training_args.num_train_epochs
+            )
+            effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"]
+            logger.info(f"Effective_Tokens_per_second: {effective_tokens_per_second} ")
+            logger.info("Benchmark done.")
+        else:
+            trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+            trainer.log_metrics("train", train_result.metrics)
+            trainer.save_metrics("train", train_result.metrics)
+            trainer.save_state()
+
+    # QAT
+    if quant_args.do_qat:
+        if training_args.tensor_parallel_degree > 1:
+            raise NotImplementedError("Only support qat on single gpu.")
+        from quant import create_qat_model
+
+        trainer.model = create_qat_model(quant_args, trainer.model, dtype)
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        trainer.save_model()
+        trainer.log_metrics("qat", train_result.metrics)
+        trainer.save_metrics("qat", train_result.metrics)
+        trainer.save_state()
+
+    # PTQ
+    if quant_args.do_ptq:
+        if isinstance(model, LoRAModel):
+            raise NotImplementedError(
+                "PTQ strategy not supported for LoRA model. Please merge lora parameters to pretrain model first."
+            )
+        from quant import apply_ptq, apply_shift, apply_smooth, get_ptq_model_config
+
+        trainer.model.eval()
+        # Prepare ptq dataloader
+        if os.path.exists(os.path.join(data_args.dataset_name_or_path, "quant.json")):
+            # ptq_ds = load_dataset(
+            #     "json", data_files=os.path.join(data_args.dataset_name_or_path, "quant.json"), lazy=data_args.lazy,
+            # )[0]
+            ptq_ds = load_dataset(
+                read_local_dataset,
+                path=os.path.join(data_args.dataset_name_or_path, "quant.json"),
+                lazy=data_args.lazy,
+            )
+            ptq_ds = ptq_ds.map(partial(trans_func, is_test=False))
+        else:
+            ptq_ds = train_ds
+            logger.info(
+                f"Not found quant.json in {data_args.dataset_name_or_path}. Set train dataset as PTQ calibration dataset."
+            )
+        ptq_dataloader = trainer.get_ptq_dataloader(ptq_ds)
+        if quant_args.shift or quant_args.smooth:
+            ptq_model_config = get_ptq_model_config(trainer.model)
+
+        if quant_args.shift:
+            apply_shift(quant_args, trainer, ptq_dataloader, ptq_model_config)
+
+        if quant_args.smooth:
+            apply_smooth(quant_args, trainer, ptq_dataloader, ptq_model_config)
+
+        apply_ptq(quant_args, trainer, ptq_dataloader)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+
+    if quant_args.do_gptq:
+        if isinstance(model, LoRAModel):
+            raise NotImplementedError(
+                "PTQ strategy not supported for LoRA model. Please merge lora parameters to pretrain model first."
+            )
+        from quant import apply_gptq
+
+        # Prepare ptq dataloader
+        if os.path.exists(os.path.join(data_args.dataset_name_or_path, "quant.json")):
+            # ptq_ds = load_dataset(
+            #     "json", data_files=os.path.join(data_args.dataset_name_or_path, "quant.json"), lazy=data_args.lazy,
+            # )[0]
+            ptq_ds = load_dataset(
+                read_local_dataset,
+                path=os.path.join(data_args.dataset_name_or_path, "quant.json"),
+                lazy=data_args.lazy,
+            )
+            ptq_ds = ptq_ds.map(partial(trans_func, is_test=False))
+        else:
+            ptq_ds = train_ds
+            logger.info(
+                f"Not found quant.json in {data_args.dataset_name_or_path}. Set train dataset as PTQ calibration dataset."
+            )
+        ptq_dataloader = trainer.get_ptq_dataloader(ptq_ds)
+        apply_gptq(quant_args, trainer, ptq_dataloader)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+
+    # Evaluation dev set
+    if training_args.do_eval:
+        eval_result = trainer.evaluate(dev_ds)
+        trainer.log_metrics("eval", eval_result)
+
+    # Evaluation test set
+    if training_args.do_predict:
+        # test_ds = load_dataset(
+        #     "json", data_files=os.path.join(data_args.dataset_name_or_path, "test.json"), lazy=data_args.lazy,
+        # )[0]
+        test_ds = load_dataset(
+            read_local_dataset,
+            path=os.path.join(data_args.dataset_name_or_path, "test.json"),
+            lazy=data_args.lazy,
+        )
+        test_ds = test_ds.map(partial(trans_func, is_test=data_args.eval_with_do_generation))
+        eval_result = trainer.predict(test_ds).metrics
+        trainer.log_metrics("test", eval_result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/flask_server.py b/llm/flask_server.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c2d0dc9bd183b447dbc2605105d7d87d9d8c6b1
--- /dev/null
+++ b/llm/flask_server.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import json
+import re
+import time
+from dataclasses import dataclass, field
+from multiprocessing.shared_memory import SharedMemory
+
+from predictor import BasePredictor, ModelArgument, PredictorArgument, create_predictor
+
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class ServerArgument:
+    port: int = field(default=8011, metadata={"help": "The port of ui service"})
+    flask_port: int = field(default=8010, metadata={"help": "The port of flask service"})
+    title: str = field(default="LLM", metadata={"help": "The title of gradio"})
+
+
+def read_shared_memory(memory: SharedMemory):
+    """read content from shared memory
+
+    Args:
+        memory (SharedMemory): the instance of shared Memory
+    """
+    length = int(memory.buf[0]) * 256 + int(memory.buf[1])
+    if length == 0:
+        return ""
+
+    sentence = bytes(memory.buf[2 : length + 2]).decode()
+    return sentence
+
+
+def write_shared_memory(memory: SharedMemory, sentence: str):
+    """write content into shared memory
+
+        [0:2]: store the length of sentence
+        [2:]:  store the content of sentence
+
+    Args:
+        memory (SharedMemory): the instance of shared Memory
+        sentence (str): the content which must be string
+    """
+    buffer = bytearray(memory.buf.nbytes)
+    data = sentence.encode("utf-8")
+
+    buffer[0:2] = bytearray([len(data) // 256, len(data) % 256])
+    buffer[2 : len(data) + 2] = data
+    memory.buf[:] = buffer
+
+
+SLEEP_SECOND = 0.5
+SHARED_MEMORY_NAME = "shared_memory"
+
+
+def create_shared_memory(name: int, rank: int):
+    """create shared memory between multi-process
+
+    Args:
+        name (int): the name of memory block
+        rank (int): the rank of current process
+    """
+    file = f"{SHARED_MEMORY_NAME}-{name}"
+    shared_memory = None
+    if rank != 0:
+        while True:
+            try:
+                shared_memory = SharedMemory(file, size=1024 * 100)
+                print("success create shared_memory")
+                break
+            except FileNotFoundError:
+                time.sleep(0.01)
+                print("sleep for create shared memory")
+    else:
+        shared_memory = SharedMemory(file, create=True, size=1024 * 100)
+    return shared_memory
+
+
+def enforce_stop_tokens(text, stop) -> str:
+    """Code by Langchain"""
+    """Cut off the text as soon as any stop words occur."""
+    return re.split(re.escape(stop), text)[0]
+
+
+class PredictorServer:
+    def __init__(self, args: ServerArgument, predictor: BasePredictor):
+
+        self.predictor = predictor
+        self.args = args
+
+        self.input_shared_memory = create_shared_memory("input", self.predictor.tensor_parallel_rank)
+        self.output_shared_memory = create_shared_memory("output", self.predictor.tensor_parallel_rank)
+
+        if self.predictor.tensor_parallel_rank == 0:
+            write_shared_memory(self.input_shared_memory, "")
+            write_shared_memory(self.output_shared_memory, "")
+
+    def predict(self, input_texts: str | list[str]):
+        return self.predictor.predict(input_texts)
+
+    def start_predict(self, data):
+        print("start to predict under data", data)
+
+        data = json.dumps(data, ensure_ascii=False)
+        write_shared_memory(self.input_shared_memory, data)
+
+        while True:
+            result = read_shared_memory(self.output_shared_memory)
+            if result:
+                write_shared_memory(self.output_shared_memory, "")
+                return result
+
+            else:
+                print("not found result, so to sleep ...")
+
+            time.sleep(0.5)
+
+    def start_flask_server(self):
+        from flask import Flask, jsonify, request
+
+        app = Flask(__name__)
+
+        @app.post("/api/chat")
+        def _server():
+            data = request.get_json()
+            logger.info(f"Request: {json.dumps(data, indent=2, ensure_ascii=False)}")
+            try:
+                pred_seq = self.start_predict(data)
+                output = {
+                    "error_code": 0,
+                    "error_msg": "Success",
+                    "result": {"response": {"role": "bot", "utterance": pred_seq}},
+                }
+            except Exception as err:
+                logger.error(f"Server error: {err}")
+                output = {"error_code": 1000, "error_msg": f"Server error: {err}", "result": None}
+
+            logger.info(f"Response: {json.dumps(output, indent=2, ensure_ascii=False)}")
+            return jsonify(output)
+
+        app.run(host="0.0.0.0", port=self.args.flask_port)
+
+    def start_ui_service(self, args):
+        # do not support start ui service in one command
+        from multiprocessing import Process
+
+        from gradio_ui import main
+
+        p = Process(target=main, args=(args,))
+        p.daemon = True
+        p.start()
+
+
+def main(args, server: PredictorServer):
+    from time import sleep
+
+    while True:
+        sleep(0.5)
+        content = read_shared_memory(server.input_shared_memory)
+
+        if content:
+            content = json.loads(content)
+
+            context = content.pop("context", "")
+            content.pop("extra_info", None)
+
+            generation_args = content
+            server.predictor.config.max_length = generation_args["max_length"]
+            server.predictor.config.top_p = generation_args["top_p"]
+            server.predictor.config.temperature = generation_args["temperature"]
+            server.predictor.config.top_k = generation_args["top_k"]
+            server.predictor.config.repetition_penalty = generation_args["repetition_penalty"]
+
+            for key, value in generation_args.items():
+                setattr(server.args, key, value)
+
+            result = server.predict(context)
+            result = result[0]
+            if not result:
+                result = "invalid response"
+            write_shared_memory(server.output_shared_memory, result)
+            write_shared_memory(server.input_shared_memory, "")
+
+
+if __name__ == "__main__":
+
+    parser = PdArgumentParser((PredictorArgument, ModelArgument, ServerArgument))
+    predictor_args, model_args, server_args = parser.parse_args_into_dataclasses()
+    predictor = create_predictor(predictor_args, model_args)
+
+    server = PredictorServer(server_args, predictor)
+
+    if server.predictor.tensor_parallel_rank == 0:
+        server.start_ui_service(server_args)
+
+        from multiprocessing import Process
+
+        p = Process(
+            target=server.start_flask_server,
+        )
+        p.daemon = True
+        p.start()
+
+    main(server_args, server)
diff --git a/llm/glm/README.md b/llm/glm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..86bc69d571e6c1de3b71281e8993a10a0436d1bc
--- /dev/null
+++ b/llm/glm/README.md
@@ -0,0 +1,102 @@
+# GLM
+
+## 1. 模型介绍
+
+[General Language Model (GLM)](https://arxiv.org/abs/2103.10360) 是以自回归填空作为训练目标的通用语言模型，可用于各类理解和生成任务。
+
+现有预训练框架包括以 BERT 为代表的自编码模型，以 GPT 为代表的自回归模型和以 T5 为代表的编码-解码模型。但这些框架均不能完全支持自然语言理解、无条件生成和条件生成这三类主要任务。为了解决这一问题，我们提出了基于自回归填空任务的通用语言模型（GLM）。GLM 使用 2D 位置编码和任意顺序预测改进了填空预训练过程，在自然语言理解任务上超越了 BERT 和 T5。同时，GLM 的预训练过程基于多种任务，填空长度和数量各不相同。在自然语言理解、无条件生成和条件生成任务上，GLM 均超过了具有相同参数规模和训练数据量的 BERT、T5 和 GPT 模型。除此之外，GLM 还以 BERT Large 1.25 倍参数量的规模取得了当前最优的效果，证明了其在不同下游任务上良好的泛化能力。
+
+
+**支持模型权重:**
+
+| Model                            |
+|----------------------------------|
+| THUDM/glm-large-chinese                |
+| THUDM/glm-10b-chinese              |
+
+## 3. 模型精调
+
+### SFT
+
+```
+python -m paddle.distributed.launch --gpus "0,1,2,3" finetune_generation.py \
+--model_name_or_path THUDM/glm-large-chinese \
+--num_train_epochs 4 \
+--learning_rate 3e-5 \
+--warmup_ratio 0.06 \
+--weight_decay 0.1 \
+--label_smoothing 0.1 \
+--save_steps 100 \
+--logging_steps 1 \
+--eval_steps 100 \
+--output_dir ./checkpoints/glm-large-chinese \
+--src_length 608 \
+--tgt_length 160 \
+--min_tgt_length 55 \
+--length_penalty 0.7 \
+--no_repeat_ngram_size 3 \
+--num_beams 5 \
+--select_topk True \
+--per_device_eval_batch_size 2 \
+--per_device_train_batch_size 2 \
+--max_grad_norm 1.0 \
+--lr_scheduler_type linear \
+--fp16 \
+--fp16_opt_level O2 \
+--recompute \
+--do_train \
+--do_eval
+```
+
+### 单卡LoRA微调
+
+```
+python finetune_generation.py \
+--model_name_or_path THUDM/glm-large-chinese \
+--num_train_epochs 4 \
+--learning_rate 3e-5 \
+--warmup_ratio 0.06 \
+--weight_decay 0.1 \
+--label_smoothing 0.1 \
+--save_steps 100 \
+--logging_steps 1 \
+--eval_steps 100 \
+--output_dir ./checkpoints/glm-large-chinese \
+--src_length 608 \
+--tgt_length 160 \
+--min_tgt_length 55 \
+--length_penalty 0.7 \
+--no_repeat_ngram_size 3 \
+--num_beams 5 \
+--select_topk True \
+--per_device_eval_batch_size 2 \
+--per_device_train_batch_size 2 \
+--max_grad_norm 1.0 \
+--lr_scheduler_type linear \
+--fp16 \
+--fp16_opt_level O2 \
+--recompute \
+--do_train \
+--do_eval \
+--lora True
+```
+
+其中参数释义如下：
+
+- `model_name_or_path`: 预训练模型内置名称或者模型所在目录，默认为`THUDM/glm-large-chinese`。
+- `src_length`: 上下文的最大输入长度，默认为608.
+- `tgt_length`: 生成文本的最大长度，默认为160.
+- `min_tgt_length`: 生成文本的最小长度，默认为55.
+- `length_penalty`: 生成解码时的长度惩罚因子，默认为0.7.
+- `num_beams`: 搜索方向数量，默认为5。
+- `label_smoothing`: 标签平滑因子，默认为0.1.
+- `lr_decay_ratio`: 学习率衰减因子，默认为0.1.
+- `lora`: 是否使用LoRA技术.
+
+
+## 3.4 动态图推理
+
+```
+python predict_generation.py \
+    --model_name_or_path  THUDM/glm-large-chinese
+```
diff --git a/llm/glm/data.py b/llm/glm/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..40f5f3320a64cb25c1b78c6125e4accd0835d634
--- /dev/null
+++ b/llm/glm/data.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+
+def custom_convert_example(example, tokenizer, data_args, is_test=True):
+    source = None
+    title = None
+    target = None
+    if "source" in example and "title" in example:
+        source = example["source"]
+        if "title" in example.keys():
+            title = example["title"]
+    elif "context" in example and "answer" in example:
+        source = example["context"]
+        if "answer" in example.keys():
+            title = example["answer"]
+    else:
+        assert False, "Source and title are not in the input dictionary, nor are context and answer."
+    if "target" in example.keys():
+        target = example["target"]
+    elif "question" in example.keys():
+        target = example["question"]
+    example["text_a"] = "答案：" + title + "，" + "上下文：" + source
+    example["text_b"] = "在已知答案的前提下，问题：" + target
+    inputs = tokenizer.encode(example["text_a"], max_length=data_args.src_length - 1, truncation=True)
+    inputs["input_ids"] = inputs["input_ids"][:-1] + [tokenizer.gmask_token_id] + inputs["input_ids"][-1:]
+    pad_length = data_args.src_length - len(inputs["input_ids"])
+    inputs["input_ids"] = np.array([inputs["input_ids"] + [tokenizer.pad_token_id] * pad_length])
+    inputs["attention_mask"] = np.array([inputs["attention_mask"] + [1] + [0] * pad_length])
+    sep = inputs["input_ids"].shape[1]
+    inputs = tokenizer.build_inputs_for_generation(
+        inputs,
+        max_gen_length=data_args.tgt_length,
+        targets=" " + example["text_b"] if not is_test else None,
+        padding="max_length",
+    )
+
+    for input_name in inputs.keys():
+        inputs[input_name] = inputs[input_name].squeeze(0)
+    if is_test:
+        inputs["position_ids"] = inputs["position_ids"][:, : inputs["input_ids"].shape[-1]]
+        labels = tokenizer.encode(
+            " " + example["text_b"], add_special_tokens=False, max_length=data_args.tgt_length - 1
+        )["input_ids"]
+        loss_mask = [0] * sep + [1] * len(labels) + [0] * (data_args.tgt_length - len(labels))
+        labels = (
+            [0] * sep
+            + labels
+            + [tokenizer.eop_token_id]
+            + [tokenizer.pad_token_id] * (data_args.tgt_length - len(labels) - 1)
+        )
+        inputs["label_ids"] = labels
+        inputs["loss_mask"] = loss_mask
+    return inputs
diff --git a/llm/glm/finetune_generation.py b/llm/glm/finetune_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd536c73f87a9aff029804857c1388eacdf1f508
--- /dev/null
+++ b/llm/glm/finetune_generation.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from data import custom_convert_example
+from utils import GLMTrainer
+
+from paddlenlp.data import DefaultDataCollator
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import BLEU, Rouge1, Rouge2, RougeL
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint
+from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArgument:
+    task_name: str = field(default="dureader_qg", metadata={"help": "The name of task."})
+    src_length: int = field(default=608, metadata={"help": "The max length of source text."})
+    tgt_length: int = field(default=160, metadata={"help": "The max length of target text."})
+    min_tgt_length: int = field(default=55, metadata={"help": "The min length of target text."})
+    length_penalty: float = field(default=0.7, metadata={"help": "The length penalty."})
+    no_repeat_ngram_size: int = field(default=3, metadata={"help": "The no repeat ngram size."})
+    num_beams: int = field(default=5, metadata={"help": "The number of beams."})
+    select_topk: bool = field(default=True, metadata={"help": "Whether to select top k tokens for generation."})
+    top_p: float = field(
+        default=0.0, metadata={"help": "The cumulative probability for top-p-filtering in the 'sampling' strategy."}
+    )
+    top_k: int = field(
+        default=0,
+        metadata={
+            "help": "The number of highest probability tokens to keep for top-k-filtering in the 'sampling' strategy."
+        },
+    )
+    no_block_position: bool = field(default=False)
+
+
+@dataclass
+class ModelArgument:
+    model_name_or_path: str = field(
+        default="THUDM/glm-2b", metadata={"help": "Build-in pretrained model name or the path to local model."}
+    )
+    label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."})
+    lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"})
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+
+
+def main():
+    parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    setattr(training_args, "label_smoothing", model_args.label_smoothing)
+    setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio)
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    dtype = None
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    # Load the pretrained language model.
+    model = AutoModelForConditionalGeneration.from_pretrained(
+        model_args.model_name_or_path,
+        output_predict=True,
+        parallel_output=True,
+        load_state_as_np=True,
+        dtype=dtype,  # todo enable set dtype to avoid additional mem usage
+        tensor_parallel_degree=training_args.tensor_parallel_degree,
+        tensor_parallel_rank=training_args.tensor_parallel_rank,
+    )
+    if model_args.lora:
+        # TODO: hardcode parameters for now. Change after MergedLoRA is introduced
+        lora_config = LoRAConfig(
+            target_modules=[".*query_key_value.*"],
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+            tensor_parallel_degree=training_args.tensor_parallel_degree,
+            dtype=dtype,
+        )
+        model = LoRAModel(model, lora_config)
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # Load the dataset.
+    train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train", "dev"])
+    trans_func = partial(custom_convert_example, tokenizer=tokenizer, data_args=data_args)
+    train_ds = train_ds.map(partial(trans_func, is_test=False))
+    test_ds = dev_ds.map(trans_func)
+
+    collate_fn = DefaultDataCollator()
+
+    def compute_metrics(eval_preds):
+        rouge1 = Rouge1()
+        rouge2 = Rouge2()
+        rougel = RougeL()
+        bleu4 = BLEU(n_size=4)
+        predictions = [x[x != -100] for x in eval_preds.predictions]
+        references = [x[x != -100] for x in eval_preds.label_ids]
+
+        # for pred in predictions:
+
+        rouge1_score = rouge1.score(predictions, references)
+        rouge2_score = rouge2.score(predictions, references)
+        for pred, ref in zip(predictions, references):
+            rougel.add_inst(pred, [ref])
+            bleu4.add_inst(pred, [ref])
+        return {
+            "rouge1": rouge1_score,
+            "rouge2": rouge2_score,
+            "rougel": rougel.score(),
+            "bleu4": bleu4.score(),
+        }
+
+    trainer = GLMTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        do_generation=True,
+        data_collator=collate_fn,
+    )
+    if training_args.fp16_opt_level == "O2":
+        trainer.disable_autocast_context_manager()
+
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_result = trainer.evaluate(test_ds)
+        trainer.log_metrics("test", eval_result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/glm/predict_generation.py b/llm/glm/predict_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..7467216557a19d14059d168a71f28b7d9c731b9d
--- /dev/null
+++ b/llm/glm/predict_generation.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.distributed import fleet
+
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.transformers import (
+    AutoConfig,
+    AutoModelForConditionalGeneration,
+    AutoTokenizer,
+)
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path", default="THUDM/glm-large-chinese", required=True, help="The directory of model."
+    )
+    parser.add_argument("--lora_path", default=None, help="The directory of LoRA parameters. Default to None")
+    parser.add_argument(
+        "--merge_tensor_parallel_path", default=None, help="The directory of model to merge tensor parallel parts."
+    )
+    parser.add_argument("--batch_size", type=int, default=2, help="The batch size of data.")
+    parser.add_argument("--src_length", type=int, default=200, help="The batch size of data.")
+    parser.add_argument("--tgt_length", type=int, default=20, help="The batch size of data.")
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+        self.batch_size = args.batch_size
+        self.args = args
+
+        tensor_parallel_degree = paddle.distributed.get_world_size()
+        tensor_parallel_rank = 0
+        if tensor_parallel_degree > 1:
+            strategy = fleet.DistributedStrategy()
+            strategy.hybrid_configs = {
+                "dp_degree": 1,
+                "mp_degree": tensor_parallel_degree,
+                "pp_degree": 1,
+                "sharding_degree": 1,
+            }
+            fleet.init(is_collective=True, strategy=strategy)
+            hcg = fleet.get_hybrid_communicate_group()
+            tensor_parallel_rank = hcg.get_model_parallel_rank()
+
+        if self.args.lora_path is not None:
+            lora_config = LoRAConfig.from_pretrained(self.args.lora_path)
+            dtype = lora_config.dtype
+        else:
+            config = AutoConfig.from_pretrained(args.model_name_or_path)
+            dtype = config.dtype if config.dtype is not None else "float32"
+
+        self.model = AutoModelForConditionalGeneration.from_pretrained(
+            args.model_name_or_path,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+            load_state_as_np=True,
+            dtype=dtype,
+            low_cpu_mem_usage=True,
+        )
+        if self.args.lora_path is not None:
+            self.model = LoRAModel.from_pretrained(self.model, self.args.lora_path)
+        self.model.eval()
+
+    def preprocess(self, input_text):
+        input_text = [text.strip() + "[gMASK]" for text in input_text]
+        inputs = self.tokenizer(
+            input_text,
+            return_tensors="np",
+            add_special_tokens=True,
+            padding=True,
+            max_length=self.args.src_length,
+            truncation=True,
+            truncation_side="left",
+        )
+        inputs = self.tokenizer.build_inputs_for_generation(inputs, max_gen_length=self.args.tgt_length)
+        inputs_tensor = {}
+        for key, value in inputs.items():
+            inputs_tensor[key] = paddle.to_tensor(value)
+        return inputs_tensor
+
+    def infer(self, inputs):
+        result = self.model.generate(
+            **inputs,
+            decode_strategy="sampling",
+            top_k=1,
+            max_length=self.args.tgt_length,
+            eos_token_id=self.tokenizer.eop_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
+        )
+        result = result[0]
+        return result
+
+    def postprocess(self, infer_data):
+        result = []
+        for x in infer_data.tolist():
+            res = self.tokenizer.decode(x, skip_special_tokens=True)
+            result.append(res)
+        out_dict = {"result": result}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    all_texts = [
+        "答案：年基准利率4.35%，上下文：从实际看,贷款的基本条件是: 一是中国大陆居民,年龄在60岁以下; 二是有稳定的住址和工作或经营地点; 三是有稳定的收入来源; 四是无不良信用记录,贷款用途不能作为炒股,赌博等行为; 五是具有完全民事行为能力。在已知答案的前提下，问题：",
+        "答案：U系列，上下文：U系列是最好的，采用国际顶尖技术（由格力自主研发）双级变频压缩机，提高压缩机运转效率，制冷制热能力更强劲；1赫兹变频技术，使空调相当于一个15 W电灯泡，更加节能省电；送风面积广，风力大；生态风，净化空气。非常不错，现在国美在做活动，可以了解一下。在已知答案的前提下，问题：",
+    ]
+    batch_texts = batchfy_text(all_texts, args.batch_size)
+    for bs, texts in enumerate(batch_texts):
+        outputs = predictor.predict(texts)
+        for text, result in zip(texts, outputs["result"]):
+            print("{}\n{}".format(text, result))
+
+    if args.merge_tensor_parallel_path is not None:
+        predictor.model.save_pretrained(
+            save_dir=args.merge_tensor_parallel_path,
+            merge_tensor_parallel=True,
+        )
+        predictor.tokenizer.save_pretrained(args.merge_tensor_parallel_path)
diff --git a/llm/glm/utils.py b/llm/glm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3b9e8919aa7286df303769873f98ce678a838df
--- /dev/null
+++ b/llm/glm/utils.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.trainer import Trainer
+
+
+class GLMTrainer(Trainer):
+    def __init__(self, do_generation: bool, **kwargs):
+        super().__init__(**kwargs)
+        self.do_generation = do_generation
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+
+        if not self.do_generation:
+            return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+
+        model.eval()
+        with paddle.no_grad():
+            tokens = model.generate(
+                input_ids=inputs["input_ids"],
+                position_ids=inputs["position_ids"],
+                attention_mask=inputs["attention_mask"],
+                decode_strategy="sampling",
+                top_k=1,
+                repetition_penalty=2.0,
+                bos_token_id=self.tokenizer.sop_token_id,
+                eos_token_id=self.tokenizer.eop_token_id,
+                pad_token_id=self.tokenizer.pad_token_id,
+            )[0]
+            all_preds = []
+            for pred_tokens in tokens:
+                all_preds.append(pred_tokens[pred_tokens != self.tokenizer.pad_token_id].tolist())
+            max_pred_length = max([len(x) for x in all_preds])
+            for index, preds in enumerate(all_preds):
+                all_preds[index] = preds + [-100] * (max_pred_length - len(preds))
+
+            all_labels = []
+            for label, mask in zip(inputs["labels"].numpy(), inputs["loss_mask"].numpy()):
+                label = label[mask.astype("bool")]
+                label = [x for x in label[label != self.tokenizer.pad_token_id]]
+                all_labels.append(label)
+            max_label_length = max([len(x) for x in all_labels])
+            for index, labels in enumerate(all_labels):
+                all_labels[index] = labels + [-100] * (max_label_length - len(labels))
+
+        return (None, paddle.to_tensor(all_preds), paddle.to_tensor(all_labels))
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+
+        if self.state.epoch is not None:
+            logs["epoch"] = round(self.state.epoch, 4)
+
+        if "eval_loss" in logs:
+            logs["eval_ppl"] = np.exp(logs["eval_loss"])
+        output = {**logs, **{"step": self.state.global_step}}
+        self.state.log_history.append(output)
+        self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs, **kwargs)
diff --git a/llm/gpt-3/README.md b/llm/gpt-3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5c687502627f3072dd62d19579df421afd2b7764
--- /dev/null
+++ b/llm/gpt-3/README.md
@@ -0,0 +1,194 @@
+# GPT
+
+## 1. 模型介绍
+
+GPT-3是一种预训练语言模型，它能够模拟人类语言思维和表达。GPT-3拥有巨大的参数，包含了1750亿个参数，这使得它具有强大的语言理解和生成能力。它可以完成的任务包括文本生成、文本摘要、回答问题、翻译、阅读理解等。GPT-3的预训练过程使用了大量的语料库，包括互联网上的大量文本。它通过分析这些文本，学习如何生成和理解人类语言。GPT-3在自然语言处理领域具有很高的影响力，它可以模拟人类对话和生成文本，这使得它在许多应用领域都有广泛的应用，比如智能客服、自然语言处理、游戏设计等。
+
+## 2. 预训练
+
+预训练数据制作参考[此处](../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
+
+为了方便用户运行测试本模型，本项目提供了处理好的100k条doc的训练样本：
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+```
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv gpt_en_dataset_300m_ids.npy ./data
+mv gpt_en_dataset_300m_idx.npz ./data
+```
+
+注意：
+1. 需要paddle develop版本训练，需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
+2. `use_flash_attention` 需要在A100机器开启，否则loss可能不正常（很快变成0.00x,非常小不正常）。建议使用cuda11.8环境。
+
+使用下面脚本,即可在gpt2-medium-en的基础上,继续训练.
+```shell
+task_name="gpt3_hybrid"
+export PYTHONPATH="../../PaddleNLP/"
+export FLAGS_cudnn_deterministic=True
+log_dir="log"
+rm -rf $log_dir
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0" \
+    --log_dir ${log_dir} \
+    run_pretrain.py \
+    --model_type "gpt" \
+    --model_name_or_path gpt2-medium-en \
+    --tokenizer_name_or_path gpt2-medium-en \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 1024 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --fuse_attention_qkv 1 \
+    --use_flash_attention 0 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --learning_rate 0.00001 \
+    --min_learning_rate 0.000005 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --recompute 1 \
+    --gradient_accumulation_steps 2 \
+    --do_train \
+    --do_eval \
+    --device "gpu"
+```
+
+其中参数释义如下：
+
+- `model_name_or_path`: 预训练模型内置名称或者模型所在目录，默认为`gpt2-medium-en`。
+- `num_train_epochs`: 要执行的训练 epoch 总数（如果不是整数，将在停止训练之前执行最后一个 epoch
+的小数部分百分比）。
+- `max_steps`: 模型训练步数。
+- `learning_rate`: 参数更新的学习率。
+- `warmup_steps`: 学习率热启的步数。
+- `eval_steps`: 模型评估的间隔步数。
+- `logging_steps`: 训练日志打印的间隔步数。
+- `save_steps`: 模型参数保存的间隔步数。
+- `output_dir`: 模型参数保存目录。
+- `src_length`: 上下文的最大输入长度，默认为128.
+- `tgt_length`: 生成文本的最大长度，默认为160.
+- `gradient_accumulation_steps`: 模型参数梯度累积的步数，可用于扩大 batch size。实际的 batch_size = per_device_train_batch_size * gradient_accumulation_steps。
+- `fuse_attention_qkv`：在MultiHeadAttention中使用qkv线性层融合
+- `use_flash_attention`：使用flash attention技术，注意此处需要在A100机器开启
+- `fp16`: 使用 float16 精度进行模型训练和推理。
+- `fp16_opt_level`: float16 精度训练模式，`O2`表示纯 float16 训练。
+- `recompute`: 使用重计算策略，开启后可节省训练显存。
+- `do_train`: 是否训练模型。
+- `do_eval`: 是否评估模型。
+- `tensor_parallel_degree`: 模型并行数量。
+- `lora`: 是否使用LoRA技术。
+- `task_name`: 内置数据集任务名。
+
+<a name="1"></a>
+
+
+## 3. 微调
+### SFT
+
+```shell
+task_name="gpt3_hybrid"
+export PYTHONPATH="../../PaddleNLP/"
+export FLAGS_cudnn_deterministic=True
+log_dir="log"
+rm -rf $log_dir
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0" \
+    --log_dir ${log_dir} \
+    finetune_generation.py \
+    --model_type "gpt" \
+    --model_name_or_path gpt2-medium-en \
+    --output_dir "output/$task_name" \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 1 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --learning_rate 0.00001 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --recompute 1 \
+    --gradient_accumulation_steps 2 \
+    --do_train \
+    --do_eval \
+    --device "gpu"
+```
+
+### LoRA
+
+```shell
+export PYTHONPATH="../../PaddleNLP/"
+export FLAGS_cudnn_deterministic=True
+log_dir="log"
+rm -rf $log_dir
+
+python finetune_generation.py \
+    --model_type "gpt" \
+    --model_name_or_path gpt2-medium-en \
+    --output_dir "output/$task_name" \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 1 \
+    --tensor_parallel_degree 1 \
+    --pipeline_parallel_degree 1 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --learning_rate 3e-4 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --recompute 1 \
+    --gradient_accumulation_steps 2 \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --lora
+```
+
+
+## 3. 动态图推理
+
+```shell
+python predict_generation.py
+
+```
diff --git a/llm/gpt-3/finetune_generation.py b/llm/gpt-3/finetune_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8906807ad98efa71e4dd466cae3ca50fb14930c
--- /dev/null
+++ b/llm/gpt-3/finetune_generation.py
@@ -0,0 +1,247 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from modeling_pp import GPTForCausalLMPipe
+from utils import (
+    DataCollatorForSupervisedDataset,
+    GPTTrainer,
+    compute_metrics,
+    convert_example,
+)
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    TrainingArguments,
+    get_last_checkpoint,
+    set_seed,
+)
+from paddlenlp.transformers import AutoTokenizer, GPTConfig, GPTForCausalLM
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (GPTConfig, GPTForCausalLM),
+}
+
+
+@dataclass
+class DataArgument:
+    task_name: str = field(default="squad", metadata={"help": "The name of task."})
+    src_length: int = field(default=1024, metadata={"help": "The max length of source text."})
+    tgt_length: int = field(default=142, metadata={"help": "The max length of target text."})
+    generate_num: int = field(default=0, metadata={"help": "Save first k examples generation result in dev dataset"})
+
+
+@dataclass
+class ModelArgument:
+    model_type: str = field(
+        default="gpt-cn", metadata={"help": "Build-in pretrained model from the different model type."}
+    )
+    model_name_or_path: str = field(
+        default="gpt-cpm-large-cn", metadata={"help": "Build-in pretrained model name or the path to local model."}
+    )
+    use_flash_attn: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+    enable_fuse_transformer: bool = field(
+        default=False,
+        metadata={"help": "gpt, enable_fuse_transformer"},
+    )
+
+    fuse_attention_qkv: bool = field(
+        default=False,
+        metadata={"help": "gpt, fuse_attention_qkv"},
+    )
+    eval_with_do_generation: bool = field(
+        default=True, metadata={"help": "Evaluate with generation, instead for calc loss."}
+    )
+    lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"})
+    # lora
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+    lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."})
+    lora_rank: int = field(default=8, metadata={"help": "Lora attention dimension"})
+    merge_weights: bool = field(
+        default=False, metadata={"help": "Merge weights of the original model and the Lora model"}
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # data_args.always_pad_to_max_length = False
+    data_args.always_pad_to_max_length = training_args.pipeline_parallel_degree > 1
+    setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio)
+
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    training_args.tgt_length = data_args.tgt_length
+    paddle.set_device(training_args.device)
+
+    set_seed(args=training_args)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    config_class, model_class = MODEL_CLASSES[model_args.model_type]
+    if training_args.pipeline_parallel_degree > 1:
+        model_class = GPTForCausalLMPipe
+    # Load the tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    tokenizer.padding_side = "left"
+
+    # Load and set the pretrained configuration
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+    config.enable_fuse_transformer = model_args.enable_fuse_transformer
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.use_flash_attn = model_args.use_flash_attn
+    config.use_recompute = training_args.recompute
+
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+    config.ignore_index = tokenizer.pad_token_id
+
+    model = model_class.from_pretrained(
+        model_args.model_name_or_path,
+        config=config,
+        dtype=dtype,
+        load_state_as_np=True,
+    )
+    if model_args.lora:
+        if model_args.lora_path is None:
+            target_modules = [
+                ".*qkv_proj.*",
+                ".*q_proj.*",
+                ".*k_proj.*",
+                ".*v_proj.*",
+                ".*linear1.*",
+                ".*linear2.*",
+                ".*out_proj.*",
+            ]
+            lora_config = LoRAConfig(
+                target_modules=target_modules,
+                r=model_args.lora_rank,
+                lora_alpha=2 * model_args.lora_rank,
+                merge_weights=model_args.merge_weights,
+                tensor_parallel_degree=training_args.tensor_parallel_degree,
+                dtype=dtype,
+            )
+            model = LoRAModel(model, lora_config)
+        else:
+            model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path)
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    # Load the dataset.
+    if training_args.do_train or training_args.do_eval:
+        train_ds, dev_ds = load_dataset(data_args.task_name, splits=["train_v1", "dev_v1"])
+        trans_func = partial(
+            convert_example,
+            tokenizer=tokenizer,
+            max_source_length=data_args.src_length,
+            max_target_length=data_args.tgt_length,
+        )
+
+    if training_args.do_train:
+        train_ds = train_ds.map(partial(trans_func))
+    if training_args.do_eval:
+        is_test = model_args.eval_with_do_generation
+        dev_ds = dev_ds.map(partial(trans_func, is_test=is_test))
+
+    collate_fn = DataCollatorForSupervisedDataset(
+        tokenizer, max_length=1024 if data_args.always_pad_to_max_length else 0
+    )
+
+    def compute_metrics_trainer(eval_preds, tokenizer):
+        all_preds = []
+        all_labels = []
+        preds = eval_preds.predictions
+        preds = [x[x != -100] for x in preds]
+        all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+        labels = [x[x != -100] for x in eval_preds.label_ids]
+        all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+
+        all_preds = [pred.strip() for pred in all_preds]
+        all_labels = [label.strip() for label in all_labels]
+        all_preds = [pred.strip("question:") for pred in all_preds]
+        all_labels = [label.strip("question:") for label in all_labels]
+
+        eval_result = compute_metrics(all_preds, all_labels)
+        return eval_result
+
+    compute_metrics_func = partial(
+        compute_metrics_trainer,
+        tokenizer=tokenizer,
+    )
+
+    trainer = GPTTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics_func
+        if (model_args.eval_with_do_generation and training_args.do_eval)
+        else None,
+        do_generation=model_args.eval_with_do_generation,
+        data_collator=collate_fn,
+    )
+
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_result = trainer.evaluate()
+        trainer.log_metrics("test", eval_result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/gpt-3/modeling_pp.py b/llm/gpt-3/modeling_pp.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6b72766e2256f775123bfeb902b2e8278478362
--- /dev/null
+++ b/llm/gpt-3/modeling_pp.py
@@ -0,0 +1,299 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# pass
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+)
+from paddle.distributed.fleet.utils import recompute
+
+from paddlenlp.transformers import (
+    GPTConfig,
+    GPTDecoderLayer,
+    GPTEmbeddings,
+    GPTPretrainedModel,
+    GPTPretrainingCriterion,
+    PretrainedModel,
+)
+from paddlenlp.transformers.gpt.modeling import parallel_matmul
+
+
+def get_hcg():
+    return fleet.get_hybrid_communicate_group()
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 3:
+            hidden_states, attention_mask, position_ids = args
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            position_ids = None
+    else:
+        hidden_states = args
+        attention_mask, position_ids = None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    return hidden_states, attention_mask, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+class GPTEmbeddingPipe(GPTEmbeddings):
+    """Extends GPTEmbeddings to forward attention_mask through the pipeline."""
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.word_embeddings, "weight")
+
+    def forward(self, args):
+        input_ids, attention_mask, position_ids = parse_args(args)
+        input_ids.stop_gradient = True
+        embeddings = super().forward(input_ids=input_ids, position_ids=position_ids)
+        return embeddings
+
+
+class GPTDecoderLayerPipe(GPTDecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, position_ids = parse_args(args)
+        # hidden_states = super().forward(hidden_states, tgt_mask=attention_mask)
+        if self.enable_recompute and self.config.recompute_granularity == "full":
+            hidden_states = recompute(super().forward, hidden_states, attention_mask)
+        else:
+            hidden_states = super().forward(hidden_states, tgt_mask=attention_mask)
+
+        return return_args(hidden_states, attention_mask, position_ids)
+
+
+class LayerNormPipe(nn.LayerNorm):
+    def __init__(self, config):
+        super(LayerNormPipe, self).__init__(config.hidden_size, epsilon=1e-05)
+
+    def forward(self, args):
+        hidden_states, attention_mask, position_ids = parse_args(args)
+        hidden_states = super().forward(hidden_states)
+        return return_args(hidden_states, attention_mask, position_ids)
+
+
+class PipelinePretrainedModel(PretrainedModel):
+    _sequential_layers = []
+    _pipeline_name_mapping = None
+
+    def __init__(self, config, *args, **kwargs):
+        super().__init__(config, *args, **kwargs)
+
+    def add_sequential_layer(self, layer_desc, name_prefix=""):
+        self._sequential_layers.append({"layer": layer_desc, "name_prefix": name_prefix})
+
+    def get_sequential_layers(self):
+        return [x["layer"] for x in self._sequential_layers]
+
+    def get_sequential_name_prefixs(self):
+        return {str(index): x["name_prefix"] for index, x in enumerate(self._sequential_layers)}
+
+    def _set_pipeline_name_mapping(self, mappings=None):
+        if mappings is not None:
+            self._pipeline_name_mapping = mappings
+        else:
+            mapping = {}
+            state_dict_keys = list(super().state_dict().keys())
+            first_key = state_dict_keys[0].split(".")
+            # if use virtual pp_degree, the prefix is like 0.0.xxx
+            # else it will be like 0.xxx
+            use_virtual_pp_degree = first_key[0].isdigit() and first_key[1].isdigit()
+
+            prefixs = self.get_sequential_name_prefixs()
+            for k in state_dict_keys:
+                name_splited = k.split(".")
+                # TODO(wawltor) Fix the virtual pipeline
+                if use_virtual_pp_degree:
+                    idx = str(int(name_splited[0]) + int(name_splited[1]))
+                    single_name = [prefixs[idx]]
+                    single_name.extend(name_splited[2:])
+                else:
+                    idx = name_splited[0]
+                    if idx == "shared_layers":
+                        single_name = name_splited[2:]
+                        single_name = ["gpt.embeddings"] + single_name
+                    elif idx.isdigit():
+                        single_name = [prefixs[idx]]
+                        single_name.extend(name_splited[1:])
+                    else:
+                        raise ("The mapping table had bad row, please check parameter name:{}".format(k))
+                mapping[".".join(single_name)] = k
+
+            self._pipeline_name_mapping = mapping
+
+        return self._pipeline_name_mapping
+
+    def _prepare_pipeline_inputs_func(self, inputs):
+        first_stage_keys = ["input_ids", "attention_mask"]
+        last_stage_keys = ["labels"]
+
+        def get_expected_keys(inputs, keys):
+            ret = tuple([inputs.pop(k) for k in keys if k in inputs])
+            if len(ret) == 1:
+                ret = ret[0]
+            return ret
+
+        if type(inputs) is dict:
+            return [
+                get_expected_keys(inputs, first_stage_keys),
+                get_expected_keys(inputs, last_stage_keys),
+            ]
+
+        keys = list(inputs[0].keys())
+        inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+        return [
+            get_expected_keys(inputs_batch, first_stage_keys),
+            get_expected_keys(inputs_batch, last_stage_keys),
+        ]
+
+    def state_dict(self, *args, **kwargs):
+        state_dict = super().state_dict(*args, **kwargs)
+
+        if self._pipeline_name_mapping is None:
+            self._set_pipeline_name_mapping()
+        assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!"
+        pp_to_single_mapping = {v: k for k, v in self._pipeline_name_mapping.items()}
+
+        for k in list(state_dict.keys()):
+            v = state_dict.pop(k)
+            state_dict[pp_to_single_mapping[k]] = v
+
+        return state_dict
+
+    def set_state_dict(self, state_dict, *args, **kwargs):
+        if self._pipeline_name_mapping is None:
+            self._set_pipeline_name_mapping()
+        assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!"
+
+        for k in list(state_dict.keys()):
+            v = state_dict.pop(k)
+            if k not in self._pipeline_name_mapping:
+                continue
+            state_dict[self._pipeline_name_mapping[k]] = v
+
+        ret = super().set_state_dict(state_dict, *args, **kwargs)
+        return ret
+
+
+class GPTForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """LlamaForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the LlamaModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = GPTConfig
+
+    _get_tensor_parallel_mappings = GPTPretrainedModel._get_tensor_parallel_mappings
+    _init_weights = GPTPretrainedModel._init_weights
+
+    # NO base_model_prefix !!!!
+
+    def __init__(
+        self,
+        config,
+        pp_recompute_interval=1,
+    ):
+        self.config = config
+
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        self.add_sequential_layer(
+            SharedLayerDesc("gpt", GPTEmbeddingPipe, shared_weight_attr="embedding_weight", config=config), "gpt"
+        )
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(GPTDecoderLayerPipe, config=config),
+                f"gpt.decoder.layers.{i}",
+            )
+
+        self.add_sequential_layer(LayerDesc(LayerNormPipe, config=config), "gpt.decoder.norm")
+
+        def _logits_helper(embedding, output):
+            return parallel_matmul(output, embedding.embedding_weight, True)
+
+        self.add_sequential_layer(
+            SharedLayerDesc(
+                "gpt",
+                GPTEmbeddingPipe,
+                forward_func=_logits_helper,
+                shared_weight_attr="embedding_weight",
+                config=config,
+            ),
+            "gpt",
+        )
+
+        recompute_interval = 0
+        # if use_recompute and recompute_granularity == "full":
+        #    assert pp_recompute_interval <= config.num_hidden_layers // (
+        #        virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+        #    ), "pp recompute interval should smaller than num layers of each pp chunk"
+        #    recompute_interval = pp_recompute_interval
+
+        seg_method = "layer:GPTDecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=GPTPretrainingCriterion(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        self.apply(self._init_weights)
diff --git a/llm/gpt-3/predict_generation.py b/llm/gpt-3/predict_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..53e1d95c22d38aa8347c3d3af27ecf680335fec0
--- /dev/null
+++ b/llm/gpt-3/predict_generation.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import paddle
+from utils import get_hcg, init_dist_env, set_seed
+
+from paddlenlp.transformers import (
+    GPTChineseTokenizer,
+    GPTConfig,
+    GPTForCausalLM,
+    GPTTokenizer,
+)
+
+MODEL_CLASSES = {
+    "gpt2": (GPTForCausalLM, GPTTokenizer),
+    "gpt2-cn": (GPTForCausalLM, GPTChineseTokenizer),
+}
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_type", default="gpt2-cn", help="The directory of model.")
+    parser.add_argument("--model_name_or_path", default="gpt-cpm-large-cn", help="The directory of model.")
+    parser.add_argument("--save_onepiece_model_path", default=None, help="The directory of model.")
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--src_length", type=int, default=200, help="The batch size of data.")
+    parser.add_argument("--tgt_length", type=int, default=200, help="The batch size of data.")
+    parser.add_argument("--seed", type=int, default=20, help="the seed of parameter initialization")
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args=None, tokenizer=None, model=None, **kwargs):
+        model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+        self.tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+        self.tokenizer.padding_side = "left"
+        self.batch_size = args.batch_size
+        self.args = args
+        self.src_length = self.args.src_length
+        self.tgt_length = self.args.tgt_length
+
+        tensor_parallel_degree = paddle.distributed.get_world_size()
+        tensor_parallel_rank = 0
+        if tensor_parallel_degree > 1:
+            hcg = get_hcg()
+            tensor_parallel_rank = hcg.get_model_parallel_rank()
+
+        config = GPTConfig.from_pretrained(args.model_name_or_path)
+        dtype = config.dtype if config.dtype is not None else "float16"
+
+        self.model = GPTForCausalLM.from_pretrained(
+            args.model_name_or_path,
+            load_state_as_np=True,
+            low_cpu_mem_usage=True,
+            dtype=dtype,
+            tensor_parallel_degree=tensor_parallel_degree,
+            tensor_parallel_rank=tensor_parallel_rank,
+        )
+        if self.tokenizer.pad_token_id is None:
+            self.tokenizer.pad_token_id = self.model.config.pad_token_id
+        self.model.eval()
+
+    def preprocess(self, input_text):
+        inputs = self.tokenizer(
+            input_text,
+            return_tensors="np",
+            padding=True,
+            max_length=self.src_length,
+        )
+        inputs_tensor = {}
+        for key, value in inputs.items():
+            inputs_tensor[key] = paddle.to_tensor(value)
+        return inputs_tensor
+
+    def infer(self, inputs):
+        if self.model.config.dtype == "float32" or self.model.config.dtype is None:
+            with paddle.no_grad():
+                result = self.model.generate(
+                    **inputs,
+                    max_length=self.tgt_length,
+                    bos_token_id=self.tokenizer.bos_token_id,
+                    eos_token_id=self.tokenizer.eol_token_id,
+                    pad_token_id=self.tokenizer.pad_token_id,
+                    decode_strategy="sampling",
+                    top_k=1,
+                )
+        else:
+            with paddle.no_grad():
+                with paddle.amp.auto_cast(False, level="O2", dtype=self.model.config.dtype):
+                    result = self.model.generate(
+                        **inputs,
+                        max_length=self.tgt_length,
+                        bos_token_id=self.tokenizer.bos_token_id,
+                        eos_token_id=self.tokenizer.eol_token_id,
+                        pad_token_id=self.tokenizer.pad_token_id,
+                        decode_strategy="sampling",
+                        top_k=1,
+                    )
+        result = result[0]
+        return result
+
+    def postprocess(self, infer_data):
+        result = []
+        for x in infer_data.tolist():
+            res = self.tokenizer.convert_ids_to_string(x)
+            result.append(res)
+        out_dict = {"result": result}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+    def save_onepiece_model(self, save_onepiece_model_path):
+        self.model.save_pretrained(save_dir=save_onepiece_model_path, merge_tensor_parallel=True)
+        paddle.distributed.barrier()
+        self.tokenizer.save_pretrained(save_onepiece_model_path)
+        paddle.distributed.barrier()
+
+
+def predict():
+    args = parse_arguments()
+
+    # Init the fleet config
+    tensor_parallel_degree = paddle.distributed.get_world_size()
+    if tensor_parallel_degree > 1:
+        init_dist_env(tensor_parallel_degree=tensor_parallel_degree, seed=args.seed)
+    set_seed(args.seed)
+
+    predictor = Predictor(args)
+    all_texts = ["问题：中国的首都是哪里？答案：北京。\n问题：苹果的CEO是谁? 答案：", "问题：中国的首都是哪里？答案：北京。\n问题：广东的省会是哪个城市? 答案："]
+    batch_texts = batchfy_text(all_texts, args.batch_size)
+    for bs, texts in enumerate(batch_texts):
+        outputs = predictor.predict(texts)
+        for text, result in zip(texts, outputs["result"]):
+            print(result)
+    if args.save_onepiece_model_path is not None:
+        predictor.save_onepiece_model(args.save_onepiece_model_path)
+
+
+if __name__ == "__main__":
+    predict()
diff --git a/llm/gpt-3/run_pretrain.py b/llm/gpt-3/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..e68f80eef6beea1abce1c8fef6012f9a05afa302
--- /dev/null
+++ b/llm/gpt-3/run_pretrain.py
@@ -0,0 +1,438 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+GPT/Llama pretraining scripts.
+"""
+import math
+import os
+import sys
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+
+import paddle
+from modeling_pp import GPTForCausalLMPipe
+
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    set_seed,
+    speed_metrics,
+)
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    CosineAnnealingWithWarmupDecay,
+    GPTConfig,
+    GPTForCausalLM,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (
+        GPTConfig,
+        GPTForCausalLM,
+    ),
+}
+
+from paddlenlp.data.causal_dataset import build_train_valid_test_datasets, print_rank_0
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+    data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
+    skip_warmup: bool = field(
+        default=True,
+        metadata={"help": "Whether to skip the warmup process of mmap files."},
+    )
+    data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(default="gpt", metadata={"help": "Only support for gpt pre-training for now."})
+    model_name_or_path: str = field(
+        default="gpt2-medium-en",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    output_attentions: bool = field(default=False, metadata={"help": "Whether output attention weights"})
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+    fused_linear: bool = field(
+        default=False,
+        metadata={"help": "gpt, whether to fuse linear projection"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=False,
+        metadata={"help": "gpt, whether to fuse attention qkv"},
+    )
+    enable_fuse_transformer: bool = field(
+        default=False,
+        metadata={"help": "gpt, enable_fuse_transformer"},
+    )
+    hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
+    attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention hidden dropout prob."})
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+):
+
+    train_val_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    print_rank_0("    train:      {}".format(train_val_test_num_samples[0]))
+    print_rank_0("    validation: {}".format(train_val_test_num_samples[1]))
+    print_rank_0("    test:       {}".format(train_val_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl=data_args.data_impl,
+        splits_string=data_args.split,
+        train_val_test_num_samples=train_val_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=data_args.skip_warmup,
+        data_cache_path=data_args.data_cache,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+
+        logger.info(tokenizer._decode(input_ids))
+
+    from paddlenlp.data import Stack
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn([x["text"] for x in data])
+
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        # Attention mask.
+        attention_mask = paddle.ones(tokens.shape, dtype=paddle.int64)
+
+        return {
+            "input_ids": tokens,
+            "attention_mask": attention_mask,
+            "labels": labels,
+        }
+
+    print_dataset(train_dataset[0])
+    print_dataset(valid_dataset[0])
+    print_dataset(test_dataset[0])
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        # keep eval_dataloader
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    if data_args.data_cache is not None:
+        os.makedirs(data_args.data_cache, exist_ok=True)
+
+    set_seed(seed=training_args.seed, args=training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+    config.output_attentions = model_args.output_attentions
+    config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+    config.hidden_dropout_prob = model_args.hidden_dropout_prob
+    config.attention_probs_dropout_prob = model_args.attention_probs_dropout_prob
+    config.enable_fuse_transformer = model_args.enable_fuse_transformer
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.use_recompute = training_args.recompute
+    config.use_flash_attention = model_args.use_flash_attention
+
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+
+    print("Final pre-training config:", config)
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    if training_args.pipeline_parallel_degree > 1:
+        model_class = GPTForCausalLMPipe
+
+    model = model_class.from_pretrained(
+        model_args.model_name_or_path,
+        config=config,
+        dtype=dtype,
+        load_state_as_np=True,
+    )
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args, training_args, data_file, tokenizer
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+    checkpoint = None
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/gpt-3/utils.py b/llm/gpt-3/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..fed4793d956bc252b1b6f017d74ff7d0d93be1e3
--- /dev/null
+++ b/llm/gpt-3/utils.py
@@ -0,0 +1,389 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import random
+import re
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.optimizer.lr import LambdaDecay
+from rouge import Rouge
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.metrics import BLEU
+from paddlenlp.trainer import Trainer
+from paddlenlp.utils.log import logger
+
+PREFIX_CHECKPOINT_DIR = "model_state"
+_re_checkpoint = re.compile(r"^" + PREFIX_CHECKPOINT_DIR + r"\.tp(\d+)" + ".pdparams$")
+
+
+_hcg = None
+
+
+def set_hcg(hcg):
+    global _hcg
+    _hcg = hcg
+
+
+def get_hcg():
+    global _hcg
+    return _hcg
+
+
+def set_seed(seed):
+    # NOTE(shenliang03): For parameter init seed:
+    # seed: dp/mp_undistributed_paramter/sharding is same; others is different
+    # For compute seed(dropout):
+    # global seed: only mp group is same.
+    # local seed: all groups are different
+
+    hcg = get_hcg()
+    if paddle.distributed.get_world_size() > 1:
+        # obtain rank message of hybrid parallel
+
+        mp_rank = hcg.get_model_parallel_rank()
+        mp_size = hcg.get_model_parallel_world_size()
+
+        pp_rank = hcg.get_stage_id()
+        pp_size = hcg.get_pipe_parallel_world_size()
+
+        dp_rank = hcg.get_data_parallel_rank()
+        dp_size = hcg.get_data_parallel_world_size()
+
+        sharding_rank = hcg.get_sharding_parallel_rank()
+        # sharding_size = hcg.get_sharding_parallel_world_size()
+    else:
+        mp_rank, mp_size = 0, 1
+        pp_rank, pp_size = 0, 1
+        dp_rank, dp_size = 0, 1
+        sharding_rank, _ = 0, 1
+
+    # NOTE: the commented seeds are set only for precision validation
+    # seed += 100 * pp_rank
+    random.seed(seed + 100 * pp_rank)
+    np.random.seed(seed + 100 * pp_rank)
+
+    # seed = mp_rank +
+    #        pp_rank * (mp_size) +
+    #        dp_rank * (mp_size * pp_size) +
+    #        sharding_rank * (mp_size * pp_size * dp_size)
+    # seed offset is order to avoid conflicts with the parameter initialization seed
+
+    seed_offset = seed + 1024 + paddle.distributed.get_world_size()
+    global_seed = (
+        seed_offset
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    seed_offset += paddle.distributed.get_world_size()
+    local_seed = (
+        seed_offset
+        + mp_rank
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    tracker = get_rng_state_tracker()
+    tracker.add("global_seed", global_seed)
+    tracker.add("local_seed", local_seed)
+
+    paddle.seed(global_seed)
+
+    logger.info("The global seed is set to {} and local seed is set to {}.".format(global_seed, local_seed))
+
+
+def create_hcg(strategy, hcg_name="HybridCommunicateGroup"):
+    if hcg_name == "HybridCommunicateGroup":
+        fleet.init(is_collective=True, strategy=strategy)
+        hcg = fleet.get_hybrid_communicate_group()
+    else:
+        dist.init_parallel_env()
+        hcg = eval("{}".format(hcg_name))(strategy)
+
+    return hcg
+
+
+def init_dist_env(
+    tensor_parallel_degree=1, sharding_parallel_degree=1, pipeline_parallel_degree=1, data_parallel_degree=1, seed=1
+):
+
+    strategy = fleet.DistributedStrategy()
+
+    def is_segment_parallel_supported():
+        import inspect
+
+        members = [name for (name, date) in inspect.getmembers(fleet.HybridCommunicateGroup)]
+        return "get_sep_parallel_world_size" in members
+
+    if tensor_parallel_degree == 1 and sharding_parallel_degree == 1:
+        if is_segment_parallel_supported():
+            order = ["pp", "dp", "sharding", "sep", "mp"]
+        else:
+            order = ["pp", "dp", "sharding", "mp"]
+    else:
+        if is_segment_parallel_supported():
+            order = ["dp", "pp", "sharding", "sep", "mp"]
+        else:
+            order = ["dp", "pp", "sharding", "mp"]
+
+    strategy.hybrid_configs = {
+        "dp_degree": data_parallel_degree,
+        "mp_degree": tensor_parallel_degree,
+        "pp_degree": pipeline_parallel_degree,
+        "sharding_degree": sharding_parallel_degree,
+        "order": order,
+    }
+
+    # TODO(wawltor) The inference parallel do not support the pipeline mode
+
+    """
+    if pipeline_parallel_degree > 1:
+        if "sequence_parallel" in config.Model:
+            if config.Model.sequence_parallel:
+                assert config.Global.enable_partial_send_recv is False, (
+                    "if config.Distributed.pp_degree > 1 and config.Model.sequence_parallel is True, "
+                    "config.Global.enable_partial_send_recv should be set False."
+                )
+
+    strategy.pipeline_configs = {
+        "accumulate_steps": config.Global.local_batch_size // config.Global.micro_batch_size,
+        "micro_batch_size": config.Global.micro_batch_size,
+        "enable_partial_send_recv": config.Global.enable_partial_send_recv,
+    }
+    """
+
+    # set control in tensor parallel
+    strategy.tensor_parallel_configs = {"tensor_init_seed": seed}
+
+    hcg = create_hcg(strategy)
+    set_hcg(hcg)
+
+
+def convert_example(
+    example,
+    tokenizer,
+    max_source_length,
+    max_target_length,
+    is_test=False,
+):
+    """
+    Convert an example into necessary features.
+    """
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    context = example["context"]
+    question = example["question"]
+    try:
+        answer = example["answers"][0]
+    except Exception:
+        print(example["context"])
+        print(example["question"])
+        print(example["answers"])
+        print(example["answer_starts"])
+        print(example["is_impossible"])
+
+    input_seq = f"answer: {answer} context: {context} </s>"
+    output_seq = f"question: {question} </s>"
+
+    outputs = tokenizer(
+        output_seq,
+        max_length=max_target_length,
+        # pad_to_max_seq_len=True,
+        truncation_strategy="longest_first",
+        return_attention_mask=False,
+        return_token_type_ids=False,
+    )
+    inputs = tokenizer(
+        input_seq,
+        max_length=max_source_length,
+        # pad_to_max_seq_len=True,
+        truncation_strategy="longest_first",
+        return_attention_mask=False,
+        return_length=False,
+    )
+
+    final = {}
+    for k in outputs.keys():
+        final[k] = inputs[k] + outputs[k]
+        if k == "input_ids":
+            final["labels"] = [tokenizer.pad_token_id] * len(inputs["input_ids"]) + outputs[k]
+    if is_test:
+        return dict(input_ids=inputs["input_ids"], labels=outputs["input_ids"])
+
+    # shift inputs and labels
+    final["input_ids"] = final["input_ids"][:-1]
+    final["labels"] = final["labels"][1:]
+    return final
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+
+    rouge1 = round(rouge1, 4)
+    rouge2 = round(rouge2, 4)
+    rougel = round(rougel, 4)
+    bleu4 = round(bleu4.score(), 4)
+    return dict(
+        rouge1=rouge1,
+        rouge2=rouge2,
+        rougel=rougel,
+        bleu4=bleu4,
+    )
+
+
+class DataCollatorForSupervisedDataset(DataCollatorForSeq2Seq):
+    """Collate examples for supervised fine-tuning."""
+
+    def __call__(self, features, return_tensors=None):
+        # Deep copy to avoid modifying features in-place
+        batch = copy.deepcopy(features)
+        if return_tensors is None:
+            return_tensors = self.return_tensors
+        labels = [feature["labels"] for feature in batch] if "labels" in batch[0].keys() else None
+        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
+        # same length to return tensors.
+        if labels is not None:
+            # Note(gongenlei): In pipeline, max_label_length = self.max_length
+            if self.padding == "max_length" and self.max_length is not None:
+                max_label_length = self.max_length
+            else:
+                max_label_length = max(len(l) for l in labels)
+            if self.pad_to_multiple_of is not None:
+                max_label_length = (
+                    (max_label_length + self.pad_to_multiple_of - 1)
+                    // self.pad_to_multiple_of
+                    * self.pad_to_multiple_of
+                )
+
+            padding_side = self.tokenizer.padding_side
+            for feature in batch:
+                remainder = [self.tokenizer.pad_token_id] * (max_label_length - len(feature["labels"]))
+                if isinstance(feature["labels"], list):
+                    feature["labels"] = (
+                        feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
+                    )
+                elif padding_side == "right":
+                    feature["labels"] = np.concatenate([feature["labels"], remainder]).astype(np.int64)
+                else:
+                    feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)
+
+        batch = self.tokenizer.pad(
+            batch,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_attention_mask=self.return_attention_mask,
+        )
+
+        return batch
+
+
+class GPTTrainer(Trainer):
+    def __init__(self, do_generation: bool, **kwargs):
+        super().__init__(**kwargs)
+        self.do_generation = do_generation
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+
+        if prediction_loss_only:
+            return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+        elif not self.do_generation:
+            loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+            # argmax here to avoid gather all logits, which is too memory-consuming.
+            # keepdim in order to maintain the same shape as logits
+            return (loss, logits.argmax(axis=-1, keepdim=True), labels)
+
+        model.eval()
+
+        preds = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"] if "attention_mask" in inputs else None,
+            max_length=self.args.tgt_length,
+            min_length=0,
+            use_cache=True,
+            temperature=1.0,
+            top_k=1,
+            top_p=1.0,
+            repetition_penalty=1.0,
+            decode_strategy="sampling",
+        )[0]
+        all_labels = []
+        for label in inputs["labels"].numpy():
+            label = [x for x in label[label != self.tokenizer.pad_token_id]]
+            all_labels.append(label)
+        max_label_length = max([len(x) for x in all_labels])
+        for index, labels in enumerate(all_labels):
+            all_labels[index] = labels + [-100] * (max_label_length - len(labels))
+
+        return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels))
+
+    def create_scheduler(self, num_training_steps: int):
+        num_warmup_steps = (
+            self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps
+        )
+
+        def lr_lambda(current_step: int):
+            if current_step < num_warmup_steps:
+                return float(current_step) / float(max(1, num_warmup_steps))
+            else:
+                decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps)
+                return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio
+
+        if self.lr_scheduler is None:
+            self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1)
+        return self.lr_scheduler
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+        if "loss" in logs:
+            logs["ppl"] = np.exp(logs["loss"])
+        if "eval_loss" in logs:
+            logs["eval_ppl"] = np.exp(logs["eval_loss"])
+
+        super(GPTTrainer, self).log(logs, **kwargs)
diff --git a/llm/gradio_ui.py b/llm/gradio_ui.py
new file mode 100644
index 0000000000000000000000000000000000000000..39f52ef91e2df3c0daa76d97f2955ba6bcfccb69
--- /dev/null
+++ b/llm/gradio_ui.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import argparse
+import copy
+import json
+
+import gradio as gr
+import requests
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8073)
+    args = parser.parse_args()
+    return args
+
+
+def launch(args):
+    """Launch characters dialogue demo."""
+
+    def rollback(state):
+        """Rollback context."""
+        context = state.setdefault("context", [])
+        utterance = context[-2]["utterance"]
+        context = context[:-2]
+        state["context"] = context
+        shown_context = get_shown_context(context)
+        return utterance, shown_context, context, state
+
+    def regen(state, version, top_k, top_p, temperature, repetition_penalty):
+        """Regenerate response."""
+        context = state.setdefault("context", [])
+        context.pop()
+        user_turn = context.pop()
+        return infer(user_turn["utterance"], state, version, top_k, top_p, temperature, repetition_penalty)
+
+    def infer(utterance, state, top_k, top_p, temperature, repetition_penalty, max_length):
+        """Model inference."""
+        utterance = utterance.strip().replace("<br>", "\n")
+        context = state.setdefault("context", [])
+
+        if not utterance:
+            gr.Warning("invalid inputs")
+            # gr.Warning("请输入有效问题")
+            shown_context = get_shown_context(context)
+            return None, shown_context, context, state
+
+        context.append({"role": "user", "utterance": utterance})
+        data = {
+            "context": utterance,
+            "top_k": top_k,
+            "top_p": top_p,
+            "temperature": temperature,
+            "repetition_penalty": repetition_penalty,
+            "max_length": max_length,
+            "min_length": 1,
+        }
+        result = requests.post(f"http://0.0.0.0:{args.flask_port}/api/chat", json=data).json()
+        bot_response = result["result"]["response"]
+
+        # replace \n with br: https://github.com/gradio-app/gradio/issues/4344
+        bot_response["utterance"] = bot_response["utterance"].replace("\n", "<br>")
+        context.append(bot_response)
+        shown_context = get_shown_context(context)
+        return None, shown_context, context, state
+
+    def clean_context(context):
+        """Clean context for EB input."""
+        cleaned_context = copy.deepcopy(context)
+        for turn in cleaned_context:
+            if turn["role"] == "bot":
+                bot_resp = turn["utterance"]
+                if bot_resp.startswith("<img src") or bot_resp.startswith("<audio controls>"):
+                    bot_resp = "\n".join(bot_resp.split("\n")[1:])
+                turn["utterance"] = bot_resp
+        return cleaned_context
+
+    def extract_eda(eb_debug_info):
+        """Extract EDA result from EB dispatch info."""
+        eda_res = None
+        for item in eb_debug_info:
+            if item["sys"] == "EDA":
+                eda_output = json.loads(item["output"])
+                eda_res = eda_output["result"]
+                break
+        return eda_res
+
+    def extract_eb_input(eb_debug_info, convert_for_ar=True):
+        """Extract EB raw input from EB dispatch info."""
+        eb_raw_input = None
+        for item in eb_debug_info:
+            if item["sys"] == "EB":
+                eb_output = json.loads(item["output"])
+                eb_raw_input = eb_output["text_after_process"]
+                if convert_for_ar:
+                    eb_raw_input = eb_raw_input.replace("[CLS]", "<cls>").replace("[SEP]", "<sep>")
+                break
+        return eb_raw_input
+
+    def get_shown_context(context):
+        """Get gradio chatbot."""
+        shown_context = []
+        for turn_idx in range(0, len(context), 2):
+            shown_context.append([context[turn_idx]["utterance"], context[turn_idx + 1]["utterance"]])
+        return shown_context
+
+    with gr.Blocks(title="LLM", theme=gr.themes.Soft()) as block:
+        gr.Markdown(f"# {args.title}")
+        with gr.Row():
+            with gr.Column(scale=1):
+                top_k = gr.Slider(
+                    minimum=1, maximum=100, value=50, step=1, label="Top-k", info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。"
+                )
+                top_p = gr.Slider(
+                    minimum=0, maximum=1, value=0.7, step=0.05, label="Top-p", info="该参数越大，模型生成结果更加随机，反之生成结果更加确定。"
+                )
+                temperature = gr.Slider(
+                    minimum=0.05,
+                    maximum=1.5,
+                    value=0.95,
+                    step=0.05,
+                    label="Temperature",
+                    info="该参数越小，模型生成结果更加随机，反之生成结果更加确定。",
+                )
+                repetition_penalty = gr.Slider(
+                    minimum=0.1,
+                    maximum=10,
+                    value=1.0,
+                    step=0.05,
+                    label="Repetition Penalty",
+                    info="该参数越大，生成结果重复的概率越低。设置 1 则不开启。",
+                )
+                max_length = gr.Slider(
+                    minimum=1, maximum=1024, value=50, step=1, label="Max Length", info="生成结果的最大长度。"
+                )
+            with gr.Column(scale=4):
+                state = gr.State({})
+                context_chatbot = gr.Chatbot(label="Context")
+                utt_text = gr.Textbox(placeholder="请输入...", label="Utterance")
+                with gr.Row():
+                    clear_btn = gr.Button("清空")
+                    rollback_btn = gr.Button("撤回")
+                    regen_btn = gr.Button("重新生成")
+                    send_btn = gr.Button("发送")
+                with gr.Row():
+                    raw_context_json = gr.JSON(label="Raw Context")
+
+            utt_text.submit(
+                infer,
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length],
+                outputs=[utt_text, context_chatbot, raw_context_json, state],
+                api_name="chat",
+            )
+            clear_btn.click(
+                lambda _: (None, None, None, {}),
+                inputs=clear_btn,
+                outputs=[utt_text, context_chatbot, raw_context_json, state],
+                api_name="clear",
+                show_progress=False,
+            )
+            rollback_btn.click(
+                rollback,
+                inputs=[state],
+                outputs=[utt_text, context_chatbot, raw_context_json, state],
+                show_progress=False,
+            )
+            regen_btn.click(
+                regen,
+                inputs=[state, top_k, top_p, temperature, repetition_penalty, max_length],
+                outputs=[utt_text, context_chatbot, raw_context_json, state],
+            )
+            send_btn.click(
+                infer,
+                inputs=[utt_text, state, top_k, top_p, temperature, repetition_penalty, max_length],
+                outputs=[utt_text, context_chatbot, raw_context_json, state],
+            )
+
+    block.queue(default_enabled=True).launch(server_name="0.0.0.0", server_port=args.port, debug=True)
+
+
+def main(args):
+    launch(args)
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    main(args)
diff --git a/llm/llama/README.md b/llm/llama/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8eeac47ea70321c6d1209434146213456a124f52
--- /dev/null
+++ b/llm/llama/README.md
@@ -0,0 +1,115 @@
+# LLaMA
+
+## 1. 模型介绍
+
+**支持模型权重:**
+
+| Model                            |
+| ---------------------------------|
+| facebook/llama-7b                 |
+| facebook/llama-13b                |
+| facebook/llama-30b                |
+| facebook/llama-65b                |
+| meta-llama/Llama-2-7b             |
+| meta-llama/Llama-2-7b-chat        |
+| meta-llama/Llama-2-13b            |
+| meta-llama/Llama-2-13b-chat       |
+| meta-llama/Llama-2-70b            |
+| meta-llama/Llama-2-70b-chat       |
+| ziqingyang/chinese-llama-7b       |
+| ziqingyang/chinese-llama-13b      |
+| ziqingyang/chinese-alpaca-7b      |
+| ziqingyang/chinese-alpaca-13b     |
+| idea-ccnl/ziya-llama-13b-v1       |
+| linly-ai/chinese-llama-2-7b       |
+| baichuan-inc/Baichuan-7B          |
+| baichuan-inc/Baichuan-13B-Base    |
+| baichuan-inc/Baichuan-13B-Chat    |
+| FlagAlpha/Llama2-Chinese-7b-Chat  |
+| FlagAlpha/Llama2-Chinese-13b-Chat |
+
+
+
+使用方法：
+
+```python
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat")
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat")
+```
+
+## 2. 模型协议
+
+LLaMA 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/LICENSE)。
+
+Llama2 模型的权重的使用则需要遵循[License](../../paddlenlp/transformers/llama/Llama2.LICENSE)。
+
+
+## 3. 预训练
+
+预训练数据制作参考[此处](../../../model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md)
+
+为了方便用户运行测试本模型，本项目提供了处理好的100k条doc的训练样本：
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz
+```
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv llama_openwebtext_100k_ids.npy ./data
+mv llama_openwebtext_100k_idx.npz ./data
+```
+
+使用下面脚本,即可在llama-7b的基础上,继续训练.
+```shell
+task_name_or_path="llama_hybrid"
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name_or_path""_log" \
+    run_pretrain.py \
+    --model_type "llama" \
+    --model_name_or_path "facebook/llama-7b" \
+    --tokenizer_name_or_path "facebook/llama-7b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name_or_path" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --use_fused_rms_norm 0 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --learning_rate 0.00001 \
+    --min_learning_rate 0.000005 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --continue_training 1\
+    --recompute 1 \
+    --do_train \
+    --do_eval \
+    --device "gpu"
+```
+注意：
+1. 需要paddle develop版本训练，需要安装`pip install tool_helpers visualdl==2.5.3`等相关缺失whl包
+2. `use_flash_attention` 需要在A100机器开启，否则loss可能不正常（很快变成0.00x,非常小不正常）。建议使用cuda11.8环境。
+3. `continue_training` 表示从现有的预训练模型加载训练。7b模型初始loss大概为1.99x, 随机初始化模型loss从11.x左右下降。
+4. `use_fused_rms_norm` 需要安装[此目录](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt-3/external_ops)下的自定义OP, `python setup.py install`。如果安装后仍然找不到算子，需要额外设置PYTHONPATH
+5. 当前脚本为sharding版本，需要4D并行训练（数据、sharding、张量、流水线并行）的用户，请参考 `run_trainer_tp4pp2.sh`脚本。
+
+## 4. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/llama/benchmark.py b/llm/llama/benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..221220eacc319486e67bfea5e9efef7cdff8f489
--- /dev/null
+++ b/llm/llama/benchmark.py
@@ -0,0 +1,376 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import sys
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+from benchmark_utils import (
+    LlamaTrainer,
+    compute_metrics,
+    compute_metrics_not_do_generation,
+)
+from modeling_pp import LlamaForCausalLMPipe
+
+from paddlenlp.data import DataCollatorForSeq2Seq
+from paddlenlp.datasets import load_dataset
+from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM
+from paddlenlp.peft.prefix import llama_postprocess_past_key_value
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    TrainingArguments,
+    get_last_checkpoint,
+    set_seed,
+)
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArgument:
+    data_name: str = field(default=None, metadata={"help": "The name of data."})
+    task_name_or_path: str = field(default=None, metadata={"help": "The name of task."})
+    src_length: int = field(default=512, metadata={"help": "The max length of source text."})
+    tgt_length: int = field(default=256, metadata={"help": "The max length of target text."})
+
+
+@dataclass
+class ModelArgument:
+    model_name_or_path: str = field(
+        default="facebook/llama-7b", metadata={"help": "Build-in pretrained model name or the path to local model."}
+    )
+    label_smoothing: float = field(default=0.1, metadata={"help": "The label smoothing parameter."})
+    lr_decay_ratio: float = field(default=0.1, metadata={"help": "The ratio for learning rate decrease"})
+    use_flash_attention: bool = field(default=False, metadata={"help": "Whether to use flash attention"})
+    eval_with_do_generation: bool = field(
+        default=False, metadata={"help": "Evaluate with generation, instead for calc loss."}
+    )
+    profiler_options: str = field(
+        default=None,
+        metadata={"help": "profiler_options."},
+    )
+    # lora
+    lora: bool = field(default=False, metadata={"help": "Whether to use LoRA technique"})
+    lora_path: str = field(default=None, metadata={"help": "Initialize lora state dict."})
+    lora_rank: int = field(default=4, metadata={"help": "Lora attention dimension"})
+    merge_weights: bool = field(
+        default=False, metadata={"help": "Merge weights of the original model and the Lora model"}
+    )
+    # prefix
+    prefix_tuning: bool = field(default=False, metadata={"help": "Whether to use Prefix technique"})
+    num_prefix_tokens: int = field(default=10, metadata={"help": "Number of prefix tokens"})
+    prefix_projection: bool = field(default=False, metadata={"help": "Whether to project the prefix tokens"})
+    # qat
+    qat: bool = field(default=False, metadata={"help": "Whether to use QAT technique"})
+    qat_type: str = field(default="A8W8", metadata={"help": "Quantization type. Supported values: A8W8, W4,A8W4"})
+
+
+PROMPT_DICT = {
+    "prompt_input": (
+        "Below is an instruction that describes a task, paired with an input that provides further context. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
+    ),
+    "prompt_no_input": (
+        "Below is an instruction that describes a task. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Response:"
+    ),
+}
+
+
+def read_local_dataset(path):
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            yield json_line
+
+
+def custom_instruction_convert_example(example, tokenizer, data_args, is_test=False, model_max_length=512):
+    """
+    Convert an example into necessary features.
+    """
+
+    prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
+
+    if example.get("input", "") != "":
+        input_seq = prompt_input.format_map(example)
+    else:
+        input_seq = prompt_no_input.format_map(example)
+
+    output_seq = example["output"] + tokenizer.eos_token
+
+    # To compatible with compile training mode in benchmark, input will be pad to fix length
+    source_tokenized = tokenizer(
+        input_seq,
+        return_tensors="pd",
+        max_length=model_max_length,
+        truncation=True,
+    )
+
+    source_input_ids_len = (
+        source_tokenized["input_ids"].not_equal(paddle.to_tensor(tokenizer.pad_token_id)).sum().item()
+    )
+
+    example_tokenized = tokenizer(
+        input_seq + output_seq,
+        return_tensors="pd",
+        max_length=model_max_length,
+        truncation=True,
+    )
+
+    input_ids = example_tokenized["input_ids"][0]
+    labels = copy.deepcopy(input_ids)
+    labels[:source_input_ids_len] = -100
+
+    if is_test:
+        return dict(
+            input_ids=source_tokenized["input_ids"][0],
+            labels=labels,
+        )
+
+    # shift labels
+    input_ids, labels = input_ids[:-1], labels[1:]
+
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArgument, DataArgument, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    data_args.always_pad_to_max_length = training_args.pipeline_parallel_degree > 1
+
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+    training_args.tgt_length = data_args.tgt_length
+
+    training_args.profiler_options = model_args.profiler_options
+    setattr(training_args, "label_smoothing", model_args.label_smoothing)
+    setattr(training_args, "lr_decay_ratio", model_args.lr_decay_ratio)
+
+    paddle.set_device(training_args.device)
+
+    set_seed(args=training_args)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    model_class = AutoModelForCausalLM
+    if training_args.pipeline_parallel_degree > 1:
+        if model_args.eval_with_do_generation and training_args.do_eval:
+            raise ValueError("Plese set eval_with_do_generation to false in pipeline parallel mode.")
+        model_class = LlamaForCausalLMPipe
+
+    # Load the pretrained language model.
+    model = model_class.from_pretrained(
+        model_args.model_name_or_path,
+        tensor_parallel_output=False,
+        tensor_parallel_degree=training_args.tensor_parallel_degree,
+        tensor_parallel_rank=training_args.tensor_parallel_rank,
+        use_flash_attention=model_args.use_flash_attention,
+        dtype=dtype,  # todo enable set dtype to avoid additional mem usage
+    )
+    if model_args.lora:
+        if model_args.lora_path is None:
+            # Not yet support RowParallelLinear
+            target_modules = [
+                ".*q_proj.*",
+                ".*v_proj.*",
+                ".*k_proj.*",
+                ".*gate_proj.*",
+                ".*up_proj.*",
+                ".*o_proj.*",
+                ".*down_proj.*",
+            ]
+
+            lora_config = LoRAConfig(
+                target_modules=target_modules,
+                r=model_args.lora_rank,
+                lora_alpha=2 * model_args.lora_rank,
+                merge_weights=model_args.merge_weights,
+                tensor_parallel_degree=training_args.tensor_parallel_degree,
+                dtype=dtype,
+            )
+            model = LoRAModel(model, lora_config)
+        else:
+            model = LoRAModel.from_pretrained(model=model, lora_path=model_args.lora_path)
+
+        model.mark_only_lora_as_trainable()
+        model.print_trainable_parameters()
+
+    if model_args.qat:
+        from paddle import nn
+        from paddle.quantization import QAT, QuantConfig
+
+        # FakeQuanterChannelWiseAbsMaxObserver not yet merge in Paddle develop
+        from paddle.quantization.quanters import FakeQuanterChannelWiseAbsMaxObserver
+        from paddle.quantization.quanters.abs_max import (
+            FakeQuanterWithAbsMaxObserverLayer,
+        )
+        from paddleslim.quant.quanters import PACTQuanter
+
+        # from paddle.quantization.quanters import FakeQuanterWithAbsMaxObserver
+        from paddlenlp.peft.lora import LoRALinear
+        from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear
+
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+
+        if model_args.qat_type == "A8W8":
+            activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype)
+            # activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype)
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32")
+        elif model_args.qat_type == "W4":
+            activation = None
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32")
+        elif model_args.qat_type == "A8W4":
+            activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype)
+            # activation = FakeQuanterWithAbsMaxObserver(moving_rate=0.9, bit_length=8, dtype=dtype)
+            weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32")
+        else:
+            raise ValueError("qat_type should be one of ['A8W8', 'W4', 'A8W4']")
+
+        q_config.add_type_config(LoRALinear, weight=weight, activation=activation)
+        q_config.add_type_config(nn.Linear, weight=weight, activation=activation)
+
+        qat = QAT(q_config)
+        model = qat.quantize(model, inplace=True)
+
+    if model_args.prefix_tuning:
+        prefix_config = PrefixConfig(
+            num_prefix_tokens=model_args.num_prefix_tokens,
+            num_attention_heads=model.config.n_head,
+            num_hidden_layers=model.config.n_layer,
+            hidden_size=model.config.hidden_size,
+            prefix_projection=model_args.prefix_projection,
+            prefix_projection_hidden_size=model.config.hidden_size,
+            dtype=dtype,
+        )
+        model = PrefixModelForCausalLM(
+            model=model,
+            prefix_config=prefix_config,
+            postprocess_past_key_value=llama_postprocess_past_key_value,
+        )
+        model.mark_only_prefix_as_trainable()
+        model.print_trainable_parameters()
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        padding_side="left",  # Allow batch inference
+    )
+    tokenizer.pad_token = tokenizer.unk_token
+
+    # Load the dataset.
+    train_ds = load_dataset(read_local_dataset, path="./data/train.txt", lazy=False)
+    training_args.do_eval = False
+    data_args.always_pad_to_max_length = True
+    trans_func = partial(custom_instruction_convert_example, tokenizer=tokenizer, data_args=data_args)
+
+    train_ds = train_ds.map(partial(trans_func))
+
+    model_max_length = 512
+    collate_fn = DataCollatorForSeq2Seq(
+        return_tensors="pd",
+        tokenizer=tokenizer,
+        max_length=model_max_length if data_args.always_pad_to_max_length else -1,
+        padding="max_length" if data_args.always_pad_to_max_length else True,
+        max_label_length=model_max_length if data_args.always_pad_to_max_length else None,
+        return_attention_mask=True,
+    )
+
+    def compute_metrics_trainer(eval_preds, tokenizer):
+        all_preds = []
+        all_labels = []
+        preds = eval_preds.predictions
+        preds = [x[x != -100] for x in preds]
+        all_preds.extend(tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+        labels = [x[x != -100] for x in eval_preds.label_ids]
+        all_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+
+        all_preds = [pred.strip() for pred in all_preds]
+        all_labels = [label.strip() for label in all_labels]
+        all_preds = [pred.strip("question:") for pred in all_preds]
+        all_labels = [label.strip("question:") for label in all_labels]
+
+        eval_result = compute_metrics(all_preds, all_labels)
+        return eval_result
+
+    compute_metrics_func = partial(
+        compute_metrics_trainer,
+        tokenizer=tokenizer,
+    )
+
+    trainer = LlamaTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds if training_args.do_train else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics_func
+        if model_args.eval_with_do_generation
+        else compute_metrics_not_do_generation,
+        do_generation=model_args.eval_with_do_generation,
+        data_collator=collate_fn,
+    )
+
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=last_checkpoint)
+        trainer.save_model(merge_tensor_parallel=training_args.tensor_parallel_degree > 1)
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        eval_result = trainer.evaluate()
+        trainer.log_metrics("test", eval_result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/llama/benchmark_utils.py b/llm/llama/benchmark_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0dbd0637c7448858311e9837521d57e0567cc29
--- /dev/null
+++ b/llm/llama/benchmark_utils.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.optimizer.lr import LambdaDecay
+from rouge import Rouge
+from sklearn.metrics import accuracy_score
+
+from paddlenlp.metrics import BLEU
+from paddlenlp.trainer import PrinterCallback, ProgressCallback, Trainer
+from paddlenlp.trainer.integrations import TrainerCallback
+from paddlenlp.utils.log import logger
+
+
+class AverageStatistical(object):
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.total_cnt = 0
+        self.time = 0
+
+    def record(self, val, cnt=1):
+        self.time += val
+        self.total_cnt += cnt
+
+    def get_average(self):
+        if self.total_cnt == 0:
+            return 0
+
+        return self.time / self.total_cnt
+
+    def get_average_per_sec(self):
+        if self.time == 0.0:
+            return 0.0
+
+        return float(self.total_cnt) / self.time
+
+    def get_total_cnt(self):
+        return self.total_cnt
+
+    def get_total_time(self):
+        return self.time
+
+
+class BenchmarkCallback(TrainerCallback):
+    def __init__(self, benchmark=True, profiler_options=None):
+        self.benchmark = benchmark
+        self.profiler_options = profiler_options
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        assert args.gradient_accumulation_steps == 1 and not args.do_eval and not args.do_predict
+        if self.benchmark:
+            self.reader_cost_avg = AverageStatistical()
+
+    def on_epoch_begin(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.epoch_start = time.time()
+            self.batch_start = time.time()
+
+    def on_step_begin(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.reader_cost_avg.record(time.time() - self.batch_start)
+
+    def on_step_end(self, args, state, control, **kwargs):
+        if self.benchmark:
+            self.batch_start = time.time()
+            if control.should_log:
+                self.maybe_log_save_evaluate_start = time.time()
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if self.benchmark:
+            if logs is not None and "interval_steps_per_second" in logs:
+                self.batch_start = self.batch_start + (time.time() - self.maybe_log_save_evaluate_start)
+                ips = logs["interval_steps_per_second"] * args.train_batch_size
+                avg_batch_cost = 1 / logs["interval_steps_per_second"]
+                logger.info(
+                    "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sample/sec"
+                    % (
+                        state.global_step,
+                        state.max_steps,
+                        logs["loss"],
+                        self.reader_cost_avg.get_average(),
+                        avg_batch_cost,
+                        args.train_batch_size,
+                        ips,
+                    )
+                )
+                self.reader_cost_avg.reset()
+
+    def on_epoch_end(self, args, state, control, **kwargs):
+        if self.benchmark:
+            train_epoch_cost = time.time() - self.epoch_start
+            logger.info("train epoch: %d, epoch_cost: %.5f s" % (state.epoch, train_epoch_cost))
+
+
+class LlamaTrainer(Trainer):
+    def __init__(self, do_generation: bool, **kwargs):
+        super().__init__(**kwargs)
+        self.add_callback(BenchmarkCallback(benchmark=True, profiler_options=self.args.profiler_options))
+        if self.args.disable_tqdm:
+            self.pop_callback(PrinterCallback)
+        else:
+            self.pop_callback(ProgressCallback)
+        self.do_generation = do_generation
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+
+        if prediction_loss_only:
+            return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+        elif not self.do_generation:
+            loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+            # argmax here to avoid gather all logits, which is too memory-consuming.
+            # keepdim in order to maintain the same shape as logits
+            return (loss, logits.argmax(axis=-1, keepdim=True), labels)
+
+        model.eval()
+
+        preds = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            max_length=self.args.tgt_length,
+            min_length=0,
+            use_cache=True,
+            temperature=1.0,
+            top_k=1,
+            top_p=1.0,
+            repetition_penalty=1.0,
+            decode_strategy="sampling",
+        )[0]
+        all_labels = []
+        for label in inputs["labels"].numpy():
+            label = [x for x in label[label != self.tokenizer.pad_token_id]]
+            all_labels.append(label)
+        max_label_length = max([len(x) for x in all_labels])
+        for index, labels in enumerate(all_labels):
+            all_labels[index] = labels + [-100] * (max_label_length - len(labels))
+
+        return (None, paddle.to_tensor(preds), paddle.to_tensor(all_labels))
+
+    def create_scheduler(self, num_training_steps: int):
+        num_warmup_steps = (
+            self.args.warmup_steps if self.args.warmup_steps > 0 else self.args.warmup_ratio * num_training_steps
+        )
+
+        def lr_lambda(current_step: int):
+            if current_step < num_warmup_steps:
+                return float(current_step) / float(max(1, num_warmup_steps))
+            else:
+                decay_step_ratio = (current_step - num_warmup_steps) / (num_training_steps - num_warmup_steps)
+                return 1.0 - (1.0 - self.args.lr_decay_ratio) * decay_step_ratio
+
+        if self.lr_scheduler is None:
+            self.lr_scheduler = LambdaDecay(self.args.learning_rate, lr_lambda, last_epoch=-1)
+        return self.lr_scheduler
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+        if "loss" in logs:
+            logs["ppl"] = np.exp(logs["loss"])
+        if "eval_loss" in logs:
+            logs["eval_ppl"] = np.exp(logs["eval_loss"])
+
+        super(LlamaTrainer, self).log(logs, **kwargs)
+
+
+def compute_metrics(preds, targets):
+    assert len(preds) == len(targets), (
+        "The length of pred_responses should be equal to the length of "
+        "target_responses. But received {} and {}.".format(len(preds), len(targets))
+    )
+    rouge = Rouge()
+    bleu4 = BLEU(n_size=4)
+    scores = []
+    for pred, target in zip(preds, targets):
+        try:
+            score = rouge.get_scores(" ".join(pred), " ".join(target))
+            scores.append([score[0]["rouge-1"]["f"], score[0]["rouge-2"]["f"], score[0]["rouge-l"]["f"]])
+        except ValueError:
+            scores.append([0, 0, 0])
+        bleu4.add_inst(pred, [target])
+    rouge1 = np.mean([i[0] for i in scores])
+    rouge2 = np.mean([i[1] for i in scores])
+    rougel = np.mean([i[2] for i in scores])
+
+    rouge1 = round(rouge1, 4)
+    rouge2 = round(rouge2, 4)
+    rougel = round(rougel, 4)
+    bleu4 = round(bleu4.score(), 4)
+    return dict(
+        rouge1=rouge1,
+        rouge2=rouge2,
+        rougel=rougel,
+        bleu4=bleu4,
+    )
+
+
+def compute_metrics_not_do_generation(eval_preds):
+    flattened_preds = np.array(eval_preds.predictions).flatten()
+    flattened_labels = np.array(eval_preds.label_ids).flatten()
+    filtered_preds = flattened_preds[flattened_labels != -100]
+    filtered_labels = flattened_labels[flattened_labels != -100]
+    accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds)
+    return {
+        "accuracy": accuracy,
+    }
diff --git a/llm/llama/fused_layers.py b/llm/llama/fused_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..2196fed9445c8a98593ddf7a2975e1799fa2e1eb
--- /dev/null
+++ b/llm/llama/fused_layers.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+from paddle import _C_ops
+from paddle.framework import core
+
+
+def is_fused_matmul_bias_supported():
+    if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm() or paddle.is_compiled_with_xpu():
+        return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue")
+    else:
+        return False
+
+
+if is_fused_matmul_bias_supported():
+    origin_linear = paddle.incubate.nn.functional.fused_linear
+else:
+    origin_linear = paddle.nn.functional.linear
+
+
+class FusedLinearWithGradAdd(paddle.autograd.PyLayer):
+    @staticmethod
+    def forward(ctx, x, weight, bias=None, name=None):
+        y = origin_linear(x, weight, bias)
+        ctx.save_for_backward(x, weight, bias)
+        return y
+
+    @staticmethod
+    def backward(ctx, y_grad):
+        x, weight, bias = ctx.saved_tensor()
+        x_grad = paddle.matmul(y_grad, weight, transpose_y=True)
+
+        # _C_ops.fused_linear_param_grad_add(x, y_grad, dw, db, multi precision, has bias)
+        if bias is None:
+            if hasattr(weight, "main_grad"):
+                weight.main_grad, _ = _C_ops.fused_linear_param_grad_add(
+                    x, y_grad, weight.main_grad, None, True, False
+                )
+                return x_grad, None
+            else:
+                if weight.grad is not None:
+                    weight.grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, weight.grad, None, False, False)
+                    return x_grad, None
+                else:
+                    weight_grad, _ = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False, False)
+                    return x_grad, weight_grad
+
+        if hasattr(weight, "main_grad") and hasattr(bias, "main_grad"):
+            weight.main_grad, bias.main_grad = _C_ops.fused_linear_param_grad_add(
+                x, y_grad, weight.main_grad, bias.main_grad, True
+            )
+            return x_grad, None, None
+        else:
+            if weight.grad is not None:
+                assert bias.grad is not None
+                weight.grad, bias.grad = _C_ops.fused_linear_param_grad_add(x, y_grad, weight.grad, bias.grad, False)
+                return x_grad, None, None
+            else:
+                weight_grad, bias_grad = _C_ops.fused_linear_param_grad_add(x, y_grad, None, None, False)
+                return x_grad, weight_grad, bias_grad
+
+
+def mock_layers():
+    paddle.nn.functional.linear = FusedLinearWithGradAdd.apply
+    if is_fused_matmul_bias_supported():
+        paddle.incubate.nn.functional.fused_linear = FusedLinearWithGradAdd.apply
diff --git a/llm/llama/gptq_argument.json b/llm/llama/gptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..75944f076c2967df2d63026ed92fab40a7682c0d
--- /dev/null
+++ b/llm/llama/gptq_argument.json
@@ -0,0 +1,16 @@
+{
+    "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_gptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_gptq": true,
+    "gptq_step": 8
+  }
\ No newline at end of file
diff --git a/llm/llama/lora_argument.json b/llm/llama/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..95bd4029b48aec6436c7296ad005fe45507c97bf
--- /dev/null
+++ b/llm/llama/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "facebook/llama-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
\ No newline at end of file
diff --git a/llm/llama/megre_tp_and_pp.py b/llm/llama/megre_tp_and_pp.py
new file mode 100644
index 0000000000000000000000000000000000000000..1758ecf597107860412d7886d7de34fcfd00192b
--- /dev/null
+++ b/llm/llama/megre_tp_and_pp.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM
+from paddlenlp.utils.log import logger
+
+
+def merge_pipeline_parallel(tp_degree, pp_degree, path):
+    tp_state_dict_list = []
+    for tp in range(tp_degree):
+        tp_state_dict = {}
+        for pp in range(pp_degree):
+            tmp = paddle.load(os.path.join(path, f"model_state.tp{tp:0>2d}_pp{pp:0>2d}.pdparams"), return_numpy=True)
+            for k, v in tmp.items():
+                tp_state_dict[k] = v
+
+        tp_state_dict_list.append(tp_state_dict)
+
+    return tp_state_dict_list
+
+
+def merge_tensor_parallel(cls, state_dict_list, config) -> None:
+    """the entry of converting config and converting model file
+
+    Args:
+        input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file
+        config (PretrainedConfig): the PretrainedConfig instance of model
+    """
+    name_action_mappings = cls._get_tensor_parallel_mappings(config, is_split=False)
+    state_keys_map = cls._resolve_prefix_keys(name_action_mappings.keys(), state_dict_list[0].keys())
+
+    for k, v in state_keys_map.items():
+        name_action_mappings[v] = name_action_mappings.pop(k)
+
+    state_dict_to_save = {}
+    for key in state_dict_list[0].keys():
+        tensor = state_dict_list[0][key]
+        if key in name_action_mappings:
+            ret = [x[key] for x in state_dict_list]
+            action = name_action_mappings.pop(key)
+            tensor = action(ret)
+
+        state_dict_to_save[key] = tensor
+
+    if len(name_action_mappings) > 0:
+        for x in name_action_mappings.keys():
+            logger.warning(f"key <{x}> need to merge tensor parallel but we can't find in model state.")
+
+    print("Finally, we merging state dict to fellowing tensors.")
+    for k, v in state_dict_to_save.items():
+        print(k, v.shape, v.dtype)
+
+    return state_dict_to_save
+
+
+def main():
+    tp_degree = 2
+    pp_degree = 2
+    model_name_or_path = "temp_dir_to_your_ckpt"
+
+    assert tp_degree > 1
+    assert pp_degree > 1
+    config = LlamaConfig.from_pretrained(model_name_or_path)
+    cls = LlamaForCausalLM
+
+    tp_state_dict_list = merge_pipeline_parallel(tp_degree, pp_degree, model_name_or_path)
+    state_dict_to_save = merge_tensor_parallel(cls=cls, state_dict_list=tp_state_dict_list, config=config)
+    print("saving")
+    paddle.save(state_dict_to_save, os.path.join(model_name_or_path, "model_state.pdparams"))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/llama/modeling_pp.py b/llm/llama/modeling_pp.py
new file mode 100644
index 0000000000000000000000000000000000000000..5354741fdea87b5b5c29055165d85ac4f8188c39
--- /dev/null
+++ b/llm/llama/modeling_pp.py
@@ -0,0 +1,278 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# pass
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import LayerDesc, PipelineLayer
+
+from paddlenlp.transformers import PretrainedModel
+from paddlenlp.transformers.llama.modeling import (
+    LlamaConfig,
+    LlamaDecoderLayer,
+    LlamaLMHead,
+    LlamaModel,
+    LlamaPretrainedModel,
+    LlamaPretrainingCriterion,
+    LlamaRMSNorm,
+)
+
+
+def get_hcg():
+    return fleet.get_hybrid_communicate_group()
+
+
+def parse_args(args):
+    if isinstance(args, tuple):
+        if len(args) == 3:
+            hidden_states, attention_mask, position_ids = args
+        elif len(args) == 2:
+            hidden_states, attention_mask = args
+            position_ids = None
+    else:
+        hidden_states = args
+        attention_mask, position_ids = None, None
+
+    if position_ids is not None:
+        position_ids.stop_gradient = True
+
+    if attention_mask is not None:
+        attention_mask.stop_gradient = True
+
+    return hidden_states, attention_mask, position_ids
+
+
+def return_args(hidden_states, attention_mask=None, position_ids=None):
+    ret = (hidden_states,)
+
+    if attention_mask is not None:
+        ret += (attention_mask.clone(),)
+    if position_ids is not None:
+        ret += (position_ids.clone(),)
+    if len(ret) == 1:
+        ret = ret[0]
+
+    return ret
+
+
+class LlamaEmbeddingPipe(nn.Layer):
+    """Extends LlamaEmbeddings to forward attention_mask through the pipeline."""
+
+    def __init__(self, config):
+        super(LlamaEmbeddingPipe, self).__init__()
+        self.sequence_parallel = config.sequence_parallel
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+
+    def forward(self, args):
+        """_summary_
+
+        Args:
+            input (_type_): _description_
+
+        Returns:
+            _type_: _description_
+        """
+        input_ids, attention_mask, position_ids = parse_args(args)
+        input_embeds = self.embed_tokens(input_ids)
+        if self.sequence_parallel:
+            from paddlenlp.transformers import ScatterOp
+
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = input_embeds.shape
+            input_embeds = paddle.reshape_(input_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            input_embeds = ScatterOp.apply(input_embeds)
+
+        batch_size, seq_length = input_ids.shape
+        if attention_mask is not None:
+            attention_mask = LlamaModel._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), 0, input_embeds.dtype
+            )
+            attention_mask.stop_gradient = True
+
+        return return_args(input_embeds, attention_mask, position_ids)
+
+
+class LlamaDecoderLayerPipe(LlamaDecoderLayer):
+    def forward(self, args):
+        hidden_states, attention_mask, position_ids = parse_args(args)
+        hidden_states = super().forward(hidden_states, attention_mask=attention_mask)
+        return return_args(hidden_states, attention_mask, position_ids)
+
+
+class LlamaRMSNormPipe(LlamaRMSNorm):
+    def forward(self, args):
+        hidden_states, attention_mask, position_ids = parse_args(args)
+        return super().forward(hidden_states)
+
+
+class PipelinePretrainedModel(PretrainedModel):
+    _sequential_layers = []
+    _pipeline_name_mapping = None
+
+    def __init__(self, config, *args, **kwargs):
+        super().__init__(config, *args, **kwargs)
+
+    def add_sequential_layer(self, layer_desc, name_prefix=""):
+        self._sequential_layers.append({"layer": layer_desc, "name_prefix": name_prefix})
+
+    def get_sequential_layers(self):
+        return [x["layer"] for x in self._sequential_layers]
+
+    def get_sequential_name_prefixs(self):
+        return {str(index): x["name_prefix"] for index, x in enumerate(self._sequential_layers)}
+
+    def _set_pipeline_name_mapping(self, mappings=None):
+        if mappings is not None:
+            self._pipeline_name_mapping = mappings
+        else:
+            mapping = {}
+            state_dict_keys = list(super().state_dict().keys())
+            first_key = state_dict_keys[0].split(".")
+            # if use virtual pp_degree, the prefix is like 0.0.xxx
+            # else it will be like 0.xxx
+            use_virtual_pp_degree = first_key[0].isdigit() and first_key[1].isdigit()
+
+            prefixs = self.get_sequential_name_prefixs()
+            for k in state_dict_keys:
+                name_splited = k.split(".")
+                if use_virtual_pp_degree:
+                    idx = str(int(name_splited[0]) + int(name_splited[1]))
+                    single_name = [prefixs[idx]]
+                    single_name.extend(name_splited[2:])
+                else:
+                    idx = name_splited[0]
+                    single_name = [prefixs[idx]]
+                    single_name.extend(name_splited[1:])
+                mapping[".".join(single_name)] = k
+
+            self._pipeline_name_mapping = mapping
+
+        return self._pipeline_name_mapping
+
+    def state_dict(self, *args, **kwargs):
+        state_dict = super().state_dict(*args, **kwargs)
+
+        if self._pipeline_name_mapping is None:
+            self._set_pipeline_name_mapping()
+        assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!"
+        pp_to_single_mapping = {v: k for k, v in self._pipeline_name_mapping.items()}
+
+        for k in list(state_dict.keys()):
+            v = state_dict.pop(k)
+            state_dict[pp_to_single_mapping[k]] = v
+
+        return state_dict
+
+    def set_state_dict(self, state_dict, *args, **kwargs):
+        if self._pipeline_name_mapping is None:
+            self._set_pipeline_name_mapping()
+        assert len(self._pipeline_name_mapping) > 0, "The pipeline stage must have parameters!"
+
+        for k in list(state_dict.keys()):
+            v = state_dict.pop(k)
+            if k not in self._pipeline_name_mapping:
+                continue
+            state_dict[self._pipeline_name_mapping[k]] = v
+
+        ret = super().set_state_dict(state_dict, *args, **kwargs)
+        return ret
+
+
+class LlamaForCausalLMPipe(PipelinePretrainedModel, PipelineLayer):
+    """LlamaForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the LlamaModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    config_class = LlamaConfig
+
+    _get_tensor_parallel_mappings = LlamaPretrainedModel._get_tensor_parallel_mappings
+    _init_weights = LlamaPretrainedModel._init_weights
+
+    # NO base_model_prefix !!!!
+
+    def __init__(
+        self,
+        config,
+        # scale_qk_by_layer_num=True,
+        # virtual_pp_degree=4,
+    ):
+        self.config = config
+
+        self.use_recompute = self.config.use_recompute
+        self.recompute_granularity = self.config.recompute_granularity
+        self.pp_recompute_interval = self.config.pp_recompute_interval
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+        if self.recompute_granularity == "full":
+            assert len(self.no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        # virtual_pp_degree = self.config.virtual_pp_degree
+        virtual_pp_degree = getattr(self.config, "virtual_pp_degree", 1)
+
+        hcg = get_hcg()
+        tensor_parallel_degree = max(hcg.get_model_parallel_world_size(), 1)
+        tensor_parallel_rank = max(hcg.get_model_parallel_rank(), 0)
+
+        config.tensor_parallel_degree = tensor_parallel_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+
+        self.add_sequential_layer(LayerDesc(LlamaEmbeddingPipe, config=config), "llama")
+        for i in range(config.num_hidden_layers):
+            self.add_sequential_layer(
+                LayerDesc(LlamaDecoderLayerPipe, config=config, layerwise_recompute=i not in self.no_recompute_layers),
+                f"llama.layers.{i}",
+            )
+
+        self.add_sequential_layer(LayerDesc(LlamaRMSNormPipe, config=config), "llama.norm")
+        self.add_sequential_layer(LayerDesc(LlamaLMHead, config=config), "lm_head")
+
+        recompute_interval = 0
+        if self.use_recompute and self.recompute_granularity == "full":
+            assert self.config.pp_recompute_interval <= config.num_hidden_layers // (
+                virtual_pp_degree * get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = self.config.pp_recompute_interval
+
+        seg_method = "layer:LlamaDecoderLayer"
+        if config.num_hidden_layers % get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        PipelineLayer.__init__(
+            self,
+            layers=self.get_sequential_layers(),
+            loss_fn=LlamaPretrainingCriterion(config),
+            topology=get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+        self.apply(self._init_weights)
+        # DON'T init PipelinePretrainedModel
+        # PipelinePretrainedModel.__init__(self.super(), config=config)
diff --git a/llm/llama/pt_argument.json b/llm/llama/pt_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..6c7e8343a0df34d40a4f31c91e0ad4a052878472
--- /dev/null
+++ b/llm/llama/pt_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "facebook/llama-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_pt_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true
+  }
\ No newline at end of file
diff --git a/llm/llama/ptq_argument.json b/llm/llama/ptq_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..82cdddbbcb3e09f29cd45132f4c7e1edf80bac78
--- /dev/null
+++ b/llm/llama/ptq_argument.json
@@ -0,0 +1,22 @@
+{
+    "model_name_or_path": "./checkpoints/llama_sft_ckpts",
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_ptq_ckpts",
+    "do_eval": true,
+    "eval_with_do_generation": false,
+    "do_ptq": true,
+    "ptq_step": 16,
+    "smooth": true,
+    "smooth_step": 16,
+    "smooth_all_linears": true,
+    "smooth_piecewise_search": true,
+    "smooth_k_piece": true,
+    "smooth_search_piece": true
+  }
\ No newline at end of file
diff --git a/llm/llama/run_pretrain.py b/llm/llama/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ca3efbddbbf7f42031cdebee30a3423b66cf7bd
--- /dev/null
+++ b/llm/llama/run_pretrain.py
@@ -0,0 +1,562 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+GPT/Llama pretraining scripts.
+"""
+import math
+import os
+import random
+import sys
+import time
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+import numpy as np
+import paddle
+
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    speed_metrics,
+)
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    CosineAnnealingWithWarmupDecay,
+    LinearAnnealingWithWarmupDecay,
+    LlamaConfig,
+    LlamaForCausalLM,
+    register_sequence_parallel_allreduce_hooks,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "llama": (
+        LlamaConfig,
+        LlamaForCausalLM,
+    ),
+}
+
+from fused_layers import mock_layers
+from modeling_pp import LlamaForCausalLMPipe
+
+from paddlenlp.data.causal_dataset import build_train_valid_test_datasets, print_rank_0
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+    enable_linear_fused_grad_add: bool = field(
+        default=False,
+        metadata={
+            "help": "Enable fused linear grad add strategy, which will reduce elementwise add for grad accumulation in the backward of nn.Linear ."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+    data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
+    skip_warmup: bool = field(
+        default=True,
+        metadata={"help": "Whether to skip the warmup process of mmap files."},
+    )
+    data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="llama", metadata={"help": "Only support for llama pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="__internal_testing__/tiny-random-llama",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    use_flash_attention: bool = field(
+        default=False,
+        metadata={"help": "use_flash_attention"},
+    )
+    use_fused_rms_norm: bool = field(
+        default=False,
+        metadata={"help": "llama, use_fused_rms_norm"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse attention qkv"},
+    )
+    fuse_attention_ffn: bool = field(
+        default=False,
+        metadata={"help": "whether to fuse first up and gate proj in mlp block"},
+    )
+    recompute_granularity: str = field(
+        default="full",
+        metadata={"help": "Choose among ['full', 'core_attn', 'full_attn']"},
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+    continue_training: bool = field(
+        default=False,
+        metadata={
+            "help": "Pre-training from existing paddlenlp model weights. Default False and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+        },
+    )
+    sequence_parallel: bool = field(
+        default=False,
+        metadata={"help": "whether to use sequence parallel"},
+    )
+    fuse_sequence_parallel_allreduce: bool = field(
+        default=False,
+        metadata={"help": "whether to use fuse sequence parallel allreduce"},
+    )
+    rope_fusion_level: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "The level of fusion of rope embedding. Can be chosen from:\n"
+            "(1) 'full': fuse sin cos compute and rope embedding\n"
+            "(2) 'core': only fuse rope embedding, will compute the sin and cos\n"
+            "(3) None: don't fuse any part of the rope embedding"
+        },
+    )
+    no_recompute_layers: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "Specify the full transformer layers that should not be recomputed."},
+    )
+    pp_recompute_interval: int = field(
+        default=1,
+        metadata={
+            "help": "The interval for the number of layers at which recomputation occurs. A value of 0 indicates no recomputation. Default is 0."
+        },
+    )
+    recompute_use_reentrant: bool = field(
+        default=False,
+        metadata={"help": "recompute_use_reentrant"},
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+    need_data=True,
+):
+    train_val_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    print_rank_0("    train:      {}".format(train_val_test_num_samples[0]))
+    print_rank_0("    validation: {}".format(train_val_test_num_samples[1]))
+    print_rank_0("    test:       {}".format(train_val_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl=data_args.data_impl,
+        splits_string=data_args.split,
+        train_val_test_num_samples=train_val_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=data_args.skip_warmup,
+        data_cache_path=data_args.data_cache,
+        need_data=need_data,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode.")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+
+        logger.info(tokenizer._decode(input_ids))
+
+    from paddlenlp.data import Stack
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn([x["text"] for x in data])
+
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        return {
+            "input_ids": tokens,
+            "labels": labels,
+        }
+
+    if need_data:
+        print_dataset(train_dataset[0], "train")
+        print_dataset(valid_dataset[0], "valid")
+        print_dataset(test_dataset[0], "test")
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]  # add
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        # keep eval_dataloader
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if training_args.enable_linear_fused_grad_add:
+        mock_layers()
+
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    if data_args.data_cache is not None:
+        os.makedirs(data_args.data_cache, exist_ok=True)
+
+    set_seed(training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        # if last_checkpoint is None and len(
+        #         os.listdir(training_args.output_dir)) > 1:
+        #     raise ValueError(
+        #         f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+        #         "Use --overwrite_output_dir to overcome.")
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+
+    config.seq_length = data_args.max_seq_length
+    # There are some technique extend RotaryEmbedding context. so don't change max_position_embeddings
+    if not model_args.continue_training:
+        config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+
+    if not model_args.continue_training:
+        config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+        logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
+
+    if model_args.no_recompute_layers is not None:
+        model_args.no_recompute_layers.sort()
+
+    config.use_flash_attention = model_args.use_flash_attention
+    config.use_fused_rms_norm = model_args.use_fused_rms_norm
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.fuse_attention_ffn = model_args.fuse_attention_ffn
+    config.recompute_granularity = model_args.recompute_granularity
+    config.virtual_pp_degree = model_args.virtual_pp_degree
+    config.sequence_parallel = model_args.sequence_parallel
+    config.fuse_sequence_parallel_allreduce = model_args.fuse_sequence_parallel_allreduce
+    config.rope_fusion_level = model_args.rope_fusion_level
+    config.no_recompute_layers = model_args.no_recompute_layers
+    config.pp_recompute_interval = model_args.pp_recompute_interval
+    config.recompute_use_reentrant = model_args.recompute_use_reentrant
+
+    config.use_recompute = training_args.recompute
+    config.tensor_parallel_degree = training_args.tensor_parallel_degree
+    config.tensor_parallel_rank = training_args.tensor_parallel_rank
+
+    print("Final pre-training config:", config)
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    if training_args.pipeline_parallel_degree > 1 and model_args.model_type == "llama":
+        model_class = LlamaForCausalLMPipe
+
+    if model_args.continue_training:
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            config=config,
+            dtype=dtype,
+            load_state_as_np=True,
+        )
+    else:
+        model = model_class._from_config(config, dtype=dtype)
+
+    if model_args.sequence_parallel:
+        register_sequence_parallel_allreduce_hooks(
+            model, training_args.gradient_accumulation_steps, model_args.fuse_sequence_parallel_allreduce
+        )
+
+    if training_args.recompute:
+        model.recompute_enable()
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args,
+        training_args,
+        data_file,
+        tokenizer,
+        need_data=training_args.should_load_dataset,
+    )
+
+    total_effective_tokens = (
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps
+        * data_args.max_seq_length
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+    if training_args.should_load_dataset:
+        effective_tokens_per_second = total_effective_tokens / train_result.metrics["train_runtime"]
+        print(f"Effective Tokens per second: {effective_tokens_per_second:.2f}")
+        print(f"ips: {effective_tokens_per_second:.2f} tokens/s")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/llama/run_trainer.sh b/llm/llama/run_trainer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fc23373ac3763938201a28a2ebea88a5b44b9247
--- /dev/null
+++ b/llm/llama/run_trainer.sh
@@ -0,0 +1,59 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+task_name="llama_hybrid"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+
+PYTHONPATH=../../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain.py \
+    --model_type "llama" \
+    --model_name_or_path "facebook/llama-7b" \
+    --tokenizer_name_or_path "facebook/llama-7b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --use_fused_rms_norm 0 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 1024 \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --continue_training 1\
+    --recompute 1 \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --data_impl "mmap"
\ No newline at end of file
diff --git a/llm/llama/run_trainer_tp4pp2.sh b/llm/llama/run_trainer_tp4pp2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..07b7fa1996d53c002d36ee3f6f6ff9fe1dd1d569
--- /dev/null
+++ b/llm/llama/run_trainer_tp4pp2.sh
@@ -0,0 +1,70 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="llama_hybrid"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+
+PYTHONPATH=../../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain.py \
+    --model_type "llama" \
+    --model_name_or_path "facebook/llama-7b" \
+    --tokenizer_name_or_path "facebook/llama-7b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --per_device_eval_batch_size 4 \
+    --use_flash_attention 1 \
+    --use_fused_rms_norm 0 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 512 \
+    --tensor_parallel_degree 4 \
+    --pipeline_parallel_degree 2 \
+    --virtual_pp_degree 1 \
+    --sequence_parallel 0 \
+    --learning_rate 0.00001 \
+    --min_learning_rate 0.000001 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 10 \
+    --dataloader_num_workers 1 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --sharding "stage1" \
+    --disable_tqdm true \
+    --continue_training 1 \
+    --recompute 0 \
+    --recompute_granularity full \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --distributed_dataloader 1
+    # --pipeline_parallel_config "disable_partial_send_recv"  # if set sequence_parallel True, please note off this line.
+    # reompute settings:
+    # --no_recompute_layers 0 1 2 3 4 5 6 7 8 9 10 ... int int
+    # --pp_recompute_interval 0 # A value of 0 indicates no recomputation.
diff --git a/llm/llama/run_trainer_tp8.sh b/llm/llama/run_trainer_tp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3141558bf5b565f550e757e993ea91cdcb2ed9d3
--- /dev/null
+++ b/llm/llama/run_trainer_tp8.sh
@@ -0,0 +1,64 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task_name="llama_13b"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+
+PYTHONPATH=../../:$PYTHONPATH  \
+RCCL_NCHANNELS=8 HSA_FORCE_FINE_GRAIN_PCIE=1 python3 -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain.py \
+    --model_type "llama" \
+    --model_name_or_path "facebook/llama-13b" \
+    --tokenizer_name_or_path "facebook/llama-13b" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --per_device_eval_batch_size 2 \
+    --use_flash_attention 0 \
+    --use_fused_rms_norm 0 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --scale_loss 512 \
+    --tensor_parallel_degree 8 \
+    --learning_rate 0.00001 \
+    --min_learning_rate 0.000001 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 10 \
+    --dataloader_num_workers 1 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --sharding "stage1" \
+    --disable_tqdm true \
+    --continue_training 1 \
+    --recompute 1 \
+    --recompute_granularity full \
+    --do_train \
+    --do_eval \
+    --device "gpu" \
+    --distributed_dataloader 1
+    # --pipeline_parallel_config "disable_partial_send_recv"  # if set sequence_parallel True, please note off this line.
+    # reompute settings:
+    # --no_recompute_layers 0 1 2 3 4 5 6 7 8 9 10 ... int int
+    # --pp_recompute_interval 0 # A value of 0 indicates no recomputation.
diff --git a/llm/llama/sft_argument.json b/llm/llama/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..8f46d4a3778d1452bd65a14e36507467eca191a1
--- /dev/null
+++ b/llm/llama/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "facebook/llama-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_sft_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 4,
+    "pipeline_parallel_degree": 1
+  }
\ No newline at end of file
diff --git a/llm/llama/sft_pp_argument.json b/llm/llama/sft_pp_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..68d79ce1e606f93d90601aec21eb369057fabdb4
--- /dev/null
+++ b/llm/llama/sft_pp_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "facebook/llama-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/llama_sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 8,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 256,
+    "max_length": 512,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 2,
+    "pipeline_parallel_degree": 2,
+    "pipeline_parallel_config": "disable_p2p_cache_shape"
+  }
\ No newline at end of file
diff --git a/llm/llama/tests/test_pipeline_parallel.py b/llm/llama/tests/test_pipeline_parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3b595a92082670ebc05dd8a0619cbd3aeb7d04d
--- /dev/null
+++ b/llm/llama/tests/test_pipeline_parallel.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+from modeling_pp import LlamaForCausalLMPipe
+from paddle.distributed.fleet.meta_parallel.pipeline_parallel import PipelineParallel
+
+from paddlenlp.transformers import LlamaForCausalLM
+
+
+class TestLlama(unittest.TestCase):
+    def test_pipeline_model(self):
+        world_size = paddle.distributed.get_world_size()
+        pp_degree = world_size
+        tp_degree = 1
+        if world_size > 2:
+            pp_degree = 2
+            assert world_size % pp_degree == 0
+            tp_degree = world_size // pp_degree
+
+        pp_degree = -1
+        if pp_degree == -1:
+            tp_degree = world_size
+            pp_degree = 1
+
+        strategy = fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "dp_degree": 1,
+            "mp_degree": tp_degree,
+            "pp_degree": pp_degree,
+            "sharding_degree": 1,
+        }
+        fleet.init(is_collective=True, strategy=strategy)
+        hcg = fleet.get_hybrid_communicate_group()
+
+        if pp_degree > 1:
+            model_class = LlamaForCausalLMPipe
+        else:
+            model_class = LlamaForCausalLM
+
+        model_name_or_path = "./llama-7b-2l"
+        # model_name_or_path = "__internal_testing__/tiny-random-llama"
+        # hidden_size = 4096
+        model = model_class.from_pretrained(
+            model_name_or_path,
+            num_attention_heads=32,
+            tensor_parallel_degree=tp_degree,
+            tensor_parallel_rank=hcg.get_model_parallel_rank(),
+            tensor_parallel_output=False,
+            # use_flash_attention=True,
+        )
+
+        model.eval()
+
+        # for k, v in model.state_dict().items():
+        #     print(k, v.shape, v.dtype, v.abs().sum().item())
+        #     if k == "lm_head.weight":
+        #         print(v)
+
+        # input_ids = paddle.to_tensor([[x for x in range(100, 110)]], dtype="int64")
+        # labels = paddle.to_tensor([[x for x in range(101, 111)]], dtype="int64")
+        attention_mask = None
+        input_ids = paddle.load("/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/input_ids")
+        labels = paddle.load("/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/labels")
+        attention_mask = paddle.load(
+            "/ssd2/zhonghui03/Datasets/PaddleNLP/examples/language_model/llama/attention_mask"
+        )
+
+        # labels[labels < 0] = 1
+
+        if pp_degree > 1:
+            pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy)
+            ret = pp_model.eval_batch(data=[input_ids, labels], compute_loss=True)
+        else:
+            # pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy)
+            # pp_model.data = [input_ids, labels]
+            # ret = pp_model._forward_step(None)
+            ret = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
+            ret = ret[0]
+
+        # np.testing.assert_allclose(ret.item(), 10.49988270, atol=1e-7)
+        print(f"ret mp{tp_degree} pp", ret.item())
+        ret_mp_pp = ret.item()
+
+        single_model = LlamaForCausalLM.from_pretrained(
+            model_name_or_path,
+            num_attention_heads=32,
+            tensor_parallel_output=False,
+        )
+        single_model.eval()
+        ret = single_model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
+        print("ret single", ret[0].item())
+        print(
+            f"diff: {(ret[0].item()- ret_mp_pp)/ret[0].item()}",
+        )
+        np.testing.assert_allclose(ret[0].item(), ret_mp_pp, rtol=1.5e-7)
+        # 15.526779174804688
+        # 16.879518508911133
+
+
+if __name__ == "__main__":
+    TestLlama().test_pipeline_model()
+
+# 3 bugs to fix in paddlepaddle
+# pp_layers.py
+# def _construct_shared_comm(self):
+#     shared_comm = {}
+#     if self._topo.get_dim("pipe") == 1:
+#         return shared_comm
+
+# topology.py
+# def _set_p2p_group(self):
+#     self.send_next_group = None
+#     self.send_prev_group = None
+#     self.recv_next_group = None
+#     self.recv_prev_group = None
+#     if self._pp_degree <= 1:
+#         return
+
+# pipeline_parallel.py
+# def _load_micro_batch(self, cache_id, stage=None):
+#     inputs = self.data
+#     if stage == "fisrt":
+#         assert self.is_pipeline_first_stage()
+#         assert len(inputs) == 2, "length of input should be 2"
+#         return self._load_micro_batch_impl(inputs[0], cache_id)
+#     elif stage== "last":
+#         assert self.is_pipeline_last_stage()
+#         assert len(inputs) == 2, "length of input should be 2"
+#         return self._load_micro_batch_impl(inputs[1], cache_id)
+#     else:
+#         inputs = None
+#
+#
+# CUDA_VISIBLE_DEVICES=2 PYTHONPATH=./ pytest -s -v tests/test_pipeline_parallel.py
+# PYTHONPATH=/ssd2/zhonghui03/Datasets/PaddleNLP:$PYTHONPATH  PYTHONPATH=$PYTHONPATH:./ python   -m paddle.distributed.launch --gpus 0,1,2,3  tests/test_pipeline_parallel.py
diff --git a/llm/llama/tests/test_sequence_parallel.py b/llm/llama/tests/test_sequence_parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7cdc932bf2499dec400d82147386779074b6103
--- /dev/null
+++ b/llm/llama/tests/test_sequence_parallel.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+from modeling_pp import LlamaForCausalLMPipe
+from paddle.distributed.fleet.meta_parallel.pipeline_parallel import PipelineParallel
+
+from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM
+
+
+class TestLlama(unittest.TestCase):
+    def test_sequence_model(self):
+        world_size = paddle.distributed.get_world_size()
+        pp_degree = world_size
+        tp_degree = 1
+
+        if world_size > 2:
+            pp_degree = 2
+            assert world_size % pp_degree == 0
+            tp_degree = world_size // pp_degree
+
+        strategy = fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "dp_degree": 1,
+            "mp_degree": tp_degree,
+            "pp_degree": pp_degree,
+            "sharding_degree": 1,
+        }
+        strategy.pipeline_configs = {"enable_partial_send_recv": False if pp_degree > 1 else True}
+        fleet.init(is_collective=True, strategy=strategy)
+        hcg = fleet.get_hybrid_communicate_group()
+        mp_group = hcg.get_model_parallel_group()
+        tensor_parallel_rank = mp_group.rank
+
+        if pp_degree > 1:
+            model_class = LlamaForCausalLMPipe
+        else:
+            model_class = LlamaForCausalLM
+
+        # model_name_or_path = "facebook/llama-7b"
+        model_name_or_path = "__internal_testing__/tiny-random-llama"
+
+        seq_len = 2048
+        batch_size = 2
+
+        config = LlamaConfig.from_pretrained(model_name_or_path)
+        config.seq_length = seq_len
+        config.use_flash_attention = False
+        config.use_fused_rms_norm = False
+        config.fuse_attention_qkv = False
+        config.recompute_granularity = "full"
+        config.virtual_pp_degree = 1
+        config.use_recompute = False
+
+        config.tensor_parallel_degree = tp_degree
+        config.tensor_parallel_rank = tensor_parallel_rank
+        config.tensor_parallel_output = False
+        config.sequence_parallel = True
+
+        config.fuse_sequence_parallel_allreduce = False
+
+        # hidden_size = 4096
+        model = model_class.from_pretrained(
+            model_name_or_path,
+            config=config,
+            dtype="float32",
+        )
+
+        model.eval()
+
+        input_ids = paddle.arange(100, 100 + batch_size * seq_len, dtype="int64").reshape([batch_size, seq_len])
+        labels = paddle.arange(101, 101 + batch_size * seq_len, dtype="int64").reshape([batch_size, seq_len])
+
+        attention_mask = None
+        if pp_degree > 1:
+            pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy)
+            pp_model.accumulate_steps = batch_size  # for micro_batch_size * acc_steps == batch_size
+            ret = pp_model.eval_batch(data=[input_ids, labels], compute_loss=True)
+        else:
+            # pp_model = PipelineParallel(layers=model, hcg=hcg, strategy=strategy)
+            # pp_model.data = [input_ids, labels]
+            # ret = pp_model._forward_step(None)
+            ret = model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
+            ret = ret[0]
+
+        # np.testing.assert_allclose(ret.item(), 10.49988270, atol=1e-7)
+        print(f"ret mp{tp_degree} pp{pp_degree}", ret.item())
+        ret_mp_pp = ret.item()
+
+        single_model = LlamaForCausalLM.from_pretrained(model_name_or_path, config=config)
+        single_model.eval()
+        ret = single_model(input_ids=input_ids, labels=labels, attention_mask=attention_mask)
+        print("ret single", ret[0].item())
+        print(
+            f"diff: {(ret[0].item()- ret_mp_pp)/ret[0].item()}",
+        )
+        np.testing.assert_allclose(ret[0].item(), ret_mp_pp, rtol=1.5e-7)
+
+
+if __name__ == "__main__":
+    TestLlama().test_sequence_model()
+
+# CUDA_VISIBLE_DEVICES=2 PYTHONPATH=./ pytest -s -v tests/test_pipeline_parallel.py
+# PYTHONPATH=/ssd2/zhonghui03/Datasets/PaddleNLP:$PYTHONPATH  PYTHONPATH=$PYTHONPATH:./ python   -m paddle.distributed.launch --gpus 0,1,2,3  tests/test_pipeline_parallel.py
diff --git a/llm/merge_lora_params.py b/llm/merge_lora_params.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2b27ffc900af1adfb815c99343498ea408e1467
--- /dev/null
+++ b/llm/merge_lora_params.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import paddle
+
+from paddlenlp.peft import LoRAConfig, LoRAModel
+from paddlenlp.transformers import AutoModelForCausalLM
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", default=None, required=True, help="The directory of pretrained model.")
+    parser.add_argument(
+        "--lora_path", default=None, required=True, help="The directory of LoRA parameters. Default to None"
+    )
+    parser.add_argument("--merge_model_path", default=None, help="The directory of merged parameters. Default to None")
+    parser.add_argument("--device", type=str, default="gpu", help="Device")
+    return parser.parse_args()
+
+
+def merge():
+    args = parse_arguments()
+    paddle.set_device(args.device)
+    lora_config = LoRAConfig.from_pretrained(args.lora_path)
+    dtype = lora_config.dtype
+    lora_config.merge_weights = True
+
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_name_or_path,
+        dtype=dtype,
+    )
+    model = LoRAModel.from_pretrained(model=model, lora_path=args.lora_path, lora_config=lora_config)
+    model.eval()
+    if args.merge_model_path is None:
+        args.merge_model_path = args.lora_path
+
+    model_state_dict = model.model.state_dict()
+    for key in list(model_state_dict):
+        if "lora" in key:
+            del model_state_dict[key]
+    model.model.save_pretrained(args.merge_model_path, state_dict=model_state_dict)
+
+
+if __name__ == "__main__":
+    merge()
diff --git a/llm/merge_tp_params.py b/llm/merge_tp_params.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f310c1836b5930cd045b298e3b0dc5860899ee1
--- /dev/null
+++ b/llm/merge_tp_params.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import os
+
+import paddle
+
+from paddlenlp.transformers import AutoConfig
+from paddlenlp.transformers.auto.modeling import MAPPING_NAMES
+from paddlenlp.utils.log import logger
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name_or_path", default=None, required=True, help="The directory of model.")
+    parser.add_argument("--device", type=str, default="gpu", help="Device")
+    return parser.parse_args()
+
+
+def load_tp_params(tp_degree, path):
+    tp_state_dict_list = []
+    for tp in range(tp_degree):
+        tp_state_dict = {}
+        tmp = paddle.load(os.path.join(path, f"model_state.tp{tp:0>2d}.pdparams"), return_numpy=True)
+        for k, v in tmp.items():
+            tp_state_dict[k] = v
+        tp_state_dict_list.append(tp_state_dict)
+
+    return tp_state_dict_list
+
+
+def merge_tensor_parallel(model_class, state_dict_list, config) -> None:
+    """the entry of converting config and converting model file
+
+    Args:
+        input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file
+        config (PretrainedConfig): the PretrainedConfig instance of model
+    """
+    name_action_mappings = model_class._get_tensor_parallel_mappings(config, is_split=False)
+    state_keys_map = model_class._resolve_prefix_keys(name_action_mappings.keys(), state_dict_list[0].keys())
+
+    for k, v in state_keys_map.items():
+        name_action_mappings[v] = name_action_mappings.pop(k)
+
+    state_dict_to_save = {}
+    for key in state_dict_list[0].keys():
+        tensor = state_dict_list[0][key]
+        if key in name_action_mappings:
+            ret = [x[key] for x in state_dict_list]
+            action = name_action_mappings.pop(key)
+            tensor = action(ret)
+
+        state_dict_to_save[key] = tensor
+
+    if len(name_action_mappings) > 0:
+        for x in name_action_mappings.keys():
+            logger.warning(f"key <{x}> need to merge tensor parallel but we can't find in model state.")
+
+    logger.info("Finally, we merging state dict to fellowing tensors.")
+    for k, v in state_dict_to_save.items():
+        logger.info(f"{k}, {v.shape}, {v.dtype}")
+
+    return state_dict_to_save
+
+
+def main():
+    args = parse_arguments()
+    paddle.set_device(args.device)
+    config = AutoConfig.from_pretrained(args.model_name_or_path)
+    init_class = config["architectures"][0]
+    import_class = importlib.import_module(f"paddlenlp.transformers.{MAPPING_NAMES[init_class[:-11]]}.modeling")
+    model_class = getattr(import_class, init_class)
+
+    if config.tensor_parallel_degree > 1:
+        tp_state_dict_list = load_tp_params(config.tensor_parallel_degree, args.model_name_or_path)
+        state_dict_to_save = merge_tensor_parallel(
+            model_class=model_class, state_dict_list=tp_state_dict_list, config=config
+        )
+
+        logger.info("Saving")
+        paddle.save(state_dict_to_save, os.path.join(args.model_name_or_path, "model_state.pdparams"))
+    else:
+        logger.info("No need to merge since config.tensor_parallel_degree <= 1.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/llm/opt/README.md b/llm/opt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..98b3f140fbfb3e2d004edf05eb1e8a4ba16c6c0f
--- /dev/null
+++ b/llm/opt/README.md
@@ -0,0 +1,22 @@
+# OPT
+
+## 1. 模型介绍
+
+[OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 是以自回归填空作为训练目标的通用语言模型，可用于各类理解和生成任务。
+
+**支持模型权重:**
+| Model                            |
+|----------------------------------|
+|facebook/opt-125m|
+|facebook/opt-350m |
+|facebook/opt-1.3b |
+|facebook/opt-2.7b |
+|facebook/opt-6.7b|
+|facebook/opt-13b |
+|facebook/opt-30b |
+|facebook/opt-66b |
+|facebook/opt-iml-1.3b |
+|opt-iml-max-1.3b |
+
+## 2. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/opt/lora_argument.json b/llm/opt/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..3c25fd57844001c52704947fb30bd30b656dcd9b
--- /dev/null
+++ b/llm/opt/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "facebook/opt-125m",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/opt_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
\ No newline at end of file
diff --git a/llm/opt/sft_argument.json b/llm/opt/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..14e302fb19aee3d8c8dc19964ce6e068ea31174e
--- /dev/null
+++ b/llm/opt/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "facebook/opt-125m",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/opt_sft_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "fp16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "sharding_parallel_degree": 4,
+    "sharding": "stage2"
+  }
\ No newline at end of file
diff --git a/llm/predictor.py b/llm/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc6da46cbe733bd0b058e9715df5a707162192fc
--- /dev/null
+++ b/llm/predictor.py
@@ -0,0 +1,827 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import json
+import os
+import sys
+import time
+from abc import abstractmethod
+from dataclasses import dataclass, field
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet.base.topology as tp
+from paddle.distributed import fleet
+from utils import (
+    dybatch_preprocess,
+    get_alibi_slopes,
+    get_infer_model_path,
+    get_prefix_tuning_params,
+    load_real_time_tokens,
+)
+
+from paddlenlp.peft import LoRAConfig, LoRAModel, PrefixConfig, PrefixModelForCausalLM
+from paddlenlp.taskflow.utils import static_mode_guard
+from paddlenlp.trainer import PdArgumentParser
+from paddlenlp.transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    LlamaTokenizer,
+    PretrainedModel,
+    PretrainedTokenizer,
+)
+from paddlenlp.utils.import_utils import import_module, is_paddlenlp_ops_available
+
+
+@dataclass
+class PredictorArgument:
+    model_name_or_path: str = field(default=None, metadata={"help": "The directory of model."})
+    model_prefix: str = field(default="model", metadata={"help": "the prefix name of static model"})
+    src_length: int = field(default=1024, metadata={"help": "The max length of source text."})
+    max_length: int = field(default=2048, metadata={"help": "the max length for decoding."})
+    top_k: int = field(default=0, metadata={"help": "top_k parameter for generation"})
+    top_p: float = field(default=0.7, metadata={"help": "top_p parameter for generation"})
+    temperature: float = field(default=0.95, metadata={"help": "top_p parameter for generation"})
+    repetition_penalty: float = field(default=1.0, metadata={"help": "repetition penalty parameter for generation"})
+    device: str = field(default="gpu", metadata={"help": "Device"})
+    dtype: str = field(default=None, metadata={"help": "Model dtype"})
+    lora_path: str = field(default=None, metadata={"help": "The directory of LoRA parameters. Default to None"})
+    export_precache: bool = field(default=False, metadata={"help": "whether use prefix weight to do infer"})
+    prefix_path: str = field(
+        default=None, metadata={"help": "The directory of Prefix Tuning parameters. Default to None"}
+    )
+    decode_strategy: str = field(
+        default="sampling",
+        metadata={
+            "help": "the decoding strategy of generation, which should be one of ['sampling', 'greedy_search', 'beam_search']. Default to sampling"
+        },
+    )
+
+    mode: str = field(
+        default="dynamic", metadata={"help": "the type of predictor, it should be one of [dynamic, static]"}
+    )
+    inference_model: bool = field(default=False, metadata={"help": "whether use InferenceModel to do generation"})
+    quant_type: str = field(
+        default="None",
+        metadata={"help": "The quant type of inference model, support `weight_only_int8`, `weight_only_int4`."},
+    )
+    batch_size: int = field(default=1, metadata={"help": "The batch size of data."})
+    benchmark: bool = field(
+        default=False,
+        metadata={
+            "help": "If benchmark set as `True`, we will force model decode to max_length, which is helpful to compute throughput. "
+        },
+    )
+
+
+@dataclass
+class ModelArgument:
+    model_type: str = field(
+        default=None,
+        metadata={"help": "the type of the model, which can be one of ['gpt-3', 'ernie-3.5-se', 'llama-img2txt']"},
+    )
+    data_file: str = field(default=None, metadata={"help": "data file directory"})
+    output_file: str = field(default="output.json", metadata={"help": "predict result file directory"})
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+def init_dist_env():
+    tensor_parallel_degree = paddle.distributed.get_world_size()
+    tensor_parallel_rank = paddle.distributed.get_rank()
+
+    if tensor_parallel_degree > 1:
+        # refer to: https://github.com/PaddlePaddle/Paddle/blob/4abea956ee852ce52791a1e08fa92ed4d3be150d/python/paddle/distributed/fleet/fleet.py#L298C23-L298C45
+        hcg = tp._HYBRID_PARALLEL_GROUP
+        if hcg is None:
+            strategy = fleet.DistributedStrategy()
+            strategy.hybrid_configs = {
+                "dp_degree": 1,
+                "mp_degree": tensor_parallel_degree,
+                "pp_degree": 1,
+                "sharding_degree": 1,
+            }
+            fleet.init(is_collective=True, strategy=strategy)
+            hcg = fleet.get_hybrid_communicate_group()
+
+        tensor_parallel_rank = hcg.get_model_parallel_rank()
+    return tensor_parallel_rank, tensor_parallel_degree
+
+
+class BasePredictor:
+    def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None):
+        self.model_config = AutoConfig.from_pretrained(config.model_name_or_path)
+        self.config: PredictorArgument = config
+        if tokenizer is None:
+            tokenizer = AutoTokenizer.from_pretrained(config.model_name_or_path, padding_side="left")
+
+        self.tokenizer = tokenizer
+        self.return_tensors = "pd"
+        self.tensor_parallel_rank, self.tensor_parallel_degree = init_dist_env()
+        self.model_config.tensor_parallel_rank, self.model_config.tensor_parallel_degree = init_dist_env()
+
+    def _preprocess(self, source):
+        tokenized_source = self.tokenizer(
+            source,
+            max_length=self.config.src_length,
+            truncation=True,
+            truncation_side="left",
+            return_tensors=self.return_tensors,
+            padding=True,
+            add_special_tokens=True,
+        )
+        return tokenized_source
+
+    @abstractmethod
+    def _infer(self, inputs):
+        raise NotImplementedError
+
+    def _postprocess(self, predictions):
+        decoded_predictions = self.tokenizer.batch_decode(
+            predictions, skip_special_tokens=True, clean_up_tokenization_spaces=False
+        )
+        return decoded_predictions
+
+    def predict(self, input_texts: str | list[str]):
+        tokenized_source = self._preprocess(input_texts)
+        predictions = self._infer(tokenized_source)
+        decoded_predictions = self._postprocess(predictions)
+        return decoded_predictions
+
+
+class DygraphPredictor(BasePredictor):
+    def __init__(
+        self, config: PredictorArgument, model: PretrainedModel = None, tokenizer: PretrainedTokenizer = None
+    ):
+        super().__init__(config, tokenizer)
+        self.model = model
+        if config.lora_path is not None:
+            lora_config = LoRAConfig.from_pretrained(config.lora_path)
+            dtype = lora_config.dtype
+            lora_config.merge_weights = True
+        elif config.prefix_path is not None:
+            prefix_config = PrefixConfig.from_pretrained(config.prefix_path)
+            dtype = prefix_config.dtype
+        elif config.dtype is not None:
+            dtype = config.dtype
+        else:
+            raise ValueError("Please specific the model dtype.")
+
+        if self.model is None:
+            self.model = AutoModelForCausalLM.from_pretrained(
+                config.model_name_or_path,
+                dtype=dtype,
+                tensor_parallel_degree=self.tensor_parallel_degree,
+                tensor_parallel_rank=self.tensor_parallel_rank,
+            )
+
+        if config.lora_path is not None:
+            self.model = LoRAModel.from_pretrained(
+                model=self.model, lora_path=config.lora_path, lora_config=lora_config
+            )
+        if config.prefix_path is not None:
+            prefix_tuning_params = get_prefix_tuning_params(self.model)
+            self.model = PrefixModelForCausalLM.from_pretrained(
+                model=self.model,
+                prefix_path=config.prefix_path,
+                postprocess_past_key_value=prefix_tuning_params["postprocess_past_key_value"],
+            )
+        self.model.eval()
+
+    @paddle.no_grad()
+    def _infer(self, inputs: dict[str, paddle.Tensor]):
+        result = self.model.generate(
+            **inputs,
+            max_length=self.config.max_length,
+            bos_token_id=self.tokenizer.bos_token_id,
+            eos_token_id=self.tokenizer.eos_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
+            decode_strategy=self.config.decode_strategy,
+            temperature=self.config.temperature,
+            top_k=self.config.top_k,
+            top_p=self.config.top_p,
+            repetition_penalty=self.config.repetition_penalty,
+        )
+        result = result[0]
+        return result
+
+
+class StaticGraphPredictor(BasePredictor):
+    def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = None):
+        super().__init__(config, tokenizer)
+
+        params_path = os.path.join(self.config.model_name_or_path, self.config.model_prefix + ".pdiparams")
+        model_path = os.path.join(self.config.model_name_or_path, self.config.model_prefix + ".pdmodel")
+        inference_config = paddle.inference.Config(model_path, params_path)
+
+        if self.config.device == "gpu":
+            # set GPU configs accordingly
+            inference_config.enable_use_gpu(100, 0)
+        elif self.config.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            inference_config.disable_gpu()
+        inference_config.disable_glog_info()
+
+        with static_mode_guard():
+            self.predictor = paddle.inference.create_predictor(inference_config)
+
+        self.return_tensors = "np"
+
+    def _preprocess(self, input_text: str | list[str]):
+        inputs = super()._preprocess(input_text)
+        inputs["max_length"] = np.array(self.config.max_length, dtype="int64")
+
+        inputs["top_p"] = np.array(self.config.top_p, dtype="float32")
+        inputs["temperature"] = np.array(self.config.temperature, dtype="float32")
+        inputs["top_k"] = np.array(self.config.top_k, dtype="int64")
+        inputs["repetition_penalty"] = np.array(self.config.repetition_penalty, dtype="float32")
+
+        return inputs
+
+    def _infer(self, inputs: dict[str, np.ndarray]):
+        for name in self.predictor.get_input_names():
+            self.predictor.get_input_handle(name).copy_from_cpu(inputs[name])
+
+        self.predictor.run()
+        output_names = self.predictor.get_output_names()
+        output_handle = self.predictor.get_output_handle(output_names[0])
+        results = output_handle.copy_to_cpu()
+        # the first result is decoding_ids
+        decoded_ids = results.tolist()
+        return decoded_ids
+
+
+class InferencePredictorMixin:
+    def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer):
+
+        self.architectures = self.model_config.architectures[0].lower()
+
+        self.dtype = config.dtype or self.model_config
+        self.cache_kvs = [paddle.zeros(shape, dtype=self.dtype) for shape in self.cache_kvs_shape]
+        self.num_layers, self.num_attention_heads, self.head_dim = (
+            len(self.cache_kvs),
+            self.cache_kvs[0].shape[-3],
+            self.cache_kvs[0].shape[-1],
+        )
+        self.total_max_length = config.src_length + config.max_length
+        self.pre_ids = paddle.full([config.batch_size, self.total_max_length], -1, dtype="int64")
+        if "chatglm" in self.architectures:
+            self.attention_mask = paddle.ones(
+                shape=(config.batch_size, 1, self.total_max_length, self.total_max_length),
+                dtype=self.dtype,
+            )
+            self.tgt_pos = paddle.ones(
+                shape=[config.batch_size, 2, 1],
+                dtype="int64",
+            )
+        else:
+            self.attention_mask = paddle.zeros(
+                shape=(config.batch_size, 1, self.total_max_length, self.total_max_length),
+                dtype=self.dtype,
+            )
+
+        self.tgt_generation_mask = paddle.zeros(
+            shape=[config.batch_size, 1, 1, self.total_max_length],
+            dtype=self.dtype,
+        )
+        self.arange_tensor_encoder = paddle.zeros(
+            shape=(config.batch_size, 1, self.total_max_length), dtype=self.dtype
+        )
+
+        if config.export_precache:
+            if config.prefix_path:
+                prefix_cache = (
+                    paddle.to_tensor(np.load(f"{config.prefix_path}/pre_caches.npy")).astype(self.dtype).unsqueeze(2)
+                )
+                prefix_cache = paddle.expand(
+                    prefix_cache,
+                    [
+                        self.num_layers,
+                        2,
+                        config.batch_size,
+                        self.num_attention_heads,
+                        prefix_cache.shape[-2],
+                        self.head_dim,
+                    ],
+                )
+                self.pre_caches = [item.squeeze_(0) for item in paddle.split(prefix_cache, self.num_layers, axis=0)]
+            else:
+                prefix_cache = paddle.zeros(
+                    [self.num_layers, 2, config.batch_size, self.num_attention_heads, 128, self.head_dim],
+                    dtype=self.dtype,
+                )
+                self.pre_caches = [item.squeeze_(0) for item in paddle.split(prefix_cache, self.num_layers, axis=0)]
+
+    def _postprocess(self, predictions):
+        if paddle.distributed.get_rank() == 0:
+            tokens: np.ndarray = load_real_time_tokens()
+            decoded_predictions = self.tokenizer.batch_decode(
+                tokens.tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=False
+            )
+            return decoded_predictions
+        else:
+            return None
+
+    def _preprocess(self, source):
+        self.attention_mask[:] = 0
+        self.tgt_generation_mask[:] = 0
+        pre_caches_length = 0 if not self.config.export_precache else self.pre_caches[0].shape[-2]
+
+        if "chatglm" in self.architectures:
+            inputs = dybatch_preprocess(
+                self.tokenizer,
+                source,
+                self.config.src_length,
+                self.config.max_length,
+                self.architectures,
+                top_p=self.config.top_p,
+                temperature=self.config.temperature,
+            )
+            for i in range(inputs["input_ids"].shape[0]):
+                length = inputs["seq_len_encoder"][i][0]
+                self.attention_mask[i, 0, :length, :length] = 0
+                self.attention_mask[i, 0, : length - 1, length - 1] = 1
+                self.tgt_generation_mask[i, 0, 0, :length] = paddle.ones(shape=[1, length], dtype=self.config.dtype)
+                self.tgt_pos[i, 0, 0] = paddle.to_tensor([length], dtype="int64")
+
+            inputs["tgt_pos"] = self.tgt_pos
+        elif "bloom" in self.architectures:
+            inputs = dybatch_preprocess(
+                self.tokenizer,
+                source,
+                self.config.src_length,
+                self.config.max_length,
+                self.architectures,
+                top_p=self.config.top_p,
+                temperature=self.config.temperature,
+            )
+            for i in range(inputs["input_ids"].shape[0]):
+                length = inputs["seq_len_encoder"][i][0]
+                self.attention_mask[i, :, :length, :length] = paddle.tril(
+                    paddle.ones(shape=(length, length), dtype=self.config.dtype)
+                )
+                self.arange_tensor_encoder[i, :, :length] = paddle.arange(length).astype(self.config.dtype)
+
+                self.tgt_generation_mask[i, :, 0, :length] = paddle.ones(shape=[1, length], dtype=self.config.dtype)
+            # alibi encoder
+            alibi_slopes = get_alibi_slopes(self.model_config.n_head)
+            inputs["position_ids"] = paddle.to_tensor(alibi_slopes, dtype="float32")
+
+            alibi = alibi_slopes[..., None] * self.arange_tensor_encoder
+            alibi = alibi[:, :, None, :]
+
+            if self.model_config.tensor_parallel_degree > 1:
+                block_size = self.model_config.n_head // self.model_config.tensor_parallel_degree
+                alibi = alibi[
+                    :,
+                    self.model_config.tensor_parallel_rank
+                    * block_size : (self.model_config.tensor_parallel_rank + 1)
+                    * block_size,
+                ]
+                alibi = alibi.reshape([inputs["input_ids"].shape[0], block_size, 1, self.config.max_length])
+                inputs["position_ids"] = inputs["position_ids"][
+                    self.model_config.tensor_parallel_rank
+                    * block_size : (self.model.config.tensor_parallel_rank + 1)
+                    * block_size
+                ]
+
+            alibi_encoder = alibi.expand(
+                [
+                    inputs["input_ids"].shape[0],
+                    self.model_config.n_head // self.model_config.tensor_parallel_degree,
+                    self.total_max_length,
+                    self.total_max_length,
+                ]
+            )
+            alibi_decoder = alibi.expand(
+                [
+                    inputs["input_ids"].shape[0],
+                    self.model_config.n_head // self.model_config.tensor_parallel_degree,
+                    1,
+                    self.total_max_length,
+                ]
+            )
+            self.attention_mask = (
+                alibi_encoder + (1 - self.attention_mask) * paddle.finfo(self.attention_mask.dtype).min
+            )
+            self.tgt_generation_mask = (
+                alibi_decoder + (1 - self.tgt_generation_mask) * paddle.finfo(self.tgt_generation_mask.dtype).min
+            )
+
+        else:
+            inputs = dybatch_preprocess(
+                self.tokenizer,
+                source,
+                self.config.src_length,
+                self.config.max_length,
+                self.architectures,
+                top_p=self.config.top_p,
+                temperature=self.config.temperature,
+                pre_caches_length=pre_caches_length,
+            )
+
+            for i in range(inputs["input_ids"].shape[0]):
+                length = inputs["seq_len_encoder"][i][0]
+                self.attention_mask[i, 0, :length, :length] = paddle.tril(
+                    paddle.ones(shape=(length, length), dtype=self.config.dtype)
+                )
+
+                if pre_caches_length > 0:
+                    if self.config.prefix_path is None:
+                        prefix_attention_mask = paddle.zeros(
+                            [1, length, pre_caches_length], dtype=self.attention_mask.dtype
+                        )
+                    else:
+                        prefix_attention_mask = paddle.ones(
+                            [1, length, pre_caches_length], dtype=self.attention_mask.dtype
+                        )
+                    post_attention_mask = paddle.tril(
+                        paddle.ones(shape=(length, length), dtype=self.attention_mask.dtype)
+                    ).unsqueeze_(axis=0)
+                    self.attention_mask[i, 0, :length, : length + pre_caches_length] = paddle.concat(
+                        [prefix_attention_mask, post_attention_mask], axis=2
+                    )
+
+                if self.config.prefix_path is None:
+                    self.tgt_generation_mask[i, 0, 0, pre_caches_length : length + pre_caches_length] = paddle.ones(
+                        shape=[1, length], dtype="float16"
+                    )
+                else:
+                    self.tgt_generation_mask[i, 0, 0, : length + pre_caches_length] = paddle.ones(
+                        shape=[1, length + pre_caches_length], dtype=self.config.dtype
+                    )
+
+        inputs["pre_ids"] = self.pre_ids
+        inputs["attention_mask"] = self.attention_mask
+        inputs["tgt_generation_mask"] = self.tgt_generation_mask
+
+        if pre_caches_length > 0:
+            if self.config.mode == "dynamic":
+                inputs["pre_caches"] = self.pre_caches
+            else:
+                for i in range(len(self.pre_caches)):
+                    inputs["pre_caches_{}".format(i)] = self.pre_caches[i].numpy()
+
+        return inputs
+
+
+class StaticInferencePredictor(InferencePredictorMixin, BasePredictor):
+    def __init__(
+        self,
+        config: PredictorArgument,
+        cache_kvs_shape: list[list[int]],
+        tokenizer: PretrainedTokenizer = None,
+    ):
+        self.cache_kvs_shape = cache_kvs_shape
+        BasePredictor.__init__(self, config, tokenizer)
+        InferencePredictorMixin.__init__(self, config, tokenizer)
+
+        self.predictor = self._create_predictor(config)
+
+    def _create_predictor(self, predictor_args: PredictorArgument):
+        if not is_paddlenlp_ops_available():
+            raise ValueError(
+                "you should install the paddlenlp ops to run inference predictor, "
+                "https://github.com/PaddlePaddle/PaddleNLP/blob/develop/csrc/README.md"
+            )
+
+        # register the custome ops
+        import_module("paddlenlp_ops.encode_rotary_qk")
+        import_module("paddlenlp_ops.get_padding_offset")
+        import_module("paddlenlp_ops.qkv_transpose_split")
+        import_module("paddlenlp_ops.rebuild_padding")
+        import_module("paddlenlp_ops.transpose_remove_padding")
+        import_module("paddlenlp_ops.write_cache_kv")
+
+        infer_model_path = get_infer_model_path(predictor_args.model_name_or_path, predictor_args.model_prefix)
+
+        config = paddle.inference.Config(infer_model_path + ".pdmodel", infer_model_path + ".pdiparams")
+
+        config.switch_ir_optim(True)
+        device_id = int(os.environ.get("FLAGS_selected_gpus", 0))
+        config.enable_use_gpu(100, device_id)
+        # config.disable_glog_info()
+        # config.enable_memory_optim()
+
+        if self.tensor_parallel_degree > 1:
+            trainer_endpoints = fleet.worker_endpoints()
+            current_endpoint = trainer_endpoints[self.tensor_parallel_rank]
+
+            dist_config = config.dist_config()
+            dist_config.set_ranks(self.tensor_parallel_degree, self.tensor_parallel_rank)
+            dist_config.set_endpoints(trainer_endpoints, current_endpoint)
+            dist_config.enable_dist_model(True)
+
+            dist_config.set_comm_init_config(os.path.join(predictor_args.model_name_or_path, "rank_mapping.csv"))
+            config.set_dist_config(dist_config)
+
+        predictor = paddle.inference.create_predictor(config)
+        return predictor
+
+    @paddle.no_grad()
+    def _infer(self, inputs):
+        for k, v in inputs.items():
+            input_tensor = self.predictor.get_input_handle(k)
+
+            if "mask" in k or "position" in k:
+                input_tensor.share_external_data(v)
+            else:
+                if paddle.is_tensor(v):
+                    v = v.numpy()
+                input_tensor.copy_from_cpu(v)
+
+        for i in range(len(self.cache_kvs_shape)):
+            input_tensor = self.predictor.get_input_handle("cache_kvs_" + str(i))
+            input_tensor.share_external_data(self.cache_kvs[i])
+
+        input_tensor = self.predictor.get_input_handle("pre_ids")
+        input_tensor.share_external_data(self.pre_ids)
+
+        self.predictor.run()
+
+
+class DygraphInferencePredictor(InferencePredictorMixin, BasePredictor):
+    def __init__(
+        self,
+        config: PredictorArgument,
+        model: PretrainedModel = None,
+        tokenizer: PretrainedTokenizer = None,
+    ):
+        self.cache_kvs_shape = model.get_cache_kvs_shape(model.config, config.batch_size)
+        BasePredictor.__init__(self, config, tokenizer)
+        InferencePredictorMixin.__init__(self, config, tokenizer)
+        self.model = model
+
+    @paddle.no_grad()
+    def _infer(self, inputs: dict[str, paddle.Tensor]):
+        for key in inputs.keys():
+            if paddle.is_tensor(inputs[key]):
+                continue
+            if isinstance(inputs[key], list):
+                if paddle.is_tensor(inputs[key]):
+                    continue
+                inputs[key] = [paddle.to_tensor(item) for item in inputs[key]]
+            else:
+                inputs[key] = paddle.to_tensor(inputs[key])
+
+        inputs["cache_kvs"] = self.cache_kvs
+        self.model.generate(
+            **inputs,
+        )
+        return None
+
+
+def create_predictor(
+    predictor_args: PredictorArgument,
+    model_args: ModelArgument,
+    tensor_parallel_degree: int = 1,
+    tensor_parallel_rank: int = 0,
+):
+    tokenizer = AutoTokenizer.from_pretrained(predictor_args.model_name_or_path)
+    # TODO(wj-Mcat): fix llama tokenzier pad_token bug
+    if isinstance(tokenizer, LlamaTokenizer) and not tokenizer.pad_token:
+        tokenizer.pad_token = tokenizer.unk_token
+
+    # update config parameter for inference predictor
+    if predictor_args.decode_strategy == "greedy_search" and predictor_args.inference_model:
+        predictor_args.top_p = 0.0
+        predictor_args.temperature = 1.0
+
+    tensor_parallel_rank, tensor_parallel_degree = init_dist_env()
+    if not predictor_args.inference_model:
+        if predictor_args.mode == "dynamic":
+            if model_args.model_type == "gpt-3":
+                sys.path.append("./gpt-3")
+                from modeling import GPTForCausalLM
+
+                model = GPTForCausalLM.from_pretrained(
+                    predictor_args.model_name_or_path,
+                    dtype=predictor_args.dtype,
+                    tensor_parallel_degree=tensor_parallel_degree,
+                    tensor_parallel_rank=tensor_parallel_rank,
+                )
+            elif model_args.model_type == "ernie-3.5-se":
+                sys.path.append("./ernie-3.5-se")
+                from modeling import Ernie35ForCausalLM
+
+                tensor_parallel_degree = paddle.distributed.get_world_size()
+                tensor_parallel_rank = paddle.distributed.get_rank()
+                model = Ernie35ForCausalLM.from_pretrained(
+                    predictor_args.model_name_or_path,
+                    dtype=predictor_args.dtype,
+                    tensor_parallel_degree=tensor_parallel_degree,
+                    tensor_parallel_rank=tensor_parallel_rank,
+                )
+            else:
+                model = AutoModelForCausalLM.from_pretrained(
+                    predictor_args.model_name_or_path,
+                    dtype=predictor_args.dtype,
+                    tensor_parallel_degree=tensor_parallel_degree,
+                    tensor_parallel_rank=tensor_parallel_rank,
+                )
+
+            predictor = DygraphPredictor(predictor_args, model=model, tokenizer=tokenizer)
+        elif predictor_args.mode == "static":
+            predictor = StaticGraphPredictor(predictor_args, tokenizer=tokenizer)
+        else:
+            raise ValueError("the `mode` should be one of [dynamic, static]")
+    else:
+        if predictor_args.mode == "dynamic":
+            # TODO(wj-Mcat): complete AutoInferenceModel & AutoPredictor
+            config = AutoConfig.from_pretrained(predictor_args.model_name_or_path)
+            config.tensor_parallel_degree = tensor_parallel_degree
+            config.tensor_parallel_rank = tensor_parallel_rank
+
+            if "llama" in config.architectures[0].lower():
+                if model_args.model_type == "llama-img2txt":
+                    # we use llama for img2txt.
+                    from paddlenlp.experimental.transformers import (
+                        LlamaForMiniGPT4InferenceModel as LlamaInferenceModel,
+                    )
+                else:
+                    from paddlenlp.experimental.transformers import (
+                        LlamaForCausalLMInferenceModel as LlamaInferenceModel,
+                    )
+
+                    config.tensor_parallel_degree = tensor_parallel_degree
+                    config.tensor_parallel_rank = tensor_parallel_rank
+                    config.quant_bits = -1
+
+                    if predictor_args.quant_type.startswith("weight_only_int"):
+                        quant_bits = int(predictor_args.quant_type[-1])
+                        config.quant_bits = quant_bits
+
+                model = LlamaInferenceModel.from_pretrained(
+                    predictor_args.model_name_or_path, config=config, dtype=predictor_args.dtype
+                )
+                model.eval()
+            elif "chatglm" in config.architectures[0].lower():
+                from paddlenlp.experimental.transformers import (
+                    ChatGLMForCausalLMInferenceModel,
+                )
+
+                model = ChatGLMForCausalLMInferenceModel.from_pretrained(
+                    predictor_args.model_name_or_path,
+                    config=config,
+                    dtype=predictor_args.dtype,
+                )
+                model.eval()
+            elif "bloom" in config.architectures[0].lower():
+                from paddlenlp.experimental.transformers import (
+                    BloomForCausalLMInferenceModel,
+                )
+
+                config.tensor_parallel_degree = tensor_parallel_degree
+                config.tensor_parallel_rank = tensor_parallel_rank
+                model = BloomForCausalLMInferenceModel.from_pretrained(
+                    predictor_args.model_name_or_path,
+                    config=config,
+                    dtype=predictor_args.dtype,
+                )
+                cache_kvs_shape = BloomForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size)
+                model.eval()
+            predictor = DygraphInferencePredictor(predictor_args, model=model, tokenizer=tokenizer)
+        elif predictor_args.mode == "static":
+            config = AutoConfig.from_pretrained(predictor_args.model_name_or_path)
+            if "llama" in config.architectures[0].lower():
+                from paddlenlp.experimental.transformers import (
+                    LlamaForCausalLMInferenceModel,
+                )
+
+                cache_kvs_shape = LlamaForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size)
+                predictor = StaticInferencePredictor(predictor_args, cache_kvs_shape, tokenizer=tokenizer)
+            elif "chatglm" in config.architectures[0].lower():
+                from paddlenlp.experimental.transformers import (
+                    ChatGLMForCausalLMInferenceModel,
+                )
+
+                cache_kvs_shape = ChatGLMForCausalLMInferenceModel.get_cache_kvs_shape(
+                    config, predictor_args.batch_size
+                )
+                predictor = StaticInferencePredictor(predictor_args, cache_kvs_shape, tokenizer=tokenizer)
+            elif "bloom" in config.architectures[0].lower():
+                from paddlenlp.experimental.transformers import (
+                    BloomForCausalLMInferenceModel,
+                )
+
+                cache_kvs_shape = BloomForCausalLMInferenceModel.get_cache_kvs_shape(config, predictor_args.batch_size)
+                predictor = StaticInferencePredictor(
+                    predictor_args, cache_kvs_shape=cache_kvs_shape, tokenizer=tokenizer
+                )
+        else:
+            raise ValueError("the `mode` should be one of [dynamic, static]")
+    return predictor
+
+
+def predict():
+    parser = PdArgumentParser((PredictorArgument, ModelArgument))
+    predictor_args, model_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(predictor_args.device)
+    paddle.set_default_dtype(predictor_args.dtype)
+
+    tensor_parallel_degree = paddle.distributed.get_world_size()
+    if tensor_parallel_degree > 1:
+        strategy = fleet.DistributedStrategy()
+        strategy.hybrid_configs = {
+            "dp_degree": 1,
+            "mp_degree": tensor_parallel_degree,
+            "pp_degree": 1,
+            "sharding_degree": 1,
+        }
+        fleet.init(is_collective=True, strategy=strategy)
+
+    predictor = create_predictor(predictor_args, model_args)
+    source_texts = []
+    target_texts = []
+    if model_args.data_file:
+        with open(model_args.data_file, "r", encoding="utf-8") as f:
+            for line in f:
+                example = json.loads(line)
+                source_texts.append(example["src"])
+                target_texts.append(example["tgt"])
+    else:
+        source_texts = ["hello world, how are you?", "你好，请问你是谁?"]
+        target_texts = ["", ""]
+
+    batch_source_texts = batchfy_text(source_texts, predictor_args.batch_size)
+    batch_target_texts = batchfy_text(target_texts, predictor_args.batch_size)
+
+    with open(model_args.output_file, "w", encoding="utf-8") as f:
+        for bs, batch_source_text in enumerate(batch_source_texts):
+            outputs = predictor.predict(batch_source_text)
+
+            if predictor.tensor_parallel_rank > 0:
+                continue
+            for output, source, target in zip(outputs, batch_source_texts[bs], batch_target_texts[bs]):
+                print("***********Source**********")
+                print(source)
+                print("***********Target**********")
+                print(target)
+                print("***********Output**********")
+                print(output)
+                out = {"src": source, "tgt": target, "output": output}
+                f.write(json.dumps(out, ensure_ascii=False) + "\n")
+
+    if predictor_args.benchmark:
+        benchmark(predictor, predictor_args, model_args)
+
+
+def benchmark(predictor, predictor_args, model_args):
+    # Just construct a simple benchmark input. We pad input to the src_length.
+    test_texts = "hello world, how are you?"
+    benchmark_texts = [test_texts + "<pad>" * predictor_args.src_length for _ in range(predictor_args.batch_size)]
+
+    batch_benchmark_texts = batchfy_text(benchmark_texts, predictor_args.batch_size)
+    print("***********Start Benchmark**********")
+
+    warmup_time = 10
+    test_time = 100
+
+    print("***********Start Warmup**********")
+    for _ in range(warmup_time):
+        for bs, batch_source_text in enumerate(batch_benchmark_texts):
+            outputs = predictor.predict(batch_source_text)
+
+    print("***********Start Speed Test**********")
+    start = time.perf_counter()
+    for _ in range(test_time):
+        for bs, batch_source_text in enumerate(batch_benchmark_texts):
+            outputs = predictor.predict(batch_source_text)
+    end = time.perf_counter()
+
+    output_tokens = sum([len(output) for output in outputs])
+    print(
+        "Input length is: {}, Output length is: {}, bs is: {}, Generate speed is: {:.3f} tokens/s(ips), QPS: {:.3f} requests/s. ".format(
+            predictor_args.src_length,
+            predictor_args.max_length,
+            predictor_args.batch_size,
+            (output_tokens / (end - start) / test_time),
+            (predictor_args.batch_size / (end - start) / test_time),
+        )
+    )
+
+
+if __name__ == "__main__":
+    predict()
diff --git a/llm/quant.py b/llm/quant.py
new file mode 100644
index 0000000000000000000000000000000000000000..56314d6e468fe898224e9009f652996a9ae123d1
--- /dev/null
+++ b/llm/quant.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+
+import paddle
+from paddle import nn
+from paddle.distributed.fleet.meta_parallel import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+)
+from paddle.quantization import PTQ, QAT, QuantConfig
+from paddle.quantization.quanters.abs_max import FakeQuanterWithAbsMaxObserverLayer
+from paddleslim.quant.advanced import (
+    GPTQ,
+    EMASampler,
+    MultiStepSampler,
+    PieceWiseSearch,
+    Shift,
+    Smooth,
+)
+from paddleslim.quant.advanced.utils import find_parent_layer_and_sub_name
+from paddleslim.quant.layers import (
+    QuantizedColumnParallelLinear,
+    QuantizedRowParallelLinear,
+)
+from paddleslim.quant.observers import AbsMaxChannelWiseWeightObserver, AVGObserver
+from paddleslim.quant.observers.abs_max_weight import (
+    AbsMaxChannelWiseWeightObserverLayer,
+)
+from paddleslim.quant.observers.avg import AVGObserverLayer
+from paddleslim.quant.quanters import PACTQuanter
+
+from paddlenlp.peft import PrefixModelForCausalLM
+from paddlenlp.peft.lora import LoRALinear
+from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear
+from paddlenlp.utils.log import logger
+
+
+def create_qat_model(quant_args, model, dtype):
+    # FakeQuanterChannelWiseAbsMaxObserver not yet merge in Paddle develop
+    from paddle.quantization.quanters import FakeQuanterChannelWiseAbsMaxObserver
+
+    q_config = QuantConfig(activation=None, weight=None)
+    q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+    if quant_args.quant_type == "A8W8":
+        activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype)
+        weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32")
+    elif quant_args.quant_type == "WINT4":
+        activation = None
+        weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32")
+    elif quant_args.quant_type == "WINT8":
+        activation = None
+        weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=8, dtype="float32")
+    elif quant_args.quant_type == "A8W4":
+        activation = PACTQuanter(quanter=FakeQuanterWithAbsMaxObserverLayer, init_value=20, dtype=dtype)
+        weight = FakeQuanterChannelWiseAbsMaxObserver(bit_length=4, dtype="float32")
+    else:
+        raise ValueError("quant_type should be one of ['A8W8', 'WINT4', 'WINT8']")
+    q_config.add_type_config(LoRALinear, weight=weight, activation=activation)
+    q_config.add_type_config(nn.Linear, weight=weight, activation=activation)
+
+    qat = QAT(q_config)
+    model = qat.quantize(model, inplace=True)
+    return model
+
+
+def apply_shift(quant_args, trainer, ptq_dataloader, ptq_model_config):
+    logger.info("***** Running Shift *****")
+    shift_sampler = EMASampler() if quant_args.shift_sampler == "ema" else None
+    shift = Shift(
+        model=trainer.model,
+        model_config=ptq_model_config,
+        sample_function=shift_sampler,
+        shift_all_linears=quant_args.shift_all_linears,
+    )
+
+    trainer.ptq_loop(
+        ptq_dataloader,
+        description="Shift",
+        max_eval_iters=quant_args.shift_step,
+    )
+    shift.update_weight()
+    del shift, shift_sampler
+    logger.info("***** Shift done *****")
+
+
+def apply_smooth(quant_args, trainer, ptq_dataloader, ptq_model_config):
+    logger.info("***** Running Smooth *****")
+    smooth_sampler = MultiStepSampler() if quant_args.smooth_sampler == "multi_step" else None
+    if quant_args.smooth_piecewise_search:
+        search_func = PieceWiseSearch(
+            k_piece=quant_args.smooth_k_piece,
+            bits_length=8,
+            search_piece=quant_args.smooth_search_piece,
+            search_alpha_min=0.2,
+            search_alpha_max=0.8,
+            search_scale_min=1.0,
+            search_scale_max=5.0,
+            weight_quant_method="abs_max_channel_wise",
+            act_quant_method="avg",
+        )
+    else:
+        search_func = None
+    smooth = Smooth(
+        trainer.model,
+        ptq_model_config,
+        alpha=0.5,
+        smooth_all_linears=quant_args.smooth_all_linears,
+        sample_function=smooth_sampler,
+        search_function=search_func,
+    )
+    trainer.ptq_loop(
+        ptq_dataloader,
+        description="Smooth",
+        max_eval_iters=quant_args.smooth_step,
+    )
+
+    smooth.update_weight()
+    del smooth, smooth_sampler, search_func
+    logger.info("***** Smooth done *****")
+
+
+def apply_ptq(quant_args, trainer, ptq_dataloader):
+    logger.info("***** Running PTQ *****")
+    q_config = QuantConfig(activation=None, weight=None)
+
+    if quant_args.quant_type == "A8W8":
+        activation = AVGObserver(quant_bits=8)
+        weight = AbsMaxChannelWiseWeightObserver(quant_bits=8)
+    elif quant_args.quant_type == "WINT4":
+        activation = None
+        weight = AbsMaxChannelWiseWeightObserver(quant_bits=4)
+    elif quant_args.quant_type == "WINT8":
+        activation = None
+        weight = AbsMaxChannelWiseWeightObserver(quant_bits=8)
+    else:
+        raise ValueError("quant_type should be one of ['A8W8', 'WINT4', 'WINT8']")
+
+    q_config.add_qat_layer_mapping(ColumnParallelLinear, QuantizedColumnParallelLinear)
+    q_config.add_qat_layer_mapping(RowParallelLinear, QuantizedRowParallelLinear)
+    q_config.add_type_config(
+        [paddle.nn.Linear, ColumnParallelLinear, RowParallelLinear, QuantedLoRALinear],
+        activation=activation,
+        weight=weight,
+    )
+
+    ptq = PTQ(q_config)
+    trainer.model = ptq.quantize(trainer.model, inplace=True)
+    trainer.ptq_loop(
+        ptq_dataloader,
+        description="PTQ",
+        max_eval_iters=quant_args.ptq_step,
+    )
+    weight_scales = {}
+    act_scales = {}
+    for cur_name, cur_layer in trainer.model.named_sublayers():
+        if isinstance(cur_layer, AbsMaxChannelWiseWeightObserverLayer):
+            if "_observer" not in cur_name:
+                weight_scales[cur_name] = cur_layer.scales().numpy().tolist()
+        if isinstance(cur_layer, AVGObserverLayer):
+            if "_observer" not in cur_name:
+                act_scales[cur_name] = cur_layer.scales().numpy().tolist()
+
+    weight_scales_path = os.path.join(trainer.args.output_dir, "weight_scales.json")
+    with open(weight_scales_path, "w") as f:
+        json.dump(weight_scales, f)
+    logger.info(f"Weight scales saved in {weight_scales_path}.")
+
+    act_scales_path = os.path.join(trainer.args.output_dir, "act_scales.json")
+    with open(act_scales_path, "w") as f:
+        json.dump(act_scales, f)
+    logger.info(f"Activation scales saved in {act_scales_path}.")
+
+    trainer.model = ptq.convert(trainer.model, inplace=True)
+    logger.info("***** PTQ done *****")
+
+
+def apply_gptq(quant_args, trainer, ptq_dataloader):
+    logger.info("***** Running GPTQ *****")
+    num_layer = 0
+    model = trainer.model
+    for cur_name, cur_layer in model.named_sublayers():
+        if type(cur_layer) in [paddle.nn.Linear, ColumnParallelLinear, RowParallelLinear]:
+            num_layer += 1
+            logger.info(f"GPTQ layer: {num_layer}, {cur_name}")
+            parent_layer, sub_name = find_parent_layer_and_sub_name(model, cur_name)
+            cur_quant_layer = GPTQ(cur_layer)
+            setattr(parent_layer, sub_name, cur_quant_layer)
+            trainer.ptq_loop(
+                ptq_dataloader,
+                description="GPTQ",
+                max_eval_iters=quant_args.gptq_step,
+            )
+            cur_quant_layer.fasterquant(percdamp=0.1, groupsize=-1, actorder=True)
+            del cur_quant_layer
+            setattr(parent_layer, sub_name, cur_layer)
+    logger.info("***** GPTQ done *****")
+
+
+def get_ptq_model_config(model):
+    if isinstance(model, PrefixModelForCausalLM):
+        base_model_prefix = model.model.base_model_prefix
+    else:
+        base_model_prefix = model.base_model_prefix
+
+    if base_model_prefix in ["chatglm"]:
+        raise NotImplementedError(f"{model} does not support Shift or Smooth.")
+    elif base_model_prefix == "chatglm_v2":
+        model_config = {"fused_qkv": False, "parallel_ffn": False, "skip_norm_list": ["rms_norm_56"]}
+    elif base_model_prefix == "bloom":
+        model_config = {"fused_qkv": True, "parallel_ffn": False}
+    elif base_model_prefix == "llama":
+        model_config = {"fused_qkv": False, "parallel_ffn": True}
+    else:
+        raise ValueError(
+            f"Unknown base_model_prefix: {model.base_model_prefix}. Supported base_model_prefix list: chatglm_V2, bloom, llama."
+        )
+    return model_config
diff --git a/llm/qwen/README.md b/llm/qwen/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..08d29823f9650f7b1e787144d4beb945e06eb310
--- /dev/null
+++ b/llm/qwen/README.md
@@ -0,0 +1,14 @@
+# Qwen
+
+## 1.模型介绍
+
+[通义千问-7B（Qwen-7B）](https://arxiv.org/abs/2205.01068) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。
+
+**支持模型权重:**
+| Model             |
+|-------------------|
+| qwen/qwen-7b      |
+| qwen/qwen-7b-chat |
+
+## 2. 模型精调
+请参考[LLM全流程工具介绍](../README.md)
diff --git a/llm/qwen/lora_argument.json b/llm/qwen/lora_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..56d5b66c75755848345d0803333f80b86fc92f4c
--- /dev/null
+++ b/llm/qwen/lora_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "qwen/qwen-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/qwen_lora_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-04,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "bf16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "lora": true
+  }
diff --git a/llm/qwen/pt_argument.json b/llm/qwen/pt_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..763c951176692fc8c9fd15feb833ccfb68caa8e3
--- /dev/null
+++ b/llm/qwen/pt_argument.json
@@ -0,0 +1,30 @@
+{
+    "model_name_or_path": "qwen/qwen-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/qwen_pt_ckpts",
+    "per_device_train_batch_size": 4,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-02,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "bf16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 1,
+    "pipeline_parallel_degree": 1,
+    "prefix_tuning": true
+  }
diff --git a/llm/qwen/sft_argument.json b/llm/qwen/sft_argument.json
new file mode 100644
index 0000000000000000000000000000000000000000..d06801aef2be3b0d42c44332f70cfe960d950776
--- /dev/null
+++ b/llm/qwen/sft_argument.json
@@ -0,0 +1,29 @@
+{
+    "model_name_or_path": "qwen/qwen-7b",
+    "dataset_name_or_path": "./data",
+    "output_dir": "./checkpoints/qwen_sft_ckpts",
+    "per_device_train_batch_size": 1,
+    "gradient_accumulation_steps": 4,
+    "per_device_eval_batch_size": 8,
+    "eval_accumulation_steps":16,
+    "num_train_epochs": 3,
+    "learning_rate": 3e-05,
+    "warmup_steps": 30,
+    "logging_steps": 1,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "src_length": 1024,
+    "max_length": 2048,
+    "bf16": true,
+    "fp16_opt_level": "O2",
+    "do_train": true,
+    "do_eval": true,
+    "disable_tqdm": true,
+    "load_best_model_at_end": true,
+    "eval_with_do_generation": false,
+    "metric_for_best_model": "accuracy",
+    "recompute": true,
+    "save_total_limit": 1,
+    "tensor_parallel_degree": 4,
+    "pipeline_parallel_degree": 1
+  }
diff --git a/llm/utils.py b/llm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d439cec114512d40dfd28cd49e66719b10d5e3b
--- /dev/null
+++ b/llm/utils.py
@@ -0,0 +1,561 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import glob
+import math
+import os
+import struct
+from typing import Dict, Optional
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.distributed import fleet
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from sklearn.metrics import accuracy_score
+
+from paddlenlp.datasets import InTokensIterableDataset
+from paddlenlp.trainer import Trainer, TrainerCallback
+from paddlenlp.trainer.trainer_utils import IterableDatasetShard, has_length
+from paddlenlp.utils.log import logger
+
+
+def compute_metrics(eval_preds):
+
+    flattened_preds = np.array(eval_preds.predictions).flatten()
+    flattened_labels = np.array(eval_preds.label_ids).flatten()
+    filtered_preds = flattened_preds[flattened_labels != -100]
+    filtered_labels = flattened_labels[flattened_labels != -100]
+    accuracy = accuracy_score(y_true=filtered_labels, y_pred=filtered_preds)
+    return {
+        "accuracy": accuracy,
+    }
+
+
+def get_prefix_tuning_params(model):
+    if model.base_model_prefix == "chatglm":
+        from paddlenlp.peft.prefix import chatglm_postprocess_past_key_value
+
+        num_attention_heads = model.config.num_attention_heads
+        num_hidden_layers = model.config.num_hidden_layers
+        hidden_size = model.config.hidden_size
+        postprocess_past_key_value = chatglm_postprocess_past_key_value
+        multi_query_group_num = None
+    elif model.base_model_prefix == "chatglm_v2":
+        from paddlenlp.peft.prefix import chatglm_postprocess_past_key_value
+
+        num_attention_heads = model.config.num_attention_heads
+        num_hidden_layers = model.config.num_layers
+        hidden_size = model.config.hidden_size
+        postprocess_past_key_value = chatglm_postprocess_past_key_value
+        multi_query_group_num = model.config.multi_query_group_num
+    elif model.base_model_prefix == "bloom":
+        from paddlenlp.peft.prefix import bloom_postprocess_past_key_value
+
+        num_attention_heads = model.config.num_attention_heads
+        num_hidden_layers = model.config.n_layer
+        hidden_size = model.config.n_embed
+        postprocess_past_key_value = bloom_postprocess_past_key_value
+        multi_query_group_num = None
+    elif model.base_model_prefix == "llama":
+        from paddlenlp.peft.prefix import llama_postprocess_past_key_value
+
+        num_attention_heads = model.config.n_head
+        num_hidden_layers = model.config.n_layer
+        hidden_size = model.config.hidden_size
+        postprocess_past_key_value = llama_postprocess_past_key_value
+        multi_query_group_num = None
+    elif model.base_model_prefix == "qwen":
+        from paddlenlp.peft.prefix import qwen_postprocess_past_key_value
+
+        num_attention_heads = model.config.num_attention_heads
+        num_hidden_layers = model.config.num_hidden_layers
+        hidden_size = model.config.hidden_size
+        postprocess_past_key_value = qwen_postprocess_past_key_value
+        multi_query_group_num = None
+    else:
+        raise ValueError(f"Unknown base_model_prefix: {model.base_model_prefix}. ")
+    return dict(
+        num_attention_heads=num_attention_heads,
+        num_hidden_layers=num_hidden_layers,
+        hidden_size=hidden_size,
+        postprocess_past_key_value=postprocess_past_key_value,
+        multi_query_group_num=multi_query_group_num,
+    )
+
+
+def get_lora_target_modules(model):
+    # Not yet support RowParallelLinear
+    if model.base_model_prefix == "chatglm":
+        target_modules = [".*query_key_value.*", ".*dense.*", ".*dense_h_to_4h.*", ".*dense_4h_to_h.*"]
+    elif model.base_model_prefix == "chatglm_v2":
+        target_modules = [
+            ".*query.*",
+            ".*key.*",
+            ".*value.*",
+            ".*dense.*",
+            ".*dense_h_to_4h.*",
+            ".*dense_4h_to_h.*",
+        ]
+    elif model.base_model_prefix == "bloom":
+        target_modules = [".*query_key_value.*", ".*dense.*", ".*dense_h_to_4h.*", ".*dense_4h_to_h.*"]
+    elif model.base_model_prefix == "llama":
+        target_modules = [
+            ".*q_proj.*",
+            ".*v_proj.*",
+            ".*k_proj.*",
+            ".*o_proj.*",
+            ".*gate_proj.*",
+            ".*down_proj.*",
+            ".*up_proj.*",
+        ]
+    elif model.base_model_prefix == "opt":
+        target_modules = [
+            ".*project_in.*",
+            ".*project_out.*",
+            ".*q_proj.*",
+            ".*k_proj.*",
+            ".*v_proj.*",
+            ".*qkv_proj.*",
+            ".*out_proj.*",
+            ".*linear1.*",
+            ".*linear2.*",
+        ]
+    elif model.base_model_prefix == "qwen":
+        target_modules = [
+            ".*attn.c_attn.*",
+            ".*attn.c_proj.*",
+            ".*mlp.w1.*",
+            ".*mlp.w2.*",
+            ".*mlp.c_proj.*",
+        ]
+    else:
+        raise ValueError(f"Unknown base_model_prefix: {model.base_model_prefix}.")
+    return target_modules
+
+
+class InTokensIterDatasetCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that handles early stopping.
+
+    """
+
+    def on_step_end(self, args, state, control, **kwargs):
+        train_dataloader = kwargs["train_dataloader"]
+        if isinstance(train_dataloader.dataset, InTokensIterableDataset):
+            dataset = train_dataloader.dataset
+        elif isinstance(train_dataloader.dataset, IterableDatasetShard) and isinstance(
+            train_dataloader.dataset.dataset, InTokensIterableDataset
+        ):
+            dataset = train_dataloader.dataset.dataset
+        else:
+            raise ValueError(
+                "Unexpected dataset format: InTokensIterDatasetCallback expectes `paddlenlp.datasets.InTokensIterableDataset`"
+            )
+        if state.trial_params is None:
+            state.trial_params = {}
+        state.trial_params["intokens_global_step"] = dataset.intokens_global_step
+
+
+class CausalLMTrainer(Trainer):
+    def __init__(self, do_generation: bool, gen_args, data_args, **kwargs):
+        super().__init__(**kwargs)
+        self.do_generation = do_generation
+        self.gen_args = gen_args
+        self.data_args = data_args
+
+    def prediction_step(
+        self,
+        model,
+        inputs,
+        prediction_loss_only: bool,
+        ignore_keys=None,
+    ):
+        if prediction_loss_only:
+            return super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+        elif not self.do_generation:
+            loss, logits, labels = super().prediction_step(model, inputs, prediction_loss_only, ignore_keys)
+            # argmax here to avoid gather all logits, which is too memory-consuming.
+            # keepdim in order to maintain the same shape as logits
+            if isinstance(logits, (list, tuple)):
+                logits = logits[0]
+            return (loss, logits.argmax(axis=-1, keepdim=True), labels)
+
+        loss = None
+
+        model.eval()
+        with paddle.no_grad():
+            generated_tokens = model.generate(
+                input_ids=inputs["input_ids"],
+                attention_mask=inputs["attention_mask"] if "attention_mask" in inputs else None,
+                position_ids=inputs["position_ids"] if "position_ids" in inputs else None,
+                max_length=max(self.data_args.max_length - inputs["input_ids"].shape[-1], 1),
+                decode_strategy="sampling",
+                top_k=self.gen_args.top_k,
+                top_p=self.gen_args.top_p,
+                bos_token_id=self.tokenizer.bos_token_id,
+                eos_token_id=self.tokenizer.eos_token_id,
+                pad_token_id=self.tokenizer.pad_token_id,
+                use_cache=True,
+            )[0]
+            all_preds = []
+            for pred_tokens in generated_tokens:
+                pred_tokens = pred_tokens[pred_tokens != self.tokenizer.pad_token_id].tolist()
+                all_preds.append(pred_tokens)
+            max_pred_length = max([len(x) for x in all_preds])
+            for index, preds in enumerate(all_preds):
+                all_preds[index] = preds + [-100] * (max_pred_length - len(preds))
+            all_preds = paddle.to_tensor(all_preds)
+
+            if "labels" in inputs:
+                all_labels = paddle.to_tensor(inputs["labels"])
+            else:
+                all_labels = None
+
+        return (loss, all_preds, all_labels)
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+        if "loss" in logs:
+            logs["ppl"] = np.exp(logs["loss"])
+        if "eval_loss" in logs:
+            logs["eval_ppl"] = np.exp(logs["eval_loss"])
+
+        super(CausalLMTrainer, self).log(logs, **kwargs)
+
+    def get_ptq_dataloader(self, ptq_ds):
+        if self.args.world_size <= 1:
+            ptq_sampler = BatchSampler(
+                dataset=ptq_ds,
+                shuffle=True,
+                batch_size=self.args.per_device_train_batch_size,
+                drop_last=self.args.dataloader_drop_last,
+            )
+        else:
+            ptq_sampler = DistributedBatchSampler(
+                self.train_dataset,
+                batch_size=self.args.per_device_train_batch_size,
+                shuffle=True,
+                num_replicas=self.args.dataset_world_size,
+                rank=self.args.dataset_rank,
+                drop_last=self.args.dataloader_drop_last,
+            )
+        ptq_dataloader = DataLoader(
+            ptq_ds,
+            batch_sampler=ptq_sampler,
+            collate_fn=self.data_collator,
+            num_workers=self.args.dataloader_num_workers,
+        )
+        return ptq_dataloader
+
+    def ptq_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        max_eval_iters: Optional[int] = -1,
+    ):
+        if isinstance(dataloader, paddle.io.DataLoader):
+            batch_size = dataloader.batch_sampler.batch_size
+        else:
+            raise ValueError("Only support for paddle.io.DataLoader")
+
+        if has_length(dataloader):
+            logger.info(f"  Num examples = {self.num_examples(dataloader)}")
+            if max_eval_iters > 0:
+                logger.info(f"  Total {description} steps = {max_eval_iters}")
+            else:
+                logger.info(f"  Total {description} steps = {len(dataloader)}")
+        else:
+            logger.info("  Num examples: Unknown")
+            if max_eval_iters > 0:
+                logger.info(f"  Total {description} steps = {max_eval_iters}")
+
+        logger.info(f"  Pre device batch size = {batch_size}")
+        logger.info(f"  Total Batch size = {batch_size * self.args.dataset_world_size}")
+        self.model.eval()
+        with paddle.no_grad():
+            for step, inputs in enumerate(dataloader):
+                self.prediction_step(model=self.model, inputs=inputs, prediction_loss_only=True, ignore_keys=None)
+                if max_eval_iters > 0 and step >= max_eval_iters - 1:
+                    break
+
+
+def get_infer_model_path(input_dir, model_prefix):
+    if dist.get_world_size() > 1:
+        local_rank = dist.ParallelEnv().dev_id
+        return os.path.join(input_dir, "rank_{}".format(local_rank), model_prefix)
+    else:
+        return os.path.join(input_dir, model_prefix)
+
+
+def generate_rank_mapping(output_filename):
+    ring_id = -1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        ring_id = model_parallel_group.id
+    except Exception:
+        pass
+
+    if ring_id == -1:
+        return
+
+    world_size = dist.get_world_size()
+    with open(output_filename, "w") as f:
+        f.write("[ring_id -> ranks]\n")
+        f.write(",".join(map(str, [0] + list(range(world_size)))) + "\n")
+        f.write(",".join(map(str, [ring_id] + list(range(world_size)))) + "\n")
+
+        f.write("[rank -> ring_ids]\n")
+        for i in range(world_size):
+            f.write("{},0,{}\n".format(i, ring_id))
+
+
+def deserialize_from_file(fp):
+    x_type = fp.read(1)
+    x_type_out = struct.unpack("c", x_type)[0]
+    # data
+    data_list = []
+    if x_type_out == b"0":
+        data = fp.read(4)
+        data_out = struct.unpack("f", data)[0]
+        while data:
+            data_out = struct.unpack("f", data)[0]
+            data_list.append(data_out)
+            data = fp.read(4)
+    elif x_type_out == b"1":
+        data = fp.read(8)
+        while data:
+            data_out = struct.unpack("l", data)[0]
+            data_list.append(data_out)
+            data = fp.read(8)
+    elif x_type_out == b"2":
+        data = fp.read(4)
+        while data:
+            data_out = struct.unpack("i", data)[0]
+            data_list.append(data_out)
+            data = fp.read(4)
+    else:
+        print("type error")
+    data_arr = np.array(data_list)
+    return data_arr
+
+
+def get_alibi_slopes(num_heads):
+    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+    base = 2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3)))
+    powers = np.arange(1, 1 + closest_power_of_2)
+    slopes = np.power(base, powers)
+
+    if closest_power_of_2 != num_heads:
+        extra_base = 2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3)))
+        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
+        extra_powers = np.arange(1, 1 + 2 * num_remaining_heads, 2)
+        slopes = np.concatante([slopes, np.power(extra_base, extra_powers)], axis=0)
+
+    return slopes.astype("float32")
+
+
+def pad_batch_data(insts, pad_id=0, return_seq_len=False, pad_style="right"):
+    """Pad sequences to the max sequence length in batch."""
+    max_len = max(map(len, insts))
+    if pad_style == "left":
+        inst_data = np.array([[pad_id] * (max_len - len(inst)) + list(inst) for inst in insts])
+    else:
+        inst_data = np.array([list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts])
+
+    if return_seq_len:
+        seq_len = np.array([len(inst) for inst in insts])
+        return inst_data.astype("int64").reshape([-1, max_len]), seq_len
+    else:
+        return inst_data.astype("int64").reshape([-1, max_len])
+
+
+def dybatch_preprocess(
+    tokenizer,
+    texts: list[str],
+    src_length: int,
+    max_length: int,
+    architectures: str,
+    top_p: float,
+    temperature: float,
+    pre_caches_length: int = 0,
+    benchmark: bool = False,
+):
+    """Pre-process generation inputs."""
+    if "chatglm" in architectures:
+        input_ids = []
+        position_ids = []
+
+        for text in texts:
+            tokens = tokenizer(text, return_tensors="np", padding=True, max_length=src_length)
+            input_ids.append(tokens["input_ids"][0])
+            position_ids.append(tokens["position_ids"][0])
+
+        inputs = {}
+        pad_token_id = tokenizer([tokenizer.pad_token], return_tensors="np")["input_ids"][0][0]
+        inputs["input_ids"], seq_len = pad_batch_data(input_ids, pad_id=pad_token_id, return_seq_len=True)
+        bs = inputs["input_ids"].shape[0]
+        max_len = max(map(len, input_ids))
+
+        inst_data_pos = []
+        for i in range(len(position_ids)):
+            inst_data_pos.append(np.array([list(inst) + [0] * (max_len - len(inst)) for inst in position_ids[i]]))
+        inputs["position_ids"] = paddle.to_tensor(np.array(inst_data_pos))
+    else:
+        input_ids = []
+        position_ids = []
+        if isinstance(texts, str):
+            texts = [texts]
+
+        for text in texts:
+            tokens = tokenizer(
+                text,
+                return_tensors="np",
+                padding=False,
+                max_length=src_length,
+                return_attention_mask=False,
+                return_token_type_ids=False,
+            )
+            input_ids.append(tokens["input_ids"][0])
+
+        inputs = {}
+        pad_token_id = tokenizer([tokenizer.pad_token], return_tensors="np")["input_ids"][0][-1]
+        inputs["input_ids"], seq_len = pad_batch_data(input_ids, pad_id=pad_token_id, return_seq_len=True)
+        bs = inputs["input_ids"].shape[0]
+        max_len = max(map(len, input_ids))
+
+        position_ids = paddle.zeros(shape=[bs, max_length + src_length], dtype="int64")
+
+        for i in range(bs):
+            position_ids[i, pre_caches_length : pre_caches_length + seq_len[i]] = paddle.arange(seq_len[i])
+        inputs["position_ids"] = position_ids
+
+    tgt_ids = [input[-1:] for input in input_ids]
+    tgt_pos = []
+    for i, valid_len in enumerate(map(len, input_ids)):
+        tgt_pos.append(valid_len - 1)
+
+    step_idx = [
+        0,
+    ] * bs
+    tgt_pos = np.array(tgt_pos).astype("int64")
+    inputs["eos_token_id"] = (
+        np.array(
+            [
+                tokenizer.eos_token_id,
+            ]
+            * bs
+        )
+        .reshape(-1, 1)
+        .astype("int64")
+    )
+    inputs["top_p"] = (
+        np.array(
+            [
+                top_p,
+            ]
+            * bs
+        )
+        .reshape(-1, 1)
+        .astype("float32")
+    )
+    inputs["temperature"] = (
+        np.array(
+            [
+                temperature,
+            ]
+            * bs
+        )
+        .reshape(-1, 1)
+        .astype("float32")
+    )
+    inputs["seq_len_encoder"] = seq_len.astype("int32").reshape(-1, 1)
+    inputs["seq_len_decoder"] = (seq_len + pre_caches_length).astype("int32").reshape(-1, 1)
+    inputs["step_idx"] = np.array(step_idx).astype("int64").reshape(-1, 1)
+    inputs["tgt_ids"] = np.array(tgt_ids).astype("int64").reshape(-1, 1)
+    inputs["tgt_pos"] = tgt_pos.reshape(-1, 1)
+    inputs["max_length"] = np.array(max_length - pre_caches_length).astype("int64").reshape((-1, 1))
+    inputs["min_length"] = (
+        np.array(
+            [
+                1
+                if not benchmark
+                else max_length
+                - pre_caches_length,  # Note(Zhengzekang): When in benchmark mode, we need to set a fixed decode length.
+            ]
+            * bs
+        )
+        .astype("int64")
+        .reshape((-1, 1))
+    )
+    inputs["penalty_score"] = (
+        np.array(
+            [
+                1.0,
+            ]
+            * bs
+        )
+        .astype("float32")
+        .reshape((-1, 1))
+    )
+    inputs["frequency_score"] = (
+        np.array(
+            [
+                0.0,
+            ]
+            * bs
+        )
+        .astype("float32")
+        .reshape((-1, 1))
+    )
+    inputs["presence_score"] = (
+        np.array(
+            [
+                0.0,
+            ]
+            * bs
+        )
+        .astype("float32")
+        .reshape((-1, 1))
+    )
+    inputs["stop_flags"] = (
+        np.array(
+            [
+                0,
+            ]
+            * bs
+        )
+        .astype("bool")
+        .reshape((-1, 1))
+    )
+    inputs["stop_nums"] = np.array([bs]).astype("int64")
+    return inputs
+
+
+def load_real_time_tokens():
+    tokens = []
+    files = glob.glob(os.path.join("./real_time_save.*"))
+    for j in range(1, len(files) + 1):
+        filename = "./real_time_save.temp_ids_rank_0_step_{}".format(j)
+        if not os.path.exists(filename):
+            break
+        fp = open(filename, "rb+")
+        fp.read(1)
+        data_list = deserialize_from_file(fp)
+        fp.close()
+        tokens.append(np.array(data_list).reshape(-1, 1))
+    os.system("rm -f ./real_time_save.temp_ids_rank_*")
+    tokens = np.concatenate(tokens, axis=1)
+    return tokens
diff --git a/model.properties b/model.properties
new file mode 100644
index 0000000000000000000000000000000000000000..642940d8a4661878287656530612d098b34a541f
--- /dev/null
+++ b/model.properties
@@ -0,0 +1,10 @@
+# 模型唯一标识
+modelCode=488
+# 模型名称
+modelName=llama_paddle
+# 模型描述
+modelDescription=基于Paddle框架的llama-13b
+# 应用场景(多个标签以英文逗号分割)
+appScenario=训练,推理,医疗,教育,科研,金融
+# 框架类型(多个标签以英文逗号分割)
+frameType=Paddle,PaddleNLP
diff --git a/model_zoo/README.md b/model_zoo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31bc46d811cad13bc17e7b73e19734849257d5fb
--- /dev/null
+++ b/model_zoo/README.md
@@ -0,0 +1,3 @@
+# PaddleNLP Selected Model Zoo
+
+本目录是飞桨PaddleNLP精选模型库，提供了高质量的预训练模型和端到端的全流程部署工具链。
diff --git a/model_zoo/bert/README.md b/model_zoo/bert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6c172be0ee0ffda8ba225f99e4511e597232865d
--- /dev/null
+++ b/model_zoo/bert/README.md
@@ -0,0 +1,293 @@
+# BERT
+
+## 模型简介
+
+[BERT](https://arxiv.org/abs/1810.04805) （Bidirectional Encoder Representations from Transformers）以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件，使用掩码语言模型（Masked Language Model）和邻接句子预测（Next Sentence Prediction）两个任务在大规模无标注文本语料上进行预训练（pre-train），得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础，结合任务适配的简单输出层，微调（fine-tune）后即可应用到下游的NLP任务，效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。
+
+本项目是BERT在 Paddle 2.0上的开源实现，包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。
+
+## 快速开始
+
+### 环境依赖
+
+本教程除了需要安装PaddleNLP库，还需以下依赖
+
+```text
+h5py
+```
+
+### 数据准备
+
+#### Pre-training数据准备
+
+`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件（使用换行符换行和空白符分隔，data目录下提供了部分示例数据）为输入，经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理，最后输出hdf5格式的数据文件。使用方式如下：
+
+```shell
+python create_pretraining_data.py \
+  --input_file=data/sample_text.txt \
+  --output_file=data/training_data.hdf5 \
+  --bert_model=bert-base-uncased \
+  --max_seq_length=128 \
+  --max_predictions_per_seq=20 \
+  --masked_lm_prob=0.15 \
+  --random_seed=12345 \
+  --dupe_factor=5
+```
+
+其中参数释义如下：
+- `input_file` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有`.txt`文件。
+- `output_file` 指定输出文件。
+- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。
+- `max_seq_length` 指定最大句子长度，超过该长度将被截断，不足该长度的将会进行padding。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。
+- `masked_lm_prob` 表示每个token被mask的概率。
+- `random_seed` 指定随机种子。
+- `dupe_factor` 指定输入数据被重复处理的次数，每次处理将重新产生随机mask。
+
+使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据，可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理，得到的数据可以直接接入本项目中的预训练程序使用。
+
+#### Fine-tunning数据准备
+
+##### GLUE评测任务数据
+
+GLUE评测任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_glue.py`执行微调时将会自动下载。
+
+### 执行Pre-training
+<details>
+<summary>GPU训练</summary>
+<pre><code>unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device gpu \
+    --use_amp False</code></pre>
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目，与创建预训练数据时的设置一致。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
+- `warmup_steps` 表示动态学习率热启的step数。
+- `num_train_epochs` 表示训练轮数。
+- `input_dir` 表示输入数据的目录，该目录下所有文件名中包含training的文件将被作为训练数据。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数，达到`max_steps`后就提前结束。注意，我们必须设置 `max_steps`。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `use_amp` 指示是否启用自动混合精度训练。
+</details>
+
+#### GPU训练（Trainer版本）
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_pretrain_trainer.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --per_device_train_batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device gpu \
+    --fp16 False \
+    --do_train
+```
+
+<details>
+<summary>XPU训练</summary>
+<pre><code>unset FLAGS_selected_xpus
+python -m paddle.distributed.launch --xpus "0" run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device xpu \
+    --use_amp False</code></pre>
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目，与创建预训练数据时的设置一致。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
+- `warmup_steps` 表示动态学习率热启的step数。
+- `num_train_epochs` 表示训练轮数。
+- `input_dir` 表示输入数据的目录，该目录下所有文件名中包含training的文件将被作为训练数据。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数，达到`max_steps`后就提前结束。注意，我们必须设置 `max_steps`。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `use_amp` 指示是否启用自动混合精度训练。
+</details>
+
+#### XPU训练（Trainer版本）
+```shell
+unset FLAGS_selected_xpus
+python -m paddle.distributed.launch --xpus "0" run_pretrain_trainer.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --per_device_train_batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 3 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device xpu \
+    --fp16 False \
+    --do_train
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目，与创建预训练数据时的设置一致。
+- `per_device_train_batch_size` 表示用于训练的每个 GPU 核心/CPU 的batch大小.（`int`，可选，默认为 8）
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
+- `warmup_steps` 表示动态学习率热启的step数。
+- `num_train_epochs` 表示训练轮数。
+- `input_dir` 表示输入数据的目录，该目录下所有文件名中包含training的文件将被作为训练数据。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数，达到`max_steps`后就提前结束。注意，我们必须设置 `max_steps`。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `fp16` 是否使用 fp16 混合精度训练而不是 fp32 训练。(`bool`, 可选, 默认为 `False`)
+- `do_train` 是否进行训练任务。(`bool`, 可选, 默认为 `False`)
+
+**NOTICE**: 预训练时data目录存放的是经过 `create_pretraining_data.py` 处理后的数据，因此需要通过该数据处理脚本预先处理，否则预训练将会出现报错。
+
+### 执行Fine-tunning
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue_trainer.py \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST2 \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32   \
+    --per_device_eval_batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    --fp16 False\
+    --do_train \
+    --do_eval
+```
+
+其中参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。注：`bert-base-uncased`等对应使用的预训练模型转自[huggingface/transformers](https://github.com/huggingface/transformers)，具体可参考当前目录下converter中的内容。
+- `task_name` 表示Fine-tuning的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `per_device_train_batch_size` 表示用于训练的每个 GPU 核心/CPU 的batch大小.（`int`，可选，默认为 8）
+- `per_device_eval_batch_size` 表示用于评估的每个 GPU 核心/CPU 的batch大小.（`int`，可选，默认为 8）
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+- `fp16` 是否使用 fp16 混合精度训练而不是 fp32 训练。(`bool`, 可选, 默认为 `False`)
+- `do_train` 是否进行训练任务。(`bool`, 可选, 默认为 `False`)
+- `do_eval` 是否进行评估任务。同上。(`bool`, 可选, 默认为 `False`)
+
+基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| SST2 | Accuracy                     |      0.92660      |
+| QNLI  | Accuracy                     |      0.91707      |
+| CoLA  | Mattehew's corr              |      0.59557      |
+| MRPC  | F1/Accuracy                  |  0.91667/0.88235  |
+| STSB | Person/Spearman corr         |  0.88847/0.88350  |
+| QQP   | Accuracy/F1                  |  0.90581/0.87347  |
+| MNLI  | Matched acc/MisMatched acc   |  0.84422/0.84825  |
+| RTE   | Accuracy                     |      0.711191     |
+
+
+### 预测
+
+在Fine-tuning完成后，我们可以使用如下方式导出希望用来预测的模型：
+
+```shell
+python -u ./export_model.py \
+    --model_type bert \
+    --model_path bert-base-uncased \
+    --output_path ./infer_model/model
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_path` 表示训练模型的保存路径，与训练时的`output_dir`一致。
+- `output_path` 表示导出预测模型文件的前缀。保存时会添加后缀（`pdiparams`，`pdiparams.info`，`pdmodel`）；除此之外，还会在`output_path`包含的目录下保存tokenizer相关内容。
+
+完成模型导出后，可以开始部署。`deploy/python/seq_cls_infer.py` 文件提供了python部署预测示例。可执行以下命令运行部署示例：
+
+```shell
+python deploy/python/seq_cls_infer.py --model_dir infer_model/ --device gpu --backend paddle
+```
+
+运行后预测结果打印如下：
+
+```bash
+[2023-03-02 08:30:03,877] [    INFO] - We are using <class 'paddlenlp.transformers.bert.fast_tokenizer.BertFastTokenizer'> to load '../../infer_model/'.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend	Runtime initialized with Backend::PDINFER in Device::GPU.
+Batch id: 0, example id: 0, sentence1: against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting, label: positive, negative prob: 0.0003, positive prob: 0.9997.
+Batch id: 1, example id: 0, sentence1: the situation in a well-balanced fashion, label: positive, negative prob: 0.0002, positive prob: 0.9998.
+Batch id: 2, example id: 0, sentence1: at achieving the modest , crowd-pleasing goals it sets for itself, label: positive, negative prob: 0.0017, positive prob: 0.9983.
+Batch id: 3, example id: 0, sentence1: so pat it makes your teeth hurt, label: negative, negative prob: 0.9986, positive prob: 0.0014.
+Batch id: 4, example id: 0, sentence1: this new jangle of noise , mayhem and stupidity must be a serious contender for the title ., label: negative, negative prob: 0.9806, positive prob: 0.0194.
+```
+
+更多详细用法可参考 [Python 部署](deploy/python/README.md)。
+
+## 扩展
+
+上述的介绍是基于动态图的BERT的预训练任务和微调任务以及预测任务的实践过程，同时在我们也提供了基于PaddlePaddle Fleet API的静态图的BERT相关实践，在组网代码层面保持动静统一，在计算速度以及多机联合训练方面有着更优的性能，具体的细节可以参考 [BERT静态图](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/bert/static)  。
diff --git a/model_zoo/bert/create_pretraining_data.py b/model_zoo/bert/create_pretraining_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d88859585adba2a24223bc20d8d73b1298381a9b
--- /dev/null
+++ b/model_zoo/bert/create_pretraining_data.py
@@ -0,0 +1,473 @@
+# coding=utf-8
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Create masked LM/next sentence masked_lm examples for BERT."""
+import argparse
+import collections
+import os
+import random
+from io import open
+
+import h5py
+import numpy as np
+from tqdm import tqdm
+
+from paddlenlp.transformers import BertTokenizer
+from paddlenlp.transformers.tokenizer_utils import convert_to_unicode
+
+
+class TrainingInstance(object):
+    """A single training instance (sentence pair)."""
+
+    def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next):
+        self.tokens = tokens
+        self.segment_ids = segment_ids
+        self.is_random_next = is_random_next
+        self.masked_lm_positions = masked_lm_positions
+        self.masked_lm_labels = masked_lm_labels
+
+
+def write_instance_to_example_file(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_file):
+    """Create example files from `TrainingInstance`s."""
+
+    total_written = 0
+    features = collections.OrderedDict()
+
+    num_instances = len(instances)
+    features["input_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["input_mask"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["segment_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["masked_lm_positions"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32")
+    features["masked_lm_ids"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32")
+    features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32")
+
+    for inst_index, instance in enumerate(tqdm(instances)):
+        input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
+        input_mask = [1] * len(input_ids)
+        segment_ids = list(instance.segment_ids)
+        assert len(input_ids) <= max_seq_length
+
+        while len(input_ids) < max_seq_length:
+            input_ids.append(0)
+            input_mask.append(0)
+            segment_ids.append(0)
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        masked_lm_positions = list(instance.masked_lm_positions)
+        masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
+        masked_lm_weights = [1.0] * len(masked_lm_ids)
+
+        while len(masked_lm_positions) < max_predictions_per_seq:
+            masked_lm_positions.append(0)
+            masked_lm_ids.append(0)
+            masked_lm_weights.append(0.0)
+
+        next_sentence_label = 1 if instance.is_random_next else 0
+
+        features["input_ids"][inst_index] = input_ids
+        features["input_mask"][inst_index] = input_mask
+        features["segment_ids"][inst_index] = segment_ids
+        features["masked_lm_positions"][inst_index] = masked_lm_positions
+        features["masked_lm_ids"][inst_index] = masked_lm_ids
+        features["next_sentence_labels"][inst_index] = next_sentence_label
+
+        total_written += 1
+
+    print("saving data")
+    f = h5py.File(output_file, "w")
+    f.create_dataset("input_ids", data=features["input_ids"], dtype="i4", compression="gzip")
+    f.create_dataset("input_mask", data=features["input_mask"], dtype="i1", compression="gzip")
+    f.create_dataset("segment_ids", data=features["segment_ids"], dtype="i1", compression="gzip")
+    f.create_dataset("masked_lm_positions", data=features["masked_lm_positions"], dtype="i4", compression="gzip")
+    f.create_dataset("masked_lm_ids", data=features["masked_lm_ids"], dtype="i4", compression="gzip")
+    f.create_dataset("next_sentence_labels", data=features["next_sentence_labels"], dtype="i1", compression="gzip")
+    f.flush()
+    f.close()
+
+
+def create_training_instances(
+    input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng
+):
+    """Create `TrainingInstance`s from raw text."""
+    all_documents = [[]]
+
+    # Input file format:
+    # (1) One sentence per line. These should ideally be actual sentences, not
+    # entire paragraphs or arbitrary spans of text. (Because we use the
+    # sentence boundaries for the "next sentence prediction" task).
+    # (2) Blank lines between documents. Document boundaries are needed so
+    # that the "next sentence prediction" task doesn't span between documents.
+    for input_file in input_files:
+        print("creating instance from {}".format(input_file))
+        with open(input_file, "r", encoding="UTF-8") as reader:
+            while True:
+                line = convert_to_unicode(reader.readline())
+                if not line:
+                    break
+                line = line.strip()
+
+                # Empty lines are used as document delimiters
+                if not line:
+                    all_documents.append([])
+                tokens = tokenizer.tokenize(line)
+                if tokens:
+                    all_documents[-1].append(tokens)
+
+    # Remove empty documents
+    all_documents = [x for x in all_documents if x]
+    rng.shuffle(all_documents)
+
+    # vocab_words = list(tokenizer.vocab.keys())
+    vocab_words = list(tokenizer.vocab.token_to_idx.keys())
+    instances = []
+    for _ in range(dupe_factor):
+        for document_index in range(len(all_documents)):
+            instances.extend(
+                create_instances_from_document(
+                    all_documents,
+                    document_index,
+                    max_seq_length,
+                    short_seq_prob,
+                    masked_lm_prob,
+                    max_predictions_per_seq,
+                    vocab_words,
+                    rng,
+                )
+            )
+
+    rng.shuffle(instances)
+    return instances
+
+
+def create_instances_from_document(
+    all_documents,
+    document_index,
+    max_seq_length,
+    short_seq_prob,
+    masked_lm_prob,
+    max_predictions_per_seq,
+    vocab_words,
+    rng,
+):
+    """Creates `TrainingInstance`s for a single document."""
+    document = all_documents[document_index]
+
+    # Account for [CLS], [SEP], [SEP]
+    max_num_tokens = max_seq_length - 3
+
+    # We *usually* want to fill up the entire sequence since we are padding
+    # to `max_seq_length` anyways, so short sequences are generally wasted
+    # computation. However, we *sometimes*
+    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
+    # sequences to minimize the mismatch between pre-training and fine-tuning.
+    # The `target_seq_length` is just a rough target however, whereas
+    # `max_seq_length` is a hard limit.
+    target_seq_length = max_num_tokens
+    if rng.random() < short_seq_prob:
+        target_seq_length = rng.randint(2, max_num_tokens)
+
+    # We DON'T just concatenate all of the tokens from a document into a long
+    # sequence and choose an arbitrary split point because this would make the
+    # next sentence prediction task too easy. Instead, we split the input into
+    # segments "A" and "B" based on the actual "sentences" provided by the user
+    # input.
+    instances = []
+    current_chunk = []
+    current_length = 0
+    i = 0
+    while i < len(document):
+        segment = document[i]
+        current_chunk.append(segment)
+        current_length += len(segment)
+        if i == len(document) - 1 or current_length >= target_seq_length:
+            if current_chunk:
+                # `a_end` is how many segments from `current_chunk` go into the `A`
+                # (first) sentence.
+                a_end = 1
+                if len(current_chunk) >= 2:
+                    a_end = rng.randint(1, len(current_chunk) - 1)
+
+                tokens_a = []
+                for j in range(a_end):
+                    tokens_a.extend(current_chunk[j])
+
+                tokens_b = []
+                # Random next
+                is_random_next = False
+                if len(current_chunk) == 1 or rng.random() < 0.5:
+                    is_random_next = True
+                    target_b_length = target_seq_length - len(tokens_a)
+
+                    # This should rarely go for more than one iteration for large
+                    # corpora. However, just to be careful, we try to make sure that
+                    # the random document is not the same as the document
+                    # we're processing.
+                    for _ in range(10):
+                        random_document_index = rng.randint(0, len(all_documents) - 1)
+                        if random_document_index != document_index:
+                            break
+
+                    # If picked random document is the same as the current document
+                    if random_document_index == document_index:
+                        is_random_next = False
+
+                    random_document = all_documents[random_document_index]
+                    random_start = rng.randint(0, len(random_document) - 1)
+                    for j in range(random_start, len(random_document)):
+                        tokens_b.extend(random_document[j])
+                        if len(tokens_b) >= target_b_length:
+                            break
+                    # We didn't actually use these segments so we "put them back" so
+                    # they don't go to waste.
+                    num_unused_segments = len(current_chunk) - a_end
+                    i -= num_unused_segments
+                # Actual next
+                else:
+                    is_random_next = False
+                    for j in range(a_end, len(current_chunk)):
+                        tokens_b.extend(current_chunk[j])
+                truncate_seq_pair(tokens_a, tokens_b, target_seq_length, rng)
+
+                assert len(tokens_a) >= 1
+                assert len(tokens_b) >= 1
+
+                tokens = []
+                segment_ids = []
+                tokens.append("[CLS]")
+                segment_ids.append(0)
+                for token in tokens_a:
+                    tokens.append(token)
+                    segment_ids.append(0)
+
+                tokens.append("[SEP]")
+                segment_ids.append(0)
+
+                for token in tokens_b:
+                    tokens.append(token)
+                    segment_ids.append(1)
+                tokens.append("[SEP]")
+                segment_ids.append(1)
+
+                (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions(
+                    tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng
+                )
+                instance = TrainingInstance(
+                    tokens=tokens,
+                    segment_ids=segment_ids,
+                    is_random_next=is_random_next,
+                    masked_lm_positions=masked_lm_positions,
+                    masked_lm_labels=masked_lm_labels,
+                )
+                instances.append(instance)
+            current_chunk = []
+            current_length = 0
+        i += 1
+
+    return instances
+
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"])
+
+
+def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
+    """Creates the predictions for the masked LM objective."""
+
+    cand_indexes = []
+    for (i, token) in enumerate(tokens):
+        if token == "[CLS]" or token == "[SEP]":
+            continue
+        cand_indexes.append(i)
+
+    rng.shuffle(cand_indexes)
+
+    output_tokens = list(tokens)
+
+    num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    masked_lms = []
+    covered_indexes = set()
+    for index in cand_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if index in covered_indexes:
+            continue
+        covered_indexes.add(index)
+
+        masked_token = None
+        # 80% of the time, replace with [MASK]
+        if rng.random() < 0.8:
+            masked_token = "[MASK]"
+        else:
+            # 10% of the time, keep original
+            if rng.random() < 0.5:
+                masked_token = tokens[index]
+            # 10% of the time, replace with random word
+            else:
+                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
+
+        output_tokens[index] = masked_token
+
+        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+
+    return (output_tokens, masked_lm_positions, masked_lm_labels)
+
+
+def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_num_tokens:
+            break
+
+        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
+        assert len(trunc_tokens) >= 1
+
+        # We want to sometimes truncate from the front and sometimes from the
+        # back to add more randomness and avoid biases.
+        if rng.random() < 0.5:
+            del trunc_tokens[0]
+        else:
+            trunc_tokens.pop()
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--input_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The input train corpus. can be directory with .txt files or a path to a single file",
+    )
+    parser.add_argument(
+        "--output_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file where created hdf5 formatted data will be written.",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        required=False,
+        help="The vocabulary the BERT model will train on. "
+        "Use bert_model argument would ignore this. "
+        "The bert_model argument is recommended.",
+    )
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_true",
+        default=True,
+        help="Whether to lower case the input text. True for uncased models, False for cased models. "
+        "Use bert_model argument would ignore this. The bert_model argument is recommended.",
+    )
+    parser.add_argument(
+        "--bert_model",
+        default="bert-base-uncased",
+        type=str,
+        required=False,
+        help="Bert pre-trained model selected in the list: bert-base-uncased, "
+        "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
+        "If provided, use the pre-trained model used tokenizer to create data "
+        "and ignore vocab_file and do_lower_case.",
+    )
+
+    # Other parameters
+    # int
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after WordPiece tokenization. \n"
+        "Sequences longer than this will be truncated, and sequences shorter \n"
+        "than this will be padded.",
+    )
+    parser.add_argument(
+        "--dupe_factor",
+        default=10,
+        type=int,
+        help="Number of times to duplicate the input data (with different masks).",
+    )
+    parser.add_argument(
+        "--max_predictions_per_seq", default=20, type=int, help="Maximum number of masked LM predictions per sequence."
+    )
+
+    # floats
+    parser.add_argument("--masked_lm_prob", default=0.15, type=float, help="Masked LM probability.")
+    parser.add_argument(
+        "--short_seq_prob",
+        default=0.1,
+        type=float,
+        help="Probability to create a sequence shorter than maximum sequence length",
+    )
+
+    parser.add_argument("--random_seed", type=int, default=12345, help="random seed for initialization")
+
+    args = parser.parse_args()
+    print(args)
+
+    if args.bert_model:
+        tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    else:
+        assert args.vocab_file, "vocab_file must be set If bert_model is not provided."
+        tokenizer = BertTokenizer(args.vocab_file, do_lower_case=args.do_lower_case)
+
+    input_files = []
+    if os.path.isfile(args.input_file):
+        input_files.append(args.input_file)
+    elif os.path.isdir(args.input_file):
+        input_files = [
+            os.path.join(args.input_file, f)
+            for f in os.listdir(args.input_file)
+            if (os.path.isfile(os.path.join(args.input_file, f)) and f.endswith(".txt"))
+        ]
+    else:
+        raise ValueError("{} is not a valid path".format(args.input_file))
+
+    rng = random.Random(args.random_seed)
+    instances = create_training_instances(
+        input_files,
+        tokenizer,
+        args.max_seq_length,
+        args.dupe_factor,
+        args.short_seq_prob,
+        args.masked_lm_prob,
+        args.max_predictions_per_seq,
+        rng,
+    )
+
+    output_file = args.output_file
+
+    write_instance_to_example_file(
+        instances, tokenizer, args.max_seq_length, args.max_predictions_per_seq, output_file
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/bert/data/sample_text.txt b/model_zoo/bert/data/sample_text.txt
new file mode 100644
index 0000000000000000000000000000000000000000..75ec60cdb7842e023d32f95f0e16e1973eff4b71
--- /dev/null
+++ b/model_zoo/bert/data/sample_text.txt
@@ -0,0 +1,100 @@
+Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career.
+He holds titles across various organizations in diverse geographies.
+Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom.
+He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids.
+He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health.
+Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine.
+He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014.
+Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction.
+His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996.
+He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences.
+Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978).
+He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan).
+Through 1980's he continued his work as a surgeon and paediatrician.
+He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992.
+In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008.
+Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years.
+Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University.
+In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom.
+Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards.
+In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners.
+These include book reviews, chapters, 1.
+"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986.
+Revised 2nd Edition 1991.
+2.
+"Nutritional management of acute and persistent diarrhoea".
+A M Molla, Bhutta Z A and  A Molla.
+In McNeish A S, Mittal S K and Walker-Smith J A (eds).
+Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51.
+3.
+"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi.
+& Lahore 4.
+"Innovations in neonatal care : Impact on neonatal survival in the developing world:.
+Bhutta Z A  Zaidi S (Editor) 1992.
+TWEL Publisher.
+Karachi pp 121-131 5.
+"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60.
+6.
+"Dietary management of persistent diarrhoea".
+Bhutta Z A, Molla A M, Issani Z.
+In Reflections on  Diarrhoeal Disease & Nutrition  of Children".
+1993 Karachi, pp 97 - 103.
+7.
+"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q.
+Nizami, I.A.
+Khan, Bhutta Z A.
+In "Reflections on Diarrhoeal Disease and Nutrition of Children".
+1993 Karachi, pp  88-90.
+8.
+"The challenge of multidrug-resistant typhoid".
+Bhutta Z A.
+In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994.
+Jaypee Publishers, New Delhi, pp 403.8.
+9.
+"Perinatal Care in Pakistan: Current status and trends".
+In Proceedings of the Workshop in Reproductive Health.
+College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103.
+10.
+“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries.
+NAHRES-30 1996 IAEA Vienna.
+11.
+"Pneumococcal infections in Pakistan: a country report".
+In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82.
+12.
+“Factors affecting protein and aminoacid metabolism in childhood from developing countries".
+In Child Nutrition: an international perspective.
+Editors Solomons NW, Caballero B, Brown KH.
+CRC Press 1998.
+13.
+"Protein Digestion and Bioavailability".
+In Encyclopedia of Human Nutrition.
+Editors: Sadler M, Strain JJ, Caballero B.
+Academic Press (London), 1998 pp.1646-54.
+14.
+"Perinatal Care in Pakistan.
+Reproductive Health: A manual for family practice and primary health care.
+Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78.
+15.
+“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection.
+Bhutta ZA.
+In Costello A, Manandhar D (eds).
+"Improving Newborn Infant Health in Developing Countries’ 1999.
+Imperial College Press, London pp.289-308.
+16.
+“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”.
+In Manual of International Child Health, British Medical Journal, 2000 (in press).
+17.
+“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112.
+18.
+"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001).
+19.
+"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan".
+Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report.
+2000.
+20.
+“Typhoid Fever in Childhood: the south Asian experience”.
+Ahmad K &Bhutta ZA.
+In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India .
+21.
+“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds).
+The Perinatal Medicine of the new Millennium.
\ No newline at end of file
diff --git a/model_zoo/bert/deploy/python/README.md b/model_zoo/bert/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b2f9ef8727edd8bb695a9d482b99ff5119b6ce9b
--- /dev/null
+++ b/model_zoo/bert/deploy/python/README.md
@@ -0,0 +1,139 @@
+# FastDeploy BERT 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下分别提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的 GLUE 文本分类任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+```
+
+## 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 BERT 模型在 GLUE SST-2 数据集上进行自然语言推断任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [BERT 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/bert/infer_model`（用户可按实际情况设置）。
+
+
+```bash
+# CPU 推理
+python seq_cls_infer.py --model_dir ../../infer_model/ --device cpu --backend paddle
+# GPU 推理
+python seq_cls_infer.py --model_dir ../../infer_model/ --device gpu --backend paddle
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-03-02 08:30:03,877] [    INFO] - We are using <class 'paddlenlp.transformers.bert.fast_tokenizer.BertFastTokenizer'> to load '../../infer_model/'.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+Batch id: 0, example id: 0, sentence1: against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting, label: positive, negative prob: 0.0003, positive prob: 0.9997.
+Batch id: 1, example id: 0, sentence1: the situation in a well-balanced fashion, label: positive, negative prob: 0.0002, positive prob: 0.9998.
+Batch id: 2, example id: 0, sentence1: at achieving the modest , crowd-pleasing goals it sets for itself, label: positive, negative prob: 0.0017, positive prob: 0.9983.
+Batch id: 3, example id: 0, sentence1: so pat it makes your teeth hurt, label: negative, negative prob: 0.9986, positive prob: 0.0014.
+Batch id: 4, example id: 0, sentence1: this new jangle of noise , mayhem and stupidity must be a serious contender for the title ., label: negative, negative prob: 0.9806, positive prob: 0.0194.
+```
+
+## 参数说明
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--device_id | 运行设备的id。默认为0。 |
+|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 BERT 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 BERT 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
diff --git a/model_zoo/bert/deploy/python/seq_cls_infer.py b/model_zoo/bert/deploy/python/seq_cls_infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..68f8a1b113573fa49be40ea5c453a5292a65e9bc
--- /dev/null
+++ b/model_zoo/bert/deploy/python/seq_cls_infer.py
@@ -0,0 +1,159 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
+    parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_threads)
+        else:
+            option.use_gpu(args.device_id)
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.use_paddle_infer_backend()
+                option.paddle_infer_option.collect_trt_shape = True
+                option.paddle_infer_option.enable_trt = True
+            trt_file = os.path.join(args.model_dir, "model.trt")
+            option.trt_option.set_shape(
+                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            if args.use_fp16:
+                option.trt_option.enable_fp16 = True
+                trt_file = trt_file + ".fp16"
+            option.trt_option.serialize_file = trt_file
+        return fd.Runtime(option)
+
+    def preprocess(self, text):
+        data = self.tokenizer(text, max_length=self.max_length, padding=True, truncation=True)
+        input_ids_name = self.runtime.get_input_info(0).name
+        token_type_ids_name = self.runtime.get_input_info(1).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int64"),
+            token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def postprocess(self, infer_data):
+        logits = np.array(infer_data[0])
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1), "confidence": probs}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    texts_ds = [
+        "against shimmering cinematography that lends the setting the ethereal beauty of an asian landscape painting",
+        "the situation in a well-balanced fashion",
+        "at achieving the modest , crowd-pleasing goals it sets for itself",
+        "so pat it makes your teeth hurt",
+        "this new jangle of noise , mayhem and stupidity must be a serious contender for the title .",
+    ]
+    label_map = {0: "negative", 1: "positive"}
+    batch_texts = batchfy_text(texts_ds, args.batch_size)
+    for bs, texts in enumerate(batch_texts):
+        outputs = predictor.predict(texts)
+        for i, sentence1 in enumerate(texts):
+            print(
+                f"Batch id: {bs}, example id: {i}, sentence1: {sentence1}, "
+                f"label: {label_map[outputs['label'][i]]}, negative prob: {outputs['confidence'][i][0]:.4f}, "
+                f"positive prob: {outputs['confidence'][i][1]:.4f}."
+            )
diff --git a/model_zoo/bert/export_model.py b/model_zoo/bert/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..367be76da408757c2dfeb166e670cb8b2cf22e4f
--- /dev/null
+++ b/model_zoo/bert/export_model.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from run_glue_trainer import MODEL_CLASSES
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path of the trained model to be exported.",
+    )
+    parser.add_argument(
+        "--output_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file prefix used to save the exported inference model.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # build model and load trained parameters
+    model = model_class.from_pretrained(args.model_path)
+    # switch to eval model
+    model.eval()
+    # convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # save converted static graph model
+    paddle.jit.save(model, args.output_path)
+    # also save tokenizer for inference usage
+    tokenizer = tokenizer_class.from_pretrained(args.model_path)
+    tokenizer.save_pretrained(os.path.dirname(args.output_path))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/bert/run_glue_trainer.py b/model_zoo/bert/run_glue_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e06af9e9bcac2f491514aa6dbfa6eca09efb7743
--- /dev/null
+++ b/model_zoo/bert/run_glue_trainer.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle.metric import Accuracy
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    BertForSequenceClassification,
+    BertTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "stsb": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+    "wnli": Accuracy,
+}
+
+task_to_keys = {
+    "cola": ("sentence", None),
+    "mnli": ("premise", "hypothesis"),
+    "mrpc": ("sentence1", "sentence2"),
+    "qnli": ("question", "sentence"),
+    "qqp": ("question1", "question2"),
+    "rte": ("sentence1", "sentence2"),
+    "sst2": ("sentence", None),
+    "stsb": ("sentence1", "sentence2"),
+    "wnli": ("sentence1", "sentence2"),
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+}
+
+
+@dataclass
+class ModelArguments:
+    task_name: str = field(
+        default=None,
+        metadata={"help": "The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys())},
+    )
+    model_name_or_path: str = field(
+        default=None,
+        metadata={"help": "Path to pre-trained model or shortcut name"},
+    )
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+
+def do_train():
+    training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
+    training_args: TrainingArguments = training_args
+    model_args: ModelArguments = model_args
+
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(training_args, "Training")
+
+    model_args.task_name = model_args.task_name.lower()
+
+    sentence1_key, sentence2_key = task_to_keys[model_args.task_name]
+
+    train_ds = load_dataset("glue", model_args.task_name, split="train")
+    columns = train_ds.column_names
+    is_regression = model_args.task_name == "stsb"
+    label_list = None
+    if not is_regression:
+        label_list = train_ds.features["label"].names
+        num_labels = len(label_list)
+    else:
+        num_labels = 1
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    def preprocess_function(examples):
+        # Tokenize the texts
+        texts = (
+            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
+        )
+        result = tokenizer(*texts, max_length=model_args.max_seq_length, truncation=True)
+        if "label" in examples:
+            # In all cases, rename the column to labels because the model will expect that.
+            result["labels"] = examples["label"]
+        return result
+
+    train_ds = train_ds.map(preprocess_function, batched=True, remove_columns=columns)
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    if model_args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", model_args.task_name, split=["validation_matched", "validation_mismatched"]
+        )
+        dev_ds_matched = dev_ds_matched.map(preprocess_function, batched=True, remove_columns=columns)
+        dev_ds_mismatched = dev_ds_mismatched.map(preprocess_function, batched=True, remove_columns=columns)
+        dev_ds = {"matched": dev_ds_matched, "mismatched": dev_ds_mismatched}
+    else:
+        dev_ds = load_dataset("glue", model_args.task_name, split="validation")
+        dev_ds = dev_ds.map(preprocess_function, batched=True, remove_columns=columns)
+
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_labels=num_labels)
+
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+        if is_regression:
+            preds = np.squeeze(preds)
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = METRIC_CLASSES[model_args.task_name]()
+        result = metric.compute(preds, label)
+        metric.update(result)
+
+        if isinstance(metric, AccuracyAndF1):
+            acc, precision, recall, f1, _ = metric.accumulate()
+            return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}
+        elif isinstance(metric, Mcc):
+            mcc = metric.accumulate()
+            return {"mcc": mcc[0]}
+        elif isinstance(metric, PearsonAndSpearman):
+            pearson, spearman, _ = metric.accumulate()
+            return {"pearson": pearson, "spearman": spearman}
+        elif isinstance(metric, Accuracy):
+            acc = metric.accumulate()
+            return {"accuracy": acc}
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    # training
+    if training_args.do_train:
+        train_result = trainer.train()
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_eval:
+        if model_args.task_name == "mnli":
+            for _, eval_dataset in dev_ds.items():
+                eval_metrics = trainer.evaluate(eval_dataset)
+                trainer.log_metrics("eval", eval_metrics)
+                trainer.save_metrics("eval", eval_metrics)
+        else:
+            eval_metrics = trainer.evaluate(dev_ds)
+            trainer.log_metrics("eval", eval_metrics)
+            trainer.save_metrics("eval", eval_metrics)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/bert/run_pretrain.py b/model_zoo/bert/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c855de0e47498cd1d54c74561bfbd4f84274f2b
--- /dev/null
+++ b/model_zoo/bert/run_pretrain.py
@@ -0,0 +1,476 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import h5py
+import numpy as np
+import paddle
+from paddle.io import DataLoader, Dataset
+
+from paddlenlp.data import Stack
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    BertForPretraining,
+    BertModel,
+    BertPretrainingCriterion,
+    BertTokenizer,
+    ErnieForPretraining,
+    ErnieModel,
+    ErniePretrainingCriterion,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.tools import TimeCostAverage
+
+MODEL_CLASSES = {
+    "bert": (BertModel, BertForPretraining, BertPretrainingCriterion, BertTokenizer),
+    "ernie": (ErnieModel, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument(
+        "--max_predictions_per_seq", default=80, type=int, help="The maximum total of masked tokens in input sequence"
+    )
+
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--max_steps",
+        default=1000000,
+        type=int,
+        help="Set total number of training steps to perform. ",
+    )
+    parser.add_argument(
+        "--preprocessing_num_workers",
+        type=int,
+        default=0,
+        help="The number of processes to use for the preprocessing.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="gpu",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        help="Device for selecting for the training.",
+    )
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument(
+        "--amp_level", type=str, default="O2", choices=["O1", "O2"], help="select O1 or O2 of amp level."
+    )
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    parser.add_argument("--to_static", type=strtobool, default=False, help="Enable training under @to_static.")
+
+    # For benchmark.
+    parser.add_argument(
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+    parser.add_argument(
+        "--fuse_transformer",
+        type=strtobool,
+        default=False,
+        help="Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not.",
+    )
+    parser.add_argument(
+        "--cinn",
+        type=strtobool,
+        default=False,
+        help="If cinn is True, we will apply @to_static to model.bert.encoder, else we will apply it to the whole model.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def create_pretraining_dataset(input_file, max_pred_length, shared_list, args, worker_init):
+    train_data = PretrainingDataset(input_file=input_file, max_pred_length=max_pred_length)
+    # files have been sharded, no need to dispatch again
+    train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True)
+
+    # DataLoader cannot be pickled because of its place.
+    # If it can be pickled, use global function instead of lambda and use
+    # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # input_ids, segment_ids, input_mask, masked_lm_positions,
+        # masked_lm_labels, next_sentence_labels, mask_token_num
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        _, seq_length = out[0].shape
+        size = sum(len(x[3]) for x in data)
+        # Padding for divisibility by 8 for fp16 or int8 usage
+        if size % 8 != 0:
+            size += 8 - (size % 8)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        out[3] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+        # mask_token_num
+        out.append(np.asarray([mask_token_num], dtype=np.float32))
+        return out
+
+    train_data_loader = DataLoader(
+        dataset=train_data,
+        batch_sampler=train_batch_sampler,
+        collate_fn=_collate_data,
+        num_workers=args.preprocessing_num_workers,
+        worker_init_fn=worker_init,
+        return_list=True,
+    )
+    return train_data_loader, input_file
+
+
+def create_input_specs():
+    input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
+    segment_ids = paddle.static.InputSpec(name="segment_ids", shape=[-1, -1], dtype="int64")
+    position_ids = None
+    input_mask = paddle.static.InputSpec(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32")
+    masked_lm_positions = paddle.static.InputSpec(name="masked_lm_positions", shape=[-1], dtype="int32")
+    return [input_ids, segment_ids, position_ids, input_mask, masked_lm_positions]
+
+
+class PretrainingDataset(Dataset):
+    def __init__(self, input_file, max_pred_length):
+        self.input_file = input_file
+        self.max_pred_length = max_pred_length
+        f = h5py.File(input_file, "r")
+        keys = [
+            "input_ids",
+            "input_mask",
+            "segment_ids",
+            "masked_lm_positions",
+            "masked_lm_ids",
+            "next_sentence_labels",
+        ]
+        self.inputs = [np.asarray(f[key][:]) for key in keys]
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs[0])
+
+    def __getitem__(self, index):
+
+        [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [
+            input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64))
+            for indice, input in enumerate(self.inputs)
+        ]
+        # TODO: whether to use reversed mask by changing 1s and 0s to be
+        # consistent with nv bert
+        input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9
+
+        index = self.max_pred_length
+        # store number of  masked tokens in index
+        # outputs of torch.nonzero diff with that of numpy.nonzero by zip
+        padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
+        if len(padded_mask_indices) != 0:
+            index = padded_mask_indices[0].item()
+        else:
+            index = self.max_pred_length
+        # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
+        # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
+        masked_lm_labels = masked_lm_ids[:index]
+        masked_lm_positions = masked_lm_positions[:index]
+        # softmax_with_cross_entropy enforce last dim size equal 1
+        masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
+        next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
+
+        return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+
+    args.model_type = args.model_type.lower()
+    base_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+    if args.model_name_or_path in pretrained_models_list:
+        config = model_class.config_class.from_pretrained(args.model_name_or_path)
+        config.fuse = args.fuse_transformer
+        model = model_class(config)
+    else:
+        model = model_class.from_pretrained(args.model_name_or_path)
+    criterion = criterion_class(getattr(model, model_class.base_model_prefix).config.vocab_size)
+    # decorate @to_static for benchmark, skip it by default.
+    if args.to_static:
+        if args.cinn:
+            model.bert.encoder = paddle.jit.to_static(model.bert.encoder)
+            logger.info("Successfully to apply @to_static to model.bert.encoder.")
+        else:
+            specs = create_input_specs()
+            model = paddle.jit.to_static(model, input_spec=specs)
+            logger.info("Successfully to apply @to_static to the whole model with specs: {}.".format(specs))
+
+    # If use default last_epoch, lr of the first iteration is 0.
+    # Use `last_epoch = 0` to be consistent with nv bert.
+    num_training_steps = args.max_steps
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps, last_epoch=0)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        model = paddle.amp.decorate(models=model, level=args.amp_level, save_dtype="float32")
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    pool = ThreadPoolExecutor(1)
+    global_step = 0
+    for epoch in range(sys.maxsize):
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f)) and "train" in f
+        ]
+        files.sort()
+        num_files = len(files)
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        shared_file_list = {}
+
+        if paddle.distributed.get_world_size() > num_files:
+            remainder = paddle.distributed.get_world_size() % num_files
+            data_file = files[
+                (
+                    f_start_id * paddle.distributed.get_world_size()
+                    + paddle.distributed.get_rank()
+                    + remainder * f_start_id
+                )
+                % num_files
+            ]
+        else:
+            data_file = files[
+                (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+            ]
+
+        train_data_loader, _ = create_pretraining_dataset(
+            data_file, args.max_predictions_per_seq, shared_file_list, args, worker_init
+        )
+
+        # TODO(guosheng): better way to process single file
+        single_file = True if f_start_id + 1 == len(files) else False
+
+        for f_id in range(f_start_id, len(files)):
+            if not single_file and f_id == f_start_id:
+                continue
+            if paddle.distributed.get_world_size() > num_files:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id)
+                    % num_files
+                ]
+            else:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+                ]
+
+            dataset_future = pool.submit(
+                create_pretraining_dataset,
+                data_file,
+                args.max_predictions_per_seq,
+                shared_file_list,
+                args,
+                worker_init,
+            )
+            train_cost_avg = TimeCostAverage()
+            reader_cost_avg = TimeCostAverage()
+            total_samples = 0
+            batch_start = time.time()
+            for step, batch in enumerate(train_data_loader):
+                train_reader_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                global_step += 1
+                (
+                    input_ids,
+                    segment_ids,
+                    input_mask,
+                    masked_lm_positions,
+                    masked_lm_labels,
+                    next_sentence_labels,
+                    masked_lm_scale,
+                ) = batch
+                with paddle.amp.auto_cast(
+                    args.use_amp,
+                    custom_white_list=["layer_norm", "softmax", "gelu", "fused_attention", "fused_feedforward"],
+                    level=args.amp_level,
+                ):
+                    prediction_scores, seq_relationship_score = model(
+                        input_ids=input_ids,
+                        token_type_ids=segment_ids,
+                        attention_mask=input_mask,
+                        masked_positions=masked_lm_positions,
+                    )
+                    loss = criterion(
+                        prediction_scores,
+                        seq_relationship_score,
+                        masked_lm_labels,
+                        next_sentence_labels,
+                        masked_lm_scale,
+                    )
+                if args.use_amp:
+                    scaler.scale(loss).backward()
+                    scaler.minimize(optimizer, loss)
+                else:
+                    loss.backward()
+                    optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                total_samples += args.batch_size
+                train_run_cost = time.time() - batch_start
+                train_cost_avg.record(train_run_cost)
+
+                # Profile for model benchmark
+                if args.profiler_options is not None:
+                    profiler.add_profiler_step(args.profiler_options)
+
+                if global_step % args.logging_steps == 0:
+                    if paddle.distributed.get_rank() == 0:
+                        logger.info(
+                            "global step: %d, epoch: %d, batch: %d, loss: %f, "
+                            "avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                            % (
+                                global_step,
+                                epoch,
+                                step,
+                                loss,
+                                reader_cost_avg.get_average(),
+                                train_cost_avg.get_average(),
+                                total_samples / args.logging_steps,
+                                total_samples / (args.logging_steps * train_cost_avg.get_average()),
+                            )
+                        )
+                    total_samples = 0
+                    train_cost_avg.reset()
+                    reader_cost_avg.reset()
+                if global_step % args.save_steps == 0 or global_step >= args.max_steps:
+                    if paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+                batch_start = time.time()
+
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print(args)
+    do_train(args)
diff --git a/model_zoo/bert/run_pretrain_trainer.py b/model_zoo/bert/run_pretrain_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5624ea3dcf795d962f6e834352b1057bbd19250
--- /dev/null
+++ b/model_zoo/bert/run_pretrain_trainer.py
@@ -0,0 +1,259 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+
+import h5py
+import numpy as np
+import paddle
+from paddle.io import Dataset
+
+from paddlenlp.data import Stack
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    BertForPretraining,
+    BertTokenizer,
+    ErnieForPretraining,
+    ErnieTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "bert": (BertForPretraining, BertTokenizer),
+    "ernie": (ErnieForPretraining, ErnieTokenizer),
+}
+
+
+@dataclass
+class DataArguments:
+    input_dir: str = field(default=None, metadata={"help": "The input directory where the data will be read from."})
+
+
+@dataclass
+class ModelArguments:
+    model_type: str = field(
+        default="bert", metadata={"help": "Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys())}
+    )
+    model_name_or_path: str = field(
+        default=None,
+        metadata={
+            "help": "Path to pre-trained model or shortcut name selected in the list: "
+            + ", ".join(
+                sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+            )
+        },
+    )
+    max_predictions_per_seq: int = field(
+        default=80, metadata={"help": "The maximum total of masked tokens in input sequence"}
+    )
+
+    to_static: strtobool = field(default=False, metadata={"help": "Enable training under @to_static."})
+    profiler_options: str = field(
+        default=None,
+        metadata={"help": "Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not."},
+    )
+    fuse_transformer: strtobool = field(
+        default=False,
+        metadata={"help": "Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not."},
+    )
+
+
+def get_train_data_file(data_args):
+    files = [
+        os.path.join(data_args.input_dir, f)
+        for f in os.listdir(data_args.input_dir)
+        if os.path.isfile(os.path.join(data_args.input_dir, f)) and "train" in f
+    ]
+    files.sort()
+    num_files = len(files)
+    # random.Random(training_args.seed + epoch).shuffle(files)
+    f_start_id = 0
+
+    if paddle.distributed.get_world_size() > num_files:
+        remainder = paddle.distributed.get_world_size() % num_files
+        data_file = files[
+            (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_start_id)
+            % num_files
+        ]
+    else:
+        data_file = files[
+            (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+        ]
+
+    # TODO(guosheng): better way to process single file
+    single_file = True if f_start_id + 1 == len(files) else False
+
+    for f_id in range(f_start_id, len(files)):
+        if not single_file and f_id == f_start_id:
+            continue
+        if paddle.distributed.get_world_size() > num_files:
+            data_file = files[
+                (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id)
+                % num_files
+            ]
+        else:
+            data_file = files[(f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files]
+
+    return data_file
+
+
+def data_collator(data, stack_fn=Stack()):
+    num_fields = len(data[0])
+    out = [None] * num_fields
+    # input_ids, segment_ids, input_mask, masked_lm_positions,
+    # masked_lm_labels, next_sentence_labels, mask_token_num
+    for i in (0, 1, 2, 5):
+        out[i] = stack_fn([x[i] for x in data])
+    _, seq_length = out[0].shape
+    size = _ = sum(len(x[3]) for x in data)
+    # Padding for divisibility by 8 for fp16 or int8 usage
+    if size % 8 != 0:
+        size += 8 - (size % 8)
+    # masked_lm_positions
+    # Organize as a 1D tensor for gather or use gather_nd
+
+    # masked_lm_positions
+    # Organize as a 1D tensor for gather or use gather_nd
+    out[3] = np.full(size, 0, dtype=np.int32)
+    # masked_lm_labels
+    out[4] = np.full([size, 1], -100, dtype=np.int64)
+    mask_token_num = 0
+    for i, x in enumerate(data):
+        for j, pos in enumerate(x[3]):
+            out[3][mask_token_num] = i * seq_length + pos
+            out[4][mask_token_num] = x[4][j]
+            mask_token_num += 1
+
+    return {
+        "input_ids": out[0],
+        "token_type_ids": out[1],
+        "attention_mask": out[2],
+        "masked_positions": out[3],
+        "labels": out[4],
+        "next_sentence_label": out[5],
+    }
+
+
+def create_input_specs():
+    input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
+    segment_ids = paddle.static.InputSpec(name="segment_ids", shape=[-1, -1], dtype="int64")
+    position_ids = None
+    input_mask = paddle.static.InputSpec(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32")
+    masked_lm_positions = paddle.static.InputSpec(name="masked_lm_positions", shape=[-1], dtype="int32")
+    return [input_ids, segment_ids, position_ids, input_mask, masked_lm_positions]
+
+
+class PretrainingDataset(Dataset):
+    def __init__(self, input_file, max_pred_length):
+        self.input_file = input_file
+        self.max_pred_length = max_pred_length
+        f = h5py.File(input_file, "r")
+        keys = [
+            "input_ids",
+            "input_mask",
+            "segment_ids",
+            "masked_lm_positions",
+            "masked_lm_ids",
+            "next_sentence_labels",
+        ]
+        self.inputs = [np.asarray(f[key][:]) for key in keys]
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs[0])
+
+    def __getitem__(self, index):
+
+        [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [
+            input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64))
+            for indice, input in enumerate(self.inputs)
+        ]
+        # TODO: whether to use reversed mask by changing 1s and 0s to be
+        # consistent with nv bert
+        input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e9
+
+        index = self.max_pred_length
+        # store number of  masked tokens in index
+        # outputs of torch.nonzero diff with that of numpy.nonzero by zip
+        padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
+        if len(padded_mask_indices) != 0:
+            index = padded_mask_indices[0].item()
+            # mask_token_num = index
+        else:
+            index = self.max_pred_length
+            # mask_token_num = self.max_pred_length
+        # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
+        # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
+        masked_lm_labels = masked_lm_ids[:index]
+        masked_lm_positions = masked_lm_positions[:index]
+        # softmax_with_cross_entropy enforce last dim size equal 1
+        masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
+        next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
+
+        return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels]
+
+
+def do_train():
+    data_args, training_args, model_args = PdArgumentParser(
+        [DataArguments, TrainingArguments, ModelArguments]
+    ).parse_args_into_dataclasses()
+    training_args: TrainingArguments = training_args
+    model_args: ModelArguments = model_args
+    data_args: DataArguments = data_args
+
+    training_args.print_config(data_args, "Data")
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(model_args, "Training")
+
+    model_args.model_type = model_args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
+
+    config = model_class.config_class.from_pretrained(model_args.model_name_or_path)
+    config.fuse = model_args.fuse_transformer
+    model = model_class(config)
+
+    data_file = get_train_data_file(data_args)
+    train_dataset = PretrainingDataset(input_file=data_file, max_pred_length=model_args.max_predictions_per_seq)
+
+    # decorate @to_static for benchmark, skip it by default.
+    if model_args.to_static:
+        specs = create_input_specs()
+        model = paddle.jit.to_static(model, input_spec=specs)
+        logger.info("Successfully to apply @to_static with specs: {}".format(specs))
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=None,
+        tokenizer=tokenizer,
+    )
+    # training
+    if training_args.do_train:
+        train_result = trainer.train()
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/bert/static/README.md b/model_zoo/bert/static/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa3d5f62b28db0703a35415f8d4a7c9f0633b324
--- /dev/null
+++ b/model_zoo/bert/static/README.md
@@ -0,0 +1,153 @@
+# BERT Benchmark with Fleet API
+## 模型简介
+
+[BERT](https://arxiv.org/abs/1810.04805) （Bidirectional Encoder Representations from Transformers）以[Transformer](https://arxiv.org/abs/1706.03762) 编码器为网络基本组件，使用掩码语言模型（Masked Language Model）和邻接句子预测（Next Sentence Prediction）两个任务在大规模无标注文本语料上进行预训练（pre-train），得到融合了双向内容的通用语义表示模型。以预训练产生的通用语义表示模型为基础，结合任务适配的简单输出层，微调（fine-tune）后即可应用到下游的NLP任务，效果通常也较直接在下游的任务上训练的模型更优。此前BERT即在[GLUE评测任务](https://gluebenchmark.com/tasks)上取得了SOTA的结果。
+
+本项目是BERT在 Paddle 2.0上的开源实现，包含了预训练和[GLUE评测任务](https://gluebenchmark.com/tasks)上的微调代码。
+
+## 快速开始
+
+### 数据准备
+
+#### Pre-training数据准备
+
+`create_pretraining_data.py` 是创建预训练程序所需数据的脚本。其以文本文件（使用换行符换行和空白符分隔，data目录下提供了部分示例数据）为输入，经由BERT tokenizer进行tokenize后再做生成sentence pair正负样本、掩码token等处理，最后输出hdf5格式的数据文件。使用方式如下：
+
+```shell
+python create_pretraining_data.py \
+  --input_file=data/sample_text.txt \
+  --output_file=data/training_data.hdf5 \
+  --bert_model=bert-base-uncased \
+  --max_seq_length=128 \
+  --max_predictions_per_seq=20 \
+  --masked_lm_prob=0.15 \
+  --random_seed=12345 \
+  --dupe_factor=5
+```
+
+其中参数释义如下：
+- `input_file` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有`.txt`文件。
+- `output_file` 指定输出文件。
+- `bert_model` 指定使用特定BERT模型对应的tokenizer进行tokenize处理。
+- `max_seq_length` 指定最大句子长度，超过该长度将被截断，不足该长度的将会进行padding。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目。
+- `masked_lm_prob` 表示每个token被mask的概率。
+- `random_seed` 指定随机种子。
+- `dupe_factor` 指定输入数据被重复处理的次数，每次处理将重新产生随机mask。
+
+使用以上预训练数据生成程序可以用于处理领域垂类数据后进行二次预训练。若需要使用BERT论文中预训练使用的英文Wiki和BookCorpus数据，可以参考[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)进行处理，得到的数据可以直接接入本项目中的预训练程序使用。
+
+#### Fine-tuning数据准备
+Fine-tuning的数据集已经被PaddleNLP框架集成，只需要填写相应的数据集的名称，PaddleNLP会自动下载数据集，具体的使用方法可以参考 `run_glue.py` 脚本。
+
+##### GLUE评测任务数据
+
+GLUE评测任务所含数据集已在paddlenlp中以API形式提供，无需预先准备，使用`run_glue.py`执行微调时将会自动下载。
+
+### 执行Pre-training
+
+#### GPU训练
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 20000 \
+    --max_steps 1000000 \
+    --device gpu \
+    --use_amp False
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_predictions_per_seq` 表示每个句子中会被mask的token的最大数目，与创建预训练数据时的设置一致。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `weight_decay` 表示AdamW优化器中使用的weight_decay的系数。
+- `adam_epsilon` 表示AdamW优化器中使用的epsilon值。
+- `warmup_steps` 表示动态学习率热启的step数。
+- `num_train_epochs` 表示训练轮数。
+- `input_dir` 表示输入数据的目录，该目录下所有文件名中包含training的文件将被作为训练数据。
+- `output_dir` 表示模型的保存目录。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `use_amp` 指示是否启用自动混合精度训练。
+**NOTICE**: 预训练时data目录存放的是经过 `create_pretraining_data.py` 处理后的数据，因此需要通过该数据处理脚本预先处理，否则预训练将会出现报错。
+
+### 执行Fine-tunning
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/ \
+    --device gpu
+```
+
+其中参数释义如下：
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。注：`bert-base-uncased`等对应使用的预训练模型转自[huggingface/transformers](https://github.com/huggingface/transformers)，具体可参考当前目录下converter中的内容。
+- `task_name` 表示Fine-tuning的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+
+基于`bert-base-uncased`在GLUE各评测任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthews corr                | 59.90       |
+| SST-2 | Accuracy                     | 92.76       |
+| STS-B | Pearson/Spearman corr        | 89.12       |
+| MNLI  | matched acc./mismatched acc. | 84.45/84.62 |
+| QNLI  | acc.                         | 91.73       |
+| RTE   | acc.                         | 67.15       |
+
+### 预测
+
+在Fine-tuning完成后，我们可以使用如下方式导出希望用来预测的模型：
+然后按照如下的方式进行GLUE中的评测任务进行预测（基于Paddle的[Python预测API](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/python_infer_cn.html)）：
+
+```shell
+python -u ./predict_glue.py \
+    --task_name SST-2 \
+    --model_type bert \
+    --model_path ./tmp/model_20/infer_model \
+    --batch_size 32 \
+    --max_seq_length 128
+```
+
+其中参数释义如下：
+- `task_name` 表示Fine-tuning的任务。
+- `model_type` 指示了模型类型，使用BERT模型时设置为bert即可。
+- `model_path` 表示预测模型文件的前缀，和上一步导出预测模型中的`output_path`一致。
+- `batch_size` 表示每个预测批次的样本数目。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+
+**NOTICE**: 预测脚本中的 './tmp/model_20/infer_model' 是 run_glue.py 中保存下来的模型，具体的模型路径可以根据具体的路径来设定。
diff --git a/model_zoo/bert/static/create_pretraining_data.py b/model_zoo/bert/static/create_pretraining_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..09394efc722c9fb8badebf1cbc64f11711192729
--- /dev/null
+++ b/model_zoo/bert/static/create_pretraining_data.py
@@ -0,0 +1,474 @@
+# coding=utf-8
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Create masked LM/next sentence masked_lm examples for BERT."""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import argparse
+import collections
+import os
+import random
+from io import open
+
+import h5py
+import numpy as np
+from tqdm import tqdm
+
+from paddlenlp.transformers import BertTokenizer
+from paddlenlp.transformers.tokenizer_utils import convert_to_unicode
+
+
+class TrainingInstance(object):
+    """A single training instance (sentence pair)."""
+
+    def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, is_random_next):
+        self.tokens = tokens
+        self.segment_ids = segment_ids
+        self.is_random_next = is_random_next
+        self.masked_lm_positions = masked_lm_positions
+        self.masked_lm_labels = masked_lm_labels
+
+
+def write_instance_to_example_file(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_file):
+    """Create example files from `TrainingInstance`s."""
+
+    total_written = 0
+    features = collections.OrderedDict()
+
+    num_instances = len(instances)
+    features["input_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["input_mask"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["segment_ids"] = np.zeros([num_instances, max_seq_length], dtype="int32")
+    features["masked_lm_positions"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32")
+    features["masked_lm_ids"] = np.zeros([num_instances, max_predictions_per_seq], dtype="int32")
+    features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32")
+
+    for inst_index, instance in enumerate(tqdm(instances)):
+        input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)
+        input_mask = [1] * len(input_ids)
+        segment_ids = list(instance.segment_ids)
+        assert len(input_ids) <= max_seq_length
+
+        while len(input_ids) < max_seq_length:
+            input_ids.append(0)
+            input_mask.append(0)
+            segment_ids.append(0)
+
+        assert len(input_ids) == max_seq_length
+        assert len(input_mask) == max_seq_length
+        assert len(segment_ids) == max_seq_length
+
+        masked_lm_positions = list(instance.masked_lm_positions)
+        masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)
+        masked_lm_weights = [1.0] * len(masked_lm_ids)
+
+        while len(masked_lm_positions) < max_predictions_per_seq:
+            masked_lm_positions.append(0)
+            masked_lm_ids.append(0)
+            masked_lm_weights.append(0.0)
+
+        next_sentence_label = 1 if instance.is_random_next else 0
+
+        features["input_ids"][inst_index] = input_ids
+        features["input_mask"][inst_index] = input_mask
+        features["segment_ids"][inst_index] = segment_ids
+        features["masked_lm_positions"][inst_index] = masked_lm_positions
+        features["masked_lm_ids"][inst_index] = masked_lm_ids
+        features["next_sentence_labels"][inst_index] = next_sentence_label
+
+        total_written += 1
+
+    print("saving data")
+    f = h5py.File(output_file, "w")
+    f.create_dataset("input_ids", data=features["input_ids"], dtype="i4", compression="gzip")
+    f.create_dataset("input_mask", data=features["input_mask"], dtype="i1", compression="gzip")
+    f.create_dataset("segment_ids", data=features["segment_ids"], dtype="i1", compression="gzip")
+    f.create_dataset("masked_lm_positions", data=features["masked_lm_positions"], dtype="i4", compression="gzip")
+    f.create_dataset("masked_lm_ids", data=features["masked_lm_ids"], dtype="i4", compression="gzip")
+    f.create_dataset("next_sentence_labels", data=features["next_sentence_labels"], dtype="i1", compression="gzip")
+    f.flush()
+    f.close()
+
+
+def create_training_instances(
+    input_files, tokenizer, max_seq_length, dupe_factor, short_seq_prob, masked_lm_prob, max_predictions_per_seq, rng
+):
+    """Create `TrainingInstance`s from raw text."""
+    all_documents = [[]]
+
+    # Input file format:
+    # (1) One sentence per line. These should ideally be actual sentences, not
+    # entire paragraphs or arbitrary spans of text. (Because we use the
+    # sentence boundaries for the "next sentence prediction" task).
+    # (2) Blank lines between documents. Document boundaries are needed so
+    # that the "next sentence prediction" task doesn't span between documents.
+    for input_file in input_files:
+        print("creating instance from {}".format(input_file))
+        with open(input_file, "r", encoding="UTF-8") as reader:
+            while True:
+                line = convert_to_unicode(reader.readline())
+                if not line:
+                    break
+                line = line.strip()
+
+                # Empty lines are used as document delimiters
+                if not line:
+                    all_documents.append([])
+                tokens = tokenizer.tokenize(line)
+                if tokens:
+                    all_documents[-1].append(tokens)
+
+    # Remove empty documents
+    all_documents = [x for x in all_documents if x]
+    rng.shuffle(all_documents)
+
+    # vocab_words = list(tokenizer.vocab.keys())
+    vocab_words = list(tokenizer.vocab.token_to_idx.keys())
+    instances = []
+    for _ in range(dupe_factor):
+        for document_index in range(len(all_documents)):
+            instances.extend(
+                create_instances_from_document(
+                    all_documents,
+                    document_index,
+                    max_seq_length,
+                    short_seq_prob,
+                    masked_lm_prob,
+                    max_predictions_per_seq,
+                    vocab_words,
+                    rng,
+                )
+            )
+
+    rng.shuffle(instances)
+    return instances
+
+
+def create_instances_from_document(
+    all_documents,
+    document_index,
+    max_seq_length,
+    short_seq_prob,
+    masked_lm_prob,
+    max_predictions_per_seq,
+    vocab_words,
+    rng,
+):
+    """Creates `TrainingInstance`s for a single document."""
+    document = all_documents[document_index]
+
+    # Account for [CLS], [SEP], [SEP]
+    max_num_tokens = max_seq_length - 3
+
+    # We *usually* want to fill up the entire sequence since we are padding
+    # to `max_seq_length` anyways, so short sequences are generally wasted
+    # computation. However, we *sometimes*
+    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
+    # sequences to minimize the mismatch between pre-training and fine-tuning.
+    # The `target_seq_length` is just a rough target however, whereas
+    # `max_seq_length` is a hard limit.
+    target_seq_length = max_num_tokens
+    if rng.random() < short_seq_prob:
+        target_seq_length = rng.randint(2, max_num_tokens)
+
+    # We DON'T just concatenate all of the tokens from a document into a long
+    # sequence and choose an arbitrary split point because this would make the
+    # next sentence prediction task too easy. Instead, we split the input into
+    # segments "A" and "B" based on the actual "sentences" provided by the user
+    # input.
+    instances = []
+    current_chunk = []
+    current_length = 0
+    i = 0
+    while i < len(document):
+        segment = document[i]
+        current_chunk.append(segment)
+        current_length += len(segment)
+        if i == len(document) - 1 or current_length >= target_seq_length:
+            if current_chunk:
+                # `a_end` is how many segments from `current_chunk` go into the `A`
+                # (first) sentence.
+                a_end = 1
+                if len(current_chunk) >= 2:
+                    a_end = rng.randint(1, len(current_chunk) - 1)
+
+                tokens_a = []
+                for j in range(a_end):
+                    tokens_a.extend(current_chunk[j])
+
+                tokens_b = []
+                # Random next
+                is_random_next = False
+                if len(current_chunk) == 1 or rng.random() < 0.5:
+                    is_random_next = True
+                    target_b_length = target_seq_length - len(tokens_a)
+
+                    # This should rarely go for more than one iteration for large
+                    # corpora. However, just to be careful, we try to make sure that
+                    # the random document is not the same as the document
+                    # we're processing.
+                    for _ in range(10):
+                        random_document_index = rng.randint(0, len(all_documents) - 1)
+                        if random_document_index != document_index:
+                            break
+
+                    # If picked random document is the same as the current document
+                    if random_document_index == document_index:
+                        is_random_next = False
+
+                    random_document = all_documents[random_document_index]
+                    random_start = rng.randint(0, len(random_document) - 1)
+                    for j in range(random_start, len(random_document)):
+                        tokens_b.extend(random_document[j])
+                        if len(tokens_b) >= target_b_length:
+                            break
+                    # We didn't actually use these segments so we "put them back" so
+                    # they don't go to waste.
+                    num_unused_segments = len(current_chunk) - a_end
+                    i -= num_unused_segments
+                # Actual next
+                else:
+                    is_random_next = False
+                    for j in range(a_end, len(current_chunk)):
+                        tokens_b.extend(current_chunk[j])
+                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
+
+                assert len(tokens_a) >= 1
+                assert len(tokens_b) >= 1
+
+                tokens = []
+                segment_ids = []
+                tokens.append("[CLS]")
+                segment_ids.append(0)
+                for token in tokens_a:
+                    tokens.append(token)
+                    segment_ids.append(0)
+
+                tokens.append("[SEP]")
+                segment_ids.append(0)
+
+                for token in tokens_b:
+                    tokens.append(token)
+                    segment_ids.append(1)
+                tokens.append("[SEP]")
+                segment_ids.append(1)
+
+                (tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions(
+                    tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng
+                )
+                instance = TrainingInstance(
+                    tokens=tokens,
+                    segment_ids=segment_ids,
+                    is_random_next=is_random_next,
+                    masked_lm_positions=masked_lm_positions,
+                    masked_lm_labels=masked_lm_labels,
+                )
+                instances.append(instance)
+            current_chunk = []
+            current_length = 0
+        i += 1
+
+    return instances
+
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"])
+
+
+def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
+    """Creates the predictions for the masked LM objective."""
+
+    cand_indexes = []
+    for (i, token) in enumerate(tokens):
+        if token == "[CLS]" or token == "[SEP]":
+            continue
+        cand_indexes.append(i)
+
+    rng.shuffle(cand_indexes)
+
+    output_tokens = list(tokens)
+
+    num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    masked_lms = []
+    covered_indexes = set()
+    for index in cand_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if index in covered_indexes:
+            continue
+        covered_indexes.add(index)
+
+        masked_token = None
+        # 80% of the time, replace with [MASK]
+        if rng.random() < 0.8:
+            masked_token = "[MASK]"
+        else:
+            # 10% of the time, keep original
+            if rng.random() < 0.5:
+                masked_token = tokens[index]
+            # 10% of the time, replace with random word
+            else:
+                masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
+
+        output_tokens[index] = masked_token
+
+        masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+
+    return (output_tokens, masked_lm_positions, masked_lm_labels)
+
+
+def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_num_tokens:
+            break
+
+        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
+        assert len(trunc_tokens) >= 1
+
+        # We want to sometimes truncate from the front and sometimes from the
+        # back to add more randomness and avoid biases.
+        if rng.random() < 0.5:
+            del trunc_tokens[0]
+        else:
+            trunc_tokens.pop()
+
+
+def main():
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--input_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The input train corpus. can be directory with .txt files or a path to a single file",
+    )
+    parser.add_argument(
+        "--output_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file where created hdf5 formatted data will be written.",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        required=False,
+        help="The vocabulary the BERT model will train on. "
+        "Use bert_model argument would ignore this. "
+        "The bert_model argument is recommended.",
+    )
+    parser.add_argument(
+        "--do_lower_case",
+        action="store_true",
+        default=True,
+        help="Whether to lower case the input text. True for uncased models, False for cased models. "
+        "Use bert_model argument would ignore this. The bert_model argument is recommended.",
+    )
+    parser.add_argument(
+        "--bert_model",
+        default="bert-base-uncased",
+        type=str,
+        required=False,
+        help="Bert pre-trained model selected in the list: bert-base-uncased, "
+        "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese."
+        "If provided, use the pre-trained model used tokenizer to create data "
+        "and ignore vocab_file and do_lower_case.",
+    )
+
+    # Other parameters
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after WordPiece tokenization. \n"
+        "Sequences longer than this will be truncated, and sequences shorter \n"
+        "than this will be padded.",
+    )
+    parser.add_argument(
+        "--dupe_factor",
+        default=10,
+        type=int,
+        help="Number of times to duplicate the input data (with different masks).",
+    )
+    parser.add_argument(
+        "--max_predictions_per_seq", default=20, type=int, help="Maximum number of masked LM predictions per sequence."
+    )
+
+    # floats
+    parser.add_argument("--masked_lm_prob", default=0.15, type=float, help="Masked LM probability.")
+    parser.add_argument(
+        "--short_seq_prob",
+        default=0.1,
+        type=float,
+        help="Probability to create a sequence shorter than maximum sequence length",
+    )
+
+    parser.add_argument("--random_seed", type=int, default=12345, help="random seed for initialization")
+
+    args = parser.parse_args()
+    print(args)
+
+    if args.bert_model:
+        tokenizer = BertTokenizer.from_pretrained(args.bert_model)
+    else:
+        assert args.vocab_file, "vocab_file must be set If bert_model is not provided."
+        tokenizer = BertTokenizer(args.vocab_file, do_lower_case=args.do_lower_case)
+
+    input_files = []
+    if os.path.isfile(args.input_file):
+        input_files.append(args.input_file)
+    elif os.path.isdir(args.input_file):
+        input_files = [
+            os.path.join(args.input_file, f)
+            for f in os.listdir(args.input_file)
+            if (os.path.isfile(os.path.join(args.input_file, f)) and f.endswith(".txt"))
+        ]
+    else:
+        raise ValueError("{} is not a valid path".format(args.input_file))
+
+    rng = random.Random(args.random_seed)
+    instances = create_training_instances(
+        input_files,
+        tokenizer,
+        args.max_seq_length,
+        args.dupe_factor,
+        args.short_seq_prob,
+        args.masked_lm_prob,
+        args.max_predictions_per_seq,
+        rng,
+    )
+
+    output_file = args.output_file
+
+    write_instance_to_example_file(
+        instances, tokenizer, args.max_seq_length, args.max_predictions_per_seq, output_file
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/bert/static/data/sample_text.txt b/model_zoo/bert/static/data/sample_text.txt
new file mode 100644
index 0000000000000000000000000000000000000000..75ec60cdb7842e023d32f95f0e16e1973eff4b71
--- /dev/null
+++ b/model_zoo/bert/static/data/sample_text.txt
@@ -0,0 +1,100 @@
+Zulfiqar A. Bhutta trained as a physician in Pakistan in the early stages of his career.
+He holds titles across various organizations in diverse geographies.
+Professor Bhutta is the Founding Director of the Center of Excellence in Women and Child Health & Institute for Global Child Health & Development, at the Aga Khan University South-Central Asia, East Africa & United Kingdom.
+He is currently the Co-Director at the Centre for Global Child Health, at the Hospital for Sick Children and leads many projects as a Senior Scientist at the Research Institute in the Centre for Global Child Health at Sick Kids.
+He holds a Professorship at the University of Toronto in the Department of Nutritional Sciences and the Division of Epidemiology, Dalla Lana School of Public Health.
+Additionally, he holds concurrent professorship at the Department of Paediatrics, Aga Khan University in Karachi, Pakistan and at the Schools of Public Health of Johns Hopkins University, Tufts University, Boston University, University of Alberta and the London School of Hygiene & Tropical Medicine.
+He is a designated Distinguished National Professor of the Government of Pakistan and was the Founding Chair of the National Research Ethics Committee of the Government of Pakistan from 2003-2014.
+Dr. Bhutta received his MBBS from Khyber Medical College in Peshawar, Pakistan in 1977 at which time he was names "Best Graduate of the Year" and awarded the University Gold Medal for overall distinction.
+His PhD work was completed at Karolinska Institute in Stockholm, Sweden in 1996.
+He is a Fellow of the Royal College of Physicians (Edinburgh & London), the Royal College of Paediatrics and Child Health (London), American Academy of Paediatrics and the Pakistan Academy of Sciences.
+Following the completion of his PhD Dr. Bhutta began working as House Surgeon in Obstetrics & Gynecology at the Khyber Teaching Hospital, Peshawar (April-November 1978).
+He began work in paediatrics as a physician in November of 1978 in the Professorial Unit at the Institute of Child Health, Jinnah Postgraduate Medical Centre, Karachi (Pakistan).
+Through 1980's he continued his work as a surgeon and paediatrician.
+He undertook his first professor position in the Department of Paediatrics, The Aga Khan University Hospital, Karachi (Pakistan), from November 1987 to June 1992.
+In 2005, Dr. Bhutta became the Chairman of the Department of Paediatrics & Child Health at the Aga Khan University & Medical Center, a position held until 2008.
+Following his term as Chairman he became The Noordin Noormahomed Sheriff Professor & Founding Chair, Division of Women & Child Health, The Aga Khan University, a position he held for four years.
+Dr. Bhutta currently holds the titles of co-director of the Centre for Global Child Health at the Hospital for Sick Children in Toronto, and founding director of the Centre of Excellence in Women and Child Health at the Aga Khan University.
+In 2020, he was appointed founding director of the Institute for Global child Health & Development at the Aga Khan University and elected Fellow to the Royal Society, United Kingdom.
+Outside of his professional responsibilities Dr. Bhutta serves on various local and international boards and committees, including a series of editorial boards.
+In his various capacities Dr. Bhutta has produced a large collection of publications working with his teams at Sick Kids, AKU and international partners.
+These include book reviews, chapters, 1.
+"Haematological disorders" "Neonatal Jaundice" in Neonatal Vade‑Mecum, Fleming PJ, Speidel BD, Dunn PM Eds, Lloyd‑Luke Publishers, UK, 1986.
+Revised 2nd Edition 1991.
+2.
+"Nutritional management of acute and persistent diarrhoea".
+A M Molla, Bhutta Z A and  A Molla.
+In McNeish A S, Mittal S K and Walker-Smith J A (eds).
+Recent trends in diarrhoea and malnutrition, MAMC, Delhi, 1991, pp 37-51.
+3.
+"Paediatric Prescribing” in "Text book of Paediatrics for developing countries"            Arif MA, Hanif SM, Wasti SMK Eds, 1989, 2nd Edition 1996,  PPA, Karachi.
+& Lahore 4.
+"Innovations in neonatal care : Impact on neonatal survival in the developing world:.
+Bhutta Z A  Zaidi S (Editor) 1992.
+TWEL Publisher.
+Karachi pp 121-131 5.
+"Short course therapy in Pediatrics" Bhutta Z A& Teele D.  In Tice A D, Waldvogel F (Eds), Contemporary issues in Infectious Disease Epidemiology and Management, 1993 Gardiner Caldwell, Cheshire, pp 52 - 60.
+6.
+"Dietary management of persistent diarrhoea".
+Bhutta Z A, Molla A M, Issani Z.
+In Reflections on  Diarrhoeal Disease & Nutrition  of Children".
+1993 Karachi, pp 97 - 103.
+7.
+"Prescribing practices amongst general practitioners (GPs) and consultant paediatricians in childhood diarrhoea.”  S.Q.
+Nizami, I.A.
+Khan, Bhutta Z A.
+In "Reflections on Diarrhoeal Disease and Nutrition of Children".
+1993 Karachi, pp  88-90.
+8.
+"The challenge of multidrug-resistant typhoid".
+Bhutta Z A.
+In Puri R K, Sachdev H P S, Choudhry P, Verma I C (Eds), Current concepts in Paediatrics, 1994.
+Jaypee Publishers, New Delhi, pp 403.8.
+9.
+"Perinatal Care in Pakistan: Current status and trends".
+In Proceedings of the Workshop in Reproductive Health.
+College of Physicians and Surgeons, Pakistan, Karachi, 1995, pp 95-103.
+10.
+“A study of whole body protein kinetics in malnourished children with persistent diarrhoea” Bhutta Z A, Nizami SQ, Isani Z, Hardy S, Hendricks K, Young V.   Report of the second RCM coordinated Research Programme for application of stable isotope tracer methods to studies of energy metabolism in malnourished populations of developing countries.
+NAHRES-30 1996 IAEA Vienna.
+11.
+"Pneumococcal infections in Pakistan: a country report".
+In Adult Immunization in Asia, Fondation Mercel Merieux, Lyon, 1998. pp 79-82.
+12.
+“Factors affecting protein and aminoacid metabolism in childhood from developing countries".
+In Child Nutrition: an international perspective.
+Editors Solomons NW, Caballero B, Brown KH.
+CRC Press 1998.
+13.
+"Protein Digestion and Bioavailability".
+In Encyclopedia of Human Nutrition.
+Editors: Sadler M, Strain JJ, Caballero B.
+Academic Press (London), 1998 pp.1646-54.
+14.
+"Perinatal Care in Pakistan.
+Reproductive Health: A manual for family practice and primary health care.
+Bhutta Z A, Maqbool S.  College of Physicians and Surgeons, Pakistan, Karachi, 1999, pp 69-78.
+15.
+“Effective interventions to reduce neonatal mortality and morbidity from perinatal infection.
+Bhutta ZA.
+In Costello A, Manandhar D (eds).
+"Improving Newborn Infant Health in Developing Countries’ 1999.
+Imperial College Press, London pp.289-308.
+16.
+“Ambulatory management of typhoid fever”            “Risk factors and management of micronutrient deficiencies”            “Management of persistent diarrhoea in developing countries”.
+In Manual of International Child Health, British Medical Journal, 2000 (in press).
+17.
+“The role of Cefixime in typhoid fever during childhood” in Cefixime, Adam D, Quintiliani R (Eds), Torre-Lazur-McCann, Tokyo, 2000; pp.107-112.
+18.
+"Micronutrients and Child Health in the Commonwealth”, Commonwealth Foundation" (UK) (2001).
+19.
+"Isotopic evaluation of breast milk intake, energy metabolism growth and body composition of exclusively breastfed infants in Pakistan".
+Bhutta ZA, Nizami SQ, Weaver LT, Preston T. In Application of Stable Isotopes to evaluate Growth and Body Composition of Exclusively Breastfed infants, IAEA and WHO, NAHRES Report.
+2000.
+20.
+“Typhoid Fever in Childhood: the south Asian experience”.
+Ahmad K &Bhutta ZA.
+In "Recent Advances in Paediatrics", Gupte S (Ed), 2000, India .
+21.
+“Neonatal Infections in developing countries” in  Carrera JM, Cabero L, Baraibar R (Eds).
+The Perinatal Medicine of the new Millennium.
\ No newline at end of file
diff --git a/model_zoo/bert/static/dataset.py b/model_zoo/bert/static/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..e25357ad6a62a1c276fba03f51961a7cdbf7e1b7
--- /dev/null
+++ b/model_zoo/bert/static/dataset.py
@@ -0,0 +1,139 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import h5py
+import numpy as np
+import paddle
+from paddle.io import DataLoader, Dataset
+
+from paddlenlp.data import Stack
+
+
+def create_pretraining_dataset(input_file, max_pred_length, args, data_holders, worker_init=None, places=None):
+    train_data = PretrainingDataset(input_file=input_file, max_pred_length=max_pred_length)
+    train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True)
+
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # input_ids, segment_ids, input_mask, masked_lm_positions,
+        # masked_lm_labels, next_sentence_labels, mask_token_num
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        _, seq_length = out[0].shape
+        size = sum(len(x[3]) for x in data)
+        # Padding for divisibility by 8 for fp16 or int8 usage
+        if size % 8 != 0:
+            size += 8 - (size % 8)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        out[3] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+        # mask_token_num
+        out.append(np.asarray([mask_token_num], dtype=np.float32))
+        if args.use_amp and args.use_pure_fp16:
+            # cast input_mask to fp16
+            out[2] = out[2].astype(np.float16)
+            # cast masked_lm_scale to fp16
+            out[-1] = out[-1].astype(np.float16)
+        return out
+
+    train_data_loader = DataLoader(
+        dataset=train_data,
+        places=places,
+        feed_list=data_holders,
+        batch_sampler=train_batch_sampler,
+        collate_fn=_collate_data,
+        num_workers=0,
+        worker_init_fn=worker_init,
+        return_list=False,
+    )
+    return train_data_loader, input_file
+
+
+def create_data_holder(args):
+    input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64")
+    segment_ids = paddle.static.data(name="segment_ids", shape=[-1, -1], dtype="int64")
+    input_mask = paddle.static.data(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32")
+    masked_lm_positions = paddle.static.data(name="masked_lm_positions", shape=[-1], dtype="int32")
+    masked_lm_labels = paddle.static.data(name="masked_lm_labels", shape=[-1, 1], dtype="int64")
+    next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[-1, 1], dtype="int64")
+    masked_lm_scale = paddle.static.data(name="masked_lm_scale", shape=[-1, 1], dtype="float32")
+    return [
+        input_ids,
+        segment_ids,
+        input_mask,
+        masked_lm_positions,
+        masked_lm_labels,
+        next_sentence_labels,
+        masked_lm_scale,
+    ]
+
+
+class PretrainingDataset(Dataset):
+    def __init__(self, input_file, max_pred_length):
+        self.input_file = input_file
+        self.max_pred_length = max_pred_length
+        f = h5py.File(input_file, "r")
+        keys = [
+            "input_ids",
+            "input_mask",
+            "segment_ids",
+            "masked_lm_positions",
+            "masked_lm_ids",
+            "next_sentence_labels",
+        ]
+        self.inputs = [np.asarray(f[key][:]) for key in keys]
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs[0])
+
+    def __getitem__(self, index):
+
+        [input_ids, input_mask, segment_ids, masked_lm_positions, masked_lm_ids, next_sentence_labels] = [
+            input[index].astype(np.int64) if indice < 5 else np.asarray(input[index].astype(np.int64))
+            for indice, input in enumerate(self.inputs)
+        ]
+        # TODO: whether to use reversed mask by changing 1s and 0s to be
+        # consistent with nv bert
+        input_mask = (1 - np.reshape(input_mask.astype(np.float32), [1, 1, input_mask.shape[0]])) * -1e4
+
+        index = self.max_pred_length
+        # store number of  masked tokens in index
+        # outputs of torch.nonzero diff with that of numpy.nonzero by zip
+        padded_mask_indices = (masked_lm_positions == 0).nonzero()[0]
+        if len(padded_mask_indices) != 0:
+            index = padded_mask_indices[0].item()
+            # mask_token_num = index
+        else:
+            index = self.max_pred_length
+            # mask_token_num = self.max_pred_length
+        # masked_lm_labels = np.full(input_ids.shape, -1, dtype=np.int64)
+        # masked_lm_labels[masked_lm_positions[:index]] = masked_lm_ids[:index]
+        masked_lm_labels = masked_lm_ids[:index]
+        masked_lm_positions = masked_lm_positions[:index]
+        # softmax_with_cross_entropy enforce last dim size equal 1
+        masked_lm_labels = np.expand_dims(masked_lm_labels, axis=-1)
+        next_sentence_labels = np.expand_dims(next_sentence_labels, axis=-1)
+
+        return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels]
diff --git a/model_zoo/bert/static/predict_glue.py b/model_zoo/bert/static/predict_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..8095066423d81947316efaa8e9d32bfcfecc44bd
--- /dev/null
+++ b/model_zoo/bert/static/predict_glue.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from functools import partial
+
+import paddle
+from run_glue import METRIC_CLASSES, MODEL_CLASSES, convert_example
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The path prefix of inference model to be used.",
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        choices=["gpu", "cpu", "xpu"],
+        help="Device selected for inference.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size for predict.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+        return cls(predictor, input_handles, output_handles)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, dataset, collate_fn, batch_size=1):
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=False)
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, num_workers=0, return_list=True
+        )
+        outputs = []
+        for data in data_loader:
+            output = self.predict_batch(data)
+            outputs.append(output)
+        return outputs
+
+
+def main():
+    args = parse_args()
+
+    predictor = Predictor.create_predictor(args)
+
+    args.task_name = args.task_name.lower()
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    test_ds = load_dataset("glue", args.task_name, splits="test")
+    tokenizer = tokenizer_class.from_pretrained(os.path.dirname(args.model_path))
+
+    trans_func = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        label_list=test_ds.label_list,
+        max_seq_length=args.max_seq_length,
+        is_test=True,
+    )
+    test_ds = test_ds.map(trans_func)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+    ): fn(samples)
+    predictor.predict(test_ds, batch_size=args.batch_size, collate_fn=batchify_fn)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/bert/static/run_glue.py b/model_zoo/bert/static/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6cb54dc960f00297342b8a30a293778fe0a4265
--- /dev/null
+++ b/model_zoo/bert/static/run_glue.py
@@ -0,0 +1,395 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "sts-b": PearsonAndSpearman,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization")
+    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
+    args = parser.parse_args()
+    return args
+
+
+def create_data_holder(task_name):
+    """
+    Define the input data holder for the glue task.
+    """
+    input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64")
+    token_type_ids = paddle.static.data(name="token_type_ids", shape=[-1, -1], dtype="int64")
+    if task_name == "sts-b":
+        label = paddle.static.data(name="label", shape=[-1, 1], dtype="float32")
+    else:
+        label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64")
+
+    return [input_ids, token_type_ids, label]
+
+
+def reset_program_state_dict(args, model, state_dict, pretrained_state_dict):
+    """
+    Initialize the parameter from the bert config, and set the parameter by
+    reseting the state dict."
+    """
+    reset_state_dict = {}
+    scale = (
+        model.initializer_range
+        if hasattr(model, "initializer_range")
+        else getattr(model, args.model_type).config["initializer_range"]
+    )
+    reset_parameter_names = []
+    for n, p in state_dict.items():
+        if n in pretrained_state_dict:
+            reset_state_dict[p.name] = np.array(pretrained_state_dict[n])
+            reset_parameter_names.append(n)
+        elif p.name in pretrained_state_dict and "bert" in n:
+            reset_state_dict[p.name] = np.array(pretrained_state_dict[p.name])
+            reset_parameter_names.append(n)
+        else:
+            dtype_str = "float32"
+            if str(p.dtype) == "VarType.FP64":
+                dtype_str = "float64"
+            reset_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
+    logger.info("the following parameter had reset, please check. {}".format(reset_parameter_names))
+    return reset_state_dict
+
+
+def set_seed(args):
+    """
+    Use the same data seed(for data shuffle) for all procs to guarantee data
+    consistency after sharding.
+    """
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def evaluate(exe, metric, loss, correct, dev_program, data_loader, phase="eval"):
+    """
+    The evaluate process, calcluate the eval loss and metric.
+    """
+    metric.reset()
+    returns = [loss]
+    if isinstance(correct, list) or isinstance(correct, tuple):
+        returns.extend(list(correct))
+    else:
+        returns.append(correct)
+    for batch in data_loader:
+        exe.run(dev_program, feed=batch, fetch_list=returns)
+        return_numpys = exe.run(dev_program, feed=batch, fetch_list=returns)
+        metric_numpy = return_numpys[1] if len(return_numpys[1:]) == 1 else return_numpys[1:]
+        metric.update(metric_numpy)
+    res = metric.accumulate()
+    if isinstance(metric, Mcc):
+        print("%s loss: %f, mcc: %s" % (phase, return_numpys[0], res[0]))
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "%s loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s"
+            % (phase, return_numpys[0], res[0], res[1], res[2])
+        )
+    else:
+        print("%s loss: %f, acc: %s, " % (phase, return_numpys[0], res))
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """
+    Convert a glue example into necessary features.
+    """
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    # Set the paddle execute environment
+    paddle.enable_static()
+    place = paddle.set_device(args.device)
+    fleet.init(is_collective=True)
+    set_seed(args)
+
+    # Create the main_program for the training and dev_program for the validation
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+    dev_program = paddle.static.Program()
+
+    # Get the configuration of tokenizer and model
+    args.task_name = args.task_name.lower()
+    args.model_type = args.model_type.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # Create the tokenizer and dataset
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+
+    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    # Define the input data and create the train/dev data_loader
+    with paddle.static.program_guard(main_program, startup_program):
+        [input_ids, token_type_ids, labels] = create_data_holder(args.task_name)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds,
+        feed_list=[input_ids, token_type_ids, labels],
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=False,
+    )
+
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            feed_list=[input_ids, token_type_ids, labels],
+            num_workers=0,
+            return_list=False,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            feed_list=[input_ids, token_type_ids, labels],
+            return_list=False,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds,
+            batch_sampler=dev_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            feed_list=[input_ids, token_type_ids, labels],
+            return_list=False,
+        )
+
+    # Create the training-forward program, and clone it for the validation
+    with paddle.static.program_guard(main_program, startup_program):
+        num_class = 1 if train_ds.label_list is None else len(train_ds.label_list)
+        model, pretrained_state_dict = model_class.from_pretrained(args.model_name_or_path, num_classes=num_class)
+        loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+        logits = model(input_ids, token_type_ids)
+        loss = loss_fct(logits, labels)
+        dev_program = main_program.clone(for_test=True)
+
+    # Create the training-backward program, this pass will not be
+    # executed in the validation
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+    with paddle.static.program_guard(main_program, startup_program):
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+        optimizer = fleet.distributed_optimizer(optimizer)
+        optimizer.minimize(loss)
+
+    # Create the metric pass for the validation
+    with paddle.static.program_guard(dev_program, startup_program):
+        metric = metric_class()
+        correct = metric.compute(logits, labels)
+
+    # Initialize the fine-tuning parameter, we will load the parameters in
+    # pre-training model. And initialize the parameter which not in pre-training model
+    # by the normal distribution.
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+    state_dict = model.state_dict()
+    reset_state_dict = reset_program_state_dict(args, model, state_dict, pretrained_state_dict)
+    paddle.static.set_program_state(main_program, reset_state_dict)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            loss_return = exe.run(main_program, feed=batch, fetch_list=[loss])
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss_return[0], args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            lr_scheduler.step()
+            if global_step % args.save_steps == 0:
+                # Validation pass, record the loss and metric
+                if args.task_name == "mnli":
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_matched, "matched eval")
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_mismatched, "mismatched eval")
+                else:
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader)
+                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                paddle.static.save_inference_model(
+                    os.path.join(output_dir, "model"), [input_ids, token_type_ids], [logits], exe
+                )
+                tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu."
+
+    do_train(args)
diff --git a/model_zoo/bert/static/run_glue_with_sparaity.py b/model_zoo/bert/static/run_glue_with_sparaity.py
new file mode 100644
index 0000000000000000000000000000000000000000..023c58b0b812a088e3358cba1329f99d2de69754
--- /dev/null
+++ b/model_zoo/bert/static/run_glue_with_sparaity.py
@@ -0,0 +1,408 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.incubate import asp
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "sts-b": PearsonAndSpearman,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization")
+    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
+    args = parser.parse_args()
+    return args
+
+
+def create_data_holder(task_name):
+    """
+    Define the input data holder for the glue task.
+    """
+    input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64")
+    token_type_ids = paddle.static.data(name="token_type_ids", shape=[-1, -1], dtype="int64")
+    if task_name == "sts-b":
+        label = paddle.static.data(name="label", shape=[-1, 1], dtype="float32")
+    else:
+        label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64")
+
+    return [input_ids, token_type_ids, label]
+
+
+def reset_program_state_dict(args, model, state_dict, pretrained_state_dict):
+    """
+    Initialize the parameter from the bert config, and set the parameter by
+    reseting the state dict."
+    """
+    reset_state_dict = {}
+    scale = (
+        model.initializer_range
+        if hasattr(model, "initializer_range")
+        else getattr(model, args.model_type).config["initializer_range"]
+    )
+    reset_parameter_names = []
+    for n, p in state_dict.items():
+        if n in pretrained_state_dict:
+            reset_state_dict[p.name] = np.array(pretrained_state_dict[n])
+            reset_parameter_names.append(n)
+        elif p.name in pretrained_state_dict and "bert" in n:
+            reset_state_dict[p.name] = np.array(pretrained_state_dict[p.name])
+            reset_parameter_names.append(n)
+        else:
+            dtype_str = "float32"
+            if str(p.dtype) == "VarType.FP64":
+                dtype_str = "float64"
+            reset_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
+    logger.info("the following parameter had reset, please check. {}".format(reset_parameter_names))
+    return reset_state_dict
+
+
+def set_seed(args):
+    """
+    Use the same data seed(for data shuffle) for all procs to guarantee data
+    consistency after sharding.
+    """
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def evaluate(exe, metric, loss, correct, dev_program, data_loader, phase="eval"):
+    """
+    The evaluate process, calcluate the eval loss and metric.
+    """
+    metric.reset()
+    returns = [loss]
+    if isinstance(correct, list) or isinstance(correct, tuple):
+        returns.extend(list(correct))
+    else:
+        returns.append(correct)
+    for batch in data_loader:
+        exe.run(dev_program, feed=batch, fetch_list=returns)
+        return_numpys = exe.run(dev_program, feed=batch, fetch_list=returns)
+        metric_numpy = return_numpys[1] if len(return_numpys[1:]) == 1 else return_numpys[1:]
+        metric.update(metric_numpy)
+    res = metric.accumulate()
+    if isinstance(metric, Mcc):
+        print("%s loss: %f, mcc: %s" % (phase, return_numpys[0], res[0]))
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "%s loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s"
+            % (phase, return_numpys[0], res[0], res[1], res[2])
+        )
+    else:
+        print("%s loss: %f, acc: %s, " % (phase, return_numpys[0], res))
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """
+    Convert a glue example into necessary features.
+    """
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    # Set the paddle execute environment
+    paddle.enable_static()
+    place = paddle.set_device(args.device)
+    set_seed(args)
+
+    # Create the main_program for the training and dev_program for the validation
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+    dev_program = paddle.static.Program()
+
+    # Get the configuration of tokenizer and model
+    args.task_name = args.task_name.lower()
+    args.model_type = args.model_type.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # Create the tokenizer and dataset
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+
+    train_ds = train_ds.map(trans_func, lazy=True)
+
+    def batchify_fn(
+        samples,
+        fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type
+            Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+        ),
+    ):
+        return fn(samples)
+
+    train_batch_sampler = paddle.io.BatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+    # Define the input data and create the train/dev data_loader
+    with paddle.static.program_guard(main_program, startup_program):
+        [input_ids, token_type_ids, labels] = create_data_holder(args.task_name)
+
+    train_data_loader = DataLoader(
+        dataset=train_ds,
+        feed_list=[input_ids, token_type_ids, labels],
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=False,
+    )
+
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            feed_list=[input_ids, token_type_ids, labels],
+            num_workers=0,
+            return_list=False,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            feed_list=[input_ids, token_type_ids, labels],
+            return_list=False,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds,
+            batch_sampler=dev_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            feed_list=[input_ids, token_type_ids, labels],
+            return_list=False,
+        )
+
+    # Create the training-forward program, and clone it for the validation
+    with paddle.static.program_guard(main_program, startup_program):
+        num_class = 1 if train_ds.label_list is None else len(train_ds.label_list)
+        model, pretrained_state_dict = model_class.from_pretrained(args.model_name_or_path, num_classes=num_class)
+        loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+        logits = model(input_ids, token_type_ids)
+        loss = loss_fct(logits, labels)
+        dev_program = main_program.clone(for_test=True)
+
+    # Create the training-backward program, this pass will not be
+    # executed in the validation
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs
+    with paddle.static.program_guard(main_program, startup_program):
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+        # Generate parameter names needed to perform weight decay.
+        # All bias and LayerNorm parameters are excluded.
+        decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            epsilon=args.adam_epsilon,
+            parameters=model.parameters(),
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+
+        # Keep Pooler and task-specific layer dense.
+        # Please note, excluded_layers must be set before calling `optimizer.minimize()`.
+        asp.set_excluded_layers(main_program, [model.bert.pooler.dense.full_name(), model.classifier.full_name()])
+        # Calling asp.decorate() to wrap minimize() in optimizer, which
+        # will insert necessary masking operations for ASP workflow.
+        optimizer = asp.decorate(optimizer)
+        optimizer.minimize(loss)
+
+    # Create the metric pass for the validation
+    with paddle.static.program_guard(dev_program, startup_program):
+        metric = metric_class()
+        correct = metric.compute(logits, labels)
+
+    # Initialize the fine-tuning parameter, we will load the parameters in
+    # pre-training model. And initialize the parameter which not in pre-training model
+    # by the normal distribution.
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+    state_dict = model.state_dict()
+    reset_state_dict = reset_program_state_dict(args, model, state_dict, pretrained_state_dict)
+    paddle.static.set_program_state(main_program, reset_state_dict)
+
+    # Pruning model to be 2:4 sparse pattern
+    # Must call `exe.run(startup_program)` first before calling `asp.prune_model`
+    asp.prune_model(place, main_program)
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            loss_return = exe.run(main_program, feed=batch, fetch_list=[loss])
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss_return[0], args.logging_steps / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            lr_scheduler.step()
+            if global_step % args.save_steps == 0:
+                # Validation pass, record the loss and metric
+                if args.task_name == "mnli":
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_matched, "matched eval")
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader_mismatched, "mismatched eval")
+                else:
+                    evaluate(exe, metric, loss, correct, dev_program, dev_data_loader)
+                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                paddle.static.save_inference_model(
+                    os.path.join(output_dir, "model"), [input_ids, token_type_ids], [logits], exe
+                )
+                tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    assert args.device in ["cpu", "gpu", "xpu"], "Invalid device! Available device should be cpu, gpu, or xpu."
+
+    do_train(args)
diff --git a/model_zoo/bert/static/run_pretrain.py b/model_zoo/bert/static/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..d300a5f791fc2e6146e894912fbccb65cfb8db82
--- /dev/null
+++ b/model_zoo/bert/static/run_pretrain.py
@@ -0,0 +1,397 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+from dataset import create_data_holder, create_pretraining_dataset
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    BertForPretraining,
+    BertPretrainingCriterion,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils import profiler
+from paddlenlp.utils.tools import TimeCostAverage
+
+MODEL_CLASSES = {"bert": (BertForPretraining, BertTokenizer)}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_predictions_per_seq", default=80, type=int, help="The maximum total of masked tokens in input sequence"
+    )
+
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization")
+    parser.add_argument("--use_amp", type=strtobool, default=False, help="Enable mixed precision training.")
+    parser.add_argument(
+        "--enable_addto",
+        type=strtobool,
+        default=False,
+        help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.",
+    )
+    parser.add_argument("--scale_loss", type=float, default=2**15, help="The value of scale_loss for fp16.")
+    parser.add_argument("--use_pure_fp16", type=strtobool, default=False, help="Whether to use pure fp16 training.")
+    parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.")
+    parser.add_argument(
+        "--gradient_merge_steps",
+        type=int,
+        default=1,
+        help="Number of merge steps before gradient update." "global_batch_size = gradient_merge_steps * batch_size.",
+    )
+
+    # For benchmark.
+    parser.add_argument(
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+    parser.add_argument(
+        "--fuse_transformer",
+        type=strtobool,
+        default=False,
+        help="Whether to use FusedTransformerEncoderLayer to replace a TransformerEncoderLayer or not.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def select_dataset_file_for_each_worker(files, f_start_id, worker_num, worker_index):
+    """
+    Spliting the train file according to the worker index.
+    """
+    num_files = len(files)
+    if worker_num > num_files:
+        remainder = worker_num % num_files
+        data_file = files[(f_start_id * worker_num + worker_index + remainder * f_start_id) % num_files]
+    else:
+        data_file = files[(f_start_id * worker_num + worker_index) % num_files]
+    return data_file
+
+
+def reset_program_state_dict(model, state_dict):
+    """
+    Initialize the parameter from the bert config, and set the parameter by
+    reseting the state dict."
+    """
+    scale = model.initializer_range if hasattr(model, "initializer_range") else model.bert.config.initializer_range
+
+    new_state_dict = dict()
+    for n, p in state_dict.items():
+        if "layer_norm" not in p.name:
+            dtype_str = "float32"
+            if str(p.dtype) == "VarType.FP64":
+                dtype_str = "float64"
+            new_state_dict[p.name] = np.random.normal(loc=0.0, scale=scale, size=p.shape).astype(dtype_str)
+    return new_state_dict
+
+
+def create_strategy(args):
+    """
+    Create build strategy and exec strategy.
+    """
+    build_strategy = paddle.static.BuildStrategy()
+    exec_strategy = paddle.static.ExecutionStrategy()
+
+    build_strategy.enable_addto = args.enable_addto
+
+    exec_strategy.num_threads = 1
+    exec_strategy.num_iteration_per_drop_scope = 10000
+    return build_strategy, exec_strategy
+
+
+def dist_optimizer(args, optimizer):
+    """
+    Create a distributed optimizer based on a normal optimizer
+    """
+    build_strategy, exec_strategy = create_strategy(args)
+
+    dist_strategy = fleet.DistributedStrategy()
+    dist_strategy.execution_strategy = exec_strategy
+    dist_strategy.build_strategy = build_strategy
+
+    dist_strategy.fuse_grad_size_in_MB = 16
+    if args.use_amp:
+        dist_strategy.amp = True
+
+        custom_black_list = ["lookup_table", "lookup_table_v2"] if args.use_pure_fp16 else None
+        dist_strategy.amp_configs = {
+            "custom_white_list": ["softmax", "layer_norm", "gelu"],
+            "init_loss_scaling": args.scale_loss,
+            "custom_black_list": custom_black_list,
+            "use_pure_fp16": args.use_pure_fp16,
+        }
+    if args.gradient_merge_steps > 1:
+        dist_strategy.gradient_merge = True
+        dist_strategy.gradient_merge_configs = {"k_steps": args.gradient_merge_steps}
+
+    optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+    return optimizer
+
+
+def set_seed(seed):
+    """
+    Use the same data seed(for data shuffle) for all procs to guarantee data
+    consistency after sharding.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+class WorkerInitObj(object):
+    "Construct the object with different seed, and the Dataloader will generate the data"
+    "with different seed in each worker."
+
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def do_train(args):
+    # Initialize the paddle and paddle fleet execute environment
+    paddle.enable_static()
+    place = paddle.set_device(args.device)
+    fleet.init(is_collective=True)
+
+    worker_num = fleet.worker_num()
+    worker_index = fleet.worker_index()
+
+    # Create the random seed for the worker
+    set_seed(args.seed)
+    worker_init = WorkerInitObj(args.seed + worker_index)
+
+    # Define the input data in the static mode
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+
+    data_holders = create_data_holder(args)
+
+    [
+        input_ids,
+        segment_ids,
+        input_mask,
+        masked_lm_positions,
+        masked_lm_labels,
+        next_sentence_labels,
+        masked_lm_scale,
+    ] = data_holders
+
+    # Define the model structure in static mode
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    config = model_class.config_class.from_pretrained(args.model_name_or_path)
+    if config.vocab_size % 8 != 0:
+        config.vocab_size += 8 - (config.vocab_size % 8)
+    config.fuse = args.fuse_transformer
+    model = model_class(config)
+    criterion = BertPretrainingCriterion(model.bert.config.vocab_size)
+    prediction_scores, seq_relationship_score = model(
+        input_ids=input_ids,
+        token_type_ids=segment_ids,
+        attention_mask=input_mask,
+        masked_positions=masked_lm_positions,
+    )
+    loss = criterion(
+        prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale
+    )
+
+    # Define the dynamic learing_reate scheduler and optimizer
+    # BUG: train_data_loader is undefined variable here hence the noqa: F821
+    num_training_steps = (
+        args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_train_epochs  # noqa: F821
+    )
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        multi_precision=args.use_pure_fp16,
+    )
+
+    # Use the fleet api to compile the distributed optimizer
+    optimizer = dist_optimizer(args, optimizer)
+    optimizer.minimize(loss)
+
+    # Define the Executor for running the static model
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+    state_dict = model.state_dict()
+
+    # Use the state dict to update the parameter
+    reset_state_dict = reset_program_state_dict(model, state_dict)
+    paddle.static.set_program_state(main_program, reset_state_dict)
+    if args.use_amp:
+        optimizer.amp_init(place)
+
+    pool = ThreadPoolExecutor(1)
+    global_step = 0
+    epoch = 0
+    while True:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f)) and "training" in f
+        ]
+        files.sort()
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        # Select one file for each worker and create the DataLoader for the file
+        data_file = select_dataset_file_for_each_worker(files, f_start_id, worker_num, worker_index)
+        train_data_loader, _ = create_pretraining_dataset(
+            data_file, args.max_predictions_per_seq, args, data_holders, worker_init, paddle.static.cuda_places()
+        )
+
+        for f_id in range(f_start_id + 1, len(files)):
+            data_file = select_dataset_file_for_each_worker(files, f_id, worker_num, worker_index)
+            dataset_future = pool.submit(
+                create_pretraining_dataset,
+                data_file,
+                args.max_predictions_per_seq,
+                args,
+                data_holders,
+                worker_init,
+                paddle.static.cuda_places(),
+            )
+
+            train_cost_avg = TimeCostAverage()
+            reader_cost_avg = TimeCostAverage()
+            total_samples = 0
+            batch_start = time.time()
+            for step, batch in enumerate(train_data_loader):
+                train_reader_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                global_step += 1
+                loss_return = exe.run(main_program, feed=batch, fetch_list=[loss])
+                total_samples += args.batch_size
+                # In the new 2.0 api, must call this function to change the learning_rate
+                lr_scheduler.step()
+                train_run_cost = time.time() - batch_start
+                train_cost_avg.record(train_run_cost)
+
+                # Profile for model benchmark
+                if args.profiler_options is not None:
+                    profiler.add_profiler_step(args.profiler_options)
+
+                if global_step % args.logging_steps == 0:
+                    print(
+                        "total step: %d, epoch: %d, batch: %d, loss: %f, "
+                        "avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss_return[0],
+                            reader_cost_avg.get_average(),
+                            train_cost_avg.get_average(),
+                            total_samples / args.logging_steps,
+                            args.batch_size / (reader_cost_avg.get_average() + train_cost_avg.get_average()),
+                        )
+                    )
+                    total_samples = 0
+                    train_cost_avg.reset()
+                    reader_cost_avg.reset()
+                if global_step % args.save_steps == 0:
+                    if worker_index == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        model.save_model_config(output_dir)
+                        paddle.static.save(main_program, os.path.join(output_dir, "model_state"))
+                        tokenizer.save_pretrained(output_dir)
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+                batch_start = time.time()
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+        epoch += 1
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print(args)
+    do_train(args)
diff --git a/model_zoo/bert/static_ipu/README.md b/model_zoo/bert/static_ipu/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f18e209a004d0f2c1ca2c3a82c240bd58b82de77
--- /dev/null
+++ b/model_zoo/bert/static_ipu/README.md
@@ -0,0 +1,223 @@
+# Paddle-BERT with Graphcore IPUs
+
+## Overview
+
+This project enabled BERT-Base pre-training and SQuAD fine-tuning task using [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) on Graphcore [IPU-POD16](https://www.graphcore.ai/products/mk2/ipu-pod16).
+
+## File Structure
+
+| File              | Description                                                        |
+| ----------------- | ------------------------------------------------------------------ |
+| `README.md`       | How to run the model.                                              |
+| `run_pretrain.py` | The algorithm script to run pretraining tasks (phase1 and phase2). |
+| `run_squad.py`    | The algorithm script to run SQuAD finetune and validation task.    |
+| `modeling.py`     | The algorithm script to build the Bert-Base model.                 |
+| `dataset_ipu.py`  | The algorithm script to load input data in pretraining.            |
+| `custom_ops/`     | The folder contains custom ops that will be used.                  |
+| `scripts/`        | The folder contains scripts for model running.                     |
+
+## Dataset
+
+- Pretraining dataset
+
+   Wikipedia dataset is used to do pretraining. Please refer to the Wikipedia dataset generator provided by [Nvidia](https://github.com/NVIDIA/DeepLearningExamples.git) to generate pretraining dataset.
+
+   The sequence length used in pretraining phase1 and phase2 are: 128 and 384. Following steps are provided for dataset generation.
+
+   ```bash
+   # Here we use a specific commmit, the latest commit should also be fine.
+   git clone https://github.com/NVIDIA/DeepLearningExamples.git
+   git checkout 88eb3cff2f03dad85035621d041e23a14345999e
+
+   cd DeepLearningExamples/PyTorch/LanguageModeling/BERT
+
+   # Modified the parameters `--max_seq_length 512` to `--max_seq_length 384` at line 50 and
+   # `--max_predictions_per_seq 80` to `--max_predictions_per_seq 56` at line 51.
+   vim data/create_datasets_from_start.sh
+
+   # Build docker image
+   bash scripts/docker/build.sh
+
+   # Use NV's docker to download and generate hdf5 file. This may requires GPU available.
+   # You can Remove `--gpus $NV_VISIBLE_DEVICES` to avoid GPU requirements.
+   bash scripts/docker/launch.sh
+
+   # generate dataset with wiki_only
+   bash data/create_datasets_from_start.sh wiki_only
+   ```
+
+- SQuAD v1.1 dataset
+
+   SQuAD v1.1 dataset will be downloaded automatically. You don't have to download manually.
+
+
+## Quick Start Guide
+
+### Prepare Project Environment
+
+- Create docker image
+
+```bash
+# clone paddle repo
+git clone https://github.com/paddlepaddle/Paddle.git -b release/2.3
+cd Paddle
+
+# build docker image
+docker build -t paddlepaddle/paddle:latest-dev-ipu -f tools/dockerfile/Dockerfile.ipu .
+```
+
+- Create docker container
+
+```bash
+# clone paddlenlp repo
+git clone https://github.com/paddlepaddle/paddlenlp.git
+cd paddlenlp/examples/language_model/bert/static_ipu
+
+# create docker container
+# the ipuof configuration file need to be pre-generated and mounted to docker container
+# the environment variable IPUOF_CONFIG_PATH should point to the ipuof configuration file
+# more information on ipuof configuration is available at https://docs.graphcore.ai/projects/vipu-admin/en/latest/cli_reference.html?highlight=ipuof#ipuof-configuration-file
+docker run --ulimit memlock=-1:-1 --net=host --cap-add=IPC_LOCK \
+--device=/dev/infiniband/ --ipc=host \
+--name paddle-bert-base \
+-v ${IPUOF_CONFIG_PATH}:/ipu.conf \
+-e IPUOF_CONFIG_PATH=/ipu.conf \
+-v ${PWD}:/workdir \
+-w /home -it paddlepaddle/paddle:latest-dev-ipu bash
+```
+
+All of later processes are required to be executed in the container.
+
+- Compile and installation
+
+```bash
+# clone paddle repo
+git clone https://github.com/paddlepaddle/Paddle.git -b release/2.3
+cd Paddle
+
+mkdir build && cd build
+
+# run cmake
+cmake .. -DWITH_IPU=ON -DWITH_PYTHON=ON -DPY_VERSION=3.7 -DWITH_MKL=ON \
+         -DPOPLAR_DIR=/opt/poplar -DPOPART_DIR=/opt/popart -DCMAKE_BUILD_TYPE=Release
+
+# compile
+make paddle_python -j$(nproc)
+
+# install paddle package
+pip install -U python/dist/paddlepaddle-0.0.0-cp37-cp37m-linux_x86_64.whl
+
+# go to workdir
+cd /workdir
+```
+
+### Execution
+
+- Run pretraining phase1 (sequence_length = 128)
+
+```bash
+# pod16
+# takes about 11.3 hours
+bash scripts/pod16/run_pretrain.sh
+
+# pod4
+# takes about 11.3 * 4 hours
+bash scripts/pod4/run_pretrain.sh
+```
+
+- Run pretraining phase2 (sequence_length = 384)
+
+```bash
+# pod16
+# takes about 3 hours
+bash scripts/pod16/run_pretrain_phase2.sh
+
+# pod4
+# takes about 3 * 4 hours
+bash scripts/pod4/run_pretrain_phase2.sh
+```
+
+- Run SQuAD finetune task
+
+```bash
+# pod16
+bash scripts/pod16/run_squad.sh
+
+# pod4
+bash scripts/pod4/run_squad.sh
+```
+
+- Run SQuAD validation
+
+```bash
+# pod16
+bash scripts/pod16/run_squad_infer.sh
+
+# pod4
+bash scripts/pod4/run_squad_infer.sh
+```
+
+#### Parameters
+
+- `task` The type of the NLP model.
+- `input_files` The directory of the input data.
+- `output_dir` The directory of the trained models.
+- `is_training` Training or inference.
+- `seq_len` The sequence length.
+- `vocab_size` Size of the vocabulary.
+- `max_predictions_per_seq` The max number of the masked token each sentence.
+- `max_position_embeddings` The length of the input mask.
+- `num_hidden_layers` The number of encoder layers.
+- `hidden_size` The size of the hidden state of the transformer layers size.
+- `ignore_index` The ignore index for the masked position.
+- `hidden_dropout_prob` The dropout probability for fully connected layer in embedding and encoder
+- `attention_probs_dropout_prob` The dropout probability for attention layer in encoder.
+- `learning_rate` The learning rate for training.
+- `weight_decay` The weight decay.
+- `beta1` The Adam/Lamb beta1 value
+- `beta2` The Adam/Lamb beta2 value
+- `adam_epsilon` Epsilon for Adam optimizer.
+- `max_steps` The max training steps.
+- `warmup_steps` The warmup steps used to update learning rate with lr_schedule.
+- `scale_loss` The loss scaling.
+- `accl1_type` set accl1 type to FLOAT or FLOAT16
+- `accl2_type` set accl2 type to FLOAT or FLOAT16
+- `weight_decay_mode` decay or l2 regularization
+- `optimizer_state_offchip` The store location of the optimizer tensors
+- `logging_steps` The gap steps of logging.
+- `save_steps` Save the paddle model every n steps.
+- `epochs` the iteration of the whole dataset.
+- `batch_size` total batch size (= batches_per_step \* num_replica \* grad_acc_factor \* micro_batch_size).
+- `micro_batch_size` The batch size of the IPU graph.
+- `batches_per_step` The number of batches per step with pipelining.
+- `seed` The random seed.
+- `num_ipus` The number of IPUs.
+- `ipu_enable_fp16` Enable FP16 or not.
+- `num_replica` The number of the graph replication.
+- `enable_grad_acc` Enable gradiant accumulation or not.
+- `grad_acc_factor` Update the weights every n batches.
+- `available_mem_proportion` The available proportion of memory used by conv or matmul.
+- `shuffle` Shuffle Dataset.
+- `wandb` Enable logging to Weights and Biases.
+- `enable_engine_caching` Enable engine caching or not.
+- `enable_load_params` Load paddle params or not.
+- `tf_checkpoint` Path to Tensorflow Checkpoint to initialise the model.
+
+## Result
+
+For a POD16 platform:
+
+| Task   | Metric   | Result  |
+| ------ | -------- | ------- |
+| Phase1 | MLM Loss | 1.6064  |
+|        | NSP Loss | 0.0272  |
+|        | MLM Acc  | 0.6689  |
+|        | NSP Acc  | 0.9897  |
+|        | tput     | 11700   |
+| Phase2 | MLM Loss | 1.5029  |
+|        | NSP Loss | 0.02444 |
+|        | MLM Acc  | 0.68555 |
+|        | NSP Acc  | 0.99121 |
+|        | tput     | 3470    |
+| SQuAD  | EM       | 79.9053 |
+|        | F1       | 87.6396 |
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc b/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc
new file mode 100644
index 0000000000000000000000000000000000000000..edc7eec8fbf369e723e5e76e438513d874e697a8
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/custom_checkpointoutput.cc
@@ -0,0 +1,41 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/extension.h"
+
+namespace {
+std::vector<std::vector<int64_t>> InferShape(std::vector<int64_t> x_shape) {
+  return {x_shape};
+}
+
+std::vector<paddle::DataType> InferDtype(paddle::DataType x_dtype) {
+  return {x_dtype};
+}
+
+std::vector<paddle::Tensor> OpForward(const paddle::Tensor &x) { return {x}; }
+
+std::vector<paddle::Tensor> OpBackward(const paddle::Tensor &x) { return {x}; }
+}
+
+PD_BUILD_OP(checkpointoutput)
+    .Inputs({"X"})
+    .Outputs({"Out"})
+    .SetInferShapeFn(PD_INFER_SHAPE(InferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype))
+    .SetKernelFn(PD_KERNEL(OpForward));
+
+PD_BUILD_GRAD_OP(checkpointoutput)
+    .Inputs({paddle::Grad("Out")})
+    .Outputs({paddle::Grad("X")})
+    .SetKernelFn(PD_KERNEL(OpBackward));
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc b/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2796fd07d60d28e7b293990a2e007c4084ea14a6
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/custom_detach.cc
@@ -0,0 +1,42 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/extension.h"
+
+namespace {
+std::vector<std::vector<int64_t>>
+InferShape(std::vector<int64_t> x_shape) {
+  return {x_shape};
+}
+
+std::vector<paddle::DataType> InferDtype(paddle::DataType x_dtype) {
+  return {x_dtype};
+}
+
+std::vector<paddle::Tensor> OpForward(const paddle::Tensor &x) { return {x}; }
+
+std::vector<paddle::Tensor> OpBackward(const paddle::Tensor &x) { return {x}; }
+}
+
+PD_BUILD_OP(detach)
+    .Inputs({"X"})
+    .Outputs({"Out"})
+    .SetInferShapeFn(PD_INFER_SHAPE(InferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype))
+    .SetKernelFn(PD_KERNEL(OpForward));
+
+PD_BUILD_GRAD_OP(detach)
+    .Inputs({paddle::Grad("Out")})
+    .Outputs({paddle::Grad("X")})
+    .SetKernelFn(PD_KERNEL(OpBackward));
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc b/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1997d0e896c12a0dc20b0a19d5bd4360a4a2bde1
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/custom_identity.cc
@@ -0,0 +1,41 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/extension.h"
+
+namespace {
+std::vector<std::vector<int64_t>> InferShape(std::vector<int64_t> x_shape) {
+  return {x_shape};
+}
+
+std::vector<paddle::DataType> InferDtype(paddle::DataType x_dtype) {
+  return {x_dtype};
+}
+
+std::vector<paddle::Tensor> OpForward(const paddle::Tensor &x) { return {x}; }
+
+std::vector<paddle::Tensor> OpBackward(const paddle::Tensor &x) { return {x}; }
+}
+
+PD_BUILD_OP(identity)
+    .Inputs({"X"})
+    .Outputs({"Out"})
+    .SetInferShapeFn(PD_INFER_SHAPE(InferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype))
+    .SetKernelFn(PD_KERNEL(OpForward));
+
+PD_BUILD_GRAD_OP(identity)
+    .Inputs({paddle::Grad("Out")})
+    .Outputs({paddle::Grad("X")})
+    .SetKernelFn(PD_KERNEL(OpBackward));
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc b/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc
new file mode 100644
index 0000000000000000000000000000000000000000..88112a26b7e3b245ee530b9587a5d05e9d710b26
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/custom_nll_loss.cc
@@ -0,0 +1,55 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/extension.h"
+
+namespace {
+std::vector<std::vector<int64_t>>
+InferShape(std::vector<int64_t> x_shape, std::vector<int64_t> y_shape,
+           const int &reduction, const std::string &ignoreIndex,
+           const bool &inputIsLogProbability) {
+  // 0: sum, 1: mean, 2: none
+  if (reduction == 2) {
+    return {y_shape};
+  } else {
+    return {{1}};
+  }
+}
+
+std::vector<paddle::DataType> InferDtype(paddle::DataType x_dtype,
+                                         paddle::DataType y_dtype) {
+  return {x_dtype};
+}
+
+std::vector<paddle::Tensor> OpForward(const paddle::Tensor &x,
+                                      const paddle::Tensor &y) {
+  return {x};
+}
+
+std::vector<paddle::Tensor> OpBackward(const paddle::Tensor &x) { return {x}; }
+}
+
+PD_BUILD_OP(custom_nll_loss)
+    .Inputs({"X", "Y"})
+    .Outputs({"Out"})
+    .Attrs({"reduction: int", "ignoreIndex: std::string",
+            "inputIsLogProbability: bool"})
+    .SetInferShapeFn(PD_INFER_SHAPE(InferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(InferDtype))
+    .SetKernelFn(PD_KERNEL(OpForward));
+
+PD_BUILD_GRAD_OP(custom_nll_loss)
+    .Inputs({paddle::Grad("Out")})
+    .Outputs({paddle::Grad("X")})
+    .SetKernelFn(PD_KERNEL(OpBackward));
diff --git a/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc b/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..74e144d8d7e6c715f7aded4df6ccd5423c9cdba5
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/custom_shape_infer.cc
@@ -0,0 +1,37 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <popart/shapeinference.hpp>
+#include <popart/version.hpp>
+
+auto splitShapeInferenceFun = [](popart::ShapeInferenceContext &ctx) {
+  auto numOutputs = ctx.getNumOutputs();
+  auto type = ctx.inType(0);
+  auto shape = ctx.inShape(0);
+  auto axis = ctx.getAttribute<int64_t>("axis");
+  auto split = ctx.getAttribute<std::vector<int64_t>>("split");
+
+  for (int i = 0; i < numOutputs; i++) {
+    shape[axis] = split.at(i);
+    ctx.outInfo(i) = {type, shape};
+  }
+};
+
+#if POPART_VERSION_MAJOR == 2
+#if POPART_VERSION_MINOR == 3
+// for version 2.3, need to register a shape inference function for Split op
+static popart::RegisterShapeInferenceFunction
+    splitRegister11(popart::Onnx::Operators::Split_11, splitShapeInferenceFun);
+#endif
+#endif
\ No newline at end of file
diff --git a/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc b/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc
new file mode 100644
index 0000000000000000000000000000000000000000..803ae20c658bea3dce14b7d6e84a6fa71e9b33fc
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/disable_attn_dropout_bwd_pattern.cc
@@ -0,0 +1,91 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <popart/graph.hpp>
+#include <popart/ir.hpp>
+#include <popart/op.hpp>
+#include <popart/patterns/patterns.hpp>
+#include <popart/tensor.hpp>
+#include <popart/tensorinfo.hpp>
+#include <popart/logging.hpp>
+#include <popart/op/matmul.hpp>
+#include <popart/op/identity.hpp>
+#include <popart/op/dropout.hpp>
+#include <popart/op/softmax.hpp>
+#include <popart/logging.hpp>
+#include <popart/opidentifier.hpp>
+
+#include "utils.cc"
+
+// Tests have found that disabling dropout in the backwards pass of the attention, just before the softmax,
+// can improve SQuAD fine-tuning. This pattern finds that op replaces it with an identity op.
+class DisableAttnDropoutBwdPattern : public popart::PreAliasPattern {
+public:
+  bool matches(popart::Op *op) const override {
+    int check_levels = 2;
+
+    if (!op->isConvertibleTo<popart::DropoutGradOp>()) {
+      return false;
+    }
+
+    // Is dropout enabled? If ratio is 0, we don't need to apply the pattern.
+    auto dropoutGradOp = dynamic_cast<popart::DropoutGradOp*>(op);
+    if (dropoutGradOp->getRatio() == 0.f) {
+      return false;
+    }
+
+    // The specific attention DropoutGradOp we want to cull sits between a matmul and a softmax,
+    // so we'll look through producers and consumers and see if we can find them.
+    auto grad = op->input->tensor(popart::DropoutGradOp::getGradInIndex());
+
+    // The MatMulPattern converts the MatMulLhsGradOp to a MatMulOp
+    // There doesn't seem to be a way to check if a pattern is enabled from inside another pattern.
+    // The IR holds the patterns object, but it’s inaccessible for checking the status of individual patterns.
+    // Check both, with the most likely first.
+    bool hasMatMulProducer = search_producers_for<popart::MatMulOp>(grad, check_levels) != nullptr;
+    if (!hasMatMulProducer) {
+      hasMatMulProducer |= search_producers_for<popart::MatMulLhsGradOp>(grad, check_levels) != nullptr;
+    }
+
+    return hasMatMulProducer && search_consumers_for<popart::SoftmaxGradOp>(grad) != nullptr;
+  }
+
+  std::vector<const popart::Tensor *> touches(popart::Op *) const override { return {}; }
+
+  bool apply(popart::Op *op) const override {
+    if (!op->isConvertibleTo<popart::DropoutGradOp>()) {
+      return false;
+    }
+
+    auto dropoutGradOp = dynamic_cast<popart::DropoutGradOp*>(op);
+    auto identityOp = makeReplacementOpInIr(popart::Onnx::Operators::Identity_1,
+                                            dropoutGradOp,
+                                            "");
+
+    auto inputId = dropoutGradOp->inId(popart::DropoutGradOp::getGradInIndex());
+    auto outputId = dropoutGradOp->outId(popart::DropoutGradOp::getOutIndex());
+    dropoutGradOp->disconnectAllInputs();
+    dropoutGradOp->disconnectAllOutputs();
+    dropoutGradOp->getGraph().eraseOp(dropoutGradOp->id);
+
+    identityOp->connectInTensor(popart::IdentityOp::getInIndex(), inputId);
+    identityOp->connectOutTensor(popart::IdentityOp::getOutIndex(), outputId);
+    identityOp->setup();
+
+    return true;
+  }
+};
+
+
+static popart::PatternCreator<DisableAttnDropoutBwdPattern> disableAttnDropoutBwdPatternCreator("DisableAttnDropoutBwdPattern", false);
diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc b/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2350ffd243c4c0172dc16c2f73ee9378a8853c29
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/tied_gather.cc
@@ -0,0 +1,181 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <popart/logging.hpp>
+#include <popart/names.hpp>
+#include <popart/opmanager.hpp>
+#include <popart/region.hpp>
+#include <popart/op.hpp>
+#include <popart/op/gather.hpp>
+#include <popart/popx/opx.hpp>
+#include <popart/popx/op/gatherx.hpp>
+#include <popart/popx/opxmanager.hpp>
+#include <popart/popx/devicex.hpp>
+#include <popart/util.hpp>
+
+#include <popops/DynamicSlice.hpp>
+#include <popops/ElementWise.hpp>
+#include <popops/Gather.hpp>
+#include <popops/Cast.hpp>
+
+namespace CustomOperators {
+  const popart::OperatorIdentifier TiedGather = {"ai.graphcore", "TiedGather", 1};
+} // namespace CustomOperators
+
+class TiedGatherOp;
+class TiedGatherGradOp;
+
+class TiedGatherGradOp : public popart::GatherGradOp {
+public:
+  TiedGatherGradOp(const popart::GatherOp &op, int64_t axis_)
+      : popart::GatherGradOp(op, axis_),
+        fwd_op(&op) {}
+  const popart::GatherOp *fwd_op;
+};
+
+class TiedGatherOp : public popart::GatherOp {
+public:
+  TiedGatherOp(int64_t axis_, const popart::Op::Settings &settings_)
+      : popart::GatherOp(CustomOperators::TiedGather, axis_, settings_) {}
+  bool check_indices = true;
+
+  std::unique_ptr<popart::Op> clone() const override {
+    return std::make_unique<TiedGatherOp>(*this);
+  }
+
+  std::vector<std::unique_ptr<Op>> getGradOps() {
+    std::vector<std::unique_ptr<Op>> result;
+    result.push_back(std::make_unique<TiedGatherGradOp>(*this, getAxis()));
+    result[0]->pruneable = false;
+    return result;
+  }
+};
+
+class TiedGatherOpx : public popart::popx::Opx {
+public:
+  TiedGatherOpx(popart::Op *op, popart::popx::Devicex *devicex) : popart::popx::Opx(op, devicex) {
+    verifyOp<TiedGatherOp>(op, CustomOperators::TiedGather);
+    // We always want this to layout its inputs
+    inputCreatorPriority = std::numeric_limits<double>::max();
+  }
+
+  bool createsEquiv(int, const popart::popx::Opx *, int) const final { return false; }
+
+  std::set<popart::TensorId> mustExistBeforeCreate(int) const final { return {}; }
+
+  popart::popx::InputCreatorType getInputCreatorType(int index0) const final {
+    return index0 == TiedGatherOp::dataInIndex() ? popart::popx::InputCreatorType::CanCreate
+                                                 : popart::popx::Opx::getInputCreatorType(index0);
+  }
+
+  poplar::Tensor createInput(popart::InIndex index,
+                             const poplar::DebugNameAndId &dnai) const final {
+    popart::logging::debug("TiedGather asked to create index {}: name {}", index, dnai);
+    if (index != TiedGatherOp::dataInIndex()) {
+      throw popart::error("CustomOps Error: GatherOpx::createInput Cannot create input {}", index);
+    }
+
+    auto inputInfo = inInfo(TiedGatherOp::indicesInIndex());
+    auto weightInfo = inInfo(TiedGatherOp::dataInIndex());
+
+    unsigned inputSize = inputInfo.nelms();
+    unsigned inChannels = weightInfo.dim(getOp<TiedGatherOp>().getAxis());
+    unsigned outChannels = weightInfo.nelms() / inChannels;
+
+    std::vector<std::size_t> lhsShape = {inputSize, inChannels};
+    std::vector<std::size_t> rhsShape = {inChannels, outChannels};
+
+    return poplin::createMatMulInputRHS(graph(),
+                                        popart::popx::popType(weightInfo),
+                                        lhsShape,
+                                        rhsShape,
+                                        dnai,
+                                        {},
+                                        &dv_p->matmulCache);
+  }
+
+  // Identical to popart::opx::GatherOpx::grow however:
+  //    1) uses popops::gather instead of popops::multislice
+  //    2) range checks the indices and masks those out of range
+  void grow(poplar::program::Sequence &prog) const final {
+    const auto indicesShape = inShape(TiedGatherOp::indicesInIndex());
+    const auto outputShape =
+        popart::vXtoY<int64_t, std::size_t>(outShape(TiedGatherOp::outIndex()));
+
+    auto op       = getOp<TiedGatherOp>();
+    unsigned axis = op.getAxis();
+    auto indices  = getInTensor(TiedGatherOp::indicesInIndex());
+    auto data     = getInTensor(TiedGatherOp::dataInIndex());
+
+    // If there are no indices, return an empty tensor of the appropriate
+    // shape
+    if (indices.numElements() == 0) {
+      auto result = graph().addVariable(
+          data.elementType(), outputShape, debugContext("result"));
+
+      setOutTensor(TiedGatherOp::outIndex(), result);
+    } else {
+      // Flatten the scalar indices.
+      auto offsets = indices.flatten();
+      // reinterpret the indices as unsigned int. This assumes negative indices.
+      // are impossible.
+      offsets = offsets.reinterpret(poplar::UNSIGNED_INT);
+
+      // Place the gather axis at the front.
+      data = data.dimShufflePartial({0}, {axis});
+      // Store the shape for later.
+      auto tmp_shape = data.shape();
+      // Flatten the other dimensions.
+      data = data.flatten(1, data.rank());
+
+      // Change (2)
+      poplar::Tensor mask;
+      if (op.check_indices) {
+        auto gather_size = data.shape()[0];
+        mask = popops::lt(graph(), offsets, static_cast<unsigned>(gather_size), prog, debugContext("mask<size"));
+        auto indices_mask = popops::cast(graph(), mask, offsets.elementType(), prog, debugContext("mask_castInt"));
+        offsets = popops::mul(graph(), offsets, indices_mask, prog, debugContext("masked_indices"));
+      }
+
+      // Change (1)
+      auto result = popops::gather(graph(),
+                                   data,
+                                   offsets,
+                                   0,
+                                   prog,
+                                   popops::GatherParams(),
+                                   debugContext());
+
+      // Change (2)
+      if (op.check_indices) {
+        auto out_mask = popops::cast(graph(), mask, data.elementType(), prog, debugContext("mask_cast"));
+        popops::mulInPlace(graph(), result, out_mask.expand({1}), prog, debugContext("masked_result"));
+      }
+
+      // Reshape the result to "unflatten" the other dimensions.
+      tmp_shape.front() = result.dim(0);
+      result            = result.reshape(tmp_shape);
+      // Put the gather axis dimension back in the right place.
+      result = result.dimShufflePartial({axis}, {0});
+
+      // Reshape into the expected ONNX shape.
+      result = result.reshape(outputShape);
+
+      setOutTensor(TiedGatherOp::outIndex(), result);
+    }
+  }
+};
+
+static popart::popx::OpxCreator<TiedGatherOpx>
+    tiedGatherOpxCreator(CustomOperators::TiedGather);
diff --git a/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc b/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d22fcd22711e689468c816b266ed5bd0b054343d
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/tied_gather_pattern.cc
@@ -0,0 +1,504 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+
+#include <popart/graph.hpp>
+#include <popart/op.hpp>
+#include <popart/opidentifier.hpp>
+#include <popart/topocons.hpp>
+#include <popart/optimizer.hpp>
+#include <popart/adam.hpp>
+#include <popart/tensor.hpp>
+#include <popart/tensorinfo.hpp>
+#include <popart/logging.hpp>
+#include <popart/op/gather.hpp>
+#include <popart/op/slice.hpp>
+#include <popart/op/add.hpp>
+#include <popart/op/subtract.hpp>
+#include <popart/op/matmul.hpp>
+#include <popart/op/div.hpp>
+#include <popart/op/detach.hpp>
+#include <popart/op/transpose.hpp>
+#include <popart/op/accumulate.hpp>
+#include <popart/op/collectives/replicatedallgather.hpp>
+
+#include <map>
+
+#include "tied_gather.cc"
+#include "utils.cc"
+
+using SerialiseSettings = popart::MatMulBaseOp::SerialiseSettings;
+
+// This pattern matches for graphs of the shape.
+//
+//              Weight
+//             /     \
+//        Transpose   MatMul
+//            |
+// Indices --Gather
+//
+// And performs the following transformations:
+//    1) Disable FullyConnectedPass on MatMul
+//    2) Add Detach between the Gather and the Weight so no SGD ops are created (they will be added later by TiedGatherAccumulatePattern)
+//    3) Replace Gather with TiedGather
+// Resulting in:
+//              Weight
+//             /     \
+//        Transpose   MatMul
+//            |
+//          Detach
+//            |
+// Indices --TiedGather
+//
+// Conditionally, if MatMul is annotated with serialisation it will:
+//    4) Replace Gather with N x TiedGather to match the serialisation on the MatMul
+// Resulting in:
+//    For serialisation factor: 2
+//
+//              Weight
+//             /     \
+//        Transpose  MatMul
+//            |
+// Indices  Detach
+//  |   |    |  |
+//  |   |    | Slice--\
+//  |   Sub -|------TiedGather
+//  |        |              |
+//  |       Slice--\        |
+//  Sub ---------TiedGather |
+//                        \ |
+//                        Add
+//
+namespace {
+bool produced_by_transpose(popart::Tensor *t) {
+    return t->hasProducer() && t->getProducer()->isConvertibleTo<popart::TransposeBaseOp>();
+}
+}
+
+class TiedGatherPattern : public popart::PreAliasPattern {
+    mutable std::map<popart::Op *, popart::MatMulBaseOp *> tied_op_map;
+public:
+    bool matches(popart::Op *op) const override {
+        auto &ir = op->getIr();
+        // Only run in the fwd pass
+        if (op->getIr().hasConstructedBackwards()) {
+            return false;
+        }
+        if (op->getIr().isTraining() && !op->getIr().getSessionOptions().enableGradientAccumulation) {
+            return false;
+        }
+        if (op->isConvertibleTo<popart::GatherOp>() && !op->isConvertibleTo<TiedGatherOp>()) {
+            if (produced_by_transpose(op->input->tensor(popart::GatherOp::dataInIndex()))) {
+                auto matmul = weight_consumed_by<popart::MatMulBaseOp>(op->input->tensor(popart::GatherOp::dataInIndex()));
+                if (matmul) {
+                    tied_op_map.insert({op, matmul});
+                    return true;
+                }
+            }
+        }
+        return false;
+    }
+
+    std::vector<const popart::Tensor *> touches(popart::Op *) const override { return {}; }
+
+    bool apply(popart::Op *op) const override {
+        auto &graph = op->getGraph();
+
+        auto gather = dynamic_cast<popart::GatherOp *>(op);
+        auto matmul = tied_op_map[gather];
+
+        // (1)
+        matmul->setUseFullyConnectedPass(false);
+
+        auto axis = gather->getAxis();
+        auto serialisation = matmul->getSerialiseSettings();
+
+        auto data    = gather->input->tensor(popart::GatherOp::dataInIndex());
+        auto indices = gather->input->tensor(popart::GatherOp::indicesInIndex());
+        auto out     = gather->output->tensor(popart::GatherOp::outIndex());
+
+        // Disconnect "out" so it can be connected to the replacing ops.
+        gather->disconnectAllOutputs();
+
+        // (2)
+        auto detach_up = std::make_unique<popart::DetachOp>(
+            popart::Onnx::CustomOperators::Detach_1,
+            popart::Op::Settings(graph, "TiedGatherDetach")
+        );
+        auto detach = detach_up.get();
+        transferBaseProperties(gather, detach);
+        graph.moveIntoGraph(std::move(detach_up));
+        detach->connectInTensor(0, data->id);
+        auto detached_data_id = data->id + "/detached";
+        detach->createAndConnectOutTensor(0, detached_data_id);
+        detach->setup();
+        data = graph.getTensors().get(detached_data_id);
+
+        std::string name = gather->name();
+        if (name.empty()) {
+            name = std::to_string(gather->id);
+        }
+
+        auto replace_with_tied_gather = [&](popart::TensorId dict, popart::TensorId ind, int64_t i, const std::string &debugContext) {
+            auto tied_gather_up = std::make_unique<TiedGatherOp>(
+                axis,
+                popart::Op::Settings(graph, debugContext));
+            auto tied_gather = tied_gather_up.get();
+            transferBaseProperties(gather, tied_gather);
+            graph.moveIntoGraph(std::move(tied_gather_up));
+
+            tied_gather->connectInTensor(TiedGatherOp::dataInIndex(), dict);
+            tied_gather->connectInTensor(TiedGatherOp::indicesInIndex(), ind);
+
+            auto out_id = out->id;
+            if (i >= 0) {
+                out_id = debugContext + ":0";
+                tied_gather->createAndConnectOutTensor(TiedGatherOp::outIndex(), out_id);
+            } else {
+                tied_gather->connectOutTensor(TiedGatherOp::outIndex(), out_id);
+            }
+
+            graph.topoCons->transfer(gather, tied_gather);
+
+            tied_gather->setup();
+
+            return out_id;
+        };
+
+        if (serialisation.factor <= 1 || serialisation.mode == SerialiseSettings::Mode::None) {
+            // (3)
+            replace_with_tied_gather(data->id, indices->id, -1, name);
+        } else {
+            // (4)
+            if (serialisation.mode != SerialiseSettings::Mode::OutputChannels) {
+                throw popart::error("CustomOps Error: Tied Gather Pattern only supports Serialisation::Mode::OutputChannels");
+            }
+
+            auto slice_op = [&](int64_t starts, int64_t ends, const std::string &debugContext) {
+                auto slice_up = std::make_unique<popart::SliceOp>(
+                    popart::Onnx::AiOnnx::OpSet9::Slice,
+                    std::vector<int64_t>({starts}),
+                    std::vector<int64_t>({ends}),
+                    std::vector<int64_t>({axis}),
+                    popart::Op::Settings(graph, debugContext + "/slice"));
+                auto slice = slice_up.get();
+                transferBaseProperties(gather, slice);
+                graph.moveIntoGraph(std::move(slice_up));
+                slice->connectInTensor(popart::SliceOp::getInIndex(), data->id);
+                auto data_slice = debugContext + "/slice:0";
+                slice->createAndConnectOutTensor(popart::SliceOp::getOutIndex(), data_slice);
+                slice->setup();
+                return data_slice;
+            };
+
+            auto subtract_with_constant = [&](popart::Tensor *a, int64_t c, const std::string &debugContext) {
+                auto sub_up = std::make_unique<popart::SubtractOp>(
+                    popart::Onnx::Operators::Sub_7,
+                    popart::Op::Settings(graph, debugContext + "/sub"));
+                auto sub = sub_up.get();
+                transferBaseProperties(gather, sub);
+                graph.moveIntoGraph(std::move(sub_up));
+                sub->connectInTensor(popart::SubtractOp::getArg0InIndex(), a->id);
+                // Create constant to subtract from
+                static unsigned i = 0;
+                auto sub_const_id = a->id + "_sub_const_" + std::to_string(i++);
+                popart::TensorInfo subInfo(a->info.dataType(), {1});
+                std::vector<unsigned> d(1, c);
+                graph.getTensors().addConstInit(sub_const_id, subInfo, d.data());
+                sub->connectInTensor(popart::SubtractOp::getArg1InIndex(), sub_const_id);
+                auto indices_sub = debugContext + "/sub:0";
+                sub->createAndConnectOutTensor(popart::SubtractOp::getOutIndex(), indices_sub);
+                sub->setup();
+                return indices_sub;
+            };
+
+            auto add_op = [&](popart::TensorId a, popart::TensorId b, popart::TensorId out, const std::string &debugContext) {
+                auto add_up = std::make_unique<popart::AddOp>(
+                    popart::Onnx::Operators::Add_6,
+                    popart::Op::Settings(graph, debugContext + "/add"));
+                auto add = add_up.get();
+                transferBaseProperties(gather, add);
+                graph.moveIntoGraph(std::move(add_up));
+                add->connectInTensor(popart::AddOp::getArg0InIndex(), a);
+                add->connectInTensor(popart::AddOp::getArg1InIndex(), b);
+                if (graph.getTensors().contains(out)) {
+                    add->connectOutTensor(popart::AddOp::getOutIndex(), out);
+                } else {
+                    add->createAndConnectOutTensor(popart::AddOp::getOutIndex(), out);
+                }
+                add->setup();
+                return out;
+            };
+
+            popart::TensorId tmp_id;
+            for (int64_t i = 0; i < serialisation.factor; i++) {
+                int64_t slice_size = data->info.dim(axis) / serialisation.factor;
+                auto serial_name = name + "/" + std::to_string(i);
+                // Slice the Dictionary
+                auto data_slice = slice_op(i * slice_size, (i + 1) * slice_size, serial_name);
+                // Subtract the indices
+                auto indices_sub = subtract_with_constant(indices, i * slice_size, serial_name);
+                // Add the tied gather to the graph
+                auto next_id = replace_with_tied_gather(data_slice, indices_sub, i, serial_name);
+
+                // Add the results
+                if (i == 0) {
+                    tmp_id = next_id;
+                } else {
+                    auto out_id = out->id;
+                    if (i < serialisation.factor - 1) {
+                        out_id += "_tmp" + std::to_string(i);
+                    }
+                    tmp_id = add_op(tmp_id, next_id, out_id, serial_name);
+
+                    // Tie the add to happen directly after the gather
+                    graph.topoCons->insert(
+                        graph.getTensors().get(next_id)->getProducer(),
+                        graph.getTensors().get(tmp_id)->getProducer(),
+                        true);
+                }
+            }
+        }
+
+        gather->disconnectAllInputs();
+        graph.eraseOp(gather->id);
+
+        return true;
+    }
+};
+
+// This pattern matches for graphs of the shape.
+//
+//    Weight
+//    |              \
+// TiedGatherGrad   MatMul
+//                    |
+//         Accl  -  Accumulate
+//
+// And will perform the following transformation
+//   1) Replace TiedGatherGrad with SparseAccumulate
+//
+// Resulting in:
+//
+//    Weight
+//    |              \
+//    |             MatMul
+//    |               |
+//    |    Accl  -  Accumulate
+//    |     |          |
+// SparseAccumulate  - Optimizer
+//
+// (--> is a topocon)
+
+class TiedGatherAccumulatePattern : public popart::PreAliasPattern {
+public:
+    bool matches(popart::Op *op) const override {
+        // Only works with gradient accumulation
+        if (!op->getIr().getSessionOptions().enableGradientAccumulation) {
+            return false;
+        }
+        // Only run after the optimizers have been created
+        if (!op->getIr().hasDecomposedOptimizers()) {
+            return false;
+        }
+        return op->isConvertibleTo<TiedGatherGradOp>();
+    }
+
+    std::vector<const popart::Tensor *> touches(popart::Op *) const override { return {}; }
+
+    bool apply(popart::Op *op) const override {
+        auto gather_grad = dynamic_cast<TiedGatherGradOp *>(op);
+        auto gather = gather_grad->fwd_op;
+        auto root_weight = get_variable(gather->input->tensor(popart::GatherOp::dataInIndex()));
+
+        auto gather_ops = find_all_consumers<TiedGatherOp>(root_weight);
+
+        auto &ir = op->getIr();
+
+        // Get all the Accumulate ops in the normal context
+        std::vector<popart::AccumulateOp *> accumulate_ops;
+
+        auto update_ops = find_all_consumers<popart::VarUpdateWithUpdaterOp, popart::ExecutionContext::AccumulateOuterFragment>(root_weight);
+        if (update_ops.size() < 1) {
+            // OptimizerDecomposePattern has not run.
+            throw popart::error("CustomOps Error: Could not find update ops for weight {}", root_weight->id);
+        }
+
+        for (size_t i = 0; i < update_ops.size(); i++) {
+            auto var_update = update_ops[i];
+
+            auto accum = var_update->inTensor(popart::VarUpdateWithUpdaterOp::getUpdaterInIndex());
+            // Accumulate Ops in the normal fragment are Gradient Accumulation.
+            auto accl_op = search_producers_for<popart::AccumulateOp, popart::ExecutionContext::Normal>(accum, 10);
+
+            if (accl_op) {
+                auto exists = std::find_if(accumulate_ops.begin(), accumulate_ops.end(), [&accl_op](popart::Op* op){ return op->id == accl_op->id; });
+                if (exists == accumulate_ops.end()) {
+                    accumulate_ops.push_back(accl_op);
+                }
+            } else {
+                popart::logging::info("CustomOps Warning: Could not find outer AccumulateOp gradient accumulation via accumulator {}.", accum->id);
+            }
+        }
+
+        if (accumulate_ops.size() != gather_ops.size()) {
+            throw popart::error("CustomOps Error: The number of gather ops ({}) does not match the number of accumulate ops ({}).", gather_ops.size(), accumulate_ops.size());
+        }
+
+        // Match up gather serial index to Accumulator's matmul index.
+        // TODO: Find a more robust way than sorting input ids
+        std::sort(accumulate_ops.begin(), accumulate_ops.end(),
+                  [](const popart::Op *l, const popart::Op *r) {
+                      return l->input->tensor(popart::AccumulateOp::getVarToUpdateInIndex())->id.compare(
+                          r->input->tensor(popart::AccumulateOp::getVarToUpdateInIndex())->id) < 0;
+                  });
+        std::sort(gather_ops.begin(), gather_ops.end(),
+            [](const popart::Op *l, const popart::Op *r) {
+            return l->name().compare(r->name()) < 0;
+        });
+
+        auto itr = std::find(gather_ops.begin(), gather_ops.end(), gather);
+        if (itr == gather_ops.end()) {
+            throw popart::error("CustomOps Error: Could not find {} in the consumers of {}.", gather->name(), root_weight->id);
+        }
+
+        unsigned serial_index = std::distance(gather_ops.begin(), itr);
+
+        auto dense_accl = accumulate_ops[serial_index];
+
+        auto accl_id = dense_accl->inId(popart::AccumulateOp::getVarToUpdateInIndex());
+        auto weight_id = gather->inId(popart::GatherOp::dataInIndex());
+        popart::logging::pattern::info("Using tied accumulator {} for {}", accl_id, gather->name());
+
+        // Transpose must be inplace so the accumulator is actually updated
+        accl_id    = transpose_inplace(accl_id, gather_grad);
+
+        auto &graph = op->getGraph();
+
+        auto accum_type = dense_accl->getAccumulationType();
+        popart::Tensor *factor = dense_accl->getFactor().isConst() ? nullptr : dense_accl->inTensor(popart::SparseAccumulateOp::getFactorInIndex());
+
+        if (factor != nullptr && accum_type == popart::AccumulationType::Mean) {
+            auto inv_counter = factor->id + "_inverse";
+            if (!graph.getTensors().contains(inv_counter)) {
+                popart::TensorInfo one_info(factor->info.dataType(), {});
+                std::vector<float> one_data(one_info.nelms(), 1);
+                const auto &one_id = graph.getIr().createIntermediateTensorId("one");
+                graph.getTensors().addConstInit(one_id, one_info, one_data.data());
+                auto inv_op = graph.createConnectedOp<popart::DivOp>(
+                    {{popart::DivOp::getArg0InIndex(), one_id},
+                    {popart::DivOp::getArg1InIndex(), factor->id}},
+                    {{popart::DivOp::getOutIndex(), inv_counter}},
+                    popart::Onnx::Operators::Div_7,
+                    popart::Op::Settings(graph, "mean_accumulate_inverse"));
+                transferBaseProperties(gather_grad, inv_op);
+
+                for (auto cons : factor->consumers.getOps()) {
+                    if (cons->isConvertibleTo<popart::AccumulateOp>() &&
+                        cons->inId(popart::AccumulateOp::getVarToUpdateInIndex()) == factor->id) {
+                        graph.topoCons->insert(cons, inv_op);
+                    }
+                }
+            }
+            accum_type = popart::AccumulationType::DampenedAdd;
+            factor = graph.getTensor(inv_counter);
+        }
+
+        // Add sparseAccumulateOp.
+        auto sparse_accl_up = std::make_unique<popart::SparseAccumulateOp>(
+            accum_type,
+            dense_accl->getFactor(),
+            gather_grad->getAxis(),
+            popart::Op::Settings(graph, "_tiedAccumulate/" + std::to_string(serial_index)));
+
+        auto sparse_accl = sparse_accl_up.get();
+        transferBaseProperties(gather_grad, sparse_accl);
+        graph.moveIntoGraph(std::move(sparse_accl_up));
+
+        // Inputs
+        // Accumulator
+        sparse_accl->connectInTensor(popart::SparseAccumulateOp::getVarToUpdateInIndex(),
+                                     accl_id);
+        // Gradients
+        sparse_accl->connectInTensor(
+            popart::SparseAccumulateOp::getUpdaterInIndex(),
+            gather_grad->inId(popart::GatherGradOp::gradInIndex()));
+        // Scale
+        if (!dense_accl->getFactor().isConst()) {
+          sparse_accl->connectInTensor(
+              // the index at which the dampening scale factor is received,
+              popart::SparseAccumulateOp::getFactorInIndex(),
+              // the name of the dampening scale factor
+              factor->id);
+        }
+        // Indices
+        sparse_accl->connectInTensor(
+            popart::SparseAccumulateOp::getIndicesInIndex(),
+            gather_grad->inId(popart::GatherGradOp::indicesInIndex()));
+
+        // Original weight to be cloned
+        sparse_accl->connectInTensor(
+            popart::SparseAccumulateOp::getOriginalVarToUpdateInIndex(),
+            weight_id);
+
+        // Transfer TopoCons
+        graph.topoCons->transfer(gather_grad, sparse_accl);
+
+        // gatherGrad output that will be isolated
+        auto grad_Id = gather_grad->outId(TiedGatherGradOp::gradOutIndex());
+
+        // Remove TiedGatherGrad
+        gather_grad->disconnectAllInputs();
+        gather_grad->disconnectAllOutputs();
+        graph.eraseOp(gather_grad->id);
+
+        // Outputs
+        sparse_accl->createAndConnectOutTensor(
+            popart::SparseAccumulateOp::getUpdatedVarOutIndex(),
+            sparse_accl->name() + ":0");
+
+        // remove the gatherGrad output
+        graph.getTensors().remove(grad_Id);
+
+        // Finalise sparse op
+        sparse_accl->setup();
+
+        return true;
+    }
+
+    popart::TensorId transpose_inplace(popart::TensorId tid, popart::Op *op) const {
+        auto &graph = op->getGraph();
+
+        // TransposeInplaceOp's constructor requires a transposeOp
+        auto outplace_up = std::make_unique<popart::TransposeOp>(
+            popart::Onnx::AiOnnx::OpSet9::Transpose,
+            std::vector<int64_t>{1, 0},
+            popart::Op::Settings(graph, tid + "_Transpose"));
+        auto transpose_up = outplace_up->getInplaceVariant(popart::Onnx::CustomOperators::TransposeInplace);
+
+        auto transpose = transpose_up.get();
+        transferBaseProperties(op, transpose);
+        graph.moveIntoGraph(std::move(transpose_up));
+
+        transpose->connectInTensor(popart::TransposeOp::getInIndex(), tid);
+        popart::TensorId out_id = tid + "/transposed";
+        transpose->createAndConnectOutTensor(popart::TransposeOp::getOutIndex(), out_id);
+
+        transpose->setup();
+        return out_id;
+    }
+};
+
+static popart::PatternCreator<TiedGatherPattern> TiedGatherPatternCreator("TiedGatherPattern", true);
+static popart::PatternCreator<TiedGatherAccumulatePattern> TiedGatherAccumulatePatternCreator("TiedGatherAccumulatePattern", true);
diff --git a/model_zoo/bert/static_ipu/custom_ops/utils.cc b/model_zoo/bert/static_ipu/custom_ops/utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b6c6570f803cd28c2c61a30040b6a66eca9af3be
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/utils.cc
@@ -0,0 +1,173 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <popart/tensor.hpp>
+#include <popart/tensors.hpp>
+#include <popart/tensorindex.hpp>
+#include <popart/op.hpp>
+#include <popart/op/matmul.hpp>
+#include <popart/op/dropout.hpp>
+#include <popart/op/mul.hpp>
+#include <popart/op/collectives/replicatedreducescatter.hpp>
+#include <popart/op/collectives/replicatedallgather.hpp>
+#include <popart/op/adamupdater.hpp>
+#include <popart/op/adamvarupdate.hpp>
+#include <popart/op/accumulate.hpp>
+#include <popart/popx/op/matmulx.hpp>
+#include <popart/logging.hpp>
+#include <queue>
+
+template <class T, popart::ExecutionContext Ctx = popart::ExecutionContext::Normal>
+static T *search_producers_for(popart::Tensor *t, int max_depth=-1) {
+
+    // Searched as far as we can without success
+    if (t->tensorType() == popart::TensorType::Variable || !t->hasProducer()) {
+        return nullptr;
+    }
+    auto op = t->getProducer();
+    if (op->isConvertibleTo<T>() && op->settings.executionContext == Ctx) {
+        return dynamic_cast<T *>(op);
+    }
+
+    if (op->input->n() < 1) {
+        return nullptr;
+    }
+
+    unsigned producer_index = 0;
+    if (op->input->n() > 1) {
+        if (op->isConvertibleTo<popart::AdamUpdaterOp>()) {
+            producer_index = popart::AdamUpdaterOp::getAccl1InIndex();
+        } else if (op->isConvertibleTo<popart::AdamVarUpdateOp>()) {
+            producer_index = popart::AdamVarUpdateOp::getUpdaterInIndex();
+        } else if (op->isConvertibleTo<popart::AccumulateBaseOp>()) {
+            producer_index = popart::AccumulateBaseOp::getUpdaterInIndex();
+        } else if (op->isConvertibleTo<popart::DropoutGradOp>()) {
+            producer_index = popart::DropoutGradOp::getGradInIndex();
+        } else if (op->isConvertibleTo<popart::MulOp>()) {
+            // Grad Unscaling for Adam-based optimizers
+            producer_index = popart::MulOp::getArg0InIndex();
+        } else if (op->isConvertibleTo<popart::ReplicatedReduceScatterOp>()) {
+            // Replicated Tensor Sharding
+            producer_index = popart::ReplicatedReduceScatterOp::getInIndex();
+        } else if (op->isConvertibleTo<popart::ReplicatedAllGatherOp>()) {
+            // Replicated Tensor Sharding
+            producer_index = popart::ReplicatedAllGatherOp::getInIndex();
+        } else {
+            return nullptr;
+        }
+    }
+
+    // Providing a max-search depth of -1 will remove the depth limit at the cost of potentially
+    // unnecessary checks.
+    if (max_depth > 0) {
+        max_depth -= 1;
+        if (max_depth == 0) {
+            return nullptr;
+        }
+    }
+
+    return search_producers_for<T, Ctx>(op->input->tensor(producer_index), max_depth);
+}
+
+// Finds the underlying variable by searching through producers.
+static popart::Tensor *get_variable(popart::Tensor *t) {
+    if (t->tensorType() == popart::TensorType::Variable || t->tensorType() == popart::TensorType::Const) {
+        return t;
+    } else if (!t->hasProducer()) {
+        return nullptr;
+    }
+    auto op = t->getProducer();
+    if (op->input->n() != 1) {
+        return nullptr;
+    }
+    return get_variable(op->input->tensors().front());
+}
+
+// Attempts to find T by searching through consumers.
+template <class T, popart::ExecutionContext Ctx = popart::ExecutionContext::Normal>
+static T *search_consumers_for(popart::Tensor *w, std::queue<popart::Tensor *> &q) {
+    for (auto consumer : w->consumers.getOps()) {
+        if (consumer->isConvertibleTo<T>() && consumer->settings.executionContext == Ctx) {
+            return dynamic_cast<T *>(consumer);
+        }
+
+        if (consumer->isConvertibleTo<popart::DropoutGradOp>()) {
+            q.push(consumer->output->tensor(popart::DropoutGradOp::getGradInIndex()));
+        }
+        if (consumer->isConvertibleTo<popart::ReplicatedReduceScatterOp>()) {
+          q.push(consumer->output->tensor(
+              popart::ReplicatedReduceScatterOp::getOutIndex()));
+        }
+
+        // TODO: Improve this as it's too general. Most ops that have one input and one output are view changing.
+        if (consumer->input->n() == 1 && consumer->output->n() == 1) {
+            q.push(consumer->output->tensor(0));
+        }
+    }
+    if (q.size() < 1) {
+        return nullptr;
+    }
+    w = q.front();
+    q.pop();
+    return search_consumers_for<T, Ctx>(w, q);
+}
+template <class T, popart::ExecutionContext Ctx = popart::ExecutionContext::Normal>
+static T *search_consumers_for(popart::Tensor *w) {
+    std::queue<popart::Tensor *> q;
+    return search_consumers_for<T, Ctx>(w, q);
+}
+
+template <class T>
+static T *weight_consumed_by(popart::Tensor *w) {
+    w = get_variable(w);
+    if (w) {
+        return search_consumers_for<T>(w);
+    }
+    return nullptr;
+}
+
+template <class T, popart::ExecutionContext Ctx>
+static void find_all_consumers(popart::Tensor *w,std::queue<popart::Tensor *> &q, std::vector<T *> &result) {
+    for (auto consumer : w->consumers.getOps()) {
+        if (std::find(result.begin(), result.end(), consumer) == result.end()) {
+            if (consumer->isConvertibleTo<T>() && consumer->settings.executionContext == Ctx) {
+                result.push_back(dynamic_cast<T *>(consumer));
+            }
+            if (consumer->isConvertibleTo<popart::MatMulOp>()) {
+                q.push(consumer->output->tensor(popart::MatMulOp::getOutIndex()));
+            }
+            if (consumer->isConvertibleTo<popart::ReplicatedReduceScatterOp>()) {
+              q.push(consumer->output->tensor(
+                  popart::ReplicatedReduceScatterOp::getOutIndex()));
+            }
+            // Most ops that have one input and one output are view changing.
+            if (consumer->input->n() == 1 && consumer->output->n() == 1) {
+                q.push(consumer->output->tensor(0));
+            }
+        }
+    }
+    if (q.size() < 1) {
+        return;
+    }
+    w = q.front();
+    q.pop();
+    return find_all_consumers<T, Ctx>(w, q, result);
+}
+template <class T, popart::ExecutionContext Ctx = popart::ExecutionContext::Normal>
+static std::vector<T *> find_all_consumers(popart::Tensor *w) {
+    std::queue<popart::Tensor *> q;
+    std::vector<T *> result;
+    find_all_consumers<T, Ctx>(w, q, result);
+    return result;
+}
diff --git a/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc b/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..d6482ad4e98a80e46e98395b3d70190fe4bdb69b
--- /dev/null
+++ b/model_zoo/bert/static_ipu/custom_ops/workarounds/prevent_const_expr_folding_op.cc
@@ -0,0 +1,137 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <popart/op.hpp>
+#include <popart/names.hpp>
+#include <popart/opmanager.hpp>
+#include <popart/region.hpp>
+#include <popart/popx/opx.hpp>
+#include <popart/popx/opxmanager.hpp>
+#include <popart/popx/devicex.hpp>
+
+namespace CustomOperators
+{
+    const popart::OperatorIdentifier PreventConstFolding = {"ai.graphcore", "PreventConstFolding", 1};
+} // namespace CustomOperators
+namespace CustomGradOperators {
+  const popart::OperatorIdentifier PreventConstFoldingGrad = {"ai.graphcore", "PreventConstFoldingGrad", 1};
+} // namespace CustomGradOperators
+
+class PreventConstFoldingOp;
+class PreventConstFoldingGradOp;
+class PreventConstFoldingOpx;
+class PreventConstFoldingGradOpx;
+
+// By default, const expressions ops get folded to optimise the graph and remove unnessary ops
+// at the start. However, in this case, it causes the word embedding to exist in both its
+// original and transposed form. By adding this op, the constant expression folding transform
+// can't fold through it, so we prevent folding after this point.
+
+class PreventConstFoldingOp : public popart::Op
+{
+public:
+    PreventConstFoldingOp(const popart::OperatorIdentifier &_opid, const Op::Settings &settings_)
+        : Op(_opid, settings_) {}
+
+    void setup() final { outInfo(0) = inInfo(0); }
+
+    std::unique_ptr<Op> clone() const {
+        return std::make_unique<PreventConstFoldingOp>(*this);
+    }
+
+    std::vector<std::unique_ptr<Op>> getGradOps() {
+        std::vector<std::unique_ptr<Op>> upops;
+        upops.emplace_back(std::make_unique<PreventConstFoldingGradOp>(*this));
+        return upops;
+    }
+
+    float getSubgraphValue() const final { return getLowSubgraphValue(); }
+};
+
+static popart::OpDefinition PreventConstFoldingOpDef({});
+
+static popart::OpCreator<PreventConstFoldingOp> PreventConstFoldingOpCreator(
+    popart::OpDefinitions({{CustomOperators::PreventConstFolding,
+                            PreventConstFoldingOpDef}}),
+    [](const popart::OpCreatorInfo &oci) -> std::unique_ptr<popart::Op> {
+      return std::unique_ptr<PreventConstFoldingOp>(
+          new PreventConstFoldingOp(oci.opid, oci.settings));
+    },
+    true);
+
+class PreventConstFoldingOpx : public popart::popx::Opx {
+public:
+    PreventConstFoldingOpx(popart::Op *op, popart::popx::Devicex *devicex) : popart::popx::Opx(op, devicex)
+    { verifyOp<PreventConstFoldingOp>(op, CustomOperators::PreventConstFolding); }
+
+    popart::popx::InputCreatorType getInputCreatorType(popart::InIndex) const {
+        return popart::popx::InputCreatorType::CanUnwind;
+    }
+
+    poplar::Tensor unwindTensorLayout(poplar::Tensor tensor, popart::InIndex, popart::OutIndex) const {
+        return tensor;
+    }
+
+    popart::view::RegMap unwindRegion(popart::InIndex, popart::OutIndex) const {
+        return [this](const popart::view::Region &r) {
+            return popart::view::Regions(1, r);
+        };
+    }
+
+    void grow(poplar::program::Sequence &prog) const final {
+        insert(outId(0), getInTensor(0));
+    }
+};
+
+class PreventConstFoldingGradOp : public PreventConstFoldingOp
+{
+public:
+    PreventConstFoldingGradOp(const PreventConstFoldingOp &fwdOp)
+        : PreventConstFoldingOp(CustomGradOperators::PreventConstFoldingGrad, fwdOp.getSettings()) {}
+
+    PreventConstFoldingGradOp(const popart::Op::Settings &settings)
+        : PreventConstFoldingOp(CustomGradOperators::PreventConstFoldingGrad, settings) {}
+
+    std::unique_ptr<popart::Op> clone() const final {
+        return std::make_unique<PreventConstFoldingGradOp>(*this);
+    }
+
+    const std::vector<popart::GradInOutMapper> &gradInputInfo() const {
+        static const std::vector<popart::GradInOutMapper> inInfo = {
+            {0, 0, popart::GradOpInType::GradOut}};
+
+        return inInfo;
+    }
+    const std::map<int, int> &gradOutToNonGradIn() const {
+        static const std::map<int, int> outInfo = {{0, 0}};
+        return outInfo;
+    }
+};
+
+class PreventConstFoldingGradOpx : public popart::popx::Opx {
+public:
+  PreventConstFoldingGradOpx(popart::Op *op, popart::popx::Devicex *devicex)
+      : popart::popx::Opx(op, devicex) {
+    verifyOp<PreventConstFoldingGradOp>(op, CustomGradOperators::PreventConstFoldingGrad);
+  }
+
+  void grow(poplar::program::Sequence &prog) const final {
+      setOutTensor(0, getInTensor(0));
+  }
+};
+
+static popart::popx::OpxCreator<PreventConstFoldingOpx>
+    preventConstFoldingOpxCreator(CustomOperators::PreventConstFolding);
+static popart::popx::OpxCreator<PreventConstFoldingGradOpx>
+    preventConstFoldingGradOpxCreator(CustomGradOperators::PreventConstFoldingGrad);
diff --git a/model_zoo/bert/static_ipu/dataset_ipu.py b/model_zoo/bert/static_ipu/dataset_ipu.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e27c5d7041c6112a7f68332bcfe37220f009584
--- /dev/null
+++ b/model_zoo/bert/static_ipu/dataset_ipu.py
@@ -0,0 +1,270 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import multiprocessing
+import threading
+from queue import Queue
+
+import h5py
+import numpy as np
+import paddle
+
+KEYS = ("input_ids", "input_mask", "segment_ids", "masked_lm_positions", "masked_lm_ids", "next_sentence_labels")
+
+
+def shuffle_dict(dic, len):
+    idxs = np.arange(len)
+    np.random.shuffle(idxs)
+    for k, v in dic.items():
+        dic[k] = v[idxs]
+
+
+class PretrainingHDF5DataLoader:
+    def __init__(
+        self,
+        input_files,
+        max_seq_length=128,
+        max_mask_tokens=20,
+        batch_size=1,
+        dtype=np.int32,
+        shuffle=False,
+        pad_position_value=511,
+        num_workers=3,
+    ):
+        self.files = input_files
+        self.batch_size = batch_size
+        self.max_seq_length = max_seq_length
+        self.max_mask_tokens = max_mask_tokens
+        self.dtype = dtype
+        self.shuffle = shuffle
+        self.pad_position_value = pad_position_value
+        if shuffle:
+            np.random.shuffle(self.files)
+
+        self.counter = 0
+
+        # get total number of samples
+        pool = multiprocessing.Pool(min(multiprocessing.cpu_count(), 32))
+        num_samples = pool.map(self.samples_in_file, self.files)
+        pool.close()
+        pool.join()
+        self.total_samples = sum(num_samples)
+        self.len = self.total_samples // self.batch_size
+        assert self.len > 1, f"Batch size {self.batch_size} larger than number of samples {self.total_samples}"
+
+        # notify feed and fetch processes/thread to stop
+        self.event_queue = multiprocessing.Manager().Queue(10)
+
+        # buffer to store final data
+        self.feed_buffer = Queue(20)
+
+        # number of processes to do remask
+        self.num_workers = num_workers
+        # each feed_worker has one process_buffer to use
+        self.process_buffers = [multiprocessing.Manager().Queue(10) for _ in range(num_workers)]
+        self.split_files = np.array_split(self.files, self.num_workers)
+        # feed_worker will load data from h5py files, and do remask process
+        self.feed_workers = [
+            multiprocessing.Process(
+                target=self.fill_buffer_loop, args=(self.split_files[idx], self.process_buffers[idx])
+            )
+            for idx in range(self.num_workers)
+        ]
+        for p in self.feed_workers:
+            p.start()
+
+        # index for which process_buffer is used each time
+        self.post_fetch_idx = 0
+        # load final data from process_buffers
+        self.fetch_worker = threading.Thread(target=self.post_fetch)
+        self.fetch_worker.start()
+
+    def samples_in_file(self, filename):
+        with h5py.File(filename, "r") as f:
+            data_len = f[KEYS[0]].shape[0]
+        return data_len
+
+    def release(self):
+        self.event_queue.put("END")
+        while not self.feed_buffer.empty():
+            self.feed_buffer.get()
+        for process_buffer in self.process_buffers:
+            while not process_buffer.empty():
+                process_buffer.get()
+        self.fetch_worker.join()
+        for p in self.feed_workers:
+            p.join()
+        return
+
+    def __len__(self):
+        return self.len
+
+    def __iter__(self):
+        self.counter = 0
+        return self
+
+    def __next__(self):
+        result = self.feed_buffer.get()
+        self.counter += 1
+        return result
+
+    def post_fetch(self):
+        while True:
+            if not self.event_queue.empty():
+                return
+            if not self.process_buffers[self.post_fetch_idx].empty():
+                logging.debug(f"self.post_fetch_idx: {self.post_fetch_idx}")
+                np_feed_list = self.process_buffers[self.post_fetch_idx].get()
+                self.post_fetch_idx += 1
+                if self.post_fetch_idx == self.num_workers:
+                    self.post_fetch_idx = 0
+                elif self.post_fetch_idx > self.num_workers:
+                    raise Exception("post_fetch_idx must < num_workers")
+
+                lod_feed_list = []
+                for data in np_feed_list:
+                    tensor = paddle.fluid.core.LoDTensor()
+                    place = paddle.CPUPlace()
+                    tensor.set(data, place)
+                    lod_feed_list.append(tensor)
+                self.feed_buffer.put(lod_feed_list)
+
+    def fill_buffer_loop(self, files, process_buffer):
+        data = None
+        data_index = 0
+        file_index = 0
+
+        def multiprocess_fill_buffer(data, file_index, data_index):
+            if data is None:
+                data = self.load_one_file(files[file_index])
+                file_index += 1
+                data_index = 0
+
+            curr_batch = []
+            still_required = self.batch_size
+            while still_required > 0:
+                data_batch = {k: data[k][data_index : data_index + still_required] for k in KEYS}
+                data_batch_len = len(data_batch[KEYS[0]])
+                data_index += data_batch_len
+                curr_batch.append(data_batch)
+                curr_batch_len = sum(len(x[KEYS[0]]) for x in curr_batch)
+                still_required = self.batch_size - curr_batch_len
+                if still_required > 0:
+                    if file_index >= len(files):
+                        np.random.shuffle(files)
+                        file_index = 0
+
+                    data = self.load_one_file(files[file_index])
+                    file_index += 1
+                    data_index = 0
+            if not curr_batch_len == self.batch_size:
+                raise Exception("data length should equal to batch_size")
+
+            result = {}
+            for k in KEYS:
+                result[k] = np.concatenate([item[k] for item in curr_batch], axis=0)
+            process_buffer.put(self.do_remask(result))
+
+            return data, file_index, data_index
+
+        while True:
+            if self.event_queue.empty():
+                data, file_index, data_index = multiprocess_fill_buffer(data, file_index, data_index)
+            else:
+                return
+
+    def do_remask(self, samples):
+        input_ids = samples["input_ids"]
+        segment_ids = samples["segment_ids"]
+        masked_lm_positions = samples["masked_lm_positions"]
+        masked_lm_ids = samples["masked_lm_ids"]
+        next_sentence_labels = samples["next_sentence_labels"]
+        masked_lm_weights = np.ones_like(masked_lm_ids, dtype=np.int32)
+        masked_lm_weights[masked_lm_ids == 0] = 0
+
+        # post process
+        batch_size, seq_len = input_ids.shape
+        formatted_pos = self.pad_position_value * np.ones_like(samples["input_ids"])
+        formatted_input = np.zeros_like(input_ids)
+        formatted_seg = np.zeros_like(segment_ids)
+        formatted_mask_labels = np.zeros((batch_size, self.max_mask_tokens), dtype=masked_lm_ids.dtype)
+
+        valid_seq_positions = []
+        valid_mask_positions = masked_lm_weights == 1
+        valid_mask_len = np.sum(valid_mask_positions, axis=1).reshape(-1, 1)
+        for i, mask_pos in enumerate(masked_lm_positions):
+            pos = [True] * seq_len
+            for mask_index, m in enumerate(mask_pos):
+                if mask_index < valid_mask_len[i]:
+                    pos[m] = False
+            valid_seq_positions.append(np.logical_and(pos, input_ids[i] != 0))
+        valid_seq_len = np.minimum(
+            np.sum(valid_seq_positions, axis=1) + self.max_mask_tokens, self.max_seq_length
+        ).reshape(-1, 1)
+        unmasked_len = np.minimum(np.sum(valid_seq_positions, axis=1), self.max_seq_length - self.max_mask_tokens)
+        for i in range(batch_size):
+            target_mask_indices = np.arange(valid_mask_len[i])
+            target_seq_indices = self.max_mask_tokens + np.arange(unmasked_len[i])
+            source_mask_indices = masked_lm_positions[i][valid_mask_positions[i]]
+            source_seq_indices = np.arange(seq_len)[valid_seq_positions[i]][: unmasked_len[i]]
+
+            target_indices = np.hstack([target_mask_indices, target_seq_indices])
+            source_indices = np.hstack([source_mask_indices, source_seq_indices])
+
+            formatted_pos[i, target_indices] = source_indices
+            formatted_input[i, target_indices] = input_ids[i, source_indices]
+            formatted_seg[i, target_indices] = segment_ids[i, source_indices]
+            formatted_mask_labels[i] = masked_lm_ids[i, : self.max_mask_tokens]
+
+        return [
+            formatted_input.astype(np.int32),
+            formatted_seg.astype(np.int32),
+            formatted_pos.astype(np.int32),
+            valid_mask_len.astype(np.int32),
+            valid_seq_len.astype(np.int32),
+            formatted_mask_labels.astype(np.int32),
+            next_sentence_labels.astype(np.int32),
+        ]
+
+    def load_one_file(self, file_path):
+        data = self.load_hdf5(file_path)
+
+        if self.shuffle:
+            shuffle_dict(data, len(data[KEYS[0]]))
+
+        return data
+
+    def load_hdf5(self, filename):
+        with h5py.File(filename, "r") as f:
+            data = {key: np.asarray(f[key][:]) for key in KEYS}
+        return data
+
+
+if __name__ == "__main__":
+    import glob
+
+    base_dir = "data_path/wikicorpus_en/"
+    input_files = glob.glob(f"{base_dir}/*training*.hdf5")
+    input_files.sort()
+    # print(input_files)
+
+    seed = 1984
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+    data_loader = PretrainingHDF5DataLoader(input_files, batch_size=65536, shuffle=True)
+
+    for idx, batch in enumerate(data_loader):
+        print(f"{idx}: {batch[0].shape()}")
diff --git a/model_zoo/bert/static_ipu/load_tf_ckpt.py b/model_zoo/bert/static_ipu/load_tf_ckpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..2837fd7a69885015b6a81dbb2fe524602ca42c92
--- /dev/null
+++ b/model_zoo/bert/static_ipu/load_tf_ckpt.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from logging import getLogger
+
+import numpy as np
+
+logger = getLogger(__name__)
+
+
+def get_tf_mapping(args):
+    squad_mapping = {"cls/squad/output_weights": "linear_72.w_0", "cls/squad/output_bias": "linear_72.b_0"}
+
+    tf_to_pdmodel = {
+        "bert/embeddings/word_embeddings": "ipu_bert_embeddings_0.w_0",
+        "bert/embeddings/position_embeddings": "embedding_0.w_0",
+        "bert/embeddings/token_type_embeddings": "ipu_bert_embeddings_0.w_1",
+        "bert/embeddings/LayerNorm/gamma": "layer_norm_0.w_0",
+        "bert/embeddings/LayerNorm/beta": "layer_norm_0.b_0",
+    }
+    for i in range(args.num_hidden_layers):
+        layer = {
+            f"bert/encoder/layer_{i}/attention/self/query/bias": f"bert_model_0.b_{i}",
+            f"bert/encoder/layer_{i}/attention/self/key/bias": f"bert_model_0.b_{i}",
+            f"bert/encoder/layer_{i}/attention/self/value/bias": f"bert_model_0.b_{i}",
+            f"bert/encoder/layer_{i}/attention/output/dense/kernel": f"linear_{i*6}.w_0",
+            f"bert/encoder/layer_{i}/attention/output/dense/bias": f"linear_{i*6}.b_0",
+            f"bert/encoder/layer_{i}/attention/output/LayerNorm/gamma": f"layer_norm_{i*4+2}.w_0",
+            f"bert/encoder/layer_{i}/attention/output/LayerNorm/beta": f"layer_norm_{i*4+2}.b_0",
+            f"bert/encoder/layer_{i}/intermediate/dense/kernel": f"linear_{i*6+2}.w_0",
+            f"bert/encoder/layer_{i}/intermediate/dense/bias": f"linear_{i*6+2}.b_0",
+            f"bert/encoder/layer_{i}/output/dense/kernel": f"linear_{i*6+3}.w_0",
+            f"bert/encoder/layer_{i}/output/dense/bias": f"linear_{i*6+3}.b_0",
+            f"bert/encoder/layer_{i}/output/LayerNorm/gamma": f"layer_norm_{(i+1)*4}.w_0",
+            f"bert/encoder/layer_{i}/output/LayerNorm/beta": f"layer_norm_{(i+1)*4}.b_0",
+        }
+        layer[f"bert/encoder/layer_{i}/attention/self/query/kernel"] = f"bert_model_0.w_{i*3+0}"
+        layer[f"bert/encoder/layer_{i}/attention/self/key/kernel"] = f"bert_model_0.w_{i*3+1}"
+        layer[f"bert/encoder/layer_{i}/attention/self/value/kernel"] = f"bert_model_0.w_{i*3+2}"
+        tf_to_pdmodel.update(**layer)
+
+    if args.task == "PRETRAINING":
+        logger.error("Mapping ckpt weights is only supported in SQUAD task.")
+    elif args.task == "SQUAD":
+        tf_to_pdmodel.update(**squad_mapping)
+
+    return tf_to_pdmodel
+
+
+def generate_initializers(args, map_names, load_data, mapping, transform={}):
+    initializers = {}
+    initializers_param = {}
+    initializers_opt = {}
+
+    qkv_tensor_range = {
+        "query": (0, args.hidden_size),
+        "key": (args.hidden_size, args.hidden_size * 2),
+        "value": (args.hidden_size * 2, args.hidden_size * 3),
+    }
+
+    for name, array in zip(map_names, load_data):
+        logger.debug(f"Initialising tensor from checkpoint {name} -> {mapping[name]}")
+
+        # config["lamb_m_dtype"] is for setting the data type for accl1 of lamb
+        # BERT can use FP16 for accl1 without lossing accuracy
+        # accl2 is always in FP32
+        lamb_m_dtype = np.float32
+        dtype = np.float32
+
+        if "moment1" in mapping[name]:
+            if array.dtype != lamb_m_dtype:
+                array = array.astype(lamb_m_dtype)
+        elif "moment2" in mapping[name]:
+            if array.dtype != np.float32:
+                array = array.astype(np.float32)
+        elif array.dtype != dtype:
+            array = array.astype(dtype)
+
+        # If it's part of QKV biases, we need to handle separately as those 3
+        # tensors need concatenating into one
+        if "bert_model_0.b" in mapping[name]:
+            qkv_part = name.split("/")[5]
+            if mapping[name] not in initializers.keys():
+                qkv_shape = array.shape[0] * 3
+                initializers[mapping[name]] = np.empty(qkv_shape, dtype=array.dtype)
+
+            start_idx = qkv_tensor_range[qkv_part][0]
+            end_idx = qkv_tensor_range[qkv_part][1]
+            initializers[mapping[name]][start_idx:end_idx] = array
+            logger.debug(f"Initialising QKV_bias component {name}[{start_idx}:{end_idx}] from checkpoint")
+            continue
+
+        if name in transform:
+            array = transform[name](array)
+
+        padded_vocab_length = args.vocab_size
+        if "bert_embeddings_0.w_0" in mapping[name]:
+            tf_vocab_length = array.shape[0]
+            diff = padded_vocab_length - tf_vocab_length
+            # Pad or Crop the vocab.
+            if diff > 0:
+                logger.info(f"Padding the vocabulary. From {tf_vocab_length} to {padded_vocab_length}")
+                pad = np.zeros((diff, args.hidden_size)).astype(array.dtype)
+                array = np.concatenate((array, pad), axis=0)
+            else:
+                logger.warning(
+                    f"Cropping the vocabulary may negatively effect performance. From {tf_vocab_length} to {padded_vocab_length}"
+                )
+                array = np.array(array[:padded_vocab_length, :])
+            # if args.task == "PRETRAINING":
+            # We use transposed weight in both pretraining and squad
+            array = np.transpose(array, [1, 0])
+
+        if "embedding_0.w_0" in mapping[name]:
+            max_pos, hidden_len = array.shape
+            if max_pos > args.max_position_embeddings:
+                array = array[: args.max_position_embeddings, :]
+
+            # Otherwise just copy the positional embeddings over and over again as is done in longformer
+            elif max_pos < args.max_position_embeddings:
+                logger.warning("Not enough positional embeddings in checkpoint, copying to match length...")
+                array = array[np.mod(np.arange(args.max_position_embeddings), max_pos)]
+
+        initializers[mapping[name]] = array.copy()
+        for k in initializers:
+            if "moment" in k:
+                initializers_opt[k] = initializers[k]
+            else:
+                initializers_param[k] = initializers[k]
+    return initializers_param, initializers_opt
+
+
+# util function for load tf pretrained weight
+def load_initializers_from_tf(file_path, args):
+    """
+    Loads weights, etc. from Tensorflow files into a dictionary of Numpy Arrays.
+
+    Can read either checkpoint files, or frozen graphs, according to the
+    `is_checkpoint` flag, passed in as the second argument.
+    """
+    try:
+        import tensorflow as tf
+    except ImportError:
+        logger.error(
+            "Loading a TensorFlow model requires TensorFlow to be installed. "
+            "Please see https://www.tensorflow.org/install/ for installation "
+            "instructions."
+        )
+        raise
+
+    tf_path = os.path.abspath(file_path)
+    logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+
+    mapping = get_tf_mapping(args)
+    map_names = [name for name, shape in init_vars if name in mapping.keys()]
+    for name in (n for n, _ in init_vars if n not in mapping.keys()):
+        logger.debug(f"Skipping load of {name} - Not in mapping")
+
+    load_data = [tf.train.load_variable(tf_path, name) for name in map_names]
+    initializers, opt_params = generate_initializers(args, map_names, load_data, mapping)
+    return initializers, opt_params
diff --git a/model_zoo/bert/static_ipu/modeling.py b/model_zoo/bert/static_ipu/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8c0503e1be51052d1a78cdfabf3cf33783a4088
--- /dev/null
+++ b/model_zoo/bert/static_ipu/modeling.py
@@ -0,0 +1,635 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from contextlib import ExitStack
+from typing import List, NamedTuple
+
+import numpy as np
+import paddle
+import paddle.fluid
+import paddle.nn as nn
+import paddle.static
+from paddle.nn import Layer
+
+
+class DeviceScope(object):
+    def __init__(self, index, stage, name_scope=None):
+        self.index = index
+        self.stage = stage
+        self.name_scope = name_scope
+
+    def __enter__(self):
+        self.stack = ExitStack()
+        self.stack.enter_context(paddle.static.ipu_shard_guard(index=self.index, stage=self.stage))
+        if self.name_scope is not None:
+            self.stack.enter_context(paddle.static.name_scope(self.name_scope))
+        return self
+
+    def __exit__(self, *exp):
+        self.stack.close()
+        return False
+
+
+class IpuBertConfig(NamedTuple):
+    """
+    The configuration for BERT Model.
+    Args:
+        seq_len (int):
+            The sequence length. Default to `128`.
+        max_position_embeddings (int):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        max_predictions_per_seq (int):
+            The max number of the masked token each sentence. Default to `20`.
+        hidden_size (int):
+            Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `768`.
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `BertModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `BertModel`.
+        num_hidden_layers (int):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        available_mem_proportion (float):
+            The available proportion of memory used by conv or matmul. Default to `0.28`.
+        type_vocab_size (int):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `2`.
+        hidden_dropout_prob (float):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        task (str):
+            The type of the NLP model.
+        layers_per_ipu (list):
+            Number of attention layers executed on each IPU.
+    """
+
+    micro_batch_size: int = 1
+    seq_len: int = 128
+    max_position_embeddings: int = 512
+    max_predictions_per_seq: int = 20
+    hidden_size: int = 768
+    vocab_size: int = 30400
+    num_hidden_layers: int = 12
+    available_mem_proportion: float = 0.28
+    type_vocab_size: int = 2
+
+    hidden_dropout_prob: float = 0.1
+    attention_probs_dropout_prob: float = 0.1
+
+    # Choices: PRETRAINING (MLM + NSP), SQUAD
+    task: str = "PRETRAINING"
+    layers_per_ipu: List = None
+
+    embeddings_scope: DeviceScope = None
+    attn_scopes: DeviceScope = None
+    ff_scopes: DeviceScope = None
+    mlm_scope: DeviceScope = None
+    nsp_scope: DeviceScope = None
+
+
+class IpuBertEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config, custom_ops=None):
+        super(IpuBertEmbeddings, self).__init__()
+        self.config = config
+        self.word_embeddings_weights = self.create_parameter(
+            shape=[config.hidden_size, config.vocab_size], dtype="float32"
+        )
+        self.token_embeddings_weights = self.create_parameter(
+            shape=[config.type_vocab_size, config.hidden_size], dtype="float32"
+        )
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=0.001)
+        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
+        self.custom_ops = custom_ops
+
+    def forward(self, indices, segments, positions):
+        # word embeddings
+        word_embeddings_weights = paddle.transpose(self.word_embeddings_weights, [1, 0])
+        input_embeddings = paddle.gather(word_embeddings_weights, indices, axis=0)
+
+        # position_embeddings
+        position_embeddings = self.position_embeddings(positions)
+
+        # token_type_embeddings
+        token_type_embeddings = paddle.fluid.input.one_hot(segments, depth=2)
+        token_type_embeddings = paddle.matmul(token_type_embeddings, self.token_embeddings_weights)
+
+        embeddings = paddle.add(input_embeddings, position_embeddings)
+        embeddings = paddle.add(embeddings, token_type_embeddings)
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings, self.word_embeddings_weights
+
+
+class BertModel(Layer):
+    """
+    The bare BERT Model transformer outputting raw hidden-states.
+
+    This model refers to :class:`~paddlenlp.transformers.bert.BertModel`.
+
+    Args:
+        config (IpuBertConfig):
+            configuration of bert.
+        custom_ops:
+            custom defined operators which can be found in directory `custom_ops`.
+    """
+
+    def __init__(self, config, custom_ops=None):
+        super(BertModel, self).__init__()
+        self.config = config
+        self.custom_ops = custom_ops
+
+        qk_scale = 1 / np.sqrt(self.config.hidden_size / self.config.num_hidden_layers)
+        self.qk_scale_attrs = {
+            "name": "QK_scale",
+            "shape": [1],
+            "dtype": "float32",
+            "value": qk_scale,
+        }
+        self.qkv_shape = [-1, self.config.seq_len, 12, 64]
+        self.masks = {}
+
+        self.embedding = IpuBertEmbeddings(self.config, custom_ops)
+
+    def _encoder_layer_ipu_offset(self, layer_index):
+        encoder_index = 0
+        if len(self.config.layers_per_ipu) == 1:
+            encoder_index = layer_index // self.config.layers_per_ipu[0]
+        else:
+            for ipu, num_layers in enumerate(self.config.layers_per_ipu):
+                layer_index -= num_layers
+                if layer_index < 0:
+                    encoder_index = ipu
+                    break
+        return encoder_index
+
+    def should_checkpoint(self, layer_index):
+        encoder_index = self._encoder_layer_ipu_offset(layer_index)
+        if len(self.config.layers_per_ipu) == 1:
+            layers = self.config.layers_per_ipu[0]
+            layer_index -= encoder_index * layers
+        else:
+            layers = self.config.layers_per_ipu[encoder_index]
+            layer_index -= sum(self.config.layers_per_ipu[:encoder_index])
+        return layer_index < (layers - 1)
+
+    def forward(self, indices, segments, positions, input_mask):
+        r"""
+        The BertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            indices (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int32` and it has a shape of [batch_size * sequence_length].
+            segments (Tensor):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                Its data type should be `int32` and it has a shape of [batch_size * sequence_length].
+            positions(Tensor):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size * sequence_length]` and dtype as int32.
+            input_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                If the task is PRETRAINING:
+                    input_mask[0] is the index that masking starts in the mask_tokens
+                    input_mask[1] is the index that masking starts in the rest of the sequence
+                Otherwise
+                    input_mask is the mask tensor that has -1000 in positions to be masked and 0 otherwise.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `word_embeddings_weights`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+        """
+
+        with self.config.embeddings_scope:
+            sequence_output, word_embeddings_weights = self.embedding(indices, segments, positions)
+
+        if self.config.task == "PRETRAINING":
+            with paddle.static.ipu_shard_guard(index=0, stage=0):
+                input_mask[0] = self.custom_ops.detach(input_mask[0])
+                input_mask[1] = self.custom_ops.detach(input_mask[1])
+
+        for i in range(self.config.num_hidden_layers):
+            # Attention
+            attn_scope = self.config.attn_scopes[i]
+            with attn_scope:
+                with paddle.static.name_scope(f"Layer{i}/Attention"):
+                    layer_input = sequence_output
+                    q = self.create_parameter(
+                        shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32"
+                    )
+                    k = self.create_parameter(
+                        shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32"
+                    )
+                    v = self.create_parameter(
+                        shape=[self.config.hidden_size, self.config.hidden_size], dtype="float32"
+                    )
+                    qkv = paddle.concat([q, k, v], axis=1)
+                    qkv = paddle.matmul(sequence_output, qkv)
+                    qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion)
+                    q, k, v = paddle.split(
+                        qkv,
+                        num_or_sections=[self.config.hidden_size, self.config.hidden_size, self.config.hidden_size],
+                        axis=1,
+                    )
+                    q = paddle.reshape(q, self.qkv_shape)
+                    q = paddle.transpose(q, [0, 2, 1, 3])
+                    k = paddle.reshape(k, self.qkv_shape)
+                    k = paddle.transpose(k, [0, 2, 3, 1])
+                    v = paddle.reshape(v, self.qkv_shape)
+                    v = paddle.transpose(v, [0, 2, 1, 3])
+
+                    # Attention calculation
+                    with paddle.static.name_scope("Z"):
+                        if self.config.task == "PRETRAINING":
+                            if attn_scope.index in self.masks:
+                                final_mask = self.masks[attn_scope.index]
+                            else:
+                                with paddle.static.name_scope("Mask"):
+                                    base_value = np.arange(self.config.seq_len).astype("int32")
+                                    base = paddle.fluid.layers.assign(base_value)
+                                    mmask = paddle.less_than(base, input_mask[0])
+                                    mask_value = np.greater_equal(base_value, self.config.max_predictions_per_seq)
+                                    mask = paddle.fluid.layers.assign(mask_value)
+                                    mmask = paddle.logical_or(mmask, mask)
+                                    smask = paddle.less_than(base, input_mask[1])
+                                    final_mask = paddle.logical_and(mmask, smask)
+                                    final_mask = paddle.cast(final_mask, "float16")
+                                    sub_attrs = {
+                                        "name": "constant_sub",
+                                        "shape": [1],
+                                        "dtype": "float32",
+                                        "value": 1,
+                                    }
+                                    mul_attrs = {
+                                        "name": "constant_mul",
+                                        "shape": [1],
+                                        "dtype": "float32",
+                                        "value": 1000,
+                                    }
+                                    final_mask = paddle.fluid.layers.elementwise_sub(
+                                        final_mask, paddle.fluid.layers.fill_constant(**sub_attrs)
+                                    )
+                                    final_mask = paddle.fluid.layers.elementwise_mul(
+                                        final_mask, paddle.fluid.layers.fill_constant(**mul_attrs)
+                                    )
+                                    final_mask = paddle.reshape(final_mask, [-1, 1, 1, self.config.seq_len])
+                                    final_mask = self.custom_ops.detach(final_mask)
+                                    self.masks[attn_scope.index] = final_mask
+
+                        qk = paddle.matmul(q, k)
+                        qk.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion)
+                        qk_scale = paddle.fluid.layers.fill_constant(**self.qk_scale_attrs)
+                        qk = paddle.fluid.layers.elementwise_mul(qk, qk_scale)
+
+                        if self.config.task == "PRETRAINING":
+                            qk = paddle.fluid.layers.elementwise_add(qk, final_mask)
+                        else:
+                            # for SQUAD task, input_mask is calculated in data preprocessing
+                            qk = paddle.fluid.layers.elementwise_add(qk, input_mask)
+
+                        qk = paddle.fluid.layers.softmax(qk)
+                        if self.config.task == "SQUAD":
+                            qk = paddle.fluid.layers.dropout(
+                                qk, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train"
+                            )
+                        qkv = paddle.matmul(qk, v)
+                        qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion)
+                        qkv = paddle.transpose(qkv, [0, 2, 1, 3])
+                        qkv = paddle.reshape(qkv, [-1, self.config.hidden_size])
+
+                    qkv_linear = nn.Linear(self.config.hidden_size, self.config.hidden_size, bias_attr=False)
+                    qkv = qkv_linear(qkv)
+                    qkv.block.ops[-1]._set_attr("__available_memory", self.config.available_mem_proportion)
+                    qkv = paddle.fluid.layers.dropout(
+                        qkv, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train"
+                    )
+                    attention = paddle.add(layer_input, qkv)
+                    layer_norm1 = nn.LayerNorm(self.config.hidden_size, epsilon=0.001)
+                    attention = layer_norm1(attention)
+
+            # FF
+            with self.config.ff_scopes[i]:
+                with paddle.static.name_scope(f"Layer{i}/FF"):
+                    ff_linear1 = nn.Linear(self.config.hidden_size, 4 * self.config.hidden_size)
+                    ff_linear2 = nn.Linear(4 * self.config.hidden_size, self.config.hidden_size)
+                    with paddle.static.name_scope("1"):
+                        ff = ff_linear1(attention)
+                        ff.block.ops[-2]._set_attr("__available_memory", self.config.available_mem_proportion)
+                    ff = paddle.fluid.layers.gelu(ff, approximate=True)
+                    with paddle.static.name_scope("2"):
+                        ff = ff_linear2(ff)
+                        ff.block.ops[-2]._set_attr("__available_memory", self.config.available_mem_proportion)
+                    ff = paddle.fluid.layers.dropout(
+                        ff, self.config.attention_probs_dropout_prob, dropout_implementation="upscale_in_train"
+                    )
+                    ff = paddle.add(attention, ff)
+                    layer_norm2 = nn.LayerNorm(self.config.hidden_size, epsilon=0.001)
+                    sequence_output = layer_norm2(ff)
+
+                if self.should_checkpoint(i):
+                    with paddle.static.name_scope(f"Layer{i}"):
+                        logging.info(f"add checkpointoutput for ff_{i}")
+                        sequence_output = self.custom_ops.checkpointoutput(sequence_output)
+        return sequence_output, word_embeddings_weights
+
+
+class IpuBertForQuestionAnswering(Layer):
+    """
+    Bert Model with a span classification head on top for extractive question-answering tasks like
+    SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and
+    `span end logits`).
+
+    Args:
+        hidden_size (int):
+            Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `768`.
+        seq_len (int):
+            See :class:`IpuBertConfig`.
+    """
+
+    def __init__(self, hidden_size, seq_len):
+        super(IpuBertForQuestionAnswering, self).__init__()
+        self.hidden_size = hidden_size
+        self.seq_len = seq_len
+        self.classifier = nn.Linear(hidden_size, 2)
+
+    def forward(self, sequence_output):
+        r"""
+        The IpuBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            sequence_output (Tensor):
+                See :class:`BertModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+        """
+        logits = self.classifier(sequence_output)
+
+        start_logits = paddle.slice(input=logits, axes=[1], starts=[0], ends=[1])
+        end_logits = paddle.slice(input=logits, axes=[1], starts=[1], ends=[2])
+
+        start_logits = paddle.reshape(start_logits, [-1, self.seq_len])
+        end_logits = paddle.reshape(end_logits, [-1, self.seq_len])
+        return start_logits, end_logits
+
+
+class IpuBertQAAccAndLoss(paddle.nn.Layer):
+    """
+    Criterion for Question and Answering.
+    """
+
+    def __init__(self, custom_ops=None):
+        super(IpuBertQAAccAndLoss, self).__init__()
+        self.custom_ops = custom_ops
+
+    def forward(self, start_logits, end_logits, start_labels, end_labels):
+        r"""
+        The IpuBertQAAccAndLoss forward method, overrides the __call__() special method.
+
+        Args:
+            start_logits (Tensor):
+                See :class:`IpuBertForQuestionAnswering`.
+            end_logits (Tensor):
+                See :class:`IpuBertForQuestionAnswering`.
+            start_labels (Tensor):
+                Labels for start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            end_labels (Tensor):
+                Labels for end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        """
+        with paddle.static.name_scope("loss"):
+            start_loss = paddle.fluid.layers.softmax(start_logits)
+            start_loss = self.custom_ops.custom_nll_loss(start_loss, start_labels, 1, "None", False)
+            end_loss = paddle.fluid.layers.softmax(end_logits)
+            end_loss = self.custom_ops.custom_nll_loss(end_loss, end_labels, 1, "None", False)
+            loss = paddle.add(start_loss, end_loss)
+
+        with paddle.static.name_scope("acc"):
+            start_logits = paddle.fluid.layers.argmax(start_logits, axis=1)
+            end_logits = paddle.fluid.layers.argmax(end_logits, axis=1)
+            start_equal = paddle.fluid.layers.equal(start_logits, start_labels)
+            end_equal = paddle.fluid.layers.equal(end_logits, end_labels)
+            start_equal = paddle.fluid.layers.cast(start_equal, "float32")
+            end_equal = paddle.fluid.layers.cast(end_equal, "float32")
+            start_acc = paddle.mean(start_equal)
+            end_acc = paddle.mean(end_equal)
+
+        return start_acc, end_acc, loss
+
+
+class IpuBertPretrainingMLMHeads(Layer):
+    """
+    Perform language modeling task.
+
+    Args:
+        hidden_size (int):
+            See :class:`IpuBertConfig`.
+        vocab_size (int):
+            See :class:`IpuBertConfig`.
+        max_position_embeddings (int):
+            See :class:`IpuBertConfig`.
+        max_predictions_per_seq (int):
+            See :class:`IpuBertConfig`.
+        seq_len (int):
+            See :class:`IpuBertConfig`.
+    """
+
+    def __init__(self, hidden_size, vocab_size, max_position_embeddings, max_predictions_per_seq, seq_len):
+        super(IpuBertPretrainingMLMHeads, self).__init__()
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.max_predictions_per_seq = max_predictions_per_seq
+        self.sequence_length = seq_len
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.layer_norm = nn.LayerNorm(hidden_size, epsilon=0.001)
+
+    def forward(self, encoders_output, word_embeddings_weights):
+        # cls
+        out = self.transform(encoders_output)
+        out = paddle.fluid.layers.gelu(out, approximate=True)
+        out = self.layer_norm(out)
+
+        # mlm
+        out = paddle.reshape(out, [-1, self.sequence_length, self.hidden_size])
+        out = paddle.slice(out, [1], [0], [self.max_predictions_per_seq])
+        out = paddle.reshape(out, [-1, self.hidden_size])
+
+        # serialized matmul
+        out = paddle.matmul(out, word_embeddings_weights)
+        out.block.ops[-1]._set_attr("serialize_factor", 5)
+        mlm_out = paddle.reshape(out, [-1, self.max_predictions_per_seq, self.vocab_size])
+
+        return mlm_out
+
+
+class IpuBertPretrainingNSPHeads(Layer):
+    """
+    Perform next sequence classification task.
+
+    Args:
+        hidden_size (int):
+            See :class:`IpuBertConfig`.
+        max_predictions_per_seq (int):
+            See :class:`IpuBertConfig`.
+        seq_len (int):
+            See :class:`IpuBertConfig`.
+    """
+
+    def __init__(self, hidden_size, max_predictions_per_seq, seq_len):
+        super(IpuBertPretrainingNSPHeads, self).__init__()
+        self.hidden_size = hidden_size
+        self.max_predictions_per_seq = max_predictions_per_seq
+        self.seq_len = seq_len
+        self.seq_relationship = nn.Linear(hidden_size, 2)
+        self.pooler = IpuBertPooler(hidden_size, self.seq_len, self.max_predictions_per_seq)
+
+    def forward(self, encoders_output):
+        pooled_output = self.pooler(encoders_output)
+        nsp_out = self.seq_relationship(pooled_output)
+        return nsp_out
+
+
+class IpuBertPooler(Layer):
+    """
+    Pool the result of BertEncoder.
+    """
+
+    def __init__(self, hidden_size, sequence_length, max_predictions_per_seq, pool_act="tanh"):
+        super(IpuBertPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+        self.pool_act = pool_act
+        self.sequence_length = sequence_length
+        self.max_predictions_per_seq = max_predictions_per_seq
+        self.hidden_size = hidden_size
+
+    def forward(self, hidden_states):
+        hidden_states = paddle.reshape(hidden_states, [-1, self.sequence_length, self.hidden_size])
+        first_token_tensor = paddle.slice(
+            input=hidden_states,
+            axes=[1],
+            starts=[self.max_predictions_per_seq],
+            ends=[self.max_predictions_per_seq + 1],
+        )
+        first_token_tensor = paddle.reshape(first_token_tensor, [-1, self.hidden_size])
+        pooled_output = self.dense(first_token_tensor)
+        if self.pool_act == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class IpuBertPretrainingMLMAccAndLoss(Layer):
+    """
+    Criterion for masked language modeling.
+    """
+
+    def __init__(self, micro_batch, ignore_index, custom_ops):
+        super(IpuBertPretrainingMLMAccAndLoss, self).__init__()
+        self.micro_batch = micro_batch
+        self.ignore_index = ignore_index
+        self.custom_ops = custom_ops
+
+    def forward(self, mlm, masked_lm_ids):
+        mlm_pred = paddle.fluid.layers.argmax(mlm, axis=-1)
+        mlm_pred = paddle.cast(mlm_pred, "int32")
+        with paddle.static.name_scope("Accuracy"):
+            mlm_label = paddle.cast(masked_lm_ids, "int32")
+            mlm_correct = paddle.fluid.layers.equal(mlm_pred, mlm_label)
+            attrs = {
+                "name": "mlm_mask_val",
+                "shape": [1],
+                "dtype": "int32",
+                "value": self.ignore_index,
+            }
+            mlm_mask_val = paddle.fluid.layers.fill_constant(**attrs)
+            mlm_unmask = paddle.fluid.layers.equal(mlm_label, mlm_mask_val)
+            mlm_mask = paddle.logical_not(mlm_unmask)
+            mlm_mask = paddle.cast(mlm_mask, "float32")
+            mlm_correct = paddle.cast(mlm_correct, "float32")
+            masked_mlm_correct = paddle.fluid.layers.elementwise_mul(mlm_correct, mlm_mask)
+            total_correct_tokens = paddle.fluid.layers.reduce_sum(masked_mlm_correct)
+            total_tokens = paddle.fluid.layers.reduce_sum(mlm_mask)
+            total_correct_tokens = paddle.cast(total_correct_tokens, "float32")
+            total_tokens = paddle.cast(total_tokens, "float32")
+            mlm_acc = paddle.fluid.layers.elementwise_div(total_correct_tokens, total_tokens)
+
+        masked_lm_softmax = paddle.fluid.layers.softmax(mlm)
+        mlm_loss = self.custom_ops.custom_nll_loss(masked_lm_softmax, masked_lm_ids, 1, str(self.ignore_index), False)
+
+        return mlm_acc, mlm_loss
+
+
+class IpuBertPretrainingNSPAccAndLoss(Layer):
+    """
+    Criterion for next sequence classification.
+    """
+
+    def __init__(self, micro_batch, ignore_index, custom_ops):
+        super(IpuBertPretrainingNSPAccAndLoss, self).__init__()
+        self.micro_batch = micro_batch
+        self.ignore_index = ignore_index
+        self.custom_ops = custom_ops
+
+    def forward(self, nsp, nsp_label):
+        nsp_pred = paddle.fluid.layers.argmax(nsp, axis=-1)
+        nsp_pred = paddle.cast(nsp_pred, "int32")
+        with paddle.static.name_scope("Accuracy"):
+            nsp_label = paddle.cast(nsp_label, "int32")
+            nsp_correct = paddle.fluid.layers.equal(nsp_pred, nsp_label)
+            nsp_correct = paddle.cast(nsp_correct, "int32")
+            nsp_correct = paddle.fluid.layers.reduce_sum(nsp_correct)
+            nsp_correct = paddle.cast(nsp_correct, "float32")
+            attrs = {
+                "name": "mlm_mask_val",
+                "shape": [1],
+                "dtype": "int32",
+                "value": self.micro_batch,
+            }
+            nsp_total = paddle.fluid.layers.fill_constant(**attrs)
+            nsp_total = paddle.cast(nsp_total, "float32")
+            nsp_acc = paddle.fluid.layers.elementwise_div(nsp_correct, nsp_total)
+
+        next_sentence_softmax = paddle.fluid.layers.softmax(nsp)
+        nsp_loss = self.custom_ops.custom_nll_loss(next_sentence_softmax, nsp_label, 1, "None", False)
+
+        return nsp_acc, nsp_loss
diff --git a/model_zoo/bert/static_ipu/requirements.txt b/model_zoo/bert/static_ipu/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..43dee67df533055d39178a88c6c7978a9029b365
--- /dev/null
+++ b/model_zoo/bert/static_ipu/requirements.txt
@@ -0,0 +1,8 @@
+datasets
+h5py
+multiprocess
+numpy
+paddlenlp
+scipy
+wandb
+tqdm
diff --git a/model_zoo/bert/static_ipu/run_pretrain.py b/model_zoo/bert/static_ipu/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c45cc6cfb4392ea51c9e2c0ad9182567c14f240
--- /dev/null
+++ b/model_zoo/bert/static_ipu/run_pretrain.py
@@ -0,0 +1,393 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import pickle
+import random
+import time
+
+import numpy as np
+import paddle
+import paddle.optimizer
+import paddle.static
+from dataset_ipu import PretrainingHDF5DataLoader
+from modeling import (
+    BertModel,
+    DeviceScope,
+    IpuBertConfig,
+    IpuBertPretrainingMLMAccAndLoss,
+    IpuBertPretrainingMLMHeads,
+    IpuBertPretrainingNSPAccAndLoss,
+    IpuBertPretrainingNSPHeads,
+)
+from scipy.stats import truncnorm
+from utils import ProgressFunc, load_custom_ops, parse_args
+
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+
+def set_seed(seed):
+    """
+    Use the same data seed(for data shuffle) for all procs to guarantee data
+    consistency after sharding.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def create_data_holder(args):
+    bs = args.micro_batch_size
+    indices = paddle.static.data(name="indices", shape=[bs * args.seq_len], dtype="int32")
+    segments = paddle.static.data(name="segments", shape=[bs * args.seq_len], dtype="int32")
+    positions = paddle.static.data(name="positions", shape=[bs * args.seq_len], dtype="int32")
+    mask_tokens_mask_idx = paddle.static.data(name="mask_tokens_mask_idx", shape=[bs, 1], dtype="int32")
+    sequence_mask_idx = paddle.static.data(name="sequence_mask_idx", shape=[bs, 1], dtype="int32")
+    masked_lm_ids = paddle.static.data(name="masked_lm_ids", shape=[bs, args.max_predictions_per_seq], dtype="int32")
+    next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[bs], dtype="int32")
+    return [indices, segments, positions, mask_tokens_mask_idx, sequence_mask_idx, masked_lm_ids, next_sentence_labels]
+
+
+def reset_program_state_dict(state_dict, mean=0, scale=0.02):
+    """
+    Initialize the parameter from the bert config, and set the parameter by
+    reseting the state dict."
+    """
+    new_state_dict = dict()
+    for n, p in state_dict.items():
+        if (
+            n.endswith("_moment1_0")
+            or n.endswith("_moment2_0")
+            or n.endswith("_beta2_pow_acc_0")
+            or n.endswith("_beta1_pow_acc_0")
+        ):
+            continue
+        if "learning_rate" in n:
+            continue
+
+        dtype_str = "float32"
+        if p._dtype == paddle.float64:
+            dtype_str = "float64"
+
+        if "layer_norm" in n and n.endswith(".w_0"):
+            new_state_dict[n] = np.ones(p.shape()).astype(dtype_str)
+            continue
+
+        if n.endswith(".b_0"):
+            new_state_dict[n] = np.zeros(p.shape()).astype(dtype_str)
+        else:
+            new_state_dict[n] = truncnorm.rvs(-2, 2, loc=mean, scale=scale, size=p.shape()).astype(dtype_str)
+    return new_state_dict
+
+
+def create_ipu_strategy(args):
+    ipu_strategy = paddle.static.IpuStrategy()
+    options = {
+        "is_training": args.is_training,
+        "enable_manual_shard": True,
+        "enable_pipelining": True,
+        "batches_per_step": args.batches_per_step,
+        "micro_batch_size": args.micro_batch_size,
+        "loss_scaling": args.scale_loss,
+        "enable_replicated_graphs": True,
+        "replicated_graph_count": args.num_replica,
+        "num_ipus": args.num_ipus * args.num_replica,
+        "enable_gradient_accumulation": args.enable_grad_acc,
+        "accumulation_factor": args.grad_acc_factor,
+        "auto_recomputation": 3,
+        "enable_half_partial": True,
+        "available_memory_proportion": args.available_mem_proportion,
+        "enable_stochastic_rounding": True,
+        "max_weight_norm": 65504.0,
+        "default_prefetch_buffering_depth": 3,
+        "rearrange_anchors_on_host": False,
+        "enable_fp16": args.ipu_enable_fp16,
+        "random_seed": args.seed,
+        "use_no_bias_optimizer": True,
+        "enable_prefetch_datastreams": True,
+        "enable_outlining": True,
+        "subgraph_copying_strategy": 1,  # JustInTime
+        "outline_threshold": 10.0,
+        "disable_grad_accumulation_tensor_streams": True,
+        "schedule_non_weight_update_gradient_consumers_early": True,
+        "cache_path": "paddle_cache",
+        "enable_floating_point_checks": False,
+        "accl1_type": args.accl1_type,
+        "accl2_type": args.accl2_type,
+        "weight_decay_mode": args.weight_decay_mode,
+    }
+
+    if not args.optimizer_state_offchip:
+        options["location_optimizer"] = {
+            "on_chip": 1,  # popart::TensorStorage::OnChip
+            "use_replicated_tensor_sharding": 1,  # popart::ReplicatedTensorSharding::On
+        }
+
+    # use popart::AccumulateOuterFragmentSchedule::OverlapMemoryOptimized
+    # excludedVirtualGraphs = [0]
+    options["accumulate_outer_fragment"] = {3: [0]}
+
+    options["convolution_options"] = {"partialsType": "half"}
+    options["engine_options"] = {
+        "opt.useAutoloader": "true",
+        "target.syncReplicasIndependently": "true",
+        "exchange.streamBufferOverlap": "hostRearrangeOnly",
+    }
+
+    options["enable_engine_caching"] = args.enable_engine_caching
+
+    options["compilation_progress_logger"] = ProgressFunc
+
+    ipu_strategy.set_options(options)
+
+    # enable custom patterns
+    ipu_strategy.enable_pattern("DisableAttnDropoutBwdPattern")
+
+    return ipu_strategy
+
+
+def main(args):
+    paddle.enable_static()
+    place = paddle.set_device("ipu")
+    set_seed(args.seed)
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+
+    # The sharding of encoder layers
+    if args.num_hidden_layers == 12:
+        attn_index = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
+        ff_index = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
+    else:
+        raise Exception("Only support num_hidden_layers = 12")
+
+    bert_config = {k: getattr(args, k) for k in IpuBertConfig._fields if hasattr(args, k)}
+    bert_config["embeddings_scope"] = DeviceScope(0, 0, "Embedding")
+    bert_config["attn_scopes"] = [DeviceScope(attn_index[i], attn_index[i]) for i in range(args.num_hidden_layers)]
+    bert_config["ff_scopes"] = [DeviceScope(ff_index[i], ff_index[i]) for i in range(args.num_hidden_layers)]
+    bert_config["mlm_scope"] = DeviceScope(0, args.num_ipus, "MLM")
+    bert_config["nsp_scope"] = DeviceScope(0, args.num_ipus, "NSP")
+    bert_config["layers_per_ipu"] = [4, 4, 4]
+
+    config = IpuBertConfig(**bert_config)
+
+    # custom_ops
+    custom_ops = load_custom_ops()
+
+    # Load the training dataset
+    logging.info("Loading dataset")
+    input_files = [
+        os.path.join(args.input_files, f)
+        for f in os.listdir(args.input_files)
+        if os.path.isfile(os.path.join(args.input_files, f)) and "training" in f
+    ]
+    input_files.sort()
+
+    dataset = PretrainingHDF5DataLoader(
+        input_files=input_files,
+        max_seq_length=args.seq_len,
+        max_mask_tokens=args.max_predictions_per_seq,
+        batch_size=args.batch_size,
+        shuffle=args.shuffle,
+    )
+    logging.info(f"dataset length: {len(dataset)}")
+    total_samples = dataset.total_samples
+    logging.info(
+        "total samples: %d, total batch_size: %d, max steps: %d" % (total_samples, args.batch_size, args.max_steps)
+    )
+
+    logging.info("Building Model")
+
+    [
+        indices,
+        segments,
+        positions,
+        mask_tokens_mask_idx,
+        sequence_mask_idx,
+        masked_lm_ids,
+        next_sentence_labels,
+    ] = create_data_holder(args)
+
+    # Encoder Layers
+    bert_model = BertModel(config, custom_ops)
+    encoders, word_embedding = bert_model(indices, segments, positions, [mask_tokens_mask_idx, sequence_mask_idx])
+
+    # PretrainingHeads
+    mlm_heads = IpuBertPretrainingMLMHeads(
+        args.hidden_size, args.vocab_size, args.max_position_embeddings, args.max_predictions_per_seq, args.seq_len
+    )
+    nsp_heads = IpuBertPretrainingNSPHeads(args.hidden_size, args.max_predictions_per_seq, args.seq_len)
+
+    # AccAndLoss
+    nsp_criterion = IpuBertPretrainingNSPAccAndLoss(args.micro_batch_size, args.ignore_index, custom_ops)
+    mlm_criterion = IpuBertPretrainingMLMAccAndLoss(args.micro_batch_size, args.ignore_index, custom_ops)
+
+    with config.nsp_scope:
+        nsp_out = nsp_heads(encoders)
+        nsp_acc, nsp_loss = nsp_criterion(nsp_out, next_sentence_labels)
+
+    with config.mlm_scope:
+        mlm_out = mlm_heads(encoders, word_embedding)
+        (
+            mlm_acc,
+            mlm_loss,
+        ) = mlm_criterion(mlm_out, masked_lm_ids)
+        total_loss = mlm_loss + nsp_loss
+
+    # lr_scheduler
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, args.max_steps, args.warmup_steps)
+    # optimizer
+    optimizer = paddle.optimizer.Lamb(
+        learning_rate=lr_scheduler,
+        lamb_weight_decay=args.weight_decay,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        epsilon=args.adam_epsilon,
+    )
+    optimizer.minimize(total_loss)
+
+    # Static executor
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+
+    # Set initial weights
+    state_dict = main_program.state_dict()
+    reset_state_dict = reset_program_state_dict(state_dict)
+    paddle.static.set_program_state(main_program, reset_state_dict)
+
+    if args.enable_load_params:
+        logging.info(f"loading weights from: {args.load_params_path}")
+        if not args.load_params_path.endswith("pdparams"):
+            raise Exception("need pdparams file")
+        with open(args.load_params_path, "rb") as file:
+            params = pickle.load(file)
+        paddle.static.set_program_state(main_program, params)
+
+    # Create ipu_strategy
+    ipu_strategy = create_ipu_strategy(args)
+
+    feed_list = [
+        "indices",
+        "segments",
+        "positions",
+        "mask_tokens_mask_idx",
+        "sequence_mask_idx",
+        "masked_lm_ids",
+        "next_sentence_labels",
+    ]
+    fetch_list = [mlm_acc.name, mlm_loss.name, nsp_acc.name, nsp_loss.name]
+
+    # Compile program for IPU
+    ipu_compiler = paddle.static.IpuCompiledProgram(main_program, ipu_strategy=ipu_strategy)
+    logging.info("start compiling, please wait some minutes")
+    cur_time = time.time()
+    main_program = ipu_compiler.compile(feed_list, fetch_list)
+    time_cost = time.time() - cur_time
+    logging.info(f"finish compiling! time cost: {time_cost}")
+
+    batch_start = time.time()
+    global_step = 0
+    for batch in dataset:
+        global_step += 1
+        epoch = global_step * args.batch_size // total_samples
+        read_cost = time.time() - batch_start
+
+        feed = {
+            "indices": batch[0],
+            "segments": batch[1],
+            "positions": batch[2],
+            "mask_tokens_mask_idx": batch[3],
+            "sequence_mask_idx": batch[4],
+            "masked_lm_ids": batch[5],
+            "next_sentence_labels": batch[6],
+        }
+        lr_scheduler.step()
+
+        train_start = time.time()
+        loss_return = exe.run(main_program, feed=feed, fetch_list=fetch_list, use_program_cache=True)
+        train_cost = time.time() - train_start
+        total_cost = time.time() - batch_start
+        tput = args.batch_size / total_cost
+
+        if args.wandb:
+            wandb.log(
+                {
+                    "epoch": epoch,
+                    "global_step": global_step,
+                    "loss/MLM": np.mean(loss_return[1]),
+                    "loss/NSP": np.mean(loss_return[3]),
+                    "accuracy/MLM": np.mean(loss_return[0]),
+                    "accuracy/NSP": np.mean(loss_return[2]),
+                    "latency/read": read_cost,
+                    "latency/train": train_cost,
+                    "latency/e2e": total_cost,
+                    "throughput": tput,
+                    "learning_rate": lr_scheduler(),
+                }
+            )
+
+        if global_step % args.logging_steps == 0:
+            logging.info(
+                {
+                    "epoch": epoch,
+                    "global_step": global_step,
+                    "loss/MLM": np.mean(loss_return[1]),
+                    "loss/NSP": np.mean(loss_return[3]),
+                    "accuracy/MLM": np.mean(loss_return[0]),
+                    "accuracy/NSP": np.mean(loss_return[2]),
+                    "latency/read": read_cost,
+                    "latency/train": train_cost,
+                    "latency/e2e": total_cost,
+                    "throughput": tput,
+                    "learning_rate": lr_scheduler(),
+                }
+            )
+
+        if global_step % args.save_steps == 0:
+            ipu_compiler._backend.weights_to_host()
+            paddle.static.save(main_program.org_program, os.path.join(args.output_dir, "step_{}".format(global_step)))
+
+        if global_step >= args.max_steps:
+            ipu_compiler._backend.weights_to_host()
+            paddle.static.save(
+                main_program.org_program, os.path.join(args.output_dir, "final_step_{}".format(global_step))
+            )
+            dataset.release()
+            del dataset
+            return
+
+        batch_start = time.time()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    logging.basicConfig(
+        level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S %a"
+    )
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir, exist_ok=True)
+
+    if args.wandb:
+        import wandb
+
+        wandb.init(project="paddle-base-bert", settings=wandb.Settings(console="off"), name="paddle-base-bert")
+        wandb_config = vars(args)
+        wandb_config["global_batch_size"] = args.batch_size
+        wandb.config.update(args)
+
+    logging.info(args)
+    main(args)
+    logging.info("program finished")
diff --git a/model_zoo/bert/static_ipu/run_squad.py b/model_zoo/bert/static_ipu/run_squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3c76ed612a8351a53a79409f64432f5b374239e
--- /dev/null
+++ b/model_zoo/bert/static_ipu/run_squad.py
@@ -0,0 +1,482 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import pickle
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.optimizer
+import paddle.static
+from datasets import load_dataset
+from modeling import (
+    BertModel,
+    DeviceScope,
+    IpuBertConfig,
+    IpuBertForQuestionAnswering,
+    IpuBertQAAccAndLoss,
+)
+from paddle.io import BatchSampler, DataLoader
+from run_pretrain import create_ipu_strategy, reset_program_state_dict, set_seed
+from utils import load_custom_ops, parse_args
+
+from paddlenlp.data import Dict, Stack
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.transformers import BertTokenizer, LinearDecayWithWarmup
+
+
+def create_data_holder(args):
+    bs = args.micro_batch_size
+    indices = paddle.static.data(name="indices", shape=[bs * args.seq_len], dtype="int32")
+    segments = paddle.static.data(name="segments", shape=[bs * args.seq_len], dtype="int32")
+    positions = paddle.static.data(name="positions", shape=[bs * args.seq_len], dtype="int32")
+    input_mask = paddle.static.data(name="input_mask", shape=[bs, 1, 1, args.seq_len], dtype="float32")
+    if not args.is_training:
+        return [indices, segments, positions, input_mask]
+    else:
+        start_labels = paddle.static.data(name="start_labels", shape=[bs], dtype="int32")
+        end_labels = paddle.static.data(name="end_labels", shape=[bs], dtype="int32")
+        return [indices, segments, positions, input_mask, start_labels, end_labels]
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
+    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
+    # left whitespace
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        stride=128,
+        max_seq_len=args.seq_len,
+        pad_to_max_seq_len=True,
+        return_position_ids=True,
+        return_token_type_ids=True,
+        return_attention_mask=True,
+        return_length=True,
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+    tokenized_examples["input_mask"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+
+        # attention_mask to input_mask
+        input_mask = (np.asarray(tokenized_examples["attention_mask"][i]) - 1) * 1e3
+        input_mask = np.expand_dims(input_mask, axis=(0, 1))
+        if args.ipu_enable_fp16:
+            input_mask = input_mask.astype(np.float16)
+        else:
+            input_mask = input_mask.astype(np.float32)
+        tokenized_examples["input_mask"].append(input_mask)
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+    tokenized_examples = tokenizer(
+        questions,
+        contexts,
+        stride=128,
+        max_seq_len=args.seq_len,
+        pad_to_max_seq_len=True,
+        return_position_ids=True,
+        return_token_type_ids=True,
+        return_attention_mask=True,
+        return_length=True,
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+    tokenized_examples["input_mask"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        input_ids = tokenized_examples["input_ids"][i]
+        sequence_A_lengths = input_ids.index(tokenizer.sep_token_id) + 2
+        sequence_B_lengths = len(input_ids) - sequence_A_lengths
+        sequence_ids = [0] * sequence_A_lengths + [1] * sequence_B_lengths
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+        # attention_mask to input_mask
+        input_mask = (np.asarray(tokenized_examples["attention_mask"][i]) - 1) * 1e3
+        input_mask = np.expand_dims(input_mask, axis=(0, 1))
+        if args.ipu_enable_fp16:
+            input_mask = input_mask.astype(np.float16)
+        else:
+            input_mask = input_mask.astype(np.float32)
+        tokenized_examples["input_mask"].append(input_mask)
+
+    return tokenized_examples
+
+
+def load_squad_dataset(args):
+    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+    features_fn = prepare_train_features if args.is_training else prepare_validation_features
+    if args.is_training:
+        raw_dataset = load_dataset("squad", split="train")
+    else:
+        raw_dataset = load_dataset("squad", split="validation")
+    column_names = raw_dataset.column_names
+    dataset = raw_dataset.map(
+        partial(features_fn, tokenizer=tokenizer, args=args), batched=True, remove_columns=column_names, num_proc=4
+    )
+
+    bs = args.micro_batch_size * args.grad_acc_factor * args.batches_per_step * args.num_replica
+    args.batch_size = bs
+    if args.is_training:
+        train_batch_sampler = BatchSampler(dataset, batch_size=bs, shuffle=args.shuffle, drop_last=True)
+    else:
+        train_batch_sampler = BatchSampler(dataset, batch_size=bs, shuffle=args.shuffle, drop_last=False)
+
+    if args.is_training:
+        collate_fn = lambda samples, fn=Dict(
+            {
+                "input_ids": Stack(),
+                "token_type_ids": Stack(),
+                "position_ids": Stack(),
+                "input_mask": Stack(),
+                "start_positions": Stack(),
+                "end_positions": Stack(),
+            }
+        ): fn(samples)
+    else:
+        collate_fn = lambda samples, fn=Dict(
+            {"input_ids": Stack(), "token_type_ids": Stack(), "position_ids": Stack(), "input_mask": Stack()}
+        ): fn(samples)
+
+    data_loader = DataLoader(
+        dataset=dataset, batch_sampler=train_batch_sampler, collate_fn=collate_fn, return_list=True
+    )
+    return raw_dataset, data_loader
+
+
+def main(args):
+    paddle.enable_static()
+    place = paddle.set_device("ipu")
+    set_seed(args.seed)
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+
+    # The sharding of encoder layers
+    if args.num_hidden_layers == 12:
+        attn_ipu_index = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
+        ff_ipu_index = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
+    else:
+        raise Exception("Only support num_hidden_layers = 12")
+
+    bert_config = {k: getattr(args, k) for k in IpuBertConfig._fields if hasattr(args, k)}
+    bert_config["embeddings_scope"] = DeviceScope(0, 0, "Embedding")
+    bert_config["attn_scopes"] = [
+        DeviceScope(attn_ipu_index[i], attn_ipu_index[i]) for i in range(args.num_hidden_layers)
+    ]
+    bert_config["ff_scopes"] = [DeviceScope(ff_ipu_index[i], ff_ipu_index[i]) for i in range(args.num_hidden_layers)]
+    bert_config["layers_per_ipu"] = [6, 6]
+
+    config = IpuBertConfig(**bert_config)
+
+    # custom_ops
+    custom_ops = load_custom_ops()
+
+    logging.info("building model")
+
+    if args.is_training:
+        [indices, segments, positions, input_mask, start_labels, end_labels] = create_data_holder(args)
+    else:
+        [indices, segments, positions, input_mask] = create_data_holder(args)
+
+    # Encoder Layers
+    bert_model = BertModel(config, custom_ops)
+    encoders, _ = bert_model(indices, segments, positions, input_mask)
+
+    squad_scope = DeviceScope(args.num_ipus - 1, args.num_ipus - 1, "squad")
+    with squad_scope:
+        qa_cls = IpuBertForQuestionAnswering(args.hidden_size, args.seq_len)
+        start_logits, end_logits = qa_cls(encoders)
+
+        if args.is_training:
+            acc_loss = IpuBertQAAccAndLoss(custom_ops)
+            acc0, acc1, loss = acc_loss(start_logits, end_logits, start_labels, end_labels)
+
+    # load squad dataset
+    raw_dataset, data_loader = load_squad_dataset(args)
+
+    total_samples = len(data_loader.dataset)
+    max_steps = total_samples // args.batch_size * args.epochs
+    logging.info(
+        "total samples: %d, total batch_size: %d, max steps: %d" % (total_samples, args.batch_size, max_steps)
+    )
+
+    if args.is_training:
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, max_steps, args.warmup_steps)
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=lr_scheduler,
+            weight_decay=args.weight_decay,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.adam_epsilon,
+        )
+        optimizer.minimize(loss)
+
+    # Static executor
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+
+    # Set initial weights
+    state_dict = main_program.state_dict()
+    reset_state_dict = reset_program_state_dict(state_dict)
+    paddle.static.set_program_state(main_program, reset_state_dict)
+
+    if args.enable_load_params:
+        logging.info(f"loading weights from: {args.load_params_path}")
+        if not args.load_params_path.endswith("pdparams"):
+            raise Exception("need pdparams file")
+        with open(args.load_params_path, "rb") as file:
+            params = pickle.load(file)
+        # Delete mlm and nsp weights
+        if args.is_training and "linear_72.w_0" in params:
+            params.pop("linear_72.w_0")
+            params.pop("linear_72.b_0")
+        paddle.static.set_program_state(main_program, params)
+
+    if args.tf_checkpoint:
+        from load_tf_ckpt import load_initializers_from_tf
+
+        logging.info(f"loading weights from: {args.tf_checkpoint}")
+        initializers, _ = load_initializers_from_tf(args.tf_checkpoint, args)
+        paddle.static.set_program_state(main_program, initializers)
+
+    # Create ipu_strategy
+    ipu_strategy = create_ipu_strategy(args)
+
+    if args.is_training:
+        feed_list = ["indices", "segments", "positions", "input_mask", "start_labels", "end_labels"]
+        fetch_list = [loss.name, acc0.name, acc1.name]
+    else:
+        feed_list = ["indices", "segments", "positions", "input_mask"]
+        fetch_list = [start_logits.name, end_logits.name]
+
+    ipu_compiler = paddle.static.IpuCompiledProgram(main_program, ipu_strategy=ipu_strategy)
+    logging.info("start compiling, please wait some minutes")
+    cur_time = time.time()
+    main_program = ipu_compiler.compile(feed_list, fetch_list)
+    time_cost = time.time() - cur_time
+    logging.info(f"finish compiling! time cost: {time_cost}")
+
+    if args.is_training:
+        global_step = 0
+        batch_start = time.time()
+        for epoch in range(args.epochs):
+            for batch in data_loader:
+                global_step += 1
+
+                feed = {
+                    "indices": batch[0],
+                    "segments": batch[1],
+                    "positions": batch[2],
+                    "input_mask": batch[3],
+                    "start_labels": batch[4],
+                    "end_labels": batch[5],
+                }
+                lr_scheduler.step()
+
+                train_start = time.time()
+                outputs = exe.run(main_program, feed=feed, fetch_list=fetch_list, use_program_cache=True)
+                train_cost = time.time() - train_start
+                total_cost = time.time() - batch_start
+
+                tput = args.batch_size / total_cost
+                if args.wandb:
+                    wandb.log(
+                        {
+                            "epoch": epoch,
+                            "global_step": global_step,
+                            "loss": np.mean(outputs[0]),
+                            "accuracy": np.mean(outputs[1:]),
+                            "train_cost": train_cost,
+                            "total_cost": total_cost,
+                            "throughput": tput,
+                            "learning_rate": lr_scheduler(),
+                        }
+                    )
+
+                if global_step % args.logging_steps == 0:
+                    logging.info(
+                        {
+                            "epoch": epoch,
+                            "global_step": global_step,
+                            "loss": np.mean(outputs[0]),
+                            "accuracy": np.mean(outputs[1:]),
+                            "train_cost": train_cost,
+                            "total_cost": total_cost,
+                            "throughput": tput,
+                            "learning_rate": lr_scheduler(),
+                        }
+                    )
+
+                batch_start = time.time()
+
+        # save final state
+        ipu_compiler._backend.weights_to_host()
+        paddle.static.save(main_program.org_program, os.path.join(args.output_dir, "Final_model"))
+
+    if not args.is_training:
+        all_start_logits = []
+        all_end_logits = []
+        for step, batch in enumerate(data_loader):
+            if step % args.logging_steps == 0:
+                logging.info(f"running step: {step}")
+
+            real_len = np.array(batch[0]).shape[0]
+            # padding zeros if needed
+            if real_len < args.batch_size:
+                batch = [np.asarray(x) for x in batch]
+                pad0 = np.zeros([args.batch_size - real_len, args.seq_len]).astype(batch[0].dtype)
+                batch[0] = np.vstack((batch[0], pad0))
+                batch[1] = np.vstack((batch[1], pad0))
+                batch[2] = np.vstack((batch[2], pad0))
+                pad1 = np.zeros([args.batch_size - real_len, 1, 1, args.seq_len]) - 1e3
+                pad1 = pad1.astype(batch[3].dtype)
+                batch[3] = np.vstack((batch[3], pad1))
+
+            feed = {
+                "indices": batch[0],
+                "segments": batch[1],
+                "positions": batch[2],
+                "input_mask": batch[3],
+            }
+            start_logits, end_logits = exe.run(main_program, feed=feed, fetch_list=fetch_list)
+
+            start_logits = start_logits.reshape([-1, args.seq_len])
+            end_logits = end_logits.reshape([-1, args.seq_len])
+            for idx in range(real_len):
+                all_start_logits.append(start_logits[idx])
+                all_end_logits.append(end_logits[idx])
+
+        # evaluate results
+        all_predictions, all_nbest_json, scores_diff_json = compute_prediction(
+            raw_dataset, data_loader.dataset, (all_start_logits, all_end_logits)
+        )
+        squad_evaluate(
+            examples=[raw_data for raw_data in raw_dataset], preds=all_predictions, na_probs=scores_diff_json
+        )
+        # write results to file
+        with open("squad_prediction.json", "w", encoding="utf-8") as writer:
+            writer.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    logging.basicConfig(
+        level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S %a"
+    )
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir, exist_ok=True)
+
+    if args.wandb:
+        import wandb
+
+        wandb.init(project="paddle-squad", settings=wandb.Settings(console="off"), name="paddle-squad")
+        wandb_config = vars(args)
+        wandb_config["global_batch_size"] = args.batch_size
+        wandb.config.update(args)
+
+    logging.info(args)
+    main(args)
+    logging.info("program finished")
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cd1c5bb00f40e1aad40c614e695c42b8104d97ea
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain.sh
@@ -0,0 +1,36 @@
+#!/usr/bin/env bash
+
+export RDMAV_FORK_SAFE=1
+python3 run_pretrain.py \
+        --input_files "path_to_phase1_hdf5_dataset" \
+        --output_dir pretrain_128_model \
+        --seq_len 128 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 20 \
+        --max_position_embeddings 512 \
+        --learning_rate 0.006 \
+        --weight_decay 1e-2 \
+        --max_steps 7038 \
+        --warmup_steps 2000 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 32 \
+        --ipu_enable_fp16 True \
+        --scale_loss 512 \
+        --batches_per_step 1 \
+        --num_replica 4 \
+        --enable_grad_acc True \
+        --grad_acc_factor 512 \
+        --batch_size 65536 \
+        --available_mem_proportion 0.28 \
+        --ignore_index 0 \
+        --enable_load_params False \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --save_steps 1000
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8458ed48b6b2528fceaa25cd64cbc52bb5d06bbe
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod16/run_pretrain_phase2.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+export RDMAV_FORK_SAFE=1
+python3 run_pretrain.py \
+        --input_files "path_to_phase2_hdf5_dataset" \
+        --output_dir pretrain_384_model \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 0.002828427125 \
+        --weight_decay 1e-2 \
+        --max_steps 2137 \
+        --warmup_steps 274 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 8 \
+        --ipu_enable_fp16 True \
+        --scale_loss 128 \
+        --batches_per_step 1 \
+        --num_replica 4 \
+        --enable_grad_acc True \
+        --grad_acc_factor 512 \
+        --batch_size 16384 \
+        --available_mem_proportion 0.28 \
+        --ignore_index 0 \
+        --enable_load_params True \
+        --load_params_path "./pretrain_128_model/final_step_7038.pdparams" \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --enable_engine_caching False \
+        --save_steps 500
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c36ef69d6b1a8fdffa9255b9e3f7b68b5b9ba90
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod16/run_squad.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+
+python3 run_squad.py \
+        --output_dir squad_model \
+        --task "SQUAD" \
+        --is_training True \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 5.6e-05 \
+        --weight_decay 0 \
+        --epochs 4 \
+        --warmup_steps 52 \
+        --logging_steps 10 \
+        --seed 42 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 2 \
+        --ipu_enable_fp16 True \
+        --accl1_type "FLOAT" \
+        --accl2_type "FLOAT" \
+        --weight_decay_mode "decay" \
+        --scale_loss 256 \
+        --optimizer_state_offchip False \
+        --batches_per_step 4 \
+        --num_replica 4 \
+        --num_ipus 2 \
+        --enable_grad_acc True \
+        --grad_acc_factor 16 \
+        --available_mem_proportion 0.40 \
+        --ignore_index 0 \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --enable_engine_caching False \
+        --enable_load_params True \
+        --load_params_path "pretrain_384_model/final_step_2137.pdparams"
diff --git a/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh b/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..28ffa72854436f2edaf16da927a4aa9ff6760d12
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod16/run_squad_infer.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+python3 run_squad.py \
+        --output_dir squad_model \
+        --task "SQUAD" \
+        --is_training False \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 5.6e-05 \
+        --weight_decay 1e-2 \
+        --epochs 4 \
+        --warmup_steps 52 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 2 \
+        --ipu_enable_fp16 True \
+        --scale_loss 256 \
+        --optimizer_state_offchip False \
+        --batches_per_step 4 \
+        --num_replica 4 \
+        --num_ipus 2 \
+        --enable_grad_acc False \
+        --grad_acc_factor 1 \
+        --available_mem_proportion 0.40 \
+        --ignore_index 0 \
+        --hidden_dropout_prob 0.0 \
+        --attention_probs_dropout_prob 0.0 \
+        --shuffle False \
+        --wandb False \
+        --enable_engine_caching False \
+        --enable_load_params True \
+        --load_params_path "squad_model/Final_model.pdparams"
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
new file mode 100644
index 0000000000000000000000000000000000000000..299e0dc259811dad21a63f7e1619508819d07d4b
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain.sh
@@ -0,0 +1,36 @@
+#!/usr/bin/env bash
+
+export RDMAV_FORK_SAFE=1
+python3 run_pretrain.py \
+        --input_files "path_to_phase1_hdf5_dataset" \
+        --output_dir pretrain_128_model \
+        --seq_len 128 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 20 \
+        --max_position_embeddings 512 \
+        --learning_rate 0.006 \
+        --weight_decay 1e-2 \
+        --max_steps 7038 \
+        --warmup_steps 2000 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 32 \
+        --ipu_enable_fp16 True \
+        --scale_loss 512 \
+        --batches_per_step 1 \
+        --num_replica 1 \
+        --enable_grad_acc True \
+        --grad_acc_factor 2048 \
+        --batch_size 65536 \
+        --available_mem_proportion 0.28 \
+        --ignore_index 0 \
+        --enable_load_params False \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --save_steps 1000
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..89ec3ec4bab920768a2c8f9f2f4d87b8a3da434c
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod4/run_pretrain_phase2.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+export RDMAV_FORK_SAFE=1
+python3 run_pretrain.py \
+        --input_files "path_to_phase2_hdf5_dataset" \
+        --output_dir pretrain_384_model \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 0.002828427125 \
+        --weight_decay 1e-2 \
+        --max_steps 2137 \
+        --warmup_steps 274 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 8 \
+        --ipu_enable_fp16 True \
+        --scale_loss 128 \
+        --batches_per_step 1 \
+        --num_replica 1 \
+        --enable_grad_acc True \
+        --grad_acc_factor 2048 \
+        --batch_size 16384 \
+        --available_mem_proportion 0.28 \
+        --ignore_index 0 \
+        --enable_load_params True \
+        --load_params_path "./pretrain_128_model/final_step_7038.pdparams" \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --enable_engine_caching False \
+        --save_steps 500
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
new file mode 100644
index 0000000000000000000000000000000000000000..81302949c4a8f1f69c12f8dc137c26b9466dfa3e
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod4/run_squad.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+
+python3 run_squad.py \
+        --output_dir squad_model \
+        --task "SQUAD" \
+        --is_training True \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 5.6e-05 \
+        --weight_decay 1e-2 \
+        --epochs 4 \
+        --warmup_steps 30 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 2 \
+        --ipu_enable_fp16 True \
+        --accl1_type "FLOAT" \
+        --accl2_type "FLOAT" \
+        --weight_decay_mode "decay" \
+        --scale_loss 256 \
+        --optimizer_state_offchip True \
+        --batches_per_step 4 \
+        --num_replica 2 \
+        --num_ipus 2 \
+        --enable_grad_acc True \
+        --grad_acc_factor 64 \
+        --available_mem_proportion 0.40 \
+        --ignore_index 0 \
+        --hidden_dropout_prob 0.1 \
+        --attention_probs_dropout_prob 0.1 \
+        --shuffle True \
+        --wandb False \
+        --enable_engine_caching False \
+        --enable_load_params True \
+        --load_params_path "pretrain_384_model/final_step_2137.pdparams"
diff --git a/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh b/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ae400c59e52839827c28183c9f12d20c3b344690
--- /dev/null
+++ b/model_zoo/bert/static_ipu/scripts/pod4/run_squad_infer.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+python3 run_squad.py \
+        --output_dir squad_model \
+        --task "SQUAD" \
+        --is_training False \
+        --seq_len 384 \
+        --hidden_size 768 \
+        --vocab_size 30400 \
+        --max_predictions_per_seq 56 \
+        --max_position_embeddings 512 \
+        --learning_rate 5.6e-05 \
+        --weight_decay 1e-2 \
+        --epochs 4 \
+        --warmup_steps 52 \
+        --logging_steps 10 \
+        --seed 1984 \
+        --beta1 0.9 \
+        --beta2 0.999 \
+        --num_hidden_layers 12 \
+        --micro_batch_size 2 \
+        --ipu_enable_fp16 True \
+        --scale_loss 256 \
+        --optimizer_state_offchip False \
+        --batches_per_step 4 \
+        --num_replica 2 \
+        --num_ipus 2 \
+        --enable_grad_acc False \
+        --grad_acc_factor 1 \
+        --available_mem_proportion 0.40 \
+        --ignore_index 0 \
+        --hidden_dropout_prob 0.0 \
+        --attention_probs_dropout_prob 0.0 \
+        --shuffle False \
+        --wandb False \
+        --enable_engine_caching False \
+        --enable_load_params True \
+        --load_params_path "squad_model/Final_model.pdparams"
diff --git a/model_zoo/bert/static_ipu/utils.py b/model_zoo/bert/static_ipu/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5280a3bbc014588c5d0eb7e8209df1293d707cc0
--- /dev/null
+++ b/model_zoo/bert/static_ipu/utils.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import tqdm
+from paddle.utils.cpp_extension import load
+
+from paddlenlp.trainer.argparser import strtobool
+
+
+def load_custom_ops():
+    cur_dir = os.path.dirname(os.path.realpath(__file__))
+    custom_dir = cur_dir + "/custom_ops"
+    sources = [
+        f"{custom_dir}/custom_shape_infer.cc",
+        f"{custom_dir}/custom_checkpointoutput.cc",
+        f"{custom_dir}/custom_detach.cc",
+        f"{custom_dir}/custom_identity.cc",
+        f"{custom_dir}/custom_nll_loss.cc",
+        f"{custom_dir}/tied_gather_pattern.cc",
+        f"{custom_dir}/tied_gather.cc",
+        f"{custom_dir}/disable_attn_dropout_bwd_pattern.cc",
+        f"{custom_dir}/workarounds/prevent_const_expr_folding_op.cc",
+        f"{custom_dir}/utils.cc",
+    ]
+    custom_ops = load(
+        name="custom_ops",
+        sources=sources,
+        extra_cxx_cflags=["-DONNX_NAMESPACE=onnx"],
+        build_directory=custom_dir,
+    )
+    return custom_ops
+
+
+class ProgressBar:
+    def __init__(self):
+        self._bar = None
+        self._last = 0
+
+    def __call__(self, progress: int, total: int):
+        if self._bar is None:
+            bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} "
+            bar_format += "[{elapsed}<{remaining}]"
+            self._bar = tqdm.tqdm(desc="Graph compilation", total=total, bar_format=bar_format)
+        self._bar.update(progress - self._last)
+        self._last = progress
+        if progress == total:
+            self._bar.close()
+            self._bar = None
+
+
+# need to set to 0 when start a new compilation
+g_current_progress = 0
+
+
+def ProgressFunc(progress, total):
+    global g_current_progress
+    if progress != g_current_progress:
+        g_current_progress = progress
+        print(f"Graph compilation: {progress}/{total}")
+
+
+def str_to_bool(val):
+    return bool(strtobool(val))
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--task",
+        type=str,
+        default="PRETRAINING",
+        help="task",
+    )
+    parser.add_argument(
+        "--input_files",
+        type=str,
+        default="",
+        help="Files to load data from. "
+        "For Pretraining: Path to tfrecord files"
+        "For SQuAD: Path to train-v1.1.json",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=False,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument("--is_training", type=str_to_bool, default=True, help="training or inference")
+    # graph
+    parser.add_argument("--seq_len", default=128, type=int, help="The sequence length")
+    parser.add_argument("--vocab_size", default=30912, type=int, help="Set the size of the vocabulary")
+    parser.add_argument(
+        "--max_predictions_per_seq", default=20, type=int, help="The maximum total of masked tokens in input sequence"
+    )
+    parser.add_argument("--max_position_embeddings", default=512, type=int, help="the length of the input mask")
+    parser.add_argument("--num_hidden_layers", type=int, default=None, help="Override config file if not None")
+    parser.add_argument(
+        "--hidden_size", default=768, type=int, help="Set the size of the hidden state of the transformer layers size"
+    )
+    parser.add_argument("--ignore_index", type=int, default=-1, help="ignore mlm index")
+    parser.add_argument(
+        "--hidden_dropout_prob",
+        type=float,
+        default=0.1,
+        help="Set the layer dropout probability for fully connected layer in embedding and encoder",
+    )
+    parser.add_argument(
+        "--attention_probs_dropout_prob",
+        type=float,
+        default=0.0,
+        help="Set the layer dropout probability for attention layer in encoder",
+    )
+    # optimizer
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--beta1", type=float, default=0.9, help="Set the Adam/Lamb beta1 value")
+    parser.add_argument("--beta2", type=float, default=0.999, help="Set the Adam/Lamb beta2 value")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=10, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--scale_loss", type=float, default=1.0, help="The value of scale_loss for fp16.")
+    parser.add_argument("--accl1_type", type=str, default="FLOAT", help="FLOAT or FLOAT16")
+    parser.add_argument("--accl2_type", type=str, default="FLOAT", help="FLOAT or FLOAT16")
+    parser.add_argument("--weight_decay_mode", type=str, default="", help="decay or l2_regularization")
+    parser.add_argument(
+        "--optimizer_state_offchip",
+        type=str_to_bool,
+        default=True,
+        help="Set the store location of the optimizer tensors",
+    )
+    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    # ipu
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=1,
+        help="the iteration of the whole dataset",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=8,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--micro_batch_size", type=int, default=1, help="micro batch size")
+    parser.add_argument("--batches_per_step", type=int, default=1, help="batches per step")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed for initialization")
+    parser.add_argument("--num_ipus", type=int, default=4, help="Number of IPUs to use")
+    parser.add_argument("--ipu_enable_fp16", type=str_to_bool, default=False, help="ipu enable fp16 or not.")
+    parser.add_argument("--num_replica", type=int, default=1, help="number of replica")
+    parser.add_argument("--enable_grad_acc", type=str_to_bool, default=False, help="enable gradient accumulation")
+    parser.add_argument("--grad_acc_factor", type=int, default=1, help="factor of gradient accumulation")
+    parser.add_argument(
+        "--available_mem_proportion",
+        type=float,
+        default=0.0,
+        help="set the available memory proportion for matmul/conv",
+    )
+    parser.add_argument("--shuffle", type=str_to_bool, nargs="?", const=True, default=False, help="Shuffle Dataset")
+    parser.add_argument(
+        "--wandb", type=str_to_bool, nargs="?", const=True, default=False, help="Enable logging to Weights and Biases."
+    )
+    parser.add_argument("--enable_load_params", type=str_to_bool, default=False, help="load params or not")
+    parser.add_argument("--load_params_path", type=str, help="load params path")
+    parser.add_argument("--tf_checkpoint", type=str, help="Path to Tensorflow Checkpoint to initialise the model.")
+    parser.add_argument("--enable_engine_caching", type=str_to_bool, default=True, help="enable engine caching or not")
+    args = parser.parse_args()
+    return args
diff --git a/model_zoo/electra/README.md b/model_zoo/electra/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b1c7d1e6dd24dba32486ea4252a795b525c0f7b6
--- /dev/null
+++ b/model_zoo/electra/README.md
@@ -0,0 +1,270 @@
+# ELECTRA with PaddleNLP
+
+[ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) 在[BERT](https://arxiv.org/abs/1810.04805)的基础上对其预训练过程进行了改进：预训练由两部分模型网络组成，称为Generator和Discriminator，各自包含1个BERT模型。Generator的预训练使用和BERT一样的Masked Language Model(MLM)任务，但Discriminator的预训练使用Replaced Token Detection(RTD)任务（主要改进点）。预训练完成后，使用Discriminator作为精调模型，后续的Fine-tuning不再使用Generator。
+![avatar](./electra_model_brief_introduce.JPG)
+
+图片来源：来自[electra论文](https://openreview.net/pdf?id=r1xMH1BtvB)
+
+根据论文中给出的实验结果，在和BERT具有相同的模型参数、预训练计算量一样的情况下，GLUE得分比BERT明显好，small模型为79.9：75.1，Base模型为85.1：82.2，Large模型为89.0：87.2。
+
+本项目是 ELECTRA 在 Paddle 2.0上的开源实现。
+
+## **环境依赖**
+
+- jieba, 安装方式：`pip install jieba`
+- colorlog, 安装方式：`pip install colorlog`
+- colorama, 安装方式：`pip install colorama`
+- seqeval, 安装方式：`pip install seqeval`
+
+## **数据准备**
+### 建议的预训练数据
+论文中提到预训练需要两部分数据：Book Corpus数据 和 Wikipedia Corpus数据，均为英文文本，utf-8编码。但是当前BookCorpus数据已不再开源，可以使用其它数据替代，只要是纯英文文本数据，utf-8编码即可（例如 Gutenberg Dataset）。
+。另外，Wikipedia Corpus数据建议从[官方获取](https://www.english-corpora.org/wiki/)，下面例子假设这些数据都已获取并都放在./BookCorpus/train.data 文件中，每行一句英文文本
+
+### 自定义预训练数据
+支持用户自定义数据进行训练，自定义数据为文本形式，每行一句英文文本，utf-8编码，下面例子假设数据在./BookCorpus/train.data
+
+### Fine-tuning数据
+Fine-tuning 使用GLUE数据，这部分Paddle已提供，在执行第4章 Fine-tuning 命令时会自动下载并加载
+
+### 推理数据
+可以使用GLUE test数据集（Paddle已提供，在Fine-tuning时会自动下载），或者也可以自定义，格式要求和2.2 自定义预训练数据一样，每行一句英文文本，utf-8编码
+
+## **模型预训练**
+
+**特别注意**：预训练模型如果想要达到较好的效果，需要训练几乎全量的Book Corpus数据 和 Wikipedia Corpus数据，原始文本接近20G，建议用GPU进行预训练，最好4片GPU以上。如果资源较少，Paddle提供已经预训练好的模型进行Fine-tuning，可以直接跳转到下面：运行Fine-tuning-使用Paddle提供的预训练模型运行 Fine-tuning
+
+### 单机单卡
+```shell
+export CUDA_VISIBLE_DEVICES="0"
+export DATA_DIR=./BookCorpus/
+
+python -u ./run_pretrain.py \
+    --model_type electra \
+    --model_name_or_path electra-small \
+    --input_dir $DATA_DIR \
+    --output_dir ./pretrain_model/ \
+    --train_batch_size 64 \
+    --learning_rate 5e-4 \
+    --max_seq_length 128 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 4 \
+    --logging_steps 100 \
+    --save_steps 10000 \
+    --max_steps -1 \
+    --device gpu
+```
+其中参数释义如下：
+- `model_type` 表示模型类型，默认为ELECTRA模型。
+- `model_name_or_path` 如果配置1个名字，则表示预训练模型的规模，当前支持的名字为：electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。如果配置1个路径，则表示按照路径中的模型规模进行训练，这时需配置 --init_from_ckpt 参数一起使用，一般用于断点恢复训练场景。
+- `input_dir` 表示输入数据的目录，该目录下需要有1个train.data纯英文文本文件，utf-8编码。
+- `output_dir` 表示将要保存预训练模型的目录。
+- `train_batch_size` 表示 每次迭代**每张卡**上的样本数目。此例子train_batch_size=64 运行时大致需要单卡12G显存，如果实际GPU显存小于12G或大大多于12G，可适当调小/调大此配置。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `weight_decay` 表示每次迭代中参数缩小的比例，该值乘以学习率为真正缩小的比例。
+- `adam_epsilon` 表示adam优化器中的epsilon值。
+- `warmup_steps` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存间隔。
+- `max_steps` 如果配置且大于0，表示预训练最多执行的迭代数量；如果不配置或配置小于0，则根据输入数据量、train_batch_size和num_train_epochs来确定预训练迭代数量
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+
+另外还有一些额外参数不在如上命令中：
+- `use_amp` 表示是否开启混合精度(float16)进行训练，默认不开启。如果在命令中加上了--use_amp，则会开启。
+- `init_from_ckpt` 表示是否从某个checkpoint继续训练（断点恢复训练），默认不开启。如果在命令中加上了--init_from_ckpt，且 --model_name_or_path 配置的是路径，则会开启从某个checkpoint继续训练。例如下面的命令从第40000步的checkpoint继续训练：
+```shell
+export CUDA_VISIBLE_DEVICES="0"
+export DATA_DIR=./BookCorpus/
+
+python -u ./run_pretrain.py \
+    --model_type electra \
+    --model_name_or_path ./pretrain_model/model_40000.pdparams/ \
+    --input_dir $DATA_DIR \
+    --output_dir ./pretrain_model/ \
+    --train_batch_size 64 \
+    --learning_rate 5e-4 \
+    --max_seq_length 128 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 4 \
+    --logging_steps 100 \
+    --save_steps 10000 \
+    --max_steps -1 \
+    --device gpu \
+    --init_from_ckpt
+```
+
+训练过程将按照 `logging_steps`的设置打印如下日志：
+
+```
+global step 100/322448, epoch: 0, loss: 46.2487393681735099, lr: 0.000100000000, speed: 0.6439 step/s
+global step 200/322448, epoch: 0, loss: 45.2436411214760099, lr: 0.000200000000, speed: 0.6041 step/s
+global step 300/322448, epoch: 0, loss: 43.2906827821215998, lr: 0.000300000000, speed: 0.5991 step/s
+```
+
+### 单机多卡
+```shell
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+export DATA_DIR=./BookCorpus/
+
+python -u ./run_pretrain.py \
+    --model_type electra \
+    --model_name_or_path electra-small \
+    --input_dir $DATA_DIR \
+    --output_dir ./pretrain_model/ \
+    --train_batch_size 64 \
+    --learning_rate 5e-4 \
+    --max_seq_length 128 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 4 \
+    --logging_steps 100 \
+    --save_steps 10000 \
+    --max_steps -1 \
+    --device gpu
+```
+执行命令和单机单卡一样，只有环境变量CUDA_VISIBLE_DEVICES配置多个GPU-id，配置后预训练程序使用配置中的GPU-id，不会使用未配置的GPU-id
+
+## **Fine-tuning**
+### 从预训练模型得到Fine-tuning所需模型
+由第一段简介得知，Electra Fine-tuning时只需要Discriminator部分，所以通过如下命令从预训练模型中提取出Discriminator，得到Fine-tuning所需模型
+```shell
+python -u ./get_ft_model.py \
+    --model_dir ./pretrain_model/model_40000.pdparams/
+```
+其中参数释义如下：
+- `model_dir` 表示预训练模型所在目录，这里例子取预训练40000步的checkpoint来生成Fine-tuning所需模型，生成的用于Fine-tuning的模型也会在这个目录下。
+
+此命令可多次执行，但只有第1次会生成Fine-tuning所需模型
+
+**特别注意**：如果使用windows系统执行此命令，需使用**管理员**权限运行，否则会出错。Linux无此限制
+
+### 运行Fine-tuning
+使用./run_glue.py运行，有两种方式：
+#### **使用Paddle提供的预训练模型运行 Fine-tuning**
+此方式无需在本地进行预训练，即可以跳过上面 模型预训练 和 从预训练模型得到Fine-tuning所需模型 的介绍，直接运行Fine-tuning。
+
+以 GLUE/SST-2 任务为例，启动 Fine-tuning 的方式如下：
+```shell
+export CUDA_VISIBLE_DEVICES=0,1
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_type electra \
+    --model_name_or_path electra-small \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --num_train_epochs 3 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --output_dir ./$TASK_NAME/ \
+    --device gpu
+```
+其中参数释义如下：
+- `model_type` 指示了模型类型，当前支持BERT、ELECTRA、ERNIE模型。
+- `model_name_or_path` 如果配置模型名（electra模型当前支持electra-small、electra-base、electra-large几种规格）则为本节介绍的方式。如果配置本地目录（例如执行get_ft_model.py 命令得到Fine-tuning所需模型，配置其所在的目录 pretrain_model/model_40000.pdparams/）则为下一节中介绍的方式。
+- `task_name` 表示 Fine-tuning 的任务，当前支持CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU、NPU。若希望使用多GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+
+#### **使用本地预训练模型运行 Fine-tuning**
+按照上面模型预训练的介绍，在本地运行 ELECTRA 模型的预训练后，执行get_ft_model.py命令得到Fine-tuning所需模型，然后运行 Fine-tuning。
+
+以 GLUE/SST-2 任务为例，启动 Fine-tuning 的方式如下：
+```shell
+export CUDA_VISIBLE_DEVICES=0,1
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_type electra \
+    --model_name_or_path ./pretrain_model/model_40000.pdparams/ \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 1e-4 \
+    --num_train_epochs 3 \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --output_dir ./$TASK_NAME/ \
+    --device gpu
+```
+其中绝大部分参数和上节中一样，只有参数model_name_or_path配置了本地预训练模型的路径
+
+无论使用哪种方式进行 Fine-tuning，过程将按照 `logging_steps` 和 `save_steps` 的设置打印如下格式的日志：
+
+```
+global step 100/6315, epoch: 0, batch: 99, rank_id: 0, loss: 0.687738, lr: 0.0000158479, speed: 3.3566 step/s
+eval loss: 0.693736, acc: 0.5137614678899083, eval done total : 2.0170159339904785 s
+global step 200/6315, epoch: 0, batch: 199, rank_id: 0, loss: 0.342201, lr: 0.0000316957, speed: 3.1531 step/s
+eval loss: 0.715023, acc: 0.8256880733944955, eval done total : 1.9682419300079346 s
+global step 300/6315, epoch: 0, batch: 299, rank_id: 0, loss: 0.516034, lr: 0.0000475436, speed: 3.1663 step/s
+eval loss: 0.653879, acc: 0.8658256880733946, eval done total : 1.9738705158233643 s
+global step 400/6315, epoch: 0, batch: 399, rank_id: 0, loss: 0.228789, lr: 0.0000633914, speed: 3.1512 step/s
+eval loss: 0.863306, acc: 0.8600917431192661, eval done total : 1.960683822631836 s
+global step 500/6315, epoch: 0, batch: 499, rank_id: 0, loss: 0.320570, lr: 0.0000792393, speed: 3.1495 step/s
+eval loss: 0.732358, acc: 0.8704128440366973, eval done total : 1.9749321937561035 s
+```
+
+使用electra-small预训练模型进行单卡 Fine-tuning ，在验证集上有如下结果（这里各类任务的结果是运行3次取最好得到）：
+
+| Task  | Metric                       | Result      |
+|-------|------------------------------|-------------|
+| CoLA  | Matthews corr                | 58.22       |
+| SST-2 | acc.                         | 91.85       |
+| MRPC  | acc./F1                      | 88.24       |
+| STS-B | Pearson/Spearman corr        | 87.24       |
+| QQP   | acc./F1                      | 88.83       |
+| MNLI  | matched acc./mismatched acc. | 82.45       |
+| QNLI  | acc.                         | 88.61       |
+| RTE   | acc.                         | 66.78       |
+
+注：acc.是Accuracy的简称，表中Metric字段名词取自[GLUE论文](https://openreview.net/pdf?id=rJ4km2R5t7)
+
+## **推理部署**
+运行某个GLUE任务后（还是继续以GLUE/SST-2 情感分类任务为例），想要将Fine-tuning模型导出以加速类似场景更多数据的推理，可以按照如下步骤完成推理部署
+
+### 导出推理模型
+```shell
+python -u ./export_model.py \
+    --input_model_dir ./SST-2/sst-2_ft_model_6000.pdparams/ \
+    --output_model_dir ./ \
+    --model_name electra-deploy
+```
+其中参数释义如下：
+- `input_model_dir` 表示输入的预训练模型所在目录，这里例子取SST-2 Fine-tuning 6000步的checkpoint来导出推理模型。
+- `output_model_dir` 表示将要保存推理模型的目录，这里例子取当前路径。
+- `model_name` 表示输出推理模型的名字前缀，任意字符串均可，默认为electra-deploy。
+
+例如，执行如上命令后，可以看到在output_model_dir配置的目录下，导出的推理模型包括3个文件：
+| 文件                          | 说明                                   |
+|-------------------------------|----------------------------------------|
+| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
+| electra-deploy.pdiparams.info | 模型权重信息文件                         |
+| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
+
+### **使用Paddle Inference API进行推理**
+准备好如上推理模型后，可参考[Paddle Inference API推理步骤](./deploy/python/README.md)。
+
+### **使用Paddle Serving API进行推理**
+上面介绍的Paddle Inference为使用本地模型推理，Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过RPC/HTTP方式发送数据进行推理，实现模型推理的服务化。准备好如上推理模型后，可参考[Paddle Serving API推理步骤](./deploy/serving/README.md)。
+
+### **使用Paddle Lite API进行推理**
+上面介绍的Paddle Inference和Serving主要在服务器上进行推理，而在移动设备（手机、平板等）上需要使用Paddle Lite进行推理。准备好如上推理模型后，可参考[Paddle Lite API推理步骤](./deploy/lite/README.md)。
+
+
+## Reference
+[Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR 2020](https://openreview.net/pdf?id=r1xMH1BtvB)
diff --git a/model_zoo/electra/deploy/lite/Makefile b/model_zoo/electra/deploy/lite/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..ca4be9cfc7aead414bcd32e16e1288af46a8a8b1
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/Makefile
@@ -0,0 +1,33 @@
+ARM_ABI = arm8
+export ARM_ABI
+
+include ../Makefile.def
+
+LITE_ROOT=../../../
+
+CXX_INCLUDES = $(INCLUDES) ${OPENCV_INCLUDE} -I$(LITE_ROOT)/cxx/include
+
+CXX_LIBS = -L$(LITE_ROOT)/cxx/lib/ -lpaddle_light_api_shared $(SYSTEM_LIBS)
+
+###############################################################
+# How to use one of static libaray:                           #
+#  `libpaddle_api_full_bundled.a`                             #
+#  `libpaddle_api_light_bundled.a`                            #
+###############################################################
+# Note: default use lite's shared library.                    #
+###############################################################
+# 1. Comment above line using `libpaddle_light_api_shared.so`
+# 2. Undo comment below line using `libpaddle_api_light_bundled.a`
+
+#CXX_LIBS = $(LITE_ROOT)/cxx/lib/libpaddle_api_light_bundled.a $(SYSTEM_LIBS)
+
+electra_lite: electra_lite.o
+	$(CC) $(SYSROOT_LINK) $(CXXFLAGS_LINK) electra_lite.o -o electra_lite  $(CXX_LIBS) $(LDFLAGS)
+
+electra_lite.o: sentiment_classfication.cpp
+	$(CC) $(SYSROOT_COMPLILE) $(CXX_DEFINES) $(CXX_INCLUDES) $(CXX_FLAGS) -o electra_lite.o -c sentiment_classfication.cpp
+
+.PHONY: clean
+clean:
+	rm -f electra_lite.o
+	rm -f electra_lite
diff --git a/model_zoo/electra/deploy/lite/README.md b/model_zoo/electra/deploy/lite/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..00230fafacf70947ee158f9033b843674227ee69
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/README.md
@@ -0,0 +1,162 @@
+# **ELECTRA 使用Paddle Lite API进行推理**
+在移动设备（手机、平板等）上需要使用Paddle Lite进行推理。[Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite)是飞桨轻量化推理引擎，为手机、IOT端提供高效推理能力，并广泛整合跨平台硬件，为端侧部署及应用落地问题提供轻量化的部署方案。下面以Android手机(Armv7或Armv8)为例，使用Paddle Lite进行ELECTRA模型的推理。
+
+## 前提条件
+准备好Inference所需模型，需要2个文件：
+| 文件                          | 说明                                   |
+|-------------------------------|----------------------------------------|
+| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
+| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
+
+如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下
+
+## 准备硬件和系统
+- 电脑。用于保存代码和数据；编译Paddle Lite（看需要）
+- 手机。Android手机(Armv7或Armv8)，手机要能直接连接电脑，或者手机直连某个设备，其能连接到电脑。
+
+如果在其它特殊硬件上或想要自己编译Paddle Lite预测库和优化工具，则电脑上还需准备：
+- 交叉编译环境。不同开发环境的编译流程请参考对应文档。
+   - [Docker](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#docker)
+   - [Linux](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#linux)
+   - [MAC OS](https://paddle-lite.readthedocs.io/zh/latest/source_compile/compile_env.html#mac-os)
+
+## 准备Paddle Lite预测库
+有两种方法：
+- 直接下载。[官方预测库下载地址](https://paddle-lite.readthedocs.io/zh/latest/quick_start/release_lib.html)，注意选择和手机arm系统版本匹配的，并带with_extra=ON的下载链接。
+- 编译Paddle-Lite得到预测库。**需要先准备好交叉编译环境**，然后依次执行如下命令，例如编译在 armv8 硬件上运行的预测库并打开extra op：
+```shell
+git clone https://github.com/PaddlePaddle/Paddle-Lite.git
+cd Paddle-Lite
+git checkout develop
+./lite/tools/build_android.sh --arch=armv8 --with_extra=ON
+```
+直接下载预测库并解压后，可以得到`inference_lite_lib.android.armv8/`文件夹，通过编译Paddle-Lite得到的预测库位于`Paddle-Lite/build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/`文件夹。无论使用哪种方法，得到的预测库的文件目录结构都如下，为了方便统一说明，下面都假设预测库位于${Paddle_Lite_root}/inference_lite_lib.android.armv8/目录中：
+```
+${Paddle_Lite_root}/inference_lite_lib.android.armv8/
+|-- cxx                                        C++ 预测库和头文件
+|   |-- include                                C++ 头文件
+|   |   |-- paddle_api.h
+|   |   |-- paddle_image_preprocess.h
+|   |   |-- paddle_lite_factory_helper.h
+|   |   |-- paddle_place.h
+|   |   |-- paddle_use_kernels.h
+|   |   |-- paddle_use_ops.h
+|   |   `-- paddle_use_passes.h
+|   `-- lib                                           C++预测库
+|       |-- libpaddle_api_light_bundled.a             C++静态库
+|       `-- libpaddle_light_api_shared.so             C++动态库
+|-- java                                     Java预测库
+|   |-- jar
+|   |   `-- PaddlePredictor.jar
+|   |-- so
+|   |   `-- libpaddle_lite_jni.so
+|   `-- src
+|-- demo                                     C++和Java示例代码
+|   |-- cxx                                  C++  预测库demo
+|   `-- java                                 Java 预测库demo
+```
+
+## 准备Paddle Lite模型优化工具
+因为移动设备上对模型的要求很严格，所以需要使用Paddle Lite模型优化工具将Inference模型优化后才能将模型部署到移动设备上进行推理，模型优化的方法包括量化、子图融合、混合调度、Kernel优选等等方法。准备Paddle Lite模型优化工具也有两种方法：
+- 直接下载。
+```shell
+pip install paddlelite。
+```
+- 编译Paddle-Lite得到模型优化工具。**需要先准备好交叉编译环境**，然后依次执行如下命令：
+```shell
+# 如果准备环境时已经clone了Paddle-Lite，则不用重新clone Paddle-Lite
+git clone https://github.com/PaddlePaddle/Paddle-Lite.git
+cd Paddle-Lite
+git checkout develop
+# 启动编译
+./lite/tools/build.sh build_optimize_tool
+```
+如果是直接下载，工具可执行文件为`paddle_lite_opt`，并放在系统环境变量PATH中，所以无需进入到工具所在目录就可以直接执行；如果是编译得到，则工具可执行文件为`Paddle-Lite/build.opt/lite/api/opt`，为了后面统一说明，可将工具统一命名为`paddle_lite_opt`，并将其所处目录添加到系统环境变量PATH中，通过如下方式查看其运行选项和使用方式；
+```shell
+cd build.opt/lite/api/ && mv opt paddle_lite_opt
+./paddle_lite_opt
+```
+
+## 使用Paddle Lite模型优化工具转换Inference模型
+以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例，执行：
+```shell
+paddle_lite_opt \
+    --model_file ./electra-deploy.pdmodel \
+    --param_file ./electra-deploy.pdiparams \
+    --optimize_out ./electra-deploy-lite \
+    --optimize_out_type protobuf \
+    --valid_targets arm \
+    --record_tailoring_info false
+```
+其中参数释义如下：
+- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
+- `param_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
+- `optimize_out` 表示输出的Lite模型**名字前缀**。例如配置./electra-deploy-lite，最终得到的Lite模型为./electra-deploy-lite.nb。
+- `optimize_out_type` 表示输出模型类型，目前支持两种类型：protobuf和naive_buffer，其中naive_buffer是一种更轻量级的序列化/反序列化实现。若您需要在mobile端执行模型预测，请将此选项设置为naive_buffer。默认为protobuf。
+- `valid_targets` 表示模型将要运行的硬件类型，默认为arm。目前可支持x86、arm、opencl、npu、xpu，可以同时指定多个backend(以空格分隔)，Model Optimize Tool将会自动选择最佳方式。如果需要支持华为NPU（Kirin 810/990 Soc搭载的达芬奇架构NPU），应当设置为npu, arm。
+- `record_tailoring_info` 表示是否使用 根据模型裁剪库文件 功能，如使用则设置该选项为true，以记录优化后模型含有的kernel和OP信息，默认为false。
+
+如上命令执行后，得到Lite模型为./electra-deploy-lite.nb
+
+## 预处理输入数据，并和Lite预测库、Lite模型、编译好的C++代码/配置 一起打包。
+```shell
+# 假设${Paddle_Lite_root}已经配置了正确的Lite预测库路径
+python -u ./prepare.py \
+    --lite_lib_path ${Paddle_Lite_root}/inference_lite_lib.android.armv8/ \
+    --lite_model_file ./electra-deploy-lite.nb \
+    --predict_file ./test.txt \
+    --max_seq_length 128 \
+    --model_name electra-small
+
+# 进入lite demo的工作目录
+cd ${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/
+make -j && mv electra_lite debug
+```
+其中prepare.py的参数释义如下：
+- `lite_lib_path` 表示预测库所在目录。
+- `lite_model_file` 表示Lite模型路径。
+- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。
+- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
+- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
+
+如上命令执行完后，${Paddle_Lite_root}/inference_lite_lib.android.armv8/demo/cxx/electra/文件夹下将有如下文件，只有其中的**debug目录**会传到手机：
+```shell
+demo/cxx/electra/
+|-- debug/
+|    |--config.txt                       推理配置和超参数配置
+|    |--electra-deploy-lite.nb           优化后的Lite模型文件
+|    |--electra_lite                     编译好的在手机上执行推理的可执行文件
+|    |--libpaddle_light_api_shared.so    C++预测库文件
+|    |--predict_input.bin                预处理好的输入数据（二进制）
+|    |--predict_input.txt                输入数据明文
+|    |--sst2_label.txt                   类别说明文件
+|-- config.txt                              推理配置和超参数配置
+|-- Makefile                                编译文件
+|-- sentiment_classfication.cpp                推理代码文件
+```
+
+## 与目标设备连接执行推理
+如果电脑和Android手机直接连接，则在电脑上安装[ADB工具](https://developer.android.com/studio/command-line/adb)，通过ADB工具来连接和操作Android设备：
+```shell
+# 检查是否连接上设备
+adb devices
+# 将debug目录推送到设备的/data/local/tmp/electra/目录下，需事先在设备上创建
+adb push debug /data/local/tmp/electra/
+# 登录设备并打开设备上的shell
+adb shell
+# 准备相关环境。进入程序目录，配置好动态链接库的环境变量并给程序添加执行权限
+cd /data/local/tmp/electra/debug && export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp/electra/debug/ && chmod +x electra_lite
+# 输入数据，运行Lite推理
+./electra_lite ./config.txt
+```
+如果电脑和Android手机没有直接连接，Android手机直连某个设备，则需将debug目录cp到那个设备上，并在那个设备上安装ADB工具以执行如上代码。
+
+执行如上推理命令后得到如下结果，同样数据在Paddle Lite推理的结果应该和使用Inference/Serving的结果是一样的
+```shell
+=== electra predict result: ./predict_input.txt===
+sentence: [CLS] uneasy mishmash of styles and genres . [SEP], class_id: 0(negative), logits: 2.22824
+sentence: [CLS] director rob marshall went out gunning to make a great one . [SEP], class_id: 1(positive), logits: 0.241332
+total time : 0.399562 s.
+```
+
+如果修改了代码，则需要先执行prepare.py，再重新编译并打包push到手机上；如果只修改输入数据，则只需要执行prepare.py并打包push到手机上，不用重新编译。
diff --git a/model_zoo/electra/deploy/lite/config.txt b/model_zoo/electra/deploy/lite/config.txt
new file mode 100644
index 0000000000000000000000000000000000000000..df4c84b9775de1b3b916cc65a7dbe813644f5584
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/config.txt
@@ -0,0 +1,6 @@
+lite_model_file ./electra-deploy-lite.nb # path of model relative to executable file
+label_file ./sst2_label.txt              # path of label description file
+predict_file_bin ./predict_input.bin     # path of data in binary
+predict_file_txt ./predict_input.txt     # path of data in text
+predict_num 10                           # number of data to predict, automatic generation and no need config
+predict_length 39                        # sequence length of each data, automatic generation and no need config
diff --git a/model_zoo/electra/deploy/lite/prepare.py b/model_zoo/electra/deploy/lite/prepare.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1bc88d3086b2c82730db05cb6c20560701c917a
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/prepare.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import fileinput
+import io
+import os
+import shutil
+import time
+
+import numpy as np
+
+from paddlenlp.transformers import ElectraTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--lite_lib_path", type=str, required=True, default=None, help="directory of paddle lite api library"
+    )
+    parser.add_argument("--lite_model_file", type=str, required=True, default=None, help="paddle lite model file(.nb)")
+    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
+    parser.add_argument(
+        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
+    )
+    parser.add_argument(
+        "--prepared_file_prefix",
+        type=str,
+        default="predict_input",
+        help="prepared file prefix after processing predict sentences",
+    )
+    parser.add_argument("--batch_size", type=int, default=100000, help="batch size")
+    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="electra-small",
+        help="shortcut name selected in the list: "
+        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
+    )
+    return parser.parse_args()
+
+
+def read_sentences(paths=[]):
+    sentences = []
+    for sen_path in paths:
+        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
+        sen = read_file(sen_path)
+        if sen is None:
+            logger.info("error in loading file:{}".format(sen_path))
+            continue
+        sentences.extend(sen)
+    return sentences
+
+
+def read_file(path):
+    lines = []
+    with open(path, encoding="utf-8") as f:
+        while True:
+            line = f.readline()
+            if line:
+                if len(line) > 0 and not line.isspace():
+                    lines.append(line.strip())
+            else:
+                break
+    return lines
+
+
+def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
+    if predicted_data == [] or not isinstance(predicted_data, list):
+        raise TypeError("The predicted_data is inconsistent with expectations.")
+
+    sen_ids_batch = []
+    sen_words_batch = []
+    sen_ids = []
+    sen_words = []
+    batch_num = 0
+    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
+    for sen in predicted_data:
+        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
+        sen_ids.append(sen_id)
+        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
+        batch_num += 1
+        if batch_num == batch_size:
+            tmp_list = []
+            max_length = max([len(i) for i in sen_ids])
+            for i in sen_ids:
+                if len(i) < max_length:
+                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+                else:
+                    tmp_list.append(i)
+            sen_ids_batch.append(tmp_list)
+            sen_words_batch.append(sen_words)
+            sen_ids = []
+            sen_words = []
+            batch_num = 0
+
+    if len(sen_ids) > 0:
+        tmp_list = []
+        max_length = max([len(i) for i in sen_ids])
+        for i in sen_ids:
+            if len(i) < max_length:
+                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+            else:
+                tmp_list.append(i)
+        sen_ids_batch.append(tmp_list)
+        sen_words_batch.append(sen_words)
+
+    return sen_ids_batch, sen_words_batch
+
+
+def prepare_predict(args, sentences=[], paths=[]):
+    """
+    Args:
+        sentences (list[str]): each string is a sentence. If sentences not paths
+        paths (list[str]): The paths of file which contain sentences. If paths not sentences
+    """
+
+    # initialize data
+    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
+        predicted_data = sentences
+    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
+        predicted_data = read_sentences(paths)
+    else:
+        raise TypeError("The input data is inconsistent with expectations.")
+
+    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
+    predicted_input, predicted_sens = get_predicted_input(
+        predicted_data, tokenizer, args.max_seq_length, args.batch_size
+    )
+
+    predicted_input_np = np.array(predicted_input)
+    predict_num = predicted_input_np.shape[1]
+    predict_length = predicted_input_np.shape[2]
+    predict_input_bin = args.prepared_file_prefix + ".bin"
+    predict_input_txt = args.prepared_file_prefix + ".txt"
+    predicted_input_np[0].astype(np.int64).tofile(predict_input_bin)
+    with io.open(predict_input_txt, "w", encoding="UTF-8") as f:
+        for sen_batch in predicted_sens:
+            for sen in sen_batch:
+                if len(sen.strip()) > 0:
+                    f.write(sen.strip() + "\n")
+
+    for line in fileinput.input("./deploy/lite/config.txt", inplace=True):
+        if "predict_num" in line:
+            newline = "predict_num " + str(predict_num)
+            print("%s" % newline)
+        elif "predict_length" in line:
+            newline = "predict_length " + str(predict_length)
+            print("%s" % newline)
+        else:
+            print("%s" % line.strip())
+
+    root_dir = args.lite_lib_path + "/demo/cxx/electra/"
+    debug_dir = args.lite_lib_path + "/demo/cxx/electra/debug/"
+    if not os.path.exists(debug_dir):
+        os.makedirs(debug_dir)
+    shutil.copy(args.lite_model_file, debug_dir)
+    shutil.copy("./deploy/lite/sst2_label.txt", debug_dir)
+    shutil.copy("./deploy/lite/config.txt", debug_dir)
+    shutil.copy(predict_input_bin, debug_dir)
+    shutil.copy(predict_input_txt, debug_dir)
+    libpaddle_light_api = os.path.join(args.lite_lib_path, "cxx/lib/libpaddle_light_api_shared.so")
+    shutil.copy(libpaddle_light_api, debug_dir)
+
+    shutil.copy("./deploy/lite/config.txt", root_dir)
+    shutil.copy("./deploy/lite/sentiment_classfication.cpp", root_dir)
+    shutil.copy("./deploy/lite/Makefile", root_dir)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    sentences = args.predict_sentences
+    paths = args.predict_file
+    start_time = time.time()
+    # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."]
+    # paths = ["../../debug/test.txt", "../../debug/test.txt.1"]
+    prepare_predict(args, sentences, paths)
+    print("prepare lite predict done, total time : %s s" % (time.time() - start_time))
diff --git a/model_zoo/electra/deploy/lite/sentiment_classfication.cpp b/model_zoo/electra/deploy/lite/sentiment_classfication.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6305a5d2adc131ed46a1ceb113437e5dfe251ae2
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/sentiment_classfication.cpp
@@ -0,0 +1,240 @@
+// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <arm_neon.h>
+#include <math.h>
+#include <sys/time.h>
+#include <chrono>
+#include <fstream>
+#include <iostream>
+#include <vector>
+#include "paddle_api.h"  // NOLINT
+
+using namespace paddle::lite_api;  // NOLINT
+using namespace std;
+
+#undef stderr
+FILE *stderr = &__sF[2];
+
+struct RESULT {
+  std::string class_name;
+  int class_id;
+  float logits;
+};
+
+std::vector<RESULT> PostProcess(const float *output_data,
+                                int predict_num,
+                                int predict_class,
+                                const std::vector<std::string> &word_labels) {
+  int predict_result[predict_num] = {0};
+  float predict_logits[predict_num] = {0};
+  for (int i = 0; i < predict_num; i++) {
+    int index = -1;
+    float max_score = -100.0;
+    for (int j = 0; j < predict_class; j++) {
+      float score = output_data[i * predict_class + j];
+      if (score > max_score) {
+        max_score = score;
+        index = j;
+      }
+    }
+    predict_result[i] = index;
+    predict_logits[i] = max_score;
+  }
+
+  std::vector<RESULT> results(predict_num);
+  for (int i = 0; i < results.size(); i++) {
+    results[i].class_name = "Unknown";
+    if (predict_result[i] >= 0 && predict_result[i] < word_labels.size()) {
+      results[i].class_name = word_labels[predict_result[i]];
+    }
+    results[i].class_id = predict_result[i];
+    results[i].logits = predict_logits[i];
+  }
+  return results;
+}
+
+std::shared_ptr<PaddlePredictor> LoadModel(std::string model_file) {
+  MobileConfig config;
+  config.set_model_from_file(model_file);
+
+  std::shared_ptr<PaddlePredictor> predictor =
+      CreatePaddlePredictor<MobileConfig>(config);
+  return predictor;
+}
+
+std::vector<std::string> split(const std::string &str,
+                               const std::string &delim) {
+  std::vector<std::string> res;
+  if ("" == str) return res;
+  char *strs = new char[str.length() + 1];
+  std::strcpy(strs, str.c_str());
+
+  char *d = new char[delim.length() + 1];
+  std::strcpy(d, delim.c_str());
+
+  char *p = std::strtok(strs, d);
+  while (p) {
+    string s = p;
+    res.push_back(s);
+    p = std::strtok(NULL, d);
+  }
+
+  return res;
+}
+
+std::vector<std::string> ReadDict(const std::string &path) {
+  std::ifstream in(path);
+  std::string filename;
+  std::string line;
+  std::vector<std::string> m_vec;
+  if (in) {
+    while (getline(in, line)) {
+      m_vec.push_back(line);
+    }
+  } else {
+    std::cout << "no such file" << std::endl;
+  }
+  return m_vec;
+}
+
+std::map<std::string, std::string> LoadConfigTxt(
+    const std::string &config_path) {
+  auto config = ReadDict(config_path);
+
+  std::map<std::string, std::string> dict;
+  for (int i = 0; i < config.size(); i++) {
+    std::vector<std::string> res = split(config[i], " ");
+    dict[res[0]] = res[1];
+  }
+  return dict;
+}
+
+void PrintConfig(const std::map<std::string, std::string> &config) {
+  std::cout << "=======PaddleClas lite demo config======" << std::endl;
+  for (auto iter = config.begin(); iter != config.end(); iter++) {
+    std::cout << iter->first << " : " << iter->second << std::endl;
+  }
+  std::cout << "=======End of PaddleClas lite demo config======" << std::endl;
+}
+
+std::vector<std::string> LoadLabels(const std::string &path) {
+  std::ifstream file;
+  std::vector<std::string> labels;
+  file.open(path);
+  while (file) {
+    std::string line;
+    std::getline(file, line);
+    std::string::size_type pos = line.find(" ");
+    if (pos != std::string::npos) {
+      line = line.substr(pos + 1);
+    }
+    labels.push_back(line);
+  }
+  file.clear();
+  file.close();
+  return labels;
+}
+
+std::vector<RESULT> RunModel(std::shared_ptr<PaddlePredictor> predictor,
+                             const std::map<std::string, std::string> &config,
+                             double &cost_time) {
+  // read config
+  std::string label_path = config.at("label_file");
+  std::string predict_file_bin = config.at("predict_file_bin");
+  int predict_num = stoi(config.at("predict_num"));
+  int predict_length = stoi(config.at("predict_length"));
+
+  // Load Labels
+  std::vector<std::string> word_labels = LoadLabels(label_path);
+
+  // Read predict data
+  int64_t predict_data[predict_num][predict_length] = {0};
+  ifstream in(predict_file_bin, ios::in | ios::binary);
+  in.read((char *)&predict_data, sizeof predict_data);
+  in.close();
+
+  // Fill input tensor
+  std::unique_ptr<Tensor> input_tensor(std::move(predictor->GetInput(0)));
+  input_tensor->Resize({predict_num, predict_length});
+  auto *data = input_tensor->mutable_data<int64_t>();
+  for (int i = 0; i < predict_num; i++) {
+    for (int j = 0; j < predict_length; j++) {
+      data[i * predict_length + j] = predict_data[i][j];
+    }
+  }
+
+  auto start = std::chrono::system_clock::now();
+  // Run predictor
+  predictor->Run();
+
+  // Get output and post process
+  std::unique_ptr<const Tensor> output_tensor(
+      std::move(predictor->GetOutput(0)));
+  auto *output_data = output_tensor->data<float>();
+  auto end = std::chrono::system_clock::now();
+  auto duration =
+      std::chrono::duration_cast<std::chrono::microseconds>(end - start);
+  cost_time = double(duration.count()) *
+              std::chrono::microseconds::period::num /
+              std::chrono::microseconds::period::den;
+
+  if (output_tensor->shape().size() != 2) {
+    std::cerr << "[ERROR] the size of output tensor shape must equal to 2\n";
+    exit(1);
+  }
+  int predict_class = int(output_tensor->shape()[1]);
+
+  auto results =
+      PostProcess(output_data, predict_num, predict_class, word_labels);
+
+  return results;
+}
+
+int main(int argc, char **argv) {
+  if (argc < 2) {
+    std::cerr << "[ERROR] usage: " << argv[0] << " config_path\n";
+    exit(1);
+  }
+
+  // load config
+  std::string config_path = argv[1];
+  auto config = LoadConfigTxt(config_path);
+  PrintConfig(config);
+
+  // init predictor
+  std::string lite_model_file = config.at("lite_model_file");
+  auto electra_predictor = LoadModel(lite_model_file);
+
+  double elapsed_time = 0.0;
+  double run_time = 0;
+
+  // run lite inference
+  std::vector<RESULT> results = RunModel(electra_predictor, config, run_time);
+
+  // print result
+  std::string predict_file_txt = config.at("predict_file_txt");
+  auto sentences = ReadDict(predict_file_txt);
+  std::cout << "=== electra predict result: " << predict_file_txt
+            << "===" << std::endl;
+  for (int i = 0; i < results.size(); i++) {
+    std::cout << "sentence: " << sentences[i]
+              << ", class_id: " << results[i].class_id << "("
+              << results[i].class_name << ")"
+              << ", logits: " << results[i].logits << std::endl;
+  }
+  std::cout << "total time : " << run_time << " s." << std::endl;
+
+  return 0;
+}
diff --git a/model_zoo/electra/deploy/lite/sst2_label.txt b/model_zoo/electra/deploy/lite/sst2_label.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0dcf43b8731e78e4e4dfa69d20674dcd1b9b5bb3
--- /dev/null
+++ b/model_zoo/electra/deploy/lite/sst2_label.txt
@@ -0,0 +1,2 @@
+0 negative
+1 positive
diff --git a/model_zoo/electra/deploy/python/README.md b/model_zoo/electra/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..35628a747926dea80553480aabf6e5955176bfa4
--- /dev/null
+++ b/model_zoo/electra/deploy/python/README.md
@@ -0,0 +1,56 @@
+# **ELECTRA 使用Paddle Inference API进行推理**
+
+## 前提条件
+准备好Inference所需模型，需要2个文件：
+| 文件                          | 说明                                   |
+|-------------------------------|----------------------------------------|
+| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
+| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
+
+如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下，有两种方法进行推理
+
+## 从命令行读取输入数据进行推理
+```shell
+python -u ./predict.py \
+    --model_file ./electra-deploy.pdmodel \
+    --params_file ./electra-deploy.pdiparams \
+    --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \
+    --batch_size 2 \
+    --max_seq_length 128 \
+    --model_name electra-small
+```
+其中参数释义如下：
+- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
+- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
+- `predict_sentences` 表示用于推理的（句子）数据，可以配置1条或多条。如果此项配置，则predict_file不用配置。
+- `batch_size` 表示每次推理的样本数目。
+- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
+- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
+
+另外还有一些额外参数不在如上命令中：
+- `use_gpu` 表示是否使用GPU进行推理，默认不开启。如果在命令中加上了--use_gpu，则使用GPU进行推理。
+- `use_trt` 表示是否使用TensorRT进行推理，默认不开启。如果在命令中加上了--use_trt，且配置了--use_gpu，则使用TensorRT进行推理。前提条件：1）需提前安装TensorRT或使用[Paddle提供的TensorRT docker镜像](https://github.com/PaddlePaddle/Serving/blob/v0.5.0/doc/DOCKER_IMAGES_CN.md)。2）需根据cuda、cudnn、tensorRT和python的版本，安装[匹配版本的Paddle包](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html)
+
+## 从文件读取输入数据进行推理
+```shell
+python -u ./predict.py \
+    --model_file ./electra-deploy.pdmodel \
+    --params_file ./electra-deploy.pdiparams \
+    --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \
+    --batch_size 2 \
+    --max_seq_length 128 \
+    --model_name electra-small
+```
+其中绝大部分和从命令行读取输入数据一样，这里描述不一样的参数：
+- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。如果此项配置，则predict_sentences不用配置。
+
+模型对每1句话分别推理出1个结果，例如下面为使用第一种方法中的命令得到的SST-2情感分类推理结果，0表示句子是负向情感，1表示句子为正向情感。因为batch_size=2，所以只有1个batch：
+```shell
+===== batch 0 =====
+Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP]
+Output data is : 0
+Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP]
+Output data is : 1
+inference total 2 sentences done, total time : 0.0849156379699707 s
+```
+此推理结果表示：第1句话是负向情感，第2句话是正向情感。
diff --git a/model_zoo/electra/deploy/python/predict.py b/model_zoo/electra/deploy/python/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4c6a60ff44559967a18bdef4927f1ee1c091fed
--- /dev/null
+++ b/model_zoo/electra/deploy/python/predict.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import numpy as np
+from paddle import inference
+
+from paddlenlp.transformers import ElectraTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_file", type=str, required=True, help="model filename")
+    parser.add_argument("--params_file", type=str, required=True, help="parameter filename")
+    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
+    parser.add_argument(
+        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
+    parser.add_argument("--use_gpu", action="store_true", help="whether to use gpu")
+    parser.add_argument("--use_trt", action="store_true", help="whether to use TensorRT")
+    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="electra-small",
+        help="shortcut name selected in the list: "
+        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
+    )
+    return parser.parse_args()
+
+
+def read_sentences(paths=[]):
+    sentences = []
+    for sen_path in paths:
+        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
+        sen = read_file(sen_path)
+        if sen is None:
+            logger.info("error in loading file:{}".format(sen_path))
+            continue
+        sentences.extend(sen)
+    return sentences
+
+
+def read_file(path):
+    lines = []
+    with open(path, encoding="utf-8") as f:
+        while True:
+            line = f.readline()
+            if line:
+                if len(line) > 0 and not line.isspace():
+                    lines.append(line.strip())
+            else:
+                break
+    return lines
+
+
+def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
+    if predicted_data == [] or not isinstance(predicted_data, list):
+        raise TypeError("The predicted_data is inconsistent with expectations.")
+
+    sen_ids_batch = []
+    sen_words_batch = []
+    sen_ids = []
+    sen_words = []
+    batch_num = 0
+    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
+    for sen in predicted_data:
+        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
+        sen_ids.append(sen_id)
+        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
+        batch_num += 1
+        if batch_num == batch_size:
+            tmp_list = []
+            max_length = max([len(i) for i in sen_ids])
+            for i in sen_ids:
+                if len(i) < max_length:
+                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+                else:
+                    tmp_list.append(i)
+            sen_ids_batch.append(tmp_list)
+            sen_words_batch.append(sen_words)
+            sen_ids = []
+            sen_words = []
+            batch_num = 0
+
+    if len(sen_ids) > 0:
+        tmp_list = []
+        max_length = max([len(i) for i in sen_ids])
+        for i in sen_ids:
+            if len(i) < max_length:
+                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+            else:
+                tmp_list.append(i)
+        sen_ids_batch.append(tmp_list)
+        sen_words_batch.append(sen_words)
+
+    return sen_ids_batch, sen_words_batch
+
+
+def predict(args, sentences=[], paths=[]):
+    """
+    Args:
+        sentences (list[str]): each string is a sentence. If sentences not paths
+        paths (list[str]): The paths of file which contain sentences. If paths not sentences
+    Returns:
+        res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences.
+    """
+
+    # initialize data
+    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
+        predicted_data = sentences
+    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
+        predicted_data = read_sentences(paths)
+    else:
+        raise TypeError("The input data is inconsistent with expectations.")
+
+    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
+    predicted_input, predicted_sens = get_predicted_input(
+        predicted_data, tokenizer, args.max_seq_length, args.batch_size
+    )
+
+    # config
+    config = inference.Config(args.model_file, args.params_file)
+    config.switch_use_feed_fetch_ops(False)
+    config.enable_memory_optim()
+    if args.use_gpu:
+        config.enable_use_gpu(1000, 0)
+    if args.use_trt:
+        config.enable_tensorrt_engine(
+            workspace_size=1 << 30,
+            max_batch_size=args.batch_size,
+            min_subgraph_size=5,
+            precision_mode=inference.PrecisionType.Float32,
+            use_static=False,
+            use_calib_mode=False,
+        )
+
+    # predictor
+    predictor = inference.create_predictor(config)
+
+    start_time = time.time()
+    output_data = []
+    count = 0
+    for i, sen in enumerate(predicted_input):
+        sen = np.array(sen).astype("int64")
+        # get input name
+        input_names = predictor.get_input_names()
+        # get input pointer and copy data
+        input_tensor = predictor.get_input_handle(input_names[0])
+        input_tensor.copy_from_cpu(sen)
+
+        # run predictor
+        predictor.run()
+
+        # get output name
+        output_names = predictor.get_output_names()
+        # get output pointer and copy data(nd.array)
+        output_tensor = predictor.get_output_handle(output_names[0])
+        predict_data = output_tensor.copy_to_cpu()
+        output_res = np.argmax(predict_data, axis=1).tolist()
+        output_data.append(output_res)
+
+        print("===== batch {} =====".format(i))
+        for j in range(len(predicted_sens[i])):
+            print("Input sentence is : {}".format(predicted_sens[i][j]))
+            # print("Output logis is : {}".format(output_data[j]))
+            print("Output data is : {}".format(output_res[j]))
+        count += len(predicted_sens[i])
+    print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    sentences = args.predict_sentences
+    paths = args.predict_file
+    # sentences = ["The quick brown fox see over the lazy dog.", "The quick brown fox jump over tree lazy dog."]
+    # paths = ["../../debug/test.txt", "../../debug/test.txt.1"]
+    predict(args, sentences, paths)
diff --git a/model_zoo/electra/deploy/serving/README.md b/model_zoo/electra/deploy/serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1b64547e9c2463ff4207eba0a2430f63ac9394a3
--- /dev/null
+++ b/model_zoo/electra/deploy/serving/README.md
@@ -0,0 +1,107 @@
+# **ELECTRA 使用Paddle Serving API进行推理**
+Paddle Serving 可以实现在服务器端部署推理模型，客户端远程通过RPC/HTTP方式发送数据进行推理，实现模型推理的服务化，下面以RPC方式为例进行说明。
+
+## 前提条件
+准备好Inference所需模型，需要2个文件：
+| 文件                          | 说明                                   |
+|-------------------------------|----------------------------------------|
+| electra-deploy.pdiparams      | 模型权重文件，供推理时加载使用            |
+| electra-deploy.pdmodel        | 模型结构文件，供推理时加载使用            |
+
+如何获得Inference模型？[可参考文档“导出推理模型”一节](../../README.md)，下面假设这2个文件已生成，并放在当前目录下
+
+## 在服务器端和客户端启动Serving的docker容器
+建议在docker容器中运行服务器端和客户端以避免一些系统依赖库问题，启动docker镜像的命令参考：[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0)
+
+## 在服务器端安装相关模块
+```shell
+pip install paddle-serving-app paddle-serving-client paddle-serving-server paddlepaddle
+```
+如果服务器端可以使用GPU进行推理，则安装server的gpu版本，安装时要注意参考服务器当前CUDA、TensorRT的版本来安装对应的版本：[Serving readme](https://github.com/PaddlePaddle/Serving/tree/v0.5.0)
+```shell
+pip install paddle-serving-app paddle-serving-client paddle-serving-server-gpu paddlepaddle-gpu
+```
+
+## 在客户端安装相关模块
+```shell
+pip install paddle-serving-app paddle-serving-client
+```
+
+## 从Inference模型生成Serving的模型和配置
+以前提条件中准备好的Inference模型 electra-deploy.pdmodel/electra-deploy.pdiparams 为例：
+```shell
+python -u ./covert_inference_model_to_serving.py \
+    --inference_model_dir ./ \
+    --model_file ./electra-deploy.pdmodel \
+    --params_file ./electra-deploy.pdiparams
+```
+其中参数释义如下：
+- `inference_model_dir` 表示Inference推理模型所在目录，这里假设为当前目录。
+- `model_file` 表示推理需要加载的模型结构文件。例如前提中得到的electra-deploy.pdmodel。
+- `params_file` 表示推理需要加载的模型权重文件。例如前提中得到的electra-deploy.pdiparams。
+
+执行命令后，会在当前目录下生成2个目录：serving_server 和 serving_client。serving_server目录包含服务器端所需的模型和配置，需将其cp到服务器端容器中；serving_client目录包含客户端所需的配置，需将其cp到客户端容器中
+
+## 启动server
+在服务器端容器中，使用上一步得到的serving_server目录启动server
+```shell
+python -m paddle_serving_server_gpu.serve \
+    --model ./serving_server \
+    --port 8383
+```
+其中参数释义如下：
+- `model` 表示server加载的模型和配置所在目录。
+- `port` 表示server开启的服务端口8383。
+
+如果服务器端可以使用GPU进行推理计算，则启动服务器时可以配置server使用的GPU id
+```shell
+python -m paddle_serving_server_gpu.serve \
+    --model ./serving_server \
+    --port 8383 \
+    --gpu_id 0
+```
+- `gpu_id` 表示server使用0号GPU。
+
+## 启动client进行推理
+在客户端容器中，使用前面得到的serving_client目录启动client发起RPC推理请求。和使用Paddle Inference API进行推理一样，有如下两种方法:
+### 从命令行读取输入数据发起推理请求
+```shell
+python -u ./client.py \
+    --client_config_file ./serving_client/serving_client_conf.prototxt \
+    --server_ip_port 127.0.0.1:8383 \
+    --predict_sentences "uneasy mishmash of styles and genres ." "director rob marshall went out gunning to make a great one ." \
+    --batch_size 2 \
+    --max_seq_length 128 \
+    --model_name electra-small
+```
+其中参数释义如下：
+- `client_config_file` 表示客户端需要加载的配置文件。
+- `server_ip_port` 表示服务器端的ip和port。默认为127.0.0.1:8383。
+- `predict_sentences` 表示用于推理的（句子）数据，可以配置1条或多条。如果此项配置，则predict_file不用配置。
+- `batch_size` 表示每次推理的样本数目。
+- `max_seq_length` 表示输入的最大句子长度，超过该长度将被截断。
+- `model_name` 表示推理模型的类型，当前支持electra-small（约1400万参数）、electra-base（约1.1亿参数）、electra-large（约3.35亿参数）。
+
+### 从文件读取输入数据发起推理请求
+```shell
+python -u ./client.py \
+    --client_config_file ./serving_client/serving_client_conf.prototxt \
+    --server_ip_port 127.0.0.1:8383 \
+    --predict_file "./sst-2.test.tsv.1" "./sst-2.test.tsv.2" \
+    --batch_size 2 \
+    --max_seq_length 128 \
+    --model_name electra-small
+```
+其中绝大部分和从命令行读取输入数据一样，这里描述不一样的参数：
+- `predict_file` 表示用于推理的文件数据，可以配置1个或多个文件，每个文件和预训练数据格式一样，为utf-8编码的文本数据，每行1句文本。如果此项配置，则predict_sentences不用配置。
+
+使用Paddle Serving API进行推理的结果和使用Inference API的结果是一样的：
+```shell
+===== batch 0 =====
+Input sentence is : [CLS] uneasy mishmash of styles and genres . [SEP]
+Output data is : 0
+Input sentence is : [CLS] director rob marshall went out gunning to make a great one . [SEP]
+Output data is : 1
+inference total 2 sentences done, total time : 4.729415416717529 s
+```
+此推理结果表示：第1句话是负向情感，第2句话是正向情感。
diff --git a/model_zoo/electra/deploy/serving/client.py b/model_zoo/electra/deploy/serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..331748feee3eb29ad679e3d056042c508548da5c
--- /dev/null
+++ b/model_zoo/electra/deploy/serving/client.py
@@ -0,0 +1,166 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import numpy as np
+from paddle_serving_client import Client
+
+from paddlenlp.transformers import ElectraTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--client_config_file", type=str, required=True, help="client prototxt config file")
+    parser.add_argument("--server_ip_port", type=str, default="127.0.0.1:8383", help="server_ip:port")
+    parser.add_argument("--predict_sentences", type=str, nargs="*", help="one or more sentence to predict")
+    parser.add_argument(
+        "--predict_file", type=str, nargs="*", help="one or more file which contain sentence to predict"
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="batch size")
+    parser.add_argument("--max_seq_length", type=int, default=128, help="max length of each sequence")
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        default="electra-small",
+        help="shortcut name selected in the list: "
+        + ", ".join(list(ElectraTokenizer.pretrained_init_configuration.keys())),
+    )
+    return parser.parse_args()
+
+
+def read_sentences(paths=[]):
+    sentences = []
+    for sen_path in paths:
+        assert os.path.isfile(sen_path), "The {} isn't a valid file.".format(sen_path)
+        sen = read_file(sen_path)
+        if sen is None:
+            logger.info("error in loading file:{}".format(sen_path))
+            continue
+        sentences.extend(sen)
+    return sentences
+
+
+def read_file(path):
+    lines = []
+    with open(path, encoding="utf-8") as f:
+        while True:
+            line = f.readline()
+            if line:
+                if len(line) > 0 and not line.isspace():
+                    lines.append(line.strip())
+            else:
+                break
+    return lines
+
+
+def get_predicted_input(predicted_data, tokenizer, max_seq_length, batch_size):
+    if predicted_data == [] or not isinstance(predicted_data, list):
+        raise TypeError("The predicted_data is inconsistent with expectations.")
+
+    sen_ids_batch = []
+    sen_words_batch = []
+    sen_ids = []
+    sen_words = []
+    batch_num = 0
+    pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
+    for sen in predicted_data:
+        sen_id = tokenizer(sen, max_seq_len=max_seq_length)["input_ids"]
+        sen_ids.append(sen_id)
+        sen_words.append(tokenizer.cls_token + " " + sen + " " + tokenizer.sep_token)
+        batch_num += 1
+        if batch_num == batch_size:
+            tmp_list = []
+            max_length = max([len(i) for i in sen_ids])
+            for i in sen_ids:
+                if len(i) < max_length:
+                    tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+                else:
+                    tmp_list.append(i)
+            sen_ids_batch.append(tmp_list)
+            sen_words_batch.append(sen_words)
+            sen_ids = []
+            sen_words = []
+            batch_num = 0
+
+    if len(sen_ids) > 0:
+        tmp_list = []
+        max_length = max([len(i) for i in sen_ids])
+        for i in sen_ids:
+            if len(i) < max_length:
+                tmp_list.append(i + (max_length - len(i)) * [pad_token_id])
+            else:
+                tmp_list.append(i)
+        sen_ids_batch.append(tmp_list)
+        sen_words_batch.append(sen_words)
+
+    return sen_ids_batch, sen_words_batch
+
+
+def predict(args, sentences=[], paths=[]):
+    """
+    Args:
+        sentences (list[str]): each string is a sentence. If have sentences then no need paths
+        paths (list[str]): The paths of file which contain sentences. If have paths then no need sentences
+    Returns:
+        res (list(numpy.ndarray)): The result of sentence, indicate whether each word is replaced, same shape with sentences.
+    """
+
+    # initialize client
+    client = Client()
+    client.load_client_config(args.client_config_file)
+    # "serving_client/serving_client_conf.prototxt")
+    client.connect([args.server_ip_port])
+
+    # initialize data
+    if sentences != [] and isinstance(sentences, list) and (paths == [] or paths is None):
+        predicted_data = sentences
+    elif (sentences == [] or sentences is None) and isinstance(paths, list) and paths != []:
+        predicted_data = read_sentences(paths)
+    else:
+        raise TypeError("The input data is inconsistent with expectations.")
+
+    tokenizer = ElectraTokenizer.from_pretrained(args.model_name)
+    predicted_input, predicted_sens = get_predicted_input(
+        predicted_data, tokenizer, args.max_seq_length, args.batch_size
+    )
+
+    start_time = time.time()
+    count = 0
+    for i, sen in enumerate(predicted_input):
+        sen = np.array(sen).astype("int64")
+
+        fetch_map = client.predict(feed={"input_ids": sen}, fetch=["save_infer_model/scale_0.tmp_0"], batch=True)
+        output_data = np.array(fetch_map["save_infer_model/scale_0.tmp_0"])
+        output_res = np.argmax(output_data, axis=1)
+
+        print("===== batch {} =====".format(i))
+        for j in range(len(predicted_sens[i])):
+            print("Input sentence is : {}".format(predicted_sens[i][j]))
+            # print("Output logis is : {}".format(output_data[j]))
+            print("Output data is : {}".format(output_res[j]))
+
+        count += len(predicted_sens[i])
+    print("inference total %s sentences done, total time : %s s" % (count, time.time() - start_time))
+
+
+if __name__ == "__main__":
+    # paddle.enable_static()
+    args = parse_args()
+    sentences = args.predict_sentences
+    paths = args.predict_file
+    predict(args, sentences, paths)
diff --git a/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py b/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py
new file mode 100644
index 0000000000000000000000000000000000000000..795a5aeb2281af9910313ea2c6fb3a8efa643462
--- /dev/null
+++ b/model_zoo/electra/deploy/serving/covert_inference_model_to_serving.py
@@ -0,0 +1,39 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle
+import paddle_serving_client.io as serving_io
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--inference_model_dir", type=str, required=True, help="input inference model dir")
+    parser.add_argument("--model_file", type=str, required=True, help="input inference model file name")
+    parser.add_argument("--params_file", type=str, required=True, help="input inference parameters file name")
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    args = parse_args()
+    feed_names, fetch_names = serving_io.inference_model_to_serving(
+        dirname=args.inference_model_dir,
+        serving_server="serving_server",
+        serving_client="serving_client",
+        model_filename=args.model_file,
+        params_filename=args.params_file,
+    )
+    print("model feed_names : %s" % feed_names)
+    print("model fetch_names : %s" % fetch_names)
diff --git a/model_zoo/electra/electra_model_brief_introduce.JPG b/model_zoo/electra/electra_model_brief_introduce.JPG
new file mode 100644
index 0000000000000000000000000000000000000000..226294ad6c52bb52112e3a4c8523611c600f85d1
Binary files /dev/null and b/model_zoo/electra/electra_model_brief_introduce.JPG differ
diff --git a/model_zoo/electra/export_model.py b/model_zoo/electra/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..babb7741ac0d95e8ecf8e7681cd45a2ef9344e32
--- /dev/null
+++ b/model_zoo/electra/export_model.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# from collections import namedtuple
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import hashlib
+import os
+
+import paddle
+from paddle.static import InputSpec
+
+from paddlenlp.transformers import ElectraForSequenceClassification
+
+
+def get_md5sum(file_path):
+    md5sum = None
+    if os.path.isfile(file_path):
+        with open(file_path, "rb") as f:
+            md5_obj = hashlib.md5()
+            md5_obj.update(f.read())
+            hash_code = md5_obj.hexdigest()
+        md5sum = str(hash_code).lower()
+    return md5sum
+
+
+def main():
+    # check and load model
+    input_model_file = os.path.join(args.input_model_dir, "model_state.pdparams")
+    print("load model to get static model : %s \nmodel md5sum : %s" % (input_model_file, get_md5sum(input_model_file)))
+    model_state_dict = paddle.load(input_model_file)
+
+    if all((s.startswith("generator") or s.startswith("discriminator")) for s in model_state_dict.keys()):
+        print("the model : %s is electra pretrain model, we need fine-tuning model to deploy" % input_model_file)
+        exit(1)
+    elif "discriminator_predictions.dense.weight" in model_state_dict:
+        print("the model : %s is electra discriminator model, we need fine-tuning model to deploy" % input_model_file)
+        exit(1)
+    elif "classifier.dense.weight" in model_state_dict:
+        print("we are load glue fine-tuning model")
+        model = ElectraForSequenceClassification.from_pretrained(args.input_model_dir)
+        print("total model layers : ", len(model_state_dict))
+    else:
+        print("the model file : %s may not be fine-tuning model, please check" % input_model_file)
+        exit(1)
+
+    # save static model to disk
+    paddle.jit.save(
+        layer=model,
+        path=os.path.join(args.output_model_dir, args.model_name),
+        input_spec=[InputSpec(shape=[None, None], dtype="int64")],
+    )
+    print("save electra inference model success")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input_model_dir", required=True, default=None, help="Directory for storing Electra pretraining model"
+    )
+    parser.add_argument(
+        "--output_model_dir", required=True, default=None, help="Directory for output Electra inference model"
+    )
+    parser.add_argument(
+        "--model_name", default="electra-deploy", type=str, help="prefix name of output model and parameters"
+    )
+    args, unparsed = parser.parse_known_args()
+    main()
diff --git a/model_zoo/electra/get_ft_model.py b/model_zoo/electra/get_ft_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..fde0b7cb75ec3178255b5ae0343ab7da6f3236f6
--- /dev/null
+++ b/model_zoo/electra/get_ft_model.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# from collections import namedtuple
+import argparse
+import hashlib
+import os
+
+import paddle
+
+
+def get_md5sum(file_path):
+    md5sum = None
+    if os.path.isfile(file_path):
+        with open(file_path, "rb") as f:
+            md5_obj = hashlib.md5()
+            md5_obj.update(f.read())
+            hash_code = md5_obj.hexdigest()
+        md5sum = str(hash_code).lower()
+    return md5sum
+
+
+def main(args):
+    pretraining_model = os.path.join(args.model_dir, "model_state.pdparams")
+    if os.path.islink(pretraining_model):
+        print("%s already contain fine-tuning model, pleace check" % args.model_dir)
+        exit(0)
+    print(
+        "load Electra pretrain model to get generator/discriminator model : %s \nmodel md5sum : %s"
+        % (pretraining_model, get_md5sum(pretraining_model))
+    )
+    # depart total_pretraining_model to generator and discriminator state_dict
+    total_pretraining_model = paddle.load(pretraining_model)
+    generator_state_dict = {}
+    discriminator_state_dict = {}
+    num_keys = 0
+    for key in total_pretraining_model.keys():
+        new_key = None
+        if "generator." in key:
+            new_key = key.replace("generator.", "", 1)
+            generator_state_dict[new_key] = total_pretraining_model[key]
+        if "discriminator." in key:
+            new_key = key.replace("discriminator.", "", 1)
+            discriminator_state_dict[new_key] = total_pretraining_model[key]
+        num_keys += 1
+    print("total electra keys : ", num_keys)
+    print("total generator keys : ", len(generator_state_dict))
+    print("total discriminator keys : ", len(discriminator_state_dict))
+
+    # save generator and discriminator model to disk
+    paddle.save(generator_state_dict, os.path.join(args.model_dir, args.generator_output_file))
+    paddle.save(discriminator_state_dict, os.path.join(args.model_dir, args.discriminator_output_file))
+    print("save generator and discriminator model success")
+    os.rename(pretraining_model, os.path.join(args.model_dir, "pretrain_model_state.pdparams"))
+    os.symlink(args.discriminator_output_file, os.path.join(args.model_dir, "model_state.pdparams"))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_dir", required=True, default=None, help="Directory of storing ElectraForTotalPreTraining model"
+    )
+    parser.add_argument(
+        "--generator_output_file", default="generator_for_ft.pdparams", help="Electra generator model for fine-tuning"
+    )
+    parser.add_argument(
+        "--discriminator_output_file",
+        default="discriminator_for_ft.pdparams",
+        help="Electra discriminator model for fine-tuning",
+    )
+    args, unparsed = parser.parse_known_args()
+    main(args)
diff --git a/model_zoo/electra/run_glue.py b/model_zoo/electra/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5a8051a4a026e4dfc30b9a43abb40f6f24e1664
--- /dev/null
+++ b/model_zoo/electra/run_glue.py
@@ -0,0 +1,369 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    ElectraForSequenceClassification,
+    ElectraTokenizer,
+    ErnieForSequenceClassification,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "electra": (ElectraForSequenceClassification, ElectraTokenizer),
+    "ernie": (ErnieForSequenceClassification, ErnieTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "npu"],
+        help="The device to select to train the model, is must be cpu/gpu/npu.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        loss = loss_fct(logits, labels)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                loss.numpy(),
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    elif isinstance(metric, Mcc):
+        print("eval loss: %f, mcc: %s, " % (loss.numpy(), res[0]), end="")
+    elif isinstance(metric, PearsonAndSpearman):
+        print(
+            "eval loss: %f, pearson: %s, spearman: %s, pearson and spearman: %s, "
+            % (loss.numpy(), res[0], res[1], res[2]),
+            end="",
+        )
+    else:
+        print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="")
+    model.train()
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = model_class.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    metric = metric_class()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids)
+            loss = loss_fct(logits, labels)
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0 or global_step == num_training_steps:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    evaluate(model, loss_fct, metric, dev_data_loader_matched)
+                    evaluate(model, loss_fct, metric, dev_data_loader_mismatched)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    evaluate(model, loss_fct, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(
+                        args.output_dir, "%s_ft_model_%d.pdparams" % (args.task_name, global_step)
+                    )
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
+    if args.device in "gpu" and n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
+    else:
+        do_train(args)
diff --git a/model_zoo/electra/run_pretrain.py b/model_zoo/electra/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..02e12eaf88f9ac544cae47f7957f55009b128f86
--- /dev/null
+++ b/model_zoo/electra/run_pretrain.py
@@ -0,0 +1,556 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import io
+import json
+import logging
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    ElectraForTotalPretraining,
+    ElectraPretrainingCriterion,
+    ElectraTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "electra": (ElectraForTotalPretraining, ElectraTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default="electra",
+        type=str,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="electra-small",
+        type=str,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument("--max_seq_length", default=128, type=int, help="max length of each sequence")
+    parser.add_argument("--mask_prob", default=0.15, type=float, help="the probability of one word to be mask")
+    parser.add_argument(
+        "--train_batch_size",
+        default=96,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--eval_batch_size",
+        default=96,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=5e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=4,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--init_from_ckpt",
+        action="store_true",
+        help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start",
+    )
+    parser.add_argument(
+        "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train."
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+    parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu"],
+        help="The device to select to train the model, is must be cpu/gpu.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+class BookCorpus(paddle.io.Dataset):
+    """
+    https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html
+    Args:
+        data_path (:obj:`str`) : The dataset file path, which contains train.tsv, dev.tsv and test.tsv.
+        tokenizer (:obj:`class PretrainedTokenizer`) : The tokenizer to split word and convert word to id.
+        max_seq_length (:obj:`int`) : max length for each sequence.
+        mode (:obj:`str`, `optional`, defaults to `train`):
+            It identifies the dataset mode (train, test or dev).
+    """
+
+    def __init__(
+        self,
+        data_path,
+        tokenizer,
+        max_seq_length,
+        mode="train",
+    ):
+        if mode == "train":
+            data_file = "train.data"
+        elif mode == "test":
+            data_file = "test.data"
+        else:
+            data_file = "dev.data"
+
+        self.data_file = os.path.join(data_path, data_file)
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.raw_examples = self._read_file(self.data_file)
+
+    def _read_file(self, input_file):
+        """
+        Reads a text file.
+
+        Args:
+            input_file (:obj:`str`) : The file to be read.
+
+        Returns:
+            examples (:obj:`list`): All the input data.
+        """
+        if not os.path.exists(input_file):
+            raise RuntimeError("The file {} is not found.".format(input_file))
+        else:
+            with io.open(input_file, "r", encoding="UTF-8") as f:
+                examples = []
+                while True:
+                    line = f.readline()
+                    if line:
+                        if len(line) > 0 and not line.isspace():
+                            example = self.tokenizer(line, max_seq_len=self.max_seq_length)["input_ids"]
+                            examples.append(example)
+                    else:
+                        break
+                return examples
+
+    def truncation_ids(self, ids, max_seq_length):
+        if len(ids) <= (max_seq_length - 2):
+            return ids
+        else:
+            return ids[: (max_seq_length - 2)]
+
+    def __getitem__(self, idx):
+        return self.raw_examples[idx]
+
+    def __len__(self):
+        return len(self.raw_examples)
+
+
+class DataCollatorForElectra(object):
+    """
+    pads, gets batch of tensors and preprocesses batches for masked language modeling
+    when dataloader num_worker > 0, this collator may trigger some bugs, for safe, be sure dataloader num_worker=0
+    """
+
+    def __init__(self, tokenizer, max_seq_length, mlm=True, mlm_probability=0.15):
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.mlm = True
+        self.mlm_probability = mlm_probability
+
+    def __call__(self, examples):
+        if self.mlm:
+            inputs, raw_inputs, labels = self.mask_tokens(examples)
+            return inputs, raw_inputs, labels
+        else:
+            raw_inputs, _ = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
+            raw_inputs = self.tensorize_batch(raw_inputs, "int64")
+            inputs = raw_inputs.clone().detach()
+            labels = raw_inputs.clone().detach()
+            if self.tokenizer.pad_token is not None:
+                pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
+                labels[labels == pad_token_id] = -100
+            return batch, raw_inputs, labels  # noqa:821
+
+    def tensorize_batch(self, examples, dtype):
+        if isinstance(examples[0], (list, tuple)):
+            examples = [paddle.to_tensor(e, dtype=dtype) for e in examples]
+        length_of_first = examples[0].shape[0]
+        are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples)
+        if are_tensors_same_length:
+            return paddle.stack(examples, axis=0)
+        else:
+            raise ValueError("the tensor in examples not have same shape, please check input examples")
+
+    def add_special_tokens_and_set_maskprob(self, inputs, truncation, max_seq_length):
+        # sep_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token)
+        pad_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token)
+        # cls_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token)
+        full_inputs = []
+        full_maskprob = []
+        max_length = 0
+        for ids in inputs:
+            if len(ids) > max_length:
+                max_length = len(ids)
+        max_length = min(max_length, max_seq_length)
+
+        for ids in inputs:
+            if len(ids) <= max_length:
+                padding_num = max_length - len(ids)
+                full_inputs.append(ids + ([pad_token_id] * padding_num))
+                full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0] + ([0] * padding_num))
+            else:
+                if truncation:
+                    full_inputs.append(ids[:max_length])
+                    full_maskprob.append([0] + ([self.mlm_probability] * (max_length - 2)) + [0])
+                else:
+                    full_inputs.append(ids)
+                    full_maskprob.append([0] + ([self.mlm_probability] * (len(ids) - 2)) + [0])
+        return full_inputs, full_maskprob
+
+    def mask_tokens(self, examples):
+        if self.tokenizer.mask_token is None:
+            raise ValueError("the tokenizer does not have mask_token, please check!")
+        mask_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
+
+        raw_inputs, probability_matrix = self.add_special_tokens_and_set_maskprob(examples, True, self.max_seq_length)
+        raw_inputs = self.tensorize_batch(raw_inputs, "int64")
+        probability_matrix = self.tensorize_batch(probability_matrix, "float32")
+        inputs = raw_inputs.clone()
+        labels = raw_inputs.clone()
+
+        total_indices = paddle.bernoulli(probability_matrix).astype("bool").numpy()
+        labels[~total_indices] = -100
+
+        # 80% MASK
+        indices_mask = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & total_indices
+        inputs[indices_mask] = mask_token_id
+
+        # 10% Random
+        indices_random = (
+            paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy() & total_indices & ~indices_mask
+        )
+        random_words = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64")
+        inputs = paddle.where(paddle.to_tensor(indices_random), random_words, inputs)
+
+        # 10% Original
+        return inputs, raw_inputs, labels
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None):
+    """
+    Creats dataloader.
+
+    Args:
+        dataset(obj:`paddle.io.Dataset`):
+            Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`):
+            If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1):
+            The sample number of a mini-batch.
+        use_gpu(obj:`bool`, optional, defaults to obj:`True`):
+            Whether to use gpu to run.
+
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+
+    if mode == "train" and use_gpu:
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+    else:
+        shuffle = True if mode == "train" else False
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+
+    return dataloader
+
+
+def do_train(args):
+    paddle.enable_static() if not args.eager_run else None
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+    # worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # Loads or initializes a model.
+    pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys())
+    config = model_class.config_class.from_pretrained(args.model_name_or_path)
+
+    if args.model_name_or_path in pretrained_models:
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+        model = model_class(config)
+        args.init_from_ckpt = False
+    else:
+        if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
+            # Load checkpoint
+            tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+            with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f:
+                config_dict = json.load(f)
+                model_name = config_dict["model_name"]
+            if model_name in pretrained_models:
+                model = model_class.from_pretrained(args.model_name_or_path)
+                model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")))
+            else:
+                raise ValueError(
+                    "initialize a model from ckpt need model_name "
+                    "in model_config_file. The supported model_name "
+                    "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys())
+                )
+        else:
+            raise ValueError(
+                "initialize a model need identifier or the "
+                "directory of storing model. if use identifier, the supported model "
+                "identifiers are as follows: {}, if use directory, "
+                "make sure set init_from_ckpt as True".format(model_class.pretrained_init_configuration.keys())
+            )
+
+    criterion = ElectraPretrainingCriterion(config)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    # Loads dataset.
+    tic_load_data = time.time()
+    print("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+    train_dataset = BookCorpus(
+        data_path=args.input_dir, tokenizer=tokenizer, max_seq_length=args.max_seq_length, mode="train"
+    )
+    print("load data done, total : %s s" % (time.time() - tic_load_data))
+
+    # Reads data and generates mini-batches.
+    data_collator = DataCollatorForElectra(
+        tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm=True, mlm_probability=args.mask_prob
+    )
+
+    train_data_loader = create_dataloader(
+        train_dataset,
+        batch_size=args.train_batch_size,
+        mode="train",
+        use_gpu=True if args.device in "gpu" else False,
+        data_collator=data_collator,
+    )
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_train_epochs)
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=clip,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+
+    print("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+    trained_global_step = global_step = 0
+    t_loss = paddle.to_tensor([0.0])
+    log_loss = paddle.to_tensor([0.0])
+    loss_list = []
+    log_list = []
+    tic_train = time.time()
+    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
+        optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt")))
+        trained_global_step = global_step = config_dict["global_step"]
+        if trained_global_step < num_training_steps:
+            print(
+                "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s"
+                % (trained_global_step, trained_global_step + 1)
+            )
+        else:
+            print(
+                "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !"
+                % (trained_global_step, num_training_steps)
+            )
+            exit(0)
+
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            if trained_global_step > 0:
+                trained_global_step -= 1
+                continue
+            global_step += 1
+            input_ids, raw_input_ids, gen_labels = batch
+            if args.use_amp:
+                with paddle.amp.auto_cast():
+                    gen_logits, disc_logits, disc_labels, attention_mask = model(
+                        input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels
+                    )
+                    loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask)
+                scaled = scaler.scale(loss)
+                scaled.backward()
+                t_loss += loss.detach()
+                scaler.minimize(optimizer, scaled)
+            else:
+                gen_logits, disc_logits, disc_labels, attention_mask = model(
+                    input_ids=input_ids, raw_input_ids=raw_input_ids, generator_labels=gen_labels
+                )
+                loss = criterion(gen_logits, disc_logits, gen_labels, disc_labels, attention_mask)
+                loss.backward()
+                t_loss += loss.detach()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                local_loss = (t_loss - log_loss) / args.logging_steps
+                if paddle.distributed.get_world_size() > 1:
+                    paddle.distributed.all_gather(loss_list, local_loss)
+                    if paddle.distributed.get_rank() == 0:
+                        log_str = (
+                            "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                            "avg_loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
+                        ).format(
+                            global_step,
+                            num_training_steps,
+                            epoch,
+                            step,
+                            float((paddle.stack(loss_list).sum() / len(loss_list)).numpy()),
+                            optimizer.get_lr(),
+                            (time.time() - tic_train) / args.logging_steps,
+                        )
+                        print(log_str)
+                        log_list.append(log_str)
+                    loss_list = []
+                else:
+                    log_str = (
+                        "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                        "loss: {4:.15f}, lr: {5:.10f}, speed: {6:.2f} s/it"
+                    ).format(
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        float(local_loss.numpy()),
+                        optimizer.get_lr(),
+                        (time.time() - tic_train) / args.logging_steps,
+                    )
+                    print(log_str)
+                    log_list.append(log_str)
+                log_loss = t_loss
+                tic_train = time.time()
+            if global_step % args.save_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    config_to_save = copy.deepcopy(model_to_save.discriminator.electra.config)
+                    config_to_save.to_json_file(os.path.join(output_dir, "model_config.json"))
+                    run_states = {
+                        "model_name": model_name if args.init_from_ckpt else args.model_name_or_path,
+                        "global_step": global_step,
+                        "epoch": epoch,
+                        "step": step,
+                    }
+                    with open(os.path.join(output_dir, "run_states.json"), "w") as f:
+                        json.dump(run_states, f)
+                    paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams"))
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                    if len(log_list) > 0:
+                        with open(os.path.join(output_dir, "train.log"), "w") as f:
+                            for log in log_list:
+                                if len(log.strip()) > 0:
+                                    f.write(log.strip() + "\n")
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    n_gpu = len(os.getenv("CUDA_VISIBLE_DEVICES", "").split(","))
+    if args.device in "gpu" and n_gpu > 1:
+        paddle.distributed.spawn(do_train, args=(args,), nprocs=n_gpu)
+    else:
+        do_train(args)
diff --git a/model_zoo/ernie-1.0/README.md b/model_zoo/ernie-1.0/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b025325ccb7f9c3b4d39180097d974731d4d6440
--- /dev/null
+++ b/model_zoo/ernie-1.0/README.md
@@ -0,0 +1,671 @@
+# ERNIE: Enhanced Representation through kNowledge IntEgration
+
+**目录**
+- [1. 模型简介](#模型简介)
+    - [1.1 目录结构](#目录结构)
+    - [1.1 环境依赖](#环境依赖)
+- [2. 中文预训练](#中文预训练)
+    - [2.1 小规模语料预训练: 14GB - CLUECorpusSmall](#CLUECorpusSmall)
+    - [2.2 大规模语料预训练: 400GB - CLUE & WuDao](#ERNIE-CW)
+    - [2.3 预训练模型贡献](#预训练模型贡献)
+- [3. 下游任务微调](#下游任务微调)
+  - [3.1 序列分类](#序列分类)
+  - [3.2 Token分类](#序列分类)
+  - [3.3 阅读理解](#阅读理解)
+- [4. 预测部署](#预测部署)
+- [5. 参考文献](#参考文献)
+
+
+
+<a name="模型简介"></a>
+
+## 1. 模型简介
+
+ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架，它将大数据预训练与多源丰富知识相结合，通过持续学习技术，不断吸收海量文本数据中词汇、结构、语义等方面的知识，实现模型效果不断进化。
+
+ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术，在国际权威的通用语言理解评估基准GLUE上，得分首次突破90分，获得全球第一。
+相关创新成果也被国际顶级学术会议AAAI、IJCAI收录。
+同时，ERNIE在工业界得到了大规模应用，如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。
+
+ERNIE 通过建模海量数据中的词、实体及实体关系，学习真实世界的语义知识。相较于 BERT 学习原始语言信号，ERNIE 直接对先验语义知识单元进行建模，增强了模型语义表示能力。
+
+这里我们举个例子：
+```
+Learnt by BERT ：哈 [mask] 滨是 [mask] 龙江的省会，[mask] 际冰 [mask] 文化名城。
+Learnt by ERNIE：[mask] [mask] [mask] 是黑龙江的省会，国际 [mask] [mask] 文化名城。
+```
+在 BERT 模型中，我们通过『哈』与『滨』的局部共现，即可判断出『尔』字，模型没有学习与『哈尔滨』相关的任何知识。而 ERNIE 通过学习词与实体的表达，使模型能够建模出『哈尔滨』与『黑龙江』的关系，学到『哈尔滨』是 『黑龙江』的省会以及『哈尔滨』是个冰雪城市。
+
+<a name="项目特色"></a>
+
+**项目特色**
+- **中文预训练**
+    - 提供了完整中文预训练流程，从词表构造、数据处理、任务训练，到下游任务。
+    - 提供中文Whole Word Mask，支持文本动态Mask。
+- **数据流程**，
+    - 数据预处理流程高效，40分钟即可完成14G ERNIE数据制作。
+    - 数据稳定可复现，多数据集即插即用。
+- **分布式训练**，
+    - 支持多机多卡，支持混合精度、重计算、梯度累积等功能。
+
+<a name="目录结构"></a>
+
+### 1.1 目录结构
+
+整体的目录结构如下：
+
+```shell
+./
+├── args.py     训练配置参数文件
+├── converter   静态图参数转换为动态图的脚本
+│   └── params_static_to_dygraph.py
+├── finetune    下游任务finetune脚本
+│   ├── config.yml                 训练参数配置文件
+│   ├── question_answering.py      阅读理解任务预处理代码
+│   ├── sequence_classification.py 序列分类任务预处理代码
+│   ├── token_classification.py    TOKEN分类任务预处理代码
+│   ├── README.md       说明文档
+│   ├── run_ner.py      命名实体识别任务运行脚本
+│   ├── run_qa.py       阅读理解任务运行脚本
+│   ├── run_seq_cls.py  序列分类任务运行脚本
+│   └── utils.py
+├── README.md  说明文档
+├── pretraining_introduction.md 中文预训练详细介绍文档
+├── preprocess
+│   ├── docs                部分数据制作文档，包括CLUECorpusSmall，WuDaoCorpusBase
+│   ├─ xxx.py              文件处理的python脚本
+│   └──README.md            PaddleNLP 预训练数据流程
+├── vocab                   全中文字符词表制作教程
+├── run_gb512_s1m.sh        训练启动shell脚本，batch size 512. max steps 100w
+├── run_gb512_s1m_static.sh
+├── run_gb512_s1m_trainer.sh
+├── run_pretrain.py         训练启动python脚本
+├── run_pretrain_static.py
+└── run_pretrain_trainer.py
+```
+
+<a name="环境依赖"></a>
+
+### 1.2 环境依赖
+
+- tool_helpers
+- visualdl
+- pybind11
+
+安装命令 `pip install visualdl pybind11 tool_helpers`
+
+<a name="中文预训练"></a>
+
+## 2. 中文预训练
+
+ERNIE预训练采用的是MLM（Mask Language Model）的训练方式，采用WWM（Whole Word Mask）方式，对于完整语义单元的Token，会同时进行Mask。整体的训练损失loss是mlm_loss + sop_loss。
+
+ERNIE 中文预训练更详细的介绍文档请可以参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+
+本样例为用户提供了高效的训练流程，
+- **支持动态文本mask**： 用户可以根据自己的需求，灵活修改mask方式。具体可以参考修改`data_tools/dataset_utils.py`中`create_masked_lm_predictions`函数。
+- **支持自动断点训练重启恢复**。 用户可以设置`checkpoint_steps`，间隔`checkpoint_steps`数，即保留最新的checkpoint到`model_last`文件夹。重启训练时，程序默认从最新checkpoint重启训练，学习率、数据集都可以恢复到checkpoint时候的状态。
+
+
+<a name="CLUECorpusSmall"></a>
+
+### 2.1 小规模语料预训练: 14GB - CLUECorpusSmall
+下面是使用CLUECorpusSmall 14G文本进行预训练的流程：
+
+<details>
+<summary><b>CLUECorpusSmall 数据准备</b></summary>
+
+#### 数据准备
+数据下载部分请参考[preprocess](./preprocess)目录，根据文档中`CLUECorpusSmall 数据集处理教程`，下载数据。下载好后:
+
+解压文件
+```shell
+unzip comment2019zh_corpus.zip -d  clue_corpus_small_14g/comment2019zh_corpus
+unzip news2016zh_corpus.zip    -d  clue_corpus_small_14g/news2016zh_corpus
+unzip webText2019zh_corpus.zip -d  clue_corpus_small_14g/webText2019zh_corpus
+unzip wiki2019zh_corpus.zip    -d  clue_corpus_small_14g/wiki2019zh_corpus
+```
+将txt文件转换为jsonl格式
+```
+python preprocess/trans_to_json.py  --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
+```
+现在我们得到了jsonl格式的数据集，下面是针对训练任务的数据集应用，此处以ernie为例。
+```
+python -u  preprocess/create_pretraining_data.py \
+    --model_name ernie-1.0-base-zh \
+    --tokenizer_name ErnieTokenizer \
+    --input_path clue_corpus_small_14g.jsonl \
+    --split_sentences \
+    --data_impl mmap \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func jieba \
+    --output_prefix clue_corpus_small_14g_20220104 \
+    --workers 48 \
+    --log_interval 10000
+```
+数据共有文档`15702702`条左右，由于分词比较耗时，大概一小时左右可以完成。在当前目录下产出训练所需数据。
+```
+clue_corpus_small_14g_20220104.bin
+clue_corpus_small_14g_20220104.idx
+```
+
+</details>
+
+
+<details>
+<summary><b>CLUECorpusSmall 开始训练</b></summary>
+
+
+####  开始训练
+
+将制作好的数据`clue_corpus_small_14g_20220104.bin,clue_corpus_small_14g_20220104.idx`移动到input_dir中，即可开始训练。
+这里以8卡GPU训练为例任务脚本为例：
+```
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/ernie-1.0-dp8-gb512/log" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/ernie-1.0-dp8-gb512" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --fp16_opt_level O2 \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 990000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20 \
+    --num_workers 2 \
+    --eval_freq 1000 \
+    --device "gpu" \
+    --share_folder false \
+```
+
+使用8卡MLU训练示例：
+```
+python -u  -m paddle.distributed.launch \
+    --mlus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/ernie-1.0-dp8-gb512/log" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/ernie-1.0-dp8-gb512" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --fp16_opt_level O2 \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 990000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20 \
+    --num_workers 2 \
+    --eval_freq 1000 \
+    --device "mlu" \
+    --share_folder false \
+```
+
+其中参数释义如下：
+- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。
+- `tokenizer_name_or_path` 模型词表文件所在的文件夹，或者PaddleNLP内置tokenizer的名字。
+- `continue_training` 默认false，模型从随机初始化，开始训练。如果为True，从已有的预训练权重加载，开始训练。如果为True， 训练初始loss 为2.x 是正常loss，如果未False，随机初始化，初始loss一般为10+。
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `data_impl` 指定输入文件数据制作类型，默认为`mmap`，可指定`mmap`或`lazy`，`mmap`格式在读入数据时会建立内存映射，`lazy`格式在读入数据时直接从文件读取。
+- `output_dir` 指定输出文件。
+- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认1/1000的数据为test，当样本数太少时，请修改此比例。
+- `max_seq_len` 输入文本序列的长度。
+- `micro_batch_size` 单卡batch size大小，比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。
+- `use_amp` 开启混合精度策略。
+- `fp16_opt_level` 混合精度策略，支持O1 自动混合精度，O2 pure fp16精度训练。
+- `max_lr` 训练学习率。
+- `min_lr` 学习率衰减的最小值。
+- `max_steps` 最大训练步数。
+- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。
+- `checkpoint_steps` 模型checkpoint间隔，用于模型断点重启训练。默认地址为`output_dir/model_last`.
+- `weight_decay` 权重衰减参数。
+- `warmup_rate` 学习率warmup参数。
+- `grad_clip` 梯度裁剪范围。
+- `logging_freq` 日志输出间隔。
+- `num_workers` DataLoader采样进程，当数据输入为瓶颈时，可尝试提高采样进程数目。
+- `eval_freq` 模型评估间隔。
+- `device` 训练设备，默认为GPU。
+- `share_folder` 多机训练时，如果多机input_dir为挂载的同一个nfs网络位置，可以开启次选项，多机共享同一份数据。
+
+
+注：
+- 训练支持断点重启，直接启动即可，程序会找到最新的checkpoint(`output_dir/model_last`)，开始重启训练。请确保重启的训练配置与之前相同。
+- visualdl的日志在 `./output/ernie-1.0-dp8-gb512/train_log/xxx` 中。
+</details>
+
+
+
+<details>
+<summary><b>CLUECorpusSmall 数据集训练效果</b></summary>
+
+#### CLUECorpusSmall 数据集训练效果
+
+使用创建好的训练clue_corpus_small_14g数据集。使用本训练脚本, batch_size=512, max_steps=100w，[详细训练日志](https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/index?id=3fddf650db14b9319f9dc3a91dfe4ac6)
+
+最终训练loss结果：
+
+<img width="400" alt="image" src="https://user-images.githubusercontent.com/16911935/167784987-3e51a2ae-df3d-4be6-bacf-0a20e9c272b7.png">
+
+<img width="400" alt="image" src="https://user-images.githubusercontent.com/16911935/167785241-0da271ec-0cd9-446d-a425-64022098a271.png">
+
+|Loss | Train | Validation |
+|-|-|-|
+|loss |2.59 | 2.48 |
+|lm_loss|2.48 | 2.38 |
+|sop_loss|0.11 | 0.10 |
+
+训练集 lm_loss=2.48 左右, 验证集 lm_loss=2.38 左右。
+
+使用训练好的模型参数，在下游任务重进行finetune。这里报告部分数据集上的finetune结果：
+
+CLUE评估结果：
+
+Model | Arch | CLUE AVG |  AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL
+-- | -- | -- | -- | -- | -- | -- |  -- | -- | --
+Metrics |   |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc
+ERNIE-1.0 Base | 12L768H | 73.78 |  74.95 | 58.73 | 61.37 | 81.77 | 75.46 | 81.25 | 82.93
+ERINE-1.0-cluecorpussmall | 12L768H | 73.24(-0.54) | 74.26 | 57.24 | 60.79 | 81.15 | 76.64 | 81.25 | 81.33
+
+注:
+- `ERNIE-1.0 Base`官方预训练参数，采用的训练配置是batch_size=1024、steps=100w，
+- `ERINE-1.0-cluecorpussmall`复现版本，采用的是batch_size=512、steps=100w。
+
+</details>
+
+<a name="ERNIE-CW"></a>
+
+### 2.2 大规模语料预训练: 400GB - CLUE & WuDao
+
+PaddleNLP致力于预训练开源工作，使用开源中文语料CLUE、WuDao 总共400GB，提供大规模语料训练教程，让用户可以从零开始构建，基于大规模语料，训练预训练模型。
+
+[ERNIE 中文预训练介绍](./pretraining_introduction.md)，从数据下载，词表制作，数据转化，模型训练，所有流程，完全开源开放，可复现。
+并训练发布开源最优的模型参数。
+
+#### 数据准备
+
+数据下载，数据转化部分，请参见[数据预处理文档](./preprocess/README.md)，
+- [CLUECorpus2020数据处理](./preprocess/docs/CLUECorpus2020.md)
+- [WuDaoCorpusBase数据处理](./preprocess/docs/WuDaoCorpusBase.md)
+
+如果需要定制化词表，词表制作部分请参考[词表制作](./vocab/README.md)。
+
+
+#### 训练脚本
+
+训练脚本如下
+
+**环境配置**
+
+- PYTHONPATH 设置为当前目录（适合paddlenlp develop运行）
+- 设置了一些FLAGS，包括增强报错，动态图Flag，提高矩阵乘法精度。
+- 多机情况下，可以设置`NCCL_SOCKET_IFNAME`指明NCCL使用的通信网口。
+
+<details>
+<summary>环境配置脚本</summary>
+
+```shell
+set -x
+
+# cd PaddleNLP/model_zoo/ernie-1.0
+export PYTHONPATH=$PYTHONPATH:../../
+
+export FLAGS_call_stack_level=2
+# export NCCL_SOCKET_IFNAME=xgbe0
+export FLAGS_gemm_use_half_precision_compute_type=False
+export FLAGS_enable_eager_mode=1
+unset CUDA_VISIBLE_DEVICES
+```
+</details>
+
+**路径配置**
+
+- 主要配置输入输出目录
+- 这里的`vocab_dir`如果没有使用自定义词表的话，请设置为内置的tokenizer，如`ernie-1.0-base-zh,ernie-3.0-base-zh`等。
+- 这里的 `data_dir` 设置多份数据集，用户不使用多份数据集的话，直接`data_dir="./data"`即可。
+
+<details>
+<summary>路径配置</summary>
+
+```shell
+trainer_id=${PADDLE_TRAINER_ID:-"0"}
+task_name="0809-ernie-1.0-base-cw-dp16-gb1024"
+
+base_nfs="/path/to/your/nfs/mount/point"
+base_dir="${base_nfs}/ernie-cw/output/${task_name}"
+data_dir="5.0 ${base_nfs}/clue_oscar/clue_corpus_oscar_0630 7.0 ${base_nfs}/clue_train/clue_corpus_train_0629 12.0 ${base_nfs}/wudao_200g/wudao_200g_0703"
+vocab_dir="${base_nfs}/"
+```
+</details>
+
+**启动训练**：
+
+对于`ernie-3.0-base-zh`我们提供了悟道的一个小规模样本的数据：
+```
+mkdir data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_idx.npz
+cd -
+```
+同时我们也提供了 `ernie-1.0-base-zh` 的悟道一个小规模样本的数据：
+```
+https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/wudao_200g_sample_ernie-1.0-base-zh_ids.npy
+https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/wudao_200g_sample_ernie-1.0-base-zh_idx.npz
+```
+
+可以指定`tokenizer_name_or_path=ernie-3.0-bash-zh`,`input_dir=./data` 用下面的脚本训练。
+
+这里启动的是单机8卡任务，整体全局的batch_size 512 (64*8)。如果指定ips参数，进行多机运行，如 `python3 -u  -m paddle.distributed.launch  --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 `
+```shell
+python3 -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "${base_dir}/log_${trainer_id}" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-3.0-base-zh" \
+    --tokenizer_name_or_path "${vocab_dir}" \
+    --input_dir "${data_dir}" \
+    --output_dir "${base_dir}" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --binary_head true \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --fp16_opt_level "O1" \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 4000000 \
+    --save_steps 100000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 3900000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20 \
+    --num_workers 3 \
+    --eval_freq 1000 \
+    --device "gpu"\
+    --share_folder true \
+    --hidden_dropout_prob 0.1 \
+    --attention_probs_dropout_prob 0.1 \
+    --seed 1234 \
+```
+
+
+其中参数释义如下：
+- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。
+- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie，词表文件名一般命名为vocab.txt)，或者PaddleNLP内置tokenizer的名字。
+- `continue_training` 默认false，模型从随机初始化，开始训练。如果为True，从已有的预训练权重加载，开始训练。如果为True， 训练初始loss 为2.x 是正常loss，如果未False，随机初始化，初始loss一般为10+。
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `output_dir` 指定输出文件。
+- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test，当样本数太少时，增大测试的样本数目。
+- `max_seq_len` 输入文本序列的长度，默认值`512`。
+- `binary_head` 是否使用SOP(Sentences Order Predicet) loss，默认为 True，使用此loss。如果用户句子语料很短，无法组合成句子对，请设置此参数为`false`。
+- `micro_batch_size` 单卡batch size大小，比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。
+- `use_amp` 开启混合精度策略。
+- `fp16_opt_level` 混合精度策略，支持O1 自动混合精度，O2 pure fp16精度训练。
+- `max_lr` 训练学习率。
+- `min_lr` 学习率衰减到最小值后，学习率将一直保持为`min_lr`。
+- `max_steps` 最大训练步数。训练不支持通过`epoch`控制，第一次制造数据index时候，日志会显示数据会被计算的epoch数，请注意查看。
+- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。
+- `checkpoint_steps` 模型checkpoint间隔，用于模型断点重启训练。默认地址为`output_dir/model_last`.
+- `weight_decay` 权重衰减参数。
+- `warmup_rate` 学习率warmup参数。
+- `grad_clip` 梯度裁剪范围。
+- `logging_freq` 日志输出间隔。
+- `num_workers` DataLoader采样进程，当数据输入为瓶颈时，可尝试提高采样进程数目。
+- `eval_freq` 模型评估间隔。
+- `device` 训练设备，默认为GPU。
+- `share_folder` 多机训练时，如果多机`input_dir`为挂载的同一个nfs网络位置，可以开启次选项，多机共享同一份数据。（每次运行，会制作训练的index数据，如果为挂载的统一nfs位置，则一台机器制作数据即可，否则每台机器都需要制作）
+
+
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/16911935/187134299-72628dce-cc04-49d7-89ef-078fad487724.png" align="middle"  width="500" />
+</p>
+
+接下来我们主要介绍训练流程部分的特性的简单介绍：详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+- **训练网络配置方面：**
+
+    本小节主要针对，任务的损失函数、MASK参数等配置进行了简单介绍。
+    - SOP Loss
+        - SOP (Sentence Order Predict) 损失，是 模型训练的常用损失。将文本中的句子顺序分为两段打乱，最后判断文本是否被打乱。可以通过设置`binary_head`开启或者关闭。
+    - MASK
+        -  MLM (Mask Language Model) 是通过随机将文本中的部分token，随机替换为`[MASK]` token，最后预测出真实的token值。ERNIE默认采用了Whole Word MASK方式，选定一些词语进行MASK。
+        - *<u>使用方法</u>*: 用户可以设置 `masked_lm_prob` 控制mask的token占文本总token长度的比例。默认`masked_lm_prob=0.15` 随机mask 15% 的token数目。
+    - Ngram MASK
+        - 项目还支持了n-gram mask策略，如下图所示，在 WWM 进行词语级别MASK的基础上（如此处mask掉的`[模型]`词组），n-gram 可以MASK掉连续n个词组。下面例子中，连续mask了2个词组，`【[语言][模型]】`同时进行了mask。
+        <p align="center">
+        <img src="https://user-images.githubusercontent.com/16911935/187145669-7c55386d-f57a-4589-9e6d-e4a36b93e24c.png" align="middle"  width="600" />
+        </p>
+
+        - *<u>使用方法</u>*: 用户通过`max_ngrams`设置最大的`ngram`长度。默认`max_ngrams=3`。
+
+    - Dropout
+        - Dropout 是常用的防止过拟合策略。对于大规模数据集训练，如`ernie-3.0`系列4T文本语料，可以设置 `dropout=0`，不考虑过拟合。实际`ernie-3.0-base-zh`训练中，没有开启Dropout。
+
+详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+
+- **训练速度方面**
+
+    我们支持了如下策略，加速计算过程，减小显存占用，扩大batch_size：
+
+    - **多卡多机训练**：
+        - 基于飞桨Fleet分布式API，用户可以十分方便的通过数据并行的方法，将训练扩展到多机多卡。
+    - **混合精度训练**：
+        - 部分算子使用FP16计算kernel，加速计算过程。支持AMP混合精度O1，和Pure FP16全FP训练策略O2。
+    - **梯度累积训练**：
+        - 用户可以指定梯度累积的步数，在梯度累积的step中，减少多卡之间梯度的通信，减少更新的次数，可以扩大训练的batch_size.
+    - **重计算训练**：
+        -  通过重新计算前向的方式，减少前向网络中间变量的存储，可以显著减少显存占用，
+
+详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+
+- **训练数据流方面**
+
+    我们针对训练数据流扩展、混合、重启等方面做了针对性优化提升
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187355897-478e7aeb-560f-4ea7-a29c-4bea9d8a7712.png" align="middle"  width="500" />
+    </p>
+
+    - **多机扩展**
+        - 用户可以将数据放置到 NFS 服务器上，多机同时挂载数据即可。训练数据与计算资源分离。
+    - **多数据混合**
+        - 训练数据集支持多个文件，即插即用，设置权重，传入参数即可`input_dir="1.0  dateset_a/prefix  2.0 dataset_b/prefix"`
+    - **稳定可复现**
+        - MLM任务具有一定随机性，需要随机mask数据。本数据流通过固定每一个step数据的随机种子，实验数据流稳定可复现。
+    - **快加载**
+        - 数据文件使用mmap读取，加载数百GB文件几乎不耗时。
+    - **断点重启**
+        - 用户可以单独设置，checkpoints steps 参数可设置较小，重启训练默认加载最新checkpoint。
+        - 断点数据自动恢复，学习率等参数也自动恢复。
+
+详细参数配置介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+- **观察评估方面**
+
+    - **可视化日志记录**
+        - 日志展示为全局loss，波动小。
+        - 记录混合精度，loss_scaling等信息，方便用户debug。
+        - 对模型结构，配置参数，paddle版本信息进行记录，方便复现环境
+    - **下游任务评估**：CLUE Benchmark搜索评估参数效果
+        - 使用[批量启动-grid-search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/clue#%E6%89%B9%E9%87%8F%E5%90%AF%E5%8A%A8-grid-search)，可以进行批量搜索任务
+        - 注意，这里使用的是训练中的checkpoint进行评估，可以直接试着 评估待评估的参数为，所在的路径地址，即如 `python grid_seach.py output/ernie-base-outdir/model_100000` 之类的checkpoint地址。
+
+详细介绍请参见[ERNIE 中文预训练介绍](./pretraining_introduction.md)。
+
+
+- **训练效果方面**
+
+    我们release了base、large两个模型。均取得了较好的预训练效果。
+
+    - **ERNIE 1.0-Base-zh-cw** 模型：
+        - 使用CLUE，WuDao共计400GB的语料，batch_size 1024, 训练 400w step，即可训练得到`ernie-3.0-base-zh`类似的模型效果。相关模型参数，开源为`ernie-1.0-base-zh-cw`，用户加载即可使用。使用CLUE benchmark 对最优超参数进行GradSearch搜索：
+
+Model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Arch | CLUE AVG |  AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3
+-- | -- | -- | -- | -- | -- | -- |  -- | -- | -- | -- | -- |  -- |
+ Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1| Acc| Acc | Acc
+ERNIE 1.0-Base-zh-cw | 12L768H | <b>76.47</b> | 76.07 |    57.86 |    59.91 |    83.41 | 79.91 |    89.91 |   <b>83.42</b> |  72.88/90.78 |    <b>84.68</b> |    76.98 |
+ERNIE 2.0-Base-zh | 12L768H | 74.95  | 76.25 |    58.53 |    61.72 |    83.07 |    78.81 |    84.21 |    82.77 | 68.22/88.71    | 82.78    | 73.19
+ERNIE 1.0-Base-zh | 12L768H | 74.17 | 74.84 |    58.91 |    62.25 |    81.68 |    76.58 |    85.20 |    82.77 | 67.32/87.83 | 82.47 | 69.68
+-
+    - **ERNIE 1.0-Large-zh-cw** 模型：
+
+        - 除了base模型外，我们还训练了放出了large模型。此模型参数采用的是词表与ernie-1.0相同，因此命名为`ernie-1.0-large-zh-cw`。使用开源语料，batch_size 512, 训练 400w step，训练去除SOP任务，只保留MLM损失：
+
+Model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  | Arch | CLUE AVG |  AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3
+-- | -- | -- | -- | -- | -- | -- |  -- | -- | -- | -- | -- |  -- |
+Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1 | Acc| Acc
+ERNIE 1.0-Large-zh-cw | 24L1024H | <b>79.03</b> | 75.97 |    59.65 |    62.91 |    85.09 |    81.73| 93.09 |    84.53 | 74.22/91.88 | 88.57 | 84.54
+ERNIE 3.0-Xbase-zh| 20L1024H | 78.71 | 76.85 |    59.89 |    62.41 |    84.76 |    82.51 |    89.80 |    84.47 |    75.49/92.67 | 86.36 | 84.59
+RoBERTa-wwm-ext-large | 24L1024H | 76.61 |    76.00 |    59.33 |    62.02 |    83.88 |    78.81 |    90.79 |    83.67 |    70.58/89.82 |    85.72 |    75.26
+
+
+<a name="预训练模型贡献"></a>
+
+### 预训练模型贡献
+PaddleNLP为开发者提供了[community](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)模块，用户可以上传自己训练的模型，开源给其他用户使用。
+使用本文档给出的参数配置，在CLUECorpusSmall数据集上训练，可以得到`zhui/ernie-1.0-cluecorpussmall`参数，可直接使用。
+```python
+model = AutoModelForMaskedLM.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
+```
+
+贡献预训练模型的方法，可以参考[贡献预训练模型权重](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)教程。
+
+<a name="下游任务微调"></a>
+
+## 3. 下游任务微调
+
+使用训练中产出的checkpoint，或者paddlenlp内置的模型权重，使用本脚本，用户可以快速对当前模型效果进行评估。
+
+### 运行示例
+本文档适配了三大主流下游任务，用户可以根据自己的需求，评估自己所需的数据集。
+
+<a name="序列分类"></a>
+
+1. 序列分类
+```shell
+cd finetune
+dataset="chnsenticorp_v2"
+python run_seq_cls.py \
+    --do_train \
+    --do_eval \
+    --do_predict \
+    --model_name_or_path ernie-1.0-base-zh \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+<a name="Token分类"></a>
+
+2. Token分类
+```shell
+cd finetune
+dataset="peoples_daily_ner"
+python run_ner.py \
+    --do_train \
+    --do_eval \
+    --do_predict \
+    --model_name_or_path ernie-1.0-base-zh \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+<a name="阅读理解"></a>
+
+3. 阅读理解
+```shell
+cd finetune
+dataset="cmrc2018"
+python run_qa.py \
+    --do_train \
+    --do_eval \
+    --model_name_or_path ernie-1.0-base-zh \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+
+<a name="预测部署"></a>
+
+## 4. 预测部署
+以中文文本情感分类问题为例，介绍一下从模型finetune到部署的过程。
+
+与之前的finetune参数配置稍有区别，此处加入了一些配置选项。
+
+- do_export，开启模型导出功能
+- eval_steps/save_steps 评估和保存的step间隔
+- metric_for_best_model  模型效果的比较指标。（次选项生效，需要save_steps为eval_steps的倍数）
+- save_total_limit 最多保存的ckpt个数。（超过限制数据时，效果更差，或更旧的ckpt将被删除）
+
+```shell
+cd finetune
+# 开始finetune训练并导出模型
+dataset="chnsenticorp_v2"
+python run_seq_cls.py \
+    --do_train \
+    --do_eval \
+    --do_predict \
+    --do_export \
+    --model_name_or_path ernie-1.0-base-zh \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --metric_for_best_model "eval_accuracy" \
+    --load_best_model_at_end \
+    --save_total_limit 3 \
+
+```
+训练完导出模型之后，可以用于部署，`deploy/seq_cls_infer.py`文件提供了python部署预测示例。可执行以下命令运行部署示例：
+
+```shell
+python deploy/seq_cls_infer.py --model_dir tmp/chnsenticorp_v2/export/ --device cpu --backend paddle
+```
+
+运行后预测结果打印如下：
+```text
+[2023-03-01 08:25:31,352] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.fast_tokenizer.ErnieFastTokenizer'> to load '../tmp/chnsenticorp_v2/export/'.
+WARNING: Logging before InitGoogleLogging() is written to STDERR
+W0301 08:25:37.617117 58742 analysis_config.cc:958] It is detected that mkldnn and memory_optimize_pass are enabled at the same time, but they are not supported yet. Currently, memory_optimize_pass is explicitly disabled
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::CPU.
+Batch id: 0, example id: 0, sentence: 这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般, label: negative, negative prob: 0.9999, positive prob: 0.0001.
+Batch id: 1, example id: 0, sentence: 怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片！开始还怀疑是不是赠送的个别现象，可是后来发现每张DVD后面都有！真不知道生产商怎么想的，我想看的是猫和老鼠，不是米老鼠！如果厂家是想赠送的话，那就全套米老鼠和唐老鸭都赠送，只在每张DVD后面添加一集算什么？？简直是画蛇添足！！, label: negative, negative prob: 0.9998, positive prob: 0.0002.
+Batch id: 2, example id: 0, sentence: 还稍微重了点，可能是硬盘大的原故，还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多，用不了多久就要更换了，屏幕膜稍好点，但比没有要强多了。建议配赠几张膜让用用户自己贴。, label: negative, negative prob: 0.9999, positive prob: 0.0001.
+......
+```
+
+更多关于部署的情况可以参考[ERNIE 1.0 模型 Python 部署示例](finetune/deploy/README.md)。
+
+<a name="参考文献"></a>
+
+## 5. 参考文献
+- [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf)
diff --git a/model_zoo/ernie-1.0/args.py b/model_zoo/ernie-1.0/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..9af1706f12a0f9cac8f235b0c0b35fe67b874b5f
--- /dev/null
+++ b/model_zoo/ernie-1.0/args.py
@@ -0,0 +1,112 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def parse_args(MODEL_CLASSES):
+    parser = argparse.ArgumentParser()
+    # yapf: disable
+    parser.add_argument("--model_type", default=None, type=str, required=True, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), )
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])),)
+    parser.add_argument("--tokenizer_name_or_path", default=None, type=str, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])),)
+    parser.add_argument("--continue_training", default=False, type=bool, help="Pre-training from existing paddlenlp model weights. Default Fasle and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models.")
+
+    # Train I/O config
+    parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", )
+    parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.")
+    parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.")
+    parser.add_argument("--data_impl", type=str, default='mmap', help="mmap/lazy format converted from preprocessed data.")
+    parser.add_argument("--binary_head", type=strtobool, default=True, help="True for NSP task.")
+    parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.")
+    parser.add_argument("--micro_batch_size", default=8, type=int, help="Batch size per device for one step training.", )
+    parser.add_argument("--global_batch_size", default=None, type=int, help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.")
+
+    # Default training config
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.")
+    parser.add_argument("--max_lr", default=1e-5, type=float, help="The initial max learning rate for Adam.")
+    parser.add_argument("--min_lr", default=5e-5, type=float, help="The initial min learning rate for Adam.")
+    parser.add_argument("--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate.")
+
+    # Adam optimizer config
+    parser.add_argument("--adam_beta1", default=0.9, type=float, help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.")
+    parser.add_argument("--adam_beta2", default=0.999, type=float, help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+
+    # Training steps config
+    parser.add_argument("--num_train_epochs", default=1, type=int, help="Total number of training epochs to perform.", )
+    parser.add_argument("--max_steps", default=500000, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--checkpoint_steps", type=int, default=500, help="Save checkpoint every X updates steps to the model_last folder.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument("--decay_steps", default=360000, type=int, help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.")
+    parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.")
+    parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.")
+    parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.")
+
+    # Config for 4D Parallelism
+    parser.add_argument("--use_sharding", type=strtobool, nargs='?', const=False, help="Use sharding Parallelism to training.")
+    parser.add_argument("--sharding_degree", type=int, default=1, help="Sharding degree. Share the parameters to many cards.")
+    parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.")
+    parser.add_argument("--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards.")
+    parser.add_argument("--pp_degree", type=int, default=1, help="Pipeline Parallelism degree. Spliting the model layers to different parts.")
+    parser.add_argument("--use_recompute", type=strtobool, nargs='?', const=False, help="Using the recompute to save the memory.")
+
+    # AMP config
+    parser.add_argument("--use_amp", type=strtobool, nargs='?', const=False, help="Enable mixed precision training.")
+    parser.add_argument("--fp16_opt_level", type=str, default="O2", help="Mixed precision training optimization level.")
+    parser.add_argument("--enable_addto", type=strtobool, nargs='?', const=True, default=True, help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.")
+    parser.add_argument("--scale_loss", type=float, default=32768, help="The value of scale_loss for fp16. This is only used for AMP training.")
+    parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.")
+    parser.add_argument("--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob.")
+
+    # Other config
+    parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization.")
+    parser.add_argument("--num_workers", type=int, default=2, help="Num of workers for DataLoader.")
+    parser.add_argument("--check_accuracy", type=strtobool, nargs='?', const=False, help="Check accuracy for training process.")
+    parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "mlu", "npu"], help="select cpu, gpu, xpu, npu devices.")
+    parser.add_argument("--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style.")
+    parser.add_argument("--share_folder", type=strtobool, nargs='?', const=False, help="Use share folder for data dir and output dir on multi machine.")
+
+    # Argument for bert/ernie
+    parser.add_argument("--masked_lm_prob", type=float, default=0.15, help="Mask token prob.")
+    parser.add_argument("--short_seq_prob", type=float, default=0.1, help="Short sequence prob.")
+    parser.add_argument("--favor_longer_ngram", type=strtobool, default=False, help="Short sequence prob.")
+    parser.add_argument("--max_ngrams", type=int, default=3, help="Short sequence prob.")
+
+    # yapf: enable
+
+    args = parser.parse_args()
+
+    if args.tokenizer_name_or_path is None:
+        args.tokenizer_name_or_path = args.model_name_or_path
+    args.test_iters = args.eval_iters * 10
+
+    if args.check_accuracy:
+        if args.hidden_dropout_prob != 0:
+            args.hidden_dropout_prob = 0.0
+            logger.warning("The hidden_dropout_prob should set to 0 for accuracy checking.")
+        if args.attention_probs_dropout_prob != 0:
+            args.attention_probs_dropout_prob = 0.0
+            logger.warning("The attention_probs_dropout_prob should set to 0 for accuracy checking.")
+    if args.dp_degree * args.mp_degree * args.pp_degree * args.sharding_degree == 1:
+        if paddle.distributed.get_world_size() > 1:
+            args.dp_degree = paddle.distributed.get_world_size()
+
+    return args
diff --git a/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py b/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
new file mode 100644
index 0000000000000000000000000000000000000000..c86cc1fea0180e5627322cf8309ed5a5d8514533
--- /dev/null
+++ b/model_zoo/ernie-1.0/converter/params_static_to_dygraph.py
@@ -0,0 +1,51 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import paddle
+from paddlenlp.transformers import AutoModelForPretraining
+from paddlenlp.utils.log import logger
+
+paddle.set_device("cpu")
+parser = argparse.ArgumentParser()
+parser.add_argument("--model", type=str, help="The name of pretrained weights in PaddleNLP.")
+parser.add_argument("--path", type=str, help="The path of checkpoint to be loaded.")
+parser.add_argument("--output_path", type=str, default=None, help="The path of checkpoint to be loaded.")
+args = parser.parse_args()
+
+
+def init_dygraph_with_static(model, static_params_path):
+    from paddlenlp.utils.tools import static_params_to_dygraph
+
+    static_tensor_dict = paddle.static.load_program_state(static_params_path)
+    return static_params_to_dygraph(model, static_tensor_dict)
+
+
+def main(args):
+    logger.info("Loading model: %s" % args.model)
+    model = AutoModelForPretraining.from_pretrained(args.model)
+    logger.info("Loading static params and trans paramters...")
+    model_dict = init_dygraph_with_static(model, args.path)
+    save_name = args.output_path
+    if save_name is None:
+        save_name = args.model + "_converted.pdparams"
+    if not save_name.endswith(".pdparams"):
+        save_name += ".pdparams"
+    logger.info("Saving converted params to %s" % save_name)
+    paddle.save(model_dict, save_name)
+    logger.info("New pdparams saved!")
+
+
+if __name__ == "__main__":
+    main(args)
diff --git a/model_zoo/ernie-1.0/data_tools/Makefile b/model_zoo/ernie-1.0/data_tools/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..e29b78584d96c405fa75f44d71567d3676e8700a
--- /dev/null
+++ b/model_zoo/ernie-1.0/data_tools/Makefile
@@ -0,0 +1,10 @@
+CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color
+CPPFLAGS += $(shell python3 -m pybind11 --includes)
+CPPFLAGS += $(shell python3-config --includes)
+LIBNAME = helpers
+LIBEXT = $(shell python3-config --extension-suffix)
+
+default: $(LIBNAME)$(LIBEXT)
+
+%$(LIBEXT): %.cpp
+	$(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@
diff --git a/model_zoo/ernie-1.0/data_tools/dataset_utils.py b/model_zoo/ernie-1.0/data_tools/dataset_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e3d66f0f8b984be70a72a9bd7216cfb1669dc5a
--- /dev/null
+++ b/model_zoo/ernie-1.0/data_tools/dataset_utils.py
@@ -0,0 +1,840 @@
+# coding=utf-8
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors, and NVIDIA.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import math
+import os
+import re
+import time
+
+import numpy as np
+import paddle
+
+from paddlenlp.data.indexed_dataset import get_indexed_dataset_
+
+# Most of the code here has been copied from:
+#   https://github.com/google-research/albert/blob/master/create_pretraining_data.py
+# with some modifications.
+
+
+def get_local_rank():
+    return int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+print_rank_0 = print
+COMPILED = False
+DSET_TYPE_BERT = "standard_bert"
+DSET_TYPE_T5 = "t5"
+DSET_TYPE_ERNIE = "ernie"
+DSET_TYPES = [DSET_TYPE_BERT, DSET_TYPE_T5, DSET_TYPE_ERNIE]
+
+
+def compile_helper():
+    """Compile helper function ar runtime. Make sure this
+    is invoked on a single process."""
+    import os
+    import subprocess
+
+    path = os.path.abspath(os.path.dirname(__file__))
+    ret = subprocess.run(["make", "-C", path])
+    if ret.returncode != 0:
+        print("Making C++ dataset helpers module failed, exiting.")
+        import sys
+
+        sys.exit(1)
+
+
+class BlendableDataset(paddle.io.Dataset):
+    """
+    The BlendableDataset is a wrapper which used to mix different dataset.
+
+    The input is a list of dataset and corresponding weights for each dataset.
+    """
+
+    def __init__(self, datasets, weights):
+
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+
+        self.size = 0
+        for dataset in self.datasets:
+            self.size += len(dataset)
+
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+
+        # Build indecies.
+        start_time = time.time()
+        assert num_datasets < 255
+        self.dataset_index = np.zeros(self.size, dtype=np.uint8)
+        self.dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+
+        # local_rank = 0 if fleet.local_rank() is None else int(fleet.local_rank(
+        # ))
+        local_rank = get_local_rank()
+
+        while True:
+            try:
+                try:
+                    from tool_helpers import helpers
+                except Exception:
+                    print_rank_0(" > missing tool_helpers, pip install tool_helpers please, try to compile locally.")
+                    import data_tools.helpers as helpers
+                break
+            except Exception:
+                if local_rank == 0:
+                    compile_helper()
+                print_rank_0("> wait for hepers to be compiled!")
+                time.sleep(1)
+
+        helpers.build_blending_indices(
+            self.dataset_index, self.dataset_sample_index, weights, num_datasets, self.size, local_rank == 0
+        )
+        print_rank_0(
+            "> elapsed time for building blendable dataset indices: " "{:.2f} (sec)".format(time.time() - start_time)
+        )
+
+    def __len__(self):
+        return self.size
+
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return self.datasets[dataset_idx][sample_idx]
+
+
+def get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples):
+
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005)) for val in train_valid_test_num_samples]
+        )
+
+    return prefixes, weights, datasets_train_valid_test_num_samples
+
+
+def get_a_and_b_segments(sample, np_rng):
+    """Divide sample into a and b segments."""
+
+    # Number of sentences in the sample.
+    n_sentences = len(sample)
+    # Make sure we always have two sentences.
+    assert n_sentences > 1, "make sure each sample has at least two sentences."
+
+    # First part:
+    # `a_end` is how many sentences go into the `A`.
+    a_end = 1
+    if n_sentences >= 3:
+        # Note that randin in numpy is exclusive.
+        a_end = np_rng.randint(1, n_sentences)
+    tokens_a = []
+    for j in range(a_end):
+        tokens_a.extend(sample[j])
+
+    # Second part:
+    tokens_b = []
+    for j in range(a_end, n_sentences):
+        tokens_b.extend(sample[j])
+
+    # Random next:
+    is_next_random = False
+    if np_rng.random() < 0.5:
+        is_next_random = True
+        tokens_a, tokens_b = tokens_b, tokens_a
+
+    return tokens_a, tokens_b, is_next_random
+
+
+def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng):
+    """Truncates a pair of sequences to a maximum sequence length."""
+    # print(len_a, len_b, max_num_tokens)
+    assert len_a > 0
+    if len_a + len_b <= max_num_tokens:
+        return False
+    while len_a + len_b > max_num_tokens:
+        if len_a > len_b:
+            len_a -= 1
+            tokens = tokens_a
+        else:
+            len_b -= 1
+            tokens = tokens_b
+        if np_rng.random() < 0.5:
+            del tokens[0]
+        else:
+            tokens.pop()
+    return True
+
+
+def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
+    """Merge segments A and B, add [CLS] and [SEP] and build tokentypes."""
+
+    tokens = []
+    tokentypes = []
+    # [CLS].
+    tokens.append(cls_id)
+    tokentypes.append(0)
+    # Segment A.
+    for token in tokens_a:
+        tokens.append(token)
+        tokentypes.append(0)
+    # [SEP].
+    tokens.append(sep_id)
+    tokentypes.append(0)
+    # Segment B.
+    for token in tokens_b:
+        tokens.append(token)
+        tokentypes.append(1)
+    if tokens_b:
+        # [SEP].
+        tokens.append(sep_id)
+        tokentypes.append(1)
+
+    return tokens, tokentypes
+
+
+MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"])
+
+
+def is_start_piece(piece):
+    """Check if the current word piece is the starting piece (BERT)."""
+    # When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    return not piece.startswith("##")
+
+
+def create_masked_lm_predictions(
+    tokens,
+    vocab_id_list,
+    vocab_id_to_token_dict,
+    masked_lm_prob,
+    cls_id,
+    sep_id,
+    mask_id,
+    max_predictions_per_seq,
+    np_rng,
+    max_ngrams=3,
+    vocab_token_to_id_dict=None,
+    do_whole_word_mask=True,
+    favor_longer_ngram=False,
+    do_permutation=False,
+    geometric_dist=False,
+    to_chinese_char=False,
+    inplace_random_mask=False,
+    masking_style="bert",
+):
+    """Creates the predictions for the masked LM objective.
+    Note: Tokens here are vocab ids and not text tokens."""
+
+    cand_indexes = []
+    # Note(mingdachen): We create a list for recording if the piece is
+    # the starting piece of current token, where 1 means true, so that
+    # on-the-fly whole word masking is possible.
+    token_boundary = [0] * len(tokens)
+
+    for (i, token) in enumerate(tokens):
+        if token == cls_id or token == sep_id:
+            token_boundary[i] = 1
+            continue
+        # Whole Word Masking means that if we mask all of the wordpieces
+        # corresponding to an original word.
+        #
+        # Note that Whole Word Masking does *not* change the training code
+        # at all -- we still predict each WordPiece independently, softmaxed
+        # over the entire vocabulary.
+        vocab_id = vocab_id_to_token_dict[token]
+        if do_whole_word_mask and len(cand_indexes) >= 1 and not is_start_piece(vocab_id):
+            cand_indexes[-1].append(i)
+        else:
+            cand_indexes.append([i])
+            if is_start_piece(vocab_id_to_token_dict[token]):
+                token_boundary[i] = 1
+
+    if to_chinese_char:
+        # set ## chinse char to original chinese char
+        char_tokens = []
+        assert vocab_token_to_id_dict is not None
+        for i, b in enumerate(token_boundary):
+            if b == 0:
+                vocab_id = vocab_id_to_token_dict[tokens[i]]
+                new_vocab_id = vocab_id[2:] if len(re.findall("##[\u4E00-\u9FA5]", vocab_id)) > 0 else vocab_id
+                char_tokens.append(
+                    vocab_token_to_id_dict[new_vocab_id] if new_vocab_id in vocab_token_to_id_dict else token
+                )
+            else:
+                char_tokens.append(tokens[i])
+        output_tokens = list(char_tokens)
+    else:
+        output_tokens = list(tokens)
+
+    masked_lm_positions = []
+    masked_lm_labels = []
+
+    if masked_lm_prob == 0:
+        return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary)
+
+    num_to_predict = min(max_predictions_per_seq, max(1, int(round(len(tokens) * masked_lm_prob))))
+
+    ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64)
+    if not geometric_dist:
+        # Note(mingdachen):
+        # By default, we set the probilities to favor shorter ngram sequences.
+        pvals = 1.0 / np.arange(1, max_ngrams + 1)
+        pvals /= pvals.sum(keepdims=True)
+        if favor_longer_ngram:
+            pvals = pvals[::-1]
+
+    ngram_indexes = []
+    for idx in range(len(cand_indexes)):
+        ngram_index = []
+        for n in ngrams:
+            ngram_index.append(cand_indexes[idx : idx + n])
+        ngram_indexes.append(ngram_index)
+
+    np_rng.shuffle(ngram_indexes)
+
+    (masked_lms, masked_spans) = ([], [])
+    covered_indexes = set()
+    backup_output_tokens = list(output_tokens)
+    for cand_index_set in ngram_indexes:
+        if len(masked_lms) >= num_to_predict:
+            break
+        if not cand_index_set:
+            continue
+        # Note(mingdachen):
+        # Skip current piece if they are covered in lm masking or previous ngrams.
+        for index_set in cand_index_set[0]:
+            for index in index_set:
+                if index in covered_indexes:
+                    continue
+
+        if not geometric_dist:
+            n = np_rng.choice(
+                ngrams[: len(cand_index_set)],
+                p=pvals[: len(cand_index_set)] / pvals[: len(cand_index_set)].sum(keepdims=True),
+            )
+        else:
+            # Sampling "n" from the geometric distribution and clipping it to
+            # the max_ngrams. Using p=0.2 default from the SpanBERT paper
+            # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1)
+            n = min(np_rng.geometric(0.2), max_ngrams)
+
+        index_set = sum(cand_index_set[n - 1], [])
+        n -= 1
+        # Note(mingdachen):
+        # Repeatedly looking for a candidate that does not exceed the
+        # maximum number of predictions by trying shorter ngrams.
+        while len(masked_lms) + len(index_set) > num_to_predict:
+            if n == 0:
+                break
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+        # If adding a whole-word mask would exceed the maximum number of
+        # predictions, then just skip this candidate.
+        if len(masked_lms) + len(index_set) > num_to_predict:
+            continue
+        is_any_index_covered = False
+        for index in index_set:
+            if index in covered_indexes:
+                is_any_index_covered = True
+                break
+        if is_any_index_covered:
+            continue
+        for index in index_set:
+            covered_indexes.add(index)
+            masked_token = None
+            if masking_style == "bert":
+                # 80% of the time, replace with [MASK]
+                if np_rng.random() < 0.8:
+                    masked_token = mask_id
+                else:
+                    # 10% of the time, keep original
+                    if np_rng.random() < 0.5:
+                        masked_token = output_tokens[index]
+                    # 10% of the time, replace with random word
+                    else:
+                        if inplace_random_mask:
+                            masked_token = backup_output_tokens[np_rng.randint(0, len(output_tokens))]
+                        else:
+                            masked_token = vocab_id_list[np_rng.randint(0, len(vocab_id_list))]
+            elif masking_style == "t5":
+                masked_token = mask_id
+            else:
+                raise ValueError("invalid value of masking style")
+
+            output_tokens[index] = masked_token
+            masked_lms.append(MaskedLmInstance(index=index, label=backup_output_tokens[index]))
+
+        masked_spans.append(
+            MaskedLmInstance(index=index_set, label=[backup_output_tokens[index] for index in index_set])
+        )
+
+    assert len(masked_lms) <= num_to_predict
+    np_rng.shuffle(ngram_indexes)
+
+    select_indexes = set()
+    if do_permutation:
+        for cand_index_set in ngram_indexes:
+            if len(select_indexes) >= num_to_predict:
+                break
+            if not cand_index_set:
+                continue
+            # Note(mingdachen):
+            # Skip current piece if they are covered in lm masking or previous ngrams.
+            for index_set in cand_index_set[0]:
+                for index in index_set:
+                    if index in covered_indexes or index in select_indexes:
+                        continue
+
+            n = np.random.choice(
+                ngrams[: len(cand_index_set)],
+                p=pvals[: len(cand_index_set)] / pvals[: len(cand_index_set)].sum(keepdims=True),
+            )
+            index_set = sum(cand_index_set[n - 1], [])
+            n -= 1
+
+            while len(select_indexes) + len(index_set) > num_to_predict:
+                if n == 0:
+                    break
+                index_set = sum(cand_index_set[n - 1], [])
+                n -= 1
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(select_indexes) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes or index in select_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                select_indexes.add(index)
+        assert len(select_indexes) <= num_to_predict
+
+        select_indexes = sorted(select_indexes)
+        permute_indexes = list(select_indexes)
+        np_rng.shuffle(permute_indexes)
+        orig_token = list(output_tokens)
+
+        for src_i, tgt_i in zip(select_indexes, permute_indexes):
+            output_tokens[src_i] = orig_token[tgt_i]
+            masked_lms.append(MaskedLmInstance(index=src_i, label=orig_token[src_i]))
+
+    masked_lms = sorted(masked_lms, key=lambda x: x.index)
+    # Sort the spans by the index of the first span
+    masked_spans = sorted(masked_spans, key=lambda x: x.index[0])
+
+    for p in masked_lms:
+        masked_lm_positions.append(p.index)
+        masked_lm_labels.append(p.label)
+    return (output_tokens, masked_lm_positions, masked_lm_labels, token_boundary, masked_spans)
+
+
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+
+    # Some checks.
+    num_tokens = len(tokens)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length, dtype=np.int64)
+
+    # Lables and loss mask.
+    labels = [-1] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
+
+
+def build_train_valid_test_datasets(
+    data_prefix,
+    args,
+    tokenizer,
+    splits_string,
+    train_valid_test_num_samples,
+    max_seq_length,
+    masked_lm_prob,
+    short_seq_prob,
+    seed,
+    skip_warmup,
+    binary_head=False,
+    max_seq_length_dec=None,
+    dataset_type="standard_bert",
+):
+
+    if len(data_prefix) == 1:
+        return _build_train_valid_test_datasets(
+            data_prefix[0],
+            args,
+            tokenizer,
+            splits_string,
+            train_valid_test_num_samples,
+            max_seq_length,
+            masked_lm_prob,
+            short_seq_prob,
+            seed,
+            skip_warmup,
+            binary_head,
+            max_seq_length_dec,
+            dataset_type=dataset_type,
+        )
+
+    # Blending dataset.
+    # Parse the values.
+    output = get_datasets_weights_and_num_samples(data_prefix, train_valid_test_num_samples)
+    prefixes, weights, datasets_train_valid_test_num_samples = output
+
+    # Build individual datasets.
+    train_datasets = []
+    valid_datasets = []
+    test_datasets = []
+    for i in range(len(prefixes)):
+        train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+            prefixes[i],
+            args,
+            tokenizer,
+            splits_string,
+            datasets_train_valid_test_num_samples[i],
+            max_seq_length,
+            masked_lm_prob,
+            short_seq_prob,
+            seed,
+            skip_warmup,
+            binary_head,
+            max_seq_length_dec,
+            dataset_type=dataset_type,
+        )
+        if train_ds:
+            train_datasets.append(train_ds)
+        if valid_ds:
+            valid_datasets.append(valid_ds)
+        if test_ds:
+            test_datasets.append(test_ds)
+
+        # Blend.
+    blending_train_dataset = None
+    if train_datasets:
+        blending_train_dataset = BlendableDataset(train_datasets, weights)
+    blending_valid_dataset = None
+    if valid_datasets:
+        blending_valid_dataset = BlendableDataset(valid_datasets, weights)
+    blending_test_dataset = None
+    if test_datasets:
+        blending_test_dataset = BlendableDataset(test_datasets, weights)
+
+    return (blending_train_dataset, blending_valid_dataset, blending_test_dataset)
+
+
+def _build_train_valid_test_datasets(
+    data_prefix,
+    args,
+    tokenizer,
+    splits_string,
+    train_valid_test_num_samples,
+    max_seq_length,
+    masked_lm_prob,
+    short_seq_prob,
+    seed,
+    skip_warmup,
+    binary_head,
+    max_seq_length_dec,
+    dataset_type="standard_bert",
+):
+
+    if dataset_type not in DSET_TYPES:
+        raise ValueError("Invalid dataset_type: ", dataset_type)
+
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix, args.data_impl, skip_warmup)
+
+    # Get start and end indices of train/valid/train into doc-idx
+    # Note that doc-idx is desinged to be num-docs + 1 so we can
+    # easily iterate over it.
+    total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1
+    splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+
+    # Print stats about the splits.
+    print_rank_0(" > dataset split:")
+
+    def print_split_stats(name, index):
+        print_rank_0("    {}:".format(name))
+        print_rank_0(
+            "     document indices in [{}, {}) total of {} "
+            "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index])
+        )
+        start_index = indexed_dataset.doc_idx[splits[index]]
+        end_index = indexed_dataset.doc_idx[splits[index + 1]]
+        print_rank_0(
+            "     sentence indices in [{}, {}) total of {} "
+            "sentences".format(start_index, end_index, end_index - start_index)
+        )
+
+    print_split_stats("train", 0)
+    print_split_stats("validation", 1)
+    print_split_stats("test", 2)
+
+    def build_dataset(index, name):
+        # from megatron.data.bert_dataset import BertDataset
+        # from megatron.data.t5_dataset import T5Dataset
+        # from .ernie_dataset import ErnieDataset
+
+        dataset = None
+        if splits[index + 1] > splits[index]:
+            # Get the pointer to the original doc-idx so we can set it later.
+            doc_idx_ptr = indexed_dataset.get_doc_idx()
+            # Slice the doc-idx
+            start_index = splits[index]
+            # Add +1 so we can index into the dataset to get the upper bound.
+            end_index = splits[index + 1] + 1
+            # New doc_idx view.
+            indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index])
+            # Build the dataset accordingly.
+            kwargs = dict(
+                name=name,
+                data_prefix=data_prefix,
+                num_epochs=None,
+                max_num_samples=train_valid_test_num_samples[index],
+                max_seq_length=max_seq_length,
+                seed=seed,
+                share_folder=args.share_folder,
+                args=args,
+            )
+            if dataset_type == DSET_TYPE_T5:
+                from t5_dataset import T5Dataset
+
+                dataset = T5Dataset(
+                    indexed_dataset=indexed_dataset,
+                    tokenizer=tokenizer,
+                    masked_lm_prob=masked_lm_prob,
+                    max_seq_length_dec=max_seq_length_dec,
+                    short_seq_prob=short_seq_prob,
+                    **kwargs,
+                )
+            # elif dataset_type == DSET_TYPE_BERT:
+            #     dataset = BertDataset(
+            #         indexed_dataset=indexed_dataset,
+            #         tokenizer=tokenizer,
+            #         masked_lm_prob=masked_lm_prob,
+            #         short_seq_prob=short_seq_prob,
+            #         binary_head=binary_head,
+            #         **kwargs,
+            #     )
+            elif dataset_type == DSET_TYPE_ERNIE:
+                from .ernie_dataset import ErnieDataset
+
+                dataset = ErnieDataset(
+                    indexed_dataset=indexed_dataset,
+                    tokenizer=tokenizer,
+                    masked_lm_prob=masked_lm_prob,
+                    short_seq_prob=short_seq_prob,
+                    binary_head=binary_head,
+                    **kwargs,
+                )
+            else:
+                raise NotImplementedError("Dataset type not fully implemented.")
+
+            # Set the original pointer so dataset remains the main dataset.
+            indexed_dataset.set_doc_idx(doc_idx_ptr)
+            # Checks.
+            assert indexed_dataset.doc_idx[0] == 0
+            assert indexed_dataset.doc_idx.shape[0] == (total_num_of_documents + 1)
+        return dataset
+
+    train_dataset = build_dataset(0, "train")
+    valid_dataset = build_dataset(1, "valid")
+    test_dataset = build_dataset(2, "test")
+
+    return (train_dataset, valid_dataset, test_dataset)
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(",") != -1:
+        splits = [float(s) for s in splits_string.split(",")]
+    elif splits_string.find("/") != -1:
+        splits = [float(s) for s in splits_string.split("/")]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def get_samples_mapping(
+    indexed_dataset,
+    data_prefix,
+    num_epochs,
+    max_num_samples,
+    max_seq_length,
+    short_seq_prob,
+    seed,
+    name,
+    binary_head,
+    share_folder,
+):
+    """Get a list that maps a sample index to a starting sentence index, end sentence index, and length"""
+
+    if not num_epochs:
+        if not max_num_samples:
+            raise ValueError("Need to specify either max_num_samples " "or num_epochs")
+        num_epochs = np.iinfo(np.int32).max - 1
+    if not max_num_samples:
+        max_num_samples = np.iinfo(np.int64).max - 1
+
+    # Filename of the index mapping
+    indexmap_filename = data_prefix
+    indexmap_filename += "_{}_indexmap".format(name)
+    if num_epochs != (np.iinfo(np.int32).max - 1):
+        indexmap_filename += "_{}ep".format(num_epochs)
+    if max_num_samples != (np.iinfo(np.int64).max - 1):
+        indexmap_filename += "_{}mns".format(max_num_samples)
+    indexmap_filename += "_{}msl".format(max_seq_length)
+    indexmap_filename += "_{:0.2f}ssp".format(short_seq_prob)
+    indexmap_filename += "_{}s".format(seed)
+    indexmap_filename += ".npy"
+
+    local_rank = get_local_rank()
+    if share_folder:
+        local_rank = paddle.distributed.get_rank()
+    # Build the indexed mapping if not exist.
+
+    if local_rank == 0 and not os.path.isfile(indexmap_filename):
+        print(
+            " > WARNING: could not find index map file {}, building "
+            "the indices on rank 0 ...".format(indexmap_filename)
+        )
+
+        # Make sure the types match the helpers input types.
+        assert indexed_dataset.doc_idx.dtype == np.int64
+        print(indexed_dataset.sizes.dtype)
+        assert indexed_dataset.sizes.dtype == np.int32
+
+        # Build samples mapping
+        verbose = local_rank == 0
+        start_time = time.time()
+        print_rank_0(" > building sapmles index mapping for {} ...".format(name))
+        # First compile and then import.
+        try:
+            from tool_helpers import helpers
+        except ModuleNotFoundError:
+            print_rank_0(" > missing tool_helpers, pip install tool_helpers please, try to compile locally.")
+            if local_rank == 0:
+                compile_helper()
+            import data_tools.helpers as helpers
+
+        samples_mapping = helpers.build_mapping(
+            indexed_dataset.doc_idx,
+            indexed_dataset.sizes,
+            num_epochs,
+            max_num_samples,
+            max_seq_length,
+            short_seq_prob,
+            seed,
+            verbose,
+            2 if binary_head else 1,
+        )
+        print_rank_0(" > done building sapmles index maping")
+        np.save(indexmap_filename, samples_mapping, allow_pickle=True)
+        print_rank_0(" > saved the index mapping in {}".format(indexmap_filename))
+        # Make sure all the ranks have built the mapping
+        print_rank_0(
+            " > elasped time to build and save samples mapping " "(seconds): {:4f}".format(time.time() - start_time)
+        )
+
+    else:
+        while True:
+            if not os.path.isfile(indexmap_filename):
+                time.sleep(3)
+            else:
+                try:
+                    np.load(indexmap_filename, allow_pickle=True, mmap_mode="r")
+                    break
+                except Exception:
+                    print("%s file is still writing or damaged, please wait a moment." % indexmap_filename)
+                    time.sleep(3)
+
+    # This should be a barrier but nccl barrier assumes
+    # device_index=rank which is not the case for model
+    # parallel case
+    if paddle.distributed.get_world_size() > 1:
+        if paddle.in_dynamic_mode():
+            paddle.distributed.barrier()
+
+    # Load indexed dataset.
+    print_rank_0(" > loading indexed mapping from {}".format(indexmap_filename))
+    start_time = time.time()
+    samples_mapping = np.load(indexmap_filename, allow_pickle=True, mmap_mode="r")
+    print_rank_0("    loaded indexed file in {:3.3f} seconds".format(time.time() - start_time))
+    print_rank_0("    total number of samples: {}".format(samples_mapping.shape[0]))
+
+    return samples_mapping
diff --git a/model_zoo/ernie-1.0/data_tools/ernie_dataset.py b/model_zoo/ernie-1.0/data_tools/ernie_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..935c98692a774997317431e8165dfbca9931ba14
--- /dev/null
+++ b/model_zoo/ernie-1.0/data_tools/ernie_dataset.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2021, PadddlePaddle authors. All Rights Reserved.
+# Copyright (c) 2020, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT Style dataset."""
+
+import copy
+
+import numpy as np
+import paddle
+
+from .dataset_utils import (
+    create_masked_lm_predictions,
+    create_tokens_and_tokentypes,
+    get_a_and_b_segments,
+    get_samples_mapping,
+    truncate_segments,
+)
+
+
+class ErnieDataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        name,
+        tokenizer,
+        indexed_dataset,
+        data_prefix,
+        num_epochs,
+        max_num_samples,
+        masked_lm_prob,
+        max_seq_length,
+        short_seq_prob,
+        seed,
+        binary_head,
+        share_folder=False,
+        args=None,
+    ):
+
+        # Params to store.
+        self.name = name
+        self.seed = seed
+        self.masked_lm_prob = masked_lm_prob
+        self.max_seq_length = max_seq_length
+        self.binary_head = binary_head
+        self.share_folder = share_folder
+        self.args = args
+
+        # Dataset.
+        self.indexed_dataset = indexed_dataset
+
+        # Build the samples mapping.
+        self.samples_mapping = get_samples_mapping(
+            self.indexed_dataset,
+            data_prefix,
+            num_epochs,
+            max_num_samples,
+            self.max_seq_length - 3,  # account for added tokens
+            short_seq_prob,
+            self.seed,
+            self.name,
+            self.binary_head,
+            self.share_folder,
+        )
+
+        # Vocab stuff.
+        # tokenizer = get_tokenizer()
+        # self.vocab_id_list = list(tokenizer.inv_vocab.keys())
+        # self.vocab_id_to_token_dict = tokenizer.inv_vocab
+        self.vocab_id_list = list(tokenizer.vocab.idx_to_token.keys())
+        self.vocab_id_to_token_dict = copy.deepcopy(tokenizer.vocab.idx_to_token)
+        self.vocab_token_to_id_dict = copy.deepcopy(tokenizer.vocab.token_to_idx)
+
+        # ERNIE is chinse char level model, sometime is need
+        # add ## chinse char to encode and decode.
+        # Here we extend the vocab dict.
+        self.vocab_id_to_token_dict.update(tokenizer.added_tokens_decoder)
+        self.vocab_token_to_id_dict.update(tokenizer.added_tokens_encoder)
+
+        self.cls_id = tokenizer.cls_token_id
+        self.sep_id = tokenizer.sep_token_id
+        self.mask_id = tokenizer.mask_token_id
+        self.pad_id = tokenizer.pad_token_id
+
+    def __len__(self):
+        return self.samples_mapping.shape[0]
+
+    def __getitem__(self, idx):
+        start_idx, end_idx, seq_length = self.samples_mapping[idx]
+        sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)]
+
+        # Note that this rng state should be numpy and not python since
+        # python randint is inclusive whereas the numpy one is exclusive.
+        # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1
+        np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32))
+        return build_training_sample(
+            sample,
+            seq_length,
+            self.max_seq_length,  # needed for padding
+            self.vocab_id_list,
+            self.vocab_id_to_token_dict,
+            self.vocab_token_to_id_dict,
+            self.cls_id,
+            self.sep_id,
+            self.mask_id,
+            self.pad_id,
+            self.masked_lm_prob,
+            np_rng,
+            self.binary_head,
+            self.args,
+        )
+
+
+def build_training_sample(
+    sample,
+    target_seq_length,
+    max_seq_length,
+    vocab_id_list,
+    vocab_id_to_token_dict,
+    vocab_token_to_id_dict,
+    cls_id,
+    sep_id,
+    mask_id,
+    pad_id,
+    masked_lm_prob,
+    np_rng,
+    binary_head,
+    args=None,
+):
+    """Biuld training sample.
+
+    Arguments:
+        sample: A list of sentences in which each sentence is a list token ids.
+        target_seq_length: Desired sequence length.
+        max_seq_length: Maximum length of the sequence. All values are padded to
+            this length.
+        vocab_id_list: List of vocabulary ids. Used to pick a random id.
+        vocab_id_to_token_dict: A dictionary from vocab ids to text tokens.
+        vocab_token_to_id_dict: A dictionary from text tokens to vocab ids.
+        cls_id: Start of example id.
+        sep_id: Separator id.
+        mask_id: Mask token id.
+        pad_id: Padding token id.
+        masked_lm_prob: Probability to mask tokens.
+        np_rng: Random number genenrator. Note that this rng state should be
+              numpy and not python since python randint is inclusive for
+              the opper bound whereas the numpy one is exclusive.
+    """
+
+    if binary_head:
+        # We assume that we have at least two sentences in the sample
+        assert len(sample) > 1, "The sentence num should be large than 1."
+    assert target_seq_length <= max_seq_length
+
+    # Divide sample into two segments (A and B).
+    if binary_head:
+        tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample, np_rng)
+    else:
+        tokens_a = []
+        for j in range(len(sample)):
+            tokens_a.extend(sample[j])
+        tokens_b = []
+        is_next_random = False
+
+    # Truncate to `target_sequence_length`.
+    max_num_tokens = target_seq_length
+    truncate_segments(tokens_a, tokens_b, len(tokens_a), len(tokens_b), max_num_tokens, np_rng)
+
+    # Build tokens and toketypes.
+    tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id)
+
+    # Masking.
+    max_predictions_per_seq = masked_lm_prob * max_num_tokens
+    (tokens, masked_positions, masked_labels, _, _) = create_masked_lm_predictions(
+        tokens,
+        vocab_id_list,
+        vocab_id_to_token_dict,
+        masked_lm_prob,
+        cls_id,
+        sep_id,
+        mask_id,
+        max_predictions_per_seq,
+        np_rng,
+        vocab_token_to_id_dict=vocab_token_to_id_dict,
+        to_chinese_char=True,
+        inplace_random_mask=False,
+        favor_longer_ngram=False if args is None else args.favor_longer_ngram,
+        max_ngrams=3 if args is None else args.max_ngrams,
+    )
+
+    # Padding.
+    tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np = pad_and_convert_to_numpy(
+        tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length
+    )
+
+    return tokens_np, tokentypes_np, padding_mask_np, masked_positions, masked_labels, int(is_next_random)
+
+
+def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, masked_labels, pad_id, max_seq_length):
+    """Pad sequences and convert them to numpy."""
+
+    # Some checks.
+    num_tokens = len(tokens)
+    # print(num_tokens, max_seq_length)
+    padding_length = max_seq_length - num_tokens
+    assert padding_length >= 0
+    assert len(tokentypes) == num_tokens
+    assert len(masked_positions) == len(masked_labels)
+
+    # Tokens and token types.
+    filler = [pad_id] * padding_length
+    tokens_np = np.array(tokens + filler, dtype=np.int64)
+    tokentypes_np = np.array(tokentypes + filler, dtype=np.int64)
+
+    # Padding mask.
+    padding_mask_np = np.array([1] * num_tokens + [0] * padding_length, dtype=np.float32)
+    padding_mask_np = (1 - padding_mask_np) * -1e4
+
+    padding_mask_np = padding_mask_np.reshape([1, 1, -1])
+    # Lables and loss mask.
+    labels = [-1] * max_seq_length
+    loss_mask = [0] * max_seq_length
+    for i in range(len(masked_positions)):
+        assert masked_positions[i] < num_tokens
+        labels[masked_positions[i]] = masked_labels[i]
+        loss_mask[masked_positions[i]] = 1
+    labels_np = np.array(labels, dtype=np.int64)
+    loss_mask_np = np.array(loss_mask, dtype=np.int64)
+
+    return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np
diff --git a/model_zoo/ernie-1.0/data_tools/helpers.cpp b/model_zoo/ernie-1.0/data_tools/helpers.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..07ae7ccdf6dd6ec871b623519fd8162d8eac606e
--- /dev/null
+++ b/model_zoo/ernie-1.0/data_tools/helpers.cpp
@@ -0,0 +1,741 @@
+/*
+ coding=utf-8
+ Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ */
+
+
+/* Helper methods for fast index mapping builds */
+
+#include <math.h>
+#include <pybind11/numpy.h>
+#include <pybind11/pybind11.h>
+#include <algorithm>
+#include <iostream>
+#include <limits>
+#include <random>
+#include <stdexcept>
+
+namespace py = pybind11;
+using namespace std;
+
+const int32_t LONG_SENTENCE_LEN = 512;
+
+void build_blending_indices(py::array_t<uint8_t>& dataset_index,
+                            py::array_t<int64_t>& dataset_sample_index,
+                            const py::array_t<double>& weights,
+                            const int32_t num_datasets,
+                            const int64_t size,
+                            const bool verbose) {
+  /* Given multiple datasets and a weighting array, build samples
+   such that it follows those wieghts.*/
+
+  if (verbose) {
+    std::cout << "> building indices for blendable datasets ..." << std::endl;
+  }
+
+  // Get the pointer access without the checks.
+  auto dataset_index_ptr = dataset_index.mutable_unchecked<1>();
+  auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>();
+  auto weights_ptr = weights.unchecked<1>();
+
+  // Initialize buffer for number of samples used for each dataset.
+  int64_t current_samples[num_datasets];
+  for (int64_t i = 0; i < num_datasets; ++i) {
+    current_samples[i] = 0;
+  }
+
+  // For each sample:
+  for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx) {
+    // Determine where the max error in sampling is happening.
+    auto sample_idx_double = std::max(static_cast<double>(sample_idx), 1.0);
+    int64_t max_error_index = 0;
+    double max_error = weights_ptr[0] * sample_idx_double -
+                       static_cast<double>(current_samples[0]);
+    for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx) {
+      double error = weights_ptr[dataset_idx] * sample_idx_double -
+                     static_cast<double>(current_samples[dataset_idx]);
+      if (error > max_error) {
+        max_error = error;
+        max_error_index = dataset_idx;
+      }
+    }
+
+    // Populate the indices.
+    dataset_index_ptr[sample_idx] = static_cast<uint8_t>(max_error_index);
+    dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index];
+
+    // Update the total samples.
+    current_samples[max_error_index] += 1;
+  }
+
+  // print info
+  if (verbose) {
+    std::cout << " > sample ratios:" << std::endl;
+    for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx) {
+      auto ratio = static_cast<double>(current_samples[dataset_idx]) /
+                   static_cast<double>(size);
+      std::cout << "   dataset " << dataset_idx
+                << ", input: " << weights_ptr[dataset_idx]
+                << ", achieved: " << ratio << std::endl;
+    }
+  }
+}
+
+
+py::array build_sample_idx(const py::array_t<int32_t>& sizes_,
+                           const py::array_t<int32_t>& doc_idx_,
+                           const int32_t seq_length,
+                           const int32_t num_epochs,
+                           const int64_t tokens_per_epoch) {
+  /* Sample index (sample_idx) is used for gpt2 like dataset for which
+     the documents are flattened and the samples are built based on this
+     1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2]
+     where [..., 0] contains the index into `doc_idx` and [..., 1] is the
+     starting offset in that document.*/
+
+  // Consistency checks.
+  assert(seq_length > 1);
+  assert(num_epochs > 0);
+  assert(tokens_per_epoch > 1);
+
+  // Remove bound checks.
+  auto sizes = sizes_.unchecked<1>();
+  auto doc_idx = doc_idx_.unchecked<1>();
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length;
+  int32_t* sample_idx = new int32_t[2 * (num_samples + 1)];
+
+  cout << "    using:" << endl << std::flush;
+  cout << "     number of documents:       " << doc_idx_.shape(0) / num_epochs
+       << endl
+       << std::flush;
+  cout << "     number of epochs:          " << num_epochs << endl
+       << std::flush;
+  cout << "     sequence length:           " << seq_length << endl
+       << std::flush;
+  cout << "     total number of samples:   " << num_samples << endl
+       << std::flush;
+
+  // Index into sample_idx.
+  int64_t sample_index = 0;
+  // Index into doc_idx.
+  int64_t doc_idx_index = 0;
+  // Beginning offset for each document.
+  int32_t doc_offset = 0;
+  // Start with first document and no offset.
+  sample_idx[2 * sample_index] = doc_idx_index;
+  sample_idx[2 * sample_index + 1] = doc_offset;
+  ++sample_index;
+
+  while (sample_index <= num_samples) {
+    // Start with a fresh sequence.
+    int32_t remaining_seq_length = seq_length + 1;
+    while (remaining_seq_length != 0) {
+      // Get the document length.
+      auto doc_id = doc_idx[doc_idx_index];
+      auto doc_length = sizes[doc_id] - doc_offset;
+      // And add it to the current sequence.
+      remaining_seq_length -= doc_length;
+      // If we have more than a full sequence, adjust offset and set
+      // remaining length to zero so we return from the while loop.
+      // Note that -1 here is for the same reason we have -1 in
+      // `_num_epochs` calculations.
+      if (remaining_seq_length <= 0) {
+        doc_offset += (remaining_seq_length + doc_length - 1);
+        remaining_seq_length = 0;
+      } else {
+        // Otherwise, start from the beginning of the next document.
+        ++doc_idx_index;
+        doc_offset = 0;
+      }
+    }
+    // Record the sequence.
+    sample_idx[2 * sample_index] = doc_idx_index;
+    sample_idx[2 * sample_index + 1] = doc_offset;
+    ++sample_index;
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(sample_idx, [](void* mem_) {
+    int32_t* mem = reinterpret_cast<int32_t*>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(int32_t);
+  return py::array(std::vector<int64_t>{num_samples + 1, 2},  // shape
+                   {2 * byte_size, byte_size},  // C-style contiguous strides
+                   sample_idx,                  // the data pointer
+                   free_when_done);             // numpy array references
+}
+
+
+inline int32_t get_target_sample_len(const int32_t short_seq_ratio,
+                                     const int32_t max_length,
+                                     std::mt19937& rand32_gen) {
+  /* Training sample length. */
+  if (short_seq_ratio == 0) {
+    return max_length;
+  }
+  const auto random_number = rand32_gen();
+  if ((random_number % short_seq_ratio) == 0) {
+    return 2 + random_number % (max_length - 1);
+  }
+  return max_length;
+}
+
+
+template <typename DocIdx>
+py::array build_mapping_impl(const py::array_t<int64_t>& docs_,
+                             const py::array_t<int32_t>& sizes_,
+                             const int32_t num_epochs,
+                             const uint64_t max_num_samples,
+                             const int32_t max_seq_length,
+                             const double short_seq_prob,
+                             const int32_t seed,
+                             const bool verbose,
+                             const int32_t min_num_sent) {
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+     start and end index are the indices of the sentences in the sample
+     and sequence-length is the target sequence length.
+  */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(short_seq_prob >= 0.0);
+  assert(short_seq_prob <= 1.0);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+
+  // For efficiency, convert probability to ratio. Note: rand() generates int.
+  int32_t short_seq_ratio = 0;
+  if (short_seq_prob > 0) {
+    short_seq_ratio = static_cast<int32_t>(round(1.0 / short_seq_prob));
+  }
+
+  if (verbose) {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1
+         << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", "
+         << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     minimum sentences num:          " << min_num_sent << endl
+         << std::flush;
+    cout << "     short sequence probability:     " << short_seq_prob << endl
+         << std::flush;
+    cout << "     short sequence ration (1/prob): " << short_seq_ratio << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = -1;
+  DocIdx* maps = NULL;
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration) {
+    // Set the seed so both iterations produce the same results.
+    std::mt19937 rand32_gen(seed);
+
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Counters:
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch) {
+      if (map_index >= max_num_samples) {
+        if (verbose && (!second)) {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      if(epoch > 0 && map_index == 0){
+        cout << endl << "     No available documtment find this dataset." << endl << std::flush;
+        throw std::invalid_argument(
+          "Invalid dataset! the document should be with more than " 
+          + std::to_string(min_num_sent) + " scentences.");
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        // At the beginning of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second)) {
+          if (num_remain_sent == 0) {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1) {
+            ++one_sent_docs;
+          }
+        }
+
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent > 1) {
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN) {
+              if ((epoch == 0) && (!second)) {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have more than two sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+          auto target_seq_len = get_target_sample_len(
+              short_seq_ratio, max_seq_length, rand32_gen);
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and if not only one sentence is left in the document.
+            // and if we have at least two sentneces.
+            // and if we have reached end of the document.
+            if (((seq_len >= target_seq_len) && (num_remain_sent > 1) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0)) {
+              // Check for overflow.
+              if ((3 * map_index + 2) > std::numeric_limits<int64_t>::max()) {
+                cout << "number of samples exceeded maximum "
+                     << "allowed by type int64: "
+                     << std::numeric_limits<int64_t>::max() << endl;
+                throw std::overflow_error("Number of samples");
+              }
+
+              // Populate the map.
+              if (second) {
+                const auto map_index_0 = 3 * map_index;
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(target_seq_len);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              prev_start_index = sent_index + 1;
+              target_seq_len = get_target_sample_len(
+                  short_seq_ratio, max_seq_length, rand32_gen);
+              seq_len = 0;
+              num_sent = 0;
+            }
+
+          }  // for (auto sent_index=sent_index_first; ...
+        }    // if (num_remain_sent > 1) {
+      }      // for (int doc=0; doc < num_docs; ++doc) {
+    }        // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second) {
+      if (verbose) {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs
+             << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs
+             << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[3 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  }  // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i) {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 3 * i;
+    const auto j0 = 3 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void* mem_) {
+    DocIdx* mem = reinterpret_cast<DocIdx*>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 3},  // shape
+                   {3 * byte_size, byte_size},  // C-style contiguous strides
+                   maps,                        // the data pointer
+                   free_when_done);             // numpy array references
+}
+
+
+py::array build_mapping(const py::array_t<int64_t>& docs_,
+                        const py::array_t<int>& sizes_,
+                        const int num_epochs,
+                        const uint64_t max_num_samples,
+                        const int max_seq_length,
+                        const double short_seq_prob,
+                        const int seed,
+                        const bool verbose,
+                        const int32_t min_num_sent) {
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max()) {
+    if (verbose) {
+      cout << "    using uint64 for data mapping..." << endl << std::flush;
+    }
+    return build_mapping_impl<uint64_t>(docs_,
+                                        sizes_,
+                                        num_epochs,
+                                        max_num_samples,
+                                        max_seq_length,
+                                        short_seq_prob,
+                                        seed,
+                                        verbose,
+                                        min_num_sent);
+  } else {
+    if (verbose) {
+      cout << "    using uint32 for data mapping..." << endl << std::flush;
+    }
+    return build_mapping_impl<uint32_t>(docs_,
+                                        sizes_,
+                                        num_epochs,
+                                        max_num_samples,
+                                        max_seq_length,
+                                        short_seq_prob,
+                                        seed,
+                                        verbose,
+                                        min_num_sent);
+  }
+}
+
+template <typename DocIdx>
+py::array build_blocks_mapping_impl(const py::array_t<int64_t>& docs_,
+                                    const py::array_t<int32_t>& sizes_,
+                                    const py::array_t<int32_t>& titles_sizes_,
+                                    const int32_t num_epochs,
+                                    const uint64_t max_num_samples,
+                                    const int32_t max_seq_length,
+                                    const int32_t seed,
+                                    const bool verbose,
+                                    const bool use_one_sent_blocks) {
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+     start and end index are the indices of the sentences in the sample
+     and sequence-length is the target sequence length.
+  */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+  auto titles_sizes = titles_sizes_.unchecked<1>();
+
+  if (verbose) {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1
+         << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", "
+         << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and its length (1D).
+  int64_t num_samples = -1;
+  DocIdx* maps = NULL;
+
+  // Acceptable number of sentences per block.
+  int min_num_sent = 2;
+  if (use_one_sent_blocks) {
+    min_num_sent = 1;
+  }
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration) {
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch) {
+      // assign every block a unique id
+      int32_t block_id = 0;
+
+      if (map_index >= max_num_samples) {
+        if (verbose && (!second)) {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        const auto target_seq_len = max_seq_length - titles_sizes[doc];
+
+        // At the beginning of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second)) {
+          if (num_remain_sent == 0) {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1) {
+            ++one_sent_docs;
+          }
+        }
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent >= min_num_sent) {
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN) {
+              if ((epoch == 0) && (!second)) {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have enough sentences and no long sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and there are an acceptable number of sentences left
+            // and if we have at least the minimum number of sentences.
+            // or if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent >= min_num_sent) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0)) {
+              // Populate the map.
+              if (second) {
+                const auto map_index_0 = 4 * map_index;
+                // Each sample has 4 items: the starting sentence index, ending
+                // sentence index,
+                // the index of the document from which the block comes (used
+                // for fetching titles)
+                // and the unique id of the block (used for creating block
+                // indexes)
+
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(doc);
+                maps[map_index_0 + 3] = static_cast<DocIdx>(block_id);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              ++block_id;
+              prev_start_index = sent_index + 1;
+              seq_len = 0;
+              num_sent = 0;
+            }
+          }  // for (auto sent_index=sent_index_first; ...
+        }    // if (num_remain_sent > 1) {
+      }      // for (int doc=0; doc < num_docs; ++doc) {
+    }        // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second) {
+      if (verbose) {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs
+             << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs
+             << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[4 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  }  // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i) {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 4 * i;
+    const auto j0 = 4 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+    swap(maps[i0 + 3], maps[j0 + 3]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void* mem_) {
+    DocIdx* mem = reinterpret_cast<DocIdx*>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 4},  // shape
+                   {4 * byte_size, byte_size},  // C-style contiguous strides
+                   maps,                        // the data pointer
+                   free_when_done);             // numpy array references
+}
+
+py::array build_blocks_mapping(const py::array_t<int64_t>& docs_,
+                               const py::array_t<int>& sizes_,
+                               const py::array_t<int>& titles_sizes_,
+                               const int num_epochs,
+                               const uint64_t max_num_samples,
+                               const int max_seq_length,
+                               const int seed,
+                               const bool verbose,
+                               const bool use_one_sent_blocks) {
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max()) {
+    if (verbose) {
+      cout << "    using uint64 for data mapping..." << endl << std::flush;
+    }
+    return build_blocks_mapping_impl<uint64_t>(docs_,
+                                               sizes_,
+                                               titles_sizes_,
+                                               num_epochs,
+                                               max_num_samples,
+                                               max_seq_length,
+                                               seed,
+                                               verbose,
+                                               use_one_sent_blocks);
+  } else {
+    if (verbose) {
+      cout << "    using uint32 for data mapping..." << endl << std::flush;
+    }
+    return build_blocks_mapping_impl<uint32_t>(docs_,
+                                               sizes_,
+                                               titles_sizes_,
+                                               num_epochs,
+                                               max_num_samples,
+                                               max_seq_length,
+                                               seed,
+                                               verbose,
+                                               use_one_sent_blocks);
+  }
+}
+
+PYBIND11_MODULE(helpers, m) {
+  m.def("build_mapping", &build_mapping);
+  m.def("build_blocks_mapping", &build_blocks_mapping);
+  m.def("build_sample_idx", &build_sample_idx);
+  m.def("build_blending_indices", &build_blending_indices);
+}
diff --git a/model_zoo/ernie-1.0/finetune/README.md b/model_zoo/ernie-1.0/finetune/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..27603219d59b7f3a5b96590e3890ef1490af5a60
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/README.md
@@ -0,0 +1,71 @@
+# 预训练下游任务 finetune 示例
+使用训练中产出的checkpoint，或者paddlenlp内置的模型权重，使用本脚本，用户可以快速对当前模型效果进行评估。
+
+## 运行示例
+本文档适配了三大主流下游任务，用户可以根据自己的需求，评估自己所需的数据集。
+
+运行脚本示例如下:
+
+1. 序列分类
+```shell
+dataset="chnsenticorp_v2"
+python run_seq_cls.py \
+    --do_train \
+    --do_eval \
+    --do_predict \
+    --model_name_or_path ernie-1.0 \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+2. Token分类
+```shell
+dataset="peoples_daily_ner"
+python run_ner.py \
+    --do_train \
+    --do_eval \
+    --do_predict \
+    --model_name_or_path ernie-1.0 \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+3. 阅读理解
+```shell
+dataset="cmrc2018"
+python run_qa.py \
+    --do_train \
+    --do_eval \
+    --model_name_or_path ernie-1.0 \
+    --dataset $dataset \
+    --output_dir ./tmp/$dataset
+```
+
+## 参数说明
+
+### 传入参数
+必须参数
+- do_train、do_eval、do_predict分别表示运行训练、评估、测试数据集合。
+- do_export 导出为inference预测模型
+- model_name_or_path 表示模型权重名称，或者训练中保存的checkpoint地址
+- dataset 表示数据集名称
+- output_dir 表示运行中，一些checkpoint等参数的输出目录
+
+其他可配置参数：
+- per_device_train_batch_size 训练时batch大小
+- per_device_eval_batch_size 评估时batch大小
+- num_train_epochs 训练epoch数目
+- learning_rate 学习率
+- max_seq_length 最大序列长度
+- weight_decay 训练时优化器对参数衰减系数
+- logging_steps 打印日志间隔步数
+- eval_steps 评估效果间隔步数
+- max_steps 最大训练步数（可覆盖num_train_epochs）
+
+
+### yaml文件参数
+本示例也支持用户在yaml文件中配置参数。用户可以自行修改`config.yaml`文件。
+
+注意：
+- 这些参数会重写传入的默认参数，以yaml文件参数为准。
+- yaml文件中的batch_size同时等价于per_device_train_batch_size，per_device_eval_batch_size
diff --git a/model_zoo/ernie-1.0/finetune/config.yml b/model_zoo/ernie-1.0/finetune/config.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d5d1e1082182abc6730827d8689e1fed11f96c1b
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/config.yml
@@ -0,0 +1,71 @@
+# Default Args for all dataset 
+# You can overwrite the configs in each dataset.
+DefaultArgs:
+    learning_rate: 0.00005
+    num_train_epochs: 3
+    batch_size: 64
+    max_seq_length: 128
+    weight_decay: 0.01
+    logging_steps: 10
+    eval_steps: 200
+    minimum_eval_times: -1
+    max_steps: -1
+    warmup_steps: 0
+    warmup_ratio: 0.1
+    metric: "Accuracy"
+
+# Datasets which used for sequence classfication
+SequenceClassification:
+    clue afqmc: 
+        num_train_epochs: 4
+    clue tnews:
+        num_train_epochs: 4
+    clue iflytek:
+        num_train_epochs: 8
+    clue ocnli:
+        num_train_epochs: 8
+    clue cmnli: 
+        num_train_epochs: 3
+    clue wsc: 
+        num_train_epochs: 50
+    clue csl:
+        num_train_epochs: 10
+        max_seq_length: 256
+        batch_size: 32
+    xnli_cn:
+        learning_rate: 0.0001
+        num_train_epochs: 3
+        batch_size: 256
+    chnsenticorp_v2:
+        learning_rate: 0.00005
+        batch_size: 16
+        num_train_epochs: 8
+
+# Datasets which used for token classfication
+TokenClassification:
+    peoples_daily_ner:
+        learning_rate: 0.00005
+        num_train_epochs: 4
+        batch_size: 128
+    msra_ner:
+        num_train_epochs: 3
+
+# Datasets which used for question answersing
+QuestionAnswering:
+    cmrc2018:
+        learning_rate: 0.00005
+        num_train_epochs: 1
+        batch_size: 32
+        max_seq_length: 384
+    dureader_nlp:
+        num_train_epochs: 1
+        batch_size: 12
+        max_seq_length: 384
+    dureader_robust:
+        num_train_epochs: 1
+        batch_size: 12
+        max_seq_length: 384
+    dlbp:
+        num_train_epochs: 1
+        batch_size: 12
+        max_seq_length: 384
diff --git a/model_zoo/ernie-1.0/finetune/deploy/README.md b/model_zoo/ernie-1.0/finetune/deploy/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d4e135253ebb5884a3fed88021a690b148d34d8b
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/deploy/README.md
@@ -0,0 +1,138 @@
+# FastDeploy ERNIE 1.0 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的中文情感分类任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+```
+
+## 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 1.0 模型在 ChnSenticorp 数据集上进行文本分类任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 1.0 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-1.0/finetune/tmp/export`（用户可按实际情况设置）。
+
+
+```bash
+# CPU 推理
+python seq_cls_infer.py --model_dir ../tmp/chnsenticorp_v2/export/ --device cpu --backend paddle
+# GPU 推理
+python seq_cls_infer.py --model_dir ../tmp/chnsenticorp_v2/export/ --device gpu --backend paddle
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-02-26 13:38:46,370] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.fast_tokenizer.ErnieFastTokenizer'> to load '../../finetune/tmp/chnsenticorp_v2/export/'.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend  Runtime initialized with Backend::PDINFER in Device::GPU.
+Batch id: 0, example id: 0, sentence: 这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般, label: negative, negative prob: 0.9999, positive prob: 0.0001.
+Batch id: 1, example id: 0, sentence: 怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片！开始还怀疑是不是赠送的个别现象，可是后来发现每张DVD后面都有！真不知道生产商怎么想的，我想看的是猫和老鼠，不是米老鼠！如果厂家是想赠送的话，那就全套米老鼠和唐老鸭都赠送，只在每张DVD后面添加一集算什么？？简直是画蛇添足！！, label: negative, negative prob: 0.9998, positive prob: 0.0002.
+Batch id: 2, example id: 0, sentence: 还稍微重了点，可能是硬盘大的原故，还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多，用不了多久就要更换了，屏幕膜稍好点，但比没有要强多了。建议配赠几张膜让用用户自己贴。, label: negative, negative prob: 0.9999, positive prob: 0.0001.
+......
+```
+
+## 参数说明
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--device_id | 运行设备的id。默认为0。 |
+|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 1.0 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 1.0 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
diff --git a/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py b/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..5232e09505b38f13fb5e062522daea1504a3c81e
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/deploy/seq_cls_infer.py
@@ -0,0 +1,155 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
+    parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_threads)
+        else:
+            option.use_gpu(args.device_id)
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.use_paddle_infer_backend()
+                option.paddle_infer_option.collect_trt_shape = True
+                option.paddle_infer_option.enable_trt = True
+            trt_file = os.path.join(args.model_dir, "model.trt")
+            option.trt_option.set_shape(
+                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            if args.use_fp16:
+                option.trt_option.enable_fp16 = True
+                trt_file = trt_file + ".fp16"
+            option.trt_option.serialize_file = trt_file
+        return fd.Runtime(option)
+
+    def preprocess(self, text):
+        data = self.tokenizer(text, max_length=self.max_length, padding=True, truncation=True)
+        input_ids_name = self.runtime.get_input_info(0).name
+        token_type_ids_name = self.runtime.get_input_info(1).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int64"),
+            token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def postprocess(self, infer_data):
+        logits = np.array(infer_data[0])
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1), "confidence": probs}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    test_ds = load_dataset("chnsenticorp", splits=["test"])
+    texts_ds = [d["text"] for d in test_ds]
+    label_map = {0: "negative", 1: "positive"}
+    batch_texts = batchfy_text(texts_ds, args.batch_size)
+    for bs, texts in enumerate(batch_texts):
+        outputs = predictor.predict(texts)
+        for i, sentence1 in enumerate(texts):
+            print(
+                f"Batch id: {bs}, example id: {i}, sentence: {sentence1}, "
+                f"label: {label_map[outputs['label'][i]]}, negative prob: {outputs['confidence'][i][0]:.4f}, "
+                f"positive prob: {outputs['confidence'][i][1]:.4f}."
+            )
diff --git a/model_zoo/ernie-1.0/finetune/question_answering.py b/model_zoo/ernie-1.0/finetune/question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..010b07676b2e7cff2b0bc14752f79138c78743aa
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/question_answering.py
@@ -0,0 +1,219 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.trainer import PredictionOutput, Trainer
+
+
+class QuestionAnsweringTrainer(Trainer):
+    def __init__(self, *args, eval_examples=None, post_process_function=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.eval_examples = eval_examples
+        self.post_process_function = post_process_function
+
+    def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        eval_examples = self.eval_examples if eval_examples is None else eval_examples
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                eval_dataloader,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is not None and self.compute_metrics is not None:
+            eval_preds = self.post_process_function(eval_examples, eval_dataset, output.predictions)
+            metrics = self.compute_metrics(eval_preds)
+
+            # Prefix all keys with metric_key_prefix + '_'
+            for key in list(metrics.keys()):
+                if not key.startswith(f"{metric_key_prefix}_"):
+                    metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+            self.log(metrics)
+        else:
+            metrics = {}
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
+        return metrics
+
+    def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"):
+        predict_dataloader = self.get_test_dataloader(predict_dataset)
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                predict_dataloader,
+                description="Prediction",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is None or self.compute_metrics is None:
+            return output
+
+        predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict")
+        metrics = self.compute_metrics(predictions)
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics)
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length)
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_length, return_attention_mask=True
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
diff --git a/model_zoo/ernie-1.0/finetune/run_ner.py b/model_zoo/ernie-1.0/finetune/run_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..f754c83684cb84e506be5d6306431b22b717039a
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/run_ner.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from datasets import load_metric
+
+import paddlenlp
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.insert(0, os.path.abspath("."))
+from token_classification import ner_trans_fn  # noqa: E402
+from utils import ALL_DATASETS, DataArguments, ModelArguments  # noqa: E402
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # set_seed(args)
+    data_args.dataset = data_args.dataset.strip()
+    if data_args.dataset not in ALL_DATASETS:
+        raise ValueError("Not found dataset {}".format(data_args.dataset))
+
+    if data_args.dataset in ALL_DATASETS:
+        # if you custom you hyper-parameters in yaml config, it will overwrite all args.
+        config = ALL_DATASETS[data_args.dataset]
+        for args in (model_args, data_args, training_args):
+            for arg in vars(args):
+                if arg in config.keys():
+                    setattr(args, arg, config[arg])
+
+        training_args.per_device_train_batch_size = config["batch_size"]
+        training_args.per_device_eval_batch_size = config["batch_size"]
+
+    dataset_config = data_args.dataset.split(" ")
+    raw_datasets = load_dataset(
+        dataset_config[0],
+        None if len(dataset_config) <= 1 else dataset_config[1],
+    )
+
+    label_list = getattr(raw_datasets["train"], "label_list", None)
+    data_args.label_list = label_list
+    data_args.ignore_label = -100
+    data_args.no_entity_id = len(data_args.label_list) - 1
+
+    num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list)
+
+    # Define tokenizer, model, loss function.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+
+    class criterion(nn.Layer):
+        def __init__(self):
+            super(criterion, self).__init__()
+            self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=data_args.ignore_label)
+
+        def forward(self, *args, **kwargs):
+            return paddle.mean(self.loss_fn(*args, **kwargs))
+
+    loss_fct = criterion()
+
+    # Define dataset pre-process function
+    trans_fn = partial(ner_trans_fn, tokenizer=tokenizer, args=data_args)
+    # Define data collector
+    data_collator = DataCollatorForTokenClassification(tokenizer)
+
+    # Dataset pre-process
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"].map(trans_fn)
+    if training_args.do_eval:
+        eval_dataset = raw_datasets["dev"].map(trans_fn)
+    if training_args.do_predict:
+        test_dataset = raw_datasets["test"].map(trans_fn)
+
+    # Define the metrics of tasks.
+    # Metrics
+    metric = load_metric("seqeval")
+
+    def compute_metrics(p):
+        predictions, labels = p
+        predictions = np.argmax(predictions, axis=2)
+
+        # Remove ignored index (special tokens)
+        true_predictions = [
+            [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
+            for prediction, label in zip(predictions, labels)
+        ]
+        true_labels = [
+            [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
+            for prediction, label in zip(predictions, labels)
+        ]
+        results = metric.compute(predictions=true_predictions, references=true_labels)
+        return {
+            "precision": results["overall_precision"],
+            "recall": results["overall_recall"],
+            "f1": results["overall_f1"],
+            "accuracy": results["overall_accuracy"],
+        }
+
+    trainer = Trainer(
+        model=model,
+        criterion=loss_fct,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+        if test_ret.label_ids is None:
+            paddle.save(
+                test_ret.predictions,
+                os.path.join(training_args.output_dir, "test_results.pdtensor"),
+            )
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/finetune/run_qa.py b/model_zoo/ernie-1.0/finetune/run_qa.py
new file mode 100644
index 0000000000000000000000000000000000000000..d09436b737e3f5afe59b72c911a8277312a6ea49
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/run_qa.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from functools import partial
+
+import paddle
+from datasets import load_dataset
+
+import paddlenlp
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.trainer import (
+    EvalPrediction,
+    PdArgumentParser,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import AutoModelForQuestionAnswering, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+sys.path.insert(0, os.path.abspath("."))
+from question_answering import (  # noqa: E402
+    CrossEntropyLossForSQuAD,
+    QuestionAnsweringTrainer,
+    prepare_train_features,
+    prepare_validation_features,
+)
+from utils import ALL_DATASETS, DataArguments, ModelArguments  # noqa: E402
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # set_seed(args)
+    data_args.dataset = data_args.dataset.strip()
+
+    if data_args.dataset in ALL_DATASETS:
+        # if you custom you hyper-parameters in yaml config, it will overwrite all args.
+        config = ALL_DATASETS[data_args.dataset]
+        for args in (model_args, data_args, training_args):
+            for arg in vars(args):
+                if arg in config.keys():
+                    setattr(args, arg, config[arg])
+
+        training_args.per_device_train_batch_size = config["batch_size"]
+        training_args.per_device_eval_batch_size = config["batch_size"]
+
+    dataset_config = data_args.dataset.split(" ")
+    raw_datasets = load_dataset(
+        dataset_config[0], None if len(dataset_config) <= 1 else dataset_config[1], cache_dir=model_args.cache_dir
+    )
+
+    label_list = getattr(raw_datasets["train"], "label_list", None)
+    data_args.label_list = label_list
+
+    # Define tokenizer, model, loss function.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForQuestionAnswering.from_pretrained(model_args.model_name_or_path)
+
+    loss_fct = CrossEntropyLossForSQuAD()
+
+    # Preprocessing the datasets.
+    # Preprocessing is slighlty different for training and evaluation.
+    if training_args.do_train:
+        column_names = raw_datasets["train"].column_names
+    elif training_args.do_eval:
+        column_names = raw_datasets["validation"].column_names
+    else:
+        column_names = raw_datasets["test"].column_names
+
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"]
+        # Create train feature from dataset
+        with training_args.main_process_first(desc="train dataset map pre-processing"):
+            # Dataset pre-process
+            train_dataset = train_dataset.map(
+                partial(prepare_train_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on train dataset",
+            )
+
+    if training_args.do_eval:
+        eval_examples = raw_datasets["validation"]
+        with training_args.main_process_first(desc="evaluate dataset map pre-processing"):
+            eval_dataset = eval_examples.map(
+                partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on validation dataset",
+            )
+    if training_args.do_predict:
+        predict_examples = raw_datasets["test"]
+        with training_args.main_process_first(desc="test dataset map pre-processing"):
+            predict_dataset = predict_examples.map(
+                partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on prediction dataset",
+            )
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Post-processing:
+    def post_processing_function(examples, features, predictions, stage="eval"):
+        # Post-processing: we match the start logits and end logits to answers in the original context.
+        predictions, all_nbest_json, scores_diff_json = compute_prediction(
+            examples=examples,
+            features=features,
+            predictions=predictions,
+            n_best_size=data_args.n_best_size,
+            max_answer_length=data_args.max_answer_length,
+            null_score_diff_threshold=data_args.null_score_diff_threshold,
+        )
+
+        # # Format the result to the format the metric expects.
+        # formatted_predictions = [{
+        #     "id": k,
+        #     "prediction_text": v
+        # } for k, v in predictions.items()]
+
+        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
+        return EvalPrediction(predictions=predictions, label_ids=references)
+
+    def compute_metrics(p: EvalPrediction):
+        ret = squad_evaluate(examples=p.label_ids, preds=p.predictions, is_whitespace_splited=False)
+        return dict(ret)
+        # return metric.compute(predictions=p.predictions, references=p.label_ids)
+
+    trainer = QuestionAnsweringTrainer(
+        model=model,
+        criterion=loss_fct,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        eval_examples=eval_examples if training_args.do_eval else None,
+        data_collator=data_collator,
+        post_process_function=post_processing_function,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    if training_args.do_train:
+        # Training
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # model.set_state_dict(paddle.load("tmp/model_state.pdparams"))
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(predict_dataset, predict_examples)
+        trainer.log_metrics("predict", test_ret.metrics)
+
+        if test_ret.label_ids is None:
+            paddle.save(
+                test_ret.predictions,
+                os.path.join(training_args.output_dir, "test_results.pdtensor"),
+            )
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/finetune/run_seq_cls.py b/model_zoo/ernie-1.0/finetune/run_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d85030c4ad6a14cb40bb7b90abd7429c7755c2a
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/run_seq_cls.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+
+import paddle
+import paddle.nn as nn
+from paddle.metric import Accuracy
+from sequence_classification import clue_trans_fn, seq_trans_fn
+from utils import ALL_DATASETS, DataArguments, ModelArguments
+
+import paddlenlp
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    data_args.dataset = data_args.dataset.strip()
+
+    if data_args.dataset in ALL_DATASETS:
+        # if you custom you hyper-parameters in yaml config, it will overwrite all args.
+        config = ALL_DATASETS[data_args.dataset]
+        logger.info("Over-writing training config by yaml config!")
+        for args in (model_args, data_args, training_args):
+            for arg in vars(args):
+                if arg in config.keys():
+                    setattr(args, arg, config[arg])
+
+        training_args.per_device_train_batch_size = config["batch_size"]
+        training_args.per_device_eval_batch_size = config["batch_size"]
+
+    dataset_config = data_args.dataset.split(" ")
+    raw_datasets = load_dataset(
+        dataset_config[0],
+        None if len(dataset_config) <= 1 else dataset_config[1],
+    )
+
+    data_args.label_list = getattr(raw_datasets["train"], "label_list", None)
+    num_classes = 1 if raw_datasets["train"].label_list is None else len(raw_datasets["train"].label_list)
+
+    # Define tokenizer, model, loss function.
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+    criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss()
+
+    # Define dataset pre-process function
+    if "clue" in data_args.dataset:
+        trans_fn = partial(clue_trans_fn, tokenizer=tokenizer, args=data_args)
+    else:
+        trans_fn = partial(seq_trans_fn, tokenizer=tokenizer, args=data_args)
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Dataset pre-process
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"].map(trans_fn)
+    if training_args.do_eval:
+        eval_dataset = raw_datasets["dev"].map(trans_fn)
+    if training_args.do_predict:
+        test_dataset = raw_datasets["test"].map(trans_fn)
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = Accuracy()
+        metric.reset()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        accu = metric.accumulate()
+        metric.reset()
+        return {"accuracy": accu}
+
+    trainer = Trainer(
+        model=model,
+        criterion=criterion,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+        if test_ret.label_ids is None:
+            paddle.save(
+                test_ret.predictions,
+                os.path.join(training_args.output_dir, "test_results.pdtensor"),
+            )
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/finetune/sequence_classification.py b/model_zoo/ernie-1.0/finetune/sequence_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..3952520c65a3fc86525d8a55d44a70f90903df13
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/sequence_classification.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+    is_test = True
+    if "label" in example.keys():
+        is_test = False
+
+    if "text_b" in example.keys():
+        text = example["text_a"]
+        text_pair = example["text_b"]
+    else:
+        text = example["text"]
+        text_pair = None
+
+    encoded_inputs = tokenizer(text=text, text_pair=text_pair, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if is_test:
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+        }
+    else:
+        # label = np.array([example["label"]], dtype="int64")
+        label = int(example["label"])
+        return {"input_ids": input_ids, "token_type_ids": token_type_ids, "labels": label}
+
+
+# Data pre-process function for clue benchmark datatset
+def convert_clue(example, label_list, tokenizer=None, max_seq_length=512, **kwargs):
+    """convert a glue example into necessary features"""
+    is_test = False
+    if "label" not in example.keys():
+        is_test = True
+
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"])
+        label = example["label"]
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+
+    if tokenizer is None:
+        return example
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        if "token_type_ids" in example:
+            return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label}
+        else:
+            return {"input_ids": example["input_ids"], "labels": label}
+    else:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]}
+
+
+def seq_trans_fn(example, tokenizer, args):
+    return convert_example(
+        example,
+        tokenizer=tokenizer,
+        max_seq_length=args.max_seq_length,
+    )
+
+
+def clue_trans_fn(example, tokenizer, args):
+    return convert_clue(example, tokenizer=tokenizer, label_list=args.label_list, max_seq_length=args.max_seq_length)
diff --git a/model_zoo/ernie-1.0/finetune/token_classification.py b/model_zoo/ernie-1.0/finetune/token_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d4bd0fad78af00905c4dafc85275d876333cf86
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/token_classification.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def tokenize_and_align_labels(example, tokenizer, no_entity_id, max_seq_len=512):
+    if "labels" in example:
+        labels = example["labels"]
+        example = example["tokens"]
+        tokenized_input = tokenizer(example, is_split_into_words=True, max_seq_len=max_seq_len, return_length=False)
+
+        # -2 for [CLS] and [SEP]
+        if len(tokenized_input["input_ids"]) - 2 < len(labels):
+            labels = labels[: len(tokenized_input["input_ids"]) - 2]
+        tokenized_input["labels"] = [no_entity_id] + labels + [no_entity_id]
+        tokenized_input["labels"] += [no_entity_id] * (
+            len(tokenized_input["input_ids"]) - len(tokenized_input["labels"])
+        )
+    else:
+        if example["tokens"] == []:
+            tokenized_input = {
+                "labels": [],
+                "input_ids": [],
+                "token_type_ids": [],
+            }
+            return tokenized_input
+        tokenized_input = tokenizer(
+            example["tokens"],
+            max_seq_len=max_seq_len,
+            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
+            is_split_into_words=True,
+            return_length=False,
+        )
+        label_ids = example["ner_tags"]
+        if len(tokenized_input["input_ids"]) - 2 < len(label_ids):
+            label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2]
+        label_ids = [no_entity_id] + label_ids + [no_entity_id]
+
+        label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids))
+        tokenized_input["labels"] = label_ids
+    return tokenized_input
+
+
+def ner_trans_fn(example, tokenizer, args):
+    return tokenize_and_align_labels(
+        example, tokenizer=tokenizer, no_entity_id=args.no_entity_id, max_seq_len=args.max_seq_length
+    )
diff --git a/model_zoo/ernie-1.0/finetune/utils.py b/model_zoo/ernie-1.0/finetune/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..398d144e17447d409a00c4e6deb6660db108ecff
--- /dev/null
+++ b/model_zoo/ernie-1.0/finetune/utils.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os.path as osp
+from dataclasses import dataclass, field
+from typing import Optional
+
+import yaml
+
+TASKS = [
+    "SequenceClassification",
+    "TokenClassification",
+    "QuestionAnswering",
+]
+
+config = yaml.load(open(osp.join(osp.abspath("."), "./config.yml"), "r"), Loader=yaml.FullLoader)
+default_args = config["DefaultArgs"]
+
+ALL_DATASETS = {}
+
+for task_type in TASKS:
+    task = config[task_type]
+    for data_name in task.keys():
+        new_args = task[data_name]
+        new_args = {} if new_args is None else new_args
+        final_args = copy.deepcopy(default_args)
+        final_args.update(new_args)
+        final_args["model"] = "AutoModelFor{}".format(task_type)
+        ALL_DATASETS[data_name] = final_args
+
+
+class Dict(object):
+    def __init__(self, fn):
+        assert isinstance(fn, (dict)), (
+            "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn "
+            "Received fn=%s" % (str(fn))
+        )
+
+        self._fn = fn
+
+        for col_name, ele_fn in self._fn.items():
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (
+                col_name,
+                str(type(ele_fn)),
+            )
+
+    def __call__(self, data):
+
+        ret = {}
+        if len(data) <= 0:
+            return ret
+
+        for col_name, ele_fn in self._fn.items():
+            # skip unused col_name, such as labels in test mode.
+            if col_name not in data[0].keys():
+                continue
+            result = ele_fn([ele[col_name] for ele in data])
+            ret[col_name] = result
+
+        return ret
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    # Additional configs for QA task.
+    doc_stride: int = field(
+        default=128,
+        metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
+    )
+
+    n_best_size: int = field(
+        default=20,
+        metadata={
+            "help": "The total number of n-best predictions to generate in the nbest_predictions.json output file."
+        },
+    )
+
+    max_query_length: int = field(
+        default=64,
+        metadata={"help": "Max query length."},
+    )
+
+    max_answer_length: int = field(
+        default=30,
+        metadata={"help": "Max answer length."},
+    )
+
+    do_lower_case: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    null_score_diff_threshold: float = field(
+        default=0.0,
+        metadata={
+            "help": "The threshold used to select the null answer: if the best answer has a score that is less than "
+            "the score of the null answer minus this threshold, the null answer is selected for this example. "
+            "Only useful when `version_2_with_negative=True`."
+        },
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        }
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the dataset cache."},
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
diff --git a/model_zoo/ernie-1.0/preprocess/README.md b/model_zoo/ernie-1.0/preprocess/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aab9319c731cb7c832f374176bb0ee5b32f4ee90
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/README.md
@@ -0,0 +1,259 @@
+# PaddleNLP 预训练数据流程
+
+本示例致力于打造基于PaddleNLP预训练模型的最佳实践。
+
+
+我们将预训练数据过程划分为以下部分
+
+- 原始数据转换，原始文本转换为jsonl的json字符串格式。
+- 数据ID化，断句、分词、tokenize转化为token id格式。
+- 训练index文件生成，生成train、valid、test的每个样本索引。
+- token动态mask(可选)，python 层实时mask文本。
+
+本目录下主要包含一下文件：
+```
+├── create_pretraining_data.py
+├── merge.py
+├── trans_to_json.py
+├── words_segmentation.py
+└── README.md
+```
+
+### 环境依赖
+
+ - tqdm
+ - numpy
+ - pybind11
+ - tool_helpers
+ - lac (可选)
+ - zstandard (可选)
+
+安装命令`pip install tqdm numpy pybind11 tool_helpers lac zstandard`。另，部分功能需要`g++>=4.8`编译支持
+
+
+## 训练全流程数据Pipeline
+
+飞桨是自主研发、功能完备、开源开放的产业级深度学习平台，集深度学习核心训练和推理框架、基础模型库、端到端开发套件和丰富的工具组件于一体
+
+|步骤|阶段&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|数据格式| 样例|
+|-|-|-|-|
+| 0️⃣初始状态 | -|原始数据： <br/> **每个doc之间用空行间隔开** <br/> - 中文，默认每句换行符，作为句子结束。<br/> - 英文，默认使用nltk判断句子结束  | ```飞桨是功能完备、开源开放的产业级深度学习平台。``` <br/> ```飞桨拥有核心训练和推理框架、基础模型库。``` <br/><br/> ```PaddleNLP是自然语言处理领域的优秀工具。```  |
+|1️⃣原始数据转换<br/>`trans_to_json.py`|预处理 <br>输入：0️⃣初始状态 <br>输出：jsonl|jsonl格式：每个doc对应一行json字符串| ```{"text": "飞桨是功能完备、开源开放的产业级深度学习平台。飞桨拥有..."}```<br/>```{"text": "PaddleNLP是自然语言..."}```
+|❇️(**可选**)数据中文分词<br/>`words_segmentation.py`|语料分词：中文WWM <br>输入：jsonl  <br> 输出：0️⃣初始状态| 将jsonl格式的数据，恢复成分词后的原始格式数据 <br> | ```飞桨 是 功能 完备、开源 开放的 产业级 深度学习 平台。``` <br/> ```飞桨 拥有 核心 训练和推理 框架、基础 模型库。``` <br/><br/> ```PaddleNLP 是 自然语言处理领域 的 优秀工具。```
+|2️⃣数据ID化<br/>`create_pretrain_data.py`|预处理| bin格式：数据id化后的token id <br/>idx格式：数据句子、文章位置索引 | -
+|3️⃣训练index文件生成|训练启动|npy格式：<br/> 根据训练步数max_steps生成<br/>train、valid、test的每个样本索引文件| -
+|4️⃣token动态mask（可选）| Dataset取数据 | 无 |-
+
+
+注意：
+- **❇️(**可选**)数据中文分词** 是中文预训练做 WWM 的可选步骤
+  - 当你的数据比较少时，分词耗时较少，不需要分词步骤。直接在`create_pretrain_data.py`步骤中分词即可。
+  - 目的是为了提前分词，加快后续数据ID转化步骤。
+  - 如果这里输入的是 jsonl格式文件，最好为多文件，`trans_to_json.py` 时候开启`no-merge`选项。
+  - 当你的数据集比较大，或者需要尝试多次转换数据的时候，提前分词可以避免`create_pretrain_data.py`时每次都运行一次分词程序。
+- 转换后，需要重新进行步骤 1️⃣`原始数据转换 trans_to_json.py`，最后2️⃣`数据ID化`步骤设置`--cn_splited=True`参数。
+- 2️⃣`数据ID化`也可以在转化ID的同时，一起实现分词。不需要❇️`数据中文分词`步骤。
+
+
+## 数据教程汇总
+
+针对目前开源的数据集，PaddleNLP提供了详细的数据教程，点击对应数据集的链接，即可开始进行数据制作：
+
+| 名称 | 文本类型 | 纯文本大小 | 适配模型
+|-|-|-|-|
+| [CLUECorpusSmall](./docs/CLUECorpusSmall.md)| 中文 | 14GB | Llama
+| [OpenWebText2](./docs/OpenWebText2.md) | 英文 | 70GB | Llama
+| [WuDaoCorpus2.0 Base](./docs/WuDaoCorpusBase.md)| 中文 |  200GB | Llama
+| [CLUECorpus2020](./docs/CLUECorpus2020.md)| 中文 | 200GB | Llama
+
+## 预训练详细准备
+
+下面以ziya-llama-13b-v1预训练为例，简要介绍一下预训练的全流程。
+
+### 原始数据
+首先下载样例数据：
+```
+mkdir data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/baike.txt
+cd ..
+```
+
+### 原始数据转换 jsonl 格式
+使用`trans_to_json.py`转化为json串格式，下面是脚本的使用说明
+```
+optional arguments:
+  -h, --help            show this help message and exit
+  --input_path INPUT_PATH
+                        Path to you raw files. Folder or file path.
+                        必须设置，可以是文件夹或者单个文件。文件夹中的目录默认最多搜索两层子目录。
+  --output_path OUTPUT_PATH
+                        Path to save the output json files.
+                        必须设置，输出文件的名字。
+  --json_key JSON_KEY   The content key of json file.
+                        建议不修改，默认的key是text
+  --doc_spliter DOC_SPLITER
+                        Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank.
+                        根据实际情况修改，默认空行作为文章换行符。
+  --min_doc_length MIN_DOC_LENGTH
+                        Minimal char of a documment.
+                        可选。过滤掉长度多短的文章，默认值10
+  --workers WORKERS     Number of worker processes to launch
+                        可选。多进程转化文件，适用于 input_path 中包含的文件数据较多的情况。每个文件，分配给不同worker处理
+  --log_interval LOG_INTERVAL
+                        Interval between progress updates.
+                        可选。此处的interval是值处理完文件个数的间隔。
+  --no-merge            Don't merge the file.
+                        可选。默认不开启这个选项，默认每个文件转换的jsonl文本，会拼接成到同一个文件。
+  --no-shuffle          Don't shuffle the file.
+                        可选。默认不开启这个选项，默认对处理完进行shuffle。
+```
+根据说明，我们使用下面简单命令，可以得到`baike_sample.jsonl`文件。此处，我们对文章所有doc进行了shuffle。
+```shell
+python trans_to_json.py  --input_path ./data --output_path baike_sample
+```
+
+```shell
+#查看数据
+head -1 baike_sample.jsonl
+{"text": "中国效仿西方发展工业的过程，于中华民国国民政府成立后至中日战争开战前夕已顺畅发展，尽管其间受到内外因素的多重干扰。尔后直至中日战争和国共战争的结束，
+中国始有较为长期的和平发展时期。\n1980年代以来，邓小平政府宣布改革开放，开始实行社会主义市场经济并推行经济体制改革。中国大陆近年至2010年，GDP超过72000亿美元，
+已经成为美国之后的世界第二经济大国，普遍认为中国是世界上发展速度最快的经济体，但是人均国民生产总值仍位于世界中等水平（第89位），并逐渐受到资源限制和贫富差距加
+大的制约。中华人民共和国省份中，广东为GDP最高的第一强省，浙江为人均收入最高的第一富省。中国大陆、香港、澳门、台湾之间的经济联系在全球化的过程中日益紧密。\n"}
+```
+
+### 数据ID化
+本部分，我们使用 `create_pretraining_data.py` 脚本将前面得到的 `baike_sample.jsonl` 进行tokenize id化处理。
+```
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_name MODEL_NAME
+                        What model to use.
+                        必须设置，如：idea-ccnl/ziya-llama-13b-v1, 可以参考已有的模型名称 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md
+  --tokenizer_name {LlamaTokenizer}
+                        What type of tokenizer to use.
+                        模型对应的tokenizer, Llama模型需使用LlamaTokenizer
+data input/output:
+  --input_path INPUT_PATH
+                        Path to input JSON files.
+                        必须设置，输入文件jsonl的目录
+  --output_prefix OUTPUT_PREFIX
+                        Output prefix to store output file.
+                        必须设置，输出文件的名称。
+                        假设名称为XXX，则会输出 XXX.bin, XXX.idx 两个文件。
+                        bin文件，数据id化后的token ids; idx文件，数据句子、文章位置索引。
+  --data_format {JSON}  Only support json format for now. One document per line.
+                        不需要设置。目前默认处理jsonl数据格式
+  --json_key JSON_KEY   For JSON format. Space separate listed of keys to extract from json
+                        文本串json的key值。同前面trans_to_json.py的json_key，默认text为key
+  --split_sentences     Split documents into sentences.
+                        是否需要将文章划分成句子。一般而言，GPT不需要，BERT/ERNIE模型需要
+  --data_impl {mmap,lazy}
+                        Convert the json into mmap/lazy format.
+                        处理后的数据格式，可选“mmap”或“lazy”，其中“mmap”格式在读入数据时会建立内存映射，“lazy”格式在读入数据时直接从文件读取。
+
+chinese words:
+  --chinese             Is corpus need words segmentation step for chinese words.
+                        若设置了split_sentences，并处理中文则需要设置。
+  --cn_whole_word_segment
+                        Is corpus need words segmentation step for chinese words WWM.
+                        可选。是否需要WWM策略。一般而言，BERT/ERNIE模型需要，GPT不需要。
+  --cn_seg_func {lac,seg,jieba}
+                        Words segment function for chinese words.
+                        默认jieba，jieba速度较快，lac模型更准确，计算量高。
+  --cn_splited          Is chinese corpus is splited in to words.
+                        分词后的文本，可选。设置此选项则，cn_seg_func不起作用。
+                        例如分词后文本串 "中国 效仿 西方 发展 工业 的过 程"
+  --cn_split_dimer CN_SPLIT_DIMER
+                        Split dimer between chinese words.
+                        配合cn_splited使用，默认空格表示分词间隔。
+
+common config:
+  --append_eos          Append an <eos> token to the end of a document.
+                        gpt类模型专用，gpt设置此选项，表示doc结束。针对tokenier中不包含eos_token情况，输出提示warning并且不添加<eos>。
+  --log_interval LOG_INTERVAL
+                        Interval between progress updates
+                        打印日志间隔，interval表示处理 文本行数/doc数的 间隔。
+  --workers WORKERS     Number of worker processes to launch
+                        处理文本id化的进程个数。
+```
+通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:`baike_sample.bin`, 文章索引信息`baike_sample.idx`.
+
+* 针对 llama 模型
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "idea-ccnl/ziya-llama-13b-v1" \
+    --tokenizer_name "LlamaTokenizer" \
+    --input_path "baike_sample.jsonl" \
+    --output_prefix "baike_sample"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --data_impl "mmap" \
+    --cn_seg_func "jieba" \
+    --append_eos \
+    --log_interval 5 \
+    --workers 40
+
+```
+
+* 针对 ernie 模型
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --tokenizer_name "ErnieTokenizer" \
+    --input_path "baike_sample.jsonl" \
+    --output_prefix "baike_sample"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --split_sentences \
+    --data_impl "mmap" \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func "jieba" \
+    --log_interval 5 \
+    --workers 40
+```
+1. 如果您使用已经分好词的语料，可以设置 --cn_splited 为 True，同时指定--cn_split_dimer如空格。
+2. 使用自定义词表的话，请指定model_name为词表所在的文件夹地址。
+
+若需要预处理的文件过大，该脚本所耗费的时间可能会很长。此时可以考虑将jsonl文件拆分为多个小文件，并行使用create_pretraining_data.py进行处理，得到多个.bin & .idx文件。
+之后使用如下merge脚本合并多个小的.bin & .idx文件。
+```
+python merge.py \
+    --input /root/data \
+    --output-prefix /root/data/merged \
+    --data_impl mmap
+```
+使用说明：
+```
+arguments:
+  --input INPUT_PATH
+                        Path to the folder where the files to be merged.
+                        待合并的文件所在文件夹，文件夹内各个小文件需按merge的顺序排列，如1.bin / 1.idx，2.bin / 2.idx...
+  --output_prefix OUTPUT_PREFIX
+                        Output prefix to store output file.
+                        合并后输出文件的名称，假设名称为XXX，则会输出 XXX.bin, XXX.idx 两个文件。
+  --data_impl {mmap,lazy}
+                        Convert the json into mmap/lazy format.
+                        merge前后的数据格式，可选“mmap”或“lazy，各个待merge的文件需格式一致。”。
+```
+
+### 预训练开始
+得到了处理好的训练数据，就可以开始模型的预训练了。简单将预处理好的数据，拷贝到data目录，即可开始预训练。
+```shell
+mkdir data
+mv ./preprocess/baike_sample* ./data
+```
+
+* llama预训练请参考[预训练](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md)。
+* ernie预训练请参考[预训练](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-1.0/pretraining_introduction.md)。
+
+
+代码说明：
+- 动态mask相关代码实现在`./data_tools/dataset_utils.py`
+  用户可以根据自己的需求，灵活修改mask方式。具体可以参考`dataset_utils.py`中`create_masked_lm_predictions`函数。
+  可以自定义的选项有do_whole_word_mask, favor_longer_ngram, do_permutation, geometric_dist等，
+  可以参考[Megatron](https://github.com/NVIDIA/Megatron-LM)使用这些lm_mask策略。
+
+## 参考内容
+
+注: 大部分数据流程，参考自[Megatron](https://github.com/NVIDIA/Megatron-LM)，特此表达感谢。
diff --git a/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py b/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea63297936ed976313864cb6736ad1669e168628
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/create_pretraining_data.py
@@ -0,0 +1,380 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import io
+import json
+import multiprocessing
+import os
+import re
+import sys
+import time
+
+import numpy as np
+from tqdm import tqdm
+
+import paddlenlp.transformers as tfs
+from paddlenlp.data import indexed_dataset
+from paddlenlp.utils.log import logger
+
+try:
+    import nltk
+
+    nltk_available = True
+except ImportError:
+    nltk_available = False
+
+from datetime import datetime
+
+
+def print_datetime(string):
+    time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    print("[" + string + "] datetime: {} ".format(time_str))
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", type=str, required=True, help="What model to use.")
+    parser.add_argument(
+        "--tokenizer_name",
+        type=str,
+        required=True,
+        choices=[
+            "ErnieTokenizer",
+            "BertTokenizer",
+            "GPTTokenizer",
+            "GPTChineseTokenizer",
+            "LlamaTokenizer",
+            "ElectraTokenizer",
+            "T5Tokenizer",
+        ],
+        help="What type of tokenizer to use.",
+    )
+    group = parser.add_argument_group(title="data input/output")
+    group.add_argument("--input_path", type=str, required=True, help="Path to input JSON files.")
+    group.add_argument("--output_prefix", type=str, required=True, help="Output prefix to store output file.")
+    group.add_argument(
+        "--data_format",
+        type=str,
+        default="text",
+        choices=["JSON"],
+        help="Only support json format for now. One document per line.",
+    )
+    group.add_argument(
+        "--json_key",
+        type=str,
+        default="text",
+        help="For JSON format. Space separate listed of keys to extract from json",
+    )
+    group.add_argument("--split_sentences", action="store_true", help="Split documents into sentences.")
+
+    group.add_argument("--data_impl", type=str, default="mmap", choices=["lazy", "mmap"])
+
+    group = parser.add_argument_group(title="chinese words")
+    group.add_argument(
+        "--chinese", action="store_true", help="Is corpus need words segmentation step for chinese words."
+    )
+    group.add_argument(
+        "--cn_whole_word_segment",
+        action="store_true",
+        help="Is corpus need words segmentation step for chinese words WWM.",
+    )
+    group.add_argument(
+        "--cn_seg_func",
+        type=str,
+        default="jieba",
+        choices=["lac", "seg", "jieba"],
+        help="Words segment function for chinese words.",
+    )
+    group.add_argument("--cn_splited", action="store_true", help="Is chinese corpus is splited in to words.")
+    group.add_argument("--cn_split_dimer", type=str, default=" ", help="Split dimer between chinese words.")
+
+    group = parser.add_argument_group(title="common config")
+    group.add_argument("--append_eos", action="store_true", help="Append an <eos> token to the end of a document.")
+    group.add_argument("--log_interval", type=int, default=100, help="Interval between progress updates")
+    group.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch")
+    group.add_argument("--max_doc_num", type=int, default=sys.maxsize, help="Number of worker processes to launch")
+
+    args = parser.parse_args()
+    return args
+
+
+def lexical_analysis_fn():
+    from LAC import LAC
+
+    lac = LAC(mode="lac")
+
+    def process(line):
+        words, _ = lac.run(line)
+        return words
+
+    return process
+
+
+def chinese_segmentation_fn():
+    from LAC import LAC
+
+    lac_cws = LAC(mode="seg")
+
+    def process(line):
+        words = lac_cws.run(line)
+        return words
+
+    return process
+
+
+def jieba_segmentation_fn():
+    import jieba
+
+    def process(line):
+        words = jieba.cut(line)
+        return list(words)
+
+    return process
+
+
+def get_whole_word_mask_tokens(tokens, words, max_word_length=6):
+    """
+    Do whole word mask on Chinese word.
+    First, we do Chinese word segmentation on the sequence of tokens, which are from the WordPiece tokenization.
+    Then, we add the '##' mark on chinese characters which are in the middle of Chinese words.
+    And if the tokens are not chinese characters, we just exploit the results of WordPiece tokenization as words.
+    Such as,
+         - text line : 通过利用mercer核，将样本从输入空间映射到高维特征空间，使原来没有显现的特征突现出来，取得了很好的图像分割效果。
+         - the input tokens (after WordPiece):
+            ['通', '过', '利', '用', 'me', '##rc', '##er', '核', '，', '将', '样', '本', '从', '输', '入', '空', '间', '映',
+            '射', '到', '高', '维', '特', '征', '空', '间', '，', '使', '原', '来', '没', '有', '显', '现', '的', '特', '征',
+            '突', '现', '出', '来', '，', '取', '得', '了', '很', '好', '的', '图', '像', '分', '割', '效', '果', '。']
+        - the Chinese words (after Chinese word segmentation like jieba)
+            ['通过', '利用', 'mercer', '核', '，', '将', '样本', '从', '输入', '空间', '映射', '到', '高维', '特征',
+            '空间', '，', '使', '原来', '没有', '显现', '的', '特征', '突现', '出来', '，', '取得', '了', '很', '好',
+            '的', '图像', '分割', '效果', '。']
+        - the output whole word mask tokens:
+            ['通', '##过', '利', '##用', 'me', '##rc', '##er', '核', '，', '将', '样', '##本', '从', '输', '##入',
+            '空', '##间', '映', '##射', '到', '高', '##维', '特', '##征', '空', '##间', '，', '使', '原', '##来',
+            '没', '##有', '显', '##现', '的', '特', '##征', '突', '##现', '出', '##来', '，', '取', '##得', '了',
+            '很', '好', '的', '图', '##像', '分', '##割', '效', '##果', '。']
+
+    Args:
+        tokens(list(str)): The sequence of tokens, which are from the WordPiece tokenization.
+        words(list(str)): The sequence of Chinese words.
+        max_word_length(int, optional):
+            The maximum chinese character in Chinese words. It avoids too long Chinese word to be masked.
+            Defaults as 4.
+
+    Returns:
+         new_tokens(list(str)): The new token will be done with whole word masking strategy.
+
+    """
+
+    new_tokens = []
+    # opt for long document
+    words_set = set(words)
+    i = 0
+    while i < len(tokens):
+        # non-chinese character, then do word piece
+        if len(re.findall("[\u4E00-\u9FA5]", tokens[i])) == 0:
+            new_tokens.append(tokens[i])
+            i += 1
+            continue
+
+        # add "##" mark on the middel tokens of Chinese words
+        # such as ["通过", "利用"] -> ["通", "##过"， "利", "##用"]
+        has_add = False
+        for length in range(max_word_length, 0, -1):
+            if i + length > len(tokens):
+                continue
+            if "".join(tokens[i : i + length]) in words_set:
+                new_tokens.append(tokens[i])
+                for l in range(1, length):
+                    new_tokens.append("##" + tokens[i + l])
+                i += length
+                has_add = True
+                break
+
+        if not has_add:
+            new_tokens.append(tokens[i])
+            i += 1
+    return new_tokens
+
+
+class IdentitySplitter(object):
+    def tokenize(self, *text):
+        return text
+
+
+class NewlineSplitter:
+    def tokenize(self, text):
+        return text.split("\n")
+
+
+class Converter(object):
+    def __init__(self, args):
+        self.args = args
+
+    def initializer(self):
+        Converter.tokenizer = getattr(tfs, self.args.tokenizer_name).from_pretrained(self.args.model_name)
+        if self.args.cn_whole_word_segment:
+            # Extend chinese char vocab for ErnieTokinzer
+            Converter.tokenizer.extend_chinese_char()
+
+        # Split document to sentence.
+        if self.args.split_sentences:
+            if self.args.chinese:
+                Converter.splitter = NewlineSplitter()
+            else:
+                if not nltk_available:
+                    print("NLTK is not available to split sentences.")
+                    exit()
+                splitter = nltk.load("tokenizers/punkt/english.pickle")
+                Converter.splitter = splitter
+        else:
+            Converter.splitter = IdentitySplitter()
+
+        # Split sentence whole words mask for chinese
+        if self.args.cn_whole_word_segment:
+            if self.args.cn_splited:
+                Converter.segment_func = lambda text: text.split(self.args.cn_split_dimer)
+            else:
+                CHINESE_SEG_FUNC = {
+                    "lac": lexical_analysis_fn(),
+                    "seg": chinese_segmentation_fn(),
+                    "jieba": jieba_segmentation_fn(),
+                }
+                Converter.segment_func = CHINESE_SEG_FUNC[self.args.cn_seg_func]
+            Converter.whole_word_mask = get_whole_word_mask_tokens
+        else:
+            Converter.segment_func = lambda x: x
+            Converter.whole_word_mask = lambda x, y: x
+
+        def process(text):
+            words = Converter.segment_func(text)
+            # if there are two empty word, the should a split dimer in the pos
+            if self.args.cn_splited:
+                pre_dimer = False
+                for index, w in enumerate(words):
+                    if pre_dimer and len(w) == 0:
+                        words[index] = self.args.cn_split_dimer
+                        pre_dimer = False
+                    elif len(w) == 0:
+                        pre_dimer = True
+                    else:
+                        pre_dimer = False
+
+            tokens = Converter.tokenizer.tokenize("".join(words))
+            tokens = Converter.whole_word_mask(tokens, words)
+            tokens = Converter.tokenizer.convert_tokens_to_ids(tokens)
+            return tokens
+
+        Converter.process = process
+
+    def encode(self, json_line):
+        text = json.loads(json_line)[self.args.json_key]
+        doc_ids = []
+        for sentence in Converter.splitter.tokenize(text):
+            sentence_ids = Converter.process(sentence.strip())
+            if len(sentence_ids) > 0:
+                doc_ids.append(sentence_ids)
+
+        if len(doc_ids) > 0 and self.args.append_eos:
+            if Converter.tokenizer.eos_token_id is None:
+                logger.warning(
+                    "{}: eos_token_id is not set, ".format(self.args.tokenizer_name)
+                    + "please set other tokenizer "
+                    + "or config eos_token_id or unset append_eos."
+                )
+            else:
+                doc_ids[-1].append(Converter.tokenizer.eos_token_id)
+
+        return doc_ids, len(text.encode("utf-8"))
+
+
+def main():
+    print_datetime("start")
+    args = get_args()
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, _, fs in os.walk(args.input_path):
+            for f in fs:
+                file_paths.append(os.path.join(root, f))
+
+    convert = Converter(args)
+
+    # Try tokenizer is availiable
+    sample_tokenizer = getattr(tfs, args.tokenizer_name).from_pretrained(args.model_name)
+    if sample_tokenizer.vocab_size < 2**16 - 1:
+        save_dtype = np.uint16
+    else:
+        save_dtype = np.int32
+
+    pool = multiprocessing.Pool(args.workers, initializer=convert.initializer)
+
+    output_ids_files = args.output_prefix + ".bin"
+    output_idx_files = args.output_prefix + ".idx"
+    builder = indexed_dataset.make_builder(output_ids_files, args.data_impl, save_dtype)
+
+    file_paths.sort()
+
+    step = 0
+    total_bytes_processed = 0
+    startup_start = time.time()
+    for file_path in tqdm(file_paths):
+        if file_path.endswith(".zst"):
+            import zstandard
+
+            cctx = zstandard.ZstdDecompressor()
+            fh = open(file_path, "rb")
+            text = io.BufferedReader(cctx.stream_reader(fh))
+        elif file_path.endswith(".jsonl"):
+            text = open(file_path, "r", encoding="utf-8")
+        else:
+            print("Unexpected data format, skiped %s" % file_path)
+            continue
+
+        encoded_docs = pool.imap(convert.encode, text, 256)
+        print("Processing %s" % file_path)
+        for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
+            step += 1
+            total_bytes_processed += bytes_processed
+            if len(doc) == 0:
+                continue
+
+            for sentence in doc:
+                sentence_len = len(sentence)
+                if sentence_len == 0:
+                    continue
+                builder.add_item(sentence)
+
+            builder.end_document()
+
+            if step % args.log_interval == 0:
+                current = time.time()
+                elapsed = current - startup_start
+                mbs = total_bytes_processed / elapsed / 1024 / 1024
+                print(f"Processed {step} documents", f"({step/elapsed:.2f} docs/s, {mbs:.4f} MB/s).", file=sys.stderr)
+            if step >= args.max_doc_num:
+                break
+
+        if step >= args.max_doc_num:
+            break
+
+    pool.close()
+    print("Saving tokens to files...")
+    builder.finalize(output_idx_files)
+    print_datetime("end")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md
new file mode 100644
index 0000000000000000000000000000000000000000..3c6727fab4c7d1003a8d22a0f964b579710dd989
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpus2020.md
@@ -0,0 +1,12 @@
+## CLUECorpus2020 语料
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| CLUECorpus2020| 中文 | 200GB |
+
+CLUECorpus2020 过对Common Crawl的中文部分进行语料清洗得到。开源部分提供了约200G左右的语料文本，详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD)，用户可以通过邮件申请下载，方式如下：
+
+> 数据下载
+> 申请方式： 将使用语料研究目的和用途，计划、研究机构和申请者介绍，发送到邮箱，并承诺不向第三方提供。
+>
+> 邮箱: CLUEbenchmark@163.com，标题是：CLUECorpus2020 200G语料库
diff --git a/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md
new file mode 100644
index 0000000000000000000000000000000000000000..6af9876968f033653cc310f5999d6c70dc3e5b9b
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/docs/CLUECorpusSmall.md
@@ -0,0 +1,78 @@
+# CLUECorpusSmall
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| CLUECorpusSmall| 中文 | 14GB |
+
+**数据集简介**：可用于语言建模、预训练或生成型任务等，数据量超过14G，近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目
+包含如下子语料库（总共14G语料）：新闻语料[news2016zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/6bac09db4e6d4857b6d680d34447457490cb2dbdd8b8462ea1780a407f38e12b?responseContentDisposition=attachment%3B%20filename%3Dnews2016zh_corpus.zip)， 社区互动语料[webText2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/83da03f7b4974871a52348b41c16c7e3b34a26d5ca644f558df8435be4de51c3?responseContentDisposition=attachment%3B%20filename%3DwebText2019zh_corpus.zip)，维基百科语料[wiki2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/d7a166408d8b4ffdaf4de9cfca09f6ee1e2340260f26440a92f78134d068b28f?responseContentDisposition=attachment%3B%20filename%3Dwiki2019zh_corpus.zip)，评论数据语料[comment2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/b66ddd445735408383c42322850ac4bb82faf9cc611447c2affb925443de7a6d?responseContentDisposition=attachment%3B%20filename%3Dcomment2019zh_corpus.zip)。
+
+## 数据获取
+
+用户可以通过官方github网页下载，https://github.com/CLUEbenchmark/CLUECorpus2020 。同时，为方便用户，我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598)，[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据，下载好后，可以核对md5值：
+```shell
+> md5sum ./*
+ 8a8be341ebce39cfe9524fb0b46b08c5  ./comment2019zh_corpus.zip
+ 4bdc2c941a7adb4a061caf273fea42b8  ./news2016zh_corpus.zip
+ fc582409f078b10d717caf233cc58ddd  ./webText2019zh_corpus.zip
+ 157dacde91dcbd2e52a60af49f710fa5  ./wiki2019zh_corpus.zip
+```
+解压文件
+```shell
+unzip comment2019zh_corpus.zip -d  clue_corpus_small_14g/comment2019zh_corpus
+unzip news2016zh_corpus.zip    -d  clue_corpus_small_14g/news2016zh_corpus
+unzip webText2019zh_corpus.zip -d  clue_corpus_small_14g/webText2019zh_corpus
+unzip wiki2019zh_corpus.zip    -d  clue_corpus_small_14g/wiki2019zh_corpus
+```
+将txt文件转换为jsonl格式
+```
+python trans_to_json.py  --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
+```
+现在我们得到了jsonl格式的数据集。
+
+## 中文预训练数据制作
+
+下面是针对训练任务的数据集应用。
+
+* llama为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "idea-ccnl/ziya-llama-13b-v1" \
+    --tokenizer_name "LlamaTokenizer" \
+    --input_path "clue_corpus_small_14g.jsonl" \
+    --output_prefix "clue_corpus_small_14g" \
+    --data_format "JSON" \
+    --json_key "text" \
+    --data_impl "mmap" \
+    --append_eos \
+    --log_interval 10000 \
+    --workers 48
+```
+
+* ernie为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --tokenizer_name "ErnieTokenizer" \
+    --input_path "clue_corpus_small_14g.jsonl" \
+    --output_prefix "clue_corpus_small_14g"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --split_sentences \
+    --data_impl "mmap" \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func "lac" \
+    --log_interval 10000 \
+    --workers 48
+```
+
+- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/llama/README.md)。
+- workers 表示转化的线程数目
+
+数据共有文档`15702702`条左右，由于分词比较耗时，大概一小时左右可以完成。在当前目录下产出训练所需数据。
+```
+clue_corpus_small_14g.bin
+clue_corpus_small_14g.idx
+```
+用户可以使用此数据进行预训练任务。
diff --git a/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md b/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md
new file mode 100644
index 0000000000000000000000000000000000000000..8e849d829e2f6919ab82d50cc9fb4a594ecb43fe
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/docs/OpenWebText2.md
@@ -0,0 +1,43 @@
+# OpenWebText2
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| OpenWebText2 | 英文 | 70GB |
+
+## 数据获取
+
+[OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/)是一个开源的英文网页文本数据集，数据来源于Reddit，经过去重、清洗、提取，最终包含800多万个文档。
+本示例采用EleutherAI清洗好的[OpenWebText2数据](https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version)
+
+下载以后通过以下命令解压：
+
+```shell
+# wget https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
+wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/openwebtext2.jsonl.zst.tar
+tar -xvf openwebtext2.json.zst.tar -C  /path/to/openwebtext
+```
+
+## Llama训练数据制作
+
+然后使用[proprecess](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-1.0/proprecess) 工具下的`create_pretraining_data.py`脚本进行数据集制作：
+```
+python -u  create_pretraining_data.py \
+    --model_name meta-llama/Llama-2-7b \
+    --tokenizer_name LlamaTokenizer \
+    --data_format JSON \
+    --input_path /path/to/openwebtext/ \
+    --append_eos \
+    --output_prefix llama_openwebtext  \
+    --workers 40 \
+    --log_interval 10000 \
+    --data_impl "mmap"
+```
+处理时间约一个小时左右，就可以得到我们需要的`llama_openwebtext.bin`, `llama_openwebtext.idx`数据集文件。
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv llama_openwebtext.bin ./data
+mv llama_openwebtext.idx ./data
+```
diff --git a/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md b/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md
new file mode 100644
index 0000000000000000000000000000000000000000..580f12cd643bdc0c71ce9ed5aeb972e13c4205ab
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/docs/WuDaoCorpusBase.md
@@ -0,0 +1,97 @@
+# WuDaoCorpus2.0 Base 语料
+
+
+| 名称 | 文本类型 | 纯文本大小 |
+|-|-|-|
+| WuDaoCorpus2.0 Base| 中文 | 200GB |
+
+WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB，目前开源的部分为WuDaoCorpus2.0 bases数据集，大小为200GB。
+
+## 数据获取
+
+**1. 下载解压**
+
+用户微信登录[官网](https://resource.wudaoai.cn/home)，即可直接下载数据。下载好的压缩数据约 64GB。解压
+```
+unrar x WuDaoCorpus2.0_base_200G.rar
+```
+**2. 语料分词**
+
+由于WuDao数据集比较大，分词比较耗时，这里先进行了语料分词：
+```shell
+python words_segmentation.py \
+    --input_path ./WuDaoCorpus2.0_base_200G \
+    --workers 40  \
+    --data_format wudao \
+    --cn_seg_func seg \
+    --output_path ./wudao_lac_cut \
+```
+
+注：预训练需要实现 SOP( Sentence Order Predict) 任务，在分词的同时，我们使用 简单规则 进行了文本断句。如果语料只有一句话，建议去除SOP loss，训练时设置 `binary_head=False`。
+
+**3. 转换为jsonl格式**
+
+文本转化完成后。我们使用 `../data_tools/trans_to_json.py`重新转换为jsonl格式（分词完毕）。
+```shell
+python ./trans_to_json.py  \
+    --input_path ./wudao_lac_cut \
+    --output_path wudao_corpus_200g.jsonl \
+    --workers 40
+```
+在当前目录下产出数据`wudao_corpus_200g.jsonl`。格式如下：
+```
+{"text": "主持人 : 作为 一个 曲线救国 的 路线 我们 没 办法 。\n金鑫 : 考试 和 分数 只是 一个 阶段性 的 评价 手段 , 不是 目的 , 就 像 人 活着 的 目的 不是 为了 吃饭 , 吃饭 是 为了 让 我们 活下去 , 我们 学习 的 目的 不是 为了 考试 , 不是 为了 那个 分数 , 而是 我 掌握 了 知识 , 成为 我 内在 的 能力 , 将来 我 去 创作 创造 工作 , 我能 把 它 做 得 更好 。\n主持人 : 特别感谢 金总 今天 接受 我 的 访谈 , 也 让 我 从 别的 层面 看到 了 一对一 到底 存在 的 道理 是 什么 , 并且 能 发展 那么 好 的 原因 在 哪里 。\n在 节目 后 您 谈谈 您 对 一对一 未来 的 希望 , 包括 您 对 它 未来 的 设想 是 什么 ？\n金鑫 : 一对一 个性化 教育 现在 还是 在 初级阶段 , 如果 是 四个 阶段 的话 , 现在 还是 在 第一阶段 到 第二阶段 迈进 的 , 学大 在 这方面 我们 希望 能 做 得 更 快 更 远 一些 。\n将来 个性化 教育 一定 是 能够 帮助 学生 在 成绩 上 的 提升 , 能够 更好 的 成长 , 进而 成为 对 社会 对 国家 更 有用 的 人才 , 就是 我们 的 成绩 、 成长 、 成才 。\n学大 1 对 1 教育 的 教师 团队 由 各科 优秀教师 、 考试 指导 专家 、 心理 辅导 专家 及 学习 方法 指导 专家 组成 , 同时 配备 专职 班主任 及 学习 监管 师 , 全方位 辅导   顺利 而 有序 的 运作 。\n其中 部分 教师 担任 多年 毕业班 教学 工作 , 多次 参与 中 考试 命题 研究 及 阅卷 工作 , 深谙 中 考试 精髓 , 能够 在 短 的 时间 内 引领 学生 掌握 中 考试 知识   重点 , 快速 提分 。\n■   对于 成绩 差 的 学生 : 注重 学生 基础知识 , 力求 让 学生 在 基础 中 找 自信 , 在 自信 中 提升 ；\n注重 主观题 的 解题 方法 及 思路 , 以此 来 加强 对 基础知识 的 运用 。\n■   对于 成绩 需要 拔高 的 学生 : 找出 学生 弱点 , 加强 基础 , 重点 提高 弱势 项目 。\n"}
+{"text": "武田信玄 是 天生 的 武将 , 一生 开拓 了 八十五万 石至 九十余万 石之多 的 领地 。\n武田信玄  他 21 岁 时 流放 自己 的 父亲 武田信虎  至骏河 , 避免 父亲 传位 给 弟弟 , 从而 登上 了 第 19 代家督 之位 。\n他 将 信 浓国 ( 现 长野县 ) 纳入 控制 范围 后 , 又 与 当时 的 豪强 今井氏 、 北条 氏 结成 三国 军事同盟 , 与 上 杉谦信 在 川 中岛 前后 展开 了 五次 大战 。\n武田信玄  勇于 进攻 。\n他 连续 攻打 邻国 , 扩大 自己 势力范围 , 可称 遇神 杀神 , 遇佛 杀佛 。\n他 不仅 流放 了 自己 的 父亲 , 连 自己 的 嫡子 武田义信 因 与 他 在 战略 方向 上 相左 , 也 被 他 幽禁 于 佛寺 , 随即 被迫 自杀 。\n武田信玄  虽然 是 战国 武将 中 的 最强者 , 但 他 的 弱点 是 年龄 。\n信玄比 织田信长 年长 13 岁 , 比上 杉谦信 年长 9 岁 。\n当信 玄年 届 五十 之 时 , 信长 和 谦信 犹 在 壮年 。\n上杉谦信 而且 , 武田信玄  虽 驰骋 天下 , 却 未率 军 进过 京都 , 而 织田信长 在 永禄 十一年 ( 1568 年 ) 就 以 拥立 第 15 代 将军 足利义 昭 为名 率兵 上洛 了 。\n所谓 \" 制 京都 者 得 天下 \" , 所以 , 想要 一统天下 , 武田信玄  的 时间 很 紧迫 。\n元龟 三年 ( 1572 年 ) , 武田信玄  与 室 町 幕府 第 15 代 将军 足利义 昭 、 本愿 寺 显如 , 以及 浅井 氏 、 朝仓氏 等 反 织田信长 实力 组成 联盟 , 编织 \" 反信长 包围圈 \" 。\n同年 10 月 3 日 , 武田信玄  率领 大军 , 开始 了 第一次 上洛之行 。\n是 年 , 信玄 52 岁 , 这 也许 是 他 统一天下 的 最后 一次 机会 。\n武田信玄 所 率领 的 是 当时 战国 最强 的 3 万甲州 精兵 。\n打着 \" 风林火山 \" 的 旗帜 , 武田军 第一站 就 到达 了 织田信长 的 同盟 德川家康  所在 的 三河 远江 。\n织田信长 德川家康  的 军队 在 甲州 精兵 之前 显得 不堪一击 , 到 了 10 月 13 日 , 只来 成 、 天 方城 、 一 宫城 、 饭田 城 、 各和城 、 向 笠 城 等 城池 纷纷 被 攻陷 。\n德川家康  见势不妙 , 决定 在 浜松 城中 闭门不出 。\n但是 武田信玄  毫不 松懈 , 又 将 家康 在 远江 地区 的 重要 据点 二俣城 攻破 。\n德川家康  集合 所有 军队 共 1 万 1 千人 , 出城 与 信玄 决一死战 , 但 大败 而 还 , 险些 失 了 性命 。\n这次 战争 被 称为 \" 三方 原战 \" , 德川家康  曾经 承认 这次 战争 是 他 生平 最大 的 失败 。\n"}
+```
+
+## 中文预训练数据制作
+
+下面是针对训练任务的数据集应用。
+
+* llama为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "idea-ccnl/ziya-llama-13b-v1" \
+    --tokenizer_name "LlamaTokenizer" \
+    --input_path "wudao_corpus_200g.jsonl" \
+    --output_prefix "wudao_corpus_200g" \
+    --data_format "JSON" \
+    --json_key "text" \
+    --data_impl "mmap" \
+    --cn_seg_func "jieba" \
+    --cn_splited \
+    --append_eos \
+    --log_interval 10000 \
+    --workers 48
+```
+
+* ernie为例
+```shell
+python -u  create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --tokenizer_name "ErnieTokenizer" \
+    --input_path "wudao_corpus_200g.jsonl" \
+    --output_prefix "wudao_corpus_200g"  \
+    --data_format "JSON" \
+    --json_key "text" \
+    --split_sentences \
+    --data_impl "mmap" \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func "jieba" \
+    --cn_splited \
+    --log_interval 10000 \
+    --workers 48
+```
+
+
+- 我们提前进行了分词，所以加上了 `cn_splited`，否则不需要使用此选项。
+- model_name 可以更换为[其他模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm)。
+- workers 表示转化的线程数目
+
+在当前目录下产出训练所需数据。
+```
+wudao_corpus_200g.bin
+wudao_corpus_200g.idx
+```
+用户可以使用此数据进行预训练任务。
diff --git a/model_zoo/ernie-1.0/preprocess/merge.py b/model_zoo/ernie-1.0/preprocess/merge.py
new file mode 100644
index 0000000000000000000000000000000000000000..cad6034907068b1b0ae96cb8b0338b81ed001334
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/merge.py
@@ -0,0 +1,90 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from datetime import datetime
+
+from paddlenlp.data import indexed_dataset
+
+
+def print_datetime(string):
+    time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    print("[" + string + "] datetime: {} ".format(time_str))
+
+
+def main(args):
+
+    prefixes = set()
+    for basename in os.listdir(args.input):
+        prefix, ext = os.path.splitext(basename)
+
+        if prefix in prefixes:
+            continue
+
+        if not os.path.isfile(os.path.join(args.input, basename)):
+            continue
+
+        ext_pair = ".bin" if ext == ".idx" else ".idx"
+        assert os.path.isfile(
+            os.path.join(args.input, prefix) + ext_pair
+        ), f"ERROR: {ext_pair} file not provided for {os.path.join(args.input, prefix)}"
+
+        prefixes.add(prefix)
+
+    builder = None
+
+    for prefix in sorted(prefixes):
+        print_datetime(f"start processing file {prefix}")
+        if builder is None:
+            dataset = indexed_dataset.make_dataset(os.path.join(args.input, prefix), args.data_impl)
+
+            if isinstance(dataset, indexed_dataset.MMapIndexedDataset):
+                builder = indexed_dataset.MMapIndexedDatasetBuilder(
+                    args.output_prefix + ".bin", dtype=dataset._index.dtype
+                )
+            else:
+                builder = indexed_dataset.IndexedDatasetBuilder(args.output_prefix + ".bin", dtype=dataset.dtype)
+
+            del dataset
+        print_datetime(f"start merge file {prefix}")
+        builder.merge_file_(os.path.join(args.input, prefix))
+        print_datetime(f"end merge file {prefix}")
+
+    print_datetime("start finalize")
+    builder.finalize(args.output_prefix + ".idx")
+    print_datetime("end finalize")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    group = parser.add_argument_group(title="input data")
+    group.add_argument(
+        "--input", type=str, required=True, help="Path to directory containing all document files to merge"
+    )
+    group.add_argument("--data_impl", type=str, required=True, help="data_impl")
+
+    group = parser.add_argument_group(title="output data")
+    group.add_argument("--output-prefix", type=str, required=True, help="Path to binary output file without suffix")
+
+    args = parser.parse_args()
+
+    assert os.path.isdir(args.input), f"ERROR: {args.input} is not a directory or does not exist"
+
+    assert os.path.isdir(
+        os.path.dirname(args.output_prefix)
+    ), f"ERROR: {os.path.dirname(args.output_prefix)} is not a directory or does not exist"
+
+    main(args)
diff --git a/model_zoo/ernie-1.0/preprocess/trans_to_json.py b/model_zoo/ernie-1.0/preprocess/trans_to_json.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e8e71d77e54d565f7f8273676c09e62e7bdcf72
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/trans_to_json.py
@@ -0,0 +1,140 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import multiprocessing
+import os
+import shutil
+import sys
+import time
+from functools import partial
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input_path", type=str, required=True, help="Path to you raw files. Folder or file path.")
+    parser.add_argument("--output_path", type=str, required=True, help="Path to save the output json files.")
+    parser.add_argument("--json_key", type=str, default="text", help="The content key of json file.")
+    parser.add_argument(
+        "--doc_spliter",
+        type=str,
+        default="",
+        help="Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank.",
+    )
+    parser.add_argument("--min_doc_length", type=int, default=10, help="Minimal char of a documment.")
+    parser.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch")
+    parser.add_argument("--log_interval", type=int, default=1, help="Interval between progress updates.")
+    parser.add_argument("--no-merge", action="store_true", help="Don't merge the file.")
+    parser.add_argument("--no-shuffle", action="store_true", help="Don't shuffle the file.")
+    args = parser.parse_args()
+    return args
+
+
+def raw_text_to_json(path, doc_spliter="", json_key="text", min_doc_length=10):
+    path = os.path.abspath(path)
+    if not os.path.exists(path):
+        print("No found file %s" % path)
+        return 0, None
+
+    out_filepath = path + ".jsonl"
+    fout = open(out_filepath, "w", encoding="utf-8")
+    len_files = 0
+    with open(path, "r") as f:
+        doc = ""
+        line = f.readline()
+        while line:
+            len_files += len(line)
+            if line.strip() == doc_spliter:
+                if len(doc) > min_doc_length:
+                    fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n")
+                doc = ""
+            else:
+                doc += line
+            line = f.readline()
+
+        if len(doc) > min_doc_length:
+            fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n")
+        doc = ""
+
+    return len_files, out_filepath
+
+
+def merge_file(file_paths, output_path):
+    if not output_path.endswith(".jsonl"):
+        output_path = output_path + ".jsonl"
+    print("Merging files into %s" % output_path)
+    with open(output_path, "wb") as wfd:
+        for f in file_paths:
+            if f is not None and os.path.exists(f):
+                with open(f, "rb") as fd:
+                    shutil.copyfileobj(fd, wfd)
+                os.remove(f)
+    print("File save in %s" % output_path)
+    return output_path
+
+
+def shuffle_file(output_path):
+    print("Shuffling the jsonl file...")
+    if os.path.exists(output_path):
+        os.system("shuf %s -o %s" % (output_path, output_path))
+        print("File shuffled!!!")
+    else:
+        raise ValueError("File not found: %s" % output_path)
+
+
+def main():
+    args = get_args()
+    startup_start = time.time()
+
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, _, fs in os.walk(args.input_path):
+            for f in fs:
+                file_paths.append(os.path.join(root, f))
+
+    pool = multiprocessing.Pool(args.workers)
+
+    startup_end = time.time()
+    proc_start = time.time()
+    total_bytes_processed = 0
+    print("Time to startup:", startup_end - startup_start)
+
+    trans_json = partial(
+        raw_text_to_json, doc_spliter=args.doc_spliter, json_key=args.json_key, min_doc_length=args.min_doc_length
+    )
+    encoded_files = pool.imap(trans_json, file_paths, 1)
+
+    out_paths = []
+    for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1):
+        total_bytes_processed += bytes_processed
+        out_paths.append(out_path)
+
+        if i % args.log_interval == 0:
+            current = time.time()
+            elapsed = current - proc_start
+            mbs = total_bytes_processed / elapsed / 1024 / 1024
+            print(f"Processed {i} files", f"({i/elapsed} files/s, {mbs} MB/s).", file=sys.stderr)
+
+    if not args.no_merge:
+        output_path = merge_file(out_paths, args.output_path)
+        if not args.no_shuffle:
+            shuffle_file(output_path)
+
+
+if __name__ == "__main__":
+    main()
+    # profile.run("main()", "testprof")
diff --git a/model_zoo/ernie-1.0/preprocess/words_segmentation.py b/model_zoo/ernie-1.0/preprocess/words_segmentation.py
new file mode 100644
index 0000000000000000000000000000000000000000..0aeb1907e0cddf33310ad163031136334d1e722e
--- /dev/null
+++ b/model_zoo/ernie-1.0/preprocess/words_segmentation.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import multiprocessing
+import os
+import re
+import sys
+import time
+from functools import partial
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input_path", type=str, required=True, help="Path to you raw files. Folder or file path.")
+    parser.add_argument("--output_path", type=str, default="./tmp", help="Path to save the output json files.")
+    parser.add_argument(
+        "--data_format",
+        type=str,
+        default="jsonl",
+        choices=["jsonl", "wudao"],
+        help="Path to you raw files. Folder or file path.",
+    )
+    parser.add_argument(
+        "--cn_seg_func",
+        type=str,
+        default="jieba",
+        choices=["lac", "seg", "jieba"],
+        help="Words segment function for chinese words.",
+    )
+    parser.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch")
+    parser.add_argument("--log_interval", type=int, default=1, help="Interval between progress updates.")
+    args = parser.parse_args()
+    return args
+
+
+def lexical_analysis_fn():
+    from LAC import LAC
+
+    lac = LAC(mode="lac")
+
+    def process(line):
+        words, _ = lac.run(line)
+        return words
+
+    return process
+
+
+def chinese_segmentation_fn():
+    from LAC import LAC
+
+    lac_cws = LAC(mode="seg")
+
+    def process(line):
+        words = lac_cws.run(line)
+        return words
+
+    return process
+
+
+def jieba_segmentation_fn():
+    import jieba
+
+    def process(line):
+        words = jieba.cut(line)
+        return list(words)
+
+    return process
+
+
+CHINESE_SEG_FUNC = {
+    "lac": lexical_analysis_fn(),
+    "seg": chinese_segmentation_fn(),
+    "jieba": jieba_segmentation_fn(),
+}
+
+
+def read_wudao(path):
+    print("Loading %s" % path)
+    with open(path, "r") as f:
+        try:
+            contents = json.load(f)
+        except Exception:
+            print("Failed to load %s" % path)
+            raise StopIteration
+    for js in contents:
+        yield js["content"]
+
+
+def read_jsonl(path):
+    print("Loading %s" % path)
+    with open(path, "r") as f:
+        line = f.readline()
+        while line:
+            contents = json.load(f)
+            yield contents["text"]
+            line = f.readline()
+
+
+READFILE_FUNC = {
+    "jsonl": read_jsonl,
+    "wudao": read_wudao,
+}
+
+special_chars = ["\n", "。", "?", "？", " ", ";", "；", "！", "!"]
+split_chars = ["。", "?", "？", ";", "；", "!", "！"]
+
+
+def text_to_text(path, output_path, read_func, seg_func):
+    out_name = os.path.join(output_path, path[-20:])
+
+    print("Write into %s" % out_name)
+    if os.path.exists(out_name):
+        print("File exists %s" % out_name)
+        return 0, None
+
+    seg_func = CHINESE_SEG_FUNC[seg_func]
+    read_func = READFILE_FUNC[read_func]
+
+    data_len = 0
+    count = 0
+    with open(out_name, "w") as f:
+        for text in read_func(path):
+            # for js in contents:
+            count += 1
+            # text = js["content"]
+            data_len += len(text.encode("utf-8"))
+            # make special char only once,
+            # because of those token will be treat as sentence spliter.
+            # 此处为断句逻辑
+            for char in special_chars:
+                text = re.sub("[" + char + "]+[ ]*", char, text)
+            for char in split_chars:
+                text = text.replace(char, char + "\n")
+
+            # 此处为分词逻辑
+            final = ""
+            for line in text.split("\n"):
+                if len(line) == 0:
+                    continue
+                words = seg_func(line)
+                final += " ".join(words) + "\n"
+            f.write(final + "\n")
+
+    return data_len, None
+
+
+def main():
+    args = get_args()
+    startup_start = time.time()
+
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, _, fs in os.walk(args.input_path):
+            for f in fs:
+                file_paths.append(os.path.join(root, f))
+
+    pool = multiprocessing.Pool(args.workers)
+
+    startup_end = time.time()
+    proc_start = time.time()
+    total_bytes_processed = 0
+    print("Time to startup:", startup_end - startup_start)
+
+    if not os.path.exists(args.output_path):
+        os.makedirs(args.output_path)
+
+    trans_func = partial(
+        text_to_text, output_path=args.output_path, seg_func=args.cn_seg_func, read_func=args.data_format
+    )
+
+    encoded_files = pool.imap(trans_func, file_paths, 1)
+
+    out_paths = []
+    for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1):
+        total_bytes_processed += bytes_processed
+        out_paths.append(out_path)
+
+        if i % args.log_interval == 0:
+            current = time.time()
+            elapsed = current - proc_start
+            mbs = total_bytes_processed / elapsed / 1024 / 1024
+            print(f"Processed {i} files", f"({i/elapsed} files/s, {mbs} MB/s).", file=sys.stderr)
+    pool.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/pretraining_introduction.md b/model_zoo/ernie-1.0/pretraining_introduction.md
new file mode 100644
index 0000000000000000000000000000000000000000..b7d77c3e405e1e9ee9b6875fb9025503de8d8659
--- /dev/null
+++ b/model_zoo/ernie-1.0/pretraining_introduction.md
@@ -0,0 +1,615 @@
+# ERNIE 中文预训练介绍
+
+ERNIE是百度提出的大规模预训练模型，曾在中文场景下取得了SOTA效果。
+PaddleNLP致力于预训练开源工作，使用开源中文语料CLUE、WuDao 总共400GB，发布大规模开源语料预训练全流程。从零开始，轻松构建预训练模型。
+
+本项目，从数据下载，词表制作，数据转化，模型训练，所有流程，完全开源开放，可复现。
+并训练发布开源最优的模型参数。
+
+接下来将从下面几个方面，详细介绍整个数据制作全流程，从零开始，构建一个预训练模型。
+
+* [1. 数据准备](数据准备)
+    * [1.1 大规模中文数据](#大规模中文数据)
+    * [1.2 高精准中文分词](#高精准中文分词)
+    * [1.3 快速Token ID 转化](#快速TokenID转化)
+* [2. 全字符中文词表制作](#中文词表制作)
+    - [2.1 分析准备](#分析准备)
+    - [2.2 文本字符统计](#文本字符统计)
+    - [2.3 英文字符词表](#英文字符词表)
+    - [2.4 合并词表](#合并词表)
+* [3. 开始训练](#开始训练)
+    - [3.1 训练脚本](#训练脚本)
+    - [3.2 训练网络配置](#networks)
+    - [3.3 训练速度配置](#speed)
+    - [3.4 训练数据流配置](#data_pipe)
+    - [3.5 观察评估](#观察评估)
+- [4. 训练效果](#release_models)
+    - [4.1 ERNIE 1.0-Base-zh-cw 模型](#ernie-1.0-base-zh-cw)
+    - [4.2 ERNIE 1.0-Large-zh-cw 模型](#ernie-1.0-large-zh-cw)
+* [5. 参考](#references)
+
+全部流程介绍图如下：
+
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/16911935/187170152-0778a6c1-6510-4c01-84d0-8e0ea3c05231.png" align="middle"  width="500" />
+</p>
+
+
+**环境依赖**
+
+- tool_helpers
+- visualdl
+- pybind11
+- lac (可选)
+
+安装命令 `pip install tool_helpers visualdl pybind11 lac`
+
+<a name="数据准备"> </a>
+
+## 1. 数据准备
+
+数据流是预训练的非常重要的，[预处理文档](./preprocess/README.md)提供了整体的数据变动的流程示意，用户可以查看数据制作的细节文档。
+
+
+<a name="大规模中文数据"> </a>
+
+### 1.1 大规模中文数据
+
+模型的根本是数据，大数据才能有望获得更好的训练效果。我们希望语料有如下特点:
+- **大规模**：目前像ERNIE-3.0，GPT-3，CPM等模型，动辄数T的文本语料。而目前开源的一些中文模型，确是基于15G左右的CLUECorpus语料训练，大大限制了模型的效果，
+- **开源开放**：为了让用户也可以比较容易复现整体的数据流程，采用的数据希望是**开源**的，人人可以获取的。
+
+综上，我们选用的预料为 CLUECorpus2020 语料 200G， WuDaoCorpus2.0 Base 语料 200G。
+
+**CLUECorpus2020 语料**
+
+CLUECorpus2020 是通过Common Crawl中文部分语料清洗得到。开源部分提供了约200G左右的语料文本，详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD)，用户可以通过邮件申请下载。
+
+**WuDaoCorpus2.0 Base 语料**
+
+WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB，目前开源的部分为WuDaoCorpus2.0 bases数据集，大小为200GB。
+用户微信登录[官网](https://resource.wudaoai.cn/home)，即可直接下载数据。下载好的压缩数据约 64GB。
+
+
+为了方便用户测试，我们提供了少量part的WuDao数据供大家使用，（如有侵权，请联系我们删除）
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/WuDaoCorpus2.0_base_200G_sample.tar.gz
+tar -xvf WuDaoCorpus2.0_base_200G_sample.tar.gz
+```
+用户可以用这份数据跑完后续全程。数据量约为2GB。
+
+
+<a name="高精准中文分词"> </a>
+
+### 1.2 高精准中文分词
+
+ERNIE 使用知识嵌入的方式进行预训练。文本中的知识，比如 文本的中的人名、地名、成语、短语等都是知识。如何把这知识训练融合到模型中呢？ERNIE给出的方案是对这些知识短语一起MASK，然后预测，也就是Whole Words MASK。
+
+在我们数据处理层面，如何尽可能精确的从原始文本中提取知识，直接关系预训练模型的效果。我们对目前PaddleNLP常用的分词方式的有`jieba`，`lac`，`seg`进行分析。`jieba`采用HMM隐马尔可模型，`lac`是LSTM模型。
+
+效果、速度对比表格如下，假设CPU使用40线程，GPU使用16卡，处理200G文本：
+
+| 切词方式 | 效果 | 速度 | 预估耗时
+|-|-|-|-|
+| jieba | 一般 | 607 KB/s |  2.5 h |
+| lac   | 好 | 106 KB/s | 13.9 h
+| wordtag (弃用)| 最好 | 0.94 KB/s | 159 D (GPU)|
+
+综合考虑分词的效果与速度，我们选择百度的LAC（seg）作为我们的文本分词工具。
+
+
+本文档以WuDao数据为例，对数据进行分词：
+
+
+```shell
+python ./preprocess/words_segmentation.py \
+    --input_path "./WuDaoCorpus2.0_base_200G" \
+    --output_path "./wudao_lac_cut" \
+    --data_format "wudao" \
+    --cn_seg_func "seg" \
+    --workers 48
+```
+
+注：预训练需要实现 SOP( Sentence Order Predict) 任务，在分词的同时，我们使用 简单规则 进行了文本断句。如果语料只有一句话，建议去除SOP loss，训练时设置 `binary_head=False`。
+
+文本转化完成后。我们使用 `./preprocess/trans_to_json.py`重新转换为jsonl格式（分词完毕）。
+```shell
+python ./preprocess/trans_to_json.py  \
+    --input_path "./wudao_lac_cut" \
+    --output_path "wudao_corpus_200g_sample.jsonl" \
+    --workers 40 \
+    --no-shuffle
+```
+使用 WuDaoCorpus2.0_base_200G_sample.tar.gz 数据可以得到jsonl文本为:
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample.jsonl
+```
+用户可以下载处理好的数据，进行tokenizer转换。
+
+
+<a name="快速TokenID转化"> </a>
+
+## 1.3 快速Token ID 转化
+
+预料、词表准备妥当后，我们可以开始进行最后的数据ID转化。
+
+- 高效的 Multiprocessing 多进程实现
+- 使用内存BytesIO存储ID数据
+
+由于转换的逻辑复杂，需要定义`class Converter`对象来进行转化处理。如果每次处理新的文本，都实例化一次class对象，速度瓶颈会在处理函数的实例化。
+我们使用了提前multiprocessing.Pool的`initializer`，对处理函数进行提前实例化，提高处理效率。
+
+处理后的token id数量巨大，可以达到数百Billion，如果使用普通的数据结构，如python的list保存，会出现存储瓶颈，不仅占用空间大，list对象还需要重新分配内存空间。这里我们采用了 BytesIO 的方式，类似写入内存文件的方式，速度快，可以非常方便转化为numpy文件保存。
+
+使用 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz CPU测试，40线程，处理速度 8+MB/s，约7个小时左右，即可完成 200GB 文本转化为ID.
+
+```shell
+python -u  ./preprocess/create_pretraining_data.py \
+    --model_name "ernie-3.0-base-zh" \
+    --tokenizer_name "ErnieTokenizer" \
+    --input_path "wudao_corpus_200g.jsonl" \
+    --output_prefix "wudao_corpus_200g" \
+    --split_sentences\
+    --data_impl "mmap" \
+    --chinese \
+    --cn_splited \
+    --cn_whole_word_segment \
+    --workers 48 \
+    --log_interval 1000
+```
+
+此处需要指定词表文件进行ID转化，用户可以使用paddlenlp内置的部分词表如`ernie-1.0-base-zh,ernie-3.0-base-zh`，设置`model_name`参数为对应参数名即可。
+也可以根据自己的需求，重新开始制作词表，然后`model_name`传入词表所在的文件夹目录即可。词表制作，请参考下一章节[全字符中文词表制作](#全字符中文词表制作)。
+
+转化后的数据如下，使用这份数据，即可开始ERNIE预训练：
+```
+wudao_corpus_200g.bin
+wudao_corpus_200g.idx
+```
+同样，对于 WuDaoCorpus2.0_base_200G_sample.tar.gz 数据，使用`ernie-3.0-bash-zh`的tokenizer，可以得到数据。
+```
+mkdir data && cd data
+wget https://paddlenlp.bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample_ernie-3.0-base-zh.bin
+wget https://paddlenlp.bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_corpus_200g_sample_ernie-3.0-base-zh.idx
+```
+
+<a name="中文词表制作"> </a>
+
+### 2. 全字符中文词表制作
+
+之前的 数据 id 化中，使用了已有的词表进行转化，当没有词表时，需要从头开始进行词表制作。如果你没有制作新词表的需求，请跳过此部分，直接阅读 [第三节，开始训练](#开始训练)。
+
+那制作ERNIE的词表有什么特点需要注意呢？常见的方法是使用 sentencepiece 切词，使用BPE去找通用的子词串。但是，ERNIE之类的中文模型，是属于字模型，不会出现连续汉字作为子词 如`##中国`。一般是通过 BasicTokenizer，给所有中文汉字之间，添加空格，然后再去切分 子词 subword，这样每个汉字就都是独立的。
+```
+china -> ch #ina
+我爱china -> 我 爱 china -> 我 爱 ch #ina
+```
+
+这里提供了ERNIE模型词表制作的两种方案：
+
+- 第一种，词表组合方案
+    1. 统计字符
+    2. 制作英文词表
+    3. 合并词表
+
+- 第二种，预处理后直接生成，方案
+    1. 文本预处理（中文加空格，文本normalize）
+    2. 使用sentencepeice制作词表
+
+第二种方案需要对文本先使用`BasicTokenizer`切分一遍语料。
+第一种方案，自定义程度高，但存在一些局限性。本项目采用了第一种方案，详细介绍如下：
+
+### 2.1 分析准备
+词表大小： 这里我们考虑的因素主要有两个
+- 已有模型对照：
+    - ERNIE 3.0系列模型的词表，词表大小为 40000 左右。
+- 预训练数据存储占用：
+    - 文本token id化后，希望使用uint16表示，此时表示的最大字符为65536。
+    - 同时考虑到ERNIE虽然是字模型，我们的仍然需要 `##中` 之类的中文字符表示分词信息。假设使用中文全字符20902(0x4E00-0x9FA5)个字符，那么剩余 vocab 大小不能超过 44634。
+
+综上，本项目决定采用 40000 左右的 vocab 容量。
+其中：
+- 中文全字符 `20902`
+- 英文字符 `17000`
+- 其他字符约 `2000` 左右
+
+
+### 2.2 文本字符统计
+首先第一步是对文本字符进行统计。字符统计的目的主要是添加常用的中文字符、特殊字符。
+
+由于语料文本过大，我们随机选取 10G 左右的原始文本进行了字符统计。
+```
+python ./vocab/gen_char.py path_to_corpus.txt
+```
+可以在本地文件夹得到`char_dict.pickle`字符频率文件。同时我们也提供了自己统计的词频文件，方便用户复现：
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/char_dict.pickle
+```
+
+### 2.3 英文字符词表
+基于字符的词频统计，使得英文字符也切割为字母，为此我们需要添加英文词表。
+英文部分，我们使用了 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip)  数据集，来构造词表。
+下载解压数据，使用BPE切词
+```
+wget  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
+unzip wikitext-103-v1.zip
+python ./vocab/gen_vocab.py ./wikitext-103-raw/wiki.train.raw
+```
+即可产生英文部分的词表。这里我们也提供了处理好的 vocab 方便用户验证。
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/eng.vocab
+```
+
+
+### 2.4 合并词表
+
+目前我们得到了字符统计表，和英文字符词表。下一步，我们将词表进行合并。
+
+将`char_dict.pickle`，`eng.vocab`放置到当前目录，使用下面命令
+```
+python ./vocab/merge_vocab.py
+```
+即可在 当前 目录生成 vocab.txt 得到最终词表。
+
+此阶段需要注意的一些问题是：
+1. 对于一些日文、谚文文字字符，需要进行 normalize
+2. 添加special_tokens
+
+### 2.5 问题遗留
+本项目采用的第一种方式，即拼接产出的词表，对连续非中、英文字符文本，会出现UNK的情况。
+如issue: [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927)、 [#2585](https://github.com/PaddlePaddle/PaddleNLP/issues/2585)。本项目做了两点改进:
+
+1. 对 Symbol 字符默认添加空格，变成独立字符
+2. 对 日文、谚文 在合并词表阶段默认添加 ## 字符。
+
+虽然有上述两点修复，任然无法避免 [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927) 现象。
+彻底解决的话，建议使用第二种方式制作vocab文件。
+
+### 2.6 方案二：预处理后直接生成
+此方案没有被采用，这里也简单说明一下具体的方案：
+1. 对语料使用 BasicTokenizer 转换
+```python
+from paddlenlp.transformers import
+tokenizer = BasicTokenizer()
+basic_toknizer = lambda x: " ".join(tokenizer.tokenize(x))
+# 对语料使用 basic_toknizer 转换
+# 并存储为新的语料 afer_basic_toknizer_corpus.txt
+```
+2. 处理转换后的语料
+```shell
+python ./vocab/gen_vocab.py afer_basic_toknizer_corpus.txt
+```
+对处理好的vocab文件手动替换一些`<pad> -> [PAD]`之类的special_tokens，即可产出词表。
+
+
+## 3. 开始训练
+
+使用开源中文语料CLUE、WuDao 总共400GB，提供上面提供的大规模语料数据集制作教程。接下来，看是模型训练。
+
+<p align="center">
+  <img src="https://user-images.githubusercontent.com/16911935/187134299-72628dce-cc04-49d7-89ef-078fad487724.png" align="middle"  width="500" />
+</p>
+
+### 3.1 训练脚本
+
+训练脚本如下。环境配置和路径配置，不是必要的，如果用户只想简单训练，可以直接跳到[继续训练](#继续训练)部分，直接训练。
+
+<b>环境配置</b>
+- PYTHONPATH 设置为当前目录（适合paddlenlp develop运行）
+- 设置了一些FLAGS，包括增强报错，动态图Flag，提高矩阵乘法精度。
+- 多机情况下，可以设置`NCCL_SOCKET_IFNAME`指明NCCL使用的通信网口。
+
+<details>
+<summary>环境配置脚本</summary>
+
+```shell
+set -x
+
+# cd PaddleNLP/model_zoo/ernie-1.0
+export PYTHONPATH=$PYTHONPATH:../../
+
+export FLAGS_call_stack_level=2
+# export NCCL_SOCKET_IFNAME=xgbe0
+export FLAGS_gemm_use_half_precision_compute_type=False
+export FLAGS_enable_eager_mode=1
+unset CUDA_VISIBLE_DEVICES
+```
+</details>
+
+<b>路径配置</b>
+
+- 主要配置输入输出目录
+- 这里的`vocab_dir`如果没有使用自定义词表的话，请设置为内置的tokenizer，如`ernie-1.0-base-zh,ernie-3.0-base-zh`等。
+- 这里的 `data_dir` 设置多份数据集，用户不使用多份数据集的话，直接`data_dir="./data"`即可。
+
+<details>
+<summary>路径配置</summary>
+
+```shell
+trainer_id=${PADDLE_TRAINER_ID:-"0"}
+task_name="0809-ernie-1.0-base-cw-dp16-gb1024"
+
+base_nfs="/path/to/your/nfs/mount/point"
+base_dir="${base_nfs}/ernie-cw/output/${task_name}"
+data_dir="5.0 ${base_nfs}/clue_oscar/clue_corpus_oscar_0630 7.0 ${base_nfs}/clue_train/clue_corpus_train_0629 12.0 ${base_nfs}/wudao_200g/wudao_200g_0703"
+vocab_dir="${base_nfs}/"
+```
+</details>
+
+**启动训练**：这里启动的是单机8卡任务，整体全局的batch_size 512 (64*8)。如果指定ips参数，进行多机运行，如 `python3 -u  -m paddle.distributed.launch  --gpus "0,1,2,3,4,5,6,7" --ips 192.168.1.101,192.168.1.101 `
+
+```shell
+python3 -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "${base_dir}/log_${trainer_id}" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-3.0-base-zh" \
+    --tokenizer_name_or_path "${vocab_dir}" \
+    --input_dir "${data_dir}" \
+    --output_dir "${base_dir}" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --binary_head true \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --fp16_opt_level "O1" \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 4000000 \
+    --save_steps 100000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 3900000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20 \
+    --num_workers 3 \
+    --eval_freq 1000 \
+    --device "gpu"\
+    --share_folder true \
+    --hidden_dropout_prob 0.1 \
+    --attention_probs_dropout_prob 0.1 \
+    --seed 1234 \
+```
+
+
+其中参数释义如下：
+- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。
+- `tokenizer_name_or_path` 模型词表文件所在的文件夹(对于ernie，词表文件名一般命名为vocab.txt)，或者PaddleNLP内置tokenizer的名字。
+- `continue_training` 默认false，模型从随机初始化，开始训练。如果为True，从已有的预训练权重加载，开始训练。如果为True， 训练初始loss 为2.x 是正常loss，如果未False，随机初始化，初始loss一般为10+。
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `output_dir` 指定输出文件。
+- `split` 划分数据集为train、valid、test的比例。整个数据集会按照这个比例划分数据。默认`split=949,50,1`, 使用1/1000的数据为test，当样本数太少时，增大测试的样本数目。
+- `max_seq_len` 输入文本序列的长度，默认值`512`。
+- `binary_head` 是否使用SOP(Sentences Order Predicet) loss，默认为 True，使用此loss。如果用户句子语料很短，无法组合成句子对，请设置此参数为`false`。
+- `micro_batch_size` 单卡batch size大小，比如此处单卡bs=64, 采用8卡训练`global_batch_size=64*8=512`。
+- `use_amp` 开启混合精度策略。
+- `fp16_opt_level` 混合精度策略，支持O1 自动混合精度，O2 pure fp16精度训练。
+- `max_lr` 训练学习率。
+- `min_lr` 学习率衰减到最小值后，学习率将一直保持为`min_lr`。
+- `max_steps` 最大训练步数。训练不支持通过`epoch`控制，第一次制造数据index时候，日志会显示数据会被计算的epoch数，请注意查看。
+- `save_steps` 保存模型间隔。默认保存地址格式为`output_dir/model_50000`(5w 步时的权重)。
+- `checkpoint_steps` 模型checkpoint间隔，用于模型断点重启训练。默认地址为`output_dir/model_last`.
+- `weight_decay` 权重衰减参数。
+- `warmup_rate` 学习率warmup参数。
+- `grad_clip` 梯度裁剪范围。
+- `logging_freq` 日志输出间隔。
+- `num_workers` DataLoader采样进程，当数据输入为瓶颈时，可尝试提高采样进程数目。
+- `eval_freq` 模型评估间隔。
+- `device` 训练设备，默认为GPU。
+- `share_folder` 多机训练时，如果多机`input_dir`为挂载的同一个nfs网络位置，可以开启次选项，多机共享同一份数据。（每次运行，会制作训练的index数据，如果为挂载的统一nfs位置，则一台机器制作数据即可，否则每台机器都需要制作）
+
+<b>继续训练</b>
+<a name="继续训练"> </a>
+
+很多同学的需求，是从已有的预训练参数开始，继续训练过程，这里我们使用前面教程提供的`WuDaoCorpus2.0_base_200G_sample.tar.gz`样本数据，在`ernie-3.0-base-zh`权重上继续训练。脚本如下：
+
+<details>
+<summary><b>展开脚本</b></summary>
+
+```
+python3 -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/ernie_continue_training/logs" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-3.0-base-zh" \
+    --tokenizer_name_or_path  "ernie-3.0-base-zh" \
+    --continue_training true \
+    --input_dir ./data \
+    --output_dir output/ernie_continue_training/ \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --binary_head true \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --fp16_opt_level "O1" \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 500000 \
+    --save_steps 100000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 490000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1 \
+    --num_workers 3 \
+    --eval_freq 1000 \
+    --device "gpu"\
+    --scale_loss 1024\
+    --seed 1234 \
+```
+</details>
+
+
+<a name="networks"> </a>
+
+### 3.2 训练网络配置
+
+本小节
+
+- SOP Loss
+    - SOP (Sentence Order Predict) 损失，是 模型训练的常用损失。将文本中的句子顺序分为两段打乱，最后判断文本是否被打乱。下图是数据组织形式的展示：
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187140981-924fd21c-fb67-4ba8-a421-490fd293175c.png" align="middle"  width="600" />
+    </p>
+
+    - *<u>使用方法</u>*: 此开关由 `binary_head` 选项开启，`binary_head=True`添加sop loss， `binary_head=False` 关闭 sop loss。
+    - **注意：如果你使用的语料文本中，只有一句话，无法分为多个句子段落，请设置 `binary_head=False`。否则，不符合要求的数据默认被删去，导致可训练的数据过小。**
+- MASK
+    -  MLM (Mask Language Model) 是通过随机将文本中的部分token，随机替换为`[MASK]` token，最后预测出真实的token值。ERNIE默认采用了Whole Word MASK方式，选定一些词语进行MASK。
+    - *<u>使用方法</u>*: 用户可以设置 `masked_lm_prob` 控制mask的token占文本总token长度的比例。默认`masked_lm_prob=0.15` 随机mask 15% 的token数目。
+    - 设置`short_seq_prob`， 控制长度小于max_seq_length的样本比例，默认值`short_seq_prob=0.1`。制作数据时候，会有相应比例的数据 最大长度会设置为 一个小于 max_seq_length 的随机值。
+- Ngram MASK
+    - 项目还支持了n-gram mask策略，如下图所示，在 WWM 进行词语级别MASK的基础上（如此处mask掉的`[模型]`词组），n-gram 可以MASK掉连续n个词组。下面例子中，连续mask了2个词组，`【[语言][模型]】`同时进行了mask。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187145669-7c55386d-f57a-4589-9e6d-e4a36b93e24c.png" align="middle"  width="600" />
+    </p>
+
+    - *<u>使用方法</u>*: 用户通过`max_ngrams`设置最大的`ngram`长度。默认`max_ngrams=3`。
+    - 注：
+        - ernie预训练使用的 dataset 代码文件在 `./data_tools/ernie_dataset.py`
+        - 数据集index生成，动态mask相关代码实现在`./data_tools/dataset_utils.py`
+
+        - 用户可以根据自己的需求，灵活修改mask方式。具体可以参考`dataset_utils.py`中`create_masked_lm_predictions`函数。可以自定义的选项有do_whole_word_mask, favor_longer_ngram, do_permutation, geometric_dist等，可以参考[Megatron](https://github.com/NVIDIA/Megatron-LM)使用这些lm_mask策略。
+
+- Dropout
+    - Dropout 是常用的防止过拟合策略。对于大规模数据集训练，如`ernie-3.0`系列4T文本语料，可以设置 `dropout=0`，不考虑过拟合。实际`ernie-3.0-base-zh`训练中，没有开启Dropout。
+    - *<u>使用方法</u>*: 用户可以设置 `hidden_dropout_prob`，`attention_probs_dropout_prob`。默认值为 `0.1`。
+
+
+<a name="speed"> </a>
+
+### 3.3 训练速度配置
+
+**训练速度方面**，我们支持了如下策略，加
+速计算过程，减小显存占用，扩大batch_size：
+
+- **多卡多机**训练：
+    - 基于飞桨Fleet分布式API，用户可以十分方便的通过数据并行的方法，将训练扩展到多机多卡。
+    - *<u>使用方法</u>*：
+        - 单机八卡
+        ```shell
+        python3 -u  -m paddle.distributed.launch \
+            --gpus "0,1,2,3,4,5,6,7" \
+            run_pretrain.py
+        ```
+        - 多机，假设机器ip为 `192.168.1.101,192.168.1.102` **注**：多台机器启动的ips参数需要顺序一致。
+        ```shell
+        python3 -u  -m paddle.distributed.launch \
+            --gpus "0,1,2,3,4,5,6,7" \
+            --ips 192.168.1.101,192.168.1.102 \
+            run_pretrain.py
+        ```
+- **混合精度**训练：
+    - 部分算子使用FP16计算kernel，加速计算过程。支持AMP混合精度O1，和Pure FP16全FP训练策略O2。
+    - 如下图所示，使用AMP O1时，一些参数自动从fp32 cast为FP16类型计算。使用`O2` pure fp16时，模型参数为 fp16。
+    - *<u>使用方法</u>*: 设置`use_amp=True`开启混合精度训练。设置`fp16_opt_level=O1`，切换pure_fp16请设置为`O2`。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187338824-8b522935-4d6e-48d4-a5f6-55695ed3b182.png" align="middle" width=600 />
+    </p>
+- **梯度累积**训练：
+    - 用户可以指定梯度累积的步数，在梯度累积的step中。
+    - 减少多卡之间梯度的通信，减少更新的次数，扩大训练的batch_size.
+    - <u>*使用方法*</u>：用户设置 `gobal_batch_size`为 `micro_batch_size*卡数`的倍数，即可开启梯度累积。如：单卡bs=16，8卡，此时如果设置`gobal_batch_size=512`，则梯度累积次数为`gobal_batch_size/bs/card_num=512/16/8=4`。
+- **重计算**训练：
+    - 通过重新计算前向的方式，减少前向网络中间变量的存储，可以显著减少显存占用。理论上，该方式以时间换空间，但在batch size显著扩大的情况下，速度下降幅度较小。
+    - 如图所示：原来训练过程中，中间变量需要常驻显存，等待反向计算。使用重计算之后，修改成了反向需要时，再重新计算一遍前向过程，生成中间变量。避免常驻显存，减小显存占用。
+    - <u>*使用方法*</u>：用户设置`use_recompute=True`即可使用。注意使用时，可同时扩大`micro_batch_size`参数。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187176881-06103714-3061-42ab-8322-0b63422e7087.png" align="middle"  width="600" />
+    </p>
+
+
+<a name="data_pipe"> </a>
+
+### 3.4 训练数据流配置
+**训练数据流方面**，我们针对训练数据流扩展、混合、重启等方面做了针对性优化提升
+
+数据流
+- **多机扩展**
+    - 用户可以将数据放置到 NFS 服务器上，多机同时挂载数据即可。
+    - 解析：当用户需要在多台机器之间，一起多机训练，或者切换到空闲的机器上训练时。由于数据集很大(数百GB)，迁移不方便。训练数据与计算资源分离，是非常适合的策略。
+    - <u>*使用方法*</u>：参考[NFS服务搭建教程](https://blog.csdn.net/eijiyey/article/details/123184529)，用户将制作好的数据，放到NFS机器，然后挂载到有训练资源的其他机器训练即可。
+    <p align="center">
+    <img src="https://user-images.githubusercontent.com/16911935/187355897-478e7aeb-560f-4ea7-a29c-4bea9d8a7712.png" align="middle"  width="500" />
+    </p>
+
+- **多数据混合**
+    -  <u>*简介*</u>：训练数据集支持多个文件，即插即用，可设置不同数据集占比权重。上面的多机训练的架构，混合使用了四份数据集。
+    - <u>*使用方法*</u>：传入参数即可`input_dir="1.0  dateset_a/prefix  2.0 dataset_b/prefix"`
+    - **注意**：如果文件夹中只有一份数据如`data/wudao_200g_0703_ids.npy data/wudao_200g_0703_idx.npz`，可以直接设置`input_dir=./data`为输入目录即可。如果需要设定多份数据集，必须写上数据集前缀，如`input_dir="1.0 data/wudao_200g_0703 1.0 data2/clue_corpus_train_0629"`。写前缀即可，不要加上后面类似`_ids.npy _idx.npz`的尾缀。
+- **稳定可复现**
+    - <u>*简介*</u>：MLM任务具有一定随机性，需要随机mask数据。本数据流通过固定每一个step数据的随机种子，实验数据流稳定可复现。
+    - <u>*使用方法*</u>： 传入`seed`参数即可，修改参数后会重新生成 index 数据，打乱数据顺序。
+- **快加载**
+    -  <u>*简介*</u>：数据文件使用mmap读取，避免直接将数据加载到内存，加载数百GB文件几乎不耗时。
+- **断点重启**
+    -  <u>*简介*</u>：用户可以单独设置，`checkpoint_steps` 参数可设置较小，重启训练默认加载最新checkpoint。
+    - 断点数据自动恢复，学习率等参数也自动恢复。
+    - **注意：** 此`checkpoint_steps`参数仅保留最后一个`checkpoint`到`model_last`文件夹，默认每次覆盖。用户需要永久保存参数，请设置`save_steps`。建议可以设置`checkpoint_steps`为需要间隔训练半小时、一小时左右的时间，一旦环境故障，可以获取到最新的`checkpoint`。
+
+
+### 3.4 观察评估
+
+- **训练过程观察**：VisualDL可视化日志记录
+    - 日志展示为全局loss，波动小。
+    - 记录混合精度，loss_scaling等信息，方便用户debug。
+    - 对模型结构，配置参数，paddle版本信息进行记录，方便复现环境
+
+<p align="center">
+<img src="https://user-images.githubusercontent.com/16911935/187404575-52d53892-4272-4c9d-b29d-064352628951.png" align="middle"  width="900" />
+</p>
+
+
+- **下游任务评估**：CLUE Benchmark搜索评估参数效果
+    - 使用[批量启动-grid-search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/benchmark/clue#%E6%89%B9%E9%87%8F%E5%90%AF%E5%8A%A8-grid-search)，可以进行批量搜索任务
+    - 注意，这里使用的是训练中的checkpoint进行评估，可以直接试着 评估待评估的参数为，所在的路径地址，即如 `python grid_seach.py output/ernie-base-outdir/model_100000` 之类的checkpoint地址。
+
+
+<a name="release_models"></a>
+## 4. 训练效果
+
+**训练效果方面**，我们release了 base、large两个模型。均取得了较好的预训练效果。
+
+<a name="ernie-1.0-base-zh-cw"></a>
+
+### 4.1 ERNIE 1.0-Base-zh-cw 模型
+
+使用CLUE，WuDao共计400GB的语料，batch_size 1024, 训练 400w step，即可训练得到`ernie-3.0-base-zh`类似的模型效果。相关模型参数，开源为`ernie-1.0-base-zh-cw`，用户加载即可使用。使用CLUE benchmark 对最优超参数进行GradSearch搜索：
+
+Model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Arch | CLUE AVG |  AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3
+-- | -- | -- | -- | -- | -- | -- |  -- | -- | -- | -- | -- |  -- |
+ Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1| Acc| Acc
+ERNIE 1.0-Base-zh-cw | 12L768H | <b>76.47</b> | 76.04 |    57.86 |    59.91 |   <b>83.41</b> | 79.58 |    89.91 |    83.42 |  72.88/90.78 |    <b>84.68</b> |    76.98 |
+ERNIE 2.0-Base-zh | 12L768H | 74.32  | 75.65 |  58.25 | 61.64 |  82.62 |  78.71 |    81.91 |  82.33 | 66.08/87.46    | 82.78    | 73.19
+ERNIE 1.0-Base-zh | 12L768H | 74.17 | 74.84 |    58.91 |    62.25 |    81.68 |    76.58 |    85.20 |    82.77 | 67.32/87.83 | 82.47 | 69.68
+
+
+<a name="ernie-1.0-large-zh-cw"> </a>
+
+### 4.2 ERNIE 1.0-Large-zh-cw 模型
+
+除了base模型外，我们还训练了large模型。命名为`ernie-1.0-large-zh-cw`。使用开源语料，batch_size 512, 训练 400w step，训练去除SOP任务，只保留MLM损失，使用CLUE benchmark 对最优超参数进行GradSearch搜索：
+
+Model&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  | Arch | CLUE AVG |  AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUE WSC2020 | CSL | CMRC | CHID | C3
+-- | -- | -- | -- | -- | -- | -- |  -- | -- | -- | -- | -- |  -- |
+Metrics |   |   | Acc | Acc | Acc | Acc | Acc | Acc | Acc | Exact/F1 | Acc| Acc
+ERNIE 1.0-Large-zh-cw| 24L1024H | <b>79.03</b> | 75.97 |    59.65 |    62.91 |    85.09 |    81.73| 93.09 |    84.53 | 74.22/91.88 | 88.57 | 84.54
+ERNIE 3.0-Xbase-zh| 20L1024H | 78.39 | 76.16 | 59.55 | 61.87 | 84.40 |  81.73 | 88.82 | 83.60 |    75.99/93.00 | 86.78 | 84.98
+RoBERTa-wwm-ext-large | 24L1024H | 76.61 |    76.00 |    59.33 |    62.02 |    83.88 |    78.81 |    90.79 |    83.67 |    70.58/89.82 |    85.72 |    75.26
+
+<a name="references"> </a>
+
+## 5. 参考文献
+
+感谢CLUE，WuDao提供的开源文本语料，主要数据流部分参考自[Megatron](https://github.com/NVIDIA/Megatron-LM)，参考资料：
+- Xu, L., Zhang, X. and Dong, Q., 2020. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355.
+- Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2, pp.65-68.
+- https://github.com/CLUEbenchmark/CLUECorpus2020
+- https://resource.wudaoai.cn
+- https://github.com/NVIDIA/Megatron-LM
diff --git a/model_zoo/ernie-1.0/run_gb512_s1m.sh b/model_zoo/ernie-1.0/run_gb512_s1m.sh
new file mode 100644
index 0000000000000000000000000000000000000000..70719bc347e0e5972540d5e12712d089c4d23df4
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_gb512_s1m.sh
@@ -0,0 +1,54 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf core.*
+
+# dp8 for 8 worker of data parallel
+# gb512 for the global batch size is 512 = 64 * 8
+# s1m for max steps is 1 million
+task_name="ernie-1.0-dp8-gb512"
+rm -rf output/$task_name/log
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name/log" \
+    run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --micro_batch_size 64 \
+    --use_amp true \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 990000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20\
+    --num_workers 2 \
+    --eval_freq 1000 \
+    --device "gpu"\
+
diff --git a/model_zoo/ernie-1.0/run_gb512_s1m_static.sh b/model_zoo/ernie-1.0/run_gb512_s1m_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..23e885ca5fed215f26229a303c261958601116c2
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_gb512_s1m_static.sh
@@ -0,0 +1,60 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+# dp8 for 8 worker of data parallel
+# gb512 for the global batch size is 512 = 64 * 8
+task_name="ernie-1.0-dp8-gb512"
+rm -rf output/$task_name/log
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name/log" \
+    run_pretrain_static.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --micro_batch_size 64 \
+    --sharding_degree 1\
+    --dp_degree 8 \
+    --use_sharding false \
+    --use_amp true \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 990000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --num_workers 2 \
+    --logging_freq 20\
+    --eval_freq 1000 \
+    --device "gpu"
+
+# NOTE: please set use_sharding=True for sharding_degree > 1
diff --git a/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh b/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..04eddc01ad72da6477b03ff0913e4d7ab78a9952
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_gb512_s1m_trainer.sh
@@ -0,0 +1,53 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+# dp8 for 8 worker of data parallel
+# gb512 for the global batch size is 512 = 64 * 8
+# s1m for max steps is 1 million
+task_name="ernie-1.0-dp8-gb512"
+rm -rf output/$task_name/
+
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain_trainer.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 64 \
+    --per_device_eval_batch_size 64 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20\
+    --dataloader_num_workers 4 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --do_train \
+    --device "gpu"
diff --git a/model_zoo/ernie-1.0/run_npu_single_card.sh b/model_zoo/ernie-1.0/run_npu_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f8011169b62c44dda2298348657ab193154294d0
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_npu_single_card.sh
@@ -0,0 +1,51 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+
+export FLAGS_selected_npus=0
+export ASCEND_GLOBAL_LOG_LEVEL=3
+export FLAGS_allocator_strategy=naive_best_fit
+
+rm -rf core.*
+
+task_name="ernie-1.0-npu"
+rm -rf output/$task_name/log
+
+python -u run_pretrain.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-3.0-base-zh" \
+    --tokenizer_name_or_path "ernie-3.0-base-zh" \
+    --input_dir "./data" \
+    --data_impl "mmap" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_len 512 \
+    --micro_batch_size 52 \
+    --use_amp true \
+    --fp16_opt_level "O1" \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --checkpoint_steps 5000 \
+    --decay_steps 990000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 20\
+    --num_workers 8 \
+    --eval_freq 1000 \
+    --device "npu"\
diff --git a/model_zoo/ernie-1.0/run_pretrain.py b/model_zoo/ernie-1.0/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae9a0ee6962cf17e793a72a6870dc7617c96242f
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_pretrain.py
@@ -0,0 +1,763 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ERNIE-1.0 pretraining scripts.
+"""
+import contextlib
+import json
+import os
+import random
+import shutil
+import sys
+import time
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.fleet as fleet
+import yaml
+from args import parse_args
+from data_tools.dataset_utils import build_train_valid_test_datasets
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from visualdl import LogWriter
+
+from paddlenlp.data import Stack
+from paddlenlp.transformers import (
+    ErnieConfig,
+    ErnieForMaskedLM,
+    ErnieForPretraining,
+    ErniePretrainingCriterion,
+    ErnieTokenizer,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie": (ErnieConfig, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer),
+}
+
+
+def create_pretrained_dataset(
+    args,
+    data_file,
+    tokenizer,
+    data_world_size,
+    data_world_rank,
+    max_seq_len,
+    places=None,
+    data_holders=None,
+    binary_head=True,
+    current_step=0,
+):
+
+    train_valid_test_num_samples = [
+        args.global_batch_size * args.max_steps,
+        args.micro_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size,
+        args.micro_batch_size * args.test_iters * data_world_size,
+    ]
+
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        args=args,
+        tokenizer=tokenizer,
+        splits_string=args.split,
+        train_valid_test_num_samples=train_valid_test_num_samples,
+        max_seq_length=args.max_seq_len,
+        masked_lm_prob=args.masked_lm_prob,
+        short_seq_prob=args.short_seq_prob,
+        seed=args.seed,
+        skip_warmup=True,
+        binary_head=binary_head,
+        max_seq_length_dec=None,
+        dataset_type="ernie",
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data
+        if tokenizer.pad_token_id in input_ids:
+            input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)]
+        logger.info(tokenizer._decode(input_ids))
+        for pos, label in zip(masked_lm_positions, masked_lm_labels):
+            input_ids[pos] = label
+        logger.info(tokenizer._decode(input_ids))
+        logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels))
+
+    print_dataset(train_ds[0], "train")
+    print_dataset(valid_ds[0], "valid")
+    print_dataset(test_ds[0], "test")
+
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # 0. input_ids,
+        # 1. segment_ids,
+        # 2. input_mask,
+        # 3. masked_lm_positions,
+        # 4. masked_lm_labels,
+        # 5. next_sentence_labels
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        out[5] = out[5].reshape([-1, 1])
+        _, seq_length = out[0].shape
+        size = sum(len(x[3]) for x in data)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        if args.device == "npu":
+            # For NPU device, fixed input sentence length, in
+            # order to reduce the number of op compile.
+            if size % 80 != 0:
+                size += 80 - (size % 80)
+        else:
+            if size % 8 != 0:
+                size += 8 - (size % 8)
+        out[3] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+
+        return out
+
+    def loader(dataset, consumed_samples=0):
+        batch_sampler = DistributedBatchSampler(
+            dataset,
+            batch_size=args.micro_batch_size,
+            num_replicas=data_world_size,
+            rank=data_world_rank,
+            shuffle=False,
+            drop_last=True,
+            consumed_samples=consumed_samples,
+        )
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset,
+            batch_sampler=batch_sampler,
+            num_workers=args.num_workers,
+            worker_init_fn=None,
+            collate_fn=_collate_data,
+            return_list=False,
+        )
+        return data_loader
+
+    train_dl = loader(train_ds, args.global_batch_size * current_step)
+    valid_dl = loader(
+        valid_ds, args.micro_batch_size * ((current_step + 1) // args.eval_freq) * args.eval_iters * data_world_size
+    )
+    test_dl = loader(test_ds, 0)
+
+    return train_dl, valid_dl, test_dl
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def all_gather(v):
+    if dist.get_world_size() <= 1:
+        return v.item()
+    ret = []
+    dist.all_gather(ret, v)
+    output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in ret]
+    concat = paddle.concat(output_tensors, axis=0)
+    return concat.mean().item()
+
+
+@paddle.no_grad()
+def run_evaluate(data_loader, model, criterion, iter_steps, log_writer, global_step, args, task_name="valid"):
+    model.eval()
+
+    if args.binary_head:
+        loss_global = {
+            "loss": paddle.to_tensor(0.0),
+            "lm_loss": paddle.to_tensor(0.0),
+            "sop_loss": paddle.to_tensor(0.0),
+        }
+    else:
+        loss_global = {
+            "loss": paddle.to_tensor(0.0),
+        }
+
+    local_time = time.time()
+
+    for eval_step, batch in enumerate(data_loader):
+        input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = batch
+        with paddle.amp.auto_cast(
+            args.use_amp,
+            custom_white_list=[
+                "softmax",
+                "layer_norm",
+                "gelu",
+            ],
+            custom_black_list=[
+                "c_softmax_with_cross_entropy",
+            ],
+            level=args.fp16_opt_level,
+        ):
+
+            if args.binary_head:
+                prediction_scores, seq_relationship_score = model(
+                    input_ids=input_ids,
+                    token_type_ids=segment_ids,
+                    position_ids=None,
+                    attention_mask=input_mask,
+                    masked_positions=masked_lm_positions,
+                )
+
+                lm_loss, sop_loss = criterion(
+                    prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels
+                )
+                loss = lm_loss + sop_loss
+            else:
+                prediction_scores = model(
+                    input_ids=input_ids,
+                    token_type_ids=segment_ids,
+                    position_ids=None,
+                    attention_mask=input_mask,
+                    masked_positions=masked_lm_positions,
+                )
+
+                loss = criterion(prediction_scores, None, masked_lm_labels)
+
+            loss_global["loss"] += loss.detach()
+            if args.binary_head:
+                loss_global["lm_loss"] += lm_loss.detach()
+                loss_global["sop_loss"] += sop_loss.detach()
+
+        if eval_step >= iter_steps - 1:
+            log_info_dict = dict()
+            for k, v in loss_global.items():
+                log_info_dict[k] = all_gather(v) / iter_steps
+                v.subtract_(v)
+            if dist.get_rank() == 0:
+                log_info_dict["samples_per_second"] = (
+                    iter_steps * args.micro_batch_size * dist.get_world_size() / (time.time() - local_time)
+                )
+                loss_info = ", ".join(
+                    ["{}: {:.6f}".format(k, log_info_dict[k]) for k in log_info_dict.keys() if k.endswith("loss")]
+                )
+
+                logger.info(
+                    "%s step %d, batch: %d, %s, ips: %.0f seqs/s"
+                    % (task_name, global_step, iter_steps, loss_info, log_info_dict["samples_per_second"])
+                )
+
+                for k, v in log_info_dict.items():
+                    log_writer.add_scalar("%s/%s" % (task_name, k), v, global_step)
+
+            break
+
+    model.train()
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+def args_post_process(args, worker_num):
+    default_global_batch_size = worker_num * args.micro_batch_size
+    if args.global_batch_size is None:
+        args.global_batch_size = default_global_batch_size
+
+    bsz_per_dp = args.global_batch_size // worker_num
+    micro_batch_size = args.micro_batch_size
+    assert (
+        args.global_batch_size % micro_batch_size == 0
+    ), "cannot do gradient accumulate, global_batch_size: {} micro_batch_size: {}".format(
+        args.global_batch_size, micro_batch_size
+    )
+    accumulate_steps = bsz_per_dp // micro_batch_size
+    assert (
+        accumulate_steps >= 1
+    ), f"Larger global_batch_size: {args.global_batch_size} is expect, micro_batch_size is {micro_batch_size}, but only {bsz_per_dp} on each card!"
+
+    args.eval_iters *= accumulate_steps
+    args.test_iters *= accumulate_steps
+
+    args.accumulate_steps = accumulate_steps
+
+
+def default_logdir() -> str:
+    """
+    Same default
+    """
+    import socket
+    from datetime import datetime
+
+    current_time = datetime.now().strftime("%b%d_%H-%M-%S")
+    return os.path.join("runs", current_time + "_" + socket.gethostname())
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+
+    worker_index = paddle.distributed.get_rank()
+    worker_num = paddle.distributed.get_world_size()
+
+    if worker_num > 1:
+        paddle.distributed.init_parallel_env()
+
+    if args.dp_degree * args.sharding_degree == 1:
+        args.dp_degree = worker_num
+        args.sharding_degree = 1
+
+    args_post_process(args, worker_num)
+
+    logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit))
+    for arg in vars(args):
+        logger.info("{:20}:{}".format(arg, getattr(args, arg)))
+
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {"dp_degree": args.dp_degree, "mp_degree": 1, "pp_degree": 1, "sharding_degree": 1}
+
+    fleet.init(is_collective=True, strategy=strategy)
+
+    # Create the random seed for the worker
+    set_seed(args)
+
+    assert (
+        args.dp_degree * args.sharding_degree == worker_num
+    ), "The product of degree num should be equal to worker_num."
+
+    # Create log write,
+    log_writer = None
+    if worker_index == 0:
+        log_writer = LogWriter(os.path.join(args.output_dir, default_logdir()))
+
+    # Define the input data in the static mode
+    config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    if args.binary_head is False:
+        model_class = ErnieForMaskedLM
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    # load config in checkpoint
+    global_step = 0
+    checkpoint_dir = os.path.join(args.output_dir, "model_last")
+    if os.path.exists(checkpoint_dir):
+        if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")):
+            with open(os.path.join(checkpoint_dir, "./config.yml"), "r") as f:
+                step_config = yaml.load(f, Loader=yaml.FullLoader)
+                assert (
+                    step_config["global_batch_size"] == args.global_batch_size
+                ), "Please ensure checkpoint global batch size is the same. Folder: {}".format(checkpoint_dir)
+                global_step = step_config["global_step"]
+
+    if args.model_name_or_path in pretrained_models_list and not args.continue_training:
+        logger.warning(f"Your model {args.model_name_or_path} is training from scratch !!!")
+        model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+        model_config["enable_recompute"] = args.use_recompute
+        model = model_class(config_class(**model_config))
+    else:
+        logger.warning(f"Your model is continue training from {args.model_name_or_path}")
+        model = model_class.from_pretrained(
+            args.model_name_or_path,
+            hidden_dropout_prob=args.hidden_dropout_prob,
+            attention_probs_dropout_prob=args.attention_probs_dropout_prob,
+            enable_recompute=args.use_recompute,
+        )
+
+    criterion = criterion_class(with_nsp_loss=args.binary_head)
+
+    if worker_index == 0:
+        # log the model config and args
+        model_config_json = json.dumps(model.config.to_dict(), ensure_ascii=False, indent=2)
+        log_writer.add_text("model_config", model_config_json)
+        args_dict = {"paddle commit id": str(paddle.version.commit)}
+        for arg in vars(args):
+            args_dict[arg] = str(getattr(args, arg))
+        log_writer.add_text("args", json.dumps(args_dict, indent=2))
+
+    # Create the learning_rate sheduler and optimizer
+    if args.decay_steps is None:
+        args.decay_steps = args.max_steps
+    assert args.warmup_rate <= 1.0 and args.warmup_rate >= 0.0, "warmup_rate should be in [0, 1]"
+    args.warmup_steps = args.warmup_rate * args.max_steps
+
+    lr_scheduler = LinearAnnealingWithWarmupDecay(
+        args.max_lr, args.min_lr, warmup_step=args.warmup_steps, decay_step=args.decay_steps, last_epoch=global_step
+    )
+
+    clip = None
+    if args.grad_clip > 0:
+        clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+    decay_param = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    logger.info("Using paddle.optimizer.AdamW.")
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler if lr_scheduler is not None else args.max_lr,
+        beta1=args.adam_beta1,
+        beta2=args.adam_beta2,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=clip,
+        apply_decay_param_fun=lambda x: x in decay_param,
+        multi_precision=args.use_amp,
+    )
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+        scaler = fleet.distributed_scaler(scaler)
+        model = paddle.amp.decorate(models=model, level=args.fp16_opt_level)
+    else:
+        scaler = None
+
+    if paddle.distributed.get_world_size() > 1:
+        model = fleet.distributed_model(model)
+        optimizer = fleet.distributed_optimizer(optimizer)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path)
+    # Must extend chinese char for ErnieTokenizer
+    tokenizer.extend_chinese_char()
+
+    data_file = get_train_data_file(args)
+    train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
+        args,
+        data_file,
+        tokenizer,
+        data_world_size=worker_num,
+        data_world_rank=worker_index,
+        max_seq_len=args.max_seq_len,
+        binary_head=args.binary_head,
+        current_step=global_step,
+    )
+
+    # load checkpoint vars
+    if os.path.exists(checkpoint_dir):
+        if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")):
+            logger.info("Try to load checkpoint from %s " % checkpoint_dir)
+            opt_path = os.path.join(checkpoint_dir, "model_state.pdopt")
+            params_path = os.path.join(checkpoint_dir, "model_state.pdparams")
+
+            if os.path.exists(opt_path):
+                load_dict = paddle.load(params_path)
+                model_dict = model.state_dict()
+                if args.use_amp and args.fp16_opt_level == "O2":
+                    for k, v in load_dict.items():
+                        if k not in model_dict:
+                            logger.warning(f"Checkpoint have too much keys: {k}")
+                            continue
+                        if "layer_norm" not in model_dict[k].name:
+                            load_dict[k] = v.astype("float16")
+                model.set_state_dict(load_dict)
+                opt_dict = paddle.load(opt_path)
+                optimizer.set_state_dict(opt_dict)
+            else:
+                logger.warning("No optimizer checkpoint file found in %s." % opt_path)
+            if scaler is not None and os.path.isfile(os.path.join(checkpoint_dir, "scaler.pdparams")):
+                scaler.load_state_dict(paddle.load(os.path.join(checkpoint_dir, "scaler.pdparams"), return_numpy=True))
+            logger.info("Checkpoint loaded from global step: {}".format(global_step))
+
+    if args.binary_head:
+        loss_global = {
+            "loss": paddle.to_tensor(0.0),
+            "lm_loss": paddle.to_tensor(0.0),
+            "sop_loss": paddle.to_tensor(0.0),
+        }
+    else:
+        loss_global = {
+            "loss": paddle.to_tensor(0.0),
+        }
+
+    tic_train = time.time()
+    while True:
+        # If not call valid_data_loader, the enumerate will call valid_data_loader
+        # many times. and start a new random dataloader.
+        valid_data_loader = valid_data_loader()
+        test_data_loader = test_data_loader()
+
+        # time count
+        train_reader_cost = 0.0
+        train_run_cost = 0.0
+        tr_loss = paddle.to_tensor(0.0)
+        reader_start = time.time()
+
+        for step, batch in enumerate(train_data_loader()):
+            train_reader_cost += time.time() - reader_start
+            train_start = time.time()
+
+            # 0. input_ids,
+            # 1. segment_ids,
+            # 2. input_mask,
+            # 3. masked_lm_positions,
+            # 4. masked_lm_labels,
+            # 5. next_sentence_labels
+
+            input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = batch
+
+            ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress()
+
+            if worker_num > 1 and (args.use_recompute or ((step + 1) % args.accumulate_steps != 0)):
+                # grad acc, no_sync when (step + 1) % args.accumulate_steps != 0:
+                # recompute, no_sync every where
+                # recompute + grad_acc, no_sync every where
+                ctx_manager = model.no_sync()
+            else:
+                ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress()
+
+            with ctx_manager:
+                # For NPU device, using fp16 data type to execute `dropout` NPU op
+                # can improve performance, which can change `Cast` CANN OP from
+                # AICPU operator to AICore operator.
+                with paddle.amp.auto_cast(
+                    args.use_amp,
+                    custom_white_list=[
+                        "softmax",
+                        "layer_norm",
+                        "gelu",
+                        "dropout",
+                    ],
+                    custom_black_list=[
+                        "c_softmax_with_cross_entropy",
+                    ],
+                    level=args.fp16_opt_level,
+                ):
+
+                    # Create the model for the ernie pretrain
+                    if args.binary_head:
+                        prediction_scores, seq_relationship_score = model(
+                            input_ids=input_ids,
+                            token_type_ids=segment_ids,
+                            position_ids=None,
+                            attention_mask=input_mask,
+                            masked_positions=masked_lm_positions,
+                        )
+                        lm_loss, sop_loss = criterion(
+                            prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels
+                        )
+                        loss = lm_loss + sop_loss
+                    else:
+                        prediction_scores = model(
+                            input_ids=input_ids,
+                            token_type_ids=segment_ids,
+                            position_ids=None,
+                            attention_mask=input_mask,
+                            masked_positions=masked_lm_positions,
+                        )
+
+                        loss = criterion(prediction_scores, None, masked_lm_labels)
+
+                if args.accumulate_steps >= 1:
+                    tr_loss_step = loss / args.accumulate_steps
+                else:
+                    tr_loss_step = loss
+
+                if args.use_amp:
+                    scaler.scale(tr_loss_step).backward()
+                else:
+                    tr_loss_step.backward()
+
+            tr_loss += tr_loss_step.detach()
+
+            loss_global["loss"] += loss.detach()
+            if args.binary_head:
+                loss_global["lm_loss"] += lm_loss.detach()
+                loss_global["sop_loss"] += sop_loss.detach()
+
+            # Skip for accumulate_steps in global step
+            if (step + 1) % args.accumulate_steps != 0:
+                continue
+
+            if worker_num > 1 and args.use_recompute:
+                fused_allreduce_gradients(list(model.parameters()), None)
+
+            if args.use_amp:
+                scaler.minimize(optimizer, tr_loss)
+            else:
+                optimizer.step()
+
+            if args.device == "npu":
+                # For NPU device, set set_to_zero to False can improve
+                # performance.
+                optimizer.clear_grad(set_to_zero=False)
+            else:
+                optimizer.clear_grad()
+            train_run_cost += time.time() - train_start
+            tr_loss.subtract_(tr_loss)
+
+            global_step += 1
+
+            if global_step % args.logging_freq == 0:
+                log_info_dict = dict()
+                log_info_dict["global_step"] = global_step
+                for k, v in loss_global.items():
+                    log_info_dict[k] = all_gather(v) / args.logging_freq / args.accumulate_steps
+                    v.subtract_(v)
+                if worker_index == 0:
+                    speed = args.logging_freq / (time.time() - tic_train)
+                    log_info_dict["learning_rate"] = lr_scheduler.get_lr()
+                    log_info_dict["steps_per_second"] = speed
+                    log_info_dict["samples_per_second"] = speed * args.global_batch_size
+
+                    for k, v in log_info_dict.items():
+                        log_writer.add_scalar("train/%s" % k, v, global_step)
+
+                    loss_info = ", ".join(
+                        ["{}: {:.6f}".format(k, log_info_dict[k]) for k in log_info_dict.keys() if k.endswith("loss")]
+                    )
+
+                    common_loginfo = (
+                        "global step %d, %s, speed: %.2f steps/s, ips: %.2f seqs/s, learning rate: %.5e"
+                        % (
+                            global_step,
+                            loss_info,
+                            speed,
+                            log_info_dict["samples_per_second"],
+                            log_info_dict["learning_rate"],
+                        )
+                    )
+
+                    addition_info = ""
+                    if args.use_amp:
+                        amp_info = {
+                            "loss_scaling": scaler._scale.item(),
+                            "incr_count": scaler._incr_count,
+                            "decr_count": scaler._decr_count,
+                        }
+                        addition_info = ", ".join("%s: %.2f" % (k, v) for k, v in amp_info.items())
+                        for k, v in amp_info.items():
+                            log_writer.add_scalar("amp/%s" % k, v, global_step)
+
+                    logger.info(", ".join([common_loginfo, addition_info]))
+
+                tic_train = time.time()
+
+            if lr_scheduler is not None:
+                lr_scheduler.step()
+
+            if global_step % args.eval_freq == 0:
+                # TODO, check the input data of validation
+
+                run_evaluate(
+                    valid_data_loader,
+                    model,
+                    criterion,
+                    args.eval_iters,
+                    log_writer,
+                    global_step,
+                    args,
+                    task_name="valid",
+                )
+                tic_train = time.time()
+
+            def save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step):
+                step_config = {
+                    "model_name": args.model_name_or_path,
+                    "global_step": global_step,
+                    "global_batch_size": args.global_batch_size,
+                    "consumed_samples": global_step * args.global_batch_size,
+                }
+
+                logger.debug("saving models to {}".format(output_dir))
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+
+                tokenizer.save_pretrained(output_dir)
+                # added token is not need for downstream finetune tasks.
+                added_token_path = os.path.join(output_dir, "added_tokens.json")
+                if os.path.exists(added_token_path):
+                    os.remove(added_token_path)
+
+                model_to_save.save_model_config(output_dir)
+                model_dict = model_to_save.state_dict()
+                if scaler is not None:
+                    paddle.save(scaler.state_dict(), os.path.join(output_dir, "scaler.pdparams"))
+                    for k, v in model_dict.items():
+                        if v.dtype is paddle.float16:
+                            model_dict[k] = v.astype("float32")
+                paddle.save(model_dict, os.path.join(output_dir, "model_state.pdparams"))
+                paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+
+                with open(os.path.join(output_dir, "config.yml"), "w") as f:
+                    yaml.dump(step_config, f, encoding="utf-8", allow_unicode=True)
+
+            if global_step % args.save_steps == 0 or global_step >= args.max_steps:
+                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                if worker_index == 0:
+                    save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step)
+
+                if worker_num > 1:
+                    paddle.distributed.barrier()
+                tic_train = time.time()
+
+            if global_step % args.checkpoint_steps == 0:
+                output_dir = os.path.join(args.output_dir, "model_last")
+                if worker_index == 0:
+                    if not os.path.exists(output_dir):
+                        os.mkdir(output_dir)
+                    output_dir_bak = os.path.join(args.output_dir, "model_last_bak")
+                    if os.path.exists(output_dir):
+                        if os.path.exists(output_dir_bak):
+                            shutil.rmtree(output_dir_bak)
+                        shutil.move(output_dir, output_dir_bak)
+                        os.mkdir(output_dir)
+                    save_ckpt(output_dir, model, tokenizer, optimizer, scaler, args, global_step)
+
+                if worker_num > 1:
+                    paddle.distributed.barrier()
+
+            if global_step >= args.max_steps:
+                run_evaluate(
+                    test_data_loader,
+                    model,
+                    criterion,
+                    args.test_iters,
+                    log_writer,
+                    global_step,
+                    args,
+                    task_name="test",
+                )
+                del train_data_loader
+                del valid_data_loader
+                del test_data_loader
+                return
+
+
+if __name__ == "__main__":
+    config = parse_args(MODEL_CLASSES)
+    do_train(config)
diff --git a/model_zoo/ernie-1.0/run_pretrain_static.py b/model_zoo/ernie-1.0/run_pretrain_static.py
new file mode 100644
index 0000000000000000000000000000000000000000..0986e01039b2b43fc0bec60b40479da135eb4c5b
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_pretrain_static.py
@@ -0,0 +1,744 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ERNIE pretraining scripts for paddlepaddle static graph mode.
+"""
+import collections
+import json
+import os
+import random
+import shutil
+import time
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.fleet as fleet
+import yaml
+from args import parse_args
+from data_tools.dataset_utils import build_train_valid_test_datasets
+from paddle.distributed.fleet.meta_optimizers.sharding.utils import save_persistables
+from visualdl import LogWriter
+
+from paddlenlp.data import Stack
+from paddlenlp.ops import Topology, get_rng_state_tracker
+from paddlenlp.transformers import (
+    ErnieConfig,
+    ErnieForPretraining,
+    ErniePretrainingCriterion,
+    ErnieTokenizer,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie": (ErnieConfig, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer),
+}
+
+
+def create_pretrained_dataset(
+    args,
+    data_file,
+    tokenizer,
+    data_world_size,
+    data_world_rank,
+    max_seq_len,
+    places,
+    data_holders,
+    binary_head=True,
+    current_step=0,
+):
+
+    train_valid_test_num_samples = [
+        args.global_batch_size * args.max_steps,
+        args.micro_batch_size * (args.max_steps // args.eval_freq + 1) * args.eval_iters * data_world_size,
+        args.micro_batch_size * args.test_iters * data_world_size,
+    ]
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        args=args,
+        tokenizer=tokenizer,
+        splits_string=args.split,
+        train_valid_test_num_samples=train_valid_test_num_samples,
+        max_seq_length=args.max_seq_len,
+        masked_lm_prob=args.masked_lm_prob,
+        short_seq_prob=args.short_seq_prob,
+        seed=args.seed,
+        skip_warmup=True,
+        binary_head=binary_head,
+        max_seq_length_dec=None,
+        dataset_type="ernie",
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data
+        if tokenizer.pad_token_id in input_ids:
+            input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)]
+        logger.info(tokenizer._decode(input_ids))
+        for pos, label in zip(masked_lm_positions, masked_lm_labels):
+            input_ids[pos] = label
+        logger.info(tokenizer._decode(input_ids))
+        logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels))
+
+    print_dataset(train_ds[0], "train")
+    print_dataset(valid_ds[0], "valid")
+    print_dataset(test_ds[0], "test")
+
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # 0. input_ids,
+        # 1. segment_ids,
+        # 2. input_mask,
+        # 3. masked_lm_positions,
+        # 4. masked_lm_labels,
+        # 5. next_sentence_labels
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        out[5] = out[5].reshape([-1, 1])
+        _, seq_length = out[0].shape
+        size = sum(len(x[3]) for x in data)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        if size % 8 != 0:
+            size += 8 - (size % 8)
+        out[3] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+
+        return out
+
+    def loader(dataset, consumed_samples=0):
+        batch_sampler = DistributedBatchSampler(
+            dataset,
+            batch_size=args.micro_batch_size,
+            num_replicas=data_world_size,
+            rank=data_world_rank,
+            shuffle=False,
+            drop_last=True,
+            consumed_samples=consumed_samples,
+        )
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset,
+            places=places,
+            feed_list=data_holders,
+            batch_sampler=batch_sampler,
+            num_workers=args.num_workers,
+            worker_init_fn=None,
+            collate_fn=_collate_data,
+            return_list=False,
+        )
+        return data_loader
+
+    train_dl = loader(train_ds, args.global_batch_size * current_step)
+    valid_dl = loader(
+        valid_ds, args.micro_batch_size * ((current_step + 1) // args.eval_freq) * args.eval_iters * data_world_size
+    )
+    test_dl = loader(test_ds, 0)
+
+    return train_dl, valid_dl, test_dl
+
+
+def create_data_holder(args=None):
+    input_ids = paddle.static.data(name="input_ids", shape=[-1, -1], dtype="int64")
+    segment_ids = paddle.static.data(name="segment_ids", shape=[-1, -1], dtype="int64")
+    input_mask = paddle.static.data(name="input_mask", shape=[-1, 1, 1, -1], dtype="float32")
+    masked_lm_positions = paddle.static.data(name="masked_lm_positions", shape=[-1], dtype="int32")
+    masked_lm_labels = paddle.static.data(name="masked_lm_labels", shape=[-1, 1], dtype="int64")
+
+    next_sentence_labels = paddle.static.data(name="next_sentence_labels", shape=[-1, 1], dtype="int64")
+
+    return [input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels]
+
+
+def dist_optimizer(args, topo):
+    default_global_batch_size = topo.data_info.size * args.micro_batch_size
+    if args.global_batch_size is None:
+        args.global_batch_size = default_global_batch_size
+
+    bsz_per_dp = args.global_batch_size // topo.data_info.size
+    micro_batch_size = args.micro_batch_size
+    assert (
+        args.global_batch_size % micro_batch_size == 0
+    ), "cannot do gradient accumulate, global_batch_size: {} micro_batch_size: {}".format(
+        args.global_batch_size, micro_batch_size
+    )
+    accumulate_steps = bsz_per_dp // micro_batch_size
+
+    exec_strategy = paddle.static.ExecutionStrategy()
+    exec_strategy.num_threads = 1
+    exec_strategy.num_iteration_per_drop_scope = 10000
+
+    build_strategy = paddle.static.BuildStrategy()
+    build_strategy.enable_sequential_execution = True  # for profile
+    # build_strategy.reduce_strategy = paddle.static.BuildStrategy.ReduceStrategy._NoReduce
+    build_strategy.fuse_broadcast_ops = True
+    build_strategy.fix_op_run_order = True
+    build_strategy.enable_inplace = True
+    build_strategy.enable_addto = args.enable_addto
+
+    dist_strategy = fleet.DistributedStrategy()
+    # dist_strategy.without_graph_optimization = True
+    dist_strategy.execution_strategy = exec_strategy
+    dist_strategy.build_strategy = build_strategy
+    dist_strategy.nccl_comm_num = 3
+    dist_strategy.fuse_grad_size_in_MB = 16
+
+    dist_strategy.recompute = args.use_recompute
+    dist_strategy.pipeline = args.pp_degree > 1
+
+    if args.pp_degree <= 1 and args.sharding_degree <= 1 and accumulate_steps > 1:
+        dist_strategy.gradient_merge = True
+        dist_strategy.gradient_merge_configs = {"k_steps": accumulate_steps}
+    args.eval_iters *= accumulate_steps
+    args.test_iters *= accumulate_steps
+
+    if args.use_amp:
+        dist_strategy.amp = True
+        dist_strategy.amp_configs = {
+            "custom_white_list": [
+                "softmax",
+                "layer_norm",
+                "gelu",
+            ],
+            "custom_black_list": ["c_softmax_with_cross_entropy"],
+            "init_loss_scaling": 32768,
+            "use_dynamic_loss_scaling": True,
+        }
+    if args.use_sharding:
+        dist_strategy.sharding = True
+        dist_strategy.sharding_configs = {
+            "segment_broadcast_MB": 32,
+            "sharding_degree": args.sharding_degree,
+            "mp_degree": args.mp_degree,
+            "pp_degree": args.pp_degree,
+            "dp_degree": args.dp_degree,
+            "gradient_merge_acc_step": accumulate_steps if args.sharding_degree > 1 else 1,
+            "optimize_offload": False,
+        }
+    if args.pp_degree > 1:
+        dist_strategy.pipeline_configs = {
+            "schedule_mode": "1F1B",
+            "micro_micro_batch_size": micro_batch_size,
+            "accumulate_steps": accumulate_steps,
+        }
+
+    args.accumulate_steps = accumulate_steps
+    return dist_strategy
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def run_evaluate(
+    data_loader, exe, program, iter_steps, log_writer, global_step, args, is_last, eval_fetch, task_name="valid"
+):
+    all_ret = collections.defaultdict(list)
+    average_ret = collections.defaultdict(float)
+
+    local_time = time.time()
+    worker_num = fleet.worker_num()
+
+    for eval_step, batch in enumerate(data_loader):
+        ret = exe.run(program, feed=batch, fetch_list=list(eval_fetch.values()))
+        if is_last:
+            for k, v in zip(list(eval_fetch.keys()), ret):
+                all_ret[k].append(v.item())
+
+        if eval_step >= iter_steps - 1:
+            if not is_last or log_writer is None:
+                break
+
+            for k in list(eval_fetch.keys()):
+                average_ret[k] = sum(all_ret[k]) / len(all_ret[k]) / worker_num
+
+            speed = iter_steps / (time.time() - local_time)
+            speed_tokens = speed * args.micro_batch_size * args.max_seq_len * worker_num
+            ips = speed * args.micro_batch_size * worker_num
+
+            loss_info = ", ".join(["{}: {:.6f}".format(k, average_ret[k]) for k in eval_fetch.keys()])
+
+            logger.info(
+                "%s step %d, batch: %d, %s, speed: %.0f tokens/s, ips: %.2f seqs/s"
+                % (task_name, global_step, eval_step + 1, loss_info, speed_tokens, ips)
+            )
+
+            for k in list(eval_fetch.keys()):
+                log_writer.add_scalar("%s/%s" % (task_name, k), average_ret[k], global_step)
+
+            break
+
+
+def all_reduce(v):
+    if fleet.worker_num() <= 1:
+        return v
+    v = v + 0
+    dist.all_reduce(v)
+    return v
+
+
+def default_logdir() -> str:
+    """
+    Same default
+    """
+    import socket
+    from datetime import datetime
+
+    current_time = datetime.now().strftime("%b%d_%H-%M-%S")
+    return os.path.join("runs", current_time + "_" + socket.gethostname())
+
+
+def do_train(args):
+    # Initialize the paddle and paddle fleet execute environment
+    paddle.enable_static()
+    fleet.init(is_collective=True)
+
+    # Create the random seed for the worker
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+    get_rng_state_tracker().add("global_seed", args.seed)
+    get_rng_state_tracker().add("local_seed", args.seed + fleet.worker_index() + 2021)
+
+    assert args.device in ["cpu", "gpu", "xpu", "mlu"], "Invalid device! Available device should be cpu, gpu, or xpu."
+    place = paddle.set_device(args.device)
+
+    worker_num = fleet.worker_num()
+    worker_index = fleet.worker_index()
+
+    assert (
+        args.dp_degree * args.sharding_degree * args.mp_degree * args.pp_degree == worker_num
+    ), "The product of degree num should be equal to worker_num."
+
+    topo = Topology(
+        device_rank=worker_index,
+        world_size=worker_num,
+        dp_degree=args.dp_degree,
+        pp_degree=args.pp_degree,
+        sharding_degree=args.sharding_degree,
+        mp_degree=args.mp_degree,
+    )
+
+    logger.info("The topo of hybrid parallelism:\n{}".format(topo))
+
+    dist_strategy = dist_optimizer(args, topo)
+
+    # Create log write, train results show on last card of pipeline.
+    # Create log write,
+    log_writer = None
+    if worker_index == 0:
+        log_writer = LogWriter(os.path.join(args.output_dir, default_logdir()))
+
+    # Define the input data in the static mode
+    config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    # load config in checkpoint
+    global_step = 0
+    checkpoint_dir = os.path.join(args.output_dir, "model_last")
+    if os.path.exists(checkpoint_dir):
+        if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")):
+            with open(os.path.join(checkpoint_dir, "./config.yml"), "r") as f:
+                step_config = yaml.load(f, Loader=yaml.FullLoader)
+                assert (
+                    step_config["global_batch_size"] == args.global_batch_size
+                ), "Please ensure checkpoint global batch size is the same. Folder: {}".format(checkpoint_dir)
+                global_step = step_config["global_step"]
+
+    data_file = get_train_data_file(args)
+    main_program = paddle.static.default_main_program()
+    startup_program = paddle.static.default_startup_program()
+    with paddle.static.program_guard(main_program, startup_program):
+        data_holders = create_data_holder(args)
+        # 0. input_ids,
+        # 1. segment_ids,
+        # 2. input_mask,
+        # 3. masked_lm_positions,
+        # 4. masked_lm_labels,
+        # 5. next_sentence_labels
+
+        [
+            input_ids,
+            segment_ids,
+            input_mask,
+            masked_lm_positions,
+            masked_lm_labels,
+            next_sentence_labels,
+        ] = data_holders
+
+        tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name_or_path)
+        tokenizer.extend_chinese_char()
+
+        train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset(
+            args,
+            data_file,
+            tokenizer,
+            data_world_size=topo.data_info.size,
+            data_world_rank=topo.data_info.rank,
+            max_seq_len=args.max_seq_len,
+            places=paddle.static.cuda_places(),
+            data_holders=data_holders,
+            binary_head=args.binary_head,
+            current_step=global_step,
+        )
+        fleet.init(is_collective=True)
+
+        if args.model_name_or_path in pretrained_models_list:
+            model_config = model_class.pretrained_init_configuration[args.model_name_or_path]
+            if model_config["vocab_size"] % 8 != 0:
+                model_config["vocab_size"] += 8 - (model_config["vocab_size"] % 8)
+            model_config["hidden_dropout_prob"] = args.hidden_dropout_prob
+            model_config["attention_probs_dropout_prob"] = args.attention_probs_dropout_prob
+            model = model_class(config_class(**model_config))
+        else:
+            model, _ = model_class.from_pretrained(
+                args.model_name_or_path,
+                hidden_dropout_prob=args.hidden_dropout_prob,
+                attention_probs_dropout_prob=args.attention_probs_dropout_prob,
+            )
+
+        # Create the model for the gpt pretrain
+        prediction_scores, seq_relationship_score = model(
+            input_ids=input_ids,
+            token_type_ids=segment_ids,
+            position_ids=None,
+            attention_mask=input_mask,
+            masked_positions=masked_lm_positions,
+        )
+
+        criterion = criterion_class(with_nsp_loss=args.binary_head)
+        if args.binary_head:
+            lm_loss, sop_loss = criterion(
+                prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels
+            )
+            loss = lm_loss + sop_loss
+            lm_loss_reduce = all_reduce(lm_loss)
+            sop_loss_reduce = all_reduce(sop_loss)
+        else:
+            loss = criterion(prediction_scores, seq_relationship_score, masked_lm_labels)
+
+        loss_reduce = all_reduce(loss)
+
+        # Create the learning_rate sheduler and optimizer
+        if args.decay_steps is None:
+            args.decay_steps = args.max_steps
+
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            args.max_lr,
+            args.min_lr,
+            warmup_step=args.warmup_rate * args.max_steps,
+            decay_step=args.decay_steps,
+            last_epoch=global_step,
+        )
+
+        clip = None
+        if args.grad_clip > 0:
+            clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=args.grad_clip)
+
+        decay_param = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+        logger.info("Using paddle.optimizer.AdamW.")
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=lr_scheduler,
+            beta1=args.adam_beta1,
+            beta2=args.adam_beta2,
+            epsilon=args.adam_epsilon,
+            grad_clip=clip,
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_param,
+        )
+        # alias
+        optimizer.apply_optimize = optimizer._apply_optimize
+
+        # if args.use_recompute:
+        #     dist_strategy.recompute = True
+        #     dist_strategy.recompute_configs = {
+        #         "checkpoints": model.ernie.checkpoints
+        #     }
+
+        # Use the fleet api to compile the distributed optimizer
+        optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+
+        optimizer.minimize(loss)
+        logger.info(f"final strategy: {fleet._final_strategy()}")
+        logger.info("The training meta optimizer is/are %s" % fleet._get_applied_meta_list())
+
+    program_desc_dir = os.path.join(args.output_dir, "program_desc")
+    if not os.path.isdir(program_desc_dir):
+        os.mkdir(program_desc_dir)
+
+    with open(program_desc_dir + "/main_program.txt.%d" % worker_index, "w") as f:
+        f.write(str(main_program))
+
+    with open(program_desc_dir + "/startup_program.txt.%d" % worker_index, "w") as f:
+        f.write(str(startup_program))
+
+    if worker_index == 0:
+        # log the model config and args
+        model_config_json = json.dumps(model.config.to_dict(), ensure_ascii=False, indent=2)
+        log_writer.add_text("model_config", model_config_json)
+        args_dict = {"paddle commit id": str(paddle.version.commit)}
+        for arg in vars(args):
+            args_dict[arg] = str(getattr(args, arg))
+        log_writer.add_text("args", json.dumps(args_dict, indent=2))
+
+    # Define the Executor for running the static model
+    exe = paddle.static.Executor(place)
+    exe.run(startup_program)
+
+    test_program = main_program.clone(for_test=True)
+
+    if args.model_name_or_path not in pretrained_models_list:
+        logger.info("Try to load checkpoint from %s " % args.model_name_or_path)
+        dygrah_path = os.path.join(args.model_name_or_path, "model_state.pdparams")
+        static_path = os.path.join(args.model_name_or_path, "static_vars")
+
+        flag_loaded = False
+        if os.path.exists(static_path):
+            if args.mp_degree > 1:
+                logger.warning("MP should init with dygraph params")
+            else:
+                logger.info("Loading parameters from %s" % static_path)
+                paddle.static.load(main_program, static_path, exe)
+                flag_loaded = True
+
+        if not flag_loaded and os.path.exists(dygrah_path):
+            if args.sharding_degree > 1:
+                logger.warning("Sharding should init with static vars")
+            else:
+                logger.info("Loading parameters from %s" % dygrah_path)
+                # init_static_with_params(model, paddle.load(dygrah_path, return_numpy=True), topo, main_program)
+                flag_loaded = True
+
+        if not flag_loaded:
+            logger.error("No checkpoint load.")
+
+    # load checkpoint vars
+    if os.path.exists(checkpoint_dir):
+        if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")):
+            paddle.static.load(main_program, os.path.join(checkpoint_dir, "static_vars"), exe)
+
+    fetch_loss_vars = collections.OrderedDict()
+    fetch_other_vars = collections.OrderedDict()
+    fetch_loss_vars["loss"] = loss_reduce
+    if args.binary_head:
+        fetch_loss_vars["lm_loss"] = lm_loss_reduce
+        fetch_loss_vars["sop_loss"] = sop_loss_reduce
+
+    fetch_other_vars["learning_rate"] = main_program.global_block().vars["learning_rate_0"]
+
+    additional_vars = collections.OrderedDict()
+    if args.use_amp:
+        for key in ["loss_scaling", "num_good_steps", "num_bad_steps"]:
+            additional_vars[key] = main_program.global_block().vars[key + "_0"]
+
+    tic_train = time.time()
+    while True:
+        fetchs = []
+        fetchs_keys = []
+        if topo.is_last:
+            fetchs = list(fetch_loss_vars.values()) + list(fetch_other_vars.values()) + list(additional_vars.values())
+            fetchs_keys = list(fetch_loss_vars.keys()) + list(fetch_other_vars.keys()) + list(additional_vars.keys())
+
+        # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader
+        # many times. and start a new random dataloader.
+        valid_data_loader = valid_data_loader()
+        test_data_loader = test_data_loader()
+
+        loss_res = collections.defaultdict(list)
+        for step, batch in enumerate(train_data_loader()):
+            ret = exe.run(main_program, feed=batch, fetch_list=fetchs, use_program_cache=True)
+            # Skip for accumulate_steps in global step
+
+            if log_writer is not None:
+                for k, v in zip(fetchs_keys, ret):
+                    if k in fetch_loss_vars:
+                        loss_res[k].append(v.item())
+
+            if (step + 1) % args.accumulate_steps != 0:
+                continue
+            global_step += 1
+            # In the new 2.0 api, must call this function to change the learning_rate
+            lr_scheduler.step()
+
+            if global_step % args.logging_freq == 0:
+                if topo.is_last and log_writer is not None:
+                    res = collections.defaultdict(float)
+                    for k, v in zip(fetchs_keys, ret):
+                        if k in fetch_loss_vars:
+                            res[k] = sum(loss_res[k]) / len(loss_res[k]) / worker_num
+                            loss_res[k] = []
+                        else:
+                            res[k] = v
+
+                    speed = args.logging_freq / (time.time() - tic_train)
+                    res["global_step"] = global_step
+                    res["steps_per_second"] = speed
+                    res["samples_per_second"] = speed * args.global_batch_size
+
+                    loss_info = ", ".join(["{}: {:.6f}".format(k, res[k]) for k in fetch_loss_vars.keys()])
+
+                    common_loginfo = (
+                        "global step %d, %s, speed: %.2f steps/s, ips: %.2f seqs/s, learning rate: %.5e"
+                        % (
+                            global_step,
+                            loss_info,
+                            res["steps_per_second"],
+                            res["samples_per_second"],
+                            res["learning_rate"],
+                        )
+                    )
+                    additional_loginfo = ", ".join(["{}: {}".format(k, res[k]) for k in additional_vars.keys()])
+                    if additional_loginfo:
+                        common_loginfo += ", " + additional_loginfo
+                    logger.info(common_loginfo)
+
+                    for k, v in res.items():
+                        if k in additional_vars:
+                            log_writer.add_scalar("amp/" + k, v, global_step)
+                        else:
+                            log_writer.add_scalar("train/" + k, v, global_step)
+
+                tic_train = time.time()
+
+            # if args.check_accuracy:
+            #    if global_step >= args.max_steps:
+            #        return
+            #    else:
+            #        continue
+
+            if global_step % args.eval_freq == 0:
+                # TODO, check the input data of validation
+                eval_fetch = collections.OrderedDict()
+                # if topo.is_last:
+                eval_fetch["loss"] = loss_reduce
+                if args.binary_head:
+                    eval_fetch["lm_loss"] = lm_loss_reduce
+                    eval_fetch["sop_loss"] = sop_loss_reduce
+
+                run_evaluate(
+                    valid_data_loader,
+                    exe,
+                    test_program,
+                    args.eval_iters,
+                    log_writer,
+                    global_step,
+                    args,
+                    topo.is_last,
+                    eval_fetch,
+                    "valid",
+                )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0 or global_step >= args.max_steps:
+                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                logger.debug("saving models to {}".format(output_dir))
+                save_persistables(exe, os.path.join(output_dir, "static_vars"), main_program)
+                tokenizer.save_pretrained(output_dir)
+                tic_train = time.time()
+
+            if global_step % args.checkpoint_steps == 0:
+                output_dir = os.path.join(args.output_dir, "model_last")
+                if worker_index == 0:
+                    if not os.path.exists(output_dir):
+                        os.mkdir(output_dir)
+                    output_dir_bak = os.path.join(args.output_dir, "model_last_bak")
+                    if os.path.exists(output_dir):
+                        if os.path.exists(output_dir_bak):
+                            shutil.rmtree(output_dir_bak)
+                        shutil.move(output_dir, output_dir_bak)
+                        os.mkdir(output_dir)
+
+                    step_config = {
+                        "model_name": args.model_name_or_path,
+                        "global_step": global_step,
+                        "global_batch_size": args.global_batch_size,
+                        "consumed_samples": global_step * args.global_batch_size,
+                    }
+
+                    with open(os.path.join(output_dir, "config.yml"), "w") as f:
+                        yaml.dump(step_config, f, encoding="utf-8", allow_unicode=True)
+
+                fleet.barrier_worker()
+
+                logger.debug("saving models to {}".format(output_dir))
+                if args.sharding_degree <= 1:
+                    # Save on the first worker by default.
+                    if worker_index == 0:
+                        paddle.static.save(main_program, os.path.join(output_dir, "static_vars"))
+                else:
+                    # Use save_persistables in sharding, but more slower
+                    save_persistables(exe, os.path.join(output_dir, "static_vars"), main_program)
+
+            if global_step >= args.max_steps:
+                eval_fetch = collections.OrderedDict()
+                # if topo.is_last:
+                eval_fetch["loss"] = loss_reduce
+                if args.binary_head:
+                    eval_fetch["lm_loss"] = lm_loss_reduce
+                    eval_fetch["sop_loss"] = sop_loss_reduce
+
+                run_evaluate(
+                    test_data_loader,
+                    exe,
+                    test_program,
+                    args.test_iters,
+                    log_writer,
+                    global_step,
+                    args,
+                    topo.is_last,
+                    eval_fetch,
+                    "test",
+                )
+                del train_data_loader
+                del valid_data_loader
+                del test_data_loader
+                return
+
+
+if __name__ == "__main__":
+    args = parse_args(MODEL_CLASSES)
+    logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit))
+    for arg in vars(args):
+        logger.info("{:20}:{}".format(arg, getattr(args, arg)))
+
+    do_train(args)
diff --git a/model_zoo/ernie-1.0/run_pretrain_trainer.py b/model_zoo/ernie-1.0/run_pretrain_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d54eee4ffaf153b99e4fe445e8b976e2c002232
--- /dev/null
+++ b/model_zoo/ernie-1.0/run_pretrain_trainer.py
@@ -0,0 +1,486 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ERNIE-1.0 pretraining scripts.
+"""
+import math
+import os
+import random
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+import paddle
+from data_tools.dataset_utils import build_train_valid_test_datasets
+
+from paddlenlp.data import Stack
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    speed_metrics,
+)
+from paddlenlp.transformers import (
+    ErnieConfig,
+    ErnieForMaskedLM,
+    ErnieForPretraining,
+    ErniePretrainingCriterion,
+    ErnieTokenizer,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie": (
+        ErnieConfig,
+        ErnieForPretraining,
+        ErniePretrainingCriterion,
+        ErnieTokenizer,
+    ),
+}
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    masked_lm_prob: float = field(
+        default=0.15,
+        metadata={"help": "Mask token prob."},
+    )
+    short_seq_prob: float = field(
+        default=0.1,
+        metadata={"help": "Short sequence prob."},
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+    favor_longer_ngram: bool = field(
+        default=False,
+        metadata={"help": "Whether to favor long ngrams"},
+    )
+    max_ngrams: int = field(
+        default=3,
+        metadata={"help": "Max N Grams"},
+    )
+    data_impl: str = field(
+        default="mmap",
+        metadata={"help": "mmap/lazy format converted from json"},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="ernie", metadata={"help": "Only support for ernie pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="ernie-1.0",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    binary_head: Optional[bool] = field(default=True, metadata={"help": "True for NSP task."})
+    hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
+    attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention probs dropout prob."})
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+    binary_head=True,
+):
+
+    train_valid_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.world_size * training_args.test_iters,
+    ]
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        args=data_args,
+        tokenizer=tokenizer,
+        splits_string=data_args.split,
+        train_valid_test_num_samples=train_valid_test_num_samples,
+        max_seq_length=data_args.max_seq_length,
+        masked_lm_prob=data_args.masked_lm_prob,
+        short_seq_prob=data_args.short_seq_prob,
+        seed=training_args.seed,
+        skip_warmup=True,
+        binary_head=binary_head,
+        max_seq_length_dec=None,
+        dataset_type="ernie",
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        input_ids, segment_ids, input_mask, masked_lm_positions, masked_lm_labels, next_sentence_labels = data
+        if tokenizer.pad_token_id in input_ids:
+            input_ids = input_ids[0 : list(input_ids).index(tokenizer.pad_token_id)]
+        logger.info(tokenizer._decode(input_ids))
+        for pos, label in zip(masked_lm_positions, masked_lm_labels):
+            input_ids[pos] = label
+        logger.info(tokenizer._decode(input_ids))
+        logger.info(tokenizer.convert_ids_to_tokens(masked_lm_labels))
+
+    print_dataset(train_ds[0], "train")
+    print_dataset(valid_ds[0], "valid")
+    print_dataset(test_ds[0], "test")
+
+    def _collate_data(data, stack_fn=Stack()):
+        num_fields = len(data[0])
+        out = [None] * num_fields
+        # 0. input_ids,
+        # 1. segment_ids,
+        # 2. input_mask,
+        # 3. masked_lm_positions,
+        # 4. masked_lm_labels,
+        # 5. next_sentence_labels
+        for i in (0, 1, 2, 5):
+            out[i] = stack_fn([x[i] for x in data])
+        out[5] = out[5].reshape([-1, 1])
+        _, seq_length = out[0].shape
+        size = sum(len(x[3]) for x in data)
+        # masked_lm_positions
+        # Organize as a 1D tensor for gather or use gather_nd
+        if size % 8 != 0:
+            size += 8 - (size % 8)
+        out[3] = np.full(size, 0, dtype=np.int32)
+        # masked_lm_labels
+        out[4] = np.full([size, 1], -1, dtype=np.int64)
+        mask_token_num = 0
+        for i, x in enumerate(data):
+            for j, pos in enumerate(x[3]):
+                out[3][mask_token_num] = i * seq_length + pos
+                out[4][mask_token_num] = x[4][j]
+                mask_token_num += 1
+
+        return {
+            "input_ids": out[0],
+            "token_type_ids": out[1],
+            "attention_mask": out[2],
+            "masked_positions": out[3],
+            "labels": (out[4], out[5]),
+        }
+
+    return train_ds, valid_ds, test_ds, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in str(f))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.world_size,
+            rank=self.args.process_index,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.world_size,
+            rank=self.args.process_index,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    set_seed(training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        # if last_checkpoint is None and len(
+        #         os.listdir(training_args.output_dir)) > 1:
+        #     raise ValueError(
+        #         f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+        #         "Use --overwrite_output_dir to overcome.")
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[model_args.model_type]
+
+    if model_args.binary_head is False:
+        model_class = ErnieForMaskedLM
+
+    pretrained_models_list = list(model_class.pretrained_init_configuration.keys())
+
+    if model_args.model_name_or_path in pretrained_models_list:
+        logger.warning(f"Your model {model_args.model_name_or_path} is training from scratch !!!")
+        model_config = model_class.pretrained_init_configuration[model_args.model_name_or_path]
+        model_config["hidden_dropout_prob"] = model_args.hidden_dropout_prob
+        model_config["attention_probs_dropout_prob"] = model_args.attention_probs_dropout_prob
+        model = model_class(config_class(**model_config))
+        # model_config["enable_recompute"] = args.use_recompute
+    else:
+        logger.warning(f"Your model is continue training from {model_args.model_name_or_path}")
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            hidden_dropout_prob=model_args.hidden_dropout_prob,
+            attention_probs_dropout_prob=model_args.attention_probs_dropout_prob,
+        )
+
+    class CriterionWrapper(paddle.nn.Layer):
+        """ """
+
+        def __init__(self):
+            """CriterionWrapper"""
+            super(CriterionWrapper, self).__init__()
+            self.criterion = criterion_class(with_nsp_loss=model_args.binary_head)
+
+        def forward(self, output, labels):
+            """forward function
+
+            Args:
+                output (tuple): prediction_scores, seq_relationship_score
+                labels (tuple): masked_lm_labels, next_sentence_labels
+
+            Returns:
+                Tensor: final loss.
+            """
+            masked_lm_labels, next_sentence_labels = labels
+            if model_args.binary_head:
+                prediction_scores, seq_relationship_score = output
+
+                lm_loss, sop_loss = self.criterion(
+                    prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels
+                )
+
+                loss = lm_loss + sop_loss
+
+            else:
+                prediction_scores = output
+                loss = self.criterion(prediction_scores, None, masked_lm_labels)
+
+            return loss
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = LinearAnnealingWithWarmupDecay(
+        training_args.learning_rate,
+        training_args.min_learning_rate,
+        warmup_step=warmup_steps,
+        decay_step=training_args.decay_steps,
+    )
+
+    data_file = get_train_data_file(data_args)
+    tokenizer = tokenizer_class.from_pretrained(model_args.tokenizer_name_or_path)
+    tokenizer.extend_chinese_char()
+
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args, training_args, data_file, tokenizer, model_args.binary_head
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        criterion=CriterionWrapper(),
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-1.0/vocab/README.md b/model_zoo/ernie-1.0/vocab/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8179e8651a81af2867d3f2996c54cd9f9a0b03f6
--- /dev/null
+++ b/model_zoo/ernie-1.0/vocab/README.md
@@ -0,0 +1,203 @@
+# ERNIE 中文词表制作
+
+ERNIE是百度提出的大规模预训练模型，曾在中文场景下取得了SOTA效果。
+PaddleNLP致力于预训练开源工作，本文档提供了ERNIE词表的制作方法。
+
+预训练全部流程的整体详细介绍文档，请参考[ERNIE 中文预训练介绍](../pretraining_introduction.md)。
+
+**目录**
+* [1. 数据获取](#数据获取)
+* [2. 全字符中文词表制作](#中文词表制作)
+    - [2.1 分析准备](#分析准备)
+    - [2.2 文本字符统计](#文本字符统计)
+    - [2.3 英文字符词表](#英文字符词表)
+    - [2.4 合并词表](#合并词表)
+* [3. 词表使用](#vocab_usage)
+    - [3.1 转化为jsonl格式数据](#jsonl)
+    - [3.2 TokenID转化](#快速TokenID转化)
+* [4. 参考](#ref)
+
+
+<a name="数据获取"> </a>
+
+## 1. 数据获取
+
+
+**WuDaoCorpus2.0 Base 语料**
+
+WuDaoCorpora是悟道爬取的中文大规模语料。整体数量为3TB，目前开源的部分为WuDaoCorpus2.0 bases数据集，大小为200GB。用户请参考[这里](../preprocess/docs/WuDaoCorpusBase.md)获取原始文本数据。
+
+
+**CLUECorpus2020 语料**
+
+CLUECorpus2020 过对Common Crawl的中文部分进行语料清洗得到。开源部分提供了约200G左右的语料文本，详细介绍见[官网](https://github.com/CLUEbenchmark/CLUECorpus2020#%E6%95%B0%E6%8D%AE%E4%B8%8B%E8%BD%BD)，用户参考[这里](./preprocess/docs/CLUECorpus2020.md)获取原始文本数据。
+
+
+
+
+<a name="全字符中文词表制作"> </a>
+
+## 2. 全字符中文词表制作
+
+词表的制作有两种方案：
+
+第一种，词表组合方案
+1. 统计字符
+2. 制作英文词表
+3. 合并词表
+
+第二种，预处理后直接生成，方案
+1. 文本预处理（中文加空格，文本normalize）
+2. 使用sentencepeice制作词表
+
+第二种方案需要对文本先使用`BasicTokenizer`切分一遍语料。
+第一种方案，自定义程度高，但存在一些局限性。本项目采用了第一种方案，详细介绍如下：
+
+### 2.1 分析准备
+词表大小： 这里我们考虑的因素主要有两个
+- 已有模型对照：
+    - ERNIE 3.0系列模型的词表，词表大小为 40000 左右。
+- 预训练数据存储占用：
+    - 文本token id化后，希望使用uint16表示，此时表示的最大字符为65536。
+    - 同时考虑到ERNIE虽然是字模型，我们的仍然需要 `##中` 之类的中文字符表示分词信息。假设使用中文全字符20902(0x4E00-0x9FA5)个字符，那么剩余 vocab 大小不能超过 44634。
+
+综上，本项目决定采用 40000 左右的 vocab 容量。
+其中：
+- 中文全字符 `20902`
+- 英文字符 `17000`
+- 其他字符约 `2000` 左右
+
+
+### 2.2 文本字符统计
+首先第一步是对文本字符进行统计。字符统计的目的主要是添加常用的中文字符、特殊字符。
+
+由于语料文本过大，我们随机选取 10G 左右的原始文本进行了字符统计。
+```
+python gen_char.py path_to_corpus.txt
+```
+可以在本地文件夹得到`char_dict.pickle`字符频率文件。同时我们也提供了自己统计的词频文件，方便用户复现：
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/char_dict.pickle
+```
+
+### 2.3 英文字符词表
+基于字符的词频统计，使得英文字符也切割为字母，为此我们需要添加英文词表。
+英文部分，我们使用了 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip)  数据集，来构造词表。
+下载解压数据，使用BPE切词
+```
+wget  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
+unzip wikitext-103-v1.zip
+python gen_vocab.py ./wikitext-103-raw/wiki.train.raw
+```
+即可产生英文部分的词表。这里我们也提供了处理好的 vocab 方便用户验证。
+```
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/eng.vocab
+```
+
+
+### 2.4 合并词表
+
+目前我们得到了字符统计表，和英文字符词表。下一步，我们将词表进行合并。
+
+将`char_dict.pickle`，`eng.vocab`放置到当前目录，使用下面命令
+```
+python merge_vocab.py
+```
+即可在 当前 目录生成 vocab.txt 得到最终词表。
+
+此阶段需要注意的一些问题是：
+1. 对于一些日文、谚文文字字符，需要进行 normalize
+2. 添加special_tokens
+
+### 2.5 问题遗留
+本项目采用的第一种方式，即拼接产出的词表，对连续非中、英文字符文本，会出现UNK的情况。
+如issue: [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927)、 [#2585](https://github.com/PaddlePaddle/PaddleNLP/issues/2585)。本项目做了两点改进:
+
+1. 对 Symbol 字符默认添加空格，变成独立字符
+2. 对 日文、谚文 在合并词表阶段默认添加 ## 字符。
+
+虽然有上述两点修复，任然无法避免 [#2927](https://github.com/PaddlePaddle/PaddleNLP/issues/2927) 现象。
+彻底解决的话，建议使用第二种方式制作vocab文件。
+
+### 2.6 方案二：预处理后直接生成
+此方案没有被采用，这里也简单说明一下具体的方案：
+1. 对语料使用 BasicTokenizer 转换
+```python
+from paddlenlp.transformers import
+tokenizer = BasicTokenizer()
+basic_toknizer = lambda x: " ".join(tokenizer.tokenize(x))
+# 对语料使用 basic_toknizer 转换
+# 并存储为新的语料 afer_basic_toknizer_corpus.txt
+```
+2. 处理转换后的语料
+```shell
+python gen_vocab.py afer_basic_toknizer_corpus.txt
+```
+对处理好的vocab文件手动替换一些`<pad> -> [PAD]`之类的special_tokens，即可产出词表。
+
+
+<a name="vocab_usage"></a>
+## 3. 词表使用
+
+<a name="josnl"> </a>
+
+## 3.1 转化为jsonl格式数据
+
+本文档以WuDao数据为例，对数据进行分词：
+
+```shell
+python ../preprocess/words_segmentation.py \
+    --input_path ./WuDaoCorpus2.0_base_200G \
+    --workers 40  \
+    --data_format wudao \
+    --cn_seg_func seg \
+    --output_path ./wudao_lac_cut \
+```
+
+文本转化完成后。我们使用 `../data_tools/trans_to_json.py`重新转换为jsonl格式（分词完毕）。
+```shell
+python ../preprocess/trans_to_json.py  \
+    --input_path ./wudao_lac_cut \
+    --output_path wudao_corpus_200g_0623.jsonl \
+    --workers 40 \
+```
+
+<a name="快速TokenID转化"> </a>
+
+## 3.2 Token ID 转化
+
+语料、新建的词表准备妥当后，我们可以开始进行最后的数据ID转化。
+
+```
+python -u  ../preprocess/create_pretraining_data.py \
+    --model_name /path/to/your/vocab.txt \
+    --tokenizer_name ErnieTokenizer \
+    --input_path wudao_corpus_200g_0623.jsonl \
+    --split_sentences \
+    --chinese \
+    --cn_whole_word_segment \
+    --cn_seg_func jieba \
+    --cn_splited \
+    --output_prefix wudao_corpus_200g_0623 \
+    --workers 48 \
+    --log_interval 10000
+```
+
+- 我们提前分词好了，所以加上了 `cn_splited`，否则不需要使用此选项。
+- model_name 指定为我们准备的词表路径。也可以更换为其他 ERNIE 系列模型，如: `ernie-3.0-base-zh`
+- workers 表示转化的线程数目
+
+转化后的数据如下，使用这份数据，即可开始ERNIE预训练
+```
+-rw-rw-r-- 1 500 501 129G Jul  4 03:39 wudao_200g_0703_ids.npy
+-rw-rw-r-- 1 500 501 6.4G Jul  4 03:39 wudao_200g_0703_idx.npz
+```
+
+<a name='ref'></a>
+## 4. 参考
+
+感谢CLUE，WuDao提供的开源文本语料，参考资料：
+- Xu, L., Zhang, X. and Dong, Q., 2020. CLUECorpus2020: A large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355.
+- Yuan, S., Zhao, H., Du, Z., Ding, M., Liu, X., Cen, Y., Zou, X., Yang, Z. and Tang, J., 2021. Wudaocorpora: A super large-scale chinese corpora for pre-training language models. AI Open, 2, pp.65-68.
+- https://github.com/CLUEbenchmark/CLUECorpus2020
+- https://resource.wudaoai.cn
diff --git a/model_zoo/ernie-1.0/vocab/gen_char.py b/model_zoo/ernie-1.0/vocab/gen_char.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbda9900f245ae884e64990d38597d4fdb579865
--- /dev/null
+++ b/model_zoo/ernie-1.0/vocab/gen_char.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+import sys
+import pickle
+from collections import defaultdict
+
+input_path = sys.argv[1]
+print(input_path)
+
+char_dict = defaultdict(int)
+
+file_paths = []
+if os.path.isfile(input_path):
+    file_paths.append(input_path)
+else:
+    for root, _, fs in os.walk(input_path):
+        for f in fs:
+            file_paths.append(os.path.join(root, f))
+
+count = 0
+s = time.time()
+data_len = 0
+for file_name in file_paths:
+    print(f" > reading file {file_name}")
+    with open(file_name, "r") as f:
+        line = f.readline()
+        while line:
+            count += 1
+            data_len += len(line.encode("utf-8"))
+            for char in line:
+                char_dict[char] += 1
+            line = f.readline()
+            if count % 10000 == 0:
+                print(
+                    f"processed doc {count}, char size: {len(char_dict)}, speed: {data_len/1024/1024/(time.time() - s)} MB/s"
+                )
+                with open("char_dict.txt", "w") as rf:
+                    res = sorted(char_dict.items(), key=lambda x: -x[1])
+                    for x in res:
+                        k, v = x
+                        rf.write(f"{k} {v}\n")
+
+with open("char_dict.txt", "w") as f:
+    res = sorted(char_dict.items(), key=lambda x: -x[1])
+    for x in res:
+        k, v = x
+        f.write(f"{k} {v}\n")
+
+with open("char_dict.pickle", "wb") as f:
+    pickle.dump(char_dict, f)
diff --git a/model_zoo/ernie-1.0/vocab/gen_vocab.py b/model_zoo/ernie-1.0/vocab/gen_vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..4df40721c411da0e07830ddf7b2bc9aa73b21b87
--- /dev/null
+++ b/model_zoo/ernie-1.0/vocab/gen_vocab.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import sentencepiece as spm
+
+input_path = sys.argv[1]
+print("Generate vocabulary file for corpus:  ", input_path)
+
+spm.SentencePieceTrainer.train(
+    input=input_path,
+    model_prefix="eng",
+    vocab_size=17000,
+    model_type="BPE",
+)
diff --git a/model_zoo/ernie-1.0/vocab/merge_vocab.py b/model_zoo/ernie-1.0/vocab/merge_vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..15f6f6191874a87c806b28deeab05bcc5a48d33a
--- /dev/null
+++ b/model_zoo/ernie-1.0/vocab/merge_vocab.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pickle
+import re
+
+from paddlenlp.transformers import BasicTokenizer
+from paddlenlp.transformers.tokenizer_utils import (
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+    is_chinese_char,
+    tokenize_special_chars,
+)
+
+re_eng = re.compile("[#a-zA-Z0-9]", re.U)
+re_sep = re.compile("\[[A-Z]+\]", re.U)
+re_sep_eng = re.compile("\<[\/a-z]+\>", re.U)
+
+bt = BasicTokenizer()
+normalize_chars = lambda x: "".join(bt.tokenize(x))
+
+
+# 20902 个中文全字符
+def chinese_char():
+    return set([chr(x) for x in range(0x4E00, 0x9FA5 + 1)])
+
+
+# 日文 或 谚文字母
+def jk_vocab(c):
+    c = ord(c)
+    return (c >= 0x3040 and c <= 0x33FF) or (c >= 0x1100 and c <= 0x11FF)  # 谚文字母
+
+
+# 特殊 TOKEN
+def add_special_token():
+    return ["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"]
+
+
+char_dict = pickle.load(open("char_dict.pickle", "rb"))
+chinese_vocab = chinese_char()
+final_vocab = set()
+other_char = []
+
+
+def add_vocab(char, f):
+    if re_sep_eng.match(char):
+        # Delete <pad> tokens in eng.vocab
+        return
+
+    # Add eng vocab and specical token
+    if re_eng.match(char) or re_sep.match(char):
+        if char not in final_vocab:
+            final_vocab.add(char)
+            f.write(f"{char}\n")
+        return
+
+    # Add chinese char
+    if len(char) > 1 and char.startswith("##") and chinese_vocab(char[2]):
+        if char not in final_vocab:
+            final_vocab.add(char)
+            f.write(f"{char}\n")
+        return
+
+    # Normalize char， 部分字符 nioe
+    char = normalize_chars(char)
+    for i, k in enumerate(char):
+        if _is_whitespace(k) or _is_control(k):
+            continue
+        if k not in final_vocab:
+            if not _is_punctuation(k) and not is_chinese_char(ord(k)) and k == tokenize_special_chars(k):
+                other_char.append(k)
+            final_vocab.add(k)
+            f.write(f"{k}\n")
+            if jk_vocab(k):
+                # add "##" for japanese and korean char
+                add_vocab("##" + k, f)
+
+
+with open("vocab.txt", "w") as f:
+    for x in add_special_token():
+        add_vocab(x, f)
+
+    res = sorted(char_dict.items(), key=lambda x: -x[1])
+
+    # Add chinse char by freq
+    for x in res:
+        k, v = x
+        k = normalize_chars(k)
+        if k in chinese_vocab:
+            add_vocab(k, f)
+            chinese_vocab.remove(k)
+
+    # If chinse char not in freq add it
+    chinese_vocab = sorted(chinese_vocab)
+    while len(chinese_vocab) > 0:
+        k = chinese_vocab.pop()
+        if k not in final_vocab:
+            f.write(f"{k}\n")
+            final_vocab.add(k)
+
+    # And english vocab part
+    with open("eng.vocab") as ec:
+        line = ec.readline()
+        while line:
+            k, v = line.strip().split()
+            if "▁" in k:
+                # remove "▁" in eng vocab
+                k = k[1:]
+            elif re_sep_eng.match(k):
+                pass
+            else:
+                # add "##" for eng vocab
+                k = "##" + k
+
+            add_vocab(k, f)
+            line = ec.readline()
+
+    # Add additional tokens in corpus
+    # such as japanese and korean char and other symbols
+    for x in res:
+        k, v = x
+        if v >= 200:
+            add_vocab(k, f)
diff --git a/model_zoo/ernie-3.0/README.md b/model_zoo/ernie-3.0/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..96a6452183dc0ea0ad63770070db5228f16138b5
--- /dev/null
+++ b/model_zoo/ernie-3.0/README.md
@@ -0,0 +1,1632 @@
+# ERNIE 3.0 轻量级模型
+
+ **目录**
+   * [模型介绍](#模型介绍)
+       * [在线蒸馏技术](#在线蒸馏技术)
+   * [模型效果](#模型效果)
+   * [开始运行](#开始运行)
+       * [环境要求](#环境要求)
+       * [数据准备](#数据准备)
+   * [模型训练](#模型训练)
+   * [模型预测](#模型预测)
+   * [模型压缩](#模型压缩)
+       * [环境依赖](#环境依赖)
+       * [模型压缩 API 使用](#模型压缩API使用)
+       * [压缩效果](#压缩效果)
+           * [精度测试](#精度测试)
+           * [性能测试](#性能测试)
+               * [CPU 性能](#CPU性能)
+               * [GPU 性能](#CPU性能)
+   * [使用 FastTokenizer 加速](#使用FastTokenizer加速)
+   * [部署](#部署)
+       * [FastDeploy 部署](#FastDeploy部署)
+           * [Python 部署](#Python部署)
+           * [C++ 部署](#C++部署)
+       * [服务化部署](#服务化部署)
+   * [Notebook教程](#Notebook教程)
+   * [参考文献](#参考文献)
+
+<a name="模型介绍"></a>
+
+## 模型介绍
+
+本次开源的模型是文心大模型 ERNIE 3.0, 文心大模型 ERNIE 3.0 作为百亿参数知识增强的大模型，除了从海量文本数据中学习词汇、结构、语义等知识外，还从大规模知识图谱中学习。 基础上通过**在线蒸馏技术**得到的轻量级模型，模型结构与 ERNIE 2.0 保持一致，相比 ERNIE 2.0 具有更强的中文效果。
+
+相关技术详解可参考文章[《解析全球最大中文单体模型鹏城-百度·文心技术细节》](https://www.jiqizhixin.com/articles/2021-12-08-9)
+
+### 在线蒸馏技术
+
+在线蒸馏技术在模型学习的过程中周期性地将知识信号传递给若干个学生模型同时训练，从而在蒸馏阶段一次性产出多种尺寸的学生模型。相对传统蒸馏技术，该技术极大节省了因大模型额外蒸馏计算以及多个学生的重复知识传递带来的算力消耗。
+
+这种新颖的蒸馏方式利用了文心大模型的规模优势，在蒸馏完成后保证了学生模型的效果和尺寸丰富性，方便不同性能需求的应用场景使用。此外，由于文心大模型的模型尺寸与学生模型差距巨大，模型蒸馏难度极大甚至容易失效。为此，通过引入了助教模型进行蒸馏的技术，利用助教作为知识传递的桥梁以缩短学生模型和大模型表达空间相距过大的问题，从而促进蒸馏效率的提升。
+
+更多技术细节可以参考论文：
+- [ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression](https://arxiv.org/abs/2106.02241)
+- [ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/abs/2112.12731)
+
+<p align="center">
+        <img width="644" alt="image" src="https://user-images.githubusercontent.com/1371212/168516904-3fff73e0-010d-4bef-adc1-4d7c97a9c6ff.png" title="ERNIE 3.0 Online Distillation">
+</p>
+
+<a name="模型效果"></a>
+
+
+### 模型效果
+
+本项目开源 **ERNIE 3.0 _Base_** 、**ERNIE 3.0 _Medium_** 、 **ERNIE 3.0 _Mini_** 、 **ERNIE 3.0 _Micro_** 、 **ERNIE 3.0 _Nano_** 五个模型：
+
+- [**ERNIE 3.0-_Base_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams) (_12-layer, 768-hidden, 12-heads_)
+- [**ERNIE 3.0-_Medium_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams) (_6-layer, 768-hidden, 12-heads_)
+- [**ERNIE 3.0-_Mini_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams) (_6-layer, 384-hidden, 12-heads_)
+- [**ERNIE 3.0-_Micro_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams) (_4-layer, 384-hidden, 12-heads_)
+- [**ERNIE 3.0-_Nano_**](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams) (_4-layer, 312-hidden, 12-heads_)
+
+
+下面是 PaddleNLP 中轻量级中文模型的**效果-时延图**。横坐标表示在 IFLYTEK 数据集 (最大序列长度设置为 128) 上测试的延迟（latency，单位：ms），纵坐标是 CLUE 10 个任务上的平均精度（包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务），其中 CMRC2018 阅读理解任务的评价指标是 Exact Match(EM)，其他任务的评价指标均是 Accuracy。图中越靠**左上**的模型，精度和性能水平越高。
+
+图中模型名下方标注了模型的参数量，测试环境见[性能测试](#性能测试)。
+
+batch_size=32 时，CPU 下的效果-时延图（线程数 1 和 8）：
+
+<table>
+    <tr>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175852121-2798b5c9-d122-4ac0-b4c8-da46b89b5512.png"></a></td>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175852129-bbe58835-8eec-45d5-a4a9-cc2cf9a3db6a.png"></a></td>
+    </tr>
+</table>
+
+batch_size=1 时，CPU 下的效果-时延图（线程数 1 和 8）：
+
+<table>
+    <tr>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175852106-658e18e7-705b-4f53-bad0-027281163ae3.png"></a></td>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175852112-4b89d675-7c95-4d75-84b6-db5a6ea95e2c.png"></a></td>
+    </tr>
+</table>
+
+batch_size=32 和 1，预测精度为 FP16 时，GPU 下的效果-时延图：
+
+<table>
+    <tr>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175854679-3247f42e-8716-4a36-b5c6-9ce4661b36c7.png"></a></td>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175854670-57878b34-c213-47ac-b620-aaaec082f435.png"></a></td>
+    </tr>
+</table>
+
+从图上可看出，ERNIE 3.0 系列轻量级模型在精度和性能上的综合表现已全面领先于 UER-py、Huawei-Noah 以及 HFL 的中文模型。且当 batch_size=1、预测精度为 FP16 时，在 GPU 上宽且浅的模型的推理性能更有优势。
+
+在 CLUE **验证集**上评测指标如下表所示：
+
+<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000">
+    <tbody>
+        <tr>
+            <td style="text-align:center;vertical-align:middle">
+                <span style="font-size:18px;">Arch</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">Model</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">AVG</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px;">AFQMC</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">TNEWS</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">IFLYTEK</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CMNLI</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">OCNLI</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CLUEWSC2020</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CSL</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CMRC2018</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">CHID</span>
+            </td>
+            <td style="text-align:center;">
+                <span style="font-size:18px;">C<sup>3</sup></span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=3 align=center> 24L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Large-cw</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.03</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.65</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>85.09</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.73</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>93.09</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.53</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>74.22/91.88</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>88.57</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.54</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 2.0-Large-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.23</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>59.33</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.85</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">89.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.23</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.95/90.31</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">86.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.12</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">RoBERTa-wwm-ext-large</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.61</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.00</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.88</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.81</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">90.79</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.58/89.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">85.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.26</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 20L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>ERNIE 3.0-Xbase-zh</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>78.39</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.16</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>59.55</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>61.87</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.40</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.73</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>88.82</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.60</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>75.99/93.00</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>86.78</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.98</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=9 align=center> 12L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams">
+                        ERNIE 3.0-Base-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>80.10</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">86.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.71/90.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">84.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>77.88</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Base-zh-cw</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.47</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.07</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.86</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.41</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>89.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>83.42</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>72.88/90.78</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>84.68</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.98</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE-Gram-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.28</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.88</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.87</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.82/90.38</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">84.04</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.69</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">Langboat/Mengzi-BERT-Base</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.69</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.76</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.16</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.04/88.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.74</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.70</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 2.0-Base-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.65</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.25</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.08/87.46</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.19</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">ERNIE 1.0-Base-zh</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.84</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>58.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.25</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.68</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">85.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.32/87.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.47</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.68</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">RoBERTa-wwm-ext</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.60</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.23</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.92</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">88.49</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.39/88.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">83.43</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.03</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">BERT-Base-Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.57</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.30/86.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">82.01</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.38</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Base</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.89</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">61.14</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.01</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.58</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.80</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.87/84.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.76</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 8L512H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Medium</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.06</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.10</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.35</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.09</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.63/78.91</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.84</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=5 align=center> 6L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams">
+                        ERNIE 3.0-Medium-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>72.49</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>73.37</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>57.00</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>80.64</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>76.88</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.28</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>81.60</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>65.83/87.30</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>79.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>69.73</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">HLF/RBT6, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.06</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.45</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.36</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.67</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.72/84.77</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.85</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">TinyBERT<sub>6</sub>, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.62</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.70</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.12</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">80.17</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.03/83.75</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.11</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">RoFormerV2 Small</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.47</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>60.72</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.37</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.00</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">81.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.97/83.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.66</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.41</span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-L6-H768</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.09</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.54</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.49</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.00</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.04</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.74/75.52</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.73</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.40</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 6L384H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams">
+                        ERNIE 3.0-Mini-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.85</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.24</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.48</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.19</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">79.30</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.53/81.97</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.60</span>
+            </td>
+        </tr>
+               <tr>
+            <td rowspan=1 align=center> 4L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBT4, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.42</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">72.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">77.34</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">78.23</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.30/81.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.45</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L512H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Small</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.25</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.21</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.552</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.64</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.80</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">66.78</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">46.75/69.69</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.59</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">50.92</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L384H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams">
+                    ERNIE 3.0-Micro-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">64.21</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.15</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.05</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.83</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">74.81</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.08</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.50</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.77/77.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">62.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.53</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=2 align=center> 4L312H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">
+                    <a href="https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams">
+                        ERNIE 3.0-Nano-zh
+                    </a>
+                </span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>62.97</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.51</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>54.57</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>48.36</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>74.97</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.61</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">68.75</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>75.93</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>52.00/76.35</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>58.91</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>55.11</b></span>
+            </td>
+        </tr>
+        <tr>
+            <td style="text-align:center">
+            <span style="font-size:18px">TinyBERT<sub>4</sub>, Chinese</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">60.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">39.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">73.94</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.59</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px"><b>70.07</b></span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">75.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">46.04/69.34</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">52.18</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 4L256H </td>
+            <td style="text-align:center">
+            <span style="font-size:18px">UER/Chinese-RoBERTa-Mini</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">53.40</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.32</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.22</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">41.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.40</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.36</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">65.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.07</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">5.96/17.13</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">51.19</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">39.68</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 3L1024H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBTL3, Chinese</span>
+            </td>
+                <td style="text-align:center">
+                <span style="font-size:18px">66.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">56.14</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.56</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.41</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.29</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.74</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.93</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">58.50/80.90</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">71.03</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.56</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 3L768H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">HFL/RBT3, Chinese</span>
+            </td>
+                <td style="text-align:center">
+                <span style="font-size:18px">65.72</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.53</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.18</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.20</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.71</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.11</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">76.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">55.73/78.63</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">70.26</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">54.93</span>
+            </td>
+        </tr>
+        <tr>
+            <td rowspan=1 align=center> 2L128H </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">UER/Chinese-RoBERTa-Tiny</span>
+            </td>
+                <td style="text-align:center">
+                <span style="font-size:18px">44.45</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">69.02</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">51.47</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">20.28</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">59.95</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">57.73</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">63.82</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">67.43</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">3.08/14.33</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">23.57</span>
+            </td>
+            <td style="text-align:center">
+                <span style="font-size:18px">28.12</span>
+            </td>
+        </tr>
+    <tbody>
+</table>
+<br />
+
+
+<a name="代码结构"></a>
+
+## 代码结构
+以下是本项目代码结构
+
+```text
+.
+├── run_seq_cls.py               # 分类任务的微调脚本
+├── run_token_cls.py             # 序列标注任务的微调脚本
+├── run_qa.py                    # 阅读理解任务的微调脚本
+├── compress_seq_cls.py          # 分类任务的压缩脚本
+├── compress_token_cls.py        # 序列标注任务的压缩脚本
+├── compress_qa.py               # 阅读理解任务的压缩脚本
+├── utils.py                     # 训练工具脚本
+├── configs                      # 压缩配置文件夹
+│ └── default.yml                # 默认配置文件
+├── deploy                       # 部署目录
+│ └── predictor                  # onnx离线部署
+│   └── infer_cpu.py
+│   └── infer_gpu.py
+│   └── README.md
+│   └── requirements_cpu.txt
+│   └── requirements_gpu.txt
+│ └── simple_serving            # 基于PaddleNLP SimpleServing 服务化部署
+│   └── client_qa.py
+│   └── client_seq_cls.py
+│   └── client_token_cls.py
+│   └── README.md
+│   └── server_qa.py
+│   └── server_seq_cls.py
+│   └── server_token_cls.py
+│ └── triton_serving           # 基于Triton Serving 服务化部署
+│   └── models
+│   └── README.md
+│   └── seq_cls_grpc_client.py
+│   └── token_cls_grpc_client.py
+└── README.md                    # 文档
+
+```
+
+
+<a name="开始运行"></a>
+## 开始运行
+下面提供以 CLUE 数据集进行模型微调相关训练、预测、部署的代码, CLUE 数据集是中文语言理解测评基准数据集，包括了文本分类、文本推理、实体抽取、问答等相关数据集。
+
+### 环境要求
+- python >= 3.7
+- paddlepaddle >= 2.3
+- paddlenlp >= 2.4
+- paddleslim >= 2.4
+
+### 数据准备
+此次微调数据主要是以 CLUE benchmark 数据集为主, CLUE benchmark 包括了文本分类、实体抽取、问答三大类数据集，而 CLUE benchmark 数据目前已经集成在 PaddleNLP 的 datasets 里面，可以通过下面的方式来使用数据集
+
+```python
+from paddlenlp.datasets import load_dataset
+
+# Load the clue Tnews dataset
+train_ds, test_ds = load_dataset('clue', 'tnews', splits=('train', 'test'))
+
+```
+
+<a name="模型训练"></a>
+## 模型训练
+
+使用 PaddleNLP 只需要一行代码可以拿到 ERNIE 3.0 系列模型，之后可以在自己的下游数据下进行微调，从而获得具体任务上效果更好的模型。
+
+```python
+
+from paddlenlp.transformers import *
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+# 用于分类任务
+seq_cls_model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-medium-zh")
+
+# 用于序列标注任务
+token_cls_model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh")
+
+# 用于阅读理解任务
+qa_model = AutoModelForQuestionAnswering.from_pretrained("ernie-3.0-medium-zh")
+
+```
+
+本项目提供了针对分类（包含文本分类、文本匹配、自然语言推理、代词消歧等任务）、序列标注、阅读理解三大场景下微调的示例脚本，可分别参考 `run_seq_cls.py` 、`run_token_cls.py`、`run_qa.py` 三个脚本，启动方式如下：
+
+```shell
+# 分类任务
+# 该脚本共支持 CLUE 中 7 个分类任务，超参不全相同，因此分类任务中的超参配置利用 config.yml 配置
+# --device 选择训练模型的硬件，可选 cpu/gpu/xpu/npu，默认为 gpu。xpu 为昆仑芯片，npu 为昇腾芯片。
+python run_seq_cls.py  --model_name_or_path ernie-3.0-medium-zh  --dataset afqmc --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml
+
+# 序列标注任务
+python run_token_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset msra_ner --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml
+
+# 阅读理解任务
+python run_qa.py --model_name_or_path ernie-3.0-medium-zh --dataset cmrc2018  --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml
+```
+
+<a name="模型预测"></a>
+## 模型预测
+
+```shell
+# 分类任务
+# 该脚本共支持 CLUE 中 7 个分类任务，超参不全相同，因此分类任务中的超参配置利用 config.yml 配置
+# --device 选择训练模型的硬件，可选 cpu/gpu/xpu/npu，默认为 gpu。xpu 为昆仑芯片，npu 为昇腾芯片。
+python run_seq_cls.py  --model_name_or_path best_models/afqmc/  --dataset afqmc --output_dir ./best_models --do_predict --config=configs/default.yml
+
+# 序列标注任务
+python run_token_cls.py  --model_name_or_path best_models/msra_ner/  --dataset msra_ner --output_dir ./best_models --do_predict --config=configs/default.yml
+
+# 阅读理解任务
+python run_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018  --output_dir ./best_models --do_predict --config=configs/default.yml
+```
+
+
+<a name="模型压缩"></a>
+
+## 模型压缩
+
+尽管 ERNIE 3.0 已提供了效果不错的 6 层、4 层轻量级模型可以微调后直接使用，但如果有模型部署上线的需求，则需要进一步压缩模型体积，可以使用这里提供的一套模型压缩方案及 API 对上一步微调后的模型进行压缩。
+
+<a name="环境依赖"></a>
+
+### 环境依赖
+
+使用裁剪功能需要安装 paddleslim 包
+
+```shell
+pip install paddleslim
+```
+
+<a name="模型压缩API使用"></a>
+
+### 模型压缩 API 使用
+
+本项目使用压缩 API 对任务数据上微调后的模型进行裁剪和量化。用户在传入模型，以及相关的压缩超参数（可选，提供默认选项）后，只需要调用一行 `compress()` 即可一键启动裁剪和量化，并自动保存压缩后的模型进行后续部署。
+
+核心调用方法如下，如需跑通完整的例子可参考本目录下完整样例脚本:
+
+```python
+
+trainer = Trainer(
+    model=model,
+    args=compression_args,
+    data_collator=data_collator,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    criterion=criterion)
+
+trainer.compress()
+
+```
+压缩 API 可以传入的超参数可参考[文档](../../docs/compression.md)。
+
+本项目提供了压缩 API 在分类（包含文本分类、文本匹配、自然语言推理、代词消歧等任务）、序列标注、阅读理解三大场景下的使用样例，可以分别参考 `compress_seq_cls.py` 、`compress_token_cls.py`、`compress_qa.py`，启动方式如下：
+
+```shell
+# 分类任务
+# 该脚本共支持 CLUE 中 7 个分类任务，超参不全相同，因此分类任务中的超参配置利用 configs/defalut.yml 配置
+python compress_seq_cls.py  --model_name_or_path best_models/afqmc/  --dataset afqmc --output_dir ./best_models/afqmc --config=configs/default.yml
+
+# 序列标注任务
+python compress_token_cls.py  --model_name_or_path best_models/msra_ner/  --dataset msra_ner --output_dir ./best_models/msra_ner --config=configs/default.yml
+
+# 阅读理解任务
+python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018  --output_dir ./best_models/cmrc2018 --config=configs/default.yml
+
+```
+
+
+<a name="压缩效果"></a>
+
+### 压缩效果
+
+<a name="精度测试"></a>
+
+#### 精度测试
+
+本案例中我们对 ERNIE 3.0-Medium 模型在三类任务上微调后的模型使用压缩 API 进行压缩。压缩后精度如下：
+
+| Model                           | AVG   | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | CLUEWSC2020 | CSL   | CMRC2018    | MSRA_NER          |
+| ------------------------------- | ----- | ----- | ----- | ------- | ----- | ----- | ----------- | ----- | ----------- | ----------------- |
+| ERNIE 3.0-Medium                | 74.87 | 75.35 | 57.45 | 60.18   | 81.16 | 77.19 | 80.59       | 81.93 | 66.95/87.15 | 92.65/93.43/93.04 |
+| ERNIE 3.0-Medium+FP16           | 74.87 | 75.32 | 57.45 | 60.22   | 81.16 | 77.22 | 80.59       | 81.90 | 66.95/87.16 | 92.65/93.45/93.05 |
+| ERNIE 3.0-Medium+裁剪+FP32      | 74.70 | 75.14 | 57.31 | 60.29   | 81.25 | 77.46 | 79.93       | 81.70 | 65.92/86.43 | 93.10/93.43/93.27 |
+| ERNIE 3.0-Medium+裁剪+FP16      | 74.71 | 75.21 | 57.27 | 60.29   | 81.24 | 77.56 | 79.93       | 81.73 | 65.89/86.44 | 93.10/93.43/93.27 |
+| ERNIE 3.0-Medium+裁剪+量化+INT8 | 74.44 | 75.02 | 57.26 | 60.37   | 81.03 | 77.25 | 77.96       | 81.67 | 66.17/86.55 | 93.17/93.23/93.20 |
+| ERNIE 3.0-Medium+量化+INT8      | 74.10 | 74.67 | 56.99 | 59.91   | 81.03 | 75.05 | 78.62       | 81.60 | 66.32/86.82 | 93.10/92.90/92.70 |
+
+**评价指标说明：** 其中 CLUE 分类任务（AFQMC 语义相似度、TNEWS 文本分类、IFLYTEK 长文本分类、CMNLI 自然语言推理、OCNLI 自然语言推理、CLUEWSC2020 代词消歧、CSL 论文关键词识别）的评价指标是 Accuracy，阅读理解任务 CLUE CMRC2018 的评价指标是 EM (Exact Match) / F1-Score，计算平均值时取 EM，序列标注任务 MSRA_NER 的评价指标是 Precision/Recall/F1-Score，计算平均值时取 F1-Score。
+
+由表可知，`ERNIE 3.0-Medium` 模型经过裁剪和量化后，精度平均下降 0.46，其中裁剪后下降了 0.17，单独量化精度平均下降 0.77。
+
+<a name="性能测试"></a>
+
+#### 性能测试
+
+性能测试的配置如下：
+
+1. 数据集：TNEWS（文本分类）、MSRA_NER（序列标注）、CLUE CMRC2018（阅读理解）
+
+2. 计算卡：T4、CUDA11.2、CuDNN8.2
+
+3. CPU 信息：Intel(R) Xeon(R) Gold 6271C CPU
+
+4. PaddlePaddle 版本：2.3
+
+5. PaddleNLP 版本：2.3
+
+6. 性能数据单位是 QPS。QPS 测试方法：固定 batch size 为 32，测试运行时间 total_time，计算 QPS = total_samples / total_time
+
+7. 精度数据单位：文本分类是 Accuracy，序列标注是 F1-Score，阅读理解是 EM (Exact Match)
+
+<a name="CPU性能"></a>
+
+##### CPU 性能
+
+测试环境及说明如上，测试 CPU 性能时，线程数设置为12。
+
+|                            | TNEWS 性能   | TNEWS 精度   | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 |
+| -------------------------- | ------------ | ------------ | ------------- | ------------- | ------------- | ------------- |
+| ERNIE 3.0-Medium+FP32      | 311.95(1.0X) | 57.45        | 90.91(1.0x)   | 93.04         | 33.74(1.0x)   | 66.95         |
+| ERNIE 3.0-Medium+INT8      | 600.35(1.9x) | 56.57(-0.88) | 141.00(1.6x)  | 92.64(-0.40)  | 56.51(1.7x)   | 66.23(-0.72)  |
+| ERNIE 3.0-Medium+裁剪+FP32 | 408.65(1.3x) | 57.31(-0.14) | 122.13(1.3x)  | 93.27(+0.23)  | 48.47(1.4x)   | 65.55(-1.40)  |
+| ERNIE 3.0-Medium+裁剪+INT8 | 704.42(2.3x) | 56.69(-0.76) | 215.58(2.4x)  | 92.39(-0.65)  | 75.23(2.2x)   | 63.47(-3.48)  |
+
+
+三类任务（分类、序列标注、阅读理解）经过相同压缩过程后，加速比达到 2.3 左右。
+
+
+<a name="GPU性能"></a>
+
+##### GPU 性能
+
+|                            | TNEWS 性能    | TNEWS 精度   | MSRA_NER 性能 | MSRA_NER 精度 | CMRC2018 性能 | CMRC2018 精度 |
+| -------------------------- | ------------- | ------------ | ------------- | ------------- | ------------- | ------------- |
+| ERNIE 3.0-Medium+FP32      | 1123.85(1.0x) | 57.45        | 366.75(1.0x)  | 93.04         | 146.84(1.0x)  | 66.95         |
+| ERNIE 3.0-Medium+FP16      | 2672.41(2.4x) | 57.45(0.00)  | 840.11(2.3x)  | 93.05(0.01)   | 303.43(2.1x)  | 66.95(0.00)   |
+| ERNIE 3.0-Medium+INT8      | 3226.26(2.9x) | 56.99(-0.46) | 889.33(2.4x)  | 92.70(-0.34)  | 348.84(2.4x)  | 66.32(-0.63   |
+| ERNIE 3.0-Medium+裁剪+FP32 | 1424.01(1.3x) | 57.31(-0.14) | 454.27(1.2x)  | 93.27(+0.23)  | 183.77(1.3x)  | 65.92(-1.03)  |
+| ERNIE 3.0-Medium+裁剪+FP16 | 3577.62(3.2x) | 57.27(-0.18) | 1138.77(3.1x) | 93.27(+0.23)  | 445.71(3.0x)  | 65.89(-1.06)  |
+| ERNIE 3.0-Medium+裁剪+INT8 | 3635.48(3.2x) | 57.26(-0.19) | 1105.26(3.0x) | 93.20(+0.16)  | 444.27(3.0x)  | 66.17(-0.78)  |
+
+
+三类任务（分类、序列标注、阅读理解）经过裁剪 + 量化后加速比均达到 3 倍左右，所有任务上平均精度损失可控制在 0.5 以内（0.46）。
+
+<a name="使用FastTokenizer加速"></a>
+
+### 使用 FastTokenizer 加速
+
+FastTokenizer 是飞桨提供的速度领先的文本处理算子库，集成了 Google 于 2021 年底发布的 LinMaxMatch 算法，该算法引入 Aho-Corasick 将 WordPiece 的时间复杂度从 O(N<sup>2</sup>) 优化到 O(N)，已在 Google 搜索业务中大规模上线。FastTokenizer 速度显著领先，且呈现 batch_size 越大，优势越突出。例如，设置 batch_size = 64 时，FastTokenizer 切词速度比 HuggingFace 快 28 倍。
+
+在 ERNIE 3.0 轻量级模型裁剪、量化基础上，当设置切词线程数为 4 时，使用 FastTokenizer 在 NVIDIA Tesla T4 环境下在 IFLYTEK （长文本分类数据集，最大序列长度为 128）数据集上性能提升了 2.39 倍，相比 BERT-Base 性能提升了 7.09 倍，在 Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz、线程数为 8 的情况下性能提升了 1.27 倍，相比 BERT-Base 性能提升了 5.13 倍。加速效果如下图所示：
+
+<table>
+    <tr>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175452331-bc5ff646-90ee-4377-85a5-d5b073a8e7f9.png"></a></td>
+        <td><a><img src="https://user-images.githubusercontent.com/26483581/175452337-e0eff0d3-ed5f-42e7-b06b-caad61f37978.png"></a></td>
+    </tr>
+</table>
+
+使用 FastTokenizer 的方式非常简单，在安装 fast_tokenizer 包之后，仅需在 tokenizer 实例化时直接传入 `use_fast=True` 即可。目前已在 Linux 系统下支持 BERT、ERNIE、TinyBERT 等模型。
+
+安装 fast_tokenizer 包的命令：
+
+```shell
+pip install fast-tokenizer-python
+```
+
+如需设置切词线程数，需要调用`fast_tokenizer.set_thread_num`接口进行设置：
+
+```python
+# 设置切词线程数为 4
+import fast_tokenizer
+fast_tokenizer.set_thread_num(4)
+```
+
+调用 `from_pretrained` 时只需轻松传入一个参数 `use_fast=True`：
+
+```python
+from paddlenlp.transformers import AutoTokenizer
+AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+```
+
+<a name="部署"></a>
+
+## 部署
+
+我们基于 FastDeploy 为 ERNIE 3.0 提供了多种部署方案，可以满足不同场景下的部署需求，请根据实际情况进行选择。
+
+<a name="FastDeploy部署"></a>
+
+### FastDeploy 部署
+
+⚡️[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)是一款全场景、易用灵活、极致高效的AI推理部署工具，为开发者提供多硬件、多推理引擎后端的部署能力。开发者只需调用一行代码即可随意切换硬件、推理引擎后端。
+
+<div align="center">
+
+<img src="https://user-images.githubusercontent.com/54695910/213087724-7175953a-0e07-4af8-a4a1-5304163da2e0.png" >
+
+</div>
+
+目前 ERNIE 3.0 模型已提供基于 FastDeploy 的部署示例，支持在多款硬件（CPU、GPU、昆仑芯、华为昇腾以及 Graphcore IPU）以及推理引擎后端进行部署。具体的适配的硬件以及推理引擎请参考：[FastDeploy 部署指南](./deploy/README.md)
+
+<a name="Python部署"></a>
+
+#### Python 部署
+
+Python 部署请参考：[Python 部署指南](./deploy/python/README.md)
+
+<a name="C++部署"></a>
+
+#### C++ 部署
+
+C++ 部署请参考：[C++ 部署指南](./deploy/cpp/README.md)
+
+<a name="服务化部署"></a>
+
+### 服务化部署
+
+- [FastDeploy Serving 高性能服务化部署指南](./deploy/serving/README.md)
+- [PaddleNLP SimpleServing 服务化部署指南](./deploy/simple_serving/README.md)
+
+
+<a name="参考文献"></a>
+
+## Notebook教程
+
+- [【快速上手ERNIE 3.0】中文情感分析实战](https://aistudio.baidu.com/aistudio/projectdetail/3955163)
+
+- [【快速上手ERNIE 3.0】法律文本多标签分类实战](https://aistudio.baidu.com/aistudio/projectdetail/3996601)
+
+- [【快速上手ERNIE 3.0】中文语义匹配实战](https://aistudio.baidu.com/aistudio/projectdetail/3986803)
+
+- [【快速上手ERNIE 3.0】MSRA序列标注实战](https://aistudio.baidu.com/aistudio/projectdetail/3989073)
+
+- [【快速上手ERNIE 3.0】机器阅读理解实战](https://aistudio.baidu.com/aistudio/projectdetail/2017189)
+
+- [【快速上手ERNIE 3.0】对话意图识别](https://aistudio.baidu.com/aistudio/projectdetail/2017202?contributionType=1)
+
+
+## 参考文献
+
+* Sun Y, Wang S, Feng S, et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2107.02137, 2021.
+
+* Su W, Chen X, Feng S, et al. ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression[J]. arXiv preprint arXiv:2106.02241, 2021.
+
+* Wang S, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2112.12731, 2021.
diff --git a/model_zoo/ernie-3.0/compress_qa.py b/model_zoo/ernie-3.0/compress_qa.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4cf1722aad1b7172a6b17c5dc88621a49f33805
--- /dev/null
+++ b/model_zoo/ernie-3.0/compress_qa.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import paddle
+import paddle.nn.functional as F
+from datasets import load_dataset
+from utils import (
+    DataArguments,
+    ModelArguments,
+    QuestionAnsweringTrainer,
+    load_config,
+    prepare_train_features,
+    prepare_validation_features,
+)
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.squad import compute_prediction
+from paddlenlp.trainer import CompressionArguments, EvalPrediction, PdArgumentParser
+from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+
+    # Load model and data config
+    model_args, data_args, compression_args = load_config(
+        model_args.config, "QuestionAnswering", data_args.dataset, model_args, data_args, compression_args
+    )
+
+    paddle.set_device(compression_args.device)
+    data_args.dataset = data_args.dataset.strip()
+
+    # Log model and data config
+    compression_args.print_config(model_args, "Model")
+    compression_args.print_config(data_args, "Data")
+
+    raw_datasets = load_dataset("clue", data_args.dataset)
+
+    label_list = getattr(raw_datasets["train"], "label_list", None)
+    data_args.label_list = label_list
+
+    # Define tokenizer, model, loss function.
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = ErnieForQuestionAnswering.from_pretrained(model_args.model_name_or_path)
+
+    # Preprocessing the datasets.
+    # Preprocessing is slighlty different for training and evaluation.
+
+    raw_datasets = load_dataset("clue", data_args.dataset)
+    column_names = raw_datasets["train"].column_names
+    label_list = getattr(raw_datasets["train"], "label_list", None)
+    data_args.label_list = label_list
+    # Create train feature from dataset
+    with compression_args.main_process_first(desc="train dataset map pre-processing"):
+        # Dataset pre-process
+        train_dataset = raw_datasets["train"]
+        train_dataset = train_dataset.map(
+            partial(prepare_train_features, tokenizer=tokenizer, args=data_args),
+            batched=True,
+            num_proc=4,
+            remove_columns=column_names,
+            load_from_cache_file=not data_args.overwrite_cache,
+            desc="Running tokenizer on train dataset",
+        )
+    with compression_args.main_process_first(desc="evaluate dataset map pre-processing"):
+        eval_examples = raw_datasets["validation"]
+        eval_dataset = eval_examples.map(
+            partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
+            batched=True,
+            num_proc=4,
+            remove_columns=column_names,
+            load_from_cache_file=not data_args.overwrite_cache,
+            desc="Running tokenizer on validation dataset",
+        )
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Post-processing:
+    def post_processing_function(examples, features, predictions, stage="eval"):
+        # Post-processing: we match the start logits and end logits to answers in the original context.
+        predictions, all_nbest_json, scores_diff_json = compute_prediction(
+            examples=examples,
+            features=features,
+            predictions=predictions,
+            n_best_size=data_args.n_best_size,
+            max_answer_length=data_args.max_answer_length,
+            null_score_diff_threshold=data_args.null_score_diff_threshold,
+        )
+
+        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
+        return EvalPrediction(predictions=predictions, label_ids=references)
+
+    def criterion(outputs, label):
+        start_logits, end_logits = outputs
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = F.cross_entropy(input=start_logits, label=start_position)
+        end_loss = F.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+    trainer = QuestionAnsweringTrainer(
+        model=model,
+        args=compression_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        eval_examples=eval_examples,
+        data_collator=data_collator,
+        post_process_function=post_processing_function,
+        tokenizer=tokenizer,
+        criterion=criterion,
+    )
+
+    compression_args.print_config()
+
+    trainer.compress()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/compress_seq_cls.py b/model_zoo/ernie-3.0/compress_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..75e0aa2f0201b0bef5d551aa7ddfcc4178930a05
--- /dev/null
+++ b/model_zoo/ernie-3.0/compress_seq_cls.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import paddle
+from utils import DataArguments, ModelArguments, load_config, seq_convert_example
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    model_args, data_args, compression_args = load_config(
+        model_args.config, "SequenceClassification", data_args.dataset, model_args, data_args, compression_args
+    )
+
+    paddle.set_device(compression_args.device)
+
+    data_args.dataset = data_args.dataset.strip()
+
+    # Log model and data config
+    compression_args.print_config(model_args, "Model")
+    compression_args.print_config(data_args, "Data")
+
+    raw_datasets = load_dataset("clue", data_args.dataset)
+
+    data_args.label_list = getattr(raw_datasets["train"], "label_list", None)
+    num_classes = len(raw_datasets["train"].label_list)
+
+    criterion = paddle.nn.CrossEntropyLoss()
+    # Defines tokenizer, model, loss function.
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = ErnieForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+
+    # Defines dataset pre-process function
+    trans_fn = partial(
+        seq_convert_example, tokenizer=tokenizer, label_list=data_args.label_list, max_seq_len=data_args.max_seq_length
+    )
+
+    # Defines data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    train_dataset = raw_datasets["train"].map(trans_fn)
+    eval_dataset = raw_datasets["dev"].map(trans_fn)
+
+    trainer = Trainer(
+        model=model,
+        args=compression_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        criterion=criterion,
+    )  # Strategy`dynabert` needs arguments `criterion`
+
+    compression_args.print_config()
+
+    trainer.compress()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/compress_token_cls.py b/model_zoo/ernie-3.0/compress_token_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..3271b3a41dac2f54271fb9a74f080c7d2fd5e312
--- /dev/null
+++ b/model_zoo/ernie-3.0/compress_token_cls.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import paddle
+import paddle.nn as nn
+from utils import DataArguments, ModelArguments, load_config, token_convert_example
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import CompressionArguments, PdArgumentParser, Trainer
+from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    model_args, data_args, compression_args = load_config(
+        model_args.config, "TokenClassification", data_args.dataset, model_args, data_args, compression_args
+    )
+    paddle.set_device(compression_args.device)
+
+    data_args.dataset = data_args.dataset.strip()
+
+    # Log model and data config
+    compression_args.print_config(model_args, "Model")
+    compression_args.print_config(data_args, "Data")
+
+    raw_datasets = load_dataset(data_args.dataset)
+    label_list = raw_datasets["train"].label_list
+    data_args.label_list = label_list
+    data_args.ignore_label = -100
+
+    data_args.no_entity_id = 0
+    num_classes = len(label_list)
+
+    # Define tokenizer, model, loss function.
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = ErnieForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+
+    class criterion(nn.Layer):
+        def __init__(self):
+            super(criterion, self).__init__()
+            self.loss_fn = nn.CrossEntropyLoss(ignore_index=data_args.ignore_label)
+
+        def forward(self, *args, **kwargs):
+            return paddle.mean(self.loss_fn(*args, **kwargs))
+
+    loss_fct = criterion()
+
+    # Define dataset pre-process function
+    trans_fn = partial(
+        token_convert_example,
+        tokenizer=tokenizer,
+        no_entity_id=data_args.no_entity_id,
+        max_seq_length=data_args.max_seq_length,
+        return_length=True,
+    )
+
+    # Define data collector
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=data_args.ignore_label)
+
+    # Dataset pre-process
+    train_dataset = raw_datasets["train"].map(trans_fn)
+    eval_dataset = raw_datasets["test"].map(trans_fn)
+    train_dataset.label_list = label_list
+    train_dataset.ignore_label = data_args.ignore_label
+    trainer = Trainer(
+        model=model,
+        criterion=loss_fct,
+        args=compression_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        tokenizer=tokenizer,
+    )
+
+    compression_args.print_config()
+
+    trainer.compress()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/configs/default.yml b/model_zoo/ernie-3.0/configs/default.yml
new file mode 100644
index 0000000000000000000000000000000000000000..33c543f318355061f5dca6336b2105ec89380ea6
--- /dev/null
+++ b/model_zoo/ernie-3.0/configs/default.yml
@@ -0,0 +1,70 @@
+# Default Args for all dataset 
+# You can overwrite the configs in each dataset.
+DefaultArgs:
+    learning_rate: 0.00003
+    num_train_epochs: 3
+    max_seq_length: 128
+    weight_decay: 0.01
+# Datasets which used for sequence classfication
+SequenceClassification:
+    afqmc: 
+        num_train_epochs: 1
+        learning_rate: 0.00003
+        max_seq_length: 128
+        per_device_train_batch_size: 16 
+        per_device_eval_batch_size: 32
+    tnews:
+        num_train_epochs: 3
+        learning_rate: 0.00005
+        max_seq_length: 128
+        per_device_train_batch_size: 32 
+        per_device_eval_batch_size: 32
+    iflytek:
+        num_train_epochs: 3
+        learning_rate: 0.00005
+        max_seq_length: 128
+        per_device_train_batch_size: 16 
+        per_device_eval_batch_size: 16
+    ocnli:
+        num_train_epochs: 6
+        learning_rate: 0.00005
+        max_seq_length: 128
+        per_device_train_batch_size: 64 
+        per_device_eval_batch_size: 64
+    cmnli: 
+        num_train_epochs: 4
+        learning_rate: 0.00002
+        max_seq_length: 128
+        per_device_train_batch_size: 32 
+        per_device_eval_batch_size: 32
+    cluewsc2020: 
+        num_train_epochs: 50
+        learning_rate: 0.00003
+        max_seq_length: 128
+        per_device_train_batch_size: 16 
+        per_device_eval_batch_size: 16
+    csl:
+        num_train_epochs: 8
+        learning_rate: 0.00005
+        max_seq_length: 256
+        per_device_train_batch_size: 64 
+        per_device_eval_batch_size: 64
+
+# Datasets which used for token classfication
+TokenClassification:
+    msra_ner:
+            learning_rate: 0.00005
+            max_seq_length: 128 
+            num_train_epochs: 1
+            per_device_train_batch_size: 8 
+            per_device_eval_batch_size: 16
+
+# Datasets which used for question answersing
+QuestionAnswering:
+    cmrc2018:
+            learning_rate: 0.00005
+            max_seq_length: 512 
+            num_train_epochs: 1
+            per_device_train_batch_size: 8 
+            per_device_eval_batch_size: 12
+
diff --git a/model_zoo/ernie-3.0/deploy/README.md b/model_zoo/ernie-3.0/deploy/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c08521440c3e3dfc64c0cbe47ab478d918e476fd
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/README.md
@@ -0,0 +1,119 @@
+# FastDeploy ERNIE 3.0 模型高性能部署
+
+**⚡️FastDeploy** 是一款**全场景**、**易用灵活**、**极致高效**的 AI 推理部署工具，满足开发者**多硬件、多平台**的产业部署需求。开发者可以基于 FastDeploy 将训练好的预测模型在不同的硬件、不同的推理引擎后端上进行部署。目前 FastDeploy 提供多种编程语言的 SDK，包括 C++、Python 以及 Java SDK。
+
+在部署 ERNIE 3.0 模型前，需要安装 FastDeploy SDK，可参考 [FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)确认部署环境是否满足 FastDeploy 环境要求，并按照介绍安装相应的 SDK。
+
+目前，ERNIE 3.0 模型支持在如下的硬件以及推理引擎进行部署。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
+
+## 支持的NLP任务列表
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 任务 Task</td>
+        <td align=center> 部署方式  </td>
+        <td align=center> 是否支持</td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> 文本分类 </td>
+        <td align=center> Python </td>
+        <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> C++ </td>
+      <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> Serving </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> 序列标注 </td>
+        <td align=center> Python </td>
+        <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> C++ </td>
+      <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> Serving </td>
+      <td align=center> ✅ </td>
+    </tr>
+</table>
+
+## 详细部署文档
+
+ERNIE 3.0 模型支持 Python、C++ 部署以及 Serving 服务化部署。以下是详细文档。
+
+- [Python 部署](python/README.md)
+- [C++ 部署](cpp/README.md)
+- [Serving 部署](serving/README.md)
diff --git a/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt b/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..db2b4305b58688e2a65394ddf3ffbc3d705fd09b
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/cpp/CMakeLists.txt
@@ -0,0 +1,35 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PROJECT(infer_demo C CXX)
+CMAKE_MINIMUM_REQUIRED (VERSION 3.10)
+
+option(FASTDEPLOY_INSTALL_DIR "Path of downloaded fastdeploy sdk.")
+include(${FASTDEPLOY_INSTALL_DIR}/FastDeploy.cmake)
+include(${FASTDEPLOY_INSTALL_DIR}/utils/gflags.cmake)
+
+include_directories(${FASTDEPLOY_INCS})
+
+add_executable(seq_cls_infer_demo ${PROJECT_SOURCE_DIR}/seq_cls_infer.cc)
+add_executable(token_cls_infer_demo ${PROJECT_SOURCE_DIR}/token_cls_infer.cc)
+add_dependencies(seq_cls_infer_demo gflags)
+add_dependencies(token_cls_infer_demo gflags)
+
+if(UNIX AND (NOT APPLE) AND (NOT ANDROID))
+  target_link_libraries(seq_cls_infer_demo ${FASTDEPLOY_LIBS} gflags pthread)
+  target_link_libraries(token_cls_infer_demo ${FASTDEPLOY_LIBS} gflags pthread)
+else()
+  target_link_libraries(seq_cls_infer_demo ${FASTDEPLOY_LIBS} gflags)
+  target_link_libraries(token_cls_infer_demo ${FASTDEPLOY_LIBS} gflags)
+endif()
diff --git a/model_zoo/ernie-3.0/deploy/cpp/README.md b/model_zoo/ernie-3.0/deploy/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..39c098c95f36e8523cdcab1192cd79e42a71b5fb
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/cpp/README.md
@@ -0,0 +1,272 @@
+# FastDeploy ERNIE 3.0 模型 C++ 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy C++ SDK。
+
+本目录下分别提供 `seq_cls_infer.cc` 以及 `token_cls_infer.cc` 快速完成在 CPU/GPU 的文本分类任务以及序列标注任务的 C++ 部署示例。
+
+
+## 文本分类任务
+
+### 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务的 C++ 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端。示例中的模型是 ERNIE 3.0 在 `AFQMC 数据集`微调后导出得到的部署模型。
+
+```bash
+mkdir build
+cd build
+# 下载FastDeploy预编译库，用户可在上文提到的`FastDeploy SDK安装文档`中自行选择合适的版本使用
+wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-x.x.x.tgz
+tar xvf fastdeploy-linux-x64-x.x.x.tgz
+cmake .. -DFASTDEPLOY_INSTALL_DIR=${PWD}/fastdeploy-linux-x64-x.x.x
+make -j
+
+# CPU 推理
+./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/export/ --device cpu --backend paddle
+
+# GPU 推理
+./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/export/ --device gpu --backend paddle
+
+```
+
+运行完成后返回的结果如下：
+```bash
+[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/seq_cls_infer.cc(103)::CreateRuntimeOption    model_path = ../../../best_models/afqmc/export/model.pdmodel, param_path = ../../../best_models/afqmc/export/model.pdiparams
+[INFO] fastdeploy/runtime.cc(500)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+input data: 花呗收款额度限制, 收钱码，对花呗支付的金额有限制吗
+seq cls result:
+label: Similar confidence: 0.509855
+-----------------------------
+input data: 花呗支持高铁票支付吗, 为什么友付宝不支持花呗付款
+seq cls result:
+label: Similar confidence: 0.986198
+-----------------------------
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在 GPU 上使用 tensorrt 后端运行量化模型，模型目录可按照实际模型路径设置
+./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8
+
+# 在 CPU 上使用paddle_inference后端，模型目录可按照实际模型路径设置
+./seq_cls_infer_demo --model_dir ../../../best_models/afqmc/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/seq_cls_infer.cc(67)::CreateRuntimeOption    model_path = ../../../best_models/afqmc/width_mult_0.75/mse16_1/int8.pdmodel, param_path = ../../../best_models/afqmc/width_mult_0.75/mse16_1/int8.pdmodel
+[INFO] fastdeploy/runtime.cc(596)::Init    Runtime initialized with Backend::TRT in Device::GPU.
+input data: 花呗收款额度限制, 收钱码，对花呗支付的金额有限制吗
+seq cls result:
+label: Similar confidence: 0.5259
+-----------------------------
+input data: 花呗支持高铁票支付吗, 为什么友付宝不支持花呗付款
+seq cls result:
+label: Similar confidence: 0.985738
+-----------------------------
+```
+
+### 参数说明
+
+`seq_cls_infer_demo` 除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录 |
+|--vocab_path| 指定的模型词表路径 |
+|--batch_size |最大可测的 batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'kunlunxin', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'model'|
+
+## 序列标注任务
+
+### 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的 C++ 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端。
+
+```bash
+mkdir build
+cd build
+# 下载FastDeploy预编译库，用户可在上文提到的`FastDeploy预编译库`中自行选择合适的版本使用
+wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-x.x.x.tgz
+tar xvf fastdeploy-linux-x64-x.x.x.tgz
+cmake .. -DFASTDEPLOY_INSTALL_DIR=${PWD}/fastdeploy-linux-x64-x.x.x
+make -j
+
+# CPU 推理
+./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/export --device cpu --backend paddle
+
+# GPU 推理
+./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/export --device gpu --backend paddle
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/token_cls_infer.cc(103)::CreateRuntimeOption    model_path = ../../../best_models/msra_ner/export/model.pdmodel, param_path = ../../../best_models/msra_ner/export/model.pdiparams
+[INFO] fastdeploy/runtime.cc(500)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+input data: 北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。
+The model detects all entities:
+entity: 北京, label: LOC, pos: [0, 1]
+entity: 重庆, label: LOC, pos: [6, 7]
+entity: 成都, label: LOC, pos: [12, 13]
+-----------------------------
+input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。
+The model detects all entities:
+entity: 乔丹, label: PER, pos: [0, 1]
+entity: 科比, label: PER, pos: [3, 4]
+entity: 詹姆斯, label: PER, pos: [6, 8]
+entity: 姚明, label: PER, pos: [10, 11]
+-----------------------------
+
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在 GPU 上使用 tensorrt 后端运行量化模型，模型目录可按照实际模型路径设置
+./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8
+
+# 在 CPU 上使用paddle_inference后端，模型目录可按照实际模型路径设置
+./token_cls_infer_demo --model_dir ../../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[INFO] /paddle/PaddleNLP/model_zoo/ernie-3.0/fastdeploy/cpp/token_cls_infer.cc(103)::CreateRuntimeOption    model_path = ../../../best_models/msra_ner/export/model.pdmodel, param_path = ../../../best_models/msra_ner/export/model.pdiparams
+[INFO] fastdeploy/runtime.cc(500)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+input data: 北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。
+The model detects all entities:
+entity: 北京, label: LOC, pos: [0, 1]
+entity: 重庆, label: LOC, pos: [6, 7]
+entity: 成都, label: LOC, pos: [12, 13]
+-----------------------------
+input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。
+The model detects all entities:
+entity: 乔丹, label: PER, pos: [0, 1]
+entity: 科比, label: PER, pos: [3, 4]
+entity: 詹姆斯, label: PER, pos: [6, 8]
+entity: 姚明, label: PER, pos: [10, 11]
+-----------------------------
+```
+
+### 参数说明
+
+`token_cls_infer_demo` 除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |最大可测的 batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 C++ 端上，提供 `fastdeploy::RuntimeOption::UseXXX()` 以及 `fastdeploy::RuntimeOption::UseXXXBackend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> UseCpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> UseOrtBackend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> UseOpenvinoBackend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> UseGpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> UseOrtBackend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> UseTrtBackend() + EnablePaddleToTrt() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> UseTrtBackend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> UseKunlunXin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> UsePaddleLiteBackend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> UseAscend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> UsePaddleLiteBackend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> UseIpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
+
+## 相关文档
+
+[ERNIE 3.0模型详细介绍](../../README.md)
+
+[ERNIE 3.0模型Python部署方法](../python/README.md)
diff --git a/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc b/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a33337d9694441d614260dd3834cbf04d0028ee5
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/cpp/seq_cls_infer.cc
@@ -0,0 +1,293 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include <array>
+#include <iostream>
+#include <sstream>
+#include <vector>
+
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+#include "fastdeploy/function/reduce.h"
+#include "fastdeploy/function/softmax.h"
+#include "fastdeploy/runtime.h"
+#include "fastdeploy/utils/path.h"
+#include "gflags/gflags.h"
+
+using namespace paddlenlp;
+using namespace fast_tokenizer::tokenizers_impl;
+#ifdef WIN32
+const char sep = '\\';
+#else
+const char sep = '/';
+#endif
+
+DEFINE_string(model_dir, "", "Directory of the inference model.");
+DEFINE_string(vocab_path, "", "Path of the vocab file.");
+DEFINE_string(model_prefix, "model", "The model and params file prefix.");
+DEFINE_string(device,
+              "cpu",
+              "Type of inference device, support 'cpu', 'kunlunxin' or 'gpu'.");
+DEFINE_string(backend,
+              "paddle",
+              "The inference runtime backend, support: ['onnx_runtime', "
+              "'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']");
+DEFINE_int32(batch_size, 1, "The batch size of data.");
+DEFINE_int32(max_length, 128, "The batch size of data.");
+DEFINE_bool(use_fp16, false, "Wheter to use FP16 mode.");
+
+void PrintUsage() {
+  fastdeploy::FDINFO
+      << "Usage: seq_cls_infer_demo --model_dir dir --device [cpu|gpu] "
+         "--backend "
+         "[onnx_runtime|paddle|openvino|tensorrt|paddle_tensorrt] "
+         "--batch_size size --max_length len --use_fp16 false"
+      << std::endl;
+  fastdeploy::FDINFO << "Default value of device: cpu" << std::endl;
+  fastdeploy::FDINFO << "Default value of backend: onnx_runtime" << std::endl;
+  fastdeploy::FDINFO << "Default value of batch_size: 1" << std::endl;
+  fastdeploy::FDINFO << "Default value of max_length: 128" << std::endl;
+  fastdeploy::FDINFO << "Default value of use_fp16: false" << std::endl;
+}
+
+bool CreateRuntimeOption(fastdeploy::RuntimeOption* option) {
+  std::string model_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdmodel";
+  std::string param_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdiparams";
+  fastdeploy::FDINFO << "model_path = " << model_path
+                     << ", param_path = " << param_path << std::endl;
+  option->SetModelPath(model_path, param_path);
+  if (FLAGS_device == "kunlunxin") {
+    option->UseKunlunXin();
+    return true;
+  } else if (FLAGS_device == "gpu") {
+    option->UseGpu();
+  } else if (FLAGS_device == "cpu") {
+    option->UseCpu();
+  } else {
+    fastdeploy::FDERROR << "The avilable device should be one of the list "
+                           "['cpu', 'gpu', 'kunlunxin']. But receive '"
+                        << FLAGS_device << "'" << std::endl;
+    return false;
+  }
+
+  if (FLAGS_backend == "onnx_runtime") {
+    option->UseOrtBackend();
+  } else if (FLAGS_backend == "paddle") {
+    option->UsePaddleInferBackend();
+  } else if (FLAGS_backend == "openvino") {
+    option->UseOpenVINOBackend();
+  } else if (FLAGS_backend == "tensorrt" ||
+             FLAGS_backend == "paddle_tensorrt") {
+    option->UseTrtBackend();
+    if (FLAGS_backend == "paddle_tensorrt") {
+      option->EnablePaddleToTrt();
+      option->EnablePaddleTrtCollectShape();
+    }
+    std::string trt_file = FLAGS_model_dir + sep + "infer.trt";
+    option->SetTrtInputShape("input_ids",
+                             {1, 1},
+                             {FLAGS_batch_size, FLAGS_max_length},
+                             {FLAGS_batch_size, FLAGS_max_length});
+    option->SetTrtInputShape("token_type_ids",
+                             {1, 1},
+                             {FLAGS_batch_size, FLAGS_max_length},
+                             {FLAGS_batch_size, FLAGS_max_length});
+    if (FLAGS_use_fp16) {
+      option->EnableTrtFP16();
+      trt_file = trt_file + ".fp16";
+    }
+  } else {
+    fastdeploy::FDERROR << "The avilable backend should be one of the list "
+                           "['paddle', 'openvino', 'tensorrt', "
+                           "'paddle_tensorrt']. But receive '"
+                        << FLAGS_backend << "'" << std::endl;
+    return false;
+  }
+  return true;
+}
+
+bool BatchFyTexts(const std::vector<std::string>& texts,
+                  int batch_size,
+                  std::vector<std::vector<std::string>>* batch_texts) {
+  for (int idx = 0; idx < texts.size(); idx += batch_size) {
+    int rest = texts.size() - idx;
+    int curr_size = std::min(batch_size, rest);
+    std::vector<std::string> batch_text(curr_size);
+    std::copy_n(texts.begin() + idx, curr_size, batch_text.begin());
+    batch_texts->emplace_back(std::move(batch_text));
+  }
+  return true;
+}
+
+struct SeqClsResult {
+  int label;
+  float confidence;
+};
+
+struct ErnieForSequenceClassificationPredictor {
+  fastdeploy::Runtime runtime_;
+  ErnieFastTokenizer tokenizer_;
+  ErnieForSequenceClassificationPredictor(
+      const fastdeploy::RuntimeOption& option,
+      const ErnieFastTokenizer& tokenizer)
+      : tokenizer_(tokenizer) {
+    runtime_.Init(option);
+  }
+
+  bool Preprocess(const std::vector<std::string>& texts,
+                  const std::vector<std::string>& texts_pair,
+                  std::vector<fastdeploy::FDTensor>* inputs) {
+    std::vector<fast_tokenizer::core::Encoding> encodings;
+    std::vector<fast_tokenizer::core::EncodeInput> text_pair_input;
+    // 1. Tokenize the text or (text, text_pair)
+    if (texts_pair.empty()) {
+      for (int i = 0; i < texts.size(); ++i) {
+        text_pair_input.emplace_back(texts[i]);
+      }
+    } else {
+      if (texts.size() != texts_pair.size()) {
+        return false;
+      }
+      for (int i = 0; i < texts.size(); ++i) {
+        text_pair_input.emplace_back(
+            std::pair<std::string, std::string>(texts[i], texts_pair[i]));
+      }
+    }
+    tokenizer_.EncodeBatchStrings(text_pair_input, &encodings);
+    // 2. Construct the input vector tensor
+    // 2.1 Allocate input tensor
+    int64_t batch_size = texts.size();
+    int64_t seq_len = 0;
+    if (batch_size > 0) {
+      seq_len = encodings[0].GetIds().size();
+    }
+    inputs->resize(runtime_.NumInputs());
+    for (int i = 0; i < runtime_.NumInputs(); ++i) {
+      (*inputs)[i].Allocate({batch_size, seq_len},
+                            fastdeploy::FDDataType::INT64,
+                            runtime_.GetInputInfo(i).name);
+    }
+    // 2.2 Set the value of data
+    size_t start = 0;
+    int64_t* input_ids_ptr =
+        reinterpret_cast<int64_t*>((*inputs)[0].MutableData());
+    int64_t* type_ids_ptr =
+        reinterpret_cast<int64_t*>((*inputs)[1].MutableData());
+    for (int i = 0; i < encodings.size(); ++i) {
+      auto&& curr_input_ids = encodings[i].GetIds();
+      auto&& curr_type_ids = encodings[i].GetTypeIds();
+      std::copy(
+          curr_input_ids.begin(), curr_input_ids.end(), input_ids_ptr + start);
+      std::copy(
+          curr_type_ids.begin(), curr_type_ids.end(), type_ids_ptr + start);
+      start += seq_len;
+    }
+    return true;
+  }
+
+  bool Postprocess(const std::vector<fastdeploy::FDTensor>& outputs,
+                   std::vector<SeqClsResult>* seq_cls_results) {
+    const auto& logits = outputs[0];
+    fastdeploy::FDTensor probs;
+    fastdeploy::function::Softmax(logits, &probs);
+
+    fastdeploy::FDTensor labels, confidences;
+    fastdeploy::function::Max(probs, &confidences, {-1});
+    fastdeploy::function::ArgMax(probs, &labels, -1);
+    if (labels.Numel() != confidences.Numel()) {
+      return false;
+    }
+
+    seq_cls_results->resize(labels.Numel());
+    int64_t* label_ptr = reinterpret_cast<int64_t*>(labels.Data());
+    float* confidence_ptr = reinterpret_cast<float*>(confidences.Data());
+    for (int i = 0; i < labels.Numel(); ++i) {
+      (*seq_cls_results)[i].label = label_ptr[i];
+      (*seq_cls_results)[i].confidence = confidence_ptr[i];
+    }
+    return true;
+  }
+
+  bool Predict(const std::vector<std::string>& texts,
+               const std::vector<std::string>& texts_pair,
+               std::vector<SeqClsResult>* seq_cls_results) {
+    std::vector<fastdeploy::FDTensor> inputs;
+    if (!Preprocess(texts, texts_pair, &inputs)) {
+      return false;
+    }
+
+    std::vector<fastdeploy::FDTensor> outputs(runtime_.NumOutputs());
+    runtime_.Infer(inputs, &outputs);
+
+    if (!Postprocess(outputs, seq_cls_results)) {
+      return false;
+    }
+    return true;
+  }
+};
+
+void PrintResult(const std::vector<SeqClsResult>& seq_cls_results,
+                 const std::vector<std::string>& data,
+                 const std::vector<std::string>& data_pair) {
+  static std::vector<std::string> label_list{"Similar", "Not similar"};
+  for (int i = 0; i < data.size(); ++i) {
+    std::cout << "input data: " << data[i] << ", " << data_pair[i] << std::endl;
+    std::cout << "seq cls result: " << std::endl;
+    std::cout << "label: " << label_list[seq_cls_results[i].label]
+              << " confidence: " << seq_cls_results[i].confidence << std::endl;
+    std::cout << "-----------------------------" << std::endl;
+  }
+}
+
+int main(int argc, char* argv[]) {
+  google::ParseCommandLineFlags(&argc, &argv, true);
+  auto option = fastdeploy::RuntimeOption();
+  if (!CreateRuntimeOption(&option)) {
+    PrintUsage();
+    return -1;
+  }
+
+  std::string vocab_path = FLAGS_vocab_path;
+  if (!fastdeploy::CheckFileExists(vocab_path)) {
+    vocab_path = fastdeploy::PathJoin(FLAGS_model_dir, "vocab.txt");
+    if (!fastdeploy::CheckFileExists(vocab_path)) {
+      fastdeploy::FDERROR << "The path of vocab " << vocab_path
+                          << " doesn't exist" << std::endl;
+      PrintUsage();
+      return -1;
+    }
+  }
+  ErnieFastTokenizer tokenizer(vocab_path);
+  tokenizer.EnableTruncMethod(
+      FLAGS_max_length,
+      0,
+      fast_tokenizer::core::Direction::RIGHT,
+      fast_tokenizer::core::TruncStrategy::LONGEST_FIRST);
+
+  ErnieForSequenceClassificationPredictor predictor(option, tokenizer);
+
+  std::vector<SeqClsResult> seq_cls_results;
+  std::vector<std::string> texts_ds = {"花呗收款额度限制",
+                                       "花呗支持高铁票支付吗"};
+  std::vector<std::string> texts_pair_ds = {"收钱码，对花呗支付的金额有限制吗",
+                                            "为什么友付宝不支持花呗付款"};
+  std::vector<std::vector<std::string>> batch_texts, batch_texts_pair;
+  BatchFyTexts(texts_ds, FLAGS_batch_size, &batch_texts);
+  BatchFyTexts(texts_pair_ds, FLAGS_batch_size, &batch_texts_pair);
+  for (int bs = 0; bs < batch_texts.size(); ++bs) {
+    predictor.Predict(batch_texts[bs], batch_texts_pair[bs], &seq_cls_results);
+    PrintResult(seq_cls_results, batch_texts[bs], batch_texts_pair[bs]);
+  }
+  return 0;
+}
diff --git a/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc b/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5395be7e2f55196cc45afdec6b17bba572f22a2b
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/cpp/token_cls_infer.cc
@@ -0,0 +1,320 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include <array>
+#include <iostream>
+#include <sstream>
+#include <vector>
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+#include "fastdeploy/function/functions.h"
+#include "fastdeploy/runtime.h"
+#include "fastdeploy/utils/path.h"
+#include "gflags/gflags.h"
+
+using namespace paddlenlp;
+using namespace fast_tokenizer::tokenizers_impl;
+#ifdef WIN32
+const char sep = '\\';
+#else
+const char sep = '/';
+#endif
+
+DEFINE_string(model_dir, "", "Directory of the inference model.");
+DEFINE_string(vocab_path, "", "Path of the vocab file.");
+DEFINE_string(model_prefix, "model", "The model and params file prefix.");
+DEFINE_string(device,
+              "cpu",
+              "Type of inference device, support 'cpu' or 'gpu'.");
+DEFINE_string(backend,
+              "paddle",
+              "The inference runtime backend, support: ['onnx_runtime', "
+              "'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']");
+DEFINE_int32(batch_size, 1, "The batch size of data.");
+DEFINE_int32(max_length, 128, "The batch size of data.");
+DEFINE_bool(use_fp16, false, "Wheter to use FP16 mode.");
+
+void PrintUsage() {
+  fastdeploy::FDINFO
+      << "Usage: seq_cls_infer_demo --model_dir dir --device [cpu|gpu] "
+         "--backend "
+         "[onnx_runtime|paddle|openvino|tensorrt|paddle_tensorrt] "
+         "--batch_size size --max_length len --use_fp16 false"
+      << std::endl;
+  fastdeploy::FDINFO << "Default value of device: cpu" << std::endl;
+  fastdeploy::FDINFO << "Default value of backend: onnx_runtime" << std::endl;
+  fastdeploy::FDINFO << "Default value of batch_size: 1" << std::endl;
+  fastdeploy::FDINFO << "Default value of max_length: 128" << std::endl;
+  fastdeploy::FDINFO << "Default value of use_fp16: false" << std::endl;
+}
+
+bool CreateRuntimeOption(fastdeploy::RuntimeOption* option) {
+  std::string model_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdmodel";
+  std::string param_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdiparams";
+  fastdeploy::FDINFO << "model_path = " << model_path
+                     << ", param_path = " << param_path << std::endl;
+  option->SetModelPath(model_path, param_path);
+
+  if (FLAGS_device == "gpu") {
+    option->UseGpu();
+  } else if (FLAGS_device == "cpu") {
+    option->UseCpu();
+  } else if (FLAGS_device == "kunlunxin") {
+    option->UseKunlunXin();
+    return true;
+  } else {
+    fastdeploy::FDERROR << "The avilable device should be one of the list "
+                           "['cpu', 'gpu', 'kunlunxin']. But receive '"
+                        << FLAGS_device << "'" << std::endl;
+    return false;
+  }
+
+  if (FLAGS_backend == "onnx_runtime") {
+    option->UseOrtBackend();
+  } else if (FLAGS_backend == "paddle") {
+    option->UsePaddleInferBackend();
+  } else if (FLAGS_backend == "openvino") {
+    option->UseOpenVINOBackend();
+  } else if (FLAGS_backend == "tensorrt" ||
+             FLAGS_backend == "paddle_tensorrt") {
+    option->UseTrtBackend();
+    if (FLAGS_backend == "paddle_tensorrt") {
+      option->EnablePaddleToTrt();
+      option->EnablePaddleTrtCollectShape();
+    }
+    std::string trt_file = FLAGS_model_dir + sep + "infer.trt";
+    option->SetTrtInputShape("input_ids",
+                             {1, 1},
+                             {FLAGS_batch_size, FLAGS_max_length},
+                             {FLAGS_batch_size, FLAGS_max_length});
+    option->SetTrtInputShape("token_type_ids",
+                             {1, 1},
+                             {FLAGS_batch_size, FLAGS_max_length},
+                             {FLAGS_batch_size, FLAGS_max_length});
+    if (FLAGS_use_fp16) {
+      option->EnableTrtFP16();
+      trt_file = trt_file + ".fp16";
+    }
+  } else {
+    fastdeploy::FDERROR << "The avilable backend should be one of the list "
+                           "['paddle', 'openvino', 'tensorrt', "
+                           "'paddle_tensorrt']. But receive '"
+                        << FLAGS_backend << "'" << std::endl;
+    return false;
+  }
+
+  return true;
+}
+
+bool BatchFyTexts(const std::vector<std::string>& texts,
+                  int batch_size,
+                  std::vector<std::vector<std::string>>* batch_texts) {
+  for (int idx = 0; idx < texts.size(); idx += batch_size) {
+    int rest = texts.size() - idx;
+    int curr_size = std::min(batch_size, rest);
+    std::vector<std::string> batch_text(curr_size);
+    std::copy_n(texts.begin() + idx, curr_size, batch_text.begin());
+    batch_texts->emplace_back(std::move(batch_text));
+  }
+  return true;
+}
+
+struct TokenClsResult {
+  struct TokenResult {
+    std::string token_label;
+    std::string entity;
+    std::pair<int, int> pos;
+    friend std::ostream& operator<<(std::ostream& os,
+                                    const TokenResult& result);
+  };
+  std::vector<TokenResult> token_results;
+};
+
+std::ostream& operator<<(std::ostream& os,
+                         const typename TokenClsResult::TokenResult& result) {
+  os << "entity: " << result.entity << ", label: " << result.token_label
+     << ", pos: [" << result.pos.first << ", " << result.pos.second << "]";
+  return os;
+}
+
+void PrintResult(const std::vector<TokenClsResult>& token_cls_results,
+                 const std::vector<std::string>& data) {
+  for (int i = 0; i < data.size(); ++i) {
+    std::cout << "input data: " << data[i] << std::endl;
+    std::cout << "The model detects all entities:" << std::endl;
+    auto& curr_results = token_cls_results[i];
+    for (auto& token_result : curr_results.token_results) {
+      std::cout << token_result << std::endl;
+    }
+    std::cout << "-----------------------------" << std::endl;
+  }
+}
+
+struct ErnieForTokenClassificationPredictor {
+  fastdeploy::Runtime runtime_;
+  ErnieFastTokenizer tokenizer_;
+  std::vector<std::string> label_list_;
+
+  ErnieForTokenClassificationPredictor(
+      const fastdeploy::RuntimeOption& option,
+      const ErnieFastTokenizer& tokenizer,
+      const std::vector<std::string>& label_list)
+      : tokenizer_(tokenizer), label_list_(label_list) {
+    runtime_.Init(option);
+  }
+
+  bool Preprocess(const std::vector<std::string>& texts,
+                  std::vector<fastdeploy::FDTensor>* inputs) {
+    std::vector<fast_tokenizer::core::Encoding> encodings;
+    std::vector<fast_tokenizer::core::EncodeInput> text_pair_input;
+    // 1. Tokenize the text or (text, text_pair)
+    for (int i = 0; i < texts.size(); ++i) {
+      text_pair_input.emplace_back(texts[i]);
+    }
+    tokenizer_.EncodeBatchStrings(text_pair_input, &encodings);
+    // 2. Construct the input vector tensor
+    // 2.1 Allocate input tensor
+    int64_t batch_size = texts.size();
+    int64_t seq_len = 0;
+    if (batch_size > 0) {
+      seq_len = encodings[0].GetIds().size();
+    }
+    inputs->resize(runtime_.NumInputs());
+    for (int i = 0; i < runtime_.NumInputs(); ++i) {
+      (*inputs)[i].Allocate({batch_size, seq_len},
+                            fastdeploy::FDDataType::INT64,
+                            runtime_.GetInputInfo(i).name);
+    }
+    // 2.2 Set the value of data
+    size_t start = 0;
+    int64_t* input_ids_ptr =
+        reinterpret_cast<int64_t*>((*inputs)[0].MutableData());
+    int64_t* type_ids_ptr =
+        reinterpret_cast<int64_t*>((*inputs)[1].MutableData());
+    for (int i = 0; i < encodings.size(); ++i) {
+      auto&& curr_input_ids = encodings[i].GetIds();
+      auto&& curr_type_ids = encodings[i].GetTypeIds();
+      std::copy(
+          curr_input_ids.begin(), curr_input_ids.end(), input_ids_ptr + start);
+      std::copy(
+          curr_type_ids.begin(), curr_type_ids.end(), type_ids_ptr + start);
+      start += seq_len;
+    }
+    return true;
+  }
+
+  bool Postprocess(const std::vector<fastdeploy::FDTensor>& outputs,
+                   const std::vector<std::string>& texts,
+                   std::vector<TokenClsResult>* results) {
+    fastdeploy::FDTensor batch_preds;
+    auto& logits = outputs[0];
+    fastdeploy::function::ArgMax(logits, &batch_preds, -1);
+    for (int i = 0; i < results->size(); ++i) {
+      fastdeploy::FDTensor preds;
+      fastdeploy::function::Slice(batch_preds, {0}, {i}, &preds);
+      int start = -1;
+      std::string label_name = "";
+      std::vector<typename TokenClsResult::TokenResult> items;
+
+      int seq_len = preds.Shape()[0];
+
+      fast_tokenizer::pretokenizers::CharToBytesOffsetConverter convertor(
+          texts[i]);
+      fast_tokenizer::core::Offset curr_offset;
+      for (int j = 0; j < seq_len; ++j) {
+        int64_t label_id = (reinterpret_cast<int64_t*>(preds.Data()))[j];
+        const std::string& curr_label = label_list_[label_id];
+        if ((curr_label == "O" || curr_label.find("B-") != std::string::npos) &&
+            start >= 0) {
+          // Convert the unicode character offset to byte offset.
+          convertor.convert({start, j - 1}, &curr_offset);
+          if (curr_offset.first >= texts[i].length()) {
+            break;
+          }
+          items.emplace_back(typename TokenClsResult::TokenResult{
+              label_name,
+              texts[i].substr(curr_offset.first,
+                              curr_offset.second - curr_offset.first),
+              {start, j - 2}});
+          start = -1;
+        }
+        if (curr_label.find("B-") != std::string::npos) {
+          start = j - 1;
+          label_name = curr_label.substr(2);
+        }
+      }
+      (*results)[i].token_results = std::move(items);
+    }
+    return true;
+  }
+  bool Predict(const std::vector<std::string>& texts,
+               std::vector<TokenClsResult>* results) {
+    std::vector<fastdeploy::FDTensor> inputs;
+    if (!Preprocess(texts, &inputs)) {
+      return false;
+    }
+
+    std::vector<fastdeploy::FDTensor> outputs(runtime_.NumOutputs());
+    runtime_.Infer(inputs, &outputs);
+    results->resize(texts.size());
+    if (!Postprocess(outputs, texts, results)) {
+      return false;
+    }
+    return true;
+  }
+};
+
+int main(int argc, char* argv[]) {
+  google::ParseCommandLineFlags(&argc, &argv, true);
+  auto option = fastdeploy::RuntimeOption();
+  if (!CreateRuntimeOption(&option)) {
+    PrintUsage();
+    return -1;
+  }
+
+  std::string vocab_path = FLAGS_vocab_path;
+  if (!fastdeploy::CheckFileExists(vocab_path)) {
+    vocab_path = fastdeploy::PathJoin(FLAGS_model_dir, "vocab.txt");
+    if (!fastdeploy::CheckFileExists(vocab_path)) {
+      fastdeploy::FDERROR << "The path of vocab " << vocab_path
+                          << " doesn't exist" << std::endl;
+      PrintUsage();
+      return -1;
+    }
+  }
+  uint32_t max_length = FLAGS_max_length;
+  ErnieFastTokenizer tokenizer(vocab_path);
+  tokenizer.EnableTruncMethod(
+      max_length,
+      0,
+      fast_tokenizer::core::Direction::RIGHT,
+      fast_tokenizer::core::TruncStrategy::LONGEST_FIRST);
+
+  std::vector<std::string> label_list = {
+      "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"};
+  ErnieForTokenClassificationPredictor predictor(option, tokenizer, label_list);
+  std::vector<TokenClsResult> token_cls_results;
+  std::vector<std::string> texts_ds = {
+      "北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。",
+      "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"};
+  std::vector<std::vector<std::string>> batch_texts;
+  BatchFyTexts(texts_ds, FLAGS_batch_size, &batch_texts);
+  for (int bs = 0; bs < batch_texts.size(); ++bs) {
+    predictor.Predict(batch_texts[bs], &token_cls_results);
+    PrintResult(token_cls_results, batch_texts[bs]);
+  }
+  return 0;
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-3.0/deploy/python/README.md b/model_zoo/ernie-3.0/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..da1576a1fd65da0633fdaedc11b59708fabf99d5
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/python/README.md
@@ -0,0 +1,260 @@
+# FastDeploy ERNIE 3.0 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下分别提供 `seq_cls_infer.py` 以及 `token_cls_infer.py` 快速完成在 CPU/GPU 的文本分类任务以及序列标注任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+
+# 安装fast_tokenizer以及GPU版本fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+
+```
+
+## 文本分类任务
+
+### 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-3.0/best_models/afqmc/export`（用户可按实际情况设置）。
+
+```bash
+
+# CPU 推理
+python seq_cls_infer.py --model_dir ../../best_models/afqmc/export --device cpu --backend paddle
+
+# GPU 推理
+python seq_cls_infer.py --model_dir ../../best_models/afqmc/export --device gpu --backend paddle
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] fastdeploy/runtime.cc(596)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+Batch id:0, example id:0, sentence1:花呗收款额度限制, sentence2:收钱码，对花呗支付的金额有限制吗, label:0, similarity:0.5099
+Batch id:1, example id:0, sentence1:花呗支持高铁票支付吗, sentence2:为什么友付宝不支持花呗付款, label:0, similarity:0.9862
+
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在GPU上使用 tensorrt 后端，模型目录可按照实际模型路径设置
+python seq_cls_infer.py --model_dir ../../best_models/afqmc/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8
+
+# 在CPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+python seq_cls_infer.py --model_dir ../../best_models/afqmc/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[INFO] fastdeploy/runtime/runtime.cc(101)::Init    Runtime initialized with Backend::PDINFER in Device::GPU.
+Batch id:0, example id:0, sentence1:花呗收款额度限制, sentence2:收钱码，对花呗支付的金额有限制吗, label:0, similarity:0.5224
+Batch id:1, example id:0, sentence1:花呗支持高铁票支付吗, sentence2:为什么友付宝不支持花呗付款, label:0, similarity:0.9856
+```
+
+
+### 参数说明
+
+`seq_cls_infer.py` 除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+
+## 序列标注任务
+
+### 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 Medium 模型在 CLUE Benchmark 的[ MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的Python预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-3.0/best_models/msra_ner/export`（用户可按实际情况设置）。
+
+
+```bash
+
+# CPU 推理
+python token_cls_infer.py --model_dir ../../best_models/msra_ner/export/ --device cpu --backend paddle
+
+# GPU 推理
+python token_cls_infer.py --model_dir ../../best_models/msra_ner/export/ --device gpu --backend paddle
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] fastdeploy/runtime.cc(500)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+input data: 北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。
+The model detects all entities:
+entity: 北京   label: LOC   pos: [0, 1]
+entity: 重庆   label: LOC   pos: [6, 7]
+entity: 成都   label: LOC   pos: [12, 13]
+-----------------------------
+input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。
+The model detects all entities:
+entity: 乔丹   label: PER   pos: [0, 1]
+entity: 科比   label: PER   pos: [3, 4]
+entity: 詹姆斯   label: PER   pos: [6, 8]
+entity: 姚明   label: PER   pos: [10, 11]
+-----------------------------
+
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在GPU上使用 tensorrt 后端，模型目录可按照实际模型路径设置
+python token_cls_infer.py --model_dir ../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device gpu --backend tensorrt --model_prefix int8
+
+# 在CPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+python token_cls_infer.py --model_dir ../../best_models/msra_ner/width_mult_0.75/mse16_1/ --device cpu --backend paddle --model_prefix int8
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] fastdeploy/runtime.cc(500)::Init    Runtime initialized with Backend::PDINFER in Device::CPU.
+input data: 北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。
+The model detects all entities:
+entity: 北京   label: LOC   pos: [0, 1]
+entity: 重庆   label: LOC   pos: [6, 7]
+entity: 成都   label: LOC   pos: [12, 13]
+-----------------------------
+input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。
+The model detects all entities:
+entity: 乔丹   label: PER   pos: [0, 1]
+entity: 科比   label: PER   pos: [3, 4]
+entity: 詹姆斯   label: PER   pos: [6, 8]
+entity: 姚明   label: PER   pos: [10, 11]
+-----------------------------
+```
+
+### 参数说明
+
+`token_cls_infer.py` 除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'model'|
+
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_trt_backend() + enable_paddle_to_trt() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
+
+## 相关文档
+
+[ERNIE 3.0模型详细介绍](../../README.md)
+
+[ERNIE 3.0模型C++部署方法](../cpp/README.md)
diff --git a/model_zoo/ernie-3.0/deploy/python/requirements.txt b/model_zoo/ernie-3.0/deploy/python/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..08af4cf97eb1aabc6da8278fceb000b6adc04363
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/python/requirements.txt
@@ -0,0 +1 @@
+fast-tokenizer-python
\ No newline at end of file
diff --git a/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py b/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1699a3fb4762cbef58cd21fef552e3d325d900b
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/python/seq_cls_infer.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+        else:
+            option.use_gpu()
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.enable_paddle_to_trt()
+                option.enable_paddle_trt_collect_shape()
+            trt_file = os.path.join(args.model_dir, "model.trt")
+            option.set_trt_input_shape(
+                "input_ids",
+                min_shape=[1, 1],
+                opt_shape=[args.batch_size, args.max_length],
+                max_shape=[args.batch_size, args.max_length],
+            )
+            option.set_trt_input_shape(
+                "token_type_ids",
+                min_shape=[1, 1],
+                opt_shape=[args.batch_size, args.max_length],
+                max_shape=[args.batch_size, args.max_length],
+            )
+            if args.use_fp16:
+                option.enable_trt_fp16()
+                trt_file = trt_file + ".fp16"
+            option.set_trt_cache_file(trt_file)
+        return fd.Runtime(option)
+
+    def preprocess(self, text, text_pair):
+        data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True)
+        input_ids_name = self.runtime.get_input_info(0).name
+        token_type_ids_name = self.runtime.get_input_info(1).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int64"),
+            token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def postprocess(self, infer_data):
+        logits = np.array(infer_data[0])
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)}
+        return out_dict
+
+    def predict(self, texts, texts_pair=None):
+        input_map = self.preprocess(texts, texts_pair)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    texts_ds = ["花呗收款额度限制", "花呗支持高铁票支付吗"]
+    texts_pair_ds = ["收钱码，对花呗支付的金额有限制吗", "为什么友付宝不支持花呗付款"]
+    batch_texts = batchfy_text(texts_ds, args.batch_size)
+    batch_texts_pair = batchfy_text(texts_pair_ds, args.batch_size)
+
+    for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)):
+        outputs = predictor.predict(texts, texts_pair)
+        for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)):
+            print(
+                f"Batch id:{bs}, example id:{i}, sentence1:{sentence1}, sentence2:{sentence2}, label:{outputs['label'][i]}, similarity:{outputs['confidence'][i]:.4f}"
+            )
diff --git a/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py b/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..00a9c82333b9cc3cf1738bc56a4f63c183cc0539
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/python/token_cls_infer.py
@@ -0,0 +1,187 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class ErnieForTokenClassificationPredictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+        self.label_names = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"]
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+        else:
+            option.use_gpu()
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.enable_paddle_to_trt()
+                option.enable_paddle_trt_collect_shape()
+            trt_file = os.path.join(args.model_dir, "infer.trt")
+            option.set_trt_input_shape(
+                "input_ids",
+                min_shape=[1, 1],
+                opt_shape=[args.batch_size, args.max_length],
+                max_shape=[args.batch_size, args.max_length],
+            )
+            option.set_trt_input_shape(
+                "token_type_ids",
+                min_shape=[1, 1],
+                opt_shape=[args.batch_size, args.max_length],
+                max_shape=[args.batch_size, args.max_length],
+            )
+            if args.use_fp16:
+                option.enable_trt_fp16()
+                trt_file = trt_file + ".fp16"
+            option.set_trt_cache_file(trt_file)
+        return fd.Runtime(option)
+
+    def preprocess(self, texts):
+        is_split_into_words = False
+        if isinstance(texts[0], list):
+            is_split_into_words = True
+        data = self.tokenizer(
+            texts, max_length=self.max_length, padding=True, truncation=True, is_split_into_words=is_split_into_words
+        )
+        input_ids_name = self.runtime.get_input_info(0).name
+        token_type_ids_name = self.runtime.get_input_info(1).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int64"),
+            token_type_ids_name: np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def postprocess(self, infer_data, input_data):
+        result = np.array(infer_data[0])
+        tokens_label = result.argmax(axis=-1).tolist()
+        value = []
+        for batch, token_label in enumerate(tokens_label):
+            start = -1
+            label_name = ""
+            items = []
+            for i, label in enumerate(token_label):
+                if (self.label_names[label] == "O" or "B-" in self.label_names[label]) and start >= 0:
+                    entity = input_data[batch][start : i - 1]
+                    if isinstance(entity, list):
+                        entity = "".join(entity)
+                    if len(entity) == 0:
+                        break
+                    items.append(
+                        {
+                            "pos": [start, i - 2],
+                            "entity": entity,
+                            "label": label_name,
+                        }
+                    )
+                    start = -1
+                if "B-" in self.label_names[label]:
+                    start = i - 1
+                    label_name = self.label_names[label][2:]
+            value.append(items)
+
+        out_dict = {"value": value, "tokens_label": tokens_label}
+        return out_dict
+
+    def predict(self, texts):
+        input_map = self.preprocess(texts)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result, texts)
+        return output
+
+
+def token_cls_print_ret(infer_result, input_data):
+    rets = infer_result["value"]
+    for i, ret in enumerate(rets):
+        print("input data:", input_data[i])
+        print("The model detects all entities:")
+        for iterm in ret:
+            print("entity:", iterm["entity"], "  label:", iterm["label"], "  pos:", iterm["pos"])
+        print("-----------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = ErnieForTokenClassificationPredictor(args)
+    texts = ["北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"]
+    batch_data = batchfy_text(texts, args.batch_size)
+    for data in batch_data:
+        outputs = predictor.predict(data)
+        token_cls_print_ret(outputs, data)
diff --git a/model_zoo/ernie-3.0/deploy/serving/README.md b/model_zoo/ernie-3.0/deploy/serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a8a9de1ad94e51ad95a6ed0f18471d34e71fdd3b
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/README.md
@@ -0,0 +1,225 @@
+# FastDeploy ERNIE 3.0 模型 Serving 部署示例
+
+
+在服务化部署前，需确认
+
+- 1. 服务化镜像的软硬件环境要求和镜像拉取命令请参考 [FastDeploy 服务化部署](https://github.com/PaddlePaddle/FastDeploy/blob/develop/serving/README_CN.md)
+
+## 准备模型
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE 3.0 模型在 CLUE Benchmark 的 [AFQMC 数据集](https://github.com/CLUEbenchmark/CLUE)上进行文本分类任务以及 [MSRA_NER 数据集](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra)上进行序列标注任务的**服务化部署**。按照[ERNIE 3.0 训练文档](../../README.md)分别训练并导出文本分类模型以及序列标注模型，并将导出的模型移动到 models 目录下相应位置。注意：模型与参数文件必须命名为 **model.pdmodel** 和 **model.pdiparams**。
+
+模型移动好之后，文本分类任务的 models 目录结构如下:
+
+```
+models
+├── ernie_seqcls                      # 分类任务的 pipeline
+│   ├── 1
+│   └── config.pbtxt                  # 通过这个文件组合前后处理和模型推理
+├── ernie_seqcls_model                # 分类任务的模型推理
+│   ├── 1
+│   │   ├── model.pdiparams
+│   │   └── model.pdmodel
+│   └── config.pbtxt
+├── ernie_seqcls_postprocess          # 分类任务后处理
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── ernie_tokenizer                   # 预处理分词
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+序列标注任务的 models 目录结构如下:
+
+```
+models
+├── ernie_tokencls                      # 序列标注任务的 pipeline
+│   ├── 1
+│   └── config.pbtxt                    # 通过这个文件组合前后处理和模型推理
+├── ernie_tokencls_model                # 序列标注任务的模型推理
+│   ├── 1
+│   │   ├── model.pdiparams
+│   │   └── model.pdmodel
+│   └── config.pbtxt
+├── ernie_tokencls_postprocess          # 序列标注任务后处理
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── ernie_tokenizer                     # 预处理分词
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+
+## 拉取并运行镜像
+
+```
+# x.y.z为镜像版本号，需参照 serving 文档替换为数字
+# GPU镜像
+docker pull registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10
+# CPU镜像
+docker pull registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-cpu-only-21.10
+
+# GPU 运行
+nvidia-docker run  -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models rregistry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 bash
+
+# CPU 运行
+docker run  -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-cpu-only-21.10 bash
+```
+
+## 部署模型
+
+serving 目录包含启动 pipeline 服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # 服务化启动需要的模型仓库，包含模型和服务配置文件
+seq_cls_rpc_client.py     # AFQMC 分类任务发送 pipeline 预测请求的脚本
+token_cls_rpc_client.py   # 序列标注任务发送 pipeline 预测请求的脚本
+```
+
+注意:启动服务时，Server 的每个 python 后端进程默认申请 64M 内存，默认启动的 docker 无法启动多个 python 后端节点。有两个解决方案：
+
+1. 启动容器时设置 shm-size 参数, 比如: docker run  -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/serving/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 bash
+
+2. 启动服务时设置 python 后端的 shm-default-byte-size 参数, 设置 python 后端的默认内存为10M： fastdeployserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760
+
+### 分类任务
+
+在容器内执行下面命令启动服务:
+
+```
+# 默认启动 models 下所有模型
+fastdeployserver --model-repository=/models
+
+# 可通过参数只启动分类任务
+fastdeployserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_seqcls
+```
+
+输出打印如下:
+
+```shell
+
+I0209 09:15:49.314029 708 model_repository_manager.cc:1183] successfully loaded 'ernie_seqcls_model' version 1
+I0209 09:15:49.314917 708 model_repository_manager.cc:1022] loading: ernie_seqcls:1
+I0209 09:15:49.417014 708 model_repository_manager.cc:1183] successfully loaded 'ernie_seqcls' version 1
+...
+I0209 09:15:49.417394 708 server.cc:549]
++------------+---------------------------------------------------------------+--------+
+| Backend    | Path                                                          | Config |
++------------+---------------------------------------------------------------+--------+
+| python     | /opt/tritonserver/backends/python/libtriton_python.so         | {}     |
+| fastdeploy | /opt/tritonserver/backends/fastdeploy/libtriton_fastdeploy.so | {}     |
++------------+---------------------------------------------------------------+--------+
+
+I0209 09:15:49.417552 708 server.cc:592]
++--------------------------+---------+--------+
+| Model                    | Version | Status |
++--------------------------+---------+--------+
+| ernie_seqcls             | 1       | READY  |
+| ernie_seqcls_model       | 1       | READY  |
+| ernie_seqcls_postprocess | 1       | READY  |
+| ernie_seqcls_tokenizer   | 1       | READY  |
++--------------------------+---------+--------+
+
+```
+
+### 序列标注任务
+
+在容器内执行下面命令启动序列标注服务:
+
+```shell
+fastdeployserver --model-repository=/models --model-control-mode=explicit --load-model=ernie_tokencls --backend-config=python,shm-default-byte-size=10485760
+```
+
+输出打印如下:
+
+```shell
+
+I0209 09:15:49.314029 708 model_repository_manager.cc:1183] successfully loaded 'ernie_tokencls_model' version 1
+I0209 09:15:49.314917 708 model_repository_manager.cc:1022] loading: ernie_tokencls:1
+I0209 09:15:49.417014 708 model_repository_manager.cc:1183] successfully loaded 'ernie_tokencls' version 1
+...
+I0209 09:15:49.417394 708 server.cc:549]
++------------+---------------------------------------------------------------+--------+
+| Backend    | Path                                                          | Config |
++------------+---------------------------------------------------------------+--------+
+| python     | /opt/tritonserver/backends/python/libtriton_python.so         | {}     |
+| fastdeploy | /opt/tritonserver/backends/fastdeploy/libtriton_fastdeploy.so | {}     |
++------------+---------------------------------------------------------------+--------+
+
+I0209 09:15:49.417552 708 server.cc:592]
++----------------------------+---------+--------+
+| Model                      | Version | Status |
++----------------------------+---------+--------+
+| ernie_tokencls             | 1       | READY  |
+| ernie_tokencls_model       | 1       | READY  |
+| ernie_tokencls_postprocess | 1       | READY  |
+| ernie_tokencls_tokenizer   | 1       | READY  |
++----------------------------+---------+--------+
+
+```
+
+## 客户端请求
+
+客户端请求可以在本地执行脚本请求；也可以在容器中执行。
+
+本地执行脚本需要先安装依赖:
+
+```shell
+
+pip install grpcio
+pip install tritonclient[all]
+
+# 如果bash无法识别括号，可以使用如下指令安装:
+pip install tritonclient\[all\]
+
+```
+
+### 分类任务
+
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+
+```shell
+python seq_cls_grpc_client.py
+```
+
+输出打印如下:
+
+```shell
+{'label': array([0, 0]), 'confidence': array([0.54437345, 0.98503494], dtype=float32)}
+acc: 0.7224281742354032
+```
+
+
+### 序列标注任务
+
+注意执行客户端请求时关闭代理，并根据实际情况修改main函数中的ip地址(启动服务所在的机器)
+
+
+```shell
+python token_cls_grpc_client.py
+```
+
+输出打印如下:
+
+```shell
+
+input data: 北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。
+The model detects all entities:
+entity: 北京   label: LOC   pos: [0, 1]
+entity: 重庆   label: LOC   pos: [6, 7]
+entity: 成都   label: LOC   pos: [12, 13]
+input data: 乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。
+The model detects all entities:
+entity: 乔丹   label: PER   pos: [0, 1]
+entity: 科比   label: PER   pos: [3, 4]
+entity: 詹姆斯   label: PER   pos: [6, 8]
+entity: 姚明   label: PER   pos: [10, 11]
+
+```
+
+## 配置修改
+
+当前分类任务( ernie_seqcls_model/config.pbtxt )默认配置在 CPU上 运行 OpenVINO 引擎; 序列标注任务默认配置在 GPU 上运行 Paddle Inference 引擎。如果要在 CPU/GPU 或其他推理引擎上运行, 需要修改配置，详情请参考[配置文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/serving/docs/zh_CN/model_configuration.md)。
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..05a9750b0b0a2885900f3fe0d2bd4a41b5ede4e8
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls/config.pbtxt
@@ -0,0 +1,85 @@
+name: "ernie_seqcls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "TEXT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  },
+  {
+    name: "TEXT_PAIR"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "ernie_seqcls_tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "TEXT"
+      }
+      input_map {
+        key: "INPUT_1"
+        value: "TEXT_PAIR"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "ernie_seqcls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        # 需要按照实际模型输出进行配置。
+        key: "linear_75.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "ernie_seqcls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_label"
+        value: "label"
+      }
+      output_map {
+        key: "POST_confidence"
+        value: "confidence"
+      }
+    }
+  ]
+}
+
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5eb08196c5088eb7aa903e5e8896e466b22094d7
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/1/README.md
@@ -0,0 +1 @@
+本目录存放 ERNIE 3.0 模型
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..10ea68104f384fe0bd4446ce9b85e93f5cd7399c
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_model/config.pbtxt
@@ -0,0 +1,43 @@
+backend: "fastdeploy"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "linear_75.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ 2 ]
+    }
+]
+
+instance_group [
+  {
+      # 创建1个实例
+      count: 1
+      # 使用CPU推理(KIND_CPU、KIND_GPU)
+      kind: KIND_GPU
+  }
+]
+
+optimization {
+  execution_accelerators {
+    cpu_execution_accelerator : [
+      {
+        # use openvino backend
+        name: "paddle"
+        parameters { key: "cpu_threads" value: "5" }
+      }
+    ]
+  }
+}
+
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4f72200eb24f3c62b7b9cb4c1f64d7e77151d4e
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/1/model.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+
+            max_value = np.max(data, axis=1, keepdims=True)
+            exp_data = np.exp(data - max_value)
+            probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+            probs = probs.max(axis=-1)
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], data.argmax(axis=-1))
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], probs)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e3a5c61ec77f336083e5198767e7b713b1fa9857
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_postprocess/config.pbtxt
@@ -0,0 +1,31 @@
+name: "ernie_seqcls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ 2 ]
+  }
+]
+
+output [
+  {
+    name: "POST_label"
+    data_type: TYPE_INT64
+    dims: [ 1 ]
+  },
+  {
+    name: "POST_confidence"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e0ab63157653a04578179db4fcae9bd841c46d4
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/1/model.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TritonPythonModel:
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        for request in requests:
+            texts = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            texts = texts.as_numpy()
+            texts = [i[0].decode("utf-8") for i in texts]
+
+            text_pairs = pb_utils.get_input_tensor_by_name(request, self.input_names[1])
+            text_pairs = text_pairs.as_numpy()
+            text_pairs = [i[0].decode("utf-8") for i in text_pairs]
+
+            data = self.tokenizer(texts, text_pair=text_pairs, max_length=128, padding=True, truncation=True)
+            input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0])
+            token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1])
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f72a61c879d976ac6eee55d312862e46d85b3d6a
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_seqcls_tokenizer/config.pbtxt
@@ -0,0 +1,36 @@
+name: "ernie_seqcls_tokenizer"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT_0"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  },
+  {
+    name: "INPUT_1"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT_0"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "OUTPUT_1"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0936b6ef5eb646aad5ea62ba5727084eab67b05d
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls/config.pbtxt
@@ -0,0 +1,66 @@
+name: "ernie_tokencls"
+platform: "ensemble"
+max_batch_size: 64
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+ensemble_scheduling {
+  step [
+    {
+      model_name: "ernie_tokencls_tokenizer"
+      model_version: 1
+      input_map {
+        key: "INPUT_0"
+        value: "INPUT"
+      }
+      output_map {
+        key: "OUTPUT_0"
+        value: "tokenizer_input_ids"
+      }
+      output_map {
+        key: "OUTPUT_1"
+        value: "tokenizer_token_type_ids"
+      }
+    },
+    {
+      model_name: "ernie_tokencls_model"
+      model_version: 1
+      input_map {
+        key: "input_ids"
+        value: "tokenizer_input_ids"
+      }
+      input_map {
+        key: "token_type_ids"
+        value: "tokenizer_token_type_ids"
+      }
+      output_map {
+        # 需要按照实际模型输出进行配置。
+        key: "linear_75.tmp_1"
+        value: "OUTPUT_2"
+      }
+    },
+    {
+      model_name: "ernie_tokencls_postprocess"
+      model_version: 1
+      input_map {
+        key: "POST_INPUT"
+        value: "OUTPUT_2"
+      }
+      output_map {
+        key: "POST_OUTPUT"
+        value: "OUTPUT"
+      }
+    }
+  ]
+}
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b3ce2c1ae200527f83d571590f1c731d51644fdb
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/1/README.md
@@ -0,0 +1 @@
+本目录存放ERNIE 3.0模型
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e87dc6d28728ae3bdd7417145f4dac0e85cc1826
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_model/config.pbtxt
@@ -0,0 +1,41 @@
+backend: "fastdeploy"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      # 需要按照实际模型输出进行配置。
+      name: "linear_75.tmp_1"
+      data_type: TYPE_FP32
+      dims: [ -1, 7 ]
+    }
+]
+
+instance_group [
+  {
+      # 创建1个实例
+      count: 1
+      # 使用GPU推理(KIND_CPU、KIND_GPU)
+      kind: KIND_GPU
+  }
+]
+
+optimization {
+  execution_accelerators {
+    gpu_execution_accelerator : [
+      {
+        name: "paddle"
+      }
+    ]
+  }
+}
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e1e6720a17ce2573b35087f8e5ded5f493a87d2
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/1/model.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+        # The label names of NER models trained by different data sets may be different
+        self.label_names = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"]
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            tokens_label = data.argmax(axis=-1).tolist()
+            value = []
+            for _, token_label in enumerate(tokens_label):
+                start = -1
+                label_name = ""
+                items = []
+                for i, label in enumerate(token_label):
+                    if (self.label_names[label] == "O" or "B-" in self.label_names[label]) and start >= 0:
+                        items.append(
+                            {
+                                "pos": [start, i - 2],
+                                "label": label_name,
+                            }
+                        )
+                        start = -1
+                    if "B-" in self.label_names[label]:
+                        start = i - 1
+                        label_name = self.label_names[label][2:]
+                value.append(items)
+            out_result = np.array(value, dtype="object")
+            out_tensor = pb_utils.Tensor(self.output_names[0], out_result)
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[
+                    out_tensor,
+                ]
+            )
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7760158277e04ff8769a1940af207c836a75c6f3
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_postprocess/config.pbtxt
@@ -0,0 +1,26 @@
+name: "ernie_tokencls_postprocess"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "POST_INPUT"
+    data_type: TYPE_FP32
+    dims: [ -1, 7 ]
+  }
+]
+
+output [
+  {
+    name: "POST_OUTPUT"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..66c3a947371de3c156daf45449dca8bfeacc96b6
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/1/model.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import numpy as np
+
+# triton_python_backend_utils is available in every Triton Python model. You
+# need to use this module to create inference requests and responses. It also
+# contains some utility functions for extracting information from model_config
+# and converting Triton input/output types to numpy types.
+import triton_python_backend_utils as pb_utils
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TritonPythonModel:
+    """Your Python model must use the same class name. Every Python model
+    that is created must have "TritonPythonModel" as the class name.
+    """
+
+    def initialize(self, args):
+        """`initialize` is called only once when the model is being loaded.
+        Implementing `initialize` function is optional. This function allows
+        the model to initialize any state associated with this model.
+        Parameters
+        ----------
+        args : dict
+          Both keys and values are strings. The dictionary keys and values are:
+          * model_config: A JSON string containing the model configuration
+          * model_instance_kind: A string containing model instance kind
+          * model_instance_device_id: A string containing model instance device ID
+          * model_repository: Model repository path
+          * model_version: Model version
+          * model_name: Model name
+        """
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)
+        # You must parse model_config. JSON string is not parsed here
+        self.model_config = json.loads(args["model_config"])
+        print("model_config:", self.model_config)
+
+        self.input_names = []
+        for input_config in self.model_config["input"]:
+            self.input_names.append(input_config["name"])
+        print("input:", self.input_names)
+
+        self.output_names = []
+        self.output_dtype = []
+        for output_config in self.model_config["output"]:
+            self.output_names.append(output_config["name"])
+            dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
+            self.output_dtype.append(dtype)
+        print("output:", self.output_names)
+
+    def execute(self, requests):
+        """`execute` must be implemented in every Python model. `execute`
+        function receives a list of pb_utils.InferenceRequest as the only
+        argument. This function is called when an inference is requested
+        for this model. Depending on the batching configuration (e.g. Dynamic
+        Batching) used, `requests` may contain multiple requests. Every
+        Python model, must create one pb_utils.InferenceResponse for every
+        pb_utils.InferenceRequest in `requests`. If there is an error, you can
+        set the error argument when creating a pb_utils.InferenceResponse.
+        Parameters
+        ----------
+        requests : list
+          A list of pb_utils.InferenceRequest
+        Returns
+        -------
+        list
+          A list of pb_utils.InferenceResponse. The length of this list must
+          be the same as `requests`
+        """
+        responses = []
+        for request in requests:
+            data = pb_utils.get_input_tensor_by_name(request, self.input_names[0])
+            data = data.as_numpy()
+            data = [i[0].decode("utf-8") for i in data]
+            data = self.tokenizer(data, max_length=128, padding=True, truncation=True)
+            input_ids = np.array(data["input_ids"], dtype=self.output_dtype[0])
+            token_type_ids = np.array(data["token_type_ids"], dtype=self.output_dtype[1])
+
+            out_tensor1 = pb_utils.Tensor(self.output_names[0], input_ids)
+            out_tensor2 = pb_utils.Tensor(self.output_names[1], token_type_ids)
+            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor1, out_tensor2])
+            responses.append(inference_response)
+        return responses
+
+    def finalize(self):
+        """`finalize` is called only once when the model is being unloaded.
+        Implementing `finalize` function is optional. This function allows
+        the model to perform any necessary clean ups before exit.
+        """
+        print("Cleaning up...")
diff --git a/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..08c7727bf89fc801bc9fb6fb61729f927b4590f9
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/models/ernie_tokencls_tokenizer/config.pbtxt
@@ -0,0 +1,31 @@
+name: "ernie_tokencls_tokenizer"
+backend: "python"
+max_batch_size: 64
+
+input [
+  {
+    name: "INPUT_0"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT_0"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  },
+  {
+    name: "OUTPUT_1"
+    data_type: TYPE_INT64
+    dims: [ -1 ]
+  }
+]
+
+instance_group [
+  {
+      count: 1
+      kind: KIND_CPU
+  }
+]
diff --git a/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py b/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..deb45a7e45523f566d144e6f9fb616674a23554f
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/seq_cls_grpc_client.py
@@ -0,0 +1,140 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Optional
+
+import numpy as np
+from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class SyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} "
+            f"are up and ready!"
+        )
+
+        model_config = self._client.get_model_config(self._model_name, self._model_version)
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        self._inputs = {tm.name: tm for tm in model_metadata.inputs}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm.name: tm for tm in model_metadata.outputs}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run(self, inputs):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate(inputs):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+
+        results = self._client.infer(
+            model_name=self._model_name,
+            model_version=self._model_version,
+            inputs=infer_inputs,
+            outputs=self._outputs_req,
+            client_timeout=self._response_wait_t,
+        )
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+def test_afqmc_dataset(runner):
+    from paddlenlp.datasets import load_dataset
+
+    dev_ds = load_dataset("clue", "afqmc", splits="dev")
+
+    batches = []
+    labels = []
+    idx = 0
+    batch_size = 32
+    while idx < len(dev_ds):
+        texts = []
+        text_pairs = []
+        label = []
+        for i in range(batch_size):
+            if idx + i >= len(dev_ds):
+                break
+            texts.append(dev_ds[idx + i]["sentence1"])
+            text_pairs.append(dev_ds[idx + i]["sentence2"])
+            label.append(dev_ds[idx + i]["label"])
+        batches.append((texts, text_pairs))
+        labels.append(np.array(label))
+        idx += batch_size
+
+    accuracy = 0
+    for i, data in enumerate(batches):
+        ret = runner.Run(data)
+        accuracy += np.sum(labels[i] == ret["label"])
+    print("acc:", 1.0 * accuracy / len(dev_ds))
+
+
+if __name__ == "__main__":
+    model_name = "ernie_seqcls"
+    model_version = "1"
+    url = "localhost:8001"
+    runner = SyncGRPCTritonRunner(url, model_name, model_version)
+
+    # [([texts], [text_pairs])]
+    dataset = [(["花呗收款额度限制", "花呗支持高铁票支付吗"], ["收钱码，对花呗支付的金额有限制吗", "为什么友付宝不支持花呗付款"])]
+
+    for batch_text_pair in dataset:
+        result = runner.Run(batch_text_pair)
+        print(result)
+
+    test_afqmc_dataset(runner)
diff --git a/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py b/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..21b3b6de537f691891f4354f68d7aea259a94832
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/serving/token_cls_grpc_client.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ast
+import logging
+from typing import Optional
+
+import numpy as np
+from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class SyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} "
+            f"are up and ready!"
+        )
+
+        model_config = self._client.get_model_config(self._model_name, self._model_version)
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        self._inputs = {tm.name: tm for tm in model_metadata.inputs}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm.name: tm for tm in model_metadata.outputs}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run(self, inputs):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate(inputs):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+
+        results = self._client.infer(
+            model_name=self._model_name,
+            model_version=self._model_version,
+            inputs=infer_inputs,
+            outputs=self._outputs_req,
+            client_timeout=self._response_wait_t,
+        )
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+if __name__ == "__main__":
+    model_name = "ernie_tokencls"
+    model_version = "1"
+    url = "localhost:8001"
+    runner = SyncGRPCTritonRunner(url, model_name, model_version)
+    dataset = [
+        ["北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"],
+    ]
+
+    for batch_input in dataset:
+        # input format:[input1, input2 ... inputn], n = len(self._input_names)
+        result = runner.Run([batch_input])
+        for i, ret in enumerate(result["OUTPUT"]):
+            ret = ast.literal_eval(ret.decode("utf-8"))
+            print("input data:", batch_input[i])
+            print("The model detects all entities:")
+            for iterm in ret:
+                entity = batch_input[i][iterm["pos"][0] : iterm["pos"][1] + 1]
+                if len(entity) > 0:
+                    print(
+                        "entity:",
+                        entity,
+                        "  label:",
+                        iterm["label"],
+                        "  pos:",
+                        iterm["pos"],
+                    )
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/README.md b/model_zoo/ernie-3.0/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b411bc58b9eafba3b2a7be436be6e656f9986346
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/README.md
@@ -0,0 +1,58 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+
+## Server服务启动
+### 文本分类任务启动
+#### 启动文本分类 Server 服务
+```bash
+paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189
+```
+
+#### 分类任务发送服务
+```bash
+python client_seq_cls.py --dataset afqmc
+```
+
+### 命名实体识别任务启动
+#### 启动命名实体识别 Server 服务
+```bash
+paddlenlp server server_token_cls:app --host 0.0.0.0 --port 8189
+```
+
+#### 命名实体识别 Client发送服务
+```bash
+python client_token_cls.py
+```
+
+###  问答任务启动
+#### 启动问答 Server 服务
+```bash
+paddlenlp server server_qa:app --host 0.0.0.0 --port 8189
+```
+
+#### 问答 Client 发送服务
+```bash
+python client_qa.py
+```
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+            'text_pair': text_pairs if len(text_pairs) > 0 else None
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size
+        }
+    }
+```
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4e72ba2db4bd338fc2c8e6e2ef43deea4e099c2
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_qa.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_len", default=512, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
+parser.add_argument("--doc_stride", default=128, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+# yapf: disable
+url = "http://0.0.0.0:8189/models/ernie_qa"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["研究证实，细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", ]
+    data = {
+        'data': {
+            'context': ['《战国无双3》（）是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴，分别是以武田信玄等人为主的《关东三国志》，织田信长等人为主的《战国三杰》，石田三成等人为主的《关原的年轻武者》，丰富游戏内的剧情。此部份专门介绍角色，欲知武器情报、奥义字或擅长攻击类型等，请至战国无双系列1.由于乡里大辅先生因故去世，不得不寻找其他声优接手。从猛将传 and Z开始。2.战国无双 编年史的原创男女主角亦有专属声优。此模式是任天堂游戏谜之村雨城改编的新增模式。本作中共有20张战场地图（不含村雨城），后来发行的猛将传再新增3张战场地图。但游戏内战役数量繁多，部分地图会有兼用的状况，战役虚实则是以光荣发行的2本「战国无双3 人物真书」内容为主，以下是相关介绍。（注：前方加☆者为猛将传新增关卡及地图。）合并本篇和猛将传的内容，村雨城模式剔除，战国史模式可直接游玩。主打两大模式「战史演武」&「争霸演武」。系列作品外传作品', '《战国无双3》（）是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴，分别是以武田信玄等人为主的《关东三国志》，织田信长等人为主的《战国三杰》，石田三成等人为主的《关原的年轻武者》，丰富游戏内的剧情。此部份专门介绍角色，欲知武器情报、奥义字或擅长攻击类型等，请至战国无双系列1.由于乡里大辅先生因故去世，不得不寻找其他声优接手。从猛将传 and Z开始。2.战国无双 编年史的原创男女主角亦有专属声优。此模式是任天堂游戏谜之村雨城改编的新增模式。本作中共有20张战场地图（不含村雨城），后来发行的猛将传再新增3张战场地图。但游戏内战役数量繁多，部分地图会有兼用的状况，战役虚实则是以光荣发行的2本「战国无双3 人物真书」内容为主，以下是相关介绍。（注：前方加☆者为猛将传新增关卡及地图。）合并本篇和猛将传的内容，村雨城模式剔除，战国史模式可直接游玩。主打两大模式「战史演武」&「争霸演武」。系列作品外传作品'],  # noqa: E126
+            'question': ['《战国无双3》是由哪两个公司合作开发的？', '男女主角亦有专属声优这一模式是由谁改编的？']  # noqa: E126
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size,
+            'doc_stride': args.doc_stride,
+        }
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32c950498abd9226fbe7023c636285053ab8e65
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_seq_cls.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+from paddlenlp.datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset", required=True, type=str, help="The dataset name for the simple seving")
+parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+# yapf: enable
+
+url = "http://0.0.0.0:8189/models/ernie_cls"
+headers = {"Content-Type": "application/json"}
+
+
+def seq_convert_example(example):
+    """convert a glue example into necessary features"""
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+    return example
+
+
+if __name__ == "__main__":
+    examples = load_dataset("clue", args.dataset)["dev"][:10]
+    texts = []
+    text_pairs = []
+    for example in examples:
+        example = seq_convert_example(example)
+        if "sentence" in example:
+            texts.append(example)
+        else:
+            texts.append(example["sentence1"])
+            text_pairs.append(example["sentence2"])
+
+    data = {
+        "data": {"text": texts, "text_pair": text_pairs if len(text_pairs) > 0 else None},
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a07509fd5e7f33dca5c9dca3fa214b338c5eabd
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/client_token_cls.py
@@ -0,0 +1,41 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+# yapf: disable
+url = "http://0.0.0.0:8189/models/ernie_ner"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["北京的涮肉，重庆的火锅，成都的小吃都是极具特色的美食。", "乔丹、科比、詹姆斯和姚明都是篮球界的标志性人物。"]
+    data = {
+        'data': {
+            'text': texts
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size,
+        }
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py
new file mode 100644
index 0000000000000000000000000000000000000000..a02775acb6caa7222ba28a2ef0f055a8942cc208
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_qa.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import BasePostHandler, QAModelHandler
+
+
+class QAPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        start_logits = data["logits"]
+        end_logits = data["logits_1"]
+        contexts = data["data"]["context"]
+        questions = data["data"]["question"]
+        offset_mappings = data["data"]["offset_mapping"]
+        answers = []
+        count = 0
+        for start_logit, end_logit, offset_mapping in zip(start_logits, end_logits, offset_mappings):
+            start_position = np.argmax(np.array(start_logit))
+            end_position = np.argmax(np.array(end_logit))
+            start_id = offset_mapping[start_position][0]
+            end_id = offset_mapping[end_position][1]
+            answer = []
+            if end_position > start_position:
+                answer = contexts[count][start_id:end_id]
+            answers.append(answer)
+            count += 1
+
+        return {"context": contexts, "question": questions, "answer": answers}
+
+
+app = SimpleServer()
+app.register(
+    "models/ernie_qa",
+    model_path="../../best_models/cmrc2018/export/",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=QAModelHandler,
+    post_handler=QAPostHandler,
+)
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..837ad1793247416a3fb337742d4de84ad804cccf
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_seq_cls.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/ernie_cls",
+    model_path="../../best_models/afqmc/export/",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=CustomModelHandler,
+    post_handler=MultiClassificationPostHandler,
+)
diff --git a/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py b/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..57aff957a078f2410b0fdcd20e2acbc2a7c927be
--- /dev/null
+++ b/model_zoo/ernie-3.0/deploy/simple_serving/server_token_cls.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import BasePostHandler, TokenClsModelHandler
+
+
+class NERPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        label_list = ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"]
+        input_datas = data["data"]["text"]
+        predictions = np.array(data["logits"])
+        tokens_label = predictions.argmax(axis=-1)
+        tokens_label = tokens_label.tolist()
+        value = []
+        for batch, token_label in enumerate(tokens_label):
+            start = -1
+            label_name = ""
+            items = []
+            input_data = input_datas[batch]
+            for i, label in enumerate(token_label):
+                if (label_list[label] == "O" or "B-" in label_list[label]) and start >= 0:
+                    entity = input_data[start : i - 1]
+                    if isinstance(entity, list):
+                        entity = "".join(entity)
+                    items.append(
+                        {
+                            "pos": [start, i - 2],
+                            "entity": entity,
+                            "label": label_name,
+                        }
+                    )
+                    start = -1
+                if "B-" in label_list[label]:
+                    start = i - 1
+                    label_name = label_list[label][2:]
+            if start >= 0:
+                items.append(
+                    {
+                        "pos": [start, len(token_label) - 1],
+                        "entity": input_data[start : len(token_label) - 1],
+                        "label": "",
+                    }
+                )
+            value.append(items)
+        out_dict = {"value": value, "tokens_label": tokens_label}
+        return out_dict
+
+
+app = SimpleServer()
+app.register(
+    "models/ernie_ner",
+    model_path="../../best_models/msra_ner/export/",
+    tokenizer_name="ernie-3.0-medium-zh",
+    model_handler=TokenClsModelHandler,
+    post_handler=NERPostHandler,
+)
diff --git a/model_zoo/ernie-3.0/infer.py b/model_zoo/ernie-3.0/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0941a2db9b8a0d9204d63d25467b579b596e3b3c
--- /dev/null
+++ b/model_zoo/ernie-3.0/infer.py
@@ -0,0 +1,521 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from functools import partial
+from multiprocessing import cpu_count
+
+import numpy as np
+import onnxruntime as ort
+import paddle
+from datasets import load_dataset
+from paddle import inference
+from paddle.metric import Accuracy
+
+from paddlenlp.data import DataCollatorForTokenClassification, DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset as ppnlp_load_dataset
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import AutoTokenizer
+
+METRIC_CLASSES = {
+    "afqmc": Accuracy,
+    "tnews": Accuracy,
+    "iflytek": Accuracy,
+    "ocnli": Accuracy,
+    "cmnli": Accuracy,
+    "cluewsc2020": Accuracy,
+    "csl": Accuracy,
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default="tnews",
+        type=str,
+        help="The name of the task to perform predict, selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path", default="ernie-3.0-medium-zh", type=str, help="The directory or name of model."
+    )
+    parser.add_argument("--model_path", type=str, required=True, help="The path prefix of inference model to be used.")
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu"], help="Device selected for inference."
+    )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size for predict.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--perf_warmup_steps", default=20, type=int, help="Warmup steps for performance test.")
+    parser.add_argument(
+        "--n_best_size",
+        default=20,
+        type=int,
+        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+    )
+    parser.add_argument(
+        "--max_answer_length", default=50, type=int, help="Max answer length for question answering task."
+    )
+    parser.add_argument("--shape_file", default="shape_info.txt", type=str, help="Shape info filename.")
+    parser.add_argument("--use_trt", action="store_true", help="Whether to use inference engin TensorRT.")
+    parser.add_argument("--use_lite", action="store_true", help="Whether to use inference engin PaddleLite.")
+    parser.add_argument("--perf", action="store_true", help="Whether to test performance.")
+    parser.add_argument("--collect_shape", action="store_true", help="Whether to collect shape info.")
+
+    parser.add_argument(
+        "--precision", default="fp32", choices=["fp32", "fp16", "int8"], help="Precision for inference."
+    )
+    parser.add_argument(
+        "--num_threads",
+        default=cpu_count(),
+        type=int,
+        help="num_threads for cpu.",
+    )
+    parser.add_argument(
+        "--enable_quantize",
+        action="store_true",
+        help="Whether to enable quantization for acceleration. Valid for both onnx and dnnl",
+    )
+    parser.add_argument(
+        "--enable_bf16",
+        action="store_true",
+        help="Whether to use the bfloat16 datatype",
+    )
+    parser.add_argument("--use_onnxruntime", type=strtobool, default=False, help="Use onnxruntime to infer or not.")
+    parser.add_argument(
+        "--debug", action="store_true", help="With debug it will save graph and model after each pass."
+    )
+    parser.add_argument(
+        "--provider",
+        default="CPUExecutionProvider",
+        choices=["CPUExecutionProvider", "DnnlExecutionProvider"],
+        type=str,
+        help="Onnx ExecutionProvider with DNNL or without DNNL",
+    )
+    parser.add_argument(
+        "--lazy_data_processing",
+        default=True,
+        type=bool,
+        help="Whether use lazy data processing",
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def convert_example(example, tokenizer, label_list, is_test=False, max_seq_length=512):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        # Get the label
+        label = np.array(example["label"], dtype="int64")
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+    if "sentence" in example:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    elif "sentence1" in example:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        example["labels"] = label
+    return example
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        if args.use_onnxruntime:
+            assert args.device != "xpu", "Running ONNXRuntime on XPU is temporarily not supported."
+            if args.model_path.count(".onnx"):
+                onnx_model = args.model_path
+            else:
+                import paddle2onnx
+
+                onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+                    model_file=args.model_path + ".pdmodel",
+                    params_file=args.model_path + ".pdiparams",
+                    opset_version=13,
+                    enable_onnx_checker=True,
+                )
+            dynamic_quantize_model = onnx_model
+            if args.enable_quantize:
+                from onnxruntime.quantization import quantize_dynamic
+
+                float_onnx_file = "model.onnx"
+                with open(float_onnx_file, "wb") as f:
+                    f.write(onnx_model)
+                dynamic_quantize_model = "dynamic_quantize_model.onnx"
+                quantize_dynamic(float_onnx_file, dynamic_quantize_model)
+            sess_options = ort.SessionOptions()
+            sess_options.intra_op_num_threads = args.num_threads
+            sess_options.inter_op_num_threads = args.num_threads
+            executionprovider = args.provider
+            print("ExecutionProvider is: ", executionprovider)
+            predictor = ort.InferenceSession(
+                dynamic_quantize_model, sess_options=sess_options, providers=[executionprovider]
+            )
+            input_name1 = predictor.get_inputs()[1].name
+            input_name2 = predictor.get_inputs()[0].name
+            input_handles = [input_name1, input_name2]
+            return cls(predictor, input_handles, [])
+
+        config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams")
+        if args.device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+            cls.device = paddle.set_device("gpu")
+        elif args.device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            config.switch_ir_optim(True)
+            config.enable_mkldnn()
+            if args.enable_bf16:
+                config.enable_mkldnn_bfloat16()
+            if args.enable_quantize:
+                config.enable_mkldnn_int8()
+            if args.debug:
+                config.switch_ir_debug(True)
+            config.set_cpu_math_library_num_threads(args.num_threads)
+            cls.device = paddle.set_device("cpu")
+        elif args.device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+        elif args.device == "npu":
+            if args.use_lite:
+                config.enable_lite_engine(paddle.inference.PrecisionType(0), True)
+                config.nnadapter().enable().set_device_names(["huawei_ascend_npu"])
+            else:
+                config.enable_custom_device("npu")
+        if args.use_trt:
+            precision_map = {
+                "int8": inference.PrecisionType.Int8,
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+            }
+            config.enable_tensorrt_engine(
+                workspace_size=1 << 30,
+                precision_mode=precision_map[args.precision],
+                max_batch_size=args.batch_size,
+                min_subgraph_size=5,
+                use_static=False,
+                use_calib_mode=False,
+            )
+            print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))
+
+            if args.collect_shape:
+                config.collect_shape_range_info(args.task_name + args.shape_file)
+            else:
+                config.enable_tuned_tensorrt_dynamic_shape(args.task_name + args.shape_file, True)
+
+        config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
+        predictor = paddle.inference.create_predictor(config)
+
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+        return cls(predictor, input_handles, output_handles)
+
+    def set_dynamic_shape(self, max_seq_length, batch_size):
+        # The dynamic shape info required by TRT is automatically generated according to max_seq_length and batch_size and stored in shape_info.txt
+        min_batch_size, max_batch_size, opt_batch_size = 1, batch_size, batch_size
+        min_seq_len, max_seq_len, opt_seq_len = 2, max_seq_length, 32
+        batches = [
+            [
+                np.zeros([min_batch_size, min_seq_len], dtype="int64"),
+                np.zeros([min_batch_size, min_seq_len], dtype="int64"),
+            ],
+            [
+                np.zeros([max_batch_size, max_seq_len], dtype="int64"),
+                np.zeros([max_batch_size, max_seq_len], dtype="int64"),
+            ],
+            [
+                np.zeros([opt_batch_size, opt_seq_len], dtype="int64"),
+                np.zeros([opt_batch_size, opt_seq_len], dtype="int64"),
+            ],
+        ]
+        for batch in batches:
+            self.predict_batch(batch)
+        print("Set dynamic shape finished, please close set_dynamic_shape and restart.")
+        exit(0)
+
+    def predict_batch(self, data):
+        if len(self.output_handles) == 0:
+            input_dict = {}
+            for input_field, input_handle in zip(data, self.input_handles):
+                input_dict[input_handle] = input_field
+            result = self.predictor.run(None, input_dict)
+            return result
+
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, dataset, tokenizer, batchify_fn, args, dev_example=None, dev_ds_ori=None):
+        if args.collect_shape:
+            self.set_dynamic_shape(args.max_seq_length, args.batch_size)
+        if args.task_name == "cmrc2018":
+            dataset_removed = dataset.remove_columns(["offset_mapping", "attention_mask", "example_id"])
+            sample_num = len(dataset)
+            batches = []
+            for i in range(0, sample_num, args.batch_size):
+                batch_size = min(args.batch_size, sample_num - i)
+                batch = [dataset_removed[i + j] for j in range(batch_size)]
+                batches.append(batch)
+        else:
+            sample_num = len(dataset)
+            batches = []
+            for i in range(0, sample_num, args.batch_size):
+                batch_size = min(args.batch_size, sample_num - i)
+                batch = [dataset[i + j] for j in range(batch_size)]
+                batches.append(batch)
+        if args.perf:
+            for i, batch in enumerate(batches):
+                batch = batchify_fn(batch)
+                input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy()
+                output = self.predict_batch([input_ids, segment_ids])
+                if i > args.perf_warmup_steps:
+                    break
+            time1 = time.time()
+            nums = 0
+            for batch in batches:
+                batch = batchify_fn(batch)
+                input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy()
+                nums = nums + input_ids.shape[0]
+                output = self.predict_batch([input_ids, segment_ids])
+            total_time = time.time() - time1
+            print(
+                "task name: %s, sample nums: %s, time: %s, QPS: %s "
+                % (args.task_name, nums, total_time, nums / total_time)
+            )
+
+        else:
+            if args.task_name == "msra_ner":
+                metric = ChunkEvaluator(label_list=args.label_list)
+                metric.reset()
+                all_predictions = []
+                for batch in batches:
+                    batch = batchify_fn(batch)
+                    input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy()
+                    output = self.predict_batch([input_ids, segment_ids])[0]
+                    preds = np.argmax(output, axis=2)
+                    all_predictions.append(preds.tolist())
+                    num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(
+                        batch["seq_len"], paddle.to_tensor(preds), batch["labels"]
+                    )
+                    metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+                res = metric.accumulate()
+                print("task name: %s, (precision, recall, f1): %s, " % (args.task_name, res))
+            elif args.task_name == "cmrc2018":
+                all_start_logits = []
+                all_end_logits = []
+                for batch in batches:
+                    batch = batchify_fn(batch)
+                    input_ids, segment_ids = batch["input_ids"].numpy(), batch["token_type_ids"].numpy()
+                    start_logits, end_logits = self.predict_batch([input_ids, segment_ids])
+                    for idx in range(start_logits.shape[0]):
+                        if len(all_start_logits) % 1000 == 0 and len(all_start_logits):
+                            print("Processing example: %d" % len(all_start_logits))
+                        all_start_logits.append(start_logits[idx])
+                        all_end_logits.append(end_logits[idx])
+                all_predictions, _, _ = compute_prediction(
+                    dev_example,
+                    dataset,
+                    (all_start_logits, all_end_logits),
+                    False,
+                    args.n_best_size,
+                    args.max_answer_length,
+                )
+                res = squad_evaluate(
+                    examples=[raw_data for raw_data in dev_example], preds=all_predictions, is_whitespace_splited=False
+                )
+                print("task name: %s, EM: %s, F1: %s" % (args.task_name, res["exact"], res["f1"]))
+                return all_predictions
+            else:
+                all_predictions = []
+                metric = METRIC_CLASSES[args.task_name]()
+                metric.reset()
+                for i, batch in enumerate(batches):
+                    batch = batchify_fn(batch)
+                    output = self.predict_batch([batch["input_ids"].numpy(), batch["token_type_ids"].numpy()])[0]
+                    preds = np.argmax(output, axis=1)
+                    all_predictions.append(preds.tolist())
+                    correct = metric.compute(paddle.to_tensor(output), batch["labels"])
+                    metric.update(correct)
+                res = metric.accumulate()
+
+                print("task name: %s, acc: %s, " % (args.task_name, res))
+                return all_predictions
+
+
+def tokenize_and_align_labels(example, tokenizer, no_entity_id, max_seq_len=512):
+    if example["tokens"] == []:
+        tokenized_input = {
+            "labels": [],
+            "input_ids": [],
+            "token_type_ids": [],
+            "seq_len": 0,
+            "length": 0,
+        }
+        return tokenized_input
+    tokenized_input = tokenizer(
+        example["tokens"],
+        max_seq_len=max_seq_len,
+        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
+        is_split_into_words=True,
+        return_length=True,
+    )
+    label_ids = example["ner_tags"]
+    if len(tokenized_input["input_ids"]) - 2 < len(label_ids):
+        label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2]
+    label_ids = [no_entity_id] + label_ids + [no_entity_id]
+
+    label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids))
+    tokenized_input["labels"] = label_ids
+    return tokenized_input
+
+
+def prepare_validation_features(examples, tokenizer, doc_stride, max_seq_length):
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    tokenized_examples = tokenizer(
+        questions, contexts, stride=doc_stride, max_seq_len=max_seq_length, return_attention_mask=True
+    )
+
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+def main():
+    paddle.seed(42)
+    args = parse_args()
+
+    args.task_name = args.task_name.lower()
+
+    predictor = Predictor.create_predictor(args)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+
+    if args.task_name == "msra_ner":
+
+        def ner_trans_fn(example, tokenizer, max_seq_length=128, no_entity_id=0):
+            return tokenize_and_align_labels(
+                example, tokenizer=tokenizer, no_entity_id=no_entity_id, max_seq_len=max_seq_length
+            )
+
+        trans_fn = partial(ner_trans_fn, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+        dev_ds = load_dataset("msra_ner", split="test")
+        label_list = dev_ds.features["ner_tags"].feature.names
+        args.label_list = label_list
+
+        column_names = dev_ds.column_names
+        dev_ds = dev_ds.map(trans_fn, remove_columns=column_names)
+        batchify_fn = DataCollatorForTokenClassification(tokenizer)
+        predictor.predict(dev_ds, tokenizer, batchify_fn, args)
+    elif args.task_name == "cmrc2018":
+        dev_example = load_dataset("cmrc2018", split="validation")
+        column_names = dev_example.column_names
+        dev_ds = dev_example.map(
+            partial(
+                prepare_validation_features, tokenizer=tokenizer, doc_stride=128, max_seq_length=args.max_seq_length
+            ),
+            batched=True,
+            num_proc=4,
+            remove_columns=column_names,
+            load_from_cache_file=True,
+            desc="Running tokenizer on validation dataset",
+        )
+
+        batchify_fn = DataCollatorWithPadding(tokenizer)
+        predictor.predict(dev_ds, tokenizer, batchify_fn, args, dev_example)
+    else:
+        dev_ds = ppnlp_load_dataset("clue", args.task_name, splits="dev")
+
+        trans_func = partial(
+            convert_example,
+            label_list=dev_ds.label_list,
+            tokenizer=tokenizer,
+            max_seq_length=args.max_seq_length,
+            is_test=False,
+        )
+        dev_ds = dev_ds.map(trans_func, lazy=args.lazy_data_processing)
+        if args.device == "npu":
+            # NOTE: Avoid CANN recompile operators for different shape inputs, which will result in very slow training.
+            batchify_fn = DataCollatorWithPadding(tokenizer, padding="max_length", max_length=args.max_seq_length)
+        else:
+            batchify_fn = DataCollatorWithPadding(tokenizer)
+
+        predictor.predict(dev_ds, tokenizer, batchify_fn, args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/run_qa.py b/model_zoo/ernie-3.0/run_qa.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c3b03d70c493a550bed99d081d986df0eef4dea
--- /dev/null
+++ b/model_zoo/ernie-3.0/run_qa.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from functools import partial
+
+import paddle
+from datasets import load_dataset
+from utils import (
+    CrossEntropyLossForSQuAD,
+    DataArguments,
+    ModelArguments,
+    QuestionAnsweringTrainer,
+    load_config,
+    prepare_train_features,
+    prepare_validation_features,
+)
+
+import paddlenlp
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.metrics.squad import compute_prediction, squad_evaluate
+from paddlenlp.trainer import (
+    EvalPrediction,
+    PdArgumentParser,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Load model and data config
+    model_args, data_args, training_args = load_config(
+        model_args.config, "QuestionAnswering", data_args.dataset, model_args, data_args, training_args
+    )
+    # Print model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    data_args.dataset = data_args.dataset.strip()
+    training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    raw_datasets = load_dataset("clue", data_args.dataset)
+    label_list = getattr(raw_datasets["train"], "label_list", None)
+    data_args.label_list = label_list
+
+    # Define tokenizer, model, loss function.
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = ErnieForQuestionAnswering.from_pretrained(model_args.model_name_or_path)
+
+    loss_fct = CrossEntropyLossForSQuAD()
+
+    # Preprocessing the datasets.
+    # Preprocessing is slighlty different for training and evaluation.
+    if training_args.do_train:
+        column_names = raw_datasets["train"].column_names
+    elif training_args.do_eval:
+        column_names = raw_datasets["validation"].column_names
+    else:
+        column_names = raw_datasets["validation"].column_names
+
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"]
+        # Create train feature from dataset
+        with training_args.main_process_first(desc="train dataset map pre-processing"):
+            # Dataset pre-process
+            train_dataset = train_dataset.map(
+                partial(prepare_train_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                batch_size=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on train dataset",
+            )
+
+    if training_args.do_eval:
+        eval_examples = raw_datasets["validation"]
+        with training_args.main_process_first(desc="evaluate dataset map pre-processing"):
+            eval_dataset = eval_examples.map(
+                partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                batch_size=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on validation dataset",
+            )
+    if training_args.do_predict:
+        predict_examples = raw_datasets["validation"]
+        contexts = predict_examples["context"]
+        questions = predict_examples["question"]
+        with training_args.main_process_first(desc="test dataset map pre-processing"):
+            predict_dataset = predict_examples.map(
+                partial(prepare_validation_features, tokenizer=tokenizer, args=data_args),
+                batched=True,
+                num_proc=4,
+                batch_size=4,
+                remove_columns=column_names,
+                load_from_cache_file=not data_args.overwrite_cache,
+                desc="Running tokenizer on prediction dataset",
+            )
+
+    # Define data collector
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Post-processing:
+    def post_processing_function(examples, features, predictions, stage="eval"):
+        # Post-processing: we match the start logits and end logits to answers in the original context.
+        predictions, all_nbest_json, scores_diff_json = compute_prediction(
+            examples=examples,
+            features=features,
+            predictions=predictions,
+            n_best_size=data_args.n_best_size,
+            max_answer_length=data_args.max_answer_length,
+            null_score_diff_threshold=data_args.null_score_diff_threshold,
+        )
+
+        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
+        return EvalPrediction(predictions=predictions, label_ids=references)
+
+    def compute_metrics(p: EvalPrediction):
+        ret = squad_evaluate(examples=p.label_ids, preds=p.predictions, is_whitespace_splited=False)
+        return dict(ret)
+        # return metric.compute(predictions=p.predictions, references=p.label_ids)
+
+    trainer = QuestionAnsweringTrainer(
+        model=model,
+        criterion=loss_fct,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        eval_examples=eval_examples if training_args.do_eval else None,
+        data_collator=data_collator,
+        post_process_function=post_processing_function,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    if training_args.do_train:
+        # Training
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(predict_dataset, predict_examples)
+        trainer.log_metrics("predict", test_ret.metrics)
+
+        out_dict = {"answer": test_ret.predictions, "context": contexts, "question": questions}
+        out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w", encoding="utf8")
+        json.dump(out_dict, out_file, ensure_ascii=True)
+
+    # Export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+
+        model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export")
+
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/run_seq_cls.py b/model_zoo/ernie-3.0/run_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..ada03238c9b457b317671127fb4754cfdd622b95
--- /dev/null
+++ b/model_zoo/ernie-3.0/run_seq_cls.py
@@ -0,0 +1,179 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import json
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.metric import Accuracy
+from utils import DataArguments, ModelArguments, load_config, seq_convert_example
+
+import paddlenlp
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Log model and data config
+    model_args, data_args, training_args = load_config(
+        model_args.config, "SequenceClassification", data_args.dataset, model_args, data_args, training_args
+    )
+    # Print model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    data_args.dataset = data_args.dataset.strip()
+    training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    raw_datasets = load_dataset("clue", data_args.dataset)
+    data_args.label_list = getattr(raw_datasets["train"], "label_list", None)
+    num_classes = len(raw_datasets["train"].label_list)
+
+    # Define tokenizer, model, loss function.
+    model = ErnieForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    criterion = nn.loss.CrossEntropyLoss() if data_args.label_list else nn.loss.MSELoss()
+
+    # Define dataset pre-process function
+    trans_fn = partial(
+        seq_convert_example,
+        tokenizer=tokenizer,
+        label_list=data_args.label_list,
+        max_seq_len=data_args.max_seq_length,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+
+    # Define data collator
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    # Dataset pre-process
+    logger.info("Data Preprocessing...")
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"].map(trans_fn, lazy=training_args.lazy_data_processing)
+    if training_args.do_eval:
+        eval_dataset = raw_datasets["dev"].map(trans_fn, lazy=training_args.lazy_data_processing)
+    if training_args.do_predict:
+        test_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing)
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = Accuracy()
+        metric.reset()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        accu = metric.accumulate()
+        metric.reset()
+        return {"accuracy": accu}
+
+    trainer = Trainer(
+        model=model,
+        criterion=criterion,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+        logits = test_ret.predictions
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()}
+        out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w")
+        json.dump(out_dict, out_file)
+
+    # Export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export")
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/run_token_cls.py b/model_zoo/ernie-3.0/run_token_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..f805ad307c6f796dfd371e64c27fbeb5434fd128
--- /dev/null
+++ b/model_zoo/ernie-3.0/run_token_cls.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from datasets import load_metric
+from utils import DataArguments, ModelArguments, load_config, token_convert_example
+
+import paddlenlp
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Log model and data config
+    model_args, data_args, training_args = load_config(
+        model_args.config, "TokenClassification", data_args.dataset, model_args, data_args, training_args
+    )
+
+    # Print model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    data_args.dataset = data_args.dataset.strip()
+    training_args.output_dir = os.path.join(training_args.output_dir, data_args.dataset)
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    raw_datasets = load_dataset(data_args.dataset)
+    label_list = raw_datasets["train"].label_list
+    data_args.label_list = label_list
+    data_args.ignore_label = -100
+    data_args.no_entity_id = 0
+
+    num_classes = len(label_list)
+
+    # Define tokenizer, model, loss function.
+    tokenizer = ErnieTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = ErnieForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_classes)
+
+    class criterion(nn.Layer):
+        def __init__(self):
+            super(criterion, self).__init__()
+            self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=data_args.ignore_label)
+
+        def forward(self, *args, **kwargs):
+            return paddle.mean(self.loss_fn(*args, **kwargs))
+
+    loss_fct = criterion()
+
+    # Define dataset pre-process function
+    trans_fn = partial(
+        token_convert_example,
+        tokenizer=tokenizer,
+        no_entity_id=data_args.no_entity_id,
+        max_seq_length=data_args.max_seq_length,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+    # Define data collector
+    data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=data_args.ignore_label)
+
+    # Dataset pre-process
+    logger.info("Data Preprocessing...")
+    if training_args.do_train:
+        train_dataset = raw_datasets["train"].map(trans_fn, lazy=training_args.lazy_data_processing)
+    if training_args.do_eval:
+        # The msra_ner dataset do not have the dev dataset, use the test dataset for the evaluation
+        eval_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing)
+    if training_args.do_predict:
+        test_dataset = raw_datasets["test"].map(trans_fn, lazy=training_args.lazy_data_processing)
+
+    # Define the metrics of tasks.
+    # Metrics
+    metric = load_metric("seqeval")
+
+    def compute_metrics(p):
+        predictions, labels = p
+        predictions = np.argmax(predictions, axis=2)
+
+        # Remove ignored index (special tokens)
+        true_predictions = [
+            [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
+            for prediction, label in zip(predictions, labels)
+        ]
+        true_labels = [
+            [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
+            for prediction, label in zip(predictions, labels)
+        ]
+        results = metric.compute(predictions=true_predictions, references=true_labels)
+        return {
+            "precision": results["overall_precision"],
+            "recall": results["overall_recall"],
+            "f1": results["overall_f1"],
+            "accuracy": results["overall_accuracy"],
+        }
+
+    trainer = Trainer(
+        model=model,
+        criterion=loss_fct,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+        tokens_label = test_ret.predictions.argmax(axis=-1)
+        tokens_label = tokens_label.tolist()
+        value = []
+        for batch, token_label in enumerate(tokens_label):
+            start = -1
+            label_name = ""
+            items = []
+            input_data = tokenizer.convert_ids_to_tokens(test_dataset[batch]["input_ids"])[1:-1]
+            for i, label in enumerate(token_label):
+                if (data_args.label_list[label] == "O" or "B-" in data_args.label_list[label]) and start >= 0:
+                    entity = input_data[start : i - 1]
+                    if isinstance(entity, list):
+                        entity = "".join(entity)
+                    items.append(
+                        {
+                            "pos": [start, i - 2],
+                            "entity": entity,
+                            "label": label_name,
+                        }
+                    )
+                    start = -1
+                if "B-" in data_args.label_list[label]:
+                    start = i - 1
+                    label_name = data_args.label_list[label][2:]
+            if start >= 0:
+                items.append(
+                    {
+                        "pos": [start, len(token_label) - 1],
+                        "entity": input_data[start : len(token_label) - 1],
+                        "label": "",
+                    }
+                )
+            value.append(items)
+        out_dict = {"value": value, "tokens_label": tokens_label}
+        out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w")
+        json.dump(out_dict, out_file, ensure_ascii=True)
+
+    # Export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+        model_args.export_model_dir = os.path.join(model_args.export_model_dir, data_args.dataset, "export")
+        paddlenlp.transformers.export_model(
+            model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir
+        )
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-3.0/utils.py b/model_zoo/ernie-3.0/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..697adfd22bcfadf178fe64c8e475ed0493949c0c
--- /dev/null
+++ b/model_zoo/ernie-3.0/utils.py
@@ -0,0 +1,556 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+import paddle
+import yaml
+
+from paddlenlp.trainer import PredictionOutput, Trainer
+
+
+def load_config(config_file_path, task_name, dataset_name, model_args, data_args, training_args):
+    config = yaml.load(open(config_file_path, "r"), Loader=yaml.FullLoader)
+    # Set the batch size of trainer setting
+
+    config = config[task_name][dataset_name]
+    for args in (model_args, data_args, training_args):
+        for arg in config.keys():
+            if hasattr(args, arg):
+                setattr(args, arg, config[arg])
+    return model_args, data_args, training_args
+
+
+def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    # if the input is a batch of examples
+    if isinstance(examples["input_ids"][0], list):
+        cur_length = max([len(i) for i in examples["input_ids"]])
+    # if the input is a single example
+    else:
+        cur_length = len(examples["input_ids"])
+
+    max_length = default_max_length
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+
+
+def prepare_train_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    if args.dynamic_max_length is not None:
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True
+        )
+        max_length = get_dynamic_max_length(
+            examples=tokenized_examples,
+            default_max_length=args.max_seq_length,
+            dynamic_max_length=args.dynamic_max_length,
+        )
+        # always pad to max_length
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=max_length, padding="max_length", truncation=True
+        )
+    else:
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True
+        )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+    # The offset mappings will give us a map from token to character position in the original context. This will
+    # help us compute the start_positions and end_positions.
+    offset_mapping = tokenized_examples.pop("offset_mapping")
+
+    # Let's label those examples!
+    tokenized_examples["start_positions"] = []
+    tokenized_examples["end_positions"] = []
+
+    for i, offsets in enumerate(offset_mapping):
+        # We will label impossible answers with the index of the CLS token.
+        input_ids = tokenized_examples["input_ids"][i]
+        cls_index = input_ids.index(tokenizer.cls_token_id)
+
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        answers = examples["answers"][sample_index]
+        # If no answers are given, set the cls_index as answer.
+        if len(answers["answer_start"]) == 0:
+            tokenized_examples["start_positions"].append(cls_index)
+            tokenized_examples["end_positions"].append(cls_index)
+        else:
+            # Start/end character index of the answer in the text.
+            start_char = answers["answer_start"][0]
+            end_char = start_char + len(answers["text"][0])
+
+            # Start token index of the current span in the text.
+            token_start_index = 0
+            while sequence_ids[token_start_index] != 1:
+                token_start_index += 1
+
+            # End token index of the current span in the text.
+            token_end_index = len(input_ids) - 1
+            while sequence_ids[token_end_index] != 1:
+                token_end_index -= 1
+            token_end_index -= 1
+
+            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
+            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
+                tokenized_examples["start_positions"].append(cls_index)
+                tokenized_examples["end_positions"].append(cls_index)
+            else:
+                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
+                # Note: we could go after the last offset if the answer is the last word (edge case).
+                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
+                    token_start_index += 1
+                tokenized_examples["start_positions"].append(token_start_index - 1)
+                while offsets[token_end_index][1] >= end_char:
+                    token_end_index -= 1
+                tokenized_examples["end_positions"].append(token_end_index + 1)
+
+    return tokenized_examples
+
+
+def prepare_validation_features(examples, tokenizer, args):
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is
+    # that HuggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead.
+    contexts = examples["context"]
+    questions = examples["question"]
+
+    if args.dynamic_max_length is not None:
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True
+        )
+        max_length = get_dynamic_max_length(
+            examples=tokenized_examples,
+            default_max_length=args.max_seq_length,
+            dynamic_max_length=args.dynamic_max_length,
+        )
+        # always pad to max_length
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=max_length, padding="max_length", truncation=True
+        )
+    else:
+        tokenized_examples = tokenizer(
+            questions, contexts, stride=args.doc_stride, max_length=args.max_seq_length, truncation=True
+        )
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples["token_type_ids"][i]
+        context_index = 1
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index and k != len(sequence_ids) - 1 else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+class CrossEntropyLossForSQuAD(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForSQuAD, self).__init__()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+        start_position = paddle.unsqueeze(start_position, axis=-1)
+        end_position = paddle.unsqueeze(end_position, axis=-1)
+        start_loss = paddle.nn.functional.cross_entropy(input=start_logits, label=start_position)
+        end_loss = paddle.nn.functional.cross_entropy(input=end_logits, label=end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+class QuestionAnsweringTrainer(Trainer):
+    def __init__(self, *args, eval_examples=None, post_process_function=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.eval_examples = eval_examples
+        self.post_process_function = post_process_function
+
+    def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        eval_examples = self.eval_examples if eval_examples is None else eval_examples
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                eval_dataloader,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is not None and self.compute_metrics is not None:
+            eval_preds = self.post_process_function(eval_examples, eval_dataset, output.predictions)
+            metrics = self.compute_metrics(eval_preds)
+
+            # Prefix all keys with metric_key_prefix + '_'
+            for key in list(metrics.keys()):
+                if not key.startswith(f"{metric_key_prefix}_"):
+                    metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+            self.log(metrics)
+        else:
+            metrics = {}
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
+        return metrics
+
+    def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"):
+        predict_dataloader = self.get_test_dataloader(predict_dataset)
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                predict_dataloader,
+                description="Prediction",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is None or self.compute_metrics is None:
+            return output
+
+        predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict")
+        metrics = self.compute_metrics(predictions)
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics)
+
+
+# Data pre-process function for clue benchmark datatset
+def seq_convert_example(
+    example, label_list, tokenizer=None, max_seq_length=512, dynamic_max_length: Optional[List[int]] = None, **kwargs
+):
+    """convert a glue example into necessary features"""
+    is_test = False
+    if "label" not in example.keys():
+        is_test = True
+
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"])
+        label = example["label"]
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+
+    if tokenizer is None:
+        return example
+    if "sentence" in example:
+        if dynamic_max_length is not None:
+            temp_example = tokenizer(example["sentence"], max_length=max_seq_length, truncation=True)
+            max_length = get_dynamic_max_length(
+                examples=temp_example, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length
+            )
+            # always pad to max_length
+            example = tokenizer(example["sentence"], max_length=max_length, padding="max_length", truncation=True)
+        else:
+            example = tokenizer(example["sentence"], max_length=max_seq_length, truncation=True)
+    elif "sentence1" in example:
+        if dynamic_max_length is not None:
+            temp_example = tokenizer(
+                example["sentence1"],
+                text_pair=example["sentence2"],
+                max_length=max_seq_length,
+                truncation=True,
+            )
+            max_length = get_dynamic_max_length(
+                examples=temp_example, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length
+            )
+            example = tokenizer(
+                example["sentence1"],
+                text_pair=example["sentence2"],
+                max_length=max_length,
+                padding="max_length",
+                truncation=True,
+            )
+        else:
+            example = tokenizer(
+                example["sentence1"],
+                text_pair=example["sentence2"],
+                max_length=max_seq_length,
+                truncation=True,
+            )
+
+    if not is_test:
+        if "token_type_ids" in example:
+            return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"], "labels": label}
+        else:
+            return {"input_ids": example["input_ids"], "labels": label}
+    else:
+        return {"input_ids": example["input_ids"], "token_type_ids": example["token_type_ids"]}
+
+
+def token_convert_example(
+    example,
+    tokenizer,
+    no_entity_id,
+    max_seq_length=512,
+    return_length=False,
+    dynamic_max_length: Optional[List[int]] = None,
+):
+    if "labels" in example:
+        labels = example["labels"]
+        example = example["tokens"]
+        if dynamic_max_length is not None:
+            tokenized_input = tokenizer(
+                example,
+                is_split_into_words=True,
+                max_length=max_seq_length,
+                truncation=True,
+                return_length=return_length,
+            )
+            max_length = get_dynamic_max_length(
+                examples=tokenized_input, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length
+            )
+            # always pad to max_length
+            tokenized_input = tokenizer(
+                example,
+                is_split_into_words=True,
+                max_length=max_length,
+                padding="max_length",
+                truncation=True,
+                return_length=return_length,
+            )
+        else:
+            tokenized_input = tokenizer(
+                example,
+                is_split_into_words=True,
+                max_length=max_seq_length,
+                truncation=True,
+                return_length=return_length,
+            )
+
+        # -2 for [CLS] and [SEP]
+        if len(tokenized_input["input_ids"]) - 2 < len(labels):
+            labels = labels[: len(tokenized_input["input_ids"]) - 2]
+        tokenized_input["labels"] = [no_entity_id] + labels + [no_entity_id]
+        tokenized_input["labels"] += [no_entity_id] * (
+            len(tokenized_input["input_ids"]) - len(tokenized_input["labels"])
+        )
+    else:
+        if example["tokens"] == []:
+            if return_length:
+                tokenized_input = {"labels": [], "input_ids": [], "token_type_ids": [], "length": 0, "seq_len": 0}
+            else:
+                tokenized_input = {"labels": [], "input_ids": [], "token_type_ids": []}
+
+            return tokenized_input
+        if dynamic_max_length is not None:
+            tokenized_input = tokenizer(
+                example["tokens"],
+                max_length=max_seq_length,
+                truncation=True,
+                is_split_into_words=True,
+                return_length=return_length,
+            )
+            max_length = get_dynamic_max_length(
+                examples=tokenized_input, default_max_length=max_seq_length, dynamic_max_length=dynamic_max_length
+            )
+            # always pad to max_length
+            tokenized_input = tokenizer(
+                example["tokens"],
+                max_length=max_length,
+                padding="max_length",
+                truncation=True,
+                is_split_into_words=True,
+                return_length=return_length,
+            )
+        else:
+            tokenized_input = tokenizer(
+                example["tokens"],
+                max_length=max_seq_length,
+                truncation=True,
+                is_split_into_words=True,
+                return_length=return_length,
+            )
+
+        label_ids = example["ner_tags"]
+        if len(tokenized_input["input_ids"]) - 2 < len(label_ids):
+            label_ids = label_ids[: len(tokenized_input["input_ids"]) - 2]
+        label_ids = [no_entity_id] + label_ids + [no_entity_id]
+
+        label_ids += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(label_ids))
+        tokenized_input["labels"] = label_ids
+    return tokenized_input
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    dataset: str = field(default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."})
+
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    # Additional configs for QA task.
+    doc_stride: int = field(
+        default=128,
+        metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
+    )
+
+    n_best_size: int = field(
+        default=20,
+        metadata={
+            "help": "The total number of n-best predictions to generate in the nbest_predictions.json output file."
+        },
+    )
+
+    max_query_length: int = field(
+        default=64,
+        metadata={"help": "Max query length."},
+    )
+
+    max_answer_length: int = field(
+        default=30,
+        metadata={"help": "Max answer length."},
+    )
+
+    dynamic_max_length: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
+    )
+
+    do_lower_case: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to lower case the input text. Should be True for uncased models and False for cased models."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    null_score_diff_threshold: float = field(
+        default=0.0,
+        metadata={
+            "help": "The threshold used to select the null answer: if the best answer has a score that is less than "
+            "the score of the null answer minus this threshold, the null answer is selected for this example. "
+            "Only useful when `version_2_with_negative=True`."
+        },
+    )
+
+    # TODO(wj-Mcat): support padding configuration: `max_length`, `longest_first`
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        }
+    )
+    config: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    export_model_dir: Optional[str] = field(
+        default="./best_models",
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
diff --git a/model_zoo/ernie-code/README.en.md b/model_zoo/ernie-code/README.en.md
new file mode 100644
index 0000000000000000000000000000000000000000..77904c09ce4da8a937ce07af1e9381f5c76c89f7
--- /dev/null
+++ b/model_zoo/ernie-code/README.en.md
@@ -0,0 +1,76 @@
+# ERNIE-Code
+
+[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [中文版](./README.md)
+
+![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4)
+
+[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf)
+
+
+ERNIE-Code is a unified large language model (LLM) that connects 116 natural languages with 6 programming languages. We employ two pre-training methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation.
+
+## Quick Start
+
+This project is the PaddlePaddle implementation of the ERINE-Code, including model prediction and weight conversion. The brief directory structure and description of this example are as follows:
+
+```text
+├── README.md               # Documentation
+├── predict.py              # Forward prediction demo
+├── converter.py            # Weight conversion script
+```
+
+### Multilingual Text-to-Code / Code-to-Text
+
+This project provides a simple demo for multlingual code/text generation. The startup command is as follows:
+
+```shell
+python predict.py \
+  --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
+  --target_lang 'code' \
+  --source_prefix 'translate Japanese to Python: \n' \
+  --max_length 1024 \
+  --num_beams 3 \
+  --device 'gpu'
+```
+
+Explanation of parameters in the configuration file:
+- `input`:The input sequence.
+- `target_lang`: The target language, which can be set to 'text' or 'code'.
+- `source_prefix`: The prompt.
+- `max_length`: The maximum length of input/output text.
+- `num_beams`: The number of beams to keep at each decoding step (for beam search).
+- `device`: The running device, which can be set to 'cpu' or 'gpu'.
+
+
+### Zero-shot Examples
+- Multilingual code-to-text generation (zero-shot)
+
+![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76)
+
+![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e)
+
+- Multilingual text-to-text translation (zero-shot)
+
+![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a)
+
+
+## BibTeX
+```
+@inproceedings{chai-etal-2023-ernie,
+    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
+    author = "Chai, Yekun  and
+      Wang, Shuohuan  and
+      Pang, Chao  and
+      Sun, Yu  and
+      Tian, Hao  and
+      Wu, Hua",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
+    month = jul,
+    year = "2023",
+    address = "Toronto, Canada",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.findings-acl.676",
+    pages = "10628--10650",
+    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
+}
+```
diff --git a/model_zoo/ernie-code/README.md b/model_zoo/ernie-code/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5562d043d24004fba13e873a7365f44004d1a828
--- /dev/null
+++ b/model_zoo/ernie-code/README.md
@@ -0,0 +1,81 @@
+# ERNIE-Code
+
+[ACL 2023 (Findings)](https://aclanthology.org/2023.findings-acl.676/) | [arXiv](https://arxiv.org/pdf/2212.06742) | [BibTex](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-code/README.md#bibtex) | [English version](./README.en.md)
+
+![ernie-code-comp](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2a550b46-a7d5-416d-b300-83cce7044be4)
+
+[ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages](https://aclanthology.org/2023.findings-acl.676.pdf)
+
+
+ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型（Code LLM），支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练：
+- Span-Corruption Language Modeling (SCLM) 从单语言的自然语言或编程语言中进行掩码语言学习；
+- Pivot-based Translation Language Modeling (PTLM)，将多自然语言到多编程语言的映射 规约为，以英语为枢轴(pivot)的多自然语言到英语、和英语到多编程语言的联合学习。
+
+ERNIE-Code在代码智能的各种下游任务中，包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务，优于以前的多语言代码和文本模型（例如mT5 和 CodeT5），同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。
+
+## 快速开始
+
+本项目是ERNIE-Code的PaddlePaddle实现，包括模型预测和权重转换。以下是该示例的简要目录结构和说明：
+
+```text
+├── README.md               # 文档
+├── predict.py              # 前向预测示例
+├── converter.py            # 权重转换脚本
+```
+
+### 多语言文本到代码/代码到文本
+
+本项目提供了一个简单的多语言代码/文本生成的演示。启动命令如下：
+
+```shell
+python predict.py \
+  --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
+  --target_lang 'code' \
+  --source_prefix 'translate Japanese to Python: \n' \
+  --max_length 1024 \
+  --num_beams 3 \
+  --device 'gpu'
+```
+
+配置文件中参数的解释：
+- `input`：输入的文本序列。
+- `target_lang`：目标语言，可设置为'text'或'code'。
+- `source_prefix`：提示词Prompt。
+- `max_length`：输入/输出文本的最大长度。
+- `num_beams`：解码时每个时间步保留的beam大小（用于束搜索）。
+- `device`：运行设备，可设置为'cpu'或'gpu'。
+
+
+
+### Zero-shot示例
+- 多语言代码到文本生成（zero-shot）
+
+![code-to-text-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/7dbf225e-e6be-401d-9f6c-f733e2f68f76)
+
+![zh_code-to-text_examples-1](https://github.com/KB-Ding/PaddleNLP/assets/13767887/2d1ba091-f43c-4f3e-95c6-0038ede9e63e)
+
+- 计算机术语翻译（zero-shot）
+
+![zero-shot-mt-examples](https://github.com/KB-Ding/PaddleNLP/assets/13767887/8be1a977-fa21-4a46-86ba-136fa8276a1a)
+
+
+## BibTeX
+```
+@inproceedings{chai-etal-2023-ernie,
+    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
+    author = "Chai, Yekun  and
+      Wang, Shuohuan  and
+      Pang, Chao  and
+      Sun, Yu  and
+      Tian, Hao  and
+      Wu, Hua",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
+    month = jul,
+    year = "2023",
+    address = "Toronto, Canada",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.findings-acl.676",
+    pages = "10628--10650",
+    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
+}
+```
diff --git a/model_zoo/ernie-code/convert.py b/model_zoo/ernie-code/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..900e8437bea71025d1f9e78184c178296609ec17
--- /dev/null
+++ b/model_zoo/ernie-code/convert.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from collections import OrderedDict
+
+dont_transpose = [
+    "shared.weight",
+    "layer_norm.weight",
+    ".layer_norm.weight",
+    "relative_attention_bias.weight",
+    "embed_tokens.weight",
+]
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+    import paddle
+    import torch
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        transpose = False
+
+        if k[-7:] == ".weight":
+            if not any([w in k for w in dont_transpose]):
+                if v.ndim == 2:
+                    v = v.transpose(0, 1)
+                    transpose = True
+
+        print(f"Converting: {k} | is_transpose {transpose}")
+
+        if k != "lm_head.weight":
+            k = "ErnieCode." + k
+        # The bf16 data of torch cannot be directly converted to paddle
+        paddle_state_dict[k] = paddle.to_tensor(v.to(torch.float32).numpy()).cast(paddle.bfloat16).numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="/home/models/pytorch_model.bin",
+        type=str,
+        required=False,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="/home/models/model_state.pdparams",
+        type=str,
+        required=False,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/model_zoo/ernie-code/predict.py b/model_zoo/ernie-code/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..95b916bae7d232e252bfed86e47c30e68762d667
--- /dev/null
+++ b/model_zoo/ernie-code/predict.py
@@ -0,0 +1,76 @@
+#   Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer
+
+parser = argparse.ArgumentParser("ERNIE-CODE")
+parser.add_argument(
+    "--model_name_or_path",
+    default="ernie-code-base-L512",
+    type=str,
+)
+parser.add_argument("--input", default="BadZipFileのAliasは、古い Python バージョンとの互換性のために。", type=str)
+parser.add_argument("--target_lang", default="code", type=str)
+parser.add_argument("--source_prefix", default="translate Japanese to Python: \n", type=str)
+parser.add_argument("--max_length", type=int, default=1024)
+parser.add_argument("--num_beams", type=int, default=3)
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"])
+
+args = parser.parse_args()
+
+
+def predict():
+
+    paddle.set_device(args.device)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
+    model = AutoModelForConditionalGeneration.from_pretrained(args.model_name_or_path)
+
+    prefix = args.source_prefix if args.source_prefix is not None else ""
+
+    def preprocess_function(inputs, tokenizer):
+        inputs = [prefix + inp for inp in inputs]
+
+        model_inputs = tokenizer(inputs, max_length=args.max_length)
+        return model_inputs
+
+    dev_dataset = [args.input]
+    model_inputs = preprocess_function(dev_dataset, tokenizer)
+    model.eval()
+    gen_kwargs = {
+        "max_length": args.max_length,
+        "num_beams": args.num_beams,
+        "decode_strategy": "beam_search",
+        "length_penalty": 0,
+        "min_length": 0,
+    }
+    generated_tokens, _ = model.generate(
+        paddle.to_tensor(np.array(model_inputs["input_ids"]).reshape(1, -1).astype("int64")),
+        attention_mask=paddle.to_tensor(np.array(model_inputs["attention_mask"]).reshape(1, -1).astype("int64")),
+        **gen_kwargs,
+    )
+    if args.target_lang == "text":
+        decoded_preds = tokenizer.batch_decode(generated_tokens.numpy(), skip_special_tokens=True)
+    elif args.target_lang == "code":
+        decoded_preds = tokenizer.batch_decode(
+            generated_tokens.numpy(), skip_special_tokens=False, clean_up_tokenization_spaces=False
+        )
+    print(decoded_preds)
+
+
+if __name__ == "__main__":
+    predict()
diff --git a/model_zoo/ernie-doc/README.md b/model_zoo/ernie-doc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ca3d5a064826f925a166f43d520d7bfea33aff78
--- /dev/null
+++ b/model_zoo/ernie-doc/README.md
@@ -0,0 +1,215 @@
+# ERNIE-Doc: A Retrospective Long-Document Modeling Transformer
+
+* [模型简介](#模型简介)
+* [快速开始](#快速开始)
+    * [环境依赖](#环境依赖)
+    * [通用参数释义](#通用参数释义)
+    * [分类任务](#分类任务)
+    * [阅读理解任务](#阅读理解任务)
+    * [语义匹配任务](#语义匹配任务)
+    * [序列标注任务](#序列标注任务)
+* [致谢](#致谢)
+* [参考论文](#参考论文)
+
+## 模型简介
+[ERNIE-Doc](https://arxiv.org/abs/2012.15688)是百度NLP提出的针对长文本的预训练模型。在循环Transformer机制之上，创新性地提出两阶段重复学习以及增强的循环机制，以此提高模型感受野，加强模型对长文本的理解能力。
+
+本项目是 ERNIE-Doc 的 PaddlePaddle 动态图实现， 包含模型训练，模型验证等内容。以下是本例的简要目录结构及说明：
+
+```text
+.
+├── README.md                   # 文档
+├── data.py                     # 数据处理
+├── metrics.py                  # ERNIE-Doc下游任务指标
+├── model.py                    # 下游任务模型实现
+├── optimization.py             # 优化算法
+├── run_classifier.py           # 分类任务
+├── run_mcq.py                  # 阅读理解任务，单项选择题
+├── run_mrc.py                  # 抽取式阅读理解任务
+├── run_semantic_matching.py    # 语义匹配任务
+└── run_sequence_labeling.py    # 序列标注任务
+
+```
+
+## 快速开始
+
+### 环境依赖
+
+- nltk
+- beautifulsoup4
+
+安装命令：`pip install nltk==3.5 beautifulsoup4`
+
+初次使用时，需要下载nltk的模型，可运行以下命令（下载模型可能比较慢，请耐心等待）：
+
+```
+python -c "import nltk; nltk.download('punkt')"
+```
+
+### 通用参数释义
+
+- `model_name_or_path` 指示了Fine-tuning使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："ernie-doc-base-zh", "ernie-doc-base-en"。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./checkpoint/model_xx/"。
+- `dataset` 表示Fine-tuning需要加载的数据集。
+- `memory_length` 表示当前的句子被截取作为下一个样本的特征的长度。
+- `max_seq_length` 表示最大句子长度，超过该长度的部分将被切分成下一个样本。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔步数。
+- `save_steps` 表示模型保存及评估间隔步数。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `seed` 表示随机数种子。
+- `weight_decay` 表示AdamW的权重衰减系数。
+- `warmup_proportion` 表示学习率warmup系数。
+- `layerwise_decay` 表示AdamW with Layerwise decay的逐层衰减系数。
+
+由于不同任务、不同数据集所设的超参数差别较大，可查看[ERNIE-Doc](https://arxiv.org/abs/2012.15688)论文附录中具体超参设定，此处不一一列举。
+
+### 分类任务
+
+分类任务支持多种数据集的评测，目前支持`imdb`, `iflytek`, `thucnews`, `hyp`四个数据集（有关数据集的描述可查看[PaddleNLP文本分类数据集](../../docs/data_prepare/dataset_list.md)）。可通过参数`dataset`指定具体的数据集，下面以`imdb`为例子运行分类任务。
+
+#### 单卡训练
+
+```shell
+python run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
+
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1" --log_dir imdb run_classifier.py --batch_size 8 --model_name_or_path ernie-doc-base-en
+
+```
+
+在`imdb`, `iflytek`, `thucnews`, `hyp`各数据集上Fine-tuning后，在验证集上有如下结果：
+
+| Dataset   | Model             |      Dev ACC     |
+|:---------:|:-----------------:|:----------------:|
+| IMDB      | ernie-doc-base-en |      0.9506      |
+| THUCNews  | ernie-doc-base-zh |      0.9854      |
+| HYP       | ernie-doc-base-en |      0.7412      |
+| IFLYTEK   | ernie-doc-base-zh |      0.6179      |
+
+
+### 阅读理解任务
+
+阅读理解任务支持抽取式阅读理解与单项选择题任务。
+
+- 抽取式阅读理解
+
+目前抽取式阅读理解支持`duredear-robust`, `drcd`,`cmrc2018`数据集。可通过参数`dataset`指定具体的数据集，下面以`dureader_robust`为例子运行抽取式阅读理解任务。
+
+#### 单卡训练
+
+```shell
+python run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1" --log_dir dureader_robust run_mrc.py --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4
+```
+
+在`duredear-robust`, `drcd`, `cmrc2018`各数据集上Fine-tuning后，在验证集上有如下结果：
+
+| Dataset        | Model             |      Dev EM/F1   |
+|:--------------:|:-----------------:|:----------------:|
+| Dureader-robust| ernie-doc-base-zh |  0.7481/0.8637   |
+| DRCD           | ernie-doc-base-zh |  0.8879/0.9392   |
+| CMRC2018       | ernie-doc-base-zh |  0.7061/0.9004   |
+
+
+- 单项选择题
+
+[C3](https://github.com/nlpdata/c3)是首个自由形式的多选项中文机器阅读理解数据集。该数据集每个样本提供一个上下文（文章或者对话）、问题以及至多四个答案选项，要求从答案选项中选择一个正确选项。
+
+目前PaddleNLP提供`C3`阅读理解单项选择题数据集，可执行以下命令运行该任务。
+
+#### 单卡训练
+
+```shell
+python run_mcq.py --batch_size 8
+
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1" --log_dir mcq run_mcq.py --batch_size 8
+
+```
+
+在`C3`数据集上Fine-tuning后，在验证集上有如下结果：
+| Dataset        | Model             |   Dev/Test Acc   |
+|:--------------:|:-----------------:|:----------------:|
+| C3             | ernie-doc-base-zh |  0.7573/0.7583   |
+
+
+### 语义匹配任务
+
+[CAIL2019 SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) 数据集是来自“中国裁判文书网”公开的法律文书,其中每份数据由三篇法律文书组成。对于每份数据，用`(A,B,C)`来代表该组数据，其中`(A,B,C)`均对应某一篇文书。该任务要求判别similarity(A, B)是否大于similarity(A, C)。
+
+可执行以下命令运行该任务。
+
+#### 单卡训练
+
+```shell
+python run_semantic_matching.py  --batch_size 6 --learning_rate 2e-5
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1" --log_dir cail run_semantic_matching.py --batch_size 6 --learning_rate 2e-5
+```
+
+在`CAIL2019-SCM`数据集上Fine-tuning后，在验证集与测试集上有如下结果：
+
+| Dataset        | Model             |   Dev/Test Acc   |
+|:--------------:|:-----------------:|:----------------:|
+| CAIL2019-SCM   | ernie-doc-base-zh |  0.6420/0.6484   |
+
+
+### 序列标注任务
+
+
+MSRA-NER 数据集由微软亚研院发布，其目标是识别文本中具有特定意义的实体，主要包括人名、地名、机构名等。示例如下：
+
+```
+不\002久\002前\002，\002中\002国\002共\002产\002党\002召\002开\002了\002举\002世\002瞩\002目\002的\002第\002十\002五\002次\002全\002国\002代\002表\002大\002会\002。    O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O\002O\002O\002O\002O\002O\002O\002O\002B-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002I-ORG\002O
+这\002次\002代\002表\002大\002会\002是\002在\002中\002国\002改\002革\002开\002放\002和\002社\002会\002主\002义\002现\002代\002化\002建\002设\002发\002展\002的\002关\002键\002时\002刻\002召\002开\002的\002历\002史\002性\002会\002议\002。    O\002O\002O\002O\002O\002O\002O\002O\002B-LOC\002I-LOC\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O\002O
+```
+
+PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整：每一行文本、标签以特殊字符"\t"进行分隔，每个字之间以特殊字符"\002"分隔。
+
+可执行以下命令运行序列标注任务。
+
+#### 单卡训练
+
+```shell
+python run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
+```
+
+#### 多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus "0,1" --log_dir msra_ner run_sequence_labeling.py --batch_size 8 --learning_rate 3e-5
+```
+
+在`MSRA-NER`数据集上Fine-tuning后，在验证集与测试集上有如下最佳结果：
+
+| Dataset        | Model             |   Precision/Recall/F1   |
+|:--------------:|:-----------------:|:-----------------------:|
+| MSRA-NER       | ernie-doc-base-zh |  0.9288/0.9139/0.9213   |
+
+
+## 致谢
+* 感谢[百度NLP](https://github.com/PaddlePaddle/ERNIE/tree/repro/ernie-doc)提供ERNIE-Doc开源代码的实现以及预训练模型。
+
+## 参考论文
+
+* Siyu Ding, Junyuan Shang et al. "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer" ACL, 2021
diff --git a/model_zoo/ernie-doc/data.py b/model_zoo/ernie-doc/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..93a36443458aed06b19475936ff9d275f87fa53b
--- /dev/null
+++ b/model_zoo/ernie-doc/data.py
@@ -0,0 +1,1194 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+from collections import namedtuple
+
+import numpy as np
+from paddle.utils import try_import
+
+from paddlenlp.transformers import tokenize_chinese_chars
+from paddlenlp.utils.log import logger
+
+__all__ = ["ClassifierIterator", "MRCIterator", "MCQIterator"]
+
+
+def get_related_pos(insts, seq_len, memory_len=128):
+    """generate relative postion ids"""
+    beg = seq_len + seq_len + memory_len
+    r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))]
+    return np.array(r_position).astype("int64").reshape([len(insts), beg, 1])
+
+
+def pad_batch_data(
+    insts,
+    insts_data_type="int64",
+    pad_idx=0,
+    final_cls=False,
+    pad_max_len=None,
+    return_pos=False,
+    return_input_mask=False,
+    return_max_len=False,
+    return_num_token=False,
+    return_seq_lens=False,
+):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    if pad_max_len:
+        max_len = pad_max_len
+    else:
+        max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    # Input id
+    if final_cls:
+        inst_data = np.array([inst[:-1] + list([pad_idx] * (max_len - len(inst))) + [inst[-1]] for inst in insts])
+    else:
+        inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype(insts_data_type).reshape([-1, max_len, 1])]
+
+    # Position id
+    if return_pos:
+        inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        if final_cls:
+            input_mask_data = np.array([[1] * len(inst[:-1]) + [0] * (max_len - len(inst)) + [1] for inst in insts])
+        else:
+            input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens_type = [-1]
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape(seq_lens_type)]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+class TextPreprocessor(object):
+    def __call__(self, text):
+        raise NotImplementedError("TextPreprocessor object can't be called")
+
+
+class ImdbTextPreprocessor(TextPreprocessor):
+    def __call__(self, text):
+        text = text.strip().replace("<br /><br />", " ")
+        text = text.replace("\t", "")
+        return text
+
+
+class HYPTextPreprocessor(TextPreprocessor):
+    def __init__(self):
+        self.bs4 = try_import("bs4")
+
+    def __call__(self, text):
+        text = self.bs4.BeautifulSoup(text, "html.parser").get_text()
+        text = text.strip().replace("\n", "").replace("\t", "")
+        return text
+
+
+class ClassifierIterator(object):
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        preprocess_text_fn=None,
+    ):
+        self.batch_size = batch_size
+        self.tokenizer = tokenizer
+        self.trainer_num = trainer_num
+        self.trainer_id = trainer_id
+        self.max_seq_length = max_seq_length
+        self.memory_len = memory_len
+        self.repeat_input = repeat_input
+        self.in_tokens = in_tokens
+        self.dataset = [data for data in dataset]
+        self.num_examples = None
+        self.mode = mode
+        self.shuffle = True if mode == "train" else False
+        if random_seed is None:
+            random_seed = 12345
+        self.random_seed = random_seed
+        self.preprocess_text_fn = preprocess_text_fn
+
+    def shuffle_sample(self):
+        if self.shuffle:
+            self.global_rng = np.random.RandomState(self.random_seed)
+            self.global_rng.shuffle(self.dataset)
+
+    def _cnt_list(self, inp):
+        """Cnt_list"""
+        cnt = 0
+        for lit in inp:
+            if lit:
+                cnt += 1
+        return cnt
+
+    def _convert_to_features(self, example, qid):
+        """
+        Convert example to features fed into model
+        """
+        if "text" in example:  # imdb
+            text = example["text"]
+        elif "sentence" in example:  # iflytek
+            text = example["sentence"]
+
+        if self.preprocess_text_fn:
+            text = self.preprocess_text_fn(text)
+        label = example["label"]
+        doc_spans = []
+        _DocSpan = namedtuple("DocSpan", ["start", "length"])
+        start_offset = 0
+        max_tokens_for_doc = self.max_seq_length - 2
+        tokens_a = self.tokenizer.tokenize(text)
+        while start_offset < len(tokens_a):
+            length = len(tokens_a) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(tokens_a):
+                break
+            start_offset += min(length, self.memory_len)
+
+        features = []
+        Feature = namedtuple("Feature", ["src_ids", "label_id", "qid", "cal_loss"])
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = tokens_a[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"] + ["[CLS]"]
+            token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+            features.append(Feature(src_ids=token_ids, label_id=label, qid=qid, cal_loss=1))
+
+        if self.repeat_input:
+            features_repeat = features
+            features = list(map(lambda x: x._replace(cal_loss=0), features))
+            features = features + features_repeat
+        return features
+
+    def _get_samples(self, pre_batch_list, is_last=False):
+        if is_last:
+            # Pad batch
+            len_doc = [len(doc) for doc in pre_batch_list]
+            max_len_idx = len_doc.index(max(len_doc))
+            dirty_sample = pre_batch_list[max_len_idx][-1]._replace(cal_loss=0)
+            for sample_list in pre_batch_list:
+                sample_list.extend([dirty_sample] * (max(len_doc) - len(sample_list)))
+
+        samples = []
+        min_len = min([len(doc) for doc in pre_batch_list])
+        for cnt in range(min_len):
+            for batch_idx in range(self.batch_size * self.trainer_num):
+                sample = pre_batch_list[batch_idx][cnt]
+                samples.append(sample)
+
+        for idx in range(len(pre_batch_list)):
+            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
+        return samples
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [record.src_ids for record in batch_records]
+        if batch_records[0].label_id is not None:
+            batch_labels = [record.label_id for record in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        if batch_records[-1].qid is not None:
+            batch_qids = [record.qid for record in batch_records]
+            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+        else:
+            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            final_cls=True,
+            return_input_mask=True,
+        )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            batch_labels,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
+
+    def _prepare_batch_data(self, examples):
+        batch_records, max_len, gather_idx = [], 0, []
+        for index, example in enumerate(examples):
+            max_len = max(max_len, len(example.src_ids))
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * max_len <= self.batch_size
+            else:
+                to_append = len(batch_records) < self.batch_size
+            if to_append:
+                batch_records.append(example)
+                if example.cal_loss == 1:
+                    gather_idx.append(index % self.batch_size)
+            else:
+                yield self._pad_batch_records(batch_records, gather_idx)
+                batch_records, max_len = [example], len(example.src_ids)
+                gather_idx = [index % self.batch_size] if example.cal_loss == 1 else []
+        yield self._pad_batch_records(batch_records, gather_idx)
+
+    def _create_instances(self):
+        examples = self.dataset
+        pre_batch_list = []
+        insert_idx = []
+        for qid, example in enumerate(examples):
+            features = self._convert_to_features(example, qid)
+            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
+                if insert_idx:
+                    pre_batch_list[insert_idx[0]] = features
+                    insert_idx.pop(0)
+                else:
+                    pre_batch_list.append(features)
+            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
+                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
+                assert not insert_idx, "the insert_idx must be null"
+                sample_batch = self._get_samples(pre_batch_list)
+
+                for idx, lit in enumerate(pre_batch_list):
+                    if not lit:
+                        insert_idx.append(idx)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+        if self.mode != "train":
+            if self._cnt_list(pre_batch_list):
+                pre_batch_list += [
+                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
+                ]
+                sample_batch = self._get_samples(pre_batch_list, is_last=True)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+    def __call__(self):
+        curr_id = 0
+        for batch_records in self._create_instances():
+            if curr_id == self.trainer_id or self.mode != "train":
+                yield batch_records
+            curr_id = (curr_id + 1) % self.trainer_num
+
+    def get_num_examples(self):
+        if self.num_examples is None:
+            self.num_examples = 0
+            for qid, example in enumerate(self.dataset):
+                self.num_examples += len(self._convert_to_features(example, qid))
+        return self.num_examples
+
+
+class MRCIterator(ClassifierIterator):
+    """
+    Machine Reading Comprehension iterator. Only for answer extraction.
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        doc_stride=128,
+        max_query_length=64,
+    ):
+        super(MRCIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+            preprocess_text_fn=None,
+        )
+        self.doc_stride = doc_stride
+        self.max_query_length = max_query_length
+        self.examples = []
+        self.features = []
+        self.features_all = []
+        self._preprocess_data()
+
+    def shuffle_sample(self):
+        if self.shuffle:
+            self.global_rng = np.random.RandomState(self.random_seed)
+            self.global_rng.shuffle(self.features_all)
+
+    def _convert_qa_to_examples(self):
+        Example = namedtuple(
+            "Example", ["qas_id", "question_text", "doc_tokens", "orig_answer_text", "start_position", "end_position"]
+        )
+        examples = []
+        for qa in self.dataset:
+            qas_id = qa["id"]
+            question_text = qa["question"]
+            context = qa["context"]
+            start_pos = None
+            end_pos = None
+            orig_answer_text = None
+            if self.mode == "train":
+                if len(qa["answers"]) != 1:
+                    raise ValueError("For training, each question should have exactly 1 answer.")
+                orig_answer_text = qa["answers"][0]
+                answer_offset = qa["answer_starts"][0]
+                answer_length = len(orig_answer_text)
+                doc_tokens = [
+                    context[:answer_offset],
+                    context[answer_offset : answer_offset + answer_length],
+                    context[answer_offset + answer_length :],
+                ]
+
+                start_pos = 1
+                end_pos = 1
+
+                actual_text = " ".join(doc_tokens[start_pos : (end_pos + 1)])
+                if orig_answer_text.islower():
+                    actual_text = actual_text.lower()
+                if actual_text.find(orig_answer_text) == -1:
+                    logger.info("Could not find answer: '%s' vs. '%s'" % (actual_text, orig_answer_text))
+                    continue
+
+            else:
+                doc_tokens = tokenize_chinese_chars(context)
+
+            example = Example(
+                qas_id=qas_id,
+                question_text=question_text,
+                doc_tokens=doc_tokens,
+                orig_answer_text=orig_answer_text,
+                start_position=start_pos,
+                end_position=end_pos,
+            )
+            examples.append(example)
+        return examples
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple(
+            "Feature",
+            [
+                "qid",
+                "example_index",
+                "doc_span_index",
+                "tokens",
+                "token_to_orig_map",
+                "token_is_max_context",
+                "src_ids",
+                "start_position",
+                "end_position",
+                "cal_loss",
+            ],
+        )
+        features = []
+        self.features_all = []
+        unique_id = 1000
+        is_training = self.mode == "train"
+        print("total {} examples".format(len(examples)), flush=True)
+        for (example_index, example) in enumerate(examples):
+            query_tokens = self.tokenizer.tokenize(example.question_text)
+            if len(query_tokens) > self.max_query_length:
+                query_tokens = query_tokens[0 : self.max_query_length]
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            for (i, token) in enumerate(example.doc_tokens):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                sub_tokens = self.tokenizer.tokenize(token)
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+
+            tok_start_position = None
+            tok_end_position = None
+            if is_training:
+                tok_start_position = orig_to_tok_index[example.start_position]
+                if example.end_position < len(example.doc_tokens) - 1:
+                    tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
+                else:
+                    tok_end_position = len(all_doc_tokens) - 1
+                (tok_start_position, tok_end_position) = self._improve_answer_span(
+                    all_doc_tokens, tok_start_position, tok_end_position, example.orig_answer_text
+                )
+
+            max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                tokens = []
+                token_to_orig_map = {}
+                token_is_max_context = {}
+                tokens.append("[CLS]")
+                for i in range(doc_span.length):
+                    split_token_index = doc_span.start + i
+                    token_to_orig_map[i + 1] = tok_to_orig_index[split_token_index]
+                    is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[i + 1] = is_max_context
+                tokens += all_doc_tokens[doc_span.start : doc_span.start + doc_span.length]
+                tokens.append("[SEP]")
+
+                for token in query_tokens:
+                    tokens.append(token)
+                tokens.append("[SEP]")
+
+                token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+                start_position = None
+                end_position = None
+                if is_training:
+                    doc_start = doc_span.start
+                    doc_end = doc_span.start + doc_span.length - 1
+                    out_of_span = False
+                    if not (tok_start_position >= doc_start and tok_end_position <= doc_end):
+                        out_of_span = True
+                    if out_of_span:
+                        start_position = 0
+                        end_position = 0
+                    else:
+                        doc_offset = 1  # len(query_tokens) + 2
+                        start_position = tok_start_position - doc_start + doc_offset
+                        end_position = tok_end_position - doc_start + doc_offset
+
+                feature = Feature(
+                    qid=unique_id,
+                    example_index=example_index,
+                    doc_span_index=doc_span_index,
+                    tokens=tokens,
+                    token_to_orig_map=token_to_orig_map,
+                    token_is_max_context=token_is_max_context,
+                    src_ids=token_ids,
+                    start_position=start_position,
+                    end_position=end_position,
+                    cal_loss=1,
+                )
+                features.append(feature)
+                features_each.append(feature)
+                if example_index % 1000 == 0:
+                    print("processing {} examples".format(example_index), flush=True)
+
+                unique_id += 1
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _preprocess_data(self):
+        # Construct examples
+        self.examples = self._convert_qa_to_examples()
+        # Construct features
+        self.features = self._convert_example_to_feature(self.examples)
+
+    def get_num_examples(self):
+        if not self.features_all:
+            self._preprocess_data()
+        return len(sum(self.features_all, []))
+
+    def _improve_answer_span(self, doc_tokens, input_start, input_end, orig_answer_text):
+        """Improve answer span"""
+        tok_answer_text = " ".join(self.tokenizer.tokenize(orig_answer_text))
+
+        for new_start in range(input_start, input_end + 1):
+            for new_end in range(input_end, new_start - 1, -1):
+                text_span = " ".join(doc_tokens[new_start : (new_end + 1)])
+                if text_span == tok_answer_text:
+                    return (new_start, new_end)
+
+        return (input_start, input_end)
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        """Check is max context"""
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span.start + doc_span.length - 1
+            if position < doc_span.start:
+                break
+            if position > end:
+                continue
+            num_left_context = position - doc_span.start
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+                if best_span_index > cur_span_index:
+                    return False
+
+        return cur_span_index == best_span_index
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        """Pad batch data"""
+        batch_token_ids = [record.src_ids for record in batch_records]
+
+        if self.mode == "train":
+            batch_start_position = [record.start_position for record in batch_records]
+            batch_end_position = [record.end_position for record in batch_records]
+            batch_start_position = np.array(batch_start_position).astype("int64").reshape([-1, 1])
+            batch_end_position = np.array(batch_end_position).astype("int64").reshape([-1, 1])
+        else:
+            batch_size = len(batch_token_ids)
+            batch_start_position = np.zeros(shape=[batch_size, 1], dtype="int64")
+            batch_end_position = np.zeros(shape=[batch_size, 1], dtype="int64")
+
+        batch_qids = [record.qid for record in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        # padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            batch_start_position,
+            batch_end_position,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+
+        return return_list
+
+    def _create_instances(self):
+        """Generate batch records"""
+        pre_batch_list = []
+        insert_idx = []
+        for qid, features in enumerate(self.features_all):
+            if self._cnt_list(pre_batch_list) < self.batch_size * self.trainer_num:
+                if insert_idx:
+                    pre_batch_list[insert_idx[0]] = features
+                    insert_idx.pop(0)
+                else:
+                    pre_batch_list.append(features)
+            if self._cnt_list(pre_batch_list) == self.batch_size * self.trainer_num:
+                assert self._cnt_list(pre_batch_list) == len(pre_batch_list), "the two value must be equal"
+                assert not insert_idx, "the insert_idx must be null"
+                sample_batch = self._get_samples(pre_batch_list)
+
+                for idx, lit in enumerate(pre_batch_list):
+                    if not lit:
+                        insert_idx.append(idx)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+        if self.mode != "train":
+            if self._cnt_list(pre_batch_list):
+                pre_batch_list += [
+                    [] for _ in range(self.batch_size * self.trainer_num - self._cnt_list(pre_batch_list))
+                ]
+                sample_batch = self._get_samples(pre_batch_list, is_last=True)
+                for batch_records in self._prepare_batch_data(sample_batch):
+                    yield batch_records
+
+
+class MCQIterator(MRCIterator):
+    """
+    Multiple choice question iterator.
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        doc_stride=128,
+        max_query_length=64,
+        choice_num=4,
+    ):
+        self.choice_num = choice_num
+        super(MCQIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+        )
+
+    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
+        """Truncates a sequence pair in place to the maximum length."""
+
+        # This is a simple heuristic which will always truncate the longer sequence
+        # one token at a time. This makes more sense than truncating an equal percent
+        # of tokens from each, since if one sequence is very short then each token
+        # that's truncated likely contains more information than a longer sequence.
+        tokens_a = list(tokens_a)
+        tokens_b = list(tokens_b)
+        while True:
+            total_length = len(tokens_a) + len(tokens_b)
+            if total_length <= max_length:
+                break
+            if len(tokens_a) > len(tokens_b):
+                tokens_a.pop()
+            else:
+                tokens_b.pop()
+        return tokens_a, tokens_b
+
+    def _convert_qa_to_examples(self):
+        Example = namedtuple("Example", ["qas_id", "context", "question", "choice", "label"])
+        examples = []
+        for qas_id, qa in enumerate(self.dataset):
+            context = "\n".join(qa["context"]).lower()
+            question = qa["question"].lower()
+            choice = [c.lower() for c in qa["choice"]]
+            # pad empty choice
+            for k in range(len(choice), self.choice_num):
+                choice.append("")
+            label = qa["label"]
+
+            example = Example(qas_id=qas_id, context=context, question=question, choice=choice, label=label)
+            examples.append(example)
+        return examples
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple("Feature", ["qid", "src_ids", "segment_ids", "label", "cal_loss"])
+        features = []
+        self.features_all = []
+        for (ex_index, example) in enumerate(examples):
+            context_tokens = self.tokenizer.tokenize(example.context)
+            question_tokens = self.tokenizer.tokenize(example.question)
+            choice_tokens_lst = [self.tokenizer.tokenize(choice) for choice in example.choice]
+            # nums = 4
+            question_choice_pairs = [
+                self._truncate_seq_pair(question_tokens, choice_tokens, self.max_query_length - 2)
+                for choice_tokens in choice_tokens_lst
+            ]
+            total_qc_num = sum([(len(q) + len(c)) for q, c in question_choice_pairs])
+            max_tokens_for_doc = self.max_seq_length - total_qc_num - 4
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+
+            while start_offset < len(context_tokens):
+                length = len(context_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(context_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                qa_features = []
+                for q_tokens, c_tokens in question_choice_pairs:
+                    segment_tokens = ["[CLS]"]
+                    token_type_ids = [0]
+
+                    segment_tokens += context_tokens[doc_span.start : doc_span.start + doc_span.length]
+                    token_type_ids += [0] * doc_span.length
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [0]
+
+                    segment_tokens += q_tokens
+                    token_type_ids += [1] * len(q_tokens)
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [1]
+
+                    segment_tokens += c_tokens
+                    token_type_ids += [1] * len(c_tokens)
+
+                    segment_tokens += ["[SEP]"]
+                    token_type_ids += [1]
+
+                    input_ids = self.tokenizer.convert_tokens_to_ids(segment_tokens)
+                    feature = Feature(
+                        qid=example.qas_id,
+                        label=example.label,
+                        src_ids=input_ids,
+                        segment_ids=token_type_ids,
+                        cal_loss=1,
+                    )
+                    qa_features.append(feature)
+
+                features.append(qa_features)
+                features_each.append(qa_features)
+
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [[record.src_ids for record in records] for records in batch_records]
+        if batch_records[0][0].label is not None:
+            batch_labels = [[record.label for record in records] for records in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        batch_qids = [[record.qid for record in records] for records in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        batch_task_ids = [[record.segment_ids for record in records] for records in batch_records]
+
+        # Padding
+        batch_padded_token_ids = []
+        batch_input_mask = []
+        batch_padded_task_ids = []
+        batch_padded_position_ids = []
+        batch_size = len(batch_token_ids)
+        for i in range(batch_size):
+            padded_token_ids, input_mask = pad_batch_data(
+                batch_token_ids[i],
+                pad_idx=self.tokenizer.pad_token_id,
+                pad_max_len=self.max_seq_length,
+                return_input_mask=True,
+            )
+            padded_task_ids = pad_batch_data(
+                batch_task_ids[i], pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
+            )
+
+            padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+            batch_padded_token_ids.append(padded_token_ids)
+            batch_input_mask.append(input_mask)
+            batch_padded_task_ids.append(padded_task_ids)
+            batch_padded_position_ids.append(padded_position_ids)
+
+        batch_padded_token_ids = (
+            np.array(batch_padded_token_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_padded_position_ids = (
+            np.array(batch_padded_position_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_padded_task_ids = (
+            np.array(batch_padded_task_ids).astype("int64").reshape([batch_size * self.choice_num, -1, 1])
+        )
+        batch_input_mask = np.array(batch_input_mask).astype("float32").reshape([batch_size * self.choice_num, -1, 1])
+
+        return_list = [
+            batch_padded_token_ids,
+            batch_padded_position_ids,
+            batch_padded_task_ids,
+            batch_input_mask,
+            batch_labels,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
+
+    def _prepare_batch_data(self, examples_list):
+        batch_records, max_len, gather_idx = [], 0, []
+        real_batch_size = self.batch_size * self.choice_num
+        index = 0
+        for examples in examples_list:
+            records = []
+            gather_idx_candidate = []
+            for example in examples:
+                if example.cal_loss == 1:
+                    gather_idx_candidate.append(index % real_batch_size)
+                max_len = max(max_len, len(example.src_ids))
+                records.append(example)
+                index += 1
+
+            if self.in_tokens:
+                to_append = (len(batch_records) + 1) * self.choice_num * max_len <= self.batch_size
+            else:
+                to_append = len(batch_records) < self.batch_size
+            if to_append:
+                batch_records.append(records)
+                gather_idx += gather_idx_candidate
+            else:
+                yield self._pad_batch_records(batch_records, gather_idx)
+                batch_records, max_len = [records], max(len(record.src_ids) for record in records)
+                gather_idx = gather_idx_candidate
+        if len(batch_records) > 0:
+            yield self._pad_batch_records(batch_records, gather_idx)
+
+    def _get_samples(self, pre_batch_list, is_last=False):
+        if is_last:
+            # Pad batch
+            len_doc = [[len(doc) for doc in doc_list] for doc_list in pre_batch_list]
+            len_doc = list(itertools.chain(*len_doc))
+            max_len_idx = len_doc.index(max(len_doc))
+            doc_idx = max_len_idx % self.choice_num
+            doc_list_idx = max_len_idx // self.choice_num
+            dirty_sample = pre_batch_list[doc_list_idx][doc_idx][-1]._replace(cal_loss=0)
+            for sample_list in pre_batch_list:
+                for samples in sample_list:
+                    samples.extend([dirty_sample] * (max(len_doc) - len(samples)))
+        samples = []
+        min_len = min([len(doc) for doc in pre_batch_list])
+        for cnt in range(min_len):
+            for batch_idx in range(self.batch_size * self.trainer_num):
+                sample = pre_batch_list[batch_idx][cnt]
+                samples.append(sample)
+
+        for idx in range(len(pre_batch_list)):
+            pre_batch_list[idx] = pre_batch_list[idx][min_len:]
+        return samples
+
+
+class SemanticMatchingIterator(MRCIterator):
+    def _convert_qa_to_examples(self):
+        Example = namedtuple("Example", ["qid", "text_a", "text_b", "text_c", "label"])
+        examples = []
+        for qid, qa in enumerate(self.dataset):
+            text_a, text_b, text_c = list(
+                map(lambda x: x.replace("\n", "").strip(), [qa["text_a"], qa["text_b"], qa["text_c"]])
+            )
+
+            example = Example(qid=qid, text_a=text_a, text_b=text_b, text_c=text_c, label=qa["label"])
+            examples += [example]
+        return examples
+
+    def _create_tokens_and_type_id(self, text_a_tokens, text_b_tokens, start, length):
+        tokens = (
+            ["[CLS]"]
+            + text_a_tokens[start : start + length]
+            + ["[SEP]"]
+            + text_b_tokens[start : start + length]
+            + ["[SEP]"]
+        )
+        token_type_ids = [0] + [0] * (length + 1) + [1] * (length + 1)
+        return tokens, token_type_ids
+
+    def _convert_example_to_feature(self, examples):
+        Feature = namedtuple(
+            "Feature", ["qid", "src_ids", "segment_ids", "pair_src_ids", "pair_segment_ids", "label", "cal_loss"]
+        )
+        features = []
+        self.features_all = []
+        for (ex_index, example) in enumerate(examples):
+            text_a_tokens = self.tokenizer.tokenize(example.text_a)
+            text_b_tokens = self.tokenizer.tokenize(example.text_b)
+            text_c_tokens = self.tokenizer.tokenize(example.text_c)
+            a_len, b_len, c_len = list(map(lambda x: len(x), [text_a_tokens, text_b_tokens, text_c_tokens]))
+
+            # Align 3 text
+            min_text_len = min([a_len, b_len, c_len])
+            text_a_tokens = text_a_tokens[:min_text_len]
+            text_b_tokens = text_b_tokens[:min_text_len]
+            text_c_tokens = text_c_tokens[:min_text_len]
+
+            _DocSpan = namedtuple("DocSpan", ["start", "length"])
+            doc_spans = []
+            start_offset = 0
+
+            max_tokens_for_doc = (self.max_seq_length - 3) // 2
+
+            while start_offset < len(text_a_tokens):
+                length = len(text_a_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append(_DocSpan(start=start_offset, length=length))
+                if start_offset + length == len(text_a_tokens):
+                    break
+                start_offset += min(length, self.doc_stride)
+
+            features_each = []
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+                tokens1, token_type_ids1 = self._create_tokens_and_type_id(
+                    text_a_tokens, text_b_tokens, doc_span.start, doc_span.length
+                )
+                tokens2, token_type_ids2 = self._create_tokens_and_type_id(
+                    text_a_tokens, text_c_tokens, doc_span.start, doc_span.length
+                )
+
+                input_ids1 = self.tokenizer.convert_tokens_to_ids(tokens1)
+                input_ids2 = self.tokenizer.convert_tokens_to_ids(tokens2)
+                feature = Feature(
+                    qid=example.qid,
+                    label=example.label,
+                    src_ids=input_ids1,
+                    segment_ids=token_type_ids1,
+                    pair_src_ids=input_ids2,
+                    pair_segment_ids=token_type_ids2,
+                    cal_loss=1,
+                )
+
+                features.append(feature)
+                features_each.append(feature)
+
+            # Repeat
+            if self.repeat_input:
+                features_each_repeat = features_each
+                features_each = list(map(lambda x: x._replace(cla_loss=0), features_each))
+                features_each += features_each_repeat
+
+            self.features_all.append(features_each)
+
+        return features
+
+    def _create_pad_ids(self, batch_records, prefix=""):
+        src_ids = prefix + "src_ids"
+        segment_ids = prefix + "segment_ids"
+        batch_token_ids = [getattr(record, src_ids) for record in batch_records]
+        batch_task_ids = [getattr(record, segment_ids) for record in batch_records]
+
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        padded_task_ids = pad_batch_data(
+            batch_task_ids, pad_idx=self.tokenizer.pad_token_id, pad_max_len=self.max_seq_length
+        )
+
+        padded_position_ids = get_related_pos(padded_task_ids, self.max_seq_length, self.memory_len)
+
+        return [padded_token_ids, padded_position_ids, padded_task_ids, input_mask]
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        if batch_records[0].label is not None:
+            batch_labels = [record.label for record in batch_records]
+            batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        batch_qids = [record.qid for record in batch_records]
+        batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+
+        return_list = (
+            self._create_pad_ids(batch_records)
+            + self._create_pad_ids(batch_records, "pair_")
+            + [batch_labels, batch_qids, batch_gather_idx, need_cal_loss]
+        )
+        return return_list
+
+
+class SequenceLabelingIterator(ClassifierIterator):
+    def __init__(
+        self,
+        dataset,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id,
+        max_seq_length=512,
+        memory_len=128,
+        repeat_input=False,
+        in_tokens=False,
+        mode="train",
+        random_seed=None,
+        no_entity_id=-1,
+    ):
+        super(SequenceLabelingIterator, self).__init__(
+            dataset,
+            batch_size,
+            tokenizer,
+            trainer_num,
+            trainer_id,
+            max_seq_length,
+            memory_len,
+            repeat_input,
+            in_tokens,
+            mode,
+            random_seed,
+            preprocess_text_fn=None,
+        )
+        self.no_entity_id = no_entity_id
+
+    def _convert_to_features(self, example, qid):
+        """
+        Convert example to features fed into model
+        """
+        tokens = example["tokens"]
+        label = example["labels"]
+        doc_spans = []
+        _DocSpan = namedtuple("DocSpan", ["start", "length"])
+        start_offset = 0
+        max_tokens_for_doc = self.max_seq_length - 2
+        while start_offset < len(tokens):
+            length = len(tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(tokens):
+                break
+            start_offset += min(length, self.memory_len)
+
+        features = []
+        Feature = namedtuple("Feature", ["src_ids", "label_ids", "qid", "cal_loss"])
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            curr_tokens = ["[CLS]"] + tokens[doc_span.start : doc_span.start + doc_span.length] + ["[SEP]"]
+            token_ids = self.tokenizer.convert_tokens_to_ids(curr_tokens)
+            label = (
+                [self.no_entity_id] + label[doc_span.start : doc_span.start + doc_span.length] + [self.no_entity_id]
+            )
+
+            features.append(Feature(src_ids=token_ids, label_ids=label, qid=qid, cal_loss=1))
+
+        if self.repeat_input:
+            features_repeat = features
+            features = list(map(lambda x: x._replace(cal_loss=0), features))
+            features = features + features_repeat
+        return features
+
+    def _pad_batch_records(self, batch_records, gather_idx=[]):
+        batch_token_ids = [record.src_ids for record in batch_records]
+        batch_length = [len(record.src_ids) for record in batch_records]
+        batch_length = np.array(batch_length).astype("int64").reshape([-1, 1])
+
+        if batch_records[0].label_ids is not None:
+            batch_labels = [record.label_ids for record in batch_records]
+        else:
+            batch_labels = np.array([]).astype("int64").reshape([-1, 1])
+        # Qid
+        if batch_records[-1].qid is not None:
+            batch_qids = [record.qid for record in batch_records]
+            batch_qids = np.array(batch_qids).astype("int64").reshape([-1, 1])
+        else:
+            batch_qids = np.array([]).astype("int64").reshape([-1, 1])
+
+        if gather_idx:
+            batch_gather_idx = np.array(gather_idx).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([1]).astype("int64")
+        else:
+            batch_gather_idx = np.array(list(range(len(batch_records)))).astype("int64").reshape([-1, 1])
+            need_cal_loss = np.array([0]).astype("int64")
+        # Padding
+        padded_token_ids, input_mask = pad_batch_data(
+            batch_token_ids,
+            pad_idx=self.tokenizer.pad_token_id,
+            pad_max_len=self.max_seq_length,
+            return_input_mask=True,
+        )
+        if batch_records[0].label_ids is not None:
+            padded_batch_labels = pad_batch_data(
+                batch_labels, pad_idx=self.no_entity_id, pad_max_len=self.max_seq_length
+            )
+        padded_task_ids = np.zeros_like(padded_token_ids, dtype="int64")
+        padded_position_ids = get_related_pos(padded_token_ids, self.max_seq_length, self.memory_len)
+
+        return_list = [
+            padded_token_ids,
+            padded_position_ids,
+            padded_task_ids,
+            input_mask,
+            padded_batch_labels,
+            batch_length,
+            batch_qids,
+            batch_gather_idx,
+            need_cal_loss,
+        ]
+        return return_list
diff --git a/model_zoo/ernie-doc/metrics.py b/model_zoo/ernie-doc/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..41777f1d106cc1035b50297ca140a6b1a1b9ef92
--- /dev/null
+++ b/model_zoo/ernie-doc/metrics.py
@@ -0,0 +1,367 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import sys
+
+import numpy as np
+import paddle
+from paddle.utils import try_import
+
+from paddlenlp.metrics.dureader import (
+    _compute_softmax,
+    _get_best_indexes,
+    get_final_text,
+)
+
+# Metric for ERNIE-DOCs
+
+
+class F1(object):
+    def __init__(self, positive_label=1):
+        self.positive_label = positive_label
+        self.reset()
+
+    def compute(self, preds, labels):
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        elif isinstance(preds, list):
+            preds = np.array(preds, dtype="float32")
+        if isinstance(labels, list):
+            labels = np.array(labels, dtype="int64")
+        elif isinstance(labels, paddle.Tensor):
+            labels = labels.numpy()
+        preds = np.argmax(preds, axis=1)
+        tp = ((preds == labels) & (labels == self.positive_label)).sum()
+        fn = ((preds != labels) & (labels == self.positive_label)).sum()
+        fp = ((preds != labels) & (preds == self.positive_label)).sum()
+        return tp, fp, fn
+
+    def update(self, statistic):
+        tp, fp, fn = statistic
+        self.tp += tp
+        self.fp += fp
+        self.fn += fn
+
+    def accumulate(self):
+        recall = self.tp / (self.tp + self.fn)
+        precision = self.tp / (self.tp + self.fp)
+        f1 = 2 * recall * precision / (recall + precision)
+        return f1
+
+    def reset(self):
+        self.tp = 0
+        self.fp = 0
+        self.fn = 0
+
+
+class EM_AND_F1(object):
+    def __init__(self):
+        self.nltk = try_import("nltk")
+        self.re = try_import("re")
+
+    def _mixed_segmentation(self, in_str, rm_punc=False):
+        """mixed_segmentation"""
+        in_str = in_str.lower().strip()
+        segs_out = []
+        temp_str = ""
+        sp_char = [
+            "-",
+            ":",
+            "_",
+            "*",
+            "^",
+            "/",
+            "\\",
+            "~",
+            "`",
+            "+",
+            "=",
+            "，",
+            "。",
+            "：",
+            "？",
+            "！",
+            "“",
+            "”",
+            "；",
+            "’",
+            "《",
+            "》",
+            "……",
+            "·",
+            "、",
+            "「",
+            "」",
+            "（",
+            "）",
+            "－",
+            "～",
+            "『",
+            "』",
+        ]
+        for char in in_str:
+            if rm_punc and char in sp_char:
+                continue
+            pattern = "[\\u4e00-\\u9fa5]"
+            if self.re.search(pattern, char) or char in sp_char:
+                if temp_str != "":
+                    ss = self.nltk.word_tokenize(temp_str)
+                    segs_out.extend(ss)
+                    temp_str = ""
+                segs_out.append(char)
+            else:
+                temp_str += char
+
+        # Handling last part
+        if temp_str != "":
+            ss = self.nltk.word_tokenize(temp_str)
+            segs_out.extend(ss)
+
+        return segs_out
+
+    # Remove punctuation
+    def _remove_punctuation(self, in_str):
+        """remove_punctuation"""
+        in_str = in_str.lower().strip()
+        sp_char = [
+            "-",
+            ":",
+            "_",
+            "*",
+            "^",
+            "/",
+            "\\",
+            "~",
+            "`",
+            "+",
+            "=",
+            "，",
+            "。",
+            "：",
+            "？",
+            "！",
+            "“",
+            "”",
+            "；",
+            "’",
+            "《",
+            "》",
+            "……",
+            "·",
+            "、",
+            "「",
+            "」",
+            "（",
+            "）",
+            "－",
+            "～",
+            "『",
+            "』",
+        ]
+        out_segs = []
+        for char in in_str:
+            if char in sp_char:
+                continue
+            else:
+                out_segs.append(char)
+        return "".join(out_segs)
+
+    # Find longest common string
+    def _find_lcs(self, s1, s2):
+        m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)]
+        mmax = 0
+        p = 0
+        for i in range(len(s1)):
+            for j in range(len(s2)):
+                if s1[i] == s2[j]:
+                    m[i + 1][j + 1] = m[i][j] + 1
+                    if m[i + 1][j + 1] > mmax:
+                        mmax = m[i + 1][j + 1]
+                        p = i + 1
+        return s1[p - mmax : p], mmax
+
+    def _calc_f1_score(self, answers, prediction):
+        f1_scores = []
+        for ans in answers:
+            ans_segs = self._mixed_segmentation(ans, rm_punc=True)
+            prediction_segs = self._mixed_segmentation(prediction, rm_punc=True)
+            lcs, lcs_len = self._find_lcs(ans_segs, prediction_segs)
+            if lcs_len == 0:
+                f1_scores.append(0)
+                continue
+            precision = 1.0 * lcs_len / len(prediction_segs)
+            recall = 1.0 * lcs_len / len(ans_segs)
+            f1 = (2 * precision * recall) / (precision + recall)
+            f1_scores.append(f1)
+        return max(f1_scores)
+
+    def _calc_em_score(self, answers, prediction):
+        em = 0
+        for ans in answers:
+            ans_ = self._remove_punctuation(ans)
+            prediction_ = self._remove_punctuation(prediction)
+            if ans_ == prediction_:
+                em = 1
+                break
+        return em
+
+    def __call__(self, prediction, ground_truth):
+        f1 = 0
+        em = 0
+        total_count = 0
+        skip_count = 0
+        for instance in ground_truth:
+            total_count += 1
+            query_id = instance["id"]
+            answers = instance["answers"]
+            if query_id not in prediction:
+                sys.stderr.write("Unanswered question: {}\n".format(query_id))
+                skip_count += 1
+                continue
+            preds = str(prediction[query_id])
+            f1 += self._calc_f1_score(answers, preds)
+            em += self._calc_em_score(answers, preds)
+
+        f1_score = 100.0 * f1 / total_count
+        em_score = 100.0 * em / total_count
+
+        avg_score = (f1_score + em_score) * 0.5
+        return em_score, f1_score, avg_score, total_count
+
+
+def compute_qa_predictions(
+    all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, tokenizer, verbose
+):
+    """Write final predictions to the json file and log-odds of null if needed."""
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
+    )
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # Keep track of the minimum score of null start+end of position 0
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.qid]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index],
+                        )
+                    )
+
+        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"]
+        )
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]
+                tok_text = " ".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = "".join(orig_tokens)
+
+                final_text = get_final_text(tok_text, orig_text, tokenizer, verbose)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
+
+        total_scores = []
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+
+        all_predictions[example.qas_id] = nbest_json[0]["text"]
+        all_nbest_json[example.qas_id] = nbest_json
+    return all_predictions, all_nbest_json
diff --git a/model_zoo/ernie-doc/model.py b/model_zoo/ernie-doc/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fd3e622596a490fb3c1337abaed8c0189ba9390
--- /dev/null
+++ b/model_zoo/ernie-doc/model.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.nn as nn
+
+
+class ErnieDocForTextMatching(nn.Layer):
+    def __init__(self, ernie_doc, num_classes=2, dropout=None):
+        super().__init__()
+        self.ernie_doc = ernie_doc
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        self.classifier = nn.Linear(ernie_doc.config["hidden_size"], num_classes)
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_memories,
+        title_memories,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        _, query_pooled_output, query_mem, = self.ernie_doc(
+            query_input_ids, query_memories, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        _, title_pooled_output, title_mem = self.ernie_doc(
+            title_input_ids, title_memories, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        diff_pooled_output = query_pooled_output - title_pooled_output
+        diff_pooled_output = self.dropout(diff_pooled_output)
+        output = self.classifier(diff_pooled_output)
+        return output, query_mem, title_mem
diff --git a/model_zoo/ernie-doc/run_classifier.py b/model_zoo/ernie-doc/run_classifier.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2985189f4162e54f9c04d3610cff2f4bb07084f
--- /dev/null
+++ b/model_zoo/ernie-doc/run_classifier.py
@@ -0,0 +1,323 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import ClassifierIterator, HYPTextPreprocessor, ImdbTextPreprocessor
+from metrics import F1
+from paddle.metric import Accuracy
+from paddle.optimizer import AdamW
+
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocBPETokenizer,
+    ErnieDocForSequenceClassification,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-en", help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=7e-5, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--dataset", default="imdb", choices=["imdb", "iflytek", "thucnews", "hyp"], type=str, help="The training dataset")
+parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+args = parser.parse_args()
+# fmt: on
+
+# tokenizer, eval_dataset, test_dataset, preprocess_text_fn, metric
+# BPETokenizer for English Tasks
+# ErnieDocTokenizer for Chinese Tasks
+
+DATASET_INFO = {
+    "imdb": (ErnieDocBPETokenizer, "test", "test", ImdbTextPreprocessor(), Accuracy()),
+    "hyp": (ErnieDocBPETokenizer, "dev", "test", HYPTextPreprocessor(), F1()),
+    "iflytek": (ErnieDocTokenizer, "validation", "test", None, Accuracy()),
+    "thucnews": (ErnieDocTokenizer, "dev", "test", None, Accuracy()),
+}
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, memories0):
+    model.eval()
+    losses = []
+    # copy the memory
+    memories = list(memories0)
+    tic_train = time.time()
+    eval_logging_step = 500
+
+    probs_dict = defaultdict(list)
+    label_dict = dict()
+    for step, batch in enumerate(data_loader, start=1):
+        input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch
+        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids]))
+        # Need to collect probs for each qid, so use softmax_with_cross_entropy
+        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
+        losses.append(loss.mean().numpy())
+        # Shape: [B, NUM_LABELS]
+        np_probs = probs.numpy()
+        # Shape: [B, 1]
+        np_qids = qids.numpy()
+        np_labels = labels.numpy().flatten()
+        for i, qid in enumerate(np_qids.flatten()):
+            probs_dict[qid].append(np_probs[i])
+            label_dict[qid] = np_labels[i]  # Same qid share same label.
+
+        if step % eval_logging_step == 0:
+            logger.info(
+                "Step %d: loss:  %.5f, speed: %.5f steps/s"
+                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+
+    # Collect predicted labels
+    preds = []
+    labels = []
+    for qid, probs in probs_dict.items():
+        mean_prob = np.mean(np.array(probs), axis=0)
+        preds.append(mean_prob)
+        labels.append(label_dict[qid])
+
+    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
+    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
+
+    metric.update(metric.compute(preds, labels))
+    acc_or_f1 = metric.accumulate()
+    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
+    metric.reset()
+    model.train()
+    return acc_or_f1
+
+
+def do_train(args):
+    set_seed(args)
+    tokenizer_class, eval_name, test_name, preprocess_text_fn, eval_metric = DATASET_INFO[args.dataset]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    if args.dataset in ["hyp", "thucnews"]:
+        from paddlenlp.datasets import load_dataset
+
+        train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
+        num_classes = len(train_ds.label_list)
+
+    else:
+        from datasets import load_dataset
+
+        # Get dataset
+        if args.dataset == "iflytek":
+
+            train_ds, eval_ds, test_ds = load_dataset("clue", name=args.dataset, split=["train", eval_name, test_name])
+        else:
+            train_ds, eval_ds = load_dataset(args.dataset, split=["train", eval_name])
+            test_ds = eval_ds
+        num_classes = train_ds.features["label"].num_classes
+
+    # Initialize model
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+    model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds_iter = ClassifierIterator(
+        train_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        preprocess_text_fn=preprocess_text_fn,
+    )
+    eval_ds_iter = ClassifierIterator(
+        eval_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="eval",
+        preprocess_text_fn=preprocess_text_fn,
+    )
+    test_ds_iter = ClassifierIterator(
+        test_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="test",
+        preprocess_text_fn=preprocess_text_fn,
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_steps = 0
+    best_acc = -1
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    tic_train = time.time()
+    stop_training = False
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            global_steps += 1
+            input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch
+            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+
+            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
+            loss = criterion(logits, labels) * need_cal_loss
+            mean_loss = loss.mean()
+            mean_loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            # Rough acc result, not a precise acc
+            acc = metric.compute(logits, labels) * need_cal_loss
+            metric.update(acc)
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        mean_loss,
+                        metric.accumulate(),
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0:
+                # Evaluate
+                logger.info("Eval:")
+                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory())
+                # Save
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    if eval_acc > best_acc:
+                        logger.info("Save best model......")
+                        best_acc = eval_acc
+                        best_model_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(best_model_dir):
+                            os.makedirs(best_model_dir)
+                        model_to_save.save_pretrained(best_model_dir)
+                        tokenizer.save_pretrained(best_model_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                stop_training = True
+                break
+        if stop_training:
+            break
+    logger.info("Final test result:")
+    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory())
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/model_zoo/ernie-doc/run_mcq.py b/model_zoo/ernie-doc/run_mcq.py
new file mode 100644
index 0000000000000000000000000000000000000000..4050959fa8c756a114de22157a429210d87d834a
--- /dev/null
+++ b/model_zoo/ernie-doc/run_mcq.py
@@ -0,0 +1,306 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import MCQIterator
+from paddle.metric import Accuracy
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocForSequenceClassification,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=1e-4, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=8, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--dataset", default="c3", choices=["c3"], type=str, help="The training dataset")
+parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio")
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--gradient_accumulation_steps", default=4, type=int, help="Number of updates steps to accumualte before performing a backward/update pass.")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+args = parser.parse_args()
+# fmt: on
+
+
+DATASET_INFO = {
+    "c3": (ErnieDocTokenizer, "dev", "test", Accuracy()),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, memories0, choice_num):
+    model.eval()
+    losses = []
+    # Copy the memory
+    memories = list(memories0)
+    tic_train = time.time()
+    eval_logging_step = 500
+
+    probs_dict = defaultdict(list)
+    label_dict = dict()
+    for step, batch in enumerate(data_loader, start=1):
+        input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idxs, need_cal_loss = batch
+        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids]))
+        logits = logits.reshape([-1, choice_num])
+        labels = labels.reshape([-1, choice_num, 1])[:, 0]
+        qids = qids.reshape([-1, choice_num, 1])[:, 0]
+        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
+        losses.append(loss.mean().numpy())
+        # Shape: [B, NUM_LABELS]
+        np_probs = probs.numpy()
+        # Shape: [B, 1]
+        np_qids = qids.numpy().flatten()
+        np_labels = labels.numpy().flatten()
+        for i, qid in enumerate(np_qids):
+            probs_dict[qid].append(np_probs[i])
+            label_dict[qid] = np_labels[i]  # Same qid share same label.
+
+        if step % eval_logging_step == 0:
+            logger.info(
+                "Step %d: loss:  %.5f, speed: %.5f steps/s"
+                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+
+    # Collect predicted labels
+    preds = []
+    labels = []
+    logger.info("Total {} qustion".format(len(probs_dict)))
+    for qid, probs in probs_dict.items():
+        mean_prob = np.mean(np.array(probs), axis=0)
+        preds.append(mean_prob)
+        labels.append(label_dict[qid])
+
+    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
+    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
+
+    metric.update(metric.compute(preds, labels))
+    acc_or_f1 = metric.accumulate()
+    metric.reset()
+    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
+    model.train()
+    return acc_or_f1
+
+
+def do_train(args):
+    set_seed(args)
+    tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    # Get dataset
+    train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
+
+    num_classes = len(train_ds.label_list)
+
+    # Initialize model
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+    model = ErnieDocForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=1, cls_token_idx=0)
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+    batch_size = int(args.batch_size / args.gradient_accumulation_steps)
+    train_ds_iter = MCQIterator(
+        train_ds,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        choice_num=num_classes,
+    )
+
+    eval_ds_iter = MCQIterator(
+        eval_ds,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        mode="eval",
+        choice_num=num_classes,
+    )
+
+    test_ds_iter = MCQIterator(
+        test_ds,
+        batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        mode="test",
+        choice_num=num_classes,
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_steps = 1
+    best_acc = -1
+    create_memory = partial(
+        init_memory,
+        batch_size * num_classes,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    tic_train = time.time()
+
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            input_ids, position_ids, token_type_ids, attn_mask, labels, qids, gather_idx, need_cal_loss = batch
+            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
+            logits = logits.reshape([-1, num_classes])
+            labels = labels.reshape([-1, num_classes, 1])[:, 0]
+
+            loss = criterion(logits, labels) * need_cal_loss
+            loss.backward()
+            if step % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+                global_steps += 1
+            # Rough acc result, not a precise acc
+            acc = metric.compute(logits, labels) * need_cal_loss
+            metric.update(acc)
+
+            if global_steps % args.logging_steps == 0 and step % args.gradient_accumulation_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        loss,
+                        metric.accumulate(),
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0 and step % args.gradient_accumulation_steps == 0:
+                logger.info("Eval, total {} qustion.".format(len(eval_ds)))
+                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), num_classes)
+                # Save model
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    if eval_acc > best_acc:
+                        logger.info("Save best model......")
+                        best_acc = eval_acc
+                        best_model_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(best_model_dir):
+                            os.makedirs(best_model_dir)
+                        model_to_save.save_pretrained(best_model_dir)
+                        tokenizer.save_pretrained(best_model_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                return
+    logger.info("Final test result:")
+    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), num_classes)
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/model_zoo/ernie-doc/run_mrc.py b/model_zoo/ernie-doc/run_mrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..51687dd04dbed59ab78da127731e7c62a10eadd7
--- /dev/null
+++ b/model_zoo/ernie-doc/run_mrc.py
@@ -0,0 +1,358 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import namedtuple
+from functools import partial
+
+import numpy as np
+import paddle
+from data import MRCIterator
+from metrics import EM_AND_F1, compute_qa_predictions
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocForQuestionAnswering,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=2.75e-4, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=5, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--layerwise_decay", default=0.8, type=float, help="Layerwise decay ratio")
+parser.add_argument("--n_best_size", default=20, type=int, help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+parser.add_argument("--max_answer_length", default=100, type=int, help="Max answer length.")
+parser.add_argument("--do_lower_case", action='store_false', help="Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+parser.add_argument("--verbose", action='store_true', help="Whether to output verbose log.")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc")
+parser.add_argument("--dataset", default="dureader_robust", type=str, choices=["dureader_robust", "cmrc2018", "drcd"], help="The avaliable Q&A dataset")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+args = parser.parse_args()
+# fmt: on
+
+# eval_dataset, test_dataset,
+DATASET_INFO = {
+    "dureader_robust": ["dev", "dev", ErnieDocTokenizer],
+    "cmrc2018": ["dev", "dev", ErnieDocTokenizer],
+    "drcd": ["dev", "test", ErnieDocTokenizer],
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
+
+
+class CrossEntropyLossForQA(paddle.nn.Layer):
+    def __init__(self):
+        super(CrossEntropyLossForQA, self).__init__()
+        self.criterion = paddle.nn.CrossEntropyLoss()
+
+    def forward(self, y, label):
+        start_logits, end_logits = y
+        start_position, end_position = label
+
+        start_loss = self.criterion(start_logits, start_position)
+        end_loss = self.criterion(end_logits, end_position)
+        loss = (start_loss + end_loss) / 2
+        return loss
+
+
+@paddle.no_grad()
+def evaluate(args, model, criterion, metric, data_loader, memories0, tokenizer):
+    RawResult = namedtuple("RawResult", ["unique_id", "start_logits", "end_logits"])
+    model.eval()
+    all_results = []
+
+    tic_start = time.time()
+    tic_eval = time.time()
+    memories = list(memories0)
+
+    # Collect result
+    logger.info("The example number of eval_dataloader: {}".format(len(data_loader._batch_reader.features)))
+    for step, batch in enumerate(data_loader, start=1):
+        (
+            input_ids,
+            position_ids,
+            token_type_ids,
+            attn_mask,
+            start_position,
+            end_position,
+            qids,
+            gather_idx,
+            need_cal_loss,
+        ) = batch
+
+        start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+
+        start_logits, end_logits, qids = list(
+            map(lambda x: paddle.gather(x, gather_idx), [start_logits, end_logits, qids])
+        )
+        np_qids = qids.numpy()
+        np_start_logits = start_logits.numpy()
+        np_end_logits = end_logits.numpy()
+
+        if int(need_cal_loss.numpy()) == 1:
+            for idx in range(qids.shape[0]):
+                if len(all_results) % 1000 == 0 and len(all_results):
+                    logger.info("Processing example: %d" % len(all_results))
+                    logger.info("time per 1000: {} s".format(time.time() - tic_eval))
+                    tic_eval = time.time()
+
+                qid_each = int(np_qids[idx])
+                start_logits_each = [float(x) for x in np_start_logits[idx].flat]
+                end_logits_each = [float(x) for x in np_end_logits[idx].flat]
+                all_results.append(
+                    RawResult(unique_id=qid_each, start_logits=start_logits_each, end_logits=end_logits_each)
+                )
+
+    # Compute_predictions
+    all_predictions_eval, all_nbest_eval = compute_qa_predictions(
+        data_loader._batch_reader.examples,
+        data_loader._batch_reader.features,
+        all_results,
+        args.n_best_size,
+        args.max_answer_length,
+        args.do_lower_case,
+        tokenizer,
+        args.verbose,
+    )
+
+    EM, F1, AVG, TOTAL = metric(all_predictions_eval, data_loader._batch_reader.dataset)
+
+    logger.info("EM: {}, F1: {}, AVG: {}, TOTAL: {}, TIME: {}".format(EM, F1, AVG, TOTAL, time.time() - tic_start))
+    model.train()
+    return EM, F1, AVG
+
+
+def do_train(args):
+    set_seed(args)
+
+    DEV, TEST, TOKENIZER_CLASS = DATASET_INFO[args.dataset]
+    tokenizer = TOKENIZER_CLASS.from_pretrained(args.model_name_or_path)
+
+    train_ds, eval_ds = load_dataset(args.dataset, splits=["train", DEV])
+    if DEV == TEST:
+        test_ds = eval_ds
+    else:
+        test_ds = load_dataset(args.dataset, splits=[TEST])
+
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+
+    model = ErnieDocForQuestionAnswering.from_pretrained(args.model_name_or_path, dropout=args.dropout)
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds_iter = MRCIterator(
+        train_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+    )
+
+    eval_ds_iter = MRCIterator(
+        eval_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="eval",
+        random_seed=args.seed,
+    )
+
+    test_ds_iter = MRCIterator(
+        test_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="test",
+        random_seed=args.seed,
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    global_steps = 0
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+
+    criterion = CrossEntropyLossForQA()
+
+    memories = create_memory()
+    tic_train = time.time()
+    best_avg_metric = -1
+    stop_training = False
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            global_steps += 1
+            (
+                input_ids,
+                position_ids,
+                token_type_ids,
+                attn_mask,
+                start_position,
+                end_position,
+                qids,
+                gather_idx,
+                need_cal_loss,
+            ) = batch
+            start_logits, end_logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+
+            start_logits, end_logits, qids, start_position, end_position = list(
+                map(
+                    lambda x: paddle.gather(x, gather_idx),
+                    [start_logits, end_logits, qids, start_position, end_position],
+                )
+            )
+            loss = criterion([start_logits, end_logits], [start_position, end_position]) * need_cal_loss
+
+            mean_loss = loss.mean()
+            mean_loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        mean_loss,
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0:
+                # Evaluate
+                logger.info("Eval:")
+                EM, F1, AVG = evaluate(
+                    args, model, criterion, EM_AND_F1(), eval_dataloader, create_memory(), tokenizer
+                )
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    if best_avg_metric < AVG:
+                        output_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                stop_training = True
+                break
+        if stop_training:
+            break
+    logger.info("Test:")
+    evaluate(args, model, criterion, EM_AND_F1(), test_dataloader, create_memory(), tokenizer)
+    if rank == 0:
+        output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+        model_to_save.save_pretrained(output_dir)
+        tokenizer.save_pretrained(output_dir)
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/model_zoo/ernie-doc/run_semantic_matching.py b/model_zoo/ernie-doc/run_semantic_matching.py
new file mode 100644
index 0000000000000000000000000000000000000000..986154f82e15f22f5d2f63b0dc9ee49503712632
--- /dev/null
+++ b/model_zoo/ernie-doc/run_semantic_matching.py
@@ -0,0 +1,357 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from data import SemanticMatchingIterator
+from model import ErnieDocForTextMatching
+from paddle.metric import Accuracy
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocModel,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--dataset", default="cail2019_scm", choices=["cail2019_scm"], type=str, help="The training dataset")
+parser.add_argument("--dropout", default=0.1, type=float, help="Dropout ratio of ernie_doc")
+parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+args = parser.parse_args()
+# fmt: on
+
+DATASET_INFO = {
+    "cail2019_scm": (ErnieDocTokenizer, "dev", "test", Accuracy()),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, memories0, pair_memories0):
+    model.eval()
+    losses = []
+    # copy the memory
+    memories = list(memories0)
+    pair_memories = list(pair_memories0)
+    tic_train = time.time()
+    eval_logging_step = 500
+
+    probs_dict = defaultdict(list)
+    label_dict = dict()
+    for step, batch in enumerate(data_loader, start=1):
+        (
+            input_ids,
+            position_ids,
+            token_type_ids,
+            attn_mask,
+            pair_input_ids,
+            pair_position_ids,
+            pair_token_type_ids,
+            pair_attn_mask,
+            labels,
+            qids,
+            gather_idx,
+            need_cal_loss,
+        ) = batch
+
+        logits, memories, pair_memories = model(
+            input_ids,
+            pair_input_ids,
+            memories,
+            pair_memories,
+            token_type_ids,
+            position_ids,
+            attn_mask,
+            pair_token_type_ids,
+            pair_position_ids,
+            pair_attn_mask,
+        )
+        logits, labels, qids = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels, qids]))
+        # Need to collect probs for each qid, so use softmax_with_cross_entropy
+        loss, probs = nn.functional.softmax_with_cross_entropy(logits, labels, return_softmax=True)
+        losses.append(loss.mean().numpy())
+        # Shape: [B, NUM_LABELS]
+        np_probs = probs.numpy()
+        # Shape: [B, 1]
+        np_qids = qids.numpy()
+        np_labels = labels.numpy().flatten()
+        for i, qid in enumerate(np_qids.flatten()):
+            probs_dict[qid].append(np_probs[i])
+            label_dict[qid] = np_labels[i]  # Same qid share same label.
+
+        if step % eval_logging_step == 0:
+            logger.info(
+                "Step %d: loss:  %.5f, speed: %.5f steps/s"
+                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+
+    # Collect predicted labels
+    preds = []
+    labels = []
+    for qid, probs in probs_dict.items():
+        mean_prob = np.mean(np.array(probs), axis=0)
+        preds.append(mean_prob)
+        labels.append(label_dict[qid])
+
+    preds = paddle.to_tensor(np.array(preds, dtype="float32"))
+    labels = paddle.to_tensor(np.array(labels, dtype="int64"))
+
+    metric.update(metric.compute(preds, labels))
+    acc_or_f1 = metric.accumulate()
+    logger.info("Eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric.__class__.__name__, acc_or_f1))
+    metric.reset()
+    model.train()
+    return acc_or_f1
+
+
+def do_train(args):
+    set_seed(args)
+    tokenizer_class, eval_name, test_name, eval_metric = DATASET_INFO[args.dataset]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+
+    # Get dataset
+    train_ds, eval_ds, test_ds = load_dataset(args.dataset, splits=["train", eval_name, test_name])
+
+    num_classes = len(train_ds.label_list)
+
+    # Initialize model
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+
+    ernie_doc = ErnieDocModel.from_pretrained(args.model_name_or_path, cls_token_idx=0)
+    model = ErnieDocForTextMatching(ernie_doc, num_classes, args.dropout)
+
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds_iter = SemanticMatchingIterator(
+        train_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+    )
+
+    eval_ds_iter = SemanticMatchingIterator(
+        eval_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        mode="eval",
+    )
+
+    test_ds_iter = SemanticMatchingIterator(
+        test_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        mode="test",
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_steps = 0
+    best_acc = -1
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    pair_memories = create_memory()
+    tic_train = time.time()
+
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            global_steps += 1
+            (
+                input_ids,
+                position_ids,
+                token_type_ids,
+                attn_mask,
+                pair_input_ids,
+                pair_position_ids,
+                pair_token_type_ids,
+                pair_attn_mask,
+                labels,
+                qids,
+                gather_idx,
+                need_cal_loss,
+            ) = batch
+
+            logits, memories, pair_memories = model(
+                input_ids,
+                pair_input_ids,
+                memories,
+                pair_memories,
+                token_type_ids,
+                position_ids,
+                attn_mask,
+                pair_token_type_ids,
+                pair_position_ids,
+                pair_attn_mask,
+            )
+
+            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
+            loss = criterion(logits, labels) * need_cal_loss
+            mean_loss = loss.mean()
+            mean_loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            # Rough acc result, not a precise acc
+            acc = metric.compute(logits, labels) * need_cal_loss
+            metric.update(acc)
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, acc:%f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        mean_loss,
+                        metric.accumulate(),
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+
+            if global_steps % args.save_steps == 0:
+                # Evaluate
+                logger.info("Eval:")
+                eval_acc = evaluate(model, eval_metric, eval_dataloader, create_memory(), create_memory())
+                # Save
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    save_param_path = os.path.join(output_dir, "model_state.pdparams")
+                    paddle.save(model_to_save.state_dict(), save_param_path)
+                    tokenizer.save_pretrained(output_dir)
+                    if eval_acc > best_acc:
+                        logger.info("Save best model......")
+                        best_acc = eval_acc
+                        best_model_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(best_model_dir):
+                            os.makedirs(best_model_dir)
+
+                        save_param_path = os.path.join(best_model_dir, "model_state.pdparams")
+                        paddle.save(model_to_save.state_dict(), save_param_path)
+                        tokenizer.save_pretrained(best_model_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                return
+    logger.info("Final test result:")
+    eval_acc = evaluate(model, eval_metric, test_dataloader, create_memory(), create_memory())
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/model_zoo/ernie-doc/run_sequence_labeling.py b/model_zoo/ernie-doc/run_sequence_labeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0686f9626ac41df6e6e34e20dfdd684a58f06393
--- /dev/null
+++ b/model_zoo/ernie-doc/run_sequence_labeling.py
@@ -0,0 +1,300 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+from data import SequenceLabelingIterator
+from paddle.optimizer import AdamW
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.ops.optimizer import layerwise_lr_decay
+from paddlenlp.transformers import (
+    ErnieDocForTokenClassification,
+    ErnieDocTokenizer,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser()
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument("--model_name_or_path", type=str, default="ernie-doc-base-zh", help="Pretraining model name or path")
+parser.add_argument("--max_seq_length", type=int, default=512, help="The maximum total input sequence length after SentencePiece tokenization.")
+parser.add_argument("--learning_rate", type=float, default=3e-5, help="Learning rate used to train.")
+parser.add_argument("--save_steps", type=int, default=1000, help="Save checkpoint every X updates steps.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--output_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
+parser.add_argument("--epochs", type=int, default=3, help="Number of epoches for training.")
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="Select cpu, gpu devices to train model.")
+parser.add_argument("--seed", type=int, default=1, help="Random seed for initialization.")
+parser.add_argument("--memory_length", type=int, default=128, help="Length of the retained previous heads.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process.")
+parser.add_argument("--dataset", default="msra_ner", choices=["msra_ner"], type=str, help="The training dataset")
+parser.add_argument("--layerwise_decay", default=1.0, type=float, help="Layerwise decay ratio")
+parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.",)
+args = parser.parse_args()
+# fmt: on
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    paddle.seed(args.seed)
+
+
+def init_memory(batch_size, memory_length, d_model, n_layers):
+    return [paddle.zeros([batch_size, memory_length, d_model], dtype="float32") for _ in range(n_layers)]
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, memories0):
+    model.eval()
+    metric.reset()
+    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
+    loss_fct = paddle.nn.loss.CrossEntropyLoss()
+    losses = []
+    # Copy the memory
+    memories = list(memories0)
+    tic_train = time.time()
+    eval_logging_step = 500
+    labels_dict = defaultdict(list)
+    preds_dict = defaultdict(list)
+    length_dict = defaultdict(list)
+
+    for step, batch in enumerate(data_loader, start=1):
+        input_ids, position_ids, token_type_ids, attn_mask, labels, lengths, qids, gather_idxs, need_cal_loss = batch
+        logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        logits, labels, qids, lengths = list(
+            map(lambda x: paddle.gather(x, gather_idxs), [logits, labels, qids, lengths])
+        )
+        loss = loss_fct(logits, labels)
+        avg_loss = loss.mean()
+        losses.append(avg_loss)
+        preds = logits.argmax(axis=2)
+
+        np_qids = qids.numpy().flatten()
+        for i, qid in enumerate(np_qids):
+            preds_dict[qid].append(preds[i])
+            labels_dict[qid].append(labels[i])
+            length_dict[qid].append(lengths[i])
+
+        if step % eval_logging_step == 0:
+            logger.info(
+                "Step %d: loss:  %.5f, speed: %.5f steps/s"
+                % (step, np.mean(losses), eval_logging_step / (time.time() - tic_train))
+            )
+            tic_train = time.time()
+
+    qids = preds_dict.keys()
+    for qid in qids:
+        preds = paddle.concat(preds_dict[qid], axis=0).unsqueeze(0)
+        labels = paddle.concat(labels_dict[qid], axis=0).unsqueeze(0).squeeze(-1)
+        length = paddle.concat(length_dict[qid], axis=0)
+        length = length.sum(axis=0, keepdim=True)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+    precision, recall, f1_score = metric.accumulate()
+    metric.reset()
+    logger.info("Total {} samples.".format(len(qids)))
+    logger.info("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score))
+    model.train()
+    return precision, recall, f1_score
+
+
+def do_train(args):
+    set_seed(args)
+    tokenizer = ErnieDocTokenizer.from_pretrained(args.model_name_or_path)
+    train_ds, eval_ds = load_dataset(args.dataset, splits=["train", "test"])
+    test_ds = eval_ds
+
+    num_classes = len(train_ds.label_list)
+    no_entity_id = num_classes - 1
+
+    paddle.set_device(args.device)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            logger.info("init checkpoint from %s" % args.model_name_or_path)
+    model = ErnieDocForTokenClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+    model_config = model.ernie_doc.config
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+
+    train_ds_iter = SequenceLabelingIterator(
+        train_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        random_seed=args.seed,
+        no_entity_id=no_entity_id,
+    )
+    eval_ds_iter = SequenceLabelingIterator(
+        eval_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="eval",
+        no_entity_id=no_entity_id,
+    )
+    test_ds_iter = SequenceLabelingIterator(
+        test_ds,
+        args.batch_size,
+        tokenizer,
+        trainer_num,
+        trainer_id=rank,
+        memory_len=model_config["memory_len"],
+        max_seq_length=args.max_seq_length,
+        mode="test",
+        no_entity_id=no_entity_id,
+    )
+
+    train_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+    eval_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    eval_dataloader.set_batch_generator(eval_ds_iter, paddle.get_device())
+    test_dataloader = paddle.fluid.reader.DataLoader.from_generator(capacity=70, return_list=True)
+    test_dataloader.set_batch_generator(test_ds_iter, paddle.get_device())
+
+    num_training_examples = train_ds_iter.get_num_examples()
+    num_training_steps = args.epochs * num_training_examples // args.batch_size // trainer_num
+    logger.info("Device count: %d, trainer_id: %d" % (trainer_num, rank))
+    logger.info("Num train examples: %d" % num_training_examples)
+    logger.info("Max train steps: %d" % num_training_steps)
+    logger.info("Num warmup steps: %d" % int(num_training_steps * args.warmup_proportion))
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    # Construct dict
+    name_dict = dict()
+    for n, p in model.named_parameters():
+        name_dict[p.name] = n
+
+    simple_lr_setting = partial(layerwise_lr_decay, args.layerwise_decay, name_dict, model_config["num_hidden_layers"])
+
+    optimizer = AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        lr_ratio=simple_lr_setting,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = ChunkEvaluator(label_list=train_ds.label_list)
+
+    global_steps = 0
+
+    create_memory = partial(
+        init_memory,
+        args.batch_size,
+        args.memory_length,
+        model_config["hidden_size"],
+        model_config["num_hidden_layers"],
+    )
+    # Copy the memory
+    memories = create_memory()
+    tic_train = time.time()
+    best_f1 = 0
+    stop_training = False
+    for epoch in range(args.epochs):
+        train_ds_iter.shuffle_sample()
+        train_dataloader.set_batch_generator(train_ds_iter, paddle.get_device())
+        for step, batch in enumerate(train_dataloader, start=1):
+            global_steps += 1
+            (
+                input_ids,
+                position_ids,
+                token_type_ids,
+                attn_mask,
+                labels,
+                lengths,
+                qids,
+                gather_idx,
+                need_cal_loss,
+            ) = batch
+            logits, memories = model(input_ids, memories, token_type_ids, position_ids, attn_mask)
+            logits, labels = list(map(lambda x: paddle.gather(x, gather_idx), [logits, labels]))
+
+            loss = criterion(logits, labels) * need_cal_loss
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_steps % args.logging_steps == 0:
+                logger.info(
+                    "train: global step %d, epoch: %d, loss: %f, lr: %f, speed: %.2f step/s"
+                    % (
+                        global_steps,
+                        epoch,
+                        loss,
+                        lr_scheduler.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_steps % args.save_steps == 0:
+                # Evaluate
+                logger.info("Eval:")
+                precision, recall, f1_score = evaluate(model, metric, eval_dataloader, create_memory())
+                # Save
+                if rank == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d" % (global_steps))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    if f1_score > best_f1:
+                        logger.info("Save best model......")
+                        best_f1 = f1_score
+                        best_model_dir = os.path.join(args.output_dir, "best_model")
+                        if not os.path.exists(best_model_dir):
+                            os.makedirs(best_model_dir)
+                        model_to_save.save_pretrained(best_model_dir)
+                        tokenizer.save_pretrained(best_model_dir)
+
+            if args.max_steps > 0 and global_steps >= args.max_steps:
+                stop_training = True
+                break
+        if stop_training:
+            break
+
+    logger.info("Final test result:")
+    evaluate(model, metric, test_dataloader, create_memory())
+
+
+if __name__ == "__main__":
+    do_train(args)
diff --git a/model_zoo/ernie-gen/README.md b/model_zoo/ernie-gen/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4d236818fd22338ac00e30765cf7414919de628d
--- /dev/null
+++ b/model_zoo/ernie-gen/README.md
@@ -0,0 +1,148 @@
+# ERNIE-Gen: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
+
+## 1. 简介
+
+ERNIE-GEN 是面向生成任务的预训练-微调框架，首次在预训练阶段加入**span-by-span 生成任务**，让模型每次能够生成一个语义完整的片段。在预训练和微调中通过**填充式生成机制**和**噪声感知机制**来缓解曝光偏差问题。此外, ERNIE-GEN 采样**多片段-多粒度目标文本采样策略**, 增强源文本和目标文本的关联性，加强了编码器和解码器的交互。
+
+![multi-flow-attention](https://github.com/PaddlePaddle/ERNIE/raw/repro/ernie-gen/.meta/multi-flow-attention.png)
+
+## 快速开始
+
+### 环境依赖
+
+- tqdm
+
+安装方式：`pip install tqdm`
+
+### 数据准备
+
+在本例中，我们提供了古诗词数据集，示例数据如下：
+
+```text
+画\002精\002禅\002室\002冷\002，\002方\002暑\002久\002徘\002徊\002。	不\002尽\002林\002端\002雪\002，\002长\002青\002石\002上\002苔\002。\002心\002闲\002对\002岩\002岫\002，\002目\002浄\002失\002尘\002埃\002。\002坐\002久\002清\002风\002至\002，\002疑\002从\002翠\002涧\002来\002。
+```
+
+每行数据都是由两列组成，以制表符分隔。第一列是输入的诗句前文，第二列是输出的诗句后文，所有文字都以 `\002` 分隔。
+
+完整数据集可以通过以下命令下载并解压：
+
+```bash
+wget --no-check-certificate https://bj.bcebos.com/paddlenlp/datasets/poetry.tar.gz
+tar xvf poetry.tar.gz
+```
+
+### 模型微调
+
+#### 单卡训练
+
+训练启动方式如下：
+
+```bash
+python -u ./train.py \
+    --model_name_or_path ernie-1.0 \
+    --max_encode_len 24 \
+    --max_decode_len 72 \
+    --batch_size 48  \
+    --learning_rate 2e-5 \
+    --num_epochs 12 \
+    --logging_steps 1 \
+    --save_steps 1000 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    # --init_checkpoint ./tmp/model_10000/model_state.pdparams
+```
+
+参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer，支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型，但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数，其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_encode_len` 表示最大输入句子长度，超过该长度将被截断。
+- `max_decode_len` 表示最大输出句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。
+- `init_checkpoint` 表示模型加载路径，通过设置此参数可以开启增量训练。
+
+训练会持续很长的时间，为此我们提供了[微调后的模型](https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gen_finetuned/ernie_1.0_poetry.pdparams)。您可以下载该模型并通过`init_checkpoint`加载其参数进行增量训练、评估或预测。
+
+#### 多卡训练
+
+训练启动方式如下：
+
+```bash
+python -m paddle.distributed.launch --gpus "0,1" ./train.py \
+    --model_name_or_path ernie-1.0 \
+    --max_encode_len 24 \
+    --max_decode_len 72 \
+    --batch_size 48  \
+    --learning_rate 2e-5 \
+    --num_epochs 12 \
+    --logging_steps 1 \
+    --save_steps 1000 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    # --init_checkpoint ./tmp/model_10000/model_state.pdparams
+```
+
+### 模型评估
+
+通过加载训练保存的模型，可以对验证集数据进行验证，启动方式如下：
+
+```bash
+python -u ./eval.py \
+    --model_name_or_path ernie-1.0 \
+    --max_encode_len 24 \
+    --max_decode_len 72 \
+    --batch_size 48   \
+    --init_checkpoint ./tmp/model_10000/model_state.pdparams \
+    --device gpu
+```
+
+参数释义如下：
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer，支持[PaddleNLP Transformer类预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer) 中的所有模型，但只有`ernie-gen-base-en, ernie-gen-large-en, ernie-gen-large-en-430g`三种模型会加载最后输出层的参数，其余模型只会加载transformer参数作热启动。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `max_encode_len` 表示最大输入句子长度，超过该长度将被截断。
+- `max_decode_len` 表示最大输出句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `init_checkpoint` 表示模型加载路径。
+- `use_gpu` 表示使用GPU。
+
+### 模型预测
+
+对无标签数据可以启动模型预测：
+
+```bash
+python -u ./predict.py \
+    --model_name_or_path ernie-1.0 \
+    --max_encode_len 24 \
+    --max_decode_len 72 \
+    --batch_size 48   \
+    --init_checkpoint ./tmp/model_10000/model_state.pdparams \
+    --device gpu
+```
+
+
+## Citation
+
+您可以按下面的格式引用ERNIE-Gen论文:
+
+```
+@article{xiao2020ernie-gen,
+  title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
+  author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
+  journal={arXiv preprint arXiv:2001.11314},
+  year={2020}
+}
+```
+
+## 线上教程体验
+
+我们为诗歌文本生成提供了线上教程，欢迎体验：
+
+* [使用PaddleNLP预训练模型ERNIE-GEN生成诗歌](https://aistudio.baidu.com/aistudio/projectdetail/1339888)
+
+
+## Acknowledgement
+
+- 感谢 [chinese-poetry数据集](https://github.com/chinese-poetry/chinese-poetry) 开放的诗歌数据集
diff --git a/model_zoo/ernie-gen/decode.py b/model_zoo/ernie-gen/decode.py
new file mode 100644
index 0000000000000000000000000000000000000000..0450dd46504f19e884cf2f3922f1821e268359d1
--- /dev/null
+++ b/model_zoo/ernie-gen/decode.py
@@ -0,0 +1,292 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from collections import namedtuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+def gen_bias(encoder_inputs, decoder_inputs, step):
+    decoder_bsz, decoder_seqlen = decoder_inputs.shape[:2]
+    encoder_bsz, encoder_seqlen = encoder_inputs.shape[:2]
+    attn_bias = paddle.reshape(paddle.arange(0, decoder_seqlen, 1, dtype="float32") + 1, [1, -1, 1])
+    decoder_bias = paddle.cast(
+        (paddle.matmul(attn_bias, 1.0 / attn_bias, transpose_y=True) >= 1.0), "float32"
+    )  # [1, decoderlen, decoderlen]
+    encoder_bias = paddle.unsqueeze(
+        paddle.cast(paddle.ones_like(encoder_inputs), "float32"), [1]
+    )  # [bsz, 1, encoderlen]
+    encoder_bias = paddle.expand(
+        encoder_bias, [encoder_bsz, decoder_seqlen, encoder_seqlen]
+    )  # [bsz,decoderlen, encoderlen]
+    decoder_bias = paddle.expand(
+        decoder_bias, [decoder_bsz, decoder_seqlen, decoder_seqlen]
+    )  # [bsz, decoderlen, decoderlen]
+    if step > 0:
+        bias = paddle.concat(
+            [encoder_bias, paddle.ones([decoder_bsz, decoder_seqlen, step], "float32"), decoder_bias], -1
+        )
+    else:
+        bias = paddle.concat([encoder_bias, decoder_bias], -1)
+    return bias
+
+
+@paddle.no_grad()
+def greedy_search_infilling(
+    model,
+    token_ids,
+    token_type_ids,
+    sos_id,
+    eos_id,
+    attn_id,
+    pad_id,
+    unk_id,
+    vocab_size,
+    max_encode_len=640,
+    max_decode_len=100,
+    tgt_type_id=3,
+):
+    _, logits, info = model(token_ids, token_type_ids)
+    d_batch, d_seqlen = token_ids.shape
+    seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True)
+    has_stopped = np.zeros([d_batch], dtype=np.bool_)
+    gen_seq_len = np.zeros([d_batch], dtype=np.int64)
+    output_ids = []
+
+    past_cache = info["caches"]
+
+    cls_ids = paddle.ones([d_batch], dtype="int64") * sos_id
+    attn_ids = paddle.ones([d_batch], dtype="int64") * attn_id
+    ids = paddle.stack([cls_ids, attn_ids], -1)
+    for step in range(max_decode_len):
+        bias = gen_bias(token_ids, ids, step)
+        pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch, 1]))
+        pos_ids += seqlen
+        _, logits, info = model(
+            ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache
+        )
+
+        if logits.shape[-1] > vocab_size:
+            logits[:, :, vocab_size:] = 0
+        logits[:, :, pad_id] = 0
+        logits[:, :, unk_id] = 0
+        logits[:, :, attn_id] = 0
+
+        gen_ids = paddle.argmax(logits, -1)
+
+        past_cached_k, past_cached_v = past_cache
+        cached_k, cached_v = info["caches"]
+        cached_k = [paddle.concat([pk, k[:, :1, :]], 1) for pk, k in zip(past_cached_k, cached_k)]  # concat cached
+        cached_v = [paddle.concat([pv, v[:, :1, :]], 1) for pv, v in zip(past_cached_v, cached_v)]
+        past_cache = (cached_k, cached_v)
+
+        gen_ids = gen_ids[:, 1]
+        ids = paddle.stack([gen_ids, attn_ids], 1)
+
+        gen_ids = gen_ids.numpy()
+        has_stopped |= (gen_ids == eos_id).astype(np.bool_)
+        gen_seq_len += 1 - has_stopped.astype(np.int64)
+        output_ids.append(gen_ids.tolist())
+        if has_stopped.all():
+            break
+    output_ids = np.array(output_ids).transpose([1, 0])
+    return output_ids
+
+
+BeamSearchState = namedtuple("BeamSearchState", ["log_probs", "lengths", "finished"])
+BeamSearchOutput = namedtuple("BeamSearchOutput", ["scores", "predicted_ids", "beam_parent_ids"])
+
+
+def log_softmax(x):
+    e_x = np.exp(x - np.max(x))
+    return np.log(e_x / e_x.sum())
+
+
+def mask_prob(p, onehot_eos, finished):
+    is_finished = paddle.cast(paddle.reshape(finished, [-1, 1]) != 0, "float32")
+    p = is_finished * (1.0 - paddle.cast(onehot_eos, "float32")) * -9999.0 + (1.0 - is_finished) * p
+    return p
+
+
+def hyp_score(log_probs, length, length_penalty):
+    lp = paddle.pow((5.0 + paddle.cast(length, "float32")) / 6.0, length_penalty)
+    return log_probs / lp
+
+
+def beam_search_step(state, logits, eos_id, beam_width, is_first_step, length_penalty):
+    """logits.shape == [B*W, V]"""
+    _, vocab_size = logits.shape
+
+    bsz, beam_width = state.log_probs.shape
+    onehot_eos = paddle.cast(nn.functional.one_hot(paddle.ones([1], "int64") * eos_id, vocab_size), "int64")  # [1, V]
+
+    probs = paddle.log(nn.functional.softmax(logits))  # [B*W, V]
+    probs = mask_prob(probs, onehot_eos, state.finished)  # [B*W, V]
+    allprobs = paddle.reshape(state.log_probs, [-1, 1]) + probs  # [B*W, V]
+
+    not_finished = 1 - paddle.reshape(state.finished, [-1, 1])  # [B*W,1]
+    not_eos = 1 - onehot_eos
+    length_to_add = not_finished * not_eos  # [B*W,V]
+    alllen = paddle.reshape(state.lengths, [-1, 1]) + length_to_add
+
+    allprobs = paddle.reshape(allprobs, [-1, beam_width * vocab_size])
+    alllen = paddle.reshape(alllen, [-1, beam_width * vocab_size])
+    allscore = hyp_score(allprobs, alllen, length_penalty)
+    if is_first_step:
+        allscore = paddle.reshape(allscore, [bsz, beam_width, -1])[:, 0, :]  # first step only consiter beam 0
+    scores, idx = paddle.topk(allscore, k=beam_width)  # [B, W]
+    next_beam_id = idx // vocab_size  # [B, W]
+    next_word_id = idx % vocab_size
+
+    gather_idx = paddle.concat([paddle.nonzero(idx != -1)[:, :1], paddle.reshape(idx, [-1, 1])], 1)
+    next_probs = paddle.reshape(paddle.gather_nd(allprobs, gather_idx), idx.shape)
+    next_len = paddle.reshape(paddle.gather_nd(alllen, gather_idx), idx.shape)
+
+    gather_idx = paddle.concat([paddle.nonzero(next_beam_id != -1)[:, :1], paddle.reshape(next_beam_id, [-1, 1])], 1)
+    next_finished = paddle.reshape(
+        paddle.gather_nd(state.finished, gather_idx), state.finished.shape
+    )  # [gather new beam state according to new beam id]
+
+    next_finished += paddle.cast(next_word_id == eos_id, "int64")
+    next_finished = paddle.cast(next_finished > 0, "int64")
+
+    next_state = BeamSearchState(log_probs=next_probs, lengths=next_len, finished=next_finished)
+    output = BeamSearchOutput(scores=scores, predicted_ids=next_word_id, beam_parent_ids=next_beam_id)
+
+    return output, next_state
+
+
+@paddle.no_grad()
+def beam_search_infilling(
+    model,
+    token_ids,
+    token_type_ids,
+    sos_id,
+    eos_id,
+    attn_id,
+    pad_id,
+    unk_id,
+    vocab_size,
+    max_encode_len=640,
+    max_decode_len=100,
+    beam_width=5,
+    tgt_type_id=3,
+    length_penalty=1.0,
+):
+    _, __, info = model(token_ids, token_type_ids)
+    d_batch, d_seqlen = token_ids.shape
+
+    state = BeamSearchState(
+        log_probs=paddle.zeros([d_batch, beam_width], "float32"),
+        lengths=paddle.zeros([d_batch, beam_width], "int64"),
+        finished=paddle.zeros([d_batch, beam_width], "int64"),
+    )
+    outputs = []
+
+    def reorder_(t, parent_id):
+        """reorder cache according to parent beam id"""
+        gather_idx = paddle.nonzero(parent_id != -1)[:, 0] * beam_width + paddle.reshape(parent_id, [-1])
+        t = paddle.gather(t, gather_idx)
+        return t
+
+    def tile_(t, times):
+        _shapes = list(t.shape[1:])
+        new_shape = [t.shape[0], times] + list(t.shape[1:])
+        ret = paddle.reshape(
+            paddle.expand(paddle.unsqueeze(t, [1]), new_shape),
+            [
+                -1,
+            ]
+            + _shapes,
+        )
+        return ret
+
+    cached_k, cached_v = info["caches"]
+    cached_k = [tile_(k, beam_width) for k in cached_k]
+    cached_v = [tile_(v, beam_width) for v in cached_v]
+    past_cache = (cached_k, cached_v)
+
+    token_ids = tile_(token_ids, beam_width)
+    seqlen = paddle.sum(paddle.cast(token_ids != 0, "int64"), 1, keepdim=True)
+    # log.debug(token_ids.shape)
+
+    cls_ids = paddle.ones([d_batch * beam_width], dtype="int64") * sos_id
+    attn_ids = paddle.ones([d_batch * beam_width], dtype="int64") * attn_id  # SOS
+    ids = paddle.stack([cls_ids, attn_ids], -1)
+    for step in range(max_decode_len):
+        # log.debug('decode step %d' % step)
+        bias = gen_bias(token_ids, ids, step)
+        pos_ids = paddle.to_tensor(np.tile(np.array([[step, step + 1]], dtype=np.int64), [d_batch * beam_width, 1]))
+        pos_ids += seqlen
+        _, logits, info = model(
+            ids, paddle.ones_like(ids) * tgt_type_id, pos_ids=pos_ids, attn_bias=bias, past_cache=past_cache
+        )
+        if logits.shape[-1] > vocab_size:
+            logits[:, :, vocab_size:] = 0
+        logits[:, :, pad_id] = 0
+        logits[:, :, unk_id] = 0
+        logits[:, :, attn_id] = 0
+
+        output, state = beam_search_step(
+            state,
+            logits[:, 1],
+            eos_id=eos_id,
+            beam_width=beam_width,
+            is_first_step=(step == 0),
+            length_penalty=length_penalty,
+        )
+        outputs.append(output)
+
+        past_cached_k, past_cached_v = past_cache
+        cached_k, cached_v = info["caches"]
+        cached_k = [
+            reorder_(paddle.concat([pk, k[:, :1, :]], 1), output.beam_parent_ids)
+            for pk, k in zip(past_cached_k, cached_k)
+        ]  # concat cached
+        cached_v = [
+            reorder_(paddle.concat([pv, v[:, :1, :]], 1), output.beam_parent_ids)
+            for pv, v in zip(past_cached_v, cached_v)
+        ]
+        past_cache = (cached_k, cached_v)
+
+        pred_ids_flatten = paddle.reshape(output.predicted_ids, [d_batch * beam_width])
+        ids = paddle.stack([pred_ids_flatten, attn_ids], 1)
+
+        if state.finished.numpy().all():
+            break
+
+    final_ids = paddle.stack([o.predicted_ids for o in outputs], 0)
+    final_parent_ids = paddle.stack([o.beam_parent_ids for o in outputs], 0)
+    final_ids = nn.functional.gather_tree(final_ids, final_parent_ids)[:, :, 0]  # pick best beam
+    final_ids = paddle.transpose(paddle.reshape(final_ids, [-1, d_batch * 1]), [1, 0])
+
+    return final_ids.numpy()
+
+
+en_patten = re.compile(r"^[a-zA-Z0-9]*$")
+
+
+def post_process(token):
+    if token.startswith("##"):
+        ret = token[2:]
+    elif token in ["[CLS]", "[SEP]", "[PAD]"]:
+        ret = ""
+    else:
+        if en_patten.match(token):
+            ret = " " + token
+        else:
+            ret = token
+    return ret
diff --git a/model_zoo/ernie-gen/encode.py b/model_zoo/ernie-gen/encode.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1f47e1f33102619f7810d5061e1af3161ade8ab
--- /dev/null
+++ b/model_zoo/ernie-gen/encode.py
@@ -0,0 +1,145 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from copy import deepcopy
+
+import numpy as np
+
+
+def convert_example(
+    tokenizer,
+    attn_id,
+    tgt_type_id=3,
+    max_encode_len=512,
+    max_decode_len=128,
+    is_test=False,
+    noise_prob=0.0,
+    use_random_noice=False,
+):
+    def warpper(example):
+        """convert an example into necessary features"""
+        tokens = example["tokens"]
+        labels = example["labels"]
+        encoded_src = tokenizer(tokens, max_seq_len=max_encode_len, pad_to_max_seq_len=False)
+        src_ids, src_sids = encoded_src["input_ids"], encoded_src["token_type_ids"]
+        src_pids = np.arange(len(src_ids))
+
+        if not is_test:
+            encoded_tgt = tokenizer(labels, max_seq_len=max_decode_len, pad_to_max_seq_len=False)
+            tgt_ids, tgt_sids = encoded_tgt["input_ids"], encoded_tgt["token_type_ids"]
+            tgt_ids = np.array(tgt_ids).astype("int64")
+            tgt_sids = np.array(tgt_sids) + tgt_type_id
+            tgt_pids = np.arange(len(tgt_ids)) + len(src_ids)
+
+        attn_ids = np.ones_like(tgt_ids) * attn_id
+        if noise_prob > 0.0:
+            tgt_labels = deepcopy(tgt_ids)
+            if use_random_noice:
+                noice_ids = np.random.randint(1, len(tokenizer.vocab), size=tgt_ids.shape)
+            else:
+                noice_ids = np.ones_like(tgt_ids) * tokenizer.vocab["[NOISE]"]
+            (pos,) = np.where(np.ones_like(tgt_ids))
+            np.random.shuffle(pos)
+            pos = pos[: int(noise_prob * len(pos))]
+            tgt_ids[pos] = noice_ids[
+                pos,
+            ]
+        else:
+            tgt_labels = tgt_ids
+
+        return (src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels)
+
+    return warpper
+
+
+def gen_mask(batch_ids, mask_type="bidi", query_len=None, pad_value=0):
+    if query_len is None:
+        query_len = batch_ids.shape[1]
+    if mask_type != "empty":
+        mask = (batch_ids != pad_value).astype(np.float32)
+        mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1])
+        if mask_type == "causal":
+            assert query_len == batch_ids.shape[1]
+            mask = np.tril(mask)
+        elif mask_type == "causal_without_diag":
+            assert query_len == batch_ids.shape[1]
+            mask = np.tril(mask, -1)
+        elif mask_type == "diag":
+            assert query_len == batch_ids.shape[1]
+            mask = np.stack([np.diag(np.diag(m)) for m in mask], 0)
+
+    else:
+        mask_type == "empty"
+        mask = np.zeros_like(batch_ids).astype(np.float32)
+        mask = np.tile(np.expand_dims(mask, 1), [1, query_len, 1])
+    return mask
+
+
+def after_padding(args):
+    """
+    attention mask:
+    ***  src,  tgt, attn
+    src  00,   01,   11
+    tgt  10,   11,   12
+    attn 20,   21,   22
+
+    ***   s1, s2 | t1 t2 t3| attn1 attn2 attn3
+    s1    1,  1  | 0, 0, 0,| 0,    0,    0,
+    s2    1,  1  | 0, 0, 0,| 0,    0,    0,
+    -
+    t1    1,  1, | 1, 0, 0,| 0,    0,    0,
+    t2    1,  1, | 1, 1, 0,| 0,    0,    0,
+    t3    1,  1, | 1, 1, 1,| 0,    0,    0,
+    -
+    attn1 1,  1, | 0, 0, 0,| 1,    0,    0,
+    attn2 1,  1, | 1, 0, 0,| 0,    1,    0,
+    attn3 1,  1, | 1, 1, 0,| 0,    0,    1,
+
+    for details, see Fig3. https://arxiv.org/abs/2001.11314
+    """
+    src_ids, src_pids, src_sids, tgt_ids, tgt_pids, tgt_sids, attn_ids, tgt_labels = args
+    src_len = src_ids.shape[1]
+    tgt_len = tgt_ids.shape[1]
+    mask_00 = gen_mask(src_ids, "bidi", query_len=src_len)
+    # mask_01 = gen_mask(tgt_ids, "empty", query_len=src_len)
+    # mask_02 = gen_mask(attn_ids, "empty", query_len=src_len)
+
+    mask_10 = gen_mask(src_ids, "bidi", query_len=tgt_len)
+    mask_11 = gen_mask(tgt_ids, "causal", query_len=tgt_len)
+    # mask_12 = gen_mask(attn_ids, "empty", query_len=tgt_len)
+
+    mask_20 = gen_mask(src_ids, "bidi", query_len=tgt_len)
+    mask_21 = gen_mask(tgt_ids, "causal_without_diag", query_len=tgt_len)
+    mask_22 = gen_mask(attn_ids, "diag", query_len=tgt_len)
+
+    mask_src_2_src = mask_00
+    mask_tgt_2_srctgt = np.concatenate([mask_10, mask_11], 2)
+    mask_attn_2_srctgtattn = np.concatenate([mask_20, mask_21, mask_22], 2)
+
+    raw_tgt_labels = deepcopy(tgt_labels)
+    tgt_labels = tgt_labels[np.where(tgt_labels != 0)]
+    return (
+        src_ids,
+        src_sids,
+        src_pids,
+        tgt_ids,
+        tgt_sids,
+        tgt_pids,
+        attn_ids,
+        mask_src_2_src,
+        mask_tgt_2_srctgt,
+        mask_attn_2_srctgtattn,
+        tgt_labels,
+        raw_tgt_labels,
+    )
diff --git a/model_zoo/ernie-gen/eval.py b/model_zoo/ernie-gen/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..47dd2298ac7b6d8d2031f1bb08e23b8c5da608e9
--- /dev/null
+++ b/model_zoo/ernie-gen/eval.py
@@ -0,0 +1,147 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from decode import beam_search_infilling
+from encode import after_padding, convert_example
+from paddle.io import DataLoader
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import Rouge1, Rouge2
+from paddlenlp.transformers import (
+    BertTokenizer,
+    ElectraTokenizer,
+    ErnieForGeneration,
+    ErnieTinyTokenizer,
+    ErnieTokenizer,
+    RobertaTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN')
+parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())))
+parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length")
+parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length")
+parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", )
+parser.add_argument('--beam_width', type=int, default=1, help="Beam search width")
+parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding")
+parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from')
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+args = parser.parse_args()
+# fmt: on
+
+
+def evaluate():
+    paddle.set_device(args.device)
+
+    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
+    if "ernie-tiny" in args.model_name_or_path:
+        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
+    elif "ernie" in args.model_name_or_path:
+        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
+        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
+    elif "electra" in args.model_name_or_path:
+        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
+    else:
+        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+
+    dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False)
+    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
+    tgt_type_id = model.sent_emb.weight.shape[0] - 1
+
+    trans_func = convert_example(
+        tokenizer=tokenizer,
+        attn_id=attn_id,
+        tgt_type_id=tgt_type_id,
+        max_encode_len=args.max_encode_len,
+        max_decode_len=args.max_decode_len,
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_sids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_sids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
+    ): after_padding(fn(samples))
+
+    dev_dataset = dev_dataset.map(trans_func)
+    dev_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False)
+    data_loader = DataLoader(
+        dataset=dev_dataset, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    rouge1 = Rouge1()
+    rouge2 = Rouge2()
+
+    if args.init_checkpoint:
+        model_state = paddle.load(args.init_checkpoint)
+        model.set_state_dict(model_state)
+
+    model.eval()
+    vocab = tokenizer.vocab
+    eos_id = vocab[tokenizer.sep_token]
+    sos_id = vocab[tokenizer.cls_token]
+    pad_id = vocab[tokenizer.pad_token]
+    unk_id = vocab[tokenizer.unk_token]
+    vocab_size = len(vocab)
+    evaluated_sentences_ids = []
+    reference_sentences_ids = []
+    logger.info("Evaluating...")
+    for data in tqdm(data_loader):
+        (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
+        # Use greedy_search_infilling or beam_search_infilling to get predictions
+        output_ids = beam_search_infilling(
+            model,
+            src_ids,
+            src_sids,
+            eos_id=eos_id,
+            sos_id=sos_id,
+            attn_id=attn_id,
+            pad_id=pad_id,
+            unk_id=unk_id,
+            vocab_size=vocab_size,
+            max_decode_len=args.max_decode_len,
+            max_encode_len=args.max_encode_len,
+            beam_width=args.beam_width,
+            length_penalty=args.length_penalty,
+            tgt_type_id=tgt_type_id,
+        )
+
+        for ids in output_ids.tolist():
+            if eos_id in ids:
+                ids = ids[: ids.index(eos_id)]
+            evaluated_sentences_ids.append(ids)
+
+        for ids in raw_tgt_labels.numpy().tolist():
+            ids = ids[: ids.index(eos_id)]
+            reference_sentences_ids.append(ids)
+
+    score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids)
+    score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids)
+
+    logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100))
+
+
+if __name__ == "__main__":
+    evaluate()
diff --git a/model_zoo/ernie-gen/model.py b/model_zoo/ernie-gen/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..31e30c6e9c333f38a1659fd5576a25311b8bc0cc
--- /dev/null
+++ b/model_zoo/ernie-gen/model.py
@@ -0,0 +1,64 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class StackModel(nn.Layer):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+    def forward(
+        self,
+        src_ids,
+        src_sids,
+        src_pids,
+        tgt_ids,
+        tgt_sids,
+        tgt_pids,
+        attn_ids,
+        mask_src_2_src,
+        mask_tgt_2_srctgt,
+        mask_attn_2_srctgtattn,
+        tgt_labels,
+        tgt_pos,
+    ):
+        _, __, info = self.model(
+            src_ids, sent_ids=src_sids, pos_ids=src_pids, attn_bias=mask_src_2_src, encode_only=True
+        )
+        cached_k, cached_v = info["caches"]
+        _, __, info = self.model(
+            tgt_ids,
+            sent_ids=tgt_sids,
+            pos_ids=tgt_pids,
+            attn_bias=mask_tgt_2_srctgt,
+            past_cache=(cached_k, cached_v),
+            encode_only=True,
+        )
+        cached_k2, cached_v2 = info["caches"]
+        past_cache_k = [paddle.concat([k, k2], 1) for k, k2 in zip(cached_k, cached_k2)]
+        past_cache_v = [paddle.concat([v, v2], 1) for v, v2 in zip(cached_v, cached_v2)]
+        loss, _, __ = self.model(
+            attn_ids,
+            sent_ids=tgt_sids,
+            pos_ids=tgt_pids,
+            attn_bias=mask_attn_2_srctgtattn,
+            past_cache=(past_cache_k, past_cache_v),
+            tgt_labels=tgt_labels,
+            tgt_pos=tgt_pos,
+        )
+        loss = loss.mean()
+        return loss
diff --git a/model_zoo/ernie-gen/predict.py b/model_zoo/ernie-gen/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..408ded91231af664a54acbde4a8f9e5784999093
--- /dev/null
+++ b/model_zoo/ernie-gen/predict.py
@@ -0,0 +1,137 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+from decode import beam_search_infilling, post_process
+from encode import after_padding, convert_example
+from paddle.io import DataLoader
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    BertTokenizer,
+    ElectraTokenizer,
+    ErnieForGeneration,
+    ErnieTinyTokenizer,
+    ErnieTokenizer,
+    RobertaTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+# fmt: off
+parser = argparse.ArgumentParser('seq2seq model with ERNIE-GEN')
+parser.add_argument("--model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())))
+parser.add_argument('--max_encode_len', type=int, default=24, help="The max encoding sentence length")
+parser.add_argument('--max_decode_len', type=int, default=72, help="The max decoding sentence length")
+parser.add_argument("--batch_size", default=50, type=int, help="Batch size per GPU/CPU for training.", )
+parser.add_argument('--beam_width', type=int, default=3, help="Beam search width")
+parser.add_argument('--length_penalty', type=float, default=1.0, help="The length penalty during decoding")
+parser.add_argument('--init_checkpoint', type=str, default=None, help='Checkpoint to warm start from')
+parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.")
+# fmt: on
+
+args = parser.parse_args()
+
+
+def predict():
+    paddle.set_device(args.device)
+
+    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
+    if "ernie-tiny" in args.model_name_or_path:
+        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
+    elif "ernie" in args.model_name_or_path:
+        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
+        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
+    elif "electra" in args.model_name_or_path:
+        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
+    else:
+        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+
+    dev_dataset = load_dataset("poetry", splits=("dev"), lazy=False)
+    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
+    tgt_type_id = model.sent_emb.weight.shape[0] - 1
+
+    trans_func = convert_example(
+        tokenizer=tokenizer,
+        attn_id=attn_id,
+        tgt_type_id=tgt_type_id,
+        max_encode_len=args.max_encode_len,
+        max_decode_len=args.max_decode_len,
+    )
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_sids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_sids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
+    ): after_padding(fn(samples))
+
+    dev_dataset = dev_dataset.map(trans_func)
+    test_batch_sampler = paddle.io.BatchSampler(dev_dataset, batch_size=args.batch_size, shuffle=False)
+    data_loader = DataLoader(
+        dataset=dev_dataset, batch_sampler=test_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    if args.init_checkpoint:
+        model_state = paddle.load(args.init_checkpoint)
+        model.set_state_dict(model_state)
+
+    model.eval()
+    vocab = tokenizer.vocab
+    eos_id = vocab[tokenizer.sep_token]
+    sos_id = vocab[tokenizer.cls_token]
+    pad_id = vocab[tokenizer.pad_token]
+    unk_id = vocab[tokenizer.unk_token]
+    vocab_size = len(vocab)
+    logger.info("Predicting...")
+    for data in data_loader:
+        (src_ids, src_sids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
+        # Use greedy_search_infilling or beam_search_infilling to get predictions
+        output_ids = beam_search_infilling(
+            model,
+            src_ids,
+            src_sids,
+            eos_id=eos_id,
+            sos_id=sos_id,
+            attn_id=attn_id,
+            pad_id=pad_id,
+            unk_id=unk_id,
+            vocab_size=vocab_size,
+            max_decode_len=args.max_decode_len,
+            max_encode_len=args.max_encode_len,
+            beam_width=args.beam_width,
+            length_penalty=args.length_penalty,
+            tgt_type_id=tgt_type_id,
+        )
+
+        for source_ids, target_ids, predict_ids in zip(
+            src_ids.numpy().tolist(), raw_tgt_labels.numpy().tolist(), output_ids.tolist()
+        ):
+            if eos_id in predict_ids:
+                predict_ids = predict_ids[: predict_ids.index(eos_id)]
+            source_sentence = "".join(map(post_process, vocab.to_tokens(source_ids[1 : source_ids.index(eos_id)])))
+            tgt_sentence = "".join(map(post_process, vocab.to_tokens(target_ids[1 : target_ids.index(eos_id)])))
+            predict_ids = "".join(map(post_process, vocab.to_tokens(predict_ids)))
+            print("source :%s\ntarget :%s\npredict:%s\n" % (source_sentence, tgt_sentence, predict_ids))
+
+
+if __name__ == "__main__":
+    predict()
diff --git a/model_zoo/ernie-gen/train.py b/model_zoo/ernie-gen/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..feafba7ce36c04eaf9a03a8254f1877a5ce3686b
--- /dev/null
+++ b/model_zoo/ernie-gen/train.py
@@ -0,0 +1,323 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+
+import paddle
+import paddle.nn as nn
+from decode import beam_search_infilling, post_process
+from encode import after_padding, convert_example
+from model import StackModel
+from paddle.io import DataLoader
+from tqdm import tqdm
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import Rouge1, Rouge2
+from paddlenlp.transformers import (
+    BertTokenizer,
+    ElectraTokenizer,
+    ErnieForGeneration,
+    ErnieTinyTokenizer,
+    ErnieTokenizer,
+    LinearDecayWithWarmup,
+    RobertaTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser("seq2seq model with ERNIE-GEN")
+parser.add_argument(
+    "--model_name_or_path",
+    default=None,
+    type=str,
+    required=True,
+    help="Path to pre-trained model or shortcut name selected in the list: "
+    + ", ".join(list(ErnieTokenizer.pretrained_init_configuration.keys())),
+)
+parser.add_argument(
+    "--output_dir",
+    default=None,
+    type=str,
+    required=True,
+    help="The output directory where the model predictions and checkpoints will be written.",
+)
+parser.add_argument("--max_encode_len", type=int, default=5, help="The max encoding sentence length")
+parser.add_argument("--max_decode_len", type=int, default=5, help="The max decoding sentence length")
+parser.add_argument(
+    "--batch_size",
+    default=8,
+    type=int,
+    help="Batch size per GPU/CPU for training.",
+)
+parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+parser.add_argument("--weight_decay", default=0.1, type=float, help="Weight decay if we apply some.")
+parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+parser.add_argument(
+    "--num_epochs",
+    default=3,
+    type=int,
+    help="Total number of training epochs to perform.",
+)
+parser.add_argument("--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion.")
+parser.add_argument("--logging_steps", type=int, default=1, help="Log every X updates steps.")
+parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu", "xpu"],
+    help="The device to select to train the model, is must be cpu/gpu/xpu.",
+)
+parser.add_argument("--beam_width", type=int, default=1, help="Beam search width.")
+parser.add_argument("--noise_prob", type=float, default=0.0, help="Probability of token be repalced.")
+parser.add_argument(
+    "--use_random_noice",
+    action="store_true",
+    help="If set, replace target tokens with random token from vocabulary, else replace with `[NOISE]`.",
+)
+parser.add_argument("--label_smooth", type=float, default=0.0, help="The soft label smooth rate.")
+parser.add_argument("--length_penalty", type=float, default=1.0, help="The length penalty during decoding.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Checkpoint to warm start from.")
+parser.add_argument("--save_dir", type=str, default=None, help="Model output directory.")
+parser.add_argument(
+    "--max_steps",
+    default=-1,
+    type=int,
+    help="If > 0: set total number of training steps to perform. Override num_epochs.",
+)
+
+args = parser.parse_args()
+
+
+def evaluate(model, data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args):
+    model.eval()
+
+    vocab = tokenizer.vocab
+    eos_id = vocab[tokenizer.sep_token]
+    sos_id = vocab[tokenizer.cls_token]
+    pad_id = vocab[tokenizer.pad_token]
+    unk_id = vocab[tokenizer.unk_token]
+    vocab_size = len(vocab)
+    evaluated_sentences_ids = []
+    reference_sentences_ids = []
+    logger.info("Evaluating...")
+    for data in tqdm(data_loader):
+        (src_ids, src_tids, src_pids, _, _, _, _, _, _, _, _, raw_tgt_labels) = data  # never use target when infer
+        # Use greedy_search_infilling or beam_search_infilling to get predictions
+        output_ids = beam_search_infilling(
+            model,
+            src_ids,
+            src_tids,
+            eos_id=eos_id,
+            sos_id=sos_id,
+            attn_id=attn_id,
+            pad_id=pad_id,
+            unk_id=unk_id,
+            vocab_size=vocab_size,
+            max_decode_len=args.max_decode_len,
+            max_encode_len=args.max_encode_len,
+            beam_width=args.beam_width,
+            length_penalty=args.length_penalty,
+            tgt_type_id=tgt_type_id,
+        )
+
+        for ids in output_ids.tolist():
+            if eos_id in ids:
+                ids = ids[: ids.index(eos_id)]
+            evaluated_sentences_ids.append(ids)
+
+        for ids in raw_tgt_labels.numpy().tolist():
+            ids = ids[: ids.index(eos_id)]
+            reference_sentences_ids.append(ids)
+
+    score1 = rouge1.score(evaluated_sentences_ids, reference_sentences_ids)
+    score2 = rouge2.score(evaluated_sentences_ids, reference_sentences_ids)
+
+    logger.info("Rouge-1: %.5f ,Rouge-2: %.5f" % (score1 * 100, score2 * 100))
+
+    evaluated_sentences = []
+    reference_sentences = []
+    for ids in reference_sentences_ids[:5]:
+        reference_sentences.append("".join(map(post_process, vocab.to_tokens(ids))))
+    for ids in evaluated_sentences_ids[:5]:
+        evaluated_sentences.append("".join(map(post_process, vocab.to_tokens(ids))))
+    logger.debug(reference_sentences)
+    logger.debug(evaluated_sentences)
+
+    model.train()
+
+
+def train():
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    model = ErnieForGeneration.from_pretrained(args.model_name_or_path)
+    if "ernie-tiny" in args.model_name_or_path:
+        tokenizer = ErnieTinyTokenizer.from_pretrained(args.model_name_or_path)
+    elif "ernie" in args.model_name_or_path:
+        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+    elif "roberta" in args.model_name_or_path or "rbt" in args.model_name_or_path:
+        tokenizer = RobertaTokenizer.from_pretrained(args.model_name_or_path)
+    elif "electra" in args.model_name_or_path:
+        tokenizer = ElectraTokenizer.from_pretrained(args.model_name_or_path)
+    else:
+        tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
+    if args.init_checkpoint:
+        model_state = paddle.load(args.init_checkpoint)
+        model.set_state_dict(model_state)
+
+    train_dataset, dev_dataset = load_dataset("poetry", splits=("train", "dev"), lazy=False)
+    attn_id = tokenizer.vocab["[ATTN]"] if "[ATTN]" in tokenizer.vocab else tokenizer.vocab["[MASK]"]
+    tgt_type_id = model.sent_emb.weight.shape[0] - 1
+
+    trans_func = convert_example(
+        tokenizer=tokenizer,
+        attn_id=attn_id,
+        tgt_type_id=tgt_type_id,
+        max_encode_len=args.max_encode_len,
+        max_decode_len=args.max_decode_len,
+        noise_prob=args.noise_prob,
+        use_random_noice=args.use_random_noice,
+    )
+
+    train_dataset = train_dataset.map(trans_func)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_dataset, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # src_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # src_tids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_pids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # tgt_tids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # attn_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # tgt_labels
+    ): after_padding(fn(samples))
+    train_data_loader = DataLoader(
+        dataset=train_dataset,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        return_list=True,
+    )
+
+    dev_dataset = dev_dataset.map(trans_func)
+    dev_data_loader = DataLoader(
+        dataset=dev_dataset, batch_size=args.batch_size, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+
+    label_num = model.word_emb.weight.shape[0]
+    train_model = StackModel(model)
+    if paddle.distributed.get_world_size() > 1:
+        # All 'forward' outputs derived from the module parameters using in DataParallel
+        # must participate in the calculation of losses and subsequent gradient calculations.
+        # So we use StackModel here to make the model only output loss in its 'forward' function.
+        train_model = paddle.DataParallel(train_model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.num_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=nn.ClipGradByGlobalNorm(1.0),
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    rouge1 = Rouge1()
+    rouge2 = Rouge2()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.num_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            (
+                src_ids,
+                src_tids,
+                src_pids,
+                tgt_ids,
+                tgt_tids,
+                tgt_pids,
+                attn_ids,
+                mask_src_2_src,
+                mask_tgt_2_srctgt,
+                mask_attn_2_srctgtattn,
+                tgt_labels,
+                _,
+            ) = batch
+            if args.label_smooth > 0.0:
+                tgt_labels = nn.functional.label_smooth(
+                    nn.functional.one_hot(tgt_labels, label_num), epsilon=args.label_smooth
+                )
+            tgt_pos = paddle.nonzero(attn_ids == attn_id)
+            loss = train_model(
+                src_ids,
+                src_tids,
+                src_pids,
+                tgt_ids,
+                tgt_tids,
+                tgt_pids,
+                attn_ids,
+                mask_src_2_src,
+                mask_tgt_2_srctgt,
+                mask_attn_2_srctgtattn,
+                tgt_labels,
+                tgt_pos,
+            )
+            if global_step % args.logging_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s, lr: %.3e"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss,
+                            args.logging_steps / (time.time() - tic_train),
+                            lr_scheduler.get_lr(),
+                        )
+                    )
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if (
+                global_step % args.save_steps == 0
+                or global_step == num_training_steps
+                and paddle.distributed.get_rank() == 0
+            ):
+                evaluate(model, dev_data_loader, tokenizer, rouge1, rouge2, attn_id, tgt_type_id, args)
+                output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(output_dir)
+                tokenizer.save_pretrained(output_dir)
+            if global_step >= num_training_steps:
+                return
+
+
+if __name__ == "__main__":
+    train()
diff --git a/model_zoo/ernie-health/README.md b/model_zoo/ernie-health/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..af5924fffec0fb1b1d8015350976594c53a04491
--- /dev/null
+++ b/model_zoo/ernie-health/README.md
@@ -0,0 +1,189 @@
+# ERNIE-Health 中文医疗预训练模型
+
+医疗领域存在大量的专业知识和医学术语，人类经过长时间的学习才能成为一名优秀的医生。那机器如何才能“读懂”医疗文献呢？尤其是面对电子病历、生物医疗文献中存在的大量非结构化、非标准化文本，计算机是无法直接使用、处理的。这就需要自然语言处理（Natural Language Processing，NLP）技术大展身手了。
+
+## 模型介绍
+
+本项目针对中文医疗语言理解任务，开源了中文医疗预训练模型 [ERNIE-Health](https://arxiv.org/pdf/2110.07244.pdf)（模型名称`ernie-health-chinese`）。
+
+ERNIE-Health 依托百度文心 ERNIE 先进的知识增强预训练语言模型打造, 通过医疗知识增强技术进一步学习海量的医疗数据, 精准地掌握了专业的医学知识。ERNIE-Health 利用医疗实体掩码策略对专业术语等实体级知识学习, 学会了海量的医疗实体知识。同时，通过医疗问答匹配任务学习病患病状描述与医生专业治疗方案的对应关系，获得了医疗实体知识之间的内在联系。ERNIE-Health 共学习了 60 多万的医疗专业术语和 4000 多万的医疗专业问答数据，大幅提升了对医疗专业知识的理解和建模能力。此外，ERNIE-Health 还探索了多级语义判别预训练任务，提升了模型对医疗知识的学习效率。该模型的整体结构与 ELECTRA 相似，包括生成器和判别器两部分。
+
+![Overview_of_EHealth](https://user-images.githubusercontent.com/25607475/163949632-8b34e23c-d0cd-49df-8d88-8549a253d221.png)
+
+更多技术细节可参考论文
+- [Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/pdf/2110.07244.pdf)
+
+## 模型效果
+
+ERNIE-Health模型以超越人类医学专家水平的成绩登顶中文医疗信息处理权威榜单 [CBLUE](https://github.com/CBLUEbenchmark/CBLUE) 冠军, 验证了 ERNIE 在医疗行业应用的重要价值。
+
+![CBLUERank](https://user-images.githubusercontent.com/25607475/160394225-04f75498-ce1a-4665-85f7-d495815eed51.png)
+
+相应的开源模型参数 ``ernie-health-chinese`` 在 CBLUE **验证集** 上的评测指标如下表所示：
+
+| Task      |  metric  | results | results (fp16) |
+| --------- | :------: | :-----: | :------------: |
+| CHIP-STS  | Macro-F1 | 0.88749 |    0.88555     |
+| CHIP-CTC  | Macro-F1 | 0.84136 |    0.83514     |
+| CHIP-CDN  |    F1    | 0.76979 |    0.76489     |
+| KUAKE-QQR | Accuracy | 0.83865 |    0.84053     |
+| KUAKE-QTR | Accuracy | 0.69722 |    0.69722     |
+| KUAKE-QIC | Accuracy | 0.81483 |    0.82046     |
+| CMeEE     | Micro-F1 | 0.66120 |    0.66026     |
+| CMeIE     | Micro-F1 | 0.61385 |    0.60076     |
+
+## 环境依赖
+
+- paddlepaddle >= 2.2.0
+- paddlenlp >= 2.3.4
+
+## 模型预训练
+
+PaddleNLP中提供了ERNIE-Health训练好的模型参数。``ernie-health-chinese``版本为160G医疗文本数据上的训练结果，数据包括脱敏医患对话语料、医疗健康科普文章、脱敏医院电子医疗病例档案以及医学和临床病理学教材。本节提供了预训练的整体流程，可用于自定义数据的学习。
+
+#### 注意: 预训练资源要求
+
+- 推荐使用至少4张16G以上显存的GPU进行预训练。
+- 数据量应尽可能接近ERNIE-Health论文中训练数据的量级，以获得好的预训练模型效果。
+- 若资源有限，可以直接使用开源的ERNIE-Health模型进行微调，具体实现可参考 [CBLUE样例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-health/cblue)。
+
+#### 数据准备
+
+- 数据编码：UTF-8
+- 数据格式：预训练文本数据放在同个目录下，每个文件中每行一句中文文本。
+
+- 数据预处理：首先对原始文本进行分词，分词结果中非首中文字符替换为``##``前缀的字符（例如，``医疗``处理后得到``[医, ##疗]``）。接着将token转换为对应的id。最后将目录下的全部数据合并存储，token ids拼接后存储至``.npy``文件，每条样本的长度存储在``.npz``文件。
+
+```shell
+python preprocess.py --input_path ./raw_data/ --output_file ./data/samples --tokenize_tool lac --num_worker 8
+```
+可配置参数包括
+- ``input_path`` 为原始文本数据所在目录，该目录下包含至少一个中文文本文件，UTF-8编码。
+- ``output_file`` 为预处理后数据的存储路径及文件名（不包含后缀）。
+- ``tokenize_tool``表示分词工具，包括``lac``、``seg``和``jieba``，默认为``lac``。
+- ``logging_steps`` 表示日志打印间隔，每处理``logging_steps``个句子打印一次日志。
+- ``num_worker`` 表示使用的进程数，增加进程数可加速预处理。
+
+
+#### 单机单卡
+
+```
+CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
+    --input_dir ./data \
+    --output_dir ./output \
+    --learning_rate 1e-7 \
+    --batch_size 10 \
+    --adam_epsilon 1e-8 \
+    --weight_decay 1e-2 \
+    --warmup_steps 10000 \
+    --max_steps 1000000 \
+    --save_steps 10000 \
+    --logging_steps 1 \
+    --seed 1000 \
+    --use_amp
+```
+
+#### 单机多卡
+
+```
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3" run_pretrain.py \
+    --input_dir ./data \
+    --output_dir ./output \
+    --learning_rate 1e-7 \
+    --batch_size 10 \
+    --adam_epsilon 1e-8 \
+    --weight_decay 1e-2 \
+    --warmup_steps 10000 \
+    --max_steps 1000000 \
+    --save_steps 10000 \
+    --logging_steps 1 \
+    --seed 1000 \
+    --use_amp
+```
+
+可配置参数包括
+- ``model_name_or_path``表示内置模型参数名（目前支持``ernie-health-chinese``），或者模型参数配置路径（这时需配置 --init_from_ckpt 参数一起使用，一般用于断点恢复训练场景。）
+- ``input_dir``表示训练数据所在目录，该目录下要有``.npy``和``.npz``两个文件，格式与```preprocess.py``预处理结果相同。
+- ``output_dir``表示预训练模型参数和训练日志的保存目录。
+- ``batch_size``表示每次迭代每张卡上的样本数量。当batch_size=4时，运行时单卡约需要12G显存。如果实际GPU显存小于12G或大大多于12G，可适当调小/调大此配置。
+- ``learning_rate`` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- ``max_seq_length`` 表示最大句子长度，超过该长度将被截断。
+- ``weight_decay`` 表示每次迭代中参数缩小的比例，该值乘以学习率为真正缩小的比例。
+- ``adam_epsilon`` 表示adam优化器中的epsilon值。
+- ``warmup_steps`` 表示学习率逐渐升高到基础学习率（即上面配置的learning_rate）所需要的迭代数，最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。
+- ``num_epochs`` 表示训练轮数。
+- ``logging_steps`` 表示日志打印间隔。
+- ``save_steps`` 表示模型保存间隔。
+- ``max_steps`` 如果配置且大于0，表示预训练最多执行的迭代数量；如果不配置或配置小于0，则根据输入数据量、``batch_size``和``num_epochs``来确定预训练迭代数量。
+- ``device`` 表示使用的设备类型。默认为GPU，可以配置为CPU、GPU、XPU。若希望使用GPU训练，将其设置为GPU，同时环境变量CUDA_VISIBLE_DEVICES配置要使用的GPU id。
+- ``use_amp`` 表示是否开启混合精度(float16)进行训练，默认不开启。如果在命令中加上了--use_amp，则会开启。
+- ``init_from_ckpt`` 表示是否从某个checkpoint继续训练（断点恢复训练），默认不开启。如果在命令中加上了--init_from_ckpt，且 --model_name_or_path 配置的是路径，则会开启从某个checkpoint继续训练。
+
+#### Trainer 训练版本
+本样例同时提供了Trainer版本的预训练流程，预训练重启、可视化等流程较为完备。需要从源码安装paddlenlp使用。
+
+```
+unset CUDA_VISIBLE_DEVICES
+task_name="eheath-pretraining"
+
+python -u -m paddle.distributed.launch \
+    --gpus 0,1,2,3,4,5,6,7  \
+    run_pretrain_trainer.py \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_length 512 \
+    --gradient_accumulation_steps 1\
+    --per_device_train_batch_size 8 \
+    --learning_rate 0.001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20 \
+    --dataloader_num_workers 2 \
+    --device "gpu"\
+    --fp16  \
+    --fp16_opt_level "O1"  \
+    --do_train \
+    --disable_tqdm True\
+    --save_total_limit 10
+```
+大部分参数含义如上文所述，这里简要介绍一些新参数:
+
+- ``per_device_train_batch_size`` 同上文batch_size。训练时，每次迭代每张卡上的样本数目。
+- ``warmup_ratio`` 与warmup_steps类似，warmup步数占总步数的比例。
+- ``fp16`` 与`use_amp`相同，表示使用混合精度
+- ``fp16_opt_level`` 混合精度的策略。注：O2训练eHealth存在部分问题，暂时请勿使用。
+- ``save_total_limit`` 保存的ckpt数量的最大限制
+
+## 微调
+
+模型预训练结束后，可以对判别器进行微调以完成下游医疗任务。不同任务的模型加载方式如下：
+
+```
+from paddlenlp.transformers import *
+
+tokenizer = AutoTokenizer.from_pretrained('ernie-health-chinese')
+
+# 分类任务
+model = AutoModelForSequenceClassification.from_pretrained('ernie-health-chinese')
+# 序列标注任务
+model = AutoModelForTokenClassification.from_pretrained('ernie-health-chinese')
+# 阅读理解任务
+model = AutoModelForQuestionAnswering.from_pretrained('ernie-health-chinese')
+```
+
+本项目提供了在 CBLUE 数据集上的微调脚本，包括分类、实体识别和关系抽取三类任务，详细信息可参考 ``cblue``[目录](./cblue)。
+
+## 部署
+
+我们为ERNIE-Health微调后的模型提供了Python端部署方案，请根据实际情况进行实现。
+
+详细部署流程请参考：[基于ONNXRuntime推理部署指南](./cblue/deploy/predictor/)
+
+
+## Reference
+
+Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244)
diff --git a/model_zoo/ernie-health/cblue/README.md b/model_zoo/ernie-health/cblue/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ce37227d052ce5ddc86bcb89a54178bbeb4b1b65
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/README.md
@@ -0,0 +1,130 @@
+# 使用医疗领域预训练模型Fine-tune完成中文医疗语言理解任务
+
+本示例展示了中文医疗预训练模型 ERNIE-Health（[Building Chinese Biomedical Language Models via Multi-Level Text Discrimination](https://arxiv.org/abs/2110.07244)）如何 Fine-tune 完成中文医疗语言理解任务。
+
+## 数据集介绍
+
+本项目使用了中文医学语言理解测评（[Chinese Biomedical Language Understanding Evaluation，CBLUE](https://github.com/CBLUEbenchmark/CBLUE)）1.0 版本数据集，这是国内首个面向中文医疗文本处理的多任务榜单，涵盖了医学文本信息抽取（实体识别、关系抽取）、医学术语归一化、医学文本分类、医学句子关系判定和医学问答共5大类任务8个子任务。其数据来源分布广泛，包括医学教材、电子病历、临床试验公示以及互联网用户真实查询等。该榜单一经推出便受到了学界和业界的广泛关注，已逐渐发展成为检验AI系统中文医疗信息处理能力的“金标准”。
+
+* CMeEE：中文医学命名实体识别
+* CMeIE：中文医学文本实体关系抽取
+* CHIP-CDN：临床术语标准化任务
+* CHIP-CTC：临床试验筛选标准短文本分类
+* CHIP-STS：平安医疗科技疾病问答迁移学习
+* KUAKE-QIC：医疗搜索检索词意图分类
+* KUAKE-QTR：医疗搜索查询词-页面标题相关性
+* KUAKE-QQR：医疗搜索查询词-查询词相关性
+
+更多信息可参考CBLUE的[github](https://github.com/CBLUEbenchmark/CBLUE/blob/main/README_ZH.md)。其中对于临床术语标准化任务（CHIP-CDN），我们按照 ERNIE-Health 中的方法通过检索将原多分类任务转换为了二分类任务，即给定一诊断原词和一诊断标准词，要求判定后者是否是前者对应的诊断标准词。本项目提供了检索处理后的 CHIP-CDN 数据集（简写`CHIP-CDN-2C`），且构建了基于该数据集的example代码。
+
+## 模型介绍
+
+ERNIE-Health 模型的整体结构与 ELECTRA 相似，包括生成器和判别器两部分。 而 Fine-tune 过程只用到了判别器模块，由 12 层 Transformer 网络组成。
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```text
+cblue/
+├── train_classification.py   # 文本分类任务训练评估脚本
+├── train_ner.py              # 实体识别任务训练评估脚本
+├── train_spo.py              # 关系抽取任务训练评估脚本
+├── export_model.py           # 动态图导出静态图参数脚本
+└── README.md
+```
+
+### 依赖安装
+
+```shell
+pip install xlrd==1.2.0
+```
+
+### 模型训练
+
+我们按照任务类别划分，同时提供了8个任务的样例代码。可以运行下边的命令，在训练集上进行训练，并在**验证集**上进行验证。
+
+**训练参数设置（Training setup）及结果**
+
+| Task      | epochs | batch_size | learning_rate | max_seq_length |  metric  | results | results (fp16) |
+| --------- | :----: | :--------: | :-----------: | :------------: | :------: | :-----: | :------------: |
+| CHIP-STS  |    4   |     16     |      3e-5     |       96       | Macro-F1 | 0.88749 |    0.88555     |
+| CHIP-CTC  |    4   |     32     |      6e-5     |      160       | Macro-F1 | 0.84136 |    0.83514     |
+| CHIP-CDN  |   16   |    256     |      3e-5     |       32       |    F1    | 0.76979 |    0.76489     |
+| KUAKE-QQR |    2   |     32     |      6e-5     |       64       | Accuracy | 0.83865 |    0.84053     |
+| KUAKE-QTR |    4   |     32     |      6e-5     |       64       | Accuracy | 0.69722 |    0.69722     |
+| KUAKE-QIC |    4   |     32     |      6e-5     |      128       | Accuracy | 0.81483 |    0.82046     |
+| CMeEE     |    2   |     32     |      6e-5     |      128       | Micro-F1 | 0.66120 |    0.66026     |
+| CMeIE     |  100   |     12     |      6e-5     |      300       | Micro-F1 | 0.61385 |    0.60076     |
+
+可支持配置的参数：
+
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ELECTRA模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为6e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.01。
+* `epochs`: 训练轮次，默认为3。
+* `max_steps`: 最大训练步数。若训练`epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+* `valid_steps`: evaluate的间隔steps数，默认100。
+* `save_steps`: 保存checkpoints的间隔steps数，默认100。
+* `logging_steps`: 日志打印的间隔steps数，默认10。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.1。
+* `init_from_ckpt`：可选，模型参数路径，恢复模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu、gpu或npu。如使用gpu训练则参数gpus指定GPU卡号。
+* `use_amp`: 是否使用混合精度训练，默认为False。
+
+
+#### 医疗文本分类任务
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+$ python -m paddle.distributed.launch --gpus '0,1,2,3' train_classification.py --dataset CHIP-CDN-2C --batch_size 256 --max_seq_length 32 --learning_rate 3e-5 --epochs 16
+```
+
+其他可支持配置的参数：
+
+* `dataset`：可选，CHIP-CDN-2C CHIP-CTC CHIP-STS KUAKE-QIC KUAKE-QTR KUAKE-QQR，默认为KUAKE-QIC数据集。
+
+#### 医疗命名实体识别任务（CMeEE）
+
+```shell
+$ export CUDA_VISIBLE_DEVICES=0
+$ python train_ner.py --batch_size 32 --max_seq_length 128 --learning_rate 6e-5 --epochs 12
+```
+
+#### 医疗关系抽取任务（CMeIE）
+
+```shell
+$ export CUDA_VISIBLE_DEVICES=0
+$ python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 100
+```
+
+### 静态图模型导出
+
+使用动态图训练结束之后，还可以将动态图参数导出成静态图参数，用于部署推理等，具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。
+
+运行方式：
+1. 分类任务静态图模型导出
+```shell
+python export_model.py --train_dataset CHIP-CDN-2C --params_path=./checkpoint/model_900/ --output_path=./export
+```
+
+2. SPO任务静态图模型导出
+```shell
+python export_model.py --train_dataset CMeIE --params_path=./checkpoint/model_900/ --output_path=./export
+```
+
+3. NER任务静态图模型导出
+```shell
+python export_model.py --train_dataset CMeEE --params_path=./checkpoint/model_1500/ --output_path=./export
+```
+
+**NOTICE**: train_dataset分类任务选择填上训练数据集名称，params_path选择最好参数的模型的路径。
+
+[1] CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [pdf](https://arxiv.org/abs/2106.08087) [git](https://github.com/CBLUEbenchmark/CBLUE) [web](https://tianchi.aliyun.com/specials/promotion/2021chinesemedicalnlpleaderboardchallenge)
+
+[2] Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021). [pdf](https://arxiv.org/abs/2110.07244)
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/README.md b/model_zoo/ernie-health/cblue/deploy/predictor/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..4741a43385ffec8c3869c10453e6e7abe018ad28
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/README.md
@@ -0,0 +1,191 @@
+# 基于ONNXRuntime推理部署指南
+
+本示例以[CBLUE数据集微调](../../README.md)得到的ERNIE-Health模型为例，分别提供了文本分类任务、实体识别任务和关系抽取任务的部署代码，自定义数据集可参考实现。
+在推理部署前需将微调后的动态图模型转换导出为静态图，详细步骤见[静态图模型导出](../../README.md)。
+
+**目录**
+   * [环境安装](#环境安装)
+   * [GPU部署推理样例](#gpu部署推理样例)
+   * [CPU部署推理样例](#cpu部署推理样例)
+   * [性能与精度测试](#性能与精度测试)
+       * [GPU精度与性能](#gpu精度与性能)
+       * [CPU精度与性能](#cpu精度与性能)
+
+## 环境安装
+
+ONNX模型转换和推理部署依赖于Paddle2ONNX和ONNXRuntime。其中Paddle2ONNX支持将Paddle静态图模型转化为ONNX模型格式，算子目前稳定支持导出ONNX Opset 7~15，更多细节可参考：[Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX)。
+
+#### GPU端
+请先确保机器已正确安装NVIDIA相关驱动和基础软件，确保CUDA >= 11.2，CuDNN >= 8.2，并使用以下命令安装所需依赖:
+```
+python -m pip install -r requirements_gpu.tx
+```
+\* 如需使用半精度（FP16）部署，请确保GPU设备的CUDA计算能力 (CUDA Compute Capability) 大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。 更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)
+#### CPU端
+请使用如下命令安装所需依赖:
+```
+python -m pip install -r requirements_cpu.txt
+```
+## GPU部署推理样例
+
+请使用如下命令进行GPU上的部署，可用`use_fp16`开启**半精度部署推理加速**，可用`device_id`**指定GPU卡号**。
+
+- 文本分类任务
+
+```
+python infer_classification.py --device gpu --device_id 0 --dataset KUAKE-QIC --model_path_prefix ../../export/inference
+```
+
+- 实体识别任务
+
+```
+python infer_ner.py --device gpu --device_id 0 --dataset CMeEE --model_path_prefix ../../export/inference
+```
+
+- 关系抽取任务
+
+```
+python infer_spo.py --device gpu --device_id 0 --dataset CMeIE --model_path_prefix ../../export/inference
+```
+
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型；默认为"ernie-health-chinese"。
+* `dataset`：CBLUE中的训练数据集。
+   * `文本分类任务`：包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C；默认为KUAKE-QIC。
+   * `实体抽取任务`：默认为CMeEE。
+   * `关系抽取任务`：默认为CMeIE。
+* `max_seq_length`：模型使用的最大序列长度，最大不能超过512；`关系抽取任务`默认为300，其余默认为128。
+* `use_fp16`：选择是否开启FP16进行加速，仅在`devive=gpu`时生效；默认关闭。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
+* `device`: 选用什么设备进行训练，可选cpu、gpu；默认为gpu。
+* `device_id`: 选择GPU卡号；默认为0。
+* `data_file`：本地待预测数据文件；默认为None。
+
+#### 本地数据集加载
+如需使用本地数据集，请指定本地待预测数据文件 `data_file`，每行一条样例，单文本输入每句一行，双文本输入以`\t`分隔符隔开。例如
+
+**ctc-data.txt**
+```
+在过去的6个月曾服用偏头痛预防性药物或长期服用镇痛药物者，以及有酒精依赖或药物滥用习惯者；
+患有严重的冠心病、脑卒中，以及传染性疾病、精神疾病者；
+活动性乙肝（包括大三阳或小三阳）或血清学指标（HBsAg或/和HBeAg或/和HBcAb）阳性者，丙肝、肺结核、巨细胞病毒、严重真菌感染或HIV感染；
+...
+```
+
+**sts-data.txt**
+```
+糖尿病能吃减肥药吗？能治愈吗？\t糖尿病为什么不能吃减肥药？
+为什么慢性乙肝会急性发作\t引起隐匿性慢性乙肝的原因是什么
+标准血压是多少高血压指？低血压又指？\t半月前检查血压100／130，正常吗？
+...
+```
+
+## CPU部署推理样例
+
+请使用如下命令进行CPU上的部署，可用`num_threads`**调整预测线程数量**。
+
+- 文本分类任务
+
+```
+python infer_classification.py --device cpu --dataset KUAKE-QIC --model_path_prefix ../../export/inference
+```
+
+- 实体识别任务
+
+```
+python infer_ner.py --device cpu --dataset CMeEE --model_path_prefix ../../export/inference
+```
+
+- 关系抽取任务
+
+```
+python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../export/inference
+```
+
+可支持配置的参数：
+
+* `model_path_prefix`：必须，待推理模型路径前缀。
+* `model_name_or_path`：选择预训练模型；默认为"ernie-health-chinese"。
+* `dataset`：CBLUE中的训练数据集。
+   * `文本分类任务`：包括KUAKE-QIC, KUAKE-QQR, KUAKE-QTR, CHIP-CTC, CHIP-STS, CHIP-CDN-2C；默认为KUAKE-QIC。
+   * `实体抽取任务`：默认为CMeEE。
+   * `关系抽取任务`：默认为CMeIE。
+* `max_seq_length`：模型使用的最大序列长度，最大不能超过512；`关系抽取任务`默认为300，其余默认为128。
+* `batch_size`：批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为200。
+* `device`: 选用什么设备进行训练，可选cpu、gpu；默认为gpu。
+* `num_threads`：cpu线程数，在`device=gpu`时影响较小；默认为cpu的物理核心数量。
+* `data_file`：本地待预测数据文件，格式见[GPU部署推理样例](#本地数据集加载)中的介绍；默认为None。
+
+## 性能与精度测试
+
+本节提供了在CBLUE数据集上预测的性能和精度数据，以供参考。
+
+测试配置如下：
+
+1. 数据集
+
+    使用CBLUE数据集中的开发集用于ERNIE-Health微调模型部署推理的性能与精度测试，包括
+
+  - 医疗搜索检索词意图分类（KUAKE-QIC）任务
+  - 医疗搜索查询词-页面标题相关性（KUAKE-QTR）任务
+  - 医疗搜索查询词-查询词相关性（KUAKE-QQR）任务
+  - 临床试验筛选标准短文本分类(CHIP-CTC)任务
+  - 平安医疗科技疾病问答迁移学习（CHIP-STS）任务
+  - 临床术语标准化匹配（CHIP-CDN-2C）任务
+  - 中文医学命名实体识别（CMeEE）任务
+  - 中文医学文本实体关系抽取（CMeIE）任务
+
+2. 物理机环境
+
+    系统: CentOS Linux release 7.7.1908 (Core)
+
+    GPU: Tesla V100-SXM2-32GB
+
+    CPU: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
+
+    CUDA: 11.2
+
+    cuDNN: 8.1.0
+
+    Driver Version: 460.27.04
+
+    内存: 630 GB
+
+3. PaddlePaddle 版本：2.3.0
+
+4. PaddleNLP 版本：2.3.4
+
+5. 性能数据指标：latency。latency 测试方法：固定 batch size 为 200（CHIP-CDN-2C 和 CMeIE 数据集为 20），部署运行时间 total_time，计算 latency = total_time / total_samples
+
+
+### GPU精度与性能
+
+| 数据集       | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP16 指标值 | FP32 latency(ms) | FP16 latency(ms) |
+| ----------  | ---------- | ---------- | ---------- | ---------- | ---------------- | ---------------- |
+| KUAKE-QIC   | 128        | Accuracy   | 0.8046     | 0.8046     | 1.92             | 0.46             |
+| KUAKE-QTR   | 64         | Accuracy   | 0.6886     | 0.6876 (-) | 0.92             | 0.23             |
+| KUAKE-QQR   | 64         | Accuracy   | 0.7755     | 0.7755     | 0.61             | 0.16             |
+| CHIP-CTC    | 160        | Macro F1   | 0.8445     | 0.8446 (+) | 2.34             | 0.60             |
+| CHIP-STS    | 96         | Macro F1   | 0.8892     | 0.8892     | 1.39             | 0.35             |
+| CHIP-CDN-2C | 256        | Macro F1   | 0.8921     | 0.8920 (-) | 1.58             | 0.48             |
+| CMeEE       | 128        | Micro F1   | 0.6469     | 0.6468 (-) | 1.90             | 0.48             |
+| CMeIE       | 300        | Micro F1   | 0.5903     | 0.5902 (-) | 50.32            | 41.50            |
+
+经过FP16转化加速比达到 1.2 ~ 4 倍左右，精度变化在 1e-4 ~ 1e-3 内。
+
+### CPU精度与性能
+
+测试环境及说明如上，测试 CPU 性能时，线程数设置为40。
+
+| 数据集      | 最大文本长度 | 精度评估指标 | FP32 指标值 | FP32 latency(ms) |
+| ----------  | ------------ | ------------ | ---------- | ---------------- |
+| KUAKE-QIC   | 128          | Accuracy     | 0.8046     | 37.72            |
+| KUAKE-QTR   | 64           | Accuracy     | 0.6886     | 18.40            |
+| KUAKE-QQR   | 64           | Accuracy     | 0.7755     | 10.34            |
+| CHIP-CTC    | 160          | Macro F1     | 0.8445     | 47.43            |
+| CHIP-STS    | 96           | Macro F1     | 0.8892     | 27.67            |
+| CHIP-CDN-2C | 256          | Micro F1     | 0.8921     | 26.86            |
+| CMeEE       | 128          | Micro F1     | 0.6469     | 37.59            |
+| CMeIE       | 300          | Micro F1     | 0.5902     | 213.04           |
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c4586fc9bc2946d892f731a537be286248c3b97
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_classification.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import psutil
+from predictor import CLSPredictor
+
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
+    )
+    parser.add_argument(
+        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
+    )
+    parser.add_argument("--dataset", default="KUAKE-QIC", type=str, help="Dataset for text classfication.")
+    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
+    parser.add_argument(
+        "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
+    )
+    parser.add_argument(
+        "--use_fp16",
+        action="store_true",
+        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
+    )
+    parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
+    parser.add_argument(
+        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu."
+    )
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+    args = parser.parse_args()
+    return args
+
+
+LABEL_LIST = {
+    "kuake-qic": ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"],
+    "kuake-qtr": ["完全不匹配", "很少匹配，有一些参考价值", "部分匹配", "完全匹配"],
+    "kuake-qqr": ["B为A的语义父集，B指代范围大于A； 或者A与B语义毫无关联。", "B为A的语义子集，B指代范围小于A。", "表示A与B等价，表述完全一致。"],
+    "chip-ctc": [
+        "成瘾行为",
+        "居住情况",
+        "年龄",
+        "酒精使用",
+        "过敏耐受",
+        "睡眠",
+        "献血",
+        "能力",
+        "依存性",
+        "知情同意",
+        "数据可及性",
+        "设备",
+        "诊断",
+        "饮食",
+        "残疾群体",
+        "疾病",
+        "教育情况",
+        "病例来源",
+        "参与其它试验",
+        "伦理审查",
+        "种族",
+        "锻炼",
+        "性别",
+        "健康群体",
+        "实验室检查",
+        "预期寿命",
+        "读写能力",
+        "含有多类别的语句",
+        "肿瘤进展",
+        "疾病分期",
+        "护理",
+        "口腔相关",
+        "器官组织状态",
+        "药物",
+        "怀孕相关",
+        "受体状态",
+        "研究者决定",
+        "风险评估",
+        "性取向",
+        "体征(医生检测）",
+        " 吸烟状况",
+        "特殊病人特征",
+        "症状(患者感受)",
+        "治疗或手术",
+    ],
+    "chip-sts": ["语义不同", "语义相同"],
+    "chip-cdn-2c": ["否", "是"],
+}
+
+TEXT = {
+    "kuake-qic": ["心肌缺血如何治疗与调养呢？", "什么叫痔核脱出？什么叫外痔？"],
+    "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些？"]],
+    "kuake-qqr": [["茴香是发物吗", "茴香怎么吃？"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]],
+    "chip-ctc": ["(1)前牙结构发育不良：釉质发育不全、氟斑牙、四环素牙等；", "怀疑或确有酒精或药物滥用史；"],
+    "chip-sts": [["糖尿病能吃减肥药吗？能治愈吗？", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]],
+    "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]],
+}
+
+METRIC = {
+    "kuake-qic": "acc",
+    "kuake-qtr": "acc",
+    "kuake-qqr": "acc",
+    "chip-ctc": "macro",
+    "chip-sts": "macro",
+    "chip-cdn-2c": "macro",
+}
+
+
+def main():
+    args = parse_args()
+
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    args.dataset = args.dataset.lower()
+    label_list = LABEL_LIST[args.dataset]
+    if args.data_file is not None:
+        with open(args.data_file, "r") as fp:
+            input_data = [x.strip().split("\t") for x in fp.readlines()]
+            input_data = [x[0] if len(x) == 1 else x for x in input_data]
+    else:
+        input_data = TEXT[args.dataset]
+
+    predictor = CLSPredictor(args, label_list)
+    predictor.predict(input_data)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..afc2d2ba99fc73678d2bab67eb27beff4811238d
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_ner.py
@@ -0,0 +1,116 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import psutil
+from predictor import NERPredictor
+
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
+    )
+    parser.add_argument(
+        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
+    )
+    parser.add_argument("--dataset", default="CMeEE", type=str, help="Dataset for named entity recognition.")
+    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
+    parser.add_argument(
+        "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization"
+    )
+    parser.add_argument(
+        "--use_fp16",
+        action="store_true",
+        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
+    )
+    parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for predicting.")
+    parser.add_argument(
+        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="Number of threads for cpu."
+    )
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+    args = parser.parse_args()
+    return args
+
+
+LABEL_LIST = {
+    "cmeee": [
+        [
+            "B-bod",
+            "I-bod",
+            "E-bod",
+            "S-bod",
+            "B-dis",
+            "I-dis",
+            "E-dis",
+            "S-dis",
+            "B-pro",
+            "I-pro",
+            "E-pro",
+            "S-pro",
+            "B-dru",
+            "I-dru",
+            "E-dru",
+            "S-dru",
+            "B-ite",
+            "I-ite",
+            "E-ite",
+            "S-ite",
+            "B-mic",
+            "I-mic",
+            "E-mic",
+            "S-mic",
+            "B-equ",
+            "I-equ",
+            "E-equ",
+            "S-equ",
+            "B-dep",
+            "I-dep",
+            "E-dep",
+            "S-dep",
+            "O",
+        ],
+        ["B-sym", "I-sym", "E-sym", "S-sym", "O"],
+    ]
+}
+
+TEXT = {"cmeee": ["研究证实，细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热，但以不规则发热为多，可能与患儿应用退热药物导致热型不规律有关。"]}
+
+
+def main():
+    args = parse_args()
+
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    dataset = args.dataset.lower()
+    label_list = LABEL_LIST[dataset]
+    if args.data_file is not None:
+        with open(args.data_file, "r") as fp:
+            input_data = [x.strip() for x in fp.readlines()]
+    else:
+        input_data = TEXT[dataset]
+
+    predictor = NERPredictor(args, label_list)
+    predictor.predict(input_data)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py b/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py
new file mode 100644
index 0000000000000000000000000000000000000000..972eade14d753f2343488c2b541d636cca0e8814
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/infer_spo.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import psutil
+from predictor import SPOPredictor
+
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used."
+    )
+    parser.add_argument(
+        "--model_name_or_path", default="ernie-health-chinese", type=str, help="The directory or name of model."
+    )
+    parser.add_argument("--dataset", default="CMeIE", type=str, help="Dataset for named entity recognition.")
+    parser.add_argument("--data_file", default=None, type=str, help="The data to predict with one sample per line.")
+    parser.add_argument(
+        "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization."
+    )
+    parser.add_argument(
+        "--use_fp16",
+        action="store_true",
+        help="Whether to use fp16 inference, only takes effect when deploying on gpu.",
+    )
+    parser.add_argument(
+        "--num_threads", default=psutil.cpu_count(logical=False), type=int, help="num_threads for cpu."
+    )
+    parser.add_argument("--batch_size", default=20, type=int, help="Batch size per GPU/CPU for predicting.")
+    parser.add_argument(
+        "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu."
+    )
+    parser.add_argument("--device_id", default=0, help="Select which gpu device to train model.")
+    args = parser.parse_args()
+    return args
+
+
+LABEL_LIST = {
+    "cmeie": [
+        "预防",
+        "阶段",
+        "就诊科室",
+        "辅助治疗",
+        "化疗",
+        "放射治疗",
+        "手术治疗",
+        "实验室检查",
+        "影像学检查",
+        "辅助检查",
+        "组织学检查",
+        "内窥镜检查",
+        "筛查",
+        "多发群体",
+        "发病率",
+        "发病年龄",
+        "多发地区",
+        "发病性别倾向",
+        "死亡率",
+        "多发季节",
+        "传播途径",
+        "并发症",
+        "病理分型",
+        "相关（导致）",
+        "鉴别诊断",
+        "相关（转化）",
+        "相关（症状）",
+        "临床表现",
+        "治疗后症状",
+        "侵及周围组织转移的症状",
+        "病因",
+        "高危因素",
+        "风险评估因素",
+        "病史",
+        "遗传因素",
+        "发病机制",
+        "病理生理",
+        "药物治疗",
+        "发病部位",
+        "转移部位",
+        "外侵部位",
+        "预后状况",
+        "预后生存率",
+        "同义词",
+    ]
+}
+
+TEXT = {"cmeie": ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"]}
+
+
+def main():
+    args = parse_args()
+
+    for arg_name, arg_value in vars(args).items():
+        logger.info("{:20}: {}".format(arg_name, arg_value))
+
+    dataset = args.dataset.lower()
+    label_list = LABEL_LIST[dataset]
+    if args.data_file is not None:
+        with open(args.data_file, "r") as fp:
+            input_data = [x.strip() for x in fp.readlines()]
+    else:
+        input_data = TEXT[dataset]
+
+    predictor = SPOPredictor(args, label_list)
+    predictor.predict(input_data)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py b/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e3137301cfda85e50b2150f47a6b3574b81b0f1
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/predictor.py
@@ -0,0 +1,361 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+
+import numpy as np
+import onnxruntime as ort
+import paddle2onnx
+import six
+
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    normalize_chars,
+    tokenize_special_chars,
+)
+from paddlenlp.utils.log import logger
+
+
+class InferBackend(object):
+    def __init__(self, model_path_prefix, device="cpu", device_id=0, use_fp16=False, num_threads=10):
+
+        if not isinstance(device, six.string_types):
+            logger.error(
+                ">>> [InferBackend] The type of device must be string, but the type you set is: ", type(device)
+            )
+            exit(0)
+        if device not in ["cpu", "gpu"]:
+            logger.error(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(device))
+            exit(0)
+
+        logger.info(">>> [InferBackend] Creating Engine ...")
+
+        onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+            model_file=model_path_prefix + ".pdmodel",
+            params_file=model_path_prefix + ".pdiparams",
+            opset_version=13,
+            enable_onnx_checker=True,
+        )
+        infer_model_dir = model_path_prefix.rsplit("/", 1)[0]
+        float_onnx_file = os.path.join(infer_model_dir, "model.onnx")
+        with open(float_onnx_file, "wb") as f:
+            f.write(onnx_model)
+
+        if device == "gpu":
+            logger.info(">>> [InferBackend] Use GPU to inference ...")
+            providers = ["CUDAExecutionProvider"]
+            if use_fp16:
+                logger.info(">>> [InferBackend] Use FP16 to inference ...")
+                import onnx
+                from onnxconverter_common import float16
+
+                fp16_model_file = os.path.join(infer_model_dir, "fp16_model.onnx")
+                onnx_model = onnx.load_model(float_onnx_file)
+                trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+                onnx.save_model(trans_model, fp16_model_file)
+                onnx_model = fp16_model_file
+        else:
+            logger.info(">>> [InferBackend] Use CPU to inference ...")
+            providers = ["CPUExecutionProvider"]
+            if use_fp16:
+                logger.warning(
+                    ">>> [InferBackend] Ignore use_fp16 as it only " + "takes effect when deploying on gpu..."
+                )
+
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = num_threads
+        self.predictor = ort.InferenceSession(
+            onnx_model, sess_options=sess_options, providers=providers, provider_options=[{"device_id": device_id}]
+        )
+
+        self.input_handles = [
+            self.predictor.get_inputs()[0].name,
+            self.predictor.get_inputs()[1].name,
+        ]
+
+        if device == "gpu":
+            try:
+                assert "CUDAExecutionProvider" in self.predictor.get_providers()
+            except AssertionError:
+                raise AssertionError(
+                    """The environment for GPU inference is not set properly. \nA possible cause is that you had installed both onnxruntime and onnxruntime-gpu. \nPlease run the following commands to reinstall: \n1) pip uninstall -y onnxruntime onnxruntime-gpu  \n2) pip install onnxruntime-gpu"""
+                )
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def infer(self, input_dict: dict):
+        input_dict = {k: v for k, v in input_dict.items() if k in self.input_handles}
+        result = self.predictor.run(None, input_dict)
+        return result
+
+
+class EHealthPredictor(object):
+    def __init__(self, args, label_list):
+        self.label_list = label_list
+        self._tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True)
+        self._max_seq_length = args.max_seq_length
+        self._batch_size = args.batch_size
+        self.inference_backend = InferBackend(
+            args.model_path_prefix, args.device, args.device_id, args.use_fp16, args.num_threads
+        )
+
+    def predict(self, input_data: list):
+        encoded_inputs = self.preprocess(input_data)
+        infer_result = self.infer_batch(encoded_inputs)
+        result = self.postprocess(infer_result)
+        self.printer(result, input_data)
+        return result
+
+    def _infer(self, input_dict):
+        infer_data = self.inference_backend.infer(input_dict)
+        return infer_data
+
+    def infer_batch(self, encoded_inputs):
+        num_sample = len(encoded_inputs["input_ids"])
+        infer_data = None
+        num_infer_data = None
+        for idx in range(0, num_sample, self._batch_size):
+            l, r = idx, idx + self._batch_size
+            keys = encoded_inputs.keys()
+            input_dict = {k: encoded_inputs[k][l:r] for k in keys}
+            results = self._infer(input_dict)
+            if infer_data is None:
+                infer_data = [[x] for x in results]
+                num_infer_data = len(results)
+            else:
+                for i in range(num_infer_data):
+                    infer_data[i].append(results[i])
+        for i in range(num_infer_data):
+            infer_data[i] = np.concatenate(infer_data[i], axis=0)
+        return infer_data
+
+    def performance(self, encoded_inputs):
+        nums = len(encoded_inputs["input_ids"])
+        start_time = time.time()
+        infer_result = self.infer_batch(preprocess_result)  # noqa
+        total_time = time.time() - start_time
+        logger.info("sample nums: %d, time: %.2f, latency: %.2f ms" % (nums, total_time, 1000 * total_time / nums))
+
+    def get_text_and_label(self, dataset):
+        raise NotImplementedError
+
+    def preprocess(self, input_data: list):
+        raise NotImplementedError
+
+    def postprocess(self, infer_data):
+        raise NotImplementedError
+
+    def printer(self, result, input_data):
+        raise NotImplementedError
+
+
+class CLSPredictor(EHealthPredictor):
+    def preprocess(self, input_data: list):
+        norm_text = lambda x: tokenize_special_chars(normalize_chars(x))
+        # To deal with a pair of input text.
+        if isinstance(input_data[0], list):
+            text = [norm_text(sample[0]) for sample in input_data]
+            text_pair = [norm_text(sample[1]) for sample in input_data]
+        else:
+            text = [norm_text(x) for x in input_data]
+            text_pair = None
+
+        data = self._tokenizer(
+            text=text, text_pair=text_pair, max_length=self._max_seq_length, padding=True, truncation=True
+        )
+
+        encoded_inputs = {
+            "input_ids": np.array(data["input_ids"], dtype="int64"),
+            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return encoded_inputs
+
+    def postprocess(self, infer_data):
+        infer_data = infer_data[0]
+        max_value = np.max(infer_data, axis=1, keepdims=True)
+        exp_data = np.exp(infer_data - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        label = probs.argmax(axis=-1)
+        confidence = probs.max(axis=-1)
+        return {"label": label, "confidence": confidence}
+
+    def printer(self, result, input_data):
+        label, confidence = result["label"], result["confidence"]
+        for i in range(len(label)):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("labels: {}, confidence: {}".format(self.label_list[label[i]], confidence[i]))
+            logger.info("-----------------------------")
+
+
+class NERPredictor(EHealthPredictor):
+    """The predictor for CMeEE dataset."""
+
+    en_to_cn = {
+        "bod": "身体",
+        "mic": "微生物类",
+        "dis": "疾病",
+        "sym": "临床表现",
+        "pro": "医疗程序",
+        "equ": "医疗设备",
+        "dru": "药物",
+        "dep": "科室",
+        "ite": "医学检验项目",
+    }
+
+    def _extract_chunk(self, tokens):
+        chunks = set()
+        start_idx, cur_idx = 0, 0
+        while cur_idx < len(tokens):
+            if tokens[cur_idx][0] == "B":
+                start_idx = cur_idx
+                cur_idx += 1
+                while cur_idx < len(tokens) and tokens[cur_idx][0] == "I":
+                    if tokens[cur_idx][2:] == tokens[start_idx][2:]:
+                        cur_idx += 1
+                    else:
+                        break
+                if cur_idx < len(tokens) and tokens[cur_idx][0] == "E":
+                    if tokens[cur_idx][2:] == tokens[start_idx][2:]:
+                        chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx))
+                        cur_idx += 1
+            elif tokens[cur_idx][0] == "S":
+                chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx))
+                cur_idx += 1
+            else:
+                cur_idx += 1
+        return list(chunks)
+
+    def preprocess(self, infer_data):
+        infer_data = [[x.lower() for x in text] for text in infer_data]
+        data = self._tokenizer(
+            infer_data, max_length=self._max_seq_length, padding=True, is_split_into_words=True, truncation=True
+        )
+
+        encoded_inputs = {
+            "input_ids": np.array(data["input_ids"], dtype="int64"),
+            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
+        }
+        return encoded_inputs
+
+    def postprocess(self, infer_data):
+        tokens_oth = np.argmax(infer_data[0], axis=-1)
+        tokens_sym = np.argmax(infer_data[1], axis=-1)
+        entity = []
+        for oth_ids, sym_ids in zip(tokens_oth, tokens_sym):
+            token_oth = [self.label_list[0][x] for x in oth_ids]
+            token_sym = [self.label_list[1][x] for x in sym_ids]
+            chunks = self._extract_chunk(token_oth) + self._extract_chunk(token_sym)
+            sub_entity = []
+            for etype, sid, eid in chunks:
+                sub_entity.append({"type": self.en_to_cn[etype], "start_id": sid, "end_id": eid})
+            entity.append(sub_entity)
+        return {"entity": entity}
+
+    def printer(self, result, input_data):
+        result = result["entity"]
+        for i, preds in enumerate(result):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("detected entities:")
+            for item in preds:
+                logger.info(
+                    "* entity: {}, type: {}, position: ({}, {})".format(
+                        input_data[i][item["start_id"] : item["end_id"]],
+                        item["type"],
+                        item["start_id"],
+                        item["end_id"],
+                    )
+                )
+            logger.info("-----------------------------")
+
+
+class SPOPredictor(EHealthPredictor):
+    """The predictor for the CMeIE dataset."""
+
+    def predict(self, input_data: list):
+        encoded_inputs = self.preprocess(input_data)
+        lengths = encoded_inputs["attention_mask"].sum(axis=-1)
+        infer_result = self.infer_batch(encoded_inputs)
+        result = self.postprocess(infer_result, lengths)
+        self.printer(result, input_data)
+        return result
+
+    def preprocess(self, infer_data):
+        infer_data = [[x.lower() for x in text] for text in infer_data]
+        data = self._tokenizer(
+            infer_data,
+            max_length=self._max_seq_length,
+            padding=True,
+            is_split_into_words=True,
+            truncation=True,
+            return_attention_mask=True,
+        )
+        encoded_inputs = {
+            "input_ids": np.array(data["input_ids"], dtype="int64"),
+            "token_type_ids": np.array(data["token_type_ids"], dtype="int64"),
+            "attention_mask": np.array(data["attention_mask"], dtype="float32"),
+        }
+        return encoded_inputs
+
+    def postprocess(self, infer_data, lengths):
+        ent_logits = np.array(infer_data[0])
+        spo_logits = np.array(infer_data[1])
+        ent_pred_list = []
+        ent_idxs_list = []
+        for idx, ent_pred in enumerate(ent_logits):
+            seq_len = lengths[idx] - 2
+            start = np.where(ent_pred[:, 0] > 0.5)[0]
+            end = np.where(ent_pred[:, 1] > 0.5)[0]
+            ent_pred = []
+            ent_idxs = {}
+            for x in start:
+                y = end[end >= x]
+                if (x == 0) or (x > seq_len):
+                    continue
+                if len(y) > 0:
+                    y = y[0]
+                    if y > seq_len:
+                        continue
+                    ent_idxs[x] = (x - 1, y - 1)
+                    ent_pred.append((x - 1, y - 1))
+            ent_pred_list.append(ent_pred)
+            ent_idxs_list.append(ent_idxs)
+
+        spo_preds = spo_logits > 0
+        spo_pred_list = [[] for _ in range(len(spo_preds))]
+        idxs, preds, subs, objs = np.nonzero(spo_preds)
+        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
+            obj = ent_idxs_list[idx].get(o_id, None)
+            if obj is None:
+                continue
+            sub = ent_idxs_list[idx].get(s_id, None)
+            if sub is None:
+                continue
+            spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj)))
+
+        return {"entity": ent_pred_list, "spo": spo_pred_list}
+
+    def printer(self, result, input_data):
+        ent_pred_list, spo_pred_list = result["entity"], result["spo"]
+        for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)):
+            logger.info("input data: {}".format(input_data[i]))
+            logger.info("detected entities and relations:")
+            for sid, eid in ent:
+                logger.info("* entity: {}, position: ({}, {})".format(input_data[i][sid : eid + 1], sid, eid))
+            for s, p, o in rel:
+                logger.info(
+                    "+ spo: ({}, {}, {})".format(
+                        input_data[i][s[0] : s[1] + 1], self.label_list[p], input_data[i][o[0] : o[1] + 1]
+                    )
+                )
+            logger.info("-----------------------------")
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..645682ec79c6c8694ee9ea288af3dc3c416a4dfb
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_cpu.txt
@@ -0,0 +1,2 @@
+onnxruntime==1.10.0
+psutil
diff --git a/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2ca8b172eb7993140d6f5e2c3692a200195dd1ee
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/predictor/requirements_gpu.txt
@@ -0,0 +1,4 @@
+onnxruntime-gpu==1.11.1
+onnx==1.12.0
+onnxconverter-common==1.9.0
+psutil
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..50166fe400b50fc6f3ef6e21b889ed704273ae12
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/README.md
@@ -0,0 +1,60 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本
+```shell
+pip install paddlenlp >= 2.3.6
+```
+## Server服务启动
+### 分类任务启动
+#### 启动 分类 Server 服务
+```bash
+paddlenlp server server_classification:app --host 0.0.0.0 --port 8189
+```
+
+#### 分类任务发送服务
+```bash
+python client_classification.py --dataset  chip-cdn-2c
+```
+
+### NER 任务启动
+#### 启动 NER Server 服务
+```bash
+paddlenlp server server_ner:app --host 0.0.0.0 --port 8189
+```
+
+#### NER Client发送服务
+```bash
+python client_ner.py
+```
+
+### SPO 任务启动
+#### 启动 SPO Server 服务
+```bash
+paddlenlp server server_spo:app --host 0.0.0.0 --port 8189
+```
+
+#### SPO Client 发送服务
+```bash
+python client_spo.py
+```
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+            'text_pair': text_pairs if len(text_pairs) > 0 else None
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size
+        }
+    }
+```
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..1993acb4b0f0456a5af6a11572baca1521dc9372
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_classification.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--dataset", required=True, type=str, help="The dataset name for the simple seving")
+parser.add_argument(
+    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+
+url = "http://0.0.0.0:8189/models/cblue_cls"
+headers = {"Content-Type": "application/json"}
+
+TEXT = {
+    "kuake-qic": ["心肌缺血如何治疗与调养呢？", "什么叫痔核脱出？什么叫外痔？"],
+    "kuake-qtr": [["儿童远视眼怎么恢复视力", "远视眼该如何保养才能恢复一些视力"], ["抗生素的药有哪些", "抗生素类的药物都有哪些？"]],
+    "kuake-qqr": [["茴香是发物吗", "茴香怎么吃？"], ["气的胃疼是怎么回事", "气到胃痛是什么原因"]],
+    "chip-ctc": ["(1)前牙结构发育不良：釉质发育不全、氟斑牙、四环素牙等；", "怀疑或确有酒精或药物滥用史；"],
+    "chip-sts": [["糖尿病能吃减肥药吗？能治愈吗？", "糖尿病为什么不能吃减肥药"], ["H型高血压的定义", "WHO对高血压的最新分类定义标准数值"]],
+    "chip-cdn-2c": [["1型糖尿病性植物神经病变", " 1型糖尿病肾病IV期"], ["髂腰肌囊性占位", "髂肌囊肿"]],
+}
+
+if __name__ == "__main__":
+    args.dataset = args.dataset.lower()
+    input_data = TEXT[args.dataset]
+    texts = []
+    text_pairs = []
+    for data in input_data:
+        if len(data) == 2:
+            text_pairs.append(data[1])
+        texts.append(data[0])
+    data = {
+        "data": {"text": texts, "text_pair": text_pairs if len(text_pairs) > 0 else None},
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3c64479ec20c2925bf64232d262ca879feb9298
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_ner.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+
+url = "http://0.0.0.0:8189/models/cblue_ner"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["研究证实，细胞减少与肺内病变程度及肺内炎性病变吸收程度密切相关。", "可为不规则发热、稽留热或弛张热，但以不规则发热为多，可能与患儿应用退热药物导致热型不规律有关。"]
+    texts = [[x.lower() for x in text] for text in texts]
+    data = {
+        "data": {
+            "text": texts,
+        },
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size, "is_split_into_words": True},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py
new file mode 100644
index 0000000000000000000000000000000000000000..38c34459d054ab57fac31ab0ef5b70c03a45ce07
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/client_spo.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+
+url = "http://0.0.0.0:8189/models/cblue_spo"
+headers = {"Content-Type": "application/json"}
+
+if __name__ == "__main__":
+    texts = ["骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。", "稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。"]
+    texts = [[x.lower() for x in text] for text in texts]
+    data = {
+        "data": {
+            "text": texts,
+        },
+        "parameters": {
+            "max_seq_len": args.max_seq_len,
+            "batch_size": args.batch_size,
+            "return_attention_mask": True,
+            "is_split_into_words": True,
+        },
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c024501dd15c512b637930d924978e251451771
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_classification.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import CustomModelHandler, MultiClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/cblue_cls",
+    model_path="../../../export",
+    tokenizer_name="ernie-health-chinese",
+    model_handler=CustomModelHandler,
+    post_handler=MultiClassificationPostHandler,
+)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b20efd1df819e2d62567f327ef80a91ff3d79cc
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_ner.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import BasePostHandler, TokenClsModelHandler
+
+en_to_cn = {
+    "bod": "身体",
+    "mic": "微生物类",
+    "dis": "疾病",
+    "sym": "临床表现",
+    "pro": "医疗程序",
+    "equ": "医疗设备",
+    "dru": "药物",
+    "dep": "科室",
+    "ite": "医学检验项目",
+}
+
+label_list = [
+    [
+        "B-bod",
+        "I-bod",
+        "E-bod",
+        "S-bod",
+        "B-dis",
+        "I-dis",
+        "E-dis",
+        "S-dis",
+        "B-pro",
+        "I-pro",
+        "E-pro",
+        "S-pro",
+        "B-dru",
+        "I-dru",
+        "E-dru",
+        "S-dru",
+        "B-ite",
+        "I-ite",
+        "E-ite",
+        "S-ite",
+        "B-mic",
+        "I-mic",
+        "E-mic",
+        "S-mic",
+        "B-equ",
+        "I-equ",
+        "E-equ",
+        "S-equ",
+        "B-dep",
+        "I-dep",
+        "E-dep",
+        "S-dep",
+        "O",
+    ],
+    ["B-sym", "I-sym", "E-sym", "S-sym", "O"],
+]
+
+
+def _extract_chunk(tokens):
+    chunks = set()
+    start_idx, cur_idx = 0, 0
+    while cur_idx < len(tokens):
+        if tokens[cur_idx][0] == "B":
+            start_idx = cur_idx
+            cur_idx += 1
+            while cur_idx < len(tokens) and tokens[cur_idx][0] == "I":
+                if tokens[cur_idx][2:] == tokens[start_idx][2:]:
+                    cur_idx += 1
+                else:
+                    break
+            if cur_idx < len(tokens) and tokens[cur_idx][0] == "E":
+                if tokens[cur_idx][2:] == tokens[start_idx][2:]:
+                    chunks.add((tokens[cur_idx][2:], start_idx - 1, cur_idx))
+                    cur_idx += 1
+        elif tokens[cur_idx][0] == "S":
+            chunks.add((tokens[cur_idx][2:], cur_idx - 1, cur_idx))
+            cur_idx += 1
+        else:
+            cur_idx += 1
+    return list(chunks)
+
+
+class NERPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        if "logits" not in data or "logits_1" not in data:
+            raise ValueError(
+                "The output of model handler do not include the 'logits', "
+                " please check the model handler output. The model handler output:\n{}".format(data)
+            )
+        tokens_oth = np.array(data["logits"])
+        tokens_sym = np.array(data["logits_1"])
+        tokens_oth = np.argmax(tokens_oth, axis=-1)
+        tokens_sym = np.argmax(tokens_sym, axis=-1)
+        entity = []
+        for oth_ids, sym_ids in zip(tokens_oth, tokens_sym):
+            token_oth = [label_list[0][x] for x in oth_ids]
+            token_sym = [label_list[1][x] for x in sym_ids]
+            chunks = _extract_chunk(token_oth) + _extract_chunk(token_sym)
+            sub_entity = []
+            for etype, sid, eid in chunks:
+                sub_entity.append({"type": en_to_cn[etype], "start_id": sid, "end_id": eid})
+            entity.append(sub_entity)
+        return {"entity": entity}
+
+
+app = SimpleServer()
+app.register(
+    "models/cblue_ner",
+    model_path="../../../export_ner",
+    tokenizer_name="ernie-health-chinese",
+    model_handler=TokenClsModelHandler,
+    post_handler=NERPostHandler,
+)
diff --git a/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a64cdbe66aa897e2cdcdce977227ebd6440e95e
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/deploy/serving/simple_serving/server_spo.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import BasePostHandler, TokenClsModelHandler
+
+label_list = [
+    "预防",
+    "阶段",
+    "就诊科室",
+    "辅助治疗",
+    "化疗",
+    "放射治疗",
+    "手术治疗",
+    "实验室检查",
+    "影像学检查",
+    "辅助检查",
+    "组织学检查",
+    "内窥镜检查",
+    "筛查",
+    "多发群体",
+    "发病率",
+    "发病年龄",
+    "多发地区",
+    "发病性别倾向",
+    "死亡率",
+    "多发季节",
+    "传播途径",
+    "并发症",
+    "病理分型",
+    "相关（导致）",
+    "鉴别诊断",
+    "相关（转化）",
+    "相关（症状）",
+    "临床表现",
+    "治疗后症状",
+    "侵及周围组织转移的症状",
+    "病因",
+    "高危因素",
+    "风险评估因素",
+    "病史",
+    "遗传因素",
+    "发病机制",
+    "病理生理",
+    "药物治疗",
+    "发病部位",
+    "转移部位",
+    "外侵部位",
+    "预后状况",
+    "预后生存率",
+    "同义词",
+]
+
+
+class SPOPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        if "logits" not in data or "logits_1" not in data:
+            raise ValueError(
+                "The output of model handler do not include the 'logits', "
+                " please check the model handler output. The model handler output:\n{}".format(data)
+            )
+        lengths = np.array(data["attention_mask"], dtype="float32").sum(axis=-1)
+        ent_logits = np.array(data["logits"])
+        spo_logits = np.array(data["logits_1"])
+        ent_pred_list = []
+        ent_idxs_list = []
+        for idx, ent_pred in enumerate(ent_logits):
+            seq_len = lengths[idx] - 2
+            start = np.where(ent_pred[:, 0] > 0.5)[0]
+            end = np.where(ent_pred[:, 1] > 0.5)[0]
+            ent_pred = []
+            ent_idxs = {}
+            for x in start:
+                y = end[end >= x]
+                if (x == 0) or (x > seq_len):
+                    continue
+                if len(y) > 0:
+                    y = y[0]
+                    if y > seq_len:
+                        continue
+                    ent_idxs[x] = (x - 1, y - 1)
+                    ent_pred.append((x - 1, y - 1))
+            ent_pred_list.append(ent_pred)
+            ent_idxs_list.append(ent_idxs)
+
+        spo_preds = spo_logits > 0
+        spo_pred_list = [[] for _ in range(len(spo_preds))]
+        idxs, preds, subs, objs = np.nonzero(spo_preds)
+        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
+            obj = ent_idxs_list[idx].get(o_id, None)
+            if obj is None:
+                continue
+            sub = ent_idxs_list[idx].get(s_id, None)
+            if sub is None:
+                continue
+            spo_pred_list[idx].append((tuple(sub), p_id, tuple(obj)))
+        input_data = data["data"]["text"]
+        ent_list = []
+        spo_list = []
+        for i, (ent, rel) in enumerate(zip(ent_pred_list, spo_pred_list)):
+            cur_ent_list = []
+            cur_spo_list = []
+            for sid, eid in ent:
+                cur_ent_list.append("".join([str(d) for d in input_data[i][sid : eid + 1]]))
+            for s, p, o in rel:
+                cur_spo_list.append(
+                    (
+                        "".join([str(d) for d in input_data[i][s[0] : s[1] + 1]]),
+                        label_list[p],
+                        "".join([str(d) for d in input_data[i][o[0] : o[1] + 1]]),
+                    )
+                )
+            ent_list.append(cur_ent_list)
+            spo_list.append(cur_spo_list)
+
+        return {"entity": ent_list, "spo": spo_list}
+
+
+app = SimpleServer()
+app.register(
+    "models/cblue_spo",
+    model_path="../../../export",
+    tokenizer_name="ernie-health-chinese",
+    model_handler=TokenClsModelHandler,
+    post_handler=SPOPostHandler,
+)
diff --git a/model_zoo/ernie-health/cblue/export_model.py b/model_zoo/ernie-health/cblue/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebc71e376aa47c75444adaca8589c87c142f0fd1
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/export_model.py
@@ -0,0 +1,88 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import ElectraForBinaryTokenClassification, ElectraForSPO
+
+from paddlenlp.transformers import ElectraForSequenceClassification
+
+NUM_CLASSES = {
+    "CHIP-CDN-2C": 2,
+    "CHIP-STS": 2,
+    "CHIP-CTC": 44,
+    "KUAKE-QQR": 3,
+    "KUAKE-QTR": 4,
+    "KUAKE-QIC": 11,
+    "CMeEE": [33, 5],
+    "CMeIE": 44,
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--train_dataset", required=True, type=str, help="The name of dataset used for training.")
+    parser.add_argument(
+        "--params_path",
+        type=str,
+        required=True,
+        default="./checkpoint/",
+        help="The path to model parameters to be loaded.",
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="./export", help="The path of model parameter in static graph to be saved."
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    # Load the model parameters.
+    if args.train_dataset not in NUM_CLASSES:
+        raise ValueError(f"Please modify the code to fit {args.dataset}")
+
+    if args.train_dataset == "CMeEE":
+        model = ElectraForBinaryTokenClassification.from_pretrained(
+            args.params_path,
+            num_classes_oth=NUM_CLASSES[args.train_dataset][0],
+            num_classes_sym=NUM_CLASSES[args.train_dataset][1],
+        )
+    elif args.train_dataset == "CMeIE":
+        model = ElectraForSPO.from_pretrained(args.params_path, num_labels=NUM_CLASSES[args.train_dataset])
+    else:
+        model = ElectraForSequenceClassification.from_pretrained(
+            args.params_path, num_labels=NUM_CLASSES[args.train_dataset]
+        )
+
+    model.eval()
+
+    # Convert to static graph with specific input description:
+    # input_ids, token_type_ids
+    input_spec = [
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+    ]
+    model = paddle.jit.to_static(model, input_spec=input_spec)
+
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/cblue/model.py b/model_zoo/ernie-health/cblue/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..71c9d62e71ffeb724beb9d034ebd55d19ee74ab0
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/model.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers import ElectraConfig, ElectraModel, ElectraPretrainedModel
+
+
+class ElectraForBinaryTokenClassification(ElectraPretrainedModel):
+    """
+    Electra Model with two linear layers on top of the hidden-states output layers,
+    designed for token classification tasks with nesting.
+
+    Args:
+        electra (:class:`ElectraModel`):
+            An instance of ElectraModel.
+        num_classes (list):
+            The number of classes.
+        dropout (float, optionl):
+            The dropout probability for output of Electra.
+            If None, use the same value as `hidden_dropout_prob' of 'ElectraModel`
+            instance `electra`. Defaults to None.
+    """
+
+    def __init__(self, config: ElectraConfig, num_classes_oth, num_classes_sym):
+        super(ElectraForBinaryTokenClassification, self).__init__(config)
+        self.num_classes_oth = num_classes_oth
+        self.num_classes_sym = num_classes_sym
+        self.electra = ElectraModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier_oth = nn.Linear(config.hidden_size, self.num_classes_oth)
+        self.classifier_sym = nn.Linear(config.hidden_size, self.num_classes_sym)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        sequence_output = self.electra(input_ids, token_type_ids, position_ids, attention_mask)
+        sequence_output = self.dropout(sequence_output)
+
+        logits_sym = self.classifier_sym(sequence_output)
+        logits_oth = self.classifier_oth(sequence_output)
+
+        return logits_oth, logits_sym
+
+
+class MultiHeadAttentionForSPO(nn.Layer):
+    """
+    Multi-head attention layer for SPO task.
+    """
+
+    def __init__(self, embed_dim, num_heads, scale_value=768):
+        super(MultiHeadAttentionForSPO, self).__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.scale_value = scale_value**-0.5
+        self.q_proj = nn.Linear(embed_dim, embed_dim * num_heads)
+        self.k_proj = nn.Linear(embed_dim, embed_dim * num_heads)
+
+    def forward(self, query, key):
+        q = self.q_proj(query)
+        k = self.k_proj(key)
+        q = paddle.reshape(q, shape=[0, 0, self.num_heads, self.embed_dim])
+        k = paddle.reshape(k, shape=[0, 0, self.num_heads, self.embed_dim])
+        q = paddle.transpose(q, perm=[0, 2, 1, 3])
+        k = paddle.transpose(k, perm=[0, 2, 1, 3])
+        scores = paddle.matmul(q, k, transpose_y=True)
+        scores = paddle.scale(scores, scale=self.scale_value)
+        return scores
+
+
+class ElectraForSPO(ElectraPretrainedModel):
+    """
+    Electra Model with a linear layer on top of the hidden-states output
+    layers for entity recognition, and a multi-head attention layer for
+    relation classification.
+
+    Args:
+        electra (:class:`ElectraModel`):
+            An instance of ElectraModel.
+        num_classes (int):
+            The number of classes.
+        dropout (float, optionl):
+            The dropout probability for output of Electra.
+            If None, use the same value as `hidden_dropout_prob' of 'ElectraModel`
+            instance `electra`. Defaults to None.
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForSPO, self).__init__(config)
+        self.num_classes = config.num_labels
+        self.electra = ElectraModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+        self.span_attention = MultiHeadAttentionForSPO(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        outputs = self.electra(
+            input_ids, token_type_ids, position_ids, attention_mask, output_hidden_states=True, return_dict=True
+        )
+        sequence_outputs = outputs.last_hidden_state
+        all_hidden_states = outputs.hidden_states
+        sequence_outputs = self.dropout(sequence_outputs)
+        ent_logits = self.classifier(sequence_outputs)
+
+        subject_output = all_hidden_states[-2]
+        cls_output = paddle.unsqueeze(sequence_outputs[:, 0, :], axis=1)
+        subject_output = subject_output + cls_output
+
+        output_size = self.num_classes + self.electra.config["hidden_size"]  # noqa:F841
+        rel_logits = self.span_attention(sequence_outputs, subject_output)
+
+        return ent_logits, rel_logits
diff --git a/model_zoo/ernie-health/cblue/train_classification.py b/model_zoo/ernie-health/cblue/train_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7e59b2f80f0acfff2f7b717db26345966ad7374
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/train_classification.py
@@ -0,0 +1,263 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import distutils.util
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.metric import Accuracy
+from utils import LinearDecayWithWarmup, convert_example, create_dataloader
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, MultiLabelsMetric
+from paddlenlp.transformers import ElectraForSequenceClassification, ElectraTokenizer
+
+METRIC_CLASSES = {
+    "KUAKE-QIC": Accuracy,
+    "KUAKE-QQR": Accuracy,
+    "KUAKE-QTR": Accuracy,
+    "CHIP-CTC": MultiLabelsMetric,
+    "CHIP-STS": MultiLabelsMetric,
+    "CHIP-CDN-2C": AccuracyAndF1,
+}
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--dataset",
+    choices=["KUAKE-QIC", "KUAKE-QQR", "KUAKE-QTR", "CHIP-STS", "CHIP-CTC", "CHIP-CDN-2C"],
+    default="KUAKE-QIC",
+    type=str,
+    help="Dataset for sequence classfication tasks.",
+)
+parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu", "npu"],
+    default="gpu",
+    help="Select which device to train model, default to gpu.",
+)
+parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs.")
+parser.add_argument(
+    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
+)
+parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task."
+)
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.")
+parser.add_argument(
+    "--warmup_proportion",
+    default=0.1,
+    type=float,
+    help="Linear warmup proportion of learning rate over the training process.",
+)
+parser.add_argument(
+    "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
+parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
+parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.")
+parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
+
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """set random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and compute the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, position_ids, labels = batch
+        logits = model(input_ids, token_type_ids, position_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    if isinstance(metric, Accuracy):
+        metric_name = "accuracy"
+        result = metric.accumulate()
+    elif isinstance(metric, MultiLabelsMetric):
+        metric_name = "macro f1"
+        _, _, result = metric.accumulate("macro")
+    else:
+        metric_name = "micro f1"
+        _, _, _, result, _ = metric.accumulate()
+
+    print("eval loss: %.5f, %s: %.5f" % (np.mean(losses), metric_name, result))
+    model.train()
+    metric.reset()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("cblue", args.dataset, splits=["train", "dev"])
+
+    model = ElectraForSequenceClassification.from_pretrained(
+        "ernie-health-chinese", num_labels=len(train_ds.label_list)
+    )
+    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        Pad(axis=0, pad_val=args.max_seq_length - 1, dtype="int64"),  # position
+        Stack(dtype="int64"),
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
+        if len(state_keys) > 0:
+            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
+        model.set_dict(state_dict)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
+    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    if METRIC_CLASSES[args.dataset] is Accuracy:
+        metric = METRIC_CLASSES[args.dataset]()
+        metric_name = "accuracy"
+    elif METRIC_CLASSES[args.dataset] is MultiLabelsMetric:
+        metric = METRIC_CLASSES[args.dataset](num_labels=len(train_ds.label_list))
+        metric_name = "macro f1"
+    else:
+        metric = METRIC_CLASSES[args.dataset]()
+        metric_name = "micro f1"
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    global_step = 0
+    tic_train = time.time()
+    total_train_time = 0
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, position_ids, labels = batch
+            with paddle.amp.auto_cast(
+                args.use_amp,
+                custom_white_list=["layer_norm", "softmax", "gelu", "tanh"],
+            ):
+                logits = model(input_ids, token_type_ids, position_ids)
+                loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+
+            if isinstance(metric, Accuracy):
+                result = metric.accumulate()
+            elif isinstance(metric, MultiLabelsMetric):
+                _, _, result = metric.accumulate("macro")
+            else:
+                _, _, _, result, _ = metric.accumulate()
+
+            if args.use_amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                total_train_time += time_diff
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, %s: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, metric_name, result, args.logging_steps / time_diff)
+                )
+
+            if global_step % args.valid_steps == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                if paddle.distributed.get_world_size() > 1:
+                    model._layers.save_pretrained(save_dir)
+                else:
+                    model.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+            if global_step >= num_training_steps:
+                return
+            tic_train = time.time()
+
+    if rank == 0 and total_train_time > 0:
+        print("Speed: %.2f steps/s" % (global_step / total_train_time))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/ernie-health/cblue/train_ner.py b/model_zoo/ernie-health/cblue/train_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..de7e50b1f9dde76ba1b106092e7d86ad085cca81
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/train_ner.py
@@ -0,0 +1,254 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from model import ElectraForBinaryTokenClassification
+from utils import (
+    LinearDecayWithWarmup,
+    NERChunkEvaluator,
+    convert_example_ner,
+    create_dataloader,
+)
+
+from paddlenlp.data import Dict, Pad
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ElectraTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu", "npu"],
+    default="gpu",
+    help="Select which device to train model, default to gpu.",
+)
+parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning token classification task."
+)
+parser.add_argument(
+    "--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+parser.add_argument(
+    "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over the training process."
+)
+parser.add_argument("--use_amp", default=False, type=bool, help="Enable mixed precision training.")
+parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs.")
+parser.add_argument(
+    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
+)
+parser.add_argument("--seed", default=1000, type=int, help="Random seed.")
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
+
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """set random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch
+        logits = model(input_ids, token_type_ids, position_ids)
+
+        loss_mask = masks.unsqueeze(2)
+        loss = [(criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym])]
+        losses.append([x.numpy() for x in loss])
+
+        lengths = paddle.sum(masks, axis=1)
+        preds = [paddle.argmax(x, axis=2) for x in logits]
+        correct = metric.compute(lengths, preds, [label_oth, label_sym])
+        metric.update(correct)
+        _, _, result = metric.accumulate()
+    loss = np.mean(losses, axis=0)
+    print("eval loss symptom: %.5f, loss others: %.5f, loss: %.5f, f1: %.5f" % (loss[1], loss[0], loss.sum(), result))
+    model.train()
+    metric.reset()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("cblue", "CMeEE", splits=["train", "dev"])
+
+    model = ElectraForBinaryTokenClassification.from_pretrained(
+        "ernie-health-chinese",
+        num_classes_oth=len(train_ds.label_list[0]),
+        num_classes_sym=len(train_ds.label_list[1]),
+    )
+    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
+
+    label_list = train_ds.label_list
+    pad_label_id = [len(label_list[0]) - 1, len(label_list[1]) - 1]
+
+    trans_func = partial(
+        convert_example_ner, tokenizer=tokenizer, max_seq_length=args.max_seq_length, pad_label_id=pad_label_id
+    )
+
+    batchify_fn = lambda samples, fn=Dict(  # noqa: E731
+        {
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),
+            "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+            "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"),
+            "label_oth": Pad(axis=0, pad_val=pad_label_id[0], dtype="int64"),
+            "label_sym": Pad(axis=0, pad_val=pad_label_id[1], dtype="int64"),
+        }
+    ): fn(samples)
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt:
+        if not os.path.isfile(args.init_from_ckpt):
+            raise ValueError("init_from_ckpt is not a valid model filename.")
+        state_dict = paddle.load(args.init_from_ckpt)
+        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
+        if len(state_keys) > 0:
+            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
+        model.set_dict(state_dict)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
+    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.functional.softmax_with_cross_entropy
+
+    metric = NERChunkEvaluator(label_list)
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+
+    global_step = 0
+    tic_train = time.time()
+    total_train_time = 0
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, position_ids, masks, label_oth, label_sym = batch
+            with paddle.amp.auto_cast(
+                args.use_amp,
+                custom_white_list=["layer_norm", "softmax", "gelu"],
+            ):
+                logits = model(input_ids, token_type_ids, position_ids)
+
+                loss_mask = paddle.unsqueeze(masks, 2)
+                losses = [
+                    (criterion(x, y.unsqueeze(2)) * loss_mask).mean() for x, y in zip(logits, [label_oth, label_sym])
+                ]
+                loss = losses[0] + losses[1]
+
+                lengths = paddle.sum(masks, axis=1)
+                preds = [paddle.argmax(x, axis=-1) for x in logits]
+                correct = metric.compute(lengths, preds, [label_oth, label_sym])
+                metric.update(correct)
+                _, _, f1 = metric.accumulate()
+
+                if args.use_amp:
+                    scaler.scale(loss).backward()
+                    scaler.minimize(optimizer, loss)
+                else:
+                    loss.backward()
+                    optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                global_step += 1
+                if global_step % args.logging_steps == 0 and rank == 0:
+                    time_diff = time.time() - tic_train
+                    total_train_time += time_diff
+                    print(
+                        "global step %d, epoch: %d, batch: %d, loss: %.5f, loss symptom: %.5f, loss others: %.5f, f1: %.5f, speed: %.2f step/s, learning_rate: %f"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss,
+                            losses[1],
+                            losses[0],
+                            f1,
+                            args.logging_steps / time_diff,
+                            lr_scheduler.get_lr(),
+                        )
+                    )
+
+                if global_step % args.valid_steps == 0 and rank == 0:
+                    evaluate(model, criterion, metric, dev_data_loader)
+
+                if global_step % args.save_steps == 0 and rank == 0:
+                    save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    if paddle.distributed.get_world_size() > 1:
+                        model._layers.save_pretrained(save_dir)
+                    else:
+                        model.save_pretrained(save_dir)
+                    tokenizer.save_pretrained(save_dir)
+
+                if global_step >= num_training_steps:
+                    return
+                tic_train = time.time()
+
+    if rank == 0 and total_train_time > 0:
+        print("Speed: %.2f steps/s" % (global_step / total_train_time))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/ernie-health/cblue/train_spo.py b/model_zoo/ernie-health/cblue/train_spo.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5d8d5eb6128c0503deeaeb01d6d253c58b66e8f
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/train_spo.py
@@ -0,0 +1,300 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import distutils.util
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from model import ElectraForSPO
+from tqdm import tqdm
+from utils import (
+    LinearDecayWithWarmup,
+    SPOChunkEvaluator,
+    convert_example_spo,
+    create_dataloader,
+)
+
+from paddlenlp.data import Dict, Pad
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ElectraTokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--seed", default=1000, type=int, help="Random seed for initialization.")
+parser.add_argument(
+    "--device",
+    choices=["cpu", "gpu", "xpu", "npu"],
+    default="gpu",
+    help="Select which device to train model, default to gpu.",
+)
+parser.add_argument("--epochs", default=100, type=int, help="Total number of training epochs.")
+parser.add_argument(
+    "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override epochs."
+)
+parser.add_argument("--batch_size", default=12, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument(
+    "--learning_rate", default=6e-5, type=float, help="Learning rate for fine-tuning sequence classification task."
+)
+parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay of optimizer if we apply some.")
+parser.add_argument(
+    "--warmup_proportion",
+    default=0.1,
+    type=float,
+    help="Linear warmup proportion of learning rate over the training process.",
+)
+parser.add_argument(
+    "--max_seq_length", default=300, type=int, help="The maximum total input sequence length after tokenization."
+)
+parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.")
+parser.add_argument("--logging_steps", default=10, type=int, help="The interval steps to logging.")
+parser.add_argument(
+    "--save_dir",
+    default="./checkpoint",
+    type=str,
+    help="The output directory where the model checkpoints will be written.",
+)
+parser.add_argument("--save_steps", default=100, type=int, help="The interval steps to save checkpoints.")
+parser.add_argument("--valid_steps", default=100, type=int, help="The interval steps to evaluate model performance.")
+parser.add_argument("--use_amp", default=False, type=distutils.util.strtobool, help="Enable mixed precision training.")
+parser.add_argument("--scale_loss", default=128, type=float, help="The value of scale_loss for fp16.")
+
+args = parser.parse_args()
+
+
+def set_seed(seed):
+    """set random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and compute the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        dataloader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(`paddle.nn.functional`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in tqdm(data_loader):
+        input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch
+        ent_mask = paddle.unsqueeze(masks, axis=2)
+        spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True)
+        spo_mask = paddle.unsqueeze(spo_mask, axis=1)
+
+        logits = model(input_ids, token_type_ids, position_ids)
+
+        ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum")
+        spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum")
+        loss = ent_loss + spo_loss
+        losses.append(loss.numpy())
+        lengths = paddle.sum(masks, axis=-1)
+        correct = metric.compute(lengths, logits[0], logits[1], ent_label[1], spo_label[1])
+        metric.update(correct)
+    results = metric.accumulate()
+    print(
+        "eval loss: %.5f, entity f1: %.5f, spo f1: %.5f" % (np.mean(losses), results["entity"][2], results["spo"][2])
+    )
+    model.train()
+    metric.reset()
+
+
+def do_train():
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("cblue", "CMeIE", splits=["train", "dev"])
+
+    model = ElectraForSPO.from_pretrained("ernie-health-chinese", num_classes=len(train_ds.label_list))
+    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
+
+    trans_func = partial(
+        convert_example_spo,
+        tokenizer=tokenizer,
+        num_classes=len(train_ds.label_list),
+        max_seq_length=args.max_seq_length,
+    )
+
+    def batchify_fn(data):
+        _batchify_fn = lambda samples, fn=Dict(  # noqa: E731
+            {
+                "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+                "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+                "position_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+                "attention_mask": Pad(axis=0, pad_val=0, dtype="float32"),
+            }
+        ): fn(samples)
+        ent_label = [x["ent_label"] for x in data]
+        spo_label = [x["spo_label"] for x in data]
+        input_ids, token_type_ids, position_ids, masks = _batchify_fn(data)
+        batch_size, batch_len = input_ids.shape
+        num_classes = len(train_ds.label_list)
+        # Create one-hot labels.
+        #
+        # For example,
+        # - text:
+        #   [CLS], 局, 部, 皮, 肤, 感, 染, 引, 起, 的, 皮, 疹, 等, [SEP]
+        #
+        # - ent_label (obj: `list`):
+        #   [(0, 5), (9, 10)] # ['局部皮肤感染', '皮疹']
+        #
+        # - one_hot_ent_label: # shape (sequence_length, 2)
+        #   [[ 0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0], # start index
+        #    [ 0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  1,  0,  0]] # end index
+        #
+        # - spo_label (obj: `list`):
+        #   [(0, 23, 9)] # [('局部皮肤感染', '相关（导致）', '皮疹')], where entities
+        #                  are encoded by their start indexes.
+        #
+        # - one_hot_spo_label: # shape (num_predicate, sequence_length, sequence_length)
+        #   [...,
+        #    [..., [0, ..., 1, ..., 0], ...], # for predicate '相关（导致）'
+        #    ...]                             # the value at [23, 1, 10] is set as 1
+        #
+        one_hot_ent_label = np.zeros([batch_size, batch_len, 2], dtype=np.float32)
+        one_hot_spo_label = np.zeros([batch_size, num_classes, batch_len, batch_len], dtype=np.float32)
+        for idx, ent_idxs in enumerate(ent_label):
+            # Shift index by 1 because input_ids start with [CLS] here.
+            for x, y in ent_idxs:
+                x = x + 1
+                y = y + 1
+                if x > 0 and x < batch_len and y < batch_len:
+                    one_hot_ent_label[idx, x, 0] = 1
+                    one_hot_ent_label[idx, y, 1] = 1
+        for idx, spo_idxs in enumerate(spo_label):
+            for s, p, o in spo_idxs:
+                s_id = s[0] + 1
+                o_id = o[0] + 1
+                if s_id > 0 and s_id < batch_len and o_id < batch_len:
+                    one_hot_spo_label[idx, p, s_id, o_id] = 1
+        # one_hot_xxx_label are used for loss computation.
+        # xxx_label are used for metric computation.
+        ent_label = [one_hot_ent_label, ent_label]
+        spo_label = [one_hot_spo_label, spo_label]
+        return input_ids, token_type_ids, position_ids, masks, ent_label, spo_label
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt:
+        if not os.path.isfile(args.init_from_ckpt):
+            raise ValueError("init_from_ckpt is not a valid model filename.")
+        state_dict = paddle.load(args.init_from_ckpt)
+        state_keys = {x: x.replace("discriminator.", "") for x in state_dict.keys() if "discriminator." in x}
+        if len(state_keys) > 0:
+            state_dict = {state_keys[k]: state_dict[k] for k in state_keys.keys()}
+        model.set_dict(state_dict)
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epochs
+    args.epochs = (num_training_steps - 1) // len(train_data_loader) + 1
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = F.binary_cross_entropy_with_logits
+
+    metric = SPOChunkEvaluator(num_classes=len(train_ds.label_list))
+
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+    global_step = 0
+    tic_train = time.time()
+    total_train_time = 0
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, position_ids, masks, ent_label, spo_label = batch
+            ent_mask = paddle.unsqueeze(masks, axis=2)
+            spo_mask = paddle.matmul(ent_mask, ent_mask, transpose_y=True)
+            spo_mask = paddle.unsqueeze(spo_mask, axis=1)
+
+            with paddle.amp.auto_cast(
+                args.use_amp,
+                custom_white_list=["layer_norm", "softmax", "gelu"],
+            ):
+                logits = model(input_ids, token_type_ids, position_ids)
+                ent_loss = criterion(logits[0], ent_label[0], weight=ent_mask, reduction="sum")
+                spo_loss = criterion(logits[1], spo_label[0], weight=spo_mask, reduction="sum")
+
+                loss = ent_loss + spo_loss
+
+            if args.use_amp:
+                scaler.scale(loss).backward()
+                scaler.minimize(optimizer, loss)
+            else:
+                loss.backward()
+                optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            global_step += 1
+            if global_step % args.logging_steps == 0 and rank == 0:
+                time_diff = time.time() - tic_train
+                total_train_time += time_diff
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, "
+                    "ent_loss: %.5f, spo_loss: %.5f, speed: %.2f steps/s"
+                    % (global_step, epoch, step, loss, ent_loss, spo_loss, args.logging_steps / time_diff)
+                )
+
+            if global_step % args.valid_steps == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+
+            if global_step % args.save_steps == 0 and rank == 0:
+                save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                if paddle.distributed.get_world_size() > 1:
+                    model._layers.save_pretrained(save_dir)
+                else:
+                    model.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+            if global_step >= num_training_steps:
+                return
+            tic_train = time.time()
+
+    if rank == 0 and total_train_time > 0:
+        print("Speed: %.2f steps/s" % (global_step / total_train_time))
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/ernie-health/cblue/utils.py b/model_zoo/ernie-health/cblue/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c0bda0c63ebbd5eb07a27ef1b1aef45a055671a
--- /dev/null
+++ b/model_zoo/ernie-health/cblue/utils.py
@@ -0,0 +1,503 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import numpy as np
+import paddle
+from paddle.optimizer.lr import LambdaDecay
+
+from paddlenlp.transformers import normalize_chars, tokenize_special_chars
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+class LinearDecayWithWarmup(LambdaDecay):
+    def __init__(self, learning_rate, total_steps, warmup, last_epoch=-1, verbose=False):
+        """
+        Creates a learning rate scheduler, which increases learning rate linearly
+        from 0 to given `learning_rate`, after this warmup period learning rate
+        would be decreased linearly from the base learning rate to 0.
+
+        Args:
+            learning_rate (float):
+                The base learning rate. It is a python float number.
+            total_steps (int):
+                The number of training steps.
+            warmup (int or float):
+                If int, it means the number of steps for warmup. If float, it means
+                the proportion of warmup in total training steps.
+            last_epoch (int, optional):
+                The index of last epoch. It can be set to restart training. If
+                None, it means initial learning rate.
+                Defaults to -1.
+            verbose (bool, optional):
+                If True, prints a message to stdout for each update.
+                Defaults to False.
+        """
+
+        warmup_steps = warmup if isinstance(warmup, int) else int(math.floor(warmup * total_steps))
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            return max(0.0, 1.0 - current_step / total_steps)
+
+        super(LinearDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequences for sequence
+    classification tasks by concatenating and adding special tokens. And
+    creates a mask from the two sequences for sequence-pair classification
+    tasks.
+
+    The convention in Electra/EHealth is:
+
+    - single sequence:
+        input_ids:      ``[CLS] X [SEP]``
+        token_type_ids: ``  0   0   0``
+        position_ids:   ``  0   1   2``
+
+    - a senquence pair:
+        input_ids:      ``[CLS] X [SEP] Y [SEP]``
+        token_type_ids: ``  0   0   0   1   1``
+        position_ids:   ``  0   1   2   3   4``
+
+    Args:
+        example (obj:`dict`):
+            A dictionary of input data, containing text and label if it has.
+        tokenizer (obj:`PretrainedTokenizer`):
+            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
+            Users can refer to the superclass for more information.
+        max_seq_length (obj:`int`):
+            The maximum total input sequence length after tokenization.
+            Sequences longer will be truncated, and the shorter will be padded.
+        is_test (obj:`bool`, default to `False`):
+            Whether the example contains label or not.
+
+    Returns:
+        input_ids (obj:`list[int]`):
+            The list of token ids.
+        token_type_ids (obj:`list[int]`):
+            List of sequence pair mask.
+        position_ids (obj:`list[int]`):
+            List of position ids.
+        label(obj:`numpy.array`, data type of int64, optional):
+            The input label if not is_test.
+    """
+    text_a = example["text_a"]
+    text_b = example.get("text_b", None)
+
+    text_a = tokenize_special_chars(normalize_chars(text_a))
+    if text_b is not None:
+        text_b = tokenize_special_chars(normalize_chars(text_b))
+
+    encoded_inputs = tokenizer(text=text_a, text_pair=text_b, max_seq_len=max_seq_length, return_position_ids=True)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+    position_ids = encoded_inputs["position_ids"]
+
+    if is_test:
+        return input_ids, token_type_ids, position_ids
+    label = np.array([example["label"]], dtype="int64")
+    return input_ids, token_type_ids, position_ids, label
+
+
+def convert_example_ner(example, tokenizer, max_seq_length=512, pad_label_id=-100, is_test=False):
+    """
+    Builds model inputs from a sequence and creates labels for named-
+    entity recognition task CMeEE.
+
+    For example, a sample should be:
+
+    - input_ids:      ``[CLS]  x1   x2 [SEP] [PAD]``
+    - token_type_ids: ``  0    0    0    0     0``
+    - position_ids:   ``  0    1    2    3     0``
+    - attention_mask: ``  1    1    1    1     0``
+    - label_oth:      `` 32    3   32   32    32`` (optional, label ids of others)
+    - label_sym:      ``  4    4    4    4     4`` (optional, label ids of symptom)
+
+    Args:
+        example (obj:`dict`):
+            A dictionary of input data, containing text and label if it has.
+        tokenizer (obj:`PretrainedTokenizer`):
+            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
+            Users can refer to the superclass for more information.
+        max_seq_length (obj:`int`):
+            The maximum total input sequence length after tokenization.
+            Sequences longer will be truncated, and the shorter will be padded.
+        is_test (obj:`bool`, default to `False`):
+            Whether the example contains label or not.
+
+    Returns:
+        encoded_output (obj: `dict[str, list|np.array]`):
+            The sample dictionary including `input_ids`, `token_type_ids`,
+            `position_ids`, `attention_mask`, `label_oth` (optional),
+            `label_sym` (optional)
+    """
+
+    encoded_inputs = {}
+    text = example["text"]
+    if len(text) > max_seq_length - 2:
+        text = text[: max_seq_length - 2]
+    text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"]
+    input_len = len(text)
+    encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text)
+    encoded_inputs["token_type_ids"] = np.zeros(input_len)
+    encoded_inputs["position_ids"] = list(range(input_len))
+    encoded_inputs["attention_mask"] = np.ones(input_len)
+
+    if not is_test:
+        labels = example["labels"]
+        if input_len - 2 < len(labels[0]):
+            labels[0] = labels[0][: input_len - 2]
+        if input_len - 2 < len(labels[1]):
+            labels[1] = labels[1][: input_len - 2]
+        encoded_inputs["label_oth"] = [pad_label_id[0]] + labels[0] + [pad_label_id[0]]
+        encoded_inputs["label_sym"] = [pad_label_id[1]] + labels[1] + [pad_label_id[1]]
+
+    return encoded_inputs
+
+
+def convert_example_spo(example, tokenizer, num_classes, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence and creates labels for SPO prediction
+    task CMeIE.
+
+    For example, a sample should be:
+
+    - input_ids:      ``[CLS]  x1   x2 [SEP] [PAD]``
+    - token_type_ids: ``  0    0    0    0     0``
+    - position_ids:   ``  0    1    2    3     0``
+    - attention_mask: ``  1    1    1    1     0``
+    - ent_label:      ``[[0    1    0    0     0], # start ids are set as 1
+                         [0    0    1    0     0]] # end ids are set as 1
+    - spo_label: a tensor of shape [num_classes, max_batch_len, max_batch_len].
+                 Set [predicate_id, subject_start_id, object_start_id] as 1
+                 when (subject, predicate, object) exists.
+
+    Args:
+        example (obj:`dict`):
+            A dictionary of input data, containing text and label if it has.
+        tokenizer (obj:`PretrainedTokenizer`):
+            A tokenizer inherits from :class:`paddlenlp.transformers.PretrainedTokenizer`.
+            Users can refer to the superclass for more information.
+        num_classes (obj:`int`):
+            The number of predicates.
+        max_seq_length (obj:`int`):
+            The maximum total input sequence length after tokenization.
+            Sequences longer will be truncated, and the shorter will be padded.
+        is_test (obj:`bool`, default to `False`):
+            Whether the example contains label or not.
+
+    Returns:
+        encoded_output (obj: `dict[str, list|np.array]`):
+            The sample dictionary including `input_ids`, `token_type_ids`,
+            `position_ids`, `attention_mask`, `ent_label` (optional),
+            `spo_label` (optional)
+    """
+    encoded_inputs = {}
+    text = example["text"]
+    if len(text) > max_seq_length - 2:
+        text = text[: max_seq_length - 2]
+    text = ["[CLS]"] + [x.lower() for x in text] + ["[SEP]"]
+    input_len = len(text)
+    encoded_inputs["input_ids"] = tokenizer.convert_tokens_to_ids(text)
+    encoded_inputs["token_type_ids"] = np.zeros(input_len)
+    encoded_inputs["position_ids"] = list(range(input_len))
+    encoded_inputs["attention_mask"] = np.ones(input_len)
+    if not is_test:
+        encoded_inputs["ent_label"] = example["ent_label"]
+        encoded_inputs["spo_label"] = example["spo_label"]
+    return encoded_inputs
+
+
+class NERChunkEvaluator(paddle.metric.Metric):
+    """
+    NERChunkEvaluator computes the precision, recall and F1-score for chunk detection.
+    It is often used in sequence tagging tasks, such as Named Entity Recognition (NER).
+
+    Args:
+        label_list (list):
+            The label list.
+
+    Note:
+        Difference from `paddlenlp.metric.ChunkEvaluator`:
+
+        - `paddlenlp.metric.ChunkEvaluator`
+           All sequences with non-'O' labels are taken as chunks when computing num_infer.
+        - `NERChunkEvaluator`
+           Only complete sequences are taken as chunks, namely `B- I- E-` or `S-`.
+    """
+
+    def __init__(self, label_list):
+        super(NERChunkEvaluator, self).__init__()
+        self.id2label = [dict(enumerate(x)) for x in label_list]
+        self.num_classes = [len(x) for x in label_list]
+        self.num_infer = 0
+        self.num_label = 0
+        self.num_correct = 0
+
+    def compute(self, lengths, predictions, labels):
+        """
+        Computes the prediction, recall and F1-score for chunk detection.
+
+        Args:
+            lengths (Tensor):
+                The valid length of every sequence, a tensor with shape `[batch_size]`.
+            predictions (Tensor):
+                The predictions index, a tensor with shape `[batch_size, sequence_length]`.
+            labels (Tensor):
+                The labels index, a tensor with shape `[batch_size, sequence_length]`.
+
+        Returns:
+            tuple: Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`).
+
+            With the fields:
+
+            - `num_infer_chunks` (Tensor): The number of the inference chunks.
+            - `num_label_chunks` (Tensor): The number of the label chunks.
+            - `num_correct_chunks` (Tensor): The number of the correct chunks.
+        """
+        assert len(predictions) == len(labels)
+        assert len(predictions) == len(self.id2label)
+        preds = [x.numpy() for x in predictions]
+        labels = [x.numpy() for x in labels]
+
+        preds_chunk = set()
+        label_chunk = set()
+        for idx, (pred, label) in enumerate(zip(preds, labels)):
+            for i, case in enumerate(pred):
+                case = [self.id2label[idx][x] for x in case[: lengths[i]]]
+                preds_chunk |= self.extract_chunk(case, i)
+            for i, case in enumerate(label):
+                case = [self.id2label[idx][x] for x in case[: lengths[i]]]
+                label_chunk |= self.extract_chunk(case, i)
+
+        num_infer = len(preds_chunk)
+        num_label = len(label_chunk)
+        num_correct = len(preds_chunk & label_chunk)
+        return num_infer, num_label, num_correct
+
+    def update(self, correct):
+        num_infer, num_label, num_correct = correct
+        self.num_infer += num_infer
+        self.num_label += num_label
+        self.num_correct += num_correct
+
+    def accumulate(self):
+        precision = self.num_correct / (self.num_infer + 1e-6)
+        recall = self.num_correct / (self.num_label + 1e-6)
+        f1 = 2 * precision * recall / (precision + recall + 1e-6)
+        return precision, recall, f1
+
+    def reset(self):
+        self.num_infer = 0
+        self.num_label = 0
+        self.num_correct = 0
+
+    def name(self):
+        return "precision", "recall", "f1"
+
+    def extract_chunk(self, sequence, cid=0):
+        chunks = set()
+
+        start_idx, cur_idx = 0, 0
+        while cur_idx < len(sequence):
+            if sequence[cur_idx][0] == "B":
+                start_idx = cur_idx
+                cur_idx += 1
+                while cur_idx < len(sequence) and sequence[cur_idx][0] == "I":
+                    if sequence[cur_idx][2:] == sequence[start_idx][2:]:
+                        cur_idx += 1
+                    else:
+                        break
+                if cur_idx < len(sequence) and sequence[cur_idx][0] == "E":
+                    if sequence[cur_idx][2:] == sequence[start_idx][2:]:
+                        chunks.add((cid, sequence[cur_idx][2:], start_idx, cur_idx))
+                        cur_idx += 1
+            elif sequence[cur_idx][0] == "S":
+                chunks.add((cid, sequence[cur_idx][2:], cur_idx, cur_idx))
+                cur_idx += 1
+            else:
+                cur_idx += 1
+
+        return chunks
+
+
+class SPOChunkEvaluator(paddle.metric.Metric):
+    """
+    SPOChunkEvaluator computes the precision, recall and F1-score for multiple
+    chunk detections, including Named Entity Recognition (NER) and SPO Prediction.
+
+    Args:
+        num_classes (int):
+            The number of predicates.
+    """
+
+    def __init__(self, num_classes=None):
+        super(SPOChunkEvaluator, self).__init__()
+        self.num_classes = num_classes
+        self.num_infer_ent = 0
+        self.num_infer_spo = 1e-10
+        self.num_label_ent = 0
+        self.num_label_spo = 1e-10
+        self.num_correct_ent = 0
+        self.num_correct_spo = 0
+
+    def compute(self, lengths, ent_preds, spo_preds, ent_labels, spo_labels):
+        """
+        Computes the prediction, recall and F1-score for NER and SPO prediction.
+
+        Args:
+            lengths (Tensor):
+                The valid length of every sequence, a tensor with shape `[batch_size]`.
+            ent_preds (Tensor):
+                The predictions of entities.
+                A tensor with shape `[batch_size, sequence_length, 2]`.
+                `ent_preds[:, :, 0]` denotes the start indexes of entities.
+                `ent_preds[:, :, 1]` denotes the end indexes of entities.
+            spo_preds (Tensor):
+                The predictions of predicates between all possible entities.
+                A tensor with shape `[batch_size, num_classes, sequence_length, sequence_length]`.
+            ent_labels (list[list|tuple]):
+                The entity labels' indexes. A list of pair `[start_index, end_index]`.
+            spo_labels (list[list|tuple]):
+                The SPO labels' indexes. A list of triple `[[subject_start_index, subject_end_index],
+                predicate_id, [object_start_index, object_end_index]]`.
+
+        Returns:
+            tuple:
+                Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`).
+                The `ent` denotes results of NER and the `spo` denotes results of SPO prediction.
+
+            With the fields:
+
+            - `num_infer_chunks` (dict): The number of the inference chunks.
+            - `num_label_chunks` (dict): The number of the label chunks.
+            - `num_correct_chunks` (dict): The number of the correct chunks.
+        """
+        ent_preds = ent_preds.numpy()
+        spo_preds = spo_preds.numpy()
+
+        ent_pred_list = []
+        ent_idxs_list = []
+        for idx, ent_pred in enumerate(ent_preds):
+            seq_len = lengths[idx] - 2
+            start = np.where(ent_pred[:, 0] > 0.5)[0]
+            end = np.where(ent_pred[:, 1] > 0.5)[0]
+            ent_pred = []
+            ent_idxs = {}
+            for x in start:
+                y = end[end >= x]
+                if (x == 0) or (x > seq_len):
+                    continue
+                if len(y) > 0:
+                    y = y[0]
+                    if y > seq_len:
+                        continue
+                    ent_idxs[x] = (x - 1, y - 1)
+                    ent_pred.append((x - 1, y - 1))
+            ent_pred_list.append(ent_pred)
+            ent_idxs_list.append(ent_idxs)
+
+        spo_preds = spo_preds > 0
+        spo_pred_list = [[] for _ in range(len(spo_preds))]
+        idxs, preds, subs, objs = np.nonzero(spo_preds)
+        for idx, p_id, s_id, o_id in zip(idxs, preds, subs, objs):
+            obj = ent_idxs_list[idx].get(o_id, None)
+            if obj is None:
+                continue
+            sub = ent_idxs_list[idx].get(s_id, None)
+            if sub is None:
+                continue
+            spo_pred_list[idx].append((sub, p_id, obj))
+
+        correct = {"ent": 0, "spo": 0}
+        infer = {"ent": 0, "spo": 0}
+        label = {"ent": 0, "spo": 0}
+        for ent_pred, ent_true in zip(ent_pred_list, ent_labels):
+            ent_true = [tuple(x) for x in ent_true]
+            infer["ent"] += len(set(ent_pred))
+            label["ent"] += len(set(ent_true))
+            correct["ent"] += len(set(ent_pred) & set(ent_true))
+
+        for spo_pred, spo_true in zip(spo_pred_list, spo_labels):
+            spo_true = [(tuple(s), p, tuple(o)) for s, p, o in spo_true]
+            infer["spo"] += len(set(spo_pred))
+            label["spo"] += len(set(spo_true))
+            correct["spo"] += len(set(spo_pred) & set(spo_true))
+
+        return infer, label, correct
+
+    def update(self, corrects):
+        assert len(corrects) == 3
+        for item in corrects:
+            assert isinstance(item, dict)
+            for value in item.values():
+                if not self._is_number_or_matrix(value):
+                    raise ValueError("The numbers must be a number(int) or a numpy ndarray.")
+        num_infer, num_label, num_correct = corrects
+        self.num_infer_ent += num_infer["ent"]
+        self.num_infer_spo += num_infer["spo"]
+        self.num_label_ent += num_label["ent"]
+        self.num_label_spo += num_label["spo"]
+        self.num_correct_ent += num_correct["ent"]
+        self.num_correct_spo += num_correct["spo"]
+
+    def accumulate(self):
+        spo_precision = self.num_correct_spo / self.num_infer_spo
+        spo_recall = self.num_correct_spo / self.num_label_spo
+        spo_f1 = 2 * self.num_correct_spo / (self.num_infer_spo + self.num_label_spo)
+        ent_precision = self.num_correct_ent / self.num_infer_ent if self.num_infer_ent > 0 else 0.0
+        ent_recall = self.num_correct_ent / self.num_label_ent if self.num_label_ent > 0 else 0.0
+        ent_f1 = (
+            2 * ent_precision * ent_recall / (ent_precision + ent_recall) if (ent_precision + ent_recall) != 0 else 0.0
+        )
+        return {"entity": (ent_precision, ent_recall, ent_f1), "spo": (spo_precision, spo_recall, spo_f1)}
+
+    def _is_number_or_matrix(self, var):
+        def _is_number_(var):
+            return (
+                isinstance(var, int)
+                or isinstance(var, np.int64)
+                or isinstance(var, float)
+                or (isinstance(var, np.ndarray) and var.shape == (1,))
+            )
+
+        return _is_number_(var) or isinstance(var, np.ndarray)
+
+    def reset(self):
+        self.num_infer_ent = 0
+        self.num_infer_spo = 1e-10
+        self.num_label_ent = 0
+        self.num_label_spo = 1e-10
+        self.num_correct_ent = 0
+        self.num_correct_spo = 0
+
+    def name(self):
+        return {"entity": ("precision", "recall", "f1"), "spo": ("precision", "recall", "f1")}
diff --git a/model_zoo/ernie-health/dataset.py b/model_zoo/ernie-health/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..252b2e952aee484db6c6a365ee36f7e93d16ca63
--- /dev/null
+++ b/model_zoo/ernie-health/dataset.py
@@ -0,0 +1,204 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle
+
+
+class MedicalCorpus(paddle.io.Dataset):
+    def __init__(self, data_path, tokenizer):
+        self.data_path = data_path
+        self.tokenizer = tokenizer
+        # Add ids for suffixal chinese tokens in tokenized text, e.g. '##度' in '百度'.
+        # It should coincide with the vocab dictionary in preprocess.py.
+        orig_len = len(self.tokenizer)  # noqa:F841
+        suffix_vocab = {}
+        for idx, token in enumerate(range(0x4E00, 0x9FA6)):
+            suffix_vocab[len(self.tokenizer) + idx] = "##" + chr(token)
+        self.tokenizer.added_tokens_decoder.update(suffix_vocab)
+        self._samples, self._global_index = self._read_data_files(data_path)
+
+    def _get_data_files(self, data_path):
+        # Get all prefix of .npy/.npz files in the current and next-level directories.
+        files = [
+            os.path.join(data_path, f)
+            for f in os.listdir(data_path)
+            if (os.path.isfile(os.path.join(data_path, f)) and "_idx.npz" in str(f))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        return files
+
+    def _read_data_files(self, data_path):
+        data_files = self._get_data_files(data_path)
+        samples = []
+        indexes = []
+        for file_id, file_name in enumerate(data_files):
+
+            for suffix in ["_ids.npy", "_idx.npz"]:
+                if not os.path.isfile(file_name + suffix):
+                    raise ValueError("File Not found, %s" % (file_name + suffix))
+
+            token_ids = np.load(file_name + "_ids.npy", mmap_mode="r", allow_pickle=True)
+            samples.append(token_ids)
+
+            split_ids = np.load(file_name + "_idx.npz")
+            end_ids = np.cumsum(split_ids["lens"], dtype=np.int64)
+            file_ids = np.full(end_ids.shape, file_id)
+            split_ids = np.stack([file_ids, end_ids], axis=-1)
+            indexes.extend(split_ids)
+        indexes = np.stack(indexes, axis=0)
+        return samples, indexes
+
+    def __len__(self):
+        return len(self._global_index)
+
+    def __getitem__(self, index):
+        file_id, end_id = self._global_index[index]
+        start_id = 0
+        if index > 0:
+            pre_file_id, pre_end_id = self._global_index[index - 1]
+            if pre_file_id == file_id:
+                start_id = pre_end_id
+        word_token_ids = self._samples[file_id][start_id:end_id]
+        token_ids = []
+        is_suffix = np.zeros(word_token_ids.shape)
+        for idx, token_id in enumerate(word_token_ids):
+            token = self.tokenizer.convert_ids_to_tokens(int(token_id))
+            if "##" in token:
+                token_id = self.tokenizer.convert_tokens_to_ids(token[-1])
+                is_suffix[idx] = 1
+            token_ids.append(token_id)
+
+        return token_ids, is_suffix.astype(np.int64)
+
+
+class DataCollatorForErnieHealth(object):
+    def __init__(self, tokenizer, mlm_prob, max_seq_length, return_dict=False):
+        self.tokenizer = tokenizer
+        self.mlm_prob = mlm_prob
+        self.max_seq_len = max_seq_length
+        self.return_dict = return_dict
+        self._ids = {
+            "cls": self.tokenizer.convert_tokens_to_ids(self.tokenizer.cls_token),
+            "sep": self.tokenizer.convert_tokens_to_ids(self.tokenizer.sep_token),
+            "pad": self.tokenizer.convert_tokens_to_ids(self.tokenizer.pad_token),
+            "mask": self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token),
+        }
+
+    def __call__(self, data):
+        masked_input_ids_a, input_ids_a, labels_a = self.mask_tokens(data)
+        masked_input_ids_b, input_ids_b, labels_b = self.mask_tokens(data)
+        masked_input_ids = paddle.concat([masked_input_ids_a, masked_input_ids_b], axis=0).astype("int64")
+        input_ids = paddle.concat([input_ids_a, input_ids_b], axis=0)
+        labels = paddle.concat([labels_a, labels_b], axis=0)
+        if self.return_dict:
+            return {"input_ids": masked_input_ids, "raw_input_ids": input_ids, "generator_labels": labels}
+
+        else:
+            return masked_input_ids, input_ids, labels
+
+    def mask_tokens(self, batch_data):
+
+        token_ids = [x[0] for x in batch_data]
+        is_suffix = [x[1] for x in batch_data]
+
+        # Create probability matrix where the probability of real tokens is
+        # self.mlm_prob, while that of others is zero.
+        data = self.add_special_tokens_and_set_maskprob(token_ids, is_suffix)
+        token_ids, is_suffix, prob_matrix = data
+        token_ids = paddle.to_tensor(token_ids, dtype="int64", stop_gradient=True)
+        masked_token_ids = token_ids.clone()
+        labels = token_ids.clone()
+
+        # Create masks for words, where '百' must be masked if '度' is masked
+        # for the word '百度'.
+        prob_matrix = prob_matrix * (1 - is_suffix)
+        word_mask_index = np.random.binomial(1, prob_matrix).astype("float")
+        is_suffix_mask = is_suffix == 1
+        word_mask_index_tmp = word_mask_index
+        while word_mask_index_tmp.sum() > 0:
+            word_mask_index_tmp = np.concatenate(
+                [np.zeros((word_mask_index.shape[0], 1)), word_mask_index_tmp[:, :-1]], axis=1
+            )
+            word_mask_index_tmp = word_mask_index_tmp * is_suffix_mask
+            word_mask_index += word_mask_index_tmp
+        word_mask_index = word_mask_index.astype("bool")
+        labels[~word_mask_index] = -100
+
+        # 80% replaced with [MASK].
+        token_mask_index = paddle.bernoulli(paddle.full(labels.shape, 0.8)).astype("bool").numpy() & word_mask_index
+        masked_token_ids[token_mask_index] = self._ids["mask"]
+
+        # 10% replaced with random token ids.
+        token_random_index = paddle.to_tensor(
+            paddle.bernoulli(paddle.full(labels.shape, 0.5)).astype("bool").numpy()
+            & word_mask_index
+            & ~token_mask_index
+        )
+        random_tokens = paddle.randint(low=0, high=self.tokenizer.vocab_size, shape=labels.shape, dtype="int64")
+        masked_token_ids = paddle.where(token_random_index, random_tokens, masked_token_ids)
+
+        return masked_token_ids, token_ids, labels
+
+    def add_special_tokens_and_set_maskprob(self, token_ids, is_suffix):
+        batch_size = len(token_ids)
+        batch_token_ids = np.full((batch_size, self.max_seq_len), self._ids["pad"])
+        batch_token_ids[:, 0] = self._ids["cls"]
+        batch_is_suffix = np.full_like(batch_token_ids, -1)
+        prob_matrix = np.zeros_like(batch_token_ids, dtype="float32")
+
+        for idx in range(batch_size):
+            if len(token_ids[idx]) > self.max_seq_len - 2:
+                token_ids[idx] = token_ids[idx][: self.max_seq_len - 2]
+                is_suffix[idx] = is_suffix[idx][: self.max_seq_len - 2]
+            seq_len = len(token_ids[idx])
+            batch_token_ids[idx, seq_len + 1] = self._ids["sep"]
+            batch_token_ids[idx, 1 : seq_len + 1] = token_ids[idx]
+            batch_is_suffix[idx, 1 : seq_len + 1] = is_suffix[idx]
+            prob_matrix[idx, 1 : seq_len + 1] = self.mlm_prob
+
+        return batch_token_ids, batch_is_suffix, prob_matrix
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, use_gpu=True, data_collator=None):
+    """
+    Creats dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`):
+            Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`):
+            If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1):
+            The sample number of a mini-batch.
+        use_gpu(obj:`bool`, optional, defaults to obj:`True`):
+            Whether to use gpu to run.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+
+    if mode == "train" and use_gpu:
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+    else:
+        shuffle = True if mode == "train" else False
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+        dataloader = paddle.io.DataLoader(
+            dataset, batch_sampler=sampler, return_list=True, collate_fn=data_collator, num_workers=0
+        )
+
+    return dataloader
diff --git a/model_zoo/ernie-health/preprocess.py b/model_zoo/ernie-health/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8c6ec2c5629dfa76c69e46bf4b4b6185ece5b63
--- /dev/null
+++ b/model_zoo/ernie-health/preprocess.py
@@ -0,0 +1,201 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import multiprocessing
+import os
+import re
+import sys
+import time
+
+import numpy as np
+from tqdm import tqdm
+
+from paddlenlp.transformers import ElectraTokenizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Preprocessor for ERNIE-Health")
+    parser.add_argument(
+        "--input_path", type=str, required=True, help="The path to input text files where a sentence per line."
+    )
+    parser.add_argument("--output_file", type=str, required=True, help="The output file path of preprocessed ids.")
+    parser.add_argument(
+        "--tokenize_tool",
+        type=str,
+        default="lac",
+        choices=["lac", "seg", "jieba"],
+        help="The tokenization tool for chinese words.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="The interval between progress updates.")
+    parser.add_argument("--num_worker", type=int, default=1, help="Number of worker processes to launch.")
+
+    args = parser.parse_args()
+    return args
+
+
+def lac_segmentation():
+    from LAC import LAC
+
+    tool = LAC(mode="lac")
+
+    def process(text):
+        words, _ = tool.run(text)
+        return words
+
+    return process
+
+
+def seg_segmentation():
+    from LAC import LAC
+
+    tool = LAC(mode="seg")
+
+    def process(text):
+        words = tool.run(text)
+        return words
+
+    return process
+
+
+def jieba_segmentation():
+    import jieba
+
+    def process(text):
+        words = jieba.cut(text)
+        return list(words)
+
+    return process
+
+
+SEGMENTATION_FN = {"lac": lac_segmentation(), "seg": seg_segmentation(), "jieba": jieba_segmentation()}
+
+
+class ProcessFn(object):
+    def __init__(self, args):
+        self.args = args
+
+    def initializer(self):
+        ProcessFn.tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
+        ProcessFn.segmenter = SEGMENTATION_FN[self.args.tokenize_tool]
+        # Update vocabulary with '##'-prefixed chinese characters.
+        # The ids should coincide with those in run_pretrain.py.
+        suffix_vocab = {}
+        for idx, token in enumerate(range(0x4E00, 0x9FA6)):
+            suffix_vocab["##" + chr(token)] = len(ProcessFn.tokenizer) + idx
+        ProcessFn.tokenizer.added_tokens_encoder.update(suffix_vocab)
+
+        def mark_word_in_tokens(tokens, words, max_word_length=4):
+            word_set = set(words)
+            index = 0
+            while index < len(tokens):
+                # Skip non-chinese characters.
+                if len(re.findall("[\u4E00-\u9FA5]", tokens[index])) == 0:
+                    index += 1
+                    continue
+                # Find the word with maximum length and mark it.
+                find_word = False
+                for length in range(max_word_length, 0, -1):
+                    if index + length > len(tokens):
+                        continue
+                    if "".join(tokens[index : index + length]) in word_set:
+                        for i in range(1, length):
+                            tokens[index + i] = "##" + tokens[index + i]
+                        index += length
+                        find_word = True
+                        break
+
+                if not find_word:
+                    index += 1
+            return tokens
+
+        def process(text):
+            words = ProcessFn.segmenter(text.strip())
+            tokens = ProcessFn.tokenizer.tokenize("".join(words))
+            tokens = mark_word_in_tokens(tokens, words)
+            tokens = ProcessFn.tokenizer.convert_tokens_to_ids(tokens)
+            return tokens
+
+        ProcessFn.process = process
+
+    def encode(self, text):
+        token_ids = ProcessFn.process(text)
+        return token_ids, len(text.encode("utf-8"))
+
+
+def main():
+    args = parse_args()
+
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, dirs, files in os.walk(args.input_path):
+            for file_name in files:
+                file_paths.append(os.path.join(root, file_name))
+    file_paths.sort()
+
+    tokenizer = ElectraTokenizer.from_pretrained("ernie-health-chinese")
+    save_dtype = np.uint16 if tokenizer.vocab_size < 2**16 - 1 else np.int32
+    processer = ProcessFn(args)
+
+    pool = multiprocessing.Pool(args.num_worker, initializer=processer.initializer)
+
+    token_id_stream = io.BytesIO()
+    sent_len_stream = io.BytesIO()
+
+    step = 0
+    sent_count = 0
+    total_bytes_processed = 0
+    start_tic = time.time()
+
+    for path in tqdm(file_paths):
+        text_fp = open(path, "r")
+        processed_text = pool.imap(processer.encode, text_fp, 256)
+        print("Processing %s" % path)
+        for i, (tokens, bytes_processed) in enumerate(processed_text, start=1):
+            step += 1
+            total_bytes_processed += bytes_processed
+
+            sentence_len = len(tokens)
+            if sentence_len == 0:
+                continue
+            sent_len_stream.write(sentence_len.to_bytes(4, byteorder="little", signed=True))
+            sent_count += 1
+            token_id_stream.write(np.array(tokens, dtype=save_dtype).tobytes(order="C"))
+
+            if step % args.logging_steps == 0:
+                time_cost = time.time() - start_tic
+                mbs = total_bytes_processed / time_cost / 1024 / 1024
+                print(
+                    f"Processed {step} sentences",
+                    f"({step/time_cost:.2f} sentences/s, {mbs:.4f} MB/s).",
+                    file=sys.stderr,
+                )
+
+    pool.close()
+    print("Saving tokens to files...")
+    all_token_ids = np.frombuffer(token_id_stream.getbuffer(), dtype=save_dtype)
+    all_sent_lens = np.frombuffer(sent_len_stream.getbuffer(), dtype=np.int32)
+    np.save(args.output_file + "_ids.npy", all_token_ids)
+    np.savez(args.output_file + "_idx.npz", lens=all_sent_lens)
+
+    print("Total sentences num: %d" % len(all_sent_lens))
+    print("Total tokens num: %d" % len(all_token_ids))
+    print("Average tokens per sentence: %.2f" % (len(all_token_ids) / len(all_sent_lens)))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/run_pretrain.py b/model_zoo/ernie-health/run_pretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..2774a865ed9cc62f681a76a31dd6b91c5c6ff290
--- /dev/null
+++ b/model_zoo/ernie-health/run_pretrain.py
@@ -0,0 +1,396 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import random
+import time
+from collections import defaultdict
+
+import numpy as np
+import paddle
+from dataset import DataCollatorForErnieHealth, MedicalCorpus, create_dataloader
+from visualdl import LogWriter
+
+from paddlenlp.transformers import (
+    ElectraConfig,
+    ElectraTokenizer,
+    ErnieHealthForTotalPretraining,
+    LinearDecayWithWarmup,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--model_name_or_path",
+        default="ernie-health-chinese",
+        type=str,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument("--max_seq_length", default=512, type=int, help="The max length of each sequence")
+    parser.add_argument(
+        "--mlm_prob", default=0.15, type=float, help="The probability of tokens to be sampled as masks."
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=256,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument("--learning_rate", default=2e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--num_epochs",
+        default=100,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=10000, type=int, help="Linear warmup over warmup_steps.")
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=10000, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--init_from_ckpt",
+        action="store_true",
+        help="Whether to load model checkpoint. if True, args.model_name_or_path must be dir store ckpt or will train from fresh start",
+    )
+    parser.add_argument(
+        "--use_amp", action="store_true", help="Whether to use float16(Automatic Mixed Precision) to train."
+    )
+    parser.add_argument("--eager_run", type=bool, default=True, help="Use dygraph mode.")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu"],
+        help="The device to select to train the model, is must be cpu/gpu.",
+    )
+    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(seed)
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def do_train(args):
+    paddle.enable_static() if not args.eager_run else None
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"]
+
+    # Loads or initialize a model.
+    pretrained_models = list(tokenizer_class.pretrained_init_configuration.keys())
+
+    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
+        # Load checkpoint
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+        with open(os.path.join(args.model_name_or_path, "run_states.json"), "r") as f:
+            config_dict = json.load(f)
+            model_name = config_dict["model_name"]
+        if model_name in pretrained_models:
+            model = model_class.from_pretrained(args.model_name_or_path)
+            model.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdparams")))
+        else:
+            raise ValueError(
+                "initialize a model from ckpt need model_name "
+                "in model_config_file. The supported model_name "
+                "are as follows: {}".format(tokenizer_class.pretrained_init_configuration.keys())
+            )
+    else:
+        tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+        model_config = config_class()
+        model = model_class(model_config)
+        args.init_from_ckpt = False
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    # Loads dataset.
+    tic_load_data = time.time()
+    logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+
+    train_dataset = MedicalCorpus(data_path=args.input_dir, tokenizer=tokenizer)
+    logger.info("load data done, total : %s s" % (time.time() - tic_load_data))
+
+    # Reads data and generates mini-batches.
+    data_collator = DataCollatorForErnieHealth(
+        tokenizer=tokenizer, max_seq_length=args.max_seq_length, mlm_prob=args.mlm_prob
+    )
+
+    train_data_loader = create_dataloader(
+        train_dataset,
+        batch_size=args.batch_size,
+        mode="train",
+        use_gpu=True if args.device in "gpu" else False,
+        data_collator=data_collator,
+    )
+
+    num_training_steps = args.max_steps if args.max_steps > 0 else (len(train_data_loader) * args.num_epochs)
+    args.num_epochs = (num_training_steps - 1) // len(train_data_loader) + 1
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_steps)
+
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        grad_clip=clip,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+
+    logger.info("start train : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+    trained_global_step = global_step = 0
+    t_loss = defaultdict(lambda: paddle.to_tensor([0.0]))
+    log_loss = defaultdict(lambda: paddle.to_tensor([0.0]))
+    loss_list = defaultdict(list)
+    log_list = []
+    tic_train = time.time()
+
+    if os.path.isdir(args.model_name_or_path) and args.init_from_ckpt:
+        optimizer.set_state_dict(paddle.load(os.path.join(args.model_name_or_path, "model_state.pdopt")))
+        trained_global_step = global_step = config_dict["global_step"]
+        if trained_global_step < num_training_steps:
+            logger.info(
+                "[ start train from checkpoint ] we have already trained %s steps, seeking next step : %s"
+                % (trained_global_step, trained_global_step + 1)
+            )
+        else:
+            logger.info(
+                "[ start train from checkpoint ] we have already trained %s steps, but total training steps is %s, please check configuration !"
+                % (trained_global_step, num_training_steps)
+            )
+            exit(0)
+
+    if paddle.distributed.get_rank() == 0:
+        writer = LogWriter(os.path.join(args.output_dir, "loss_log"))
+
+    for epoch in range(args.num_epochs):
+        for step, batch in enumerate(train_data_loader):
+            if trained_global_step > 0:
+                trained_global_step -= 1
+                continue
+            global_step += 1
+            masked_input_ids, input_ids, gen_labels = batch
+
+            if args.use_amp:
+                with paddle.amp.auto_cast():
+                    loss, gen_loss, rtd_loss, mts_loss, csp_loss = model(
+                        input_ids=masked_input_ids,
+                        raw_input_ids=input_ids,
+                        generator_labels=gen_labels,
+                    )
+
+                scaled = scaler.scale(loss)
+                scaled.backward()
+                t_loss["loss"] += loss.detach()
+                t_loss["gen"] += gen_loss.detach()
+                t_loss["rtd"] += rtd_loss.detach()
+                t_loss["mts"] += mts_loss.detach()
+                t_loss["csp"] += csp_loss.detach()
+                scaler.minimize(optimizer, scaled)
+            else:
+                loss, gen_loss, rtd_loss, mts_loss, csp_loss = model(
+                    input_ids=masked_input_ids,
+                    raw_input_ids=input_ids,
+                    generator_labels=gen_labels,
+                )
+                loss.backward()
+                t_loss["loss"] += loss.detach()
+                t_loss["gen"] += gen_loss.detach()
+                t_loss["rtd"] += rtd_loss.detach()
+                t_loss["mts"] += mts_loss.detach()
+                t_loss["csp"] += csp_loss.detach()
+                optimizer.step()
+
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                local_loss = dict(
+                    [(k, (t_loss[k] - log_loss[k]) / args.logging_steps) for k in ["loss", "gen", "rtd", "mts", "csp"]]
+                )
+                if paddle.distributed.get_world_size() > 1:
+                    for k in ["loss", "gen", "rtd", "mts", "csp"]:
+                        paddle.distributed.all_gather(loss_list[k], local_loss[k])
+                    if paddle.distributed.get_rank() == 0:
+                        tmp_loss = dict(
+                            [
+                                (k, float((paddle.stack(loss_list[k]).sum() / len(loss_list[k])).numpy()))
+                                for k in ["loss", "gen", "rtd", "mts", "csp"]
+                            ]
+                        )
+                        log_str = (
+                            "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                            "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, "
+                            "seq_contrastive: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it"
+                        ).format(
+                            global_step,
+                            num_training_steps,
+                            epoch,
+                            step,
+                            tmp_loss["loss"],
+                            tmp_loss["gen"],
+                            tmp_loss["rtd"],
+                            tmp_loss["mts"],
+                            tmp_loss["csp"],
+                            optimizer.get_lr(),
+                            (time.time() - tic_train) / args.logging_steps,
+                        )
+                        logger.info(log_str)
+                        log_list.append(log_str)
+                        writer.add_scalar("generator_loss", tmp_loss["gen"], global_step)
+                        writer.add_scalar("rtd_loss", tmp_loss["rtd"] * 50, global_step)
+                        writer.add_scalar("mts_loss", tmp_loss["mts"] * 20, global_step)
+                        writer.add_scalar("csp_loss", tmp_loss["csp"], global_step)
+                        writer.add_scalar("total_loss", tmp_loss["loss"], global_step)
+                        writer.add_scalar("lr", optimizer.get_lr(), global_step)
+                    loss_list = defaultdict(list)
+                else:
+                    local_loss = dict([(k, float(v)) for k, v in local_loss.items()])
+                    log_str = (
+                        "global step {0:d}/{1:d}, epoch: {2:d}, batch: {3:d}, "
+                        "avg_loss: {4:.15f}, generator: {5:.15f}, rtd: {6:.15f}, multi_choice: {7:.15f}, "
+                        "seq_contrastive_loss: {8:.15f}, lr: {9:.10f}, speed: {10:.2f} s/it"
+                    ).format(
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        local_loss["loss"],
+                        local_loss["gen"],
+                        local_loss["rtd"],
+                        local_loss["mts"],
+                        local_loss["csp"],
+                        optimizer.get_lr(),
+                        (time.time() - tic_train) / args.logging_steps,
+                    )
+                    logger.info(log_str)
+                    log_list.append(log_str)
+                    loss_dict = {
+                        "generator_loss": local_loss["gen"],
+                        "rtd_loss": local_loss["rtd"] * 50,
+                        "mts_loss": local_loss["mts"] * 20,
+                        "csp_loss": local_loss["csp"],
+                    }
+                    for k, v in loss_dict.items():
+                        writer.add_scalar("loss/%s" % k, v, global_step)
+                    writer.add_scalar("total_loss", local_loss["loss"], global_step)
+                    writer.add_scalar("lr", optimizer.get_lr(), global_step)
+                log_loss = dict(t_loss)
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    output_dir = os.path.join(args.output_dir, "model_%d.pdparams" % global_step)
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    config_to_save = model_to_save.discriminator.electra.config.to_dict()
+                    if "self" in config_to_save:
+                        del config_to_save["self"]
+                    run_states = {
+                        "model_name": model_name if args.init_from_ckpt else args.model_name_or_path,
+                        "global_step": global_step,
+                        "epoch": epoch,
+                        "step": step,
+                    }
+                    with open(os.path.join(output_dir, "model_config.json"), "w") as f:
+                        json.dump(config_to_save, f)
+                    with open(os.path.join(output_dir, "run_states.json"), "w") as f:
+                        json.dump(run_states, f)
+                    paddle.save(model.state_dict(), os.path.join(output_dir, "model_state.pdparams"))
+                    tokenizer.save_pretrained(output_dir)
+                    paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                    if len(log_list) > 0:
+                        with open(os.path.join(output_dir, "train.log"), "w") as f:
+                            for log in log_list:
+                                if len(log.strip()) > 0:
+                                    f.write(log.strip() + "\n")
+            if global_step >= num_training_steps:
+                if paddle.distributed.get_rank() == 0:
+                    writer.close()
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/model_zoo/ernie-health/run_pretrain_trainer.py b/model_zoo/ernie-health/run_pretrain_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..98505d68958f5ba9ff9b4f55bc76bb12d6228aa1
--- /dev/null
+++ b/model_zoo/ernie-health/run_pretrain_trainer.py
@@ -0,0 +1,166 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+
+import paddle
+from dataset import DataCollatorForErnieHealth, MedicalCorpus
+
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import (
+    ElectraConfig,
+    ElectraTokenizer,
+    ErnieHealthForTotalPretraining,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "ernie-health": (ElectraConfig, ErnieHealthForTotalPretraining, ElectraTokenizer),
+}
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    max_seq_length: int = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    masked_lm_prob: float = field(
+        default=0.15,
+        metadata={"help": "Mask token prob."},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(
+        default="ernie-health", metadata={"help": "Only support for ernie-health pre-training for now."}
+    )
+    model_name_or_path: str = field(
+        default="ernie-health-chinese",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+    # training_args.recompute = True
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 1:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class, tokenizer_class = MODEL_CLASSES["ernie-health"]
+
+    # Loads or initialize a model.
+    tokenizer = tokenizer_class.from_pretrained(model_args.model_name_or_path)
+
+    model_config = config_class()
+    model = model_class(model_config)
+
+    # Loads dataset.
+    tic_load_data = time.time()
+    logger.info("start load data : %s" % (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())))
+
+    train_dataset = MedicalCorpus(data_path=data_args.input_dir, tokenizer=tokenizer)
+    logger.info("load data done, total : %s s" % (time.time() - tic_load_data))
+
+    # Reads data and generates mini-batches.
+    data_collator = DataCollatorForErnieHealth(
+        tokenizer=tokenizer,
+        max_seq_length=data_args.max_seq_length,
+        mlm_prob=data_args.masked_lm_prob,
+        return_dict=True,
+    )
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=None,
+        tokenizer=tokenizer,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-health/run_trainer.sh b/model_zoo/ernie-health/run_trainer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..94b99e83471303621de611f6320024b0410e35b0
--- /dev/null
+++ b/model_zoo/ernie-health/run_trainer.sh
@@ -0,0 +1,45 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="eheath-pretraining"
+rm -rf output/$task_name/log
+
+python -u -m paddle.distributed.launch \
+    --gpus 0,1,2,3,4,5,6,7  \
+    run_pretrain_trainer.py \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_length 512 \
+    --gradient_accumulation_steps 1\
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --learning_rate 0.001 \
+    --max_steps 1000000 \
+    --save_steps 50000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20 \
+    --dataloader_num_workers 4 \
+    --device "gpu"\
+    --fp16  \
+    --fp16_opt_level "O1"  \
+    --do_train \
+    --disable_tqdm True\
+    --save_total_limit 10 
+
+# WARNING: fp16_opt_level O2 may cause ehealth pretraining fail !
\ No newline at end of file
diff --git a/model_zoo/ernie-layout/README.md b/model_zoo/ernie-layout/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a092f0a5c048115340bd4bc71762d46e2267d9f5
--- /dev/null
+++ b/model_zoo/ernie-layout/README.md
@@ -0,0 +1,435 @@
+English | [简体中文](README_ch.md)
+
+# ERNIE-Layout
+
+ **content**
+
+- [ERNIE-Layout](#ERNIE-Layout)
+  - [1. Model Instruction](#1)
+  - [2. Out-of-Box](#2)
+      - [HuggingFace web demo](#21)
+      - [Demo show](#22)
+      - [Taskflow](#23)
+  - [3. Model Performance](#3)
+  - [4. Fine-tuning Examples](#4)
+      - [4.1 Key Information Extraction](#41)
+      - [4.2 Document Question Answering](#42)
+      - [4.3 Document Image Classification](#43)
+  - [5. Deploy](#5)
+      - [5.1 Inference Model Export](#51)
+      - [5.2 Python Deploy](#52)
+
+<a name="1"></a>
+
+## 1. Model Instruction
+Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.
+
+[The work](http://arxiv.org/abs/2210.06155) is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP.
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png height=450 width=1000 hspace='10'/>
+</div>
+
+<a name="2"></a>
+
+## 2. Out-of-Box
+
+<a name="21"></a>
+
+#### HuggingFace web demo
+
+🧾 HuggingFace web demo is available [here](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195749427-864d7744-1fd1-455e-99c6-53a260776483.jpg height=700 width=1100 hspace='10'/>
+</div>
+
+<a name="22"></a>
+
+#### Demo show
+
+- Invoice VQA
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png height=350 width=1000 hspace='10'/>
+</div>
+
+- Poster VQA
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png height=600 width=1000 hspace='15'/>
+</div>
+
+- WebPage VQA
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- Table VQA
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- Exam Paper VQA
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png height=700 width=1000 hspace='10'/>
+</div>
+
+
+- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195615523-05d05aba-3bc3-415d-a836-ad1a5d3db56e.png height=400 width=1000 hspace='15'/>
+</div>
+
+- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, DE) prompt
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/209898193-c09a87fe-29cd-4e22-8e67-a5f281a99871.jpg height=350 width=1000 hspace='15'/>
+</div>
+
+- Demo images are available [here](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip)
+
+<a name="23"></a>
+
+#### Taskflow
+
+- Input Format
+
+```
+[
+  {"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+Default to use PaddleOCR, you can also use your own OCR result via ``word_boxes``, the data format is ``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+- Support single and batch input
+
+  - Image from http link
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/195605071-02d4f3ab-ef2d-43c1-9bf0-ffdc017d4f92.png height=400 hspace='10'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence", lang="en")
+  >>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}])
+  [{'prompt': "What is the name of the author of 'The Adventure Zone: The "
+              'Crystal Kingdom’?',
+    'result': [{'end': 39,
+                'prob': 0.99,
+                'start': 22,
+                'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis '
+                        'McElroy'}]},
+  {'prompt': 'What type of book cover does The Adventure Zone: The Crystal '
+              'Kingdom have?',
+    'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]},
+  {'prompt': 'For Rage, who is the author listed as?',
+    'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}]
+  ```
+
+  - Image from local path
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
+  [{'prompt': '五百丁本次想要担任的是什么职位?',
+    'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
+  {'prompt': '五百丁是在哪里上的大学?',
+    'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
+  {'prompt': '大学学的是什么专业?',
+    'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]
+  ```
+
+- Parameter Description
+  * `batch_size`: number of input of each batch, default to 1.
+  * `lang`: PaddleOCR language, `en` is better to English images, default to `ch`.
+  * `topn`: return the top n results with highest probability, default to 1.
+
+
+<a name="3"></a>
+
+## 3. Model Performance
+
+- Dataset
+
+  |   Dataset   |  Task   | Language | Note |
+  | --------- | ---------- | --- | ---- |
+  | FUNSD     | Key Information Extraction | English | - |
+  | XFUND-ZH  | Key Information Extraction | Chinese | - |
+  | DocVQA-ZH | Document Question Answering | Chinese | The submission of the competition of [DocVQA-ZH](http://ailab.aiwin.org.cn/competitions/49) is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images.|
+  | RVL-CDIP (sampled)  | Document Image Classification | English | The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images. |
+
+- Results
+
+  |         Model      |    FUNSD  | RVL-CDIP (sampled)  | XFUND-ZH  | DocVQA-ZH |
+  | ------------------ | --------- | --------- | --------- | --------- |
+  | LayoutXLM-Base     |   86.72   |   **90.88**   |   86.24   |   66.01   |
+  | ERNIE-LayoutX-Base | **89.31** | 90.29 | **88.58** | **69.57** |
+
+- Evaluation Methods
+
+  - All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is F1-Score. The evaluation step interval of RVL-CDIP is 2000, metric is Accuracy. The evaluation step interval of DocVQA-ZH is 10000, metric is [ANLS](https://arxiv.org/pdf/1907.00490.pdf),
+
+  - Hyper Parameters search ranges
+
+    | Hyper Parameters  |  FUNSD  | RVL-CDIP (sampled)  | XFUND-ZH | DocVQA-ZH |
+    | ----------------- | ------- | -------- | -------- | --------- |
+    | learning_rate     | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 |
+    | batch_size        | 1, 2, 4 |  8, 16, 24   | 1, 2, 4 |   8, 16, 24  |
+    | warmup_ratio      |     -   | 0, 0.05, 0.1 |    -    | 0, 0.05, 0.1 |
+
+    The strategy of ``lr_scheduler_type`` for FUNSD and XFUND is constant, so warmup_ratio is excluded.
+
+  - ``max_steps`` is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively; ``num_train_epochs`` is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively.
+
+- Best Hyper Parameter
+
+  |         Model      |     FUNSD    |   RVL-CDIP (sampled)   |   XFUND-ZH   |   DocVQA-ZH |
+  | ------------------ | ------------ | ------------ | ------------ | ----------- |
+  | LayoutXLM-Base     |  1e-5, 2, _  | 1e-5, 8, 0.1 |  1e-5, 2, _  | 2e-5. 8, 0.1 |
+  | ERNIE-LayoutX-Base |  2e-5, 4, _  | 1e-5, 8, 0.  |  1e-5, 4, _  | 2e-5. 8, 0.05 |
+
+
+<a name="4"></a>
+
+## 4. Fine-tuning Examples
+
+- Installation
+
+```
+pip install -r requirements.txt
+```
+
+<a name="41"></a>
+
+#### 4.1 Key Information Extraction
+
+- FUNSD Train
+
+```shell
+python -u run_ner.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
+  --dataset_name funsd \
+  --do_train \
+  --do_eval \
+  --max_steps 10000 \
+  --eval_steps 100 \
+  --save_steps 100 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern ner-bio \
+  --preprocessing_num_workers 4 \
+  --overwrite_cache false \
+  --use_segment_box \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 4 \
+  --per_device_eval_batch_size 4 \
+  --learning_rate 2e-5 \
+  --lr_scheduler_type constant \
+  --gradient_accumulation_steps 1 \
+  --seed 1000 \
+  --metric_for_best_model eval_f1 \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+- XFUND-ZH Train
+
+```shell
+python -u run_ner.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
+  --dataset_name xfund_zh \
+  --do_train \
+  --do_eval \
+  --lang "ch" \
+  --max_steps 20000 \
+  --eval_steps 100 \
+  --save_steps 100 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern ner-bio \
+  --preprocessing_num_workers 4 \
+  --overwrite_cache false \
+  --use_segment_box \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 4 \
+  --per_device_eval_batch_size 4 \
+  --learning_rate 1e-5 \
+  --lr_scheduler_type constant \
+  --gradient_accumulation_steps 1 \
+  --seed 1000 \
+  --metric_for_best_model eval_f1 \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+<a name="42"></a>
+
+#### 4.2 Document Question Answering
+
+- DocVQA-ZH Train
+
+```shell
+python3 -u run_mrc.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
+  --dataset_name docvqa_zh \
+  --do_train \
+  --do_eval \
+  --lang "ch" \
+  --num_train_epochs 6 \
+  --lr_scheduler_type linear \
+  --warmup_ratio 0.05 \
+  --weight_decay 0 \
+  --eval_steps 10000 \
+  --save_steps 10000 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern "mrc" \
+  --use_segment_box false \
+  --return_entity_level_metrics false \
+  --overwrite_cache false \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 8 \
+  --per_device_eval_batch_size 8 \
+  --learning_rate 2e-5 \
+  --preprocessing_num_workers 32 \
+  --save_total_limit 1 \
+  --train_nshard 16 \
+  --seed 1000 \
+  --metric_for_best_model anls \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+<a name="43"></a>
+
+#### 4.3 Document Image Classification
+
+- RVL-CDIP Train
+
+```shell
+python3 -u run_cls.py \
+    --model_name_or_path ernie-layoutx-base-uncased \
+    --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
+    --dataset_name rvl_cdip_sampled \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 20 \
+    --lr_scheduler_type linear \
+    --max_seq_length 512 \
+    --warmup_ratio 0.05 \
+    --weight_decay 0 \
+    --eval_steps 2000 \
+    --save_steps 2000 \
+    --save_total_limit 1 \
+    --load_best_model_at_end \
+    --pattern "cls" \
+    --use_segment_box \
+    --return_entity_level_metrics false \
+    --overwrite_cache false \
+    --doc_stride 128 \
+    --target_size 1000 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --learning_rate 1e-5 \
+    --preprocessing_num_workers 32 \
+    --train_nshard 16 \
+    --seed 1000 \
+    --metric_for_best_model acc \
+    --greater_is_better true \
+    --overwrite_output_dir
+```
+
+<a name="5"></a>
+
+## 5. Deploy
+
+<a name="51"></a>
+
+#### 5.1 Inference Model Export
+
+After fine-tuning, you can also export the inference model via [Model Export Script](export_model.py), the inference model will be saved in the `output_path` you specified.
+
+- Export the model fine-tuned on FUNSD
+
+```shell
+python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export
+```
+
+- Export the model fine-tuned on DocVQA-ZH
+
+```shell
+python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export
+```
+
+- Export the model fine-tuned on RVL-CDIP(sampled)
+
+```shell
+python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export
+```
+
+- Parameter Description
+  * `model_path`：the save directory of dygraph model parameters, default to "./checkpoint/"。
+  * `output_path`：the save directory of static graph model parameters, default to "./export"。
+
+- Directory
+
+  ```text
+  export/
+  ├── inference.pdiparams
+  ├── inference.pdiparams.info
+  └── inference.pdmodel
+  ```
+
+<a name="52"></a>
+
+#### 5.2 Python Deploy
+
+We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the [ERNIE-Layout Python Deploy Guide](./deploy/python/README.md)
+
+
+<a name="References"></a>
+
+## References
+
+- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)
+
+- [ICDAR 2019 Competition on Scene Text Visual Question Answering](https://arxiv.org/pdf/1907.00490.pdf)
+
+- [XFUND dataset](https://github.com/doc-analysis/XFUND)
+
+- [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/)
+
+- [RVL-CDIP dataset](https://adamharley.com/rvl-cdip/)
+
+- [Competition of Insurance Document Visual Cognition Question Answering](http://ailab.aiwin.org.cn/competitions/49)
diff --git a/model_zoo/ernie-layout/README_ch.md b/model_zoo/ernie-layout/README_ch.md
new file mode 100644
index 0000000000000000000000000000000000000000..d755952b20cf354038403da6612b9b8898a87993
--- /dev/null
+++ b/model_zoo/ernie-layout/README_ch.md
@@ -0,0 +1,437 @@
+[English](README.md) | 简体中文
+
+# ERNIE-Layout
+
+ **目录**
+
+- [1. 模型介绍](#1)
+- [2. 开箱即用](#2)
+  - [HuggingFace web demo](#21)
+  - [应用场景展示](#22)
+  - [Taskflow](#23)
+- [3. Benchmark模型效果](#3)
+- [4. 模型微调](#4)
+  - [4.1 文档信息抽取任务](#41)
+  - [4.2 文档视觉问答任务](#42)
+  - [4.3 文档图像分类任务](#43)
+- [5. 部署](#5)
+  - [5.1 静态图导出](#51)
+  - [5.2 Python部署](#52)
+
+<a name="1"></a>
+
+## 1. 模型介绍
+
+
+ERNIE-Layout以文心文本大模型ERNIE为底座，融合文本、图像、布局等信息进行跨模态联合建模，创新性引入布局知识增强，提出阅读顺序预测、细粒度图文匹配等自监督预训练任务，升级空间解偶注意力机制，在各数据集上效果取得大幅度提升，相关工作[ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)已被EMNLP 2022 Findings会议收录[1]。考虑到文档智能在多语种上商用广泛，依托PaddleNLP对外开源业界最强的多语言跨模态文档预训练模型ERNIE-Layout。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png height=450 width=1000 hspace='10'/>
+</div>
+
+<a name="2"></a>
+
+## 2. 开箱即用
+
+<a name="21"></a>
+
+#### HuggingFace web demo
+
+🧾 通过[Huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)体验DocPrompt功能：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195749427-864d7744-1fd1-455e-99c6-53a260776483.jpg height=700 width=1100 hspace='10'/>
+</div>
+
+<a name="22"></a>
+
+#### 应用场景展示
+
+- 发票抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png height=350 width=1000 hspace='10'/>
+</div>
+
+- 海报抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png height=600 width=1000 hspace='15'/>
+</div>
+
+- 网页抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 表格抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png height=350 width=1000 hspace='10'/>
+</div>
+
+
+- 试卷抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png height=700 width=1000 hspace='10'/>
+</div>
+
+
+- 英文票据多语种（中、英、日、泰、西班牙、俄语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png height=400 width=1000 hspace='15'/>
+</div>
+
+- 中文票据多语种（中简、中繁、英、日、德语）抽取问答
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/209898223-71f09f4d-a394-4a4d-91fa-33784d1cec18.jpg height=350 width=1000 hspace='15'/>
+</div>
+
+- Demo图片可在此[下载](https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/demo.zip)
+
+<a name="23"></a>
+
+#### Taskflow
+
+通过``paddlenlp.Taskflow``三行代码调用DocPrompt功能，具备多语言文档抽取问答能力，部分应用场景展示如下：
+
+- 输入格式
+
+```
+[
+  {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+]
+```
+
+默认使用PaddleOCR进行OCR识别，同时支持用户通过``word_boxes``传入自己的OCR结果，格式为``List[str, List[float, float, float, float]]``。
+
+```
+[
+  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
+]
+```
+
+- 支持单条、批量预测
+
+  - 支持本地图片路径输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png height=800 hspace='20'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
+  [{'prompt': '五百丁本次想要担任的是什么职位?',
+    'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
+  {'prompt': '五百丁是在哪里上的大学?',
+    'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
+  {'prompt': '大学学的是什么专业?',
+    'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]
+  ```
+
+  - http图片链接输入
+
+  <div align="center">
+      <img src=https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg height=400 hspace='10'/>
+  </div>
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> docprompt = Taskflow("document_intelligence")
+  >>> pprint(docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}]))
+  [{'prompt': '发票号码是多少?',
+    'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},
+  {'prompt': '校验码是多少?',
+    'result': [{'end': 233,
+                'prob': 1.0,
+                'start': 231,
+                'value': '01107 555427109891646'}]}]
+  ```
+
+- 可配置参数说明
+  * `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+  * `lang`：选择PaddleOCR的语言，`ch`可在中英混合的图片中使用，`en`在英文图片上的效果更好，默认为`ch`。
+  * `topn`: 如果模型识别出多个结果，将返回前n个概率值最高的结果，默认为1。
+
+
+<a name="3"></a>
+
+## 3. Benchmark模型效果
+
+- 开源数据集介绍
+
+  |   数据集   |  任务类型   | 语言 | 说明 |
+  | --------- | ---------- | --- | ---- |
+  | FUNSD     | 文档信息抽取 | 英文 | - |
+  | XFUND-ZH  | 文档信息抽取 | 中文 | - |
+  | DocVQA-ZH | 文档视觉问答 | 中文 | [DocVQA-ZH](http://ailab.aiwin.org.cn/competitions/49)已停止榜单提交，因此我们将原始训练集进行重新划分以评估模型效果，划分后训练集包含4,187张图片，验证集包含500张图片，测试集包含500张图片。 |
+  | RVL-CDIP (sampled)  | 文档图像分类 | 英文 | RVL-CDIP原始数据集共包含400,000张图片，由于数据集较大训练较慢，为验证文档图像分类的模型效果故进行降采样，采样后的训练集包含6,400张图片，验证集包含800张图片，测试集包含800张图片。 |
+
+- 评测结果
+
+  在文档智能领域主流开源数据集的**验证集**上评测指标如下表所示：
+
+  |         Model      |    FUNSD  | RVL-CDIP (sampled)  | XFUND-ZH  | DocVQA-ZH |
+  | ------------------ | --------- | --------- | --------- | --------- |
+  | LayoutXLM-Base     |   86.72   |   **90.88**   |   86.24   |   66.01   |
+  | ERNIE-LayoutX-Base | **89.31** | 90.29 | **88.58** | **69.57** |
+
+- 具体评测方式
+
+  - 以上所有任务均基于Grid Search方式进行超参寻优。FUNSD和XFUND-ZH每间隔 100 steps 评估验证集效果，评价指标为F1-Score。
+    RVL-CDIP每间隔2000 steps评估验证集效果，评价指标为Accuracy。DocVQA-ZH每间隔10000 steps评估验证集效果，取验证集最优效果作为表格中的汇报指标，评价指标为ANLS（计算方法参考https://arxiv.org/pdf/1907.00490.pdf）。
+
+  - 以上每个下游任务的超参范围如下表所示：
+
+    | Hyper Parameters  |  FUNSD  | RVL-CDIP (sampled)  | XFUND-ZH | DocVQA-ZH |
+    | ----------------- | ------- | -------- | -------- | --------- |
+    | learning_rate     | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 |
+    | batch_size        | 1, 2, 4 |  8, 16, 24   | 1, 2, 4 |   8, 16, 24  |
+    | warmup_ratio      |     -   | 0, 0.05, 0.1 |    -    | 0, 0.05, 0.1 |
+
+    FUNSD和XFUND-ZH使用的lr_scheduler_type策略是constant，因此不对warmup_ratio进行搜索。
+
+  - 文档信息抽取任务FUNSD和XFUND-ZH采用最大步数（max_steps）的微调方式，分别为10000 steps和20000 steps；文档视觉问答DocVQA-ZH的num_train_epochs为6；文档图像分类RVL-CDIP的num_train_epochs为20。
+
+- 最优超参
+
+  不同预训练模型在下游任务上做Grid Search之后的最优超参（learning_rate、batch_size、warmup_ratio）如下：
+
+  |         Model      |     FUNSD    |   RVL-CDIP (sampled)   |   XFUND-ZH   |   DocVQA-ZH |
+  | ------------------ | ------------ | ------------ | ------------ | ----------- |
+  | LayoutXLM-Base     |  1e-5, 2, _  | 1e-5, 8, 0.1 |  1e-5, 2, _  | 2e-5. 8, 0.1 |
+  | ERNIE-LayoutX-Base |  2e-5, 4, _  | 1e-5, 8, 0.  |  1e-5, 4, _  | 2e-5. 8, 0.05 |
+
+
+<a name="4"></a>
+
+## 4. 模型微调
+
+- 请执行以下命令进行安装项目依赖
+
+```
+pip install -r requirements.txt
+```
+
+<a name="41"></a>
+
+#### 4.1 文档信息抽取任务
+
+- FUNSD训练
+
+```shell
+python -u run_ner.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
+  --dataset_name funsd \
+  --do_train \
+  --do_eval \
+  --max_steps 10000 \
+  --eval_steps 100 \
+  --save_steps 100 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern ner-bio \
+  --preprocessing_num_workers 4 \
+  --overwrite_cache false \
+  --use_segment_box \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 4 \
+  --per_device_eval_batch_size 4 \
+  --learning_rate 2e-5 \
+  --lr_scheduler_type constant \
+  --gradient_accumulation_steps 1 \
+  --seed 1000 \
+  --metric_for_best_model eval_f1 \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+- XFUND-ZH训练
+
+```shell
+python -u run_ner.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
+  --dataset_name xfund_zh \
+  --do_train \
+  --do_eval \
+  --lang "ch" \
+  --max_steps 20000 \
+  --eval_steps 100 \
+  --save_steps 100 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern ner-bio \
+  --preprocessing_num_workers 4 \
+  --overwrite_cache false \
+  --use_segment_box \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 4 \
+  --per_device_eval_batch_size 4 \
+  --learning_rate 1e-5 \
+  --lr_scheduler_type constant \
+  --gradient_accumulation_steps 1 \
+  --seed 1000 \
+  --metric_for_best_model eval_f1 \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+<a name="42"></a>
+
+#### 4.2 文档视觉问答任务
+
+- DocVQA-ZH训练
+
+```shell
+python3 -u run_mrc.py \
+  --model_name_or_path ernie-layoutx-base-uncased \
+  --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
+  --dataset_name docvqa_zh \
+  --do_train \
+  --do_eval \
+  --lang "ch" \
+  --num_train_epochs 6 \
+  --lr_scheduler_type linear \
+  --warmup_ratio 0.05 \
+  --weight_decay 0 \
+  --eval_steps 10000 \
+  --save_steps 10000 \
+  --save_total_limit 1 \
+  --load_best_model_at_end \
+  --pattern "mrc" \
+  --use_segment_box false \
+  --return_entity_level_metrics false \
+  --overwrite_cache false \
+  --doc_stride 128 \
+  --target_size 1000 \
+  --per_device_train_batch_size 8 \
+  --per_device_eval_batch_size 8 \
+  --learning_rate 2e-5 \
+  --preprocessing_num_workers 32 \
+  --save_total_limit 1 \
+  --train_nshard 16 \
+  --seed 1000 \
+  --metric_for_best_model anls \
+  --greater_is_better true \
+  --overwrite_output_dir
+```
+
+<a name="43"></a>
+
+#### 4.3 文档图像分类任务
+
+- RVL-CDIP训练
+
+```shell
+python3 -u run_cls.py \
+    --model_name_or_path ernie-layoutx-base-uncased \
+    --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
+    --dataset_name rvl_cdip_sampled \
+    --do_train \
+    --do_eval \
+    --num_train_epochs 20 \
+    --lr_scheduler_type linear \
+    --max_seq_length 512 \
+    --warmup_ratio 0.05 \
+    --weight_decay 0 \
+    --eval_steps 2000 \
+    --save_steps 2000 \
+    --save_total_limit 1 \
+    --load_best_model_at_end \
+    --pattern "cls" \
+    --use_segment_box \
+    --return_entity_level_metrics false \
+    --overwrite_cache false \
+    --doc_stride 128 \
+    --target_size 1000 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --learning_rate 1e-5 \
+    --preprocessing_num_workers 32 \
+    --train_nshard 16 \
+    --seed 1000 \
+    --metric_for_best_model acc \
+    --greater_is_better true \
+    --overwrite_output_dir
+```
+
+<a name="5"></a>
+
+## 5. 部署
+
+<a name="51"></a>
+
+#### 5.1 静态图导出
+
+使用动态图训练结束之后，还可以将动态图参数导出为静态图参数，静态图模型将用于**后续的推理部署工作**。具体代码见[静态图导出脚本](export_model.py)，静态图参数保存在`output_path`指定路径中。运行方式：
+
+
+- 导出在FUNSD上微调后的模型：
+
+```shell
+python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export
+```
+
+- 导出在DocVQA-ZH上微调后的模型：
+
+```shell
+python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export
+```
+
+- 导出在RVL-CDIP(sampled)上微调后的模型：
+
+```shell
+python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export
+```
+
+- 可支持配置的参数：
+* `model_path`：动态图训练保存的参数路径；默认为"./checkpoint/"。
+* `output_path`：静态图图保存的参数路径；默认为"./export"。
+
+- 程序运行时将会自动导出模型到指定的 `output_path` 中，保存模型文件结构如下所示：
+
+```text
+export/
+├── inference.pdiparams
+├── inference.pdiparams.info
+└── inference.pdmodel
+```
+
+<a name="52"></a>
+
+#### 5.2 Python部署
+
+导出静态图模型之后可用于部署，项目提供了文档信息抽取、文档视觉问答和文档图像分类三大场景下的使用示例，详见[ERNIE-Layout Python部署指南](./deploy/python/README_ch.md)。
+
+
+<a name="References"></a>
+
+## References
+
+- [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](http://arxiv.org/abs/2210.06155)
+
+- [ICDAR 2019 Competition on Scene Text Visual Question Answering](https://arxiv.org/pdf/1907.00490.pdf)
+
+- [XFUND dataset](https://github.com/doc-analysis/XFUND)
+
+- [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/)
+
+- [RVL-CDIP dataset](https://adamharley.com/rvl-cdip/)
+
+- [保险文本视觉认知问答竞赛](http://ailab.aiwin.org.cn/competitions/49)
diff --git a/model_zoo/ernie-layout/data_collator.py b/model_zoo/ernie-layout/data_collator.py
new file mode 100644
index 0000000000000000000000000000000000000000..bee1a06cf8162e07607c92b7a5a6635fdb914791
--- /dev/null
+++ b/model_zoo/ernie-layout/data_collator.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Union
+from dataclasses import dataclass
+
+from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase, PaddingStrategy
+
+
+@dataclass
+class DataCollator:
+    """
+    Data collator that will dynamically pad the inputs received, as well as the labels.
+    Args:
+        tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):
+            The tokenizer used for encoding the data.
+        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.file_utils.PaddingStrategy`, `optional`, defaults to :obj:`True`):
+            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
+            among:
+            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+              sequence if provided).
+            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+              maximum acceptable input length for the model if that argument is not provided.
+            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+              different lengths).
+        max_length (:obj:`int`, `optional`):
+            Maximum length of the returned list and optionally padding length (see above).
+        pad_to_multiple_of (:obj:`int`, `optional`):
+            If set will pad the sequence to a multiple of the provided value.
+            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
+            7.5 (Volta).
+        label_pad_token_id (:obj:`int`, `optional`, defaults to -100):
+            The id to use when padding the labels (-100 will be automatically ignore by PyTorch loss functions).
+    """
+
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    label_pad_token_id: int = -100
+    pad_to_multiple_of: Optional[int] = None
+    return_tensors: str = "np"
+
+    def __call__(self, features):
+        has_labels = "labels" in features[0]
+        for feat in features:
+            feat["input_ids"] = feat["input_ids"] + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_length - len(feat["input_ids"])
+            )
+            feat["attention_mask"] = feat["attention_mask"] + [
+                1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]
+            ] * (self.max_length - len(feat["attention_mask"]))
+            feat["bbox"] = feat["bbox"] + [[0, 0, 0, 0] for _ in range(self.max_length - len(feat["bbox"]))]
+            if has_labels and not isinstance(feat["labels"], int):
+                feat["labels"] = feat["labels"] + [1 * self.label_pad_token_id] * (
+                    self.max_length - len(feat["labels"])
+                )
+
+        batch = self.tokenizer.pad(
+            features,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            # Conversion to tensors will fail if we have labels as they are not of the same length yet.
+            return_tensors=self.return_tensors,
+        )
+        return batch
diff --git a/model_zoo/ernie-layout/deploy/python/README.md b/model_zoo/ernie-layout/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b4d2a52589fc849a15c63648dea60d02fdaeae78
--- /dev/null
+++ b/model_zoo/ernie-layout/deploy/python/README.md
@@ -0,0 +1,137 @@
+English | [简体中文](README_ch.md)
+
+# ERNIE-Layout Python Deploy Guide
+
+- [1. Quick Start](#1)
+- [2. Key Information Extraction Deploy](#2)
+- [3. Document Question Answering Deploy](#3)
+- [4. Document Image Classification Deploy](#4)
+- [5. Parameter Description](#5)
+
+<a name="1"></a>
+
+## 1. Quick Start
+
+#### Environment
+
+- Dependency Installation
+
+```
+pip install -r requirements.txt
+```
+
+#### Data Preparation
+
+- Dowload the sample images and put in ``./images``
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip
+```
+
+<a name="2"></a>
+
+## 2. Key Information Extraction Deploy
+
+- Run
+
+```shell
+python infer.py \
+    --model_path_prefix ../../ner_export/inference \
+    --task_type ner \
+    --lang "en" \
+    --batch_size 8
+```
+
+- Output sample
+
+```
+[{'doc': './images/ner_sample.jpg',
+  'result': [{'text': 'ATT . GEN . ADMIN . OFFICE',
+              'label': 'QUESTION',
+              'start': 0,
+              'end': 12,
+              'probability': 0.8961102192651806},
+             {'text': 'Fax :',
+              'label': 'QUESTION',
+              'start': 13,
+              'end': 14,
+              'probability': 0.8005126895801068},
+             {'text': '614',
+              'label': 'ANSWER',
+              'start': 15,
+              'end': 16,
+              'probability': 0.5063673730110718},
+             {'text': 'Dec 10',
+              'label': 'ANSWER',
+              'start': 23,
+              'end': 24,
+              'probability': 0.6265156606943465},
+
+            ......
+
+             {'text': 'NOTE',
+              'label': 'QUESTION',
+              'start': 179,
+              'end': 179,
+              'probability': 0.9810855421041412}]}]
+```
+
+<a name="3"></a>
+
+## 3. Document Question Answering Deploy
+
+- Run
+
+```shell
+python infer.py \
+    --model_path_prefix ../../mrc_export/inference \
+    --task_type mrc \
+    --lang "ch" \
+    --batch_size 8
+```
+
+- Output sample
+
+```
+[{'doc': './images/mrc_sample.jpg',
+  'result': [{'question': '杨小峰是什么身份？', 'answer': ['法定代表人']},
+             {'question': '花了多少钱进行注册的这个公司？', 'answer': ['壹仟壹佰万元']},
+             {'question': '公司的类型属于什么？', 'answer': ['有限责任公司']},
+             {'question': '杨小峰的住所是在哪里？',
+              'answer': ['成都市武侯区佳灵路20号九峰国际1栋16楼62号']},
+             {'question': '这个公司的法定代表人叫什么？', 'answer': ['杨小峰']},
+             {'question': '91510107749745776R代表的是什么？', 'answer': ['统一社会信用代码']},
+             {'question': '公司在什么时候成立的？',
+              'answer': ['2003年7月22日营业期限2003年7月22日']}]}]
+```
+
+<a name="4"></a>
+
+## 4. Document Image Classification Deploy
+
+- Run
+
+```shell
+python infer.py \
+    --model_path_prefix ../../cls_export/inference \
+    --lang "en" \
+    --task_type cls \
+    --batch_size 8
+```
+
+- Output sample
+
+```
+[{'doc': './images/cls_sample.jpg', 'result': 'email'}]
+```
+
+<a name="5"></a>
+
+## 5. Parameter Description
+
+- `model_path_prefix`: The file path of the Paddle model for inference, with the file prefix name。For example, the inference model file path is `./export/inference.pdiparams`, then pass `./export/inference`。
+- `batch_size`: number of input of each batch, default to 1.
+- `max_seq_length`: If the OCR result exceeds the set maximum length, the OCR result will be sliced. The default is 512.
+- `task_type`: choose the task type，the options are `ner`, `cls` and `mrc`。
+- `lang`: select the task language，the options are `en` and `ch`。
+- `device`: choose the device，the options are `cpu` and `gpu`。
diff --git a/model_zoo/ernie-layout/deploy/python/README_ch.md b/model_zoo/ernie-layout/deploy/python/README_ch.md
new file mode 100644
index 0000000000000000000000000000000000000000..1eeb0debc1a60f810164eaf81997e3a6a9d7e8ed
--- /dev/null
+++ b/model_zoo/ernie-layout/deploy/python/README_ch.md
@@ -0,0 +1,129 @@
+[English](README.md) | 简体中文
+
+# ERNIE-Layout Python部署指南
+
+本文介绍ERNIE-Layout Python部署指南，包括部署环境的准备，文档信息抽取、文档视觉问答和文档图像分类三大场景下的使用示例。
+
+- [1. 开始运行](#1-开始运行)
+- [2. 文档信息抽取模型推理](#2-文档信息抽取模型推理)
+- [3. 文档视觉问答模型推理](#3-文档视觉问答模型推理)
+- [4. 文档图像分类模型推理](#4-文档图像分类模型推理)
+- [5. 更多配置](#5-更多配置)
+
+## 1. 开始运行
+
+#### 环境要求
+
+- 请执行以下命令进行安装项目依赖
+
+```
+pip install -r requirements.txt
+```
+
+#### 数据准备
+
+- 提供了少量图片数据，可用于后续章节的部署测试，下载后放在``./images``目录。
+
+```shell
+wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip
+```
+
+## 2. 文档信息抽取模型推理
+
+- 使用如下命令进行英文文档信息抽取部署
+
+```shell
+python infer.py \
+    --model_path_prefix ../../ner_export/inference \
+    --task_type ner \
+    --lang "en" \
+    --batch_size 8
+```
+
+- 输出样例
+
+```
+[{'doc': './images/ner_sample.jpg',
+  'result': [{'text': 'ATT . GEN . ADMIN . OFFICE',
+              'label': 'QUESTION',
+              'start': 0,
+              'end': 12,
+              'probability': 0.8961102192651806},
+             {'text': 'Fax :',
+              'label': 'QUESTION',
+              'start': 13,
+              'end': 14,
+              'probability': 0.8005126895801068},
+             {'text': '614',
+              'label': 'ANSWER',
+              'start': 15,
+              'end': 16,
+              'probability': 0.5063673730110718},
+             {'text': 'Dec 10',
+              'label': 'ANSWER',
+              'start': 23,
+              'end': 24,
+              'probability': 0.6265156606943465},
+
+            ......
+
+             {'text': 'NOTE',
+              'label': 'QUESTION',
+              'start': 179,
+              'end': 179,
+              'probability': 0.9810855421041412}]}]
+```
+
+## 3. 文档视觉问答模型推理
+
+- 使用如下命令进行中文文档视觉问答部署
+
+```shell
+python infer.py \
+    --model_path_prefix ../../mrc_export/inference \
+    --task_type mrc \
+    --lang "ch" \
+    --batch_size 8
+```
+
+- 输出样例
+
+```
+[{'doc': './images/mrc_sample.jpg',
+  'result': [{'question': '杨小峰是什么身份？', 'answer': ['法定代表人']},
+             {'question': '花了多少钱进行注册的这个公司？', 'answer': ['壹仟壹佰万元']},
+             {'question': '公司的类型属于什么？', 'answer': ['有限责任公司']},
+             {'question': '杨小峰的住所是在哪里？',
+              'answer': ['成都市武侯区佳灵路20号九峰国际1栋16楼62号']},
+             {'question': '这个公司的法定代表人叫什么？', 'answer': ['杨小峰']},
+             {'question': '91510107749745776R代表的是什么？', 'answer': ['统一社会信用代码']},
+             {'question': '公司在什么时候成立的？',
+              'answer': ['2003年7月22日营业期限2003年7月22日']}]}]
+```
+
+## 4. 文档图像分类模型推理
+
+- 使用如下命令进行英文文档图像分类部署
+
+```shell
+python infer.py \
+    --model_path_prefix ../../cls_export/inference \
+    --lang "en" \
+    --task_type cls \
+    --batch_size 8
+```
+
+- 输出样例
+
+```
+[{'doc': './images/cls_sample.jpg', 'result': 'email'}]
+```
+
+## 5. 更多配置
+
+- `model_path_prefix`: 用于推理的Paddle模型文件路径，需加上文件前缀名称。例如模型文件路径为`./export/inference.pdiparams`，则传入`./export/inference`。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_length`: 如果OCR的结果超过设定的最大长度则对OCR结果进行自动切分，默认为512。
+- `task_type`: 选择任务类型，可选有`ner`, `cls`和`mrc`。
+- `lang`: 选择任务的语言类型，可选有`en`, `ch`。
+- `device`: 选用什么设备进行训练，可选`cpu`或`gpu`。
diff --git a/model_zoo/ernie-layout/deploy/python/infer.py b/model_zoo/ernie-layout/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..686a6c50a79443ec2c7ad57db9d22b18acb437ea
--- /dev/null
+++ b/model_zoo/ernie-layout/deploy/python/infer.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from predictor import Predictor
+
+
+def parse_args():
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument("--model_path_prefix", type=str, required=True, help="The path prefix of inference model to be used.")
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size per GPU for inference.")
+    parser.add_argument("--max_seq_length", default=512, type=int, help="The maximum input sequence length. Sequences longer than this will be split automatically.")
+    parser.add_argument("--task_type", default="ner", type=str, choices=["ner", "cls", "mrc"], help="Specify the task type.")
+    parser.add_argument("--lang", default="en", type=str, choices=["ch", "en"], help="Specify the task type.")
+    parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+    args = parser.parse_args()
+    # yapf: enable
+    return args
+
+
+def main():
+    args = parse_args()
+    if args.task_type == "mrc":
+        args.questions = [
+            [
+                "公司的类型属于什么？",
+                "杨小峰的住所是在哪里？",
+                "这个公司的法定代表人叫什么？",
+                "花了多少钱进行注册的这个公司？",
+                "公司在什么时候成立的？",
+                "杨小峰是什么身份？",
+                "91510107749745776R代表的是什么？",
+            ],
+        ]
+        docs = ["./images/mrc_sample.jpg"]
+    elif args.task_type == "cls":
+        docs = ["./images/cls_sample.jpg"]
+    elif args.task_type == "ner":
+        docs = ["./images/ner_sample.jpg"]
+    else:
+        raise ValueError("Unspport task type: {}".format(args.task_type))
+
+    predictor = Predictor(args)
+
+    outputs = predictor.predict(docs)
+    import pprint
+
+    pprint.sorted = lambda x, key=None: x
+    pprint.pprint(outputs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-layout/deploy/python/predictor.py b/model_zoo/ernie-layout/deploy/python/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b11a4d901710d196105fb84b6f4e3b20c3553bd
--- /dev/null
+++ b/model_zoo/ernie-layout/deploy/python/predictor.py
@@ -0,0 +1,867 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import collections
+
+import cv2
+import numpy as np
+import paddle
+import scipy
+import six
+from paddleocr import PaddleOCR
+from PIL import Image
+from seqeval.metrics.sequence_labeling import get_entities
+
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.image_utils import ppocr2example
+from paddlenlp.utils.log import logger
+
+
+class InferBackend(object):
+    def __init__(self, model_path_prefix, device="cpu"):
+        logger.info(">>> [InferBackend] Creating Engine ...")
+        config = paddle.inference.Config(
+            model_path_prefix + ".pdmodel",
+            model_path_prefix + ".pdiparams",
+        )
+        if device == "gpu":
+            config.enable_use_gpu(100, 0)
+            config.switch_ir_optim(False)
+        else:
+            config.disable_gpu()
+            config.enable_mkldnn()
+        config.switch_use_feed_fetch_ops(False)
+        config.disable_glog_info()
+        config.enable_memory_optim()
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_names = [name for name in self.predictor.get_input_names()]
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handles = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()]
+        logger.info(">>> [InferBackend] Engine Created ...")
+
+    def infer(self, input_dict: dict):
+        for idx, input_name in enumerate(self.input_names):
+            self.input_handles[idx].copy_from_cpu(input_dict[input_name])
+        self.predictor.run()
+        outputs = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return outputs
+
+
+class Predictor(object):
+    def __init__(self, args):
+        use_gpu = True if args.device == "gpu" else False
+        self.tokenizer = AutoTokenizer.from_pretrained("ernie-layoutx-base-uncased")
+        self.batch_size = args.batch_size
+        self.max_seq_length = args.max_seq_length
+        self.task_type = args.task_type
+        self.lang = args.lang
+        self.ocr = PaddleOCR(use_angle_cls=True, lang=self.lang, show_log=False, use_gpu=use_gpu)
+
+        self.examples_cache = collections.defaultdict(list)
+        self.features_cache = collections.defaultdict(list)
+        self._PrelimPrediction = collections.namedtuple(
+            "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
+        )
+        self.inference_backend = InferBackend(args.model_path_prefix, device=args.device)
+        if self.task_type == "ner":
+            self.label_dict = {
+                "O": 0,
+                "B-ANSWER": 1,
+                "I-ANSWER": 2,
+                "B-HEADER": 3,
+                "I-HEADER": 4,
+                "B-QUESTION": 5,
+                "I-QUESTION": 6,
+            }
+            self.preprocess = self.preprocess_ner
+            self.postprocess = self.postprocess_ner
+        elif self.task_type == "cls":
+            self.label_dict = {
+                "advertisement": 0,
+                "budget": 1,
+                "email": 2,
+                "file folder": 3,
+                "form": 4,
+                "handwritten": 5,
+                "invoice": 6,
+                "letter": 7,
+                "memo": 8,
+                "news article": 9,
+                "presentation": 10,
+                "questionnaire": 11,
+                "resume": 12,
+                "scientific publication": 13,
+                "scientific report": 14,
+                "specification": 15,
+            }
+            self.preprocess = self.preprocess_cls
+            self.postprocess = self.postprocess_cls
+        elif self.task_type == "mrc":
+            self.questions = args.questions
+            self.preprocess = self.preprocess_mrc
+            self.postprocess = self.postprocess_mrc
+        else:
+            raise ValueError("Unspport task type: {}".format(args.task_type))
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        """Check if this is the 'max context' doc span for the token."""
+
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span["start"] + doc_span["length"] - 1
+            if position < doc_span["start"]:
+                continue
+            if position > end:
+                continue
+            num_left_context = position - doc_span["start"]
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span["length"]
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+        return cur_span_index == best_span_index
+
+    def _get_best_indexes(self, logits, n_best_size):
+        """Get the n-best logits from a list."""
+        index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
+
+        best_indexes = []
+        for i in range(len(index_and_score)):
+            if i >= n_best_size:
+                break
+            best_indexes.append(index_and_score[i][0])
+        return best_indexes
+
+    def get_predictions(self, pred, label_list):
+        pred = scipy.special.softmax(pred, axis=-1)
+        pred_ids = np.argmax(pred, axis=1)
+        prediction_score = [pred[idx][i] for idx, i in enumerate(pred_ids)]
+        predictions = [label_list[i] for i in pred_ids]
+        return predictions, prediction_score
+
+    def get_final_text(self, pred_text, orig_text, do_lower_case, tokenizer):
+        """Project the tokenized prediction back to the original text."""
+
+        def _strip_spaces(text):
+            ns_chars = []
+            ns_to_s_map = collections.OrderedDict()
+            for (i, c) in enumerate(text):
+                if c == " ":
+                    continue
+                ns_to_s_map[len(ns_chars)] = i
+                ns_chars.append(c)
+            ns_text = "".join(ns_chars)
+            return (ns_text, ns_to_s_map)
+
+        tok_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_text))
+
+        start_position = tok_text.find(pred_text)
+        if start_position == -1:
+            return orig_text
+        end_position = start_position + len(pred_text) - 1
+
+        (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+        (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+        if len(orig_ns_text) != len(tok_ns_text):
+            return orig_text
+
+        # We then project the characters in `pred_text` back to `orig_text` using
+        # the character-to-character alignment.
+        tok_s_to_ns_map = {}
+        for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+            tok_s_to_ns_map[tok_index] = i
+
+        orig_start_position = None
+        if start_position in tok_s_to_ns_map:
+            ns_start_position = tok_s_to_ns_map[start_position]
+            if ns_start_position in orig_ns_to_s_map:
+                orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+        if orig_start_position is None:
+            return orig_text
+
+        orig_end_position = None
+        if end_position in tok_s_to_ns_map:
+            ns_end_position = tok_s_to_ns_map[end_position]
+            if ns_end_position in orig_ns_to_s_map:
+                orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+        if orig_end_position is None:
+            return orig_text
+
+        output_text = orig_text[orig_start_position : (orig_end_position + 1)]
+        return output_text
+
+    def preprocess_ner(self, examples, doc_stride=128, target_size=1000, max_size=1000):
+        ignore_label_id = -100
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            all_doc_token_labels = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            orig_labels = ["O"] * len(example_text)
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                if self.lang == "ch":
+                    sub_tokens = self.tokenizer.tokenize("&" + token)[1:]
+                else:
+                    sub_tokens = self.tokenizer.tokenize(token)
+                label = orig_labels[i]
+                box = bboxes[i]
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+                    all_doc_token_labels.append(label)
+
+            max_tokens_for_doc = self.max_seq_length - 2
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append({"start": start_offset, "length": length})
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+
+                tokens = []
+                token_boxes = []
+                token_label_ids = []
+                token_to_orig_map = {}
+                token_is_max_context = {}
+                sentence_ids = []
+                tokens.append(self.tokenizer.cls_token)
+                token_boxes.append(cls_token_box)
+                token_label_ids.append(ignore_label_id)
+                sentence_ids.append(0)
+
+                for i in range(doc_span["length"]):
+                    split_token_index = doc_span["start"] + i
+                    token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index]
+
+                    is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[str(len(tokens))] = is_max_context
+                    tokens.append(all_doc_tokens[split_token_index])
+                    token_boxes.append(all_doc_token_boxes[split_token_index])
+                    token_label_ids.append(self.label_dict[all_doc_token_labels[split_token_index]])
+                    sentence_ids.append(0)
+
+                token_is_max_context[str(len(tokens))] = False
+                token_to_orig_map[str(len(tokens))] = -1
+                tokens.append(self.tokenizer.sep_token)
+                token_boxes.append(sep_token_box)
+                token_label_ids.append(ignore_label_id)
+                sentence_ids.append(0)
+                input_mask = [1] * len(tokens)
+
+                while len(tokens) < self.max_seq_length:
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(self.tokenizer.pad_token)
+                    input_mask.append(0)
+                    sentence_ids.append(0)
+                    token_boxes.append(pad_token_box)
+                    token_label_ids.append(ignore_label_id)
+
+                input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+                position_ids = list(range(len(input_ids)))
+
+                tokenized_examples["id"].append(example_idx)
+                tokenized_examples["tokens"].append(tokens)
+                tokenized_examples["input_ids"].append(input_ids)
+                tokenized_examples["attention_mask"].append(input_mask)
+                tokenized_examples["token_type_ids"].append(sentence_ids)
+                tokenized_examples["bbox"].append(token_boxes)
+                tokenized_examples["position_ids"].append(position_ids)
+                tokenized_examples["image"].append(image)
+                tokenized_examples["labels"].append(token_label_ids)
+                tokenized_examples["token_is_max_context"].append(token_is_max_context)
+                tokenized_examples["token_to_orig_map"].append(token_to_orig_map)
+        for input_id in tokenized_examples["input_ids"]:
+            input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(input_id)
+            )
+
+        for att_mask in tokenized_examples["attention_mask"]:
+            att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(att_mask)
+            )
+
+        for bbox in tokenized_examples["bbox"]:
+            bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))]
+
+        for label in tokenized_examples["labels"]:
+            label = label + [1 * ignore_label_id] * (self.max_seq_length - len(label))
+
+        self.examples_cache["name"] = list(range(len(examples["text"])))
+        self.examples_cache["text"] = [item for item in examples["text"]]
+        self.features_cache["id"] = [item for item in tokenized_examples["id"]]
+        self.features_cache["labels"] = [item for item in tokenized_examples["labels"]]
+        self.features_cache["tokens"] = [item for item in tokenized_examples["tokens"]]
+        self.features_cache["token_is_max_context"] = [item for item in tokenized_examples["token_is_max_context"]]
+        self.features_cache["token_to_orig_map"] = [item for item in tokenized_examples["token_to_orig_map"]]
+        return tokenized_examples
+
+    def postprocess_ner(self, preds):
+        separator = "" if self.lang == "ch" else " "
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        predictions = []
+        recover_preds = []
+
+        for eid, example_id in enumerate(self.examples_cache["name"]):
+            prediction_tags = []
+            feature_map = example_id
+            features_ids = feature_id_to_features[feature_map]
+            gather_pred = []
+            gather_label = []
+            gather_tokens = []
+            gather_score = []
+            gather_map = []
+            for idx in features_ids:
+                pred, label = preds[idx], self.features_cache["labels"][idx]
+                prediction, prediction_score = self.get_predictions(pred, list(self.label_dict.keys()))
+
+                token_is_max_context = self.features_cache["token_is_max_context"][idx]
+                token_to_orig_map = self.features_cache["token_to_orig_map"][idx]
+                for token_idx in range(len(token_is_max_context)):
+                    token_idx += 1
+                    if token_is_max_context[str(token_idx)]:
+                        gather_tokens.append(self.features_cache["tokens"][idx][token_idx])
+                        gather_pred.append(prediction[token_idx])
+                        gather_score.append(prediction_score[token_idx])
+                        gather_label.append(label[token_idx])
+                        gather_map.append(token_to_orig_map[str(token_idx)])
+
+            recover_pred = [p for (p, l) in zip(gather_pred, gather_label) if l != -100]
+
+            pred_entities = get_entities(recover_pred)
+            recover_preds.append(recover_pred)
+
+            for item in pred_entities:
+                entity = self.tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip()
+                orig_doc_start = gather_map[item[1]]
+                orig_doc_end = gather_map[item[2]]
+                orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)]
+                orig_text = separator.join(orig_tokens)
+                final_text = self.get_final_text(entity, orig_text, False, self.tokenizer)
+                final_text = final_text.replace("   ", " ")
+
+                res = {
+                    "text": final_text,
+                    "label": item[0],
+                    "start": item[1],
+                    "end": item[2],
+                    "probability": sum(gather_score[item[1] : item[2] + 1]) / (item[2] - item[1] + 1),
+                }
+                prediction_tags.append(res)
+
+            predictions.append(prediction_tags)
+        return predictions
+
+    def preprocess_cls(self, examples, doc_stride=128, target_size=1000, max_size=1000):
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                if self.lang == "ch":
+                    sub_tokens = self.tokenizer.tokenize("&" + token)[1:]
+                else:
+                    sub_tokens = self.tokenizer.tokenize(token)
+                box = bboxes[i]
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+
+            max_tokens_for_doc = self.max_seq_length - 2
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append({"start": start_offset, "length": length})
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+            for doc_span in doc_spans:
+
+                tokens = []
+                token_boxes = []
+                sentence_ids = []
+                tokens.append(self.tokenizer.cls_token)
+                token_boxes.append(cls_token_box)
+                sentence_ids.append(0)
+
+                for i in range(doc_span["length"]):
+                    split_token_index = doc_span["start"] + i
+                    tokens.append(all_doc_tokens[split_token_index])
+                    token_boxes.append(all_doc_token_boxes[split_token_index])
+                    sentence_ids.append(0)
+
+                tokens.append(self.tokenizer.sep_token)
+                token_boxes.append(sep_token_box)
+                sentence_ids.append(0)
+                input_mask = [1] * len(tokens)
+
+                while len(tokens) < self.max_seq_length:
+                    tokens.append(self.tokenizer.pad_token)
+                    input_mask.append(0)
+                    sentence_ids.append(0)
+                    token_boxes.append(pad_token_box)
+
+                input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+                position_ids = list(range(len(input_ids)))
+
+                tokenized_examples["id"].append(example_idx)
+                tokenized_examples["tokens"].append(tokens)
+                tokenized_examples["input_ids"].append(input_ids)
+                tokenized_examples["attention_mask"].append(input_mask)
+                tokenized_examples["token_type_ids"].append(sentence_ids)
+                tokenized_examples["bbox"].append(token_boxes)
+                tokenized_examples["position_ids"].append(position_ids)
+                tokenized_examples["image"].append(image)
+        for input_id in tokenized_examples["input_ids"]:
+            input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(input_id)
+            )
+
+        for att_mask in tokenized_examples["attention_mask"]:
+            att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(att_mask)
+            )
+
+        for bbox in tokenized_examples["bbox"]:
+            bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))]
+
+        self.examples_cache["name"] = list(range(len(examples["text"])))
+        self.features_cache["id"] = [item for item in tokenized_examples["id"]]
+        return tokenized_examples
+
+    def postprocess_cls(self, preds):
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        predictions = []
+
+        for example_id in self.examples_cache["name"]:
+            features_ids = feature_id_to_features[example_id]
+
+            max_rcd = [0, -1]
+            for idx in features_ids:
+                pred = preds[idx]
+                pred = scipy.special.softmax(pred, axis=-1)
+                pred_id = int(np.argmax(pred, axis=-1))
+                if pred[pred_id] > max_rcd[0]:
+                    max_rcd = [pred[pred_id], pred_id]
+
+            predictions.append(list(self.label_dict.keys())[max_rcd[1]])
+        return predictions
+
+    def preprocess_mrc(self, examples, doc_stride=128, max_query_length=64, target_size=1000, max_size=1000):
+        qid = -1
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+            query_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                if self.lang == "ch":
+                    sub_tokens = self.tokenizer.tokenize("&" + token)[1:]
+                else:
+                    sub_tokens = self.tokenizer.tokenize(token)
+                box = bboxes[i]
+                for sub_token in sub_tokens:
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+
+            for question in self.questions[example_idx]:
+                qid += 1
+                query_tokens = self.tokenizer.tokenize(
+                    question, add_special_tokens=False, truncation=False, max_length=max_query_length
+                )
+
+                start_offset = 0
+                doc_spans = []
+                max_tokens_for_doc = self.max_seq_length - len(query_tokens) - 3
+                while start_offset < len(all_doc_tokens):
+                    length = len(all_doc_tokens) - start_offset
+                    if length > max_tokens_for_doc:
+                        length = max_tokens_for_doc
+                    doc_spans.append({"start": start_offset, "length": length})
+                    if start_offset + length == len(all_doc_tokens):
+                        break
+                    start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+                for (doc_span_index, doc_span) in enumerate(doc_spans):
+
+                    tokens = []
+                    token_boxes = []
+                    token_to_orig_map = {}
+                    token_is_max_context = {}
+                    sentence_ids = []
+                    seg_a = 0
+                    seg_b = 1
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(self.tokenizer.cls_token)
+                    token_boxes.append(cls_token_box)
+                    sentence_ids.append(seg_a)
+
+                    for i in range(doc_span["length"]):
+                        split_token_index = doc_span["start"] + i
+                        token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index]
+
+                        is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                        token_is_max_context[str(len(tokens))] = is_max_context
+                        tokens.append(all_doc_tokens[split_token_index])
+                        token_boxes.append(all_doc_token_boxes[split_token_index])
+                        sentence_ids.append(seg_a)
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(self.tokenizer.sep_token)
+                    token_boxes.append(sep_token_box)
+                    sentence_ids.append(seg_a)
+                    input_mask = [1] * len(tokens)
+
+                    while len(tokens) < self.max_seq_length - len(query_tokens) - 1:
+                        token_is_max_context[str(len(tokens))] = False
+                        token_to_orig_map[str(len(tokens))] = -1
+                        tokens.append(self.tokenizer.pad_token)
+                        input_mask.append(0)
+                        sentence_ids.append(seg_b)
+                        token_boxes.append(pad_token_box)
+
+                    for token in query_tokens:
+                        token_is_max_context[str(len(tokens))] = False
+                        token_to_orig_map[str(len(tokens))] = -1
+                        tokens.append(token)
+                        input_mask.append(1)
+                        sentence_ids.append(seg_b)
+                        token_boxes.append(query_token_box)
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(self.tokenizer.sep_token)
+                    input_mask.append(1)
+                    token_boxes.append(sep_token_box)
+                    sentence_ids.append(seg_b)
+
+                    input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
+                    position_ids = list(range(len(tokens) - len(query_tokens) - 1)) + list(
+                        range(len(query_tokens) + 1)
+                    )
+
+                    answer_rcd = []
+                    start_position = -1
+                    end_position = -1
+
+                    start_labels = [0] * len(input_ids)
+                    end_labels = [0] * len(input_ids)
+                    start_labels[start_position] = 1
+                    end_labels[end_position] = 1
+                    answer_rcd.append([start_position, end_position])
+
+                    tokenized_examples["id"].append(example_idx)
+                    tokenized_examples["question_id"].append(qid)
+                    tokenized_examples["questions"].append(question)
+                    tokenized_examples["tokens"].append(tokens)
+                    tokenized_examples["input_ids"].append(input_ids)
+                    tokenized_examples["attention_mask"].append(input_mask)
+                    tokenized_examples["token_type_ids"].append(sentence_ids)
+                    tokenized_examples["bbox"].append(token_boxes)
+                    tokenized_examples["position_ids"].append(position_ids)
+                    tokenized_examples["image"].append(image)
+                    tokenized_examples["token_is_max_context"].append(token_is_max_context)
+                    tokenized_examples["token_to_orig_map"].append(token_to_orig_map)
+        for input_id in tokenized_examples["input_ids"]:
+            input_id = input_id + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(input_id)
+            )
+
+        for att_mask in tokenized_examples["attention_mask"]:
+            att_mask = att_mask + [1 * self.tokenizer.tokens_to_ids[self.tokenizer.pad_token]] * (
+                self.max_seq_length - len(att_mask)
+            )
+
+        for bbox in tokenized_examples["bbox"]:
+            bbox = bbox + [[0, 0, 0, 0] for _ in range(self.max_seq_length - len(bbox))]
+        self.examples_cache["name"] = list(range(len(examples["text"])))
+        self.examples_cache["text"] = [item for item in examples["text"]]
+        self.features_cache["id"] = [item for item in tokenized_examples["id"]]
+        self.features_cache["question_id"] = [item for item in tokenized_examples["question_id"]]
+        self.features_cache["tokens"] = [item for item in tokenized_examples["tokens"]]
+        self.features_cache["questions"] = [item for item in tokenized_examples["questions"]]
+        self.features_cache["token_is_max_context"] = [item for item in tokenized_examples["token_is_max_context"]]
+        self.features_cache["token_to_orig_map"] = [item for item in tokenized_examples["token_to_orig_map"]]
+        return tokenized_examples
+
+    def postprocess_mrc(self, preds, max_answer_length=64, n_best_size=5):
+        separator = "" if self.lang == "ch" else " "
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        predictions = collections.defaultdict(lambda: collections.defaultdict(list))
+        for ei, example_id in enumerate(self.examples_cache["name"]):
+            feature_map = example_id
+            features_ids = feature_id_to_features[feature_map]
+            prelim_predictions = []
+            for idx in features_ids:
+                start_logits = preds[0][idx]
+                end_logits = preds[1][idx]
+
+                start_indexes = self._get_best_indexes(start_logits, n_best_size)
+                end_indexes = self._get_best_indexes(end_logits, n_best_size)
+                token_is_max_context = self.features_cache["token_is_max_context"][idx]
+
+                for start_index in start_indexes:
+                    for end_index in end_indexes:
+                        if not token_is_max_context.get(str(start_index), False):
+                            continue
+                        if end_index < start_index:
+                            continue
+                        length = end_index - start_index + 1
+                        if length > max_answer_length:
+                            continue
+                        prelim_predictions.append(
+                            self._PrelimPrediction(
+                                feature_index=idx,
+                                start_index=start_index,
+                                end_index=end_index,
+                                start_logit=start_logits[start_index],
+                                end_logit=end_logits[end_index],
+                            )
+                        )
+
+            prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
+
+            for rcd in prelim_predictions:
+
+                question_id = self.features_cache["question_id"][rcd.feature_index]
+                question = self.features_cache["questions"][rcd.feature_index]
+                if question_id in predictions[example_id]:
+                    continue
+
+                if rcd.start_index > 0:
+                    tok_tokens = self.features_cache["tokens"][rcd.feature_index][
+                        rcd.start_index : (rcd.end_index + 1)
+                    ]
+                    orig_doc_start = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.start_index)]
+                    orig_doc_end = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.end_index)]
+                    orig_tokens = self.examples_cache["text"][ei][orig_doc_start : (orig_doc_end + 1)]
+                    orig_text = separator.join(orig_tokens)
+
+                    tok_text = self.tokenizer.convert_tokens_to_string(tok_tokens).strip()
+                    final_text = self.get_final_text(tok_text, orig_text, False, self.tokenizer)
+                else:
+                    continue
+                if question_id in predictions[example_id]:
+                    predictions[example_id][question_id]["answer"].append(final_text)
+                else:
+                    predictions[example_id][question_id] = {"question": question, "answer": [final_text]}
+        formatted_predictions = []
+        for v in predictions.values():
+            formatted_predictions.append([{"question": qa["question"], "answer": qa["answer"]} for qa in v.values()])
+        return formatted_predictions
+
+    def infer(self, data):
+        return self.inference_backend.infer(data)
+
+    def predict(self, docs):
+        input_data = []
+        for doc in docs:
+            ocr_result = self.ocr.ocr(doc, cls=True)
+            # Compatible with paddleocr>=2.6.0.2
+            ocr_result = ocr_result[0] if len(ocr_result) == 1 else ocr_result
+            example = ppocr2example(ocr_result, doc)
+            input_data.append(example)
+
+        inputs = collections.defaultdict(list)
+        for data in input_data:
+            for k in data.keys():
+                inputs[k].append(data[k])
+
+        preprocess_result = self.preprocess(inputs)
+        preds = [[], []] if self.task_type == "mrc" else []
+        for idx in range(0, len(preprocess_result["id"]), self.batch_size):
+            l, r = idx, idx + self.batch_size
+            input_dict = {}
+            for input_name in self.inference_backend.input_names:
+                input_dict[input_name] = np.array(preprocess_result[input_name][l:r], dtype="int64")
+            output = self.infer(input_dict)
+            if self.task_type != "mrc":
+                preds.extend(output[0].tolist())
+            else:
+                preds[0].extend(output[0].tolist())
+                preds[1].extend(output[1].tolist())
+        results = self.postprocess(preds)
+        formatted_results = []
+        for doc, res in zip(docs, results):
+            formatted_result = {"doc": doc, "result": res}
+            formatted_results.append(formatted_result)
+        return formatted_results
+
+
+def _decode_image(im_base64):
+    """Decode image"""
+    if im_base64 is not None:
+        image = base64.b64decode(im_base64.encode("utf-8"))
+        im = np.frombuffer(image, dtype="uint8")
+        im = cv2.imdecode(im, 1)
+        im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+        return im
+    else:
+        return np.zeros([224, 224, 3], dtype=np.uint8)
+
+
+def _resize_image(
+    im,
+    target_size=0,
+    interp=cv2.INTER_LINEAR,
+    resize_box=False,
+):
+    """Resize the image numpy."""
+    if not isinstance(im, np.ndarray):
+        raise TypeError("image type is not numpy.")
+    if len(im.shape) != 3:
+        raise ValueError("image is not 3-dimensional.")
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    selected_size = target_size
+    if float(im_size_min) == 0:
+        raise ZeroDivisionError("min size of image is 0")
+    resize_w = selected_size
+    resize_h = selected_size
+
+    im = im.astype("uint8")
+    im = Image.fromarray(im)
+    im = im.resize((int(resize_w), int(resize_h)), interp)
+    im = np.array(im)
+    return im
+
+
+def _scale_same_as_image(boxes, width, height, target_size):
+    """
+    Scale the bounding box of each character within maximum boundary.
+    """
+    scale_x = target_size / width
+    scale_y = target_size / height
+
+    new_boxes = [
+        [
+            int(max(0, min(box[0] * scale_x, target_size - 1))),
+            int(max(0, min(box[1] * scale_y, target_size - 1))),
+            int(max(0, min(box[2] * scale_x, target_size - 1))),
+            int(max(0, min(box[3] * scale_y, target_size - 1))),
+        ]
+        for box in boxes
+    ]
+    return new_boxes, (scale_x, scale_y)
+
+
+def _permute(im, channel_first=True, to_bgr=False):
+    """Permute"""
+    if channel_first:
+        im = np.swapaxes(im, 1, 2)
+        im = np.swapaxes(im, 1, 0)
+    if to_bgr:
+        im = im[[2, 1, 0], :, :]
+    return im
+
+
+def _str2im(
+    im_base64,
+    target_size=224,
+    mean=[103.530, 116.280, 123.675],
+    std=[57.375, 57.120, 58.395],
+):
+    # step1: decode image
+    origin_im = _decode_image(im_base64)
+    # step2: resize image
+    im = _resize_image(origin_im, target_size=target_size, interp=1, resize_box=False)
+    return im, origin_im
diff --git a/model_zoo/ernie-layout/deploy/python/requirements.txt b/model_zoo/ernie-layout/deploy/python/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0adfb83a41a922a8329a3d598d5873081cf639da
--- /dev/null
+++ b/model_zoo/ernie-layout/deploy/python/requirements.txt
@@ -0,0 +1 @@
+paddleocr
diff --git a/model_zoo/ernie-layout/export_model.py b/model_zoo/ernie-layout/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea6e6e2a2cbd39fb5e80ab47ea0821365f5e95b3
--- /dev/null
+++ b/model_zoo/ernie-layout/export_model.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoModelForQuestionAnswering,
+    AutoModelForTokenClassification,
+)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_path", type=str, required=True, default='./ernie-layoutx-base-uncased/models/funsd/1e-5_2/', help="The path to model parameters to be loaded.")
+parser.add_argument("--task_type", type=str, required=True, default="ner", choices=["ner", "cls", "mrc"], help="Select the task type.")
+parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    if args.task_type == "ner":
+        model = AutoModelForTokenClassification.from_pretrained(args.model_path)
+    elif args.task_type == "mrc":
+        model = AutoModelForQuestionAnswering.from_pretrained(args.model_path)
+    elif args.task_type == "cls":
+        model = AutoModelForSequenceClassification.from_pretrained(args.model_path)
+    else:
+        raise ValueError("Unsppoorted task type!")
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None, None], dtype="int64", name="bbox"),
+            paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64", name="image"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/model_zoo/ernie-layout/finetune_args.py b/model_zoo/ernie-layout/finetune_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..11d0e8fa940f86f895718cdb0dead4d4066a508e
--- /dev/null
+++ b/model_zoo/ernie-layout/finetune_args.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional
+from dataclasses import dataclass, field
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    task_name: Optional[str] = field(default="ner", metadata={"help": "The name of the task (ner, pos...)."})
+    dataset_name: Optional[str] = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    max_seq_length: int = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    doc_stride: int = field(
+        default=128,
+        metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
+    )
+    target_size: int = field(
+        default=1024,
+        metadata={"help": "The maximum 2d pos size"},
+    )
+    pad_to_max_length: bool = field(
+        default=True,
+        metadata={
+            "help": "Whether to pad all samples to model maximum sentence length. "
+            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
+            "efficient on GPU but very bad for TPU."
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
+            "value if set."
+        },
+    )
+    max_val_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
+            "value if set."
+        },
+    )
+    max_test_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of test examples to this "
+            "value if set."
+        },
+    )
+    label_all_tokens: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to put the label for one word on all tokens of generated by that word or just on the "
+            "one (in which case the other tokens will have a padding index)."
+        },
+    )
+    return_entity_level_metrics: bool = field(
+        default=False,
+        metadata={"help": "Whether to return all the entity levels during evaluation or just the overall ones."},
+    )
+    train_log_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "train log file"},
+    )
+    train_nshard: Optional[int] = field(
+        default=1,
+        metadata={"help": "For big dataset, DocVQA/CORD when using ner3 pattern"},
+    )
+    use_segment_box: bool = field(
+        default=False,
+        metadata={"help": "Whether use segment box"},
+    )
+    task_type: str = field(
+        default="ner",
+        metadata={"help": "The task type"},
+    )
+    pattern: Optional[str] = field(
+        default="ner1",
+        metadata={"help": "The way to process input, choose from ner1, ner2, ner3"},
+    )
+    rst_converter: Optional[str] = field(
+        default=None,
+        metadata={"help": "The way to convert the predict result"},
+    )
+    lang: Optional[str] = field(
+        default="en",
+        metadata={"help": "Languge type of the dataset"},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
+    )
diff --git a/model_zoo/ernie-layout/layout_trainer.py b/model_zoo/ernie-layout/layout_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed2e8afc6b3dd0335e4fc83d64abed4847c89376
--- /dev/null
+++ b/model_zoo/ernie-layout/layout_trainer.py
@@ -0,0 +1,132 @@
+# encoding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from typing import Dict
+
+from paddlenlp.trainer import Trainer
+
+
+class LayoutTrainer(Trainer):
+    def __init__(self, *args, eval_examples=None, post_process_function=None, convert_fn=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.eval_examples = eval_examples
+        self.post_process_function = post_process_function
+        self.convert_fn = convert_fn
+
+    def save_predictions(self, split, preds, labels):
+        """
+        Save metrics into a json file for that split, e.g. `train_results.json`.
+        Under distributed environment this is done only for a process with rank 0.
+        Args:
+            split (`str`):
+                Mode/split name: one of `train`, `eval`, `test`, `all`
+        To understand the metrics please read the docstring of [`~Trainer.log_metrics`]. The only difference is that raw
+        unformatted numbers are saved in the current method.
+        """
+
+        path = os.path.join(self.args.output_dir, f"{split}_predictions.json")
+        with open(path, "w") as f:
+            json.dump(preds, f, ensure_ascii=False, indent=4, sort_keys=True)
+
+        path = os.path.join(self.args.output_dir, f"{split}_golden_labels.json")
+        with open(path, "w") as f:
+            json.dump(labels, f, ensure_ascii=False, indent=4, sort_keys=True)
+
+    def evaluate(
+        self,
+        eval_dataset=None,
+        eval_examples=None,
+        ignore_keys=None,
+        metric_key_prefix="eval",
+    ) -> Dict[str, float]:
+
+        eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+        eval_examples = self.eval_examples if eval_examples is None else eval_examples
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                eval_dataloader,
+                description="Evaluation",
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+                metric_key_prefix=metric_key_prefix,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is not None and self.compute_metrics is not None:
+            pred_rst, gt_rst, eval_preds = self.post_process_function(
+                eval_examples, eval_dataset, output.predictions, output.label_ids
+            )
+            self.save_predictions("eval", pred_rst, gt_rst)
+            metrics = self.compute_metrics(eval_preds)
+            if self.convert_fn is not None:
+                processed_metrics = self.convert_fn(pred_rst, self.args.output_dir)
+                if processed_metrics is not None:
+                    metrics.update(processed_metrics)
+
+            # Prefix all keys with metric_key_prefix + '_'
+            for key in list(metrics.keys()):
+                if not key.startswith(f"{metric_key_prefix}_"):
+                    metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+            self.log(metrics)
+        else:
+            metrics = {}
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
+        return metrics
+
+    def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"):
+
+        predict_dataloader = self.get_test_dataloader(predict_dataset)
+
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.evaluation_loop
+        try:
+            output = eval_loop(
+                predict_dataloader,
+                description="Prediction",
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is not None and self.compute_metrics is not None:
+            pred_rst, gt_rst, eval_preds = self.post_process_function(
+                predict_examples, predict_dataset, output.predictions, output.label_ids
+            )
+            self.save_predictions("test", pred_rst, gt_rst)
+            metrics = self.compute_metrics(eval_preds)
+
+            if self.convert_fn is not None:
+                processed_metrics = self.convert_fn(pred_rst, self.args.output_dir)
+                if processed_metrics is not None:
+                    metrics.update(processed_metrics)
+
+            # Prefix all keys with metric_key_prefix + '_'
+            for key in list(metrics.keys()):
+                if not key.startswith(f"{metric_key_prefix}_"):
+                    metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+        else:
+            metrics = {}
+        return metrics
diff --git a/model_zoo/ernie-layout/requirements.txt b/model_zoo/ernie-layout/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f3e7debe27f8d7705700b9fafc5f99ad23ade721
--- /dev/null
+++ b/model_zoo/ernie-layout/requirements.txt
@@ -0,0 +1,2 @@
+editdistance>=0.6.0
+opencv-python>=4.6.0.66
diff --git a/model_zoo/ernie-layout/run_cls.py b/model_zoo/ernie-layout/run_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..9784cf05f6cf1698941626fd8709e28ab5a8a4bc
--- /dev/null
+++ b/model_zoo/ernie-layout/run_cls.py
@@ -0,0 +1,209 @@
+# encoding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+from functools import partial
+
+import datasets
+import paddle
+from data_collator import DataCollator
+from datasets import load_dataset
+from finetune_args import DataArguments, ModelArguments
+from layout_trainer import LayoutTrainer
+from paddle.metric import Accuracy
+from utils import PostProcessor, PreProcessor, get_label_ld
+
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint
+from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"])
+
+    if training_args.do_train:
+        column_names = train_ds.column_names
+    elif training_args.do_eval:
+        column_names = dev_ds.column_names
+    elif training_args.do_predict:
+        column_names = test_ds.column_names
+    else:
+        logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.")
+        raise NotImplementedError
+
+    label_list, label_to_id = get_label_ld(train_ds["qas"], scheme="cls")
+    num_labels = len(label_list)
+
+    # Load Model and Tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_labels)
+    model.config["has_visual_segment_embedding"] = False
+
+    preprocessor = PreProcessor()
+    postprocessor = PostProcessor()
+    training_args.label_names = ["labels"]
+    max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
+
+    preprocess_func = partial(
+        preprocessor.preprocess_cls,
+        tokenizer=tokenizer,
+        max_seq_length=max_seq_length,
+        doc_stride=data_args.doc_stride,
+        label_dict=label_to_id,
+        max_size=data_args.target_size,
+        target_size=data_args.target_size,
+        use_segment_box=data_args.use_segment_box,
+        preprocessing_num_workers=data_args.preprocessing_num_workers,
+    )
+    preprocess_func_for_valid = preprocess_func
+
+    postprocess_func = partial(postprocessor.postprocess_cls, label_list=label_list, tokenizer=tokenizer)
+
+    # Dataset pre-process
+    if training_args.do_train:
+        if data_args.train_nshard > 1:
+            logger.info(f"spliting train dataset into {data_args.train_nshard} shard")
+            train_shards = []
+            for idx in range(data_args.train_nshard):
+                train_shards.append(
+                    train_ds.shard(num_shards=data_args.train_nshard, index=idx).map(
+                        preprocess_func,
+                        batched=True,
+                        remove_columns=column_names,
+                        num_proc=data_args.preprocessing_num_workers,
+                        load_from_cache_file=not data_args.overwrite_cache,
+                    )
+                )
+            train_dataset = datasets.concatenate_datasets(train_shards)
+        else:
+            train_dataset = train_ds.map(
+                preprocess_func,
+                batched=True,
+                remove_columns=column_names,
+                num_proc=data_args.preprocessing_num_workers,
+                load_from_cache_file=not data_args.overwrite_cache,
+            )
+    if training_args.do_eval:
+        eval_dataset = dev_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+    if training_args.do_predict:
+        test_dataset = test_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+
+    # Data collator
+    data_collator = DataCollator(
+        tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd"
+    )
+
+    def compute_metrics(eval_preds):
+        preds = paddle.to_tensor(eval_preds.predictions)
+        labels = paddle.to_tensor(eval_preds.label_ids)
+
+        metric = Accuracy()
+        metric.reset()
+        correct = preds == labels
+        correct = paddle.cast(paddle.unsqueeze(correct, axis=-1), dtype="float32")
+
+        metric.update(correct)
+        accu = metric.accumulate()
+        metric.reset()
+        return {"acc": accu}
+
+    trainer = LayoutTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        eval_examples=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        post_process_function=postprocess_func,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+
+        max_train_samples = (
+            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
+        )
+        metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+
+        max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset)
+        metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
+
+        trainer.log_metrics("eval", eval_metrics)
+        trainer.save_metrics("eval", metrics)
+
+    if training_args.do_predict:
+        postprocessor.examples_cache = collections.defaultdict(list)
+        postprocessor.features_cache = collections.defaultdict(list)
+        metrics = trainer.predict(test_dataset, test_ds)
+        trainer.log_metrics("test", metrics)
+        trainer.save_metrics("test", metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-layout/run_mrc.py b/model_zoo/ernie-layout/run_mrc.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1f4f918c8a01fc83a77c82a492fa82fbdfcccf2
--- /dev/null
+++ b/model_zoo/ernie-layout/run_mrc.py
@@ -0,0 +1,242 @@
+# encoding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+from functools import partial
+
+import datasets
+import paddle
+from data_collator import DataCollator
+from datasets import load_dataset
+from finetune_args import DataArguments, ModelArguments
+from layout_trainer import LayoutTrainer
+from utils import PostProcessor, PreProcessor, anls_score
+
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint
+from paddlenlp.transformers import AutoModelForQuestionAnswering, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"])
+
+    if training_args.do_train:
+        column_names = train_ds.column_names
+    elif training_args.do_eval:
+        column_names = dev_ds.column_names
+    elif training_args.do_predict:
+        column_names = test_ds.column_names
+    else:
+        logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.")
+        raise NotImplementedError
+
+    num_labels = 2
+
+    # Load Model and Tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForQuestionAnswering.from_pretrained(model_args.model_name_or_path, num_classes=num_labels)
+    model.config["has_visual_segment_embedding"] = False
+
+    preprocessor = PreProcessor()
+    postprocessor = PostProcessor()
+    training_args.label_names = ["start_positions", "end_positions"]
+
+    max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
+
+    preprocess_func = partial(
+        preprocessor.preprocess_mrc,
+        tokenizer=tokenizer,
+        max_seq_length=max_seq_length,
+        doc_stride=data_args.doc_stride,
+        max_size=data_args.target_size,
+        target_size=data_args.target_size,
+        use_segment_box=data_args.use_segment_box,
+        preprocessing_num_workers=data_args.preprocessing_num_workers,
+        is_training=True,
+        lang=data_args.lang,
+    )
+
+    preprocess_func_for_valid = partial(
+        preprocessor.preprocess_mrc,
+        tokenizer=tokenizer,
+        max_seq_length=max_seq_length,
+        doc_stride=data_args.doc_stride,
+        max_size=data_args.target_size,
+        target_size=data_args.target_size,
+        use_segment_box=data_args.use_segment_box,
+        preprocessing_num_workers=data_args.preprocessing_num_workers,
+        is_training=False,
+        lang=data_args.lang,
+    )
+
+    postprocess_func = partial(postprocessor.postprocess_mrc, tokenizer=tokenizer, lang=data_args.lang)
+
+    # Dataset pre-process
+    if training_args.do_train:
+        if data_args.train_nshard > 1:
+            logger.info(f"spliting train dataset into {data_args.train_nshard} shard")
+            train_shards = []
+            for idx in range(data_args.train_nshard):
+                train_shards.append(
+                    train_ds.shard(num_shards=data_args.train_nshard, index=idx).map(
+                        preprocess_func,
+                        batched=True,
+                        remove_columns=column_names,
+                        num_proc=data_args.preprocessing_num_workers,
+                        load_from_cache_file=not data_args.overwrite_cache,
+                    )
+                )
+            train_dataset = datasets.concatenate_datasets(train_shards)
+        else:
+            train_dataset = train_ds.map(
+                preprocess_func,
+                batched=True,
+                remove_columns=column_names,
+                num_proc=data_args.preprocessing_num_workers,
+                load_from_cache_file=not data_args.overwrite_cache,
+            )
+
+    if training_args.do_eval:
+        eval_dataset = dev_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+    if training_args.do_predict:
+        test_dataset = test_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+
+    # Data collator
+    data_collator = DataCollator(
+        tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd"
+    )
+
+    def compute_metrics(eval_preds):
+        def _convert(examples):
+            """Convert to evaluation data format"""
+            formatted_examples = []
+            for example in examples:
+                formatted_example = {}
+                formatted_example["id"] = example["id"]
+                formatted_example["annotations"] = {
+                    "qid": [],
+                    "question": [],
+                    "value": [],
+                }
+                for i in range(len(example["annotations"])):
+                    formatted_example["annotations"]["qid"].append(example["annotations"][i]["qid"])
+                    formatted_example["annotations"]["question"].append(example["annotations"][i]["question"])
+                    formatted_example["annotations"]["value"].append(example["annotations"][i]["value"])
+                formatted_examples.append(formatted_example)
+            return formatted_examples
+
+        pred_dict = collections.defaultdict(lambda: collections.defaultdict(list))
+        ref_dict = collections.defaultdict(lambda: collections.defaultdict(list))
+
+        preds = _convert(eval_preds.predictions)
+        labels = _convert(eval_preds.label_ids)
+
+        for pred in preds:
+            for key, values in zip(pred["annotations"]["qid"], pred["annotations"]["value"]):
+                pred_dict[pred["id"]][key].extend(values)
+        for label in labels:
+            for key, values in zip(label["annotations"]["qid"], label["annotations"]["value"]):
+                ref_dict[label["id"]][key].extend(values)
+        score = anls_score(ref_dict, pred_dict)
+        return score
+
+    trainer = LayoutTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        eval_examples=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        post_process_function=postprocess_func,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+
+        max_train_samples = (
+            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
+        )
+        metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+
+        max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset)
+        metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
+
+        trainer.log_metrics("eval", eval_metrics)
+        trainer.save_metrics("eval", metrics)
+
+    if training_args.do_predict:
+        postprocessor.examples_cache = collections.defaultdict(list)
+        postprocessor.features_cache = collections.defaultdict(list)
+        metrics = trainer.predict(test_dataset, test_ds)
+        trainer.log_metrics("test", metrics)
+        trainer.save_metrics("test", metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-layout/run_ner.py b/model_zoo/ernie-layout/run_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..fab6f73fd6e882917272b93977333a588cd3a149
--- /dev/null
+++ b/model_zoo/ernie-layout/run_ner.py
@@ -0,0 +1,214 @@
+# encoding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+from functools import partial
+
+import paddle
+from data_collator import DataCollator
+from datasets import load_dataset
+from finetune_args import DataArguments, ModelArguments
+from layout_trainer import LayoutTrainer
+from seqeval.metrics import classification_report
+from utils import PostProcessor, PreProcessor, get_label_ld
+
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments, get_last_checkpoint
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    train_ds, dev_ds, test_ds = load_dataset(data_args.dataset_name, split=["train", "validation", "test"])
+
+    if training_args.do_train:
+        column_names = train_ds.column_names
+    elif training_args.do_eval:
+        column_names = dev_ds.column_names
+    elif training_args.do_predict:
+        column_names = test_ds.column_names
+    else:
+        logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.")
+        raise NotImplementedError
+
+    label_list, label_to_id = get_label_ld(train_ds["qas"], scheme=data_args.pattern.split("-")[1])
+    num_labels = len(label_list)
+
+    # Load Model and Tokenizer
+    if model_args.model_name_or_path == "vi-layoutxlm-base-uncased":
+        tokenizer = AutoTokenizer.from_pretrained("layoutxlm-base-uncased")
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    model = AutoModelForTokenClassification.from_pretrained(model_args.model_name_or_path, num_classes=num_labels)
+    model.config["has_visual_segment_embedding"] = False
+
+    preprocessor = PreProcessor()
+    postprocessor = PostProcessor()
+    training_args.label_names = ["labels"]
+    max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
+
+    preprocess_func = partial(
+        preprocessor.preprocess_ner,
+        tokenizer=tokenizer,
+        max_seq_length=max_seq_length,
+        doc_stride=data_args.doc_stride,
+        label_dict=label_to_id,
+        max_size=data_args.target_size,
+        target_size=data_args.target_size,
+        use_segment_box=data_args.use_segment_box,
+        preprocessing_num_workers=data_args.preprocessing_num_workers,
+        scheme=data_args.pattern.split("-")[1],
+        lang=data_args.lang,
+    )
+    preprocess_func_for_valid = preprocess_func
+
+    postprocess_func = partial(
+        postprocessor.postprocess_ner, label_list=label_list, tokenizer=tokenizer, lang=data_args.lang
+    )
+
+    # Dataset pre-process
+    if training_args.do_train:
+        train_dataset = train_ds.map(
+            preprocess_func,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+    if training_args.do_eval:
+        eval_dataset = dev_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+    if training_args.do_predict:
+        test_dataset = test_ds.map(
+            preprocess_func_for_valid,
+            batched=True,
+            remove_columns=column_names,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+
+    # Data collator
+    data_collator = DataCollator(
+        tokenizer, padding="max_length", label_pad_token_id=-100, max_length=max_seq_length, return_tensors="pd"
+    )
+
+    def compute_metrics(eval_preds):
+        preds = eval_preds.predictions
+        labels = eval_preds.label_ids
+
+        report = classification_report(y_true=labels, y_pred=preds, output_dict=True)
+
+        report.pop("macro avg")
+        report.pop("weighted avg")
+        overall_score = report.pop("micro avg")
+        scores = {
+            type_name: {
+                "precision": score["precision"],
+                "recall": score["recall"],
+                "f1": score["f1-score"],
+                "number": score["support"],
+            }
+            for type_name, score in report.items()
+        }
+        scores["overall_precision"] = overall_score["precision"]
+        scores["overall_recall"] = overall_score["recall"]
+        scores["overall_f1"] = overall_score["f1-score"]
+        results = {
+            "precision": scores["overall_precision"],
+            "recall": scores["overall_recall"],
+            "f1": scores["overall_f1"],
+        }
+        return results
+
+    trainer = LayoutTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        eval_examples=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        post_process_function=postprocess_func,
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+
+        max_train_samples = (
+            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
+        )
+        metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+
+        max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset)
+        metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
+
+        trainer.log_metrics("eval", eval_metrics)
+        trainer.save_metrics("eval", metrics)
+
+    if training_args.do_predict:
+        postprocessor.examples_cache = collections.defaultdict(list)
+        postprocessor.features_cache = collections.defaultdict(list)
+        metrics = trainer.predict(test_dataset, test_ds)
+        trainer.log_metrics("test", metrics)
+        trainer.save_metrics("test", metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-layout/utils.py b/model_zoo/ernie-layout/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a7b32bf4941c3d45ce0a5e1532b687a4156d5e4
--- /dev/null
+++ b/model_zoo/ernie-layout/utils.py
@@ -0,0 +1,1091 @@
+# encoding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import collections
+import hashlib
+import random
+
+import cv2
+import datasets
+import editdistance
+import numpy as np
+import scipy
+import six
+from PIL import Image
+from seqeval.metrics.sequence_labeling import get_entities
+
+from paddlenlp.trainer import EvalPrediction
+
+
+def _get_md5(string):
+    """Get md5 value for string"""
+    hl = hashlib.md5()
+    hl.update(string.encode(encoding="utf-8"))
+    return hl.hexdigest()
+
+
+def _decode_image(im_base64):
+    """Decode image"""
+    if im_base64 is not None:
+        image = base64.b64decode(im_base64.encode("utf-8"))
+        im = np.frombuffer(image, dtype="uint8")
+        im = cv2.imdecode(im, 1)
+        im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+        return im
+    else:
+        return np.zeros([224, 224, 3], dtype=np.uint8)
+
+
+def _resize_image(
+    im,
+    target_size=0,
+    interp=cv2.INTER_LINEAR,
+    resize_box=False,
+):
+    """Resize the image numpy."""
+    if not isinstance(im, np.ndarray):
+        raise TypeError("image type is not numpy.")
+    if len(im.shape) != 3:
+        raise ValueError("image is not 3-dimensional.")
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    if isinstance(target_size, list):
+        # Case for multi-scale training
+        selected_size = random.choice(target_size)
+    else:
+        selected_size = target_size
+    if float(im_size_min) == 0:
+        raise ZeroDivisionError("min size of image is 0")
+    resize_w = selected_size
+    resize_h = selected_size
+
+    im = im.astype("uint8")
+    im = Image.fromarray(im)
+    im = im.resize((int(resize_w), int(resize_h)), interp)
+    im = np.array(im)
+    return im
+
+
+def _scale_same_as_image(boxes, width, height, target_size):
+    """
+    Scale the bounding box of each character within maximum boundary.
+    """
+    scale_x = target_size / width
+    scale_y = target_size / height
+
+    new_boxes = [
+        [
+            int(max(0, min(box[0] * scale_x, target_size - 1))),
+            int(max(0, min(box[1] * scale_y, target_size - 1))),
+            int(max(0, min(box[2] * scale_x, target_size - 1))),
+            int(max(0, min(box[3] * scale_y, target_size - 1))),
+        ]
+        for box in boxes
+    ]
+    return new_boxes, (scale_x, scale_y)
+
+
+def _permute(im, channel_first=True, to_bgr=False):
+    """Permute"""
+    if channel_first:
+        im = np.swapaxes(im, 1, 2)
+        im = np.swapaxes(im, 1, 0)
+    if to_bgr:
+        im = im[[2, 1, 0], :, :]
+    return im
+
+
+def _str2im(
+    im_base64,
+    target_size=224,
+):
+    # Step1: decode image
+    origin_im = _decode_image(im_base64)
+    # Step2: resize image
+    im = _resize_image(origin_im, target_size=target_size, interp=1, resize_box=False)
+    return im, origin_im
+
+
+def get_label_ld(qas, scheme="bio"):
+    if scheme == "cls":
+        unique_labels = set()
+        for qa in qas:
+            label_text = qa["answers"][0]["text"][0]
+            unique_labels.add(label_text)
+
+        label_list = list(unique_labels)
+        label_list.sort()
+    else:
+        unique_keys = set()
+        for qa in qas:
+            for key in qa["question"]:
+                unique_keys.add(key)
+        key_list = list(unique_keys)
+        key_list.sort()
+
+        label_list = ["O"]
+        for key in key_list:
+            if scheme == "bio":
+                label_list.append("B-" + key)
+                label_list.append("I-" + key)
+            elif scheme == "bioes":
+                label_list.append("B-" + key)
+                label_list.append("I-" + key)
+                label_list.append("E-" + key)
+                label_list.append("S-" + key)
+            else:
+                raise NotImplementedError
+
+    label_dict = {l: i for i, l in enumerate(label_list)}
+    return label_list, label_dict
+
+
+def anls_score(labels, predictions):
+    def get_anls(prediction, ground_truth):
+        prediction = prediction.strip().lower()
+        ground_truth = ground_truth.strip().lower()
+        iou = 1 - editdistance.eval(prediction, ground_truth) / max(len(prediction), len(ground_truth), 1e-5)
+        anls = iou if iou >= 0.5 else 0.0
+        return anls
+
+    def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
+        if len(ground_truths) == 0:
+            return 0
+        scores_for_ground_truths = []
+        for ground_truth in ground_truths:
+            score = metric_fn(prediction, ground_truth)
+            scores_for_ground_truths.append(score)
+        return max(scores_for_ground_truths)
+
+    anls, total = 0, 0
+    assert labels.keys() == predictions.keys()
+    for _id in labels.keys():
+        assert labels[_id].keys() == predictions[_id].keys()
+        for question in labels[_id]:
+            if len(predictions[_id][question]) > 0:
+                prediction_text = predictions[_id][question][0]
+            else:
+                prediction_text = ""
+            ground_truths = labels[_id][question]
+            total += 1
+            anls += metric_max_over_ground_truths(get_anls, prediction_text, ground_truths)
+
+    anls = 100.0 * anls / total
+    return {"anls": anls}
+
+
+class PreProcessor:
+    def __init__(self):
+        pass
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        """Check if this is the 'max context' doc span for the token."""
+
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span["start"] + doc_span["length"] - 1
+            if position < doc_span["start"]:
+                continue
+            if position > end:
+                continue
+            num_left_context = position - doc_span["start"]
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span["length"]
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+        return cur_span_index == best_span_index
+
+    def preprocess_ner(
+        self,
+        examples,
+        tokenizer=None,
+        label_dict=None,
+        max_seq_length=512,
+        doc_stride=128,
+        target_size=1000,
+        max_size=1000,
+        other_label="O",
+        ignore_label_id=-100,
+        use_segment_box=False,
+        preprocessing_num_workers=1,
+        scheme="bio",
+        lang="en",
+    ):
+        """
+        Adapt to NER task.
+        """
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            all_doc_token_labels = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            if use_segment_box:
+                bboxes = examples["segment_bbox"][example_idx]
+            else:
+                bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            qas = examples["qas"][example_idx]
+            orig_labels = [other_label] * len(example_text)
+            for question, answers in zip(qas["question"], qas["answers"]):
+                for answer_start, answer_end in zip(
+                    answers["answer_start"],
+                    answers["answer_end"],
+                ):
+                    if scheme == "bio":
+                        orig_labels[answer_start] = "B-" + question
+                        orig_labels[answer_start + 1 : answer_end] = ["I-" + question] * (
+                            answer_end - answer_start - 1
+                        )
+                    elif scheme == "bioes":
+                        orig_labels[answer_start] = "B-" + question
+                        if answer_end - answer_start - 1 > 1:
+                            orig_labels[answer_end - 1] = "E-" + question
+                            orig_labels[answer_start + 1 : answer_end - 1] = ["I-" + question] * (
+                                answer_end - answer_start - 2
+                            )
+                        else:
+                            orig_labels[answer_start] = "S-" + question
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                if lang == "ch":
+                    sub_tokens = tokenizer.tokenize("&" + token)[1:]
+                else:
+                    sub_tokens = tokenizer.tokenize(token)
+                label = orig_labels[i]
+                box = bboxes[i]
+                for j, sub_token in enumerate(sub_tokens):
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+                    if "B-" in label[:2]:
+                        if j == 0:
+                            all_doc_token_labels.append(label)
+                        else:
+                            all_doc_token_labels.append("I-" + label[2:])
+                    elif "E-" in label[:2]:
+                        if len(sub_tokens) - 1 == j:
+                            all_doc_token_labels.append("E-" + label[2:])
+                        else:
+                            all_doc_token_labels.append("I-" + label[2:])
+                    elif "S-" in label[:2]:
+                        if len(sub_tokens) == 1:
+                            all_doc_token_labels.append(label)
+                        else:
+                            if j == 0:
+                                all_doc_token_labels.append("B-" + label[2:])
+                            elif len(sub_tokens) - 1 == j:
+                                all_doc_token_labels.append("E-" + label[2:])
+                            else:
+                                all_doc_token_labels.append("I-" + label[2:])
+                    else:
+                        all_doc_token_labels.append(label)
+
+            max_tokens_for_doc = max_seq_length - 2
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append({"start": start_offset, "length": length})
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+            for (doc_span_index, doc_span) in enumerate(doc_spans):
+
+                tokens = []
+                token_boxes = []
+                token_label_ids = []
+                token_to_orig_map = {}
+                token_is_max_context = {}
+                sentence_ids = []
+                tokens.append(tokenizer.cls_token)
+                token_boxes.append(cls_token_box)
+                token_label_ids.append(ignore_label_id)
+                sentence_ids.append(0)
+
+                for i in range(doc_span["length"]):
+                    split_token_index = doc_span["start"] + i
+                    token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index]
+
+                    is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                    token_is_max_context[str(len(tokens))] = is_max_context
+                    tokens.append(all_doc_tokens[split_token_index])
+                    token_boxes.append(all_doc_token_boxes[split_token_index])
+                    token_label_ids.append(label_dict[all_doc_token_labels[split_token_index]])
+                    sentence_ids.append(0)
+
+                token_is_max_context[str(len(tokens))] = False
+                token_to_orig_map[str(len(tokens))] = -1
+                tokens.append(tokenizer.sep_token)
+                token_boxes.append(sep_token_box)
+                token_label_ids.append(ignore_label_id)
+                sentence_ids.append(0)
+                input_mask = [1] * len(tokens)
+
+                while len(tokens) < max_seq_length:
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(tokenizer.pad_token)
+                    input_mask.append(0)
+                    sentence_ids.append(0)
+                    token_boxes.append(pad_token_box)
+                    token_label_ids.append(ignore_label_id)
+
+                input_ids = tokenizer.convert_tokens_to_ids(tokens)
+                position_ids = list(range(len(input_ids)))
+
+                assert len(input_ids) == max_seq_length
+                assert len(input_mask) == max_seq_length
+                assert len(token_boxes) == max_seq_length
+                assert len(sentence_ids) == max_seq_length
+                assert len(token_label_ids) == max_seq_length
+
+                feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
+                tokenized_examples["id"].append(feature_id)
+                tokenized_examples["tokens"].append(tokens)
+                tokenized_examples["input_ids"].append(input_ids)
+                tokenized_examples["attention_mask"].append(input_mask)
+                tokenized_examples["token_type_ids"].append(sentence_ids)
+                tokenized_examples["bbox"].append(token_boxes)
+                tokenized_examples["position_ids"].append(position_ids)
+                tokenized_examples["image"].append(image)
+                # tokenized_examples["orig_image"].append(origin_image)
+                tokenized_examples["labels"].append(token_label_ids)
+                tokenized_examples["token_is_max_context"].append(token_is_max_context)
+                tokenized_examples["token_to_orig_map"].append(token_to_orig_map)
+        return tokenized_examples
+
+    def _improve_answer_span(self, doc_tokens, input_start, input_end, tokenizer, orig_answer_text):
+        """Returns tokenized answer spans that better match the annotated answer."""
+
+        tok_answer_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_answer_text))
+        for new_start in range(input_start, input_end + 1):
+            for new_end in range(input_end, new_start - 1, -1):
+                text_span = tokenizer.convert_tokens_to_string(doc_tokens[new_start : (new_end + 1)])
+                if text_span == tok_answer_text:
+                    return (new_start, new_end)
+
+        return (input_start, input_end)
+
+    def preprocess_mrc(
+        self,
+        examples,
+        tokenizer=None,
+        max_seq_length=512,
+        doc_stride=128,
+        max_query_length=64,
+        target_size=1000,
+        max_size=1000,
+        use_segment_box=False,
+        preprocessing_num_workers=1,
+        is_training=False,
+        lang="en",
+    ):
+        """
+        Adapt to MRC task.
+        """
+
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+            query_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            if use_segment_box:
+                bboxes = examples["segment_bbox"][example_idx]
+            else:
+                bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                if lang == "ch":
+                    sub_tokens = tokenizer.tokenize("&" + token)[1:]
+                else:
+                    sub_tokens = tokenizer.tokenize(token)
+                box = bboxes[i]
+                for j, sub_token in enumerate(sub_tokens):
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+
+            qas = examples["qas"][example_idx]
+            for qid, question, answers in zip(qas["question_id"], qas["question"], qas["answers"]):
+
+                query_tokens = tokenizer.tokenize(
+                    question, add_special_tokens=False, truncation=False, max_length=max_query_length
+                )
+
+                start_offset = 0
+                doc_spans = []
+                max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+                while start_offset < len(all_doc_tokens):
+                    length = len(all_doc_tokens) - start_offset
+                    if length > max_tokens_for_doc:
+                        length = max_tokens_for_doc
+                    doc_spans.append({"start": start_offset, "length": length})
+                    if start_offset + length == len(all_doc_tokens):
+                        break
+                    start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+                for (doc_span_index, doc_span) in enumerate(doc_spans):
+
+                    tokens = []
+                    token_boxes = []
+                    token_to_orig_map = {}
+                    token_is_max_context = {}
+                    sentence_ids = []
+                    seg_a = 0
+                    seg_b = 1
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(tokenizer.cls_token)
+                    token_boxes.append(cls_token_box)
+                    sentence_ids.append(seg_a)
+
+                    for i in range(doc_span["length"]):
+                        split_token_index = doc_span["start"] + i
+                        token_to_orig_map[str(len(tokens))] = tok_to_orig_index[split_token_index]
+
+                        is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                        token_is_max_context[str(len(tokens))] = is_max_context
+                        tokens.append(all_doc_tokens[split_token_index])
+                        token_boxes.append(all_doc_token_boxes[split_token_index])
+                        sentence_ids.append(seg_a)
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(tokenizer.sep_token)
+                    token_boxes.append(sep_token_box)
+                    sentence_ids.append(seg_a)
+                    input_mask = [1] * len(tokens)
+
+                    while len(tokens) < max_seq_length - len(query_tokens) - 1:
+                        token_is_max_context[str(len(tokens))] = False
+                        token_to_orig_map[str(len(tokens))] = -1
+                        tokens.append(tokenizer.pad_token)
+                        input_mask.append(0)
+                        sentence_ids.append(seg_b)
+                        token_boxes.append(pad_token_box)
+
+                    for idx, token in enumerate(query_tokens):
+                        token_is_max_context[str(len(tokens))] = False
+                        token_to_orig_map[str(len(tokens))] = -1
+                        tokens.append(token)
+                        input_mask.append(1)
+                        sentence_ids.append(seg_b)
+                        token_boxes.append(query_token_box)
+
+                    token_is_max_context[str(len(tokens))] = False
+                    token_to_orig_map[str(len(tokens))] = -1
+                    tokens.append(tokenizer.sep_token)
+                    input_mask.append(1)
+                    token_boxes.append(sep_token_box)
+                    sentence_ids.append(seg_b)
+
+                    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+                    position_ids = list(range(len(tokens) - len(query_tokens) - 1)) + list(
+                        range(len(query_tokens) + 1)
+                    )
+
+                    assert len(input_ids) == max_seq_length
+                    assert len(input_mask) == max_seq_length
+                    assert len(token_boxes) == max_seq_length
+                    assert len(sentence_ids) == max_seq_length
+
+                    answer_rcd = []
+                    for answer_text, answer_start, answer_end in zip(
+                        answers["text"],
+                        answers["answer_start"],
+                        answers["answer_end"],
+                    ):
+
+                        if is_training and answer_start == -1 and answer_end == -1:
+                            continue
+
+                        start_position = -1
+                        end_position = -1
+
+                        if is_training:
+
+                            if [answer_start, answer_end] in answer_rcd:
+                                continue
+                            answer_rcd.append([answer_start, answer_end])
+
+                            tok_start_position = orig_to_tok_index[answer_start]
+                            if answer_end < len(example_text) - 1:
+                                tok_end_position = orig_to_tok_index[answer_end] - 1
+                            else:
+                                tok_end_position = len(all_doc_tokens) - 1
+                            (tok_start_position, tok_end_position) = self._improve_answer_span(
+                                all_doc_tokens, tok_start_position, tok_end_position, tokenizer, answer_text
+                            )
+                            # If the answer is outside the span, set start_position == end_position == 0
+
+                            # For training, if our document chunk does not contain an annotation
+                            # we throw it out, since there is nothing to predict.
+                            doc_start = doc_span["start"]
+                            doc_end = doc_span["start"] + doc_span["length"] - 1
+                            if not (tok_start_position >= doc_start and tok_end_position <= doc_end):
+                                start_position = 0
+                                end_position = 0
+                            else:
+                                doc_offset = 1
+                                start_position = tok_start_position - doc_start + doc_offset
+                                end_position = tok_end_position - doc_start + doc_offset
+
+                        start_labels = [0] * len(input_ids)
+                        end_labels = [0] * len(input_ids)
+                        start_labels[start_position] = 1
+                        end_labels[end_position] = 1
+                        answer_rcd.append([start_position, end_position])
+
+                        feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
+                        tokenized_examples["id"].append(feature_id)
+                        tokenized_examples["question_id"].append(qid)
+                        tokenized_examples["questions"].append(question)
+                        tokenized_examples["tokens"].append(tokens)
+                        tokenized_examples["input_ids"].append(input_ids)
+                        tokenized_examples["attention_mask"].append(input_mask)
+                        tokenized_examples["token_type_ids"].append(sentence_ids)
+                        tokenized_examples["bbox"].append(token_boxes)
+                        tokenized_examples["position_ids"].append(position_ids)
+                        tokenized_examples["image"].append(image)
+                        tokenized_examples["start_positions"].append(start_position)
+                        tokenized_examples["end_positions"].append(end_position)
+                        tokenized_examples["start_labels"].append(start_labels)
+                        tokenized_examples["end_labels"].append(end_labels)
+                        tokenized_examples["token_is_max_context"].append(token_is_max_context)
+                        tokenized_examples["token_to_orig_map"].append(token_to_orig_map)
+
+                        if not is_training:
+                            break
+        return tokenized_examples
+
+    def preprocess_cls(
+        self,
+        examples,
+        tokenizer=None,
+        label_dict=None,
+        max_seq_length=512,
+        doc_stride=128,
+        target_size=1000,
+        max_size=1000,
+        other_label="O",
+        ignore_label_id=-100,
+        use_segment_box=False,
+        preprocessing_num_workers=1,
+    ):
+        """
+        Adapt to CLS task.
+        """
+
+        tokenized_examples = collections.defaultdict(list)
+        for example_idx, example_text in enumerate(examples["text"]):
+            tok_to_orig_index = []
+            orig_to_tok_index = []
+            all_doc_tokens = []
+            all_doc_token_boxes = []
+            cls_token_box = [0, 0, 0, 0]
+            sep_token_box = [0, 0, 0, 0]
+            pad_token_box = [0, 0, 0, 0]
+
+            im_base64 = examples["image"][example_idx]
+            image, _ = _str2im(im_base64)
+            image = _permute(image, to_bgr=False)
+
+            if use_segment_box:
+                bboxes = examples["segment_bbox"][example_idx]
+            else:
+                bboxes = examples["bbox"][example_idx]
+            bboxes, _s = _scale_same_as_image(
+                bboxes,
+                examples["width"][example_idx],
+                examples["height"][example_idx],
+                target_size,
+            )
+
+            qas = examples["qas"][example_idx]
+            label = label_dict[qas["answers"][0]["text"][0]]
+
+            for (i, token) in enumerate(example_text):
+                orig_to_tok_index.append(len(all_doc_tokens))
+                sub_tokens = tokenizer.tokenize(token)
+                box = bboxes[i]
+                for j, sub_token in enumerate(sub_tokens):
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_token_boxes.append(box)
+
+            max_tokens_for_doc = max_seq_length - 2
+            doc_spans = []
+            start_offset = 0
+            while start_offset < len(all_doc_tokens):
+                length = len(all_doc_tokens) - start_offset
+                if length > max_tokens_for_doc:
+                    length = max_tokens_for_doc
+                doc_spans.append({"start": start_offset, "length": length})
+                if start_offset + length == len(all_doc_tokens):
+                    break
+                start_offset += min(length, doc_stride, max_tokens_for_doc)
+
+            for doc_span in doc_spans:
+
+                tokens = []
+                token_boxes = []
+                sentence_ids = []
+                tokens.append(tokenizer.cls_token)
+                token_boxes.append(cls_token_box)
+                sentence_ids.append(0)
+
+                for i in range(doc_span["length"]):
+                    split_token_index = doc_span["start"] + i
+                    tokens.append(all_doc_tokens[split_token_index])
+                    token_boxes.append(all_doc_token_boxes[split_token_index])
+                    sentence_ids.append(0)
+
+                tokens.append(tokenizer.sep_token)
+                token_boxes.append(sep_token_box)
+                sentence_ids.append(0)
+                input_mask = [1] * len(tokens)
+
+                while len(tokens) < max_seq_length:
+                    tokens.append(tokenizer.pad_token)
+                    input_mask.append(0)
+                    sentence_ids.append(0)
+                    token_boxes.append(pad_token_box)
+
+                input_ids = tokenizer.convert_tokens_to_ids(tokens)
+                position_ids = list(range(len(input_ids)))
+
+                assert len(input_ids) == max_seq_length
+                assert len(input_mask) == max_seq_length
+                assert len(token_boxes) == max_seq_length
+                assert len(sentence_ids) == max_seq_length
+
+                feature_id = examples["name"][example_idx] + "__" + str(examples["page_no"][example_idx])
+                tokenized_examples["id"].append(feature_id)
+                tokenized_examples["tokens"].append(tokens)
+                tokenized_examples["input_ids"].append(input_ids)
+                tokenized_examples["attention_mask"].append(input_mask)
+                tokenized_examples["token_type_ids"].append(sentence_ids)
+                tokenized_examples["bbox"].append(token_boxes)
+                tokenized_examples["position_ids"].append(position_ids)
+                tokenized_examples["image"].append(image)
+                # tokenized_examples["orig_image"].append(origin_image)
+                tokenized_examples["labels"].append(label)
+        return tokenized_examples
+
+
+class PostProcessor:
+    def __init__(self):
+        """init post processor"""
+
+        self.examples_cache = collections.defaultdict(list)
+        self.features_cache = collections.defaultdict(list)
+        self._PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
+        )
+
+    def get_predictions(self, pred, label_list, with_crf=False):
+        if not with_crf:
+            pred = scipy.special.softmax(pred, axis=-1)
+            pred_ids = np.argmax(pred, axis=1)
+        else:
+            pred_ids = pred
+        prediction_score = [pred[idx][i] for idx, i in enumerate(pred_ids)]
+        predictions = [label_list[i] for i in pred_ids]
+        return predictions, prediction_score
+
+    def postprocess_ner(
+        self,
+        examples: datasets.Dataset,
+        features: datasets.Dataset,
+        preds,
+        labels,
+        label_list,
+        tokenizer=None,
+        with_crf=False,
+        lang="en",
+    ):
+        if "name" not in self.examples_cache:
+            self.examples_cache["name"] = [item for item in examples["name"]]
+        if "page_no" not in self.examples_cache:
+            self.examples_cache["page_no"] = [item for item in examples["page_no"]]
+        if "text" not in self.examples_cache:
+            self.examples_cache["text"] = [item for item in examples["text"]]
+        if "id" not in self.features_cache:
+            self.features_cache["id"] = [item for item in features["id"]]
+        if "tokens" not in self.features_cache:
+            self.features_cache["tokens"] = [item for item in features["tokens"]]
+        if "token_is_max_context" not in self.features_cache:
+            self.features_cache["token_is_max_context"] = [item for item in features["token_is_max_context"]]
+        if "token_to_orig_map" not in self.features_cache:
+            self.features_cache["token_to_orig_map"] = [item for item in features["token_to_orig_map"]]
+        separator = "" if lang == "ch" else " "
+
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        references = collections.defaultdict(list)
+        predictions = collections.defaultdict(list)
+        recover_preds = []
+        recover_labels = []
+
+        for eid, example_id in enumerate(self.examples_cache["name"]):
+            feature_map = example_id + "__" + str(self.examples_cache["page_no"][eid])
+            features_ids = feature_id_to_features[feature_map]
+            gather_pred = []
+            gather_label = []
+            gather_tokens = []
+            gather_score = []
+            gather_map = []
+            for idx in features_ids:
+                pred, label = preds[idx], labels[idx]
+                prediction, prediction_score = self.get_predictions(pred, label_list, with_crf=with_crf)
+
+                token_is_max_context = self.features_cache["token_is_max_context"][idx]
+                token_to_orig_map = self.features_cache["token_to_orig_map"][idx]
+                for token_idx in range(len(token_is_max_context)):
+                    token_idx += 1
+                    if token_is_max_context[str(token_idx)]:
+                        gather_tokens.append(self.features_cache["tokens"][idx][token_idx])
+                        gather_pred.append(prediction[token_idx])
+                        gather_score.append(prediction_score[token_idx])
+                        gather_label.append(label[token_idx])
+                        gather_map.append(token_to_orig_map[str(token_idx)])
+
+            recover_pred = [p for (p, l) in zip(gather_pred, gather_label) if l != -100]
+            recover_label = [label_list[l] for l in gather_label if l != -100]
+
+            pred_entities = get_entities(recover_pred)
+            gt_entities = get_entities(recover_label)
+            recover_preds.append(recover_pred)
+            recover_labels.append(recover_label)
+
+            for item in pred_entities:
+                entity = tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip()
+                orig_doc_start = gather_map[item[1]]
+                orig_doc_end = gather_map[item[2]]
+                orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)]
+                orig_text = separator.join(orig_tokens)
+                final_text = self.get_final_text(entity, orig_text, False, tokenizer)
+                predictions[example_id].append(
+                    [
+                        item[0],
+                        final_text,
+                        sum(gather_score[item[1] : item[2] + 1]) / (item[2] - item[1] + 1),
+                        [item[1], item[2]],
+                        ", ".join(recover_pred[item[1] : item[2] + 1]),
+                    ]
+                )
+
+            for item in gt_entities:
+                entity = tokenizer.convert_tokens_to_string(gather_tokens[item[1] : (item[2] + 1)]).strip()
+                orig_doc_start = gather_map[item[1]]
+                orig_doc_end = gather_map[item[2]]
+                orig_tokens = self.examples_cache["text"][eid][orig_doc_start : (orig_doc_end + 1)]
+                orig_text = separator.join(orig_tokens)
+                final_text = self.get_final_text(entity, orig_text, False, tokenizer)
+                references[example_id].append(
+                    [item[0], final_text, 1, [item[1], item[2]], ", ".join(recover_label[item[1] : item[2] + 1])]
+                )
+            if example_id not in predictions:
+                predictions[example_id].append(["", "", -1, [], ""])
+
+        return predictions, references, EvalPrediction(predictions=recover_preds, label_ids=recover_labels)
+
+    def _get_best_indexes(self, logits, n_best_size):
+        """Get the n-best logits from a list."""
+        index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
+
+        best_indexes = []
+        for i in range(len(index_and_score)):
+            if i >= n_best_size:
+                break
+            best_indexes.append(index_and_score[i][0])
+        return best_indexes
+
+    def get_final_text(self, pred_text, orig_text, do_lower_case, tokenizer):
+        """Project the tokenized prediction back to the original text."""
+
+        def _strip_spaces(text):
+            ns_chars = []
+            ns_to_s_map = collections.OrderedDict()
+            for (i, c) in enumerate(text):
+                if c == " ":
+                    continue
+                ns_to_s_map[len(ns_chars)] = i
+                ns_chars.append(c)
+            ns_text = "".join(ns_chars)
+            return (ns_text, ns_to_s_map)
+
+        tok_text = tokenizer.convert_tokens_to_string(tokenizer.tokenize(orig_text))
+
+        start_position = tok_text.find(pred_text)
+        if start_position == -1:
+            return orig_text
+        end_position = start_position + len(pred_text) - 1
+
+        (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+        (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+        if len(orig_ns_text) != len(tok_ns_text):
+            return orig_text
+
+        # We then project the characters in `pred_text` back to `orig_text` using
+        # the character-to-character alignment.
+        tok_s_to_ns_map = {}
+        for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+            tok_s_to_ns_map[tok_index] = i
+
+        orig_start_position = None
+        if start_position in tok_s_to_ns_map:
+            ns_start_position = tok_s_to_ns_map[start_position]
+            if ns_start_position in orig_ns_to_s_map:
+                orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+        if orig_start_position is None:
+            return orig_text
+
+        orig_end_position = None
+        if end_position in tok_s_to_ns_map:
+            ns_end_position = tok_s_to_ns_map[end_position]
+            if ns_end_position in orig_ns_to_s_map:
+                orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+        if orig_end_position is None:
+            return orig_text
+
+        output_text = orig_text[orig_start_position : (orig_end_position + 1)]
+        return output_text
+
+    def postprocess_mrc(
+        self,
+        examples: datasets.Dataset,
+        features: datasets.Dataset,
+        preds,
+        labels,
+        tokenizer,
+        max_answer_length=64,
+        n_best_size=5,
+        lang="en",
+    ):
+        if "name" not in self.examples_cache:
+            self.examples_cache["name"] = [item for item in examples["name"]]
+        if "page_no" not in self.examples_cache:
+            self.examples_cache["page_no"] = [item for item in examples["page_no"]]
+        if "text" not in self.examples_cache:
+            self.examples_cache["text"] = [item for item in examples["text"]]
+        if "qas" not in self.examples_cache:
+            self.examples_cache["qas"] = [item for item in examples["qas"]]
+
+        if "id" not in self.features_cache:
+            self.features_cache["id"] = [item for item in features["id"]]
+        if "tokens" not in self.features_cache:
+            self.features_cache["tokens"] = [item for item in features["tokens"]]
+        if "question_id" not in self.features_cache:
+            self.features_cache["question_id"] = [item for item in features["question_id"]]
+        if "questions" not in self.features_cache:
+            self.features_cache["questions"] = [item for item in features["questions"]]
+        if "token_is_max_context" not in self.features_cache:
+            self.features_cache["token_is_max_context"] = [item for item in features["token_is_max_context"]]
+        if "token_to_orig_map" not in self.features_cache:
+            self.features_cache["token_to_orig_map"] = [item for item in features["token_to_orig_map"]]
+
+        separator = "" if lang == "ch" else " "
+
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        predictions, references = collections.defaultdict(
+            lambda: collections.defaultdict(list)
+        ), collections.defaultdict(lambda: collections.defaultdict(list))
+        for ei, example_id in enumerate(self.examples_cache["name"]):
+            feature_map = example_id + "__" + str(self.examples_cache["page_no"][ei])
+            features_ids = feature_id_to_features[feature_map]
+            prelim_predictions = []
+            for i, idx in enumerate(features_ids):
+
+                start_logits = preds[0][idx]
+                end_logits = preds[1][idx]
+
+                start_indexes = self._get_best_indexes(start_logits, n_best_size)
+                end_indexes = self._get_best_indexes(end_logits, n_best_size)
+                token_is_max_context = self.features_cache["token_is_max_context"][idx]
+
+                for start_index in start_indexes:
+                    for end_index in end_indexes:
+                        if not token_is_max_context.get(str(start_index), False):
+                            continue
+                        if end_index < start_index:
+                            continue
+                        length = end_index - start_index + 1
+                        if length > max_answer_length:
+                            continue
+                        prelim_predictions.append(
+                            self._PrelimPrediction(
+                                feature_index=idx,
+                                start_index=start_index,
+                                end_index=end_index,
+                                start_logit=start_logits[start_index],
+                                end_logit=end_logits[end_index],
+                            )
+                        )
+
+            prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
+
+            for rcd in prelim_predictions:
+
+                question_id = self.features_cache["question_id"][rcd.feature_index]
+                question = self.features_cache["questions"][rcd.feature_index]
+                if question_id in predictions[example_id]:
+                    continue
+
+                if rcd.start_index > 0:
+                    tok_tokens = self.features_cache["tokens"][rcd.feature_index][
+                        rcd.start_index : (rcd.end_index + 1)
+                    ]
+                    orig_doc_start = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.start_index)]
+                    orig_doc_end = self.features_cache["token_to_orig_map"][rcd.feature_index][str(rcd.end_index)]
+                    orig_tokens = self.examples_cache["text"][ei][orig_doc_start : (orig_doc_end + 1)]
+                    orig_text = separator.join(orig_tokens)
+
+                    tok_text = tokenizer.convert_tokens_to_string(tok_tokens).strip()
+                    final_text = self.get_final_text(tok_text, orig_text, False, tokenizer)
+                else:
+                    continue
+                if question_id in predictions[example_id]:
+                    predictions[example_id][question_id]["answers"].append(final_text)
+                else:
+                    predictions[example_id][question_id] = {"question": question, "answers": [final_text]}
+
+        for example_index, example in enumerate(examples):
+            eid = self.examples_cache["name"][example_index]
+            qas = self.examples_cache["qas"][example_index]
+            for question_id, question, answers in zip(qas["question_id"], qas["question"], qas["answers"]):
+                references[eid][question_id] = {
+                    "question": question,
+                    "answers": [answer_text for answer_text in answers["text"]],
+                }
+                if eid not in predictions or question_id not in predictions[eid]:
+                    predictions[eid][question_id] = {"question": question, "answers": [""]}
+
+        formatted_predictions = [
+            {
+                "id": k,
+                "annotations": [
+                    {"qid": str(qid), "question": qa["question"], "value": qa["answers"]} for qid, qa in v.items()
+                ],
+            }
+            for k, v in predictions.items()
+        ]
+        formated_references = [
+            {
+                "id": k,
+                "annotations": [
+                    {"qid": str(qid), "question": qa["question"], "value": qa["answers"]} for qid, qa in v.items()
+                ],
+            }
+            for k, v in references.items()
+        ]
+        return (
+            predictions,
+            references,
+            EvalPrediction(predictions=formatted_predictions, label_ids=formated_references),
+        )
+
+    def postprocess_cls(
+        self,
+        examples: datasets.Dataset,
+        features: datasets.Dataset,
+        preds,
+        labels,
+        label_list,
+        tokenizer=None,
+    ):
+        if "name" not in self.examples_cache:
+            self.examples_cache["name"] = [item for item in examples["name"]]
+        if "page_no" not in self.examples_cache:
+            self.examples_cache["page_no"] = [item for item in examples["page_no"]]
+        if "id" not in self.features_cache:
+            self.features_cache["id"] = [item for item in features["id"]]
+
+        feature_id_to_features = collections.defaultdict(list)
+        for idx, feature_id in enumerate(self.features_cache["id"]):
+            feature_id_to_features[feature_id].append(idx)
+
+        references = {}
+        predictions = {}
+        recover_preds = []
+        recover_labels = []
+
+        for eid, example_id in enumerate(self.examples_cache["name"]):
+            feature_map = example_id + "__" + str(self.examples_cache["page_no"][eid])
+            features_ids = feature_id_to_features[feature_map]
+
+            max_rcd = [0, -1]
+            for i, idx in enumerate(features_ids):
+                pred, label = preds[idx], labels[idx]
+                pred = scipy.special.softmax(pred, axis=-1)
+                pred_id = int(np.argmax(pred, axis=-1))
+                if pred[pred_id] > max_rcd[0]:
+                    max_rcd = [pred[pred_id], pred_id]
+
+            recover_preds.append(max_rcd[1])
+            recover_labels.append(label)
+            predictions[example_id] = label_list[max_rcd[1]]
+            references[example_id] = label_list[label]
+        return predictions, references, EvalPrediction(predictions=recover_preds, label_ids=recover_labels)
diff --git a/model_zoo/ernie-m/README.md b/model_zoo/ernie-m/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..634f88c21deb6703bae1be6b4e3eb3f0750f2572
--- /dev/null
+++ b/model_zoo/ernie-m/README.md
@@ -0,0 +1,237 @@
+# ERNIE-M
+
+* [模型介绍](#模型介绍)
+* [开始运行](#开始运行)
+  * [环境要求](#环境要求)
+  * [数据准备](#数据准备)
+  * [模型训练](#模型训练)
+    * [参数释义](#参数释义)
+    * [单卡训练](#单卡训练)
+    * [单机多卡](#单机多卡)
+    * [预测评估](#预测评估)
+  * [部署](#部署)
+    * [FastDeploy 部署](#FastDeploy部署)
+      * [Python 部署](#Python部署)
+    * [服务化部署](#服务化部署)
+* [参考论文](#参考论文)
+
+## 模型介绍
+
+[ERNIE-M](https://arxiv.org/abs/2012.15674) 是百度提出的一种多语言语言模型。原文提出了一种新的训练方法，让模型能够将多种语言的表示与单语语料库对齐，以克服平行语料库大小对模型性能的限制。原文的主要想法是将回译机制整合到预训练的流程中，在单语语料库上生成伪平行句对，以便学习不同语言之间的语义对齐，从而增强跨语言模型的语义建模。实验结果表明，ERNIE-M 优于现有的跨语言模型，并在各种跨语言下游任务中提供了最新的 SOTA 结果。
+原文提出两种方法建模各种语言间的对齐关系:
+
+- **Cross-Attention Masked Language Modeling(CAMLM)**: 该算法在少量双语语料上捕捉语言间的对齐信息。其需要在不利用源句子上下文的情况下，通过目标句子还原被掩盖的词语，使模型初步建模了语言间的对齐关系。
+- **Back-Translation masked language modeling(BTMLM)**: 该方法基于回译机制从单语语料中学习语言间的对齐关系。通过CAMLM 生成伪平行语料，然后让模型学习生成的伪平行句子，使模型可以利用单语语料更好地建模语义对齐关系。
+
+
+![framework](https://user-images.githubusercontent.com/40912707/201308423-bf4f0100-3ada-4bae-89d5-b07ffec1e2c0.png)
+
+本项目是 ERNIE-M 的 PaddlePaddle 动态图实现，包含模型训练，模型验证等内容。以下是本例的简要目录结构及说明：
+
+```text
+.
+|-- README.md                        # 文档
+|-- deploy                           # 部署目录
+|   |-- predictor                    # onnx离线部署
+|   |   |-- README.md
+|   |   |-- ernie_m_predictor.py
+|   |   |-- inference.py
+|   |   |-- requirements_cpu.txt
+|   |   `-- requirements_gpu.txt
+|   `-- simple_serving               # 基于PaddleNLP SimpleServing 服务化部署
+|       |-- README.md
+|       |-- client_seq_cls.py
+|       `-- server_seq_cls.py
+`-- run_classifier.py                # 分类任务微调脚本
+```
+
+## 开始运行
+
+下面提供以XNLI数据集进行模型微调相关训练、预测、部署的代码，XNLI数据集是MNLI的子集，并且已被翻译成14种不同的语言（包含一些较低资源语言）。与MNLI一样，目标是预测文本蕴含（句子 A 是否暗示/矛盾/都不是句子 B ）。
+
+### 环境要求
+
+python >= 3.7
+paddlepaddle >= 2.3
+paddlenlp >= 2.4.9
+paddle2onnx >= 1.0.5
+
+### 数据准备
+
+此次微调数据使用XNLI数据集, 可以通过下面的方式来使用数据集
+
+```python
+from datasets import load_dataset
+
+# all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
+# load xnli dataset of english
+train_ds, eval_ds, test_ds = load_dataset("xnli", "en", split=["train_ds", "validation", "test"])
+```
+
+### 模型训练
+
+#### 参数释义
+
+- `task_type` 表示了自然语言推断任务的类型，目前支持的类型为："cross-lingual-transfer", "translate-train-all"
+  ，分别表示在英文数据集上训练并在所有15种语言数据集上测试、在所有15种语言数据集上训练和测试。
+- `model_name_or_path` 指示了 Fine-tuning 使用的具体预训练模型以及预训练时使用的tokenizer，目前支持的预训练模型有："ernie-m-base"， "ernie-m-large"
+  。若模型相关内容保存在本地，这里也可以提供相应目录地址，例如："./finetuned_models"。
+- `do_train` 是否进行训练任务。
+- `do_eval` 是否进行评估任务。
+- `do_predict` 是否进行评测任务。
+- `do_export` 是否导出模型。
+- `output_dir` 表示模型保存路径。
+- `export_model_dir` 模型的导出路径。
+- `per_device_train_batch_size` 表示训练时每次迭代**每张**卡上的样本数目。
+- `per_device_eval_batch_size` 表示验证时每次迭代**每张**卡上的样本数目。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断，不足该长度的将会进行 padding。
+- `learning_rate` 表示基础学习率大小，将于 learning rate scheduler 产生的值相乘作为当前学习率。
+- `classifier_dropout` 表示模型用于分类的 dropout rate ，默认是0.1。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔步数。
+- `save_steps` 表示模型保存及评估间隔步数。
+- `layerwise_decay` 表示 AdamW with Layerwise decay 的逐层衰减系数。
+- `warmup_rate` 表示学习率warmup系数。
+- `max_steps` 表示最大训练步数。若训练`num_train_epochs`轮包含的训练步数大于该值，则达到`max_steps`后就提前结束。
+- `seed` 表示随机数种子。
+- `device` 表示训练使用的设备, 'gpu'表示使用 GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用 CPU。
+- `fp16` 表示是否启用自动混合精度训练。
+- `scale_loss` 表示自动混合精度训练的参数。
+- `load_best_model_at_end` 训练结束后是否加载最优模型，通常与`metric_for_best_model`配合使用。
+- `metric_for_best_model` 最优模型指标，如`eval_accuarcy`等，用于比较模型好坏。
+
+#### 单卡训练
+
+`run_classifier.py`是模型微调脚本，可以使用如下命令对预训练模型进行微调训练。
+
+```shell
+python run_classifier.py \
+  --do_train \
+  --do_eval \
+  --do_export \
+  --task_type cross-lingual-transfer \
+  --model_name_or_path ernie-m-base \
+  --output_dir ./finetuned_models/ \
+  --export_model_dir ./finetuned_models/ \
+  --per_device_train_batch_size 16 \
+  --per_device_eval_batch_size 16 \
+  --max_seq_length 256 \
+  --learning_rate 5e-5 \
+  --classifier_dropout 0.1 \
+  --weight_decay 0.0 \
+  --layerwise_decay 0.8 \
+  --save_steps 12272 \
+  --eval_steps 767 \
+  --num_train_epochs 5 \
+  --warmup_ratio 0.1 \
+  --load_best_model_at_end True \
+  --metric_for_best_model eval_accuracy \
+  --overwrite_output_dir
+```
+
+#### 单机多卡
+
+同样，可以执行如下命令实现多卡训练
+
+```shell
+python -m paddle.distributed.launch --gpus 0,1 run_classifier.py \
+  --do_train \
+  --do_eval \
+  --do_export \
+  --task_type cross-lingual-transfer \
+  --model_name_or_path ernie-m-base \
+  --output_dir ./finetuned_models/ \
+  --export_model_dir ./finetuned_models/ \
+  --per_device_train_batch_size 16 \
+  --per_device_eval_batch_size 16 \
+  --max_seq_length 256 \
+  --learning_rate 5e-5 \
+  --classifier_dropout 0.1 \
+  --weight_decay 0.0 \
+  --layerwise_decay 0.8 \
+  --save_steps 12272 \
+  --eval_steps 767 \
+  --num_train_epochs 5 \
+  --warmup_ratio 0.1 \
+  --load_best_model_at_end True \
+  --metric_for_best_model eval_accuracy \
+  --overwrite_output_dir \
+  --remove_unused_columns False
+```
+
+这里设置额外的参数`--remove_unused_columns`为`False`是因为数据集中不需要的字段已经被手动去除了。
+
+#### 预测评估
+
+当训练完成后，可以直接加载训练保存的模型进行评估，此时`--model_name_or_path`传入训练时的`output_dir`即`./finetuned_models`。
+
+```shell
+python run_classifier.py \
+    --do_predict \
+    --task_type cross-lingual-transfer \
+    --model_name_or_path ./finetuned_models \
+    --output_dir ./finetuned_models
+```
+
+预测结果（label）和预测的置信度（confidence）将写入`./finetuned_models/test_results.json`文件。
+
+
+在XNLI数据集上微调 cross-lingual-transfer 类型的自然语言推断任务后，在测试集上有如下结果
+| Model | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur | Avg |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Cross-lingual Transfer |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 78.7 | 78.9 | 77.8 | 76.6 | 77.4 | 75.3 | 72.5 | 73.1 | 76.1 | 73.2 | 76.5 | 69.6 | 68.4 | 67.3 | 75.1 |
+| Unicoder | 85.1 | 79.0 | 79.4 | 77.8 | 77.2 | 77.2 | 76.3 | 72.8 | 73.5 | 76.4 | 73.6 | 76.2 | 69.4 | 69.7 | 66.7 | 75.4 |
+| XLM-R | 85.8 | 79.7 | 80.7 | 78.7 | 77.5 | 79.6 | 78.1 | 74.2 | 73.8 | 76.5 | 74.6 | 76.7 | 72.4 | 66.5 | 68.3 | 76.2 |
+| INFOXLM | **86.4** | **80.6** | 80.8 | 78.9 | 77.8 | 78.9 | 77.6 | 75.6 | 74.0 | 77.0 | 73.7 | 76.7 | 72.0 | 66.4 | 67.1 | 76.2 |
+| **ERNIE-M** | 85.5 | 80.1 | **81.2** | **79.2** | **79.1** | **80.4** | **78.1** | **76.8** | **76.3** | **78.3** | **75.8** | **77.4** | **72.9** | **69.5** | **68.8** | **77.3** |
+| XLM-R Large | 89.1 | 84.1 | 85.1 | 83.9 | 82.9 | 84.0 | 81.2 | 79.6 | 79.8 | 80.8 | 78.1 | 80.2 | 76.9 | 73.9 | 73.8 | 80.9 |
+| INFOXLM Large | **89.7** | 84.5 | 85.5 | 84.1 | 83.4 | 84.2 | 81.3 | 80.9 | 80.4 | 80.8 | 78.9 | 80.9 | 77.9 | 74.8 | 73.7 | 81.4 |
+| VECO Large | 88.2 | 79.2 | 83.1 | 82.9 | 81.2 | 84.2 | 82.8 | 76.2 | 80.3 | 74.3 | 77.0 | 78.4 | 71.3 | **80.4** | **79.1** | 79.9 |
+| **ERNIR-M Large** | 89.3 | **85.1** | **85.7** | **84.4** | **83.7** | **84.5** | 82.0 | **81.2** | **81.2** | **81.9** | **79.2** | **81.0** | **78.6** | 76.2 | 75.4 | **82.0** |
+| Translate-Train-All |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
+| XLM | 85.0 | 80.8 | 81.3 | 80.3 | 79.1 | 80.9 | 78.3 | 75.6 | 77.6 | 78.5 | 76.0 | 79.5 | 72.9 | 72.8 | 68.5 | 77.8 |
+| Unicoder | 85.6 | 81.1 | 82.3 | 80.9 | 79.5 | 81.4 | 79.7 | 76.8 | 78.2 | 77.9 | 77.1 | 80.5 | 73.4 | 73.8 | 69.6 | 78.5 |
+| XLM-R | 85.4 | 81.4 | 82.2 | 80.3 | 80.4 | 81.3 | 79.7 | 78.6 | 77.3 | 79.7 | 77.9 | 80.2 | 76.1 | 73.1 | 73.0 | 79.1 |
+| INFOXLM | 86.1 | 82.0 | 82.8 | 81.8 | 80.9 | 82.0 | 80.2 | 79.0 | 78.8 | 80.5 | 78.3 | 80.5 | 77.4 | 73.0 | 71.6 | 79.7 |
+| **ERNIE-M** | **86.2** | **82.5** | **83.8** | **82.6** | **82.4** | **83.4** | **80.2** | **80.6** | **80.5** | **81.1** | **79.2** | **80.5** | **77.7** | **75.0** | **73.3** | **80.6** |
+| XLM-R Large | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | **83.7** | **81.6** | 78.0 | 78.1 | 83.6 |
+| VECO Large | 88.9 | 82.4 | 86.0 | 84.7 | 85.3 | 86.2 | **85.8** | 80.1 | 83.0 | 77.2 | 80.9 | 82.8 | 75.3 | **83.1** | **83.0** | 83.0 |
+| **ERNIE-M Large** | **89.5** | **86.5** | **86.9** | **86.1** | **86.0** | **86.8** | 84.1 | **83.8** | **84.1** | **84.5** | **82.1** | 83.5 | 81.1 | 79.4 | 77.9 | **84.2** |
+
+<a name="部署"></a>
+
+## 部署
+
+我们基于 FastDeploy 为 ERNIE-M 提供了多种部署方案，可以满足不同场景下的部署需求，请根据实际情况进行选择。
+
+<a name="FastDeploy部署"></a>
+
+### FastDeploy 部署
+
+⚡️[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)是一款全场景、易用灵活、极致高效的AI推理部署工具，为开发者提供多硬件、多推理引擎后端的部署能力。开发者只需调用一行代码即可随意切换硬件、推理引擎后端。
+
+<div align="center">
+
+<img src="https://user-images.githubusercontent.com/54695910/213087724-7175953a-0e07-4af8-a4a1-5304163da2e0.png" >
+
+</div>
+
+目前 ERNIE-M 模型已提供基于 FastDeploy 的部署示例，支持在多款硬件（CPU、GPU、昆仑芯、华为昇腾以及 Graphcore IPU）以及推理引擎后端进行部署。
+
+<a name="Python部署"></a>
+
+#### Python 部署
+
+Python 部署请参考：[Python 部署指南](./deploy/python/README.md)
+
+<a name="服务化部署"></a>
+
+### 服务化部署
+
+* [PaddleNLP SimpleServing 服务化部署指南](./deploy/simple_serving/README.md)
+
+
+## 参考论文
+
+ [Ouyang X ,  Wang S ,  Pang C , et al. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora[J].  2020.](https://arxiv.org/abs/2012.15674)
diff --git a/model_zoo/ernie-m/deploy/python/README.md b/model_zoo/ernie-m/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dfc55dcdcd39d081c62113b496468e3db306ba3a
--- /dev/null
+++ b/model_zoo/ernie-m/deploy/python/README.md
@@ -0,0 +1,144 @@
+# FastDeploy ERNIE-M 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下分别提供 `seq_cls_infer.py` 快速完成在 CPU/GPU 的文本分类任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+
+# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+
+```
+
+## 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 ERNIE-M 模型在 XNLI 数据集上进行自然语言推断任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE-M 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-m/finetuned_models/export`（用户可按实际情况设置）。
+
+
+```bash
+
+# CPU 推理
+python seq_cls_infer.py --model_dir ../../finetuned_models/export/ --device cpu --backend paddle
+
+# GPU 推理
+python seq_cls_infer.py --model_dir ../../finetuned_models/export/model --device gpu --backend paddle
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[2023-02-24 08:54:42,574] [    INFO] - We are using <class 'paddlenlp.transformers.ernie_m.fast_tokenizer.ErnieMFastTokenizer'> to load 'export/'.
+[INFO] fastdeploy/runtime/runtime.cc(309)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+Batch id:0, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"我从来没有被告知任何与任何人会面。", label:contradiction, similarity:0.9975
+Batch id:1, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"我被告知将有一个人被叫进来与我见面。", label:entailment, similarity:0.9866
+Batch id:2, example id:0, sentence1:"他们告诉我，呃，我最后会被叫到一个人那里去见面。", sentence2:"那个人来得有点晚。", label:neutral, similarity:0.9921
+
+```
+
+## 参数说明
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--device_id | 运行设备的id。默认为0。 |
+|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE-M 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE-M 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
diff --git a/model_zoo/ernie-m/deploy/python/seq_cls_infer.py b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..cffbbc253b48ed9298d5110e89eb7fbbaf785054
--- /dev/null
+++ b/model_zoo/ernie-m/deploy/python/seq_cls_infer.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--cpu_threads", type=int, default=1, help="Number of threads to predict when using cpu.")
+    parser.add_argument("--device_id", type=int, default=0, help="Select which gpu device to train model.")
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchfy_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_threads)
+        else:
+            option.use_gpu(args.device_id)
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.use_paddle_infer_backend()
+                option.paddle_infer_option.collect_trt_shape = True
+                option.paddle_infer_option.enable_trt = True
+            trt_file = os.path.join(args.model_dir, "model.trt")
+            option.trt_option.set_shape(
+                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            if args.use_fp16:
+                option.trt_option.enable_fp16 = True
+                trt_file = trt_file + ".fp16"
+            option.trt_option.serialize_file = trt_file
+        return fd.Runtime(option)
+
+    def preprocess(self, text, text_pair):
+        data = self.tokenizer(text, text_pair, max_length=self.max_length, padding=True, truncation=True)
+        input_ids_name = self.runtime.get_input_info(0).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int64"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def postprocess(self, infer_data):
+        logits = np.array(infer_data[0])
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)}
+        return out_dict
+
+    def predict(self, texts, texts_pair=None):
+        input_map = self.preprocess(texts, texts_pair)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    text = ["他们告诉我，呃，我最后会被叫到一个人那里去见面。"] * 3
+    text_pair = ["我从来没有被告知任何与任何人会面。", "我被告知将有一个人被叫进来与我见面。", "那个人来得有点晚。"]
+    batch_texts = batchfy_text(text, args.batch_size)
+    batch_texts_pair = batchfy_text(text_pair, args.batch_size)
+    label_list = ["entailment", "neutral", "contradiction"]
+
+    for bs, (texts, texts_pair) in enumerate(zip(batch_texts, batch_texts_pair)):
+        outputs = predictor.predict(texts, texts_pair)
+        for i, (sentence1, sentence2) in enumerate(zip(texts, texts_pair)):
+            print(
+                f'Batch id:{bs}, example id:{i}, sentence1:"{sentence1}", sentence2:"{sentence2}", '
+                f"label:{label_list[outputs['label'][i]]}, confidence:{outputs['confidence'][i]:.4f}"
+            )
diff --git a/model_zoo/ernie-m/deploy/simple_serving/README.md b/model_zoo/ernie-m/deploy/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..30da3c4f796ad41af7ffe2b536aa397598ff280f
--- /dev/null
+++ b/model_zoo/ernie-m/deploy/simple_serving/README.md
@@ -0,0 +1,37 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server启动服务](#Server服务启动)
+- [其他参数设置](#其他参数设置)
+
+## 环境准备
+
+paddlenlp >= 2.5.0
+
+## Server服务启动
+### 文本分类任务启动
+#### 启动文本分类 Server 服务
+```bash
+paddlenlp server server_seq_cls:app --host 0.0.0.0 --port 8189
+```
+
+#### 分类任务发送服务
+```bash
+python client_seq_cls.py --language zh
+```
+
+## 其他参数设置
+可以在client端设置 `max_seq_len`, `batch_size` 参数
+```python
+    data = {
+        'data': {
+            'text': texts,
+            'text_pair': text_pairs
+        },
+        'parameters': {
+            'max_seq_len': args.max_seq_len,
+            'batch_size': args.batch_size
+        }
+    }
+```
diff --git a/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fc1de30fa040839444ea9d8bcfbb2cee696b1bd
--- /dev/null
+++ b/model_zoo/ernie-m/deploy/simple_serving/client_seq_cls.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import requests
+from datasets import load_dataset
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--language", required=True, type=str, help="The language for the simple seving")
+parser.add_argument("--max_seq_len", default=256, type=int, help="The maximum total input sequence length after tokenization.")
+parser.add_argument("--batch_size", default=1, type=int, help="Batch size per GPU/CPU for predicting.")
+args = parser.parse_args()
+# yapf: enable
+
+url = "http://0.0.0.0:8189/models/ernie_m_cls"
+headers = {"Content-Type": "application/json"}
+
+
+if __name__ == "__main__":
+    examples = load_dataset("xnli", args.language, split="validation")[:10]
+    texts = [text for text in examples["premise"]]
+    text_pairs = [text for text in examples["hypothesis"]]
+
+    data = {
+        "data": {"text": texts, "text_pair": text_pairs},
+        "parameters": {"max_seq_len": args.max_seq_len, "batch_size": args.batch_size},
+    }
+    r = requests.post(url=url, headers=headers, data=json.dumps(data))
+    print(r.text)
diff --git a/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py b/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..b38202ece3d1550f0956a8623fa42cc6dc730023
--- /dev/null
+++ b/model_zoo/ernie-m/deploy/simple_serving/server_seq_cls.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer
+from paddlenlp.server import ERNIEMHandler, MultiClassificationPostHandler
+
+app = SimpleServer()
+app.register(
+    "models/ernie_m_cls",
+    model_path="../../finetuned_models/export",
+    tokenizer_name="ernie-m-base",
+    model_handler=ERNIEMHandler,
+    post_handler=MultiClassificationPostHandler,
+)
diff --git a/model_zoo/ernie-m/run_classifier.py b/model_zoo/ernie-m/run_classifier.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d1886c6dd6f6db588ae027bad7852f28410d834
--- /dev/null
+++ b/model_zoo/ernie-m/run_classifier.py
@@ -0,0 +1,322 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import random
+from dataclasses import dataclass, field
+from functools import partial
+from typing import Optional
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle.io import Dataset
+from paddle.metric import Accuracy
+
+import paddlenlp
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    ErnieMForSequenceClassification,
+)
+from paddlenlp.utils.log import logger
+
+all_languages = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
+task_type_list = ["cross-lingual-transfer", "translate-train-all"]
+
+
+@dataclass
+class ModelArguments:
+    task_type: str = field(
+        default=None,
+        metadata={"help": "The type of the task to finetune selected in the list: " + ", ".join(task_type_list)},
+    )
+    model_name_or_path: str = field(
+        default=None,
+        metadata={
+            "help": "Path to pre-trained model or shortcut name selected in the list: "
+            + ", ".join(list(ErnieMForSequenceClassification.pretrained_init_configuration.keys()))
+        },
+    )
+    max_seq_length: Optional[int] = field(
+        default=256,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    classifier_dropout: Optional[float] = field(default=0.1, metadata={"help": "Dropout rate."})
+    layerwise_decay: Optional[float] = field(default=0.8, metadata={"help": "Layerwise decay ratio."})
+    export_model_dir: Optional[str] = field(
+        default="./best_models",
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+    use_test_data: Optional[bool] = field(
+        default=False, metadata={"help": "Whether to use a tiny dataset for CI test."}
+    )
+    test_data_path: Optional[str] = field(default=None, metadata={"help": "Path to tiny dataset."})
+
+
+def set_seed(seed):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(seed)
+    np.random.seed(seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(seed + paddle.distributed.get_rank())`
+    paddle.seed(seed)
+
+
+def convert_example(example, tokenizer, max_seq_length=256):
+    """convert a example into necessary features"""
+    # Convert raw text to feature
+    tokenized_example = tokenizer(
+        example["premise"],
+        text_pair=example["hypothesis"],
+        max_length=max_seq_length,
+        padding=False,
+        truncation=True,
+        return_position_ids=True,
+        return_attention_mask=True,
+        return_token_type_ids=False,
+    )
+    return tokenized_example
+
+
+def load_xnli_dataset(args, path, lang, split=None):
+    """load dataset for specificed language"""
+    if args.use_test_data:
+        if args.test_data_path is None:
+            raise ValueError("Should specified `test_data_path` for test datasets when `use_test_data` is True.")
+        data_files = {
+            "train": args.test_data_path,
+            "validation": args.test_data_path,
+            "test": args.test_data_path,
+        }
+        return load_dataset("json", data_files=data_files, split=split)
+    else:
+        return load_dataset(path, lang, split=split)
+
+
+class XnliDataset(Dataset):
+    """
+    Make all languages datasets be loaded in lazy mode.
+    """
+
+    def __init__(self, datasets):
+        self.datasets = datasets
+        # Ar language has 2000 empty data.
+        self.num_samples = [len(i) for i in datasets]
+        self.cumsum_len = np.cumsum(self.num_samples)
+
+    def __getitem__(self, idx):
+        language_idx = np.argmax(self.cumsum_len > idx)
+        last = language_idx - 1 if language_idx > 0 else language_idx
+        sample_idx = idx - self.cumsum_len[last] if idx >= self.cumsum_len[last] else idx
+        return self.datasets[int(language_idx)][int(sample_idx)]
+
+    def __len__(self):
+        return self.cumsum_len[-1]
+
+
+def do_train():
+    training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
+    training_args: TrainingArguments = training_args
+    model_args: ModelArguments = model_args
+
+    training_args.print_config(model_args, "Model")
+
+    paddle.set_device(training_args.device)
+
+    set_seed(training_args.seed)
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Dataset pre-process
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=model_args.max_seq_length)
+    remove_columns = ["premise", "hypothesis"]
+
+    def collect_all_languages_dataset(split):
+        all_ds = []
+        for language in all_languages:
+            ds = load_xnli_dataset(model_args, "xnli", language, split=split)
+            all_ds.append(ds.map(trans_func, batched=True, remove_columns=remove_columns))
+        return XnliDataset(all_ds)
+
+    if model_args.task_type == "cross-lingual-transfer":
+        raw_datasets = load_xnli_dataset(model_args, "xnli", "en")
+        if training_args.do_train:
+            train_ds = raw_datasets["train"].map(trans_func, batched=True, remove_columns=remove_columns)
+        if training_args.do_eval:
+            eval_ds = raw_datasets["validation"].map(trans_func, batched=True, remove_columns=remove_columns)
+        if training_args.do_predict:
+            test_ds = raw_datasets["test"].map(trans_func, batched=True, remove_columns=remove_columns)
+    elif model_args.task_type == "translate-train-all":
+        if training_args.do_train:
+            train_ds = collect_all_languages_dataset("train")
+        if training_args.do_eval:
+            eval_ds = collect_all_languages_dataset("validation")
+        if training_args.do_predict:
+            test_ds = collect_all_languages_dataset("test")
+    else:
+        raise ValueError(
+            f"task_type should be 'cross-lingual-transfer' or 'translate-train-all' but '{model_args.task_type}' is specificed."
+        )
+
+    data_collator = DataCollatorWithPadding(tokenizer)
+
+    num_labels = 3
+    model = AutoModelForSequenceClassification.from_pretrained(
+        model_args.model_name_or_path, num_labels=num_labels, classifier_dropout=model_args.classifier_dropout
+    )
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        # Define the metrics of tasks.
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = Accuracy()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        accu = metric.accumulate()
+        return {"accuracy": accu}
+
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train else None,
+        eval_dataset=eval_ds if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        # optimizers=[optimizer, lr_scheduler],
+    )
+
+    def using_layerwise_lr_decay(layerwise_decay, model, training_args):
+        """
+        Generate parameter names needed to perform weight decay.
+        All bias and LayerNorm parameters are excluded.
+        """
+        # params_list = [{"params": param, "learning_rate": lr * decay_ratio}, ... ]
+        params_list = []
+        n_layers = model.config.num_hidden_layers
+        for name, param in model.named_parameters():
+            ratio = 1.0
+            param_to_train = {"params": param, "dygraph_key_name": name}
+            if any(nd in name for nd in ["bias", "norm"]):
+                param_to_train["weight_decay"] = 0.0
+            else:
+                param_to_train["weight_decay"] = training_args.weight_decay
+
+            if "encoder.layers" in name:
+                idx = name.find("encoder.layers.")
+                layer = int(name[idx:].split(".")[2])
+                ratio = layerwise_decay ** (n_layers - layer)
+            elif "embedding" in name:
+                ratio = layerwise_decay ** (n_layers + 1)
+
+            param_to_train["learning_rate"] = ratio
+
+            params_list.append(param_to_train)
+        return params_list
+
+    params_to_train = using_layerwise_lr_decay(model_args.layerwise_decay, model, training_args)
+
+    trainer.set_optimizer_grouped_parameters(params_to_train)
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluating
+    if training_args.do_eval:
+        combined = {}
+        for language in all_languages:
+            eval_ds = load_xnli_dataset(model_args, "xnli", language, split="validation")
+            eval_ds = eval_ds.map(trans_func, batched=True, remove_columns=remove_columns, load_from_cache_file=True)
+            metrics = trainer.evaluate(eval_dataset=eval_ds)
+            metrics = {k + f"_{language}": v for k, v in metrics.items()}
+            combined.update({f"eval_accuracy_{language}": metrics.get(f"eval_accuracy_{language}", 0.0)})
+            trainer.log_metrics("eval", metrics)
+
+        combined.update({"eval_accuracy_average": np.mean(list(combined.values()))})
+        trainer.log_metrics("eval", combined)
+        trainer.save_metrics("eval", combined)
+
+    # Predicting
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_ds)
+        trainer.log_metrics("test", test_ret.metrics)
+        logits = test_ret.predictions
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": probs.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()}
+        out_file = open(os.path.join(training_args.output_dir, "test_results.json"), "w")
+        json.dump(out_dict, out_file)
+
+    # Export inference model
+    if training_args.do_export and paddle.distributed.get_rank() == 0:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        model_to_save = trainer.model
+        model_to_save = model_to_save._layers if isinstance(model_to_save, paddle.DataParallel) else model_to_save
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+        model_args.export_model_dir = os.path.join(model_args.export_model_dir, "export")
+        paddlenlp.transformers.export_model(
+            model=model_to_save, input_spec=input_spec, path=model_args.export_model_dir
+        )
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/ernie-tiny/README.md b/model_zoo/ernie-tiny/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c0107eea0ca8950822d142b3743f4b526a46d9c0
--- /dev/null
+++ b/model_zoo/ernie-tiny/README.md
@@ -0,0 +1,654 @@
+# ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization
+
+ **目录**
+   * [ERNIE 3.0 Tiny 介绍](#模型介绍)
+   * [预训练模型效果](#模型效果)
+   * [代码结构](#代码结构)
+   * [开始运行](#开始运行)
+       * [任务介绍](#任务介绍)
+       * [环境要求](#环境要求)
+       * [数据准备](#数据准备)
+   * [模型训练](#模型训练)
+   * [模型评估](#模型评估)
+   * [端上模型压缩方案🔥](#模型压缩)
+       * [压缩效果](#压缩效果)
+   * [⚡️ FastDeploy 部署](#FastDeploy部署)
+       * [性能结论](#压缩结论)
+   * [参考文献](#参考文献)
+
+本项目开源了 **ERNIE 3.0 Tiny** 预训练模型及 **端上语义理解压缩方案**。
+
+- **ERNIE 3.0 Tiny** 百度 ERNIE 使用 ERNIE-Tiny 系列的知识蒸馏技术，将 ERNIE 3.0 Titan 大模型的能力传递给小模型，产出并开源了易于部署的 ERNIE 3.0 Tiny 系列预训练模型，刷新了中文小模型的 SOTA 成绩。在这些较少参数量的 ERNIE 3.0 Tiny 系列模型中，有一部分可以直接部署在 CPU 上。
+
+- **端上语义理解压缩方案** 在语义理解任务中使用 ERNIE 3.0 Tiny 微调的基础上，我们建议进一步使用包含模型裁剪、量化训练、Embedding 量化等策略的压缩方案，在保持模型精度不降的情况下，可将模型体积减小为原来的 7.8%，达到 5.4 MB，内存占用也随之大幅减小。再经过 [⚡️FastDeploy](https://github.com/PaddlePaddle/FastDeploy) 部署工具和 [FastTokenizer](../../fast_tokenizer/README.md) 对分词阶段的加速，**端到端推理性能**也有显著提升，从而将 ERNIE 3.0 Tiny 模型成功部署至 **📱端侧**。由于端侧部署对内存占用的要求比服务端更高，因此该方案也同样适用于 🖥服务端部署。
+
+<a name="模型介绍"></a>
+
+## ERNIE 3.0 Tiny 介绍
+
+百度 ERNIE 团队在 2021 年底发布了百亿级别大模型 ERNIE 3.0 和千亿级别的大模型 ERNIE 3.0 Titan。为了让大模型的能力能够真正在一线业务发挥威力，ERNIE 团队推出了 ERNIE-Tiny 系列的知识蒸馏技术，通过任务无关蒸馏的方法，产出了多个轻量级模型 ERNIE 3.0 Tiny，刷新了中文小模型的成绩，并使这些模型能够直接在 CPU 上进行预测，大大拓展了 ERNIE 模型的使用场景。
+
+2023 年初，ERNIE 团队进一步开源了 ERNIE 3.0 Tiny 模型的 v2 版本，使教师模型预先**注入下游知识**并参与 **多任务训练**，大大提高了小模型在下游任务上的效果。ERNIE 3.0 Tiny v2 模型在 in-domain、out-domain、low-resource 的下游任务上比 v1 有了进一步的提升，并且 v2 还开源了 3L128H 结构的模型。
+
+### 在线蒸馏技术
+
+在线蒸馏技术在模型学习的过程中周期性地将知识信号传递给若干个学生模型同时训练，从而在蒸馏阶段一次性产出多种尺寸的学生模型。相对传统蒸馏技术，该技术极大节省了因大模型额外蒸馏计算以及多个学生的重复知识传递带来的算力消耗。
+
+这种新颖的蒸馏方式利用了文心大模型的规模优势，在蒸馏完成后保证了学生模型的效果和尺寸丰富性，方便不同性能需求的应用场景使用。此外，由于文心大模型的模型尺寸与学生模型差距巨大，模型蒸馏难度极大甚至容易失效。为此，通过引入了助教模型进行蒸馏的技术，利用助教作为知识传递的桥梁以缩短学生模型和大模型表达空间相距过大的问题，从而促进蒸馏效率的提升。
+
+<p align="center">
+        <img width="644" alt="image" src="https://user-images.githubusercontent.com/1371212/168516904-3fff73e0-010d-4bef-adc1-4d7c97a9c6ff.png" title="ERNIE 3.0 Online Distillation">
+</p>
+
+<br>
+
+### 注入下游知识
+ERNIE 3.0 Tiny v1 通过在线蒸馏技术将预训练大模型压缩成预训练小模型，然而由于小模型在微调之前没有接触到下游任务的相关知识，导致效果和大模型仍然存在差距。因此 ERNIE 团队进一步提出 **ERNIE 3.0 Tiny v2**，通过微调教师模型，让教师模型学习到下游任务的相关知识，进而能够在蒸馏的过程中传导给学生模型。尽管学生模型完全没有见过下游数据，通过预先注入下游知识到教师模型，蒸馏得到的学生模型也能够获取到下游任务的相关知识，进而使下游任务上的效果得到提升。
+
+### 多任务学习提升泛化性
+多任务学习已经被证明对增强模型泛化性有显著的效果，例如 MT-DNN、MUPPET、FLAN 等。通过对教师模型加入多下游任务微调，不但能够对教师模型注入下游知识、提高教师模型的泛化性，并且能够通过蒸馏传给学生模型，大幅度提升小模型的泛化性。具体地，我们对教师模型进行了 28 个任务的多任务微调。
+
+<p align="center">
+        <img width="644" alt="image" src="https://user-images.githubusercontent.com/26483581/210303124-c9df89a9-e291-4322-a6a5-37d2c4c1c008.png" title="ERNIE 3.0 Tiny v2">
+</p>
+<br>
+
+因此，ERNIE 3.0 Tiny v2 相比 ERNIE 3.0 Tiny v1 在 in-domain、out-domain、low-resource 数据上都能获得显著的提升。
+
+<a name="模型效果"></a>
+
+## 预训练模型效果
+
+本项目开源 **ERNIE 3.0 Tiny _Base_** 、**ERNIE 3.0 Tiny _Medium_** 、 **ERNIE 3.0 Tiny _Mini_** 、 **ERNIE 3.0 Tiny _Micro_** 、 **ERNIE 3.0 Tiny _Nano_**、**ERNIE 3.0 Tiny _Pico_** 六种结构的中文模型：
+
+- **ERNIE 3.0-Tiny-_Base_**-zh (_12-layer, 768-hidden, 12-heads_)
+- **ERNIE 3.0-Tiny-_Medium_**-zh(_6-layer, 768-hidden, 12-heads_)
+- **ERNIE 3.0-Tiny-_Mini_**-zh (_6-layer, 384-hidden, 12-heads_)
+- **ERNIE 3.0-Tiny-_Micro_**-zh (_4-layer, 384-hidden, 12-heads_)
+- **ERNIE 3.0-Tiny-_Nano_**-zh (_4-layer, 312-hidden, 12-heads_)
+- **ERNIE 3.0-Tiny-_Pico_**-zh (_3-layer, 128-hidden, 2-heads_)
+
+其中，v2 版本开源了 6 种结构的模型，v1 版本开源了前 5 种结构的模型。
+
+ERNIE 3.0 Tiny 模型可以用于文本分类、文本推理、实体抽取、问答等各种 NLU 任务中。下表是 ERNIE 3.0 Tiny 模型在 in-domain、out-domain 和 low-resource 三类数据集上的效果。其中 CLUE 指标可以通过 [PaddleNLP CLUE Benchmark](../../examples/benchmark/clue) 复现。
+
+<table>
+    <tr>
+        <td>Arch</td>
+        <td>Model</td>
+        <td colspan=8 align=center> In-domain </td>
+        <td colspan=3 align=center> Out-domain </td>
+        <td colspan=4 align=center> Low-resource</td>
+    </tr>
+    <tr>
+        <td>-</td>
+        <td>-</td>
+        <td>avg.</td>
+        <td>afqmc</td>
+        <td>tnews</td>
+        <td>iflytek</td>
+        <td>cmnli</td>
+        <td>ocnli</td>
+        <td>cluewsc2020</td>
+        <td>csl</td>
+        <td>avg.</td>
+        <td>CANLI</td>
+        <td>shopping_10</td>
+        <td>avg.</td>
+        <td>bustm_few</td>
+        <td>eprtmt_few</td>
+        <td>csldcp_few</td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center>12L768H</td>
+        <td>ERNIE 3.0 Tiny-Base-v1-zh</td>
+        <td>75.38</td>
+        <td>75.93</td>
+        <td>58.26</td>
+        <td>61.56</td>
+        <td>83.02</td>
+        <td>80.10</td>
+        <td>86.18</td>
+        <td>82.63</td>
+        <td>97.29</td>
+        <td>99.31</td>
+        <td>95.26</td>
+        <td>75.81</td>
+        <td>76.09</td>
+        <td>89.06</td>
+        <td>62.29</td>
+    </tr>
+    <tr>
+        <td><b>ERNIE 3.0 Tiny-Base-v2-zh</b></td>
+        <td>75.93</td>
+        <td>77.43</td>
+        <td>59.11</td>
+        <td>61.49</td>
+        <td>84.56</td>
+        <td>81.86</td>
+        <td>84.54</td>
+        <td>82.50</td>
+        <td>97.30</td>
+        <td>99.22</td>
+        <td>95.38</td>
+        <td><b>79.00</b></td>
+        <td><b>82.50</b></td>
+        <td>89.84</td>
+        <td>64.65</td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center>6L768H</td>
+        <td>ERNIE 3.0 Tiny-Medium-v1-zh</td>
+        <td>72.78</td>
+        <td>73.37</td>
+        <td>57.00</td>
+        <td>60.67</td>
+        <td>80.64</td>
+        <td>76.88</td>
+        <td>79.28</td>
+        <td>81.60</td>
+        <td>96.99</td>
+        <td>99.16</td>
+        <td>94.82</td>
+        <td>72.16</td>
+        <td>69.06</td>
+        <td>85.94</td>
+        <td>61.48</td>
+    </tr>
+    <tr>
+        <td><b>ERNIE 3.0 Tiny-Medium-v2-zh</b></td>
+        <td>74.25</td>
+        <td>75.88</td>
+        <td>57.86</td>
+        <td>61.64</td>
+        <td>82.89</td>
+        <td><b>80.27</b></td>
+        <td>79.93</td>
+        <td>81.27</td>
+        <td>97.22</td>
+        <td>99.19</td>
+        <td>95.24</td>
+        <td><b>78.64</b></td>
+        <td><b>81.41</b></td>
+        <td><b>90.94</b></td>
+        <td>63.58</td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center>6L384H</td>
+        <td>ERNIE 3.0 Tiny-Mini-v1-zh</td>
+        <td>68.88</td>
+        <td>71.85</td>
+        <td>55.24</td>
+        <td>54.48</td>
+        <td>77.19</td>
+        <td>73.08</td>
+        <td>71.05</td>
+        <td>79.30</td>
+        <td>96.27</td>
+        <td>98.44</td>
+        <td>94.10</td>
+        <td>66.79</td>
+        <td>67.34</td>
+        <td>82.97</td>
+        <td>50.07</td>
+    </tr>
+    <tr>
+        <td><b>ERNIE 3.0 Tiny-Mini-v2-zh</b></td>
+        <td>70.49</td>
+        <td><b>74.40</b></td>
+        <td>56.20</td>
+        <td>55.79</td>
+        <td>80.17</b></td>
+        <td><b>76.75</b></td>
+        <td>72.37</td>
+        <td>77.77</td>
+        <td>96.69</td>
+        <td>98.69</td>
+        <td>94.68</td>
+        <td><b>72.46</b></td>
+        <td><b>73.75</b></td>
+        <td><b>88.12</b></td>
+        <td><b>55.50</b></td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center>4L384H</td>
+        <td>ERNIE 3.0 Tiny-Micro-v1-zh</td>
+        <td>67.26</td>
+        <td>71.15</td>
+        <td>55.05</td>
+        <td>53.83</td>
+        <td>74.81</td>
+        <td>70.41</td>
+        <td>69.08</td>
+        <td>76.50</td>
+        <td>95.76</td>
+        <td>97.69</td>
+        <td>93.83</td>
+        <td>65.71</td>
+        <td>66.25</td>
+        <td>83.75</td>
+        <td>47.12</td>
+    </tr>
+    <tr>
+        <td><b>ERNIE 3.0 Tiny-Micro-v2-zh</b></td>
+        <td>67.98</td>
+        <td>72.52</td>
+        <td>55.45</td>
+        <td>54.33</td>
+        <td><b>77.81</b></td>
+        <td><b>74.85</b></td>
+        <td>66.45</td>
+        <td>74.43</td>
+        <td>96.47</td>
+        <td>98.41</td>
+        <td>94.52</td>
+        <td><b>69.65</b></td>
+        <td><b>72.50</b></td>
+        <td>84.53</td>
+        <td><b>51.93</b></td>
+    </tr>
+    <tr>
+        <td rowspan=2 align=center>4L312H</td>
+        <td>ERNIE 3.0 Tiny-Nano-v1-zh</td>
+        <td>66.24</td>
+        <td>70.51</td>
+        <td>54.57</td>
+        <td>48.36</td>
+        <td>74.97</td>
+        <td>70.61</td>
+        <td>68.75</td>
+        <td>75.93</td>
+        <td>71.16</td>
+        <td>51.87</td>
+        <td>91.35</td>
+        <td>53.80</td>
+        <td>58.59</td>
+        <td>81.41</td>
+        <td>21.40</td>
+    </tr>
+    <tr>
+        <td><b>ERNIE 3.0 Tiny-Nano-v2-zh</b></td>
+        <td>67.77</td>
+        <td>72.75</td>
+        <td>55.38</td>
+        <td>48.90</td>
+        <td><b>78.01</b></td>
+        <td><b>74.54</b></td>
+        <td>68.42</td>
+        <td>76.37</td>
+        <td><b>96.34</b></td>
+        <td><b>98.19</b></td>
+        <td><b>94.48</b></td>
+        <td><b>68.16</b></td>
+        <td><b>72.34</b></td>
+        <td><b>87.03</b></td>
+        <td><b>45.10</b></td>
+    </tr>
+    <tr>
+        <td rowspan=1 align=center>3L128H2A</td>
+        <td><b>ERNIE 3.0 Tiny-Pico-v2-zh</b></td>
+        <td>57.81</td>
+        <td>69.35</td>
+        <td>52.50</td>
+        <td>21.05</td>
+        <td>65.65</td>
+        <td>64.03</td>
+        <td>63.49</td>
+        <td>68.60</td>
+        <td>74.13</td>
+        <td>54.97</td>
+        <td>93.29</td>
+        <td>51.25</td>
+        <td>62.34</td>
+        <td>79.84</td>
+        <td>11.58</td>
+    </tr>
+</table>
+
+ERNIE 3.0 Tiny v2 多任务学习、在线蒸馏方案效果显著，刷新了中文小模型的 SOTA 成绩。具体对比数据见如下模型 **精度-时延** 图，横坐标表示在 Arm CPU（高通 865 芯片）上，基于 Arm v8 arch 测试（batch_size=1, seq_len=32）的推理时延（Latency，单位毫秒），纵坐标是 CLUE 10 个任务上的平均精度（包含文本分类、文本匹配、自然语言推理、代词消歧、阅读理解等任务），其中 CMRC2018 阅读理解任务的评价指标是 Exact Match(EM)，其它任务的评价指标均是 Accuracy。模型名下方标注了模型的参数量。
+
+<p align="center">
+        <img width="644" alt="image" src="https://user-images.githubusercontent.com/26483581/218035834-050c04d4-3b59-468a-910b-aabf543d9c98.png" title="">
+</p>
+
+
+图中越靠左上方的模型，精度和性能水平越高。可以看到 ERNIE 3.0 Tiny v2 在同等规模的开源模型中，综合实力领先其他同类型轻量级模型。与 UER/RoBERTa-Base 相比，12L768H 的 ERNIE 3.0-Base 平均精度提升了 4.5 个点，比同等规模的BERT-Base-Chinese 提升 3.7 个点；6L768H 的 ERNIE 3.0-Medium 相比 12L768H 的 UER/Chinese-RoBERTa 高 2.4，比 BERT-Base-Chinese 高 1.7，并且节省一倍运算时间；另外值得一提的是，这些小模型能够直接部署在 CPU 上。
+
+使用 PaddleNLP 只需要一行代码就可以下载并获取 ERNIE 3.0 Tiny 预训练模型，之后可以用自己的下游数据下进行微调。
+
+```python
+
+from paddlenlp.transformers import *
+
+tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-tiny-medium-v2-zh")
+
+# 用于分类任务（本项目中的意图识别任务）
+seq_cls_model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-tiny-medium-v2-zh")
+
+# 用于序列标注任务（本项目中的槽位填充任务）
+token_cls_model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-tiny-medium-v2-zh")
+
+# 用于阅读理解任务
+qa_model = AutoModelForQuestionAnswering.from_pretrained("ernie-3.0-tiny-medium-v2-zh")
+
+```
+
+如果使用 v1 版本模型，只需要把 v2 替换成 v1 即可。
+
+<a name="代码结构"></a>
+
+## 代码结构
+
+以下是本项目代码结构
+
+```text
+.
+├── run_train.py                 # 微调和压缩脚本
+├── run_eval.py                  # 评估脚本
+├── utils.py                     # 训练工具脚本
+├── model.py                     # 模型结构脚本
+├── data                         # 数据目录（自定义数据）
+│ └── train.txt                  # 训练集（待用户新增）
+│ └── dev.txt                    # 验证集（待用户新增）
+│ └── intent_label.txt           # 意图标签文件
+│ └── slot_label.txt             # 槽位标签文件
+├── deploy                       # 部署目录
+│ └── README.md                  # Fastdeploy 部署文档
+│ └── android                    # 端侧部署目录
+│ └── cpp                        # 服务端部署目录（C++）
+│ └── python                     # 服务端部署目录（Python）
+└── README.md                    # 文档
+```
+
+<a name="开始运行"></a>
+
+## 开始运行
+
+<a name="任务介绍"></a>
+
+### 任务介绍
+
+本项目是使用 ERNIE 3.0 Tiny 预训练模型端侧部署方案，任务背景是车载语音场景下的口语理解（Spoken Language Understanding，SLU）。本项目包括微调、压缩和部署的全流程。
+
+SLU 任务主要将用户的自然语言表达解析为结构化信息。结构化信息的解析主要包括意图识别和槽位填充两个步骤。
+
+- 数据样例：
+
+```text
+- 输入：来一首周华健的花心
+- 输出
+    - 意图识别任务：music.play
+    - 槽位填充任务：来一首<singer>周华健</singer>的<song>花心</song>
+```
+
+在本项目中，意图识别和槽位填充任务分别被建模为文本分类和序列标注任务，二者共用一个 ERNIE 3.0 Tiny 模型，只有最后的任务层是独立的。
+
+- 评价方法：单句意图和槽位被完全正确分类的准确率（Accuracy）。
+
+### 环境要求
+- python >= 3.7
+- paddlepaddle >= 2.4.1
+- paddlenlp >= 2.5
+- paddleslim >= 2.4
+
+### 数据准备
+
+本项目使用了 [NLPCC2018 Shared Task 4](http://tcci.ccf.org.cn/conference/2018/taskdata.php) 的数据集，该数据集来源于中文真实商用车载语音任务型对话系统的对话日志。需要说明的一点是，本项目为了使压缩样例更简洁，只考虑了原任务中的意图识别和槽位填充任务，纠错数据被忽略，并且只考虑单句任务。由于公开的测试集没有标签，因此只使用了训练集，并自行分割出训练集和验证集。
+
+训练集的下载地址为[链接](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata04.zip)。下载、解压后得到 `corpus.train.txt` 文件，将它移动至本项目中的 `data` 目录，再经过下面的代码按照 4:1 的比例分割出训练集和验证集，得到 `data/train.txt` 和 `data/dev.txt` 两个文件：
+
+```shell
+cd data
+
+shuf corpus.train.txt > corpus.train.txt.shuf
+num_lines=$(wc -l corpus.train.txt|awk '{print $1}')
+head -n $[num_lines/5] corpus.train.txt.shuf > dev.txt
+tail -n $[num_lines-num_lines/5] corpus.train.txt.shuf > train.txt
+
+```
+执行完后，data 目录应是如下结构：
+
+```text
+├── data                         # 数据目录（自定义数据）
+│ └── train.txt                  # 训练集
+│ └── dev.txt                    # 验证集
+│ └── intent_label.txt           # 意图标签文件
+│ └── slot_label.txt             # 槽位标签文件
+```
+
+由于文件较小，`intent_label.txt` 和 `slot_label.txt` 文件是从 `corpus.train.txt` 文件中提取并上传 git 的，提前写入这两个文件是为了读取数据逻辑更便捷，也便于预测时后处理使用。
+
+<a name="模型训练"></a>
+
+## 模型训练
+
+本项目自定义了继承自 `ErniePretrainedModel` 的模型 `JointErnie`，使意图识别和槽位填充两个任务可以共用一个预训练模型 `ernie-3.0-tiny-nano-v2-zh`，但是各自也分别拥有最后一层独立的全连接层。模型的定义依然可以使用 `from_pretrained` API 传入使用的预训练模型和相关参数。这里也可以按照需求使用 ERNIE 3.0 Tiny 其他大小的模型，如果不知道如何选择，可以对多个大小的模型都进行训练和压缩，最后根据在硬件上的精度、时延、内存占用等指标来选择模型。
+
+```python
+from model import JointErnie
+
+model = JointErnie.from_pretrained(
+    pretrained_model_name_or_path="ernie-3.0-tiny-nano-v2-zh",
+    intent_dim=11,
+    slot_dim=32,
+)
+```
+
+运行下面的脚本，使用 Trainer API 启动训练：
+
+```shell
+mkdir output/BS64_LR5e-5_EPOCHS30
+
+python run_train.py \
+    --device gpu \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --model_name_or_path ernie-3.0-tiny-nano-v2-zh \
+    --num_train_epochs 30 \
+    --per_device_eval_batch_size 64 \
+    --per_device_train_batch_size  64 \
+    --learning_rate 5e-5 \
+    --prune_embeddings \
+    --max_vocab_size 6000 \
+    --max_seq_length 16  \
+    --output_dir output/BS64_LR5e-5_EPOCHS30 \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt \
+    --intent_label_path data/intent_label.txt \
+    --slot_label_path data/slot_label.txt \
+    --label_names  'intent_label' 'slot_label' \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.1 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --input_dtype "int32" \
+    --disable_tqdm True \
+    --overwrite_output_dir \
+    --load_best_model_at_end  True \
+    --save_total_limit 1 \
+    --metric_for_best_model eval_accuracy \
+```
+
+可配置参数说明：
+
+* `model_name_or_path`：必须，进行微调使用的预训练模型。可选择的有 "ernie-3.0-tiny-base-v2-zh"、"ernie-3.0-tiny-medium-v2-zh"、"ernie-3.0-tiny-mini-v2-zh"、"ernie-3.0-tiny-micro-v2-zh"、"ernie-3.0-tiny-nano-v2-zh"、"ernie-3.0-tiny-pico-v2-zh"。
+* `output_dir`：必须，模型训练后保存的模型目录。
+* `prune_embeddings`：可选，模型的 embeddings 是否需要裁剪。如果设置，会按照 `max_seq_length` 以及 `max_vocab_size` 对预训练模型的 `position embeddings` 和 `word_embeddings` 参数进行裁剪，并将新的 model 和 tokenizer 保存至 `${output_dir}/pretrained_model` 下。后续的模型微调会基于 embeddings 裁剪后的模型开始。该策略主要是为了减少部署时模型的内存占用。如果对模型的内存占用要求不高，也可以不设置。
+* `max_seq_length`：最大序列长度，是指分词后样本的最大token数，本项目中是 16。如果设置了 `prune_embeddings`，那么会对模型的 `position embeddings` 根据 `max_seq_length` 的值进行裁剪。
+* `max_vocab_size`：词表裁剪后的大小。当设置 `prune_embeddings` 时，会根据词频对预训练模型的词表进行排序，并根据 `max_vocab_size` 大小进行裁剪。
+* `train_path`：必须，训练集路径
+* `dev_path`：必须，验证集路径
+* `intent_label_path`：必须，意图标签文件路径。
+* `slot_label_path`：必须，槽位标签文件路径。
+* `label_names`：训练集中标签对应的 key 名称。如果不传入，在训练时 Trainer 可能由于无法区分输入数据和标签造成错误。
+* `do_train`:是否进行微调训练，设置该参数表示进行微调训练。
+* `do_eval`:是否进行评估，设置该参数表示进行评估。
+* `do_export`：是否导出模型，设置该参数表示训练完成后导出预测模型。
+* `load_best_model_at_end`：是否在训练结尾导入最好的模型。
+* `metric_for_best_model`：选择最好模型的 metric 名称。
+* `per_device_train_batch_size`：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
+* `per_device_eval_batch_size`：开发集评测过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
+* `learning_rate`：训练最大学习率。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择 100；默认为10。
+* `logging_steps`: 训练过程中日志打印的间隔 steps 数，默认100。
+* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `weight_decay`：除了所有 bias 和 LayerNorm 权重之外，应用于所有层的权重衰减数值。可选；默认为 0.0；
+* `input_dtype`：模型输入张量的数据类型。默认是 `int64`。
+* `device`: 训练设备，可选择 'cpu'、'gpu' 其中的一种；默认为 'gpu'。
+
+
+<a name="模型评估"></a>
+
+## 模型评估
+- 动态图
+
+使用动态图进行评估，可以直接使用 [模型训练](#模型训练) 中的评估脚本，取消设置 `--do_train` 和 `--do_export` 并保留设置 `--do_eval`，并将 `--model_name_or_path` 设置成微调后的模型路径即可。
+
+- 静态图
+
+如果使用静态图进行评估或者预测，可以参考脚本 `run_eval.py`，参考下面的命令启动评估：
+
+```shell
+python run_eval.py  \
+    --device gpu \
+    --model_name_or_path output/BS64_LR5e-5_EPOCHS30/checkpoint-7700/ \
+    --infer_prefix output/BS64_LR5e-5_EPOCHS30/infer_model \
+    --output_dir ./ \
+    --test_path data/dev.txt \
+    --intent_label_path data/intent_label.txt \
+    --slot_label_path data/slot_label.txt \
+    --max_seq_length 16  \
+    --per_device_eval_batch_size 512 \
+    --do_eval
+```
+
+* `model_name_or_path`：动态图模型的目录，主要用于加载 tokenizer。
+* `infer_prefix`：预测模型的路径（目录+前缀）。例如当 `infer_prefix` 为 `output/infer_model` 时，代表预测模型和参数文件分别为 `output/infer_model.pdmodel` 和 `output/infer_model.pdiparams`。
+* `test_path` ：评估所用文件路径名；
+* `do_eval`，是否输出评价指标的结果。如果设置，脚本会开启评估模式，最终会输出精度评价指标的值。如果不设置，则会输出模型后处理后的结果。例如：
+
+```text
+- 输入：放一首刘德华的音乐
+- 输出：
+    {'intent': 'music.play', 'confidence': array([0.9984201], dtype=float32)}
+    {'value': [[{'slot': 'singer', 'entity': '刘德华', 'pos': [3, 5]}]]}
+```
+
+<a name="模型压缩"></a>
+
+## 🔥端上模型压缩方案
+
+尽管 ERNIE 3.0 Tiny 已提供了效果不错的轻量级模型可以微调后直接使用，但在本项目中，微调后的模型体积是 69.0 MB，内存占用达到 115.72MB，部署至端侧还是存在一定困难。因此当模型有部署上线的需求，想要进一步压缩模型体积，降低推理时延，可使用本项目的 **端上语义理解压缩方案** 对上一步微调后的模型进行压缩。
+
+为了方便实现，[PaddleNLP 模型压缩 API](../../docs/compression.md) 已提供了以下压缩功能，模型压缩 API 主要是基于 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 模型压缩能力，PaddleSlim 是一个专注于深度学习模型压缩的工具库，提供低比特量化、知识蒸馏、稀疏化和模型结构搜索等模型压缩策略，帮助开发者快速实现模型的小型化，欢迎大家使用。
+
+端上模型压缩流程如下图所示：
+
+<p align="center">
+        <img width="1000" alt="image" src="https://user-images.githubusercontent.com/26483581/218037457-8b91cac4-e19e-401f-86c8-b64d7247014c.png" title="compression plan">
+</p>
+<br>
+
+在本项目中，模型压缩和模型训练共用了脚本 `run_train.py`，压缩时需设置 `--do_compress` 开启模型压缩，并取消设置 `--do_train` 关闭普通训练。模型压缩还需要设置 `--strategy` 参数，本项目中选择 `'dynabert+qat+embeddings'` 组合策略。
+
+运行下面的脚本，可对上面微调后的模型进行压缩：
+
+```shell
+python run_train.py \
+    --do_compress \
+    --strategy 'dynabert+qat+embeddings' \
+    --num_train_epochs 10 \
+    --model_name_or_path output/BS64_LR5e-5_EPOCHS30/checkpoint-6700 \
+    --output_dir output/BS64_LR5e-5_EPOCHS30/ \
+    --max_seq_length 16  \
+    --per_device_eval_batch_size 64 \
+    --per_device_train_batch_size  64 \
+    --learning_rate 5e-5 \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt \
+    --intent_label_path data/intent_label.txt \
+    --slot_label_path data/slot_label.txt \
+    --label_names  'intent_label' 'slot_label' \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.1 \
+    --input_dtype "int32" \
+    --device gpu \
+    --logging_steps 100 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --disable_tqdm True \
+    --save_total_limit 1 \
+    --metric_for_best_model eval_accuracy \
+```
+
+可配置参数说明：
+
+* `strategy`：压缩策略，本案例中推荐使用`"dynabert+qat+embeddings"`，这是一个策略组合，由 `"dynabert"`、`"qat"`、`"embeddings"` 组成。其中`"dynabert"` 是一种裁剪策略，能直接对模型宽度进行裁剪，从而直接减少参数量，需要训练；`"qat"` 是一种量化方法，用于将模型中矩阵乘(底层是 matmul_v2 算子)的权重及激活值的数据类型由 FP32 转成 INT8，并使模型精度尽量保持无损，需要训练；`"embeddings"` 则代表 Embedding 量化策略，它将 Embedding API（底层是 lookup_table_v2 算子）的权重由 FP32 转成 INT8 存储，而不需要训练。由于词表参数量占比非常大，Embedding 量化能够大幅度减少模型的内存占用，但不会对时延产生正向作用。
+* `model_name_or_path`：必须，进行压缩所使用的微调模型。
+* `output_dir`：必须，模型训练或者压缩后保存的模型目录；默认为 `None` 。
+* `do_compress`：必须。压缩需要通过这个开关来打开。其他的开关`do_train` 、`do_eval`和`do_export` 在此步则不能设置。
+* `input_dtype`：模型输入张量的数据类型。默认是 `int64`。
+
+其他参数同训练参数，如`learning_rate`、`num_train_epochs`、`per_device_train_batch_size` 等，是指压缩过程中的训练（`"dynabert"` 裁剪 以及 `"qat"` 量化）时所使用的参数，一般可以和微调时保持一致即可，其中 `num_train_epochs` 可比微调时略小。
+
+<a name="压缩效果"></a>
+
+### 压缩效果
+
+| 模型                                | 模型精度(acc.)   | 模型体积(MB)     |
+|-----------------------------------|--------------|--------------|
+| 原模型                               | 82.34        | 69.0         |
+| 原模型+裁剪（词表+模型宽度）                   | 82.11(-0.23) | 64.0(-7.2%)  |
+| 原模型+裁剪（词表+模型宽度）+量化（矩阵乘）           | 82.21(-0.13) | 11.0(-84.1%) |
+| 原模型+裁剪（词表+模型宽度）+量化（矩阵乘+Embedding） | 82.21(-0.13) | 5.4(-92.2%)  |
+
+模型经过压缩后，精度基本无损，体积减小了 92.2%，仅有 5.4 MB。到此，算法侧的工作基本完成。
+
+<a name="FastDeploy部署"></a>
+
+## ⚡️FastDeplopy 部署
+能够将深度学习模型部署到性能较低的端侧本身是比较困难的工作，因此在前面我们对小模型做了大量的优化，在精度不降的情况下将 69 MB 的模型压缩至 5.4 MB，但是如果想更好地满足业务上线要求，还需要有部署工具对性能有更多优化。在这里，PaddlePadde 提供了易用高效的云边端推理部署工具 ⚡️FastDeploy，它的 [Paddle Lite](https://github.com/PaddlePaddle/Paddle-Lite) 后端基于算子融合和常量折叠进行了深度模型优化，使得模型推理速度可有大幅度提升；它所集成的 FastTokenizer 库能够对分词阶段进行加速，在麒麟 985 芯片上单条文本的分词的推理时延低于 0.1 毫秒；
+
+因此，本项目基于 FastDeploy 部署工具，完成了 ERNIE 3.0 Tiny 端侧和服务端的高效部署，请参考 [ERNIE 3.0 Tiny 部署文档](deploy/README.md)。以下动图是 ERNIE 3.0 Tiny 意图识别、槽位填充联合模型使用 FastDeploy 部署在 Android App 上推理的效果展示：
+
+<p align="center">
+        <img width="200" alt="image" src="https://user-images.githubusercontent.com/26483581/210997849-9d3b7f7f-9363-4a3d-87c9-b29496a6b5b0.gif" title="compression plan">
+</p>
+
+想要更多了解 FastDeploy 可参考 [FastDeploy 仓库](https://github.com/PaddlePaddle/FastDeploy)。FastDeploy 是一款全场景、易用灵活、极致高效的 AI 推理部署工具，提供开箱即用的部署体验。它为 NLP 任务提供了一整套完整的部署 Pipeline，提供 ERNIE 3.0 Tiny 模型从文本预处理、推理引擎 Runtime 以及后处理三个阶段所需要的接口模块，开发者可以基于这些接口模块在云、边、端上部署各类常见的 NLP 任务，如文本分类、序列标注、信息抽取等：
+- 在文本预处理阶段，FastDeploy 使用 PaddleNLP 提供的简单易用的高效分词工具 [FastTokenizer](../../fast_tokenizer/README.md) 完成文本预处理，开发者只需调用几行代码就能完成分词阶段开发。在麒麟 985 芯片上单条文本的分词延时低于 0.1 毫秒，将本项目模型部署在 GPU 上时，使用 FastTokenizer 工具可使端到端性能提升 **3.56倍**；
+- 在 Runtime 阶段，FastDeploy 集成多款硬件以及推理引擎后端，开发者可以设置 `fastdeploy::RuntimeOption` 以完成在不同硬件以及使用不同的推理引擎进行部署。目前，FastDeploy 支持的后端引擎有：
+    - 端侧： `Paddle Lite`；
+    - 服务端 GPU： `Paddle Inference`、`ONNX Runtime`、`Paddle TensorRT` 以及 `TensorRT`；
+    - 服务端 CPU：`Paddle Inference`、`ONNX Runtime` 以及 `OpenVINO`。
+- 在后处理阶段，FastDeploy 提供了张量级别的 [数值运算模块](https://baidu-paddle.github.io/fastdeploy-api/cpp/html/namespacefastdeploy_1_1function.html)， 基于该模块可以快速完成各类任务的后处理计算，如文本分类任务的 Softmax 等数值计算。
+
+<a name="性能结论"></a>
+
+### 性能结论
+
+使用 FastDeploy 将压缩后的模型部署在华为 nova 7 Pro （麒麟 985 芯片）上，选用 Paddle Lite 作为后端进行测试，得到不同推理精度下的模型效果、端到端时延（包括前后处理）、内存占用（包括加载 FastTokenizer 库）的数据如下：
+
+| 模型                                | 模型精度(acc.)   | 推理精度      | 端到端时延(ms)   | 内存占用 Pss (MB)  | 模型体积(MB)     |
+|-----------------------------------|--------------|-----------|-------------|----------------|--------------|
+| 原模型                               | 82.34        | FP32      | 9.90        | 115.72         | 69.0         |
+| 原模型                               | 82.34(-0.00) | FP16      | 6.03(1.64x) | 106.24(-8.2%)  | 69.0(-0.0%)  |
+| 原模型+裁剪（词表+模型宽度）                   | 82.11(-0.23) | FP32      | 7.55(1.31x) | 59.49(-48.59%) | 64.0(-7.2%)  |
+| 原模型+裁剪（词表+模型宽度）                   | 82.11(-0.23) | FP16      | 4.68(2.12x) | 52.23(-54.87%) | 64.0(-7.2%)  |
+| 原模型+裁剪（词表+模型宽度）+量化（矩阵乘）           | 82.21(-0.13) | FP32+INT8 | 4.57(2.17x) | 49.17(-57.51%) | 11.0(-84.1%) |
+| **原模型+裁剪（词表+模型宽度）+量化（矩阵乘+Embedding）** | 82.21(-0.13) | FP32+INT8 | **4.64(2.13x)** | **43.77(-62.18%)** | **5.4(-92.2%)**  |
+
+
+**测试条件**：max_seq_length=16，batch_size=1，thread_num=1
+
+模型经过压缩后，精度基本无损，体积减小了 92.2%。在以上测试条件下，端到端推理速度达到原来的 2.13 倍，内存占用减小了 62.18%。
+
+<a name="参考文献"></a>
+
+## 参考文献
+* Liu W, Chen X, Liu J, et al. ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization[J]. arXiv preprint arXiv:2301.03416, 2023.
+* Su W, Chen X, Feng S, et al. ERNIE-Tiny: A Progressive Distillation Framework for Pretrained Transformer Compression[J]. arXiv preprint arXiv:2106.02241, 2021.
+* Wang S, Sun Y, Xiang Y, et al. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2112.12731, 2021.
+* Sun Y, Wang S, Feng S, et al. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation[J]. arXiv preprint arXiv:2107.02137, 2021.
diff --git a/model_zoo/ernie-tiny/data/intent_label.txt b/model_zoo/ernie-tiny/data/intent_label.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ebaff04778ed8d27626dab99c55d9b3c678f8984
--- /dev/null
+++ b/model_zoo/ernie-tiny/data/intent_label.txt
@@ -0,0 +1,11 @@
+OTHERS
+music.play
+music.pause
+music.prev
+music.next
+navigation.navigation
+navigation.open
+navigation.start_navigation
+navigation.cancel_navigation
+phone_call.make_a_phone_call
+phone_call.cancel
diff --git a/model_zoo/ernie-tiny/data/slot_label.txt b/model_zoo/ernie-tiny/data/slot_label.txt
new file mode 100644
index 0000000000000000000000000000000000000000..269b7a3912813f7cb6130c86a29225c490680560
--- /dev/null
+++ b/model_zoo/ernie-tiny/data/slot_label.txt
@@ -0,0 +1,32 @@
+PAD
+O
+B-song
+I-song
+B-singer
+I-singer
+B-theme
+I-theme
+B-style
+I-style
+B-age
+I-age
+B-toplist
+I-toplist
+B-emotion
+I-emotion
+B-language
+I-language
+B-instrument
+I-instrument
+B-scene
+I-scene
+B-destination
+I-destination
+B-custom_destination
+I-custom_destination
+B-origin
+I-origin
+B-phone_num
+I-phone_num
+B-contact_name
+I-contact_name
diff --git a/model_zoo/ernie-tiny/deploy/README.md b/model_zoo/ernie-tiny/deploy/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0f83e2cbd791e2884ca34fa2af82105f56373ad2
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/README.md
@@ -0,0 +1,59 @@
+# FastDeploy ERNIE 3.0 Tiny 模型高性能部署
+
+**目录**
+   * [FastDeploy部署介绍](#FastDeploy部署介绍)
+   * [代码结构](#代码结构)
+   * [环境要求](#环境要求)
+   * [详细部署文档](#详细部署文档)
+
+<a name="FastDeploy部署介绍"></a>
+
+## FastDeploy部署介绍
+
+**⚡️FastDeploy**是一款**全场景**、**易用灵活**、**极致高效**的AI推理部署工具，满足开发者**多硬件、多平台**的产业部署需求。开发者可以基于FastDeploy将训练好的预测模型在不同的硬件、不同的操作系统以及不同的推理引擎后端上进行部署。目前FastDeploy提供多种编程语言的 SDK，包括 C++、Python 以及 Java SDK。
+
+目前 ERNIE 3.0 Tiny 模型已提供基于 FastDeploy 的云边端的部署示例，在服务端上的 GPU 硬件上，支持`Paddle Inference`、`ONNX Runtime`、`Paddle TensorRT`以及`TensorRT`后端，在CPU上支持`Paddle Inference`、`ONNX Runtime`以及`OpenVINO`后端；在移动端上支持`Paddle Lite`后端。多硬件、多推理引擎后端的支持可以满足开发者不同的部署需求。
+
+为了提供 ERNIE 3.0 Tiny 高性能端到端部署能力，我们使用 [FastTokenizer](../../../fast_tokenizer/README.md) 工具完成高效分词，大大提升端到端预测性能。针对 ERNIE 3.0 Tiny 模型，FastTokenizer 集成了Google提出的 [Fast WordPiece Tokenization](https://arxiv.org/pdf/2012.15524.pdf) 快速分词算法，可大幅提升分词效率。在 ERNIE 3.0 Tiny 性能测试中，使用 FastTokenizer 可提升端到端性能 **3.56倍** 。我们在 [Python部署](python/README.md) 文档更全面地展示使用 FastTokenizer 的加速能力。
+
+本部署示例是车载语音场景下的口语理解（Spoken Language Understanding，SLU）任务，详细可看[ERNIE 3.0 Tiny介绍](../README.md)。
+
+
+<a name="代码结构"></a>
+
+## 代码结构
+
+```text
+
+├── cpp
+│   ├── CMakeLists.txt    # CMake编译脚本
+│   ├── infer_demo.cc     # C++ 部署示例代码
+│   └── README.md         # C++ 部署示例文档
+├── python
+│   ├── infer_demo.py     # Python 部署示例代码
+│   └── README.md         # Python 部署示例文档
+├── android
+│   ├── README.md         # Android部署文档
+│   ├── app               # App示例代码
+│   ├── build.gradle
+│   ├── ernie_tiny        # ERNIE 3.0 Tiny JNI & Java封装代码
+│   ├── ......            # Android相关的工程文件及目录
+│   ├── local.properties
+│   └── ui                # 一些辅助用的UI代码
+└── README.md             # 文档
+
+```
+
+<a name="环境要求"></a>
+
+## 环境要求
+
+在部署ERNIE 3.0 Tiny模型前，需要安装FastDeploy SDK，可参考[FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)确认部署环境是否满足FastDeploy环境要求，并按照介绍安装相应的SDK。
+
+<a name="详细部署文档"></a>
+
+## 详细部署文档
+
+- [Python部署](python/README.md)
+- [C++部署](cpp/README.md)
+- [Android部署](android/README.md)
diff --git a/model_zoo/ernie-tiny/deploy/android/.gitignore b/model_zoo/ernie-tiny/deploy/android/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..a5bc2e92563cf7f0e331638a7a7c78923bd13dfd
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/.gitignore
@@ -0,0 +1,23 @@
+.DS_Store
+.idea
+.gradle
+.cxx
+cache
+build
+app/cache
+app/libs/*
+app/.cxx
+app/build
+app/src/main/assets/models/*
+app/.gradle
+app/.idea
+ernie_tiny/cache
+ernie_tiny/libs/fastdeploy*
+ernie_tiny/.cxx
+ernie_tiny/build
+ernie_tiny/src/main/assets/models/*
+ernie_tiny/.gradle
+ernie_tiny/.idea
+ui/build
+ui/.gradle
+ui/cache
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/README.md b/model_zoo/ernie-tiny/deploy/android/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..357464fa7739a4664a2d92f466fcd4cb19b47e9d
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/README.md
@@ -0,0 +1,260 @@
+# FastDeploy ERNIE 3.0 Tiny 模型Android部署示例
+
+本目录提供了快速完成在 Android 的车载语音场景下的口语理解（Spoken Language Understanding，SLU）任务的部署示例。
+
+## 环境准备
+
+1. 在本地环境安装好 Android Studio 工具，详细安装方法请见[Android Stuido 官网](https://developer.android.com/studio)。
+2. 准备一部 Android 手机，并开启 USB 调试模式。开启方法: `手机设置 -> 查找开发者选项 -> 打开开发者选项和 USB 调试模式`
+
+## App示例编译和使用步骤
+
+1. ERNIE 3.0 Tiny Android 部署示例位于`PaddleNLP/model_zoo/ernie-tiny/deploy/android`目录
+2. 用 Android Studio 打开 ernie-tiny/deploy/android 工程
+3. 手机连接电脑，打开 USB 调试和文件传输模式，并在 Android Studio 上连接自己的手机设备（手机需要开启允许从 USB 安装软件权限）
+
+<img width="1440" alt="image" src="https://user-images.githubusercontent.com/31974251/210742200-596f1585-9d4c-46e4-acae-5956a798c7ce.png">
+
+
+工程内容说明：  
+```bash
+.
+├── README.md
+├── app                   # App示例代码
+├── build.gradle
+├── ernie_tiny            # ERNIE Tiny JNI & Java封装代码
+# ...                     # 一些和gradle相关的工程配置文件
+├── local.properties
+└── ui                    # 一些辅助用的UI代码
+```
+
+> **注意：**
+>> 如果您在导入项目、编译或者运行过程中遇到 NDK 配置错误的提示，请打开 ` File > Project Structure > SDK Location`，修改 `Andriod NDK location` 为您本机配置的 NDK 所在路径。本工程默认使用的NDK版本为20.
+>> 如果您是通过 Android Studio 的 SDK Tools 下载的 NDK (见本章节"环境准备")，可以直接点击下拉框选择默认路径。
+>> 还有一种 NDK 配置方法，你可以在 `local.properties` 文件中手动完成 NDK 路径配置，如下图所示
+>> 如果以上步骤仍旧无法解决 NDK 配置错误，请尝试根据 Android Studio 官方文档中的[更新 Android Gradle 插件](https://developer.android.com/studio/releases/gradle-plugin?hl=zh-cn#updating-plugin)章节，尝试更新Android Gradle plugin版本。
+
+
+4. 点击 Run 按钮，自动编译 APP 并安装到手机。(该过程会自动下载预编译的 FastDeploy Android 库 以及 模型文件，需要联网)
+   成功后效果如下，图一：APP 安装到手机；图二： APP 打开后的效果，输入文本后点击"开始分析意图"后即会自动进行意图识别和槽位分析；图三：APP 设置选项，点击右上角的设置图片，可以设置不同选项进行体验。
+
+| APP 效果 | APP 演示 | APP设置项 |
+  | ---     | --- | --- |
+| <img width="300" height="500" alt="image" src="https://user-images.githubusercontent.com/31974251/210737269-0fe175c9-7b87-40b3-8249-1c6378e4a5e9.jpg">  | <img width="300" height="500" alt="image" src="https://user-images.githubusercontent.com/31974251/211800602-2790c799-0823-4f91-8ce3-7b4a45896b41.gif"> | <img width="300" height="500" alt="image" src="https://user-images.githubusercontent.com/31974251/211800645-bc274593-26a4-4ed0-a258-c0615fcafce1.jpg"> |  
+
+如果您使用的是较旧版本的Android Studio时遇到gradle版本兼容的问题，可以修改[build.gradle](./build.gradle) 中的gradle版本号为您当前使用的版本。  
+```gradle
+buildscript {
+    // ...
+    dependencies {
+        classpath 'com.android.tools.build:gradle:7.2.2' // 修改为您当前使用的版本
+    }
+}
+```
+
+5. 点击 APP 右上角的设置选项，可以跳转到设置页。在设置页，您可以选择不同的模型和不同的推理精度，即是否开启 FP16 和 Int8 推理，两种推理精度只能二选一。所有模型均支持FP32推理，当设置项中的 FP16 和 Int8 选项都为 false 或 不设置 时，即使用了FP32进行推理。各模型FP16和Int8推理的支持情况为：  
+
+|模型选项|模型名称|FP16|Int8|  
+|---|---|---|---|  
+|models/ernie-tiny|原模型|✅|-|   
+|models/ernie-tiny-clip|原模型+裁剪（词表+模型宽度）|✅|-|   
+|models/ernie-tiny-clip-qat|原模型+裁剪（词表+模型宽度）+量化（矩阵乘）|-|✅|   
+|models/ernie-tiny-clip-qat-embedding-int8|原模型+裁剪（词表+模型宽度）+量化（矩阵乘+Embedding）|-|✅|   
+
+
+## ERNIE Tiny Java SDK 说明和使用
+
+本工程除了可以直接编译 App 体验之外，还可以编译 ERNIE 3.0 Tiny 的`Java SDK`，方便用户开箱即用。 如下图所示，编译 Java SDK 的步骤为：  
+   - 先在 Android Studio 中打开`ernie_tiny/build.gradle`工程文件；  
+   - 选择 Build->Make Module 'android:ernietiny'；  
+   - 从`ernie_tiny/build/outputs/aar`目录中获取编译后得到的 SDK，即`ernie_tiny-debug.aar`.
+
+<img width="1073" alt="image" src="https://user-images.githubusercontent.com/31974251/210746163-41d39478-8d5d-4138-9f76-75ff5c25c295.png">
+
+在获取到`ernie_tiny-debug.aar`后，可将其拷贝到您自己的工程中进行使用。在 Android Studio 中的配置步骤如下：  
+（1）首先，将`ernie_tiny-debug.aar`拷贝到您 Android 工程的libs目录下。
+```bash
+├── build.gradle
+├── libs
+│   └── ernie_tiny-debug.aar
+├── proguard-rules.pro
+└── src
+```
+（2）在您的 Android 工程中的 build.gradle 引入 ERNIE 3.0 Tiny SDK，如下
+```groovy
+dependencies {
+    implementation fileTree(include: ['ernie_tiny-debug.aar'], dir: 'libs')
+    implementation 'com.android.support:appcompat-v7:28.0.0'
+    // ...
+}
+```
+
+### ERNIE 3.0 Tiny Java API说明如下  
+
+- ERNIE 3.0 Tiny `Predictor` 初始化 API: 模型初始化 API 包含两种方式，方式一是通过构造函数直接初始化；方式二是，通过调用 init 函数，在合适的程序节点进行初始化。ERNIE 3.0 Tiny Predictor 初始化参数说明如下：
+   - modelFile: String, paddle 格式的模型文件路径，如 infer_model.pdmodel
+   - paramFile: String, paddle 格式的参数文件路径，如 infer_model.pdiparams
+   - vocabFile: String, 词表文件，如 vocab.txt 每一行包含一个词
+   - slotLabelsFile: String, 槽位标签文件，如 slots_label.txt
+   - intentLabelsFile: String, 意图标签文件，如 intent_label.txt 每一行包含一个标签
+   - addedTokensFile: String, 额外词表文件，如 added_tokens.json，json文件
+   - runtimeOption: RuntimeOption，可选参数，模型初始化 option。如果不传入该参数则会使用默认的运行时选项。
+   - maxLength: 最大序列长度，默认为16
+
+```java
+public Predictor(); // 空构造函数，之后可以调用init初始化
+public Predictor(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile);
+public Predictor(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile, RuntimeOption runtimeOption, int maxLength);
+public boolean init(String modelFile, String paramsFile, String vocabFile, String slotLabelsFile, String intentLabelsFile, String addedTokensFile, RuntimeOption runtimeOption, int maxLength);
+```  
+
+- ERNIE 3.0 Tiny `Predictor` 预测 API：Predictor 提供 predict 接口对输出的文本进行意图识别。  
+```java
+public IntentDetAndSlotFillResult[] predict(String[] texts);
+```
+
+- ERNIE 3.0 Tiny Predictor 资源释放 API: 调用 release() API 可以释放模型资源，返回 true 表示释放成功，false 表示失败；调用 initialized() 可以判断 Predictor 是否初始化成功，true 表示初始化成功，false 表示失败。
+```java
+public boolean release(); // 释放native资源  
+public boolean initialized(); // 检查Predictor是否初始化成功
+```
+
+- RuntimeOption设置说明
+
+```java
+public class RuntimeOption {
+  public void enableLiteFp16();                       // 开启fp16精度推理
+  public void disableLiteFP16();                      // 关闭fp16精度推理（默认关闭）
+  public void enableLiteInt8();                       // 开启int8精度推理（需要先准备好量化模型）
+  public void disableLiteInt8();                      // 关闭int8精度推理（默认关闭）
+  public void setCpuThreadNum(int threadNum);         // 设置线程数
+  public void setLitePowerMode(LitePowerMode mode);   // 设置能耗模式
+  public void setLitePowerMode(String modeStr);       // 通过字符串形式设置能耗模式
+}
+```
+
+- 意图和槽位识别结果`IntentDetAndSlotFillResult`说明  
+
+```java
+public class IntentDetAndSlotFillResult {
+   public String mStr;                    // 可用于debug的字符串 拼接了意图识别和槽位识别的结果
+   public boolean mInitialized = false;
+   public IntentDetResult mIntentResult;  // 意图识别结果
+   public SlotFillResult[] mSlotResult;   // 槽位识别结果
+   
+   static class IntentDetResult {
+      public String mIntentLabel;         // 意图识别结果文本标签
+      public float mIntentConfidence;     // 意图识别结果置信度
+   }
+   static class SlotFillResult {
+      public String mSlotLabel;           // 槽位识别结果文本标签
+      public String mEntity;              // 槽位识别的实体
+      public int[] mPos; // [2]           // 在原始文本对应的位置 [start,end] 
+   }
+}
+```  
+
+- ERNIE 3.0 Tiny `Predictor` FP32/FP16 推理示例
+
+```java
+import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption;
+import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor;
+import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult;
+import android.app.Activity;
+
+// 以下为伪代码
+class TestERNIETiny extends Activity {
+  @Override
+  protected void onCreate(@Nullable Bundle savedInstanceState) {
+    super.onCreate(savedInstanceState);
+    Predictor predictor = new Predictor();
+
+    // 设置模型文件和标签文件
+    String modelFile = "ernie-tiny/infer_model.pdmodel";
+    String paramsFile = "ernie-tiny/infer_model.pdiparams";
+    String vocabFile = "ernie-tiny/vocab.txt";
+    String slotLabelsFile = "ernie-tiny/slots_label.txt";
+    String intentLabelsFile = "ernie-tiny/intent_label.txt";
+    String addedTokensFile = "ernie-tiny/added_tokens.json";
+
+    // RuntimeOption 设置 
+    RuntimeOption option = new RuntimeOption();
+    option.setCpuThreadNum(2);  // 设置线程数
+    option.enableLiteFp16();    // 是否开启FP16精度推理
+
+    // Predictor初始化        
+    predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile, intentLabelsFile, addedTokensFile, option, 16);
+
+    // 进行意图识别和槽位分析
+    String[] inputTexts = new String[]{"来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"};
+
+    IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts);
+  }
+}
+```  
+
+- ERNIE 3.0 Tiny `Predictor` Int8 量化模型推理示例
+
+```java
+import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption;
+import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor;
+import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult;
+import android.app.Activity;
+
+// 以下为伪代码
+class TestERNIETiny extends Activity {
+  @Override
+  protected void onCreate(@Nullable Bundle savedInstanceState) {
+    super.onCreate(savedInstanceState);
+    Predictor predictor = new Predictor();
+
+    // 设置模型文件和标签文件
+    String modelFile = "ernie-tiny-clip-qat-embedding-int8/infer_model.pdmodel";
+    String paramsFile = "ernie-tiny-clip-qat-embedding-int8/infer_model.pdiparams";
+    String vocabFile = "ernie-tiny-clip-qat-embedding-int8/vocab.txt";
+    String slotLabelsFile = "ernie-tiny-clip-qat-embedding-int8/slots_label.txt";
+    String intentLabelsFile = "ernie-tiny-clip-qat-embedding-int8/intent_label.txt";
+    String addedTokensFile = "ernie-tiny-clip-qat-embedding-int8/added_tokens.json";
+
+    // RuntimeOption 设置 
+    RuntimeOption option = new RuntimeOption();
+    option.setCpuThreadNum(2);  // 设置线程数
+    option.enableLiteInt8();    // 开启int8精度推理（需要先准备好量化模型）
+
+    // Predictor初始化        
+    predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile, intentLabelsFile, addedTokensFile, option, 16);
+
+    // 进行意图识别和槽位分析
+    String[] inputTexts = new String[]{"来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"};
+
+    IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts);
+  }
+}
+```
+
+更详细的用法请参考 [ERNIETinyMainActivity](./app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java) 中的用法
+
+## 替换 App 示例中的 ERNIE 3.0 Tiny 模型    
+
+替换 App 示例中的模型的步骤非常简单，模型所在的位置为 `app/src/main/assets/models`。替换模型之前请确保您的模型目录中包含 vocab.txt、slots_label.txt、intent_label.txt 以及 added_token.json 等意图识别和槽位分析所需要的词表和标签文件。替换模型的步骤为：  
+  - 将您的 ERNIE 3.0 Tiny 模型放在 `app/src/main/assets/models` 目录下；
+  - 修改 `app/src/main/res/values/strings.xml` 中模型路径的默认值，如：
+
+```xml
+<!-- 将这个路径指修改成您的模型，如 models/ernie-tiny-clip-qat-embedding-int8 -->
+<string name="ERNIE_TINY_MODEL_DIR_DEFAULT">models/ernie-tiny-clip-qat-embedding-int8</string>
+```  
+
+## 关于 ERNIE 3.0 Tiny JNI 的封装  
+
+如果您对 ERNIE 3.0 Tiny JNI 封装的实现感兴趣，可以参考 [ernie_tiny_jni/predictor_jni.cc](./ernie_tiny/src/main/cpp/ernie_tiny_jni/predictor_jni.cc), 关于如何使用 JNI 进行模型封装并和 Java 通信，可以参考 [FastDeploy/use_cpp_sdk_on_android.md](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/faq/use_cpp_sdk_on_android.md) 文档中的说明。
+
+## 相关文档
+
+[ERNIE 3.0 Tiny 模型详细介绍](../../README.md)
+
+[ERNIE 3.0 Tiny 模型C++部署方法](../cpp/README.md)
+
+[ERNIE 3.0 Tiny 模型Python部署方法](../python/README.md)
+
+[FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/build.gradle b/model_zoo/ernie-tiny/deploy/android/app/build.gradle
new file mode 100644
index 0000000000000000000000000000000000000000..2037c756095921513c58e3c31c307b35ded93494
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/build.gradle
@@ -0,0 +1,91 @@
+apply plugin: 'com.android.application'
+
+android {
+    compileSdk 28
+
+    defaultConfig {
+        applicationId 'com.baidu.paddle.paddlenlp.app'
+        minSdkVersion 15
+        //noinspection ExpiredTargetSdkVersion
+        targetSdkVersion 28
+        versionCode 1
+        versionName "1.0"
+        testInstrumentationRunner "android.support.test.runner.AndroidJUnitRunner"
+    }
+
+    buildTypes {
+        release {
+            minifyEnabled false
+            proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
+        }
+    }
+}
+
+dependencies {
+    implementation project(path: ':ernie_tiny')
+    implementation project(path: ':ui')
+    //noinspection GradleCompatible
+    implementation 'com.android.support:appcompat-v7:28.0.0'
+    //noinspection GradleDependency
+    implementation 'com.android.support.constraint:constraint-layout:1.1.3'
+    implementation 'com.android.support:design:28.0.0'
+    implementation 'org.jetbrains:annotations:15.0'
+    //noinspection GradleDependency
+    testImplementation 'junit:junit:4.12'
+    androidTestImplementation 'com.android.support.test:runner:1.0.2'
+    androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2'
+}
+
+def FD_MODEL = [
+        [
+                'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny.tgz',
+                'dest': 'src/main/assets/models'
+        ],
+        [
+                'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip.tgz',
+                'dest': 'src/main/assets/models'
+        ],
+        [
+                'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip-qat.tgz',
+                'dest': 'src/main/assets/models'
+        ],
+        [
+                'src' : 'https://bj.bcebos.com/paddlehub/fastdeploy/ernie-tiny-clip-qat-embedding-int8.tgz',
+                'dest': 'src/main/assets/models'
+        ]
+
+]
+
+task downloadAndExtractModels(type: DefaultTask) {
+    doFirst {
+        println "[INFO] Downloading and extracting models ..."
+    }
+    doLast {
+        String cachePath = "cache"
+        if (!file("${cachePath}").exists()) {
+            mkdir "${cachePath}"
+        }
+        FD_MODEL.eachWithIndex { model, index ->
+            String[] modelPaths = model.src.split("/")
+            String modelName = modelPaths[modelPaths.length - 1]
+            String modelPrefix = modelName.substring(0, modelName.length() - 4)
+            boolean copyFiles = false
+            if (!file("${model.dest}/${modelPrefix}").exists()) {
+                // Download the target model if not exists
+                if (!file("${cachePath}/${modelName}").exists()) {
+                    println "[INFO] Downloading ${model.src} -> ${cachePath}/${modelName}"
+                    ant.get(src: model.src, dest: file("${cachePath}/${modelName}"))
+                }
+                copyFiles = true
+            }
+            if (copyFiles) {
+                println "[INFO] Taring ${cachePath}/${modelName} -> ${model.dest}/${modelPrefix}"
+                copy { from(tarTree("${cachePath}/${modelName}")) into("${model.dest}") }
+            } else {
+                println "[INFO] ${model.dest}/${modelPrefix} already exists!"
+            }
+        }
+    }
+}
+
+preBuild.dependsOn downloadAndExtractModels
diff --git a/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro b/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro
new file mode 100644
index 0000000000000000000000000000000000000000..481bb434814107eb79d7a30b676d344b0df2f8ce
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/proguard-rules.pro
@@ -0,0 +1,21 @@
+# Add project specific ProGuard rules here.
+# You can control the set of applied configuration files using the
+# proguardFiles setting in build.gradle.
+#
+# For more details, see
+#   http://developer.android.com/guide/developing/tools/proguard.html
+
+# If your project uses WebView with JS, uncomment the following
+# and specify the fully qualified class name to the JavaScript interface
+# class:
+#-keepclassmembers class fqcn.of.javascript.interface.for.webview {
+#   public *;
+#}
+
+# Uncomment this to preserve the line number information for
+# debugging stack traces.
+#-keepattributes SourceFile,LineNumberTable
+
+# If you keep the line number information, uncomment this to
+# hide the original source file name.
+#-renamesourcefileattribute SourceFile
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml
new file mode 100644
index 0000000000000000000000000000000000000000..08851e680b01916de42e4e2af9ca6708fba14c99
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/AndroidManifest.xml
@@ -0,0 +1,30 @@
+<?xml version="1.0" encoding="utf-8"?>
+<manifest xmlns:android="http://schemas.android.com/apk/res/android"
+    package="com.baidu.paddle.paddlenlp.app">
+
+    <uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE"/>
+    <uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE"/>
+
+    <application
+        android:allowBackup="true"
+        android:icon="@mipmap/ic_launcher"
+        android:label="@string/ernie_tiny_app_name"
+        android:roundIcon="@mipmap/ic_launcher_round"
+        android:supportsRtl="true"
+        android:theme="@style/AppTheme">
+        <activity android:name=".ernie_tiny.ERNIETinyWelcomeActivity">
+            <intent-filter>
+                <action android:name="android.intent.action.MAIN"/>
+                <category android:name="android.intent.category.LAUNCHER"/>
+            </intent-filter>
+        </activity>
+        <activity
+            android:name=".ernie_tiny.ERNIETinyMainActivity">
+        </activity>
+        <activity
+            android:name=".ernie_tiny.ERNIETinySettingsActivity"
+            android:label="Settings">
+        </activity>
+    </application>
+
+</manifest>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java
new file mode 100644
index 0000000000000000000000000000000000000000..3fb229bec15f5c8790eac7f7518be42e6f606a22
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyMainActivity.java
@@ -0,0 +1,241 @@
+package com.baidu.paddle.paddlenlp.app.ernie_tiny;
+
+import android.Manifest;
+import android.annotation.SuppressLint;
+import android.app.Activity;
+import android.app.AlertDialog;
+import android.content.DialogInterface;
+import android.content.Intent;
+import android.content.SharedPreferences;
+import android.content.pm.PackageManager;
+import android.os.Bundle;
+import android.preference.PreferenceManager;
+import android.support.annotation.NonNull;
+import android.support.annotation.Nullable;
+import android.support.v4.app.ActivityCompat;
+import android.support.v4.content.ContextCompat;
+import android.util.Log;
+import android.view.View;
+import android.widget.Button;
+import android.widget.EditText;
+import android.widget.ImageButton;
+import android.widget.ImageView;
+
+import com.baidu.paddle.paddlenlp.app.R;
+import com.baidu.paddle.paddlenlp.ernie_tiny.RuntimeOption;
+import com.baidu.paddle.paddlenlp.ernie_tiny.Predictor;
+import com.baidu.paddle.paddlenlp.ernie_tiny.IntentDetAndSlotFillResult;
+import com.baidu.paddle.paddlenlp.ui.Utils;
+
+
+public class ERNIETinyMainActivity extends Activity implements View.OnClickListener {
+    private static final String TAG = ERNIETinyMainActivity.class.getSimpleName();
+    private ImageView back;
+    private ImageButton btnSettings;
+    private EditText etERNIETinyInput;
+    private EditText etERNIETinyOutput;
+    private Button btnERNIETinyAnalysis;
+    private String[] inputTexts;
+
+    // Call 'init' and 'release' manually later
+    Predictor predictor = new Predictor();
+
+    @Override
+    protected void onCreate(@Nullable Bundle savedInstanceState) {
+        super.onCreate(savedInstanceState);
+
+        setContentView(R.layout.ernie_tiny_activity_main);
+
+        // Clear all setting items to avoid app crashing due to the incorrect settings
+        initSettings();
+
+        // Check and request WRITE_EXTERNAL_STORAGE permissions
+        if (!checkAllPermissions()) {
+            requestAllPermissions();
+        }
+
+        // Init the camera preview and UI components
+        initView();
+    }
+
+    @Override
+    protected void onResume() {
+        super.onResume();
+        // Reload settings and re-initialize the predictor
+        checkAndUpdateSettings();
+    }
+
+    @Override
+    protected void onPause() {
+        super.onPause();
+    }
+
+    @Override
+    protected void onDestroy() {
+        if (predictor != null) {
+            predictor.release();
+        }
+        super.onDestroy();
+    }
+
+    private void initView() {
+        // Back from setting page to main page
+        back = findViewById(R.id.iv_back);
+        back.setOnClickListener(this);
+        // Apply ERNIE Tiny predict
+        btnERNIETinyAnalysis = findViewById(R.id.btn_ernie_tiny_analysis);
+        btnERNIETinyAnalysis.setOnClickListener(this);
+        // ERNIE Tiny input and output texts
+        etERNIETinyInput = findViewById(R.id.et_ernie_tiny_input);
+        etERNIETinyOutput = findViewById(R.id.et_ernie_tiny_output);
+        // Setting page
+        btnSettings = findViewById(R.id.btn_settings);
+        btnSettings.setOnClickListener(this);
+    }
+
+    @SuppressLint("NonConstantResourceId")
+    @Override
+    public void onClick(View view) {
+        switch (view.getId()) {
+            case R.id.btn_settings:
+                startActivity(new Intent(ERNIETinyMainActivity.this, ERNIETinySettingsActivity.class));
+                break;
+            case R.id.iv_back:
+                finish();
+                break;
+            case R.id.btn_ernie_tiny_analysis:
+                extractTextsIntentAndSlot();
+                break;
+            default:
+                break;
+        }
+
+    }
+
+    public void extractTextsIntentAndSlot() {
+        if (updateInputTexts()) {
+            IntentDetAndSlotFillResult[] results = predictor.predict(inputTexts);
+            updateOutputTexts(results);
+        }
+    }
+
+    public void updateOutputTexts(IntentDetAndSlotFillResult[] results) {
+        if (results == null) {
+            etERNIETinyOutput.setText("分析结果为空");
+            return;
+        }
+        if (inputTexts == null) {
+            etERNIETinyOutput.setText("输入文本为空");
+            return;
+        }
+        if (inputTexts.length != results.length) {
+            String info = "输入文本数量与分析结果数量不一致！"
+                    + inputTexts.length + "!=" + results.length;
+            etERNIETinyOutput.setText(info);
+            return;
+        }
+        // Merge Result Str
+        StringBuilder resultStrBuffer = new StringBuilder();
+        for (int i = 0; i < results.length; ++i) {
+            resultStrBuffer
+                    .append("NO.")
+                    .append(i)
+                    .append(" text = ")
+                    .append(inputTexts[i])
+                    .append("\n")
+                    .append(results[i].mStr)
+                    .append("\n");
+        }
+        // Update output text view (EditText)
+        etERNIETinyOutput.setText(resultStrBuffer.toString());
+    }
+
+    public boolean updateInputTexts() {
+        String combinedInputText = etERNIETinyInput.getText().toString();
+        if (combinedInputText == null || combinedInputText.length() == 0) {
+            // Use default text if no custom text
+            combinedInputText = getString(R.string.ERNIE_TINY_INPUT_TEXTS_DEFAULT);
+        }
+        String[] texts = combinedInputText.split("[。！!：；:;]");
+        if (texts.length <= 0) {
+            return false;
+        }
+        for (int i = 0; i < texts.length; ++i) {
+            texts[i] = texts[i].trim();
+        }
+        // Update input texts
+        inputTexts = texts;
+        return true;
+    }
+
+    @SuppressLint("ApplySharedPref")
+    public void initSettings() {
+        SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(this);
+        SharedPreferences.Editor editor = sharedPreferences.edit();
+        editor.clear();
+        editor.commit();
+        ERNIETinySettingsActivity.resetSettings();
+    }
+
+    public void checkAndUpdateSettings() {
+        if (ERNIETinySettingsActivity.checkAndUpdateSettings(this)) {
+            // Clear output text first
+            etERNIETinyOutput.setText("");
+
+            // Update predictor
+            String realModelDir = getCacheDir() + "/" + ERNIETinySettingsActivity.modelDir;
+            Utils.copyDirectoryFromAssets(this, ERNIETinySettingsActivity.modelDir, realModelDir);
+
+            String modelFile = realModelDir + "/" + "infer_model.pdmodel";
+            String paramsFile = realModelDir + "/" + "infer_model.pdiparams";
+            String vocabFile = realModelDir + "/" + "vocab.txt";
+            String slotLabelsFile = realModelDir + "/" + "slots_label.txt";
+            String intentLabelsFile = realModelDir + "/" + "intent_label.txt";
+            String addedTokensFile = realModelDir + "/" + "added_tokens.json";
+            RuntimeOption option = new RuntimeOption();
+            option.setCpuThreadNum(ERNIETinySettingsActivity.cpuThreadNum);
+            option.setLitePowerMode(ERNIETinySettingsActivity.cpuPowerMode);
+            if (Boolean.parseBoolean(ERNIETinySettingsActivity.enableLiteInt8)) {
+                option.enableLiteInt8(); // For quantized models
+            } else {
+                // Enable FP16 if Int8 option is not ON.
+                if (Boolean.parseBoolean(ERNIETinySettingsActivity.enableLiteFp16)) {
+                    option.enableLiteFp16();
+                }
+            }
+            predictor.init(modelFile, paramsFile, vocabFile, slotLabelsFile,
+                    intentLabelsFile, addedTokensFile, option, 16);
+        }
+    }
+
+    @Override
+    public void onRequestPermissionsResult(int requestCode, @NonNull String[] permissions,
+                                           @NonNull int[] grantResults) {
+        super.onRequestPermissionsResult(requestCode, permissions, grantResults);
+        if (grantResults[0] != PackageManager.PERMISSION_GRANTED || grantResults[1] != PackageManager.PERMISSION_GRANTED) {
+            new AlertDialog.Builder(ERNIETinyMainActivity.this)
+                    .setTitle("Permission denied")
+                    .setMessage("Click to force quit the app, then open Settings->Apps & notifications->Target " +
+                            "App->Permissions to grant all of the permissions.")
+                    .setCancelable(false)
+                    .setPositiveButton("Exit", new DialogInterface.OnClickListener() {
+                        @Override
+                        public void onClick(DialogInterface dialog, int which) {
+                            ERNIETinyMainActivity.this.finish();
+                        }
+                    }).show();
+        }
+    }
+
+    private void requestAllPermissions() {
+        ActivityCompat.requestPermissions(
+                this, new String[]{Manifest.permission.WRITE_EXTERNAL_STORAGE},
+                0);
+    }
+
+    private boolean checkAllPermissions() {
+        return ContextCompat.checkSelfPermission(this, Manifest.permission.WRITE_EXTERNAL_STORAGE)
+                == PackageManager.PERMISSION_GRANTED;
+    }
+
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java
new file mode 100644
index 0000000000000000000000000000000000000000..3e4c7cc394fde9db31eb83ede303b754ad775083
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinySettingsActivity.java
@@ -0,0 +1,133 @@
+package com.baidu.paddle.paddlenlp.app.ernie_tiny;
+
+import android.annotation.SuppressLint;
+import android.content.Context;
+import android.content.SharedPreferences;
+import android.os.Bundle;
+import android.preference.ListPreference;
+import android.preference.PreferenceManager;
+import android.support.v7.app.ActionBar;
+
+import com.baidu.paddle.paddlenlp.app.R;
+import com.baidu.paddle.paddlenlp.ui.view.AppCompatPreferenceActivity;
+
+
+public class ERNIETinySettingsActivity extends AppCompatPreferenceActivity implements
+        SharedPreferences.OnSharedPreferenceChangeListener {
+    private static final String TAG = ERNIETinySettingsActivity.class.getSimpleName();
+    static public String modelDir = "";
+    static public int cpuThreadNum = 2;
+    static public String cpuPowerMode = "";
+    static public String enableLiteFp16 = "false";
+    static public String enableLiteInt8 = "true";
+
+    ListPreference lpChoosePreInstalledModel = null;
+    ListPreference lpCPUThreadNum = null;
+    ListPreference lpCPUPowerMode = null;
+    ListPreference lpEnableLiteFp16 = null;
+    ListPreference lpEnableLiteInt8 = null;
+
+    @Override
+    public void onCreate(Bundle savedInstanceState) {
+        super.onCreate(savedInstanceState);
+        addPreferencesFromResource(R.xml.ernie_tiny_settings);
+        ActionBar supportActionBar = getSupportActionBar();
+        if (supportActionBar != null) {
+            supportActionBar.setDisplayHomeAsUpEnabled(true);
+        }
+
+        // Setup UI components
+        lpChoosePreInstalledModel =
+                (ListPreference) findPreference(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY));
+        lpCPUThreadNum = (ListPreference) findPreference(getString(R.string.CPU_THREAD_NUM_KEY));
+        lpCPUPowerMode = (ListPreference) findPreference(getString(R.string.CPU_POWER_MODE_KEY));
+        lpEnableLiteFp16 = (ListPreference) findPreference(getString(R.string.ENABLE_LITE_FP16_MODE_KEY));
+        lpEnableLiteInt8 = (ListPreference) findPreference(getString(R.string.ENABLE_LITE_INT8_MODE_KEY));
+    }
+
+    @SuppressLint("ApplySharedPref")
+    private void reloadSettingsAndUpdateUI() {
+        SharedPreferences sharedPreferences = getPreferenceScreen().getSharedPreferences();
+
+        String model_dir = sharedPreferences.getString(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY),
+                getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_DEFAULT));
+
+        String cpu_thread_num = sharedPreferences.getString(getString(R.string.CPU_THREAD_NUM_KEY),
+                getString(R.string.CPU_THREAD_NUM_DEFAULT));
+        String cpu_power_mode = sharedPreferences.getString(getString(R.string.CPU_POWER_MODE_KEY),
+                getString(R.string.CPU_POWER_MODE_DEFAULT));
+        String enable_lite_fp16 = sharedPreferences.getString(getString(R.string.ENABLE_LITE_FP16_MODE_KEY),
+                getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT));
+        String enable_lite_int8 = sharedPreferences.getString(getString(R.string.ENABLE_LITE_INT8_MODE_KEY),
+                getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT));
+
+        lpChoosePreInstalledModel.setSummary(model_dir);
+        lpChoosePreInstalledModel.setValue(model_dir);
+        lpCPUThreadNum.setValue(cpu_thread_num);
+        lpCPUThreadNum.setSummary(cpu_thread_num);
+        lpCPUPowerMode.setValue(cpu_power_mode);
+        lpCPUPowerMode.setSummary(cpu_power_mode);
+        lpEnableLiteFp16.setValue(enable_lite_fp16);
+        lpEnableLiteFp16.setSummary(enable_lite_fp16);
+        lpEnableLiteInt8.setValue(enable_lite_int8);
+        lpEnableLiteInt8.setSummary(enable_lite_int8);
+    }
+
+    static boolean checkAndUpdateSettings(Context ctx) {
+        boolean settingsChanged = false;
+        SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(ctx);
+
+        String model_dir = sharedPreferences.getString(ctx.getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY),
+                ctx.getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_DEFAULT));
+        settingsChanged |= !modelDir.equalsIgnoreCase(model_dir);
+        modelDir = model_dir;
+
+        String cpu_thread_num = sharedPreferences.getString(ctx.getString(R.string.CPU_THREAD_NUM_KEY),
+                ctx.getString(R.string.CPU_THREAD_NUM_DEFAULT));
+        settingsChanged |= cpuThreadNum != Integer.parseInt(cpu_thread_num);
+        cpuThreadNum = Integer.parseInt(cpu_thread_num);
+
+        String cpu_power_mode = sharedPreferences.getString(ctx.getString(R.string.CPU_POWER_MODE_KEY),
+                ctx.getString(R.string.CPU_POWER_MODE_DEFAULT));
+        settingsChanged |= !cpuPowerMode.equalsIgnoreCase(cpu_power_mode);
+        cpuPowerMode = cpu_power_mode;
+
+        String enable_lite_fp16 = sharedPreferences.getString(ctx.getString(R.string.ENABLE_LITE_FP16_MODE_KEY),
+                ctx.getString(R.string.ENABLE_LITE_FP16_MODE_DEFAULT));
+        settingsChanged |= !enableLiteFp16.equalsIgnoreCase(enable_lite_fp16);
+        enableLiteFp16 = enable_lite_fp16;
+
+        String enable_lite_int8 = sharedPreferences.getString(ctx.getString(R.string.ENABLE_LITE_INT8_MODE_KEY),
+                ctx.getString(R.string.ENABLE_LITE_INT8_MODE_DEFAULT));
+        settingsChanged |= !enableLiteInt8.equalsIgnoreCase(enable_lite_int8);
+        enableLiteInt8 = enable_lite_int8;
+
+        return settingsChanged;
+    }
+
+    static void resetSettings() {
+        modelDir = "";
+        cpuThreadNum = 2;
+        cpuPowerMode = "";
+        enableLiteFp16 = "false";
+        enableLiteInt8 = "true";
+    }
+
+    @Override
+    protected void onResume() {
+        super.onResume();
+        getPreferenceScreen().getSharedPreferences().registerOnSharedPreferenceChangeListener(this);
+        reloadSettingsAndUpdateUI();
+    }
+
+    @Override
+    protected void onPause() {
+        super.onPause();
+        getPreferenceScreen().getSharedPreferences().unregisterOnSharedPreferenceChangeListener(this);
+    }
+
+    @Override
+    public void onSharedPreferenceChanged(SharedPreferences sharedPreferences, String key) {
+        reloadSettingsAndUpdateUI();
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java
new file mode 100644
index 0000000000000000000000000000000000000000..9d9aace3594caaf8bb5628185c31b9a813a29399
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/java/com/baidu/paddle/paddlenlp/app/ernie_tiny/ERNIETinyWelcomeActivity.java
@@ -0,0 +1,30 @@
+package com.baidu.paddle.paddlenlp.app.ernie_tiny;
+
+import android.app.Activity;
+import android.content.Intent;
+import android.graphics.Color;
+import android.os.Build;
+import android.os.Bundle;
+import android.support.annotation.Nullable;
+import android.view.View;
+
+import com.baidu.paddle.paddlenlp.app.R;
+
+public class ERNIETinyWelcomeActivity extends Activity {
+    @Override
+    protected void onCreate(@Nullable Bundle savedInstanceState) {
+        super.onCreate(savedInstanceState);
+        if (Build.VERSION.SDK_INT > Build.VERSION_CODES.LOLLIPOP) {
+            getWindow().getDecorView().setSystemUiVisibility(View.SYSTEM_UI_FLAG_LAYOUT_FULLSCREEN
+                    | View.SYSTEM_UI_FLAG_LAYOUT_STABLE
+            );
+            getWindow().setStatusBarColor(Color.TRANSPARENT);
+        }
+        setContentView(R.layout.ernie_tiny_welcome);
+    }
+
+    public void startActivity(View view) {
+        Intent intent = new Intent(ERNIETinyWelcomeActivity.this, ERNIETinyMainActivity.class);
+        startActivity(intent);
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml
new file mode 100644
index 0000000000000000000000000000000000000000..1f6bb290603d7caa16c5fb6f61bbfdc750622f5c
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/ic_launcher_foreground.xml
@@ -0,0 +1,34 @@
+<vector xmlns:android="http://schemas.android.com/apk/res/android"
+    xmlns:aapt="http://schemas.android.com/aapt"
+    android:width="108dp"
+    android:height="108dp"
+    android:viewportWidth="108"
+    android:viewportHeight="108">
+    <path
+        android:fillType="evenOdd"
+        android:pathData="M32,64C32,64 38.39,52.99 44.13,50.95C51.37,48.37 70.14,49.57 70.14,49.57L108.26,87.69L108,109.01L75.97,107.97L32,64Z"
+        android:strokeWidth="1"
+        android:strokeColor="#00000000">
+        <aapt:attr name="android:fillColor">
+            <gradient
+                android:endX="78.5885"
+                android:endY="90.9159"
+                android:startX="48.7653"
+                android:startY="61.0927"
+                android:type="linear">
+                <item
+                    android:color="#44000000"
+                    android:offset="0.0" />
+                <item
+                    android:color="#00000000"
+                    android:offset="1.0" />
+            </gradient>
+        </aapt:attr>
+    </path>
+    <path
+        android:fillColor="#FFFFFF"
+        android:fillType="nonZero"
+        android:pathData="M66.94,46.02L66.94,46.02C72.44,50.07 76,56.61 76,64L32,64C32,56.61 35.56,50.11 40.98,46.06L36.18,41.19C35.45,40.45 35.45,39.3 36.18,38.56C36.91,37.81 38.05,37.81 38.78,38.56L44.25,44.05C47.18,42.57 50.48,41.71 54,41.71C57.48,41.71 60.78,42.57 63.68,44.05L69.11,38.56C69.84,37.81 70.98,37.81 71.71,38.56C72.44,39.3 72.44,40.45 71.71,41.19L66.94,46.02ZM62.94,56.92C64.08,56.92 65,56.01 65,54.88C65,53.76 64.08,52.85 62.94,52.85C61.8,52.85 60.88,53.76 60.88,54.88C60.88,56.01 61.8,56.92 62.94,56.92ZM45.06,56.92C46.2,56.92 47.13,56.01 47.13,54.88C47.13,53.76 46.2,52.85 45.06,52.85C43.92,52.85 43,53.76 43,54.88C43,56.01 43.92,56.92 45.06,56.92Z"
+        android:strokeWidth="1"
+        android:strokeColor="#00000000" />
+</vector>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml
new file mode 100644
index 0000000000000000000000000000000000000000..c5dcc45d56375ae8bfad057aea837a1d34c6aac2
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-v24/round_corner_btn.xml
@@ -0,0 +1,10 @@
+<?xml version="1.0" encoding="utf-8"?>
+<shape xmlns:android="http://schemas.android.com/apk/res/android"
+    android:shape="rectangle">
+    <corners
+        android:bottomLeftRadius="25dp"
+        android:bottomRightRadius="25dp"
+        android:topLeftRadius="25dp"
+        android:topRightRadius="25dp"></corners>
+    <solid android:color="#3B85F5"></solid>
+</shape>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png
new file mode 100644
index 0000000000000000000000000000000000000000..ff121e85f5614dfd022f39627028af825a46d683
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/back_btn.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png
new file mode 100644
index 0000000000000000000000000000000000000000..edf9f3ccced5afeb71d9516d93ea19f26c7d9984
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable-xhdpi/more_menu.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml
new file mode 100644
index 0000000000000000000000000000000000000000..917897b99981d18082d18a87a4ad5176ad8e8f8d
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings.xml
@@ -0,0 +1,6 @@
+<?xml version="1.0" encoding="utf-8"?>
+<selector xmlns:android="http://schemas.android.com/apk/res/android">
+    <item android:state_pressed="true" android:drawable="@drawable/btn_settings_pressed"/>
+    <item android:drawable="@drawable/btn_settings_default"/>
+</selector>
+
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml
new file mode 100644
index 0000000000000000000000000000000000000000..e19589a97e419249eaacd05f3d75deeeada3e128
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_default.xml
@@ -0,0 +1,13 @@
+<?xml version="1.0" encoding="utf-8"?>
+<vector xmlns:android="http://schemas.android.com/apk/res/android"
+    android:width="24dp"
+    android:height="24dp"
+    android:viewportWidth="24"
+    android:viewportHeight="24">
+    <path
+        android:fillColor="#FFFFFF"
+        android:fillType="evenOdd"
+        android:pathData="M10.4696,3.5L9.8539,5.2714C9.7605,5.5401 9.5784,5.7688 9.3375,5.9199L8.0854,6.7054C7.8421,6.8581 7.5537,6.9223 7.2686,6.8872L5.3518,6.6517L3.9124,9.0649L4.9862,10.3888C5.1668,10.6114 5.2654,10.8894 5.2654,11.1762L5.2654,12.9051C5.2654,13.2057 5.157,13.4963 4.9602,13.7235L3.9165,14.9283L5.3573,17.4236L7.264,17.1741C7.5472,17.137 7.8344,17.198 8.0781,17.3469L9.401,18.1555C9.655,18.3107 9.8452,18.5515 9.9375,18.8345L10.4806,20.5L13.5194,20.5L14.0625,18.8345C14.1548,18.5515 14.345,18.3107 14.599,18.1555L15.9219,17.3469C16.1656,17.198 16.4528,17.137 16.736,17.1741L18.6427,17.4236L20.0835,14.9283L19.0398,13.7235C18.843,13.4963 18.7346,13.2057 18.7346,12.9051L18.7346,11.1762C18.7346,10.8894 18.8332,10.6114 19.0138,10.3888L20.0876,9.0649L18.6482,6.6517L16.7314,6.8872C16.4463,6.9223 16.1579,6.8581 15.9146,6.7054L14.6629,5.9202C14.4221,5.7691 14.2399,5.5404 14.1466,5.2718L13.5305,3.5L10.4696,3.5ZM8.4659,4.696L9.1111,2.8396C9.2858,2.3369 9.7596,2 10.2918,2L13.7083,2C14.2404,2 14.7142,2.3369 14.8889,2.8395L15.5345,4.6962L16.6366,5.3876L18.6269,5.143C19.1184,5.0826 19.5993,5.318 19.8529,5.7433L21.4653,8.4465C21.7339,8.8968 21.6928,9.4669 21.3625,9.8742L20.2346,11.2648L20.2346,12.8118L21.3338,14.0807C21.6826,14.4833 21.7379,15.0628 21.4715,15.5241L19.8583,18.3182C19.6057,18.7557 19.1145,18.9982 18.6136,18.9326L16.6288,18.6728L15.46,19.3872L14.8893,21.1375C14.7216,21.6519 14.2419,22 13.7009,22L10.2991,22C9.7581,22 9.2784,21.6519 9.1107,21.1375L8.54,19.3872L7.3712,18.6728L5.3864,18.9326C4.8855,18.9982 4.3943,18.7557 4.1417,18.3182L2.5285,15.5241C2.2621,15.0628 2.3174,14.4833 2.6662,14.0807L3.7654,12.8118L3.7654,11.2648L2.6375,9.8742C2.3072,9.4669 2.2661,8.8968 2.5347,8.4465L4.1471,5.7433C4.4007,5.318 4.8816,5.0826 5.3731,5.143L7.3634,5.3876L8.4659,4.696ZM12,15.75C9.9289,15.75 8.25,14.0711 8.25,12C8.25,9.9289 9.9289,8.25 12,8.25C14.0711,8.25 15.75,9.9289 15.75,12C15.75,14.0711 14.0711,15.75 12,15.75ZM12,14.75C13.5188,14.75 14.75,13.5188 14.75,12C14.75,10.4812 13.5188,9.25 12,9.25C10.4812,9.25 9.25,10.4812 9.25,12C9.25,13.5188 10.4812,14.75 12,14.75Z"
+        android:strokeWidth="0"
+        android:strokeColor="#00000000" />
+</vector>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml
new file mode 100644
index 0000000000000000000000000000000000000000..c4af2a042de3a8ae00ab253f889a20dedffa4874
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/btn_settings_pressed.xml
@@ -0,0 +1,13 @@
+<?xml version="1.0" encoding="utf-8"?>
+<vector xmlns:android="http://schemas.android.com/apk/res/android"
+    android:width="24dp"
+    android:height="24dp"
+    android:viewportWidth="24"
+    android:viewportHeight="24">
+    <path
+        android:fillColor="#bcbcbc"
+        android:fillType="evenOdd"
+        android:pathData="M10.4696,3.5L9.8539,5.2714C9.7605,5.5401 9.5784,5.7688 9.3375,5.9199L8.0854,6.7054C7.8421,6.8581 7.5537,6.9223 7.2686,6.8872L5.3518,6.6517L3.9124,9.0649L4.9862,10.3888C5.1668,10.6114 5.2654,10.8894 5.2654,11.1762L5.2654,12.9051C5.2654,13.2057 5.157,13.4963 4.9602,13.7235L3.9165,14.9283L5.3573,17.4236L7.264,17.1741C7.5472,17.137 7.8344,17.198 8.0781,17.3469L9.401,18.1555C9.655,18.3107 9.8452,18.5515 9.9375,18.8345L10.4806,20.5L13.5194,20.5L14.0625,18.8345C14.1548,18.5515 14.345,18.3107 14.599,18.1555L15.9219,17.3469C16.1656,17.198 16.4528,17.137 16.736,17.1741L18.6427,17.4236L20.0835,14.9283L19.0398,13.7235C18.843,13.4963 18.7346,13.2057 18.7346,12.9051L18.7346,11.1762C18.7346,10.8894 18.8332,10.6114 19.0138,10.3888L20.0876,9.0649L18.6482,6.6517L16.7314,6.8872C16.4463,6.9223 16.1579,6.8581 15.9146,6.7054L14.6629,5.9202C14.4221,5.7691 14.2399,5.5404 14.1466,5.2718L13.5305,3.5L10.4696,3.5ZM8.4659,4.696L9.1111,2.8396C9.2858,2.3369 9.7596,2 10.2918,2L13.7083,2C14.2404,2 14.7142,2.3369 14.8889,2.8395L15.5345,4.6962L16.6366,5.3876L18.6269,5.143C19.1184,5.0826 19.5993,5.318 19.8529,5.7433L21.4653,8.4465C21.7339,8.8968 21.6928,9.4669 21.3625,9.8742L20.2346,11.2648L20.2346,12.8118L21.3338,14.0807C21.6826,14.4833 21.7379,15.0628 21.4715,15.5241L19.8583,18.3182C19.6057,18.7557 19.1145,18.9982 18.6136,18.9326L16.6288,18.6728L15.46,19.3872L14.8893,21.1375C14.7216,21.6519 14.2419,22 13.7009,22L10.2991,22C9.7581,22 9.2784,21.6519 9.1107,21.1375L8.54,19.3872L7.3712,18.6728L5.3864,18.9326C4.8855,18.9982 4.3943,18.7557 4.1417,18.3182L2.5285,15.5241C2.2621,15.0628 2.3174,14.4833 2.6662,14.0807L3.7654,12.8118L3.7654,11.2648L2.6375,9.8742C2.3072,9.4669 2.2661,8.8968 2.5347,8.4465L4.1471,5.7433C4.4007,5.318 4.8816,5.0826 5.3731,5.143L7.3634,5.3876L8.4659,4.696ZM12,15.75C9.9289,15.75 8.25,14.0711 8.25,12C8.25,9.9289 9.9289,8.25 12,8.25C14.0711,8.25 15.75,9.9289 15.75,12C15.75,14.0711 14.0711,15.75 12,15.75ZM12,14.75C13.5188,14.75 14.75,13.5188 14.75,12C14.75,10.4812 13.5188,9.25 12,9.25C10.4812,9.25 9.25,10.4812 9.25,12C9.25,13.5188 10.4812,14.75 12,14.75Z"
+        android:strokeWidth="0"
+        android:strokeColor="#00000000" />
+</vector>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml
new file mode 100644
index 0000000000000000000000000000000000000000..0d025f9bf6b67c63044a36a9ff44fbc69e5c5822
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/ic_launcher_background.xml
@@ -0,0 +1,170 @@
+<?xml version="1.0" encoding="utf-8"?>
+<vector xmlns:android="http://schemas.android.com/apk/res/android"
+    android:width="108dp"
+    android:height="108dp"
+    android:viewportWidth="108"
+    android:viewportHeight="108">
+    <path
+        android:fillColor="#008577"
+        android:pathData="M0,0h108v108h-108z" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M9,0L9,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,0L19,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M29,0L29,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M39,0L39,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M49,0L49,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M59,0L59,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M69,0L69,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M79,0L79,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M89,0L89,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M99,0L99,108"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,9L108,9"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,19L108,19"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,29L108,29"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,39L108,39"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,49L108,49"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,59L108,59"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,69L108,69"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,79L108,79"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,89L108,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M0,99L108,99"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,29L89,29"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,39L89,39"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,49L89,49"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,59L89,59"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,69L89,69"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M19,79L89,79"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M29,19L29,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M39,19L39,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M49,19L49,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M59,19L59,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M69,19L69,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+    <path
+        android:fillColor="#00000000"
+        android:pathData="M79,19L79,89"
+        android:strokeWidth="0.8"
+        android:strokeColor="#33FFFFFF" />
+</vector>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png
new file mode 100644
index 0000000000000000000000000000000000000000..1ff9457d4d74c9b486c6c5094d3396aaea417d1c
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/main_bk.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..bc1135abfab7aa48f29392da4bca614f688314af
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/drawable/paddle_logo.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml
new file mode 100644
index 0000000000000000000000000000000000000000..27e469bb1563af7bf6817de8ba71cfb131384124
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_activity_main.xml
@@ -0,0 +1,84 @@
+<?xml version="1.0" encoding="utf-8"?>
+<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
+    android:layout_width="match_parent"
+    android:layout_height="match_parent"
+    android:orientation="vertical">
+
+    <com.baidu.paddle.paddlenlp.ui.layout.ActionBarLayout
+        android:layout_width="match_parent"
+        android:layout_height="wrap_content">
+
+        <ImageView
+            android:id="@+id/iv_back"
+            android:layout_width="wrap_content"
+            android:layout_height="wrap_content"
+            android:cropToPadding="true"
+            android:paddingLeft="40px"
+            android:paddingTop="60px"
+            android:paddingRight="60px"
+            android:paddingBottom="40px"
+            android:src="@drawable/back_btn" />
+
+        <TextView
+            style="@style/action_btn"
+            android:layout_width="wrap_content"
+            android:layout_height="wrap_content"
+            android:layout_centerHorizontal="true"
+            android:layout_marginTop="50px"
+            android:text="@string/ernie_tiny_app_name"
+            android:textSize="15sp" />
+
+        <ImageButton
+            android:id="@+id/btn_settings"
+            android:layout_width="30dp"
+            android:layout_height="30dp"
+            android:layout_alignParentRight="true"
+            android:layout_centerVertical="true"
+            android:layout_marginRight="10dp"
+            android:background="@null"
+            android:scaleType="fitXY"
+            android:src="@drawable/btn_settings" />
+    </com.baidu.paddle.paddlenlp.ui.layout.ActionBarLayout>
+
+    <LinearLayout
+        android:layout_width="match_parent"
+        android:layout_height="match_parent"
+        android:gravity="center"
+        android:orientation="vertical">
+
+        <EditText
+            android:id="@+id/et_ernie_tiny_input"
+            android:layout_width="match_parent"
+            android:layout_height="wrap_content"
+            android:layout_marginLeft="20dp"
+            android:layout_marginTop="10dp"
+            android:layout_marginRight="20dp"
+            android:hint="@string/ERNIE_TINY_INPUT_TEXTS_HINT"
+            android:inputType="textMultiLine"
+            android:lines="4"
+            android:textColor="@android:color/black"
+            android:textSize="20sp" />
+
+        <Button
+            android:id="@+id/btn_ernie_tiny_analysis"
+            android:layout_width="wrap_content"
+            android:layout_height="wrap_content"
+            android:layout_gravity="center"
+            android:layout_marginTop="20dp"
+            android:textSize="20sp"
+            android:text="开始分析意图" />
+
+        <EditText
+            android:id="@+id/et_ernie_tiny_output"
+            android:layout_width="match_parent"
+            android:layout_height="wrap_content"
+            android:layout_marginLeft="20dp"
+            android:layout_marginTop="10dp"
+            android:layout_marginRight="20dp"
+            android:hint="意图识别结果："
+            android:inputType="textMultiLine"
+            android:lines="6"
+            android:textColor="@android:color/black"
+            android:textSize="20sp" />
+    </LinearLayout>
+</LinearLayout>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_welcome.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_welcome.xml
new file mode 100644
index 0000000000000000000000000000000000000000..fba673eff132dab900ffffae9baf9cbfa37fe489
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/layout/ernie_tiny_welcome.xml
@@ -0,0 +1,76 @@
+<?xml version="1.0" encoding="utf-8"?>
+<FrameLayout
+    xmlns:android="http://schemas.android.com/apk/res/android"
+    android:layout_width="match_parent"
+    android:layout_height="match_parent">
+
+    <ImageView
+        android:id="@+id/imageView"
+        android:layout_width="match_parent"
+        android:layout_height="match_parent"
+        android:scaleType="centerCrop"
+        android:background="@drawable/main_bk"
+        />
+
+    <RelativeLayout
+        android:layout_width="match_parent"
+        android:layout_height="match_parent">
+
+        <TextView
+            android:id="@+id/model_text"
+            android:layout_width="320dp"
+            android:layout_height="wrap_content"
+            android:layout_centerHorizontal="true"
+            android:layout_marginTop="120dp"
+            android:background="@color/colorStartBtn"
+            android:gravity="center"
+            android:text="ERNIE Tiny"
+            android:textColor="@color/colorTextWrite"
+            android:textSize="30sp" />
+
+        <TextView
+            android:id="@+id/baidu"
+            android:layout_width="wrap_content"
+            android:layout_height="wrap_content"
+            android:layout_below="@id/model_text"
+            android:layout_centerHorizontal="true"
+            android:layout_marginTop="20dp"
+            android:text="百度PaddleNLP"
+            android:textColor="@color/colorTextWrite"
+            android:textSize="22sp" />
+
+        <TextView
+            android:layout_width="wrap_content"
+            android:layout_height="wrap_content"
+            android:layout_below="@id/baidu"
+            android:layout_centerHorizontal="true"
+            android:layout_marginTop="10dp"
+            android:text="Powered by FastDeploy &amp; EasyEdge"
+            android:textColor="@color/colorTextWrite"
+            android:textSize="18sp" />
+
+        <Button
+
+            android:id="@+id/start_ui_activity"
+            android:layout_width="200dp"
+            android:layout_height="50dp"
+            android:layout_alignParentBottom="true"
+            android:layout_centerHorizontal="true"
+            android:layout_marginBottom="70dp"
+            android:background="@drawable/round_corner_btn"
+            android:text="@string/start_ui_activity"
+            android:textColor="@color/colorTextWrite"
+            android:textSize="22sp"
+            android:onClick="startActivity"/>
+
+        <ImageView
+            android:id="@+id/logo"
+            android:layout_width="95dp"
+            android:layout_height="30dp"
+            android:layout_alignParentBottom="true"
+            android:layout_centerHorizontal="true"
+            android:layout_marginBottom="10dp"
+            android:background="@drawable/paddle_logo"
+            android:scaleType="centerCrop" />
+    </RelativeLayout>
+</FrameLayout>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher.xml
new file mode 100644
index 0000000000000000000000000000000000000000..eca70cfe52eac1ba66ba280a68ca7be8fcf88a16
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher.xml
@@ -0,0 +1,5 @@
+<?xml version="1.0" encoding="utf-8"?>
+<adaptive-icon xmlns:android="http://schemas.android.com/apk/res/android">
+    <background android:drawable="@drawable/ic_launcher_background" />
+    <foreground android:drawable="@drawable/ic_launcher_foreground" />
+</adaptive-icon>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher_round.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher_round.xml
new file mode 100644
index 0000000000000000000000000000000000000000..eca70cfe52eac1ba66ba280a68ca7be8fcf88a16
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-anydpi-v26/ic_launcher_round.xml
@@ -0,0 +1,5 @@
+<?xml version="1.0" encoding="utf-8"?>
+<adaptive-icon xmlns:android="http://schemas.android.com/apk/res/android">
+    <background android:drawable="@drawable/ic_launcher_background" />
+    <foreground android:drawable="@drawable/ic_launcher_foreground" />
+</adaptive-icon>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher.png
new file mode 100644
index 0000000000000000000000000000000000000000..898f3ed59ac9f3248734a00e5902736c9367d455
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher_round.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher_round.png
new file mode 100644
index 0000000000000000000000000000000000000000..dffca3601eba7bf5f409bdd520820e2eb5122c75
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-hdpi/ic_launcher_round.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher.png
new file mode 100644
index 0000000000000000000000000000000000000000..64ba76f75e9ce021aa3d95c213491f73bcacb597
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher_round.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher_round.png
new file mode 100644
index 0000000000000000000000000000000000000000..dae5e082342fcdeee5db8a6e0b27028e2d2808f5
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-mdpi/ic_launcher_round.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher.png
new file mode 100644
index 0000000000000000000000000000000000000000..e5ed46597ea8447d91ab1786a34e30f1c26b18bd
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher_round.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher_round.png
new file mode 100644
index 0000000000000000000000000000000000000000..14ed0af35023e4f1901cf03487b6c524257b8483
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xhdpi/ic_launcher_round.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher.png
new file mode 100644
index 0000000000000000000000000000000000000000..b0907cac3bfd8fbfdc46e1108247f0a1055387ec
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher_round.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher_round.png
new file mode 100644
index 0000000000000000000000000000000000000000..d8ae03154975f397f8ed1b84f2d4bf9783ecfa26
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxhdpi/ic_launcher_round.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher.png
new file mode 100644
index 0000000000000000000000000000000000000000..2c18de9e66108411737e910f5c1972476f03ddbf
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher_round.png b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher_round.png
new file mode 100644
index 0000000000000000000000000000000000000000..beed3cdd2c32af5114a7dc70b9ef5b698eb8797e
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/mipmap-xxxhdpi/ic_launcher_round.png differ
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/arrays.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/arrays.xml
new file mode 100644
index 0000000000000000000000000000000000000000..0670f7f9599872f96787ae1a8c71b49d1539227e
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/arrays.xml
@@ -0,0 +1,60 @@
+<?xml version="1.0" encoding="utf-8"?>
+<resources>
+    <string-array name="cpu_thread_num_entries">
+        <item>1 threads</item>
+        <item>2 threads</item>
+        <item>4 threads</item>
+        <item>8 threads</item>
+    </string-array>
+    <string-array name="cpu_thread_num_values">
+        <item>1</item>
+        <item>2</item>
+        <item>4</item>
+        <item>8</item>
+    </string-array>
+    <string-array name="cpu_power_mode_entries">
+        <item>HIGH(only big cores)</item>
+        <item>LOW(only LITTLE cores)</item>
+        <item>FULL(all cores)</item>
+        <item>NO_BIND(depends on system)</item>
+        <item>RAND_HIGH</item>
+        <item>RAND_LOW</item>
+    </string-array>
+    <string-array name="cpu_power_mode_values">
+        <item>LITE_POWER_HIGH</item>
+        <item>LITE_POWER_LOW</item>
+        <item>LITE_POWER_FULL</item>
+        <item>LITE_POWER_NO_BIND</item>
+        <item>LITE_POWER_RAND_HIGH</item>
+        <item>LITE_POWER_RAND_LOW</item>
+    </string-array>
+    <string-array name="enable_lite_fp16_mode_entries">
+        <item>true</item>
+        <item>false</item>
+    </string-array>
+    <string-array name="enable_lite_fp16_mode_values">
+        <item>true</item>
+        <item>false</item>
+    </string-array>
+    <string-array name="enable_lite_int8_mode_entries">
+        <item>true</item>
+        <item>false</item>
+    </string-array>
+    <string-array name="enable_lite_int8_mode_values">
+        <item>true</item>
+        <item>false</item>
+    </string-array>
+    <string-array name="preinstalled_models_entries">
+        <item>models/ernie-tiny</item>
+        <item>models/ernie-tiny-clip</item>
+        <item>models/ernie-tiny-clip-qat</item>
+        <item>models/ernie-tiny-clip-qat-embedding-int8</item>
+    </string-array>
+
+    <string-array name="preinstalled_models_values">
+        <item>models/ernie-tiny</item>
+        <item>models/ernie-tiny-clip</item>
+        <item>models/ernie-tiny-clip-qat</item>
+        <item>models/ernie-tiny-clip-qat-embedding-int8</item>
+    </string-array>
+</resources>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/colors.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/colors.xml
new file mode 100644
index 0000000000000000000000000000000000000000..4d8fd6599154c28278a94910f7f99a26128ae0eb
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/colors.xml
@@ -0,0 +1,25 @@
+<?xml version="1.0" encoding="utf-8"?>
+<resources>
+    <color name="colorPrimary">#008577</color>
+    <color name="colorPrimaryDark">#00574B</color>
+    <color name="colorAccent">#D81B60</color>
+    <color name="colorWindow">#FF000000</color>
+    <color name="colorTopBar">#00000000</color>
+    <color name="colorBottomBar">#00000000</color>
+    <color name="colorText">#FFFFFFFF</color>
+
+    <color name="bk_black">#000000</color>
+    <color name="bk_blue">#3B85F5</color>
+    <color name="textColorHighlight">#F5A623</color>
+    <color name="textColor">#FFFFFF</color>
+
+    <color name="bk_result_image_padding">#EEEEEE</color>
+
+    <color name="table_result_item_text_color">#3B85F5</color>
+    <color name="table_result_tableheader_text_color">#333333</color>
+    <color name="result_section_border_color">#E5E5E5</color>
+    <color name="result_popview_tablebody_bk">#3b85f5</color>
+
+    <color name="colorTextWrite">#FFFFFF</color>
+    <color name="colorStartBtn">#7c4bd2</color>
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/dimens.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/dimens.xml
new file mode 100644
index 0000000000000000000000000000000000000000..2df89499da7090787effe0b811af18a2612b0f4c
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/dimens.xml
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="utf-8"?>
+<resources>
+    <dimen name="bottom_bar_top_margin">26dp</dimen>
+    <dimen name="bottom_bar_bottom_margin">36dp</dimen>
+    <dimen name="bottom_bar_left_right_margin">34dp</dimen>
+    <dimen name="top_bar_height">60dp</dimen>
+    <dimen name="top_bar_left_right_margin">16dp</dimen>
+    <dimen name="large_button_width">67dp</dimen>
+    <dimen name="large_button_height">67dp</dimen>
+    <dimen name="medium_button_width">56dp</dimen>
+    <dimen name="medium_button_height">56dp</dimen>
+    <dimen name="small_button_width">46dp</dimen>
+    <dimen name="small_button_height">46dp</dimen>
+    <dimen name="large_font_size">32dp</dimen>
+    <dimen name="medium_font_size">24dp</dimen>
+    <dimen name="small_font_size">16dp</dimen>
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/strings.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/strings.xml
new file mode 100644
index 0000000000000000000000000000000000000000..90d7595355fd059bf0a7ec47c7aa7e0f2224acfb
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/strings.xml
@@ -0,0 +1,23 @@
+<resources>
+    <!-- App name -->
+    <string name="app_name">PaddleNLP</string>
+    <string name="ernie_tiny_app_name">ERNIE Tiny</string>
+    <string name="start_ui_activity">开始使用</string>
+    <!-- Keys for PreferenceScreen -->
+    <string name="CHOOSE_PRE_INSTALLED_MODEL_KEY">CHOOSE_INSTALLED_MODEL_KEY</string>
+    <string name="LABEL_PATH_KEY">LABEL_PATH_KEY</string>
+    <string name="CPU_THREAD_NUM_KEY">CPU_THREAD_NUM_KEY</string>
+    <string name="CPU_POWER_MODE_KEY">CPU_POWER_MODE_KEY</string>
+    <string name="SCORE_THRESHOLD_KEY">SCORE_THRESHOLD_KEY</string>
+    <string name="ENABLE_LITE_FP16_MODE_KEY">ENABLE_LITE_FP16_MODE_KEY</string>
+    <string name="ENABLE_LITE_INT8_MODE_KEY">ENABLE_LITE_INT8_MODE_KEY</string>
+    <!-- Common default values ... -->
+    <string name="CPU_THREAD_NUM_DEFAULT">2</string>
+    <string name="CPU_POWER_MODE_DEFAULT">LITE_POWER _HIGH</string>
+    <string name="ENABLE_LITE_FP16_MODE_DEFAULT">false</string>
+    <string name="ENABLE_LITE_INT8_MODE_DEFAULT">true</string>
+    <string name="CHOOSE_PRE_INSTALLED_MODEL_DEFAULT">models/ernie-tiny-clip-qat-embedding-int8</string>
+    <!-- ERNIE Tiny values -->
+    <string name="ERNIE_TINY_INPUT_TEXTS_HINT">请输入需要分析的文本，如：来一首周华健的花心</string>
+    <string name="ERNIE_TINY_INPUT_TEXTS_DEFAULT">来一首周华健的花心；播放我们都一样；到信阳市汽车配件城</string>
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/styles.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/styles.xml
new file mode 100644
index 0000000000000000000000000000000000000000..67c147594487ee33165cb1c13d0cc8bc332671a9
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/styles.xml
@@ -0,0 +1,70 @@
+<resources>
+
+    <!-- Base application theme. -->
+    <style name="AppTheme" parent="Theme.AppCompat.Light.DarkActionBar">
+        <!-- Customize your theme here. -->
+        <item name="colorPrimary">@color/colorPrimary</item>
+        <item name="colorPrimaryDark">@color/colorPrimaryDark</item>
+        <item name="colorAccent">@color/colorAccent</item>
+        <item name="actionOverflowMenuStyle">@style/OverflowMenuStyle</item>
+    </style>
+
+    <style name="OverflowMenuStyle" parent="Widget.AppCompat.Light.PopupMenu.Overflow">
+        <item name="overlapAnchor">false</item>
+    </style>
+
+    <style name="AppTheme.NoActionBar">
+        <item name="windowActionBar">false</item>
+        <item name="windowNoTitle">true</item>
+    </style>
+
+    <style name="AppTheme.AppBarOverlay" parent="ThemeOverlay.AppCompat.Dark.ActionBar"/>
+
+    <style name="AppTheme.PopupOverlay" parent="ThemeOverlay.AppCompat.Light"/>
+
+    <style name="list_result_view_item_style">
+        <item name="android:textColor">@color/table_result_item_text_color</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">30px</item>
+    </style>
+
+    <style name="list_result_popview_item_style">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">15px</item>
+        <item name="android:background">@color/result_popview_tablebody_bk</item>
+        <item name="android:layout_width">wrap_content</item>
+        <item name="android:alpha">0.5</item>
+    </style>
+
+    <style name="list_result_view_tablehead_style">
+        <item name="android:textColor">@color/table_result_item_text_color</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">15px</item>
+    </style>
+
+    <style name="list_result_popview_tablehead_style">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">20px</item>
+    </style>
+
+    <style name="action_btn">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:background">@color/bk_black</item>
+    </style>
+
+    <style name="action_btn_selected">
+        <item name="android:textColor">@color/textColorHighlight</item>
+        <item name="android:background">@color/bk_black</item>
+    </style>
+
+
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/values.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/values.xml
new file mode 100644
index 0000000000000000000000000000000000000000..156146d9ad86481e7aaa245be39936fbaa1f765f
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/values/values.xml
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="utf-8"?>
+<resources>
+    <dimen name="action_btn_size">120dp</dimen>
+    <dimen name="action_btn_text_size">46px</dimen>
+
+    <dimen name="operation_btn_margin_top_take_picture">126px</dimen>
+    <dimen name="operation_btn_margin_top">136px</dimen>
+
+    <dimen name="result_list_view_text_size">46px</dimen>
+
+    <dimen name="result_list_popview_text_size">36px</dimen>
+
+    <dimen name="result_list_padding_lr">15dp</dimen>
+
+    <dimen name="result_list_gap_width">15dp</dimen>
+
+</resources>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/app/src/main/res/xml/ernie_tiny_settings.xml b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/xml/ernie_tiny_settings.xml
new file mode 100644
index 0000000000000000000000000000000000000000..174d9d0c280600702c0d05f5636fbebfb35eaa68
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/app/src/main/res/xml/ernie_tiny_settings.xml
@@ -0,0 +1,43 @@
+<?xml version="1.0" encoding="utf-8"?>
+<PreferenceScreen xmlns:android="http://schemas.android.com/apk/res/android">
+    <ListPreference
+        android:defaultValue="@string/CHOOSE_PRE_INSTALLED_MODEL_DEFAULT"
+        android:key="@string/CHOOSE_PRE_INSTALLED_MODEL_KEY"
+        android:entries="@array/preinstalled_models_entries"
+        android:entryValues="@array/preinstalled_models_values"
+        android:negativeButtonText="@null"
+        android:positiveButtonText="@null"
+        android:title="Choose Pre-Installed Models" />
+    <ListPreference
+        android:defaultValue="@string/CPU_THREAD_NUM_DEFAULT"
+        android:entries="@array/cpu_thread_num_entries"
+        android:entryValues="@array/cpu_thread_num_values"
+        android:key="@string/CPU_THREAD_NUM_KEY"
+        android:negativeButtonText="@null"
+        android:positiveButtonText="@null"
+        android:title="CPU Thread Num" />
+    <ListPreference
+        android:defaultValue="@string/CPU_POWER_MODE_DEFAULT"
+        android:entries="@array/cpu_power_mode_entries"
+        android:entryValues="@array/cpu_power_mode_values"
+        android:key="@string/CPU_POWER_MODE_KEY"
+        android:negativeButtonText="@null"
+        android:positiveButtonText="@null"
+        android:title="CPU Power Mode" />
+    <ListPreference
+        android:defaultValue="@string/ENABLE_LITE_FP16_MODE_DEFAULT"
+        android:entries="@array/enable_lite_fp16_mode_entries"
+        android:entryValues="@array/enable_lite_fp16_mode_values"
+        android:key="@string/ENABLE_LITE_FP16_MODE_KEY"
+        android:negativeButtonText="@null"
+        android:positiveButtonText="@null"
+        android:title="Enable Lite FP16" />
+    <ListPreference
+        android:defaultValue="@string/ENABLE_LITE_INT8_MODE_DEFAULT"
+        android:entries="@array/enable_lite_int8_mode_entries"
+        android:entryValues="@array/enable_lite_int8_mode_values"
+        android:key="@string/ENABLE_LITE_INT8_MODE_KEY"
+        android:negativeButtonText="@null"
+        android:positiveButtonText="@null"
+        android:title="Enable Lite Int8" />
+</PreferenceScreen>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/build.gradle b/model_zoo/ernie-tiny/deploy/android/build.gradle
new file mode 100644
index 0000000000000000000000000000000000000000..d8d678b3ffd56e367294f6c5fb7c4be25df22a7c
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/build.gradle
@@ -0,0 +1,37 @@
+// Top-level build file where you can add configuration options common to all sub-projects/modules.
+//plugins {
+//    id 'com.android.application' version '7.2.2' apply false
+//    id 'com.android.library' version '7.2.2' apply false
+//}
+//
+//task clean(type: Delete) {
+//    delete rootProject.buildDir
+//}
+
+buildscript {
+    repositories {
+        google()
+        jcenter()
+        // mavenCentral()
+
+    }
+    dependencies {
+        classpath 'com.android.tools.build:gradle:7.2.2'
+
+        // NOTE: Do not place your application dependencies here; they belong
+        // in the individual module build.gradle files
+    }
+}
+
+allprojects {
+    repositories {
+        google()
+        jcenter()
+        // mavenCentral()
+
+    }
+}
+
+task clean(type: Delete) {
+    delete rootProject.buildDir
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/.gitignore b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..42afabfd2abebf31384ca7797186a27a4b7dbee8
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/.gitignore
@@ -0,0 +1 @@
+/build
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/build.gradle b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/build.gradle
new file mode 100644
index 0000000000000000000000000000000000000000..084428d23e83a67945485621606d2642258a652c
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/build.gradle
@@ -0,0 +1,113 @@
+apply plugin: 'com.android.library'
+
+
+android {
+    compileSdk 28
+
+    defaultConfig {
+        minSdk 15
+        targetSdk 28
+
+        testInstrumentationRunner "android.support.test.runner.AndroidJUnitRunner"
+        consumerProguardFiles "consumer-rules.pro"
+        externalNativeBuild {
+            cmake {
+                arguments '-DANDROID_PLATFORM=android-21', '-DANDROID_STL=c++_shared', "-DANDROID_TOOLCHAIN=clang"
+                abiFilters 'armeabi-v7a', 'arm64-v8a'
+                cppFlags "-std=c++11"
+            }
+        }
+    }
+
+    buildTypes {
+        release {
+            minifyEnabled false
+            proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
+        }
+    }
+    externalNativeBuild {
+        cmake {
+            path "src/main/cpp/CMakeLists.txt"
+            version "3.10.2"
+        }
+    }
+    sourceSets {
+        main {
+            jniLibs.srcDirs = ['libs']
+        }
+    }
+    ndkVersion '20.1.5948944'
+}
+
+dependencies {
+
+    implementation 'com.android.support:appcompat-v7:28.0.0'
+    testImplementation 'junit:junit:4.13.2'
+    androidTestImplementation 'com.android.support.test:runner:1.0.2'
+    androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2'
+}
+
+def FD_CXX_LIB = [
+        [
+                'src' : 'https://bj.bcebos.com/fastdeploy/dev/android/fastdeploy-android-with-text-0.0.0-shared.tgz',
+                'dest': 'libs',
+                'name': 'fastdeploy-android-latest-shared-dev'
+        ]
+]
+
+task downloadAndExtractLibs(type: DefaultTask) {
+    doFirst {
+        println "[INFO] Downloading and extracting fastdeploy android c++ lib ..."
+    }
+    doLast {
+        String cachePath = "cache"
+        if (!file("${cachePath}").exists()) {
+            mkdir "${cachePath}"
+        }
+
+        FD_CXX_LIB.eachWithIndex { lib, index ->
+
+            String[] libPaths = lib.src.split("/")
+            String sdkName = lib.name
+            String libName = libPaths[libPaths.length - 1]
+            libName = libName.substring(0, libName.indexOf("tgz") - 1)
+            String cacheName = cachePath + "/" + "${libName}.tgz"
+
+            String libDir = lib.dest + "/" + libName
+            String sdkDir = lib.dest + "/" + sdkName
+
+            boolean copyFiles = false
+            if (!file("${sdkDir}").exists()) {
+                // Download lib and rename to sdk name later.
+                if (!file("${cacheName}").exists()) {
+                    println "[INFO] Downloading ${lib.src} -> ${cacheName}"
+                    ant.get(src: lib.src, dest: file("${cacheName}"))
+                }
+                copyFiles = true
+            }
+
+            if (copyFiles) {
+                println "[INFO] Taring ${cacheName} -> ${libDir}"
+                copy { from(tarTree("${cacheName}")) into("${lib.dest}") }
+                if (!libName.equals(sdkName)) {
+                    if (file("${sdkDir}").exists()) {
+                        delete("${sdkDir}")
+                        println "[INFO] Remove old ${sdkDir}"
+                    }
+                    mkdir "${sdkDir}"
+                    println "[INFO] Coping ${libDir} -> ${sdkDir}"
+                    copy { from("${libDir}") into("${sdkDir}") }
+                    delete("${libDir}")
+                    println "[INFO] Removed ${libDir}"
+                    println "[INFO] Update ${sdkDir} done!"
+                }
+            } else {
+                println "[INFO] ${sdkDir} already exists!"
+                println "[WARN] Please delete ${cacheName} and ${sdkDir} \n" +
+                        "if you want to UPDATE ${sdkName} c++ lib.\nThen, rebuild this sdk."
+            }
+        }
+    }
+}
+
+preBuild.dependsOn downloadAndExtractLibs
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/consumer-rules.pro b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/consumer-rules.pro
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/libs/.gitignore b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/libs/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..bdcc6a81e7069adcd8b566fcb0280255717b1df0
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/libs/.gitignore
@@ -0,0 +1,2 @@
+fastdeploy-*
+arm*
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/proguard-rules.pro b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/proguard-rules.pro
new file mode 100644
index 0000000000000000000000000000000000000000..481bb434814107eb79d7a30b676d344b0df2f8ce
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/proguard-rules.pro
@@ -0,0 +1,21 @@
+# Add project specific ProGuard rules here.
+# You can control the set of applied configuration files using the
+# proguardFiles setting in build.gradle.
+#
+# For more details, see
+#   http://developer.android.com/guide/developing/tools/proguard.html
+
+# If your project uses WebView with JS, uncomment the following
+# and specify the fully qualified class name to the JavaScript interface
+# class:
+#-keepclassmembers class fqcn.of.javascript.interface.for.webview {
+#   public *;
+#}
+
+# Uncomment this to preserve the line number information for
+# debugging stack traces.
+#-keepattributes SourceFile,LineNumberTable
+
+# If you keep the line number information, uncomment this to
+# hide the original source file name.
+#-renamesourcefileattribute SourceFile
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/androidTest/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleInstrumentedTest.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/androidTest/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleInstrumentedTest.java
new file mode 100644
index 0000000000000000000000000000000000000000..a54a417c58ed13b3e54f2344c5be49c3ab231085
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/androidTest/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleInstrumentedTest.java
@@ -0,0 +1,25 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+import android.content.Context;
+import android.support.test.InstrumentationRegistry;
+import android.support.test.runner.AndroidJUnit4;
+
+import org.junit.Test;
+import org.junit.runner.RunWith;
+
+import static org.junit.Assert.*;
+
+/**
+ * Instrumented test, which will execute on an Android device.
+ *
+ * @see <a href="http://d.android.com/tools/testing">Testing documentation</a>
+ */
+@RunWith(AndroidJUnit4.class)
+public class ExampleInstrumentedTest {
+    @Test
+    public void useAppContext() {
+        // Context of the app under test.
+        Context appContext = InstrumentationRegistry.getInstrumentation().getTargetContext();
+        assertEquals("com.baidu.paddle.paddlenlp.ernie_tiny.test", appContext.getPackageName());
+    }
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/AndroidManifest.xml b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/AndroidManifest.xml
new file mode 100644
index 0000000000000000000000000000000000000000..4f065c4e7804fff6745bc38cb95f76bb90dfd4cb
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/AndroidManifest.xml
@@ -0,0 +1,5 @@
+<?xml version="1.0" encoding="utf-8"?>
+<manifest xmlns:android="http://schemas.android.com/apk/res/android"
+    package="com.baidu.paddle.paddlenlp.ernie_tiny">
+
+</manifest>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/CMakeLists.txt b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2770760e8bf50cdd76317cb990b3a4afa845fcc4
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/CMakeLists.txt
@@ -0,0 +1,55 @@
+# For more information about using CMake with Android Studio, read the
+# documentation: https://d.android.com/studio/projects/add-native-code.html
+
+# Sets the minimum version of CMake required to build the native library.
+cmake_minimum_required(VERSION 3.10.2)
+
+# Declares and names the project.
+project("ernie_tiny_jni")
+
+# Creates and names a library, sets it as either STATIC
+# or SHARED, and provides the relative paths to its source code.
+# You can define multiple libraries, and CMake builds them for you.
+# Gradle automatically packages shared libraries with your APK.
+
+set(FastDeploy_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../../libs/fastdeploy-android-latest-shared-dev")
+
+find_package(FastDeploy REQUIRED)
+
+# include_directories(.)
+include_directories(${CMAKE_CURRENT_SOURCE_DIR})
+include_directories(${FastDeploy_INCLUDE_DIRS})
+
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ffast-math -Ofast -DNDEBUG -fomit-frame-pointer -fno-asynchronous-unwind-tables -fno-unwind-tables")
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fvisibility=hidden -fvisibility-inlines-hidden -fdata-sections -ffunction-sections")
+set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,--gc-sections -Wl,-z,nocopyreloc")
+
+add_library(
+        ernie_tiny_jni
+        SHARED
+        ernie_tiny_jni/runtime_option_jni.cc
+        ernie_tiny_jni/predictor_jni.cc
+)
+
+# Searches for a specified prebuilt library and stores the path as a
+# variable. Because CMake includes system libraries in the search path by
+# default, you only need to specify the name of the public NDK library
+# you want to add. CMake verifies that the library exists before
+# completing its build.
+
+find_library( # Sets the name of the path variable.
+        log-lib
+        # Specifies the name of the NDK library that
+        # you want CMake to locate.
+        log)
+
+# Specifies libraries CMake should link to your target library. You can link
+# multiple libraries, such as libraries you define in this build script,
+# prebuilt third-party libraries, or system libraries.
+
+target_link_libraries(
+        # Specifies the target library.
+        ernie_tiny_jni
+        ${FASTDEPLOY_LIBS}
+        ${log-lib}
+)
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/convert_jni.h b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/convert_jni.h
new file mode 100644
index 0000000000000000000000000000000000000000..26e6f28d91b848404be10ff1041867e2bc71908d
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/convert_jni.h
@@ -0,0 +1,171 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <jni.h>   // NOLINT
+#include <string>  // NOLINT
+#include <vector>  // NOLINT
+
+namespace ernie_tiny {
+namespace jni {
+
+template <typename OutputType, typename InputType>
+OutputType ConvertTo(JNIEnv *env, InputType input);
+
+template <typename OutputType, typename InputType>
+OutputType ConvertTo(JNIEnv *env, const InputType *input, int64_t len);
+
+/// jstring -> std::string
+template <>
+inline std::string ConvertTo(JNIEnv *env, jstring jstr) {
+  // In java, a unicode char will be encoded using 2 bytes (utf16).
+  // so jstring will contain characters utf16. std::string in c++ is
+  // essentially a string of bytes, not characters, so if we want to
+  // pass jstring from JNI to c++, we have convert utf16 to bytes.
+  if (!jstr) {
+    return "";
+  }
+  const jclass jstring_clazz = env->GetObjectClass(jstr);
+  const jmethodID getBytesID =
+      env->GetMethodID(jstring_clazz, "getBytes", "(Ljava/lang/String;)[B");
+  const jbyteArray jstring_bytes = reinterpret_cast<jbyteArray>(
+      env->CallObjectMethod(jstr, getBytesID, env->NewStringUTF("UTF-8")));
+
+  size_t length = static_cast<size_t>(env->GetArrayLength(jstring_bytes));
+  jbyte *jstring_bytes_ptr = env->GetByteArrayElements(jstring_bytes, NULL);
+
+  std::string res =
+      std::string(reinterpret_cast<char *>(jstring_bytes_ptr), length);
+  env->ReleaseByteArrayElements(jstring_bytes, jstring_bytes_ptr, JNI_ABORT);
+
+  env->DeleteLocalRef(jstring_bytes);
+  env->DeleteLocalRef(jstring_clazz);
+  return res;
+}
+
+/// jstring s-> std::vector<std::string>
+template <>
+inline std::vector<std::string> ConvertTo(JNIEnv *env, jobjectArray jstrs) {
+  // In java, a unicode char will be encoded using 2 bytes (utf16).
+  // so jstring will contain characters utf16. std::string in c++ is
+  // essentially a string of bytes, not characters, so if we want to
+  // pass jstring from JNI to c++, we have convert utf16 to bytes.
+  if (!jstrs) {
+    return {};
+  }
+  std::vector<std::string> res;
+  const int len = env->GetArrayLength(jstrs);
+  if (len > 0) {
+    for (int i = 0; i < len; ++i) {
+      auto j_str =
+          reinterpret_cast<jstring>(env->GetObjectArrayElement(jstrs, i));
+      res.push_back(ernie_tiny::jni::ConvertTo<std::string>(env, j_str));
+    }
+  }
+  return res;
+}
+
+/// std::string -> jstring
+template <>
+inline jstring ConvertTo(JNIEnv *env, std::string str) {
+  auto *cstr_data_ptr = str.c_str();
+  jclass jstring_clazz = env->FindClass("java/lang/String");
+  jmethodID initID =
+      env->GetMethodID(jstring_clazz, "<init>", "([BLjava/lang/String;)V");
+
+  jbyteArray jstring_bytes = env->NewByteArray(strlen(cstr_data_ptr));
+  env->SetByteArrayRegion(jstring_bytes, 0, strlen(cstr_data_ptr),
+                          reinterpret_cast<const jbyte *>(cstr_data_ptr));
+
+  jstring jstring_encoding = env->NewStringUTF("UTF-8");
+  jstring res = reinterpret_cast<jstring>(
+      env->NewObject(jstring_clazz, initID, jstring_bytes, jstring_encoding));
+
+  env->DeleteLocalRef(jstring_clazz);
+  env->DeleteLocalRef(jstring_bytes);
+  env->DeleteLocalRef(jstring_encoding);
+
+  return res;
+}
+
+/// jlongArray -> std::vector<int64_t>
+template <>
+inline std::vector<int64_t> ConvertTo(JNIEnv *env, jlongArray jdata) {
+  int jdata_size = env->GetArrayLength(jdata);
+  jlong *jdata_ptr = env->GetLongArrayElements(jdata, nullptr);
+  std::vector<int64_t> res(jdata_ptr, jdata_ptr + jdata_size);
+  env->ReleaseLongArrayElements(jdata, jdata_ptr, 0);
+  return res;
+}
+
+/// jintArray -> std::vector<int>
+template <>
+inline std::vector<int> ConvertTo(JNIEnv *env, jintArray jdata) {
+  int jdata_size = env->GetArrayLength(jdata);
+  jint *jdata_ptr = env->GetIntArrayElements(jdata, nullptr);
+  std::vector<int> res(jdata_ptr, jdata_ptr + jdata_size);
+  env->ReleaseIntArrayElements(jdata, jdata_ptr, 0);
+  return res;
+}
+
+/// jfloatArray -> std::vector<float>
+template <>
+inline std::vector<float> ConvertTo(JNIEnv *env, jfloatArray jdata) {
+  int jdata_size = env->GetArrayLength(jdata);
+  jfloat *jdata_ptr = env->GetFloatArrayElements(jdata, nullptr);
+  std::vector<float> res(jdata_ptr, jdata_ptr + jdata_size);
+  env->ReleaseFloatArrayElements(jdata, jdata_ptr, 0);
+  return res;
+}
+
+/// std::vector<int64_t> -> jlongArray
+template <>
+inline jlongArray ConvertTo(JNIEnv *env, const std::vector<int64_t> &cvec) {
+  jlongArray res = env->NewLongArray(cvec.size());
+  jlong *jbuf = new jlong[cvec.size()];
+  for (size_t i = 0; i < cvec.size(); ++i) {
+    jbuf[i] = static_cast<jlong>(cvec[i]);
+  }
+  env->SetLongArrayRegion(res, 0, cvec.size(), jbuf);
+  delete[] jbuf;
+  return res;
+}
+
+/// cxx float buffer -> jfloatArray
+template <>
+inline jfloatArray ConvertTo(JNIEnv *env, const float *cbuf, int64_t len) {
+  jfloatArray res = env->NewFloatArray(len);
+  env->SetFloatArrayRegion(res, 0, len, cbuf);
+  return res;
+}
+
+/// cxx int buffer -> jintArray
+template <>
+inline jintArray ConvertTo(JNIEnv *env, const int *cbuf, int64_t len) {
+  jintArray res = env->NewIntArray(len);
+  env->SetIntArrayRegion(res, 0, len, cbuf);
+  return res;
+}
+
+/// cxx int8_t buffer -> jbyteArray
+template <>
+inline jbyteArray ConvertTo(JNIEnv *env, const int8_t *cbuf, int64_t len) {
+  jbyteArray res = env->NewByteArray(len);
+  env->SetByteArrayRegion(res, 0, len, cbuf);
+  return res;
+}
+
+}  // namespace jni
+}  // namespace ernie_tiny
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/perf_jni.h b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/perf_jni.h
new file mode 100644
index 0000000000000000000000000000000000000000..9f96320fe457a44255ec8bf4434364f5627f07d3
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/perf_jni.h
@@ -0,0 +1,54 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#ifdef __ANDROID__
+#include <android/log.h>  // NOLINT
+#endif
+#include <fstream>        // NOLINT
+#include <string>         // NOLINT
+#include <vector>         // NOLINT
+
+#define TAG "[FastDeploy][ERNIE][JNI]"
+#ifdef __ANDROID__
+#define LOGD(...) __android_log_print(ANDROID_LOG_DEBUG, TAG, __VA_ARGS__)
+#define LOGI(...) __android_log_print(ANDROID_LOG_INFO, TAG, __VA_ARGS__)
+#define LOGW(...) __android_log_print(ANDROID_LOG_WARN, TAG, __VA_ARGS__)
+#define LOGE(...) __android_log_print(ANDROID_LOG_ERROR, TAG, __VA_ARGS__)
+#define LOGF(...) __android_log_print(ANDROID_LOG_FATAL, TAG, __VA_ARGS__)
+#else
+#define LOGD(...) {}
+#define LOGI(...) {}
+#define LOGW(...) {}
+#define LOGE(...) {}
+#define LOGF(...) {}
+#endif
+
+namespace ernie_tiny {
+namespace jni {
+
+/// Time counter
+inline int64_t GetCurrentTime() {
+  struct timeval time;
+  gettimeofday(&time, NULL);
+  return 1000000LL * static_cast<int64_t>(time.tv_sec) +
+         static_cast<int64_t>(time.tv_usec);
+}
+
+inline double GetElapsedTime(int64_t time) {
+  return (GetCurrentTime() - time) / 1000.0f;
+}
+
+}  // namespace jni
+}  // namespace ernie_tiny
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/predictor_jni.cc b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/predictor_jni.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ba2018f8315fca6e94b86ef371f68362be59f66f
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/predictor_jni.cc
@@ -0,0 +1,546 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <jni.h>                                                 // NOLINT
+#include <iostream>                                              // NOLINT
+#include <sstream>                                               // NOLINT
+#include <vector>                                                // NOLINT
+#include "ernie_tiny_jni/convert_jni.h"                          // NOLINT
+#include "ernie_tiny_jni/perf_jni.h"                             // NOLINT
+#include "ernie_tiny_jni/runtime_option_jni.h"                   // NOLINT
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"           // NOLINT
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"      // NOLINT
+#include "fast_tokenizer/utils/utf8.h"                           // NOLINT
+#include "fastdeploy/function/functions.h"                       // NOLINT
+#include "fastdeploy/runtime.h"                                  // NOLINT
+#include "nlohmann/json.hpp"                                     // NOLINT
+
+using namespace paddlenlp;
+using namespace fast_tokenizer::tokenizers_impl;
+
+static bool BatchiFyTexts(const std::vector<std::string>& texts, int batch_size,
+                          std::vector<std::vector<std::string>>* batch_texts) {
+  for (int idx = 0; idx < texts.size(); idx += batch_size) {
+    int rest = texts.size() - idx;
+    int curr_size = std::min(batch_size, rest);
+    std::vector<std::string> batch_text(curr_size);
+    std::copy_n(texts.begin() + idx, curr_size, batch_text.begin());
+    batch_texts->emplace_back(std::move(batch_text));
+  }
+  return true;
+}
+
+struct IntentDetAndSlotFillResult {
+  struct IntentDetResult {
+    std::string intent_label;
+    float intent_confidence;
+  } intent_result;
+  struct SlotFillResult {
+    std::string slot_label;
+    std::string entity;
+    std::pair<int, int> pos;
+  };
+  std::vector<SlotFillResult> slot_result;
+
+  friend std::ostream& operator<<(std::ostream& os,
+                                  const IntentDetAndSlotFillResult& result);
+};
+
+std::ostream& operator<<(std::ostream& os,
+                         const IntentDetAndSlotFillResult& result) {
+  os << "intent result: label = " << result.intent_result.intent_label
+     << ", confidence = " << result.intent_result.intent_confidence
+     << std::endl;
+  os << "slot result: " << std::endl;
+  for (auto&& slot : result.slot_result) {
+    os << "slot = " << slot.slot_label << ", entity = '" << slot.entity
+       << "', pos = [" << slot.pos.first << ", " << slot.pos.second << "]"
+       << std::endl;
+  }
+  return os;
+}
+
+static std::string ResultStr(const IntentDetAndSlotFillResult& result) {
+  std::ostringstream oss;
+  oss << result;
+  return oss.str();
+}
+
+static std::string TextsStr(const std::vector<std::string>& texts) {
+  std::string str = "";
+  for (const auto& s: texts) {
+    str += (s + ";");
+  }
+  return str;
+}
+
+static bool AddTokensToTokenizer(
+    const std::string& path, ErnieFastTokenizer* tokenizer) {
+  std::string actual_path = path;
+  std::ifstream fin(actual_path);
+  nlohmann::json j;
+  fin >> j;
+  using VOCAB_ITEM = std::pair<std::string, uint32_t>;
+  std::vector<VOCAB_ITEM> vocab_list;
+  for (nlohmann::json::iterator it = j.begin(); it != j.end(); ++it) {
+    vocab_list.emplace_back(it.key(), it.value());
+  }
+  std::sort(vocab_list.begin(),
+            vocab_list.end(),
+            [](const VOCAB_ITEM& a, const VOCAB_ITEM& b) {
+              return a.second < b.second;
+            });
+  std::vector<fast_tokenizer::core::AddedToken> added_tokens;
+  int increase_idx = 0;
+  for (auto&& vocab_item : vocab_list) {
+    if (vocab_item.second != tokenizer->GetVocabSize() + increase_idx) {
+      fastdeploy::FDERROR << "Non-consecutive added token '" << vocab_item.first
+                          << "' found. "
+                          << "Should have index " << tokenizer->GetVocabSize()
+                          << " but has index " << vocab_item.second
+                          << " in saved vocabulary." << std::endl;
+      return false;
+    }
+    added_tokens.emplace_back(vocab_item.first);
+    ++increase_idx;
+  }
+  std::ostringstream oss;
+  oss << "Try to add the following tokens to tokenizer vocab. AddedTokens = [";
+  for (int i = 0; i < added_tokens.size(); ++i) {
+    auto&& token = added_tokens[i];
+    oss << token.GetContent();
+    if (i < added_tokens.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "]";
+  fastdeploy::FDINFO << oss.str() << std::endl;
+  tokenizer->AddTokens(added_tokens);
+  return true;
+}
+
+struct Predictor {
+  fastdeploy::Runtime runtime_;
+  ErnieFastTokenizer tokenizer_;
+  std::unordered_map<int, std::string> slot_labels_;
+  std::unordered_map<int, std::string> intent_labels_;
+  bool runtime_initialed_ = false;
+
+  Predictor(const fastdeploy::RuntimeOption& option,
+            const ErnieFastTokenizer& tokenizer,
+            const std::unordered_map<int, std::string>& slot_labels,
+            const std::unordered_map<int, std::string>& intent_labels)
+      : tokenizer_(tokenizer), slot_labels_(slot_labels),
+        intent_labels_(intent_labels) {
+    runtime_initialed_ = runtime_.Init(option);
+  }
+
+  bool Initialized() {
+    return runtime_initialed_;
+  }
+
+  bool Preprocess(const std::vector<std::string>& texts,
+                  std::vector<fastdeploy::FDTensor>* inputs) {
+    std::vector<fast_tokenizer::core::Encoding> encodings;
+    // 1. Tokenize the text
+    tokenizer_.EncodeBatchStrings(texts, &encodings);
+    // 2. Construct the input vector tensor
+    // 2.1 Allocate input tensor
+    int64_t batch_size = texts.size();
+    int64_t seq_len = 0;
+    if (batch_size > 0) {
+      seq_len = encodings[0].GetLen();
+    }
+    inputs->resize(runtime_.NumInputs());
+    for (int i = 0; i < runtime_.NumInputs(); ++i) {
+      (*inputs)[i].Allocate({batch_size, seq_len},
+                            fastdeploy::FDDataType::INT32,
+                            runtime_.GetInputInfo(i).name);
+    }
+    // 2.2 Set the value of data
+    size_t start = 0;
+    int* input_ids_ptr = reinterpret_cast<int*>((*inputs)[0].MutableData());
+    for (int i = 0; i < encodings.size(); ++i) {
+      auto&& curr_input_ids = encodings[i].GetIds();
+      std::copy(curr_input_ids.begin(), curr_input_ids.end(),
+                input_ids_ptr + start);
+      start += seq_len;
+    }
+    return true;
+  }
+
+  bool IntentClsPostprocess(const fastdeploy::FDTensor& intent_logits,
+                            std::vector<IntentDetAndSlotFillResult>* results) {
+    fastdeploy::FDTensor probs;
+    fastdeploy::function::Softmax(intent_logits, &probs);
+
+    fastdeploy::FDTensor labels, confidences;
+    fastdeploy::function::Max(probs, &confidences, {-1});
+    fastdeploy::function::ArgMax(probs, &labels, -1);
+    if (labels.Numel() != confidences.Numel()) {
+      return false;
+    }
+    int64_t* label_ptr = reinterpret_cast<int64_t*>(labels.Data());
+    float* confidence_ptr = reinterpret_cast<float*>(confidences.Data());
+    for (int i = 0; i < labels.Numel(); ++i) {
+      (*results)[i].intent_result.intent_label = intent_labels_[label_ptr[i]];
+      (*results)[i].intent_result.intent_confidence = confidence_ptr[i];
+    }
+    return true;
+  }
+
+  bool SlotClsPostprocess(const fastdeploy::FDTensor& slot_logits,
+                          const std::vector<std::string>& texts,
+                          std::vector<IntentDetAndSlotFillResult>* results) {
+    fastdeploy::FDTensor batch_preds;
+    fastdeploy::function::ArgMax(slot_logits, &batch_preds, -1);
+    for (int i = 0; i < results->size(); ++i) {
+      fastdeploy::FDTensor preds;
+      fastdeploy::function::Slice(batch_preds, {0}, {i}, &preds);
+      int start = -1;
+      std::string label_name = "";
+      std::vector<IntentDetAndSlotFillResult::SlotFillResult> items;
+      fast_tokenizer::pretokenizers::CharToBytesOffsetConverter convertor(
+          texts[i]);
+      fast_tokenizer::core::Offset curr_offset;
+      int unicode_len = fast_tokenizer::utils::GetUnicodeLenFromUTF8(
+          texts[i].data(), texts[i].length());
+      int seq_len = preds.Shape()[0];
+      int64_t* preds_ptr = reinterpret_cast<int64_t*>(preds.Data());
+      for (int j = 0; j < seq_len; ++j) {
+        fastdeploy::FDTensor pred;
+        fastdeploy::function::Slice(preds, {0}, {j}, &pred);
+        int64_t slot_label_id = preds_ptr[j];
+        const std::string& curr_label = slot_labels_[slot_label_id];
+        if ((curr_label == "O" || curr_label.find("B-") != std::string::npos ||
+             (j - 1 >= unicode_len)) &&
+            start >= 0) {
+          // Convert the unicode character offset to byte offset.
+          convertor.convert({start, j - 1}, &curr_offset);
+          items.emplace_back(IntentDetAndSlotFillResult::SlotFillResult{
+              label_name,
+              texts[i].substr(curr_offset.first,
+                              curr_offset.second - curr_offset.first),
+              {start, j - 2}});
+          start = -1;
+          if (j - 1 >= unicode_len) {
+            break;
+          }
+        }
+        if (curr_label.find("B-") != std::string::npos) {
+          start = j - 1;
+          label_name = curr_label.substr(2);
+        }
+      }
+      (*results)[i].slot_result = std::move(items);
+    }
+    return true;
+  }
+
+  bool Postprocess(const std::vector<fastdeploy::FDTensor>& outputs,
+                   const std::vector<std::string>& texts,
+                   std::vector<IntentDetAndSlotFillResult>* results) {
+    const auto& intent_logits = outputs[0];
+    const auto& slot_logits = outputs[1];
+    return IntentClsPostprocess(intent_logits, results) &&
+           SlotClsPostprocess(slot_logits, texts, results);
+  }
+
+  bool Predict(const std::vector<std::string>& texts,
+               std::vector<IntentDetAndSlotFillResult>* results) {
+    std::vector<fastdeploy::FDTensor> inputs;
+    if (!Preprocess(texts, &inputs)) {
+      return false;
+    }
+
+    std::vector<fastdeploy::FDTensor> outputs(runtime_.NumOutputs());
+
+    auto t = ernie_tiny::jni::GetCurrentTime();
+    runtime_.Infer(inputs, &outputs);
+    LOGD("Runtime cost %f ms.", ernie_tiny::jni::GetElapsedTime(t));
+
+    results->resize(texts.size());
+    if (!Postprocess(outputs, texts, results)) {
+      return false;
+    }
+    return true;
+  }
+};
+
+static void ReadLabelMapFromTxt(
+    const std::string path, std::unordered_map<int, std::string>* label_map) {
+  std::fstream fin(path);
+  int id = 0;
+  std::string label;
+  while (fin) {
+    fin >> label;
+    if (label.size() > 0) {
+      label_map->insert({id++, label});
+    } else {
+      break;
+    }
+  }
+}
+
+static void ReadDatasetFromTxt(
+    const std::string path, std::vector<std::string>* dataset) {
+  std::fstream fin(path);
+  std::string text;
+  while (fin >> text) {
+    if (text.length() > 0) {
+      dataset->push_back(text);
+    } else {
+      break;
+    }
+  }
+}
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+JNIEXPORT jlong JNICALL
+Java_com_baidu_paddle_paddlenlp_ernie_1tiny_Predictor_bindNative(JNIEnv *env,
+                                                                 jobject thiz,
+                                                                 jstring model_file,
+                                                                 jstring params_file,
+                                                                 jstring vocab_file,
+                                                                 jstring slot_labels_file,
+                                                                 jstring intent_labels_file,
+                                                                 jstring added_tokens_file,
+                                                                 jobject runtime_option,
+                                                                 jint max_length) {
+
+  auto c_model_file = ernie_tiny::jni::ConvertTo<std::string>(env, model_file);
+  auto c_params_file = ernie_tiny::jni::ConvertTo<std::string>(env, params_file);
+  auto c_vocab_file = ernie_tiny::jni::ConvertTo<std::string>(env, vocab_file);
+  auto c_slot_labels_file = ernie_tiny::jni::ConvertTo<std::string>(env, slot_labels_file);
+  auto c_intent_labels_file = ernie_tiny::jni::ConvertTo<std::string>(env, intent_labels_file);
+  auto c_added_tokens_file = ernie_tiny::jni::ConvertTo<std::string>(env, added_tokens_file);
+  auto c_runtime_option = ernie_tiny::jni::NewCxxRuntimeOption(env, runtime_option);
+  c_runtime_option.SetModelPath(c_model_file, c_params_file);
+
+  uint32_t c_max_length = static_cast<uint32_t>(max_length);
+  ErnieFastTokenizer c_tokenizer(c_vocab_file);
+  AddTokensToTokenizer(c_added_tokens_file, &c_tokenizer);
+
+  c_tokenizer.EnableTruncMethod(
+      c_max_length,
+      0,
+      fast_tokenizer::core::Direction::RIGHT,
+      fast_tokenizer::core::TruncStrategy::LONGEST_FIRST);
+  c_tokenizer.EnablePadMethod(
+      fast_tokenizer::core::Direction::RIGHT,
+      0,
+      0,
+      "[PAD]",
+      &c_max_length,
+      nullptr);
+
+  std::unordered_map<int, std::string> c_slot_label_map;
+  std::unordered_map<int, std::string> c_intent_label_map;
+  ReadLabelMapFromTxt(c_slot_labels_file, &c_slot_label_map);
+  ReadLabelMapFromTxt(c_intent_labels_file, &c_intent_label_map);
+
+  auto c_predictor_ptr = new Predictor(
+      c_runtime_option, c_tokenizer, c_slot_label_map, c_intent_label_map);
+
+  if (!c_predictor_ptr->Initialized()) {
+    LOGE("Predictor initialize failed!");
+    return 0;
+  }
+
+  return reinterpret_cast<jlong>(c_predictor_ptr);
+}
+
+JNIEXPORT jobjectArray JNICALL
+Java_com_baidu_paddle_paddlenlp_ernie_1tiny_Predictor_predictNative(JNIEnv *env,
+                                                                    jobject thiz,
+                                                                    jlong cxx_context,
+                                                                    jobjectArray texts) {
+  if (cxx_context == 0) {
+    return NULL;
+  }
+  auto c_predictor_ptr = reinterpret_cast<Predictor*>(cxx_context);
+  auto c_texts = ernie_tiny::jni::ConvertTo<std::vector<std::string>>(env, texts);
+  if (c_texts.empty()) {
+    LOGE("c_texts is empty!");
+    return NULL;
+  }
+  LOGD("c_texts: %s", TextsStr(c_texts).c_str());
+
+  std::vector<IntentDetAndSlotFillResult> c_results;
+
+  auto t = ernie_tiny::jni::GetCurrentTime();
+  // TODO: may add BatchiFyTexts (not need now)
+  if(c_predictor_ptr->Predict(c_texts, &c_results)) {
+    LOGD("Predict cost %f ms.", ernie_tiny::jni::GetElapsedTime(t));
+
+    // Show some log info to logcat
+    for (int i = 0; i < c_results.size(); ++i) {
+      std::string info = "No." + std::to_string(i) + " text = "
+          + c_texts[i] + "\n" + ResultStr(c_results[i]);
+      LOGD("%s", info.c_str());
+    }
+
+    // Assign C++ Results -> Java IntentDetAndSlotFillResult[]
+    const jclass j_intent_slot_result_clazz = env->FindClass(
+        "com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult");
+    const jclass j_intent_result_clazz = env->FindClass(
+        "com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult$IntentDetResult");
+    const jclass j_slot_result_clazz = env->FindClass(
+        "com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult$SlotFillResult");
+
+    // jfieldID of IntentDetAndSlotFillResult
+    const jfieldID j_intent_slot_str_id = env->GetFieldID(
+        j_intent_slot_result_clazz, "mStr", "Ljava/lang/String;");
+    const jfieldID j_intent_slot_initialized_id = env->GetFieldID(
+        j_intent_slot_result_clazz, "mInitialized", "Z");
+    const jfieldID j_intent_slot_intent_id = env->GetFieldID(
+        j_intent_slot_result_clazz, "mIntentResult",
+        "Lcom/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult$IntentDetResult;");
+    const jfieldID j_intent_slot_slot_id = env->GetFieldID(
+        j_intent_slot_result_clazz, "mSlotResult",
+        "[Lcom/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult$SlotFillResult;");
+
+    // jfieldID of IntentDetResult
+    const jfieldID j_intent_label_id = env->GetFieldID(
+        j_intent_result_clazz, "mIntentLabel", "Ljava/lang/String;");
+    const jfieldID j_intent_conf_id = env->GetFieldID(
+        j_intent_result_clazz, "mIntentConfidence", "F");
+
+    // jfieldID of SlotFillResult
+    const jfieldID j_slot_label_id = env->GetFieldID(
+        j_slot_result_clazz, "mSlotLabel", "Ljava/lang/String;");
+    const jfieldID j_slot_entity_id = env->GetFieldID(
+        j_slot_result_clazz, "mEntity", "Ljava/lang/String;");
+    const jfieldID j_slot_pos_id = env->GetFieldID(
+        j_slot_result_clazz, "mPos", "[I");
+
+    // jmethodID of constructor method
+    const jmethodID j_intent_slot_init_method = env->GetMethodID(
+        j_intent_slot_result_clazz, "<init>", "()V");
+    const jmethodID j_intent_init_method = env->GetMethodID(
+        j_intent_result_clazz, "<init>", "()V");
+    const jmethodID j_slot_init_method = env->GetMethodID(
+        j_slot_result_clazz, "<init>", "()V");
+
+    // IntentDetAndSlotFillResult[]
+    const int c_results_size = c_results.size();
+    jobjectArray j_intent_slot_results_arr = env->NewObjectArray(
+        c_results_size, j_intent_slot_result_clazz, NULL);
+
+    for (int i = 0; i < c_results_size; ++i) {
+      const auto& curr_c_result = c_results.at(i);
+
+      // IntentDetAndSlotFillResult
+      jobject j_intent_slot_result_obj = env->NewObject(
+          j_intent_slot_result_clazz, j_intent_slot_init_method);
+
+      // 1. Set mStr String
+      env->SetObjectField(j_intent_slot_result_obj, j_intent_slot_str_id,
+                          ernie_tiny::jni::ConvertTo<jstring>(
+                              env,ResultStr(curr_c_result)));
+
+      // 2. Set mInitialized boolean
+      env->SetBooleanField(j_intent_slot_result_obj, j_intent_slot_initialized_id,
+                           JNI_TRUE);
+
+      // 3. Set mIntentResult IntentDetResult
+      jobject j_intent_result_obj = env->NewObject(
+          j_intent_result_clazz, j_intent_init_method);
+      const auto& curr_c_intent_result = curr_c_result.intent_result;
+      // 3.1 Set mIntentLabel String
+      env->SetObjectField(j_intent_result_obj, j_intent_label_id,
+                          ernie_tiny::jni::ConvertTo<jstring>(
+                              env, curr_c_intent_result.intent_label));
+      // 3.2 Set mIntentConfidence float
+      env->SetFloatField(j_intent_result_obj, j_intent_conf_id,
+                         static_cast<jfloat>(curr_c_intent_result.intent_confidence));
+
+      // 3.3 Set mIntentResult -> IntentDetAndSlotFillResult
+      env->SetObjectField(j_intent_slot_result_obj, j_intent_slot_intent_id,
+                          j_intent_result_obj);
+
+      env->DeleteLocalRef(j_intent_result_obj);
+
+      // 4. Set mSlotResult SlotFillResult[]
+      const int curr_c_slot_size = curr_c_result.slot_result.size();
+      jobjectArray j_slot_result_obj = env->NewObjectArray(
+          curr_c_slot_size, j_slot_result_clazz, NULL);
+      for (int j = 0; j < curr_c_slot_size; ++j) {
+
+        jobject j_curr_sub_slot_obj = env->NewObject(j_slot_result_clazz, j_slot_init_method);
+        const auto& curr_c_sub_slot_result = curr_c_result.slot_result.at(j);
+
+        // 4.1 Set mSlotLabel String
+        env->SetObjectField(j_curr_sub_slot_obj, j_slot_label_id,
+                            ernie_tiny::jni::ConvertTo<jstring>(
+                                env, curr_c_sub_slot_result.slot_label));
+
+        // 4.2 Set mEntity String
+        env->SetObjectField(j_curr_sub_slot_obj, j_slot_entity_id,
+                            ernie_tiny::jni::ConvertTo<jstring>(
+                                env, curr_c_sub_slot_result.entity));
+
+        // 4.3 Set mPos int[2]
+        std::vector<int> curr_c_sub_slot_pos = {
+            curr_c_sub_slot_result.pos.first, curr_c_sub_slot_result.pos.second
+        };
+        jintArray j_curr_sub_slot_pos_arr = env->NewIntArray(2);
+        env->SetIntArrayRegion(j_curr_sub_slot_pos_arr, 0, 2,
+                               curr_c_sub_slot_pos.data());
+        env->SetObjectField(j_curr_sub_slot_obj, j_slot_pos_id,
+                            j_curr_sub_slot_pos_arr);
+
+        // 4.4 Set j-th value of SlotFillResult[]
+        env->SetObjectArrayElement(j_slot_result_obj, j, j_curr_sub_slot_obj);
+
+        env->DeleteLocalRef(j_curr_sub_slot_pos_arr);
+        env->DeleteLocalRef(j_curr_sub_slot_obj);
+      }
+      // 4.5 Set mSlotResult -> IntentDetAndSlotFillResult
+      env->SetObjectField(j_intent_slot_result_obj, j_intent_slot_slot_id,
+                          j_slot_result_obj);
+
+      env->DeleteLocalRef(j_slot_result_obj);
+
+      // 5. Set i-th value of IntentDetAndSlotFillResult[]
+      env->SetObjectArrayElement(j_intent_slot_results_arr, i, j_intent_slot_result_obj);
+
+      env->DeleteLocalRef(j_intent_slot_result_obj);
+    }
+    return j_intent_slot_results_arr;
+  }
+
+  return NULL;
+}
+
+JNIEXPORT jboolean JNICALL
+Java_com_baidu_paddle_paddlenlp_ernie_1tiny_Predictor_releaseNative(JNIEnv *env,
+                                                                    jobject thiz,
+                                                                    jlong cxx_context) {
+  if (cxx_context == 0) {
+    return JNI_FALSE;
+  }
+  auto c_predictor_ptr = reinterpret_cast<Predictor*>(cxx_context);
+
+  delete c_predictor_ptr;
+  LOGD("[End] Release ERNIE Tiny Predictor in native !");
+  return JNI_TRUE;
+}
+
+#ifdef __cplusplus
+}
+#endif
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.cc b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3893337023d7303573000bb817a1727e7b47f17a
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.cc
@@ -0,0 +1,99 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "ernie_tiny_jni/convert_jni.h"         // NOLINT
+#include "ernie_tiny_jni/runtime_option_jni.h"  // NOLINT
+
+namespace ernie_tiny {
+namespace jni {
+
+fastdeploy::RuntimeOption NewCxxRuntimeOption(
+    JNIEnv *env, jobject j_runtime_option_obj) {
+  // WARN: Please make sure 'j_runtime_option_obj' param is a
+  // ref of Java RuntimeOption.
+  // Field signatures of Java RuntimeOption.
+  // (1) mCpuThreadNum int:               I
+  // (2) mEnableLiteFp16 boolean:         Z
+  // (3) mLitePowerMode LitePowerMode:    com/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode
+  // (4) mLiteOptimizedModelDir String:   java/lang/String
+
+  const jclass j_runtime_option_clazz = env->FindClass(
+      "com/baidu/paddle/paddlenlp/ernie_tiny/RuntimeOption");
+  const jfieldID j_cpu_num_thread_id = env->GetFieldID(
+      j_runtime_option_clazz, "mCpuThreadNum", "I");
+  const jfieldID j_enable_lite_fp16_id = env->GetFieldID(
+      j_runtime_option_clazz, "mEnableLiteFp16", "Z");
+  const jfieldID j_enable_lite_int8_id = env->GetFieldID(
+      j_runtime_option_clazz, "mEnableLiteInt8", "Z");
+  const jfieldID j_lite_power_mode_id = env->GetFieldID(
+      j_runtime_option_clazz, "mLitePowerMode",
+      "Lcom/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode;");
+  const jfieldID j_lite_optimized_model_dir_id = env->GetFieldID(
+      j_runtime_option_clazz, "mLiteOptimizedModelDir", "Ljava/lang/String;");
+
+  // mLitePowerMode is Java Enum.
+  const jclass j_lite_power_mode_clazz = env->FindClass(
+      "com/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode");
+  const jmethodID j_lite_power_mode_ordinal_id = env->GetMethodID(
+      j_lite_power_mode_clazz, "ordinal", "()I");
+
+  fastdeploy::RuntimeOption c_runtime_option;
+  c_runtime_option.UseCpu();
+  c_runtime_option.UseLiteBackend();
+
+  if (!env->IsInstanceOf(j_runtime_option_obj, j_runtime_option_clazz)) {
+    return c_runtime_option;
+  }
+
+  // Get values from Java RuntimeOption.
+  jint j_cpu_num_thread = env->GetIntField(
+      j_runtime_option_obj, j_cpu_num_thread_id);
+  jboolean j_enable_lite_fp16 = env->GetBooleanField(
+      j_runtime_option_obj, j_enable_lite_fp16_id);
+  jboolean j_enable_lite_int8 = env->GetBooleanField(
+      j_runtime_option_obj, j_enable_lite_int8_id);
+  jstring j_lite_optimized_model_dir = static_cast<jstring>(
+      env->GetObjectField(j_runtime_option_obj, j_lite_optimized_model_dir_id));
+  jobject j_lite_power_mode_obj = env->GetObjectField(
+      j_runtime_option_obj, j_lite_power_mode_id);
+  jint j_lite_power_mode = env->CallIntMethod(
+      j_lite_power_mode_obj, j_lite_power_mode_ordinal_id);
+
+  int c_cpu_num_thread = static_cast<int>(j_cpu_num_thread);
+  bool c_enable_lite_fp16 = static_cast<bool>(j_enable_lite_fp16);
+  bool c_enable_lite_int8 = static_cast<bool>(j_enable_lite_int8);
+  fastdeploy::LitePowerMode c_lite_power_mode =
+      static_cast<fastdeploy::LitePowerMode>(j_lite_power_mode);
+  std::string c_lite_optimized_model_dir =
+      ConvertTo<std::string>(env, j_lite_optimized_model_dir);
+
+  // Setup Cxx RuntimeOption
+  c_runtime_option.SetCpuThreadNum(c_cpu_num_thread);
+  c_runtime_option.SetLitePowerMode(c_lite_power_mode);
+  c_runtime_option.SetLiteOptimizedModelDir(c_lite_optimized_model_dir);
+  if (c_enable_lite_fp16) {
+    c_runtime_option.EnableLiteFP16();
+  }
+  if (c_enable_lite_int8) {
+    c_runtime_option.EnableLiteInt8();
+  }
+
+  env->DeleteLocalRef(j_runtime_option_clazz);
+  env->DeleteLocalRef(j_lite_power_mode_clazz);
+
+  return c_runtime_option;
+}
+
+}  // namespace jni
+}  // namespace ernie_tiny
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.h b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.h
new file mode 100644
index 0000000000000000000000000000000000000000..9869333a4136d2934ed79b2d005433a08ef7d1a4
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/cpp/ernie_tiny_jni/runtime_option_jni.h
@@ -0,0 +1,30 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <jni.h>                 // NOLINT
+#include <string>                // NOLINT
+#include <vector>                // NOLINT
+#include "fastdeploy/runtime.h"  // NOLINT
+
+namespace ernie_tiny {
+namespace jni {
+
+/// Create a C++ RuntimeOption from Java RuntimeOption.
+fastdeploy::RuntimeOption NewCxxRuntimeOption(
+    JNIEnv *env, jobject j_runtime_option_obj);
+
+}  // namespace jni
+}  // namespace ernie_tiny
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Initializer.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Initializer.java
new file mode 100644
index 0000000000000000000000000000000000000000..43d708cc2dc0f81fdd089c6d2313c7029c879444
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Initializer.java
@@ -0,0 +1,22 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+/**
+ * Initializer for ERNIE Tiny. The initialization methods are called by package
+ * classes only. Public users don't have to call them. Public users can get
+ * FastDeploy information constants such as JNI lib name in this class.
+ */
+public class Initializer {
+    /** name of C++ JNI lib */
+    public final static String JNI_LIB_NAME = "ernie_tiny_jni";
+
+    /**
+     * loads the C++ JNI lib. We only call it in our package, so it shouldn't be
+     * visible to public users.
+     *
+     * @return true if initialize successfully.
+     */
+    public static boolean init() {
+        System.loadLibrary(JNI_LIB_NAME);
+        return true;
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult.java
new file mode 100644
index 0000000000000000000000000000000000000000..ec59baf2aa313571939118284543cc571000e4d1
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/IntentDetAndSlotFillResult.java
@@ -0,0 +1,25 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+public class IntentDetAndSlotFillResult {
+    public String mStr;
+    public boolean mInitialized = false;
+    public IntentDetResult mIntentResult;
+    public SlotFillResult[] mSlotResult;
+
+    public IntentDetAndSlotFillResult() {
+        mInitialized = false;
+    }
+
+    static class IntentDetResult {
+        public IntentDetResult() {}
+        public String mIntentLabel;
+        public float mIntentConfidence;
+    }
+
+    static class SlotFillResult {
+        public SlotFillResult() {}
+        public String mSlotLabel;
+        public String mEntity;
+        public int[] mPos; // [2]
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode.java
new file mode 100644
index 0000000000000000000000000000000000000000..213e120945a7b70369e9cfac1f7f4015be47a328
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/LitePowerMode.java
@@ -0,0 +1,10 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+public enum LitePowerMode {
+    LITE_POWER_HIGH,
+    LITE_POWER_LOW,
+    LITE_POWER_FULL,
+    LITE_POWER_NO_BIND,
+    LITE_POWER_RAND_HIGH,
+    LITE_POWER_RAND_LOW
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Predictor.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Predictor.java
new file mode 100644
index 0000000000000000000000000000000000000000..ed4f72a0a3200ef3778585c9c9efe43ea5173bf8
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/Predictor.java
@@ -0,0 +1,131 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+public class Predictor {
+    protected long mCxxContext = 0; // Context from native.
+    protected boolean mInitialized = false;
+
+    public Predictor() {
+        mInitialized = false;
+    }
+
+    public Predictor(String modelFile,
+                     String paramsFile,
+                     String vocabFile,
+                     String slotLabelsFile,
+                     String intentLabelsFile,
+                     String addedTokensFile) {
+        init_(modelFile, paramsFile, vocabFile, slotLabelsFile,
+                intentLabelsFile, addedTokensFile, new RuntimeOption(), 16);
+    }
+
+    public Predictor(String modelFile,
+                     String paramsFile,
+                     String vocabFile,
+                     String slotLabelsFile,
+                     String intentLabelsFile,
+                     String addedTokensFile,
+                     RuntimeOption runtimeOption,
+                     int maxLength) {
+        init_(modelFile, paramsFile, vocabFile, slotLabelsFile,
+                intentLabelsFile, addedTokensFile, runtimeOption, maxLength);
+    }
+
+    public boolean init(String modelFile,
+                        String paramsFile,
+                        String vocabFile,
+                        String slotLabelsFile,
+                        String intentLabelsFile,
+                        String addedTokensFile,
+                        RuntimeOption runtimeOption,
+                        int maxLength) {
+        return init_(modelFile, paramsFile, vocabFile, slotLabelsFile,
+                intentLabelsFile, addedTokensFile, runtimeOption, maxLength);
+    }
+
+    public boolean release() {
+        mInitialized = false;
+        if (mCxxContext == 0) {
+            return false;
+        }
+        return releaseNative(mCxxContext);
+    }
+
+    public boolean initialized() {
+        return mInitialized;
+    }
+
+    // Fetch text information (will call predict from native)
+    public IntentDetAndSlotFillResult[] predict(String[] texts) {
+        if (mCxxContext == 0) {
+            return null;
+        }
+        return predictNative(mCxxContext, texts);
+    }
+
+    private boolean init_(String modelFile,
+                          String paramsFile,
+                          String vocabFile,
+                          String slotLabelsFile,
+                          String intentLabelsFile,
+                          String addedTokensFile,
+                          RuntimeOption runtimeOption,
+                          int maxLength) {
+        if (!mInitialized) {
+            mCxxContext = bindNative(
+                    modelFile,
+                    paramsFile,
+                    vocabFile,
+                    slotLabelsFile,
+                    intentLabelsFile,
+                    addedTokensFile,
+                    runtimeOption,
+                    maxLength
+            );
+            if (mCxxContext != 0) {
+                mInitialized = true;
+            }
+            return mInitialized;
+        } else {
+            // release current native context and bind a new one.
+            if (release()) {
+                mCxxContext = bindNative(
+                        modelFile,
+                        paramsFile,
+                        vocabFile,
+                        slotLabelsFile,
+                        intentLabelsFile,
+                        addedTokensFile,
+                        runtimeOption,
+                        maxLength
+                );
+                if (mCxxContext != 0) {
+                    mInitialized = true;
+                }
+                return mInitialized;
+            }
+            return false;
+        }
+    }
+
+    // Bind predictor from native context.
+    private native long bindNative(String modelFile,
+                                   String paramsFile,
+                                   String vocabFile,
+                                   String slotLabelsFile,
+                                   String intentLabelsFile,
+                                   String addedTokensFile,
+                                   RuntimeOption runtimeOption,
+                                   int maxLength);
+
+    // Call prediction from native context.
+    private native IntentDetAndSlotFillResult[] predictNative(long CxxContext,
+                                                              String[] texts);
+
+    // Release buffers allocated in native context.
+    private native boolean releaseNative(long CxxContext);
+
+    // Initializes at the beginning.
+    static {
+        Initializer.init();
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/RuntimeOption.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/RuntimeOption.java
new file mode 100644
index 0000000000000000000000000000000000000000..d52634aee50941552fefa8f153a7a1f6965c196d
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/main/java/com/baidu/paddle/paddlenlp/ernie_tiny/RuntimeOption.java
@@ -0,0 +1,68 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+public class RuntimeOption {
+    public int mCpuThreadNum = 1;
+    public boolean mEnableLiteFp16 = false;
+    public boolean mEnableLiteInt8 = false;
+    public LitePowerMode mLitePowerMode = LitePowerMode.LITE_POWER_NO_BIND;
+    public String mLiteOptimizedModelDir = "";
+
+    public RuntimeOption() {
+        mCpuThreadNum = 1;
+        mEnableLiteFp16 = false;
+        mEnableLiteInt8 = false;
+        mLitePowerMode = LitePowerMode.LITE_POWER_NO_BIND;
+        mLiteOptimizedModelDir = "";
+    }
+
+    public void enableLiteFp16() {
+        mEnableLiteFp16 = true;
+    }
+
+    public void disableLiteFP16() {
+        mEnableLiteFp16 = false;
+    }
+
+    public void enableLiteInt8() {
+        mEnableLiteInt8 = true;
+    }
+
+    public void disableLiteInt8() {
+        mEnableLiteInt8 = false;
+    }
+
+    public void setCpuThreadNum(int threadNum) {
+        mCpuThreadNum = threadNum;
+    }
+
+    public void setLitePowerMode(LitePowerMode mode) {
+        mLitePowerMode = mode;
+    }
+
+    public void setLitePowerMode(String modeStr) {
+        mLitePowerMode = parseLitePowerModeFromString(modeStr);
+    }
+
+    public void setLiteOptimizedModelDir(String modelDir) {
+        mLiteOptimizedModelDir = modelDir;
+    }
+
+    // Helpers: parse lite power mode from string
+    public static LitePowerMode parseLitePowerModeFromString(String modeStr) {
+        if (modeStr.equalsIgnoreCase("LITE_POWER_HIGH")) {
+            return LitePowerMode.LITE_POWER_HIGH;
+        } else if (modeStr.equalsIgnoreCase("LITE_POWER_LOW")) {
+            return LitePowerMode.LITE_POWER_LOW;
+        } else if (modeStr.equalsIgnoreCase("LITE_POWER_FULL")) {
+            return LitePowerMode.LITE_POWER_FULL;
+        } else if (modeStr.equalsIgnoreCase("LITE_POWER_NO_BIND")) {
+            return LitePowerMode.LITE_POWER_NO_BIND;
+        } else if (modeStr.equalsIgnoreCase("LITE_POWER_RAND_HIGH")) {
+            return LitePowerMode.LITE_POWER_RAND_HIGH;
+        } else if (modeStr.equalsIgnoreCase("LITE_POWER_RAND_LOW")) {
+            return LitePowerMode.LITE_POWER_RAND_LOW;
+        } else {
+            return LitePowerMode.LITE_POWER_NO_BIND;
+        }
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/test/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleUnitTest.java b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/test/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleUnitTest.java
new file mode 100644
index 0000000000000000000000000000000000000000..d00e018879d56fa8e564642395d34505cfd7b4b1
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ernie_tiny/src/test/java/com/baidu/paddle/paddlenlp/ernie_tiny/ExampleUnitTest.java
@@ -0,0 +1,17 @@
+package com.baidu.paddle.paddlenlp.ernie_tiny;
+
+import org.junit.Test;
+
+import static org.junit.Assert.*;
+
+/**
+ * Example local unit test, which will execute on the development machine (host).
+ *
+ * @see <a href="http://d.android.com/tools/testing">Testing documentation</a>
+ */
+public class ExampleUnitTest {
+    @Test
+    public void addition_isCorrect() {
+        assertEquals(4, 2 + 2);
+    }
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/gradle.properties b/model_zoo/ernie-tiny/deploy/android/gradle.properties
new file mode 100644
index 0000000000000000000000000000000000000000..ae995d47ccd9199fa367c2566d87f18caf10b8e5
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/gradle.properties
@@ -0,0 +1,13 @@
+# Project-wide Gradle settings.
+# IDE (e.g. Android Studio) users:
+# Gradle settings configured through the IDE *will override*
+# any settings specified in this file.
+# For more details on how to configure your build environment visit
+# http://www.gradle.org/docs/current/userguide/build_environment.html
+# Specifies the JVM arguments used for the daemon process.
+# The setting is particularly useful for tweaking memory settings.
+org.gradle.jvmargs=-Xmx3096m
+# When configured, Gradle will run in incubating parallel mode.
+# This option should only be used with decoupled projects. More details, visit
+# http://www.gradle.org/docs/current/userguide/multi_project_builds.html#sec:decoupled_projects
+# org.gradle.parallel=true
diff --git a/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.jar b/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.jar
new file mode 100644
index 0000000000000000000000000000000000000000..e708b1c023ec8b20f512888fe07c5bd3ff77bb8f
Binary files /dev/null and b/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.jar differ
diff --git a/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.properties b/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.properties
new file mode 100644
index 0000000000000000000000000000000000000000..7855fafe4997690cd9fdc4db93d3b7491f7fb747
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/gradle/wrapper/gradle-wrapper.properties
@@ -0,0 +1,6 @@
+#Sat Oct 08 17:24:34 CST 2022
+distributionBase=GRADLE_USER_HOME
+distributionUrl=https\://services.gradle.org/distributions/gradle-7.3.3-bin.zip
+distributionPath=wrapper/dists
+zipStorePath=wrapper/dists
+zipStoreBase=GRADLE_USER_HOME
diff --git a/model_zoo/ernie-tiny/deploy/android/gradlew b/model_zoo/ernie-tiny/deploy/android/gradlew
new file mode 100644
index 0000000000000000000000000000000000000000..4f906e0c811fc9e230eb44819f509cd0627f2600
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/gradlew
@@ -0,0 +1,185 @@
+#!/usr/bin/env sh
+
+#
+# Copyright 2015 the original author or authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+##############################################################################
+##
+##  Gradle start up script for UN*X
+##
+##############################################################################
+
+# Attempt to set APP_HOME
+# Resolve links: $0 may be a link
+PRG="$0"
+# Need this for relative symlinks.
+while [ -h "$PRG" ] ; do
+    ls=`ls -ld "$PRG"`
+    link=`expr "$ls" : '.*-> \(.*\)$'`
+    if expr "$link" : '/.*' > /dev/null; then
+        PRG="$link"
+    else
+        PRG=`dirname "$PRG"`"/$link"
+    fi
+done
+SAVED="`pwd`"
+cd "`dirname \"$PRG\"`/" >/dev/null
+APP_HOME="`pwd -P`"
+cd "$SAVED" >/dev/null
+
+APP_NAME="Gradle"
+APP_BASE_NAME=`basename "$0"`
+
+# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
+DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
+
+# Use the maximum available, or set MAX_FD != -1 to use that value.
+MAX_FD="maximum"
+
+warn () {
+    echo "$*"
+}
+
+die () {
+    echo
+    echo "$*"
+    echo
+    exit 1
+}
+
+# OS specific support (must be 'true' or 'false').
+cygwin=false
+msys=false
+darwin=false
+nonstop=false
+case "`uname`" in
+  CYGWIN* )
+    cygwin=true
+    ;;
+  Darwin* )
+    darwin=true
+    ;;
+  MINGW* )
+    msys=true
+    ;;
+  NONSTOP* )
+    nonstop=true
+    ;;
+esac
+
+CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
+
+
+# Determine the Java command to use to start the JVM.
+if [ -n "$JAVA_HOME" ] ; then
+    if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
+        # IBM's JDK on AIX uses strange locations for the executables
+        JAVACMD="$JAVA_HOME/jre/sh/java"
+    else
+        JAVACMD="$JAVA_HOME/bin/java"
+    fi
+    if [ ! -x "$JAVACMD" ] ; then
+        die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
+
+Please set the JAVA_HOME variable in your environment to match the
+location of your Java installation."
+    fi
+else
+    JAVACMD="java"
+    which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
+
+Please set the JAVA_HOME variable in your environment to match the
+location of your Java installation."
+fi
+
+# Increase the maximum file descriptors if we can.
+if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
+    MAX_FD_LIMIT=`ulimit -H -n`
+    if [ $? -eq 0 ] ; then
+        if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
+            MAX_FD="$MAX_FD_LIMIT"
+        fi
+        ulimit -n $MAX_FD
+        if [ $? -ne 0 ] ; then
+            warn "Could not set maximum file descriptor limit: $MAX_FD"
+        fi
+    else
+        warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
+    fi
+fi
+
+# For Darwin, add options to specify how the application appears in the dock
+if $darwin; then
+    GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
+fi
+
+# For Cygwin or MSYS, switch paths to Windows format before running java
+if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then
+    APP_HOME=`cygpath --path --mixed "$APP_HOME"`
+    CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
+
+    JAVACMD=`cygpath --unix "$JAVACMD"`
+
+    # We build the pattern for arguments to be converted via cygpath
+    ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
+    SEP=""
+    for dir in $ROOTDIRSRAW ; do
+        ROOTDIRS="$ROOTDIRS$SEP$dir"
+        SEP="|"
+    done
+    OURCYGPATTERN="(^($ROOTDIRS))"
+    # Add a user-defined pattern to the cygpath arguments
+    if [ "$GRADLE_CYGPATTERN" != "" ] ; then
+        OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
+    fi
+    # Now convert the arguments - kludge to limit ourselves to /bin/sh
+    i=0
+    for arg in "$@" ; do
+        CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
+        CHECK2=`echo "$arg"|egrep -c "^-"`                                 ### Determine if an option
+
+        if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then                    ### Added a condition
+            eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
+        else
+            eval `echo args$i`="\"$arg\""
+        fi
+        i=`expr $i + 1`
+    done
+    case $i in
+        0) set -- ;;
+        1) set -- "$args0" ;;
+        2) set -- "$args0" "$args1" ;;
+        3) set -- "$args0" "$args1" "$args2" ;;
+        4) set -- "$args0" "$args1" "$args2" "$args3" ;;
+        5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
+        6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
+        7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
+        8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
+        9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
+    esac
+fi
+
+# Escape application args
+save () {
+    for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
+    echo " "
+}
+APP_ARGS=`save "$@"`
+
+# Collect all arguments for the java command, following the shell quoting and substitution rules
+eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"
+
+exec "$JAVACMD" "$@"
diff --git a/model_zoo/ernie-tiny/deploy/android/gradlew.bat b/model_zoo/ernie-tiny/deploy/android/gradlew.bat
new file mode 100644
index 0000000000000000000000000000000000000000..107acd32c4e687021ef32db511e8a206129b88ec
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/gradlew.bat
@@ -0,0 +1,89 @@
+@rem
+@rem Copyright 2015 the original author or authors.
+@rem
+@rem Licensed under the Apache License, Version 2.0 (the "License");
+@rem you may not use this file except in compliance with the License.
+@rem You may obtain a copy of the License at
+@rem
+@rem      https://www.apache.org/licenses/LICENSE-2.0
+@rem
+@rem Unless required by applicable law or agreed to in writing, software
+@rem distributed under the License is distributed on an "AS IS" BASIS,
+@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+@rem See the License for the specific language governing permissions and
+@rem limitations under the License.
+@rem
+
+@if "%DEBUG%" == "" @echo off
+@rem ##########################################################################
+@rem
+@rem  Gradle startup script for Windows
+@rem
+@rem ##########################################################################
+
+@rem Set local scope for the variables with windows NT shell
+if "%OS%"=="Windows_NT" setlocal
+
+set DIRNAME=%~dp0
+if "%DIRNAME%" == "" set DIRNAME=.
+set APP_BASE_NAME=%~n0
+set APP_HOME=%DIRNAME%
+
+@rem Resolve any "." and ".." in APP_HOME to make it shorter.
+for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
+
+@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
+set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"
+
+@rem Find java.exe
+if defined JAVA_HOME goto findJavaFromJavaHome
+
+set JAVA_EXE=java.exe
+%JAVA_EXE% -version >NUL 2>&1
+if "%ERRORLEVEL%" == "0" goto execute
+
+echo.
+echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
+echo.
+echo Please set the JAVA_HOME variable in your environment to match the
+echo location of your Java installation.
+
+goto fail
+
+:findJavaFromJavaHome
+set JAVA_HOME=%JAVA_HOME:"=%
+set JAVA_EXE=%JAVA_HOME%/bin/java.exe
+
+if exist "%JAVA_EXE%" goto execute
+
+echo.
+echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
+echo.
+echo Please set the JAVA_HOME variable in your environment to match the
+echo location of your Java installation.
+
+goto fail
+
+:execute
+@rem Setup the command line
+
+set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
+
+
+@rem Execute Gradle
+"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*
+
+:end
+@rem End local scope for the variables with windows NT shell
+if "%ERRORLEVEL%"=="0" goto mainEnd
+
+:fail
+rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
+rem the _cmd.exe /c_ return code!
+if  not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
+exit /b 1
+
+:mainEnd
+if "%OS%"=="Windows_NT" endlocal
+
+:omega
diff --git a/model_zoo/ernie-tiny/deploy/android/local.properties b/model_zoo/ernie-tiny/deploy/android/local.properties
new file mode 100644
index 0000000000000000000000000000000000000000..fbfca12ef843b85bd2757118d5be4075368c7531
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/local.properties
@@ -0,0 +1,8 @@
+## This file must *NOT* be checked into Version Control Systems,
+# as it contains information specific to your local configuration.
+#
+# Location of the SDK. This is only used by Gradle.
+# For customization when using a Version Control System, please read the
+# header note.
+#Thu Oct 20 16:50:08 CST 2022
+sdk.dir=/Users/qiuyanjun/Library/Android/sdk
diff --git a/model_zoo/ernie-tiny/deploy/android/settings.gradle b/model_zoo/ernie-tiny/deploy/android/settings.gradle
new file mode 100644
index 0000000000000000000000000000000000000000..a6d934fc89d31847ff4d1e41dd22475dff8049c9
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/settings.gradle
@@ -0,0 +1,3 @@
+include ':app'
+include ':ernie_tiny'
+include ':ui'
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/.gitignore b/model_zoo/ernie-tiny/deploy/android/ui/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..42afabfd2abebf31384ca7797186a27a4b7dbee8
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/.gitignore
@@ -0,0 +1 @@
+/build
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/build.gradle b/model_zoo/ernie-tiny/deploy/android/ui/build.gradle
new file mode 100644
index 0000000000000000000000000000000000000000..a574cf398fa2f35acecfcf5f02931dc0bf87dd1b
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/build.gradle
@@ -0,0 +1,40 @@
+plugins {
+    id 'com.android.library'
+}
+
+android {
+    compileSdkVersion 28
+
+    defaultConfig {
+        minSdkVersion 15
+        //noinspection ExpiredTargetSdkVersion
+        targetSdkVersion 28
+        
+        testInstrumentationRunner "androidx.test.runner.AndroidJUnitRunner"
+        consumerProguardFiles "consumer-rules.pro"
+    }
+
+    buildTypes {
+        release {
+            minifyEnabled false
+            proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
+        }
+    }
+    compileOptions {
+        sourceCompatibility JavaVersion.VERSION_1_8
+        targetCompatibility JavaVersion.VERSION_1_8
+    }
+}
+
+dependencies {
+    implementation fileTree(include: ['*.aar'], dir: 'libs')
+    implementation 'com.android.support:appcompat-v7:28.0.0'
+    //noinspection GradleDependency
+    implementation 'com.android.support.constraint:constraint-layout:1.1.3'
+    implementation 'com.android.support:design:28.0.0'
+    implementation 'org.jetbrains:annotations:15.0'
+    //noinspection GradleDependency
+    testImplementation 'junit:junit:4.12'
+    androidTestImplementation 'com.android.support.test:runner:1.0.2'
+    androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2'
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/consumer-rules.pro b/model_zoo/ernie-tiny/deploy/android/ui/consumer-rules.pro
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/local.properties b/model_zoo/ernie-tiny/deploy/android/ui/local.properties
new file mode 100644
index 0000000000000000000000000000000000000000..845c5cf9a17ac19c550bb57b27da87cbad94fee4
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/local.properties
@@ -0,0 +1,8 @@
+## This file must *NOT* be checked into Version Control Systems,
+# as it contains information specific to your local configuration.
+#
+# Location of the SDK. This is only used by Gradle.
+# For customization when using a Version Control System, please read the
+# header note.
+#Fri Nov 25 17:48:04 CST 2022
+sdk.dir=D\:\\androidsdk
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/proguard-rules.pro b/model_zoo/ernie-tiny/deploy/android/ui/proguard-rules.pro
new file mode 100644
index 0000000000000000000000000000000000000000..481bb434814107eb79d7a30b676d344b0df2f8ce
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/proguard-rules.pro
@@ -0,0 +1,21 @@
+# Add project specific ProGuard rules here.
+# You can control the set of applied configuration files using the
+# proguardFiles setting in build.gradle.
+#
+# For more details, see
+#   http://developer.android.com/guide/developing/tools/proguard.html
+
+# If your project uses WebView with JS, uncomment the following
+# and specify the fully qualified class name to the JavaScript interface
+# class:
+#-keepclassmembers class fqcn.of.javascript.interface.for.webview {
+#   public *;
+#}
+
+# Uncomment this to preserve the line number information for
+# debugging stack traces.
+#-keepattributes SourceFile,LineNumberTable
+
+# If you keep the line number information, uncomment this to
+# hide the original source file name.
+#-renamesourcefileattribute SourceFile
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/AndroidManifest.xml b/model_zoo/ernie-tiny/deploy/android/ui/src/main/AndroidManifest.xml
new file mode 100644
index 0000000000000000000000000000000000000000..4ace553f7f646185eda8c2a04ca20f15ec1438b0
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/AndroidManifest.xml
@@ -0,0 +1,6 @@
+<?xml version="1.0" encoding="utf-8"?>
+<manifest xmlns:android="http://schemas.android.com/apk/res/android"
+    package="com.baidu.paddle.paddlenlp.ui">
+
+    <uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />
+</manifest>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/Utils.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/Utils.java
new file mode 100644
index 0000000000000000000000000000000000000000..e5dea4c977e03d7c176c96ebcb6dd9927fbcce50
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/Utils.java
@@ -0,0 +1,339 @@
+package com.baidu.paddle.paddlenlp.ui;
+
+import android.content.Context;
+import android.content.res.Resources;
+import android.database.Cursor;
+import android.graphics.Bitmap;
+import android.graphics.BitmapFactory;
+import android.hardware.Camera;
+import android.net.ConnectivityManager;
+import android.net.NetworkInfo;
+import android.net.Uri;
+import android.opengl.GLES20;
+import android.os.Environment;
+import android.provider.MediaStore;
+import android.util.Log;
+import android.view.Surface;
+import android.view.WindowManager;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedOutputStream;
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.OutputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+public class Utils {
+    private static final String TAG = Utils.class.getSimpleName();
+
+    public static void RecursiveCreateDirectories(String fileDir) {
+        String[] fileDirs = fileDir.split("\\/");
+        String topPath = "";
+        for (int i = 0; i < fileDirs.length; i++) {
+            topPath += "/" + fileDirs[i];
+            File file = new File(topPath);
+            if (file.exists()) {
+                continue;
+            } else {
+                file.mkdir();
+            }
+        }
+    }
+
+    public static void copyFileFromAssets(Context appCtx, String srcPath, String dstPath) {
+        if (srcPath.isEmpty() || dstPath.isEmpty()) {
+            return;
+        }
+        String dstDir = dstPath.substring(0, dstPath.lastIndexOf('/'));
+        if (dstDir.length() > 0) {
+            RecursiveCreateDirectories(dstDir);
+        }
+        InputStream is = null;
+        OutputStream os = null;
+        try {
+            is = new BufferedInputStream(appCtx.getAssets().open(srcPath));
+            os = new BufferedOutputStream(new FileOutputStream(new File(dstPath)));
+            byte[] buffer = new byte[1024];
+            int length = 0;
+            while ((length = is.read(buffer)) != -1) {
+                os.write(buffer, 0, length);
+            }
+        } catch (FileNotFoundException e) {
+            e.printStackTrace();
+        } catch (IOException e) {
+            e.printStackTrace();
+        } finally {
+            try {
+                os.close();
+                is.close();
+            } catch (IOException e) {
+                e.printStackTrace();
+            }
+        }
+    }
+
+    public static void copyDirectoryFromAssets(Context appCtx, String srcDir, String dstDir) {
+        if (srcDir.isEmpty() || dstDir.isEmpty()) {
+            return;
+        }
+        try {
+            if (!new File(dstDir).exists()) {
+                new File(dstDir).mkdirs();
+            }
+            for (String fileName : appCtx.getAssets().list(srcDir)) {
+                String srcSubPath = srcDir + File.separator + fileName;
+                String dstSubPath = dstDir + File.separator + fileName;
+                if (new File(srcSubPath).isDirectory()) {
+                    copyDirectoryFromAssets(appCtx, srcSubPath, dstSubPath);
+                } else {
+                    copyFileFromAssets(appCtx, srcSubPath, dstSubPath);
+                }
+            }
+        } catch (Exception e) {
+            e.printStackTrace();
+        }
+    }
+
+    public static float[] parseFloatsFromString(String string, String delimiter) {
+        String[] pieces = string.trim().toLowerCase().split(delimiter);
+        float[] floats = new float[pieces.length];
+        for (int i = 0; i < pieces.length; i++) {
+            floats[i] = Float.parseFloat(pieces[i].trim());
+        }
+        return floats;
+    }
+
+    public static long[] parseLongsFromString(String string, String delimiter) {
+        String[] pieces = string.trim().toLowerCase().split(delimiter);
+        long[] longs = new long[pieces.length];
+        for (int i = 0; i < pieces.length; i++) {
+            longs[i] = Long.parseLong(pieces[i].trim());
+        }
+        return longs;
+    }
+
+    public static String getSDCardDirectory() {
+        return Environment.getExternalStorageDirectory().getAbsolutePath();
+    }
+
+    public static String getDCIMDirectory() {
+        return Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DCIM).getAbsolutePath();
+    }
+
+    public static Camera.Size getOptimalPreviewSize(List<Camera.Size> sizes, int w, int h) {
+        final double ASPECT_TOLERANCE = 0.3;
+        double targetRatio = (double) w / h;
+        if (sizes == null) return null;
+
+        Camera.Size optimalSize = null;
+        double minDiff = Double.MAX_VALUE;
+
+        int targetHeight = h;
+
+        // Try to find an size match aspect ratio and size
+        for (Camera.Size size : sizes) {
+            double ratio = (double) size.width / size.height;
+            if (Math.abs(ratio - targetRatio) > ASPECT_TOLERANCE) continue;
+            if (Math.abs(size.height - targetHeight) < minDiff) {
+                optimalSize = size;
+                minDiff = Math.abs(size.height - targetHeight);
+            }
+        }
+
+        // Cannot find the one match the aspect ratio, ignore the requirement
+        if (optimalSize == null) {
+            minDiff = Double.MAX_VALUE;
+            for (Camera.Size size : sizes) {
+                if (Math.abs(size.height - targetHeight) < minDiff) {
+                    optimalSize = size;
+                    minDiff = Math.abs(size.height - targetHeight);
+                }
+            }
+        }
+        return optimalSize;
+    }
+
+    public static int getScreenWidth() {
+        return Resources.getSystem().getDisplayMetrics().widthPixels;
+    }
+
+    public static int getScreenHeight() {
+        return Resources.getSystem().getDisplayMetrics().heightPixels;
+    }
+
+    public static int getCameraDisplayOrientation(Context context, int cameraId) {
+        Camera.CameraInfo info = new Camera.CameraInfo();
+        Camera.getCameraInfo(cameraId, info);
+        WindowManager wm = (WindowManager) context.getSystemService(Context.WINDOW_SERVICE);
+        int rotation = wm.getDefaultDisplay().getRotation();
+        int degrees = 0;
+        switch (rotation) {
+            case Surface.ROTATION_0:
+                degrees = 0;
+                break;
+            case Surface.ROTATION_90:
+                degrees = 90;
+                break;
+            case Surface.ROTATION_180:
+                degrees = 180;
+                break;
+            case Surface.ROTATION_270:
+                degrees = 270;
+                break;
+        }
+        int result;
+        if (info.facing == Camera.CameraInfo.CAMERA_FACING_FRONT) {
+            result = (info.orientation + degrees) % 360;
+            result = (360 - result) % 360;   // compensate the mirror
+        } else {
+            // back-facing
+            result = (info.orientation - degrees + 360) % 360;
+        }
+        return result;
+    }
+
+    public static int createShaderProgram(String vss, String fss) {
+        int vshader = GLES20.glCreateShader(GLES20.GL_VERTEX_SHADER);
+        GLES20.glShaderSource(vshader, vss);
+        GLES20.glCompileShader(vshader);
+        int[] status = new int[1];
+        GLES20.glGetShaderiv(vshader, GLES20.GL_COMPILE_STATUS, status, 0);
+        if (status[0] == 0) {
+            Log.e(TAG, GLES20.glGetShaderInfoLog(vshader));
+            GLES20.glDeleteShader(vshader);
+            vshader = 0;
+            return 0;
+        }
+
+        int fshader = GLES20.glCreateShader(GLES20.GL_FRAGMENT_SHADER);
+        GLES20.glShaderSource(fshader, fss);
+        GLES20.glCompileShader(fshader);
+        GLES20.glGetShaderiv(fshader, GLES20.GL_COMPILE_STATUS, status, 0);
+        if (status[0] == 0) {
+            Log.e(TAG, GLES20.glGetShaderInfoLog(fshader));
+            GLES20.glDeleteShader(vshader);
+            GLES20.glDeleteShader(fshader);
+            fshader = 0;
+            return 0;
+        }
+
+        int program = GLES20.glCreateProgram();
+        GLES20.glAttachShader(program, vshader);
+        GLES20.glAttachShader(program, fshader);
+        GLES20.glLinkProgram(program);
+        GLES20.glDeleteShader(vshader);
+        GLES20.glDeleteShader(fshader);
+        GLES20.glGetProgramiv(program, GLES20.GL_LINK_STATUS, status, 0);
+        if (status[0] == 0) {
+            Log.e(TAG, GLES20.glGetProgramInfoLog(program));
+            program = 0;
+            return 0;
+        }
+        GLES20.glValidateProgram(program);
+        GLES20.glGetProgramiv(program, GLES20.GL_VALIDATE_STATUS, status, 0);
+        if (status[0] == 0) {
+            Log.e(TAG, GLES20.glGetProgramInfoLog(program));
+            GLES20.glDeleteProgram(program);
+            program = 0;
+            return 0;
+        }
+
+        return program;
+    }
+
+    public static boolean isSupportedNPU() {
+        String hardware = android.os.Build.HARDWARE;
+        return hardware.equalsIgnoreCase("kirin810") || hardware.equalsIgnoreCase("kirin990");
+    }
+
+    public static Bitmap decodeBitmap(String path, int displayWidth, int displayHeight) {
+        BitmapFactory.Options op = new BitmapFactory.Options();
+        op.inJustDecodeBounds = true;// Only the width and height information of Bitmap is read, not the pixels.
+        Bitmap bmp = BitmapFactory.decodeFile(path, op); // Get size information.
+        int wRatio = (int) Math.ceil(op.outWidth / (float) displayWidth);// Get Scale Size.
+        int hRatio = (int) Math.ceil(op.outHeight / (float) displayHeight);
+        // If the specified size is exceeded, reduce the corresponding scale.
+        if (wRatio > 1 && hRatio > 1) {
+            if (wRatio > hRatio) {
+                // If it is too wide, we will reduce the width to the required size. Note that the height will become smaller.
+                op.inSampleSize = wRatio;
+            } else {
+                op.inSampleSize = hRatio;
+            }
+        }
+        op.inJustDecodeBounds = false;
+        bmp = BitmapFactory.decodeFile(path, op);
+        // Create a Bitmap with a given width and height from the original Bitmap.
+        return Bitmap.createScaledBitmap(bmp, displayWidth, displayHeight, true);
+    }
+
+    public static String getRealPathFromURI(Context context, Uri contentURI) {
+        String result;
+        Cursor cursor = null;
+        try {
+            cursor = context.getContentResolver().query(contentURI, null, null, null, null);
+        } catch (Throwable e) {
+            e.printStackTrace();
+        }
+        if (cursor == null) {
+            result = contentURI.getPath();
+        } else {
+            cursor.moveToFirst();
+            int idx = cursor.getColumnIndex(MediaStore.Images.ImageColumns.DATA);
+            result = cursor.getString(idx);
+            cursor.close();
+        }
+        return result;
+    }
+
+    public static List<String> readTxt(String txtPath) {
+        File file = new File(txtPath);
+        if (file.isFile() && file.exists()) {
+            try {
+                FileInputStream fileInputStream = new FileInputStream(file);
+                InputStreamReader inputStreamReader = new InputStreamReader(fileInputStream);
+                BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
+                String text;
+                List<String> labels = new ArrayList<>();
+                while ((text = bufferedReader.readLine()) != null) {
+                    labels.add(text);
+                }
+                return labels;
+            } catch (Exception e) {
+                e.printStackTrace();
+            }
+        }
+        return null;
+    }
+
+    public static boolean isNetworkAvailable(final Context context) {
+        boolean hasWifoCon = false;
+        boolean hasMobileCon = false;
+
+        ConnectivityManager cm = (ConnectivityManager) context.getSystemService(context.CONNECTIVITY_SERVICE);
+        NetworkInfo[] netInfos = cm.getAllNetworkInfo();
+        for (NetworkInfo net : netInfos) {
+
+            String type = net.getTypeName();
+            if (type.equalsIgnoreCase("WIFI")) {
+                if (net.isConnected()) {
+                    hasWifoCon = true;
+                }
+            }
+
+            if (type.equalsIgnoreCase("MOBILE")) {
+                if (net.isConnected()) {
+                    hasMobileCon = true;
+                }
+            }
+        }
+        return hasWifoCon || hasMobileCon;
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/layout/ActionBarLayout.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/layout/ActionBarLayout.java
new file mode 100644
index 0000000000000000000000000000000000000000..95e112cae9bafde5fc5eff607b4d24893d8a62da
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/layout/ActionBarLayout.java
@@ -0,0 +1,33 @@
+package com.baidu.paddle.paddlenlp.ui.layout;
+
+import android.content.Context;
+import android.graphics.Color;
+import android.support.annotation.Nullable;
+import android.util.AttributeSet;
+import android.widget.RelativeLayout;
+
+
+public class ActionBarLayout extends RelativeLayout {
+    private int layoutHeight = 150;
+
+    public ActionBarLayout(Context context) {
+        super(context);
+    }
+
+    public ActionBarLayout(Context context, @Nullable AttributeSet attrs) {
+        super(context, attrs);
+    }
+
+    public ActionBarLayout(Context context, @Nullable AttributeSet attrs, int defStyleAttr) {
+        super(context, attrs, defStyleAttr);
+    }
+
+    @Override
+    protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec) {
+        super.onMeasure(widthMeasureSpec, heightMeasureSpec);
+        int width = MeasureSpec.getSize(widthMeasureSpec);
+        setMeasuredDimension(width, layoutHeight);
+        setBackgroundColor(Color.BLACK);
+        setAlpha(0.9f);
+    }
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/AppCompatPreferenceActivity.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/AppCompatPreferenceActivity.java
new file mode 100644
index 0000000000000000000000000000000000000000..718b88ecbb9fd415b2df39f08f052785e37e92ab
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/AppCompatPreferenceActivity.java
@@ -0,0 +1,111 @@
+package com.baidu.paddle.paddlenlp.ui.view;
+
+import android.content.res.Configuration;
+import android.os.Bundle;
+import android.preference.PreferenceActivity;
+import android.support.annotation.LayoutRes;
+import android.support.annotation.Nullable;
+import android.support.v7.app.ActionBar;
+import android.support.v7.app.AppCompatDelegate;
+import android.support.v7.widget.Toolbar;
+import android.view.MenuInflater;
+import android.view.View;
+import android.view.ViewGroup;
+
+/**
+ * A {@link PreferenceActivity} which implements and proxies the necessary calls
+ * to be used with AppCompat.
+ * <p>
+ * This technique can be used with an {@link android.app.Activity} class, not just
+ * {@link PreferenceActivity}.
+ */
+public abstract class AppCompatPreferenceActivity extends PreferenceActivity {
+    private AppCompatDelegate mDelegate;
+
+    @Override
+    protected void onCreate(Bundle savedInstanceState) {
+        getDelegate().installViewFactory();
+        getDelegate().onCreate(savedInstanceState);
+        super.onCreate(savedInstanceState);
+    }
+
+    @Override
+    protected void onPostCreate(Bundle savedInstanceState) {
+        super.onPostCreate(savedInstanceState);
+        getDelegate().onPostCreate(savedInstanceState);
+    }
+
+    public ActionBar getSupportActionBar() {
+        return getDelegate().getSupportActionBar();
+    }
+
+    public void setSupportActionBar(@Nullable Toolbar toolbar) {
+        getDelegate().setSupportActionBar(toolbar);
+    }
+
+    @Override
+    public MenuInflater getMenuInflater() {
+        return getDelegate().getMenuInflater();
+    }
+
+    @Override
+    public void setContentView(@LayoutRes int layoutResID) {
+        getDelegate().setContentView(layoutResID);
+    }
+
+    @Override
+    public void setContentView(View view) {
+        getDelegate().setContentView(view);
+    }
+
+    @Override
+    public void setContentView(View view, ViewGroup.LayoutParams params) {
+        getDelegate().setContentView(view, params);
+    }
+
+    @Override
+    public void addContentView(View view, ViewGroup.LayoutParams params) {
+        getDelegate().addContentView(view, params);
+    }
+
+    @Override
+    protected void onPostResume() {
+        super.onPostResume();
+        getDelegate().onPostResume();
+    }
+
+    @Override
+    protected void onTitleChanged(CharSequence title, int color) {
+        super.onTitleChanged(title, color);
+        getDelegate().setTitle(title);
+    }
+
+    @Override
+    public void onConfigurationChanged(Configuration newConfig) {
+        super.onConfigurationChanged(newConfig);
+        getDelegate().onConfigurationChanged(newConfig);
+    }
+
+    @Override
+    protected void onStop() {
+        super.onStop();
+        getDelegate().onStop();
+    }
+
+    @Override
+    protected void onDestroy() {
+        super.onDestroy();
+        getDelegate().onDestroy();
+    }
+
+    public void invalidateOptionsMenu() {
+        getDelegate().invalidateOptionsMenu();
+    }
+
+    private AppCompatDelegate getDelegate() {
+        if (mDelegate == null) {
+            mDelegate = AppCompatDelegate.create(this, null);
+        }
+        return mDelegate;
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/ResultListView.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/ResultListView.java
new file mode 100644
index 0000000000000000000000000000000000000000..cef8f3eed6f39f593a64466a690e4948e7254cfe
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/ResultListView.java
@@ -0,0 +1,43 @@
+package com.baidu.paddle.paddlenlp.ui.view;
+
+import android.content.Context;
+import android.os.Handler;
+import android.util.AttributeSet;
+import android.widget.ListView;
+
+public class ResultListView extends ListView {
+    public ResultListView(Context context) {
+        super(context);
+    }
+
+    public ResultListView(Context context, AttributeSet attrs) {
+        super(context, attrs);
+    }
+
+    public ResultListView(Context context, AttributeSet attrs, int defStyleAttr) {
+        super(context, attrs, defStyleAttr);
+    }
+
+    private Handler handler;
+
+    public void setHandler(Handler mHandler) {
+        handler = mHandler;
+    }
+
+    public void clear() {
+        handler.post(new Runnable() {
+            @Override
+            public void run() {
+                removeAllViewsInLayout();
+                invalidate();
+            }
+        });
+    }
+
+    @Override
+    protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec) {
+        int expandSpec = MeasureSpec.makeMeasureSpec(Integer.MAX_VALUE >> 2,
+                MeasureSpec.AT_MOST);
+        super.onMeasure(widthMeasureSpec, expandSpec);
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/adapter/BaseResultAdapter.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/adapter/BaseResultAdapter.java
new file mode 100644
index 0000000000000000000000000000000000000000..2d81210cf96f1f67db0eaf925058315d25497ca7
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/adapter/BaseResultAdapter.java
@@ -0,0 +1,48 @@
+package com.baidu.paddle.paddlenlp.ui.view.adapter;
+
+import android.content.Context;
+import android.support.annotation.NonNull;
+import android.support.annotation.Nullable;
+import android.view.LayoutInflater;
+import android.view.View;
+import android.view.ViewGroup;
+import android.widget.ArrayAdapter;
+import android.widget.TextView;
+
+import com.baidu.paddle.paddlenlp.ui.R;
+import com.baidu.paddle.paddlenlp.ui.view.model.BaseResultModel;
+
+import java.text.DecimalFormat;
+import java.util.List;
+
+public class BaseResultAdapter extends ArrayAdapter<BaseResultModel> {
+    private int resourceId;
+
+    public BaseResultAdapter(@NonNull Context context, int resource) {
+        super(context, resource);
+    }
+
+    public BaseResultAdapter(@NonNull Context context, int resource, @NonNull List<BaseResultModel> objects) {
+        super(context, resource, objects);
+        resourceId = resource;
+    }
+
+    @NonNull
+    @Override
+    public View getView(int position, @Nullable View convertView, @NonNull ViewGroup parent) {
+        BaseResultModel model = getItem(position);
+        View view = LayoutInflater.from(getContext()).inflate(resourceId, null);
+        TextView indexText = (TextView) view.findViewById(R.id.index);
+        TextView nameText = (TextView) view.findViewById(R.id.name);
+        TextView confidenceText = (TextView) view.findViewById(R.id.confidence);
+        indexText.setText(String.valueOf(model.getIndex()));
+        nameText.setText(String.valueOf(model.getName()));
+        confidenceText.setText(formatFloatString(model.getConfidence()));
+        return view;
+    }
+
+    public static String formatFloatString(float number) {
+        DecimalFormat df = new DecimalFormat("0.00");
+        return df.format(number);
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/model/BaseResultModel.java b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/model/BaseResultModel.java
new file mode 100644
index 0000000000000000000000000000000000000000..73d84516645031d2e9c4b9ff17741bf54c8f6f2e
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/java/com/baidu/paddle/paddlenlp/ui/view/model/BaseResultModel.java
@@ -0,0 +1,41 @@
+package com.baidu.paddle.paddlenlp.ui.view.model;
+
+public class BaseResultModel {
+    private int index;
+    private String name;
+    private float confidence;
+
+    public BaseResultModel() {
+
+    }
+
+    public BaseResultModel(int index, String name, float confidence) {
+        this.index = index;
+        this.name = name;
+        this.confidence = confidence;
+    }
+
+    public float getConfidence() {
+        return confidence;
+    }
+
+    public void setConfidence(float confidence) {
+        this.confidence = confidence;
+    }
+
+    public int getIndex() {
+        return index;
+    }
+
+    public void setIndex(int index) {
+        this.index = index;
+    }
+
+    public String getName() {
+        return name;
+    }
+
+    public void setName(String name) {
+        this.name = name;
+    }
+}
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/drawable-v24/result_page_border_section_bk.xml b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/drawable-v24/result_page_border_section_bk.xml
new file mode 100644
index 0000000000000000000000000000000000000000..bd068f169f551e5f88942ed65c5dca83fc8a6033
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/drawable-v24/result_page_border_section_bk.xml
@@ -0,0 +1,12 @@
+<?xml version="1.0" encoding="utf-8"?>
+<layer-list xmlns:android="http://schemas.android.com/apk/res/android">
+    <item>
+        <shape android:shape="rectangle">
+            <solid android:color="#FFFFFF" />
+
+            <stroke
+                android:width="1px"
+                android:color="#E5E5E5" />
+        </shape>
+    </item>
+</layer-list>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/layout/base_result_page_item.xml b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/layout/base_result_page_item.xml
new file mode 100644
index 0000000000000000000000000000000000000000..6a2b09ebff16c3398c0fe64dff2772c00ba6be53
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/layout/base_result_page_item.xml
@@ -0,0 +1,26 @@
+<?xml version="1.0" encoding="utf-8"?>
+<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
+    android:orientation="horizontal"
+    android:layout_width="match_parent"
+    android:layout_height="wrap_content"
+    android:background="@drawable/result_page_border_section_bk">
+
+    <TextView
+        android:id="@+id/index"
+        style="@style/list_result_view_item_style"
+        android:layout_width="wrap_content"
+        android:layout_weight="0.2" />
+
+    <TextView
+        android:id="@+id/name"
+        style="@style/list_result_view_item_style"
+        android:layout_width="wrap_content"
+        android:layout_weight="0.6"
+        android:maxWidth="300px" />
+
+    <TextView
+        android:id="@+id/confidence"
+        style="@style/list_result_view_item_style"
+        android:layout_weight="0.2"
+        android:layout_width="wrap_content" />
+</LinearLayout>
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/colors.xml b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/colors.xml
new file mode 100644
index 0000000000000000000000000000000000000000..f8ec1f0c3bca8b1b8cf4a82334fdd6ab18f35862
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/colors.xml
@@ -0,0 +1,22 @@
+<?xml version="1.0" encoding="utf-8"?>
+<resources>
+    <color name="colorPrimary">#008577</color>
+    <color name="colorPrimaryDark">#00574B</color>
+    <color name="colorAccent">#D81B60</color>
+    <color name="colorWindow">#FF000000</color>
+    <color name="colorTopBar">#00000000</color>
+    <color name="colorBottomBar">#00000000</color>
+    <color name="colorText">#FFFFFFFF</color>
+
+    <color name="bk_black">#000000</color>
+    <color name="bk_blue">#3B85F5</color>
+    <color name="textColorHighlight">#F5A623</color>
+    <color name="textColor">#FFFFFF</color>
+
+    <color name="bk_result_image_padding">#EEEEEE</color>
+
+    <color name="table_result_item_text_color">#3B85F5</color>
+    <color name="table_result_tableheader_text_color">#333333</color>
+    <color name="result_section_border_color">#E5E5E5</color>
+    <color name="result_popview_tablebody_bk">#3b85f5</color>
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/styles.xml b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/styles.xml
new file mode 100644
index 0000000000000000000000000000000000000000..67c147594487ee33165cb1c13d0cc8bc332671a9
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/android/ui/src/main/res/values/styles.xml
@@ -0,0 +1,70 @@
+<resources>
+
+    <!-- Base application theme. -->
+    <style name="AppTheme" parent="Theme.AppCompat.Light.DarkActionBar">
+        <!-- Customize your theme here. -->
+        <item name="colorPrimary">@color/colorPrimary</item>
+        <item name="colorPrimaryDark">@color/colorPrimaryDark</item>
+        <item name="colorAccent">@color/colorAccent</item>
+        <item name="actionOverflowMenuStyle">@style/OverflowMenuStyle</item>
+    </style>
+
+    <style name="OverflowMenuStyle" parent="Widget.AppCompat.Light.PopupMenu.Overflow">
+        <item name="overlapAnchor">false</item>
+    </style>
+
+    <style name="AppTheme.NoActionBar">
+        <item name="windowActionBar">false</item>
+        <item name="windowNoTitle">true</item>
+    </style>
+
+    <style name="AppTheme.AppBarOverlay" parent="ThemeOverlay.AppCompat.Dark.ActionBar"/>
+
+    <style name="AppTheme.PopupOverlay" parent="ThemeOverlay.AppCompat.Light"/>
+
+    <style name="list_result_view_item_style">
+        <item name="android:textColor">@color/table_result_item_text_color</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">30px</item>
+    </style>
+
+    <style name="list_result_popview_item_style">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">15px</item>
+        <item name="android:background">@color/result_popview_tablebody_bk</item>
+        <item name="android:layout_width">wrap_content</item>
+        <item name="android:alpha">0.5</item>
+    </style>
+
+    <style name="list_result_view_tablehead_style">
+        <item name="android:textColor">@color/table_result_item_text_color</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">15px</item>
+    </style>
+
+    <style name="list_result_popview_tablehead_style">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:layout_height">wrap_content</item>
+        <item name="android:layout_weight">1</item>
+        <item name="android:gravity">left</item>
+        <item name="android:padding">20px</item>
+    </style>
+
+    <style name="action_btn">
+        <item name="android:textColor">@color/textColor</item>
+        <item name="android:background">@color/bk_black</item>
+    </style>
+
+    <style name="action_btn_selected">
+        <item name="android:textColor">@color/textColorHighlight</item>
+        <item name="android:background">@color/bk_black</item>
+    </style>
+
+
+</resources>
diff --git a/model_zoo/ernie-tiny/deploy/cpp/CMakeLists.txt b/model_zoo/ernie-tiny/deploy/cpp/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f63d10670aeaa730bd24bbe58ac631accb334883
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/cpp/CMakeLists.txt
@@ -0,0 +1,31 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PROJECT(infer_demo C CXX)
+CMAKE_MINIMUM_REQUIRED (VERSION 3.10)
+
+option(FASTDEPLOY_INSTALL_DIR "Path of downloaded fastdeploy sdk.")
+include(${FASTDEPLOY_INSTALL_DIR}/FastDeploy.cmake)
+include(${FASTDEPLOY_INSTALL_DIR}/utils/gflags.cmake)
+
+include_directories(${FASTDEPLOY_INCS})
+
+add_executable(infer_demo ${PROJECT_SOURCE_DIR}/infer_demo.cc)
+add_dependencies(infer_demo gflags)
+
+if(UNIX AND (NOT APPLE) AND (NOT ANDROID))
+  target_link_libraries(infer_demo ${FASTDEPLOY_LIBS} gflags pthread)
+else()
+  target_link_libraries(infer_demo ${FASTDEPLOY_LIBS} gflags)
+endif()
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/cpp/README.md b/model_zoo/ernie-tiny/deploy/cpp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6f7338de9fbf5404c181873b1fe2f36cbb5cdcb9
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/cpp/README.md
@@ -0,0 +1,224 @@
+# FastDeploy ERNIE 3.0 Tiny 模型 C++ 部署示例
+
+在部署前，参考 [FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md) 安装 FastDeploy C++ SDK。
+
+本目录下分别提供 `infer_demo.cc` 快速完成在 CPU/GPU 的车载语音场景下的口语理解（Spoken Language Understanding，SLU）任务的 C++ 部署示例。
+
+## 依赖安装
+
+下载 FastDeploy 预编译库，用户可在上文提到的 [FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md) 中自行选择合适的版本使用（例如1.0.2）。在 Linux 64位系统下，可以执行以下命令完成安装。
+
+```bash
+
+# 安装linux x64平台GPU版本的FastDeploy SDK
+wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-gpu-x.x.x.tgz
+tar xvf fastdeploy-linux-x64-gpu-x.x.x.tgz
+
+```
+
+## 快速开始
+
+以下示例可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [ERNIE 3.0 Tiny 训练文档](../../README.md) 导出得到的部署模型，其模型目录为`model_zoo/ernie-tiny/output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1/`（用户可按实际情况设置）。
+
+```bash
+
+mkdir build
+cd build
+
+# 指定解压后fastdeploy sdk目录进行编译
+cmake .. -DFASTDEPLOY_INSTALL_DIR=/path/to/fastdeploy-linux-x64-gpu-x.x.x
+make -j
+
+# 在GPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+./infer_demo --device gpu --backend paddle --model_dir ../../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1 --slot_label_path ../../../data/slot_label.txt --intent_label_path ../../../data/intent_label.txt
+
+# 在CPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+./infer_demo --device cpu --backend paddle --model_dir ../../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1 --slot_label_path /path/to/slot_label.txt --intent_label_path ../../../data/intent_label.txt
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] /paddle/PaddleNLP/model_zoo/ernie-tiny/deploy/cpp/infer_demo.cc(108)::CreateRuntimeOption    model_path = ../../ernie-tiny/output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1/infer_model.pdmodel, param_path = ../../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1/infer_model.pdiparams
+[INFO] fastdeploy/runtime.cc(596)::Init    Runtime initialized with Backend::PDINFER in Device::GPU.
+No.0 text = 来一首周华健的花心
+intent result: label = music.play, confidence = 0.99834
+slot result:
+slot = singer, entity = '周华健', pos = [3, 5]
+slot = song, entity = '花心', pos = [7, 8]
+
+No.1 text = 播放我们都一样
+intent result: label = music.play, confidence = 0.998516
+slot result:
+slot = song, entity = '我们都一样', pos = [2, 6]
+
+No.2 text = 到信阳市汽车配件城
+intent result: label = navigation.navigation, confidence = 0.998626
+slot result:
+slot = destination, entity = '信阳市汽车配件城', pos = [1, 8]
+
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 Tiny训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在 GPU 上使用 tensorrt 后端运行量化模型，模型目录可按照实际模型路径设置
+./infer_demo --device gpu --backend tensorrt --model_prefix int8 --model_dir ../../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1_quant --slot_label_path ../../../data/slot_label.txt --intent_label_path ../../../data/intent_label.txt
+
+# 在 CPU 上使用 paddle_inference 后端，模型目录可按照实际模型路径设置
+./infer_demo --device cpu --backend paddle --model_prefix int8 --model_dir ../../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1_quant --slot_label_path /path/to/slot_label.txt --intent_label_path ../../../data/intent_label.txt
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[INFO] fastdeploy/runtime.cc(596)::Init    Runtime initialized with Backend::TRT in Device::GPU.
+No.0 text = 来一首周华健的花心
+intent result: label = music.play, confidence = 0.997125
+slot result:
+slot = singer, entity = '周华健', pos = [3, 5]
+slot = song, entity = '花心', pos = [7, 8]
+
+No.1 text = 播放我们都一样
+intent result: label = music.play, confidence = 0.997346
+slot result:
+slot = song, entity = '我们都一样', pos = [2, 6]
+
+No.2 text = 到信阳市汽车配件城
+intent result: label = navigation.navigation, confidence = 0.998141
+slot result:
+slot = destination, entity = '信阳市汽车配件城', pos = [1, 8]
+
+```
+
+## 参数说明
+
+除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--model_dir | 指定部署模型的目录。支持传入 Paddle INT8 新格式量化模型。 |
+|--vocab_path| 指定的模型词表路径，默认为`model_dir`目录下的vocab.txt文件 |
+|--added_tokens_path| 指定的模型词表路径，默认为`model_dir`目录下的added_tokens.json文件 |
+|--slot_label_path| 指定的 slot label 文件路径 |
+|--intent_label_path| 指定的 intent label 文件路径 |
+|--batch_size |最大可测的 batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--use_trt_fp16 | 是否使用 FP16 模式进行推理。使用 TensorRT 和 Paddle TensorRT 后端时可开启，默认为False |
+|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'infer_model'|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 C++ 端上，提供 `fastdeploy::RuntimeOption::UseXXX()` 以及 `fastdeploy::RuntimeOption::UseXXXBackend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 Tiny 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 Tiny 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 ERNIE 3.0 Tiny 模型 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> UseCpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> UseOrtBackend() </td>
+      <td align=center> ✅ </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> UseOpenvinoBackend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> UseGpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> UseOrtBackend() </td>
+      <td align=center> ✅ </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> UseTrtBackend() + EnablePaddleToTrt() </td>
+      <td align=center> ✅  </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> UseTrtBackend() </td>
+      <td align=center> ✅  </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> UseKunlunXin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> UsePaddleLiteBackend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> UseAscend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> UsePaddleLiteBackend() </td>
+        <td align=center> ✅ </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> UseIpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> UsePaddleInferBackend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
+
+## 相关文档
+
+[ERNIE 3.0 Tiny模型详细介绍](../../README.md)
+
+[ERNIE 3.0 Tiny模型Python部署方法](../python/README.md)
+
+[ERNIE 3.0 Tiny模型Android部署方法](../android/README.md)
+
+[FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)
diff --git a/model_zoo/ernie-tiny/deploy/cpp/infer_demo.cc b/model_zoo/ernie-tiny/deploy/cpp/infer_demo.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e5c4452c5d2b8cc4c0786c7ead71f5defc68fe7a
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/cpp/infer_demo.cc
@@ -0,0 +1,454 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <iostream>
+#include <sstream>
+#include <vector>
+
+#include "fast_tokenizer/pretokenizers/pretokenizer.h"
+#include "fast_tokenizer/tokenizers/ernie_fast_tokenizer.h"
+#include "fast_tokenizer/utils/utf8.h"
+#include "fastdeploy/function/functions.h"
+#include "fastdeploy/runtime.h"
+#include "fastdeploy/utils/path.h"
+#include "gflags/gflags.h"
+#include "nlohmann/json.hpp"
+
+using namespace paddlenlp;
+using namespace fast_tokenizer::tokenizers_impl;
+#ifdef WIN32
+const char sep = '\\';
+#else
+const char sep = '/';
+#endif
+
+DEFINE_string(model_dir, "", "Directory of the inference model.");
+DEFINE_string(model_prefix, "infer_model", "The model and params file prefix");
+DEFINE_string(vocab_path, "", "Path of the vocab file.");
+DEFINE_string(added_tokens_path, "", "Path of the added tokens json file.");
+DEFINE_string(slot_label_path, "", "Path of the slot label file.");
+DEFINE_string(intent_label_path, "", "Path of the intent label file.");
+DEFINE_string(device,
+              "cpu",
+              "Type of inference device, support 'cpu' or 'gpu'.");
+DEFINE_string(backend,
+              "paddle",
+              "The inference runtime backend, support: ['onnx_runtime', "
+              "'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']");
+DEFINE_int32(batch_size, 1, "The batch size of data.");
+DEFINE_int32(max_length, 16, "The batch size of data.");
+DEFINE_int32(cpu_num_threads,
+             1,
+             "The number of threads when inferring on cpu.");
+DEFINE_bool(use_trt_fp16, false, "Wheter to use FP16 mode.");
+
+void PrintUsage() {
+  fastdeploy::FDINFO
+      << "Usage: infer_demo --model_dir dir --device [cpu|gpu] "
+         "--backend "
+         "[onnx_runtime|paddle|openvino|tensorrt|paddle_tensorrt] "
+         "--batch_size size --max_length len --use_trt_fp16 false"
+      << std::endl;
+  fastdeploy::FDINFO << "Default value of device: cpu" << std::endl;
+  fastdeploy::FDINFO << "Default value of backend: onnx_runtime" << std::endl;
+  fastdeploy::FDINFO << "Default value of batch_size: 1" << std::endl;
+  fastdeploy::FDINFO << "Default value of max_length: 128" << std::endl;
+  fastdeploy::FDINFO << "Default value of use_trt_fp16: false" << std::endl;
+}
+
+bool CreateRuntimeOption(fastdeploy::RuntimeOption* option) {
+  if (FLAGS_device == "gpu") {
+    option->UseGpu();
+  } else if (FLAGS_device == "cpu") {
+    option->UseCpu();
+    option->SetCpuThreadNum(FLAGS_cpu_num_threads);
+  } else {
+    fastdeploy::FDERROR << "The avilable device should be one of the list "
+                           "['cpu', 'gpu']. But receive '"
+                        << FLAGS_device << "'" << std::endl;
+    return false;
+  }
+
+  if (FLAGS_backend == "onnx_runtime") {
+    option->UseOrtBackend();
+  } else if (FLAGS_backend == "paddle") {
+    option->UsePaddleInferBackend();
+  } else if (FLAGS_backend == "openvino") {
+    option->UseOpenVINOBackend();
+  } else if (FLAGS_backend == "tensorrt" ||
+             FLAGS_backend == "paddle_tensorrt") {
+    option->UseTrtBackend();
+    if (FLAGS_backend == "paddle_tensorrt") {
+      option->EnablePaddleToTrt();
+      option->EnablePaddleTrtCollectShape();
+    }
+    std::string trt_file = FLAGS_model_dir + sep + "infer.trt";
+    option->SetTrtInputShape("input_ids",
+                             {1, 1},
+                             {FLAGS_batch_size, FLAGS_max_length},
+                             {FLAGS_batch_size, FLAGS_max_length});
+    if (FLAGS_use_trt_fp16) {
+      option->EnableTrtFP16();
+      trt_file = trt_file + ".fp16";
+    }
+  } else {
+    fastdeploy::FDERROR << "The avilable backend should be one of the list "
+                           "['paddle', 'openvino', 'tensorrt', "
+                           "'paddle_tensorrt']. But receive '"
+                        << FLAGS_backend << "'" << std::endl;
+    return false;
+  }
+
+  std::string model_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdmodel";
+  std::string param_path =
+      FLAGS_model_dir + sep + FLAGS_model_prefix + ".pdiparams";
+  fastdeploy::FDINFO << "model_path = " << model_path
+                     << ", param_path = " << param_path << std::endl;
+  option->SetModelPath(model_path, param_path);
+  return true;
+}
+
+bool BatchiFyTexts(const std::vector<std::string>& texts,
+                   int batch_size,
+                   std::vector<std::vector<std::string>>* batch_texts) {
+  for (int idx = 0; idx < texts.size(); idx += batch_size) {
+    int rest = texts.size() - idx;
+    int curr_size = std::min(batch_size, rest);
+    std::vector<std::string> batch_text(curr_size);
+    std::copy_n(texts.begin() + idx, curr_size, batch_text.begin());
+    batch_texts->emplace_back(std::move(batch_text));
+  }
+  return true;
+}
+
+struct IntentDetAndSlotFillResult {
+  struct IntentDetResult {
+    std::string intent_label;
+    float intent_confidence;
+  } intent_result;
+  struct SlotFillResult {
+    std::string slot_label;
+    std::string entity;
+    std::pair<int, int> pos;
+  };
+  std::vector<SlotFillResult> slot_result;
+
+  friend std::ostream& operator<<(std::ostream& os,
+                                  const IntentDetAndSlotFillResult& result);
+};
+
+std::ostream& operator<<(std::ostream& os,
+                         const IntentDetAndSlotFillResult& result) {
+  os << "intent result: label = " << result.intent_result.intent_label
+     << ", confidence = " << result.intent_result.intent_confidence
+     << std::endl;
+  os << "slot result: " << std::endl;
+  for (auto&& slot : result.slot_result) {
+    os << "slot = " << slot.slot_label << ", entity = '" << slot.entity
+       << "', pos = [" << slot.pos.first << ", " << slot.pos.second << "]"
+       << std::endl;
+  }
+  return os;
+}
+
+bool AddTokensToTokenizer(const std::string& path,
+                          ErnieFastTokenizer* tokenizer) {
+  std::string actual_path = path;
+  if (!fastdeploy::CheckFileExists(actual_path)) {
+    actual_path = fastdeploy::PathJoin(FLAGS_model_dir, "added_tokens.json");
+    if (!fastdeploy::CheckFileExists(actual_path)) {
+      fastdeploy::FDINFO << "No added_tokens have been added to tokenizer."
+                         << std::endl;
+      return false;
+    }
+  }
+  std::ifstream fin(actual_path);
+  nlohmann::json j;
+  fin >> j;
+  using VOCAB_ITEM = std::pair<std::string, uint32_t>;
+  std::vector<VOCAB_ITEM> vocab_list;
+  for (nlohmann::json::iterator it = j.begin(); it != j.end(); ++it) {
+    vocab_list.emplace_back(it.key(), it.value());
+  }
+  std::sort(vocab_list.begin(),
+            vocab_list.end(),
+            [](const VOCAB_ITEM& a, const VOCAB_ITEM& b) {
+              return a.second < b.second;
+            });
+  std::vector<fast_tokenizer::core::AddedToken> added_tokens;
+  int increase_idx = 0;
+  for (auto&& vocab_item : vocab_list) {
+    if (vocab_item.second != tokenizer->GetVocabSize() + increase_idx) {
+      fastdeploy::FDERROR << "Non-consecutive added token '" << vocab_item.first
+                          << "' found. "
+                          << "Should have index " << tokenizer->GetVocabSize()
+                          << " but has index " << vocab_item.second
+                          << " in saved vocabulary." << std::endl;
+      return false;
+    }
+    added_tokens.emplace_back(vocab_item.first);
+    ++increase_idx;
+  }
+  std::ostringstream oss;
+  oss << "Try to add the following tokens to tokenizer vocab. AddedTokens = [";
+  for (int i = 0; i < added_tokens.size(); ++i) {
+    auto&& token = added_tokens[i];
+    oss << token.GetContent();
+    if (i < added_tokens.size() - 1) {
+      oss << ", ";
+    }
+  }
+  oss << "]";
+  fastdeploy::FDINFO << oss.str() << std::endl;
+  tokenizer->AddTokens(added_tokens);
+  return true;
+}
+
+struct Predictor {
+  fastdeploy::Runtime runtime_;
+  ErnieFastTokenizer tokenizer_;
+  std::unordered_map<int, std::string> slot_labels_;
+  std::unordered_map<int, std::string> intent_labels_;
+
+  Predictor(const fastdeploy::RuntimeOption& option,
+            const ErnieFastTokenizer& tokenizer,
+            const std::unordered_map<int, std::string>& slot_labels,
+            const std::unordered_map<int, std::string>& intent_labels)
+      : tokenizer_(tokenizer),
+        slot_labels_(slot_labels),
+        intent_labels_(intent_labels) {
+    runtime_.Init(option);
+  }
+  bool Preprocess(const std::vector<std::string>& texts,
+                  std::vector<fastdeploy::FDTensor>* inputs) {
+    std::vector<fast_tokenizer::core::Encoding> encodings;
+    // 1. Tokenize the text
+    tokenizer_.EncodeBatchStrings(texts, &encodings);
+    // 2. Construct the input vector tensor
+    // 2.1 Allocate input tensor
+    int64_t batch_size = texts.size();
+    int64_t seq_len = 0;
+    if (batch_size > 0) {
+      seq_len = encodings[0].GetLen();
+    }
+    inputs->resize(runtime_.NumInputs());
+    for (int i = 0; i < runtime_.NumInputs(); ++i) {
+      (*inputs)[i].Allocate({batch_size, seq_len},
+                            fastdeploy::FDDataType::INT32,
+                            runtime_.GetInputInfo(i).name);
+    }
+    // 2.2 Set the value of data
+    size_t start = 0;
+    int* input_ids_ptr = reinterpret_cast<int*>((*inputs)[0].MutableData());
+    for (int i = 0; i < encodings.size(); ++i) {
+      auto&& curr_input_ids = encodings[i].GetIds();
+      std::copy(
+          curr_input_ids.begin(), curr_input_ids.end(), input_ids_ptr + start);
+      start += seq_len;
+    }
+    return true;
+  }
+
+  bool IntentClsPostprocess(const fastdeploy::FDTensor& intent_logits,
+                            std::vector<IntentDetAndSlotFillResult>* results) {
+    fastdeploy::FDTensor probs;
+    fastdeploy::function::Softmax(intent_logits, &probs);
+
+    fastdeploy::FDTensor labels, confidences;
+    fastdeploy::function::Max(probs, &confidences, {-1});
+    fastdeploy::function::ArgMax(probs, &labels, -1);
+    if (labels.Numel() != confidences.Numel()) {
+      return false;
+    }
+    int64_t* label_ptr = reinterpret_cast<int64_t*>(labels.Data());
+    float* confidence_ptr = reinterpret_cast<float*>(confidences.Data());
+    for (int i = 0; i < labels.Numel(); ++i) {
+      (*results)[i].intent_result.intent_label = intent_labels_[label_ptr[i]];
+      (*results)[i].intent_result.intent_confidence = confidence_ptr[i];
+    }
+    return true;
+  }
+
+  bool SlotClsPostprocess(const fastdeploy::FDTensor& slot_logits,
+                          const std::vector<std::string>& texts,
+                          std::vector<IntentDetAndSlotFillResult>* results) {
+    fastdeploy::FDTensor batch_preds;
+    fastdeploy::function::ArgMax(slot_logits, &batch_preds, -1);
+    for (int i = 0; i < results->size(); ++i) {
+      fastdeploy::FDTensor preds;
+      fastdeploy::function::Slice(batch_preds, {0}, {i}, &preds);
+      int start = -1;
+      std::string label_name = "";
+      std::vector<IntentDetAndSlotFillResult::SlotFillResult> items;
+      fast_tokenizer::pretokenizers::CharToBytesOffsetConverter convertor(
+          texts[i]);
+      fast_tokenizer::core::Offset curr_offset;
+      int unicode_len = fast_tokenizer::utils::GetUnicodeLenFromUTF8(
+          texts[i].data(), texts[i].length());
+      int seq_len = preds.Shape()[0];
+      int64_t* preds_ptr = reinterpret_cast<int64_t*>(preds.Data());
+      for (int j = 0; j < seq_len; ++j) {
+        fastdeploy::FDTensor pred;
+        fastdeploy::function::Slice(preds, {0}, {j}, &pred);
+        int64_t slot_label_id = preds_ptr[j];
+        const std::string& curr_label = slot_labels_[slot_label_id];
+        if ((curr_label == "O" || curr_label.find("B-") != std::string::npos ||
+             (j - 1 >= unicode_len)) &&
+            start >= 0) {
+          // Convert the unicode character offset to byte offset.
+          convertor.convert({start, j - 1}, &curr_offset);
+          items.emplace_back(IntentDetAndSlotFillResult::SlotFillResult{
+              label_name,
+              texts[i].substr(curr_offset.first,
+                              curr_offset.second - curr_offset.first),
+              {start, j - 2}});
+          start = -1;
+          if (j - 1 >= unicode_len) {
+            break;
+          }
+        }
+        if (curr_label.find("B-") != std::string::npos) {
+          start = j - 1;
+          label_name = curr_label.substr(2);
+        }
+      }
+      (*results)[i].slot_result = std::move(items);
+    }
+    return true;
+  }
+
+  bool Postprocess(const std::vector<fastdeploy::FDTensor>& outputs,
+                   const std::vector<std::string>& texts,
+                   std::vector<IntentDetAndSlotFillResult>* results) {
+    const auto& intent_logits = outputs[0];
+    const auto& slot_logits = outputs[1];
+    return IntentClsPostprocess(intent_logits, results) &&
+           SlotClsPostprocess(slot_logits, texts, results);
+  }
+
+  bool Predict(const std::vector<std::string>& texts,
+               std::vector<IntentDetAndSlotFillResult>* results) {
+    std::vector<fastdeploy::FDTensor> inputs;
+    if (!Preprocess(texts, &inputs)) {
+      return false;
+    }
+
+    std::vector<fastdeploy::FDTensor> outputs(runtime_.NumOutputs());
+    runtime_.Infer(inputs, &outputs);
+    results->resize(texts.size());
+    if (!Postprocess(outputs, texts, results)) {
+      return false;
+    }
+    return true;
+  }
+};
+
+void ReadLabelMapFromTxt(const std::string path,
+                         std::unordered_map<int, std::string>* label_map) {
+  std::fstream fin(path);
+  int id = 0;
+  std::string label;
+  while (fin) {
+    fin >> label;
+    if (label.size() > 0) {
+      label_map->insert({id++, label});
+    } else {
+      break;
+    }
+  }
+}
+
+bool GetFilePath(const std::string& path,
+                 const std::string& default_path,
+                 std::string* actual_path) {
+  *actual_path = path;
+  if (!fastdeploy::CheckFileExists(path)) {
+    *actual_path = fastdeploy::PathJoin(FLAGS_model_dir, default_path);
+    if (!fastdeploy::CheckFileExists(*actual_path)) {
+      fastdeploy::FDERROR << "The path `" << *actual_path << "` doesn't exist"
+                          << std::endl;
+      PrintUsage();
+      return false;
+    }
+  }
+  return true;
+}
+
+
+int main(int argc, char* argv[]) {
+  google::ParseCommandLineFlags(&argc, &argv, true);
+  auto option = fastdeploy::RuntimeOption();
+  if (!CreateRuntimeOption(&option)) {
+    PrintUsage();
+    return -1;
+  }
+
+  std::string vocab_path;
+  if (!GetFilePath(FLAGS_vocab_path, "vocab.txt", &vocab_path)) {
+    return -1;
+  }
+  ErnieFastTokenizer tokenizer(vocab_path);
+  AddTokensToTokenizer(FLAGS_added_tokens_path, &tokenizer);
+
+  uint32_t max_length = FLAGS_max_length;
+  tokenizer.EnableTruncMethod(
+      max_length,
+      0,
+      fast_tokenizer::core::Direction::RIGHT,
+      fast_tokenizer::core::TruncStrategy::LONGEST_FIRST);
+  tokenizer.EnablePadMethod(fast_tokenizer::core::Direction::RIGHT,
+                            0,
+                            0,
+                            "[PAD]",
+                            &max_length,
+                            nullptr);
+  std::unordered_map<int, std::string> slot_label_map;
+  std::unordered_map<int, std::string> intent_label_map;
+
+  std::string slot_label_path;
+  if (!GetFilePath(
+          FLAGS_slot_label_path, "slots_label.txt", &slot_label_path)) {
+    return -1;
+  }
+  ReadLabelMapFromTxt(slot_label_path, &slot_label_map);
+
+  std::string intent_label_path;
+  if (!GetFilePath(
+          FLAGS_intent_label_path, "intent_label.txt", &intent_label_path)) {
+    return -1;
+  }
+  ReadLabelMapFromTxt(intent_label_path, &intent_label_map);
+
+  Predictor predictor(option, tokenizer, slot_label_map, intent_label_map);
+
+  std::vector<IntentDetAndSlotFillResult> results;
+  std::vector<std::vector<std::string>> batch_texts;
+  std::vector<std::string> texts = {
+      "来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"};
+  BatchiFyTexts(texts, FLAGS_batch_size, &batch_texts);
+  for (int i = 0; i < batch_texts.size(); ++i) {
+    auto& curr_texts = batch_texts[i];
+    predictor.Predict(curr_texts, &results);
+    for (int k = 0; k < curr_texts.size(); ++k) {
+      std::cout << "No." << i * FLAGS_batch_size + k
+                << " text = " << curr_texts[k] << std::endl;
+      std::cout << results[k] << std::endl;
+    }
+    results.clear();
+  }
+
+
+  return 0;
+}
\ No newline at end of file
diff --git a/model_zoo/ernie-tiny/deploy/python/README.md b/model_zoo/ernie-tiny/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..8a8770b211cbb92a941d073dc869875430ec911c
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/python/README.md
@@ -0,0 +1,289 @@
+# FastDeploy ERNIE 3.0 Tiny 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md) 安装 FastDeploy Python SDK。
+
+本目录下分别提供 `infer_demo.py` 快速完成在 CPU/GPU 的车载语音场景下的口语理解（Spoken Language Understanding，SLU）任务的 Python 部署示例，并展示使用 FastTokenizer 后，端到端预测性能的 Benchmark。
+
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+
+# 安装fast_tokenizer以及GPU版本fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+
+```
+
+## 快速开始
+
+以下示例可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照[ERNIE 3.0 Tiny 训练文档](../../README.md)导出得到的部署模型，其模型目录为`model_zoo/ernie-tiny/output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1/`（用户可按实际情况设置）。
+
+```bash
+
+# 在GPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+python infer_demo.py --device gpu --backend paddle --model_dir ../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1 --slot_label_path ../../data/slot_label.txt --intent_label_path ../../data/intent_label.txt
+
+# 在CPU上使用paddle_inference后端，模型目录可按照实际模型路径设置
+python infer_demo.py --device cpu --backend paddle --model_dir ../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1 --slot_label_path ../../data/slot_label.txt --intent_label_path ../../data/intent_label.txt
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] fastdeploy/runtime.cc(596)::Init    Runtime initialized with Backend::PDINFER in Device::GPU.
+No. 0 text = 来一首周华健的花心
+{'intent': 'music.play', 'confidence': 0.99833965, 'slot': [{'slot': 'singer', 'entity': '周华健', 'pos': [3, 5]}, {'slot': 'song', 'entity': '花心', 'pos': [7, 8]}]}
+No. 1 text = 播放我们都一样
+{'intent': 'music.play', 'confidence': 0.9985164, 'slot': [{'slot': 'song', 'entity': '我们都一样', 'pos': [2, 6]}]}
+No. 2 text = 到信阳市汽车配件城
+{'intent': 'navigation.navigation', 'confidence': 0.998626, 'slot': [{'slot': 'destination', 'entity': '信阳市汽车配件城', 'pos': [1, 8]}]}
+
+```
+
+### 量化模型部署
+
+该示例支持部署 Paddle INT8 新格式量化模型，仅需在`--model_dir`参数传入量化模型路径，并且在对应硬件上选择可用的推理引擎后端，即可完成量化模型部署。在 GPU 上部署量化模型时，可选后端为`paddle_tensorrt`、`tensorrt`；在CPU上部署量化模型时，可选后端为`paddle`、`onnx_runtime`。下面将展示如何使用该示例完成量化模型部署，示例中的模型是按照 [ERNIE 3.0 Tiny 训练文档](../../README.md) 压缩量化后导出得到的量化模型。
+
+```bash
+
+# 在 GPU 上使用 tensorrt 后端，模型目录可按照实际模型路径设置
+python infer_demo.py --device gpu --backend tensorrt --model_prefix int8 --model_dir ../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1 --slot_label_path ../../data/slot_label.txt --intent_label_path ../../data/intent_label.txt
+
+# 在 CPU 上使用 paddle_inference 后端，模型目录可按照实际模型路径设置
+python infer_demo.py --device cpu --backend paddle --model_prefix int8 --model_dir ../../output/BS64_LR5e-5_20EPOCHS_WD0.01_WR0.1_quant --slot_label_path ../../data/slot_label.txt --intent_label_path ../../data/intent_label.txt
+
+```
+
+运行完成后返回的结果如下：
+
+```bash
+
+[INFO] fastdeploy/runtime.cc(517)::Init    Runtime initialized with Backend::TRT in Device::GPU.
+No. 0 text = 来一首周华健的花心
+{'intent': 'music.play', 'confidence': 0.99706995, 'slot': [{'slot': 'singer', 'entity': '周华健', 'pos': [3, 5]}, {'slot': 'song', 'entity': '花心', 'pos': [7, 8]}]}
+No. 1 text = 播放我们都一样
+{'intent': 'music.play', 'confidence': 0.9973666, 'slot': [{'slot': 'song', 'entity': '我们都一样', 'pos': [2, 6]}]}
+No. 2 text = 到信阳市汽车配件城
+{'intent': 'navigation.navigation', 'confidence': 0.99799216, 'slot': [{'slot': 'destination', 'entity': '信阳市汽车配件城', 'pos': [1, 8]}]}
+
+```
+
+## 参数说明
+
+除了以上示例的命令行参数，还支持更多命令行参数的设置。以下为各命令行参数的说明。
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--model_dir | 指定部署模型的目录。支持传入 Paddle INT8 新格式量化模型。 |
+|--slot_label_path| 指定的 slot label 文件路径 |
+|--intent_label_path| 指定的 intent label 文件路径 |
+|--batch_size |最大可测的 batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--use_trt_fp16 | 是否使用 FP16 模式进行推理。使用 TensorRT 和 Paddle TensorRT 后端时可开启，默认为 False |
+|--use_fast| 是否使用 FastTokenizer 加速分词阶段。默认为 True|
+|--model_prefix| 模型文件前缀。前缀会分别与'.pdmodel'和'.pdiparams'拼接得到模型文件名和参数文件名。默认为 'infer_model'|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 ERNIE 3.0 Tiny 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 ERNIE 3.0 Tiny模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 ERNIE 3.0 Tiny 模型 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持FP16模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_trt_backend() + enable_paddle_to_trt() </td>
+      <td align=center> ✅  </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅  </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ✅ </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
+
+## 性能 Benchmark
+
+在 Python 端上，以往会使用纯 Python 实现的 Tokenizer 进行分词，在处理大规模文本下往往会显得十分低效。为了解决这个问题，我们使用 PaddleNLP 的 FastTokenizer 工具，该工具使用 C++ 实现，并集成了 Google 提出的 [Fast WordPiece Tokenization](https://arxiv.org/pdf/2012.15524.pdf) 快速分词算法，可以大大提升分词阶段性能。开发者安装 FastTokenizer 后，可以使用 PaddleNLP 提供的 `AutoTokenizer.from_pretrained` 加载 Tokenizer，并通过传入 `use_fast=True` 的参数使用 FastTokenizer。下面对比使用 FastTokenizer 前后，FP32 模型与量化 INT8 模型在 GPU 上使用 Paddle Inference 以及 Paddle TensorRT 后端预测的预测性能。
+
+### 实验环境
+
+<table>
+    <tr>
+        <td align=center> GPU型号 </td>
+        <td align=center> A10 </td>
+    </tr>
+    <tr>
+        <td align=center> CUDA版本 </td>
+        <td align=center> 11.6 </td>
+    </tr>
+    <tr>
+        <td align=center> cuDNN版本 </td>
+        <td align=center> 8.4.0 </td>
+    </tr>
+    <tr>
+        <td align=center> CPU型号 </td>
+        <td align=center> Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz </td>
+    </tr>
+</table>
+
+### 参数设置
+
+batch size = 32，max length = 16。
+
+测试文本长度15。
+
+### 性能对比
+
+#### FP32 模型
+
+**使用 Paddle Inference 后端预测**。
+
+<table>
+  <tr>
+    <td align=center> 切词方式 </td>
+    <td align=center> 端到端延时（ms） </td>
+    <td align=center> Runtime延时（ms） </td>
+    <td align=center> Tokenizer延时（ms） </td>
+    <td align=center> PostProcess延时（ms） </td>
+  </tr>
+  <tr>
+    <td align=center> FastTokenizer </td>
+    <td align=center> 2.5047 </td>
+    <td align=center> 0.9702 </td>
+    <td align=center> 1.1807 </td>
+    <td align=center> 0.3538 </td>
+  </tr>
+  <tr>
+    <td align=center> Python Tokenizer </td>
+    <td align=center> 8.9028 </td>
+    <td align=center> 0.9987 </td>
+    <td align=center> 7.5499 </td>
+    <td align=center> 0.3541 </td>
+  </tr>
+</table>
+
+#### INT8 模型
+
+**使用 Paddle TensorRT 后端预测**。
+
+<table>
+  <tr>
+    <td align=center> 切词方式 </td>
+    <td align=center> 端到端延时（ms） </td>
+    <td align=center> Runtime延时（ms） </td>
+    <td align=center> Tokenizer延时（ms） </td>
+    <td align=center> PostProcess延时（ms） </td>
+  </tr>
+  <tr>
+    <td align=center> FastTokenizer </td>
+    <td align=center> 2.5707 </td>
+    <td align=center> 1.0858 </td>
+    <td align=center> 1.1233 </td>
+    <td align=center> 0.3616 </td>
+  </tr>
+  <tr>
+    <td align=center> Python Tokenizer </td>
+    <td align=center> 9.2509 </td>
+    <td align=center> 1.0543 </td>
+    <td align=center> 7.8407 </td>
+    <td align=center> 0.3559 </td>
+  </tr>
+</table>
+
+**结论**：在此 ERNIE 3.0 Tiny 模型部署场景下，使用 FastTokenizer 可以大大加速分词阶段，分词阶段性能加速比为 `6.39x~6.98x`，端到端性能加速比为 `3.56x~3.59x` 。
+
+## 相关文档
+
+[ERNIE 3.0 Tiny模型详细介绍](../../README.md)
+
+[ERNIE 3.0 Tiny模型C++部署方法](../cpp/README.md)
+
+[ERNIE 3.0 Tiny模型Android部署方法](../android/README.md)
+
+[FastDeploy SDK安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)
diff --git a/model_zoo/ernie-tiny/deploy/python/infer_demo.py b/model_zoo/ernie-tiny/deploy/python/infer_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..56abb15cc2fabccf3c826a3eca1c97239d4cf4b4
--- /dev/null
+++ b/model_zoo/ernie-tiny/deploy/python/infer_demo.py
@@ -0,0 +1,215 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import fastdeploy as fd
+import numpy as np
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import AutoTokenizer
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument("--slot_label_path", type=str, default="", help="Path of the slot label file.")
+    parser.add_argument("--intent_label_path", type=str, default="", help="Path of the intent label file.")
+    parser.add_argument("--model_prefix", type=str, default="infer_model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["gpu", "cpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--max_length", type=int, default=16, help="The max length of sequence.")
+    parser.add_argument("--cpu_num_threads", type=int, default=1, help="The number of threads when inferring on cpu.")
+    parser.add_argument("--use_trt_fp16", type=strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+def batchify_text(texts, batch_size):
+    batch_texts = []
+    batch_start = 0
+    while batch_start < len(texts):
+        batch_texts += [texts[batch_start : min(batch_start + batch_size, len(texts))]]
+        batch_start += batch_size
+    return batch_texts
+
+
+class Predictor(object):
+    def __init__(self, args):
+        self.tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+        self.slot_label_map = {}
+        self.intent_label_map = {}
+
+        slot_label_path = self.get_actual_path(args.slot_label_path, "slots_label.txt", args)
+        if not os.path.exists(slot_label_path):
+            raise ValueError("Slot label path doesn't exist")
+        with open(slot_label_path, "r") as f:
+            for i, label in enumerate(f):
+                self.slot_label_map[i] = label.rstrip("\n")
+
+        intent_label_path = self.get_actual_path(args.intent_label_path, "intent_label.txt", args)
+        if not os.path.exists(intent_label_path):
+            raise ValueError("Intent label path doesn't exist")
+        with open(intent_label_path, "r") as f:
+            for i, label in enumerate(f):
+                self.intent_label_map[i] = label.rstrip("\n")
+
+    def get_actual_path(self, path, default_path, args):
+        if os.path.exists(path):
+            return path
+        return os.path.join(args.model_dir, default_path)
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_num_threads)
+        else:
+            option.use_gpu()
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.enable_paddle_to_trt()
+                option.enable_paddle_trt_collect_shape()
+            trt_file = os.path.join(args.model_dir, "infer.trt")
+            option.set_trt_input_shape(
+                "input_ids",
+                min_shape=[1, 1],
+                opt_shape=[args.batch_size, args.max_length],
+                max_shape=[args.batch_size, args.max_length],
+            )
+            if args.use_trt_fp16:
+                option.enable_trt_fp16()
+                trt_file = trt_file + ".fp16"
+            option.set_trt_cache_file(trt_file)
+        return fd.Runtime(option)
+
+    def preprocess(self, data):
+        data = self.tokenizer(data, max_length=self.max_length, padding=True, truncation=True)
+        input_ids_name = self.runtime.get_input_info(0).name
+        input_map = {
+            input_ids_name: np.array(data["input_ids"], dtype="int32"),
+        }
+        return input_map
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def intent_cls_postprocess(self, intent_logits):
+        max_value = np.max(intent_logits, axis=1, keepdims=True)
+        exp_data = np.exp(intent_logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"intent": probs.argmax(axis=-1), "confidence": probs.max(axis=-1)}
+        return out_dict
+
+    def slot_cls_postprocess(self, slot_logits, input_data):
+        batch_preds = slot_logits.argmax(axis=-1).tolist()
+        value = []
+        for batch, preds in enumerate(batch_preds):
+            start = -1
+            label_name = ""
+            items = []
+            text_length = len(input_data[batch])
+            for i, pred in enumerate(preds):
+                if (
+                    self.slot_label_map[pred] == "O" or "B-" in self.slot_label_map[pred] or i - 1 >= text_length
+                ) and start >= 0:
+                    entity = input_data[batch][start : i - 1]
+
+                    if isinstance(entity, list):
+                        entity = "".join(entity)
+                    items.append(
+                        {
+                            "slot": label_name,
+                            "entity": entity,
+                            "pos": [start, i - 2],
+                        }
+                    )
+                    start = -1
+                    if i - 1 >= text_length:
+                        break
+                if "B-" in self.slot_label_map[pred]:
+                    start = i - 1
+                    label_name = self.slot_label_map[pred][2:]
+            value.append(items)
+        out_dict = {"value": value}
+        return out_dict
+
+    def postprocess(self, infer_data, data):
+        intent_logits = np.array(infer_data[0])
+        intent_out = self.intent_cls_postprocess(intent_logits)
+        slot_logits = np.array(infer_data[1])
+        slot_out = self.slot_cls_postprocess(slot_logits, data)
+        out_list = [
+            {
+                "intent": self.intent_label_map[intent_out["intent"][i]],
+                "confidence": intent_out["confidence"][i],
+                "slot": slot_out["value"][i],
+            }
+            for i in range(len(data))
+        ]
+        return out_list
+
+    def predict(self, data):
+        input_map = self.preprocess(data)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result, data)
+        return output
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    predictor = Predictor(args)
+    data = ["来一首周华健的花心", "播放我们都一样", "到信阳市汽车配件城"]
+    batch_data = batchify_text(data, args.batch_size)
+    j = 0
+    for batch in batch_data:
+        output = predictor.predict(batch)
+        for out in output:
+            print(f"No. {j} text = {data[j]}")
+            print(out)
+            j += 1
diff --git a/model_zoo/ernie-tiny/model.py b/model_zoo/ernie-tiny/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..83d2c2d49b589ad73ee1a89510fd0210c7666661
--- /dev/null
+++ b/model_zoo/ernie-tiny/model.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import nn
+from paddle.nn import Layer
+
+from paddlenlp.transformers import ErnieModel, ErniePretrainedModel
+
+
+class JointErnie(ErniePretrainedModel):
+    def __init__(self, config, intent_dim, slot_dim, dropout=None):
+        super(JointErnie, self).__init__(config)
+        self.intent_num_labels = intent_dim
+        self.slot_num_labels = slot_dim
+
+        self.ernie = ErnieModel(config)
+        self.dropout = nn.Dropout(dropout if dropout is not None else config["hidden_dropout_prob"])
+
+        self.intent_classifier = nn.Linear(
+            config["hidden_size"],
+            self.intent_num_labels,
+            weight_attr=nn.initializer.KaimingNormal(),
+            bias_attr=nn.initializer.KaimingNormal(),
+        )
+        self.slot_classifier = nn.Linear(
+            config["hidden_size"],
+            self.slot_num_labels,
+            weight_attr=nn.initializer.KaimingNormal(),
+            bias_attr=nn.initializer.KaimingNormal(),
+        )
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        sequence_output, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        sequence_output = self.dropout(sequence_output)
+        slot_logits = self.slot_classifier(sequence_output)
+
+        pooled_output = self.dropout(pooled_output)
+        intent_logits = self.intent_classifier(pooled_output)
+
+        if paddle.in_dynamic_mode():
+            padding_mask = input_ids == 0
+            padding_mask |= (input_ids == 2) | (input_ids == 1)
+            return intent_logits, slot_logits, padding_mask
+
+        return intent_logits * 1.0, slot_logits * 1.0
+
+
+class NLULoss(Layer):
+    def __init__(self, ignore_index=0):
+        super(NLULoss, self).__init__()
+        self.intent_loss_fct = paddle.nn.CrossEntropyLoss()
+        self.slot_loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignore_index)
+
+    def forward(self, logits, labels):
+        intent_label, slot_label = labels
+        (
+            intent_logits,
+            slot_logits,
+            _,
+        ) = logits
+        intent_loss = self.intent_loss_fct(intent_logits, intent_label)
+        slot_loss = self.slot_loss_fct(slot_logits, slot_label)
+        return slot_loss + intent_loss
diff --git a/model_zoo/ernie-tiny/run_eval.py b/model_zoo/ernie-tiny/run_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..1768232c2a59b7952e71d2a3ae69bb5d5e39a419
--- /dev/null
+++ b/model_zoo/ernie-tiny/run_eval.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+import paddle
+from utils import (
+    get_label_name,
+    input_preprocess,
+    intent_cls_postprocess,
+    read_example,
+    read_test_file,
+    slot_cls_postprocess,
+)
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import CompressionArguments, PdArgumentParser
+from paddlenlp.transformers import AutoTokenizer
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    test_path: str = field(default=None, metadata={"help": "Test data path. Defaults to None."})
+    intent_label_path: str = field(default=None, metadata={"help": "Intent label dict path. Defaults to None."})
+    slot_label_path: str = field(default=None, metadata={"help": "Slot label dict path. Defaults to None."})
+    max_seq_length: Optional[int] = field(
+        default=16,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    ignore_index: Optional[int] = field(default=0, metadata={"help": ""})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: Optional[str] = field(
+        default="ernie-3.0-tiny-nano-v2-zh",
+        metadata={"help": "Path to pretrained model. Defaults to 'ernie-3.0-tiny-nano-v2-zh'"},
+    )
+    infer_prefix: Optional[str] = field(
+        default=None,
+        metadata={"help": ""},
+    )
+
+    dropout: float = field(default=0.1, metadata={"help": "Dropout rate for JointErnie. Defaults to 0.1."})
+    dynamic: bool = field(default=False)
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(compression_args.device)
+
+    intent_label_names, slot_label_names, intent2id, slot2id = get_label_name(
+        data_args.intent_label_path, data_args.slot_label_path
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    paddle.enable_static()
+    place = paddle.set_device(compression_args.device)
+    exe = paddle.static.Executor(place)
+
+    program, feed_target_names, fetch_targets = paddle.static.load_inference_model(model_args.infer_prefix, exe)
+
+    if compression_args.do_eval:
+        test_dataset = load_dataset(
+            read_example,
+            filename=data_args.test_path,
+            intent2id=intent2id,
+            slot2id=slot2id,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            no_entity_id=data_args.ignore_index,
+            lazy=False,
+        )
+        intent_right, slot_right = 0, 0
+        for data in test_dataset:
+            input_ids = np.array(data["input_ids"])
+            intent_logits, slot_logits = exe.run(
+                program, feed={"input_ids": input_ids.reshape(1, -1).astype("int32")}, fetch_list=fetch_targets
+            )
+            slot_pred = slot_logits.argmax(axis=-1)
+            intent_pred = intent_logits.argmax(axis=-1)
+
+            intent_label = np.array(data["intent_label"])
+            slot_label = np.array(data["slot_label"])
+
+            padding_mask = input_ids == 0
+            padding_mask |= (input_ids == 2) | (input_ids == 1)
+
+            if intent_label == intent_pred:
+                intent_right += 1
+                if intent_label in (0, 2, 3, 4, 6, 7, 8, 10):
+                    slot_right += 1
+                elif ((slot_pred == slot_label) | padding_mask).all():
+                    slot_right += 1
+        accuracy = slot_right / len(test_dataset) * 100
+        intent_accuracy = intent_right / len(test_dataset) * 100
+
+        print("accuray: %.2f, intent_accuracy: %.2f" % (accuracy, intent_accuracy))
+    else:
+        test_dataset = load_dataset(
+            read_test_file,
+            filename=data_args.test_path,
+            lazy=False,
+        )
+        for data in test_dataset:
+            query_list = [data["query"]]
+            query_input_dict = input_preprocess(query_list, tokenizer, max_seq_length=16)
+            input_ids = query_input_dict["input_ids"]
+            intent_logits, slot_logits = exe.run(program, feed={"input_ids": input_ids}, fetch_list=fetch_targets)
+
+            # Shows result
+            intent_out = intent_cls_postprocess(intent_logits, intent_label_names)
+            slots_out = slot_cls_postprocess(slot_logits, query_list, slot_label_names)
+            print(query_list, "\n", intent_out, "\n", slots_out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-tiny/run_train.py b/model_zoo/ernie-tiny/run_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eb541acd522e3e8b56394ae71ebd4787a885aea
--- /dev/null
+++ b/model_zoo/ernie-tiny/run_train.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from typing import Optional
+
+import paddle
+from model import JointErnie, NLULoss
+from utils import compute_metrics, get_label_name, read_example
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.trainer import (
+    CompressionArguments,
+    PdArgumentParser,
+    Trainer,
+    cut_embeddings,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import AutoTokenizer, ErnieConfig
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    train_path: str = field(default=None, metadata={"help": "The screen data path for train dataset."})
+    dev_path: str = field(
+        default=None,
+        metadata={"help": "The screen data path for dev dataset. Defaults to None."},
+    )
+    test_path: str = field(default=None, metadata={"help": "Test data path. Defaults to None."})
+    intent_label_path: str = field(default=None, metadata={"help": "Intent label dict path. Defaults to None."})
+    slot_label_path: str = field(default=None, metadata={"help": "Slot label dict path. Defaults to None."})
+    max_seq_length: Optional[int] = field(
+        default=16,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded. Defaults to 16."
+        },
+    )
+    max_vocab_size: Optional[int] = field(
+        default=8000,
+        metadata={"help": "The Maximum vocab size after pruning word embeddings. Defaults to 8000."},
+    )
+    ignore_index: Optional[int] = field(
+        default=0,
+        metadata={
+            "help": "Padding index, and it's used to pad noscreen label in screen data, "
+            "and pad screen label in noscreen data. Defaults to 9999."
+        },
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: Optional[str] = field(
+        default="ernie-3.0-tiny-nano-v2-zh",
+        metadata={"help": "Path to pretrained model. Defaults to 'ernie-3.0-tiny-nano-v2-zh'"},
+    )
+    dropout: float = field(default=0.1, metadata={"help": "Dropout rate for JointErnie. Defaults to 0.1."})
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, compression_args = parser.parse_args_into_dataclasses()
+
+    paddle.set_device(compression_args.device)
+
+    _, _, intent2id, slot2id = get_label_name(data_args.intent_label_path, data_args.slot_label_path)
+
+    model = JointErnie.from_pretrained(
+        pretrained_model_name_or_path=model_args.model_name_or_path,
+        intent_dim=len(intent2id),
+        slot_dim=len(slot2id),
+        dropout=model_args.dropout,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    slot_list = [slot.replace("B-", "") for slot in slot2id]
+    slot_list = [slot.replace("I-", "") for slot in slot_list]
+    slot_list += ["<", ">", "/"]
+
+    if compression_args.prune_embeddings and compression_args.do_train:
+        filelist = [data_args.train_path, data_args.dev_path]
+        vocab_dict = {}
+        for i in range(tokenizer.vocab_size):
+            vocab_dict[i] = 0
+        max_freq = 0
+
+        for filename in filelist:
+            f = open(filename)
+            for line in f:
+                if len(line.strip().split("\t")) < 2:
+                    continue
+                idx_list = tokenizer(line.strip().split("\t")[1])["input_ids"]
+                for idx in idx_list:
+                    if idx in vocab_dict:
+                        vocab_dict[idx] += 1
+                    else:
+                        vocab_dict[idx] = 0
+                    max_freq = max(max_freq, vocab_dict[idx])
+            f.close()
+        for special_token in tokenizer.all_special_tokens:
+            if special_token == "[PAD]":
+                vocab_dict[tokenizer.convert_tokens_to_ids([special_token])[0]] = max_freq + 2
+            else:
+                vocab_dict[tokenizer.convert_tokens_to_ids([special_token])[0]] = max_freq + 1
+
+        vocab_dict = sorted(vocab_dict.items(), key=lambda item: item[1], reverse=True)
+
+        vocab_dict = vocab_dict[: data_args.max_vocab_size - len(slot_list)]
+        word_emb_index = [vocab[0] for vocab in vocab_dict]
+
+        config = ErnieConfig.from_pretrained(model_args.model_name_or_path)
+        pretrained_model_dir = os.path.join(compression_args.output_dir, "pretrained_model")
+        # Rewrites model, tokenizer and pretrained_model directory.
+        cut_embeddings(
+            model,
+            tokenizer,
+            config,
+            word_emb_index,
+            data_args.max_seq_length,
+            data_args.max_vocab_size,
+            pretrained_model_dir,
+        )
+
+        # Reloads model and tokenizer
+        model = JointErnie.from_pretrained(
+            pretrained_model_name_or_path=pretrained_model_dir,
+            intent_dim=len(intent2id),
+            slot_dim=len(slot2id),
+            dropout=model_args.dropout,
+        )
+        tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir)
+
+    tokenizer.add_tokens(slot_list)
+
+    train_dataset = load_dataset(
+        read_example,
+        filename=data_args.train_path,
+        intent2id=intent2id,
+        slot2id=slot2id,
+        tokenizer=tokenizer,
+        max_seq_length=data_args.max_seq_length,
+        no_entity_id=data_args.ignore_index,
+        lazy=False,
+    )
+
+    eval_dataset = load_dataset(
+        read_example,
+        filename=data_args.dev_path,
+        intent2id=intent2id,
+        slot2id=slot2id,
+        tokenizer=tokenizer,
+        max_seq_length=data_args.max_seq_length,
+        no_entity_id=data_args.ignore_index,
+        lazy=False,
+    )
+
+    data_collator = DataCollatorForTokenClassification(
+        tokenizer, label_pad_token_id=0, padding="max_length", max_length=data_args.max_seq_length
+    )
+
+    criterion = NLULoss()
+
+    trainer = Trainer(
+        model=model,
+        args=compression_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if compression_args.do_train or compression_args.do_compress else None,
+        eval_dataset=eval_dataset if compression_args.do_eval or compression_args.do_compress else None,
+        criterion=criterion,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    compression_args.print_config()
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if (
+        os.path.isdir(compression_args.output_dir)
+        and compression_args.do_train
+        and not compression_args.overwrite_output_dir
+    ):
+        last_checkpoint = get_last_checkpoint(compression_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(compression_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({compression_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and compression_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    checkpoint = None
+    if compression_args.resume_from_checkpoint is not None:
+        checkpoint = compression_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    if compression_args.do_train:
+        trainer.train(resume_from_checkpoint=checkpoint)
+
+    if compression_args.do_eval:
+        trainer.evaluate()
+    if compression_args.do_compress:
+
+        @paddle.no_grad()
+        def custom_evaluate(self, model, data_loader):
+            model.eval()
+            intent_right, slot_right, sample_num = 0, 0, 0
+            for batch in data_loader:
+                logits = model(input_ids=batch["input_ids"])
+
+                if len(logits) == 2:
+                    intent_logits, slot_logits, padding_mask = logits[0]
+                elif len(logits) == 3:
+                    intent_logits, slot_logits, padding_mask = logits
+
+                slot_pred = slot_logits.argmax(axis=-1)
+                intent_pred = intent_logits.argmax(axis=-1)
+
+                intent_label = batch["intent_label"]
+                slot_label = batch["slot_label"]
+
+                batch_num = intent_label.shape[0]
+                for i in range(batch_num):
+                    if intent_label[i] == intent_pred[i]:
+                        intent_right += 1
+                        if intent_label[i] in (0, 2, 3, 4, 6, 7, 8, 10):
+                            slot_right += 1
+                        elif paddle.all((slot_pred[i] == slot_label[i]) | padding_mask[i]):
+                            slot_right += 1
+                sample_num += batch_num
+
+            intent_accuracy = intent_right / sample_num * 100
+            accuracy = slot_right / sample_num * 100
+            logger.info("accuray: %.2f, intent_accuracy: %.2f" % (accuracy, intent_accuracy))
+            model.train()
+
+            return accuracy
+
+        trainer.compress(custom_evaluate=custom_evaluate)
+
+    if compression_args.do_export:
+        model.eval()
+        # convert to static graph with specific input description
+        model = paddle.jit.to_static(
+            model,
+            input_spec=[
+                paddle.static.InputSpec(shape=[None, None], dtype=compression_args.input_dtype),  # input_ids
+            ],
+        )
+        # save converted static graph model
+        paddle.jit.save(model, os.path.join(compression_args.output_dir, "infer_model"))
+        tokenizer.save_pretrained(compression_args.output_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-tiny/utils.py b/model_zoo/ernie-tiny/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..faeb40c4b1d60b96221e86f46e76e2da7f27daa1
--- /dev/null
+++ b/model_zoo/ernie-tiny/utils.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+
+def get_label_name(filename_intent, filename_slot):
+    intent_names, slot_names = [], []
+    intent2id, slot2id = {}, {}
+    for id, line in enumerate(open(filename_intent)):
+        line = line.strip()
+        intent_names.append(line)
+        intent2id[line] = id
+
+    for id, line in enumerate(open(filename_slot)):
+        line = line.strip()
+        slot_names.append(line)
+        slot2id[line] = id
+    return intent_names, slot_names, intent2id, slot2id
+
+
+def read_example(filename, intent2id, slot2id, tokenizer, max_seq_length=16, no_entity_id=0):
+    """
+    Reads data from file.
+
+    tokenized_query = ['来', '一', '首', '周', '华', '健', '的', '花', '心']
+    slot_sentence = '来一首<singer>周华健</singer>的<song>花心</song>'
+    after processing:
+    slot_label = ['O', 'O', 'O', 'B-singer', 'I-singer', 'I-singer', 'O', 'B-song', 'I-song']
+    """
+    for line in open(filename):
+        line = line.strip().split("\t")
+        if len(line) != 4:
+            continue
+        _, query, intent_label, slot_sentence = line
+        # skip correction data
+        if "||" in slot_sentence:
+            continue
+        tokenized_query = tokenizer.tokenize(query)
+        slot_label = ["O"] * len(tokenized_query)
+        query_idx = 0
+        # 0 means 'O', 1 means processing in label(curr_label is accmulated)
+        # 2 means copying curr_label(curr_label would not be accumulated again)
+        process_id = 0
+        curr_label = "O"
+        for slot_char in tokenizer.tokenize(slot_sentence):
+            if query_idx >= len(tokenized_query):
+                break
+            if slot_char == "<":
+                if curr_label == "O" and process_id == 0:
+                    process_id = 1
+                    curr_label = "B-"
+                elif process_id == 2:
+                    curr_label = "O"
+                continue
+            if slot_char == ">":
+                if "B-" in curr_label:
+                    process_id = 2
+                else:
+                    process_id = 0
+                    curr_label = "O"
+                continue
+
+            if process_id == 0:
+                if slot_char == tokenized_query[query_idx]:
+                    query_idx += 1
+                    continue
+                else:
+                    curr_tokenized_query = tokenized_query[query_idx].replace("##", "")
+                    if slot_char == curr_tokenized_query:
+                        query_idx += 1
+                        continue
+                    raise ValueError("Sample error")
+            elif process_id == 1:
+                curr_label += slot_char
+            elif process_id == 2:
+                if curr_label == "O":
+                    continue
+                slot_label[query_idx] = curr_label
+                query_idx += 1
+                if "B-" in curr_label:
+                    curr_label = curr_label.replace("B-", "I-")
+        slot_label = [slot2id[each_slot_label] for each_slot_label in slot_label]
+        tokenized_input = tokenizer(query, max_seq_len=max_seq_length, padding="max_length", truncation=True)
+
+        example = {}
+        if len(tokenized_input["input_ids"]) - 2 < len(slot_label):
+            slot_label = slot_label[: len(tokenized_input["input_ids"]) - 2]
+
+        slot_label = [no_entity_id] + slot_label + [no_entity_id]
+        slot_label += [no_entity_id] * (len(tokenized_input["input_ids"]) - len(slot_label))
+        example["intent_label"] = intent2id[intent_label]
+        example["input_ids"] = tokenized_input["input_ids"]
+        example["slot_label"] = slot_label
+        yield example
+
+
+def compute_metrics(p):
+    intent_logits, slot_logits, padding_mask = p.predictions
+    slot_preds = slot_logits.argmax(axis=-1)
+    intent_preds = intent_logits.argmax(axis=-1)
+    intent_label, slot_label = p.label_ids
+    slot_right, intent_right = 0, 0
+    for i, slot_pred in enumerate(slot_preds):
+        if intent_label[i] == intent_preds[i]:
+            if intent_label[i] in (0, 2, 3, 4, 6, 7, 8, 10):
+                slot_right += 1
+            elif ((slot_pred == slot_label[i]) | padding_mask[i]).all():
+                slot_right += 1
+
+    intent_right += sum(intent_preds == intent_label)
+    accuracy = slot_right / slot_label.shape[0] * 100
+    intent_accuracy = intent_right / intent_label.shape[0] * 100
+
+    return {"accuracy": accuracy, "intent_accuracy": intent_accuracy}
+
+
+def read_test_file(filename):
+    for line in open(filename):
+        line = line.strip().split("\t")
+        if len(line) < 2:
+            continue
+        query = line[1]
+        yield {"query": query}
+
+
+def input_preprocess(text, tokenizer, max_seq_length=16):
+    data = tokenizer(text, max_length=max_seq_length)
+    input_ids = data["input_ids"]
+    return {
+        "input_ids": np.array(input_ids, dtype="int32"),
+    }
+
+
+def intent_cls_postprocess(logits, intent_label_names):
+    max_value = np.max(logits, axis=1, keepdims=True)
+    exp_data = np.exp(logits - max_value)
+    probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+    out_dict = {"intent": intent_label_names[int(probs.argmax(axis=-1))], "confidence": probs.max(axis=-1)}
+    return out_dict
+
+
+def slot_cls_postprocess(logits, input_data, label_names):
+    batch_preds = logits.argmax(axis=-1).tolist()
+    value = []
+    for batch, preds in enumerate(batch_preds):
+        start = -1
+        label_name = ""
+        items = []
+        for i, pred in enumerate(preds):
+            if (label_names[pred] == "O" or "B-" in label_names[pred]) and start >= 0:
+                entity = input_data[batch][start : i - 1]
+
+                if isinstance(entity, list):
+                    entity = "".join(entity)
+                items.append(
+                    {
+                        "slot": label_name,
+                        "entity": entity,
+                        "pos": [start, i - 2],
+                    }
+                )
+                start = -1
+            if "B-" in label_names[pred]:
+                start = i - 1
+                label_name = label_names[pred][2:]
+        value.append(items)
+
+    out_dict = {"value": value}
+    return out_dict
diff --git a/model_zoo/ernie-vil2.0/README.md b/model_zoo/ernie-vil2.0/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3717632622cf60ab2f07399fa32804e51e6df4e9
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/README.md
@@ -0,0 +1,402 @@
+# ERNIE-ViL 2.0 基于多视角对比学习的跨模态预训练模型
+
+ **目录**
+   * [ERNIE-ViL 2.0  介绍](#模型介绍)
+   * [预训练模型效果](#模型效果)
+   * [代码结构](#代码结构)
+   * [开始运行](#开始运行)
+       * [任务介绍](#任务介绍)
+       * [环境要求](#环境要求)
+       * [数据准备](#数据准备)
+   * [模型训练](#模型训练)
+   * [模型评估](#模型评估)
+   * [模型预测](#模型预测)
+   * [模型导出预测](#模型导出预测)
+   * [Taskflow一键预测](#Taskflow一键预测)
+   * [参考文献](#参考文献)
+
+本项目开源了 **ERNIE-ViL 2.0** 预训练模型及微调方案。
+
+
+<a name="模型介绍"></a>
+
+## ERNIE-ViL 2.0 介绍
+
+近年来，基于大规模数据预训练的跨模态模型取得了令人瞩目的成绩。基于对比学习的双塔预训练框架能够利用大规模的噪声图文数据，在跨模态检索等任务上展现出较大的效果提升，同时具备计算效率高等优势，受到了广泛的关注（如CLIP，ALIGN等）。然而，已有的视觉-语言预训练技术基于单视角的对比学习，无法同时学习多种模态间和模态内的关联性。
+ERNIE-ViL 2.0提出了一种基于多视角对比学习的预训练框架，通过构建丰富的视觉/文本视角，能够同时学习模态间和模态内的多种关联性，从而学习到更鲁棒的跨模态对齐，在跨模态检索等任务上取得了业界领先水平。
+
+![framework](https://user-images.githubusercontent.com/12107462/212857637-c26882ab-c164-403c-b310-12282955dbc0.png)
+
+使用 PaddleNLP 只需要一行代码就可以下载并获取 ERNIE-ViL 2.0 预训练模型，之后可以用自己的下游数据下进行微调。
+
+```python
+import paddle
+import requests
+import paddle.nn.functional as F
+from PIL import Image
+from paddlenlp.transformers import ErnieViLModel, ErnieViLProcessor
+
+processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+model.eval()
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+
+inputs = processor(text=["一只猫的照片", "一条狗的照片"],
+                images=image,
+                padding=True,
+                return_tensors="pd")
+with paddle.no_grad():
+    outputs = model(**inputs)
+
+logits_per_image = outputs[0]
+probs = F.softmax(logits_per_image, axis=1)
+print(probs)
+
+```
+结果输出为：
+```
+Tensor(shape=[1, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[0.99166542, 0.00833452]])
+```
+这是关于猫的照片，可以看到最终输出的猫的概率最高。
+
+<a name="模型效果"></a>
+
+## 预训练模型效果
+
+### 跨模态检索效果
+以下为以中、英文模型在Flickr30K、COCO-CN的zero-shot结果，其他详见论文。
+* **ERNIE-ViL 2.0 英文 on Flickr30k**:
+
+| Name       |   R@1 |   R@5 |   R@10 |
+|------------|-------|-------|--------|
+| Text2Image | 85.0 | 97.0 |  98.3 |
+| Image2Text | 96.1 | 99.9 |  100.0 |
+
+* **ERNIE-ViL 2.0 中文 COCO-CN**:
+
+| Name       |   R@1 |   R@5 |   R@10 |
+|------------|-------|-------|--------|
+| Text2Image | 69.6 | 91.2 |  96.9 |
+| Image2Text | 69.1 | 92.9 |  97.1 |
+
+* 这里结果均为论文最好结果
+
+
+<a name="代码结构"></a>
+
+## 代码结构
+
+以下是本项目代码结构
+
+```text
+├── data_util.py  # 训练的预处理操作
+├── extract_features.py # 提取图片和文本特征
+├── README.md # README文档
+├── predict.py # 预测的示例
+├── run_finetune.py # trainer实现微调
+├── trainer_util.py # 微调的工具代码
+├── deploy
+│   └── python
+│       └── infer.py # FastDeploy预测脚本
+└── utils
+    ├── evaluation.py # 评估以文搜图的召回脚本
+    ├── evaluation_tr.py # 评估以图搜文的召回脚本
+    ├── make_topk_predictions.py # 以文搜图的ann检索
+    ├── make_topk_predictions_tr.py # 以图搜文的ann检索
+    └── transform_ir_annotation_to_tr.py # 将图文对标注的jsonl文件由文到图的格式转为图到文
+```
+
+<a name="开始运行"></a>
+
+## 开始运行
+
+<a name="任务介绍"></a>
+
+### 任务介绍
+
+本项目是使用 ERNIE-ViL 2.0 的跨模态检索方案，任务背景是实现搜索场景下图文互搜的任务，包括微调流程。
+
+
+### 环境要求
+- python >= 3.7
+- paddlepaddle >= 2.4.1
+- paddlenlp >= 2.5.1
+
+### 数据准备
+
+本项目使用了 [Flickr30k-CN](https://paddlenlp.bj.bcebos.com/datasets/Flickr30k-CN.tar.gz) 中文场景下的图文数据集。
+
+为了训练的时候方便随机读取，我们将tsv和图片数据序列化，转换为arrow文件。
+###
+
+```shell
+mkdir -p data/datasets
+wget https://paddlenlp.bj.bcebos.com/datasets/Flickr30k-CN.tar.gz
+tar -xzvf Flickr30k-CN.tar.gz -d data/datasets/
+
+python preprocess/create_arrow_dataset.py \
+    --data_dir data/datasets/Flickr30k-CN \
+    --splits train,valid,test \
+    --image_dir data/datasets/Flickr30k-CN/image \
+    --t2i_type   jsonl
+```
+执行完后，data 目录应是如下结构：
+
+```text
+├── data
+    └── datasets
+        └── Flickr30k-CN
+            |── image#图像数据
+            ├── arrow # 文本图像数据
+            |   ├── test_img.arrow
+            |   ├── valid_img.arrow
+            │   ├── test.arrow
+            │   ├── train.arrow
+            │   └── valid.arrow
+            ├── test_texts.jsonl # 文本测试数据，文本id & 文本内容，连同匹配的图片id列表
+            ├── train_texts.jsonl # 文本训练集
+            └── valid_texts.jsonl # 文本验证集
+```
+
+
+<a name="模型训练"></a>
+
+## 模型训练
+
+
+运行下面的脚本，使用 Trainer API 启动训练：
+
+```shell
+DATAPATH=./data
+
+# data options
+train_data=${DATAPATH}/datasets/Flickr30k-CN/arrow
+val_data=${DATAPATH}/datasets/Flickr30k-CN/arrow
+
+# 启动方式
+log_dir=train_log
+python -u -m paddle.distributed.launch --gpus "0,1" \
+                --log_dir ${log_dir}  \
+                run_finetune.py --output_dir output_pd \
+                --train_data=${train_data} \
+                --val_data=${val_data} \
+                --do_train \
+                --learning_rate 5e-5 \
+                --warmup_steps 100 \
+                --logging_steps 50 \
+                --per_device_train_batch_size 128 \
+                --dataloader_num_workers 8 \
+                --save_steps 50 \
+                --num_train_epochs 5 \
+                --weight_decay 0.001 \
+                --save_total_limit 50 \
+                --seed 1 \
+                --label_names index \
+                --data_root ./data \
+                --lr_scheduler_type cosine \
+                --recompute
+```
+**注意**：如果使用单卡训练，则默认不会开启Cross-batch Negatives策略，如果是多卡训练，则会默认开启Cross-batch Negatives策略，数据量比较大，一般建议多卡进行训练。
+
+可配置参数说明：
+* `do_train` 是否进行微调训练，设置该参数表示进行微调训练。
+* `train_data` 必须，训练集路径。
+* `val_data` 必须，验证集路径。
+* `learning_rate` 训练的学习率。
+* `warmup_steps` warmup的step数。
+* `logging_steps` 训练过程中日志打印的间隔 steps 数。
+* `per_device_train_batch_size` 训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为128。
+* `dataloader_num_workers` Dataloader的num_worker的数目。
+* `save_steps` 训练过程中保存模型 checkpoint 的间隔 steps 数，默认50。
+* `num_train_epochs` 训练的epoch数目。
+* `weight_decay` 除了所有 bias 和 LayerNorm 权重之外，应用于所有层的权重衰减数值。可选；默认为 0.0。
+* `save_total_limit` 保存checkpoints的数目，默认-1，表示不设限制。
+* `seed` 随机种子，用于固定模型训练的随机因素。
+* `label_names`训练集中标签对应的 key 名称。如果不传入，在训练时 Trainer 可能由于无法区分输入数据和标签造成错误。
+* `data_root` 数据集的根目录路径。
+* `lr_scheduler_type` 学习率变化的类型，支持linear,cosine,constant等。
+* `recompute` 节省缓存的策略，是一种以时间换空间的技术。
+
+<a name="模型评估"></a>
+
+## 模型评估
+
+### 提取特征
+
+模型训练完以后，需要对训练集的文本和图像抽取特征，方便向量近似检索，下面是抽取特征向量的脚本：
+
+```
+DATAPATH=./data
+
+split=valid # 指定计算valid或test集特征
+python -u extract_features.py \
+    --extract-image-feats \
+    --extract-text-feats \
+    --image-data="${DATAPATH}/datasets/Flickr30k-CN/arrow/${split}_img.arrow" \
+    --text-data="${DATAPATH}/datasets/Flickr30k-CN/${split}_texts.jsonl" \
+    --resume output_pd/checkpoint-600 \
+    --img-batch-size=32 \
+    --text-batch-size=32 \
+    --context-length=52
+```
+可配置参数说明：
+* `extract-image-feats` 是否进行图像特征提取。
+* `extract-image-feats` 是否进行文本特征提取。
+* `image-data` 图像数据的地址。
+* `text-data` 文本数据的地址。
+* `resume` checkpoints的加载地址。
+* `img-batch-size` 图像特征提取的batch size。
+* `text-batch-size` 文本特征提取的batch size。
+* `context-length` 文本序列的最大长度。
+
+### 以文搜图评估
+
+下面进行以文搜图的评估，即输入文本来搜索图像的内容：
+
+```shell
+DATAPATH=./data
+dataset_name=Flickr30k-CN
+split=valid # 指定计算valid或test集特征
+
+python -u utils/make_topk_predictions.py \
+    --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
+    --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
+    --top-k=10 \
+    --eval-batch-size=32768 \
+    --output="${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl"
+
+python utils/evaluation.py \
+    ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl \
+    ${DATAPATH}/datasets/${dataset_name}/${split}_predictions.jsonl \
+    output.json
+cat output.json
+
+```
+运行结束后会有如下的输出：
+
+```
+{"success": true, "score": 86.64, "scoreJson": {"score": 86.64, "mean_recall": 86.64, "r1": 72.42, "r5": 91.74, "r10": 95.76}}
+```
+
+### 以图搜文评估
+
+下面进行图像搜文本的评估，即输入图像来检索文本的内容：
+
+```
+DATAPATH=./data
+dataset_name=Flickr30k-CN
+
+split=valid # 指定计算valid或test集特征
+python -u utils/make_topk_predictions_tr.py \
+    --image-feats="${DATAPATH}/datasets/${dataset_name}/${split}_imgs.img_feat.jsonl" \
+    --text-feats="${DATAPATH}/datasets/${dataset_name}/${split}_texts.txt_feat.jsonl" \
+    --top-k=10 \
+    --eval-batch-size=32768 \
+    --output="${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl"
+
+python utils/transform_ir_annotation_to_tr.py \
+    --input ${DATAPATH}/datasets/${dataset_name}/${split}_texts.jsonl
+
+split=valid # 指定计算valid或test集特征
+python utils/evaluation_tr.py \
+    ${DATAPATH}/datasets/${dataset_name}/${split}_texts.tr.jsonl \
+    ${DATAPATH}/datasets/${dataset_name}/${split}_tr_predictions.jsonl \
+    output.json
+cat output.json
+```
+运行结束后会有如下的输出：
+
+```
+{"success": true, "score": 95.36666666666666, "scoreJson": {"score": 95.36666666666666, "mean_recall": 95.36666666666666, "r1": 88.8, "r5": 97.89999999999999, "r10": 99.4}}
+```
+
+
+<a name="模型预测"></a>
+
+## 模型预测
+
+给定一张图：
+
+![000000039769](https://user-images.githubusercontent.com/12107462/212855663-c0a54707-e14c-4450-b45d-0162ae76aeb8.jpeg)
+
+把图像下载下来放到 `examples`目录。然后给定文本：
+
+```
+["猫的照片", "狗的照片"]
+```
+
+运行如下的命令，计算图像和文本的相似度：
+
+```
+python predict.py --resume output_pd/checkpoint-600/ --image_path examples/212855663-c0a54707-e14c-4450-b45d-0162ae76aeb8.jpeg
+```
+运行结束以后会有如下的输出：
+
+```
+......
+         -0.15448952,  0.72006893,  0.36882138, -0.84108782,  0.37967119,
+          0.12349987, -1.02212155, -0.58292383,  1.48998547, -0.46960664,
+          0.30193087, -0.56355256, -0.30767381, -0.34489608,  0.59651250,
+         -0.49545336, -0.95961350,  0.68815416,  0.47264558, -0.25057256,
+         -0.61301452,  0.09002528, -0.03568697]])
+Text features
+Tensor(shape=[2, 768], dtype=float32, place=Place(cpu), stop_gradient=True,
+       [[ 0.04250492, -0.41429815,  0.26164034, ...,  0.26221907,
+          0.34387457,  0.18779679],
+        [ 0.06672275, -0.41456315,  0.13787840, ...,  0.21791621,
+          0.36693257,  0.34208682]])
+Label probs: Tensor(shape=[1, 2], dtype=float32, place=Place(cpu), stop_gradient=True,
+       [[0.99110782, 0.00889216]])
+```
+可以看到`猫的照片`的相似度更高，结果符合预期。
+
+<a name="模型导出预测"></a>
+
+## 模型导出预测
+
+上一节是动态图的示例，下面提供了简单的导出静态图预测的示例，帮助用户将预训练模型导出成预测部署的参数。首先安装[FastDeploy](https://github.com/PaddlePaddle/FastDeploy):
+
+```
+pip install fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+```
+然后运行下面的命令：
+
+```"shell
+python export_model.py --model_path=output_pd/checkpoint-600/ \
+    --output_path=./infer_model/
+```
+用户在`infer_model`中可以看到导出的文件。
+
+对于导出的模型，我们提供了Python的infer脚本，调用预测库对简单的例子进行预测。
+```shell
+python deploy/python/infer.py --model_dir ./infer_model/
+```
+可以得到如下输出：
+```
+......
+  -5.63553333e-01 -3.07674855e-01 -3.44897419e-01  5.96513569e-01
+  -4.95454431e-01 -9.59614694e-01  6.88151956e-01  4.72645760e-01
+  -2.50571519e-01 -6.13013864e-01  9.00242254e-02 -3.56860608e-02]]
+[[0.99110764 0.00889209]]
+```
+可以看到输出的概率值跟前面的预测结果几乎是一致的
+
+<a name="Taskflow一键预测"></a>
+
+## Taskflow一键预测
+
+可以使用PaddleNLP提供的Taskflow工具来使用ERNIE Vil2.0，具体使用可以参考文档[模型特征提取](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md#%E6%A8%A1%E5%9E%8B%E7%89%B9%E5%BE%81%E6%8F%90%E5%8F%96)，下面是使用加载微调的模型的示例：
+
+```
+vision_language = Taskflow("feature_extraction",model="PaddlePaddle/ernie_vil-2.0-base-zh"", task_path="/path/to/checkpoint-4000")
+```
+
+
+<a name="参考文献"></a>
+
+## 参考文献
+* Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang: ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training. CoRR abs/2209.15270 (2022)
+* An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou: Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. CoRR abs/2211.01335 (2022)
diff --git a/model_zoo/ernie-vil2.0/data_util.py b/model_zoo/ernie-vil2.0/data_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ce176b30f618502386fffb21addb7cdd32fa7ee
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/data_util.py
@@ -0,0 +1,243 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import logging
+import os
+from io import BytesIO
+from math import ceil
+
+import numpy as np
+import paddle
+import pyarrow as pa
+from paddle.io import Dataset
+from paddle.vision.transforms import (
+    Compose,
+    Normalize,
+    RandomHorizontalFlip,
+    RandomResizedCrop,
+    Resize,
+    ToTensor,
+)
+from PIL import Image
+
+
+def _convert_to_rgb(image):
+    return image.convert("RGB")
+
+
+def _preprocess_text(text):
+    # Adapt the text to Chinese BERT vocab
+    text = text.lower().replace("“", '"').replace("”", '"')
+    return text
+
+
+class ArrowDataset(Dataset):
+    def __init__(
+        self,
+        arrow_path,
+        split="val",
+        max_txt_length=64,
+        use_augment=False,
+        resolution=224,
+        tokenizer=None,
+        text_column_name="caption",
+    ):
+        self.arrow_path = arrow_path
+        # Assert Arrow directories exist
+        print(os.path.join(arrow_path, split + ".arrow"))
+        assert os.path.exists(
+            os.path.join(arrow_path, split + ".arrow")
+        ), "The arrow directory {} of {} split does not exist!".format(arrow_path, split)
+        arrow_split_path = os.path.join(arrow_path, split + ".arrow")
+        self.df = pa.ipc.open_file(arrow_split_path).read_pandas()
+        # Fetch number of pairs and images
+        self.number_samples = len(self.df)
+        self.number_images = self.number_samples
+        logging.info(
+            "{} Arrow file contains {} images and {} pairs.".format(split, self.number_images, self.number_samples)
+        )
+
+        super(ArrowDataset, self).__init__()
+
+        # The self.dataset_len will be edited to a larger value by calling pad_dataset()
+        self.dataset_len = self.number_samples
+        self.global_batch_size = 1  # Will be modified to the exact global_batch_size after calling pad_dataset()
+
+        self.split = split
+        self.max_txt_length = max_txt_length
+
+        self.use_augment = use_augment
+        self.transform = self._build_transform(resolution)
+        self.tokenizer = tokenizer
+
+    def _build_transform(self, resolution):
+        if self.split == "train" and self.use_augment:
+            transform = Compose(
+                [
+                    RandomResizedCrop(resolution, scale=(0.9, 1.0), interpolation="bicubic"),
+                    RandomHorizontalFlip(0.5),
+                    _convert_to_rgb,
+                    ToTensor(),
+                    Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
+                ]
+            )
+        else:
+            transform = Compose(
+                [
+                    Resize((resolution, resolution), interpolation="bicubic"),
+                    _convert_to_rgb,
+                    ToTensor(),
+                    Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
+                ]
+            )
+        return transform
+
+    def __len__(self):
+        return self.dataset_len
+
+    def __getitem__(self, index):
+        sample_index = index % self.number_samples
+        data_raw = self.df.iloc[sample_index, :]
+        txt_raw = data_raw["caption"]
+        image_raw = data_raw["image"]
+        image_bytes = BytesIO(image_raw)
+        image_bytes.seek(0)
+        image = Image.open(image_bytes)
+        image = self.transform(image)
+        texts = self.tokenizer(
+            [_preprocess_text(txt_raw)], max_length=self.max_txt_length, truncation=True, padding="max_length"
+        )
+        text = texts["input_ids"][0]
+
+        eos_index = text.index(self.tokenizer.vocab["[SEP]"])
+        eos_index = np.array(eos_index)
+        return {"pixel_values": image, "input_ids": text, "index": eos_index}
+
+
+def pad_dataset(dataset, global_batch_size):
+    # Edit dataset.__len__() of the dataset
+    dataset.dataset_len = ceil(dataset.dataset_len / global_batch_size) * global_batch_size
+    dataset.global_batch_size = global_batch_size
+
+
+def get_eval_txt_dataset(args, max_txt_length=24, tokenizer=None):
+    input_filename = args.text_data
+    dataset = EvalTxtDataset(input_filename, max_txt_length=max_txt_length, tokenizer=tokenizer)
+    return dataset
+
+
+def get_eval_img_dataset(args):
+    arrow_imgs = args.image_data
+    dataset = EvalImgDataset(arrow_imgs, resolution=224)
+    return dataset
+
+
+def get_train_eval_dataset(args, epoch_id=0, max_txt_length=64, tokenizer=None):
+    train_dataset = ArrowDataset(
+        args.train_data,
+        split="train",
+        max_txt_length=max_txt_length,
+        tokenizer=tokenizer,
+        use_augment=True,
+        resolution=224,
+    )
+    eval_dataset = ArrowDataset(
+        args.val_data,
+        split="val",
+        max_txt_length=max_txt_length,
+        tokenizer=tokenizer,
+        use_augment=False,
+        resolution=224,
+    )
+    return train_dataset, eval_dataset
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, num_workers=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(
+        dataset=dataset, batch_sampler=batch_sampler, num_workers=num_workers, collate_fn=batchify_fn, return_list=True
+    )
+
+
+class EvalTxtDataset(Dataset):
+    def __init__(self, filename, max_txt_length=24, tokenizer=None):
+        assert os.path.exists(filename), "The annotation datafile {} not exists!".format(filename)
+        jsonl_filename = filename
+        logging.debug(f"Loading jsonl data from {jsonl_filename}.")
+        self.texts = []
+        with open(jsonl_filename, "r", encoding="utf-8") as fin:
+            for line in fin:
+                obj = json.loads(line.strip())
+                text_id = obj["text_id"]
+                text = obj["text"]
+                self.texts.append((text_id, text))
+        logging.debug(f"Finished loading jsonl data from {jsonl_filename}.")
+
+        self.max_txt_length = max_txt_length
+        self.tokenizer = tokenizer
+
+    def __len__(self):
+        return len(self.texts)
+
+    def __getitem__(self, idx):
+        text_id, text = self.texts[idx]
+        texts = self.tokenizer([_preprocess_text(str(text))], max_len=self.max_txt_length, padding="max_length")
+        text = texts["input_ids"][0]
+        return {"text_id": text_id, "input_ids": text}
+
+
+class EvalImgDataset(Dataset):
+    def __init__(self, arrow_imgs_filename, resolution=224):
+        assert os.path.isfile(arrow_imgs_filename), "The image arrow filename {} not exists!".format(
+            arrow_imgs_filename
+        )
+
+        logging.debug(f"Loading image arrow from {arrow_imgs_filename}.")
+        self.img_df = pa.ipc.open_file(arrow_imgs_filename).read_pandas()
+        self.number_images = len(self.img_df)
+        self.transform = self._build_transform(resolution)
+        logging.info("The specified arrow directory contains {} images.".format(self.number_images))
+
+    def _build_transform(self, resolution):
+        normalize = Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+        return Compose(
+            [
+                Resize((resolution, resolution), interpolation="bicubic"),
+                _convert_to_rgb,
+                ToTensor(),
+                normalize,
+            ]
+        )
+
+    def __len__(self):
+        return self.number_images
+
+    def __getitem__(self, idx):
+        img_raw, img_id = self.img_df.iloc[idx]["image"], self.img_df.iloc[idx]["image_id"]
+        image_bytes = BytesIO(img_raw)
+        image_bytes.seek(0)
+        image = Image.open(image_bytes)
+        image = self.transform(image)
+        if img_id.isnumeric():
+            img_id = int(img_id)
+
+        return img_id, image
diff --git a/model_zoo/ernie-vil2.0/deploy/python/infer.py b/model_zoo/ernie-vil2.0/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f4f4bc8c7b4222452208bcd674b2ccfe3c1bdd1d
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/deploy/python/infer.py
@@ -0,0 +1,190 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import os
+
+import fastdeploy as fd
+import numpy as np
+from PIL import Image
+
+from paddlenlp.transformers import ErnieViLProcessor
+
+
+def parse_arguments():
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="gpu",
+        choices=["gpu", "cpu", "kunlunxin"],
+        help="Type of inference device, support 'cpu', 'kunlunxin' or 'gpu'.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="onnx_runtime",
+        choices=["onnx_runtime", "paddle", "openvino", "tensorrt", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--temperature", type=float, default=4.30022621, help="The temperature of the model.")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument("--log_interval", type=int, default=10, help="The interval of logging.")
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Wheter to use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=False,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    parser.add_argument(
+        "--encode_type",
+        type=str,
+        default="text",
+        choices=[
+            "image",
+            "text",
+        ],
+        help="The encoder type.",
+    )
+    parser.add_argument(
+        "--image_path",
+        default="000000039769.jpg",
+        type=str,
+        help="image_path used for prediction",
+    )
+    return parser.parse_args()
+
+
+class ErnieVil2Predictor(object):
+    def __init__(self, args):
+        self.processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+        self.runtime = self.create_fd_runtime(args)
+        self.batch_size = args.batch_size
+        self.max_length = args.max_length
+        self.encode_type = args.encode_type
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        if args.encode_type == "text":
+            model_path = os.path.join(args.model_dir, "get_text_features.pdmodel")
+            params_path = os.path.join(args.model_dir, "get_text_features.pdiparams")
+        else:
+            model_path = os.path.join(args.model_dir, "get_image_features.pdmodel")
+            params_path = os.path.join(args.model_dir, "get_image_features.pdiparams")
+        option.set_model_path(model_path, params_path)
+        if args.device == "kunlunxin":
+            option.use_kunlunxin()
+            option.use_paddle_lite_backend()
+            return fd.Runtime(option)
+        if args.device == "cpu":
+            option.use_cpu()
+        else:
+            option.use_gpu()
+        if args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        else:
+            option.use_trt_backend()
+            if args.backend == "paddle_tensorrt":
+                option.enable_paddle_to_trt()
+                option.enable_paddle_trt_collect_shape()
+            trt_file = os.path.join(args.model_dir, "{}_infer.trt".format(args.encode_type))
+            if args.encode_type == "text":
+                option.set_trt_input_shape(
+                    "input_ids",
+                    min_shape=[1, args.max_length],
+                    opt_shape=[args.batch_size, args.max_length],
+                    max_shape=[args.batch_size, args.max_length],
+                )
+            else:
+                option.set_trt_input_shape(
+                    "pixel_values",
+                    min_shape=[1, 3, 224, 224],
+                    opt_shape=[args.batch_size, 3, 224, 224],
+                    max_shape=[args.batch_size, 3, 224, 224],
+                )
+            if args.use_fp16:
+                option.enable_trt_fp16()
+                trt_file = trt_file + ".fp16"
+            option.set_trt_cache_file(trt_file)
+        return fd.Runtime(option)
+
+    def preprocess(self, inputs):
+        if self.encode_type == "text":
+            dataset = [np.array([self.processor(text=text)["input_ids"] for text in inputs], dtype="int64")]
+        else:
+            dataset = [np.array([self.processor(images=image)["pixel_values"][0] for image in inputs])]
+        input_map = {}
+        for input_field_id, data in enumerate(dataset):
+            input_field = self.runtime.get_input_info(input_field_id).name
+            input_map[input_field] = data
+        return input_map
+
+    def postprocess(self, infer_data):
+        logits = np.array(infer_data[0])
+        out_dict = {
+            "features": logits,
+        }
+        return out_dict
+
+    def infer(self, input_map):
+        results = self.runtime.infer(input_map)
+        return results
+
+    def predict(self, inputs):
+        input_map = self.preprocess(inputs)
+        infer_result = self.infer(input_map)
+        output = self.postprocess(infer_result)
+        return output
+
+
+def main():
+    args = parse_arguments()
+    texts = [
+        "猫的照片",
+        "狗的照片",
+    ]
+    args.batch_size = 2
+    predictor = ErnieVil2Predictor(args)
+    outputs = predictor.predict(texts)
+    print(outputs)
+    text_feats = outputs["features"]
+    image = Image.open(args.image_path)
+    args.encode_type = "image"
+    args.batch_size = 1
+    predictor = ErnieVil2Predictor(args)
+    images = [image]
+    outputs = predictor.predict(images)
+    image_feats = outputs["features"]
+    print(image_feats)
+    from scipy.special import softmax
+
+    image_feats = image_feats / np.linalg.norm(image_feats, ord=2, axis=-1, keepdims=True)
+    text_feats = text_feats / np.linalg.norm(text_feats, ord=2, axis=-1, keepdims=True)
+    # Get from dygraph， refer to predict.py
+    exp_data = np.exp(args.temperature)
+    m = softmax(np.matmul(exp_data * text_feats, image_feats.T), axis=0).T
+    print(m)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-vil2.0/export_model.py b/model_zoo/ernie-vil2.0/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec4be94d6ed7846fbbe60761b3252f39297fd7c8
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/export_model.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+
+from paddlenlp.transformers import ErnieViLModel
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path of the trained model to be exported.",
+    )
+    parser.add_argument(
+        "--output_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file prefix used to save the exported inference model.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    model = ErnieViLModel.from_pretrained(args.model_path)
+    # Switch to eval model
+    model.eval()
+    # Save text encoder model
+    # Convert to static graph with specific input description
+    static_model = paddle.jit.to_static(
+        model.get_text_features,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(static_model, os.path.join(args.output_path, "get_text_features"))
+    # Save image encoder model
+    static_model = paddle.jit.to_static(
+        model.get_image_features,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="float32"),  # pixel_values
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(static_model, os.path.join(args.output_path, "get_image_features"))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-vil2.0/extract_features.py b/model_zoo/ernie-vil2.0/extract_features.py
new file mode 100644
index 0000000000000000000000000000000000000000..49230a667ea3ae408c9a6eb4c2937d9c8f7b3614
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/extract_features.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+import paddle
+from data_util import get_eval_img_dataset, get_eval_txt_dataset
+from paddle.io import DataLoader
+from tqdm import tqdm
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.transformers import ErnieViLModel, ErnieViLTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--extract-image-feats', action="store_true", default=False, help="Whether to extract image features.")
+parser.add_argument('--extract-text-feats', action="store_true", default=False, help="Whether to extract text features.")
+parser.add_argument('--image-data', type=str, default="../Multimodal_Retrieval/lmdb/test/imgs", help="If --extract-image-feats is True, specify the path of the LMDB directory storing input image base64 strings.")
+parser.add_argument('--text-data', type=str, default="../Multimodal_Retrieval/test_texts.jsonl", help="If --extract-text-feats is True, specify the path of input text Jsonl file.")
+parser.add_argument('--image-feat-output-path', type=str, default=None, help="If --extract-image-feats is True, specify the path of output image features.")
+parser.add_argument('--text-feat-output-path', type=str, default=None, help="If --extract-image-feats is True, specify the path of output text features.")
+parser.add_argument("--img-batch-size", type=int, default=64, help="Image batch size.")
+parser.add_argument("--text-batch-size", type=int, default=64, help="Text batch size.")
+parser.add_argument("--context-length", type=int, default=64, help="The maximum length of input text (include [CLS] & [SEP] tokens).")
+parser.add_argument("--resume", default=None, type=str, help="path to latest checkpoint (default: none)",)
+args = parser.parse_args()
+# yapf: enable
+
+
+def main():
+    if args.resume is not None:
+        # Finetune
+        model = ErnieViLModel.from_pretrained(args.resume)
+        print("load parameters from " + args.resume)
+    else:
+        # Zero shot
+        model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+    # Make inference for texts
+    if args.extract_text_feats:
+        tokenizer = ErnieViLTokenizer.from_pretrained("ernie_vil-2.0-base-zh")
+        eval_dataset = get_eval_txt_dataset(args, tokenizer=tokenizer, max_txt_length=args.context_length)
+        my_collate = DataCollatorWithPadding(tokenizer)
+        text_loader = DataLoader(eval_dataset, collate_fn=my_collate, batch_size=args.text_batch_size)
+        print("Make inference for texts...")
+        if args.text_feat_output_path is None:
+            args.text_feat_output_path = "{}.txt_feat.jsonl".format(args.text_data[:-6])
+        write_cnt = 0
+        with open(args.text_feat_output_path, "w") as fout:
+            model.eval()
+            with paddle.no_grad():
+                for batch in tqdm(text_loader):
+                    text_ids, texts = batch["text_id"], batch["input_ids"]
+                    text_features = model.get_text_features(texts)
+                    text_features /= text_features.norm(axis=-1, keepdim=True)
+                    for text_id, text_feature in zip(text_ids.tolist(), text_features.tolist()):
+                        fout.write("{}\n".format(json.dumps({"text_id": text_id, "feature": text_feature})))
+                        write_cnt += 1
+        print("{} text features are stored in {}".format(write_cnt, args.text_feat_output_path))
+
+    # Make inference for images
+    if args.extract_image_feats:
+        image_eval_dataset = get_eval_img_dataset(args)
+        image_loader = DataLoader(image_eval_dataset, batch_size=args.img_batch_size)
+        print("Make inference for images...")
+        if args.image_feat_output_path is None:
+            # by default, we store the image features under the same directory with the text features
+            args.image_feat_output_path = "{}.img_feat.jsonl".format(args.text_data.replace("_texts.jsonl", "_imgs"))
+        write_cnt = 0
+        with open(args.image_feat_output_path, "w") as fout:
+            model.eval()
+            with paddle.no_grad():
+                for batch in tqdm(image_loader):
+                    image_ids, images = batch
+                    image_features = model.get_image_features(pixel_values=images)
+                    image_features /= image_features.norm(axis=-1, keepdim=True)
+                    if type(image_ids) != list:
+                        image_ids = image_ids.tolist()
+
+                    for image_id, image_feature in zip(image_ids, image_features.tolist()):
+                        fout.write("{}\n".format(json.dumps({"image_id": image_id, "feature": image_feature})))
+                        write_cnt += 1
+        print("{} image features are stored in {}".format(write_cnt, args.image_feat_output_path))
+
+    print("Done!")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-vil2.0/predict.py b/model_zoo/ernie-vil2.0/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e0774b4dcdcf2541c6a9b50273fbd01ebbb5077
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/predict.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import paddle
+import paddle.nn.functional as F
+from PIL import Image
+
+from paddlenlp.transformers import ErnieViLModel, ErnieViLProcessor, ErnieViLTokenizer
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--resume", default=None, type=str, help="path to latest checkpoint (default: none)",)
+parser.add_argument("--image_path", default="000000039769.jpg", type=str, help="image_path used for prediction",)
+args = parser.parse_args()
+# yapf: enable
+
+
+def main():
+
+    tokenizer = ErnieViLTokenizer.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+    processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+    if args.resume is not None:
+        # Loading finetuned model
+        model = ErnieViLModel.from_pretrained(args.resume)
+        print("Loading parameters from " + args.resume)
+    else:
+        model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+    model.eval()
+    image = Image.open(args.image_path)
+    images = processor(images=image, return_tensors="pd")
+    source_text = ["猫的照片", "狗的照片"]
+    texts = tokenizer(source_text, padding=True, return_tensors="pd")
+
+    with paddle.no_grad():
+        image_features = model.get_image_features(**images)
+        text_features = model.get_text_features(**texts)
+        print("Image features:")
+        print(image_features)
+        print("Text features")
+        print(text_features)
+        print("model temperature")
+        print(model.temperature)
+        # Normalize image and text features to have 0 mean and unit variance.
+        image_features /= image_features.norm(axis=-1, keepdim=True)
+        text_features /= text_features.norm(axis=-1, keepdim=True)
+        ret = model(pixel_values=images["pixel_values"], input_ids=texts["input_ids"])
+        logits_per_image = (ret.logits_per_image,)
+        probs = F.softmax(logits_per_image, axis=-1)
+
+    print("Label probs:", probs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py b/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..67dcb4339ec4cc928230a8865ce8f81e2ad8d67f
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/preprocess/create_arrow_dataset.py
@@ -0,0 +1,174 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script serializes images and image-text pair annotations into arrow files,
+which supports more convenient dataset loading and random access to samples during training
+compared with TSV and Jsonl data files.
+"""
+import argparse
+import base64
+import json
+import logging
+import os
+import random
+from collections import defaultdict
+from glob import glob
+
+import jsonlines
+import pandas as pd
+import pyarrow as pa
+from tqdm import tqdm
+
+
+def id2rest(photo_id, iid2photo, iid2captions):
+    captions = iid2captions[photo_id]
+    photo_data = iid2photo[photo_id]
+    photo_data = photo_data.encode("utf-8")
+    photo_data = base64.urlsafe_b64decode(photo_data)
+    return [photo_data, captions, photo_id]
+
+
+def path2rest(path, iid2captions, iid2photo):
+    name = path.split("/")[-1]
+    with open(path, "rb") as fp:
+        binary = fp.read()
+    text_index = iid2photo[name.split("/")[-1]]
+    bs_item = []
+    for index in text_index:
+        bs_item.append([binary, iid2captions[index], name.split("/")[-1]])
+    return bs_item
+
+
+def valid_img(path, iid2photo):
+    name = path.split("/")[-1]
+    with open(path, "rb") as fp:
+        binary = fp.read()
+    return [binary, name.split("/")[-1]]
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--data_dir",
+    type=str,
+    required=True,
+    help="the directory which stores the image tsvfiles and the text jsonl annotations",
+)
+parser.add_argument(
+    "--splits",
+    type=str,
+    required=True,
+    help="specify the dataset splits which this script processes, concatenated by comma \ (e.g. train,valid,test)",
+)
+parser.add_argument(
+    "--data_out_dir",
+    type=str,
+    default=None,
+    help="specify the directory which stores the output arrow files. If set to None, the arrow_dir will be set to args.data_dir/arrow",
+)
+parser.add_argument("--t2i_type", type=str, default="jsonl", help="the type of text2photo filename")
+parser.add_argument(
+    "--image_dir",
+    type=str,
+    required=True,
+    help="the directory which stores the images['png','jpg','JPG']",
+)
+args = parser.parse_args()
+logging.basicConfig(
+    format="%(asctime)s - %(name)s - %(levelname)s -%(module)s:  %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S %p",
+    level=10,
+)
+# yapf: enable
+if __name__ == "__main__":
+    assert os.path.isdir(args.data_dir), "The data_dir does not exist! Please check the input args..."
+    specified_splits = list(set(args.splits.strip().split(",")))  # train test dev
+    specified_type = args.t2i_type
+    print("Dataset splits to be processed: {}".format(", ".join(specified_splits)))
+    for split in specified_splits:
+        datasplit_path = os.path.join(args.data_dir, split)
+        iid2captions = dict()
+        iid2photo = defaultdict(list)
+        image_dir = args.image_dir
+        assert os.path.isdir(image_dir), "The image_dir does not exist! Please check the input args..."
+        assert specified_type == "jsonl", "the type of file is not jsonl"
+        txt_path = datasplit_path + "_texts.jsonl"
+        assert os.path.exists(txt_path) is True
+        with open(txt_path, "r", encoding="utf-8") as fin_pairs:
+            for index, line in tqdm(enumerate(fin_pairs)):
+                line = line.strip()
+                obj = json.loads(line)
+                for field in ("text_id", "text", "image_ids"):
+                    assert (
+                        field in obj
+                    ), "Field {} does not exist in line {}. \
+                            Please check the integrity of the text annotation Jsonl file."
+                image_ids = obj["image_ids"]
+                if type(image_ids[0]) == str and image_ids[0].isnumeric() is True:
+                    iid2captions[index] = obj["text"]
+                    iid2photo[int(image_ids[0])].append(index)
+                else:
+                    iid2captions[index] = obj["text"]
+                    iid2photo[image_ids[0]].append(index)
+        paths = (
+            list(glob(f"{args.image_dir}/*/*jpg"))
+            + list(glob(f"{args.image_dir}/*/*.png"))
+            + list(glob(f"{args.image_dir}/*/*.JPG"))
+        )
+        random.shuffle(paths)
+        # 有效图片路径
+        if type(list(iid2photo.keys())[0]) == int:
+            caption_paths = [path for path in paths if int(path.split("/")[-1][:-4]) in iid2photo]
+        elif type(list(iid2photo.keys())[0]) == str and "." not in list(iid2photo.keys())[0]:
+            caption_paths = [path for path in paths if path.split("/")[-1][:-4] in iid2photo]
+        else:
+            caption_paths = [path for path in paths if path.split("/")[-1] in iid2photo]
+        invalid_photo = [path for path in iid2photo if path not in [i.split("/")[-1] for i in caption_paths]]
+        bs = []
+        for path in tqdm(caption_paths):
+            bs += path2rest(path, iid2captions, iid2photo)
+        dataframe = pd.DataFrame(bs, columns=["image", "caption", "image_id"])
+        table = pa.Table.from_pandas(dataframe)
+
+        if args.data_out_dir == "":
+            data_out_dir = os.path.join(args.data_dir, "arrow")
+        else:
+            data_out_dir = args.data_out_dir
+        os.makedirs(data_out_dir, exist_ok=True)
+        with pa.OSFile(f"{data_out_dir}/{split}.arrow", "wb") as sink:
+            with pa.RecordBatchFileWriter(sink, table.schema) as writer:
+                writer.write_table(table)
+        if split in ["valid", "test"]:
+            if len(invalid_photo) > 0:
+                logging.info("The jsonl file contains invalid images")
+                data_valid = []
+                with open(txt_path, "r", encoding="utf-8") as fin_pairs:
+                    for line in fin_pairs:
+                        line = line.strip()
+                        obj = json.loads(line)
+                        obj["image_ids"] = [i for i in obj["image_ids"] if i not in invalid_photo]
+                        if len(obj["image_ids"]) == 0:
+                            continue
+                        data_valid.append(obj)
+                with jsonlines.open(datasplit_path + "_updata_texts.jsonl", mode="w") as writer:
+                    for row in data_valid:
+                        writer.write({"text_id": row["text_id"], "text": row["text"], "image_ids": row["image_ids"]})
+            bs_img = [valid_img(path, iid2captions) for path in tqdm(caption_paths)]
+            dataframe_img = pd.DataFrame(bs_img, columns=["image", "image_id"])
+            table_img = pa.Table.from_pandas(dataframe_img)
+            with pa.OSFile(f"{data_out_dir}/{split}_img.arrow", "wb") as sink:
+                with pa.RecordBatchFileWriter(sink, table_img.schema) as writer:
+                    writer.write_table(table_img)
diff --git a/model_zoo/ernie-vil2.0/requirements.txt b/model_zoo/ernie-vil2.0/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..3230f7d2c04c6fca3f87bf633eb23063a155d985
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/requirements.txt
@@ -0,0 +1,4 @@
+python >= 3.7
+paddlepaddle >= 2.4.1
+paddlenlp >= 2.5.1
+pyarrow
diff --git a/model_zoo/ernie-vil2.0/run_finetune.py b/model_zoo/ernie-vil2.0/run_finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..09dc001a00b36bcaa3a7d03d6b2c8f49122ea9ad
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/run_finetune.py
@@ -0,0 +1,117 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+import os
+from dataclasses import dataclass, field
+
+import paddle
+from data_util import get_train_eval_dataset
+from paddle.metric import Accuracy
+from trainer_util import ErnieViLTrainer
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.trainer import PdArgumentParser, TrainingArguments
+from paddlenlp.transformers import ErnieViLModel, ErnieViLTokenizer
+
+os.environ["NCCL_DEBUG"] = "INFO"
+
+
+@dataclass
+class DataArguments:
+    test_only: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."})
+    data_root: str = field(
+        default="./data",
+        metadata={"help": "Whether to evaluate model on public test datasets."},
+    )
+    train_data: str = field(
+        default="./data",
+        metadata={"help": "Whether to evaluate model on public test datasets."},
+    )
+    val_data: str = field(
+        default="./data",
+        metadata={"help": "Whether to evaluate model on public test datasets."},
+    )
+    max_seq_len: int = field(
+        default=32,
+        metadata={"help": "The length of the text."},
+    )
+
+
+@dataclass
+class ModelArguments:
+    checkpoint_path: str = field(default="", metadata={"help": "checkpoint path"})
+    model_type: str = field(default="ernie_vil-2.0-base-zh", metadata={"help": "the type of model"})
+    model_name_or_path: str = field(
+        default="PaddlePaddle/ernie_vil-2.0-base-zh", metadata={"help": "model name or path for initialization"}
+    )
+
+
+def do_train():
+    parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.is_initialized() and paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    tokenizer = ErnieViLTokenizer.from_pretrained(model_args.model_name_or_path)
+    train_dataset, eval_dataset = get_train_eval_dataset(
+        data_args, tokenizer=tokenizer, max_txt_length=data_args.max_seq_len
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+
+    my_collate = DataCollatorWithPadding(tokenizer)
+    model = ErnieViLModel.from_pretrained(model_args.model_name_or_path)
+
+    # Define the metrics of tasks.
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = Accuracy()
+        result = metric.compute(preds, label)
+        metric.update(result)
+        accu = metric.accumulate()
+        return {"accuracy": accu}
+
+    trainer = ErnieViLTrainer(
+        model=model,
+        criterion=None,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        data_collator=my_collate,
+        callbacks=None,
+        compute_metrics=compute_metrics,
+    )
+    if data_args.test_only:
+        train_result = trainer.evaluate()
+    elif training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+    print(train_result)
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/ernie-vil2.0/trainer_util.py b/model_zoo/ernie-vil2.0/trainer_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2d980cbb5d4b8106b2747bdeef826a396020b97
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/trainer_util.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.trainer import Trainer
+from paddlenlp.trainer.utils.helper import nested_detach
+
+
+class ErnieViLTrainer(Trainer):
+    def compute_loss(self, model, inputs, return_outputs=False):
+
+        images, texts = inputs["pixel_values"], inputs["input_ids"]
+        ret = model(input_ids=texts, pixel_values=images, return_loss=True)
+        loss = ret.loss
+        outputs = (loss, ret.logits_per_text)
+        if self.criterion is not None:
+            loss = self.criterion(outputs, ret.logits_per_text)
+            outputs = (loss, outputs)
+            # Save past state if it exists
+        # TODO: this needs to be fixed and made cleaner later.
+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+        del ret
+        # We don't use .loss here since the model may return tuples instead of ModelOutput.
+        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
+
+        return (loss, outputs) if return_outputs else loss
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        Perform an evaluation step on `model` using `inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to evaluate.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (`bool`):
+                Whether or not to return the loss only.
+            ignore_keys (`Lst[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+
+        Return:
+            Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss,
+            logits and labels (each being optional).
+        """
+        has_labels = all(inputs.get(k) is not None for k in self.label_names)
+        inputs = self._prepare_inputs(inputs)
+        if ignore_keys is None:
+            if hasattr(self.model, "config"):
+                ignore_keys = getattr(self.model.config, "keys_to_ignore_at_inference", [])
+            else:
+                ignore_keys = []
+        # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
+        if has_labels:
+            labels = nested_detach(tuple(inputs.get(name) for name in self.label_names))
+            if len(labels) == 1:
+                labels = labels[0]
+        else:
+            labels = None
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
+                loss = loss.mean().detach()
+
+                if isinstance(outputs, dict):
+                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"])
+                else:
+                    logits = outputs[1:]
+            else:
+                loss = None
+                with self.autocast_smart_context_manager():
+                    outputs = model(
+                        input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], return_dict=False
+                    )
+                if isinstance(outputs, dict):
+                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys)
+                else:
+                    logits = outputs
+                # TODO: this needs to be fixed and made cleaner later.
+                if self.args.past_index >= 0:
+                    self._past = outputs[self.args.past_index - 1]
+
+        if prediction_loss_only:
+            return (loss, None, None)
+
+        logits = nested_detach(logits)
+        if isinstance(logits, (list, tuple)) and len(logits) == 1:
+            logits = logits[0]
+        return (loss, logits, labels)
diff --git a/model_zoo/ernie-vil2.0/utils/evaluation.py b/model_zoo/ernie-vil2.0/utils/evaluation.py
new file mode 100644
index 0000000000000000000000000000000000000000..d468afe4572a3f1c6d5c93cf65d0ba13f40d7796
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/utils/evaluation.py
@@ -0,0 +1,187 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script computes the recall scores given the ground-truth annotations and predictions.
+"""
+
+import json
+import os
+import sys
+
+NUM_K = 10
+
+
+def read_submission(submit_path, reference, k=5):
+    # Check whether the path of submitted file exists
+    if not os.path.exists(submit_path):
+        raise Exception("The file is not found!")
+
+    submission_dict = {}
+    ref_qids = set(reference.keys())
+
+    with open(submit_path, encoding="utf-8") as fin:
+        for line in fin:
+            line = line.strip()
+            try:
+                pred_obj = json.loads(line)
+            except Exception:
+                raise Exception("Cannot parse this line into json object: {}".format(line))
+            if "text_id" not in pred_obj:
+                raise Exception("There exists one line not containing text_id: {}".format(line))
+            if not isinstance(pred_obj["text_id"], int):
+                raise Exception(
+                    "Found an invalid text_id , it should be an integer (not string), please check your schema"
+                )
+            qid = pred_obj["text_id"]
+            if "image_ids" not in pred_obj:
+                raise Exception("There exists one line not containing the predicted image_ids: {}".format(line))
+            image_ids = pred_obj["image_ids"]
+            if not isinstance(image_ids, list):
+                raise Exception(
+                    "The image_ids field of text_id {} is not a list, please check your schema".format(qid)
+                )
+            # Check whether there are K products for each text
+            if len(image_ids) != k:
+                raise Exception(
+                    "Text_id {} has wrong number of predicted image_ids! Require {}, but {} founded.".format(
+                        qid, k, len(image_ids)
+                    )
+                )
+            # Check whether there are duplicate predicted products for a single text
+            if len(set(image_ids)) != k:
+                raise Exception(
+                    "Text_id {} has duplicate products in your prediction. Pleace check again!".format(qid)
+                )
+            submission_dict[qid] = image_ids  # here we save the list of product ids
+
+    # Check if any text is missing in the submission
+    pred_qids = set(submission_dict.keys())
+    nopred_qids = ref_qids - pred_qids
+    if len(nopred_qids) != 0:
+        raise Exception(
+            "The following text_ids have no prediction in your submission, please check again: {}".format(
+                ", ".join([str(idx) for idx in nopred_qids])
+            )
+        )
+
+    return submission_dict
+
+
+def dump_2_json(info, path):
+    with open(path, "w", encoding="utf-8") as output_json_file:
+        json.dump(info, output_json_file)
+
+
+def report_error_msg(detail, showMsg, out_p):
+    error_dict = dict()
+    error_dict["errorDetail"] = detail
+    error_dict["errorMsg"] = showMsg
+    error_dict["score"] = 0
+    error_dict["scoreJson"] = {}
+    error_dict["success"] = False
+    dump_2_json(error_dict, out_p)
+
+
+def report_score(r1, r5, r10, out_p):
+    result = dict()
+    result["success"] = True
+    mean_recall = (r1 + r5 + r10) / 3.0
+    result["score"] = mean_recall * 100
+    result["scoreJson"] = {
+        "score": mean_recall * 100,
+        "mean_recall": mean_recall * 100,
+        "r1": r1 * 100,
+        "r5": r5 * 100,
+        "r10": r10 * 100,
+    }
+    dump_2_json(result, out_p)
+
+
+def read_reference(path):
+    fin = open(path, encoding="utf-8")
+    reference = dict()
+    for line in fin:
+        line = line.strip()
+        obj = json.loads(line)
+        reference[obj["text_id"]] = obj["image_ids"]
+    return reference
+
+
+def compute_score(golden_file, predict_file):
+    # Read ground-truth
+    reference = read_reference(golden_file)
+
+    # Read predictions
+    k = 10
+    predictions = read_submission(predict_file, reference, k)
+
+    # Compute score for each text
+    r1_stat, r5_stat, r10_stat = 0, 0, 0
+    for qid in reference.keys():
+        ground_truth_ids = set(reference[qid])
+        top10_pred_ids = predictions[qid]
+        if any([idx in top10_pred_ids[:1] for idx in ground_truth_ids]):
+            r1_stat += 1
+        if any([idx in top10_pred_ids[:5] for idx in ground_truth_ids]):
+            r5_stat += 1
+        if any([idx in top10_pred_ids[:10] for idx in ground_truth_ids]):
+            r10_stat += 1
+    # The higher score, the better
+    r1, r5, r10 = r1_stat * 1.0 / len(reference), r5_stat * 1.0 / len(reference), r10_stat * 1.0 / len(reference)
+    mean_recall = (r1 + r5 + r10) / 3.0
+    result = [mean_recall, r1, r5, r10]
+    result = [score * 100 for score in result]
+    return result
+
+
+if __name__ == "__main__":
+    # The path of answer json file (eg. test_queries_answers.jsonl)
+    standard_path = sys.argv[1]
+    # The path of prediction file (eg. example_pred.jsonl)
+    submit_path = sys.argv[2]
+    # The score will be dumped into this output json file
+    out_path = sys.argv[3]
+
+    print("Read standard from %s" % standard_path)
+    print("Read user submit file from %s" % submit_path)
+
+    try:
+        # Read ground-truth
+        reference = read_reference(standard_path)
+
+        # Read predictions
+        k = 10
+        predictions = read_submission(submit_path, reference, k)
+
+        # Compute score for each text
+        r1_stat, r5_stat, r10_stat = 0, 0, 0
+        for qid in reference.keys():
+            ground_truth_ids = set(reference[qid])
+            top10_pred_ids = predictions[qid]
+            if any([idx in top10_pred_ids[:1] for idx in ground_truth_ids]):
+                r1_stat += 1
+            if any([idx in top10_pred_ids[:5] for idx in ground_truth_ids]):
+                r5_stat += 1
+            if any([idx in top10_pred_ids[:10] for idx in ground_truth_ids]):
+                r10_stat += 1
+        # The higher score, the better
+        r1, r5, r10 = r1_stat * 1.0 / len(reference), r5_stat * 1.0 / len(reference), r10_stat * 1.0 / len(reference)
+        report_score(r1, r5, r10, out_path)
+        print("The evaluation finished successfully.")
+    except Exception as e:
+        report_error_msg(e.args[0], e.args[0], out_path)
+        print("The evaluation failed: {}".format(e.args[0]))
diff --git a/model_zoo/ernie-vil2.0/utils/evaluation_tr.py b/model_zoo/ernie-vil2.0/utils/evaluation_tr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a353de0acf021bb67a764625cc1b02de60c2d40
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/utils/evaluation_tr.py
@@ -0,0 +1,184 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script computes the recall scores given the ground-truth annotations and predictions.
+"""
+
+import json
+import os
+import sys
+
+NUM_K = 10
+
+
+def read_submission(submit_path, reference, k=5):
+    # Check whether the path of submitted file exists
+    if not os.path.exists(submit_path):
+        raise Exception("The submission file is not found!")
+
+    submission_dict = {}
+    ref_image_ids = set(reference.keys())
+
+    with open(submit_path, encoding="utf-8") as fin:
+        for line in fin:
+            line = line.strip()
+            try:
+                pred_obj = json.loads(line)
+            except Exception:
+                raise Exception("Cannot parse this line into json object: {}".format(line))
+            if "image_id" not in pred_obj:
+                raise Exception("There exists one line not containing image_id: {}".format(line))
+            image_id = pred_obj["image_id"]
+            if "text_ids" not in pred_obj:
+                raise Exception("There exists one line not containing the predicted text_ids: {}".format(line))
+            text_ids = pred_obj["text_ids"]
+            if not isinstance(text_ids, list):
+                raise Exception(
+                    "The text_ids field of image_id {} is not a list, please check your schema".format(image_id)
+                )
+            # Check whether there are K products for each text
+            if len(text_ids) != k:
+                raise Exception(
+                    "Image_id {} has wrong number of predicted text_ids! Require {}, but {} founded.".format(
+                        image_id, k, len(text_ids)
+                    )
+                )
+            # Check whether there exist an invalid prediction for any text
+            for rank, text_id in enumerate(text_ids):
+                if not isinstance(text_id, int):
+                    raise Exception(
+                        "Image_id {} has an invalid predicted text_id {} at rank {}, it should be an integer (not string), please check your schema".format(
+                            image_id, text_id, rank + 1
+                        )
+                    )
+            submission_dict[image_id] = text_ids  # here we save the list of product ids
+
+    # Check if any text is missing in the submission
+    pred_image_ids = set(submission_dict.keys())
+    nopred_image_ids = ref_image_ids - pred_image_ids
+    if len(nopred_image_ids) != 0:
+        raise Exception(
+            "The following image_ids have no prediction in your submission, please check again: {}".format(
+                ", ".join([str(idx) for idx in nopred_image_ids])
+            )
+        )
+
+    return submission_dict
+
+
+def dump_2_json(info, path):
+    with open(path, "w", encoding="utf-8") as output_json_file:
+        json.dump(info, output_json_file)
+
+
+def report_error_msg(detail, showMsg, out_p):
+    error_dict = dict()
+    error_dict["errorDetail"] = detail
+    error_dict["errorMsg"] = showMsg
+    error_dict["score"] = 0
+    error_dict["scoreJson"] = {}
+    error_dict["success"] = False
+    dump_2_json(error_dict, out_p)
+
+
+def report_score(r1, r5, r10, out_p):
+    result = dict()
+    result["success"] = True
+    mean_recall = (r1 + r5 + r10) / 3.0
+    result["score"] = mean_recall * 100
+    result["scoreJson"] = {
+        "score": mean_recall * 100,
+        "mean_recall": mean_recall * 100,
+        "r1": r1 * 100,
+        "r5": r5 * 100,
+        "r10": r10 * 100,
+    }
+    dump_2_json(result, out_p)
+
+
+def read_reference(path):
+    fin = open(path, encoding="utf-8")
+    reference = dict()
+    for line in fin:
+        line = line.strip()
+        obj = json.loads(line)
+        reference[obj["image_id"]] = obj["text_ids"]
+    return reference
+
+
+def compute_score(golden_file, predict_file):
+    # Read ground-truth
+    reference = read_reference(golden_file)
+
+    # Read predictions
+    k = 10
+    predictions = read_submission(predict_file, reference, k)
+
+    # Compute score for each text
+    r1_stat, r5_stat, r10_stat = 0, 0, 0
+    for qid in reference.keys():
+        ground_truth_ids = set(reference[qid])
+        top10_pred_ids = predictions[qid]
+        if any([idx in top10_pred_ids[:1] for idx in ground_truth_ids]):
+            r1_stat += 1
+        if any([idx in top10_pred_ids[:5] for idx in ground_truth_ids]):
+            r5_stat += 1
+        if any([idx in top10_pred_ids[:10] for idx in ground_truth_ids]):
+            r10_stat += 1
+    # The higher score, the better
+    r1, r5, r10 = r1_stat * 1.0 / len(reference), r5_stat * 1.0 / len(reference), r10_stat * 1.0 / len(reference)
+    mean_recall = (r1 + r5 + r10) / 3.0
+    result = [mean_recall, r1, r5, r10]
+    result = [score * 100 for score in result]
+    return result
+
+
+if __name__ == "__main__":
+    # The path of answer json file (eg. test_queries_answers.jsonl)
+    standard_path = sys.argv[1]
+    # The path of prediction file (eg. example_pred.jsonl)
+    submit_path = sys.argv[2]
+    # The score will be dumped into this output json file
+    out_path = sys.argv[3]
+
+    print("Read standard from %s" % standard_path)
+    print("Read user submit file from %s" % submit_path)
+    try:
+        # Read ground-truth
+        reference = read_reference(standard_path)
+
+        # Read predictions
+        k = 10
+        predictions = read_submission(submit_path, reference, k)
+        # Compute score for each text
+        r1_stat, r5_stat, r10_stat = 0, 0, 0
+        for qid in reference.keys():
+            ground_truth_ids = set(reference[qid])
+            top10_pred_ids = predictions[qid]
+            if any([idx in top10_pred_ids[:1] for idx in ground_truth_ids]):
+                r1_stat += 1
+            if any([idx in top10_pred_ids[:5] for idx in ground_truth_ids]):
+                r5_stat += 1
+            if any([idx in top10_pred_ids[:10] for idx in ground_truth_ids]):
+                r10_stat += 1
+        # The higher score, the better
+        r1, r5, r10 = r1_stat * 1.0 / len(reference), r5_stat * 1.0 / len(reference), r10_stat * 1.0 / len(reference)
+        report_score(r1, r5, r10, out_path)
+        print("The evaluation finished successfully.")
+    except Exception as e:
+        report_error_msg(e.args[0], e.args[0], out_path)
+        print("The evaluation failed: {}".format(e.args[0]))
diff --git a/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py b/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2dcd74c00deebfc5fd6ca694e18020928bbce2a
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/utils/make_topk_predictions.py
@@ -0,0 +1,87 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This scripts performs kNN search on inferenced image and text features (on single-GPU) and outputs text-to-image prediction file for evaluation.
+"""
+
+import argparse
+import json
+
+import numpy
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--image-feats', type=str, required=True, help="Specify the path of image features.")
+parser.add_argument('--text-feats', type=str, required=True, help="Specify the path of text features.")
+parser.add_argument('--top-k', type=int, default=10, help="Specify the k value of top-k predictions.")
+parser.add_argument('--eval-batch-size', type=int, default=32768, help="Specify the image-side batch size when computing the inner products, default to 8192")
+parser.add_argument('--output', type=str, required=True, help="Specify the output jsonl prediction filepath.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    # Log params.
+    print("Params:")
+    for name in sorted(vars(args)):
+        val = getattr(args, name)
+        print(f"  {name}: {val}")
+
+    print("Begin to load image features...")
+    image_ids = []
+    image_feats = []
+    with open(args.image_feats, "r") as fin:
+        for line in tqdm(fin):
+            obj = json.loads(line.strip())
+            image_ids.append(obj["image_id"])
+            image_feats.append(obj["feature"])
+    image_feats_array = np.array(image_feats, dtype=np.float32)
+    print("Finished loading image features.")
+
+    print("Begin to compute top-{} predictions for texts...".format(args.top_k))
+
+    with open(args.output, "w") as fout:
+        with open(args.text_feats, "r") as fin:
+            for line in tqdm(fin):
+                obj = json.loads(line.strip())
+                text_id = obj["text_id"]
+                text_feat = obj["feature"]
+                score_tuples = []
+                text_feat_tensor = paddle.to_tensor([text_feat], dtype="float32")  # [1, feature_dim]
+                idx = 0
+                while idx < len(image_ids):
+                    img_feats_tensor = paddle.to_tensor(
+                        image_feats_array[idx : min(idx + args.eval_batch_size, len(image_ids))]
+                    ).cuda()  # [batch_size, feature_dim]
+                    batch_scores = text_feat_tensor @ img_feats_tensor.t()  # [1, batch_size]
+                    for image_id, score in zip(
+                        image_ids[idx : min(idx + args.eval_batch_size, len(image_ids))],
+                        batch_scores.squeeze(0).tolist(),
+                    ):
+                        score_tuples.append((image_id, score))
+                    idx += args.eval_batch_size
+                top_k_predictions = sorted(score_tuples, key=lambda x: x[1], reverse=True)[: args.top_k]
+                fout.write(
+                    "{}\n".format(
+                        json.dumps({"text_id": text_id, "image_ids": [entry[0] for entry in top_k_predictions]})
+                    )
+                )
+
+    print("Top-{} predictions are saved in {}".format(args.top_k, args.output))
+    print("Done!")
diff --git a/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py b/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py
new file mode 100644
index 0000000000000000000000000000000000000000..03c84f28cc48b4cce5847798d6c0c636cc4071ff
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/utils/make_topk_predictions_tr.py
@@ -0,0 +1,86 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This scripts performs kNN search on inferenced image and text features (on single-GPU) and outputs image-to-text retrieval prediction file for evaluation.
+"""
+
+import argparse
+import json
+
+import numpy
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--image-feats', type=str, required=True, help="Specify the path of image features.")
+parser.add_argument('--text-feats', type=str, required=True, help="Specify the path of text features.")
+parser.add_argument('--top-k', type=int, default=10, help="Specify the k value of top-k predictions.")
+parser.add_argument('--eval-batch-size', type=int, default=32768, help="Specify the image-side batch size when computing the inner products, default to 8192")
+parser.add_argument('--output', type=str, required=True, help="Specify the output jsonl prediction filepath.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    # Log params.
+    print("Params:")
+    for name in sorted(vars(args)):
+        val = getattr(args, name)
+        print(f"  {name}: {val}")
+
+    print("Begin to load text features...")
+    text_ids = []
+    text_feats = []
+    with open(args.text_feats, "r") as fin:
+        for line in tqdm(fin):
+            obj = json.loads(line.strip())
+            text_ids.append(obj["text_id"])
+            text_feats.append(obj["feature"])
+    text_feats_array = np.array(text_feats, dtype=np.float32)
+    print("Finished loading text features.")
+
+    print("Begin to compute top-{} predictions for images...".format(args.top_k))
+    with open(args.output, "w") as fout:
+        with open(args.image_feats, "r") as fin:
+            for line in tqdm(fin):
+                obj = json.loads(line.strip())
+                image_id = obj["image_id"]
+                image_feat = obj["feature"]
+                score_tuples = []
+                image_feat_tensor = paddle.to_tensor([image_feat], dtype="float32")  # [1, feature_dim]
+                idx = 0
+                while idx < len(text_ids):
+                    text_feats_tensor = paddle.to_tensor(
+                        text_feats_array[idx : min(idx + args.eval_batch_size, len(text_ids))]
+                    ).cuda()  # [batch_size, feature_dim]
+                    batch_scores = image_feat_tensor @ text_feats_tensor.t()  # [1, batch_size]
+                    for text_id, score in zip(
+                        text_ids[idx : min(idx + args.eval_batch_size, len(text_ids))],
+                        batch_scores.squeeze(0).tolist(),
+                    ):
+                        score_tuples.append((text_id, score))
+                    idx += args.eval_batch_size
+                top_k_predictions = sorted(score_tuples, key=lambda x: x[1], reverse=True)[: args.top_k]
+                fout.write(
+                    "{}\n".format(
+                        json.dumps({"image_id": image_id, "text_ids": [entry[0] for entry in top_k_predictions]})
+                    )
+                )
+
+    print("Top-{} predictions are saved in {}".format(args.top_k, args.output))
+    print("Done!")
diff --git a/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py b/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py
new file mode 100644
index 0000000000000000000000000000000000000000..50b6a4b0bd4c3aebd528b2c83f021a2df91f9207
--- /dev/null
+++ b/model_zoo/ernie-vil2.0/utils/transform_ir_annotation_to_tr.py
@@ -0,0 +1,45 @@
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+
+from tqdm import tqdm
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--input', type=str, required=True, help="Input path of text-to-image Jsonl annotation file.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    t2i_record = dict()
+    with open(args.input, "r", encoding="utf-8") as fin:
+        for line in tqdm(fin):
+            obj = json.loads(line.strip())
+            text_id = obj["text_id"]
+            image_ids = obj["image_ids"]
+            for image_id in image_ids:
+                if image_id not in t2i_record:
+                    t2i_record[image_id] = []
+                t2i_record[image_id].append(text_id)
+
+    with open(args.input.replace(".jsonl", "") + ".tr.jsonl", "w", encoding="utf-8") as fout:
+        for image_id, text_ids in t2i_record.items():
+            out_obj = {"image_id": image_id, "text_ids": text_ids}
+            fout.write("{}\n".format(json.dumps(out_obj)))
+
+    print("Done!")
diff --git a/model_zoo/gpt-3/.pre-commit-config.yaml b/model_zoo/gpt-3/.pre-commit-config.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..af668f50f0c87878c2678fac401d1e05373bbeb5
--- /dev/null
+++ b/model_zoo/gpt-3/.pre-commit-config.yaml
@@ -0,0 +1,52 @@
+repos:
+-   repo: https://github.com/Lucas-C/pre-commit-hooks.git
+    sha: v1.0.1
+    hooks:
+    -   id: remove-crlf
+        files: (?!.*third_party)^.*$ | (?!.*book)^.*$
+-   repo: https://github.com/PaddlePaddle/mirrors-yapf.git
+    sha: 0d79c0c469bab64f7229c9aca2b1186ef47f0e37
+    hooks:
+    -   id: yapf
+        files: (.*\.(py|bzl)|BUILD|.*\.BUILD|WORKSPACE)$
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    sha: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
+    hooks:
+    -   id: check-added-large-files
+    -   id: check-merge-conflict
+    -   id: check-symlinks
+    -   id: detect-private-key
+        files: (?!.*third_party)^.*$ | (?!.*book)^.*$
+    -   id: end-of-file-fixer
+-   repo: local
+    hooks:
+    -   id: clang-format-with-version-check
+        name: clang-format
+        description: Format files with ClangFormat.
+        entry: bash ./model_zoo/gpt-3/codestyle/clang_format.hook -i
+        language: system
+        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto)$
+-   repo: local
+    hooks:
+    -   id: cpplint-cpp-source
+        name: cpplint
+        description: Check C++ code style using cpplint.py.
+        entry: bash ./model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
+        language: system
+        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx)$
+-   repo: local
+    hooks:
+    -   id: pylint-doc-string
+        name: pylint
+        description: Check python docstring style using docstring_checker.
+        entry: bash ./model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
+        language: system
+        files: \.(py)$
+-   repo: local
+    hooks:
+    -   id: copyright_checker
+        name: copyright_checker
+        entry: python ./model_zoo/gpt-3/codestyle/copyright.hook
+        language: system
+        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py|sh)$
+        exclude: (?!.*third_party)^.*$ | (?!.*book)^.*$
diff --git a/model_zoo/gpt-3/README.md b/model_zoo/gpt-3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..78aca83b86f7e2622a47bc224ece2a925bad7393
--- /dev/null
+++ b/model_zoo/gpt-3/README.md
@@ -0,0 +1 @@
+# GPT-3
diff --git a/model_zoo/gpt-3/benchmarks/README.md b/model_zoo/gpt-3/benchmarks/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c3ba9927532d18bb6b6ff2be08018737e4d22d62
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_bs64_fp16_DP8-MP1-PP1.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_1024
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=fp16
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+yaml_path=./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${yaml_path} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..991a63a4624a2e06934878f16ffe27e4bf67d0a9
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_1024_flash_bs64_fp16_DP8-MP1-PP1.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_1024_flash
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=fp16
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+yaml_path=./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${yaml_path} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2fa394e5ef5c2d9f8789a3bba69b1acc78fcf1d1
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/N1C8/gpt_2048_bs64_fp16_DP8-MP1-PP1.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_2048
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=fp16
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+yaml_path=./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${yaml_path} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e13ed576175ca144cc7cb2ef163c5b1d948e6cfe
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..85791d7aac72499d0edb1a92cbbcecd1aa3e722e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/data_parallel/benchmark_common/run_benchmark.sh
@@ -0,0 +1,128 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    yaml_path=${10:-"./pretrain/configs/pretrain_gpt_345M_single_card.yaml"}
+    max_iter=${11:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    eval_freq=${12:-"1000"}         # (可选)模型评估间隔
+    use_recompute=${13:-"False"}    # (可选)是否打开recompute
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ ${model_item} = "gpt_1024_flash" ];then
+        args="-o Model.use_flash_attn=True"
+    else
+        args=""
+    fi
+
+    train_cmd="-c ${yaml_path} ${args} \
+               -o Engine.max_steps=${max_iter} \
+               -o Engine.eval_freq=${eval_freq} \
+               -o Engine.save_load.save_steps=100000 \
+               -o Distributed.dp_degree=${dp_degree} \
+               "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP8-MP1-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/train.py \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 15m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c27fed8bf598011e8877768dddfee7023b30bf98
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_CoLA_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_CoLA
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=mcc:
+dataset=CoLA
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3b6ef09b328deb67039cd37cfa999f0072a226eb
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_acc_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_MRPC_acc
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=acc:
+dataset=MRPC
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ad79cf4237c253734cc8379a1f06ab869310f6b2
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_MRPC_f1_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_MRPC_f1
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=f1:
+dataset=MRPC
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65cd3736a15df6990f5ced71cb85bd07789feace
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_QNLI_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_QNLI
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=acc:
+dataset=QNLI
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..562cc2dbbad34076c405958afac9519a02686c17
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_RTE_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_RTE
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=acc:
+dataset=RTE
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2c97ac542c6e0d3961f8b9bef104297522c6ce15
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_SST2_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_SST2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=acc:
+dataset=SST2
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f792c650bd8d4b43406a04fd778a9614925fe0df
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_pearson_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_STSB_pearson
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=pearson:
+dataset=STSB
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..420eac55a380d06bf1918aef711eb24a119d549d
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_STSB_spearman_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_STSB_spearman
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=spearman:
+dataset=STSB
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ca4ebc0e90484454141e828b9a221fd87bf02c99
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/N1C1/CE_gpt_finetune_WNLI_bs32_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=CE_gpt_finetune_WNLI
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=32
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+convergence_key=acc:
+dataset=WNLI
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
+# run
+sed -i "s/num_train_epochs=5/num_train_epochs=20/g" ../projects/gpt/finetune_gpt_345M_single_card.sh
+bash ./test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${convergence_key} ${dataset} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5709407b66272dc8467f1d5fb257a019f5a6a0cd
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get ckpt
+cd ../
+rm -rf ckpt
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bc9368e34e51230d5fc977140842cc76488ca425
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/finetune/benchmark_common/run_benchmark.sh
@@ -0,0 +1,120 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="steps/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key=${10:-"loss:"}        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    dataset=${11:-"CoLA"}                 # 数据集
+    max_iter=${12:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    base_batch_size=$global_batch_size
+    sharding_degree=${13-"1"}      # (可选)
+    sharding_stage=${14:-"1"}       # (可选)sharding case
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    # if [ ${model_item} = "gpt3_moe" ];then
+    #     static_scripts="../examples/language_model/gpt-moe/dygraph/"
+    # else
+    #     echo "not supported model item: ${model_item}"; exit 1;
+    # fi
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    # data_path="./data/"
+
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree} / ${sharding_degree}`
+
+    train_cmd="${dataset}"
+
+
+    # 以下为通用执行命令，无特殊可不用修改
+
+    # hybrid_parallelism case
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="bash projects/gpt/finetune_gpt_345M_single_card.sh \
+            ${train_cmd}"
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+
+    workerlog_id=0
+    timeout 40m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..72670c34f4be8b6e32f621df439e151ac8f9d8cd
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d385523c683cd4da42043f17e51f6de7c1ee906c
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C1/gpt_bs16_fp32_DP1-MP1-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp32
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e800db7b8fad0986b5b20c20083197b63c003364
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP1-PP4.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=1
+pp_degree=4
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP4
+device_num=N1C4
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8d44989c55a2cf700a90a95aee0e7b878cdff582
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C4/gpt_bs16_fp16_DP1-MP4-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=4
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP4-PP1
+device_num=N1C4
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..649d7933e73ef3129ef9c960e0e07639d66ca788
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP1-PP8.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..aa305f555e2b3cd769f8bfeb209b214423d70627
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP2-PP4.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a62c29150c3da5b6e00dfd43866e3847af98b405
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP4-PP2.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=4
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP4-PP2
+device_num=N1C8
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eb1fe5c12e8b74355bdc9ac8640d3f2bd8de4c86
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP1-MP8-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..56ddaf94ee2df9618d1060551bf9be7419fedd5e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp16_DP2-MP2-PP2.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..356a89ef86c447601b2d3ae8ec9ea0c92413e96d
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs16_fp32_DP2-MP2-PP2.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp32
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6bba42128104af1690b1c0b2cd4f1baa193e2d67
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp16_DP8-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=fp16
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+max_iter=500
+use_recompute=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${max_iter} ${use_recompute} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0602aa3a4778f6d24368d360b6510877f10dea33
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_bs64_fp32_DP8-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=fp32
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+max_iter=500
+use_recompute=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${max_iter} ${use_recompute} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..867460d2853b7fb05aa3dfa61234edaf40cedc3b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp16_DP2-MP2-PP2.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_recompute
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+max_iter=500
+use_recompute=True
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${max_iter} ${use_recompute} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c154a7331a1d388875206d621e030e976e30c28f
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N1C8/gpt_recompute_bs16_fp32_DP2-MP2-PP2.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_recompute
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp32
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+max_iter=500
+use_recompute=True
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${max_iter} ${use_recompute} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ed5bf41b460ed69446a64484bd26be72656f3981
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP1-MP8-PP4.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=8
+pp_degree=4
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP8-PP4
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f254ec1870d9c847293ac8c4812f1aa700fc5698
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP2-MP8-PP2.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bf72f867eb9366c9613b249aabba2186a00483b3
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp16_DP4-MP8-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=4
+mp_degree=8
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP4-MP8-PP1
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9326ab8cd4cea17dfec134114e5d4e1a17541402
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP1-MP8-PP4.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=1
+mp_degree=8
+pp_degree=4
+bs_item=16
+fp_item=fp32
+run_mode=DP1-MP8-PP4
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a5051f40779db3de5467d9d709fc87f4f67fab07
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP2-MP8-PP2.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp32
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9fe781053c18f590c0002f9ff2f4d05ed9c0722c
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/N4C32/gpt_bs16_fp32_DP4-MP8-PP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt
+dp_degree=4
+mp_degree=8
+pp_degree=1
+bs_item=16
+fp_item=fp32
+run_mode=DP4-MP8-PP1
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e13ed576175ca144cc7cb2ef163c5b1d948e6cfe
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ce59052b2b05c51654de9281b01c27a798256baa
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/hybrid_parallel/benchmark_common/run_benchmark.sh
@@ -0,0 +1,158 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${10:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    use_recompute=${11:-"False"}    # (可选)是否打开recompute
+    eval_freq=${12:-"1000"}         # (可选)模型评估间隔
+    sharding_degree=${13:-"1"}      # (可选)分组切分并行维度
+    sharding_stage=${14:-"1"}       # (可选)切分策略；1表示仅切分优化器状态，2表示再切分梯度，3表示再切分前向参数
+    sharding_offload=${15:-"False"} # (可选)CPU offload策略
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree} / ${sharding_degree}`
+    num_attention_heads=16 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_attention_heads=4; fi #"gpt2-small-en"
+    num_layers=24 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_layers=4; fi #"gpt2-small-en"
+    use_pure_fp16=False
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="-o Global.seed=1234 \
+               -o Global.local_batch_size=${local_batch_size} \
+               -o Global.micro_batch_size=${micro_batch_size} \
+               -o Engine.max_steps=${max_iter} \
+               -o Engine.eval_freq=${eval_freq} \
+               -o Engine.mix_precision.enable=${use_pure_fp16} \
+               -o Engine.save_load.save_steps=100000 \
+               -o Model.hidden_size=1024 \
+               -o Model.num_layers=${num_layers} \
+               -o Model.num_attention_heads=${num_attention_heads} \
+               -o Model.type_vocab_size=1 \
+               -o Model.use_recompute=${use_recompute} \
+               -o Distributed.dp_degree=${dp_degree} \
+               -o Distributed.mp_degree=${mp_degree} \
+               -o Distributed.pp_degree=${pp_degree} \
+               -o Distributed.sharding.sharding_degree=${sharding_degree} \
+               -o Distributed.sharding.sharding_stage=${sharding_stage} \
+               -o Distributed.sharding.sharding_offload=${sharding_offload} \
+               -o Optimizer.lr.max_lr=1e-4 \
+               -o Optimizer.lr.min_lr=1e-5 "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0 ${PADDLE_RANK_OPTION}\
+              tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+              ${train_cmd}" 
+        workerlog_id=0
+        ;;
+    DP1-MP1-PP4|DP1-MP4-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3 ${PADDLE_RANK_OPTION}\
+            tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP8-MP1-PP1|DP1-MP8-PP1|DP1-MP1-PP8|DP1-MP2-PP4|DP1-MP4-PP2|DP2-MP2-PP2| \
+    DP2-MP8-PP2|DP4-MP8-PP1|DP1-MP8-PP4) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 15m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b31717e41e327e1359aa45ab316ba7dc4798c8c4
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_False_bs8_fp16_DP1-MP8-PP1.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_sp_False
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=8
+fp_item=fp16
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sequence_parallel=False
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sequence_parallel} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4e96e0db912ebe5039738fbf20f905dfad224931
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N1C8/gpt_sp_True_bs8_fp16_DP1-MP8-PP1.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_sp_True
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=8
+fp_item=fp16
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sequence_parallel=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sequence_parallel} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0451edb2e10e173eeab227e0f64736f4392f3454
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_False_bs16_fp16_DP2-MP8-PP2.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_sp_False
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+sequence_parallel=False
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sequence_parallel} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e73be198a65f42e0f48882f58f65d37adce50270
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/N4C32/gpt_sp_True_bs16_fp16_DP2-MP8-PP2.sh
@@ -0,0 +1,32 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_sp_True
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+sequence_parallel=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sequence_parallel} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e13ed576175ca144cc7cb2ef163c5b1d948e6cfe
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..38279d5886dd096eaf7d15c5972fce5b9927da6b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sequence_parallel/benchmark_common/run_benchmark.sh
@@ -0,0 +1,150 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    sequence_parallel=${10:-"False"}    # (可选)是否打开sequence_parallel
+    max_iter=${11:-1000}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    eval_freq=${12:-"1000"}         # (可选)模型评估间隔
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    use_recompute=${13:-"True"}    # (可选)是否打开recompute
+    sharding_degree=${14:-"1"}      # (可选)分组切分并行维度
+    sharding_stage=${15:-"1"}       # (可选)切分策略；1表示仅切分优化器状态，2表示再切分梯度，3表示再切分前向参数
+    sharding_offload=${16:-"False"} # (可选)CPU offload策略
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree} / ${sharding_degree}`
+    num_attention_heads=16 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_attention_heads=4; fi #"gpt2-small-en"
+    num_layers=24 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_layers=4; fi #"gpt2-small-en"
+    use_pure_fp16=False
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="-o Engine.max_steps=${max_iter} \
+               -o Engine.eval_iters=${eval_freq} \
+               -o Distributed.dp_degree=${dp_degree} \
+               -o Distributed.mp_degree=${mp_degree} \
+               -o Distributed.pp_degree=${pp_degree} \
+               -o Distributed.sharding.sharding_degree=${sharding_degree} \
+               -o Distributed.sharding.sharding_stage=${sharding_stage} \
+               -o Distributed.sharding.sharding_offload=${sharding_offload} \
+               -o Model.sequence_parallel=${sequence_parallel}  \
+               -o Distributed.sharding.reduce_overlap=False \
+               -o Distributed.sharding.broadcast_overlap=False \
+               -o Optimizer.tensor_fusion=False "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0 ${PADDLE_RANK_OPTION}\
+            tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP1-MP8-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP2-MP8-PP2) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 60m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1971730632548f9f3128375e18a51d254fa52413
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage2_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP1-Sharding2
+device_num=N1C2
+sharding_degree=2
+sharding_stage=2
+sharding_offload=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} ${sharding_offload} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2224fdbed2549668817e771e42ea9cb961842603
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp16_DP1-MP1-PP1-Sharding2.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP1-Sharding2
+device_num=N1C2
+sharding_degree=2
+sharding_stage=3
+sharding_offload=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} ${sharding_offload} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dc9866036b09ad84e7ed1bd75d299eb76da0e0ce
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N1C2/gpt_stage3_bs16_fp32_DP1-MP1-PP1-Sharding2.sh
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp32
+run_mode=DP1-MP1-PP1-Sharding2
+device_num=N1C2
+sharding_degree=2
+sharding_stage=3
+sharding_offload=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} ${sharding_offload} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9a73b522f7e9a8c18afcfa9b17fd3907420f4a9f
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/N2C16/gpt_stage2_bs128_fp16_DP1-MP1-PP1-Sharding16.sh
@@ -0,0 +1,35 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=128
+fp_item=fp16
+run_mode=DP1-MP1-PP1-Sharding16
+device_num=N2C16
+sharding_degree=16
+sharding_stage=2
+sharding_offload=True
+max_iter=30
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} ${sharding_offload} ${max_iter} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e13ed576175ca144cc7cb2ef163c5b1d948e6cfe
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..041268c89c44f270e16c77f8c2fad84d161b8b2b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/dygraph/sharding/benchmark_common/run_benchmark.sh
@@ -0,0 +1,149 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    sharding_degree=${10:-"1"}      # (可选)分组切分并行维度
+    sharding_stage=${11:-"1"}       # (可选)切分策略；1表示仅切分优化器状态，2表示再切分梯度，3表示再切分前向参数
+    sharding_offload=${12:-"False"} # (可选)CPU offload策略
+    max_iter=${13:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    eval_freq=${14:-"1000"}         # (可选)模型评估间隔
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    use_recompute=${15:-"True"}    # (可选)是否打开recompute
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree} / ${sharding_degree}`
+    use_pure_fp16=False
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="-o Global.local_batch_size=${local_batch_size} \
+               -o Global.micro_batch_size=${micro_batch_size} \
+               -o Engine.max_steps=${max_iter} \
+               -o Engine.eval_freq=${eval_freq} \
+               -o Engine.mix_precision.enable=${use_pure_fp16} \
+               -o Engine.save_load.save_steps=100000 \
+               -o Model.use_recompute=${use_recompute} \
+               -o Distributed.dp_degree=${dp_degree} \
+               -o Distributed.mp_degree=${mp_degree} \
+               -o Distributed.pp_degree=${pp_degree} \
+               -o Distributed.sharding.sharding_degree=${sharding_degree} \
+               -o Distributed.sharding.sharding_stage=${sharding_stage} \
+               -o Distributed.sharding.sharding_offload=${sharding_offload} \
+                "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1-Sharding2) echo "run run_mode: DP1-MP1-PP1-Sharding2"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1 ${PADDLE_RANK_OPTION}\
+            ./tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+            -o Global.seed=1234 \
+            -o Model.hidden_size=1024 \
+            -o Model.num_layers=4 \
+            -o Model.num_attention_heads=4 \
+            -o Model.type_vocab_size=1 \
+            -o Optimizer.lr.max_lr=1e-4 \
+            -o Optimizer.lr.min_lr=1e-5 \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP1-MP1-PP1-Sharding16) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            ./tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+            -o Engine.logging_freq=1 \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 70m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a2fa4a9f91cb082c0c8323ddc6a3b2d3a1bb66e2
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o1_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..beaa85ad3eaf90ff936c484244acfa9bb93bec57
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o2_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a4920254d7e88c910d747954742f6316655ea483
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage1_bs64_o3_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8892197c215774472ea4c1ec429acbf41aa86396
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o1_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..54d1ce061a21b31710326e09a381c4956e51752b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o2_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..30df9ad527983d3b8b00a2eb773c45e6286dd9e3
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage2_bs64_o3_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..16828e6c45ff170826b44aa7d99b99a745a2de03
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o1_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..66ea8147d9f37b2568313ea78da0626570806a71
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o2_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5cd3c13ca526edcaa7227096ddb6340918414586
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C1/gpt_auto_stage3_bs64_o3_DP1-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..af1697b1f71c3881f7e2c44b223a971140cdfe59
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9def077a8ae2ae0cbb1ad4f89772ef5e8b991f48
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bf339b3b0eb651a48286629b98f85be6f6e7f5a5
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..46a1894a53106fd2d84249c3092252afd3869cc1
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3d73e017dce0c123a867d76b700f13b3c575b4fe
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..82250056179b1b11ef475463b4e829701333469d
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e7c782afc452bf727da3b188018521e71fbac405
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o1_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8061b848890d1941fc4f6f758b165080d174a07d
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2130408bfd526fb459c78f2e6e42b8766822b68e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..10906e7925a7a6b228d0889b7fca4093fb7bd432
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ff56fec07b95a200ff5bef5fcf9cfe998a838dbf
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..19b7c2e334e1cfb9248b7006934211623fb6514c
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..47b7a0abc6946ee3a2a9b1f4b2ea5ff930f165dc
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..150eea529b05eaae40123dff3a0aa01ba2db190b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o2_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ba2c43f0f3b6de3091c9b06186549c0cb2b21054
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d1a56fdb74741c5a51abb4ab4a1b11d3a195bc71
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9eb01b4fcf9f09e185489b95ce3a3db5ea337609
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9e22166e9bb4e18aee054da4300708052b7338bc
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8807990f61104c00a74b2b8439571c350ea97fea
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..49e80b7a57889188b68aa2874b9b8a29b78a5034
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=1
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0338c6d2f05b1f367cce73fa5766d59eeb7ba0a5
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage1_bs64_o3_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage1
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=1
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..582117566298308856d76a6d275907f6d91f3ed1
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6906235f4eee9ca411d527a22f55c6bf4f89e97b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3d7ce6f7c66be5a9823f3f5a0d29d20fc1896217
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..520dee6bceafe8aa6f8fcef9614756b0c79d9d5a
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4dd8ae33b00adac33a95f77279e4ccb1513d7d01
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a6ca19157ddb2a763efe1065e0b9eddf8087f67e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bdad843b0d1d354cd7e61cf65824d19e831d3d00
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o1_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4edc495ab9e31c56c8e5470cf1a7f707d7e699c1
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6506a2fcff7047cd64c9153c0a7454ac84636e44
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..655f7aa9b2f22cd246a61b3ddf708da2ff9d8e80
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ef63caa6e41fca480758b4a2d056ff62ae4a0c8d
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f50601cc8ae51b3ee3a7782fea486bb3501da53b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..526a81ecbf576a67bbb7b35616b03864eba5de4a
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4ba425aff51a8099ef2dad44d08d621e2630baea
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o2_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1e7d7a511ba6aab8fffe71038eb942e66b03c090
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0c03b244c27b7ccbd3d81656796eef5c5d6d0384
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e0de095425e7104b85e2c7de25094a22049dc507
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..32b650c23c1124d9bd6ca0b2eff93529c116f659
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7f2228690c4a9e8a60383b4ba34b106ca068c427
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2950e4672b1bb9e6166312e00eba94d40da39aae
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f3c76ebe4b62666b57250e4fb845276de4345c5a
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage2_bs64_o3_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=2
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..10585a1dcbc162b4b59a3af37b6cae0d9e18a2a4
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ca4c9a9a10ddd019b24fe982aaa9eaed078896ad
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..26111127541d04918b20f069e99ff3f31447e63b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b639536be5f1adee89c0bf5ca28268414b515e72
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6daee190cf2875ca1ba0663a8d5b095c20149989
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7ba03ea7ccbc064f434ade00ba39d3c7616b68c5
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c6e5eb660e0e4f457538bc1244cb334a10b30fa6
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o1_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o1
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..436edaa9241da4b7134546dc514534a1fe0a926f
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..55ed534fda3d14e1e934981d8915fd667c07d403
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..33ad53ea74ce5c587a74a54e5001e013fc4b3d92
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8c03d18b1f99ead7d8f8ba8b502c74451be72d66
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dc1390af72054276f0670cec01e22cc98d0edcf3
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..659f35960e0b4472316525742d6b3aa0a4ac4ae7
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6488507e22523b91afd2ed8dc80e34499dad4c69
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o2_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o2
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b8e19620f9bef8ee4ca8d0846d18994115dd38e9
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP1-PP8.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=1
+pp_degree=8
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP1-PP8
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4fa88a67a9672150127847c6fd4e9e9c0186c7d8
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP2-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=2
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP2-PP4
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=16
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f2d06767c9533af08d4fd790905daa5620fd60e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP1-MP8-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=1
+mp_degree=8
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP1-MP8-PP1
+device_num=N1C8
+sharding_degree=1
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f055e2d734a8135bbcbac667d4fd501f25fb2f02
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP1-PP4.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=1
+pp_degree=4
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP1-PP4
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c624d96cb1b68b59c1956075f447071341c82a39
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP2-PP2.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..51b36bbb8e3f87896d642adf40f58840ff6b4b5b
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP2-MP4-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=2
+mp_degree=4
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+sharding_degree=2
+sharding_stage=3
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f1e566016e2585dc5a018ba0444363c66106b74a
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N1C8/gpt_auto_stage3_bs64_o3_DP8-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage3
+dp_degree=8
+mp_degree=1
+pp_degree=1
+bs_item=64
+fp_item=o3
+run_mode=DP8-MP1-PP1
+device_num=N1C8
+sharding_degree=8
+sharding_stage=3
+
+model=gpt
+micro_bs=2
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f2d98063f2afac0e3c6143a2fa79062d8bf35d9c
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage2_bs128_o2_DP16-MP1-PP1.sh
@@ -0,0 +1,19 @@
+model_item=gpt_auto_stage2
+dp_degree=16
+mp_degree=1
+pp_degree=1
+bs_item=128
+fp_item=o2
+run_mode=DP16-MP1-PP1
+device_num=N2C16
+sharding_degree=16
+sharding_stage=2
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..519b8d3a9db33c618e95472402dd3df1ec8f8d0e
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/N2C16/gpt_auto_stage3_bs128_o2_DP16-MP1-PP1.sh
@@ -0,0 +1,20 @@
+model_item=gpt_auto_stage3
+dp_degree=16
+mp_degree=1
+pp_degree=1
+bs_item=128
+fp_item=o2
+run_mode=DP16-MP1-PP1
+device_num=N2C16
+sharding_degree=8
+sharding_stage=3
+num_layers=48
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${sharding_degree} ${sharding_stage} ${num_layers} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5b282b9d7e397bdc25424f0985fb59b9c9fb2fb2
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f8069bd3cb51dc5f670b396003a020f0ea688469
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_amp/benchmark_common/run_benchmark.sh
@@ -0,0 +1,197 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"o1"}            # (必选) o1|o2|o3
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="samples/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    sharding_degree=${10:-"1"}      # (可选)
+    sharding_stage=${11:-"1"}       # (可选)sharding case
+    num_layers=${12:-"32"}       # (可选) 模型层数
+    num_workers=${13:-0}                 # (可选)
+    max_iter=${14:-100}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    base_batch_size=$global_batch_size
+    verbose=${15:-"3"}         # (可选)是否打印性能数据
+    logging_freq=${16:-"1"} # (可选)loss打印频率
+
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree}`
+    train_cmd="-o Model.hidden_dropout_prob=0 \
+               -o Model.attention_probs_dropout_prob=0 \
+               -o Model.use_recompute=True \
+               -o Global.local_batch_size=${local_batch_size} \
+               -o Global.micro_batch_size=${micro_batch_size} \
+               -o Distributed.dp_degree=${dp_degree} \
+               -o Distributed.mp_degree=${mp_degree} \
+               -o Distributed.pp_degree=${pp_degree} \
+               -o Distributed.sharding.sharding_degree=${sharding_degree} \
+               -o Distributed.sharding.sharding_stage=${sharding_stage} \
+               -o Engine.mix_precision.level=${fp_item} \
+               -o Engine.max_steps=${max_iter} \
+               -o Engine.eval_freq=100000 \
+               -o Engine.verbose=${verbose} \
+               -o Engine.logging_freq=${logging_freq} "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP8-MP1-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP1-MP8-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml -o Model.hidden_size=5120 \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP1-MP1-PP8) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml \
+            ${train_cmd}"
+        workerlog_id=7
+        ;;
+    DP2-MP2-PP2) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml \
+            ${train_cmd}"
+        workerlog_id_1=4
+        workerlog_id_2=6
+        ;;
+    DP2-MP4-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml \
+            ${train_cmd}"
+        workerlog_id_1=0
+        workerlog_id_2=4
+        ;;
+    DP1-MP2-PP4|DP2-MP1-PP4) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml \
+            ${train_cmd}"
+        workerlog_id_1=6
+        workerlog_id_2=7
+        ;;
+    DP16-MP1-PP1) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml -o Model.num_layers=${num_layers} \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 40m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        case ${run_mode} in
+        DP1-MP1-PP1|DP8-MP1-PP1|DP1-MP8-PP1|DP1-MP1-PP8|DP16-MP1-PP1) echo "${run_mode} cp  mylog/workerlog.${workerlog_id}"
+            rm ${log_file}
+            cp mylog/workerlog.${workerlog_id} ${log_file}
+            ;;
+        DP2-MP2-PP2|DP2-MP4-PP1|DP1-MP2-PP4|DP2-MP1-PP4) echo "${run_mode} cp  mylog/workerlog.${workerlog_id_1} & mylog/workerlog.${workerlog_id_2}"
+            rm ${log_file}
+            cp mylog/workerlog.${workerlog_id_1} ${log_file}
+            cp mylog/workerlog.${workerlog_id_2} ${log_file}_2
+            ;;
+        *) echo "${run_mode} cp  mylog/workerlog.${workerlog_id}"
+            rm ${log_file}
+            workerlog_id=0
+            cp mylog/workerlog.${workerlog_id} ${log_file}
+            ;;
+        esac
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+# 避免预分配的的显存影响实际值观测
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+sed -i "s/if self.keyword not in line/\if self.keyword not in line or \"loss\" in line/g" ${BENCHMARK_ROOT}/scripts/analysis.py
+sed -i "s/CE_gpt_auto/gpt_auto/g" ${BENCHMARK_ROOT}/scripts/analysis.py 
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..be3329ada91b9b4485615911133505a46c2e6a9c
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/N1C1/gpt_auto_recompute_bs8_fp32_DP1-MP1-PP1.sh
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt_auto_recompute
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=8
+fp_item=fp32
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+max_iter=500
+use_recompute=True
+
+model=gpt
+micro_bs=8
+
+cd ./benchmarks
+bash ./test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} \
+${max_iter} ${use_recompute} 2>&1;
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a458e96e0917663c6f32c3a550b08e45e421a34
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/prepare.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+# get data
+cd ../
+rm -rf data
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
diff --git a/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..53a09904e683faa166a51918f83be4141aa63594
--- /dev/null
+++ b/model_zoo/gpt-3/benchmarks/test_tipc/gpt/static/auto_parallel/benchmark_common/run_benchmark.sh
@@ -0,0 +1,157 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="samples/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${10:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    use_recompute=${11:-"False"}    # (可选)是否打开recompute
+    verbose=${12:-"3"}         # (可选)是否打印性能数据
+    logging_freq=${13:-"1"} # (可选)loss打印频率
+    sharding_degree=${14:-"1"}      # (可选)
+    sharding_stage=${15:-"1"}       # (可选)sharding case
+    
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    local_batch_size=`expr ${global_batch_size} / ${dp_degree} / ${sharding_degree}`
+    num_attention_heads=16 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_attention_heads=4; fi #"gpt2-small-en"
+    num_layers=24 #"gpt2-medium-en"
+    if [ ${mp_degree} -lt 8 -a ${pp_degree} -lt 8 ]; then num_layers=4; fi #"gpt2-small-en"
+    use_pure_fp16=False # fp32
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="-o Global.seed=1234 \
+               -o Global.local_batch_size=${local_batch_size} \
+               -o Global.micro_batch_size=${micro_batch_size} \
+               -o Engine.max_steps=${max_iter} \
+               -o Engine.eval_freq=100000 \
+               -o Engine.mix_precision.enable=${use_pure_fp16} \
+               -o Engine.save_load.save_steps=100000 \
+               -o Model.hidden_size=1024 \
+               -o Model.num_layers=${num_layers} \
+               -o Model.num_attention_heads=${num_attention_heads} \
+               -o Model.type_vocab_size=1 \
+               -o Model.use_recompute=${use_recompute} \
+               -o Distributed.dp_degree=${dp_degree} \
+               -o Distributed.mp_degree=${mp_degree} \
+               -o Distributed.pp_degree=${pp_degree} \
+               -o Distributed.sharding.sharding_degree=${sharding_degree} \
+               -o Distributed.sharding.sharding_stage=${sharding_stage} \
+               -o Optimizer.lr.max_lr=1e-4 \
+               -o Optimizer.lr.min_lr=1e-5  \
+               -o Engine.verbose=${verbose} \
+               -o Engine.logging_freq=${logging_freq} "
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP2-MP2-PP2) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --devices=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+            tools/auto.py -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+            ${train_cmd}"
+        workerlog_id_1=3
+        workerlog_id_2=7
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../
+    unset CUDA_MODULE_LOADING # fleet executor + CUDA_MODULE_LOADING=LAZY 容易hang
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 20m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id_1} ${log_file}
+        cp mylog/workerlog.${workerlog_id_2} ${log_file}_2
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+sed -i "s/100:/1:/g" ${BENCHMARK_ROOT}/scripts/analysis.py 
+sed -i "s/100)/1)/g" ${BENCHMARK_ROOT}/scripts/analysis.py 
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/model_zoo/gpt-3/codestyle/clang_format.hook b/model_zoo/gpt-3/codestyle/clang_format.hook
new file mode 100644
index 0000000000000000000000000000000000000000..79a0a542ebc1792289bd71dc19b392800966b3ba
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/clang_format.hook
@@ -0,0 +1,20 @@
+#!/bin/bash
+set -e
+
+readonly VERSION="13.0.0"
+
+version=$(clang-format -version)
+
+if ! [[ $(python -V 2>&1 | awk '{print $2}' | awk -F '.' '{print $1$2}') -ge 36 ]]; then
+    echo "clang-format installation by pip need python version great equal 3.6, 
+          please change the default python to higher version."
+    exit 1
+fi
+
+if ! [[ $version == *"$VERSION"* ]]; then
+    # low version of pip may not have the source of clang-format whl
+    pip install --upgrade pip 
+    pip install clang-format==13.0.0
+fi
+
+clang-format $@
diff --git a/model_zoo/gpt-3/codestyle/copyright.hook b/model_zoo/gpt-3/codestyle/copyright.hook
new file mode 100644
index 0000000000000000000000000000000000000000..d25ac074d8c92dedfe4a21045a7c74c2de98924e
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/copyright.hook
@@ -0,0 +1,134 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import argparse
+import io
+import re
+import sys
+import os
+import datetime
+
+COPYRIGHT = '''Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.'''
+
+def _generate_copyright(comment_mark):
+    copyright=COPYRIGHT.split(os.linesep)
+    header = copyright[0].rstrip()
+
+    p = re.search('(\d{4})', header).group(0)
+    now = datetime.datetime.now()
+
+    header = header.replace(p,str(now.year))
+
+    ans=[comment_mark + " " + header + os.linesep]
+    for idx, line in enumerate(copyright[1:]):
+        ans.append(comment_mark + " " + line.rstrip() + os.linesep)
+
+    return ans
+
+def _get_comment_mark(path):
+    lang_type=re.compile(r"\.(py|sh)$")
+    if lang_type.search(path) is not None:
+        return "#"
+
+    lang_type=re.compile(r"\.(h|c|hpp|cc|cpp|cu|go|cuh|proto)$")
+    if lang_type.search(path) is not None:
+        return "//"
+
+    return None
+
+
+RE_ENCODE = re.compile(r"^[ \t\v]*#.*?coding[:=]", re.IGNORECASE)
+RE_COPYRIGHT = re.compile(r".*Copyright \(c\) \d{4}", re.IGNORECASE)
+RE_SHEBANG = re.compile(r"^[ \t\v]*#[ \t]?\!")
+
+def _check_copyright(path):
+    head=[]
+    try:
+        with open(path) as f:
+            head = [next(f) for x in range(4)]
+    except StopIteration:
+        pass
+
+    for idx, line in enumerate(head):
+        if RE_COPYRIGHT.search(line) is not None:
+            return True
+
+    return False
+
+def generate_copyright(path, comment_mark):
+    original_contents = io.open(path, encoding="utf-8").readlines()
+    head = original_contents[0:4]
+
+    insert_line_no=0
+    for i, line in enumerate(head):
+        if RE_ENCODE.search(line) or RE_SHEBANG.search(line):
+            insert_line_no=i+1
+
+    copyright = _generate_copyright(comment_mark)
+    if insert_line_no == 0:
+        new_contents = copyright
+        if len(original_contents) > 0 and len(original_contents[0].strip()) != 0:
+            new_contents.append(os.linesep)
+        new_contents.extend(original_contents)
+    else:
+        new_contents=original_contents[0:insert_line_no]
+        new_contents.append(os.linesep)
+        new_contents.extend(copyright)
+        if len(original_contents) > insert_line_no and len(original_contents[insert_line_no].strip()) != 0:
+            new_contents.append(os.linesep)
+        new_contents.extend(original_contents[insert_line_no:])
+    new_contents="".join(new_contents)
+
+    with io.open(path, 'w') as output_file:
+        output_file.write(new_contents)
+
+
+
+def main(argv=None):
+    parser = argparse.ArgumentParser(
+        description='Checker for copyright declaration.')
+    parser.add_argument('filenames', nargs='*', help='Filenames to check')
+    args = parser.parse_args(argv)
+
+    retv = 0
+    for path in args.filenames:
+        comment_mark = _get_comment_mark(path)
+        if comment_mark is None:
+            print("warning:Unsupported file", path, file=sys.stderr)
+            continue
+
+        if _check_copyright(path):
+            continue
+
+        generate_copyright(path, comment_mark)
+
+
+if __name__ == '__main__':
+    exit(main())
diff --git a/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook b/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
new file mode 100644
index 0000000000000000000000000000000000000000..3c584a440ee1f552bada0e1209c3ed1358dbb6aa
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/cpplint_pre_commit.hook
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+TOTAL_ERRORS=0
+
+readonly VERSION="1.6.0"
+
+version=$(cpplint --version)
+
+if [[ ! $TRAVIS_BRANCH ]]; then
+  # install cpplint on local machine.
+  if ! [[ $version == *"$VERSION"* ]]; then
+    pip install cpplint==1.6.0
+  fi
+  # diff files on local machine. 
+  files=$(git diff --cached --name-status | awk '$1 != "D" {print $2}')
+else
+  # diff files between PR and latest commit on Travis CI. 
+  branch_ref=$(git rev-parse "$TRAVIS_BRANCH")
+  head_ref=$(git rev-parse HEAD)
+  files=$(git diff --name-status $branch_ref $head_ref | awk '$1 != "D" {print $2}')
+fi
+# The trick to remove deleted files: https://stackoverflow.com/a/2413151
+for file in $files; do
+    if [[ $file =~ ^(patches/.*) ]]; then
+        continue;
+    else
+        cpplint --filter=-readability/fn_size,-build/include_what_you_use,-build/c++11,-whitespace/parens $file;
+        TOTAL_ERRORS=$(expr $TOTAL_ERRORS + $?);
+    fi
+done
+
+exit $TOTAL_ERRORS
diff --git a/model_zoo/gpt-3/codestyle/docstring_checker.py b/model_zoo/gpt-3/codestyle/docstring_checker.py
new file mode 100644
index 0000000000000000000000000000000000000000..6477ac72144aa103a926b1830cfa59cf9939fb6c
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/docstring_checker.py
@@ -0,0 +1,348 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""DocstringChecker is used to check python doc string's style."""
+
+import astroid
+
+from pylint.checkers import BaseChecker, utils
+from pylint.interfaces import IAstroidChecker
+
+from collections import defaultdict
+import re
+
+
+def register(linter):
+    """Register checkers."""
+    linter.register_checker(DocstringChecker(linter))
+
+
+class Docstring(object):
+    """Docstring class holds the parsed doc string elements.
+    """
+
+    def __init__(self):
+        self.d = defaultdict(list)  #name->[]
+        self.clear()
+
+    def clear(self):
+        self.d['Args'] = []
+        self.d['Examples'] = []
+        self.d['Returns'] = []
+        self.d['Raises'] = []
+        self.args = {}  #arg_name->arg_type
+
+    def get_level(self, string, indent='    '):
+        level = 0
+        unit_size = len(indent)
+        while string[:unit_size] == indent:
+            string = string[unit_size:]
+            level += 1
+
+        return level
+
+    def parse(self, doc):
+        """parse gets sections from doc
+        Such as Args, Returns, Raises, Examples s
+        Args:
+            doc (string): is the astroid node doc string.
+        Returns:
+            True if doc is parsed successfully.
+        """
+        self.clear()
+
+        lines = doc.splitlines()
+        state = ("others", -1)
+        for l in lines:
+            c = l.strip()
+            if len(c) <= 0:
+                continue
+
+            level = self.get_level(l)
+            if c.startswith("Args:"):
+                state = ("Args", level)
+            elif c.startswith("Returns:"):
+                state = ("Returns", level)
+            elif c.startswith("Raises:"):
+                state = ("Raises", level)
+            elif c.startswith("Examples:"):
+                state = ("Examples", level)
+            else:
+                if level > state[1]:
+                    self.d[state[0]].append(c)
+                    continue
+
+                state = ("others", -1)
+                self.d[state[0]].append(c)
+
+        self._arg_with_type()
+        return True
+
+    def get_returns(self):
+        return self.d['Returns']
+
+    def get_raises(self):
+        return self.d['Raises']
+
+    def get_examples(self):
+        return self.d['Examples']
+
+    def _arg_with_type(self):
+
+        for t in self.d['Args']:
+            m = re.search(r'([A-Za-z0-9_-]+)\s{0,4}(\(.+\))\s{0,4}:', t)
+            if m:
+                self.args[m.group(1)] = m.group(2)
+
+        return self.args
+
+
+class DocstringChecker(BaseChecker):
+    """DosstringChecker is pylint checker to
+    check docstring style.
+    """
+    __implements__ = (IAstroidChecker, )
+
+    POSITIONAL_MESSAGE_ID = 'str-used-on-positional-format-argument'
+    KEYWORD_MESSAGE_ID = 'str-used-on-keyword-format-argument'
+
+    name = 'doc-string-checker'
+    symbol = "doc-string"
+    priority = -1
+    msgs = {
+        'W9001': ('One line doc string on > 1 lines', symbol + "-one-line",
+                  'Used when a short doc string is on multiple lines'),
+        'W9002':
+        ('Doc string does not end with "." period', symbol + "-end-with",
+         'Used when a doc string does not end with a period'),
+        'W9003':
+        ('All args with their types must be mentioned in doc string %s',
+         symbol + "-with-all-args",
+         'Used when not all arguments are in the doc string '),
+        'W9005': ('Missing docstring or docstring is too short',
+                  symbol + "-missing", 'Add docstring longer >=10'),
+        'W9006': ('Docstring indent error, use 4 space for indent',
+                  symbol + "-indent-error", 'Use 4 space for indent'),
+        'W9007': ('You should add `Returns` in comments',
+                  symbol + "-with-returns",
+                  'There should be a `Returns` section in comments'),
+        'W9008': ('You should add `Raises` section in comments',
+                  symbol + "-with-raises",
+                  'There should be a `Raises` section in comments'),
+    }
+    options = ()
+
+    def visit_functiondef(self, node):
+        """visit_functiondef checks Function node docstring style.
+        Args:
+            node (astroid.node): The visiting node.
+        Returns:
+            True if successful other wise False.
+        """
+
+        self.check_doc_string(node)
+
+        if node.tolineno - node.fromlineno <= 10:
+            return True
+
+        if not node.doc:
+            return True
+
+        doc = Docstring()
+        doc.parse(node.doc)
+
+        self.all_args_in_doc(node, doc)
+        self.with_returns(node, doc)
+        self.with_raises(node, doc)
+
+    def visit_module(self, node):
+        self.check_doc_string(node)
+
+    def visit_classdef(self, node):
+        self.check_doc_string(node)
+
+    def check_doc_string(self, node):
+        self.missing_doc_string(node)
+        self.one_line(node)
+        self.has_period(node)
+        self.indent_style(node)
+
+    def missing_doc_string(self, node):
+        if node.name.startswith("__") or node.name.startswith("_"):
+            return True
+        if node.tolineno - node.fromlineno <= 10:
+            return True
+
+        if node.doc is None or len(node.doc) < 10:
+            self.add_message('W9005', node=node, line=node.fromlineno)
+        return False
+
+    # FIXME(gongwb): give the docstring line-no
+    def indent_style(self, node, indent=4):
+        """indent_style checks docstring's indent style
+        Args:
+            node (astroid.node): The visiting node.
+            indent (int): The default indent of style
+        Returns:
+            True if successful other wise False.
+        """
+        if node.doc is None:
+            return True
+
+        doc = node.doc
+        lines = doc.splitlines()
+        line_num = 0
+
+        for l in lines:
+            if line_num == 0:
+                continue
+            cur_indent = len(l) - len(l.lstrip())
+            if cur_indent % indent != 0:
+                self.add_message('W9006', node=node, line=node.fromlineno)
+                return False
+            line_num += 1
+
+        return True
+
+    def one_line(self, node):
+        """one_line checks if docstring (len < 40) is on one line.
+        Args:
+            node (astroid.node): The node visiting.
+        Returns:
+            True if successful otherwise False.
+        """
+
+        doc = node.doc
+        if doc is None:
+            return True
+
+        if len(doc) > 40:
+            return True
+        elif sum(doc.find(nl) for nl in ('\n', '\r', '\n\r')) == -3:
+            return True
+        else:
+            self.add_message('W9001', node=node, line=node.fromlineno)
+            return False
+
+        return True
+
+    def has_period(self, node):
+        """has_period checks if one line doc end-with '.' .
+        Args:
+            node (astroid.node): the node is visiting.
+        Returns:
+            True if successful otherwise False.
+        """
+        if node.doc is None:
+            return True
+
+        if len(node.doc.splitlines()) > 1:
+            return True
+
+        if not node.doc.strip().endswith('.'):
+            self.add_message('W9002', node=node, line=node.fromlineno)
+            return False
+
+        return True
+
+    def with_raises(self, node, doc):
+        """with_raises checks if one line doc end-with '.' .
+        Args:
+            node (astroid.node): the node is visiting.
+            doc (Docstring): Docstring object.
+        Returns:
+            True if successful otherwise False.
+        """
+
+        find = False
+        for t in node.body:
+            if not isinstance(t, astroid.Raise):
+                continue
+
+            find = True
+            break
+
+        if not find:
+            return True
+
+        if len(doc.get_raises()) == 0:
+            self.add_message('W9008', node=node, line=node.fromlineno)
+            return False
+
+        return True
+
+    def with_returns(self, node, doc):
+        """with_returns checks if docstring comments what are returned .
+        Args:
+            node (astroid.node): the node is visiting.
+            doc (Docstring): Docstring object.
+        Returns:
+            True if successful otherwise False.
+        """
+
+        if node.name.startswith("__") or node.name.startswith("_"):
+            return True
+        find = False
+        for t in node.body:
+            if not isinstance(t, astroid.Return):
+                continue
+
+            find = True
+            break
+
+        if not find:
+            return True
+
+        if len(doc.get_returns()) == 0:
+            self.add_message('W9007', node=node, line=node.fromlineno)
+            return False
+
+        return True
+
+    def all_args_in_doc(self, node, doc):
+        """all_args_in_doc checks if arguments are mentioned in doc
+        Args:
+            node (astroid.node): the node is visiting.
+            doc (Docstring): Docstring object
+        Returns:
+            True if successful otherwise False.
+        """
+        if node.name.startswith("__") or node.name.startswith("_"):
+            return True
+        args = []
+        for arg in node.args.get_children():
+            if (not isinstance(arg, astroid.AssignName)) \
+                or arg.name == "self":
+                continue
+            args.append(arg.name)
+
+        if len(args) <= 0:
+            return True
+
+        parsed_args = doc.args
+        args_not_documented = set(args) - set(parsed_args)
+        if len(args) > 0 and len(parsed_args) <= 0:
+            self.add_message(
+                'W9003',
+                node=node,
+                line=node.fromlineno,
+                args=list(args_not_documented))
+            return False
+
+        for t in args:
+            if t not in parsed_args:
+                self.add_message(
+                    'W9003', node=node, line=node.fromlineno, args=[t, ])
+                return False
+
+        return True
diff --git a/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook b/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
new file mode 100644
index 0000000000000000000000000000000000000000..0bb46f3fce56a7b804eec15b308c89822238a5b7
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/pylint_pre_commit.hook
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+TOTAL_ERRORS=0
+
+
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+export PYTHONPATH=$DIR:$PYTHONPATH
+
+readonly VERSION="2.12.0"
+version=$(pylint --version | grep 'pylint')
+
+if ! [[ $version == *"$VERSION"* ]]; then
+    pip install pylint==2.12.0
+fi
+
+# The trick to remove deleted files: https://stackoverflow.com/a/2413151
+for file in $(git diff --name-status | awk '$1 != "D" {print $2}'); do
+    pylint --disable=all --load-plugins=docstring_checker \
+    --enable=doc-string-one-line,doc-string-end-with,doc-string-with-all-args,doc-string-triple-quotes,doc-string-missing,doc-string-indent-error,doc-string-with-returns,doc-string-with-raises $file;
+    TOTAL_ERRORS=$(expr $TOTAL_ERRORS + $?);
+done
+
+exit $TOTAL_ERRORS
+#For now, just warning:
+#exit 0
+Footer
+
diff --git a/model_zoo/gpt-3/codestyle/test_docstring_checker.py b/model_zoo/gpt-3/codestyle/test_docstring_checker.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c2c0b503332d2abdeac8ce33dc534d956cfc9aa
--- /dev/null
+++ b/model_zoo/gpt-3/codestyle/test_docstring_checker.py
@@ -0,0 +1,231 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import docstring_checker
+import pylint.testutils
+import astroid
+import pytest
+import sys
+
+
+class TestDocstring(pylint.testutils.CheckerTestCase):
+    CHECKER_CLASS = docstring_checker.DocstringChecker
+
+    def test_one_line(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            """get 
+            news.
+            """
+            if True:
+                return 5
+            return 5
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9001' == got[0][0]
+
+    def test_one_line_1(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            """get news"""
+            if True:
+                return 5
+            return 5
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9002' == got[0][0]
+
+    def test_args(self):
+        func_node = astroid.extract_node('''
+        def test(scale, mean): 
+            """get news.
+            Args:
+                scale (int): scale is the number.
+            """
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9003' == got[0][0]
+
+    def test_missing(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9005' == got[0][0]
+
+    def test_indent(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            """ get get get get get get get get
+              get get get get get get get get.
+            """
+            pass 
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9006' == got[0][0]
+
+    def test_with_resturns(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            """get news.
+            Args:
+                scale (int): scale is the number.
+            """
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            return mean
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9007' == got[0][0]
+
+    def test_with_raises(self):
+        func_node = astroid.extract_node('''
+        def test(): 
+            """get news.
+            Args:
+                scale (int): scale is the number.
+            """
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            mean=scale
+            raise ValueError('A very specific bad thing happened.')
+        ''')
+
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 1
+        assert 'W9008' == got[0][0]
+
+    def test_no_message(self):
+        p = '''
+def fc(input,
+       size,
+       num_flatten_dims=1,
+       param_attr=None,
+       bias_attr=None,
+       act=None,
+       name=None):
+    """
+    **Fully Connected Layer**
+    The fully connected layer can take multiple tensors as its inputs. It
+    creates a variable called weights for each input tensor, which represents
+    a fully connected weight matrix from each input unit to each output unit.
+    The fully connected layer multiplies each input tensor with its coresponding
+    weight to produce an output Tensor. If multiple input tensors are given,
+    the results of multiple multiplications will be sumed up. If bias_attr is
+    not None, a bias variable will be created and added to the output. Finally,
+    if activation is not None, it will be applied to the output as well.
+    This process can be formulated as follows:
+    Args:
+        input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
+            the input tensor(s) is at least 2.
+        size(int): The number of output units in this layer.
+        num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
+            two dimensions. If this happens, the multidimensional tensor will first be flattened
+            into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
+            tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
+            dimensions will be flatten to form the first dimension of the final matrix (height of
+            the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
+            form the second dimension of the final matrix (width of the matrix). For example, suppose
+            `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
+            Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
+        param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
+            parameters/weights of this layer.
+        bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
+            of this layer. If it is set to None, no bias will be added to the output units.
+        act (str, default None): Activation to be applied to the output of this layer.
+        name (str, default None): The name of this layer.
+    Returns:
+        A tensor variable storing the transformation result.
+    Raises:
+        ValueError: If rank of the input tensor is less than 2.
+    Examples:
+        .. code-block:: python
+            data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
+            fc = fluid.layers.fc(input=data, size=1000, act="tanh")
+    """
+    raise ValueError('A very specific bad thing happened.')
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    size = 1
+    return size
+    '''
+
+        func_node = astroid.extract_node(p)
+        self.checker.visit_functiondef(func_node)
+        got = self.linter.release_messages()
+        assert len(got) == 0
diff --git a/model_zoo/gpt-3/docs/cluster_deployment.md b/model_zoo/gpt-3/docs/cluster_deployment.md
new file mode 100644
index 0000000000000000000000000000000000000000..666d4aec2a7e597b652056a1be671b252ac24e0e
--- /dev/null
+++ b/model_zoo/gpt-3/docs/cluster_deployment.md
@@ -0,0 +1,172 @@
+
+## 集群部署
+
+本文档介绍在集群上使用分布式进行大模型训练的方法，包括在 Kubernetes 上使用 PaddlePaddle 分布式和在云上使用的方法。
+
+### 1. Kubernetes部署
+
+在 Kubernetes 上部署分布式任务需要安装 [paddle-operator](https://github.com/PaddleFlow/paddle-operator) 。
+
+paddle-operator 通过添加自定义资源类型 (paddlejob) 以及部署 controller 和一系列 Kubernetes 原生组件的方式实现简单定义即可运行 PaddlePaddle 任务的需求。
+
+目前支持运行 ParameterServer (PS) 和 Collective 两种分布式任务，当然也支持运行单节点任务。
+
+**paddle-operator 安装**
+
+安装 paddle-operator 需要有已经安装的 Kubernetes (v1.16+) 集群和 [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) (v1.16+) 工具。
+
+本节所需配置文件和示例可以在 [这里](https://github.com/PaddleFlow/paddle-operator/tree/main/deploy) 找到，
+可以通过 *git clone* 或者复制文件内容保存。
+
+```yaml
+deploy
+|-- examples
+|   |-- resnet.yaml
+|   |-- wide_and_deep.yaml
+|   |-- wide_and_deep_podip.yaml
+|   |-- wide_and_deep_service.yaml
+|   `-- wide_and_deep_volcano.yaml
+|-- v1
+|   |-- crd.yaml
+|   `-- operator.yaml
+```
+
+执行以下命令，
+
+```shell
+kubectl create -f https://raw.githubusercontent.com/PaddleFlow/paddle-operator/dev/deploy/v1/crd.yaml
+```
+
+或者
+
+```shell
+kubectl create -f deploy/v1/crd.yaml
+```
+
+通过以下命令查看是否成功，
+
+```shell
+kubectl get crd
+NAME                                    CREATED AT
+paddlejobs.batch.paddlepaddle.org       2021-02-08T07:43:24Z
+```
+
+执行以下部署命令，
+
+```shell
+kubectl create -f https://raw.githubusercontent.com/PaddleFlow/paddle-operator/dev/deploy/v1/operator.yaml
+```
+
+或者
+
+```shell
+kubectl create -f deploy/v1/operator.yaml
+```
+
+通过以下命令查看部署结果和运行状态，
+
+```shell
+kubectl -n paddle-system get pods
+NAME                                         READY   STATUS    RESTARTS   AGE
+paddle-controller-manager-698dd7b855-n65jr   1/1     Running   0          1m
+```
+
+通过查看 controller 日志以确保运行正常，
+
+```shell
+kubectl -n paddle-system logs paddle-controller-manager-698dd7b855-n65jr
+```
+
+提交 demo 任务查看效果，
+
+```shell
+kubectl -n paddle-system create -f deploy/examples/wide_and_deep.yaml
+```
+
+查看 paddlejob 任务状态, pdj 为 paddlejob 的缩写，
+
+```shell
+kubectl -n paddle-system get pdj
+NAME                     STATUS      MODE   AGE
+wide-ande-deep-service   Completed   PS     4m4s
+```
+
+以上信息可以看出：训练任务已经正确完成，该任务为 ps 模式。
+可通过 cleanPodPolicy 配置任务完成/失败后的 pod 删除策略，详见任务配置。
+
+训练期间可以通过如下命令查看 pod 状态，
+
+```shell
+kubectl -n paddle-system get pods
+```
+
+**paddlejob 任务提交**
+
+本resnet示例为 Collective 模式，使用 GPU 进行训练，只需要配置 worker，worker 配置中需要声明使用的 GPU 信息。
+
+准备配置文件，
+
+```yaml
+apiVersion: batch.paddlepaddle.org/v1
+kind: PaddleJob
+metadata:
+  name: resnet
+spec:
+  cleanPodPolicy: Never
+  worker:
+    replicas: 2
+    template:
+      spec:
+        containers:
+          - name: paddle
+            image: registry.baidubce.com/paddle-operator/demo-resnet:v1
+            command:
+            - python
+            args:
+            - "-m"
+            - "paddle.distributed.launch"
+            - "train_fleet.py"
+            volumeMounts:
+            - mountPath: /dev/shm
+              name: dshm
+            resources:
+              limits:
+                nvidia.com/gpu: 1
+        volumes:
+        - name: dshm
+          emptyDir:
+            medium: Memory
+```
+
+注意：
+
+* 这里需要添加 shared memory 挂载以防止缓存出错。
+* 本示例采用内置 flower 数据集，程序启动后会进行下载，根据网络环境可能等待较长时间。
+
+提交任务: 使用 kubectl 提交 yaml 配置文件以创建任务，
+
+```shell
+kubectl -n paddle-system create -f resnet.yaml
+```
+
+**卸载**
+
+通过以下命令卸载部署的组件，
+
+```shell
+kubectl delete -f deploy/v1/crd.yaml -f deploy/v1/operator.yaml
+```
+
+*注意：重新安装时，建议先卸载再安装*
+
+### 2. 公有云和私有云部署
+
+在公有云上运行 PaddlePaddle 分布式建议通过选购容器引擎服务的方式，各大云厂商都推出了基于标准 Kubernetes 的云产品，然后根据上节中的教程安装使用即可。
+
+| 云厂商 | 容器引擎 | 链接                                           |
+| --- | ---- | -------------------------------------------- |
+| 百度云 | CCE  | https://cloud.baidu.com/product/cce.html     |
+| 阿里云 | ACK  | https://help.aliyun.com/product/85222.html   |
+| 华为云 | CCE  | https://www.huaweicloud.com/product/cce.html |
+
+更为方便的是使用百度提供的全功能AI开发平台 [BML](https://cloud.baidu.com/product/bml) 来使用，详细的使用方式请参考 [BML文档](https://ai.baidu.com/ai-doc/BML/pkhxhgo5v)。
diff --git a/model_zoo/gpt-3/docs/compression.md b/model_zoo/gpt-3/docs/compression.md
new file mode 100644
index 0000000000000000000000000000000000000000..1b686c0df0f712d68c1d4b4380f22e0159eb71d1
--- /dev/null
+++ b/model_zoo/gpt-3/docs/compression.md
@@ -0,0 +1,86 @@
+# 模型压缩
+
+------------------------------------------------------------------------------------------
+
+## **简介**
+
+PaddleFleetX 集成了 PaddleSlim 中的常见的压缩方法：量化训练（Qutization Aware Training，QAT）、结构化稀疏（Structured Pruning，SP）和知识蒸馏（Knowledge Distillation，KD）。本文会介绍如何在 PaddleFleetX 中使用这些功能，来压缩并且导出压缩后的模型。
+
+## **特性**
+
+- <a href=https://github.com/PaddlePaddle/PaddleSlim/tree/release/2.4/demo/dygraph/quant>量化训练</a>：通过将全连接层的矩阵乘计算由 Float 浮点型优化为 INT8 整型来优化推理性能；
+- <a href=https://github.com/PaddlePaddle/PaddleSlim/tree/release/2.4/demo/dygraph/pruning>结构化稀疏</a>：通过剪裁全连接层权重的通道数目来优化推理性能；
+- <a href=#知识蒸馏>知识蒸馏</a>：通过使用高精度的大模型（教师模型）来蒸馏低精度的小模型（学生模型）来提升小模型精度
+
+
+
+## **配置文档**
+
+模型压缩开关通过 Compress 字段控制，预训练的模型参数路径由 pretrained 指定。接下来就是量化训练、结构化稀疏和知识蒸馏各自的技术参数。
+
+```yaml
+Compress:
+  pretrained:         // 预训练模型参数的保存路径
+
+  Quantization:       // 量化训练参数
+
+  Prune:              // 结构化稀疏参数
+
+  Distillation:       // 知识蒸馏参数
+```
+
+**注意**： 我们正在开发上述三种压缩方法的联合使用，请先单独使用上述各个方法。
+
+### **量化训练参数**
+
+```yaml
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    weight_preprocess_type: None
+    activation_preprocess_type: 'PACT'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
+```
+
+其中参数说明：
+
+| **参数名**                   | **参数释义**                              |
+|-----------------------------|-----------------------------------------|
+| pretrained                  | 预训练模型的加载目录，若设置该参数，将在量化之前加载预训练模型；若需要加载量化后参数，将此参数设置为None，直接设置Engine.save_load.ckpt_dir即可       |
+| enable                      | 是否开启量化训练                           |
+| weight_quantize_type        | weight量化方法, 默认为`channel_wise_abs_max`, 此外还支持`abs_max` |
+| activation_quantize_type    | activation量化方法, 默认为`moving_average_abs_max`               |
+| weight_preprocess_type      | weight预处理方法，默认为None，代表不进行预处理；当需要使用`PACT`方法时设置为`PACT` |
+| activation_preprocess_type  | activation预处理方法，默认为None，代表不进行预处理                   |
+| weight_bits                 | weight量化比特数, 默认为 8                                        |
+| activation_bits             | activation量化比特数, 默认为 8                                    |
+| quantizable_layer_type      | 需要量化的算子类型                                                |
+| onnx_format                 | 是否使用新量化格式，默认为False                                     |
+
+更详细的量化训练参数介绍可参考[PaddleSlim动态图量化训练接口介绍](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/dygraph/quanter/qat.rst)。
+
+### **结构化稀疏参数**
+
+```yaml
+Compress:
+  pretrained:
+  Prune:
+    enable: True
+    criterion: l1_norm
+    ratio: 0.125
+```
+
+其中参数说明：
+
+| **参数名**                   | **参数释义**                              |
+|-----------------------------|-----------------------------------------|
+| pretrained                  | 预训练模型的加载目录       |
+| enable                      | 是否开启结构化稀疏训练                           |
+| criterion    | 权重的重要性指标，目前支持l1_norm 和 l2_norm|
+| ratio      | 权重稀疏的比例。例如，0.125的意思是12.5%的权重会被稀疏掉 |
diff --git a/model_zoo/gpt-3/docs/deployment_faq.md b/model_zoo/gpt-3/docs/deployment_faq.md
new file mode 100644
index 0000000000000000000000000000000000000000..131941af2ce2eaae9e8ea7198ba3698e572bd084
--- /dev/null
+++ b/model_zoo/gpt-3/docs/deployment_faq.md
@@ -0,0 +1,629 @@
+## 环境验证和常见问题
+
+本文为环境问题排查指引，包括环境正确性验证的方法和常见的一些问题解决方法。
+
+### 1. 单机环境验证
+
+以下验证不区分本机环境和 Docker 环境。
+
+**GPU验证**
+
+当使用 GPU 时，使用 `nvidia-smi` 命令查看环境中 GPU 状态，预期输出如下
+
+```shell
+Thu Jul 21 19:32:03 2022
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
+| N/A   33C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
+| N/A   34C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
+| N/A   35C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
+| N/A   38C    P0    42W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
+| N/A   34C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
+| N/A   36C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
+| N/A   37C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
+| N/A   36C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
+|                               |                      |                  N/A |
++-------------------------------+----------------------+----------------------+
+
++-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
++-----------------------------------------------------------------------------+
+```
+
+结果中可以看出
+
+* CUDA Version栏显示的是当前环境中的CUDA版本号，此处为11.2。开始使用飞桨前，请先保证此处CUDA Version显示正常。如果CUDA Version栏不显示版本号，则需要添加CUDA相关库的路径到环境变量`LD_LIBRARY_PATH`中，例如执行命令添加 `export LD_LIBRARY_PATH=/usr/lib64/:/usr/local/lib/:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}` 。具体请参考[文档](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html)。
+* Memory-Usage 列显示的是当前的显存占用值，此处为0MiB，表示当前设备的显存未被占用；GPU-Util 列显示的是当前的GPU利用率，此处为0%，表示当前设备空闲，可以使用。开始使用飞桨前，请保证当前设备显存充足，且利用率处于空闲状态。
+* 最后的 Processes 信息表示正在使用设备的进程，Docker 内可能存在不准确的情况，不影响使用。
+
+**PaddlePaddle 安装验证**
+
+首先运行如下命令确保 PaddlePaddle 正确安装
+
+```shell
+python -c "import paddle; paddle.utils.run_check()"
+```
+
+预期会有如下输出
+
+```shell
+Running verify PaddlePaddle program ...
+W0720 09:29:22.035640 12791 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
+W0720 09:29:22.040702 12791 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
+PaddlePaddle works well on 1 GPU.
+W0720 09:29:36.763486 12791 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2.
+PaddlePaddle works well on 8 GPUs.
+PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
+```
+
+表示 PaddlePaddle 已经正确安装。
+
+如果出现以下错误信息请确保 CUDA 安装正确且已根据 CUDA 安装路径正确配置的 LD_LIBRARY_PATH。
+例如执行命令添加 `export LD_LIBRARY_PATH=/usr/lib64/:/usr/local/lib/:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}` 。
+具体请参考[文档](https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html)。
+
+```
+You are using GPU version Paddle, but your CUDA device is not set properly.
+```
+
+### 2. 分布式环境验证
+
+如果单机运行正常，但多机分布式运行异常请先根据 [网络问题排查](#31-网络问题排查) 部分排查网络问题再进行以下排查。
+
+请先确保**各个机器**的 PaddlePaddle 环境已经正确安装，然后在等待验证的其中一个节点上运行如下命令
+
+```shell
+python -m paddle.distributed.launch run_check
+```
+
+> 默认验证 2 机分布式环境，如果需要验证更多机器（例如4个）环境下飞桨分布式是否运行正常，请添加节点数参数 --nnodes，具体命令如下：
+>
+> `python -m paddle.distributed.launch --nnodes=4 run_check`
+
+预期输出如下
+
+```shell
+LAUNCH INFO 2022-07-20 09:38:33,349 PaddlePaddle Distributed Check begin...
+LAUNCH INFO 2022-07-20 09:38:33,358 -----------  Configuration  ----------------------
+LAUNCH INFO 2022-07-20 09:38:33,358 devices: None
+LAUNCH INFO 2022-07-20 09:38:33,358 elastic_level: -1
+LAUNCH INFO 2022-07-20 09:38:33,358 elastic_timeout: 30
+LAUNCH INFO 2022-07-20 09:38:33,358 gloo_port: 6767
+LAUNCH INFO 2022-07-20 09:38:33,358 host: None
+LAUNCH INFO 2022-07-20 09:38:33,358 job_id: default
+LAUNCH INFO 2022-07-20 09:38:33,358 legacy: False
+LAUNCH INFO 2022-07-20 09:38:33,358 log_dir: log
+LAUNCH INFO 2022-07-20 09:38:33,358 log_level: ERROR
+LAUNCH INFO 2022-07-20 09:38:33,358 master: None
+LAUNCH INFO 2022-07-20 09:38:33,358 max_restart: 3
+LAUNCH INFO 2022-07-20 09:38:33,358 nnodes: 2
+LAUNCH INFO 2022-07-20 09:38:33,358 nproc_per_node: None
+LAUNCH INFO 2022-07-20 09:38:33,358 rank: -1
+LAUNCH INFO 2022-07-20 09:38:33,358 run_mode: collective
+LAUNCH INFO 2022-07-20 09:38:33,359 server_num: None
+LAUNCH INFO 2022-07-20 09:38:33,359 servers:
+LAUNCH INFO 2022-07-20 09:38:33,359 trainer_num: None
+LAUNCH INFO 2022-07-20 09:38:33,359 trainers:
+LAUNCH INFO 2022-07-20 09:38:33,359 training_script: /usr/local/lib/python3.7/dist-packages/paddle/distributed/launch/plugins/test.py
+LAUNCH INFO 2022-07-20 09:38:33,359 training_script_args: []
+LAUNCH INFO 2022-07-20 09:38:33,359 with_gloo: 1
+LAUNCH INFO 2022-07-20 09:38:33,359 --------------------------------------------------
+LAUNCH INFO 2022-07-20 09:38:33,360 Job: default, mode collective, replicas 2[2:2], elastic False
+LAUNCH INFO 2022-07-20 09:38:33,367 Waiting peer start...
+Copy the following command to other nodes to run.
+--------------------------------------------------------------------------------
+python -m paddle.distributed.launch --master 10.10.1.1:49178 run_check
+--------------------------------------------------------------------------------
+```
+
+> 如果当前安装的 PaddlePaddle 中未包含该工具，请根据上节提示安装 develop 版本进行测试。
+
+根据提示，复制最后的命令（复制机器上个命令的执行结果，以下命令为示例），在其他节点上粘贴执行
+
+```shell
+python -m paddle.distributed.launch --master 10.10.1.1:49178 run_check
+```
+
+执行后，如果配置正常则每个节点都会有后续输出
+
+```shell
+LAUNCH INFO 2022-07-20 09:46:41,571 Run Pod: xqqbsr, replicas 2, status ready
+LAUNCH INFO 2022-07-20 09:46:41,601 Watching Pod: xqqbsr, replicas 2, status running
+Prepare distributed training with 2 nodes 2 cards
+I0720 09:46:43.583846 13375 tcp_utils.cc:181] The server starts to listen on IP_ANY:14863
+I0720 09:46:43.584153 13375 tcp_utils.cc:130] Successfully connected to 10.10.10.1:14863
+W0720 09:46:47.089151 13375 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
+W0720 09:46:47.098454 13375 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
+2022-07-20 09:46:51,333-INFO: [topology.py:187:__init__] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 4, mp_group: [0],  sharding_group: [0], pp_group: [0], dp_group: [0, 1, 2, 3], check/clip group: [0]
+Distributed training start...
+[Epoch 0, batch 0] loss: 5.10316, acc1: 0.03125, acc5: 0.06250
+Distributed training completed
+I0720 09:46:54.828758 13432 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
+LAUNCH INFO 2022-07-20 09:46:56,617 Pod completed
+LAUNCH INFO 2022-07-20 09:46:57,085 Exit code 0
+```
+
+则表示分布式环境配置正常，多机分布式训练可以成功运行。
+
+> 如果其他节点执行命令后各个节点没有后续输出或输出不符合预期请参考 [FAQ](#3-faq) 部分解决。
+
+**实际分布式训练任务验证**
+
+在启动分布式任务前需要确保各个节点上安装好 PaddlePaddle 环境，同步好数据和代码。
+
+例如准备好训练代码 `train.py`，同步至每个训练节点的工作目录。
+
+```python
+import numpy as np
+import paddle
+from paddle.distributed import fleet
+from paddle.vision.models import ResNet
+from paddle.vision.models.resnet import BottleneckBlock
+from paddle.io import Dataset, BatchSampler, DataLoader
+
+base_lr = 0.1
+momentum_rate = 0.9
+l2_decay = 1e-4
+
+epoch = 10
+batch_num = 3
+batch_size = 32
+class_dim = 102
+
+class RandomDataset(Dataset):
+    def __init__(self, num_samples):
+        self.num_samples = num_samples
+
+    def __getitem__(self, idx):
+        image = np.random.random([3, 224, 224]).astype('float32')
+        label = np.random.randint(0, class_dim - 1, (1, )).astype('int64')
+        return image, label
+
+    def __len__(self):
+        return self.num_samples
+
+def optimizer_setting(parameter_list=None):
+    optimizer = paddle.optimizer.Momentum(
+        learning_rate=base_lr,
+        momentum=momentum_rate,
+        weight_decay=paddle.regularizer.L2Decay(l2_decay),
+        parameters=parameter_list)
+    return optimizer
+
+
+def train_resnet():
+    fleet.init(is_collective=True)
+
+    resnet = ResNet(BottleneckBlock, 18, num_classes=class_dim)
+    optimizer = optimizer_setting(parameter_list=resnet.parameters())
+    optimizer = fleet.distributed_optimizer(optimizer)
+    resnet = fleet.distributed_model(resnet)
+
+    dataset = RandomDataset(batch_num * batch_size)
+    train_loader = DataLoader(dataset,
+                    batch_size=batch_size,
+                    shuffle=True,
+                    drop_last=True,
+                    num_workers=2)
+
+    for eop in range(epoch):
+        resnet.train()
+
+        for batch_id, data in enumerate(train_loader()):
+            img, label = data
+            label.stop_gradient = True
+
+            out = resnet(img)
+            loss = paddle.nn.functional.cross_entropy(input=out, label=label)
+            avg_loss = paddle.mean(x=loss)
+            acc_top1 = paddle.metric.accuracy(input=out, label=label, k=1)
+            acc_top5 = paddle.metric.accuracy(input=out, label=label, k=5)
+
+            avg_loss.backward()
+            optimizer.step()
+            resnet.clear_gradients()
+
+            print("[Epoch %d, batch %d] loss: %.5f, acc1: %.5f, acc5: %.5f" % (eop, batch_id, avg_loss, acc_top1, acc_top5))
+
+if __name__ == '__main__':
+    train_resnet()
+```
+
+启动分布式训练的命令如下，
+这个命令需要在每个参与训练的节点上执行（每个节点上的 `--master`都设置为同一个），如节点较多可以考虑使用 `ssh` 脚本或 `mpirun` 进行跨节点命令分发。
+
+```python
+python -m paddle.distributed.launch --master=10.10.1.1:49178 --nnodes=2 train.py
+```
+
+这里用到了分布式启动最重要的两个参数
+
+- `--nnodes` 为分布式任务的节点个数（一般为参与任务的机器数量），默认为 1 即启动单机任务，也可使用环境变量 PADDLE_NNODES 设置。
+
+- `--master` 为分布式信息同步的主节点地址，ip:port 格式，可以由第一个启动的节点自动打印或者直接由用户设置为参与任务的任意节点 ip 和任意可用端口，也可使用环境变量 PADDLE_MASTER 设置。
+
+> master 支持使用 etcd 服务，当使用 etcd 服务时，需要同时指定任务 id 以避免任务间冲突。具体地，可以通过 --job_id 参数或者设置环境变量 PADDLE_JOB_ID 指定任务id。
+
+
+启动后，将看到如下日志，首先是配置部分
+
+```shell
+LAUNCH INFO 2022-07-20 12:10:15,863 -----------  Configuration  ----------------------
+LAUNCH INFO 2022-07-20 12:10:15,863 devices: None
+LAUNCH INFO 2022-07-20 12:10:15,863 elastic_level: -1
+LAUNCH INFO 2022-07-20 12:10:15,863 elastic_timeout: 30
+LAUNCH INFO 2022-07-20 12:10:15,863 gloo_port: 6767
+LAUNCH INFO 2022-07-20 12:10:15,863 host: None
+LAUNCH INFO 2022-07-20 12:10:15,863 job_id: default
+LAUNCH INFO 2022-07-20 12:10:15,863 legacy: False
+LAUNCH INFO 2022-07-20 12:10:15,863 log_dir: log
+LAUNCH INFO 2022-07-20 12:10:15,863 log_level: INFO
+LAUNCH INFO 2022-07-20 12:10:15,863 master: 127.0.0.1:8890
+LAUNCH INFO 2022-07-20 12:10:15,863 max_restart: 3
+LAUNCH INFO 2022-07-20 12:10:15,863 nnodes: 2
+LAUNCH INFO 2022-07-20 12:10:15,863 nproc_per_node: None
+LAUNCH INFO 2022-07-20 12:10:15,863 rank: -1
+LAUNCH INFO 2022-07-20 12:10:15,863 run_mode: collective
+LAUNCH INFO 2022-07-20 12:10:15,863 server_num: None
+LAUNCH INFO 2022-07-20 12:10:15,863 servers:
+LAUNCH INFO 2022-07-20 12:10:15,863 trainer_num: None
+LAUNCH INFO 2022-07-20 12:10:15,863 trainers:
+LAUNCH INFO 2022-07-20 12:10:15,863 training_script: train.py
+LAUNCH INFO 2022-07-20 12:10:15,863 training_script_args: []
+LAUNCH INFO 2022-07-20 12:10:15,864 with_gloo: 1
+LAUNCH INFO 2022-07-20 12:10:15,864 --------------------------------------------------
+```
+
+这里打印分布式启动时的配置信息， 更多 launch 启动参数和用法请参考 [API 文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/distributed/launch_cn.html) 或通过以下命令获得。
+
+```shell
+python -m paddle.distributed.launch --help
+```
+
+然后打印的是任务启动相关的信息：
+
+```shell
+LAUNCH INFO 2022-07-20 12:10:15,864 Job: default, mode collective, replicas 2[2:2], elastic False
+LAUNCH INFO 2022-07-20 12:10:15,870 Waiting peer start...
+LAUNCH INFO 2022-07-20 12:10:25,860 Run Pod: bpdjev, replicas 2, status ready
+LAUNCH INFO 2022-07-20 12:10:25,883 Watching Pod: bpdjev, replicas 2, status running
+```
+
+其中，每行对应的具体含义解释如下：
+
+* 因为未设置 job_id，使用默认名称 default，启动的是 collective 模式，总共 2 个节点的分布式任务，不支持弹性（即节点数不可变）。
+* 节点短暂处于等待其他节点启动的状态，如果其他节点已启动但日志长期处于等待状态，请根据 [FAQ](#31-网络问题排查) 进行排查。
+* 任务准备启动，当前节点名为 bpdjev（该名称为随机生成）处于 ready 状态，当前节点包含 2 个进程（1 个进程对应 1 个 GPU）。
+* 节点已启动，正在监控进程健康状态。
+
+至此分布式启动成功，接下来打印业务日志（即用户代码相关输出日志）
+
+```shell
+I0720 12:10:27.763713 14071 tcp_utils.cc:181] The server starts to listen on IP_ANY:11061
+I0720 12:10:27.763914 14071 tcp_utils.cc:130] Successfully connected to 10.10.10.1:11061
+W0720 12:10:30.666985 14071 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
+W0720 12:10:30.675815 14071 gpu_resources.cc:91] device: 0, cuDNN Version: 8.1.
+2022-07-20 12:10:36,377-INFO: [topology.py:187:**init**] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 4, mp_group: [0], sharding_group: [0], pp_group: [0], dp_group: [0, 1, 2, 3], check/clip group: [0]
+/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:668: UserWarning: When training, we now always track global mean and variance.
+ "When training, we now always track global mean and variance.")
+[Epoch 0, batch 0] loss: 5.42939, acc1: 0.00000, acc5: 0.00000
+[Epoch 0, batch 1] loss: 6.13338, acc1: 0.00000, acc5: 0.03125
+[Epoch 0, batch 2] loss: 7.25566, acc1: 0.03125, acc5: 0.06250
+// 此处省略多行类似日志
+[Epoch 9, batch 0] loss: 7.23511, acc1: 0.00000, acc5: 0.00000
+[Epoch 9, batch 1] loss: 4.69053, acc1: 0.03125, acc5: 0.06250
+[Epoch 9, batch 2] loss: 5.08652, acc1: 0.00000, acc5: 0.03125
+I0720 12:10:53.647085 14112 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop
+```
+
+至此，训练结束，业务代码结束，最后打印退出日志
+
+```shell
+LAUNCH INFO 2022-07-20 12:10:56,915 Pod completed
+LAUNCH INFO 2022-07-20 12:10:57,388 Exit code 0
+```
+
+更多日志请在 log 目录下查看，日志文件命名为` {job_id}.{节点名}.{卡号}.log` , 例如如下两个文件为本例子中 2 张卡分别对应的日志。
+
+```shell
+-rw-r--r--  1 root   root 2.9K Jul 20 12:10 default.bpdjev.0.log
+-rw-r--r--  1 root   root 2.7K Jul 20 12:10 default.bpdjev.1.log
+```
+
+当有错误发生时，比如 GPU 卡被占用发生冲突时，会有如下输出
+
+```shell
+LAUNCH INFO 2022-07-21 11:58:59,451 Pod failed
+LAUNCH ERROR 2022-07-21 11:58:59,452 Container failed !!!
+Container rank 6 status failed cmd ['/usr/bin/python', '-u', 'train.py'] code 1 log log/default.fxemxd.6.log
+env {'GREP_COLOR': '1;31', 'CUDNN_VERSION': '8.1.1.33', 'LC_ALL': 'en_US.UTF-8', 'LD_LIBRARY_PATH': '/usr/local/lib/python3.7/dist-packages/cv2/../../lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'LANG': 'en_US.UTF-8', 'HOSTNAME': 'xxxxx', 'OLDPWD': '/home/userhome', 'WITH_GPU': 'ON', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_VERSION': '2.8.4', 'GOPATH': '/root/gopath', 'PWD': '/home/userhome/workspace/Paddle', 'HOME': '/home/userhome', 'GOROOT': '/usr/local/go', 'CLICOLOR': '1', 'DEBIAN_FRONTEND': 'noninteractive', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'TERM': 'xterm', 'WITH_AVX': 'ON', 'CUDA_VERSION': '11.2.1', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'SHLVL': '1', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.2 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450,driver<451', 'PATH': '/home/cmake-3.16.0-Linux-x86_64/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/root/gopath/bin:/home/userhome/.fzf/bin', 'PS1': '\\[\\033[1;33m\\]kui \\[\\033[1;37m\\]\\h \\[\\033[1;32m\\]\\w\\[\\033[1;33m\\]$(__git_ps1 " \\[\\033[35m\\]{\\[\\033[36m\\]%s\\[\\033[35m\\]}")\\[\\033[0m\\] ', '_': '/usr/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/usr/local/lib/python3.7/dist-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/usr/local/lib/python3.7/dist-packages/cv2/qt/fonts', 'runtime_include_dir': '/usr/local/lib/python3.7/dist-packages/paddle/libs', 'POD_NAME': 'fxemxd', 'PADDLE_MASTER': '10.10.10.1:60216', 'PADDLE_GLOBAL_SIZE': '10', 'PADDLE_LOCAL_SIZE': '8', 'PADDLE_GLOBAL_RANK': '8', 'PADDLE_LOCAL_RANK': '6', 'PADDLE_NNODES': '2', 'PADDLE_TRAINER_ENDPOINTS': '10.10.10.1:49825,10.10.10.1:18781,10.10.10.1:53546,10.10.10.1:30837,10.10.10.1:11249,10.10.10.1:13092,10.10.10.1:11398,10.10.10.1:21309,10.10.10.1:47065,10.10.10.1:14834', 'PADDLE_CURRENT_ENDPOINT': '10.10.10.1:47065', 'PADDLE_TRAINER_ID': '8', 'PADDLE_TRAINERS_NUM': '10', 'PADDLE_RANK_IN_NODE': '6', 'FLAGS_selected_gpus': '6'}
+I0721 11:58:51.079766 29676 tcp_utils.cc:130] Successfully connected to 10.10.10.1:60216
+W0721 11:58:54.582710 29676 gpu_resources.cc:61] Please NOTE: device: 6, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
+W0721 11:58:54.590724 29676 gpu_resources.cc:91] device: 6, cuDNN Version: 8.1.
+Traceback (most recent call last):
+  File "train.py", line 75, in <module>
+    train_resnet()
+  File "train.py", line 39, in train_resnet
+    fleet.init(is_collective=True)
+  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/fleet/base/fleet_base.py", line 319, in init
+    paddle.distributed.init_parallel_env()
+  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/parallel.py", line 264, in init_parallel_env
+    paddle.distributed.barrier(group=group)
+  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/collective.py", line 334, in barrier
+    task = group.process_group.barrier()
+OSError: (External) NCCL error(5), invalid usage.
+  [Hint: 'ncclInvalidUsage'. The call to NCCL is incorrect. This is usually reflecting a programming error.] (at /paddle/Paddle/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc:214)
+
+LAUNCH INFO 2022-07-21 11:59:00,655 Exit code -15
+```
+
+这当中主要包含以下信息：
+
+* 发生错误的提示 Pod failed 和 Container failed !!!.
+* 错误的卡号（Container rank 6），错误命令和错误环境的环境变量。
+* 具体的错误信息 trace，该部分取决于业务代码错误内容。
+* 最后打印错误退出码 Exit code -15.
+
+请根据报错信息进行排查，部分错误请参考 [FAQ](#3-faq)。
+
+### 3. FAQ
+
+#### 3.1 网络问题排查
+
+请按照以下步骤排查网络问题
+
+**获取节点IP**
+
+使用命令 `hostname -i` 查看机器 ip，多网卡环境使用 `ifconfig` 命令查看(见上节)，获得 IP。
+
+```shell
+$ hostname -i
+10.10.10.1
+```
+
+如果这里得到的IP非预期使用的IP或者和日志中打印的IP不相符时，请根据后面小节排查是否是多网卡环境导致使用的网卡不一致。
+
+
+**确认节点间是否能通过ping连接**
+
+这里举例获得 ip 为 10.10.10.1，在其他节点上使用 `ping 10.10.10.1` 测试机器间是否能连接，有如下输出即为连接成功
+
+```shell
+$ ping 10.10.10.1
+PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
+64 bytes from 10.10.10.1: icmp_seq=1 ttl=61 time=0.089 ms
+64 bytes from 10.10.10.1: icmp_seq=2 ttl=61 time=0.057 ms
+64 bytes from 10.10.10.1: icmp_seq=3 ttl=61 time=0.059 ms
+64 bytes from 10.10.10.1: icmp_seq=4 ttl=61 time=0.078 ms
+64 bytes from 10.10.10.1: icmp_seq=5 ttl=61 time=0.055 ms
+^C
+--- 10.10.10.1 ping statistics ---
+5 packets transmitted, 5 received, 0% packet loss, time 4053ms
+rtt min/avg/max/mdev = 0.055/0.067/0.089/0.016 ms
+```
+
+长时间无输出或其他输出即无法连接，请联系机器网络管理员处理。
+
+**确认节点间是否能通过HTTP/TCP连接**
+
+在机器 `10.10.10.1`上运行命令 `python -m http.server 8090` 启动 http 服务，
+
+```shell
+$ python -m http.server 8090
+Serving HTTP on 0.0.0.0 port 8090 (http://0.0.0.0:8090/) ...
+```
+
+如果提示端口被占用请使用其他可用端口启动服务，然后在其他的机器上运行命令
+`curl 10.10.10.1:8090`
+
+```shell
+$ curl 10.10.10.1:8090
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
+<html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
+<title>Directory listing for /</title>
+</head>
+<body>
+<h1>Directory listing for /</h1>
+<hr>
+<li><a href="train.py">train.py</a></li>
+</ul>
+<hr>
+</body>
+</html>
+```
+
+有类似以上输出则说明连接成功，否则两台机器间网络可能存在问题，尝试其他端口仍有问题需要联系网络管理员处理。
+
+**确认NCCL是否运行正常**
+
+首先，设置环境变量NCCL_DEBUG，查看NCCL版本和当前使用的IP
+
+```shell
+export NCCL_DEBUG=INFO
+
+python -m paddle.distributed.launch train.py
+```
+
+在输出日志中找到 NCCL 版本信息
+
+```shell
+NCCL version 2.8.4+cuda11.2
+```
+
+确认各个节点的 NCCL 版本相同且高于 2.8。
+
+以及在输出的信息中查找如下信息
+
+```shell
+[0] NCCL INFO NET/Socket : Using [0]eth0:10.10.10.1<0> [1]
+```
+
+表示 nccl 使用了名为 `eth0` ip 为 10.10.10.1 的网卡，如果需要使用其他网卡，需要在运行命令前添加环境变量
+
+```shell
+export NCCL_SOCKET_IFNAME=eth1
+```
+
+注意这里添加的时网卡名不是 ip，对应关系参照 `ifconfig` 的输出。
+
+上述测试均正常但是无法跑通分布式环境测试时
+请使用 [nccl-test](https://github.com/NVIDIA/nccl-tests)  测试 GPU 通信是否正常。
+
+#### 3.2 多Python环境问题
+
+当工作环境中存在多个版本的 python 时可能存在不一致导致问题。
+
+检查 python 版本
+
+```shell
+$ python --version
+Python 3.7.12
+```
+
+检查 python 安装目录
+
+```shell
+$ which python
+/usr/bin/python
+```
+
+直接调用绝对路径验证版本
+
+```shell
+$ /usr/bin/python --version
+Python 3.7.12
+```
+
+如果两次打印的版本不匹配，可以通过使用绝对路径的方式解决。
+获取绝对路径需要知道需要安装目录，默认环境中可以通过以下命令查看安装的版本。
+
+```shell
+$ ls /usr/bin/python*
+/usr/bin/python   /usr/bin/python2.7  /usr/bin/python3.6   /usr/bin/python3.7
+```
+
+即当使用 python 时，使用绝对路径 `/usr/bin/python3.7` 替换。
+
+#### 3.3 自动获取 IP 错误（多网卡环境问题）
+
+使用 paddle.distributed.launch 会自动识别使用的 IP，在多网卡配置的环境中自动识别的网卡可能不是预期使用的网卡。
+
+首先可以通过 `ifconfig` 命令查看机器的网卡配置情况，例如
+
+```shell
+docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
+        inet 10.0.3.1  netmask 255.255.255.0  broadcast 0.0.0.0
+        inet6 fe80::7050:1cff:fea2:14f3  prefixlen 64  scopeid 0x20<link>
+        ether 1e:a6:0d:0d:3b:1e  txqueuelen 1000  (Ethernet)
+        RX packets 27201548  bytes 12176726229 (11.3 GiB)
+        RX errors 0  dropped 0  overruns 0  frame 0
+        TX packets 26762571  bytes 48666409371 (45.3 GiB)
+        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
+
+lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
+        inet 127.0.0.1  netmask 255.0.0.0
+        inet6 ::1  prefixlen 128  scopeid 0x10<host>
+        loop  txqueuelen 1000  (Local Loopback)
+        RX packets 1321339447  bytes 1047567817083 (975.6 GiB)
+        RX errors 0  dropped 0  overruns 0  frame 0
+        TX packets 1321339447  bytes 1047567817083 (975.6 GiB)
+        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
+
+eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
+        inet 10.10.10.1  netmask 255.255.255.192  broadcast 10.127.4.191
+        inet6 f080::5200:4bff:f030:2090  prefixlen 64  scopeid 0x20<link>
+        ether 50:6b:4b:31:2a:90  txqueuelen 1000  (Ethernet)
+        RX packets 32040749852  bytes 43394575453133 (39.4 TiB)
+        RX errors 0  dropped 391107  overruns 0  frame 0
+        TX packets 24330967394  bytes 30441950099144 (27.6 TiB)
+        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
+```
+
+结果中虽然有3项甚至更多但这里只有一张 ip 为 `10.10.10.1` 网卡（inet值），docker0 为 Docker 虚拟网卡， lo 为本地回路，都不需要关注。
+
+当启动分布式训练命令时，如果飞桨自动识别出的网卡IP不正确时，可以使用--host参数手动配置IP，如
+
+```python
+python -m paddle.distributed.launch --master=10.10.10.1:49178 --nnodes=2 --host=10.10.10.1 train.py
+```
+
+> 当 --master 地址识别错误时，也需要手动替换。
+
+#### 3.4 机器端口有限制，需要使用固定端口
+
+当集群环境限制通信网卡时需要手动配置所有 ip 和 port 以启动分布式，以机器 `10.10.10.1` 和机器 `10.10.10.2` 必须使用端口 8000-8999 的情况为例，
+假设每台机器有两个卡，使用如下脚本设置每个卡对应进程的环境变量，依次启动进程。
+
+```shell
+# 所有卡 ip port 列表， ip1:port1,ip2:port2
+export PADDLE_TRAINER_ENDPOINTS=10.10.10.1:8000,10.10.10.1:8001,10.10.10.2:8000,10.10.10.2:8001
+# 所有卡数
+export PADDLE_TRAINERS_NUM=4
+# 当前卡 ip:port
+export PADDLE_CURRENT_ENDPOINT=10.10.10.1:8000
+# 当前卡序号
+export PADDLE_TRAINER_ID=0
+# 当前卡在节点内序号
+export PADDLE_RANK_IN_NODE=0
+# 当前卡使用的 GPU 卡号
+export FLAGS_selected_gpus=0
+
+# 注意，这里不再使用 launch 启动，但本脚本需要运行多次
+python train.py
+```
+
+注意在执行时，需要依次替换后面4个环境变量为对应值启动。
+
+#### 3.5 常用的通信问题排查
+
+GPU/NCCL 问题请先核对**版本是否匹配**，通过 `nvidia-smi` 查看是否有进程正在占用，仍有问题需要通过 [nccl-test](https://github.com/NVIDIA/nccl-tests)  测试。常见运行时错误和解决方法如下，
+
+**NCCL error(5)**
+
+```shell
+OSError: (External) NCCL error(5), invalid usage.
+  [Hint: 'ncclInvalidUsage'. The call to NCCL is incorrect. This is usually reflecting a programming error.]
+```
+
+原因和解决方法：该错误多为同一张 GPU 卡被多个进程同时使用导致冲突，请检查正在使用 GPU 的进程。如果需要在同一台机器上启动多个逻辑节点，可以使用 `CUDA_VISIBLE_DEVICES` 环境变量控制设备可见性。
+
+**NCCL error(2)**
+
+```shell
+ExternalError: Nccl error(2), unhandled system error
+```
+
+原因和解决方法：该错误一般为 shm 设置太小，如果使用 Docker 环境需要在启动 Docker 时做映射和设置如 `--shm-size 32G`.
diff --git a/model_zoo/gpt-3/docs/docker_install.md b/model_zoo/gpt-3/docs/docker_install.md
new file mode 100644
index 0000000000000000000000000000000000000000..42875d7e238fa4ece6973fbfa6a4083eaa349565
--- /dev/null
+++ b/model_zoo/gpt-3/docs/docker_install.md
@@ -0,0 +1,68 @@
+
+## Docker 环境安装
+
+使用 Docker 首先需要安装 Docker  环境，安装的完整流程请参考[文档](https://docs.docker.com/engine/install/)，基础安装流程如下所述。
+另外在 Docker 中使用 GPU 还需要安装 [nvida-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)。
+
+**Ubuntu**
+
+添加 apt 源。
+```
+sudo curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
+sudo add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
+```
+
+软件源升级， 安装docker
+
+```
+sudo apt-get update
+
+sudo apt-get docker-ce docker-ce-cli containerd.io
+```
+
+使用 `docker version` 查看 docker 版本信息无错误信息即说明安装运行正常。
+
+安装 nvida-container-runtime
+
+```
+sudo apt-get install nvidia-container-runtimeb
+```
+
+**CentOS**
+
+添加yum源。
+
+```
+sudo wget -O /etc/yum.repos.d/docker-ce.repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
+```
+
+安装组件。
+```
+sudo yum install docker-ce docker-ce-cli containerd.io
+```
+
+启动Docker。
+```
+sudo systemctl start docker
+```
+
+查看Docker状态。
+```
+sudo systemctl status docker
+```
+
+如日志状态为 active (running) 则表示docker启动正常。
+```
+● docker.service - LSB: start and stop docker
+   Loaded: loaded (/etc/rc.d/init.d/docker; bad; vendor preset: disabled)
+   Active: active (running) since Thu 2022-08-11 20:11:19 CST; 3 days ago
+     Docs: man:systemd-sysv-generator(8)
+  Process: 29766 ExecStop=/etc/rc.d/init.d/docker stop (code=exited, status=0/SUCCESS)
+  Process: 33215 ExecStart=/etc/rc.d/init.d/docker start (code=exited, status=0/SUCCESS)
+```
+
+安装 nvida-container-runtime。
+
+```
+sudo yum install nvidia-container-runtime
+```
diff --git a/model_zoo/gpt-3/docs/images/fleetx_arc.png b/model_zoo/gpt-3/docs/images/fleetx_arc.png
new file mode 100644
index 0000000000000000000000000000000000000000..33f8391e8d76c0238d7079bcb4798b1141edab44
Binary files /dev/null and b/model_zoo/gpt-3/docs/images/fleetx_arc.png differ
diff --git a/model_zoo/gpt-3/docs/images/throughput_compare.png b/model_zoo/gpt-3/docs/images/throughput_compare.png
new file mode 100644
index 0000000000000000000000000000000000000000..41ca1f0427174e5dff34090cbc3be824bf40a9df
Binary files /dev/null and b/model_zoo/gpt-3/docs/images/throughput_compare.png differ
diff --git a/model_zoo/gpt-3/docs/images/throughput_compare_graph.png b/model_zoo/gpt-3/docs/images/throughput_compare_graph.png
new file mode 100644
index 0000000000000000000000000000000000000000..990144840ba72e4753fbf45595384ba5f9254f2a
Binary files /dev/null and b/model_zoo/gpt-3/docs/images/throughput_compare_graph.png differ
diff --git a/model_zoo/gpt-3/docs/quick_start.md b/model_zoo/gpt-3/docs/quick_start.md
new file mode 100644
index 0000000000000000000000000000000000000000..915be9d8437a052eb1fa66a2cbbf069eb4028d1a
--- /dev/null
+++ b/model_zoo/gpt-3/docs/quick_start.md
@@ -0,0 +1,209 @@
+
+# 快速开始
+
+## 1. 环境准备
+
+这里介绍使用裸机或者 Docker 环境使用 PaddleFleetX 的方法，用户根据具体情况选择一种安装部署方式即可。
+使用多机训练时，需要在每台机器上都部署相应的环境。
+
+### 1.1 Docker 环境部署
+
+推荐使用 Docker 安装部署 PaddleFleetX 进行大模型训练，Docker 环境的安装可以参考[文档](docker_install.md)。
+
+请根据本地 CUDA 版本（使用 `nvidia-smi`命令查看）使用以下命令拉取对应或兼容的镜像，
+
+```
+docker pull registry.baidubce.com/ppfleetx/fleetx-cuda11.2-cudnn8:dev
+```
+
+如本地环境cuda版本较低可以参考 Dockerfile 根据需要定制镜像。
+
+大模型训练需要使用GPU，如已安装 nvida-container-runtime 可以使用以下命令运行镜像，
+
+```
+docker run -it --name=paddle --net=host -v /dev/shm:/dev/shm --shm-size=32G -v $PWD:/paddle --runtime=nvidia registry.baidubce.com/ppfleetx/ppfleetx-cuda11.2-cudnn8:v0.1.0 bash
+```
+
+未安装 nvida-container-runtime 或启动后无法执行 `nvidia-smi` 查看GPU信息时可以尝试通过如下脚本启动运行，
+
+```shell
+export CUDA_SO="$(\ls /usr/lib64/libcuda* | grep -v : | xargs -I{} echo '-v {}:{}') $(\ls /usr/lib64/libnvidia* | grep -v : | xargs -I{} echo '-v {}:{}')"
+export DEVICES=$(find /dev/nvidia* -maxdepth 1 -not -type d | xargs -I{} echo '--device {}:{}')
+
+nvsmi=`which nvidia-smi`
+
+docker run \
+${CUDA_SO} ${DEVICES} \
+-v /dev/shm:/dev/shm \
+-v $PWD:/paddle \
+--name paddle \
+--net=host \
+--shm-size=32G \
+-v $nvsmi:$nvsmi \
+-it \
+registry.baidubce.com/ppfleetx/ppfleetx-cuda11.2-cudnn8:v0.1.0 \
+bash
+```
+
+以上命令 `-v $PWD:/paddle` 将当前目录映射到 /paddle 目录，在 docker 环境内部对该目录的更改将会持久化。
+
+> 为保证通信效率和通信正常，添加参数 --net=host 使用主机网络，更多 docker run 参数说明请参考 [docker 文档](https://docs.docker.com/engine/reference/commandline/run/)。
+
+### 1.2 裸机部署
+
+**安装 PaddlePaddle**
+
+首先根据环境在
+[安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html) 选择对应的版本使用 pip install 执行对应命令安装 PaddlePaddle.
+**请务必按照文档安装 GPU 版本且验证安装成功**。
+
+例如使用如下命令将会安装基于 CUDA 11.2 最新版本的 PaddlePaddle.
+
+```shell
+python -m pip install paddlepaddle-gpu==0.0.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
+```
+
+安装遇到问题以及环境验证的方法也可以参考[文档](deployment_faq.md#1-单机环境验证)。
+
+**安装依赖**
+
+使用以下命令安装 PaddleFleetX 运行所需依赖。
+
+```shell
+python -m pip install -r https://raw.githubusercontent.com/PaddlePaddle/PaddleFleetX/develop/requirements.txt -i https://mirror.baidu.com/pypi/simple
+```
+
+## 2. 模型训练
+
+进入环境后首先使用以下命令拉取最新代码
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleFleetX.git
+```
+
+然后根据需求选择对应的训练方式。
+
+### 2.1. 单卡训练
+
+切换工作目录并下载demo数据，
+```
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+```
+
+然后使用以下命令运行程序，
+
+```shell
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型单卡训练，可将对应yaml文件中的Model-hidden size值改为原来的1/2即可。
+
+**运行日志**
+
+```
+[2022-09-21 05:42:26,980] [    INFO] - [train] epoch: 0, batch: 0, loss: 10.999595642, avg_batch_cost: 2.73014 sec, speed: 0.37 step/s, ips_total: 3001 tokens/s, ips: 3001 tokens/s, learning rate: 2.77778e-08
+[2022-09-21 05:42:27,492] [    INFO] - [train] epoch: 0, batch: 1, loss: 10.997043610, avg_batch_cost: 0.51164 sec, speed: 1.95 step/s, ips_total: 16011 tokens/s, ips: 16011 tokens/s, learning rate: 4.16667e-08
+[2022-09-21 05:42:27,997] [    INFO] - [train] epoch: 0, batch: 2, loss: 10.994422913, avg_batch_cost: 0.50457 sec, speed: 1.98 step/s, ips_total: 16236 tokens/s, ips: 16236 tokens/s, learning rate: 5.55556e-08
+[2022-09-21 05:42:28,503] [    INFO] - [train] epoch: 0, batch: 3, loss: 11.005314827, avg_batch_cost: 0.50497 sec, speed: 1.98 step/s, ips_total: 16223 tokens/s, ips: 16223 tokens/s, learning rate: 6.94444e-08
+[2022-09-21 05:42:29,009] [    INFO] - [train] epoch: 0, batch: 4, loss: 10.988020897, avg_batch_cost: 0.50480 sec, speed: 1.98 step/s, ips_total: 16228 tokens/s, ips: 16228 tokens/s, learning rate: 8.33333e-08
+[2022-09-21 05:42:29,513] [    INFO] - [train] epoch: 0, batch: 5, loss: 10.983006477, avg_batch_cost: 0.50393 sec, speed: 1.98 step/s, ips_total: 16256 tokens/s, ips: 16256 tokens/s, learning rate: 9.72222e-08
+[2022-09-21 05:42:30,018] [    INFO] - [train] epoch: 0, batch: 6, loss: 10.988539696, avg_batch_cost: 0.50427 sec, speed: 1.98 step/s, ips_total: 16245 tokens/s, ips: 16245 tokens/s, learning rate: 1.11111e-07
+```
+
+
+
+### 2.2. 单机多卡训练
+
+切换工作目录并下载demo数据，
+
+```shell
+mkdir data
+wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+```
+
+然后使用以下命令运行单机多卡程序，
+
+```
+python -m paddle.distributed.launch \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
+```
+
+若要在显存容量更小的环境例如 16G 显存下进行GPT模型单机训练，可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+```
+python -m paddle.distributed.launch \
+    ./tools/train.py -c \
+    ./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Model.hidden_size=1024
+```
+
+> 更多 launch 启动参数和用法请参考 [API 文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/distributed/launch_cn.html)。
+
+成功则开始训练过程，
+```
+LAUNCH INFO 2022-08-15 07:37:38,946 -----------  Configuration  ----------------------
+LAUNCH INFO 2022-08-15 07:37:38,946 devices: None
+LAUNCH INFO 2022-08-15 07:37:38,947 elastic_level: -1
+LAUNCH INFO 2022-08-15 07:37:38,947 elastic_timeout: 30
+LAUNCH INFO 2022-08-15 07:37:38,947 gloo_port: 6767
+LAUNCH INFO 2022-08-15 07:37:38,947 host: None
+LAUNCH INFO 2022-08-15 07:37:38,947 ips: None
+LAUNCH INFO 2022-08-15 07:37:38,947 job_id: default
+LAUNCH INFO 2022-08-15 07:37:38,947 legacy: False
+LAUNCH INFO 2022-08-15 07:37:38,947 log_dir: log
+LAUNCH INFO 2022-08-15 07:37:38,947 log_level: INFO
+LAUNCH INFO 2022-08-15 07:37:38,947 master: None
+LAUNCH INFO 2022-08-15 07:37:38,947 max_restart: 3
+LAUNCH INFO 2022-08-15 07:37:38,947 nnodes: 1
+LAUNCH INFO 2022-08-15 07:37:38,947 nproc_per_node: None
+LAUNCH INFO 2022-08-15 07:37:38,947 rank: -1
+LAUNCH INFO 2022-08-15 07:37:38,947 run_mode: collective
+LAUNCH INFO 2022-08-15 07:37:38,947 server_num: None
+LAUNCH INFO 2022-08-15 07:37:38,947 servers:
+LAUNCH INFO 2022-08-15 07:37:38,947 start_port: 6070
+LAUNCH INFO 2022-08-15 07:37:38,947 trainer_num: None
+LAUNCH INFO 2022-08-15 07:37:38,947 trainers:
+LAUNCH INFO 2022-08-15 07:37:38,947 training_script: run_pretrain.py
+LAUNCH INFO 2022-08-15 07:37:38,947 training_script_args: ['-c', './configs_1.3B_dp8.yaml']
+LAUNCH INFO 2022-08-15 07:37:38,947 with_gloo: 1
+LAUNCH INFO 2022-08-15 07:37:38,947 --------------------------------------------------
+LAUNCH INFO 2022-08-15 07:37:38,948 Job: default, mode collective, replicas 1[1:1], elastic False
+LAUNCH INFO 2022-08-15 07:37:38,949 Run Pod: vqhbut, replicas 8, status ready
+LAUNCH INFO 2022-08-15 07:37:39,063 Watching Pod: vqhbut, replicas 8, status running
+## 启动配置
+[2022-08-15 07:41:23,063] [    INFO] - [train] epoch: 0, batch: 0, loss: 11.255846024, avg_batch_cost: 7.06713 sec, speed: 0.14 step/s, ips_total: 9273 tokens/s, ips: 1159 tokens/s, learning rate: 2.77778e-08
+## 更多训练日志
+```
+
+如有启动异常请根据[文档](deployment_faq.md#1-单机环境验证)进行工作环境验证，其他问题可参考[FAQ](deployment_faq.md#3-faq)解决。
+
+## 2.3. 多机多卡训练
+
+使用以下命令进行多机分布式训练，其中 --nnodes 参数为分布式训练机器数量，--master 为训练机器中其中一台机器的IP，运行时需要将命令中示例IP替换为真实的机器IP和任意可用端口，然后在**每个节点**上都运行以下命令，
+如果不知道机器IP可以不设置--master参数先在一台机器上启动，然后根据提示复制命令在其他机器上启动即可。
+
+```
+python -m paddle.distributed.launch --master=10.10.10.1:8099 --nnodes=2 \
+    ./tools/train.py -c \
+    ./ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
+```
+
+> 该示例为16卡任务，需要满足总卡数为16的要求。
+
+> 注意这里需要使用单机多卡训练部分的代码和数据。
+
+
+成功则开始多机训练过程，日志和单机多卡类似，日志异常时请按照[文档](deployment_faq.md#2-分布式环境验证)进行环境验证和问题排查。
+
+若要在显存容量更小的环境例如 16G 显存下进行GPT模型单机训练，可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+```
+python -m paddle.distributed.launch --master=10.10.10.1:8099 --nnodes=2 \
+    ./tools/train.py -c \
+    ./ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml -o Model.hidden_size=2048
+```
+
+更多大模型多机训练内容可见[文档](../projects/gpt/docs/README.md)。
diff --git a/model_zoo/gpt-3/docs/standard.md b/model_zoo/gpt-3/docs/standard.md
new file mode 100644
index 0000000000000000000000000000000000000000..371d876e564010d12f01a7f3d29e84015ac993ce
--- /dev/null
+++ b/model_zoo/gpt-3/docs/standard.md
@@ -0,0 +1,328 @@
+## 模型接入规范
+
+本文讲述在PaddleFleetX repo接入一个新模型，该如何添加和修改文件，以及相应的规范化流程。
+
+### 1.PaddleFleetX 介绍
+PaddleFleetX是飞桨大模型训练推理一站式工具组件。与Paddle.distributed、Paddle.fleet API的关系如下：
+
+
+<div align="center">
+<img src="./images/fleetx_arc.png"  alt="drawing" width="500">
+
+<em> PaddleFleetX与Paddle的关系 </em>
+</div>
+
+
+目前支持的模型列表如下：
+- GPT
+
+
+### 2.目录结构
+
+整体的PaddleFleetX的目录结构如下：
+
+```text
+.
+├── benchmarks                  # benchmark评估结果和示例代码
+│   └── README.md
+├── Dockerfile
+├── docs                        # 文档
+│   ├── cluster_deployment.md
+│   ├── deployment_faq.md
+│   ├── docker_install.md
+│   ├── images
+│   ├── quick_start.md
+│   └── standard.md
+├── ppfleetx
+│   ├── configs
+│   ├── core                    # 管理模型的组网规范，执行规范
+│   ├── data                    # 数据集下载、预处理脚本
+│   ├── models                  # 模型组网
+│   ├── optims                  # 优化器类定义
+│   └── utils
+├── projects                    # 模型脚本，包含GPT模型
+│   ├── gpt
+├── README.md
+├── requirements.txt
+├── tasks
+│   └── gpt
+└── tools
+    ├── auto.py
+    ├── eval.py
+    ├── export_model.py
+    ├── inference.py
+    └── train.py
+```
+
+### 3.模型接入方法
+
+根据模型训练的阶段不同，整体分为两个阶段：组网阶段和执行阶段。
+#### 3.1 组网阶段
+需要不同的分布式策略，它们会调用github/PaddlePaddle/Paddle核心框架里面的分布式高层API（FleetAPI），参考：
+需要的并行方式。
+- [数据并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/data_parallel/index_cn.html)
+- [张量模型并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/model_parallel_cn.html
+)
+- [流水线并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/pipeline_parallel_cn.html)
+- [分组切片并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/group_sharded_parallel_cn.html)
+
+
+#### 3.2 执行阶段
+##### BasicModule
+执行阶段采用Engine模块分装，为了能够保证Engine的模块化调用，需要将组网为``BasicModule``的子类，保证其规范化输出。其中``BasicModule``提供了多个统一的函数方法：
+
+| **函数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| init | 接受用户的组网参数，实现Module初始化 |
+| pretreating_batch | 预处理batch数据 |
+| train_step    | 一次完整的训练                  |
+| train_step_end  |   一次完整的训练后的操作                |
+| training_epoch_end  | 一次完整的epoch训练后的操作                  |
+| validation_step    | 一次完整的验证                  |
+| validation_step_end  | 一次完整的验证后的操作                  |
+| validation_epoch_end  | 一次完整的epoch验证后的操作                  |
+| test_step    | 一次完整的测试                  |
+| test_step_end  | 一次完整的测试后的操作                  |
+| configure_optimizers  | 配置这次训练的优化器                  |
+
+##### EagerEngine
+``EagerEngine``将上述函数串联起来，实现底层的执行逻辑对上层的屏蔽，减少冗余代码。
+初始化需要传入对应的config配置，其层级配置如下：
+
+```yaml
+Engine:
+  max_steps: 500000
+  num_train_epochs: 1
+  accumulate_steps: 1
+  logging_freq: 1
+  eval_freq: 500
+  eval_iters: 10
+  test_iters:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "O2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+  save_load:
+    save_steps: 1000
+    save_epoch: 1
+    output_dir: ./output
+    ckpt_dir:
+```
+
+其中参数对应的释义如下：
+
+| **参数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| max_steps         | 最大训练步数                               |
+| num_train_epochs  | 训练的epoch数量                           |
+| accumulate_steps  | 梯度累加次数                           |
+| logging_freq      | 训练日志打印的频率                            |
+| eval_freq         | 模型评估间隔                               |
+| eval_iters        | 模型评估时训练评估测试集的轮数                      |
+| enable            | 是否使用混合精度策略进行训练                     |
+| dtype             | 混合精度训练数据类型使用float16还是bfloat16，默认为float16类型 |
+| level             | 混合精度训练模式，默认``O2``模式                 |
+| scale_loss        | 使用fp16混合精度策略下，loss的放缩比例                  |
+| custom_black_list | 自定义算子黑名单。这个名单中的算子在支持混合精度计算时会被认为是数值危险的，它们的影响也可能会在下游操作中观察到。这些算子通常不会转为float16/bfloat16计算 |
+| custom_white_list | 自定义算子白名单。这个名单中的算子在支持混合精度计算时会被认为是数值安全的，并且对性能至关重要。如果设置了白名单，该名单中的算子会使用float16/bfloat16计算 |
+| save_steps        | 保存模型间隔                               |
+| save_epoch        | 保存模型epoch间隔                               |
+| output_dir        | 指定输出文件                               |
+| ckpt_dir          | checkpoint的加载目录                      |
+
+``EagerEngine``中重载了多个常用函数，整体的说明如下：
+
+
+| **函数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| fit | 模型训练 |
+| evaluate | 模型评估 |
+| predict    | 模型预测                 |
+| save  |   模型参数保存                |
+| load    | 模型参数加载                  |
+
+其中module和engine函数方法的映射关系如下：
+
+- fit
+
+``fit``实现模型的训练，EagerEngine的内部调用伪代码如下：
+
+```python
+module.model.train()
+for batch in train_dataloader:
+    module.training_step()
+    module.training_step_end()
+
+    module.optimizer.step()
+    module.lr_scheduler.step()
+
+    module.optimizer.clear_grad()
+```
+
+- evaluate
+
+``evaluate``实现模型的评估，``EagerEngine``的内部调用伪代码如下：
+
+```python
+with paddle.no_grad():
+    module.model.eval()
+    for batch in vailidation_dataloader:
+        module.validation_step()
+        module.validation_step_end()
+```
+
+- test
+
+`` predict``实现模型的预测，``EagerEngine``的内部调用伪代码如下：
+
+```python
+with paddle.no_grad():
+    module.model.eval()
+    for batch in test_dataloader:
+        module.predict_step()
+        module.predict_step_end()
+```
+
+
+### 4.模型接入示例
+
+
+1、构建组网文件，放置在`ppfleex/models`目录下。
+
+```python
+class SimpleNet(nn.Layer):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.fc1 = nn.Linear(IMAGE_SIZE, IMAGE_SIZE)
+        self.fc2 = nn.Linear(IMAGE_SIZE, IMAGE_SIZE)
+        self.fc3 = nn.Linear(IMAGE_SIZE, IMAGE_SIZE)
+        self.fc4 = nn.Linear(IMAGE_SIZE, IMAGE_SIZE)
+        self.fc5 = nn.Linear(IMAGE_SIZE, CLASS_NUM)
+
+    def forward(self, image, label=None):
+        output = self.fc1(image)
+        output = self.fc2(output)
+        output = self.fc3(output)
+        output = self.fc4(output)
+        return self.fc5(output)
+
+class LossLayer(nn.Layer):
+    def __init__(self):
+        super(LossLayer, self).__init__()
+
+    def forward(self, image, label=None):
+        return F.cross_entropy(image, label)
+```
+
+2、构建BasicModule，设置符合要求的组网形式，放置在`ppfleetx/models`目录下；并引入`ppfleetx/models/__init__.py`
+
+```python
+class TestModule(BasicModule):
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = LossLayer()
+
+    def get_model(self):
+        model = SimpleNet()
+        return model
+
+    def forward(self, x):
+        return self.model(x)
+
+    def training_step(self, batch):
+        x, y = batch
+        loss = self.loss_fn(self(x), y)
+        return loss
+
+    def training_step_end(self, log_dict):
+        logger.info(
+            "[train] epoch: %d, batch: %d, loss: %.9f, avg_batch_cost: %.5f sec"
+            % (log_dict['epoch'], log_dict['batch'], log_dict['loss'], log_dict['train_cost']))
+
+    def validation_step(self, batch):
+        x, y = batch
+        loss = self.loss_fn(self(x), y)
+        return loss
+
+    def validation_step_end(self, log_dict):
+        logger.info(
+            "[eval] epoch: %d, batch: %d, loss: %.9f, avg_eval_cost: %.5f sec"
+            % (log_dict['epoch'], log_dict['batch'], log_dict['loss'], log_dict['eval_cost']))
+
+    def test_step(self, batch):
+        x, y = batch
+        loss = self.loss_fn(self(x), y)
+        return loss
+
+    def test_step_end(self, log_dict):
+        logger.info(
+            "[test] epoch: %d, batch: %d, loss: %.9f, avg_test_cost: %.5f sec"
+            % (log_dict['epoch'], log_dict['batch'], log_dict['loss'], log_dict['test_cost']))
+
+```
+3、通过config配置Dataset
+
+Dataset可以通过config文件进行配置。新增Dataset类型放置在 `ppfleetx/data/dataset`,同时其构造参数于其对应的Dataset字段一致。比如：
+
+```python
+class GPTDataset(paddle.io.Dataset):
+    def __init__(self,
+                 input_dir,
+                 split,
+                 max_seq_len,
+                 num_samples,
+                 mode,
+                 seed=1234):
+```
+对应config中的yaml字段：
+
+```yaml
+Data:
+  Train:
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [949, 50, 1]
+      max_seq_len: 1024
+    sampler:
+      name: DistributedBatchSampler
+      shuffle: False
+      drop_last: True
+    loader:
+      num_workers: 1
+      return_list: False
+      collate_fn: gpt_collate_fn
+```
+
+4、通过config配置Optimizer和LR
+
+
+```yaml
+Optimizer:
+  name: FusedAdamW
+  weight_decay: 0.01
+  beta1: 0.9
+  beta2: 0.999
+  epsilon: 1.0e-8
+  lr:
+    name: CosineAnnealingWithWarmupDecay
+    decay_steps: 360000
+    warmup_rate: 0.01
+    max_lr: 5.0e-5
+    min_lr: 1.0e-5
+  grad_clip:
+    name: "ClipGradByGlobalNorm"
+    clip_norm: 1.0
+  tensor_fusion: False
+```
+
+5、运行模型相关的配置文件以及相应的运行脚本，放置在[projects](https://github.com/PaddlePaddle/PaddleFleetX/tree/develop/projects)目录。
+
+
+### 5.模型推理示例
+
+模型训练完成后，可使用飞桨高性能推理引擎Paddle Inference通过如下方式进行推理部署。
+总共分为两个步骤：模型导出和推理部署。可以参考[GPT的模型推理](https://github.com/PaddlePaddle/PaddleFleetX/blob/develop/docs/inference.md)。
diff --git a/model_zoo/gpt-3/external_ops/.gitignore b/model_zoo/gpt-3/external_ops/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..b5a8ab12142ac755e1bd26bab8f786765a88bdf0
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/.gitignore
@@ -0,0 +1,3 @@
+build
+dist
+*egg-info
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln.h b/model_zoo/gpt-3/external_ops/fast_ln/ln.h
new file mode 100644
index 0000000000000000000000000000000000000000..0c86902d752935cf67cbe44bbb98f02472da2b41
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln.h
@@ -0,0 +1,195 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#pragma once
+
+#include <cuda_bf16.h>
+#include <cuda_fp16.h>
+#include <unordered_map>
+
+namespace layer_norm {
+
+template <typename Params>
+struct LaunchParams {
+  size_t workspace_bytes;
+  size_t barrier_size;
+
+  cudaDeviceProp *props;
+
+  cudaStream_t stream;
+
+  Params params;
+};
+
+struct ParamsBase {
+  ParamsBase()
+      : ctas_per_col(0),
+        rows(0),
+        cols(0),
+        x(nullptr),
+        mean(nullptr),
+        invvar(nullptr),
+        scale(nullptr),
+        workspace(nullptr),
+        barrier(nullptr) {}
+
+  // For Multi-CTA, number of different CTA groups. Otherwise same as gridDim.x.
+  int ctas_per_col;
+
+  // Input is interpreted as matrix. We normalize across columns.
+  int rows;
+  int cols;
+
+  // Common data pointers.
+  void *x;
+  void *mean;
+  void *invvar;
+  void *scale;
+
+  // Multi-CTA workspace in gmem.
+  void *workspace;
+
+  // Multi-CTA sync barriers in gmem.
+  int *barrier;
+};
+
+struct FwdParams : public ParamsBase {
+  FwdParams() : ParamsBase(), y(nullptr), bias(nullptr), epsilon(0.f) {}
+
+  // Output of LN FWD.
+  void *y;
+  void *bias;
+  float epsilon;
+};
+
+struct BwdParams : public ParamsBase {
+  BwdParams()
+      : ParamsBase(),
+        dy(nullptr),
+        dbias_part(nullptr),
+        dscale_part(nullptr),
+        dx(nullptr),
+        dbias(nullptr),
+        dscale(nullptr) {}
+
+  // Input: gradient wrt. LN FWD output.
+  void *dy;
+
+  // Workspace for Wgrad pre-reduction.
+  void *dbias_part;
+  void *dscale_part;
+
+  // Output: Dgrad.
+  void *dx;
+  // Output: Wgrad.
+  void *dbias;
+  void *dscale;
+};
+
+using FwdFunction = std::function<void(LaunchParams<FwdParams> &, const bool)>;
+using BwdFunction = std::function<void(LaunchParams<BwdParams> &, const bool)>;
+using FunctionKey = uint64_t;
+using FwdRegistry = std::unordered_map<FunctionKey, FwdFunction>;
+using BwdRegistry = std::unordered_map<FunctionKey, BwdFunction>;
+
+extern FwdRegistry FWD_FUNCS;
+extern BwdRegistry BWD_FUNCS;
+
+using fp32 = float;
+using fp16 = half;
+using bf16 = nv_bfloat16;
+
+template <typename T>
+struct TypeToIdTrait {};
+
+template <>
+struct TypeToIdTrait<fp16> {
+  constexpr static uint32_t Value = 0;
+};
+
+template <>
+struct TypeToIdTrait<bf16> {
+  constexpr static uint32_t Value = 1;
+};
+
+template <>
+struct TypeToIdTrait<fp32> {
+  constexpr static uint32_t Value = 2;
+};
+
+template <typename T, int Significant>
+struct Type2KeyTrait {
+  constexpr static uint32_t Value = TypeToIdTrait<T>::Value << Significant;
+};
+
+template <typename T>
+struct WeightType2KeyTrait : public Type2KeyTrait<T, 0> {};
+
+template <typename T>
+struct InputType2KeyTrait : public Type2KeyTrait<T, 2> {};
+
+template <typename T>
+struct OutputType2KeyTrait : public Type2KeyTrait<T, 4> {};
+
+template <typename T>
+struct ComputeType2KeyTrait : public Type2KeyTrait<T, 6> {};
+
+template <typename WeightT,
+          typename InputT,
+          typename OutputT,
+          typename ComputeT>
+struct Types2KeyTrait {
+  constexpr static uint32_t Value = WeightType2KeyTrait<WeightT>::Value |
+                                    InputType2KeyTrait<InputT>::Value |
+                                    OutputType2KeyTrait<OutputT>::Value |
+                                    ComputeType2KeyTrait<ComputeT>::Value;
+  constexpr static inline uint64_t get(const uint64_t hidden_size) {
+    constexpr uint64_t type_key = Value;
+    return (type_key << 32) | hidden_size;
+  }
+};
+
+template <typename WeightT,
+          typename InputT,
+          typename OutputT,
+          typename ComputeT,
+          uint64_t HIDDEN_SIZE>
+struct FwdRegistrar {
+  FwdRegistrar(FwdFunction f) {  // NOLINT
+    uint64_t key =
+        Types2KeyTrait<WeightT, InputT, OutputT, ComputeT>::get(HIDDEN_SIZE);
+    FWD_FUNCS.insert({key, f});
+  }
+};
+
+template <typename WeightT,
+          typename InputT,
+          typename OutputT,
+          typename ComputeT,
+          uint64_t HIDDEN_SIZE>
+struct BwdRegistrar {
+  BwdRegistrar(BwdFunction f) {  // NOLINT
+    uint64_t key =
+        Types2KeyTrait<WeightT, InputT, OutputT, ComputeT>::get(HIDDEN_SIZE);
+    BWD_FUNCS.insert({key, f});
+  }
+};
+
+}  // namespace layer_norm
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp b/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..68792abac55f18ca2a56d0c4ecb5295d991f8b6f
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_api.cpp
@@ -0,0 +1,336 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#include "paddle/extension.h"
+
+#include "ln.h"  // NOLINT
+
+/*
+
+Supported Type combinations:
+
+input    compute   weights   output
+=======================================
+fp32     fp32      fp32      fp32
+fp16     fp32      fp16      fp16
+bf16     fp32      bf16      bf16
+fp32     fp32      fp16      fp16
+fp32     fp32      bf16      bf16
+
+Remarks:
+Output type = Weight type
+Compute always in FP32
+
+*/
+
+namespace layer_norm {
+
+// Create registries and provide runtime versions of config hash functions.
+
+FwdRegistry FWD_FUNCS;
+BwdRegistry BWD_FUNCS;
+
+uint32_t get_type_id(paddle::DataType dtype) {
+  if (dtype == paddle::DataType::FLOAT16) {
+    return TypeToIdTrait<fp16>::Value;
+  } else if (dtype == paddle::DataType::BFLOAT16) {
+    return TypeToIdTrait<bf16>::Value;
+  } else if (dtype == paddle::DataType::FLOAT32) {
+    return TypeToIdTrait<float>::Value;
+  } else {
+    PD_CHECK(false, "Type not supported: ", dtype);
+  }
+}
+
+uint64_t get_key(paddle::DataType weight_type,
+                 paddle::DataType input_type,
+                 paddle::DataType output_type,
+                 paddle::DataType compute_type,
+                 uint64_t hidden_size) {
+  uint64_t type_key =
+      get_type_id(weight_type) | (get_type_id(input_type) << 2) |  // NOLINT
+      (get_type_id(output_type) << 4) | (get_type_id(compute_type) << 6);
+  uint64_t launcher_key = (type_key << 32) | hidden_size;
+  return launcher_key;
+}
+
+}  // namespace layer_norm
+
+layer_norm::FwdFunction &get_fwd_launcher(paddle::DataType weight_type,
+                                          paddle::DataType input_type,
+                                          paddle::DataType output_type,
+                                          paddle::DataType compute_type,
+                                          uint32_t hidden_size) {
+  auto iter = layer_norm::FWD_FUNCS.find(layer_norm::get_key(
+      weight_type, input_type, output_type, compute_type, hidden_size));
+  if (iter != layer_norm::FWD_FUNCS.end()) {
+    return iter->second;
+  } else {
+    PD_CHECK(false,
+             "FWD: Unsupported hidden_size or types: ",
+             hidden_size,
+             weight_type,
+             input_type,
+             output_type,
+             compute_type);
+  }
+}
+
+layer_norm::BwdFunction &get_bwd_launcher(paddle::DataType weight_type,
+                                          paddle::DataType input_type,
+                                          paddle::DataType output_type,
+                                          paddle::DataType compute_type,
+                                          uint32_t hidden_size) {
+  auto iter = layer_norm::BWD_FUNCS.find(layer_norm::get_key(
+      weight_type, input_type, output_type, compute_type, hidden_size));
+  if (iter != layer_norm::BWD_FUNCS.end()) {
+    return iter->second;
+  } else {
+    PD_CHECK(false,
+             "BWD: Unsupported hidden_size or types: ",
+             hidden_size,
+             weight_type,
+             input_type,
+             output_type,
+             compute_type);
+  }
+}
+
+static cudaDeviceProp GetDevicePropImpl() {
+  int device = -1;
+  PD_CHECK(cudaGetDevice(&device) == cudaSuccess);
+  cudaDeviceProp prop;
+  PD_CHECK(cudaGetDeviceProperties(&prop, device) == cudaSuccess);
+  return prop;
+}
+
+static cudaDeviceProp *GetDeviceProp() {
+  static auto prop = GetDevicePropImpl();
+  return &prop;
+}
+
+std::vector<paddle::Tensor> LnFwd(const paddle::Tensor &x,
+                                  const paddle::Tensor &scale,
+                                  const paddle::Tensor &bias,
+                                  const float epsilon) {
+  auto input_type = x.type();
+  auto weight_type = scale.type();
+  auto output_type = weight_type;
+  auto compute_type = paddle::DataType::FLOAT32;
+
+  PD_CHECK(bias.type() == weight_type);
+
+  PD_CHECK(!x.is_cpu());
+  PD_CHECK(!scale.is_cpu());
+  PD_CHECK(!bias.is_cpu());
+
+  auto sizes = x.shape();
+  PD_CHECK(sizes.size() >= 2);
+
+  int rows = 1;
+  for (size_t i = 0; i + 1 < sizes.size(); ++i) {
+    rows *= sizes[i];
+  }
+
+  const int cols = sizes[sizes.size() - 1];
+  auto hidden_size = scale.numel();
+
+  PD_CHECK(scale.shape() == bias.shape());
+  PD_CHECK(hidden_size == cols);
+
+  PD_CHECK(epsilon >= 0.f);
+
+  auto place = x.place();
+
+  auto y = paddle::empty(sizes, output_type, place);
+
+  auto mean = paddle::empty({rows}, compute_type, place);
+  auto invvar = paddle::empty({rows}, compute_type, place);
+
+  layer_norm::LaunchParams<layer_norm::FwdParams> launch_params;
+
+  launch_params.props = GetDeviceProp();
+  launch_params.stream = x.stream();
+
+  // Request the kernel launcher.
+  auto launcher = get_fwd_launcher(
+      weight_type, input_type, output_type, compute_type, hidden_size);
+
+  // Query the kernel-specific launch parameters.
+  launcher(launch_params, true);
+
+  paddle::Tensor workspace, barrier;
+
+  // Set the kernel runtime parameters.
+  layer_norm::FwdParams &params = launch_params.params;
+  params.rows = rows;
+  params.cols = cols;
+  params.x = const_cast<void *>(x.data());
+  params.mean = mean.data();
+  params.invvar = invvar.data();
+  params.scale = const_cast<void *>(scale.data());
+  params.bias = const_cast<void *>(bias.data());
+  params.y = y.data();
+  params.epsilon = epsilon;
+
+  if (launch_params.barrier_size > 0) {
+    barrier = paddle::zeros(
+        {launch_params.barrier_size}, paddle::DataType::INT32, place);
+    workspace = paddle::empty(
+        {launch_params.workspace_bytes}, paddle::DataType::UINT8, place);
+    params.workspace = workspace.data();
+    params.barrier = barrier.data<int>();
+  }
+
+  launcher(launch_params, false);
+
+  return {y, mean, invvar};
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+std::vector<paddle::Tensor> LnBwd(const paddle::Tensor &x,
+                                  const paddle::Tensor &scale,
+                                  const paddle::Tensor &mean,
+                                  const paddle::Tensor &invvar,
+                                  const paddle::Tensor &dy,
+                                  const float epsilon) {
+  auto input_type = x.type();
+  auto weight_type = scale.type();
+  auto output_type = weight_type;
+  auto compute_type = paddle::DataType::FLOAT32;
+
+  PD_CHECK(dy.dtype() == output_type);
+  PD_CHECK(mean.dtype() == compute_type);
+  PD_CHECK(invvar.dtype() == compute_type);
+
+  PD_CHECK(!x.is_cpu());
+  PD_CHECK(!dy.is_cpu());
+  PD_CHECK(!mean.is_cpu());
+  PD_CHECK(!invvar.is_cpu());
+  PD_CHECK(!scale.is_cpu());
+
+  auto sizes = x.shape();
+  PD_CHECK(sizes.size() >= 2);
+  PD_CHECK(dy.shape() == sizes);
+
+  int64_t rows = 1;
+  for (size_t i = 0; i + 1 < sizes.size(); ++i) {
+    rows *= sizes[i];
+  }
+  auto cols = sizes[sizes.size() - 1];
+
+  auto hidden_size = scale.numel();
+
+  PD_CHECK(mean.numel() == rows);
+  PD_CHECK(mean.shape() == invvar.shape());
+
+  PD_CHECK(scale.numel() == cols);
+
+  auto dx = paddle::empty_like(x);
+  auto dscale = paddle::empty_like(scale);
+  auto dbias = paddle::empty_like(scale);
+
+  layer_norm::LaunchParams<layer_norm::BwdParams> launch_params;
+  launch_params.stream = x.stream();
+  launch_params.props = GetDeviceProp();
+
+  auto launcher = get_bwd_launcher(
+      weight_type, input_type, output_type, compute_type, hidden_size);
+
+  launcher(launch_params, true);
+
+  auto place = x.place();
+  auto dscale_part = paddle::empty(
+      {launch_params.params.ctas_per_col, hidden_size}, compute_type, place);
+  auto dbias_part = paddle::empty(
+      {launch_params.params.ctas_per_col, hidden_size}, compute_type, place);
+  paddle::Tensor workspace, barrier;
+
+  layer_norm::BwdParams &params = launch_params.params;
+  params.rows = rows;
+  params.cols = cols;
+  params.x = const_cast<void *>(x.data());
+  params.mean = const_cast<void *>(mean.data());
+  params.invvar = const_cast<void *>(invvar.data());
+  params.scale = const_cast<void *>(scale.data());
+  params.dy = const_cast<void *>(dy.data());
+  params.dx = dx.data();
+  params.dbias = dbias.data();
+  params.dscale = dscale.data();
+  params.dbias_part = dbias_part.data();
+  params.dscale_part = dscale_part.data();
+
+  if (launch_params.barrier_size > 0) {
+    barrier = paddle::zeros(
+        {launch_params.barrier_size}, paddle::DataType::INT32, place);
+    workspace = paddle::empty(
+        {launch_params.workspace_bytes}, paddle::DataType::UINT8, place);
+    params.workspace = workspace.data();
+    params.barrier = barrier.data<int>();
+  }
+
+  launcher(launch_params, false);
+
+  return {dx, dscale, dbias};
+}
+
+std::vector<std::vector<int64_t>> LnFwdInferShape(
+    std::vector<int64_t> x_shape,
+    std::vector<int64_t> scale_shape,
+    std::vector<int64_t> bias_shape,
+    float epsilon) {
+  int64_t rows = 1;
+  for (size_t i = 0; i + 1 < x_shape.size(); ++i) {
+    rows *= x_shape[i];
+  }
+  return {x_shape, {rows}, {rows}};
+}
+
+std::vector<paddle::DataType> LnFwdInferDtype(paddle::DataType x_dtype,
+                                              paddle::DataType scale_dtype,
+                                              paddle::DataType bias_dtype) {
+  return {scale_dtype, paddle::DataType::FLOAT32, paddle::DataType::FLOAT32};
+}
+
+std::vector<std::vector<int64_t>> LnBwdInferShape(
+    std::vector<int64_t> x_shape,
+    std::vector<int64_t> scale_shape,
+    std::vector<int64_t> mean_shape,
+    std::vector<int64_t> invvar_shape,
+    std::vector<int64_t> dy_shape,
+    float epsilon) {
+  return {x_shape, scale_shape, scale_shape};
+}
+
+PD_BUILD_OP(fast_ln)
+    .Inputs({"x", "scale", "bias"})
+    .Outputs({"y", "mean", "invvar"})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(LnFwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(LnFwdInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(LnFwdInferDtype));
+
+PD_BUILD_GRAD_OP(fast_ln)
+    .Inputs({"x", "scale", "mean", "invvar", paddle::Grad("y")})
+    .Outputs({paddle::Grad("x"), paddle::Grad("scale"), paddle::Grad("bias")})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(LnBwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(LnBwdInferShape));
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h b/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h
new file mode 100644
index 0000000000000000000000000000000000000000..3f7e0a52b363c11d41a77bdf05bcc35b1962faea
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_kernels.h
@@ -0,0 +1,341 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#pragma once
+
+#include "ln.h"        // NOLINT
+#include "ln_utils.h"  // NOLINT
+
+namespace layer_norm {
+
+template <typename Ktraits>
+__global__ __launch_bounds__(Ktraits::THREADS_PER_CTA) void ln_bwd_kernel(
+    layer_norm::BwdParams params) {
+  enum { ROWS_PER_CTA = Ktraits::ROWS_PER_CTA };
+  enum { WARPS_M = Ktraits::WARPS_M };
+  enum { WARPS_N = Ktraits::WARPS_N };
+  enum { THREADS_PER_ROW = Ktraits::THREADS_PER_ROW };
+  enum { COLS = Ktraits::COLS };
+  enum { BYTES_PER_ROW = Ktraits::BYTES_PER_ROW };
+  enum { LDGS = Ktraits::LDGS };
+  enum { NUM_ELTS = Ktraits::ELTS_PER_LDG };
+  enum { THREADS_PER_WARP = Ktraits::THREADS_PER_WARP };
+  enum { CTAS_PER_ROW = Ktraits::CTAS_PER_ROW };
+
+  using compute_t = typename Ktraits::compute_t;
+  using index_t = typename Ktraits::index_t;
+  using Ivec = typename Ktraits::Ivec;
+  using Ovec = typename Ktraits::Ovec;
+  using Wvec = typename Ktraits::Wvec;
+  using Cvec = typename Ktraits::Cvec;
+  using Reducer = typename Ktraits::Reducer;
+  using reduce_t = typename Reducer::Type;
+
+  extern __shared__ char smem_[];
+
+  const index_t tidx = threadIdx.x;
+  const index_t bidn = blockIdx.x % CTAS_PER_ROW;
+  const index_t bidm = blockIdx.x / CTAS_PER_ROW;
+  const index_t lane = tidx % THREADS_PER_WARP;
+  const index_t warp = tidx / THREADS_PER_WARP;
+  const index_t warp_m = warp / Ktraits::WARPS_N;
+  const index_t warp_n = warp % Ktraits::WARPS_N;
+  const index_t tid_r = warp_n * THREADS_PER_WARP + lane;
+
+  const index_t r = bidm * Ktraits::ROWS_PER_CTA + warp_m;
+  const index_t c = bidn * THREADS_PER_ROW + warp_n * THREADS_PER_WARP + lane;
+
+  static_assert(COLS == THREADS_PER_ROW * LDGS * NUM_ELTS * CTAS_PER_ROW);
+
+  Cvec dzy_sum[LDGS];
+  Cvec dz_sum[LDGS];
+
+  memset(dzy_sum, 0, sizeof(dzy_sum));
+  memset(dz_sum, 0, sizeof(dz_sum));
+
+  compute_t *smem_wgrad = reinterpret_cast<compute_t *>(smem_);
+  char *smem_dgrad = smem_ + Ktraits::SMEM_BYTES_WGRAD;
+
+  Reducer reducer(params, bidm, bidn, warp_m, warp_n, lane, smem_dgrad);
+
+  Sum<reduce_t> sum;
+
+  constexpr float rn = 1.f / static_cast<float>(COLS);
+  Wvec gamma[LDGS];
+  index_t idx = c;
+#pragma unroll
+  for (int it = 0; it < LDGS; it++) {
+    gamma[it].load_from(params.scale, idx);
+    idx += Ktraits::VEC_COLS_PER_LDG;
+  }
+#pragma unroll 1
+  for (int row = r; row < params.rows;
+       row += params.ctas_per_col * ROWS_PER_CTA) {
+    const compute_t mu_r = static_cast<const compute_t *>(params.mean)[row];
+    const compute_t rs_r = static_cast<const compute_t *>(params.invvar)[row];
+    Ivec x[LDGS];
+    Ovec dz[LDGS];
+    index_t idx = row * Ktraits::VEC_COLS + c;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+      dz[it].load_from(params.dy, idx);
+      x[it].load_from(params.x, idx);
+      idx += Ktraits::VEC_COLS_PER_LDG;
+    }
+
+    compute_t dy[LDGS * NUM_ELTS];
+    compute_t y[LDGS * NUM_ELTS];
+
+    compute_t mdy_local = 0.f;
+    compute_t mdyy_local = 0.f;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+#pragma unroll
+      for (int jt = 0; jt < NUM_ELTS; jt++) {
+        compute_t x_tmp = x[it].data.elt[jt];
+        compute_t y_tmp = rs_r * (x_tmp - mu_r);
+        compute_t dy_tmp = compute_t(gamma[it].data.elt[jt]);
+        dy_tmp *= compute_t(dz[it].data.elt[jt]);
+        compute_t dz_tmp = dz[it].data.elt[jt];
+
+        mdy_local += dy_tmp;
+        mdyy_local += dy_tmp * y_tmp;
+
+        dy[it * NUM_ELTS + jt] = dy_tmp;
+        y[it * NUM_ELTS + jt] = y_tmp;
+
+        dzy_sum[it].data.elt[jt] += dz_tmp * y_tmp;
+        dz_sum[it].data.elt[jt] += dz_tmp;
+      }
+    }
+
+    reduce_t result = reducer.allreduce({mdy_local, mdyy_local}, sum);
+    mdy_local = layer_norm::Get<0>::of<reduce_t, compute_t>(result) * rn;
+    mdyy_local = layer_norm::Get<1>::of<reduce_t, compute_t>(result) * rn;
+
+    Ivec dx[LDGS];
+    idx = row * Ktraits::VEC_COLS + c;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+#pragma unroll
+      for (int jt = 0; jt < NUM_ELTS; jt++) {
+        compute_t dy_tmp = dy[it * NUM_ELTS + jt];
+        compute_t y_tmp = y[it * NUM_ELTS + jt];
+        compute_t dx_tmp = rs_r * (dy_tmp - (mdyy_local * y_tmp + mdy_local));
+        dx[it].data.elt[jt] = dx_tmp;
+      }
+      dx[it].store_to(params.dx, idx);
+      idx += Ktraits::VEC_COLS_PER_LDG;
+    }
+  }  // end: grid stride loop
+
+  if (WARPS_M == 1) {
+    idx = r * Ktraits::VEC_COLS + c;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+      dz_sum[it].store_to(params.dbias_part, idx);
+      dzy_sum[it].store_to(params.dscale_part, idx);
+      idx += Ktraits::VEC_COLS_PER_LDG;
+    }
+  } else {
+    static_assert(WARPS_M == 1 || Ktraits::CTAS_PER_ROW == 1,
+                  "Multiple rows per CTA not supported for Multi-CTA.");
+    // Finalize reduction of part dgamma and dbeta for this CTA
+    // by reducing over the rows held across the WARPS_M warps
+
+    // Assumption: blockSize divides hidden size.
+    enum { NUM_RES = COLS / Ktraits::THREADS_PER_CTA };
+    static_assert(NUM_RES * Ktraits::THREADS_PER_CTA == COLS, "");
+
+    idx = warp_m * Ktraits::VEC_COLS + tid_r;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+      dz_sum[it].store_to(smem_wgrad, idx);
+      idx += THREADS_PER_ROW;
+    }
+    __syncthreads();
+    compute_t cta_dz_sum[NUM_RES];
+    memset(cta_dz_sum, 0, sizeof(compute_t) * NUM_RES);
+    for (int it = 0; it < ROWS_PER_CTA; it++) {
+      for (int jt = 0; jt < NUM_RES; jt++) {
+        cta_dz_sum[jt] +=
+            smem_wgrad[it * COLS + tidx + jt * Ktraits::THREADS_PER_CTA];
+      }
+    }
+    __syncthreads();
+
+    idx = warp_m * Ktraits::VEC_COLS + tid_r;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+      dzy_sum[it].store_to(smem_wgrad, idx);
+      idx += THREADS_PER_ROW;
+    }
+    __syncthreads();
+    compute_t cta_dzy_sum[NUM_RES];
+    memset(cta_dzy_sum, 0, sizeof(compute_t) * NUM_RES);
+    for (int it = 0; it < ROWS_PER_CTA; it++) {
+      for (int jt = 0; jt < NUM_RES; jt++) {
+        cta_dzy_sum[jt] +=
+            smem_wgrad[it * COLS + tidx + jt * Ktraits::THREADS_PER_CTA];
+      }
+    }
+
+    compute_t *dgamma_part =
+        static_cast<compute_t *>(params.dscale_part) + bidm * COLS + tidx;
+    for (int jt = 0; jt < NUM_RES; jt++) {
+      *dgamma_part = cta_dzy_sum[jt];
+      dgamma_part += Ktraits::THREADS_PER_CTA;
+    }
+
+    compute_t *dbeta_part =
+        static_cast<compute_t *>(params.dbias_part) + bidm * COLS + tidx;
+    for (int jt = 0; jt < NUM_RES; jt++) {
+      *dbeta_part = cta_dz_sum[jt];
+      dbeta_part += Ktraits::THREADS_PER_CTA;
+    }
+  }
+}
+
+template <typename Kernel_traits>
+__global__
+__launch_bounds__(Kernel_traits::THREADS_PER_CTA) void ln_bwd_finalize_kernel(
+    BwdParams params) {
+  using compute_t = typename Kernel_traits::compute_t;
+  using weight_t = typename Kernel_traits::weight_t;
+  using index_t = typename Kernel_traits::index_t;
+  using Reducer = typename Kernel_traits::Reducer;
+  using reduce_t = typename Reducer::Type;
+
+  Sum<reduce_t> sum;
+  enum { NUM_ELT = Kernel_traits::ELTS_PER_LDG };
+  enum { THREADS_PER_WARP = Kernel_traits::THREADS_PER_WARP };
+
+  __shared__ char smem_[Kernel_traits::SMEM_BYTES_PER_CTA];
+
+  constexpr uint32_t bidm = 0;
+
+  const uint32_t bidn = blockIdx.x;
+  const uint32_t tidx = threadIdx.x;
+  const uint32_t warp = tidx / THREADS_PER_WARP;
+  const uint32_t lane = tidx % THREADS_PER_WARP;
+
+  Reducer reducer(params, bidm, bidn, 0, 0, lane, smem_);
+
+  const uint32_t c = bidn * THREADS_PER_WARP + lane;
+  const uint32_t c_out = bidn * THREADS_PER_WARP / 2 + lane;
+  constexpr uint32_t COL_STRIDE = Kernel_traits::CTAS * THREADS_PER_WARP;
+  for (uint32_t col = c, col_out = c_out; col < Kernel_traits::COLS;
+       col += COL_STRIDE, col_out += COL_STRIDE / 2) {
+    // Each thread sums over NUM_ELT columns.
+    Vec<compute_t, NUM_ELT> dbeta_local, dgamma_local;
+    memset(&dgamma_local, 0, sizeof(dgamma_local));
+    memset(&dbeta_local, 0, sizeof(dbeta_local));
+    for (uint32_t row = warp; row < params.ctas_per_col;
+         row += Kernel_traits::ROWS_PER_CTA) {
+      index_t idx = row * Kernel_traits::COLS + col;
+
+      Vec<compute_t, NUM_ELT> dbeta_part, dgamma_part;
+      dbeta_part.load_from(params.dbias_part, idx);
+      dgamma_part.load_from(params.dscale_part, idx);
+#pragma unroll
+      for (int it = 0; it < NUM_ELT; it++) {
+        dgamma_local.data.elt[it] += dgamma_part.data.elt[it];
+        dbeta_local.data.elt[it] += dbeta_part.data.elt[it];
+      }
+    }
+
+    void *smem_gamma = smem_;
+    void *smem_beta = &smem_[Kernel_traits::SMEM_BYTES_TRANSPOSE];
+
+    const int write_row = warp;
+    const int write_col = lane ^ write_row;
+    const int write_idx = write_row * THREADS_PER_WARP + write_col;
+
+    dgamma_local.store_to(smem_gamma, write_idx);
+    dbeta_local.store_to(smem_beta, write_idx);
+
+    __syncthreads();
+
+    // It would be probably safe to reuse the first row of smem_beta and
+    // smem_gamma
+    void *smem_gamma_out = &smem_[2 * Kernel_traits::SMEM_BYTES_TRANSPOSE];
+    void *smem_beta_out = &smem_[2 * Kernel_traits::SMEM_BYTES_TRANSPOSE +
+                                 Kernel_traits::SMEM_BYTES_OUTPUT];
+
+    // More than one iter iff ROWS_PER_CTA < 32.
+    for (int w = warp; w < THREADS_PER_WARP; w += Kernel_traits::ROWS_PER_CTA) {
+      const int read_row = lane;
+      const int read_col = w ^ read_row;
+      const int read_idx = read_row * THREADS_PER_WARP + read_col;
+
+      memset(&dbeta_local, 0, sizeof(dbeta_local));
+      memset(&dgamma_local, 0, sizeof(dgamma_local));
+
+      // Load beta and gamma transposed
+      if (read_row < Kernel_traits::ROWS_PER_CTA) {
+        dbeta_local.load_from(smem_beta, read_idx);
+        dgamma_local.load_from(smem_gamma, read_idx);
+      }
+
+// Call reducer on the loaded value(s) and convert.
+#pragma unroll
+      for (int it = 0; it < NUM_ELT; it++) {
+        compute_t b_i = dbeta_local.data.elt[it];
+        compute_t g_i = dgamma_local.data.elt[it];
+        b_i = reducer.allreduce(b_i, sum);
+        g_i = reducer.allreduce(g_i, sum);
+
+        dgamma_local.data.elt[it] = g_i;
+        dbeta_local.data.elt[it] = b_i;
+      }
+
+      // Leader stores the result at the current column.
+      if (lane == 0) {
+        dgamma_local.store_to(smem_gamma_out, w);
+        dbeta_local.store_to(smem_beta_out, w);
+      }
+    }
+
+    // All writes done.
+    __syncthreads();
+
+    // Pack and store: 2-wide stores with half the threads.
+    if (warp == Kernel_traits::ROWS_PER_CTA - 1 &&
+        lane < THREADS_PER_WARP / 2) {
+      using src_t = typename TypeToVec2<compute_t>::Type;
+      using dst_t = typename TypeToVec2<weight_t>::Type;
+      Vec<src_t, NUM_ELT> dbeta_vec2, dgamma_vec2;
+      Vec<dst_t, NUM_ELT> dbeta_out2, dgamma_out2;
+
+      dgamma_vec2.load_from(smem_gamma_out, lane);
+      dbeta_vec2.load_from(smem_beta_out, lane);
+#pragma unroll
+      for (int it = 0; it < NUM_ELT; it++) {
+        dgamma_out2.data.elt[it] =
+            Converter<src_t, dst_t>::convert(dgamma_vec2.data.elt[it]);
+        dbeta_out2.data.elt[it] =
+            Converter<src_t, dst_t>::convert(dbeta_vec2.data.elt[it]);
+      }
+      dgamma_out2.store_to(params.dscale, col_out);
+      dbeta_out2.store_to(params.dbias, col_out);
+    }
+  }
+}
+}  // namespace layer_norm
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu b/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu
new file mode 100644
index 0000000000000000000000000000000000000000..dfbf6c7af82dd55e8a0e4f8a1d0954ca6d607c27
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_bwd_semi_cuda_kernel.cu
@@ -0,0 +1,273 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#include "ln.h"                // NOLINT
+#include "ln_bwd_kernels.h"    // NOLINT
+#include "ln_kernel_traits.h"  // NOLINT
+#include "ln_utils.h"          // NOLINT
+
+using namespace layer_norm;  // NOLINT
+
+template <typename weight_t,
+          typename input_t,
+          typename output_t,
+          typename compute_t,
+          typename index_t,
+          int HIDDEN_SIZE,
+          int CTAS_PER_ROW,
+          int WARPS_M,
+          int WARPS_N,
+          int BYTES_PER_LDG_MAIN,
+          int BYTES_PER_LDG_FINAL>
+void launch_(LaunchParams<BwdParams> &launch_params,  // NOLINT
+             const bool configure_params) {
+  using KernelTraits = KernelTraits<weight_t,
+                                    input_t,
+                                    output_t,
+                                    compute_t,
+                                    index_t,
+                                    HIDDEN_SIZE,
+                                    CTAS_PER_ROW,
+                                    WARPS_M,
+                                    WARPS_N,
+                                    BYTES_PER_LDG_MAIN>;
+  auto kernel = &ln_bwd_kernel<KernelTraits>;
+
+  if (configure_params) {
+    int ctas_per_sm;
+    cudaError status_ = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &ctas_per_sm,
+        kernel,
+        KernelTraits::THREADS_PER_CTA,
+        KernelTraits::SMEM_BYTES);
+    launch_params.params.ctas_per_col =
+        launch_params.props->multiProcessorCount * ctas_per_sm /
+        KernelTraits::CTAS_PER_ROW;
+    launch_params.barrier_size = 0;
+    launch_params.workspace_bytes = 0;
+    if (KernelTraits::CTAS_PER_ROW > 1) {
+      launch_params.barrier_size = 2 * launch_params.params.ctas_per_col;
+      launch_params.workspace_bytes =
+          launch_params.params.ctas_per_col * KernelTraits::WARPS_M *
+          KernelTraits::CTAS_PER_ROW * sizeof(typename KernelTraits::reduce_t) *
+          2;
+    }
+    return;
+  }
+
+  if (KernelTraits::SMEM_BYTES >= 48 * 1024) {
+    CHECK_CUDA(cudaFuncSetAttribute(kernel,
+                                    cudaFuncAttributeMaxDynamicSharedMemorySize,
+                                    KernelTraits::SMEM_BYTES));
+  }
+  auto stream = launch_params.stream;
+  auto ctas_per_col = launch_params.params.ctas_per_col;
+
+  if (KernelTraits::CTAS_PER_ROW == 1) {
+    kernel<<<ctas_per_col,
+             KernelTraits::THREADS_PER_CTA,
+             KernelTraits::SMEM_BYTES,
+             stream>>>(launch_params.params);
+  } else {
+    dim3 grid(KernelTraits::CTAS_PER_ROW * ctas_per_col);
+    dim3 block(KernelTraits::THREADS_PER_CTA);
+    void *params_ = (void *)&launch_params.params;  // NOLINT
+    cudaLaunchCooperativeKernel((void *)kernel,     // NOLINT
+                                grid,
+                                block,
+                                (void **)&params_,  // NOLINT
+                                KernelTraits::SMEM_BYTES,
+                                stream);
+  }
+
+  using KernelTraitsF =
+      layer_norm::KernelTraitsFinalize<HIDDEN_SIZE,
+                                       weight_t,
+                                       input_t,
+                                       output_t,
+                                       compute_t,
+                                       index_t,
+                                       32 * 32,  // THREADS_PER_CTA
+                                       BYTES_PER_LDG_FINAL>;
+
+  auto kernel_f = &layer_norm::ln_bwd_finalize_kernel<KernelTraitsF>;
+  kernel_f<<<KernelTraitsF::CTAS, KernelTraitsF::THREADS_PER_CTA, 0, stream>>>(
+      launch_params.params);
+}
+
+// Create backward launch function and register. Macro signature:
+//  HIDDEN_SIZE, WTYPE, ITYPE, OTYPE, CTYPE, CTAS_PER_ROW, WARPS_M, WARPS_N,
+//  BYTES_PER_LDG, BYTES_PER_LDG_FINAL
+
+REGISTER_BWD_LAUNCHER(768, fp32, fp32, fp32, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(768, fp16, fp16, fp16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(768, fp16, fp32, fp16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(768, bf16, bf16, bf16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(768, bf16, fp32, bf16, fp32, 1, 4, 1, 16, 4);
+
+REGISTER_BWD_LAUNCHER(1024, fp32, fp32, fp32, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(1024, fp16, fp16, fp16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(1024, fp16, fp32, fp16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(1024, bf16, bf16, bf16, fp32, 1, 4, 1, 16, 4);
+REGISTER_BWD_LAUNCHER(1024, bf16, fp32, bf16, fp32, 1, 4, 1, 16, 4);
+
+REGISTER_BWD_LAUNCHER(1536, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(1536, fp16, fp16, fp16, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(1536, fp16, fp32, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(1536, bf16, bf16, bf16, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(1536, bf16, fp32, bf16, fp32, 1, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(2048, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(2048, fp16, fp16, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(2048, fp16, fp32, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(2048, bf16, bf16, bf16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(2048, bf16, fp32, bf16, fp32, 1, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(2304, fp32, fp32, fp32, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(2304, fp16, fp16, fp16, fp32, 1, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(2304, fp16, fp32, fp16, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(2304, bf16, bf16, bf16, fp32, 1, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(2304, bf16, fp32, bf16, fp32, 1, 1, 4, 8, 4);
+
+REGISTER_BWD_LAUNCHER(3072, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(3072, fp16, fp16, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(3072, fp16, fp32, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(3072, bf16, bf16, bf16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(3072, bf16, fp32, bf16, fp32, 1, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(3840, fp32, fp32, fp32, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(3840, fp16, fp16, fp16, fp32, 1, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(3840, fp16, fp32, fp16, fp32, 1, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(3840, bf16, bf16, bf16, fp32, 1, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(3840, bf16, fp32, bf16, fp32, 1, 1, 4, 8, 4);
+
+REGISTER_BWD_LAUNCHER(4096, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(4096, fp16, fp16, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(4096, fp16, fp32, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(4096, bf16, bf16, bf16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(4096, bf16, fp32, bf16, fp32, 1, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(5120, fp32, fp32, fp32, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(5120, fp16, fp16, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(5120, fp16, fp32, fp16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(5120, bf16, bf16, bf16, fp32, 1, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(5120, bf16, fp32, bf16, fp32, 1, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(6144, fp32, fp32, fp32, fp32, 1, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(6144, fp16, fp16, fp16, fp32, 1, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(6144, fp16, fp32, fp16, fp32, 1, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(6144, bf16, bf16, bf16, fp32, 1, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(6144, bf16, fp32, bf16, fp32, 1, 1, 8, 16, 4);
+
+REGISTER_BWD_LAUNCHER(8192, fp32, fp32, fp32, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(8192, fp16, fp16, fp16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(8192, fp16, fp32, fp16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(8192, bf16, bf16, bf16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(8192, bf16, fp32, bf16, fp32, 2, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(10240, fp32, fp32, fp32, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(10240, fp16, fp16, fp16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(10240, fp16, fp32, fp16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(10240, bf16, bf16, bf16, fp32, 2, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(10240, bf16, fp32, bf16, fp32, 2, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(12288, fp32, fp32, fp32, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12288, fp16, fp16, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12288, fp16, fp32, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12288, bf16, bf16, bf16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12288, bf16, fp32, bf16, fp32, 4, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(12800, fp32, fp32, fp32, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12800, fp16, fp16, fp16, fp32, 5, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(12800, fp16, fp32, fp16, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(12800, bf16, bf16, bf16, fp32, 5, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(12800, bf16, fp32, bf16, fp32, 5, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(14336, fp32, fp32, fp32, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(14336, fp16, fp16, fp16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(14336, fp16, fp32, fp16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(14336, bf16, bf16, bf16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(14336, bf16, fp32, bf16, fp32, 4, 1, 4, 8, 4);
+
+REGISTER_BWD_LAUNCHER(15360, fp32, fp32, fp32, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(15360, fp16, fp16, fp16, fp32, 4, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(15360, fp16, fp32, fp16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(15360, bf16, bf16, bf16, fp32, 4, 1, 4, 4, 4);
+REGISTER_BWD_LAUNCHER(15360, bf16, fp32, bf16, fp32, 4, 1, 4, 8, 4);
+
+REGISTER_BWD_LAUNCHER(16384, fp32, fp32, fp32, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(16384, fp16, fp16, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(16384, fp16, fp32, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(16384, bf16, bf16, bf16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(16384, bf16, fp32, bf16, fp32, 4, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(18432, fp32, fp32, fp32, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(18432, fp16, fp16, fp16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(18432, fp16, fp32, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(18432, bf16, bf16, bf16, fp32, 4, 1, 4, 8, 4);
+REGISTER_BWD_LAUNCHER(18432, bf16, fp32, bf16, fp32, 4, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(20480, fp32, fp32, fp32, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(20480, fp16, fp16, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(20480, fp16, fp32, fp16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(20480, bf16, bf16, bf16, fp32, 4, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(20480, bf16, fp32, bf16, fp32, 4, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(24576, fp32, fp32, fp32, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(24576, fp16, fp16, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(24576, fp16, fp32, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(24576, bf16, bf16, bf16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(24576, bf16, fp32, bf16, fp32, 4, 1, 8, 16, 4);
+
+REGISTER_BWD_LAUNCHER(25600, fp32, fp32, fp32, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(25600, fp16, fp16, fp16, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(25600, fp16, fp32, fp16, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(25600, bf16, bf16, bf16, fp32, 5, 1, 4, 16, 4);
+REGISTER_BWD_LAUNCHER(25600, bf16, fp32, bf16, fp32, 5, 1, 4, 16, 4);
+
+REGISTER_BWD_LAUNCHER(30720, fp32, fp32, fp32, fp32, 4, 1, 8, 8, 4);
+REGISTER_BWD_LAUNCHER(30720, fp16, fp16, fp16, fp32, 4, 1, 8, 4, 4);
+REGISTER_BWD_LAUNCHER(30720, fp16, fp32, fp16, fp32, 4, 1, 8, 8, 4);
+REGISTER_BWD_LAUNCHER(30720, bf16, bf16, bf16, fp32, 4, 1, 8, 4, 4);
+REGISTER_BWD_LAUNCHER(30720, bf16, fp32, bf16, fp32, 4, 1, 8, 8, 4);
+
+REGISTER_BWD_LAUNCHER(32768, fp32, fp32, fp32, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(32768, fp16, fp16, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(32768, fp16, fp32, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(32768, bf16, bf16, bf16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(32768, bf16, fp32, bf16, fp32, 4, 1, 8, 16, 4);
+
+REGISTER_BWD_LAUNCHER(40960, fp32, fp32, fp32, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(40960, fp16, fp16, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(40960, fp16, fp32, fp16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(40960, bf16, bf16, bf16, fp32, 4, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(40960, bf16, fp32, bf16, fp32, 4, 1, 8, 16, 4);
+
+REGISTER_BWD_LAUNCHER(49152, fp32, fp32, fp32, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(49152, fp16, fp16, fp16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(49152, fp16, fp32, fp16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(49152, bf16, bf16, bf16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(49152, bf16, fp32, bf16, fp32, 8, 1, 8, 16, 4);
+
+REGISTER_BWD_LAUNCHER(65536, fp32, fp32, fp32, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(65536, fp16, fp16, fp16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(65536, fp16, fp32, fp16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(65536, bf16, bf16, bf16, fp32, 8, 1, 8, 16, 4);
+REGISTER_BWD_LAUNCHER(65536, bf16, fp32, bf16, fp32, 8, 1, 8, 16, 4);
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu b/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu
new file mode 100644
index 0000000000000000000000000000000000000000..84ced3fbefc6a2046b638730da2fcf8ac2789d70
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_cuda_kernel.cu
@@ -0,0 +1,258 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#include "ln.h"                // NOLINT
+#include "ln_fwd_kernels.h"    // NOLINT
+#include "ln_kernel_traits.h"  // NOLINT
+#include "ln_utils.h"          // NOLINT
+
+using namespace layer_norm;  // NOLINT
+
+template <typename weight_t,
+          typename input_t,
+          typename output_t,
+          typename compute_t,
+          typename index_t,
+          int HIDDEN_SIZE,
+          int CTAS_PER_ROW,
+          int WARPS_M,
+          int WARPS_N,
+          int BYTES_PER_LDG>
+void launch_(LaunchParams<FwdParams> &launch_params,  // NOLINT
+             const bool configure_params) {
+  using KernelTraits = KernelTraits<weight_t,
+                                    input_t,
+                                    output_t,
+                                    compute_t,
+                                    index_t,
+                                    HIDDEN_SIZE,
+                                    CTAS_PER_ROW,
+                                    WARPS_M,
+                                    WARPS_N,
+                                    BYTES_PER_LDG>;
+  auto kernel = &ln_fwd_kernel<KernelTraits>;
+
+  if (configure_params) {
+    int ctas_per_sm;
+    cudaError status_ = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &ctas_per_sm,
+        kernel,
+        KernelTraits::THREADS_PER_CTA,
+        KernelTraits::SMEM_BYTES_FWD);
+    launch_params.params.ctas_per_col =
+        launch_params.props->multiProcessorCount * ctas_per_sm /
+        KernelTraits::CTAS_PER_ROW;
+    launch_params.barrier_size = 0;
+    launch_params.workspace_bytes = 0;
+    if (KernelTraits::CTAS_PER_ROW > 1) {
+      launch_params.barrier_size = 2 * launch_params.params.ctas_per_col;
+      launch_params.workspace_bytes =
+          launch_params.params.ctas_per_col * KernelTraits::WARPS_M *
+          KernelTraits::CTAS_PER_ROW *
+          sizeof(typename KernelTraits::Stats::stats_t) * 2;
+    }
+    return;
+  }
+
+  if (KernelTraits::SMEM_BYTES_FWD >= 48 * 1024) {
+    CHECK_CUDA(cudaFuncSetAttribute(kernel,
+                                    cudaFuncAttributeMaxDynamicSharedMemorySize,
+                                    KernelTraits::SMEM_BYTES_FWD));
+  }
+  auto stream = launch_params.stream;
+  auto ctas_per_col = launch_params.params.ctas_per_col;
+
+  if (KernelTraits::CTAS_PER_ROW == 1) {
+    kernel<<<ctas_per_col,
+             KernelTraits::THREADS_PER_CTA,
+             KernelTraits::SMEM_BYTES_FWD,
+             stream>>>(launch_params.params);
+  } else {
+    dim3 grid(KernelTraits::CTAS_PER_ROW * ctas_per_col);
+    dim3 block(KernelTraits::THREADS_PER_CTA);
+    void *params_ = (void *)&launch_params.params;  // NOLINT
+    cudaLaunchCooperativeKernel((void *)kernel,     // NOLINT
+                                grid,
+                                block,
+                                (void **)&params_,  // NOLINT
+                                KernelTraits::SMEM_BYTES_FWD,
+                                stream);
+  }
+}
+
+// Create forward launch function and register. Macro signature:
+//  HIDDEN_SIZE, WTYPE, ITYPE, OTYPE, CTYPE, CTAS_PER_ROW, WARPS_M, WARPS_N,
+//  BYTES_PER_LDG
+
+REGISTER_FWD_LAUNCHER(768, fp32, fp32, fp32, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(768, fp16, fp16, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(768, fp16, fp32, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(768, bf16, bf16, bf16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(768, bf16, fp32, bf16, fp32, 1, 4, 1, 16);
+
+REGISTER_FWD_LAUNCHER(1024, fp32, fp32, fp32, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1024, fp16, fp16, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1024, fp16, fp32, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1024, bf16, bf16, bf16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1024, bf16, fp32, bf16, fp32, 1, 4, 1, 16);
+
+REGISTER_FWD_LAUNCHER(1536, fp32, fp32, fp32, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1536, fp16, fp16, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1536, fp16, fp32, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1536, bf16, bf16, bf16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(1536, bf16, fp32, bf16, fp32, 1, 4, 1, 16);
+
+REGISTER_FWD_LAUNCHER(2048, fp32, fp32, fp32, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2048, fp16, fp16, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2048, fp16, fp32, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2048, bf16, bf16, bf16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2048, bf16, fp32, bf16, fp32, 1, 4, 1, 16);
+
+REGISTER_FWD_LAUNCHER(2304, fp32, fp32, fp32, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2304, fp16, fp16, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2304, fp16, fp32, fp16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2304, bf16, bf16, bf16, fp32, 1, 4, 1, 16);
+REGISTER_FWD_LAUNCHER(2304, bf16, fp32, bf16, fp32, 1, 4, 1, 16);
+
+REGISTER_FWD_LAUNCHER(3072, fp32, fp32, fp32, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(3072, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(3072, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(3072, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(3072, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(3840, fp32, fp32, fp32, fp32, 1, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(3840, fp16, fp16, fp16, fp32, 1, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(3840, fp16, fp32, fp16, fp32, 1, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(3840, bf16, bf16, bf16, fp32, 1, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(3840, bf16, fp32, bf16, fp32, 1, 1, 4, 4);
+
+REGISTER_FWD_LAUNCHER(4096, fp32, fp32, fp32, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(4096, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(4096, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(4096, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(4096, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(5120, fp32, fp32, fp32, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(5120, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(5120, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(5120, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(5120, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(6144, fp32, fp32, fp32, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(6144, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(6144, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(6144, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(6144, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(8192, fp32, fp32, fp32, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(8192, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(8192, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(8192, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(8192, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(10240, fp32, fp32, fp32, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(10240, fp16, fp16, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(10240, fp16, fp32, fp16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(10240, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(10240, bf16, fp32, bf16, fp32, 1, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(12288, fp32, fp32, fp32, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(12288, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(12288, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(12288, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(12288, bf16, fp32, bf16, fp32, 2, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(12800, fp32, fp32, fp32, fp32, 2, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(12800, fp16, fp16, fp16, fp32, 2, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(12800, fp16, fp32, fp16, fp32, 2, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(12800, bf16, bf16, bf16, fp32, 2, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(12800, bf16, fp32, bf16, fp32, 2, 1, 4, 4);
+
+REGISTER_FWD_LAUNCHER(14336, fp32, fp32, fp32, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(14336, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(14336, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(14336, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(14336, bf16, fp32, bf16, fp32, 2, 1, 4, 8);
+
+REGISTER_FWD_LAUNCHER(15360, fp32, fp32, fp32, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(15360, fp16, fp16, fp16, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(15360, fp16, fp32, fp16, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(15360, bf16, bf16, bf16, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(15360, bf16, fp32, bf16, fp32, 2, 1, 4, 8);
+
+REGISTER_FWD_LAUNCHER(16384, fp32, fp32, fp32, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(16384, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(16384, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(16384, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(16384, bf16, fp32, bf16, fp32, 2, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(18432, fp32, fp32, fp32, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(18432, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(18432, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(18432, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(18432, bf16, fp32, bf16, fp32, 4, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(20480, fp32, fp32, fp32, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(20480, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(20480, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(20480, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(20480, bf16, fp32, bf16, fp32, 2, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(24576, fp32, fp32, fp32, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(24576, fp16, fp16, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(24576, fp16, fp32, fp16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(24576, bf16, bf16, bf16, fp32, 2, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(24576, bf16, fp32, bf16, fp32, 2, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(25600, fp32, fp32, fp32, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(25600, fp16, fp16, fp16, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(25600, fp16, fp32, fp16, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(25600, bf16, bf16, bf16, fp32, 2, 1, 4, 8);
+REGISTER_FWD_LAUNCHER(25600, bf16, fp32, bf16, fp32, 4, 1, 4, 4);
+
+REGISTER_FWD_LAUNCHER(30720, fp32, fp32, fp32, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(30720, fp16, fp16, fp16, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(30720, fp16, fp32, fp16, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(30720, bf16, bf16, bf16, fp32, 4, 1, 4, 4);
+REGISTER_FWD_LAUNCHER(30720, bf16, fp32, bf16, fp32, 4, 1, 4, 4);
+
+REGISTER_FWD_LAUNCHER(32768, fp32, fp32, fp32, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(32768, fp16, fp16, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(32768, fp16, fp32, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(32768, bf16, bf16, bf16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(32768, bf16, fp32, bf16, fp32, 4, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(40960, fp32, fp32, fp32, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(40960, fp16, fp16, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(40960, fp16, fp32, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(40960, bf16, bf16, bf16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(40960, bf16, fp32, bf16, fp32, 4, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(49152, fp32, fp32, fp32, fp32, 8, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(49152, fp16, fp16, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(49152, fp16, fp32, fp16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(49152, bf16, bf16, bf16, fp32, 4, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(49152, bf16, fp32, bf16, fp32, 4, 1, 4, 16);
+
+REGISTER_FWD_LAUNCHER(65536, fp32, fp32, fp32, fp32, 8, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(65536, fp16, fp16, fp16, fp32, 8, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(65536, fp16, fp32, fp16, fp32, 8, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(65536, bf16, bf16, bf16, fp32, 8, 1, 4, 16);
+REGISTER_FWD_LAUNCHER(65536, bf16, fp32, bf16, fp32, 8, 1, 4, 16);
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h b/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h
new file mode 100644
index 0000000000000000000000000000000000000000..c31b1ec7d118252f47ad3324a0afec82f727cd9e
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_fwd_kernels.h
@@ -0,0 +1,130 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#pragma once
+
+#include "ln.h"        // NOLINT
+#include "ln_utils.h"  // NOLINT
+
+namespace layer_norm {
+
+template <typename Ktraits>
+__global__ __launch_bounds__(Ktraits::THREADS_PER_CTA) void ln_fwd_kernel(
+    FwdParams params) {
+  enum { ROWS_PER_CTA = Ktraits::ROWS_PER_CTA };
+  enum { WARPS_N = Ktraits::WARPS_N };
+  enum { WARPS_M = Ktraits::WARPS_M };
+  enum { THREADS_PER_ROW = Ktraits::THREADS_PER_ROW };
+  enum { VEC_COLS_PER_LDG = Ktraits::VEC_COLS_PER_LDG };
+  enum { BYTES_PER_ROW = Ktraits::BYTES_PER_ROW };
+  enum { LDGS = Ktraits::LDGS };
+  enum { NUM_ELTS = Ktraits::NUM_ELTS };
+  enum { CTAS_PER_ROW = Ktraits::CTAS_PER_ROW };
+
+  using output_t = typename Ktraits::output_t;
+  using index_t = typename Ktraits::index_t;
+  using compute_t = typename Ktraits::compute_t;
+  using Ivec = typename Ktraits::Ivec;
+  using Ovec = typename Ktraits::Ovec;
+  using Wvec = typename Ktraits::Wvec;
+  using Cvec = typename Ktraits::Cvec;
+
+  using Stats = typename Ktraits::Stats;
+  using stats_t = typename Stats::stats_t;
+
+  extern __shared__ char smem_[];
+
+  const index_t tidx = threadIdx.x;
+  const index_t bidn = blockIdx.x % CTAS_PER_ROW;
+  const index_t bidm = blockIdx.x / CTAS_PER_ROW;
+  const index_t lane = tidx % THREADS_PER_WARP;
+  const index_t warp = tidx / THREADS_PER_WARP;
+  const index_t warp_m = warp / WARPS_N;
+  const index_t warp_n = warp % WARPS_N;
+
+  const index_t r = bidm * ROWS_PER_CTA + warp_m;
+  const index_t c = bidn * THREADS_PER_ROW + warp_n * THREADS_PER_WARP + lane;
+
+  Stats stats(params, bidm, bidn, warp_m, warp_n, lane, smem_);
+
+  compute_t *mu_ptr = static_cast<compute_t *>(params.mean);
+  compute_t *rs_ptr = static_cast<compute_t *>(params.invvar);
+
+  Wvec gamma[LDGS];
+  Wvec beta[LDGS];
+  index_t idx = c;
+#pragma unroll
+  for (int it = 0; it < LDGS; it++) {
+    gamma[it].load_from(params.scale, idx);
+    beta[it].load_from(params.bias, idx);
+    idx += VEC_COLS_PER_LDG;
+  }
+
+  constexpr compute_t rn = 1.f / compute_t(Ktraits::COLS);
+
+  for (int row = r; row < params.rows;
+       row += params.ctas_per_col * ROWS_PER_CTA) {
+    Ivec x[LDGS];
+    index_t idx = row * Ktraits::VEC_COLS + c;
+    compute_t xf[LDGS * NUM_ELTS];
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+      x[it].load_from(params.x, idx);
+#pragma unroll
+      for (int jt = 0; jt < NUM_ELTS; jt++) {
+        compute_t x_ij = compute_t(x[it].data.elt[jt]);
+        xf[it * NUM_ELTS + jt] = x_ij;
+      }
+      idx += VEC_COLS_PER_LDG;
+    }
+
+    stats_t s = stats.compute(xf, rn);
+
+    compute_t mu = layer_norm::Get<0>::of<stats_t, compute_t>(s);
+    compute_t m2 = layer_norm::Get<1>::of<stats_t, compute_t>(s);
+
+    if (bidn == 0 && warp_n == 0 && lane == 0) {
+      mu_ptr[row] = mu;
+    }
+
+    compute_t rs = rsqrtf(rn * m2 + params.epsilon);
+
+    if (bidn == 0 && warp_n == 0 && lane == 0) {
+      rs_ptr[row] = rs;
+    }
+
+    Ovec z[LDGS];
+    idx = row * Ktraits::VEC_COLS + c;
+#pragma unroll
+    for (int it = 0; it < LDGS; it++) {
+#pragma unroll
+      for (int jt = 0; jt < NUM_ELTS; jt++) {
+        output_t y_ij = output_t(rs * (xf[it * NUM_ELTS + jt] - mu));
+        output_t g_ij = gamma[it].data.elt[jt];
+        output_t b_ij = beta[it].data.elt[jt];
+        z[it].data.elt[jt] = (g_ij * y_ij + b_ij);
+      }
+      z[it].store_to(params.y, idx);
+      idx += VEC_COLS_PER_LDG;
+    }
+  }
+}
+
+}  // namespace layer_norm
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h b/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h
new file mode 100644
index 0000000000000000000000000000000000000000..2701dfa09c658d087d279de1d1cbd431d759d4be
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_kernel_traits.h
@@ -0,0 +1,175 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#pragma once
+
+#include "ln_bwd_kernels.h"  // NOLINT
+#include "ln_fwd_kernels.h"  // NOLINT
+#include "ln_utils.h"        // NOLINT
+
+namespace layer_norm {
+template <uint32_t HIDDEN_SIZE_,
+          typename weight_t_,
+          typename input_t_,
+          typename output_t_,
+          typename compute_t_,
+          typename index_t_,
+          uint32_t THREADS_PER_CTA_>
+struct KernelTraitsBase {
+  using weight_t = weight_t_;
+  using input_t = input_t_;
+  using output_t = output_t_;
+  using compute_t = compute_t_;
+  using index_t = index_t_;
+
+  enum { HIDDEN_SIZE = HIDDEN_SIZE_ };
+  enum { THREADS_PER_CTA = THREADS_PER_CTA_ };
+  enum { THREADS_PER_WARP = 32 };
+};
+
+template <uint32_t HIDDEN_SIZE_,
+          typename weight_t_,
+          typename input_t_,
+          typename output_t_,
+          typename compute_t_,
+          typename index_t_,
+          uint32_t THREADS_PER_CTA_,
+          uint32_t BYTES_PER_LDG_,
+          typename Base = KernelTraitsBase<HIDDEN_SIZE_,
+                                           weight_t_,
+                                           input_t_,
+                                           output_t_,
+                                           compute_t_,
+                                           index_t_,
+                                           THREADS_PER_CTA_>>
+struct KernelTraitsFinalize : public Base {
+  enum { ROWS_PER_CTA = Base::THREADS_PER_CTA / Base::THREADS_PER_WARP };
+  static_assert((int)ROWS_PER_CTA <= (int)Base::THREADS_PER_WARP);  // NOLINT
+  // Bytes per global load from the input.
+  enum { BYTES_PER_LDG = BYTES_PER_LDG_ };
+  // Number of elements fetched by a global load.
+  enum { ELTS_PER_LDG = BYTES_PER_LDG / sizeof(compute_t_) };
+  // Bytes per global store of the weights.
+  enum { BYTES_PER_STG = ELTS_PER_LDG * sizeof(weight_t_) };
+  static_assert(
+      sizeof(BYTES_PER_LDG) == 4,
+      "Conflict-free smem transpose only implemented for 4B compute type!");
+  static_assert(Base::THREADS_PER_CTA == ROWS_PER_CTA * Base::THREADS_PER_WARP,
+                "We assume one warp per row!");
+  // The total number of BYTES_PER_LDG-wide words in a hidden vector.
+  enum { COLS = HIDDEN_SIZE_ * sizeof(compute_t_) / BYTES_PER_LDG };
+  static_assert(COLS * BYTES_PER_LDG == HIDDEN_SIZE_ * sizeof(compute_t_));
+
+  // Shared memory size to transpose the CTA result.
+  enum { SMEM_BYTES_TRANSPOSE = Base::THREADS_PER_CTA * BYTES_PER_LDG };
+  // Shared memory size to coalsece the CTA result.
+  enum { SMEM_BYTES_OUTPUT = Base::THREADS_PER_WARP * BYTES_PER_LDG };
+  // Shared memory requirement per CTA.
+  enum {
+    SMEM_BYTES_PER_CTA = 2 * SMEM_BYTES_TRANSPOSE + 2 * SMEM_BYTES_OUTPUT
+  };
+
+  // The type of the reducer.
+  using Reducer = layer_norm::Reducer<compute_t_, 1, 1, 1>;
+
+  // Condition for the whole CTA to participate in syncthreads.
+  static_assert(COLS % Base::THREADS_PER_WARP == 0);
+  enum { CTAS = COLS / Base::THREADS_PER_WARP };
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename weight_t_,
+          typename input_t_,
+          typename output_t_,
+          typename compute_t_,
+          typename index_t_,
+          uint32_t HIDDEN_SIZE_,
+          uint32_t CTAS_PER_ROW_,
+          uint32_t WARPS_M_,
+          uint32_t WARPS_N_,
+          uint32_t BYTES_PER_LDG_ = 16,
+          typename Base =
+              KernelTraitsBase<HIDDEN_SIZE_,
+                               weight_t_,
+                               input_t_,
+                               output_t_,
+                               compute_t_,
+                               index_t_,
+                               WARPS_M_ * WARPS_N_ * THREADS_PER_WARP>>
+struct KernelTraits : public Base {
+  using input_t = typename Base::input_t;
+  using weight_t = typename Base::weight_t;
+  using compute_t = typename Base::compute_t;
+  using output_t = typename Base::output_t;
+  using index_t = typename Base::index_t;
+
+  enum { CTAS_PER_ROW = CTAS_PER_ROW_ };
+  enum { WARPS_M = WARPS_M_ };
+  enum { WARPS_N = WARPS_N_ };
+  enum { COLS = HIDDEN_SIZE_ };
+  enum { HIDDEN_SIZE = HIDDEN_SIZE_ };
+  enum { BYTES_PER_LDG = BYTES_PER_LDG_ };
+  enum { NUM_ELTS = BYTES_PER_LDG / sizeof(input_t) };
+
+  enum { THREADS_PER_ROW = WARPS_N * THREADS_PER_WARP };
+  enum { THREADS_PER_CTA = WARPS_M * THREADS_PER_ROW };
+  enum { ROWS_PER_CTA = WARPS_M };
+
+  enum { BYTES_PER_ROW = COLS * sizeof(input_t) };
+  enum { BYTES_PER_ROW_PER_CTA = THREADS_PER_ROW * BYTES_PER_LDG };
+  // Multi-row per CTA not supported for multi-CTA => no smem for WGRAD needed
+  enum {
+    SMEM_BYTES_WGRAD =
+        CTAS_PER_ROW > 1 ? 0 : ROWS_PER_CTA* COLS * sizeof(compute_t)
+  };
+  static_assert(WARPS_M == 1 || CTAS_PER_ROW == 1);
+
+  using reduce_t = typename layer_norm::TypeToVec2<compute_t>::Type;
+  using Reducer = layer_norm::Reducer<reduce_t, CTAS_PER_ROW, WARPS_M, WARPS_N>;
+
+  enum { SMEM_BYTES_DGRAD = Reducer::SMEM_BYTES };
+  enum { SMEM_BYTES = SMEM_BYTES_DGRAD + SMEM_BYTES_WGRAD };
+
+  using Ivec = layer_norm::Vec<input_t, NUM_ELTS>;
+  using Ovec = layer_norm::Vec<output_t, NUM_ELTS>;
+  using Wvec = layer_norm::Vec<weight_t, NUM_ELTS>;
+  using Cvec = layer_norm::Vec<compute_t, NUM_ELTS>;
+  enum { ELTS_PER_LDG = BYTES_PER_LDG / sizeof(input_t) };
+
+  // Assume that each thread can handle the same number of elements in the
+  // output and weights as in the input.
+  static_assert(sizeof(input_t) >= sizeof(output_t));
+  static_assert(sizeof(input_t) >= sizeof(weight_t));
+  // The number of columns fetched per load from input: one per thread.
+  enum { VEC_COLS_PER_LDG = CTAS_PER_ROW * THREADS_PER_ROW };
+  // The total number of vectorized loads/stores per hidden vector.
+  enum { VEC_COLS = COLS / ELTS_PER_LDG };
+  // The number of loads per thread for the input.
+  enum { LDGS = VEC_COLS / VEC_COLS_PER_LDG };
+  static_assert(LDGS * VEC_COLS_PER_LDG == VEC_COLS);
+  // static_assert(LDGS * BYTES_PER_ROW_PER_CTA * CTAS_PER_ROW == BYTES_PER_ROW,
+  // "");
+
+  using Stats = layer_norm::Stats<compute_t, CTAS_PER_ROW, WARPS_M, WARPS_N>;
+  enum { SMEM_BYTES_FWD = Stats::SMEM_BYTES };
+};
+
+}  // namespace layer_norm
diff --git a/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h b/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..f1914a497934f738e3888b0f0797019ff70e7cdf
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fast_ln/ln_utils.h
@@ -0,0 +1,810 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+#pragma once
+
+#include <cassert>
+
+#include <cuda_bf16.h>  // NOLINT
+#include <cuda_fp16.h>  // NOLINT
+
+#include "ln.h"  // NOLINT
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+constexpr uint32_t THREADS_PER_WARP = 32;
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline void check_cuda_(cudaError_t status, const char *file, int line) {
+  if (status != cudaSuccess) {
+    fprintf(stderr,
+            "CUDA Error: %s %s %d\n",
+            cudaGetErrorString(status),
+            file,
+            line);
+    exit(status);
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define CHECK_CUDA(ans) \
+  { check_cuda_((ans), __FILE__, __LINE__); }
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define DIVUP(x, y) (((x) + ((y)-1)) / (y))
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define REGISTER_FWD_LAUNCHER(HIDDEN_SIZE,                                   \
+                              WTYPE,                                         \
+                              ITYPE,                                         \
+                              OTYPE,                                         \
+                              CTYPE,                                         \
+                              CTAS_PER_ROW,                                  \
+                              WARPS_M,                                       \
+                              WARPS_N,                                       \
+                              BYTES_PER_LDG)                                 \
+  void ln_fwd_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE(         \
+      LaunchParams<FwdParams> &launch_params, const bool configure_params) { \
+    launch_<WTYPE,                                                           \
+            ITYPE,                                                           \
+            OTYPE,                                                           \
+            CTYPE,                                                           \
+            uint32_t,                                                        \
+            HIDDEN_SIZE,                                                     \
+            CTAS_PER_ROW,                                                    \
+            WARPS_M,                                                         \
+            WARPS_N,                                                         \
+            BYTES_PER_LDG>(launch_params, configure_params);                 \
+  }                                                                          \
+  static FwdRegistrar<WTYPE, ITYPE, OTYPE, CTYPE, HIDDEN_SIZE>               \
+      reg_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE(             \
+          ln_fwd_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define REGISTER_BWD_LAUNCHER(HIDDEN_SIZE,                                   \
+                              WTYPE,                                         \
+                              ITYPE,                                         \
+                              OTYPE,                                         \
+                              CTYPE,                                         \
+                              CTAS_PER_ROW,                                  \
+                              WARPS_M,                                       \
+                              WARPS_N,                                       \
+                              BYTES_PER_LDG,                                 \
+                              BYTES_PER_LDG_FINALIZE)                        \
+  void ln_bwd_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE(         \
+      LaunchParams<BwdParams> &launch_params, const bool configure_params) { \
+    launch_<WTYPE,                                                           \
+            ITYPE,                                                           \
+            OTYPE,                                                           \
+            CTYPE,                                                           \
+            uint32_t,                                                        \
+            HIDDEN_SIZE,                                                     \
+            CTAS_PER_ROW,                                                    \
+            WARPS_M,                                                         \
+            WARPS_N,                                                         \
+            BYTES_PER_LDG,                                                   \
+            BYTES_PER_LDG_FINALIZE>(launch_params, configure_params);        \
+  }                                                                          \
+  static BwdRegistrar<WTYPE, ITYPE, OTYPE, CTYPE, HIDDEN_SIZE>               \
+      reg_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE(             \
+          ln_bwd_##HIDDEN_SIZE##_##WTYPE##_##ITYPE##_##OTYPE##_##CTYPE)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float2 operator+(const float2 &a, const float2 &b) {
+  return {a.x + b.x, a.y + b.y};
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void operator+=(float2 &a, const float2 &b) {  // NOLINT
+  a.x += b.x;
+  a.y += b.y;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+struct Sum {
+  inline __device__ Sum() {}
+  inline __device__ T operator()(const T &a, const T &b) { return a + b; }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+inline __device__ T warp_shuffle_xor(const T &x, uint32_t idx) {
+  return __shfl_xor_sync(uint32_t(-1), x, idx);
+}
+
+template <>
+inline __device__ float2 warp_shuffle_xor<float2>(const float2 &x,
+                                                  uint32_t idx) {
+  return {warp_shuffle_xor(x.x, idx), warp_shuffle_xor(x.y, idx)};
+}
+
+template <typename T>
+inline __device__ T warp_shuffle_down(const T &x, uint32_t idx) {
+  return __shfl_down_sync(uint32_t(-1), x, idx);
+}
+
+template <>
+inline __device__ float2 warp_shuffle_down<float2>(const float2 &x,
+                                                   uint32_t idx) {
+  return {warp_shuffle_down(x.x, idx), warp_shuffle_down(x.y, idx)};
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace layer_norm {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct uint16 {
+  uint4 u;
+  uint4 v;
+  uint4 s;
+  uint4 t;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+struct uint8 {
+  uint4 u;
+  uint4 v;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <int BYTES>
+struct BytesToType {};
+
+template <>
+struct BytesToType<64> {
+  using Type = uint16;
+  static_assert(sizeof(Type) == 64);
+};
+
+template <>
+struct BytesToType<32> {
+  using Type = uint8;
+  static_assert(sizeof(Type) == 32);
+};
+
+template <>
+struct BytesToType<16> {
+  using Type = uint4;
+  static_assert(sizeof(Type) == 16);
+};
+
+template <>
+struct BytesToType<8> {
+  using Type = uint64_t;
+  static_assert(sizeof(Type) == 8);
+};
+
+template <>
+struct BytesToType<4> {
+  using Type = uint32_t;
+  static_assert(sizeof(Type) == 4);
+};
+
+template <>
+struct BytesToType<2> {
+  using Type = uint16_t;
+  static_assert(sizeof(Type) == 2);
+};
+
+template <>
+struct BytesToType<1> {
+  using Type = uint8_t;
+  static_assert(sizeof(Type) == 1);
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+struct TypeToVec2 {};
+
+template <>
+struct TypeToVec2<float> {
+  using Type = float2;
+};
+
+template <>
+struct TypeToVec2<half> {
+  using Type = half2;
+};
+
+template <>
+struct TypeToVec2<nv_bfloat16> {
+  using Type = nv_bfloat162;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <int INDEX>
+struct Get {
+  template <typename T, typename R>
+  static inline __device__ R of(const T &vec);
+};
+
+template <>
+template <typename T, typename R>
+inline __device__ R Get<0>::of(const T &vec) {
+  return vec.x;
+}
+
+template <>
+template <typename T, typename R>
+inline __device__ R Get<1>::of(const T &vec) {
+  return vec.y;
+}
+
+template <>
+template <typename T, typename R>
+inline __device__ R Get<2>::of(const T &vec) {
+  return vec.z;
+}
+
+template <>
+template <typename T, typename R>
+inline __device__ R Get<3>::of(const T &vec) {
+  return vec.w;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename Src, typename Dst>
+struct Converter {
+  static inline __device__ Dst convert(const Src &from) { return Dst(from); }
+};
+
+template <>
+struct Converter<float2, half2> {
+  static inline __device__ half2 convert(const float2 &x) {
+    return __float22half2_rn(x);
+  }
+};
+
+template <>
+struct Converter<float2, nv_bfloat162> {
+  static inline __device__ nv_bfloat162 convert(const float2 &x) {
+#if __CUDA_ARCH__ >= 800
+    return __float22bfloat162_rn(x);
+#else
+    union {
+      nv_bfloat162 raw;
+      nv_bfloat16 x;
+      nv_bfloat16 y;
+    } tmp;
+    tmp.x = __float2bfloat16_rn(x.x);
+    tmp.y = __float2bfloat16_rn(x.y);
+    return tmp.raw;
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+struct Zeros {
+  static inline __device__ T get() { return T(0.f); }
+};
+
+template <>
+struct Zeros<float2> {
+  static inline __device__ float2 get() { return make_float2(0.f, 0.f); }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename Elt_type, uint32_t NUM_ELT>
+struct Vec {
+  enum { BYTES = NUM_ELT * sizeof(Elt_type) };
+
+  using Vec_type = typename BytesToType<BYTES>::Type;
+
+  using Alias_type = union {
+    Vec_type vec;
+    Elt_type elt[NUM_ELT];
+  };
+
+  Alias_type data;
+
+  template <typename S>
+  inline __device__ void to(Vec<S, NUM_ELT> &other) {  // NOLINT
+#pragma unroll
+    for (int it = 0; it < NUM_ELT; it++) {
+      other.data.elt[it] = S(this->data.elt[it]);
+    }
+  }
+
+  template <typename Op>
+  inline __device__ void assign(const Op &op) {
+#pragma unroll
+    for (int it = 0; it < NUM_ELT; it++) {
+      this->data.elt[it] = op(it);
+    }
+  }
+
+  inline __device__ void load_from(const void *base_ptr, const size_t idx) {
+    this->data.vec = static_cast<const Vec_type *>(base_ptr)[idx];
+  }
+
+  inline __device__ void store_to(void *base_ptr, const size_t idx) {
+    static_cast<Vec_type *>(base_ptr)[idx] = this->data.vec;
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <uint32_t CTAS_PER_ROW>
+struct InterCTASync {
+  template <typename Params>
+  inline __device__ InterCTASync(Params &params,  // NOLINT
+                                 uint32_t bidm,
+                                 uint32_t bidn)
+      : phase_counter_(0),
+        b0_(params.barrier + bidm)  // The barrier for this group of CTAs.
+        ,
+        b1_(params.barrier + bidm + params.ctas_per_col) {
+  }  // The barrier for this group of CTAs.
+
+  inline __device__ void spin_wait_(int *barrier, int step, int expected) {
+    asm volatile("red.release.gpu.global.add.s32 [%0], %1;" ::"l"(barrier),
+                 "r"(step));
+    for (int found = -1; found != expected;) {
+      asm volatile("ld.global.acquire.gpu.b32 %0, [%1];"
+                   : "=r"(found)
+                   : "l"(barrier));
+    }
+  }
+
+  inline __device__ void sync() {
+    // ALL THREADS MUST ENTER!
+
+    // We switch barrier every iteration.
+    int *barrier = phase_counter_ & 0x1 ? b1_ : b0_;
+    // We decrement every other iteration.
+    bool dec = phase_counter_ & 0x2;
+    int step = dec ? -1 : 1;
+    int expected = dec ? 0 : CTAS_PER_ROW;
+    // There are only 4 phases: up/down for b0/b1.
+    phase_counter_ = (phase_counter_ + 1) & 0x3;
+
+    if (threadIdx.x == 0) {
+      spin_wait_(barrier, step, expected);
+    }
+    // CTA waits for thread 0
+    __syncthreads();
+  }
+
+  int phase_counter_;
+  int *b0_;
+  int *b1_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t CTAS_PER_ROW, uint32_t WARPS_M, uint32_t WARPS_N>
+struct Reducer : public Reducer<T, 1, WARPS_M, WARPS_N> {
+  using InterCTASync = InterCTASync<CTAS_PER_ROW>;
+  using Base = Reducer<T, 1, WARPS_M, WARPS_N>;
+  using Type = typename Base::Type;
+
+  enum { SMEM_BYTES = Base::SMEM_BYTES };
+
+  enum { WS_BARRIER_BYTES = 2 * sizeof(int) };
+  enum { WS_DATA_BYTES = WARPS_M * CTAS_PER_ROW * sizeof(T) };
+
+  // size of the barriers + temporary result per CTA (multiply with CTAS_PER_ROW
+  // to get total)
+  enum {
+    WORKSPACE_BYTES_PER_GROUP =
+        Base::WORKSPACE_BYTES_PER_GROUP + WS_BARRIER_BYTES + WS_DATA_BYTES
+  };
+
+  template <typename Params>
+  inline __device__ Reducer(Params &params,  // NOLINT
+                            uint32_t bidm,
+                            uint32_t bidn,
+                            uint32_t warp_m,
+                            uint32_t warp_n,
+                            uint32_t lane,
+                            void *smem)
+      : Base(params, bidm, bidn, warp_m, warp_n, lane, smem),
+        inter_cta_(params, bidm, bidn),
+        bidn_(bidn)  // CTA id within the group.
+        ,
+        w0_(static_cast<T *>(params.workspace) +
+            (bidm * WARPS_M + warp_m) * CTAS_PER_ROW),
+        w1_(w0_ + params.ctas_per_col * WARPS_M * CTAS_PER_ROW) {}
+
+  template <typename Op>
+  inline __device__ T allreduce(T data, Op &op) {  // NOLINT
+    data = Base::reduce(data, op);
+    // We switch workspace every iteration.
+    T *workspace = inter_cta_.phase_counter_ & 0x1 ? w1_ : w0_;
+
+    // Warp leaders 0 hold the CTA-local results.
+    if (this->warp_n_ == 0 && this->lane_ == 0) {
+      workspace[bidn_] = data;
+    }
+    inter_cta_.sync();
+    static_assert(CTAS_PER_ROW <= 32);
+    T total = Zeros<T>::get();
+    if (this->lane_ < CTAS_PER_ROW) {
+      total = workspace[this->lane_];
+    }
+    total = Reducer<T, 1, 1, 1>::allreduce_(total, op);
+
+    return total;
+  }
+
+  InterCTASync inter_cta_;
+
+  T *w0_;
+  T *w1_;
+  int bidn_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t WARPS_M>
+struct Reducer<T, 1, WARPS_M, 1> {
+  using Type = T;
+  enum { SMEM_BYTES = 0 };
+  enum { WORKSPACE_BYTES_PER_GROUP = 0 };
+
+  enum { THREADS_PER_WARP = 32 };
+
+  template <typename Params>
+  inline __device__ Reducer(Params &params,  // NOLINT
+                            uint32_t bidm,
+                            uint32_t bidn,
+                            uint32_t warp_m,
+                            uint32_t warp_n,
+                            uint32_t lane,
+                            void *smem)
+      : warp_n_(warp_n), lane_(lane) {}
+
+  template <typename Op>
+  static inline __device__ T allreduce_(T data, Op &op) {
+#pragma unroll
+    for (int it = 1; it < THREADS_PER_WARP; it *= 2) {
+      data = op(data, warp_shuffle_xor(data, it));
+    }
+    return data;
+  }
+
+  template <typename Op>
+  inline __device__ T allreduce(T data, Op &op) {  // NOLINT
+    return allreduce_(data, op);
+  }
+
+  template <typename Op>
+  inline __device__ T reduce(T data, Op &op) {  // NOLINT
+// only lane 0 holds the result!
+#pragma unroll
+    for (int it = THREADS_PER_WARP / 2; it > 0; it /= 2) {
+      data = op(data, warp_shuffle_down(data, it));
+    }
+    return data;
+  }
+  int warp_n_;
+  int lane_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t WARPS_M, uint32_t WARPS_N>
+struct Reducer<T, 1, WARPS_M, WARPS_N> : public Reducer<T, 1, WARPS_M, 1> {
+  using Base = Reducer<T, 1, WARPS_M, 1>;
+
+  using Type = T;
+
+  enum { SMEM_BYTES = Base::SMEM_BYTES + WARPS_M * WARPS_N * sizeof(T) * 2 };
+  enum { WORKSPACE_BYTES_PER_GROUP = 0 };
+
+  enum { THREADS_PER_WARP = 32 };
+
+  template <typename Params>
+  inline __device__ Reducer(Params &params,  // NOLINT
+                            uint32_t bidm,
+                            uint32_t bidn,
+                            uint32_t warp_m,
+                            uint32_t warp_n,
+                            uint32_t lane,
+                            void *smem)
+      : Base(params, bidm, bidn, warp_m, warp_n, lane, smem), use0_(true) {
+    smem0_ = &static_cast<T *>(smem)[warp_m * WARPS_N];  // NOLINT
+    smem1_ = smem0_ + WARPS_M * WARPS_N;
+  }
+
+  template <typename Op>
+  inline __device__ T allreduce(T data, Op &op) {  // NOLINT
+    T *smem = use0_ ? smem0_ : smem1_;
+    use0_ = !use0_;
+    data = Base::reduce(data, op);
+    if (this->lane_ == 0) {
+      smem[this->warp_n_] = data;
+    }
+    __syncthreads();
+    T out = Zeros<T>::get();
+#pragma unroll
+    for (int it = 0; it < WARPS_N; it++) {
+      out = op(out, smem[it]);
+    }
+    return out;
+  }
+
+  template <typename Op>
+  inline __device__ T reduce(T data, Op &op) {  // NOLINT
+    T *smem = use0_ ? smem0_ : smem1_;
+    use0_ = !use0_;
+    // only intra-CTA group leader holds the result!
+    data = Base::reduce(data, op);
+    if (this->lane_ == 0) {
+      smem[this->warp_n_] = data;
+    }
+    __syncthreads();
+    T out = Zeros<T>::get();
+    if (this->warp_n_ == 0 && this->lane_ == 0) {
+#pragma unroll
+      for (int it = 0; it < WARPS_N; it++) {
+        out = op(out, smem[it]);
+      }
+    }
+    return out;
+  }
+
+  T *smem0_;
+  T *smem1_;
+  bool use0_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+inline __device__ void warp_chan_upd_dynamic(T &m_a,   // NOLINT
+                                             T &m2_a,  // NOLINT
+                                             T &n_a,   // NOLINT
+                                             int num_active) {
+  // Assume at least leftmost is valid and init: step = next_pow2(num_active) /
+  // 2 (might get NaN otherwise)
+  int highest_bit_set = (8 * sizeof(num_active)) - __clz(num_active - 1);
+
+#pragma unroll
+  for (int step = (1 << (highest_bit_set - 1)); step > 0; step /= 2) {
+    // Exchange
+    T n_b = warp_shuffle_down(n_a, step);
+    T m_b = warp_shuffle_down(m_a, step);
+    T m2_b = warp_shuffle_down(m2_a, step);
+
+    // Update
+    const T n_ab = n_a + n_b;    // We can handle one of them being 0, not both.
+    const T rn_ab = 1.f / n_ab;  // Might have different n per thread, otherwise
+                                 // this would simplify :(
+    const T delta = m_a - m_b;
+    const float m2_ab = m2_a + m2_b + delta * delta * n_a * n_b * rn_ab;
+    const float m_ab = (n_a * m_a + n_b * m_b) * rn_ab;
+
+    n_a = n_ab;
+    m_a = m_ab;
+    m2_a = m2_ab;
+  }
+  // Intra-warp broadcast (only lane 0 has valid stats).
+  m_a = __shfl_sync(uint32_t(-1), m_a, 0);
+  m2_a = __shfl_sync(uint32_t(-1), m2_a, 0);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t CTAS_PER_ROW, uint32_t WARPS_M, uint32_t WARPS_N>
+struct Stats {
+  // This could be done generically with the Reducer. But then we would have to
+  // exchange 3 instead of 2 fields.
+
+  using InterCTASync = InterCTASync<CTAS_PER_ROW>;
+  using BlockStats = Stats<T, 1, WARPS_M, WARPS_N>;
+  using stats_t = typename BlockStats::stats_t;
+
+  enum { SMEM_BYTES = BlockStats::SMEM_BYTES };
+
+  template <typename Params>
+  inline __device__ Stats(Params &params,  // NOLINT
+                          uint32_t bidm,
+                          uint32_t bidn,
+                          uint32_t warp_m,
+                          uint32_t warp_n,
+                          uint32_t lane,
+                          void *smem)
+      : inter_cta_(params, bidm, bidn),
+        block_stats_(params, bidm, bidn, warp_m, warp_n, lane, smem),
+        bidn_(bidn)  // CTA id within the group.
+        ,
+        w0_(static_cast<stats_t *>(params.workspace) +
+            (bidm * WARPS_M + warp_m) * CTAS_PER_ROW),
+        w1_(w0_ + params.ctas_per_col * WARPS_M * CTAS_PER_ROW),
+        warp_n_(warp_n),
+        lane_(lane) {}
+
+  template <uint32_t N>
+  inline __device__ stats_t compute(const T (&elts)[N], const T rn) {
+    constexpr T ELTS_PER_ROW_PER_CTA = N * WARPS_N * THREADS_PER_WARP;
+    constexpr T block_rn = 1.f / T(ELTS_PER_ROW_PER_CTA);
+    stats_t block_stats = block_stats_.compute(elts, block_rn);
+
+    stats_t *workspace = inter_cta_.phase_counter_ & 0x1 ? w1_ : w0_;
+
+    if (warp_n_ == 0 && lane_ == 0) {
+      workspace[bidn_] = block_stats;
+    }
+
+    // Wait for all CTAS_PER_ROW CTAS in the group to have written their result.
+    inter_cta_.sync();
+
+    T n = Zeros<T>::get();
+    T m = Zeros<T>::get();
+    T m2 = Zeros<T>::get();
+
+    // Assume CTA group size in N less than 32, such that we can finalize with a
+    // single warp.
+    static_assert(CTAS_PER_ROW <= 32);
+
+    // Every warp does the final reduction locally.
+    if (lane_ < CTAS_PER_ROW) {
+      stats_t result = workspace[lane_];
+      n = ELTS_PER_ROW_PER_CTA;
+      m = layer_norm::Get<0>::of<stats_t, T>(result);
+      m2 = layer_norm::Get<1>::of<stats_t, T>(result);
+    }
+
+    warp_chan_upd_dynamic(m, m2, n, CTAS_PER_ROW);
+
+    return {m, m2};
+  }
+
+  InterCTASync inter_cta_;
+  BlockStats block_stats_;
+
+  stats_t *w0_;
+  stats_t *w1_;
+  int bidn_;
+  int warp_n_;
+  int lane_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t WARPS_M, uint32_t WARPS_N>
+struct Stats<T, 1, WARPS_M, WARPS_N> {
+  using WarpStats = Stats<T, 1, WARPS_M, 1>;
+  using stats_t = typename WarpStats::stats_t;
+
+  enum { SMEM_BYTES = WARPS_M * WARPS_N * sizeof(stats_t) * 2 };
+
+  template <typename Params>
+  inline __device__ Stats(Params &params,  // NOLINT
+                          uint32_t bidm,
+                          uint32_t bidn,
+                          uint32_t warp_m,
+                          uint32_t warp_n,
+                          uint32_t lane,
+                          void *smem)
+      : warp_stats_(params, bidm, bidn, warp_m, warp_n, lane, smem),
+        use0_(true) {
+    smem0_ = static_cast<stats_t *>(smem) + warp_m * WARPS_N;
+    smem1_ = smem0_ + WARPS_M * WARPS_N;
+  }
+
+  template <uint32_t N>
+  inline __device__ stats_t compute(const T (&elts)[N], const T rn) {
+    stats_t *smem = use0_ ? smem0_ : smem1_;
+    use0_ = !use0_;
+    // Compute warp local for all WARPS_N
+    constexpr T warp_rn = 1.f / T(N * THREADS_PER_WARP);
+    stats_t warp_stats = warp_stats_.compute(elts, warp_rn);
+
+    // Each warp warp leader stores its stats
+    const auto warp_n = warp_stats_.reducer_.warp_n_;
+    const auto lane = warp_stats_.reducer_.lane_;
+    if (lane == 0) {
+      smem[warp_n] = warp_stats;
+    }
+    __syncthreads();
+
+    T n = Zeros<T>::get();
+    T m = Zeros<T>::get();
+    T m2 = Zeros<T>::get();
+
+    // Assume that there are less than 32 warps, such that we can finalize with
+    // a single warp
+    static_assert(WARPS_N <= 32);
+    if (lane < WARPS_N) {
+      stats_t result = smem[lane];
+      n = N * THREADS_PER_WARP;
+      m = layer_norm::Get<0>::of<stats_t, T>(result);
+      m2 = layer_norm::Get<1>::of<stats_t, T>(result);
+    }
+
+    warp_chan_upd_dynamic(m, m2, n, WARPS_N);
+
+    return {m, m2};
+  }
+  WarpStats warp_stats_;
+  stats_t *smem0_;
+  stats_t *smem1_;
+  bool use0_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T, uint32_t WARPS_M>
+struct Stats<T, 1, WARPS_M, 1> {
+  using stats_t = typename TypeToVec2<T>::Type;
+  // The simple Warp reducer.
+  using Reducer = Reducer<T, 1, WARPS_M, 1>;
+
+  enum { SMEM_BYTES = 0 };
+
+  template <typename Params>
+  inline __device__ Stats(Params &params,  // NOLINT
+                          uint32_t bidm,
+                          uint32_t bidn,
+                          uint32_t warp_m,
+                          uint32_t warp_n,
+                          uint32_t lane,
+                          void *smem)
+      : reducer_(params, bidm, bidn, warp_m, warp_n, lane, smem) {}
+
+  template <uint32_t N>
+  inline __device__ stats_t compute(const T (&elts)[N], const T rn) {
+    auto sum = Sum<T>();
+
+    T m = Zeros<T>::get();
+#pragma unroll
+    for (int it = 0; it < N; it++) {
+      m += elts[it];
+    }
+    m = reducer_.allreduce(m, sum) * rn;
+
+    T m2 = Zeros<T>::get();
+#pragma unroll
+    for (int it = 0; it < N; it++) {
+      T diff = (elts[it] - m);
+      m2 += diff * diff;
+    }
+    m2 = reducer_.allreduce(m2, sum);
+
+    return {m, m2};
+  }
+
+  Reducer reducer_;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+}  // namespace layer_norm
diff --git a/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu b/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu
new file mode 100644
index 0000000000000000000000000000000000000000..66af7f8bb41fddb657a9f8593fc19773d9a06e01
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.cu
@@ -0,0 +1,227 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#include <cassert>
+#include <vector>
+
+#include "layer_norm_cuda.h"  // NOLINT
+#include "paddle/extension.h"
+
+#define CHECK_CUDA(x) PD_CHECK(!x.is_cpu(), #x " must be a CUDA tensor")
+
+static void GetRowsCols(const std::vector<int64_t> &shape,
+                        int *p_rows,
+                        int *p_cols) {
+  int rows = 1;
+  for (int i = 0; i + 1 < shape.size(); ++i) {
+    rows *= shape[i];
+  }
+  int cols = shape[shape.size() - 1];
+  *p_rows = rows;
+  *p_cols = cols;
+}
+
+std::vector<paddle::Tensor> RMSLnFwd(const paddle::Tensor &x,
+                                     const paddle::Tensor &scale,
+                                     float epsilon) {
+  const auto &scale_shape = scale.shape();
+  const auto &x_shape = x.shape();
+  PD_CHECK(scale_shape.size() == 1);
+  PD_CHECK(scale_shape[0] == x_shape[x_shape.size() - 1]);
+  CHECK_CUDA(x);
+  CHECK_CUDA(scale);
+
+  int rows, cols;
+  GetRowsCols(x_shape, &rows, &cols);
+
+  auto place = x.place();
+  auto y = paddle::empty(x_shape, scale.type(), place);
+  auto invvar = paddle::empty({rows}, paddle::DataType::FLOAT32, place);
+
+  cuda_rms_norm(x, scale, rows, cols, epsilon, &y, &invvar);
+  return {y, invvar};
+}
+
+std::vector<paddle::Tensor> LnFwd(const paddle::Tensor &x,
+                                  const paddle::Tensor &scale,
+                                  const paddle::Tensor &bias,
+                                  float epsilon) {
+  const auto &scale_shape = scale.shape();
+  const auto &bias_shape = bias.shape();
+  const auto &x_shape = x.shape();
+  PD_CHECK(scale_shape == bias_shape);
+  PD_CHECK(scale_shape.size() == 1);
+  PD_CHECK(scale_shape[0] == x_shape[x_shape.size() - 1]);
+  CHECK_CUDA(x);
+  CHECK_CUDA(scale);
+  CHECK_CUDA(bias);
+
+  int rows, cols;
+  GetRowsCols(x_shape, &rows, &cols);
+
+  auto place = x.place();
+  auto y = paddle::empty(x_shape, scale.type(), place);
+  auto mean = paddle::empty({rows}, paddle::DataType::FLOAT32, place);
+  auto invvar = paddle::empty_like(mean);
+
+  cuda_layer_norm(x, scale, bias, rows, cols, epsilon, &y, &mean, &invvar);
+  return {y, mean, invvar};
+}
+
+std::vector<std::vector<int64_t>> LnFwdInferShape(
+    std::vector<int64_t> x_shape,
+    std::vector<int64_t> scale_shape,
+    std::vector<int64_t> bias_shape,
+    float epsilon) {
+  int rows, cols;
+  GetRowsCols(x_shape, &rows, &cols);
+  return {x_shape, {rows}, {rows}};
+}
+
+std::vector<std::vector<int64_t>> RMSLnFwdInferShape(
+    std::vector<int64_t> x_shape,
+    std::vector<int64_t> scale_shape,
+    float epsilon) {
+  int rows, cols;
+  GetRowsCols(x_shape, &rows, &cols);
+  return {x_shape, {rows}};
+}
+
+std::vector<paddle::DataType> LnFwdInferDtype(paddle::DataType x_dtype,
+                                              paddle::DataType scale_dtype,
+                                              paddle::DataType bias_dtype) {
+  return {x_dtype, paddle::DataType::FLOAT32, paddle::DataType::FLOAT32};
+}
+
+std::vector<paddle::DataType> RMSLnFwdInferDtype(paddle::DataType x_dtype,
+                                                 paddle::DataType scale_dtype) {
+  return {x_dtype, paddle::DataType::FLOAT32};
+}
+
+std::vector<paddle::Tensor> LnBwd(const paddle::Tensor &x,
+                                  const paddle::Tensor &scale,
+                                  const paddle::Tensor &bias,
+                                  const paddle::Tensor &mean,
+                                  const paddle::Tensor &invvar,
+                                  const paddle::Tensor &dy,
+                                  float epsilon) {
+  CHECK_CUDA(dy);
+  CHECK_CUDA(mean);
+  CHECK_CUDA(invvar);
+  CHECK_CUDA(x);
+  CHECK_CUDA(scale);
+  CHECK_CUDA(bias);
+
+  int rows, cols;
+  GetRowsCols(x.shape(), &rows, &cols);
+
+  auto grad_x = paddle::empty_like(x);
+  auto grad_scale = paddle::empty_like(scale);
+  auto grad_bias = paddle::empty_like(bias);
+
+  cuda_layer_norm_gradient(x,
+                           scale,
+                           bias,
+                           mean,
+                           invvar,
+                           dy,
+                           rows,
+                           cols,
+                           epsilon,
+                           &grad_x,
+                           &grad_scale,
+                           &grad_bias);
+  return {grad_x, grad_scale, grad_bias};
+}
+
+std::vector<paddle::Tensor> RMSLnBwd(const paddle::Tensor &x,
+                                     const paddle::Tensor &scale,
+                                     const paddle::Tensor &invvar,
+                                     const paddle::Tensor &dy,
+                                     float epsilon) {
+  CHECK_CUDA(dy);
+  CHECK_CUDA(invvar);
+  CHECK_CUDA(x);
+  CHECK_CUDA(scale);
+
+  int rows, cols;
+  GetRowsCols(x.shape(), &rows, &cols);
+
+  auto grad_x = paddle::empty_like(x);
+  auto grad_scale = paddle::empty_like(scale);
+
+  cuda_rms_norm_gradient(
+      x, scale, invvar, dy, rows, cols, epsilon, &grad_x, &grad_scale);
+  return {grad_x, grad_scale};
+}
+
+std::vector<std::vector<int64_t>> LnBwdInferShape(
+    std::vector<int64_t> input_shape,
+    std::vector<int64_t> gamma_shape,
+    std::vector<int64_t> beta_shape,
+    std::vector<int64_t> mean_shape,
+    std::vector<int64_t> invvar_shape,
+    std::vector<int64_t> dout_shape,
+    float epsilon) {
+  return {input_shape, gamma_shape, beta_shape};
+}
+
+std::vector<std::vector<int64_t>> RMSLnBwdInferShape(
+    std::vector<int64_t> input_shape,
+    std::vector<int64_t> gamma_shape,
+    std::vector<int64_t> invvar_shape,
+    std::vector<int64_t> dout_shape,
+    float epsilon) {
+  return {input_shape, gamma_shape};
+}
+
+
+PD_BUILD_OP(fused_ln)
+    .Inputs({"x", "scale", "bias"})
+    .Outputs({"y", "mean", "invvar"})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(LnFwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(LnFwdInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(LnFwdInferDtype));
+
+PD_BUILD_GRAD_OP(fused_ln)
+    .Inputs({"x", "scale", "bias", "mean", "invvar", paddle::Grad("y")})
+    .Outputs({paddle::Grad("x"), paddle::Grad("scale"), paddle::Grad("bias")})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(LnBwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(LnBwdInferShape));
+
+PD_BUILD_OP(fused_rms_norm)
+    .Inputs({"x", "scale"})
+    .Outputs({"y", "invvar"})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(RMSLnFwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(RMSLnFwdInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(RMSLnFwdInferDtype));
+
+PD_BUILD_GRAD_OP(fused_rms_norm)
+    .Inputs({"x", "scale", "invvar", paddle::Grad("y")})
+    .Outputs({paddle::Grad("x"), paddle::Grad("scale")})
+    .Attrs({"epsilon: float"})
+    .SetKernelFn(PD_KERNEL(RMSLnBwd))
+    .SetInferShapeFn(PD_INFER_SHAPE(RMSLnBwdInferShape));
+
+
+// https://github.com/NVIDIA/apex/blob/85e9eddece9d4ac72b48c2407f8162f2173e1bf4/csrc/layer_norm_cuda_kernel.cu#L679
diff --git a/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h b/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h
new file mode 100644
index 0000000000000000000000000000000000000000..5e7b3a1d88bac14d20158775d8d2846d46cb4348
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/fused_ln/layer_norm_cuda.h
@@ -0,0 +1,1210 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
+
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+
+#pragma once  // NOLINT
+
+#include <cuda.h>          // NOLINT
+#include <cuda_runtime.h>  // NOLINT
+
+#include "paddle/extension.h"
+
+#define DEFAULT_THROW(NAME, TYPE)                           \
+  default:                                                  \
+    do {                                                    \
+      PD_THROW(#NAME, " not implemented for '", TYPE, "'"); \
+    } while (0);                                            \
+    break
+
+#define DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(TYPEIN, TYPEOUT, NAME, ...) \
+  switch (TYPEIN) {                                                            \
+    case paddle::DataType::FLOAT32: {                                          \
+      using scalar_t_in = float;                                               \
+      switch (TYPEOUT) {                                                       \
+        case paddle::DataType::FLOAT32: {                                      \
+          using scalar_t_out = float;                                          \
+          __VA_ARGS__;                                                         \
+          break;                                                               \
+        }                                                                      \
+        case paddle::DataType::FLOAT16: {                                      \
+          using scalar_t_out = phi::dtype::float16;                            \
+          __VA_ARGS__;                                                         \
+          break;                                                               \
+        }                                                                      \
+        case paddle::DataType::BFLOAT16: {                                     \
+          using scalar_t_out = phi::dtype::bfloat16;                           \
+          __VA_ARGS__;                                                         \
+          break;                                                               \
+        }                                                                      \
+          DEFAULT_THROW(NAME, TYPEOUT);                                        \
+      }                                                                        \
+      break;                                                                   \
+    }                                                                          \
+    case paddle::DataType::FLOAT16: {                                          \
+      using scalar_t_in = phi::dtype::float16;                                 \
+      using scalar_t_out = phi::dtype::float16;                                \
+      __VA_ARGS__;                                                             \
+      break;                                                                   \
+    }                                                                          \
+    case paddle::DataType::BFLOAT16: {                                         \
+      using scalar_t_in = phi::dtype::bfloat16;                                \
+      using scalar_t_out = phi::dtype::bfloat16;                               \
+      __VA_ARGS__;                                                             \
+      break;                                                                   \
+    }                                                                          \
+      DEFAULT_THROW(NAME, TYPEIN);                                             \
+  }
+
+#define WARP_SIZE 32
+
+template <typename T>
+__device__ __forceinline__ T WARP_SHFL_XOR(T value,
+                                           int laneMask,
+                                           int width = WARP_SIZE,
+                                           unsigned int mask = 0xffffffff) {
+  return __shfl_xor_sync(mask, value, laneMask, width);
+}
+
+template <typename T>
+__device__ __forceinline__ T WARP_SHFL(T value,
+                                       int srcLane,
+                                       int width = WARP_SIZE,
+                                       unsigned int mask = 0xffffffff) {
+  return __shfl_sync(mask, value, srcLane, width);
+}
+
+template <typename U>
+__device__ void cuWelfordOnlineSum(const U curr,
+                                   U& mu,       // NOLINT
+                                   U& sigma2,   // NOLINT
+                                   U& count) {  // NOLINT
+  count = count + U(1);
+  U delta = curr - mu;
+  U lmean = mu + delta / count;
+  mu = lmean;
+  U delta2 = curr - lmean;
+  sigma2 = sigma2 + delta * delta2;
+}
+
+template <typename U>
+__device__ void cuChanOnlineSum(const U muB,
+                                const U sigma2B,
+                                const U countB,
+                                U& mu,       // NOLINT
+                                U& sigma2,   // NOLINT
+                                U& count) {  // NOLINT
+  U delta = muB - mu;
+  U nA = count;
+  U nB = countB;
+  count = count + countB;
+  U nX = count;
+  if (nX > U(0)) {
+    nA = nA / nX;
+    nB = nB / nX;
+    mu = nA * mu + nB * muB;
+    sigma2 = sigma2 + sigma2B + delta * delta * nA * nB * nX;
+  } else {
+    mu = U(0);
+    sigma2 = U(0);
+  }
+}
+
+template <typename U>
+__device__ void cuRMSOnlineSum(const U curr, U& sigma2) {  // NOLINT
+  sigma2 = sigma2 + curr * curr;
+}
+
+template <typename U>
+__device__ void cuChanRMSOnlineSum(const U sigma2B, U& sigma2) {  // NOLINT
+  sigma2 = sigma2 + sigma2B;
+}
+
+
+template <typename T, typename U>
+__device__ void cuWelfordMuSigma2(const T* __restrict__ vals,
+                                  const int n1,
+                                  const int n2,
+                                  const int i1,
+                                  U& mu,      // NOLINT
+                                  U& sigma2,  // NOLINT
+                                  U* buf,
+                                  bool rms_only) {
+  // Assumptions:
+  // 1) blockDim.x == WARP_SIZE
+  // 2) Tensor is contiguous
+  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
+  //
+  // compute variance and mean over n2
+  U count = U(0);
+  mu = U(0);
+  sigma2 = U(0);
+  if (i1 < n1) {
+    // one warp normalizes one n1 index,
+    // synchronization is implicit
+    // initialize with standard Welford algorithm
+    const int numx = blockDim.x * blockDim.y;
+    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
+    const T* lvals = vals + i1 * n2;
+    int l = 4 * thrx;
+    for (; l + 3 < n2; l += 4 * numx) {
+      for (int k = 0; k < 4; ++k) {
+        U curr = static_cast<U>(lvals[l + k]);
+        if (!rms_only) {
+          cuWelfordOnlineSum<U>(curr, mu, sigma2, count);
+        } else {
+          cuRMSOnlineSum<U>(curr, sigma2);
+        }
+      }
+    }
+    for (; l < n2; ++l) {
+      U curr = static_cast<U>(lvals[l]);
+      if (!rms_only) {
+        cuWelfordOnlineSum<U>(curr, mu, sigma2, count);
+      } else {
+        cuRMSOnlineSum<U>(curr, sigma2);
+      }
+    }
+    // intra-warp reductions
+    for (int l = 0; l <= 4; ++l) {
+      int srcLaneB = (threadIdx.x + (1 << l)) & 31;
+      U sigma2B = WARP_SHFL(sigma2, srcLaneB);
+      if (!rms_only) {
+        U muB = WARP_SHFL(mu, srcLaneB);
+        U countB = WARP_SHFL(count, srcLaneB);
+        cuChanOnlineSum<U>(muB, sigma2B, countB, mu, sigma2, count);
+      } else {
+        cuChanRMSOnlineSum<U>(sigma2B, sigma2);
+      }
+    }
+    // threadIdx.x == 0 has correct values for each warp
+    // inter-warp reductions
+    if (blockDim.y > 1) {
+      U* ubuf = (U*)buf;                  // NOLINT
+      U* ibuf = (U*)(ubuf + blockDim.y);  // NOLINT
+      for (int offset = blockDim.y / 2; offset > 0; offset /= 2) {
+        // upper half of warps write to shared
+        if (threadIdx.x == 0 && threadIdx.y >= offset &&
+            threadIdx.y < 2 * offset) {
+          const int wrt_y = threadIdx.y - offset;
+          if (!rms_only) {
+            ubuf[2 * wrt_y] = mu;
+            ibuf[wrt_y] = count;
+          }
+          ubuf[2 * wrt_y + 1] = sigma2;
+        }
+        __syncthreads();
+        // lower half merges
+        if (threadIdx.x == 0 && threadIdx.y < offset) {
+          U sigma2B = ubuf[2 * threadIdx.y + 1];
+          if (!rms_only) {
+            U muB = ubuf[2 * threadIdx.y];
+            U countB = ibuf[threadIdx.y];
+            cuChanOnlineSum<U>(muB, sigma2B, countB, mu, sigma2, count);
+          } else {
+            cuChanRMSOnlineSum<U>(sigma2B, sigma2);
+          }
+        }
+        __syncthreads();
+      }
+      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
+      if (threadIdx.x == 0 && threadIdx.y == 0) {
+        if (!rms_only) {
+          ubuf[0] = mu;
+        }
+        ubuf[1] = sigma2;
+      }
+      __syncthreads();
+      if (!rms_only) {
+        mu = ubuf[0];
+      }
+      sigma2 = ubuf[1] / U(n2);
+      // don't care about final value of count, we know count == n2
+    } else {
+      if (!rms_only) {
+        mu = WARP_SHFL(mu, 0);
+      }
+      mu = WARP_SHFL(mu, 0);
+      sigma2 = WARP_SHFL(sigma2 / U(n2), 0);
+    }
+  }
+}
+
+template <>
+__device__ void cuWelfordMuSigma2(const phi::dtype::float16* __restrict__ vals,
+                                  const int n1,
+                                  const int n2,
+                                  const int i1,
+                                  float& mu,      // NOLINT
+                                  float& sigma2,  // NOLINT
+                                  float* buf,
+                                  bool rms_only) {
+  // Assumptions:
+  // 1) blockDim.x == WARP_SIZE
+  // 2) Tensor is contiguous
+  // 3) 2*blockDim.y*sizeof(U)+blockDim.y*sizeof(int) shared memory available.
+  //
+  // compute variance and mean over n2
+  float count = 0.0f;
+  mu = float(0);      // NOLINT
+  sigma2 = float(0);  // NOLINT
+  if (i1 < n1) {
+    // one warp normalizes one n1 index,
+    // synchronization is implicit
+    // initialize with standard Welford algorithm
+    const int numx = blockDim.x * blockDim.y;
+    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
+    const auto* lvals = vals + i1 * n2;
+    int l = 8 * thrx;
+    if ((((size_t)lvals) & 3) != 0) {  // NOLINT
+      // 16 bit alignment
+      // first thread consumes first point
+      if (thrx == 0) {
+        float curr = static_cast<float>(lvals[0]);
+        if (!rms_only) {
+          cuWelfordOnlineSum(curr, mu, sigma2, count);
+        } else {
+          cuRMSOnlineSum(curr, sigma2);
+        }
+      }
+      ++l;
+    }
+    // at this point, lvals[l] are 32 bit aligned for all threads.
+    for (; l + 7 < n2; l += 8 * numx) {
+      for (int k = 0; k < 8; k += 2) {
+        float2 curr = __half22float2(*((__half2*)(lvals + l + k)));  // NOLINT
+        if (!rms_only) {
+          cuWelfordOnlineSum(curr.x, mu, sigma2, count);
+          cuWelfordOnlineSum(curr.y, mu, sigma2, count);
+        } else {
+          cuRMSOnlineSum(curr.x, sigma2);
+          cuRMSOnlineSum(curr.y, sigma2);
+        }
+      }
+    }
+    for (; l < n2; ++l) {
+      float curr = static_cast<float>(lvals[l]);
+      if (!rms_only) {
+        cuWelfordOnlineSum(curr, mu, sigma2, count);
+      } else {
+        cuRMSOnlineSum(curr, sigma2);
+      }
+    }
+    // intra-warp reductions
+    for (int l = 0; l <= 4; ++l) {
+      int srcLaneB = (threadIdx.x + (1 << l)) & 31;
+      float sigma2B = WARP_SHFL(sigma2, srcLaneB);
+      if (!rms_only) {
+        float muB = WARP_SHFL(mu, srcLaneB);
+        float countB = WARP_SHFL(count, srcLaneB);
+        cuChanOnlineSum(muB, sigma2B, countB, mu, sigma2, count);
+      } else {
+        cuChanRMSOnlineSum(sigma2B, sigma2);
+      }
+    }
+    // threadIdx.x == 0 has correct values for each warp
+    // inter-warp reductions
+    if (blockDim.y > 1) {
+      float* ubuf = (float*)buf;                  // NOLINT
+      float* ibuf = (float*)(ubuf + blockDim.y);  // NOLINT
+      for (int offset = blockDim.y / 2; offset > 0; offset /= 2) {
+        // upper half of warps write to shared
+        if (threadIdx.x == 0 && threadIdx.y >= offset &&
+            threadIdx.y < 2 * offset) {
+          const int wrt_y = threadIdx.y - offset;
+          ubuf[2 * wrt_y + 1] = sigma2;
+          if (!rms_only) {
+            ubuf[2 * wrt_y] = mu;
+            ibuf[wrt_y] = count;
+          }
+        }
+        __syncthreads();
+        // lower half merges
+        if (threadIdx.x == 0 && threadIdx.y < offset) {
+          float sigma2B = ubuf[2 * threadIdx.y + 1];
+          if (!rms_only) {
+            float muB = ubuf[2 * threadIdx.y];
+            float countB = ibuf[threadIdx.y];
+            cuChanOnlineSum(muB, sigma2B, countB, mu, sigma2, count);
+          } else {
+            cuChanRMSOnlineSum(sigma2B, sigma2);
+          }
+        }
+        __syncthreads();
+      }
+      // threadIdx.x = 0 && threadIdx.y == 0 only thread that has correct values
+      if (threadIdx.x == 0 && threadIdx.y == 0) {
+        if (!rms_only) {
+          ubuf[0] = mu;
+        }
+        ubuf[1] = sigma2;
+      }
+      __syncthreads();
+      if (!rms_only) {
+        mu = ubuf[0];
+      }
+      sigma2 = ubuf[1] / float(n2);  // NOLINT
+      // don't care about final value of count, we know count == n2
+    } else {
+      if (!rms_only) {
+        mu = WARP_SHFL(mu, 0);
+      }
+      sigma2 = WARP_SHFL(sigma2 / float(n2), 0);  // NOLINT
+    }
+  }
+}
+
+template <typename U>
+U rsqrt(U v) {
+  return U(1) / sqrt(v);
+}
+template <>
+float rsqrt(float v) {
+  return rsqrtf(v);
+}
+template <>
+double rsqrt(double v) {
+  return rsqrt(v);
+}
+
+namespace {  // NOLINT
+// This is the un-specialized struct.  Note that we prevent instantiation of
+// this struct by putting an undefined symbol in the function body so it won't
+// compile.
+//  template <typename T>
+//  struct SharedMemory
+//  {
+//      // Ensure that we won't compile any un-specialized types
+//      __device__ T *getPointer()
+//      {
+//          extern __device__ void error(void);
+//          error();
+//          return NULL;
+//      }
+//  };
+// https://github.com/NVIDIA/apex/issues/246
+template <typename T>
+struct SharedMemory;
+
+template <>
+struct SharedMemory<float> {
+  __device__ float* getPointer() {
+    extern __shared__ float s_float[];
+    return s_float;
+  }
+};
+
+}  // namespace
+
+template <typename T, typename U, typename V>
+__device__ void cuApplyLayerNorm_(V* __restrict__ output_vals,
+                                  U* __restrict__ mean,
+                                  U* __restrict__ invvar,
+                                  const T* __restrict__ vals,
+                                  const int n1,
+                                  const int n2,
+                                  const U epsilon,
+                                  const V* __restrict__ gamma,
+                                  const V* __restrict__ beta,
+                                  bool rms_only) {
+  // Assumptions:
+  // 1) blockDim.x == WARP_SIZE
+  // 2) Tensors are contiguous
+  //
+  for (auto i1 = blockIdx.y; i1 < n1; i1 += gridDim.y) {
+    SharedMemory<U> shared;
+    U* buf = shared.getPointer();
+    U mu, sigma2;
+    cuWelfordMuSigma2(vals, n1, n2, i1, mu, sigma2, buf, rms_only);
+    const T* lvals = vals + i1 * n2;
+    V* ovals = output_vals + i1 * n2;
+    U c_invvar = rsqrt(sigma2 + epsilon);
+    const int numx = blockDim.x * blockDim.y;
+    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
+    if (gamma != NULL && (beta != NULL || rms_only)) {
+      for (int i = thrx; i < n2; i += numx) {
+        U curr = static_cast<U>(lvals[i]);
+        if (!rms_only) {
+          ovals[i] =
+              gamma[i] * static_cast<V>(c_invvar * (curr - mu)) + beta[i];
+        } else {
+          ovals[i] = gamma[i] * static_cast<V>(c_invvar * curr);
+        }
+      }
+    } else {
+      for (int i = thrx; i < n2; i += numx) {
+        U curr = static_cast<U>(lvals[i]);
+        if (!rms_only) {
+          ovals[i] = static_cast<V>(c_invvar * (curr - mu));
+        } else {
+          ovals[i] = static_cast<V>(c_invvar * curr);
+        }
+      }
+    }
+    if (threadIdx.x == 0 && threadIdx.y == 0) {
+      if (!rms_only) {
+        mean[i1] = mu;
+      }
+      invvar[i1] = c_invvar;
+    }
+    __syncthreads();
+  }
+}
+
+template <typename T, typename U, typename V = T>
+__global__ void cuApplyLayerNorm(V* __restrict__ output_vals,
+                                 U* __restrict__ mean,
+                                 U* __restrict__ invvar,
+                                 const T* __restrict__ vals,
+                                 const int n1,
+                                 const int n2,
+                                 const U epsilon,
+                                 const V* __restrict__ gamma,
+                                 const V* __restrict__ beta) {
+  cuApplyLayerNorm_<T, U, V>(
+      output_vals, mean, invvar, vals, n1, n2, epsilon, gamma, beta, false);
+}
+
+
+template <typename T, typename U, typename V = T>
+__global__ void cuApplyRMSNorm(V* __restrict__ output_vals,
+                               U* __restrict__ invvar,
+                               const T* __restrict__ vals,
+                               const int n1,
+                               const int n2,
+                               const U epsilon,
+                               const V* __restrict__ gamma) {
+  cuApplyLayerNorm_<T, U, V>(
+      output_vals, NULL, invvar, vals, n1, n2, epsilon, gamma, NULL, true);
+}
+
+template <typename T, typename U, typename V>
+__device__ void cuLoadWriteStridedInputs(const int i1_block,
+                                         const int thr_load_row_off,
+                                         const int thr_load_col_off,
+                                         const int i2_off,
+                                         const int row_stride,
+                                         U* warp_buf1,
+                                         U* warp_buf2,
+                                         const T* input,
+                                         const V* dout,
+                                         const int i1_end,
+                                         const int n2,
+                                         const U* __restrict__ mean,
+                                         const U* __restrict__ invvar,
+                                         bool rms_only) {
+  int i1 = i1_block + thr_load_row_off;
+  if (i1 < i1_end) {
+    U curr_mean;
+    if (!rms_only) {
+      curr_mean = mean[i1];
+    }
+    U curr_invvar = invvar[i1];
+    for (int k = 0; k < blockDim.y; ++k) {
+      int i2 = i2_off + k;
+      int load_idx = i1 * n2 + i2;
+      int write_idx = thr_load_row_off * row_stride + thr_load_col_off + k;
+      if (i2 < n2) {
+        U curr_input = static_cast<U>(input[load_idx]);
+        U curr_dout = static_cast<U>(dout[load_idx]);
+        if (!rms_only) {
+          warp_buf1[write_idx] = curr_dout;
+          warp_buf2[write_idx] =
+              curr_dout * (curr_input - curr_mean) * curr_invvar;
+        } else {
+          warp_buf2[write_idx] = curr_dout * (curr_input)*curr_invvar;
+        }
+      } else {
+        if (!rms_only) {
+          warp_buf1[write_idx] = U(0);
+        }
+        warp_buf2[write_idx] = U(0);
+      }
+    }
+  } else {
+    for (int k = 0; k < blockDim.y; ++k) {
+      int write_idx = thr_load_row_off * row_stride + thr_load_col_off + k;
+      if (!rms_only) {
+        warp_buf1[write_idx] = U(0);
+      }
+      warp_buf2[write_idx] = U(0);
+    }
+  }
+}
+
+template <typename T, typename U, typename V>
+__device__ void cuLoadAddStridedInputs(const int i1_block,
+                                       const int thr_load_row_off,
+                                       const int thr_load_col_off,
+                                       const int i2_off,
+                                       const int row_stride,
+                                       U* warp_buf1,
+                                       U* warp_buf2,
+                                       const T* input,
+                                       const V* dout,
+                                       const int i1_end,
+                                       const int n2,
+                                       const U* __restrict__ mean,
+                                       const U* __restrict__ invvar,
+                                       bool rms_only) {
+  int i1 = i1_block + thr_load_row_off;
+  if (i1 < i1_end) {
+    U curr_mean;
+    if (!rms_only) {
+      curr_mean = mean[i1];
+    }
+    U curr_invvar = invvar[i1];
+    for (int k = 0; k < blockDim.y; ++k) {
+      int i2 = i2_off + k;
+      int load_idx = i1 * n2 + i2;
+      int write_idx = thr_load_row_off * row_stride + thr_load_col_off + k;
+      if (i2 < n2) {
+        U curr_input = static_cast<U>(input[load_idx]);
+        U curr_dout = static_cast<U>(dout[load_idx]);
+        if (!rms_only) {
+          warp_buf1[write_idx] += curr_dout;
+          warp_buf2[write_idx] +=
+              curr_dout * (curr_input - curr_mean) * curr_invvar;
+        } else {
+          warp_buf2[write_idx] += curr_dout * (curr_input)*curr_invvar;
+        }
+      }
+    }
+  }
+}
+
+template <typename T, typename U, typename V>
+__global__ void cuComputePartGradGammaBeta(const V* __restrict__ dout,
+                                           const T* __restrict__ input,
+                                           const int n1,
+                                           const int n2,
+                                           const U* __restrict__ mean,
+                                           const U* __restrict__ invvar,
+                                           U epsilon,
+                                           U* part_grad_gamma,
+                                           U* part_grad_beta,
+                                           bool rms_only) {
+  const int numsegs_n1 =
+      (n1 + blockDim.y * blockDim.y - 1) / (blockDim.y * blockDim.y);
+  const int segs_per_block = (numsegs_n1 + gridDim.y - 1) / gridDim.y;
+  const int i1_beg = blockIdx.y * segs_per_block * blockDim.y * blockDim.y;
+  const int i1_beg_plus_one =
+      (blockIdx.y + 1) * segs_per_block * blockDim.y * blockDim.y;
+  const int i1_end = i1_beg_plus_one < n1 ? i1_beg_plus_one : n1;
+  const int row_stride = blockDim.x + 1;
+  const int thr_load_col_off = (threadIdx.x * blockDim.y) & (blockDim.x - 1);
+  const int thr_load_row_off =
+      (threadIdx.x * blockDim.y) / blockDim.x + threadIdx.y * blockDim.y;
+  const int i2_off = blockIdx.x * blockDim.x + thr_load_col_off;
+  SharedMemory<U> shared;
+  U* buf = shared.getPointer();  // buf has at least blockDim.x * blockDim.y *
+                                 // blockDim.y + (blockDim.y -
+                                 // 1)*(blockDim.x/blockDim.y) elements
+  U* warp_buf1 = (U*)buf;        // NOLINT
+  U* warp_buf2 = warp_buf1 + blockDim.y * blockDim.y * row_stride;
+  // compute partial sums from strided inputs
+  // do this to increase number of loads in flight
+  cuLoadWriteStridedInputs(i1_beg,
+                           thr_load_row_off,
+                           thr_load_col_off,
+                           i2_off,
+                           row_stride,
+                           warp_buf1,
+                           warp_buf2,
+                           input,
+                           dout,
+                           i1_end,
+                           n2,
+                           mean,
+                           invvar,
+                           rms_only);
+  for (int i1_block = i1_beg + blockDim.y * blockDim.y; i1_block < i1_end;
+       i1_block += blockDim.y * blockDim.y) {
+    cuLoadAddStridedInputs(i1_block,
+                           thr_load_row_off,
+                           thr_load_col_off,
+                           i2_off,
+                           row_stride,
+                           warp_buf1,
+                           warp_buf2,
+                           input,
+                           dout,
+                           i1_end,
+                           n2,
+                           mean,
+                           invvar,
+                           rms_only);
+  }
+  __syncthreads();
+  // inter-warp reductions
+  // sum within each warp
+  U acc1 = U(0);
+  U acc2 = U(0);
+  for (int k = 0; k < blockDim.y; ++k) {
+    int row1 = threadIdx.y + k * blockDim.y;
+    int idx1 = row1 * row_stride + threadIdx.x;
+    if (!rms_only) {
+      acc1 += warp_buf1[idx1];
+    }
+    acc2 += warp_buf2[idx1];
+  }
+
+  if (!rms_only) {
+    warp_buf1[threadIdx.y * row_stride + threadIdx.x] = acc1;
+  }
+  warp_buf2[threadIdx.y * row_stride + threadIdx.x] = acc2;
+  __syncthreads();
+  // sum all warps
+  for (int offset = blockDim.y / 2; offset > 1; offset /= 2) {
+    if (threadIdx.y < offset) {
+      int row1 = threadIdx.y;
+      int row2 = threadIdx.y + offset;
+      int idx1 = row1 * row_stride + threadIdx.x;
+      int idx2 = row2 * row_stride + threadIdx.x;
+      if (!rms_only) {
+        warp_buf1[idx1] += warp_buf1[idx2];
+      }
+      warp_buf2[idx1] += warp_buf2[idx2];
+    }
+    __syncthreads();
+  }
+  int i2 = blockIdx.x * blockDim.x + threadIdx.x;
+  if (threadIdx.y == 0 && i2 < n2) {
+    int row1 = threadIdx.y;
+    int row2 = threadIdx.y + 1;
+    int idx1 = row1 * row_stride + threadIdx.x;
+    int idx2 = row2 * row_stride + threadIdx.x;
+    if (!rms_only) {
+      part_grad_beta[blockIdx.y * n2 + i2] = warp_buf1[idx1] + warp_buf1[idx2];
+    }
+    part_grad_gamma[blockIdx.y * n2 + i2] = warp_buf2[idx1] + warp_buf2[idx2];
+  }
+}
+
+template <typename U, typename V>
+__global__ void cuComputeGradGammaBeta(const U* part_grad_gamma,
+                                       const U* part_grad_beta,
+                                       const int part_size,
+                                       const int n1,
+                                       const int n2,
+                                       V* grad_gamma,
+                                       V* grad_beta,
+                                       bool rms_only) {
+  // sum partial gradients for gamma and beta
+  SharedMemory<U> shared;
+  U* buf = shared.getPointer();
+  int i2 = blockIdx.x * blockDim.x + threadIdx.x;
+  if (i2 < n2) {
+    // each warp does sequential reductions until reduced part_size is num_warps
+    int num_warp_reductions = part_size / blockDim.y;
+    U sum_gamma = U(0);
+    U sum_beta = U(0);
+    const U* part_grad_gamma_ptr =
+        part_grad_gamma + threadIdx.y * num_warp_reductions * n2 + i2;
+    const U* part_grad_beta_ptr =
+        part_grad_beta + threadIdx.y * num_warp_reductions * n2 + i2;
+    for (int warp_offset = 0; warp_offset < num_warp_reductions;
+         ++warp_offset) {
+      sum_gamma += part_grad_gamma_ptr[warp_offset * n2];
+      if (!rms_only) {
+        sum_beta += part_grad_beta_ptr[warp_offset * n2];
+      }
+    }
+    // inter-warp reductions
+    const int nbsize3 = blockDim.x * blockDim.y / 2;
+    for (int offset = blockDim.y / 2; offset >= 1; offset /= 2) {
+      // top half write to shared memory
+      if (threadIdx.y >= offset && threadIdx.y < 2 * offset) {
+        const int write_idx = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
+        buf[write_idx] = sum_gamma;
+        if (!rms_only) {
+          buf[write_idx + nbsize3] = sum_beta;
+        }
+      }
+      __syncthreads();
+      // bottom half sums
+      if (threadIdx.y < offset) {
+        const int read_idx = threadIdx.y * blockDim.x + threadIdx.x;
+        sum_gamma += buf[read_idx];
+        if (!rms_only) {
+          sum_beta += buf[read_idx + nbsize3];
+        }
+      }
+      __syncthreads();
+    }
+    // write out fully summed gradients
+    if (threadIdx.y == 0) {
+      grad_gamma[i2] = sum_gamma;
+      if (!rms_only) {
+        grad_beta[i2] = sum_beta;
+      }
+    }
+  }
+}
+
+template <typename T, typename U, typename V>
+__global__ void cuComputeGradInput(const V* __restrict__ dout,
+                                   const T* __restrict__ input,
+                                   const int n1,
+                                   const int n2,
+                                   const U* __restrict__ mean,
+                                   const U* __restrict__ invvar,
+                                   U epsilon,
+                                   const V* gamma,
+                                   T* grad_input,
+                                   bool rms_only) {
+  for (auto i1 = blockIdx.y; i1 < n1; i1 += gridDim.y) {
+    U sum_loss1 = U(0);
+    U sum_loss2 = U(0);
+    U c_mean;
+    if (!rms_only) {
+      c_mean = mean[i1];
+    }
+    const U c_invvar = invvar[i1];
+    const T* k_input = input + i1 * n2;
+    const V* k_dout = dout + i1 * n2;
+    const int numx = blockDim.x * blockDim.y;
+    const int thrx = threadIdx.x + threadIdx.y * blockDim.x;
+    if (gamma != NULL) {
+      int l = 4 * thrx;
+      for (; l + 3 < n2; l += 4 * numx) {
+        for (int k = 0; k < 4; ++k) {
+          const U c_h = static_cast<U>(k_input[l + k]);
+          const U c_loss = static_cast<U>(k_dout[l + k]);
+          const U gamma_tmp = static_cast<U>(gamma[l + k]);
+          if (!rms_only) {
+            sum_loss1 += c_loss * gamma_tmp;
+            sum_loss2 += c_loss * gamma_tmp * (c_h - c_mean) * c_invvar;
+          } else {
+            sum_loss2 += c_loss * gamma_tmp * (c_h)*c_invvar;
+          }
+        }
+      }
+      for (; l < n2; ++l) {
+        const U c_h = static_cast<U>(k_input[l]);
+        const U c_loss = static_cast<U>(k_dout[l]);
+        const U gamma_tmp = static_cast<U>(gamma[l]);
+        if (!rms_only) {
+          sum_loss1 += c_loss * gamma_tmp;
+          sum_loss2 += c_loss * gamma_tmp * (c_h - c_mean) * c_invvar;
+        } else {
+          sum_loss2 += c_loss * gamma_tmp * (c_h)*c_invvar;
+        }
+      }
+    } else {
+      int l = 4 * thrx;
+      for (; l + 3 < n2; l += 4 * numx) {
+        for (int k = 0; k < 4; ++k) {
+          const U c_h = static_cast<U>(k_input[l + k]);
+          const U c_loss = static_cast<U>(k_dout[l + k]);
+          if (!rms_only) {
+            sum_loss1 += c_loss;
+            sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
+          } else {
+            sum_loss2 += c_loss * (c_h)*c_invvar;
+          }
+        }
+      }
+      for (; l < n2; ++l) {
+        const U c_h = static_cast<U>(k_input[l]);
+        const U c_loss = static_cast<U>(k_dout[l]);
+        if (!rms_only) {
+          sum_loss1 += c_loss;
+          sum_loss2 += c_loss * (c_h - c_mean) * c_invvar;
+        } else {
+          sum_loss2 += c_loss * (c_h)*c_invvar;
+        }
+      }
+    }
+    // intra-warp reductions
+    for (int mask = blockDim.x / 2; mask > 0; mask /= 2) {
+      if (!rms_only) {
+        sum_loss1 += WARP_SHFL_XOR(sum_loss1, mask);
+      }
+      sum_loss2 += WARP_SHFL_XOR(sum_loss2, mask);
+    }
+    // inter-warp reductions
+    if (blockDim.y > 1) {
+      SharedMemory<U> shared;
+      U* buf = shared.getPointer();
+      for (int offset = blockDim.y / 2; offset > 0; offset /= 2) {
+        // upper half of warps write to shared
+        if (threadIdx.y >= offset && threadIdx.y < 2 * offset) {
+          const int wrt_i = (threadIdx.y - offset) * blockDim.x + threadIdx.x;
+          if (!rms_only) {
+            buf[2 * wrt_i] = sum_loss1;
+          }
+          buf[2 * wrt_i + 1] = sum_loss2;
+        }
+        __syncthreads();
+        // lower half merges
+        if (threadIdx.y < offset) {
+          const int read_i = threadIdx.y * blockDim.x + threadIdx.x;
+          if (!rms_only) {
+            sum_loss1 += buf[2 * read_i];
+          }
+          sum_loss2 += buf[2 * read_i + 1];
+        }
+        __syncthreads();
+      }
+      if (threadIdx.y == 0) {
+        if (!rms_only) {
+          buf[2 * threadIdx.x] = sum_loss1;
+        }
+        buf[2 * threadIdx.x + 1] = sum_loss2;
+      }
+      __syncthreads();
+      if (threadIdx.y != 0) {
+        if (!rms_only) {
+          sum_loss1 = buf[2 * threadIdx.x];
+        }
+        sum_loss2 = buf[2 * threadIdx.x + 1];
+      }
+    }
+    // all threads now have the two sums over l
+    U fH = (U)n2;
+    U term1 = (U(1) / fH) * c_invvar;
+    T* k_grad_input = grad_input + i1 * n2;
+    if (gamma != NULL) {
+      for (int l = thrx; l < n2; l += numx) {
+        const U c_h = static_cast<U>(k_input[l]);
+        const U c_loss = static_cast<U>(k_dout[l]);
+        U f_grad_input = fH * c_loss * static_cast<U>(gamma[l]);
+        if (!rms_only) {
+          f_grad_input -= sum_loss1;
+          f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
+        } else {
+          f_grad_input -= (c_h)*c_invvar * sum_loss2;
+        }
+        f_grad_input *= term1;
+        k_grad_input[l] = static_cast<T>(f_grad_input);
+      }
+    } else {
+      for (int l = thrx; l < n2; l += numx) {
+        const U c_h = static_cast<U>(k_input[l]);
+        const U c_loss = static_cast<U>(k_dout[l]);
+        U f_grad_input = fH * c_loss;
+        if (!rms_only) {
+          f_grad_input -= sum_loss1;
+          f_grad_input -= (c_h - c_mean) * c_invvar * sum_loss2;
+        } else {
+          f_grad_input -= (c_h)*c_invvar * sum_loss2;
+        }
+        f_grad_input *= term1;
+        k_grad_input[l] = static_cast<T>(f_grad_input);
+      }
+    }
+    // prevent race where buf is written again before reads are done
+    __syncthreads();
+  }
+}
+
+static cudaDeviceProp GetDevicePropImpl() {
+  int device = -1;
+  PD_CHECK(cudaGetDevice(&device) == cudaSuccess);
+  cudaDeviceProp prop;
+  PD_CHECK(cudaGetDeviceProperties(&prop, device) == cudaSuccess);
+  return prop;
+}
+
+static cudaDeviceProp* GetDeviceProp() {
+  static auto prop = GetDevicePropImpl();
+  return &prop;
+}
+
+template <typename T, typename U, typename V>
+void HostApplyLayerNorm(V* output,
+                        U* mean,
+                        U* invvar,
+                        const T* input,
+                        int n1,
+                        int n2,
+                        double epsilon,
+                        const V* gamma,
+                        const V* beta,
+                        cudaStream_t stream) {
+  const dim3 threads(32, 4, 1);
+  const uint64_t maxGridY = GetDeviceProp()->maxGridSize[1];
+  const dim3 blocks(1, std::min((uint64_t)n1, maxGridY), 1);
+  int nshared =
+      threads.y > 1 ? threads.y * sizeof(U) + (threads.y / 2) * sizeof(U) : 0;
+  cuApplyLayerNorm<<<blocks, threads, nshared, stream>>>(
+      output, mean, invvar, input, n1, n2, U(epsilon), gamma, beta);
+}
+
+template <typename T, typename U, typename V = T>
+void HostApplyRMSNorm(V* output,
+                      U* invvar,
+                      const T* input,
+                      int n1,
+                      int n2,
+                      double epsilon,
+                      const V* gamma,
+                      cudaStream_t stream) {
+  // auto stream = at::cuda::getCurrentCUDAStream().stream();
+  const dim3 threads(32, 4, 1);
+  // const uint64_t maxGridY =
+  // at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
+  const uint64_t maxGridY = GetDeviceProp()->maxGridSize[1];
+  const dim3 blocks(1, std::min((uint64_t)n1, maxGridY), 1);
+  int nshared =
+      threads.y > 1 ? threads.y * sizeof(U) + (threads.y / 2) * sizeof(U) : 0;
+  cuApplyRMSNorm<<<blocks, threads, nshared, stream>>>(
+      output, invvar, input, n1, n2, U(epsilon), gamma);
+}
+
+static void cuda_layer_norm(const paddle::Tensor& x,
+                            const paddle::Tensor& scale,
+                            const paddle::Tensor& bias,
+                            int rows,
+                            int cols,
+                            float epsilon,
+                            paddle::Tensor* y,
+                            paddle::Tensor* mean,
+                            paddle::Tensor* invvar) {
+  DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
+      x.type(),
+      y->type(),
+      "cuda_layer_norm_kernel",
+      HostApplyLayerNorm(y->data<scalar_t_out>(),
+                         mean->data<float>(),
+                         invvar->data<float>(),
+                         const_cast<scalar_t_in*>(x.data<scalar_t_in>()),
+                         rows,
+                         cols,
+                         epsilon,
+                         const_cast<scalar_t_out*>(scale.data<scalar_t_out>()),
+                         const_cast<scalar_t_out*>(bias.data<scalar_t_out>()),
+                         x.stream()));
+}
+
+static void cuda_rms_norm(const paddle::Tensor& x,
+                          const paddle::Tensor& scale,
+                          int rows,
+                          int cols,
+                          float epsilon,
+                          paddle::Tensor* y,
+                          paddle::Tensor* invvar) {
+  DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
+      x.type(),
+      y->type(),
+      "cuda_rms_norm_kernel",
+      HostApplyRMSNorm(y->data<scalar_t_out>(),
+                       invvar->data<float>(),
+                       const_cast<scalar_t_in*>(x.data<scalar_t_in>()),
+                       rows,
+                       cols,
+                       epsilon,
+                       const_cast<scalar_t_out*>(scale.data<scalar_t_out>()),
+                       x.stream()));
+}
+
+template <typename T, typename U, typename V>
+void HostLayerNormGradient(const V* dout,
+                           const U* mean,
+                           const U* invvar,
+                           const paddle::Tensor& input,
+                           int n1,
+                           int n2,
+                           const V* gamma,
+                           const V* beta,
+                           double epsilon,
+                           T* grad_input,
+                           V* grad_gamma,
+                           V* grad_beta,
+                           cudaStream_t stream) {
+  if (gamma != NULL && beta != NULL) {
+    // compute grad_gamma(j) and grad_beta(j)
+    const int part_size = 16;
+    const dim3 threads2(32, 4, 1);
+    const dim3 blocks2((n2 + threads2.x - 1) / threads2.x, part_size, 1);
+    const int nshared2_a =
+        2 * sizeof(U) * threads2.y * threads2.y * (threads2.x + 1);
+    const int nshared2_b = threads2.x * threads2.y * sizeof(U);
+    const int nshared2 = nshared2_a > nshared2_b ? nshared2_a : nshared2_b;
+    auto place = input.place();
+    paddle::Tensor part_grad_gamma =
+        paddle::empty({part_size, n2}, paddle::DataType::FLOAT32, place);
+    paddle::Tensor part_grad_beta = paddle::empty_like(part_grad_gamma);
+    cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
+        dout,
+        input.data<T>(),
+        n1,
+        n2,
+        mean,
+        invvar,
+        U(epsilon),
+        part_grad_gamma.data<U>(),
+        part_grad_beta.data<U>(),
+        false);
+
+    const dim3 threads3(32, 8, 1);
+    const dim3 blocks3((n2 + threads2.x - 1) / threads2.x, 1, 1);
+    const int nshared3 = threads3.x * threads3.y * sizeof(U);
+    cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
+        part_grad_gamma.data<U>(),
+        part_grad_beta.data<U>(),
+        part_size,
+        n1,
+        n2,
+        grad_gamma,
+        grad_beta,
+        false);
+  }
+
+  // compute grad_input
+  const uint64_t maxGridY = GetDeviceProp()->maxGridSize[1];
+  const dim3 blocks1(1, std::min((uint64_t)n1, maxGridY), 1);
+  const dim3 threads1(32, 4, 1);
+  int nshared = threads1.y > 1 ? threads1.y * threads1.x * sizeof(U) : 0;
+  cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(dout,
+                                                             input.data<T>(),
+                                                             n1,
+                                                             n2,
+                                                             mean,
+                                                             invvar,
+                                                             U(epsilon),
+                                                             gamma,
+                                                             grad_input,
+                                                             false);
+}
+
+template <typename T, typename U, typename V>
+void HostRMSNormGradient(const V* dout,
+                         const U* invvar,
+                         const paddle::Tensor& input,
+                         int n1,
+                         int n2,
+                         const V* gamma,
+                         double epsilon,
+                         T* grad_input,
+                         V* grad_gamma,
+                         cudaStream_t stream) {
+  if (gamma != NULL) {
+    const int part_size = 16;
+    const dim3 threads2(32, 4, 1);
+    const dim3 blocks2((n2 + threads2.x - 1) / threads2.x, part_size, 1);
+    const int nshared2_a =
+        2 * sizeof(U) * threads2.y * threads2.y * (threads2.x + 1);
+    const int nshared2_b = threads2.x * threads2.y * sizeof(U);
+    const int nshared2 = nshared2_a > nshared2_b ? nshared2_a : nshared2_b;
+    auto place = input.place();
+    paddle::Tensor part_grad_gamma =
+        paddle::empty({part_size, n2}, paddle::DataType::FLOAT32, place);
+    cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
+        dout,
+        input.data<T>(),
+        n1,
+        n2,
+        invvar,  // unused
+        invvar,
+        U(epsilon),
+        part_grad_gamma.data<U>(),
+        part_grad_gamma.data<U>(), /* unused */
+        true);
+
+    const dim3 threads3(32, 8, 1);
+    const dim3 blocks3((n2 + threads2.x - 1) / threads2.x, 1, 1);
+    const int nshared3 = threads3.x * threads3.y * sizeof(U);
+    cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
+        part_grad_gamma.data<U>(),
+        part_grad_gamma.data<U>(), /* unused */
+        part_size,
+        n1,
+        n2,
+        grad_gamma,
+        grad_gamma, /* unused */
+        true);
+  }
+
+  // compute grad_input
+  const uint64_t maxGridY = GetDeviceProp()->maxGridSize[1];
+  const dim3 blocks1(1, std::min((uint64_t)n1, maxGridY), 1);
+  const dim3 threads1(32, 4, 1);
+  int nshared = threads1.y > 1 ? threads1.y * threads1.x * sizeof(U) : 0;
+  cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
+      dout,
+      input.data<T>(),
+      n1,
+      n2,
+      invvar, /* unused */
+      invvar,
+      U(epsilon),
+      gamma,
+      grad_input,
+      true);
+}
+
+static void cuda_layer_norm_gradient(const paddle::Tensor& x,
+                                     const paddle::Tensor& scale,
+                                     const paddle::Tensor& bias,
+                                     const paddle::Tensor& mean,
+                                     const paddle::Tensor& invvar,
+                                     const paddle::Tensor& dy,
+                                     int rows,
+                                     int cols,
+                                     float epsilon,
+                                     paddle::Tensor* grad_x,
+                                     paddle::Tensor* grad_scale,
+                                     paddle::Tensor* grad_bias) {
+  DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
+      x.type(),
+      scale.type(),
+      "cuda_layer_norm_gradient_kernel",
+      HostLayerNormGradient<scalar_t_in, float, scalar_t_out>(
+          dy.data<scalar_t_out>(),
+          mean.data<float>(),
+          invvar.data<float>(),
+          x,
+          rows,
+          cols,
+          scale.data<scalar_t_out>(),
+          bias.data<scalar_t_out>(),
+          epsilon,
+          grad_x->data<scalar_t_in>(),
+          grad_scale->data<scalar_t_out>(),
+          grad_bias->data<scalar_t_out>(),
+          x.stream()));
+}
+
+
+static void cuda_rms_norm_gradient(const paddle::Tensor& x,
+                                   const paddle::Tensor& scale,
+                                   const paddle::Tensor& invvar,
+                                   const paddle::Tensor& dy,
+                                   int rows,
+                                   int cols,
+                                   float epsilon,
+                                   paddle::Tensor* grad_x,
+                                   paddle::Tensor* grad_scale) {
+  DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(
+      x.type(),
+      scale.type(),
+      "cuda_rms_norm_gradient_kernel",
+      HostRMSNormGradient<scalar_t_in, float, scalar_t_out>(
+          dy.data<scalar_t_out>(),
+          invvar.data<float>(),
+          x,
+          rows,
+          cols,
+          scale.data<scalar_t_out>(),
+          epsilon,
+          grad_x->data<scalar_t_in>(),
+          grad_scale->data<scalar_t_out>(),
+          x.stream()));
+}
diff --git a/model_zoo/gpt-3/external_ops/setup.py b/model_zoo/gpt-3/external_ops/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c5cd345c21e6b375f5573fac9ee2323fae80d0b
--- /dev/null
+++ b/model_zoo/gpt-3/external_ops/setup.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import multiprocessing
+import os
+
+
+def get_gencode_flags():
+    import paddle
+
+    prop = paddle.device.cuda.get_device_properties()
+    cc = prop.major * 10 + prop.minor
+    return ["-gencode", "arch=compute_{0},code=sm_{0}".format(cc)]
+
+
+def run(func):
+    p = multiprocessing.Process(target=func)
+    p.start()
+    p.join()
+
+
+def change_pwd():
+    path = os.path.dirname(__file__)
+    if path:
+        os.chdir(path)
+
+
+def setup_fast_ln():
+    from paddle.utils.cpp_extension import CUDAExtension, setup
+
+    gencode_flags = get_gencode_flags()
+    change_pwd()
+    setup(
+        name="fast_ln",
+        ext_modules=CUDAExtension(
+            sources=[
+                "fast_ln/ln_api.cpp",
+                "fast_ln/ln_bwd_semi_cuda_kernel.cu",
+                "fast_ln/ln_fwd_cuda_kernel.cu",
+            ],
+            extra_compile_args={
+                "cxx": ["-O3"],
+                "nvcc": [
+                    "-O3",
+                    "-U__CUDA_NO_HALF_OPERATORS__",
+                    "-U__CUDA_NO_HALF_CONVERSIONS__",
+                    "-U__CUDA_NO_BFLOAT16_OPERATORS__",
+                    "-U__CUDA_NO_BFLOAT16_CONVERSIONS__",
+                    "-U__CUDA_NO_BFLOAT162_OPERATORS__",
+                    "-U__CUDA_NO_BFLOAT162_CONVERSIONS__",
+                    "-I./apex/contrib/csrc/layer_norm/",
+                    "--expt-relaxed-constexpr",
+                    "--expt-extended-lambda",
+                    "--use_fast_math",
+                ]
+                + gencode_flags,
+            },
+        ),
+    )
+
+
+def setup_fused_ln():
+    from paddle.utils.cpp_extension import CUDAExtension, setup
+
+    gencode_flags = get_gencode_flags()
+    change_pwd()
+    setup(
+        name="fused_ln",
+        ext_modules=CUDAExtension(
+            sources=[
+                "fused_ln/layer_norm_cuda.cu",
+            ],
+            extra_compile_args={
+                "cxx": ["-O3"],
+                "nvcc": [
+                    "-O3",
+                    "-U__CUDA_NO_HALF_OPERATORS__",
+                    "-U__CUDA_NO_HALF_CONVERSIONS__",
+                    "-U__CUDA_NO_BFLOAT16_OPERATORS__",
+                    "-U__CUDA_NO_BFLOAT16_CONVERSIONS__",
+                    "-U__CUDA_NO_BFLOAT162_OPERATORS__",
+                    "-U__CUDA_NO_BFLOAT162_CONVERSIONS__",
+                    "-I./apex/contrib/csrc/layer_norm/",
+                    "--expt-relaxed-constexpr",
+                    "--expt-extended-lambda",
+                    "--use_fast_math",
+                    "-maxrregcount=50",
+                ]
+                + gencode_flags,
+            },
+        ),
+    )
+
+
+run(setup_fast_ln)
+run(setup_fused_ln)
diff --git a/model_zoo/gpt-3/ppfleetx/__init__.py b/model_zoo/gpt-3/ppfleetx/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..282c3d669acebfb6cfbadff990780f6db3ee1bc6
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml
@@ -0,0 +1,32 @@
+Global:
+  device: gpu
+  seed: 1024
+  global_batch_size: 
+  local_batch_size: 1
+  micro_batch_size: 1
+
+Engine:
+  max_steps: -1
+  num_train_epochs: -1
+  eval_freq: -1
+  eval_iters: -1
+  test_iters: -1
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+  save_load:
+    output_dir:
+    ckpt_dir:
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..6350011f935458fdb9854bd1892e2b7b118b49e9
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
@@ -0,0 +1,52 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 1
+  top_p: 0.9
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 8
+  use_topp_sampling: True
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  use_topp_sampling: True
+  early_finish: True
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 50432
+  hidden_size: 5120
+  num_layers: 40
+  num_attention_heads: 40
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 4096
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity: 'full'
+  no_recompute_layers:
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 8
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..ceec84c6ddcba8353506b873040394e113a4a4fd
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
@@ -0,0 +1,51 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 1
+  top_p: 0.9
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 8
+  use_topp_sampling: True
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  use_topp_sampling: True
+  early_finish: True
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 51200
+  hidden_size: 12288
+  num_layers: 96
+  num_attention_heads: 96
+  ffn_hidden_size: 49152
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 1
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 8
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
\ No newline at end of file
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b48b1413b42e383384a3c4cdd6c91f1b7a98bfbf
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
@@ -0,0 +1,50 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  use_topp_sampling: True
+  early_finish: True
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 2
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..946f2bb24e83cf5dc9f921e0d3e21dab728b9cf7
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml
@@ -0,0 +1,51 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 0
+  top_p: 0.9
+  use_topp_sampling: True
+  inference: True
+  temperature: 1.0
+  min_dec_len: 8
+  max_dec_len: 8
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  early_finish: True
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..8c1bd2537aa8c9b0bee94ec08327fdf1375f46a1
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
@@ -0,0 +1,51 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 0
+  top_p: 0.9
+  use_topp_sampling: True
+  inference: True
+  temperature: 1.0
+  min_dec_len: 8
+  max_dec_len: 8
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  early_finish: True
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 51200
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size: 16384
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
\ No newline at end of file
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..214fcf8fc0c8110ec7308ff62e22b0cd75ac067d
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
@@ -0,0 +1,32 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 2048
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree: 8
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..115cc40ccdc5479f884fc3d1bd14dc50316edc0d
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
@@ -0,0 +1,39 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 2048
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity: "full_attn"
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree: 8
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+
+
+Tuning:
+  enable: True
+  tuning_recompute: True
+  profile_start_step: 1
+  profile_end_step: 5
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f2a334b831d8b713d0f930678833a36081412d24
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
@@ -0,0 +1,32 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 8
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 2048
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..deae40fa1899aff4529c2f25e097568b263acc21
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml
@@ -0,0 +1,39 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 1
+  micro_batch_size: 1
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 5120
+  num_layers: 40
+  num_attention_heads: 40
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 8
+    sharding_stage: 3
+    reduce_overlap: True
+    broadcast_overlap: True
+    param_comm_stream_num: 3
+    grad_comm_stream_num: 3
+    param_bucket_size_numel: 210355872
+    grad_bucket_size_numel: 210355872
+    enable_hierarchical_comm: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0f59fb1a4919d7632914bc1482e1b8e8cbdb8baa
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml
@@ -0,0 +1,30 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..05e44fb9ded651bc7d087415bf41a2e5e1c839f7
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml
@@ -0,0 +1,32 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 2
+  micro_batch_size: 1
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree: 2
+  mp_degree: 2
+  pp_degree: 2
+  sharding:
+    sharding_degree: 2
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1e03fd05b3f469019a63050a91d8c536bda2c65c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
@@ -0,0 +1,32 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  fuse_attn_qkv: True
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 16
+    sharding_stage: 2
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b1fd647ba3fb649902d6d43505067ec79eb8a4a5
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_base.yaml
@@ -0,0 +1,78 @@
+Global:
+  device: gpu
+  seed: 1024
+
+  global_batch_size: 
+  local_batch_size: 1
+  micro_batch_size: 1
+
+
+Engine:
+  max_steps: 500000
+  num_train_epochs: 1
+  accumulate_steps:
+  logging_freq: 1
+  eval_freq: 500
+  eval_iters: 10
+  test_iters:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+  save_load:
+    output_dir: ./output
+    ckpt_dir:
+
+
+Model:
+  module: "GPTModuleAuto"
+  name: "GPT"
+  fuse_attn_qkv: True
+  scale_qk_by_layer_num: True
+  fused_softmax_with_triangular: True
+
+
+Data:
+  Train:
+    collate_fn: gpt_collate_fn
+    sample_split: 2
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [969, 30, 1]
+      max_seq_len: 1024
+
+  Eval:
+    collate_fn: gpt_collate_fn
+    sample_split: 2
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [969, 30, 1]
+      max_seq_len: 1024
+
+
+Optimizer:
+  name: AdamW
+  weight_decay: 0.01
+  beta1: 0.9
+  beta2: 0.999
+  epsilon: 1.0e-8
+  lr:
+    name: CosineAnnealingWithWarmupDecay
+    decay_steps: 360000
+    warmup_rate: 0.01
+    max_lr: 5.0e-5
+    min_lr: 1.0e-5
+    use_increments: True
+  grad_clip:
+    name: "ClipGradByGlobalNorm"
+    clip_norm: 1.0
+
+
+FusedPasses:
+  fused_linear: False
+  fused_adamw: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..adfed4cc5ac5b5064d4778e61a1d342cdbaa885b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml
@@ -0,0 +1,55 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 2
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+
+
+Quantization:
+  enable: True
+  channel_wise_abs_max: False
+  weight_bits: 8
+  activation_bits: 8
+  onnx_format: True
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1b3df70ebe8043351abad2d617c80f2e350f3f99
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
@@ -0,0 +1,14 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+
+Model:
+  module: GPTEvalModule
+
+
+Offline_Eval:
+  eval_path: ./wikitext-103/wiki.valid.tokens
+  cloze_eval: False
+  overlapping_eval: 32
+  batch_size: 8
+  max_seq_len: 1024
+  logging_freq: 10
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..45a1d8cdadff76ea1ee2de9936a375d821ab75f0
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
@@ -0,0 +1,28 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+
+Engine:
+  save_load:
+    ckpt_dir:
+
+
+Model:
+  module: GPTEvalModule
+  hidden_dropout_prob: 0.0
+  attention_probs_dropout_prob: 0.0
+
+
+Compress:
+  Prune:
+    enable: True
+    criterion: l1_norm
+    ratio: 0.125
+
+
+Offline_Eval:
+  eval_path: ./lambada_test.jsonl
+  cloze_eval: True
+  overlapping_eval: 32
+  batch_size: 8
+  max_seq_len: 1024
+  logging_freq: 10
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..a764a71b36350ec450e13a779196d5e5e29a9657
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml
@@ -0,0 +1,33 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+
+Model:
+  module: GPTEvalModule
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    activation_preprocess_type: 'PACT'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
+    skip_tensor_map: 
+      block_3: ['linear2']
+      block_5: ['linear1']
+      block_6: ['linear2']
+      block_7: ['linear2']
+      block_10: ['linear2']
+      block_20: ['linear2']
+      block_21: ['linear2']
+
+Offline_Eval:
+  eval_path: ./wikitext-103/wiki.valid.tokens
+  cloze_eval: False
+  overlapping_eval: 32
+  batch_size: 8
+  max_seq_len: 1024
+  logging_freq: 10
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c3ff3969fa23d285af549cb6ed7a14073f22b1b9
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml
@@ -0,0 +1,46 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 8
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  fused_linear: True
+  
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..20be3699dd0f1733684facdc5d8f08b5040d036f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml
@@ -0,0 +1,111 @@
+_base_: ./finetune_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 32
+  micro_batch_size: 32
+  
+
+Engine:
+  run_mode: epoch
+  num_train_epochs: 3
+  accumulate_steps:
+  logging_freq: 10
+  eval_freq: 1
+  mix_precision:
+    enable: True
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "reduce_mean"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+  save_load:
+    save_epoch: 1
+    output_dir: ./output
+    ckpt_dir:
+
+
+Model:
+  module: "GPTFinetuneModule"
+  name: "GPT"
+  num_classes: 2
+  pretrained: './ckpt/PaddleFleetX_GPT_345M_220826/model'
+  fuse_attn_qkv: True
+  fused_linear: False
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  
+  loss:
+    train:
+      name: 'CrossEntropyLoss'
+    eval:
+      name: 'CrossEntropyLoss'
+  
+  metric:
+    eval:
+      name: 'Accuracy'
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+    
+Optimizer:
+  name: FusedAdamW
+  weight_decay: 0.0
+  beta1: 0.9
+  beta2: 0.999
+  epsilon: 1e-6
+  multi_precision: True
+  lr:
+    name: LinearDecayWithWarmup
+    warmup: 0.1
+    learning_rate: 2e-5
+  tensor_fusion: False
+    
+    
+Data:
+  Train:
+    dataset:
+      name: SST2
+      root: ./dataset/SST-2/
+      split: 'train'
+      max_length: 128
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 32
+      shuffle: True
+      drop_last: True
+    loader:
+      num_workers: 4
+      return_list: False
+  
+  Eval:
+    dataset:
+      name: SST2
+      root: ./dataset/SST-2/
+      split: 'dev'
+      max_length: 128
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 32
+      shuffle: False
+      drop_last: False
+    loader:
+      num_workers: 4
+      return_list: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1a7aed17065562d545defed75a5b6e3252b3aaf9
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/finetune_gpt_base.yaml
@@ -0,0 +1,29 @@
+Global:
+  device: gpu
+  seed: 42
+
+  global_batch_size: 
+  local_batch_size: 1
+  micro_batch_size: 1
+  
+Engine:
+  run_mode: epoch
+  max_steps: -1
+  eval_freq: 1
+  eval_iters: -1
+  test_iters: -1
+  save_load:
+    save_steps: -1
+    save_epoch: 1
+    output_dir: ./output
+    ckpt_dir:
+
+
+Profiler:
+  enable: False
+  scheduler: [1, 5]
+  profiler_log: profiler_log
+  detailed: False
+
+Model:
+  use_flash_attn: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2004f8ffceadf183b91d58b481c468e4bd208903
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml
@@ -0,0 +1,24 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+Model:
+  module: GPTGenerationModule
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+
+Distributed:
+  dp_degree: 
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..dfe927e51b5c72880ee90a280c7277d86cf8594b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_mp1.yaml
@@ -0,0 +1,42 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    level:
+
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..13d7f632a05aa8f457b813c6eab525c28d97937f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
@@ -0,0 +1,13 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+Model:
+  module: GPTGenerationModule
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2b0802cae0ae77b0d10982b907a375ee4668f382
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_gpt_6.7B_single_mp1.yaml
@@ -0,0 +1,48 @@
+_base_: ./pretrain_gpt_base.yaml
+
+
+Engine:
+  mix_precision:
+    level: "o2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div", "where"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+    use_fp16_guard: False
+
+
+Generation:
+  top_k: 0
+  top_p: 0.9
+  use_topp_sampling: True
+  inference: True
+  temperature: 1.0
+  min_dec_len: 8
+  max_dec_len: 8
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+
+
+Model:
+  module: GPTGenerationModuleAuto
+  vocab_size: 51200
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size: 16384
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  fuse_attn_qkv: True
+  use_fused_dropout_add: False
+
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..e7c56df4ae0fd26affe3641b77044168b369e4a6
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
@@ -0,0 +1,19 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+Model:
+  module: GPTGenerationModule
+
+Compress:
+  Prune:
+    enable: True
+    criterion: l1_norm
+    ratio: 0.125
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..00b57d2e0e9298871cca51b0556c51d6c70d0708
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml
@@ -0,0 +1,26 @@
+_base_: ./pretrain_gpt_345M_single_card.yaml
+
+Model:
+  module: GPTGenerationModule
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  use_topp_sampling: True
+  inference: True
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4b201fc0cb40159b7f21540959943c3e63c0faf2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml
@@ -0,0 +1,26 @@
+_base_: ./pretrain_gpt_6.7B_single_card.yaml
+
+Model:
+  module: GPTGenerationModule
+
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+  use_topp_sampling: True
+  inference: True
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4e9c1af2d7d16ed0f018d1325715ee3ffb0f1591
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_dp8.yaml
@@ -0,0 +1,35 @@
+_base_: ./generation_gpt_345M_dp8.yaml
+
+
+Inference:
+  model_dir: ./output
+  mp_degree: 1
+
+
+Distributed:
+  dp_degree: 
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Data:
+  Test:
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [949, 50, 1]
+      max_seq_len: 1024
+    sampler:
+      name: GPTBatchSampler
+      shuffle: False
+      drop_last: True
+    loader:
+      num_workers: 1
+      return_list: False
+      collate_fn: gpt_collate_fn
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..147901973e99f72503c43b960cd318623d9b70ef
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml
@@ -0,0 +1,35 @@
+_base_: ./generation_gpt_345M_single_card.yaml
+
+
+Inference:
+  model_dir: ./output
+  mp_degree: 1
+
+
+Distributed:
+  dp_degree: 
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Data:
+  Test:
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [949, 50, 1]
+      max_seq_len: 1024
+    sampler:
+      name: GPTBatchSampler
+      shuffle: False
+      drop_last: True
+    loader:
+      num_workers: 1
+      return_list: False
+      collate_fn: gpt_collate_fn
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..8e955042b48a101904b365180738e1c204d9b60d
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
@@ -0,0 +1,34 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 2048
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 8
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..caed203be1272843c6d519e25a5c4183be30c0ce
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
@@ -0,0 +1,34 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 8
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 2048
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..275c294aebcce2eac2624a561143525d455f9d98
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_13B_dp8.yaml
@@ -0,0 +1,63 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  seed: 1234
+
+  global_batch_size: 480
+  local_batch_size: 
+  micro_batch_size: 4
+
+
+Engine:
+  max_steps: 200000
+  eval_freq: 1000
+  eval_iters: 10
+  save_load:
+    save_steps: 500
+
+
+Model:
+  vocab_size: 50432
+  hidden_size: 5120
+  num_layers: 40
+  num_attention_heads: 40
+  ffn_hidden_size: 
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 4096
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity: 'full'
+  no_recompute_layers:
+
+
+Data:
+  Train:
+    dataset:
+      max_seq_len: 4096
+  
+  Eval:
+    dataset:
+      max_seq_len: 4096
+
+
+Distributed:
+  dp_degree:
+  mp_degree: 2
+  pp_degree: 8
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Optimizer:
+  lr:
+    name: CosineAnnealingWithWarmupDecay
+    decay_steps: 160000
+    warmup_rate: 0.001
+    max_lr: 1.0e-4
+    min_lr: 1.0e-5
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..078b380fe402dbf6d16cfb29a46970949811a4c5
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
@@ -0,0 +1,37 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 1536
+  micro_batch_size: 1
+
+
+Model:
+  vocab_size: 51200
+  hidden_size: 12288
+  num_layers: 96
+  num_attention_heads: 96
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity: 'core_attn'
+  no_recompute_layers:
+  virtual_pp_degree: 1
+  sequence_parallel: True
+  fused_linear: True
+  
+
+Distributed:
+  dp_degree:
+  mp_degree: 8
+  pp_degree: 16
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c5b4364c4435e1dc279f9dc1f9f15ef82e37b1bb
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
@@ -0,0 +1,34 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..8df8c71cac983613befcd56d4e82f7c66ba9c7f8
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
@@ -0,0 +1,43 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Engine:
+  logging_freq: 10
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+  fused_linear: True
+
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 16
+    sharding_stage: 2
+    sharding_offload: False
+    reduce_overlap: True
+    broadcast_overlap: True
+
+
+Optimizer:
+  tensor_fusion: True
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f46862e4ee572fc12e2d638b1791873b18aacc7b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_single_card.yaml
@@ -0,0 +1,34 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size: 16384
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..04da8a1068f48036737bdd86cb9f9e1fcadf9ffa
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml
@@ -0,0 +1,103 @@
+Global:
+  device: gpu
+  seed: 1024
+
+  global_batch_size: 
+  local_batch_size: 1
+  micro_batch_size: 1
+
+
+Engine:
+  max_steps: 500000
+  num_train_epochs: 1
+  accumulate_steps:
+  logging_freq: 1
+  eval_freq: 500
+  eval_iters: 10
+  test_iters:
+  mix_precision:
+    enable: True
+    dtype: "float16"
+    level: "O2"
+    scale_loss: 32768.0
+    custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"]
+    custom_white_list: ["lookup_table", "lookup_table_v2"]
+  save_load:
+    save_steps: 1000
+    save_epoch: 1
+    output_dir: ./output
+    ckpt_dir:
+
+
+Model:
+  module: "GPTModule"
+  name: "GPT"
+  vocab_size_divisible_unit: 128
+  fused_linear: False
+  fuse_attn_qkv: True
+  scale_qk_by_layer_num: True
+  sequence_parallel: False
+  use_flash_attn: False
+  fused_softmax_with_triangular: True
+
+
+Data:
+  Train:
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [969, 30, 1]
+      max_seq_len: 1024
+    sampler:
+      name: GPTBatchSampler
+      shuffle: False
+      drop_last: True
+    loader:
+      num_workers: 1
+      return_list: False
+      collate_fn: gpt_collate_fn
+  
+  Eval:
+    dataset:
+      name: GPTDataset
+      input_dir: ./data/
+      split: [969, 30, 1]
+      max_seq_len: 1024
+    sampler:
+      name: GPTBatchSampler
+      shuffle: False
+      drop_last: True
+    loader:
+      num_workers: 1
+      return_list: False
+      collate_fn: gpt_collate_fn
+
+
+Optimizer:
+  name: FusedAdamW
+  weight_decay: 0.01
+  beta1: 0.9
+  beta2: 0.999
+  epsilon: 1.0e-8
+  lr:
+    name: CosineAnnealingWithWarmupDecay
+    decay_steps: 360000
+    warmup_rate: 0.01
+    max_lr: 5.0e-5
+    min_lr: 1.0e-5
+    use_increments: True
+  grad_clip:
+    name: "ClipGradByGlobalNorm"
+    clip_norm: 1.0
+  tensor_fusion: False
+
+
+Profiler:
+  enable: False
+  scheduler: [1, 5]
+  profiler_log: profiler_log
+  detailed: False
+
+
+Distributed:
+  fuse_sequence_parallel_allreduce: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f93b436b0d55c247717647e1261e75590cd73ec7
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/pretrain_gpt_cn_345M_single_card.yaml
@@ -0,0 +1,35 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  name: "GPT-cn"
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c6926a79be37f0127b9af42c65ba79ebb9aa693a
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
@@ -0,0 +1,56 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+Engine:
+  save_load:
+    save_steps: 1000
+    save_epoch: 1
+    output_dir: ./output
+    ckpt_dir:
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size: 4096
+  hidden_dropout_prob: 0.0
+  attention_probs_dropout_prob: 0.0
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  no_recompute_layers:
+  
+
+Distributed:
+  dp_degree: 1
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    comm_overlap: False
+
+
+Optimizer:
+  weight_decay: 0.0
+  lr:
+    decay_steps: 90000
+    warmup_rate: 0.00
+    max_lr: 2.5e-5
+    min_lr: 5.0e-6
+    
+
+Compress:
+  pretrained:
+  Prune:
+    enable: True
+    criterion: l1_norm
+    ratio: 0.125
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..7d338cd413bed15835153c8e847b54d2c6996cdd
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
@@ -0,0 +1,55 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 8
+  local_batch_size: 8
+  micro_batch_size: 1
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  fused_linear: True
+  
+
+Distributed:
+  dp_degree:
+  mp_degree: 8
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
+    freeze_embedding: True
+    skip_tensor_map: 
+      block_3: ['linear2']
+      block_5: ['linear1']
+      block_6: ['linear2']
+      block_7: ['linear2']
+      block_10: ['linear2']
+      block_20: ['linear2']
+      block_21: ['linear2']
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4a9cdeba159cd3718f93e5604d901098ecc253e6
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml
@@ -0,0 +1,56 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 8
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 1024
+  num_layers: 24
+  num_attention_heads: 16
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: False
+  recompute_granularity:
+  fused_linear: True
+  
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 1
+    sharding_stage: 1
+    sharding_offload: False
+    reduce_overlap: False
+    broadcast_overlap: False
+
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    activation_preprocess_type: 'PACT'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
+    freeze_embedding: True
+    skip_tensor_map: 
+      block_3: ['linear2']
+      block_5: ['linear1']
+      block_6: ['linear2']
+      block_7: ['linear2']
+      block_10: ['linear2']
+      block_20: ['linear2']
+      block_21: ['linear2']
diff --git a/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f1e59271e1dfd5ec976485bc74435ce5d462580c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml
@@ -0,0 +1,56 @@
+_base_: ./pretrain_gpt_base.yaml
+
+Global:
+  global_batch_size: 
+  local_batch_size: 8
+  micro_batch_size: 8
+
+
+Engine:
+  logging_freq: 10
+
+
+Model:
+  vocab_size: 50304
+  hidden_size: 4096
+  num_layers: 32
+  num_attention_heads: 32
+  ffn_hidden_size:
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  max_position_embeddings: 1024
+  type_vocab_size: 16
+  initializer_range: 0.02
+  use_recompute: True
+  recompute_granularity:
+  no_recompute_layers:
+  fused_linear: True
+
+
+Distributed:
+  dp_degree:
+  mp_degree: 1
+  pp_degree: 1
+  sharding:
+    sharding_degree: 16
+    sharding_stage: 2
+    sharding_offload: False
+    reduce_overlap: True
+    broadcast_overlap: True
+
+
+Optimizer:
+  tensor_fusion: True
+
+
+Compress:
+  pretrained:
+  Quantization:
+    enable: True
+    weight_quantize_type: 'abs_max'
+    activation_quantize_type: 'moving_average_abs_max'
+    activation_preprocess_type: 'PACT'
+    weight_bits: 8
+    activation_bits: 8
+    quantizable_layer_type: ['Linear', 'ColumnParallelLinear', 'RowParallelLinear']
+    onnx_format: True
diff --git a/model_zoo/gpt-3/ppfleetx/core/__init__.py b/model_zoo/gpt-3/ppfleetx/core/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..338354f781ec3c7c5c234ddd33fc323fb6815542
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .engine import *
+from .module import *
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py b/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f2ad43e0fefc9f4b704246b45e9d1faa9aa23be
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/engine/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .basic_engine import BasicEngine
+from .inference_engine import InferenceEngine, TensorRTConfig
+from .eager_engine import EagerEngine
+from .auto_engine import AutoEngine
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py b/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..667965d379b7e193097d13411fc9a54e89fd3d64
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/engine/auto_engine.py
@@ -0,0 +1,545 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import numpy as np
+
+import paddle
+import paddle.base.core as core
+import paddle.nn as nn
+from paddle.distributed.fleet import auto
+from paddle.profiler import SummaryView
+
+try:
+    from ppfleetx.optims import build_lr_scheduler, build_optimizer
+except Exception:
+    pass
+from ppfleetx.core.engine import BasicEngine
+from ppfleetx.core.module import BasicModule
+from ppfleetx.utils.device import synchronize as device_synchronize
+from ppfleetx.utils.log import convert_timestamp_to_data, get_timestamp, logger
+from ppfleetx.utils.version import version_check
+
+
+class AutoEngine(BasicEngine):
+    def __init__(self, configs, module=None, mode="train"):
+        super().__init__()
+        version_check()
+
+        model = None
+        loss_fn = None
+
+        if module and not isinstance(module, BasicModule):
+            raise TypeError("'module' must be sub classes of `BasicModule`, but got: {model.__class__.__name__}.")
+
+        if module:
+            if module.model and not isinstance(module.model, nn.Layer) and not callable(module.model):
+                raise TypeError(
+                    "'model' must be sub classes of `paddle.nn.Layer` or any callable function, but got: {module.model.__class__.__name__}."
+                )
+            model = module.model
+
+            if mode == "train":
+                if module.loss_fn and not isinstance(module.loss_fn, nn.Layer) and not callable(module.loss_fn):
+                    raise TypeError(
+                        "'loss_fn' must be sub classes of `paddle.nn.Layer` or any callable function, but got: {module.loss_fn.__class__.__name__}."
+                    )
+            else:
+                module.loss_fn = None
+                module.model.eval()
+            loss_fn = module.loss_fn
+
+        self._module = module
+
+        # global configs
+        self._global_batch_size = configs["Global"]["global_batch_size"]
+        self._local_batch_size = configs["Global"]["local_batch_size"]
+
+        # Distributed
+        self._pp_degree = configs["Distributed"]["pp_degree"]
+
+        # engine configs
+        self._configs = configs["Engine"]
+
+        self._run_mode = self._configs.get("run_mode", "step")
+        assert self._run_mode in ["epoch", "step"], "run_mode must be epoch or step"
+
+        self._max_steps = self._configs["max_steps"]
+        self._eval_freq = self._configs["eval_freq"]
+        self._eval_iters = self._configs["eval_iters"]
+        self._test_iters = self._configs["test_iters"]
+        self._logging_freq = self._configs["logging_freq"]
+        self._num_train_epochs = self._configs["num_train_epochs"]
+        self._accumulate_steps = self._configs["accumulate_steps"]
+        self._strategy = self._configs["strategy"]
+
+        # save & load
+        self._save_steps = self._configs["save_load"]["save_steps"]
+        self._save_epoch = self._configs["save_load"]["save_epoch"]
+        self._output_dir = self._configs["save_load"]["output_dir"]
+        self._ckpt_dir = self._configs["save_load"]["ckpt_dir"]
+
+        # lr_scheduler and optimizer
+        if mode == "train":
+            self._use_increments = configs.Optimizer.lr.pop("use_increments", False)
+            self._lr_scheduler_mode = configs.Optimizer.lr.pop("run_mode", "step")
+            assert self._lr_scheduler_mode in ["epoch", "step"], "lr.run_mode must be epoch or step"
+        self._lr_scheduler = build_lr_scheduler(configs.Optimizer.lr) if mode == "train" else None
+        self._optimizer = (
+            build_optimizer(configs.Optimizer, model, self._lr_scheduler) if mode == "train" else None
+        )
+
+        # init engine
+        self._auto_engine = auto.Engine(model, loss_fn, self._optimizer, strategy=self._strategy)
+
+        # using for save/load
+        self._load_recovery = {"step": 0, "epoch": 0}
+
+        self.profiler = None
+        if "Profiler" in configs and configs.get("Profiler", {}).get("enable", False):
+            self.profiler_config = configs["Profiler"]
+
+            scheduler = self.profiler_config.get("scheduler", None)
+            profiler_log = self.profiler_config.get("profiler_log", "./profiler_log")
+            record_shapes = self.profiler_config.get("record_shapes", True)
+            profile_memory = self.profiler_config.get("profile_memory", True)
+            self.profiler = paddle.profiler.Profiler(
+                targets=[paddle.profiler.ProfilerTarget.CPU, paddle.profiler.ProfilerTarget.GPU],
+                scheduler=scheduler,
+                on_trace_ready=paddle.profiler.export_chrome_tracing(profiler_log),
+                record_shapes=record_shapes,
+                profile_memory=profile_memory,
+            )
+            self.profiler.start()
+            logger.warning("Profiler is enabled, do not enable it in production.")
+
+        # Profiler_auto configs
+        self.memory_stats = configs.get("Profiler_auto", {}).get("memory_stats", False)
+        self.nvprof_start = configs.get("Profiler_auto", {}).get("nvprof_start", -1)
+        self.nvprof_end = configs.get("Profiler_auto", {}).get("nvprof_end", -1)
+
+    def _validate_batch(self, batch):
+        if self._pp_degree > 1 or self._accumulate_steps == 1:
+            batches = batch
+        else:
+            feed_names = []
+            split_batches = []
+            for n, b in batch[0].items():
+                feed_names.append(n)
+                split_batches.append(np.split(np.array(b), self._accumulate_steps, 0))
+            batches = []
+            for i in range(len(split_batches[0])):
+                micro_batch = [split_batch[i] for split_batch in split_batches]
+                batches.append(dict(zip(feed_names, micro_batch)))
+        return batches
+
+    def _train_one_epoch(self, epoch_index, train_data_loader=None, valid_data_loader=None):
+
+        train_losses = []
+        train_step_start = get_timestamp()
+        skip_first = True
+
+        total_train_batch = self._max_steps if self._run_mode == "step" else len(train_data_loader)
+        total_train_step = self._max_steps if self._run_mode == "step" else total_train_batch * self._num_train_epochs
+        total_eval_batch = len(valid_data_loader) if valid_data_loader is not None else 0
+        valid_data_loader = valid_data_loader if valid_data_loader is not None else None
+        eval_finished_step = 0
+
+        self._auto_engine.prepare(mode="train")
+
+        for step, batch in enumerate(train_data_loader):
+            if epoch_index == self._load_recovery["epoch"]:
+                if step < self._load_recovery["step"]:
+                    continue
+
+            batches = self._validate_batch(batch)
+
+            fetch_list = None
+            if self._strategy.amp.enable:
+                # fetch_list = ["find_infinite_scale.tmp_0", "loss_scaling_0"]
+                fetch_list = []
+
+            final_loss = None
+            for micro_batch in batches:
+                with paddle.profiler.utils._nvprof_range(iter_id=step, start=self.nvprof_start, end=self.nvprof_end):
+                    outs = self._auto_engine.run(micro_batch, fetch_list=fetch_list, mode="train")
+                # pp: some devices don't have loss in outs
+                if "loss" in outs:
+                    if final_loss is None:
+                        final_loss = np.sum(outs["loss"])
+                    else:
+                        final_loss += np.sum(outs["loss"])
+
+            if final_loss is not None and self._accumulate_steps > 1:
+                final_loss /= self._accumulate_steps
+
+            if final_loss is not None:
+                train_losses.append(final_loss)
+
+            if self._lr_scheduler is not None and self._lr_scheduler_mode == "step":
+                self._auto_engine.optimizer._learning_rate.step(epoch=self._global_batch_size if self._use_increments else None)
+
+            if (step + 1) % self._logging_freq == 0:
+                train_step_cost = get_timestamp() - train_step_start
+                numpy_losses = [float(loss) for loss in train_losses]
+                log_dict = {
+                    "epoch": epoch_index,
+                    "total_epoch": self._num_train_epochs,
+                    "batch": step,
+                    "total_batch": total_train_batch,
+                    "total_step": total_train_step,
+                    "train_cost": train_step_cost if step == 0 else train_step_cost / self._logging_freq,
+                    "lr": self._auto_engine.optimizer.get_lr(),
+                    "found_inf": 0, # if self._strategy.amp.enable outs["fetches"]["find_infinite_scale.tmp_0"]
+                    "dp_world_size": self._auto_engine._dp_world_sizes[0]
+                }
+                if len(train_losses) > 0:
+                    log_dict["loss"] = sum(numpy_losses) / len(numpy_losses)
+                if self._strategy.amp.enable:
+                    log_dict["loss_scale"] = self._strategy.amp.init_loss_scaling  # outs["fetches"]["loss_scaling_0"]
+                if self.memory_stats:
+                    # convert from Byte to MB
+                    log_dict["max_memory_allocated"] = paddle.device.cuda.max_memory_allocated() / (1024**2)
+                    log_dict["max_memory_reserved"] = paddle.device.cuda.max_memory_reserved() / (1024**2)
+                    log_dict["memory_allocated"] = paddle.device.cuda.memory_allocated() / (1024**2)
+                    log_dict["memory_reserved"] = paddle.device.cuda.memory_reserved() / (1024**2)
+                self._module.training_step_end(log_dict)
+
+                train_step_start = get_timestamp()
+                train_losses = []
+
+            if self._run_mode == "step" and not skip_first:
+                if self._eval_freq > 0 and step % self._eval_freq == 0:
+
+                    eval_losses = []
+                    eval_step_start = get_timestamp()
+
+                    for eval_step, batch in enumerate(valid_data_loader):
+                        eval_finished_step += 1
+                        outs = self._auto_engine.run(batch, mode="eval")
+                        if "loss" in outs:
+                            eval_losses.append(outs["loss"])
+
+                        if eval_step >= self._eval_iters - 1:
+                            break
+
+                    numpy_losses = [float(loss) for loss in eval_losses]
+                    eval_step_cost = get_timestamp() - eval_step_start
+
+                    log_dict = {
+                        "epoch": epoch_index,
+                        "batch": eval_finished_step,
+                        "total_batch": total_eval_batch,
+                        "eval_cost": eval_step_cost / self._logging_freq,
+                    }
+                    if len(eval_losses) > 0:
+                        log_dict["loss"] = sum(numpy_losses) / len(numpy_losses)
+                    self._module.validation_step_end(log_dict)
+
+                if self._save_steps > 0 and step % self._save_steps == 0:
+                    device_synchronize()
+                    self.save(epoch=epoch_index, step=step)
+            else:
+                skip_first = False
+
+            if self._run_mode == "step" and step >= self._max_steps:
+                return
+
+            if self.profiler:
+                self.profiler.step()
+
+    def fit(self, epoch=1, train_dataset=None, valid_dataset=None):
+
+        train_start = get_timestamp()
+
+        start_epoch = self._load_recovery["epoch"]
+
+        train_data_loader, valid_data_loader = None, None
+        if train_dataset:
+            train_data_loader = self._auto_engine.dataloader(
+                dataset=train_dataset,
+                batch_size=self._global_batch_size,
+                steps_per_epoch=self._max_steps,
+                epochs=self._num_train_epochs,
+                collate_fn=train_dataset.collate_fn,
+                num_workers=1,
+                sample_split=train_dataset.sample_split,
+                mode="train",
+            )
+        if valid_dataset and self._eval_freq <= self._max_steps:
+            valid_data_loader = self._auto_engine.dataloader(
+                dataset=valid_dataset,
+                batch_size=self._global_batch_size,
+                steps_per_epoch=self._max_steps,
+                epochs=self._num_train_epochs,
+                collate_fn=valid_dataset.collate_fn,
+                num_workers=1,
+                sample_split=valid_dataset.sample_split,
+                mode="eval",
+            )
+
+        for epoch_index in range(start_epoch, epoch):
+            train_epoch_start = get_timestamp()
+            self._train_one_epoch(epoch_index, train_data_loader, valid_data_loader)
+            train_epoch_cost = get_timestamp() - train_epoch_start
+            log_dict = {
+                "epoch": epoch_index,
+                "train_cost": train_epoch_cost,
+            }
+            self._module.training_epoch_end(log_dict)
+
+            if self._lr_scheduler is not None and self._lr_scheduler_mode == "epoch":
+                self._lr_scheduler.step()
+
+            if self._run_mode == "epoch" and self._eval_freq > 0 and epoch_index % self._eval_freq == 0:
+                eval_epoch_start = get_timestamp()
+                self._evaluate_one_epoch(epoch_index, valid_data_loader)
+                eval_epoch_cost = get_timestamp() - eval_epoch_start
+                log_dict = {
+                    "epoch": epoch_index,
+                    "eval_cost": eval_epoch_cost,
+                }
+                self._module.validation_epoch_end(log_dict)
+
+            if self._save_epoch > 0 and self._run_mode == "epoch" and epoch_index % self._save_epoch == 0:
+                self.save(epoch=epoch_index, step=len(train_data_loader))
+
+        logger.info(
+            "The training process is complete and total cost of time for training is : {}".format(
+                convert_timestamp_to_data(get_timestamp() - train_start)
+            )
+        )
+
+        if self.profiler:
+            self._profiler_done()
+
+    def evaluate(self, epoch=1, valid_dataset=None):
+
+        valid_data_loader = None
+        if valid_dataset:
+            valid_data_loader = self._auto_engine.dataloader(
+                dataset=valid_dataset,
+                batch_size=self._global_batch_size,
+                steps_per_epoch=self._max_steps,
+                epochs=self._num_train_epochs,
+                collate_fn=valid_dataset.collate_fn,
+                num_workers=1,
+                sample_split=valid_dataset.sample_split,
+                mode="eval",
+            )
+
+        for epoch_index in range(epoch):
+            eval_epoch_start = get_timestamp()
+            self._evaluate_one_epoch(epoch_index, valid_data_loader)
+
+            eval_epoch_cost = get_timestamp() - eval_epoch_start
+            log_dict = {
+                "epoch": epoch_index,
+                "eval_cost": eval_epoch_cost,
+            }
+            self._module.validation_epoch_end(log_dict)
+
+        logger.info("The evaluting process is complete.")
+        del valid_data_loader
+        return
+
+    def _evaluate_one_epoch(self, epoch=1, valid_data_loader=None):
+
+        eval_step_start = get_timestamp()
+        eval_losses = []
+        total_eval_batch = len(valid_data_loader)
+        valid_data_loader = valid_data_loader() if valid_data_loader is not None else None
+        for eval_step, batch in enumerate(valid_data_loader):
+            with paddle.profiler.utils._nvprof_range(iter_id=eval_step, start=self.nvprof_start, end=self.nvprof_end):
+                outs = self._auto_engine.run(batch, mode="eval")
+            eval_losses.append(outs["loss"])
+
+            if eval_step % self._logging_freq == 0:
+                eval_losses = [float(loss) for loss in eval_losses]
+                eval_step_cost = get_timestamp() - eval_step_start
+                log_dict = {
+                    "loss": sum(eval_losses) / len(eval_losses),
+                    "epoch": epoch,
+                    "batch": eval_step,
+                    "total_batch": total_eval_batch,
+                    "eval_cost": eval_step_cost if eval_step == 0 else eval_step_cost / self._logging_freq,
+                }
+                self._module.validation_step_end(log_dict)
+                eval_step_start = get_timestamp()
+                eval_losses = []
+
+            if self._run_mode == "step" and eval_step >= self._max_steps:
+                logger.info("[eval] epoch {} : evaluting process is complete.".
+                            format(epoch))
+                return
+
+    def predict(self, epoch=1, test_dataset=None):
+
+        test_data_loader = None
+        if test_dataset:
+            test_data_loader = self._auto_engine.dataloader(
+                dataset=test_dataset,
+                batch_size=self._global_batch_size,
+                steps_per_epoch=self._max_steps,
+                epochs=self._num_train_epochs,
+                collate_fn=test_dataset.collate_fn,
+                num_workers=1,
+                sample_split=test_dataset.sample_split,
+                mode="predict",
+            )
+
+        test_start = get_timestamp()
+        test_losses = []
+        for test_step, batch in enumerate(test_data_loader):
+            with paddle.profiler.utils._nvprof_range(iter_id=test_step, start=self.nvprof_start, end=self.nvprof_end):
+                outs = self._auto_engine.run(batch, mode="predict")
+            test_losses.append(outs["loss"])
+
+            if test_step % self._logging_freq == 0:
+                test_losses = [float(loss) for loss in test_losses]
+                test_cost = get_timestamp() - test_start
+                log_dict = {
+                    "loss": sum(test_losses) / len(test_losses),
+                    "epoch": epoch,
+                    "batch": test_step,
+                    "test_cost": test_cost if test_step == 0 else test_cost / self._logging_freq,
+                }
+                self._module.test_step_end(log_dict)
+                test_start = get_timestamp()
+                test_losses = []
+
+            if test_step >= self._max_steps:
+                logger.info("The predicting process is complete.")
+                del test_data_loader
+                return
+
+    def export(self):
+        self._auto_engine.prepare(self._module.input_spec(), mode="predict")
+        self.save(training=False)
+
+    def tune(self, tune_dataset=None):
+        self._auto_engine._tune(tune_dataset, tune_sample_split=tune_dataset.sample_split, batch_size=self.batch_size)
+
+    def save(self, training=True):
+        if self._output_dir and isinstance(self._output_dir, str):
+            path = os.path.join(self._output_dir, "auto")
+            self._auto_engine.save(path, training=training)
+        else:
+            raise TypeError("`save` requires a valid value of `output_dir`.")
+
+    def load(self):
+        if self._ckpt_dir and isinstance(self._ckpt_dir, str):
+            self._auto_engine.load(self._ckpt_dir)
+        else:
+            logger.warning("`load` requires a valid value of `ckpt_dir`.")
+
+    def export_from_prog(self):
+        paddle.enable_static()
+
+        if not (self._ckpt_dir and isinstance(self._ckpt_dir, str)):
+            raise ValueError("invalid ckpt_dir.")
+
+        exe = paddle.static.Executor()
+
+        [inference_program, feed_target_names, fetch_targets] = paddle.static.load_inference_model(
+            path_prefix=self._ckpt_dir, executor=exe
+        )
+        feed_targets = [inference_program.global_block().var(name) for name in feed_target_names]
+
+        self._auto_engine.prepare(
+            inputs=feed_targets,
+            main_program=inference_program,
+            startup_program=paddle.static.Program(),
+            mode="predict",
+        )
+
+        model_dict = self._auto_engine.main_program.state_dict()
+        for param in list(filter(lambda var: var.persistable, self._auto_engine.main_program.list_vars())):
+            if param.type in [core.VarDesc.VarType.FEED_MINIBATCH, core.VarDesc.VarType.FETCH_LIST]:
+                continue
+            if param.dtype != model_dict[param.name]._dtype():
+                model_dict[param.name] = model_dict[param.name]._as_type(param.dtype)
+        self._auto_engine.main_program.set_state_dict(model_dict)
+
+        path = os.path.join(self._output_dir, "auto_dist0")
+        paddle.static.save_inference_model(
+            path,
+            feed_targets,
+            fetch_targets,
+            exe,
+            program=self._auto_engine.main_program,
+        )
+
+        paddle.disable_static()
+
+    def _print_summary(self):
+        views_dict = {
+            SummaryView.DeviceView: "device",
+            SummaryView.OverView: "overview",
+            SummaryView.ModelView: "model",
+            SummaryView.DistributedView: "dist",
+            SummaryView.KernelView: "kernel",
+            SummaryView.OperatorView: "op",
+            SummaryView.MemoryView: "mem",
+            SummaryView.MemoryManipulationView: "memcpy",
+            SummaryView.UDFView: "udf",
+        }
+
+        default_views = [
+            SummaryView.OverView,
+            SummaryView.ModelView,
+            SummaryView.KernelView,
+            SummaryView.OperatorView,
+        ]
+
+        def gen_views(cfg):
+            # print all summary view if detailed=True
+            if self.profiler_config.get("detailed", False):
+                return None
+
+            views = []
+            # override default view with user defined value if detailed=False
+            for view in SummaryView:
+                v = self.profiler_config.get("summary", {}).get(
+                    views_dict[view], None)
+                if v is True or (v is None and view in default_views):
+                    views.append(view)
+
+            return views or None
+
+        self.profiler.summary(
+            sorted_by=paddle.profiler.SortedKeys.GPUTotal,
+            views=gen_views(self.profiler_config))
+
+    def _profiler_done(self):
+        if not self.profiler:
+            return
+
+        logger.info("Profiler finished, prepare to print summary...")
+
+        self.profiler.stop()
+
+        self._print_summary()
+        profiler_log = self.profiler_config.get("profiler_log",
+                                                "./profiler_log")
+        logger.info(
+            "For more information please install visualdl and run it with following command:"
+        )
+        logger.info(
+            "-------------------------------------------------------------------------------"
+        )
+        logger.info(f"visualdl --host 0.0.0.0 --logdir {profiler_log}")
+        logger.info(
+            "-------------------------------------------------------------------------------"
+        )
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py b/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ca52df74beba40d360b50b7f649cc93598841e0
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/engine/basic_engine.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class BasicEngine:
+    """ """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def fit(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def evaluate(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def predict(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def save(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def load(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def inference(self, *args, **kwargs):
+        raise NotImplementedError
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py b/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..7623faf92d1dfc0b8abac3e0e8b11f63412fc9d3
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/engine/eager_engine.py
@@ -0,0 +1,859 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import TensorParallel
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+try:
+    from paddle.distributed.parallel import sync_params_buffers
+except Exception:
+    pass
+from paddle.distributed.sharding import group_sharded_parallel
+from paddle.incubate.distributed.utils.io import save_for_auto_inference
+from paddle.profiler import SummaryView
+from ppfleetx.core.engine import BasicEngine, InferenceEngine, TensorRTConfig
+from ppfleetx.core.module import BasicModule
+from ppfleetx.distributed.apis import env
+from paddle.distributed.fleet.utils import mix_precision_utils
+try:
+    from ppfleetx.optims import build_lr_scheduler, build_optimizer
+    from ppfleetx.utils.compression_helper import prune_model, quant_model
+except Exception:
+    pass
+from ppfleetx.utils.device import synchronize as device_synchronize
+from ppfleetx.utils.export import export_inference_model
+from ppfleetx.utils.log import convert_timestamp_to_data, get_timestamp, logger
+from ppfleetx.utils.tensor_fusion_helper import all_reduce_parameters
+from ppfleetx.utils.version import version_check
+
+
+class EagerEngine(BasicEngine):
+    """
+    The common engine for all models that support single-card and distributed
+    training, validation and test. Only used in eager dygraph mode.
+    """
+
+    def __init__(self, configs, module, optimizer=None, lr=None, mode="train"):
+        """
+        Initialize an engine depending on the user-defined module and configs.
+
+        Args:
+
+            module(BasicModule): user-defined module. After assigning computations
+                and configurations of model/optimizers/lr Schedulers, engine can
+                support the whole loop of training/validation/test.
+
+            configs(dict): the configurations that engine needs for training/validation/test
+                loop. Such as mix precision strategy, save&load and the infos of steps/epoches.
+
+        Return:
+
+            An instance of `EagerEngine`.
+
+        Examples::
+
+            class TestModule(BasicModule):
+
+                def __init__(self):
+                    super().__init__()
+                    self.model = paddle.nn.Linear(28 * 28, 10)
+                    self.loss_fn = paddle.nn.MSELoss()
+
+                def forward(self, x):
+                    return paddle.relu(self.model(x.reshape(-1)))
+
+                def training_step(self, batch):
+                    x, y = batch
+                    loss = self.loss_fn(self(x), y)
+                    return loss
+
+                def configure_optimizers(self):
+                    return paddle.optimizer.Adam(
+                        parameters=self.model.parameters(), learning_rate=0.02)
+
+            module = TestModule()
+            engine = EagerEngine(module, configs)
+
+        """
+        super().__init__()
+        version_check()
+
+        self.mode = mode
+
+        if not isinstance(module, BasicModule):
+            raise TypeError("'module' must be sub classes of `BasicModule`, but got: {model.__class__.__name__}.")
+
+        self._module = module
+
+        if module.model and not isinstance(module.model, nn.Layer) and not callable(module.model):
+            raise TypeError(
+                "'model' must be sub classes of `paddle.nn.Layer` or any callable function, but got: {module.model.__class__.__name__}."
+            )
+
+        # if mode == 'train':
+        #     if module.loss_fn and not isinstance(
+        #             module.loss_fn, nn.Layer) and not callable(module.loss_fn):
+        #         raise TypeError(
+        #             "'loss_fn' must be sub classes of `paddle.nn.Layer` or any callable function, but got: {module.loss_fn.__class__.__name__}."
+        #         )
+
+        # global configs
+        self._global_batch_size = configs["Global"]["global_batch_size"]
+
+        # engine configs
+        self._configs = configs["Engine"]
+
+        self._run_mode = self._configs.get("run_mode", "step")
+        assert self._run_mode in ["epoch", "step"], "run_mode must be epoch or step"
+        self._max_steps = self._configs["max_steps"]
+        self._eval_freq = self._configs["eval_freq"]
+        self._eval_iters = self._configs["eval_iters"]
+        self._test_iters = self._configs["test_iters"]
+        self._logging_freq = self._configs["logging_freq"]
+        self._num_train_epochs = self._configs["num_train_epochs"]
+        self._accumulate_steps = self._configs["accumulate_steps"]
+
+        amp_config = self._configs["mix_precision"]
+        self._amp_enable = amp_config["enable"]
+        if mode == "export" and self._amp_enable:
+            logger.info("NOTE: disable mix_precision in export mode")
+            self._amp_enable = False
+
+        self._amp_dtype = amp_config.get("dtype", "float16")
+        self._amp_level = amp_config.get("level", "O2")
+        self._use_main_grad = amp_config.get("use_main_grad", False)
+        self._scale_loss = amp_config["scale_loss"]
+        self._custom_black_list = amp_config["custom_black_list"]
+        self._custom_white_list = amp_config["custom_white_list"]
+
+        self._save_steps = self._configs["save_load"]["save_steps"]
+        self._save_epoch = self._configs["save_load"]["save_epoch"]
+
+        self._output_dir = self._configs["save_load"]["output_dir"]
+        self._ckpt_dir = self._configs["save_load"]["ckpt_dir"]
+
+        self._compress_configs = None
+        self.prune_configs = None
+        self.quant_configs = None
+        self._quant_mode = False
+        if "Compress" in configs:
+            self.mode = "compress"
+            self._compress_configs = configs["Compress"]
+            if "Prune" in self._compress_configs:
+                self.prune_configs = self._compress_configs["Prune"]
+            if "Quantization" in self._compress_configs:
+                self.quant_configs = self._compress_configs["Quantization"]
+                self._quant_mode = True
+            self.compress_model()
+
+        # TODO(haohongxiang): Remove there extra configs after reconstruct of Fleet API
+        self._dist_configs = configs["Distributed"]
+        self._dp_degree = self._dist_configs["dp_degree"]
+        self._mp_degree = self._dist_configs["mp_degree"]
+        self._pp_degree = self._dist_configs["pp_degree"]
+        sharding_config = self._dist_configs["sharding"]
+
+        self._sharding_stage = sharding_config["sharding_stage"]
+        self._sharding_degree = sharding_config["sharding_degree"]
+        self._sharding_offload = sharding_config["sharding_offload"]
+        self._reduce_overlap = sharding_config["reduce_overlap"]
+        self._broadcast_overlap = sharding_config["broadcast_overlap"]
+
+        self._use_recompute = configs["Model"]["use_recompute"]
+
+        if self._amp_enable:
+            if mode == "train" and self._amp_dtype == "float16":
+                self._scaler = paddle.amp.GradScaler(init_loss_scaling=self._scale_loss)
+            else:  # bfloat16
+                self._scaler = paddle.amp.GradScaler(init_loss_scaling=1, use_dynamic_loss_scaling=False)
+
+            # Save dtype is the same as model dtype. Also can set save_dtype='float32' when
+            # training with pure fp16 strategy, but will cause the rise of memory.
+            if self._amp_level == "O2":
+                self._module.model = paddle.amp.decorate(
+                    models=self._module.model, dtype=self._amp_dtype, level=self._amp_level
+                )
+        else:
+            self._scaler = None
+
+        if mode == "train":
+            self._use_increments = configs.Optimizer.lr.pop("use_increments", False)
+            self._lr_scheduler_mode = configs.Optimizer.lr.pop("run_mode", "step")
+            assert self._lr_scheduler_mode in ["epoch", "step"], "lr.run_mode must be epoch or step"
+        self._lr_scheduler = build_lr_scheduler(configs.Optimizer.lr) if mode == "train" else None
+
+        self._optimizer = (
+            build_optimizer(configs.Optimizer, self._module.model, self._lr_scheduler) if mode == "train" else None
+        )
+
+        if (
+            self._amp_enable
+            and self._amp_dtype in ["float16", "bfloat16"]
+            and self._amp_level == "O2"
+            and self._use_main_grad
+        ):
+            self._module.model = mix_precision_utils.MixPrecisionLayer(self._module.model, dtype=self._amp_dtype)
+            self._optimizer = mix_precision_utils.MixPrecisionOptimizer(self._optimizer)
+            self._scaler = mix_precision_utils.MixPrecisionScaler(self._scaler)
+
+        # distributed configs
+        self._distributed = dist.get_world_size() > 1
+
+        if self._distributed:
+            self._hcg = env.get_hcg()
+            self._dp_group = self._hcg.get_data_parallel_group()
+            self._sharding_group = self._hcg.get_sharding_parallel_group()
+
+            self._dp_rank = self._hcg.get_data_parallel_rank()
+            self._mp_rank = self._hcg.get_model_parallel_rank()
+            self._pp_rank = self._hcg.get_stage_id()
+            self._sharding_rank = self._hcg.get_sharding_parallel_rank()
+
+            self._wrap_with_fleet()
+        else:
+            self._dp_rank = 0
+
+        # using for save/load
+        self._load_recovery = {"step": 0, "epoch": 0, "rng_state": -1}
+
+        if "Inference" in configs:
+            self._inference_configs = configs["Inference"]
+            self._inference_engine = None
+
+        self.profiler = None
+        if "Profiler" in configs and configs.get("Profiler", {}).get("enable", False):
+            self.profiler_config = configs["Profiler"]
+
+            scheduler = self.profiler_config.get("scheduler", None)
+            profiler_log = self.profiler_config.get("profiler_log", "./profiler_log")
+            record_shapes = self.profiler_config.get("record_shapes", True)
+            profile_memory = self.profiler_config.get("profile_memory", True)
+            self.profiler = paddle.profiler.Profiler(
+                targets=[paddle.profiler.ProfilerTarget.CPU, paddle.profiler.ProfilerTarget.GPU],
+                scheduler=scheduler,
+                on_trace_ready=paddle.profiler.export_chrome_tracing(profiler_log),
+                record_shapes=record_shapes,
+                profile_memory=profile_memory,
+            )
+            self.profiler.start()
+            logger.warning("Profiler is enabled, do not enable it in production.")
+
+    def _wrap_with_fleet(self):
+        if self._sharding_stage in [2, 3]:
+            assert self._pp_degree == 1, "sharding stage2/3 will support pipeline parallel later"
+            self._wrap_sharding_2_3()
+        else:
+            self._wrap_3D_parallel()
+
+    def _wrap_sharding_2_3(self):
+        if self._dp_degree > 1 and self._sharding_stage == 3:
+            sync_params_buffers(self._module.model, comm_group=self._dp_group, src_rank=self._dp_group.ranks[0])
+
+        if self._mp_degree > 1:
+            assert self._sharding_stage == 2, "only support mp + sharding stage2 hybrid parallel now."
+            self._module.model = TensorParallel(self._module.model, self._hcg, strategy=None)
+
+        level = "p_g_os" if self._sharding_stage == 3 else "os_g"
+        origin_model = self._module.model
+        self._module.model, self._optimizer, self._scaler = group_sharded_parallel(
+            model=self._module.model,
+            optimizer=self._optimizer,
+            level=level,
+            scaler=self._scaler,
+            group=self._sharding_group,
+            offload=self._sharding_offload,
+            dp_group=self._dp_group if self._dp_group.nranks > 1 else None,
+        )
+        if self._reduce_overlap:
+            self._module.model._set_reduce_overlap(self._reduce_overlap)
+        if self._broadcast_overlap:
+            self._optimizer._set_broadcast_overlap(self._broadcast_overlap, layers=origin_model, num_groups=2)
+
+    def _wrap_3D_parallel(self):
+        if isinstance(self._module.model, mix_precision_utils.MixPrecisionLayer):
+            if dist.get_world_size() == self._dp_degree:
+                sync_params_buffers(self._module.model, comm_group=self._dp_group, src_rank=self._dp_group.ranks[0])
+            elif self._pp_degree > 1:
+                self._module.model = fleet.distributed_model(self._module.model._layers)
+        else:
+            self._module.model = fleet.distributed_model(self._module.model)
+        self._optimizer = fleet.distributed_optimizer(self._optimizer)
+        self._scaler = fleet.distributed_scaler(self._scaler) if self._scaler is not None else self._scaler
+
+    def _train_one_epoch(self, epoch_index, train_data_loader=None, valid_data_loader=None):
+        self._module.model.train()
+
+        # time count
+        train_losses = []
+        train_step_start = get_timestamp()
+        skip_first = True
+        # Note(GuoxiaWang): Do not use len(train_data_loader()),
+        # it will cause a memory leak.
+        total_train_batch = self._max_steps if self._run_mode == "step" else len(train_data_loader)
+        total_train_step = self._max_steps if self._run_mode == "step" else total_train_batch * self._num_train_epochs
+        total_eval_batch = len(valid_data_loader) if valid_data_loader is not None else 0
+        valid_data_loader = valid_data_loader() if valid_data_loader is not None else None
+        eval_finished_step = 0
+        for step, batch in enumerate(train_data_loader()):
+
+            if epoch_index == self._load_recovery["epoch"]:
+                if step < self._load_recovery["step"]:
+                    continue
+
+            loss = self._fit_impl(batch)
+            train_losses.append(loss)
+
+            if self._lr_scheduler is not None and self._lr_scheduler_mode == "step":
+                if self._scaler is None or self._scaler._found_inf == 0:
+                    self._lr_scheduler.step(epoch=self._global_batch_size if self._use_increments else None)
+
+            if (step + 1) % self._logging_freq == 0:
+                train_step_cost = get_timestamp() - train_step_start
+                numpy_losses = [float(loss) for loss in train_losses]
+                log_dict = {
+                    "epoch": epoch_index,
+                    "total_epoch": self._num_train_epochs,
+                    "batch": step,
+                    "total_batch": total_train_batch,
+                    "total_step": total_train_step,
+                    "train_cost": train_step_cost if step == 0 else train_step_cost / self._logging_freq,
+                    "loss": sum(numpy_losses) / len(numpy_losses),
+                    "lr": self._optimizer.get_lr(),
+                    "found_inf": self._scaler._found_inf if self._scaler is not None else 0,
+                }
+                if self._amp_enable:
+                    log_dict["loss_scale"] = self._scaler._scale.numpy()[0]
+                self._module.training_step_end(log_dict)
+
+                train_step_start = get_timestamp()
+                train_losses = []
+
+            self._optimizer.clear_grad()
+
+            if self._run_mode == "step" and not skip_first:
+                if self._eval_freq > 0 and step % self._eval_freq == 0:
+
+                    eval_losses = []
+                    eval_step_start = get_timestamp()
+
+                    for eval_step, batch in enumerate(valid_data_loader):
+                        eval_finished_step += 1
+                        loss = self._evaluate_impl(batch)
+                        eval_losses.append(float(loss))
+
+                        if eval_step >= self._eval_iters - 1:
+                            break
+
+                    eval_step_cost = get_timestamp() - eval_step_start
+                    eval_loss = sum(eval_losses) / len(eval_losses)
+
+                    log_dict = {
+                        "loss": float(eval_loss),
+                        "epoch": epoch_index,
+                        "batch": eval_finished_step,
+                        "total_batch": total_eval_batch,
+                        "eval_cost": eval_step_cost / self._logging_freq,
+                    }
+                    self._module.validation_step_end(log_dict)
+
+                if self._save_steps > 0 and step % self._save_steps == 0:
+                    device_synchronize()
+                    self.save(epoch=epoch_index, step=step)
+            else:
+                skip_first = False
+
+            if self._run_mode == "step" and step >= self._max_steps:
+                return
+
+            if self.profiler:
+                self.profiler.step()
+
+    def fit(self, epoch=1, train_data_loader=None, valid_data_loader=None):
+        """
+        Run the full process of training/validation/save loop.
+
+        Args:
+
+            epoch(int): the epoch index.
+
+            train_data_loader(DataLoader, None): a collection of :class:`paddle.io.DataLoader`, specifying training samples.
+
+            valid_data_loader(DataLoader, None): a collection of :class:`paddle.io.DataLoader`, specifying validation samples.
+
+        """
+        self._module.model.train()
+
+        train_start = get_timestamp()
+
+        start_epoch = self._load_recovery["epoch"]
+        if self._load_recovery["rng_state"] != -1:
+            paddle.set_cuda_rng_state(self._load_recovery["rng_state"])
+
+        for epoch_index in range(start_epoch, epoch):
+            train_epoch_start = get_timestamp()
+            self._train_one_epoch(epoch_index, train_data_loader, valid_data_loader)
+
+            train_epoch_cost = get_timestamp() - train_epoch_start
+            log_dict = {
+                "epoch": epoch_index,
+                "train_cost": train_epoch_cost,
+            }
+            self._module.training_epoch_end(log_dict)
+
+            if self._lr_scheduler is not None and self._lr_scheduler_mode == "epoch":
+                self._lr_scheduler.step()
+
+            if self._run_mode == "epoch" and self._eval_freq > 0 and epoch_index % self._eval_freq == 0:
+                eval_epoch_start = get_timestamp()
+                self._evaluate_one_epoch(epoch_index, valid_data_loader)
+                eval_epoch_cost = get_timestamp() - eval_epoch_start
+                log_dict = {
+                    "epoch": epoch_index,
+                    "eval_cost": eval_epoch_cost,
+                }
+                self._module.validation_epoch_end(log_dict)
+
+            if self._save_epoch > 0 and self._run_mode == "epoch" and epoch_index % self._save_epoch == 0:
+                self.save(epoch=epoch_index, step=len(train_data_loader))
+
+        logger.info(
+            "The training process is complete and total cost of time for training is : {}".format(
+                convert_timestamp_to_data(get_timestamp() - train_start)
+            )
+        )
+
+        if self.profiler:
+            self._profiler_done()
+
+    def _fit_impl(self, batch):
+        self._module.model.train()
+
+        batch = self._module.pretreating_batch(batch)
+        if self._pp_degree == 1:
+            if self._use_recompute and isinstance(self._module.model, paddle.DataParallel):
+                with self._module.model.no_sync():
+                    loss = self._model_forward_backward(batch)
+                if not hasattr(self._optimizer, "all_fused_tensors") or self._optimizer.all_fused_tensors is None:
+                    try:
+                        fused_allreduce_gradients(list(self._module.model.parameters()), None)
+                    except:
+                        fused_allreduce_gradients(list(self._module.model.parameters()), None)
+                else:
+                    all_reduce_parameters(self._optimizer.all_fused_tensors, self._dp_group)
+            elif (
+                isinstance(self._module.model, mix_precision_utils.MixPrecisionLayer)
+                and self._distributed
+                and dist.get_world_size() == self._dp_degree
+            ):
+                loss = self._model_forward_backward(batch)
+                fused_allreduce_gradients(list(self._module.model.parameters()), None)
+            else:
+                loss = self._model_forward_backward(batch)
+        else:
+            with paddle.amp.auto_cast(
+                enable=self._amp_enable,
+                custom_black_list=self._custom_black_list,
+                custom_white_list=self._custom_white_list,
+                dtype=self._amp_dtype,
+                level=self._amp_level,
+            ):
+                batch = self._module.model._prepare_training(batch, self._optimizer, self._lr_scheduler)
+                loss = self._module.model.forward_backward_pipeline(batch, self._scaler)
+
+        self._optim_update_params()
+        return loss
+
+    def _model_forward_backward(self, batch):
+        if self._accumulate_steps == 1 or self._pp_degree > 1:
+            batches = [batch]
+        else:
+            split_batches = [paddle.split(b, self._accumulate_steps) for b in batch]
+            batches = []
+            for i in range(len(split_batches[0])):
+                micro_batch = [split_batch[i] for split_batch in split_batches]
+                batches.append(micro_batch)
+        final_loss = None
+        for micro_batch in batches:
+            with paddle.amp.auto_cast(
+                self._amp_enable,
+                custom_black_list=self._custom_black_list,
+                custom_white_list=self._custom_white_list,
+                dtype=self._amp_dtype,
+                level=self._amp_level,
+            ):
+                loss = self._module.training_step(micro_batch)
+
+            if self._amp_enable and self._amp_dtype == "float16":
+                loss_bw = self._scaler.scale(loss)
+            else:
+                loss_bw = loss
+            if self._accumulate_steps > 1:
+                # div the loss for backward
+                loss_bw = loss_bw / self._accumulate_steps
+
+            self._module.backward(loss_bw)
+
+            detach_loss = loss.detach()
+            if final_loss is None:
+                final_loss = detach_loss
+            else:
+                final_loss = paddle.add(final_loss, detach_loss)
+        if self._accumulate_steps > 1:
+            # div the loss for print
+            final_loss = final_loss / self._accumulate_steps
+        return final_loss
+
+    def _optim_update_params(self):
+        if self._sharding_stage in [3] and self._dp_degree > 1:
+            fused_allreduce_gradients(self._module.model.parameters(), self._hcg)
+
+            for p in self._module.model.parameters():
+                if hasattr(p, "bw_storage"):
+                    assert p.grad is None, "This case shouldn't happen."
+                    p.bw_storage.scale_(1.0 / self._dp_group.nranks)
+                    dist.all_reduce(p.bw_storage, group=self._dp_group)
+
+        if self._amp_enable and self._amp_dtype == "float16":
+            self._scaler.step(self._optimizer)
+            self._scaler.update()
+        else:
+            self._optimizer.step()
+
+    @paddle.no_grad()
+    def evaluate(self, epoch=1, valid_data_loader=None):
+        """
+        run one evaluation epoch over the validation set.
+
+        Args:
+
+            epoch(int): the epoch index.
+
+            valid_data_loader(DataLoader, None): a collection of :class:`paddle.io.DataLoader`, specifying validation samples.
+
+        """
+        self._module.model.eval()
+
+        for epoch_index in range(epoch):
+            eval_epoch_start = get_timestamp()
+            self._evaluate_one_epoch(epoch_index, valid_data_loader)
+
+            eval_epoch_cost = get_timestamp() - eval_epoch_start
+            log_dict = {
+                "epoch": epoch_index,
+                "eval_cost": eval_epoch_cost,
+            }
+            self._module.validation_epoch_end(log_dict)
+
+        logger.info("The evaluting process is complete.")
+        del valid_data_loader
+        return
+
+    @paddle.no_grad()
+    def _evaluate_one_epoch(self, epoch=1, valid_data_loader=None):
+        self._module.model.eval()
+
+        eval_step_start = get_timestamp()
+        eval_losses = []
+        total_eval_batch = len(valid_data_loader)
+        valid_data_loader = valid_data_loader() if valid_data_loader is not None else None
+        for eval_step, batch in enumerate(valid_data_loader):
+            loss = self._evaluate_impl(batch)
+            eval_losses.append(float(loss))
+
+            if eval_step % self._logging_freq == 0:
+                eval_step_cost = get_timestamp() - eval_step_start
+                log_dict = {
+                    "loss": sum(eval_losses) / len(eval_losses),
+                    "epoch": epoch,
+                    "batch": eval_step,
+                    "total_batch": total_eval_batch,
+                    "eval_cost": eval_step_cost if eval_step == 0 else eval_step_cost / self._logging_freq,
+                }
+                self._module.validation_step_end(log_dict)
+                eval_step_start = get_timestamp()
+                eval_losses = []
+
+            if self._run_mode == "step" and eval_step >= self._max_steps:
+                logger.info("[eval] epoch {} : evaluting process is complete.".format(epoch))
+                return
+
+    @paddle.no_grad()
+    def _evaluate_impl(self, batch):
+        self._module.model.eval()
+
+        batch = self._module.pretreating_batch(batch)
+        with paddle.amp.auto_cast(
+            self._amp_enable,
+            custom_black_list=self._custom_black_list,
+            custom_white_list=self._custom_white_list,
+            dtype=self._amp_dtype,
+            level=self._amp_level,
+        ):
+            if self._pp_degree == 1:
+                loss = self._module.validation_step(batch)
+            else:
+                loss = self._module.model.eval_batch(batch, compute_loss=True)
+
+        return loss
+
+    @paddle.no_grad()
+    def predict(self, epoch=1, test_data_loader=None):
+        """
+        run one evaluation epoch over the test set.
+
+        Args:
+
+            epoch(int): the epoch index.
+
+            test_data_loader(DataLoader, None): a collection of :class:`paddle.io.DataLoader`, specifying test samples.
+
+        """
+        self._module.model.eval()
+
+        test_start = get_timestamp()
+        test_losses = []
+        test_data_loader = test_data_loader()
+        for test_step, batch in enumerate(test_data_loader):
+            loss = self._predict_impl(batch)
+
+            test_losses.append(float(loss))
+
+            if test_step % self._logging_freq == 0:
+                test_cost = get_timestamp() - test_start
+                log_dict = {
+                    "loss": sum(test_losses) / len(test_losses),
+                    "epoch": epoch,
+                    "batch": test_step,
+                    "test_cost": test_cost if test_step == 0 else test_cost / self._logging_freq,
+                }
+                self._module.test_step_end(log_dict)
+                test_start = get_timestamp()
+                test_losses = []
+
+            if test_step >= self._max_steps:
+                logger.info("The predicting process is complete.")
+                del test_data_loader
+                return
+
+    @paddle.no_grad()
+    def _predict_impl(self, batch):
+        self._module.model.eval()
+        batch = self._module.pretreating_batch(batch)
+
+        with paddle.amp.auto_cast(
+            self._amp_enable,
+            custom_black_list=self._custom_black_list,
+            custom_white_list=self._custom_white_list,
+            dtype=self._amp_dtype,
+            level=self._amp_level,
+        ):
+            if self._pp_degree == 1:
+                loss = self._module.test_step(batch)
+            else:
+                loss = self._module.model.eval_batch(batch, compute_loss=True)
+
+        return loss
+
+    def save(self, epoch=0, step=0):
+        """
+        save the state dicts of model and optimizer into an checkpoint.
+        """
+        if self._dp_rank != 0:
+            logger.info("DP_Rank %d doesn't save model" % self._dp_rank)
+            return
+
+        if self._output_dir and isinstance(self._output_dir, str):
+            output_dir = os.path.join(self._output_dir, "epoch_%d_step_%d" % (epoch, step))
+            if not os.path.exists(output_dir):
+                os.makedirs(output_dir, exist_ok=True)
+            logger.info("Save model to %s" % output_dir)
+
+            save_dir = (
+                "{}/mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}".format(
+                    output_dir, self._mp_rank, self._sharding_rank, self._pp_rank
+                )
+                if self._distributed
+                else output_dir
+            )
+
+            if self._sharding_stage == 3:
+                self._module.model.get_all_parameters(convert2cpu=False)
+            paddle.save(self._module.model.state_dict(), os.path.join(save_dir, "model.pdparams"))
+            paddle.save(self._optimizer.state_dict(), os.path.join(save_dir, "model_state.pdopt"))
+
+            meta_dict = {"epoch": epoch, "step": step, "cuda_rng_state": paddle.get_cuda_rng_state()}
+            paddle.save(meta_dict, os.path.join(save_dir, "meta_state.pdopt"))
+
+            save_auto_dir = os.path.join(output_dir, "auto_infer")
+            save_for_auto_inference(os.path.join(save_auto_dir, "auto"), self._module.model)
+
+        else:
+            raise TypeError("`save` requires a valid value of `output_dir`.")
+
+    def compress_model(self):
+        if self._compress_configs is None:
+            return
+        self._distributed = dist.get_world_size() > 1
+        # Load pretrained model before compression
+        if "pretrained" in self._compress_configs and self._compress_configs["pretrained"] is not None:
+            self._ckpt_dir = self._compress_configs["pretrained"]
+            self.load()
+            # Avoid loading again
+            self._configs["save_load"]["ckpt_dir"] = None
+
+        if self.prune_configs is not None and self.prune_configs.enable:
+            prune_model(self._module.model, self.prune_configs, self._module.input_spec())
+        # NOTE(minghaoBD): We haven't fully tested Prune+Quantization, so an "else if" is put here for separation.
+        elif self.quant_configs is not None and self.quant_configs.enable:
+            self._module.model, self.quanter = quant_model(self._module.model, self.quant_configs)
+
+    def load(self):
+        """
+        load the saved checkpoint file and update the state dicts of model and optimizer.
+        """
+        if self._ckpt_dir and isinstance(self._ckpt_dir, str):
+            logger.info("Try to load checkpoint from %s " % self._ckpt_dir)
+
+            if self._quant_mode:
+                load_dir = self._ckpt_dir
+            else:
+                load_dir = (
+                    "{}/mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}".format(
+                        self._ckpt_dir, self._mp_rank, self._sharding_rank, self._pp_rank
+                    )
+                    if self._distributed
+                    else self._ckpt_dir
+                )
+            model_path = os.path.join(load_dir, "model.pdparams")
+            opt_path = os.path.join(load_dir, "model_state.pdopt")
+            meta_path = os.path.join(load_dir, "meta_state.pdopt")
+
+            if os.path.exists(model_path):
+                model_dict = paddle.load(model_path)
+                for name, param in self._module.model.state_dict().items():
+                    assert name in model_dict.keys(), "No param named `{}` was found in checkpoint file.".format(name)
+
+                    if param.dtype != model_dict[name].dtype:
+                        model_dict[name] = model_dict[name].cast(param.dtype)
+
+                self._module.model.set_state_dict(model_dict)
+            else:
+                raise ValueError("No optimizer checkpoint file found in %s." % model_path)
+
+            if self.mode == "train":
+                if os.path.exists(opt_path):
+                    opt_dict = paddle.load(opt_path)
+                    self._optimizer.set_state_dict(opt_dict)
+                else:
+                    raise ValueError("No optimizer checkpoint file found in %s." % opt_path)
+
+                if os.path.exists(meta_path):
+                    meta_dict = paddle.load(meta_path)
+                    self._load_recovery = {
+                        "step": meta_dict["step"],
+                        "epoch": meta_dict["epoch"],
+                        "rng_state": meta_dict["cuda_rng_state"],
+                    }
+                else:
+                    raise ValueError("No meta checkpoint file found in %s." % meta_path)
+
+            logger.info("successfully load checkpoints")
+        else:
+            logger.warning("`load` requires a valid value of `ckpt_dir`.")
+            raise TypeError("`load` requires a valid value of `ckpt_dir`.")
+
+    def export(self):
+        self._module.model.eval()
+        input_spec = self._module.input_spec()
+
+        save_dir = os.path.join(self._output_dir, "rank_{}".format(self._dp_rank))
+
+        if not self._quant_mode:
+            export_inference_model(self._module.model, input_spec, save_dir, "model")
+        else:
+            logger.info("export quantized model.")
+            export_inference_model(
+                self._module.model, input_spec, save_dir, "model", export_quant_model=True, quanter=self.quanter
+            )
+
+    def inference(self, data):
+        if self._inference_engine is None:
+            # parse TensorRT config
+            tensorrt_config = None
+            if "TensorRT" in self._inference_configs:
+                tensorrt_config = TensorRTConfig(**self._inference_configs["TensorRT"])
+
+            self._inference_engine = InferenceEngine(
+                self._inference_configs["model_dir"], self._inference_configs["mp_degree"], tensorrt_config
+            )
+
+        return self._inference_engine.predict(data)
+
+    def _print_summary(self):
+        views_dict = {
+            SummaryView.DeviceView: "device",
+            SummaryView.OverView: "overview",
+            SummaryView.ModelView: "model",
+            SummaryView.DistributedView: "dist",
+            SummaryView.KernelView: "kernel",
+            SummaryView.OperatorView: "op",
+            SummaryView.MemoryView: "mem",
+            SummaryView.MemoryManipulationView: "memcpy",
+            SummaryView.UDFView: "udf",
+        }
+
+        default_views = [
+            SummaryView.OverView,
+            SummaryView.ModelView,
+            SummaryView.KernelView,
+            SummaryView.OperatorView,
+        ]
+
+        def gen_views(cfg):
+            # print all summary view if detailed=True
+            if self.profiler_config.get("detailed", False):
+                return None
+
+            views = []
+            # override default view with user defined value if detailed=False
+            for view in SummaryView:
+                v = self.profiler_config.get("summary", {}).get(views_dict[view], None)
+                if v is True or (v is None and view in default_views):
+                    views.append(view)
+
+            return views or None
+
+        self.profiler.summary(sorted_by=paddle.profiler.SortedKeys.GPUTotal, views=gen_views(self.profiler_config))
+
+    def _profiler_done(self):
+        if not self.profiler:
+            return
+
+        logger.info("Profiler finished, prepare to print summary...")
+
+        self.profiler.stop()
+
+        self._print_summary()
+        profiler_log = self.profiler_config.get("profiler_log", "./profiler_log")
+        logger.info("For more information please install visualdl and run it with following command:")
+        logger.info("-------------------------------------------------------------------------------")
+        logger.info(f"visualdl --host 0.0.0.0 --logdir {profiler_log}")
+        logger.info("-------------------------------------------------------------------------------")
diff --git a/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py b/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py
new file mode 100644
index 0000000000000000000000000000000000000000..58e323d910cd4e9a9e88b99cbfd4cf8ca61cf906
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/engine/inference_engine.py
@@ -0,0 +1,260 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from collections.abc import Mapping, Sequence
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+try:
+    from ppfleetx_ops import topp_sampling
+except Exception as e:
+    pass
+
+# TensorRT precisions
+TRT_PRECISIONS = {
+    "fp32": paddle.inference.PrecisionType.Float32,
+    "fp16": paddle.inference.PrecisionType.Half,
+    "int8": paddle.inference.PrecisionType.Int8,
+}
+
+
+class _StaticGuard(object):
+    def __init__(self):
+        pass
+
+    def __enter__(self):
+        paddle.enable_static()
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        paddle.disable_static()
+
+
+class TensorRTConfig(object):
+    """
+    TensorRT Inference Configuration
+
+    Args:
+        max_batch_size (int): The maxmum batch size of input data. Default 1
+        workspace_size (int): The size of TensorRT workspace in bytes. Default 1<<30
+        min_subgraph_size (int): The minimum subgraph node size to convert subgraph to TensorRT engine. Default 3
+        precision (str): The inference precision, can be 'fp32', 'fp16' and 'int8'. Default 'fp16'
+        use_static (bool): Whether to serialize and save TensorRT engine. Default False
+        use_calib_mode (bool): Whether to use TensorRT calibration. Default False
+        collect_shape (bool): Whether to collect dynamic shape. Default False
+        shape_range_info_filename (str): Path to dynamic shape range file. Default None
+    """
+
+    def __init__(
+        self,
+        max_batch_size=1,
+        workspace_size=1 << 30,
+        min_subgraph_size=3,
+        precision="fp16",
+        use_static=False,
+        use_calib_mode=False,
+        collect_shape=False,
+        shape_range_info_filename=None,
+    ):
+        self.max_batch_size = max_batch_size
+        self.workspace_size = eval(workspace_size)
+        self.min_subgraph_size = min_subgraph_size
+        self.precision = precision
+        self.use_static = use_static
+        self.use_calib_mode = use_calib_mode
+        self.shape_range_info_filename = shape_range_info_filename
+        self.collect_shape = collect_shape
+
+    @property
+    def precision(self):
+        return TRT_PRECISIONS[self._precision]
+
+    @precision.setter
+    def precision(self, value):
+        print("value", value)
+        assert value.lower() in [
+            "fp32",
+            "fp16",
+            "int8",
+        ], "TensorRT precision can only be 'fp32', 'fp16' or 'int8', " "but got {}".format(value.lower())
+        self._precision = value.lower()
+
+    @property
+    def collect_shape(self):
+        return self._collect_shape
+
+    @collect_shape.setter
+    def collect_shape(self, value):
+        if value:
+            assert self.shape_range_info_filename is not None, (
+                "shape_range_info_filename should be set in " "collect_shape mode"
+            )
+        else:
+            assert self.shape_range_info_filename and os.path.isfile(
+                self.shape_range_info_filename
+            ), "shape_range_info_filename {} is not a " "file".format(self.shape_range_info_filename)
+        self._collect_shape = value
+
+
+class InferenceEngine(object):
+    """
+    Model Parallel Inference Engine
+
+    Args:
+        model_dir (string): root directory of inference model
+        mp_degree (int): model parallel size
+        tensorrt_config (TensorRTConfig): configurations for TensorRT inference
+    """
+
+    def __init__(self, model_dir, mp_degree=1, tensorrt_config=None):
+        self.model_dir = model_dir
+        self.mp_degree = mp_degree
+        self.tensorrt_config = tensorrt_config
+        self.auto = False
+
+        for fname in os.listdir(model_dir):
+            if "auto" in fname:
+                self.auto = True
+                break
+
+        if mp_degree == 1:
+            self.nranks = 1
+            self.rank = 0
+        else:
+            self.nranks = fleet.worker_num()
+            self.rank = fleet.worker_index()
+
+        if not self.auto:
+            self._check_model()
+
+        self._static_guard = _StaticGuard()
+        with self._static_guard:
+            self._init_predictor()
+
+    def _check_model(self):
+        if not os.path.isdir(self.model_dir):
+            raise ValueError("model_dir is not a directory")
+
+        rank_path = os.path.join(self.model_dir, "rank_{}".format(self.rank))
+        if not os.path.isdir(rank_path):
+            raise ValueError("rank_{} directory not found".format(self.rank))
+        model_files = []
+        param_files = []
+        for fname in os.listdir(rank_path):
+            if os.path.splitext(fname)[1] == ".pdmodel":
+                model_files.append(fname)
+            if os.path.splitext(fname)[1] == ".pdiparams":
+                param_files.append(fname)
+
+        def _check_and_get_file(files, tag):
+            if len(files) == 0:
+                raise ValueError("no {} file found under {}".format(tag, rank_path))
+            elif len(files) > 1:
+                raise ValueError("multiple {} file found under {}".format(tag, rank_path))
+            else:
+                return os.path.join(self.model_dir, "rank_{}".format(self.rank), files[0])
+
+        self.model_file = _check_and_get_file(model_files, "pdmodel")
+        self.param_file = _check_and_get_file(param_files, "pdiparams")
+
+    def _generate_comm_init_config(self, rank, nranks):
+        ring_id_to_ranks = ",".join(["0"] + [str(i) for i in range(nranks)])
+        rank_to_ring_ids = "".join(["{},0\n".format(i) for i in range(nranks)])
+        comm_str = "[ring_id -> ranks]\n" + ring_id_to_ranks + "\n[rank -> ring_ids]\n" + rank_to_ring_ids
+
+        config_fname = "./.comm_config{}.csv".format(rank)
+        if os.path.exists(config_fname):
+            os.remove(config_fname)
+        with open(config_fname, "w") as f:
+            f.write(comm_str)
+
+        return config_fname
+
+    def _init_predictor(self):
+        if self.auto:
+            self.model_file = os.path.join(self.model_dir, "auto_dist{}.pdmodel".format(self.rank))
+            self.param_file = os.path.join(self.model_dir, "auto_dist{}.pdiparams".format(self.rank))
+        config = paddle.inference.Config(self.model_file, self.param_file)
+
+        config.enable_memory_optim()
+        config.switch_ir_optim(True)
+        if paddle.base.core.is_compiled_with_cuda():
+            device_id = int(os.environ.get("FLAGS_selected_gpus", 0))
+            config.enable_use_gpu(100, device_id)
+        elif paddle.base.core.is_compiled_with_xpu():
+            device_id = int(os.environ.get("FLAGS_selected_xpus", 0))
+            config.enable_xpu()
+            config.set_xpu_device_id(device_id)
+
+        # distributed config
+        if self.mp_degree > 1:
+            trainer_endpoints = fleet.worker_endpoints()
+            current_endpoint = trainer_endpoints[self.rank]
+
+            dist_config = config.dist_config()
+            dist_config.set_ranks(self.nranks, self.rank)
+            dist_config.set_endpoints(trainer_endpoints, current_endpoint)
+            dist_config.enable_dist_model(True)
+
+            if self.auto:
+                config_fname = os.path.join(self.model_dir, "rank_mapping.csv")
+            else:
+                config_fname = self._generate_comm_init_config(self.rank, self.nranks)
+            dist_config.set_comm_init_config(config_fname)
+            config.set_dist_config(dist_config)
+
+        # TensorRT config
+        if self.tensorrt_config:
+            config.enable_tensorrt_engine(
+                max_batch_size=self.tensorrt_config.max_batch_size,
+                workspace_size=self.tensorrt_config.workspace_size,
+                min_subgraph_size=self.tensorrt_config.min_subgraph_size,
+                precision_mode=self.tensorrt_config.precision,
+                use_static=self.tensorrt_config.use_static,
+                use_calib_mode=self.tensorrt_config.use_calib_mode,
+            )
+
+            if self.tensorrt_config.collect_shape:
+                config.collect_shape_range_info(self.tensorrt_config.shape_range_info_filename)
+            else:
+                config.enable_tuned_tensorrt_dynamic_shape(self.tensorrt_config.shape_range_info_filename, True)
+
+        self.predictor = paddle.inference.create_predictor(config)
+
+    def input_names(self):
+        return self.predictor.get_input_names()
+
+    def output_names(self):
+        return self.predictor.get_output_names()
+
+    def predict(self, data):
+        # data in dict/list format
+        with self._static_guard:
+            if isinstance(data, Sequence):
+                if len(data) != len(self.input_names()):
+                    raise ValueError()
+                for d, name in zip(data, self.input_names()):
+                    handle = self.predictor.get_input_handle(name)
+                    handle.copy_from_cpu(np.array(d.copy()))
+            elif isinstance(data, Mapping):
+                # key check
+                for k, v in data.items():
+                    handle = self.predictor.get_input_handle(k)
+                    handle.copy_from_cpu(np.array(v))
+            else:
+                raise ValueError()
+
+            self.predictor.run()
+            return {name: self.predictor.get_output_handle(name).copy_to_cpu() for name in self.output_names()}
diff --git a/model_zoo/gpt-3/ppfleetx/core/module/__init__.py b/model_zoo/gpt-3/ppfleetx/core/module/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..73ac2cbffb512a0ba4ea910752174a55fb1807b7
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/module/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .basic_module import BasicModule
diff --git a/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py b/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3c84c8cfe29db462fba8049334a7a5bee920411
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/core/module/basic_module.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# The file has been adapted from lightning file:
+# https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/core/module.py
+# Git commit hash: 2d9e00fab64c8b19a8646f755a95bcb092aa710f
+# We retain the following license from the original files:
+
+# Copyright 2018-2021 William Falcon. All rights reserved.
+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+
+import paddle.nn as nn
+
+
+class BasicModule(nn.Layer):
+    """ """
+
+    def __init__(self, configs, *args, **kwargs):
+        self.configs = self.process_configs(configs)
+        super().__init__(*args, **kwargs)
+        self.model = self.get_model()
+
+    def process_configs(self, configs):
+        return configs
+
+    def get_model(self):
+        raise NotImplementedError
+
+    def get_loss_fn(self):
+        pass
+
+    def pretreating_batch(self, batch):
+        return batch
+
+    def forward(self, *args, **kwargs):
+        return super().forward(*args, **kwargs)
+
+    def training_step(self, *args, **kwargs):
+        raise NotImplementedError
+
+    def training_step_end(self, *args, **kwargs):
+        pass
+
+    def validation_step(self, *args, **kwargs):
+        pass
+
+    def validation_step_end(self, *args, **kwargs):
+        pass
+
+    def test_step(self, *args, **kwargs):
+        pass
+
+    def test_step_end(self, *args, **kwargs):
+        pass
+
+    def backward(self, loss):
+        loss.backward()
+
+    def input_spec(self):
+        raise NotImplementedError("Please redefine Module.input_spec for model export")
+
+    def inference_end(self, outputs):
+        pass
+
+    def training_epoch_end(self, *args, **kwargs):
+        pass
+
+    def validation_epoch_end(self, *args, **kwargs):
+        pass
diff --git a/model_zoo/gpt-3/ppfleetx/data/__init__.py b/model_zoo/gpt-3/ppfleetx/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e9fabde95edeba3bb38e94baea9696d270b8c69
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/__init__.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import random
+
+import numpy as np
+import paddle
+from ppfleetx.data import dataset as fleetx_dataset
+from ppfleetx.data import sampler, utils
+from ppfleetx.distributed.apis import env
+from ppfleetx.utils.log import logger
+
+
+def build_auto_dataset(config, mode):
+    """
+    build dataset for auto parallel
+    """
+    assert mode in ["Train", "Eval", "Test"], "Dataset mode should be Train, Eval, Test"
+
+    if mode not in config:
+        return None
+
+    dataset = build_dataset(config, mode)
+
+    collate_fn = None
+    if "collate_fn" in config[mode].keys():
+        collate_fn_cfg = config[mode].pop("collate_fn", None)
+        if isinstance(collate_fn_cfg, str):
+            collate_fn = getattr(utils, collate_fn_cfg) if collate_fn_cfg is not None else None
+        elif isinstance(collate_fn_cfg, dict):
+            collate_fn_class_name = collate_fn_cfg.pop("name")
+            collate_fn = eval("utils.{}".format(collate_fn_class_name))(**collate_fn_cfg)
+            logger.debug("build collate_fn({}) success...".format(collate_fn))
+
+    dataset.collate_fn = collate_fn
+    dataset.sample_split = config[mode].pop("sample_split", None)
+    return dataset
+
+
+def build_dataset(config, mode):
+    # build dataset
+    config_dataset = config[mode].dataset
+    config_dataset = copy.deepcopy(config_dataset)
+    dataset_name = config_dataset.pop("name")
+    dataset = eval("fleetx_dataset.{}".format(dataset_name))(**config_dataset)
+
+    logger.debug("build dataset({}) success...".format(dataset))
+
+    return dataset
+
+
+def build_dataloader(config, mode):
+    assert mode in ["Train", "Eval", "Test"], "Dataset mode should be Train, Eval, Test"
+
+    if mode not in config:
+        return None
+
+    dataset = build_dataset(config, mode)
+
+    batch_sampler = None
+    # build sampler
+    if "sampler" in config[mode].keys():
+        config_sampler = config[mode].sampler
+        config_sampler = copy.deepcopy(config_sampler)
+        sampler_name = config_sampler.pop("name")
+        batch_sampler = eval("sampler.{}".format(sampler_name))(dataset, **config_sampler)
+        logger.debug("build batch_sampler({}) success...".format(batch_sampler))
+
+    collate_fn = None
+    config_loader = {}
+    # build dataloader
+    if "loader" in config[mode].keys():
+        config_loader = config[mode].loader
+        config_loader = copy.deepcopy(config_loader)
+
+        collate_fn_cfg = config_loader.pop("collate_fn", None)
+        if isinstance(collate_fn_cfg, str):
+            collate_fn = getattr(utils, collate_fn_cfg) if collate_fn_cfg is not None else None
+        elif isinstance(collate_fn_cfg, dict):
+            collate_fn_class_name = collate_fn_cfg.pop("name")
+            collate_fn = eval("utils.{}".format(collate_fn_class_name))(**collate_fn_cfg)
+            logger.debug("build collate_fn({}) success...".format(collate_fn))
+
+    def worker_init_fn(worker_id):
+        """set seed in subproces for dataloader when num_workers > 0"""
+        np.random.seed(env.get_dp_seed() + worker_id)
+        random.seed(env.get_dp_seed() + worker_id)
+
+    data_loader = paddle.io.DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=collate_fn,
+        worker_init_fn=worker_init_fn,
+        **config_loader,
+    )
+
+    logger.debug("build data_loader({}) success...".format(data_loader))
+    return data_loader
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..3897e29c0703788b981015098701f5bfef7ca68f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/Makefile
@@ -0,0 +1,11 @@
+CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color
+CPPFLAGS += $(shell $(PYTHON_BIN) -m pybind11 --includes)
+CPPFLAGS += $(shell python3-config --includes)
+
+LIBNAME = fast_index_map_helpers
+LIBEXT = .so
+
+default: $(LIBNAME)$(LIBEXT)
+
+%$(LIBEXT): %.cpp
+	$(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py
new file mode 100644
index 0000000000000000000000000000000000000000..aedbec2fe073df5b90e51c557544df4e2eeda65e
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/compile.py
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import subprocess
+
+path = os.path.abspath(os.path.dirname(__file__))
+
+
+def compile_helper():
+    """Compile helper function ar runtime. Make sure this
+    is invoked on a single process."""
+    import sys
+
+    excutable = sys.executable
+    ret = subprocess.run(["make", "-C", path, f"PYTHON_BIN={excutable}"])
+    if ret.returncode != 0:
+        print("Making C++ dataset helpers module failed, exiting.")
+        sys.exit(1)
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..8539c5033bb9905b193a12cb9c7106608af65ee2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/cpp/fast_index_map_helpers.cpp
@@ -0,0 +1,698 @@
+/*
+ coding=utf-8
+ Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+     http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ */
+
+/* Helper methods for fast index mapping builds */
+
+#include <algorithm>
+#include <iostream>
+#include <limits>
+#include <random>
+#include <stdexcept>
+
+#include <math.h>
+#include <pybind11/numpy.h>
+#include <pybind11/pybind11.h>
+
+namespace py = pybind11;
+using namespace std;
+
+const int32_t LONG_SENTENCE_LEN = 512;
+
+void build_blending_indices(
+    py::array_t<uint8_t> &dataset_index,        // NOLINT
+    py::array_t<int64_t> &dataset_sample_index, // NOLINT
+    const py::array_t<double> &weights, const int32_t num_datasets,
+    const int64_t size, const bool verbose) {
+  /* Given multiple datasets and a weighting array, build samples
+   such that it follows those wieghts.*/
+
+  if (verbose) {
+    std::cout << "> building indices for blendable datasets ..." << std::endl;
+  }
+
+  // Get the pointer access without the checks.
+  auto dataset_index_ptr = dataset_index.mutable_unchecked<1>();
+  auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>();
+  auto weights_ptr = weights.unchecked<1>();
+
+  // Initialize buffer for number of samples used for each dataset.
+  int64_t current_samples[num_datasets];
+  for (int64_t i = 0; i < num_datasets; ++i) {
+    current_samples[i] = 0;
+  }
+
+  // For each sample:
+  for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx) {
+    // Determine where the max error in sampling is happening.
+    auto sample_idx_double = std::max(static_cast<double>(sample_idx), 1.0);
+    int64_t max_error_index = 0;
+    double max_error = weights_ptr[0] * sample_idx_double -
+                       static_cast<double>(current_samples[0]);
+    for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx) {
+      double error = weights_ptr[dataset_idx] * sample_idx_double -
+                     static_cast<double>(current_samples[dataset_idx]);
+      if (error > max_error) {
+        max_error = error;
+        max_error_index = dataset_idx;
+      }
+    }
+
+    // Populate the indices.
+    dataset_index_ptr[sample_idx] = static_cast<uint8_t>(max_error_index);
+    dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index];
+
+    // Update the total samples.
+    current_samples[max_error_index] += 1;
+  }
+
+  // print info
+  if (verbose) {
+    std::cout << " > sample ratios:" << std::endl;
+    for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx) {
+      auto ratio = static_cast<double>(current_samples[dataset_idx]) /
+                   static_cast<double>(size);
+      std::cout << "   dataset " << dataset_idx
+                << ", input: " << weights_ptr[dataset_idx]
+                << ", achieved: " << ratio << std::endl;
+    }
+  }
+}
+
+py::array build_sample_idx(const py::array_t<int64_t> &sizes_,
+                           const py::array_t<int64_t> &doc_idx_,
+                           const int32_t seq_length, const int32_t num_epochs,
+                           const int64_t tokens_per_epoch) {
+  /* Sample index (sample_idx) is used for gpt2 like dataset for which
+     the documents are flattened and the samples are built based on this
+     1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2]
+     where [..., 0] contains the index into `doc_idx` and [..., 1] is the
+     starting offset in that document.*/
+
+  // Consistency checks.
+  assert(seq_length > 1);
+  assert(num_epochs > 0);
+  assert(tokens_per_epoch > 1);
+
+  // Remove bound checks.
+  auto sizes = sizes_.unchecked<1>();
+  auto doc_idx = doc_idx_.unchecked<1>();
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length;
+  int64_t *sample_idx = new int64_t[2 * (num_samples + 1)];
+
+  cout << "    using:" << endl << std::flush;
+  cout << "     number of documents:       " << doc_idx_.shape(0) / num_epochs
+       << endl
+       << std::flush;
+  cout << "     number of epochs:          " << num_epochs << endl
+       << std::flush;
+  cout << "     sequence length:           " << seq_length << endl
+       << std::flush;
+  cout << "     total number of samples:   " << num_samples << endl
+       << std::flush;
+
+  // Index into sample_idx.
+  int64_t sample_index = 0;
+  // Index into doc_idx.
+  int64_t doc_idx_index = 0;
+  // Beginning offset for each document.
+  int64_t doc_offset = 0;
+  // Start with first document and no offset.
+  sample_idx[2 * sample_index] = doc_idx_index;
+  sample_idx[2 * sample_index + 1] = doc_offset;
+  ++sample_index;
+
+  while (sample_index <= num_samples) {
+    // Start with a fresh sequence.
+    int64_t remaining_seq_length = seq_length + 1;
+    while (remaining_seq_length != 0) {
+      // Get the document length.
+      auto doc_id = doc_idx[doc_idx_index];
+      auto doc_length = sizes[doc_id] - doc_offset;
+      // And add it to the current sequence.
+      remaining_seq_length -= doc_length;
+      // If we have more than a full sequence, adjust offset and set
+      // remaining length to zero so we return from the while loop.
+      // Note that -1 here is for the same reason we have -1 in
+      // `_num_epochs` calculations.
+      if (remaining_seq_length <= 0) {
+        doc_offset += (remaining_seq_length + doc_length - 1);
+        remaining_seq_length = 0;
+      } else {
+        // Otherwise, start from the beginning of the next document.
+        ++doc_idx_index;
+        doc_offset = 0;
+      }
+    }
+    // Record the sequence.
+    sample_idx[2 * sample_index] = doc_idx_index;
+    sample_idx[2 * sample_index + 1] = doc_offset;
+    ++sample_index;
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(sample_idx, [](void *mem_) {
+    int64_t *mem = reinterpret_cast<int64_t *>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(int64_t);
+  return py::array(std::vector<int64_t>{num_samples + 1, 2}, // shape
+                   {2 * byte_size, byte_size}, // C-style contiguous strides
+                   sample_idx,                 // the data pointer
+                   free_when_done);            // numpy array references
+}
+
+inline int32_t get_target_sample_len(const int32_t short_seq_ratio,
+                                     const int32_t max_length,
+                                     std::mt19937 &rand32_gen) {
+  /* Training sample length. */
+  if (short_seq_ratio == 0) {
+    return max_length;
+  }
+  const auto random_number = rand32_gen();
+  if ((random_number % short_seq_ratio) == 0) {
+    return 2 + random_number % (max_length - 1);
+  }
+  return max_length;
+}
+
+template <typename DocIdx>
+py::array
+build_mapping_impl(const py::array_t<int64_t> &docs_,
+                   const py::array_t<int32_t> &sizes_, const int32_t num_epochs,
+                   const uint64_t max_num_samples, const int32_t max_seq_length,
+                   const double short_seq_prob, const int32_t seed,
+                   const bool verbose, const int32_t min_num_sent) {
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+     start and end index are the indices of the sentences in the sample
+     and sequence-length is the target sequence length.
+  */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(short_seq_prob >= 0.0);
+  assert(short_seq_prob <= 1.0);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+
+  // For efficiency, convert probability to ratio. Note: rand() generates int.
+  int32_t short_seq_ratio = 0;
+  if (short_seq_prob > 0) {
+    short_seq_ratio = static_cast<int32_t>(round(1.0 / short_seq_prob));
+  }
+
+  if (verbose) {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1
+         << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", "
+         << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     minimum sentences num:          " << min_num_sent << endl
+         << std::flush;
+    cout << "     short sequence probability:     " << short_seq_prob << endl
+         << std::flush;
+    cout << "     short sequence ration (1/prob): " << short_seq_ratio << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and it's length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration) {
+    // Set the seed so both iterations produce the same results.
+    std::mt19937 rand32_gen(seed);
+
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Counters:
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch) {
+      if (map_index >= max_num_samples) {
+        if (verbose && (!second)) {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      if (epoch > 0 && map_index == 0) {
+        cout << endl
+             << "     No available documtment find this dataset." << endl
+             << std::flush;
+        throw std::invalid_argument(
+            "Invalid dataset! the document should be with more than " +
+            std::to_string(min_num_sent) + " scentences.");
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+
+        // At the beginning of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second)) {
+          if (num_remain_sent == 0) {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1) {
+            ++one_sent_docs;
+          }
+        }
+
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent > 1) {
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN) {
+              if ((epoch == 0) && (!second)) {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+
+        // If we have more than two sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+          auto target_seq_len = get_target_sample_len(
+              short_seq_ratio, max_seq_length, rand32_gen);
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and if not only one sentence is left in the document.
+            // and if we have at least two sentneces.
+            // and if we have reached end of the document.
+            if (((seq_len >= target_seq_len) && (num_remain_sent > 1) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0)) {
+              // Check for overflow.
+              if ((3 * map_index + 2) > std::numeric_limits<int64_t>::max()) {
+                cout << "number of samples exceeded maximum "
+                     << "allowed by type int64: "
+                     << std::numeric_limits<int64_t>::max() << endl;
+                throw std::overflow_error("Number of samples");
+              }
+
+              // Populate the map.
+              if (second) {
+                const auto map_index_0 = 3 * map_index;
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(target_seq_len);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              prev_start_index = sent_index + 1;
+              target_seq_len = get_target_sample_len(
+                  short_seq_ratio, max_seq_length, rand32_gen);
+              seq_len = 0;
+              num_sent = 0;
+            }
+
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second) {
+      if (verbose) {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs
+             << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs
+             << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[3 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i) {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 3 * i;
+    const auto j0 = 3 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_) {
+    DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 3}, // shape
+                   {3 * byte_size, byte_size}, // C-style contiguous strides
+                   maps,                       // the data pointer
+                   free_when_done);            // numpy array references
+}
+
+py::array build_mapping(const py::array_t<int64_t> &docs_,
+                        const py::array_t<int> &sizes_, const int num_epochs,
+                        const uint64_t max_num_samples,
+                        const int max_seq_length, const double short_seq_prob,
+                        const int seed, const bool verbose,
+                        const int32_t min_num_sent) {
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max()) {
+    if (verbose) {
+      cout << "    using uint64 for data mapping..." << endl << std::flush;
+    }
+    return build_mapping_impl<uint64_t>(
+        docs_, sizes_, num_epochs, max_num_samples, max_seq_length,
+        short_seq_prob, seed, verbose, min_num_sent);
+  } else {
+    if (verbose) {
+      cout << "    using uint32 for data mapping..." << endl << std::flush;
+    }
+    return build_mapping_impl<uint32_t>(
+        docs_, sizes_, num_epochs, max_num_samples, max_seq_length,
+        short_seq_prob, seed, verbose, min_num_sent);
+  }
+}
+
+template <typename DocIdx>
+py::array build_blocks_mapping_impl(
+    const py::array_t<int64_t> &docs_, const py::array_t<int32_t> &sizes_,
+    const py::array_t<int32_t> &titles_sizes_, const int32_t num_epochs,
+    const uint64_t max_num_samples, const int32_t max_seq_length,
+    const int32_t seed, const bool verbose, const bool use_one_sent_blocks) {
+  /* Build a mapping of (start-index, end-index, sequence-length) where
+     start and end index are the indices of the sentences in the sample
+     and sequence-length is the target sequence length.
+  */
+
+  // Consistency checks.
+  assert(num_epochs > 0);
+  assert(max_seq_length > 1);
+  assert(seed > 0);
+
+  // Remove bound checks.
+  auto docs = docs_.unchecked<1>();
+  auto sizes = sizes_.unchecked<1>();
+  auto titles_sizes = titles_sizes_.unchecked<1>();
+
+  if (verbose) {
+    const auto sent_start_index = docs[0];
+    const auto sent_end_index = docs[docs_.shape(0) - 1];
+    const auto num_sentences = sent_end_index - sent_start_index;
+    cout << "    using:" << endl << std::flush;
+    cout << "     number of documents:            " << docs_.shape(0) - 1
+         << endl
+         << std::flush;
+    cout << "     sentences range:                [" << sent_start_index << ", "
+         << sent_end_index << ")" << endl
+         << std::flush;
+    cout << "     total number of sentences:      " << num_sentences << endl
+         << std::flush;
+    cout << "     number of epochs:               " << num_epochs << endl
+         << std::flush;
+    cout << "     maximum number of samples:      " << max_num_samples << endl
+         << std::flush;
+    cout << "     maximum sequence length:        " << max_seq_length << endl
+         << std::flush;
+    cout << "     seed:                           " << seed << endl
+         << std::flush;
+  }
+
+  // Mapping and its length (1D).
+  int64_t num_samples = -1;
+  DocIdx *maps = NULL;
+
+  // Acceptable number of sentences per block.
+  int min_num_sent = 2;
+  if (use_one_sent_blocks) {
+    min_num_sent = 1;
+  }
+
+  // Perform two iterations, in the first iteration get the size
+  // and allocate memory and in the second iteration populate the map.
+  bool second = false;
+  for (int32_t iteration = 0; iteration < 2; ++iteration) {
+    // Set the flag on second iteration.
+    second = (iteration == 1);
+
+    // Current map index.
+    uint64_t map_index = 0;
+
+    uint64_t empty_docs = 0;
+    uint64_t one_sent_docs = 0;
+    uint64_t long_sent_docs = 0;
+    // For each epoch:
+    for (int32_t epoch = 0; epoch < num_epochs; ++epoch) {
+      // assign every block a unique id
+      int32_t block_id = 0;
+
+      if (map_index >= max_num_samples) {
+        if (verbose && (!second)) {
+          cout << "    reached " << max_num_samples << " samples after "
+               << epoch << " epochs ..." << endl
+               << std::flush;
+        }
+        break;
+      }
+      // For each document:
+      for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) {
+        // Document sentences are in [sent_index_first, sent_index_last)
+        const auto sent_index_first = docs[doc];
+        const auto sent_index_last = docs[doc + 1];
+        const auto target_seq_len = max_seq_length - titles_sizes[doc];
+
+        // At the beginning of the document previous index is the
+        // start index.
+        auto prev_start_index = sent_index_first;
+
+        // Remaining documents.
+        auto num_remain_sent = sent_index_last - sent_index_first;
+
+        // Some bookkeeping
+        if ((epoch == 0) && (!second)) {
+          if (num_remain_sent == 0) {
+            ++empty_docs;
+          }
+          if (num_remain_sent == 1) {
+            ++one_sent_docs;
+          }
+        }
+        // Detect documents with long sentences.
+        bool contains_long_sentence = false;
+        if (num_remain_sent >= min_num_sent) {
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            if (sizes[sent_index] > LONG_SENTENCE_LEN) {
+              if ((epoch == 0) && (!second)) {
+                ++long_sent_docs;
+              }
+              contains_long_sentence = true;
+              break;
+            }
+          }
+        }
+        // If we have enough sentences and no long sentences.
+        if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) {
+          // Set values.
+          auto seq_len = int32_t{0};
+          auto num_sent = int32_t{0};
+
+          // Loop through sentences.
+          for (auto sent_index = sent_index_first; sent_index < sent_index_last;
+               ++sent_index) {
+            // Add the size and number of sentences.
+            seq_len += sizes[sent_index];
+            ++num_sent;
+            --num_remain_sent;
+
+            // If we have reached the target length.
+            // and there are an acceptable number of sentences left
+            // and if we have at least the minimum number of sentences.
+            // or if we have reached end of the document.
+            if (((seq_len >= target_seq_len) &&
+                 (num_remain_sent >= min_num_sent) &&
+                 (num_sent >= min_num_sent)) ||
+                (num_remain_sent == 0)) {
+              // Populate the map.
+              if (second) {
+                const auto map_index_0 = 4 * map_index;
+                // Each sample has 4 items: the starting sentence index, ending
+                // sentence index,
+                // the index of the document from which the block comes (used
+                // for fetching titles)
+                // and the unique id of the block (used for creating block
+                // indexes)
+
+                maps[map_index_0] = static_cast<DocIdx>(prev_start_index);
+                maps[map_index_0 + 1] = static_cast<DocIdx>(sent_index + 1);
+                maps[map_index_0 + 2] = static_cast<DocIdx>(doc);
+                maps[map_index_0 + 3] = static_cast<DocIdx>(block_id);
+              }
+
+              // Update indices / counters.
+              ++map_index;
+              ++block_id;
+              prev_start_index = sent_index + 1;
+              seq_len = 0;
+              num_sent = 0;
+            }
+          } // for (auto sent_index=sent_index_first; ...
+        }   // if (num_remain_sent > 1) {
+      }     // for (int doc=0; doc < num_docs; ++doc) {
+    }       // for (int epoch=0; epoch < num_epochs; ++epoch) {
+
+    if (!second) {
+      if (verbose) {
+        cout << "   number of empty documents: " << empty_docs << endl
+             << std::flush;
+        cout << "   number of documents with one sentence: " << one_sent_docs
+             << endl
+             << std::flush;
+        cout << "   number of documents with long sentences: " << long_sent_docs
+             << endl
+             << std::flush;
+        cout << "   will create mapping for " << map_index << " samples" << endl
+             << std::flush;
+      }
+      assert(maps == NULL);
+      assert(num_samples < 0);
+      maps = new DocIdx[4 * map_index];
+      num_samples = static_cast<int64_t>(map_index);
+    }
+
+  } // for (int iteration=0; iteration < 2; ++iteration) {
+
+  // Shuffle.
+  // We need a 64 bit random number generator as we might have more
+  // than 2 billion samples.
+  std::mt19937_64 rand64_gen(seed + 1);
+  for (auto i = (num_samples - 1); i > 0; --i) {
+    const auto j = static_cast<int64_t>(rand64_gen() % (i + 1));
+    const auto i0 = 4 * i;
+    const auto j0 = 4 * j;
+    // Swap values.
+    swap(maps[i0], maps[j0]);
+    swap(maps[i0 + 1], maps[j0 + 1]);
+    swap(maps[i0 + 2], maps[j0 + 2]);
+    swap(maps[i0 + 3], maps[j0 + 3]);
+  }
+
+  // Method to deallocate memory.
+  py::capsule free_when_done(maps, [](void *mem_) {
+    DocIdx *mem = reinterpret_cast<DocIdx *>(mem_);
+    delete[] mem;
+  });
+
+  // Return the numpy array.
+  const auto byte_size = sizeof(DocIdx);
+  return py::array(std::vector<int64_t>{num_samples, 4}, // shape
+                   {4 * byte_size, byte_size}, // C-style contiguous strides
+                   maps,                       // the data pointer
+                   free_when_done);            // numpy array references
+}
+
+py::array build_blocks_mapping(
+    const py::array_t<int64_t> &docs_, const py::array_t<int> &sizes_,
+    const py::array_t<int> &titles_sizes_, const int num_epochs,
+    const uint64_t max_num_samples, const int max_seq_length, const int seed,
+    const bool verbose, const bool use_one_sent_blocks) {
+  if (sizes_.size() > std::numeric_limits<uint32_t>::max()) {
+    if (verbose) {
+      cout << "    using uint64 for data mapping..." << endl << std::flush;
+    }
+    return build_blocks_mapping_impl<uint64_t>(
+        docs_, sizes_, titles_sizes_, num_epochs, max_num_samples,
+        max_seq_length, seed, verbose, use_one_sent_blocks);
+  } else {
+    if (verbose) {
+      cout << "    using uint32 for data mapping..." << endl << std::flush;
+    }
+    return build_blocks_mapping_impl<uint32_t>(
+        docs_, sizes_, titles_sizes_, num_epochs, max_num_samples,
+        max_seq_length, seed, verbose, use_one_sent_blocks);
+  }
+}
+
+PYBIND11_MODULE(fast_index_map_helpers, m) {
+  m.def("build_mapping", &build_mapping);
+  m.def("build_blocks_mapping", &build_blocks_mapping);
+  m.def("build_sample_idx", &build_sample_idx);
+  m.def("build_blending_indices", &build_blending_indices);
+}
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..badf58760a5a7ccb2386fee5505732aca6c465f1
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/README.md
@@ -0,0 +1,179 @@
+## GPT 模型预训练数据准备流程(中文数据处理正在支持中)
+
+我们将预训练数据过程划分为以下2个部分：
+
+1. 原始数据转换，原始文本转换为jsonl的json字符串格式。
+2. 数据ID化，断句、分词、tokenize转化为token id格式。
+
+本目录下主要包含以下文件：
+```
+├── preprocess_data.py # 将jsonl文本，断句、分词后，tokenizer转化为token id。
+├── README.md # 预训练数据准备流程教程
+└── raw_trans_to_json.py # 原始文本数据转化的脚本，将数据转化为json串格式。
+```
+
+## 目录切换
+```
+# 如果您还未下载 PaddleFleetX 套件，请先 clone 套件
+# git clone https://github.com/PaddlePaddle/PaddleFleetX.git
+cd PaddleNLP/model_zoo/gpt-3
+
+# 以下所有命令都在 PaddleFleetX 根目录中执行
+```
+
+## 环境依赖
+
+ - paddlepaddle-gpu>=2.3.0
+ - python==3.7
+ - tqdm==4.54.1
+ - numpy==1.20.1
+ - pybind11==2.10.0
+
+安装命令`pip install -r requirements.txt`。
+
+
+## 训练全流程数据 Pipeline
+
+|步骤|阶段|数据格式| 样例|
+|-|-|-|-|
+| 原始数据清洗 | 原始数据准备|原始数据： <br/> 每个doc之间用空行间隔开 <br/> - 中文，默认每句换行符，作为句子结束。<br/> - 英文，默认使用nltk判断句子结束。doc是又一段或多端文字组成，每段文字由一句或多句话文字组成。  | ```飞桨是功能完备、开源开放的产业级深度学习平台。``` <br/> ```飞桨拥有核心训练和推理框架、基础模型库。``` <br/><br/> ```PaddleNLP是自然语言处理领域的优秀工具。```  |
+|原始数据转换<br/>`raw_trans_to_json.py`|预处理|jsonl格式：每个doc对应一行json字符串| ```{"text": "飞桨是功能完备、开源开放的产业级深度学习平台。飞桨拥有..."}```<br/>```{"text": "PaddleNLP是自然语言..."}```
+|数据ID化<br/>`preprocess_data.py`|预处理| npy格式：数据id化后的token id <br/>npz格式：数据句子、文章位置索引 | -
+
+
+## 全流程示例
+
+下面以 GPT 预训练为例，简要介绍一下预训练数据处理的全流程。
+
+### 原始数据
+首先下载样例数据：
+```
+mkdir -p dataset/wikitext_103_en
+wget -O dataset/wikitext_103_en/wikitext-103-en.txt http://fleet.bj.bcebos.com/datasets/gpt/wikitext-103-en.txt
+```
+### 原始数据转换 jsonl 格式
+使用`raw_trans_to_json.py`转化为json串格式，下面是脚本的使用说明
+```
+optional arguments:
+  -h, --help            show this help message and exit
+  --input_path INPUT_PATH
+                        Path to you raw files. Folder or file path.
+                        必须设置，可以是文件夹或者单个文件。文件夹中的目录默认最多搜索两层子目录。
+  --output_path OUTPUT_PATH
+                        Path to save the output json files.
+                        必须设置，输出文件的名字。
+  --json_key JSON_KEY   The content key of json file.
+                        建议不修改，默认的key是text
+  --doc_spliter DOC_SPLITER
+                        Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank.
+                        根据实际情况修改，默认空行作为文章换行符。
+  --min_doc_length MIN_DOC_LENGTH
+                        Minimal char of a documment.
+                        可选。过滤掉长度多短的文章，默认值10
+  --workers WORKERS     Number of worker processes to launch
+                        可选。多进程转化文件，适用于 input_path 中包含的文件数据较多的情况。每个文件，分配给不同worker处理
+  --log_interval LOG_INTERVAL
+                        Interval between progress updates.
+                        可选。此处的interval是值处理完文件个数的间隔。
+  --no-merge            Don't merge the file.
+                        可选。默认不开启这个选项，默认每个文件转换的jsonl文本，会拼接成到同一个文件。
+  --no-shuffle          Don't shuffle the file.
+                        可选。默认不开启这个选项，默认对处理完进行shuffle。
+```
+根据说明，我们使用下面简单命令，可以得到`wikitext_103_en.jsonl`文件。此处，我们对所有doc进行了shuffle。
+```shell
+python ppfleetx/data/data_tools/gpt/raw_trans_to_json.py  --input_path ./dataset/wikitext_103_en --output_path ./dataset/wikitext_103_en/wikitext_103_en
+
+# output of terminal
+# Time to startup: 0.0075109004974365234
+# Processed 1 files (0.12870440603278582 files/s, 64.80481421466284 MB/s).
+# Merging files into wikitext_103_en.jsonl
+# File save in wikitext_103_en.jsonl
+# Shuffling the jsonl file...
+# File shuffled!!!
+
+# 查看数据。因为对数据有 shuffle，下面的内容可能会不一样。
+tail -1 ./dataset/wikitext_103_en/wikitext_103_en.jsonl
+{"text": "The album was released in June 1973 . Although it received good reviews , it did not sell well , except in Austin , where it sold more copies than earlier records by Nelson did nationwide . The recording led Nelson to a new style ; he later stated regarding his new musical identity that Shotgun Willie had \" cleared his throat . \" It became his breakthrough record , and one of the first of the outlaw movement , music created without the influence of the conservative Nashville Sound . The album — the first to feature Nelson with long hair and a beard on the cover — gained him the interest of younger audiences . It peaked at number 41 on Billboard 's album chart and the songs \" Shotgun Willie \" and \" Stay All Night ( Stay A Little Longer ) \" peaked at number 60 and 22 on Billboard Hot 100 respectively .\nRolling Stone wrote : \" With this flawless album , Willie Nelson finally demonstrates why he has for so long been regarded as a Country & Western singer @-@ songwriter 's singer @-@ songwriter ... At the age of 39 , Nelson finally seems destined for the stardom he deserves \" . Robert Christgau wrote : \" This attempt to turn Nelson into a star runs into trouble when it induces him to outshout Memphis horns or Western swing . \"\nBillboard wrote : \" This is Willie Nelson at his narrative best . He writes and sings with the love and the hurt and the down @-@ to @-@ earth things he feels , and he has a few peers . \" Texas Monthly praised Nelson and Wexler regarding the change in musical style : \" They 've switched his arrangements from Ray Price to Ray Charles — the result : a revitalized music . He 's the same old Willie , but veteran producer Jerry Wexler finally captured on wax the energy Nelson projects in person \" . School Library Journal wrote : \" Willie Nelson differs ( from ) rock artists framing their music with a country & western facade — in that he appears a honky @-@ tonk stardust cowboy to the core . This album abounds in unabashed sentimentalism , nasal singing , lyrics preoccupied with booze , religion , and love gone bad , and stereotyped Nashville instrumentation ( twangy steel guitars , fiddles , and a clean rhythm section characterized by the minimal use of bass drum and cymbals , both of which gain heavy mileage with rock performers ) .\nStephen Thomas Erlewine wrote in his review for Allmusic : \" Willie Nelson offered his finest record to date for his debut – possibly his finest album ever . Shotgun Willie encapsulates Willie 's world view and music , finding him at a peak as a composer , interpreter , and performer . This is laid @-@ back , deceptively complex music , equal parts country , rock attitude , jazz musicianship , and troubadour storytelling \" .\n"}
+```
+
+### 数据ID化
+我们使用 `preprocess_data.py` 脚本将前面得到的 `wikitext_103_en.jsonl` 进行tokenize id化处理。
+```
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_name MODEL_NAME
+                        What model to use.
+                        必须设置，如：gpt2
+  --tokenizer_name {ErnieTokenizer,BertTokenizer,GPTTokenizer,GPTChineseTokenizer}
+                        What type of tokenizer to use.
+                        模型对应的tokenizer, 目前暂时只支持 Ernie，Bert，GPT
+data input/output:
+  --input_path INPUT_PATH
+                        Path to input JSON files.
+                        必须设置，输入文件jsonl的目录
+  --output_prefix OUTPUT_PREFIX
+                        Output prefix to store output file.
+                        必须设置，输出文件的名称。
+                        假设名称为XXX，则会输出 XXX_ids.npy, XXX_idx.npz 两个文件。
+                        npy文件，数据id化后的token ids; npz文件，数据句子、文章位置索引。
+  --data_format {JSON}  Only support json format for now. One document per line.
+                        不需要设置。目前默认处理jsonl数据格式
+  --json_key JSON_KEY   For JSON format. Space separate listed of keys to extract from json
+                        文本串json的key值。同前面trans_to_json.py的json_key，默认text为key
+  --split_sentences     Split documents into sentences.
+                        是否需要将文章划分成句子。一般而言，GPT不需要，Bert/Ernie模型需要
+
+chinese words:
+  --chinese             Is corpus need words segmentation step for chinese words.
+                        中文情形必须设置。处理的文本类型是否是中文。
+  --cn_whole_word_segment
+                        Is corpus need words segmentation step for chinese words WWM.
+                        可选。是否需要WWM策略。一般而言，Bert/Ernie模型需要，GPT不需要。
+  --cn_seg_func {lac,seg,jieba}
+                        Words segment function for chinese words.
+                        默认jieba，jieba速度较快，lac模型更准确，计算量高。
+  --cn_splited          Is chinese corpus is splited in to words.
+                        分词后的文本，可选。设置此选项则，cn_seg_func不起作用。
+                        例如分词后文本串 "中国 效仿 西方 发展 工业 的过 程"
+  --cn_split_dimer CN_SPLIT_DIMER
+                        Split dimer between chinese words.
+                        配合cn_splited使用，默认空格表示分词间隔。
+
+common config:
+  --append_eos          Append an <eos> token to the end of a document.
+                        gpt模型专用，gpt设置此选项，表示doc结束。
+  --log_interval LOG_INTERVAL
+                        Interval between progress updates
+                        打印日志间隔，interval表示处理 文本行数/doc数的 间隔。
+  --workers WORKERS     Number of worker processes to launch
+                        处理文本id化的进程个数。
+```
+通过下面脚本转化，我们可以得到处理好的预训练数据，token ids:`wikitext_103_en.npy`, 文章索引信息`wikitext_103_en.npz`.
+在使用 `GPTTokenizer` 时需要用到 `gpt2-vocab.json` 与 `gpt2-merges.txt`，如果没有下载缓存过这两个文件，脚本会自动下载并缓存。当遇到网络问题时，可以自行下载并将这两个文件放置在 `~/.cache/ppfleetx/` 目录下。
+```
+python ppfleetx/data/data_tools/gpt/preprocess_data.py \
+    --model_name gpt2 \
+    --tokenizer_name GPTTokenizer \
+    --data_format JSON \
+    --input_path ./dataset/wikitext_103_en/wikitext_103_en.jsonl \
+    --append_eos \
+    --output_prefix ./dataset/wikitext_103_en/wikitext_103_en  \
+    --workers 40 \
+    --log_interval 1000
+
+# 处理完后 terminal 输出
+# Processed 267000 documents (9843.34 docs/s, 18.4880 MB/s).
+# Processed 268000 documents (9869.46 docs/s, 18.5351 MB/s).
+# 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.17s/it]
+# Saving tokens to files...
+# Total sentences num: 268492
+# Total documents num: 268492
+# Total tokens num: 114130026
+# Average tokens per sentence: 425.08
+# Average tokens per document: 425.08
+```
+
+## 参考内容
+
+注: 大部分数据流程，参考自[Megatron](https://github.com/NVIDIA/Megatron-LM)和[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)，特此表达感谢。
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..506033beb501b291502643a328a91c96381578d2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba399a8bf1e16aa9494d394c8557525758edec7c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/preprocess_data.py
@@ -0,0 +1,368 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import json
+import multiprocessing
+import os
+import re
+import sys
+import time
+
+import numpy as np
+from tqdm import tqdm
+
+try:
+    from ppfleetx.data import tokenizers as tfs
+except ImportError:
+    __dir__ = os.path.dirname(os.path.abspath(__file__))
+    sys.path.append(os.path.abspath(os.path.join(__dir__, "../../../../")))
+    from ppfleetx.data import tokenizers as tfs
+
+try:
+    import nltk
+
+    nltk_available = True
+except ImportError:
+    nltk_available = False
+
+CHINESE_SEG_FUNC = {}
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_name", type=str, required=True, help="What model to use.")
+    parser.add_argument(
+        "--tokenizer_name",
+        type=str,
+        required=True,
+        choices=["ErnieTokenizer", "BertTokenizer", "GPTTokenizer", "GPTChineseTokenizer", "ElectraTokenizer"],
+        help="What type of tokenizer to use.",
+    )
+    group = parser.add_argument_group(title="data input/output")
+    group.add_argument("--input_path", type=str, required=True, help="Path to input JSON files.")
+    group.add_argument("--output_prefix", type=str, required=True, help="Output prefix to store output file.")
+    group.add_argument(
+        "--data_format",
+        type=str,
+        default="text",
+        choices=["JSON"],
+        help="Only support json format for now. One document per line.",
+    )
+    group.add_argument(
+        "--json_key",
+        type=str,
+        default="text",
+        help="For JSON format. Space separate listed of keys to extract from json",
+    )
+    group.add_argument("--split_sentences", action="store_true", help="Split documents into sentences.")
+
+    group = parser.add_argument_group(title="chinese words")
+    group.add_argument(
+        "--chinese", action="store_true", help="Is corpus need words segmentation step for chinese words."
+    )
+    group.add_argument(
+        "--cn_whole_word_segment",
+        action="store_true",
+        help="Is corpus need words segmentation step for chinese words WWM.",
+    )
+    group.add_argument(
+        "--cn_seg_func",
+        type=str,
+        default="jieba",
+        choices=["lac", "seg", "jieba"],
+        help="Words segment function for chinese words.",
+    )
+    group.add_argument("--cn_splited", action="store_true", help="Is chinese corpus is splited in to words.")
+    group.add_argument("--cn_split_dimer", type=str, default=" ", help="Split dimer between chinese words.")
+
+    group = parser.add_argument_group(title="common config")
+    group.add_argument("--append_eos", action="store_true", help="Append an <eos> token to the end of a document.")
+    group.add_argument("--log_interval", type=int, default=100, help="Interval between progress updates")
+    group.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch")
+
+    args = parser.parse_args()
+    if args.chinese:
+        global CHINESE_SEG_FUNC
+        CHINESE_SEG_FUNC["lac"] = lexical_analysis_fn()
+        CHINESE_SEG_FUNC["seg"] = chinese_segmentation_fn()
+        CHINESE_SEG_FUNC["jieba"] = jieba_segmentation_fn()
+
+    return args
+
+
+def lexical_analysis_fn():
+    from LAC import LAC
+
+    lac = LAC(mode="lac")
+
+    def process(line):
+        words, _ = lac.run(line)
+        return words
+
+    return process
+
+
+def chinese_segmentation_fn():
+    from LAC import LAC
+
+    lac_cws = LAC(mode="seg")
+
+    def process(line):
+        words = lac_cws.run(line)
+        return words
+
+    return process
+
+
+def jieba_segmentation_fn():
+    import jieba
+
+    def process(line):
+        words = jieba.cut(line)
+        return list(words)
+
+    return process
+
+
+def get_whole_word_mask_tokens(tokens, words, max_word_length=4):
+    """
+    Do whole word mask on Chinese word.
+    First, we do Chinese word segmentation on the sequence of tokens, which are from the WordPiece tokenization.
+    Then, we add the '##' mark on chinese characters which are in the middle of Chinese words.
+    And if the tokens are not chinese characters, we just exploit the results of WordPiece tokenization as words.
+    Such as,
+         - text line : 通过利用mercer核，将样本从输入空间映射到高维特征空间，使原来没有显现的特征突现出来，取得了很好的图像分割效果。
+         - the input tokens (after WordPiece):
+            ['通', '过', '利', '用', 'me', '##rc', '##er', '核', '，', '将', '样', '本', '从', '输', '入', '空', '间', '映',
+            '射', '到', '高', '维', '特', '征', '空', '间', '，', '使', '原', '来', '没', '有', '显', '现', '的', '特', '征',
+            '突', '现', '出', '来', '，', '取', '得', '了', '很', '好', '的', '图', '像', '分', '割', '效', '果', '。']
+        - the Chinese words (after Chinese word segmentation like jieba)
+            ['通过', '利用', 'mercer', '核', '，', '将', '样本', '从', '输入', '空间', '映射', '到', '高维', '特征',
+            '空间', '，', '使', '原来', '没有', '显现', '的', '特征', '突现', '出来', '，', '取得', '了', '很', '好',
+            '的', '图像', '分割', '效果', '。']
+        - the output whole word mask tokens:
+            ['通', '##过', '利', '##用', 'me', '##rc', '##er', '核', '，', '将', '样', '##本', '从', '输', '##入',
+            '空', '##间', '映', '##射', '到', '高', '##维', '特', '##征', '空', '##间', '，', '使', '原', '##来',
+            '没', '##有', '显', '##现', '的', '特', '##征', '突', '##现', '出', '##来', '，', '取', '##得', '了',
+            '很', '好', '的', '图', '##像', '分', '##割', '效', '##果', '。']
+    Args:
+        tokens(list(str)): The sequence of tokens, which are from the WordPiece tokenization.
+        words(list(str)): The sequence of Chinese words.
+        max_word_length(int, optional):
+            The maximum chinese character in Chinese words. It avoids too long Chinese word to be masked.
+            Defaults as 4.
+    Returns:
+         new_tokens(list(str)): The new token will be done with whole word masking strategy.
+    """
+
+    new_tokens = []
+    # opt for long document
+    words_set = set(words)
+    i = 0
+    while i < len(tokens):
+        # non-chinese character, then do word piece
+        if len(re.findall("[\u4E00-\u9FA5]", tokens[i])) == 0:
+            new_tokens.append(tokens[i])
+            i += 1
+            continue
+
+        # add "##" mark on the middel tokens of Chinese words
+        # such as ["通过", "利用"] -> ["通", "##过"， "利", "##用"]
+        has_add = False
+        for length in range(max_word_length, 0, -1):
+            if i + length > len(tokens):
+                continue
+            if "".join(tokens[i : i + length]) in words_set:
+                new_tokens.append(tokens[i])
+                for l in range(1, length):
+                    new_tokens.append("##" + tokens[i + l])
+                i += length
+                has_add = True
+                break
+
+        if not has_add:
+            new_tokens.append(tokens[i])
+            i += 1
+    return new_tokens
+
+
+class IdentitySplitter(object):
+    def tokenize(self, *text):
+        return text
+
+
+class NewlineSplitter:
+    def tokenize(self, text):
+        return text.split("\n")
+
+
+class Converter(object):
+    def __init__(self, args):
+        self.args = args
+
+    def initializer(self):
+        Converter.tokenizer = getattr(tfs, self.args.tokenizer_name).from_pretrained(self.args.model_name)
+
+        # Split document to sentence.
+        if self.args.split_sentences:
+            if self.args.chinese:
+                Converter.splitter = NewlineSplitter()
+            else:
+                if not nltk_available:
+                    print("NLTK is not available to split sentences.")
+                    exit()
+                splitter = nltk.load("tokenizers/punkt/english.pickle")
+                Converter.splitter = splitter
+        else:
+            Converter.splitter = IdentitySplitter()
+
+        # Split sentence whole words mask for chinese
+        if self.args.cn_whole_word_segment:
+            if self.args.cn_splited:
+                Converter.segment_func = lambda text: text.split(self.args.cn_split_dimer)
+            else:
+                Converter.segment_func = CHINESE_SEG_FUNC[self.args.cn_seg_func]
+            Converter.whole_word_mask = get_whole_word_mask_tokens
+        else:
+            Converter.segment_func = lambda x: x
+            Converter.whole_word_mask = lambda x, y: x
+
+        def process(text):
+            words = Converter.segment_func(text)
+            tokens = Converter.tokenizer.tokenize("".join(words))
+            tokens = Converter.whole_word_mask(tokens, words)
+            tokens = Converter.tokenizer.convert_tokens_to_ids(tokens)
+            return tokens
+
+        Converter.process = process
+
+    def encode(self, json_line):
+        text = json.loads(json_line)[self.args.json_key]
+        doc_ids = []
+        for sentence in Converter.splitter.tokenize(text):
+            sentence_ids = Converter.process(sentence.strip())
+            if len(sentence_ids) > 0:
+                doc_ids.append(sentence_ids)
+
+        if len(doc_ids) > 0 and self.args.append_eos:
+            doc_ids[-1].append(Converter.tokenizer.eos_token_id)
+
+        return doc_ids, len(text.encode("utf-8"))
+
+
+def main():
+    args = get_args()
+
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, _, fs in os.walk(args.input_path):
+            for f in fs:
+                file_paths.append(os.path.join(root, f))
+    if len(file_paths) == 0:
+        print("No input file found!")
+        exit(-1)
+
+    convert = Converter(args)
+
+    # Try tokenizer is availiable
+    sample_tokenizer = getattr(tfs, args.tokenizer_name).from_pretrained(args.model_name)
+    if sample_tokenizer.vocab_size < 2**16 - 1:
+        save_dtype = np.uint16
+    else:
+        save_dtype = np.int32
+
+    pool = multiprocessing.Pool(args.workers, initializer=convert.initializer)
+
+    # We use BytesIO to store the ids.
+    token_ids_stream = io.BytesIO()
+    sentlens_stream = io.BytesIO()
+    # # Cumsum on tokens num
+    # sent_cumsum_stream = io.BytesIO()
+    # sent_cumsum_stream.write((0).to_bytes(8, byteorder='little', signed=True))
+    # Cunsum on document on every sentence num, type=np.int64
+    doc_cumsum_stream = io.BytesIO()
+    doc_cumsum_stream.write((0).to_bytes(8, byteorder="little", signed=True))
+
+    sent_count = 0
+    # token_count = 0
+
+    file_paths.sort()
+
+    step = 0
+    total_bytes_processed = 0
+    startup_start = time.time()
+    for file_path in tqdm(file_paths):
+        if file_path.endswith(".zst"):
+            import zstandard
+
+            cctx = zstandard.ZstdDecompressor()
+            fh = open(file_path, "rb")
+            text = io.BufferedReader(cctx.stream_reader(fh))
+        elif file_path.endswith(".jsonl"):
+            text = open(file_path, "r", encoding="utf-8")
+        else:
+            print("Unexpected data format, skiped %s" % file_path)
+            continue
+
+        encoded_docs = pool.imap(convert.encode, text, 256)
+        print("Processing %s" % file_path)
+        for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
+            step += 1
+            total_bytes_processed += bytes_processed
+            if len(doc) == 0:
+                continue
+
+            for sentence in doc:
+                sentence_len = len(sentence)
+                if sentence_len == 0:
+                    continue
+                sentlens_stream.write(sentence_len.to_bytes(4, byteorder="little", signed=True))
+                # token_count += sentence_len
+                # sent_cumsum_stream.write(
+                #     token_count.to_bytes(
+                #         8, byteorder='little', signed=True))
+                sent_count += 1
+                token_ids_stream.write(np.array(sentence, dtype=save_dtype).tobytes(order="C"))
+
+            doc_cumsum_stream.write(sent_count.to_bytes(8, byteorder="little", signed=True))
+
+            if step % args.log_interval == 0:
+                current = time.time()
+                elapsed = current - startup_start
+                mbs = total_bytes_processed / elapsed / 1024 / 1024
+                print(f"Processed {step} documents", f"({step/elapsed:.2f} docs/s, {mbs:.4f} MB/s).", file=sys.stderr)
+
+    pool.close()
+    print("Saving tokens to files...")
+    all_doc_ids = np.frombuffer(token_ids_stream.getbuffer(), dtype=save_dtype)
+    lens = np.frombuffer(sentlens_stream.getbuffer(), dtype=np.int32)
+    # sents = np.frombuffer(sent_cumsum_stream.getbuffer(), dtype=np.int64)
+    docs = np.frombuffer(doc_cumsum_stream.getbuffer(), dtype=np.int64)
+    np.save(args.output_prefix + "_ids.npy", all_doc_ids)
+    # np.savez(args.output_prefix + "_idx.npz", lens=lens, sents=sents, docs=docs)
+    np.savez(args.output_prefix + "_idx.npz", lens=lens, docs=docs)
+
+    print("Total sentences num: %d" % len(lens))
+    print("Total documents num: %d" % (len(docs) - 1))
+    print("Total tokens num: %d" % len(all_doc_ids))
+    print("Average tokens per sentence: %.2f" % (len(all_doc_ids) / len(lens)))
+    print("Average tokens per document: %.2f" % (len(all_doc_ids) / (len(docs) - 1)))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e03ddbc00bb5eeeb0251e2dd1b6ee62f101ce99
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/data_tools/gpt/raw_trans_to_json.py
@@ -0,0 +1,139 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import multiprocessing
+import os
+import shutil
+import sys
+import time
+from functools import partial
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input_path", type=str, required=True, help="Path to you raw files. Folder or file path.")
+    parser.add_argument("--output_path", type=str, required=True, help="Path to save the output json files.")
+    parser.add_argument("--json_key", type=str, default="text", help="The content key of json file.")
+    parser.add_argument(
+        "--doc_spliter",
+        type=str,
+        default="",
+        help="Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank.",
+    )
+    parser.add_argument("--min_doc_length", type=int, default=10, help="Minimal char of a documment.")
+    parser.add_argument("--workers", type=int, default=1, help="Number of worker processes to launch")
+    parser.add_argument("--log_interval", type=int, default=1, help="Interval between progress updates.")
+    parser.add_argument("--no-merge", action="store_true", help="Don't merge the file.")
+    parser.add_argument("--no-shuffle", action="store_true", help="Don't shuffle the file.")
+    args = parser.parse_args()
+    return args
+
+
+def raw_text_to_json(path, doc_spliter="", json_key="text", min_doc_length=10):
+    path = os.path.abspath(path)
+    if not os.path.exists(path):
+        print("No found file %s" % path)
+        return 0, None
+
+    out_filepath = path + ".jsonl"
+    fout = open(out_filepath, "w", encoding="utf-8")
+    len_files = 0
+    with open(path, "r") as f:
+        doc = ""
+        line = f.readline()
+        while line:
+            len_files += len(line)
+            if line.strip() == doc_spliter:
+                if len(doc) > min_doc_length:
+                    fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n")
+                doc = ""
+            else:
+                doc += line
+            line = f.readline()
+
+        if len(doc) > min_doc_length:
+            fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n")
+        doc = ""
+
+    return len_files, out_filepath
+
+
+def merge_file(file_paths, output_path):
+    if not output_path.endswith(".jsonl"):
+        output_path = output_path + ".jsonl"
+    print("Merging files into %s" % output_path)
+    with open(output_path, "wb") as wfd:
+        for f in file_paths:
+            if f is not None and os.path.exists(f):
+                with open(f, "rb") as fd:
+                    shutil.copyfileobj(fd, wfd)
+                os.remove(f)
+    print("File save in %s" % output_path)
+    return output_path
+
+
+def shuffle_file(output_path):
+    print("Shuffling the jsonl file...")
+    if os.path.exists(output_path):
+        os.system("shuf %s -o %s" % (output_path, output_path))
+        print("File shuffled!!!")
+    else:
+        raise ValueError("File not found: %s" % output_path)
+
+
+def main():
+    args = get_args()
+    startup_start = time.time()
+
+    file_paths = []
+    if os.path.isfile(args.input_path):
+        file_paths.append(args.input_path)
+    else:
+        for root, _, fs in os.walk(args.input_path):
+            for f in fs:
+                file_paths.append(os.path.join(root, f))
+
+    pool = multiprocessing.Pool(args.workers)
+
+    startup_end = time.time()
+    proc_start = time.time()
+    total_bytes_processed = 0
+    print("Time to startup:", startup_end - startup_start)
+
+    trans_json = partial(
+        raw_text_to_json, doc_spliter=args.doc_spliter, json_key=args.json_key, min_doc_length=args.min_doc_length
+    )
+    encoded_files = pool.imap(trans_json, file_paths, 1)
+
+    out_paths = []
+    for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1):
+        total_bytes_processed += bytes_processed
+        out_paths.append(out_path)
+
+        if i % args.log_interval == 0:
+            current = time.time()
+            elapsed = current - proc_start
+            mbs = total_bytes_processed / elapsed / 1024 / 1024
+            print(f"Processed {i} files", f"({i/elapsed} files/s, {mbs} MB/s).", file=sys.stderr)
+
+    if not args.no_merge:
+        output_path = merge_file(out_paths, args.output_path)
+        if not args.no_shuffle:
+            shuffle_file(output_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py b/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..72c3fe2b4a6529b427812f0c508bc095da455d30
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/dataset/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .glue_dataset import *
+from .gpt_dataset import GPTDataset, Lambada_Eval_Dataset, LM_Eval_Dataset
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py b/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7151e0c95462ac36ce7fdf21f72e5e8152bde7b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/dataset/glue_dataset.py
@@ -0,0 +1,788 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle
+from ppfleetx.data.tokenizers import GPTTokenizer
+from ppfleetx.utils.download import cached_path
+from ppfleetx.utils.file import parse_csv, unzip
+
+__all__ = ["CoLA", "SST2", "MNLI", "QNLI", "RTE", "WNLI", "MRPC", "QQP", "STSB"]
+"""
+
+Single-Sentence Tasks:
+* CoLA
+* SST-2
+
+
+Similarity and Paraphrase Tasks:
+* MRPC
+* STS-B
+* QQP
+
+
+Inference Tasks:
+* MNLI
+* QNLI
+* RTE
+* WNLI
+"""
+
+
+class CoLA(paddle.io.Dataset):
+    """The Corpus of Linguistic Acceptability consists of English
+    acceptability judgments drawn from books and journal articles on
+    linguistic theory. Each example is a sequence of words annotated
+    with whether it is a grammatical English sentence."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/cola.html#CoLA
+
+    URL = "https://nyu-mll.github.io/CoLA/cola_public_1.1.zip"
+    MD5 = "9f6d88c3558ec424cd9d66ea03589aba"
+
+    NUM_LINES = {
+        "train": 8551,
+        "dev": 527,
+        "test": 516,
+    }
+
+    _PATH = "cola_public_1.1.zip"
+
+    DATASET_NAME = "CoLA"
+
+    _EXTRACTED_FILES = {
+        "train": os.path.join("raw", "in_domain_train.tsv"),
+        "dev": os.path.join("raw", "in_domain_dev.tsv"),
+        "test": os.path.join("raw", "out_of_domain_dev.tsv"),
+    }
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _filter_res(x):
+            return len(x) == 4
+
+        def _modify_res(x):
+            return (x[3], int(x[1]))
+
+        self.samples = parse_csv(
+            self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res, filter_funcs=_filter_res
+        )
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[1]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class SST2(paddle.io.Dataset):
+    """The Stanford Sentiment Treebank consists of sentences from movie reviews and
+    human annotations of their sentiment. The task is to predict the sentiment of a
+    given sentence. We use the two-way (positive/negative) class split, and use only
+    sentence-level labels."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/sst2.html#SST2
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/SST-2.zip"
+    MD5 = "9f81648d4199384278b86e315dac217c"
+
+    NUM_LINES = {
+        "train": 67349,
+        "dev": 872,
+        "test": 1821,
+    }
+
+    _PATH = "SST-2.zip"
+
+    DATASET_NAME = "SST2"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        # test split for SST2 doesn't have labels
+        if split == "test":
+
+            def _modify_test_res(t):
+                return (t[1].strip(),)
+
+            self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_test_res)
+        else:
+
+            def _modify_res(t):
+                return (t[0].strip(), int(t[1]))
+
+            self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[1]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class MNLI(paddle.io.Dataset):
+    """The Multi-Genre Natural Language Inference Corpus is a crowdsourced
+    collection of sentence pairs with textual entailment annotations. Given a premise sentence
+    and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis
+    (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are
+    gathered from ten different sources, including transcribed speech, fiction, and government reports.
+    We use the standard test set, for which we obtained private labels from the authors, and evaluate
+    on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend
+    the SNLI corpus as 550k examples of auxiliary training data."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/mnli.html#MNLI
+
+    URL = "https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip"
+    MD5 = "0f70aaf66293b3c088a864891db51353"
+
+    NUM_LINES = {
+        "train": 392702,
+        "dev_matched": 9815,
+        "dev_mismatched": 9832,
+    }
+
+    _PATH = "multinli_1.0.zip"
+
+    DATASET_NAME = "MNLI"
+
+    _EXTRACTED_FILES = {
+        "train": "multinli_1.0_train.txt",
+        "dev_matched": "multinli_1.0_dev_matched.txt",
+        "dev_mismatched": "multinli_1.0_dev_mismatched.txt",
+    }
+
+    LABEL_TO_INT = {"entailment": 0, "neutral": 1, "contradiction": 2}
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev_matched", "dev_mismatched"]
+
+        def _filter_res(x):
+            return x[0] in self.LABEL_TO_INT
+
+        def _modify_res(x):
+            return (x[5], x[6], self.LABEL_TO_INT[x[0]])
+
+        self.samples = parse_csv(
+            self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res, filter_funcs=_filter_res
+        )
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        return input_ids, sample[2]
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 3
+
+
+class QNLI(paddle.io.Dataset):
+    """The Stanford Question Answering Dataset is a question-answering
+    dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn
+    from Wikipedia) contains the answer to the corresponding question (written by an annotator). We
+    convert the task into sentence pair classification by forming a pair between each question and each
+    sentence in the corresponding context, and filtering out pairs with low lexical overlap between the
+    question and the context sentence. The task is to determine whether the context sentence contains
+    the answer to the question. This modified version of the original task removes the requirement that
+    the model select the exact answer, but also removes the simplifying assumptions that the answer
+    is always present in the input and that lexical overlap is a reliable cue."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/qnli.html#QNLI
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/QNLIv2.zip"
+    MD5 = "b4efd6554440de1712e9b54e14760e82"
+
+    NUM_LINES = {
+        "train": 104743,
+        "dev": 5463,
+        "test": 5463,
+    }
+
+    _PATH = "QNLIv2.zip"
+
+    DATASET_NAME = "QNLI"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    MAP_LABELS = {"entailment": 0, "not_entailment": 1}
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _modify_res(x):
+            if split == "test":
+                # test split for QNLI doesn't have labels
+                return (x[1], x[2])
+            else:
+                return (x[1], x[2], self.MAP_LABELS[x[3]])
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[2]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class RTE(paddle.io.Dataset):
+    """The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual
+    entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim
+    et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are
+    constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where
+    for three-class datasets we collapse neutral and contradiction into not entailment, for consistency."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/rte.html#RTE
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/RTE.zip"
+    MD5 = "bef554d0cafd4ab6743488101c638539"
+
+    NUM_LINES = {
+        "train": 67349,
+        "dev": 872,
+        "test": 1821,
+    }
+
+    _PATH = "RTE.zip"
+
+    DATASET_NAME = "RTE"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    MAP_LABELS = {"entailment": 0, "not_entailment": 1}
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _modify_res(x):
+            if split == "test":
+                # test split for RTE doesn't have labels
+                return (x[1], x[2])
+            else:
+                return (x[1], x[2], self.MAP_LABELS[x[3]])
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[2]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class WNLI(paddle.io.Dataset):
+    """The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task
+    in which a system must read a sentence with a pronoun and select the referent of that pronoun from
+    a list of choices. The examples are manually constructed to foil simple statistical methods: Each
+    one is contingent on contextual information provided by a single word or phrase in the sentence.
+    To convert the problem into sentence pair classification, we construct sentence pairs by replacing
+    the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the
+    pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of
+    new examples derived from fiction books that was shared privately by the authors of the original
+    corpus. While the included training set is balanced between two classes, the test set is imbalanced
+    between them (65% not entailment). Also, due to a data quirk, the development set is adversarial:
+    hypotheses are sometimes shared between training and development examples, so if a model memorizes the
+    training examples, they will predict the wrong label on corresponding development set
+    example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence
+    between a model's score on this task and its score on the unconverted original task. We
+    call converted dataset WNLI (Winograd NLI)."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/wnli.html#WNLI
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/WNLI.zip"
+    MD5 = "a1b4bd2861017d302d29e42139657a42"
+
+    NUM_LINES = {
+        "train": 635,
+        "dev": 71,
+        "test": 146,
+    }
+
+    _PATH = "WNLI.zip"
+
+    DATASET_NAME = "WNLI"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _modify_res(x):
+            if split == "test":
+                # test split for WNLI doesn't have labels
+                return (x[1], x[2])
+            else:
+                return (x[1], x[2], int(x[3]))
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[2]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class MRPC(paddle.io.Dataset):
+    """The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of
+    sentence pairs automatically extracted from online news sources, with human annotations
+    for whether the sentences in the pair are semantically equivalent."""
+
+    # ref https://pytorch.org/text/stable/_modules/torchtext/datasets/mrpc.html#MRPC
+
+    URL = {
+        "train": "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt",
+        "test": "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt",
+    }
+
+    MD5 = {
+        "train": "793daf7b6224281e75fe61c1f80afe35",
+        "test": "e437fdddb92535b820fe8852e2df8a49",
+    }
+
+    NUM_LINES = {
+        "train": 4076,
+        "test": 1725,
+    }
+
+    DATASET_NAME = "MRPC"
+
+    _EXTRACTED_FILES = {
+        "train": "msr_paraphrase_train.txt",
+        "test": "msr_paraphrase_test.txt",
+    }
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        cached_path(self.URL[split], cache_dir=os.path.abspath(self.root))
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "test"]
+
+        def _modify_res(x):
+            return (x[3], x[4], int(x[0]))
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        return input_ids, sample[2]
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class QQP(paddle.io.Dataset):
+    """The Quora Question Pairs2 dataset is a collection of question pairs from the
+    community question-answering website Quora. The task is to determine whether a
+    pair of questions are semantically equivalent."""
+
+    # ref https://huggingface.co/datasets/glue/blob/main/glue.py#L212-L239
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip"
+    MD5 = "884bf26e39c783d757acc510a2a516ef"
+
+    NUM_LINES = {
+        "train": 363846,
+        "dev": 40430,
+        "test": 390961,
+    }
+
+    _PATH = "QQP-clean.zip"
+
+    DATASET_NAME = "QQP"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    MAP_LABELS = {"not_duplicate": 0, "duplicate": 1}
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _modify_res(x):
+            if split == "test":
+                # test split for QQP doesn't have labels
+                return (x[1], x[2])
+            else:
+                return (x[3], x[4], int(x[5]))
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            return input_ids, sample[2]
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
+
+
+class STSB(paddle.io.Dataset):
+    """The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of
+    sentence pairs drawn from news headlines, video and image captions, and natural
+    language inference data. Each pair is human-annotated with a similarity score
+    from 1 to 5."""
+
+    # ref https://huggingface.co/datasets/glue/blob/main/glue.py#L240-L267
+
+    URL = "https://dl.fbaipublicfiles.com/glue/data/STS-B.zip"
+    MD5 = "d573676be38f1a075a5702b90ceab3de"
+
+    NUM_LINES = {
+        "train": 5749,
+        "dev": 1500,
+        "test": 1379,
+    }
+
+    _PATH = "STS-B.zip"
+
+    DATASET_NAME = "STSB"
+
+    _EXTRACTED_FILES = {
+        "train": "train.tsv",
+        "dev": "dev.tsv",
+        "test": "test.tsv",
+    }
+
+    def __init__(self, root, split, max_length=128):
+
+        self.root = root
+        self.split = split
+        if os.path.exists(self.root):
+            assert os.path.isdir(self.root)
+        else:
+            zip_path = cached_path(self.URL, cache_dir=os.path.abspath(self.root))
+            unzip(zip_path, mode="r", out_dir=os.path.join(self.root, ".."), delete=True)
+
+        self.path = os.path.join(self.root, self._EXTRACTED_FILES[split])
+        assert os.path.exists(self.path), f"{self.path} is not exists!"
+        self.max_length = max_length
+
+        self.tokenizer = GPTTokenizer.from_pretrained("gpt2")
+
+        assert split in ["train", "dev", "test"]
+
+        def _modify_res(x):
+            if split == "test":
+                # test split for STSB doesn't have labels
+                return (x[7], x[8])
+            else:
+                return (x[7], x[8], float(x[9]))
+
+        self.samples = parse_csv(self.path, skip_lines=1, delimiter="\t", map_funcs=_modify_res)
+
+    def __getitem__(self, idx):
+        sample = self.samples[idx]
+
+        encoded_inputs = self.tokenizer(
+            sample[0],
+            text_pair=sample[1],
+            padding="max_length",
+            truncation="longest_first",
+            max_length=self.max_length,
+            return_token_type_ids=False,
+        )
+        input_ids = encoded_inputs["input_ids"]
+        input_ids = paddle.to_tensor(input_ids)
+        if self.split != "test":
+            # Note(GuoxiaWang): We need return shape [1] value,
+            # so that we can attain a batched label with shape [batchsize, 1].
+            # Because the logits shape is [batchsize, 1], and feed into MSE loss.
+            return input_ids, np.array([sample[2]], dtype=np.float32)
+        else:
+            return input_ids
+
+    def __len__(self):
+        return len(self.samples)
+
+    @property
+    def class_num(self):
+        return 2
diff --git a/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py b/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..517c0677f9fd7814c23eae4c62a3c1a0ae2765ee
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/dataset/gpt_dataset.py
@@ -0,0 +1,635 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import os
+import re
+import time
+
+import numpy as np
+import paddle
+from ppfleetx.data.tokenizers import GPTTokenizer
+from ppfleetx.distributed.apis import env
+from ppfleetx.utils.log import logger
+
+# TODO(haohongxiang): to solve the problem of cross-reference
+import paddlenlp  # noqa: F401
+from paddlenlp.transformers.gpt.tokenizer import GPTChineseTokenizer
+
+mode_to_index = {"Train": 0, "Eval": 1, "Test": 2}
+
+MODEL_CLASSES = {
+    "GPT": (GPTTokenizer, "gpt2"),
+    "GPT-cn": (GPTChineseTokenizer, "gpt-cpm-large-cn"),
+}
+
+
+class GPTDataset(paddle.io.Dataset):
+    def __init__(self, input_dir, split, max_seq_len, num_samples, mode, model_type="GPT", seed=1234):
+
+        files = get_train_data_file(input_dir)
+        files.sort()
+        input_dir = [files[0]]
+
+        local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+        if local_rank == 0:
+            try:
+                import ppfleetx.data.data_tools.cpp.fast_index_map_helpers
+            except Exception:
+                start_time = time.time()
+                print("> compiling dataset index builder ...")
+                from ppfleetx.data.data_tools.cpp.compile import compile_helper
+
+                compile_helper()
+                print(
+                    ">>> done with dataset index builder. Compilation time: {:.3f} "
+                    "seconds".format(time.time() - start_time),
+                    flush=True,
+                )
+
+        device_world_size = paddle.distributed.get_world_size()
+
+        if device_world_size > 1 and local_rank != 0:
+            while True:
+                try:
+                    import ppfleetx.data.data_tools.cpp.fast_index_map_helpers  # noqa: F401, F811
+
+                    break
+                except Exception:
+                    print("> wait for helpers to be compiled!")
+                    time.sleep(1)
+
+        try:
+            data_world_size = env.get_data_world_size()
+
+            logger.info(
+                "The distributed run, total device num:{}, distinct dataflow num:{}.".format(
+                    device_world_size, data_world_size
+                )
+            )
+        except AttributeError:
+            pass
+
+        assert len(input_dir) == 1, "GPT only support one dataset for now."
+
+        input_prefix = input_dir[0]
+
+        if os.path.isfile(input_prefix + "_ids.npz"):
+            logger.warning("You are using compatible dataset, please make new dataset as the readme!")
+            process_data = np.load(input_prefix + "_ids.npz", mmap_mode="r+", allow_pickle=True)
+            sample_ids = process_data["ids"]
+            sample_lens = process_data["lens"].astype("int32")
+        else:
+            for suffix in ["_ids.npy", "_idx.npz"]:
+                if not os.path.isfile(input_prefix + suffix):
+                    raise ValueError("File Not found, %s" % (input_prefix + suffix))
+
+            sample_ids = np.load(input_prefix + "_ids.npy", mmap_mode="r", allow_pickle=True)
+            # All documment ids, extend as 1-D array.
+
+            process_data = np.load(input_prefix + "_idx.npz")
+            # The len(sample_lens) num of docs
+            # The sum(sample_lens) should equal len(sample_ids)
+            sample_lens = process_data["lens"]
+
+        splits = get_train_valid_test_split_(split, len(sample_lens))
+        assert len(sample_lens) >= splits[-1], "The document nums should larger than max of splits, but %s < %s" % (
+            len(sample_lens),
+            splits[-1],
+        )
+
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_type]
+        tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        self.input_dir = input_dir
+        self.max_seq_len = max_seq_len
+        self.mode = mode
+        self.name = "gpt_" + mode
+        self.eos_id = tokenizer.eos_token_id
+        self.sample_ids = sample_ids
+        self.sample_lens = sample_lens
+        self.build_data_file = local_rank == 0
+
+        if mode in mode_to_index.keys():
+            index = mode_to_index[mode]
+        else:
+            raise ValueError("valid str value for 'mode'")
+
+        documents = np.arange(splits[index], splits[index + 1])
+        if documents is None:
+            document_ids = np.arange(0, self.sample_lens.shape[0])
+        else:
+            document_ids = documents
+
+        self.doc_idx, self.sample_idx, self.shuffle_idx = construct_samples_and_shuffle_data(
+            self.name,
+            input_prefix,
+            document_ids,
+            self.sample_lens,
+            num_samples,
+            max_seq_len,
+            seed,
+            self.build_data_file,
+        )
+
+        # The doc cumsum start pos
+        self.start_pos = [0] + np.cumsum(self.sample_lens).tolist()
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+        seq_length = len(tokens)
+        # Attention mask for the attention calulate
+        # attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length,
+        #  seq_length))
+        # The pad and eos tokens do not contribute the loss
+        loss_mask = np.ones(seq_length, dtype="float32")
+        loss_mask[tokens == self.eos_id] = 0.0
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        labels = np.array(labels).astype("int64")
+        tokens = np.array(tokens).astype("int64")
+        if self.mode == "Test":
+            return [tokens, position_ids]
+        else:
+            return [tokens, position_ids, labels, loss_mask]
+
+    def _get_single_sample_from_idx(self, doc_index_f, doc_index_l, offset_f, offset_l):
+        """
+        The input means:
+            doc_index_f: data from the first doc.
+            doc_index_l: data from the last doc.
+            offset_f: offset of the first doc.
+            offset_l: offset of the last doc.
+        """
+        # Data from the sample doc. just select the needed ids.
+        if doc_index_f == doc_index_l:
+            current_start_pos = self.start_pos[self.doc_idx[doc_index_f]]
+            return self.sample_ids[current_start_pos + offset_f : current_start_pos + offset_l + 1].tolist()
+
+        # Data from multi docs.
+        else:
+            current_start_pos = self.start_pos[self.doc_idx[doc_index_f]]
+            next_start_pos = self.start_pos[self.doc_idx[doc_index_f] + 1]
+            tokens = self.sample_ids[current_start_pos + offset_f : next_start_pos].tolist()
+            for i in range(doc_index_f + 1, doc_index_l):
+                current_start_pos = self.start_pos[self.doc_idx[i]]
+                next_start_pos = self.start_pos[self.doc_idx[i] + 1]
+                tokens.extend(self.sample_ids[current_start_pos:next_start_pos].tolist())
+            last_start_pos = self.start_pos[self.doc_idx[doc_index_l]]
+            tokens.extend(self.sample_ids[last_start_pos : last_start_pos + offset_l + 1].tolist())
+
+        return tokens
+
+    def __getitem__(self, index):
+        idx = self.shuffle_idx[index]
+        # Start and end documents and offsets.
+        doc_index_f = self.sample_idx[idx][0]
+        doc_index_l = self.sample_idx[idx + 1][0]
+        offset_f = self.sample_idx[idx][1]
+        offset_l = self.sample_idx[idx + 1][1]
+        tokens = self._get_single_sample_from_idx(doc_index_f, doc_index_l, offset_f, offset_l)
+        return self._construct_sample(tokens)
+
+    def __len__(self):
+        return self.sample_idx.shape[0] - 1
+
+
+def get_train_data_file(input_dir):
+    files = [
+        os.path.join(input_dir, f)
+        for f in os.listdir(input_dir)
+        if (os.path.isfile(os.path.join(input_dir, f)) and str(f).endswith("_idx.npz"))
+    ]
+    files = [x.replace("_idx.npz", "") for x in files]
+    if len(files) == 0:
+        logger.warning(
+            "Not found dataset with name of xxx_ids.npy and xxx_idx.npz! Try to found old compatible xxx_ids.npz file."
+        )
+    else:
+        return files
+
+    files = [
+        os.path.join(input_dir, f)
+        for f in os.listdir(input_dir)
+        if (os.path.isfile(os.path.join(input_dir, f)) and str(f).endswith("_ids.npz"))
+    ]
+
+    files = [x.replace("_ids.npz", "") for x in files]
+
+    if len(files) == 0:
+        raise RuntimeError("Not found dataset with name of xxx_ids.npz in given input_dir '{}'! ".format(input_dir))
+    else:
+        return files
+
+
+def get_train_valid_test_split_(splits, size):
+    """
+    Get dataset splits from comma or '/' separated string list.
+    """
+
+    splits = [float(s) for s in splits]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def construct_samples_and_shuffle_data(
+    name, data_prefix, documents, sizes, num_samples, seq_length, seed, build_data_file
+):
+    """
+    documents: document index from 0 to len(docs)
+    sizes: the length list of all docs.
+    num_samples: total step*bs iterations of data.
+    seq_length: the sequence length.
+    sum(sizes) = tokens_per_epoch
+    data_nums = num_samples *  micro_batch_size
+    num_epochs = (data_nums + 1) // sum(sizes)
+    len(doc_idx) = num_epochs * sum(sizes)
+    """
+    # Number of tokens in each epoch and number of required epochs.
+    tokens_per_epoch = _num_tokens(documents, sizes)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+    # Rng state
+    np_rng = np.random.RandomState(seed=seed)
+
+    # Filename of the index mappings.
+    _filename = data_prefix
+    _filename += "_{}_indexmap".format(name)
+    _filename += "_{}ns".format(num_samples)
+    _filename += "_{}sl".format(seq_length)
+    doc_idx_filename = _filename + "_doc_idx.npy"
+    sample_idx_filename = _filename + "_sample_idx.npy"
+    shuffle_idx_filename = _filename + "_shuffle_idx.npy"
+
+    # Sava random state
+    savedState = np_rng.get_state()
+    # Build the indexed mapping if not exist.
+    if build_data_file:
+        if (
+            (not os.path.isfile(doc_idx_filename))
+            or (not os.path.isfile(sample_idx_filename))
+            or (not os.path.isfile(shuffle_idx_filename))
+        ):
+            if num_epochs == 1:
+                separate_last_epoch = False
+            else:
+                num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+                last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one
+                assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative."
+                num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+                assert last_epoch_num_samples < (
+                    num_samples_per_epoch + 1
+                ), "last epoch number of samples exceeded max value."
+                separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)
+            # Note. len(doc_idx) = num_epochs * len(doc)
+            start_time = time.time()
+            doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch)
+            np.save(doc_idx_filename, doc_idx, allow_pickle=True)
+            print(
+                " > elasped time to build and save doc-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # sample-idx. pos of each seq_len of data.
+            start_time = time.time()
+            assert doc_idx.dtype == np.int32
+            assert sizes.dtype == np.int32
+
+            from ppfleetx.data.data_tools.cpp import fast_index_map_helpers
+
+            sample_idx = fast_index_map_helpers.build_sample_idx(
+                sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch
+            )
+            # sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
+            #                                num_epochs, tokens_per_epoch)
+
+            np.save(sample_idx_filename, sample_idx, allow_pickle=True)
+            print(
+                " > elasped time to build and save sample-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+
+            # shuffle-idx.
+            start_time = time.time()
+
+            if separate_last_epoch:
+                num_samples_ = num_samples_from_epochs_minus_one
+            else:
+                num_samples_ = sample_idx.shape[0] - 1
+
+            # Shuffle all seq len data.
+            shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng)
+            np.save(shuffle_idx_filename, shuffle_idx, allow_pickle=True)
+            print(
+                " > elasped time to build and save shuffle-idx mapping"
+                " (seconds): {:4f}".format(time.time() - start_time)
+            )
+
+    else:
+        while True:
+            if (
+                (not os.path.isfile(doc_idx_filename))
+                or (not os.path.isfile(sample_idx_filename))
+                or (not os.path.isfile(shuffle_idx_filename))
+            ):
+                time.sleep(3)
+            else:
+                try:
+                    np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode="r")
+                    break
+                except Exception:
+                    print("%s file is still writing or damaged, please wait a moment." % shuffle_idx_filename)
+                    time.sleep(3)
+
+    # Restore random state
+    np_rng.set_state(savedState)
+
+    try:
+        if paddle.distributed.get_world_size() > 1:
+            if paddle.in_dynamic_mode():
+                paddle.distributed.barrier()
+    except AssertionError:
+        pass
+
+    # Load mappings.
+    doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
+    sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode="r")
+    shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode="r")
+    return doc_idx, sample_idx, shuffle_idx
+
+
+def _num_tokens(documents, lens):
+    """Total number of tokens in the dataset."""
+    return np.sum(lens[documents])
+
+
+def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+    """Based on number of samples and sequence lenght, calculate how many
+    epochs will be needed."""
+    num_epochs = 0
+    total_tokens = 0
+    while True:
+        num_epochs += 1
+        total_tokens += tokens_per_epoch
+        if ((total_tokens - 1) // seq_length) >= num_samples:
+            return num_epochs
+
+
+def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
+    """
+    Build an array with length = number-of-epochs * number-of-documents.
+    Each index is mapped to a corresponding document.
+    """
+    if not separate_last_epoch or num_epochs == 1:
+        doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1]
+        doc_idx[:] = documents
+        # The documents repeat num_epochs times.
+        doc_idx = doc_idx.reshape(-1)
+        doc_idx = doc_idx.astype(np.int32)
+        np_rng.shuffle(doc_idx)
+        return doc_idx
+
+    doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False)
+    doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
+    return np.concatenate((doc_idx_first, doc_idx_last))
+
+
+def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch):
+    """
+    num_samples + 1, pos of bs data
+    the distance between two points for sample idx is bs tokens.
+    """
+    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    sample_idx = np.zeros([int(num_samples) + 1, 2], dtype=np.int32)
+
+    sample_index = 0
+    doc_idx_index = 0
+    doc_offset = 0
+    sample_idx[sample_index][0] = doc_idx_index
+    sample_idx[sample_index][1] = doc_offset
+    sample_index += 1
+    while sample_index <= num_samples:
+        remaining_seq_length = seq_length + 1
+        while remaining_seq_length != 0:
+            doc_id = doc_idx[doc_idx_index]
+            doc_length = sizes[doc_id] - doc_offset
+            remaining_seq_length -= doc_length
+            if remaining_seq_length <= 0:
+                doc_offset += remaining_seq_length + doc_length - 1
+                remaining_seq_length = 0
+            else:
+                doc_idx_index += 1
+                doc_offset = 0
+        sample_idx[sample_index][0] = doc_idx_index
+        sample_idx[sample_index][1] = doc_offset
+        sample_index += 1
+
+    return sample_idx
+
+
+def _build_shuffle_idx(num_samples, total_size, np_rng):
+    dtype_ = np.uint32
+    if total_size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+
+    shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_first)
+    if num_samples == total_size:
+        return shuffle_idx_first
+
+    shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_last)
+
+    return np.concatenate((shuffle_idx_first, shuffle_idx_last))
+
+
+class LM_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, input_dir, max_seq_len, overlapping_eval=None, model_type="GPT", **kwargs):
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_type]
+        tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        with open(input_dir, "rb") as reader:
+            entire_data = reader.read().decode("utf-8")
+
+        self.num_original_tokens = len(entire_data.strip().split(" "))
+        entire_data = self._wikitext_detokenizer(entire_data)
+        self.tokens = tokenizer.encode(entire_data)
+        self.num_tokenized_tokens = len(self.tokens)
+        print("Original Tokens: %d, Detokenized tokens: %d" % (self.num_original_tokens, self.num_tokenized_tokens))
+
+        self.seq_len = max_seq_len
+        self.pad_idx = tokenizer.eos_token_id
+        self.overlapping_eval = overlapping_eval
+        if self.overlapping_eval is None:
+            self.overlapping_eval = self.seq_len
+        self.overlapping_eval = max(1, self.overlapping_eval)
+
+        self.total_targets = len(self.tokens) - 1
+        # remove first sequence tokens
+        targets = max(self.total_targets - self.overlapping_eval, 0)
+        self.total_sequences = max(math.ceil(targets / self.overlapping_eval) + 1, 1)
+
+    def __len__(self):
+        return self.total_sequences
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        loss_mask = np.ones(seq_length, dtype="float32")
+        loss_mask[tokens == self.pad_idx] = 0.0
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        start_idx = idx * self.overlapping_eval
+        end_idx = start_idx + self.seq_len
+        tokens = self.tokens[start_idx : end_idx + 1]
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        [tokens, loss_mask, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        if self.overlapping_eval != self.seq_len and idx != 0:
+            loss_mask[: -self.overlapping_eval] *= 0
+
+        return [
+            tokens,
+            loss_mask,
+            attention_mask,
+            position_ids,
+            labels,
+            np.array([self.num_original_tokens, self.num_tokenized_tokens]),
+        ]
+
+    def _wikitext_detokenizer(self, string):
+        # contractions
+        string = string.replace("s '", "s'")
+        string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
+        # number separators
+        string = string.replace(" @-@ ", "-")
+        string = string.replace(" @,@ ", ",")
+        string = string.replace(" @.@ ", ".")
+        # punctuation
+        string = string.replace(" : ", ": ")
+        string = string.replace(" ; ", "; ")
+        string = string.replace(" . ", ". ")
+        string = string.replace(" ! ", "! ")
+        string = string.replace(" ? ", "? ")
+        string = string.replace(" , ", ", ")
+        # double brackets
+        string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
+        string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
+        string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
+        string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
+        string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
+        # miscellaneous
+        string = string.replace("= = = =", "====")
+        string = string.replace("= = =", "===")
+        string = string.replace("= =", "==")
+        string = string.replace(" " + chr(176) + " ", chr(176))
+        string = string.replace(" \n", "\n")
+        string = string.replace("\n ", "\n")
+        string = string.replace(" N ", " 1 ")
+        string = string.replace(" 's", "'s")
+        return string
+
+
+class Lambada_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, input_dir, max_seq_len, model_type="GPT", **kwargs):
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_type]
+        tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        tokenized_data = []
+        tokenized_label = []
+        with open(input_dir, "r") as f:
+            for line in f.readlines():
+                text = json.loads(line)["text"]
+                tokens, labels = self._get_tokens(tokenizer, text)
+                tokenized_data.append(tokens)
+                tokenized_label.append(labels)
+
+        self.pad_idx = tokenizer.eos_token_id
+        self.seq_len = max_seq_len
+        self.tokens = tokenized_data
+        self.labels = tokenized_label
+
+    def __len__(self):
+        return len(self.tokens)
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        tokens = self.tokens[idx][: self.seq_len]
+        labels = self.labels[idx]
+        tokens = tokens + labels
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        loss_mask = np.zeros(self.seq_len, dtype="float32")
+        loss_mask[num_tokens - len(labels) - 1 : num_tokens - 1] = 1.0
+        [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        return [tokens, loss_mask, attention_mask, position_ids, labels, np.array([self.__len__()])]
+
+    def _get_tokens(self, tokenizer, text, strict=True):
+        if not strict:
+            tokens = tokenizer.encode(text)
+            return tokens[:-1], [tokens[-1]]
+        last_token = text.split()[-1]
+        start_idx = text.rfind(last_token)
+        beginning_tokens = tokenizer.encode(text[:start_idx].strip())
+        last_token = tokenizer.encode(" " + last_token)
+        return beginning_tokens, last_token
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py b/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cb834ba5347229325ece7226e15b15c4eb2b6ed
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/sampler/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .batch_sampler import *
+from .collate import Dict, Pad, Stack, Tuple
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py b/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdd1d531d4b82b84c7d30c74fe2f96aba369f16b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/sampler/batch_sampler.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import division, print_function
+
+import math
+
+import paddle
+from paddle.io import DistributedBatchSampler
+from ppfleetx.distributed.apis import env
+
+__all__ = ["GPTBatchSampler", "DistributedBatchSampler"]
+
+
+class GPTBatchSampler(paddle.io.BatchSampler):
+    """Sampler that restricts data loading to a subset of the dataset.
+    In such case, each process can pass a DistributedBatchSampler instance
+    as a DataLoader sampler, and load a subset of the original dataset that
+    is exclusive to it.
+    .. note::
+        Dataset is assumed to be of constant size.
+
+    Args:
+        dataset(paddle.io.Dataset): this could be a `paddle.io.Dataset` implement
+                     or other python object which implemented
+                     `__len__` for BatchSampler to get sample
+                     number of data source.
+        batch_size(int): sample indice number in a mini-batch indices.
+        num_replicas(int, optional): porcess number in distributed training.
+            If :attr:`num_replicas` is None, :attr:`num_replicas` will be
+            retrieved from :code:`paddle.distributed.ParallenEnv`.
+            Default None.
+        rank(int, optional): the rank of the current process among :attr:`num_replicas`
+            processes. If :attr:`rank` is None, :attr:`rank` is retrieved from
+            :code:`paddle.distributed.ParallenEnv`. Default None.
+        shuffle(bool): whther to shuffle indices order before genrating
+            batch indices. Default False.
+        drop_last(bool): whether drop the last incomplete batch dataset size
+            is not divisible by the batch size. Default False
+    Examples:
+        .. code-block:: python
+            import numpy as np
+            from paddle.io import Dataset, DistributedBatchSampler
+            # init with dataset
+            class RandomDataset(Dataset):
+                def __init__(self, num_samples):
+                    self.num_samples = num_samples
+
+                def __getitem__(self, idx):
+                    image = np.random.random([784]).astype('float32')
+                    label = np.random.randint(0, 9, (1, )).astype('int64')
+                    return image, label
+
+                def __len__(self):
+                    return self.num_samples
+
+            dataset = RandomDataset(100)
+            sampler = DistributedBatchSampler(dataset, batch_size=64)
+            for data in sampler:
+                # do something
+                break
+    """
+
+    def __init__(
+        self, dataset, batch_size, num_replicas=None, rank=None, shuffle=False, drop_last=False, consumed_samples=0
+    ):
+        self.dataset = dataset
+
+        assert isinstance(batch_size, int) and batch_size > 0, "batch_size should be a positive integer"
+        self.batch_size = batch_size
+        assert isinstance(shuffle, bool), "shuffle should be a boolean value"
+        self.shuffle = shuffle
+        assert isinstance(drop_last, bool), "drop_last should be a boolean number"
+
+        if num_replicas is not None:
+            assert isinstance(num_replicas, int) and num_replicas > 0, "num_replicas should be a positive integer"
+            self.nranks = num_replicas
+        else:
+            self.nranks = env.get_data_world_size()
+
+        if rank is not None:
+            assert isinstance(rank, int) and rank >= 0, "rank should be a non-negative integer"
+            self.local_rank = rank
+        else:
+            self.local_rank = env.get_data_world_rank()
+
+        self.drop_last = drop_last
+        self.epoch = 0
+
+        self.consumed_samples = consumed_samples
+        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.nranks))
+        self.total_size = self.num_samples * self.nranks
+
+    def get_start_end_idx(self):
+        start_idx = self.local_rank * self.batch_size
+        end_idx = start_idx + self.batch_size
+        return start_idx, end_idx
+
+    def __iter__(self):
+        assert (
+            self.consumed_samples % self.nranks == 0
+        ), "The consumed_samples should be divided by nranks. consumed_samples=%d, nranks=%s" % (
+            self.consumed_samples,
+            self.nranks,
+        )
+        self.remain_num_samples = int(math.ceil((len(self.dataset) - self.consumed_samples) * 1.0 / self.nranks))
+        self.remain_total_size = self.remain_num_samples * self.nranks
+        self.batch_size_times_rank_size = self.batch_size * self.nranks
+
+        num_samples = len(self.dataset)
+        batch_indices = []
+        for idx in range(self.consumed_samples, self.total_size):
+            if idx >= num_samples:
+                batch_indices.append(idx - num_samples)
+            else:
+                batch_indices.append(idx)
+            if len(batch_indices) == self.batch_size_times_rank_size:
+                start_idx, end_idx = self.get_start_end_idx()
+                yield batch_indices[start_idx:end_idx]
+                batch_indices = []
+        if not self.drop_last and len(batch_indices) > 0:
+            yield batch_indices
+
+    def __len__(self):
+        num_samples = self.num_samples
+        num_samples += int(not self.drop_last) * (self.batch_size - 1)
+        return num_samples // self.batch_size
+
+    def set_epoch(self, epoch=0, consumed_samples=0):
+        """
+        Sets the epoch number. When :attr:`shuffle=True`, this number is used
+        as seeds of random numbers. By default, users may not set this, all
+        replicas (workers) use a different random ordering for each epoch.
+        If set same number at each epoch, this sampler will yield the same
+        ordering at all epoches.
+        Arguments:
+            epoch (int): Epoch number.
+        Examples:
+            .. code-block:: python
+
+                from paddle.io import Dataset, DistributedBatchSampler
+
+                # init with dataset
+                class RandomDataset(Dataset):
+                    def __init__(self, num_samples):
+                        self.num_samples = num_samples
+
+                    def __getitem__(self, idx):
+                        image = np.random.random([784]).astype('float32')
+                        label = np.random.randint(0, 9, (1, )).astype('int64')
+                        return image, label
+
+                    def __len__(self):
+                        return self.num_samples
+
+                dataset = RandomDataset(100)
+                sampler = DistributedBatchSampler(dataset, batch_size=64)
+
+                for epoch in range(10):
+                    sampler.set_epoch(epoch)
+        """
+        self.epoch = epoch
+        # if we reset the epoch, the consumed_samples should be set to 0.
+        self.consumed_samples = consumed_samples
diff --git a/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py b/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5aa1d0d471c3852f5935ebe0af65c7beb282c2b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/sampler/collate.py
@@ -0,0 +1,303 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+__all__ = [
+    "Stack",
+    "Pad",
+    "Tuple",
+    "Dict",
+]
+
+
+class Stack(object):
+    """
+    Stacks the input data samples to construct the batch. The N input samples
+    must have the same shape/length and will be stacked to construct a batch.
+    Args:
+        axis (int, optional): The axis in the result data along which the input
+            data are stacked. Default: 0.
+        dtype (str|numpy.dtype, optional): The value type of the output. If it
+            is set to None, the type of input data is used. Default: None.
+    """
+
+    def __init__(self, axis=0, dtype=None):
+        self._axis = axis
+        self._dtype = dtype
+
+    def __call__(self, data):
+        """
+        Batchifies the input data by stacking.
+        Args:
+            data (list[numpy.ndarray]): The input data samples. It is a list.
+                Each element is a numpy.ndarray or list.
+        Returns:
+            numpy.ndarray: Stacked batch data.
+        Example:
+            .. code-block:: python
+                from paddlenlp.data import Stack
+                a = [1, 2, 3, 4]
+                b = [3, 4, 5, 6]
+                c = [5, 6, 7, 8]
+                result = Stack()([a, b, c])
+                '''
+                [[1, 2, 3, 4],
+                 [3, 4, 5, 6],
+                 [5, 6, 7, 8]]
+                '''
+        """
+        data = np.stack(data, axis=self._axis).astype(self._dtype) if self._dtype else np.stack(data, axis=self._axis)
+        return data
+
+
+class Pad(object):
+    """
+    Pads the input data samples to the largest length at `axis`.
+    Args:
+        pad_val (float|int, optional): The padding value. Default: 0.
+        axis (int, optional): The axis to pad the arrays. The arrays will be
+            padded to the largest length at `axis`. For example, assume the
+            input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the
+            axis is 0. Each input will be padded into (10, 8, 5) and then
+            stacked to form the final output, which has shape (3, 10, 8, 5).
+            Default: 0.
+        ret_length (bool|numpy.dtype, optional): If it is bool, indicate whether
+            to return the valid length in the output, and the data type of
+            returned length is int32 if True. If it is numpy.dtype, indicate the
+            data type of returned length. Default: None.
+        dtype (numpy.dtype, optional): The value type of the output. If it is
+            set to None, the input data type is used. Default: None.
+        pad_right (bool, optional): Whether the padding direction is right-side.
+            If True, it indicates we pad to the right side, while False indicates
+            we pad to the left side. Default: True.
+    """
+
+    def __init__(self, pad_val=0, axis=0, ret_length=None, dtype=None, pad_right=True):
+        self._pad_val = pad_val
+        self._axis = axis
+        self._ret_length = ret_length
+        self._dtype = dtype
+        self._pad_right = pad_right
+
+    def __call__(self, data):
+        """
+        Batchifies the input data by padding. The input will be padded to the
+        largest dimension at `axis` and then stacked to form the final output.
+        In addition, the function will output the original dimensions at the
+        `axis` if `ret_length` is not None or False.
+        Args:
+            data (list[numpy.ndarray|list]): The input data samples. It is a
+                list. Each element is a numpy.ndarray or list.
+        Returns:
+            numpy.ndarray|tuple[numpy.ndarray]: If `ret_length` is False, it
+            is a numpy.ndarray representing the padded batch data and the
+            shape is (N, …). Otherwise, it is a tuple, besides the padded batch
+            data, the tuple also includes a numpy.ndarray representing original
+            length at `axis` of all input samples, which shaped `(N,)`.
+        Example:
+            .. code-block:: python
+                from paddlenlp.data import Pad
+                a = [1, 2, 3, 4]
+                b = [5, 6, 7]
+                c = [8, 9]
+                result = Pad(pad_val=0)([a, b, c])
+                '''
+                [[1, 2, 3, 4],
+                 [5, 6, 7, 0],
+                 [8, 9, 0, 0]]
+                '''
+        """
+
+        # return data itself for rare unexpected cases when 1-D array is passed to Pad
+        if not isinstance(data[0], list) and not isinstance(data[0], np.ndarray):
+            return np.asarray(data, dtype=self._dtype if self._dtype is not None else np.int64)
+
+        arrs = [np.asarray(ele) for ele in data]
+        original_length = [ele.shape[self._axis] for ele in arrs]
+        max_size = max(original_length)
+        ret_shape = list(arrs[0].shape)
+        ret_shape[self._axis] = max_size
+        ret_shape = (len(arrs),) + tuple(ret_shape)
+        ret = np.full(
+            shape=ret_shape, fill_value=self._pad_val, dtype=arrs[0].dtype if self._dtype is None else self._dtype
+        )
+        for i, arr in enumerate(arrs):
+            if arr.shape[self._axis] == max_size:
+                ret[i] = arr
+            else:
+                slices = [slice(None) for _ in range(arr.ndim)]
+                if self._pad_right:
+                    slices[self._axis] = slice(0, arr.shape[self._axis])
+                else:
+                    slices[self._axis] = slice(max_size - arr.shape[self._axis], max_size)
+
+                if slices[self._axis].start != slices[self._axis].stop:
+                    slices = [slice(i, i + 1)] + slices
+                    ret[tuple(slices)] = arr
+        if self._ret_length:
+            return ret, np.asarray(original_length, dtype="int32") if self._ret_length else np.asarray(
+                original_length, self._ret_length
+            )
+        else:
+            return ret
+
+
+class Tuple(object):
+    """
+    Wraps multiple batchify functions together. The input functions will be applied
+    to the corresponding input fields.
+
+    Each sample should be a list or tuple containing multiple fields. The i'th
+    batchify function stored in Tuple will be applied on the i'th field.
+
+    For example, when data sample is (nd_data, label), you can wrap two batchify
+    functions using `Tuple(DataBatchify, LabelBatchify)` to batchify nd_data and
+    label correspondingly.
+    Args:
+        fn (callable|list[callable]|tuple[callable]): The batchify functions to
+            wrap. It is a callable function or a list/tuple of callable functions.
+        args (tuple[callable]): The additional batchify functions to wrap.
+    """
+
+    def __init__(self, fn, *args):
+        if isinstance(fn, (list, tuple)):
+            assert len(args) == 0, (
+                "Input pattern not understood. The input of Tuple can be "
+                "Tuple(A, B, C) or Tuple([A, B, C]) or Tuple((A, B, C)). "
+                "Received fn=%s, args=%s" % (str(fn), str(args))
+            )
+            self._fn = fn
+        else:
+            self._fn = (fn,) + args
+        for i, ele_fn in enumerate(self._fn):
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (i, str(type(ele_fn)))
+
+    def __call__(self, data):
+        """
+        Batchifies data samples by applying each function on the corresponding
+        data field, and each data field is produced by stacking the field data
+        of samples.
+        Args:
+            data (list|tuple): The samples to batchfy. Each sample in list/tuple
+                should contain `N` fields.
+        Returns:
+            tuple: A tuple composed of results from all including batchifying
+            functions.
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Stack, Pad, Tuple
+                data = [
+                        [[1, 2, 3, 4], [1]],
+                        [[5, 6, 7], [0]],
+                        [[8, 9], [1]],
+                       ]
+                batchify_fn = Tuple(Pad(pad_val=0), Stack())
+                ids, label = batchify_fn(data)
+                '''
+                ids:
+                [[1, 2, 3, 4],
+                [5, 6, 7, 0],
+                [8, 9, 0, 0]]
+                label: [[1], [0], [1]]
+                '''
+        """
+
+        assert len(data[0]) == len(
+            self._fn
+        ), "The number of attributes in each data sample should contain" " {} elements".format(len(self._fn))
+        ret = []
+        for i, ele_fn in enumerate(self._fn):
+            result = ele_fn([ele[i] for ele in data])
+            if isinstance(result, (tuple, list)):
+                ret.extend(result)
+            else:
+                ret.append(result)
+        return tuple(ret)
+
+
+class Dict(object):
+    """
+    Wraps multiple batchify functions together. The input functions will be
+    applied to the corresponding input fields.
+
+    Each sample should be a dict containing multiple fields. Each batchify
+    function with key stored in `Dict` will be applied on the field which has
+    the same key.
+
+    For example, when data sample is {'tokens': tokens, 'labels': labels}, you
+    can wrap two batchify functions using
+    `Dict({'tokens': DataBatchify, 'labels': LabelBatchify})` to batchify tokens
+    and labels correspondingly.
+    Args:
+        fn (dict): The batchify functions to wrap. It is a dict, which values is
+            callable functions.
+    """
+
+    def __init__(self, fn):
+        assert isinstance(fn, (dict)), (
+            "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn "
+            "Received fn=%s" % (str(fn))
+        )
+
+        self._fn = fn
+
+        for col_name, ele_fn in self._fn.items():
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (
+                col_name,
+                str(type(ele_fn)),
+            )
+
+    def __call__(self, data):
+        """
+        Batchifies data samples by applying each function on the corresponding
+        data field, and each data field is produced by stacking the field data
+        with the same key as batchify functions of all samples.
+        Args:
+            data (list[dict]|tuple[dict]): The samples to batchfy. Each sample
+                in list/tuple is a dict with `N` key-values.
+
+        Returns:
+            tuple: A tuple composed of results from all including batchifying
+            functions.
+
+        Example:
+            .. code-block:: python
+                from paddlenlp.data import Stack, Pad, Dict
+                data = [
+                        {'labels':[1], 'token_ids':[1, 2, 3, 4]},
+                        {'labels':[0], 'token_ids':[5, 6, 7]},
+                        {'labels':[1], 'token_ids':[8, 9]},
+                       ]
+                batchify_fn = Dict({'token_ids':Pad(pad_val=0), 'labels':Stack()})
+                ids, label = batchify_fn(data)
+                '''
+                ids:
+                [[1, 2, 3, 4],
+                [5, 6, 7, 0],
+                [8, 9, 0, 0]]
+                label: [[1], [0], [1]]
+                '''
+        """
+
+        ret = []
+        for col_name, ele_fn in self._fn.items():
+            result = ele_fn([ele[col_name] for ele in data])
+            if isinstance(result, (tuple, list)):
+                ret.extend(result)
+            else:
+                ret.append(result)
+        return tuple(ret)
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py b/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..efacc52936ee18b5fc76e42bdfb57bf953e31c59
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/tokenizers/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .gpt_tokenizer import GPTChineseTokenizer, GPTTokenizer
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py b/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..aebad793e77512fa0d3832e92db6ee32472d27f5
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/tokenizers/gpt_tokenizer.py
@@ -0,0 +1,759 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+"""Tokenization classes for OpenAI GPT."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import os
+import sys
+import warnings
+from io import open
+
+import regex as re
+from ppfleetx.utils.download import cached_path
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE
+    # tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+
+from ppfleetx.utils.log import logger
+
+try:
+    import paddlenlp
+    from paddlenlp.transformers.gpt.tokenizer import GPTChineseTokenizer
+except ImportError:
+    raise ImportError("Please import paddlenlp before running the GPT tasks.")
+
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    "gpt2": "http://fleet.bj.bcebos.com/datasets/gpt/gpt2-vocab.json",
+}
+PRETRAINED_MERGES_ARCHIVE_MAP = {
+    "gpt2": "http://fleet.bj.bcebos.com/datasets/gpt/gpt2-merges.txt",
+}
+PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
+    "gpt2": 1024,
+}
+VOCAB_NAME = "vocab.json"
+MERGES_NAME = "merges.txt"
+SPECIAL_TOKENS_NAME = "special_tokens.txt"
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = unichr if sys.version_info[0] == 2 else chr
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class GPTTokenizer(object):
+    """
+    GPT-2 BPE tokenizer. Peculiarities:
+        - Byte-level BPE
+    """
+
+    padding_side = "right"
+    truncation_side = "right"
+    model_input_names = ["input_ids", "token_type_ids", "attention_mask"]
+    pad_token_type_id = 0
+    pad_token_id = 0
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            merges_file = PRETRAINED_MERGES_ARCHIVE_MAP[pretrained_model_name_or_path]
+            special_tokens_file = None
+        else:
+            vocab_file = os.path.join(pretrained_model_name_or_path, VOCAB_NAME)
+            merges_file = os.path.join(pretrained_model_name_or_path, MERGES_NAME)
+            special_tokens_file = os.path.join(pretrained_model_name_or_path, SPECIAL_TOKENS_NAME)
+            if not os.path.exists(special_tokens_file):
+                special_tokens_file = None
+            else:
+                logger.info("loading special tokens file {}".format(special_tokens_file))
+        # redirect to the cache, if necessary
+        try:
+            resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
+            resolved_merges_file = cached_path(merges_file, cache_dir=cache_dir)
+        except Exception as e:
+            logger.info(e)
+            logger.error(
+                "Model name '{}' was not found in model name list ({}). "
+                "We assumed '{}' was a path or url but couldn't find files {} and {} "
+                "at this path or url.".format(
+                    pretrained_model_name_or_path,
+                    ", ".join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                    pretrained_model_name_or_path,
+                    vocab_file,
+                    merges_file,
+                )
+            )
+            return None
+        if resolved_vocab_file == vocab_file and resolved_merges_file == merges_file:
+            logger.info("loading vocabulary file {}".format(vocab_file))
+            logger.info("loading merges file {}".format(merges_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(vocab_file, resolved_vocab_file))
+            logger.info("loading merges file {} from cache at {}".format(merges_file, resolved_merges_file))
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
+            # if we're using a pretrained model, ensure the tokenizer wont index sequences longer
+            # than the number of positional embeddings
+            max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
+            kwargs["max_len"] = min(kwargs.get("max_len", int(1e12)), max_len)
+        # Instantiate tokenizer.
+        if special_tokens_file and "special_tokens" not in kwargs:
+            special_tokens = open(special_tokens_file, encoding="utf-8").read().split("\n")[:-1]
+        else:
+            special_tokens = kwargs.pop("special_tokens", [])
+        tokenizer = cls(resolved_vocab_file, resolved_merges_file, special_tokens=special_tokens, *inputs, **kwargs)
+        return tokenizer
+
+    def __init__(self, vocab_file, merges_file, errors="replace", special_tokens=None, max_len=None, **kwargs):
+
+        self.padding_side = kwargs.pop("padding_side", self.padding_side)
+        if self.padding_side not in ["right", "left"]:
+            raise ValueError(
+                f"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}"
+            )
+
+        self.truncation_side = kwargs.pop("truncation_side", self.truncation_side)
+        if self.truncation_side not in ["right", "left"]:
+            raise ValueError(
+                f"Padding side should be selected between 'right' and 'left', current value: {self.truncation_side}"
+            )
+
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.encoder = json.load(open(vocab_file))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        bpe_data = open(merges_file, encoding="utf-8").read().split("\n")[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for
+        # capitalized versions of contractions
+        self.eod_id = self.encoder["<|endoftext|>"]
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+        self.special_tokens = {}
+        self.special_tokens_decoder = {}
+        self.set_special_tokens(special_tokens)
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        add_special_tokens=True,
+        padding=False,
+        truncation=False,
+        max_length=None,
+        pad_to_multiple_of=None,
+        return_token_type_ids=None,
+        return_attention_mask=None,
+        return_overflowing_tokens=False,
+        return_length=False,
+    ):
+        assert padding in [True, False, "longest", "max_length", "do_not_pad"]
+
+        if max_length is not None and padding is False and truncation is False:
+            truncation = "longest_first"
+
+        if padding is True:
+            padding = "longest"
+        elif padding is False:
+            padding = "do_not_pad"
+
+        assert truncation in [True, False, "only_first", "only_second", "longest_first", "do_not_truncate"]
+        if truncation is True:
+            truncation = "longest_first"
+        elif truncation is False:
+            truncation = "do_not_truncate"
+
+        # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
+        if (
+            truncation != "do_not_truncate"
+            and padding != "do_not_pad"
+            and pad_to_multiple_of is not None
+            and max_length is not None
+            and (max_length % pad_to_multiple_of != 0)
+        ):
+            raise ValueError(
+                "Truncation and padding are both activated but "
+                f"truncation length ({max_length}) is not a multiple of pad_to_multiple_of ({pad_to_multiple_of})."
+            )
+
+        is_batched = isinstance(text, (list, tuple))
+        if is_batched:
+            raise NotImplementedError
+        else:
+            return self.encode_plus(
+                text=text,
+                text_pair=text_pair,
+                add_special_tokens=add_special_tokens,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_length=return_length,
+            )
+
+    def encode_plus(
+        self,
+        text,
+        text_pair,
+        add_special_tokens=True,
+        padding="do_not_pad",
+        truncation="do_not_truncate",
+        max_length=None,
+        pad_to_multiple_of=None,
+        return_token_type_ids=None,
+        return_attention_mask=None,
+        return_overflowing_tokens=False,
+        return_length=False,
+        **kwargs
+    ):
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self.tokenize(text, **kwargs)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                if is_split_into_words:
+                    tokens = list(
+                        itertools.chain(*(self.tokenize(t, is_split_into_words=True, **kwargs) for t in text))
+                    )
+                    return self.convert_tokens_to_ids(tokens)
+                else:
+                    return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise NotImplementedError
+
+        first_ids = get_input_ids(text)
+        second_ids = get_input_ids(text_pair) if text_pair is not None else None
+
+        pair = bool(second_ids is not None)
+        len_ids = len(first_ids)
+        len_pair_ids = len(second_ids) if pair else 0
+
+        if return_token_type_ids and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+
+        encoded_inputs = {}
+        # Compute the total size of the returned encodings
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+
+        # Truncation: Handle max sequence length
+        overflowing_tokens = []
+        if truncation != "do_not_truncate" and max_length and total_len > max_length:
+            first_ids, second_ids, overflowing_tokens = self.truncate_sequences(
+                first_ids,
+                pair_ids=second_ids,
+                num_tokens_to_remove=total_len - max_length,
+                truncation=truncation,
+            )
+        if return_overflowing_tokens:
+            encoded_inputs["overflowing_tokens"] = overflowing_tokens
+            encoded_inputs["num_truncated_tokens"] = total_len - max_length
+
+        # Add special tokens
+        if add_special_tokens:
+            sequence = self.build_inputs_with_special_tokens(first_ids, second_ids)
+            token_type_ids = self.create_token_type_ids_from_sequences(first_ids, second_ids)
+        else:
+            sequence = first_ids + second_ids if pair else first_ids
+            token_type_ids = [0] * len(first_ids) + ([0] * len(second_ids) if pair else [])
+
+        # Build output dictionary
+        encoded_inputs["input_ids"] = sequence
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+
+        # Padding
+        if padding != "do_not_pad" or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+
+        return encoded_inputs
+
+    def num_special_tokens_to_add(self, pair: bool = False) -> int:
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        if token_ids_1 is None:
+            return token_ids_0
+        return token_ids_0 + token_ids_1
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        if token_ids_1 is None:
+            return len(token_ids_0) * [0]
+        return [0] * len(token_ids_0) + [1] * len(token_ids_1)
+
+    def truncate_sequences(
+        self,
+        ids,
+        pair_ids=None,
+        num_tokens_to_remove=0,
+        truncation="longest_first",
+        stride=0,
+    ):
+        if num_tokens_to_remove <= 0:
+            return ids, pair_ids, []
+
+        overflowing_tokens = []
+        if truncation == "only_first" or (truncation == "longest_first" and pair_ids is None):
+            if len(ids) > num_tokens_to_remove:
+                window_len = min(len(ids), stride + num_tokens_to_remove)
+                if self.truncation_side == "left":
+                    overflowing_tokens = ids[:window_len]
+                    ids = ids[num_tokens_to_remove:]
+                elif self.truncation_side == "right":
+                    overflowing_tokens = ids[-window_len:]
+                    ids = ids[:-num_tokens_to_remove]
+                else:
+                    raise ValueError(f"invalid truncation strategy: {self.truncation_side}, use 'left' or 'right'.")
+
+            else:
+                error_msg = (
+                    f"We need to remove {num_tokens_to_remove} to truncate the input "
+                    f"but the first sequence has a length {len(ids)}. "
+                )
+                if truncation == "only_first":
+                    error_msg = (
+                        error_msg + "Please select another truncation strategy than "
+                        f"{truncation}, for instance 'longest_first' or 'only_second'."
+                    )
+                logger.error(error_msg)
+        elif truncation == "longest_first":
+            warnings.warn(
+                "Be aware, overflowing tokens are not returned for the setting you have chosen,"
+                f" i.e. sequence pairs with the '{truncation}' "
+                "truncation strategy. So the returned list will always be empty even if some "
+                "tokens have been removed."
+            )
+            for _ in range(num_tokens_to_remove):
+                if pair_ids is None or len(ids) > len(pair_ids):
+                    if self.truncation_side == "right":
+                        ids = ids[:-1]
+                    elif self.truncation_side == "left":
+                        ids = ids[1:]
+                    else:
+                        raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+                else:
+                    if self.truncation_side == "right":
+                        pair_ids = pair_ids[:-1]
+                    elif self.truncation_side == "left":
+                        pair_ids = pair_ids[1:]
+                    else:
+                        raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+        elif truncation == "only_second" and pair_ids is not None:
+            if len(pair_ids) > num_tokens_to_remove:
+                window_len = min(len(pair_ids), stride + num_tokens_to_remove)
+                if self.truncation_side == "right":
+                    overflowing_tokens = pair_ids[-window_len:]
+                    pair_ids = pair_ids[:-num_tokens_to_remove]
+                elif self.truncation_side == "left":
+                    overflowing_tokens = pair_ids[:window_len]
+                    pair_ids = pair_ids[num_tokens_to_remove:]
+                else:
+                    raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+            else:
+                logger.error(
+                    f"We need to remove {num_tokens_to_remove} to truncate the input "
+                    f"but the second sequence has a length {len(pair_ids)}. "
+                    f"Please select another truncation strategy than {truncation}, "
+                    "for instance 'longest_first' or 'only_first'."
+                )
+
+        return (ids, pair_ids, overflowing_tokens)
+
+    def pad(
+        self,
+        encoded_inputs,
+        padding=True,
+        max_length=None,
+        pad_to_multiple_of=None,
+        return_attention_mask=None,
+        return_tensors=None,
+        verbose=True,
+    ):
+
+        # The model's main input name, usually `input_ids`, has be passed for padding
+        if self.model_input_names[0] not in encoded_inputs:
+            raise ValueError(
+                "You should supply an encoding or a list of encodings to this method "
+                f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
+            )
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+
+        if not required_input:
+            if return_attention_mask:
+                encoded_inputs["attention_mask"] = []
+            return encoded_inputs
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+
+        if required_input and not isinstance(required_input[0], (list, tuple)):
+            encoded_inputs = self._pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+            return encoded_inputs
+
+        batch_size = len(required_input)
+        assert all(
+            len(v) == batch_size for v in encoded_inputs.values()
+        ), "Some items in the output dictionary have a different batch size than others."
+
+        if padding == "longest":
+            max_length = max(len(inputs) for inputs in required_input)
+            padding = "max_length"
+
+        batch_outputs = {}
+        for i in range(batch_size):
+            inputs = dict((k, v[i]) for k, v in encoded_inputs.items())
+            outputs = self._pad(
+                inputs,
+                max_length=max_length,
+                padding=padding,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                batch_outputs[key].append(value)
+
+        return encoded_inputs
+
+    def _pad(
+        self,
+        encoded_inputs,
+        max_length=None,
+        padding="do_not_pad",
+        pad_to_multiple_of=None,
+        return_attention_mask=None,
+    ) -> dict:
+        # Load from model defaults
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names or "attention_mask" in encoded_inputs
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+
+        if padding == "longest":
+            max_length = len(required_input)
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = padding != "do_not_pad" and len(required_input) != max_length
+
+        # Initialize attention mask if not present.
+        if return_attention_mask and "attention_mask" not in encoded_inputs:
+            encoded_inputs["attention_mask"] = [1] * len(required_input)
+
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+
+            if self.padding_side == "right":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                    )
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+                if "offset_mapping" in encoded_inputs:
+                    encoded_inputs["offset_mapping"] = encoded_inputs["offset_mapping"] + [(0, 0)] * difference
+                if "position_ids" in encoded_inputs:
+                    encoded_inputs["position_ids"] = encoded_inputs["position_ids"] + [0] * difference
+                encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
+            elif self.padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+                        "token_type_ids"
+                    ]
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+                if "offset_mapping" in encoded_inputs:
+                    encoded_inputs["offset_mapping"] = [(0, 0)] * difference + encoded_inputs["offset_mapping"]
+                if "position_ids" in encoded_inputs:
+                    encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
+                encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+            else:
+                raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+
+        return encoded_inputs
+
+    def __len__(self):
+        return len(self.encoder) + len(self.special_tokens)
+
+    def set_special_tokens(self, special_tokens):
+        """Add a list of additional tokens to the encoder.
+        The additional tokens are indexed starting from the last index of the
+        current vocabulary in the order of the `special_tokens` list.
+        """
+        if not special_tokens:
+            self.special_tokens = {}
+            self.special_tokens_decoder = {}
+            return
+        self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens))
+        self.special_tokens_decoder = {v: k for k, v in self.special_tokens.items()}
+        logger.info("Special tokens {}".format(self.special_tokens))
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except BaseException:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            if sys.version_info[0] == 2:
+                token = "".join(self.byte_encoder[ord(b)] for b in token)
+            else:
+                token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """Converts a sequence of tokens into ids using the vocab."""
+        ids = []
+        if isinstance(tokens, str) or (sys.version_info[0] == 2 and isinstance(tokens, unicode)):
+            if tokens in self.special_tokens:
+                return self.special_tokens[tokens]
+            else:
+                return self.encoder.get(tokens, 0)
+        for token in tokens:
+            if token in self.special_tokens:
+                ids.append(self.special_tokens[token])
+            else:
+                ids.append(self.encoder.get(token, 0))
+        if len(ids) > self.max_len:
+            warnings.warn(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this OpenAI GPT model ({} > {}). Running this"
+                " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a single index or a sequence of indices to texts.
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+        Returns:
+            str: The decoded text.
+        Example:
+            .. code-block::
+                from paddlenlp.transformers import GPTTokenizer
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
+                # 'Welcome to use PaddlePaddle and PaddleNLP'
+        """
+
+        text = "".join([self.decoder[id] for id in ids])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """Converts a sequence of ids in BPE tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            if i in self.special_tokens_decoder:
+                if not skip_special_tokens:
+                    tokens.append(self.special_tokens_decoder[i])
+            else:
+                tokens.append(self.decoder[i])
+        return tokens
+
+    def encode(self, text):
+        return self.convert_tokens_to_ids(self.tokenize(text))
+
+    def decode(self, tokens):
+        text = "".join([self.decoder[token] if token in self.decoder.keys() else "" for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(vocab_path):
+            logger.error("Vocabulary path ({}) should be a directory".format(vocab_path))
+            return
+        vocab_file = os.path.join(vocab_path, VOCAB_NAME)
+        merge_file = os.path.join(vocab_path, MERGES_NAME)
+        special_tokens_file = os.path.join(vocab_path, SPECIAL_TOKENS_NAME)
+
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write("#version: 0.2\n")
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    warnings.warn(
+                        "Saving vocabulary to {}: BPE merge indices are not consecutive."
+                        " Please check that the tokenizer is not corrupted!".format(merge_file)
+                    )
+                    index = token_index
+                writer.write(" ".join(bpe_tokens) + "\n")
+                index += 1
+
+        index = len(self.encoder)
+        with open(special_tokens_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.special_tokens.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    warnings.warn(
+                        "Saving special tokens vocabulary to {}: BPE indices are not consecutive."
+                        " Please check that the tokenizer is not corrupted!".format(special_tokens_file)
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+
+        return vocab_file, merge_file, special_tokens_file
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    @property
+    def vocab(self):
+        return self.encoder
+
+    @property
+    def inv_vocab(self):
+        return self.decoder
+
+    @property
+    def eos_token_id(self):
+        return self.eod_id
diff --git a/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py b/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5fca6cbea8ef5328c2523b6990a34b2c4d8748c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/tokenizers/tokenization_utils_base.py
@@ -0,0 +1,1790 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+"""
+Base classes common to both the slow and the fast tokenization classes: PreTrainedTokenizerBase (host all the user
+fronting encoding methods) Special token mixing (host the special tokens logic) and BatchEncoding (wrap the dictionary
+of output with special method for the Fast tokenizers)
+"""
+
+import copy
+import importlib
+import json
+import os
+import re
+import warnings
+from collections import OrderedDict, UserDict
+from collections.abc import Mapping
+from contextlib import contextmanager
+from dataclasses import dataclass, field
+from typing import (
+    TYPE_CHECKING,
+    Any,
+    Dict,
+    List,
+    NamedTuple,
+    Optional,
+    Sequence,
+    Tuple,
+    Union,
+)
+
+import numpy as np
+
+
+def is_sentencepiece_available():
+    return importlib.util.find_spec("sentencepiece") is not None
+
+
+def is_tokenizers_available():
+    return importlib.util.find_spec("tokenizers") is not None
+
+
+if is_tokenizers_available():
+    from tokenizers import AddedToken
+else:
+
+    @dataclass(frozen=True, eq=True)
+    class AddedToken:
+        """
+        AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
+        way it should behave.
+        """
+
+        content: str = field(default_factory=str)
+        single_word: bool = False
+        lstrip: bool = False
+        rstrip: bool = False
+        normalized: bool = True
+
+        def __getstate__(self):
+            return self.__dict__
+
+
+TOKENIZER_MAPPING_NAMES = OrderedDict(
+    [
+        (
+            "albert",
+            (
+                "AlbertTokenizer" if is_sentencepiece_available() else None,
+                "AlbertTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("bart", ("BartTokenizer", "BartTokenizerFast")),
+        (
+            "barthez",
+            (
+                "BarthezTokenizer" if is_sentencepiece_available() else None,
+                "BarthezTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("bartpho", ("BartphoTokenizer", None)),
+        ("bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+        ("bert-generation", ("BertGenerationTokenizer" if is_sentencepiece_available() else None, None)),
+        ("bert-japanese", ("BertJapaneseTokenizer", None)),
+        ("bertweet", ("BertweetTokenizer", None)),
+        (
+            "big_bird",
+            (
+                "BigBirdTokenizer" if is_sentencepiece_available() else None,
+                "BigBirdTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("bigbird_pegasus", ("PegasusTokenizer", "PegasusTokenizerFast" if is_tokenizers_available() else None)),
+        ("blenderbot", ("BlenderbotTokenizer", "BlenderbotTokenizerFast")),
+        ("blenderbot-small", ("BlenderbotSmallTokenizer", None)),
+        ("bloom", (None, "BloomTokenizerFast" if is_tokenizers_available() else None)),
+        ("byt5", ("ByT5Tokenizer", None)),
+        (
+            "camembert",
+            (
+                "CamembertTokenizer" if is_sentencepiece_available() else None,
+                "CamembertTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("canine", ("CanineTokenizer", None)),
+        (
+            "clip",
+            (
+                "CLIPTokenizer",
+                "CLIPTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "cpm",
+            (
+                "CpmTokenizer" if is_sentencepiece_available() else None,
+                "CpmTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("ctrl", ("CTRLTokenizer", None)),
+        ("data2vec-text", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
+        ("deberta", ("DebertaTokenizer", "DebertaTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "deberta-v2",
+            (
+                "DebertaV2Tokenizer" if is_sentencepiece_available() else None,
+                "DebertaV2TokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("distilbert", ("DistilBertTokenizer", "DistilBertTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "dpr",
+            (
+                "DPRQuestionEncoderTokenizer",
+                "DPRQuestionEncoderTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("electra", ("ElectraTokenizer", "ElectraTokenizerFast" if is_tokenizers_available() else None)),
+        ("flaubert", ("FlaubertTokenizer", None)),
+        ("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)),
+        ("fsmt", ("FSMTTokenizer", None)),
+        ("funnel", ("FunnelTokenizer", "FunnelTokenizerFast" if is_tokenizers_available() else None)),
+        ("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
+        ("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
+        ("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
+        ("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
+        ("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)),
+        ("hubert", ("Wav2Vec2CTCTokenizer", None)),
+        ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
+        ("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
+        ("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
+        ("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
+        ("layoutxlm", ("LayoutXLMTokenizer", "LayoutXLMTokenizerFast" if is_tokenizers_available() else None)),
+        ("led", ("LEDTokenizer", "LEDTokenizerFast" if is_tokenizers_available() else None)),
+        ("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "longt5",
+            (
+                "T5Tokenizer" if is_sentencepiece_available() else None,
+                "T5TokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("luke", ("LukeTokenizer", None)),
+        ("lxmert", ("LxmertTokenizer", "LxmertTokenizerFast" if is_tokenizers_available() else None)),
+        ("m2m_100", ("M2M100Tokenizer" if is_sentencepiece_available() else None, None)),
+        ("marian", ("MarianTokenizer" if is_sentencepiece_available() else None, None)),
+        (
+            "mbart",
+            (
+                "MBartTokenizer" if is_sentencepiece_available() else None,
+                "MBartTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        (
+            "mbart50",
+            (
+                "MBart50Tokenizer" if is_sentencepiece_available() else None,
+                "MBart50TokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+        ("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
+        ("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
+        ("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "mt5",
+            (
+                "MT5Tokenizer" if is_sentencepiece_available() else None,
+                "MT5TokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        (
+            "nystromformer",
+            (
+                "AlbertTokenizer" if is_sentencepiece_available() else None,
+                "AlbertTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("openai-gpt", ("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None)),
+        ("opt", ("GPT2Tokenizer", None)),
+        (
+            "pegasus",
+            (
+                "PegasusTokenizer" if is_sentencepiece_available() else None,
+                "PegasusTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        (
+            "perceiver",
+            (
+                "PerceiverTokenizer",
+                None,
+            ),
+        ),
+        ("phobert", ("PhobertTokenizer", None)),
+        ("plbart", ("PLBartTokenizer" if is_sentencepiece_available() else None, None)),
+        ("prophetnet", ("ProphetNetTokenizer", None)),
+        ("qdqbert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+        ("rag", ("RagTokenizer", None)),
+        ("realm", ("RealmTokenizer", "RealmTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "reformer",
+            (
+                "ReformerTokenizer" if is_sentencepiece_available() else None,
+                "ReformerTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        (
+            "rembert",
+            (
+                "RemBertTokenizer" if is_sentencepiece_available() else None,
+                "RemBertTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("retribert", ("RetriBertTokenizer", "RetriBertTokenizerFast" if is_tokenizers_available() else None)),
+        ("roberta", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
+        ("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)),
+        ("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
+        ("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
+        ("splinter", ("SplinterTokenizer", "SplinterTokenizerFast")),
+        (
+            "squeezebert",
+            ("SqueezeBertTokenizer", "SqueezeBertTokenizerFast" if is_tokenizers_available() else None),
+        ),
+        (
+            "t5",
+            (
+                "T5Tokenizer" if is_sentencepiece_available() else None,
+                "T5TokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("tapas", ("TapasTokenizer", None)),
+        ("tapex", ("TapexTokenizer", None)),
+        ("transfo-xl", ("TransfoXLTokenizer", None)),
+        ("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+        ("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
+        ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
+        ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
+        ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
+        (
+            "xglm",
+            (
+                "XGLMTokenizer" if is_sentencepiece_available() else None,
+                "XGLMTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("xlm", ("XLMTokenizer", None)),
+        ("xlm-prophetnet", ("XLMProphetNetTokenizer" if is_sentencepiece_available() else None, None)),
+        (
+            "xlm-roberta",
+            (
+                "XLMRobertaTokenizer" if is_sentencepiece_available() else None,
+                "XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        ("xlm-roberta-xl", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
+        (
+            "xlnet",
+            (
+                "XLNetTokenizer" if is_sentencepiece_available() else None,
+                "XLNetTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+        (
+            "yoso",
+            (
+                "AlbertTokenizer" if is_sentencepiece_available() else None,
+                "AlbertTokenizerFast" if is_tokenizers_available() else None,
+            ),
+        ),
+    ]
+)
+
+SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
+    [
+        ("openai-gpt", "openai"),
+        ("data2vec-audio", "data2vec"),
+        ("data2vec-text", "data2vec"),
+        ("data2vec-vision", "data2vec"),
+    ]
+)
+
+
+def model_type_to_module_name(key):
+    """Converts a config key to the corresponding module."""
+    # Special treatment
+    if key in SPECIAL_MODEL_TYPE_TO_MODULE_NAME:
+        return SPECIAL_MODEL_TYPE_TO_MODULE_NAME[key]
+
+    return key.replace("-", "_")
+
+
+class _LazyConfigMapping(OrderedDict):
+    """
+    A dictionary that lazily load its values when they are requested.
+    """
+
+    def __init__(self, mapping):
+        self._mapping = mapping
+        self._extra_content = {}
+        self._modules = {}
+
+    def __getitem__(self, key):
+        if key in self._extra_content:
+            return self._extra_content[key]
+        if key not in self._mapping:
+            raise KeyError(key)
+        value = self._mapping[key]
+        module_name = model_type_to_module_name(key)
+        if module_name not in self._modules:
+
+            self._modules[module_name] = importlib.import_module(f".{module_name}", "transformers.models")
+        if hasattr(self._modules[module_name], value):
+            return getattr(self._modules[module_name], value)
+
+        # Some of the mappings have entries model_type -> config of another model type. In that case we try to grab the
+        # object at the top level.
+        transformers_module = importlib.import_module("transformers")
+        return getattr(transformers_module, value)
+
+    def keys(self):
+        return list(self._mapping.keys()) + list(self._extra_content.keys())
+
+    def values(self):
+        return [self[k] for k in self._mapping.keys()] + list(self._extra_content.values())
+
+    def items(self):
+        return [(k, self[k]) for k in self._mapping.keys()] + list(self._extra_content.items())
+
+    def __iter__(self):
+        return iter(list(self._mapping.keys()) + list(self._extra_content.keys()))
+
+    def __contains__(self, item):
+        return item in self._mapping or item in self._extra_content
+
+    def register(self, key, value):
+        """
+        Register a new configuration in this mapping.
+        """
+        if key in self._mapping.keys():
+            raise ValueError(f"'{key}' is already used by a Transformers config, pick another name.")
+        self._extra_content[key] = value
+
+
+class Trie:
+    """
+    Trie in Python. Creates a Trie out of a list of words. The trie is used to split on `added_tokens` in one pass
+    Loose reference https://en.wikipedia.org/wiki/Trie
+    """
+
+    def __init__(self):
+        self.data = {}
+
+    def add(self, word: str):
+        """
+        Passes over every char (utf-8 char) on word and recursively adds it to the internal `data` trie representation.
+        The special key `""` is used to represent termination.
+
+        This function is idempotent, adding twice the same word will leave the trie unchanged
+
+        Example:
+
+        ```python
+        >>> trie = Trie()
+        >>> trie.add("Hello 友達")
+        >>> trie.data
+        {"H": {"e": {"l": {"l": {"o": {" ": {"友": {"達": {"": 1}}}}}}}}}
+
+        >>> trie.add("Hello")
+        >>> trie.data
+        {"H": {"e": {"l": {"l": {"o": {"": 1, " ": {"友": {"達": {"": 1}}}}}}}}}
+        ```
+        """
+        if not word:
+            # Prevent empty string
+            return
+        ref = self.data
+        for char in word:
+            ref[char] = char in ref and ref[char] or {}
+            ref = ref[char]
+        ref[""] = 1
+
+    def split(self, text: str) -> List[str]:
+        """
+        Will look for the words added to the trie within `text`. Output is the original string splitted along the
+        boundaries of the words found.
+
+        This trie will match the longest possible word first !
+
+        Example:
+
+        ```python
+        >>> trie = Trie()
+        >>> trie.split("[CLS] This is a extra_id_100")
+        ["[CLS] This is a extra_id_100"]
+
+        >>> trie.add("[CLS]")
+        >>> trie.add("extra_id_1")
+        >>> trie.add("extra_id_100")
+        >>> trie.split("[CLS] This is a extra_id_100")
+        ["[CLS]", " This is a ", "extra_id_100"]
+        ```
+        """
+        # indexes are counted left of the chars index.
+        # "hello", index 0, is left of h, index 1 is between h and e.
+        # index 5 is right of the "o".
+
+        # States are going to capture every possible start (indexes as above)
+        # as keys, and have as values, a pointer to the position in the trie
+        # where we're at. This is a partial match for now.
+        # This enables to keep track of multiple matches while we're iterating
+        # the string
+        # If the trie contains, "blowing", and "lower" and we encounter the
+        # string "blower", we need to split into ["b", "lower"].
+        # This is where we need to keep track of multiple possible starts.
+        states = OrderedDict()
+
+        # This will contain every indices where we need
+        # to cut.
+        # We force to cut at offset 0 and len(text) (added later)
+        offsets = [0]
+
+        # This is used by the lookahead which needs to skip over
+        # some text where the full match exceeded the place in the initial
+        # for loop
+        skip = 0
+        # Main loop, Giving this algorithm O(n) complexity
+        for current, current_char in enumerate(text):
+            if skip and current < skip:
+                # Prevents the lookahead for matching twice
+                # like extra_id_100 and id_100
+                continue
+
+            # This will track every state
+            # that stop matching, we need to stop tracking them.
+            # If we look at "lowball", we're going to match "l" (add it to states), "o", "w", then
+            # fail on "b", we need to remove 0 from the valid states.
+            to_remove = set()
+            # Whenever we found a match, we need to drop everything
+            # this is a greedy algorithm, it will match on the first found token
+            reset = False
+
+            # In this case, we already have partial matches (But unfinished)
+            for start, trie_pointer in states.items():
+                if "" in trie_pointer:
+                    # This is a final match, we need to reset and
+                    # store the results in `offsets`.
+
+                    # Lookahead to match longest first
+                    # Important in case of extra_id_1 vs extra_id_100
+                    # Here we are also actively looking for other earlier partial
+                    # matches
+                    # "[CLS]", "L", we need to match CLS even if L is special
+                    for lookstart, looktrie_pointer in states.items():
+                        if lookstart > start:
+                            # This partial match is later, we can stop looking
+                            break
+                        elif lookstart < start:
+                            # This partial match is earlier, the trie pointer
+                            # was already updated, so index is + 1
+                            lookahead_index = current + 1
+                            end = current + 1
+                        else:
+                            # Here lookstart == start and
+                            #      looktrie_pointer == trie_pointer
+                            # It wasn't updated yet so indices are current ones
+                            lookahead_index = current
+                            end = current
+                        next_char = text[lookahead_index] if lookahead_index < len(text) else None
+                        if "" in looktrie_pointer:
+                            start = lookstart
+                            end = lookahead_index
+                            skip = lookahead_index
+
+                        while next_char in looktrie_pointer:
+                            looktrie_pointer = looktrie_pointer[next_char]
+                            lookahead_index += 1
+                            if "" in looktrie_pointer:
+                                start = lookstart
+                                end = lookahead_index
+                                skip = lookahead_index
+
+                            if lookahead_index == len(text):
+                                # End of string
+                                break
+                            next_char = text[lookahead_index]
+                        # End lookahead
+
+                        # Storing and resetting
+                    offsets.append(start)
+                    offsets.append(end)
+                    reset = True
+                    break
+                elif current_char in trie_pointer:
+                    # The current character being looked at has a match within the trie
+                    # update the pointer (it will be stored back into states later).
+                    trie_pointer = trie_pointer[current_char]
+
+                    # Storing back the new pointer into the states.
+                    # Partial matches got longer by one.
+                    states[start] = trie_pointer
+                else:
+                    # The new character has not match in the trie, we need
+                    # to stop keeping track of this partial match.
+                    # We can't do it directly within the loop because of how
+                    # python iteration works
+                    to_remove.add(start)
+
+            # Either clearing the full start (we found a real match)
+            # Or clearing only the partial matches that didn't work.
+            if reset:
+                states = {}
+            else:
+                for start in to_remove:
+                    del states[start]
+
+            # If this character is a starting character within the trie
+            # start keeping track of this partial match.
+            if current >= skip and current_char in self.data:
+                states[current] = self.data[current_char]
+
+        # We have a cut at the end with states.
+        for start, trie_pointer in states.items():
+            if "" in trie_pointer:
+                # This is a final match, we need to reset and
+                # store the results in `offsets`.
+                end = len(text)
+                offsets.append(start)
+                offsets.append(end)
+                # Longest cut is always the one with lower start so the first
+                # item so we need to break.
+                break
+
+        return self.cut_text(text, offsets)
+
+    def cut_text(self, text, offsets):
+        # We have all the offsets now, we just need to do the actual splitting.
+        # We need to eventually add the first part of the string and the eventual
+        # last part.
+        offsets.append(len(text))
+        tokens = []
+        start = 0
+        for end in offsets:
+            if start > end:
+                logger.error(
+                    "There was a bug in Trie algorithm in tokenization. Attempting to recover. Please report it"
+                    " anyway."
+                )
+                continue
+            elif start == end:
+                # This might happen if there's a match at index 0
+                # we're also preventing zero-width cuts in case of two
+                # consecutive matches
+                continue
+            tokens.append(text[start:end])
+            start = end
+
+        return tokens
+
+
+from enum import Enum
+
+
+class ExplicitEnum(Enum):
+    """
+    Enum with more explicit error message for missing values.
+    """
+
+    @classmethod
+    def _missing_(cls, value):
+        raise ValueError(
+            f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
+        )
+
+
+class TensorType(ExplicitEnum):
+    """
+    Possible values for the `return_tensors` argument in [`PreTrainedTokenizerBase.__call__`]. Useful for
+    tab-completion in an IDE.
+    """
+
+    PADDLE = "paddle"
+    PYTORCH = "pt"
+    TENSORFLOW = "tf"
+    NUMPY = "np"
+    JAX = "jax"
+
+
+class BatchEncoding(UserDict):
+    """
+    Holds the output of the [`~tokenization_utils_base.PreTrainedTokenizerBase.__call__`],
+    [`~tokenization_utils_base.PreTrainedTokenizerBase.encode_plus`] and
+    [`~tokenization_utils_base.PreTrainedTokenizerBase.batch_encode_plus`] methods (tokens, attention_masks, etc).
+
+    This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes
+    utility methods to map from word/character space to token space.
+
+    Args:
+        data (`dict`):
+            Dictionary of lists/arrays/tensors returned by the `__call__`/`encode_plus`/`batch_encode_plus` methods
+            ('input_ids', 'attention_mask', etc.).
+        encoding (`tokenizers.Encoding` or `Sequence[tokenizers.Encoding]`, *optional*):
+            If the tokenizer is a fast tokenizer which outputs additional information like mapping from word/character
+            space to token space the `tokenizers.Encoding` instance or list of instance (for batches) hold this
+            information.
+        tensor_type (`Union[None, str, TensorType]`, *optional*):
+            You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at
+            initialization.
+        prepend_batch_axis (`bool`, *optional*, defaults to `False`):
+            Whether or not to add a batch axis when converting to tensors (see `tensor_type` above).
+        n_sequences (`Optional[int]`, *optional*):
+            You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at
+            initialization.
+    """
+
+    def __init__(
+        self,
+        data=None,
+        encoding=None,
+        tensor_type=None,
+        prepend_batch_axis: bool = False,
+        n_sequences=None,
+    ):
+        super().__init__(data)
+
+        # if isinstance(encoding, EncodingFast):
+        #    encoding = [encoding]
+
+        self._encodings = encoding
+
+        if n_sequences is None and encoding is not None and len(encoding):
+            n_sequences = encoding[0].n_sequences
+
+        self._n_sequences = n_sequences
+
+        self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
+
+    @property
+    def n_sequences(self) -> Optional[int]:
+        """
+        `Optional[int]`: The number of sequences used to generate each sample from the batch encoded in this
+        [`BatchEncoding`]. Currently can be one of `None` (unknown), `1` (a single sentence) or `2` (a pair of
+        sentences)
+        """
+        return self._n_sequences
+
+    @property
+    def is_fast(self) -> bool:
+        """
+        `bool`: Indicate whether this [`BatchEncoding`] was generated from the result of a [`PreTrainedTokenizerFast`]
+        or not.
+        """
+        return self._encodings is not None
+
+    # def __getitem__(self, item: Union[int, str]) -> Union[Any, EncodingFast]:
+
+    def __getitem__(self, item):
+        """
+        If the key is a string, returns the value of the dict associated to `key` ('input_ids', 'attention_mask',
+        etc.).
+
+        If the key is an integer, get the `tokenizers.Encoding` for batch item with index `key`.
+        """
+        if isinstance(item, str):
+            return self.data[item]
+        elif self._encodings is not None:
+            return self._encodings[item]
+        else:
+            raise KeyError(
+                "Indexing with integers (to access backend Encoding for a given batch index) "
+                "is not available when using Python based tokenizers"
+            )
+
+    def __getattr__(self, item: str):
+        try:
+            return self.data[item]
+        except KeyError:
+            raise AttributeError
+
+    def __getstate__(self):
+        return {"data": self.data, "encodings": self._encodings}
+
+    def __setstate__(self, state):
+        if "data" in state:
+            self.data = state["data"]
+
+        if "encodings" in state:
+            self._encodings = state["encodings"]
+
+    def keys(self):
+        return self.data.keys()
+
+    def values(self):
+        return self.data.values()
+
+    def items(self):
+        return self.data.items()
+
+    # After this point:
+    # Extended properties and methods only available for fast (Rust-based) tokenizers
+    # provided by HuggingFace tokenizers library.
+
+    @property
+    def encodings(self):
+        """
+        `Optional[List[tokenizers.Encoding]]`: The list all encodings from the tokenization process. Returns `None` if
+        the input was tokenized through Python (i.e., not a fast) tokenizer.
+        """
+        return self._encodings
+
+    def tokens(self, batch_index=0):
+        """
+        Return the list of tokens (sub-parts of the input strings after word/subword splitting and before conversion to
+        integer indices) at a given batch index (only works for the output of a fast tokenizer).
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[str]`: The list of tokens at that index.
+        """
+        if not self._encodings:
+            raise ValueError("tokens() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].tokens
+
+    def sequence_ids(self, batch_index=0):
+        """
+        Return a list mapping the tokens to the id of their original sentences:
+
+            - `None` for special tokens added around or between sequences,
+            - `0` for tokens corresponding to words in the first sequence,
+            - `1` for tokens corresponding to words in the second sequence when a pair of sequences was jointly
+              encoded.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the sequence id corresponding to each token. Special tokens added
+            by the tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding
+            sequence.
+        """
+        if not self._encodings:
+            raise ValueError("sequence_ids() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].sequence_ids
+
+    def words(self, batch_index=0):
+        """
+        Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by the
+            tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding word
+            (several tokens will be mapped to the same word index if they are parts of that word).
+        """
+        if not self._encodings:
+            raise ValueError("words() is not available when using Python-based tokenizers")
+        warnings.warn(
+            "`BatchEncoding.words()` property is deprecated and should be replaced with the identical, "
+            "but more self-explanatory `BatchEncoding.word_ids()` property.",
+            FutureWarning,
+        )
+        return self.word_ids(batch_index)
+
+    def word_ids(self, batch_index: int = 0) -> List[Optional[int]]:
+        """
+        Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by the
+            tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding word
+            (several tokens will be mapped to the same word index if they are parts of that word).
+        """
+        if not self._encodings:
+            raise ValueError("word_ids() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].word_ids
+
+    def token_to_sequence(self, batch_or_token_index, token_index):
+        """
+        Get the index of the sequence represented by the given token. In the general use case, this method returns `0`
+        for a single sequence or the first sequence of a pair, and `1` for the second sequence of a pair
+
+        Can be called as:
+
+        - `self.token_to_sequence(token_index)` if batch size is 1
+        - `self.token_to_sequence(batch_index, token_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e.,
+        words are defined by the user). In this case it allows to easily associate encoded tokens with provided
+        tokenized words.
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprises one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token in the
+                sequence.
+
+        Returns:
+            `int`: Index of the word in the input sequence.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_sequence() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if token_index < 0:
+            token_index = self._seq_len + token_index
+        return self._encodings[batch_index].token_to_sequence(token_index)
+
+    def token_to_word(self, batch_or_token_index, token_index=None):
+        """
+        Get the index of the word corresponding (i.e. comprising) to an encoded token in a sequence of the batch.
+
+        Can be called as:
+
+        - `self.token_to_word(token_index)` if batch size is 1
+        - `self.token_to_word(batch_index, token_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e.,
+        words are defined by the user). In this case it allows to easily associate encoded tokens with provided
+        tokenized words.
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token in the
+                sequence.
+
+        Returns:
+            `int`: Index of the word in the input sequence.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_word() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if token_index < 0:
+            token_index = self._seq_len + token_index
+        return self._encodings[batch_index].token_to_word(token_index)
+
+    def word_to_tokens(self, batch_or_word_index, word_index=None, sequence_index=0):
+        """
+        Get the encoded token span corresponding to a word in a sequence of the batch.
+
+        Token spans are returned as a [`~tokenization_utils_base.TokenSpan`] with:
+
+        - **start** -- Index of the first token.
+        - **end** -- Index of the token following the last token.
+
+        Can be called as:
+
+        - `self.word_to_tokens(word_index, sequence_index: int = 0)` if batch size is 1
+        - `self.word_to_tokens(batch_index, word_index, sequence_index: int = 0)` if batch size is greater or equal to
+          1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_word_index (`int`):
+                Index of the sequence in the batch. If the batch only comprises one sequence, this can be the index of
+                the word in the sequence.
+            word_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided word index belongs to.
+
+        Returns:
+            Optional [`~tokenization_utils_base.TokenSpan`] Span of tokens in the encoded sequence. Returns `None` if
+            no tokens correspond to the word.
+        """
+
+        if not self._encodings:
+            raise ValueError("word_to_tokens() is not available when using Python based tokenizers")
+        if word_index is not None:
+            batch_index = batch_or_word_index
+        else:
+            batch_index = 0
+            word_index = batch_or_word_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if word_index < 0:
+            word_index = self._seq_len + word_index
+        span = self._encodings[batch_index].word_to_tokens(word_index, sequence_index)
+        return TokenSpan(*span) if span is not None else None
+
+    def token_to_chars(self, batch_or_token_index: int, token_index=None):
+        """
+        Get the character span corresponding to an encoded token in a sequence of the batch.
+
+        Character spans are returned as a [`~tokenization_utils_base.CharSpan`] with:
+
+        - **start** -- Index of the first character in the original string associated to the token.
+        - **end** -- Index of the character following the last character in the original string associated to the
+          token.
+
+        Can be called as:
+
+        - `self.token_to_chars(token_index)` if batch size is 1
+        - `self.token_to_chars(batch_index, token_index)` if batch size is greater or equal to 1
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token or tokens in
+                the sequence.
+
+        Returns:
+            [`~tokenization_utils_base.CharSpan`]: Span of characters in the original string, or None, if the token
+            (e.g. <s>, </s>) doesn't correspond to any chars in the origin string.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_chars() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        span_indices = self._encodings[batch_index].token_to_chars(token_index)
+
+        return CharSpan(*span_indices) if span_indices is not None else None
+
+    def char_to_token(
+        self, batch_or_char_index: int, char_index: Optional[int] = None, sequence_index: int = 0
+    ) -> int:
+        """
+        Get the index of the token in the encoded output comprising a character in the original string for a sequence
+        of the batch.
+
+        Can be called as:
+
+        - `self.char_to_token(char_index)` if batch size is 1
+        - `self.char_to_token(batch_index, char_index)` if batch size is greater or equal to 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_char_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the word in the sequence
+            char_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided character index belongs to.
+
+
+        Returns:
+            `int`: Index of the token.
+        """
+
+        if not self._encodings:
+            raise ValueError("char_to_token() is not available when using Python based tokenizers")
+        if char_index is not None:
+            batch_index = batch_or_char_index
+        else:
+            batch_index = 0
+            char_index = batch_or_char_index
+        return self._encodings[batch_index].char_to_token(char_index, sequence_index)
+
+    def word_to_chars(self, batch_or_word_index: int, word_index: Optional[int] = None, sequence_index: int = 0):
+        """
+        Get the character span in the original string corresponding to given word in a sequence of the batch.
+
+        Character spans are returned as a CharSpan NamedTuple with:
+
+        - start: index of the first character in the original string
+        - end: index of the character following the last character in the original string
+
+        Can be called as:
+
+        - `self.word_to_chars(word_index)` if batch size is 1
+        - `self.word_to_chars(batch_index, word_index)` if batch size is greater or equal to 1
+
+        Args:
+            batch_or_word_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the word in the sequence
+            word_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided word index belongs to.
+
+        Returns:
+            `CharSpan` or `List[CharSpan]`: Span(s) of the associated character or characters in the string. CharSpan
+            are NamedTuple with:
+
+                - start: index of the first character associated to the token in the original string
+                - end: index of the character following the last character associated to the token in the original
+                  string
+        """
+
+        if not self._encodings:
+            raise ValueError("word_to_chars() is not available when using Python based tokenizers")
+        if word_index is not None:
+            batch_index = batch_or_word_index
+        else:
+            batch_index = 0
+            word_index = batch_or_word_index
+        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index, sequence_index)))
+
+    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None, sequence_index: int = 0) -> int:
+        """
+        Get the word in the original string corresponding to a character in the original string of a sequence of the
+        batch.
+
+        Can be called as:
+
+        - `self.char_to_word(char_index)` if batch size is 1
+        - `self.char_to_word(batch_index, char_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_char_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the character in the original string.
+            char_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the character in the
+                original string.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided character index belongs to.
+
+
+        Returns:
+            `int` or `List[int]`: Index or indices of the associated encoded token(s).
+        """
+
+        if not self._encodings:
+            raise ValueError("char_to_word() is not available when using Python based tokenizers")
+        if char_index is not None:
+            batch_index = batch_or_char_index
+        else:
+            batch_index = 0
+            char_index = batch_or_char_index
+        return self._encodings[batch_index].char_to_word(char_index, sequence_index)
+
+    def convert_to_tensors(self, tensor_type=None, prepend_batch_axis: bool = False):
+        """
+        Convert the inner content to tensors.
+
+        Args:
+            tensor_type (`str` or [`~utils.TensorType`], *optional*):
+                The type of tensors to use. If `str`, should be one of the values of the enum [`~utils.TensorType`]. If
+                `None`, no modification is done.
+            prepend_batch_axis (`int`, *optional*, defaults to `False`):
+                Whether or not to add the batch dimension during the conversion.
+        """
+        if tensor_type is None:
+            return self
+
+        # Get a function reference for the correct framework
+        if tensor_type == "paddle":
+            import paddle
+
+            as_tensor = paddle.to_tensor
+            is_tensor = paddle.is_tensor
+        else:
+            as_tensor = np.asarray
+            is_tensor = _is_numpy
+        # (mfuntowicz: This code is unreachable)
+        # else:
+        #     raise ImportError(
+        #         f"Unable to convert output to tensors format {tensor_type}"
+        #     )
+
+        # Do the tensor conversion in batch
+        for key, value in self.items():
+            try:
+                if prepend_batch_axis:
+                    value = [value]
+
+                if not is_tensor(value):
+                    tensor = as_tensor(value)
+
+                    # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
+                    # # at-least2d
+                    # if tensor.ndim > 2:
+                    #     tensor = tensor.squeeze(0)
+                    # elif tensor.ndim < 2:
+                    #     tensor = tensor[None, :]
+
+                    self[key] = tensor
+            except:  # noqa E722
+                if key == "overflowing_tokens":
+                    raise ValueError(
+                        "Unable to create tensor returning overflowing tokens of different lengths. "
+                        "Please see if a fast version of this tokenizer is available to have this feature available."
+                    )
+                raise ValueError(
+                    "Unable to create tensor, you should probably activate truncation and/or padding "
+                    "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
+                )
+
+        return self
+
+
+class TruncationStrategy(ExplicitEnum):
+    """
+    Possible values for the `truncation` argument in [`PreTrainedTokenizerBase.__call__`]. Useful for tab-completion in
+    an IDE.
+    """
+
+    ONLY_FIRST = "only_first"
+    ONLY_SECOND = "only_second"
+    LONGEST_FIRST = "longest_first"
+    DO_NOT_TRUNCATE = "do_not_truncate"
+
+
+class PaddingStrategy(ExplicitEnum):
+    """
+    Possible values for the `padding` argument in [`PreTrainedTokenizerBase.__call__`]. Useful for tab-completion in an
+    IDE.
+    """
+
+    LONGEST = "longest"
+    MAX_LENGTH = "max_length"
+    DO_NOT_PAD = "do_not_pad"
+
+
+class SpecialTokensMixin:
+    """
+    A mixin derived by [`PreTrainedTokenizer`] and [`PreTrainedTokenizerFast`] to handle specific behaviors related to
+    special tokens. In particular, this class hold the attributes which can be used to directly access these special
+    tokens in a model-independent manner and allow to set and update the special tokens.
+
+    Args:
+        bos_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token representing the beginning of a sentence.
+        eos_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token representing the end of a sentence.
+        unk_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token representing an out-of-vocabulary token.
+        sep_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token separating two different sentences in the same input (used by BERT for instance).
+        pad_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
+            attention mechanisms or loss computation.
+        cls_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token representing the class of the input (used by BERT for instance).
+        mask_token (`str` or `tokenizers.AddedToken`, *optional*):
+            A special token representing a masked token (used by masked-language modeling pretraining objectives, like
+            BERT).
+        additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
+            A tuple or a list of additional special tokens.
+    """
+
+    SPECIAL_TOKENS_ATTRIBUTES = [
+        "bos_token",
+        "eos_token",
+        "unk_token",
+        "sep_token",
+        "pad_token",
+        "cls_token",
+        "mask_token",
+        "additional_special_tokens",
+    ]
+
+    def __init__(self, verbose=True, **kwargs):
+        self._bos_token = None
+        self._eos_token = None
+        self._unk_token = None
+        self._sep_token = None
+        self._pad_token = None
+        self._cls_token = None
+        self._mask_token = None
+        self._pad_token_type_id = 0
+        self._additional_special_tokens = []
+        self.verbose = verbose
+        self.added_tokens_encoder: Dict[str, int] = {}
+        self.added_tokens_decoder: Dict[int, str] = {}
+        self.unique_no_split_tokens: List[str] = []
+        self.tokens_trie = Trie()
+
+        self._decode_use_source_tokenizer = False
+
+        # We directly set the hidden value to allow initialization with special tokens
+        # which are not yet in the vocabulary. Necessary for serialization/de-serialization
+        # TODO clean this up at some point (probably by switching to fast tokenizers)
+        for key, value in kwargs.items():
+            if value is None:
+                continue
+            if key in self.SPECIAL_TOKENS_ATTRIBUTES:
+                if key == "additional_special_tokens":
+                    assert isinstance(value, (list, tuple)), f"Value {value} is not a list or tuple"
+                    assert all(
+                        isinstance(t, (str, AddedToken)) for t in value
+                    ), "One of the tokens is not a string or an AddedToken"
+                    setattr(self, key, value)
+                elif isinstance(value, (str, AddedToken)):
+                    setattr(self, key, value)
+                else:
+                    raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}")
+
+    def convert_tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
+        """
+        Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the
+        vocabulary.
+
+        Args:
+            tokens (`str` or `List[str]`): One or several token(s) to convert to token id(s).
+
+        Returns:
+            `int` or `List[int]`: The token id or list of token ids.
+        """
+        if tokens is None:
+            return None
+
+        if isinstance(tokens, str):
+            return self._convert_token_to_id_with_added_voc(tokens)
+
+        ids = []
+        for token in tokens:
+            ids.append(self._convert_token_to_id_with_added_voc(token))
+        return ids
+
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+
+        if token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+        return self._convert_token_to_id(token)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token.startswith("<extra_id_"):
+            match = re.match(r"<extra_id_(\d+)>", token)
+            num = int(match.group(1))
+            return self.vocab_size - num - 1
+        return self.sp_model.piece_to_id(token)
+
+    def sanitize_special_tokens(self) -> int:
+        """
+        Make sure that all the special tokens attributes of the tokenizer (`tokenizer.mask_token`,
+        `tokenizer.cls_token`, etc.) are in the vocabulary.
+
+        Add the missing ones to the vocabulary if needed.
+
+        Return:
+            `int`: The number of tokens added in the vocabulary during the operation.
+        """
+        return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
+
+    def add_special_tokens(self, special_tokens_dict: Dict[str, Union[str, AddedToken]]) -> int:
+        """
+        Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If
+        special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
+        current vocabulary).
+
+        Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
+        matrix of the model so that its embedding matrix matches the tokenizer.
+
+        In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
+
+        Using `add_special_tokens` will ensure your special tokens can be used in several ways:
+
+        - Special tokens are carefully handled by the tokenizer (they are never split).
+        - You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This
+          makes it easy to develop model-agnostic training and fine-tuning scripts.
+
+        When possible, special tokens are already registered for provided pretrained models (for instance
+        [`BertTokenizer`] `cls_token` is already registered to be :obj*'[CLS]'* and XLM's one is also registered to be
+        `'</s>'`).
+
+        Args:
+            special_tokens_dict (dictionary *str* to *str* or `tokenizers.AddedToken`):
+                Keys should be in the list of predefined special attributes: [`bos_token`, `eos_token`, `unk_token`,
+                `sep_token`, `pad_token`, `cls_token`, `mask_token`, `additional_special_tokens`].
+
+                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
+                assign the index of the `unk_token` to them).
+
+        Returns:
+            `int`: Number of tokens added to the vocabulary.
+
+        Examples:
+
+        ```python
+        # Let's see how to add a new classification token to GPT-2
+        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+        model = GPT2Model.from_pretrained("gpt2")
+
+        special_tokens_dict = {"cls_token": "<CLS>"}
+
+        num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+        print("We have added", num_added_toks, "tokens")
+        # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
+        model.resize_token_embeddings(len(tokenizer))
+
+        assert tokenizer.cls_token == "<CLS>"
+        ```"""
+        if not special_tokens_dict:
+            return 0
+
+        added_tokens = 0
+        for key, value in special_tokens_dict.items():
+            assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token"
+
+            if self.verbose:
+                # logger.info(f"Assigning {value} to the {key} key of the tokenizer")
+                print(f"Assigning {value} to the {key} key of the tokenizer")
+            setattr(self, key, value)
+
+            if key == "additional_special_tokens":
+                assert isinstance(value, (list, tuple)) and all(
+                    isinstance(t, (str, AddedToken)) for t in value
+                ), f"Tokens {value} for key {key} should all be str or AddedToken instances"
+                added_tokens += self.add_tokens(value, special_tokens=True)
+            else:
+                assert isinstance(
+                    value, (str, AddedToken)
+                ), f"Token {value} for key {key} should be a str or an AddedToken instance"
+                added_tokens += self.add_tokens([value], special_tokens=True)
+
+        return added_tokens
+
+    def add_tokens(
+        self, new_tokens: Union[str, AddedToken, List[Union[str, AddedToken]]], special_tokens: bool = False
+    ) -> int:
+        """
+        Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
+        it with indices starting from length of the current vocabulary.
+
+        Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
+        matrix of the model so that its embedding matrix matches the tokenizer.
+
+        In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
+
+        Args:
+            new_tokens (`str`, `tokenizers.AddedToken` or a list of *str* or `tokenizers.AddedToken`):
+                Tokens are only added if they are not already in the vocabulary. `tokenizers.AddedToken` wraps a string
+                token to let you personalize its behavior: whether this token should only match against a single word,
+                whether this token should strip all potential whitespaces on the left side, whether this token should
+                strip all potential whitespaces on the right side, etc.
+            special_tokens (`bool`, *optional*, defaults to `False`):
+                Can be used to specify if the token is a special token. This mostly change the normalization behavior
+                (special tokens like CLS or [MASK] are usually not lower-cased for instance).
+
+                See details for `tokenizers.AddedToken` in HuggingFace tokenizers library.
+
+        Returns:
+            `int`: Number of tokens added to the vocabulary.
+
+        Examples:
+
+        ```python
+        # Let's see how to increase the vocabulary of Bert model and tokenizer
+        tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
+        model = BertModel.from_pretrained("bert-base-uncased")
+
+        num_added_toks = tokenizer.add_tokens(["new_tok1", "my_new-tok2"])
+        print("We have added", num_added_toks, "tokens")
+        # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
+        model.resize_token_embeddings(len(tokenizer))
+        ```"""
+        if not new_tokens:
+            return 0
+
+        if not isinstance(new_tokens, (list, tuple)):
+            new_tokens = [new_tokens]
+
+        return self._add_tokens(new_tokens, special_tokens=special_tokens)
+
+    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
+        new_tokens = [str(tok) for tok in new_tokens]
+
+        tokens_to_add = []
+        for token in new_tokens:
+            if not isinstance(token, str):
+                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
+            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
+                token = token.lower()
+            if (
+                token != self.unk_token
+                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
+                and token not in tokens_to_add
+            ):
+                tokens_to_add.append(token)
+                # if self.verbose:
+            # logger.info(f"Adding {token} to the vocabulary")
+            # print(f"Adding {token} to the vocabulary")
+
+        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))
+        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
+        self.added_tokens_encoder.update(added_tok_encoder)
+        self.added_tokens_decoder.update(added_tok_decoder)
+
+        # Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
+        if special_tokens:
+            if len(new_tokens) == 1:
+                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
+            else:
+                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
+        else:
+            # Or on the newly added tokens
+            if len(tokens_to_add) == 1:
+                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
+            else:
+                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))
+        self._create_trie(self.unique_no_split_tokens)
+
+        return len(tokens_to_add)
+
+    def _create_trie(self, unique_no_split_tokens):
+        trie = Trie()
+        for token in unique_no_split_tokens:
+            if hasattr(self, "do_lower_case") and self.do_lower_case and token not in self.all_special_tokens:
+                trie.add(token.lower())
+            else:
+                trie.add(token)
+        self.tokens_trie = trie
+
+    @property
+    def bos_token(self) -> str:
+        """
+        `str`: Beginning of sentence token. Log an error if used while not having been set.
+        """
+        if self._bos_token is None and self.verbose:
+            print("Using bos_token, but it is not set yet.")
+            # logger.error("Using bos_token, but it is not set yet.")
+            return None
+        return str(self._bos_token)
+
+    @property
+    def eos_token(self) -> str:
+        """
+        `str`: End of sentence token. Log an error if used while not having been set.
+        """
+        if self._eos_token is None and self.verbose:
+            # logger.error("Using eos_token, but it is not set yet.")
+            print("Using eos_token, but it is not set yet.")
+            return None
+        return str(self._eos_token)
+
+    @property
+    def unk_token(self) -> str:
+        """
+        `str`: Unknown token. Log an error if used while not having been set.
+        """
+        if self._unk_token is None and self.verbose:
+            print("Using unk_token, but it is not set yet.")
+            # logger.error("Using unk_token, but it is not set yet.")
+            return None
+        return str(self._unk_token)
+
+    @property
+    def sep_token(self) -> str:
+        """
+        `str`: Separation token, to separate context and query in an input sequence. Log an error if used while not
+        having been set.
+        """
+        if self._sep_token is None and self.verbose:
+            print("Using sep_token, but it is not set yet.")
+            # logger.error("Using sep_token, but it is not set yet.")
+            return None
+        return str(self._sep_token)
+
+    @property
+    def pad_token(self) -> str:
+        """
+        `str`: Padding token. Log an error if used while not having been set.
+        """
+        if self._pad_token is None and self.verbose:
+            # logger.error("Using pad_token, but it is not set yet.")
+            print("Using pad_token, but it is not set yet.")
+            return None
+        return str(self._pad_token)
+
+    @property
+    def cls_token(self) -> str:
+        """
+        `str`: Classification token, to extract a summary of an input sequence leveraging self-attention along the full
+        depth of the model. Log an error if used while not having been set.
+        """
+        if self._cls_token is None and self.verbose:
+            # logger.error("Using cls_token, but it is not set yet.")
+            print("Using cls_token, but it is not set yet.")
+            return None
+        return str(self._cls_token)
+
+    @property
+    def mask_token(self) -> str:
+        """
+        `str`: Mask token, to use when training a model with masked-language modeling. Log an error if used while not
+        having been set.
+        """
+        if self._mask_token is None and self.verbose:
+            # logger.error("Using mask_token, but it is not set yet.")
+            print("Using mask_token, but it is not set yet.")
+            return None
+        return str(self._mask_token)
+
+    @property
+    def additional_special_tokens(self) -> List[str]:
+        """
+        `List[str]`: All the additional special tokens you may want to use. Log an error if used while not having been
+        set.
+        """
+        if self._additional_special_tokens is None and self.verbose:
+            # logger.error("Using additional_special_tokens, but it is not set yet.")
+            print("Using additional_special_tokens, but it is not set yet.")
+            return None
+        return [str(tok) for tok in self._additional_special_tokens]
+
+    @bos_token.setter
+    def bos_token(self, value):
+        self._bos_token = value
+
+    @eos_token.setter
+    def eos_token(self, value):
+        self._eos_token = value
+
+    @unk_token.setter
+    def unk_token(self, value):
+        self._unk_token = value
+
+    @sep_token.setter
+    def sep_token(self, value):
+        self._sep_token = value
+
+    @pad_token.setter
+    def pad_token(self, value):
+        self._pad_token = value
+
+    @cls_token.setter
+    def cls_token(self, value):
+        self._cls_token = value
+
+    @mask_token.setter
+    def mask_token(self, value):
+        self._mask_token = value
+
+    @additional_special_tokens.setter
+    def additional_special_tokens(self, value):
+        self._additional_special_tokens = value
+
+    @property
+    def bos_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not
+        been set.
+        """
+        if self._bos_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.bos_token)
+
+    @property
+    def eos_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the end of sentence token in the vocabulary. Returns `None` if the token has not been
+        set.
+        """
+        if self._eos_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eos_token)
+
+    @property
+    def unk_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the unknown token in the vocabulary. Returns `None` if the token has not been set.
+        """
+        if self._unk_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.unk_token)
+
+    @property
+    def sep_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the separation token in the vocabulary, to separate context and query in an input
+        sequence. Returns `None` if the token has not been set.
+        """
+        if self._sep_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.sep_token)
+
+    @property
+    def pad_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the padding token in the vocabulary. Returns `None` if the token has not been set.
+        """
+        if self._pad_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.pad_token)
+
+    @property
+    def pad_token_type_id(self) -> int:
+        """
+        `int`: Id of the padding token type in the vocabulary.
+        """
+        return self._pad_token_type_id
+
+    @property
+    def cls_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the classification token in the vocabulary, to extract a summary of an input sequence
+        leveraging self-attention along the full depth of the model.
+
+        Returns `None` if the token has not been set.
+        """
+        if self._cls_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.cls_token)
+
+    @property
+    def mask_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the mask token in the vocabulary, used when training a model with masked-language
+        modeling. Returns `None` if the token has not been set.
+        """
+        if self._mask_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.mask_token)
+
+    @property
+    def additional_special_tokens_ids(self) -> List[int]:
+        """
+        `List[int]`: Ids of all the additional special tokens in the vocabulary. Log an error if used while not having
+        been set.
+        """
+        return self.convert_tokens_to_ids(self.additional_special_tokens)
+
+    @bos_token_id.setter
+    def bos_token_id(self, value):
+        self._bos_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @eos_token_id.setter
+    def eos_token_id(self, value):
+        self._eos_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @unk_token_id.setter
+    def unk_token_id(self, value):
+        self._unk_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @sep_token_id.setter
+    def sep_token_id(self, value):
+        self._sep_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @pad_token_id.setter
+    def pad_token_id(self, value):
+        self._pad_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @cls_token_id.setter
+    def cls_token_id(self, value):
+        self._cls_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @mask_token_id.setter
+    def mask_token_id(self, value):
+        self._mask_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @additional_special_tokens_ids.setter
+    def additional_special_tokens_ids(self, values):
+        self._additional_special_tokens = [self.convert_ids_to_tokens(value) for value in values]
+
+    @property
+    def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
+        """
+        `Dict[str, Union[str, List[str]]]`: A dictionary mapping special token class attributes (`cls_token`,
+        `unk_token`, etc.) to their values (`'<unk>'`, `'<cls>'`, etc.).
+
+        Convert potential tokens of `tokenizers.AddedToken` type to string.
+        """
+        set_attr = {}
+        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
+            attr_value = getattr(self, "_" + attr)
+            if attr_value:
+                set_attr[attr] = (
+                    type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
+                    if isinstance(attr_value, (list, tuple))
+                    else str(attr_value)
+                )
+        return set_attr
+
+    @property
+    def special_tokens_map_extended(self) -> Dict[str, Union[str, AddedToken, List[Union[str, AddedToken]]]]:
+        """
+        `Dict[str, Union[str, tokenizers.AddedToken, List[Union[str, tokenizers.AddedToken]]]]`: A dictionary mapping
+        special token class attributes (`cls_token`, `unk_token`, etc.) to their values (`'<unk>'`, `'<cls>'`, etc.).
+
+        Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
+        special tokens are tokenized.
+        """
+        set_attr = {}
+        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
+            attr_value = getattr(self, "_" + attr)
+            if attr_value:
+                set_attr[attr] = attr_value
+        return set_attr
+
+    @property
+    def all_special_tokens(self) -> List[str]:
+        """
+        `List[str]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
+
+        Convert tokens of `tokenizers.AddedToken` type to string.
+        """
+        all_toks = [str(s) for s in self.all_special_tokens_extended]
+        return all_toks
+
+    @property
+    def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
+        """
+        `List[Union[str, tokenizers.AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class
+        attributes.
+
+        Don't convert tokens of `tokenizers.AddedToken` type to string so they can be used to control more finely how
+        special tokens are tokenized.
+        """
+        all_toks = []
+        set_attr = self.special_tokens_map_extended
+        for attr_value in set_attr.values():
+            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
+        all_toks = list(OrderedDict.fromkeys(all_toks))
+        return all_toks
+
+    @property
+    def all_special_ids(self) -> List[int]:
+        """
+        `List[int]`: List the ids of the special tokens(`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
+        """
+        all_toks = self.all_special_tokens
+        all_ids = self.convert_tokens_to_ids(all_toks)
+        return all_ids
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py b/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..506033beb501b291502643a328a91c96381578d2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/transforms/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py b/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddda898bfd8bea71cd681091518c1689b6595596
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/transforms/preprocess.py
@@ -0,0 +1,404 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import math
+import random
+from functools import partial
+
+import cv2
+import numpy as np
+from paddle.vision.transforms import ColorJitter as PPColorJitter
+from paddle.vision.transforms import functional as F
+from PIL import Image, ImageFilter
+from ppfleetx.utils.log import logger
+
+
+class OperatorParamError(ValueError):
+    """OperatorParamError"""
+
+    pass
+
+
+class DecodeImage(object):
+    """decode image"""
+
+    def __init__(self, to_rgb=True, channel_first=False):
+        self.to_rgb = to_rgb
+        self.channel_first = channel_first
+
+    def __call__(self, img):
+        assert type(img) is bytes and len(img) > 0, "invalid input 'img' in DecodeImage"
+        data = np.frombuffer(img, dtype="uint8")
+        img = cv2.imdecode(data, 1)
+        if self.to_rgb:
+            assert img.shape[2] == 3, "invalid shape of image[%s]" % (img.shape)
+            img = img[:, :, ::-1]
+
+        if self.channel_first:
+            img = img.transpose((2, 0, 1))
+
+        return img
+
+
+class UnifiedResize(object):
+    def __init__(self, interpolation=None, backend="cv2"):
+        _cv2_interp_from_str = {
+            "nearest": cv2.INTER_NEAREST,
+            "bilinear": cv2.INTER_LINEAR,
+            "area": cv2.INTER_AREA,
+            "bicubic": cv2.INTER_CUBIC,
+            "lanczos": cv2.INTER_LANCZOS4,
+        }
+        _pil_interp_from_str = {
+            "nearest": Image.NEAREST,
+            "bilinear": Image.BILINEAR,
+            "bicubic": Image.BICUBIC,
+            "box": Image.BOX,
+            "lanczos": Image.LANCZOS,
+            "hamming": Image.HAMMING,
+        }
+
+        def _pil_resize(src, size, resample):
+            pil_img = Image.fromarray(src)
+            pil_img = pil_img.resize(size, resample)
+            return np.asarray(pil_img)
+
+        if backend.lower() == "cv2":
+            if isinstance(interpolation, str):
+                interpolation = _cv2_interp_from_str[interpolation.lower()]
+            # compatible with opencv < version 4.4.0
+            elif interpolation is None:
+                interpolation = cv2.INTER_LINEAR
+            self.resize_func = partial(cv2.resize, interpolation=interpolation)
+        elif backend.lower() == "pil":
+            if isinstance(interpolation, str):
+                interpolation = _pil_interp_from_str[interpolation.lower()]
+            self.resize_func = partial(_pil_resize, resample=interpolation)
+        else:
+            logger.warning(
+                f'The backend of Resize only support "cv2" or "PIL". "f{backend}" is unavailable. Use "cv2" instead.'
+            )
+            self.resize_func = cv2.resize
+
+    def __call__(self, src, size):
+        return self.resize_func(src, size)
+
+
+class ResizeImage(object):
+    """resize image"""
+
+    def __init__(self, size=None, resize_short=None, interpolation=None, backend="cv2"):
+        if resize_short is not None and resize_short > 0:
+            self.resize_short = resize_short
+            self.w = None
+            self.h = None
+        elif size is not None:
+            self.resize_short = None
+            self.w = size if type(size) is int else size[0]
+            self.h = size if type(size) is int else size[1]
+        else:
+            raise OperatorParamError(
+                "invalid params for ReisizeImage for '\
+                'both 'size' and 'resize_short' are None"
+            )
+
+        self._resize_func = UnifiedResize(interpolation=interpolation, backend=backend)
+
+    def __call__(self, img):
+        img_h, img_w = img.shape[:2]
+        if self.resize_short is not None:
+            percent = float(self.resize_short) / min(img_w, img_h)
+            w = int(round(img_w * percent))
+            h = int(round(img_h * percent))
+        else:
+            w = self.w
+            h = self.h
+        return self._resize_func(img, (w, h))
+
+
+class CenterCropImage(object):
+    """crop image"""
+
+    def __init__(self, size):
+        if type(size) is int:
+            self.size = (size, size)
+        else:
+            self.size = size  # (h, w)
+
+    def __call__(self, img):
+        w, h = self.size
+        img_h, img_w = img.shape[:2]
+        w_start = (img_w - w) // 2
+        h_start = (img_h - h) // 2
+
+        w_end = w_start + w
+        h_end = h_start + h
+        return img[h_start:h_end, w_start:w_end, :]
+
+
+class RandCropImage(object):
+    """random crop image"""
+
+    def __init__(self, size, scale=None, ratio=None, interpolation=None, backend="cv2"):
+        if type(size) is int:
+            self.size = (size, size)  # (h, w)
+        else:
+            self.size = size
+
+        self.scale = [0.08, 1.0] if scale is None else scale
+        self.ratio = [3.0 / 4.0, 4.0 / 3.0] if ratio is None else ratio
+
+        self._resize_func = UnifiedResize(interpolation=interpolation, backend=backend)
+
+    def __call__(self, img):
+        size = self.size
+        scale = self.scale
+        ratio = self.ratio
+
+        aspect_ratio = math.sqrt(random.uniform(*ratio))
+        w = 1.0 * aspect_ratio
+        h = 1.0 / aspect_ratio
+
+        img_h, img_w = img.shape[:2]
+
+        bound = min((float(img_w) / img_h) / (w**2), (float(img_h) / img_w) / (h**2))
+        scale_max = min(scale[1], bound)
+        scale_min = min(scale[0], bound)
+
+        target_area = img_w * img_h * random.uniform(scale_min, scale_max)
+        target_size = math.sqrt(target_area)
+        w = int(target_size * w)
+        h = int(target_size * h)
+
+        i = random.randint(0, img_w - w)
+        j = random.randint(0, img_h - h)
+
+        img = img[j : j + h, i : i + w, :]
+
+        return self._resize_func(img, size)
+
+
+class RandFlipImage(object):
+    """random flip image
+    flip_code:
+        1: Flipped Horizontally
+        0: Flipped Vertically
+        -1: Flipped Horizontally & Vertically
+    """
+
+    def __init__(self, flip_code=1):
+        assert flip_code in [-1, 0, 1], "flip_code should be a value in [-1, 0, 1]"
+        self.flip_code = flip_code
+
+    def __call__(self, img):
+        if random.randint(0, 1) == 1:
+            return cv2.flip(img, self.flip_code)
+        else:
+            return img
+
+
+class NormalizeImage(object):
+    """normalize image such as substract mean, divide std"""
+
+    def __init__(self, scale=None, mean=None, std=None, order="chw", output_fp16=False, channel_num=3):
+        if isinstance(scale, str):
+            scale = eval(scale)
+        assert channel_num in [3, 4], "channel number of input image should be set to 3 or 4."
+        self.channel_num = channel_num
+        self.output_dtype = "float16" if output_fp16 else "float32"
+        self.scale = np.float32(scale if scale is not None else 1.0 / 255.0)
+        self.order = order
+        mean = mean if mean is not None else [0.485, 0.456, 0.406]
+        std = std if std is not None else [0.229, 0.224, 0.225]
+
+        shape = (3, 1, 1) if self.order == "chw" else (1, 1, 3)
+        self.mean = np.array(mean).reshape(shape).astype("float32")
+        self.std = np.array(std).reshape(shape).astype("float32")
+
+    def __call__(self, img):
+        if isinstance(img, Image.Image):
+            img = np.array(img)
+
+        assert isinstance(img, np.ndarray), "invalid input 'img' in NormalizeImage"
+
+        img = (img.astype("float32") * self.scale - self.mean) / self.std
+
+        if self.channel_num == 4:
+            img_h = img.shape[1] if self.order == "chw" else img.shape[0]
+            img_w = img.shape[2] if self.order == "chw" else img.shape[1]
+            pad_zeros = np.zeros((1, img_h, img_w)) if self.order == "chw" else np.zeros((img_h, img_w, 1))
+            img = (
+                np.concatenate((img, pad_zeros), axis=0)
+                if self.order == "chw"
+                else np.concatenate((img, pad_zeros), axis=2)
+            )
+        return img.astype(self.output_dtype)
+
+
+class ToCHWImage(object):
+    """convert hwc image to chw image"""
+
+    def __init__(self):
+        pass
+
+    def __call__(self, img):
+        if isinstance(img, Image.Image):
+            img = np.array(img)
+
+        return img.transpose((2, 0, 1))
+
+
+class ColorJitter(PPColorJitter):
+    """ColorJitter."""
+
+    def __init__(self, *args, **kwargs):
+        self.p = kwargs.pop("p", 1.0)
+        super().__init__(*args, **kwargs)
+
+    def __call__(self, img):
+        if random.random() < self.p:
+            if not isinstance(img, Image.Image):
+                img = np.ascontiguousarray(img)
+                img = Image.fromarray(img)
+            img = super()._apply_image(img)
+            if isinstance(img, Image.Image):
+                img = np.asarray(img)
+        return img
+
+
+class GaussianBlur(object):
+    """Gaussian blur augmentation in SimCLR https://arxiv.org/abs/2002.05709"""
+
+    def __init__(self, sigma=[0.1, 2.0], p=1.0):
+        self.p = p
+        self.sigma = sigma
+
+    def __call__(self, img):
+        if random.random() < self.p:
+            if not isinstance(img, Image.Image):
+                img = np.ascontiguousarray(img)
+                img = Image.fromarray(img)
+            sigma = random.uniform(self.sigma[0], self.sigma[1])
+            img = img.filter(ImageFilter.GaussianBlur(radius=sigma))
+            if isinstance(img, Image.Image):
+                img = np.asarray(img)
+        return img
+
+
+class Pixels(object):
+    def __init__(self, mode="const", mean=[0.0, 0.0, 0.0]):
+        self._mode = mode
+        self._mean = mean
+
+    def __call__(self, h=224, w=224, c=3):
+        if self._mode == "rand":
+            return np.random.normal(size=(1, 1, 3))
+        elif self._mode == "pixel":
+            return np.random.normal(size=(h, w, c))
+        elif self._mode == "const":
+            return self._mean
+        else:
+            raise Exception('Invalid mode in RandomErasing, only support "const", "rand", "pixel"')
+
+
+class RandomErasing(object):
+    """RandomErasing.
+    This code is adapted from https://github.com/zhunzhong07/Random-Erasing, and refer to Timm.
+    """
+
+    def __init__(
+        self,
+        EPSILON=0.5,
+        sl=0.02,
+        sh=0.4,
+        r1=0.3,
+        mean=[0.0, 0.0, 0.0],
+        attempt=100,
+        use_log_aspect=False,
+        mode="const",
+    ):
+        self.EPSILON = eval(EPSILON) if isinstance(EPSILON, str) else EPSILON
+        self.sl = eval(sl) if isinstance(sl, str) else sl
+        self.sh = eval(sh) if isinstance(sh, str) else sh
+        r1 = eval(r1) if isinstance(r1, str) else r1
+        self.r1 = (math.log(r1), math.log(1 / r1)) if use_log_aspect else (r1, 1 / r1)
+        self.use_log_aspect = use_log_aspect
+        self.attempt = attempt
+        self.get_pixels = Pixels(mode, mean)
+
+    def __call__(self, img):
+        if random.random() > self.EPSILON:
+            return img
+
+        for _ in range(self.attempt):
+            area = img.shape[0] * img.shape[1]
+
+            target_area = random.uniform(self.sl, self.sh) * area
+            aspect_ratio = random.uniform(*self.r1)
+            if self.use_log_aspect:
+                aspect_ratio = math.exp(aspect_ratio)
+
+            h = int(round(math.sqrt(target_area * aspect_ratio)))
+            w = int(round(math.sqrt(target_area / aspect_ratio)))
+
+            if w < img.shape[1] and h < img.shape[0]:
+                pixels = self.get_pixels(h, w, img.shape[2])
+                x1 = random.randint(0, img.shape[0] - h)
+                y1 = random.randint(0, img.shape[1] - w)
+                if img.shape[2] == 3:
+                    img[x1 : x1 + h, y1 : y1 + w, :] = pixels
+                else:
+                    img[x1 : x1 + h, y1 : y1 + w, 0] = pixels[0]
+                return img
+        return img
+
+
+class RandomGrayscale(object):
+    """Randomly convert image to grayscale with a probability of p (default 0.1).
+    Args:
+        p (float): probability that image should be converted to grayscale.
+    Returns:
+        PIL Image: Grayscale version of the input image with probability p and unchanged
+        with probability (1-p).
+        - If input image is 1 channel: grayscale version is 1 channel
+        - If input image is 3 channel: grayscale version is 3 channel with r == g == b
+    """
+
+    def __init__(self, p=0.1):
+        self.p = p
+
+    def __call__(self, img):
+        """
+        Args:
+            img (PIL Image): Image to be converted to grayscale.
+        Returns:
+            PIL Image: Randomly grayscaled image.
+        """
+
+        flag = False
+        if not isinstance(img, Image.Image):
+            img = np.ascontiguousarray(img)
+            img = Image.fromarray(img)
+            flag = True
+
+        num_output_channels = 1 if img.mode == "L" else 3
+
+        if random.random() < self.p:
+            img = F.to_grayscale(img, num_output_channels=num_output_channels)
+
+        if flag:
+            img = np.asarray(img)
+
+        return img
diff --git a/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py b/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba9f18e689fb740855b99590e0f8db2bfdd3292f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/transforms/utils.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import preprocess
+
+
+def transform(data, ops=[]):
+    """transform"""
+    for op in ops:
+        data = op(data)
+    return data
+
+
+def create_preprocess_operators(params):
+    """
+    create operators based on the config
+    Args:
+        params(list): a dict list, used to create some operators
+    """
+    assert isinstance(params, list), "operator config should be a list"
+    ops = []
+    for operator in params:
+        assert isinstance(operator, dict) and len(operator) == 1, "yaml format error"
+        op_name = list(operator)[0]
+        param = {} if operator[op_name] is None else operator[op_name]
+        op = getattr(preprocess, op_name)(**param)
+        ops.append(op)
+
+    return ops
diff --git a/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py b/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..81b5213e1b82c4b5229701bff519ce9e80814dc4
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/utils/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .batch_collate_fn import *
diff --git a/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py b/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py
new file mode 100644
index 0000000000000000000000000000000000000000..3967465e7268c62c0e68b24d90ef1454b8c97d9e
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/data/utils/batch_collate_fn.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numbers
+
+import numpy as np
+import paddle
+
+try:
+    from collections.abc import Mapping, Sequence
+except:
+    from collections import Sequence, Mapping
+
+from ppfleetx.data.sampler import Stack, Tuple
+
+
+def collate_fn(batch):
+    """
+    Default batch collating function for :code:`paddle.io.DataLoader`,
+    get input data as a list of sample datas, each element in list
+    if the data of a sample, and sample data should composed of list,
+    dictionary, string, number, numpy array and paddle.Tensor, this
+    function will parse input data recursively and stack number,
+    numpy array and paddle.Tensor datas as batch datas. e.g. for
+    following input data:
+    [{'image': np.array(shape=[3, 224, 224]), 'label': 1},
+     {'image': np.array(shape=[3, 224, 224]), 'label': 3},
+     {'image': np.array(shape=[3, 224, 224]), 'label': 4},
+     {'image': np.array(shape=[3, 224, 224]), 'label': 5},]
+
+
+    This default collate function zipped each number and numpy array
+    field together and stack each field as the batch field as follows:
+    {'image': np.array(shape=[4, 3, 224, 224]), 'label': np.array([1, 3, 4, 5])}
+    Args:
+        batch(list of sample data): batch should be a list of sample data.
+
+    Returns:
+        Batched data: batched each number, numpy array and paddle.Tensor
+                      in input data.
+    """
+    sample = batch[0]
+    if isinstance(sample, np.ndarray):
+        batch = np.stack(batch, axis=0)
+        return batch
+    elif isinstance(sample, paddle.Tensor):
+        return paddle.stack(batch, axis=0)
+    elif isinstance(sample, numbers.Number):
+        batch = np.array(batch)
+        return batch
+    elif isinstance(sample, (str, bytes)):
+        return batch
+    elif isinstance(sample, Mapping):
+        return {key: collate_fn([d[key] for d in batch]) for key in sample}
+    elif isinstance(sample, Sequence):
+        sample_fields_num = len(sample)
+        if not all(len(sample) == sample_fields_num for sample in iter(batch)):
+            raise RuntimeError("fileds number not same among samples in a batch")
+        return [collate_fn(fields) for fields in zip(*batch)]
+
+    raise TypeError(
+        "batch data con only contains: tensor, numpy.ndarray, " "dict, list, number, but got {}".format(type(sample))
+    )
+
+
+def default_collate_fn(batch_transform=None):
+    if batch_transform is not None:
+        # batch_ops = create_preprocess_operators(batch_transform)
+
+        # def inner_collate_fn(batch):
+        #     batch = transform(batch, batch_ops)
+        #     batch = collate_fn(batch)
+        #     return batch
+
+        # return inner_collate_fn
+        pass
+    else:
+        return collate_fn
+
+
+def gpt_collate_fn(batch):
+    return Tuple([Stack() for raw in zip(*batch)])(batch)
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/__init__.py b/model_zoo/gpt-3/ppfleetx/distributed/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py
new file mode 100644
index 0000000000000000000000000000000000000000..b26ac874fff69d97161fa3df354cb1e4dae0609b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/auto_env.py
@@ -0,0 +1,197 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+from collections import namedtuple
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.auto_parallel as auto
+
+from ppfleetx.utils.log import logger
+
+_mesh = None
+
+
+def get_mesh():
+    global _mesh
+    if _mesh is None and paddle.distributed.get_world_size() == 1:
+        set_mesh(
+            Mesh(
+                get_local_rank(),
+                1,
+                1,
+                1,
+            )
+        )
+    return _mesh
+
+
+def set_mesh(mesh):
+    global _mesh
+    _mesh = mesh
+
+
+class Mesh:
+    def __init__(self, rank, dp_degree, mp_degree, pp_degree):
+        self._dp_dim = "dp" if dp_degree > 1 else None
+        self._mp_dim = "mp" if mp_degree > 1 else None
+        self._dp_degree = dp_degree
+        self._mp_degree = mp_degree
+        self._pp_degree = pp_degree
+
+        arr = np.arange(0, pp_degree * dp_degree * mp_degree).reshape([dp_degree, pp_degree, mp_degree])
+        arr = arr.transpose(1, 0, 2)
+        self.world_process_mesh = auto.ProcessMesh(arr, dim_names=["pp", "dp", "mp"])
+        self.g_process_mesh = auto.ProcessMesh(list(range(pp_degree * dp_degree * mp_degree)))
+        ipp, idp, imp = np.where(arr == rank)
+        ipp = ipp[0]
+        idp = idp[0]
+        imp = imp[0]
+
+        if dp_degree > 1 and mp_degree > 1:
+            self.pp_process_mesh = self.world_process_mesh
+        elif mp_degree > 1:
+            self.pp_process_mesh = self.world_process_mesh[:, idp, :]
+        else:
+            self.pp_process_mesh = self.world_process_mesh[:, :, imp]
+
+    @property
+    def dp_degree(self):
+        return self._dp_degree
+
+    @property
+    def mp_degree(self):
+        return self._mp_degree
+
+    @property
+    def pp_degree(self):
+        return self._pp_degree
+
+    @property
+    def dp_dim(self):
+        return self._dp_dim
+
+    @property
+    def mp_dim(self):
+        return self._mp_dim
+
+    def __getitem__(self, idx):
+        return self.pp_process_mesh[idx]
+
+
+def init_dist_env(config):
+    paddle.set_device(config.Global.device)
+
+    mesh = Mesh(
+        get_local_rank(),
+        config.Distributed.dp_degree,
+        config.Distributed.mp_degree,
+        config.Distributed.pp_degree,
+    )
+    set_mesh(mesh)
+    paddle.distributed.fleet.init(is_collective=True)
+
+
+def get_local_rank():
+    return int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+GroupInfo = namedtuple('GroupInfo', ['size', 'rank', 'world'])
+class Topology:
+    """
+    In fact, AutoParallel 
+    """
+    def __init__(self, rank, world_size, dp, pp, mp):
+        arr = np.arange(0, dp * pp * mp).reshape([dp, pp, mp])
+        idp, ipp, imp = np.where(arr == rank)
+        idp = idp[0]
+        ipp = ipp[0]
+        imp = imp[0]
+        self.world = GroupInfo(size=world_size, rank=rank, world=list(range(0, world_size)))
+        mp = arr[idp, ipp,:]
+        self.mp = GroupInfo(size=len(mp), rank=imp, world=mp.tolist())
+        pp = arr[idp, :,imp]
+        self.pp = GroupInfo(size=len(pp), rank=ipp, world=pp.tolist())
+        dp = arr[:,ipp,imp]
+        self.dp = GroupInfo(size=len(dp), rank=idp, world=dp.tolist())
+
+    def __repr__(self):
+        return f'dp: {self.dp}, pp: {self.pp}, mp: {self.mp}'
+
+
+def set_seed(seed):
+
+    if dist.get_world_size() > 1:
+
+        topo = Topology(
+            dist.get_rank(), 
+            dist.get_world_size(),
+            _mesh.dp_degree, 
+            _mesh.pp_degree,
+            _mesh.mp_degree
+        )
+
+        dp_rank = topo.dp.rank
+        dp_size = topo.dp.size
+
+        pp_rank = topo.pp.rank
+        pp_size = topo.pp.size
+
+        mp_rank = topo.mp.rank
+        mp_size = topo.mp.size
+
+        sharding_rank = 0 # auto_parallel's sharding is not orthogonal with dp, mp and pp
+    else:
+        mp_rank, mp_size = 0, 1
+        pp_rank, pp_size = 0, 1
+        dp_rank, dp_size = 0, 1
+        sharding_rank, _ = 0, 1
+
+    # NOTE: the commented seeds are set only for precision validation
+    # seed += 100 * pp_rank
+    random.seed(seed + 100 * pp_rank)
+    np.random.seed(seed + 100 * pp_rank)
+
+    seed_offset = seed + 1024 + paddle.distributed.get_world_size()
+    global_seed = (
+        seed_offset
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    seed_offset += paddle.distributed.get_world_size()
+    local_seed = (
+        seed_offset
+        + mp_rank
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    # tracker = get_rng_state_tracker()
+    # tracker.add("global_seed", global_seed)
+    # tracker.add("local_seed", local_seed)
+
+    paddle.seed(global_seed)
+
+    logger.info("The global seed is set to {} and local seed is set to {}.".format(global_seed, local_seed))
+
+    global _seed
+    global _dp_seed
+    _seed = seed
+    _dp_seed = global_seed
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed935d24796d0ea65018b200674518adb810341e
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/comm_groups.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.distributed as dist
+from paddle.distributed import fleet
+from paddle.distributed.fleet.base.orthogonal_strategy import OrthogonalStrategy
+from paddle.distributed.fleet.base.strategy_group import (
+    DPGroup,
+    MPGroup,
+    PPGroup,
+    ShardingGroup,
+)
+
+
+def create_hcg(strategy, hcg_name):
+    if hcg_name == "HybridCommunicateGroup":
+        fleet.init(is_collective=True, strategy=strategy)
+        hcg = fleet.get_hybrid_communicate_group()
+    else:
+        dist.init_parallel_env()
+        hcg = eval("{}".format(hcg_name))(strategy)
+
+    return hcg
+
+
+class Hybrid4DCommGroup(OrthogonalStrategy):
+    def __init__(self, list_of_strategy=None, fused_strategy_dict={}):
+        list_of_strategy = (
+            [
+                ("dp", 1, DPGroup),
+                ("mp", 1, MPGroup),
+                ("pp", 1, PPGroup),
+                ("sharding", 1, ShardingGroup),
+            ]
+            if list_of_strategy is None
+            else list_of_strategy
+        )
+
+        fused_strategy_dict["check"] = ["mp", "pp"]
+
+        super().__init__(list_of_strategy, fused_strategy_dict)
+
+    # data parallel
+    def get_data_parallel_rank(self):
+        return self.rank_in_strategy("dp")
+
+    def get_data_parallel_world_size(self):
+        return self.strategy_group("dp").world_size
+
+    def get_data_parallel_group(self):
+        return self.strategy_group("dp").group
+
+    def get_data_parallel_group_src_rank(self):
+        return self.strategy_group("dp").group.ranks[0]
+
+    # tensor parallel
+    def get_model_parallel_rank(self):
+        return self.rank_in_strategy("mp")
+
+    def get_model_parallel_world_size(self):
+        return self.strategy_group("mp").world_size
+
+    def get_model_parallel_group(self):
+        return self.strategy_group("mp").group
+
+    def get_model_parallel_group_src_rank(self):
+        return self.strategy_group("mp").group.ranks[0]
+
+    # pipeline parallel
+    def get_stage_id(self):
+        return self.rank_in_strategy("pp")
+
+    def get_pipe_parallel_world_size(self):
+        return self.strategy_group("pp").world_size
+
+    def get_pipe_parallel_group(self):
+        return self.strategy_group("pp").group
+
+    def get_p2p_groups(self):
+        return self.strategy_group("pp").p2p_groups
+
+    # group sharded parallel
+    def get_sharding_parallel_rank(self):
+        return self.rank_in_strategy("sharding")
+
+    def get_sharding_parallel_world_size(self):
+        return self.strategy_group("sharding").world_size
+
+    def get_sharding_parallel_group(self):
+        return self.strategy_group("sharding")
+
+    def get_sharding_parallel_group_src_rank(self):
+        return self.strategy_group("sharding").ranks[0]
+
+    # check parallel group
+    def get_check_parallel_group(self):
+        return self.strategy_group("check").group
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ebc57a5e8a7b4347a49dcf5717bb791cbea7b99
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/env.py
@@ -0,0 +1,207 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from ppfleetx.distributed.apis import comm_groups
+from ppfleetx.utils.log import logger
+
+__all__ = ["init_dist_env"]
+
+_seed = None
+_dp_seed = None
+_hcg = None
+
+
+def set_seed(seed):
+    # NOTE(shenliang03): For parameter init seed:
+    # seed: dp/mp_undistributed_paramter/sharding is same; others is different
+    # For compute seed(dropout):
+    # global seed: only mp group is same.
+    # local seed: all groups are different
+
+    if dist.get_world_size() > 1:
+        # obtain rank message of hybrid parallel
+        hcg = get_hcg()
+
+        mp_rank = hcg.get_model_parallel_rank()
+        mp_size = hcg.get_model_parallel_world_size()
+
+        pp_rank = hcg.get_stage_id()
+        pp_size = hcg.get_pipe_parallel_world_size()
+
+        dp_rank = hcg.get_data_parallel_rank()
+        dp_size = hcg.get_data_parallel_world_size()
+
+        sharding_rank = hcg.get_sharding_parallel_rank()
+        # sharding_size = hcg.get_sharding_parallel_world_size()
+    else:
+        mp_rank, mp_size = 0, 1
+        pp_rank, pp_size = 0, 1
+        dp_rank, dp_size = 0, 1
+        sharding_rank, _ = 0, 1
+
+    # NOTE: the commented seeds are set only for precision validation
+    # seed += 100 * pp_rank
+    random.seed(seed + 100 * pp_rank)
+    np.random.seed(seed + 100 * pp_rank)
+
+    # seed = mp_rank +
+    #        pp_rank * (mp_size) +
+    #        dp_rank * (mp_size * pp_size) +
+    #        sharding_rank * (mp_size * pp_size * dp_size)
+    # seed offset is order to avoid conflicts with the parameter initialization seed
+
+    seed_offset = seed + 1024 + paddle.distributed.get_world_size()
+    global_seed = (
+        seed_offset
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    seed_offset += paddle.distributed.get_world_size()
+    local_seed = (
+        seed_offset
+        + mp_rank
+        + pp_rank * (mp_size)
+        + dp_rank * (mp_size * pp_size)
+        + sharding_rank * (mp_size * pp_size * dp_size)
+    )
+
+    tracker = get_rng_state_tracker()
+    tracker.add("global_seed", global_seed)
+    tracker.add("local_seed", local_seed)
+
+    paddle.seed(global_seed)
+
+    logger.info("The global seed is set to {} and local seed is set to {}.".format(global_seed, local_seed))
+
+    global _seed
+    global _dp_seed
+    _seed = seed
+    _dp_seed = global_seed
+
+
+def set_hcg(hcg):
+    global _hcg
+    _hcg = hcg
+
+
+def get_hcg():
+    global _hcg
+    return _hcg
+
+
+def get_seed():
+    global _seed
+    return _seed
+
+
+def get_dp_seed():
+    global _dp_seed
+    return _dp_seed
+
+
+def init_dist_env(config):
+    paddle.set_device(config.Global.device)
+    strategy = fleet.DistributedStrategy()
+    def is_segment_parallel_supported():
+        import inspect
+        members = [name for (name, date) in inspect.getmembers(fleet.HybridCommunicateGroup)]
+        return "get_sep_parallel_world_size" in members
+
+    if config.Distributed.mp_degree == 1 and config.Distributed.sharding.sharding_degree == 1:
+        if is_segment_parallel_supported():
+            order = ["pp", "dp", "sharding", "sep", "mp"]
+        else:
+            order = ["pp", "dp", "sharding", "mp"]
+    else:
+        if is_segment_parallel_supported():
+            order = ["dp", "pp", "sharding", "sep", "mp"]
+        else:
+            order = ["dp", "pp", "sharding", "mp"]
+
+    strategy.hybrid_configs = {
+        "dp_degree": config.Distributed.dp_degree,
+        "mp_degree": config.Distributed.mp_degree,
+        "pp_degree": config.Distributed.pp_degree,
+        "sharding_degree": config.Distributed.sharding.sharding_degree,
+        "order": order,
+    }
+
+    if config.Distributed.pp_degree > 1:
+        if "sequence_parallel" in config.Model:
+            if config.Model.sequence_parallel:
+                assert config.Global.enable_partial_send_recv is False, (
+                    "if config.Distributed.pp_degree > 1 and config.Model.sequence_parallel is True, "
+                    "config.Global.enable_partial_send_recv should be set False."
+                )
+
+    strategy.pipeline_configs = {
+        "accumulate_steps": config.Global.local_batch_size // config.Global.micro_batch_size,
+        "micro_batch_size": config.Global.micro_batch_size,
+        "enable_partial_send_recv": config.Global.enable_partial_send_recv,
+    }
+
+    # set control in tensor parallel
+    seed = config.Global.seed
+    strategy.tensor_parallel_configs = {"tensor_init_seed": seed}
+
+    hcg = comm_groups.create_hcg(strategy, hcg_name=config.Distributed.hcg)
+    set_hcg(hcg)
+
+
+def get_local_rank():
+    return int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+def get_data_world_size():
+    if paddle.distributed.get_world_size() == 1:
+        return 1
+
+    hcg = get_hcg()
+    dp_size = hcg.get_data_parallel_world_size()
+    sharding_size = hcg.get_sharding_parallel_world_size()
+
+    return dp_size * sharding_size
+
+
+def get_data_world_rank():
+    if paddle.distributed.get_world_size() == 1:
+        return 0
+
+    hcg = get_hcg()
+    dp_rank = hcg.get_data_parallel_rank()
+    sharding_rank = hcg.get_sharding_parallel_rank()
+    sharding_size = hcg.get_sharding_parallel_world_size()
+
+    return dp_rank * sharding_size + sharding_rank
+
+
+def work_at_local_rank0(func):
+    def wrapper(*args, **kwargs):
+        local_rank = 0
+        if paddle.base.core.is_compiled_with_dist() and paddle.distributed.get_world_size() > 1:
+            local_rank = paddle.distributed.ParallelEnv().dev_id
+        if local_rank == 0:
+            func(*args, **kwargs)
+
+    return wrapper
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py
new file mode 100644
index 0000000000000000000000000000000000000000..02d1e0bf97c1ffb613f8a6f0ed683b730910102f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/io.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+import paddle.distributed as dist
+from paddle.incubate.distributed.utils.io import save_for_auto_inference
+from ppfleetx.distributed.apis import env
+from ppfleetx.utils.log import logger
+
+
+def save(output_dir, model, optimizer=None, step=0, epoch=0, sharding_stage=2):
+    """
+    save the state dicts of model and optimizer into an checkpoint.
+    """
+
+    nranks = dist.get_world_size()
+    if nranks > 1:
+        hcg = env.get_hcg()
+
+        dp_rank = hcg.get_data_parallel_rank()
+        mp_rank = hcg.get_model_parallel_rank()
+        pp_rank = hcg.get_stage_id()
+        sharding_rank = hcg.get_sharding_parallel_rank()
+    else:
+        dp_rank = 0
+
+    if dp_rank != 0:
+        logger.info("DP_Rank %d doesn't save model" % dp_rank)
+        return
+
+    if output_dir and isinstance(output_dir, str):
+        output_dir = os.path.join(output_dir, "epoch_%d_step_%d" % (epoch, step))
+
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir, exist_ok=True)
+        logger.info("Save model to %s" % output_dir)
+
+        save_dir = (
+            "{}/mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}".format(output_dir, mp_rank, sharding_rank, pp_rank)
+            if nranks > 1
+            else output_dir
+        )
+
+        if sharding_stage == 3:
+            model.get_all_parameters(convert2cpu=False)
+
+        paddle.save(model.state_dict(), os.path.join(save_dir, "model.pdparams"))
+
+        if optimizer is not None:
+            paddle.save(optimizer.state_dict(), os.path.join(save_dir, "model_state.pdopt"))
+
+        meta_dict = {"epoch": epoch, "step": step, "cuda_rng_state": paddle.get_cuda_rng_state()}
+        paddle.save(meta_dict, os.path.join(save_dir, "meta_state.pdopt"))
+
+        save_auto_dir = os.path.join(output_dir, "auto_infer")
+        save_for_auto_inference(os.path.join(save_auto_dir, "auto"), model)
+
+    else:
+        raise TypeError("`save` requires a valid value of `output_dir`.")
+
+
+def load(ckpt_dir, model, optimizer=None, mode="train", load_recovery=None):
+    nranks = dist.get_world_size()
+    if nranks > 1:
+        hcg = env.get_hcg()
+
+        mp_rank = hcg.get_model_parallel_rank()
+        pp_rank = hcg.get_stage_id()
+        sharding_rank = hcg.get_sharding_parallel_rank()
+
+    load_recovery = {} if load_recovery is None else load_recovery
+
+    if ckpt_dir and isinstance(ckpt_dir, str):
+        logger.info("Try to load checkpoint from %s " % ckpt_dir)
+
+        if mode == "quant":
+            load_dir = ckpt_dir
+        else:
+            load_dir = (
+                "{}/mp_{:0>2d}_sharding_{:0>2d}_pp_{:0>2d}".format(ckpt_dir, mp_rank, sharding_rank, pp_rank)
+                if nranks > 1
+                else ckpt_dir
+            )
+        model_path = os.path.join(load_dir, "model.pdparams")
+        opt_path = os.path.join(load_dir, "model_state.pdopt")
+        meta_path = os.path.join(load_dir, "meta_state.pdopt")
+
+        if os.path.exists(model_path):
+            model_dict = paddle.load(model_path)
+            for name, param in model.state_dict().items():
+                assert name in model_dict.keys(), "No param named `{}` was found in checkpoint file.".format(name)
+
+                if param.dtype != model_dict[name].dtype:
+                    model_dict[name] = model_dict[name].cast(param.dtype)
+
+            model.set_state_dict(model_dict)
+        else:
+            raise ValueError("No model checkpoint file found in %s." % model_path)
+
+        if mode == "train":
+            if os.path.exists(opt_path):
+                opt_dict = paddle.load(opt_path)
+                optimizer.set_state_dict(opt_dict)
+            else:
+                raise ValueError("No optimizer checkpoint file found in %s." % opt_path)
+
+            if os.path.exists(meta_path):
+                meta_dict = paddle.load(meta_path)
+
+                load_recovery.update(
+                    {"step": meta_dict["step"], "epoch": meta_dict["epoch"], "rng_state": meta_dict["cuda_rng_state"]}
+                )
+
+            else:
+                raise ValueError("No meta checkpoint file found in %s." % meta_path)
+
+        logger.info("successfully load checkpoints")
+    else:
+        logger.warning("`load` requires a valid value of `ckpt_dir`.")
+        raise TypeError("`load` requires a valid value of `ckpt_dir`.")
diff --git a/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py b/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bcfcb06303339e071de8300f2891a6e080416f2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/distributed/apis/strategy.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.distributed.fleet as fleet
+from paddle.distributed.fleet.meta_parallel import TensorParallel
+from paddle.distributed.parallel import sync_params_buffers
+from paddle.distributed.sharding import group_sharded_parallel
+from ppfleetx.distributed.apis import env
+
+
+def wrap_with_fleet(dist_config, model, optimizer=None, scaler=None):
+    if dist_config.sharding.sharding_stage in [2, 3]:
+        assert dist_config.pp_degree == 1, "sharding stage2/3 will support pipeline parallel later"
+        return wrap_sharding_2_3(dist_config, model, optimizer, scaler)
+    else:
+        return wrap_3D_parallel(dist_config, model, optimizer, scaler)
+
+
+def wrap_sharding_2_3(dist_config, model, optimizer=None, scaler=None):
+    hcg = env.get_hcg()
+    dp_group = hcg.get_data_parallel_group()
+    sharding_group = hcg.get_sharding_parallel_group()
+
+    if dist_config.dp_degree > 1 and dist_config.sharding.sharding_stage == 3:
+        sync_params_buffers(model, comm_group=dp_group, src_rank=dp_group.ranks[0])
+
+    if dist_config.mp_degree > 1:
+        assert dist_config.sharding.sharding_stage == 2, "only support mp + sharding stage2 hybrid parallel now."
+        model = TensorParallel(model, hcg, strategy=None)
+
+    level = "p_g_os" if dist_config.sharding.sharding_stage == 3 else "os_g"
+    origin_model = model
+    model, optimizer, scaler = group_sharded_parallel(
+        model=model,
+        optimizer=optimizer,
+        level=level,
+        scaler=scaler,
+        group=sharding_group,
+        offload=dist_config.sharding.sharding_offload,
+        dp_group=dp_group if dp_group.nranks > 1 else None,
+    )
+
+    if dist_config.sharding.reduce_overlap:
+        model._set_reduce_overlap(dist_config.sharding.reduce_overlap)
+
+    if dist_config.sharding.broadcast_overlap:
+        optimizer._set_broadcast_overlap(dist_config.sharding.broadcast_overlap, layers=origin_model, num_groups=2)
+
+    return model, optimizer, scaler
+
+
+def wrap_3D_parallel(dist_config, model, optimizer=None, scaler=None):
+    model = fleet.distributed_model(model)
+    optimizer = fleet.distributed_optimizer(optimizer) if optimizer is not None else optimizer
+    scaler = fleet.distributed_scaler(scaler) if scaler is not None else scaler
+
+    return model, optimizer, scaler
diff --git a/model_zoo/gpt-3/ppfleetx/models/__init__.py b/model_zoo/gpt-3/ppfleetx/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b713675237b3bf3e82a86833d57790df3a8dde6c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/__init__.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import sys
+
+from ppfleetx.core.module.basic_module import BasicModule
+from ppfleetx.models.language_model.gpt.auto.auto_module import (
+    GPTGenerationModuleAuto,
+    GPTModuleAuto,
+)
+from ppfleetx.models.language_model.language_module import (
+    GPTEvalModule,
+    GPTFinetuneModule,
+    GPTGenerationModule,
+    GPTModule,
+)
+
+
+def build_module(config):
+    module_name = config.Model.get("module", "BasicModule")
+    module = eval(module_name)(config)
+
+    return module
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py b/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py b/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..201143f1bf970cb0fb7d587096908369fd6d484f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/auto_utils.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def process_optim_configs(config):
+    """
+    process optim configs for auto parallel
+    """
+    config["Optimizer"]["lr"]["decay_steps"] *= config["Global"]["global_batch_size"]
+
+
+def process_model_configs(config):
+    """
+    process model configs for auto parallel
+    """
+    cfg_model = config["Model"]
+    if cfg_model["ffn_hidden_size"] is None:
+        cfg_model["ffn_hidden_size"] = 4 * cfg_model["hidden_size"]
+
+    if cfg_model["use_recompute"]:
+        if not cfg_model.get("recompute_granularity", None):
+            cfg_model["recompute_granularity"] = "full"
+
+
+def process_data_configs(config):
+    """
+    process data configs for auto parallel
+    """
+    cfg_global = config["Global"]
+    cfg_data = config["Data"]
+
+    mode_to_num_samples = {
+        "Train": cfg_global["global_batch_size"] * config["Engine"]["max_steps"],
+        "Eval": cfg_global["global_batch_size"]
+        * (config["Engine"]["max_steps"] // config["Engine"]["eval_freq"] + 1)
+        * config["Engine"]["eval_iters"],
+        "Test": cfg_global["global_batch_size"] * config["Engine"]["test_iters"],
+    }
+
+    for mode in ("Train", "Eval", "Test"):
+        if mode in cfg_data.keys():
+            cfg_data[mode]["dataset"]["num_samples"] = mode_to_num_samples[mode]
+            cfg_data[mode]["dataset"]["mode"] = mode
+            cfg_data[mode]["dataset"]["seed"] = cfg_global["seed"]
+
+
+def process_configs(config):
+
+    process_model_configs(config)
+    process_data_configs(config)
+    process_optim_configs(config)
+
+    return config
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..506a831600db8e9218c707e69daecd2e692e72ce
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/__init__.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .auto.auto_model import (
+    GPTForGenerationAuto,
+    GPTForPretrainingAuto,
+    GPTModelAuto,
+    GPTPretrainingCriterionAuto,
+)
+from .dygraph.hybrid_model import (
+    GPTForGenerationHybrid,
+    GPTForPretrainingHybrid,
+    GPTForPretrainingPipe,
+    GPTModelHybrid,
+    GPTPretrainingCriterionHybird,
+)
+from .dygraph.single_model import (
+    GPTForGeneration,
+    GPTForPretraining,
+    GPTForSequenceClassification,
+    GPTModel,
+    GPTPretrainingCriterion,
+)
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..56ef2823c202aa80fe4c3acffc374dccfb8cf902
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_model.py
@@ -0,0 +1,1217 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import math
+
+import paddle
+import paddle.distributed.auto_parallel as auto
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.common_ops_import import convert_dtype
+from paddle.base import layers
+from paddle.nn.layer.transformer import _convert_param_attr_to_list
+from ppfleetx.distributed.apis import auto_env
+
+from ..dygraph.processor import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+)
+
+try:
+    from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd
+except:
+    FusedDropoutAdd = None
+FusedDropoutAdd = None
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        weight_attr=None,
+        bias_attr=None,
+        output_layer_weight_attr=None,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        use_recompute=False,
+        recompute_granularity="full",
+        ipp=None,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.need_weights = need_weights
+        self.fuse_attn_qkv = fuse_attn_qkv
+        self.scale_qk_coeff = scale_qk_coeff
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.ipp = ipp
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim[{}] must be divisible by num_heads[{}]".format(self.embed_dim, num_heads)
+
+        if self.fuse_attn_qkv:
+            assert self.kdim == embed_dim
+            assert self.vdim == embed_dim
+            self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim, weight_attr, bias_attr=bias_attr)
+        else:
+            self.q_proj = nn.Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
+            self.k_proj = nn.Linear(self.kdim, embed_dim, weight_attr, bias_attr=bias_attr)
+            self.v_proj = nn.Linear(self.vdim, embed_dim, weight_attr, bias_attr=bias_attr)
+
+        self.out_proj = nn.Linear(embed_dim, embed_dim, output_layer_weight_attr, bias_attr=bias_attr)
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        auto.shard_tensor(self.qkv_proj.weight, auto_env.get_mesh()[self.ipp], [None, auto_env.get_mesh().mp_dim])
+
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, -1, 3 * self.head_dim])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        auto.shard_tensor(self.q_proj.weight, auto_env.get_mesh()[self.ipp], [None, auto_env.get_mesh().mp_dim])
+
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, -1, self.head_dim])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        auto.shard_tensor(self.k_proj.weight, auto_env.get_mesh()[self.ipp], [None, auto_env.get_mesh().mp_dim])
+        auto.shard_tensor(self.v_proj.weight, auto_env.get_mesh()[self.ipp], [None, auto_env.get_mesh().mp_dim])
+
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, -1, self.head_dim])
+        v = tensor.reshape(x=v, shape=[0, 0, -1, self.head_dim])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            v = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def core_attn(self, q, k, v, attn_mask=None):
+        perm = [0, 2, 1, 3]
+        q = tensor.transpose(x=q, perm=perm)
+        k = tensor.transpose(x=k, perm=perm)
+        v = tensor.transpose(x=v, perm=perm)
+
+        # scale dot product attention
+        scale_qk_coeff = self.scale_qk_coeff * self.head_dim**0.5
+        product = paddle.matmul(x=q.scale(1.0 / scale_qk_coeff), y=k, transpose_y=True)
+
+        if self.scale_qk_coeff != 1.0:
+            product = product.scale(self.scale_qk_coeff)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+            weights = F.softmax(product)
+        else:
+            weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.dropout:
+            weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = paddle.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, -1])
+
+        return out, weights
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if self.fuse_attn_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        if self.use_recompute and self.recompute_granularity == "core_attn":
+            out, weights = auto.recompute(self.core_attn)(q, k, v, attn_mask)
+        else:
+            out, weights = self.core_attn(q, k, v, attn_mask=attn_mask)
+
+        auto.shard_tensor(self.out_proj.weight, auto_env.get_mesh()[self.ipp], [auto_env.get_mesh().mp_dim, None])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(
+        self,
+        decoder_layers,
+        num_layers,
+        norm=None,
+        hidden_size=None,
+        use_recompute=False,
+        recompute_granularity="full",
+    ):
+        super(TransformerDecoder, self).__init__()
+
+        self.num_layers = num_layers
+        self.layers = decoder_layers
+        self.norm = norm
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        if norm == "LayerNorm":
+            self.norm = nn.LayerNorm(hidden_size, epsilon=1e-5)
+        elif norm is not None:
+            raise ValueError("Only support LayerNorm")
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = []
+
+        for i, mod in enumerate(self.layers):
+            ipp = mod.ipp
+            mod = auto.shard_op(mod, auto_env.get_mesh()[ipp])
+
+            if cache is None:
+                if use_cache:
+                    output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache)
+                    new_caches.append(new_cache)
+                else:
+                    if self.use_recompute and self.recompute_granularity == "full":
+                        output = auto.recompute(mod)(output, memory, tgt_mask, use_cache, cache)
+                    else:
+                        output = mod(output, memory, tgt_mask, use_cache, cache)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache[i])
+                new_caches.append(new_cache)
+
+            auto.shard_tensor(output, auto_env.get_mesh()[ipp], [auto_env.get_mesh().dp_dim, None, None])
+
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if use_cache is False else (output, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        topk=1,
+        weight_attr=None,
+        bias_attr=None,
+        output_layer_weight_attr=None,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        use_recompute=False,
+        recompute_granularity="full",
+        use_fused_dropout_add=True,
+        ipp=None,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.ipp = ipp
+        if not FusedDropoutAdd:
+            self.use_fused_dropout_add = False
+        else:
+            self.use_fused_dropout_add = use_fused_dropout_add
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+        output_layer_weight_attrs = _convert_param_attr_to_list(output_layer_weight_attr, 3)
+
+        self.self_attn = MultiHeadAttention(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+            output_layer_weight_attr=output_layer_weight_attrs[0],
+            fuse_attn_qkv=fuse_attn_qkv,
+            scale_qk_coeff=scale_qk_coeff,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+            ipp=ipp,
+        )
+
+        self.linear1 = nn.Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2])
+        self.linear2 = nn.Linear(dim_feedforward, d_model, output_layer_weight_attrs[2], bias_attr=bias_attrs[2])
+
+        self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
+        if not self.use_fused_dropout_add:
+            self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
+            self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
+        else:
+            self.fused_dropout_add1 = FusedDropoutAdd(dropout, mode="upscale_in_train")
+            self.fused_dropout_add2 = FusedDropoutAdd(act_dropout, mode="upscale_in_train")
+
+        if activation == "gelu":
+            self.activation = nn.GELU(approximate=True)
+        else:
+            self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None):
+
+        auto.shard_tensor(self.linear1.weight, auto_env.get_mesh()[self.ipp], [None, auto_env.get_mesh().mp_dim])
+        auto.shard_tensor(self.linear2.weight, auto_env.get_mesh()[self.ipp], [auto_env.get_mesh().mp_dim, None])
+
+        residual = tgt
+
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if use_cache is False:
+            if self.use_recompute and self.recompute_granularity == "full_attn":
+                tgt = auto.recompute(self.self_attn)(tgt, None, None, tgt_mask, use_cache, cache)
+            else:
+                tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        if not self.use_fused_dropout_add:
+            tgt = residual + self.dropout1(tgt)
+        else:
+            tgt = self.fused_dropout_add1(tgt, residual)
+
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        if not self.use_fused_dropout_add:
+            tgt = self.dropout2(self.linear2(self.activation(self.linear1(tgt))))
+            tgt = residual + tgt
+        else:
+            tgt = self.fused_dropout_add2(self.linear2(self.activation(self.linear1(tgt))), residual)
+
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        return tgt if use_cache is False else (tgt, incremental_cache)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        freeze_embedding=False,
+    ):
+        super(GPTEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(
+            vocab_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        self.position_embeddings = nn.Embedding(
+            max_position_embeddings,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        if freeze_embedding:
+            self.word_embeddings.weight.learning_rate = 0.0
+            self.position_embeddings.weight.learning_rate = 0.0
+
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        auto.shard_tensor(self.word_embeddings.weight, auto_env.get_mesh()[0], [auto_env.get_mesh().mp_dim, None])
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embedings + position_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class GPTModelAuto(nn.Layer):
+    def __init__(
+        self,
+        vocab_size=51200,
+        hidden_size=768,
+        num_layers=12,
+        num_attention_heads=12,
+        ffn_hidden_size=3072,
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        use_recompute=False,
+        initializer_range=0.02,
+        topk=1,
+        fuse_attn_qkv=False,
+        scale_qk_by_layer_num=True,
+        recompute_granularity="full",
+        freeze_embedding=False,
+        fused_softmax_with_triangular=False,
+        use_fused_dropout_add=True,
+    ):
+
+        super(GPTModelAuto, self).__init__()
+
+        self.initializer_range = initializer_range
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.fused_softmax_with_triangular = fused_softmax_with_triangular
+
+        if not auto_env.get_mesh():
+            raise RuntimeError(
+                "Please call auto_env.init_dist_env(config). AutoPrallel modeling need `mesh` to annotate distributed attribute."
+            )
+
+        self.embeddings = GPTEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            self.initializer_range,
+            freeze_embedding,
+        )
+
+        layer_per_stage = num_layers // auto_env.get_mesh().pp_degree
+        layer_to_pipe = [i // layer_per_stage for i in range(num_layers)]
+        decoder_layers = nn.LayerList()
+        for i in range(num_layers):
+            decoder_layers.append(
+                TransformerDecoderLayer(
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=ffn_hidden_size,
+                    dropout=hidden_dropout_prob,
+                    activation="gelu",
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range)
+                    ),
+                    output_layer_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(
+                            mean=0.0, std=self.initializer_range / math.sqrt(2.0 * num_layers)
+                        )
+                    ),
+                    bias_attr=None,
+                    fuse_attn_qkv=fuse_attn_qkv,
+                    scale_qk_coeff=num_layers if scale_qk_by_layer_num else 1.0,
+                    use_recompute=use_recompute,
+                    recompute_granularity=recompute_granularity,
+                    use_fused_dropout_add=use_fused_dropout_add,
+                    ipp=layer_to_pipe[i],
+                )
+            )
+
+        self.decoder = TransformerDecoder(
+            decoder_layers,
+            num_layers,
+            norm="LayerNorm",
+            hidden_size=hidden_size,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+        )
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None):
+
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(attention_mask)[-1] - 1
+            position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype=input_ids.dtype)
+            position_ids = position_ids.unsqueeze(0)
+            position_ids = paddle.expand_as(position_ids, input_ids)
+
+        input_ids.stop_gradient = True
+        position_ids.stop_gradient = True
+        auto.shard_tensor(
+            input_ids, auto_env.get_mesh()[0], [auto_env.get_mesh().dp_dim] + [None] * (len(input_ids.shape) - 1)
+        )
+
+        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        # fused_softmax_with_triangular is only suppported on GPU/DCU.
+        # If on non-GPU devices, we use user defined mask and non-fused softmax.
+        if not self.fused_softmax_with_triangular or not paddle.is_compiled_with_cuda():
+            # TODO, use registered buffer
+            causal_mask = paddle.tensor.triu(
+                paddle.ones((paddle.shape(input_ids)[-1], paddle.shape(input_ids)[-1])) * -1e4, diagonal=1
+            )
+            if attention_mask is not None:
+                if len(attention_mask.shape) == 2:
+                    attention_mask = attention_mask[:, None, None, :]
+                attention_mask = attention_mask + causal_mask
+            else:
+                attention_mask = causal_mask
+            # The tensor returned by triu not in static graph.
+            attention_mask.stop_gradient = True
+
+        encoder_outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            tgt_mask=None
+            if (self.fused_softmax_with_triangular and self.training and paddle.is_compiled_with_cuda())
+            else attention_mask,  # use softmax_mask_fuse_upper_triangle
+            use_cache=use_cache,
+            cache=cache,
+        )
+
+        return encoder_outputs
+
+
+class GPTForPretrainingAuto(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt):
+        super(GPTForPretrainingAuto, self).__init__()
+        self.gpt = gpt
+
+    def forward(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        x_dims_mapping = [auto_env.get_mesh().dp_dim] + [None] * (len(encoder_outputs.shape) - 1)
+        w_dims_mapping = [auto_env.get_mesh().mp_dim, None]
+        matmul = auto.shard_op(paddle.matmul, auto_env.get_mesh()[-1], [x_dims_mapping, w_dims_mapping, None])
+        logits = matmul(encoder_outputs, get_attr(self.gpt.embeddings.word_embeddings, "weight"), transpose_y=True)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+
+class GPTPretrainingCriterionAuto(nn.Layer):
+    """
+    Criterion for GPT. It calculates the final loss.
+    """
+
+    def __init__(self):
+        super(GPTPretrainingCriterionAuto, self).__init__()
+        self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The logits of masked token prediction. Its data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, the dimensionality of `masked_lm_labels`
+                is equal to `prediction_scores`. Its data type should be int64 and
+                its shape is [batch_size, sequence_length, 1].
+            loss_mask(Tensor):
+                Mask used for calculating the loss of the masked language modeling to avoid
+                calculating some unwanted tokens.
+                Its data type should be float32 and its shape is [batch_size, sequence_length, 1].
+
+        Returns:
+            Tensor: The pretraining loss. Its data type should be float32 and its shape is [1].
+
+        """
+        masked_lm_labels.stop_gradient = True
+        loss_mask.stop_gradient = True
+        auto.shard_tensor(
+            loss_mask, auto_env.get_mesh()[-1], [auto_env.get_mesh().dp_dim] + [None] * (len(loss_mask.shape) - 1)
+        )
+
+        masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+
+        loss_mask = loss_mask.reshape([-1])
+        masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+        loss = masked_lm_loss / loss_mask.sum()
+        return loss
+
+
+class GPTForGenerationAuto(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt, configs):
+        super(GPTForGenerationAuto, self).__init__()
+        self.gpt = gpt
+        self.configs = configs
+
+        self.max_length = self.configs.get("max_dec_len", 20)
+        self.min_length = self.configs.get("min_dec_len", 0)
+        self.decode_strategy = self.configs.get("decode_strategy", "sampling")
+        self.early_finish = self.configs.get("early_finish", True)
+        self.temperature = self.configs.get("temperature", 1.0)
+        self.top_k = self.configs.get("top_k", 0)
+        self.top_p = self.configs.get("top_p", 1.0)
+        self.use_topp_sampling = self.configs.get("use_topp_sampling", False)
+        self.inference = self.configs.get("inference", False)
+        self.repetition_penalty = self.configs.get("repetition_penalty", 1.0)
+        self.num_beams = self.configs.get("num_beams", 1)
+        self.num_beam_groups = self.configs.get("num_beam_groups", 1)
+        self.length_penalty = self.configs.get("length_penalty", 0.0)
+        self.early_stopping = self.configs.get("early_stopping", False)
+        self.bos_token_id = self.configs.get("bos_token_id", None)
+        self.eos_token_id = self.configs.get("eos_token_id", None)
+        self.pad_token_id = self.configs.get("pad_token_id", None)
+        self.decoder_start_token_id = self.configs.get("decoder_start_token_id", None)
+        self.forced_bos_token_id = self.configs.get("forced_bos_token_id", None)
+        self.forced_eos_token_id = self.configs.get("forced_eos_token_id", None)
+        self.num_return_sequences = self.configs.get("num_return_sequences", 1)
+        self.diversity_rate = self.configs.get("diversity_rate", 0.0)
+        self.use_cache = self.configs.get("use_cache", True)
+
+    def prepare_input_ids_for_generation(self, bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_attention_mask_for_generation(self, input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(
+            input_ids == pad_token_id
+        ).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e9
+        else:
+            attention_mask = paddle.zeros_like(input_ids, dtype=paddle.get_default_dtype())
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
+
+    def update_scores_for_generation(self, scores, next_scores, length, unfinished_flag):
+        # update scores
+
+        unfinished_scores = (scores * length + next_scores) / (length + 1)
+        scores = paddle.where(unfinished_flag, unfinished_scores, scores)
+        return scores
+
+    def get_logits_processor(
+        self,
+        min_length=None,
+        max_length=None,
+        eos_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_beams=1,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        repetition_penalty=None,
+    ):
+        processors = LogitsProcessorList()
+
+        if min_length is not None and eos_token_id is not None and min_length > -1:
+            processors.append(MinLengthLogitsProcessor(min_length, eos_token_id))
+        if num_beam_groups > 1 and diversity_rate > 0.0:
+            processors.append(
+                HammingDiversityLogitsProcessor(
+                    diversity_rate=diversity_rate, num_beams=num_beams, num_beam_groups=num_beam_groups
+                )
+            )
+        if repetition_penalty is not None and repetition_penalty != 1.0:
+            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
+        if forced_bos_token_id is not None:
+            processors.append(ForcedBOSTokenLogitsProcessor(forced_bos_token_id))
+        if forced_eos_token_id is not None:
+            processors.append(ForcedEOSTokenLogitsProcessor(max_length, forced_eos_token_id))
+        # TODO
+        # Add more pre_processing for distribution
+
+        return processors
+
+    def expand_inputs_for_generation(self, input_ids, expand_size, attention_mask=None, **model_kwargs):
+
+        index = paddle.tile(paddle.arange(paddle.shape(input_ids)[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.gather(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.gather(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.gather(token_type_ids, index)
+
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.gather(position_ids, index)
+
+        if "seq_len" in model_kwargs and model_kwargs["seq_len"] is not None:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.gather(seq_len, index)
+
+        if "encoder_output" in model_kwargs and model_kwargs["encoder_output"] is not None:
+            encoder_output = model_kwargs["encoder_output"]
+            model_kwargs["encoder_output"] = paddle.gather(encoder_output, index)
+
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.gather(role_ids, index)
+
+        return input_ids, model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask[:, -1, -1, :]
+            if "int" in paddle.common_ops_import.convert_dtype(attention_mask.dtype):
+                attention_mask = (1.0 - attention_mask) * -1e4
+        return {"input_ids": input_ids, "position_ids": position_ids, "attention_mask": attention_mask, "cache": cache}
+
+    def update_model_kwargs_for_generation(self, next_tokens, outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = position_ids[:, -1:] + 1
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            # nn.Pad2D don't support the data type `bool`
+            if convert_dtype(attention_mask.dtype) == "bool":
+                attention_mask = paddle.cast(attention_mask, "int64")
+            if len(attention_mask.shape) == 4:
+                attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(attention_mask)
+                attention_mask = nn.Pad2D([0, 1, 0, 0], value=-1e4)(attention_mask)
+                dtype = convert_dtype(attention_mask.dtype)
+                if "int" in dtype:
+                    attention_mask[:, :, -1, -1] = 1
+                elif "float" in dtype:
+                    attention_mask[:, :, -1, -1] = 0.0
+                else:
+                    raise ValueError("The data type of input `attention_mask` must " "be bool, int or float")
+            else:
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+            model_kwargs["attention_mask"] = attention_mask
+
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        model_kwargs["res"] = paddle.concat([model_kwargs["res"], next_tokens], axis=1)
+
+        return model_kwargs
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+        def TopKProcess(probs, top_k, min_tokens_to_keep):
+            top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+            # Remove all tokens with a probability less than the last token of the top-k
+            topk_probs, _ = paddle.topk(probs, k=top_k)
+            probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+            return probs
+
+        def TopPProcess(probs, top_p, min_tokens_to_keep):
+            sorted_probs = paddle.sort(probs, descending=True)
+            sorted_indices = paddle.argsort(probs, descending=True)
+            cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+            # Remove tokens with cumulative probs above the top_p, But keep at
+            # least min_tokens_to_keep tokens
+            sorted_indices_to_remove = cumulative_probs > top_p
+            if min_tokens_to_keep > 1:
+                # Set 'min_tokens_to_keep - 1' because the first token is kept
+                sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+            # Keep the first token
+            sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+            sorted_indices_to_remove[:, 0] = 0
+
+            # Scatter sorted tensors to original indexing
+            sorted_indices = sorted_indices + paddle.arange(probs.shape[0]).unsqueeze(-1) * probs.shape[-1]
+            condition = paddle.scatter(
+                sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+            )
+            condition = paddle.cast(condition, "bool").reshape(probs.shape)
+            probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+            return probs
+
+        batch_size, cur_len = paddle.shape(input_ids)
+        # used for compute on gpu, avoid memcpy D2H
+        cur_len_gpu = paddle.full([1], cur_len, dtype="int64")
+
+        origin_len = paddle.shape(input_ids)[1]
+        # used for compute on gpu, avoid memcpy D2H
+        origin_len_gpu = paddle.full([1], origin_len, dtype="int64")
+
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        res = paddle.assign(input_ids)
+        model_kwargs["res"] = res
+
+        # use_cache is immutable, we split it off other mutable kwargs.
+        assert "use_cache" in model_kwargs
+        immutable = {"use_cache": model_kwargs["use_cache"]}
+        del model_kwargs["use_cache"]
+
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **args, **immutable)
+            return self.gpt(**model_inputs, **immutable)
+
+        def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs):
+
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            x_dims_mapping = [auto_env.get_mesh().dp_dim] + [None] * (len(logits.shape) - 1)
+            w_dims_mapping = [auto_env.get_mesh().mp_dim, None]
+            matmul = auto.shard_op(paddle.matmul, auto_env.get_mesh()[-1], [x_dims_mapping, w_dims_mapping, None])
+            with paddle.base.name_scope("skip_quant"):
+                logits = matmul(logits, get_attr(self.gpt.embeddings.word_embeddings, "weight"), transpose_y=True)
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            if temperature is None or temperature == 1.0:
+                probs = paddle.assign(origin_probs)
+                origin_probs = paddle.log(origin_probs)
+            else:
+                origin_probs = paddle.log(origin_probs)
+                logits = logits / temperature
+                probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                if self.use_topp_sampling:
+                    try:
+                        from ppfleetx_ops import topp_sampling
+                    except ImportError:
+                        raise ImportError(
+                            "please install ppfleetx_ops by 'cd ppfleetx/ops && python setup_cuda.py install'!"
+                        )
+                    top_ps_tensor = paddle.full(shape=[paddle.shape(probs)[0]], fill_value=top_p, dtype=probs.dtype)
+                    # TODO fake random seed here
+                    # Users should set the random seed dynamically when inference
+                    _, next_tokens = topp_sampling(probs, top_ps_tensor, random_seed=100)
+                else:
+                    probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+
+            if not self.use_topp_sampling:
+                next_tokens = paddle.multinomial(probs)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            input_ids = next_tokens
+
+            if eos_token_id is not None:
+                unfinished_flag = paddle.logical_and(unfinished_flag, next_tokens != eos_token_id)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                next_tokens, outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            return input_ids, scores, unfinished_flag, model_kwargs
+
+        # Note(GuoxiaWang):Pre-while call for inference, simulate a do while loop statement
+        # the value in model_kwargs should be tensor before while loop
+        outputs = _forward_(**model_kwargs)
+
+        input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+            outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs
+        )
+        if not self.inference:
+            cur_len += 1
+        else:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            paddle.increment(cur_len)
+        paddle.increment(cur_len_gpu)
+
+        attn_mask = model_kwargs["attention_mask"]
+        # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
+        model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
+        max_length = paddle.to_tensor(max_length)
+        while cur_len < max_length:
+            # Note(GuoxiaWang): Remove outputs = _forward_(**model_kwargs)
+            # and change it to pass directly to _post_process_ to avoid
+            # closed-loop problem of dynamic-to-static model
+            input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                _forward_(**model_kwargs),
+                input_ids,
+                cur_len_gpu,
+                origin_len_gpu,
+                scores,
+                unfinished_flag,
+                model_kwargs,
+            )
+            if not self.inference:
+                cur_len += 1
+            else:
+                # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+                paddle.increment(cur_len)
+            paddle.increment(cur_len_gpu)
+
+            # early finish should be True in generation scenes,
+            # If users want to test the inference speed, you can just set it False.
+            if self.early_finish and not paddle.any(unfinished_flag):
+                break
+
+        return model_kwargs["res"][:, origin_len:], scores
+
+    def forward(self, input_ids=None, **model_kwargs):
+
+        max_length = self.max_length
+        min_length = self.min_length
+        decode_strategy = self.decode_strategy
+        temperature = self.temperature
+        top_k = self.top_k
+        top_p = self.top_p
+        repetition_penalty = self.repetition_penalty
+        num_beams = self.num_beams
+        num_beam_groups = self.num_beam_groups
+        bos_token_id = self.bos_token_id
+        eos_token_id = self.eos_token_id
+        pad_token_id = self.pad_token_id
+        decoder_start_token_id = self.decoder_start_token_id
+        forced_bos_token_id = self.forced_bos_token_id
+        forced_eos_token_id = self.forced_eos_token_id
+        num_return_sequences = self.num_return_sequences
+        diversity_rate = self.diversity_rate
+        use_cache = self.use_cache
+
+        assert decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            decode_strategy
+        )
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self.gpt, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self.gpt, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self.gpt, "pad_token_id", None)
+        forced_bos_token_id = (
+            forced_bos_token_id if forced_bos_token_id is not None else getattr(self.gpt, "forced_bos_token_id", None)
+        )
+        forced_eos_token_id = (
+            forced_eos_token_id if forced_eos_token_id is not None else getattr(self.gpt, "forced_eos_token_id", None)
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self.gpt, "decoder_start_token_id", None)
+        )
+
+        # params check
+        if input_ids is None:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # TODO
+            # Init `attention_mask` depending on `pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, pad_token_id, eos_token_id
+            )
+
+        if model_kwargs.get("position_ids", None) is None:
+            model_kwargs["position_ids"] = paddle.arange(
+                0, paddle.shape(model_kwargs["attention_mask"])[-1], dtype=input_ids.dtype
+            ).unsqueeze(0)
+
+        self.is_encoder_decoder = False
+
+        model_kwargs["use_cache"] = use_cache
+
+        if self.inference:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+                # TODO(wanghuancoder): _no_check_dy2st_diff is used to turn off the checking of behavior
+                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+                # removed after static graphs support inplace and stride.
+                with paddle.framework._no_check_dy2st_diff():
+                    min_len = input_ids.shape[-1]
+                    max_len = input_ids.shape[-1]
+                    paddle.increment(min_len, min_length)
+                    paddle.increment(max_len, max_length)
+            else:
+                min_len = input_ids.shape[-1]
+                max_len = input_ids.shape[-1]
+                paddle.increment(min_len, min_length)
+                paddle.increment(max_len, max_length)
+        else:
+            input_len = input_ids.shape[-1]
+            max_len = max_length + input_len
+            min_len = min_length + input_len
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_len,
+            max_length=max_len,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=num_beams,
+            num_beam_groups=num_beam_groups,
+            diversity_rate=diversity_rate,
+            repetition_penalty=repetition_penalty,
+        )
+
+        if decode_strategy == "sampling":
+            if num_return_sequences > 1:
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_return_sequences, **model_kwargs
+                )
+            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+                # TODO(wanghuancoder): _no_check_dy2st_diff is used to turn off the checking of behavior
+                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+                # removed after static graphs support inplace and stride.
+                with paddle.framework._no_check_dy2st_diff():
+                    ret = self.sample(
+                        input_ids,
+                        logits_processors,
+                        max_len,
+                        pad_token_id,
+                        eos_token_id,
+                        top_k,
+                        top_p,
+                        temperature,
+                        **model_kwargs,
+                    )
+            else:
+                ret = self.sample(
+                    input_ids,
+                    logits_processors,
+                    max_len,
+                    pad_token_id,
+                    eos_token_id,
+                    top_k,
+                    top_p,
+                    temperature,
+                    **model_kwargs,
+                )
+        else:
+            raise ValueError(f"Not support {decode_strategy} strategy yet!")
+        return ret
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cd5ed12cdff0a9356c72f79c79828532d71d1a4
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/auto/auto_module.py
@@ -0,0 +1,201 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+
+import paddle
+import ppfleetx.models.language_model.gpt as gpt
+from paddle import LazyGuard
+from paddle.static import InputSpec
+from ppfleetx.core.module.basic_module import BasicModule
+from ppfleetx.data.tokenizers import GPTTokenizer
+from ppfleetx.utils.log import logger
+
+from paddlenlp.transformers.gpt.tokenizer import GPTChineseTokenizer
+
+from ...auto_utils import process_configs
+from ...language_module import get_model_size
+
+MODEL_CLASSES = {
+    "GPT": (GPTTokenizer, "gpt2"),
+    "GPT-cn": (GPTChineseTokenizer, "gpt-cpm-large-cn"),
+}
+
+
+class LanguageModuleAuto(BasicModule):
+    def __init__(self, configs):
+        self.nranks = paddle.distributed.get_world_size()
+        super(LanguageModuleAuto, self).__init__(configs)
+
+        self.loss_fn = self.get_loss_fn()
+
+    def process_configs(self, configs):
+        configs = process_configs(configs)
+        return configs
+
+    def training_step_end(self, log_dict):
+        speed = 1.0 / log_dict["train_cost"]
+        default_global_tokens_num = self.configs.Global.global_batch_size * self.configs.Data.Train.dataset.max_seq_len
+
+        loss_scale_str = ("loss_scale: %.9f," % (log_dict["loss_scale"]) if
+                          log_dict.get("loss_scale", None) is not None else "")
+
+        logger_info_str = "[train] epoch: [%d/%d], batch: [%d/%d]" \
+                          % (
+                            log_dict["epoch"],
+                            log_dict["total_epoch"],
+                            log_dict["batch"],
+                            log_dict["total_step"],
+                          )
+
+        if "loss" in log_dict:
+            logger_info_str += ", loss: %.9f" % (log_dict["loss"])
+
+        logger_info_str += ", avg_batch_cost: %.5f sec, speed: %.2f step/s, ips_total: %.0f tokens/s, ips: %.0f tokens/s, %s learning rate: %.5e, found_inf: %.0f" \
+                            % (
+                                log_dict["train_cost"],
+                                speed,
+                                speed * default_global_tokens_num,
+                                speed * default_global_tokens_num / log_dict["dp_world_size"],
+                                loss_scale_str,
+                                log_dict["lr"],
+                                log_dict["found_inf"],
+                            )
+
+        if "max_memory_allocated" in log_dict:
+            logger_info_str += ", max_memory_allocated: %.1f MB, max_memory_reserved: %.1f MB" \
+                                ", memory_allocated: %.1f MB, memory_reserved: %.1f MB" \
+                                % (log_dict["max_memory_allocated"], log_dict["max_memory_reserved"],log_dict["memory_allocated"], log_dict["memory_reserved"])
+
+        logger.info(logger_info_str)
+
+    def validation_step_end(self, log_dict):
+        speed = 1.0 / log_dict["eval_cost"]
+
+        logger_info_str = "[eval] epoch: %d, batch: %d/%d" \
+                          % (
+                            log_dict["epoch"],
+                            log_dict["batch"],
+                            log_dict["total_batch"],
+                          )
+        
+        if "loss" in log_dict:
+            logger_info_str += ", loss: %.9f" % (log_dict["loss"])
+
+        logger_info_str += ", avg_eval_cost: %.5f sec, speed: %.2f step/s" \
+                           % (
+                            log_dict["eval_cost"],
+                            speed,
+                           )
+
+        logger.info(logger_info_str)
+
+    def test_step_end(self, log_dict):
+        speed = 1.0 / log_dict["test_cost"]
+        logger.info(
+            "[test] epoch: %d, batch: %d, loss: %.9f, avg_test_cost: %.5f sec, speed: %.2f step/s"
+            % (log_dict["epoch"], log_dict["batch"], log_dict["loss"],
+               log_dict["test_cost"], speed))
+
+    def training_epoch_end(self, log_dict):
+        logger.info("[Training] epoch: %d, total time: %.5f sec" %
+                    (log_dict["epoch"], log_dict["train_cost"]))
+
+
+class GPTModuleAuto(LanguageModuleAuto):
+    def __init__(self, configs):
+        super(GPTModuleAuto, self).__init__(configs)
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        model_setting.pop("module")
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        l = model_setting["num_layers"]
+        h = model_setting["hidden_size"]
+        v = model_setting["vocab_size"]
+        s = self.configs.Data.Train.dataset.max_seq_len
+        get_model_size(l, h, v, s)
+
+        with LazyGuard():
+            model = gpt.GPTForPretrainingAuto(gpt.GPTModelAuto(**model_setting))
+        return model
+
+    def get_loss_fn(self):
+        return gpt.GPTPretrainingCriterionAuto()
+
+
+class GPTGenerationModuleAuto(BasicModule):
+    def __init__(self, configs):
+        self.configs = configs
+        self.generation_cfgs = configs.Generation
+        self.nranks = paddle.distributed.get_world_size()
+
+        super().__init__(configs)
+
+    def process_configs(self, configs):
+        configs = process_configs(configs)
+        return configs
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        model_setting.pop("module")
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        with LazyGuard():
+            model = gpt.GPTForGenerationAuto(gpt.GPTModelAuto(**model_setting), self.generation_cfgs)
+
+        self.generation_cfgs["max_dec_len"] = self.adjust_length_to_model(self.generation_cfgs["max_dec_len"], 512)
+
+        self.generation_cfgs["bos_token_id"] = self.tokenizer.eos_token_id
+        self.generation_cfgs["eos_token_id"] = self.tokenizer.eos_token_id
+        self.generation_cfgs["pad_token_id"] = self.tokenizer.eos_token_id
+
+        return model
+
+    def adjust_length_to_model(self, length, max_sequence_length):
+        if length < 0 or length > max_sequence_length:
+            length = max_sequence_length
+        return length
+
+    def left_padding(self, inputs, pad_id, padding="longest"):
+        assert "input_ids" in inputs, "input_ids should be in inputs!"
+        max_length = 0
+        for ids in inputs["input_ids"]:
+            max_length = max(max_length, len(ids))
+
+        def extend_max_lenth(value, max_length, to_pad_id):
+            return [to_pad_id] * (max_length - len(value)) + value
+
+        def extend_filed(name, max_length, to_pad_id):
+            values = inputs[name]
+            res = []
+            for index, value in enumerate(values):
+                res.append(extend_max_lenth(value, max_length, to_pad_id))
+            inputs[name] = res
+
+        extend_filed("input_ids", max_length, pad_id)
+        if "attention_mask" in inputs:
+            extend_filed("attention_mask", max_length, 0)
+        if "position_ids" in inputs:
+            extend_filed("position_ids", max_length, 0)
+
+        return inputs
+
+    def input_spec(self):
+        return [InputSpec(shape=[None, None], name="input_ids", dtype="int64")]
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d17d196b7acc51ff7205770fd0dca65f4dbd812
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/hybrid_model.py
@@ -0,0 +1,1602 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import logging
+import math
+
+import numpy as np
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.autograd import PyLayer
+from paddle.common_ops_import import convert_dtype
+from paddle.distributed.fleet.meta_parallel import (
+    LayerDesc,
+    PipelineLayer,
+    SharedLayerDesc,
+    get_rng_state_tracker,
+)
+from paddle.distributed.fleet.utils import recompute
+from paddle.base import layers
+from paddle.nn.layer.transformer import _convert_param_attr_to_list
+from ppfleetx.distributed.apis import env
+from ppfleetx.utils.log import logger
+
+from .processor import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+)
+from .sequence_parallel_utils import (
+    ColumnSequenceParallelLinear,
+    GatherOp,
+    RowSequenceParallelLinear,
+    ScatterOp,
+    mark_as_sequence_parallel_parameter,
+)
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+try:
+    from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd
+except:
+    FusedDropoutAdd = None
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+def parallel_matmul(lm_output, logit_weights, parallel_output):
+    """ """
+    hcg = env.get_hcg()
+    model_parallel_group = hcg.get_model_parallel_group()
+    world_size = hcg.get_model_parallel_world_size()
+
+    if world_size > 1:
+        input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group)
+
+        logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True)
+
+        if parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(lm_output, logit_weights, transpose_y=True)
+        return logits
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        weight_attr=None,
+        output_layer_weight_attr=None,
+        bias_attr=None,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        num_partitions=1,
+        fused_linear=False,
+        use_recompute=False,
+        recompute_granularity="full",
+        sequence_parallel=False,
+        do_recompute=True,
+        use_flash_attn=False,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.need_weights = need_weights
+        self.fuse_attn_qkv = fuse_attn_qkv
+        self.scale_qk_coeff = scale_qk_coeff
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.do_recompute = do_recompute
+        self.sequence_parallel = sequence_parallel
+        self.use_flash_attn = use_flash_attn if flash_attention else None
+
+        if sequence_parallel:
+            ColumnParallelLinear = ColumnSequenceParallelLinear
+            RowParallelLinear = RowSequenceParallelLinear
+        else:
+            ColumnParallelLinear = fleet.meta_parallel.ColumnParallelLinear
+            RowParallelLinear = fleet.meta_parallel.RowParallelLinear
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        assert self.num_heads % num_partitions == 0, "num_heads {} must be divisible by num_partitions {}".format(
+            self.num_heads, num_partitions
+        )
+        self.num_heads = self.num_heads // num_partitions
+
+        if self.fuse_attn_qkv:
+            assert self.kdim == embed_dim
+            assert self.vdim == embed_dim
+
+            self.qkv_proj = ColumnParallelLinear(
+                embed_dim,
+                3 * embed_dim,
+                mp_group=env.get_hcg().get_model_parallel_group(),
+                weight_attr=weight_attr,
+                has_bias=True,
+                gather_output=False,
+                fuse_matmul_bias=fused_linear,
+            )
+        else:
+            self.q_proj = ColumnParallelLinear(
+                embed_dim,
+                embed_dim,
+                mp_group=env.get_hcg().get_model_parallel_group(),
+                weight_attr=weight_attr,
+                has_bias=True,
+                gather_output=False,
+                fuse_matmul_bias=fused_linear,
+            )
+
+            self.k_proj = ColumnParallelLinear(
+                self.kdim,
+                embed_dim,
+                mp_group=env.get_hcg().get_model_parallel_group(),
+                weight_attr=weight_attr,
+                has_bias=True,
+                gather_output=False,
+                fuse_matmul_bias=fused_linear,
+            )
+
+            self.v_proj = ColumnParallelLinear(
+                self.vdim,
+                embed_dim,
+                mp_group=env.get_hcg().get_model_parallel_group(),
+                weight_attr=weight_attr,
+                has_bias=True,
+                gather_output=False,
+                fuse_matmul_bias=fused_linear,
+            )
+
+        self.out_proj = RowParallelLinear(
+            embed_dim,
+            embed_dim,
+            mp_group=env.get_hcg().get_model_parallel_group(),
+            weight_attr=output_layer_weight_attr,
+            has_bias=True,
+            input_is_parallel=True,
+            fuse_matmul_bias=fused_linear,
+        )
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, -1, 3 * self.head_dim])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, -1, self.head_dim])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, -1, self.head_dim])
+        v = tensor.reshape(x=v, shape=[0, 0, -1, self.head_dim])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            v = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def _flash_attention(self, q, k, v, attn_mask=None):
+        if self.sequence_parallel:
+            perm = [1, 0, 2, 3]
+            q = tensor.transpose(x=q, perm=perm)
+            k = tensor.transpose(x=k, perm=perm)
+            v = tensor.transpose(x=v, perm=perm)
+        out, weights = flash_attention(
+            q, k, v, self.dropout, causal=True, return_softmax=self.need_weights, training=self.training
+        )
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+        if self.sequence_parallel:
+            perm = [1, 0, 2]
+            out = tensor.transpose(x=out, perm=perm)
+        return (out, weights) if self.need_weights else out
+
+    def core_attn(self, q, k, v, attn_mask=None):
+        perm = [1, 2, 0, 3] if self.sequence_parallel else [0, 2, 1, 3]
+        q = tensor.transpose(x=q, perm=perm)
+        k = tensor.transpose(x=k, perm=perm)
+        v = tensor.transpose(x=v, perm=perm)
+
+        # scale dot product attention
+        scale_qk_coeff = self.scale_qk_coeff * self.head_dim**0.5
+        product = paddle.matmul(x=q.scale(1.0 / scale_qk_coeff), y=k, transpose_y=True)
+
+        if self.scale_qk_coeff != 1.0:
+            product = product.scale(self.scale_qk_coeff)
+
+        # softmax_mask_fuse_upper_triangle is not supported sif paddle is not compiled with cuda/rocm
+        if not paddle.is_compiled_with_cuda():
+            attn_mask = get_triangle_upper_mask(product, attn_mask)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+            weights = F.softmax(product)
+        else:
+            weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.dropout:
+            with get_rng_state_tracker().rng_state("local_seed"):
+                weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = paddle.matmul(weights, v)
+
+        # combine heads
+        if self.sequence_parallel:
+            out = tensor.transpose(out, perm=[2, 0, 1, 3])
+        else:
+            out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        # If sequence_parallel is true, out shape is [s, b, h] after reshape
+        # else out shape is [b, s, h]
+        out = tensor.reshape(x=out, shape=[0, 0, -1])
+
+        return (out, weights) if self.need_weights else out
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # if sequence_parallel is true, query, key, value shape are [s, b, h],
+        # else their shape are [b, s, h], n is mp parallelism.
+        # no matter sequence_parallel is true or false,
+        # after reshape, q, k, v shape should be [b, num_heads/n, s, head_dim]
+        # compute q ,k ,v
+        if self.fuse_attn_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        if self.use_flash_attn and attn_mask is None:
+            attn_func = self._flash_attention
+        else:
+            attn_func = self.core_attn
+
+        if self.use_recompute and self.recompute_granularity == "core_attn" and self.do_recompute:
+            out = recompute(attn_func, q, k, v, attn_mask)
+        else:
+            out = attn_func(q, k, v, attn_mask=attn_mask)
+
+        if self.need_weights:
+            out, weights = out
+
+        # project to output
+        # if sequence_parallel is true, out shape are [s/n, b, h],
+        # else their shape are [b, s, h], n is mp parallelism.
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(
+        self,
+        decoder_layers,
+        num_layers,
+        norm=None,
+        hidden_size=None,
+        use_recompute=False,
+        recompute_granularity="full",
+        sequence_parallel=False,
+        no_recompute_layers=None,
+    ):
+        super(TransformerDecoder, self).__init__()
+
+        if no_recompute_layers is None:
+            no_recompute_layers = []
+        self.no_recompute_layers = no_recompute_layers
+
+        self.num_layers = num_layers
+        self.layers = decoder_layers
+        self.norm = norm
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.sequence_parallel = sequence_parallel
+        if norm == "LayerNorm":
+            self.norm = nn.LayerNorm(hidden_size, epsilon=1e-5)
+            # if sequence parallel is true,
+            # register hook to all_reduce gradient of weight, bias
+            if self.sequence_parallel:
+                mark_as_sequence_parallel_parameter(self.norm.weight)
+                mark_as_sequence_parallel_parameter(self.norm.bias)
+        elif norm is not None:
+            raise ValueError("Only support LayerNorm")
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = []
+
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                if use_cache:
+                    output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache)
+                    new_caches.append(new_cache)
+                else:
+                    if (
+                        self.use_recompute
+                        and self.recompute_granularity == "full"
+                        and i not in self.no_recompute_layers
+                    ):
+                        output = recompute(mod, output, memory, tgt_mask, use_cache, cache)
+                    else:
+                        output = mod(output, memory, tgt_mask, use_cache, cache)
+
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if use_cache is False else (output, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        weight_attr=None,
+        output_layer_weight_attr=None,
+        bias_attr=None,
+        num_partitions=1,
+        fused_linear=False,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        recompute_attn=False,
+        use_recompute=False,
+        recompute_granularity="full",
+        sequence_parallel=False,
+        do_recompute=True,
+        skip_quant_tensors=[],
+        use_flash_attn=False,
+        use_fused_dropout_add=True,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.sequence_parallel = sequence_parallel
+        self.do_recompute = do_recompute
+        if not FusedDropoutAdd:
+            self.use_fused_dropout_add = False
+        else:
+            self.use_fused_dropout_add = use_fused_dropout_add
+
+        if sequence_parallel:
+            ColumnParallelLinear = ColumnSequenceParallelLinear
+            RowParallelLinear = RowSequenceParallelLinear
+        else:
+            ColumnParallelLinear = fleet.meta_parallel.ColumnParallelLinear
+            RowParallelLinear = fleet.meta_parallel.RowParallelLinear
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+        output_layer_weight_attrs = _convert_param_attr_to_list(output_layer_weight_attr, 3)
+
+        self.self_attn = MultiHeadAttention(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+            output_layer_weight_attr=output_layer_weight_attrs[0],
+            num_partitions=num_partitions,
+            fused_linear=fused_linear,
+            fuse_attn_qkv=fuse_attn_qkv,
+            scale_qk_coeff=scale_qk_coeff,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+            sequence_parallel=sequence_parallel,
+            do_recompute=do_recompute,
+            use_flash_attn=use_flash_attn,
+        )
+
+        self.linear1 = ColumnParallelLinear(
+            d_model,
+            dim_feedforward,
+            mp_group=env.get_hcg().get_model_parallel_group(),
+            weight_attr=weight_attrs[2],
+            gather_output=False,
+            has_bias=True,
+            fuse_matmul_bias=fused_linear,
+        )
+
+        self.linear2 = RowParallelLinear(
+            dim_feedforward,
+            d_model,
+            mp_group=env.get_hcg().get_model_parallel_group(),
+            weight_attr=output_layer_weight_attrs[2],
+            input_is_parallel=True,
+            has_bias=True,
+            fuse_matmul_bias=fused_linear,
+        )
+
+        if "linear1" in skip_quant_tensors:
+            self.linear1.skip_quant = True
+
+        if "linear2" in skip_quant_tensors:
+            self.linear2.skip_quant = True
+
+        self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
+        if self.sequence_parallel:
+            # if sequence parallel is true, register hook to all_reduce gradient of bias
+            mark_as_sequence_parallel_parameter(self.norm1.weight)
+            mark_as_sequence_parallel_parameter(self.norm1.bias)
+            mark_as_sequence_parallel_parameter(self.norm2.weight)
+            mark_as_sequence_parallel_parameter(self.norm2.bias)
+        if not self.use_fused_dropout_add:
+            self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
+            self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
+        else:
+            self.fused_dropout_add1 = FusedDropoutAdd(dropout, mode="upscale_in_train")
+            self.fused_dropout_add2 = FusedDropoutAdd(act_dropout, mode="upscale_in_train")
+
+        self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory=None, tgt_mask=None, use_cache=False, cache=None):
+        residual = tgt
+
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if use_cache is False:
+            if self.use_recompute and self.recompute_granularity == "full_attn" and self.do_recompute:
+                tgt = recompute(self.self_attn, tgt, None, None, tgt_mask, use_cache, cache)
+            else:
+                tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        # If use sequence_parallel, different input partition in dropout
+        # should use different seed.
+        if self.sequence_parallel:
+            current_seed = "local_seed"
+        else:
+            current_seed = "global_seed"
+        with get_rng_state_tracker().rng_state(current_seed):
+            if not self.use_fused_dropout_add:
+                tgt = residual + self.dropout1(tgt)
+            else:
+                tgt = self.fused_dropout_add1(tgt, residual)
+
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        with get_rng_state_tracker().rng_state(current_seed):
+            if not self.use_fused_dropout_add:
+                tgt = residual + self.linear2(F.gelu(self.linear1(tgt), approximate=True))
+            else:
+                tgt = self.fused_dropout_add2(self.linear2(F.gelu(self.linear1(tgt), approximate=True)), residual)
+
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        return tgt if use_cache is False else (tgt, incremental_cache)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        sequence_parallel=False,
+        freeze_embedding=False,
+    ):
+        super(GPTEmbeddings, self).__init__()
+
+        self.sequence_parallel = sequence_parallel
+        self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+            vocab_size,
+            hidden_size,
+            mp_group=env.get_hcg().get_model_parallel_group(),
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        self.position_embeddings = nn.Embedding(
+            max_position_embeddings,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        if freeze_embedding:
+            self.word_embeddings.weight.learning_rate = 0.0
+            self.position_embeddings.weight.learning_rate = 0.0
+
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embedings + position_embeddings
+        # if sequence parallel is true, change embedding shape [b, s, h] to [s, b, h]
+        # set the sequence dim as first, so the split in sequence dim is data-continuous
+        if self.sequence_parallel:
+            embeddings = paddle.transpose(embeddings, perm=[1, 0, 2])
+            embeddings = ScatterOp.apply(embeddings)
+            with get_rng_state_tracker().rng_state("local_seed"):
+                embeddings = self.dropout(embeddings)
+        else:
+            embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class GPTModelHybrid(nn.Layer):
+    def __init__(
+        self,
+        vocab_size=51200,
+        hidden_size=768,
+        num_layers=12,
+        num_attention_heads=12,
+        ffn_hidden_size=3072,
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        num_partitions=1,
+        use_recompute=False,
+        fused_linear=False,
+        fuse_attn_qkv=False,
+        scale_qk_by_layer_num=True,
+        recompute_granularity="full",
+        sequence_parallel=False,
+        no_recompute_layers=None,
+        skip_tensor_map={},
+        freeze_embedding=False,
+        use_flash_attn=False,
+        fused_softmax_with_triangular=False,
+        use_fused_dropout_add=True,
+    ):
+
+        super(GPTModelHybrid, self).__init__()
+
+        if no_recompute_layers is None:
+            no_recompute_layers = []
+        self.initializer_range = initializer_range
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.fused_softmax_with_triangular = fused_softmax_with_triangular
+
+        if use_flash_attn:
+            if flash_attention:
+                logger.info("Flash-attention enabled.")
+            else:
+                use_flash_attn = False
+                logger.warning("Flash-attention is not support in this Paddle version.")
+
+        hcg = env.get_hcg()
+        mp_size = hcg.get_model_parallel_world_size()
+        if mp_size <= 1:
+            sequence_parallel = False
+            logging.warning("If mp_size <= 1, sequence_parallel strategy will be turned off in GPTModelHybrid model.")
+
+        self.embeddings = GPTEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            self.initializer_range,
+            sequence_parallel,
+            freeze_embedding,
+        )
+        self.sequence_parallel = sequence_parallel
+
+        decoder_layers = nn.LayerList()
+        for i in range(num_layers):
+            decoder_layers.append(
+                TransformerDecoderLayer(
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=ffn_hidden_size,
+                    dropout=hidden_dropout_prob,
+                    activation="gelu",
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range)
+                    ),
+                    output_layer_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(
+                            mean=0.0, std=self.initializer_range / math.sqrt(2.0 * num_layers)
+                        )
+                    ),
+                    bias_attr=None,
+                    num_partitions=num_partitions,
+                    fused_linear=fused_linear,
+                    fuse_attn_qkv=fuse_attn_qkv,
+                    scale_qk_coeff=num_layers if scale_qk_by_layer_num else 1.0,
+                    use_recompute=use_recompute,
+                    recompute_granularity=recompute_granularity,
+                    sequence_parallel=sequence_parallel,
+                    do_recompute=i not in no_recompute_layers,
+                    skip_quant_tensors=skip_tensor_map.get("block_{}".format(i), []),
+                    use_flash_attn=use_flash_attn,
+                    use_fused_dropout_add=use_fused_dropout_add,
+                )
+            )
+
+        self.decoder = TransformerDecoder(
+            decoder_layers,
+            num_layers,
+            norm="LayerNorm",
+            hidden_size=hidden_size,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+            sequence_parallel=sequence_parallel,
+            no_recompute_layers=no_recompute_layers,
+        )
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None):
+
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(attention_mask)[-1] - 1
+            position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype=input_ids.dtype)
+            position_ids = position_ids.unsqueeze(0)
+            # .expand_as(input_ids)
+            position_ids = paddle.expand_as(position_ids, input_ids)
+        # if sequence_parallel is true, embedding_output shape is [s/n, b, h]
+        # else its shape is [b, s, h], n is mp parallelism
+        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        # fused_softmax_with_triangular is only suppported on GPU/DCU.
+        # If on non-GPU devices, we use user defined mask and non-fused softmax.
+        if not self.fused_softmax_with_triangular or not paddle.is_compiled_with_cuda():
+            # TODO, use registered buffer
+            causal_mask = paddle.tensor.triu(
+                paddle.ones((paddle.shape(input_ids)[-1], paddle.shape(input_ids)[-1])) * -1e4, diagonal=1
+            )
+            if attention_mask is not None:
+                if len(attention_mask.shape) == 2:
+                    attention_mask = attention_mask[:, None, None, :]
+                attention_mask = attention_mask + causal_mask
+            else:
+                attention_mask = causal_mask
+            # The tensor returned by triu not in static graph.
+            attention_mask.stop_gradient = True
+
+        encoder_outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            tgt_mask=None
+            if (self.fused_softmax_with_triangular and self.training and paddle.is_compiled_with_cuda())
+            else attention_mask,  # use softmax_mask_fuse_upper_triangle
+            use_cache=use_cache,
+            cache=cache,
+        )
+
+        if self.sequence_parallel:
+            encoder_outputs = GatherOp.apply(encoder_outputs)
+
+        return encoder_outputs
+
+
+class GPTForPretrainingHybrid(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt):
+        super(GPTForPretrainingHybrid, self).__init__()
+        self.gpt = gpt
+        # extra_parameters using for sharding stage3 to register extra_parameters
+        self.extra_parameters = [get_attr(self.gpt.embeddings.word_embeddings, "weight")]
+
+    def forward(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        logits = parallel_matmul(encoder_outputs, get_attr(self.gpt.embeddings.word_embeddings, "weight"), True)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+
+class GPTPretrainingCriterionHybird(nn.Layer):
+    """
+    Criterion for GPT. It calculates the final loss.
+    """
+
+    def __init__(self, topo=None, sequence_parallel=False):
+        super(GPTPretrainingCriterionHybird, self).__init__()
+        self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+        self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy(
+            mp_group=env.get_hcg().get_model_parallel_group()
+        )
+        self.sequence_parallel = sequence_parallel
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The logits of masked token prediction. Its data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, the dimensionality of `masked_lm_labels`
+                is equal to `prediction_scores`. Its data type should be int64 and
+                its shape is [batch_size, sequence_length, 1].
+            loss_mask(Tensor):
+                Mask used for calculating the loss of the masked language modeling to avoid
+                calculating some unwanted tokens.
+                Its data type should be float32 and its shape is [batch_size, sequence_length, 1].
+
+        Returns:
+            Tensor: The pretraining loss. Its data type should be float32 and its shape is [1].
+
+        """
+        hcg = env.get_hcg()
+        mp_size = hcg.get_model_parallel_world_size()
+        if self.sequence_parallel:
+            masked_lm_labels = masked_lm_labels.transpose([1, 0])
+            loss_mask = loss_mask.transpose([1, 0])
+
+        if mp_size > 1:
+            if paddle.is_compiled_with_cuda() or paddle.is_compiled_with_xpu():
+                masked_lm_loss = self.parallel_loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+            else:
+                prediction_scores = ConcatSoftmaxInput.apply(
+                    prediction_scores, group=env.get_hcg().get_model_parallel_group()
+                )
+                masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+        else:
+            masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+        loss_mask = loss_mask.reshape([-1])
+        masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+        loss = masked_lm_loss / loss_mask.sum()
+        return loss
+
+
+# these Layers is just for PipelineParallel
+
+
+class GPTPretrainingCriterionPipe(GPTPretrainingCriterionHybird):
+    """Extends GPTPretrainingCriterion to meet the input standard."""
+
+    def forward(self, prediction_scores, args):
+        masked_lm_labels = args[0]
+        loss_mask = args[1]
+        loss = super().forward(prediction_scores, masked_lm_labels, loss_mask)
+        return loss
+
+
+class EmbeddingPipe(GPTEmbeddings):
+    """Extends GPTEmbeddings to forward attention_mask through the pipeline."""
+
+    @property
+    def embedding_weight(self):
+        return get_attr(self.word_embeddings, "weight")
+
+    def forward(self, tensors):
+        input_ids, position_ids = tensors
+        embeddings = super().forward(input_ids=input_ids, position_ids=position_ids)
+        return embeddings
+
+
+class LayerNormPipe(nn.Layer):
+    def __init__(
+        self,
+        normalized_shape,
+        epsilon=1e-05,
+        weight_attr=None,
+        bias_attr=None,
+        name=None,
+        sequence_parallel=False,
+        is_last=False,
+    ):
+        super(LayerNormPipe, self).__init__()
+        self.sequence_parallel = sequence_parallel
+        self.is_last = is_last
+        self.norm = nn.LayerNorm(
+            normalized_shape=normalized_shape, epsilon=epsilon, weight_attr=weight_attr, bias_attr=bias_attr, name=name
+        )
+        if self.sequence_parallel:
+            mark_as_sequence_parallel_parameter(self.norm.weight)
+            mark_as_sequence_parallel_parameter(self.norm.bias)
+
+    def forward(self, input):
+        output = self.norm(input)
+        if self.sequence_parallel and self.is_last:
+            output = GatherOp.apply(output)
+        return output
+
+
+class GPTForPretrainingPipe(PipelineLayer):
+    """GPTForPretraining adapted for pipeline parallelism.
+
+    The largest change is flattening the GPTModel class so we can express it as a
+    sequence of layers including embedding, transformer layers, and output.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        num_layers=12,
+        num_attention_heads=12,
+        ffn_hidden_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        num_partitions=1,
+        topology=None,
+        use_recompute=False,
+        fused_linear=False,
+        fuse_attn_qkv=False,
+        scale_qk_by_layer_num=True,
+        recompute_granularity="full",
+        virtual_pp_degree=1,
+        sequence_parallel=False,
+        no_recompute_layers=None,
+        pp_recompute_interval=1,
+        use_flash_attn=False,
+        fused_softmax_with_triangular=False,
+        use_fused_dropout_add=True,
+    ):
+
+        # forward desc
+        self.descs = []
+
+        if no_recompute_layers is None:
+            no_recompute_layers = []
+        else:
+            if recompute_granularity == "full":
+                assert len(no_recompute_layers) == 0, "for pp with full recompute, no_recompute_layers is not support"
+
+        if use_flash_attn:
+            if flash_attention:
+                logger.info("Flash-attention enabled.")
+            else:
+                use_flash_attn = False
+                logger.warning("Flash-attention is not support in this Paddle version.")
+
+        hcg = env.get_hcg()
+        mp_size = hcg.get_model_parallel_world_size()
+        if mp_size <= 1:
+            sequence_parallel = False
+            logging.warning(
+                "If mp_size <= 1, sequence_parallel strategy will be turned off in GPTForPretrainingPipe model."
+            )
+
+        self.descs.append(
+            SharedLayerDesc(
+                "embed",
+                EmbeddingPipe,
+                shared_weight_attr="embedding_weight",
+                vocab_size=vocab_size,
+                hidden_size=hidden_size,
+                hidden_dropout_prob=hidden_dropout_prob,
+                max_position_embeddings=max_position_embeddings,
+                type_vocab_size=type_vocab_size,
+                initializer_range=initializer_range,
+                sequence_parallel=sequence_parallel,
+            )
+        )
+
+        for i in range(num_layers):
+            self.descs.append(
+                LayerDesc(
+                    TransformerDecoderLayer,
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=ffn_hidden_size,
+                    dropout=hidden_dropout_prob,
+                    activation=hidden_act,
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+                    output_layer_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(
+                            mean=0.0, std=initializer_range / math.sqrt(2.0 * num_layers)
+                        )
+                    ),
+                    bias_attr=None,
+                    num_partitions=num_partitions,
+                    fused_linear=fused_linear,
+                    fuse_attn_qkv=fuse_attn_qkv,
+                    scale_qk_coeff=num_layers if scale_qk_by_layer_num else 1.0,
+                    use_recompute=use_recompute,
+                    recompute_granularity=recompute_granularity,
+                    sequence_parallel=sequence_parallel,
+                    do_recompute=i not in no_recompute_layers,
+                    use_flash_attn=use_flash_attn,
+                    use_fused_dropout_add=use_fused_dropout_add,
+                )
+            )
+
+        self.descs.append(
+            LayerDesc(LayerNormPipe, normalized_shape=hidden_size, sequence_parallel=sequence_parallel, is_last=True)
+        )
+
+        def _logits_helper(embedding, output):
+            return parallel_matmul(output, embedding.embedding_weight, True)
+
+        self.descs.append(
+            SharedLayerDesc(
+                "embed",
+                EmbeddingPipe,
+                forward_func=_logits_helper,
+                shared_weight_attr="embedding_weight",
+                vocab_size=vocab_size,
+                hidden_size=hidden_size,
+                hidden_dropout_prob=hidden_dropout_prob,
+                max_position_embeddings=max_position_embeddings,
+                type_vocab_size=type_vocab_size,
+                initializer_range=initializer_range,
+            )
+        )
+
+        recompute_interval = 0
+        if recompute and recompute_granularity == "full":
+            assert pp_recompute_interval <= num_layers // (
+                virtual_pp_degree * env.get_hcg().topology().get_dim_size("pipe")
+            ), "pp recompute interval should smaller than num layers of each pp chunk"
+            recompute_interval = pp_recompute_interval
+
+        seg_method = "layer:TransformerDecoderLayer"
+        if num_layers % env.get_hcg().topology().get_dim_size("pipe") != 0:
+            seg_method = "uniform"
+
+        super().__init__(
+            layers=self.descs,
+            loss_fn=GPTPretrainingCriterionPipe(sequence_parallel=sequence_parallel),
+            topology=env.get_hcg().topology(),
+            seg_method=seg_method,
+            recompute_interval=recompute_interval,
+            recompute_ctx={
+                "mp_group": env.get_hcg().get_model_parallel_group(),
+                "offload": False,
+                "partition": False,
+            },
+            num_virtual_pipeline_stages=virtual_pp_degree,
+        )
+
+
+class GPTForGenerationHybrid(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt, configs):
+        super(GPTForGenerationHybrid, self).__init__()
+        self.gpt = gpt
+        # extra_parameters using for sharding stage3 to register extra_parameters
+        self.extra_parameters = [get_attr(self.gpt.embeddings.word_embeddings, "weight")]
+        self.configs = configs
+
+        self.max_length = self.configs.get("max_dec_len", 20)
+        self.min_length = self.configs.get("min_dec_len", 0)
+        self.decode_strategy = self.configs.get("decode_strategy", "sampling")
+        self.temperature = self.configs.get("temperature", 1.0)
+        self.top_k = self.configs.get("top_k", 0)
+        self.top_p = self.configs.get("top_p", 1.0)
+        self.repetition_penalty = self.configs.get("repetition_penalty", 1.0)
+        self.num_beams = self.configs.get("num_beams", 1)
+        self.num_beam_groups = self.configs.get("num_beam_groups", 1)
+        self.length_penalty = self.configs.get("length_penalty", 0.0)
+        self.early_stopping = self.configs.get("early_stopping", False)
+        self.bos_token_id = self.configs.get("bos_token_id", None)
+        self.eos_token_id = self.configs.get("eos_token_id", None)
+        self.pad_token_id = self.configs.get("pad_token_id", None)
+        self.decoder_start_token_id = self.configs.get("decoder_start_token_id", None)
+        self.forced_bos_token_id = self.configs.get("forced_bos_token_id", None)
+        self.forced_eos_token_id = self.configs.get("forced_eos_token_id", None)
+        self.num_return_sequences = self.configs.get("num_return_sequences", 1)
+        self.diversity_rate = self.configs.get("diversity_rate", 0.0)
+        self.use_cache = self.configs.get("use_cache", True)
+
+    def prepare_input_ids_for_generation(self, bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_attention_mask_for_generation(self, input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(
+            input_ids == pad_token_id
+        ).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e9
+        else:
+            attention_mask = paddle.zeros_like(input_ids, dtype=paddle.get_default_dtype())
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
+
+    def update_scores_for_generation(self, scores, next_scores, length, unfinished_flag):
+        # update scores
+
+        unfinished_scores = (scores * length + next_scores) / (length + 1)
+        scores = paddle.where(unfinished_flag, unfinished_scores, scores)
+        return scores
+
+    def get_logits_processor(
+        self,
+        min_length=None,
+        max_length=None,
+        eos_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_beams=1,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        repetition_penalty=None,
+    ):
+        processors = LogitsProcessorList()
+
+        if min_length is not None and eos_token_id is not None and min_length > -1:
+            processors.append(MinLengthLogitsProcessor(min_length, eos_token_id))
+        if num_beam_groups > 1 and diversity_rate > 0.0:
+            processors.append(
+                HammingDiversityLogitsProcessor(
+                    diversity_rate=diversity_rate, num_beams=num_beams, num_beam_groups=num_beam_groups
+                )
+            )
+        if repetition_penalty is not None and repetition_penalty != 1.0:
+            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
+        if forced_bos_token_id is not None:
+            processors.append(ForcedBOSTokenLogitsProcessor(forced_bos_token_id))
+        if forced_eos_token_id is not None:
+            processors.append(ForcedEOSTokenLogitsProcessor(max_length, forced_eos_token_id))
+        # TODO
+        # Add more pre_processing for distribution
+
+        return processors
+
+    def expand_inputs_for_generation(self, input_ids, expand_size, attention_mask=None, **model_kwargs):
+
+        index = paddle.tile(paddle.arange(paddle.shape(input_ids)[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.gather(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.gather(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.gather(token_type_ids, index)
+
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.gather(position_ids, index)
+
+        if "seq_len" in model_kwargs and model_kwargs["seq_len"] is not None:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.gather(seq_len, index)
+
+        if "encoder_output" in model_kwargs and model_kwargs["encoder_output"] is not None:
+            encoder_output = model_kwargs["encoder_output"]
+            model_kwargs["encoder_output"] = paddle.gather(encoder_output, index)
+
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.gather(role_ids, index)
+
+        return input_ids, model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask[:, -1, -1, :]
+            if "int" in paddle.common_ops_import.convert_dtype(attention_mask.dtype):
+                attention_mask = (1.0 - attention_mask) * -1e4
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if position_ids is not None:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        return {"input_ids": input_ids, "position_ids": position_ids, "attention_mask": attention_mask, "cache": cache}
+
+    def update_model_kwargs_for_generation(self, outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[:, -1:] + 1], axis=-1)
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            # nn.Pad2D don't support the data type `bool`
+            if convert_dtype(attention_mask.dtype) == "bool":
+                attention_mask = paddle.cast(attention_mask, "int64")
+            if len(attention_mask.shape) == 4:
+                attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(attention_mask)
+                attention_mask = nn.Pad2D([0, 1, 0, 0], value=-1e4)(attention_mask)
+                dtype = convert_dtype(attention_mask.dtype)
+                if "int" in dtype:
+                    attention_mask[:, :, -1, -1] = 1
+                elif "float" in dtype:
+                    attention_mask[:, :, -1, -1] = 0.0
+                else:
+                    raise ValueError("The data type of input `attention_mask` must " "be bool, int or float")
+            else:
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+            model_kwargs["attention_mask"] = attention_mask
+
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        return model_kwargs
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+        def TopKProcess(probs, top_k, min_tokens_to_keep):
+            top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+            # Remove all tokens with a probability less than the last token of the top-k
+            topk_probs, _ = paddle.topk(probs, k=top_k)
+            probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+            return probs
+
+        def TopPProcess(probs, top_p, min_tokens_to_keep):
+            sorted_probs = paddle.sort(probs, descending=True)
+            sorted_indices = paddle.argsort(probs, descending=True)
+            cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+            # Remove tokens with cumulative probs above the top_p, But keep at
+            # least min_tokens_to_keep tokens
+            sorted_indices_to_remove = cumulative_probs > top_p
+            if min_tokens_to_keep > 1:
+                # Set 'min_tokens_to_keep - 1' because the first token is kept
+                sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+            # Keep the first token
+            sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+            sorted_indices_to_remove[:, 0] = 0
+
+            # Scatter sorted tensors to original indexing
+            sorted_indices = sorted_indices + paddle.arange(probs.shape[0]).unsqueeze(-1) * probs.shape[-1]
+            condition = paddle.scatter(
+                sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+            )
+            condition = paddle.cast(condition, "bool").reshape(probs.shape)
+            probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+            return probs
+
+        batch_size, cur_len = input_ids.shape
+        origin_len = input_ids.shape[1]
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        # use_cache is immutable, we split it off other mutable kwargs.
+        assert "use_cache" in model_kwargs
+        immutable = {"use_cache": model_kwargs["use_cache"]}
+        del model_kwargs["use_cache"]
+
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **args, **immutable)
+            return self.gpt(**model_inputs, **immutable)
+
+        def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs):
+
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            logits = parallel_matmul(logits, get_attr(self.gpt.embeddings.word_embeddings, "weight"), False)
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            origin_probs = paddle.log(origin_probs)
+            if temperature is not None and temperature != 1.0:
+                logits = logits / temperature
+            probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+            next_tokens = paddle.multinomial(probs)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            input_ids = paddle.concat([input_ids, next_tokens], axis=1)
+
+            if eos_token_id is not None:
+                unfinished_flag = paddle.logical_and(unfinished_flag, next_tokens != eos_token_id)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            return input_ids, scores, unfinished_flag, model_kwargs
+
+        # Note(GuoxiaWang):Pre-while call for inference, simulate a do while loop statement
+        # the value in model_kwargs should be tensor before while loop
+        outputs = _forward_(**model_kwargs)
+
+        input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+            outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs
+        )
+        cur_len += 1
+
+        attn_mask = model_kwargs["attention_mask"]
+        # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
+        model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
+        while cur_len < max_length:
+            # Note(GuoxiaWang): Remove outputs = _forward_(**model_kwargs)
+            # and change it to pass directly to _post_process_ to avoid
+            # closed-loop problem of dynamic-to-static model
+            input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                _forward_(**model_kwargs), input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs
+            )
+            cur_len += 1
+
+            if not paddle.any(unfinished_flag):
+                break
+
+        return input_ids[:, origin_len:], scores
+
+    def forward(self, input_ids=None, **model_kwargs):
+
+        max_length = self.max_length
+        min_length = self.min_length
+        decode_strategy = self.decode_strategy
+        temperature = self.temperature
+        top_k = self.top_k
+        top_p = self.top_p
+        repetition_penalty = self.repetition_penalty
+        num_beams = self.num_beams
+        num_beam_groups = self.num_beam_groups
+        bos_token_id = self.bos_token_id
+        eos_token_id = self.eos_token_id
+        pad_token_id = self.pad_token_id
+        decoder_start_token_id = self.decoder_start_token_id
+        forced_bos_token_id = self.forced_bos_token_id
+        forced_eos_token_id = self.forced_eos_token_id
+        num_return_sequences = self.num_return_sequences
+        diversity_rate = self.diversity_rate
+        use_cache = self.use_cache
+
+        assert decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            decode_strategy
+        )
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self.gpt, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self.gpt, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self.gpt, "pad_token_id", None)
+        forced_bos_token_id = (
+            forced_bos_token_id if forced_bos_token_id is not None else getattr(self.gpt, "forced_bos_token_id", None)
+        )
+        forced_eos_token_id = (
+            forced_eos_token_id if forced_eos_token_id is not None else getattr(self.gpt, "forced_eos_token_id", None)
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self.gpt, "decoder_start_token_id", None)
+        )
+
+        # params check
+        if input_ids is None:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # TODO
+            # Init `attention_mask` depending on `pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, pad_token_id, eos_token_id
+            )
+        self.is_encoder_decoder = False
+
+        model_kwargs["use_cache"] = use_cache
+
+        max_length += input_ids.shape[-1]
+        min_length += input_ids.shape[-1]
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_length,
+            max_length=max_length,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=num_beams,
+            num_beam_groups=num_beam_groups,
+            diversity_rate=diversity_rate,
+            repetition_penalty=repetition_penalty,
+        )
+
+        if decode_strategy == "sampling":
+            if num_return_sequences > 1:
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_return_sequences, **model_kwargs
+                )
+
+            ret = self.sample(
+                input_ids,
+                logits_processors,
+                max_length,
+                pad_token_id,
+                eos_token_id,
+                top_k,
+                top_p,
+                temperature,
+                **model_kwargs,
+            )
+        else:
+            raise ValueError(f"Not support {decode_strategy} strategy yet!")
+        return ret
+
+
+def get_triangle_upper_mask(x, mask):
+    if mask is not None:
+        return mask
+    if paddle.is_compiled_with_xpu():
+        # xpu does not support set constant to -np.inf
+        mask = paddle.full_like(x, -1e4)
+    else:
+        mask = paddle.full_like(x, -np.inf)
+    mask.stop_gradient = True
+    mask = paddle.triu(mask, diagonal=1)
+    mask.stop_gradient = True
+    return mask
+
+
+class ConcatSoftmaxInput(PyLayer):
+    @staticmethod
+    def forward(ctx, inp, group=None):
+        inputs = []
+        paddle.distributed.all_gather(inputs, inp, group=group)
+        with paddle.no_grad():
+            cat = paddle.concat(inputs, axis=-1)
+        ctx.cat_args = group
+        return cat
+
+    @staticmethod
+    def backward(ctx, grad):
+        group = ctx.cat_args
+        with paddle.no_grad():
+            grads = paddle.split(grad, paddle.distributed.get_world_size(group), axis=-1)
+        grad = grads[paddle.distributed.get_rank(group)]
+        return grad
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4caa37169767d0354498f6d57f5ca6185b34f76
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/processor.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from abc import ABC
+from typing import List
+
+import paddle
+
+
+class LogitsProcessorList(List):
+    def __call__(self, input_ids, logits, **kwargs):
+        for processor in self:
+            processor_args = inspect.signature(processor.__call__).parameters
+            if len(processor_args) > 2:
+                assert all(
+                    arg in kwargs for arg in list(processor_args.keys())[2:]
+                ), f"The parameters don't match for {processor.__class__}"
+                logits = processor(input_ids, logits, **kwargs)
+            else:
+                logits = processor(input_ids, logits)
+        return logits
+
+
+class LogitsProcessor(ABC):
+    """
+    Abstract base class for all logit processors that can be applied during
+    generation.
+    """
+
+    def __call__(self, input_ids, logits):
+        raise NotImplementedError(
+            f"{self.__class__} is an abstract class. " "Only classes inheriting this class can be called."
+        )
+
+
+class MinLengthLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing a min-length by setting EOS probability to 0.
+    Args:
+        min_length (int): The minimum length of generation sequence.
+        eos_token_id (int): The id of the `end-of-sequence` token.
+    """
+
+    def __init__(self, min_length, eos_token_id):
+        if not isinstance(min_length, int) or min_length < 0:
+            raise ValueError("`min_length` should be a positive integer, but get {}".format(min_length))
+
+        if not isinstance(eos_token_id, int) or eos_token_id < 0:
+            raise ValueError("`eos_token_id` should be a positive integer, but get {}".format(eos_token_id))
+
+        self.min_length = min_length
+        self.eos_token_id = eos_token_id
+
+    def __call__(self, input_ids, logits):
+        cur_len = input_ids.shape[-1]
+        if cur_len < self.min_length:
+            logits[:, self.eos_token_id] = -float("inf")
+        return logits
+
+
+class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing an exponential penalty on repeated sequences.
+    Args:
+        repetition_penalty (float):
+            The parameter for repetition penalty. 1.0 means no penalty. See `this paper
+            <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details.
+    """
+
+    def __init__(self, penalty: float):
+        if not isinstance(penalty, float) or not (penalty > 0):
+            raise ValueError(f"`penalty` has to be a strictly positive float, but is {penalty}")
+
+        self.penalty = penalty
+
+    def __call__(self, input_ids, logits):
+        score = paddle.index_sample(logits, input_ids)
+        score = paddle.where(score < 0, score * self.penalty, score / self.penalty)
+        input_ids = input_ids + paddle.arange(logits.shape[0]).unsqueeze(-1) * logits.shape[-1]
+        outputs = paddle.scatter(logits.flatten(), input_ids.flatten(), score.flatten()).reshape(logits.shape)
+        return outputs
+
+
+class HammingDiversityLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces diverse beam search. Note that this logits
+    processor is only effective for `group_beam_search`. See
+    `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__ for more details.
+    Args:
+        diversity_rate (float): This value is subtracted from a beam's score if
+            it generates a token same as any beam from other group at a particular
+            time.
+        num_beams (int): Number of beams used for group beam search.
+        num_beam_groups (int): Number of groups to divide `num_beams` into in order
+            to ensure diversity among different groups of beams.
+    """
+
+    def __init__(self, diversity_rate, num_beams, num_beam_groups):
+        if not isinstance(diversity_rate, float) or (not diversity_rate > 0.0):
+            raise ValueError("`diversity_rate` should be a float strictly larger than 0.")
+        self._diversity_rate = diversity_rate
+        if not isinstance(num_beams, int) or num_beams < 2:
+            raise ValueError("`num_beams` should be an integer strictly larger than 1.")
+        self._num_beams = num_beams
+        if not isinstance(num_beam_groups, int) or num_beam_groups < 2:
+            raise ValueError("`num_beam_groups` should be an integer strictly larger than 1.")
+        self._num_sub_beams = num_beams // num_beam_groups
+
+    def __call__(self, input_ids, scores, current_tokens, beam_group_idx):
+        batch_size = current_tokens.shape[0] // self._num_beams
+        group_start_idx = beam_group_idx * self._num_sub_beams
+        group_end_idx = min(group_start_idx + self._num_sub_beams, self._num_beams)
+        group_size = group_end_idx - group_start_idx
+        vocab_size = scores.shape[-1]
+
+        if group_start_idx == 0:
+            return scores
+
+        for batch_idx in range(batch_size):
+            previous_group_tokens = current_tokens[
+                batch_idx * self._num_beams : batch_idx * self._num_beams + group_start_idx
+            ]
+            token_frequency = paddle.bincount(previous_group_tokens, minlength=vocab_size)
+            scores[batch_idx * group_size : (batch_idx + 1) * group_size] -= self._diversity_rate * token_frequency
+
+        return scores
+
+
+class ForcedBOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the first generated token to be the selected `forced_bos_token`.
+    Args:
+        forced_bos_token_id (:obj:`int`):
+            The id of the token to be generated as the first token.
+    """
+
+    def __init__(self, forced_bos_token_id):
+        self.forced_bos_token_id = forced_bos_token_id
+
+    def __call__(self, input_ids, scores):
+        cur_len = input_ids.shape[-1]
+        if cur_len == 1:
+            num_tokens = scores.shape[1]
+            scores[:, [i for i in range(num_tokens) if i != self.forced_bos_token_id]] = -float("inf")
+            scores[:, self.forced_bos_token_id] = 0
+        return scores
+
+
+class ForcedEOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the last generated token to be the selected `forced_eos_token`.
+    Args:
+        max_length (int): The maximum length of the sequence to be generated.
+        forced_eos_token_id (int): The id of the token to be generated as the last token.
+    """
+
+    def __init__(self, max_length, forced_eos_token_id):
+        self.max_length = max_length
+        self.forced_eos_token_id = forced_eos_token_id
+
+    def __call__(self, input_ids, scores):
+        cur_len = input_ids.shape[-1]
+        if cur_len == self.max_length - 1:
+            num_tokens = scores.shape[1]
+            scores[
+                :, [i for i in range(num_tokens) if i != self.forced_eos_token_id]
+            ] = -1e9  # TODO change back to -inf after paddle.topk is fixed
+            scores[:, self.forced_eos_token_id] = 0
+        return scores
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/sequence_parallel_utils.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/sequence_parallel_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f78c0e4c6279153d4b3ed59c7df4ebb1438f6dc7
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/sequence_parallel_utils.py
@@ -0,0 +1,421 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import distributed as dist
+from paddle.autograd import PyLayer
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients_with_group,
+)
+from paddle.base import core
+from paddle.nn import Layer
+from paddle.nn import functional as F
+from ppfleetx.distributed.apis import env
+
+####################################################
+#                                                  #
+#        Distributed Communication Operator        #
+#                                                  #
+####################################################
+
+
+def scatter(input):
+    hcg = env.get_hcg()
+    group = hcg.get_model_parallel_group()
+    parallelism = group.nranks
+    rank = group.rank
+    seq_len = input.shape[0]
+    assert (
+        seq_len % parallelism == 0
+    ), "Input sequence length {} can't be divided exactly by sequence parallelism {}".format(seq_len, parallelism)
+    interval = seq_len // parallelism
+    input = paddle.slice(input, axes=[0], starts=[interval * rank], ends=[interval * (rank + 1)])
+    return input
+
+
+def all_gather(input):
+    hcg = env.get_hcg()
+    group = hcg.get_model_parallel_group()
+    parallelism = group.nranks
+    output_shape = input.shape
+    output_shape[0] = output_shape[0] * parallelism
+    output = paddle.empty(shape=output_shape, dtype=input.dtype)
+    group.process_group.all_gather(input, output).wait()
+    return output
+
+
+def reduce_scatter(input):
+    hcg = env.get_hcg()
+    group = hcg.get_model_parallel_group()
+    parallelism = group.nranks
+    output_shape = input.shape
+    assert (
+        input.shape[0] % parallelism == 0
+    ), "Input sequence length {0} can't be divided exactly by sequence parallelism {1}".format(
+        input.shape[0], parallelism
+    )
+    output_shape[0] = output_shape[0] // parallelism
+    output = paddle.empty(shape=output_shape, dtype=input.dtype)
+    dist.stream.reduce_scatter(output, input, op=dist.ReduceOp.SUM, group=group, sync_op=True)
+    return output
+
+
+class ScatterOp(PyLayer):
+    # input shape: [s, b, h], n is mp parallelism
+    # after forward shape: [s/n, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return scatter(input)
+
+    @staticmethod
+    def backward(ctx, grad):
+        return all_gather(grad)
+
+
+class GatherOp(PyLayer):
+    # input shape: [s/n, b, h], n is mp parallelism
+    # after forward shape: [s, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return all_gather(input)
+
+    @staticmethod
+    def backward(ctx, grad):
+        return scatter(grad)
+
+
+# All gather along the first dim during forward pass
+# All reduce and scatter along the first dim during backward pass
+class AllGatherOp(PyLayer):
+    # input shape: [s/n, b, h], n is mp parallelism
+    # after forward shape: [s, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return all_gather(input)
+
+    # grad shape: [s, b, h], n is mp parallelism
+    # after forward shape: [s/n, b, h]
+    @staticmethod
+    def backward(ctx, grad):
+        return reduce_scatter(grad)
+
+
+# All reduce and scatter along the first dim during forward pass
+# All gather along the first dim during backward pass
+class ReduceScatterOp(PyLayer):
+    # input shape: [s, b, h], n is mp parallelism
+    # after forward shape: [s/n, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return reduce_scatter(input)
+
+    # grad shape: [s/n, b, h], n is mp parallelism
+    # after forward shape: [s, b, h]
+    @staticmethod
+    def backward(ctx, grad):
+        return all_gather(grad)
+
+
+###################################################
+#                                                 #
+#        Modified Parallel Linear Operator        #
+#                                                 #
+###################################################
+
+
+def mark_as_sequence_parallel_parameter(parameter):
+    setattr(parameter, "sequence_parallel", True)
+
+
+def is_sequence_parallel_parameter(parameter):
+    return getattr(parameter, "sequence_parallel", False)
+
+
+def create_fused_allreduce_gradient_hook(parameter_list, accumulation_steps):
+    hcg = env.get_hcg()
+    group = hcg.get_model_parallel_group()
+
+    step = [0]
+    accumulation_steps *= len(parameter_list)
+
+    def __impl__(grad):
+        step[0] += 1
+        if step[0] == accumulation_steps:
+            step[0] = 0
+            fused_allreduce_gradients_with_group(parameter_list, group=group, scale=1.0)
+        return grad
+
+    return __impl__
+
+
+def create_non_fused_allreduce_gradient_hook(param, accumulation_steps):
+    hcg = env.get_hcg()
+    pg = hcg.get_model_parallel_group().process_group
+    step = [0]
+
+    @paddle.autograd.no_grad()
+    def __impl__():
+        step[0] += 1
+        if (step[0] % accumulation_steps) == 0:
+            if hasattr(param, "main_grad"):
+                pg.allreduce(param.main_grad).wait()
+            else:
+                pg.allreduce(param.grad).wait()
+
+    return __impl__
+
+
+def register_sequence_parallel_allreduce_hooks(model, accumulation_steps, fuse_sequence_parallel_allreduce):
+    if accumulation_steps <= 0 or not paddle.distributed.is_initialized():
+        return
+
+    mp_group = env.get_hcg().get_model_parallel_group()
+    if mp_group.nranks <= 1:
+        return
+
+    params = []
+    for p in model.parameters():
+        if is_sequence_parallel_parameter(p):
+            params.append(p)
+
+    if fuse_sequence_parallel_allreduce:
+        hook = create_fused_allreduce_gradient_hook(params, accumulation_steps)
+        for p in params:
+            p._register_backward_hook(hook)
+    else:
+        for p in params:
+            hook = create_non_fused_allreduce_gradient_hook(p, accumulation_steps)
+            p._register_backward_hook(hook)
+
+
+def is_fused_matmul_bias_supported():
+    if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm() or paddle.is_compiled_with_xpu():
+        return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue")
+    else:
+        return False
+
+
+class ColumnSequenceParallelLinear(Layer):
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        weight_attr=None,
+        has_bias=None,
+        gather_output=True,
+        fuse_matmul_bias=False,
+        mp_group=None,
+        name=None,
+    ):
+        super(ColumnSequenceParallelLinear, self).__init__()
+
+        hcg = env.get_hcg()
+        self.model_parallel_group = hcg.get_model_parallel_group() if mp_group is None else mp_group
+        self.world_size = hcg.get_model_parallel_group().nranks if mp_group is None else mp_group.nranks
+        self._name = name
+        self.is_mp = self.world_size > 1
+
+        assert (
+            gather_output is False
+        ), "If sequence_parallel is True, \
+                                        gather_output is False"
+
+        self.gather_output = gather_output
+        assert (
+            out_features % self.world_size == 0
+        ), "Number of column of the weight for linear ({}) must be" " divisible by model parallel size ({})".format(
+            out_features, self.world_size
+        )
+        self.output_size_per_partition = out_features // self.world_size
+
+        self._weight_attr = weight_attr
+        self._dtype = self._helper.get_default_dtype()
+
+        if self.is_mp and paddle.in_dynamic_mode():
+            with get_rng_state_tracker().rng_state():
+                self.weight = self.create_parameter(
+                    shape=[in_features, self.output_size_per_partition],
+                    attr=self._weight_attr,
+                    dtype=self._dtype,
+                    is_bias=False,
+                )
+        else:
+            self.weight = self.create_parameter(
+                shape=[in_features, self.output_size_per_partition],
+                attr=self._weight_attr,
+                dtype=self._dtype,
+                is_bias=False,
+            )
+
+        self.weight.is_distributed = True if self.is_mp else False
+
+        if has_bias:
+            # initialize bias to zero like Megatron
+            self.bias = self.create_parameter(
+                shape=[self.output_size_per_partition],
+                attr=paddle.nn.initializer.Constant(value=0.0),
+                dtype=self._dtype,
+                is_bias=True,
+            )
+            self.bias.is_distributed = True if self.is_mp else False
+        else:
+            self.bias = None
+
+        self.linear = F.linear
+
+        if fuse_matmul_bias:
+            if not is_fused_matmul_bias_supported():
+                raise NotImplementedError(
+                    "You set fuse_matmul_bias=True in ColumnSequenceParallelLinear, "
+                    "however, the paddle you are using not support this operation. "
+                    "Please set fuse_matmul_bias=False or use paddle compiled "
+                    "with cuda 11.6 or higher, or use xpu version."
+                )
+            from paddle.incubate.nn.functional import fused_linear
+
+            self.linear = fused_linear
+
+    def forward(self, x):
+        # sequence parallelism is same as model parallelism
+        # if sequence parallel is true, input shape is [s, b, h]
+        # else input shape is [b, s, h]
+        if self.is_mp:
+            input_parallel = AllGatherOp.apply(x)
+        else:
+            input_parallel = x
+        output = self.linear(input_parallel, self.weight, self.bias, name=self._name)
+        return output
+
+
+class MPScale(PyLayer):
+    @staticmethod
+    def forward(ctx, x, mp_degree):
+        out = paddle.scale(x, 1.0 / mp_degree)
+        return out
+
+    @staticmethod
+    def backward(ctx, dout):
+        return dout
+
+
+class RowSequenceParallelLinear(Layer):
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        weight_attr=None,
+        has_bias=True,
+        input_is_parallel=False,
+        fuse_matmul_bias=False,
+        mp_group=None,
+        name=None,
+    ):
+        super(RowSequenceParallelLinear, self).__init__()
+
+        self.in_features = in_features
+        self.out_features = out_features
+        assert (
+            input_is_parallel is True
+        ), "If sequence_parallel is True, \
+                                           input_is_parallel should be true."
+
+        self.input_is_parallel = input_is_parallel
+        self._weight_attr = weight_attr
+        self._dtype = self._helper.get_default_dtype()
+        self._name = name
+
+        hcg = env.get_hcg()
+        self.model_parallel_group = hcg.get_model_parallel_group() if mp_group is None else mp_group
+        self.world_size = hcg.get_model_parallel_group().nranks if mp_group is None else mp_group.nranks
+        self.rank = hcg.get_model_parallel_group().rank if mp_group is None else mp_group.rank
+
+        self.is_mp = self.world_size > 1
+        assert (
+            in_features % self.world_size == 0
+        ), "Number of row of the weight for linear ({}) must be" " divisible by model parallel size ({})".format(
+            in_features, self.world_size
+        )
+
+        self.input_size_per_partition = in_features // self.world_size
+
+        if self.is_mp and paddle.in_dynamic_mode():
+            with get_rng_state_tracker().rng_state():
+                self.weight = self.create_parameter(
+                    shape=[self.input_size_per_partition, self.out_features],
+                    attr=self._weight_attr,
+                    dtype=self._dtype,
+                    is_bias=False,
+                )
+        else:
+            self.weight = self.create_parameter(
+                shape=[self.input_size_per_partition, self.out_features],
+                attr=self._weight_attr,
+                dtype=self._dtype,
+                is_bias=False,
+            )
+
+        self.weight.is_distributed = True if self.is_mp else False
+
+        # if sequence parallel is true,
+        # register hook to all_reduce gradient of weight and bias
+        if has_bias:
+            self.bias = self.create_parameter(
+                shape=[self.out_features],
+                attr=paddle.nn.initializer.Constant(value=0.0),
+                dtype=self._dtype,
+                is_bias=True,
+            )
+            if self.is_mp:
+                mark_as_sequence_parallel_parameter(self.bias)
+        else:
+            self.bias = None
+
+        self.linear = F.linear
+
+        self.mp_scale = None
+        if fuse_matmul_bias:
+            if not is_fused_matmul_bias_supported():
+                raise NotImplementedError(
+                    "You set fuse_matmul_bias=True in RowParallelLinear, "
+                    "however, the paddle you are using not support this operation. "
+                    "Please set fuse_matmul_bias=False or use paddle compiled "
+                    "with cuda 11.6 or higher."
+                )
+            from paddle.incubate.nn.functional import fused_linear
+
+            self.linear = fused_linear
+            if self.is_mp and has_bias:
+                self.mp_scale = MPScale.apply
+
+    def forward(self, x):
+        input_parallel = x
+        if self.is_mp:
+            if self.mp_scale is not None:
+                bias = self.mp_scale(self.bias, self.world_size)
+            else:
+                bias = None
+            output_parallel = self.linear(input_parallel, self.weight, bias, name=self._name)
+            output_ = ReduceScatterOp.apply(output_parallel)
+            # if self.bias is not none, sequence parallel will use
+            # register_hook to all_reduce self.bias
+            if bias is None and self.bias is not None:
+                output = output_ + self.bias
+            else:
+                output = output_
+        else:
+            output = self.linear(input_parallel, self.weight, self.bias, name=self._name)
+        return output
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..601b0d9ae4c441912f74d384df59ecac1af31c7a
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/gpt/dygraph/single_model.py
@@ -0,0 +1,1242 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import math
+
+import paddle
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.common_ops_import import convert_dtype
+from paddle.distributed.fleet.utils import recompute
+from paddle.base import layers
+from paddle.incubate.nn import FusedLinear
+from paddle.nn.layer.transformer import _convert_param_attr_to_list
+from ppfleetx.utils.log import logger
+
+from .processor import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+)
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+
+def get_attr(layer, name):
+    if getattr(layer, name, None) is not None:
+        return getattr(layer, name, None)
+    else:
+        return get_attr(layer._layer, name)
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        weight_attr=None,
+        bias_attr=None,
+        output_layer_weight_attr=None,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        fused_linear=False,
+        use_recompute=False,
+        recompute_granularity="full",
+        do_recompute=True,
+        use_flash_attn=False,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.need_weights = need_weights
+        self.fuse_attn_qkv = fuse_attn_qkv
+        self.scale_qk_coeff = scale_qk_coeff
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.do_recompute = do_recompute
+        self.use_flash_attn = use_flash_attn if flash_attention else None
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        Linear = FusedLinear if fused_linear else nn.Linear
+
+        if self.fuse_attn_qkv:
+            assert self.kdim == embed_dim
+            assert self.vdim == embed_dim
+            self.qkv_proj = Linear(embed_dim, 3 * embed_dim, weight_attr, bias_attr=bias_attr)
+        else:
+            self.q_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
+            self.k_proj = Linear(self.kdim, embed_dim, weight_attr, bias_attr=bias_attr)
+            self.v_proj = Linear(self.vdim, embed_dim, weight_attr, bias_attr=bias_attr)
+
+        self.out_proj = Linear(embed_dim, embed_dim, output_layer_weight_attr, bias_attr=bias_attr)
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, -1, 3 * self.head_dim])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, -1, self.head_dim])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, -1, self.head_dim])
+        v = tensor.reshape(x=v, shape=[0, 0, -1, self.head_dim])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            v = layers.fill_constant_batch_size_like(
+                input=key, shape=[-1, self.num_heads, 0, self.head_dim], dtype=key.dtype, value=0
+            )
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def _flash_attention(self, q, k, v, attn_mask=None):
+        out, weights = flash_attention(q, k, v, self.dropout, causal=True, return_softmax=self.need_weights)
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+        return out, weights
+
+    def core_attn(self, q, k, v, attn_mask=None):
+        perm = [0, 2, 1, 3]
+        q = tensor.transpose(x=q, perm=perm)
+        k = tensor.transpose(x=k, perm=perm)
+        v = tensor.transpose(x=v, perm=perm)
+
+        # scale dot product attention
+        scale_qk_coeff = self.scale_qk_coeff * self.head_dim**0.5
+        product = paddle.matmul(x=q.scale(1.0 / scale_qk_coeff), y=k, transpose_y=True)
+
+        if self.scale_qk_coeff != 1.0:
+            product = product.scale(self.scale_qk_coeff)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+            weights = F.softmax(product)
+        else:
+            weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.dropout:
+            weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = paddle.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, -1])
+
+        return out, weights
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if self.fuse_attn_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        if self.use_recompute and self.recompute_granularity == "core_attn" and self.do_recompute:
+            out, weights = recompute(self.core_attn, q, k, v, attn_mask)
+        elif self.use_flash_attn and attn_mask is None:
+            out, weights = self._flash_attention(q, k, v)
+        else:
+            out, weights = self.core_attn(q, k, v, attn_mask=attn_mask)
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(
+        self,
+        decoder_layers,
+        num_layers,
+        norm=None,
+        hidden_size=None,
+        use_recompute=False,
+        recompute_granularity="full",
+        no_recompute_layers=None,
+    ):
+        super(TransformerDecoder, self).__init__()
+
+        if no_recompute_layers is None:
+            no_recompute_layers = []
+        self.no_recompute_layers = no_recompute_layers
+
+        self.num_layers = num_layers
+        self.layers = decoder_layers
+        self.norm = norm
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        if norm == "LayerNorm":
+            self.norm = nn.LayerNorm(hidden_size, epsilon=1e-5)
+        elif norm is not None:
+            raise ValueError("Only support LayerNorm")
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = []
+
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                if use_cache:
+                    output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache)
+                    new_caches.append(new_cache)
+                else:
+                    if (
+                        self.use_recompute
+                        and self.recompute_granularity == "full"
+                        and i not in self.no_recompute_layers
+                    ):
+                        output = recompute(mod, output, memory, tgt_mask, use_cache, cache)
+                    else:
+                        output = mod(output, memory, tgt_mask, use_cache, cache)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, use_cache=use_cache, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+        return output if use_cache is False else (output, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        topk=1,
+        weight_attr=None,
+        bias_attr=None,
+        output_layer_weight_attr=None,
+        fused_linear=False,
+        fuse_attn_qkv=False,
+        scale_qk_coeff=1.0,
+        use_recompute=False,
+        recompute_granularity="full",
+        do_recompute=True,
+        skip_quant_tensors=[],
+        use_flash_attn=False,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.do_recompute = do_recompute
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+        output_layer_weight_attrs = _convert_param_attr_to_list(output_layer_weight_attr, 3)
+
+        Linear = FusedLinear if fused_linear else nn.Linear
+
+        self.self_attn = MultiHeadAttention(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+            output_layer_weight_attr=output_layer_weight_attrs[0],
+            fused_linear=fused_linear,
+            fuse_attn_qkv=fuse_attn_qkv,
+            scale_qk_coeff=scale_qk_coeff,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+            do_recompute=do_recompute,
+            use_flash_attn=use_flash_attn,
+        )
+
+        self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2])
+        self.linear2 = Linear(dim_feedforward, d_model, output_layer_weight_attrs[2], bias_attr=bias_attrs[2])
+
+        if "linear1" in skip_quant_tensors:
+            self.linear1.skip_quant = True
+
+        if "linear2" in skip_quant_tensors:
+            self.linear2.skip_quant = True
+
+        self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
+        self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
+        if activation == "gelu":
+            self.activation = nn.GELU(approximate=True)
+        else:
+            self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None):
+        residual = tgt
+
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if use_cache is False:
+            if self.use_recompute and self.recompute_granularity == "full_attn" and self.do_recompute:
+                tgt = recompute(self.self_attn, tgt, None, None, tgt_mask, use_cache, cache)
+            else:
+                tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        tgt = residual + self.dropout1(tgt)
+
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        tgt = self.dropout2(self.linear2(self.activation(self.linear1(tgt))))
+
+        tgt = residual + tgt
+
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        return tgt if use_cache is False else (tgt, incremental_cache)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        freeze_embedding=False,
+    ):
+        super(GPTEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(
+            vocab_size,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        self.position_embeddings = nn.Embedding(
+            max_position_embeddings,
+            hidden_size,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=initializer_range)),
+        )
+
+        if freeze_embedding:
+            self.word_embeddings.weight.learning_rate = 0.0
+            self.position_embeddings.weight.learning_rate = 0.0
+
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embedings + position_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class GPTModel(nn.Layer):
+    def __init__(
+        self,
+        vocab_size=51200,
+        hidden_size=768,
+        num_layers=12,
+        num_attention_heads=12,
+        ffn_hidden_size=3072,
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        use_recompute=False,
+        initializer_range=0.02,
+        topk=1,
+        fused_linear=False,
+        fuse_attn_qkv=False,
+        scale_qk_by_layer_num=True,
+        recompute_granularity="full",
+        sequence_parallel=False,
+        no_recompute_layers=None,
+        skip_tensor_map={},
+        freeze_embedding=False,
+        use_flash_attn=False,
+        fused_softmax_with_triangular=False,
+    ):
+
+        super(GPTModel, self).__init__()
+
+        if no_recompute_layers is None:
+            no_recompute_layers = []
+        self.initializer_range = initializer_range
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.fused_softmax_with_triangular = fused_softmax_with_triangular
+
+        if use_flash_attn:
+            if flash_attention:
+                logger.info("Flash-attention enabled.")
+            else:
+                use_flash_attn = False
+                logger.warning("Flash-attention is not support in this Paddle version.")
+
+        self.embeddings = GPTEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            self.initializer_range,
+            freeze_embedding,
+        )
+
+        decoder_layers = nn.LayerList()
+        for i in range(num_layers):
+            # TODO: original layer_num = i + 1 + offset here
+            decoder_layers.append(
+                TransformerDecoderLayer(
+                    d_model=hidden_size,
+                    nhead=num_attention_heads,
+                    dim_feedforward=ffn_hidden_size,
+                    dropout=hidden_dropout_prob,
+                    activation="gelu",
+                    attn_dropout=attention_probs_dropout_prob,
+                    act_dropout=hidden_dropout_prob,
+                    weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range)
+                    ),
+                    output_layer_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.Normal(
+                            mean=0.0, std=self.initializer_range / math.sqrt(2.0 * num_layers)
+                        )
+                    ),
+                    bias_attr=None,
+                    fused_linear=fused_linear,
+                    fuse_attn_qkv=fuse_attn_qkv,
+                    scale_qk_coeff=num_layers if scale_qk_by_layer_num else 1.0,
+                    use_recompute=use_recompute,
+                    recompute_granularity=recompute_granularity,
+                    do_recompute=i not in no_recompute_layers,
+                    skip_quant_tensors=skip_tensor_map.get("block_{}".format(i), []),
+                    use_flash_attn=use_flash_attn,
+                )
+            )
+
+        self.decoder = TransformerDecoder(
+            decoder_layers,
+            num_layers,
+            norm="LayerNorm",
+            hidden_size=hidden_size,
+            use_recompute=use_recompute,
+            recompute_granularity=recompute_granularity,
+            no_recompute_layers=no_recompute_layers,
+        )
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None):
+
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(attention_mask)[-1] - 1
+            position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype=input_ids.dtype)
+            position_ids = position_ids.unsqueeze(0)
+            # .expand_as(input_ids)
+            position_ids = paddle.expand_as(position_ids, input_ids)
+
+        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        # fused_softmax_with_triangular is only suppported on GPU/DCU.
+        # If on non-GPU devices, we use user defined mask and non-fused softmax.
+        if not self.fused_softmax_with_triangular or not paddle.is_compiled_with_cuda():
+            # TODO, use registered buffer
+            causal_mask = paddle.tensor.triu(
+                paddle.ones((paddle.shape(input_ids)[-1], paddle.shape(input_ids)[-1])) * -1e4, diagonal=1
+            )
+            if attention_mask is not None:
+                if len(attention_mask.shape) == 2:
+                    attention_mask = attention_mask[:, None, None, :]
+                attention_mask = attention_mask + causal_mask
+            else:
+                attention_mask = causal_mask
+            # The tensor returned by triu not in static graph.
+            attention_mask.stop_gradient = True
+
+        encoder_outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            tgt_mask=None
+            if (self.fused_softmax_with_triangular and self.training and paddle.is_compiled_with_cuda())
+            else attention_mask,  # use softmax_mask_fuse_upper_triangle
+            use_cache=use_cache,
+            cache=cache,
+        )
+
+        return encoder_outputs
+
+
+class GPTForPretraining(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt):
+        super(GPTForPretraining, self).__init__()
+        self.gpt = gpt
+
+    def forward(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+        logits = paddle.matmul(
+            encoder_outputs, get_attr(self.gpt.embeddings.word_embeddings, "weight"), transpose_y=True
+        )
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+
+class GPTPretrainingCriterion(nn.Layer):
+    """
+    Criterion for GPT. It calculates the final loss.
+    """
+
+    def __init__(self, topo=None):
+        super(GPTPretrainingCriterion, self).__init__()
+        self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The logits of masked token prediction. Its data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, the dimensionality of `masked_lm_labels`
+                is equal to `prediction_scores`. Its data type should be int64 and
+                its shape is [batch_size, sequence_length, 1].
+            loss_mask(Tensor):
+                Mask used for calculating the loss of the masked language modeling to avoid
+                calculating some unwanted tokens.
+                Its data type should be float32 and its shape is [batch_size, sequence_length, 1].
+
+        Returns:
+            Tensor: The pretraining loss. Its data type should be float32 and its shape is [1].
+
+        """
+        masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+
+        loss_mask = loss_mask.reshape([-1])
+        masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+        loss = masked_lm_loss / loss_mask.sum()
+        return loss
+
+
+class GPTForSequenceClassification(nn.Layer):
+    """
+    GPT Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.
+    for GLUE tasks.
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of GPTModel.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+
+    """
+
+    def __init__(self, gpt, num_classes=2):
+        super(GPTForSequenceClassification, self).__init__()
+        self.gpt = gpt
+        self.score = nn.Linear(self.gpt.hidden_size, num_classes, bias_attr=False)
+
+        from paddle.nn.initializer import Normal
+
+        normal_ = Normal(std=self.gpt.initializer_range)
+        normal_(self.score.weight)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+
+        output = self.gpt(input_ids, position_ids=position_ids, attention_mask=attention_mask)
+
+        logits = self.score(output)
+        # padding index maybe 0
+        eos_token_id = 0
+        # sequence_lengths shape [bs,]
+        sequence_lengths = (input_ids != eos_token_id).astype("int64").sum(axis=-1) - 1
+
+        pooled_logits = logits.gather_nd(paddle.stack([paddle.arange(output.shape[0]), sequence_lengths], axis=-1))
+
+        return pooled_logits
+
+
+class GPTForGeneration(nn.Layer):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, gpt, configs):
+        super(GPTForGeneration, self).__init__()
+        self.gpt = gpt
+        self.configs = configs
+
+        self.max_length = self.configs.get("max_dec_len", 20)
+        self.min_length = self.configs.get("min_dec_len", 0)
+        self.decode_strategy = self.configs.get("decode_strategy", "sampling")
+        self.temperature = self.configs.get("temperature", 1.0)
+        self.top_k = self.configs.get("top_k", 0)
+        self.top_p = self.configs.get("top_p", 1.0)
+        self.use_topp_sampling = self.configs.get("use_topp_sampling", False)
+        self.inference = self.configs.get("inference", False)
+        self.repetition_penalty = self.configs.get("repetition_penalty", 1.0)
+        self.num_beams = self.configs.get("num_beams", 1)
+        self.num_beam_groups = self.configs.get("num_beam_groups", 1)
+        self.length_penalty = self.configs.get("length_penalty", 0.0)
+        self.early_stopping = self.configs.get("early_stopping", False)
+        self.bos_token_id = self.configs.get("bos_token_id", None)
+        self.eos_token_id = self.configs.get("eos_token_id", None)
+        self.pad_token_id = self.configs.get("pad_token_id", None)
+        self.decoder_start_token_id = self.configs.get("decoder_start_token_id", None)
+        self.forced_bos_token_id = self.configs.get("forced_bos_token_id", None)
+        self.forced_eos_token_id = self.configs.get("forced_eos_token_id", None)
+        self.num_return_sequences = self.configs.get("num_return_sequences", 1)
+        self.diversity_rate = self.configs.get("diversity_rate", 0.0)
+        self.use_cache = self.configs.get("use_cache", True)
+
+    def prepare_input_ids_for_generation(self, bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_attention_mask_for_generation(self, input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(
+            input_ids == pad_token_id
+        ).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * -1e9
+        else:
+            attention_mask = paddle.zeros_like(input_ids, dtype=paddle.get_default_dtype())
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
+
+    def update_scores_for_generation(self, scores, next_scores, length, unfinished_flag):
+        # update scores
+
+        unfinished_scores = (scores * length + next_scores) / (length + 1)
+        scores = paddle.where(unfinished_flag, unfinished_scores, scores)
+        return scores
+
+    def get_logits_processor(
+        self,
+        min_length=None,
+        max_length=None,
+        eos_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_beams=1,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        repetition_penalty=None,
+    ):
+        processors = LogitsProcessorList()
+
+        if min_length is not None and eos_token_id is not None and min_length > -1:
+            processors.append(MinLengthLogitsProcessor(min_length, eos_token_id))
+        if num_beam_groups > 1 and diversity_rate > 0.0:
+            processors.append(
+                HammingDiversityLogitsProcessor(
+                    diversity_rate=diversity_rate, num_beams=num_beams, num_beam_groups=num_beam_groups
+                )
+            )
+        if repetition_penalty is not None and repetition_penalty != 1.0:
+            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
+        if forced_bos_token_id is not None:
+            processors.append(ForcedBOSTokenLogitsProcessor(forced_bos_token_id))
+        if forced_eos_token_id is not None:
+            processors.append(ForcedEOSTokenLogitsProcessor(max_length, forced_eos_token_id))
+        # TODO
+        # Add more pre_processing for distribution
+
+        return processors
+
+    def expand_inputs_for_generation(self, input_ids, expand_size, attention_mask=None, **model_kwargs):
+
+        index = paddle.tile(paddle.arange(paddle.shape(input_ids)[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.gather(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.gather(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.gather(token_type_ids, index)
+
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.gather(position_ids, index)
+
+        if "seq_len" in model_kwargs and model_kwargs["seq_len"] is not None:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.gather(seq_len, index)
+
+        if "encoder_output" in model_kwargs and model_kwargs["encoder_output"] is not None:
+            encoder_output = model_kwargs["encoder_output"]
+            model_kwargs["encoder_output"] = paddle.gather(encoder_output, index)
+
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.gather(role_ids, index)
+
+        return input_ids, model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask[:, -1, -1, :]
+            if "int" in paddle.common_ops_import.convert_dtype(attention_mask.dtype):
+                attention_mask = (1.0 - attention_mask) * -1e4
+        return {"input_ids": input_ids, "position_ids": position_ids, "attention_mask": attention_mask, "cache": cache}
+
+    def update_model_kwargs_for_generation(self, next_tokens, outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = position_ids[:, -1:] + 1
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            # nn.Pad2D don't support the data type `bool`
+            if convert_dtype(attention_mask.dtype) == "bool":
+                attention_mask = paddle.cast(attention_mask, "int64")
+            if len(attention_mask.shape) == 4:
+                attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(attention_mask)
+                attention_mask = nn.Pad2D([0, 1, 0, 0], value=-1e4)(attention_mask)
+                dtype = convert_dtype(attention_mask.dtype)
+                if "int" in dtype:
+                    attention_mask[:, :, -1, -1] = 1
+                elif "float" in dtype:
+                    attention_mask[:, :, -1, -1] = 0.0
+                else:
+                    raise ValueError("The data type of input `attention_mask` must " "be bool, int or float")
+            else:
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+            model_kwargs["attention_mask"] = attention_mask
+
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        model_kwargs["res"] = paddle.concat([model_kwargs["res"], next_tokens], axis=1)
+
+        return model_kwargs
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+        def TopKProcess(probs, top_k, min_tokens_to_keep):
+            top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+            # Remove all tokens with a probability less than the last token of the top-k
+            topk_probs, _ = paddle.topk(probs, k=top_k)
+            probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+            return probs
+
+        def TopPProcess(probs, top_p, min_tokens_to_keep):
+            sorted_probs = paddle.sort(probs, descending=True)
+            sorted_indices = paddle.argsort(probs, descending=True)
+            cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+            # Remove tokens with cumulative probs above the top_p, But keep at
+            # least min_tokens_to_keep tokens
+            sorted_indices_to_remove = cumulative_probs > top_p
+            if min_tokens_to_keep > 1:
+                # Set 'min_tokens_to_keep - 1' because the first token is kept
+                sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+            # Keep the first token
+            sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+            sorted_indices_to_remove[:, 0] = 0
+
+            # Scatter sorted tensors to original indexing
+            sorted_indices = sorted_indices + paddle.arange(probs.shape[0]).unsqueeze(-1) * probs.shape[-1]
+            condition = paddle.scatter(
+                sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+            )
+            condition = paddle.cast(condition, "bool").reshape(probs.shape)
+            probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+            return probs
+
+        batch_size, cur_len = input_ids.shape
+        # used for compute on gpu, avoid memcpy D2H
+        cur_len_gpu = paddle.full([1], cur_len, dtype="int64")
+
+        origin_len = input_ids.shape[1]
+        # used for compute on gpu, avoid memcpy D2H
+        origin_len_gpu = paddle.full([1], origin_len, dtype="int64")
+
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        res = paddle.assign(input_ids)
+        model_kwargs["res"] = res
+
+        # use_cache is immutable, we split it off other mutable kwargs.
+        assert "use_cache" in model_kwargs
+        immutable = {"use_cache": model_kwargs["use_cache"]}
+        del model_kwargs["use_cache"]
+
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **args, **immutable)
+            return self.gpt(**model_inputs, **immutable)
+
+        def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs):
+
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            logits = paddle.matmul(logits, self.gpt.embeddings.word_embeddings.weight, transpose_y=True)
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            if temperature is None or temperature == 1.0:
+                probs = paddle.assign(origin_probs)
+                origin_probs = paddle.log(origin_probs)
+            else:
+                origin_probs = paddle.log(origin_probs)
+                logits = logits / temperature
+                probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                if self.use_topp_sampling:
+                    try:
+                        from ppfleetx_ops import topp_sampling
+                    except ImportError:
+                        raise ImportError(
+                            "please install ppfleetx_ops by 'cd ppfleetx/ops && python setup_cuda.py install'!"
+                        )
+                    top_ps_tensor = paddle.full(shape=[paddle.shape(probs)[0]], fill_value=top_p, dtype=probs.dtype)
+                    _, next_tokens = topp_sampling(probs, top_ps_tensor, random_seed=100)
+                else:
+                    probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+
+            if not self.use_topp_sampling:
+                next_tokens = paddle.multinomial(probs)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            input_ids = next_tokens
+
+            if eos_token_id is not None:
+                unfinished_flag = paddle.logical_and(unfinished_flag, next_tokens != eos_token_id)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                next_tokens, outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            return input_ids, scores, unfinished_flag, model_kwargs
+
+        # Note(GuoxiaWang):Pre-while call for inference, simulate a do while loop statement
+        # the value in model_kwargs should be tensor before while loop
+        outputs = _forward_(**model_kwargs)
+
+        input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+            outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs
+        )
+        if not self.inference:
+            cur_len += 1
+        else:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            paddle.increment(cur_len)
+        paddle.increment(cur_len_gpu)
+
+        attn_mask = model_kwargs["attention_mask"]
+        # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
+        model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
+        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+            # TODO(wanghuancoder): _no_check_dy2st_diff is used to turn off the checking of behavior
+            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+            # removed after static graphs support inplace and stride.
+            with paddle.framework._no_check_dy2st_diff():
+                while cur_len < max_length:
+                    # Note(GuoxiaWang): Remove outputs = _forward_(**model_kwargs)
+                    # and change it to pass directly to _post_process_ to avoid
+                    # closed-loop problem of dynamic-to-static model
+                    input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                        _forward_(**model_kwargs),
+                        input_ids,
+                        cur_len_gpu,
+                        origin_len_gpu,
+                        scores,
+                        unfinished_flag,
+                        model_kwargs,
+                    )
+                    if not self.inference:
+                        cur_len += 1
+                    else:
+                        # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+                        paddle.increment(cur_len)
+                    paddle.increment(cur_len_gpu)
+
+                    if not paddle.any(unfinished_flag):
+                        break
+        else:
+            while cur_len < max_length:
+                # Note(GuoxiaWang): Remove outputs = _forward_(**model_kwargs)
+                # and change it to pass directly to _post_process_ to avoid
+                # closed-loop problem of dynamic-to-static model
+                input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                    _forward_(**model_kwargs),
+                    input_ids,
+                    cur_len_gpu,
+                    origin_len_gpu,
+                    scores,
+                    unfinished_flag,
+                    model_kwargs,
+                )
+                if not self.inference:
+                    cur_len += 1
+                else:
+                    # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+                    paddle.increment(cur_len)
+                paddle.increment(cur_len_gpu)
+
+                if not paddle.any(unfinished_flag):
+                    break
+
+        return model_kwargs["res"][:, origin_len:], scores
+
+    def forward(self, input_ids=None, **model_kwargs):
+
+        max_length = self.max_length
+        min_length = self.min_length
+        decode_strategy = self.decode_strategy
+        temperature = self.temperature
+        top_k = self.top_k
+        top_p = self.top_p
+        repetition_penalty = self.repetition_penalty
+        num_beams = self.num_beams
+        num_beam_groups = self.num_beam_groups
+        bos_token_id = self.bos_token_id
+        eos_token_id = self.eos_token_id
+        pad_token_id = self.pad_token_id
+        decoder_start_token_id = self.decoder_start_token_id
+        forced_bos_token_id = self.forced_bos_token_id
+        forced_eos_token_id = self.forced_eos_token_id
+        num_return_sequences = self.num_return_sequences
+        diversity_rate = self.diversity_rate
+        use_cache = self.use_cache
+
+        assert decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            decode_strategy
+        )
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self.gpt, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self.gpt, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self.gpt, "pad_token_id", None)
+        forced_bos_token_id = (
+            forced_bos_token_id if forced_bos_token_id is not None else getattr(self.gpt, "forced_bos_token_id", None)
+        )
+        forced_eos_token_id = (
+            forced_eos_token_id if forced_eos_token_id is not None else getattr(self.gpt, "forced_eos_token_id", None)
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self.gpt, "decoder_start_token_id", None)
+        )
+
+        # params check
+        if input_ids is None:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # TODO
+            # Init `attention_mask` depending on `pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, pad_token_id, eos_token_id
+            )
+
+        if model_kwargs.get("position_ids", None) is None:
+            model_kwargs["position_ids"] = paddle.arange(
+                0, paddle.shape(model_kwargs["attention_mask"])[-1], dtype=input_ids.dtype
+            ).unsqueeze(0)
+
+        self.is_encoder_decoder = False
+
+        model_kwargs["use_cache"] = use_cache
+
+        if self.inference:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            min_len = input_ids.shape[-1]
+            max_len = input_ids.shape[-1]
+            paddle.increment(min_len, min_length)
+            paddle.increment(max_len, max_length)
+        else:
+            input_len = input_ids.shape[-1]
+            max_len = max_length + input_len
+            min_len = min_length + input_len
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_len,
+            max_length=max_len,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=num_beams,
+            num_beam_groups=num_beam_groups,
+            diversity_rate=diversity_rate,
+            repetition_penalty=repetition_penalty,
+        )
+
+        if decode_strategy == "sampling":
+            if num_return_sequences > 1:
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_return_sequences, **model_kwargs
+                )
+
+            ret = self.sample(
+                input_ids,
+                logits_processors,
+                max_len,
+                pad_token_id,
+                eos_token_id,
+                top_k,
+                top_p,
+                temperature,
+                **model_kwargs,
+            )
+        else:
+            raise ValueError(f"Not support {decode_strategy} strategy yet!")
+        return ret
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py b/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py
new file mode 100644
index 0000000000000000000000000000000000000000..34ee17c5014599226dc9d5815b3076d8e8f4dc8b
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/language_module.py
@@ -0,0 +1,733 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import logging
+import math
+import os
+
+import numpy as np
+import paddle
+import ppfleetx.models.language_model.gpt as gpt
+from paddle.static import InputSpec
+from ppfleetx.core.module.basic_module import BasicModule
+from ppfleetx.data.tokenizers import GPTTokenizer
+from ppfleetx.distributed.apis import env
+from ppfleetx.models.language_model.gpt.dygraph.sequence_parallel_utils import (
+    register_sequence_parallel_allreduce_hooks,
+)
+from ppfleetx.utils.log import logger
+
+# TODO(haohongxiang): to solve the problem of cross-reference
+import paddlenlp  # noqa: F401
+from paddlenlp.transformers.gpt.tokenizer import GPTChineseTokenizer
+
+from .metrics import Accuracy, AccuracyAndF1, Mcc, PearsonAndSpearman
+from .utils import process_configs
+
+MODEL_CLASSES = {
+    "GPT": (GPTTokenizer, "gpt2"),
+    "GPT-cn": (GPTChineseTokenizer, "gpt-cpm-large-cn"),
+}
+
+
+def get_model_size(l, h, v, s):
+    P = 0
+    # embedding
+    P += (v + s) * h
+    # attention
+    P += (4 * h * h + 4 * h) * l
+    # layer_norm of decoder
+    P += (2 * (2 * h)) * l
+    # FFN Layer
+    P += (8 * h * h + 5 * h) * l
+    # layer_norm of transformer
+    P += 2 * h
+    logger.info("Model Size: {:.2f} B".format(P / 1000.0 / 1000.0 / 1000.0))
+
+
+def vocab_size_with_padding(vocab_size, div_unit, mp_degree):
+    padded_size = vocab_size
+    multiple = div_unit * mp_degree
+    while (padded_size % multiple) != 0:
+        padded_size += 1
+    logging.warning(
+        " > padded vocab (size: {}) with {} dummy tokens "
+        "(new size: {})".format(vocab_size, padded_size - vocab_size, padded_size)
+    )
+    return padded_size
+
+
+class LanguageModule(BasicModule):
+    def __init__(self, configs):
+        self.nranks = paddle.distributed.get_world_size()
+        self.data_world_size = env.get_data_world_size()
+        super(LanguageModule, self).__init__(configs)
+
+        self.loss_fn = self.get_loss_fn()
+
+    def process_configs(self, configs):
+        configs = process_configs(configs)
+        return configs
+
+    def forward(self, tokens, ids):
+        return self.model(tokens, ids)
+
+    def training_step(self, batch):
+        tokens, position_ids, labels, loss_mask = batch
+
+        loss_mask.stop_gradient = True
+        labels.stop_gradient = True
+        position_ids.stop_gradient = True
+
+        preds = self(tokens, position_ids)
+        loss = self.loss_fn(preds, labels, loss_mask)
+
+        return loss
+
+    def training_step_end(self, log_dict):
+        speed = 1.0 / log_dict["train_cost"]
+        default_global_tokens_num = self.configs.Global.global_batch_size * self.configs.Data.Train.dataset.max_seq_len
+
+        loss_scale_str = (
+            "loss_scale: %.9f," % (log_dict["loss_scale"]) if log_dict.get("loss_scale", None) is not None else ""
+        )
+        logger.info(
+            "[train] epoch: [%d/%d], batch: [%d/%d], loss: %.9f, avg_batch_cost: %.5f sec, speed: %.2f step/s, "
+            "ips_total: %.0f tokens/s, ips: %.0f tokens/s, %s learning rate: %.5e, found_inf: %.0f"
+            % (
+                log_dict["epoch"],
+                log_dict["total_epoch"],
+                log_dict["batch"],
+                log_dict["total_step"],
+                log_dict["loss"],
+                log_dict["train_cost"],
+                speed,
+                speed * default_global_tokens_num,
+                speed * default_global_tokens_num / self.data_world_size,
+                loss_scale_str,
+                log_dict["lr"],
+                log_dict["found_inf"],
+            )
+        )
+
+    def validation_step(self, batch):
+        tokens, position_ids, labels, loss_mask = batch
+        preds = self(tokens, position_ids)
+        preds = paddle.cast(preds, dtype="float32")
+        loss = self.loss_fn(preds, labels, loss_mask)
+        return loss
+
+    def validation_step_end(self, log_dict):
+        speed = 1.0 / log_dict["eval_cost"]
+        logger.info(
+            "[eval] epoch: %d, batch: %d/%d, loss: %.9f, avg_eval_cost: %.5f sec, speed: %.2f step/s"
+            % (
+                log_dict["epoch"],
+                log_dict["batch"],
+                log_dict["total_batch"],
+                log_dict["loss"],
+                log_dict["eval_cost"],
+                speed,
+            )
+        )
+
+    def test_step(self, batch):
+        tokens, position_ids, labels, loss_mask = batch
+        preds = self(tokens, position_ids)
+        preds = paddle.cast(preds, dtype="float32")
+        loss = self.loss_fn(preds, labels, loss_mask)
+        return loss
+
+    def test_step_end(self, log_dict):
+        speed = 1.0 / log_dict["test_cost"]
+        logger.info(
+            "[test] epoch: %d, batch: %d, loss: %.9f, avg_test_cost: %.5f sec, speed: %.2f step/s"
+            % (log_dict["epoch"], log_dict["batch"], log_dict["loss"], log_dict["test_cost"], speed)
+        )
+
+    def training_epoch_end(self, log_dict):
+        logger.info("[Training] epoch: %d, total time: %.5f sec" % (log_dict["epoch"], log_dict["train_cost"]))
+
+
+class GPTModule(LanguageModule):
+    def __init__(self, configs):
+        super(GPTModule, self).__init__(configs)
+        if configs.Model.sequence_parallel:
+            register_sequence_parallel_allreduce_hooks(
+                self, configs.Engine.accumulate_steps, configs.Distributed.fuse_sequence_parallel_allreduce
+            )
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        if "Compress" in self.configs and "Quantization" in self.configs.Compress:
+            quant_setting = copy.deepcopy(self.configs.Compress.Quantization)
+            skip_tensor_map = quant_setting.get("skip_tensor_map", {})
+            freeze_embedding = quant_setting.get("freeze_embedding", False)
+            model_setting["skip_tensor_map"] = skip_tensor_map
+            model_setting["freeze_embedding"] = freeze_embedding
+        model_setting.pop("module")
+
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        model_setting["vocab_size"] = vocab_size_with_padding(
+            model_setting.get("vocab_size", self.tokenizer.vocab_size),
+            model_setting.pop("vocab_size_divisible_unit", 128),
+            self.configs.Distributed.get("mp_degree", 1),
+        )
+
+        l = model_setting["num_layers"]
+        h = model_setting["hidden_size"]
+        v = model_setting["vocab_size"]
+        s = self.configs.Data.Train.dataset.max_seq_len
+        get_model_size(l, h, v, s)
+
+        if self.nranks == 1:
+            model_setting.pop("sequence_parallel")
+            model = gpt.GPTForPretraining(gpt.GPTModel(**model_setting))
+        else:
+            model_setting["num_partitions"] = self.configs.Distributed.mp_degree
+            if self.configs.Distributed.pp_degree == 1:
+                model_setting.pop("virtual_pp_degree", None)
+                model = gpt.GPTForPretrainingHybrid(gpt.GPTModelHybrid(**model_setting))
+            else:
+                model = gpt.GPTForPretrainingPipe(**model_setting)
+
+        return model
+
+    def get_loss_fn(self):
+        if self.nranks == 1:
+            loss_fn = gpt.GPTPretrainingCriterion()
+        else:
+            loss_fn = gpt.GPTPretrainingCriterionHybird(sequence_parallel=self.configs.Model.sequence_parallel)
+        return loss_fn
+
+    def pretreating_batch(self, batch):
+        if self.configs.Distributed.pp_degree > 1:
+            tokens, position_ids, labels, loss_mask = batch
+            data = [(tokens, position_ids), (labels, loss_mask)]
+            return data
+        else:
+            return batch
+
+    def input_spec(self):
+        return [
+            InputSpec(shape=[None, None], name="tokens", dtype="int64"),
+            InputSpec(shape=[None, None], name="ids", dtype="int64"),
+        ]
+
+    def inference_end(self, outputs):
+        for k, v in outputs.items():
+            for i in range(v.shape[0]):
+                out_ids = [int(x) for x in v[i]]
+                ret_str = self.tokenizer.decode(out_ids)
+                # ret_str = text[i] + ret_str
+                print(ret_str)
+
+
+class GPTFinetuneModule(BasicModule):
+    def __init__(self, configs):
+        self.nranks = paddle.distributed.get_world_size()
+        self.data_world_size = env.get_data_world_size()
+        super(GPTFinetuneModule, self).__init__(configs)
+
+        # self.loss_config will be init in super class by get_model()
+        assert self.loss_config is not None
+        assert "train" in self.loss_config
+        assert "eval" in self.loss_config
+
+        train_loss = copy.deepcopy(self.loss_config.train)
+        train_loss_cls = train_loss.pop("name")
+        self.loss_fn = eval(f"paddle.nn.loss.{train_loss_cls}")(**train_loss)
+
+        eval_loss = copy.deepcopy(self.loss_config.eval)
+        eval_loss_cls = eval_loss.pop("name")
+        self.eval_loss_fn = eval(f"paddle.nn.loss.{eval_loss_cls}")(**eval_loss)
+
+        # self.metric_config will be init in super class by get_model()
+        assert self.metric_config is not None
+        assert "eval" in self.metric_config
+
+        if "train" in self.metric_config:
+            train_metric = copy.deepcopy(self.metric_config.train)
+            train_metric_cls = train_metric.pop("name")
+            self.train_metric = eval(f"{train_metric_cls}")(**train_metric)
+
+        eval_metric = copy.deepcopy(self.metric_config.eval)
+        eval_metric_cls = eval_metric.pop("name")
+        self.eval_metric = eval(f"{eval_metric_cls}")(**eval_metric)
+
+        self.best_metric = 0.0
+
+    def process_configs(self, configs):
+        return configs
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        model_setting.pop("module")
+
+        self.metric_config = model_setting.pop("metric", None)
+        self.loss_config = model_setting.pop("loss", None)
+
+        pretrained = model_setting.pop("pretrained")
+        num_classes = model_setting.pop("num_classes", 2)
+        assert pretrained is not None
+
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        model_setting["vocab_size"] = vocab_size_with_padding(
+            model_setting.get("vocab_size", self.tokenizer.vocab_size),
+            model_setting.pop("vocab_size_divisible_unit", 128),
+            self.configs.Distributed.get("mp_degree", 1),
+        )
+
+        l = model_setting["num_layers"]
+        h = model_setting["hidden_size"]
+        v = model_setting["vocab_size"]
+        num_heads = model_setting["num_attention_heads"]
+        s = self.configs.Data.Train.dataset.max_length
+        get_model_size(l, h, v, s)
+
+        if self.nranks == 1:
+            model = gpt.GPTForSequenceClassification(gpt.GPTModel(**model_setting), num_classes)
+        else:
+            raise NotImplementedError
+
+        pretrained_path = pretrained + ".pdparams"
+        assert os.path.exists(pretrained_path), f"{pretrained_path} is not exists!"
+        model_dict = paddle.load(pretrained_path)
+
+        # Note(GuoxiaWang): Guess whether to convert fused vs non-fused parameters.
+        # 'q_proj' vs 'qkv_proj'
+        def is_fused(model_state):
+            for key in model_state:
+                if "qkv_proj" in key:
+                    return True
+            return False
+
+        def split_params(model_state, num_layers):
+            for idx in range(num_layers):
+                qkv_b = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.qkv_proj.bias")
+                qkv_w = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.qkv_proj.weight")
+
+                qkv_b = qkv_b.reshape((num_heads, 3, -1))
+                qkv_w = qkv_w.reshape((h, num_heads, 3, -1))
+
+                q_w, k_w, v_w = np.split(qkv_w, 3, axis=2)
+                q_w = q_w.reshape((h, -1))
+                k_w = k_w.reshape((h, -1))
+                v_w = v_w.reshape((h, -1))
+
+                q_b, k_b, v_b = np.split(qkv_b, 3, axis=1)
+                q_b = q_b.reshape((-1))
+                k_b = k_b.reshape((-1))
+                v_b = v_b.reshape((-1))
+
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.q_proj.bias"] = q_b
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.q_proj.weight"] = q_w
+
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.k_proj.bias"] = k_b
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.k_proj.weight"] = k_w
+
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.v_proj.bias"] = v_b
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.v_proj.weight"] = v_w
+
+            return model_state
+
+        def fuse_params(model_state, num_layers):
+            for idx in range(num_layers):
+                q_b = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.q_proj.bias")
+                q_w = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.q_proj.weight")
+
+                k_b = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.k_proj.bias")
+                k_w = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.k_proj.weight")
+
+                v_b = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.v_proj.bias")
+                v_w = model_state.pop(f"gpt.decoder.layers.{idx}.self_attn.v_proj.weight")
+
+                q_w = q_w.reshape((h, num_heads, -1))
+                k_w = k_w.reshape((h, num_heads, -1))
+                v_w = v_w.reshape((h, num_heads, -1))
+
+                qkv_w = np.stack([q_w, k_w, v_w], axis=2)
+                qkv_w = qkv_w.reshape((h, -1))
+
+                q_b = q_b.reshape((num_heads, -1))
+                k_b = k_b.reshape((num_heads, -1))
+                v_b = v_b.reshape((num_heads, -1))
+                qkv_b = np.stack([q_b, k_b, v_b], axis=1)
+                qkv_b = qkv_b.reshape((-1))
+
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.qkv_proj.weight"] = qkv_w
+                model_state[f"gpt.decoder.layers.{idx}.self_attn.qkv_proj.bias"] = qkv_b
+            return model_state
+
+        fused = is_fused(model.state_dict())
+        load_fused = is_fused(model_dict)
+
+        if fused is True and load_fused is False:
+            model_dict = fuse_params(model_dict, l)
+        elif fused is False and load_fused is True:
+            model_dict = split_params(model_dict, l)
+
+        for name, param in model.state_dict().items():
+            if name in model_dict and param.dtype != model_dict[name].dtype:
+                model_dict[name] = model_dict[name].cast(param.dtype)
+
+        model.set_state_dict(model_dict)
+        logger.info(f"Load pretrained weight from {pretrained_path}")
+
+        return model
+
+    def forward(self, tokens):
+        return self.model(tokens)
+
+    def training_step(self, batch):
+        input_ids, labels = batch
+
+        input_ids.stop_gradient = True
+        labels.stop_gradient = True
+
+        logits = self(input_ids)
+        loss = self.loss_fn(logits, labels)
+
+        return loss
+
+    def training_step_end(self, log_dict):
+        speed = 1.0 / log_dict["train_cost"]
+        default_global_tokens_num = self.configs.Global.global_batch_size * self.configs.Data.Train.dataset.max_length
+
+        logger.info(
+            "[train] epoch: [%d/%d], step: [%d/%d], learning rate: %.7f, loss: %.9f, avg_batch_cost: %.5f sec, speed: %.2f step/s, "
+            "ips_total: %.0f tokens/s, ips: %.0f tokens/s"
+            % (
+                log_dict["epoch"],
+                log_dict["total_epoch"],
+                log_dict["batch"],
+                log_dict["total_batch"],
+                log_dict["lr"],
+                log_dict["loss"],
+                log_dict["train_cost"],
+                speed,
+                speed * default_global_tokens_num,
+                speed * default_global_tokens_num / self.data_world_size,
+            )
+        )
+
+    def validation_step(self, batch):
+        input_ids, labels = batch
+
+        input_ids.stop_gradient = True
+        labels.stop_gradient = True
+
+        logits = self(input_ids)
+        loss = self.eval_loss_fn(logits, labels)
+        correct = self.eval_metric.compute(logits, labels)
+        self.eval_metric.update(correct)
+        return loss
+
+    def validation_step_end(self, log_dict):
+        speed = 1.0 / log_dict["eval_cost"]
+        logger.info(
+            "[eval] epoch: %d, batch: %d, loss: %.9f, avg_eval_cost: %.5f sec, speed: %.2f step/s"
+            % (log_dict["epoch"], log_dict["batch"], log_dict["loss"], log_dict["eval_cost"], speed)
+        )
+
+    def test_step(self, batch):
+        tokens, position_ids, labels, loss_mask = batch
+        preds = self(tokens, position_ids)
+        preds = paddle.cast(preds, dtype="float32")
+        loss = self.eval_loss_fn(preds, labels, loss_mask)
+        return loss
+
+    def test_step_end(self, log_dict):
+        speed = 1.0 / log_dict["test_cost"]
+        logger.info(
+            "[test] epoch: %d, batch: %d, loss: %.9f, avg_test_cost: %.5f sec, speed: %.2f step/s"
+            % (log_dict["epoch"], log_dict["batch"], log_dict["loss"], log_dict["test_cost"], speed)
+        )
+
+    def training_epoch_end(self, log_dict):
+        logger.info("[Training] epoch: %d, total time: %.5f sec" % (log_dict["epoch"], log_dict["train_cost"]))
+
+    def validation_epoch_end(self, log_dict):
+        res = self.eval_metric.accumulate()
+        self.eval_metric.reset()
+        if isinstance(self.eval_metric, AccuracyAndF1):
+            msg = "acc: %.5f, precision: %.5f, recall: %.5f, f1: %.5f, acc and f1: %.5f" % (
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            )
+            metric = res[4]
+        elif isinstance(self.eval_metric, Mcc):
+            msg = "mcc: %.5f" % (res[0])
+            metric = res[0]
+        elif isinstance(self.eval_metric, PearsonAndSpearman):
+            msg = "pearson: %.5f, spearman: %.5f, pearson and spearman: %.5f" % (res[0], res[1], res[2])
+            metric = res[2]
+        else:
+            msg = "acc: %.5f" % (res)
+            metric = res
+
+        if metric > self.best_metric:
+            self.best_metric = metric
+
+        logger.info(
+            "[Eval] epoch: %d, total time: %.5f sec, %s, best_metric: %.5f"
+            % (log_dict["epoch"], log_dict["eval_cost"], msg, self.best_metric)
+        )
+
+
+class GPTGenerationModule(BasicModule):
+    def __init__(self, configs):
+        self.configs = configs
+        self.generation_cfgs = configs.Generation
+        self.nranks = paddle.distributed.get_world_size()
+
+        super().__init__(configs)
+
+    def process_configs(self, configs):
+        configs = process_configs(configs)
+        return configs
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        if "Compress" in self.configs and "Quantization" in self.configs.Compress:
+            quant_setting = copy.deepcopy(self.configs.Compress.Quantization)
+            skip_tensor_map = quant_setting.get("skip_tensor_map", {})
+            freeze_embedding = quant_setting.get("freeze_embedding", False)
+            model_setting["skip_tensor_map"] = skip_tensor_map
+            model_setting["freeze_embedding"] = freeze_embedding
+        model_setting.pop("module")
+
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        model_setting["vocab_size"] = vocab_size_with_padding(
+            model_setting.get("vocab_size", self.tokenizer.vocab_size),
+            model_setting.pop("vocab_size_divisible_unit", 128),
+            self.configs.Distributed.get("mp_degree", 1),
+        )
+
+        if self.nranks == 1:
+            model = gpt.GPTForGeneration(gpt.GPTModel(**model_setting), self.generation_cfgs)
+        else:
+            assert (
+                self.nranks == self.configs.Distributed.dp_degree
+            ), "only support single card and data parallel in generation task."
+            model = gpt.GPTForGenerationHybrid(gpt.GPTModelHybrid(**model_setting), self.generation_cfgs)
+
+        self.generation_cfgs["max_dec_len"] = self.adjust_length_to_model(self.generation_cfgs["max_dec_len"], 512)
+
+        self.generation_cfgs["bos_token_id"] = self.tokenizer.eos_token_id
+        self.generation_cfgs["eos_token_id"] = self.tokenizer.eos_token_id
+        self.generation_cfgs["pad_token_id"] = self.tokenizer.eos_token_id
+
+        return model
+
+    def adjust_length_to_model(self, length, max_sequence_length):
+        if length < 0 or length > max_sequence_length:
+            length = max_sequence_length
+        return length
+
+    def left_padding(self, inputs, pad_id, padding="longest"):
+        assert "input_ids" in inputs, "input_ids should be in inputs!"
+        max_length = 0
+        for ids in inputs["input_ids"]:
+            max_length = max(max_length, len(ids))
+
+        def extend_max_lenth(value, max_length, to_pad_id):
+            return [to_pad_id] * (max_length - len(value)) + value
+
+        def extend_filed(name, max_length, to_pad_id):
+            values = inputs[name]
+            res = []
+            for index, value in enumerate(values):
+                res.append(extend_max_lenth(value, max_length, to_pad_id))
+            inputs[name] = res
+
+        extend_filed("input_ids", max_length, pad_id)
+        if "attention_mask" in inputs:
+            extend_filed("attention_mask", max_length, 0)
+        if "position_ids" in inputs:
+            extend_filed("position_ids", max_length, 0)
+
+        return inputs
+
+    def generate(self, input_text):
+        return self(input_text)
+
+    def forward(self, input_text):
+        input_ids = self.tokenizer.encode(input_text)
+        inputs = {"input_ids": [input_ids]}
+
+        inputs = self.left_padding(inputs, self.tokenizer.eos_token_id)
+        input_ids = inputs["input_ids"]
+
+        if len(input_ids) == 0:
+            input_ids = None
+        else:
+            # [1, seq_len]
+            input_ids = paddle.to_tensor(input_ids, dtype="int64")
+
+        ids, scores = self.model(input_ids=input_ids)
+
+        generated_sequences = []
+        for i, generated_ids in enumerate(ids):
+            generated_ids = generated_ids.numpy().tolist()
+            # Decode text
+            text = self.tokenizer.convert_ids_to_string(generated_ids)
+            sequence = input_text + text
+            generated_sequences.append(sequence)
+
+        return generated_sequences
+
+    def input_spec(self):
+        return [InputSpec(shape=[None, None], name="input_ids", dtype="int64")]
+
+
+class GPTEvalModule(LanguageModule):
+    def __init__(self, configs):
+        self.eval_cfgs = configs.Offline_Eval
+
+        super().__init__(configs)
+
+        self.post_process_configs()
+
+        self.first_step = True
+        self.total_score = 0
+        self.score_name = "loss" if not self.eval_cfgs.cloze_eval else "number correct"
+
+    def post_process_configs(self):
+        self.configs.pop("Optimizer", None)
+        self.configs.pop("Inference", None)
+
+        self.configs.Data.pop("Train", None)
+        self.configs.Data.pop("Test", None)
+        self.configs.Data.Eval.pop("sampler", None)
+        self.configs.Data.Eval.loader.collate_fn = "gpt_collate_fn"
+        self.configs.Data.Eval.loader.batch_size = self.eval_cfgs.batch_size
+        self.configs.Data.Eval.dataset.input_dir = self.eval_cfgs.eval_path
+        self.configs.Data.Eval.dataset.max_seq_len = self.eval_cfgs.max_seq_len
+
+        self.configs.Engine.logging_freq = self.eval_cfgs.logging_freq
+
+        if not self.eval_cfgs.cloze_eval:
+            self.configs.Data.Eval.dataset.name = "LM_Eval_Dataset"
+            self.configs.Data.Eval.dataset.overlapping_eval = self.eval_cfgs.overlapping_eval
+        else:
+            self.configs.Data.Eval.dataset.name = "Lambada_Eval_Dataset"
+
+    def get_model(self):
+        model_setting = copy.deepcopy(self.configs.Model)
+        if "Compress" in self.configs and "Quantization" in self.configs.Compress:
+            quant_setting = copy.deepcopy(self.configs.Compress.Quantization)
+            skip_tensor_map = quant_setting.get("skip_tensor_map", {})
+            freeze_embedding = quant_setting.get("freeze_embedding", False)
+            model_setting["skip_tensor_map"] = skip_tensor_map
+            model_setting["freeze_embedding"] = freeze_embedding
+        model_setting.pop("module")
+
+        model_name = model_setting.pop("name")
+        tokenizer_class, pretrained_name = MODEL_CLASSES[model_name]
+        self.tokenizer = tokenizer_class.from_pretrained(pretrained_name)
+
+        model_setting["vocab_size"] = vocab_size_with_padding(
+            model_setting.get("vocab_size", self.tokenizer.vocab_size),
+            model_setting.pop("vocab_size_divisible_unit", 128),
+            self.configs.Distributed.get("mp_degree", 1),
+        )
+
+        if self.nranks == 1:
+            model = gpt.GPTForPretraining(gpt.GPTModel(**model_setting))
+        else:
+            raise RuntimeError("Only single-card offline eval is supported in GPTModel now.")
+
+        return model
+
+    def forward(self, tokens, ids, mask):
+        return self.model(tokens, ids, mask)
+
+    def validation_step(self, batch):
+        tokens, loss_mask, attention_mask, position_ids, labels, info = batch
+
+        preds = self(tokens, position_ids, attention_mask)
+
+        if not self.eval_cfgs.cloze_eval:
+            if self.first_step:
+                self.num_original_tokens = info.numpy()[0][0]
+                self.num_tokenized_tokens = info.numpy()[0][1]
+
+            masked_lm_loss = paddle.nn.functional.cross_entropy(preds, labels, reduction="none")
+            loss = paddle.sum(masked_lm_loss * loss_mask)
+            return loss
+        else:
+            if self.first_step:
+                self.num_examples = info.numpy()[0][0]
+
+            outputs = paddle.argmax(preds, -1)
+            acc = paddle.cast(outputs == labels, "float32")
+            acc = paddle.where(paddle.cast(loss_mask, "bool"), acc, paddle.ones_like(acc))
+            acc = paddle.sum(paddle.prod(acc, -1))
+            return acc
+
+        self.first_step = False
+
+    def validation_step_end(self, log_dict):
+        speed = 1.0 / log_dict["eval_cost"]
+
+        if not self.eval_cfgs.cloze_eval:
+            self.total_score += log_dict["loss"] * self.configs.Engine.logging_freq / (self.num_tokenized_tokens - 1)
+        else:
+            self.total_score += log_dict["loss"] * self.configs.Engine.logging_freq
+
+        logger.info(
+            "[eval] epoch: %d, batch: %d, %s: %.9f, speed: %.2f step/s"
+            % (log_dict["epoch"], log_dict["batch"], self.score_name, self.total_score, speed)
+        )
+
+    def validation_epoch_end(self, log_dict):
+        if not self.eval_cfgs.cloze_eval:
+            total_loss = float(self.total_score)
+            ppl = math.exp(min(20, total_loss))
+            token_ratio = (self.num_tokenized_tokens - 1) / (self.num_original_tokens - 1)
+            adjusted_ppl = math.exp(min(20, total_loss * token_ratio))
+            string = " validation results on {} | ".format(self.eval_cfgs.eval_path)
+            string += "avg loss: {:.4E} | ".format(total_loss)
+            string += "ppl: {:.4E} | ".format(ppl)
+            string += "adjusted ppl: {:.4E} | ".format(adjusted_ppl)
+            string += "token ratio: {} |".format(token_ratio)
+        else:
+            num_correct = float(self.total_score)
+            acc = float(num_correct / self.num_examples)
+            string = " validation results on {} | ".format(self.eval_cfgs.eval_path)
+            string += "number correct: {:.4E} | ".format(num_correct)
+            string += "total examples: {:.4E} | ".format(self.num_examples)
+            string += "avg accuracy: {:.4E}".format(acc)
+
+        logger.info(string)
+
+    def input_spec(self):
+        return [
+            InputSpec(shape=[None, None], name="tokens", dtype="int64"),
+            InputSpec(shape=[None, None], name="ids", dtype="int64"),
+        ]
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py b/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ae1dfe894b817456c346cda0a3d2276cbab646f
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/metrics.py
@@ -0,0 +1,668 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import warnings
+
+import numpy as np
+import paddle
+from paddle.metric import Accuracy, Metric, Precision, Recall
+
+__all__ = ["Accuracy", "AccuracyAndF1", "Mcc", "PearsonAndSpearman", "MultiLabelsMetric"]
+
+
+class AccuracyAndF1(Metric):
+    """
+    This class encapsulates Accuracy, Precision, Recall and F1 metric logic,
+    and `accumulate` function returns accuracy, precision, recall and f1.
+    The overview of all metrics could be seen at the document of `paddle.metric
+    <https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/metric/Overview_cn.html>`_
+    for details.
+
+    Args:
+        topk (int or tuple(int), optional):
+            Number of top elements to look at for computing accuracy.
+            Defaults to (1,).
+        pos_label (int, optional): The positive label for calculating precision
+            and recall.
+            Defaults to 1.
+        name (str, optional):
+            String name of the metric instance. Defaults to 'acc_and_f1'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import AccuracyAndF1
+
+            x = paddle.to_tensor([[0.1, 0.9], [0.5, 0.5], [0.6, 0.4], [0.7, 0.3]])
+            y = paddle.to_tensor([[1], [0], [1], [1]])
+
+            m = AccuracyAndF1()
+            correct = m.compute(x, y)
+            m.update(correct)
+            res = m.accumulate()
+            print(res) # (0.5, 0.5, 0.3333333333333333, 0.4, 0.45)
+
+    """
+
+    def __init__(self, topk=(1,), pos_label=1, name="acc_and_f1", *args, **kwargs):
+        super(AccuracyAndF1, self).__init__(*args, **kwargs)
+        self.topk = topk
+        self.pos_label = pos_label
+        self._name = name
+        self.acc = Accuracy(self.topk, *args, **kwargs)
+        self.precision = Precision(*args, **kwargs)
+        self.recall = Recall(*args, **kwargs)
+        self.reset()
+
+    def compute(self, pred, label, *args):
+        """
+        Accepts network's output and the labels, and calculates the top-k
+        (maximum value in topk) indices for accuracy.
+
+        Args:
+            pred (Tensor):
+                Predicted tensor, and its dtype is float32 or float64, and
+                has a shape of [batch_size, num_classes].
+            label (Tensor):
+                The ground truth tensor, and its dtype is int64, and has a
+                shape of [batch_size, 1] or [batch_size, num_classes] in one
+                hot representation.
+
+        Returns:
+            Tensor: Correct mask, each element indicates whether the prediction
+            equals to the label. Its' a tensor with a data type of float32 and
+            has a shape of [batch_size, topk].
+
+        """
+        self.label = label
+        self.preds_pos = paddle.nn.functional.softmax(pred)[:, self.pos_label]
+        return self.acc.compute(pred, label)
+
+    def update(self, correct, *args):
+        """
+        Updates the metrics states (accuracy, precision and recall), in order to
+        calculate accumulated accuracy, precision and recall of all instances.
+
+        Args:
+            correct (Tensor):
+                Correct mask for calculating accuracy, and it's a tensor with
+                shape [batch_size, topk] and has a dtype of
+                float32.
+
+        """
+        self.acc.update(correct)
+        self.precision.update(self.preds_pos, self.label)
+        self.recall.update(self.preds_pos, self.label)
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: The accumulated metric. A tuple of shape (acc, precision,
+            recall, f1, average_of_acc_and_f1)
+
+            With the fields:
+
+            - `acc` (numpy.float64):
+                The accumulated accuracy.
+            - `precision` (numpy.float64):
+                The accumulated precision.
+            - `recall` (numpy.float64):
+                The accumulated recall.
+            - `f1` (numpy.float64):
+                The accumulated f1.
+            - `average_of_acc_and_f1` (numpy.float64):
+                The average of accumulated accuracy and f1.
+
+        """
+        acc = self.acc.accumulate()
+        precision = self.precision.accumulate()
+        recall = self.recall.accumulate()
+        if precision == 0.0 or recall == 0.0:
+            f1 = 0.0
+        else:
+            # 1/f1 = 1/2 * (1/precision + 1/recall)
+            f1 = (2 * precision * recall) / (precision + recall)
+        return (
+            acc,
+            precision,
+            recall,
+            f1,
+            (acc + f1) / 2,
+        )
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.acc.reset()
+        self.precision.reset()
+        self.recall.reset()
+        self.label = None
+        self.preds_pos = None
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class Mcc(Metric):
+    """
+    This class calculates `Matthews correlation coefficient <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ .
+
+    Args:
+        name (str, optional):
+            String name of the metric instance. Defaults to 'mcc'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import Mcc
+
+            x = paddle.to_tensor([[-0.1, 0.12], [-0.23, 0.23], [-0.32, 0.21], [-0.13, 0.23]])
+            y = paddle.to_tensor([[1], [0], [1], [1]])
+
+            m = Mcc()
+            (preds, label) = m.compute(x, y)
+            m.update((preds, label))
+            res = m.accumulate()
+            print(res) # (0.0,)
+
+    """
+
+    def __init__(self, name="mcc", *args, **kwargs):
+        super(Mcc, self).__init__(*args, **kwargs)
+        self._name = name
+        self.tp = 0  # true positive
+        self.fp = 0  # false positive
+        self.tn = 0  # true negative
+        self.fn = 0  # false negative
+
+    def compute(self, pred, label, *args):
+        """
+        Processes the pred tensor, and returns the indices of the maximum of each
+        sample.
+
+        Args:
+            pred (Tensor):
+                The predicted value is a Tensor with dtype float32 or float64.
+                Shape is [batch_size, 1].
+            label (Tensor):
+                The ground truth value is Tensor with dtype int64, and its
+                shape is [batch_size, 1].
+
+        Returns:
+            tuple: A tuple of preds and label. Each shape is
+            [batch_size, 1], with dtype float32 or float64.
+
+        """
+        preds = paddle.argsort(pred, descending=True)[:, :1]
+        return (preds, label)
+
+    def update(self, preds_and_labels):
+        """
+        Calculates states, i.e. the number of true positive, false positive,
+        true negative and false negative samples.
+
+        Args:
+            preds_and_labels (tuple[Tensor]):
+                Tuple of predicted value and the ground truth label, with dtype
+                float32 or float64. Each shape is [batch_size, 1].
+
+        """
+        preds = preds_and_labels[0]
+        labels = preds_and_labels[1]
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        if isinstance(labels, paddle.Tensor):
+            labels = labels.numpy().reshape(-1, 1)
+        sample_num = labels.shape[0]
+        for i in range(sample_num):
+            pred = preds[i]
+            label = labels[i]
+            if pred == 1:
+                if pred == label:
+                    self.tp += 1
+                else:
+                    self.fp += 1
+            else:
+                if pred == label:
+                    self.tn += 1
+                else:
+                    self.fn += 1
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: Returns the accumulated metric, a tuple of shape (mcc,), `mcc` is the accumulated mcc and its data
+            type is float64.
+
+        """
+        if self.tp == 0 or self.fp == 0 or self.tn == 0 or self.fn == 0:
+            mcc = 0.0
+        else:
+            # mcc = (tp*tn-fp*fn)/ sqrt(tp+fp)(tp+fn)(tn+fp)(tn+fn))
+            mcc = (self.tp * self.tn - self.fp * self.fn) / math.sqrt(
+                (self.tp + self.fp) * (self.tp + self.fn) * (self.tn + self.fp) * (self.tn + self.fn)
+            )
+        return (mcc,)
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.tp = 0  # true positive
+        self.fp = 0  # false positive
+        self.tn = 0  # true negative
+        self.fn = 0  # false negative
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+            str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class PearsonAndSpearman(Metric):
+    """
+    The class calculates `Pearson correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
+    and `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_ .
+
+
+    Args:
+        name (str, optional):
+            String name of the metric instance. Defaults to 'pearson_and_spearman'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import PearsonAndSpearman
+
+            x = paddle.to_tensor([[0.1], [1.0], [2.4], [0.9]])
+            y = paddle.to_tensor([[0.0], [1.0], [2.9], [1.0]])
+
+            m = PearsonAndSpearman()
+            m.update((x, y))
+            res = m.accumulate()
+            print(res) # (0.9985229081857804, 1.0, 0.9992614540928901)
+
+    """
+
+    def __init__(self, name="pearson_and_spearman", *args, **kwargs):
+        super(PearsonAndSpearman, self).__init__(*args, **kwargs)
+        self._name = name
+        self.preds = []
+        self.labels = []
+
+    def update(self, preds_and_labels):
+        """
+        Ensures the type of preds and labels is numpy.ndarray and reshapes them
+        into [-1, 1].
+
+        Args:
+            preds_and_labels (tuple[Tensor] or list[Tensor]):
+                Tuple or list of predicted value and the ground truth label.
+                Its data type should be float32 or float64 and its shape is [batch_size, d0, ..., dN].
+
+        """
+        preds = preds_and_labels[0]
+        labels = preds_and_labels[1]
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        if isinstance(labels, paddle.Tensor):
+            labels = labels.numpy()
+        preds = np.squeeze(preds.reshape(-1, 1)).tolist()
+        labels = np.squeeze(labels.reshape(-1, 1)).tolist()
+        self.preds.append(preds)
+        self.labels.append(labels)
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: Returns the accumulated metric, a tuple of (pearson, spearman,
+            the_average_of_pearson_and_spearman).
+
+            With the fields:
+
+            - `pearson` (numpy.float64):
+                The accumulated pearson.
+
+            - `spearman` (numpy.float64):
+                The accumulated spearman.
+
+            - `the_average_of_pearson_and_spearman` (numpy.float64):
+                The average of accumulated pearson and spearman correlation
+                coefficient.
+
+        """
+        preds = [item for sublist in self.preds for item in sublist]
+        labels = [item for sublist in self.labels for item in sublist]
+        pearson = self.pearson(preds, labels)
+        spearman = self.spearman(preds, labels)
+        return (
+            pearson,
+            spearman,
+            (pearson + spearman) / 2,
+        )
+
+    def pearson(self, preds, labels):
+        n = len(preds)
+        # simple sums
+        sum1 = sum(float(preds[i]) for i in range(n))
+        sum2 = sum(float(labels[i]) for i in range(n))
+        # sum up the squares
+        sum1_pow = sum([pow(v, 2.0) for v in preds])
+        sum2_pow = sum([pow(v, 2.0) for v in labels])
+        # sum up the products
+        p_sum = sum([preds[i] * labels[i] for i in range(n)])
+
+        numerator = p_sum - (sum1 * sum2 / n)
+        denominator = math.sqrt((sum1_pow - pow(sum1, 2) / n) * (sum2_pow - pow(sum2, 2) / n))
+        if denominator == 0:
+            return 0.0
+        return numerator / denominator
+
+    def spearman(self, preds, labels):
+        preds_rank = self.get_rank(preds)
+        labels_rank = self.get_rank(labels)
+
+        total = 0
+        n = len(preds)
+        for i in range(n):
+            total += pow((preds_rank[i] - labels_rank[i]), 2)
+        spearman = 1 - float(6 * total) / (n * (pow(n, 2) - 1))
+        return spearman
+
+    def get_rank(self, raw_list):
+        x = np.array(raw_list)
+        r_x = np.empty(x.shape, dtype=int)
+        y = np.argsort(-x)
+        for i, k in enumerate(y):
+            r_x[k] = i + 1
+        return r_x
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.preds = []
+        self.labels = []
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class MultiLabelsMetric(Metric):
+    """
+    This class encapsulates Accuracy, Precision, Recall and F1 metric logic in
+    multi-labels setting (also the binary setting).
+    Some codes are taken and modified from sklearn.metrics .
+
+    Args:
+        num_labels (int)
+            The total number of labels which is usually the number of classes
+        name (str, optional):
+            String name of the metric instance. Defaults to 'multi_labels_metric'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import MultiLabelsMetric
+
+            x = paddle.to_tensor([[0.1, 0.2, 0.9], [0.5, 0.8, 0.5], [0.6, 1.5, 0.4], [2.8, 0.7, 0.3]])
+            y = paddle.to_tensor([[2], [1], [2], [1]])
+
+            m = MultiLabelsMetric(num_labels=3)
+            args = m.compute(x, y)
+            m.update(args)
+
+            result1 = m.accumulate(average=None)
+            # (array([0.0, 0.5, 1.0]), array([0.0, 0.5, 0.5]), array([0.0, 0.5, 0.66666667]))
+            result2 = m.accumulate(average='binary', pos_label=0)
+            # (0.0, 0.0, 0.0)
+            result3 = m.accumulate(average='binary', pos_label=1)
+            # (0.5, 0.5, 0.5)
+            result4 = m.accumulate(average='binary', pos_label=2)
+            # (1.0, 0.5, 0.6666666666666666)
+            result5 = m.accumulate(average='micro')
+            # (0.5, 0.5, 0.5)
+            result6 = m.accumulate(average='macro')
+            # (0.5, 0.3333333333333333, 0.38888888888888884)
+            result7 = m.accumulate(average='weighted')
+            # (0.75, 0.5, 0.5833333333333333)
+
+    Note: When zero_division is encountered (details as followed), the corresponding metrics will be set to 0.0
+        precision is zero_division if there are no positive predictions
+        recall is zero_division if there are no positive labels
+        fscore is zero_division if all labels AND predictions are negative
+    """
+
+    def __init__(self, num_labels, name="multi_labels_metric"):
+        super(MultiLabelsMetric, self).__init__()
+        if num_labels <= 1:
+            raise ValueError(f"The num_labels is {num_labels}, which must be greater than 1.")
+        self.num_labels = num_labels
+        self._name = name
+        self._confusion_matrix = np.zeros((num_labels, 2, 2), dtype=int)
+
+    def update(self, args):
+        """
+        Updates the metrics states (accuracy, precision and recall), in order to
+        calculate accumulated accuracy, precision and recall of all instances.
+
+        Args:
+            args (tuple of Tensor):
+                the tuple returned from `compute` function
+        """
+        pred = args[0].numpy()
+        label = args[1].numpy()
+        tmp_confusion_matrix = self._multi_labels_confusion_matrix(pred, label)
+        self._confusion_matrix += tmp_confusion_matrix
+
+    def accumulate(self, average=None, pos_label=1):
+        """
+        Calculates and returns the accumulated metric.
+
+        Args:
+            average (str in {‘binary’, ‘micro’, ‘macro’, ’weighted’} or None, optional):
+            Defaults to `None`. If `None`, the scores for each class are returned.
+            Otherwise, this determines the type of averaging performed on the data:
+
+            - `binary` :
+                Only report results for the class specified by pos_label.
+
+            - `micro` :
+                Calculate metrics globally by counting the total true positives,
+                false negatives and false positives.
+
+            - `macro` :
+                Calculate metrics for each label, and find their unweighted mean.
+                This does not take label imbalance into account.
+
+            - `weighted` :
+                Calculate metrics for each label, and find their average weighted
+                by support (the number of true instances for each label). This
+                alters `macro` to account for label imbalance; it can result in
+                an F-score that is not between precision and recall.
+
+            pos_label (int, optional):
+                The positive label for calculating precision and recall in binary settings.
+                Noted: Only when `average='binary'`, this arguments will be used. Otherwise,
+                it will be ignored.
+                Defaults to 1.
+
+        Returns:
+            tuple: The accumulated metric. A tuple of shape (precision, recall, f1)
+                With the fields:
+
+                - `precision` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated precision.
+                - `recall` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated recall.
+                - `f1` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated f1.
+
+        """
+        if average not in {"binary", "micro", "macro", "weighted", None}:
+            raise ValueError(f"The average is {average}, which is unknown.")
+        if average == "binary":
+            if pos_label >= self.num_labels:
+                raise ValueError(
+                    f"The pos_label is {pos_label}, num_labels is {self.num_labels}. "
+                    f"The num_labels must be greater than pos_label."
+                )
+
+        confusion_matrix = None  # [*, 2, 2]
+        if average == "binary":
+            confusion_matrix = np.expand_dims(self._confusion_matrix[pos_label], axis=0)
+        elif average == "micro":
+            confusion_matrix = self._confusion_matrix.sum(axis=0, keepdims=True)
+        #  if average is 'macro' or 'weighted' or None
+        else:
+            confusion_matrix = self._confusion_matrix
+
+        tp = confusion_matrix[:, 1, 1]  # [*,]
+        pred = tp + confusion_matrix[:, 0, 1]  # [*,]
+        true = tp + confusion_matrix[:, 1, 0]  # [*,]
+
+        def _robust_divide(numerator, denominator, metric_name):
+            mask = denominator == 0.0
+            denominator = denominator.copy()
+            denominator[mask] = 1  # avoid zero division
+            result = numerator / denominator
+
+            if not np.any(mask):
+                return result
+
+            # precision is zero_division if there are no positive predictions
+            # recall is zero_division if there are no positive labels
+            # fscore is zero_division if all labels AND predictions are negative
+            warnings.warn(f"Zero division when calculating {metric_name}.", UserWarning)
+            result[mask] = 0.0
+            return result
+
+        precision = _robust_divide(tp, pred, "precision")
+        recall = _robust_divide(tp, true, "recall")
+        f1 = _robust_divide(2 * (precision * recall), (precision + recall), "f1")
+
+        weights = None  # [num_labels]
+        if average == "weighted":
+            weights = true
+            if weights.sum() == 0:
+                zero_division_value = np.float64(0.0)
+                if pred.sum() == 0:
+                    return (zero_division_value, zero_division_value, zero_division_value)
+                else:
+                    return (np.float64(0.0), zero_division_value, np.float64(0.0))
+        elif average == "macro":
+            weights = np.ones((self.num_labels), dtype=float)
+        if average is not None:
+            precision = np.average(precision, weights=weights)
+            recall = np.average(recall, weights=weights)
+            f1 = np.average(f1, weights=weights)
+
+        return precision, recall, f1
+
+    def compute(self, pred, label):
+        """
+        Accepts network's output and the labels, and calculates the top-k
+        (maximum value in topk) indices for accuracy.
+
+        Args:
+            pred (Tensor):
+                Predicted tensor, and its dtype is float32 or float64, and
+                has a shape of [batch_size, *, num_labels].
+            label (Tensor):
+                The ground truth tensor, and its dtype is int64, and has a
+                shape of [batch_size, *] or [batch_size, *, num_labels] in one
+                hot representation.
+
+        Returns:
+            tuple of Tensor: it contains two Tensor of shape [*, 1].
+            The tuple should be passed to `update` function.
+        """
+        if not (paddle.is_tensor(pred) and paddle.is_tensor(label)):
+            raise ValueError("pred and label must be paddle tensor")
+
+        if pred.shape[-1] != self.num_labels:
+            raise ValueError(f"The last dim of pred is {pred.shape[-1]}, " f"which should be num_labels")
+        pred = paddle.reshape(pred, [-1, self.num_labels])
+        pred = paddle.argmax(pred, axis=-1)
+
+        if label.shape[-1] == self.num_labels:
+            label = paddle.reshape(label, [-1, self.num_labels])
+            label = paddle.argmax(label, axis=-1)
+        else:
+            label = paddle.reshape(label, [-1])
+            if paddle.max(label) >= self.num_labels:
+                raise ValueError(f"Tensor label has value {paddle.max(label)}, " f"which is no less than num_labels")
+
+        if pred.shape[0] != label.shape[0]:
+            raise ValueError("The length of pred is not equal to the length of label")
+
+        return pred, label
+
+    def _multi_labels_confusion_matrix(self, pred, label):
+        tp_bins = label[pred == label]
+        tp = np.bincount(tp_bins, minlength=self.num_labels)  # [num_labels,]
+        tp_plus_fp = np.bincount(pred, minlength=self.num_labels)  # [num_labels,]
+        tp_plus_fn = np.bincount(label, minlength=self.num_labels)  # [num_labels,]
+        fp = tp_plus_fp - tp  # [num_labels,]
+        fn = tp_plus_fn - tp  # [num_labels,]
+        tn = pred.shape[0] - tp - fp - fn  # [num_labels,]
+        return np.array([tn, fp, fn, tp]).T.reshape(-1, 2, 2)  # [num_labels, 2, 2]
+
+    def reset(self):
+        self._confusion_matrix = np.zeros((self.num_labels, 2, 2), dtype=int)
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
diff --git a/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py b/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..454a5e0dd8311209879e8c3f423976118ea7f9ec
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/models/language_model/utils.py
@@ -0,0 +1,165 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+import paddle
+import paddle.distributed as dist
+from paddle.base import core
+from ppfleetx.utils.log import logger
+
+
+def is_fused_matmul_bias_supported():
+    if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm() or paddle.is_compiled_with_xpu():
+        return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue")
+    else:
+        return False
+
+
+def process_inference_configs(config):
+    """
+    process inference configs for hybrid parallel
+    """
+    if "Inference" not in config.keys():
+        return
+
+    configs = config["Inference"]
+
+    if configs["model_dir"] is None:
+        configs["model_dir"] = config["Engine"]["save_load"]["output_dir"]
+
+    if configs["mp_degree"] is None:
+        configs["mp_degree"] = config["Distributed"]["mp_degree"]
+
+
+def process_model_configs(config):
+    """
+    process model configs for hybrid parallel
+    """
+    configs = config["Model"]
+    if configs["ffn_hidden_size"] is None:
+        configs["ffn_hidden_size"] = 4 * configs["hidden_size"]
+
+    if configs["use_recompute"]:
+        if not configs["recompute_granularity"]:
+            configs["recompute_granularity"] = "full"
+        if not configs["no_recompute_layers"]:
+            configs["no_recompute_layers"] = []
+        else:
+            assert isinstance(configs["no_recompute_layers"], list), "no_recompute_layers should be a list"
+            for i in configs["no_recompute_layers"]:
+                assert isinstance(i, int), "all values in no_recompute_layers should be an integer"
+            assert min(configs["no_recompute_layers"]) >= 0, "the min value in no_recompute_layers should >= 0"
+            assert (
+                max(configs["no_recompute_layers"]) < configs["num_layers"]
+            ), "the max value in no_recompute_layers should < num_layers"
+            configs["no_recompute_layers"] = sorted(list(set(configs["no_recompute_layers"])))
+
+    if configs["fused_linear"] and not is_fused_matmul_bias_supported():
+        configs["fused_linear"] = False
+        logging.warning(
+            "The flag fused_linear only valid for cuda version higher than 11.6, "
+            "but the paddle is compiled with cuda " + paddle.version.cuda() + ", or you can use xpu version."
+        )
+
+    pp_degree = config.Distributed.pp_degree
+
+    if pp_degree > 1:
+        configs["virtual_pp_degree"] = (
+            1 if configs.get("virtual_pp_degree", None) is None else configs["virtual_pp_degree"]
+        )
+        virtual_pp_degree = configs["virtual_pp_degree"]
+        num_layers = configs.num_layers
+
+        if not (num_layers % (virtual_pp_degree * pp_degree)) == 0:
+            assert virtual_pp_degree == 1, "virtual pp doesn't support uneven layer split."
+            logger.warning(
+                "The num_layers of the model is not divisible by pp_degree."
+                "Receive num_layers: {}, pp_degree: {}.".format(num_layers, pp_degree)
+            )
+        else:
+            assert (num_layers % (virtual_pp_degree * pp_degree)) == 0, (
+                "The num_layers of the model should be divisible of pp_degree * virtual_pp_degree."
+                "Receive num_layers: {}, pp_degree: {}, virtual_pp_degree: {}.".format(
+                    num_layers, pp_degree, virtual_pp_degree
+                )
+            )
+
+        if virtual_pp_degree > 1:
+            local_batch_size = config.Global.local_batch_size
+            micro_batch_size = config.Global.micro_batch_size
+            acc_steps = local_batch_size // micro_batch_size
+            assert (
+                acc_steps % pp_degree == 0
+            ), "num of microbatches {} should be divisible of pp_degree {} when " "using interleave pipeline".format(
+                acc_steps, pp_degree
+            )
+
+        if virtual_pp_degree > 2:
+            logger.warning("Setting virtual_pp_degree > 2 may harm the throughput of the pipeline parallel.")
+    else:
+        if configs.get("virtual_pp_degree", None):
+            logger.warning("virtual_pp_degree is unuseful.")
+
+
+def process_optim_configs(config):
+    """
+    process optim configs for hybrid parallel
+    """
+    config["Optimizer"]["multi_precision"] = config["Engine"]["mix_precision"]["enable"]
+
+    nranks = dist.get_world_size()
+    dp_degree = config["Distributed"]["dp_degree"]
+    sharding_degree = config["Distributed"]["sharding"]["sharding_degree"]
+    if config["Optimizer"].get("tensor_fusion", None):
+        assert (
+            nranks == dp_degree * sharding_degree
+        ), "tensor_fusion only support single card train or data/sharding parallel train"
+
+    if config["Optimizer"]["lr"]["decay_steps"] is None:
+        config["Optimizer"]["lr"]["decay_steps"] = config["Engine"]["max_steps"]
+    config["Optimizer"]["lr"]["decay_steps"] *= config["Global"]["global_batch_size"]
+
+
+def process_data_configs(config):
+    """
+    process data configs for hybrid parallel
+    """
+    cfg_global = config["Global"]
+    cfg_data = config["Data"]
+
+    mode_to_num_samples = {
+        "Train": cfg_global["global_batch_size"] * config["Engine"]["max_steps"],
+        "Eval": cfg_global["global_batch_size"]
+        * (config["Engine"]["max_steps"] // config["Engine"]["eval_freq"] + 1)
+        * config["Engine"]["eval_iters"],
+        "Test": cfg_global["global_batch_size"] * config["Engine"]["test_iters"],
+    }
+
+    for mode in ("Train", "Eval", "Test"):
+        if mode in cfg_data.keys():
+            cfg_data[mode]["dataset"]["num_samples"] = mode_to_num_samples[mode]
+            cfg_data[mode]["dataset"]["mode"] = mode
+            cfg_data[mode]["dataset"]["seed"] = cfg_global["seed"]
+            cfg_data[mode]["dataset"]["model_type"] = config["Model"]["name"]
+            cfg_data[mode]["sampler"]["batch_size"] = cfg_global["local_batch_size"]
+
+
+def process_configs(config):
+    process_data_configs(config)
+    process_model_configs(config)
+    process_optim_configs(config)
+    process_inference_configs(config)
+
+    return config
diff --git a/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py b/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e58e30e150da93225a4e279b0bf30706553d724
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/ops/fused_layers.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import distutils.util
+import importlib
+import os
+
+import paddle
+from paddle import _C_ops
+
+OriginLayerNorm = paddle.nn.LayerNorm
+origin_linear = paddle.incubate.nn.functional.fused_linear
+
+
+def try_import(module_name, func_name=None):
+    if func_name is None:
+        func_name = module_name
+    try:
+        m = importlib.import_module(module_name)
+        return m
+    # return getattr(m, func_name)
+    except ImportError:
+        return None
+
+
+fast_ln_lib = try_import("fast_ln")
+fused_ln_lib = try_import("fused_ln")
+
+if fast_ln_lib is not None:
+    fast_ln = fast_ln_lib.fast_ln
+
+if fused_ln_lib is not None:
+    fused_ln = fused_ln_lib.fused_ln
+    fused_rms_norm = fused_ln_lib.fused_rms_norm
+
+
+def check_normalized_shape(normalized_shape):
+    if isinstance(normalized_shape, (list, tuple)):
+        assert len(normalized_shape) == 1
+
+
+class FusedLayerNorm(OriginLayerNorm):
+    def __init__(self,
+                 normalized_shape,
+                 epsilon=1e-05,
+                 weight_attr=None,
+                 bias_attr=None,
+                 name=None):
+        super().__init__(
+            normalized_shape=normalized_shape,
+            epsilon=epsilon,
+            weight_attr=weight_attr,
+            bias_attr=bias_attr)
+        check_normalized_shape(self._normalized_shape)
+
+    def forward(self, input):
+        return fused_ln(input, self.weight, self.bias, self._epsilon)[0]
+
+
+class FusedRMSNorm(OriginLayerNorm):
+    def __init__(self,
+                 normalized_shape,
+                 epsilon=1e-05,
+                 weight_attr=None,
+                 name=None):
+        super().__init__(
+            normalized_shape=normalized_shape,
+            epsilon=epsilon,
+            weight_attr=weight_attr,
+            bias_attr=False)
+        check_normalized_shape(self._normalized_shape)
+
+    def forward(self, input):
+        return fused_rms_norm(input, self.weight, self._epsilon)[0]
+
+
+class FastLayerNorm(OriginLayerNorm):
+    def __init__(self,
+                 normalized_shape,
+                 epsilon=1e-05,
+                 weight_attr=None,
+                 bias_attr=None,
+                 name=None):
+        super().__init__(
+            normalized_shape=normalized_shape,
+            epsilon=epsilon,
+            weight_attr=weight_attr,
+            bias_attr=bias_attr)
+        check_normalized_shape(self._normalized_shape)
+
+    def forward(self, input):
+        return fast_ln(input, self.weight, self.bias, self._epsilon)[0]
+
+
+class FusedLinearWithGradAdd(paddle.autograd.PyLayer):
+    @staticmethod
+    def forward(ctx, x, weight, bias=None, name=None):
+        y = origin_linear(x, weight, bias)
+        ctx.save_for_backward(x, weight, bias)
+        return y
+
+    @staticmethod
+    def backward(ctx, y_grad):
+        x, weight, bias = ctx.saved_tensor()
+        x_grad = paddle.matmul(y_grad, weight, transpose_y=True)
+
+        if bias is None:
+            if hasattr(weight, "main_grad"):
+                weight.main_grad, _ = _C_ops.fused_linear_param_grad_add(
+                    x, y_grad, weight.main_grad, None, True)
+                return x_grad, None
+            else:
+                weight_grad, _ = _C_ops.fused_linear_param_grad_add(
+                    x, y_grad, None, None, False)
+                return x_grad, weight_grad
+
+        if hasattr(weight, "main_grad") and hasattr(bias, "main_grad"):
+            weight.main_grad, bias.main_grad = _C_ops.fused_linear_param_grad_add(
+                x, y_grad, weight.main_grad, bias.main_grad, True)
+            return x_grad, None, None
+        else:
+            weight_grad, bias_grad = _C_ops.fused_linear_param_grad_add(
+                x, y_grad, None, None, False)
+            return x_grad, weight_grad, bias_grad
+
+
+def strtobool(s):
+    return True if distutils.util.strtobool(s) else False
+
+
+def get_env(env_name, default_value=False):
+    return strtobool(os.getenv(env_name, str(default_value)))
+
+
+def mock_layers():
+    if get_env("USE_FAST_LN"):
+        paddle.nn.LayerNorm = FastLayerNorm
+    elif get_env("USE_FUSED_LN"):
+        paddle.nn.LayerNorm = FusedLayerNorm
+    elif get_env("USE_FUSED_RMS_NORM"):
+        paddle.nn.LayerNorm = FusedRMSNorm
+
+    if get_env("USE_LINEAR_WITH_GRAD_ADD"):
+        paddle.nn.functional.linear = FusedLinearWithGradAdd.apply
+        paddle.incubate.nn.functional.fused_linear = FusedLinearWithGradAdd.apply
diff --git a/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py b/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py
new file mode 100644
index 0000000000000000000000000000000000000000..28b07e518c27e5b2dd9941eed2f6b919a6aa242e
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/ops/setup_cuda.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.utils.cpp_extension import CUDAExtension, setup
+
+setup(name="ppfleetx_ops", ext_modules=CUDAExtension(sources=["topp_sampling.cu"]))
diff --git a/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py b/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py
new file mode 100644
index 0000000000000000000000000000000000000000..788c529bf96d45b01d6342e3da5c842dfb9a4b79
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/ops/test_topp_sampling.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+from ppfleetx.ops import topp_sampling
+
+paddle.seed(2022)
+
+x = paddle.randn([1, 51200], dtype="float16")
+x = paddle.nn.functional.softmax(x)
+top_ps = paddle.to_tensor(np.random.uniform(0, 1, [1]).astype(np.float16))
+out = topp_sampling(x, top_ps)
+print(out)
diff --git a/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu b/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu
new file mode 100644
index 0000000000000000000000000000000000000000..583c59e7250c347884fa5acc54c7d03da12df9cd
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/ops/topp_sampling.cu
@@ -0,0 +1,663 @@
+// Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <curand_kernel.h>
+#include <cuda_fp16.h>
+#include "cub/cub.cuh"
+#include "paddle/extension.h"
+
+#define CHECK_INPUT(x) PD_CHECK(x.is_gpu(), #x " must be a GPU Tensor.")
+
+#define FINAL_MASK 0xFFFFFFFF
+
+#define FIXED_BLOCK_DIM_BASE(dim, ...) \
+  case (dim): {                        \
+    constexpr auto kBlockDim = (dim);  \
+    __VA_ARGS__;                       \
+  } break
+
+
+#define FIXED_BLOCK_DIM(...)                 \
+  FIXED_BLOCK_DIM_BASE(1024, ##__VA_ARGS__); \
+  FIXED_BLOCK_DIM_BASE(512, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(256, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(128, ##__VA_ARGS__);  \
+  FIXED_BLOCK_DIM_BASE(64, ##__VA_ARGS__);   \
+  FIXED_BLOCK_DIM_BASE(32, ##__VA_ARGS__)
+
+template <paddle::DataType D>
+class PDTraits;
+
+template <>
+class PDTraits<paddle::DataType::FLOAT32> {
+public:
+  typedef float DataType;
+  typedef float data_t;
+};
+
+template <>
+class PDTraits<paddle::DataType::FLOAT16> {
+public:
+  typedef half DataType;
+  typedef paddle::float16 data_t;
+};
+
+struct SegmentOffsetIter {
+    explicit SegmentOffsetIter(int num_cols) : num_cols_(num_cols) {}
+
+    __host__ __device__ __forceinline__ int operator()(int idx) const {
+        return idx * num_cols_;
+    }
+
+    int num_cols_;
+};
+
+template <typename T>
+struct Pair {
+  __device__ __forceinline__ Pair() {}
+  __device__ __forceinline__ Pair(T value, int id) : v(value), id(id) {}
+
+  __device__ __forceinline__ void set(T value, int id) {
+    v = value;
+    id = id;
+  }
+
+  __device__ __forceinline__ void operator=(const Pair<T>& in) {
+    v = in.v;
+    id = in.id;
+  }
+
+  __device__ __forceinline__ bool operator<(const T value) const {
+    return ((float)v < (float)value);
+  }
+
+  __device__ __forceinline__ bool operator>(const T value) const {
+    return ((float)v > (float)value);
+  }
+  __device__ __forceinline__ bool operator<(const Pair<T>& in) const {
+    return ((float)v < (float)in.v) || (((float)v == (float)in.v) && (id > in.id));
+  }
+
+  __device__ __forceinline__ bool operator>(const Pair<T>& in) const {
+    return ((float)v > (float)in.v) || (((float)v == (float)in.v) && (id < in.id));
+  }
+
+  T v;
+  int id;
+};
+
+inline int div_up(int a, int n)
+{
+    return (a + n - 1) / n;
+}
+
+__global__ void setup_kernel(curandState_t *state, const uint64_t seed, const int bs) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  for (int i = idx; i < bs; i += gridDim.x * blockDim.x) {
+    curand_init(seed, 0, i, &state[i]);
+  }
+}
+
+template <typename T>
+__device__ __forceinline__ void AddTo(Pair<T> topk[],
+                                      const Pair<T>& p,
+                                      int beam_size) {
+  for (int k = beam_size - 2; k >= 0; k--) {
+    if (topk[k] < p) {
+    topk[k + 1] = topk[k];
+    } else {
+    topk[k + 1] = p;
+    return;
+    }
+  }
+  topk[0] = p;
+}
+
+template <typename T, int BlockSize>
+__device__ __forceinline__ void GetTopK(Pair<T> topk[],
+                                        const T* src,
+                                        int idx,
+                                        int dim,
+                                        int beam_size) {
+  while (idx < dim) {
+    if (topk[beam_size - 1] < src[idx]) {
+    Pair<T> tmp(src[idx], idx);
+    AddTo<T>(topk, tmp, beam_size);
+    }
+    idx += BlockSize;
+  }
+}
+
+template <typename T, int BlockSize>
+__device__ __forceinline__ void GetTopK(Pair<T> topk[],
+                                        const T* src,
+                                        int idx,
+                                        int dim,
+                                        const Pair<T>& max,
+                                        int beam_size) {
+  while (idx < dim) {
+    if (topk[beam_size - 1] < src[idx]) {
+        Pair<T> tmp(src[idx], idx);
+        if (tmp < max) {
+            AddTo<T>(topk, tmp, beam_size);
+        }
+    }
+    idx += BlockSize;
+  }
+}
+
+template <typename T, int MaxLength, int BlockSize>
+__device__ __forceinline__ void ThreadGetTopK(Pair<T> topk[],
+                                              int* beam,
+                                              int beam_size,
+                                              const T* src,
+                                              bool* firstStep,
+                                              bool* is_empty,
+                                              Pair<T>* max,
+                                              int dim,
+                                              const int tid) {
+  if (*beam > 0) {
+    int length = (*beam) < beam_size ? *beam : beam_size;
+    if (*firstStep) {
+      *firstStep = false;
+      GetTopK<T, BlockSize>(topk, src, tid, dim, length);
+    } else {
+      for (int k = 0; k < MaxLength; k++) {
+        if (k < MaxLength - (*beam)) {
+          topk[k] = topk[k + *beam];
+        } else {
+            topk[k].set(std::numeric_limits<T>::min(), -1);
+        }
+      }
+      if (!(*is_empty)) {
+        GetTopK<T, BlockSize>(
+            topk + MaxLength - *beam, src, tid, dim, *max, length);
+      }
+    }
+
+    *max = topk[MaxLength - 1];
+    if ((*max).id == -1) *is_empty = true;
+    *beam = 0;
+  }
+}
+
+template <typename T>
+__forceinline__ __device__ Pair<T> WarpReduce(Pair<T> input) {
+#pragma unroll
+    for (int offset = 16; offset > 0; offset >>= 1) {
+        T tmp_val = __shfl_down_sync(FINAL_MASK, input.v, static_cast<unsigned>(offset), 32);
+        int tmp_id = __shfl_down_sync(FINAL_MASK, input.id, static_cast<unsigned>(offset), 32);
+        if ((float)input.v < (float)tmp_val) {
+            input.v = tmp_val;
+            input.id = tmp_id;
+        }
+    }
+    return input;
+}
+
+template <typename T, int MaxLength, int BlockSize>
+__device__ __forceinline__ void BlockReduce(Pair<T> shared_max[],
+                                            Pair<T> topk[],
+                                            Pair<T> beam_max[],
+                                            int* beam,
+                                            int* k,
+                                            int *count,
+                                            const int tid,
+                                            const int wid,
+                                            const int lane) {
+  while (true) {
+    __syncthreads();
+    Pair<T> input_now = topk[0];
+    input_now = WarpReduce(input_now);
+
+    if (lane == 0) {
+      shared_max[wid] = input_now;
+    }
+    __syncthreads();
+    input_now = (tid < BlockSize / 32)
+                    ? shared_max[lane]
+                    : Pair<T>(std::numeric_limits<T>::min(), -1);
+    if (wid == 0) {
+      input_now = WarpReduce(input_now);
+      if (lane == 0) shared_max[0] = input_now;
+    }
+    __syncthreads();
+    if (tid == 0) {
+      beam_max[*count] = shared_max[0]; 
+      (*count)++;
+    }
+    int tid_max = shared_max[0].id % BlockSize;
+    if (tid == tid_max) {
+      (*beam)++;
+    }
+    if (--(*k) == 0) break;
+    __syncthreads();
+
+    if (tid == tid_max) {
+        if (*beam < MaxLength) {
+            topk[0] = topk[*beam];
+        }
+    }
+
+    if (MaxLength < 5) {
+      if (*beam >= MaxLength) break;
+    } else {
+      unsigned mask = 0u;
+      mask = __ballot_sync(FINAL_MASK, true);
+      if (tid_max / 32 == wid) {
+        if (__shfl_down_sync(FINAL_MASK, *beam, tid_max % 32, 32) ==
+            MaxLength)
+          break;
+      }
+    }
+  }
+}
+
+template <typename T, int MaxLength, int TopPBeamTopK, int BlockSize>
+__global__ void KeMatrixTopPBeamTopK(const T* src,
+                                     T *top_ps,
+                                     int64_t *out_id, // topk id
+                                     T *out_val, // topk val
+                                     int vocab_size,
+                                     curandState_t *state,
+                                     int *count_iter,
+                                     int *count_iter_begin) {
+    const int tid = threadIdx.x;
+    const int wid = tid / 32;
+    const int lane = tid % 32;
+    const int bid = blockIdx.x;
+
+    int top_num = TopPBeamTopK;
+    float top_p_num = (float)top_ps[bid];
+
+    __shared__ Pair<T> shared_max[BlockSize / 32];
+    __shared__ Pair<T> beam_max[TopPBeamTopK];
+
+    Pair<T> topk[MaxLength];
+    int beam = MaxLength;
+    Pair<T> max;
+    bool is_empty = false;
+    bool firststep = true;
+    __shared__ int count;
+
+    if (tid == 0) {
+        count = 0;
+    }
+
+    for (int j = 0; j < MaxLength; j++) {
+        topk[j].set(std::numeric_limits<T>::min(), -1);
+    }
+
+    while (top_num) {
+        ThreadGetTopK<T, MaxLength, BlockSize>(topk,
+                                               &beam,
+                                               TopPBeamTopK,
+                                               src + bid * vocab_size,
+                                               &firststep,
+                                               &is_empty,
+                                               &max,
+                                               vocab_size,
+                                               tid);
+        BlockReduce<T, MaxLength, BlockSize>(shared_max,
+                                             topk,
+                                             beam_max,
+                                             &beam,
+                                             &top_num,
+                                             &count,
+                                             tid,
+                                             wid,
+                                             lane);
+    }
+    if (tid == 0) {
+        count_iter_begin[bid] = count_iter[bid];
+        float rand_top_p = curand_uniform(state + bid) * top_p_num;
+        top_ps[bid] = (T)rand_top_p;
+        float sum_prob = 0.0f;
+#pragma unroll
+        for(int i = 0; i < TopPBeamTopK; i++) {
+            sum_prob += (float)(beam_max[i].v);
+            if(sum_prob >= rand_top_p) {
+                count_iter_begin[bid] += 1;
+                out_id[bid] = (int64_t)beam_max[i].id;
+                out_val[bid] = beam_max[i].v;
+                break;
+            }
+        }
+    }
+}
+
+__global__ void SetCountIter(int *count_iter, int num) {
+    int tid = threadIdx.x;
+    int bid = blockIdx.x;
+    int idx = bid * blockDim.x + tid;
+    for (int i = idx; i < num; i += gridDim.x * blockDim.x) {
+        count_iter[i] = i;
+    }
+}
+
+template <typename T>
+__global__ void FillIndex(T* indices, T num_rows, T num_cols) {
+  int col_id = threadIdx.x;
+  int row_id = blockIdx.x;
+
+  for (T j = row_id; j < num_rows; j += gridDim.x) {
+    for (T i = col_id; i < num_cols; i += blockDim.x) {
+      indices[j * num_cols + i] = i;
+    }
+  }
+}
+
+struct BlockPrefixCallbackOp {
+    // Running prefix
+    float running_total;
+    // Constructor
+    __device__ BlockPrefixCallbackOp(float running_total): running_total(running_total) {}
+    // Callback operator to be entered by the first warp of threads in the block.
+    // Thread-0 is responsible for returning a value for seeding the block-wide scan.
+    __device__ float operator()(float block_aggregate)
+    {
+        float old_prefix = running_total;
+        running_total += block_aggregate;
+        return old_prefix;
+    }
+};
+
+template <typename T, int BLOCK_SIZE>
+__global__ void topp_sampling(T *sorted_probs,
+                              int64_t *sorted_id,
+                              T *out_val,
+                              int64_t *out_id,
+                              const T *top_ps,
+                              int p_num,
+                              int vocab_size,
+                              int *count_iter,
+                              int *count_iter_begin) {
+    __shared__ int stop_shared;
+    __shared__ float rand_p;
+    const int tid = threadIdx.x;
+    const int bid = blockIdx.x;
+    constexpr int WARP_SIZE = 32;
+    constexpr int NUM_WARPS = BLOCK_SIZE / WARP_SIZE;
+    const int lane_id = tid % WARP_SIZE;
+    const int warp_id = tid / WARP_SIZE;
+    const float p_t = (float)top_ps[bid];
+    if (tid == 0) {
+        stop_shared = 0;
+        rand_p = p_t;
+    }
+    if (count_iter_begin[bid] == count_iter[bid + 1]) {
+        // topk
+        return;
+    }
+
+    typedef cub::BlockScan<float, BLOCK_SIZE>  BlockScan;
+    __shared__ typename BlockScan::TempStorage temp_storage;
+    __shared__ uint32_t selected_shared[NUM_WARPS];
+
+    // Initialize running total
+    BlockPrefixCallbackOp prefix_op(0);
+
+    if (lane_id == 0) {
+        selected_shared[warp_id] = 0;
+    }
+    __syncthreads();
+
+    int offset = bid * vocab_size;
+    int end = ((vocab_size + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE;
+    int i_activate = 0;
+    float thread_offset = 0;
+    for (int i = tid; i < end; i += BLOCK_SIZE) {
+        float thread_count = (i < vocab_size) ? (float)sorted_probs[offset + i] : 0.f;
+        BlockScan(temp_storage).InclusiveSum(thread_count, thread_offset, prefix_op);
+    
+        uint32_t activate_mask = __ballot_sync(FINAL_MASK, rand_p <= thread_offset);
+        
+        i_activate = i;
+        if (activate_mask != 0) {
+            if (lane_id == 0) {
+                atomicAdd(&stop_shared, 1);
+                selected_shared[warp_id] = activate_mask;
+            }
+        }
+        __syncthreads();
+        if(stop_shared > 0) {
+            break;
+        }
+    }
+
+    bool skip = (selected_shared[warp_id] > 0) ? false : true;
+    for (int i=0; i < warp_id; i++) {
+        if(selected_shared[i] != 0) {
+            skip = true;
+        }
+    }
+    if (!skip) {
+        int active_lane_id = WARP_SIZE - __popc(selected_shared[warp_id]); // first not 0
+        if (lane_id == active_lane_id) {
+            // printf("active_lane_id: %d, i_activate: %d.\n", active_lane_id, i_activate);
+            // for (int i=0; i < active_lane_id; i++) {
+            //   printf("p %d, value: %f\n", i, (float)(sorted_probs[offset + i]));
+            // }
+            out_id[bid] = sorted_id[offset + i_activate];
+            out_val[bid] = sorted_probs[offset + i_activate];
+        }
+    }
+}
+
+int GetBlockSize(int vocab_size) {
+    if (vocab_size > 512) {
+        return 1024;
+    } else if (vocab_size > 256) {
+        return 512;
+    } else if (vocab_size > 128) {
+        return 256;
+    } else if (vocab_size > 64) {
+        return 128;
+    } else {
+        return 64;
+    }
+}
+
+template <typename T>
+__global__ void print_kernel(T *input, int size) {
+  printf("[");
+  for (int i=0; i < size; i++) {
+    if (i != size-1) {
+      printf("%f, ", (float)input[i]);
+    } else {
+      printf("%f]\n", (float)input[i]);
+    }
+  }
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> top_p_sampling_kernel(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) {
+    typedef PDTraits<D> traits_;
+    typedef typename traits_::DataType DataType_;
+    typedef typename traits_::data_t data_t;
+    std::vector<int64_t> shape = x.shape();
+    auto cu_stream = x.stream();
+
+    int bs = shape[0];
+    int p_num = top_ps.numel();
+    PD_CHECK(bs == p_num, "PD_CHECK returns ", false, ", expected bs == p_num.");
+    int vocab_size = shape[1];
+    auto topp_ids = paddle::full({bs, 1}, 1, paddle::DataType::INT64, x.place());
+    auto topp_probs = paddle::full({bs, 1}, 1, x.dtype(), x.place());
+    auto inds_input = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place());
+    auto sorted_out = paddle::full({bs, vocab_size}, 1, x.dtype(), x.place());
+    auto sorted_id = paddle::full({bs, vocab_size}, 1, paddle::DataType::INT64, x.place());
+    
+
+    int BlockSize = GetBlockSize(vocab_size);
+    switch (BlockSize) {
+        FIXED_BLOCK_DIM(FillIndex<int64_t><<<bs, kBlockDim, 0, cu_stream>>>(inds_input.data<int64_t>(), bs, vocab_size));
+        default:
+            PD_THROW("the input data shape has error in the FillIndex kernel.");
+    }
+
+    
+    static int count = 0;
+    static curandState_t* dev_curand_states;
+    if (count == 0) {
+#if CUDA_VERSION >= 11020
+      cudaMallocAsync(&dev_curand_states, bs * sizeof(curandState_t), cu_stream);
+#else
+      cudaMalloc(&dev_curand_states, bs * sizeof(curandState_t));
+#endif
+    }
+    srand((unsigned int)(time(NULL)));
+    setup_kernel<<<1, 256, 0, cu_stream>>>(dev_curand_states, rand() % random_seed, bs);
+    PD_CHECK(bs == p_num, "PD_CHECK returns ", false, ", expected bs == p_num.");
+
+    auto count_iter = paddle::empty({bs + 1}, paddle::DataType::INT32, x.place());
+    auto count_iter_begin = paddle::empty({bs}, paddle::DataType::INT32, x.place());
+    SetCountIter<<<1, 256, 0, cu_stream>>>(count_iter.data<int>(), bs + 1);
+
+    constexpr int TopKMaxLength = 1;
+    constexpr int TopPBeamTopK = 1;
+    switch (BlockSize) {
+        FIXED_BLOCK_DIM(
+            KeMatrixTopPBeamTopK<DataType_, TopKMaxLength, TopPBeamTopK, kBlockDim><<<bs, kBlockDim, 0, cu_stream>>>(
+                reinterpret_cast<DataType_*>(const_cast<data_t*>(x.data<data_t>())),
+                reinterpret_cast<DataType_*>(const_cast<data_t*>(top_ps.data<data_t>())),
+                topp_ids.data<int64_t>(),
+                reinterpret_cast<DataType_*>(topp_probs.data<data_t>()),
+                vocab_size,
+                dev_curand_states,
+                count_iter.data<int>(),
+                count_iter_begin.data<int>()));
+        default:
+            PD_THROW("the input data shape has error in the topp_beam_topk kernel.");
+    }
+//     if (count % random_seed == random_seed - 1) {
+// #if CUDA_VERSION >= 11020
+//       cudaFreeAsync(dev_curand_states, cu_stream);
+// #else
+//       cudaFree(dev_curand_states);
+// #endif
+//     }
+    count++;
+
+    size_t temp_storage_bytes = 0;
+
+    cub::TransformInputIterator<int, SegmentOffsetIter, int*>
+        segment_offsets_t_begin(count_iter_begin.data<int>(),
+                                SegmentOffsetIter(vocab_size));
+
+    cub::TransformInputIterator<int, SegmentOffsetIter, int*>
+        segment_offsets_t_end(count_iter.data<int>(),
+                              SegmentOffsetIter(vocab_size));
+    
+    DataType_ *x_ptr = reinterpret_cast<DataType_*>(const_cast<data_t*>(x.data<data_t>()));
+    DataType_ *sorted_out_ptr = reinterpret_cast<DataType_*>(const_cast<data_t*>(sorted_out.data<data_t>()));
+    int64_t *in_id_ptr = inds_input.data<int64_t>();
+    int64_t *out_id_ptr = sorted_id.data<int64_t>();
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(nullptr,
+                                                       temp_storage_bytes,
+                                                       x_ptr,
+                                                       sorted_out_ptr,
+                                                       in_id_ptr,
+                                                       out_id_ptr,
+                                                       vocab_size * bs,
+                                                       bs,
+                                                       segment_offsets_t_begin,
+                                                       segment_offsets_t_end + 1,
+                                                       0,
+                                                       sizeof(data_t) * 8,
+                                                       cu_stream);
+
+    temp_storage_bytes = div_up(temp_storage_bytes, 256) * 256;
+    int64_t temp_size = temp_storage_bytes;
+    auto temp_storage = paddle::empty({temp_size}, paddle::DataType::UINT8, x.place());
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        temp_storage.data<uint8_t>(),
+        temp_storage_bytes,
+        x_ptr,
+        sorted_out_ptr,
+        in_id_ptr,
+        out_id_ptr,
+        vocab_size * bs,
+        bs,
+        segment_offsets_t_begin,
+        segment_offsets_t_end + 1,
+        0,
+        sizeof(data_t) * 8,
+        cu_stream);
+
+    switch (BlockSize) {
+      FIXED_BLOCK_DIM(
+          topp_sampling<DataType_, kBlockDim><<<bs, kBlockDim, 0, cu_stream>>>(
+              sorted_out_ptr,
+              out_id_ptr,
+              reinterpret_cast<DataType_*>(topp_probs.data<data_t>()),
+              topp_ids.data<int64_t>(),
+              reinterpret_cast<DataType_*>(const_cast<data_t*>(top_ps.data<data_t>())),
+              p_num,
+              vocab_size,
+              count_iter.data<int>(),
+              count_iter_begin.data<int>()));
+      default:
+          PD_THROW("the input data shape has error in the topp_sampling kernel.");
+    }
+    return {topp_probs, topp_ids};
+}
+
+
+std::vector<paddle::Tensor> TopPSampling(const paddle::Tensor& x, const paddle::Tensor& top_ps, int random_seed) {
+    switch (x.type()) {
+        case paddle::DataType::FLOAT16: {
+            return top_p_sampling_kernel<paddle::DataType::FLOAT16>(
+                x,
+                top_ps,
+                random_seed
+            );
+        }
+        case paddle::DataType::FLOAT32: {
+            return top_p_sampling_kernel<paddle::DataType::FLOAT32>(
+                x,
+                top_ps,
+                random_seed
+            );
+        }
+        default: {
+            PD_THROW(
+                "NOT supported data type. "
+                "Only float16 and float32 are supported. ");
+            break;
+        }
+    }
+}
+
+std::vector<std::vector<int64_t>> TopPSamplingInferShape(const std::vector<int64_t>& x_shape,
+                                                         const std::vector<int64_t>& top_ps_shape) {
+    std::vector<int64_t> out_probs_shape = {x_shape[0], 1};                                                          
+    std::vector<int64_t> out_ids_shape = {x_shape[0], 1};
+    return {out_probs_shape, out_ids_shape};
+}
+
+std::vector<paddle::DataType> TopPSamplingInferDtype(const paddle::DataType& x_dtype,
+                                                     const paddle::DataType& top_ps_dtype) {
+    return {x_dtype, paddle::DataType::INT64};
+}
+
+PD_BUILD_OP(topp_sampling)
+    .Inputs({"x", "top_ps"})
+    .Outputs({"topp_probs", "topp_ids"})
+    .Attrs({"random_seed: int"})
+    .SetKernelFn(PD_KERNEL(TopPSampling))
+    .SetInferShapeFn(PD_INFER_SHAPE(TopPSamplingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(TopPSamplingInferDtype));
\ No newline at end of file
diff --git a/model_zoo/gpt-3/ppfleetx/optims/__init__.py b/model_zoo/gpt-3/ppfleetx/optims/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc71ebfa2e547d011d9feb0de46a17d59b80f4fe
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/optims/__init__.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import sys
+from collections import defaultdict
+
+import paddle
+from paddle.optimizer.lr import LRScheduler
+from ppfleetx.utils.log import logger
+
+from .grad_clip import *
+from .lr_scheduler import *
+from .optimizer import *
+
+
+def build_lr_scheduler(lr_config):
+    if "name" in lr_config:
+        lr_name = lr_config.pop("name")
+        lr = eval(lr_name)(**lr_config)
+        if isinstance(lr, LRScheduler):
+            return lr
+        else:
+            return lr()
+    else:
+        lr = lr_config.learning_rate
+
+    logger.debug("build lr ({}) success..".format(lr))
+    return lr
+
+
+def build_grad_clip(grad_clip_config):
+    if grad_clip_config is not None:
+        grad_clip_name = grad_clip_config.pop("name", "ClipGradByGlobalNorm")
+        clip_norm = grad_clip_config.get("clip_norm", 1.0)
+        grad_clip = eval(grad_clip_name)(**grad_clip_config) if clip_norm != 0.0 else None
+        return grad_clip
+    else:
+        return None
+
+
+def build_optimizer(config, model, lr_scheduler=None):
+    config = copy.deepcopy(config)
+    if lr_scheduler is not None:
+        config.pop("lr")
+
+    multi_precision = config.get("multi_precision", False)
+    if multi_precision:
+        paddle.nn.clip._clip_by_global_norm_using_mp_type(True)
+
+    grad_clip_config = config.pop("grad_clip", None)
+    grad_clip = build_grad_clip(grad_clip_config)
+
+    optim_name = config.pop("name")
+    optim = eval(optim_name)(learning_rate=lr_scheduler, parameters=model.parameters(), grad_clip=grad_clip, **config)
+
+    logger.debug("build optimizer ({}) success..".format(optim))
+    return optim
diff --git a/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py b/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py
new file mode 100644
index 0000000000000000000000000000000000000000..c51f395269416c849ecdc0ffb976112f5f60227c
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/optims/grad_clip.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+import paddle
+import paddle.distributed.fleet as fleet
+from paddle.distributed import collective
+from paddle.base import core, layers
+from paddle.base.dygraph import base as imperative_base
+from paddle.nn.clip import ClipGradBase, ClipGradByGlobalNorm, _squared_l2_norm
+from ppfleetx.distributed.apis import env
diff --git a/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py b/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..d59cac610957d375c736e6e7f91cc3d6dc9e2379
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/optims/lr_scheduler.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+from paddle.optimizer import lr
+from paddle.optimizer.lr import LRScheduler
+
+__all__ = [
+    "CosineAnnealingWithWarmupDecay",
+    "LinearDecayWithWarmup",
+    "CosineDecay",
+]
+
+
+class CosineAnnealingWithWarmupDecay(LRScheduler):
+    def __init__(self, max_lr, min_lr, warmup_rate, decay_steps, last_epoch=0, verbose=False, **kwargs):
+
+        self.decay_steps = decay_steps
+        self.warmup_step = warmup_rate * decay_steps
+        self.max_lr = max_lr
+        self.min_lr = min_lr
+        self.increment = 0
+        super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, verbose)
+        self.increment = int(kwargs.get("global_batch_size", 0))
+
+    def get_lr(self):
+        if self.warmup_step > 0 and self.last_epoch <= self.warmup_step:
+            return float(self.max_lr) * (self.last_epoch) / self.warmup_step
+
+        if self.last_epoch > self.decay_steps:
+            return self.min_lr
+
+        num_step_ = self.last_epoch - self.warmup_step
+        decay_steps_ = self.decay_steps - self.warmup_step
+        decay_ratio = float(num_step_) / float(decay_steps_)
+        coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
+        return self.min_lr + coeff * (self.max_lr - self.min_lr)
+
+    def step(self, epoch=None):
+        if epoch is None:
+            self.last_epoch += self.increment
+            self.last_lr = self.get_lr()
+        else:
+            self.last_epoch += epoch
+            if hasattr(self, "_get_closed_form_lr"):
+                self.last_lr = self._get_closed_form_lr()
+            else:
+                self.last_lr = self.get_lr()
+
+        if self.verbose:
+            print(
+                "Epoch {}: {} set learning rate to {}.".format(self.last_epoch, self.__class__.__name__, self.last_lr)
+            )
+
+
+class LinearDecayWithWarmup(LRScheduler):
+    def __init__(self, learning_rate, step_each_epoch, epochs, warmup=0, verbose=False, last_epoch=-1, **kwargs):
+        if kwargs.get("total_steps", -1) > 0:
+            self.T_max = kwargs.get("total_steps")
+        else:
+            self.T_max = epochs * step_each_epoch
+
+        self.warmup_steps = warmup if isinstance(warmup, int) else int(math.floor(warmup * self.T_max))
+        super(LinearDecayWithWarmup, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            return self.base_lr * (float(self.last_epoch) / float(max(1, self.warmup_steps)))
+        return self.base_lr * max(0.0, 1.0 - self.last_epoch / self.T_max)
+
+
+class CosineDecay(lr.LRScheduler):
+    def __init__(
+        self,
+        learning_rate,
+        step_each_epoch,
+        epochs,
+        update_unit="epoch",
+        warmups=0,
+        verbose=False,
+        last_epoch=-1,
+        **kwargs
+    ):
+
+        self.T_max = epochs if update_unit == "epoch" else step_each_epoch * epochs
+        self.warmups = warmups if update_unit == "epoch" else step_each_epoch * warmups
+
+        assert self.warmups < self.T_max
+
+        self.last_epoch = last_epoch
+        super(CosineDecay, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+
+        progress = (self.last_epoch - self.warmups) / float(self.T_max - self.warmups)
+        progress = min(1.0, max(0.0, progress))
+
+        if self.warmups:
+            lr = self.base_lr * min(1.0, self.last_epoch / self.warmups)
+        else:
+            lr = 0.5 * self.base_lr * (1.0 + math.cos(math.pi * progress))
+
+        return lr
diff --git a/model_zoo/gpt-3/ppfleetx/optims/optimizer.py b/model_zoo/gpt-3/ppfleetx/optims/optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..795bcbabfe603611a5b7e546d8af0ab61107c98e
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/optims/optimizer.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import warnings
+
+import paddle
+from paddle import _C_ops
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import (
+    device_guard,
+)
+from paddle.base import core, framework
+from paddle.base.framework import Variable
+from paddle.optimizer import Adam, AdamW, Momentum
+from ppfleetx.distributed.apis import env
+from ppfleetx.utils.tensor_fusion_helper import fused_parameters
+
+__all__ = [
+    "Adam",
+    "AdamW",
+    "Momentum",
+    "FusedAdamW",
+    "FusedOffloadAdamW",
+]
+
+
+class FusedAdamW(paddle.optimizer.AdamW):
+    def __init__(self, learning_rate, parameters, grad_clip, **config):
+        tensor_fusion = config.pop("tensor_fusion", False)
+
+        if paddle.distributed.get_world_size() > 1:
+            hcg = env.get_hcg()
+            sharding_size = hcg.get_sharding_parallel_world_size()
+
+        if tensor_fusion:
+            self.decay_fused_tensors, self.all_fused_tensors = fused_parameters(parameters, sharding_size > 1)
+            decay_params = [p.name for p in self.decay_fused_tensors]
+        else:
+            decay_params = [p.name for p in parameters if not any(nd in p.name for nd in ["bias", "norm", "b_0"])]
+
+        apply_decay_param_fun = lambda x: x in decay_params
+
+        super().__init__(
+            learning_rate=learning_rate,
+            parameters=self.all_fused_tensors if tensor_fusion else parameters,
+            grad_clip=grad_clip,
+            apply_decay_param_fun=apply_decay_param_fun,
+            **config,
+        )
+
+
+class FusedOffloadAdamW(paddle.optimizer.AdamW):
+    def __init__(self, learning_rate, parameters, grad_clip, **config):
+        tensor_fusion = config.pop("tensor_fusion", False)
+
+        if paddle.distributed.get_world_size() > 1:
+            hcg = env.get_hcg()
+            sharding_size = hcg.get_sharding_parallel_world_size()
+
+        if tensor_fusion:
+            self.decay_fused_tensors, self.all_fused_tensors = fused_parameters(parameters, sharding_size > 1)
+            decay_params = [p.name for p in self.decay_fused_tensors]
+        else:
+            decay_params = [p.name for p in parameters if not any(nd in p.name for nd in ["bias", "norm", "b_0"])]
+
+        apply_decay_param_fun = lambda x: x in decay_params
+
+        super().__init__(
+            learning_rate=learning_rate,
+            parameters=self.all_fused_tensors if tensor_fusion else parameters,
+            grad_clip=grad_clip,
+            apply_decay_param_fun=apply_decay_param_fun,
+            **config,
+        )
+
+        self._already_create_accumulater = set()
+        self._dev_id = 0 if paddle.get_device() == "cpu" else int(paddle.get_device().split(":")[1])
+
+        for p in parameters:
+            if self._multi_precision and self._is_dtype_fp16_or_bf16(p.dtype):
+                self._master_weights[p.name] = core.eager.Tensor(
+                    name=p.name + "_fp32_master",
+                    value=p.numpy(),
+                    place=core.CPUPlace(),
+                    stop_gradient=True,
+                ).cast(paddle.float32)
+
+    def _add_moments_pows(self, p):
+        acc_dtype = p.dtype
+        if self._is_dtype_fp16_or_bf16(acc_dtype):
+            acc_dtype = core.VarDesc.VarType.FP32
+        self._add_accumulator(self._moment1_acc_str, p, dtype=acc_dtype, device="cpu")
+        self._add_accumulator(self._moment2_acc_str, p, dtype=acc_dtype, device="cpu")
+        self._add_accumulator(
+            name=self._beta1_pow_acc_str,
+            param=p,
+            dtype=acc_dtype,
+            fill_value=0.9 if isinstance(self._beta1, Variable) else self._beta1,
+            shape=[1],
+            type=core.VarDesc.VarType.LOD_TENSOR,
+            device="cpu",
+        )
+        self._add_accumulator(
+            name=self._beta2_pow_acc_str,
+            param=p,
+            dtype=acc_dtype,
+            fill_value=0.999 if isinstance(self._beta2, Variable) else self._beta2,
+            shape=[1],
+            type=core.VarDesc.VarType.LOD_TENSOR,
+            device="cpu",
+        )
+
+    def _create_accumulators(self, block, parameters):
+        with device_guard():
+            assert isinstance(block, framework.Block)
+            if isinstance(parameters, dict):
+                parameters = self._update_param_group(parameters)
+
+            # Create accumulator tensors for first and second moments
+            for p in parameters:
+                if p.name in self._already_create_accumulater:
+                    continue
+                if self._multi_precision and self._is_dtype_fp16_or_bf16(p.dtype):
+                    master_p = self._create_master_weight(p)
+                    self._add_moments_pows(master_p)
+                    self._already_create_accumulater.add(p.name)
+                    continue
+                if self._is_dtype_fp16_or_bf16(p.dtype) and not self._multi_precision:
+                    warnings.warn(
+                        "Accumulating with FP16 or BF16 in optimizer can lead to poor accuracy or slow convergence."
+                        "Consider using multi_precision=True option of the Adam optimizer."
+                    )
+                self._add_moments_pows(p)
+                self._already_create_accumulater.add(p.name)
+
+    def _get_accumulator_master(self, name, param):
+        """Utility function to fetch an accumulator for a parameter
+        Args:
+            name: name of the accumulator
+            param: parameter variable for which accumulator is to be fetched
+        Returns:
+            accumulator variable for the parameter
+        """
+        if self._name is not None:
+            name = self._name + "_" + name
+        find_master = self._multi_precision and self._is_dtype_fp16_or_bf16(param.dtype)
+        target_param = self._master_weights[param.name] if find_master else param
+        target_name = target_param.name
+        if name not in self._accumulators or target_name not in self._accumulators[name]:
+            raise Exception("Accumulator {} does not exist for parameter {}".format(name, target_name))
+        return self._accumulators[name][target_name]
+
+    def _append_optimize_op(self, block, param_and_grad):
+        with device_guard():
+            assert isinstance(block, framework.Block)
+            if isinstance(param_and_grad, dict):
+                param_and_grad = self._update_param_group(param_and_grad)
+            param, grad = param_and_grad
+
+            # Whether we should do weight decay for the parameter.
+            with_decay = True
+            if self._apply_decay_param_fun is not None and not self._apply_decay_param_fun(param.name):
+                with_decay = False
+
+            moment1 = self._get_accumulator_master(self._moment1_acc_str, param_and_grad[0])
+            moment2 = self._get_accumulator_master(self._moment2_acc_str, param_and_grad[0])
+            beta1_pow_acc = self._get_accumulator_master(self._beta1_pow_acc_str, param_and_grad[0])
+            beta2_pow_acc = self._get_accumulator_master(self._beta2_pow_acc_str, param_and_grad[0])
+            find_master = self._multi_precision and self._is_dtype_fp16_or_bf16(param_and_grad[0].dtype)
+            master_weight = self._master_weights[param_and_grad[0].name] if find_master else None
+            lr = self._create_param_lr(param_and_grad)
+
+            # create the adamw optimize op
+            if framework.in_dygraph_mode():
+                lr_ratio_ = 1.0 if self._lr_ratio is None else self._lr_ratio(param_and_grad[0])
+
+                _beta1 = self._beta1 if not isinstance(self._beta1, Variable) else self._beta1.item(0)
+                _beta2 = self._beta2 if not isinstance(self._beta2, Variable) else self._beta2.item(0)
+
+                origin_dtype = param_and_grad[0].dtype
+                cpu_fp32_param = param_and_grad[0].cpu().cast(paddle.float32)
+                cpu_fp32_grad = param_and_grad[1].cpu().cast(paddle.float32)
+
+                _, _, _, _, _, _ = _C_ops.adamw_(
+                    cpu_fp32_param,
+                    cpu_fp32_grad,
+                    lr.cpu(),
+                    moment1.cpu(),
+                    moment2.cpu(),
+                    beta1_pow_acc.cpu(),
+                    beta2_pow_acc.cpu(),
+                    master_weight.cpu() if master_weight is not None else None,
+                    None,
+                    _beta1,
+                    _beta2,
+                    self._epsilon,
+                    lr_ratio_,
+                    self._weight_decay,
+                    with_decay,
+                    self._lazy_mode,
+                    1000,
+                    find_master,
+                    False,
+                )
+
+                param_and_grad[0]._clear_data()
+                cpu_fp32_param.cuda(self._dev_id).cast(origin_dtype)._share_buffer_to(param_and_grad[0])
+
+                return None
diff --git a/model_zoo/gpt-3/ppfleetx/tools/__init__.py b/model_zoo/gpt-3/ppfleetx/tools/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..506033beb501b291502643a328a91c96381578d2
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/tools/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py b/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py
new file mode 100644
index 0000000000000000000000000000000000000000..e45eaf5533b9831e673dc55dbef540f9a1b36870
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/tools/multiprocess_tool.py
@@ -0,0 +1,102 @@
+#   Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import multiprocessing
+import os
+import time
+import warnings
+from multiprocessing import Process
+
+"""
+Multi-process batch processing tool
+
+This tool provides a multi-process batch processing method.
+For example, multi-process batch download data, multi-process preprocessing data, etc.
+
+The tool relies on executable shell commands or scripts. Its essence is to use Python's
+multi-process library to create multiple processes, and call executable commands or
+scripts through the os.system API.
+
+Executable commands or scripts are passed in via a txt text file, organized by line.
+For example, the following example is download, unzip and delete example.
+
+batch_cmd.txt
+
+wget http://xxxx.com/0.tar && tar -xf 0.tar && rm 0.tar
+wget http://xxxx.com/1.tar && tar -xf 1.tar && rm 1.tar
+...
+wget http://xxxx.com/99.tar && tar -xf 99.tar && rm 99.tar
+
+How to run:
+
+python multiprocess_tool.py --num_proc 10 --shell_cmd_list_filename batch_cmd.txt
+
+"""
+
+
+def process_fn(cmd_list):
+    for cmd in cmd_list:
+        try:
+            ret = os.system(cmd)
+            if ret != 0:
+                raise Exception(f"execute command: {cmd} failed.")
+        except Exception as e:
+            print(e)
+
+
+def read_command(shell_cmd_list_filename):
+    shell_cmd_list = []
+    with open(shell_cmd_list_filename, "r") as f:
+        for cmd in f:
+            cmd = cmd.strip()
+            shell_cmd_list.append(cmd)
+    return shell_cmd_list
+
+
+def parallel_process(cmd_list, nproc=20):
+    if nproc > multiprocessing.cpu_count():
+        warnings.warn(
+            "The set number of processes exceeds the number of cpu cores, please confirm whether it is reasonable."
+        )
+    num_cmd = len(cmd_list)
+    num_cmd_part = (num_cmd + nproc - 1) // nproc
+    workers = []
+    for i in range(min(nproc, num_cmd)):
+        start = i * num_cmd_part
+        end = min(start + num_cmd_part, num_cmd)
+        p = Process(target=process_fn, args=(cmd_list[start:end],))
+        workers.append(p)
+        p.start()
+
+    for p in workers:
+        p.join()
+
+
+def main(args):
+    start = time.time()
+    shell_cmd_list = read_command(args.shell_cmd_list_filename)
+    parallel_process(shell_cmd_list, args.num_proc)
+    end = time.time()
+    print("Cost time: {:.2f}".format(end - start))
+
+
+if __name__ == "__main__":
+    parse = argparse.ArgumentParser(description="multi-process batch processing tool")
+    parse.add_argument("--num_proc", type=int, default=20)
+    parse.add_argument(
+        "--shell_cmd_list_filename", type=str, help="a txt file contains shell command list to be execute."
+    )
+    args = parse.parse_args()
+    main(args)
diff --git a/model_zoo/gpt-3/ppfleetx/utils/__init__.py b/model_zoo/gpt-3/ppfleetx/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/model_zoo/gpt-3/ppfleetx/utils/auto_config.py b/model_zoo/gpt-3/ppfleetx/utils/auto_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..eca6f9a8140b65c3a5524c197e307cce2abfef47
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/auto_config.py
@@ -0,0 +1,298 @@
+# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+import paddle.distributed.auto_parallel as auto
+
+from .config import (
+    AttrDict,
+    check_config,
+    create_attr_dict,
+    override_config,
+    parse_config,
+    print_config,
+)
+from .log import logger
+
+
+def process_dist_configs(config):
+    """
+    process distributed strategy for auto parallel
+    """
+    nranks = dist.get_world_size()
+
+    configs = config["Distributed"]
+
+    mp_degree = configs.setdefault("mp_degree", 1)
+    pp_degree = configs.setdefault("pp_degree", 1)
+
+    # sharding default
+    sharding_config = configs["sharding"]
+    sharding_degree = sharding_config.setdefault("sharding_degree", 1)
+    sharding_config.setdefault("sharding_stage", 2)
+    sharding_config.setdefault("reduce_overlap", False)
+    sharding_config.setdefault("broadcast_overlap", False)
+
+    other_degree = mp_degree * pp_degree
+
+    assert nranks % other_degree == 0, "Requires nranks should be divided by mp_degree*pp_degree."
+    dp_degree = configs.setdefault("dp_degree", nranks // other_degree)
+    assert nranks % dp_degree == 0, "unreasonable config of dist_strategy."
+    assert nranks == dp_degree * other_degree, (
+        "Mismatched config using {} cards with dp_degree[{}],"
+        "mp_degree[{}], pp_degree[{}] and sharding_degree[{}]".format(
+            nranks, dp_degree, mp_degree, pp_degree, sharding_degree
+        )
+    )
+
+
+def process_global_configs(config):
+    """
+    process global configs for auto parallel
+    """
+    dp_degree = config["Distributed"]["dp_degree"]
+    # pp_degree = config["Distributed"]["pp_degree"]
+    # sharding_degree = config["Distributed"]["sharding"]["sharding_degree"]
+
+    # TODO: support partial_send_recv and sequence_parallel
+    # config["Global"]["enable_partial_send_recv"] = True
+    # if config.get("Model", None) is not None and "sequence_parallel" in config["Model"] and pp_degree > 1:
+    #     if config["Model"]["sequence_parallel"]:
+    #         config["Global"]["enable_partial_send_recv"] = False
+    #         logger.warning(
+    #             "if config.Distributed.pp_degree > 1 and config.Model.sequence_parallel is True, "
+    #             "config.Global.enable_partial_send_recv will be set False."
+    #         )
+
+    global_cfg = config["Global"]
+
+    # Set environment variable
+    flags = global_cfg.get("flags", {})
+    paddle.set_flags(flags)
+    for k, v in flags.items():
+        logger.info("Environment variable {} is set {}.".format(k, v))
+
+    if global_cfg["global_batch_size"] is None and global_cfg["local_batch_size"] is None:
+        raise ValueError("global_batch_size or local_batch_size should be set.")
+    elif global_cfg["global_batch_size"] is not None and global_cfg["local_batch_size"] is not None:
+        assert (
+            global_cfg["global_batch_size"] // global_cfg["local_batch_size"] == dp_degree
+        ), "global_batch_size[{}] should be divided by local_batch_size[{}] when dp_degree is [{}]".format(
+            global_cfg["global_batch_size"], global_cfg["local_batch_size"], dp_degree
+        )
+    elif global_cfg["global_batch_size"] is not None and global_cfg["local_batch_size"] is None:
+        assert (
+            global_cfg["global_batch_size"] % dp_degree == 0
+        ), "global_batch_size[{}] should be divided by dp_degree[{}]".format(
+            global_cfg["global_batch_size"], dp_degree
+        )
+        global_cfg["local_batch_size"] = global_cfg["global_batch_size"] // dp_degree
+    else:
+        global_cfg["global_batch_size"] = global_cfg["local_batch_size"] * dp_degree
+    assert global_cfg["local_batch_size"] % global_cfg["micro_batch_size"] == 0
+
+
+def process_engine_configs(config):
+    """
+    process engine configs for auto parallel
+    """
+    if config.Engine.get("verbose", None) is None:
+        config.Engine["verbose"] = 2
+    if config.Engine.get("logging_freq", None) is None:
+        config.Engine["logging_freq"] = 10
+    config.Engine["save_load"] = config.Engine.get("save_load", {})
+    save_load_cfg = config.Engine.save_load
+    save_steps = save_load_cfg.get("save_steps", None)
+    save_epoch = save_load_cfg.get("save_epoch", None)
+    if save_steps is None or save_steps == -1:
+        save_load_cfg["save_steps"] = sys.maxsize if sys.version > "3" else sys.maxint
+
+    if save_epoch is None or save_epoch == -1:
+        save_load_cfg["save_epoch"] = 1
+
+    save_load_cfg["output_dir"] = save_load_cfg.get("output_dir", "./output")
+    save_load_cfg["ckpt_dir"] = save_load_cfg.get("ckpt_dir", None)
+
+    config.Engine["max_steps"] = config.Engine.get("max_steps", 500000)
+    config.Engine["eval_freq"] = config.Engine.get("eval_freq", -1)
+    config.Engine["eval_iters"] = config.Engine.get("eval_iters", 0)
+    config.Engine["logging_freq"] = config.Engine.get("logging_freq", 1)
+    config.Engine["num_train_epochs"] = config.Engine.get("num_train_epochs", 1)
+    config.Engine["test_iters"] = (
+        config.Engine["eval_iters"] * 10
+        if config.Engine.get("test_iters", None) is None
+        else config.Engine["test_iters"]
+    )
+    config.Engine["accumulate_steps"] = config.Global.local_batch_size // config.Global.micro_batch_size
+
+
+def process_strategy(config):
+    """
+    process auto strategy for auto parallel
+    """
+    strategy = auto.Strategy()
+    strategy.auto_mode = "semi"
+    # strategy.seed = config["Global"]["seed"]
+
+    if config.get("FusedPasses", None) is not None:
+        # fused passes config
+        fused_passes_list = []
+        fused_linear = config.FusedPasses.pop("fused_linear", False)
+        fused_adamw = config.FusedPasses.pop("fused_adamw", False)
+        if fused_linear:
+            fused_passes_list.append("fuse_gemm_epilogue")
+        if fused_adamw:
+            fused_passes_list.append("fuse_adamw")
+        fused_passes = strategy.fused_passes
+        fused_passes.enable = len(fused_passes_list) > 0
+        fused_passes.fused_passes_list = fused_passes_list
+
+    if config.get("Model", None) is not None:
+        # recompute config
+        if not config.Model.get("no_recompute_layers", None):
+            config.Model["no_recompute_layers"] = []
+        else:
+            assert isinstance(config.Model["no_recompute_layers"], list), "no_recompute_layers should be a list"
+            for i in config.Model["no_recompute_layers"]:
+                assert isinstance(i, int), "all values in no_recompute_layers should be an integer"
+            assert min(config.Model["no_recompute_layers"]) >= 0, "the min value in no_recompute_layers should >= 0"
+            assert (
+                max(config.Model["no_recompute_layers"]) < config.Model["num_layers"]
+            ), "the max value in no_recompute_layers should < num_layers"
+            config.Model["no_recompute_layers"] = sorted(list(set(config.Model["no_recompute_layers"])))
+        recompute = strategy.recompute
+        recompute.enable = config.Model.get("use_recompute", False)
+        recompute.no_recompute_segments = config.Model.pop("no_recompute_layers", [])
+        recompute.enable_tuning = config.get("Tuning", False) and config.Tuning.get("tuning_recompute", False)
+
+    # amp config
+    amp_cfg = config.Engine.get("mix_precision", {})
+    amp = strategy.amp
+    amp.enable = amp_cfg.get("enable", False)
+    amp.dtype = amp_cfg.get("dtype", "float16")
+    amp.level = amp_cfg.get("level", "o2")
+    amp.init_loss_scaling = amp_cfg.get("scale_loss", 32768)
+    amp.custom_black_list = amp_cfg.get("custom_black_list", [])
+    amp.custom_white_list = amp_cfg.get("custom_white_list", [])
+    amp.use_fp16_guard = amp_cfg.get("use_fp16_guard", False)
+    amp.use_bf16_guard = amp_cfg.get("use_bf16_guard", False)
+
+    # sharding config
+    sharding_cfg = config.Distributed.get("sharding", {})
+    sharding = strategy.sharding
+    sharding.enable = sharding_cfg.get("sharding_degree", 1) > 1
+    sharding.degree = sharding_cfg.get("sharding_degree", 1)
+    sharding.stage = sharding_cfg.get("sharding_stage", 1)
+    sharding.enable_overlap = sharding_cfg.get("reduce_overlap", False) and sharding_cfg.get("broadcast_overlap", False)
+    sharding.param_comm_stream_num = sharding_cfg.get("param_comm_stream_num", 1)
+    sharding.grad_comm_stream_num = sharding_cfg.get("grad_comm_stream_num", 1)
+    sharding.param_bucket_size_numel = sharding_cfg.get("param_bucket_size_numel", 1)
+    sharding.grad_bucket_size_numel = sharding_cfg.get("grad_bucket_size_numel", 1)
+    sharding.enable_hierarchical_comm = sharding_cfg.get("enable_hierarchical_comm", False)
+
+    pp_degree = config["Distributed"]["pp_degree"]
+    accumulate_steps = config.Engine.get("accumulate_steps", 1)
+    if pp_degree > 1 and accumulate_steps > 1:
+        # pipeline config
+        pipeline = strategy.pipeline
+        pipeline.enable = True
+        pipeline.schedule_mode = config.Distributed.get("schedule_mode", "1F1B")
+        pipeline.micro_batch_size = config.Global.micro_batch_size
+        pipeline.accumulate_steps = accumulate_steps
+    elif accumulate_steps > 1:
+        # gradient merge config
+        gradient_merge = strategy.gradient_merge
+        gradient_merge.enable = True
+        gradient_merge.k_steps = accumulate_steps
+
+    # quantization config
+    qat_cfg = config.get("Quantization", {})
+    qat = strategy.qat
+    qat.enable = qat_cfg.get("enable", False)
+    qat.channel_wise_abs_max = qat_cfg.get("channel_wise_abs_max", True)
+    qat.weight_bits = qat_cfg.get("weight_bits", 8)
+    qat.activation_bits = qat_cfg.get("activation_bits", 8)
+    qat.onnx_format = qat_cfg.get("onnx_format", True)
+
+    # tuning config
+    tuning_cfg = config.get("Tuning", {})
+    tuning = strategy.tuning
+    tuning.enable = tuning_cfg.get("enable", False)
+    tuning.profile_start_step = tuning_cfg.get("profile_start_step", 1)
+    tuning.profile_end_step = tuning_cfg.get("profile_end_step", 1)
+    tuning.run_after_tuning = tuning_cfg.get("run_after_tuning", True)
+    tuning.debug = tuning_cfg.get("debug", True)
+
+    engine_cfg = config["Engine"]
+    engine_cfg["strategy"] = strategy
+
+
+def process_ckpt_dir(config):
+    configs = config["Engine"]["save_load"]
+    ckpt_dir = configs.get("ckpt_dir", None)
+    if ckpt_dir is None:
+        return
+
+    assert (
+        os.path.isdir(ckpt_dir) is False
+    ), "Wrong setting of ckpt_dir!ckpt_dir can't be a folder, but {} is a folder. Your `ckpt_dir` should be `dirname/prefix` like `output/auto` if your model path is `output/auto_dist0.pdparams`".format(
+        ckpt_dir
+    )
+
+    assert os.path.exists(ckpt_dir) is False, (
+        "Wrong setting of ckpt_dir,"
+        "if you want to load weight,you should set ckpt_dir like this!"
+        "for example:\ngpt_auto_model_save\n\t--auto_dist0.pdparams\n\t--auto_dist0.pdparams\n"
+        '\t--auto_dist0.pdattr\nyou should set ckpt_dir="gpt_auto_model_save/auto"'
+    )
+
+    parent_path = os.path.split(ckpt_dir)[0]
+
+    if os.path.exists(parent_path) is False:
+        logger.warning("{} path is not existed!we will set ckpt_dir None.".format(parent_path))
+        configs["ckpt_dir"] is None
+
+
+def get_config(fname, overrides=None, show=False):
+    """
+    Read config from file for auto parallel
+    """
+    assert os.path.exists(fname), "config file({}) is not exist".format(fname)
+    config = parse_config(fname)
+    override_config(config, overrides)
+
+    process_dist_configs(config)
+    process_global_configs(config)
+    process_engine_configs(config)
+    process_strategy(config)
+    process_ckpt_dir(config)
+    create_attr_dict(AttrDict(config))
+
+    if show:
+        print_config(config)
+    check_config(config)
+    return config
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("train script")
+    parser.add_argument("-c", "--config", type=str, default="configs/config.yaml", help="config file path")
+    parser.add_argument("-o", "--override", action="append", default=[], help="config options to be overridden")
+    args = parser.parse_args()
+    return args
diff --git a/model_zoo/gpt-3/ppfleetx/utils/check.py b/model_zoo/gpt-3/ppfleetx/utils/check.py
new file mode 100644
index 0000000000000000000000000000000000000000..01280b2e2aa38233048e1c648dc36c01972107c0
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/check.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import sys
+
+from .device import get_device_and_mapping
+from .log import logger
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = (
+        "PaddlePaddle version 1.8.0 or higher is required, "
+        "or a suitable develop version is satisfied as well. \n"
+        "Please make sure the version is good with your code."
+    )
+    try:
+        pass
+        # paddle.utils.require_version('0.0.0')
+    except Exception:
+        logger.error(err)
+        sys.exit(1)
+
+
+def check_device(device):
+    """
+    Log error and exit when using paddlepaddle cpu version.
+    """
+    err = (
+        "You are using paddlepaddle %s version! Please try to \n"
+        "1. install paddlepaddle-%s to run model on %s \nor 2. set the config option 'Global.device' to %s."
+    )
+
+    d, supported_device_map = get_device_and_mapping()
+
+    assert (
+        device in supported_device_map
+    ), f"the device({device}) to check is not supported by now.Now the paddle only supports: {supported_device_map.keys()}"
+    err = err % (d, device, device, d)
+
+    try:
+        assert supported_device_map[device]
+    except AssertionError:
+        logger.error(err)
+        sys.exit(1)
diff --git a/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py b/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..6591a7fc810af6287d6b8f3faed9b32aab174244
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/compression_helper.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddleslim
+
+
+def get_pruned_params(model):
+    params = []
+    for sublayer in model.sublayers():
+        for param in sublayer.parameters(include_sublayers=False):
+            if (
+                isinstance(sublayer, paddle.nn.layer.common.Linear)
+                or isinstance(sublayer, paddle.distributed.fleet.layers.mpu.mp_layers.ColumnParallelLinear)
+                or isinstance(sublayer, paddle.distributed.fleet.layers.mpu.mp_layers.RowParallelLinear)
+            ):
+                if len(param.shape) != 2:
+                    continue
+
+                # NOTE(minghaoBD):
+                # 1. param.shape[1] == 3 * param.shape[0]： prune fused-qkv's weight and its next weight: out-linear's weight
+                # 2. param.shape[1] == 4 * param.shape[0]： prune ffn1's weight and its next weight: ffn2's weight
+                # If your model has a different architecture, like your qkv's weights are not fused or ffn1_weight.shape[1] != 4*ffn1_weight.shape[0], you may need to customize this function to suit your model.
+                if param.shape[1] == 3 * param.shape[0] or param.shape[1] == 4 * param.shape[0]:
+                    params.append(param.name)
+
+    return params
+
+
+def prune_model(model, configs, inputs_desc=[]):
+    prune_criterion = configs.criterion
+    ratio = configs.ratio
+    shapes, dtypes = [], []
+    for input_desc in inputs_desc:
+        dtypes.append(input_desc.dtype)
+        new_shape = [10 if item == -1 else item for item in input_desc.shape]
+        shapes.append(new_shape)
+    # TODO(minghaoBD): support ViT and other model architectures in the future
+    num_attention_heads = model.gpt.decoder.layers[0].self_attn.num_heads
+
+    if prune_criterion == "l1_norm":
+        pruner = paddleslim.L1NormFilterPruner(
+            model, shapes, skip_leaves=False, prune_type="fc", input_dtype=dtypes[0], num_head=num_attention_heads
+        )
+    elif prune_criterion == "l2_norm":
+        pruner = paddleslim.L2NormFilterPruner(
+            model, shapes, skip_leaves=False, prune_type="fc", input_dtype=dtypes[0], num_head=num_attention_heads
+        )
+    params = get_pruned_params(model)
+    ratios = {}
+    for param in params:
+        ratios[param] = ratio
+    # NOTE(minghaoBD): hidden size in Layernorm must be 768/1024/2048/4096 for best inference performace, and when axis=0, the hidden size in layernorm will be changed accordingly. So axis=1 is required.
+    pruner.prune_vars(ratios, [1])
+
+
+def quant_model(model, configs):
+    quanter = paddleslim.dygraph.quant.QAT(configs)
+    return quanter.quantize(model), quanter
diff --git a/model_zoo/gpt-3/ppfleetx/utils/config.py b/model_zoo/gpt-3/ppfleetx/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b612e3013541f8bfa5e63db68a744620de26b54
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/config.py
@@ -0,0 +1,409 @@
+# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import codecs
+import copy
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+import yaml
+from paddle.base.reader import use_pinned_memory
+
+from . import check
+from .log import advertise, logger
+
+__all__ = ["get_config", "print_config"]
+
+
+def process_dist_config(configs):
+    """
+    process distributed strategy for hybrid parallel
+    """
+    nranks = dist.get_world_size()
+
+    config = configs["Distributed"]
+
+    config.setdefault("hcg", "HybridCommunicateGroup")
+    mp_degree = config.setdefault("mp_degree", 1)
+    pp_degree = config.setdefault("pp_degree", 1)
+    config.setdefault("pp_recompute_interval", 1)
+
+    # sharding default
+    sharding_config = config["sharding"]
+    sharding_degree = sharding_config.setdefault("sharding_degree", 1)
+    sharding_config.setdefault("sharding_stage", 2)
+    sharding_config.setdefault("sharding_offload", False)
+    reduce_overlap = sharding_config.setdefault("reduce_overlap", False)
+    broadcast_overlap = sharding_config.setdefault("broadcast_overlap", False)
+
+    other_degree = mp_degree * pp_degree * sharding_degree
+
+    assert nranks % other_degree == 0, "unreasonable config of dist_strategy."
+    dp_degree = config.setdefault("dp_degree", nranks // other_degree)
+    assert nranks % dp_degree == 0, "unreasonable config of dist_strategy."
+    assert nranks == dp_degree * other_degree, (
+        "Mismatched config using {} cards with dp_degree[{}],"
+        "mp_degree[{}], pp_degree[{}] and sharding_degree[{}]".format(
+            nranks, dp_degree, mp_degree, pp_degree, sharding_degree
+        )
+    )
+
+    if sharding_config["sharding_degree"] > 1 and reduce_overlap:
+        if sharding_config["sharding_stage"] == 3 or sharding_config["sharding_offload"]:
+            sharding_config["reduce_overlap"] = False
+            logger.warning("reduce overlap only valid for sharding stage 2 without offload")
+
+    if sharding_config["sharding_degree"] > 1 and broadcast_overlap:
+        if sharding_config["sharding_stage"] == 3 or sharding_config["sharding_offload"]:
+            sharding_config["broadcast_overlap"] = False
+            logger.warning("broadcast overlap only valid for sharding stage 2 without offload")
+
+    if broadcast_overlap and configs["Engine"]["logging_freq"] == 1:
+        logger.warning(
+            "Set logging_freq to 1 will disable broadcast_overlap. "
+            "If you want to overlap the broadcast, please increase the logging_freq."
+        )
+        sharding_config["broadcast_overlap"] = False
+
+    if sharding_config["sharding_degree"] > 1:
+        if getattr(sharding_config, "broadcast_overlap", False):
+            logger.warning("Enable broadcast overlap for sharding will not use pin memory for dataloader")
+            use_pinned_memory(False)
+
+    if "fuse_sequence_parallel_allreduce" not in config:
+        config["fuse_sequence_parallel_allreduce"] = False
+
+    if "use_main_grad" in config and config["use_main_grad"] is True:
+        logger.warning("If use_main_grad is True, fuse_sequence_parallel_allreduce will be forced to False")
+        config["fuse_sequence_parallel_allreduce"] = False
+
+
+def process_global_configs(config):
+    """
+    process global configs for hybrid parallel
+    """
+    dp_degree = config["Distributed"]["dp_degree"]
+    pp_degree = config["Distributed"]["pp_degree"]
+    sharding_degree = config["Distributed"]["sharding"]["sharding_degree"]
+
+    config["Global"]["enable_partial_send_recv"] = True
+    if "sequence_parallel" in config["Model"] and pp_degree > 1:
+        if config["Model"]["sequence_parallel"]:
+            config["Global"]["enable_partial_send_recv"] = False
+            logger.warning(
+                "if config.Distributed.pp_degree > 1 and config.Model.sequence_parallel is True, "
+                "config.Global.enable_partial_send_recv will be set False."
+            )
+
+    global_cfg = config["Global"]
+
+    # Set environment variable
+    flags = global_cfg.get("flags", {})
+    paddle.set_flags(flags)
+    for k, v in flags.items():
+        logger.info("Environment variable {} is set {}.".format(k, v))
+
+    if global_cfg["global_batch_size"] is None and global_cfg["local_batch_size"] is None:
+        raise ValueError("global_batch_size or local_batch_size should be set.")
+    elif global_cfg["global_batch_size"] is not None and global_cfg["local_batch_size"] is not None:
+        assert global_cfg["global_batch_size"] // global_cfg["local_batch_size"] == (dp_degree * sharding_degree), (
+            "global_batch_size[{}] should be divided by local_batch_size[{}] "
+            "when dp_degree is [{}] and sharding_degree is [{}]".format(
+                global_cfg["global_batch_size"], global_cfg["local_batch_size"], dp_degree, sharding_degree
+            )
+        )
+    elif global_cfg["global_batch_size"] is not None and global_cfg["local_batch_size"] is None:
+        assert (
+            global_cfg["global_batch_size"] % (dp_degree * sharding_degree) == 0
+        ), "global_batch_size[{}] should be divided by dp_degree[{}] times sharding_degree[{}]".format(
+            global_cfg["global_batch_size"], dp_degree, sharding_degree
+        )
+        global_cfg["local_batch_size"] = global_cfg["global_batch_size"] // (dp_degree * sharding_degree)
+    else:
+        global_cfg["global_batch_size"] = global_cfg["local_batch_size"] * dp_degree * sharding_degree
+    assert global_cfg["local_batch_size"] % global_cfg["micro_batch_size"] == 0
+
+
+def process_engine_config(config):
+    """
+    process engine
+    """
+    # save_load
+    config.Engine["save_load"] = config.Engine.get("save_load", {})
+    save_load_cfg = config.Engine.save_load
+    save_steps = save_load_cfg.get("save_steps", None)
+    save_epoch = save_load_cfg.get("save_epoch", None)
+    if save_steps is None or save_steps == -1:
+        save_load_cfg["save_steps"] = sys.maxsize if sys.version > "3" else sys.maxint
+
+    if save_epoch is None or save_epoch == -1:
+        save_load_cfg["save_epoch"] = 1
+
+    save_load_cfg["output_dir"] = save_load_cfg.get("output_dir", "./output")
+    save_load_cfg["ckpt_dir"] = save_load_cfg.get("ckpt_dir", None)
+
+    # mix_precision
+    config.Engine["mix_precision"] = config.Engine.get("mix_precision", {})
+    amp_cfg = config.Engine.mix_precision
+
+    amp_cfg["enable"] = amp_cfg.get("enable", False)
+    amp_cfg["scale_loss"] = amp_cfg.get("scale_loss", 32768)
+    amp_cfg["custom_black_list"] = amp_cfg.get("custom_black_list", None)
+    amp_cfg["custom_white_list"] = amp_cfg.get("custom_white_list", None)
+
+    # engine
+    config.Engine["max_steps"] = config.Engine.get("max_steps", 500000)
+    config.Engine["eval_freq"] = config.Engine.get("eval_freq", -1)
+    config.Engine["eval_iters"] = config.Engine.get("eval_iters", 0)
+    config.Engine["logging_freq"] = config.Engine.get("logging_freq", 1)
+    config.Engine["num_train_epochs"] = config.Engine.get("num_train_epochs", 1)
+    config.Engine["test_iters"] = (
+        config.Engine["eval_iters"] * 10
+        if config.Engine.get("test_iters", None) is None
+        else config.Engine["test_iters"]
+    )
+    config.Engine["accumulate_steps"] = config.Global.local_batch_size // config.Global.micro_batch_size
+
+
+class AttrDict(dict):
+    def __getattr__(self, key):
+        return self[key]
+
+    def __setattr__(self, key, value):
+        if key in self.__dict__:
+            self.__dict__[key] = value
+        else:
+            self[key] = value
+
+    def __copy__(self):
+        cls = self.__class__
+        result = cls.__new__(cls)
+        result.__dict__.update(self.__dict__)
+        return result
+
+    def __deepcopy__(self, memo):
+        cls = self.__class__
+        result = cls.__new__(cls)
+        memo[id(self)] = result
+        for k, v in self.__dict__.items():
+            setattr(result, k, copy.deepcopy(v, memo))
+        for k, v in self.items():
+            setattr(result, k, copy.deepcopy(v, memo))
+        return result
+
+    def setdefault(self, k, default=None):
+        if k not in self or self[k] is None:
+            self[k] = default
+            return default
+        else:
+            return self[k]
+
+
+def create_attr_dict(yaml_config):
+    from ast import literal_eval
+
+    for key, value in yaml_config.items():
+        if type(value) is dict:
+            yaml_config[key] = value = AttrDict(value)
+        if isinstance(value, str):
+            try:
+                value = literal_eval(value)
+            except BaseException:
+                pass
+        if isinstance(value, AttrDict):
+            create_attr_dict(yaml_config[key])
+        else:
+            yaml_config[key] = value
+
+
+def parse_config(cfg_file):
+    """Load a config file into AttrDict"""
+
+    def _update_dic(dic, base_dic):
+        """Update config from dic based base_dic"""
+        base_dic = base_dic.copy()
+        dic = dic.copy()
+
+        if dic.get("_inherited_", True) is False:
+            dic.pop("_inherited_")
+            return dic
+
+        for key, val in dic.items():
+            if isinstance(val, dict) and key in base_dic:
+                base_dic[key] = _update_dic(val, base_dic[key])
+            else:
+                base_dic[key] = val
+        dic = base_dic
+        return dic
+
+    def _parse_from_yaml(path):
+        """Parse a yaml file and build config"""
+
+        with codecs.open(path, "r", "utf-8") as file:
+            dic = yaml.load(file, Loader=yaml.FullLoader)
+
+        if "_base_" in dic:
+            cfg_dir = os.path.dirname(path)
+            base_path = dic.pop("_base_")
+            base_path = os.path.join(cfg_dir, base_path)
+            base_dic = _parse_from_yaml(base_path)
+            dic = _update_dic(dic, base_dic)
+        return dic
+
+    yaml_dict = _parse_from_yaml(cfg_file)
+    yaml_config = AttrDict(yaml_dict)
+
+    create_attr_dict(yaml_config)
+    return yaml_config
+
+
+def print_dict(d, delimiter=0):
+    """
+    Recursively visualize a dict and
+    indenting acrrording by the relationship of keys.
+    """
+    placeholder = "-" * 60
+    for k, v in sorted(d.items()):
+        if isinstance(v, dict):
+            logger.info("{}{} : ".format(delimiter * " ", k))
+            print_dict(v, delimiter + 4)
+        elif isinstance(v, list) and len(v) >= 1 and isinstance(v[0], dict):
+            logger.info("{}{} : ".format(delimiter * " ", k))
+            for value in v:
+                print_dict(value, delimiter + 4)
+        else:
+            logger.info("{}{} : {}".format(delimiter * " ", k, v))
+        if k.isupper():
+            logger.info(placeholder)
+
+
+def print_config(config):
+    """
+    visualize configs
+    Arguments:
+        config: configs
+    """
+    advertise()
+    print_dict(config)
+
+
+def check_config(config):
+    """
+    Check config
+    """
+    # global_batch_size = config.get("")
+
+    global_config = config.get("Global")
+    check.check_version()
+    device = global_config.get("device", "gpu")
+    device = device.lower()
+    if device in ["gpu", "xpu", "rocm", "npu", "cpu"]:
+        check.check_device(device)
+    else:
+        raise ValueError(
+            f"device({device}) is not in ['gpu', 'xpu', 'rocm', 'npu', 'cpu'],\n"
+            "Please ensure the config option Global.device is one of these devices"
+        )
+
+
+def override(dl, ks, v):
+    """
+    Recursively replace dict of list
+    Args:
+        dl(dict or list): dict or list to be replaced
+        ks(list): list of keys
+        v(str): value to be replaced
+    """
+
+    def str2num(v):
+        try:
+            return eval(v)
+        except Exception:
+            return v
+
+    assert isinstance(dl, (list, dict)), "{} should be a list or a dict"
+    assert len(ks) > 0, "lenght of keys should larger than 0"
+    if isinstance(dl, list):
+        k = str2num(ks[0])
+        if len(ks) == 1:
+            assert k < len(dl), "index({}) out of range({})".format(k, dl)
+            dl[k] = str2num(v)
+        else:
+            override(dl[k], ks[1:], v)
+    else:
+        if len(ks) == 1:
+            # assert ks[0] in dl, ('{} is not exist in {}'.format(ks[0], dl))
+            if not ks[0] in dl:
+                print(f"A new field ({ks[0]}) detected!")
+            dl[ks[0]] = str2num(v)
+        else:
+            if ks[0] not in dl.keys():
+                dl[ks[0]] = {}
+                print(f"A new Series field ({ks[0]}) detected!")
+            override(dl[ks[0]], ks[1:], v)
+
+
+def override_config(config, options=None):
+    """
+    Recursively override the config
+    Args:
+        config(dict): dict to be replaced
+        options(list): list of pairs(key0.key1.idx.key2=value)
+            such as: [
+                'topk=2',
+                'VALID.transforms.1.ResizeImage.resize_short=300'
+            ]
+    Returns:
+        config(dict): replaced config
+    """
+    if options is not None:
+        for opt in options:
+            assert isinstance(opt, str), "option({}) should be a str".format(opt)
+            assert "=" in opt, "option({}) should contain a =" "to distinguish between key and value".format(opt)
+            pair = opt.split("=")
+            assert len(pair) == 2, "there can be only a = in the option"
+            key, value = pair
+            keys = key.split(".")
+            override(config, keys, value)
+    return config
+
+
+def get_config(fname, overrides=None, show=False):
+    """
+    Read config from file
+    """
+    assert os.path.exists(fname), "config file({}) is not exist".format(fname)
+    config = parse_config(fname)
+    override_config(config, overrides)
+
+    process_dist_config(config)
+    process_global_configs(config)
+    process_engine_config(config)
+    create_attr_dict(AttrDict(config))
+
+    if show:
+        print_config(config)
+    check_config(config)
+    return config
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("train script")
+    parser.add_argument("-c", "--config", type=str, default="configs/config.yaml", help="config file path")
+    parser.add_argument("-o", "--override", action="append", default=[], help="config options to be overridden")
+    args = parser.parse_args()
+    return args
diff --git a/model_zoo/gpt-3/ppfleetx/utils/device.py b/model_zoo/gpt-3/ppfleetx/utils/device.py
new file mode 100644
index 0000000000000000000000000000000000000000..23296115b6e23572fd5482dcd9a8dff9c842f837
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/device.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from .log import logger
+
+
+def get_device_and_mapping():
+    """
+    Return device type and name-bool mapping implifying which type is supported.
+    """
+    suppoted_device_map = {
+        "gpu": paddle.is_compiled_with_cuda(),
+        "xpu": paddle.is_compiled_with_xpu(),
+        "rocm": paddle.is_compiled_with_rocm(),
+        "npu": paddle.is_compiled_with_custom_device("npu"),
+        "cpu": True,
+    }
+    for d, v in suppoted_device_map.items():
+        if v:
+            return d, suppoted_device_map
+
+
+def get_device():
+    """
+    Return the device with which the paddle is compiled, including 'gpu'(for rocm and gpu), 'npu', 'xpu', 'cpu'.
+    """
+    d, _ = get_device_and_mapping()
+    return d
+
+
+def synchronize():
+    """
+    Synchronize device, return True if succeeded, otherwise return False
+    """
+    device = paddle.get_device().split(":")[0]
+    if device in ["gpu", "rocm"]:
+        paddle.device.cuda.synchronize()
+        return True
+    elif device == "xpu":
+        paddle.device.xpu.synchronize()
+        return True
+    elif device in paddle.device.get_all_custom_device_type():
+        paddle.device.synchronize()
+        return True
+    else:
+        logger.warning("The synchronization is only supported on cuda and xpu now.")
+    return False
diff --git a/model_zoo/gpt-3/ppfleetx/utils/download.py b/model_zoo/gpt-3/ppfleetx/utils/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3de5d6c7c11bfea6ddcb5762f291b4fc13f5972
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/download.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import time
+
+import paddle
+import requests
+from ppfleetx.utils.log import logger
+from tqdm import tqdm
+
+DOWNLOAD_RETRY_LIMIT = 3
+
+
+def is_url(path):
+    """
+    Whether path is URL.
+    Args:
+        path (string): URL string or not.
+    """
+    return path.startswith("http://") or path.startswith("https://")
+
+
+def _map_path(url, root_dir):
+    # parse path after download under root_dir
+    fname = os.path.split(url)[-1]
+    fpath = fname
+    return os.path.join(root_dir, fpath)
+
+
+def cached_path(url_or_path, cache_dir=None):
+    if cache_dir is None:
+        cache_dir = "~/.cache/ppfleetx/"
+
+    cache_dir = os.path.expanduser(cache_dir)
+
+    if not os.path.exists(cache_dir):
+        os.makedirs(cache_dir, exist_ok=True)
+
+    if is_url(url_or_path):
+        path = _map_path(url_or_path, cache_dir)
+        url = url_or_path
+    else:
+        path = url_or_path
+        url = None
+
+    if os.path.exists(path):
+        logger.info(f"Found {os.path.split(path)[-1]} in cache_dir: {cache_dir}.")
+        return path
+
+    download(url, path)
+    return path
+
+
+def _download(url, fullname):
+    """
+    Download from url, save to path.
+    url (str): download url
+    path (str): download to given path
+    """
+    retry_cnt = 0
+
+    while not os.path.exists(fullname):
+        if retry_cnt < DOWNLOAD_RETRY_LIMIT:
+            retry_cnt += 1
+        else:
+            raise RuntimeError("Download from {} failed. " "Retry limit reached".format(url))
+
+        logger.info("Downloading {}".format(url))
+
+        try:
+            req = requests.get(url, stream=True)
+        except Exception as e:  # requests.exceptions.ConnectionError
+            logger.info("Downloading {} failed {} times with exception {}".format(url, retry_cnt + 1, str(e)))
+            time.sleep(1)
+            continue
+
+        if req.status_code != 200:
+            raise RuntimeError("Downloading from {} failed with code " "{}!".format(url, req.status_code))
+
+        # For protecting download interupted, download to
+        # tmp_fullname firstly, move tmp_fullname to fullname
+        # after download finished
+        tmp_fullname = fullname + "_tmp"
+        total_size = req.headers.get("content-length")
+        with open(tmp_fullname, "wb") as f:
+            if total_size:
+                with tqdm(total=(int(total_size) + 1023) // 1024) as pbar:
+                    for chunk in req.iter_content(chunk_size=1024):
+                        f.write(chunk)
+                        pbar.update(1)
+            else:
+                for chunk in req.iter_content(chunk_size=1024):
+                    if chunk:
+                        f.write(chunk)
+        shutil.move(tmp_fullname, fullname)
+
+    return fullname
+
+
+def download(url, path):
+    local_rank = 0
+    world_size = 1
+    if paddle.base.core.is_compiled_with_dist() and paddle.distributed.get_world_size() > 1:
+        local_rank = paddle.distributed.ParallelEnv().dev_id
+        world_size = paddle.distributed.get_world_size()
+    if world_size > 1 and local_rank != 0:
+        while not os.path.exists(path):
+            time.sleep(1)
+    else:
+        _download(url, path)
diff --git a/model_zoo/gpt-3/ppfleetx/utils/export.py b/model_zoo/gpt-3/ppfleetx/utils/export.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8070ed1d9b61bef5ffdbac32fe10729e3bbee03
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/export.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from .log import logger
+
+__all__ = ["export_inference_model"]
+
+
+def _prune_input_spec(input_spec, program, targets):
+    # try to prune static program to figure out pruned input spec
+    # so we perform following operations in static mode
+    device = paddle.get_device()
+    paddle.enable_static()
+    paddle.set_device(device)
+    pruned_input_spec = []
+    program = program.clone()
+    program = program._prune(targets=targets)
+    global_block = program.global_block()
+    for spec in input_spec:
+        try:
+            global_block.var(spec.name)
+            pruned_input_spec.append(spec)
+        except Exception:
+            pass
+    paddle.disable_static(place=device)
+    return pruned_input_spec
+
+
+def export_inference_model(
+    model,
+    input_spec,
+    save_dir="./output",
+    save_name="model",
+    export_quant_model=False,
+    quanter=None,
+):
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+
+    static_model = paddle.jit.to_static(model, input_spec)
+    pruned_input_spec = _prune_input_spec(input_spec, static_model.forward.main_program, static_model.forward.outputs)
+
+    if export_quant_model:
+        quanter.save_quantized_model(model, os.path.join(save_dir, save_name), input_spec=pruned_input_spec)
+        logger.info("export quantized inference model saved in {}".format(save_dir))
+        return
+
+    paddle.jit.save(static_model, os.path.join(save_dir, save_name), input_spec=pruned_input_spec)
+    logger.info("export inference model saved in {}".format(save_dir))
diff --git a/model_zoo/gpt-3/ppfleetx/utils/file.py b/model_zoo/gpt-3/ppfleetx/utils/file.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0cbcd3e34a844dff62b192bc68ca460d150cfee
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/file.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import csv
+import os
+import tarfile
+import zipfile
+from typing import Callable, Iterable
+
+from ppfleetx.distributed.apis import env
+
+
+@env.work_at_local_rank0
+def unzip(zip_path, mode="r", out_dir=None, delete=False):
+    with zipfile.ZipFile(zip_path, mode) as zip_ref:
+        zip_ref.extractall(out_dir)
+
+    if delete:
+        os.remove(zip_path)
+
+
+@env.work_at_local_rank0
+def untar(tar_path, mode="r:gz", out_dir=None, delete=False):
+    try:
+        with tarfile.open(tar_path, "r:gz") as f:
+            f.extractall(out_dir)
+    finally:
+        if delete:
+            os.remove(tar_path)
+
+
+def parse_csv(
+    path, skip_lines=0, delimiter=" ", quotechar="|", quoting=csv.QUOTE_NONE, map_funcs=None, filter_funcs=None
+):
+
+    with open(path, newline="") as csvfile:
+        data = []
+        spamreader = csv.reader(csvfile, delimiter=delimiter, quotechar=quotechar, quoting=quoting)
+        for idx, row in enumerate(spamreader):
+            if idx < skip_lines:
+                continue
+            filter_flag = True
+            if filter_funcs is not None:
+                if isinstance(filter_funcs, Iterable):
+                    for func in filter_funcs:
+                        filter_flag = func(row)
+                        if filter_flag is False:
+                            break
+                else:
+                    assert isinstance(filter_funcs, Callable)
+                    filter_flag = filter_funcs(row)
+            if filter_flag is False:
+                continue
+
+            if map_funcs is not None:
+                if isinstance(map_funcs, Iterable):
+                    for func in map_funcs:
+                        row = func(row)
+                else:
+                    assert isinstance(map_funcs, Callable)
+                    row = map_funcs(row)
+            data.append(row)
+        return data
diff --git a/model_zoo/gpt-3/ppfleetx/utils/log.py b/model_zoo/gpt-3/ppfleetx/utils/log.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e8410d4feb043f43a508babb9cb17984a56df99
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/log.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2023  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import datetime
+import functools
+import logging
+import threading
+import time
+
+import colorlog
+
+loggers = {}
+
+log_config = {
+    "DEBUG": {"level": 10, "color": "purple"},
+    "INFO": {"level": 20, "color": "green"},
+    "TRAIN": {"level": 21, "color": "cyan"},
+    "EVAL": {"level": 22, "color": "blue"},
+    "WARNING": {"level": 30, "color": "yellow"},
+    "ERROR": {"level": 40, "color": "red"},
+    "CRITICAL": {"level": 50, "color": "bold_red"},
+}
+
+
+class Logger(object):
+    """
+    Deafult logger in PaddleFleetX
+
+    Args:
+        name(str) : Logger name, default is 'PaddleFleetX'
+    """
+
+    def __init__(self, name: str = None):
+        name = "PaddleFleetX" if not name else name
+        self.logger = logging.getLogger(name)
+
+        for key, conf in log_config.items():
+            logging.addLevelName(conf["level"], key)
+            self.__dict__[key] = functools.partial(self.__call__, conf["level"])
+            self.__dict__[key.lower()] = functools.partial(self.__call__, conf["level"])
+
+        self.format = colorlog.ColoredFormatter(
+            "%(log_color)s[%(asctime)-15s] [%(levelname)s]%(reset)s - %(message)s",
+            log_colors={key: conf["color"] for key, conf in log_config.items()},
+        )
+
+        self.handler = logging.StreamHandler()
+        self.handler.setFormatter(self.format)
+
+        self.logger.addHandler(self.handler)
+        self.logLevel = "DEBUG"
+        self.logger.setLevel(logging.DEBUG)
+        self.logger.propagate = False
+        self._is_enable = True
+
+    def disable(self):
+        self._is_enable = False
+
+    def enable(self):
+        self._is_enable = True
+
+    @property
+    def is_enable(self) -> bool:
+        return self._is_enable
+
+    def __call__(self, log_level: str, msg: str):
+        if not self.is_enable:
+            return
+
+        self.logger.log(log_level, msg)
+
+    @contextlib.contextmanager
+    def use_terminator(self, terminator: str):
+        old_terminator = self.handler.terminator
+        self.handler.terminator = terminator
+        yield
+        self.handler.terminator = old_terminator
+
+    @contextlib.contextmanager
+    def processing(self, msg: str, interval: float = 0.1):
+        """
+        Continuously print a progress bar with rotating special effects.
+
+        Args:
+            msg(str): Message to be printed.
+            interval(float): Rotation interval. Default to 0.1.
+        """
+        end = False
+
+        def _printer():
+            index = 0
+            flags = ["\\", "|", "/", "-"]
+            while not end:
+                flag = flags[index % len(flags)]
+                with self.use_terminator("\r"):
+                    self.info("{}: {}".format(msg, flag))
+                time.sleep(interval)
+                index += 1
+
+        t = threading.Thread(target=_printer)
+        t.start()
+        yield
+        end = True
+
+
+logger = Logger()
+
+
+def advertise():
+    """
+    Show the advertising message like the following:
+    ===========================================================
+    ==        PaddleFleetX is powered by PaddlePaddle !        ==
+    ===========================================================
+    ==                                                       ==
+    ==   For more info please go to the following website.   ==
+    ==                                                       ==
+    ==       https://github.com/PaddlePaddle/PaddleFleetX    ==
+    ===========================================================
+    """
+    copyright = "PaddleFleetX is powered by PaddlePaddle !"
+    ad = "For more info please go to the following website."
+    website = "https://github.com/PaddlePaddle/PaddleFleetX"
+    AD_LEN = 6 + len(max([copyright, ad, website], key=len))
+
+    logger.info(
+        "\n{0}\n{1}\n{2}\n{3}\n{4}\n{5}\n{6}\n{7}\n".format(
+            "=" * (AD_LEN + 4),
+            "=={}==".format(copyright.center(AD_LEN)),
+            "=" * (AD_LEN + 4),
+            "=={}==".format(" " * AD_LEN),
+            "=={}==".format(ad.center(AD_LEN)),
+            "=={}==".format(" " * AD_LEN),
+            "=={}==".format(website.center(AD_LEN)),
+            "=" * (AD_LEN + 4),
+        )
+    )
+
+
+from .device import synchronize  # noqa: E402
+
+
+def get_timestamp():
+    if synchronize():
+        return time.time()
+    else:
+        logger.warning("Device synchronizing failed, which may result uncorrect time")
+    return time.time()
+
+
+def convert_timestamp_to_data(timeStamp):
+    return str(datetime.timedelta(seconds=int(timeStamp)))
diff --git a/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py b/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fd04e51722ff73000a9647ae2af85001729ff71
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/tensor_fusion_helper.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_storage import (
+    GradStorage,
+    ParamStorage,
+)
+from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import Type
+from paddle.framework import core
+
+alignment = {
+    "gpu": 256,
+}
+align = {
+    Type.fp16.value: 2,
+    Type.fp32.value: 4,
+}
+
+
+def assign_group_by_size(parameters, group_size=256 * 1024 * 1024):
+    is_sparse_gradient = [False] * len(parameters)
+
+    group_indices = core.eager_assign_group_by_size(parameters, is_sparse_gradient, [group_size, group_size])
+
+    var_groups = OrderedDict()
+    for group_idx, indices in enumerate(group_indices):
+        for index in indices:
+            var_groups.setdefault(group_idx, []).append(parameters[index])
+    return var_groups
+
+
+def flatten_dense_tensors(parameters):
+    _buffer_size = 0
+    _param2align = {}
+    dtype = parameters[0].dtype
+
+    for param in parameters:
+        assert param.trainable, "param must be trainable..."
+        size = np.prod(param.shape) * align[dtype]
+        remaining = size % alignment["gpu"]
+        ali = 0 if remaining == 0 else alignment["gpu"] - remaining
+        align_ = ali // align[dtype]
+        _buffer_size += np.prod(param.shape) + align_
+        _param2align[param.name] = align_
+
+    param_storage = ParamStorage(size=_buffer_size, dtype=dtype, device="gpu")
+
+    param_storage.add_rank_params(parameters, _param2align)
+
+    # process gradient
+    grad_storage = GradStorage(size=_buffer_size, dtype=dtype, device="gpu", destination="0", parm2align=_param2align)
+
+    for param in parameters:
+        grad_storage.add_grad(param, _param2align[param.name])
+
+    # param_storage --> grad_storage
+    param_storage.buffer._copy_gradient_from(grad_storage.buffer)
+    param_storage.buffer.stop_gradient = False
+    return param_storage, grad_storage
+
+
+def obtain_storage(parameters):
+    if len(parameters) < 1:
+        return []
+
+    var_groups = assign_group_by_size(parameters)
+    storage = []
+    for group_idx, parameters in var_groups.items():
+        param_storage, grad_storage = flatten_dense_tensors(parameters)
+        storage.append(param_storage.buffer)
+    return storage
+
+
+def fused_parameters(parameters, use_sharding=False):
+    decay_params = []
+    other_params = []
+
+    for param in parameters:
+        if not any(nd in param.name for nd in ["bias", "norm", "b_0"]):
+            decay_params.append(param)
+        else:
+            other_params.append(param)
+
+    decay_fused = decay_params if use_sharding else obtain_storage(decay_params)
+    other_fused = other_params if use_sharding else obtain_storage(other_params)
+    all_fused = decay_fused + other_fused
+
+    return decay_fused, all_fused
+
+
+def all_reduce_parameters(params, group):
+    if group.nranks < 2:
+        return
+
+    div_factor = 1.0 / group.nranks
+    with paddle.framework.no_grad():
+        for p in params:
+            grad = p.grad.scale_(div_factor)
+            paddle.distributed.all_reduce(grad, group=group)
diff --git a/model_zoo/gpt-3/ppfleetx/utils/version.py b/model_zoo/gpt-3/ppfleetx/utils/version.py
new file mode 100644
index 0000000000000000000000000000000000000000..f36e4f1e298da5bacc926678cf83acc70b7329fa
--- /dev/null
+++ b/model_zoo/gpt-3/ppfleetx/utils/version.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from ppfleetx.utils.log import logger
+
+
+def version_check():
+    version = paddle.version.full_version
+    logger.info("run with paddle {}, commit id {}".format(paddle.__version__, paddle.__git_commit__[:8]))
+    if version != "0.0.0":
+        paddle.utils.require_version(min_version="2.4.0")
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ff1472f7cceebbaa608d67cea8b97083669a4937
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_13B_mp8.sh
@@ -0,0 +1,22 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp8
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_13B_mp8.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b81c42a6d3789dab56145c65fa065b88243608ce
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_175B_mp8.sh
@@ -0,0 +1,22 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp8
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
\ No newline at end of file
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0110e8fba8db1418e850739ca99e7d50f975325d
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_mp2.sh
@@ -0,0 +1,22 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp2
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b01117b47ec093e3b4bc4cdac954e56dca13f525
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_345M_single_card.sh
@@ -0,0 +1,30 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_345m_mp1
+rm -rf $log_dir
+
+DIRECTORY=./pretrained
+if [ ! -d "$DIRECTORY" ]; then
+  echo "start download ckpt"
+  wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_FP16.tar.gz
+  tar -zxvf GPT_345M_FP16.tar.gz
+fi
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "1" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml \
+    -o Engine.save_load.ckpt_dir=./pretrained/inference
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b1d472c4f33fe525492f1b9ca9b064abcd7365cf
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_6.7B_mp1.sh
@@ -0,0 +1,22 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp1
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml
\ No newline at end of file
diff --git a/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b0b31c242fe7ebf8a7c253db0d65df74785f14cc
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_export_gpt_fp16_single_card.sh
@@ -0,0 +1,19 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python ./tools/auto_export.py -c ./ppfleetx/configs/nlp/gpt/auto/export_gpt_fp16_single_card.yaml \
+    -o Engine.save_load.output_dir="./serial_model" \
+    -o Engine.save_load.ckpt_dir="./output/rank_0/model" \
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..96cecb1131c168a21fda459cb70fea4a22b88c61
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_auto
+rm -rf $log_dir
+
+# 1.3B+dp8 run_pretrain
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f89c6cc2ff2b641bc72589a95d319fd56765d0bb
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_dp8_tuning.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_auto
+rm -rf $log_dir
+
+# 1.3B+dp8 recompute tuning
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8_tuning.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..48ca946577e5bcc77aa2d1135e1e24cd09320dc9
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_1.3B_single_card.sh
@@ -0,0 +1,18 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/auto.py -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml 
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dbe60c20938a026899d95b7183e5f745465919b1
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_13B_sharding8.sh
@@ -0,0 +1,28 @@
+#! /bin/bash
+# Runs the "1.3B" parameter model
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_auto
+rm -rf $log_dir
+
+# 10B+sharding8 run_pretrain
+# Engine.eval_freq in this bash if is set small will cause error (sharding in eval mode has problem)
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_13B_sharding8.yaml \
+    -o Engine.max_steps=1000 \
+    -o Engine.logging_freq=1 \
+    -o Engine.eval_freq=10000 \
+    -o Engine.verbose=3
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9a9ea2c0d51768ce11587931444d7d97891b68e0
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_345M_single_card.sh
@@ -0,0 +1,19 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/auto.py -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml 
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a17482842ad4d7c316dab8123fe5e55e9702180
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_dp2_mp2_pp2_sharding2.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_dp2mp2pp2
+rm -rf $log_dir
+export FLAGS_new_executor_micro_batching=True
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_dp2_mp2_pp2_sharding2.yaml \
diff --git a/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh b/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0749f0c8f771adc954bf592e6c3f6d4ef5ae4b09
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_gpt_6.7B_sharding16.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+# Runs the "1.3B" parameter model
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_auto
+rm -rf $log_dir
+
+# 6.7B+sharding16 run_pretrain
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh b/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3d5f18e7b1ebe4d6f2abdfc979cb31cd77bfcbdc
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/auto_qat_export_gpt_345M_mp2.sh
@@ -0,0 +1,24 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+log_dir=log_auto
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1" \
+    ./tools/auto_export.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/qat_generation_gpt_345M_mp2.yaml \
+    -o Engine.save_load.output_dir="./mp2_qat_model" \
diff --git a/model_zoo/gpt-3/projects/gpt/benchmark.py b/model_zoo/gpt-3/projects/gpt/benchmark.py
new file mode 100644
index 0000000000000000000000000000000000000000..f84b7887349f980f1ed2a9e4b9482613e1f97a74
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/benchmark.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+import time
+
+import numpy as np
+import paddle.distributed.fleet as fleet
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../", "../")))
+
+from ppfleetx.core.engine.inference_engine import InferenceEngine
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--seq_len", default=128, type=int, required=False, help="seq length of inputs")
+    parser.add_argument("--iter", default=100, type=int, help="run iterations for timing")
+    parser.add_argument("--mp_degree", default=1, type=int, help="")
+    parser.add_argument("--model_dir", default="output", type=str, help="model directory")
+
+    args = parser.parse_args()
+    return args
+
+
+def predict(engine, data, args):
+
+    with engine._static_guard:
+        for d, name in zip(data, engine.input_names()):
+            handle = engine.predictor.get_input_handle(name)
+            handle.copy_from_cpu(d)
+
+        for _ in range(10):
+            engine.predictor.run()
+        engine.predictor.get_output_handle(engine.output_names()[0]).copy_to_cpu()
+
+        start = time.perf_counter()
+        for _ in range(args.iter):
+            engine.predictor.run()
+        end = time.perf_counter()
+        print(f"batch {data.shape} run time: {1000 * (end - start) / args.iter}ms")
+
+        return {name: engine.predictor.get_output_handle(name).copy_to_cpu() for name in engine.output_names()}
+
+
+def main():
+
+    args = parse_args()
+
+    fleet.init(is_collective=True)
+    infer_engine = InferenceEngine(args.model_dir, args.mp_degree)
+    ids = [100] * args.seq_len
+
+    # run test
+    for batch in [1, 2, 4, 8, 16]:
+
+        whole_data = [ids] * batch
+        whole_data = np.array(whole_data, dtype="int64").reshape(1, batch, -1)
+
+        _ = predict(infer_engine, whole_data, args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt-3/projects/gpt/docs/README.md b/model_zoo/gpt-3/projects/gpt/docs/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0d836d6674384af99f33bc4d9d74ec020eefe874
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/README.md
@@ -0,0 +1,306 @@
+# GPT
+
+## 模型介绍
+GPT-[2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)/[3](https://arxiv.org/pdf/2005.14165.pdf) 是以[Transformer](https://arxiv.org/abs/1706.03762) 解码器为网络基本组件，使用自回归的方式在大规模无标注文本语料上进行预训练得到的语言生成模型。
+
+本项目是语言模型 GPT 的 PaddlePaddle 大模型实现。目前，PaddleFleetX 提供了 [GPT-345M](https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz) 的预训练模型文件；分别基于 [LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl) 和 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) 数据集，采用 ACC(accuracy) 和 PPL(perplexity) 指标后的评估结果如下：
+
+| **模型文件** | **ACC** | **PPL** |
+|---------|-----------|---------------|
+| GPT-345M | 44.17% |  18.01  |
+
+下面是本例的简要目录结构及说明：
+
+```text
+.
+├── auto_export_gpt_345M_mp2.sh            # 自动并行345M模型两卡张量并行导出入口
+├── auto_gpt_345M_single_card.sh           # 自动并行345M模型单卡预训练入口
+├── auto_gpt_1.3B_single_card.sh           # 自动并行1.3B模型单卡预训练入口
+├── auto_gpt_1.3B_dp8.sh                   # 自动并行1.3B模型数据并行预训练入口
+├── auto_gpt_6.7B_sharding16.sh            # 自动并行6.7B模型分组切片并行预训练入口
+├── evaluate_gpt_345M_single_card.sh       # 单卡345M模型评估入口
+├── export_gpt_345M_single_card.sh         # 单卡345M模型动转静导出入口
+├── finetune_gpt_345M_single_card.sh       # 单卡345M模型finetune训练入口
+├── inference_gpt_345M_single_card.sh      # 单卡345M模型推理入口
+├── pretrain_gpt_345M_single_card.sh       # 单卡345M模型预训练入口
+├── pretrain_gpt_1.3B_single_card.sh       # 单卡1.3B模型预训练入口
+├── pretrain_gpt_1.3B_dp8.sh               # 8卡1.3B模型数据并行预训练入口
+├── pretrain_gpt_6.7B_sharding16.sh        # 16卡6.7B模型分组切片并行预训练入口
+├── pretrain_gpt_175B_mp8_pp16.sh          # 128卡175B模型混合并行预训练入口
+├── qat_gpt_345M_single_card.sh            # 单卡345M模型量化训练入口
+├── qat_gpt_345M_mp8.sh                    # 8卡345M模型模型并行量化训练入口
+├── qat_gpt_6.7B_sharding16.sh             # 16卡6.7B模型分组切片并行量化训练入口
+├── eval_qat_gpt_345M_single_card.sh       # 单卡345M量化模型验证入口
+├── export_qat_gpt_345M_single_card.sh     # 单卡345M量化模型导出入口
+```
+
+## 快速开始
+
+### 环境依赖
+
+请确保已根据根目录 requirements.txt 安装所需依赖，或者通过以下命令快速安装
+
+```shell
+python -m pip install -r https://raw.githubusercontent.com/PaddlePaddle/PaddleFleetX/develop/requirements.txt -i https://mirror.baidu.com/pypi/simple
+```
+
+### 数据准备
+
+数据获取和制作详见[GPT 模型预训练数据准备流程](https://github.com/PaddlePaddle/PaddleFleetX/tree/develop/ppfleetx/data/data_tools/gpt)
+
+为了方便用户运行测试本模型，此处提供处理好的300M的训练样本，在单卡训练或混合并行训练前都需要通过以下命令获取数据。
+
+**数据下载命令**
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+# 下载样例数据
+mkdir data && cd data
+wget -O gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -O gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+
+cd .. # 回到 PaddleFleetX 根目录下
+```
+
+### 模型训练
+
+除了单卡训练，飞桨还支持数据并行、混合并行、自动并行、重计算等多种分布式策略，减少显存占用、加速训练，达到大模型可训练且训得快的效果。在模型训练前，需要根据模型规模选择合适的并行策略。下面分别从单卡训练、混合并行训练和自动并行训练三个方面来介绍GPT模型训练的配置文件和启动方式。
+
+
+- [单卡训练](./single_card.md)
+
+- [混合并行训练](./hybrid_parallel.md)
+
+- [自动并行训练](./auto_parallel.md)
+
+### 文本生成体验
+
+- [单卡预训练模型文本生成](./single_card.md#GPT-Zero-shot-文本生成)
+
+- [混合并行预训练模型文本生成](./hybrid_parallel.md#GPT-Zero-shot-文本生成)
+
+
+### 模型压缩
+
+- [量化训练](./quantization_aware_training.md)
+
+### 推理部署
+
+- [推理部署](inference.md)
+### GLUE 下游任务微调
+
+- [单卡微调](./single_finetune.md)
+
+
+## 参数释义
+
+
+### 全局信息
+全局参数指定训练的batch size，以及设备、随机种子等信息。
+```yaml
+  Global:
+    device: gpu
+    seed: 1024
+
+    global_batch_size:
+    local_batch_size: 1
+    micro_batch_size: 1
+```
+
+其中参数对应的释义如下：
+| **参数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| device | 设备信息 |
+| seed | 随机数种子 |
+| global_batch_size | 全局的batch size大小，即一次参数更新等效的batch size |
+| local_batch_size  | 每个进程训练的batch size大小                  |
+| micro_batch_size  | 每次前向计算的batch size大小                  |
+
+
+### Engine训练控制
+
+Engine训练设置完成模型训练/验证/推理等过程中的参数设置，是fleetX的EagerEngine的必要参数，所有使用该Engine都必须指定该配置。 其中包含的参数有：
+
+```yaml
+  Engine:
+    max_steps: 500000
+    num_train_epochs: 1
+    accumulate_steps:
+    logging_freq: 1
+    eval_freq: 500
+    eval_iters: 10
+    test_iters:
+    mix_precision:
+      enable: True
+      dtype: "float16"
+      level: "O2"
+      scale_loss: 32768.0
+      custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"]
+      custom_white_list: ["lookup_table", "lookup_table_v2"]
+    save_load:
+      save_steps: 1000
+      save_epoch: 1
+      output_dir: ./output
+      ckpt_dir:
+```
+其中参数对应的释义如下：
+
+| **参数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| max_steps         | 最大训练步数                               |
+| num_train_epochs  | 训练的epoch数量                           |
+| accumulate_steps  | 梯度累加次数                           |
+| logging_freq      | 训练日志打印的频率                            |
+| eval_freq         | 模型评估间隔                               |
+| eval_iters        | 模型评估时训练评估测试集的轮数                      |
+| test_iters        | 模型测试或推理时的轮数                      |
+| enable            | 是否使用混合精度策略进行训练                     |
+| dtype             | 混合精度训练数据类型使用float16还是bfloat16，默认为float16类型 |
+| level             | 混合精度训练模式，默认``O2``模式                 |
+| scale_loss        | 使用fp16混合精度策略下，loss的放缩比例                  |
+| custom_black_list | 自定义算子黑名单。这个名单中的算子在支持混合精度计算时会被认为是数值危险的，它们的影响也可能会在下游操作中观察到。这些算子通常不会转为float16/bfloat16计算 |
+| custom_white_list | 自定义算子白名单。这个名单中的算子在支持混合精度计算时会被认为是数值安全的，并且对性能至关重要。如果设置了白名单，该名单中的算子会使用float16/bfloat16计算 |
+| save_steps        | 保存模型间隔step数                         |
+| save_epoch        | 保存模型间隔epoch数                        |
+| output_dir        | 指定输出文件                              |
+| ckpt_dir          | checkpoint的加载目录                      |
+
+### 模型网络
+
+网络部分完成了网络的组网操作，GPT在[PaddleFleetX/ppfleetx/models/language_model/gpt/dygraph/single_model.py]((https://github.com/PaddlePaddle/PaddleFleetX/blob/develop/ppfleetx/models/language_model/gpt/dygraph/single_model.py))下。
+可以使用配置文件配置模型的规模，如：
+
+```yaml
+  Model:
+    module: "GPTModule"
+    name: "GPT"
+    vocab_size: 50304
+    hidden_size: 1024
+    num_layers: 24
+    num_attention_heads: 16
+    ffn_hidden_size:
+    hidden_dropout_prob: 0.1
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 1024
+    type_vocab_size: 16
+    initializer_range: 0.02
+    use_recompute: True
+    recompute_granularity:
+    no_recompute_layers:
+    fused_linear: True
+    fuse_attn_qkv: True
+    sequence_parallel: False
+```
+
+其中参数对应的释义如下：
+| **参数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| module | 指定GPT模型的执行模块 ｜
+| vocab_size                   | 训练词表大小                 |
+| hidden_size                  | 隐藏层大小                  |
+| num_layers                   | transformer层数          |
+| num_attention_heads          | attention head的数量      |
+| max_seq_len                  | 输入文本序列的长度              |
+| ffn_hidden_size              | ffn层大小，一般为隐藏层的四倍       |
+| attention_probs_dropout_prob | attention中的dropout的失活率 |
+| max_position_embeddings      | position embedding的长度  |
+| type_vocab_size              | 词表类型                   |
+| initializer_range            | 参数初始化的范围               |
+| use_recompute     | 是否使用recompute训练                      |
+| recompute_granularity | recompute训练的粒度，可选 `full` `full_attn` `core_attn`，full即recompute全部transformer，full_attn表明只recompute所有self attention部分，core_attn表明只recompute `softmax(qkT)v` 部分。注：显存占用方面，`core_attn` > `full_attn` > `full`，若所选策略产生OOM错误，可以适当更改recompute_granularity |
+|no_recompute_layers| list of integer，标识哪些层的transformer不需要进行recompute。所有在该list中的值应该 >= 0 同时应该 < num_layers。向该参数中增加不进行recompute 的层数可以提升模型训练的整体吞吐，但是会适当的增加显存。若训练中发现有显存富裕，可以适当增加不进行recompute的层数。如果使用该参数后出现OOM错误，可以适当减小不进行recompute的层数。 ｜
+| fused_linear      | 是否使用fused_linear代替传统Linear加速训练。注：该功能需要cuda 11.6及以上编译的paddle支持。       |
+| fuse_attn_qkv     | 是否对attention层中的qkv计算使用fuse策略以加速训练 |
+| sequence_parallel | 是否使用序列并行策略以加速训练。注：只有混合并行的GPT才支持该功能，它与张量模型并行共用通信组，当mp_degree=1时，序列并行策略会被强制关闭。 |
+| virtual_pp_degree | 虚拟流水线并行维度，该参数会减小流水线bubble的占比以提升流水线的吞吐。但是该参数会增加流水线间的通讯，所以该参数的推荐值为2。并且，只有 num_layers可以被 pp_degree * virtual_pp_degree 整除时，才可以使用虚拟流水线并行。 |
+### 数据集
+
+数据集参数分为“Train”、“Eval”和“Test”三部分，分别对应模型预训练、离线评估、推理等三个模块。
+
+每个模型的配置参数都包含以下内容：
+
+```yaml
+  Data:
+    Train:
+      dataset:
+        name: GPTDataset
+        input_dir: ./data/
+        split: [949, 50, 1]
+        max_seq_len: 1024
+      sampler:
+        name: DistributedBatchSampler
+        shuffle: False
+        drop_last: True
+      loader:
+        num_workers: 1
+        return_list: False
+        collate_fn: gpt_collate_fn
+```
+
+其中参数对应的释义如下：
+| **参数名**                      | **参数释义**               |
+|------------------------------|------------------------|
+| dataset.name         | 指定自定义数据集的名称  |
+| input_dir         | 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件       |
+| split             | 训练集，验证集和测试集的切分比例                     |
+| max_seq_len       | 输入文本序列的长度                            |
+| sampler.name         | 指定自定义采样器的名称  |
+| shuffle         | 是否需要在生成样本下标时打乱顺序     |
+| drop_last             | 是否需要丢弃最后无法凑整一个mini-batch的样本        |
+| num_workers        | 用于加载数据的子进程个数  |
+| return_list         | 每个设备上的数据是否以list形式返回    |
+| collate_fn             | 通过此参数指定如果将样本列表组合为mini-batch数据；支持自定义     |
+
+
+### 优化器
+
+
+GPT训练默认使用AdamW优化器以及cosine学习率衰减，这里通过配置文件配置优化器的参数，如：
+
+```yaml
+  Optimizer:
+    name: AdamW
+    weight_decay: 0.01
+    beta1: 0.9
+    beta2: 0.999
+    epsilon: 1.0e-8
+    lr:
+      name: CosineAnnealingWithWarmupDecay
+      decay_steps: 360000
+      warmup_rate: 0.01
+      max_lr: 5.0e-5
+      min_lr: 1.0e-5
+    grad_clip:
+      name: "ClipGradByGlobalNorm"
+      clip_norm: 1.0
+    tensor_fusion: False
+```
+
+其中参数说明：
+
+| **参数名**      | **参数释义**                  |
+|--------------|---------------------------|
+| name | 指定自定义优化器的名称               |
+| weight_decay | weight的衰减率                |
+| beta1   | 一阶矩估计的指数衰减率               |
+| beta2   | 二阶矩估计的指数衰减率               |
+| epsilon | 指定优化器需要优化的参数              |
+| lr.name | 指定自定义学习率策略的名称               |
+| decay_steps  | 衰减的步长                     |
+| warmup_rate  | warmup 率                  |
+| max_lr       | Adam 的初始最大学习率             |
+| min_lr       | Adam 的初始最小学习率             |
+| grad_clip.name    | 指定自定义梯度裁剪策略的名称 |
+| clip_norm    | 所允许的范数最大值 |
+| tensor_fusion    | 是否使用tensor_fustion功能加速训练 |
+
+另外，[Profiler](./hybrid_profiler.md)中还介绍了在 GPT 中开启 Profiler 并分析调试分析结果的方法及相关的参数解释。
+
+### 模型压缩
+PaddleFleetX 集成了 PaddleSlim 中的常见的压缩方法：量化训练（Qutization Aware Training，QAT）、结构化稀疏（Structured Pruning，SP）和知识蒸馏（Knowledge Distillation，KD）。详细参数介绍见[模型压缩介绍](../../../docs/compression.md)。
+
+
+## 参考文献
+- [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf)
+- [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413)
diff --git a/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md b/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md
new file mode 100644
index 0000000000000000000000000000000000000000..4520716d4359b1461f5d59c6b70b12c231f45b90
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/auto_parallel.md
@@ -0,0 +1,311 @@
+# GPT 自动并行模型训练
+
+分布式并行训练技术使超大模型成为可能，但分布式训练程序的编写门槛较高，并行算法较为复杂，开发者需同时具有较好的工程能力和算法功底。为了降低分布式训练的难度，自动并行成为新的研究热点，受到学术界和工业界的广泛关注。自动并行通常分为半自动并行和全自动并行。半自动并行指的是开发者在单机脚本的基础上额外添加少量标注信息即可表达并行逻辑。而全自动并行则无需开发者添加任何并行逻辑，根据单机脚本自动搜索出较为高效的并行策略，实现分布式训练。
+
+
+## 参数释义
+
+### 全局信息
+全局信息指定训练的 batch size，以及设备、随机种子等信息
+
+```yaml
+Global:
+  device: gpu
+  seed: 1024
+
+  global_batch_size:
+  local_batch_size: 1
+  micro_batch_size: 1
+```
+
+其中参数对应的释义如下：
+| **参数名**                      | **参数释义**               |
+|--------------------------------|---------------------------|
+| device | 设备信息 |
+| seed | 随机数种子 |
+| global_batch_size | 全局的batch size大小，即一次参数更新等效的 batch size |
+| local_batch_size  | 每个进程训练的batch size大小                        |
+| micro_batch_size  | 每次前向计算的batch size大小                        |
+
+
+### Engine训练控制
+
+Engine训练设置完成模型训练/验证/推理等过程中的参数设置，是PaddleFleetX AutoEngine的必要参数，所有使用该Engine都必须指定该配置。 其中包含的参数有：
+
+```yaml
+  Engine:
+    max_steps: 500000
+    num_train_epochs: 1
+    eval_freq: 1
+    eval_iters: 10
+    test_iters:
+    mix_precision:
+      enable: True
+      dtype: "float16"
+      level: "o2"
+      scale_loss: 32768.0
+      custom_black_list: ["reduce_sum", "c_softmax_with_cross_entropy", "elementwise_div"]
+      custom_white_list: ["lookup_table", "lookup_table_v2"]
+    save_load:
+      output_dir: ./output
+      ckpt_dir:
+```
+
+其中参数对应的释义如下：
+
+| **参数名**         | **参数释义**                              |
+|-------------------|------------------------------------------|
+| max_steps         | 最大训练步数                               |
+| num_train_epochs  | 训练的epoch数量                            |
+| logging_freq      | 训练日志打印的频率                          |
+| eval_freq         | 模型评估间隔，以epoch为粒度                  |
+| eval_iters        | 模型评估时训练评估测试集的轮数                |
+| test_iters        | 模型测试或推理时的轮数                       |
+| enable            | 是否使用混合精度的类型，可选: `True` `False`   |
+| dtype             | 使用混合精度的类型，可选: `float16` `bfloat16`|
+| level             | 使用混合精度训练的等级，可选 `o1` `o2` `o3`   |
+| scale_loss        | 使用混合精度float16下，loss的放缩比例         |
+| custom_black_list | 自定义算子黑名单。这个名单中的算子在支持float16/bfloat16计算时会被认为是数值危险的，它们的影响也可能会在下游操作中观察到。这些算子通常不会转为float16/bfloat16计算。 |
+| custom_white_list | 自定义算子白名单。这个名单中的算子在支持float16/bfloat16计算时会被认为是数值安全的，并且对性能至关重要。如果设置了白名单，该名单中的算子会使用float16/bfloat16计算。|
+| output_dir        | 指定输出文件                              |
+| ckpt_dir          | checkpoint的加载目录                      |
+
+
+### 模型网络
+
+网络部分完成了网络的组网操作，GPT在[PaddleFleetX/ppfleetx/models/language_model/gpt/auto/auto_model.py]((https://github.com/PaddlePaddle/PaddleFleetX/blob/develop/ppfleetx/models/language_model/gpt/auto/auto_model.py))下。
+可以使用配置文件配置模型的规模，如：
+
+```yaml
+  Model:
+    module: "GPTModuleAuto"
+    name: "GPT"
+    vocab_size: 50304
+    hidden_size: 1024
+    num_layers: 24
+    num_attention_heads: 16
+    ffn_hidden_size:
+    hidden_dropout_prob: 0.1
+    attention_probs_dropout_prob: 0.1
+    max_position_embeddings: 1024
+    type_vocab_size: 16
+    initializer_range: 0.02
+    use_recompute: True
+    fuse_attn_qkv: True
+```
+
+其中参数对应的释义如下：
+| **参数名**                    | **参数释义**               |
+|------------------------------|------------------------|
+| module | 指定GPT模型的执行模块  |
+| vocab_size                   | 训练词表大小                 |
+| hidden_size                  | 隐藏层大小                  |
+| num_layers                   | transformer层数          |
+| num_attention_heads          | attention head的数量      |
+| max_seq_len                  | 输入文本序列的长度              |
+| ffn_hidden_size              | ffn层大小，一般为隐藏层的四倍       |
+| attention_probs_dropout_prob | attention中的dropout的失活率 |
+| max_position_embeddings      | position embedding的长度  |
+| type_vocab_size              | 词表类型                   |
+| initializer_range            | 参数初始化的范围               |
+| use_recompute                | 是否使用recompute训练，重计算全部transformer  |
+| fuse_attn_qkv                | 是否对attention层中qkv计算使用fuse代替传统Linear加速训练 |
+
+
+### 数据集
+
+数据集参数分为“Train”、“Eval”和“Test”三部分，分别对应模型预训练、离线评估、推理等三个模块。
+
+每个模型的配置参数都包含以下内容：
+
+```yaml
+  Data:
+    Train:
+      collate_fn: gpt_collate_fn
+      sample_split: 2
+      dataset:
+        name: GPTDataset
+        input_dir: ./data/
+        split: [949, 50, 1]
+        max_seq_len: 1024
+```
+
+其中参数对应的释义如下：
+| **参数名**         | **参数释义**               |
+|-------------------|------------------------|
+| collate_fn        | 通过此参数指定如果将样本列表组合为mini-batch数据；支持自定义  |
+| sample_split      | 通过此参数dataset返回的sample被组织为(inputs,labels) |
+| dataset.name      | 指定自定义数据集的名称  |
+| input_dir         | 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件 |
+| split             | 训练集，验证集和测试集的切分比例 |
+| max_seq_len       | 输入文本序列的长度 |
+
+
+### 优化器
+
+GPT训练默认使用AdamW优化器以及cosine学习率衰减，这里通过配置文件配置优化器的参数，如：
+
+```yaml
+  Optimizer:
+    name: AdamW
+    weight_decay: 0.01
+    beta1: 0.9
+    beta2: 0.999
+    epsilon: 1.0e-8
+    lr:
+      name: CosineAnnealingWithWarmupDecay
+      decay_steps: 360000
+      warmup_rate: 0.01
+      max_lr: 5.0e-5
+      min_lr: 1.0e-5
+    grad_clip:
+      name: "ClipGradByGlobalNorm"
+      clip_norm: 1.0
+```
+
+其中参数说明：
+
+| **参数名**      | **参数释义**                |
+|----------------|---------------------------|
+| name           | 指定自定义优化器的名称        |
+| weight_decay   | weight的衰减率              |
+| beta1          | 一阶矩估计的指数衰减率        |
+| beta2          | 二阶矩估计的指数衰减率        |
+| epsilon        | 指定优化器需要优化的参数      |
+| lr.name        | 指定自定义学习率策略的名称     |
+| decay_steps    | 衰减的步长                  |
+| warmup_rate    | warmup 率                  |
+| max_lr         | Adam 的初始最大学习率        |
+| min_lr         | Adam 的初始最小学习率        |
+| grad_clip.name | 指定自定义梯度裁剪策略的名称   |
+| clip_norm      | 所允许的范数最大值           |
+
+
+### 并行维度
+
+当前GPT模型已适配自动并行的**半自动策略**，用户可以通过配置文件选择并行的维度。
+
+```yaml
+  Distributed:
+    dp_degree: 2
+    mp_degree: 2
+    pp_degree: 2
+    sharding:
+      sharding_degree: 1
+      sharding_stage: 1
+```
+
+其中参数说明：
+
+| **参数名**          | **参数释义**                             |
+|------------------|--------------------------------------|
+| dp_degree        | 数据并行维度                               |
+| mp_degree        | 张量模型并行维度                             |
+| pp_degree        | 流水线并行维度                              |
+| sharding_degree  | 分组切分并行维度                             |
+| sharding_stage   | 切分策略；1表示仅切分优化器状态，2表示再切分梯度，3表示再切分前向参数 |
+
+
+## 运行方式
+本目录按照345M、1.3B和6.7B规模大小，给出32G V100环境下GPT模型半自动并行训练的策略配置如下：
+
+| 模型规模   | 训练策略                     | yaml文件                               |
+|----------|---------------------------- |----------------------------------------|
+| 345MB    | 单卡+fp16                    | pretrain_gpt_345M_single_card.yaml     |
+| 1.3B     | dp8+fp16+recompute          | pretrain_gpt_1.3B_dp8.yaml             |
+| 6.7B     | sharding16+fp16+recompute   | pretrain_gpt_6.7B_sharding16.yaml  |
+
+若要在显存容量更小的16G V100环境下进行GPT大模型训练，可将对应yaml文件中的`Model`-`hidden size`值改为原来的1/2即可。
+
+### 策略支持
+
+自动并行包括2种模式：半自动并行与全自动并行。
+半自动并行包括了数据并行、张量模型并行、流水线并行和分组切片并行。此外还支持重计算、混合精度等策略，来减少显存占用、加速训练。**目前，GPT 模型训练可以支持任意维度的策略组合。**
+
+|                 | data parallel | tensor parallel | pipeline parallel | pure fp16 | recompute |
+|-----------------|---------------|-----------------|-------------------|-----------|-----------|
+| sharding stage1 | ✓             | ✓               | ✓                 | ✓         | ✓         |
+| sharding stage2 | ✓             | ✓               | ✓                 | ✓         | ✓         |
+| sharding stage3 | ✓             | ✓               | ✓                 | ✓         | ✓         |
+
+
+### 单卡训练
+
+以单机1.3B模型训练为例，该gpt程序需要单卡32G V100以运行
+
+**启动命令**
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+export FLAGS_USE_STANDALONE_EXECUTOR=False # 设置执行器环境变量
+python ./tools/auto.py -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_single_card.yaml
+```
+
+### 单机训练
+
+以单机1.3B模型数据并行训练为例，通过``paddle.distributed.launch``启动多进程训练，该gpt程序需要8卡32G V100以运行。
+
+**启动命令**
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+log_dir=log_auto
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型单机训练，可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+**启动命令**
+```shell
+log_dir=log_auto
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py \
+    -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+    -o Model.hidden_size=1024
+```
+
+每张GPU的运行日志`workerlog.x`可在launch命令中指定的`log_dir`路径下找到；若未指定，日志路径为`log/workerlog.x`。运行日志具体内容如下：
+
+**运行日志**
+
+```
+[INFO 2022-08-19 10:47:00,392 engine.py:461] [train] epoch: 0 step: 0 lr: 5.555556e-09 loss: 10.972320
+[INFO 2022-08-19 10:47:02,858 engine.py:461] [train] epoch: 0 step: 1 lr: 8.333333e-09 loss: 10.950481
+[INFO 2022-08-19 10:47:05,321 engine.py:461] [train] epoch: 0 step: 2 lr: 1.111111e-08 loss: 10.951584
+[INFO 2022-08-19 10:47:07,791 engine.py:461] [train] epoch: 0 step: 3 lr: 1.388889e-08 loss: 10.954518
+[INFO 2022-08-19 10:47:10,256 engine.py:461] [train] epoch: 0 step: 4 lr: 1.666667e-08 loss: 10.959060
+[INFO 2022-08-19 10:47:12,725 engine.py:461] [train] epoch: 0 step: 5 lr: 1.944444e-08 loss: 10.957585
+[INFO 2022-08-19 10:47:15,198 engine.py:461] [train] epoch: 0 step: 6 lr: 2.222222e-08 loss: 10.947868
+[INFO 2022-08-19 10:47:17,680 engine.py:461] [train] epoch: 0 step: 7 lr: 2.500000e-08 loss: 10.939037
+```
+
+### 多机训练
+
+若需要在更多机器上进行大模型训练，则需要在每个参与训练的节点上设置master节点ip/port信息后执行启动命令（master节点ip为训练所用某一台机器的ip即可）。
+
+以2机16卡32G V100上的6.7B模型分组切分并行训练为例，启动命令为：
+
+```shell
+master_ip=master节点ip
+master_port=可用的空闲端口号
+
+log_dir=log_sharding16
+python -m paddle.distributed.launch --log_dir $log_dir \
+    --master=$master_ip:$master_port --nnodes=2 --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型两机训练，也可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+```shell
+master_ip=master节点ip
+master_port=可用的空闲端口号
+
+log_dir=log_sharding16
+python -m paddle.distributed.launch --log_dir $log_dir \
+    --master=$master_ip:$master_port --nnodes=2 --devices "0,1,2,3,4,5,6,7" \
+    ./tools/auto.py -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_6.7B_sharding16.yaml \
+    -o Model.hidden_size=2048
+```
diff --git a/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md b/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md
new file mode 100644
index 0000000000000000000000000000000000000000..5937e8cb362dd2f322536890761bfce6c4acafce
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/hybrid_parallel.md
@@ -0,0 +1,268 @@
+# GPT 混合并行模型训练
+
+当训练超大模型时，就必须借助混合并行策略，混合并行策略分别指数据并行、张量模型并行、流水线并行和分组切片并行。其中数据并行保存完整的模型参数并独立处理一份子数据集，以加速模型训练过程；张量模型并行将网络中的张量（Tensor）切分到不同的设备，从而降低单个设备的显存消耗；流水线并行将模型的不同层放置到不同的计算设备，降低单个计算设备的显存消耗；分组切片并行将参数和模型状态划分到不同卡上，每个GPU只保存部分副本，以减少显存占用。联合四种训练方式，可以实现更大模型、更快训练的效果。具体策略以及相关FleetAPI介绍可以参考以下教程：
+
+- [数据并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/data_parallel/index_cn.html)
+
+- [张量模型并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/model_parallel_cn.html
+)
+- [流水线并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/pipeline_parallel_cn.html)
+
+- [分组切片并行](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/06_distributed_training/group_sharded_parallel_cn.html)
+
+
+## 参数释义
+
+### 并行维度
+
+当前GPT模型已适配3D混合并行，并能够在训练超大模型，用户可以通过配置文件选择并行的维度。
+
+```yaml
+  Distributed:
+    dp_degree: 2
+    mp_degree: 2
+    pp_degree: 2
+    sharding:
+      sharding_degree: 1
+      sharding_stage: 1
+      sharding_offload: False
+      reduce_overlap: False
+      broadcast_overlap: False
+```
+
+其中参数说明：
+
+| **参数名**          | **参数释义**                             |
+|------------------|--------------------------------------|
+| dp_degree        | 数据并行维度                               |
+| mp_degree        | 张量模型并行维度                             |
+| pp_degree        | 流水线并行维度                              |
+| sharding_degree  | 分组切分并行维度                             |
+| sharding_stage   | 切分策略；1表示仅切分优化器状态，2表示再切分梯度，3表示再切分前向参数 |
+| sharding_offload | CPU offload策略                        |
+|reduce_overlap| 是否在sharding stage 2的模式下进行reduce通讯与反向计算的overlap，该策略暂时不支持sharding_offload|
+|broadcast_overlap| 是否在sharding stage 2的模式下进行broadcast通讯与下一个batch的 前向计算的overlap，该策略暂时不支持sharding_offload。若使用该模型，在evaluation与save之前，必须调用 `paddle.device.cuda.synchronize()` 方法|
+
+## 运行方式
+本目录中按照345M、1.3B、6.7B和175B规模大小，给出32G V100环境下GPT模型混合并行训练的策略配置如下：
+
+| 模型规模 | 训练策略                 | yaml文件                   |
+|----------|---------------------------|------------------------------|
+| 345M     | fp16+mp8+qat              | qat_gpt_345M_mp8.yaml    |
+| 1.3B     | fp16+dp8+recompute        | pretrain_gpt_1.3B_dp8.yaml   |
+| 6.7B     | fp16+sharding16+recompute | pretrain_gpt_6.7B_sharding16.yaml  |
+| 175B     | fp16+mp8+pp16+recompute   | pretrain_gpt_175B_mp8_pp16.yaml   |
+
+若要在显存容量更小的16G V100环境下进行GPT大模型训练，可将对应yaml文件中的`Model`-`hidden size`值改为原来的1/2即可。
+
+### 策略支持
+
+飞桨的混合并行技术包括4个维度：数据并行、张量模型并行、流水线并行和分组切片并行，此外还支持重计算、offload、混合精度、序列并行等策略，来减少显存占用、加速训练。
+
+目前，GPT模型训练已支持前3个维度的任意策略组合，但分组切片并行stage2/3仅支持与数据并行策略组合使用；详见下表。
+
+|                 | data parallel | tensor parallel | pipeline parallel | pure fp16 | recompute |
+|-----------------|---------------|-----------------|-------------------|-----------|-----------|
+| sharding stage1 | ✓             | ✓               | ✓                 | ✓         | ✓         |
+| sharding stage2 | ✓             | ㄨ               | ㄨ                 | ✓         | ✓         |
+| sharding stage3 | ✓             | ㄨ               | ㄨ                 | ✓         | ✓         |
+
+### 单机训练
+
+以单机1.3B模型数据并行训练为例，通过``paddle.distributed.launch``启动多进程训练，该gpt程序需要8卡32G V100以运行。
+
+**启动命令**
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+log_dir=log_dp8
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型单机训练，可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+**启动命令**
+```shell
+log_dir=log_dp8
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+    -o Model.hidden_size=1024
+```
+
+每张GPU的运行日志`workerlog.x`可在launch命令中指定的`log_dir`路径下找到；若未指定，日志路径为`log/workerlog.x`。运行日志具体内容如下：
+
+**运行日志**
+
+```
+[2022-09-21 05:43:58,797] [    INFO] - [train] epoch: 0, batch: 0, loss: 10.992407799, avg_batch_cost: 5.51734 sec, speed: 0.18 step/s, ips_total: 11878 tokens/s, ips: 1485 tokens/s, learning rate: 2.77778e-08
+[2022-09-21 05:43:59,508] [    INFO] - [train] epoch: 0, batch: 1, loss: 11.000075340, avg_batch_cost: 0.71029 sec, speed: 1.41 step/s, ips_total: 92267 tokens/s, ips: 11533 tokens/s, learning rate: 4.16667e-08
+[2022-09-21 05:44:00,242] [    INFO] - [train] epoch: 0, batch: 2, loss: 11.017463684, avg_batch_cost: 0.73301 sec, speed: 1.36 step/s, ips_total: 89406 tokens/s, ips: 11176 tokens/s, learning rate: 5.55556e-08
+[2022-09-21 05:44:00,965] [    INFO] - [train] epoch: 0, batch: 3, loss: 10.983654976, avg_batch_cost: 0.72319 sec, speed: 1.38 step/s, ips_total: 90620 tokens/s, ips: 11328 tokens/s, learning rate: 6.94444e-08
+[2022-09-21 05:44:01,678] [    INFO] - [train] epoch: 0, batch: 4, loss: 11.014451981, avg_batch_cost: 0.71223 sec, speed: 1.40 step/s, ips_total: 92016 tokens/s, ips: 11502 tokens/s, learning rate: 8.33333e-08
+[2022-09-21 05:44:02,385] [    INFO] - [train] epoch: 0, batch: 5, loss: 11.005180359, avg_batch_cost: 0.70707 sec, speed: 1.41 step/s, ips_total: 92687 tokens/s, ips: 11586 tokens/s, learning rate: 9.72222e-08
+[2022-09-21 05:44:03,100] [    INFO] - [train] epoch: 0, batch: 6, loss: 10.989698410, avg_batch_cost: 0.71402 sec, speed: 1.40 step/s, ips_total: 91785 tokens/s, ips: 11473 tokens/s, learning rate: 1.11111e-07
+[2022-09-21 05:44:03,806] [    INFO] - [train] epoch: 0, batch: 7, loss: 10.992337227, avg_batch_cost: 0.70554 sec, speed: 1.42 step/s, ips_total: 92888 tokens/s, ips: 11611 tokens/s, learning rate: 1.25000e-07
+[2022-09-21 05:44:04,516] [    INFO] - [train] epoch: 0, batch: 8, loss: 10.972790718, avg_batch_cost: 0.71011 sec, speed: 1.41 step/s, ips_total: 92290 tokens/s, ips: 11536 tokens/s, learning rate: 1.38889e-07
+[2022-09-21 05:44:05,228] [    INFO] - [train] epoch: 0, batch: 9, loss: 10.983499527, avg_batch_cost: 0.71128 sec, speed: 1.41 step/s, ips_total: 92138 tokens/s, ips: 11517 tokens/s, learning rate: 1.52778e-07
+```
+
+### 多机训练
+
+若需要在更多机器上进行大模型训练，则需要在每个参与训练的节点上设置master节点ip/port信息后执行启动命令（master节点ip为训练所用某一台机器的ip即可）。
+
+以2机16卡32G V100上的6.7B模型分组切分并行训练为例，启动命令为：
+
+```shell
+master_ip=master节点ip
+master_port=可用的空闲端口号
+
+log_dir=log_sharding16
+python -m paddle.distributed.launch --log_dir $log_dir \
+    --master=$master_ip:$master_port --nnodes=2 --devices "0,1,2,3,4,5,6,7" \
+    tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型两机训练，也可通过减小`Model.hidden_size`调整模型规模至合适大小再启动训练，命令如下：
+
+```shell
+master_ip=master节点ip
+master_port=可用的空闲端口号
+
+log_dir=log_sharding16
+python -m paddle.distributed.launch --log_dir $log_dir \
+    --master=$master_ip:$master_port --nnodes=2 --devices "0,1,2,3,4,5,6,7" tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+    -o Model.hidden_size=2048
+```
+
+若要执行16机175B大模型混合并行训练，以运行启动命令为：
+
+```shell
+master_ip=master节点ip
+master_port=可用的空闲端口号
+
+log_dir=log_mp8_pp16
+python -m paddle.distributed.launch --log_dir $log_dir \
+    --master=$master_ip:$master_port --nnodes=16 --devices "0,1,2,3,4,5,6,7" tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
+```
+
+当节点较多时，可以考虑使用 `ssh` 脚本或 `mpirun` 进行跨节点命令分发。
+
+### 量化训练
+
+
+若需要对模型进行量化训练，按照以上在配置文件中添加量化参数，可参考`qat_gpt_345M_mp8.yaml`，量化训练时可以可以适当减少训练轮数和学习率。以单机345M模型模型并行训练为例，通过``paddle.distributed.launch``启动多进程训练，该gpt程序需要8卡32G V100以运行，命令如下：
+
+```shell
+log_dir=log_mp8
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml
+    -o Engine.max_steps=100000 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6
+```
+
+
+# GPT Zero-shot 文本生成
+
+## 参数释义
+
+```yaml
+Generation:
+  top_k: 50
+  top_p: 0.75
+  temperature: 1.0
+  min_dec_len: 1
+  max_dec_len: 200
+  num_return_sequences: 1
+  decode_strategy: "sampling"
+```
+
+其中参数说明：
+
+| **参数名**      | **参数释义**                  |
+|--------------|---------------------------|
+| top_k | 每次为采样挑选保留分数最高的 k 个 token        |
+| top_p   | 如果设置小于 1.0 的小数，则保留加起来为 top_p 或更高的最可能的概率的 token。默认值为 1.0        |
+| temperature   |  调节下一个 token 的概率温度，logits = logits / temperature，默认值为 1.0           |
+| min_dec_len | 最小生成 token 长度              |
+| max_dec_len  | 最大生成 token 长度                     |
+| num_return_sequences  | 每个输入生成的序列个数，默认值为 1                  |
+| decode_strategy       | 解码策略，默认值为 "sampling"，目前只支持 "sampling"，未来会支持 "greedy_search"，"beam_search" |
+
+## 文本生成
+
+下载预训练好的模型，快速体验文本生成
+
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
+
+# --devices 根据并行策略设置设备
+
+python -m paddle.distributed.launch --devices "0" tasks/gpt/generation.py \
+    -c ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml \
+    -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/
+
+# 生成的文本，由于 checkpoint 不同，超参不同，随机数不同，您执行可能会生成不一样的内容
+
+Prompt: Hi, GPT2. Tell me who Jack Ma is.
+Generation: Hi, GPT2. Tell me who Jack Ma is. I don’t want to hear that.”
+
+For now, the only question the crowd is asking is whether or not Jack Ma will step down from the board of directors of Alibaba.
+
+Jack Ma on why he never wanted to run for President in 2016:
+
+There were two reasons. One is that I wanted to spend more time with my family. I thought it was better to spend more time with my family and spend more time with my children. So it was a very personal reason. But the second reason was that I thought it would be difficult to get elected, because there are a lot of political interests in this country. So I thought it was better to spend more time with my family.
+
+On how Alibaba will evolve into a new player in China’s transportation and logistics sector:
+
+I think that we are going to become a very important player in the logistics industry. So our strategy is to make it easy for people to travel.
+```
+
+### 剖析体验文本生成
+
+#### GPT 文本生成模块初始化
+
+```python
+    module = build_module(cfg)
+    module.model.eval()
+```
+
+#### 预训练模型加载
+
+```python
+    # 获取到预训练 checkpoint 的根目录
+    ckpt_dir = cfg.Engine.save_load.ckpt_dir
+
+    # 构造出具体路径
+    model_path = os.path.join(ckpt_dir, "model.pdparams")
+
+    # 加载模型参数
+    model_dict = paddle.load(model_path)
+
+    # FP16 模型参数转成 FP32 模型参数
+    for key, value in model_dict.items():
+        model_dict[key] = model_dict[key].astype(paddle.float32)
+
+    # 设置模型参数为预训练参数
+    module.model.set_state_dict(model_dict)
+```
+
+#### 文本生成与结果展示
+
+```python
+    input_text = "Historical Records: Tell us about the history of the Great Wall."
+    result = module.generate(input_text)
+
+    print(f'Prompt: {input_text}')
+    print(f'Generation: {result[0]}')
+```
diff --git a/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md b/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md
new file mode 100644
index 0000000000000000000000000000000000000000..8c3c21504b8d956fe957f54d2126d75d2f6e1f34
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/hybrid_profiler.md
@@ -0,0 +1,181 @@
+# Profiler
+
+本文档主要包括在 GPT 中开启 Profiler 并分析调试分析结果的方法，在模型开发中使用 Profiler 分析工具的方法请参考[教程](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/performance_improving/profiling_model.html)和[API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/profiler/Profiler_cn.html)。
+
+## 参数配置
+
+使用 Profiler 功能需要在任务配置文件中添加 Profiler 配置信息并确保字段为 `enable: True` 以开启分析器。
+
+完整的可配置参数如下所示，可以根据使用场景调整配置。
+
+```
+Profiler:
+  enable: True
+  scheduler: [1, 5]
+  profiler_log: log_path
+  detailed: True
+  record_shapes: True
+  profile_memory: True
+  summary:
+    overview: True
+    device: True
+    model: True
+    dist: True
+    kernel: True
+    op: True
+    mem: True
+    memcpy: True
+```
+
+其中参数说明：
+
+| **参数名**                      | **参数释义**               |  **默认值** |
+|------------------------------|------------------------|------------------------|
+|  enable |   是否开启 Profiler | False |
+|  scheduler  | 定义分析区间，如 [1, 5] 记录 step 1 到 step 4 的分析数据 | None |
+|  profiler_log  | 日志文件目录 |   profiler_log |
+|  detailed  | 是否显示详细信息 |   False |
+|  record_shapes  |   是否记录 tensor shape 相关信息 | True |
+|  profile_memory |   是否统计 memory 相关信息 | True |
+
+其中，当 detailed=True 时会打印所有 summary 表格数据，当 detailed=False 时用户可以根据以下说明定制需要展示的表格信息。
+
+| **参数名**                      | **参数释义**               |  **默认值** |
+|------------------------------|------------------------|------------------------|
+|  summary.overview | 显示每种类型的 Event 时间消耗 |  True |
+|  summary.device | 显示 CPU 和 GPU 的平均利用率信息 |  False |
+|  summary.model  | 显示模型 dataloader、forward、backward、optimization 时间消耗 |  True |
+|  summary.dist  | 显示计算、通信以及重叠时间 |  False |
+|  summary.kernel  | 显示 GPU 执行的 kernel 信息 |  True |
+|  summary.op  | 显示框架中算子 (op) 的执行信息 |  True |
+|  summary.mem  | 显示内存/显存占用统计信息 |  False |
+|  summary.memcpy  | 显示框架中调用内存操作所花费的时间 | False |
+
+## 运行分析
+
+本节以 gpt混合并行 为例，首先进入目录，
+
+```
+cd PaddleNLP/model_zoo/gpt-3
+```
+
+
+修改`ppfleetx/configs/nlp/gpt/pretrain_gpt_base.yaml` 中 Profiler.enable 为 True, 同时可以根据上节说明调整相关配置，或者使用命令行参数覆盖，例如可以使用以下命令运行程序，
+```
+python -m paddle.distributed.launch \
+    ./tools/train.py -c \
+    ./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml -o Profiler.enable=True
+
+```
+
+> 在使用 Profiler 工具进行性能分析时，建议减少 train 的步数，获得分析数据即可停止训练。
+
+## 结果分析
+
+在训练结束后会有以下数据：
+
+* 根据配置信息在控制台打印 summary 表格
+* 在配置的 `profiler_log` 目录保存 profiler json 文件
+
+这里保存的 json 文件可以通过如下两种方式查看：
+
+* 在 chrome 浏览器中打开 chrome://tracing/，然后打开 json 文件查看
+* 根据控制台信息安装并启动 `visualdl --logdir log_path` 然后根据提示在浏览器中**性能分析**模块查看
+
+具体的信息含义解释以及分析方法请参考[文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/performance_improving/profiling_model.html)。
+
+> 在使用 visualdl 时，如果 log 文件数据较大，启动会比较耗时，请耐心等待。
+
+## 附录
+
+控制台打印的 summary 信息示例如下所示。
+
+**Overview Summary**
+```
+---------------------------------------------Overview Summary---------------------------------------------
+Time unit: ms
+-------------------------  -------------------------  -------------------------  -------------------------
+Event Type                 Calls                      CPU Time                   Ratio (%)
+-------------------------  -------------------------  -------------------------  -------------------------
+ProfileStep                4                          18591.04                   100.00
+  CudaRuntime              87527                      8555.11                    46.02
+  Operator                 21912                      1883.11                    10.13
+  UserDefined              13116                      1841.33                    9.90
+  OperatorInner            33668                      1018.39                    5.48
+  Forward                  8                          731.46                     3.93
+  Backward                 4                          671.82                     3.61
+  Optimization             4                          315.91                     1.70
+  Dataloader               4                          1.37                       0.01
+-------------------------  -------------------------  -------------------------  -------------------------
+                           Calls                      GPU Time                   Ratio (%)
+-------------------------  -------------------------  -------------------------  -------------------------
+  Kernel                   16092                      4924.90                    26.49
+  Memcpy                   4278                       3617.26                    19.46
+  Memset                   780                        2.31                       0.01
+  Communication            192                        2363.13                    12.71
+-------------------------  -------------------------  -------------------------  -------------------------
+```
+
+**Model Summary**
+
+```
+-----------------------------------------------------Model Summary-----------------------------------------------------
+Time unit: ms
+---------------  ------  -----------------------------------------------  ---------------------------------------------
+Name             Calls   CPU Total / Avg / Max / Min / Ratio(%)           GPU Total / Avg / Max / Min / Ratio(%)
+---------------  ------  -----------------------------------------------  ---------------------------------------------
+ProfileStep      4       18591.04 / 4647.76 / 14114.47 / 757.27 / 100.00  4924.90 / 1231.22 / 2853.61 / 682.04 / 100.00
+  Dataloader     4       1.37 / 0.34 / 0.85 / 0.16 / 0.01                 0.00 / 0.00 / 0.00 / 0.00 / 0.00
+  Forward        8       731.46 / 91.43 / 133.28 / 49.03 / 3.93           714.83 / 89.35 / 174.91 / 4.72 / 14.51
+  Backward       4       671.82 / 167.96 / 168.29 / 167.52 / 3.61         1701.53 / 425.38 / 426.97 / 424.10 / 34.55
+  Optimization   4       315.91 / 78.98 / 89.07 / 73.78 / 1.70            108.27 / 27.07 / 27.09 / 27.06 / 2.20
+  Others         -       16870.48 / - / - / - / 90.75                     2400.27 / - / - / - / 48.74
+---------------  ------  -----------------------------------------------  ---------------------------------------------
+```
+
+**Operator Summary**
+
+```
+----------------------------------------------------------------Operator Summary-----------------------------------------------------------------
+Time unit: ms
+----------------------------------------------------  ------  -----------------------------------------  ----------------------------------------
+Name                                                  Calls   CPU Total / Avg / Max / Min / Ratio(%)     GPU Total / Avg / Max / Min / Ratio(%)
+----------------------------------------------------  ------  -----------------------------------------  ----------------------------------------
+-----------------------------------------------------------Thread: All threads merged------------------------------------------------------------
+GradNodePyLayer_RecomputeFunction_backward            96      663.37 / 6.91 / 17.17 / 4.01 / 18.56       1629.87 / 16.98 / 17.41 / 16.69 / 26.98
+  TransformerDecoderLayer                             96      262.68 / 2.74 / 5.91 / 1.90 / 39.60        661.18 / 6.89 / 7.11 / 6.73 / 40.57
+  backward                                            96      318.62 / 3.32 / 10.57 / 1.31 / 48.03       968.69 / 10.09 / 10.31 / 9.91 / 59.43
+matmul dygraph                                        2312    200.13 / 0.09 / 1.61 / 0.04 / 5.60         1487.76 / 0.64 / 9.81 / 0.22 / 24.63
+  matmul infer_meta                                   964     1.42 / 0.00 / 0.01 / 0.00 / 0.71           0.00 / 0.00 / 0.00 / 0.00 / 0.00
+  matmul compute                                      964     71.38 / 0.07 / 1.59 / 0.03 / 35.67         644.02 / 0.67 / 9.81 / 0.22 / 43.29
+    MEMSET                                            192     - / - / - / - / -                          0.42 / 0.00 / 0.00 / 0.00 / 0.07
+    volta_fp16_s884gemm_fp16_128x128_ldg8_f2f_nn      384     - / - / - / - / -                          199.35 / 0.52 / 0.83 / 0.22 / 30.95
+    volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn      384     - / - / - / - / -                          263.96 / 0.69 / 0.79 / 0.59 / 40.99
+    volta_h884gemm_64x128_ldg8_nn                     192     - / - / - / - / -                          141.13 / 0.74 / 0.92 / 0.61 / 21.91
+    void cutlass::Kernel<cutlass_70_tensorop_f16_...  4       - / - / - / - / -                          39.15 / 9.79 / 9.81 / 9.78 / 6.08
+  matmul node_creation                                676     2.05 / 0.00 / 0.03 / 0.00 / 1.02           0.00 / 0.00 / 0.00 / 0.00 / 0.00
+...
+```
+
+**Kernel Summary**
+```
+---------------------------------------------------------------Kernel Summary---------------------------------------------------------------
+Time unit: ms
+------------------------------------------------------------------------------------------  ------  ----------------------------------------
+Name                                                                                        Calls   GPU Total / Avg / Max / Min / Ratio(%)
+------------------------------------------------------------------------------------------  ------  ----------------------------------------
+ncclKernel_AllReduce_RING_LL_Sum_half(ncclWorkElem)                                         96      2360.57 / 24.59 / 2202.54 / 0.46 / 47.93
+volta_fp16_s884gemm_fp16_256x128_ldg8_f2f_nn                                                384     263.96 / 0.69 / 0.79 / 0.59 / 5.36
+volta_fp16_s884gemm_fp16_128x128_ldg8_f2f_stages_32x1_tn                                    384     241.74 / 0.63 / 0.84 / 0.22 / 4.91
+void paddle::operators::VectorizedRandomGenerator<phi::dtype::float16, unsigned char>       580     209.08 / 0.36 / 0.97 / 0.06 / 4.25
+volta_h884gemm_64x128_ldg8_nn                                                               288     203.89 / 0.71 / 0.92 / 0.57 / 4.14
+volta_fp16_s884gemm_fp16_128x128_ldg8_f2f_nn                                                384     199.35 / 0.52 / 0.83 / 0.22 / 4.05
+volta_h884gemm_256x64_ldg8_tn                                                               288     149.52 / 0.52 / 0.54 / 0.45 / 3.04
+void phi::funcs::VectorizedBroadcastKernel<phi::dtype::float16, phi::dtype::float16, ph...  1352    123.12 / 0.09 / 0.40 / 0.05 / 2.50
+void paddle::operators::SoftmaxMaskFuseUpperTriangleGPUKernel<phi::dtype::float16, 10>      192     122.37 / 0.64 / 0.66 / 0.60 / 2.48
+void cutlass::Kernel<cutlass_70_tensorop_f16_s884gemm_f16_256x128_nt_align8>                100     103.07 / 1.03 / 8.08 / 0.73 / 2.09
+void phi::funcs::VectorizedElementwiseKernel<phi::dtype::float16, paddle::operators::Cu...  292     90.80 / 0.31 / 0.83 / 0.06 / 1.84
+volta_h884gemm_64x128_ldg8_nt                                                               192     79.76 / 0.42 / 0.43 / 0.40 / 1.62
+void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eige...  576     75.36 / 0.13 / 0.20 / 0.07 / 1.53
+...
+```
diff --git a/model_zoo/gpt-3/projects/gpt/docs/inference.md b/model_zoo/gpt-3/projects/gpt/docs/inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..b76ee15ec0296cc47951cf99ef2ba01e22491c5e
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/inference.md
@@ -0,0 +1,115 @@
+
+# 推理部署
+
+模型训练完成后，可使用飞桨高性能推理引擎Paddle Inference通过如下方式进行推理部署。
+
+## 1. 模型导出
+
+### 1.1 非量化模型导出
+
+以`GPT-3(345M)`模型为例，通过如下方式下载PaddleFleetX发布的训练好的权重。若你已下载或使用训练过程中的权重，可跳过此步。
+
+```bash
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_FP16.tar.gz
+tar -zxvf GPT_345M_FP16.tar.gz
+```
+
+通过如下方式进行推理模型导出
+导出单卡`GPT-3(345M)`模型：
+```bash
+sh projects/gpt/auto_export_gpt_345M_single_card.sh
+```
+
+导出单卡`GPT-3(6.7B)`模型：
+```bash
+sh projects/gpt/auto_export_gpt_6.7B_mp1.sh
+```
+
+导出8卡`GPT-3(175B)`模型：
+```bash
+sh projects/gpt/auto_export_gpt_175B_mp8.sh
+```
+
+### 1.2 量化模型导出
+
+导出单卡`GPT-3(345M)`量化模型：
+
+```shell
+# 为了方便快速体验，这里给出345M量化训练的模型，若已有量化模型，则无需下载
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_wo_analysis.tar
+tar xf GPT_345M_QAT_wo_analysis.tar
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_wo_analysis/'
+```
+
+导出单卡`GPT-3(6.7B)`量化模型：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_qat_gpt_6.7B_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0
+```
+
+## 2. 推理部署
+
+模型导出后，可通过`tasks/gpt/inference.py`脚本进行推理部署。
+
+单卡推理
+```bash
+bash projects/gpt/inference_gpt_single_card.sh
+```
+
+多卡推理(以8卡为例)
+
+```bash
+export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+export MP=8
+bash projects/gpt/inference_gpt_multigpu.sh
+```
+
+
+## 3. Benchmark
+- 导出模型
+修改配置文件
+PaddleFleetX/ppfleetx/configs/nlp/gpt/auto/generation_gpt_6.7B_mp1.yaml，将`Generation/early_finish`选项设置为False(关闭提前终止，仅适用于测速场景)
+
+执行导出
+```bash
+sh projects/gpt/auto_export_gpt_6.7B_mp1.sh
+```
+如果打开了topp_sampling,则需要安装自定义算子：
+```bash
+cd ppfleetx/ops && python setup_cuda.py install && cd ../..
+```
+
+- 运行benchmark脚本
+```
+bash projects/gpt/run_benchmark.sh
+```
+
+| 模型          | 输入长度 | 输出长度 | batch size | GPU卡数 | FP16推理时延 | INT8推理时延 |
+| :------------ | :------: | :------: | :--------: | :-----: | :----------: | :----------: |
+| GPT-3(345M)   |    128   |    8     |     1      |    1    |   18.91ms    |   18.30ms    |
+| GPT-3(345M)   |    128   |    8     |     2      |    1    |   20.01ms    |   18.88ms    |
+| GPT-3(345M)   |    128   |    8     |     4      |    1    |   20.83ms    |   20.77ms    |
+| GPT-3(345M)   |    128   |    8     |     8      |    1    |   24.06ms    |   23.90ms    |
+| GPT-3(345M)   |    128   |    8     |    16      |    1    |   29.32ms    |   27.95ms    |
+| GPT-3(6.7B)   |    128   |    8     |     1      |    1    |   84.93ms    |   63.96ms    |
+| GPT-3(6.7B)   |    128   |    8     |     2      |    1    |   91.93ms    |   67.25ms    |
+| GPT-3(6.7B)   |    128   |    8     |     4      |    1    |   105.50ms   |   78.98ms    |
+| GPT-3(6.7B)   |    128   |    8     |     8      |    1    |   138.56ms   |   99.54ms    |
+| GPT-3(6.7B)   |    128   |    8     |    16      |    1    |   204.33ms   |   140.97ms   |
+| GPT-3(175B)   |    128   |    8     |     1      |    8    |   327.26ms   |   230.11ms   |
+| GPT-3(175B)   |    128   |    8     |     2      |    8    |   358.61ms   |   244.23ms   |
+| GPT-3(175B)   |    128   |    8     |     4      |    8    |   428.93ms   |   278.63ms   |
+| GPT-3(175B)   |    128   |    8     |     8      |    8    |   554.28ms   |   344.00ms   |
+| GPT-3(175B)   |    128   |    8     |    16      |    8    |   785.92ms   |   475.19ms   |
+
+以上性能数据基于PaddlePaddle[每日版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-develop) ，依赖CUDA 11.6测试环境。
diff --git a/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md b/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md
new file mode 100644
index 0000000000000000000000000000000000000000..517de6096b48661b07d7e724ca64116b1993b0b7
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/quantization_aware_training.md
@@ -0,0 +1,158 @@
+
+# GPT模型量化训练
+
+本项目对语言模型 GPT 进行量化训练。目前，PaddleFleetX 提供了 [GPT-345M量化模型](https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_w_analysis.tar) 的预训练模型文件；基于 [LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl)，采用 ACC(accuracy) 指标后的评估结果如下：
+
+| **模型文件** | **数据类型** | **ACC** |
+|---------|-----------|---------------|
+| GPT-345M | FP16 |  44.17%  |
+| GPT-345M | INT8 |  44.94%  |
+
+下面是本例涉及的文件及说明：
+
+```text
+.
+├── qat_gpt_345M_single_card.sh            # 单卡345M模型量化训练入口
+├── qat_gpt_345M_mp8.sh                    # 8卡345M模型模型并行量化训练入口
+├── qat_gpt_6.7B_sharding16.sh             # 16卡6.7B模型分组切片并行量化训练入口
+├── eval_qat_gpt_345M_single_card.sh       # 单卡345M量化模型验证入口
+├── export_qat_gpt_345M_single_card.sh     # 单卡345M量化模型导出入口
+
+```
+
+
+### 环境依赖和数据准备
+环境依赖和数据准备请参考[GPT文档](./README.md)。
+
+另外，模型导出还依赖于`ppfleetx-ops`的安装
+
+```
+cd PaddleNLP/model_zoo/gpt-3/ # 如果已在此目录下，则忽略
+cd ppfleetx/ops && python setup_cuda.py install && cd ../..
+```
+
+### 预训练模型准备
+量化训练需加载[GPT-345M](https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz) 的预训练模型。
+
+**预训练模型下载命令**
+```shell
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar xf GPT_345M.tar.gz
+```
+
+### 量化训练
+
+- [345M模型单卡训练](../qat_gpt_345M_single_card.sh)
+
+快速启动：
+```shell
+bash ./projects/gpt/qat_gpt_345M_single_card.sh
+```
+
+或如下启动：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+python ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml \
+    -o Engine.max_steps=100000 \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.02 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_345M_220826'
+
+```
+
+- [345M模型模型并行训练](../qat_gpt_345M_mp8.sh)
+
+快速启动：
+```shell
+bash ./projects/gpt/qat_gpt_345M_mp8.sh
+```
+
+或如下启动：
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml \
+    -o Engine.max_steps=100000 \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.02 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_345M_220826'
+```
+
+Tips：尽管设置的最大训练轮数为100000轮，但实验经验4000轮即可达到最优效果。
+
+
+### 量化训练精度调优
+针对生成式预训练语言模型的模型压缩一直是学界上的难点，潜在的原因目前并不清楚。经我们研究分析发现，生成式预训练语言模型的Transformer层的权重分布差异较大，且由于生成式预训练语言模型的从左到右预测的性质，量化误差会逐步累积，精度损失较大。为了保证量化模型的精度，PaddleSlim提供量化训练敏感度分析工具，可以有效定位模型某层带来的量化损失较大，以规避一些敏感层并提高量化模型精度。
+
+PaddleSlim中的量化训练敏感度分析工具仅支持静态图模型，需要将量化模型导出为静态图模型。导出命令为：
+
+```shell
+# 下载未经过分析的量化模型
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_wo_analysis.tar
+tar xf GPT_345M_QAT_wo_analysis.tar
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_wo_analysis/'
+```
+注意：此处导出的并非GenerationModule，而是可用于验证的GPTModule。
+
+具体步骤可参考
+[GPT量化训练敏感度分析示例](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/example/quantization_analysis/GPT/README.md)。
+
+
+
+### 模型验证
+```shell
+# 下载验证数据
+wget https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl
+
+# 下载已经训练好的量化模型
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_w_analysis.tar
+tar xf GPT_345M_QAT_w_analysis.tar
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/eval.py \
+    -c ./ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_w_analysis' \
+    -o Offline_Eval.eval_path=./lambada_test.jsonl \
+    -o Offline_Eval.cloze_eval=True
+```
+
+### 模型导出
+```shell
+# 下载已经训练好的量化模型，若已有量化模型，不需要下载
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_wo_analysis.tar
+tar xf GPT_345M_QAT_wo_analysis.tar
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_wo_analysis/'
+```
diff --git a/model_zoo/gpt-3/projects/gpt/docs/single_card.md b/model_zoo/gpt-3/projects/gpt/docs/single_card.md
new file mode 100644
index 0000000000000000000000000000000000000000..060e8eb8ebb87dc14304103e7392f077e1434b0b
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/single_card.md
@@ -0,0 +1,257 @@
+# GPT 单卡模型训练
+
+## 运行方式
+
+本文档按照345M和1.3B规模大小，给出32G V100环境下GPT模型单卡训练的策略配置如下：
+
+| 模型规模 | 训练策略       | yaml文件                    | 显存占用 |
+|----------|----------------|-------------------------------|----------|
+| 345M     | fp16           | pretrain_gpt_345M_single_card.yaml | 30.9GB   |
+| 1.3B     | fp16+recompute | pretrain_gpt_1.3B_single_card.yaml | 26.0GB   |
+
+**启动命令**
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+# 345M
+python tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml
+
+# 1.3B
+python tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml
+```
+
+若要在显存容量更小的16G V100环境下进行GPT模型单机训练，可通过减小`Model.hidden_size`调整模型规模至合适大小，或使用重计算等显存优化策略再启动训练，命令如下：
+
+```shell
+# 345M
+python tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml \
+    -o Model.use_recompute=True
+
+# 1.3B
+python tools/train.py \
+    -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml \
+    -o Model.hidden_size=1024
+```
+
+**运行日志**
+
+```
+[2022-09-21 05:45:27,009] [    INFO] - [train] epoch: 0, batch: 0, loss: 10.999595642, avg_batch_cost: 2.53083 sec, speed: 0.40 step/s, ips_total: 3237 tokens/s, ips: 3237 tokens/s, learning rate: 2.77778e-08
+[2022-09-21 05:45:27,518] [    INFO] - [train] epoch: 0, batch: 1, loss: 10.997043610, avg_batch_cost: 0.50907 sec, speed: 1.96 step/s, ips_total: 16092 tokens/s, ips: 16092 tokens/s, learning rate: 4.16667e-08
+[2022-09-21 05:45:28,021] [    INFO] - [train] epoch: 0, batch: 2, loss: 10.994422913, avg_batch_cost: 0.50265 sec, speed: 1.99 step/s, ips_total: 16298 tokens/s, ips: 16298 tokens/s, learning rate: 5.55556e-08
+[2022-09-21 05:45:28,526] [    INFO] - [train] epoch: 0, batch: 3, loss: 11.005314827, avg_batch_cost: 0.50378 sec, speed: 1.98 step/s, ips_total: 16261 tokens/s, ips: 16261 tokens/s, learning rate: 6.94444e-08
+[2022-09-21 05:45:29,029] [    INFO] - [train] epoch: 0, batch: 4, loss: 10.988020897, avg_batch_cost: 0.50237 sec, speed: 1.99 step/s, ips_total: 16307 tokens/s, ips: 16307 tokens/s, learning rate: 8.33333e-08
+[2022-09-21 05:45:29,531] [    INFO] - [train] epoch: 0, batch: 5, loss: 10.983006477, avg_batch_cost: 0.50179 sec, speed: 1.99 step/s, ips_total: 16326 tokens/s, ips: 16326 tokens/s, learning rate: 9.72222e-08
+[2022-09-21 05:45:30,035] [    INFO] - [train] epoch: 0, batch: 6, loss: 10.988540649, avg_batch_cost: 0.50379 sec, speed: 1.98 step/s, ips_total: 16261 tokens/s, ips: 16261 tokens/s, learning rate: 1.11111e-07
+[2022-09-21 05:45:30,540] [    INFO] - [train] epoch: 0, batch: 7, loss: 10.966930389, avg_batch_cost: 0.50387 sec, speed: 1.98 step/s, ips_total: 16258 tokens/s, ips: 16258 tokens/s, learning rate: 1.25000e-07
+[2022-09-21 05:45:31,044] [    INFO] - [train] epoch: 0, batch: 8, loss: 10.980175018, avg_batch_cost: 0.50365 sec, speed: 1.99 step/s, ips_total: 16265 tokens/s, ips: 16265 tokens/s, learning rate: 1.38889e-07
+[2022-09-21 05:45:31,562] [    INFO] - [train] epoch: 0, batch: 9, loss: 10.966150284, avg_batch_cost: 0.51796 sec, speed: 1.93 step/s, ips_total: 15816 tokens/s, ips: 15816 tokens/s, learning rate: 1.52778e-07
+```
+
+
+# GPT 单卡模型评估
+
+我们提供了对[WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip)、[LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl)两种数据集的评估脚本，其中数据集WikiText采用的是PPL(perplexity)评估指标，LAMBADA采用的是ACC(accuracy)指标。
+
+## 参数释义
+
+请在模型评估前将前述数据集下载到FleetX根目录下(WikiText数据集需要解压缩)，然后可以使用配置文件配置评估相关的参数，包括：
+
+```yaml
+  Offline_Eval:
+    eval_path: ./wikitext-103/wiki.valid.tokens
+    cloze_eval: False
+    overlapping_eval: 32
+    batch_size: 8
+    max_seq_len: 1024
+    logging_freq: 10
+```
+
+其中参数对应的释义如下：
+
+| **参数名**                      | **参数释义**          |
+|------------------------------|------------------------|
+| eval_path         | 评估数据集地址                      |
+| cloze_eval  | lambada数据集参数                     |
+| overlapping_eval  | wikitext数据集参数              |
+| batch_size         | 模型评估时batch size             |
+| max_seq_len        | 模型评估时文本序列长度           |
+| logging_freq     | 评估日志的打印频率                |
+
+## 运行方式
+
+以单卡345M模型评估为例，可以使用如下命令启动评估：
+
+### WikiText数据集评估
+
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
+
+wget -O wikitext-103-v1.zip https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
+unzip -q wikitext-103-v1.zip
+
+ckpt_dir=ckpt/PaddleFleetX_GPT_345M_220826/
+eval_dir=./wikitext-103
+
+python tools/eval.py -c ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml \
+    -o Engine.save_load.ckpt_dir=$ckpt_dir \
+    -o Offline_Eval.eval_path=$eval_dir/wiki.valid.tokens \
+    -o Offline_Eval.overlapping_eval=32 \
+    -o Offline_Eval.batch_size=16
+```
+
+评估日志如下：
+```shell
+[2022-09-21 05:28:26,263] [    INFO] - [eval] epoch: 0, batch: 0, loss: 0.170368048, speed: 0.29 step/s
+[2022-09-21 05:28:39,642] [    INFO] - [eval] epoch: 0, batch: 10, loss: 0.231640193, speed: 0.75 step/s
+[2022-09-21 05:28:53,469] [    INFO] - [eval] epoch: 0, batch: 20, loss: 0.292417919, speed: 0.72 step/s
+[2022-09-21 05:29:07,012] [    INFO] - [eval] epoch: 0, batch: 30, loss: 0.351391476, speed: 0.74 step/s
+[2022-09-21 05:29:27,359] [    INFO] - [eval] epoch: 0, batch: 40, loss: 0.415404772, speed: 0.49 step/s
+```
+
+评估结果如下：
+
+```shell
+[2022-09-21 05:40:32,820] [    INFO] - validation results on ./wikitext-103/wiki.valid.tokens | avg loss: 2.9554E+00 | ppl: 1.9210E+01 | adjusted ppl: 2.4948E+01 | token ratio: 1.0884484081583892
+```
+
+### LAMBADA数据集评估
+
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
+
+wget -O lambada_test.jsonl https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl
+
+ckpt_dir=ckpt/PaddleFleetX_GPT_345M_220826/
+
+python tools/eval.py -c ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml \
+    -o Engine.save_load.ckpt_dir=$ckpt_dir \
+    -o Offline_Eval.eval_path=./lambada_test.jsonl \
+    -o Offline_Eval.cloze_eval=True \
+    -o Offline_Eval.batch_size=16
+
+```
+
+评估日志如下：
+```shell
+[2022-09-21 05:18:24,152] [    INFO] - [eval] epoch: 0, batch: 0, number correct: 50.000000000, speed: 0.29 step/s
+[2022-09-21 05:18:37,264] [    INFO] - [eval] epoch: 0, batch: 10, number correct: 130.000000000, speed: 0.76 step/s
+[2022-09-21 05:18:50,408] [    INFO] - [eval] epoch: 0, batch: 20, number correct: 209.000000000, speed: 0.76 step/s
+[2022-09-21 05:19:03,578] [    INFO] - [eval] epoch: 0, batch: 30, number correct: 279.000000000, speed: 0.76 step/s
+[2022-09-21 05:19:16,760] [    INFO] - [eval] epoch: 0, batch: 40, number correct: 343.000000000, speed: 0.76 step/s
+```
+
+评估结果如下：
+
+```shell
+[2022-09-21 05:25:28,662] [    INFO] - validation results on ./lambada_test.jsonl | number correct: 2.1240E+03 | total examples: 5.1530E+03 | avg accuracy: 4.1219E-01
+```
+
+# GPT Zero-shot 文本生成
+
+## 参数释义
+
+```yaml
+  Generation:
+    top_k: 50
+    top_p: 0.75
+    temperature: 1.0
+    min_dec_len: 1
+    max_dec_len: 200
+    num_return_sequences: 1
+    decode_strategy: "sampling"
+```
+
+其中参数说明：
+
+| **参数名**      | **参数释义**                  |
+|--------------|---------------------------|
+| top_k | 每次为采样挑选保留分数最高的 k 个 token        |
+| top_p   | 如果设置小于 1.0 的小数，则保留加起来为 top_p 或更高的最可能的概率的 token。默认值为 1.0        |
+| temperature   |  调节下一个 token 的概率温度，logits = logits / temperature，默认值为 1.0           |
+| min_dec_len | 最小生成 token 长度              |
+| max_dec_len  | 最大生成 token 长度                     |
+| num_return_sequences  | 每个输入生成的序列个数，默认值为 1                  |
+| decode_strategy       | 解码策略，默认值为 "sampling"，目前只支持 "sampling"，未来会支持 "greedy_search"，"beam_search" |
+
+## 文本生成
+
+下载预训练好的模型，快速体验文本生成
+
+### 快速体验文本生成
+
+
+```shell
+cd PaddleNLP/model_zoo/gpt-3 # 如果已在 PaddleNLP/model_zoo/gpt-3 目录下，则忽略
+
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
+
+python tasks/gpt/generation.py \
+    -c ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml \
+    -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/
+
+# 生成的文本，由于 checkpoint 不同，超参不同，随机数不同，您执行可能会生成不一样的内容
+
+Prompt: Hi, GPT2. Tell me who Jack Ma is.
+Generation: Hi, GPT2. Tell me who Jack Ma is. I don’t want to hear that.”
+
+For now, the only question the crowd is asking is whether or not Jack Ma will step down from the board of directors of Alibaba.
+
+Jack Ma on why he never wanted to run for President in 2016:
+
+There were two reasons. One is that I wanted to spend more time with my family. I thought it was better to spend more time with my family and spend more time with my children. So it was a very personal reason. But the second reason was that I thought it would be difficult to get elected, because there are a lot of political interests in this country. So I thought it was better to spend more time with my family.
+
+On how Alibaba will evolve into a new player in China’s transportation and logistics sector:
+
+I think that we are going to become a very important player in the logistics industry. So our strategy is to make it easy for people to travel.
+```
+
+### 剖析体验文本生成
+
+#### GPT 文本生成模块初始化
+
+```python
+    module = build_module(cfg)
+    module.model.eval()
+```
+
+#### 预训练模型加载
+
+```python
+    # 获取到预训练 checkpoint 的根目录
+    ckpt_dir = cfg.Engine.save_load.ckpt_dir
+
+    # 构造出具体路径
+    model_path = os.path.join(ckpt_dir, "model.pdparams")
+
+    # 加载模型参数
+    model_dict = paddle.load(model_path)
+
+    # FP16 模型参数转成 FP32 模型参数
+    for key, value in model_dict.items():
+        model_dict[key] = model_dict[key].astype(paddle.float32)
+
+    # 设置模型参数为预训练参数
+    module.model.set_state_dict(model_dict)
+```
+
+#### 文本生成与结果展示
+
+```python
+    input_text = "Historical Records: Tell us about the history of the Great Wall."
+    result = module.generate(input_text)
+
+    print(f'Prompt: {input_text}')
+    print(f'Generation: {result[0]}')
+```
diff --git a/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md b/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md
new file mode 100644
index 0000000000000000000000000000000000000000..ba581a807664d4a2115a12bd8f69ed13b46bec02
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/single_finetune.md
@@ -0,0 +1,330 @@
+# GPT2 微调
+
+本教程主要针对于 GLUE (General Language Understanding Evaluation) benchmark 中的数据集进行微调，涉及到分类和回归任务。
+
+## 下载 GPT345M 预训练模型
+```
+# 如果已经下载可以忽略
+mkdir -p ckpt
+wget -O ckpt/GPT_345M.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar -xzf ckpt/GPT_345M.tar.gz -C ckpt/
+```
+
+## 快速体验运行
+
+```
+# cd path/to/PaddleFleetX
+# bash projects/gpt/finetune_gpt_345M_single_card.sh taskname [split]
+
+# taskname 可选: CoLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI
+# 例如 bash projects/gpt/finetune_gpt_345M_single_card.sh CoLA
+
+# 注：当数据集为 MNLI 时，验证集有两种，分别是 dev_matched 和 dev_mismatched，
+# 其他数据集，只有一种验证集，因此不用选择
+# 可以通过 bash projects/gpt/finetune_gpt_345M_single_card.sh MNLI dev_matched
+# 或者 bash projects/gpt/finetune_gpt_345M_single_card.sh MNLI dev_mismatched
+# 进行 finetune 训练
+
+bash projects/gpt/finetune_gpt_345M_single_card.sh SST2
+```
+
+## GLUE benchmark 数据集
+
+GLUE benchmark 包含 9 个数据集，分别是 **CoLA**、**SST-2**、**MRPC**、**QQP**、**STS-B**、**MNLI**、**QNLI**、**RTE**、**WNLI**，涉及到 **自然语言推断**，**文本蕴含**，**情感分析**，**语义相似** 等任务，整体可以归位 3 类，分别是单句任务：CoLA、SST-2；相似性：MRPC、QQP、STS-B；释义：MNLI、QNLI、RTE、WNLI。
+
+以下介绍载自 [huggingface](https://huggingface.co/datasets/glue/blob/main/glue.py).
+
+* CoLA: The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence.
+* SST-2: The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.
+* MRPC: The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.
+* QQP: The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
+* STS-B: The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.
+* MNLI: The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. We use the standard test set, for which we obtained private labels from the authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend the SNLI corpus as 550k examples of auxiliary training data.
+* QNLI: The Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We convert the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue.
+* RTE: The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where for three-class datasets we collapse neutral and contradiction into not entailment, for consistency.
+* WNLI: The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, we construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. We call converted dataset WNLI (Winograd NLI).
+
+
+## 微调相关类
+
+### `GPTForSequenceClassification`
+在 GPT 模型输出的 logits 基础上，增加一个分类层，并且用正态分布对新增的层参数进行初始化。
+
+```
+self.score = nn.Linear(self.gpt.hidden_size, num_classes, bias_attr=False)
+
+from paddle.nn.initializer import Normal
+normal_ = Normal(std=self.gpt.initializer_range)
+normal_(self.score.weight)
+```
+
+### `GPTFinetuneModule`
+该类继承自`BasicModule`，负责微调模型的初始化以及逻辑计算的类，需要实现几个重要的函数，下面给出两个具体的示例。
+
+* `__init__`: 负责初始化 loss 函数以及评测指标函数。
+* `get_model`: 负责微调类 `GPTForSequenceClassification`、`GPTTokenizer` 初始化以及预训练模型的加载。
+
+## 超参数
+微调训练也需要一套完整的超参数，但是微调涉及的核心超参数并不多。
+
+### Engine
+
+| 参数字段 | 参数含义 |
+| ------ | --------|
+|run_mode| 运行的模式，需要设置为 epoch 方式|
+|num_train_epochs| 需要 finetune 的 epoch 数 |
+
+```
+Engine:
+  run_mode: epoch
+  num_train_epochs: 3 # WNLI 和 MRPC 数据集比较小，因此 `num_train_epochs=5`。
+```
+
+### Model
+
+| 参数字段 | 参数含义 |
+| ------ | --------|
+|module| 需要设置为 "GPTFinetuneModule" |
+|name | 需要设置为 "GPT" |
+|num_classes | finetune 时的类别数，根据语料库以及任务来设定 |
+|pretrained | 预训练的权重文件路径前缀，去掉 ".pdparams" |
+|loss.train.name | finetune 时的训练损失函数类名 |
+|loss.eval.name | finetune 时的验证损失函数类名 |
+|metric.eval.name | finetune 时的验证评估函数类名 |
+
+微调时，不同任务对应的类别数 和 loss 函数以及评测指标不同，因此需要通过配置来改变设置。
+```
+Model:
+  module: "GPTFinetuneModule"
+  name: "GPT"
+  num_classes: 2 # 1 or 2 or 3
+  pretrained: 'path/to/pretrained_model'
+
+  loss:
+    train:
+      name: 'CrossEntropyLoss'
+    eval:
+      name: 'CrossEntropyLoss'
+
+  metric:
+    eval:
+      name: 'Accuracy'
+```
+
+### Optimizer 和 LRScheduler
+
+| 参数字段 | 参数含义 |
+| ------ | --------|
+|name| 优化器类名 |
+|weight_decay| 权重衰减值 |
+|beta1| FusedAdamW 的 beta1 |
+|beta2| FusedAdamW 的 beta2 |
+|epsilon| FusedAdamW 的 epsilon |
+|multi_precision| 当使用 FP16 O2 级别时，是否开启参数使用多精度表示 |
+|tensor_fusion| 是否开启 tensor_fusion |
+|lr.name| 学习率调整策略类名 |
+|lr.warmup| 当参数时小数时，表示 warmup 步数占总步数的比例，如果是整数时，则表示 warmup 的步数 |
+|lr.learning_rate| 初始化学习率值 |
+
+注：这里的超参会跟随优化器类的不同而不同，可以自行查看优化器类和学习率调整策略类初始化函数需要设置的超参数设定。
+
+```
+Optimizer:
+  name: FusedAdamW
+  weight_decay: 0.0
+  beta1: 0.9
+  beta2: 0.999
+  epsilon: 1e-6
+  multi_precision: True
+  tensor_fusion: False
+  lr:
+    name: LinearDecayWithWarmup
+    warmup: 0.1
+    learning_rate: 2e-5
+```
+
+### Data
+
+| 参数字段 | 参数含义 |
+| ------ | --------|
+|Train.dataset| 描述 finetune 时的数据集 |
+|Train.sampler| 描述 dataloader 所需要的 batch sampler |
+|Train.loader| 描述 dataloader 所需要的相关信息，例如 num_workers 等 |
+
+注：数据集的设定会根据不同任务不同语料库不同而设定不同，例如 `split` 字段，不同数据集是有不同的设定，请参考所需要 finetune 的数据集初始化函数。
+
+```
+Data:
+  Train:
+    dataset:
+      name: SST2
+      root: ./dataset/SST-2/
+      split: 'train'
+      max_length: 128
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 32
+      shuffle: True
+      drop_last: True
+    loader:
+      num_workers: 4
+      return_list: False
+
+  Eval:
+    dataset:
+      name: SST2
+      root: ./dataset/SST-2/
+      split: 'dev'
+      max_length: 128
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 32
+      shuffle: False
+      drop_last: False
+    loader:
+      num_workers: 4
+      return_list: False
+```
+
+## 运行
+
+GLUE benchmark 上的语料库 finetune，大部分设置相同，可以同享一份，只有少量区别处需要改变，因此可以通过超参数的覆盖方式来设置。
+
+数据集加载时会自动判断是否已经缓存下载，如果未缓存下载会自行下载，请保证网络的畅通。当自动下载失败时，可以尝试多次以及检查是否有代理设置等。当下载失败时，也可以自己下载及解压到对应的目录中。
+
+以下是 GLUE benchmark 上的每个语料库的 finetune 单机单卡启动命令：
+
+### CoLA 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=CoLA \
+  -o Data.Train.dataset.root=./dataset/cola_public/ \
+  -o Data.Eval.dataset.name=CoLA \
+  -o Data.Eval.dataset.root=./dataset/cola_public/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.metric.train.name=Mcc \
+  -o Model.metric.eval.name=Mcc
+  -o Model.num_classes=2
+```
+
+### SST2 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=SST2 \
+  -o Data.Train.dataset.root=./dataset/SST-2/ \
+  -o Data.Eval.dataset.name=SST2 \
+  -o Data.Eval.dataset.root=./dataset/SST-2/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=2
+```
+
+### MRPC 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Engine.num_train_epochs=5 \
+  -o Data.Train.dataset.name=MRPC \
+  -o Data.Train.dataset.root=./dataset/MRPC/ \
+  -o Data.Eval.dataset.name=MRPC \
+  -o Data.Eval.dataset.root=./dataset/MRPC/ \
+  -o Data.Eval.dataset.split=test \
+  -o Model.num_classes=2 \
+  -o Model.metric.train.name=AccuracyAndF1 \
+  -o Model.metric.eval.name=AccuracyAndF1
+```
+
+### QQP 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=QQP \
+  -o Data.Train.dataset.root=./dataset/QQP/ \
+  -o Data.Eval.dataset.name=QQP \
+  -o Data.Eval.dataset.root=./dataset/QQP/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=2 \
+  -o Model.metric.train.name=AccuracyAndF1 \
+  -o Model.metric.eval.name=AccuracyAndF1
+```
+
+### STSB 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=STSB \
+  -o Data.Train.dataset.root=./dataset/STS-B/ \
+  -o Data.Eval.dataset.name=STSB \
+  -o Data.Eval.dataset.root=./dataset/STS-B/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=1 \
+  -o Model.metric.train.name=PearsonAndSpearman \
+  -o Model.metric.eval.name=PearsonAndSpearman \
+  -o Model.loss.train.name=MSELoss \
+  -o Model.loss.eval.name=MSELoss
+```
+
+### MNLI 数据集
+
+注：MNLI 数据集验证集分为 `dev_matched` 和 `dev_mismatched`，目前暂不支持两个集合同时评测，如果要评测两种验证集，有两种方法：
+
+* 分别 finetune 2次，Data.Eval.dataset.split 设置不同的验证集
+* 保存 finetune 后的 checkpoint，在不同验证集上离线评测。
+
+
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=MNLI \
+  -o Data.Train.dataset.root=./dataset/multinli_1.0 \
+  -o Data.Eval.dataset.name=MNLI \
+  -o Data.Eval.dataset.root=./dataset/multinli_1.0 \
+  -o Data.Eval.dataset.split=dev_matched \
+  -o Model.num_classes=3
+```
+
+### QNLI 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=QNLI \
+  -o Data.Train.dataset.root=./dataset/QNLI/ \
+  -o Data.Eval.dataset.name=QNLI \
+  -o Data.Eval.dataset.root=./dataset/QNLI/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=2
+```
+
+### RTE 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Data.Train.dataset.name=RTE \
+  -o Data.Train.dataset.root=./dataset/RTE/ \
+  -o Data.Eval.dataset.name=RTE \
+  -o Data.Eval.dataset.root=./dataset/RTE/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=2
+```
+
+### WNLI 数据集
+```
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+  -o Engine.num_train_epochs=5 \
+  -o Data.Train.dataset.name=WNLI \
+  -o Data.Train.dataset.root=./dataset/WNLI/ \
+  -o Data.Eval.dataset.name=WNLI \
+  -o Data.Eval.dataset.root=./dataset/WNLI/ \
+  -o Data.Eval.dataset.split=dev \
+  -o Model.num_classes=2
+```
+
+
+## 运行结果
+
+以下的指标是通过 [GPT_345M](https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz) 预训练模型 finetune 得到的结果，仅作为参考。
+
+| Corpus | Task                | Domanin            | Metric                       | Result                       |
+| ------ | ------------------- | ------------------ | ---------------------------- | ---------------------------- |
+| CoLA   | acceptability       | Misc.              | Matthews corr                | 0.60471                      |
+| SST-2  | sentiment           | Movie reviews      | Accuracy                     | 0.93005                      |
+| MNLI   | NLI                 | Misc.              | Matched acc./Mismatched acc. | 0.84238/0.84815              |
+| QNLI   | QA/NLI              | Wikipedia          | Accuracy                     | 0.90445                      |
+| RTE    | NLI                 | News, Wikipedia    | Accuracy                     | 0.70397                      |
+| WNLI   | coreference         | Books              | Accuracy                     | 0.40845                      |
+| MRPC   | paraphrase          | News               | Accuracy/F1                  | 0.81913/0.87022              |
+| QQP    | paraphrase          | social QA question | Accuracy/F1                  | 0.86087/0.81055              |
+| STS-B  | sentence similarity | Misc.              | Pearson/Spearman corr.       | 0.85797/0.85824              |
diff --git a/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md b/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e23cf01a4069c81d4d4e65fcb44d3d5426422ca
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/docs/structured_pruning.md
@@ -0,0 +1,81 @@
+# GPT模型结构化稀疏
+
+本项目对语言模型 GPT 进行结构化稀疏（以下简称稀疏）。在 GPT 模型中，我们对 fused-qkv、out-linear、ffn1 和 ffn2 四层的权重进行了通道稀疏，其中，fused-qkv 和 ffn1 是在输出通道进行稀疏，out-linear 和 ffn2 是在输入通道进行稀疏。如果您需要自定义稀疏的层和通道，可以通过重写 ppfleetx/utils/compression_helper.py 中的 get_pruned_params() 函数实现。
+
+下面是本例涉及的文件及说明：
+
+```text
+.
+├── prune_gpt_345M_single_card.sh            # 单卡345M稀疏训练入口
+├── eval_prune_gpt_345M_single_card.sh       # 单卡345M稀疏模型验证入口
+├── export_prune_gpt_345M_single_card.sh     # 单卡345M稀疏模型导出入口
+```
+
+
+### 环境依赖和数据准备
+环境依赖和数据准备请参考[GPT训练文档](./README.md)。
+
+特别的，本示例需要依赖 PaddleSlim develop版本。安装命令如下：
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleSlim.git & cd PaddleSlim
+pip install -r requirements.txt
+python setup.py install
+```
+
+
+### 预训练模型准备
+稀疏训练需加载[GPT-345M](https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz) 的预训练模型。
+
+**预训练模型下载命令**
+```shell
+wget https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+tar xf GPT_345M.tar.gz
+```
+
+### 稀疏训练
+
+- [345M模型稀疏训练](../gpt/prune_gpt_345M_single_card.sh)
+
+快速启动：
+```shell
+bash ./projects/gpt/prune_gpt_345M_single_card.sh
+```
+
+或如下启动：
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml \
+    -o Engine.max_steps=100000 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.0 \
+    -o Optimizer.lr.max_lr=2.5e-5 \
+    -o Optimizer.lr.min_lr=5.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_345M_220826'
+
+```
+
+### 模型验证
+```shell
+# 下载验证数据
+wget https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/eval.py \
+    -c ./ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./output' \
+    -o Offline_Eval.eval_path=./lambada_test.jsonl \
+    -o Offline_Eval.cloze_eval=True
+```
+
+### 模型导出
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./output'
+```
diff --git a/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..73fb8c295ffdf5ad901ff2a1eea4451a85094411
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/eval_prune_gpt_345M_single_card.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/eval.py \
+    -c ./ppfleetx/configs/nlp/gpt/eval_pruned_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7f3f3ecc016455401f67ae57d36ae7c23762ead0
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/eval_qat_gpt_345M_single_card.sh
@@ -0,0 +1,27 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/eval.py \
+    -c ./ppfleetx/configs/nlp/gpt/eval_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_w_analysis/'
+    -o Offline_Eval.eval_path=./lambada_test.jsonl \
+    -o Offline_Eval.cloze_eval=True 
diff --git a/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..20357ae79481490f9f8d540baea8ad2b804a6afb
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/evaluate_gpt_345M_single_card.sh
@@ -0,0 +1,19 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/eval.py -c ./ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cdb11fca65b3fc1f8d39a5535934bcab92558a43
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/export_gpt_345M_single_card.sh
@@ -0,0 +1,19 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/export.py -c ./ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9955b07c40da3d9a44076884d5f4c9c92d453862
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/export_prune_gpt_345M_single_card.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_pruned_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..25975a9ea4ac0f0b612ff59ffa39bff642ad6ce0
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/export_qat_gpt_345M_single_card.sh
@@ -0,0 +1,34 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+
+# 导出可验证模型
+# python ./tools/export.py \
+#     -c ./ppfleetx/configs/nlp/gpt/export_qat_gpt_345M_single_card.yaml \
+#     -o Model.hidden_dropout_prob=0.0 \
+#     -o Model.attention_probs_dropout_prob=0.0 \
+#     -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_w_analysis/'
+
+# 导出可生成句子模型
+python ./tools/export.py \
+    -c ./ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_wo_analysis/'
diff --git a/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..21bf3b6627c29b47c42962cb435c63a5ab3bdf0b
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/finetune_gpt_345M_single_card.sh
@@ -0,0 +1,116 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+# Single-Sentence Tasks
+if [ $1 == "CoLA" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=CoLA \
+      -o Data.Train.dataset.root=./dataset/cola_public/ \
+      -o Data.Eval.dataset.name=CoLA \
+      -o Data.Eval.dataset.root=./dataset/cola_public/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.metric.train.name=Mcc \
+      -o Model.metric.eval.name=Mcc \
+      -o Model.num_classes=2
+elif [ $1 == "SST2" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=SST2 \
+      -o Data.Train.dataset.root=./dataset/SST-2/ \
+      -o Data.Eval.dataset.name=SST2 \
+      -o Data.Eval.dataset.root=./dataset/SST-2/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=2
+# Similarity and Paraphrase Tasks
+elif [ $1 == "MRPC" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Engine.num_train_epochs=5 \
+      -o Data.Train.dataset.name=MRPC \
+      -o Data.Train.dataset.root=./dataset/MRPC/ \
+      -o Data.Eval.dataset.name=MRPC \
+      -o Data.Eval.dataset.root=./dataset/MRPC/ \
+      -o Data.Eval.dataset.split=test \
+      -o Model.num_classes=2 \
+      -o Model.metric.train.name=AccuracyAndF1 \
+      -o Model.metric.eval.name=AccuracyAndF1
+elif [ $1 == "QQP" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=QQP \
+      -o Data.Train.dataset.root=./dataset/QQP/ \
+      -o Data.Eval.dataset.name=QQP \
+      -o Data.Eval.dataset.root=./dataset/QQP/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=2 \
+      -o Model.metric.train.name=AccuracyAndF1 \
+      -o Model.metric.eval.name=AccuracyAndF1
+elif [ $1 == "STSB" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=STSB \
+      -o Data.Train.dataset.root=./dataset/STS-B/ \
+      -o Data.Eval.dataset.name=STSB \
+      -o Data.Eval.dataset.root=./dataset/STS-B/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=1 \
+      -o Model.metric.train.name=PearsonAndSpearman \
+      -o Model.metric.eval.name=PearsonAndSpearman \
+      -o Model.loss.train.name=MSELoss \
+      -o Model.loss.eval.name=MSELoss
+# Inference Tasks
+elif [ $1 == "MNLI" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=MNLI \
+      -o Data.Train.dataset.root=./dataset/multinli_1.0 \
+      -o Data.Eval.dataset.name=MNLI \
+      -o Data.Eval.dataset.root=./dataset/multinli_1.0 \
+      -o Data.Eval.dataset.split=${2:-"dev_matched"} \
+      -o Model.num_classes=3
+elif [ $1 == "QNLI" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=QNLI \
+      -o Data.Train.dataset.root=./dataset/QNLI/ \
+      -o Data.Eval.dataset.name=QNLI \
+      -o Data.Eval.dataset.root=./dataset/QNLI/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=2
+elif [ $1 == "RTE" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Data.Train.dataset.name=RTE \
+      -o Data.Train.dataset.root=./dataset/RTE/ \
+      -o Data.Eval.dataset.name=RTE \
+      -o Data.Eval.dataset.root=./dataset/RTE/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=2
+elif [ $1 == "WNLI" ]
+then
+    python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+      -o Engine.num_train_epochs=5 \
+      -o Data.Train.dataset.name=WNLI \
+      -o Data.Train.dataset.root=./dataset/WNLI/ \
+      -o Data.Eval.dataset.name=WNLI \
+      -o Data.Eval.dataset.root=./dataset/WNLI/ \
+      -o Data.Eval.dataset.split=dev \
+      -o Model.num_classes=2
+else
+   echo "Task name not recognized, please input CoLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI."
+fi
diff --git a/model_zoo/gpt-3/projects/gpt/inference.py b/model_zoo/gpt-3/projects/gpt/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf64f607db520225c9afc9f4f901a6b2ae9491f9
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/inference.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import os
+import sys
+
+import paddle.distributed.fleet as fleet
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../", "../")))
+
+from ppfleetx.core.engine.inference_engine import InferenceEngine
+from ppfleetx.data import tokenizers
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--mp_degree", default=1, type=int, help="")
+    parser.add_argument("--model_dir", default="output", type=str, help="model directory")
+
+    args = parser.parse_args()
+    return args
+
+
+def main():
+
+    args = parse_args()
+
+    fleet.init(is_collective=True)
+    infer_engine = InferenceEngine(args.model_dir, args.mp_degree)
+
+    tokenizer = tokenizers.GPTTokenizer.from_pretrained("gpt2")
+    input_text = "Hi, GPT2. Tell me where is Beijing?"
+    ids = [tokenizer.encode(input_text)]
+
+    # run test
+
+    outs = infer_engine.predict([ids])
+
+    ids = list(outs.values())[0]
+    out_ids = [int(x) for x in ids[0]]
+    result = tokenizer.decode(out_ids)
+    result = input_text + result
+
+    print("Prompt:", input_text)
+    print("Generation:", result)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh b/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8ee42ebe42985d66469effd7cb63eeb7ebd42734
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/inference_gpt_6.7B_single_card.sh
@@ -0,0 +1,20 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp1
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir=$log_dir  projects/gpt/inference.py --mp_degree 1 --model_dir output
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh b/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1bae5abd874a2c5e7becca9a2deb1eca81609bd4
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/inference_gpt_multigpu.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp1
+rm -rf $log_dir
+
+export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+export MP=8
+
+python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" projects/gpt/inference.py --mp_degree ${MP} --model_dir output
diff --git a/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh b/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f9dd883a339490c6c3b5f929e67371f499f970d
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/inference_gpt_single_card.sh
@@ -0,0 +1,21 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_mp1
+rm -rf $log_dir
+
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --devices "0"  projects/gpt/inference.py --mp_degree 1 --model_dir output
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..783453bb7c5a4b27714c5a1d539c57a38b7c3aed
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_dp8.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+# 1.3B+dp8 run_pretrain
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9cd8d4657de97327c592da00394c2957ef6bfcef
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_1.3B_single_card.sh
@@ -0,0 +1,19 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_single_card.yaml 
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh
new file mode 100644
index 0000000000000000000000000000000000000000..01154f2868a95bd3590fc1586c49841e53840253
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_175B_mp8_pp16.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+# 175B+mp8_pp16 run_pretrain
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1411574de89c4ef6a0bbece63f20b86aacc4108b
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_345M_single_card.sh
@@ -0,0 +1,20 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml 
diff --git a/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a2432ebae4ef856777188677c0a3f78fb04154e1
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/pretrain_gpt_6.7B_sharding16.sh
@@ -0,0 +1,23 @@
+#! /bin/bash
+# Runs the "1.3B" parameter model
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+# 6.7B+sharding16 run_pretrain
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..370c8250fa3b71c9e2f6d8abcdeb828e28b1a3bc
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/prune_gpt_345M_single_card.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/prune_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh b/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3aa1425c0b1f5e7e30d3692cba0c79cd9ba95c03
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_mp8.sh
@@ -0,0 +1,34 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml \
+    -o Engine.max_steps=100000 \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.02 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_345M_220826'
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh b/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ac7d0aad5f7147b4f81ef4f619121e6a04118677
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/qat_gpt_345M_single_card.sh
@@ -0,0 +1,30 @@
+
+#! /bin/bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+python ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/qat_gpt_345M_single_card.yaml \
+    -o Engine.max_steps=100000 \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.02 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_345M_220826'
diff --git a/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh b/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cc269038337c6e5f4d67fe4ce9db08ec4855d229
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/qat_gpt_6.7B_sharding16.sh
@@ -0,0 +1,30 @@
+#! /bin/bash
+# Runs the "1.3B" parameter model
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+log_dir=log_hybrid
+rm -rf $log_dir
+
+python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+    ./tools/train.py \
+    -c ./ppfleetx/configs/nlp/gpt/qat_gpt_6.7B_sharding16.yaml \
+    -o Engine.max_steps=100000 \
+    -o Model.hidden_dropout_prob=0.0 \
+    -o Model.attention_probs_dropout_prob=0.0 \
+    -o Optimizer.lr.decay_steps=72000 \
+    -o Optimizer.weight_decay=0.02 \
+    -o Optimizer.lr.max_lr=5.0e-6 \
+    -o Optimizer.lr.min_lr=1.0e-6 \
+    -o Compress.pretrained='./PaddleFleetX_GPT_6.7B'
diff --git a/model_zoo/gpt-3/projects/gpt/run_benchmark.sh b/model_zoo/gpt-3/projects/gpt/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..532287eb0bfab1ff92ace4e17e8edd07a3268c86
--- /dev/null
+++ b/model_zoo/gpt-3/projects/gpt/run_benchmark.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# for mp=8(GPT 175b)
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" projects/gpt/benchmark.py --seq_len 128 --iter 10 --mp_degree 8 --model_dir ./output
+
+# for mp=1(GPT 6.7B & GPT 345M)
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch --devices "0" projects/gpt/benchmark.py --seq_len 128 --iter 10 --mp_degree 1 --model_dir ./output
diff --git a/model_zoo/gpt-3/requirements.txt b/model_zoo/gpt-3/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..71b22be94a7abf15940f15b0cfb4dcb42b4798ec
--- /dev/null
+++ b/model_zoo/gpt-3/requirements.txt
@@ -0,0 +1,13 @@
+paddleslim @ https://paddle-qa.bj.bcebos.com/PaddleSlim/paddleslim-0.0.0.dev0-py3-none-any.whl
+paddlenlp @ https://paddlenlp.bj.bcebos.com/wheels/paddlenlp-ci-py3-none-any.whl
+requests==2.25.1
+regex==2022.7.25
+colorlog==6.6.0
+colorama==0.4.5
+omegaconf==2.2.2
+tqdm>=4.62.1
+pybind11==2.10.0
+numpy>=1.19.5,<=1.21.6
+opencv-python>=4.2.0.32
+Pillow==9.3.0
+blobfile==1.3.3
diff --git a/model_zoo/gpt-3/run_mp8.sh b/model_zoo/gpt-3/run_mp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..81e2d4536e196ee5e2fffe4f9ed778d391ce05b1
--- /dev/null
+++ b/model_zoo/gpt-3/run_mp8.sh
@@ -0,0 +1,10 @@
+# cd external_ops && python setup.py install && cd -
+
+export USE_FAST_LN=1
+export USE_LINEAR_WITH_GRAD_ADD=1
+
+python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7"     ./tools/auto_export.py     -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_175B_mp8.yaml
+
+python -m paddle.distributed.launch  projects/gpt/inference.py --mp_degree 8 --model_dir output
+
+python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" projects/gpt/benchmark.py --seq_len 128 --iter 10 --mp_degree 8 --model_dir ./output
\ No newline at end of file
diff --git a/model_zoo/gpt-3/tasks/gpt/generation.py b/model_zoo/gpt-3/tasks/gpt/generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b759621a834f6f170a95da8e909e9852f08fb75
--- /dev/null
+++ b/model_zoo/gpt-3/tasks/gpt/generation.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../../")))
+
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.utils import config
+
+if __name__ == "__main__":
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    module.model.eval()
+
+    ckpt_dir = cfg.Engine.save_load.ckpt_dir
+    if ckpt_dir is not None:
+        model_path = os.path.join(ckpt_dir, "model.pdparams")
+        model_dict = paddle.load(model_path)
+
+        for key, value in model_dict.items():
+            model_dict[key] = model_dict[key].astype(paddle.float32)
+
+        module.model.set_state_dict(model_dict)
+
+    input_text = "Hi, GPT2. Tell me who Jack Ma is."
+    result = module.generate(input_text)
+
+    print(f"Prompt: {input_text}")
+    print(f"Generation: {result[0]}")
diff --git a/model_zoo/gpt-3/tasks/gpt/inference.py b/model_zoo/gpt-3/tasks/gpt/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e93fa1604401f6d6740992142145da019961fd9
--- /dev/null
+++ b/model_zoo/gpt-3/tasks/gpt/inference.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../../")))
+
+from ppfleetx.core import EagerEngine
+from ppfleetx.data import tokenizers
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.utils import config
+
+if __name__ == "__main__":
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    tokenizer = tokenizers.GPTTokenizer.from_pretrained("gpt2")
+    engine = EagerEngine(configs=cfg, module=module, mode="inference")
+
+    input_text = "Hi, GPT2. Tell me who Jack Ma is."
+    input_ids = [tokenizer.encode(input_text)]
+
+    outs = engine.inference([input_ids])
+
+    ids = list(outs.values())[0]
+    out_ids = [int(x) for x in ids[0]]
+    result = tokenizer.decode(out_ids)
+    result = input_text + result
+
+    print("Prompt:", input_text)
+    print("Generation:", result)
diff --git a/model_zoo/gpt-3/tasks/gpt/run_generation.sh b/model_zoo/gpt-3/tasks/gpt/run_generation.sh
new file mode 100644
index 0000000000000000000000000000000000000000..01ac1b54aa07b11c83d6a58948a1185f5f70c0a3
--- /dev/null
+++ b/model_zoo/gpt-3/tasks/gpt/run_generation.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# for single card generation
+
+export CUDA_VISIBLE_DEVICES=0
+python tasks/gpt/generation.py -c ./ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml
diff --git a/model_zoo/gpt-3/tools/auto.py b/model_zoo/gpt-3/tools/auto.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a65630cedc605e7e11b4f2da509b1d0784536a7
--- /dev/null
+++ b/model_zoo/gpt-3/tools/auto.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import AutoEngine
+from ppfleetx.data import build_auto_dataset
+from ppfleetx.distributed.apis import auto_env
+from ppfleetx.models import build_module
+from ppfleetx.utils import auto_config
+
+if __name__ == "__main__":
+    args = auto_config.parse_args()
+    cfg = auto_config.get_config(args.config, overrides=args.override, show=False)
+
+    paddle.set_device(cfg["Global"]["device"])
+    if dist.get_world_size() > 1:
+        auto_env.init_dist_env(cfg)
+
+    auto_env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    auto_config.print_config(cfg)
+
+    train_data = build_auto_dataset(cfg.Data, "Train")
+    eval_data = build_auto_dataset(cfg.Data, "Eval")
+
+    cfg.Optimizer.lr.update(
+        {
+            "epochs": cfg.Engine.num_train_epochs,
+            "step_each_epoch": len(train_data),
+        }
+    )
+
+    engine = AutoEngine(configs=cfg, module=module)
+
+    if cfg.Engine.save_load.ckpt_dir is not None:
+        engine.load()
+
+    if cfg.get("Tuning", None) and cfg.Tuning.enable:
+        engine.tune(train_data)
+    else:
+        engine.fit(train_dataset=train_data, valid_dataset=eval_data, epoch=cfg.Engine.num_train_epochs)
diff --git a/model_zoo/gpt-3/tools/auto_export.py b/model_zoo/gpt-3/tools/auto_export.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f5fab7a91b3dec702e96c288cba1fe7215a1882
--- /dev/null
+++ b/model_zoo/gpt-3/tools/auto_export.py
@@ -0,0 +1,54 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import AutoEngine
+from ppfleetx.distributed.apis import auto_env
+from ppfleetx.models import build_module
+from ppfleetx.utils import auto_config
+
+if __name__ == "__main__":
+    args = auto_config.parse_args()
+    cfg = auto_config.get_config(args.config, overrides=args.override, show=False)
+
+    paddle.set_device(cfg["Global"]["device"])
+    if dist.get_world_size() > 1:
+        auto_env.init_dist_env(cfg)
+
+    if cfg.get("Model", None) is not None:
+        module = build_module(cfg)
+        auto_config.print_config(cfg)
+
+        engine = AutoEngine(configs=cfg, module=module, mode="export")
+
+        if cfg.Engine.save_load.ckpt_dir is not None:
+            engine.load()
+
+        engine.export()
+    else:
+        engine = AutoEngine(configs=cfg, mode="export")
+        if cfg.Engine.save_load.ckpt_dir is None:
+            raise ValueError("invalid ckpt_dir.")
+
+        engine.export_from_prog()
diff --git a/model_zoo/gpt-3/tools/eval.py b/model_zoo/gpt-3/tools/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cd1b0c5262b41c09fdae35d26e595fb0319f4c2
--- /dev/null
+++ b/model_zoo/gpt-3/tools/eval.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import EagerEngine
+from ppfleetx.data import build_dataloader
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.utils import config
+
+if __name__ == "__main__":
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    engine = EagerEngine(configs=cfg, module=module, mode="eval")
+
+    valid_data_loader = build_dataloader(cfg.Data, "Eval")
+
+    if cfg.Engine.save_load.ckpt_dir is not None:
+        engine.load()
+
+    engine.evaluate(valid_data_loader=valid_data_loader, epoch=cfg.Engine.num_train_epochs)
diff --git a/model_zoo/gpt-3/tools/export.py b/model_zoo/gpt-3/tools/export.py
new file mode 100644
index 0000000000000000000000000000000000000000..449ad3d55feea7c8068b6e025ce46a27c6ae40f6
--- /dev/null
+++ b/model_zoo/gpt-3/tools/export.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import EagerEngine
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.utils import config
+
+if __name__ == "__main__":
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    engine = EagerEngine(configs=cfg, module=module, mode="export")
+
+    if cfg.Engine.save_load.ckpt_dir is not None:
+        engine.load()
+
+    engine.export()
diff --git a/model_zoo/gpt-3/tools/inference.py b/model_zoo/gpt-3/tools/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..0df4c8c40ef59b37f34683e38737aea10b57c6ad
--- /dev/null
+++ b/model_zoo/gpt-3/tools/inference.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import EagerEngine
+from ppfleetx.data import build_dataloader
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.utils import config
+from ppfleetx.utils.log import logger
+
+# init_logger()
+
+if __name__ == "__main__":
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    engine = EagerEngine(configs=cfg, module=module, mode="inference")
+
+    test_data_loader = build_dataloader(cfg.Data, "Test")
+    for iter_id, data in enumerate(test_data_loader()):
+        outs = engine.inference(data)
+
+        if iter_id >= cfg.Engine.test_iters:
+            break
+
+    logger.info("The inference process is complete.")
+    del test_data_loader
diff --git a/model_zoo/gpt-3/tools/train.py b/model_zoo/gpt-3/tools/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..26e71aaa99893c9a2e376a3ee4e13b6ebfbd5c78
--- /dev/null
+++ b/model_zoo/gpt-3/tools/train.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import, division, print_function
+
+import os
+import sys
+
+import paddle
+import paddle.distributed as dist
+
+__dir__ = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.abspath(os.path.join(__dir__, "../")))
+
+from ppfleetx.core import EagerEngine
+from ppfleetx.data import build_dataloader
+from ppfleetx.distributed.apis import env
+from ppfleetx.models import build_module
+from ppfleetx.ops.fused_layers import mock_layers
+from ppfleetx.utils import config
+
+
+def set_default_flags(flags):
+    for flag_name, flag_value in flags.items():
+        if os.getenv(flag_name) is None:
+            paddle.set_flags({flag_name: flag_value})
+
+
+if __name__ == "__main__":
+    mock_layers()
+
+    args = config.parse_args()
+    cfg = config.get_config(args.config, overrides=args.override, show=False)
+
+    paddle.set_device(cfg["Global"]["device"])
+    if dist.get_world_size() > 1:
+        env.init_dist_env(cfg)
+
+    env.set_seed(cfg.Global.seed)
+
+    module = build_module(cfg)
+    config.print_config(cfg)
+
+    train_data_loader = build_dataloader(cfg.Data, "Train")
+    eval_data_loader = build_dataloader(cfg.Data, "Eval")
+
+    cfg.Optimizer.lr.update(
+        {
+            "epochs": cfg.Engine.num_train_epochs,
+            "step_each_epoch": len(train_data_loader),
+            "total_steps": cfg.Engine.max_steps,
+        }
+    )
+
+    engine = EagerEngine(configs=cfg, module=module)
+
+    if cfg.Engine.save_load.ckpt_dir is not None:
+        engine.load()
+
+    engine.fit(
+        train_data_loader=train_data_loader, valid_data_loader=eval_data_loader, epoch=cfg.Engine.num_train_epochs
+    )
diff --git a/model_zoo/gpt/README.md b/model_zoo/gpt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..88449db943a4bbc2c7284c87d5d5a3a191fda7dc
--- /dev/null
+++ b/model_zoo/gpt/README.md
@@ -0,0 +1,358 @@
+# GPT
+
+## 模型介绍
+GPT-[2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)/[3](https://arxiv.org/pdf/2005.14165.pdf) 是以[Transformer](https://arxiv.org/abs/1706.03762) 解码器为网络基本组件，使用自回归的方式在大规模无标注文本语料上进行预训练得到的语言生成模型。
+
+本项目是语言模型 GPT 的 PaddlePaddle 实现， 包含模型训练，预测等内容。下是本例的简要目录结构及说明：
+
+```text
+.
+├── args.py                 # 训练参数配置
+├── converter.py            # 权重转化脚本
+├── dataset.py              # 数据处理
+├── deploy/                 # 模型部署的inference脚本
+├── export_model.py         # 导出预测部署的模型脚本
+├── fast_gpt/               # 使用 FasterGPT 高性能预测 sample
+├── lr.py                   # 学习率控制
+├── predict.py              # 生成文本示例demo
+├── README.md               # 文档
+├── run_eval.py             # 评估入口
+└── scripts/                # 训练脚本
+```
+
+## 快速开始
+
+### 环境依赖
+
+- regex
+- sentencepiece >= 0.1.94
+- tqdm
+- tool_helpers
+- visualdl
+- paddlepaddle-gpu >= 2.2rc
+- pybind11
+- lac (可选)
+- zstandard (可选)
+
+安装命令 `pip install regex sentencepiece tqdm visualdl tool_helpers pybind11 lac zstandard`。
+注：需要PaddlePaddle版本大于等于2.2rc，或者使用最新develop版本，安装方法请参见Paddle[官网](https://www.paddlepaddle.org.cn)。
+
+### 数据准备
+
+#### 数据获取与制作
+
+[OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/)是一个开源的英文网页文本数据集，数据来源于Reddit，经过去重、清洗、提取，最终包含800多万个文档。
+本示例采用EleutherAI清洗好的[OpenWebText2数据](https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version)
+
+下载以后通过以下命令解压：
+
+```shell
+wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/openwebtext2.jsonl.zst.tar
+tar -xvf openwebtext2.json.zst.tar -C  /path/to/openwebtext
+```
+
+然后使用[preprocess](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/gpt/../ernie-1.0/preprocess) 工具下的`create_pretraining_data.py`脚本进行数据集制作：
+```
+python -u  create_pretraining_data.py \
+    --model_name gpt2-en \
+    --tokenizer_name GPTTokenizer \
+    --data_format JSON \
+    --input_path /path/to/openwebtext/ \
+    --append_eos \
+    --output_prefix gpt_openwebtext  \
+    --workers 40 \
+    --log_interval 10000
+```
+处理时间约一个小时左右，就可以得到我们需要的`gpt_openwebtext_ids.npy`, `gpt_openwebtext_idx.npz`数据集文件。
+
+为了方便用户运行测试本模型，本项目提供了处理好的300M的训练样本：
+```shell
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+```
+
+将所有预处理得到的文件统一放入一个文件夹中，以备训练使用：
+
+```
+mkdir data
+mv gpt_en_dataset_300m_ids.npy ./data
+mv gpt_en_dataset_300m_idx.npz ./data
+```
+
+### 模型训练
+
+#### 单卡训练
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python run_pretrain_trainer.py \
+    --model_type "gpt" \
+    --model_name_or_path "gpt2-en" \
+    --tokenizer_name_or_path "gpt2-en" \
+    --input_dir "./data" \
+    --output_dir "output" \
+    --split 949,50,1 \
+    --max_seq_length 1024 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20 \
+    --dataloader_num_workers 1 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --do_train \
+    --do_eval \
+    --device "gpu"
+```
+
+其中参数释义如下：
+- `model_name_or_path` 要训练的模型，paddlenlp已有的模型或者本地模型config文件均可。
+- `tokenizer_name_or_path` 模型的tokenizer，可以指定paddlenlp已有的tokenzier，或者本地自定义的tokenizer
+- `input_dir` 指定输入文件，可以使用目录，指定目录时将包括目录中的所有文件。
+- `output_dir` 指定输出文件。
+- `min_learning_rate` 学习率decay的最小值。
+- `per_device_train_batch_size` 表示每次迭代**每张卡**上的训练样本数目。
+- `per_device_eval_batch_size` 表示每次迭代**每张卡**上的验证样本数目。
+- `split` 划分数据的，train、eval、test比例。
+
+其他参数请参考Trainer文档 https://paddlenlp.readthedocs.io/zh/latest/trainer.html
+
+
+#### 单机多卡
+
+同样，可以执行如下命令实现八卡训练：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain_trainer.py \
+    --model_name_or_path "gpt2-en" \
+    --tokenizer_name_or_path "gpt2-en"
+    --input_dir "./data" \
+    --output_dir "output" \
+    --other_arguments
+```
+这里省略了其他的参数配置, 只需换用`paddle.distributed.launch`即可启动分布式训练。
+
+### 模型评估
+
+我们提供了对[WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip)、[LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl)两种数据集的评估脚本, 使用如下命令启动评估：
+
+1. WikiText数据集评估
+```bash
+python run_eval.py --model_name gpt2-en \
+    --eval_path ./wikitext-103/wiki.valid.tokens \
+    --overlapping_eval 32 \
+    --init_checkpoint_path ./output/model_100000/model_state.pdparams \
+    --batch_size 8 \
+    --device gpu
+```
+
+2. LAMBADA数据集评估
+```bash
+# 覆盖default.yaml中的eval_path配置字段
+python run_eval.py --model_name gpt2-en \
+    --eval_path ./lambada_test.jsonl \
+    --cloze_eval \
+    --init_checkpoint_path ./output/model_100000/model_state.pdparams \
+    --batch_size 8 \
+    --device gpu
+```
+
+其中参数释义如下：
+`model_name` 使用的模型名称，如gpt2-en、gpt2-medium-en等。
+`eval_path` 数据集地址。
+`init_checkpoint_path` 模型参数地址。
+`batch_size` batch size大小。
+`device` 运行设备，cpu，gpu，xpu, npu可选。
+`overlapping_eval` wikitext数据集参数。
+`cloze_eval` lambada数据参数，作为完型填空任务。
+
+其中数据集WikiText采用的是PPL(perplexity)评估指标，LAMBADA采用的是ACC(accuracy)指标。
+
+注：不设置`init_checkpoint_path` 参数时，可以评估默认预训练好的模型参数。
+
+### 文本生成
+
+本项目提供了简单的文本生成的demo，供用户测试文本生成效果。
+
+```shell
+# 中文示例
+python predict.py gpt-cn
+# 英文示例
+python predict.py
+```
+
+生成效果展示:
+```text
+问题：中国的首都是哪里？答案：北京。
+问题：苹果的CEO是谁? 答案：乔布斯。
+
+默写古诗: 大漠孤烟直，长河落日圆。
+举杯邀明月，对影成三人。
+
+Question: Who is the CEO of Apple?
+Answer: Tim Cook.
+```
+
+## 模型导出预测
+
+下面提供了简单的示例，帮助用户将预训练模型导出成预测部署的参数。
+
+导出中文模型
+```"shell
+python export_model.py --model_type=gpt-cn \
+    --model_path=gpt-cpm-large-cn \
+    --output_path=./infer_model/model
+```
+用户在`infer_model`中可以看到导出的文件。
+
+对于导出的模型，我们提供了Python的infer脚本，调用预测库对简单的例子进行预测。
+```shell
+python deploy/python/inference.py --model_type gpt-cn \
+    --model_path ./infer_model/model
+```
+
+
+导出英文模型
+```"shell
+python export_model.py --model_type=gpt \
+    --model_path=gpt2-medium-en \
+    --output_path=./infer_model/model
+
+python deploy/python/inference.py --model_type gpt \
+    --model_path ./infer_model/model
+```
+
+用户可以看到屏幕输出预测结果。
+
+## Taskflow一键预测
+可以使用PaddleNLP提供的Taskflow工具来进行知识问答和写诗，具体使用方法如下:
+
+```python
+
+from paddlenlp import Taskflow
+
+# 默认是知识问答任务
+qa = Taskflow("question_answering")
+qa("中国的国土面积有多大？")
+'''
+[{'text': '中国的国土面积有多大？', 'answer': '960万平方公里。'}]
+'''
+
+qa(["中国国土面积有多大？", "中国的首都在哪里？"])
+'''
+[{'text': '中国国土面积有多大？', 'answer': '960万平方公里。'}, {'text': '中国的首都在哪里？', 'answer': '北京。'}]
+'''
+
+# 使用写诗任务进行写诗
+
+ poetry = Taskflow("poetry_generation")
+ poetry("林密不见人")
+ '''
+ [{'text': '林密不见人', 'answer': ',但闻人语响。'}]
+ '''
+
+ poetry(["林密不见人", "举头邀明月"])
+ '''
+ [{'text': '林密不见人', 'answer': ',但闻人语响。'}, {'text': '举头邀明月', 'answer': ',低头思故乡。'}]
+ '''
+```
+
+### 文本分类
+
+以GLUE中的SST-2任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_glue.py \
+  --model_name_or_path gpt2-medium-en \
+  --task_name SST-2 \
+  --max_seq_length 128 \
+  --per_device_train_batch_size 32   \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3 \
+  --logging_steps 1 \
+  --save_steps 500 \
+  --output_dir ./output_dir/glue \
+  --eval_steps 1 \
+  --device gpu \
+  --do_train true
+```
+
+配置文件中的参数释义如下：
+- `model_type` 指示了模型类型。
+- `model_name_or_path` 指示了某种特定配置的模型，对应有其预训练模型和预训练时使用的 tokenizer。若模型相关内容保存在本地，这里也可以提供相应目录地址。
+- `task_name` 表示Fine-tuning的任务。
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。
+- `batch_size` 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate` 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs` 表示训练轮数。
+- `logging_steps` 表示日志打印间隔。
+- `save_steps` 表示模型保存及评估间隔。
+- `output_dir` 表示模型保存路径。
+- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+- `use_amp` 指示是否启用自动混合精度训练。
+
+基于`gpt2-medium-en`在SST-2任务上Fine-tuning后，在验证集上有如下结果：
+
+| Task  | Metric                       | Result            |
+|:-----:|:----------------------------:|:-----------------:|
+| SST-2 | Accuracy                     |      0.94495      |
+
+
+### 序列标注
+
+以MSRA命名实体识别任务为例，启动Fine-tuning的方式如下：
+
+```shell
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch --gpus "0" run_msra_ner.py \
+  --model_name_or_path gpt-cpm-small-cn-distill \
+  --max_seq_length 128 \
+  --per_device_eval_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3 \
+  --logging_steps 25 \
+  --save_steps 250 \
+  --output_dir ./tmp/msra_ner/ \
+  --device gpu  \
+  --do_train true
+```
+
+配置文件中参数释义如下：
+- `model_name_or_path`: 指示了某种特定配置的模型。
+- `max_seq_length`: 表示最大句子长度，超过该长度将被截断。
+- `batch_size`: 表示每次迭代**每张卡**上的样本数目。
+- `learning_rate`: 表示基础学习率大小，将于learning rate scheduler产生的值相乘作为当前学习率。
+- `num_train_epochs`: 表示训练轮数。
+- `logging_steps`: 表示日志打印间隔。
+- `save_steps`: 表示模型保存及评估间隔。
+- `output_dir`: 表示模型保存路径。
+- `device`: 训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU, 'npu'表示使用华为昇腾卡。
+
+基于`gpt-cpm-small-cn-distill`在MSRA的NER任务上Fine-tuning后，在验证集上有如下结果：
+
+ Metric                       | Result      |
+------------------------------|-------------|
+Precision                     | 0.484939    |
+Recall                        | 0.634716    |
+F1                            | 0.549810    |
+
+## 其他
+
+本项目提供了Huggingface的权重转化示例`converter.py`，`python converter.py xxx-gpt.bin`即可完成转换。用户可以参考转化脚本，转换自己需要的模型权重。
+
+## 参考文献
+- [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+- [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf)
+- [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413)
+- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473)
diff --git a/model_zoo/gpt/args.py b/model_zoo/gpt/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..56b1a8c1b96b908d2e41a23624a9018b4ac74913
--- /dev/null
+++ b/model_zoo/gpt/args.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import paddle
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def parse_args(MODEL_CLASSES):
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+
+    # Train I/O config
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the training logs and checkpoints will be written.",
+    )
+    parser.add_argument("--split", type=str, default="949,50,1", help="Train/valid/test data split.")
+
+    parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.")
+    parser.add_argument(
+        "--micro_batch_size",
+        default=8,
+        type=int,
+        help="Batch size per device for one step training.",
+    )
+    parser.add_argument(
+        "--global_batch_size",
+        default=None,
+        type=int,
+        help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.",
+    )
+
+    # Default training config
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.")
+    parser.add_argument("--max_lr", default=1e-4, type=float, help="The initial max learning rate for Adam.")
+    parser.add_argument("--min_lr", default=1e-5, type=float, help="The initial min learning rate for Adam.")
+    parser.add_argument(
+        "--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate."
+    )
+
+    # Adam optimizer config
+    parser.add_argument(
+        "--adam_beta1",
+        default=0.9,
+        type=float,
+        help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.",
+    )
+    parser.add_argument(
+        "--adam_beta2",
+        default=0.999,
+        type=float,
+        help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.",
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+
+    # Training steps config
+    parser.add_argument("--max_steps", default=500000, type=int, help="set total number of training steps to perform.")
+    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--decay_steps",
+        default=360000,
+        type=int,
+        help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.",
+    )
+    parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.")
+    parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.")
+    parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.")
+
+    # Config for 4D Parallelism
+    parser.add_argument(
+        "--use_sharding", type=strtobool, nargs="?", const=False, help="Use sharding Parallelism to training."
+    )
+    parser.add_argument(
+        "--sharding_degree", type=int, default=1, help="Sharding degree. Share the parameters to many cards."
+    )
+    parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.")
+    parser.add_argument(
+        "--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards."
+    )
+    parser.add_argument(
+        "--pp_degree",
+        type=int,
+        default=1,
+        help="Pipeline Parallelism degree.  Spliting the model layers to different parts.",
+    )
+    parser.add_argument(
+        "--use_recompute", type=strtobool, nargs="?", const=False, help="Using the recompute to save the memory."
+    )
+
+    # AMP config
+    parser.add_argument("--use_amp", type=strtobool, nargs="?", const=False, help="Enable mixed precision training.")
+    parser.add_argument(
+        "--amp_level", type=str, default="O1", choices=["O1", "O2"], help="select O1 or O2 of amp level."
+    )
+    parser.add_argument(
+        "--enable_addto",
+        type=strtobool,
+        nargs="?",
+        const=True,
+        help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.",
+    )
+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=32768,
+        help="The value of scale_loss for fp16. This is only used for AMP training.",
+    )
+    parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.")
+    parser.add_argument(
+        "--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob."
+    )
+    parser.add_argument("--to_static", action="store_true", help="Whether use to_static to train.")
+    # Other config
+    parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization")
+    parser.add_argument(
+        "--check_accuracy", type=strtobool, nargs="?", const=False, help="Check accuracy for training process."
+    )
+    parser.add_argument(
+        "--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "npu"], help="select cpu, gpu, xpu devices."
+    )
+    parser.add_argument(
+        "--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style."
+    )
+    parser.add_argument(
+        "-p",
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+
+    args = parser.parse_args()
+    args.test_iters = args.eval_iters * 10
+
+    if args.check_accuracy:
+        if args.hidden_dropout_prob != 0:
+            args.hidden_dropout_prob = 0.0
+            logger.warning("The hidden_dropout_prob should set to 0 for accuracy checking.")
+        if args.attention_probs_dropout_prob != 0:
+            args.attention_probs_dropout_prob = 0.0
+            logger.warning("The attention_probs_dropout_prob should set to 0 for accuracy checking.")
+
+    logger.info("{:20}:{}".format("paddle commit id", paddle.version.commit))
+    for arg in vars(args):
+        logger.info("{:20}:{}".format(arg, getattr(args, arg)))
+
+    return args
diff --git a/model_zoo/gpt/configuration.py b/model_zoo/gpt/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..882bc1995e03cb0108ebf78f09c34a72abf45795
--- /dev/null
+++ b/model_zoo/gpt/configuration.py
@@ -0,0 +1,326 @@
+# coding=utf-8
+# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" GPT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["GPT_PRETRAINED_INIT_CONFIGURATION", "GPTConfig", "GPT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+GPT_PRETRAINED_INIT_CONFIGURATION = {
+    "gpt-cpm-large-cn": {  # 2.6B
+        "vocab_size": 30000,
+        "hidden_size": 2560,
+        "num_hidden_layers": 32,
+        "num_attention_heads": 32,
+        "intermediate_size": 10240,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 7,
+        "bos_token_id": 0,
+        "eol_token_id": 3,
+        "num_partitions": 1,
+    },
+    "gpt-cpm-small-cn-distill": {  # 109M
+        "vocab_size": 30000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 7,
+        "bos_token_id": 0,
+        "eol_token_id": 3,
+        "num_partitions": 1,
+    },
+    "gpt3-89B-en": {  # 89B
+        "vocab_size": 51200,
+        "hidden_size": 12288,
+        "num_hidden_layers": 48,
+        "num_attention_heads": 96,
+        "intermediate_size": 49152,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-175B-en": {  # 175B
+        "vocab_size": 51200,
+        "hidden_size": 12288,
+        "num_hidden_layers": 96,
+        "num_attention_heads": 96,
+        "intermediate_size": 49152,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-13B-en": {  # 13B
+        "vocab_size": 50304,
+        "hidden_size": 5120,
+        "num_hidden_layers": 40,
+        "num_attention_heads": 40,
+        "intermediate_size": 20480,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 2048,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+        "num_partitions": 1,
+    },
+    "gpt3-1.3B-en": {  # 1.3B
+        "vocab_size": 50304,
+        "hidden_size": 2048,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 8192,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+        "num_partitions": 1,
+    },
+    "gpt2-medium-en": {  # 345M
+        "vocab_size": 50304,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+        "num_partitions": 1,
+    },
+    "gpt2-en": {  # 117M
+        "vocab_size": 50304,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+        "num_partitions": 1,
+    },
+    "gpt2-small-en": {  # config for CE
+        "vocab_size": 50304,
+        "hidden_size": 1024,
+        "num_hidden_layers": 4,
+        "num_attention_heads": 4,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+        "num_partitions": 1,
+    },
+}
+
+GPT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "gpt-cpm-large-cn": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-cpm-large-cn.pdparams",
+        "gpt-cpm-small-cn-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-cpm-small-cn-distill.pdparams",
+        "gpt2-medium-en": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt2-medium-en.pdparams",
+    }
+}
+
+
+class GPTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`GPTModel`] or a [`TFGPTModel`]. It is used to
+    instantiate a GPT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the GPT
+    gpt-base-uncased architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the GPT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`GPTModel`] or [`TFGPTModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_activation (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`GPTModel`] or [`TFGPTModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import GPTModel, GPTConfig
+
+    >>> # Initializing a GPT gpt-base-uncased style configuration
+    >>> configuration = GPTConfig()
+
+    >>> # Initializing a model from the gpt-base-uncased style configuration
+    >>> model = GPTModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "gpt"
+    attribute_map: Dict[str, str] = {
+        "num_classes": "num_labels",
+        "dropout": "classifier_dropout",
+        "n_positions": "max_position_embeddings",
+        "n_embd": "hidden_size",
+        "n_layer": "num_hidden_layers",
+        "n_head": "num_attention_heads",
+        "n_inner": "intermediate_size",
+        "activation_function": "hidden_act",
+        "resid_pdrop": "attention_probs_dropout_prob",
+    }
+
+    pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50304,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_activation: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        eos_token_id: int = 7,
+        bos_token_id: int = 0,
+        eol_token_id: int = 3,
+        num_partitions: int = 1,
+        normalize_before: bool = True,
+        recompute_granularity: str = "full",
+        scale_qk_coeff: float = 1.0,
+        tensor_parallel_degree: int = 1,
+        tensor_parallel_output: bool = True,
+        output_attentions: bool = False,
+        ignore_index: int = 0,
+        use_flash_attention: bool = False,
+        fused_linear: bool = False,
+        fuse_attention_qkv=False,
+        enable_fuse_transformer: bool = False,
+        use_fused_dropout_add: bool = False,
+        fused_softmax_with_triangular: bool = False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_activation = hidden_activation
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.bos_token_id = bos_token_id
+        self.eol_token_id = eol_token_id
+        self.num_partitions = num_partitions
+        self.normalize_before = normalize_before
+        self.recompute_granularity = recompute_granularity
+        self.scale_qk_coeff = scale_qk_coeff
+        self.tensor_parallel_degree = tensor_parallel_degree
+        self.tensor_parallel_output = tensor_parallel_output
+        self.output_attentions = output_attentions
+        self.ignore_index = ignore_index
+        self.use_flash_attention = use_flash_attention
+        self.fused_linear = fused_linear
+        self.fuse_attention_qkv = fuse_attention_qkv
+        self.enable_fuse_transformer = enable_fuse_transformer
+        self.use_fused_dropout_add = use_fused_dropout_add
+        self.fused_softmax_with_triangular = fused_softmax_with_triangular
diff --git a/model_zoo/gpt/converter.py b/model_zoo/gpt/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0ff874e82717a508c3f22a6c8617c71ac9aaa4f
--- /dev/null
+++ b/model_zoo/gpt/converter.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import numpy as np
+import paddle
+import torch
+
+paddle.set_device("cpu")
+
+model = torch.load(sys.argv[1], map_location="cpu")
+
+print("The origin model keys:")
+for x in sorted(list(model.keys())):
+    print(x)
+
+state = {}
+for sub_name, sub_param in model.items():
+    if sub_name.startswith("transformer"):
+        sub_name = sub_name[12:-1]
+    if sub_name.startswith("h."):
+        final_name = sub_name.replace("h.", "gpt.decoder.layers.")
+    else:
+        final_name = sub_name
+    state[final_name] = sub_param.numpy()
+
+
+def trans_name(key):
+    k = key
+    k = k.replace("mlp.c_fc", "linear1")
+    k = k.replace("mlp.c_proj", "linear2")
+    k = k.replace("attn.c_proj", "self_attn.out_proj")
+    k = k.replace("ln_1", "norm1")
+    k = k.replace("ln_2", "norm2")
+    k = k.replace("ln_f", "gpt.decoder.norm")
+    k = k.replace("wte", "gpt.embeddings.word_embeddings")
+    k = k.replace("wpe", "gpt.embeddings.position_embeddings")
+    return k
+
+
+new_state_dict = {}
+all_num = 0
+for key in sorted(list(state.keys())):
+    all_num += state[key].size
+    new_key = trans_name(key)
+    if "attn.c_attn" in key:
+        shape = state[key].shape
+        print(shape)
+        if "weight" in key:
+            q, k, v = np.split(state[key], 3, axis=1)
+        else:
+            print("BIAS SHAPE", state[key].shape, state[key].transpose().shape)
+            q, k, v = np.split(state[key], 3, axis=-1)
+            q = q.reshape((-1))
+            k = k.reshape((-1))
+            v = v.reshape((-1))
+        q_name = new_key.replace("attn.c_attn", "self_attn.q_proj")
+        k_name = new_key.replace("attn.c_attn", "self_attn.k_proj")
+        v_name = new_key.replace("attn.c_attn", "self_attn.v_proj")
+        new_state_dict[q_name] = paddle.to_tensor(q, dtype="float32")
+        new_state_dict[k_name] = paddle.to_tensor(k, dtype="float32")
+        new_state_dict[v_name] = paddle.to_tensor(v, dtype="float32")
+        continue
+    new_state_dict[new_key] = paddle.to_tensor(state[key], dtype="float32")
+print("all shape numel:{}".format(all_num))
+for key, value in new_state_dict.items():
+    print("key:{}, shape:{}, dtype:{}".format(key, value.shape, value.dtype))
+
+orgin_path = sys.argv[1]
+if ".bin" in orgin_path:
+    save_path = orgin_path.replace(".bin", ".pdparams")
+else:
+    save_path = os.path.join(orgin_path, ".pdparams")
+paddle.save(new_state_dict, save_path)
diff --git a/model_zoo/gpt/dataset.py b/model_zoo/gpt/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1e42ae6df45d4cc4d39d7ba52301add151fca8d
--- /dev/null
+++ b/model_zoo/gpt/dataset.py
@@ -0,0 +1,701 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+"""GPT style dataset."""
+import hashlib
+import math
+import os
+import time
+
+import numpy as np
+import paddle
+
+from paddlenlp.data.indexed_dataset import make_dataset as make_indexed_dataset
+
+local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+class BlendableDataset(paddle.io.Dataset):
+    def __init__(self, datasets, weights, size, *, data_cache_path=None):
+
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+
+        self.size = size
+
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+
+        # Build indicies.
+        def _build_indices():
+            start_time = time.time()
+            assert num_datasets < 255
+            dataset_index = np.zeros(self.size, dtype=np.uint8)
+            dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+
+            from tool_helpers import helpers
+
+            helpers.build_blending_indices(
+                dataset_index,
+                dataset_sample_index,
+                weights,
+                num_datasets,
+                self.size,
+                local_rank == 0,
+                #    paddle.distributed.get_rank() == 0,
+            )
+            print_rank_0(
+                "> elapsed time for building blendable dataset indices: "
+                "{:.2f} (sec)".format(time.time() - start_time)
+            )
+            return dataset_index, dataset_sample_index
+
+        desc = "Blendable dataset\n\n"
+        desc += "Datasets:\n"
+        for dataset in datasets:
+            desc += dataset.desc + "\n\n"
+        desc += f"Weights: {weights}\n"
+        desc += f"Size: {size}\n"
+        self.desc = desc
+
+        if data_cache_path:
+            desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+            desc_path = os.path.join(data_cache_path, desc_hash + ".dsc")
+            index_path = os.path.join(data_cache_path, desc_hash + "_index.npy")
+            sample_index_path = os.path.join(data_cache_path, desc_hash + "_sample_index.npy")
+            cache_hit = os.path.isfile(index_path) and os.path.isfile(sample_index_path)
+            # cache_success = True
+            # if paddle.distributed.get_rank() == 0 and not cache_hit:
+            if local_rank == 0 and not cache_hit:
+                print(
+                    " > WARNING: could not find index map files for blendable"
+                    " dataset, building indices on rank 0 ...",
+                    flush=True,
+                )
+                dataset_index, dataset_sample_index = _build_indices()
+                try:
+                    os.makedirs(os.path.dirname(index_path), exist_ok=True)
+                    with open(desc_path, "wt") as fd:
+                        fd.write(desc)
+                        np.save(index_path, dataset_index, allow_pickle=True)
+                        np.save(sample_index_path, dataset_sample_index, allow_pickle=True)
+                except OSError:
+                    print(f"There was an error trying to create the data cache directory ({data_cache_path})")
+                    print("or a file in it. This is set with the --data-cache-path argument. Please")
+                    print("ensure you have write access to this directory or specify one that you do have")
+                    print("write access to.")
+                    # cache_success = False
+
+            # hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+
+            # counts = paddle.to_tensor([cache_success], dtype="int64")
+            # paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group())
+            # paddle.distributed.all_reduce(counts, group=hcg.get_pipeline_model_parallel_group())
+            # if counts[0].item() != (
+            #     paddle.distributed.get_world_size()
+            #     // paddle.distributed.get_world_size(group=hcg.get_tensor_model_parallel_group())
+            # ):
+            #     print_rank_0("Data index creation unsuccessful, exiting.")
+            #     exit()
+
+            if paddle.distributed.get_world_size() > 1:
+                if paddle.in_dynamic_mode():
+                    paddle.distributed.barrier()
+
+            # paddle.distributed.barrier()
+            # Load on all ranks.
+            print_rank_0(f"> loading blendable dataset index: {index_path}")
+            self.dataset_index = np.load(index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_index.size == self.size
+
+            print_rank_0(f"> loading blendable dataset sample index: {sample_index_path}")
+            self.dataset_sample_index = np.load(sample_index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_sample_index.size == self.size
+        else:
+            self.dataset_index, self.dataset_sample_index = _build_indices()
+
+        # Check size
+        _ = self.__getitem__(self.size - 1)
+        try:
+            _ = self.__getitem__(self.size)
+            raise RuntimeError("BlendedDataset size is improperly bounded")
+        except IndexError:
+            pass
+        print_rank_0("> size of blendable dataset: " "{} samples".format(self.size))
+
+    def __len__(self):
+        return self.size
+
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return {
+            "dataset_idx": dataset_idx,
+            **self.datasets[dataset_idx][sample_idx],
+        }
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(",") != -1:
+        splits = [float(s) for s in splits_string.split(",")]
+    elif splits_string.find("/") != -1:
+        splits = [float(s) for s in splits_string.split("/")]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples):
+
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005)) for val in train_val_test_num_samples]
+        )
+
+    return prefixes, weights, datasets_train_valid_test_num_samples
+
+
+def print_rank_0(*args, **kwargs):
+    if paddle.distributed.get_rank() == 0:
+        print(*args, **kwargs)
+
+
+def build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_val_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    train_data_prefix=None,
+    valid_data_prefix=None,
+    test_data_prefix=None,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None
+):
+    """Build train, valid, and test datasets."""
+
+    # Single dataset.
+    if len(data_prefix) == 1:
+        return _build_train_valid_test_datasets(
+            data_prefix[0],
+            data_impl,
+            splits_string,
+            train_val_test_num_samples,
+            seq_length,
+            seed,
+            skip_warmup,
+            data_cache_path=data_cache_path,
+        )
+
+    # Blending dataset.
+    # Parse the values.
+    output = get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples)
+    prefixes, weights, datasets_train_valid_test_num_samples = output
+    train_num_samples, valid_num_samples, test_num_samples = map(sum, zip(*datasets_train_valid_test_num_samples))
+
+    # Build individual datasets.
+    train_datasets = []
+    valid_datasets = []
+    test_datasets = []
+    for i in range(len(prefixes)):
+        train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+            prefixes[i],
+            data_impl,
+            splits_string,
+            datasets_train_valid_test_num_samples[i],
+            seq_length,
+            seed,
+            skip_warmup,
+            return_doc_ids,
+            data_cache_path=data_cache_path,
+        )
+        if train_ds:
+            train_datasets.append(train_ds)
+        if valid_ds:
+            valid_datasets.append(valid_ds)
+        if test_ds:
+            test_datasets.append(test_ds)
+
+    blending_train_dataset = None
+    if train_datasets:
+        blending_train_dataset = BlendableDataset(
+            train_datasets, weights, train_num_samples, data_cache_path=data_cache_path
+        )
+    blending_valid_dataset = None
+    if valid_datasets:
+        blending_valid_dataset = BlendableDataset(
+            valid_datasets, weights, valid_num_samples, data_cache_path=data_cache_path
+        )
+    blending_test_dataset = None
+    if test_datasets:
+        blending_test_dataset = BlendableDataset(
+            test_datasets, weights, test_num_samples, data_cache_path=data_cache_path
+        )
+
+    return (blending_train_dataset, blending_valid_dataset, blending_test_dataset)
+
+
+def _build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_val_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None
+):
+    """Build train, valid, and test datasets."""
+
+    # Indexed dataset.
+    indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)
+
+    total_num_of_documents = indexed_dataset.sizes.shape[0]
+    splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+
+    # Print stats about the splits.
+    print_rank_0(" > dataset split:")
+
+    def print_split_stats(name, index):
+        print_rank_0("    {}:".format(name))
+        print_rank_0(
+            "     document indices in [{}, {}) total of {} "
+            "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index])
+        )
+
+    print_split_stats("train", 0)
+    print_split_stats("validation", 1)
+    print_split_stats("test", 2)
+
+    def build_dataset(index, name):
+        dataset = None
+        if splits[index + 1] > splits[index]:
+            documents = np.arange(start=splits[index], stop=splits[index + 1], step=1, dtype=np.int32)
+            dataset = GPTDataset(
+                name,
+                data_prefix,
+                documents,
+                indexed_dataset,
+                splits_string,
+                train_val_test_num_samples[index],
+                seq_length,
+                seed,
+                return_doc_ids,
+                data_cache_path=data_cache_path,
+            )
+        return dataset
+
+    train_dataset = build_dataset(0, "train")
+    valid_dataset = build_dataset(1, "valid")
+    test_dataset = build_dataset(2, "test")
+
+    return (train_dataset, valid_dataset, test_dataset)
+
+
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+    """Build indexed dataset."""
+    print_rank_0(" > building dataset index ...")
+
+    start_time = time.time()
+    indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup)
+    print_rank_0(" > finished creating indexed dataset in {:4f} " "seconds".format(time.time() - start_time))
+    print_rank_0("    number of documents: {}".format(indexed_dataset.sizes.shape[0]))
+
+    return indexed_dataset
+
+
+class GPTDataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        name,
+        data_prefix,
+        documents,
+        indexed_dataset,
+        splits_string,
+        num_samples,
+        seq_length,
+        seed,
+        return_doc_ids=False,
+        *,
+        data_cache_path=None
+    ):
+
+        self.name = name
+        self.indexed_dataset = indexed_dataset
+        self.return_doc_ids = return_doc_ids
+
+        # Checks
+        assert np.min(documents) >= 0
+        assert np.max(documents) < indexed_dataset.sizes.shape[0]
+
+        # Build index mappings.
+        self.doc_idx, self.sample_idx, self.shuffle_idx, self.desc, self.desc_hash = _build_index_mappings(
+            self.name,
+            data_prefix,
+            documents,
+            self.indexed_dataset.sizes,
+            splits_string,
+            num_samples,
+            seq_length,
+            seed,
+            data_cache_path=data_cache_path,
+        )
+
+    def __len__(self):
+        # -1 is due to data structure used to retieve the index:
+        #    sample i --> [sample_idx[i], sample_idx[i+1])
+        return self.sample_idx.shape[0] - 1
+
+    def __getitem__(self, idx):
+        # Get the shuffled index.
+        idx = self.shuffle_idx[idx]
+        # Start and end documents and offsets.
+        doc_index_f = self.sample_idx[idx][0]
+        doc_index_l = self.sample_idx[idx + 1][0]
+        offset_f = self.sample_idx[idx][1]
+        offset_l = self.sample_idx[idx + 1][1]
+        # If we are within the same document, just extract the chunk.
+        doc_ids = []
+        if doc_index_f == doc_index_l:
+            doc_ids.append(self.doc_idx[doc_index_f])
+
+            sample = self.indexed_dataset.get(
+                self.doc_idx[doc_index_f], offset=offset_f, length=offset_l - offset_f + 1
+            )
+        else:
+            # Otherwise, get the rest of the initial document.
+            doc_ids.append(self.doc_idx[doc_index_f])
+            sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f], offset=offset_f)]
+            # Loop over all in between documents and add the entire document.
+            for i in range(doc_index_f + 1, doc_index_l):
+                doc_ids.append(self.doc_idx[i])
+                sample_list.append(self.indexed_dataset.get(self.doc_idx[i]))
+            # And finally add the relevant portion of last document.
+            doc_ids.append(self.doc_idx[doc_index_l])
+            sample_list.append(self.indexed_dataset.get(self.doc_idx[doc_index_l], length=offset_l + 1))
+            sample = np.concatenate(sample_list)
+        # print(sample)
+        if self.return_doc_ids:  # for retro preprocessing
+            return {"text": np.array(sample, dtype=np.int64), "doc_ids": np.array(doc_ids, dtype=np.int64)}
+        else:
+            return {"text": np.array(sample, dtype=np.int64)}
+
+
+def _build_index_mappings(
+    name, data_prefix, documents, sizes, splits_string, num_samples, seq_length, seed, *, data_cache_path
+):
+    """Build doc-idx, sample-idx, and shuffle-idx.
+    doc-idx: is an array (ordered) of documents to be used in training.
+    sample-idx: is the start document index and document offset for each
+       training sample.
+    shuffle-idx: maps the sample index into a random index into sample-idx.
+    """
+
+    # Number of tokens in each epoch and number of required epochs.
+    tokens_per_epoch = _num_tokens(documents, sizes)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+
+    # rng state
+    np_rng = np.random.RandomState(seed=seed)
+
+    # Filename of the index mappings.
+    desc = "GPT Dataset\n\n"
+    desc += f"Data prefix {data_prefix}\n"
+    desc += f"Dataset name {name}\n"
+    desc += f"Number of samples {num_samples}\n"
+    desc += f"Sequence length {seq_length}\n"
+    desc += f"Random seed {seed}\n"
+    desc += f"Split {splits_string}\n"
+    desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+    desc_filename = desc_hash + ".dsc"
+    doc_idx_filename = desc_hash + "_doc_idx.npy"
+    sample_idx_filename = desc_hash + "_sample_idx.npy"
+    shuffle_idx_filename = desc_hash + "_shuffle_idx.npy"
+
+    # Look for cache in main data dir first to avoid unnecessary
+    # duplication, then look in data-cache-path if specified,
+    # If nothing is found, use the last path looked in
+    build_indices = True
+    prefixes = [os.path.join(os.path.dirname(data_prefix), "index-cache")]
+    if data_cache_path is not None:
+        prefixes.append(data_cache_path)
+    for prefix in prefixes:
+        idx_path = {
+            "desc": os.path.join(prefix, desc_filename),
+            "doc": os.path.join(prefix, doc_idx_filename),
+            "sample": os.path.join(prefix, sample_idx_filename),
+            "shuffle": os.path.join(prefix, shuffle_idx_filename),
+        }
+        for f in idx_path.values():
+            if not os.path.isfile(f):
+                break
+        else:
+            # Found our files!
+            build_indices = False
+            break
+    data_cache_dir = os.path.dirname(idx_path["desc"])
+    # data_cache_success = True
+    # Build the indexed mapping if not exist.
+    if build_indices and paddle.distributed.get_rank() == 0:
+        print_rank_0(" > WARNING: could not find index map files, building " "the indices on rank 0 ...")
+
+        # For the last epoch, decide whether include the entire epoch
+        # in the global shuffle or not.
+
+        # If we need only one epoch, then separating last epoch  does
+        # not mean anything.
+        if num_epochs == 1:
+            separate_last_epoch = False
+            print(" > only one epoch required, setting " "separate_last_epoch to False", flush=True)
+
+        else:
+            # Get the number of samples for the last epoch
+            num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+            last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one
+            assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative."
+            num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+            assert last_epoch_num_samples <= (
+                num_samples_per_epoch + 1
+            ), "last epoch number of samples exceeded max value."
+            # If we have less than 80% of the samples for the last epoch,
+            # seperate out the epoch and treat it differently.
+            # Note: the 80% number is just based on common sense and can
+            # be adjusted if needed.
+            separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)
+            if separate_last_epoch:
+                string = (
+                    " > last epoch number of samples ({}) is smaller "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to True"
+                )
+            else:
+                string = (
+                    " > last epoch number of samples ({}) is larger "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to False"
+                )
+            print(string.format(last_epoch_num_samples, num_samples_per_epoch), flush=True)
+
+        try:
+            os.makedirs(data_cache_dir, exist_ok=True)
+
+            # description
+            with open(idx_path["desc"], "wt") as fd:
+                fd.write(desc)
+
+            # doc-idx.
+            start_time = time.time()
+            doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch)
+            np.save(idx_path["doc"], doc_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save doc-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # sample-idx.
+            start_time = time.time()
+            # Use C++ implementation for speed.
+            # First compile and then import.
+            # from megatron.data import helpers
+            from tool_helpers import helpers
+
+            assert doc_idx.dtype == np.int32
+            assert sizes.dtype == np.int32
+            sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch)
+            np.save(idx_path["sample"], sample_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save sample-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # shuffle-idx.
+            start_time = time.time()
+            # -1 is due to data structure used to retieve the index:
+            #    sample i --> [sample_idx[i], sample_idx[i+1])
+            if separate_last_epoch:
+                num_samples_ = num_samples_from_epochs_minus_one
+            else:
+                num_samples_ = sample_idx.shape[0] - 1
+            shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng)
+            np.save(idx_path["shuffle"], shuffle_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save shuffle-idx mapping"
+                " (seconds): {:4f}".format(time.time() - start_time)
+            )
+        except OSError:
+            print(f"There was an error trying to create the data cache directory ({data_cache_dir})")
+            print('or a file in it. This defaults to a directory "index-cache" within the directory')
+            print("the data files are in and can be set with the --data-cache-path argument. Please")
+            print("ensure you have write access to this directory or specify one that you do have")
+            print("write access to.")
+
+    # add 7-18
+    if paddle.distributed.get_world_size() > 1:
+        if paddle.in_dynamic_mode():
+            paddle.distributed.barrier()
+
+    # Load mappings.
+    start_time = time.time()
+    print_rank_0(f" > loading doc-idx mapping from {idx_path['doc']}")
+    doc_idx = np.load(idx_path["doc"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0(f" > loading sample-idx mapping from {idx_path['sample']}")
+    sample_idx = np.load(idx_path["sample"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0(f" > loading shuffle-idx mapping from {idx_path['shuffle']}")
+    shuffle_idx = np.load(idx_path["shuffle"], allow_pickle=True, mmap_mode="r")
+
+    print_rank_0("    loaded indexed file in {:3.3f} seconds".format(time.time() - start_time))
+    print_rank_0("    total number of samples: {}".format(sample_idx.shape[0]))
+    print_rank_0("    total number of epochs: {}".format(num_epochs))
+    return doc_idx, sample_idx, shuffle_idx, desc, desc_hash
+
+
+def _num_tokens(documents, sizes):
+    """Total number of tokens in the dataset."""
+    return np.sum(sizes[documents])
+
+
+def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+    """Based on number of samples and sequence lenght, calculate how many
+    epochs will be needed."""
+    num_epochs = 0
+    total_tokens = 0
+    while True:
+        num_epochs += 1
+        total_tokens += tokens_per_epoch
+        # -1 is because we need to retrieve seq_length + 1 token each time
+        # but the last token will overlap with the first token of the next
+        # sample except for the last sample.
+        if ((total_tokens - 1) // seq_length) >= num_samples:
+            return num_epochs
+
+
+def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
+    """Build an array with length = number-of-epochs * number-of-dcuments.
+    Each index is mapped to a corresponding document."""
+    if not separate_last_epoch or num_epochs == 1:
+        doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1]
+        doc_idx[:] = documents
+        doc_idx = doc_idx.reshape(-1)
+        doc_idx = doc_idx.astype(np.int32)
+        np_rng.shuffle(doc_idx)
+        return doc_idx
+
+    doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False)
+    doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
+    return np.concatenate((doc_idx_first, doc_idx_last))
+
+
+def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch):
+    """Sample index mapping is a 2D array with sizes
+    [number-of-samples + 1, 2] where [..., 0] contains
+    the index into `doc_idx` and [..., 1] is the
+    starting offset in that document."""
+
+    # Total number of samples. For -1 see comments in `_num_epochs`.
+    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    sample_idx = np.zeros([num_samples + 1, 2], dtype=np.int32)
+
+    # Index into sample_idx.
+    sample_index = 0
+    # Index into doc_idx.
+    doc_idx_index = 0
+    # Begining offset for each document.
+    doc_offset = 0
+    # Start with first document and no offset.
+    sample_idx[sample_index][0] = doc_idx_index
+    sample_idx[sample_index][1] = doc_offset
+    sample_index += 1
+    while sample_index <= num_samples:
+        # Start with a fresh sequence.
+        remaining_seq_length = seq_length + 1
+        while remaining_seq_length != 0:
+            # Get the document length.
+            doc_id = doc_idx[doc_idx_index]
+            doc_length = sizes[doc_id] - doc_offset
+            # And add it to the current sequence.
+            remaining_seq_length -= doc_length
+            # If we have more than a full sequence, adjust offset and set
+            # remaining length to zero so we return from the while loop.
+            # Note that -1 here is for the same reason we have -1 in
+            # `_num_epochs` calculations.
+            if remaining_seq_length <= 0:
+                doc_offset += remaining_seq_length + doc_length - 1
+                remaining_seq_length = 0
+            else:
+                # Otherwise, start from the begining of the next document.
+                doc_idx_index += 1
+                doc_offset = 0
+        # Record the sequence.
+        sample_idx[sample_index][0] = doc_idx_index
+        sample_idx[sample_index][1] = doc_offset
+        sample_index += 1
+
+    return sample_idx
+
+
+def _build_shuffle_idx(num_samples, total_size, np_rng):
+    """Build the range [0, size) and shuffle."""
+    print(
+        " > building shuffle index with split [0, {}) and [{}, {}) "
+        "...".format(num_samples, num_samples, total_size),
+        flush=True,
+    )
+
+    dtype_ = np.uint32
+    if total_size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+
+    shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_first)
+    if num_samples == total_size:
+        return shuffle_idx_first
+
+    shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_last)
+
+    return np.concatenate((shuffle_idx_first, shuffle_idx_last))
diff --git a/model_zoo/gpt/deploy/python/inference.py b/model_zoo/gpt/deploy/python/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab4d7fa101e5ad7f14e7fad30da23d4a7af035fa
--- /dev/null
+++ b/model_zoo/gpt/deploy/python/inference.py
@@ -0,0 +1,117 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    GPTChineseTokenizer,
+    GPTForGreedyGeneration,
+    GPTTokenizer,
+)
+
+MODEL_CLASSES = {
+    "gpt-cn": (GPTForGreedyGeneration, GPTChineseTokenizer),
+    "gpt": (GPTForGreedyGeneration, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # yapf: disable
+    parser.add_argument("--model_type", default=None, type=str, required=True, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_path", default=None, type=str, required=True, help="The path prefix of inference model to be used.")
+    parser.add_argument("--select_device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference.")
+    # yapf: enable
+
+    args = parser.parse_args()
+    return args
+
+
+class Predictor(object):
+    def __init__(self, predictor, input_handles, output_handles):
+        self.predictor = predictor
+        self.input_handles = input_handles
+        self.output_handles = output_handles
+
+    @classmethod
+    def create_predictor(cls, args):
+        config = paddle.inference.Config(args.model_path + ".pdmodel", args.model_path + ".pdiparams")
+        if args.select_device == "gpu":
+            # Set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+        elif args.select_device == "cpu":
+            # Set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+        elif args.select_device == "xpu":
+            # Set XPU configs accordingly
+            config.enable_xpu(100)
+        config.switch_use_feed_fetch_ops(False)
+        predictor = paddle.inference.create_predictor(config)
+        input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()]
+        output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+        return cls(predictor, input_handles, output_handles)
+
+    def predict_batch(self, data):
+        for input_field, input_handle in zip(data, self.input_handles):
+            input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field)
+        self.predictor.run()
+        output = [output_handle.copy_to_cpu() for output_handle in self.output_handles]
+        return output
+
+    def predict(self, dataset, batch_size=1):
+        outputs = []
+        for data in dataset:
+            output = self.predict_batch(data)
+            outputs.append(output)
+        return outputs
+
+
+def main():
+    args = parse_args()
+    predictor = Predictor.create_predictor(args)
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    tokenizer = tokenizer_class.from_pretrained(os.path.dirname(args.model_path))
+    if args.model_type == "gpt":
+        ds = [
+            "Question: Who is the CEO of Apple? Answer:",
+            "Question: Who is the CEO of Facebook? Answer:",
+            "Question: How tall is the highest peak in the world? Answer:",
+            "Question: Who is the president of the united states? Answer:",
+            "Question: Where is the capital of France? Answer:",
+            "Question: What is the largest animal in the ocean? Answer:",
+            "Question: Who is the chancellor of Germany? Answer:",
+        ]
+    elif args.model_type == "gpt-cn":
+        ds = [
+            "问题：苹果的CEO是谁? 答案：",
+            "问题：中国的首都是哪里？答案：",
+            "问题：世界上最高的山峰是? 答案：",
+        ]
+
+    dataset = [[np.array(tokenizer(text)["input_ids"]).astype("int64").reshape([1, -1])] for text in ds]
+    outs = predictor.predict(dataset)
+    for res in outs:
+        res_ids = list(res[0].reshape([-1]))
+        res_ids = [int(x) for x in res_ids]
+        print(tokenizer.convert_ids_to_string(res_ids))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt/export_model.py b/model_zoo/gpt/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..919319911104219c06ea6ff2d0bfe90d23aedac8
--- /dev/null
+++ b/model_zoo/gpt/export_model.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddlenlp.transformers import GPTForGreedyGeneration, GPTChineseTokenizer, GPTTokenizer
+
+MODEL_CLASSES = {
+    "gpt-cn": (GPTForGreedyGeneration, GPTChineseTokenizer),
+    "gpt": (GPTForGreedyGeneration, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path of the trained model to be exported.",
+    )
+    parser.add_argument(
+        "--output_path",
+        default=None,
+        type=str,
+        required=True,
+        help="The output file prefix used to save the exported inference model.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+
+    # Suild model and load trained parameters
+    tokenizer = tokenizer_class.from_pretrained(args.model_path)
+    model = model_class.from_pretrained(args.model_path, max_predict_len=32, eol_token_id=tokenizer.eol_token_id)
+    # Switch to eval model
+    model.eval()
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(model, args.output_path)
+    # Also save tokenizer for inference usage
+    tokenizer.save_pretrained(os.path.dirname(args.output_path))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt/fast_gpt/README.md b/model_zoo/gpt/fast_gpt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7109e8f4f291aa7e93cec3c1384ad59cf975793f
--- /dev/null
+++ b/model_zoo/gpt/fast_gpt/README.md
@@ -0,0 +1,92 @@
+# FastGenerartion 使用
+
+在这里我们集成了 NVIDIA [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1) 用于预测加速，打造了 FastGeneration 的能力。以下是 FastGenerartion 相关的使用说明。
+
+## 使用环境说明
+
+* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本
+* CMake >= 3.10
+* CUDA 10.1 或 10.2（需要 PaddlePaddle 框架一致）
+* gcc 版本需要与编译 PaddlePaddle 版本一致，比如使用 gcc8.2
+* 推荐使用 Python3
+* [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup) 使用必要的环境
+
+## 快速开始
+
+### GPT-2 decoding 示例代码
+
+使用 PaddlePaddle 仅执行 decoding 测试（float32）：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+python infer.py --model_name_or_path gpt2-medium-en --batch_size 1 --topk 4 --topp 0.0 --max_length 32 --start_token "<|endoftext|>" --end_token "<|endoftext|>" --temperature 1.0
+```
+
+其中，各个选项的意义如下：
+* `--model_name_or_path`: 预训练模型的名称或是路径。
+* `--batch_size`: 一个 batch 内，样本数目的大小。
+* `--topk`: 执行 topk-sampling 的时候的 `k` 的大小，默认是 4。
+* `--topp`: 执行 topp-sampling 的时候的阈值的大小，默认是 0.0 表示不执行 topp-sampling。
+* `--max_length`: 最长的生成长度。
+* `--start_token`: 字符串，表示任意生成的时候的开始 token。
+* `--end_token`: 字符串，生成的结束 token。
+* `--temperature`: temperature 的设定。
+* `--use_fp16_decoding`: 是否使用 fp16 进行推理。
+
+### 导出基于 FasterGPT 的预测库使用模型文件
+
+GPT的FastGenerartion高性能预测功能底层依托于`FasterGPT()`。
+编写 python 脚本的时候，调用 `FasterGPT()` API 即可创建可用于导出的高性能预测模型。
+
+示例如下：
+
+``` python
+from paddlenlp.ops import FasterGPT
+from paddlenlp.transformers import GPTLMHeadModel
+from paddlenlp.transformers import GPTTokenizer
+
+MODEL_CLASSES = {
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+model_class, tokenizer_class = MODEL_CLASSES[args.model_name]
+tokenizer = tokenizer_class.from_pretrained(args.model_name)
+model = model_class.from_pretrained(args.model_name)
+
+# Define model
+gpt = FasterGPT(
+    model=model,
+    temperature=args.temperature,
+    use_fp16_decoding=args.use_fp16_decoding)
+```
+
+通过 `export_model.py` 脚本获取预测库用模型，执行方式如下所示：
+
+``` sh
+python export_model.py --model_name_or_path gpt2-medium-en --topk 4 --topp 0.0 --max_out_len 32 --temperature 1.0 --inference_model_dir ./infer_model/
+```
+
+各个选项的意义与上文的 `infer.py` 的选项相同。额外新增一个 `--inference_model_dir` 选项用于指定保存的模型文件、词表等文件。
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考 [文本生成高性能加速](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md)。编译好后，可以在执行 `export_model.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。
+
+注意：如果是自行编译的话，这里的 `libdecoding_op.so` 的动态库是参照文档 [文本生成高性能加速](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md) 中 **`Python 动态图使用自定义 op`** 编译出来的 lib，与相同文档中 **`C++ 预测库使用自定义 op`** 编译产出不同。因此，在使用预测库前，还需要额外导出模型：
+  * 一次用于获取 Python 动态图下的 lib，用到 Python 端进行模型导出。
+  * 一次获取编译的基于预测库的可执行文件
+
+若是使用的模型是 gpt2-medium-en，保存之后，`./infer_model/` 目录下组织的结构如下：
+
+``` text
+.
+├── gpt.pdiparams       # 保存的参数文件
+├── gpt.pdiparams.info  # 保存的一些变量描述信息，预测不会用到
+├── gpt.pdmodel         # 保存的模型文件
+├── merges.txt          # bpe
+└── vocab.txt           # 词表
+```
+
+### C++ 预测库使用高性能加速
+
+C++ 预测库使用 FasterGPT 的高性能加速需要自行编译，可以参考 [文本生成高性能加速](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md) 文档完成基于 C++ 预测库的编译，同时也可以参考相同文档执行对应的 C++ 预测库的 demo 完成预测。
+
+具体的使用 demo 可以参考 [GPT-2 预测库 C++ demo](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/fast_transformer/src/demo/gpt.cc)。
diff --git a/model_zoo/gpt/fast_gpt/export_model.py b/model_zoo/gpt/fast_gpt/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..83d815c85641b25e683afe66141439f113aa7e24
--- /dev/null
+++ b/model_zoo/gpt/fast_gpt/export_model.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterGPT
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt2-medium-en",
+        type=str,
+        help="The model name to specify the gpt to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt-cpm-large-cn']. ",
+    )
+    parser.add_argument(
+        "--decoding_lib", default="../../build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of gpt. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--topp", default=0.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_out_len", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    logger.info("Loading the model parameters, please wait...")
+    model = model_class.from_pretrained(args.model_name_or_path)
+
+    gpt = FasterGPT(model=model, decoding_lib=args.decoding_lib, use_fp16_decoding=args.use_fp16_decoding)
+
+    # Set evaluate mode
+    gpt.eval()
+
+    # Convert dygraph model to static graph model
+    gpt = paddle.jit.to_static(
+        gpt,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            #
+            # If it's necessarry to provide mem_seq_len and attention_mask,
+            # the parameters should be:
+            # mem_seq_len
+            # paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # attention_mask
+            # paddle.static.InputSpec(shape=[None, None, None], dtype="float16" if args.use_fp16_decoding else "float32"),
+            #
+            None,  # mem_seq_len
+            None,  # attention_mask
+            args.topk,
+            args.topp,
+            args.max_out_len,
+            tokenizer.eos_token_id,
+            tokenizer.eos_token_id,
+            tokenizer.pad_token_id,
+            None,  # forced_eos_token_id
+            args.temperature,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(gpt, os.path.join(args.inference_model_dir, "gpt"))
+    logger.info("GPT has been saved to {}".format(args.inference_model_dir))
+
+    gpt.save_resources(tokenizer, args.inference_model_dir)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/model_zoo/gpt/fast_gpt/infer.py b/model_zoo/gpt/fast_gpt/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..30e85359a69a2109a278d79f6186b1be519a8add
--- /dev/null
+++ b/model_zoo/gpt/fast_gpt/infer.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt2-medium-en",
+        type=str,
+        help="The model name to specify the gpt to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt-cpm-large-cn']. ",
+    )
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size. ")
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument(
+        "--start_token", default="<|endoftext|>", type=str, help="The start token. Defaults to <|endoftext|>. "
+    )
+    parser.add_argument(
+        "--end_token", default="<|endoftext|>", type=str, help="The end token. Defaults to <|endoftext|>. "
+    )
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    logger.info("Loading the model parameters, please wait...")
+    model = model_class.from_pretrained(args.model_name_or_path)
+    model.eval()
+
+    bos_id = tokenizer.convert_tokens_to_ids(args.start_token)
+    eos_id = tokenizer.convert_tokens_to_ids(args.end_token)
+
+    # Define model
+    gpt = model
+
+    # Set evaluate mode
+    gpt.eval()
+    input_ids = np.array([[bos_id] for i in range(args.batch_size * 1)]).astype("int64").reshape([args.batch_size, 1])
+    input_ids = paddle.to_tensor(input_ids)
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                paddle.device.cuda.synchronize(place)
+                start = time.time()
+            out_seq, _ = gpt.generate(
+                input_ids,
+                top_k=args.topk,
+                top_p=args.topp,
+                max_length=args.max_length,
+                temperature=args.temperature,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                decode_strategy="sampling",
+                use_fp16_decoding=args.use_fp16_decoding,
+                use_fast=True,
+            )
+            output_sequence = out_seq.numpy()
+
+        paddle.device.cuda.synchronize(place)
+        logger.info("Average test time for decoding is %f ms" % ((time.time() - start) / 50 * 1000))
+        output_sequence = out_seq.numpy().tolist()
+    for i in range(args.batch_size):
+        print("========== Sample-%d ==========" % i)
+        print(tokenizer.convert_ids_to_string(output_sequence[i]))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/model_zoo/gpt/lr.py b/model_zoo/gpt/lr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c56c2bea555b8a0c80dff369dbd964ef39d5b48
--- /dev/null
+++ b/model_zoo/gpt/lr.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+from paddle.optimizer.lr import LRScheduler
+
+
+class CosineAnnealingWithWarmupDecay(LRScheduler):
+    def __init__(self, max_lr, min_lr, warmup_step, decay_step, last_epoch=0, verbose=False):
+
+        self.decay_step = decay_step
+        self.warmup_step = warmup_step
+        self.max_lr = max_lr
+        self.min_lr = min_lr
+        super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.warmup_step > 0 and self.last_epoch <= self.warmup_step:
+            return float(self.max_lr) * (self.last_epoch) / self.warmup_step
+
+        if self.last_epoch > self.decay_step:
+            return self.min_lr
+
+        num_step_ = self.last_epoch - self.warmup_step
+        decay_step_ = self.decay_step - self.warmup_step
+        decay_ratio = float(num_step_) / float(decay_step_)
+        coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
+        return self.min_lr + coeff * (self.max_lr - self.min_lr)
diff --git a/model_zoo/gpt/modeling.py b/model_zoo/gpt/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..458a754f84a32518cb16a76fdc0647e908d5545a
--- /dev/null
+++ b/model_zoo/gpt/modeling.py
@@ -0,0 +1,899 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import contextlib
+import math
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from configuration import (
+    GPT_PRETRAINED_INIT_CONFIGURATION,
+    GPT_PRETRAINED_RESOURCE_FILES_MAP,
+    GPTConfig,
+)
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils import recompute
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+from paddlenlp.transformers.model_outputs import CausalLMOutputWithCrossAttentions
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+try:
+    from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd
+except:
+    FusedDropoutAdd = None
+
+
+def get_triangle_upper_mask(x, mask):
+    if mask is not None:
+        return mask
+    if paddle.is_compiled_with_xpu():
+        # xpu does not support set constant to -np.inf
+        mask = paddle.full_like(x, -1e4)
+    else:
+        mask = paddle.full_like(x, -np.inf)
+    mask.stop_gradient = True
+    mask = paddle.triu(mask, diagonal=1)
+    mask.stop_gradient = True
+    return mask
+
+
+def parallel_matmul(x, y, tensor_parallel_output=True):
+    is_fleet_init = True
+    tensor_parallel_degree = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        tensor_parallel_degree = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+
+    if is_fleet_init and tensor_parallel_degree > 1 and y.is_distributed:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=True)
+
+        if tensor_parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+
+    else:
+        logits = paddle.matmul(x, y, transpose_y=True)
+        return logits
+
+
+def seed_guard_context(name=None):
+    if name in get_rng_state_tracker().states_:
+        return get_rng_state_tracker().rng_state(name)
+    else:
+        return contextlib.nullcontext()
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        config,
+    ):
+        super(MultiHeadAttention, self).__init__()
+
+        self.config = config
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        self.use_flash_attention = config.use_flash_attention if flash_attention else None
+
+        self.head_dim = config.hidden_size // config.num_attention_heads
+        assert (
+            self.head_dim * config.num_attention_heads == config.hidden_size
+        ), "hidden_size must be divisible by num_attention_heads"
+
+        self.num_attention_heads = config.num_attention_heads  # default, without tensor parallel
+        if config.tensor_parallel_degree > 1:
+            assert config.num_attention_heads % config.tensor_parallel_degree == 0
+            self.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree
+
+            if config.fuse_attention_qkv:
+                self.qkv_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    3 * config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+            else:
+                self.q_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+                self.k_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+                self.v_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+            self.out_proj = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size,
+                config.hidden_size,
+                has_bias=True,
+                input_is_parallel=True,
+                fuse_matmul_bias=config.fused_linear,
+            )
+        else:
+            if self.config.fuse_attention_qkv:
+                self.qkv_proj = nn.Linear(config.hidden_size, 3 * config.hidden_size, bias_attr=True)
+            else:
+                self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+                self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+                self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+
+            self.out_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, -1, 3 * self.head_dim])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, -1, self.head_dim])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=1)
+            v = tensor.concat([cache.v, v], axis=1)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, -1, self.head_dim])
+        v = tensor.reshape(x=v, shape=[0, 0, -1, self.head_dim])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(
+                shape=[key.shape[0], self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0
+            )
+            v = paddle.full(
+                shape=[key.shape[0], self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0
+            )
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def _flash_attention(self, q, k, v, attn_mask=None, output_attentions=False):
+        out, weights = flash_attention(
+            q,
+            k,
+            v,
+            self.config.hidden_dropout_prob,
+            causal=True,
+            return_softmax=output_attentions,
+            training=self.training,
+        )
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+        return (out, weights) if output_attentions else out
+
+    def core_attn(self, q, k, v, attn_mask=None, output_attentions=False):
+        perm = [0, 2, 1, 3]
+        q = tensor.transpose(x=q, perm=perm)
+        k = tensor.transpose(x=k, perm=perm)
+        v = tensor.transpose(x=v, perm=perm)
+
+        # scale dot product attention
+
+        scale_qk_coeff = self.config.scale_qk_coeff * self.head_dim**0.5
+        product = paddle.matmul(x=q.scale(1.0 / scale_qk_coeff), y=k, transpose_y=True)
+
+        if self.config.scale_qk_coeff != 1.0:
+            product = product.scale(self.config.scale_qk_coeff)
+
+        # softmax_mask_fuse_upper_triangle is not supported sif paddle is not compiled with cuda/rocm
+        if not paddle.is_compiled_with_cuda():
+            attn_mask = get_triangle_upper_mask(product, attn_mask)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+            weights = F.softmax(product)
+        else:
+            weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.config.hidden_dropout_prob:
+
+            with seed_guard_context("lcoal_seed"):
+                weights = F.dropout(
+                    weights, self.config.hidden_dropout_prob, training=self.training, mode="upscale_in_train"
+                )
+
+        out = paddle.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, -1])
+
+        return (out, weights) if output_attentions else out
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None, output_attentions=False):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if self.config.fuse_attention_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        if self.use_flash_attention and attn_mask is None:
+            attn_func = self._flash_attention
+        else:
+            attn_func = self.core_attn
+        has_gradient = (not q.stop_gradient) or (not k.stop_gradient) or (not v.stop_gradient)
+        if self.enable_recompute and self.config.recompute_granularity == "core_attn" and has_gradient:
+            out = recompute(attn_func, q, k, v, attn_mask, output_attentions, use_reentrant=False)
+        else:
+            out = attn_func(q, k, v, attn_mask=attn_mask, output_attentions=output_attentions)
+
+        if output_attentions:
+            out, weights = out
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if output_attentions:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(
+        self,
+        config,
+        decoder_layers,
+    ):
+        super(TransformerDecoder, self).__init__()
+
+        self.config = config
+        self.layers = decoder_layers
+        self.norm = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+    def forward(
+        self, tgt, tgt_mask=None, memory=None, memory_mask=None, use_cache=False, cache=None, output_attentions=False
+    ):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = []
+        all_self_attentions = [] if output_attentions else None
+
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                if use_cache:
+                    output, new_cache = mod(
+                        output,
+                        tgt_mask=tgt_mask,
+                        memory=memory,
+                        use_cache=use_cache,
+                        cache=cache,
+                        output_attentions=output_attentions,
+                    )
+                    new_caches.append(new_cache)
+                else:
+                    has_gradient = not output.stop_gradient
+                    if self.enable_recompute and self.config.recompute_granularity == "full" and has_gradient:
+                        output = recompute(
+                            mod, output, tgt_mask, memory, use_cache, cache, output_attentions, use_reentrant=False
+                        )
+                    else:
+                        output = mod(output, tgt_mask, memory, use_cache, cache, output_attentions)
+
+            else:
+                output, new_cache = mod(
+                    output,
+                    tgt_mask=tgt_mask,
+                    memory=memory,
+                    use_cache=use_cache,
+                    cache=cache[i],
+                    output_attentions=output_attentions,
+                )
+                new_caches.append(new_cache)
+
+            if output_attentions:
+                output, weights = output
+                all_self_attentions.append(weights)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        outputs = [output]
+        if output_attentions:
+            outputs.append(all_self_attentions)
+        if use_cache:
+            outputs.append(new_caches)
+        return output if len(outputs) == 1 else tuple(outputs)
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(self, config: GPTConfig):
+
+        super(TransformerDecoderLayer, self).__init__()
+
+        self.config = config
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        if not FusedDropoutAdd:
+            config.use_fused_dropout_add = False
+
+        self.self_attn = MultiHeadAttention(config=config)
+
+        if config.tensor_parallel_degree > 1:
+            self.linear1 = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size,
+                config.intermediate_size,
+                gather_output=False,
+                has_bias=True,
+                fuse_matmul_bias=self.config.fused_linear,
+            )
+            self.linear2 = fleet.meta_parallel.RowParallelLinear(
+                config.intermediate_size,
+                config.hidden_size,
+                input_is_parallel=True,
+                has_bias=True,
+                fuse_matmul_bias=self.config.fused_linear,
+            )
+        else:
+            self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size, bias_attr=True)
+            self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size, bias_attr=True)
+
+        self.norm1 = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+
+        if not config.use_fused_dropout_add:
+            self.dropout1 = nn.Dropout(config.hidden_dropout_prob, mode="upscale_in_train")
+            self.dropout2 = nn.Dropout(config.hidden_dropout_prob, mode="upscale_in_train")
+        else:
+            self.fused_dropout_add1 = FusedDropoutAdd(config.hidden_dropout_prob, mode="upscale_in_train")
+            self.fused_dropout_add2 = FusedDropoutAdd(config.hidden_dropout_prob, mode="upscale_in_train")
+
+        self.activation = getattr(F, config.hidden_activation)
+
+    def forward(self, tgt, tgt_mask=None, memory=None, use_cache=False, cache=None, output_attentions=False):
+        residual = tgt
+
+        if self.config.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if use_cache is False:
+            has_gradient = not tgt.stop_gradient
+            if self.enable_recompute and self.config.recompute_granularity == "full_attn" and has_gradient:
+                tgt = recompute(
+                    self.self_attn, tgt, None, None, tgt_mask, use_cache, cache, output_attentions, use_reentrant=False
+                )
+            else:
+                tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache, output_attentions)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache, output_attentions)
+
+        if output_attentions:
+            tgt, weights = tgt
+
+        with seed_guard_context("global_seed"):
+            if not self.config.use_fused_dropout_add:
+                tgt = residual + self.dropout1(tgt)
+            else:
+                tgt = self.fused_dropout_add1(tgt, residual)
+
+        if not self.config.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.config.normalize_before:
+            tgt = self.norm2(tgt)
+
+        with seed_guard_context("global_seed"):
+            if not self.config.use_fused_dropout_add:
+                tgt = residual + self.linear2(F.gelu(self.linear1(tgt), approximate=True))
+            else:
+                tgt = self.fused_dropout_add2(self.linear2(F.gelu(self.linear1(tgt), approximate=True)), residual)
+
+        if not self.config.normalize_before:
+            tgt = self.norm2(tgt)
+
+        if output_attentions:
+            tgt = (tgt, weights)
+        return tgt if use_cache is False else (tgt, incremental_cache)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(
+        self,
+        config,
+    ):
+        super(GPTEmbeddings, self).__init__()
+
+        self.config = config
+
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+            )
+
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings,
+            config.hidden_size,
+        )
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embedings + position_embeddings
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class GPTPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained GPT models. It provides GPT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "gpt"
+    config_class = GPTConfig
+    pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = GPT_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "layers.0.linear1.weight": partial(fn, is_column=True),
+                "layers.0.linear1.bias": partial(fn, is_column=True),
+                # Row Linear
+                "word_embeddings.weight": partial(fn, is_column=False),
+                "layers.0.self_attn.out_proj.weight": partial(fn, is_column=False),
+                "layers.0.linear2.weight": partial(fn, is_column=False),
+            }
+
+            if config.fuse_attention_qkv:
+                base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.qkv_proj.bias"] = partial(fn, is_column=True)
+            else:
+                base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(
+            layer,
+            (
+                nn.Linear,
+                nn.Embedding,
+                fleet.meta_parallel.VocabParallelEmbedding,
+                fleet.meta_parallel.ColumnParallelLinear,
+                fleet.meta_parallel.RowParallelLinear,
+            ),
+        ):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        # Layer.apply is DFS https://github.com/PaddlePaddle/Paddle/blob/a6f5021fcc58b21f4414bae6bf4731ef6971582c/python/paddle/nn/layer/layers.py#L527-L530
+        # sublayer is init first
+        # scale RowParallelLinear weight
+        with paddle.no_grad():
+            if isinstance(layer, TransformerDecoderLayer):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.linear2.weight.scale_(factor)
+            if isinstance(layer, MultiHeadAttention):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.out_proj.weight.scale_(factor)
+
+
+@register_base_model
+class GPTModel(GPTPretrainedModel):
+    """
+    The base model of gpt.
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTModel, self).__init__(config)
+
+        self.config = config
+
+        self.embeddings = GPTEmbeddings(config)
+
+        decoder_layers = nn.LayerList()
+        for i in range(config.num_hidden_layers):
+            decoder_layers.append(TransformerDecoderLayer(config))
+
+        self.decoder = TransformerDecoder(
+            config,
+            decoder_layers,
+        )
+
+    def forward(
+        self, input_ids, position_ids=None, attention_mask=None, use_cache=False, cache=None, output_attentions=False
+    ):
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(attention_mask)[-1] - 1
+            position_ids = paddle.arange(past_length, paddle.shape(input_ids)[-1] + past_length, dtype="int64")
+            position_ids = position_ids.unsqueeze(0)
+            input_shape = paddle.shape(input_ids)
+            position_ids = paddle.expand(position_ids, input_shape)
+        embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        if not self.config.fused_softmax_with_triangular or not paddle.is_compiled_with_cuda():
+            # TODO, use registered buffer
+            causal_mask = paddle.tensor.tril(
+                paddle.ones((paddle.shape(input_ids)[-1], paddle.shape(input_ids)[-1]), dtype="int64"),
+            )
+            if attention_mask is not None:
+                if attention_mask.dtype != paddle.int64:
+                    attention_mask = paddle.cast(attention_mask, dtype=paddle.int64)
+                if len(attention_mask.shape) == 2:
+                    attention_mask = attention_mask[:, None, None, :]
+                attention_mask = (1.0 - (attention_mask & causal_mask)) * -1e4
+            else:
+                attention_mask = (1.0 - causal_mask) * -1e4
+
+        encoder_outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            tgt_mask=None
+            if (self.config.fused_softmax_with_triangular and self.training)
+            else attention_mask,  # use softmax_mask_fuse_upper_triangle
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+        )
+        return encoder_outputs
+
+
+class GPTPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for GPT.
+
+    It calculates the final loss.
+    """
+
+    def __init__(self, config):
+        super(GPTPretrainingCriterion, self).__init__()
+        self.config = config
+        if config.tensor_parallel_degree > 1 and config.tensor_parallel_output:
+            self.loss_func = fleet.meta_parallel.ParallelCrossEntropy(ignore_index=config.ignore_index)
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=config.ignore_index)
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask=None):
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
+            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0].astype("float32")
+            loss = paddle.mean(masked_lm_loss)
+
+        return loss
+
+
+class GPTForCausalLM(GPTPretrainedModel):
+    """
+    The GPT Model with a `language modeling` head on top.
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTForCausalLM, self).__init__(config)
+        self.config = config
+        self.gpt = GPTModel(config)
+        self.criterion = GPTPretrainingCriterion(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=False,
+        cache=None,
+        labels=None,
+        output_attentions=None,
+        return_dict=False,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GPTModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`GPTModel`.
+            use_cache (bool, optional):
+                See :class:`GPTModel`.
+            cache (Tensor, optional):
+                See :class:`GPTModel`.
+            labels (paddle.Tensor, optional):
+                A Tensor of shape `(batch_size, sequence_length)`.
+                Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
+                `labels = input_ids` Indices are selected in `[-100, 0, ..., vocab_size]` All labels set to `-100`
+                are ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`
+                Defaults to None.
+            output_attentions (bool, optional):
+                See :class:`GPTModel`.
+            output_hidden_states (bool, optional):
+                See :class:`GPTModel`.
+            return_dict (bool, optional):
+                See :class:`GPTModel`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especialy, when `return_dict=use_cache=output_attentions=output_hidden_states=False`,
+            returns a tensor `logits` which is the output of the gpt model.
+        """
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        outputs = self.gpt(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            # output_hidden_states=output_hidden_states,
+            # return_dict=return_dict,
+        )
+        if isinstance(outputs, input_type):
+            hidden_states = outputs
+        else:
+            hidden_states = outputs[0]
+
+        tensor_parallel_output = (
+            self.config.tensor_parallel_output and labels is not None and self.config.tensor_parallel_degree > 1
+        )
+        logits = parallel_matmul(hidden_states, self.gpt.embeddings.word_embeddings.weight, tensor_parallel_output)
+
+        loss = None
+        if labels is not None:
+            loss = self.criterion(logits, labels)
+
+        # outputs = [output, all_hidden_states, new_caches, all_self_attentions]
+        if not return_dict:
+            if isinstance(outputs, input_type):
+                return (loss, logits) if loss is not None else logits
+
+            outputs = (logits,) + outputs[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None and attention_mask.ndim == 4:
+            attention_mask = attention_mask[:, -1:, -1:, :]
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if position_ids is not None:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        return {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and float(paddle.any(input_ids == pad_token_id))
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="int64")
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
diff --git a/model_zoo/gpt/predict.py b/model_zoo/gpt/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..79b0eed8da9fe535676b8ac4a3b13bd0f1fcfa1e
--- /dev/null
+++ b/model_zoo/gpt/predict.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 TsinghuaAI Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Many thanks for following projects.
+# https://github.com/TsinghuaAI/CPM-Generate
+# https://github.com/jm12138/CPM-Generate-Paddle
+
+import sys
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    GPTChineseTokenizer,
+    GPTForGreedyGeneration,
+    GPTTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt-cn": (GPTForGreedyGeneration, GPTChineseTokenizer),
+    "gpt": (GPTForGreedyGeneration, GPTTokenizer),
+}
+
+
+class Demo:
+    def __init__(self, model_type="gpt-cn", model_name_or_path="gpt-cpm-large-cn", max_predict_len=32):
+        model_class, tokenizer_class = MODEL_CLASSES[model_type]
+        self.tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
+        logger.info("Loading the model parameters, please wait...")
+        self.model = model_class.from_pretrained(
+            model_name_or_path, max_predict_len=max_predict_len, eol_token_id=self.tokenizer.eol_token_id
+        )
+        self.model.eval()
+        logger.info("Model loaded.")
+
+    # prediction function
+    def predict(self, text):
+        ids = self.tokenizer(text)["input_ids"]
+        input_ids = paddle.to_tensor(np.array(ids).reshape(1, -1).astype("int64"))
+        out = self.model(input_ids)
+        out = [int(x) for x in out.numpy().reshape([-1])]
+        logger.info(self.tokenizer.convert_ids_to_string(out))
+
+    # One shot example
+    def ask_question_cn(self, question):
+        self.predict("问题：中国的首都是哪里？答案：北京。\n问题：%s 答案：" % question)
+
+    def ask_question_en(self, question):
+        self.predict("Question: Where is the capital of China? Answer: Beijing. \n Question:%s Answer:" % question)
+
+    # dictation poetry
+    def dictation_poetry_cn(self, front):
+        self.predict("""默写古诗: 大漠孤烟直，长河落日圆。\n%s""" % front)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) > 1 and sys.argv[1] == "gpt-cn":
+        demo = Demo("gpt-cn", "gpt-cpm-large-cn")
+        demo.ask_question_cn("苹果的CEO是谁?")
+        demo.dictation_poetry_cn("举杯邀明月，")
+    else:
+        demo = Demo("gpt", "gpt2-medium-en")
+        demo.ask_question_en("Who is the CEO of Apple?")
diff --git a/model_zoo/gpt/requirements.txt b/model_zoo/gpt/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..57fbf739e674f950e3d93c796b2cce9565f8555f
--- /dev/null
+++ b/model_zoo/gpt/requirements.txt
@@ -0,0 +1,10 @@
+regex
+sentencepiece>=0.1.94
+tqdm
+tool_helpers
+visualdl
+paddlepaddle-gpu>=2.2rc
+pybind11
+lac
+zstandard
+pyyaml
\ No newline at end of file
diff --git a/model_zoo/gpt/run_eval.py b/model_zoo/gpt/run_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..65aaebecd5d74f413251d8b200478fb29acb1c4a
--- /dev/null
+++ b/model_zoo/gpt/run_eval.py
@@ -0,0 +1,291 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import math
+import os
+import re
+import time
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+
+from paddlenlp.data import Stack, Tuple
+from paddlenlp.transformers import GPTForPretraining, GPTModel, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (GPTForPretraining, GPTTokenizer),
+}
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_name", default=None, type=str, required=True, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])), )
+parser.add_argument("--eval_path", default=None, type=str, required=True, help="The eval file path.", )
+parser.add_argument('--cloze_eval', action='store_true', help='Evaluation dataset from `--eval_path` is a cloze task.')
+parser.add_argument('--overlapping_eval', type=int, default=32, help='Sliding window for overlapping eval.')
+parser.add_argument("--init_checkpoint_path", default=None, type=str, help="The model checkpoint path.")
+parser.add_argument("--batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--seq_length', type=int, default=1024, help='Maximum sequence length to process for evaluation.')
+parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu", "npu"], help="Select cpu, gpu, xpu, npu devices.")
+parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+# yapf: enable
+
+
+class LM_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, tokens, seq_len, pad_idx, overlapping_eval=None):
+        self.tokens = tokens
+        self.seq_len = seq_len
+        self.pad_idx = pad_idx
+        self.overlapping_eval = overlapping_eval
+        if self.overlapping_eval is None:
+            self.overlapping_eval = self.seq_len
+        self.overlapping_eval = max(1, self.overlapping_eval)
+
+        self.total_targets = len(self.tokens) - 1
+        # remove first sequence tokens
+        targets = max(self.total_targets - self.overlapping_eval, 0)
+        self.total_sequences = max(math.ceil(targets / self.overlapping_eval) + 1, 1)
+
+    def __len__(self):
+        return self.total_sequences
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        loss_mask = np.ones(seq_length, dtype="float32")
+        loss_mask[np.where(np.array(tokens) == self.pad_idx)] = 0.0
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        start_idx = idx * self.overlapping_eval
+        end_idx = start_idx + self.seq_len
+        tokens = self.tokens[start_idx : end_idx + 1]
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        [tokens, loss_mask, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        if self.overlapping_eval != self.seq_len and idx != 0:
+            loss_mask[: -self.overlapping_eval] *= 0
+
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+
+class Lambada_Eval_Dataset(paddle.io.Dataset):
+    def __init__(self, tokens, labels, seq_len, pad_idx):
+        self.seq_len = seq_len
+        self.pad_idx = pad_idx
+        self.tokens = tokens
+        self.labels = labels
+
+    def __len__(self):
+        return len(self.tokens)
+
+    def _construct_sample(self, tokens):
+        tokens = np.array(tokens).astype("int64").tolist()
+        labels = tokens[1:]
+        tokens = tokens[:-1]
+
+        seq_length = len(tokens)
+        # attention mask for the attention calulate
+        attention_mask = np.tri(seq_length, seq_length).reshape((1, seq_length, seq_length))
+
+        # the pad and eos tokens do not contribute the loss
+        position_ids = np.arange(0, seq_length, dtype="int64")
+
+        # -INF mask value as default
+        # attention_mask = (attention_mask - 1.0) * 1e9
+        # Bool mask of attention
+        attention_mask = attention_mask.astype("float32")
+        return [tokens, attention_mask, position_ids, labels]
+
+    def __getitem__(self, idx):
+        tokens = self.tokens[idx][: self.seq_len]
+        labels = self.labels[idx]
+        tokens = tokens + labels
+        num_tokens = len(tokens)
+        if num_tokens < self.seq_len + 1:
+            num_pad = self.seq_len + 1 - num_tokens
+            tokens += [self.pad_idx] * num_pad
+        loss_mask = np.zeros(self.seq_len, dtype="float32")
+        loss_mask[num_tokens - len(labels) - 1 : num_tokens - 1] = 1.0
+        [tokens, attention_mask, position_ids, labels] = self._construct_sample(tokens)
+        return [tokens, loss_mask, attention_mask, position_ids, labels]
+
+
+def wikitext_detokenizer(string):
+    # contractions
+    string = string.replace("s '", "s'")
+    string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
+    # number separators
+    string = string.replace(" @-@ ", "-")
+    string = string.replace(" @,@ ", ",")
+    string = string.replace(" @.@ ", ".")
+    # punctuation
+    string = string.replace(" : ", ": ")
+    string = string.replace(" ; ", "; ")
+    string = string.replace(" . ", ". ")
+    string = string.replace(" ! ", "! ")
+    string = string.replace(" ? ", "? ")
+    string = string.replace(" , ", ", ")
+    # double brackets
+    string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
+    string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
+    string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
+    string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
+    string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
+    # miscellaneous
+    string = string.replace("= = = =", "====")
+    string = string.replace("= = =", "===")
+    string = string.replace("= =", "==")
+    string = string.replace(" " + chr(176) + " ", chr(176))
+    string = string.replace(" \n", "\n")
+    string = string.replace("\n ", "\n")
+    string = string.replace(" N ", " 1 ")
+    string = string.replace(" 's", "'s")
+    return string
+
+
+def get_tokens(tokenizer, text, strict=True):
+    if not strict:
+        tokens = tokenizer(text)["input_ids"]
+        return tokens[:-1], [tokens[-1]]
+    last_token = text.split()[-1]
+    start_idx = text.rfind(last_token)
+    beginning_tokens = tokenizer(text[:start_idx].strip())["input_ids"]
+    last_token = tokenizer(" " + last_token)["input_ids"]
+    return beginning_tokens, last_token
+
+
+def create_eval_dataset(args):
+    val_dataloader = None
+    eval_batch_size = args.batch_size
+    seq_len = args.seq_length
+
+    tokenizer = GPTTokenizer.from_pretrained(args.model_name)
+    if not args.cloze_eval:
+        with open(args.eval_path, "rb") as reader:
+            entire_data = reader.read().decode("utf-8")
+        num_original_tokens = len(entire_data.strip().split(" "))
+        entire_data = wikitext_detokenizer(entire_data)
+        tokenized_data = tokenizer(entire_data)["input_ids"]
+        num_tokenized_tokens = len(tokenized_data)
+        print("Original Tokens: %d, Detokenized tokens: %d" % (num_tokenized_tokens, num_original_tokens))
+        val_dataset = LM_Eval_Dataset(tokenized_data, seq_len, tokenizer.pad_token_id, args.overlapping_eval)
+    else:
+        tokenized_data = []
+        tokenized_label = []
+        with open(args.eval_path, "r") as f:
+            for line in f.readlines():
+                text = json.loads(line)["text"]
+                tokens, labels = get_tokens(tokenizer, text)
+                tokenized_data.append(tokens)
+                tokenized_label.append(labels)
+        val_dataset = Lambada_Eval_Dataset(tokenized_data, tokenized_label, seq_len, tokenizer.pad_token_id)
+        num_tokenized_tokens = 0
+        num_original_tokens = 0
+
+    args.num_examples = len(val_dataset)
+    args.num_original_tokens = num_original_tokens
+    args.num_tokenized_tokens = num_tokenized_tokens
+    val_dataloader = DataLoader(
+        val_dataset,
+        batch_size=eval_batch_size,
+        drop_last=False,
+        collate_fn=Tuple(Stack(), Stack(), Stack(), Stack(), Stack()),
+    )
+
+    return val_dataloader
+
+
+def do_eval(args):
+    paddle.set_device(args.device)
+    model_class, tokenizer_class = MODEL_CLASSES["gpt"]
+
+    if args.init_checkpoint_path is not None:
+        model = GPTForPretraining(GPTModel(**model_class.pretrained_init_configuration[args.model_name]))
+
+        logger.info("Load model checkpoint from %s" % args.init_checkpoint_path)
+        model_dict = paddle.load(os.path.join(args.init_checkpoint_path))
+        model.set_dict(model_dict)
+    else:
+        model = model_class.from_pretrained(args.model_name)
+
+    tic_eval = time.time()
+    eval_data_loader = create_eval_dataset(args)
+    model.eval()
+    total_score = 0
+    score_name = "loss" if not args.cloze_eval else "number correct"
+    with paddle.no_grad():
+        for step, batch in enumerate(eval_data_loader):
+            tokens, loss_mask, attention_mask, position_ids, labels = batch
+            preds = model(tokens, position_ids, attention_mask)
+            if not args.cloze_eval:
+                masked_lm_loss = paddle.nn.functional.cross_entropy(preds, labels, reduction="none")
+                loss = paddle.sum(masked_lm_loss * loss_mask)
+                total_score += loss.numpy() / (args.num_tokenized_tokens - 1)
+            else:
+                outputs = paddle.argmax(preds, -1)
+                acc = paddle.cast(outputs == labels, "float32")
+                acc = paddle.where(paddle.cast(loss_mask, "bool"), acc, paddle.ones_like(acc))
+                acc = paddle.sum(paddle.prod(acc, -1))
+                total_score += acc.numpy()
+            if step % args.logging_steps == 0:
+                logger.info(
+                    "step %d, batch: %d, %s: %f, speed: %.2f step/s"
+                    % (step, step, score_name, total_score, args.logging_steps / (time.time() - tic_eval))
+                )
+                tic_eval = time.time()
+
+    if not args.cloze_eval:
+        total_loss = float(total_score)
+        ppl = math.exp(min(20, total_loss))
+        token_ratio = (args.num_tokenized_tokens - 1) / (args.num_original_tokens - 1)
+        adjusted_ppl = math.exp(min(20, total_loss * token_ratio))
+        string = " validation results on {} | ".format(args.eval_path)
+        string += "avg loss: {:.4E} | ".format(total_loss)
+        string += "ppl: {:.4E} | ".format(ppl)
+        string += "adjusted ppl: {:.4E} | ".format(adjusted_ppl)
+        string += "token ratio: {} |".format(token_ratio)
+    else:
+        num_correct = float(total_score)
+        acc = float(num_correct / args.num_examples)
+        string = " validation results on {} | ".format(args.eval_path)
+        string += "number correct: {:.4E} | ".format(num_correct)
+        string += "total examples: {:.4E} | ".format(args.num_examples)
+        string += "avg accuracy: {:.4E}".format(acc)
+    logger.info(string)
+
+
+def run():
+    args = parser.parse_args()
+    do_eval(args)
+
+
+if __name__ == "__main__":
+    run()
diff --git a/model_zoo/gpt/run_generation.py b/model_zoo/gpt/run_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0e98bb39fd1c89b45160eb849256b3ac7c0226e
--- /dev/null
+++ b/model_zoo/gpt/run_generation.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import random
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    GPTChineseTokenizer,
+    GPTConfig,
+    GPTLMHeadModel,
+    GPTTokenizer,
+)
+
+MODEL_CLASSES = {
+    "gpt2": (GPTLMHeadModel, GPTTokenizer),
+    "gpt2-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument(
+        "--model_type",
+        default="gpt2-cn",
+        type=str,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt-cpm-small-cn-distill",
+        type=str,
+        help="The path or shortcut name of the pre-trained model.",
+    )
+    parser.add_argument("--from_hf_hub", type=bool, default=False, help="Whether load model from hf hub")
+    parser.add_argument(
+        "--decode_strategy", type=str, default="greedy_search", help="The decode strategy in generation."
+    )
+    parser.add_argument(
+        "--top_k",
+        type=int,
+        default=5,
+        help="The number of highest probability vocabulary tokens to keep for top-k sampling.",
+    )
+    parser.add_argument(
+        "--temperature", type=float, default=1.0, help="The value used to module the next token probabilities."
+    )
+    parser.add_argument("--top_p", type=float, default=1.0, help="The cumulative probability for top-p sampling.")
+    parser.add_argument("--num_beams", type=int, default=0, help="The number of beams for beam search.")
+    parser.add_argument(
+        "--length_penalty",
+        type=float,
+        default=1.0,
+        help="The exponential penalty to the sequence length for beam search.",
+    )
+    parser.add_argument(
+        "--early_stopping",
+        type=eval,
+        default=False,
+        help="Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.",
+    )
+    parser.add_argument("--min_dec_len", type=int, default=1, help="The minimum sequence length of generation.")
+    parser.add_argument("--max_dec_len", type=int, default=16, help="The maximum sequence length of generation.")
+    parser.add_argument(
+        "--num_return_sequences", type=int, default=1, help="The number of output sequences to generation."
+    )
+    parser.add_argument("--seed", type=int, default=123, help="Random seed for initialization.")
+    parser.add_argument("--device", type=str, default="gpu", help="The device to select for training the model.")
+
+    args = parser.parse_args()
+    return args
+
+
+def print_args(args):
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def set_seed(seed):
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def adjust_length_to_model(length, max_sequence_length):
+    if length < 0 or length > max_sequence_length:
+        length = max_sequence_length
+    return length
+
+
+def main(args, input_text):
+    paddle.set_device(args.device)
+    if args.seed:
+        set_seed(args.seed)
+
+    try:
+        args.model_type = args.model_type.lower()
+        model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    except KeyError:
+        raise KeyError(
+            "The `model_type` must be selected in the list: {}. But received: {}.".format(
+                MODEL_CLASSES.keys(), args.model_type
+            )
+        )
+    config = GPTConfig.from_pretrained(args.model_name_or_path)
+    model = model_class.from_pretrained(args.model_name_or_path, config=config, from_hf_hub=args.from_hf_hub)
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, from_hf_hub=args.from_hf_hub)
+    model.eval()
+
+    args.max_dec_len = adjust_length_to_model(args.max_dec_len, model.config.max_position_embeddings)
+
+    input_ids = tokenizer.encode(input_text)["input_ids"]
+    if len(input_ids) == 0:
+        input_ids = None
+    else:
+        # [1, seq_len]
+        input_ids = paddle.to_tensor(input_ids, dtype="int64").unsqueeze(0)
+
+    ids, scores = model.generate(
+        input_ids=input_ids,
+        max_length=args.max_dec_len,
+        min_length=args.min_dec_len,
+        decode_strategy=args.decode_strategy,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        num_beams=args.num_beams,
+        length_penalty=args.length_penalty,
+        early_stopping=args.early_stopping,
+        num_return_sequences=args.num_return_sequences,
+    )
+
+    generated_sequences = []
+    for i, generated_ids in enumerate(ids):
+        print("*" * 10 + " GENERATED SEQUENCE {} ".format(i) + "*" * 10)
+        generated_ids = generated_ids.numpy().tolist()
+        # Decode text
+        text = tokenizer.convert_ids_to_string(generated_ids)
+        # Add the prompt at the beginning of the sequence.
+        sequence = input_text + text
+        generated_sequences.append(sequence)
+        print(sequence)
+
+    return generated_sequences
+
+
+def run():
+    args = parse_args()
+    input_text = "花间一壶酒，独酌无相亲。举杯邀明月，"
+    main(args, input_text)
+
+
+if __name__ == "__main__":
+    run()
diff --git a/model_zoo/gpt/run_glue.py b/model_zoo/gpt/run_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cef8377bf92527e0ba6c64413d777e53f956189
--- /dev/null
+++ b/model_zoo/gpt/run_glue.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from dataclasses import dataclass, field
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
+from paddlenlp.trainer.trainer_utils import set_seed
+from paddlenlp.transformers import (
+    GPTForSequenceClassification,
+    GPTTokenizer,
+    LinearDecayWithWarmup,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": AccuracyAndF1,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": AccuracyAndF1,
+    "qnli": AccuracyAndF1,
+    "rte": AccuracyAndF1,
+}
+
+
+@dataclass
+class ModelArguments:
+    task_name: str = field(
+        default=None,
+        metadata={"help": "The namve of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys())},
+    )
+    model_name_or_path: str = field(
+        default=None,
+        metadata={
+            "help": "Path to pre-trained model or shortcut name selected in the list: "
+            + ", ".join(list(GPTForSequenceClassification.pretrained_init_configuration.keys()))
+        },
+    )
+    max_seq_length: int = field(
+        default=128, metadata={"help": "The maximum total input sequence length after tokenization"}
+    )
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(
+            example["sentence"], padding="max_length", max_length=max_seq_length, return_token_type_ids=False
+        )
+    else:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            padding=True,
+            max_length=max_seq_length,
+            return_token_type_ids=False,
+        )
+
+    if not is_test:
+        example["labels"] = label
+
+    return example
+
+
+def do_train():
+
+    training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
+    training_args: TrainingArguments = training_args
+    model_args: ModelArguments = model_args
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(training_args.seed)
+
+    model_args.task_name = model_args.task_name.lower()
+    metric_class = METRIC_CLASSES[model_args.task_name]
+
+    def compute_metrics(p):
+        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
+
+        preds = paddle.to_tensor(preds)
+        label = paddle.to_tensor(p.label_ids)
+
+        metric = metric_class()
+        result = metric.compute(preds, label)
+        metric.update(result)
+
+        if isinstance(metric, AccuracyAndF1):
+            acc, precision, recall, f1, _ = metric.accumulate()
+            return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}
+        elif isinstance(metric, Mcc):
+            mcc = metric.accumulate()
+            return {"mcc": mcc[0]}
+        elif isinstance(metric, PearsonAndSpearman):
+            pearson, spearman, _ = metric.accumulate()
+            return {"pearson": pearson, "spearman": spearman}
+
+    train_ds = load_dataset("glue", model_args.task_name, splits="train")
+    tokenizer = GPTTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=model_args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(
+        train_ds, batch_size=training_args.train_batch_size, shuffle=True
+    )
+
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, num_workers=0, return_list=True
+    )
+
+    if model_args.task_name == "mnli":
+        dev_ds = load_dataset("glue", model_args.task_name, splits=["dev_matched"])
+    else:
+        dev_ds = load_dataset("glue", model_args.task_name, splits="dev")
+
+    dev_ds = dev_ds.map(trans_func, lazy=True)
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    model = GPTForSequenceClassification.from_pretrained(model_args.model_name_or_path, num_labels=num_classes)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = (
+        training_args.max_steps
+        if training_args.max_steps > 0
+        else (len(train_data_loader) * training_args.num_train_epochs)
+    )
+    warmup = training_args.warmup_steps if training_args.warmup_steps > 0 else training_args.warmup_ratio
+
+    lr_scheduler = LinearDecayWithWarmup(training_args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=training_args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=training_args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    loss_fct = paddle.nn.loss.CrossEntropyLoss() if train_ds.label_list else paddle.nn.loss.MSELoss()
+
+    # TODO(wj-Mcat): use amp
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=model_args.max_seq_length),
+        criterion=loss_fct,
+        train_dataset=train_ds,
+        eval_dataset=dev_ds,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+        optimizers=[optimizer, lr_scheduler],
+    )
+
+    if training_args.do_train:
+        train_result = trainer.train()
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_state()
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/gpt/run_msra_ner.py b/model_zoo/gpt/run_msra_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9790ac5200bd9e5141af18e984fb72d5e72d325
--- /dev/null
+++ b/model_zoo/gpt/run_msra_ner.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from dataclasses import dataclass, field
+from functools import partial
+
+import paddle
+
+from paddlenlp.data import DataCollatorForTokenClassification
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments
+from paddlenlp.transformers import (
+    GPTChineseTokenizer,
+    GPTForTokenClassification,
+    LinearDecayWithWarmup,
+)
+
+parser = argparse.ArgumentParser()
+
+
+# yapf: disable
+@dataclass
+class ModelArguments:
+    model_name_or_path: str = field(
+        default=None,
+        metadata={"help": "Path to pre-trained model or shortcut name selected in the list: "}
+    )
+    max_seq_length: int = field(
+        default=128,
+        metadata={"help": "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "Path to pre-trained model or shortcut name selected in the list: " + ", ".join(list(GPTChineseTokenizer.pretrained_init_configuration.keys()))}
+    )
+
+
+@paddle.no_grad()
+def evaluate(model, loss_fct, metric, data_loader):
+    model.eval()
+    metric.reset()
+    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
+    for batch in data_loader:
+        input_ids, length, labels = batch
+        logits = model(input_ids)
+        loss = loss_fct(logits, labels)
+        avg_loss = paddle.mean(loss)
+        preds = logits.argmax(axis=2)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+    precision, recall, f1_score = metric.accumulate()
+    print("eval loss: %f, precision: %f, recall: %f, f1: %f" % (avg_loss, precision, recall, f1_score))
+    model.train()
+
+
+def tokenize_and_align_labels(example, tokenizer, no_entity_id, max_seq_len=512):
+    labels = example["labels"]
+    example = example["tokens"]
+    tokenized_input = tokenizer(
+        example, is_split_into_words="token", max_seq_len=max_seq_len, return_token_type_ids=False
+    )
+
+    tokenized_input["labels"] = labels[: len(tokenized_input["input_ids"])]
+    return tokenized_input
+
+
+def do_train():
+    training_args, model_args = PdArgumentParser([TrainingArguments, ModelArguments]).parse_args_into_dataclasses()
+
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    # Create dataset, tokenizer and dataloader.
+    train_ds, test_ds = load_dataset("msra_ner", splits=("train", "test"), lazy=False)
+
+    tokenizer = GPTChineseTokenizer.from_pretrained(model_args.model_name_or_path)
+
+    # add pad_token to tokenizer
+    tokenizer.add_special_tokens({
+        "pad_token": tokenizer.convert_ids_to_tokens(0)
+    })
+
+    label_list = train_ds.label_list
+    label_num = len(label_list)
+    no_entity_id = label_num - 1
+
+    batchify_fn = partial(
+        tokenize_and_align_labels,
+        tokenizer=tokenizer,
+        no_entity_id=no_entity_id,
+        max_seq_len=model_args.max_seq_length)
+
+    train_ds = train_ds.map(batchify_fn)
+    test_ds = test_ds.map(batchify_fn)
+
+    # Define the model netword and its loss
+    model = GPTForTokenClassification.from_pretrained(model_args.model_name_or_path,
+                                                      num_classes=label_num)
+
+    if paddle.distributed.get_world_size() > 1:
+        model = paddle.DataParallel(model)
+
+    num_training_steps = training_args.max_steps if training_args.max_steps > 0 else len(train_ds) // training_args.train_batch_size * training_args.num_train_epochs
+
+    lr_scheduler = LinearDecayWithWarmup(training_args.learning_rate, num_training_steps,
+                                         training_args.warmup_steps)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        epsilon=training_args.adam_epsilon,
+        parameters=model.parameters(),
+        weight_decay=training_args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    metric = ChunkEvaluator(label_list=label_list)
+
+    trainer = Trainer(
+        model=model,
+        data_collator=DataCollatorForTokenClassification(
+            tokenizer=tokenizer,
+            padding=True,
+            max_length=model_args.max_seq_length
+        ),
+        args=training_args,
+        criterion=paddle.nn.loss.CrossEntropyLoss(ignore_index=-100),
+        train_dataset=train_ds,
+        eval_dataset=test_ds,
+        tokenizer=tokenizer,
+        compute_metrics=metric,
+        optimizers=[optimizer, lr_scheduler]
+    )
+
+    if training_args.do_train:
+        train_result = trainer.train()
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_state()
+
+
+if __name__ == "__main__":
+    do_train()
diff --git a/model_zoo/gpt/run_pretrain_trainer.py b/model_zoo/gpt/run_pretrain_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..055622400772defc358ee4bb176e084faa61e480
--- /dev/null
+++ b/model_zoo/gpt/run_pretrain_trainer.py
@@ -0,0 +1,471 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+GPT/Llama pretraining scripts.
+"""
+
+import math
+import os
+import random
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+import paddle
+from configuration import GPTConfig
+from modeling import GPTForCausalLM
+
+from paddlenlp.trainer import (
+    PdArgumentParser,
+    Trainer,
+    TrainingArguments,
+    get_last_checkpoint,
+    speed_metrics,
+)
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    CosineAnnealingWithWarmupDecay,
+    LinearAnnealingWithWarmupDecay,
+)
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt": (
+        GPTConfig,
+        GPTForCausalLM,
+    ),
+}
+
+from paddlenlp.data.causal_dataset import build_train_valid_test_datasets, print_rank_0
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class PreTrainingArguments(TrainingArguments):
+    min_learning_rate: float = field(
+        default=1e-5,
+        metadata={"help": "Minimum learning rate deacyed to."},
+    )
+    decay_steps: float = field(
+        default=None,
+        metadata={
+            "help": "The steps use to control the learing rate. If the step > decay_steps, will use the min_learning_rate."
+        },
+    )
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and evaluating.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    input_dir: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    split: str = field(default="949,50,1", metadata={"help": "Train/valid/test data split."})
+
+    max_seq_length: int = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    share_folder: bool = field(
+        default=False,
+        metadata={"help": "Use share folder for data dir and output dir on multi machine."},
+    )
+
+    data_impl: str = field(default="mmap", metadata={"help": "The format of the preprocessed data."})
+    skip_warmup: bool = field(
+        default=True,
+        metadata={"help": "Whether to skip the warmup process of mmap files."},
+    )
+    data_cache: str = field(default=None, metadata={"help": "The path of the cached dataset."})
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to pre-train from.
+    """
+
+    model_type: Optional[str] = field(default="gpt", metadata={"help": "Only support for ernie pre-training for now."})
+    model_name_or_path: str = field(
+        default="gpt2-meidum-en",
+        metadata={
+            "help": "Path to pretrained model or model identifier from https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers.html"
+        },
+    )
+    hidden_dropout_prob: float = field(default=0.1, metadata={"help": "The hidden dropout prob."})
+    attention_probs_dropout_prob: float = field(default=0.1, metadata={"help": "The attention probs dropout prob."})
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name_or_path: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    use_flash_attention: bool = field(
+        default=False,
+        metadata={"help": "use_flash_attention"},
+    )
+    fuse_attention_qkv: bool = field(
+        default=True,
+        metadata={"help": "gpt, fuse_attention_qkv"},
+    )
+    recompute_granularity: str = field(
+        default="full",
+        metadata={"help": "full core_attn"},
+    )
+    virtual_pp_degree: int = field(
+        default=1,
+        metadata={"help": "virtual_pp_degree"},
+    )
+
+    continue_training: bool = field(
+        default=False,
+        metadata={
+            "help": "Pre-training from existing paddlenlp model weights. Default Fasle and model will train from scratch. If set True, the model_name_or_path argument must exist in the paddlenlp models."
+        },
+    )
+
+
+def create_pretrained_dataset(
+    data_args,
+    training_args,
+    data_file,
+    tokenizer,
+):
+
+    train_val_test_num_samples = [
+        training_args.per_device_train_batch_size
+        * training_args.dataset_world_size
+        * training_args.max_steps
+        * training_args.gradient_accumulation_steps,
+        training_args.per_device_eval_batch_size
+        * training_args.dataset_world_size
+        * training_args.eval_iters
+        * (training_args.max_steps // training_args.eval_steps + 1),
+        training_args.per_device_eval_batch_size * training_args.dataset_world_size * training_args.test_iters,
+    ]
+
+    print_rank_0(" > datasets target sizes (minimum size):")
+    print_rank_0("    train:      {}".format(train_val_test_num_samples[0]))
+    print_rank_0("    validation: {}".format(train_val_test_num_samples[1]))
+    print_rank_0("    test:       {}".format(train_val_test_num_samples[2]))
+
+    # Build the datasets.
+    train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
+        data_prefix=data_file,
+        data_impl=data_args.data_impl,
+        splits_string=data_args.split,
+        train_val_test_num_samples=train_val_test_num_samples,
+        seq_length=data_args.max_seq_length,
+        seed=training_args.seed,
+        skip_warmup=data_args.skip_warmup,
+        data_cache_path=data_args.data_cache,
+    )
+
+    def print_dataset(data, mode="train"):
+        logger.info(f"Sample data for {mode} mode")
+        # input_ids, loss_mask, attention_mask, position_ids, labels = data
+        input_ids = data["text"]
+
+        logger.info(tokenizer._decode(input_ids))  # todo
+
+        # logger.info(tokenizer._decode(labels))
+        # logger.info(tokenizer.convert_ids_to_tokens(input_ids))
+
+    from paddlenlp.data import Stack
+
+    # eod_token = tokenizer.eos_token_id
+
+    def _collate_data(data, stack_fn=Stack()):
+        tokens_ = stack_fn(x["text"] for x in data)
+        # Unpack.
+        # tokens_ = paddle.to_tensor(tokens_, dtype="int64")
+        labels = tokens_[:, 1:]
+        tokens = tokens_[:, :-1]
+
+        # # Loss mask.
+        # loss_mask = paddle.ones(tokens.shape, dtype=paddle.float32)
+        # loss_mask[data == eod_token] = 0.0
+
+        # Attention mask.
+        attention_mask = paddle.ones(tokens.shape, dtype=paddle.int64)
+
+        return {
+            "input_ids": tokens,
+            "attention_mask": attention_mask,
+            "labels": labels,
+        }
+
+    print_dataset(train_dataset[0])
+    print_dataset(valid_dataset[0])
+    print_dataset(test_dataset[0])
+
+    return train_dataset, valid_dataset, test_dataset, _collate_data
+
+
+def get_train_data_file(args):
+    if len(args.input_dir.split()) > 1:
+        # weight-1 data-prefix-1 weight-2 data-prefix-2 ...
+        return args.input_dir.split()
+    else:
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if (os.path.isfile(os.path.join(args.input_dir, f)) and ("_idx.npz" in str(f) or ".idx" in str(f)))
+        ]
+        files = [x.replace("_idx.npz", "") for x in files]
+        files = [x.replace(".idx", "") for x in files]  # add
+
+        if len(files) > 1:
+            ret = []
+            logger.info("You are using multi-dataset:")
+            for x in files:
+                ret.append(1.0)
+                ret.append(x)
+                logger.info("    > set weight of %s dataset to 1.0" % x)
+            return ret
+
+    return files
+
+
+def set_seed(args):
+    if args.device == "cpu":
+        idx = 0
+    else:
+        idx = paddle.distributed.get_rank()
+    random.seed(args.seed + idx)
+    np.random.seed(args.seed + idx)
+    paddle.seed(args.seed + idx)
+
+
+class PretrainingTrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        # keep eval_dataloader
+        eval_dataloader = getattr(self, "eval_dataloader", None)
+        if eval_dataloader is None:
+            eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+            eval_dataloader = self.get_eval_dataloader(eval_dataset)
+            # must call data loader, otherwise, it will init many times, cause OOM error.
+            self.eval_dataloader = eval_dataloader()
+
+        start_time = time.time()
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        eval_loop = self.evaluation_loop
+
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            # Only evaluate max_eval_iters
+            max_eval_iters=self.args.eval_iters,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        return output.metrics
+
+    def _get_eval_sampler(self, eval_dataset) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            eval_dataset,
+            batch_size=self.args.per_device_eval_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=False,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, PreTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.tokenizer_name_or_path is None:
+        model_args.tokenizer_name_or_path = model_args.model_name_or_path
+
+    if data_args.data_cache is not None:
+        os.makedirs(data_args.data_cache, exist_ok=True)
+
+    set_seed(training_args)
+    paddle.set_device(training_args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    training_args.eval_iters = 10
+    training_args.test_iters = training_args.eval_iters * 10
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16 or training_args.bf16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        # if last_checkpoint is None and len(
+        #         os.listdir(training_args.output_dir)) > 1:
+        #     raise ValueError(
+        #         f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+        #         "Use --overwrite_output_dir to overcome.")
+        if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    config_class, model_class = MODEL_CLASSES[model_args.model_type]
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)
+
+    config = config_class.from_pretrained(model_args.model_name_or_path)
+    config.max_position_embeddings = max(config.max_position_embeddings, data_args.max_seq_length)
+    if not model_args.continue_training:
+        config.vocab_size = max(config.vocab_size, ((tokenizer.vocab_size - 1) // 128 + 1) * 128)
+        logger.info(f"Reset vocab size to {config.vocab_size} for batter amp peformance.")
+
+    config.lm_shift_labels = False
+    config.use_flash_attention = model_args.use_flash_attention
+    # config.use_fused_rms_norm = model_args.use_fused_rms_norm
+    config.fuse_attention_qkv = model_args.fuse_attention_qkv
+    config.recompute_granularity = model_args.recompute_granularity
+    config.virtual_pp_degree = model_args.virtual_pp_degree
+    config.use_recompute = training_args.recompute
+
+    print("Final pre-training config:", config)
+
+    # Set the dtype for loading model
+    dtype = "float32"
+    if training_args.fp16_opt_level == "O2":
+        if training_args.fp16:
+            dtype = "float16"
+        if training_args.bf16:
+            dtype = "bfloat16"
+
+    if model_args.continue_training:
+        model = model_class.from_pretrained(
+            model_args.model_name_or_path,
+            config=config,
+            dtype=dtype,
+        )
+    else:
+        model = model_class._from_config(config, dtype=dtype)
+
+    # Create the learning_rate sheduler and optimizer
+    if training_args.decay_steps is None:
+        training_args.decay_steps = training_args.max_steps
+    warmup_steps = training_args.warmup_ratio * training_args.max_steps
+
+    lr_scheduler = None
+    if training_args.lr_scheduler_type.value == "cosine":
+        lr_scheduler = CosineAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+    elif training_args.lr_scheduler_type.value == "linear":
+        lr_scheduler = LinearAnnealingWithWarmupDecay(
+            max_lr=training_args.learning_rate,
+            min_lr=training_args.min_learning_rate,
+            warmup_step=warmup_steps,
+            decay_step=training_args.decay_steps,
+            last_epoch=0,
+        )
+
+    data_file = get_train_data_file(data_args)
+    train_dataset, eval_dataset, test_dataset, data_collator = create_pretrained_dataset(
+        data_args, training_args, data_file, tokenizer
+    )
+
+    trainer = PretrainingTrainer(
+        model=model,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        optimizers=(None, lr_scheduler),
+        tokenizer=tokenizer,
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    if training_args.do_predict:
+        test_ret = trainer.predict(test_dataset)
+        trainer.log_metrics("test", test_ret.metrics)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/gpt/scripts/run.sh b/model_zoo/gpt/scripts/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dde96f175144c44533ab17e49fcebd56ca5e06bc
--- /dev/null
+++ b/model_zoo/gpt/scripts/run.sh
@@ -0,0 +1,23 @@
+set -x
+export CUDA_VISIBLE_DEVICES=0
+
+python -u run_pretrain.py \
+    --model_type "gpt"\
+    --model_name_or_path "gpt2-en"\
+    --input_dir "./data"\
+    --output_dir "output"\
+    --max_seq_len 1024 \
+    --micro_batch_size 4\
+    --max_lr 0.00015\
+    --min_lr 0.00001\
+    --max_steps 500000\
+    --save_steps 100000\
+    --decay_steps 320000\
+    --weight_decay 0.01\
+    --warmup_rate 0.01\
+    --grad_clip 1.0\
+    --logging_freq 1\
+    --eval_freq 1000\
+    --device "gpu" \
+
+#    --use_amp true
diff --git a/model_zoo/gpt/scripts/run_multi.sh b/model_zoo/gpt/scripts/run_multi.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b4b4a37e225371c68f53f47c1acb88971c7309a2
--- /dev/null
+++ b/model_zoo/gpt/scripts/run_multi.sh
@@ -0,0 +1,26 @@
+set -x
+
+task_name="gpt-dygraph"
+rm -rf output/$task_name/log
+
+unset CUDA_VISIBLE_DEVICES
+python -m paddle.distributed.launch \
+    --gpus "0,1,2,3" \
+    --log_dir "output/$task_name/log"  run_pretrain.py \
+    --model_type "gpt" \
+    --model_name_or_path "gpt2-en"\
+    --input_dir "./data"\
+    --output_dir "output/$task_name"\
+    --max_seq_len 1024 \
+    --micro_batch_size 4\
+    --max_lr 0.00015\
+    --min_lr 0.00001\
+    --max_steps 500000\
+    --save_steps 100000\
+    --decay_steps 320000\
+    --weight_decay 0.01\
+    --warmup_rate 0.01\
+    --grad_clip 1.0\
+    --logging_freq 1\
+    --eval_freq 1000\
+    --device "gpu"
diff --git a/model_zoo/gpt/scripts/run_static.sh b/model_zoo/gpt/scripts/run_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..aa214161a867b226bb7994bb0e2adb61845e2d15
--- /dev/null
+++ b/model_zoo/gpt/scripts/run_static.sh
@@ -0,0 +1,42 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+export FLAGS_allocator_strategy=naive_best_fit
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-dp-sharding"
+rm -rf output/$task_name/log
+
+python -u  -m paddle.distributed.fleet.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "gpt3-1.3B-en" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 8 \
+    --global_batch_size 64 \
+    --sharding_degree 8\
+    --dp_degree 1 \
+    --use_sharding true \
+    --use_amp true \
+    --use_recompute true \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 500000 \
+    --save_steps 100000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 1000 \
+    --device "gpu"
+
+ # Not support pipeline for this version, don't change pp_degree.
diff --git a/model_zoo/gpt/scripts/run_trainer.sh b/model_zoo/gpt/scripts/run_trainer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..86ff54035ab9d348daa445606782f9cc3074e0a2
--- /dev/null
+++ b/model_zoo/gpt/scripts/run_trainer.sh
@@ -0,0 +1,56 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -x
+unset CUDA_VISIBLE_DEVICES
+
+task_name="gpt_hybrid"
+rm -rf output/$task_name/
+rm -rf "output/$task_name""_log"
+
+PYTHONPATH=../../:$PYTHONPATH  \
+python -u  -m paddle.distributed.launch \
+    --gpus "0,1,2,3,4,5,6,7" \
+    --log_dir "output/$task_name""_log" \
+    run_pretrain_trainer.py \
+    --model_type "gpt" \
+    --model_name_or_path "gpt2-medium-en" \
+    --tokenizer_name_or_path "gpt2-medium-en" \
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --use_flash_attention 1 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --lr_scheduler_type "cosine" \
+    --max_steps 10000 \
+    --save_steps 5000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 20\
+    --dataloader_num_workers 1 \
+    --sharding "stage2" \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --recompute 1 \
+    --do_train \
+    --do_eval \
+    --device "gpu"
diff --git a/model_zoo/plato-xl/README.md b/model_zoo/plato-xl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9d8fa5be72755603fe7bd12e2018764750bd5066
--- /dev/null
+++ b/model_zoo/plato-xl/README.md
@@ -0,0 +1,148 @@
+# PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
+
+## 模型简介
+
+构建高质量的开放领域（Open-Domain）的对话机器人，使得它能用自然语言与人自由地交流，这一直是自然语言处理领域终极目标之一。
+
+PLATO-XL 是业界首个开源的百亿超大规模开放域对话预训练模型，其使用了参数高效(encoder-decoder共享参数)的 UnifiedTransformer（prefix LM）模型架构，将模型参数量提升到了11B量级，经过了十亿级样本对话数据的预训练，并引入role embedding区分多方对话中的对话角色提升预训练效果，最终模型闲聊测试效果超过了众多代表性的对话模型。可以直接使用 PLATO-XL 构建高质量的开放领域对话机器人。
+
+PaddleNLP 内置了 PLATO-XL 英文预训练模型以供使用。由于 PLATO-XL 模型规模较大，这使得其在预测时生成对话回复的时间较长，并且 11B 的参数量也可能超出部分型号 GPU 显存容量，这是大模型推理与落地存在的普遍和关键问题。PaddleNLP FastGeneration 为 PLATO-XL 提供了 GPU 上的高性能生成加速能力，并且支持模型并行（张量并行）推理允许通过多张小显存容量的 GPU 使用百亿大模型，相比单卡代码中也只增加了`enable_ft_para()`一行，此外模型并行能进一步提升预测速度。
+
+本项目提供了 PLATO-XL 英文模型使用 PaddleNLP FastGeneration 进行高性能预测的使用示例。PLATO-XL 的训练及更多内容请参考 [PaddlePaddle/Knover](https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL)。
+
+## 开始运行
+### 单卡高性能推理
+
+`infer.py` 是 PLATO-XL 高性能预测使用示例脚本，可以使用如下命令运行：
+
+```shell
+python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
+```
+
+该脚本各个参数含义如下：
+
+- `topk` 用于Top-K采样策略，采样时将只从概率最高K个token中采样，默认为1，即greedy search。
+- `topp` 用于Top-P采样策略，采样时将只从概率最高且累加概率不超过该值的token中采样，默认为1.0。
+- `max_out_len` 指定生成的最大长度，默认为64。
+- `min_out_len` 指定生成的最小长度，默认为1。
+- `temperature` 用于调整预测概率分布，默认为1.0，即保持模型原有的预测概率。
+- `use_faster` 使用 FastGeneration
+- `use_fp16` 使用FP16，只在使用FastGeneration时生效
+
+脚本中使用了一条如下的多轮对话的样本数据， 由`List[str]`表示，其中每个`str`表示一句话，将根据历史对话内容生成回复。
+
+```python
+    history = [
+        "hi , Mary ! What do you usually like to do in your spare time ?",
+        "well , I spend a lot of time watching movies .",
+        "what a confidence ! I always watch a lot of movies , too ."
+        "oh really , Frank ? What kind of movies do you like ?"
+    ]
+```
+
+**注意** 由于 PLATO-XL 模型较大，单卡预测至少需要22G显存（使用FP16时），且模型下载需要一定时间（FP32的权重文件约41G）。
+
+### 多卡并行推理
+
+多卡并行推理当前依赖 MPI（[MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可）和[NCCL](https://developer.nvidia.com/nccl)，如需使用还请先安装依赖。安装完成后仍然使用 `infer.py` 来进行预测，相比单卡时不同的只是通过mpi来启动运行，如下：
+
+```shell
+mpirun -n 4 python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16
+```
+
+其中`-n 4`指明使用的进程和GPU数，在`n`设置为1时仍将进行单卡推理。由于多卡并行推理使用和单卡使用不同的依赖库，第一次运行时将重新进行JIT编译。
+
+### 性能测试
+
+`infer.py` 中同时提供了性能测试的支持，在上面预测命令的基础上加上 `--profile` 即可，如下：
+
+```shell
+mpirun -n 4 python infer.py --batch_size 8 --min_out_len 20 --max_out_len 20 --topk 1 --use_faster --use_fp16 --profile
+```
+
+此外还指定了`batch_size`和`min_out_len`来得到特定输入输出大小下的性能，性能测试将给出循环运行多次的平均时延。以下为单卡高性能推理和4卡张量并行推理性能数据（V100，CUDA 10.2，输入长度60、输出长度20），可以看出4卡并行速度为单卡的2倍左右。
+
+<table>
+<caption>PLATO-XL 高性能推理速度&nbsp;&nbsp;(in ms/batch)</caption>
+    <tr style="text-align:center;">
+        <td align=center>batch size</td>
+        <td align=center>K</td>
+        <td align=center>FastGeneration</br>1卡</br>FP16</td>
+        <td align=center>FastGeneration</br>4卡</br>FP16</td>
+        <td align=center>多卡并行</br>SpeedUp</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>1</td>
+        <td align=center>1</td>
+        <td align=center>706.937</td>
+        <td align=center>348.653</td>
+        <td align=center>2.027</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>1</td>
+        <td align=center>10</td>
+        <td align=center>707.514</td>
+        <td align=center>348.699</td>
+        <td align=center>2.029</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>4</td>
+        <td align=center>1</td>
+        <td align=center>768.597</td>
+        <td align=center>384.730</td>
+        <td align=center>1.997</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>4</td>
+        <td align=center>10</td>
+        <td align=center>770.008</td>
+        <td align=center>385.244</td>
+        <td align=center>1.998</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>8</td>
+        <td align=center>1</td>
+        <td align=center>862.017</td>
+        <td align=center>418.313</td>
+        <td align=center>2.060</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>8</td>
+        <td align=center>10</td>
+        <td align=center>866.490</td>
+        <td align=center>418.965</td>
+        <td align=center>2.068</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>16</td>
+        <td align=center>1</td>
+        <td align=center>1016.362</td>
+        <td align=center>486.974</td>
+        <td align=center>2.087</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>16</td>
+        <td align=center>10</td>
+        <td align=center>1060.472</td>
+        <td align=center>488.156</td>
+        <td align=center>2.172</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>32</td>
+        <td align=center>1</td>
+        <td align=center>1325.700</td>
+        <td align=center>606.770</td>
+        <td align=center>2.184</td>
+    </tr>
+    <tr style="text-align:center;">
+        <td align=center>32</td>
+        <td align=center>10</td>
+        <td align=center>1326.222</td>
+        <td align=center>608.479</td>
+        <td align=center>2.179</td>
+    </tr>
+</table>
+
+## Reference
+
+1. Bao S, He H, Wang F, et al. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation[J]. arXiv preprint arXiv:2109.09519, 2021.
diff --git a/model_zoo/plato-xl/infer.py b/model_zoo/plato-xl/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a20270855f9a487c4ff8a7f75307df62ece1e98c
--- /dev/null
+++ b/model_zoo/plato-xl/infer.py
@@ -0,0 +1,168 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.ops import enable_ft_para, get_ft_para_conf
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--use_role", type=strtobool, default=True, help="Whether to use role embeddings.")
+    parser.add_argument(
+        "--position_style",
+        default="relative",
+        choices=["continuous", "relative"],
+        type=str,
+        help="The type for positional embedding. Default is relative.",
+    )
+    parser.add_argument("--batch_size", default=1, type=int, help="Batch size.")
+    parser.add_argument(
+        "--num_return_sequences", default=1, type=int, help="The number of returned sequences for each sample."
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output sequence length.")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output sequence length.")
+    parser.add_argument(
+        "--topk", default=1, type=int, help="The number of highest probability tokens to keep for top-k-sampling."
+    )
+    parser.add_argument("--topp", default=1.0, type=float, help="The cumulative probability for top-p-filtering.")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set.")
+    parser.add_argument("--use_faster", action="store_true", help="Whether to use faster generation. ")
+    parser.add_argument(
+        "--use_fp16",
+        action="store_true",
+        help="Whether to use fp16 to predict. Only available when `use_faster` is True.",
+    )
+    parser.add_argument("--profile", action="store_true", help="Whether to profile.")
+    args = parser.parse_args()
+    return args
+
+
+def profile(batch_size, total_step=50, warmup_step=10, rank=0):
+    def _wrapper(func):
+        def _impl(*args, **kwargs):
+            for i in range(total_step):
+                if i == warmup_step:
+                    paddle.device.cuda.synchronize()
+                    start_time = time.time()
+                out = func(*args, **kwargs)
+            paddle.device.cuda.synchronize()
+            end_time = time.time()
+            if rank is None or get_ft_para_conf().rank == rank:
+                time_interval = end_time - start_time
+                num_batch = total_step - warmup_step
+                print("Latency: %2fs, QPS: %2f" % (time_interval / num_batch, num_batch * batch_size / time_interval))
+            return out
+
+        return _impl
+
+    return _wrapper
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    response = " ".join(tokens)
+    return response
+
+
+def main(args):
+    # For memory saving when using FastGeneration:
+    # If environment variable `PPFG_QKV_MEM_OPT` is set and the weights of q/k/v
+    # is fused, it will try to delete the original unfused weights. Note the
+    # rollback to original model would not be guarantee anymore when the faster
+    # model failed if the original weights are deleted.
+    os.environ["PPFG_QKV_MEM_OPT"] = "1"
+    if args.use_fp16:
+        assert args.use_faster, "Only supports FP16 when using FastGeneration."
+        paddle.set_default_dtype("float16")
+    enable_ft_para()
+    # TODO(guosheng): Maybe device can be set in `enable_ft_para`
+    paddle.set_device("gpu:" + str(get_ft_para_conf().rank))
+
+    if args.profile:
+        UnifiedTransformerLMHeadModel.generate = profile(args.batch_size)(UnifiedTransformerLMHeadModel.generate)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-xl")
+    model = UnifiedTransformerLMHeadModel.from_pretrained("plato-xl", load_state_as_np=True)
+    model.eval()
+
+    history = [
+        "hi , Mary ! What do you usually like to do in your spare time ?",
+        "well , I spend a lot of time watching movies .",
+        "what a confidence ! I always watch a lot of movies , too ."
+        "oh really , Frank ? What kind of movies do you like ?",
+    ]
+    inputs = [history] * args.batch_size
+    inputs = list(
+        map(
+            lambda history: tokenizer.dialogue_encode(
+                history=history,
+                add_start_token_as_response=True,
+                return_length=True,
+                return_role_ids=args.use_role,
+                position_style=args.position_style,
+            ),
+            inputs,
+        )
+    )
+    collator = DataCollatorWithPadding(tokenizer)
+    data = collator(inputs)
+
+    outputs, _ = model.generate(
+        input_ids=data["input_ids"],
+        token_type_ids=data["token_type_ids"],
+        position_ids=data["position_ids"],
+        attention_mask=data["attention_mask"].cast("float32"),  # TODO(guosheng): remove this cast
+        role_ids=data.get("role_ids", None),
+        seq_len=data["seq_len"],
+        max_length=args.max_out_len,
+        min_length=args.min_out_len,
+        decode_strategy="sampling",
+        top_k=args.topk,
+        top_p=args.topp,
+        temperature=args.temperature,
+        num_return_sequences=args.num_return_sequences,
+        use_fast=args.use_faster,
+        use_fp16_decoding=args.use_fp16,
+    )
+
+    # Only make the first process to output.
+    if get_ft_para_conf().rank == 0:
+        for i in range(len(outputs)):
+            result = postprocess_response(outputs[i].numpy(), tokenizer)
+            print("Result:", result)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    main(args)
diff --git a/model_zoo/tinybert/README.md b/model_zoo/tinybert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..648baa02f7bb31243f4842fc5e5bf0d465b39f67
--- /dev/null
+++ b/model_zoo/tinybert/README.md
@@ -0,0 +1,185 @@
+# TinyBERT: Distilling BERT for Natural Language Understanding
+以下是本例的简要目录结构及说明：
+```
+.
+├── task_distill.py       # 在特定任务上下的蒸馏脚本
+├── data_augmentation.py  # 离线数据增强脚本
+└── README.md             # 文档，本文件
+```
+## 简介
+本目录下的实验主要参考论文[《TinyBERT: Distilling BERT for Natural Language Understanding》](https://arxiv.org/abs/1909.10351)实现。
+TinyBERT中蒸馏的整体过程：首先进行通用蒸馏，然后用数据增强后的数据，在特定任务上进行蒸馏，本文主要进行了第二阶段的蒸馏，模型是利用第一阶段得到的通用小模型`tinybert-6l-768d-v2`进行初始化。
+
+<p align="center">
+<img src="./imgs/tinybert.png" width="950"/><br />
+TinyBERT蒸馏流程图
+</p>
+
+
+在模型蒸馏中，较大的模型（在本例中是BERT base）通常被称为教师模型，较小的模型（在本例中是层数为6的BERT，下文都称TinyBERT6）通常被称为学生模型。
+知识的蒸馏通常是通过让学生模型学习相关的蒸馏相损失函数实现，在本实验中，蒸馏的学习目标由两个部分组成，分别是中间层的蒸馏损失和预测层的蒸馏损失。其中，中间层的蒸馏包括对Embedding层的蒸馏、对每个Transformer layer输出的蒸馏、以及对每个Transformer中attention矩阵（softmax之前的结果）的蒸馏，三者均采用的是均方误差损失函数。而预测层蒸馏的学习目标则是学生模型输出的logits和教师模型输出的logits的交叉熵损失。
+
+由于教师模型是12层，学生模型的层数少于教师模型的层数，因此需要选择一种layer mapping的方式。论文中采用了一种固定的映射方式，当学生模型的层数为教师模型的1/2时，学生第i层的attention矩阵，需要学习教师的第2i+1层的attention矩阵，Transformer layer输出同理。
+
+实验分为两个大的训练过程：先对BERT-base进行微调，得到教师模型，再进行蒸馏的训练。其中，蒸馏过程也分为两个步骤：先对中间层进行蒸馏多个epochs（论文中针对具体任务可能是10、20或者30个），再对预测层蒸馏3个epochs。
+
+需要注意的是，在使用不同教师模型时，`tinybert-6l-768d-v2`、`tinybert-4l-312d-v2`这两个v2版本的预训练模型中开放的从学生embedding输出、transformer中间层输出到教师相应输出的转换矩阵是每层独立的，而其他的`tinybert-6l-768d`、`tinybert-4l-312d`、`tinybert-6l-768d-zh`、`tinybert-4l-312-zh`则是多层之间的参数共用一个转换矩阵的。
+
+## 数据、预训练模型介绍及获取
+
+本实验使用GLUE中数据集中的训练集作为训练语料，用数据集中的验证集评估模型的效果。
+
+运行本目录下的实验，数据集会被自动下载到`paddlenlp.utils.env.DATA_HOME` 路径下，例如在linux系统下，对于GLUE中的QQP数据集，默认存储路径是`~/.paddlenlp/datasets/Glue/QQP`。
+
+对于BERT的fine-tuning任务，本实验中使用了预训练模型`bert-base-uncased`。同样，这几个模型在训练时会被自动下载到`paddlenlp.utils.env.MODEL_HOME`路径下。例如，对于`bert-base-uncased`模型，在linux系统下，会被下载到`~/.paddlenlp/models/bert-base-uncased`下。
+
+## 蒸馏实验过程
+
+### 对BERT Fine-tuning得到教师模型
+以GLUE的SST-2任务为例，调用BERT fine-tune的训练脚本，配置如下的参数，训练SST-2任务：
+
+```shell
+cd ../../examples/benchmark/glue/
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+
+python -u ./run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name $TASK_NAME \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 500 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu \
+
+```
+
+训练完成之后，可将训练效果最好的模型保存在本项目下的 `$TEACHER_DIR` 下。模型目录下有`model_config.json`, `model_state.pdparams`, `tokenizer_config.json`及`vocab.txt`这几个文件。
+
+
+### 对TinyBERT在特定任务下蒸馏
+
+先蒸馏中间层：
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+export TASK_NAME=SST-2
+export TEACHER_DIR=teacher_models
+
+# Moves the best model to $TEACHER_DIR
+mv ../../examples/benchmark/glue/tmp/SST-2/sst-2_ft_model_xx.pdparams/  $TEACHER_DIR
+
+python task_distill.py \
+    --model_type tinybert \
+    --student_model_name_or_path tinybert-6l-768d-v2 \
+    --task_name $TASK_NAME \
+    --intermediate_distill \
+    --max_seq_length 64 \
+    --batch_size 32   \
+    --T 1 \
+    --teacher_model_type bert \
+    --teacher_path $TEACHER_DIR \
+    --learning_rate 5e-5 \
+    --num_train_epochs 20 \
+    --logging_steps 10 \
+    --save_steps 10 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu
+
+```
+
+其中参数释义如下：
+
+- `model_type` 学生模型类型，默认且目前仅支持tinybert。
+- `student_model_name_or_path` 中间层蒸馏后，学生模型存放的目录
+- `max_seq_length` 表示最大句子长度，超过该长度将被截断。默认：128
+- `T` softmax的温度，用于对softmax做平滑，在训练中起到放大负标签效果的作用。默认：1
+- `teacher_model_type` 教师模型的类型，默认且目前仅支持bert
+- `teacher_path` 教师Fine-tuned模型的目录
+- `output_dir` 学生模型存放的目录
+- `device` 表示运行该程序的设备，默认是gpu
+
+然后对预测层进行蒸馏：
+
+```shell
+python task_distill.py \
+    --model_type tinybert \
+    --student_model_name_or_path tmp/$TASK_NAME/intermediate_distill_model_final.pdparams \
+    --task_name $TASK_NAME \
+    --max_seq_length 64 \
+    --batch_size 32   \
+    --T 1 \
+    --teacher_model_type bert \
+    --teacher_path $TEACHER_DIR  \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 10 \
+    --save_steps 10 \
+    --output_dir ./tmp/$TASK_NAME/ \
+    --device gpu
+
+```
+其中参数释义如下：
+
+- `student_model_name_or_path` 中间层蒸馏后，学生模型存放的目录
+其他参数说明同上。
+
+### 实验中使用的超参数
+
+|                                  | SST-2     | QQP       | MRPC      | CoLA      | RTE       | MNLI      | QNLI      |
+| -------------------------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
+| batch_size                       | 32        | 32        | 32        | 32        | 32        | 32        | 32        |
+| max_seq_length                   | 64        | 128       | 128       | 64        | 128       | 128       | 128       |
+| max_epochs_of_intermediate_layer | 20        | 10        | 20        | 50        | 20        | 10        | 10        |
+| max_epochs_of_prediction_layer   | 3         | 3         | 3         | 3         | 3         | 3         | 3         |
+| learning_rate(inter/pred)        | 5e-5/3e-5 | 5e-5/3e-5 | 5e-5/3e-5 | 5e-5/3e-5 | 5e-5/3e-5 | 5e-5/3e-5 | 5e-5/3e-5 |
+
+
+
+## 蒸馏实验结果
+
+本文档的实验基于TinyBERT的6层、hidden_size为768的通用蒸馏得到的模型，用未使用数据增强的原始数据集训练，并基于验证集进行评价。得到以下实验结果：
+
+
+|                   | SST-2 | QQP(acc/f1) | MRPC(acc/f1) | CoLA  | RTE   | MNLI-m | MNLI-mm | QNLI  |
+| ----------------- | ----- | ----------- | ------------ | ----- | ----- | ------ | ------- | ----- |
+| BERT-base         | 93.00 | 90.58/87.35 | 88.23/91.67  | 59.56 | 73.65 | 84.42  | 84.83   | 91.78 |
+| TinyBERT(6l-768d) | 93.00 | 91.13/88.20 | 88.48/91.91  | 52.64 | 72.94 | 84.57  | 84.63   | 91.36 |
+
+
+
+## 数据增强扩充训练集（推荐）
+
+TinyBERT使用的数据增强需要用到BERT预训练模型和Glove Embeddings做词替换。
+
+即对于样本中的词，有一定的概率会被近义词替换。对于single-piece的词，会利用BERT的预训练模型，把选中的词替换成mask token，然后返回模型预测的top k个概率最大的词，最后随机选择其中一个词做替换；对于非single-piece的词，则使用Glove Embedding，找到top k个最近似的词，随机选择一个做替换。
+
+先下载glove embeddings
+
+```
+wget http://nlp.stanford.edu/data/glove.6B.zip
+```
+
+然后运行下面的命令对GLUE数据集进行扩展
+```
+export TASK_NAME=SST-2
+export GLOVE_EMB="glove/glove.6B.300d.txt"
+python data_augmentation.py --pretrained_bert_model bert-base-uncased \
+                            --glove_embs $GLOVE_EMB \
+                            --glue_dir /root/.paddlenlp/datasets/Glue/  \
+                            --task_name $TASK_NAME
+
+```
+
+运行结束后，在glue_dir/$TASK_NAME目录下，会生成`train_aug.tsv`的数据增强后的训练集文件。
+利用`task_distill.py`时，带上--use_aug这个参数，程序会读取`train_aug.tsv`作训练集进行训练。
+
+经过实验，利用数据增强后的数据集，在RTE数据集上，Acc由0.7148提升至0.7184。
+
+## 参考文献
+
+Jiao X, Yin Y, Shang L, et al. [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351)[J]. arXiv preprint arXiv:1909.10351v5, 2020.
diff --git a/model_zoo/tinybert/data_augmentation.py b/model_zoo/tinybert/data_augmentation.py
new file mode 100644
index 0000000000000000000000000000000000000000..3043ce8c79ae2d9e57da0340ab04379a94da7008
--- /dev/null
+++ b/model_zoo/tinybert/data_augmentation.py
@@ -0,0 +1,495 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 Huawei Technologies Co., Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import csv
+import logging
+import os
+import random
+import re
+import unicodedata
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import BertForPretraining, BertTokenizer
+
+logging.basicConfig(
+    format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.INFO
+)
+
+logger = logging.getLogger(__name__)
+
+StopWordsList = [
+    "i",
+    "me",
+    "my",
+    "myself",
+    "we",
+    "our",
+    "ours",
+    "ourselves",
+    "you",
+    "you're",
+    "you've",
+    "you'll",
+    "you'd",
+    "your",
+    "yours",
+    "yourself",
+    "yourselves",
+    "he",
+    "him",
+    "his",
+    "himself",
+    "she",
+    "she's",
+    "her",
+    "hers",
+    "herself",
+    "it",
+    "it's",
+    "its",
+    "itself",
+    "they",
+    "them",
+    "their",
+    "theirs",
+    "themselves",
+    "this",
+    "that",
+    "that'll",
+    "these",
+    "those",
+    "am",
+    "is",
+    "are",
+    "was",
+    "were",
+    "be",
+    "been",
+    "being",
+    "have",
+    "has",
+    "had",
+    "having",
+    "do",
+    "does",
+    "did",
+    "doing",
+    "a",
+    "an",
+    "the",
+    "and",
+    "but",
+    "if",
+    "or",
+    "because",
+    "as",
+    "until",
+    "while",
+    "of",
+    "at",
+    "by",
+    "for",
+    "with",
+    "about",
+    "against",
+    "between",
+    "into",
+    "through",
+    "during",
+    "before",
+    "after",
+    "above",
+    "below",
+    "to",
+    "from",
+    "up",
+    "down",
+    "in",
+    "out",
+    "on",
+    "off",
+    "over",
+    "under",
+    "again",
+    "further",
+    "then",
+    "once",
+    "here",
+    "there",
+    "all",
+    "any",
+    "both",
+    "each",
+    "few",
+    "more",
+    "most",
+    "other",
+    "some",
+    "such",
+    "no",
+    "nor",
+    "not",
+    "only",
+    "own",
+    "same",
+    "so",
+    "than",
+    "too",
+    "very",
+    "s",
+    "t",
+    "can",
+    "will",
+    "just",
+    "don",
+    "don't",
+    "should",
+    "should've",
+    "now",
+    "d",
+    "ll",
+    "m",
+    "o",
+    "re",
+    "ve",
+    "y",
+    "ain",
+    "aren",
+    "aren't",
+    "couldn",
+    "couldn't",
+    "didn",
+    "didn't",
+    "doesn",
+    "doesn't",
+    "hadn",
+    "hadn't",
+    "hasn",
+    "hasn't",
+    "haven",
+    "haven't",
+    "isn",
+    "isn't",
+    "ma",
+    "mightn",
+    "mightn't",
+    "mustn",
+    "mustn't",
+    "needn",
+    "needn't",
+    "shan",
+    "shan't",
+    "shouldn",
+    "shouldn't",
+    "wasn",
+    "wasn't",
+    "weren",
+    "weren't",
+    "won",
+    "won't",
+    "wouldn",
+    "wouldn't",
+    "'s",
+    "'re",
+]
+
+
+def strip_accents(text):
+    """
+    Strip accents from input String.
+    :param text: The input string.
+    :type text: String.
+    :returns: The processed String.
+    :rtype: String.
+    """
+    try:
+        text = unicode(text, "utf-8")
+    except (TypeError, NameError):
+        # unicode is a default on python 3
+        pass
+    text = unicodedata.normalize("NFD", text)
+    text = text.encode("ascii", "ignore")
+    text = text.decode("utf-8")
+    return str(text)
+
+
+# valid string only includes al
+def _is_valid(string):
+    return True if not re.search("[^a-z]", string) else False
+
+
+def _read_tsv(input_file, quotechar=None):
+    """Reads a tab separated value file."""
+    with open(input_file, "r", encoding="utf-8") as f:
+        reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+        lines = []
+        for line in reader:
+            lines.append(line)
+        return lines
+
+
+def prepare_embedding_retrieval(glove_file, vocab_size=100000):
+    cnt = 0
+    words = []
+    embeddings = {}
+
+    # only read first 100,000 words for fast retrieval
+    with open(glove_file, "r", encoding="utf-8") as fin:
+        for line in fin:
+            items = line.strip().split()
+            words.append(items[0])
+            embeddings[items[0]] = [float(x) for x in items[1:]]
+
+            cnt += 1
+            if cnt == vocab_size:
+                break
+
+    vocab = {w: idx for idx, w in enumerate(words)}
+    ids_to_tokens = {idx: w for idx, w in enumerate(words)}
+
+    vector_dim = len(embeddings[ids_to_tokens[0]])
+    emb_matrix = np.zeros((vocab_size, vector_dim))
+    for word, v in embeddings.items():
+        if word == "<unk>":
+            continue
+        emb_matrix[vocab[word], :] = v
+
+    # normalize each word vector
+    d = np.sum(emb_matrix**2, 1) ** 0.5
+    emb_norm = (emb_matrix.T / d).T
+    return emb_norm, vocab, ids_to_tokens
+
+
+class DataAugmentor(object):
+    def __init__(self, model, tokenizer, emb_norm, vocab, ids_to_tokens, M, N, p):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.emb_norm = emb_norm
+        self.vocab = vocab
+        self.ids_to_tokens = ids_to_tokens
+        self.M = M
+        self.N = N
+        self.p = p
+
+    def _word_distance(self, word):
+        if word not in self.vocab.keys():
+            return []
+        word_idx = self.vocab[word]
+        word_emb = self.emb_norm[word_idx]
+
+        dist = np.dot(self.emb_norm, word_emb.T)
+        dist[word_idx] = -np.Inf
+
+        candidate_ids = np.argsort(-dist)[: self.M]
+        return [self.ids_to_tokens[idx] for idx in candidate_ids][: self.M]
+
+    def _masked_language_model(self, sent, word_pieces, mask_id):
+        tokenized_text = self.tokenizer.tokenize(sent)
+        tokenized_text = ["[CLS]"] + tokenized_text
+        tokenized_len = len(tokenized_text)
+
+        tokenized_text = word_pieces + ["[SEP]"] + tokenized_text[1:] + ["[SEP]"]
+
+        if len(tokenized_text) > 512:
+            tokenized_text = tokenized_text[:512]
+
+        token_ids = self.tokenizer.convert_tokens_to_ids(tokenized_text)
+        segments_ids = [0] * (tokenized_len + 1) + [1] * (len(tokenized_text) - tokenized_len - 1)
+
+        tokens_tensor = paddle.to_tensor([token_ids])
+        segments_tensor = paddle.to_tensor([segments_ids])
+
+        predictions, _ = self.model(tokens_tensor, segments_tensor)
+        word_candidates = paddle.argsort(predictions[0, mask_id], descending=True)[: self.M].numpy().tolist()
+        word_candidates = self.tokenizer.convert_ids_to_tokens(word_candidates)
+
+        return list(filter(lambda x: x.find("##"), word_candidates))
+
+    def _word_augment(self, sentence, mask_token_idx, mask_token):
+        word_pieces = self.tokenizer.tokenize(sentence)
+        word_pieces = ["[CLS]"] + word_pieces
+        tokenized_len = len(word_pieces)
+
+        token_idx = -1
+        for i in range(1, tokenized_len):
+            if "##" not in word_pieces[i]:
+                token_idx = token_idx + 1
+                if token_idx < mask_token_idx:
+                    word_piece_ids = []
+                elif token_idx == mask_token_idx:
+                    word_piece_ids = [i]
+                else:
+                    break
+            else:
+                word_piece_ids.append(i)
+
+        if len(word_piece_ids) == 1:
+            word_pieces[word_piece_ids[0]] = "[MASK]"
+            candidate_words = self._masked_language_model(sentence, word_pieces, word_piece_ids[0])
+        elif len(word_piece_ids) > 1:
+            candidate_words = self._word_distance(mask_token)
+        else:
+            logger.info("invalid input sentence!")
+
+        if len(candidate_words) == 0:
+            candidate_words.append(mask_token)
+
+        return candidate_words
+
+    def augment(self, sent):
+        candidate_sents = [sent]
+
+        tokens = self.tokenizer.basic_tokenizer.tokenize(sent)
+        candidate_words = {}
+        for (idx, word) in enumerate(tokens):
+            if _is_valid(word) and word not in StopWordsList:
+                candidate_words[idx] = self._word_augment(sent, idx, word)
+        logger.info(candidate_words)
+        cnt = 0
+        while cnt < self.N:
+            new_sent = list(tokens)
+            for idx in candidate_words.keys():
+                candidate_word = random.choice(candidate_words[idx])
+
+                x = random.random()
+                if x < self.p:
+                    new_sent[idx] = candidate_word
+
+            if " ".join(new_sent) not in candidate_sents:
+                candidate_sents.append(" ".join(new_sent))
+            cnt += 1
+
+        return candidate_sents
+
+
+class AugmentProcessor(object):
+    def __init__(self, augmentor, glue_dir, task_name):
+        self.augmentor = augmentor
+        self.glue_dir = glue_dir
+        self.task_name = task_name
+        self.augment_ids = {
+            "MRPC": [3, 4],
+            "MNLI": [8, 9],
+            "CoLA": [3],
+            "SST-2": [0],
+            "STS-B": [7, 8],
+            "QQP": [3, 4],
+            "QNLI": [1, 2],
+            "RTE": [1, 2],
+        }
+
+        self.filter_flags = {
+            "MRPC": True,
+            "MNLI": True,
+            "CoLA": False,
+            "SST-2": True,
+            "STS-B": True,
+            "QQP": True,
+            "QNLI": True,
+            "RTE": True,
+        }
+
+        assert self.task_name in self.augment_ids
+
+    def read_augment_write(self):
+        task_dir = os.path.join(self.glue_dir, self.task_name)
+        train_samples = _read_tsv(os.path.join(task_dir, "train.tsv"))
+        output_filename = os.path.join(task_dir, "train_aug.tsv")
+
+        augment_ids_ = self.augment_ids[self.task_name]
+        filter_flag = self.filter_flags[self.task_name]
+
+        with open(output_filename, "w", newline="", encoding="utf-8") as f:
+            writer = csv.writer(f, delimiter="\t")
+            for (i, line) in enumerate(train_samples):
+                if i == 0 and filter_flag:
+                    writer.writerow(line)
+                    continue
+
+                for augment_id in augment_ids_:
+                    sent = line[augment_id]
+                    augmented_sents = self.augmentor.augment(sent)
+                    for augment_sent in augmented_sents:
+                        line[augment_id] = augment_sent
+                        writer.writerow(line)
+
+                if (i + 1) % 1000 == 0:
+                    logger.info("Having been processing {} examples".format(str(i + 1)))
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--pretrained_bert_model",
+        default=None,
+        type=str,
+        required=True,
+        help="Downloaded pretrained model (bert-base-uncased) is under this folder",
+    )
+    parser.add_argument("--glove_embs", default=None, type=str, required=True, help="Glove word embeddings file")
+    parser.add_argument("--glue_dir", default=None, type=str, required=True, help="GLUE data dir")
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="Task(eg. CoLA, SST-2) that we want to do data augmentation for its train set",
+    )
+    parser.add_argument("--N", default=30, type=int, help="How many times is the corpus expanded?")
+    parser.add_argument(
+        "--M", default=15, type=int, help="Choose from M most-likely words in the corresponding position"
+    )
+    parser.add_argument("--p", default=0.4, type=float, help="Threshold probability p to replace current word")
+
+    parser.add_argument("--device", default="gpu", type=str, help="device, gpu or cpu")
+
+    args = parser.parse_args()
+    logger.info(args)
+
+    default_params = {
+        "CoLA": {"N": 30},
+        "MNLI": {"N": 10},
+        "MRPC": {"N": 30},
+        "SST-2": {"N": 20},
+        "STS-b": {"N": 30},
+        "QQP": {"N": 10},
+        "QNLI": {"N": 20},
+        "RTE": {"N": 30},
+    }
+
+    if args.task_name in default_params:
+        args.N = default_params[args.task_name]["N"]
+
+    # Prepare data augmentor
+    tokenizer = BertTokenizer.from_pretrained(args.pretrained_bert_model)
+    model = BertForPretraining.from_pretrained(args.pretrained_bert_model)
+    model.eval()
+
+    emb_norm, vocab, ids_to_tokens = prepare_embedding_retrieval(args.glove_embs)
+
+    data_augmentor = DataAugmentor(model, tokenizer, emb_norm, vocab, ids_to_tokens, args.M, args.N, args.p)
+
+    # Do data augmentation
+    processor = AugmentProcessor(data_augmentor, args.glue_dir, args.task_name)
+    processor.read_augment_write()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/tinybert/general_distill.py b/model_zoo/tinybert/general_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe7a0ca129b144c5462daf246472967c98c853d3
--- /dev/null
+++ b/model_zoo/tinybert/general_distill.py
@@ -0,0 +1,415 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import random
+import time
+from concurrent.futures import ThreadPoolExecutor
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+    TinyBertForPretraining,
+    TinyBertModel,
+    TinyBertTokenizer,
+)
+from paddlenlp.transformers.distill_utils import to_distill
+from paddlenlp.utils.tools import TimeCostAverage
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "tinybert": (TinyBertForPretraining, TinyBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_type",
+        default="tinybert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--teacher_model_type",
+        default="bert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--input_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input directory where the data will be read from.",
+    )
+    parser.add_argument(
+        "--teacher_model_name_or_path", default=None, type=str, required=True, help="Path to pre-trained model."
+    )
+    parser.add_argument(
+        "--student_model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--glue_dir",
+        default="/root/.paddlenlp/datasets/Glue/",
+        type=str,
+        required=False,
+        help="The Glue directory.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for AdamW.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--T",
+        default=1,
+        type=int,
+        help="Temperature for softmax",
+    )
+    parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=10000,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for AdamW optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    random.seed(args.seed + paddle.distributed.get_rank())
+    np.random.seed(args.seed + paddle.distributed.get_rank())
+    paddle.seed(args.seed + paddle.distributed.get_rank())
+
+
+class WorkerInitObj(object):
+    def __init__(self, seed):
+        self.seed = seed
+
+    def __call__(self, id):
+        np.random.seed(seed=self.seed + id)
+        random.seed(self.seed + id)
+
+
+def create_pretraining_dataset(input_file, shared_list, args, worker_init, tokenizer):
+    train_data = PretrainingDataset(input_file=input_file, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    # files have been sharded, no need to dispatch again
+    train_batch_sampler = paddle.io.BatchSampler(train_data, batch_size=args.batch_size, shuffle=True)
+
+    # DataLoader cannot be pickled because of its place.
+    # If it can be pickled, use global function instead of lambda and use
+    # ProcessPoolExecutor instead of ThreadPoolExecutor to prefetch.
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+    ): fn(samples)
+
+    train_data_loader = DataLoader(
+        dataset=train_data,
+        batch_sampler=train_batch_sampler,
+        collate_fn=batchify_fn,
+        num_workers=0,
+        worker_init_fn=worker_init,
+        return_list=True,
+    )
+    return train_data_loader, input_file
+
+
+class PretrainingDataset(paddle.io.Dataset):
+    def __init__(self, input_file, tokenizer, max_seq_length):
+        self.input_file = input_file
+        f = open(input_file, "r")
+        input_ids = []
+        for i, line in enumerate(f):
+            line = line[:max_seq_length]
+            tokenized_example = tokenizer(line, max_seq_len=max_seq_length)
+            input_ids.append(tokenized_example["input_ids"])
+        self.inputs = np.asarray(input_ids)
+        f.close()
+
+    def __len__(self):
+        "Denotes the total number of samples"
+        return len(self.inputs)
+
+    def __getitem__(self, index):
+        input_ids = [np.asarray(self.inputs[index])]
+        return input_ids
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    worker_init = WorkerInitObj(args.seed + paddle.distributed.get_rank())
+    args.model_type = args.model_type.lower()
+
+    # For student
+    model_class, _ = MODEL_CLASSES[args.model_type]
+    if args.student_model_name_or_path in (
+        "tinybert-4l-312d",
+        "tinybert-6l-768d",
+        "tinybert-4l-312d-v2",
+        "tinybert-6l-768d-v2",
+        "tinybert-4l-312d-zh",
+        "tinybert-6l-768d-zh",
+    ):
+        student = model_class.from_pretrained(args.student_model_name_or_path)
+    else:
+        tinybert = TinyBertModel(vocab_size=21128, num_hidden_layers=6)
+        student = model_class(tinybert)
+
+    # For teacher
+    teacher_model_class, tokenizer_class = MODEL_CLASSES[args.teacher_model_type]
+    teacher = teacher_model_class.from_pretrained(args.teacher_model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.teacher_model_name_or_path)
+    if paddle.distributed.get_world_size() > 1:
+        student = paddle.DataParallel(student, find_unused_parameters=True)
+        teacher = paddle.DataParallel(teacher, find_unused_parameters=True)
+
+    num_training_steps = args.max_steps
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=student.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+        grad_clip=clip,
+    )
+
+    mse_loss_fct = paddle.nn.MSELoss()
+
+    pool = ThreadPoolExecutor(1)
+
+    teacher = to_distill(teacher, return_attentions=True, return_layer_outputs=True)
+    student = to_distill(student, return_attentions=True, return_layer_outputs=True)
+
+    global_step = 0
+    for epoch in range(args.num_train_epochs):
+        files = [
+            os.path.join(args.input_dir, f)
+            for f in os.listdir(args.input_dir)
+            if os.path.isfile(os.path.join(args.input_dir, f))
+        ]
+        files.sort()
+        num_files = len(files)
+        random.Random(args.seed + epoch).shuffle(files)
+        f_start_id = 0
+
+        shared_file_list = {}
+
+        if paddle.distributed.get_world_size() > num_files:
+            remainder = paddle.distributed.get_world_size() % num_files
+
+            data_file = files[
+                (
+                    f_start_id * paddle.distributed.get_world_size()
+                    + paddle.distributed.get_rank()
+                    + remainder * f_start_id
+                )
+                % num_files
+            ]
+        else:
+            data_file = files[
+                (f_start_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+            ]
+        train_data_loader, _ = create_pretraining_dataset(data_file, shared_file_list, args, worker_init, tokenizer)
+
+        # TODO(guosheng): better way to process single file
+        single_file = True if f_start_id + 1 == len(files) else False
+
+        def cal_intermediate_distill_loss(student, teacher):
+            loss_hidden, loss_attn = 0, 0
+            # Calculate emb loss(hidden_states[0]) and hidden states loss.
+            for i in range(len(student.outputs.hidden_states)):
+                loss_hidden += mse_loss_fct(student.outputs.hidden_states[i], teacher.outputs.hidden_states[2 * i])
+            for i in range(len(student.outputs.attentions)):
+                attn_student = student.outputs.attentions[i]
+                attn_teacher = teacher.outputs.attentions[2 * i + 1]
+                loss_attn += mse_loss_fct(attn_student, attn_teacher)
+            loss = loss_hidden + loss_attn
+            return loss
+
+        for f_id in range(f_start_id, len(files)):
+            if not single_file and f_id == f_start_id:
+                continue
+            if paddle.distributed.get_world_size() > num_files:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank() + remainder * f_id)
+                    % num_files
+                ]
+            else:
+                data_file = files[
+                    (f_id * paddle.distributed.get_world_size() + paddle.distributed.get_rank()) % num_files
+                ]
+            dataset_future = pool.submit(
+                create_pretraining_dataset, data_file, shared_file_list, args, worker_init, tokenizer
+            )
+
+            train_cost_avg = TimeCostAverage()
+            total_samples = 0
+            batch_start = time.time()
+            for step, batch in enumerate(train_data_loader):
+                global_step += 1
+                input_ids = batch[0]
+
+                student(input_ids)
+                with paddle.no_grad():
+                    teacher(input_ids)
+
+                loss = cal_intermediate_distill_loss(student, teacher)
+
+                loss.backward()
+
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.clear_grad()
+
+                total_samples += args.batch_size
+                train_run_cost = time.time() - batch_start
+                train_cost_avg.record(train_run_cost)
+                if global_step % args.logging_steps == 0:
+                    logger.info(
+                        "global step: %d, epoch: %d, batch: %d, loss: %f, "
+                        "lr: %f, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            loss,
+                            optimizer.get_lr(),
+                            train_cost_avg.get_average(),
+                            total_samples / args.logging_steps,
+                            total_samples / (args.logging_steps * train_cost_avg.get_average()),
+                        )
+                    )
+                    total_samples = 0
+                    train_cost_avg.reset()
+                if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                    if paddle.distributed.get_rank() == 0:
+                        output_dir = os.path.join(args.output_dir, "model_%d" % global_step)
+                        if not os.path.exists(output_dir):
+                            os.makedirs(output_dir)
+                        # need better way to get inner model of DataParallel
+                        model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student
+                        model_to_save.save_pretrained(output_dir)
+                        tokenizer.save_pretrained(output_dir)
+                        paddle.save(optimizer.state_dict(), os.path.join(output_dir, "model_state.pdopt"))
+                if global_step >= args.max_steps:
+                    del train_data_loader
+                    return
+                batch_start = time.time()
+
+            del train_data_loader
+            train_data_loader, data_file = dataset_future.result(timeout=None)
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/model_zoo/tinybert/imgs/tinybert.png b/model_zoo/tinybert/imgs/tinybert.png
new file mode 100644
index 0000000000000000000000000000000000000000..e907be9fe969f970aea995270db5846438b2cd8e
Binary files /dev/null and b/model_zoo/tinybert/imgs/tinybert.png differ
diff --git a/model_zoo/tinybert/task_distill.py b/model_zoo/tinybert/task_distill.py
new file mode 100644
index 0000000000000000000000000000000000000000..d29c482debe1909cc4b3c222f9686abadbaf3425
--- /dev/null
+++ b/model_zoo/tinybert/task_distill.py
@@ -0,0 +1,439 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import math
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle.io import DataLoader
+from paddle.metric import Accuracy
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman
+from paddlenlp.transformers import (
+    BertForSequenceClassification,
+    BertTokenizer,
+    LinearDecayWithWarmup,
+    TinyBertForSequenceClassification,
+    TinyBertTokenizer,
+)
+from paddlenlp.transformers.distill_utils import to_distill
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+METRIC_CLASSES = {
+    "cola": Mcc,
+    "sst-2": Accuracy,
+    "mrpc": AccuracyAndF1,
+    "sts-b": PearsonAndSpearman,
+    "qqp": AccuracyAndF1,
+    "mnli": Accuracy,
+    "qnli": Accuracy,
+    "rte": Accuracy,
+}
+
+MODEL_CLASSES = {
+    "bert": (BertForSequenceClassification, BertTokenizer),
+    "tinybert": (TinyBertForSequenceClassification, TinyBertTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(METRIC_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_type",
+        default="tinybert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--teacher_model_type",
+        default="bert",
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--student_model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pre-trained model or shortcut name selected in the list: "
+        + ", ".join(
+            sum([list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values()], [])
+        ),
+    )
+    parser.add_argument("--teacher_path", default=None, type=str, required=True, help="Path to pre-trained model.")
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--glue_dir",
+        default="/root/.paddlenlp/datasets/Glue/",
+        type=str,
+        required=False,
+        help="The Glue directory.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--learning_rate", default=1e-4, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument(
+        "--num_train_epochs",
+        default=3,
+        type=int,
+        help="Total number of training epochs to perform.",
+    )
+    parser.add_argument("--logging_steps", type=int, default=100, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--batch_size",
+        default=32,
+        type=int,
+        help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--T",
+        default=1,
+        type=int,
+        help="Temperature for softmax",
+    )
+    parser.add_argument(
+        "--use_aug",
+        action="store_true",
+        help="Whether to use augmentation data to train.",
+    )
+    parser.add_argument(
+        "--intermediate_distill",
+        action="store_true",
+        help="Whether distilling intermediate layers. If False, it means prediction layer distillation.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument(
+        "--warmup_steps",
+        default=0,
+        type=int,
+        help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion",
+    )
+    parser.add_argument(
+        "--warmup_proportion", default=0.1, type=float, help="Linear warmup proportion over total steps."
+    )
+    parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--seed", default=42, type=int, help="random seed for initialization")
+    parser.add_argument(
+        "--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu."
+    )
+    args = parser.parse_args()
+    return args
+
+
+def set_seed(args):
+    # Use the same data seed(for data shuffle) for all procs to guarantee data
+    # consistency after sharding.
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    # Maybe different op seeds(for dropout) for different procs is better. By:
+    # `paddle.seed(args.seed + paddle.distributed.get_rank())`
+    paddle.seed(args.seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        input_ids, segment_ids, labels = batch
+        logits = model(input_ids, segment_ids)
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    if isinstance(metric, AccuracyAndF1):
+        print(
+            "acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, "
+            % (
+                res[0],
+                res[1],
+                res[2],
+                res[3],
+                res[4],
+            ),
+            end="",
+        )
+    elif isinstance(metric, Mcc):
+        print("mcc: %s, " % (res[0]), end="")
+    elif isinstance(metric, PearsonAndSpearman):
+        print("pearson: %s, spearman: %s, pearson and spearman: %s, " % (res[0], res[1], res[2]), end="")
+    else:
+        print("acc: %s, " % (res), end="")
+    model.train()
+    return res[0] if isinstance(metric, (AccuracyAndF1, Mcc, PearsonAndSpearman)) else res
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """convert a glue example into necessary features"""
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        label = example["labels"]
+        label = np.array([label], dtype=label_dtype)
+    # Convert raw text to feature
+    if (int(is_test) + len(example)) == 2:
+        example = tokenizer(example["sentence"], max_seq_len=max_seq_length)
+    else:
+        example = tokenizer(example["sentence1"], text_pair=example["sentence2"], max_seq_len=max_seq_length)
+
+    if not is_test:
+        return example["input_ids"], example["token_type_ids"], label
+    else:
+        return example["input_ids"], example["token_type_ids"]
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args)
+
+    args.task_name = args.task_name.lower()
+    metric_class = METRIC_CLASSES[args.task_name]
+    args.model_type = args.model_type.lower()
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    if args.use_aug:
+        aug_data_file = (os.path.join(os.path.join(args.glue_dir, args.task_name), "train_aug.tsv"),)
+        train_ds = load_dataset("glue", args.task_name, data_files=aug_data_file)
+    else:
+        train_ds = load_dataset("glue", args.task_name, splits="train")
+    tokenizer = tokenizer_class.from_pretrained(args.student_model_name_or_path)
+
+    trans_func = partial(
+        convert_example, tokenizer=tokenizer, label_list=train_ds.label_list, max_seq_length=args.max_seq_length
+    )
+    train_ds = train_ds.map(trans_func, lazy=True)
+    train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64" if train_ds.label_list else "float32"),  # label
+    ): fn(samples)
+    train_data_loader = DataLoader(
+        dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+    )
+    if args.task_name == "mnli":
+        dev_ds_matched, dev_ds_mismatched = load_dataset(
+            "glue", args.task_name, splits=["dev_matched", "dev_mismatched"]
+        )
+
+        dev_ds_matched = dev_ds_matched.map(trans_func, lazy=True)
+        dev_ds_mismatched = dev_ds_mismatched.map(trans_func, lazy=True)
+        dev_batch_sampler_matched = paddle.io.BatchSampler(dev_ds_matched, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader_matched = DataLoader(
+            dataset=dev_ds_matched,
+            batch_sampler=dev_batch_sampler_matched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_batch_sampler_mismatched = paddle.io.BatchSampler(
+            dev_ds_mismatched, batch_size=args.batch_size, shuffle=False
+        )
+        dev_data_loader_mismatched = DataLoader(
+            dataset=dev_ds_mismatched,
+            batch_sampler=dev_batch_sampler_mismatched,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+    else:
+        dev_ds = load_dataset("glue", args.task_name, splits="dev")
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+        dev_data_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+
+    num_classes = 1 if train_ds.label_list is None else len(train_ds.label_list)
+    student = model_class.from_pretrained(args.student_model_name_or_path, num_classes=num_classes)
+    teacher_model_class, _ = MODEL_CLASSES[args.teacher_model_type]
+    teacher = teacher_model_class.from_pretrained(args.teacher_path, num_classes=num_classes)
+
+    if paddle.distributed.get_world_size() > 1:
+        student = paddle.DataParallel(student, find_unused_parameters=True)
+        teacher = paddle.DataParallel(teacher, find_unused_parameters=True)
+
+    if args.max_steps > 0:
+        num_training_steps = args.max_steps
+        num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))
+    else:
+        num_training_steps = len(train_data_loader) * args.num_train_epochs
+        num_train_epochs = args.num_train_epochs
+
+    warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in student.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=args.adam_epsilon,
+        parameters=student.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    ce_loss_fct = paddle.nn.CrossEntropyLoss(soft_label=True)
+    mse_loss_fct = paddle.nn.MSELoss()
+
+    metric = metric_class()
+
+    teacher = to_distill(teacher, return_attentions=True, return_qkv=False, return_layer_outputs=True)
+    student = to_distill(student, return_attentions=True, return_qkv=False, return_layer_outputs=True)
+    global_step = 0
+    tic_train = time.time()
+    best_res = 0.0
+
+    def cal_intermediate_distill_loss(student, teacher):
+        loss_hidden, loss_attn = 0, 0
+        # Calculate emb loss(hidden_states[0]) and hidden states loss.
+        for i in range(len(student.outputs.hidden_states)):
+            # While using tinybert-4l-312d, tinybert-6l-768d, tinybert-4l-312d-zh, tinybert-6l-768d-zh
+            # student_hidden = student.tinybert.fit_dense(student.outputs.hidden_states[i])
+            # While using tinybert-4l-312d-v2, tinybert-6l-768d-v2
+            if isinstance(student, paddle.DataParallel):
+                student_hidden = student._layers.tinybert.fit_denses[i](student.outputs.hidden_states[i])
+            else:
+                student_hidden = student.tinybert.fit_denses[i](student.outputs.hidden_states[i])
+            loss_hidden += mse_loss_fct(student_hidden, teacher.outputs.hidden_states[2 * i])
+        for i in range(len(student.outputs.attentions)):
+            attn_student = student.outputs.attentions[i]
+            attn_teacher = teacher.outputs.attentions[2 * i + 1]
+            loss_attn += mse_loss_fct(attn_student, attn_teacher)
+        loss = loss_hidden + loss_attn
+        return loss
+
+    distill_part = "intermediate" if args.intermediate_distill else "pred"
+
+    for epoch in range(num_train_epochs):
+        for step, batch in enumerate(train_data_loader):
+            global_step += 1
+            input_ids, segment_ids, labels = batch
+            logits = student(input_ids, segment_ids)
+            with paddle.no_grad():
+                teacher_logits = teacher(input_ids, segment_ids)
+
+            if args.intermediate_distill:
+                loss = cal_intermediate_distill_loss(student, teacher)
+            else:
+                loss = ce_loss_fct(logits / args.T, F.softmax(teacher_logits / args.T))
+
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+            if global_step % args.logging_steps == 0:
+                print(
+                    "global step %d/%d, epoch: %d, batch: %d, rank_id: %s, loss: %f, lr: %.10f, speed: %.4f step/s"
+                    % (
+                        global_step,
+                        num_training_steps,
+                        epoch,
+                        step,
+                        paddle.distributed.get_rank(),
+                        loss,
+                        optimizer.get_lr(),
+                        args.logging_steps / (time.time() - tic_train),
+                    )
+                )
+                tic_train = time.time()
+            if global_step % args.save_steps == 0 or global_step == num_training_steps:
+                tic_eval = time.time()
+                if args.task_name == "mnli":
+                    res = evaluate(student, metric, dev_data_loader_matched)
+                    evaluate(student, metric, dev_data_loader_mismatched)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                else:
+                    res = evaluate(student, metric, dev_data_loader)
+                    print("eval done total : %s s" % (time.time() - tic_eval))
+                if (
+                    best_res < res and global_step < num_training_steps or global_step == num_training_steps
+                ) and paddle.distributed.get_rank() == 0:
+                    if global_step < num_training_steps:
+                        output_dir = os.path.join(
+                            args.output_dir, "%s_distill_model_%d.pdparams" % (distill_part, global_step)
+                        )
+                    else:
+                        output_dir = os.path.join(args.output_dir, "%s_distill_model_final.pdparams" % (distill_part))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    # Need better way to get inner model of DataParallel
+                    model_to_save = student._layers if isinstance(student, paddle.DataParallel) else student
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+                    best_res = res
+
+            if global_step >= num_training_steps:
+                return
+
+
+def print_arguments(args):
+    """print arguments"""
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(vars(args).items()):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    print_arguments(args)
+    do_train(args)
diff --git a/model_zoo/uie/README.md b/model_zoo/uie/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..92c3da9cda91981cfe0f2472ed0a5f220ee09b0e
--- /dev/null
+++ b/model_zoo/uie/README.md
@@ -0,0 +1,954 @@
+# 通用信息抽取 UIE(Universal Information Extraction)
+
+ **目录**
+
+- [1. 模型简介](#模型简介)
+- [2. 应用示例](#应用示例)
+- [3. 开箱即用](#开箱即用)
+  - [3.1 实体抽取](#实体抽取)
+  - [3.2 关系抽取](#关系抽取)
+  - [3.3 事件抽取](#事件抽取)
+  - [3.4 评论观点抽取](#评论观点抽取)
+  - [3.5 情感分类](#情感分类)
+  - [3.6 跨任务抽取](#跨任务抽取)
+  - [3.7 模型选择](#模型选择)
+  - [3.8 更多配置](#更多配置)
+- [4. 训练定制](#训练定制)
+  - [4.1 代码结构](#代码结构)
+  - [4.2 数据标注](#数据标注)
+  - [4.3 模型微调](#模型微调)
+  - [4.4 模型评估](#模型评估)
+  - [4.5 定制模型一键预测](#定制模型一键预测)
+  - [4.6 模型快速服务化部署](#模型快速服务化部署)
+  - [4.7 实验指标](#实验指标)
+  - [4.8 模型部署](#模型部署)
+- [5. CCKS比赛](#CCKS比赛)
+
+<a name="模型简介"></a>
+
+## 1. 模型简介
+
+[UIE(Universal Information Extraction)](https://arxiv.org/pdf/2203.12277.pdf)：Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模，并使得不同任务间具备良好的迁移和泛化能力。为了方便大家使用UIE的强大能力，PaddleNLP借鉴该论文的方法，基于ERNIE 3.0知识增强预训练模型，训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取，实现零样本快速冷启动，并具备优秀的小样本微调能力，快速适配特定的抽取目标。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167236006-66ed845d-21b8-4647-908b-e1c6e7613eb1.png height=400 hspace='10'/>
+</div>
+
+#### News 📢: UIE-X 🧾
+
+**全新升级UIE-X，除已有纯文本抽取的全部功能外，新增文档抽取能力**，欢迎体验 👉 [信息抽取应用](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/information_extraction/#readme)
+
+#### UIE的优势
+
+- **使用简单**：用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**。
+
+- **降本增效**：以往的信息抽取技术需要大量标注数据才能保证信息抽取的效果，为了提高开发过程中的开发效率，减少不必要的重复工作时间，开放域信息抽取可以实现零样本（zero-shot）或者少样本（few-shot）抽取，**大幅度降低标注数据依赖，在降低成本的同时，还提升了效果**。
+
+- **效果领先**：开放域信息抽取在多种场景，多种任务上，均有不俗的表现。
+
+<a name="应用示例"></a>
+
+## 2. 应用示例
+
+UIE不限定行业领域和抽取目标，以下是一些零样本行业示例：
+
+- 医疗场景-专病结构化
+
+![image](https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png)
+
+- 法律场景-判决书抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png)
+
+- 金融场景-收入证明、招股书抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png)
+
+- 公安场景-事故报告抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png)
+
+- 旅游场景-宣传册、手册抽取
+
+![image](https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png)
+
+<a name="开箱即用"></a>
+
+## 3. 开箱即用
+
+```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力，可抽取多种类型的信息，包括但不限于命名实体识别（如人名、地名、机构名等）、关系（如电影的导演、歌曲的发行时间等）、事件（如某路口发生车祸、某地发生地震等）、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标，无需训练即可统一抽取输入文本中的对应信息。**实现开箱即用，并满足各类信息抽取需求**
+
+<a name="实体抽取"></a>
+
+#### 3.1 实体抽取
+
+  命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，抽取的类别没有限制，用户可以自己定义。
+
+  - 例如抽取的目标实体类型是"时间"、"选手"和"赛事名称", schema构造如下：
+
+    ```text
+    ['时间', '选手', '赛事名称']
+    ```
+
+    调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+
+    >>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+    >>> ie = Taskflow('information_extraction', schema=schema)
+    >>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")) # Better print results using pprint
+    [{'时间': [{'end': 6,
+              'probability': 0.9857378532924486,
+              'start': 0,
+              'text': '2月8日上午'}],
+      '赛事名称': [{'end': 23,
+                'probability': 0.8503089953268272,
+                'start': 6,
+                'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+      '选手': [{'end': 31,
+              'probability': 0.8981548639781138,
+              'start': 28,
+              'text': '谷爱凌'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"肿瘤的大小"、"肿瘤的个数"、"肝癌级别"和"脉管内癌栓分级", schema构造如下：
+
+    ```text
+    ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    ```
+
+    在上例中我们已经实例化了一个`Taskflow`对象，这里可以通过`set_schema`方法重置抽取目标。
+
+    调用示例：
+
+    ```python
+    >>> schema = ['肿瘤的大小', '肿瘤的个数', '肝癌级别', '脉管内癌栓分级']
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("（右肝肿瘤）肝细胞性肝癌（II-III级，梁索型和假腺管型），肿瘤包膜不完整，紧邻肝被膜，侵及周围肝组织，未见脉管内癌栓（MVI分级：M0级）及卫星子灶形成。（肿物1个，大小4.2×4.0×2.8cm）。"))
+    [{'肝癌级别': [{'end': 20,
+                'probability': 0.9243267447402701,
+                'start': 13,
+                'text': 'II-III级'}],
+      '肿瘤的个数': [{'end': 84,
+                'probability': 0.7538413804059623,
+                'start': 82,
+                'text': '1个'}],
+      '肿瘤的大小': [{'end': 100,
+                'probability': 0.8341128043459491,
+                'start': 87,
+                'text': '4.2×4.0×2.8cm'}],
+      '脉管内癌栓分级': [{'end': 70,
+                  'probability': 0.9083292325934664,
+                  'start': 67,
+                  'text': 'M0级'}]}]
+    ```
+
+  - 例如抽取的目标实体类型是"person"和"organization"，schema构造如下：
+
+    ```text
+    ['person', 'organization']
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> from pprint import pprint
+    >>> from paddlenlp import Taskflow
+    >>> schema = ['Person', 'Organization']
+    >>> ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Organization': [{'end': 53,
+                        'probability': 0.9985840259877357,
+                        'start': 48,
+                        'text': 'Apple'}],
+      'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+<a name="关系抽取"></a>
+
+#### 3.2 关系抽取
+
+  关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，进而获取三元组信息，即<主体，谓语，客体>。
+
+  - 例如以"竞赛名称"作为抽取主体，抽取关系类型为"主办方"、"承办方"和"已举办次数", schema构造如下：
+
+    ```text
+    {
+      '竞赛名称': [
+        '主办方',
+        '承办方',
+        '已举办次数'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))
+    [{'竞赛名称': [{'end': 13,
+                'probability': 0.7825402622754041,
+                'relations': {'主办方': [{'end': 22,
+                                      'probability': 0.8421710521379353,
+                                      'start': 14,
+                                      'text': '中国中文信息学会'},
+                                      {'end': 30,
+                                      'probability': 0.7580801847701935,
+                                      'start': 23,
+                                      'text': '中国计算机学会'}],
+                              '已举办次数': [{'end': 82,
+                                        'probability': 0.4671295049136148,
+                                        'start': 80,
+                                        'text': '4届'}],
+                              '承办方': [{'end': 39,
+                                      'probability': 0.8292706618236352,
+                                      'start': 35,
+                                      'text': '百度公司'},
+                                      {'end': 72,
+                                      'probability': 0.6193477885474685,
+                                      'start': 56,
+                                      'text': '中国计算机学会自然语言处理专委会'},
+                                      {'end': 55,
+                                      'probability': 0.7000497331473241,
+                                      'start': 40,
+                                      'text': '中国中文信息学会评测工作委员会'}]},
+                'start': 0,
+                'text': '2022语言与智能技术竞赛'}]}]
+    ```
+
+  - 例如以"person"作为抽取主体，抽取关系类型为"Company"和"Position", schema构造如下：
+
+    ```text
+    {
+      'Person': [
+        'Company',
+        'Position'
+      ]
+    }
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = [{'Person': ['Company', 'Position']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en('In 1997, Steve was excited to become the CEO of Apple.'))
+    [{'Person': [{'end': 14,
+                  'probability': 0.999631971804547,
+                  'relations': {'Company': [{'end': 53,
+                                            'probability': 0.9960158209451642,
+                                            'start': 48,
+                                            'text': 'Apple'}],
+                                'Position': [{'end': 44,
+                                              'probability': 0.8871063806420736,
+                                              'start': 41,
+                                              'text': 'CEO'}]},
+                  'start': 9,
+                  'text': 'Steve'}]}]
+    ```
+
+<a name="事件抽取"></a>
+
+#### 3.3 事件抽取
+
+  事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument)，组合为相应的事件结构化信息。
+
+  - 例如抽取的目标是"地震"事件的"地震强度"、"时间"、"震中位置"和"震源深度"这些信息，schema构造如下：
+
+    ```text
+    {
+      '地震触发词': [
+        '地震强度',
+        '时间',
+        '震中位置',
+        '震源深度'
+      ]
+    }
+    ```
+
+    触发词的格式统一为`触发词`或``XX触发词`，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。
+
+    调用示例：
+
+    ```python
+    >>> schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('中国地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')
+    [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9987181623528585, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.9962985320905915}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9882578028575182}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度，东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.8551415716584501}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.999158304648045}]}}]}]
+    ```
+
+  - 英文模型**暂不支持事件抽取**
+
+<a name="评论观点抽取"></a>
+
+#### 3.4 评论观点抽取
+
+  评论观点抽取，是指抽取文本中包含的评价维度、观点词。
+
+  - 例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向，schema构造如下：
+
+    ```text
+    {
+      '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = {'评价维度': ['观点词', '情感倾向[正向，负向]']} # Define the schema for opinion extraction
+    >>> ie.set_schema(schema) # Reset schema
+    >>> pprint(ie("店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队")) # Better print results using pprint
+    [{'评价维度': [{'end': 20,
+                'probability': 0.9817040258681473,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9966142505350533,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 22,
+                                      'probability': 0.957396472711558,
+                                      'start': 21,
+                                      'text': '高'}]},
+                'start': 17,
+                'text': '性价比'},
+              {'end': 2,
+                'probability': 0.9696849569741168,
+                'relations': {'情感倾向[正向，负向]': [{'probability': 0.9982153274927796,
+                                              'text': '正向'}],
+                              '观点词': [{'end': 4,
+                                      'probability': 0.9945318044652538,
+                                      'start': 2,
+                                      'text': '干净'}]},
+                'start': 0,
+                'text': '店面'}]}]
+    ```
+
+  - 英文模型schema构造如下：
+
+    ```text
+    {
+      'Aspect': [
+        'Opinion',
+        'Sentiment classification [negative, positive]'
+      ]
+    }
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]
+    >>> ie_en.set_schema(schema)
+    >>> pprint(ie_en("The teacher is very nice."))
+    [{'Aspect': [{'end': 11,
+                  'probability': 0.4301476415932193,
+                  'relations': {'Opinion': [{'end': 24,
+                                            'probability': 0.9072940447883724,
+                                            'start': 15,
+                                            'text': 'very nice'}],
+                                'Sentiment classification [negative, positive]': [{'probability': 0.9998571920670685,
+                                                                                  'text': 'positive'}]},
+                  'start': 4,
+                  'text': 'teacher'}]}]
+    ```
+
+<a name="情感分类"></a>
+
+#### 3.5 情感分类
+
+  - 句子级情感倾向分类，即判断句子的情感倾向是“正向”还是“负向”，schema构造如下：
+
+    ```text
+    '情感倾向[正向，负向]'
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = '情感倾向[正向，负向]' # Define the schema for sentence-level sentiment classification
+    >>> ie.set_schema(schema) # Reset schema
+    >>> ie('这个产品用起来真的很流畅，我非常喜欢')
+    [{'情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.9988661643929895}]}]
+    ```
+
+    英文模型schema构造如下：
+
+    ```text
+    '情感倾向[正向，负向]'
+    ```
+
+    英文模型调用示例：
+
+    ```python
+    >>> schema = 'Sentiment classification [negative, positive]'
+    >>> ie_en.set_schema(schema)
+    >>> ie_en('I am sorry but this is the worst film I have ever seen in my life.')
+    [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
+    ```
+
+<a name="跨任务抽取"></a>
+
+#### 3.6 跨任务抽取
+
+  - 例如在法律场景同时对文本进行实体抽取和关系抽取，schema可按照如下方式进行构造：
+
+    ```text
+    [
+      "法院",
+      {
+          "原告": "委托代理人"
+      },
+      {
+          "被告": "委托代理人"
+      }
+    ]
+    ```
+
+    调用示例：
+
+    ```python
+    >>> schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+    >>> ie.set_schema(schema)
+    >>> pprint(ie("北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。")) # Better print results using pprint
+    [{'原告': [{'end': 37,
+              'probability': 0.9949814024296764,
+              'relations': {'委托代理人': [{'end': 46,
+                                      'probability': 0.7956844697990384,
+                                      'start': 44,
+                                      'text': '李四'}]},
+              'start': 35,
+              'text': '张三'}],
+      '法院': [{'end': 10,
+              'probability': 0.9221074192336651,
+              'start': 0,
+              'text': '北京市海淀区人民法院'}],
+      '被告': [{'end': 67,
+              'probability': 0.8437349536631089,
+              'relations': {'委托代理人': [{'end': 92,
+                                      'probability': 0.7267121388225029,
+                                      'start': 90,
+                                      'text': '赵六'}]},
+              'start': 64,
+              'text': 'B公司'}]}]
+    ```
+
+<a name="模型选择"></a>
+
+#### 3.7 模型选择
+
+- 多模型选择，满足精度、速度要求
+
+  | 模型 |  结构  | 语言 |
+  | :---: | :--------: | :--------: |
+  | `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-base-en` | 12-layers, 768-hidden, 12-heads | 英文 |
+  | `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-medium`| 6-layers, 768-hidden, 12-heads | 中文 |
+  | `uie-mini`| 6-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-micro`| 4-layers, 384-hidden, 12-heads | 中文 |
+  | `uie-nano`| 4-layers, 312-hidden, 12-heads | 中文 |
+  | `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 中、英文 |
+  | `uie-m-base`| 12-layers, 768-hidden, 12-heads | 中、英文 |
+
+
+- `uie-nano`调用示例：
+
+  ```python
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['时间', '选手', '赛事名称']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-nano")
+  >>> ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+  [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.6513581678349247}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.9819330659468051}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.4908131110420939}]}]
+  ```
+
+- `uie-m-base`和`uie-m-large`支持中英文混合抽取，调用示例：
+
+  ```python
+  >>> from pprint import pprint
+  >>> from paddlenlp import Taskflow
+
+  >>> schema = ['Time', 'Player', 'Competition', 'Score']
+  >>> ie = Taskflow('information_extraction', schema=schema, model="uie-m-base", schema_lang="en")
+  >>> pprint(ie(["2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！", "Rafael Nadal wins French Open Final!"]))
+  [{'Competition': [{'end': 23,
+                    'probability': 0.9373889907291257,
+                    'start': 6,
+                    'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
+    'Player': [{'end': 31,
+                'probability': 0.6981119555336441,
+                'start': 28,
+                'text': '谷爱凌'}],
+    'Score': [{'end': 39,
+              'probability': 0.9888507878270296,
+              'start': 32,
+              'text': '188.25分'}],
+    'Time': [{'end': 6,
+              'probability': 0.9784080036931151,
+              'start': 0,
+              'text': '2月8日上午'}]},
+  {'Competition': [{'end': 35,
+                    'probability': 0.9851549932171295,
+                    'start': 18,
+                    'text': 'French Open Final'}],
+    'Player': [{'end': 12,
+                'probability': 0.9379371275888104,
+                'start': 0,
+                'text': 'Rafael Nadal'}]}]
+  ```
+
+<a name="更多配置"></a>
+
+#### 3.8 更多配置
+
+```python
+>>> from paddlenlp import Taskflow
+
+>>> ie = Taskflow('information_extraction',
+                  schema="",
+                  schema_lang="zh",
+                  batch_size=1,
+                  model='uie-base',
+                  position_prob=0.5,
+                  precision='fp32',
+                  use_fast=False)
+```
+
+* `schema`：定义任务抽取目标，可参考开箱即用中不同任务的调用示例进行配置。
+* `schema_lang`：设置schema的语言，默认为`zh`, 可选有`zh`和`en`。因为中英schema的构造有所不同，因此需要指定schema的语言。该参数只对`uie-m-base`和`uie-m-large`模型有效。
+* `batch_size`：批处理大小，请结合机器情况进行调整，默认为1。
+* `model`：选择任务使用的模型，默认为`uie-base`，可选有`uie-base`, `uie-medium`, `uie-mini`, `uie-micro`, `uie-nano`和`uie-medical-base`, `uie-base-en`。
+* `position_prob`：模型对于span的起始位置/终止位置的结果概率在0~1之间，返回结果去掉小于这个阈值的结果，默认为0.5，span的最终概率输出为起始位置概率和终止位置概率的乘积。
+* `precision`：选择模型精度，默认为`fp32`，可选有`fp16`和`fp32`。`fp16`推理速度更快，支持GPU和NPU硬件环境。如果选择`fp16`，在GPU硬件环境下，请先确保机器正确安装NVIDIA相关驱动和基础软件，**确保CUDA>=11.2，cuDNN>=8.1.1**，初次使用需按照提示安装相关依赖。其次，需要确保GPU设备的CUDA计算能力（CUDA Compute Capability）大于7.0，典型的设备包括V100、T4、A10、A100、GTX 20系列和30系列显卡等。更多关于CUDA Compute Capability和精度支持情况请参考NVIDIA文档：[GPU硬件与支持精度对照表](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-840-ea/support-matrix/index.html#hardware-precision-matrix)。
+* `use_fast`: 使用C++实现的高性能分词算子FastTokenizer进行文本预处理加速。需要通过`pip install fast-tokenizer-python`安装FastTokenizer库后方可使用。默认为`False`。更多使用说明可参考[FastTokenizer文档](../../fast_tokenizer)。
+<a name="训练定制"></a>
+
+## 4. 训练定制
+
+对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本（zero-shot）抽取，对于细分场景我们推荐使用轻定制功能（标注少量数据进行模型微调）以进一步提升效果。下面通过`报销工单信息抽取`的例子展示如何通过5条训练数据进行UIE模型微调。
+
+<a name="代码结构"></a>
+
+#### 4.1 代码结构
+
+```shell
+.
+├── utils.py          # 数据处理工具
+├── model.py          # 模型组网脚本
+├── doccano.py        # 数据标注脚本
+├── doccano.md        # 数据标注文档
+├── finetune.py       # 模型微调、压缩脚本
+├── evaluate.py       # 模型评估脚本
+└── README.md
+```
+
+<a name="数据标注"></a>
+
+#### 4.2 数据标注
+
+我们推荐使用数据标注平台[doccano](https://github.com/doccano/doccano) 进行数据标注，本示例也打通了从标注到训练的通道，即doccano导出数据后可通过[doccano.py](./doccano.py)脚本轻松将数据转换为输入模型时需要的形式，实现无缝衔接。标注方法的详细介绍请参考[doccano数据标注指南](doccano.md)。
+
+原始数据示例：
+
+```text
+深大到双龙28块钱4月24号交通费
+```
+
+抽取的目标(schema)为：
+
+```python
+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+标注步骤如下：
+
+- 在doccano平台上，创建一个类型为``序列标注``的标注项目。
+- 定义实体标签类别，上例中需要定义的实体标签有``出发地``、``目的地``、``费用``和``时间``。
+- 使用以上定义的标签开始标注数据，下面展示了一个doccano标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167336891-afef1ad5-8777-456d-805b-9c65d9014b80.png height=100 hspace='10'/>
+</div>
+
+- 标注完成后，在doccano平台上导出文件，并将其重命名为``doccano_ext.json``后，放入``./data``目录下。
+
+- 这里我们提供预先标注好的文件[doccano_ext.json](https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json)，可直接下载并放入`./data`目录。执行以下脚本进行数据转换，执行后会在`./data`目录下生成训练/验证/测试集文件。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --task_type ext \
+    --save_dir ./data \
+    --splits 0.8 0.2 0 \
+    --schema_lang ch
+```
+
+
+可配置参数说明：
+
+- ``doccano_file``: 从doccano导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全负例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务。
+- ``options``: 指定分类任务的类别标签，该参数只对分类类型任务有效。默认为["正向", "负向"]。
+- ``prompt_prefix``: 声明分类任务的prompt前缀信息，该参数只对分类类型任务有效。默认为"情感倾向"。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+- ``separator``: 实体类别/评价维度与分类标签的分隔符，该参数只对实体/评价维度级分类任务有效。默认为"##"。
+- ``schema_lang``: 选择schema的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从doccano导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+更多**不同类型任务（关系抽取、事件抽取、评价观点抽取等）的标注规则及参数说明**，请参考[doccano数据标注指南](doccano.md)。
+
+此外，也可以通过数据标注平台 [Label Studio](https://labelstud.io/) 进行数据标注。本示例提供了 [labelstudio2doccano.py](./labelstudio2doccano.py) 脚本，将 label studio 导出的 JSON 数据文件格式转换成 doccano 导出的数据文件格式，后续的数据转换与模型微调等操作不变。
+
+```shell
+python labelstudio2doccano.py --labelstudio_file label-studio.json
+```
+
+可配置参数说明：
+
+- ``labelstudio_file``: label studio 的导出文件路径（仅支持 JSON 格式）。
+- ``doccano_file``: doccano 格式的数据文件保存路径，默认为 "doccano_ext.jsonl"。
+- ``task_type``: 任务类型，可选有抽取（"ext"）和分类（"cls"）两种类型的任务，默认为 "ext"。
+
+<a name="模型微调"></a>
+
+#### 4.3 模型微调
+
+推荐使用 [Trainer API ](../../docs/trainer.md) 对模型进行微调。只需输入模型、数据集等就可以使用 Trainer API 高效快速地进行预训练、微调和模型压缩等任务，可以一键启动多卡训练、混合精度训练、梯度累积、断点重启、日志显示等功能，Trainer API 还针对训练过程的通用训练配置做了封装，比如：优化器、学习率调度等。
+
+使用下面的命令，使用 `uie-base` 作为预训练模型进行模型微调，将微调后的模型保存至`$finetuned_model`：
+
+单卡启动：
+
+```shell
+export finetuned_model=./checkpoint/model_best
+
+python finetune.py  \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 42 \
+    --model_name_or_path uie-base \
+    --output_dir $finetuned_model \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_length 512  \
+    --per_device_eval_batch_size 16 \
+    --per_device_train_batch_size  16 \
+    --num_train_epochs 20 \
+    --learning_rate 1e-5 \
+    --label_names "start_positions" "end_positions" \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir $finetuned_model \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1
+
+```
+
+如果在GPU环境中使用，可以指定gpus参数进行多卡训练：
+
+```shell
+export finetuned_model=./checkpoint/model_best
+
+python -u -m paddle.distributed.launch --gpus "0,1" finetune.py \
+    --device gpu \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --seed 42 \
+    --model_name_or_path uie-base \
+    --output_dir $finetuned_model \
+    --train_path data/train.txt \
+    --dev_path data/dev.txt  \
+    --max_seq_length 512  \
+    --per_device_eval_batch_size 16 \
+    --per_device_train_batch_size  16 \
+    --num_train_epochs 100 \
+    --learning_rate 1e-5 \
+    --do_train \
+    --do_eval \
+    --do_export \
+    --export_model_dir $finetuned_model \
+    --label_names "start_positions" "end_positions" \
+    --overwrite_output_dir \
+    --disable_tqdm True \
+    --metric_for_best_model eval_f1 \
+    --load_best_model_at_end  True \
+    --save_total_limit 1 \
+
+```
+**注意**：如果模型是跨语言模型 UIE-M，还需设置 `--multilingual`。
+
+可配置参数说明：
+
+* `model_name_or_path`：必须，进行 few shot 训练使用的预训练模型。可选择的有 "uie-base"、 "uie-medium", "uie-mini", "uie-micro", "uie-nano", "uie-m-base", "uie-m-large"。
+* `multilingual`：是否是跨语言模型，用 "uie-m-base", "uie-m-large" 等模型进微调得到的模型也是多语言模型，需要设置为 True；默认为 False。
+* `output_dir`：必须，模型训练或压缩后保存的模型目录；默认为 `None` 。
+* `device`: 训练设备，可选择 'cpu'、'gpu' 、'npu'其中的一种；默认为 GPU 训练。
+* `per_device_train_batch_size`：训练集训练过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
+* `per_device_eval_batch_size`：开发集评测过程批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 32。
+* `learning_rate`：训练最大学习率，UIE 推荐设置为 1e-5；默认值为3e-5。
+* `num_train_epochs`: 训练轮次，使用早停法时可以选择 100；默认为10。
+* `logging_steps`: 训练过程中日志打印的间隔 steps 数，默认100。
+* `save_steps`: 训练过程中保存模型 checkpoint 的间隔 steps 数，默认100。
+* `seed`：全局随机种子，默认为 42。
+* `weight_decay`：除了所有 bias 和 LayerNorm 权重之外，应用于所有层的权重衰减数值。可选；默认为 0.0；
+* `do_train`:是否进行微调训练，设置该参数表示进行微调训练，默认不设置。
+* `do_eval`:是否进行评估，设置该参数表示进行评估。
+
+该示例代码中由于设置了参数 `--do_eval`，因此在训练完会自动进行评估。
+
+<a name="模型评估"></a>
+
+#### 4.4 模型评估
+
+通过运行以下命令进行模型评估：
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --batch_size 16 \
+    --max_seq_len 512
+```
+
+通过运行以下命令对 UIE-M 进行模型评估：
+
+```
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --batch_size 16 \
+    --max_seq_len 512 \
+    --multilingual
+```
+
+评估方式说明：采用单阶段评价的方式，即关系抽取、事件抽取等需要分阶段预测的任务对每一阶段的预测结果进行分别评价。验证/测试集默认会利用同一层级的所有标签来构造出全部负例。
+
+可开启`debug`模式对每个正例类别分别进行评估，该模式仅用于模型调试：
+
+```shell
+python evaluate.py \
+    --model_path ./checkpoint/model_best \
+    --test_path ./data/dev.txt \
+    --debug
+```
+
+输出打印示例：
+
+```text
+[2022-09-14 03:13:58,877] [    INFO] - -----------------------------
+[2022-09-14 03:13:58,877] [    INFO] - Class Name: 疾病
+[2022-09-14 03:13:58,877] [    INFO] - Evaluation Precision: 0.89744 | Recall: 0.83333 | F1: 0.86420
+[2022-09-14 03:13:59,145] [    INFO] - -----------------------------
+[2022-09-14 03:13:59,145] [    INFO] - Class Name: 手术治疗
+[2022-09-14 03:13:59,145] [    INFO] - Evaluation Precision: 0.90000 | Recall: 0.85714 | F1: 0.87805
+[2022-09-14 03:13:59,439] [    INFO] - -----------------------------
+[2022-09-14 03:13:59,440] [    INFO] - Class Name: 检查
+[2022-09-14 03:13:59,440] [    INFO] - Evaluation Precision: 0.77778 | Recall: 0.56757 | F1: 0.65625
+[2022-09-14 03:13:59,708] [    INFO] - -----------------------------
+[2022-09-14 03:13:59,709] [    INFO] - Class Name: X的手术治疗
+[2022-09-14 03:13:59,709] [    INFO] - Evaluation Precision: 0.90000 | Recall: 0.85714 | F1: 0.87805
+[2022-09-14 03:13:59,893] [    INFO] - -----------------------------
+[2022-09-14 03:13:59,893] [    INFO] - Class Name: X的实验室检查
+[2022-09-14 03:13:59,894] [    INFO] - Evaluation Precision: 0.71429 | Recall: 0.55556 | F1: 0.62500
+[2022-09-14 03:14:00,057] [    INFO] - -----------------------------
+[2022-09-14 03:14:00,058] [    INFO] - Class Name: X的影像学检查
+[2022-09-14 03:14:00,058] [    INFO] - Evaluation Precision: 0.69231 | Recall: 0.45000 | F1: 0.54545
+```
+
+可配置参数说明：
+
+- `model_path`: 进行评估的模型文件夹路径，路径下需包含模型权重文件`model_state.pdparams`及配置文件`model_config.json`。
+- `test_path`: 进行评估的测试集文件。
+- `batch_size`: 批处理大小，请结合机器情况进行调整，默认为16。
+- `max_seq_len`: 文本最大切分长度，输入超过最大长度时会对输入文本进行自动切分，默认为512。
+- `debug`: 是否开启debug模式对每个正例类别分别进行评估，该模式仅用于模型调试，默认关闭。
+- `multilingual`: 是否是跨语言模型，默认关闭。
+- `schema_lang`: 选择schema的语言，可选有`ch`和`en`。默认为`ch`，英文数据集请选择`en`。
+
+<a name="定制模型一键预测"></a>
+
+#### 4.5 定制模型一键预测
+
+`paddlenlp.Taskflow`装载定制模型，通过`task_path`指定模型权重文件的路径，路径下需要包含训练好的模型权重文件`model_state.pdparams`。
+
+```python
+>>> from pprint import pprint
+>>> from paddlenlp import Taskflow
+
+>>> schema = ['出发地', '目的地', '费用', '时间']
+# 设定抽取目标和定制化模型权重路径
+>>> my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')
+>>> pprint(my_ie("城市内交通费7月5日金额114广州至佛山"))
+[{'出发地': [{'end': 17,
+           'probability': 0.9975287467835301,
+           'start': 15,
+           'text': '广州'}],
+  '时间': [{'end': 10,
+          'probability': 0.9999476678061399,
+          'start': 6,
+          'text': '7月5日'}],
+  '目的地': [{'end': 20,
+           'probability': 0.9998511131226735,
+           'start': 18,
+           'text': '佛山'}],
+  '费用': [{'end': 15,
+          'probability': 0.9994474579292856,
+          'start': 12,
+          'text': '114'}]}]
+```
+
+<a name="模型快速服务化部署"></a>
+
+#### 4.6 模型快速服务化部署
+在UIE的服务化能力中我们提供基于PaddleNLP SimpleServing 来搭建服务化能力，通过几行代码即可搭建服务化部署能力
+
+```python
+
+# Save at server.py
+from paddlenlp import SimpleServer
+from paddlenlp import Taskflow
+
+schema = ['出发地', '目的地', '费用', '时间']
+uie = Taskflow("information_extraction",
+               schema=schema,
+               task_path='./checkpoint/model_best/')
+app = SimpleServer()
+app.register_taskflow('uie', uie)
+```
+
+```bash
+# Start the server
+paddlenlp server server:app --host 0.0.0.0 --port 8989
+```
+
+具体使用的方法可以见[UIE SimpleServing 使用方法](./deploy/serving/simple_serving/README.md)
+
+
+<a name="实验指标"></a>
+
+#### 4.7 实验指标
+
+我们在互联网、医疗、金融三大垂类自建测试集上进行了实验：
+
+<table>
+<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网
+<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot
+<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86
+<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68
+<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38
+<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22
+<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35
+<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>
+<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13
+</table>
+
+0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测，5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据（few-shot）进一步提升效果**。
+
+
+<a name="模型部署"></a>
+
+#### 4.8 模型部署
+
+以下是 UIE Python 端的部署流程，包括环境准备、模型导出和使用示例。
+
+
+- 模型导出
+
+模型训练、压缩时已经自动进行了静态图的导出以及 tokenizer 配置文件保存，保存路径`${finetuned_model}` 下应该有 `*.pdimodel`、`*.pdiparams` 模型文件可用于推理。
+
+- 模型部署
+
+以下示例展示如何基于 FastDeploy 库完成 UIE 模型完成通用信息抽取任务的 Python 预测部署。先参考 [UIE 模型部署](./deploy/python/README.md)安装FastDeploy Python 依赖包。 可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型。模型目录为 `model_zoo/uie/checkpoint/model_best`（用户可按实际情况设置）。
+
+```bash
+# UIE 模型 CPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device cpu
+# UIE 模型 GPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device gpu
+
+# UIE-M 模型 CPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device cpu --multilingual
+# UIE-M 模型 GPU 推理
+python deploy/python/infer.py --model_dir ./checkpoint/model_best --device gpu --multilingual
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-03-06 03:31:21,456] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'export'.
+[INFO] fastdeploy/runtime/runtime.cc(91)::AutoSelectBackend    FastDeploy will choose Backend::PDINFER to inference this model.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+-----------------------------
+1. Input text:
+"北京市海淀区人民法院
+民事判决书
+(199x)建初字第xxx号
+原告：张三。
+委托代理人李四，北京市 A律师事务所律师。
+被告：B公司，法定代表人王五，开发公司总经理。
+委托代理人赵六，北京市 C律师事务所律师。"
+2. Input schema:
+['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+3. Result:
+{'原告': [{'end': 38,
+         'probability': 0.9991321038858274,
+         'relations': {'委托代理人': [{'end': 47,
+                                  'probability': 0.8729063160951966,
+                                  'start': 45,
+                                  'text': '李四'}]},
+         'start': 36,
+         'text': '张三'}],
+ '法院': [{'end': 11,
+         'probability': 0.9766876070751707,
+         'start': 1,
+         'text': '北京市海淀区人民法院'}],
+ '被告': [{'end': 68,
+         'probability': 0.9532207287016696,
+         'relations': {'委托代理人': [{'end': 93,
+                                  'probability': 0.7685119772607152,
+                                  'start': 91,
+                                  'text': '赵六'}]},
+         'start': 65,
+         'text': 'B公司'}]}
+......
+```
+
+更多细节请参考[UIE Python 部署方法](./deploy/python/README.md)
+
+<a name="CCKS比赛"></a>
+
+## 5.CCKS比赛
+
+为了进一步探索通用信息抽取的边界，我们举办了**CCKS 2022 千言通用信息抽取竞赛评测**（2022/03/30 - 2022/07/31）。
+
+- [报名链接](https://aistudio.baidu.com/aistudio/competition/detail/161/0/introduction)
+- [基线代码](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/DuUIE)
+
+## References
+- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**
+- **[Quantizing deep convolutional networks for efficient inference: A whitepaper](https://arxiv.org/pdf/1806.08342.pdf)**
+- **[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)**
diff --git a/model_zoo/uie/deploy/python/README.md b/model_zoo/uie/deploy/python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b2cabf14710707b6eb5ca298c4cf92c4aaa9ce4a
--- /dev/null
+++ b/model_zoo/uie/deploy/python/README.md
@@ -0,0 +1,169 @@
+# FastDeploy UIE 模型 Python 部署示例
+
+在部署前，参考 [FastDeploy SDK 安装文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/cn/build_and_install/download_prebuilt_libraries.md)安装 FastDeploy Python SDK。
+
+本目录下提供 `infer.py` 快速完成在 CPU/GPU 的通用文本分类任务的 Python 部署示例。
+
+## 依赖安装
+
+直接执行以下命令安装部署示例的依赖。
+
+```bash
+# 安装 fast_tokenizer 以及 GPU 版本 fastdeploy
+pip install fast-tokenizer-python fastdeploy-gpu-python -f https://www.paddlepaddle.org.cn/whl/fastdeploy.html
+```
+
+## 快速开始
+
+以下示例展示如何基于 FastDeploy 库完成 UIE 模型进行信息抽取任务的 Python 预测部署，可通过命令行参数`--device`以及`--backend`指定运行在不同的硬件以及推理引擎后端，并使用`--model_dir`参数指定运行的模型，具体参数设置可查看下面[参数说明](#参数说明)。示例中的模型是按照 [UIE 训练文档](../../README.md)导出得到的部署模型，其模型目录为 `model_zoo/uie/checkpoint/model_best`（用户可按实际情况设置）。
+
+
+```bash
+# CPU 推理
+python infer.py --model_dir ../../checkpoint/model_best --device cpu
+# GPU 推理
+python infer.py --model_dir ../../checkpoint/model_best --device gpu
+```
+
+运行完成后返回的结果如下：
+
+```bash
+[2023-03-06 03:31:21,456] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'export'.
+[INFO] fastdeploy/runtime/runtime.cc(91)::AutoSelectBackend    FastDeploy will choose Backend::PDINFER to inference this model.
+[INFO] fastdeploy/runtime/runtime.cc(266)::CreatePaddleBackend    Runtime initialized with Backend::PDINFER in Device::GPU.
+-----------------------------
+1. Input text:
+"北京市海淀区人民法院
+民事判决书
+(199x)建初字第xxx号
+原告：张三。
+委托代理人李四，北京市 A律师事务所律师。
+被告：B公司，法定代表人王五，开发公司总经理。
+委托代理人赵六，北京市 C律师事务所律师。"
+2. Input schema:
+['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]
+3. Result:
+{'原告': [{'end': 38,
+         'probability': 0.9991321038858274,
+         'relations': {'委托代理人': [{'end': 47,
+                                  'probability': 0.8729063160951966,
+                                  'start': 45,
+                                  'text': '李四'}]},
+         'start': 36,
+         'text': '张三'}],
+ '法院': [{'end': 11,
+         'probability': 0.9766876070751707,
+         'start': 1,
+         'text': '北京市海淀区人民法院'}],
+ '被告': [{'end': 68,
+         'probability': 0.9532207287016696,
+         'relations': {'委托代理人': [{'end': 93,
+                                  'probability': 0.7685119772607152,
+                                  'start': 91,
+                                  'text': '赵六'}]},
+         'start': 65,
+         'text': 'B公司'}]}
+......
+```
+
+## 参数说明
+
+| 参数 |参数说明 |
+|----------|--------------|
+|--model_dir | 指定部署模型的目录， |
+|--batch_size |输入的batch size，默认为 1|
+|--max_length |最大序列长度，默认为 128|
+|--num_omask_tokens | 最大标签数量，默认为64|
+|--device | 运行的设备，可选范围: ['cpu', 'gpu']，默认为'cpu' |
+|--device_id | 运行设备的id。默认为0。 |
+|--cpu_threads | 当使用cpu推理时，指定推理的cpu线程数，默认为1。|
+|--backend | 支持的推理后端，可选范围: ['onnx_runtime', 'paddle', 'openvino', 'tensorrt', 'paddle_tensorrt']，默认为'paddle' |
+|--use_fp16 | 是否使用FP16模式进行推理。使用tensorrt和paddle_tensorrt后端时可开启，默认为False |
+|--use_fast| 是否使用FastTokenizer加速分词阶段。默认为True|
+
+## FastDeploy 高阶用法
+
+FastDeploy 在 Python 端上，提供 `fastdeploy.RuntimeOption.use_xxx()` 以及 `fastdeploy.RuntimeOption.use_xxx_backend()` 接口支持开发者选择不同的硬件、不同的推理引擎进行部署。在不同的硬件上部署 UIE 模型，需要选择硬件所支持的推理引擎进行部署，下表展示如何在不同的硬件上选择可用的推理引擎部署 UIE 模型。
+
+符号说明: (1) ✅: 已经支持; (2) ❔: 正在进行中; (3) N/A: 暂不支持;
+
+<table>
+    <tr>
+        <td align=center> 硬件</td>
+        <td align=center> 硬件对应的接口</td>
+        <td align=center> 可用的推理引擎  </td>
+        <td align=center> 推理引擎对应的接口 </td>
+        <td align=center> 是否支持 Paddle 新格式量化模型 </td>
+        <td align=center> 是否支持 FP16 模式 </td>
+    </tr>
+    <tr>
+        <td rowspan=3 align=center> CPU </td>
+        <td rowspan=3 align=center> use_cpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> OpenVINO </td>
+      <td align=center> use_openvino_backend() </td>
+      <td align=center> ❔ </td>
+      <td align=center>  N/A </td>
+    </tr>
+    <tr>
+        <td rowspan=4 align=center> GPU </td>
+        <td rowspan=4 align=center> use_gpu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center>  ✅ </td>
+        <td align=center>  N/A </td>
+    </tr>
+    <tr>
+      <td align=center> ONNX Runtime </td>
+      <td align=center> use_ort_backend() </td>
+      <td align=center>  ✅ </td>
+      <td align=center>  ❔ </td>
+    </tr>
+    <tr>
+      <td align=center> Paddle TensorRT </td>
+      <td align=center> use_paddle_infer_backend() + paddle_infer_option.enable_trt = True </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+      <td align=center> TensorRT </td>
+      <td align=center> use_trt_backend() </td>
+      <td align=center> ✅ </td>
+      <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> 昆仑芯 XPU </td>
+        <td align=center> use_kunlunxin() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center>  N/A </td>
+        <td align=center>  ✅  </td>
+    </tr>
+    <tr>
+        <td align=center> 华为 昇腾 </td>
+        <td align=center> use_ascend() </td>
+        <td align=center> Paddle Lite </td>
+        <td align=center> use_paddle_lite_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> ✅ </td>
+    </tr>
+    <tr>
+        <td align=center> Graphcore IPU </td>
+        <td align=center> use_ipu() </td>
+        <td align=center> Paddle Inference </td>
+        <td align=center> use_paddle_infer_backend() </td>
+        <td align=center> ❔ </td>
+        <td align=center> N/A </td>
+    </tr>
+</table>
diff --git a/model_zoo/uie/deploy/python/infer.py b/model_zoo/uie/deploy/python/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6986b0dbe9877887cb179365caf35ba3e840e428
--- /dev/null
+++ b/model_zoo/uie/deploy/python/infer.py
@@ -0,0 +1,552 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import distutils.util
+import math
+import os
+import re
+from pprint import pprint
+
+import fastdeploy as fd
+import six
+
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.tools import get_bool_ids_greater_than, get_span
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", required=True, help="The directory of model, params and vocab file.")
+    parser.add_argument("--model_prefix", type=str, default="model", help="The model and params file prefix.")
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        choices=["cpu", "gpu"],
+        help="Type of inference device, support 'cpu' or 'gpu'.",
+    )
+    parser.add_argument("--vocab_path", type=str, default="", help="The path of tokenizer vocab.")
+    parser.add_argument("--multilingual", action="store_true", help="Whether is the multilingual model.")
+    parser.add_argument("--batch_size", type=int, default=1, help="The batch size of data.")
+    parser.add_argument("--device_id", type=int, default=0, help="device(gpu) id")
+    parser.add_argument("--max_length", type=int, default=128, help="The max length of sequence.")
+    parser.add_argument(
+        "--position_prob",
+        default=0.5,
+        type=float,
+        help="Probability threshold for start/end index probabiliry.",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default="paddle",
+        choices=["onnx_runtime", "paddle", "openvino", "paddle_tensorrt"],
+        help="The inference runtime backend.",
+    )
+    parser.add_argument(
+        "--cpu_threads", type=int, default=1, help="The number of threads to execute inference in cpu device."
+    )
+    parser.add_argument("--use_fp16", type=distutils.util.strtobool, default=False, help="Use FP16 mode")
+    parser.add_argument(
+        "--use_fast",
+        type=distutils.util.strtobool,
+        default=True,
+        help="Whether to use fast_tokenizer to accelarate the tokenization.",
+    )
+    return parser.parse_args()
+
+
+class UIEPredictor(object):
+    def __init__(self, args):
+        if not isinstance(args.device, six.string_types):
+            print(">>> [InferBackend] The type of device must be string, but the type you set is: ", type(args.device))
+            exit(0)
+        if args.device not in ["cpu", "gpu"]:
+            print(">>> [InferBackend] The device must be cpu or gpu, but your device is set to:", type(args.device))
+            exit(0)
+
+        self._tokenizer = AutoTokenizer.from_pretrained(args.model_dir, use_fast=args.use_fast)
+        self._position_prob = args.position_prob
+        self.max_length = args.max_length
+        self._batch_size = args.batch_size
+        self._multilingual = args.multilingual
+        self._schema_tree = None
+        self.set_schema(args.schema)
+        if args.device == "cpu":
+            args.use_fp16 = False
+        self.runtime = self.create_fd_runtime(args)
+
+    def create_fd_runtime(self, args):
+        option = fd.RuntimeOption()
+        model_path = os.path.join(args.model_dir, args.model_prefix + ".pdmodel")
+        params_path = os.path.join(args.model_dir, args.model_prefix + ".pdiparams")
+        option.set_model_path(model_path, params_path)
+        # Set device
+        if args.device == "cpu":
+            option.use_cpu()
+            option.set_cpu_thread_num(args.cpu_threads)
+        else:
+            option.use_gpu(args.device_id)
+        # Set backend
+        if args.backend == "onnx_runtime":
+            option.use_ort_backend()
+        elif args.backend == "paddle":
+            option.use_paddle_infer_backend()
+        elif args.backend == "openvino":
+            option.use_openvino_backend()
+        elif args.backend == "paddle_tensorrt":
+            option.use_paddle_infer_backend()
+            option.paddle_infer_option.collect_trt_shape = True
+            option.paddle_infer_option.enable_trt = True
+            # Only useful for single stage predict
+            option.trt_option.set_shape(
+                "input_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "token_type_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "position_ids", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            option.trt_option.set_shape(
+                "attention_mask", [1, 1], [args.batch_size, args.max_length], [args.batch_size, args.max_length]
+            )
+            trt_file = os.path.join(args.model_dir, "inference.trt")
+            if args.use_fp16:
+                option.trt_option.enable_fp16 = True
+                trt_file = trt_file + ".fp16"
+            option.trt_option.serialize_file = trt_file
+        return fd.Runtime(option)
+
+    def set_schema(self, schema):
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+    def _single_stage_predict(self, inputs):
+        input_texts = []
+        prompts = []
+        for i in range(len(inputs)):
+            input_texts.append(inputs[i]["text"])
+            prompts.append(inputs[i]["prompt"])
+        # max predict length should exclude the length of prompt and summary tokens
+        max_predict_len = self.max_length - len(max(prompts)) - 3
+        short_input_texts, self.input_mapping = self._auto_splitter(input_texts, max_predict_len, split_sentence=False)
+
+        short_texts_prompts = []
+        for k, v in self.input_mapping.items():
+            short_texts_prompts.extend([prompts[k] for i in range(len(v))])
+        short_inputs = [
+            {"text": short_input_texts[i], "prompt": short_texts_prompts[i]} for i in range(len(short_input_texts))
+        ]
+
+        prompts = []
+        texts = []
+        for s in short_inputs:
+            prompts.append(s["prompt"])
+            texts.append(s["text"])
+        encoded_inputs = self._tokenizer(
+            text=prompts,
+            text_pair=texts,
+            truncation=True,
+            max_seq_len=self.max_length,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_tensors="np",
+            return_offsets_mapping=True,
+        )
+        offset_maps = encoded_inputs["offset_mapping"]
+
+        start_probs = []
+        end_probs = []
+        for idx in range(0, len(texts), self._batch_size):
+            l, r = idx, idx + self._batch_size
+            if self._multilingual:
+                input_dict = {
+                    "input_ids": encoded_inputs["input_ids"][l:r].astype("int64"),
+                    "position_ids": encoded_inputs["position_ids"][l:r].astype("int64"),
+                }
+            else:
+                input_dict = {
+                    "input_ids": encoded_inputs["input_ids"][l:r].astype("int64"),
+                    "token_type_ids": encoded_inputs["token_type_ids"][l:r].astype("int64"),
+                    "position_ids": encoded_inputs["position_ids"][l:r].astype("int64"),
+                    "attention_mask": encoded_inputs["attention_mask"][l:r].astype("int64"),
+                }
+            start_prob, end_prob = self._infer(input_dict)
+            start_prob = start_prob.tolist()
+            end_prob = end_prob.tolist()
+            start_probs.extend(start_prob)
+            end_probs.extend(end_prob)
+        start_ids_list = get_bool_ids_greater_than(start_probs, limit=self._position_prob, return_prob=True)
+        end_ids_list = get_bool_ids_greater_than(end_probs, limit=self._position_prob, return_prob=True)
+
+        sentence_ids = []
+        probs = []
+        for start_ids, end_ids, offset_map in zip(start_ids_list, end_ids_list, offset_maps.tolist()):
+            span_list = get_span(start_ids, end_ids, with_prob=True)
+            sentence_id, prob = get_id_and_prob(span_list, offset_map)
+            sentence_ids.append(sentence_id)
+            probs.append(prob)
+
+        results = self._convert_ids_to_results(short_inputs, sentence_ids, probs)
+        results = self._auto_joiner(results, short_input_texts, self.input_mapping)
+        return results
+
+    def _auto_splitter(self, input_texts, max_text_len, split_sentence=False):
+        """
+        Split the raw texts automatically for model inference.
+        Args:
+            input_texts (List[str]): input raw texts.
+            max_text_len (int): cutting length.
+            split_sentence (bool): If True, sentence-level split will be performed.
+        return:
+            short_input_texts (List[str]): the short input texts for model inference.
+            input_mapping (dict): mapping between raw text and short input texts.
+        """
+        input_mapping = {}
+        short_input_texts = []
+        cnt_org = 0
+        cnt_short = 0
+        for text in input_texts:
+            if not split_sentence:
+                sens = [text]
+            else:
+                sens = cut_chinese_sent(text)
+            for sen in sens:
+                lens = len(sen)
+                if lens <= max_text_len:
+                    short_input_texts.append(sen)
+                    if cnt_org not in input_mapping.keys():
+                        input_mapping[cnt_org] = [cnt_short]
+                    else:
+                        input_mapping[cnt_org].append(cnt_short)
+                    cnt_short += 1
+                else:
+                    temp_text_list = [sen[i : i + max_text_len] for i in range(0, lens, max_text_len)]
+                    short_input_texts.extend(temp_text_list)
+                    short_idx = cnt_short
+                    cnt_short += math.ceil(lens / max_text_len)
+                    temp_text_id = [short_idx + i for i in range(cnt_short - short_idx)]
+                    if cnt_org not in input_mapping.keys():
+                        input_mapping[cnt_org] = temp_text_id
+                    else:
+                        input_mapping[cnt_org].extend(temp_text_id)
+            cnt_org += 1
+        return short_input_texts, input_mapping
+
+    def _auto_joiner(self, short_results, short_inputs, input_mapping):
+        concat_results = []
+        is_cls_task = False
+        for short_result in short_results:
+            if short_result == []:
+                continue
+            elif "start" not in short_result[0].keys() and "end" not in short_result[0].keys():
+                is_cls_task = True
+                break
+            else:
+                break
+        for k, vs in input_mapping.items():
+            if is_cls_task:
+                cls_options = {}
+                single_results = []
+                for v in vs:
+                    if len(short_results[v]) == 0:
+                        continue
+                    if short_results[v][0]["text"] not in cls_options.keys():
+                        cls_options[short_results[v][0]["text"]] = [1, short_results[v][0]["probability"]]
+                    else:
+                        cls_options[short_results[v][0]["text"]][0] += 1
+                        cls_options[short_results[v][0]["text"]][1] += short_results[v][0]["probability"]
+                if len(cls_options) != 0:
+                    cls_res, cls_info = max(cls_options.items(), key=lambda x: x[1])
+                    concat_results.append([{"text": cls_res, "probability": cls_info[1] / cls_info[0]}])
+                else:
+                    concat_results.append([])
+            else:
+                offset = 0
+                single_results = []
+                for v in vs:
+                    if v == 0:
+                        single_results = short_results[v]
+                        offset += len(short_inputs[v])
+                    else:
+                        for i in range(len(short_results[v])):
+                            if "start" not in short_results[v][i] or "end" not in short_results[v][i]:
+                                continue
+                            short_results[v][i]["start"] += offset
+                            short_results[v][i]["end"] += offset
+                        offset += len(short_inputs[v])
+                        single_results.extend(short_results[v])
+                concat_results.append(single_results)
+        return concat_results
+
+    def _convert_ids_to_results(self, examples, sentence_ids, probs):
+        """
+        Convert ids to raw text in a single stage.
+        """
+        results = []
+        for example, sentence_id, prob in zip(examples, sentence_ids, probs):
+            if len(sentence_id) == 0:
+                results.append([])
+                continue
+            result_list = []
+            text = example["text"]
+            prompt = example["prompt"]
+            for i in range(len(sentence_id)):
+                start, end = sentence_id[i]
+                if start < 0 and end >= 0:
+                    continue
+                if end < 0:
+                    start += len(prompt) + 1
+                    end += len(prompt) + 1
+                    result = {"text": prompt[start:end], "probability": prob[i]}
+                    result_list.append(result)
+                else:
+                    result = {"text": text[start:end], "start": start, "end": end, "probability": prob[i]}
+                    result_list.append(result)
+            results.append(result_list)
+        return results
+
+    def _multi_stage_predict(self, data):
+        """
+        Traversal the schema tree and do multi-stage prediction.
+        Args:
+            data (list): a list of strings
+        Returns:
+            list: a list of predictions, where the list's length
+                equals to the length of `data`
+        """
+        results = [{} for _ in range(len(data))]
+        # input check to early return
+        if len(data) < 1 or self._schema_tree is None:
+            return results
+
+        # copy to stay `self._schema_tree` unchanged
+        schema_list = self._schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            examples = []
+            input_map = {}
+            cnt = 0
+            idx = 0
+            if not node.prefix:
+                for one_data in data:
+                    examples.append({"text": one_data, "prompt": dbc2sbc(node.name)})
+                    input_map[cnt] = [idx]
+                    idx += 1
+                    cnt += 1
+            else:
+                for pre, one_data in zip(node.prefix, data):
+                    if len(pre) == 0:
+                        input_map[cnt] = []
+                    else:
+                        for p in pre:
+                            examples.append({"text": one_data, "prompt": dbc2sbc(p + node.name)})
+                        input_map[cnt] = [i + idx for i in range(len(pre))]
+                        idx += len(pre)
+                    cnt += 1
+            if len(examples) == 0:
+                result_list = []
+            else:
+                result_list = self._single_stage_predict(examples)
+
+            if not node.parent_relations:
+                relations = [[] for i in range(len(data))]
+                for k, v in input_map.items():
+                    for idx in v:
+                        if len(result_list[idx]) == 0:
+                            continue
+                        if node.name not in results[k].keys():
+                            results[k][node.name] = result_list[idx]
+                        else:
+                            results[k][node.name].extend(result_list[idx])
+                    if node.name in results[k].keys():
+                        relations[k].extend(results[k][node.name])
+            else:
+                relations = node.parent_relations
+                for k, v in input_map.items():
+                    for i in range(len(v)):
+                        if len(result_list[v[i]]) == 0:
+                            continue
+                        if "relations" not in relations[k][i].keys():
+                            relations[k][i]["relations"] = {node.name: result_list[v[i]]}
+                        elif node.name not in relations[k][i]["relations"].keys():
+                            relations[k][i]["relations"][node.name] = result_list[v[i]]
+                        else:
+                            relations[k][i]["relations"][node.name].extend(result_list[v[i]])
+                new_relations = [[] for i in range(len(data))]
+                for i in range(len(relations)):
+                    for j in range(len(relations[i])):
+                        if "relations" in relations[i][j].keys() and node.name in relations[i][j]["relations"].keys():
+                            for k in range(len(relations[i][j]["relations"][node.name])):
+                                new_relations[i].append(relations[i][j]["relations"][node.name][k])
+                relations = new_relations
+
+            prefix = [[] for _ in range(len(data))]
+            for k, v in input_map.items():
+                for idx in v:
+                    for i in range(len(result_list[idx])):
+                        prefix[k].append(result_list[idx][i]["text"] + "的")
+
+            for child in node.children:
+                child.prefix = prefix
+                child.parent_relations = relations
+                schema_list.append(child)
+        return results
+
+    def _infer(self, data):
+        return self.runtime.infer(data)
+
+    def predict(self, input_data):
+        results = self._multi_stage_predict(input_data)
+        return results
+
+
+class SchemaTree(object):
+    """
+    Implementataion of SchemaTree
+    """
+
+    def __init__(self, name="root", children=None):
+        self.name = name
+        self.children = []
+        self.prefix = None
+        self.parent_relations = None
+        if children is not None:
+            for child in children:
+                self.add_child(child)
+
+    def __repr__(self):
+        return self.name
+
+    def add_child(self, node):
+        assert isinstance(node, SchemaTree), "The children of a node should be an instacne of SchemaTree."
+        self.children.append(node)
+
+
+def dbc2sbc(s):
+    rs = ""
+    for char in s:
+        code = ord(char)
+        if code == 0x3000:
+            code = 0x0020
+        else:
+            code -= 0xFEE0
+        if not (0x0021 <= code and code <= 0x7E):
+            rs += char
+            continue
+        rs += chr(code)
+    return rs
+
+
+def cut_chinese_sent(para):
+    """
+    Cut the Chinese sentences more precisely, reference to
+    "https://blog.csdn.net/blmoistawinde/article/details/82379256".
+    """
+    para = re.sub(r"([。！？\?])([^”’])", r"\1\n\2", para)
+    para = re.sub(r"(\.{6})([^”’])", r"\1\n\2", para)
+    para = re.sub(r"(\…{2})([^”’])", r"\1\n\2", para)
+    para = re.sub(r"([。！？\?][”’])([^，。！？\?])", r"\1\n\2", para)
+    para = para.rstrip()
+    return para.split("\n")
+
+
+def get_id_and_prob(span_set, offset_mapping):
+    """
+    Return text id and probability of predicted spans
+
+    Args:
+        span_set (set): set of predicted spans.
+        offset_mapping (list[int]): list of pair preserving the
+                index of start and end char in original text pair (prompt + text) for each token.
+    Returns:
+        sentence_id (list[tuple]): index of start and end char in original text.
+        prob (list[float]): probabilities of predicted spans.
+    """
+    prompt_end_token_id = offset_mapping[1:].index([0, 0])
+    bias = offset_mapping[prompt_end_token_id][1] + 1
+    for index in range(1, prompt_end_token_id + 1):
+        offset_mapping[index][0] -= bias
+        offset_mapping[index][1] -= bias
+
+    sentence_id = []
+    prob = []
+    for start, end in span_set:
+        prob.append(start[1] * end[1])
+        start_id = offset_mapping[start[0]][0]
+        end_id = offset_mapping[end[0]][1]
+        sentence_id.append((start_id, end_id))
+    return sentence_id, prob
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    texts = [
+        '"北京市海淀区人民法院\n民事判决书\n(199x)建初字第xxx号\n原告：张三。\n委托代理人李四，北京市 A律师事务所律师。\n被告：B公司，法定代表人王五，开发公司总经理。\n委托代理人赵六，北京市 C律师事务所律师。"',
+        "原告赵六，2022年5月29日生\n委托代理人孙七，深圳市C律师事务所律师。\n被告周八，1990年7月28日出生\n委托代理人吴九，山东D律师事务所律师",
+    ]
+    schema1 = ["法院", {"原告": "委托代理人"}, {"被告": "委托代理人"}]
+    args.schema = schema1
+    uie = UIEPredictor(args)
+    print("-----------------------------")
+    outputs = uie.predict(texts)
+    for text, output in zip(texts, outputs):
+        print("1. Input text: ")
+        print(text)
+        print("2. Input schema: ")
+        print(schema1)
+        print("3. Result: ")
+        pprint(output)
+        print("-----------------------------")
+
+    schema2 = [{"原告": ["出生日期", "委托代理人"]}, {"被告": ["出生日期", "委托代理人"]}]
+    uie.set_schema(schema2)
+    outputs = uie.predict(texts)
+    for text, output in zip(texts, outputs):
+        print("1. Input text: ")
+        print(text)
+        print("2. Input schema: ")
+        print(schema2)
+        print("3. Result: ")
+        pprint(output)
+        print("-----------------------------")
diff --git a/model_zoo/uie/deploy/serving/simple_serving/README.md b/model_zoo/uie/deploy/serving/simple_serving/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..def3a1ec8038f25ea6871a7617c6eee65fad41d7
--- /dev/null
+++ b/model_zoo/uie/deploy/serving/simple_serving/README.md
@@ -0,0 +1,57 @@
+# 基于PaddleNLP SimpleServing 的服务化部署
+
+## 目录
+- [环境准备](#环境准备)
+- [Server服务启动](#Server服务启动)
+- [Client请求启动](#Client请求启动)
+- [服务化自定义参数](#服务化自定义参数)
+
+## 环境准备
+使用有SimpleServing功能的PaddleNLP版本(或者最新的develop版本)
+
+```shell
+pip install paddlenlp >= 2.4.4
+```
+
+
+## Server服务启动
+
+```bash
+paddlenlp server server:app --workers 1 --host 0.0.0.0 --port 8189
+```
+
+## Client请求启动
+
+```bash
+python client.py
+```
+
+## 服务化自定义参数
+
+### Server 自定义参数
+#### schema替换
+```python
+# Default schema
+schema = ['出发地', '目的地', '费用', '时间']
+```
+
+#### 设置模型路径
+```
+# Default task_path
+uie = Taskflow('information_extraction', task_path='../../../checkpoint/model_best/', schema=schema)
+```
+
+#### 多卡服务化预测
+PaddleNLP SimpleServing 支持多卡负载均衡预测，主要在服务化注册的时候，注册两个Taskflow的task即可，下面是示例代码
+```
+uie1 = Taskflow('information_extraction', task_path='../../../checkpoint/model_best/', schema=schema, device_id=0)
+uie2 = Taskflow('information_extraction', task_path='../../../checkpoint/model_best/', schema=schema, device_id=1)
+service.register_taskflow('uie', [uie1, uie2])
+```
+
+### Client 自定义参数
+
+```python
+# Changed to input texts you wanted
+texts = ['城市内交通费7月5日金额114广州至佛山', '5月9日交通费29元从北苑到望京搜后']
+```
diff --git a/model_zoo/uie/deploy/serving/simple_serving/client.py b/model_zoo/uie/deploy/serving/simple_serving/client.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd4b329896e58090c0f5c5147f2060e1d559f20c
--- /dev/null
+++ b/model_zoo/uie/deploy/serving/simple_serving/client.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+
+import requests
+
+url = "http://0.0.0.0:8189/taskflow/uie"
+headers = {"Content-Type": "application/json"}
+texts = ["城市内交通费7月5日金额114广州至佛山", "5月9日交通费29元从北苑到望京搜后"]
+data = {
+    "data": {
+        "text": texts,
+    }
+}
+r = requests.post(url=url, headers=headers, data=json.dumps(data))
+datas = json.loads(r.text)
+print(datas)
diff --git a/model_zoo/uie/deploy/serving/simple_serving/server.py b/model_zoo/uie/deploy/serving/simple_serving/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ac7d4633dbae6280d26661b8ce13fb033c2e8e3
--- /dev/null
+++ b/model_zoo/uie/deploy/serving/simple_serving/server.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp import SimpleServer, Taskflow
+
+# The schema changed to your defined schema
+schema = ["出发地", "目的地", "费用", "时间"]
+# The task path changed to your best model path
+uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
+# If you want to define the finetuned uie service
+app = SimpleServer()
+app.register_taskflow("taskflow/uie", uie)
diff --git a/model_zoo/uie/doccano.md b/model_zoo/uie/doccano.md
new file mode 100644
index 0000000000000000000000000000000000000000..5faf6a539aa38f3bfa72c552d8ba0a3abdce1235
--- /dev/null
+++ b/model_zoo/uie/doccano.md
@@ -0,0 +1,387 @@
+# doccano
+
+ **目录**
+
+* [1. 安装](#安装)
+* [2. 项目创建](#项目创建)
+* [3. 数据上传](#数据上传)
+* [4. 标签构建](#标签构建)
+* [5. 任务标注](#任务标注)
+* [6. 数据导出](#数据导出)
+* [7. 数据转换](#数据转换)
+
+<a name="安装"></a>
+
+## 1. 安装
+
+参考[doccano官方文档](https://github.com/doccano/doccano) 完成doccano的安装与初始配置。
+
+**以下标注示例用到的环境配置：**
+
+- doccano 1.6.2
+
+<a name="项目创建"></a>
+
+## 2. 项目创建
+
+UIE支持抽取与分类两种类型的任务，根据实际需要创建一个新的项目：
+
+#### 2.1 抽取式任务项目创建
+
+创建项目时选择**序列标注**任务，并勾选**Allow overlapping entity**及**Use relation Labeling**。适配**命名实体识别、关系抽取、事件抽取、评价观点抽取**等任务。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249142-44885510-51dc-4359-8054-9c89c9633700.png height=230 hspace='15'/>
+</div>
+
+#### 2.2 分类式任务项目创建
+
+创建项目时选择**文本分类**任务。适配**文本分类、句子级情感倾向分类**等任务。
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249258-48fb4f0c-f68c-4c9a-ab84-5c555ddcf427.png height=230 hspace='15'/>
+</div>
+
+<a name="数据上传"></a>
+
+## 3. 数据上传
+
+上传的文件为txt格式，每一行为一条待标注文本，示例:
+
+```text
+2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌
+第十四届全运会在西安举办
+```
+
+上传数据类型**选择TextLine**:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167247061-d5795c26-7a6f-4cdb-88ad-107a3cae5446.png height=300 hspace='15'/>
+</div>
+
+**NOTE**：doccano支持`TextFile`、`TextLine`、`JSONL`和`CoNLL`四种数据上传格式，UIE定制训练中**统一使用TextLine**这一文件格式，即上传的文件需要为txt格式，且在数据标注时，该文件的每一行待标注文本显示为一页内容。
+
+<a name="标签构建"></a>
+
+## 4. 标签构建
+
+#### 4.1 构建抽取式任务标签
+
+抽取式任务包含**Span**与**Relation**两种标签类型，Span指**原文本中的目标信息片段**，如实体识别中某个类型的实体，事件抽取中的触发词和论元；Relation指**原文本中Span之间的关系**，如关系抽取中两个实体（Subject&Object）之间的关系，事件抽取中论元和触发词之间的关系。
+
+Span类型标签构建示例:
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248034-afa3f637-65c5-4038-ada0-344ffbd776a2.png height=300 hspace='15'/>
+</div>
+
+Relation类型标签构建示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248307-916c77f6-bf80-4d6b-aa71-30c719f68257.png height=260 hspace='16'/>
+</div>
+
+#### 4.2 构建分类式任务标签
+
+添加分类类别标签：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249484-2b5f6338-8a91-48f3-8d56-edc2b26b41d7.png height=160 hspace='15'/>
+</div>
+
+<a name="任务标注"></a>
+
+## 5. 任务标注
+
+#### 5.1 命名实体识别
+
+命名实体识别（Named Entity Recognition，简称NER），是指识别文本中具有特定意义的实体。在开放域信息抽取中，**抽取的类别没有限制，用户可以自己定义**。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248557-f1da3694-1063-465a-be9a-1bb811949530.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`时间`、`选手`、`赛事名称`和`得分`四种Span类型标签。
+
+```text
+schema = [
+    '时间',
+    '选手',
+    '赛事名称',
+    '得分'
+]
+```
+
+#### 5.2 关系抽取
+
+关系抽取（Relation Extraction，简称RE），是指从文本中识别实体并抽取实体之间的语义关系，即抽取三元组（实体一，关系类型，实体二）。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248502-16a87902-3878-4432-b5b8-9808bd8d4de5.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`作品名`、`人物名`和`时间`三种Span类型标签，以及`歌手`、`发行时间`和`所属专辑`三种Relation标签。Relation标签**由Subject对应实体指向Object对应实体**。
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '作品名': [
+        '歌手',
+        '发行时间',
+        '所属专辑'
+    ]
+}
+```
+
+#### 5.3 事件抽取
+
+事件抽取 (Event Extraction, 简称EE)，是指从自然语言文本中抽取事件并识别事件类型和事件论元的技术。UIE所包含的事件抽取任务，是指根据已知事件类型，抽取该事件所包含的事件论元。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167248793-138a1e37-43c9-4933-bf89-f3ac7228bf9c.png height=200 hspace='20'/>
+</div>
+
+示例中定义了`地震触发词`（触发词）、`等级`（事件论元）和`时间`（事件论元）三种Span标签，以及`时间`和`震级`两种Relation标签。触发词标签**统一格式为`XX触发词`**，`XX`表示具体事件类型，上例中的事件类型是`地震`，则对应触发词为`地震触发词`。Relation标签**由触发词指向对应的事件论元**。
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '地震触发词': [
+        '时间',
+        '震级'
+    ]
+}
+```
+
+#### 5.4 评价观点抽取
+
+评论观点抽取，是指抽取文本中包含的评价维度、观点词。
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249035-6c16c68e-d94e-4a37-8489-111ee65924a3.png height=190 hspace='20'/>
+</div>
+
+示例中定义了`评价维度`和`观点词`两种Span标签，以及`观点词`一种Relation标签。Relation标签**由评价维度指向观点词**。
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '评价维度': '观点词'
+}
+```
+
+#### 5.5 句子级分类任务
+
+标注示例：
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/167249572-48a04c4f-ab79-47ef-a138-798f4243f520.png height=100 hspace='20'/>
+</div>
+
+示例中定义了`正向`和`负向`两种类别标签对文本的情感倾向进行分类。
+
+该标注示例对应的schema为：
+
+```text
+schema = '情感倾向[正向，负向]'
+```
+
+#### 5.6 实体/评价维度级分类任务
+
+<div align="center">
+    <img src=https://user-images.githubusercontent.com/40840292/172628328-878923d7-8c5d-4667-a0e2-b92bce89b47c.png height=200 hspace='20'/>
+</div>
+
+标注示例：
+
+示例中定义了`评价维度##正向`，`评价维度##负向`和`观点词`三种Span标签以及`观点词`一种Relation标签。其中，`##`是实体类别/评价维度与分类标签的分隔符（可通过doccano.py中的separator参数自定义）。
+
+该标注示例对应的schema为：
+
+```text
+schema = {
+    '评价维度': [
+        '观点词',
+        '情感倾向[正向，负向]'
+    ]
+}
+```
+
+<a name="数据导出"></a>
+
+## 6. 数据导出
+
+#### 6.1 导出抽取式和实体/评价维度级分类任务数据
+
+选择导出的文件类型为``JSONL(relation)``，导出数据示例：
+
+```text
+{
+    "id": 38,
+    "text": "百科名片你知道我要什么，是歌手高明骏演唱的一首歌曲，1989年发行，收录于个人专辑《丛林男孩》中",
+    "relations": [
+        {
+            "id": 20,
+            "from_id": 51,
+            "to_id": 53,
+            "type": "歌手"
+        },
+        {
+            "id": 21,
+            "from_id": 51,
+            "to_id": 55,
+            "type": "发行时间"
+        },
+        {
+            "id": 22,
+            "from_id": 51,
+            "to_id": 54,
+            "type": "所属专辑"
+        }
+    ],
+    "entities": [
+        {
+            "id": 51,
+            "start_offset": 4,
+            "end_offset": 11,
+            "label": "作品名"
+        },
+        {
+            "id": 53,
+            "start_offset": 15,
+            "end_offset": 18,
+            "label": "人物名"
+        },
+        {
+            "id": 54,
+            "start_offset": 42,
+            "end_offset": 46,
+            "label": "作品名"
+        },
+        {
+            "id": 55,
+            "start_offset": 26,
+            "end_offset": 31,
+            "label": "时间"
+        }
+    ]
+}
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``json``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识ID。
+- ``text``: 原始文本数据。
+- ``entities``: 数据中包含的Span标签，每个Span标签包含四个字段：
+    - ``id``: Span在数据集中的唯一标识ID。
+    - ``start_offset``: Span的起始token在文本中的下标。
+    - ``end_offset``: Span的结束token在文本中下标的下一个位置。
+    - ``label``: Span类型。
+- ``relations``: 数据中包含的Relation标签，每个Relation标签包含四个字段：
+    - ``id``: (Span1, Relation, Span2)三元组在数据集中的唯一标识ID，不同样本中的相同三元组对应同一个ID。
+    - ``from_id``: Span1对应的标识ID。
+    - ``to_id``: Span2对应的标识ID。
+    - ``type``: Relation类型。
+
+#### 6.2 导出句子级分类任务数据
+
+选择导出的文件类型为``JSONL``，导出数据示例：
+
+```text
+{
+    "id": 41,
+    "data": "大年初一就把车前保险杠给碰坏了，保险杠和保险公司 真够倒霉的，我决定步行反省。",
+    "label": [
+        "负向"
+    ]
+}
+```
+
+标注数据保存在同一个文本文件中，每条样例占一行且存储为``json``格式，其包含以下字段
+- ``id``: 样本在数据集中的唯一标识ID。
+- ``data``: 原始文本数据。
+- ``label``: 文本对应类别标签。
+
+<a name="数据转换"></a>
+
+## 7.数据转换
+
+该章节详细说明如何通过`doccano.py`脚本对doccano平台导出的标注数据进行转换，一键生成训练/验证/测试集。
+
+#### 7.1 抽取式任务数据转换
+
+- 当标注完成后，在 doccano 平台上导出 `JSONL(relation)` 形式的文件，并将其重命名为 `doccano_ext.json` 后，放入 `./data` 目录下。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --task_type "ext" \
+    --save_dir ./data \
+    --negative_ratio 5
+```
+
+#### 7.2 句子级分类任务数据转换
+
+- 当标注完成后，在 doccano 平台上导出 `JSON` 形式的文件，并将其重命名为 `doccano_cls.json` 后，放入 `./data` 目录下。
+- 在数据转换阶段，我们会自动构造用于模型训练的prompt信息。例如句子级情感分类中，prompt为``情感倾向[正向,负向]``，可以通过`prompt_prefix`和`options`参数进行声明。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_cls.json \
+    --task_type "cls" \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --prompt_prefix "情感倾向" \
+    --options "正向" "负向"
+```
+
+#### 7.3 实体/评价维度级分类任务数据转换
+
+- 当标注完成后，在 doccano 平台上导出 `JSONL(relation)` 形式的文件，并将其重命名为 `doccano_ext.json` 后，放入 `./data` 目录下。
+- 在数据转换阶段，我们会自动构造用于模型训练的prompt信息。例如评价维度级情感分类中，prompt为``XXX的情感倾向[正向,负向]``，可以通过`prompt_prefix`和`options`参数进行声明。
+- 通过 [doccano.py](./doccano.py) 脚本进行数据形式转换，然后便可以开始进行相应模型训练。
+
+```shell
+python doccano.py \
+    --doccano_file ./data/doccano_ext.json \
+    --task_type "ext" \
+    --save_dir ./data \
+    --splits 0.8 0.1 0.1 \
+    --prompt_prefix "情感倾向" \
+    --options "正向" "负向" \
+    --separator "##"
+```
+
+可配置参数说明：
+
+- ``doccano_file``: 从doccano导出的数据标注文件。
+- ``save_dir``: 训练数据的保存目录，默认存储在``data``目录下。
+- ``negative_ratio``: 最大负例比例，该参数只对抽取类型任务有效，适当构造负例可提升模型效果。负例数量和实际的标签数量有关，最大负例数量 = negative_ratio * 正例数量。该参数只对训练集有效，默认为5。为了保证评估指标的准确性，验证集和测试集默认构造全负例。
+- ``splits``: 划分数据集时训练集、验证集所占的比例。默认为[0.8, 0.1, 0.1]表示按照``8:1:1``的比例将数据划分为训练集、验证集和测试集。
+- ``task_type``: 选择任务类型，可选有抽取和分类两种类型的任务。
+- ``options``: 指定分类任务的类别标签，该参数只对分类类型任务有效。默认为["正向", "负向"]。
+- ``prompt_prefix``: 声明分类任务的prompt前缀信息，该参数只对分类类型任务有效。默认为"情感倾向"。
+- ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
+- ``seed``: 随机种子，默认为1000.
+- ``separator``: 实体类别/评价维度与分类标签的分隔符，该参数只对实体/评价维度级分类任务有效。默认为"##"。
+
+备注：
+- 默认情况下 [doccano.py](./doccano.py) 脚本会按照比例将数据划分为 train/dev/test 数据集
+- 每次执行 [doccano.py](./doccano.py) 脚本，将会覆盖已有的同名数据文件
+- 在模型训练阶段我们推荐构造一些负例以提升模型效果，在数据转换阶段我们内置了这一功能。可通过`negative_ratio`控制自动构造的负样本比例；负样本数量 = negative_ratio * 正样本数量。
+- 对于从doccano导出的文件，默认文件中的每条数据都是经过人工正确标注的。
+
+## References
+- **[doccano](https://github.com/doccano/doccano)**
diff --git a/model_zoo/uie/doccano.py b/model_zoo/uie/doccano.py
new file mode 100644
index 0000000000000000000000000000000000000000..0936e2cec740866d19ca913939274fface73b219
--- /dev/null
+++ b/model_zoo/uie/doccano.py
@@ -0,0 +1,179 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import os
+import time
+from decimal import Decimal
+
+import numpy as np
+from utils import convert_cls_examples, convert_ext_examples, set_seed
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+
+def do_convert():
+    set_seed(args.seed)
+
+    tic_time = time.time()
+    if not os.path.exists(args.doccano_file):
+        raise ValueError("Please input the correct path of doccano file.")
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    if len(args.splits) != 0 and len(args.splits) != 3:
+        raise ValueError("Only []/ len(splits)==3 accepted for splits.")
+
+    def _check_sum(splits):
+        return Decimal(str(splits[0])) + Decimal(str(splits[1])) + Decimal(str(splits[2])) == Decimal("1")
+
+    if len(args.splits) == 3 and not _check_sum(args.splits):
+        raise ValueError("Please set correct splits, sum of elements in splits should be equal to 1.")
+
+    with open(args.doccano_file, "r", encoding="utf-8") as f:
+        raw_examples = f.readlines()
+
+    def _create_ext_examples(
+        examples,
+        negative_ratio,
+        prompt_prefix="情感倾向",
+        options=["正向", "负向"],
+        separator="##",
+        shuffle=False,
+        is_train=True,
+        schema_lang="ch",
+    ):
+        entities, relations, aspects = convert_ext_examples(
+            examples, negative_ratio, prompt_prefix, options, separator, is_train, schema_lang
+        )
+        examples = entities + relations + aspects
+        if shuffle:
+            indexes = np.random.permutation(len(examples))
+            examples = [examples[i] for i in indexes]
+        return examples
+
+    def _create_cls_examples(examples, prompt_prefix, options, shuffle=False):
+        examples = convert_cls_examples(examples, prompt_prefix, options)
+        if shuffle:
+            indexes = np.random.permutation(len(examples))
+            examples = [examples[i] for i in indexes]
+        return examples
+
+    def _save_examples(save_dir, file_name, examples):
+        count = 0
+        save_path = os.path.join(save_dir, file_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for example in examples:
+                f.write(json.dumps(example, ensure_ascii=False) + "\n")
+                count += 1
+        logger.info("Save %d examples to %s." % (count, save_path))
+
+    if len(args.splits) == 0:
+        if args.task_type == "ext":
+            examples = _create_ext_examples(
+                raw_examples,
+                args.negative_ratio,
+                args.prompt_prefix,
+                args.options,
+                args.separator,
+                args.is_shuffle,
+                schema_lang=args.schema_lang,
+            )
+        else:
+            examples = _create_cls_examples(raw_examples, args.prompt_prefix, args.options, args.is_shuffle)
+        _save_examples(args.save_dir, "train.txt", examples)
+    else:
+        if args.is_shuffle:
+            indexes = np.random.permutation(len(raw_examples))
+            index_list = indexes.tolist()
+            raw_examples = [raw_examples[i] for i in indexes]
+        else:
+            index_list = list(range(len(raw_examples)))
+
+        i1, i2, _ = args.splits
+        p1 = int(len(raw_examples) * i1)
+        p2 = int(len(raw_examples) * (i1 + i2))
+
+        train_ids = index_list[:p1]
+        dev_ids = index_list[p1:p2]
+        test_ids = index_list[p2:]
+
+        with open(os.path.join(args.save_dir, "sample_index.json"), "w") as fp:
+            maps = {"train_ids": train_ids, "dev_ids": dev_ids, "test_ids": test_ids}
+            fp.write(json.dumps(maps))
+
+        if args.task_type == "ext":
+            train_examples = _create_ext_examples(
+                raw_examples[:p1],
+                args.negative_ratio,
+                args.prompt_prefix,
+                args.options,
+                args.separator,
+                args.is_shuffle,
+                schema_lang=args.schema_lang,
+            )
+            dev_examples = _create_ext_examples(
+                raw_examples[p1:p2],
+                -1,
+                args.prompt_prefix,
+                args.options,
+                args.separator,
+                is_train=False,
+                schema_lang=args.schema_lang,
+            )
+            test_examples = _create_ext_examples(
+                raw_examples[p2:],
+                -1,
+                args.prompt_prefix,
+                args.options,
+                args.separator,
+                is_train=False,
+                schema_lang=args.schema_lang,
+            )
+        else:
+            train_examples = _create_cls_examples(raw_examples[:p1], args.prompt_prefix, args.options)
+            dev_examples = _create_cls_examples(raw_examples[p1:p2], args.prompt_prefix, args.options)
+            test_examples = _create_cls_examples(raw_examples[p2:], args.prompt_prefix, args.options)
+
+        _save_examples(args.save_dir, "train.txt", train_examples)
+        _save_examples(args.save_dir, "dev.txt", dev_examples)
+        _save_examples(args.save_dir, "test.txt", test_examples)
+
+    logger.info("Finished! It takes %.2f seconds" % (time.time() - tic_time))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--doccano_file", default="./data/doccano.json", type=str, help="The doccano file exported from doccano platform.")
+    parser.add_argument("--save_dir", default="./data", type=str, help="The path of data that you wanna save.")
+    parser.add_argument("--negative_ratio", default=5, type=int, help="Used only for the extraction task, the ratio of positive and negative samples, number of negtive samples = negative_ratio * number of positive samples")
+    parser.add_argument("--splits", default=[0.8, 0.1, 0.1], type=float, nargs="*", help="The ratio of samples in datasets. [0.6, 0.2, 0.2] means 60% samples used for training, 20% for evaluation and 20% for test.")
+    parser.add_argument("--task_type", choices=['ext', 'cls'], default="ext", type=str, help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.")
+    parser.add_argument("--options", default=["正向", "负向"], type=str, nargs="+", help="Used only for the classification task, the options for classification")
+    parser.add_argument("--prompt_prefix", default="情感倾向", type=str, help="Used only for the classification task, the prompt prefix for classification")
+    parser.add_argument("--is_shuffle", default="True", type=strtobool, help="Whether to shuffle the labeled dataset, defaults to True.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization")
+    parser.add_argument("--separator", type=str, default='##', help="Used only for entity/aspect-level classification task, separator for entity label and classification label")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_convert()
diff --git a/model_zoo/uie/evaluate.py b/model_zoo/uie/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0dbc30aec99838082e26100f73dabdc1e3b4a5b
--- /dev/null
+++ b/model_zoo/uie/evaluate.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from functools import partial
+
+import paddle
+from utils import (
+    convert_example,
+    create_data_loader,
+    get_relation_type_dict,
+    reader,
+    unify_prompt_name,
+)
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import MapDataset, load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader, multilingual=False):
+    """
+    Given a dataset, it evals model and computes the metric.
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        multilingual(bool): Whether is the multilingual model.
+    """
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        if multilingual:
+            start_prob, end_prob = model(batch["input_ids"], batch["position_ids"])
+        else:
+            start_prob, end_prob = model(
+                batch["input_ids"], batch["token_type_ids"], batch["position_ids"], batch["attention_mask"]
+            )
+
+        start_ids = paddle.cast(batch["start_positions"], "float32")
+        end_ids = paddle.cast(batch["end_positions"], "float32")
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    model.train()
+    return precision, recall, f1
+
+
+def do_eval():
+    paddle.set_device(args.device)
+
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    if args.multilingual:
+        model = UIEM.from_pretrained(args.model_path)
+    else:
+        model = UIE.from_pretrained(args.model_path)
+
+    test_ds = load_dataset(reader, data_path=args.test_path, max_seq_len=args.max_seq_len, lazy=False)
+    class_dict = {}
+    relation_data = []
+    if args.debug:
+        for data in test_ds:
+            class_name = unify_prompt_name(data["prompt"])
+            # Only positive examples are evaluated in debug mode
+            if len(data["result_list"]) != 0:
+                p = "的" if args.schema_lang == "ch" else " of "
+                if p not in data["prompt"]:
+                    class_dict.setdefault(class_name, []).append(data)
+                else:
+                    relation_data.append((data["prompt"], data))
+
+        relation_type_dict = get_relation_type_dict(relation_data, schema_lang=args.schema_lang)
+    else:
+        class_dict["all_classes"] = test_ds
+
+    trans_fn = partial(
+        convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, multilingual=args.multilingual
+    )
+
+    for key in class_dict.keys():
+        if args.debug:
+            test_ds = MapDataset(class_dict[key])
+        else:
+            test_ds = class_dict[key]
+        test_ds = test_ds.map(trans_fn)
+
+        data_collator = DataCollatorWithPadding(tokenizer)
+
+        test_data_loader = create_data_loader(test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator)
+
+        metric = SpanEvaluator()
+        precision, recall, f1 = evaluate(model, metric, test_data_loader, args.multilingual)
+        logger.info("-----------------------------")
+        logger.info("Class Name: %s" % key)
+        logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+    if args.debug and len(relation_type_dict.keys()) != 0:
+        for key in relation_type_dict.keys():
+            test_ds = MapDataset(relation_type_dict[key])
+            test_ds = test_ds.map(trans_fn)
+            test_data_loader = create_data_loader(
+                test_ds, mode="test", batch_size=args.batch_size, trans_fn=data_collator
+            )
+
+            metric = SpanEvaluator()
+            precision, recall, f1 = evaluate(model, metric, test_data_loader)
+            logger.info("-----------------------------")
+            if args.schema_lang == "ch":
+                logger.info("Class Name: X的%s" % key)
+            else:
+                logger.info("Class Name: %s of X" % key)
+            logger.info("Evaluation Precision: %.5f | Recall: %.5f | F1: %.5f" % (precision, recall, f1))
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--model_path", type=str, default=None, help="The path of saved model that you want to load.")
+    parser.add_argument("--test_path", type=str, default=None, help="The path of test set.")
+    parser.add_argument("--batch_size", type=int, default=16, help="Batch size per GPU/CPU/NPU for training.")
+    parser.add_argument("--device", type=str, default="gpu", choices=["gpu", "cpu", "npu"], help="Device selected for evaluate.")
+    parser.add_argument("--max_seq_len", type=int, default=512, help="The maximum total input sequence length after tokenization.")
+    parser.add_argument("--debug", action='store_true', help="Precision, recall and F1 score are calculated for each class separately if this option is enabled.")
+    parser.add_argument("--multilingual", action='store_true', help="Whether is the multilingual model.")
+    parser.add_argument("--schema_lang", choices=["ch", "en"], default="ch", help="Select the language type for schema.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    do_eval()
diff --git a/model_zoo/uie/finetune.py b/model_zoo/uie/finetune.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f0359b066328b6f5d106946744f9116832d1d5d
--- /dev/null
+++ b/model_zoo/uie/finetune.py
@@ -0,0 +1,262 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from typing import List, Optional
+
+import paddle
+from utils import convert_example, reader
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.metrics import SpanEvaluator
+from paddlenlp.trainer import (
+    CompressionArguments,
+    PdArgumentParser,
+    Trainer,
+    get_last_checkpoint,
+)
+from paddlenlp.transformers import UIE, UIEM, AutoTokenizer, export_model
+from paddlenlp.utils.log import logger
+
+
+@dataclass
+class DataArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using `PdArgumentParser` we can turn this class into argparse arguments to be able to
+    specify them on the command line.
+    """
+
+    train_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    dev_path: str = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=512,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+
+    dynamic_max_length: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "dynamic max length from batch, it can be array of length, eg: 16 32 64 128"},
+    )
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: Optional[str] = field(
+        default="uie-base",
+        metadata={
+            "help": "Path to pretrained model, such as 'uie-base', 'uie-tiny', "
+            "'uie-medium', 'uie-mini', 'uie-micro', 'uie-nano', 'uie-base-en', "
+            "'uie-m-base', 'uie-m-large', or finetuned model path."
+        },
+    )
+    export_model_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to directory to store the exported inference model."},
+    )
+    multilingual: bool = field(default=False, metadata={"help": "Whether the model is a multilingual model."})
+
+
+def main():
+    parser = PdArgumentParser((ModelArguments, DataArguments, CompressionArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if model_args.model_name_or_path in ["uie-m-base", "uie-m-large"]:
+        model_args.multilingual = True
+
+    # Log model and data config
+    training_args.print_config(model_args, "Model")
+    training_args.print_config(data_args, "Data")
+
+    paddle.set_device(training_args.device)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, world_size: {training_args.world_size}, "
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)
+    if model_args.multilingual:
+        model = UIEM.from_pretrained(model_args.model_name_or_path)
+    else:
+        model = UIE.from_pretrained(model_args.model_name_or_path)
+
+    train_ds = load_dataset(reader, data_path=data_args.train_path, max_seq_len=data_args.max_seq_length, lazy=False)
+    dev_ds = load_dataset(reader, data_path=data_args.dev_path, max_seq_len=data_args.max_seq_length, lazy=False)
+
+    trans_fn = partial(
+        convert_example,
+        tokenizer=tokenizer,
+        max_seq_len=data_args.max_seq_length,
+        multilingual=model_args.multilingual,
+        dynamic_max_length=data_args.dynamic_max_length,
+    )
+
+    train_ds = train_ds.map(trans_fn)
+    dev_ds = dev_ds.map(trans_fn)
+
+    if training_args.device == "npu":
+        data_collator = DataCollatorWithPadding(tokenizer, padding="longest")
+    else:
+        data_collator = DataCollatorWithPadding(tokenizer)
+
+    criterion = paddle.nn.BCELoss()
+
+    def uie_loss_func(outputs, labels):
+        start_ids, end_ids = labels
+        start_prob, end_prob = outputs
+        start_ids = paddle.cast(start_ids, "float32")
+        end_ids = paddle.cast(end_ids, "float32")
+        loss_start = criterion(start_prob, start_ids)
+        loss_end = criterion(end_prob, end_ids)
+        loss = (loss_start + loss_end) / 2.0
+        return loss
+
+    def compute_metrics(p):
+        metric = SpanEvaluator()
+        start_prob, end_prob = p.predictions
+        start_ids, end_ids = p.label_ids
+        metric.reset()
+
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+        precision, recall, f1 = metric.accumulate()
+        metric.reset()
+
+        return {"precision": precision, "recall": recall, "f1": f1}
+
+    trainer = Trainer(
+        model=model,
+        criterion=uie_loss_func,
+        args=training_args,
+        data_collator=data_collator,
+        train_dataset=train_ds if training_args.do_train or training_args.do_compress else None,
+        eval_dataset=dev_ds if training_args.do_eval or training_args.do_compress else None,
+        tokenizer=tokenizer,
+        compute_metrics=compute_metrics,
+    )
+
+    trainer.optimizer = paddle.optimizer.AdamW(
+        learning_rate=training_args.learning_rate, parameters=model.parameters()
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    elif last_checkpoint is not None:
+        checkpoint = last_checkpoint
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        metrics = train_result.metrics
+        trainer.save_model()
+        trainer.log_metrics("train", metrics)
+        trainer.save_metrics("train", metrics)
+        trainer.save_state()
+
+    # Evaluate and tests model
+    if training_args.do_eval:
+        eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+    # export inference model
+    if training_args.do_export:
+        # You can also load from certain checkpoint
+        # trainer.load_state_dict_from_checkpoint("/path/to/checkpoint/")
+        if training_args.device == "npu":
+            # npu will transform int64 to int32 for internal calculation.
+            # To reduce useless transformation, we feed int32 inputs.
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        if model_args.multilingual:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+            ]
+        else:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+            ]
+        if model_args.export_model_dir is None:
+            model_args.export_model_dir = os.path.join(training_args.output_dir, "export")
+        export_model(model=trainer.model, input_spec=input_spec, path=model_args.export_model_dir)
+        trainer.tokenizer.save_pretrained(model_args.export_model_dir)
+
+    if training_args.do_compress:
+
+        @paddle.no_grad()
+        def custom_evaluate(self, model, data_loader):
+            metric = SpanEvaluator()
+            model.eval()
+            metric.reset()
+            for batch in data_loader:
+                if model_args.multilingual:
+                    logits = model(input_ids=batch["input_ids"], position_ids=batch["position_ids"])
+                else:
+                    logits = model(
+                        input_ids=batch["input_ids"],
+                        token_type_ids=batch["token_type_ids"],
+                        position_ids=batch["position_ids"],
+                        attention_mask=batch["attention_mask"],
+                    )
+                start_prob, end_prob = logits
+                start_ids, end_ids = batch["start_positions"], batch["end_positions"]
+                num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+                metric.update(num_correct, num_infer, num_label)
+            precision, recall, f1 = metric.accumulate()
+            logger.info("f1: %s, precision: %s, recall: %s" % (f1, precision, f1))
+            model.train()
+            return f1
+
+        trainer.compress(custom_evaluate=custom_evaluate)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/model_zoo/uie/labelstudio2doccano.py b/model_zoo/uie/labelstudio2doccano.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c36ccd0c04f32eec9b7f9ed5d9a83fd739c9b38
--- /dev/null
+++ b/model_zoo/uie/labelstudio2doccano.py
@@ -0,0 +1,111 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import json
+
+
+def append_attrs(data, item, label_id, relation_id):
+
+    mapp = {}
+
+    for anno in data["annotations"][0]["result"]:
+        if anno["type"] == "labels":
+            label_id += 1
+            item["entities"].append(
+                {
+                    "id": label_id,
+                    "label": anno["value"]["labels"][0],
+                    "start_offset": anno["value"]["start"],
+                    "end_offset": anno["value"]["end"],
+                }
+            )
+            mapp[anno["id"]] = label_id
+
+    for anno in data["annotations"][0]["result"]:
+        if anno["type"] == "relation":
+            relation_id += 1
+            item["relations"].append(
+                {
+                    "id": relation_id,
+                    "from_id": mapp[anno["from_id"]],
+                    "to_id": mapp[anno["to_id"]],
+                    "type": anno["labels"][0],
+                }
+            )
+
+    return item, label_id, relation_id
+
+
+def convert(dataset, task_type):
+    results = []
+    outer_id = 0
+    if task_type == "ext":
+        label_id = 0
+        relation_id = 0
+        for data in dataset:
+            outer_id += 1
+            item = {"id": outer_id, "text": data["data"]["text"], "entities": [], "relations": []}
+            item, label_id, relation_id = append_attrs(data, item, label_id, relation_id)
+            results.append(item)
+    # for the classification task
+    else:
+        for data in dataset:
+            outer_id += 1
+            results.append(
+                {
+                    "id": outer_id,
+                    "text": data["data"]["text"],
+                    "label": data["annotations"][0]["result"][0]["value"]["choices"],
+                }
+            )
+    return results
+
+
+def do_convert(args):
+
+    if not os.path.exists(args.labelstudio_file):
+        raise ValueError("Please input the correct path of label studio file.")
+
+    with open(args.labelstudio_file, "r", encoding="utf-8") as infile:
+        for content in infile:
+            dataset = json.loads(content)
+        results = convert(dataset, args.task_type)
+
+    with open(args.doccano_file, "w", encoding="utf-8") as outfile:
+        for item in results:
+            outline = json.dumps(item, ensure_ascii=False)
+            outfile.write(outline + "\n")
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--labelstudio_file", type=str, help="The export file path of label studio, only support the JSON format."
+    )
+    parser.add_argument("--doccano_file", type=str, default="doccano_ext.jsonl", help="Saving path in doccano format.")
+    parser.add_argument(
+        "--task_type",
+        type=str,
+        choices=["ext", "cls"],
+        default="ext",
+        help="Select task type, ext for the extraction task and cls for the classification task, defaults to ext.",
+    )
+
+    args = parser.parse_args()
+
+    do_convert(args)
diff --git a/model_zoo/uie/utils.py b/model_zoo/uie/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..86c1102aebeadf1e90445b5a9c978fc474d88b7b
--- /dev/null
+++ b/model_zoo/uie/utils.py
@@ -0,0 +1,679 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import math
+import random
+import re
+from typing import List, Optional
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def create_data_loader(dataset, mode="train", batch_size=1, trans_fn=None):
+    """
+    Create dataloader.
+    Args:
+        dataset(obj:`paddle.io.Dataset`): Dataset instance.
+        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
+        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
+        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
+    Returns:
+        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
+    """
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)
+    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True)
+    return dataloader
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def reader(data_path, max_seq_len=512):
+    """
+    read json
+    """
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            json_line = json.loads(line)
+            content = json_line["content"].strip()
+            prompt = json_line["prompt"]
+            # Model Input is aslike: [CLS] Prompt [SEP] Content [SEP]
+            # It include three summary tokens.
+            if max_seq_len <= len(prompt) + 3:
+                raise ValueError("The value of max_seq_len is too small, please set a larger value")
+            max_content_len = max_seq_len - len(prompt) - 3
+            if len(content) <= max_content_len:
+                yield json_line
+            else:
+                result_list = json_line["result_list"]
+                json_lines = []
+                accumulate = 0
+                while True:
+                    cur_result_list = []
+                    for result in result_list:
+                        if result["end"] - result["start"] > max_content_len:
+                            logger.warning(
+                                "result['end'] - result ['start'] exceeds max_content_len, which will result in no valid instance being returned"
+                            )
+                        if (
+                            result["start"] + 1 <= max_content_len < result["end"]
+                            and result["end"] - result["start"] <= max_content_len
+                        ):
+                            max_content_len = result["start"]
+                            break
+
+                    cur_content = content[:max_content_len]
+                    res_content = content[max_content_len:]
+
+                    while True:
+                        if len(result_list) == 0:
+                            break
+                        elif result_list[0]["end"] <= max_content_len:
+                            if result_list[0]["end"] > 0:
+                                cur_result = result_list.pop(0)
+                                cur_result_list.append(cur_result)
+                            else:
+                                cur_result_list = [result for result in result_list]
+                                break
+                        else:
+                            break
+
+                    json_line = {"content": cur_content, "result_list": cur_result_list, "prompt": prompt}
+                    json_lines.append(json_line)
+
+                    for result in result_list:
+                        if result["end"] <= 0:
+                            break
+                        result["start"] -= max_content_len
+                        result["end"] -= max_content_len
+                    accumulate += max_content_len
+                    max_content_len = max_seq_len - len(prompt) - 3
+                    if len(res_content) == 0:
+                        break
+                    elif len(res_content) < max_content_len:
+                        json_line = {"content": res_content, "result_list": result_list, "prompt": prompt}
+                        json_lines.append(json_line)
+                        break
+                    else:
+                        content = res_content
+
+                for json_line in json_lines:
+                    yield json_line
+
+
+def unify_prompt_name(prompt):
+    # The classification labels are shuffled during finetuning, so they need
+    # to be unified during evaluation.
+    if re.search(r"\[.*?\]$", prompt):
+        prompt_prefix = prompt[: prompt.find("[", 1)]
+        cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",")
+        cls_options = sorted(list(set(cls_options)))
+        cls_options = ",".join(cls_options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+        return prompt
+    return prompt
+
+
+def get_relation_type_dict(relation_data, schema_lang="ch"):
+    def compare(a, b, schema_lang="ch"):
+        if schema_lang == "ch":
+            a = a[::-1]
+            b = b[::-1]
+
+        res = ""
+        for i in range(min(len(a), len(b))):
+            if a[i] == b[i]:
+                res += a[i]
+            else:
+                break
+        if res == "":
+            return res
+        if schema_lang == "ch" and res[::-1][0] == "的":
+            return res[::-1][1:]
+        elif schema_lang == "en" and res[-3:] == " of":
+            return res[:-3]
+        return ""
+
+    relation_type_dict = {}
+    added_list = []
+    for i in range(len(relation_data)):
+        added = False
+        if relation_data[i][0] not in added_list:
+            for j in range(i + 1, len(relation_data)):
+                match = compare(relation_data[i][0], relation_data[j][0], schema_lang=schema_lang)
+                if match != "":
+                    match = unify_prompt_name(match)
+                    if relation_data[i][0] not in added_list:
+                        added_list.append(relation_data[i][0])
+                        relation_type_dict.setdefault(match, []).append(relation_data[i][1])
+                    added_list.append(relation_data[j][0])
+                    relation_type_dict.setdefault(match, []).append(relation_data[j][1])
+                    added = True
+            if not added:
+                added_list.append(relation_data[i][0])
+                if schema_lang == "ch":
+                    suffix = relation_data[i][0].rsplit("的", 1)[1]
+                    suffix = unify_prompt_name(suffix)
+                    relation_type = suffix
+                else:
+                    prefix = relation_data[i][0].split(" of ", 1)[0]
+                    prefix = unify_prompt_name(prefix)
+                    relation_type = prefix
+                relation_type_dict.setdefault(relation_type, []).append(relation_data[i][1])
+    return relation_type_dict
+
+
+def add_entity_negative_example(examples, texts, prompts, label_set, negative_ratio):
+    negative_examples = []
+    positive_examples = []
+    with tqdm(total=len(prompts)) as pbar:
+        for i, prompt in enumerate(prompts):
+            redundants = list(set(label_set) ^ set(prompt))
+            redundants.sort()
+
+            num_positive = len(examples[i])
+            if num_positive != 0:
+                actual_ratio = math.ceil(len(redundants) / num_positive)
+            else:
+                # Set num_positive to 1 for text without positive example
+                num_positive, actual_ratio = 1, 0
+
+            if actual_ratio <= negative_ratio or negative_ratio == -1:
+                idxs = [k for k in range(len(redundants))]
+            else:
+                idxs = random.sample(range(0, len(redundants)), negative_ratio * num_positive)
+
+            for idx in idxs:
+                negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]}
+                negative_examples.append(negative_result)
+            positive_examples.extend(examples[i])
+            pbar.update(1)
+    return positive_examples, negative_examples
+
+
+def add_relation_negative_example(redundants, text, num_positive, ratio):
+    added_example = []
+    rest_example = []
+
+    if num_positive != 0:
+        actual_ratio = math.ceil(len(redundants) / num_positive)
+    else:
+        # Set num_positive to 1 for text without positive example
+        num_positive, actual_ratio = 1, 0
+
+    all_idxs = [k for k in range(len(redundants))]
+    if actual_ratio <= ratio or ratio == -1:
+        idxs = all_idxs
+        rest_idxs = []
+    else:
+        idxs = random.sample(range(0, len(redundants)), ratio * num_positive)
+        rest_idxs = list(set(all_idxs) ^ set(idxs))
+
+    for idx in idxs:
+        negative_result = {"content": text, "result_list": [], "prompt": redundants[idx]}
+        added_example.append(negative_result)
+
+    for rest_idx in rest_idxs:
+        negative_result = {"content": text, "result_list": [], "prompt": redundants[rest_idx]}
+        rest_example.append(negative_result)
+
+    return added_example, rest_example
+
+
+def add_full_negative_example(examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang="ch"):
+    with tqdm(total=len(relation_prompts)) as pbar:
+        for i, relation_prompt in enumerate(relation_prompts):
+            negative_sample = []
+            for subject in subject_goldens[i]:
+                for predicate in predicate_set:
+                    # The relation prompt is constructed as follows:
+                    # subject + "的" + predicate -> Chinese
+                    # predicate + " of " + subject -> English
+                    if schema_lang == "ch":
+                        prompt = subject + "的" + predicate
+                    else:
+                        prompt = predicate + " of " + subject
+                    if prompt not in relation_prompt:
+                        negative_result = {"content": texts[i], "result_list": [], "prompt": prompt}
+                        negative_sample.append(negative_result)
+            examples[i].extend(negative_sample)
+            pbar.update(1)
+    return examples
+
+
+def generate_cls_example(text, labels, prompt_prefix, options):
+    random.shuffle(options)
+    cls_options = ",".join(options)
+    prompt = prompt_prefix + "[" + cls_options + "]"
+
+    result_list = []
+    example = {"content": text, "result_list": result_list, "prompt": prompt}
+    for label in labels:
+        start = prompt.rfind(label) - len(prompt) - 1
+        end = start + len(label)
+        result = {"text": label, "start": start, "end": end}
+        example["result_list"].append(result)
+    return example
+
+
+def convert_cls_examples(raw_examples, prompt_prefix="情感倾向", options=["正向", "负向"]):
+    """
+    Convert labeled data export from doccano for classification task.
+    """
+    examples = []
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)):
+        for line in raw_examples:
+            items = json.loads(line)
+            # Compatible with doccano >= 1.6.2
+            if "data" in items.keys():
+                text, labels = items["data"], items["label"]
+            else:
+                text, labels = items["text"], items["label"]
+            example = generate_cls_example(text, labels, prompt_prefix, options)
+            examples.append(example)
+    return examples
+
+
+def convert_ext_examples(
+    raw_examples,
+    negative_ratio,
+    prompt_prefix="情感倾向",
+    options=["正向", "负向"],
+    separator="##",
+    is_train=True,
+    schema_lang="ch",
+):
+    """
+    Convert labeled data export from doccano for extraction and aspect-level classification task.
+    """
+
+    def _sep_cls_label(label, separator):
+        label_list = label.split(separator)
+        if len(label_list) == 1:
+            return label_list[0], None
+        return label_list[0], label_list[1:]
+
+    texts = []
+    entity_examples = []
+    relation_examples = []
+    entity_cls_examples = []
+    entity_prompts = []
+    relation_prompts = []
+    entity_label_set = []
+    entity_name_set = []
+    predicate_set = []
+    subject_goldens = []
+    inverse_relation_list = []
+    predicate_list = []
+
+    logger.info("Converting doccano data...")
+    with tqdm(total=len(raw_examples)) as pbar:
+        for line in raw_examples:
+            items = json.loads(line)
+            entity_id = 0
+            if "data" in items.keys():
+                relation_mode = False
+                if isinstance(items["label"], dict) and "entities" in items["label"].keys():
+                    relation_mode = True
+                text = items["data"]
+                entities = []
+                relations = []
+                if not relation_mode:
+                    # Export file in JSONL format which doccano < 1.7.0
+                    # e.g. {"data": "", "label": [ [0, 2, "ORG"], ... ]}
+                    for item in items["label"]:
+                        entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]}
+                        entities.append(entity)
+                        entity_id += 1
+                else:
+                    # Export file in JSONL format for relation labeling task which doccano < 1.7.0
+                    # e.g. {"data": "", "label": {"relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}}
+                    entities.extend([entity for entity in items["label"]["entities"]])
+                    if "relations" in items["label"].keys():
+                        relations.extend([relation for relation in items["label"]["relations"]])
+            else:
+                # Export file in JSONL format which doccano >= 1.7.0
+                # e.g. {"text": "", "label": [ [0, 2, "ORG"], ... ]}
+                if "label" in items.keys():
+                    text = items["text"]
+                    entities = []
+                    for item in items["label"]:
+                        entity = {"id": entity_id, "start_offset": item[0], "end_offset": item[1], "label": item[2]}
+                        entities.append(entity)
+                        entity_id += 1
+                    relations = []
+                else:
+                    # Export file in JSONL (relation) format
+                    # e.g. {"text": "", "relations": [ {"id": 0, "start_offset": 0, "end_offset": 6, "label": "ORG"}, ... ], "entities": [ {"id": 0, "from_id": 0, "to_id": 1, "type": "foundedAt"}, ... ]}
+                    text, relations, entities = items["text"], items["relations"], items["entities"]
+            texts.append(text)
+
+            entity_example = []
+            entity_prompt = []
+            entity_example_map = {}
+            entity_map = {}  # id to entity name
+            for entity in entities:
+                entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                entity_map[entity["id"]] = {
+                    "name": entity_name,
+                    "start": entity["start_offset"],
+                    "end": entity["end_offset"],
+                }
+
+                entity_label, entity_cls_label = _sep_cls_label(entity["label"], separator)
+
+                # Define the prompt prefix for entity-level classification
+                # xxx + "的" + 情感倾向 -> Chinese
+                # Sentiment classification + " of " + xxx -> English
+                if schema_lang == "ch":
+                    entity_cls_prompt_prefix = entity_name + "的" + prompt_prefix
+                else:
+                    entity_cls_prompt_prefix = prompt_prefix + " of " + entity_name
+                if entity_cls_label is not None:
+                    entity_cls_example = generate_cls_example(
+                        text, entity_cls_label, entity_cls_prompt_prefix, options
+                    )
+
+                    entity_cls_examples.append(entity_cls_example)
+
+                result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]}
+                if entity_label not in entity_example_map.keys():
+                    entity_example_map[entity_label] = {
+                        "content": text,
+                        "result_list": [result],
+                        "prompt": entity_label,
+                    }
+                else:
+                    entity_example_map[entity_label]["result_list"].append(result)
+
+                if entity_label not in entity_label_set:
+                    entity_label_set.append(entity_label)
+                if entity_name not in entity_name_set:
+                    entity_name_set.append(entity_name)
+                entity_prompt.append(entity_label)
+
+            for v in entity_example_map.values():
+                entity_example.append(v)
+
+            entity_examples.append(entity_example)
+            entity_prompts.append(entity_prompt)
+
+            subject_golden = []  # Golden entity inputs
+            relation_example = []
+            relation_prompt = []
+            relation_example_map = {}
+            inverse_relation = []
+            predicates = []
+            for relation in relations:
+                predicate = relation["type"]
+                subject_id = relation["from_id"]
+                object_id = relation["to_id"]
+                # The relation prompt is constructed as follows:
+                # subject + "的" + predicate -> Chinese
+                # predicate + " of " + subject -> English
+                if schema_lang == "ch":
+                    prompt = entity_map[subject_id]["name"] + "的" + predicate
+                    inverse_negative = entity_map[object_id]["name"] + "的" + predicate
+                else:
+                    prompt = predicate + " of " + entity_map[subject_id]["name"]
+                    inverse_negative = predicate + " of " + entity_map[object_id]["name"]
+
+                if entity_map[subject_id]["name"] not in subject_golden:
+                    subject_golden.append(entity_map[subject_id]["name"])
+                result = {
+                    "text": entity_map[object_id]["name"],
+                    "start": entity_map[object_id]["start"],
+                    "end": entity_map[object_id]["end"],
+                }
+
+                inverse_relation.append(inverse_negative)
+                predicates.append(predicate)
+
+                if prompt not in relation_example_map.keys():
+                    relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt}
+                else:
+                    relation_example_map[prompt]["result_list"].append(result)
+
+                if predicate not in predicate_set:
+                    predicate_set.append(predicate)
+                relation_prompt.append(prompt)
+
+            for v in relation_example_map.values():
+                relation_example.append(v)
+
+            relation_examples.append(relation_example)
+            relation_prompts.append(relation_prompt)
+            subject_goldens.append(subject_golden)
+            inverse_relation_list.append(inverse_relation)
+            predicate_list.append(predicates)
+            pbar.update(1)
+
+    logger.info("Adding negative samples for first stage prompt...")
+    positive_examples, negative_examples = add_entity_negative_example(
+        entity_examples, texts, entity_prompts, entity_label_set, negative_ratio
+    )
+    if len(positive_examples) == 0:
+        all_entity_examples = []
+    else:
+        all_entity_examples = positive_examples + negative_examples
+
+    all_relation_examples = []
+    if len(predicate_set) != 0:
+        logger.info("Adding negative samples for second stage prompt...")
+        if is_train:
+
+            positive_examples = []
+            negative_examples = []
+            per_n_ratio = negative_ratio // 3
+
+            with tqdm(total=len(texts)) as pbar:
+                for i, text in enumerate(texts):
+                    negative_example = []
+                    collects = []
+                    num_positive = len(relation_examples[i])
+
+                    # 1. inverse_relation_list
+                    redundants1 = inverse_relation_list[i]
+
+                    # 2. entity_name_set ^ subject_goldens[i]
+                    redundants2 = []
+                    if len(predicate_list[i]) != 0:
+                        nonentity_list = list(set(entity_name_set) ^ set(subject_goldens[i]))
+                        nonentity_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants2 = [
+                                nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))]
+                                for nonentity in nonentity_list
+                            ]
+                        else:
+                            redundants2 = [
+                                predicate_list[i][random.randrange(len(predicate_list[i]))] + " of " + nonentity
+                                for nonentity in nonentity_list
+                            ]
+
+                    # 3. entity_label_set ^ entity_prompts[i]
+                    redundants3 = []
+                    if len(subject_goldens[i]) != 0:
+                        non_ent_label_list = list(set(entity_label_set) ^ set(entity_prompts[i]))
+                        non_ent_label_list.sort()
+
+                        if schema_lang == "ch":
+                            redundants3 = [
+                                subject_goldens[i][random.randrange(len(subject_goldens[i]))] + "的" + non_ent_label
+                                for non_ent_label in non_ent_label_list
+                            ]
+                        else:
+                            redundants3 = [
+                                non_ent_label + " of " + subject_goldens[i][random.randrange(len(subject_goldens[i]))]
+                                for non_ent_label in non_ent_label_list
+                            ]
+
+                    redundants_list = [redundants1, redundants2, redundants3]
+
+                    for redundants in redundants_list:
+                        added, rest = add_relation_negative_example(
+                            redundants,
+                            texts[i],
+                            num_positive,
+                            per_n_ratio,
+                        )
+                        negative_example.extend(added)
+                        collects.extend(rest)
+
+                    num_sup = num_positive * negative_ratio - len(negative_example)
+                    if num_sup > 0 and collects:
+                        if num_sup > len(collects):
+                            idxs = [k for k in range(len(collects))]
+                        else:
+                            idxs = random.sample(range(0, len(collects)), num_sup)
+                        for idx in idxs:
+                            negative_example.append(collects[idx])
+
+                    positive_examples.extend(relation_examples[i])
+                    negative_examples.extend(negative_example)
+                    pbar.update(1)
+            all_relation_examples = positive_examples + negative_examples
+        else:
+            relation_examples = add_full_negative_example(
+                relation_examples, texts, relation_prompts, predicate_set, subject_goldens, schema_lang=schema_lang
+            )
+            all_relation_examples = [r for relation_example in relation_examples for r in relation_example]
+    return all_entity_examples, all_relation_examples, entity_cls_examples
+
+
+def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    cur_length = len(examples[0]["input_ids"])
+    max_length = default_max_length
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+
+
+def convert_example(
+    example, tokenizer, max_seq_len, multilingual=False, dynamic_max_length: Optional[List[int]] = None
+):
+    """
+    example: {
+        title
+        prompt
+        content
+        result_list
+    }
+    """
+    if dynamic_max_length is not None:
+        temp_encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        max_length = get_dynamic_max_length(
+            examples=temp_encoded_inputs, default_max_length=max_seq_len, dynamic_max_length=dynamic_max_length
+        )
+        # always pad to max_length
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_length,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_length)]
+        end_ids = [0.0 for x in range(max_length)]
+    else:
+        encoded_inputs = tokenizer(
+            text=[example["prompt"]],
+            text_pair=[example["content"]],
+            truncation=True,
+            max_seq_len=max_seq_len,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_dict=False,
+            return_offsets_mapping=True,
+        )
+        start_ids = [0.0 for x in range(max_seq_len)]
+        end_ids = [0.0 for x in range(max_seq_len)]
+
+    encoded_inputs = encoded_inputs[0]
+    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+    bias = 0
+    for index in range(1, len(offset_mapping)):
+        mapping = offset_mapping[index]
+        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+            bias = offset_mapping[index - 1][1] + 1  # Includes [SEP] token
+        if mapping[0] == 0 and mapping[1] == 0:
+            continue
+        offset_mapping[index][0] += bias
+        offset_mapping[index][1] += bias
+    for item in example["result_list"]:
+        start = map_offset(item["start"] + bias, offset_mapping)
+        end = map_offset(item["end"] - 1 + bias, offset_mapping)
+        start_ids[start] = 1.0
+        end_ids[end] = 1.0
+    if multilingual:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    else:
+        tokenized_output = {
+            "input_ids": encoded_inputs["input_ids"],
+            "token_type_ids": encoded_inputs["token_type_ids"],
+            "position_ids": encoded_inputs["position_ids"],
+            "attention_mask": encoded_inputs["attention_mask"],
+            "start_positions": start_ids,
+            "end_positions": end_ids,
+        }
+    return tokenized_output
diff --git a/paddlenlp/__init__.py b/paddlenlp/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bba7ceff9b57980e1301ac175453583fa3d2a7f6
--- /dev/null
+++ b/paddlenlp/__init__.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+PADDLENLP_STABLE_VERSION = "PADDLENLP_STABLE_VERSION"
+
+
+__version__ = "2.6.1.post"
+if os.getenv(PADDLENLP_STABLE_VERSION):
+    __version__ = __version__.replace(".post", "")
+
+if "datasets" in sys.modules.keys():
+    from paddlenlp.utils.log import logger
+
+    logger.warning(
+        "Detected that datasets module was imported before paddlenlp. "
+        "This may cause PaddleNLP datasets to be unavalible in intranet. "
+        "Please import paddlenlp before datasets module to avoid download issues"
+    )
+import paddle
+
+from . import (
+    data,
+    dataaug,
+    datasets,
+    embeddings,
+    experimental,
+    layers,
+    losses,
+    metrics,
+    ops,
+    peft,
+    prompt,
+    seq2vec,
+    trainer,
+    transformers,
+    utils,
+)
+from .server import SimpleServer
+from .taskflow import Taskflow
+
+paddle.disable_signal_handler()
diff --git a/paddlenlp/cli/__init__.py b/paddlenlp/cli/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..412a19101596a2ae9bd77b7344896edda9ced4da
--- /dev/null
+++ b/paddlenlp/cli/__init__.py
@@ -0,0 +1,14 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .main import main
diff --git a/paddlenlp/cli/bos_community.py b/paddlenlp/cli/bos_community.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b8d238d4f1ef0446bf5a2962582fff0c77619e3
--- /dev/null
+++ b/paddlenlp/cli/bos_community.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+from baidubce.auth.bce_credentials import BceCredentials
+from baidubce.bce_client_configuration import BceClientConfiguration
+from baidubce.services.bos.bos_client import BosClient
+
+bos_config = {
+    "bucket": "models",
+    "bos_host": "paddlenlp.bj.bcebos.com",
+}
+
+
+bos_host = str(bos_config["bos_host"])
+bos_bucket = str(bos_config["bucket"])
+
+access_key_id = os.getenv("bos_access_key_id", None)
+secret_access_key = os.getenv("bos_secret_access_key", None)
+if access_key_id is None or secret_access_key is None:
+    raise ValueError(
+        "Please set environment variables of  bos_access_key_id, bos_secret_access_key, before uploading !!!"
+    )
+
+
+def upload_to_bos_from_raw(raw, name, category="test"):
+    b_config = BceClientConfiguration(credentials=BceCredentials(access_key_id, secret_access_key), endpoint=bos_host)
+    bos_client = BosClient(b_config)
+    bos_client.put_object_from_string(bos_bucket, "%s/%s" % (category, name), raw)
+    url = "https://paddlenlp.bj.bcebos.com/%s/%s/%s" % (bos_bucket, category, name)
+    return url
+
+
+def multi_upload_to_bos(filename, name, category):
+    b_config = BceClientConfiguration(credentials=BceCredentials(access_key_id, secret_access_key), endpoint=bos_host)
+    bos_client = BosClient(b_config)
+    # init multi-upload
+    key = "%s/%s" % (category, name)
+    bucket_name = bos_bucket
+    upload_id = bos_client.initiate_multipart_upload(bucket_name, key).upload_id
+
+    left_size = os.path.getsize(filename)
+    offset = 0
+    part_number = 1
+    part_list = []
+    while left_size > 0:
+        part_size = 3 * 1024 * 1024 * 1024
+        if left_size < part_size:
+            part_size = left_size
+        response = bos_client.upload_part_from_file(
+            bucket_name, key, upload_id, part_number, part_size, filename, offset
+        )
+        left_size -= part_size
+        offset += part_size
+        # your should store every part number and etag to invoke complete multi-upload
+        part_list.append({"partNumber": part_number, "eTag": response.metadata.etag})
+        part_number += 1
+    bos_client.complete_multipart_upload(bucket_name, key, upload_id, part_list)
+    url = "https://paddlenlp.bj.bcebos.com/%s/%s/%s" % (bos_bucket, category, name)
+    return url
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 3:
+        print("Usage: python bos_community.py organization/model local_dir")
+        sys.exit(1)
+
+    organization = sys.argv[1]
+    local_dir = sys.argv[2]
+
+    for filename in os.listdir(local_dir):
+        name = os.path.split(filename)[-1]
+        if name == "bos.log":
+            continue
+        filename = os.path.join(local_dir, filename)
+        left_size = os.path.getsize(filename)
+        print(f"Uploading to {organization}/{name}, size: {left_size}")
+        if left_size >= 5 * 1024 * 1024 * 1024:
+            url = multi_upload_to_bos(filename, name, category=f"community/{organization}")
+        else:
+            with open(filename, "rb") as fp:
+                url = upload_to_bos_from_raw(raw=fp.read(), name=name, category=f"community/{organization}")
+        print(f"Done: {url}")
diff --git a/paddlenlp/cli/download.py b/paddlenlp/cli/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..80b6c8d3f909123b4a44872ed91d668ed4ff2d16
--- /dev/null
+++ b/paddlenlp/cli/download.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from typing import List, Tuple
+
+from paddlenlp.utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
+from paddlenlp.utils.env import MODEL_HOME
+from paddlenlp.utils.log import logger
+
+COMMUNITY_MODEL_CONFIG_FILE_NAME = "community_models.json"
+
+
+def load_community_models() -> List[Tuple[str, str]]:
+    """load community models based on remote models.json
+
+    Returns:
+        List[Tuple[str, str]]: the name tuples of community models
+    """
+    # 1. check & download community models.json
+    local_community_model_config_path = os.path.join(MODEL_HOME, "community_models.json")
+
+    if not os.path.exists(local_community_model_config_path):
+        logger.info("download community model configuration from server ...")
+        remote_community_model_path = "/".join([COMMUNITY_MODEL_PREFIX, COMMUNITY_MODEL_CONFIG_FILE_NAME])
+        cache_dir = os.path.join(MODEL_HOME)
+        local_community_model_config_path = get_path_from_url(remote_community_model_path, root_dir=cache_dir)
+
+    # 2. load configuration
+    #
+    # config = {
+    #   "model_name": {
+    #       "type": "",
+    #       "files": ["", ""]
+    #   }
+    # }
+    #
+
+    with open(local_community_model_config_path, "r", encoding="utf-8") as f:
+        config = json.load(f)
+
+    model_names = set()
+    for model_name, obj in config.items():
+        model_names.add((model_name, obj.get("model_type", "")))
+    logger.info(f"find {len(model_names)} community models ...")
+    return model_names
diff --git a/paddlenlp/cli/install.py b/paddlenlp/cli/install.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d49e1288b0843e63e2ba703bd40affdfa8ab02c
--- /dev/null
+++ b/paddlenlp/cli/install.py
@@ -0,0 +1,47 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os.path
+import subprocess
+
+from paddlenlp.utils.downloader import _download, url_file_exists
+from paddlenlp.utils.env import PACKAGE_HOME
+from paddlenlp.utils.log import logger
+
+PACKAGE_SERVER_HOME = "https://paddlenlp.bj.bcebos.com/wheels"
+
+
+def install_package_from_bos(package_name: str, tag: str):
+    """
+    install package from bos server based on package_name and tag
+    Args:
+        package_name (str): the name of package, eg: paddlenlp, ppdiffusers, paddle-pipelines
+        tag (str): pr number、 version of paddlenlp, or latest
+    """
+    # eg: https://paddlenlp.bj.bcebos.com/wheels/paddlenlp-latest-py3-none-any.whl
+    file_name = f"{package_name}-{tag}-py3-none-any.whl"
+    logger.info(f"start to downloading package<{file_name}>")
+
+    package_url = f"{PACKAGE_SERVER_HOME}/{file_name}"
+    if not url_file_exists(package_url):
+        raise ValueError(f"there is not valid package<{package_name}_py3_{tag}.whl> " f"from the url<{package_url}>")
+
+    file_path = os.path.join(PACKAGE_HOME, file_name)
+
+    # force download
+    file_path = _download(package_url, PACKAGE_HOME)
+
+    # force reinstall the local package but ignore the dependencies
+    command = f"python -m pip install --force-reinstall --no-dependencies {file_path}".split()
+    subprocess.Popen(command)
diff --git a/paddlenlp/cli/main.py b/paddlenlp/cli/main.py
new file mode 100644
index 0000000000000000000000000000000000000000..d79904c16c837a1ba901cf8dc66a14fa4cdecbe0
--- /dev/null
+++ b/paddlenlp/cli/main.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from pathlib import Path
+from typing import List, Tuple, Type
+
+from uvicorn.config import LOGGING_CONFIG
+
+from paddlenlp.utils.import_utils import is_package_available
+
+# check whether the package is avaliable and give friendly description.
+if not is_package_available("typer"):
+    raise ModuleNotFoundError(
+        "paddlenlp-cli tools is not installed correctly, you can use the following command"
+        " to install paddlenlp cli tool: >>> pip install paddlenlp[cli]"
+    )
+
+import importlib
+import inspect
+import shutil
+
+import typer
+
+from paddlenlp.cli.download import load_community_models
+from paddlenlp.cli.install import install_package_from_bos
+from paddlenlp.cli.server import start_backend
+from paddlenlp.cli.utils.tabulate import print_example_code, tabulate
+from paddlenlp.transformers import (
+    AutoModel,
+    AutoTokenizer,
+    PretrainedModel,
+    PretrainedTokenizer,
+)
+from paddlenlp.transformers.utils import find_transformer_model_type
+from paddlenlp.utils.downloader import is_url
+from paddlenlp.utils.log import logger
+
+
+def load_all_models(include_community: bool = False) -> List[Tuple[str, str]]:
+    """load all model_name infos
+
+    Returns:
+        List[Tuple[str, str]]: [model_type, model_name]
+    """
+    # 1. load official models
+    module = importlib.import_module("paddlenlp.transformers")
+    model_names = []
+    model_names_dict = {}
+    for attr_name in dir(module):
+        if attr_name.startswith("_"):
+            continue
+        obj = getattr(module, attr_name)
+        if not inspect.isclass(obj):
+            continue
+        if not issubclass(obj, PretrainedModel):
+            continue
+
+        obj: Type[PretrainedModel] = obj
+        if not obj.__name__.endswith("PretrainedModel"):
+            continue
+        configurations = obj.pretrained_init_configuration
+        model_type = find_transformer_model_type(obj)
+        for model_name in configurations.keys():
+            # get model type with refactoring
+            model_names.append((model_type, model_name))
+            model_names_dict[model_name] = True
+
+    logger.info(f"find {len(model_names)} official models ...")
+
+    # 2. load & extend community models
+    if include_community:
+        community_model_names = load_community_models()
+        for model_name in community_model_names:
+            # there are some same model-names between codebase and community models
+            if model_name in model_names_dict:
+                continue
+
+            model_names.append(model_name)
+    # 3. sort result
+    model_names.sort(key=lambda item: item[0] + item[1])
+    return model_names
+
+
+app = typer.Typer()
+
+
+@app.command()
+def download(
+    model_name: str,
+    cache_dir: str = typer.Option(
+        "./pretrained_models", "--cache-dir", "-c", help="cache_dir for download pretrained model"
+    ),
+    force_download: bool = typer.Option(False, "--force-download", "-f", help="force download pretrained model"),
+):
+    """download the paddlenlp models with command, you can specific `model_name`
+
+    >>> paddlenlp download bert \n
+    >>> paddlenlp download -c ./my-models -f bert \n
+
+    Args:\n
+        model_name (str): pretarined model name, you can checkout all of model from source code. \n
+        cache_dir (str, optional): the cache_dir. Defaults to "./models".
+    """
+    if not os.path.isabs(cache_dir):
+        cache_dir = os.path.join(os.getcwd(), cache_dir)
+
+    if is_url(model_name):
+        logger.error("<MODEL_NAME> can not be url")
+        return
+
+    cache_dir = os.path.join(cache_dir, model_name)
+    if force_download:
+        shutil.rmtree(cache_dir, ignore_errors=True)
+
+    model: PretrainedModel = AutoModel.from_pretrained(model_name)
+    model.save_pretrained(cache_dir)
+
+    tokenizer: PretrainedTokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.save_pretrained(cache_dir)
+
+    logger.info(f"successfully saved model into <{cache_dir}>")
+
+
+@app.command()
+def search(
+    query=typer.Argument(..., help="the query of searching model"),
+    include_community: bool = typer.Option(
+        False, "--include-community", "-i", help="whether searching community models"
+    ),
+):
+    """search the model with query, eg: paddlenlp search bert
+
+    >>> paddlenlp search bert \n
+    >>> paddlenlp search -i bert \n
+
+    Args: \n
+        query (Optional[str]): the str fragment of bert-name \n
+        include_community (Optional[bool]): whether searching community models
+    """
+    logger.info("start to search models ...")
+    model_names = load_all_models(include_community)
+
+    tables = []
+    for model_type, model_name in model_names:
+        # TODO(wj-Mcat): ignore the model_category info
+        if not query or query in model_name:
+            tables.append([model_type, model_name])
+    tabulate(tables, headers=["model type", "model name"], highlight_word=query)
+    print_example_code()
+
+    logger.info(f"the retrieved number of models results is {len(tables)} ...")
+
+
+@app.command(help="Start the PaddleNLP SimpleServer.")
+def server(
+    app: str,
+    host: str = typer.Option("127.0.0.1", "--host", help="Bind socket to this host."),
+    port: int = typer.Option("8000", "--port", help="Bind socket to this port."),
+    app_dir: str = typer.Option(None, "--app_dir", help="The application directory path."),
+    workers: int = typer.Option(
+        None,
+        "--workers",
+        help="Number of worker processes. Defaults to the $WEB_CONCURRENCY environment"
+        " variable if available, or 1. Not valid with --reload.",
+    ),
+    log_level: int = typer.Option(None, "--log_level", help="Log level. [default: info]"),
+    limit_concurrency: int = typer.Option(
+        None, "--limit-concurrency", help="Maximum number of concurrent connections or tasks to allow, before issuing"
+    ),
+    limit_max_requests: int = typer.Option(
+        None, "--limit-max-requests", help="Maximum number of requests to service before terminating the process."
+    ),
+    timeout_keep_alive: int = typer.Option(
+        15, "--timeout-keep-alive", help="Close Keep-Alive connections if no new data is received within this timeout."
+    ),
+    reload: bool = typer.Option(False, "--reload", help="Reload the server when the app_dir is changed."),
+):
+    """The main function for the staring the SimpleServer"""
+    logger.info("starting to PaddleNLP SimpleServer...")
+    if app_dir is None:
+        app_dir = str(Path(os.getcwd()))
+    # Flags of uvicorn
+    backend_kwargs = {
+        "host": host,
+        "port": port,
+        "log_config": LOGGING_CONFIG,
+        "log_level": log_level,
+        "workers": workers,
+        "limit_concurrency": limit_concurrency,
+        "limit_max_requests": limit_max_requests,
+        "timeout_keep_alive": timeout_keep_alive,
+        "app_dir": app_dir,
+        "reload": reload,
+    }
+    start_backend(app, **backend_kwargs)
+
+
+@app.command(
+    help="install the target version of paddlenlp, eg: paddlenlp install / paddlenlp install paddlepaddle==latest"
+)
+def install(
+    package: str = typer.Argument(default="paddlenlp==latest", help="install the target version of paddlenlp")
+):
+    """The main function for the staring the SimpleServer"""
+    package = package.replace(" ", "").strip()
+
+    if not package:
+        raise ValueError("please assign the package name")
+
+    # 1. parse the version of paddlenlp
+    splits = [item for item in package.split("==")]
+    if len(splits) == 0 or len(splits) > 2:
+        raise ValueError(
+            "please set the valid package: <package-name>==<version>, eg: paddlenlp==latest, paddlenlp==3099, "
+            f"but received: {package}"
+        )
+
+    tag = "latest"
+    package_name = splits[0]
+
+    # TODO(wj-Mcat): will support `pipelines`, `ppdiffusers` later.
+    assert package_name in ["paddlenlp"], "we only support paddlenlp"
+
+    if len(splits) == 2:
+        tag = splits[1]
+
+    # 2. download & install package from bos server
+    install_package_from_bos(package_name=package_name, tag=tag)
+
+
+def main():
+    """the PaddleNLPCLI entry"""
+    app()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/paddlenlp/cli/server.py b/paddlenlp/cli/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..68805d28704661ce57ff16579eea5b116e07f84c
--- /dev/null
+++ b/paddlenlp/cli/server.py
@@ -0,0 +1,26 @@
+# coding:utf-8
+# copyright (c) 2022  paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license"
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+import uvicorn
+
+from ..utils.log import logger
+
+
+def start_backend(app, **kwargs):
+    logger.info("The PaddleNLP SimpleServer is starting, backend component uvicorn arguments as follows:")
+    for key, value in kwargs.items():
+        if key != "log_config":
+            logger.info("   the starting argument [{}]={}".format(key, value))
+    uvicorn.run(app, **kwargs)
diff --git a/paddlenlp/cli/utils/__init__.py b/paddlenlp/cli/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/cli/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/cli/utils/tabulate.py b/paddlenlp/cli/utils/tabulate.py
new file mode 100644
index 0000000000000000000000000000000000000000..01e0e28cce556f019dc67d3d9ad55d2c0160731b
--- /dev/null
+++ b/paddlenlp/cli/utils/tabulate.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Dict, Union, Optional, Type
+from rich.console import Console
+from rich.theme import Theme
+from rich.markdown import Markdown
+from rich.table import Table
+from rich.highlighter import RegexHighlighter
+
+
+def _get_highlighter(word: str) -> Type[RegexHighlighter]:
+    """construct Regex Highlighter class based on the word
+
+    Args:
+        word (str): the query word
+
+    Returns:
+        Type[RegexHighlighter]: the sub-class of RegexHighlighter
+    """
+
+    class KeywordHighlighter(RegexHighlighter):
+        base_style = "paddlenlp."
+        highlights = [f"(?P<keyword>{word})"]
+
+    return KeywordHighlighter()
+
+
+def print_example_code():
+    # 1. define the console
+    console = Console()
+    markdown = """
+## you can download the above model with the following command:
+
+### ***paddlenlp download --cache-dir ./paddle_pretrained_models <model name>***
+
+### ***the <model name> is copied from above table***
+    """
+    console.print(Markdown(markdown))
+
+
+def tabulate(
+    tables: List[Union[List[str], Dict[str, str]]],
+    headers: Optional[List[str]] = None,
+    highlight_word: Optional[str] = None,
+):
+    """print tabulate data into console
+
+    Args:
+        tables (List[Union[List[str], Dict[str, str]]]): the table instance data
+        headers (Optional[List[str]], optional): the header configuration. Defaults to None.
+        highlight_word (Optional[str], optional): the highlight word. Defaults to None.
+    """
+    # 1. define the console
+    theme = Theme({"paddlenlp.keyword": "bold magenta"})
+    console = Console(highlighter=_get_highlighter(highlight_word), theme=theme)
+    table_instance = Table(
+        title="PaddleNLP 模型检索结果", show_header=headers is not None, header_style="bold magenta", highlight=True
+    )
+
+    # 2. add column
+    headers = headers or []
+    for header in headers:
+        if isinstance(header, str):
+            table_instance.add_column(header)
+        else:
+            table_instance.add_column(**header)
+
+    # 3. add row data
+    for row_data in tables:
+        table_instance.add_row(*row_data)
+
+    console.print(table_instance, justify="center")
diff --git a/paddlenlp/data/__init__.py b/paddlenlp/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdd868fe4a61b6aea636a847fddd66e5ac85e87c
--- /dev/null
+++ b/paddlenlp/data/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .blendable_dataset import *
+from .causal_dataset import *
+from .collate import *
+from .data_collator import *
+from .dist_dataloader import *
+from .sampler import *
+from .tokenizer import *
+from .vocab import *
diff --git a/paddlenlp/data/blendable_dataset.py b/paddlenlp/data/blendable_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..677f52248df13cf8dc0b693c89f3b308fc769345
--- /dev/null
+++ b/paddlenlp/data/blendable_dataset.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hashlib
+import os
+import time
+
+import numpy as np
+import paddle
+
+local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+def print_rank_0(*args, **kwargs):
+    if paddle.distributed.get_rank() == 0:
+        print(*args, **kwargs)
+
+
+class BlendableDataset(paddle.io.Dataset):
+    def __init__(self, datasets, weights, size, *, data_cache_path=None):
+
+        self.datasets = datasets
+        num_datasets = len(datasets)
+        assert num_datasets == len(weights)
+
+        self.size = size
+
+        # Normalize weights.
+        weights = np.array(weights, dtype=np.float64)
+        sum_weights = np.sum(weights)
+        assert sum_weights > 0.0
+        weights /= sum_weights
+
+        # Build indicies.
+        def _build_indices():
+            start_time = time.time()
+            assert num_datasets < 255
+            dataset_index = np.zeros(self.size, dtype=np.uint8)
+            dataset_sample_index = np.zeros(self.size, dtype=np.int64)
+
+            from tool_helpers import helpers
+
+            helpers.build_blending_indices(
+                dataset_index,
+                dataset_sample_index,
+                weights,
+                num_datasets,
+                self.size,
+                local_rank == 0,
+                #    paddle.distributed.get_rank() == 0,
+            )
+            print_rank_0(
+                "> elapsed time for building blendable dataset indices: "
+                "{:.2f} (sec)".format(time.time() - start_time)
+            )
+            return dataset_index, dataset_sample_index
+
+        desc = "Blendable dataset\n\n"
+        desc += "Datasets:\n"
+        for dataset in datasets:
+            desc += dataset.desc + "\n\n"
+        desc += f"Weights: {weights}\n"
+        desc += f"Size: {size}\n"
+        self.desc = desc
+
+        if data_cache_path:
+            desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+            desc_path = os.path.join(data_cache_path, desc_hash + ".dsc")
+            index_path = os.path.join(data_cache_path, desc_hash + "_index.npy")
+            sample_index_path = os.path.join(data_cache_path, desc_hash + "_sample_index.npy")
+            cache_hit = os.path.isfile(index_path) and os.path.isfile(sample_index_path)
+            # cache_success = True
+            # if paddle.distributed.get_rank() == 0 and not cache_hit:
+            if local_rank == 0 and not cache_hit:
+                print(
+                    " > WARNING: could not find index map files for blendable"
+                    " dataset, building indices on rank 0 ...",
+                    flush=True,
+                )
+                dataset_index, dataset_sample_index = _build_indices()
+                try:
+                    os.makedirs(os.path.dirname(index_path), exist_ok=True)
+                    with open(desc_path, "wt") as fd:
+                        fd.write(desc)
+                        np.save(index_path, dataset_index, allow_pickle=True)
+                        np.save(sample_index_path, dataset_sample_index, allow_pickle=True)
+                except OSError:
+                    print(f"There was an error trying to create the data cache directory ({data_cache_path})")
+                    print("or a file in it. This is set with the --data-cache-path argument. Please")
+                    print("ensure you have write access to this directory or specify one that you do have")
+                    print("write access to.")
+                    # cache_success = False
+
+            # hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+
+            # counts = paddle.to_tensor([cache_success], dtype="int64")
+            # paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group())
+            # paddle.distributed.all_reduce(counts, group=hcg.get_pipeline_model_parallel_group())
+            # if counts[0].item() != (
+            #     paddle.distributed.get_world_size()
+            #     // paddle.distributed.get_world_size(group=hcg.get_tensor_model_parallel_group())
+            # ):
+            #     print_rank_0("Data index creation unsuccessful, exiting.")
+            #     exit()
+
+            if paddle.distributed.get_world_size() > 1:
+                if paddle.in_dynamic_mode():
+                    paddle.distributed.barrier()
+
+            # paddle.distributed.barrier()
+            # Load on all ranks.
+            print_rank_0(f"> loading blendable dataset index: {index_path}")
+            self.dataset_index = np.load(index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_index.size == self.size
+
+            print_rank_0(f"> loading blendable dataset sample index: {sample_index_path}")
+            self.dataset_sample_index = np.load(sample_index_path, allow_pickle=True, mmap_mode="r")
+            assert self.dataset_sample_index.size == self.size
+        else:
+            self.dataset_index, self.dataset_sample_index = _build_indices()
+
+        # Check size
+        _ = self.__getitem__(self.size - 1)
+        try:
+            _ = self.__getitem__(self.size)
+            raise RuntimeError("BlendedDataset size is improperly bounded")
+        except IndexError:
+            pass
+        print_rank_0("> size of blendable dataset: " "{} samples".format(self.size))
+
+    def __len__(self):
+        return self.size
+
+    def __getitem__(self, idx):
+        dataset_idx = self.dataset_index[idx]
+        sample_idx = self.dataset_sample_index[idx]
+        return {
+            "dataset_idx": dataset_idx,
+            **self.datasets[dataset_idx][sample_idx],
+        }
diff --git a/paddlenlp/data/causal_dataset.py b/paddlenlp/data/causal_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..38072facb56ea74d0213e92be365f647a1884fd0
--- /dev/null
+++ b/paddlenlp/data/causal_dataset.py
@@ -0,0 +1,645 @@
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+
+"""GPT style dataset."""
+import hashlib
+import math
+import os
+import time
+
+import numpy as np
+import paddle
+
+from paddlenlp.data.blendable_dataset import BlendableDataset
+from paddlenlp.data.indexed_dataset import make_dataset as make_indexed_dataset
+
+local_rank = int(os.getenv("PADDLE_RANK_IN_NODE", 0))
+
+
+# class FakeHCG:
+#     def get_data_parallel_group(self):
+#         return None
+
+#     def get_pipe_parallel_group(self):
+#         return None
+
+#     def get_model_parallel_group(self):
+#         return None
+
+
+def get_train_valid_test_split_(splits_string, size):
+    """Get dataset splits from comma or '/' separated string list."""
+
+    splits = []
+    if splits_string.find(",") != -1:
+        splits = [float(s) for s in splits_string.split(",")]
+    elif splits_string.find("/") != -1:
+        splits = [float(s) for s in splits_string.split("/")]
+    else:
+        splits = [float(splits_string)]
+    while len(splits) < 3:
+        splits.append(0.0)
+    splits = splits[:3]
+    splits_sum = sum(splits)
+    assert splits_sum > 0.0
+    splits = [split / splits_sum for split in splits]
+    splits_index = [0]
+    for index, split in enumerate(splits):
+        splits_index.append(splits_index[index] + int(round(split * float(size))))
+    diff = splits_index[-1] - size
+    for index in range(1, len(splits_index)):
+        splits_index[index] -= diff
+    assert len(splits_index) == 4
+    assert splits_index[-1] == size
+    return splits_index
+
+
+def get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples):
+
+    # The data prefix should be in the format of:
+    #   weight-1, data-prefix-1, weight-2, data-prefix-2, ..
+    assert len(data_prefix) % 2 == 0
+    num_datasets = len(data_prefix) // 2
+    weights = [0] * num_datasets
+    prefixes = [0] * num_datasets
+    for i in range(num_datasets):
+        weights[i] = float(data_prefix[2 * i])
+        prefixes[i] = (data_prefix[2 * i + 1]).strip()
+    # Normalize weights
+    weight_sum = 0.0
+    for weight in weights:
+        weight_sum += weight
+    assert weight_sum > 0.0
+    weights = [weight / weight_sum for weight in weights]
+
+    # Add 0.5% (the 1.005 factor) so in case the bleding dataset does
+    # not uniformly distribute the number of samples, we still have
+    # samples left to feed to the network.
+    datasets_train_valid_test_num_samples = []
+    for weight in weights:
+        datasets_train_valid_test_num_samples.append(
+            [int(math.ceil(val * weight * 1.005)) for val in train_val_test_num_samples]
+        )
+
+    return prefixes, weights, datasets_train_valid_test_num_samples
+
+
+def print_rank_0(*args, **kwargs):
+    if paddle.distributed.get_rank() == 0:
+        print(*args, **kwargs)
+
+
+def build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_val_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    train_data_prefix=None,
+    valid_data_prefix=None,
+    test_data_prefix=None,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None,
+    need_data=True,
+):
+    """Build train, valid, and test datasets."""
+
+    # Single dataset.
+    if len(data_prefix) == 1:
+        return _build_train_valid_test_datasets(
+            data_prefix[0],
+            data_impl,
+            splits_string,
+            train_val_test_num_samples,
+            seq_length,
+            seed,
+            skip_warmup,
+            data_cache_path=data_cache_path,
+            need_data=need_data,
+        )
+
+    # Blending dataset.
+    # Parse the values.
+    output = get_datasets_weights_and_num_samples(data_prefix, train_val_test_num_samples)
+    prefixes, weights, datasets_train_valid_test_num_samples = output
+    train_num_samples, valid_num_samples, test_num_samples = map(sum, zip(*datasets_train_valid_test_num_samples))
+
+    # Build individual datasets.
+    train_datasets = []
+    valid_datasets = []
+    test_datasets = []
+    for i in range(len(prefixes)):
+        train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
+            prefixes[i],
+            data_impl,
+            splits_string,
+            datasets_train_valid_test_num_samples[i],
+            seq_length,
+            seed,
+            skip_warmup,
+            return_doc_ids,
+            data_cache_path=data_cache_path,
+            need_data=need_data,
+        )
+        if train_ds:
+            train_datasets.append(train_ds)
+        if valid_ds:
+            valid_datasets.append(valid_ds)
+        if test_ds:
+            test_datasets.append(test_ds)
+
+    blending_train_dataset = None
+    if train_datasets:
+        blending_train_dataset = BlendableDataset(
+            train_datasets, weights, train_num_samples, data_cache_path=data_cache_path
+        )
+    blending_valid_dataset = None
+    if valid_datasets:
+        blending_valid_dataset = BlendableDataset(
+            valid_datasets, weights, valid_num_samples, data_cache_path=data_cache_path
+        )
+    blending_test_dataset = None
+    if test_datasets:
+        blending_test_dataset = BlendableDataset(
+            test_datasets, weights, test_num_samples, data_cache_path=data_cache_path
+        )
+
+    return (blending_train_dataset, blending_valid_dataset, blending_test_dataset)
+
+
+def _build_train_valid_test_datasets(
+    data_prefix,
+    data_impl,
+    splits_string,
+    train_val_test_num_samples,
+    seq_length,
+    seed,
+    skip_warmup,
+    return_doc_ids=False,
+    *,
+    data_cache_path=None,
+    need_data=True,
+):
+    """Build train, valid, and test datasets."""
+
+    # Indexed dataset.
+    if need_data:
+        indexed_dataset = get_indexed_dataset_(data_prefix, data_impl, skip_warmup)
+
+        total_num_of_documents = indexed_dataset.sizes.shape[0]
+        splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
+
+        # Print stats about the splits.
+        print_rank_0(" > dataset split:")
+
+        def print_split_stats(name, index):
+            print_rank_0("    {}:".format(name))
+            print_rank_0(
+                "     document indices in [{}, {}) total of {} "
+                "documents".format(splits[index], splits[index + 1], splits[index + 1] - splits[index])
+            )
+
+        print_split_stats("train", 0)
+        print_split_stats("validation", 1)
+        print_split_stats("test", 2)
+
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.barrier()
+
+    def build_dataset(index, name):
+        documents = np.arange(splits[index], splits[index + 1], 1, np.int32) if need_data else None
+        dataset = GPTDataset(
+            name,
+            data_prefix,
+            documents,
+            indexed_dataset if need_data else None,
+            splits_string,
+            train_val_test_num_samples[index],
+            seq_length,
+            seed,
+            return_doc_ids,
+            data_cache_path=data_cache_path,
+            need_data=need_data,
+        )
+        if need_data:
+            return dataset if splits[index + 1] > splits[index] else None
+        else:
+            return None
+
+    train_dataset = build_dataset(0, "train")
+    valid_dataset = build_dataset(1, "valid")
+    test_dataset = build_dataset(2, "test")
+
+    return (train_dataset, valid_dataset, test_dataset)
+
+
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+    """Build indexed dataset."""
+    print_rank_0(" > building dataset index ...")
+
+    start_time = time.time()
+    indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup)
+    print_rank_0(" > finished creating indexed dataset in {:4f} " "seconds".format(time.time() - start_time))
+    print_rank_0("    number of documents: {}".format(indexed_dataset.sizes.shape[0]))
+
+    return indexed_dataset
+
+
+class GPTDataset(paddle.io.Dataset):
+    def __init__(
+        self,
+        name,
+        data_prefix,
+        documents,
+        indexed_dataset,
+        splits_string,
+        num_samples,
+        seq_length,
+        seed,
+        return_doc_ids=False,
+        *,
+        data_cache_path=None,
+        need_data=True,
+    ):
+
+        self.name = name
+        self.indexed_dataset = indexed_dataset
+        self.return_doc_ids = return_doc_ids
+
+        # Build index mappings.
+        if need_data:
+            # Checks
+            assert np.min(documents) >= 0
+            assert np.max(documents) < indexed_dataset.sizes.shape[0]
+
+            (
+                doc_idx_filename,
+                sample_idx_filename,
+                shuffle_idx_filename,
+                self.desc,
+                self.desc_hash,
+                num_epochs,
+            ) = _build_index_mappings(
+                self.name,
+                data_prefix,
+                documents,
+                self.indexed_dataset.sizes,
+                splits_string,
+                num_samples,
+                seq_length,
+                seed,
+                data_cache_path=data_cache_path,
+            )
+
+        if paddle.distributed.get_world_size() > 1:
+            paddle.distributed.barrier()
+
+        # Load mappings.
+        if need_data:
+            start_time = time.time()
+            print_rank_0(f" > loading doc-idx mapping from {doc_idx_filename}")
+            self.doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
+
+            print_rank_0(f" > loading sample-idx mapping from {sample_idx_filename}")
+            self.sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode="r")
+
+            print_rank_0(f" > loading shuffle-idx mapping from {shuffle_idx_filename}")
+            self.shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode="r")
+
+            print_rank_0("    loaded indexed file in {:3.3f} seconds".format(time.time() - start_time))
+            print_rank_0("    total number of samples: {}".format(self.sample_idx.shape[0]))
+            print_rank_0("    total number of epochs: {}".format(num_epochs))
+
+        if paddle.distributed.get_world_size() > 1:
+            paddle.distributed.barrier()
+
+    def __len__(self):
+        # -1 is due to data structure used to retieve the index:
+        #    sample i --> [sample_idx[i], sample_idx[i+1])
+        return self.sample_idx.shape[0] - 1
+
+    def __getitem__(self, idx):
+        # Get the shuffled index.
+        idx = self.shuffle_idx[idx]
+        # Start and end documents and offsets.
+        doc_index_f = self.sample_idx[idx][0]
+        doc_index_l = self.sample_idx[idx + 1][0]
+        offset_f = self.sample_idx[idx][1]
+        offset_l = self.sample_idx[idx + 1][1]
+        # If we are within the same document, just extract the chunk.
+        doc_ids = []
+        if doc_index_f == doc_index_l:
+            doc_ids.append(self.doc_idx[doc_index_f])
+
+            sample = self.indexed_dataset.get(
+                self.doc_idx[doc_index_f], offset=offset_f, length=offset_l - offset_f + 1
+            )
+        else:
+            # Otherwise, get the rest of the initial document.
+            doc_ids.append(self.doc_idx[doc_index_f])
+            sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f], offset=offset_f)]
+            # Loop over all in between documents and add the entire document.
+            for i in range(doc_index_f + 1, doc_index_l):
+                doc_ids.append(self.doc_idx[i])
+                sample_list.append(self.indexed_dataset.get(self.doc_idx[i]))
+            # And finally add the relevant portion of last document.
+            doc_ids.append(self.doc_idx[doc_index_l])
+            sample_list.append(self.indexed_dataset.get(self.doc_idx[doc_index_l], length=offset_l + 1))
+            sample = np.concatenate(sample_list)
+        # print(sample)
+        if self.return_doc_ids:  # for retro preprocessing
+            return {"text": np.array(sample, dtype=np.int64), "doc_ids": np.array(doc_ids, dtype=np.int64)}
+        else:
+            return {"text": np.array(sample, dtype=np.int64)}
+
+
+def _build_index_mappings(
+    name, data_prefix, documents, sizes, splits_string, num_samples, seq_length, seed, *, data_cache_path
+):
+    """Build doc-idx, sample-idx, and shuffle-idx.
+    doc-idx: is an array (ordered) of documents to be used in training.
+    sample-idx: is the start document index and document offset for each
+       training sample.
+    shuffle-idx: maps the sample index into a random index into sample-idx.
+    """
+
+    # Number of tokens in each epoch and number of required epochs.
+    tokens_per_epoch = _num_tokens(documents, sizes)
+    num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
+
+    # rng state
+    np_rng = np.random.RandomState(seed=seed)
+    # Filename of the index mappings.
+    desc = "GPT Dataset\n\n"
+    desc += f"Data prefix {data_prefix}\n"
+    desc += f"Dataset name {name}\n"
+    desc += f"Number of samples {num_samples}\n"
+    desc += f"Sequence length {seq_length}\n"
+    desc += f"Random seed {seed}\n"
+    desc += f"Split {splits_string}\n"
+    desc_hash = hashlib.md5(desc.encode("utf-8")).hexdigest()
+    desc_filename = desc_hash + ".dsc"
+    doc_idx_filename = desc_hash + "_doc_idx.npy"
+    sample_idx_filename = desc_hash + "_sample_idx.npy"
+    shuffle_idx_filename = desc_hash + "_shuffle_idx.npy"
+
+    # Look for cache in main data dir first to avoid unnecessary
+    # duplication, then look in data-cache-path if specified,
+    # If nothing is found, use the last path looked in
+    build_indices = True
+    prefixes = [os.path.join(os.path.dirname(data_prefix), "index-cache")]
+    if data_cache_path is not None:
+        prefixes.append(data_cache_path)
+    for prefix in prefixes:
+        idx_path = {
+            "desc": os.path.join(prefix, desc_filename),
+            "doc": os.path.join(prefix, doc_idx_filename),
+            "sample": os.path.join(prefix, sample_idx_filename),
+            "shuffle": os.path.join(prefix, shuffle_idx_filename),
+        }
+        for f in idx_path.values():
+            if not os.path.isfile(f):
+                break
+        else:
+            # Found our files!
+            build_indices = False
+            break
+    data_cache_dir = os.path.dirname(idx_path["desc"])
+    # data_cache_success = True
+    # Build the indexed mapping if not exist.
+    if data_cache_path is not None:
+        check_rank_flag = build_indices and local_rank == 0
+    else:
+        check_rank_flag = build_indices and paddle.distributed.get_rank() == 0
+    # if build_indices and paddle.distributed.get_rank() == 0:
+    if check_rank_flag:
+        print_rank_0(" > WARNING: could not find index map files, building " "the indices on rank 0 ...")
+
+        # For the last epoch, decide whether include the entire epoch
+        # in the global shuffle or not.
+
+        # If we need only one epoch, then separating last epoch  does
+        # not mean anything.
+        if num_epochs == 1:
+            separate_last_epoch = False
+            print(" > only one epoch required, setting " "separate_last_epoch to False", flush=True)
+
+        else:
+            # Get the number of samples for the last epoch
+            num_samples_from_epochs_minus_one = ((num_epochs - 1) * tokens_per_epoch - 1) // seq_length
+            last_epoch_num_samples = num_samples - num_samples_from_epochs_minus_one
+            assert last_epoch_num_samples >= 0, "last epoch number of samples should be non-negative."
+            num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
+            assert last_epoch_num_samples <= (
+                num_samples_per_epoch + 1
+            ), "last epoch number of samples exceeded max value."
+            # If we have less than 80% of the samples for the last epoch,
+            # seperate out the epoch and treat it differently.
+            # Note: the 80% number is just based on common sense and can
+            # be adjusted if needed.
+            separate_last_epoch = last_epoch_num_samples < int(0.80 * num_samples_per_epoch)
+            if separate_last_epoch:
+                string = (
+                    " > last epoch number of samples ({}) is smaller "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to True"
+                )
+            else:
+                string = (
+                    " > last epoch number of samples ({}) is larger "
+                    "than 80% of number of samples per epoch ({}), "
+                    "setting separate_last_epoch to False"
+                )
+            print(string.format(last_epoch_num_samples, num_samples_per_epoch), flush=True)
+
+        try:
+            os.makedirs(data_cache_dir, exist_ok=True)
+
+            # description
+            with open(idx_path["desc"], "wt") as fd:
+                fd.write(desc)
+
+            # doc-idx.
+            start_time = time.time()
+            doc_idx = _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch)
+            np.save(idx_path["doc"], doc_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save doc-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # sample-idx.
+            start_time = time.time()
+            # Use C++ implementation for speed.
+            # First compile and then import.
+            # from megatron.data import helpers
+            from tool_helpers import helpers
+
+            assert doc_idx.dtype == np.int32
+            assert sizes.dtype == np.int32
+            sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch)
+            np.save(idx_path["sample"], sample_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save sample-idx mapping "
+                "(seconds): {:4f}".format(time.time() - start_time)
+            )
+            # shuffle-idx.
+            start_time = time.time()
+            # -1 is due to data structure used to retieve the index:
+            #    sample i --> [sample_idx[i], sample_idx[i+1])
+            if separate_last_epoch:
+                num_samples_ = num_samples_from_epochs_minus_one
+            else:
+                num_samples_ = sample_idx.shape[0] - 1
+            shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, np_rng)
+            np.save(idx_path["shuffle"], shuffle_idx, allow_pickle=True)
+            print_rank_0(
+                " > elasped time to build and save shuffle-idx mapping"
+                " (seconds): {:4f}".format(time.time() - start_time)
+            )
+        except OSError:
+            print(f"There was an error trying to create the data cache directory ({data_cache_dir})")
+            print('or a file in it. This defaults to a directory "index-cache" within the directory')
+            print("the data files are in and can be set with the --data-cache-path argument. Please")
+            print("ensure you have write access to this directory or specify one that you do have")
+            print("write access to.")
+            # data_cache_success = False
+    else:
+        while True:
+            if (
+                (not os.path.isfile(idx_path["doc"]))
+                or (not os.path.isfile(idx_path["sample"]))
+                or (not os.path.isfile(idx_path["shuffle"]))
+            ):
+                time.sleep(3)
+            else:
+                try:
+                    np.load(idx_path["shuffle"], allow_pickle=True, mmap_mode="r")
+                    break
+                except Exception:
+                    print("%s file is still writing or damaged, please wait for a moment." % idx_path["shuffle"])
+                    time.sleep(3)
+    # try:
+    #     hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+    # except:
+    #     hcg = FakeHCG()
+
+    # counts = paddle.to_tensor([data_cache_success], dtype="int64")
+    # paddle.distributed.all_reduce(counts, group=hcg.get_data_parallel_group())
+    # paddle.distributed.all_reduce(counts, group=hcg.get_pipe_parallel_group())
+    # if counts[0].item() != (
+    #     paddle.distributed.get_world_size() // paddle.distributed.get_world_size(group=hcg.get_model_parallel_group())
+    # ):
+    #     print_rank_0("Data index creation unsuccessful, exiting.")
+    #     exit()
+    # paddle.distributed.barrier()
+
+    return idx_path["doc"], idx_path["sample"], idx_path["shuffle"], desc, desc_hash, num_epochs
+
+
+def _num_tokens(documents, sizes):
+    """Total number of tokens in the dataset."""
+    return np.sum(sizes[documents])
+
+
+def _num_epochs(tokens_per_epoch, seq_length, num_samples):
+    """Based on number of samples and sequence lenght, calculate how many
+    epochs will be needed."""
+    num_epochs = 0
+    total_tokens = 0
+    while True:
+        num_epochs += 1
+        total_tokens += tokens_per_epoch
+        # -1 is because we need to retrieve seq_length + 1 token each time
+        # but the last token will overlap with the first token of the next
+        # sample except for the last sample.
+        if ((total_tokens - 1) // seq_length) >= num_samples:
+            return num_epochs
+
+
+def _build_doc_idx(documents, num_epochs, np_rng, separate_last_epoch):
+    """Build an array with length = number-of-epochs * number-of-dcuments.
+    Each index is mapped to a corresponding document."""
+    if not separate_last_epoch or num_epochs == 1:
+        doc_idx = np.mgrid[0:num_epochs, 0 : len(documents)][1]
+        doc_idx[:] = documents
+        doc_idx = doc_idx.reshape(-1)
+        doc_idx = doc_idx.astype(np.int32)
+        np_rng.shuffle(doc_idx)
+        return doc_idx
+
+    doc_idx_first = _build_doc_idx(documents, num_epochs - 1, np_rng, False)
+    doc_idx_last = _build_doc_idx(documents, 1, np_rng, False)
+    return np.concatenate((doc_idx_first, doc_idx_last))
+
+
+def _build_sample_idx(sizes, doc_idx, seq_length, num_epochs, tokens_per_epoch):
+    """Sample index mapping is a 2D array with sizes
+    [number-of-samples + 1, 2] where [..., 0] contains
+    the index into `doc_idx` and [..., 1] is the
+    starting offset in that document."""
+
+    # Total number of samples. For -1 see comments in `_num_epochs`.
+    num_samples = (num_epochs * tokens_per_epoch - 1) // seq_length
+    sample_idx = np.zeros([num_samples + 1, 2], dtype=np.int32)
+
+    # Index into sample_idx.
+    sample_index = 0
+    # Index into doc_idx.
+    doc_idx_index = 0
+    # Begining offset for each document.
+    doc_offset = 0
+    # Start with first document and no offset.
+    sample_idx[sample_index][0] = doc_idx_index
+    sample_idx[sample_index][1] = doc_offset
+    sample_index += 1
+    while sample_index <= num_samples:
+        # Start with a fresh sequence.
+        remaining_seq_length = seq_length + 1
+        while remaining_seq_length != 0:
+            # Get the document length.
+            doc_id = doc_idx[doc_idx_index]
+            doc_length = sizes[doc_id] - doc_offset
+            # And add it to the current sequence.
+            remaining_seq_length -= doc_length
+            # If we have more than a full sequence, adjust offset and set
+            # remaining length to zero so we return from the while loop.
+            # Note that -1 here is for the same reason we have -1 in
+            # `_num_epochs` calculations.
+            if remaining_seq_length <= 0:
+                doc_offset += remaining_seq_length + doc_length - 1
+                remaining_seq_length = 0
+            else:
+                # Otherwise, start from the begining of the next document.
+                doc_idx_index += 1
+                doc_offset = 0
+        # Record the sequence.
+        sample_idx[sample_index][0] = doc_idx_index
+        sample_idx[sample_index][1] = doc_offset
+        sample_index += 1
+
+    return sample_idx
+
+
+def _build_shuffle_idx(num_samples, total_size, np_rng):
+    """Build the range [0, size) and shuffle."""
+    print(
+        " > building shuffle index with split [0, {}) and [{}, {}) "
+        "...".format(num_samples, num_samples, total_size),
+        flush=True,
+    )
+
+    dtype_ = np.uint32
+    if total_size >= (np.iinfo(np.uint32).max - 1):
+        dtype_ = np.int64
+
+    shuffle_idx_first = np.arange(start=0, stop=num_samples, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_first)
+    if num_samples == total_size:
+        return shuffle_idx_first
+
+    shuffle_idx_last = np.arange(start=num_samples, stop=total_size, step=1, dtype=dtype_)
+    np_rng.shuffle(shuffle_idx_last)
+
+    return np.concatenate((shuffle_idx_first, shuffle_idx_last))
diff --git a/paddlenlp/data/collate.py b/paddlenlp/data/collate.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4305404a82479610ba406851ec7f5156914f6bc
--- /dev/null
+++ b/paddlenlp/data/collate.py
@@ -0,0 +1,321 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+__all__ = [
+    "Stack",
+    "Pad",
+    "Tuple",
+    "Dict",
+]
+
+
+class Stack(object):
+    """
+    Stacks the input data samples to construct the batch. The N input samples
+    must have the same shape/length and will be stacked to construct a batch.
+
+    Args:
+        axis (int, optional): The axis in the result data along which the input
+            data are stacked. Default: 0.
+        dtype (str|numpy.dtype, optional): The value type of the output. If it
+            is set to None, the type of input data is used. Default: None.
+    """
+
+    def __init__(self, axis=0, dtype=None):
+        self._axis = axis
+        self._dtype = dtype
+
+    def __call__(self, data):
+        """
+        Batchifies the input data by stacking.
+
+        Args:
+            data (list[numpy.ndarray]): The input data samples. It is a list.
+                Each element is a numpy.ndarray or list.
+
+        Returns:
+            numpy.ndarray: Stacked batch data.
+
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Stack
+                a = [1, 2, 3, 4]
+                b = [3, 4, 5, 6]
+                c = [5, 6, 7, 8]
+                result = Stack()([a, b, c])
+                '''
+                [[1, 2, 3, 4],
+                 [3, 4, 5, 6],
+                 [5, 6, 7, 8]]
+                '''
+        """
+        data = np.stack(data, axis=self._axis).astype(self._dtype) if self._dtype else np.stack(data, axis=self._axis)
+        return data
+
+
+class Pad(object):
+    """
+    Pads the input data samples to the largest length at `axis`.
+
+    Args:
+        pad_val (float|int, optional): The padding value. Default: 0.
+        axis (int, optional): The axis to pad the arrays. The arrays will be
+            padded to the largest length at `axis`. For example, assume the
+            input arrays have shape (10, 8, 5), (6, 8, 5), (3, 8, 5) and the
+            axis is 0. Each input will be padded into (10, 8, 5) and then
+            stacked to form the final output, which has shape (3, 10, 8, 5).
+            Default: 0.
+        ret_length (bool|numpy.dtype, optional): If it is bool, indicate whether
+            to return the valid length in the output, and the data type of
+            returned length is int32 if True. If it is numpy.dtype, indicate the
+            data type of returned length. Default: None.
+        dtype (numpy.dtype, optional): The value type of the output. If it is
+            set to None, the input data type is used. Default: None.
+        pad_right (bool, optional): Whether the padding direction is right-side.
+            If True, it indicates we pad to the right side, while False indicates
+            we pad to the left side. Default: True.
+    """
+
+    def __init__(self, pad_val=0, axis=0, ret_length=None, dtype=None, pad_right=True):
+        self._pad_val = pad_val
+        self._axis = axis
+        self._ret_length = ret_length
+        self._dtype = dtype
+        self._pad_right = pad_right
+
+    def __call__(self, data):
+        """
+        Batchifies the input data by padding. The input will be padded to the
+        largest dimension at `axis` and then stacked to form the final output.
+        In addition, the function will output the original dimensions at the
+        `axis` if `ret_length` is not None or False.
+
+        Args:
+            data (list[numpy.ndarray|list]): The input data samples. It is a
+                list. Each element is a numpy.ndarray or list.
+
+        Returns:
+            numpy.ndarray|tuple[numpy.ndarray]: If `ret_length` is False, it
+            is a numpy.ndarray representing the padded batch data and the
+            shape is (N, …). Otherwise, it is a tuple, besides the padded batch
+            data, the tuple also includes a numpy.ndarray representing original
+            length at `axis` of all input samples, which shaped `(N,)`.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Pad
+                a = [1, 2, 3, 4]
+                b = [5, 6, 7]
+                c = [8, 9]
+                result = Pad(pad_val=0)([a, b, c])
+                '''
+                [[1, 2, 3, 4],
+                 [5, 6, 7, 0],
+                 [8, 9, 0, 0]]
+                '''
+        """
+
+        # return data itself for rare unexpected cases when 1-D array is passed to Pad
+        if not isinstance(data[0], list) and not isinstance(data[0], np.ndarray):
+            return np.asarray(data, dtype=self._dtype if self._dtype is not None else np.int64)
+
+        arrs = [np.asarray(ele) for ele in data]
+        original_length = [ele.shape[self._axis] for ele in arrs]
+        max_size = max(original_length)
+        ret_shape = list(arrs[0].shape)
+        ret_shape[self._axis] = max_size
+        ret_shape = (len(arrs),) + tuple(ret_shape)
+        ret = np.full(
+            shape=ret_shape, fill_value=self._pad_val, dtype=arrs[0].dtype if self._dtype is None else self._dtype
+        )
+        for i, arr in enumerate(arrs):
+            if arr.shape[self._axis] == max_size:
+                ret[i] = arr
+            else:
+                slices = [slice(None) for _ in range(arr.ndim)]
+                if self._pad_right:
+                    slices[self._axis] = slice(0, arr.shape[self._axis])
+                else:
+                    slices[self._axis] = slice(max_size - arr.shape[self._axis], max_size)
+
+                if slices[self._axis].start != slices[self._axis].stop:
+                    slices = [slice(i, i + 1)] + slices
+                    ret[tuple(slices)] = arr
+        if self._ret_length:
+            return ret, np.asarray(original_length, dtype="int32") if self._ret_length else np.asarray(
+                original_length, self._ret_length
+            )
+        else:
+            return ret
+
+
+class Tuple(object):
+    """
+    Wraps multiple batchify functions together. The input functions will be applied
+    to the corresponding input fields.
+
+    Each sample should be a list or tuple containing multiple fields. The i'th
+    batchify function stored in Tuple will be applied on the i'th field.
+
+    For example, when data sample is (nd_data, label), you can wrap two batchify
+    functions using `Tuple(DataBatchify, LabelBatchify)` to batchify nd_data and
+    label correspondingly.
+
+    Args:
+        fn (callable|list[callable]|tuple[callable]): The batchify functions to
+            wrap. It is a callable function or a list/tuple of callable functions.
+        args (tuple[callable]): The additional batchify functions to wrap.
+    """
+
+    def __init__(self, fn, *args):
+        if isinstance(fn, (list, tuple)):
+            assert len(args) == 0, (
+                "Input pattern not understood. The input of Tuple can be "
+                "Tuple(A, B, C) or Tuple([A, B, C]) or Tuple((A, B, C)). "
+                "Received fn=%s, args=%s" % (str(fn), str(args))
+            )
+            self._fn = fn
+        else:
+            self._fn = (fn,) + args
+        for i, ele_fn in enumerate(self._fn):
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (i, str(type(ele_fn)))
+
+    def __call__(self, data):
+        """
+        Batchifies data samples by applying each function on the corresponding
+        data field, and each data field is produced by stacking the field data
+        of samples.
+
+        Args:
+            data (list|tuple): The samples to batchfy. Each sample in list/tuple
+                should contain `N` fields.
+
+        Returns:
+            tuple: A tuple composed of results from all including batchifying
+            functions.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Stack, Pad, Tuple
+                data = [
+                        [[1, 2, 3, 4], [1]],
+                        [[5, 6, 7], [0]],
+                        [[8, 9], [1]],
+                       ]
+                batchify_fn = Tuple(Pad(pad_val=0), Stack())
+                ids, label = batchify_fn(data)
+                '''
+                ids:
+                [[1, 2, 3, 4],
+                [5, 6, 7, 0],
+                [8, 9, 0, 0]]
+                label: [[1], [0], [1]]
+                '''
+        """
+
+        assert len(data[0]) == len(
+            self._fn
+        ), "The number of attributes in each data sample should contain" " {} elements".format(len(self._fn))
+        ret = []
+        for i, ele_fn in enumerate(self._fn):
+            result = ele_fn([ele[i] for ele in data])
+            if isinstance(result, (tuple, list)):
+                ret.extend(result)
+            else:
+                ret.append(result)
+        return tuple(ret)
+
+
+class Dict(object):
+    """
+    Wraps multiple batchify functions together. The input functions will be
+    applied to the corresponding input fields.
+
+    Each sample should be a dict containing multiple fields. Each batchify
+    function with key stored in `Dict` will be applied on the field which has
+    the same key.
+
+    For example, when data sample is {'tokens': tokens, 'labels': labels}, you
+    can wrap two batchify functions using
+    `Dict({'tokens': DataBatchify, 'labels': LabelBatchify})` to batchify tokens
+    and labels correspondingly.
+
+    Args:
+        fn (dict): The batchify functions to wrap. It is a dict, which values is
+            callable functions.
+    """
+
+    def __init__(self, fn):
+        assert isinstance(fn, (dict)), (
+            "Input pattern not understood. The input of Dict must be a dict with key of input column name and value of collate_fn "
+            "Received fn=%s" % (str(fn))
+        )
+
+        self._fn = fn
+
+        for col_name, ele_fn in self._fn.items():
+            assert callable(ele_fn), "Batchify functions must be callable! type(fn[%d]) = %s" % (
+                col_name,
+                str(type(ele_fn)),
+            )
+
+    def __call__(self, data):
+        """
+        Batchifies data samples by applying each function on the corresponding
+        data field, and each data field is produced by stacking the field data
+        with the same key as batchify functions of all samples.
+
+        Args:
+            data (list[dict]|tuple[dict]): The samples to batchfy. Each sample
+                in list/tuple is a dict with `N` key-values.
+
+        Returns:
+            tuple: A tuple composed of results from all including batchifying
+            functions.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Stack, Pad, Dict
+                data = [
+                        {'labels':[1], 'token_ids':[1, 2, 3, 4]},
+                        {'labels':[0], 'token_ids':[5, 6, 7]},
+                        {'labels':[1], 'token_ids':[8, 9]},
+                       ]
+                batchify_fn = Dict({'token_ids':Pad(pad_val=0), 'labels':Stack()})
+                ids, label = batchify_fn(data)
+                '''
+                ids:
+                [[1, 2, 3, 4],
+                [5, 6, 7, 0],
+                [8, 9, 0, 0]]
+                label: [[1], [0], [1]]
+                '''
+        """
+
+        ret = []
+        for col_name, ele_fn in self._fn.items():
+            result = ele_fn([ele[col_name] for ele in data])
+            if isinstance(result, (tuple, list)):
+                ret.extend(result)
+            else:
+                ret.append(result)
+        return tuple(ret)
diff --git a/paddlenlp/data/data_collator.py b/paddlenlp/data/data_collator.py
new file mode 100644
index 0000000000000000000000000000000000000000..9514f1225b2def5a31b881542a2a70696884d27c
--- /dev/null
+++ b/paddlenlp/data/data_collator.py
@@ -0,0 +1,860 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import random
+import warnings
+from collections.abc import Mapping
+from dataclasses import dataclass
+from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+
+from ..transformers import BertTokenizer
+from ..transformers.tokenizer_utils_base import (
+    BatchEncoding,
+    PaddingStrategy,
+    PretrainedTokenizerBase,
+)
+
+__all__ = [
+    "DataCollatorWithPadding",
+    "default_data_collator",
+    "DataCollator",
+    "DefaultDataCollator",
+    "DataCollatorForTokenClassification",
+    "DataCollatorForSeq2Seq",
+    "DataCollatorForLanguageModeling",
+    "DataCollatorForWholeWordMask",
+]
+
+InputDataClass = NewType("InputDataClass", Any)
+"""
+A DataCollator is a function that takes a list of samples from a Dataset and collate them into a batch, as a dictionary
+of PaddlePaddle tensors or NumPy arrays.
+"""
+DataCollator = NewType("DataCollator", Callable[[List[InputDataClass]], Dict[str, Any]])
+
+
+class DataCollatorMixin:
+    def __call__(self, features, return_tensors=None):
+        if return_tensors is None:
+            return_tensors = self.return_tensors
+        if return_tensors == "pd":
+            return self.paddle_call(features)
+        elif return_tensors == "np":
+            return self.numpy_call(features)
+        else:
+            raise ValueError(f"Framework '{return_tensors}' not recognized!")
+
+
+def default_data_collator(features: List[InputDataClass], return_tensors="pd") -> Dict[str, Any]:
+    """
+    Very simple data collator that simply collates batches of dict-like objects and performs special handling for
+    potential keys named:
+
+        - `label`: handles a single value (int or float) per object
+        - `label_ids`: handles a list of values per object
+
+    Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs
+    to the model. See glue and ner for example of how it's useful.
+    """
+
+    # In this function we'll make the assumption that all `features` in the batch
+    # have the same attributes.
+    # So we will look at the first element as a proxy for what attributes exist
+    # on the whole batch.
+
+    if return_tensors == "pd":
+        return paddle_default_data_collator(features)
+    elif return_tensors == "np":
+        return numpy_default_data_collator(features)
+
+
+def paddle_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:
+    if not isinstance(features[0], (dict, BatchEncoding)):
+        features = [vars(f) for f in features]
+    first = features[0]
+    batch = {}
+
+    # Special handling for labels.
+    # Ensure that tensor is created with the correct type
+    # (it should be automatically the case, but let's make sure of it.)
+    if "label" in first and first["label"] is not None:
+        label = first["label"].item() if isinstance(first["label"], paddle.Tensor) else first["label"]
+        dtype = "int64" if isinstance(label, int) else "float32"
+        batch["labels"] = paddle.to_tensor([f["label"] for f in features], dtype=dtype)
+    elif "label_ids" in first and first["label_ids"] is not None:
+        if isinstance(first["label_ids"], paddle.Tensor):
+            batch["labels"] = paddle.stack([f["label_ids"] for f in features])
+        else:
+            dtype = "int64" if type(first["label_ids"][0]) is int or np.int32 or np.int64 else "float32"
+            batch["labels"] = paddle.to_tensor([f["label_ids"] for f in features], dtype=dtype)
+
+    # Handling of all other possible keys.
+    # Again, we will use the first element to figure out which key/values are not None for this model.
+    for k, v in first.items():
+        if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
+            if isinstance(v, paddle.Tensor):
+                batch[k] = paddle.stack([f[k] for f in features])
+            else:
+                batch[k] = paddle.to_tensor([f[k] for f in features])
+
+    return batch
+
+
+def numpy_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:
+
+    if not isinstance(features[0], (dict, BatchEncoding)):
+        features = [vars(f) for f in features]
+    first = features[0]
+    batch = {}
+
+    # Special handling for labels.
+    # Ensure that tensor is created with the correct type
+    # (it should be automatically the case, but let's make sure of it.)
+    if "label" in first and first["label"] is not None:
+        label = first["label"].item() if isinstance(first["label"], np.ndarray) else first["label"]
+        dtype = np.int64 if isinstance(label, int) else np.float32
+        batch["labels"] = np.array([f["label"] for f in features], dtype=dtype)
+    elif "label_ids" in first and first["label_ids"] is not None:
+        if isinstance(first["label_ids"], np.ndarray):
+            batch["labels"] = np.stack([f["label_ids"] for f in features])
+        else:
+            dtype = np.int64 if type(first["label_ids"][0]) is int or np.int32 or np.int64 else np.float32
+            batch["labels"] = np.array([f["label_ids"] for f in features], dtype=dtype)
+
+    # Handling of all other possible keys.
+    # Again, we will use the first element to figure out which key/values are not None for this model.
+    for k, v in first.items():
+        if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
+            if isinstance(v, np.ndarray):
+                batch[k] = np.stack([f[k] for f in features])
+            else:
+                batch[k] = np.array([f[k] for f in features])
+
+    return batch
+
+
+@dataclass
+class DefaultDataCollator(DataCollatorMixin):
+    """
+    Very simple data collator that simply collates batches of dict-like objects and performs special handling for
+    potential keys named:
+        - `label`: handles a single value (int or float) per object
+        - `label_ids`: handles a list of values per object
+    Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs
+    to the model. See glue and ner for example of how it's useful.
+    This is an object (like other data collators) rather than a pure function like default_data_collator. This can be
+    helpful if you need to set a return_tensors value at initialization.
+    Args:
+        return_tensors (`bool`):
+            Return Tensor or numpy array.
+    """
+
+    return_tensors: str = "pd"
+
+    def __call__(self, features: List[Dict[str, Any]], return_tensors=None) -> Dict[str, Any]:
+        if return_tensors is None:
+            return_tensors = self.return_tensors
+        return default_data_collator(features, return_tensors)
+
+
+@dataclass
+class DataCollatorWithPadding:
+    """
+    Data collator that will dynamically pad the inputs to the longest sequence in the batch.
+
+    Args:
+        tokenizer (`paddlenlp.transformers.PretrainedTokenizer`):
+            The tokenizer used for encoding the data.
+    """
+
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    pad_to_multiple_of: Optional[int] = None
+    return_tensors: str = "pd"
+    return_attention_mask: Optional[bool] = None
+
+    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
+        batch = self.tokenizer.pad(
+            features,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            return_tensors=self.return_tensors,
+            return_attention_mask=self.return_attention_mask,
+        )
+        if "label" in batch:
+            batch["labels"] = batch["label"]
+            del batch["label"]
+        if "label_ids" in batch:
+            batch["labels"] = batch["label_ids"]
+            del batch["label_ids"]
+        # To fix windows bug for paddle inference dtype error
+        # InvalidArgumentError: The type of data we are trying to retrieve does not match the type of data currently contained in the container
+        if self.return_tensors == "np":
+            batch = {k: np.array(v, dtype=np.int64) for k, v in batch.items()}
+        return batch
+
+
+@dataclass
+class DataCollatorForTokenClassification(DataCollatorMixin):
+    """
+    Data collator that will dynamically pad the inputs received, as well as the labels.
+
+    Args:
+        tokenizer ([`PretrainedTokenizer`] or [`PretrainedFasterTokenizer`]):
+            The tokenizer used for encoding the data.
+        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
+            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
+            among:
+
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence
+              is provided).
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+              acceptable input length for the model if that argument is not provided.
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+              lengths).
+        max_length (`int`, *optional*):
+            Maximum length of the returned list and optionally padding length (see above).
+        pad_to_multiple_of (`int`, *optional*):
+            If set will pad the sequence to a multiple of the provided value.
+
+            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
+            7.5 (Volta).
+        label_pad_token_id (`int`, *optional*, defaults to -100):
+            The id to use when padding the labels (-100 will be automatically ignore by PaddlePaddle loss functions).
+        return_tensors (`str`):
+            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
+    """
+
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    pad_to_multiple_of: Optional[int] = None
+    label_pad_token_id: int = -100
+    return_tensors: str = "pd"
+
+    def paddle_call(self, features):
+        label_name = "label" if "label" in features[0].keys() else "labels"
+        labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None
+        no_labels_features = [{k: v for k, v in feature.items() if k != label_name} for feature in features]
+
+        batch = self.tokenizer.pad(
+            no_labels_features,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            # Conversion to tensors will fail if we have labels as they are not of the same length yet.
+            return_tensors="pd" if labels is None else None,
+        )
+
+        if labels is None:
+            return batch
+
+        sequence_length = paddle.to_tensor(batch["input_ids"]).shape[1]
+        padding_side = self.tokenizer.padding_side
+
+        def to_list(tensor_or_iterable):
+            if isinstance(tensor_or_iterable, paddle.Tensor):
+                return tensor_or_iterable.tolist()
+            return list(tensor_or_iterable)
+
+        if padding_side == "right":
+            batch[label_name] = [
+                to_list(label) + [self.label_pad_token_id] * (sequence_length - len(label)) for label in labels
+            ]
+        else:
+            batch[label_name] = [
+                [self.label_pad_token_id] * (sequence_length - len(label)) + to_list(label) for label in labels
+            ]
+
+        batch = {k: paddle.to_tensor(v, dtype="int64") for k, v in batch.items()}
+        return batch
+
+    def numpy_call(self, features):
+        label_name = "label" if "label" in features[0].keys() else "labels"
+        labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None
+        batch = self.tokenizer.pad(
+            features,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            # Conversion to tensors will fail if we have labels as they are not of the same length yet.
+            return_tensors="np" if labels is None else None,
+        )
+
+        if labels is None:
+            return batch
+
+        sequence_length = np.array(batch["input_ids"]).shape[1]
+        padding_side = self.tokenizer.padding_side
+        if padding_side == "right":
+            batch["labels"] = [
+                list(label) + [self.label_pad_token_id] * (sequence_length - len(label)) for label in labels
+            ]
+        else:
+            batch["labels"] = [
+                [self.label_pad_token_id] * (sequence_length - len(label)) + list(label) for label in labels
+            ]
+
+        batch = {k: np.array(v, dtype=np.int64) for k, v in batch.items()}
+        return batch
+
+
+@dataclass
+class DataCollatorForSeq2Seq:
+    """
+    Data collator that will dynamically pad the inputs received, as well as the labels.
+
+    Args:
+        tokenizer ([`PretrainedTokenizer`] or [`PretrainedFasterTokenizer`]):
+            The tokenizer used for encoding the data.
+        model ([`PreTrainedModel`]):
+            The model that is being trained. If set and has the *prepare_decoder_input_ids_from_labels*, use it to
+            prepare the *decoder_input_ids*
+
+            This is useful when using *label_smoothing* to avoid calculating loss twice.
+        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
+            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
+            among:
+
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence
+              is provided).
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+              acceptable input length for the model if that argument is not provided.
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+              lengths).
+        max_length (`int`, *optional*):
+            Maximum length of the returned list and optionally padding length (see above).
+        pad_to_multiple_of (`int`, *optional*):
+            If set will pad the sequence to a multiple of the provided value.
+
+            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
+            7.5 (Volta).
+        label_pad_token_id (`int`, *optional*, defaults to -100):
+            The id to use when padding the labels (-100 will be automatically ignored by PaddlePaddle loss functions).
+        return_tensors (`str`):
+            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
+        max_label_length (`int`, *optional*, Pad label to max_label_length. defaults to `None`):
+    """
+
+    tokenizer: PretrainedTokenizerBase
+    model: Optional[Any] = None
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    pad_to_multiple_of: Optional[int] = None
+    label_pad_token_id: int = -100
+    return_tensors: str = "pd"
+    return_attention_mask: Optional[bool] = None
+    max_label_length: Optional[int] = None
+
+    def __call__(self, features, return_tensors=None):
+        # Deep copy to avoid modifying features in-place
+        batch = copy.deepcopy(features)
+        if return_tensors is None:
+            return_tensors = self.return_tensors
+        labels = [feature["labels"] for feature in batch] if "labels" in batch[0].keys() else None
+        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
+        # same length to return tensors.
+        if labels is not None:
+            # Note(gongenlei): In pipeline, max_label_length = self.max_length
+            if self.max_label_length is not None:
+                max_label_length = self.max_label_length
+            else:
+                max_label_length = max(len(l) for l in labels)
+            if self.pad_to_multiple_of is not None:
+                max_label_length = (
+                    (max_label_length + self.pad_to_multiple_of - 1)
+                    // self.pad_to_multiple_of
+                    * self.pad_to_multiple_of
+                )
+
+            padding_side = self.tokenizer.padding_side
+            for feature in batch:
+                remainder = [self.label_pad_token_id] * (max_label_length - len(feature["labels"]))
+                if isinstance(feature["labels"], list):
+                    feature["labels"] = (
+                        feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
+                    )
+                elif padding_side == "right":
+                    feature["labels"] = np.concatenate([feature["labels"], remainder]).astype(np.int64)
+                else:
+                    feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)
+
+        batch = self.tokenizer.pad(
+            batch,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_attention_mask=self.return_attention_mask,
+        )
+
+        # prepare decoder_input_ids
+        if (
+            labels is not None
+            and self.model is not None
+            and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
+        ):
+            decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=batch["labels"])
+            batch["decoder_input_ids"] = decoder_input_ids
+        return batch
+
+
+def _paddle_collate_batch(examples, tokenizer, pad_to_multiple_of: Optional[int] = None):
+    """Collate `examples` into a batch, using the information in `tokenizer` for padding if necessary."""
+    import paddle
+
+    # Tensorize if necessary.
+    if isinstance(examples[0], (list, tuple, np.ndarray)):
+        examples = [paddle.to_tensor(e, dtype="int64") for e in examples]
+
+    length_of_first = examples[0].shape[0]
+
+    # Check if padding is necessary.
+
+    are_tensors_same_length = all(x.shape[0] == length_of_first for x in examples)
+    if are_tensors_same_length and (pad_to_multiple_of is None or length_of_first % pad_to_multiple_of == 0):
+        return paddle.stack(examples, axis=0)
+
+    # If yes, check if we have a `pad_token`.
+    if tokenizer._pad_token is None:
+        raise ValueError(
+            "You are attempting to pad samples but the tokenizer you are using"
+            f" ({tokenizer.__class__.__name__}) does not have a pad token."
+        )
+
+    # Creating the full tensor and filling it with our data.
+    max_length = max(x.shape[0] for x in examples)
+    if pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+        max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+    # result = examples[0].new_full([len(examples), max_length], tokenizer.pad_token_id)
+    result = paddle.full([len(examples), max_length], tokenizer.pad_token_id, dtype=examples[0].dtype)
+
+    for i, example in enumerate(examples):
+        if tokenizer.padding_side == "right":
+            result[i, : example.shape[0]] = example
+        else:
+            result[i, -example.shape[0] :] = example
+    return result
+
+
+def _numpy_collate_batch(examples, tokenizer, pad_to_multiple_of: Optional[int] = None):
+    import numpy as np
+
+    """Collate `examples` into a batch, using the information in `tokenizer` for padding if necessary."""
+    # Tensorize if necessary.
+    if isinstance(examples[0], (list, tuple)):
+        examples = [np.array(e, dtype=np.int64) for e in examples]
+
+    # Check if padding is necessary.
+    length_of_first = len(examples[0])
+    are_tensors_same_length = all(len(x) == length_of_first for x in examples)
+    if are_tensors_same_length and (pad_to_multiple_of is None or length_of_first % pad_to_multiple_of == 0):
+        return np.stack(examples, axis=0)
+
+    # If yes, check if we have a `pad_token`.
+    if tokenizer._pad_token is None:
+        raise ValueError(
+            "You are attempting to pad samples but the tokenizer you are using"
+            f" ({tokenizer.__class__.__name__}) does not have a pad token."
+        )
+
+    # Creating the full tensor and filling it with our data.
+    max_length = max(len(x) for x in examples)
+    if pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+        max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+    result = np.full(shape=(len(examples), max_length), fill_value=tokenizer.pad_token_id, dtype=examples[0].dtype)
+    for i, example in enumerate(examples):
+        if tokenizer.padding_side == "right":
+            result[i, : example.shape[0]] = example
+        else:
+            result[i, -example.shape[0] :] = example
+    return result
+
+
+def tolist(x):
+    if isinstance(x, list):
+        return x
+    elif hasattr(x, "numpy"):  # Checks for TF tensors without needing the import
+        x = x.cpu().numpy()
+    return x.tolist()
+
+
+@dataclass
+class DataCollatorForLanguageModeling(DataCollatorMixin):
+    """
+    Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they
+    are not all of the same length.
+    Args:
+        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
+            The tokenizer used for encoding the data.
+        mlm (`bool`, *optional*, defaults to `True`):
+            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
+            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
+            tokens and the value to predict for the masked token.
+        mlm_probability (`float`, *optional*, defaults to 0.15):
+            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
+        pad_to_multiple_of (`int`, *optional*):
+            If set will pad the sequence to a multiple of the provided value.
+        return_tensors (`str`):
+            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
+    <Tip>
+    For best performance, this data collator should be used with a dataset having items that are dictionaries or
+    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
+    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
+    </Tip>"""
+
+    tokenizer: PretrainedTokenizerBase
+    mlm: bool = True
+    mlm_probability: float = 0.15
+    pad_to_multiple_of: Optional[int] = None
+    return_tensors: str = "pd"
+
+    def paddle_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
+        # Handle dict or lists with proper padding and conversion to tensor.
+        if isinstance(examples[0], Mapping):
+            batch = self.tokenizer.pad(examples, return_tensors="pd", pad_to_multiple_of=self.pad_to_multiple_of)
+        else:
+            batch = {
+                "input_ids": _paddle_collate_batch(
+                    examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of
+                )
+            }
+
+        # If special token mask has been preprocessed, pop it from the dict.
+        special_tokens_mask = batch.pop("special_tokens_mask", None)
+        if self.mlm:
+            batch["input_ids"], batch["labels"] = self.paddle_mask_tokens(
+                batch["input_ids"], special_tokens_mask=special_tokens_mask
+            )
+        else:
+            labels = batch["input_ids"].clone()
+            if self.tokenizer.pad_token_id is not None:
+                labels[labels == self.tokenizer.pad_token_id] = -100
+            batch["labels"] = labels
+        return batch
+
+    def paddle_mask_tokens(self, inputs: Any, special_tokens_mask: Optional[Any] = None) -> Tuple[Any, Any]:
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
+        """
+        import paddle
+
+        labels = inputs.clone()
+        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
+        probability_matrix = paddle.full(labels.shape, self.mlm_probability)
+        if special_tokens_mask is None:
+            special_tokens_mask = [
+                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
+            ]
+
+            special_tokens_mask = paddle.to_tensor(special_tokens_mask, dtype="bool")
+        else:
+            special_tokens_mask = special_tokens_mask.cast("bool")
+
+        def masked_fill(x, mask, value):
+            y = paddle.full(x.shape, value, x.dtype)
+            return paddle.where(mask, y, x)
+
+        # probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
+        probability_matrix = masked_fill(probability_matrix, special_tokens_mask, value=0.0)
+        masked_indices = paddle.bernoulli(probability_matrix).cast("bool")
+        labels[~masked_indices] = -100  # We only compute loss on masked tokens
+
+        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        indices_replaced = paddle.bernoulli(paddle.full(labels.shape, 0.8)).cast("bool") & masked_indices
+        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
+
+        # 10% of the time, we replace masked input tokens with random word
+        indices_random = (
+            paddle.bernoulli(paddle.full(labels.shape, 0.5)).cast("bool") & masked_indices & ~indices_replaced
+        )
+        random_words = paddle.randint(len(self.tokenizer), shape=labels.shape, dtype="int64")
+        inputs[indices_random] = random_words[indices_random]
+
+        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+        return inputs, labels
+
+    def numpy_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
+        # Handle dict or lists with proper padding and conversion to tensor.
+        if isinstance(examples[0], Mapping):
+            batch = self.tokenizer.pad(examples, return_tensors="np", pad_to_multiple_of=self.pad_to_multiple_of)
+        else:
+            batch = {
+                "input_ids": _numpy_collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
+            }
+
+        # If special token mask has been preprocessed, pop it from the dict.
+        special_tokens_mask = batch.pop("special_tokens_mask", None)
+        if self.mlm:
+            batch["input_ids"], batch["labels"] = self.numpy_mask_tokens(
+                batch["input_ids"], special_tokens_mask=special_tokens_mask
+            )
+        else:
+            labels = np.copy(batch["input_ids"])
+            if self.tokenizer.pad_token_id is not None:
+                labels[labels == self.tokenizer.pad_token_id] = -100
+            batch["labels"] = labels
+        return batch
+
+    def numpy_mask_tokens(self, inputs: Any, special_tokens_mask: Optional[Any] = None) -> Tuple[Any, Any]:
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
+        """
+        labels = np.copy(inputs)
+        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
+        probability_matrix = np.full(labels.shape, self.mlm_probability)
+        if special_tokens_mask is None:
+            special_tokens_mask = [
+                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
+            ]
+            special_tokens_mask = np.array(special_tokens_mask, dtype=bool)
+        else:
+            special_tokens_mask = special_tokens_mask.astype(bool)
+
+        probability_matrix[special_tokens_mask] = 0
+        # Numpy doesn't have bernoulli, so we use a binomial with 1 trial
+        masked_indices = np.random.binomial(1, probability_matrix, size=probability_matrix.shape).astype(bool)
+        labels[~masked_indices] = -100  # We only compute loss on masked tokens
+
+        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        indices_replaced = np.random.binomial(1, 0.8, size=labels.shape).astype(bool) & masked_indices
+        inputs[indices_replaced] = self.tokenizer.mask_token_id
+
+        # 10% of the time, we replace masked input tokens with random word
+        # indices_random = paddle.bernoulli(paddle.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
+        indices_random = (
+            np.random.binomial(1, 0.5, size=labels.shape).astype(bool) & masked_indices & ~indices_replaced
+        )
+        random_words = np.random.randint(
+            low=0, high=len(self.tokenizer), size=np.count_nonzero(indices_random), dtype=np.int64
+        )
+        inputs[indices_random] = random_words
+
+        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+        return inputs, labels
+
+
+@dataclass
+class DataCollatorForWholeWordMask(DataCollatorForLanguageModeling):
+    """
+    Data collator used for language modeling that masks entire words.
+    - collates batches of tensors, honoring their tokenizer's pad_token
+    - preprocesses batches for masked language modeling
+    <Tip>
+    This collator relies on details of the implementation of subword tokenization by [`BertTokenizer`], specifically
+    that subword tokens are prefixed with *##*. For tokenizers that do not adhere to this scheme, this collator will
+    produce an output that is roughly equivalent to [`.DataCollatorForLanguageModeling`].
+    </Tip>"""
+
+    def paddle_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
+        if isinstance(examples[0], Mapping):
+            input_ids = [e["input_ids"] for e in examples]
+        else:
+            input_ids = examples
+            examples = [{"input_ids": e} for e in examples]
+
+        batch_input = _paddle_collate_batch(input_ids, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
+
+        mask_labels = []
+        for e in examples:
+            ref_tokens = []
+            for id in tolist(e["input_ids"]):
+                token = self.tokenizer._convert_id_to_token(id)
+                ref_tokens.append(token)
+
+            # For Chinese tokens, we need extra inf to mark sub-word, e.g [喜,欢]-> [喜，##欢]
+            if "chinese_ref" in e:
+                ref_pos = tolist(e["chinese_ref"])
+                len_seq = len(e["input_ids"])
+                for i in range(len_seq):
+                    if i in ref_pos:
+                        ref_tokens[i] = "##" + ref_tokens[i]
+            mask_labels.append(self._whole_word_mask(ref_tokens))
+        batch_mask = _paddle_collate_batch(mask_labels, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
+        inputs, labels = self.paddle_mask_tokens(batch_input, batch_mask)
+        return {"input_ids": inputs, "labels": labels}
+
+    def numpy_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
+        if isinstance(examples[0], Mapping):
+            input_ids = [e["input_ids"] for e in examples]
+        else:
+            input_ids = examples
+            examples = [{"input_ids": e} for e in examples]
+
+        batch_input = _numpy_collate_batch(input_ids, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
+
+        mask_labels = []
+        for e in examples:
+            ref_tokens = []
+            for id in tolist(e["input_ids"]):
+                token = self.tokenizer._convert_id_to_token(id)
+                ref_tokens.append(token)
+
+            # For Chinese tokens, we need extra inf to mark sub-word, e.g [喜,欢]-> [喜，##欢]
+            if "chinese_ref" in e:
+                ref_pos = tolist(e["chinese_ref"])
+                len_seq = len(e["input_ids"])
+                for i in range(len_seq):
+                    if i in ref_pos:
+                        ref_tokens[i] = "##" + ref_tokens[i]
+            mask_labels.append(self._whole_word_mask(ref_tokens))
+        batch_mask = _numpy_collate_batch(mask_labels, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
+        inputs, labels = self.numpy_mask_tokens(batch_input, batch_mask)
+        return {"input_ids": inputs, "labels": labels}
+
+    def _whole_word_mask(self, input_tokens: List[str], max_predictions=512):
+        """
+        Get 0/1 labels for masked tokens with whole word mask proxy
+        """
+        if not isinstance(self.tokenizer, (BertTokenizer)):
+            warnings.warn(
+                "DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers. "
+                "Please refer to the documentation for more information."
+            )
+
+        cand_indexes = []
+        for i, token in enumerate(input_tokens):
+            if token == "[CLS]" or token == "[SEP]":
+                continue
+
+            if len(cand_indexes) >= 1 and token.startswith("##"):
+                cand_indexes[-1].append(i)
+            else:
+                cand_indexes.append([i])
+
+        random.shuffle(cand_indexes)
+        num_to_predict = min(max_predictions, max(1, int(round(len(input_tokens) * self.mlm_probability))))
+        masked_lms = []
+        covered_indexes = set()
+        for index_set in cand_indexes:
+            if len(masked_lms) >= num_to_predict:
+                break
+            # If adding a whole-word mask would exceed the maximum number of
+            # predictions, then just skip this candidate.
+            if len(masked_lms) + len(index_set) > num_to_predict:
+                continue
+            is_any_index_covered = False
+            for index in index_set:
+                if index in covered_indexes:
+                    is_any_index_covered = True
+                    break
+            if is_any_index_covered:
+                continue
+            for index in index_set:
+                covered_indexes.add(index)
+                masked_lms.append(index)
+
+        if len(covered_indexes) != len(masked_lms):
+            raise ValueError("Length of covered_indexes is not equal to length of masked_lms.")
+        mask_labels = [1 if i in covered_indexes else 0 for i in range(len(input_tokens))]
+        return mask_labels
+
+    def paddle_mask_tokens(self, inputs: Any, mask_labels: Any) -> Tuple[Any, Any]:
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set
+        'mask_labels' means we use whole word mask (wwm), we directly mask idxs according to it's ref.
+        """
+        import paddle
+
+        if self.tokenizer.mask_token is None:
+            raise ValueError(
+                "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the"
+                " --mlm flag if you want to use this tokenizer."
+            )
+        labels = inputs.clone()
+        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+
+        probability_matrix = mask_labels
+
+        special_tokens_mask = [
+            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
+        ]
+
+        def masked_fill(x, mask, value):
+            y = paddle.full(x.shape, value, x.dtype)
+            return paddle.where(mask, y, x)
+
+        # probability_matrix.masked_fill_(paddle.tensor(special_tokens_mask, dtype=paddle.bool), value=0.0)
+        probability_matrix = masked_fill(
+            probability_matrix, paddle.to_tensor(special_tokens_mask, dtype="bool"), value=0.0
+        )
+        if self.tokenizer._pad_token is not None:
+            padding_mask = labels.equal(self.tokenizer.pad_token_id)
+            # probability_matrix.masked_fill_(padding_mask, value=0.0)
+            probability_matrix = masked_fill(probability_matrix, padding_mask, value=0.0)
+
+        masked_indices = probability_matrix.cast("bool")
+        labels[~masked_indices] = -100  # We only compute loss on masked tokens
+
+        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        indices_replaced = paddle.bernoulli(paddle.full(labels.shape, 0.8)).cast("bool") & masked_indices
+        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
+
+        # 10% of the time, we replace masked input tokens with random word
+        indices_random = (
+            paddle.bernoulli(paddle.full(labels.shape, 0.5)).cast("bool") & masked_indices & ~indices_replaced
+        )
+
+        random_words = paddle.randint(0, len(self.tokenizer), labels.shape, dtype="int64")
+        inputs[indices_random] = random_words[indices_random]
+
+        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+        return inputs, labels
+
+    def numpy_mask_tokens(self, inputs: Any, mask_labels: Any) -> Tuple[Any, Any]:
+        """
+        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. Set
+        'mask_labels' means we use whole word mask (wwm), we directly mask idxs according to it's ref.
+        """
+        if self.tokenizer.mask_token is None:
+            raise ValueError(
+                "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the"
+                " --mlm flag if you want to use this tokenizer."
+            )
+        labels = np.copy(inputs)
+        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
+
+        masked_indices = mask_labels.astype(bool)
+
+        special_tokens_mask = [
+            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
+        ]
+        masked_indices[np.array(special_tokens_mask, dtype=bool)] = 0
+        if self.tokenizer._pad_token is not None:
+            padding_mask = labels == self.tokenizer.pad_token_id
+            masked_indices[padding_mask] = 0
+
+        labels[~masked_indices] = -100  # We only compute loss on masked tokens
+
+        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
+        indices_replaced = np.random.binomial(1, 0.8, size=labels.shape).astype(bool) & masked_indices
+        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
+
+        # 10% of the time, we replace masked input tokens with random word
+        # indices_random = paddle.bernoulli(paddle.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
+        indices_random = (
+            np.random.binomial(1, 0.5, size=labels.shape).astype(bool) & masked_indices & ~indices_replaced
+        )
+        random_words = np.random.randint(low=0, high=len(self.tokenizer), size=labels.shape, dtype=np.int64)
+        inputs[indices_random] = random_words[indices_random]
+
+        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
+        return inputs, labels
diff --git a/paddlenlp/data/dist_dataloader.py b/paddlenlp/data/dist_dataloader.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a9884e27c2447de0e939fbbd0aafe250ab03df6
--- /dev/null
+++ b/paddlenlp/data/dist_dataloader.py
@@ -0,0 +1,241 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+from paddle.distributed import fleet
+
+from paddlenlp.utils.batch_sampler import DistributedBatchSampler
+from paddlenlp.utils.log import logger
+
+_MAX_DATA_DIM = 64
+
+
+class DummyDataset(paddle.io.Dataset):
+    """
+    A dummy dataset.
+    """
+
+    def __len__(self):
+        return 0
+
+
+class DistDataLoader(paddle.io.DataLoader):
+    """
+    DistDataLoader is a wrapper of paddle.io.DataLoader.
+    """
+
+    def __init__(
+        self,
+        dataset,
+        feed_list=None,
+        places=None,
+        return_list=True,
+        batch_sampler=None,
+        batch_size=1,
+        shuffle=False,
+        drop_last=False,
+        collate_fn=None,
+        num_workers=0,
+        use_buffer_reader=True,
+        prefetch_factor=2,
+        use_shared_memory=True,
+        timeout=0,
+        worker_init_fn=None,
+        persistent_workers=False,
+    ):
+
+        if dataset is None:
+            dataset = DummyDataset()
+            batch_sampler = DistributedBatchSampler(dataset, 1)
+            logger.info("rank has no data, use Dummpy dataset")
+
+        super().__init__(dataset=dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, num_workers=num_workers)
+
+        self._hcg = fleet.get_hybrid_communicate_group()
+
+        # init pp data comm group
+        if self._hcg.get_pipe_parallel_world_size() > 1:
+            self._pp_data_group = self._init_dataloader_comm_group()
+        else:
+            self._pp_data_group = None
+
+        # tensor parallel message
+        self.mp_group = self._hcg.get_model_parallel_group()
+        self.mp_rank = self._hcg.get_model_parallel_rank()
+        self.mp_src_rank = self._hcg.get_model_parallel_group_src_rank()
+
+        self.pp_rank = self._hcg.get_stage_id()
+        self.dp_rank = self._hcg.get_data_parallel_rank()
+        sharding_rank = self._hcg.get_sharding_parallel_rank()
+        self._need_data = (self.mp_rank == 0) and (self.pp_rank == 0)
+        self._data_keys, self._data_keys_size = None, None
+
+        if self._need_data:
+            self._dataloader = paddle.io.DataLoader(
+                dataset,
+                feed_list,
+                places,
+                return_list,
+                batch_sampler,
+                batch_size,
+                shuffle,
+                drop_last,
+                collate_fn,
+                num_workers,
+                use_buffer_reader,
+                prefetch_factor,
+                use_shared_memory,
+                timeout,
+                worker_init_fn,
+                persistent_workers,
+            )
+
+            self._lazy_dataloader_iter = None
+        else:
+            logger.info(
+                "mp{}_pp{}_sharding{}_dp{} no data needed, "
+                "skip init dataloader.".format(self.mp_rank, self.pp_rank, sharding_rank, self.dp_rank)
+            )
+
+    @property
+    def _dataloader_iter(self):
+        if self._lazy_dataloader_iter is None:
+            self._lazy_dataloader_iter = iter(self._dataloader)
+        return self._lazy_dataloader_iter
+
+    def __len__(self):
+        if self._need_data:
+            return super().__len__()
+        else:
+            raise ValueError("raise error for `paddlenlp.trainer.trainer_utils.has_length`")
+
+    def _init_dataloader_comm_group(self):
+        topo = self._hcg._topo
+        parallel_comm_group = None
+        parallel_groups = topo.get_comm_list("pipe")
+
+        for group in parallel_groups:
+            # only first rank and last rank
+            ranks = [group[0], group[-1]]
+            comm_group = paddle.distributed.new_group(ranks=ranks)
+            if paddle.distributed.get_rank() in ranks:
+                parallel_comm_group = comm_group
+        return parallel_comm_group
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        data_keys_size = 0
+        if self._need_data:
+            # {'input_ids': int64, 'labels': int64}
+            data = next(self._dataloader_iter)
+            data_keys_size, data_keys = len(data.keys()), list(data.keys())
+            data_list = [data[key] for key in data_keys]
+            # TODO(daisiming): add more type assertion.
+            assert {item.dtype for item in data_list} == {
+                paddle.int64
+            }, f"Distloader requires dtype == `int64`, got:{[item.dtype for item in data_list]}"
+
+        # broadcast data keys size
+        data_keys_size = paddle.to_tensor(data_keys_size)
+        if self._data_keys_size is None:
+            if self.mp_group is not None and self.pp_rank == 0:
+                paddle.distributed.broadcast(data_keys_size, src=self.mp_src_rank, group=self.mp_group)
+            if self._pp_data_group is not None:
+                paddle.distributed.broadcast(
+                    data_keys_size, src=self._pp_data_group.ranks[0], group=self._pp_data_group
+                )
+            self._data_keys_size = int(data_keys_size.item())
+
+        if not self._need_data:
+            data_keys = [None for i in range(self._data_keys_size)]
+
+        # broadcast data keys name
+        if self._data_keys is None:
+            if self.mp_group is not None and self.pp_rank == 0:
+                paddle.distributed.broadcast_object_list(data_keys, src=self.mp_src_rank, group=self.mp_group)
+            if self._pp_data_group is not None:
+                paddle.distributed.broadcast_object_list(
+                    data_keys, src=self._pp_data_group.ranks[0], group=self._pp_data_group
+                )
+            self._data_keys = data_keys
+
+        # broadcast data
+        if not self._need_data:
+            data_list = [None for i in range(self._data_keys_size)]
+        if self.mp_group is not None and self.pp_rank == 0:
+            data_list = broadcast_data_list(data_list, paddle.int64, self.mp_rank, self.mp_group, self.mp_src_rank)
+
+        if self._pp_data_group is not None:
+            # Note(daisimng): In last stage of pp, we don't need input_ids.
+            # It will be removed in future.
+            data_list = broadcast_data_list(
+                data_list,
+                paddle.int64,
+                self.pp_rank,
+                self._pp_data_group,
+                self._pp_data_group.ranks[0],
+            )
+
+        out = dict([(key, data) for key, data in zip(self._data_keys, data_list)])
+        return out
+
+
+def broadcast_data_list(data_list, datatype, comm_rank=0, comm_group=None, src_rank=0):
+    """
+    Broadcast data from src_rank to all ranks in comm_group.
+    """
+    # Move to GPU and broadcast.
+    size_cpu = []
+    if comm_rank == 0:
+        for data in data_list:
+            size_cpu.append(len(data.shape))
+            size_cpu += data.shape
+    size_cpu = size_cpu + [0] * (_MAX_DATA_DIM - len(size_cpu))
+    size_cuda = paddle.to_tensor(size_cpu)
+    paddle.distributed.broadcast(size_cuda, src_rank, group=comm_group).wait()
+
+    size_cpu = size_cuda.tolist()
+    i = 0
+    numel = 0
+    sizes = []
+    while size_cpu[i] > 0:
+        rank = size_cpu[i]
+        this_size = size_cpu[i + 1 : i + 1 + rank]
+        numel += int(np.prod(this_size))
+        sizes.append(this_size)
+        i += rank + 1
+
+    if comm_rank == 0:
+        assert data.dtype == datatype, "input has data type {} which " "is different than {}".format(
+            data.dtype, datatype
+        )
+        data_b = paddle.concat([d.cuda().reshape([-1]) for d in data_list], 0)
+        assert numel == sum([d.numel().item() for d in data_list]), (numel, [d.numel().item() for d in data_list])
+    else:
+        data_b = paddle.empty([numel], dtype=datatype).cuda()
+
+    # Broadcast
+    paddle.distributed.broadcast(data_b, src_rank, group=comm_group).wait()
+
+    ret = []
+    offset = 0
+    for size in sizes:
+        numel = int(np.prod(size))
+        ret.append(data_b[offset : offset + numel].reshape(size))
+        offset += numel
+
+    return ret
diff --git a/paddlenlp/data/indexed_dataset.py b/paddlenlp/data/indexed_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..8bd7b05e7249f081852bc112008ab7330cd9e43d
--- /dev/null
+++ b/paddlenlp/data/indexed_dataset.py
@@ -0,0 +1,677 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# https://github.com/NVIDIA/Megatron-LM/blob/060415572f4365a2e895f8036c4e37dad0efbdf5/megatron/data/indexed_dataset.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+
+# copied from fairseq/fairseq/data/indexed_dataset.py
+# Removed IndexedRawTextDataset since it relied on Fairseq dictionary
+# other slight modifications to remove fairseq dependencies
+# Added document index to index file and made it accessible.
+#    An empty sentence no longer separates documents.
+
+import os
+import shutil
+import struct
+import time
+from functools import lru_cache
+from itertools import accumulate
+
+import numpy as np
+import paddle
+
+
+def print_rank_0(*args, **kwargs):
+    if paddle.distributed.get_rank() == 0:
+        print(*args, **kwargs)
+
+
+def __best_fitting_dtype(vocab_size=None):
+    if vocab_size is not None and vocab_size < 65500:
+        return np.uint16
+    else:
+        return np.int32
+
+
+def get_available_dataset_impl():
+    return ["lazy", "mmap"]
+
+
+def make_dataset(path, impl, skip_warmup=False):
+    if CompatibleIndexedDataset.exists(path):
+        print("Using old dataet (.npy & .npz)")
+        return CompatibleIndexedDataset(path)
+    elif not IndexedDataset.exists(path):
+        print(f"Dataset does not exist: {path}")
+        print("Path should be a basename that both .idx and .bin can be appended to get full filenames.")
+        return None
+    elif impl == "lazy" and IndexedDataset.exists(path):
+        return IndexedDataset(path)
+    elif impl == "mmap" and MMapIndexedDataset.exists(path):
+        return MMapIndexedDataset(path, skip_warmup)
+    print(f"Unknown dataset implementation: {impl}")
+    return None
+
+
+def dataset_exists(path, impl):
+    if impl == "mmap":
+        return MMapIndexedDataset.exists(path)
+    else:
+        return IndexedDataset.exists(path)
+
+
+def read_longs(f, n):
+    a = np.empty(n, dtype=np.int64)
+    f.readinto(a)
+    return a
+
+
+def write_longs(f, a):
+    f.write(np.array(a, dtype=np.int64))
+
+
+def read_shorts(f, n):
+    a = np.empty(n, dtype=np.int32)
+    f.readinto(a)
+    return a
+
+
+def write_shorts(f, a):
+    f.write(np.array(a, dtype=np.int32))
+
+
+dtypes = {
+    1: np.uint8,
+    2: np.int8,
+    3: np.int16,
+    4: np.int32,
+    5: np.int64,
+    6: np.float64,
+    7: np.float32,
+    8: np.uint16,
+    9: np.uint32,
+    10: np.uint64,
+}
+
+
+def code(dtype):
+    for k in dtypes.keys():
+        if dtypes[k] == dtype:
+            return k
+    raise ValueError(dtype)
+
+
+def index_file_path(prefix_path):
+    return prefix_path + ".idx"
+
+
+def data_file_path(prefix_path):
+    return prefix_path + ".bin"
+
+
+def create_doc_idx(sizes):
+    doc_idx = [0]
+    for i, s in enumerate(sizes):
+        if s == 0:
+            doc_idx.append(i + 1)
+    return doc_idx
+
+
+class IndexedDataset(paddle.io.Dataset):
+    """Loader for IndexedDataset"""
+
+    _HDR_MAGIC = b"TNTIDX\x00\x00"
+
+    def __init__(self, path):
+        super().__init__()
+        self.path = path
+        self.data_file = None
+        self.read_index(path)
+
+    def read_index(self, path):
+        with open(index_file_path(path), "rb") as f:
+            magic = f.read(8)
+            assert magic == self._HDR_MAGIC, (
+                "Index file doesn't match expected format. " "Make sure that --dataset-impl is configured properly."
+            )
+            version = f.read(8)
+            assert struct.unpack("<Q", version) == (1,)
+            code, self.element_size = struct.unpack("<QQ", f.read(16))
+            self.dtype = dtypes[code]
+            self._len, self.s = struct.unpack("<QQ", f.read(16))
+            self.doc_count = struct.unpack("<Q", f.read(8))
+            self.dim_offsets = read_longs(f, self._len + 1)
+            self.data_offsets = read_longs(f, self._len + 1)
+            self.sizes = read_shorts(f, self.s)
+            self._doc_idx = read_longs(f, self.doc_count)
+
+    def read_data(self, path):
+        self.data_file = open(data_file_path(path), "rb", buffering=0)
+
+    def check_index(self, i):
+        if i < 0 or i >= self._len:
+            raise IndexError("index out of range")
+
+    def __del__(self):
+        if self.data_file:
+            self.data_file.close()
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if not self.data_file:
+            self.read_data(self.path)
+        if isinstance(idx, int):
+            i = idx
+            self.check_index(i)
+            tensor_size = self.sizes[self.dim_offsets[i] : self.dim_offsets[i + 1]]
+            a = np.empty(tensor_size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[i] * self.element_size)
+            self.data_file.readinto(a)
+            return a
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError("Slices into indexed_dataset must be contiguous")
+            sizes = self.sizes[self.dim_offsets[start] : self.dim_offsets[stop]]
+            size = sum(sizes)
+            a = np.empty(size, dtype=self.dtype)
+            self.data_file.seek(self.data_offsets[start] * self.element_size)
+            self.data_file.readinto(a)
+            offsets = list(accumulate(sizes))
+            sents = np.split(a, offsets[:-1])
+            return sents
+
+    def get(self, idx, offset=0, length=None):
+        """Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        if not self.data_file:
+            self.read_data(self.path)
+        size = self.sizes[idx]
+        ptr = self.data_offsets[idx]
+        if length is None:
+            length = size - offset
+        ptr += offset
+        a = np.empty(length, dtype=self.dtype)
+        self.data_file.seek(ptr * self.element_size)
+        self.data_file.readinto(a)
+        return a
+
+    def __len__(self):
+        return self._len
+
+    def num_tokens(self, index):
+        return self.sizes[index]
+
+    def size(self, index):
+        return self.sizes[index]
+
+    @staticmethod
+    def exists(path):
+        return os.path.exists(index_file_path(path)) and os.path.exists(data_file_path(path))
+
+    @property
+    def supports_prefetch(self):
+        return False  # avoid prefetching to save memory
+
+    @property
+    def doc_idx(self):
+        return self._doc_idx
+
+    def get_doc_idx(self):
+        return self._doc_idx
+
+    def set_doc_idx(self, doc_idx_):
+        self._doc_idx = doc_idx_
+
+
+class IndexedDatasetBuilder(object):
+    element_sizes = {
+        np.uint8: 1,
+        np.int8: 1,
+        np.int16: 2,
+        np.uint16: 2,
+        np.int32: 4,
+        np.int64: 8,
+        np.float32: 4,
+        np.float64: 8,
+    }
+
+    def __init__(self, out_file, dtype=np.int32):
+        self.out_file = open(out_file, "wb")
+        self.dtype = dtype
+        self.data_offsets = [0]
+        self.dim_offsets = [0]
+        self.sizes = []
+        self.element_size = self.element_sizes[self.dtype]
+        self.doc_idx = [0]
+
+    def add_item(self, tensor):
+        tensor = np.array(tensor, dtype=self.dtype)
+        bytes = self.out_file.write(tensor)
+        self.data_offsets.append(self.data_offsets[-1] + bytes / self.element_size)
+        for s in tensor.shape:
+            self.sizes.append(s)
+        self.dim_offsets.append(self.dim_offsets[-1] + len(tensor.shape))
+        del bytes
+
+    def end_document(self):
+        self.doc_idx.append(len(self.sizes))
+
+    def merge_file_(self, another_file):
+        index = IndexedDataset(another_file)
+        assert index.dtype == self.dtype
+
+        doc_offset = len(self.sizes)
+
+        begin = self.data_offsets[-1]
+        for data_offset in index.data_offsets[1:]:
+            self.data_offsets.append(begin + data_offset)
+        self.sizes.extend(index.sizes)
+
+        begin = self.dim_offsets[-1]
+        for dim_offset in index.dim_offsets[1:]:
+            self.dim_offsets.append(begin + dim_offset)
+
+        self.doc_idx.extend((doc_offset + index.doc_idx)[1:])
+
+        with open(data_file_path(another_file), "rb") as f:
+            while True:
+                data = f.read(1024)
+                if data:
+                    self.out_file.write(data)
+                else:
+                    break
+
+    def finalize(self, index_file):
+        self.out_file.close()
+        index = open(index_file, "wb")
+        index.write(b"TNTIDX\x00\x00")
+        index.write(struct.pack("<Q", 1))
+        index.write(struct.pack("<QQ", code(self.dtype), self.element_size))
+        index.write(struct.pack("<QQ", len(self.data_offsets) - 1, len(self.sizes)))
+        index.write(struct.pack("<Q", len(self.doc_idx)))
+        write_longs(index, self.dim_offsets)
+        write_longs(index, self.data_offsets)
+        write_shorts(index, self.sizes)
+        write_longs(index, self.doc_idx)
+        index.close()
+
+        print("Total sentences num: %d" % len(self.sizes))
+        print("Total documents num: %d" % (len(self.doc_idx) - 1))
+        print("Total tokens num: %d" % sum(self.sizes))
+        print("Average tokens per sentence: %.2f" % (sum(self.sizes) / len(self.sizes)))
+        print("Average tokens per document: %.2f" % (sum(self.sizes) / (len(self.doc_idx) - 1)))
+
+
+def _warmup_mmap_file(path):
+    with open(path, "rb") as stream:
+        while stream.read(100 * 1024 * 1024):
+            pass
+
+
+class MMapIndexedDataset(paddle.io.Dataset):
+    class Index(object):
+        _HDR_MAGIC = b"MMIDIDX\x00\x00"
+
+        @classmethod
+        def writer(cls, path, dtype):
+            class _Writer(object):
+                def __enter__(self):
+                    self._file = open(path, "wb")
+
+                    self._file.write(cls._HDR_MAGIC)
+                    self._file.write(struct.pack("<Q", 1))
+                    self._file.write(struct.pack("<B", code(dtype)))
+
+                    return self
+
+                @staticmethod
+                def _get_pointers(sizes):
+                    dtype_size = dtype().itemsize
+                    address = 0
+                    pointers = []
+
+                    for size in sizes:
+                        pointers.append(address)
+                        address += size * dtype_size
+
+                    return pointers
+
+                def write(self, sizes, doc_idx):
+                    pointers = self._get_pointers(sizes)
+
+                    self._file.write(struct.pack("<Q", len(sizes)))
+                    self._file.write(struct.pack("<Q", len(doc_idx)))
+
+                    sizes = np.array(sizes, dtype=np.int32)
+                    self._file.write(sizes.tobytes(order="C"))
+                    del sizes
+
+                    pointers = np.array(pointers, dtype=np.int64)
+                    self._file.write(pointers.tobytes(order="C"))
+                    del pointers
+
+                    doc_idx = np.array(doc_idx, dtype=np.int64)
+                    self._file.write(doc_idx.tobytes(order="C"))
+
+                def __exit__(self, exc_type, exc_val, exc_tb):
+                    self._file.close()
+
+            return _Writer()
+
+        def __init__(self, path, skip_warmup=False):
+            with open(path, "rb") as stream:
+                magic_test = stream.read(9)
+                assert self._HDR_MAGIC == magic_test, (
+                    "Index file doesn't match expected format. "
+                    "Make sure that --dataset-impl is configured properly."
+                )
+                version = struct.unpack("<Q", stream.read(8))
+                assert (1,) == version
+
+                (dtype_code,) = struct.unpack("<B", stream.read(1))
+                self._dtype = dtypes[dtype_code]
+                self._dtype_size = self._dtype().itemsize
+
+                self._len = struct.unpack("<Q", stream.read(8))[0]
+                self._doc_count = struct.unpack("<Q", stream.read(8))[0]
+                offset = stream.tell()
+
+            if not skip_warmup:
+                print_rank_0("    warming up index mmap file...")
+                _warmup_mmap_file(path)
+
+            self._buffer_mmap = np.memmap(path, mode="r", order="C")
+            self._buffer = memoryview(self._buffer_mmap)
+            print_rank_0("    reading sizes...")
+            self._sizes = np.frombuffer(self._buffer, dtype=np.int32, count=self._len, offset=offset)
+            print_rank_0("    reading pointers...")
+            self._pointers = np.frombuffer(
+                self._buffer, dtype=np.int64, count=self._len, offset=offset + self._sizes.nbytes
+            )
+            print_rank_0("    reading document index...")
+            self._doc_idx = np.frombuffer(
+                self._buffer,
+                dtype=np.int64,
+                count=self._doc_count,
+                offset=offset + self._sizes.nbytes + self._pointers.nbytes,
+            )
+
+        def __del__(self):
+            self._buffer_mmap._mmap.close()
+            del self._buffer_mmap
+
+        @property
+        def dtype(self):
+            return self._dtype
+
+        @property
+        def sizes(self):
+            return self._sizes
+
+        @property
+        def doc_idx(self):
+            return self._doc_idx
+
+        @lru_cache(maxsize=8)
+        def __getitem__(self, i):
+            return self._pointers[i], self._sizes[i]
+
+        def __len__(self):
+            return self._len
+
+    def __init__(self, path, skip_warmup=False):
+        super().__init__()
+
+        self._path = None
+        self._index = None
+        self._bin_buffer = None
+
+        self._do_init(path, skip_warmup)
+
+    def __getstate__(self):
+        return self._path
+
+    def __setstate__(self, state):
+        self._do_init(state, skip_warmup=True)
+
+    def _do_init(self, path, skip_warmup):
+        self._path = path
+
+        if not self.exists(path):
+            raise ValueError("Missing file, %s" % (path))
+
+        self._index = self.Index(index_file_path(self._path), skip_warmup)
+
+        if not skip_warmup:
+            print_rank_0("    warming up data mmap file...")
+            _warmup_mmap_file(data_file_path(self._path))
+        print_rank_0("    creating numpy buffer of mmap...")
+        self._bin_buffer_mmap = np.memmap(data_file_path(self._path), mode="r", order="C")
+        print_rank_0("    creating memory view of numpy buffer...")
+        self._bin_buffer = memoryview(self._bin_buffer_mmap)
+
+    def __del__(self):
+        self._bin_buffer_mmap._mmap.close()
+        del self._bin_buffer_mmap
+        del self._index
+
+    def __len__(self):
+        return len(self._index)
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, (int, np.integer)):
+            ptr, size = self._index[idx]
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=size, offset=ptr)
+            return np_array
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError("Slices into indexed_dataset must be contiguous")
+            ptr = self._index._pointers[start]
+            sizes = self._index._sizes[idx]
+            offsets = list(accumulate(sizes))
+            total_size = sum(sizes)
+            np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=total_size, offset=ptr)
+            sents = np.split(np_array, offsets[:-1])
+            return sents
+        else:
+            raise TypeError("Unexpected type received for idx: {}".format(type(idx)))
+
+    def get(self, idx, offset=0, length=None):
+        """Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        ptr, size = self._index[idx]
+        if length is None:
+            length = size - offset
+        ptr += offset * np.dtype(self._index.dtype).itemsize
+        np_array = np.frombuffer(self._bin_buffer, dtype=self._index.dtype, count=length, offset=ptr)
+        return np_array
+
+    @property
+    def sizes(self):
+        return self._index.sizes
+
+    @property
+    def doc_idx(self):
+        return self._index.doc_idx
+
+    def get_doc_idx(self):
+        return self._index._doc_idx
+
+    def set_doc_idx(self, doc_idx_):
+        self._index._doc_idx = doc_idx_
+
+    @property
+    def supports_prefetch(self):
+        return False
+
+    @staticmethod
+    def exists(path):
+        return os.path.exists(index_file_path(path)) and os.path.exists(data_file_path(path))
+
+
+def make_builder(out_file, impl, save_dtype):
+    if impl == "mmap":
+        return MMapIndexedDatasetBuilder(out_file, dtype=save_dtype)
+    else:
+        return IndexedDatasetBuilder(out_file, dtype=save_dtype)
+
+
+class MMapIndexedDatasetBuilder(object):
+    def __init__(self, out_file, dtype):
+        self._data_file = open(out_file, "wb")
+        self._dtype = dtype
+        self._sizes = []
+        self._doc_idx = [0]
+
+    def add_item(self, tensor):
+        tensor = np.array(tensor, dtype=self._dtype)
+        self._data_file.write(tensor.tobytes(order="C"))
+        self._sizes.append(tensor.size)
+
+    def add_doc(self, tensor, sizes):
+        np_array = np.array(tensor, dtype=self._dtype)
+        self._data_file.write(np_array.tobytes(order="C"))
+        self._sizes.extend(sizes)
+        self._doc_idx.append(len(self._sizes))
+
+    def end_document(self):
+        self._doc_idx.append(len(self._sizes))
+
+    def merge_file_(self, another_file):
+        # Concatenate index
+        index = MMapIndexedDataset.Index(index_file_path(another_file))
+        assert index.dtype == self._dtype
+
+        offset = len(self._sizes)
+        self._sizes.extend(index.sizes)
+        self._doc_idx.extend((offset + index.doc_idx)[1:])
+
+        # Concatenate data
+        with open(data_file_path(another_file), "rb") as f:
+            shutil.copyfileobj(f, self._data_file)
+
+    def finalize(self, index_file):
+        self._data_file.close()
+
+        with MMapIndexedDataset.Index.writer(index_file, self._dtype) as index:
+            index.write(self._sizes, self._doc_idx)
+        print("Total sentences num: %d" % len(self._sizes))
+        print("Total documents num: %d" % (len(self._doc_idx) - 1))
+        print("Total tokens num: %d" % sum(self._sizes))
+        print("Average tokens per sentence: %.2f" % (sum(self._sizes) / len(self._sizes)))
+        print("Average tokens per document: %.2f" % (sum(self._sizes) / (len(self._doc_idx) - 1)))
+
+
+def get_indexed_dataset_(data_prefix, data_impl, skip_warmup):
+
+    print_rank_0(" > building dataset index ...")
+
+    start_time = time.time()
+    indexed_dataset = make_dataset(data_prefix, data_impl, skip_warmup)
+    assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1]
+    print_rank_0(" > finished creating indexed dataset in {:4f} " "seconds".format(time.time() - start_time))
+
+    print_rank_0(" > indexed dataset stats:")
+    print_rank_0("    number of documents: {}".format(indexed_dataset.doc_idx.shape[0] - 1))
+    print_rank_0("    number of sentences: {}".format(indexed_dataset.sizes.shape[0]))
+
+    return indexed_dataset
+
+
+class CompatibleIndexedDataset(paddle.io.Dataset):
+    def __init__(self, path):
+        super().__init__()
+
+        self._path = path
+
+        # All documment ids, extend as 1-D array.
+        self._token_ids = np.load(path + "_ids.npy", mmap_mode="r", allow_pickle=True)
+        process_data = np.load(path + "_idx.npz")
+        self._sizes = process_data["lens"]
+        self._pointers = np.empty(len(self._sizes) + 1, dtype=np.int64)
+        self._pointers[0] = 0
+        np.cumsum(self._sizes, out=self._pointers[1:])
+        self._doc_idx = process_data["docs"]
+
+    def __getstate__(self):
+        return self._path
+
+    def __len__(self):
+        return len(self._sizes)
+
+    # @lru_cache(maxsize=8)
+    def __getitem__(self, idx):
+        if isinstance(idx, int):
+            size = self._sizes[idx]
+            ptr = self._pointers[idx]
+            np_array = self._token_ids[ptr : ptr + size]
+            return np_array
+
+        elif isinstance(idx, slice):
+            start, stop, step = idx.indices(len(self))
+            if step != 1:
+                raise ValueError("Slices into indexed_dataset must be contiguous")
+            ptr = self._pointers[start]
+            sizes = self._sizes[idx]
+            offsets = list(accumulate(sizes))
+            total_size = sum(sizes)
+            np_array = self._token_ids[ptr : ptr + total_size]
+            sents = np.split(np_array, offsets[:-1])
+            return sents
+
+    def get(self, idx, offset=0, length=None):
+        """Retrieves a single item from the dataset with the option to only
+        return a portion of the item.
+
+        get(idx) is the same as [idx] but get() does not support slicing.
+        """
+        size = self._sizes[idx]
+        ptr = self._pointers[idx]
+
+        if length is None:
+            length = size - offset
+        ptr += offset
+        np_array = self._token_ids[ptr : ptr + length]
+        return np_array
+
+    @property
+    def sizes(self):
+        return self._sizes
+
+    @property
+    def doc_idx(self):
+        return self._doc_idx
+
+    def get_doc_idx(self):
+        return self._doc_idx
+
+    def set_doc_idx(self, doc_idx_):
+        self._doc_idx = doc_idx_
+
+    @staticmethod
+    def exists(path):
+        return os.path.isfile(path + "_ids.npy") and os.path.isfile(path + "_idx.npz")
diff --git a/paddlenlp/data/iterator.py b/paddlenlp/data/iterator.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee969734d732471ebf9e6f3c75e4f070859033b8
--- /dev/null
+++ b/paddlenlp/data/iterator.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Iterator for NLP Dataset
diff --git a/paddlenlp/data/sampler.py b/paddlenlp/data/sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fe68d6bd5b27f09be6c8ea06c784b1c81b23f6a
--- /dev/null
+++ b/paddlenlp/data/sampler.py
@@ -0,0 +1,416 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import functools
+import math
+
+import numpy as np
+
+
+class SamplerHelper(object):
+    """
+    The class is to help construct iterable sampler used for
+    :class:`paddle.io.DataLoader`. It wraps a dataset and uses its
+    :meth:`__getitem__` method. Every subclass of :class:`SamplerHelper` has
+    to provide an :meth:`__iter__` method, providing a way to iterate over
+    indices of dataset elements, and a :meth:`__len__` method that returns the
+    length of the returned iterators.
+
+    The class also can be used as batch iterator instead of indices iterator
+    when `iterator` yield samples rather than indices by initializing `iterator`
+    with a iterable dataset.
+
+    .. note::
+        The :meth:`__len__` method isn't strictly required by
+        :class:`paddle.io.DataLoader`, but is expected in any calculation
+        involving the length of a :class:`paddle.io.DataLoader`.
+
+    Args:
+        dataset (Dataset): Input dataset for :class:`SamplerHelper`.
+        iterable (Iterable, optional): Iterator of dataset. Default: None.
+    """
+
+    # chain sampler
+    def __init__(self, dataset, iterable=None):
+        self.data_source = dataset
+        self.iterable = iterable
+        if isinstance(dataset, collections.abc.Iterable) and iterable is None:
+            # iterable-style datasets
+            self.iterable = dataset
+
+    def __iter__(self):
+        if self.iterable is None:
+            return iter(range(len(self.data_source)))
+        elif isinstance(self.iterable, collections.abc.Iterable):
+            return iter(self.iterable)
+        elif callable(self.iterable):
+            return self.iterable()
+        else:
+            raise ValueError("`iterable` should be None, instance of Iterable or callable " "producing generator.")
+
+    def __len__(self):
+        # Allow some samplers have different length with `len(data_source)`,
+        # such as batch sampler.
+        if hasattr(self, "_length"):
+            return self._length
+        else:
+            return len(self.data_source)
+
+    @property
+    def length(self):
+        """
+        Returns the length.
+        """
+
+        # since `len()` only produce integer, use length property to get None
+        # for uncertain length. samplers can set length if necessary.
+        try:
+            length = len(self)
+        except Exception:
+            length = None
+        return length
+
+    @length.setter
+    def length(self, length):
+        self._length = length
+
+    def apply(self, fn):
+        # Transformation functions would be performed. It includes
+        # :meth:`shuffle`, :meth:`sort`, :meth:`fit` and :meth:`shard`.
+        # Args:
+        #     fn (callable): Transformation functions to be performed.
+        # Returns:
+        #     SamplerHelper: A new transformed :class:`SamplerHelper` object.
+
+        rs = fn(self)
+        if isinstance(rs, (list, tuple)):
+            iterable, data_source = rs
+        else:
+            iterable, data_source = rs, self.data_source
+        sampler = type(self)(data_source, iterable)
+        return sampler
+
+    def shuffle(self, buffer_size=-1, seed=None):
+        """
+        Shuffles the dataset according to the given buffer size and random seed.
+
+        Args:
+            buffer_size (int, optional): Buffer size for shuffle. If
+                `buffer_size < 0` or more than the length of the dataset,
+                `buffer_size` is the length of the dataset. Default: -1.
+            seed (int, optional): Seed for the random. Default: None.
+
+        Returns:
+            SamplerHelper: A new shuffled :class:`SamplerHelper` object.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import SamplerHelper
+                from paddle.io import Dataset
+
+                class MyDataset(Dataset):
+                    def __init__(self):
+                        super(MyDataset, self).__init__()
+                        self.data = [
+                            [[1, 2, 3, 4], [1]],
+                            [[5, 6, 7], [0]],
+                            [[8, 9], [1]],
+                        ]
+
+                    def __getitem__(self, index):
+                        data = self.data[index][0]
+                        label = self.data[index][1]
+                        return data, label
+
+                    def __len__(self):
+                        return len(self.data)
+
+                dataset = MyDataset()
+                sampler = SamplerHelper(dataset)
+                print(list(sampler))    # indices of dataset elements
+                # [0, 1, 2]
+
+                sampler = sampler.shuffle(seed=2)
+                print(list(sampler))    # indices of dataset elements
+                # [2, 1, 0]
+        """
+        if seed is not None:
+            random_generator = np.random.RandomState(seed)
+        else:  # use the global random generator
+            random_generator = np.random
+
+        def _impl():
+            buf = []
+            for idx in iter(self):
+                buf.append(idx)
+                if buffer_size > 0 and len(buf) >= buffer_size:
+                    random_generator.shuffle(buf)
+                    for b in buf:
+                        yield b
+                    buf = []
+            if len(buf) > 0:
+                random_generator.shuffle(buf)
+                for b in buf:
+                    yield b
+
+        return type(self)(self.data_source, _impl)
+
+    def sort(self, cmp=None, key=None, reverse=False, buffer_size=-1):
+        """
+        Sorts the dataset according to given callable :meth:`cmp` or :meth:`key`.
+
+        Args:
+            cmp (callable, optional): The function of comparison. Default: None.
+            key (callable, optional): The function of key. Default: None.
+            reverse (bool, optional): Whether to reverse when sorting the data
+                samples. If True, it means in descending order, and False means
+                in ascending order. Default: False.
+            buffer_size (int, optional): Buffer size for sort. If
+                `buffer_size < 0` or `buffer_size` is more than the length
+                of the data, `buffer_size` will be set to the length of the data.
+                Default: -1.
+
+        Returns:
+            SamplerHelper: A new sorted :class:`SamplerHelper` object.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import SamplerHelper
+                from paddle.io import Dataset
+
+                class MyDataset(Dataset):
+                    def __init__(self):
+                        super(MyDataset, self).__init__()
+                        self.data = [
+                            [[1, 2, 3, 4], [1]],
+                            [[5, 6, 7], [0]],
+                            [[8, 9], [1]],
+                        ]
+
+                    def __getitem__(self, index):
+                        data = self.data[index][0]
+                        label = self.data[index][1]
+                        return data, label
+
+                    def __len__(self):
+                        return len(self.data)
+
+                dataset = MyDataset()
+                sampler = SamplerHelper(dataset)
+                print(list(sampler))    # indices of dataset elements
+                # [0, 1, 2]
+
+                # Sorted in ascending order by the length of the first field
+                # of the sample
+                key = (lambda x, data_source: len(data_source[x][0]))
+                sampler = sampler.sort(key=key)
+                print(list(sampler))    # indices of dataset elements
+                # [2, 1, 0]
+        """
+        if key:
+            key_wrapper = lambda x: key(x, self.data_source)
+        elif cmp:
+            key_wrapper = functools.cmp_to_key(lambda x, y: cmp(x, y, self.data_source))
+        else:
+            key_wrapper = lambda x: len(self.data_source[x])
+
+        def _impl():
+            buf = []
+            for idx in iter(self):
+                buf.append(idx)
+                if buffer_size > 0 and len(buf) >= buffer_size:
+                    buf = sorted(buf, key=key_wrapper, reverse=reverse)
+                    for b in buf:
+                        yield b
+                    buf = []
+            if len(buf) > 0:
+                buf = sorted(buf, key=key_wrapper, reverse=reverse)
+                for b in buf:
+                    yield b
+
+        return type(self)(self.data_source, _impl)
+
+    def batch(self, batch_size, drop_last=False, batch_size_fn=None, key=None):
+        """
+        Batches the dataset according to given `batch_size`.
+
+        Args:
+            batch_size (int): The batch size.
+            drop_last (bool, optional): Whether to drop the last mini batch.
+                Default: False.
+            batch_size_fn (callable, optional): It accepts four arguments:
+                index of data source, the length of minibatch, the size of
+                minibatch so far and data source, and it returns the size of
+                mini batch so far. Actually, the returned value can be anything
+                and would used as argument `size_so_far` in `key`. If None, it
+                would return the length of mini match. Default: None.
+            key (callable, optional): The function of key. It accepts the size of minibatch so far
+                and the length of minibatch, and returns what to be compared
+                with `batch_size`. If None, only the size of mini batch so far
+                would be compared with `batch_size`. Default: None.
+
+        Returns:
+            SamplerHelper: A new batched :class:`SamplerHelper` object.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import SamplerHelper
+                from paddle.io import Dataset
+
+                class MyDataset(Dataset):
+                    def __init__(self):
+                        super(MyDataset, self).__init__()
+                        self.data = [
+                            [[1, 2, 3, 4], [1]],
+                            [[5, 6, 7], [0]],
+                            [[8, 9], [1]],
+                        ]
+
+                    def __getitem__(self, index):
+                        data = self.data[index][0]
+                        label = self.data[index][1]
+                        return data, label
+
+                    def __len__(self):
+                        return len(self.data)
+
+                dataset = MyDataset()
+                sampler = SamplerHelper(dataset)
+                print(list(sampler))    # indices of dataset elements
+                # [0, 1, 2]
+
+                sampler = sampler.batch(batch_size=2)
+                print(list(sampler))    # indices of dataset elements
+                # [[0, 1], [2]]
+        """
+        _key = lambda size_so_far, minibatch_len: size_so_far
+
+        ori_batch_size_fn = batch_size_fn
+        if batch_size_fn is None:
+            batch_size_fn = lambda new, count, sofar, data_source: count
+        key = _key if key is None else key
+
+        def _impl():
+            data_source = self.data_source
+            minibatch, size_so_far = [], 0
+            for idx in iter(self):
+                minibatch.append(idx)
+                size_so_far = batch_size_fn(idx, len(minibatch), size_so_far, data_source)
+                if key(size_so_far, len(minibatch)) == batch_size:
+                    yield minibatch
+                    minibatch, size_so_far = [], 0
+                elif key(size_so_far, len(minibatch)) > batch_size:
+                    if len(minibatch) == 1:
+                        raise ValueError(
+                            "Please increase the value of `batch_size`, or limit the max length of batch."
+                        )
+                    yield minibatch[:-1]
+                    minibatch, size_so_far = minibatch[-1:], batch_size_fn(idx, 1, 0, data_source)
+            if minibatch and not drop_last:
+                yield minibatch
+
+        sampler = type(self)(self.data_source, _impl)
+        if ori_batch_size_fn is None and self.length is not None:
+            sampler.length = (self.length + int(not drop_last) * (batch_size - 1)) // batch_size
+        else:
+            sampler.length = None
+
+        return sampler
+
+    def shard(self, num_replicas=None, rank=None):
+        """
+        Slices the dataset for multi GPU training.
+
+        Args:
+            num_replicas (int, optional): The number of training process, and
+                is also the number of GPU cards used in training. If None, it
+                will be set by :meth:`paddle.distributed.get_world_size` method.
+                Default: None.
+            rank (int, optional): The id of current training process. Equal
+                to the value of the environment variable PADDLE_TRAINER_ID. If
+                None, it will be initialized by :meth:`paddle.distributed.get_rank`
+                method. Default: None.
+
+        Returns:
+            SamplerHelper: A new sliced :class:`SamplerHelper` object.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import SamplerHelper
+                from paddle.io import Dataset
+
+                class MyDataset(Dataset):
+                    def __init__(self):
+                        super(MyDataset, self).__init__()
+                        self.data = [
+                            [[1, 2, 3, 4], [1]],
+                            [[5, 6, 7], [0]],
+                            [[8, 9], [1]],
+                        ]
+
+                    def __getitem__(self, index):
+                        data = self.data[index][0]
+                        label = self.data[index][1]
+                        return data, label
+
+                    def __len__(self):
+                        return len(self.data)
+
+                dataset = MyDataset()
+                sampler = SamplerHelper(dataset)
+                print(list(sampler))    # indices of dataset elements
+                # [0, 1, 2]
+
+                sampler = sampler.shard(num_replicas=2)
+                print(list(sampler))    # indices of dataset elements
+                # [0, 2]
+        """
+        import paddle.distributed as dist
+
+        if num_replicas is None:
+            num_replicas = dist.get_world_size()
+        if rank is None:
+            rank = dist.get_rank()
+
+        def _impl():
+            for i, idx in enumerate(self):
+                if i % num_replicas == rank:
+                    yield idx
+            if i % num_replicas != num_replicas - 1 and rank > i % num_replicas:
+                # use last samples to make it evenly divisible
+                yield idx
+
+        sampler = type(self)(self.data_source, _impl)
+        if self.length is not None:
+            sampler.length = int(math.ceil(self.length * 1.0 / num_replicas))
+        else:
+            sampler.length = None
+        return sampler
+
+    def list(self):
+        # Produce a sampler with a `listiterator` when calling `iter`. Since
+        # `list` would fetch all contents at time, thus it can get accurate
+        # length.
+
+        def _impl():
+            indices = list(iter(self))
+            self.length = len(indices)
+            return iter(indices)
+
+        return type(self)(self.data_source, _impl)
diff --git a/paddlenlp/data/tokenizer.py b/paddlenlp/data/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..def498354b655037f71cca58594dd9c1edf473b1
--- /dev/null
+++ b/paddlenlp/data/tokenizer.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import jieba
+
+
+def get_idx_from_word(word, word_to_idx, unk_word):
+    if word in word_to_idx:
+        return word_to_idx[word]
+    return word_to_idx[unk_word]
+
+
+class BaseTokenizer(object):
+    def __init__(self, vocab):
+        self.vocab = vocab
+
+    def get_tokenizer(self):
+        return self.tokenizer
+
+    def cut(self, sentence):
+        pass
+
+    def encode(self, sentence):
+        pass
+
+
+class JiebaTokenizer(BaseTokenizer):
+    """
+    Constructs a tokenizer based on `jieba <https://github.com/fxsjy/jieba>`__.
+    It supports :meth:`cut` method to split the text to tokens, and :meth:`encode`
+    method to covert text to token ids.
+
+    Args:
+        vocab(paddlenlp.data.Vocab): An instance of :class:`paddlenlp.data.Vocab`.
+    """
+
+    def __init__(self, vocab):
+        super(JiebaTokenizer, self).__init__(vocab)
+        self.tokenizer = jieba.Tokenizer()
+        # initialize tokenizer
+        self.tokenizer.FREQ = {key: 1 for key in self.vocab.token_to_idx.keys()}
+        self.tokenizer.total = len(self.tokenizer.FREQ)
+        self.tokenizer.initialized = True
+
+    def cut(self, sentence, cut_all=False, use_hmm=True):
+        """
+        The method used to cut the text to tokens.
+
+        Args:
+            sentence(str): The text that needs to be cuted.
+            cut_all(bool, optional): Whether to use the full mode. If True,
+                using full mode that gets all the possible words from the
+                sentence, which is fast but not accurate. If False, using
+                accurate mode that attempts to cut the sentence into the most
+                accurate segmentations, which is suitable for text analysis.
+                Default: False.
+            use_hmm(bool, optional): Whether to use the HMM model. Default: True.
+
+        Returns:
+            list[str]: A list of tokens.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab, JiebaTokenizer
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                tokenizer = JiebaTokenizer(vocab)
+
+                tokens = tokenizer.cut('我爱你中国')
+                print(tokens)
+                # ['我爱你', '中国']
+        """
+        return self.tokenizer.lcut(sentence, cut_all, use_hmm)
+
+    def encode(self, sentence, cut_all=False, use_hmm=True):
+        """
+        The method used to convert the text to ids. It will firstly call
+        :meth:`cut` method to cut the text to tokens. Then, convert tokens to
+        ids using `vocab`.
+
+        Args:
+            sentence(str): The text that needs to be cuted.
+            cut_all(bool, optional): Whether to use the full mode. If True,
+                using full mode that gets all the possible words from the
+                sentence, which is fast but not accurate. If False, using
+                accurate mode that attempts to cut the sentence into the most
+                accurate segmentations, which is suitable for text analysis.
+                Default: False.
+            use_hmm(bool, optional): Whether to use the HMM model. Default: True.
+
+        Returns:
+            list[int]: A list of ids.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab, JiebaTokenizer
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                tokenizer = JiebaTokenizer(vocab)
+
+                ids = tokenizer.encode('我爱你中国')
+                print(ids)
+                # [1170578, 575565]
+        """
+        words = self.cut(sentence, cut_all, use_hmm)
+        return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words]
diff --git a/paddlenlp/data/vocab.py b/paddlenlp/data/vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..a17810f6ca5880bc7ca63b1634d8c3b27e43aa0a
--- /dev/null
+++ b/paddlenlp/data/vocab.py
@@ -0,0 +1,579 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import io
+import json
+import os
+import warnings
+
+import numpy as np
+
+
+class Vocab(object):
+    """
+    The class used to convert between tokens and ids. It also includes some
+    store/load functions.
+
+    Args:
+        counter (collections.Counter, optional): A Counter intance describes
+            the tokens and their frequencies. Its keys will be indexed accroding
+            to the order of frequency sorting to construct mapping relationship.
+            If None, `token_to_idx` must be provided as the mapping relationship.
+            Default: None.
+        max_size (int, optional): Max size of vocab, not including special tokens.
+            Default: None.
+        min_freq (int, optional): Ignore tokens whose frequencies are less than
+            `min_freq`. Default: 1.
+        token_to_idx (dict, optional): A dict specifies the mapping relationship
+            between tokens and indices to be used. If provided, adjust the tokens
+            and indices mapping according to it. If None, counter must be provided.
+            Default: None.
+        unk_token (str, optional): Special token for unknow token. If no need,
+            it also could be None. Default: None.
+        pad_token (str, optional): Special token for padding token. If no need,
+            it also could be None. Default: None.
+        bos_token (str, optional): Special token for bos token. If no need, it
+            also could be None. Default: None.
+        eos_token (str, optional): Special token for eos token. If no need, it
+            lso could be None. Default: None.
+
+        kwargs (dict): Keyword arguments ending with `_token`. It can be used
+            to specify further special tokens that will be exposed as attribute
+            of the vocabulary and associated with an index.
+    """
+
+    def __init__(
+        self,
+        counter=None,
+        max_size=None,
+        min_freq=1,
+        token_to_idx=None,
+        unk_token=None,
+        pad_token=None,
+        bos_token=None,
+        eos_token=None,
+        **kwargs
+    ):
+        # Handle special tokens
+        combs = (
+            ("unk_token", unk_token),
+            ("pad_token", pad_token),
+            ("bos_token", bos_token),
+            ("eos_token", eos_token),
+        )
+        for name, value in combs:
+            kwargs[name] = value
+        special_tokens = []
+        special_iter = kwargs.keys()
+        # sort alphabetically
+        special_iter = sorted(special_iter)
+        for special_token_name in special_iter:
+            # Test if kwarg specifies a special token
+            if not special_token_name.endswith("_token"):
+                raise ValueError(
+                    "{} is invalid. Only keyword arguments "
+                    "that end in '_token' are supported "
+                    "to declare special tokens.".format(special_token_name)
+                )
+
+            special_token = kwargs[special_token_name]
+            if special_token is not None and special_token not in special_tokens:
+                special_tokens.append(special_token)
+
+        if counter is None:
+            # use token_to_idx as dict to import pretrained vocabulary
+            assert token_to_idx, "token_to_idx should not be None when counter is None"
+            for special_token in special_tokens:
+                assert special_token in token_to_idx, "{} is not in token_to_idx".format(special_token)
+            self._token_to_idx = token_to_idx
+            self._idx_to_token = {idx: token for token, idx in token_to_idx.items()}
+            if unk_token:
+                unk_index = self._token_to_idx[unk_token]
+                self._token_to_idx = collections.defaultdict(lambda: unk_index)
+                self._token_to_idx.update(token_to_idx)
+        else:
+            self._idx_to_token = {idx: special_token for idx, special_token in enumerate(special_tokens)}
+            self._token_to_idx = collections.defaultdict()
+            self._token_to_idx.update((token, idx) for idx, token in self._idx_to_token.items())
+            self._index_counter_keys(counter, special_tokens, max_size, min_freq)
+            if token_to_idx:
+                self._sort_index_according_to_user_specification(token_to_idx)
+            if unk_token:
+                self._token_to_idx.default_factory = lambda: self._token_to_idx[unk_token]
+
+        # _expose_tokens_as_attributes
+        self._identifiers_to_tokens = kwargs
+        for identifier, token in kwargs.items():
+            if identifier.startswith("_"):
+                raise ValueError(
+                    "It is not allowed to use identifiers starting with "
+                    "underscore. In Python identifier names beginning with "
+                    "underscore are internal."
+                )
+            if hasattr(self, identifier):
+                raise ValueError(
+                    "vocab.{} already exists. "
+                    "Please choose a different identifier for token {}".format(identifier, token)
+                )
+            setattr(self, identifier, token)
+
+    def _index_counter_keys(self, counter, special_tokens, max_size, min_freq):
+        # sort by frequency, then alphabetically
+        token_freqs = sorted(counter.items(), key=lambda x: x[0])
+        token_freqs.sort(key=lambda x: x[1], reverse=True)
+        # frequencies of special tokens are not counted when building vocabulary
+        # in frequency order
+        special_tokens = set(special_tokens)
+        max_size = None if max_size is None else max_size + len(special_tokens)
+        for token, freq in token_freqs:
+            if freq < min_freq or len(self._idx_to_token) == max_size:
+                break
+            if token not in special_tokens:
+                self._idx_to_token[max(list(self._idx_to_token.keys()) + [-1]) + 1] = token
+                self._token_to_idx[token] = max(self._idx_to_token.keys())
+
+    def _sort_index_according_to_user_specification(self, token_to_idx):
+        # Sanity checks
+        if not set(token_to_idx.keys()).issubset(self.token_to_idx.keys()):
+            raise ValueError(
+                "User-specified token_to_idx mapping can only contain " "tokens that will be part of the vocabulary."
+            )
+        if len(set(token_to_idx.values())) != len(token_to_idx):
+            raise ValueError("User-specified indices must not contain duplicates.")
+        if min(token_to_idx.values()) < 0 or max(token_to_idx.values()) >= len(self.token_to_idx):
+            raise ValueError(
+                "User-specified indices must not be < 0 or >= the number of tokens "
+                "that will be in the vocabulary. The current vocab contains {}"
+                "tokens.".format(len(self.token_to_idx))
+            )
+
+        # Update index ordering
+        for token, new_idx in token_to_idx.items():
+            old_idx = self.token_to_idx[token]
+            ousted_token = self.idx_to_token[new_idx]
+
+            self.token_to_idx[token] = new_idx
+            self.token_to_idx[ousted_token] = old_idx
+            self.idx_to_token[old_idx] = ousted_token
+            self.idx_to_token[new_idx] = token
+
+    def to_tokens(self, indices):
+        """
+        Maps the input indices to token list.
+
+        Args:
+            indices (int|list[int]|tuple[int]|numpy.ndarray): The input indice(s) for mapping.
+                Must be an `int` or 1D `list[int]`|`tuple[int]`|`numpy.ndarray`.
+
+        Returns:
+            str|list[str]: Obtained token(s). If `indices` is an integer, it
+            will return a str. If `indices` is a list/tuple of integers, it will
+            return a list of str.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                tokens = vocab.to_tokens([0, 1, 2, 3])
+                print(tokens)
+                # ['[PAD]', '[UNK]', '一斤三', '意面屋']
+        """
+        to_reduce = False
+        if not isinstance(indices, (list, tuple, np.ndarray)):
+            indices = [indices]
+            to_reduce = True
+        if isinstance(indices, (list, tuple)):
+            indices = np.asarray(indices)
+
+        if isinstance(indices, (np.ndarray)) and len(indices.shape) > 1:
+            raise ValueError(
+                "Token indices is invalid. Expected 1D array, but received {}D array. ".format(len(indices.shape))
+            )
+
+        tokens = []
+        for idx in indices:
+            if not isinstance(idx, (int, np.integer)):
+                warnings.warn(
+                    "The type of `to_tokens()`'s input `indices` is not `int` which will be forcibly transfered to `int`. "
+                )
+                idx = int(idx)
+
+            try:
+                tokens.append(self._idx_to_token[idx])
+            except KeyError:
+                raise ValueError("Token index {} in the provided `indices` is invalid.".format(idx))
+
+        return tokens[0] if to_reduce else tokens
+
+    def to_indices(self, tokens):
+        """
+        Maps the input tokens into indices.
+
+        Args:
+            tokens (str|list[str]|tuple[str], optional): The input token(s) for
+                mapping.
+
+        Returns:
+            int|list[int]: Obationed indice(s). If `tokens` is a str, it will
+            return an integer. If `tokens` is a list/tuple of str, it will
+            return a list of integers.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                tokens = vocab.to_indices(['[PAD]', '[UNK]', '一斤三', '意面屋'])
+                print(tokens)
+                # [0, 1, 2, 3]
+        """
+        return self[tokens]
+
+    def __getitem__(self, tokens):
+        if not isinstance(tokens, (list, tuple)):
+            return self._token_to_idx[tokens] if tokens in self._token_to_idx else self._token_to_idx[self.unk_token]
+        else:
+            return [
+                self._token_to_idx[token] if token in self._token_to_idx else self._token_to_idx[self.unk_token]
+                for token in tokens
+            ]
+
+    def __len__(self):
+        return len(self._idx_to_token)
+
+    def __contains__(self, token):
+        return token in self._token_to_idx
+
+    def __call__(self, tokens):
+        """
+        Maps the input tokens into indices. Its function is the same as the
+        :meth:`to_indices` method.
+
+        See detail at `to_indices`.
+        """
+        return self[tokens]
+
+    @property
+    def idx_to_token(self):
+        # Returns index-token dict
+        return self._idx_to_token
+
+    @property
+    def token_to_idx(self):
+        # Return token-index dict
+        return self._token_to_idx
+
+    def to_json(self, path=None):
+        """
+        Summarizes some information of vocab as JSON string. If path is gaven,
+        the JSON string will be saved into files. The JSON string and the saved
+        file all can be used to reconstruct the :class:`Vocab` by calling
+        :meth:`from_json` method.
+
+        Args:
+            path (str, optional): The path to save JSON string. If None, the
+                JSON will not be saved. Default: None.
+
+        Returns:
+            str: The JSON string including information of vocab.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                json_str = vocab.to_json(path='./vocab.json')
+        """
+        vocab_dict = {}
+        vocab_dict["idx_to_token"] = dict(self.idx_to_token)
+        vocab_dict["token_to_idx"] = dict(self.token_to_idx)
+        vocab_dict["unk_token"] = self.unk_token
+        vocab_dict["identifiers_to_tokens"] = self._identifiers_to_tokens
+        json_str = json.dumps(vocab_dict)
+        if path:
+            with io.open(path, "w", encoding="utf-8") as f:
+                f.write(json_str)
+        return json_str
+
+    @classmethod
+    def from_json(cls, json_str):
+        """
+        Loads :class:`Vocab` from JSON string or JSON file, which is gotten by
+        calling :meth:`to_json` method.
+
+        Args:
+            json_str (str): JSON string or file path of JSON string.
+
+        Returns:
+            Vocab: An instance of :class:`Vocab` generated from information
+            contained in JSON string.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                json_str = vocab.to_json(path='./vocab.json')
+
+                vocab1 = Vocab.from_json(json_str)
+                vocab2 = Vocab.from_json('./vocab.json')
+                print(len(vocab), len(vocab1), len(vocab2))
+                # 1256608 1256608 1256608
+        """
+        if os.path.isfile(json_str):
+            with io.open(json_str, "r", encoding="utf-8") as f:
+                vocab_dict = json.load(f)
+        else:
+            vocab_dict = json.loads(json_str)
+        token_to_idx = vocab_dict.get("token_to_idx")
+        unk_token = vocab_dict.get("unk_token")
+        identifiers_to_tokens = vocab_dict.get("identifiers_to_tokens", dict())
+        if "unk_token" in identifiers_to_tokens:
+            del identifiers_to_tokens["unk_token"]
+        vocab = cls(counter=None, token_to_idx=token_to_idx, unk_token=unk_token, **identifiers_to_tokens)
+        return vocab
+
+    @classmethod
+    def from_dict(cls, token_to_idx, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs):
+        """
+        Builds the :class:`Vocab` from a dict.
+
+        Args:
+            token_to_idx (dict): A dict describes the mapping relationship between
+                tokens and indices.
+            unk_token (str, optional): The special token for unknow token. If
+                no need, it also could be None. Default: None.
+            pad_token (str, optional): The special token for padding token. If
+                no need, it also could be None. Default: None.
+            bos_token (str, optional): The special token for bos token. If no
+                need, it also could be None. Default: None.
+            eos_token (str, optional): The special token for eos token. If no
+                need, it also could be None. Default: None.
+
+            kwargs (dict): Keyword arguments ending with `_token`. It can be
+                used to specify further special tokens that will be exposed as
+                attribute of the vocabulary and associated with an index.
+
+        Returns:
+            Vocab: An instance of :class:`Vocab` generated from the given dict
+            and special tokens.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+
+                vocab1 = Vocab.from_dict(vocab.token_to_idx)
+                print(len(vocab), len(vocab.token_to_idx), len(vocab1))
+                # 1256608 1256608 1256608
+        """
+        vocab = cls(
+            counter=None,
+            token_to_idx=token_to_idx,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            **kwargs,
+        )
+        return vocab
+
+    @staticmethod
+    def build_vocab(
+        iterator,
+        max_size=None,
+        min_freq=1,
+        token_to_idx=None,
+        unk_token=None,
+        pad_token=None,
+        bos_token=None,
+        eos_token=None,
+        **kwargs
+    ):
+        """
+        Builds the :class:`Vocab` accoring to given iterator and other
+        information. Firstly, iterate over the `iterator` to construct a
+        :class:`collections.Counter` and used to init the as  :class:`Vocab`.
+
+        Args:
+            iterator (collections.Iterable): Iterator of tokens. Each element
+                should be a list of tokens if wordlevel vocab is needed.
+            max_size (int, optional): The max size of vocab, not including
+                special tokens. Default: None.
+            min_freq (int, optional): Ignore tokens whose frequencies are less
+                than `min_freq`. Default: 1.
+            token_to_idx (dict, optional): A dict specifies the mapping
+                relationship between tokens and indices to be used. If provided,
+                adjust the tokens and indices mapping according to it. If None,
+                counter must be provided. Default: None.
+            unk_token (str, optional): The special token for unknow token
+                '<unk>'. If no need, it also could be None. Default: None.
+            pad_token (str, optional): The special token for padding token
+                '<pad>'. If no need, it also could be None. Default: None.
+            bos_token (str, optional): The special token for bos token '<bos>'.
+                If no need, it also could be None. Default: None.
+            eos_token (str, optional): The special token for eos token '<eos>'.
+                If no need, it also could be None. Default: None.
+
+            kwargs (dict): Keyword arguments ending with `_token`. It can be
+                used to specify further special tokens that will be exposed as
+                attribute of the vocabulary and associated with an index.
+
+        Returns:
+            Vocab: An instance of :class:`Vocab` generated from given iterator
+            and other informations.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+
+                vocab1 = Vocab.build_vocab([list(vocab.token_to_idx.keys())])
+                print(len(vocab), len(vocab1))
+                # 1256608 1256608
+        """
+        counter = collections.Counter()
+        for tokens in iterator:
+            counter.update(tokens)
+        vocab = Vocab(
+            counter,
+            max_size=max_size,
+            min_freq=min_freq,
+            token_to_idx=token_to_idx,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            **kwargs,
+        )
+        return vocab
+
+    @staticmethod
+    def load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs):
+        """
+        Builds the :class:`Vocab` from a file reserving all tokens by calling
+        :meth:`Vocab.from_dict` method. The file contains a token per line, and
+        the line index would be the index of corresponding token.
+
+        Args:
+            filepath (str): the path of file to construct vocabulary.
+            unk_token (str, optional): special token for unknown token. If no
+                need, it also could be None. Default: None.
+            pad_token (str, optional): special token for padding token. If no
+                need, it also could be None. Default: None.
+            bos_token (str, optional): special token for bos token. If no need,
+                it also could be None. Default: None.
+            eos_token (str, optional): special token for eos token. If no need,
+                it also could be None. Default: None.
+
+            kwargs (dict): Keyword arguments ending with `_token`. It can be
+                used to specify further special tokens that will be exposed as
+                attribute of the vocabulary and associated with an index.
+
+        Returns:
+            Vocab: An instance of :class:`Vocab` generated from the given file.
+
+        Example:
+            .. code-block:: python
+
+                from paddlenlp.data import Vocab
+                # The vocab file. The sample file can be downloaded firstly.
+                # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
+                vocab_file_path = './senta_word_dict.txt'
+                # Initialize the Vocab
+                vocab = Vocab.load_vocabulary(
+                    vocab_file_path,
+                    unk_token='[UNK]',
+                    pad_token='[PAD]')
+                print(len(vocab))
+                # 1256608
+        """
+        token_to_idx = {}
+        with io.open(filepath, "r", encoding="utf-8") as f:
+            for index, line in enumerate(f):
+                token = line.rstrip("\n")
+                token_to_idx[token] = int(index)
+        vocab = Vocab.from_dict(
+            token_to_idx, unk_token=unk_token, pad_token=pad_token, bos_token=bos_token, eos_token=eos_token, **kwargs
+        )
+        return vocab
+
+    def save_vocabulary(self, filepath):
+        """
+        Save the :class:`Vocab` to a specific file. Can be reloaded by calling `load_vocabulary`.
+
+        Args:
+            filepath (str): the path of file to save vocabulary.
+        """
+        with open(filepath, "w") as f:
+            for idx in range(len(self._idx_to_token)):
+                f.write(self._idx_to_token[idx] + "\n")
+
+    def get_unk_token_id(self):
+        return self._token_to_idx[self.unk_token] if self.unk_token is not None else self.unk_token
+
+    def get_bos_token_id(self):
+        return self._token_to_idx[self.bos_token] if self.bos_token is not None else self.bos_token
+
+    def get_eos_token_id(self):
+        return self._token_to_idx[self.eos_token] if self.eos_token is not None else self.eos_token
+
+    def get_pad_token_id(self):
+        return self._token_to_idx[self.pad_token] if self.pad_token is not None else self.pad_token
diff --git a/paddlenlp/dataaug/__init__.py b/paddlenlp/dataaug/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d92b80403963bf1548e001ab498f867390959c0
--- /dev/null
+++ b/paddlenlp/dataaug/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_augment import FileAugment
+from .char import *
+from .sentence import *
+from .word import *
diff --git a/paddlenlp/dataaug/base_augment.py b/paddlenlp/dataaug/base_augment.py
new file mode 100644
index 0000000000000000000000000000000000000000..e00878d03661af2a826a31ffc9ebb386f70dc4aa
--- /dev/null
+++ b/paddlenlp/dataaug/base_augment.py
@@ -0,0 +1,241 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import re
+from typing import Iterable
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..data import JiebaTokenizer, Vocab
+from ..utils.env import DATA_HOME
+
+
+class BaseAugment(object):
+    """
+    A base class for data augmentation
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented words in sequences.
+        aug_percent (int):
+            Percentage of augmented words in sequences.
+        aug_min (int):
+            Minimum number of augmented words in sequences.
+        aug_max (int):
+            Maximum number of augmented words in sequences.
+    """
+
+    def __init__(self, create_n=1, aug_n=None, aug_percent=0.1, aug_min=1, aug_max=10, vocab="vocab"):
+        self._DATA = {
+            "stop_words": (
+                "stopwords.txt",
+                "a4a76df756194777ca18cd788231b474",
+                "https://bj.bcebos.com/paddlenlp/data/stopwords.txt",
+            ),
+            "vocab": (
+                "baidu_encyclopedia_w2v_vocab.json",
+                "25c2d41aec5a6d328a65c1995d4e4c2e",
+                "https://bj.bcebos.com/paddlenlp/data/baidu_encyclopedia_w2v_vocab.json",
+            ),
+            "test_vocab": (
+                "test_vocab.json",
+                "1d2fce1c80a4a0ec2e90a136f339ab88",
+                "https://bj.bcebos.com/paddlenlp/data/test_vocab.json",
+            ),
+            "word_synonym": (
+                "word_synonym.json",
+                "aaa9f864b4af4123bce4bf138a5bfa0d",
+                "https://bj.bcebos.com/paddlenlp/data/word_synonym.json",
+            ),
+            "word_embedding": (
+                "word_embedding.json",
+                "534aa4ad274def4deff585cefd8ead32",
+                "https://bj.bcebos.com/paddlenlp/data/word_embedding.json",
+            ),
+            "word_homonym": (
+                "word_homonym.json",
+                "a578c04201a697e738f6a1ad555787d5",
+                "https://bj.bcebos.com/paddlenlp/data/word_homonym.json",
+            ),
+            "char_homonym": (
+                "char_homonym.json",
+                "dd98d5d5d32a3d3dd45c8f7ca503c7df",
+                "https://bj.bcebos.com/paddlenlp/data/char_homonym.json",
+            ),
+            "char_antonym": (
+                "char_antonym.json",
+                "f892f5dce06f17d19949ebcbe0ed52b7",
+                "https://bj.bcebos.com/paddlenlp/data/char_antonym.json",
+            ),
+            "word_antonym": (
+                "word_antonym.json",
+                "cbea11fa99fbe9d07e8185750b37e84a",
+                "https://bj.bcebos.com/paddlenlp/data/word_antonym.json",
+            ),
+        }
+        self.stop_words = self._get_data("stop_words")
+        self.aug_n = aug_n
+        self.aug_percent = aug_percent
+        self.aug_min = aug_min
+        self.aug_max = aug_max
+        self.create_n = create_n
+        self.vocab = Vocab.from_json(self._load_file(vocab))
+        self.tokenizer = JiebaTokenizer(self.vocab)
+        self.loop = 5
+
+    @classmethod
+    def clean(cls, sequences):
+        """Clean input sequences"""
+        if isinstance(sequences, str):
+            return sequences.strip()
+        if isinstance(sequences, Iterable):
+            return [str(s).strip() if s else s for s in sequences]
+        return str(sequences).strip()
+
+    def _load_file(self, mode):
+        """Check and download data"""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, url = self._DATA[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(url, default_root, data_hash)
+
+        return fullname
+
+    def _get_data(self, mode):
+        """Read data as list"""
+        fullname = self._load_file(mode)
+        data = []
+        if os.path.exists(fullname):
+            with open(fullname, "r", encoding="utf-8") as f:
+                for line in f:
+                    data.append(line.strip())
+            f.close()
+        else:
+            raise ValueError("The {} should exist.".format(fullname))
+
+        return data
+
+    def _get_aug_n(self, size, size_a=None):
+        """Calculate number of words for data augmentation"""
+        if size == 0:
+            return 0
+        aug_n = self.aug_n or int(math.ceil(self.aug_percent * size))
+        if self.aug_min and aug_n < self.aug_min:
+            aug_n = self.aug_min
+        elif self.aug_max and aug_n > self.aug_max:
+            aug_n = self.aug_max
+        if size_a is not None:
+            aug_n = min(aug_n, int(math.floor(size_a * 0.3)))
+        return aug_n
+
+    def _skip_stop_word_tokens(self, seq_tokens):
+        """Skip words. We can rewrite function to skip specify words."""
+        indexes = []
+        for i, seq_token in enumerate(seq_tokens):
+            if (
+                seq_token not in self.stop_words
+                and not seq_token.isdigit()
+                and not bool(re.search(r"\d", seq_token))
+                and not seq_token.encode("UTF-8").isalpha()
+            ):
+                indexes.append(i)
+        return indexes
+
+    def augment(self, sequences, num_thread=1):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+            num_thread (int):
+                Number of threads
+        """
+        sequences = self.clean(sequences)
+        # Single Thread
+        if num_thread == 1:
+            if isinstance(sequences, str):
+                return [self._augment(sequences)]
+            else:
+                output = []
+                for sequence in sequences:
+                    output.append(self._augment(sequence))
+                return output
+        else:
+            raise NotImplementedError
+
+    def _augment(self, sequence):
+        raise NotImplementedError
+
+
+class FileAugment(object):
+    """
+    File data augmentation
+
+    Args:
+        strategies (List):
+            List of augmentation strategies.
+    """
+
+    def __init__(self, strategies):
+        self.strategies = strategies
+
+    def augment(self, input_file, output_file="aug.txt", separator=None, separator_id=0):
+        output_sequences = []
+        sequences = []
+
+        input_sequences = self.file_read(input_file)
+
+        if separator:
+            for input_sequence in input_sequences:
+                sequences.append(input_sequence.split(separator)[separator_id])
+        else:
+            sequences = input_sequences
+
+        for strategy in self.strategies:
+            aug_sequences = strategy.augment(sequences)
+            if separator:
+                for aug_sequence, input_sequence in zip(aug_sequences, input_sequences):
+                    input_items = input_sequence.split(separator)
+                    for s in aug_sequence:
+                        input_items[separator_id] = s
+                        output_sequences.append(separator.join(input_items))
+            else:
+                for aug_sequence in aug_sequences:
+                    output_sequences += aug_sequence
+
+        if output_file:
+            self.file_write(output_sequences, output_file)
+
+        return output_sequences
+
+    def file_read(self, input_file):
+        input_sequences = []
+        with open(input_file, "r", encoding="utf-8") as f:
+            for line in f:
+                input_sequences.append(line.strip())
+        f.close()
+        return input_sequences
+
+    def file_write(self, output_sequences, output_file):
+        with open(output_file, "w", encoding="utf-8") as f:
+            for output_sequence in output_sequences:
+                f.write(output_sequence + "\n")
+        f.close()
diff --git a/paddlenlp/dataaug/char.py b/paddlenlp/dataaug/char.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbfc3b61caee86847150dec78ad667f6f4624d85
--- /dev/null
+++ b/paddlenlp/dataaug/char.py
@@ -0,0 +1,570 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import random
+from typing import Iterable
+
+import numpy as np
+import paddle
+
+from ..transformers import AutoModelForMaskedLM, AutoTokenizer
+from .base_augment import BaseAugment
+
+__all__ = ["CharSubstitute", "CharInsert", "CharSwap", "CharDelete"]
+
+
+class CharSubstitute(BaseAugment):
+    """
+    CharSubstitute is a char-level substitution data augmentation strategy
+    that supports replacing characters in the input sequence based on existing
+    dictionaries or custom dictionaries.
+
+    Args:
+        aug_type (str or list(str)):
+            Substitution dictionary type
+        custom_file_path (str, optional):
+            Custom substitution dictionary file path
+        delete_file_path (str, optional):
+            Dictionary file path for deleting characters in substitution dictionary
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented characters in sequences.
+        aug_percent (int):
+            Percentage of augmented characters in sequences.
+        aug_min (int):
+            Minimum number of augmented characters in sequences.
+        aug_max (int):
+            Maximum number of augmented characters in sequences.
+        model_name (str):
+            Model parameter name for MLM prediction task.
+    """
+
+    def __init__(
+        self,
+        aug_type,
+        custom_file_path=None,
+        delete_file_path=None,
+        create_n=1,
+        aug_n=None,
+        aug_percent=0.1,
+        aug_min=1,
+        aug_max=10,
+        model_name="ernie-1.0-large-zh-cw",
+        vocab="vocab",
+    ):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+        self.custom_file_path = custom_file_path
+        self.delete_file_path = delete_file_path
+        self.model_name = model_name
+
+        if isinstance(aug_type, str):
+            self.type = aug_type
+            if aug_type in ["antonym", "homonym", "custom"]:
+                self.dict = self._load_substitue_dict(aug_type)
+            elif aug_type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        elif isinstance(aug_type, Iterable):
+            if len(aug_type) == 1:
+                self.type = aug_type[0]
+            else:
+                self.type = "combination"
+            if self.type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+            self.dict = {}
+            # Merge dictionaries from different sources
+            for t in aug_type:
+                if t in ["antonym", "homonym", "custom"]:
+                    t_dict = self._load_substitue_dict(t)
+                    for k in t_dict:
+                        if k in self.dict:
+                            self.dict[k] = list(set(self.dict[k] + t_dict[k]))
+                        else:
+                            self.dict[k] = t_dict[k]
+        else:
+            self.type = aug_type
+
+    def _load_substitue_dict(self, source_type):
+        """Load substitution dictionary"""
+        if source_type in ["antonym", "homonym"]:
+            fullname = self._load_file("char_" + source_type)
+        elif source_type in ["custom"]:
+            fullname = self.custom_file_path
+        elif source_type in ["delete"]:
+            fullname = self.delete_file_path
+
+        if os.path.exists(fullname):
+            with open(fullname, "r", encoding="utf-8") as f:
+                substitue_dict = json.load(f)
+            f.close()
+        else:
+            raise ValueError("The {} should exist.".format(fullname))
+
+        return substitue_dict
+
+    def _generate_sequence(self, output_seq_tokens, aug_tokens):
+        """Genearte the sequences according to the mapping list"""
+        for aug_token in aug_tokens:
+            idx, token = aug_token
+            output_seq_tokens[int(idx)] = token
+        return "".join(output_seq_tokens)
+
+    def _augment(self, sequence):
+        seq_tokens = [s for s in sequence]
+        aug_indexes = self._skip_stop_word_tokens(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+        p = None
+
+        if aug_n == 0:
+            return []
+        elif self.type == "mlm":
+            return self._augment_mlm(sequence, seq_tokens, aug_indexes, p)
+        elif aug_n == 1:
+            return self._augment_single(seq_tokens, aug_indexes, p)
+        else:
+            return self._augment_multi(seq_tokens, aug_n, aug_indexes, p)
+
+    @paddle.no_grad()
+    def _augment_mlm(self, sequence, seq_tokens, aug_indexes, p):
+        t = 0
+        sentences = []
+        while t < self.create_n * self.loop * 2 and len(sentences) < self.create_n:
+            skip = False
+            t += 1
+            idx = np.random.choice(aug_indexes, replace=False, p=p)
+
+            aug_tokens = [[idx, "[MASK]" * len(seq_tokens[idx])]]
+            sequence_mask = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+            tokenized = self.mlm_tokenizer(sequence_mask)
+            masked_positions = [
+                i for i, idx in enumerate(tokenized["input_ids"]) if idx == self.mlm_tokenizer.mask_token_id
+            ]
+
+            output = self.mlm_model(
+                paddle.to_tensor([tokenized["input_ids"]]), paddle.to_tensor([tokenized["token_type_ids"]])
+            )
+            predicted = "".join(
+                self.mlm_tokenizer.convert_ids_to_tokens(paddle.argmax(output[0][masked_positions], axis=-1))
+            )
+            for ppp in predicted:
+                if ppp in self.stop_words:
+                    skip = True
+                    break
+            if skip:
+                continue
+            aug_tokens = [[idx, predicted]]
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+            if sequence_generate != sequence and sequence_generate not in sentences:
+                sentences.append(sequence_generate)
+        return sentences
+
+    def _augment_multi(self, seq_tokens, aug_n, aug_indexes, p):
+        sentences = []
+        aug_n = min(aug_n, len(aug_indexes))
+        if self.type in ["antonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            pp = []
+            for i, aug_index in enumerate(aug_indexes):
+                if seq_tokens[aug_index] in self.dict:
+                    candidate_tokens.append([aug_index, self.dict[seq_tokens[aug_index]]])
+            pp = np.array(pp)
+            pp /= sum(pp)
+            aug_n = min(aug_n, len(candidate_tokens))
+            if aug_n != 0:
+                t = 0
+                while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                    t += 1
+                    idxes = random.sample(list(range(len(candidate_tokens))), aug_n)
+                    aug_tokens = []
+                    for idx in idxes:
+                        aug_index, aug_dict = candidate_tokens[idx]
+                        aug_tokens.append([aug_index, random.sample(aug_dict, 1)[0]])
+
+                    sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+                    if sentence not in sentences:
+                        sentences.append(sentence)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                t += 1
+                aug_tokens = []
+                aug_choice_indexes = np.random.choice(aug_indexes, size=aug_n, replace=False, p=p)
+                for aug_index in aug_choice_indexes:
+                    token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))[0]
+                    aug_tokens.append([aug_index, token])
+                sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+                if sentence not in sentences:
+                    sentences.append(sentence)
+        return sentences
+
+    def _augment_single(self, seq_tokens, aug_indexes, p):
+        sentences = []
+        aug_tokens = []
+        if self.type in ["antonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            pp = []
+            for i, aug_index in enumerate(aug_indexes):
+                if seq_tokens[aug_index] in self.dict:
+                    for token in self.dict[seq_tokens[aug_index]]:
+                        candidate_tokens.append([aug_index, token])
+                        pp.append(p[i] / len(self.dict[seq_tokens[aug_index]]))
+            create_n = min(self.create_n, len(candidate_tokens))
+            pp = np.array(pp)
+            pp /= sum(pp)
+            aug_tokens = random.sample(candidate_tokens, create_n)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(aug_tokens) < self.create_n:
+                t += 1
+                aug_index = np.random.choice(aug_indexes, replace=False, p=p)
+                token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))[0]
+                if [aug_index, token] not in aug_tokens:
+                    aug_tokens.append([aug_index, token])
+        for aug_token in aug_tokens:
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), [aug_token])
+            sentences.append(sequence_generate)
+
+        return sentences
+
+
+class CharInsert(BaseAugment):
+    """
+    CharInsert is a character-level insert data augmentation strategy.
+
+    Args:
+        aug_type (str or list(str)):
+            Insert dictionary type
+        custom_file_path (str, optional):
+            Custom insert dictionary file path
+        delete_file_path (str, optional):
+            Dictionary file path for deleting characters in insert dictionary
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented characters in sequences.
+        aug_percent (int):
+            Percentage of augmented characters in sequences.
+        aug_min (int):
+            Minimum number of augmented characters in sequences.
+        aug_max (int):
+            Maximum number of augmented characters in sequences.
+    """
+
+    def __init__(
+        self,
+        aug_type,
+        custom_file_path=None,
+        delete_file_path=None,
+        create_n=1,
+        aug_n=None,
+        aug_percent=0.1,
+        aug_min=1,
+        aug_max=10,
+        model_name="ernie-1.0-large-zh-cw",
+        vocab="vocab",
+    ):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+        self.custom_file_path = custom_file_path
+        self.delete_file_path = delete_file_path
+        self.model_name = model_name
+        if isinstance(aug_type, str):
+            self.type = aug_type
+            if aug_type in ["antonym", "homonym", "custom"]:
+                self.dict = self._load_insert_dict(aug_type)
+            elif aug_type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        elif isinstance(aug_type, Iterable):
+            self.type = "combination"
+            self.dict = {}
+            # Merge dictionaries from different sources
+            for t in aug_type:
+                if t in ["antonym", "homonym", "custom"]:
+                    t_dict = self._load_insert_dict(t)
+                    for k in t_dict:
+                        if k in self.dict:
+                            self.dict[k] = list(set(self.dict[k] + t_dict[k]))
+                        else:
+                            self.dict[k] = t_dict[k]
+        else:
+            self.type = aug_type
+
+    def _load_insert_dict(self, source_type):
+        """Load insert dictionary"""
+        if source_type in ["antonym", "homonym"]:
+            fullname = self._load_file("char_" + source_type)
+        elif source_type in ["custom"]:
+            fullname = self.custom_file_path
+        elif source_type in ["delete"]:
+            fullname = self.delete_file_path
+        if os.path.exists(fullname):
+            with open(fullname, "r", encoding="utf-8") as f:
+                insert_dict = json.load(f)
+            f.close()
+        else:
+            raise ValueError("The {} should exist.".format(fullname))
+        return insert_dict
+
+    def _augment(self, sequence):
+        seq_tokens = [s for s in sequence]
+        aug_indexes = self._skip_stop_word_tokens(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+        if aug_n == 0:
+            return []
+        elif self.type == "mlm":
+            return self._augment_mlm(sequence, seq_tokens, aug_indexes)
+        elif aug_n == 1:
+            return self._augment_single(seq_tokens, aug_indexes)
+        else:
+            return self._augment_multi(seq_tokens, aug_n, aug_indexes)
+
+    @paddle.no_grad()
+    def _augment_mlm(self, sequence, seq_tokens, aug_indexes):
+
+        t = 0
+        sentences = []
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            skip = False
+            t += 1
+            p = random.randint(0, 1)
+            idx = random.sample(aug_indexes, 1)[0]
+            aug_tokens = [[idx, "[MASK]" * len(seq_tokens[idx])]]
+            sequence_mask = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+            tokenized = self.mlm_tokenizer(sequence_mask)
+            masked_positions = [
+                i for i, idx in enumerate(tokenized["input_ids"]) if idx == self.mlm_tokenizer.mask_token_id
+            ]
+            output = self.mlm_model(
+                paddle.to_tensor([tokenized["input_ids"]]), paddle.to_tensor([tokenized["token_type_ids"]])
+            )
+            predicted = "".join(
+                self.mlm_tokenizer.convert_ids_to_tokens(paddle.argmax(output[0][masked_positions], axis=-1))
+            )
+            for p in predicted:
+                if p in self.stop_words:
+                    skip = True
+                    break
+            if skip:
+                continue
+
+            aug_tokens = [[idx, predicted]]
+
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+            if sequence_generate != sequence and sequence_generate not in sentences:
+                sentences.append(sequence_generate)
+        return sentences
+
+    def _augment_multi(self, seq_tokens, aug_n, aug_indexes):
+        sentences = []
+        if self.type in ["antonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            for aug_index in aug_indexes:
+                if seq_tokens[aug_index] in self.dict:
+                    candidate_tokens.append([aug_index, self.dict[seq_tokens[aug_index]]])
+            aug_n = min(aug_n, len(candidate_tokens))
+            if aug_n != 0:
+                t = 0
+                while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                    t += 1
+                    idxes = random.sample(list(range(len(candidate_tokens))), aug_n)
+                    aug_tokens = []
+                    for idx in idxes:
+                        aug_index, aug_dict = candidate_tokens[idx]
+                        aug_tokens.append([aug_index, random.sample(aug_dict, 1)[0]])
+                    p = random.randint(0, 1)
+                    sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+                    if sentence not in sentences:
+                        sentences.append(sentence)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                t += 1
+                aug_tokens = []
+                aug_indexes = random.sample(aug_indexes, aug_n)
+                for aug_index in aug_indexes:
+                    token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))[0]
+                    aug_tokens.append([aug_index, token])
+                p = random.randint(0, 1)
+                sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+                if sentence not in sentences:
+                    sentences.append(sentence)
+        return sentences
+
+    def _augment_single(self, seq_tokens, aug_indexes):
+
+        sentences = []
+        aug_tokens = []
+        if self.type in ["antonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            for aug_index in aug_indexes:
+                if seq_tokens[aug_index] in self.dict:
+                    for token in self.dict[seq_tokens[aug_index]]:
+                        candidate_tokens.append([aug_index, token])
+            create_n = min(self.create_n, len(candidate_tokens))
+            aug_tokens = random.sample(candidate_tokens, create_n)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(aug_tokens) < self.create_n:
+                t += 1
+                aug_index = random.sample(aug_indexes, 1)[0]
+                token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))[0]
+                if [aug_index, token] not in aug_tokens:
+                    aug_tokens.append([aug_index, token])
+        for aug_token in aug_tokens:
+            p = random.randint(0, 1)
+            sentences.append(self._generate_sequence(seq_tokens.copy(), [aug_token], p))
+        return sentences
+
+    def _generate_sequence(self, output_seq_tokens, aug_tokens, p):
+        """Genearte the sequences according to the mapping list"""
+        for aug_token in aug_tokens:
+            idx, token = aug_token
+            if p == 0:
+                output_seq_tokens[idx] = token + output_seq_tokens[idx]
+            else:
+                output_seq_tokens[idx] += token
+        return "".join(output_seq_tokens)
+
+
+class CharSwap(BaseAugment):
+    """
+    CharSwap is a character-level swap data augmentation strategy.
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented characters in sequences.
+        aug_percent (int):
+            Percentage of augmented characters in sequences.
+        aug_min (int):
+            Minimum number of augmented characters in sequences.
+        aug_max (int):
+            Maximum number of augmented characters in sequences.
+    """
+
+    def __init__(self, create_n=1, aug_n=None, aug_percent=None, aug_min=1, aug_max=10, vocab="vocab"):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=0.1, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+    def _augment(self, sequence):
+
+        seq_tokens = [s for s in sequence]
+        aug_indexes = self._skip_chars(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+
+        t = 0
+        sentences = []
+
+        if aug_n == 0:
+            return []
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            t += 1
+            idxes = random.sample(aug_indexes, aug_n)
+            output_seq_tokens = seq_tokens.copy()
+            for idx in range(len(seq_tokens)):
+                if idx in idxes:
+                    output_seq_tokens[idx], output_seq_tokens[idx + 1] = (
+                        output_seq_tokens[idx + 1],
+                        output_seq_tokens[idx],
+                    )
+            sentence = "".join(output_seq_tokens)
+            if sentence not in sentences:
+                sentences.append(sentence)
+        return sentences
+
+    def _skip_chars(self, seq_tokens):
+        """Skip specific characters."""
+        indexes = []
+        for i, seq_token in enumerate(seq_tokens[:-1]):
+            if (
+                seq_token not in self.stop_words
+                and not seq_token.isdigit()
+                and not seq_token.encode("UTF-8").isalpha()
+            ):
+                if (
+                    seq_tokens[i + 1] not in self.stop_words
+                    and not seq_tokens[i + 1].isdigit()
+                    and not seq_tokens[i + 1].encode("UTF-8").isalpha()
+                ):
+                    indexes.append(i)
+        return indexes
+
+
+class CharDelete(BaseAugment):
+    """
+    CharDelete is a character-level deletion data augmentation strategy.
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented characters in sequences.
+        aug_percent (int):
+            Percentage of augmented characters in sequences.
+        aug_min (int):
+            Minimum number of augmented characters in sequences.
+        aug_max (int):
+            Maximum number of augmented characters in sequences.
+    """
+
+    def __init__(self, create_n=1, aug_n=None, aug_percent=0.1, aug_min=1, aug_max=10, vocab="vocab"):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+    def _augment(self, sequence):
+
+        seq_tokens = [s for s in sequence]
+        aug_indexes = self._skip_chars(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+
+        t = 0
+        sentences = []
+        if aug_n == 0:
+            return sentences
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            t += 1
+            idxes = random.sample(aug_indexes, aug_n)
+            sentence = ""
+            for idx in range(len(seq_tokens)):
+                if idx not in idxes:
+                    sentence += seq_tokens[idx]
+            if sentence not in sentences:
+                sentences.append(sentence)
+        return sentences
+
+    def _skip_chars(self, seq_tokens):
+        """Skip specific characters."""
+        indexes = []
+        for i, seq_token in enumerate(seq_tokens):
+            if seq_token in self.stop_words or seq_token.isdigit() or seq_token.encode("UTF-8").isalpha():
+                continue
+            elif i != 0 and seq_tokens[i - 1].isdigit():
+                continue
+            elif i != len(seq_tokens) - 1 and seq_tokens[i + 1].isdigit():
+                continue
+            else:
+                indexes.append(i)
+        return indexes
diff --git a/paddlenlp/dataaug/sentence.py b/paddlenlp/dataaug/sentence.py
new file mode 100644
index 0000000000000000000000000000000000000000..e04a651a2bf6988f5cbbf2df057e4f827907ecf8
--- /dev/null
+++ b/paddlenlp/dataaug/sentence.py
@@ -0,0 +1,554 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import paddle
+
+from ..taskflow import Taskflow
+from ..transformers import (
+    AutoModelForCausalLM,
+    AutoModelForConditionalGeneration,
+    AutoTokenizer,
+)
+
+__all__ = [
+    "SentenceGenerate",
+    "SentenceSummarize",
+    "SentenceBackTranslate",
+    "SentenceBackTranslateAPI",
+    "SentenceContinue",
+]
+
+
+class SentenceGenerate:
+    """
+    SentenceGenerate is a sentence-level data augmentation strategy
+    that generates simialr sentences according to the input sequence.
+    The strattegy first generates several sentences, and then chooses
+    the top n simialr sentences by the model.
+
+    Args:
+        model_name (str):
+            Model parameter name for generation task.
+        create_n (int):
+            Number of augmented sequences.
+        generate_n (int):
+            Number of generated sequences.
+        max_length (int):
+            The max length of the prediction.
+        top_p (float): The cumulative probability for
+            top-p-filtering in the "sampling" strategy. The value should
+            satisfy 0 <= top_p < 1. Default to 0.95.
+    """
+
+    def __init__(
+        self, model_name="roformer-chinese-sim-char-base", create_n=1, generate_n=5, max_length=128, top_p=0.95
+    ):
+        self.model_name = model_name
+        self.create_n = create_n
+        self.generate_n = generate_n
+        self.max_length = max_length
+        self.top_p = top_p
+
+        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
+        self.model.eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+
+    def augment(self, sequences):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        augmented_sequences = []
+        for sequence in sequences:
+            augmented_sequences.append(self._generate_similar_sentence(sequence, self.model, self.tokenizer))
+        return augmented_sequences
+
+    @paddle.no_grad()
+    def _generate_similar_sentence(self, sequence, model, tokenizer):
+        """Generates generate_n similar sentences from the provided sequence, and chooose the best create_n similar sentences."""
+
+        # Generate generate_n similar sentences
+        generated_sequences = [sequence]
+        tokenized_input = tokenizer(sequence, return_tensors="pd", padding=True)
+        decoded_outputs = tokenizer.batch_decode(
+            model.generate(
+                **tokenized_input,
+                num_return_sequences=self.generate_n,
+                top_p=self.top_p,
+                decode_strategy="sampling",
+                max_length=self.max_length,
+            )[0],
+            skip_special_tokens=True,
+        )
+        for decoded_output in decoded_outputs:
+            s = decoded_output.replace(" ", "").replace(sequence, "")
+            if s not in generated_sequences and len(s) > 0:
+                generated_sequences.append(s)
+        tokenized_output = tokenizer(generated_sequences, return_tensors="pd", padding=True)
+
+        # Choose best create_n similar sentences
+        tokenized_output = tokenizer(generated_sequences, return_tensors="pd", padding=True)
+        Z = model.roformer(**tokenized_output)[1].cpu().numpy()
+        Z /= (Z**2).sum(axis=1, keepdims=True) ** 0.5
+
+        return [generated_sequences[i + 1] for i in np.dot(Z[1:], -Z[0]).argsort()[: self.create_n]]
+
+
+class SentenceSummarize:
+    """
+    SentenceSummarize is a sentence-level data augmentation strategy
+    that summarizes the input sequence.
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        max_length (int):
+            The max length of the summarization.
+        batch_size(int):
+            The sample number of a mini-batch.
+        top_k (int): The number of highest probability tokens to
+            keep for top-k-filtering in the "sampling" strategy. Default to
+            0, which means no effect.
+        top_p (float): The cumulative probability for
+            top-p-filtering in the "sampling" strategy. The value should
+            satisfy 0 <= top_p < 1. Default to 1.0, which means no
+            effect.
+        temperature (float): The value used to module the next
+            token probabilities in the "sampling" strategy. Default to 1.0,
+            which means no effect.
+        use_fp16_decoding: (bool): Whether to use fp16 for decoding.
+            Only works when faster entry is avalible. Default to False.
+        kwargs (dict): Additional keyword arguments refer to ..taskflow.text_summarization.TextSummarization
+    """
+
+    def __init__(
+        self,
+        create_n=1,
+        max_length=128,
+        batch_size=1,
+        top_k=5,
+        top_p=1.0,
+        temperature=1.0,
+        use_fp16_decoding=False,
+        **kwargs
+    ):
+
+        kwargs.setdefault("num_return_sequences", create_n)
+        kwargs.setdefault("num_beams", create_n * 4)
+        kwargs.setdefault("max_length", max_length)
+        kwargs.setdefault("batch_size", batch_size)
+        kwargs.setdefault("top_k", top_k)
+        kwargs.setdefault("top_p", top_p)
+        kwargs.setdefault("temperature", temperature)
+        kwargs.setdefault("use_fp16_decoding", use_fp16_decoding)
+
+        self.create_n = kwargs["num_return_sequences"]
+        self.summarization = Taskflow("text_summarization", **kwargs)
+
+    def augment(self, sequences):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        augmented_sequences = self.summarization(sequences)
+        return [augmented_sequences[i * self.create_n : (i + 1) * self.create_n] for i in range(len(sequences))]
+
+
+class SentenceBackTranslate:
+    """
+    SentenceBackTranslate is a sentence-level data augmentation strategy
+    that translates the input sequence into one langugage, and backtranslate
+    back into the sourche language by the language models.
+
+    Args:
+        src_lang (str):
+            The source language of the input sequences.
+        tgt_lang (str):
+            The target language of the translated sequences.
+        max_length (int):
+            The max length of the translation.
+        batch_size(int):
+            The sample number of a mini-batch.
+        num_beams (int): The number of beams in the "beam_search"
+            strategy. Default to 4.
+        use_faster: (bool): Whether to use faster entry of model
+            for FasterGeneration. Default to False.
+        decode_strategy (str, optional): The decoding strategy in generation.
+            Currently, there are three decoding strategies supported:
+            "greedy_search", "sampling" and "beam_search". Default to
+            "beam_search".
+    """
+
+    def __init__(
+        self,
+        src_lang="zh",
+        tgt_lang="en",
+        max_length=128,
+        batch_size=1,
+        num_beams=4,
+        use_faster=False,
+        decode_strategy="beam_search",
+        from_model_name=None,
+        to_model_name=None,
+    ):
+        self.src_lang = src_lang
+        self.tgt_lang = tgt_lang
+        self.max_length = max_length
+        self.batch_size = batch_size
+        self.num_beams = num_beams
+        self.use_faster = use_faster
+        self.decode_strategy = decode_strategy
+        self.from_model_name = from_model_name
+        self.to_model_name = to_model_name
+        self.MBART_MAP = {
+            "ar": "ar_AR",
+            "cs": "cs_CZ",
+            "de": "de_DE",
+            "en": "en_XX",
+            "es": "es_XX",
+            "et": "et_EE",
+            "fi": "fi_FI",
+            "fr": "fr_XX",
+            "gu": "gu_IN",
+            "hi": "hi_IN",
+            "it": "it_IT",
+            "ja": "ja_XX",
+            "kk": "kk_KZ",
+            "ko": "ko_KR",
+            "lt": "lt_LT",
+            "lv": "lv_LV",
+            "my": "my_MM",
+            "ne": "ne_NP",
+            "nl": "nl_XX",
+            "ro": "ro_RO",
+            "ru": "ru_RU",
+            "si": "si_LK",
+            "tr": "tr_TR",
+            "vi": "vi_VN",
+            "zh": "zh_CN",
+            "af": "af_ZA",
+            "az": "az_AZ",
+            "bn": "bn_IN",
+            "fa": "fa_IR",
+            "he": "he_IL",
+            "hr": "hr_HR",
+            "id": "id_ID",
+            "ka": "ka_GE",
+            "km": "km_KH",
+            "mk": "mk_MK",
+            "ml": "ml_IN",
+            "mn": "mn_MN",
+            "mr": "mr_IN",
+            "pl": "pl_PL",
+            "ps": "ps_AF",
+            "pt": "pt_XX",
+            "sv": "sv_SE",
+            "sw": "sw_KE",
+            "ta": "ta_IN",
+            "te": "te_IN",
+            "th": "th_TH",
+            "tl": "tl_XX",
+            "uk": "uk_UA",
+            "ur": "ur_PK",
+            "xh": "xh_ZA",
+            "gl": "gl_ES",
+            "sl": "sl_SI",
+        }
+        if self.from_model_name is None:
+            if tgt_lang == "en":
+                self.from_model_name = "mbart-large-50-many-to-one-mmt"
+            else:
+                self.from_model_name = "mbart-large-50-many-to-many-mmt"
+
+        if to_model_name is None:
+            if tgt_lang == "en":
+                self.to_model_name = "mbart-large-50-one-to-many-mmt"
+            else:
+                self.to_model_name = "mbart-large-50-many-to-many-mmt"
+
+        self.from_model = AutoModelForConditionalGeneration.from_pretrained(self.from_model_name)
+        self.to_model = AutoModelForConditionalGeneration.from_pretrained(self.to_model_name)
+        self.from_tokenizer = AutoTokenizer.from_pretrained(self.from_model_name, src_lang=self.MBART_MAP[src_lang])
+        self.to_tokenizer = AutoTokenizer.from_pretrained(self.to_model_name, src_lang=self.MBART_MAP[tgt_lang])
+        self.from_model.eval()
+        self.to_model.eval()
+
+    def augment(self, sequences):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        sequences = self._translate(self.from_model, self.from_tokenizer, sequences, self.tgt_lang)
+        sequences = self._translate(self.to_model, self.to_tokenizer, sequences, self.src_lang)
+        return [[sequence] for sequence in sequences]
+
+    @paddle.no_grad()
+    def _translate(self, model, tokenizer, sequences, lang):
+        batched_inputs = [sequences[idx : idx + self.batch_size] for idx in range(0, len(sequences), self.batch_size)]
+        translated_texts = []
+        eos_id = model.mbart.config["eos_token_id"]
+        for batched_input in batched_inputs:
+            tokenized_input = tokenizer(batched_input, return_tensors="pd", padding=True)["input_ids"]
+            outputs = model.generate(
+                input_ids=tokenized_input,
+                forced_bos_token_id=tokenizer.lang_code_to_id[self.MBART_MAP[lang]],
+                decode_strategy=self.decode_strategy,
+                num_beams=self.num_beams,
+                max_length=self.max_length,
+                use_faster=self.use_faster,
+            )[0]
+            for output in outputs:
+                eos = np.where(output.cpu().numpy() == eos_id)[0]
+                if len(eos) == 0:
+                    eos_pos = len(output) - 1
+                else:
+                    eos_pos = eos[0]
+                translated_texts.append(tokenizer.convert_ids_to_string(output[1:eos_pos]))
+        return translated_texts
+
+
+class SentenceBackTranslateAPI:
+    """
+    SentenceBackTranslateAPI is a sentence-level data augmentation strategy
+    that translates the input sequence into one langugage, and backtranslate
+    back into the sourche language by baidu translate api.
+
+    Args:
+        src_lang (str):
+            The source language of the input sequences.
+        tgt_lang (str):
+            The target language of the translated sequences.
+        appid (str):
+            Appid for requesting Baidu translation service. (if use your own appid/appkey)
+        secretKey (str):
+            Secret key for requesting Baidu translation service. (if use your own appid/appkey)
+        qps (int):
+            Queries per second. (if use your own appid/appkey)
+    """
+
+    def __init__(self, src_lang="zh", tgt_lang="en", appid=None, secretKey=None, qps=1):
+
+        self.src_lang = src_lang
+        self.tgt_lang = tgt_lang
+        self.appid = appid
+        self.secretKey = secretKey
+        self.qps = qps
+        self.url = "http://api.fanyi.baidu.com/api/trans/vip/translate"
+
+    def augment(self, sequences):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        if self.appid is None or self.secretKey is None:
+            return self._back_translate_hub(sequences)
+        else:
+            return self._back_translate_api(sequences)
+
+    def _back_translate_hub(self, sequences):
+        try:
+            import paddlehub as hub
+        except ImportError:
+            print(" PaddleHub not installed!")
+            import os
+
+            os.system("pip install paddlehub==2.3.1")
+            import paddlehub as hub
+
+        module = hub.Module(name="baidu_translate")
+        translated_texts = []
+        for sequence in sequences:
+            sequence = module.translate(sequence, self.src_lang, self.tgt_lang)
+            sequence = module.translate(sequence, self.tgt_lang, self.src_lang)
+            translated_texts.append([sequence])
+        return translated_texts
+
+    def _back_translate_api(self, sequences):
+
+        translated_texts = []
+        for sequence in sequences:
+            sequence = self._translate_api(sequence, self.src_lang, self.tgt_lang)
+            sequence = self._translate_api(sequence, self.tgt_lang, self.src_lang)
+            translated_texts.append(sequence)
+        return translated_texts
+
+    def _translate_api(self, query, from_lang, to_lang):
+
+        import hashlib
+        import random
+        import time
+
+        import requests
+
+        # Generate salt and sign
+        salt = str(random.randint(32768, 65536))
+        sign = self.appid + query + salt + self.secretKey
+        sign = hashlib.md5(sign.encode("utf-8")).hexdigest()
+
+        # Build request
+        headers = {"Content-Type": "application/x-www-form-urlencoded"}
+        payload = {
+            "appid": f"{self.appid}",
+            "q": f"{query}",
+            "from": from_lang,
+            "to": to_lang,
+            "salt": f"{salt}",
+            "sign": f"{sign}",
+        }
+
+        # Send request
+        time.sleep(1 / self.qps)
+        try:
+            r = requests.post(self.url, params=payload, headers=headers)
+            result = r.json()
+        except Exception as e:
+            error_msg = str(e)
+            raise RuntimeError(error_msg)
+        if "error_code" in result:
+            raise RuntimeError(result)
+        return result["trans_result"][0]["dst"]
+
+
+class SentenceContinue:
+    """
+    SentenceContinue is a sentence-level data augmentation strategy
+    that generates continuation for the input sequence.
+
+    Args:
+        model_name (str):
+            Model parameter name for summarization task.
+        max_length (int):
+            The max length of the summarization.
+        decode_strategy (str, optional): The decoding strategy in generation.
+            Currently, there are three decoding strategies supported:
+            "greedy_search", "sampling" and "beam_search". Default to
+            "beam_search".
+        use_faster: (bool): Whether to use faster entry of model
+            for FasterGeneration. Default to False.
+        create_n (int):
+            Number of augmented sequences.
+        batch_size(int):
+            The sample number of a mini-batch.
+        top_k (int): The number of highest probability tokens to
+            keep for top-k-filtering in the "sampling" strategy. Default to
+            0, which means no effect.
+        top_p (float): The cumulative probability for
+            top-p-filtering in the "sampling" strategy. The value should
+            satisfy 0 <= top_p < 1. Default to 1.0, which means no
+            effect.
+        temperature (float): The value used to module the next
+            token probabilities in the "sampling" strategy. Default to 1.0,
+            which means no effect.
+    """
+
+    def __init__(
+        self,
+        model_name="gpt-cpm-small-cn-distill",
+        max_length=64,
+        decode_strategy="sampling",
+        use_faster=False,
+        create_n=1,
+        top_k=50,
+        temperature=1.0,
+        top_p=0.9,
+        batch_size=1,
+    ):
+        self.model_name = model_name
+        self.max_length = max_length
+        self.decode_strategy = decode_strategy
+        self.use_faster = use_faster
+        self.create_n = create_n
+        self.top_k = top_k
+        self.temperature = temperature
+        self.top_p = top_p
+        self.batch_size = batch_size
+
+        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
+        self.model.eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        self.tokenizer.add_special_tokens({"pad_token": self.tokenizer.convert_ids_to_tokens(self.model.pad_token_id)})
+
+    def augment(self, sequences):
+        """
+        Apply augmentation strategy on input sequences.
+
+            Args:
+            sequences (str or list(str)):
+                Input sequence or list of input sequences.
+
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        return self._generate_continue(sequences, self.model, self.tokenizer)
+
+    @paddle.no_grad()
+    def _generate_continue(self, sequences, model, tokenizer):
+        batched_inputs = [sequences[idx : idx + self.batch_size] for idx in range(0, len(sequences), self.batch_size)]
+        generated_sequences = []
+        for batched_input in batched_inputs:
+            tokenized_inputs = tokenizer(
+                batched_input, return_tensors="pd", padding=True, return_attention_mask=True, return_position_ids=True
+            )
+            outputs = model.generate(
+                **tokenized_inputs,
+                max_length=self.max_length,
+                decode_strategy=self.decode_strategy,
+                use_faster=self.use_faster,
+                num_return_sequences=self.create_n,
+                top_k=self.top_k,
+                temperature=self.temperature,
+                top_p=self.top_p,
+            )[0]
+            for i in range(outputs.shape[0]):
+                output = outputs[i].cpu().numpy()
+                eos = np.where(output == model.eos_token_id)[0]
+                if len(eos) == 0:
+                    eos_pos = len(output) - 1
+                else:
+                    eos_pos = eos[0]
+                generated_sequences.append(tokenizer.convert_ids_to_string(output[:eos_pos].tolist()))
+        augmented_sequences = []
+        for i, sequence in enumerate(sequences):
+            augmented_sequence = []
+            for ii in range(self.create_n):
+                continue_sequence = (
+                    generated_sequences[i * self.create_n + ii].replace(" ", "").replace("\n", "").replace("\t", "")
+                )
+                augmented_sequence.append(sequence + continue_sequence)
+            augmented_sequences.append(augmented_sequence)
+        return augmented_sequences
diff --git a/paddlenlp/dataaug/word.py b/paddlenlp/dataaug/word.py
new file mode 100644
index 0000000000000000000000000000000000000000..438935e54ac4c584dd68c808ebfc384356bacd6c
--- /dev/null
+++ b/paddlenlp/dataaug/word.py
@@ -0,0 +1,635 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import math
+import os
+import random
+from typing import Iterable
+
+import numpy as np
+import paddle
+
+from ..transformers import AutoModelForMaskedLM, AutoTokenizer
+from .base_augment import BaseAugment
+
+__all__ = ["WordSubstitute", "WordInsert", "WordSwap", "WordDelete"]
+
+
+class WordSubstitute(BaseAugment):
+    """
+    WordSubstitute is a word-level substitution data augmentation strategy
+    that supports replacing words in the input sequence based on existing
+    dictionaries or custom dictionaries.
+
+    Args:
+        aug_type (str or list(str)):
+            Substitution dictionary type
+        custom_file_path (str, optional):
+            Custom substitution dictionary file path
+        delete_file_path (str, optional):
+            Dictionary file path for deleting words in substitution dictionary
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented words in sequences.
+        aug_percent (int):
+            Percentage of augmented words in sequences.
+        aug_min (int):
+            Minimum number of augmented words in sequences.
+        aug_max (int):
+            Maximum number of augmented words in sequences.
+        tf_idf (bool):
+            Use tf-idf to select the most unimportant word for substitution.
+        tf_idf (str):
+            File for calculating TF-IDF score.
+        model_name (str):
+            Model parameter name for MLM prediction task.
+    """
+
+    def __init__(
+        self,
+        aug_type,
+        custom_file_path=None,
+        delete_file_path=None,
+        create_n=1,
+        aug_n=None,
+        aug_percent=0.1,
+        aug_min=1,
+        aug_max=10,
+        tf_idf=False,
+        tf_idf_file=None,
+        model_name="ernie-1.0-large-zh-cw",
+        vocab="vocab",
+    ):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+        self.custom_file_path = custom_file_path
+        self.delete_file_path = delete_file_path
+        self.tf_idf = tf_idf
+        self.model_name = model_name
+        if self.tf_idf:
+            self._count_idf(tf_idf_file)
+
+        if isinstance(aug_type, str):
+            self.type = aug_type
+            if aug_type in ["antonym", "embedding", "synonym", "homonym", "custom"]:
+                self.dict = self._load_substitue_dict(aug_type)
+            elif aug_type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        elif isinstance(aug_type, Iterable):
+            if len(aug_type) == 1:
+                self.type = aug_type[0]
+            else:
+                self.type = "combination"
+            if self.type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+            self.dict = {}
+            # Merge dictionaries from different sources
+            for t in aug_type:
+                if t in ["antonym", "embedding", "synonym", "homonym", "custom"]:
+                    t_dict = self._load_substitue_dict(t)
+                    for k in t_dict:
+                        if k in self.dict:
+                            self.dict[k] = list(set(self.dict[k] + t_dict[k]))
+                        else:
+                            self.dict[k] = t_dict[k]
+            # Todo: delete some words in the dictionary
+        else:
+            self.type = aug_type
+
+    def _count_idf(self, tf_idf_file):
+        if os.path.exists(tf_idf_file):
+            with open(tf_idf_file, "r", encoding="utf-8") as f:
+                self.word_count_dict = {}
+                self.text_tf_idf = []
+                self.num = 0
+                for line in f:
+                    self.num += 1
+                    self.text_tf_idf.append(line.strip())
+                    for word in set(self.tokenizer.cut(line.strip())):
+                        if word not in self.word_count_dict:
+                            self.word_count_dict[word] = 0
+                        self.word_count_dict[word] += 1
+            f.close()
+        else:
+            raise ValueError("The tf_idf_file should exist.")
+        return
+
+    def _calculate_tfidf(self, sequence, seq_tokens, aug_indexes):
+        if sequence not in self.text_tf_idf:
+            self.num += 1
+            self.text_tf_idf.append(sequence)
+            for word in set(seq_tokens):
+                if word not in self.word_count_dict:
+                    self.word_count_dict[word] = 0
+                self.word_count_dict[word] += 1
+        sequence_count = {}
+        for index in aug_indexes:
+            if seq_tokens[index] in sequence_count:
+                sequence_count[seq_tokens[index]] += 1
+            else:
+                sequence_count[seq_tokens[index]] = 1
+        tfidf = []
+        for index in aug_indexes:
+            tf = sequence_count[seq_tokens[index]] / len(aug_indexes)
+            idf = math.log(self.num / self.word_count_dict[seq_tokens[index]])
+            tfidf.append(tf * idf)
+        return np.array(tfidf)
+
+    def _load_substitue_dict(self, source_type):
+        """Load substitution dictionary"""
+        if source_type in ["antonym", "embedding", "synonym", "homonym"]:
+            fullname = self._load_file("word_" + source_type)
+        elif source_type in ["custom"]:
+            fullname = self.custom_file_path
+        elif source_type in ["delete"]:
+            fullname = self.delete_file_path
+
+        if os.path.exists(fullname):
+            with open(fullname, "r", encoding="utf-8") as f:
+                substitue_dict = json.load(f)
+            f.close()
+        else:
+            raise ValueError("The {} should exist.".format(fullname))
+
+        return substitue_dict
+
+    def _generate_sequence(self, output_seq_tokens, aug_tokens):
+        """Genearte the sequences according to the mapping list"""
+        for aug_token in aug_tokens:
+            idx, token = aug_token
+            output_seq_tokens[int(idx)] = token
+        return "".join(output_seq_tokens)
+
+    def _augment(self, sequence):
+        seq_tokens = self.tokenizer.cut(sequence)
+        aug_indexes = self._skip_stop_word_tokens(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+
+        if self.tf_idf:
+            tfidf = self._calculate_tfidf(sequence, seq_tokens, aug_indexes)
+            p = (max(tfidf) + 0.01 - tfidf) / sum(max(tfidf) + 0.01 - tfidf)
+        else:
+            p = None
+
+        if aug_n == 0:
+            return []
+        elif self.type == "mlm":
+            return self._augment_mlm(sequence, seq_tokens, aug_indexes, p)
+        elif aug_n == 1:
+            return self._augment_single(seq_tokens, aug_indexes, p)
+        else:
+            return self._augment_multi(seq_tokens, aug_n, aug_indexes, p)
+
+    @paddle.no_grad()
+    def _augment_mlm(self, sequence, seq_tokens, aug_indexes, p):
+        t = 0
+        sentences = []
+        while t < self.create_n * self.loop * 2 and len(sentences) < self.create_n:
+            skip = False
+            t += 1
+            idx = np.random.choice(aug_indexes, replace=False, p=p)
+
+            aug_tokens = [[idx, "[MASK]" * len(seq_tokens[idx])]]
+            sequence_mask = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+            tokenized = self.mlm_tokenizer(sequence_mask)
+            masked_positions = [
+                i for i, idx in enumerate(tokenized["input_ids"]) if idx == self.mlm_tokenizer.mask_token_id
+            ]
+
+            output = self.mlm_model(
+                paddle.to_tensor([tokenized["input_ids"]]), paddle.to_tensor([tokenized["token_type_ids"]])
+            )
+            predicted = "".join(
+                self.mlm_tokenizer.convert_ids_to_tokens(paddle.argmax(output[0][masked_positions], axis=-1))
+            )
+            for ppp in predicted:
+                if ppp in self.stop_words:
+                    skip = True
+                    break
+            if skip:
+                continue
+            aug_tokens = [[idx, predicted]]
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+            if sequence_generate != sequence and sequence_generate not in sentences:
+                sentences.append(sequence_generate)
+        return sentences
+
+    def _augment_multi(self, seq_tokens, aug_n, aug_indexes, p):
+        sentences = []
+        aug_n = min(aug_n, len(aug_indexes))
+        if self.type in ["antonym", "embedding", "synonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            pp = []
+            for i, aug_index in enumerate(aug_indexes):
+                if seq_tokens[aug_index] in self.dict:
+                    candidate_tokens.append([aug_index, self.dict[seq_tokens[aug_index]]])
+                    if self.tf_idf:
+                        pp.append(p[i])
+            pp = np.array(pp)
+            pp /= sum(pp)
+            aug_n = min(aug_n, len(candidate_tokens))
+            if aug_n != 0:
+                t = 0
+                while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                    t += 1
+                    if self.tf_idf:
+                        idxes = np.random.choice(list(range(len(candidate_tokens))), size=aug_n, replace=False, p=pp)
+                    else:
+                        idxes = random.sample(list(range(len(candidate_tokens))), aug_n)
+                    aug_tokens = []
+                    for idx in idxes:
+                        aug_index, aug_dict = candidate_tokens[idx]
+                        aug_tokens.append([aug_index, random.sample(aug_dict, 1)[0]])
+
+                    sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+                    if sentence not in sentences:
+                        sentences.append(sentence)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                t += 1
+                aug_tokens = []
+                aug_choice_indexes = np.random.choice(aug_indexes, size=aug_n, replace=False, p=p)
+                for aug_index in aug_choice_indexes:
+                    token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))
+                    aug_tokens.append([aug_index, token])
+                sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens)
+                if sentence not in sentences:
+                    sentences.append(sentence)
+        return sentences
+
+    def _augment_single(self, seq_tokens, aug_indexes, p):
+        sentences = []
+        aug_tokens = []
+        if self.type in ["antonym", "embedding", "synonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            pp = []
+            for i, aug_index in enumerate(aug_indexes):
+                if seq_tokens[aug_index] in self.dict:
+                    for token in self.dict[seq_tokens[aug_index]]:
+                        candidate_tokens.append([aug_index, token])
+                        if self.tf_idf:
+                            pp.append(p[i] / len(self.dict[seq_tokens[aug_index]]))
+            create_n = min(self.create_n, len(candidate_tokens))
+            pp = np.array(pp)
+            pp /= sum(pp)
+            if self.tf_idf:
+                candidate_indexes = np.random.choice(range(len(candidate_tokens)), size=create_n, replace=False, p=pp)
+                candidate_tokens = np.array(candidate_tokens)
+                aug_tokens = candidate_tokens[candidate_indexes]
+            else:
+                aug_tokens = random.sample(candidate_tokens, create_n)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(aug_tokens) < self.create_n:
+                t += 1
+                aug_index = np.random.choice(aug_indexes, replace=False, p=p)
+                token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))
+                if [aug_index, token] not in aug_tokens:
+                    aug_tokens.append([aug_index, token])
+        for aug_token in aug_tokens:
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), [aug_token])
+            sentences.append(sequence_generate)
+
+        return sentences
+
+
+class WordInsert(BaseAugment):
+    """
+    WordInsert is a word-level insert data augmentation strategy.
+
+    Args:
+        aug_type (str or list(str)):
+            Insert dictionary type
+        custom_file_path (str, optional):
+            Custom insert dictionary file path
+        delete_file_path (str, optional):
+            Dictionary file path for deleting words in insert dictionary
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented words in sequences.
+        aug_percent (int):
+            Percentage of augmented words in sequences.
+        aug_min (int):
+            Minimum number of augmented words in sequences.
+        aug_max (int):
+            Maximum number of augmented words in sequences.
+    """
+
+    def __init__(
+        self,
+        aug_type,
+        custom_file_path=None,
+        delete_file_path=None,
+        create_n=1,
+        aug_n=None,
+        aug_percent=0.1,
+        aug_min=1,
+        aug_max=10,
+        model_name="ernie-1.0-large-zh-cw",
+        vocab="vocab",
+    ):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+        self.custom_file_path = custom_file_path
+        self.delete_file_path = delete_file_path
+        self.model_name = model_name
+        if isinstance(aug_type, str):
+            self.type = aug_type
+            if aug_type in ["antonym", "embedding", "synonym", "homonym", "custom"]:
+                self.dict = self._load_insert_dict(aug_type)
+            elif aug_type in ["mlm"]:
+                self.mlm_model = AutoModelForMaskedLM.from_pretrained(self.model_name)
+                self.mlm_tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        elif isinstance(aug_type, Iterable):
+            self.type = "combination"
+            self.dict = {}
+            # Merge dictionaries from different sources
+            for t in aug_type:
+                if t in ["antonym", "embedding", "synonym", "homonym", "custom"]:
+                    t_dict = self._load_insert_dict(t)
+                    for k in t_dict:
+                        if k in self.dict:
+                            self.dict[k] = list(set(self.dict[k] + t_dict[k]))
+                        else:
+                            self.dict[k] = t_dict[k]
+            # Todo: delete some words in the dictionary
+        else:
+            self.type = aug_type
+
+    def _load_insert_dict(self, source_type):
+        """Load insert dictionary"""
+        if source_type in ["antonym", "embedding", "synonym", "homonym"]:
+            fullname = self._load_file("word_" + source_type)
+        elif source_type in ["custom"]:
+            fullname = self.custom_file_path
+        elif source_type in ["delete"]:
+            fullname = self.delete_file_path
+        if os.path.exists(fullname):
+            with open(fullname, "r", encoding="utf-8") as f:
+                insert_dict = json.load(f)
+            f.close()
+        else:
+            raise ValueError("The {} should exist.".format(fullname))
+        return insert_dict
+
+    def _augment(self, sequence):
+        seq_tokens = self.tokenizer.cut(sequence)
+        aug_indexes = self._skip_stop_word_tokens(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+        if aug_n == 0:
+            return []
+        elif self.type == "mlm":
+            return self._augment_mlm(sequence, seq_tokens, aug_indexes)
+        elif aug_n == 1:
+            return self._augment_single(seq_tokens, aug_indexes)
+        else:
+            return self._augment_multi(seq_tokens, aug_n, aug_indexes)
+
+    @paddle.no_grad()
+    def _augment_mlm(self, sequence, seq_tokens, aug_indexes):
+
+        t = 0
+        sentences = []
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            skip = False
+            t += 1
+            p = random.randint(0, 1)
+            idx = random.sample(aug_indexes, 1)[0]
+            aug_tokens = [[idx, "[MASK]" * len(seq_tokens[idx])]]
+            sequence_mask = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+            tokenized = self.mlm_tokenizer(sequence_mask)
+            masked_positions = [
+                i for i, idx in enumerate(tokenized["input_ids"]) if idx == self.mlm_tokenizer.mask_token_id
+            ]
+            output = self.mlm_model(
+                paddle.to_tensor([tokenized["input_ids"]]), paddle.to_tensor([tokenized["token_type_ids"]])
+            )
+            predicted = "".join(
+                self.mlm_tokenizer.convert_ids_to_tokens(paddle.argmax(output[0][masked_positions], axis=-1))
+            )
+            for p in predicted:
+                if p in self.stop_words:
+                    skip = True
+                    break
+            if skip:
+                continue
+
+            aug_tokens = [[idx, predicted]]
+
+            sequence_generate = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+            if sequence_generate != sequence and sequence_generate not in sentences:
+                sentences.append(sequence_generate)
+        return sentences
+
+    def _augment_multi(self, seq_tokens, aug_n, aug_indexes):
+        sentences = []
+        if self.type in ["antonym", "embedding", "synonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            for aug_index in aug_indexes:
+                if seq_tokens[aug_index] in self.dict:
+                    candidate_tokens.append([aug_index, self.dict[seq_tokens[aug_index]]])
+            aug_n = min(aug_n, len(candidate_tokens))
+            if aug_n != 0:
+                t = 0
+                while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                    t += 1
+                    idxes = random.sample(list(range(len(candidate_tokens))), aug_n)
+                    aug_tokens = []
+                    for idx in idxes:
+                        aug_index, aug_dict = candidate_tokens[idx]
+                        aug_tokens.append([aug_index, random.sample(aug_dict, 1)[0]])
+                    p = random.randint(0, 1)
+                    sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+                    if sentence not in sentences:
+                        sentences.append(sentence)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(sentences) < self.create_n:
+                t += 1
+                aug_tokens = []
+                aug_indexes = random.sample(aug_indexes, aug_n)
+                for aug_index in aug_indexes:
+                    token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))
+                    aug_tokens.append([aug_index, token])
+                p = random.randint(0, 1)
+                sentence = self._generate_sequence(seq_tokens.copy(), aug_tokens, p)
+                if sentence not in sentences:
+                    sentences.append(sentence)
+        return sentences
+
+    def _augment_single(self, seq_tokens, aug_indexes):
+
+        sentences = []
+        aug_tokens = []
+        if self.type in ["antonym", "embedding", "synonym", "homonym", "combination", "custom"]:
+            candidate_tokens = []
+            for aug_index in aug_indexes:
+                if seq_tokens[aug_index] in self.dict:
+                    for token in self.dict[seq_tokens[aug_index]]:
+                        candidate_tokens.append([aug_index, token])
+            create_n = min(self.create_n, len(candidate_tokens))
+            aug_tokens = random.sample(candidate_tokens, create_n)
+        elif self.type in ["random"]:
+            t = 0
+            while t < self.create_n * self.loop and len(aug_tokens) < self.create_n:
+                t += 1
+                aug_index = random.sample(aug_indexes, 1)[0]
+                token = self.vocab.to_tokens(random.randint(0, len(self.vocab) - 2))
+                if [aug_index, token] not in aug_tokens:
+                    aug_tokens.append([aug_index, token])
+        for aug_token in aug_tokens:
+            p = random.randint(0, 1)
+            sentences.append(self._generate_sequence(seq_tokens.copy(), [aug_token], p))
+        return sentences
+
+    def _generate_sequence(self, output_seq_tokens, aug_tokens, p):
+        """Genearte the sequences according to the mapping list"""
+        for aug_token in aug_tokens:
+            idx, token = aug_token
+            if p == 0:
+                output_seq_tokens[idx] = token + output_seq_tokens[idx]
+            else:
+                output_seq_tokens[idx] += token
+        return "".join(output_seq_tokens)
+
+
+class WordSwap(BaseAugment):
+    """
+    WordSwap is a word-level swap data augmentation strategy.
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented words in sequences.
+        aug_percent (int):
+            Percentage of augmented words in sequences.
+        aug_min (int):
+            Minimum number of augmented words in sequences.
+        aug_max (int):
+            Maximum number of augmented words in sequences.
+    """
+
+    def __init__(self, create_n=1, aug_n=None, aug_percent=None, aug_min=1, aug_max=10, vocab="vocab"):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=0.1, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+    def _augment(self, sequence):
+
+        seq_tokens = self.tokenizer.cut(sequence)
+        aug_indexes = self._skip_words(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+
+        t = 0
+        sentences = []
+
+        if aug_n == 0:
+            return []
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            t += 1
+            idxes = random.sample(aug_indexes, aug_n)
+            output_seq_tokens = seq_tokens.copy()
+            for idx in range(len(seq_tokens)):
+                if idx in idxes:
+                    output_seq_tokens[idx], output_seq_tokens[idx + 1] = (
+                        output_seq_tokens[idx + 1],
+                        output_seq_tokens[idx],
+                    )
+            sentence = "".join(output_seq_tokens)
+            if sentence not in sentences:
+                sentences.append(sentence)
+        return sentences
+
+    def _skip_words(self, seq_tokens):
+        """Skip specific words."""
+        indexes = []
+        for i, seq_token in enumerate(seq_tokens[:-1]):
+            if (
+                seq_token not in self.stop_words
+                and not seq_token.isdigit()
+                and not seq_token.encode("UTF-8").isalpha()
+            ):
+                if (
+                    seq_tokens[i + 1] not in self.stop_words
+                    and not seq_tokens[i + 1].isdigit()
+                    and not seq_tokens[i + 1].encode("UTF-8").isalpha()
+                ):
+                    indexes.append(i)
+        return indexes
+
+
+class WordDelete(BaseAugment):
+    """
+    WordDelete is a word-level deletion data augmentation strategy.
+
+    Args:
+        create_n (int):
+            Number of augmented sequences.
+        aug_n (int):
+            Number of augmented words in sequences.
+        aug_percent (int):
+            Percentage of augmented words in sequences.
+        aug_min (int):
+            Minimum number of augmented words in sequences.
+        aug_max (int):
+            Maximum number of augmented words in sequences.
+    """
+
+    def __init__(self, create_n=1, aug_n=None, aug_percent=0.1, aug_min=1, aug_max=10, vocab="vocab"):
+        super().__init__(
+            create_n=create_n, aug_n=aug_n, aug_percent=aug_percent, aug_min=aug_min, aug_max=aug_max, vocab=vocab
+        )
+
+    def _augment(self, sequence):
+
+        seq_tokens = self.tokenizer.cut(sequence)
+        aug_indexes = self._skip_words(seq_tokens)
+        aug_n = self._get_aug_n(len(seq_tokens), len(aug_indexes))
+
+        t = 0
+        sentences = []
+        if aug_n == 0:
+            return sentences
+        while t < self.create_n * self.loop and len(sentences) < self.create_n:
+            t += 1
+            idxes = random.sample(aug_indexes, aug_n)
+            sentence = ""
+            for idx in range(len(seq_tokens)):
+                if idx not in idxes:
+                    sentence += seq_tokens[idx]
+            if sentence not in sentences:
+                sentences.append(sentence)
+        return sentences
+
+    def _skip_words(self, seq_tokens):
+        """Skip specific words."""
+        indexes = []
+        for i, seq_token in enumerate(seq_tokens):
+            if (
+                seq_token not in self.stop_words
+                and not seq_token.isdigit()
+                and not seq_token.encode("UTF-8").isalpha()
+            ):
+                indexes.append(i)
+        return indexes
diff --git a/paddlenlp/datasets/README.md b/paddlenlp/datasets/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/paddlenlp/datasets/__init__.py b/paddlenlp/datasets/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..01515c9b48fbdcf221405b4ff502469de4d55105
--- /dev/null
+++ b/paddlenlp/datasets/__init__.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .bellegroup import *
+from .cail2018_small import *
+from .cblue import *
+from .chnsenticorp import *
+from .clue import *
+from .cmrc2018 import *
+from .conll2002 import *
+from .cote import *
+from .couplet import *
+from .dataset import *
+from .drcd import *
+from .drcd_cn import *
+from .dureader_robust import *
+from .glue import *
+from .imdb import *
+from .intokens_dataset import *
+from .lcqmc import *
+from .msra_ner import *
+from .nlpcc13_evsam05_hit import *
+from .nlpcc13_evsam05_thu import *
+from .nlpcc14_sc import *
+from .nlpcc_dbqa import *
+from .peoples_daily_ner import *
+from .poetry import *
+from .ptb import *
+from .seabsa16 import *
+from .squad import *
+from .wmt14ende import *
+from .wos import *
+from .xnli import *
+from .xnli_cn import *
+from .yahoo_answer_100k import *
diff --git a/paddlenlp/datasets/advertisegen.py b/paddlenlp/datasets/advertisegen.py
new file mode 100644
index 0000000000000000000000000000000000000000..de5ca20e3c7a7af5f73cd30fa4bcec8650965cc5
--- /dev/null
+++ b/paddlenlp/datasets/advertisegen.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["AdvertiseGen"]
+
+
+class AdvertiseGen(DatasetBuilder):
+    """
+    This dataset contains 119K pairs of product specifications and the
+    corresponding advertising text. For more information, please refer
+    to `https://arxiv.org/abs/1908.06605v2`.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("train.json"),
+            "c0cc79f912099faa6175d28d3ddafafe",
+            "https://bj.bcebos.com/paddlenlp/datasets/AdvertiseGen/train.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("dev.json"),
+            "5fda84828628a9722da5436485601df3",
+            "https://bj.bcebos.com/paddlenlp/datasets/AdvertiseGen/dev.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            data_id = 0
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                json_data = json.loads(line)
+
+                yield {
+                    "source": json_data["content"],
+                    "src": json_data["content"],
+                    "target": json_data.get("summary", ""),
+                    "tgt": json_data.get("summary", ""),
+                    "id": data_id,
+                }
+                data_id += 1
diff --git a/paddlenlp/datasets/bellegroup.py b/paddlenlp/datasets/bellegroup.py
new file mode 100644
index 0000000000000000000000000000000000000000..a369cbffce82c8b504e0621d8655cf00e79f44fc
--- /dev/null
+++ b/paddlenlp/datasets/bellegroup.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["BelleGroup"]
+
+
+class BelleGroup(DatasetBuilder):
+    """
+    From https://github.com/LianjiaTech/BELLE/tree/main
+
+    """
+
+    BUILDER_CONFIGS = {
+        "generated_chat_0.4M": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/generated_chat_0.4M.zip",
+            "md5": "9bb71d4f2aa99acede2a0c3a8e761905",
+            "splits": {
+                "train": [os.path.join("generated_chat_0.4M", "train.json"), "47ea511025fbda9ffd6e5178677bb027"],
+                "dev": [os.path.join("generated_chat_0.4M", "dev.json"), "d7bd4b71cdb006b9de90ebb634ca1179"],
+            },
+        },
+        "school_math_0.25M": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/school_math_0.25M.zip",
+            "md5": "10076cbdc0a7436d55481f0234db8609",
+            "splits": {
+                "train": [os.path.join("school_math_0.25M", "train.json"), "e5a36fc9deb015254686c51e21528683"],
+                "dev": [os.path.join("school_math_0.25M", "dev.json"), "99e967c38e39ed919327c011d9f6288f"],
+            },
+        },
+        "train_2M_CN": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/train_2M_CN.zip",
+            "md5": "da88aca71eb9f454fab39db6a7e851e6",
+            "splits": {
+                "train": [os.path.join("train_2M_CN", "train.json"), "83e2917701a31ecf5152e4e9f234fcd0"],
+                "dev": [os.path.join("train_2M_CN", "dev.json"), "74f67f04e30896aeccc10930a7dc1f40"],
+            },
+        },
+        "train_1M_CN": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/train_1M_CN.zip",
+            "md5": "65380b542e8ddb4db8f8d2be0f28795c",
+            "splits": {
+                "train": [os.path.join("train_1M_CN.zip", "train.json"), "489886aba320c74a1fdfad43c652635b"],
+                "dev": [os.path.join("train_1M_CN.zip", "dev.json"), "7bbf382aeab89f4398b2beca984e20e8"],
+            },
+        },
+        "train_0.5M_CN": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/train_0.5M_CN.zip",
+            "md5": "45be55109ca9595efa36eaaed7c475d3",
+            "splits": {
+                "train": [os.path.join("train_0.5M_CN.zip", "train.json"), "61dc155956622c8389265de33b439757"],
+                "dev": [os.path.join("train_0.5M_CN.zip", "dev.json"), "72617388fbc4897cb2952df3e5303c2b"],
+            },
+        },
+        "multiturn_chat_0.8M": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/BelleGroup/multiturn_chat_0.8M.zip",
+            "md5": "974bc42c5920e5722146a89dce2b10cc",
+            "splits": {
+                "train": [os.path.join("multiturn_chat_0.8M", "train.json"), "27e3a7ecff0f4a199f6e7119909988e9"],
+                "dev": [os.path.join("multiturn_chat_0.8M", "dev.json"), "8fec175ea5e71cc78498d8ca3c1d5e66"],
+            },
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+
+                json_data = json.loads(line)
+
+                yield {
+                    "instruction": json_data["instruction"],
+                    "input": json_data["input"],
+                    "output": json_data["output"],
+                }
diff --git a/paddlenlp/datasets/bq_corpus.py b/paddlenlp/datasets/bq_corpus.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a15115fc2641d1305df4831631b4738a8305823
--- /dev/null
+++ b/paddlenlp/datasets/bq_corpus.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["BQCorpus"]
+
+
+class BQCorpus(DatasetBuilder):
+    """
+    BQCorpus: A Large-scale Domain-specific Chinese Corpus For Sentence
+    Semantic Equivalence Identification. More information please refer
+    to `https://www.aclweb.org/anthology/D18-1536.pdf`
+
+    Contributed by frozenfish123@Wuhan University
+
+    """
+
+    lazy = False
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/bq_corpus.zip"
+    MD5 = "abe6c480b96cb705b4d24bd522848009"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("bq_corpus", "bq_corpus", "train.tsv"), "d37683e9ee778ee2f4326033b654adb9"),
+        "dev": META_INFO(os.path.join("bq_corpus", "bq_corpus", "dev.tsv"), "8a71f2a69453646921e9ee1aa457d1e4"),
+        "test": META_INFO(os.path.join("bq_corpus", "bq_corpus", "test.tsv"), "c797995baa248b144ceaa4018b191e52"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Check and download Dataset"""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                data = line.strip().split("\t")
+                if len(data) == 3:
+                    sentence1, sentence2, label = data
+                elif len(data) == 2:
+                    sentence1, sentence2 = data
+                    label = ""
+                yield {"sentence1": sentence1, "sentence2": sentence2, "label": label}
+
+    def get_labels(self):
+        """
+        Return labels of the BQCorpus object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/bstc.py b/paddlenlp/datasets/bstc.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee1a1ab995021ef8b388300f00fd33ac5ebbc290
--- /dev/null
+++ b/paddlenlp/datasets/bstc.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class BSTC(DatasetBuilder):
+    """
+    BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English
+    speech translation dataset. This dataset is constructed based on a
+    collection of licensed videos of talks or lectures, including about
+    68 hours of Mandarin data, their manual transcripts and translations
+    into English, as well as automated transcripts by an automatic speech
+    recognition (ASR) model.
+    Details: https://arxiv.org/pdf/2104.03575.pdf
+    """
+
+    lazy = False
+    BUILDER_CONFIGS = {
+        "transcription_translation": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/bstc_transcription_translation.tar.gz",
+            "md5": "236800188e397c42a3251982aeee48ee",
+            "splits": {
+                "train": [os.path.join("bstc_transcription_translation", "train")],
+                "dev": [
+                    os.path.join("bstc_transcription_translation", "dev", "streaming_transcription"),
+                    os.path.join("bstc_transcription_translation", "dev", "ref_text"),
+                ],
+            },
+        },
+        "asr": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/bstc_asr.tar.gz",
+            "md5": "3a0cc5039f45e62e29485e27d3a5f5a7",
+            "splits": {
+                "train": [os.path.join("bstc_asr", "train", "asr_sentences")],
+                "dev": [os.path.join("bstc_asr", "dev", "streaming_asr"), os.path.join("bstc_asr", "dev", "ref_text")],
+            },
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Check and download Dataset"""
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        source_file_dir = builder_config["splits"][mode][0]
+        source_full_dir = os.path.join(default_root, source_file_dir)
+        if not os.path.exists(source_full_dir):
+            get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+        if mode == "train":
+            return source_full_dir
+        elif mode == "dev":
+            target_file_dir = builder_config["splits"][mode][1]
+            target_full_dir = os.path.join(default_root, target_file_dir)
+            if not os.path.exists(target_full_dir):
+                get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+            return source_full_dir, target_full_dir
+
+    def _read(self, data_dir, split):
+        """Reads data."""
+        if split == "train":
+            if self.name == "transcription_translation":
+                source_full_dir = data_dir
+                filenames = [f for f in os.listdir(source_full_dir) if not f.startswith(".")]
+                filenames.sort(key=lambda x: int(x[:-5]))
+                for filename in filenames:
+                    with open(os.path.join(source_full_dir, filename), "r", encoding="utf-8") as f:
+                        for line in f.readlines():
+                            line = line.strip()
+                            if not line:
+                                continue
+                            yield json.loads(line)
+            elif self.name == "asr":
+                source_full_dir = data_dir
+                dir_list = [f for f in os.listdir(source_full_dir) if not f.startswith(".")]
+                dir_list.sort(key=lambda x: int(x))
+                for dir_name in dir_list:
+                    filenames = [
+                        f for f in os.listdir(os.path.join(source_full_dir, dir_name)) if not f.startswith(".")
+                    ]
+                    filenames.sort(key=lambda x: int(x[x.find("-") + 1 : -5]))
+                    for filename in filenames:
+                        with open(os.path.join(source_full_dir, dir_name, filename), "r", encoding="utf-8") as f:
+                            for line in f.readlines():
+                                line = line.strip()
+                                if not line:
+                                    continue
+                                yield json.loads(line)
+            else:
+                raise ValueError("Argument name should be one of [transcription_translation, asr].")
+        elif split == "dev":
+            source_full_dir, target_full_dir = data_dir
+            source_filenames = [f for f in os.listdir(source_full_dir) if f.endswith("txt")]
+            target_filenames = [f for f in os.listdir(target_full_dir) if f.endswith("txt")]
+            assert len(source_filenames) == len(target_filenames)
+            source_filenames.sort(
+                key=lambda x: int(x[:-4]) if self.name == "transcription_translation" else int(x[:-8])
+            )
+            target_filenames.sort(key=lambda x: int(x[:-4]))
+            for src_file, tgt_file in zip(source_filenames, target_filenames):
+                if self.name == "transcription_translation":
+                    src_list = []
+                    with open(os.path.join(source_full_dir, src_file), "r", encoding="utf-8") as src_f:
+                        src_part = []
+                        for src_line in src_f.readlines():
+                            src_line = src_line.strip()
+                            if not src_line:
+                                continue
+                            if len(src_part) != 0 and not src_line.startswith(src_part[-1]):
+                                src_list.append(src_part)
+                                src_part = [src_line]
+                            else:
+                                src_part.append(src_line)
+                        if len(src_part) > 0:
+                            src_list.append(src_part)
+                elif self.name == "asr":
+                    src_list = []
+                    with open(os.path.join(source_full_dir, src_file), "r", encoding="utf-8") as src_f:
+                        src_part = []
+                        for src_line in src_f.readlines():
+                            src_line = src_line.strip()
+                            if not src_line:
+                                continue
+                            line = src_line.split(", ")
+                            final = line[2].split(": ")[1] == "final"
+                            src_part.append(src_line)
+                            if final:
+                                src_list.append(src_part)
+                                src_part = []
+                else:
+                    raise ValueError("Argument name should be one of [transcription_translation, asr].")
+                tgt_list = []
+                with open(os.path.join(target_full_dir, tgt_file), "r", encoding="utf-8") as tgt_f:
+                    lines = tgt_f.readlines()
+                    for idx, tgt_line in enumerate(lines):
+                        tgt_line = tgt_line.strip()
+                        if not tgt_line:
+                            continue
+                        tgt_list.append(tgt_line)
+                yield {"src": src_list, "tgt": tgt_list}
diff --git a/paddlenlp/datasets/c3.py b/paddlenlp/datasets/c3.py
new file mode 100644
index 0000000000000000000000000000000000000000..896ab03e22d21e953c49f62716eb3d90617da7c6
--- /dev/null
+++ b/paddlenlp/datasets/c3.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["C3"]
+
+
+class C3(DatasetBuilder):
+    """
+    C3 is the first free-form multiple-Choice Chinese machine reading Comprehension dataset,
+    containing 13,369 documents (dialogues or more formally written mixed-genre texts)
+    and their associated 19,577 multiple-choice free-form questions collected from
+    Chinese-as-a-second-language examinations.
+    See more details on https://arxiv.org/abs/1904.09679.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": [
+            META_INFO(
+                os.path.join("c3-d-train.json"),
+                "291b07679bef785aa66bb5343f1b49b2",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-d-train.json",
+            ),
+            META_INFO(
+                os.path.join("c3-m-train.json"),
+                "db321e631eb3e6f508e438992652618f",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-m-train.json",
+            ),
+        ],
+        "dev": [
+            META_INFO(
+                os.path.join("c3-d-dev.json"),
+                "446e75358789d3fbe8730089cadf5fb0",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-d-dev.json",
+            ),
+            META_INFO(
+                os.path.join("c3-m-dev.json"),
+                "beb2f2e08c18cd8e9429c6a55de6b8db",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-m-dev.json",
+            ),
+        ],
+        "test": [
+            META_INFO(
+                os.path.join("c3-d-test.json"),
+                "002561f15f4942328761c50c90ced36c",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-d-test.json",
+            ),
+            META_INFO(
+                os.path.join("c3-m-test.json"),
+                "f5f14c517926d22047b7bfd369dab724",
+                "https://bj.bcebos.com/paddlenlp/datasets/c3/c3-m-test.json",
+            ),
+        ],
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__, mode)
+        meta_info_list = self.SPLITS[mode]
+        fullnames = []
+        for meta_info in meta_info_list:
+            filename, data_hash, URL = meta_info
+            fullname = os.path.join(default_root, filename)
+            if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+                get_path_from_url(URL, default_root)
+            fullnames.append(fullname)
+        return fullnames
+
+    def _read(self, data_files, *args):
+        for fullname in data_files:
+            with open(fullname, "r", encoding="utf8") as fr:
+                samples = json.load(fr)
+                for sample in samples:
+                    context = sample[0]
+                    qas = sample[1]
+                    for qa in qas:
+                        question = qa["question"]
+                        choice = qa["choice"]
+                        answer = qa["answer"]
+                        label = str(choice.index(answer))
+                        yield {
+                            "context": context,
+                            "question": question,
+                            "choice": choice,
+                            "answer": answer,
+                            "label": label,
+                        }
+
+    def get_labels(self):
+        """
+        Return labels of the C3 object.
+        """
+        return ["0", "1", "2", "3"]
diff --git a/paddlenlp/datasets/cail2018_small.py b/paddlenlp/datasets/cail2018_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..3238dface7f42c1f43cd46c1854f2ad10dc9d6ea
--- /dev/null
+++ b/paddlenlp/datasets/cail2018_small.py
@@ -0,0 +1,487 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["CAIL2018Small"]
+
+
+class CAIL2018Small(DatasetBuilder):
+    """
+    CAIL2018-Small 196,000 criminal cases，which are collected from http://wenshu.court.gov.cn/
+    published by the Supreme People’s Court of China. Each case in CAIL2018 consists of two parts,
+    i.e., fact description and corresponding judgment result. The judgment result of each case is
+    refined into 3 representative ones, including relevant law articles, charges, and prison terms.
+
+    charges: predict the charges from referee result with regular expressions.
+
+    law_articles: predict the relevant law articles from referee result with regular expressions.
+
+    prison_term: predict the prison terms from referee result with regular expressions.
+
+    Find more dataset dertails in https://github.com/thunlp/CAIL
+    """
+
+    lazy = False
+    URL = "https://paddlenlp.bj.bcebos.com/datasets/cail2018_small.tar.gz"
+    MD5 = "963401d107150e250580d115dd2d43fc"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("cail2018_small", "train.json"), "e11fc099cc7709a8d128e9fe9f029621"),
+        "dev": META_INFO(os.path.join("cail2018_small", "dev.json"), "ee13108aee6a08a94490fadeb400debb"),
+        "test": META_INFO(os.path.join("cail2018_small", "test.json"), "27cea977fff2f85b5c32a8e0f708b093"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Check and download Dataset"""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f.readlines():
+                line = json.loads(line)
+                sentence = line["fact"]
+                if self.name == "charges":
+                    label = line["meta"]["accusation"]
+                    yield {"sentence": sentence, "label": label}
+                elif self.name == "law_articles":
+                    label = line["meta"]["relevant_articles"]
+                    yield {"sentence": sentence, "label": label}
+                elif self.name == "prison_term":
+                    if line["meta"]["term_of_imprisonment"]["life_imprisonment"]:
+                        lp = -1
+                    elif line["meta"]["term_of_imprisonment"]["death_penalty"]:
+                        lp = -2
+                    else:
+                        lp = line["meta"]["term_of_imprisonment"]["imprisonment"]
+                    yield {"sentence": sentence, "label": lp}
+                else:
+                    assert "Dataset name {} does not exist".format(self.name)
+        f.close()
+
+    def get_labels(self):
+        """
+        Return labels of the CAIL2018-Small.
+        """
+        if self.name == "charges":
+            return [
+                "故意伤害",
+                "盗窃",
+                "危险驾驶",
+                "非法[持有、私藏][枪支、弹药]",
+                "交通肇事",
+                "寻衅滋事",
+                "[窝藏、包庇]",
+                "放火",
+                "故意毁坏财物",
+                "绑架",
+                "赌博",
+                "妨害公务",
+                "合同诈骗",
+                "[走私、贩卖、运输、制造]毒品",
+                "抢劫",
+                "非法拘禁",
+                "诬告陷害",
+                "非法采矿",
+                "容留他人吸毒",
+                "强奸",
+                "[伪造、变造、买卖]国家机关[公文、证件、印章]",
+                "故意杀人",
+                "诈骗",
+                "聚众斗殴",
+                "[掩饰、隐瞒][犯罪所得、犯罪所得收益]",
+                "敲诈勒索",
+                "[组织、强迫、引诱、容留、介绍]卖淫",
+                "[引诱、容留、介绍]卖淫",
+                "开设赌场",
+                "重大责任事故",
+                "抢夺",
+                "破坏电力设备",
+                "[制造、贩卖、传播]淫秽物品",
+                "传播淫秽物品",
+                "虐待",
+                "非法[采伐、毁坏]国家重点保护植物",
+                "非法[制造、买卖、运输、邮寄、储存][枪支、弹药、爆炸物]",
+                "受贿",
+                "脱逃",
+                "行贿",
+                "破坏[广播电视设施、公用电信设施]",
+                "[伪造、变造]居民身份证",
+                "拐卖[妇女、儿童]",
+                "强迫交易",
+                "拒不支付劳动报酬",
+                "帮助[毁灭、伪造]证据",
+                "爆炸",
+                "污染环境",
+                "非法持有毒品",
+                "破坏易燃易爆设备",
+                "妨害信用卡管理",
+                "[引诱、教唆、欺骗]他人吸毒",
+                "非法处置[查封、扣押、冻结]的财产",
+                "贪污",
+                "职务侵占",
+                "帮助犯罪分子逃避处罚",
+                "盗伐林木",
+                "挪用资金",
+                "重婚",
+                "侵占",
+                "[窝藏、转移、收购、销售]赃物",
+                "妨害作证",
+                "挪用公款",
+                "伪造[公司、企业、事业单位、人民团体]印章",
+                "[窝藏、转移、隐瞒][毒品、毒赃]",
+                "[虚开增值税专用发票、用于骗取出口退税、抵扣税款发票]",
+                "非法侵入住宅",
+                "信用卡诈骗",
+                "非法获取公民个人信息",
+                "滥伐林木",
+                "非法经营",
+                "招摇撞骗",
+                "以危险方法危害公共安全",
+                "[盗窃、侮辱]尸体",
+                "过失致人死亡",
+                "[持有、使用]假币",
+                "传授犯罪方法",
+                "猥亵儿童",
+                "逃税",
+                "非法吸收公众存款",
+                "非法[转让、倒卖]土地使用权",
+                "骗取[贷款、票据承兑、金融票证]",
+                "破坏生产经营",
+                "高利转贷",
+                "[盗窃、抢夺][枪支、弹药、爆炸物]",
+                "[盗窃、抢夺][枪支、弹药、爆炸物、危险物质]",
+                "假冒注册商标",
+                "[伪造、变造]金融票证",
+                "强迫卖淫",
+                "扰乱无线电通讯管理秩序",
+                "虚开发票",
+                "非法占用农用地",
+                "[组织、领导、参加]黑社会性质组织",
+                "[隐匿、故意销毁][会计凭证、会计帐簿、财务会计报告]",
+                "保险诈骗",
+                "强制[猥亵、侮辱]妇女",
+                "非国家工作人员受贿",
+                "伪造货币",
+                "拒不执行[判决、裁定]",
+                "[生产、销售]伪劣产品",
+                "非法[收购、运输][盗伐、滥伐]的林木",
+                "冒充军人招摇撞骗",
+                "组织卖淫",
+                "持有伪造的发票",
+                "[生产、销售][有毒、有害]食品",
+                "非法[制造、出售]非法制造的发票",
+                "[伪造、变造、买卖]武装部队[公文、证件、印章]",
+                "[组织、领导]传销活动",
+                "强迫劳动",
+                "走私",
+                "贷款诈骗",
+                "串通投标",
+                "虚报注册资本",
+                "侮辱",
+                "伪证",
+                "聚众扰乱社会秩序",
+                "聚众扰乱[公共场所秩序、交通秩序]",
+                "劫持[船只、汽车]",
+                "集资诈骗",
+                "盗掘[古文化遗址、古墓葬]",
+                "失火",
+                "票据诈骗",
+                "经济犯",
+                "单位行贿",
+                "投放危险物质",
+                "过失致人重伤",
+                "破坏交通设施",
+                "聚众哄抢",
+                "走私普通[货物、物品]",
+                "收买被拐卖的[妇女、儿童]",
+                "非法狩猎",
+                "销售假冒注册商标的商品",
+                "破坏监管秩序",
+                "拐骗儿童",
+                "非法行医",
+                "协助组织卖淫",
+                "打击报复证人",
+                "强迫他人吸毒",
+                "非法[收购、运输、加工、出售][国家重点保护植物、国家重点保护植物制品]",
+                "[生产、销售]不符合安全标准的食品",
+                "非法买卖制毒物品",
+                "滥用职权",
+                "聚众冲击国家机关",
+                "[出售、购买、运输]假币",
+                "对非国家工作人员行贿",
+                "[编造、故意传播]虚假恐怖信息",
+                "玩忽职守",
+                "私分国有资产",
+                "非法携带[枪支、弹药、管制刀具、危险物品]危及公共安全",
+                "过失以危险方法危害公共安全",
+                "走私国家禁止进出口的[货物、物品]",
+                "违法发放贷款",
+                "徇私枉法",
+                "非法[买卖、运输、携带、持有]毒品原植物[种子、幼苗]",
+                "动植物检疫徇私舞弊",
+                "重大劳动安全事故",
+                "走私[武器、弹药]",
+                "破坏计算机信息系统",
+                "[制作、复制、出版、贩卖、传播]淫秽物品牟利",
+                "单位受贿",
+                "[生产、销售]伪劣[农药、兽药、化肥、种子]",
+                "过失损坏[武器装备、军事设施、军事通信]",
+                "破坏交通工具",
+                "包庇毒品犯罪分子",
+                "[生产、销售]假药",
+                "非法种植毒品原植物",
+                "诽谤",
+                "传播性病",
+                "介绍贿赂",
+                "金融凭证诈骗",
+                "非法[猎捕、杀害][珍贵、濒危]野生动物",
+                "徇私舞弊不移交刑事案件",
+                "巨额财产来源不明",
+                "过失损坏[广播电视设施、公用电信设施]",
+                "挪用特定款物",
+                "[窃取、收买、非法提供]信用卡信息",
+                "非法组织卖血",
+                "利用影响力受贿",
+                "非法捕捞水产品",
+                "对单位行贿",
+                "遗弃",
+                "徇私舞弊[不征、少征]税款",
+                "提供[侵入、非法控制计算机信息系统][程序、工具]",
+                "非法进行节育手术",
+                "危险物品肇事",
+                "非法[制造、买卖、运输、储存]危险物质",
+                "非法[制造、销售]非法制造的注册商标标识",
+                "侵犯著作权",
+                "倒卖[车票、船票]",
+                "过失投放危险物质",
+                "走私废物",
+                "非法出售发票",
+                "走私[珍贵动物、珍贵动物制品]",
+                "[伪造、倒卖]伪造的有价票证",
+                "招收[公务员、学生]徇私舞弊",
+                "非法[生产、销售]间谍专用器材",
+                "倒卖文物",
+                "虐待被监管人",
+                "洗钱",
+                "非法[生产、买卖]警用装备",
+                "非法获取国家秘密",
+                "非法[收购、运输、出售][珍贵、濒危野生动物、珍贵、濒危野生动物]制品",
+            ]
+        elif self.name == "law_articles":
+            return [
+                114,
+                115,
+                116,
+                117,
+                118,
+                119,
+                122,
+                124,
+                125,
+                127,
+                128,
+                130,
+                132,
+                133,
+                134,
+                135,
+                136,
+                140,
+                141,
+                143,
+                144,
+                147,
+                149,
+                150,
+                151,
+                152,
+                153,
+                155,
+                156,
+                158,
+                159,
+                161,
+                162,
+                163,
+                164,
+                168,
+                170,
+                171,
+                172,
+                175,
+                176,
+                177,
+                184,
+                185,
+                186,
+                191,
+                192,
+                193,
+                194,
+                196,
+                198,
+                199,
+                200,
+                201,
+                205,
+                209,
+                210,
+                211,
+                212,
+                213,
+                214,
+                215,
+                217,
+                220,
+                223,
+                224,
+                225,
+                226,
+                227,
+                228,
+                231,
+                232,
+                233,
+                234,
+                235,
+                236,
+                237,
+                238,
+                239,
+                240,
+                241,
+                243,
+                244,
+                245,
+                246,
+                248,
+                253,
+                258,
+                260,
+                261,
+                262,
+                263,
+                264,
+                266,
+                267,
+                268,
+                269,
+                270,
+                271,
+                272,
+                273,
+                274,
+                275,
+                276,
+                277,
+                279,
+                280,
+                281,
+                282,
+                283,
+                285,
+                286,
+                288,
+                290,
+                291,
+                292,
+                293,
+                294,
+                295,
+                302,
+                303,
+                305,
+                307,
+                308,
+                310,
+                312,
+                313,
+                314,
+                315,
+                316,
+                326,
+                328,
+                333,
+                336,
+                338,
+                340,
+                341,
+                342,
+                343,
+                344,
+                345,
+                346,
+                347,
+                348,
+                349,
+                350,
+                351,
+                352,
+                353,
+                354,
+                356,
+                357,
+                358,
+                359,
+                360,
+                361,
+                363,
+                364,
+                367,
+                369,
+                372,
+                375,
+                382,
+                383,
+                384,
+                385,
+                386,
+                387,
+                388,
+                389,
+                390,
+                391,
+                392,
+                393,
+                395,
+                396,
+                397,
+                399,
+                402,
+                404,
+                413,
+                417,
+                418,
+            ]
+        elif self.name == "prison_term":
+            return None
+        else:
+            assert "Dataset name {} does not exist".format(self.name)
diff --git a/paddlenlp/datasets/cail2019_scm.py b/paddlenlp/datasets/cail2019_scm.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7f261c05df37f8c43ce5dae445864c06fb55eb9
--- /dev/null
+++ b/paddlenlp/datasets/cail2019_scm.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["CAIL2019_SCM"]
+
+
+class CAIL2019_SCM(DatasetBuilder):
+    """
+    CAIL2019-SCM contains 8,964 triplets of cases published by the Supreme People's
+    Court of China. The input of CAIL2019-SCM is a triplet (A, B, C), where A, B, C
+    are fact descriptions of three cases. The task of CAIL2019-SCM is to predict
+    whether sim(A, B) > sim(A, C) or sim(A, C) > sim(A, B).
+
+    See more details on https://arxiv.org/abs/1911.08962.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("cail2019_scm_train.json"),
+            "d50a105f9689e72be7d79adbba0ae224",
+            "https://bj.bcebos.com/paddlenlp/datasets/cail2019/scm/cail2019_scm_train.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("cail2019_scm_dev.json"),
+            "e36a295c1cb8c6b9fb28015907a42d9e",
+            "https://bj.bcebos.com/paddlenlp/datasets/cail2019/scm/cail2019_scm_dev.json",
+        ),
+        "test": META_INFO(
+            os.path.join("cail2019_scm_test.json"),
+            "91a6cf060e1283f05fcc6a2027238379",
+            "https://bj.bcebos.com/paddlenlp/datasets/cail2019/scm/cail2019_scm_test.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f.readlines():
+                dic = json.loads(line)
+                yield {"text_a": dic["A"], "text_b": dic["B"], "text_c": dic["C"], "label": dic["label"]}
+
+    def get_labels(self):
+        """
+        Return labels of the CAIL2019_SCM object.
+        """
+        return ["B", "C"]
diff --git a/paddlenlp/datasets/cblue.py b/paddlenlp/datasets/cblue.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6969ba3ca0557d2d5d147249fe177a75ef55775
--- /dev/null
+++ b/paddlenlp/datasets/cblue.py
@@ -0,0 +1,456 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+import pandas as pd
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class CBLUE(DatasetBuilder):
+    """
+    The Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark
+    is a collection of natural language understanding tasks including named
+    entity recognition, information extraction, clinical diagnosis normalization
+    and single-sentence/sentence-pair classification.
+    From https://github.com/CBLUEbenchmark/CBLUE
+
+    CMeEE:
+        The Chinese Medical Named Entity Recognition is first released in CHIP20204.
+        Given a pre-defined schema, the task is to identify and extract entities
+        from the given sentence and classify them into nine categories: disease,
+        clinical manifestations, drugs, medical equipment, medical procedures,
+        body, medical examinations, microorganisms, and department.
+
+    CMeIE:
+        The Chinese Medical Information Extraction is also released in CHIP2020.
+        The task is aimed at identifying both entities and relations in a sentence
+        following the schema constraints. There are 53 relations defined in the dataset,
+        including 10 synonymous sub-relationships and 43 other sub-relationships.
+
+    CHIP-CDN:
+        The CHIP Clinical Diagnosis Normalization dataset aims to standardize
+        the terms from the final diagnoses of Chinese electronic medical records.
+
+    CHIP-CDN-2C:
+        The CHIP Clinical Diagnosis Normalization dataset is reformalized as a task of
+        pairwise classification to judge if a normalized term matches the original term
+        or not. For each original term from the whole ICD-10 vocabulary, 100 candidates
+        normalized terms are retrieved using Elasticsearch.
+
+    CHIP-CTC:
+        The CHIP Clinical Trial Classification dataset aimed at classifying
+        clinical trials eligibility criteria.
+
+    CHIP-STS:
+        The CHIP Semantic Textual Similarity dataset consists of question pairs
+        related to 5 different diseases and aims to determine sentence similarity.
+
+    KUAKE-QIC:
+        The KUAKE Query Intent Classification dataset is used to classify queries
+        of search engines into one of 11 medical intent categories, including
+        diagnosis, etiology analysis, treatment plan, medical advice, test result
+        analysis, disease description, consequence prediction, precautions, intended
+        effects, treatment fees, and others.
+
+    KUAKE-QTR:
+        The KUAKE Query Title Relevance dataset is used to estimate the
+        relevance of the title of a query document.
+
+    KUAKE-QQR:
+        The KUAKE Query-Query Relevance dataset is used to evaluate the
+        relevance of the content expressed in two queries.
+    """
+
+    BUILDER_CONFIGS = {
+        "CMeEE": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CMeEE.zip",
+            "md5": "2f21afc5d95918346b673f84eecd06b1",
+            "splits": {
+                "train": [os.path.join("CMeEE", "CMeEE_train.json"), "725b34819dd49a0ce028c37e4ad0a73b", ["text"]],
+                "dev": [os.path.join("CMeEE", "CMeEE_dev.json"), "42778760dcce7b9ada6e290f7b2a59c2", ["text"]],
+                "test": [os.path.join("CMeEE", "CMeEE_test.json"), "c45b3b3d79ca29776e3d9f009b7d6ee5", ["text"]],
+            },
+            "labels": [
+                [
+                    "B-bod",
+                    "I-bod",
+                    "E-bod",
+                    "S-bod",
+                    "B-dis",
+                    "I-dis",
+                    "E-dis",
+                    "S-dis",
+                    "B-pro",
+                    "I-pro",
+                    "E-pro",
+                    "S-pro",
+                    "B-dru",
+                    "I-dru",
+                    "E-dru",
+                    "S-dru",
+                    "B-ite",
+                    "I-ite",
+                    "E-ite",
+                    "S-ite",
+                    "B-mic",
+                    "I-mic",
+                    "E-mic",
+                    "S-mic",
+                    "B-equ",
+                    "I-equ",
+                    "E-equ",
+                    "S-equ",
+                    "B-dep",
+                    "I-dep",
+                    "E-dep",
+                    "S-dep",
+                    "O",
+                ],
+                ["B-sym", "I-sym", "E-sym", "S-sym", "O"],
+            ],
+        },
+        "CMeIE": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CMeIE.zip",
+            "md5": "444569dfc31580c8cfa18843d0a1bd59",
+            "splits": {
+                "train": [os.path.join("CMeIE", "CMeIE_train.json"), "d27a7d4f0f5326018db66f64ac63780c", ["text"]],
+                "dev": [os.path.join("CMeIE", "CMeIE_dev.json"), "54203d1e775a2f07aaea30b61b93ca2f", ["text"]],
+                "test": [os.path.join("CMeIE", "CMeIE_test.json"), "8ac74722e9448fdc76132206582b9a06", ["text"]],
+            },
+            "labels": [
+                "预防",
+                "阶段",
+                "就诊科室",
+                "辅助治疗",
+                "化疗",
+                "放射治疗",
+                "手术治疗",
+                "实验室检查",
+                "影像学检查",
+                "辅助检查",
+                "组织学检查",
+                "内窥镜检查",
+                "筛查",
+                "多发群体",
+                "发病率",
+                "发病年龄",
+                "多发地区",
+                "发病性别倾向",
+                "死亡率",
+                "多发季节",
+                "传播途径",
+                "并发症",
+                "病理分型",
+                "相关（导致）",
+                "鉴别诊断",
+                "相关（转化）",
+                "相关（症状）",
+                "临床表现",
+                "治疗后症状",
+                "侵及周围组织转移的症状",
+                "病因",
+                "高危因素",
+                "风险评估因素",
+                "病史",
+                "遗传因素",
+                "发病机制",
+                "病理生理",
+                "药物治疗",
+                "发病部位",
+                "转移部位",
+                "外侵部位",
+                "预后状况",
+                "预后生存率",
+                "同义词",
+            ],
+        },
+        "CHIP-CDN": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CHIP-CDN.zip",
+            "md5": "e378d6bfe6740aadfb197ca352db3427",
+            "splits": {
+                "train": [
+                    os.path.join("CHIP-CDN", "CHIP-CDN_train.json"),
+                    "2940ff04e91f52722f10010e5cbc1f18",
+                    ["text"],
+                ],
+                "dev": [os.path.join("CHIP-CDN", "CHIP-CDN_dev.json"), "c718cdd36f913deb11a1a0b46de51015", ["text"]],
+                "test": [os.path.join("CHIP-CDN", "CHIP-CDN_test.json"), "8dbe229a23af30bd7c3c5bdcdf156314", ["text"]],
+            },
+            "labels": "国际疾病分类 ICD-10北京临床版v601.xlsx",
+        },
+        "CHIP-CDN-2C": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CHIP-CDN-2C.zip",
+            "md5": "6dce903ff95713947d349b4a4e61a486",
+            "splits": {
+                "train": [
+                    os.path.join("CHIP-CDN-2C", "train.tsv"),
+                    "28e38f631b77b33bff0fd018d84c670f",
+                    ["text_a", "text_b"],
+                ],
+                "dev": [
+                    os.path.join("CHIP-CDN-2C", "dev.tsv"),
+                    "801a0e12101a7ed2261b5984350cd238",
+                    ["text_a", "text_b"],
+                ],
+                "test": [
+                    os.path.join("CHIP-CDN-2C", "test.tsv"),
+                    "0ff464a3c34b095f4d4c22753a119164",
+                    ["text_a", "text_b"],
+                ],
+            },
+            "labels": ["0", "1"],
+        },
+        "CHIP-CTC": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CHIP-CTC.zip",
+            "md5": "43d804211d46f9374c18ab13d6984f29",
+            "splits": {
+                "train": [
+                    os.path.join("CHIP-CTC", "CHIP-CTC_train.json"),
+                    "098ac22cafe7446393d941612f906531",
+                    ["text"],
+                ],
+                "dev": [os.path.join("CHIP-CTC", "CHIP-CTC_dev.json"), "b48d52fd686bea286de1a3b123398483", ["text"]],
+                "test": [os.path.join("CHIP-CTC", "CHIP-CTC_test.json"), "6a5f0f20f8f85f727d9ef1ea09f939d9", ["text"]],
+            },
+            "labels": "category.xlsx",
+        },
+        "CHIP-STS": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/CHIP-STS.zip",
+            "md5": "4d4db5ef14336e3179e4e1f3c1cc2621",
+            "splits": {
+                "train": [
+                    os.path.join("CHIP-STS", "CHIP-STS_train.json"),
+                    "c6150e2628f107cf2657feb4ed2ba65b",
+                    ["text1", "text2"],
+                ],
+                "dev": [
+                    os.path.join("CHIP-STS", "CHIP-STS_dev.json"),
+                    "2813ecc0222ef8e4612296776e54639d",
+                    ["text1", "text2"],
+                ],
+                "test": [
+                    os.path.join("CHIP-STS", "CHIP-STS_test.json"),
+                    "44394681097024aa922e4e33fa651360",
+                    ["text1", "text2"],
+                ],
+            },
+            "labels": ["0", "1"],
+        },
+        "KUAKE-QIC": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/KUAKE-QIC.zip",
+            "md5": "7661e3a6b5daf4ee025ba407669788d8",
+            "splits": {
+                "train": [
+                    os.path.join("KUAKE-QIC", "KUAKE-QIC_train.json"),
+                    "fc7e359decfcf7b1316e7833acc97b8a",
+                    ["query"],
+                ],
+                "dev": [
+                    os.path.join("KUAKE-QIC", "KUAKE-QIC_dev.json"),
+                    "2fd1f4131916239d89b213cc9860c1c6",
+                    ["query"],
+                ],
+                "test": [
+                    os.path.join("KUAKE-QIC", "KUAKE-QIC_test.json"),
+                    "337dc7f3cdc77b1a21b534ecb3142a6b",
+                    ["query"],
+                ],
+            },
+            "labels": ["病情诊断", "治疗方案", "病因分析", "指标解读", "就医建议", "疾病表述", "后果表述", "注意事项", "功效作用", "医疗费用", "其他"],
+        },
+        "KUAKE-QTR": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/KUAKE-QTR.zip",
+            "md5": "a59686c2b489ac64ff6f0f029c1df068",
+            "splits": {
+                "train": [
+                    os.path.join("KUAKE-QTR", "KUAKE-QTR_train.json"),
+                    "7197f9ca963f337fc81ce6c8a1c97dc4",
+                    ["query", "title"],
+                ],
+                "dev": [
+                    os.path.join("KUAKE-QTR", "KUAKE-QTR_dev.json"),
+                    "e6c480aa46ef2dd04290afe165cdfa9a",
+                    ["query", "title"],
+                ],
+                "test": [
+                    os.path.join("KUAKE-QTR", "KUAKE-QTR_test.json"),
+                    "4ccfcf83eef0563b16914d5455d225a5",
+                    ["query", "title"],
+                ],
+            },
+            "labels": ["0", "1", "2", "3"],
+        },
+        "KUAKE-QQR": {
+            "url": "https://paddlenlp.bj.bcebos.com/datasets/cblue/KUAKE-QQR.zip",
+            "md5": "b7fdeed0ae56e450d7cf3aa7c0b19e20",
+            "splits": {
+                "train": [
+                    os.path.join("KUAKE-QQR", "KUAKE-QQR_train.json"),
+                    "f667e31610acf3f107369310b78d56a9",
+                    ("query1", "query2"),
+                ],
+                "dev": [
+                    os.path.join("KUAKE-QQR", "KUAKE-QQR_dev.json"),
+                    "597354382a806b8168a705584f4f6887",
+                    ("query1", "query2"),
+                ],
+                "test": [
+                    os.path.join("KUAKE-QQR", "KUAKE-QQR_test.json"),
+                    "2d257135c6e1651d24a84496dd50c658",
+                    ("query1", "query2"),
+                ],
+            },
+            "labels": ["0", "1", "2"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, _ = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+        return fullname
+
+    def _search_entity_index(self, tokens, entity_tokens, skip_idx=None):
+        ent_len = len(entity_tokens)
+        for idx in range(len(tokens) - ent_len + 1):
+            if tokens[idx : idx + ent_len] == entity_tokens:
+                if skip_idx is None:
+                    return idx
+                elif idx < skip_idx[0] or idx > skip_idx[1]:
+                    return idx
+        return None
+
+    def _search_spo_index(self, tokens, subjects, objects):
+        tokens = [x.lower() for x in tokens]
+        subjects = [x.lower() for x in subjects]
+        objects = [x.lower() for x in objects]
+        if len(subjects) > len(objects):
+            sub_idx = self._search_entity_index(tokens, subjects)
+            obj_idx = self._search_entity_index(tokens, objects, (sub_idx, sub_idx + len(subjects) - 1))
+        else:
+            obj_idx = self._search_entity_index(tokens, objects)
+            sub_idx = self._search_entity_index(tokens, subjects, (obj_idx, obj_idx + len(objects) - 1))
+        return sub_idx, obj_idx
+
+    def _read(self, filename, split):
+        _, _, input_keys = self.BUILDER_CONFIGS[self.name]["splits"][split]
+        with open(filename, "r", encoding="utf-8") as f:
+            if self.name == "CMeIE":
+                for line in f.readlines():
+                    data = json.loads(line)
+                    labels = self.get_labels()
+                    label_map = dict([(x, i) for i, x in enumerate(labels)])
+                    data_list = data.get("spo_list", [])
+                    ent_list, spo_list = [], []
+                    ent_label, spo_label = [], []
+                    for spo in data_list:
+                        sub, obj = spo["subject"], spo["object"]["@value"]
+                        rel = spo["predicate"]
+                        ent_list.append(sub)
+                        ent_list.append(obj)
+                        spo_list.append((sub, rel, obj))
+
+                        sub_idx, obj_idx = self._search_spo_index(data["text"], sub, obj)
+                        if sub_idx is not None and obj_idx is not None:
+                            sub = tuple((sub_idx, sub_idx + len(sub) - 1))
+                            obj = tuple((obj_idx, obj_idx + len(obj) - 1))
+                            ent_label.append(sub)
+                            ent_label.append(obj)
+                            spo_label.append((sub, label_map[rel], obj))
+
+                        # The samples where subjects and objects have overlap
+                        # will be discarded during training.
+                        #
+                        # if sub_idx is None or obj_idx is None:
+                        #    print('Error: Can not find entities in tokens.')
+                        #    print('Tokens:', data['text'])
+                        #    print('Entities":', sub, obj)
+
+                    data["ent_list"] = ent_list
+                    data["spo_list"] = spo_list
+                    data["ent_label"] = ent_label
+                    data["spo_label"] = spo_label
+
+                    yield data
+            elif self.name == "CMeEE":
+                data_list = json.load(f)
+                for data in data_list:
+                    text_len = len(data[input_keys[0]])
+                    if data.get("entities", None):
+                        labels = [["O" for _ in range(text_len)], ["O" for _ in range(text_len)]]
+                        idx_dict = [{}, {}]
+                        for entity in data["entities"]:
+                            start_idx = entity["start_idx"]
+                            end_idx = entity["end_idx"]
+                            etype = entity["type"]
+                            ltype = int(etype == "sym")
+                            if start_idx in idx_dict[ltype]:
+                                if idx_dict[ltype][start_idx] >= end_idx:
+                                    continue
+                            idx_dict[ltype][start_idx] = end_idx
+                            if start_idx == end_idx:
+                                labels[ltype][start_idx] = "S-" + etype
+                            else:
+                                labels[ltype][start_idx] = "B-" + etype
+                                labels[ltype][end_idx] = "E-" + etype
+                                for x in range(start_idx + 1, end_idx):
+                                    labels[ltype][x] = "I-" + etype
+                        data.pop("entities")
+                        data["labels"] = labels
+                    yield data
+            elif self.name == "CHIP-CDN-2C":
+                data_keys = f.readline().strip().split("\t")
+                for data in f:
+                    data = data.strip().split("\t")
+                    data = dict([(k, v) for k, v in zip(data_keys, data)])
+                    yield data
+            else:
+                data_list = json.load(f)
+                for data in data_list:
+                    if data.get("normalized_result", None):
+                        data["labels"] = [x.strip('"') for x in data["normalized_result"].split("##")]
+                        data.pop("normalized_result")
+                    data["text_a"] = data[input_keys[0]]
+                    data.pop(input_keys[0])
+                    if len(input_keys) > 1:
+                        data["text_b"] = data[input_keys[1]]
+                        data.pop(input_keys[1])
+                    yield data
+
+    def get_labels(self):
+        """
+        Returns labels of the CBLUE task.
+        """
+        labels = self.BUILDER_CONFIGS[self.name]["labels"]
+        if isinstance(labels, str):
+            default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+            label_dir = os.path.join(default_root, self.name)
+        if self.name == "CHIP-CDN":
+            name = [x for x in os.listdir(label_dir) if x.endswith(".xlsx")][0]
+            labels = pd.read_excel(os.path.join(label_dir, name), header=None)
+            return sorted(labels[1].values)
+        elif self.name == "CHIP-CTC":
+            labels = pd.read_excel(os.path.join(label_dir, labels))
+            return sorted(labels["Label Name"].values)
+        else:
+            return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/chnsenticorp.py b/paddlenlp/datasets/chnsenticorp.py
new file mode 100644
index 0000000000000000000000000000000000000000..2dc48837ad454f54b53b3b32fc7d576ca5af9c99
--- /dev/null
+++ b/paddlenlp/datasets/chnsenticorp.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["ChnSentiCorp"]
+
+
+class ChnSentiCorp(DatasetBuilder):
+    """
+    ChnSentiCorp (by Tan Songbo at ICT of Chinese Academy of Sciences, and for
+    opinion mining)
+
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/ChnSentiCorp.zip"
+    MD5 = "7ef61b08ad10fbddf2ba97613f071561"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("ChnSentiCorp", "ChnSentiCorp", "train.tsv"), "689360c4a4a9ce8d8719ed500ae80907"
+        ),
+        "dev": META_INFO(os.path.join("ChnSentiCorp", "ChnSentiCorp", "dev.tsv"), "20c77cc2371634731a367996b097ec0a"),
+        "test": META_INFO(
+            os.path.join("ChnSentiCorp", "ChnSentiCorp", "test.tsv"), "9b4dc7d1e4ada48c645b7e938592f49c"
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = None
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    if split == "train":
+                        label, text = data
+                        yield {"text": text, "label": label, "qid": ""}
+                    elif split == "dev":
+                        qid, label, text = data
+                        yield {"text": text, "label": label, "qid": qid}
+                    elif split == "test":
+                        qid, text = data
+                        yield {"text": text, "label": "", "qid": qid}
+
+    def get_labels(self):
+        """
+        Return labels of the ChnSentiCorp object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/chnsenticorp_v2.py b/paddlenlp/datasets/chnsenticorp_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..af7cfc8c0e26e7b6fef183acb6ba8ae62a3a550b
--- /dev/null
+++ b/paddlenlp/datasets/chnsenticorp_v2.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["ChnSentiCorpV2"]
+
+
+class ChnSentiCorpV2(DatasetBuilder):
+    """
+    ChnSentiCorp (by Tan Songbo at ICT of Chinese Academy of Sciences, and for
+    opinion mining)
+
+    """
+
+    URL = "https://paddlenlp.bj.bcebos.com/datasets/data-chnsenticorp.tar.gz"
+    MD5 = "e336e76d7be4ecd5479083d5b8f771e4"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("chnsenticorp", "train", "part.0"), "3fac2659547f1ddf90d223b8ed31f22f"),
+        "dev": META_INFO(os.path.join("chnsenticorp", "dev", "part.0"), "a3a853bfb3af4a592fc4df24b56c88a7"),
+        "test": META_INFO(os.path.join("chnsenticorp", "test", "part.0"), "6bfc8f35f523d2fdf12648d9d02778ff"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = True
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    if split == "train":
+                        text, label = data
+                        yield {"text": text, "label": label}
+                    elif split == "dev":
+                        text, label = data
+                        yield {"text": text, "label": label}
+                    elif split == "test":
+                        text, label = data
+                        yield {"text": text, "label": label}
+
+    def get_labels(self):
+        """
+        Return labels of the ChnSentiCorp object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/clue.py b/paddlenlp/datasets/clue.py
new file mode 100644
index 0000000000000000000000000000000000000000..f862abb1400c202003877dda1647f5817880be31
--- /dev/null
+++ b/paddlenlp/datasets/clue.py
@@ -0,0 +1,271 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class Clue(DatasetBuilder):
+    """
+    `ClUE <https://arxiv.org/abs/2004.05986>`_ is the first large-scale Chinese
+    Language Understanding Evaluation(CLUE) benchmark. CLUE is an open-ended,
+    community-driven project that brings together 9 tasks spanning several
+    well-established single-sentence/sentence-pair classification tasks, as
+    well as machine reading comprehension, all on original Chinese text.
+
+    From https://github.com/CLUEbenchmark/CLUE
+
+    AFQMC:
+        AFQMC: The Ant Financial Question Matching Corpus3 comes from Ant
+        Technology Exploration Conference (ATEC) Developer competition. It is
+        a binary classification task that aims to predict whether two sentences
+        are semantically similar.
+
+    TNEWS:
+        TouTiao Text Classification for News Titles2 consists of Chinese news
+        published by TouTiao before May 2018, with a total of 73,360 titles.
+        Each title is labeled with one of 15 news categories (finance,
+        technology, sports, etc.) and the task is to predict which category the
+        title belongs to.
+
+    IFLYTEK:
+        IFLYTEK contains 17,332 app descriptions. The task is to assign each
+        description into one of 119 categories, such as food, car rental,
+        education, etc.
+
+    OCNLI:
+        Original Chinese Natural Language Inference is collected closely
+        following procedures of MNLI. OCNLI is composed of 56k inference pairs
+        from five genres: news, government, fiction, TV transcripts and
+        Telephone transcripts, where the premises are collected from Chinese
+        sources, and universities students in language majors are hired to
+        write the hypotheses.
+
+    CMNLI:
+        Chinese Multi-Genre NLI.
+
+    CLUEWSC2020:
+        The Chinese Winograd Schema Challenge dataset is an anaphora/
+        coreference resolution task where the model is asked to decide whether
+        a pronoun and a noun (phrase) in a sentence co-refer (binary
+        classification), built following similar datasets in English.
+
+    CSL:
+        Chinese Scientific Literature dataset contains Chinese paper abstracts
+        and their keywords from core journals of China, covering multiple
+        fields of natural sciences and social sciences.
+
+    """
+
+    BUILDER_CONFIGS = {
+        "afqmc": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/afqmc_public.zip",
+            "md5": "3377b559bb4e61d03a35282550902ca0",
+            "splits": {
+                "train": [
+                    os.path.join("afqmc_public", "train.json"),
+                    "319cf775353af9473140abca4052b89a",
+                ],
+                "dev": [
+                    os.path.join("afqmc_public", "dev.json"),
+                    "307154b59cb6c3e68a0f39c310bbd364",
+                ],
+                "test": [
+                    os.path.join("afqmc_public", "test.json"),
+                    "94b925f23a9615dd08199c4013f761f4",
+                ],
+            },
+            "labels": ["0", "1"],
+        },
+        "tnews": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/tnews_public.zip",
+            "md5": "38186ed0a751bc33e3ae0c1b59319777",
+            "splits": {
+                "train": [
+                    os.path.join("tnews_public", "train.json"),
+                    "25c021725309a3330736380a230850fd",
+                ],
+                "dev": [
+                    os.path.join("tnews_public", "dev.json"),
+                    "f0660a3339a32e764075c801b42ece3c",
+                ],
+                "test": [
+                    os.path.join("tnews_public", "test.json"),
+                    "045a6c4f59bf1a066c4a0d7afe6cd2b4",
+                ],
+                "test1.0": [
+                    os.path.join("tnews_public", "test1.0.json"),
+                    "2d1557c7548c72d5a84c47bbbd3a4e85",
+                ],
+                "labels": [
+                    os.path.join("tnews_public", "labels.json"),
+                    "a1a7595e596b202556dedd2a20617769",
+                ],
+            },
+            "labels": [
+                "100",
+                "101",
+                "102",
+                "103",
+                "104",
+                "106",
+                "107",
+                "108",
+                "109",
+                "110",
+                "112",
+                "113",
+                "114",
+                "115",
+                "116",
+            ],
+        },
+        "iflytek": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/iflytek_public.zip",
+            "md5": "19e4b19947db126f69aae18db0da2b87",
+            "splits": {
+                "train": [
+                    os.path.join("iflytek_public", "train.json"),
+                    "fc9a21700c32ee3efee3fc283e9ac560",
+                ],
+                "dev": [
+                    os.path.join("iflytek_public", "dev.json"),
+                    "79b7d95bddeb11cd54198fd077992704",
+                ],
+                "test": [
+                    os.path.join("iflytek_public", "test.json"),
+                    "ea764519ddb4369767d07664afde3325",
+                ],
+                "labels": [
+                    os.path.join("iflytek_public", "labels.json"),
+                    "7f9e794688ffb37fbd42b58325579fdf",
+                ],
+            },
+            "labels": [str(i) for i in range(119)],
+        },
+        "ocnli": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/ocnli_public.zip",
+            "md5": "acb426f6f3345076c6ce79239e7bc307",
+            "splits": {
+                "train": [
+                    os.path.join("ocnli_public", "train.50k.json"),
+                    "d38ec492ef086a894211590a18ab7596",
+                ],
+                "dev": [
+                    os.path.join("ocnli_public", "dev.json"),
+                    "3481b456bee57a3c9ded500fcff6834c",
+                ],
+                "test": [
+                    os.path.join("ocnli_public", "test.json"),
+                    "680ff24e6b3419ff8823859bc17936aa",
+                ],
+            },
+            "labels": ["entailment", "contradiction", "neutral"],
+        },
+        "cmnli": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/cmnli_public.zip",
+            "md5": "e0e8caefd9b3491220c18b466233f2ff",
+            "splits": {
+                "train": [
+                    os.path.join("cmnli_public", "train.json"),
+                    "7d02308650cd2a0e183bf599ca9bb263",
+                ],
+                "dev": [
+                    os.path.join("cmnli_public", "dev.json"),
+                    "0b16a50a297a9afb1ce5385ee4dd3d9c",
+                ],
+                "test": [
+                    os.path.join("cmnli_public", "test.json"),
+                    "804cb0bb67266983d59d1c855e6b03b0",
+                ],
+            },
+            "labels": ["contradiction", "entailment", "neutral"],
+        },
+        "cluewsc2020": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/cluewsc2020_public.zip",
+            "md5": "2e387e20e93eeab0ffaded5b0d2dfd3d",
+            "splits": {
+                "train": [
+                    os.path.join("cluewsc2020_public", "train.json"),
+                    "afd235dcf8cdb89ee1a21d0a4823eecc",
+                ],
+                "dev": [
+                    os.path.join("cluewsc2020_public", "dev.json"),
+                    "bad8cd6fa0916fc37ac96b8ce316714a",
+                ],
+                "test": [
+                    os.path.join("cluewsc2020_public", "test.json"),
+                    "27614454cc26be6fcab5bbd9a45967ff",
+                ],
+                "test1.0": [
+                    os.path.join("cluewsc2020_public", "test1.0.json"),
+                    "0e9e8ffd8ee90ddf1f58d6dc2e02de7b",
+                ],
+            },
+            "labels": ["true", "false"],
+        },
+        "csl": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/csl_public.zip",
+            "md5": "394a2ccbf6ddd7e331be4d5d7798f0f6",
+            "splits": {
+                "train": [
+                    os.path.join("csl_public", "train.json"),
+                    "e927948b4e0eb4992fe9f45a77446bf5",
+                ],
+                "dev": [
+                    os.path.join("csl_public", "dev.json"),
+                    "6c2ab8dd3b4785829ead94b05a1cb957",
+                ],
+                "test": [
+                    os.path.join("csl_public", "test.json"),
+                    "ebfb89575355f00dcd9b18f8353547cd",
+                ],
+            },
+            "labels": ["0", "1"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+        return fullname
+
+    def _read(self, filename, split):
+        if self.name == "cmnli" and split == "dev" or self.name == "ocnli" and split in ["train", "dev"]:
+            with open(filename, "r", encoding="utf-8") as f:
+                for line in f:
+                    example_dict = json.loads(line.rstrip())
+                    if example_dict["label"] == "-":
+                        continue
+                    yield example_dict
+        else:
+            with open(filename, "r", encoding="utf-8") as f:
+                for line in f:
+                    yield json.loads(line.rstrip())
+
+    def get_labels(self):
+        """
+        Returns labels of the Clue task.
+        """
+        return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/cmrc2018.py b/paddlenlp/datasets/cmrc2018.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5dfbbd2aab94834fe306a2d4cd601de5e043847
--- /dev/null
+++ b/paddlenlp/datasets/cmrc2018.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["CMRC2018"]
+
+
+class CMRC2018(DatasetBuilder):
+    """
+    This dataset is a Span-Extraction dataset for Chinese machine reading
+    comprehension. The dataset is composed by near 20,000 real questions
+    annotated on Wikipedia paragraphs by human experts.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("cmrc2018_train.json"),
+            "7fb714b479c7f40fbb16acabd7af0ede",
+            "https://bj.bcebos.com/paddlenlp/datasets/cmrc/cmrc2018_train.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("cmrc2018_dev.json"),
+            "853b80709ff2d071f9fce196521b843c",
+            "https://bj.bcebos.com/paddlenlp/datasets/cmrc/cmrc2018_dev.json",
+        ),
+        "trial": META_INFO(
+            os.path.join("cmrc2018_trial.json"),
+            "070f8ade5b15cfdb095c1fcef9cf43c1",
+            "https://bj.bcebos.com/paddlenlp/datasets/cmrc/cmrc2018_trial.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                    }
diff --git a/paddlenlp/datasets/cnn_dailymail.py b/paddlenlp/datasets/cnn_dailymail.py
new file mode 100644
index 0000000000000000000000000000000000000000..46ef30ea0f9a4e654966ecea2ae895434e274935
--- /dev/null
+++ b/paddlenlp/datasets/cnn_dailymail.py
@@ -0,0 +1,231 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import hashlib
+import os
+import shutil
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import _decompress, _get_unique_endpoints, get_path_from_url
+
+try:
+    from paddle.distributed import ParallelEnv
+except Exception:
+    import warnings
+
+    warnings.warn("paddle.distributed is not contains in you paddle!")
+
+from ..utils.env import DATA_HOME
+from ..utils.log import logger
+from .dataset import DatasetBuilder
+
+
+class CnnDailymail(DatasetBuilder):
+    """
+    CNN/DailyMail non-anonymized summarization dataset.
+    The CNN / DailyMail Dataset is an English-language dataset containing
+    just over 300k unique news articles as written by journalists at CNN
+    nd the Daily Mail. The current version supports both extractive and
+    abstractive summarization, though the original version was created
+    for machine reading and comprehension and abstractive question answering.
+
+    Version 1.0.0 aimed to support supervised neural methodologies for machine
+    reading and question answering with a large amount of real natural language
+    training data and released about 313k unique articles and nearly 1M Cloze
+    style questions to go with the articles.
+    Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support
+    summarization rather than question answering. Version 3.0.0 provided a
+    non-anonymized version of the data, whereas both the previous versions were
+    preprocessed to replace named entities with unique identifier labels.
+
+    An updated version of the code that does not anonymize the data is available
+    at https://github.com/abisee/cnn-dailymail.
+    """
+
+    lazy = False
+    META_INFO = collections.namedtuple("META_INFO", ("file", "url", "md5"))
+    SPLITS = {
+        "train": META_INFO(
+            "all_train.txt",
+            "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_train.txt",
+            "c8ca98cfcb6cf3f99a404552568490bc",
+        ),
+        "dev": META_INFO(
+            "all_val.txt",
+            "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_val.txt",
+            "83a3c483b3ed38b1392285bed668bfee",
+        ),
+        "test": META_INFO(
+            "all_test.txt",
+            "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_test.txt",
+            "4f3ac04669934dbc746b7061e68a0258",
+        ),
+    }
+    cnn_dailymail = {
+        "cnn": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/cnn_stories.tgz",
+            "md5": "85ac23a1926a831e8f46a6b8eaf57263",
+            "file_num": 92579,
+        },
+        "dailymail": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/dailymail_stories.tgz",
+            "md5": "f9c5f565e8abe86c38bfa4ae8f96fd72",
+            "file_num": 219506,
+        },
+    }
+
+    def _read_text_file(self, text_file):
+        lines = []
+        with open(text_file, "r", encoding="utf8") as f:
+            for line in f:
+                lines.append(line.strip())
+        return lines
+
+    def _get_url_hashes(self, path):
+        """Get hashes of urls in file."""
+        urls = self._read_text_file(path)
+
+        def url_hash(u):
+            h = hashlib.sha1()
+            try:
+                u = u.encode("utf-8")
+            except UnicodeDecodeError:
+                logger.error("Cannot hash url: %s", u)
+            h.update(u)
+            return h.hexdigest()
+
+        return {url_hash(u): True for u in urls}
+
+    def _get_hash_from_path(self, p):
+        """Extract hash from path."""
+        basename = os.path.basename(p)
+        return basename[0 : basename.find(".story")]
+
+    def _find_files(self, dl_paths, publisher, url_dict):
+        """Find files corresponding to urls."""
+        if publisher == "cnn":
+            top_dir = os.path.join(dl_paths["cnn"], "stories")
+        elif publisher == "dailymail":
+            top_dir = os.path.join(dl_paths["dailymail"], "stories")
+        else:
+            logger.error("Unsupported publisher: %s", publisher)
+        files = sorted(os.listdir(top_dir))
+
+        ret_files = []
+        for p in files:
+            if self._get_hash_from_path(p) in url_dict:
+                ret_files.append(os.path.join(top_dir, p))
+        return ret_files
+
+    def _subset_filenames(self, dl_paths, split):
+        """Get filenames for a particular split."""
+        # Get filenames for a split.
+        urls = self._get_url_hashes(dl_paths[split])
+        cnn = self._find_files(dl_paths, "cnn", urls)
+        dm = self._find_files(dl_paths, "dailymail", urls)
+        return cnn + dm
+
+    def _get_art_abs(self, story_file, version):
+        """Get abstract (highlights) and article from a story file path."""
+        # Based on https://github.com/abisee/cnn-dailymail/blob/master/
+        #     make_datafiles.py
+
+        lines = self._read_text_file(story_file)
+
+        # The github code lowercase the text and we removed it in 3.0.0.
+
+        # Put periods on the ends of lines that are missing them
+        # (this is a problem in the dataset because many image captions don't end in
+        # periods; consequently they end up in the body of the article as run-on
+        # sentences)
+        def fix_missing_period(line):
+            """Adds a period to a line that is missing a period."""
+            if "@highlight" in line:
+                return line
+            if not line:
+                return line
+            if line[-1] in [".", "!", "?", "...", "'", "`", '"', "\u2019", "\u201d", ")"]:
+                return line
+            return line + " ."
+
+        lines = [fix_missing_period(line) for line in lines]
+
+        # Separate out article and abstract sentences
+        article_lines = []
+        highlights = []
+        next_is_highlight = False
+        for line in lines:
+            if not line:
+                continue  # empty line
+            elif line.startswith("@highlight"):
+                next_is_highlight = True
+            elif next_is_highlight:
+                highlights.append(line)
+            else:
+                article_lines.append(line)
+
+        # Make article into a single string
+        article = " ".join(article_lines)
+
+        if version >= "2.0.0":
+            abstract = "\n".join(highlights)
+        else:
+            abstract = " ".join(highlights)
+
+        return article, abstract
+
+    def _get_data(self, mode):
+        """Check and download Dataset"""
+        dl_paths = {}
+        version = self.name
+        if version is None:
+            version = "3.0.0"
+        if version not in ["1.0.0", "2.0.0", "3.0.0"]:
+            raise ValueError("Unsupported version: %s" % version)
+        dl_paths["version"] = version
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        for k, v in self.cnn_dailymail.items():
+            dir_path = os.path.join(default_root, k)
+            if not os.path.exists(dir_path):
+                get_path_from_url(v["url"], default_root, v["md5"])
+            unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:])
+            if ParallelEnv().current_endpoint in unique_endpoints:
+                file_num = len(os.listdir(os.path.join(dir_path, "stories")))
+                if file_num != v["file_num"]:
+                    logger.warning(
+                        "Number of %s stories is %d != %d, decompress again." % (k, file_num, v["file_num"])
+                    )
+                    shutil.rmtree(os.path.join(dir_path, "stories"))
+                    _decompress(os.path.join(default_root, os.path.basename(v["url"])))
+            dl_paths[k] = dir_path
+        filename, url, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(url, default_root, data_hash)
+        dl_paths[mode] = fullname
+        return dl_paths
+
+    def _read(self, dl_paths, split):
+        files = self._subset_filenames(dl_paths, split)
+        for p in files:
+            article, highlights = self._get_art_abs(p, dl_paths["version"])
+            if not article or not highlights:
+                continue
+            yield {
+                "article": article,
+                "highlights": highlights,
+                "id": self._get_hash_from_path(p),
+            }
diff --git a/paddlenlp/datasets/conll2002.py b/paddlenlp/datasets/conll2002.py
new file mode 100644
index 0000000000000000000000000000000000000000..333708cd083126a8f27f733f092ea7541e201231
--- /dev/null
+++ b/paddlenlp/datasets/conll2002.py
@@ -0,0 +1,159 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class Conll2002(DatasetBuilder):
+    """
+    Named entities are phrases that contain the names of persons, organizations,
+    locations, times and quantities. Example: [PER Wolff] , currently a journalist
+    in [LOC Argentina] , played with [PER Del Bosque] in the final years of the seventies in [ORG Real Madrid] .
+    The shared task of CoNLL-2002 concerns language-independent named entity recognition.
+    We will concentrate on four types of named entities: persons, locations, organizations and names of
+    miscellaneous entities that do not belong to the previous three groups. The participants of the
+    shared task will be offered training and test data for at least two languages.
+    They will use the data for developing a named-entity recognition system that includes a machine learning component.
+    Information sources other than the training data may be used in this shared task. We are especially interested
+    in methods that can use additional unannotated data for improving their performance (for example co-training).
+    For more details see https://www.clips.uantwerpen.be/conll2002/ner/
+    and https://www.aclweb.org/anthology/W02-2024/
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "url", "md5"))
+    BASE_URL = "https://bj.bcebos.com/paddlenlp/datasets/conll2002/"
+    BUILDER_CONFIGS = {
+        "es": {
+            "splits": {
+                "train": META_INFO("esp.train", BASE_URL + "esp.train", "c8c6b342371b9de2f83a93767d352c17"),
+                "dev": META_INFO("esp.testa", BASE_URL + "esp.testa", "de0578160dde26ec68cc580595587dde"),
+                "test": META_INFO("esp.testb", BASE_URL + "esp.testb", "c8d35f340685a2ce6559ee90d78f9e37"),
+            },
+            "pos_tags": [
+                "AO",
+                "AQ",
+                "CC",
+                "CS",
+                "DA",
+                "DE",
+                "DD",
+                "DI",
+                "DN",
+                "DP",
+                "DT",
+                "Faa",
+                "Fat",
+                "Fc",
+                "Fd",
+                "Fe",
+                "Fg",
+                "Fh",
+                "Fia",
+                "Fit",
+                "Fp",
+                "Fpa",
+                "Fpt",
+                "Fs",
+                "Ft",
+                "Fx",
+                "Fz",
+                "I",
+                "NC",
+                "NP",
+                "P0",
+                "PD",
+                "PI",
+                "PN",
+                "PP",
+                "PR",
+                "PT",
+                "PX",
+                "RG",
+                "RN",
+                "SP",
+                "VAI",
+                "VAM",
+                "VAN",
+                "VAP",
+                "VAS",
+                "VMG",
+                "VMI",
+                "VMM",
+                "VMN",
+                "VMP",
+                "VMS",
+                "VSG",
+                "VSI",
+                "VSM",
+                "VSN",
+                "VSP",
+                "VSS",
+                "Y",
+                "Z",
+            ],
+        },
+        "nl": {
+            "splits": {
+                "train": META_INFO("ned.train", BASE_URL + "ned.train", "b6189d04eb34597d2a98ca5cec477605"),
+                "dev": META_INFO("ned.testa", BASE_URL + "ned.testa", "626900497823fdbc4f84335518cb85ce"),
+                "test": META_INFO("ned.testb", BASE_URL + "ned.testb", "c37de92da20c68c6418a73dd42e322dc"),
+            },
+            "pos_tags": ["Adj", "Adv", "Art", "Conj", "Int", "Misc", "N", "Num", "Prep", "Pron", "Punc", "V"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, url, data_hash = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(url, default_root, data_hash)
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            tokens = []
+            ner_tags = []
+            pos_tags = []
+            for line in f.readlines():
+                if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                    if tokens:
+                        yield {"tokens": tokens, "ner_tags": ner_tags, "pos_tags": pos_tags}
+                        tokens = []
+                        ner_tags = []
+                        pos_tags = []
+                else:
+                    # conll2002 tokens are space separated
+                    splits = line.split(" ")
+                    tokens.append(splits[0])
+                    pos_tags.append(splits[1])
+                    ner_tags.append(splits[2].rstrip())
+            # last example
+            yield {"tokens": tokens, "ner_tags": ner_tags, "pos_tags": pos_tags}
+
+    def get_labels(self):
+        """
+        Returns labels of ner tags and pos tags.
+        """
+        return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"], self.BUILDER_CONFIGS[
+            self.name
+        ]["pos_tags"]
diff --git a/paddlenlp/datasets/cote.py b/paddlenlp/datasets/cote.py
new file mode 100644
index 0000000000000000000000000000000000000000..18f6e94448b56adb95cc79b00547b6eb8334a4a9
--- /dev/null
+++ b/paddlenlp/datasets/cote.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["Cote"]
+
+
+class Cote(DatasetBuilder):
+    """
+    COTE_DP/COTE-BD/COTE-MFW dataset for Opinion Role Labeling task.
+    More information please refer to https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1.
+
+    """
+
+    BUILDER_CONFIGS = {
+        "dp": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/COTE-DP.zip",
+            "md5": "a73d4170a283a2264a41c3ee9eb4d262",
+            "splits": {
+                "train": [os.path.join("COTE-DP", "train.tsv"), "17d11ca91b7979f2c2023757650096e5"],
+                "test": [os.path.join("COTE-DP", "test.tsv"), "5bb9b9ccaaee6bcc1ac7a6c852b46f66"],
+            },
+            "labels": ["B", "I", "O"],
+        },
+        "bd": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/COTE-BD.zip",
+            "md5": "8d87ff9bb6f5e5d46269d72632a1b01f",
+            "splits": {
+                "train": [os.path.join("COTE-BD", "train.tsv"), "4c08ccbcc373cb3bf05c3429d435f608"],
+                "test": [os.path.join("COTE-BD", "test.tsv"), "aeb5c9af61488dadb12cbcc1d2180667"],
+            },
+            "labels": ["B", "I", "O"],
+        },
+        "mfw": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/COTE-MFW.zip",
+            "md5": "c85326bf2be4424d03373ea70cb32c3f",
+            "splits": {
+                "train": [os.path.join("COTE-MFW", "train.tsv"), "01fc90b9098d35615df6b8d257eb46ca"],
+                "test": [os.path.join("COTE-MFW", "test.tsv"), "c61a475917a461089db141c59c688343"],
+            },
+            "labels": ["B", "I", "O"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, f"COTE-{self.name.upper()}")
+        filename, data_hash = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            url = builder_config["url"]
+            md5 = builder_config["md5"]
+            get_path_from_url(url, DATA_HOME, md5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data"""
+        with open(filename, "r", encoding="utf-8") as f:
+            for idx, line in enumerate(f):
+                if idx == 0:
+                    # ignore first line about title
+                    continue
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    continue
+                if split == "test":
+                    yield {"tokens": list(line_stripped[1])}
+                else:
+                    try:
+                        entity, text = line_stripped[0], line_stripped[1]
+                        start_idx = text.index(entity)
+                    except Exception:
+                        # drop the dirty data
+                        continue
+
+                    labels = ["O"] * len(text)
+                    labels[start_idx] = "B"
+                    for idx in range(start_idx + 1, start_idx + len(entity)):
+                        labels[idx] = "I"
+                    yield {"tokens": list(text), "labels": labels, "entity": entity}
+
+    def get_labels(self):
+        """
+        Return labels of the COTE.
+        """
+        return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/couplet.py b/paddlenlp/datasets/couplet.py
new file mode 100644
index 0000000000000000000000000000000000000000..390584d7eb0dea5e174299ebb9dae8e9256ad1e7
--- /dev/null
+++ b/paddlenlp/datasets/couplet.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["Couplet"]
+
+
+class Couplet(DatasetBuilder):
+    """
+    Couplet dataset. The couplet data is from this github repository:
+    https://github.com/v-zich/couplet-clean-dataset, which filters dirty data
+    from the original repository https://github.com/wb14123/couplet-dataset.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/couplet.tar.gz"
+    META_INFO = collections.namedtuple("META_INFO", ("src_file", "tgt_file", "src_md5", "tgt_md5"))
+    MD5 = "5c0dcde8eec6a517492227041c2e2d54"
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("couplet", "train_src.tsv"),
+            os.path.join("couplet", "train_tgt.tsv"),
+            "ad137385ad5e264ac4a54fe8c95d1583",
+            "daf4dd79dbf26040696eee0d645ef5ad",
+        ),
+        "dev": META_INFO(
+            os.path.join("couplet", "dev_src.tsv"),
+            os.path.join("couplet", "dev_tgt.tsv"),
+            "65bf9e72fa8fdf0482751c1fd6b6833c",
+            "3bc3b300b19d170923edfa8491352951",
+        ),
+        "test": META_INFO(
+            os.path.join("couplet", "test_src.tsv"),
+            os.path.join("couplet", "test_tgt.tsv"),
+            "f0a7366dfa0acac884b9f4901aac2cc1",
+            "56664bff3f2edfd7a751a55a689f90c2",
+        ),
+    }
+    VOCAB_INFO = (os.path.join("couplet", "vocab.txt"), "0bea1445c7c7fb659b856bb07e54a604")
+    UNK_TOKEN = "<unk>"
+    BOS_TOKEN = "<s>"
+    EOS_TOKEN = "</s>"
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        src_filename, tgt_filename, src_data_hash, tgt_data_hash = self.SPLITS[mode]
+        src_fullname = os.path.join(default_root, src_filename)
+        tgt_fullname = os.path.join(default_root, tgt_filename)
+
+        vocab_filename, vocab_hash = self.VOCAB_INFO
+        vocab_fullname = os.path.join(default_root, vocab_filename)
+
+        if (
+            (not os.path.exists(src_fullname) or (src_data_hash and not md5file(src_fullname) == src_data_hash))
+            or (not os.path.exists(tgt_fullname) or (tgt_data_hash and not md5file(tgt_fullname) == tgt_data_hash))
+            or (not os.path.exists(vocab_fullname) or (vocab_hash and not md5file(vocab_fullname) == vocab_hash))
+        ):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return src_fullname, tgt_fullname
+
+    def _read(self, filename, *args):
+        src_filename, tgt_filename = filename
+        with open(src_filename, "r", encoding="utf-8") as src_f:
+            with open(tgt_filename, "r", encoding="utf-8") as tgt_f:
+                for src_line, tgt_line in zip(src_f, tgt_f):
+                    src_line = src_line.strip()
+                    tgt_line = tgt_line.strip()
+                    if not src_line and not tgt_line:
+                        continue
+                    yield {"first": src_line, "second": tgt_line}
+
+    def get_vocab(self):
+        vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[0])
+
+        # Construct vocab_info to match the form of the input of `Vocab.load_vocabulary()` function
+        vocab_info = {
+            "filepath": vocab_fullname,
+            "unk_token": self.UNK_TOKEN,
+            "bos_token": self.BOS_TOKEN,
+            "eos_token": self.EOS_TOKEN,
+        }
+        return vocab_info
diff --git a/paddlenlp/datasets/dataset.py b/paddlenlp/datasets/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7a4aed62b842c30f69d98adf6145611d0f2c34c
--- /dev/null
+++ b/paddlenlp/datasets/dataset.py
@@ -0,0 +1,781 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import atexit
+import inspect
+import os
+import time
+import warnings
+from collections import namedtuple
+from itertools import islice
+
+import datasets
+from multiprocess import Pool, RLock
+
+import paddlenlp
+
+try:
+    import paddle.distributed as dist
+except Exception:
+    warnings.warn("paddle.distributed is not contains in you paddle!")
+
+import importlib
+from functools import partial
+
+from paddle.io import Dataset, IterableDataset
+from paddle.utils.download import _get_unique_endpoints
+
+from paddlenlp.utils.env import DATA_HOME
+
+__all__ = ["MapDataset", "DatasetBuilder", "IterDataset", "load_dataset"]
+
+DATASETS_MODULE_PATH = "paddlenlp.datasets."
+
+# Patch for intranet
+from datasets import load_dataset as origin_load_dataset  # noqa: E402
+
+
+def load_from_ppnlp(path, *args, **kwargs):
+    ppnlp_path = paddlenlp.datasets.__path__[0]
+    new_path = os.path.split(path)[-1]
+    new_path = os.path.join(ppnlp_path, "hf_datasets", new_path + ".py")
+    if os.path.exists(new_path):
+        return origin_load_dataset(new_path, *args, **kwargs)
+    else:
+        return origin_load_dataset(path, *args, **kwargs)
+
+
+datasets.load_dataset = load_from_ppnlp
+
+
+class DatasetTuple:
+    def __init__(self, splits):
+        self.identifier_map, identifiers = self._gen_identifier_map(splits)
+        self.tuple_cls = namedtuple("datasets", identifiers)
+        self.tuple = self.tuple_cls(*[None for _ in splits])
+
+    def __getitem__(self, key):
+        if isinstance(key, (int, slice)):
+            return self.tuple[key]
+        if isinstance(key, str):
+            return getattr(self.tuple, self.identifier_map[key])
+
+    def __setitem__(self, key, value):
+        self.tuple = self.tuple._replace(**{self.identifier_map[key]: value})
+
+    def _gen_identifier_map(self, splits):
+        identifier_map = {}
+        identifiers = []
+        for i in range(len(splits)):
+            identifiers.append("splits_" + str(i))
+            identifier_map[splits[i]] = "splits_" + str(i)
+        return identifier_map, identifiers
+
+    def __len__(self):
+        return len(self.tuple)
+
+
+def import_main_class(module_path):
+    """
+    Import a module at module_path and return its DatasetBuilder class.
+
+    """
+    module_path = DATASETS_MODULE_PATH + module_path
+    module = importlib.import_module(module_path)
+    main_cls_type = DatasetBuilder
+
+    # Find the main class in our imported module
+    module_main_cls = None
+    for name, obj in module.__dict__.items():
+        if isinstance(obj, type) and issubclass(obj, main_cls_type):
+            if name == "DatasetBuilder":
+                continue
+            module_main_cls = obj
+            break
+
+    return module_main_cls
+
+
+def load_from_hf(path, name=None, splits=None, **kwargs):
+    from datasets import DatasetDict, IterableDatasetDict
+    from datasets import load_dataset as load_hf_dataset
+    from datasets.features import ClassLabel
+
+    try:
+        hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
+    except FileNotFoundError:
+        raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
+    else:
+        label_list = []
+        if isinstance(hf_datasets, DatasetDict):
+            datasets = DatasetTuple(list(hf_datasets.keys()))
+            for split, ds in hf_datasets.items():
+                for feature in ds.features.values():
+                    if isinstance(feature, ClassLabel):
+                        label_list = feature.names
+                datasets[split] = MapDataset(ds, label_list=label_list)
+        elif isinstance(hf_datasets, IterableDatasetDict):
+            datasets = DatasetTuple(list(hf_datasets.keys()))
+            for split, ds in hf_datasets.items():
+                datasets[split] = IterDataset(ds)
+        elif isinstance(hf_datasets, list):
+            datasets = DatasetTuple(splits)
+            for i, split in enumerate(splits):
+                for feature in hf_datasets[i].features.values():
+                    if isinstance(feature, ClassLabel):
+                        label_list = feature.names
+                datasets[split] = MapDataset(hf_datasets[i], label_list=label_list)
+        else:
+            for feature in hf_datasets.features.values():
+                if isinstance(feature, ClassLabel):
+                    label_list = feature.names
+            datasets = MapDataset(hf_datasets, label_list=label_list)
+    return datasets
+
+
+def load_dataset(path_or_read_func, name=None, data_files=None, splits=None, lazy=None, **kwargs):
+    """
+    This method will load a dataset, either form PaddleNLP library or from a
+    self-defined data loading script, by calling functions in `DatasetBuilder`.
+
+    For all the names of datasets in PaddleNLP library, see here:  `dataset_list
+    <https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_list.html>`__.
+
+    Either `splits` or `data_files` must be specified.
+
+    Args:
+        path_or_read_func (str|callable): Name of the dataset processing script
+            in PaddleNLP library or a custom data reading function.
+        name (str, optional): Additional name to select a more specific dataset.
+            Defaults to None.
+        data_files (str|list|tuple|dict, optional): Defining the path of dataset
+            files. If None. `splits` must be specified. Defaults to None.
+        splits (str|list|tuple, optional): Which split of the data to load. If None.
+            `data_files` must be specified. Defaults to None.
+        lazy (bool, optional): Weather to return `MapDataset` or an `IterDataset`.
+            True for `IterDataset`. False for `MapDataset`. If None, return the
+            default type of this dataset. Defaults to None.
+        kwargs (dict): Other keyword arguments to be passed to the `DatasetBuilder`.
+
+    Returns:
+        A `MapDataset` or `IterDataset` or a tuple of those.
+
+    For how to use this function, please see `dataset_load
+    <https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html>`__
+    and `dataset_self_defined
+    <https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html>`__
+
+    """
+    if inspect.isfunction(path_or_read_func):
+        assert lazy is not None, "lazy can not be None in custom mode."
+        kwargs["name"] = name
+        kwargs["data_files"] = data_files
+        kwargs["splits"] = splits
+        custom_kwargs = {}
+        for name in inspect.signature(path_or_read_func).parameters.keys():
+            if name in kwargs.keys():
+                custom_kwargs[name] = kwargs[name]
+
+        reader_instance = SimpleBuilder(lazy=lazy, read_func=path_or_read_func)
+        return reader_instance.read(**custom_kwargs)
+    else:
+        try:
+            reader_cls = import_main_class(path_or_read_func)
+        except ModuleNotFoundError:
+            datasets = load_from_hf(
+                path_or_read_func, name=name, splits=splits, data_files=data_files, streaming=lazy, **kwargs
+            )
+        else:
+            reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)
+
+            # Check if selected name and split is valid in this DatasetBuilder
+            if hasattr(reader_instance, "BUILDER_CONFIGS"):
+                if name in reader_cls.BUILDER_CONFIGS.keys():
+                    split_names = reader_cls.BUILDER_CONFIGS[name]["splits"].keys()
+                else:
+                    raise ValueError(
+                        'Invalid name "{}". Should be one of {}.'.format(name, list(reader_cls.BUILDER_CONFIGS.keys()))
+                    )
+            elif hasattr(reader_instance, "SPLITS"):
+                split_names = reader_instance.SPLITS.keys()
+            else:
+                raise AttributeError("Either 'SPLITS' or 'BUILDER_CONFIGS' must be implemented for DatasetBuilder.")
+
+            selected_splits = []
+            if isinstance(splits, list) or isinstance(splits, tuple):
+                selected_splits.extend(splits)
+            else:
+                selected_splits += [splits]
+
+            for split_name in selected_splits:
+                if split_name not in split_names and split_name is not None:
+                    raise ValueError('Invalid split "{}". Should be one of {}.'.format(split_name, list(split_names)))
+
+            datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
+        return datasets
+
+
+class MapDataset(Dataset):
+    """
+    Wraps a map-style dataset-like object as an instance of `MapDataset`, and equips it
+    with `map` and other utility methods. All non-magic methods of the raw object
+    are also accessible.
+
+    Args:
+        data (list|Dataset): An object with `__getitem__` and `__len__` methods. It could
+            be a list or a subclass of `paddle.io.Dataset`.
+        kwargs (dict, optional): Other information to be passed to the dataset.
+
+    For examples of this class, please see `dataset_self_defined
+    <https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html>`__.
+
+    """
+
+    def __init__(self, data, **kwargs):
+        self.data = data
+        self._transform_pipline = []
+        self.new_data = self.data
+        self.info = kwargs
+        self.label_list = self.info.pop("label_list", None)
+        self.vocab_info = self.info.pop("vocab_info", None)
+
+    def _transform(self, data):
+        for fn in self._transform_pipline:
+            data = fn(data)
+        return data
+
+    def __getitem__(self, idx):
+        """
+        Basic function of `MapDataset` to get sample from dataset with a given
+        index.
+        """
+        return self._transform(self.new_data[idx]) if self._transform_pipline else self.new_data[idx]
+
+    def __len__(self):
+        """
+        Returns the number of samples in dataset.
+        """
+        return len(self.new_data)
+
+    def filter(self, fn, num_workers=0):
+        """
+        Filters samples by the filter function and uses the filtered data to
+        update this dataset.
+
+        Args:
+            fn (callable): A filter function that takes a sample as input and
+                returns a boolean. Samples that return False would be discarded.
+            num_workers(int, optional): Number of processes for multiprocessing. If
+                set to 0, it doesn't use multiprocessing. Defaults to `0`.
+        """
+        assert num_workers >= 0, "num_workers should be a non-negative value"
+        if num_workers > 1:
+            shards = [
+                self._shard(num_shards=num_workers, index=index, contiguous=True) for index in range(num_workers)
+            ]
+            kwds_per_shard = [dict(self=shards[rank], fn=fn) for rank in range(num_workers)]
+            pool = Pool(num_workers, initargs=(RLock(),))
+
+            results = [pool.apply_async(self.__class__._filter, kwds=kwds) for kwds in kwds_per_shard]
+            transformed_shards = [r.get() for r in results]
+
+            pool.close()
+            pool.join()
+            self.new_data = []
+            for i in range(num_workers):
+                self.new_data += transformed_shards[i].new_data
+            return self
+        else:
+            return self._filter(fn)
+
+    def _filter(self, fn):
+        self.new_data = [self.new_data[idx] for idx in range(len(self.new_data)) if fn(self.new_data[idx])]
+        return self
+
+    def shard(self, num_shards=None, index=None, contiguous=False):
+        self.new_data = self._shard(num_shards=num_shards, index=index, contiguous=contiguous).data
+        return self
+
+    def _shard(self, num_shards=None, index=None, contiguous=False):
+        """
+        Split the dataset into `num_shards` pieces. Note that the size of each
+        shard might be different because the original dataset may not be evenly
+        divisible.
+
+        Args:
+            num_shards (int, optional): An integer representing the number of
+                data shards. If None, `num_shards` would be number of trainers.
+                Defaults to `None`.
+            index (int, optional): An integer representing the index of the
+                current shard. If None, `index` would be the current trainer rank
+                id. Defaults to `None`.
+            contiguous: (bool, optional): If true, contiguous chunks of data
+                will be select for sharding. And total number of examples will
+                be the same. Otherwise each shard will contain all examples of
+                dataset whose index mod `num_shards` = `index`. Defaults to `False`.
+        """
+        if num_shards is None:
+            num_shards = dist.get_world_size()
+        if index is None:
+            index = dist.get_rank()
+
+        if contiguous:
+            div = len(self) // num_shards
+            mod = len(self) % num_shards
+            start = div * index + min(index, mod)
+            end = start + div + (1 if index < mod else 0)
+            new_data = [self.new_data[idx] for idx in range(start, end)]
+        else:
+            new_data = [self.new_data[idx] for idx in range(len(self.new_data)) if idx % num_shards == index]
+
+        return MapDataset(new_data)
+
+    def map(self, fn, lazy=True, batched=False, num_workers=0):
+        """
+        Performs specific function on the dataset to transform and update every sample.
+
+        Args:
+            fn (callable): Transformations to be performed. It receives single
+                sample as argument if batched is False. Else it receives all examples.
+            lazy (bool, optional): If True, transformations would be delayed and
+                performed on demand. Otherwise, transforms all samples at once. Note that
+                if `fn` is stochastic, `lazy` should be True or you will get the same
+                result on all epochs. Defaults to False.
+            batched(bool, optional): If True, transformations would take all examples as
+                input and return a collection of transformed examples. Note that if set
+                True, `lazy` option would be ignored. Defaults to False.
+            num_workers(int, optional): Number of processes for multiprocessing. If
+                set to 0, it doesn't use multiprocessing. Note that if set to positive
+                value, `lazy` option would be ignored. Defaults to 0.
+        """
+
+        assert num_workers >= 0, "num_workers should be a non-negative value"
+        if num_workers > 1:
+            shards = [
+                self._shard(num_shards=num_workers, index=index, contiguous=True) for index in range(num_workers)
+            ]
+            kwds_per_shard = [
+                dict(self=shards[rank], fn=fn, lazy=False, batched=batched) for rank in range(num_workers)
+            ]
+            pool = Pool(num_workers, initargs=(RLock(),))
+            results = [pool.apply_async(self.__class__._map, kwds=kwds) for kwds in kwds_per_shard]
+            transformed_shards = [r.get() for r in results]
+            pool.close()
+            pool.join()
+            self.new_data = []
+            for i in range(num_workers):
+                self.new_data += transformed_shards[i].new_data
+            return self
+        else:
+            return self._map(fn, lazy=lazy, batched=batched)
+
+    def _map(self, fn, lazy=True, batched=False):
+        if batched:
+            self.new_data = fn(self.new_data)
+        elif lazy:
+            self._transform_pipline.append(fn)
+        else:
+            self.new_data = [fn(self.new_data[idx]) for idx in range(len(self.new_data))]
+        return self
+
+
+class IterDataset(IterableDataset):
+    """
+    Wraps a dataset-like object as an instance of `IterDataset`, and equips it with
+    `map` and other utility methods. All non-magic methods of the raw object
+    also accessible.
+
+    Args:
+        data (Iterable): An object with `__iter__` function. It can be a Iterable or a
+            subclass of `paddle.io.IterableDataset`.
+        kwargs (dict, optional): Other information to be passed to the dataset.
+
+    For examples of this class, please see `dataset_self_defined
+    <https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html>`__.
+    """
+
+    def __init__(self, data, **kwargs):
+        self.data = data
+        self._transform_pipline = []
+        self._filter_pipline = []
+
+        self.label_list = kwargs.pop("label_list", None)
+        self.vocab_info = kwargs.pop("vocab_info", None)
+
+    def _transform(self, data):
+        for fn in self._transform_pipline:
+            data = fn(data)
+        return data
+
+    def _shard_filter(self, num_samples):
+        return True
+
+    def _filter(self, data):
+        for fn in self._filter_pipline:
+            if not fn(data):
+                return False
+        return True
+
+    def __iter__(self):
+        """
+        yields sample sequentially.
+        """
+        num_samples = 0
+        if inspect.isfunction(self.data):
+            for example in self.data():
+                if (not self._filter_pipline or self._filter(self._filter_pipline)) and self._shard_filter(
+                    num_samples=num_samples
+                ):
+                    yield self._transform(example) if self._transform_pipline else example
+                num_samples += 1
+        else:
+            if inspect.isgenerator(self.data):
+                warnings.warn("Reciving generator as data source, data can only be iterated once")
+            for example in self.data:
+                if (not self._filter_pipline or self._filter(self._filter_pipline)) and self._shard_filter(
+                    num_samples=num_samples
+                ):
+                    yield self._transform(example) if self._transform_pipline else example
+                num_samples += 1
+
+    def skip(self, n):
+        if inspect.isfunction(self.data):
+            raise NotImplementedError("Function-based IterDataset does not support `.skip()`")
+        self.data = islice(self.data, n, None)
+        return self
+
+    def filter(self, fn):
+        """
+        Filters samples by the filter function and uses the filtered data to
+        update this dataset.
+
+        Args:
+            fn (callable): A filter function that takes a sample as input and
+                returns a boolean. Samples that return False are discarded.
+        """
+
+        self._filter_pipline.append(fn)
+
+        return self
+
+    def shard(self, num_shards=None, index=None):
+        """
+        Split the dataset into `num_shards` pieces.
+
+        Args:
+            num_shards (int, optional): An integer representing the number of
+                data shards. If None, `num_shards` would be number of trainers.
+                Defaults to None.
+            index (int, optional): An integer representing the index of the
+                current shard. If None, `index` would be the current trainer rank
+                id. Defaults to None.
+        """
+        if num_shards is None:
+            num_shards = dist.get_world_size()
+        if index is None:
+            index = dist.get_rank()
+
+        def sharder(num_shards, index, num_samples):
+            if num_samples % num_shards == index:
+                return True
+            else:
+                return False
+
+        fn = partial(sharder, num_shards=num_shards, index=index)
+        self._shard_filter = fn
+        return self
+
+    def map(self, fn):
+        """
+        Performs specific function on the dataset to transform and update every sample.
+
+        Args:
+            fn (callable): Transformations to be performed. It receives single
+                sample as argument.
+        """
+
+        self._transform_pipline.append(fn)
+
+        return self
+
+
+class DatasetBuilder:
+    """
+    A base class for all DatasetBuilder. It provides a `read()` function to turn
+    a data file into a MapDataset or IterDataset.
+
+    `_get_data()` function and `_read()` function should be implemented to download
+    data file and read data file into a `Iterable` of the examples.
+
+    For how to define a custom `DatasetBuilder`, please see `contribute_dataset
+    <https://paddlenlp.readthedocs.io/zh/latest/community/contribute_dataset.html>`__.
+    """
+
+    lazy = False
+
+    def __init__(self, lazy=None, name=None, **config):
+        if lazy is not None:
+            self.lazy = lazy
+        self.name = name
+        self.config = config
+
+    def read_datasets(self, splits=None, data_files=None):
+        def remove_if_exit(filepath):
+            if isinstance(filepath, (list, tuple)):
+                for file in filepath:
+                    try:
+                        os.remove(file)
+                    except OSError:
+                        pass
+            else:
+                try:
+                    os.remove(filepath)
+                except OSError:
+                    pass
+
+        if data_files is None:
+            if splits is None:
+                splits = (
+                    list(self.BUILDER_CONFIGS[self.name]["splits"].keys())
+                    if hasattr(self, "BUILDER_CONFIGS")
+                    else list(self.SPLITS.keys())
+                )
+
+            assert (
+                isinstance(splits, str)
+                or (isinstance(splits, list) and isinstance(splits[0], str))
+                or (isinstance(splits, tuple) and isinstance(splits[0], str))
+            ), "`splits` should be a string or list of string or a tuple of string."
+
+            if isinstance(splits, str):
+                splits = [splits]
+            datasets = DatasetTuple(splits)
+            parallel_env = dist.ParallelEnv()
+            unique_endpoints = _get_unique_endpoints(parallel_env.trainer_endpoints[:])
+            # move register hook to first and register togather
+            lock_files = []
+            for split in splits:
+                lock_file = os.path.join(DATA_HOME, self.__class__.__name__)
+                if self.name is not None:
+                    lock_file = lock_file + "." + self.name
+                lock_file += "." + split + ".done" + "." + str(os.getppid())
+                lock_files.append(lock_file)
+            # Must register to all procs to make the lock file can be removed
+            # when any proc breaks. Otherwise, the single registered proc may
+            # not receive proper singal send by the parent proc to exit.
+            atexit.register(lambda: remove_if_exit(lock_files))
+            for split in splits:
+                filename = self._get_data(split)
+                lock_file = os.path.join(DATA_HOME, self.__class__.__name__)
+                if self.name is not None:
+                    lock_file = lock_file + "." + self.name
+                lock_file += "." + split + ".done" + "." + str(os.getppid())
+                # `lock_file` indicates the finished status of`_get_data`.
+                # `_get_data` only works in the `unique_endpoints` specified
+                # proc since `get_path_from_url` only work for it. The other
+                # procs wait `_get_data` to be finished.
+                if parallel_env.current_endpoint in unique_endpoints:
+                    f = open(lock_file, "w")
+                    f.close()
+                else:
+                    while not os.path.exists(lock_file):
+                        time.sleep(1)
+                datasets[split] = self.read(filename=filename, split=split)
+        else:
+            assert (
+                isinstance(data_files, str) or isinstance(data_files, tuple) or isinstance(data_files, list)
+            ), "`data_files` should be a string or tuple or list of strings."
+            if isinstance(data_files, str):
+                data_files = [data_files]
+            default_split = "train"
+            if splits:
+                if isinstance(splits, str):
+                    splits = [splits]
+                datasets = DatasetTuple(splits)
+                assert len(splits) == len(
+                    data_files
+                ), "Number of `splits` and number of `data_files` should be the same if you want to specify the split of loacl data file."
+                for i in range(len(data_files)):
+                    datasets[splits[i]] = self.read(filename=data_files[i], split=splits[i])
+            else:
+                datasets = DatasetTuple(["split" + str(i) for i in range(len(data_files))])
+                for i in range(len(data_files)):
+                    datasets["split" + str(i)] = self.read(filename=data_files[i], split=default_split)
+
+        return datasets if len(datasets) > 1 else datasets[0]
+
+    def read(self, filename, split="train"):
+        """
+        Returns a dataset containing all the examples that can be read from the file path.
+
+        If `self.lazy` is False, this eagerly reads all instances from `self._read()`
+        and returns a `MapDataset`.
+
+        If `self.lazy` is True, this returns an `IterDataset`, which internally
+        relies on the generator created from `self._read()` to lazily produce examples.
+        In this case your implementation of `_read()` must also be lazy
+        (that is, not load all examples into memory at once).
+
+        Args:
+            filename (str): Path of data file to read, usually provided by `_get_data`
+                function.
+            split (str, optional): The split name of selected dataset. This only makes
+                a different when data files of different splits have different structures.
+
+        Returns:
+            A `MapDataset|IterDataset`.
+        """
+
+        label_list = self.get_labels()
+        vocab_info = self.get_vocab()
+
+        def _create_dict(labels):
+            # For multiple labels in the form of list.
+            if isinstance(labels[0], list) or isinstance(labels[0], tuple):
+                label_dict = []
+                for sub_labels in labels:
+                    sub_dict = {}
+                    for i, label in enumerate(sub_labels):
+                        sub_dict[label] = i
+                    label_dict.append(sub_dict)
+            else:
+                label_dict = {}
+                for i, label in enumerate(labels):
+                    label_dict[label] = i
+            return label_dict
+
+        def _convert_label_to_id(labels, label_dict):
+            if isinstance(labels, list) or isinstance(labels, tuple):
+                for label_idx in range(len(labels)):
+                    labels[label_idx] = label_dict[labels[label_idx]]
+            else:
+                labels = label_dict[labels]
+            return labels
+
+        if self.lazy:
+
+            def generate_examples():
+                generator = (
+                    self._read(filename, split) if self._read.__code__.co_argcount > 2 else self._read(filename)
+                )
+                for example in generator:
+                    # We need to check if the example contains label column and confirm its name.
+                    # For now we only allow `label` or `labels` to be the name of label column.
+                    if "labels" in example.keys():
+                        label_col = "labels"
+                    elif "label" in example.keys():
+                        label_col = "label"
+                    else:
+                        label_col = None
+
+                    # Convert class label to label ids.
+                    if label_list is not None and example.get(label_col, None):
+                        label_dict = _create_dict(label_list)
+                        # For multiple labels in the form of list.
+                        if isinstance(label_dict, list):
+                            for idx, sub_dict in enumerate(label_dict):
+                                example[label_col][idx] = _convert_label_to_id(example[label_col][idx], sub_dict)
+                        else:
+                            example[label_col] = _convert_label_to_id(example[label_col], label_dict)
+
+                        yield example
+                    else:
+                        yield example
+
+            return IterDataset(generate_examples(), label_list=label_list, vocab_info=vocab_info)
+        else:
+            examples = self._read(filename, split) if self._read.__code__.co_argcount > 2 else self._read(filename)
+
+            # Then some validation.
+            if not isinstance(examples, list):
+                examples = list(examples)
+
+            if not examples:
+                raise ValueError(
+                    "No instances were read from the given filepath {}. " "Is the path correct?".format(filename)
+                )
+
+            # We need to check if the example contains label column and confirm its name.
+            # For now we only allow `label` or `labels` to be the name of label column.
+            if "labels" in examples[0].keys():
+                label_col = "labels"
+            elif "label" in examples[0].keys():
+                label_col = "label"
+            else:
+                label_col = None
+
+            # Convert class label to label ids.
+            if label_list is not None and examples[0].get(label_col, None):
+                label_dict = _create_dict(label_list)
+                for idx in range(len(examples)):
+                    # For multiple labels in the form of list.
+                    if isinstance(label_dict, list):
+                        for i, sub_dict in enumerate(label_dict):
+                            examples[idx][label_col][i] = _convert_label_to_id(examples[idx][label_col][i], sub_dict)
+                    else:
+                        examples[idx][label_col] = _convert_label_to_id(examples[idx][label_col], label_dict)
+
+            return MapDataset(examples, label_list=label_list, vocab_info=vocab_info)
+
+    def _read(self, filename: str, *args):
+        """
+        Reads examples from the given file_path and returns them as an
+        `Iterable` (which could be a list or a generator).
+
+        This method must be implemented in self-defined `DatasetBuilder`.
+        """
+        raise NotImplementedError
+
+    def _get_data(self, mode: str):
+        """
+        Downloads examples from the given URL and customized split
+        informations and returns a filepath.
+
+        This method must be implemented in self-defined `DatasetBuilder`.
+        """
+        raise NotImplementedError
+
+    def get_labels(self):
+        """
+        Returns list of class labels of the dataset if specified.
+        """
+        return None
+
+    def get_vocab(self):
+        """
+        Returns vocab file path of the dataset if specified.
+        """
+        return None
+
+
+class SimpleBuilder(DatasetBuilder):
+    def __init__(self, lazy, read_func):
+        self._read = read_func
+        self.lazy = lazy
+
+    def read(self, **kwargs):
+        if self.lazy:
+
+            def generate_examples():
+                generator = self._read(**kwargs)
+                for example in generator:
+                    yield example
+
+            return IterDataset(generate_examples)
+        else:
+            examples = self._read(**kwargs)
+            if hasattr(examples, "__len__") and hasattr(examples, "__getitem__"):
+                return MapDataset(examples)
+            else:
+                return MapDataset(list(examples))
diff --git a/paddlenlp/datasets/drcd.py b/paddlenlp/datasets/drcd.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c785078fb62a48746d8c5e2ca03fef97adb2bbb
--- /dev/null
+++ b/paddlenlp/datasets/drcd.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DRCD"]
+
+
+class DRCD(DatasetBuilder):
+    """
+    Delta Reading Comprehension Dataset is an open domain traditional Chinese
+    machine reading comprehension (MRC) dataset. The dataset contains 10,014
+    paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated
+    by annotators.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("DRCD_training.json"),
+            "bbeefc8ad7585ea3e4fef8c677e7643e",
+            "https://bj.bcebos.com/paddlenlp/datasets/DRCD/DRCD_training.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("DRCD_dev.json"),
+            "42c2f2bca84fc36cf65a86563b0540e6",
+            "https://bj.bcebos.com/paddlenlp/datasets/DRCD/DRCD_dev.json",
+        ),
+        "test": META_INFO(
+            os.path.join("DRCD_test.json"),
+            "e36a295c1cb8c6b9fb28015907a42d9e",
+            "https://bj.bcebos.com/paddlenlp/datasets/DRCD/DRCD_test.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                    }
diff --git a/paddlenlp/datasets/drcd_cn.py b/paddlenlp/datasets/drcd_cn.py
new file mode 100644
index 0000000000000000000000000000000000000000..799d22333811bd085293ff24a5005d48cdc4ea34
--- /dev/null
+++ b/paddlenlp/datasets/drcd_cn.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DRCD_CN"]
+
+
+class DRCD_CN(DatasetBuilder):
+    """
+    Delta Reading Comprehension Dataset is an open domain traditional Chinese
+    machine reading comprehension (MRC) dataset. The dataset contains 10,014
+    paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated
+    by annotators.
+
+    This dataset translate origin Traditional Chinese to Simplified Chinese.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/drcd_cn.tar.gz"
+    MD5 = "8ceed5076c4f59d7a3666b13851e41fa"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("drcd_cn", "train.json"), "5a51ee5a106e16965c85fce364d316d7"),
+        "dev": META_INFO(os.path.join("drcd_cn", "dev.json"), "f352b17cddeed69877ff94d4321817ce"),
+        "test": META_INFO(os.path.join("drcd_cn", "test.json"), "e674a667033c4e8c9ae6d05d95073d02"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                    }
diff --git a/paddlenlp/datasets/duconv.py b/paddlenlp/datasets/duconv.py
new file mode 100644
index 0000000000000000000000000000000000000000..315bd29eaecb19ef9cede4480fe4c2ff6003339c
--- /dev/null
+++ b/paddlenlp/datasets/duconv.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DuConv"]
+
+
+class DuConv(DatasetBuilder):
+    """
+    Duconv is an dialogue dataset based on knowledge map released by Baidu.
+    Duconv contains two test sets, test_1 and test_2. And the test_1 contains
+    the response of the conversation but test_2 not. More information please
+    refer to `https://arxiv.org/abs/1503.02364`.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/DuConv.tar.gz"
+    MD5 = "ef496871787f66718e567d62bd8f3546"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("DuConv", "train.txt"), "26192809b8740f620b95c9e18c65edf4"),
+        "dev": META_INFO(os.path.join("DuConv", "dev.txt"), "2e5ee6396b0467309cad75d37d6460b1"),
+        "test_1": META_INFO(os.path.join("DuConv", "test_1.txt"), "8ec83a72318d004691962647905cc345"),
+        "test_2": META_INFO(os.path.join("DuConv", "test_2.txt"), "e8d5f04a5d0a03ab110b1605d0a632ad"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as fin:
+            for line in fin:
+                example = json.loads(line.strip())
+                yield example
diff --git a/paddlenlp/datasets/dureader_checklist.py b/paddlenlp/datasets/dureader_checklist.py
new file mode 100644
index 0000000000000000000000000000000000000000..1867ae0ae73be60d1fed96536700a8356ffe07bf
--- /dev/null
+++ b/paddlenlp/datasets/dureader_checklist.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DuReaderChecklist"]
+
+
+class DuReaderChecklist(DatasetBuilder):
+    """
+    A high-quality Chinese machine reading comprehension dataset for real
+    application scenarios. It that focus on challenging the MRC models
+    from multiple aspects, including understanding of vocabulary, phrase,
+    semantic role, reasoning and so on.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("dataset", "train.json"),
+            "28881033c067c690826a841d2d72a18a",
+            "https://bj.bcebos.com/paddlenlp/datasets/lic2021/dureader_checklist.dataset.tar.gz",
+        ),
+        "dev": META_INFO(
+            os.path.join("dataset", "dev.json"),
+            "28881033c067c690826a841d2d72a18a",
+            "https://bj.bcebos.com/paddlenlp/datasets/lic2021/dureader_checklist.dataset.tar.gz",
+        ),
+        "test1": META_INFO(
+            os.path.join("test1", "test1.json"),
+            "d7047ada5fb6734b4e58bfa198d47f6e",
+            "https://bj.bcebos.com/paddlenlp/datasets/lic2021/dureader_checklist.test1.tar.gz",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root, data_hash)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = []
+                    answers = []
+                    is_impossible = False
+                    qa_type = qa.get("type", "")
+
+                    if "is_impossible" in qa.keys():
+                        is_impossible = qa["is_impossible"]
+
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "type": qa_type,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                        "is_impossible": is_impossible,
+                    }
diff --git a/paddlenlp/datasets/dureader_qg.py b/paddlenlp/datasets/dureader_qg.py
new file mode 100644
index 0000000000000000000000000000000000000000..1eb451f822e1cf40726c35f868596cec1b6faad4
--- /dev/null
+++ b/paddlenlp/datasets/dureader_qg.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DuReaderQG"]
+
+
+class DuReaderQG(DatasetBuilder):
+    """
+    This dataset is made form the machine reading comprehension dataset
+    (i.e. DuReader robust) for question generation task.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("train.json"),
+            "a6d96bda4662e657ce644ed0e178fe70",
+            "https://bj.bcebos.com/paddlenlp/datasets/DuReaderQG/train.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("dev.json"),
+            "a6bd22b0da0ed8e20784398f507d4acc",
+            "https://bj.bcebos.com/paddlenlp/datasets/DuReaderQG/dev.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+
+                json_data = json.loads(line)
+                title = json_data.get("answer", None)
+
+                yield {
+                    "source": json_data["context"],
+                    "target": json_data.get("question", ""),
+                    "title": title,
+                    "id": json_data["id"],
+                }
diff --git a/paddlenlp/datasets/dureader_robust.py b/paddlenlp/datasets/dureader_robust.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca13ef8357490590a7afa979b9dd1d22c337eb54
--- /dev/null
+++ b/paddlenlp/datasets/dureader_robust.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DuReaderRobust"]
+
+
+class DuReaderRobust(DatasetBuilder):
+    """
+    The machine reading comprehension dataset (i.e. DuReader robust) is designed
+    to measure the robustness of a reading comprehension model, including the
+    over-sensitivity, over-stability and generalization ability of the model.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/dureader_robust-data.tar.gz"
+    MD5 = "82f3d191a115ec17808856866787606e"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("dureader_robust-data", "train.json"), "800a3dcb742f9fdf9b11e0a83433d4be"),
+        "dev": META_INFO(os.path.join("dureader_robust-data", "dev.json"), "ae73cec081eaa28a735204c4898a2222"),
+        "test": META_INFO(os.path.join("dureader_robust-data", "test.json"), "e0e8aa5c7b6d11b6fc3935e29fc7746f"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                    }
diff --git a/paddlenlp/datasets/dureader_yesno.py b/paddlenlp/datasets/dureader_yesno.py
new file mode 100644
index 0000000000000000000000000000000000000000..45172f7083a8adfe5a417f077d6f976ee8f4620b
--- /dev/null
+++ b/paddlenlp/datasets/dureader_yesno.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["DuReaderYesNo"]
+
+
+class DuReaderYesNo(DatasetBuilder):
+    """
+    DuReaderYesNo is a dataset with the judgment of opinion polarity as the
+    target task. Polarity of opinion is divided into three categories
+    {Yes, No, Depends}.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/dureader_yesno-data.tar.gz"
+    MD5 = "30c744d65e87fdce00cdc707fd008138"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("dureader_yesno-data", "train.json"), "c469a0ef3f975cfd705e3553ddb27cc1"),
+        "dev": META_INFO(os.path.join("dureader_yesno-data", "dev.json"), "c38544f8b5a7b567492314e3232057b5"),
+        "test": META_INFO(os.path.join("dureader_yesno-data", "test.json"), "1c7a1a3ea5b8992eeaeea017fdc2d55f"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            for entry in f:
+                source = json.loads(entry.strip())
+                yield {
+                    "id": source["id"],
+                    "question": source["question"],
+                    "answer": source["answer"],
+                    "labels": source["yesno_answer"],
+                }
+
+    def get_labels(self):
+
+        return ["Yes", "No", "Depends"]
diff --git a/paddlenlp/datasets/fewclue.py b/paddlenlp/datasets/fewclue.py
new file mode 100644
index 0000000000000000000000000000000000000000..35f40a21582513e3f904e07a05ff31c5adc888e9
--- /dev/null
+++ b/paddlenlp/datasets/fewclue.py
@@ -0,0 +1,336 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class FewCLUE(DatasetBuilder):
+    """
+    FewCLUE: Few-shot learning for Chinese Language Understanding Evaluation
+    From: https://github.com/CLUEbenchmark/FewCLUE
+
+    bustum:
+        XiaoBu Dialogue Short Text Matching
+
+    chid:
+        Chinese IDiom Dataset for Cloze Test
+
+    iflytek:
+        The Microsoft Research Paraphrase Corpus dataset.
+
+    tnews:
+        Toutiao Short Text Classificaiton for News
+
+    eprstmt:
+        E-commerce Product Review Dataset for Sentiment Analysis
+
+    ocnli:
+        Original Chinese Natural Language Inference
+
+    csldcp:
+        The classification data set of Chinese science and Literature Discipline
+
+    cluewsc:
+        WSC Winograd
+    csl:
+        Paper Keyword Recognition
+    """
+
+    BUILDER_CONFIGS = {
+        "bustm": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_bustm.tar.gz",
+            "md5": "206e037a88a57a8ca1ea157fdb756b14",
+            "splits": {
+                "train_0": [os.path.join("fewclue_bustm", "train_0.json"), "7d90d65c5305df064cbe0ea5f55be1eb"],
+                "train_1": [os.path.join("fewclue_bustm", "train_1.json"), "5e2ae6ce0129a39f14676d0b24090927"],
+                "train_2": [os.path.join("fewclue_bustm", "train_2.json"), "8c94f08f6f2cc93eaeb3f0cbc58aee2d"],
+                "train_3": [os.path.join("fewclue_bustm", "train_3.json"), "6bd32b4a15959ca037f7043e06a7663d"],
+                "train_4": [os.path.join("fewclue_bustm", "train_4.json"), "99a92cd924e1e6b4bd7c47d561fcbfee"],
+                "train_few_all": [
+                    os.path.join("fewclue_bustm", "train_few_all.json"),
+                    "7415f826a59eea3e4b319c70f6182f21",
+                ],
+                "dev_0": [os.path.join("fewclue_bustm", "dev_0.json"), "703c85a4595304a707f7b7caa85974f4"],
+                "dev_1": [os.path.join("fewclue_bustm", "dev_1.json"), "b16aa8ef45c51956be768e8e2810db4e"],
+                "dev_2": [os.path.join("fewclue_bustm", "dev_2.json"), "c5483c83c882090314e76bb7dc1e7d5a"],
+                "dev_3": [os.path.join("fewclue_bustm", "dev_3.json"), "bfcfdf318f72ac40095a4b671c8b8ec5"],
+                "dev_4": [os.path.join("fewclue_bustm", "dev_4.json"), "ac061fedac0c360d08090a2e19addcae"],
+                "dev_few_all": [os.path.join("fewclue_bustm", "dev_few_all.json"), "678159abbff4a9704001190541a45000"],
+                "unlabeled": [os.path.join("fewclue_bustm", "unlabeled.json"), "8ebf2b2178ca6e9ad3aab09b86dfaafb"],
+                "test": [os.path.join("fewclue_bustm", "test.json"), "28363457614d6fbfdd0487c3451eb9d1"],
+                "test_public": [os.path.join("fewclue_bustm", "test_public.json"), "b805ad47d511d819bd723b1c63a1a2dc"],
+            },
+            "labels": None,
+        },
+        "chid": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_chid.tar.gz",
+            "md5": "31d209e1bda2703708f2a53da66ca6ef",
+            "splits": {
+                "train_0": [os.path.join("fewclue_chid", "train_0.json"), "9fe1b1e9c2174c34bf2470b2b27e0d12"],
+                "train_1": [os.path.join("fewclue_chid", "train_1.json"), "3a3971f28707250a65a3cbdeb7c40711"],
+                "train_2": [os.path.join("fewclue_chid", "train_2.json"), "ab65bd8ca1ad1a4d464f0fd50adb5e24"],
+                "train_3": [os.path.join("fewclue_chid", "train_3.json"), "5ac78bc3bf2dbfff754a997298abae54"],
+                "train_4": [os.path.join("fewclue_chid", "train_4.json"), "9c3ad59e850bc2133d45d3d57353ba2c"],
+                "train_few_all": [
+                    os.path.join("fewclue_chid", "train_few_all.json"),
+                    "5d14b6e6aa7cbc77f0ea21d9bf36e740",
+                ],
+                "dev_0": [os.path.join("fewclue_chid", "dev_0.json"), "d50b501c0d80da404b09a3899feae907"],
+                "dev_1": [os.path.join("fewclue_chid", "dev_1.json"), "e00c8c98dd9d79f47fd38f012c80c23b"],
+                "dev_2": [os.path.join("fewclue_chid", "dev_2.json"), "283a68c62042f99740fc16d77d9df749"],
+                "dev_3": [os.path.join("fewclue_chid", "dev_3.json"), "09ddb889c668368ee5842ff1f6611817"],
+                "dev_4": [os.path.join("fewclue_chid", "dev_4.json"), "c4162fe8593fd91623c17abc7b0a0532"],
+                "dev_few_all": [os.path.join("fewclue_chid", "dev_few_all.json"), "6e0d456dc6d103f0db677cda3b607e20"],
+                "unlabeled": [os.path.join("fewclue_chid", "unlabeled.json"), "e4772b7600b348e9ff2245cef6a00812"],
+                "test": [os.path.join("fewclue_chid", "test.json"), "bf46b7a643b51f64dd890e3fcae8802a"],
+                "test_public": [os.path.join("fewclue_chid", "test_public.json"), "c8c3765c4319e370f752b601b9f2fb80"],
+            },
+            "labels": None,
+        },
+        "iflytek": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_iflytek.tar.gz",
+            "md5": "6f60fd6e0ab35c934732e41b7b7489b7",
+            "splits": {
+                "train_0": [os.path.join("fewclue_iflytek", "train_0.json"), "43e5f8ab327ae5f446fc0cfd97b6341d"],
+                "train_1": [os.path.join("fewclue_iflytek", "train_1.json"), "b3c04b6eec6f82e53f2a913b2487974a"],
+                "train_2": [os.path.join("fewclue_iflytek", "train_2.json"), "a4fdb0055ef1cb5543fef932a88092d0"],
+                "train_3": [os.path.join("fewclue_iflytek", "train_3.json"), "b8626c171555afb8e25d78b32cc2cfb1"],
+                "train_4": [os.path.join("fewclue_iflytek", "train_4.json"), "91dde0c9c939a3bc7768b105427cb3ef"],
+                "train_few_all": [
+                    os.path.join("fewclue_iflytek", "train_few_all.json"),
+                    "db4ceaf7e6682be02f4a9e9138fcda8c",
+                ],
+                "dev_0": [os.path.join("fewclue_iflytek", "dev_0.json"), "0703cb79c0c4fcb120c2cdeea2c56a6c"],
+                "dev_1": [os.path.join("fewclue_iflytek", "dev_1.json"), "a4b975f7ee524e1479d2067118fe15f5"],
+                "dev_2": [os.path.join("fewclue_iflytek", "dev_2.json"), "c0280a2675012bea323a36eb28ba2ecc"],
+                "dev_3": [os.path.join("fewclue_iflytek", "dev_3.json"), "ffdd7073ae25e40a8fa2c95f50f71c1f"],
+                "dev_4": [os.path.join("fewclue_iflytek", "dev_4.json"), "9e9a93fe76653ab7ee587b67061930ac"],
+                "dev_few_all": [
+                    os.path.join("fewclue_iflytek", "dev_few_all.json"),
+                    "86ec5c85c126e8e91efc274e79c39752",
+                ],
+                "unlabeled": [os.path.join("fewclue_iflytek", "unlabeled.json"), "431e0c787373b25f877e2c7b2fc91f91"],
+                "test": [os.path.join("fewclue_iflytek", "test.json"), "ea764519ddb4369767d07664afde3325"],
+                "test_public": [
+                    os.path.join("fewclue_iflytek", "test_public.json"),
+                    "b8ec7c77457baa842666f6e6620ab8fd",
+                ],
+            },
+            "labels": None,
+        },
+        "tnews": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_tnews.tar.gz",
+            "md5": "c1682c753e504fdba28328c0c9298e84",
+            "splits": {
+                "train_0": [os.path.join("fewclue_tnews", "train_0.json"), "e540cbcbf224e9c2e8c1297abab37d1d"],
+                "train_1": [os.path.join("fewclue_tnews", "train_1.json"), "019bb64e35371f6093451a8e7c720d02"],
+                "train_2": [os.path.join("fewclue_tnews", "train_2.json"), "9403d45f1b65fdbea38503e842e0e915"],
+                "train_3": [os.path.join("fewclue_tnews", "train_3.json"), "2f05be9b4f4c3b4fb468864f092005ac"],
+                "train_4": [os.path.join("fewclue_tnews", "train_4.json"), "ced405a502292f84f305214191cbd8d0"],
+                "train_few_all": [
+                    os.path.join("fewclue_tnews", "train_few_all.json"),
+                    "274340c49822c9cf06286bd74744cad4",
+                ],
+                "dev_0": [os.path.join("fewclue_tnews", "dev_0.json"), "ee20628d0d544869f9cc5442658602e4"],
+                "dev_1": [os.path.join("fewclue_tnews", "dev_1.json"), "15bd699553c8742f5d15909bf0aecddb"],
+                "dev_2": [os.path.join("fewclue_tnews", "dev_2.json"), "f8493a1e89d9a1e915700f0a46dda861"],
+                "dev_3": [os.path.join("fewclue_tnews", "dev_3.json"), "8948af6083f5d69ccbd1c6a9f2cc9ea6"],
+                "dev_4": [os.path.join("fewclue_tnews", "dev_4.json"), "508790da261bfd83beffcc64fef3aa66"],
+                "dev_few_all": [os.path.join("fewclue_tnews", "dev_few_all.json"), "9b079af311d8ccfb9938eb3f11b27ea7"],
+                "unlabeled": [os.path.join("fewclue_tnews", "unlabeled.json"), "6ce9e45f56521fd80e32980ef73fa7b7"],
+                "test": [os.path.join("fewclue_tnews", "test.json"), "d21791d746cd0035eaeeef9b3b9f9487"],
+                "test_public": [os.path.join("fewclue_tnews", "test_public.json"), "5539e4a3f0abc2aa4f84da04bf02ca0d"],
+            },
+            "labels": None,
+        },
+        "eprstmt": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_eprstmt.tar.gz",
+            "md5": "016091564b689fd36f52eab5e1e5407c",
+            "splits": {
+                "train_0": [os.path.join("fewclue_eprstmt", "train_0.json"), "d027ef9d3a19b4939c6bab3013397f16"],
+                "train_1": [os.path.join("fewclue_eprstmt", "train_1.json"), "aa70803b42143c648e127f5091c89512"],
+                "train_2": [os.path.join("fewclue_eprstmt", "train_2.json"), "acafc32e7c241300b943fd2557c6aacf"],
+                "train_3": [os.path.join("fewclue_eprstmt", "train_3.json"), "1cabd524e83259037f2192d978a7a32b"],
+                "train_4": [os.path.join("fewclue_eprstmt", "train_4.json"), "8648c607f00da8f2235e744a86f44c8f"],
+                "train_few_all": [
+                    os.path.join("fewclue_eprstmt", "train_few_all.json"),
+                    "72e4f19448bfb3b01229c3cd94d4e3e7",
+                ],
+                "dev_0": [os.path.join("fewclue_eprstmt", "dev_0.json"), "b6aab58bc487ad6174118d8ccf87a9e1"],
+                "dev_1": [os.path.join("fewclue_eprstmt", "dev_1.json"), "41a18a4b4d0c567c6568ff4577dbec0a"],
+                "dev_2": [os.path.join("fewclue_eprstmt", "dev_2.json"), "618590661a58ea660cabff917cc41044"],
+                "dev_3": [os.path.join("fewclue_eprstmt", "dev_3.json"), "18274080ad1d6612582f89065c1f19af"],
+                "dev_4": [os.path.join("fewclue_eprstmt", "dev_4.json"), "d5d8017e3838b6184e648696fe65fbb3"],
+                "dev_few_all": [
+                    os.path.join("fewclue_eprstmt", "dev_few_all.json"),
+                    "9cbda31b17f3adcb32ea89b020209806",
+                ],
+                "unlabeled": [os.path.join("fewclue_eprstmt", "unlabeled.json"), "e8802dad5889d7cc8f085f7d39aeb33b"],
+                "test": [os.path.join("fewclue_eprstmt", "test.json"), "05282edba3283a791167d0ce0343d182"],
+                "test_public": [
+                    os.path.join("fewclue_eprstmt", "test_public.json"),
+                    "704c551bc35d7fb2e4548637b11dabec",
+                ],
+            },
+            "labels": None,
+        },
+        "ocnli": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_ocnli.tar.gz",
+            "md5": "a49a160987d67d26e217b98edeee44a9",
+            "splits": {
+                "train_0": [os.path.join("fewclue_ocnli", "train_0.json"), "45a9a144919efde95aa53dc8b8ba9748"],
+                "train_1": [os.path.join("fewclue_ocnli", "train_1.json"), "a63b358e1b9e3ecf833a174d65713e11"],
+                "train_2": [os.path.join("fewclue_ocnli", "train_2.json"), "7882feb198022fe3cb6338f3652a5216"],
+                "train_3": [os.path.join("fewclue_ocnli", "train_3.json"), "0c6321202ca1fca9843259e6b1e83f5b"],
+                "train_4": [os.path.join("fewclue_ocnli", "train_4.json"), "f0c272e4a846b9f2483d70314a2fdff4"],
+                "train_few_all": [
+                    os.path.join("fewclue_ocnli", "train_few_all.json"),
+                    "f6d9b9198884d3a27249b346933661b6",
+                ],
+                "dev_0": [os.path.join("fewclue_ocnli", "dev_0.json"), "99f4dff1afabe4eb6808cc3e5bc5f422"],
+                "dev_1": [os.path.join("fewclue_ocnli", "dev_1.json"), "4f3b1d87ebf082ef71d29e76d9aaf909"],
+                "dev_2": [os.path.join("fewclue_ocnli", "dev_2.json"), "4c3c103f663a84f5c4fc04ee6aef98fb"],
+                "dev_3": [os.path.join("fewclue_ocnli", "dev_3.json"), "73687b04ae00f8750981ed3f86ef0baa"],
+                "dev_4": [os.path.join("fewclue_ocnli", "dev_4.json"), "b029f7b3f6d4681f4416fa2bc146e227"],
+                "dev_few_all": [os.path.join("fewclue_ocnli", "dev_few_all.json"), "f0235528abf52543c0fdec7f27dd70ae"],
+                "unlabeled": [os.path.join("fewclue_ocnli", "unlabeled.json"), "3db8319afb94780d04bfc7dff57efe81"],
+                "test": [os.path.join("fewclue_ocnli", "test.json"), "a82e69d8372ef99537c64aacba10dd4b"],
+                "test_public": [os.path.join("fewclue_ocnli", "test_public.json"), "ce8229a27a6948a63a3492d6acd6ee1f"],
+            },
+            "labels": None,
+        },
+        "csldcp": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_csldcp.tar.gz",
+            "md5": "5ce33afe9b4b8104e028e04a97e70d5c",
+            "splits": {
+                "train_0": [os.path.join("fewclue_csldcp", "train_0.json"), "ca5fc102bcbd5820743ef08ef415acfb"],
+                "train_1": [os.path.join("fewclue_csldcp", "train_1.json"), "ddfeab5c1c0b7051f3d8863b5145c0b6"],
+                "train_2": [os.path.join("fewclue_csldcp", "train_2.json"), "67fefbbabb063247108623ed9cb8bb90"],
+                "train_3": [os.path.join("fewclue_csldcp", "train_3.json"), "eebc7bc760422dd8ff8eefd5de39995b"],
+                "train_4": [os.path.join("fewclue_csldcp", "train_4.json"), "82ad233a803fd0e6ec4d9245299c3389"],
+                "train_few_all": [
+                    os.path.join("fewclue_csldcp", "train_few_all.json"),
+                    "3576c8413a9c77e20360296996f1217c",
+                ],
+                "dev_0": [os.path.join("fewclue_csldcp", "dev_0.json"), "24e6b62a23dda83ab2aa4d63b64d9306"],
+                "dev_1": [os.path.join("fewclue_csldcp", "dev_1.json"), "73f4439696f1c447c04ad2ea873fb603"],
+                "dev_2": [os.path.join("fewclue_csldcp", "dev_2.json"), "7f12d47d173c4beb77c4995a1409ad61"],
+                "dev_3": [os.path.join("fewclue_csldcp", "dev_3.json"), "35936d8347dd3d727050004cb871e686"],
+                "dev_4": [os.path.join("fewclue_csldcp", "dev_4.json"), "2fe45b969c8c33298c53c7415be9fc40"],
+                "dev_few_all": [
+                    os.path.join("fewclue_csldcp", "dev_few_all.json"),
+                    "17078e738790997cf0fe50ebe0568b8e",
+                ],
+                "unlabeled": [os.path.join("fewclue_csldcp", "unlabeled.json"), "e8802dad5889d7cc8f085f7d39aeb33b"],
+                "test": [os.path.join("fewclue_csldcp", "test.json"), "8e4c1680a30da48979f684edd4d175f2"],
+                "test_public": [
+                    os.path.join("fewclue_csldcp", "test_public.json"),
+                    "695058c4e6dc5e823be772963974c965",
+                ],
+            },
+            "labels": None,
+        },
+        "cluewsc": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_cluewsc.tar.gz",
+            "md5": "328e60d2ac14aaa6ecf255a9546e538d",
+            "splits": {
+                "train_0": [os.path.join("fewclue_cluewsc", "train_0.json"), "623085e169c6515a05cae6b52f2c5a2c"],
+                "train_1": [os.path.join("fewclue_cluewsc", "train_1.json"), "b30acf58e613ee21cd2d6fb4833e2763"],
+                "train_2": [os.path.join("fewclue_cluewsc", "train_2.json"), "ef0840acb8b61d22f1da7e94ecd7a309"],
+                "train_3": [os.path.join("fewclue_cluewsc", "train_3.json"), "7e6e15afab20ae488256278fa84468b5"],
+                "train_4": [os.path.join("fewclue_cluewsc", "train_4.json"), "5f21307270e83d3ea7d1e833db3dc514"],
+                "train_few_all": [
+                    os.path.join("fewclue_cluewsc", "train_few_all.json"),
+                    "0f875905c77747007e6e722e27e069f9",
+                ],
+                "dev_0": [os.path.join("fewclue_cluewsc", "dev_0.json"), "d52f7e97197af8782319be2946226b0f"],
+                "dev_1": [os.path.join("fewclue_cluewsc", "dev_1.json"), "d602d73dd7cc4f5e421fa0fd1deccc00"],
+                "dev_2": [os.path.join("fewclue_cluewsc", "dev_2.json"), "405bab04b2fdd00f4e23492ae24233ac"],
+                "dev_3": [os.path.join("fewclue_cluewsc", "dev_3.json"), "6896cee55db9539687ac788430319c53"],
+                "dev_4": [os.path.join("fewclue_cluewsc", "dev_4.json"), "a171a69d92408ce19449ddc4d629534e"],
+                "dev_few_all": [
+                    os.path.join("fewclue_cluewsc", "dev_few_all.json"),
+                    "9d5e5066758ac6ff24534b13dd2ed1ba",
+                ],
+                # Note: FewCLUE cluewsc unlabeled.json() is an empty file.
+                # https://github.com/CLUEbenchmark/FewCLUE/blob/main/datasets/cluewsc/unlabeled.json
+                # 'unlabeled': [
+                #    os.path.join('fewclue_cluewsc', 'unlabeled.json'),
+                #    'd41d8cd98f00b204e9800998ecf8427e'
+                # ],
+                "test": [os.path.join("fewclue_cluewsc", "test.json"), "0e9e8ffd8ee90ddf1f58d6dc2e02de7b"],
+                "test_public": [
+                    os.path.join("fewclue_cluewsc", "test_public.json"),
+                    "027bc101f000b632ef45ed6d86907527",
+                ],
+            },
+            "labels": None,
+        },
+        "csl": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/FewCLUE/fewclue_csl.tar.gz",
+            "md5": "434f3bad2958bba763506e9af8bf0419",
+            "splits": {
+                "train_0": [os.path.join("fewclue_csl", "train_0.json"), "d93bf9fcef2d5839819a7c1a695d38cb"],
+                "train_1": [os.path.join("fewclue_csl", "train_1.json"), "c5d1ce67e0c9081a160e0a0c790bc6af"],
+                "train_2": [os.path.join("fewclue_csl", "train_2.json"), "9fe5568b97e990e68770f00ce1ecd9bf"],
+                "train_3": [os.path.join("fewclue_csl", "train_3.json"), "e45acff15ae461bdf4001dd9f87ac413"],
+                "train_4": [os.path.join("fewclue_csl", "train_4.json"), "db87a6229793584e3ae1cbdb173de9db"],
+                "train_few_all": [
+                    os.path.join("fewclue_csl", "train_few_all.json"),
+                    "4b8882f1cfbdb0556b990b378ae7671e",
+                ],
+                "dev_0": [os.path.join("fewclue_csl", "dev_0.json"), "5ef6c4cce5cd8b313bd21dd2232bbdf2"],
+                "dev_1": [os.path.join("fewclue_csl", "dev_1.json"), "cbc3dbc4ed06bfe8bc9a9c25fdf98693"],
+                "dev_2": [os.path.join("fewclue_csl", "dev_2.json"), "581f5db0e79beb0e8a5f43db52fc1ff3"],
+                "dev_3": [os.path.join("fewclue_csl", "dev_3.json"), "cd3d99f6edf1ae20624b0b7aea1eeeba"],
+                "dev_4": [os.path.join("fewclue_csl", "dev_4.json"), "765bd899bad409812e6090330fc1be13"],
+                "dev_few_all": [os.path.join("fewclue_csl", "dev_few_all.json"), "2d4c44445a25bb61a48261cabea97e51"],
+                "unlabeled": [os.path.join("fewclue_csl", "unlabeled.json"), "2582af170971ab780d5650e75842e40c"],
+                "test": [os.path.join("fewclue_csl", "test.json"), "d34119b97113000988f1e03f92eb2dfe"],
+                "test_public": [os.path.join("fewclue_csl", "test_public.json"), "45a97013acfe94c887cf85e6ff540456"],
+            },
+            "labels": None,
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+
+        filename, data_hash = builder_config["splits"][mode]
+
+        fullname = os.path.join(default_root, filename)
+
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+        return fullname
+
+    def _read(self, filename, split):
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                yield json.loads(line.rstrip())
+
+    def get_labels(self):
+        """
+        Return labels of the FewCLUE task.
+        """
+        return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/glue.py b/paddlenlp/datasets/glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ec8b0b8ab32a36854544099c660b7d1c078a315
--- /dev/null
+++ b/paddlenlp/datasets/glue.py
@@ -0,0 +1,288 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class Glue(DatasetBuilder):
+    """
+    The General Language Understanding Evaluation (GLUE) benchmark is a collection
+    of resources for training, evaluating, and analyzing natural language
+    understanding systems.
+    From https://gluebenchmark.com/tasks
+
+    CoLA:
+        The Corpus of Linguistic Acceptability (Warstadt et al., 2018) consists of
+        English acceptability judgments drawn from books and journal articles on
+        linguistic theory.
+        Each example is a sequence of words annotated with whether it is a
+        grammatical English sentence.
+
+    SST2:
+        The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences
+        from movie reviews and human annotations of their sentiment.
+
+    MRPC:
+        The Microsoft Research Paraphrase Corpus dataset.
+
+    STSB:
+        The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a
+        collection of sentence pairs drawn from news headlines, video and image
+        captions, and natural language inference data. Each pair is human-annotated
+        with a similarity score from 1 to 5.
+
+    QQP:
+        The Quora Question Pairs dataset is a collection of question pairs from the
+        community question-answering website Quora.
+
+    MNLI:
+        The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018)
+        is a crowdsourced collection of sentence pairs with textual entailment
+        annotations.
+
+    QNLI:
+        The Question-answering NLI dataset converted from Stanford Question
+        Answering Dataset (Rajpurkar et al. 2016).
+
+    RTE:
+        The Recognizing Textual Entailment (RTE) datasets come from a series of
+        annual textual entailment challenges (RTE1, RTE2, RTE3, and RTE5).
+
+    WNLI:
+        The Winograd NLI dataset converted from the dataset in Winograd Schema
+        Challenge (Levesque et al., 2011).
+    """
+
+    BUILDER_CONFIGS = {
+        "cola": {
+            "url": "https://bj.bcebos.com/dataset/glue/CoLA.zip",
+            "md5": "b178a7c2f397b0433c39c7caf50a3543",
+            "splits": {
+                "train": [os.path.join("CoLA", "train.tsv"), "c79d4693b8681800338aa044bf9e797b", (3, 1), 0],
+                "dev": [os.path.join("CoLA", "dev.tsv"), "c5475ccefc9e7ca0917294b8bbda783c", (3, 1), 0],
+                "test": [os.path.join("CoLA", "test.tsv"), "d8721b7dedda0dcca73cebb2a9f4259f", (1,), 1],
+            },
+            "labels": ["0", "1"],
+        },
+        "sst-2": {
+            "url": "https://bj.bcebos.com/dataset/glue/SST.zip",
+            "md5": "9f81648d4199384278b86e315dac217c",
+            "splits": {
+                "train": [os.path.join("SST-2", "train.tsv"), "da409a0a939379ed32a470bc0f7fe99a", (0, 1), 1],
+                "dev": [os.path.join("SST-2", "dev.tsv"), "268856b487b2a31a28c0a93daaff7288", (0, 1), 1],
+                "test": [os.path.join("SST-2", "test.tsv"), "3230e4efec76488b87877a56ae49675a", (1,), 1],
+            },
+            "labels": ["0", "1"],
+        },
+        "sts-b": {
+            "url": "https://bj.bcebos.com/dataset/glue/STS.zip",
+            "md5": "d573676be38f1a075a5702b90ceab3de",
+            "splits": {
+                "train": [os.path.join("STS-B", "train.tsv"), "4f7a86dde15fe4832c18e5b970998672", (7, 8, 9), 1],
+                "dev": [os.path.join("STS-B", "dev.tsv"), "5f4d6b0d2a5f268b1b56db773ab2f1fe", (7, 8, 9), 1],
+                "test": [os.path.join("STS-B", "test.tsv"), "339b5817e414d19d9bb5f593dd94249c", (7, 8), 1],
+            },
+            "labels": None,
+        },
+        "qqp": {
+            "url": "https://dataset.bj.bcebos.com/glue/QQP.zip",
+            "md5": "884bf26e39c783d757acc510a2a516ef",
+            "splits": {
+                "train": [os.path.join("QQP", "train.tsv"), "e003db73d277d38bbd83a2ef15beb442", (3, 4, 5), 1],
+                "dev": [os.path.join("QQP", "dev.tsv"), "cff6a448d1580132367c22fc449ec214", (3, 4, 5), 1],
+                "test": [os.path.join("QQP", "test.tsv"), "73de726db186b1b08f071364b2bb96d0", (1, 2), 1],
+            },
+            "labels": ["0", "1"],
+        },
+        "mnli": {
+            "url": "https://bj.bcebos.com/dataset/glue/MNLI.zip",
+            "md5": "e343b4bdf53f927436d0792203b9b9ff",
+            "splits": {
+                "train": [os.path.join("MNLI", "train.tsv"), "220192295e23b6705f3545168272c740", (8, 9, 11), 1],
+                "dev_matched": [
+                    os.path.join("MNLI", "dev_matched.tsv"),
+                    "c3fa2817007f4cdf1a03663611a8ad23",
+                    (8, 9, 15),
+                    1,
+                ],
+                "dev_mismatched": [
+                    os.path.join("MNLI", "dev_mismatched.tsv"),
+                    "b219e6fe74e4aa779e2f417ffe713053",
+                    (8, 9, 15),
+                    1,
+                ],
+                "test_matched": [
+                    os.path.join("MNLI", "test_matched.tsv"),
+                    "33ea0389aedda8a43dabc9b3579684d9",
+                    (8, 9),
+                    1,
+                ],
+                "test_mismatched": [
+                    os.path.join("MNLI", "test_mismatched.tsv"),
+                    "7d2f60a73d54f30d8a65e474b615aeb6",
+                    (8, 9),
+                    1,
+                ],
+            },
+            "labels": ["contradiction", "entailment", "neutral"],
+        },
+        "qnli": {
+            "url": "https://bj.bcebos.com/dataset/glue/QNLI.zip",
+            "md5": "b4efd6554440de1712e9b54e14760e82",
+            "splits": {
+                "train": [os.path.join("QNLI", "train.tsv"), "5e6063f407b08d1f7c7074d049ace94a", (1, 2, 3), 1],
+                "dev": [os.path.join("QNLI", "dev.tsv"), "1e81e211959605f144ba6c0ad7dc948b", (1, 2, 3), 1],
+                "test": [os.path.join("QNLI", "test.tsv"), "f2a29f83f3fe1a9c049777822b7fa8b0", (1, 2), 1],
+            },
+            "labels": ["entailment", "not_entailment"],
+        },
+        "rte": {
+            "url": "https://bj.bcebos.com/dataset/glue/RTE.zip",
+            "md5": "bef554d0cafd4ab6743488101c638539",
+            "splits": {
+                "train": [os.path.join("RTE", "train.tsv"), "d2844f558d111a16503144bb37a8165f", (1, 2, 3), 1],
+                "dev": [os.path.join("RTE", "dev.tsv"), "973cb4178d4534cf745a01c309d4a66c", (1, 2, 3), 1],
+                "test": [os.path.join("RTE", "test.tsv"), "6041008f3f3e48704f57ce1b88ad2e74", (1, 2), 1],
+            },
+            "labels": ["entailment", "not_entailment"],
+        },
+        "wnli": {
+            "url": "https://bj.bcebos.com/dataset/glue/WNLI.zip",
+            "md5": "a1b4bd2861017d302d29e42139657a42",
+            "splits": {
+                "train": [os.path.join("WNLI", "train.tsv"), "5cdc5a87b7be0c87a6363fa6a5481fc1", (1, 2, 3), 1],
+                "dev": [os.path.join("WNLI", "dev.tsv"), "a79a6dd5d71287bcad6824c892e517ee", (1, 2, 3), 1],
+                "test": [os.path.join("WNLI", "test.tsv"), "a18789ba4f60f6fdc8cb4237e4ba24b5", (1, 2), 1],
+            },
+            "labels": ["0", "1"],
+        },
+        "mrpc": {
+            "url": {
+                "train_data": "https://bj.bcebos.com/dataset/glue/mrpc/msr_paraphrase_train.txt",
+                "dev_id": "https://bj.bcebos.com/dataset/glue/mrpc/dev_ids.tsv",
+                "test_data": "https://bj.bcebos.com/dataset/glue/mrpc/msr_paraphrase_test.txt",
+            },
+            "md5": {
+                "train_data": "793daf7b6224281e75fe61c1f80afe35",
+                "dev_id": "7ab59a1b04bd7cb773f98a0717106c9b",
+                "test_data": "e437fdddb92535b820fe8852e2df8a49",
+            },
+            "splits": {
+                "train": [os.path.join("MRPC", "train.tsv"), "dc2dac669a113866a6480a0b10cd50bf", (3, 4, 0), 1],
+                "dev": [os.path.join("MRPC", "dev.tsv"), "185958e46ba556b38c6a7cc63f3a2135", (3, 4, 0), 1],
+                "test": [os.path.join("MRPC", "test.tsv"), "4825dab4b4832f81455719660b608de5", (3, 4), 1],
+            },
+            "labels": ["0", "1"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        if self.name != "mrpc":
+            default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+            filename, data_hash, _, _ = builder_config["splits"][mode]
+            fullname = os.path.join(default_root, filename)
+            if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+                get_path_from_url(builder_config["url"], default_root, builder_config["md5"])
+
+        else:
+            default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+            filename, data_hash, _, _ = builder_config["splits"][mode]
+            fullname = os.path.join(default_root, filename)
+            if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+                if mode in ("train", "dev"):
+                    dev_id_path = get_path_from_url(
+                        builder_config["url"]["dev_id"],
+                        os.path.join(default_root, "MRPC"),
+                        builder_config["md5"]["dev_id"],
+                    )
+                    train_data_path = get_path_from_url(
+                        builder_config["url"]["train_data"],
+                        os.path.join(default_root, "MRPC"),
+                        builder_config["md5"]["train_data"],
+                    )
+                    # read dev data ids
+                    dev_ids = []
+                    print(dev_id_path)
+                    with open(dev_id_path, encoding="utf-8") as ids_fh:
+                        for row in ids_fh:
+                            dev_ids.append(row.strip().split("\t"))
+
+                    # generate train and dev set
+                    train_path = os.path.join(default_root, "MRPC", "train.tsv")
+                    dev_path = os.path.join(default_root, "MRPC", "dev.tsv")
+                    with open(train_data_path, encoding="utf-8") as data_fh:
+                        with open(train_path, "w", encoding="utf-8") as train_fh:
+                            with open(dev_path, "w", encoding="utf8") as dev_fh:
+                                header = data_fh.readline()
+                                train_fh.write(header)
+                                dev_fh.write(header)
+                                for row in data_fh:
+                                    label, id1, id2, s1, s2 = row.strip().split("\t")
+                                    example = "%s\t%s\t%s\t%s\t%s\n" % (label, id1, id2, s1, s2)
+                                    if [id1, id2] in dev_ids:
+                                        dev_fh.write(example)
+                                    else:
+                                        train_fh.write(example)
+
+                else:
+                    test_data_path = get_path_from_url(
+                        builder_config["url"]["test_data"],
+                        os.path.join(default_root, "MRPC"),
+                        builder_config["md5"]["test_data"],
+                    )
+                    test_path = os.path.join(default_root, "MRPC", "test.tsv")
+                    with open(test_data_path, encoding="utf-8") as data_fh:
+                        with open(test_path, "w", encoding="utf-8") as test_fh:
+                            header = data_fh.readline()
+                            test_fh.write("index\t#1 ID\t#2 ID\t#1 String\t#2 String\n")
+                            for idx, row in enumerate(data_fh):
+                                label, id1, id2, s1, s2 = row.strip().split("\t")
+                                test_fh.write("%d\t%s\t%s\t%s\t%s\n" % (idx, id1, id2, s1, s2))
+
+        return fullname
+
+    def _read(self, filename, split):
+        _, _, field_indices, num_discard_samples = self.BUILDER_CONFIGS[self.name]["splits"][split]
+        with open(filename, "r", encoding="utf-8") as f:
+            for idx, line in enumerate(f):
+                if idx < num_discard_samples:
+                    continue
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    continue
+                example = [line_stripped[indice] for indice in field_indices]
+                if self.name in ["cola", "sst-2"]:
+                    yield {"sentence": example[0]} if "test" in split else {
+                        "sentence": example[0],
+                        "labels": example[-1],
+                    }
+                else:
+                    yield {"sentence1": example[0], "sentence2": example[1]} if "test" in split else {
+                        "sentence1": example[0],
+                        "sentence2": example[1],
+                        "labels": example[-1],
+                    }
+
+    def get_labels(self):
+        """
+        Returns labels of the Glue task.
+        """
+        return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/hf_datasets/__init__.py b/paddlenlp/datasets/hf_datasets/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/datasets/hf_datasets/chnsenticorp.py b/paddlenlp/datasets/hf_datasets/chnsenticorp.py
new file mode 100644
index 0000000000000000000000000000000000000000..519e2f579f8810d1e421f37a77aa2ceff8eff11e
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/chnsenticorp.py
@@ -0,0 +1,120 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""ChnSentiCorp: Chinese Corpus for sentence-level sentiment classification."""
+
+import csv
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@article{tan2008empirical,
+  title={An empirical study of sentiment analysis for chinese documents},
+  author={Tan, Songbo and Zhang, Jin},
+  journal={Expert Systems with applications},
+  volume={34},
+  number={4},
+  pages={2622--2629},
+  year={2008},
+  publisher={Elsevier}
+}
+"""
+
+_DESCRIPTION = """\
+ChnSentiCorp: A classic sentence-level sentiment classification dataset, which includes hotel, laptop and data-related online review data, including positive and negative categories.
+More information refer to https://www.luge.ai/#/luge/dataDetail?id=25.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/ChnSentiCorp.zip"
+
+
+class ChnSentiCorpConfig(datasets.BuilderConfig):
+    """BuilderConfig for ChnSentiCorp."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for ChnSentiCorp.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(ChnSentiCorpConfig, self).__init__(**kwargs)
+
+
+class ChnSentiCorp(datasets.GeneratorBasedBuilder):
+    """ChnSentiCorp: Chinese Corpus for sentence-level sentiment classification."""
+
+    BUILDER_CONFIGS = [
+        ChnSentiCorpConfig(
+            name="chnsenticorp",
+            version=datasets.Version("1.0.0", ""),
+            description="COTE-BD crawled on baidu.",
+        )
+    ]
+
+    def _info(self):
+        features = {"id": datasets.Value("int32"), "text": datasets.Value("string"), "label": datasets.Value("int32")}
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage="https://www.luge.ai/#/luge/dataDetail?id=25",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        downloaded_dir = dl_manager.download_and_extract(_URL)
+        data_dir = os.path.join(downloaded_dir, "ChnSentiCorp")
+
+        train_split = datasets.SplitGenerator(
+            name=datasets.Split.TRAIN, gen_kwargs={"filepath": os.path.join(data_dir, "train.tsv"), "split": "train"}
+        )
+
+        dev_split = datasets.SplitGenerator(
+            name=datasets.Split.VALIDATION, gen_kwargs={"filepath": os.path.join(data_dir, "dev.tsv"), "split": "dev"}
+        )
+
+        test_split = datasets.SplitGenerator(
+            name=datasets.Split.TEST, gen_kwargs={"filepath": os.path.join(data_dir, "test.tsv"), "split": "test"}
+        )
+
+        return [train_split, dev_split, test_split]
+
+    def _generate_examples(self, filepath, split):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+
+        with open(filepath, encoding="utf8") as f:
+            reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+
+            for idx, row in enumerate(reader):
+                example = {}
+                example["id"] = idx
+                example["text"] = row["text_a"]
+
+                if split != "test":
+                    example["label"] = int(row["label"])
+                else:
+                    example["label"] = -1
+
+                # Filter out corrupted rows.
+                for value in example.values():
+                    if value is None:
+                        break
+                else:
+                    yield idx, example
diff --git a/paddlenlp/datasets/hf_datasets/clue.py b/paddlenlp/datasets/hf_datasets/clue.py
new file mode 100644
index 0000000000000000000000000000000000000000..87f60392a967b6d239f1d1f78e82fd7c74e31404
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/clue.py
@@ -0,0 +1,552 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""A Chinese Language Understanding Evaluation Benchmark (CLUE) benchmark."""
+
+import json
+import os
+import re
+import textwrap
+
+import datasets
+
+_CLUE_CITATION = """\
+@misc{xu2020clue,
+    title={CLUE: A Chinese Language Understanding Evaluation Benchmark},
+    author={Liang Xu and Xuanwei Zhang and Lu Li and Hai Hu and Chenjie Cao and Weitang Liu and Junyi Li and Yudong Li and Kai Sun and Yechen Xu and Yiming Cui and Cong Yu and Qianqian Dong and Yin Tian and Dian Yu and Bo Shi and Jun Zeng and Rongzhao Wang and Weijian Xie and Yanting Li and Yina Patterson and Zuoyu Tian and Yiwen Zhang and He Zhou and Shaoweihua Liu and Qipeng Zhao and Cong Yue and Xinrui Zhang and Zhengliang Yang and Zhenzhong Lan},
+    year={2020},
+    eprint={2004.05986},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+"""
+
+_CLUE_DESCRIPTION = """\
+CLUE, A Chinese Language Understanding Evaluation Benchmark
+(https://www.cluebenchmarks.com/) is a collection of resources for training,
+evaluating, and analyzing Chinese language understanding systems.
+
+"""
+
+
+class ClueConfig(datasets.BuilderConfig):
+    """BuilderConfig for CLUE."""
+
+    def __init__(
+        self,
+        data_url,
+        text_features=None,
+        label_column=None,
+        data_dir="",
+        citation="",
+        url="",
+        label_classes=None,
+        process_label=lambda x: x,
+        **kwargs,
+    ):
+        """BuilderConfig for CLUE.
+
+        Args:
+          text_features: `dict[string, string]`, map from the name of the feature
+            dict for each text field to the name of the column in the tsv file
+          label_column: `string`, name of the column in the tsv file corresponding
+            to the label
+          data_url: `string`, url to download the zip file from
+          data_dir: `string`, the path to the folder containing the tsv files in the
+            downloaded zip
+          citation: `string`, citation for the data set
+          url: `string`, url for information about the data set
+          label_classes: `list[string]`, the list of classes if the label is
+            categorical. If not provided, then the label will be of type
+            `datasets.Value('float32')`.
+          process_label: `Function[string, any]`, function  taking in the raw value
+            of the label and processing it to the form required by the label feature
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(ClueConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
+        self.text_features = text_features
+        self.label_column = label_column
+        self.label_classes = label_classes
+        self.data_url = data_url
+        self.data_dir = data_dir
+        self.citation = citation
+        self.url = url
+        self.process_label = process_label
+
+
+class Clue(datasets.GeneratorBasedBuilder):
+    """A Chinese Language Understanding Evaluation Benchmark (CLUE) benchmark."""
+
+    BUILDER_CONFIGS = [
+        ClueConfig(
+            name="afqmc",
+            description=textwrap.dedent(
+                """\
+            Ant Financial Question Matching Corpus is a dataset for Chinese
+            question matching (similar to QQP).
+            """
+            ),
+            text_features={"sentence1": "sentence1", "sentence2": "sentence2"},
+            label_classes=["0", "1"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/afqmc_public.zip",
+            url="https://dc.cloud.alipay.com/index#/topic/data?id=8",
+        ),
+        ClueConfig(
+            name="tnews",
+            description=textwrap.dedent(
+                """\
+            Toutiao Short Text Classification for News is a dataset for Chinese
+            short news classification.
+            """
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=[
+                "100",
+                "101",
+                "102",
+                "103",
+                "104",
+                "106",
+                "107",
+                "108",
+                "109",
+                "110",
+                "112",
+                "113",
+                "114",
+                "115",
+                "116",
+            ],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/tnews_public.zip",
+            url="https://github.com/skdjfla/toutiao-text-classfication-dataset",
+        ),
+        ClueConfig(
+            name="iflytek",
+            description=textwrap.dedent(
+                """\
+            IFLYTEK Long Text Classification for News is a dataset for Chinese
+            long text classification. The text is crawled from an app market.
+            """
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=[str(label) for label in range(119)],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/iflytek_public.zip",
+        ),
+        ClueConfig(
+            name="cmnli",
+            description=textwrap.dedent(
+                """\
+            Chinese Multi-Genre NLI is a dataset for Chinese Natural Language
+            Inference. It consists of XNLI (Chinese subset) and translated MNLI.
+            """
+            ),
+            text_features={"sentence1": "sentence1", "sentence2": "sentence2"},
+            label_classes=["neutral", "entailment", "contradiction"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/cmnli_public.zip",
+            data_dir="cmnli_public",
+        ),
+        ClueConfig(
+            name="cluewsc2020",
+            description=textwrap.dedent(
+                """\
+            CLUE Winograd Scheme Challenge (CLUEWSC 2020) is a Chinese WSC dataset.
+            The text is from contemporary literature and annotated by human experts.
+            The task is to determine which noun the pronoun in the sentence refers to.
+            The question appears in the form of true and false discrimination.
+            """
+            ),
+            text_features={"text": "text", "target": "target"},
+            label_classes=["false", "true"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/cluewsc2020_public.zip",
+        ),
+        ClueConfig(
+            name="csl",
+            description=textwrap.dedent(
+                """\
+            Chinese Scientific Literature Dataset (CSL) is taken from the abstracts of
+            Chinese papers and their keywords. The papers are selected from some core
+            journals of Chinese social sciences and natural sciences. TF-IDF is used to
+            generate a mixture of fake keywords and real keywords in the paper to construct
+            abstract-keyword pairs. The task goal is to judge whether the keywords are
+            all real keywords based on the abstract.
+            """
+            ),
+            text_features={"abst": "abst", "keyword": "keyword", "corpus_id": "id"},
+            label_classes=["0", "1"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/paddlenlp/datasets/csl_public.zip",
+            url="https://github.com/P01son6415/CSL",
+        ),
+        ClueConfig(
+            name="cmrc2018",
+            description=textwrap.dedent(
+                """\
+            CMRC2018 is the first Chinese Span-Extraction Machine Reading Comprehension
+            Dataset. The task requires to set up a system that reads context,
+            question and extract the answer from the context (the answer is a continuous
+            span in the context).
+            """
+            ),
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/cmrc2018_public.zip",
+            url="https://hfl-rc.github.io/cmrc2018/",
+            citation=textwrap.dedent(
+                """\
+                  @article{cmrc2018-dataset,
+                  title={A Span-Extraction Dataset for Chinese Machine Reading Comprehension},
+                  author={Cui, Yiming and Liu, Ting and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Che, Wanxiang and Wang, Shijin and Hu, Guoping},
+                  journal={arXiv preprint arXiv:1810.07366},
+                  year={2018}
+                }"""
+            ),
+        ),
+        ClueConfig(
+            name="drcd",
+            description=textwrap.dedent(
+                """\
+            Delta Reading Comprehension Dataset (DRCD) belongs to the general field of traditional
+            Chinese machine reading comprehension data set. This data set is expected to become a
+            standard Chinese reading comprehension data set suitable for transfer learning.
+            """
+            ),
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/drcd_public.zip",
+            url="https://github.com/DRCKnowledgeTeam/DRCD",
+        ),
+        ClueConfig(
+            name="chid",
+            description=textwrap.dedent(
+                """\
+            Chinese IDiom Dataset for Cloze Test (CHID) contains many masked idioms in the text.
+            The candidates contain similar idioms to the real ones.
+            """
+            ),
+            text_features={"candidates": "candidates", "content": "content"},
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/chid_public.zip",
+            url="https://arxiv.org/abs/1906.01265",
+            citation=textwrap.dedent(
+                """\
+                  @article{Zheng_2019,
+                   title={ChID: A Large-scale Chinese IDiom Dataset for Cloze Test},
+                   url={http://dx.doi.org/10.18653/v1/P19-1075},
+                   DOI={10.18653/v1/p19-1075},
+                   journal={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
+                   publisher={Association for Computational Linguistics},
+                   author={Zheng, Chujie and Huang, Minlie and Sun, Aixin},
+                   year={2019}
+                }"""
+            ),
+        ),
+        ClueConfig(
+            name="c3",
+            description=textwrap.dedent(
+                """\
+            Multiple-Choice Chinese Machine Reading Comprehension (C3, or C^3) is a Chinese
+            multi-choice reading comprehension data set, including mixed type data sets
+            such as dialogue and long text. Both the training and validation sets are
+            the concatenation of the dialogue and long-text subsets.
+            """
+            ),
+            text_features={"candidates": "candidates", "content": "content"},
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/c3_public.zip",
+            url="https://arxiv.org/abs/1904.09679",
+            citation=textwrap.dedent(
+                """\
+                  @article{sun2020investigating,
+                      author    = {Kai Sun and
+                                   Dian Yu and
+                                   Dong Yu and
+                                   Claire Cardie},
+                      title     = {Investigating Prior Knowledge for Challenging Chinese Machine Reading
+                                   Comprehension},
+                      journal   = {Trans. Assoc. Comput. Linguistics},
+                      volume    = {8},
+                      pages     = {141--155},
+                      year      = {2020},
+                      url       = {https://transacl.org/ojs/index.php/tacl/article/view/1882}
+                    }"""
+            ),
+        ),
+        ClueConfig(
+            name="ocnli",
+            description=textwrap.dedent(
+                """\
+            OCNLI stands for Original Chinese Natural Language Inference. It is a corpus for
+            Chinese Natural Language Inference, collected following closely the procedures of MNLI,
+            but with enhanced strategies aiming for more challenging inference pairs. We want to
+            emphasize we did not use human/machine translation in creating the dataset, and thus
+            our Chinese texts are original and not translated.
+            """
+            ),
+            text_features={"sentence1": "sentence1", "sentence2": "sentence2"},
+            label_classes=["neutral", "entailment", "contradiction"],
+            label_column="label",
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/OCNLI-02d55cb3c7dc984682677b8dd81db6a1e4710720.zip",
+            data_dir="OCNLI-02d55cb3c7dc984682677b8dd81db6a1e4710720/data/ocnli",
+            url="https://arxiv.org/abs/2010.05444",
+            citation=textwrap.dedent(
+                """\
+                  @inproceedings{ocnli,
+                    title={OCNLI: Original Chinese Natural Language Inference},
+                    author={Hai Hu and Kyle Richardson and Liang Xu and Lu Li and Sandra Kuebler and Larry Moss},
+                    booktitle={Findings of EMNLP},
+                    year={2020},
+                    url={https://arxiv.org/abs/2010.05444}
+                }"""
+            ),
+        ),
+        ClueConfig(
+            name="diagnostics",
+            description=textwrap.dedent(
+                """\
+            Diagnostic set, used to evaluate the performance of different models on 9 Chinese language
+            phenomena summarized by linguists.
+
+            Use the model trained on CMNLI to directly predict the result on this diagnostic set.
+            """
+            ),
+            text_features={"sentence1": "premise", "sentence2": "hypothesis"},
+            label_classes=["neutral", "entailment", "contradiction"],
+            label_column="label",
+            data_url="https://paddlenlp.bj.bcebos.com/datasets/clue_diagnostics_public.zip",
+        ),
+    ]
+
+    def _info(self):
+        if self.config.name in ["afqmc", "tnews", "iflytek", "cmnli", "diagnostics", "ocnli"]:
+            features = {text_feature: datasets.Value("string") for text_feature in self.config.text_features.keys()}
+            if self.config.label_classes:
+                features["label"] = datasets.features.ClassLabel(names=self.config.label_classes)
+            else:
+                features["label"] = datasets.Value("float32")
+            features["idx"] = datasets.Value("int32")
+        elif self.config.name == "cluewsc2020":
+            features = {
+                "idx": datasets.Value("int32"),
+                "text": datasets.Value("string"),
+                "label": datasets.features.ClassLabel(names=["true", "false"]),
+                "target": {
+                    "span1_text": datasets.Value("string"),
+                    "span2_text": datasets.Value("string"),
+                    "span1_index": datasets.Value("int32"),
+                    "span2_index": datasets.Value("int32"),
+                },
+            }
+        elif self.config.name == "csl":
+            features = {
+                "idx": datasets.Value("int32"),
+                "corpus_id": datasets.Value("int32"),
+                "abst": datasets.Value("string"),
+                "label": datasets.features.ClassLabel(names=self.config.label_classes),
+                "keyword": datasets.Sequence(datasets.Value("string")),
+            }
+        elif self.config.name in ["cmrc2018", "drcd"]:
+            features = {
+                "id": datasets.Value("string"),
+                "context": datasets.Value("string"),
+                "question": datasets.Value("string"),
+                "answers": datasets.Sequence(
+                    {
+                        "text": datasets.Value("string"),
+                        "answer_start": datasets.Value("int32"),
+                    }
+                ),
+            }
+        elif self.config.name == "chid":
+            features = {
+                "idx": datasets.Value("int32"),
+                "candidates": datasets.Sequence(datasets.Value("string")),
+                "content": datasets.Sequence(datasets.Value("string")),
+                "answers": datasets.features.Sequence(
+                    {
+                        "text": datasets.Value("string"),
+                        "candidate_id": datasets.Value("int32"),
+                    }
+                ),
+            }
+        elif self.config.name == "c3":
+            features = {
+                "id": datasets.Value("int32"),
+                "context": datasets.Sequence(datasets.Value("string")),
+                "question": datasets.Value("string"),
+                "choice": datasets.Sequence(datasets.Value("string")),
+                "answer": datasets.Value("string"),
+            }
+        else:
+            raise NotImplementedError(
+                "This task is not implemented. If you believe"
+                " this task was recently added to the CLUE benchmark, "
+                "please open a GitHub issue and we will add it."
+            )
+
+        return datasets.DatasetInfo(
+            description=_CLUE_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage=self.config.url,
+            citation=self.config.citation + "\n" + _CLUE_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(self.config.data_url)
+        data_dir = os.path.join(dl_dir, self.config.data_dir)
+
+        if self.config.name in {"chid", "c3"}:
+            test_file = "test1.1.json"
+        elif self.config.name == "diagnostics":
+            test_file = "diagnostics_test.json"
+        else:
+            test_file = "test.json"
+
+        test_split = datasets.SplitGenerator(
+            name=datasets.Split.TEST,
+            gen_kwargs={
+                "data_file": os.path.join(data_dir, test_file),
+                "split": "test",
+            },
+        )
+
+        split_list = [test_split]
+
+        if self.config.name != "diagnostics":
+            train_split = datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "data_file": os.path.join(
+                        data_dir or "", "train.json" if self.config.name != "c3" else "d-train.json"
+                    ),
+                    "split": "train",
+                },
+            )
+            val_split = datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "data_file": os.path.join(
+                        data_dir or "", "dev.json" if self.config.name != "c3" else "d-dev.json"
+                    ),
+                    "split": "dev",
+                },
+            )
+            split_list += [train_split, val_split]
+
+        if self.config.name == "cmrc2018":
+            split_list.append(
+                datasets.SplitGenerator(
+                    name=datasets.Split("trial"),
+                    gen_kwargs={
+                        "data_file": os.path.join(data_dir or "", "trial.json"),
+                        "split": "trial",
+                    },
+                )
+            )
+
+        return split_list
+
+    def _generate_examples(self, data_file, split):
+        process_label = self.config.process_label
+        label_classes = self.config.label_classes
+
+        if self.config.name == "chid" and split != "test":
+            answer_file = os.path.join(os.path.dirname(data_file), f"{split}_answer.json")
+            answer_dict = json.load(open(answer_file, encoding="utf8"))
+
+        if self.config.name == "c3":
+            if split == "test":
+                files = [data_file]
+            else:
+                data_dir = os.path.dirname(data_file)
+                files = [os.path.join(data_dir, f"{typ}-{split}.json") for typ in ["d", "m"]]
+            data = []
+            for f in files:
+                data_subset = json.load(open(f, encoding="utf8"))
+                data += data_subset
+            for idx, entry in enumerate(data):
+                for qidx, question in enumerate(entry[1]):
+                    example = {
+                        "id": idx if split != "test" else int(question["id"]),
+                        "context": entry[0],
+                        "question": question["question"],
+                        "choice": question["choice"],
+                        "answer": question["answer"] if split != "test" else "",
+                    }
+                    yield f"{idx}_{qidx}", example
+
+        else:
+            with open(data_file, encoding="utf8") as f:
+                if self.config.name in ["cmrc2018", "drcd"]:
+                    data = json.load(f)
+                    for example in data["data"]:
+                        for paragraph in example["paragraphs"]:
+                            context = paragraph["context"].strip()
+                            for qa in paragraph["qas"]:
+                                question = qa["question"].strip()
+                                id_ = qa["id"]
+
+                                answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                                answers = [answer["text"].strip() for answer in qa["answers"]]
+
+                                yield id_, {
+                                    "context": context,
+                                    "question": question,
+                                    "id": id_,
+                                    "answers": {
+                                        "answer_start": answer_starts,
+                                        "text": answers,
+                                    },
+                                }
+
+                else:
+                    for n, line in enumerate(f):
+                        row = json.loads(line)
+                        example = {feat: row[col] for feat, col in self.config.text_features.items()}
+                        example["idx"] = n if self.config.name != "diagnostics" else int(row["index"])
+                        if self.config.name == "chid":  # CHID has a separate gold label file
+                            contents = example["content"]
+                            candidates = example["candidates"]
+                            idiom_list = []
+                            if split != "test":
+                                for content in contents:
+                                    idioms = re.findall(r"#idiom\d+#", content)
+                                    for idiom in idioms:
+                                        idiom_list.append(
+                                            {
+                                                "candidate_id": answer_dict[idiom],
+                                                "text": candidates[answer_dict[idiom]],
+                                            }
+                                        )
+                            example["answers"] = idiom_list
+
+                        elif self.config.label_column in row:
+                            label = row[self.config.label_column]
+                            # Notice: some labels in CMNLI and OCNLI are invalid. We drop these data.
+                            if self.config.name in ["cmnli", "ocnli"] and label == "-":
+                                continue
+                            # For some tasks, the label is represented as 0 and 1 in the tsv
+                            # files and needs to be cast to integer to work with the feature.
+                            if label_classes and label not in label_classes:
+                                label = int(label) if label else None
+                            example["label"] = process_label(label)
+                        else:
+                            example["label"] = process_label(-1)
+
+                        # Filter out corrupted rows.
+                        for value in example.values():
+                            if value is None:
+                                break
+                        else:
+                            yield example["idx"], example
diff --git a/paddlenlp/datasets/hf_datasets/cmrc2018.py b/paddlenlp/datasets/hf_datasets/cmrc2018.py
new file mode 100644
index 0000000000000000000000000000000000000000..0af9d936cca2d22d7d7c35915d345ce273dd6bf0
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/cmrc2018.py
@@ -0,0 +1,135 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TODO(cmrc2018): Add a description here."""
+
+import json
+
+import datasets
+from datasets.tasks import QuestionAnsweringExtractive
+
+# TODO(cmrc2018): BibTeX citation
+_CITATION = """\
+@inproceedings{cui-emnlp2019-cmrc2018,
+    title = {A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension},
+    author = {Cui, Yiming  and
+      Liu, Ting  and
+      Che, Wanxiang  and
+      Xiao, Li  and
+      Chen, Zhipeng  and
+      Ma, Wentao  and
+      Wang, Shijin  and
+      Hu, Guoping},
+    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
+    month = {nov},
+    year = {2019},
+    address = {Hong Kong, China},
+    publisher = {Association for Computational Linguistics},
+    url = {https://www.aclweb.org/anthology/D19-1600},
+    doi = {10.18653/v1/D19-1600},
+    pages = {5886--5891}}
+"""
+
+# TODO(cmrc2018):
+_DESCRIPTION = """\
+A Span-Extraction dataset for Chinese machine reading comprehension to add language
+diversities in this area. The dataset is composed by near 20,000 real questions annotated
+on Wikipedia paragraphs by human experts. We also annotated a challenge set which
+contains the questions that need comprehensive understanding and multi-sentence
+inference throughout the context.
+"""
+_URL = "https://github.com/ymcui/cmrc2018"
+_TRAIN_FILE = "https://paddlenlp.bj.bcebos.com/datasets/cmrc/cmrc2018_train.json"
+_DEV_FILE = "https://paddlenlp.bj.bcebos.com/datasets/cmrc/cmrc2018_dev.json"
+_TEST_FILE = "https://paddlenlp.bj.bcebos.com/datasets/cmrc/cmrc2018_trial.json"
+
+
+class Cmrc2018(datasets.GeneratorBasedBuilder):
+    """TODO(cmrc2018): Short description of my dataset."""
+
+    # TODO(cmrc2018): Set up version.
+    VERSION = datasets.Version("0.1.0")
+
+    def _info(self):
+        # TODO(cmrc2018): Specifies the datasets.DatasetInfo object
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # datasets.features.FeatureConnectors
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "context": datasets.Value("string"),
+                    "question": datasets.Value("string"),
+                    "answers": datasets.features.Sequence(
+                        {
+                            "text": datasets.Value("string"),
+                            "answer_start": datasets.Value("int32"),
+                        }
+                    ),
+                    # These are the features of your dataset like images, labels ...
+                }
+            ),
+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,
+            # Homepage of the dataset for documentation
+            homepage=_URL,
+            citation=_CITATION,
+            task_templates=[
+                QuestionAnsweringExtractive(
+                    question_column="question", context_column="context", answers_column="answers"
+                )
+            ],
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        # TODO(cmrc2018): Downloads the data and defines the splits
+        # dl_manager is a datasets.download.DownloadManager that can be used to
+        # download and extract URLs
+        urls_to_download = {"train": _TRAIN_FILE, "dev": _DEV_FILE, "test": _TEST_FILE}
+        downloaded_files = dl_manager.download_and_extract(urls_to_download)
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
+            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
+            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": downloaded_files["test"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        """Yields examples."""
+        # TODO(cmrc2018): Yields (key, example) tuples from the dataset
+        with open(filepath, encoding="utf-8") as f:
+            data = json.load(f)
+            for example in data["data"]:
+                for paragraph in example["paragraphs"]:
+                    context = paragraph["context"].strip()
+                    for qa in paragraph["qas"]:
+                        question = qa["question"].strip()
+                        id_ = qa["id"]
+
+                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                        answers = [answer["text"].strip() for answer in qa["answers"]]
+
+                        yield id_, {
+                            "context": context,
+                            "question": question,
+                            "id": id_,
+                            "answers": {
+                                "answer_start": answer_starts,
+                                "text": answers,
+                            },
+                        }
diff --git a/paddlenlp/datasets/hf_datasets/cnn_dailymail.py b/paddlenlp/datasets/hf_datasets/cnn_dailymail.py
new file mode 100644
index 0000000000000000000000000000000000000000..605f947697de0c164123d2563b0b6b80432bd1eb
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/cnn_dailymail.py
@@ -0,0 +1,276 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""CNN/DailyMail Summarization dataset, non-anonymized version."""
+
+import hashlib
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """\
+CNN/DailyMail non-anonymized summarization dataset.
+
+There are two features:
+  - article: text of news article, used as the document to be summarized
+  - highlights: joined text of highlights with <s> and </s> around each
+    highlight, which is the target summary
+"""
+
+# The second citation introduces the source data, while the first
+# introduces the specific form (non-anonymized) we use here.
+_CITATION = """\
+@article{DBLP:journals/corr/SeeLM17,
+  author    = {Abigail See and
+               Peter J. Liu and
+               Christopher D. Manning},
+  title     = {Get To The Point: Summarization with Pointer-Generator Networks},
+  journal   = {CoRR},
+  volume    = {abs/1704.04368},
+  year      = {2017},
+  url       = {http://arxiv.org/abs/1704.04368},
+  archivePrefix = {arXiv},
+  eprint    = {1704.04368},
+  timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
+  biburl    = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@inproceedings{hermann2015teaching,
+  title={Teaching machines to read and comprehend},
+  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
+  booktitle={Advances in neural information processing systems},
+  pages={1693--1701},
+  year={2015}
+}
+"""
+
+_DL_URLS = {
+    # pylint: disable=line-too-long
+    "cnn_stories": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/cnn_stories.tgz",
+    "dm_stories": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/dailymail_stories.tgz",
+    "test_urls": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_test.txt",
+    "train_urls": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_train.txt",
+    "val_urls": "https://bj.bcebos.com/paddlenlp/datasets/cnn_dailymail/all_val.txt",
+    # pylint: enable=line-too-long
+}
+
+_HIGHLIGHTS = "highlights"
+_ARTICLE = "article"
+
+_SUPPORTED_VERSIONS = [
+    # Using cased version.
+    datasets.Version("3.0.0", "Using cased version."),
+    # Same data as 0.0.2
+    datasets.Version("1.0.0", ""),
+    # Having the model predict newline separators makes it easier to evaluate
+    # using summary-level ROUGE.
+    datasets.Version("2.0.0", "Separate target sentences with newline."),
+]
+
+_DEFAULT_VERSION = datasets.Version("3.0.0", "Using cased version.")
+
+
+class CnnDailymailConfig(datasets.BuilderConfig):
+    """BuilderConfig for CnnDailymail."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for CnnDailymail.
+
+        Args:
+
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(CnnDailymailConfig, self).__init__(**kwargs)
+
+
+def _get_url_hashes(path):
+    """Get hashes of urls in file."""
+    urls = _read_text_file(path)
+
+    def url_hash(u):
+        h = hashlib.sha1()
+        try:
+            u = u.encode("utf-8")
+        except UnicodeDecodeError:
+            logger.error("Cannot hash url: %s", u)
+        h.update(u)
+        return h.hexdigest()
+
+    return {url_hash(u): True for u in urls}
+
+
+def _get_hash_from_path(p):
+    """Extract hash from path."""
+    basename = os.path.basename(p)
+    return basename[0 : basename.find(".story")]
+
+
+def _find_files(dl_paths, publisher, url_dict):
+    """Find files corresponding to urls."""
+    if publisher == "cnn":
+        top_dir = os.path.join(dl_paths["cnn_stories"], "cnn", "stories")
+    elif publisher == "dm":
+        top_dir = os.path.join(dl_paths["dm_stories"], "dailymail", "stories")
+    else:
+        logger.fatal("Unsupported publisher: %s", publisher)
+    files = sorted(os.listdir(top_dir))
+
+    ret_files = []
+    for p in files:
+        if _get_hash_from_path(p) in url_dict:
+            ret_files.append(os.path.join(top_dir, p))
+    return ret_files
+
+
+def _subset_filenames(dl_paths, split):
+    """Get filenames for a particular split."""
+    assert isinstance(dl_paths, dict), dl_paths
+    # Get filenames for a split.
+    if split == datasets.Split.TRAIN:
+        urls = _get_url_hashes(dl_paths["train_urls"])
+    elif split == datasets.Split.VALIDATION:
+        urls = _get_url_hashes(dl_paths["val_urls"])
+    elif split == datasets.Split.TEST:
+        urls = _get_url_hashes(dl_paths["test_urls"])
+    else:
+        logger.fatal("Unsupported split: %s", split)
+    cnn = _find_files(dl_paths, "cnn", urls)
+    dm = _find_files(dl_paths, "dm", urls)
+    return cnn + dm
+
+
+DM_SINGLE_CLOSE_QUOTE = "\u2019"  # unicode
+DM_DOUBLE_CLOSE_QUOTE = "\u201d"
+# acceptable ways to end a sentence
+END_TOKENS = [".", "!", "?", "...", "'", "`", '"', DM_SINGLE_CLOSE_QUOTE, DM_DOUBLE_CLOSE_QUOTE, ")"]
+
+
+def _read_text_file(text_file):
+    lines = []
+    with open(text_file, "r", encoding="utf-8") as f:
+        for line in f:
+            lines.append(line.strip())
+    return lines
+
+
+def _get_art_abs(story_file, tfds_version):
+    """Get abstract (highlights) and article from a story file path."""
+    # Based on https://github.com/abisee/cnn-dailymail/blob/master/
+    #     make_datafiles.py
+
+    lines = _read_text_file(story_file)
+
+    # The github code lowercase the text and we removed it in 3.0.0.
+
+    # Put periods on the ends of lines that are missing them
+    # (this is a problem in the dataset because many image captions don't end in
+    # periods; consequently they end up in the body of the article as run-on
+    # sentences)
+    def fix_missing_period(line):
+        """Adds a period to a line that is missing a period."""
+        if "@highlight" in line:
+            return line
+        if not line:
+            return line
+        if line[-1] in END_TOKENS:
+            return line
+        return line + " ."
+
+    lines = [fix_missing_period(line) for line in lines]
+
+    # Separate out article and abstract sentences
+    article_lines = []
+    highlights = []
+    next_is_highlight = False
+    for line in lines:
+        if not line:
+            continue  # empty line
+        elif line.startswith("@highlight"):
+            next_is_highlight = True
+        elif next_is_highlight:
+            highlights.append(line)
+        else:
+            article_lines.append(line)
+
+    # Make article into a single string
+    article = " ".join(article_lines)
+
+    if tfds_version >= "2.0.0":
+        abstract = "\n".join(highlights)
+    else:
+        abstract = " ".join(highlights)
+
+    return article, abstract
+
+
+class CnnDailymail(datasets.GeneratorBasedBuilder):
+    """CNN/DailyMail non-anonymized summarization dataset."""
+
+    BUILDER_CONFIGS = [
+        CnnDailymailConfig(name=str(version), description="Plain text", version=version)
+        for version in _SUPPORTED_VERSIONS
+    ]
+
+    def _info(self):
+        # Should return a datasets.DatasetInfo object
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    _ARTICLE: datasets.Value("string"),
+                    _HIGHLIGHTS: datasets.Value("string"),
+                    "id": datasets.Value("string"),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://github.com/abisee/cnn-dailymail",
+            citation=_CITATION,
+        )
+
+    def _vocab_text_gen(self, paths):
+        for _, ex in self._generate_examples(paths):
+            yield " ".join([ex[_ARTICLE], ex[_HIGHLIGHTS]])
+
+    def _split_generators(self, dl_manager):
+        dl_paths = dl_manager.download_and_extract(_DL_URLS)
+        train_files = _subset_filenames(dl_paths, datasets.Split.TRAIN)
+        # Generate shared vocabulary
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"files": train_files}),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"files": _subset_filenames(dl_paths, datasets.Split.VALIDATION)},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST, gen_kwargs={"files": _subset_filenames(dl_paths, datasets.Split.TEST)}
+            ),
+        ]
+
+    def _generate_examples(self, files):
+        for p in files:
+            article, highlights = _get_art_abs(p, self.config.version)
+            if not article or not highlights:
+                continue
+            fname = os.path.basename(p)
+            yield fname, {
+                _ARTICLE: article,
+                _HIGHLIGHTS: highlights,
+                "id": _get_hash_from_path(fname),
+            }
diff --git a/paddlenlp/datasets/hf_datasets/cote.py b/paddlenlp/datasets/hf_datasets/cote.py
new file mode 100644
index 0000000000000000000000000000000000000000..c60421a274db9036b921f012b84b177034a2dfa0
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/cote.py
@@ -0,0 +1,143 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""COTE: Chinese Opinion Target Extraction."""
+
+import csv
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@inproceedings{li2018character,
+  title={Character-based bilstm-crf incorporating pos and dictionaries for chinese opinion target extraction},
+  author={Li, Yanzeng and Liu, Tingwen and Li, Diying and Li, Quangang and Shi, Jinqiao and Wang, Yanqiu},
+  booktitle={Asian Conference on Machine Learning},
+  pages={518--533},
+  year={2018},
+  organization={PMLR}
+}
+"""
+
+_DESCRIPTION = """\
+COTE, a dataset for Opinion target extraction (OTE) for sentiment analysis, which aims to extract target of a given text. This dataset covers data crawled on Baidu, Dianping, and Mafengwo.
+More information refer to https://www.luge.ai/#/luge/dataDetail?id=19.
+"""
+
+_COTE_URLs = {
+    # pylint: disable=line-too-long
+    "bd": "https://paddlenlp.bj.bcebos.com/datasets/COTE-BD.zip",
+    "mfw": "https://paddlenlp.bj.bcebos.com/datasets/COTE-MFW.zip",
+    "dp": "https://paddlenlp.bj.bcebos.com/datasets/COTE-DP.zip",
+    # pylint: enable=line-too-long
+}
+
+
+class COTEConfig(datasets.BuilderConfig):
+    """BuilderConfig for COTE."""
+
+    def __init__(self, data_url=None, data_dir=None, **kwargs):
+        """BuilderConfig for COTE.
+
+        Args:
+          data_url: `string`, url to download the zip file.
+          data_dir: `string`, the path to the folder containing the tsv files in the downloaded zip.
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(COTEConfig, self).__init__(**kwargs)
+        self.data_url = data_url
+        self.data_dir = data_dir
+
+
+class COTE(datasets.GeneratorBasedBuilder):
+    """COTE: Chinese Opinion Target Extraction."""
+
+    BUILDER_CONFIGS = [
+        COTEConfig(
+            name="bd",
+            data_url=_COTE_URLs["bd"],
+            data_dir="COTE-BD",
+            version=datasets.Version("1.0.0", ""),
+            description="COTE-BD crawled on baidu.",
+        ),
+        COTEConfig(
+            name="mfw",
+            data_url=_COTE_URLs["mfw"],
+            data_dir="COTE-MFW",
+            version=datasets.Version("1.0.0", ""),
+            description="COTE-MFW crawled on Mafengwo.",
+        ),
+        COTEConfig(
+            name="dp",
+            data_url=_COTE_URLs["dp"],
+            data_dir="COTE-DP",
+            version=datasets.Version("1.0.0", ""),
+            description="COTE-DP crawled on Dianping.",
+        ),
+    ]
+
+    def _info(self):
+        features = {
+            "id": datasets.Value("int32"),
+            "text_a": datasets.Value("string"),
+            "label": datasets.Value("string"),
+        }
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage="https://www.luge.ai/#/luge/dataDetail?id=19",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        downloaded_dir = dl_manager.download_and_extract(self.config.data_url)
+        data_dir = os.path.join(downloaded_dir, self.config.data_dir)
+
+        train_split = datasets.SplitGenerator(
+            name=datasets.Split.TRAIN, gen_kwargs={"filepath": os.path.join(data_dir, "train.tsv"), "split": "train"}
+        )
+        test_split = datasets.SplitGenerator(
+            name=datasets.Split.TEST, gen_kwargs={"filepath": os.path.join(data_dir, "test.tsv"), "split": "test"}
+        )
+
+        return [train_split, test_split]
+
+    def _generate_examples(self, filepath, split):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+
+        with open(filepath, encoding="utf8") as f:
+            reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+
+            for idx, row in enumerate(reader):
+                example = {}
+                example["id"] = idx
+                example["text_a"] = row["text_a"]
+
+                if split == "train":
+                    example["label"] = row["label"]
+                else:
+                    example["label"] = ""
+
+                # Filter out corrupted rows.
+                for value in example.values():
+                    if value is None:
+                        break
+                else:
+                    yield idx, example
diff --git a/paddlenlp/datasets/hf_datasets/docvqa_zh.py b/paddlenlp/datasets/hf_datasets/docvqa_zh.py
new file mode 100644
index 0000000000000000000000000000000000000000..43f877d0b72c4efbdb22a979e97172582763a36e
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/docvqa_zh.py
@@ -0,0 +1,131 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import os
+import json
+import hashlib
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """\
+The training set from the competition of Insurance DocVQA organized by China Pacific Insurance. \
+The submission is now closed so we split original dataset into three parts for model evluation. \
+There are 4,187 training images, 500 validation images, and 500 test images.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/docvqa_zh.tar.gz"
+
+
+def _get_md5(string):
+    """Get md5 value for string"""
+    hl = hashlib.md5()
+    hl.update(string.encode(encoding="utf-8"))
+    return hl.hexdigest()
+
+
+class DocVQAZhConfig(datasets.BuilderConfig):
+    """funsd dataset config"""
+
+    target_size: int = 1000
+    max_size: int = 1000
+
+    def __init__(self, **kwargs):
+
+        super(DocVQAZhConfig, self).__init__(**kwargs)
+
+
+class DocVQAZh(datasets.GeneratorBasedBuilder):
+    """funsd dataset builder"""
+
+    BUILDER_CONFIGS = [
+        DocVQAZhConfig(
+            name="docvqa_zh",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "name": datasets.Value("string"),
+                    "page_no": datasets.Value("int32"),
+                    "text": datasets.features.Sequence(datasets.Value("string")),
+                    "bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_id": datasets.features.Sequence(datasets.Value("int32")),
+                    "image": datasets.Value("string"),
+                    "width": datasets.Value("int32"),
+                    "height": datasets.Value("int32"),
+                    "md5sum": datasets.Value("string"),
+                    "qas": datasets.features.Sequence(
+                        {
+                            "question_id": datasets.Value("int32"),
+                            "question": datasets.Value("string"),
+                            "answers": datasets.features.Sequence(
+                                {
+                                    "text": datasets.Value("string"),
+                                    "answer_start": datasets.Value("int32"),
+                                    "answer_end": datasets.Value("int32"),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="http://ailab.aiwin.org.cn/competitions/49",
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_zh", "train.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_zh", "dev.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "docvqa_zh", "test.json")},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("Generating examples from = {}".format(filepath))
+        idx = 0
+        with open(filepath, "r") as fin:
+            for line in fin:
+                data = json.loads(line)
+                if "page_no" not in data:
+                    data["page_no"] = 0
+                for item in data["qas"]:
+                    if "question_id" not in item:
+                        item["question_id"] = -1
+                data["md5sum"] = _get_md5(data["image"])
+                yield idx, data
+                idx += 1
diff --git a/paddlenlp/datasets/hf_datasets/duconv.py b/paddlenlp/datasets/hf_datasets/duconv.py
new file mode 100644
index 0000000000000000000000000000000000000000..be8f7db6008bd44363c7658cbb8ea9cad6c93be4
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/duconv.py
@@ -0,0 +1,126 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import json
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """\
+Duconv is a chinese conversation \
+dataset, designed to evaluate the dialogue models.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/DuConv.zip"
+
+
+class DuconvConfig(datasets.BuilderConfig):
+    """BuilderConfig for Duconv."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for Duconv.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(DuconvConfig, self).__init__(**kwargs)
+
+
+class Duconv(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        DuconvConfig(
+            name="DuConv",
+            version=datasets.Version("1.0.0", ""),
+            description=_DESCRIPTION,
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "goal": datasets.Sequence(datasets.Sequence(datasets.Value("string"))),
+                    "knowledge": datasets.Sequence(datasets.Sequence(datasets.Value("string"))),
+                    "conversation": datasets.Sequence(datasets.Value("string")),
+                    "history": datasets.Sequence(datasets.Value("string")),
+                    "response": datasets.Value("string"),
+                }
+            ),
+            # No default supervised_keys (as we have to pass both question
+            # and context as input).
+            supervised_keys=None,
+            homepage="https://arxiv.org/pdf/1906.05572.pdf",
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name="train",
+                gen_kwargs={
+                    "filepath": os.path.join(dl_dir, "DuConv", "train.txt"),
+                },
+            ),
+            datasets.SplitGenerator(
+                name="dev",
+                gen_kwargs={
+                    "filepath": os.path.join(dl_dir, "DuConv", "dev.txt"),
+                },
+            ),
+            datasets.SplitGenerator(
+                name="test_1",
+                gen_kwargs={
+                    "filepath": os.path.join(dl_dir, "DuConv", "test_1.txt"),
+                },
+            ),
+            datasets.SplitGenerator(
+                name="test_2",
+                gen_kwargs={
+                    "filepath": os.path.join(dl_dir, "DuConv", "test_2.txt"),
+                },
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+        key = 0
+        with open(filepath, "r", encoding="utf-8") as fin:
+            for line in fin:
+                duconv = json.loads(line)
+
+                goal = duconv["goal"] if "goal" in duconv.keys() else [[]]
+                knowledge = duconv["knowledge"] if "knowledge" in duconv.keys() else [[]]
+                conversation = duconv["conversation"] if "conversation" in duconv.keys() else []
+                history = duconv["history"] if "history" in duconv.keys() else []
+                response = duconv["response"] if "response" in duconv.keys() else ""
+
+                yield key, {
+                    "id": str(key),
+                    "goal": goal,
+                    "knowledge": knowledge,
+                    "conversation": conversation,
+                    "history": history,
+                    "response": response,
+                }
+                key += 1
diff --git a/paddlenlp/datasets/hf_datasets/dureader_robust.py b/paddlenlp/datasets/hf_datasets/dureader_robust.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1f8ee35067271a3339662d37ffe45e5624f3b89
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/dureader_robust.py
@@ -0,0 +1,129 @@
+# coding=utf-8
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import json
+import os
+
+import datasets
+from datasets.tasks import QuestionAnsweringExtractive
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """\
+DureaderRobust is a chinese reading comprehension \
+dataset, designed to evaluate the MRC models from \
+three aspects: over-sensitivity, over-stability \
+and generalization.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/dureader_robust-data.tar.gz"
+
+
+class DureaderRobustConfig(datasets.BuilderConfig):
+    """BuilderConfig for DureaderRobust."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for DureaderRobust.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(DureaderRobustConfig, self).__init__(**kwargs)
+
+
+class DureaderRobust(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        DureaderRobustConfig(
+            name="plain_text",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "title": datasets.Value("string"),
+                    "context": datasets.Value("string"),
+                    "question": datasets.Value("string"),
+                    "answers": datasets.features.Sequence(
+                        {
+                            "text": datasets.Value("string"),
+                            "answer_start": datasets.Value("int32"),
+                        }
+                    ),
+                }
+            ),
+            # No default supervised_keys (as we have to pass both question
+            # and context as input).
+            supervised_keys=None,
+            homepage="https://arxiv.org/abs/2004.11142",
+            task_templates=[
+                QuestionAnsweringExtractive(
+                    question_column="question", context_column="context", answers_column="answers"
+                )
+            ],
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "dureader_robust-data", "train.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "dureader_robust-data", "dev.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "dureader_robust-data", "test.json")},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+        key = 0
+        with open(filepath, encoding="utf-8") as f:
+            durobust = json.load(f)
+            for article in durobust["data"]:
+                title = article.get("title", "")
+                for paragraph in article["paragraphs"]:
+                    context = paragraph["context"]  # do not strip leading blank spaces GH-2585
+                    for qa in paragraph["qas"]:
+                        answer_starts = [answer["answer_start"] for answer in qa.get("answers", "")]
+                        answers = [answer["text"] for answer in qa.get("answers", "")]
+                        # Features currently used are "context", "question", and "answers".
+                        # Others are extracted here for the ease of future expansions.
+                        yield key, {
+                            "title": title,
+                            "context": context,
+                            "question": qa["question"],
+                            "id": qa["id"],
+                            "answers": {
+                                "answer_start": answer_starts,
+                                "text": answers,
+                            },
+                        }
+                        key += 1
diff --git a/paddlenlp/datasets/hf_datasets/funsd.py b/paddlenlp/datasets/hf_datasets/funsd.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9f21552fa9fcf5427018946335660c094325387
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/funsd.py
@@ -0,0 +1,141 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import os
+import json
+import hashlib
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@article{Jaume2019FUNSDAD,
+  title={FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents},
+  author={Guillaume Jaume and H. K. Ekenel and J. Thiran},
+  journal={2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)},
+  year={2019},
+  volume={2},
+  pages={1-6}
+}
+"""
+
+_DESCRIPTION = """\
+https://guillaumejaume.github.io/FUNSD/
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/funsd.tar.gz"
+
+
+def _get_md5(string):
+    """Get md5 value for string"""
+    hl = hashlib.md5()
+    hl.update(string.encode(encoding="utf-8"))
+    return hl.hexdigest()
+
+
+class FUNSDConfig(datasets.BuilderConfig):
+    """funsd dataset config"""
+
+    target_size: int = 1000
+    max_size: int = 1000
+
+    def __init__(self, **kwargs):
+
+        super(FUNSDConfig, self).__init__(**kwargs)
+
+
+class FUNSD(datasets.GeneratorBasedBuilder):
+    """funsd dataset builder"""
+
+    BUILDER_CONFIGS = [
+        FUNSDConfig(
+            name="funsd",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "name": datasets.Value("string"),
+                    "page_no": datasets.Value("int32"),
+                    "text": datasets.features.Sequence(datasets.Value("string")),
+                    "bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_id": datasets.features.Sequence(datasets.Value("int32")),
+                    "image": datasets.Value("string"),
+                    "width": datasets.Value("int32"),
+                    "height": datasets.Value("int32"),
+                    "md5sum": datasets.Value("string"),
+                    "qas": datasets.features.Sequence(
+                        {
+                            "question_id": datasets.Value("int32"),
+                            "question": datasets.Value("string"),
+                            "answers": datasets.features.Sequence(
+                                {
+                                    "text": datasets.Value("string"),
+                                    "answer_start": datasets.Value("int32"),
+                                    "answer_end": datasets.Value("int32"),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://guillaumejaume.github.io/FUNSD/",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "funsd", "train.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "funsd", "dev.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "funsd", "test.json")},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("Generating examples from = {}".format(filepath))
+        idx = 0
+        with open(filepath, "r") as fin:
+            for line in fin:
+                data = json.loads(line)
+                if "page_no" not in data:
+                    data["page_no"] = 0
+                for item in data["qas"]:
+                    if "question_id" not in item:
+                        item["question_id"] = -1
+                data["md5sum"] = _get_md5(data["image"])
+                yield idx, data
+                idx += 1
diff --git a/paddlenlp/datasets/hf_datasets/glue.py b/paddlenlp/datasets/hf_datasets/glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..a576b23c85b607f75177967fef220822a4819a0a
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/glue.py
@@ -0,0 +1,626 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""The General Language Understanding Evaluation (GLUE) benchmark."""
+
+import csv
+import os
+import textwrap
+
+import numpy as np
+
+import datasets
+
+_GLUE_CITATION = """\
+@inproceedings{wang2019glue,
+  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
+  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
+  note={In the Proceedings of ICLR.},
+  year={2019}
+}
+"""
+
+_GLUE_DESCRIPTION = """\
+GLUE, the General Language Understanding Evaluation benchmark
+(https://gluebenchmark.com/) is a collection of resources for training,
+evaluating, and analyzing natural language understanding systems.
+
+"""
+
+_MRPC_DEV_IDS = "https://bj.bcebos.com/dataset/glue/mrpc/dev_ids.tsv"
+_MRPC_TRAIN = "https://bj.bcebos.com/dataset/glue/mrpc/msr_paraphrase_train.txt"
+_MRPC_TEST = "https://bj.bcebos.com/dataset/glue/mrpc/msr_paraphrase_test.txt"
+
+_MNLI_BASE_KWARGS = dict(
+    text_features={
+        "premise": "sentence1",
+        "hypothesis": "sentence2",
+    },
+    label_classes=["entailment", "neutral", "contradiction"],
+    label_column="gold_label",
+    data_url="https://bj.bcebos.com/dataset/glue/MNLI.zip",
+    data_dir="MNLI",
+    citation=textwrap.dedent(
+        """\
+      @InProceedings{N18-1101,
+        author = "Williams, Adina
+                  and Nangia, Nikita
+                  and Bowman, Samuel",
+        title = "A Broad-Coverage Challenge Corpus for
+                 Sentence Understanding through Inference",
+        booktitle = "Proceedings of the 2018 Conference of
+                     the North American Chapter of the
+                     Association for Computational Linguistics:
+                     Human Language Technologies, Volume 1 (Long
+                     Papers)",
+        year = "2018",
+        publisher = "Association for Computational Linguistics",
+        pages = "1112--1122",
+        location = "New Orleans, Louisiana",
+        url = "http://aclweb.org/anthology/N18-1101"
+      }
+      @article{bowman2015large,
+        title={A large annotated corpus for learning natural language inference},
+        author={Bowman, Samuel R and Angeli, Gabor and Potts, Christopher and Manning, Christopher D},
+        journal={arXiv preprint arXiv:1508.05326},
+        year={2015}
+      }"""
+    ),
+    url="http://www.nyu.edu/projects/bowman/multinli/",
+)
+
+
+class GlueConfig(datasets.BuilderConfig):
+    """BuilderConfig for GLUE."""
+
+    def __init__(
+        self,
+        text_features,
+        label_column,
+        data_url,
+        data_dir,
+        citation,
+        url,
+        label_classes=None,
+        process_label=lambda x: x,
+        **kwargs,
+    ):
+        """BuilderConfig for GLUE.
+
+        Args:
+          text_features: `dict[string, string]`, map from the name of the feature
+            dict for each text field to the name of the column in the tsv file
+          label_column: `string`, name of the column in the tsv file corresponding
+            to the label
+          data_url: `string`, url to download the zip file from
+          data_dir: `string`, the path to the folder containing the tsv files in the
+            downloaded zip
+          citation: `string`, citation for the data set
+          url: `string`, url for information about the data set
+          label_classes: `list[string]`, the list of classes if the label is
+            categorical. If not provided, then the label will be of type
+            `datasets.Value('float32')`.
+          process_label: `Function[string, any]`, function  taking in the raw value
+            of the label and processing it to the form required by the label feature
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(GlueConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
+        self.text_features = text_features
+        self.label_column = label_column
+        self.label_classes = label_classes
+        self.data_url = data_url
+        self.data_dir = data_dir
+        self.citation = citation
+        self.url = url
+        self.process_label = process_label
+
+
+class Glue(datasets.GeneratorBasedBuilder):
+    """The General Language Understanding Evaluation (GLUE) benchmark."""
+
+    BUILDER_CONFIGS = [
+        GlueConfig(
+            name="cola",
+            description=textwrap.dedent(
+                """\
+            The Corpus of Linguistic Acceptability consists of English
+            acceptability judgments drawn from books and journal articles on
+            linguistic theory. Each example is a sequence of words annotated
+            with whether it is a grammatical English sentence."""
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=["unacceptable", "acceptable"],
+            label_column="is_acceptable",
+            data_url="https://bj.bcebos.com/dataset/glue/CoLA.zip",
+            data_dir="CoLA",
+            citation=textwrap.dedent(
+                """\
+            @article{warstadt2018neural,
+              title={Neural Network Acceptability Judgments},
+              author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
+              journal={arXiv preprint arXiv:1805.12471},
+              year={2018}
+            }"""
+            ),
+            url="https://nyu-mll.github.io/CoLA/",
+        ),
+        GlueConfig(
+            name="sst2",
+            description=textwrap.dedent(
+                """\
+            The Stanford Sentiment Treebank consists of sentences from movie reviews and
+            human annotations of their sentiment. The task is to predict the sentiment of a
+            given sentence. We use the two-way (positive/negative) class split, and use only
+            sentence-level labels."""
+            ),
+            text_features={"sentence": "sentence"},
+            label_classes=["negative", "positive"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/dataset/glue/SST.zip",
+            data_dir="SST-2",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{socher2013recursive,
+              title={Recursive deep models for semantic compositionality over a sentiment treebank},
+              author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher},
+              booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing},
+              pages={1631--1642},
+              year={2013}
+            }"""
+            ),
+            url="https://datasets.stanford.edu/sentiment/index.html",
+        ),
+        GlueConfig(
+            name="mrpc",
+            description=textwrap.dedent(
+                """\
+            The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of
+            sentence pairs automatically extracted from online news sources, with human annotations
+            for whether the sentences in the pair are semantically equivalent."""
+            ),  # pylint: disable=line-too-long
+            text_features={"sentence1": "", "sentence2": ""},
+            label_classes=["not_equivalent", "equivalent"],
+            label_column="Quality",
+            data_url="",  # MRPC isn't hosted by GLUE.
+            data_dir="MRPC",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{dolan2005automatically,
+              title={Automatically constructing a corpus of sentential paraphrases},
+              author={Dolan, William B and Brockett, Chris},
+              booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},
+              year={2005}
+            }"""
+            ),
+            url="https://www.microsoft.com/en-us/download/details.aspx?id=52398",
+        ),
+        GlueConfig(
+            name="qqp",
+            description=textwrap.dedent(
+                """\
+            The Quora Question Pairs2 dataset is a collection of question pairs from the
+            community question-answering website Quora. The task is to determine whether a
+            pair of questions are semantically equivalent."""
+            ),
+            text_features={
+                "question1": "question1",
+                "question2": "question2",
+            },
+            label_classes=["not_duplicate", "duplicate"],
+            label_column="is_duplicate",
+            data_url="https://dataset.bj.bcebos.com/glue/QQP.zip",
+            data_dir="QQP",
+            citation=textwrap.dedent(
+                """\
+          @online{WinNT,
+            author = {Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel},
+            title = {First Quora Dataset Release: Question Pairs},
+            year = {2017},
+            url = {https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs},
+            urldate = {2019-04-03}
+          }"""
+            ),
+            url="https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs",
+        ),
+        GlueConfig(
+            name="stsb",
+            description=textwrap.dedent(
+                """\
+            The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of
+            sentence pairs drawn from news headlines, video and image captions, and natural
+            language inference data. Each pair is human-annotated with a similarity score
+            from 1 to 5."""
+            ),
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_column="score",
+            data_url="https://bj.bcebos.com/dataset/glue/STS.zip",
+            data_dir="STS-B",
+            citation=textwrap.dedent(
+                """\
+            @article{cer2017semeval,
+              title={Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation},
+              author={Cer, Daniel and Diab, Mona and Agirre, Eneko and Lopez-Gazpio, Inigo and Specia, Lucia},
+              journal={arXiv preprint arXiv:1708.00055},
+              year={2017}
+            }"""
+            ),
+            url="http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark",
+            process_label=np.float32,
+        ),
+        GlueConfig(
+            name="mnli",
+            description=textwrap.dedent(
+                """\
+            The Multi-Genre Natural Language Inference Corpus is a crowdsourced
+            collection of sentence pairs with textual entailment annotations. Given a premise sentence
+            and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis
+            (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are
+            gathered from ten different sources, including transcribed speech, fiction, and government reports.
+            We use the standard test set, for which we obtained private labels from the authors, and evaluate
+            on both the matched (in-domain) and mismatched (cross-domain) section. We also use and recommend
+            the SNLI corpus as 550k examples of auxiliary training data."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="mnli_mismatched",
+            description=textwrap.dedent(
+                """\
+          The mismatched validation and test splits from MNLI.
+          See the "mnli" BuilderConfig for additional information."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="mnli_matched",
+            description=textwrap.dedent(
+                """\
+          The matched validation and test splits from MNLI.
+          See the "mnli" BuilderConfig for additional information."""
+            ),
+            **_MNLI_BASE_KWARGS,
+        ),
+        GlueConfig(
+            name="qnli",
+            description=textwrap.dedent(
+                """\
+            The Stanford Question Answering Dataset is a question-answering
+            dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn
+            from Wikipedia) contains the answer to the corresponding question (written by an annotator). We
+            convert the task into sentence pair classification by forming a pair between each question and each
+            sentence in the corresponding context, and filtering out pairs with low lexical overlap between the
+            question and the context sentence. The task is to determine whether the context sentence contains
+            the answer to the question. This modified version of the original task removes the requirement that
+            the model select the exact answer, but also removes the simplifying assumptions that the answer
+            is always present in the input and that lexical overlap is a reliable cue."""
+            ),  # pylint: disable=line-too-long
+            text_features={
+                "question": "question",
+                "sentence": "sentence",
+            },
+            label_classes=["entailment", "not_entailment"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/dataset/glue/QNLI.zip",
+            data_dir="QNLI",
+            citation=textwrap.dedent(
+                """\
+            @article{rajpurkar2016squad,
+              title={Squad: 100,000+ questions for machine comprehension of text},
+              author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
+              journal={arXiv preprint arXiv:1606.05250},
+              year={2016}
+            }"""
+            ),
+            url="https://rajpurkar.github.io/SQuAD-explorer/",
+        ),
+        GlueConfig(
+            name="rte",
+            description=textwrap.dedent(
+                """\
+            The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual
+            entailment challenges. We combine the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim
+            et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).4 Examples are
+            constructed based on news and Wikipedia text. We convert all datasets to a two-class split, where
+            for three-class datasets we collapse neutral and contradiction into not entailment, for consistency."""
+            ),  # pylint: disable=line-too-long
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_classes=["entailment", "not_entailment"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/dataset/glue/RTE.zip",
+            data_dir="RTE",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{dagan2005pascal,
+              title={The PASCAL recognising textual entailment challenge},
+              author={Dagan, Ido and Glickman, Oren and Magnini, Bernardo},
+              booktitle={Machine Learning Challenges Workshop},
+              pages={177--190},
+              year={2005},
+              organization={Springer}
+            }
+            @inproceedings{bar2006second,
+              title={The second pascal recognising textual entailment challenge},
+              author={Bar-Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan},
+              booktitle={Proceedings of the second PASCAL challenges workshop on recognising textual entailment},
+              volume={6},
+              number={1},
+              pages={6--4},
+              year={2006},
+              organization={Venice}
+            }
+            @inproceedings{giampiccolo2007third,
+              title={The third pascal recognizing textual entailment challenge},
+              author={Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill},
+              booktitle={Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing},
+              pages={1--9},
+              year={2007},
+              organization={Association for Computational Linguistics}
+            }
+            @inproceedings{bentivogli2009fifth,
+              title={The Fifth PASCAL Recognizing Textual Entailment Challenge.},
+              author={Bentivogli, Luisa and Clark, Peter and Dagan, Ido and Giampiccolo, Danilo},
+              booktitle={TAC},
+              year={2009}
+            }"""
+            ),
+            url="https://aclweb.org/aclwiki/Recognizing_Textual_Entailment",
+        ),
+        GlueConfig(
+            name="wnli",
+            description=textwrap.dedent(
+                """\
+            The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task
+            in which a system must read a sentence with a pronoun and select the referent of that pronoun from
+            a list of choices. The examples are manually constructed to foil simple statistical methods: Each
+            one is contingent on contextual information provided by a single word or phrase in the sentence.
+            To convert the problem into sentence pair classification, we construct sentence pairs by replacing
+            the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the
+            pronoun substituted is entailed by the original sentence. We use a small evaluation set consisting of
+            new examples derived from fiction books that was shared privately by the authors of the original
+            corpus. While the included training set is balanced between two classes, the test set is imbalanced
+            between them (65% not entailment). Also, due to a data quirk, the development set is adversarial:
+            hypotheses are sometimes shared between training and development examples, so if a model memorizes the
+            training examples, they will predict the wrong label on corresponding development set
+            example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence
+            between a model's score on this task and its score on the unconverted original task. We
+            call converted dataset WNLI (Winograd NLI)."""
+            ),
+            text_features={
+                "sentence1": "sentence1",
+                "sentence2": "sentence2",
+            },
+            label_classes=["not_entailment", "entailment"],
+            label_column="label",
+            data_url="https://bj.bcebos.com/dataset/glue/WNLI.zip",
+            data_dir="WNLI",
+            citation=textwrap.dedent(
+                """\
+            @inproceedings{levesque2012winograd,
+              title={The winograd schema challenge},
+              author={Levesque, Hector and Davis, Ernest and Morgenstern, Leora},
+              booktitle={Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning},
+              year={2012}
+            }"""
+            ),
+            url="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html",
+        ),
+        GlueConfig(
+            name="ax",
+            description=textwrap.dedent(
+                """\
+            A manually-curated evaluation dataset for fine-grained analysis of
+            system performance on a broad range of linguistic phenomena. This
+            dataset evaluates sentence understanding through Natural Language
+            Inference (NLI) problems. Use a model trained on MulitNLI to produce
+            predictions for this dataset."""
+            ),
+            text_features={
+                "premise": "sentence1",
+                "hypothesis": "sentence2",
+            },
+            label_classes=["entailment", "neutral", "contradiction"],
+            label_column="",  # No label since we only have test set.
+            # We must use a URL shortener since the URL from GLUE is very long and
+            # causes issues in TFDS.
+            data_url="https://dl.fbaipublicfiles.com/glue/data/AX.tsv",
+            data_dir="",  # We are downloading a tsv.
+            citation="",  # The GLUE citation is sufficient.
+            url="https://gluebenchmark.com/diagnostics",
+        ),
+    ]
+
+    def _info(self):
+        features = {text_feature: datasets.Value("string") for text_feature in self.config.text_features.keys()}
+        if self.config.label_classes:
+            features["label"] = datasets.features.ClassLabel(names=self.config.label_classes)
+        else:
+            features["label"] = datasets.Value("float32")
+        features["idx"] = datasets.Value("int32")
+        return datasets.DatasetInfo(
+            description=_GLUE_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage=self.config.url,
+            citation=self.config.citation + "\n" + _GLUE_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        if self.config.name == "ax":
+            data_file = dl_manager.download(self.config.data_url)
+            return [
+                datasets.SplitGenerator(
+                    name=datasets.Split.TEST,
+                    gen_kwargs={
+                        "data_file": data_file,
+                        "split": "test",
+                    },
+                )
+            ]
+
+        if self.config.name == "mrpc":
+            data_dir = None
+            mrpc_files = dl_manager.download(
+                {
+                    "dev_ids": _MRPC_DEV_IDS,
+                    "train": _MRPC_TRAIN,
+                    "test": _MRPC_TEST,
+                }
+            )
+        else:
+            dl_dir = dl_manager.download_and_extract(self.config.data_url)
+            data_dir = os.path.join(dl_dir, self.config.data_dir)
+            mrpc_files = None
+        train_split = datasets.SplitGenerator(
+            name=datasets.Split.TRAIN,
+            gen_kwargs={
+                "data_file": os.path.join(data_dir or "", "train.tsv"),
+                "split": "train",
+                "mrpc_files": mrpc_files,
+            },
+        )
+        if self.config.name == "mnli":
+            return [
+                train_split,
+                _mnli_split_generator("validation_matched", data_dir, "dev", matched=True),
+                _mnli_split_generator("validation_mismatched", data_dir, "dev", matched=False),
+                _mnli_split_generator("test_matched", data_dir, "test", matched=True),
+                _mnli_split_generator("test_mismatched", data_dir, "test", matched=False),
+            ]
+        elif self.config.name == "mnli_matched":
+            return [
+                _mnli_split_generator("validation", data_dir, "dev", matched=True),
+                _mnli_split_generator("test", data_dir, "test", matched=True),
+            ]
+        elif self.config.name == "mnli_mismatched":
+            return [
+                _mnli_split_generator("validation", data_dir, "dev", matched=False),
+                _mnli_split_generator("test", data_dir, "test", matched=False),
+            ]
+        else:
+            return [
+                train_split,
+                datasets.SplitGenerator(
+                    name=datasets.Split.VALIDATION,
+                    gen_kwargs={
+                        "data_file": os.path.join(data_dir or "", "dev.tsv"),
+                        "split": "dev",
+                        "mrpc_files": mrpc_files,
+                    },
+                ),
+                datasets.SplitGenerator(
+                    name=datasets.Split.TEST,
+                    gen_kwargs={
+                        "data_file": os.path.join(data_dir or "", "test.tsv"),
+                        "split": "test",
+                        "mrpc_files": mrpc_files,
+                    },
+                ),
+            ]
+
+    def _generate_examples(self, data_file, split, mrpc_files=None):
+        if self.config.name == "mrpc":
+            # We have to prepare the MRPC dataset from the original sources ourselves.
+            examples = self._generate_example_mrpc_files(mrpc_files=mrpc_files, split=split)
+            for example in examples:
+                yield example["idx"], example
+        else:
+            process_label = self.config.process_label
+            label_classes = self.config.label_classes
+
+            # The train and dev files for CoLA are the only tsv files without a
+            # header.
+            is_cola_non_test = self.config.name == "cola" and split != "test"
+
+            with open(data_file, encoding="utf8") as f:
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                if is_cola_non_test:
+                    reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+
+                for n, row in enumerate(reader):
+                    if is_cola_non_test:
+                        row = {
+                            "sentence": row[3],
+                            "is_acceptable": row[1],
+                        }
+
+                    example = {feat: row[col] for feat, col in self.config.text_features.items()}
+                    example["idx"] = n
+
+                    if self.config.label_column in row:
+                        label = row[self.config.label_column]
+                        # For some tasks, the label is represented as 0 and 1 in the tsv
+                        # files and needs to be cast to integer to work with the feature.
+                        if label_classes and label not in label_classes:
+                            label = int(label) if label else None
+                        example["label"] = process_label(label)
+                    else:
+                        example["label"] = process_label(-1)
+
+                    # Filter out corrupted rows.
+                    for value in example.values():
+                        if value is None:
+                            break
+                    else:
+                        yield example["idx"], example
+
+    def _generate_example_mrpc_files(self, mrpc_files, split):
+        if split == "test":
+            with open(mrpc_files["test"], encoding="utf8") as f:
+                # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with
+                # the Quality key.
+                f.seek(3)
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for n, row in enumerate(reader):
+                    yield {
+                        "sentence1": row["#1 String"],
+                        "sentence2": row["#2 String"],
+                        "label": int(row["Quality"]),
+                        "idx": n,
+                    }
+        else:
+            with open(mrpc_files["dev_ids"], encoding="utf8") as f:
+                reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                dev_ids = [[row[0], row[1]] for row in reader]
+            with open(mrpc_files["train"], encoding="utf8") as f:
+                # The first 3 bytes are the utf-8 BOM \xef\xbb\xbf, which messes with
+                # the Quality key.
+                f.seek(3)
+                reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                for n, row in enumerate(reader):
+                    is_row_in_dev = [row["#1 ID"], row["#2 ID"]] in dev_ids
+                    if is_row_in_dev == (split == "dev"):
+                        yield {
+                            "sentence1": row["#1 String"],
+                            "sentence2": row["#2 String"],
+                            "label": int(row["Quality"]),
+                            "idx": n,
+                        }
+
+
+def _mnli_split_generator(name, data_dir, split, matched):
+    return datasets.SplitGenerator(
+        name=name,
+        gen_kwargs={
+            "data_file": os.path.join(data_dir, "%s_%s.tsv" % (split, "matched" if matched else "mismatched")),
+            "split": split,
+            "mrpc_files": None,
+        },
+    )
diff --git a/paddlenlp/datasets/hf_datasets/imdb.py b/paddlenlp/datasets/hf_datasets/imdb.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4e3353312d0719b8053919d94589bead8c1ba07
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/imdb.py
@@ -0,0 +1,109 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""IMDB movie reviews dataset."""
+
+import datasets
+from datasets.tasks import TextClassification
+
+_DESCRIPTION = """\
+Large Movie Review Dataset.
+This is a dataset for binary sentiment classification containing substantially \
+more data than previous benchmark datasets. We provide a set of 25,000 highly \
+polar movie reviews for training, and 25,000 for testing. There is additional \
+unlabeled data for use as well.\
+"""
+
+_CITATION = """\
+@InProceedings{maas-EtAl:2011:ACL-HLT2011,
+  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
+  title     = {Learning Word Vectors for Sentiment Analysis},
+  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
+  month     = {June},
+  year      = {2011},
+  address   = {Portland, Oregon, USA},
+  publisher = {Association for Computational Linguistics},
+  pages     = {142--150},
+  url       = {http://www.aclweb.org/anthology/P11-1015}
+}
+"""
+
+_DOWNLOAD_URL = "https://bj.bcebos.com/dataset/imdb%2FaclImdb_v1.tar.gz"
+
+
+class IMDBReviewsConfig(datasets.BuilderConfig):
+    """BuilderConfig for IMDBReviews."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for IMDBReviews.
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(IMDBReviewsConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
+
+
+class Imdb(datasets.GeneratorBasedBuilder):
+    """IMDB movie reviews dataset."""
+
+    BUILDER_CONFIGS = [
+        IMDBReviewsConfig(
+            name="plain_text",
+            description="Plain text",
+        )
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {"text": datasets.Value("string"), "label": datasets.features.ClassLabel(names=["neg", "pos"])}
+            ),
+            supervised_keys=None,
+            homepage="http://ai.stanford.edu/~amaas/data/sentiment/",
+            citation=_CITATION,
+            task_templates=[TextClassification(text_column="text", label_column="label")],
+        )
+
+    def _split_generators(self, dl_manager):
+        archive = dl_manager.download(_DOWNLOAD_URL)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN, gen_kwargs={"files": dl_manager.iter_archive(archive), "split": "train"}
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST, gen_kwargs={"files": dl_manager.iter_archive(archive), "split": "test"}
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split("unsupervised"),
+                gen_kwargs={"files": dl_manager.iter_archive(archive), "split": "train", "labeled": False},
+            ),
+        ]
+
+    def _generate_examples(self, files, split, labeled=True):
+        """Generate aclImdb examples."""
+        # For labeled examples, extract the label from the path.
+        if labeled:
+            label_mapping = {"pos": 1, "neg": 0}
+            for path, f in files:
+                if path.startswith(f"aclImdb/{split}"):
+                    label = label_mapping.get(path.split("/")[2])
+                    if label is not None:
+                        yield path, {"text": f.read().decode("utf-8"), "label": label}
+        else:
+            for path, f in files:
+                if path.startswith(f"aclImdb/{split}"):
+                    if path.split("/")[2] == "unsup":
+                        yield path, {"text": f.read().decode("utf-8"), "label": -1}
diff --git a/paddlenlp/datasets/hf_datasets/language_pair.py b/paddlenlp/datasets/hf_datasets/language_pair.py
new file mode 100644
index 0000000000000000000000000000000000000000..85643f70959ab03f363766bc8201b506f947d72e
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/language_pair.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_DESCRIPTION = """
+LanguagePairDataset used for machine translation between any pair of languages. """
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz"
+
+
+class LanguagePairConfig(datasets.BuilderConfig):
+    """BuilderConfig for a general LanguagePairDataset."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for LanguagePairDataset.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(LanguagePairConfig, self).__init__(**kwargs)
+
+
+class LanguagePairDataset(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        LanguagePairConfig(
+            name="LanguagePair",
+            version=datasets.Version("1.0.0", ""),
+            description=_DESCRIPTION,
+        ),
+    ]
+
+    def _info(self):
+        logger.warning(
+            "LanguagePairDataset is an experimental API which we will continue to optimize and may be changed."
+        )
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "source": datasets.Value("string"),
+                    "target": datasets.Value("string"),
+                }
+            ),
+            supervised_keys=None,
+        )
+
+    def _split_generators(self, dl_manager):
+        is_downloaded = False
+
+        # Train files.
+        if hasattr(self.config, "data_files") and "train" in self.config.data_files:
+            train_split = datasets.SplitGenerator(
+                name="train",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["train"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["train"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            train_split = datasets.SplitGenerator(
+                name="train",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.de"
+                    ),
+                },
+            )
+
+        # Dev files.
+        if hasattr(self.config, "data_files") and "dev" in self.config.data_files:
+            dev_split = datasets.SplitGenerator(
+                name="dev",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["dev"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["dev"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            dev_split = datasets.SplitGenerator(
+                name="dev",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.de"
+                    ),
+                },
+            )
+
+        # Test files.
+        if hasattr(self.config, "data_files") and "test" in self.config.data_files:
+            # test may not contain target languages.
+            if isinstance(self.config.data_files["test"], str):
+                self.config.data_files["test"] = [self.config.data_files["test"], None]
+            elif (
+                isinstance(self.config.data_files["test"], (list, tuple)) and len(self.config.data_files["test"]) == 1
+            ):
+                self.config.data_files["test"].append(None)
+
+            test_split = datasets.SplitGenerator(
+                name="test",
+                gen_kwargs={
+                    "source_filepath": os.path.abspath(self.config.data_files["test"][0]),
+                    "target_filepath": os.path.abspath(self.config.data_files["test"][1]),
+                },
+            )
+
+        else:
+            if not is_downloaded:
+                dl_dir = dl_manager.download_and_extract(_URL)
+                is_downloaded = True
+            test_split = datasets.SplitGenerator(
+                name="test",
+                gen_kwargs={
+                    "source_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.en"
+                    ),
+                    "target_filepath": os.path.join(
+                        dl_dir, "WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.de"
+                    ),
+                },
+            )
+
+        return [train_split, dev_split, test_split]
+
+    def _generate_examples(self, source_filepath, target_filepath):
+        """This function returns the examples in the raw (text) form."""
+
+        logger.info("generating examples from = source: {} & target: {}".format(source_filepath, target_filepath))
+        key = 0
+
+        with open(source_filepath, "r", encoding="utf-8") as src_fin:
+            if target_filepath is not None:
+                with open(target_filepath, "r", encoding="utf-8") as tgt_fin:
+                    src_seq = src_fin.readlines()
+                    tgt_seq = tgt_fin.readlines()
+
+                    for i, src in enumerate(src_seq):
+                        source = src.strip()
+                        target = tgt_seq[i].strip()
+
+                        yield key, {
+                            "id": str(key),
+                            "source": source,
+                            "target": target,
+                        }
+                        key += 1
+            else:
+                src_seq = src_fin.readlines()
+                for i, src in enumerate(src_seq):
+                    source = src.strip()
+
+                    yield key, {
+                        "id": str(key),
+                        "source": source,
+                        # None is not allowed.
+                        "target": "",
+                    }
+                    key += 1
diff --git a/paddlenlp/datasets/hf_datasets/msra_ner.py b/paddlenlp/datasets/hf_datasets/msra_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..c550a6fe8412f7bc47374bb1248ccc0d280069ea
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/msra_ner.py
@@ -0,0 +1,147 @@
+# coding=utf-8
+# Copyright 2020 HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""Introduction to MSRA NER Dataset"""
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@inproceedings{levow2006third,
+  author    = {Gina{-}Anne Levow},
+  title     = {The Third International Chinese Language Processing Bakeoff: Word
+               Segmentation and Named Entity Recognition},
+  booktitle = {SIGHAN@COLING/ACL},
+  pages     = {108--117},
+  publisher = {Association for Computational Linguistics},
+  year      = {2006}
+}
+"""
+
+_DESCRIPTION = """\
+The Third International Chinese Language
+Processing Bakeoff was held in Spring
+2006 to assess the state of the art in two
+important tasks: word segmentation and
+named entity recognition. Twenty-nine
+groups submitted result sets in the two
+tasks across two tracks and a total of five
+corpora. We found strong results in both
+tasks as well as continuing challenges.
+
+MSRA NER is one of the provided dataset.
+There are three types of NE, PER (person),
+ORG (organization) and LOC (location).
+The dataset is in the BIO scheme.
+
+For more details see https://faculty.washington.edu/levow/papers/sighan06.pdf
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/msra/"
+_TRAINING_FILE = "msra_train_bio.txt"
+_TEST_FILE = "msra_test_bio.txt"
+
+
+class MsraNerConfig(datasets.BuilderConfig):
+    """BuilderConfig for MsraNer"""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for MSRA NER.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(MsraNerConfig, self).__init__(**kwargs)
+
+
+class MsraNer(datasets.GeneratorBasedBuilder):
+    """MSRA NER dataset."""
+
+    BUILDER_CONFIGS = [
+        MsraNerConfig(name="msra_ner", version=datasets.Version("1.0.0"), description="MSRA NER dataset"),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "tokens": datasets.Sequence(datasets.Value("string")),
+                    "ner_tags": datasets.Sequence(
+                        datasets.features.ClassLabel(
+                            names=[
+                                "O",
+                                "B-PER",
+                                "I-PER",
+                                "B-ORG",
+                                "I-ORG",
+                                "B-LOC",
+                                "I-LOC",
+                            ]
+                        )
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://www.microsoft.com/en-us/download/details.aspx?id=52531",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        urls_to_download = {
+            "train": f"{_URL}{_TRAINING_FILE}",
+            "test": f"{_URL}{_TEST_FILE}",
+        }
+        downloaded_files = dl_manager.download_and_extract(urls_to_download)
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
+            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": downloaded_files["test"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        logger.info("⏳ Generating examples from = %s", filepath)
+        with open(filepath, encoding="utf-8") as f:
+            guid = 0
+            tokens = []
+            ner_tags = []
+            for line in f:
+                line_stripped = line.strip()
+                if line_stripped == "":
+                    if tokens:
+                        yield guid, {
+                            "id": str(guid),
+                            "tokens": tokens,
+                            "ner_tags": ner_tags,
+                        }
+                        guid += 1
+                        tokens = []
+                        ner_tags = []
+                else:
+                    splits = line_stripped.split("\t")
+                    if len(splits) == 1:
+                        splits.append("O")
+                    tokens.append(splits[0])
+                    ner_tags.append(splits[1])
+            # last example
+            yield guid, {
+                "id": str(guid),
+                "tokens": tokens,
+                "ner_tags": ner_tags,
+            }
diff --git a/paddlenlp/datasets/hf_datasets/mt_eng_vietnamese.py b/paddlenlp/datasets/hf_datasets/mt_eng_vietnamese.py
new file mode 100644
index 0000000000000000000000000000000000000000..10f4f2af87c2edcd9424de01a47ad560840d5084
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/mt_eng_vietnamese.py
@@ -0,0 +1,124 @@
+# coding=utf-8
+# Copyright 2020 HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+
+import datasets
+
+_DESCRIPTION = """\
+Preprocessed Dataset from IWSLT'15 English-Vietnamese machine translation: English-Vietnamese.
+"""
+
+_CITATION = """\
+@inproceedings{Luong-Manning:iwslt15,
+        Address = {Da Nang, Vietnam}
+        Author = {Luong, Minh-Thang  and Manning, Christopher D.},
+        Booktitle = {International Workshop on Spoken Language Translation},
+        Title = {Stanford Neural Machine Translation Systems for Spoken Language Domain},
+        Year = {2015}}
+"""
+
+_DATA_URL = "https://paddlenlp.bj.bcebos.com/datasets/iwslt15.en-vi/{}.{}"
+
+# Tuple that describes a single pair of files with matching translations.
+# language_to_file is the map from language (2 letter string: example 'en')
+# to the file path in the extracted directory.
+TranslateData = collections.namedtuple("TranslateData", ["url", "language_to_file"])
+
+
+class MT_Eng_ViConfig(datasets.BuilderConfig):
+    """BuilderConfig for MT_Eng_Vietnamese."""
+
+    def __init__(self, language_pair=(None, None), **kwargs):
+        """BuilderConfig for MT_Eng_Vi.
+        Args:
+            for the `datasets.features.text.TextEncoder` used for the features feature.
+          language_pair: pair of languages that will be used for translation. Should
+            contain 2-letter coded strings. First will be used at source and second
+            as target in supervised mode. For example: ("vi", "en").
+          **kwargs: keyword arguments forwarded to super.
+        """
+
+        description = ("Translation dataset from %s to %s") % (language_pair[0], language_pair[1])
+        super(MT_Eng_ViConfig, self).__init__(
+            description=description,
+            version=datasets.Version("1.0.0"),
+            **kwargs,
+        )
+        self.language_pair = language_pair
+
+
+class MTEngVietnamese(datasets.GeneratorBasedBuilder):
+    """English Vietnamese machine translation dataset from IWSLT2015."""
+
+    BUILDER_CONFIGS = [
+        MT_Eng_ViConfig(
+            name="iwslt2015-vi-en",
+            language_pair=("vi", "en"),
+        ),
+        MT_Eng_ViConfig(
+            name="iwslt2015-en-vi",
+            language_pair=("en", "vi"),
+        ),
+    ]
+    BUILDER_CONFIG_CLASS = MT_Eng_ViConfig
+
+    def _info(self):
+        source, target = self.config.language_pair
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {"translation": datasets.features.Translation(languages=self.config.language_pair)}
+            ),
+            supervised_keys=(source, target),
+            homepage="https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        source, target = self.config.language_pair
+
+        files = {}
+        for split in ("train", "dev", "test"):
+            if split == "dev":
+                dl_dir_src = dl_manager.download_and_extract(_DATA_URL.format("tst2012", source))
+                dl_dir_tar = dl_manager.download_and_extract(_DATA_URL.format("tst2012", target))
+            if split == "dev":
+                dl_dir_src = dl_manager.download_and_extract(_DATA_URL.format("tst2013", source))
+                dl_dir_tar = dl_manager.download_and_extract(_DATA_URL.format("tst2013", target))
+            if split == "train":
+                dl_dir_src = dl_manager.download_and_extract(_DATA_URL.format(split, source))
+                dl_dir_tar = dl_manager.download_and_extract(_DATA_URL.format(split, target))
+
+            files[split] = {"source_file": dl_dir_src, "target_file": dl_dir_tar}
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs=files["train"]),
+            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs=files["dev"]),
+            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs=files["test"]),
+        ]
+
+    def _generate_examples(self, source_file, target_file):
+        """This function returns the examples in the raw (text) form."""
+        with open(source_file, encoding="utf-8") as f:
+            source_sentences = f.read().split("\n")
+        with open(target_file, encoding="utf-8") as f:
+            target_sentences = f.read().split("\n")
+
+        source, target = self.config.language_pair
+        for idx, (l1, l2) in enumerate(zip(source_sentences, target_sentences)):
+            result = {"translation": {source: l1, target: l2}}
+            # Make sure that both translations are non-empty.
+            yield idx, result
diff --git a/paddlenlp/datasets/hf_datasets/ptb_text_only.py b/paddlenlp/datasets/hf_datasets/ptb_text_only.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbb56d423e592476962b52f454bec0528c88c3c6
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/ptb_text_only.py
@@ -0,0 +1,144 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Load the Penn Treebank dataset.
+
+    This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall
+    Street Journal material.
+"""
+
+import datasets
+
+# TODO: Add BibTeX citation
+# Find for instance the citation on arxiv or on the dataset repo/website
+_CITATION = """\
+@article{marcus-etal-1993-building,
+    title = "Building a Large Annotated Corpus of {E}nglish: The {P}enn {T}reebank",
+    author = "Marcus, Mitchell P.  and
+      Santorini, Beatrice  and
+      Marcinkiewicz, Mary Ann",
+    journal = "Computational Linguistics",
+    volume = "19",
+    number = "2",
+    year = "1993",
+    url = "https://www.aclweb.org/anthology/J93-2004",
+    pages = "313--330",
+}
+"""
+
+# TODO: Add description of the dataset here
+# You can copy an official description
+_DESCRIPTION = """\
+This is the Penn Treebank Project: Release 2 CDROM, featuring a million words of 1989 Wall Street Journal material. This corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure.
+"""
+
+# TODO: Add a link to an official homepage for the dataset here
+_HOMEPAGE = "https://catalog.ldc.upenn.edu/LDC99T42"
+
+# TODO: Add the licence for the dataset here if you can find it
+_LICENSE = "LDC User Agreement for Non-Members"
+
+# TODO: Add link to the official dataset URLs here
+# The HuggingFace dataset library don't host the datasets but only point to the original files
+# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
+_URL = "https://paddlenlp.bj.bcebos.com/datasets/ptb/"
+_TRAINING_FILE = "ptb.train.txt"
+_DEV_FILE = "ptb.valid.txt"
+_TEST_FILE = "ptb.test.txt"
+
+
+class PtbTextOnlyConfig(datasets.BuilderConfig):
+    """BuilderConfig for PtbTextOnly"""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig PtbTextOnly.
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(PtbTextOnlyConfig, self).__init__(**kwargs)
+
+
+class PtbTextOnly(datasets.GeneratorBasedBuilder):
+    """Load the Penn Treebank dataset."""
+
+    VERSION = datasets.Version("1.1.0")
+
+    # This is an example of a dataset with multiple configurations.
+    # If you don't want/need to define several sub-sets in your dataset,
+    # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.
+
+    # If you need to make complex sub-parts in the datasets with configurable options
+    # You can create your own builder configuration class to store attribute, inheriting from datasets.BuilderConfig
+    # BUILDER_CONFIG_CLASS = MyBuilderConfig
+
+    # You will be able to load one or the other configurations in the following list with
+    # data = datasets.load_dataset('my_dataset', 'first_domain')
+    # data = datasets.load_dataset('my_dataset', 'second_domain')
+    BUILDER_CONFIGS = [
+        PtbTextOnlyConfig(
+            name="penn_treebank",
+            version=VERSION,
+            description="Load the Penn Treebank dataset",
+        ),
+    ]
+
+    def _info(self):
+        features = datasets.Features({"sentence": datasets.Value("string")})
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # This defines the different columns of the dataset and their types
+            features=features,  # Here we define them above because they are different between the two configurations
+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,
+            # Homepage of the dataset for documentation
+            homepage=_HOMEPAGE,
+            # License for the dataset if available
+            license=_LICENSE,
+            # Citation for the dataset
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration
+        # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name
+
+        # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLs
+        # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files.
+        # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive
+        my_urls = {
+            "train": f"{_URL}{_TRAINING_FILE}",
+            "dev": f"{_URL}{_DEV_FILE}",
+            "test": f"{_URL}{_TEST_FILE}",
+        }
+        data_dir = dl_manager.download_and_extract(my_urls)
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_dir["train"]}),
+            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": data_dir["test"]}),
+            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": data_dir["dev"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        """Yields examples."""
+        # TODO: This method will receive as arguments the `gen_kwargs` defined in the previous `_split_generators` method.
+        # It is in charge of opening the given file and yielding (key, example) tuples from the dataset
+        # The key is not important, it's more here for legacy reason (legacy from tfds)
+        with open(filepath, encoding="utf-8") as f:
+            for id_, line in enumerate(f):
+                line = line.strip()
+                yield id_, {"sentence": line}
diff --git a/paddlenlp/datasets/hf_datasets/rvl_cdip_sampled.py b/paddlenlp/datasets/hf_datasets/rvl_cdip_sampled.py
new file mode 100644
index 0000000000000000000000000000000000000000..276b1ddd4c1324910172449bb712da4f792ca0c1
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/rvl_cdip_sampled.py
@@ -0,0 +1,144 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import os
+import json
+import hashlib
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@inproceedings{harley2015icdar,
+    title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
+    author = {Adam W Harley and Alex Ufkes and Konstantinos G Derpanis},
+    booktitle = {International Conference on Document Analysis and Recognition ({ICDAR})}},
+    year = {2015}
+}
+"""
+
+_DESCRIPTION = """\
+The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. \
+Because of the original dataset is large and slow for training, so we downsampling from it. \
+The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images.
+"""
+
+_LICENSE = "https://www.industrydocuments.ucsf.edu/help/copyright/"
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/rvl_cdip_sampled.tar.gz"
+
+
+def _get_md5(string):
+    """Get md5 value for string"""
+    hl = hashlib.md5()
+    hl.update(string.encode(encoding="utf-8"))
+    return hl.hexdigest()
+
+
+class RVLCDIPSampledConfig(datasets.BuilderConfig):
+    """funsd dataset config"""
+
+    target_size: int = 1000
+    max_size: int = 1000
+
+    def __init__(self, **kwargs):
+
+        super(RVLCDIPSampledConfig, self).__init__(**kwargs)
+
+
+class RVLCDIPSampled(datasets.GeneratorBasedBuilder):
+    """funsd dataset builder"""
+
+    BUILDER_CONFIGS = [
+        RVLCDIPSampledConfig(
+            name="rvl_cdip_sampled",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "name": datasets.Value("string"),
+                    "page_no": datasets.Value("int32"),
+                    "text": datasets.features.Sequence(datasets.Value("string")),
+                    "bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_id": datasets.features.Sequence(datasets.Value("int32")),
+                    "image": datasets.Value("string"),
+                    "width": datasets.Value("int32"),
+                    "height": datasets.Value("int32"),
+                    "md5sum": datasets.Value("string"),
+                    "qas": datasets.features.Sequence(
+                        {
+                            "question_id": datasets.Value("int32"),
+                            "question": datasets.Value("string"),
+                            "answers": datasets.features.Sequence(
+                                {
+                                    "text": datasets.Value("string"),
+                                    "answer_start": datasets.Value("int32"),
+                                    "answer_end": datasets.Value("int32"),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://adamharley.com/rvl-cdip/",
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "rvl_cdip_sampled", "train.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "rvl_cdip_sampled", "dev.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "rvl_cdip_sampled", "test.json")},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("Generating examples from = {}".format(filepath))
+        idx = 0
+        with open(filepath, "r") as fin:
+            for line in fin:
+                data = json.loads(line)
+                if "page_no" not in data:
+                    data["page_no"] = 0
+                for item in data["qas"]:
+                    if "question_id" not in item:
+                        item["question_id"] = -1
+                data["md5sum"] = _get_md5(data["image"])
+                yield idx, data
+                idx += 1
diff --git a/paddlenlp/datasets/hf_datasets/seabsa16.py b/paddlenlp/datasets/hf_datasets/seabsa16.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3b97dfc0318c036132a7a3c7a2a868a274f231f
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/seabsa16.py
@@ -0,0 +1,136 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""SE-ABSA16: SemEval-2016 Task 5: Aspect Based Sentiment Analysis."""
+
+import csv
+import os
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@inproceedings{pontiki2016semeval,
+  title={Semeval-2016 task 5: Aspect based sentiment analysis},
+  author={Pontiki, Maria and Galanis, Dimitrios and Papageorgiou, Haris and Androutsopoulos, Ion and Manandhar, Suresh and Al-Smadi, Mohammad and Al-Ayyoub, Mahmoud and Zhao, Yanyan and Qin, Bing and De Clercq, Orph{\'e}e and others},
+  booktitle={International workshop on semantic evaluation},
+  pages={19--30},
+  year={2016}
+}
+"""
+
+_DESCRIPTION = """\
+SE-ABSA16, a dataset for aspect based sentiment analysis, which aims to perform fine-grained sentiment classification for aspect in text. The dataset contains both positive and negative categories. It covers the data of mobile phone and camera.
+More information refer to https://www.luge.ai/#/luge/dataDetail?id=18.
+"""
+
+_SEABSA16_URLs = {
+    # pylint: disable=line-too-long
+    "came": "https://paddlenlp.bj.bcebos.com/datasets/SE-ABSA16_CAME.zip",
+    "phns": "https://paddlenlp.bj.bcebos.com/datasets/SE-ABSA16_PHNS.zip",
+    # pylint: enable=line-too-long
+}
+
+
+class SEABSA16Config(datasets.BuilderConfig):
+    """BuilderConfig for SEABSA16."""
+
+    def __init__(self, data_url=None, data_dir=None, **kwargs):
+        """BuilderConfig for SEABSA16.
+
+        Args:
+          data_url: `string`, url to download the zip file.
+          data_dir: `string`, the path to the folder containing the tsv files in the downloaded zip.
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(SEABSA16Config, self).__init__(**kwargs)
+        self.data_url = data_url
+        self.data_dir = data_dir
+
+
+class SEABSA16(datasets.GeneratorBasedBuilder):
+    """SE-ABSA16: SemEval-2016 Task 5: Aspect Based Sentiment Analysis."""
+
+    BUILDER_CONFIGS = [
+        SEABSA16Config(
+            name="came",
+            data_url=_SEABSA16_URLs["came"],
+            data_dir="SE-ABSA16_CAME",
+            version=datasets.Version("1.0.0", ""),
+            description="SE-ABSA16-CAME data about camera.",
+        ),
+        SEABSA16Config(
+            name="phns",
+            data_url=_SEABSA16_URLs["phns"],
+            data_dir="SE-ABSA16_PHNS",
+            version=datasets.Version("1.0.0", ""),
+            description="SE-ABSA16-PHNS data about phone.",
+        ),
+    ]
+
+    def _info(self):
+        features = {
+            "id": datasets.Value("int32"),
+            "text_a": datasets.Value("string"),
+            "text_b": datasets.Value("string"),
+            "label": datasets.Value("int32"),
+        }
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(features),
+            homepage="https://www.luge.ai/#/luge/dataDetail?id=18",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        downloaded_dir = dl_manager.download_and_extract(self.config.data_url)
+        data_dir = os.path.join(downloaded_dir, self.config.data_dir)
+
+        train_split = datasets.SplitGenerator(
+            name=datasets.Split.TRAIN, gen_kwargs={"filepath": os.path.join(data_dir, "train.tsv"), "split": "train"}
+        )
+        test_split = datasets.SplitGenerator(
+            name=datasets.Split.TEST, gen_kwargs={"filepath": os.path.join(data_dir, "test.tsv"), "split": "test"}
+        )
+
+        return [train_split, test_split]
+
+    def _generate_examples(self, filepath, split):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+
+        with open(filepath, encoding="utf8") as f:
+            reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+
+            for idx, row in enumerate(reader):
+                example = {}
+                example["id"] = idx
+                example["text_a"] = row["text_a"]
+                example["text_b"] = row["text_b"]
+
+                if split == "train":
+                    example["label"] = int(row["label"])
+                else:
+                    example["label"] = -1
+
+                # Filter out corrupted rows.
+                for value in example.values():
+                    if value is None:
+                        break
+                else:
+                    yield idx, example
diff --git a/paddlenlp/datasets/hf_datasets/squad.py b/paddlenlp/datasets/hf_datasets/squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbf050655b48787dd7f261c1631a4e1c187286f1
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/squad.py
@@ -0,0 +1,139 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""SQUAD: The Stanford Question Answering Dataset."""
+
+import json
+
+import datasets
+from datasets.tasks import QuestionAnsweringExtractive
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@article{2016arXiv160605250R,
+       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
+                 Konstantin and {Liang}, Percy},
+        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
+      journal = {arXiv e-prints},
+         year = 2016,
+          eid = {arXiv:1606.05250},
+        pages = {arXiv:1606.05250},
+archivePrefix = {arXiv},
+       eprint = {1606.05250},
+}
+"""
+
+_DESCRIPTION = """\
+Stanford Question Answering Dataset (SQuAD) is a reading comprehension \
+dataset, consisting of questions posed by crowdworkers on a set of Wikipedia \
+articles, where the answer to every question is a segment of text, or span, \
+from the corresponding reading passage, or the question might be unanswerable.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/squad/"
+_URLS = {
+    "train": _URL + "train-v1.1.json",
+    "dev": _URL + "dev-v1.1.json",
+}
+
+
+class SquadConfig(datasets.BuilderConfig):
+    """BuilderConfig for SQUAD."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for SQUAD.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(SquadConfig, self).__init__(**kwargs)
+
+
+class Squad(datasets.GeneratorBasedBuilder):
+    """SQUAD: The Stanford Question Answering Dataset. Version 1.1."""
+
+    BUILDER_CONFIGS = [
+        SquadConfig(
+            name="plain_text",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "title": datasets.Value("string"),
+                    "context": datasets.Value("string"),
+                    "question": datasets.Value("string"),
+                    "answers": datasets.features.Sequence(
+                        {
+                            "text": datasets.Value("string"),
+                            "answer_start": datasets.Value("int32"),
+                        }
+                    ),
+                }
+            ),
+            # No default supervised_keys (as we have to pass both question
+            # and context as input).
+            supervised_keys=None,
+            homepage="https://rajpurkar.github.io/SQuAD-explorer/",
+            citation=_CITATION,
+            task_templates=[
+                QuestionAnsweringExtractive(
+                    question_column="question", context_column="context", answers_column="answers"
+                )
+            ],
+        )
+
+    def _split_generators(self, dl_manager):
+        downloaded_files = dl_manager.download_and_extract(_URLS)
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
+            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("generating examples from = %s", filepath)
+        key = 0
+        with open(filepath, encoding="utf-8") as f:
+            squad = json.load(f)
+            for article in squad["data"]:
+                title = article.get("title", "")
+                for paragraph in article["paragraphs"]:
+                    context = paragraph["context"]  # do not strip leading blank spaces GH-2585
+                    for qa in paragraph["qas"]:
+                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                        answers = [answer["text"] for answer in qa["answers"]]
+                        # Features currently used are "context", "question", and "answers".
+                        # Others are extracted here for the ease of future expansions.
+                        yield key, {
+                            "title": title,
+                            "context": context,
+                            "question": qa["question"],
+                            "id": qa["id"],
+                            "answers": {
+                                "answer_start": answer_starts,
+                                "text": answers,
+                            },
+                        }
+                        key += 1
diff --git a/paddlenlp/datasets/hf_datasets/squad_v2.py b/paddlenlp/datasets/hf_datasets/squad_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..3995a78752c77354ec34f61156b3177d7c4690ce
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/squad_v2.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TODO(squad_v2): Add a description here."""
+
+import json
+
+import datasets
+from datasets.tasks import QuestionAnsweringExtractive
+
+# TODO(squad_v2): BibTeX citation
+_CITATION = """\
+@article{2016arXiv160605250R,
+       author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
+                 Konstantin and {Liang}, Percy},
+        title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
+      journal = {arXiv e-prints},
+         year = 2016,
+          eid = {arXiv:1606.05250},
+        pages = {arXiv:1606.05250},
+archivePrefix = {arXiv},
+       eprint = {1606.05250},
+}
+"""
+
+_DESCRIPTION = """\
+combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers
+ to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but
+ also determine when no answer is supported by the paragraph and abstain from answering.
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/squad/"
+_URLS = {
+    "train": _URL + "train-v2.0.json",
+    "dev": _URL + "dev-v2.0.json",
+}
+
+
+class SquadV2Config(datasets.BuilderConfig):
+    """BuilderConfig for SQUAD."""
+
+    def __init__(self, **kwargs):
+        """BuilderConfig for SQUADV2.
+
+        Args:
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(SquadV2Config, self).__init__(**kwargs)
+
+
+class SquadV2(datasets.GeneratorBasedBuilder):
+    """TODO(squad_v2): Short description of my dataset."""
+
+    # TODO(squad_v2): Set up version.
+    BUILDER_CONFIGS = [
+        SquadV2Config(name="squad_v2", version=datasets.Version("2.0.0"), description="SQuAD plaint text version 2"),
+    ]
+
+    def _info(self):
+        # TODO(squad_v2): Specifies the datasets.DatasetInfo object
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # datasets.features.FeatureConnectors
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "title": datasets.Value("string"),
+                    "context": datasets.Value("string"),
+                    "question": datasets.Value("string"),
+                    "answers": datasets.features.Sequence(
+                        {
+                            "text": datasets.Value("string"),
+                            "answer_start": datasets.Value("int32"),
+                        }
+                    ),
+                    # These are the features of your dataset like images, labels ...
+                }
+            ),
+            # If there's a common (input, target) tuple from the features,
+            # specify them here. They'll be used if as_supervised=True in
+            # builder.as_dataset.
+            supervised_keys=None,
+            # Homepage of the dataset for documentation
+            homepage="https://rajpurkar.github.io/SQuAD-explorer/",
+            citation=_CITATION,
+            task_templates=[
+                QuestionAnsweringExtractive(
+                    question_column="question", context_column="context", answers_column="answers"
+                )
+            ],
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        # TODO(squad_v2): Downloads the data and defines the splits
+        # dl_manager is a datasets.download.DownloadManager that can be used to
+        # download and extract URLs
+        urls_to_download = _URLS
+        downloaded_files = dl_manager.download_and_extract(urls_to_download)
+
+        return [
+            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
+            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": downloaded_files["dev"]}),
+        ]
+
+    def _generate_examples(self, filepath):
+        """Yields examples."""
+        # TODO(squad_v2): Yields (key, example) tuples from the dataset
+        with open(filepath, encoding="utf-8") as f:
+            squad = json.load(f)
+            for example in squad["data"]:
+                title = example.get("title", "")
+                for paragraph in example["paragraphs"]:
+                    context = paragraph["context"]  # do not strip leading blank spaces GH-2585
+                    for qa in paragraph["qas"]:
+                        question = qa["question"]
+                        id_ = qa["id"]
+
+                        answer_starts = [answer["answer_start"] for answer in qa["answers"]]
+                        answers = [answer["text"] for answer in qa["answers"]]
+
+                        # Features currently used are "context", "question", and "answers".
+                        # Others are extracted here for the ease of future expansions.
+                        yield id_, {
+                            "title": title,
+                            "context": context,
+                            "question": question,
+                            "id": id_,
+                            "answers": {
+                                "answer_start": answer_starts,
+                                "text": answers,
+                            },
+                        }
diff --git a/paddlenlp/datasets/hf_datasets/xfund_zh.py b/paddlenlp/datasets/hf_datasets/xfund_zh.py
new file mode 100644
index 0000000000000000000000000000000000000000..5af7d715159f7fe1de4f749234b626a8ff4129c0
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/xfund_zh.py
@@ -0,0 +1,153 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+
+import os
+import json
+import hashlib
+
+import datasets
+
+logger = datasets.logging.get_logger(__name__)
+
+_CITATION = """\
+@inproceedings{xu-etal-2022-xfund,
+    title = "{XFUND}: A Benchmark Dataset for Multilingual Visually Rich Form Understanding",
+    author = "Xu, Yiheng  and
+      Lv, Tengchao  and
+      Cui, Lei  and
+      Wang, Guoxin  and
+      Lu, Yijuan  and
+      Florencio, Dinei  and
+      Zhang, Cha  and
+      Wei, Furu",
+    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
+    month = may,
+    year = "2022",
+    address = "Dublin, Ireland",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.findings-acl.253",
+    doi = "10.18653/v1/2022.findings-acl.253",
+    pages = "3214--3224",
+    abstract = "Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. However, the existed research work has focused only on the English domain while neglecting the importance of multilingual generalization. In this paper, we introduce a human-annotated multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Meanwhile, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually rich document understanding. Experimental results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The XFUND dataset and the pre-trained LayoutXLM model have been publicly available at https://aka.ms/layoutxlm.",
+}
+"""
+
+_DESCRIPTION = """\
+https://github.com/doc-analysis/XFUND
+"""
+
+_URL = "https://bj.bcebos.com/paddlenlp/datasets/xfund_zh.tar.gz"
+
+
+def _get_md5(string):
+    """Get md5 value for string"""
+    hl = hashlib.md5()
+    hl.update(string.encode(encoding="utf-8"))
+    return hl.hexdigest()
+
+
+class XFUNDZhConfig(datasets.BuilderConfig):
+    """xfund_zh dataset config"""
+
+    target_size: int = 1000
+    max_size: int = 1000
+
+    def __init__(self, **kwargs):
+
+        super(XFUNDZhConfig, self).__init__(**kwargs)
+
+
+class XFUNDZh(datasets.GeneratorBasedBuilder):
+    """xfund_zh dataset builder"""
+
+    BUILDER_CONFIGS = [
+        XFUNDZhConfig(
+            name="xfund_zh",
+            version=datasets.Version("1.0.0", ""),
+            description="Plain text",
+        ),
+    ]
+
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "name": datasets.Value("string"),
+                    "page_no": datasets.Value("int32"),
+                    "text": datasets.features.Sequence(datasets.Value("string")),
+                    "bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_bbox": datasets.features.Sequence(datasets.features.Sequence(datasets.Value("int32"))),
+                    "segment_id": datasets.features.Sequence(datasets.Value("int32")),
+                    "image": datasets.Value("string"),
+                    "width": datasets.Value("int32"),
+                    "height": datasets.Value("int32"),
+                    "md5sum": datasets.Value("string"),
+                    "qas": datasets.features.Sequence(
+                        {
+                            "question_id": datasets.Value("int32"),
+                            "question": datasets.Value("string"),
+                            "answers": datasets.features.Sequence(
+                                {
+                                    "text": datasets.Value("string"),
+                                    "answer_start": datasets.Value("int32"),
+                                    "answer_end": datasets.Value("int32"),
+                                }
+                            ),
+                        }
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://github.com/doc-analysis/XFUND",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dir = dl_manager.download_and_extract(_URL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "xfund_zh", "train.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "xfund_zh", "dev.json")},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": os.path.join(dl_dir, "xfund_zh", "test.json")},
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        """This function returns the examples in the raw (text) form."""
+        logger.info("Generating examples from = {}".format(filepath))
+        idx = 0
+        with open(filepath, "r") as fin:
+            for line in fin:
+                data = json.loads(line)
+                if "page_no" not in data:
+                    data["page_no"] = 0
+                for item in data["qas"]:
+                    if "question_id" not in item:
+                        item["question_id"] = -1
+                data["md5sum"] = _get_md5(data["image"])
+                yield idx, data
+                idx += 1
diff --git a/paddlenlp/datasets/hf_datasets/xnli.py b/paddlenlp/datasets/hf_datasets/xnli.py
new file mode 100644
index 0000000000000000000000000000000000000000..cae74ffed948441f7b423427af30578449fe8f09
--- /dev/null
+++ b/paddlenlp/datasets/hf_datasets/xnli.py
@@ -0,0 +1,209 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""XNLI: The Cross-Lingual NLI Corpus."""
+
+import collections
+import csv
+import os
+from contextlib import ExitStack
+
+import datasets
+
+_CITATION = """\
+@InProceedings{conneau2018xnli,
+  author = {Conneau, Alexis
+                 and Rinott, Ruty
+                 and Lample, Guillaume
+                 and Williams, Adina
+                 and Bowman, Samuel R.
+                 and Schwenk, Holger
+                 and Stoyanov, Veselin},
+  title = {XNLI: Evaluating Cross-lingual Sentence Representations},
+  booktitle = {Proceedings of the 2018 Conference on Empirical Methods
+               in Natural Language Processing},
+  year = {2018},
+  publisher = {Association for Computational Linguistics},
+  location = {Brussels, Belgium},
+}"""
+
+_DESCRIPTION = """\
+XNLI is a subset of a few thousand examples from MNLI which has been translated
+into a 14 different languages (some low-ish resource). As with MNLI, the goal is
+to predict textual entailment (does sentence A imply/contradict/neither sentence
+B) and is a classification task (given two sentences, predict one of three
+labels).
+"""
+
+_TRAIN_DATA_URL = "https://bj.bcebos.com/paddlenlp/datasets/XNLI-MT-1.0.zip"
+_TESTVAL_DATA_URL = "https://bj.bcebos.com/paddlenlp/datasets/XNLI-1.0.zip"
+
+_LANGUAGES = ("ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh")
+
+
+class XnliConfig(datasets.BuilderConfig):
+    """BuilderConfig for XNLI."""
+
+    def __init__(self, language: str, languages=None, **kwargs):
+        """BuilderConfig for XNLI.
+
+        Args:
+        language: One of ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh, or all_languages
+          **kwargs: keyword arguments forwarded to super.
+        """
+        super(XnliConfig, self).__init__(**kwargs)
+        self.language = language
+        if language != "all_languages":
+            self.languages = [language]
+        else:
+            self.languages = languages if languages is not None else _LANGUAGES
+
+
+class Xnli(datasets.GeneratorBasedBuilder):
+    """XNLI: The Cross-Lingual NLI Corpus. Version 1.0."""
+
+    VERSION = datasets.Version("1.1.0", "")
+    BUILDER_CONFIG_CLASS = XnliConfig
+    BUILDER_CONFIGS = [
+        XnliConfig(
+            name=lang,
+            language=lang,
+            version=datasets.Version("1.1.0", ""),
+            description=f"Plain text import of XNLI for the {lang} language",
+        )
+        for lang in _LANGUAGES
+    ] + [
+        XnliConfig(
+            name="all_languages",
+            language="all_languages",
+            version=datasets.Version("1.1.0", ""),
+            description="Plain text import of XNLI for all languages",
+        )
+    ]
+
+    def _info(self):
+        if self.config.language == "all_languages":
+            features = datasets.Features(
+                {
+                    "premise": datasets.Translation(
+                        languages=_LANGUAGES,
+                    ),
+                    "hypothesis": datasets.TranslationVariableLanguages(
+                        languages=_LANGUAGES,
+                    ),
+                    "label": datasets.ClassLabel(names=["entailment", "neutral", "contradiction"]),
+                }
+            )
+        else:
+            features = datasets.Features(
+                {
+                    "premise": datasets.Value("string"),
+                    "hypothesis": datasets.Value("string"),
+                    "label": datasets.ClassLabel(names=["entailment", "neutral", "contradiction"]),
+                }
+            )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            # No default supervised_keys (as we have to pass both premise
+            # and hypothesis as input).
+            supervised_keys=None,
+            homepage="https://www.nyu.edu/projects/bowman/xnli/",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        dl_dirs = dl_manager.download_and_extract(
+            {
+                "train_data": _TRAIN_DATA_URL,
+                "testval_data": _TESTVAL_DATA_URL,
+            }
+        )
+        train_dir = os.path.join(dl_dirs["train_data"], "XNLI-MT-1.0", "multinli")
+        testval_dir = os.path.join(dl_dirs["testval_data"], "XNLI-1.0")
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepaths": [
+                        os.path.join(train_dir, f"multinli.train.{lang}.tsv") for lang in self.config.languages
+                    ],
+                    "data_format": "XNLI-MT",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepaths": [os.path.join(testval_dir, "xnli.test.tsv")], "data_format": "XNLI"},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepaths": [os.path.join(testval_dir, "xnli.dev.tsv")], "data_format": "XNLI"},
+            ),
+        ]
+
+    def _generate_examples(self, data_format, filepaths):
+        """This function returns the examples in the raw (text) form."""
+
+        if self.config.language == "all_languages":
+            if data_format == "XNLI-MT":
+                with ExitStack() as stack:
+                    files = [stack.enter_context(open(filepath, encoding="utf-8")) for filepath in filepaths]
+                    readers = [csv.DictReader(file, delimiter="\t", quoting=csv.QUOTE_NONE) for file in files]
+                    for row_idx, rows in enumerate(zip(*readers)):
+                        yield row_idx, {
+                            "premise": {lang: row["premise"] for lang, row in zip(self.config.languages, rows)},
+                            "hypothesis": {lang: row["hypo"] for lang, row in zip(self.config.languages, rows)},
+                            "label": rows[0]["label"].replace("contradictory", "contradiction"),
+                        }
+            else:
+                rows_per_pair_id = collections.defaultdict(list)
+                for filepath in filepaths:
+                    with open(filepath, encoding="utf-8") as f:
+                        reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                        for row in reader:
+                            rows_per_pair_id[row["pairID"]].append(row)
+
+                for rows in rows_per_pair_id.values():
+                    premise = {row["language"]: row["sentence1"] for row in rows}
+                    hypothesis = {row["language"]: row["sentence2"] for row in rows}
+                    yield rows[0]["pairID"], {
+                        "premise": premise,
+                        "hypothesis": hypothesis,
+                        "label": rows[0]["gold_label"],
+                    }
+        else:
+            if data_format == "XNLI-MT":
+                for file_idx, filepath in enumerate(filepaths):
+                    with open(filepath, encoding="utf-8") as file:
+                        reader = csv.DictReader(file, delimiter="\t", quoting=csv.QUOTE_NONE)
+                        for row_idx, row in enumerate(reader):
+                            key = str(file_idx) + "_" + str(row_idx)
+                            yield key, {
+                                "premise": row["premise"],
+                                "hypothesis": row["hypo"],
+                                "label": row["label"].replace("contradictory", "contradiction"),
+                            }
+            else:
+                for filepath in filepaths:
+                    with open(filepath, encoding="utf-8") as f:
+                        reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                        for row in reader:
+                            if row["language"] == self.config.language:
+                                yield row["pairID"], {
+                                    "premise": row["sentence1"],
+                                    "hypothesis": row["sentence2"],
+                                    "label": row["gold_label"],
+                                }
diff --git a/paddlenlp/datasets/hyp.py b/paddlenlp/datasets/hyp.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf2dfb63f680e4b7555c26da19058444c88724e8
--- /dev/null
+++ b/paddlenlp/datasets/hyp.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+import xml.dom.minidom
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class HYP(DatasetBuilder):
+    """
+    Hyperpartisan News Detection
+    Task: Given a news article text, decide whether it follows a hyperpartisan
+    argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning
+    allegiance to one party, faction, cause, or person.
+
+    More detail at https://pan.webis.de/semeval19/semeval19-web/
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/hyp.zip"
+    MD5 = "125c504b4da6882c2d163ae9962b6220"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("hyp", "train.xml"), "f9dc8cb583db4c061a5abfb556d8c164"),
+        "dev": META_INFO(os.path.join("hyp", "eval.xml"), "20a7a7e82ae695a7fac4b8c48d0e4932"),
+        "test": META_INFO(os.path.join("hyp", "test.xml"), "5b1a166e7966fa744b402b033b9ed3ae"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        dom = xml.dom.minidom.parse(filename)
+        example_nodes = dom.documentElement.getElementsByTagName("article")
+        for example in example_nodes:
+            text = "".join([nodes.toprettyxml(indent="", newl="") for nodes in example.childNodes])
+            label = example.getAttribute("hyperpartisan")
+            yield {"text": text, "label": label}
+
+    def get_labels(self):
+        """
+        Return labels of the HYP object.
+        """
+        return ["false", "true"]
diff --git a/paddlenlp/datasets/imdb.py b/paddlenlp/datasets/imdb.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0144bf73bd9d07eb8e753321643d9717432f56c
--- /dev/null
+++ b/paddlenlp/datasets/imdb.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import io
+import os
+
+from ..utils.downloader import get_path_from_url
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["Imdb"]
+
+
+class Imdb(DatasetBuilder):
+    """
+    Subsets of IMDb data are available for access to customers for personal and non-commercial use.
+    Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.
+    The first line in each file contains headers that describe what is in each column.
+    Implementation of `IMDB <https://www.imdb.com/interfaces/>`_ dataset.
+
+    """
+
+    URL = "https://bj.bcebos.com/dataset/imdb%2FaclImdb_v1.tar.gz"
+    MD5 = "7c2ac02c03563afcf9b574c7e56c153a"
+    META_INFO = collections.namedtuple("META_INFO", ("data_dir", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("aclImdb", "train"), None),
+        "test": META_INFO(os.path.join("aclImdb", "test"), None),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, _ = self.SPLITS[mode]
+        data_dir = os.path.join(default_root, filename)
+        if not os.path.exists(data_dir):
+            get_path_from_url(self.URL, default_root, self.MD5)
+        return data_dir
+
+    def _read(self, data_dir, *args):
+        for label in ["pos", "neg"]:
+            root = os.path.join(data_dir, label)
+            data_files = os.listdir(root)
+            data_files.sort()
+
+            if label == "pos":
+                label_id = "1"
+            elif label == "neg":
+                label_id = "0"
+            for f in data_files:
+                f = os.path.join(root, f)
+                with io.open(f, "r", encoding="utf8") as fr:
+                    data = fr.readlines()
+                    data = data[0]
+                    yield {"text": data, "label": label_id}
+
+    def get_labels(self):
+        """
+        Return labels of the Imdb object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/intokens_dataset.py b/paddlenlp/datasets/intokens_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..795d82d93e8f13a4280acb9004aad46a265034ef
--- /dev/null
+++ b/paddlenlp/datasets/intokens_dataset.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from paddle.io import Dataset, IterableDataset
+from scipy.linalg import block_diag
+
+
+class InTokens:
+    required_input_keys = ["input_ids", "labels"]
+    required_output_keys = ["input_ids", "labels", "attention_mask"]
+    # Only supported the following keys for InTokens. Keys outside of the set will be ignored.
+    supported_input_keys = ["input_ids", "labels", "attention_mask", "position_ids"]
+
+    @classmethod
+    def _pad_batch_records(cls, batch_records):
+        # Only consider supported input keys
+        input_keys = [key for key in batch_records[0].keys() if key in cls.supported_input_keys]
+
+        # Check required_keys
+        for key in cls.required_input_keys:
+            if key not in input_keys:
+                raise ValueError(f"feature `{key}` is required for InTokensDataset")
+        # Output features must include all required output keys
+        for key in cls.required_output_keys:
+            if key not in input_keys:
+                input_keys.append(key)
+
+        batched_features = {key: [] for key in input_keys}
+        for record in batch_records:
+            batched_features["input_ids"].extend(record["input_ids"])
+            batched_features["labels"].extend(record["labels"])
+            seq_length = len(record["input_ids"])
+            # If attention_mask is not given, assume it's causal mask
+            attention_mask = record.get("attention_mask", np.tril(np.ones([seq_length, seq_length], dtype=bool)))
+            batched_features["attention_mask"].append(attention_mask)
+            # NOTE: position_ids is optional and not required by every model
+            # We append instead of extend here to accomodate 2D position ids
+            if "position_ids" in record:
+                batched_features["position_ids"].append(record["position_ids"])
+        block_attention_mask = block_diag(*batched_features["attention_mask"])
+        # convert to 3-D [batch_size(1), seq_length, seq_length]
+        batched_features["attention_mask"] = np.expand_dims(block_attention_mask, axis=0)
+        if "position_ids" in batched_features:
+            # Accomodate both 1D and 2D position ids
+            batched_features["position_ids"] = np.concatenate(batched_features["position_ids"], axis=-1).tolist()
+        return batched_features
+
+
+class InTokensMapDataset(InTokens, Dataset):
+    def __init__(self, data, tokenizer, max_length):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.new_data = self._create_intokens_data(data)
+
+    def _create_intokens_data(self, data):
+        batch_records, max_len = [], 0
+        cur_len_so_far = 0
+
+        total_data = []
+        for i in range(len(data)):
+            record = data[i]
+            max_len = max(max_len, len(record["input_ids"]))
+            to_append = (cur_len_so_far + len(record["input_ids"])) <= self.max_length
+            if to_append:
+                batch_records.append(record)
+                cur_len_so_far += len(record["input_ids"])
+            else:
+                # exceed max length
+                padded_list = self._pad_batch_records(batch_records)
+                total_data.append(padded_list)
+                # reset
+                batch_records, max_len = [], 0
+                cur_len_so_far = 0
+                # append current data
+                batch_records.append(record)
+                cur_len_so_far += len(record["input_ids"])
+
+        # remaining data
+        if batch_records:
+            padded_list = self._pad_batch_records(batch_records)
+            total_data.append(padded_list)
+        return total_data
+
+    def __getitem__(self, idx):
+        return self.new_data[idx]
+
+    def __len__(self):
+        return len(self.new_data)
+
+
+class InTokensIterableDataset(InTokens, IterableDataset):
+    def __init__(self, data, tokenizer, max_length):
+        self.data = data
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        self.intokens_global_step = 0
+
+    def __iter__(self):
+        batch_records, max_len = [], 0
+        cur_len_so_far = 0
+        for record in self.data:
+            max_len = max(max_len, len(record["input_ids"]))
+            to_append = (cur_len_so_far + len(record["input_ids"])) <= self.max_length
+            if to_append:
+                batch_records.append(record)
+                self.intokens_global_step += 1
+                cur_len_so_far += len(record["input_ids"])
+            else:
+                # exceed max length
+                padded_list = self._pad_batch_records(batch_records)
+                yield padded_list
+                # reset
+                batch_records, max_len = [], 0
+                cur_len_so_far = 0
+                # append current data
+                batch_records.append(record)
+                self.intokens_global_step += 1
+                cur_len_so_far += len(record["input_ids"])
+        if batch_records:
+            padded_list = self._pad_batch_records(batch_records)
+            yield padded_list
diff --git a/paddlenlp/datasets/iwslt15.py b/paddlenlp/datasets/iwslt15.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6ef2831f9e7893f8fe6243a03d85a0f86d35fbe
--- /dev/null
+++ b/paddlenlp/datasets/iwslt15.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["IWSLT15"]
+
+
+class IWSLT15(DatasetBuilder):
+    """
+    Created by Stanford at 2015, the IWSLT 15 English-Vietnamese Sentence
+    pairs for translation., in Multi-Lingual language. Containing 133 in Text
+    file format.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/iwslt15.en-vi.tar.gz"
+    META_INFO = collections.namedtuple("META_INFO", ("src_file", "tgt_file", "src_md5", "tgt_md5"))
+    MD5 = "aca22dc3f90962e42916dbb36d8f3e8e"
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("iwslt15.en-vi", "train.en"),
+            os.path.join("iwslt15.en-vi", "train.vi"),
+            "5b6300f46160ab5a7a995546d2eeb9e6",
+            "858e884484885af5775068140ae85dab",
+        ),
+        "dev": META_INFO(
+            os.path.join("iwslt15.en-vi", "tst2012.en"),
+            os.path.join("iwslt15.en-vi", "tst2012.vi"),
+            "c14a0955ed8b8d6929fdabf4606e3875",
+            "dddf990faa149e980b11a36fca4a8898",
+        ),
+        "test": META_INFO(
+            os.path.join("iwslt15.en-vi", "tst2013.en"),
+            os.path.join("iwslt15.en-vi", "tst2013.vi"),
+            "c41c43cb6d3b122c093ee89608ba62bd",
+            "a3185b00264620297901b647a4cacf38",
+        ),
+    }
+    VOCAB_INFO = (
+        os.path.join("iwslt15.en-vi", "vocab.en"),
+        os.path.join("iwslt15.en-vi", "vocab.vi"),
+        "98b5011e1f579936277a273fd7f4e9b4",
+        "e8b05f8c26008a798073c619236712b4",
+    )
+    UNK_TOKEN = "<unk>"
+    BOS_TOKEN = "<s>"
+    EOS_TOKEN = "</s>"
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        src_filename, tgt_filename, src_data_hash, tgt_data_hash = self.SPLITS[mode]
+        src_fullname = os.path.join(default_root, src_filename)
+        tgt_fullname = os.path.join(default_root, tgt_filename)
+
+        src_vocab_filename, src_vocab_hash, tgt_vocab_filename, tgt_vocab_hash = self.VOCAB_INFO
+        src_vocab_fullname = os.path.join(default_root, src_vocab_filename)
+        tgt_vocab_fullname = os.path.join(default_root, tgt_vocab_filename)
+
+        if (
+            (not os.path.exists(src_fullname) or (src_data_hash and not md5file(src_fullname) == src_data_hash))
+            or (not os.path.exists(tgt_fullname) or (tgt_data_hash and not md5file(tgt_fullname) == tgt_data_hash))
+            or (
+                not os.path.exists(src_vocab_fullname)
+                or (src_vocab_hash and not md5file(src_vocab_fullname) == src_vocab_hash)
+            )
+            or (
+                not os.path.exists(tgt_vocab_fullname)
+                or (tgt_vocab_hash and not md5file(tgt_vocab_fullname) == tgt_vocab_hash)
+            )
+        ):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return src_fullname, tgt_fullname
+
+    def _read(self, filename, *args):
+        src_filename, tgt_filename = filename
+        with open(src_filename, "r", encoding="utf-8") as src_f:
+            with open(tgt_filename, "r", encoding="utf-8") as tgt_f:
+                for src_line, tgt_line in zip(src_f, tgt_f):
+                    src_line = src_line.strip()
+                    tgt_line = tgt_line.strip()
+                    if not src_line and not tgt_line:
+                        continue
+                    yield {"en": src_line, "vi": tgt_line}
+
+    def get_vocab(self):
+        en_vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[0])
+        vi_vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[1])
+
+        # Construct vocab_info to match the form of the input of `Vocab.load_vocabulary()` function
+        vocab_info = {
+            "en": {
+                "filepath": en_vocab_fullname,
+                "unk_token": self.UNK_TOKEN,
+                "bos_token": self.BOS_TOKEN,
+                "eos_token": self.EOS_TOKEN,
+            },
+            "vi": {
+                "filepath": vi_vocab_fullname,
+                "unk_token": self.UNK_TOKEN,
+                "bos_token": self.BOS_TOKEN,
+                "eos_token": self.EOS_TOKEN,
+            },
+        }
+        return vocab_info
diff --git a/paddlenlp/datasets/lcqmc.py b/paddlenlp/datasets/lcqmc.py
new file mode 100644
index 0000000000000000000000000000000000000000..e85924b36a2a34b1cbe8b7198f94338acfab2f09
--- /dev/null
+++ b/paddlenlp/datasets/lcqmc.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["LCQMC"]
+
+
+class LCQMC(DatasetBuilder):
+    """
+    LCQMC:A Large-scale Chinese Question Matching Corpus
+    More information please refer to `https://www.aclweb.org/anthology/C18-1166/`
+
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/lcqmc.zip"
+    MD5 = "7069fa0cffbd2110845869c61f83814a"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("lcqmc", "lcqmc", "train.tsv"), "479d94fe575981f236319f2a5b8b3c03"),
+        "dev": META_INFO(os.path.join("lcqmc", "lcqmc", "dev.tsv"), "089329fb44ef26155baef9c9c8c823ba"),
+        "test": META_INFO(os.path.join("lcqmc", "lcqmc", "test.tsv"), "a4a483f2f871d57e0f3894fca0d0f8f0"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                data = line.strip().split("\t")
+                if len(data) == 3:
+                    query, title, label = data
+                    yield {"query": query, "title": title, "label": label}
+                elif len(data) == 2:
+                    query, title = data
+                    yield {"query": query, "title": title, "label": ""}
+                else:
+                    continue
+
+    def get_labels(self):
+        """
+        Return labels of the LCQMC object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/lcqmc_v2.py b/paddlenlp/datasets/lcqmc_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3fcd53f3260c4c95fb36e592a6bfda67169dcff
--- /dev/null
+++ b/paddlenlp/datasets/lcqmc_v2.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["LCQMC_V2"]
+
+
+class LCQMC_V2(DatasetBuilder):
+    """
+    LCQMC:A Large-scale Chinese Question Matching Corpus
+    More information please refer to `https://www.aclweb.org/anthology/C18-1166/`
+
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/lcqmc_v2.tar.gz"
+    MD5 = "e44825d8e6d5117bc04caf3982cf934f"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("lcqmc", "train.tsv"), "2193c022439b038ac12c0ae918b211a1"),
+        "dev": META_INFO(os.path.join("lcqmc", "dev.tsv"), "c5dcba253cb4105d914964fd8b3c0e94"),
+        "test": META_INFO(os.path.join("lcqmc", "test.tsv"), "8f4b71e15e67696cc9e112a459ec42bd"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = True
+            for line in f:
+                data = line.strip().split("\t")
+                if head:
+                    head = False
+                else:
+                    if len(data) == 3:
+                        query, title, label = data
+                        yield {"query": query, "title": title, "label": label}
+                    elif len(data) == 2:
+                        query, title = data
+                        yield {"query": query, "title": title, "label": ""}
+                    else:
+                        continue
+
+    def get_labels(self):
+        """
+        Return labels of the LCQMC object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/lcsts_new.py b/paddlenlp/datasets/lcsts_new.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3a94968f4bf15254cdf1e66ac5f1901c020235c
--- /dev/null
+++ b/paddlenlp/datasets/lcsts_new.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["LCSTSNew"]
+
+
+class LCSTSNew(DatasetBuilder):
+    """
+    Large-scale Chinese Short Text Summarization(LCSTS) dataset is
+    constructed by utilizing the naturally annotated web resources
+    on Sina Weibo. For more information, please refer
+    to `https://aclanthology.org/D15-1229.pdf`.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("train.json"),
+            "4e06fd1cfd5e7f0380499df8cbe17237",
+            "https://bj.bcebos.com/paddlenlp/datasets/LCSTS_new/train.json",
+        ),
+        "dev": META_INFO(
+            os.path.join("dev.json"),
+            "9c39d49d25d5296bdc537409208ddc85",
+            "https://bj.bcebos.com/paddlenlp/datasets/LCSTS_new/dev.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                json_data = json.loads(line)
+
+                yield {"source": json_data["content"], "target": json_data.get("summary", ""), "id": json_data["id"]}
diff --git a/paddlenlp/datasets/msra_ner.py b/paddlenlp/datasets/msra_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cffc4e8cb89dde7b79e02ac70b6a238678e169e
--- /dev/null
+++ b/paddlenlp/datasets/msra_ner.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["MsraNer"]
+
+
+class MsraNer(DatasetBuilder):
+    """
+    Chinese Named Entity Recognition dataset published by Microsoft Research Asia
+    in 2006. The dataset is in the BIO scheme.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/msra_ner.tar.gz"
+    MD5 = "f1aadbbf328ea2fa50c9c2b56db0d31e"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("msra_ner", "train.tsv"), "e5b4b734ef91861384f441456ad995dd"),
+        "test": META_INFO(os.path.join("msra_ner", "test.tsv"), "40b26ae09b63af78ea3a91ac8b8ae303"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    break
+                if len(line_stripped) == 2:
+                    tokens = line_stripped[0].split("\002")
+                    tags = line_stripped[1].split("\002")
+                else:
+                    tokens = line_stripped.split("\002")
+                    tags = []
+                yield {"tokens": tokens, "labels": tags}
+
+    def get_labels(self):
+
+        return ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"]
diff --git a/paddlenlp/datasets/nlpcc13_evsam05_hit.py b/paddlenlp/datasets/nlpcc13_evsam05_hit.py
new file mode 100644
index 0000000000000000000000000000000000000000..2158037c613c8705e65974eea548037e03b98215
--- /dev/null
+++ b/paddlenlp/datasets/nlpcc13_evsam05_hit.py
@@ -0,0 +1,95 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["NLPCC13EVSAM05HIT"]
+
+
+class NLPCC13EVSAM05HIT(DatasetBuilder):
+    """
+    NLPCC13_EVSAM05_HIT is the dataset for dependency parsing.
+    The format of this dataset is based on the CoNLL-X style:
+
+        '''
+        raw name        definition
+
+        ID              Token counter, starting at 1 for each new sentence.
+        FORM            Word form or punctuation symbol.
+        LEMMA           Lemma or stem (depending on the particular treebank) of word form, or an underscore if not available.
+        CPOSTAG         Coarse-grained part-of-speech tag, where the tagset depends on the treebank.
+        POSTAG          Fine-grained part-of-speech tag, where the tagset depends on the treebank.
+        FEATS           Unordered set of syntactic and/or morphological features (depending on the particular treebank), or an underscore if not available.
+        HEAD            Head of the current token, which is either a value of ID, or zero (’0’) if the token links to the virtual root node of the sentence.
+        DEPREL          Dependency relation to the HEAD.
+        PHEAD           Projective head of current token, which is either a value of ID or zero (’0’), or an underscore if not available.
+        PDEPREL         Dependency relation to the PHEAD, or an underscore if not available.
+        '''
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/nlpcc13_evsam05_hit.tar.gz"
+    MD5 = "5988ede79690dc87aa6e4343b5299944"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("nlpcc13_evsam05_hit", "train.conll"), "d82e667950a5e22b18baf595b9feb30f"),
+        "dev": META_INFO(os.path.join("nlpcc13_evsam05_hit", "dev.conll"), "b71b08dc85e652769bfbda30b1e352a9"),
+        "test": META_INFO(os.path.join("nlpcc13_evsam05_hit", "test.conll"), "784fb9d966a286df5370f7eee4013cf0"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        start = 0
+        with open(filename, "r", encoding="utf-8") as f:
+            lines = []
+            for line in f.readlines():
+                if not line.startswith(" "):
+                    if not line.startswith("#") and (len(line) == 1 or line.split()[0].isdigit()):
+                        lines.append(line.strip())
+                else:
+                    lines.append("")
+
+        for i, line in enumerate(lines):
+            if not line:
+                values = list(zip(*[j.split("\t") for j in lines[start:i]]))
+                if split == "test":
+                    ID, FORM, LEMMA, CPOS, POS, FEATS, HEAD, DEPREL = values
+                else:
+                    ID, FORM, LEMMA, CPOS, POS, FEATS, HEAD, DEPREL, _, _ = values
+                if values:
+                    yield {
+                        "ID": ID,
+                        "FORM": FORM,
+                        "LEMMA": LEMMA,
+                        "CPOS": CPOS,
+                        "POS": POS,
+                        "FEATS": FEATS,
+                        "HEAD": HEAD,
+                        "DEPREL": DEPREL,
+                    }
+                start = i + 1
diff --git a/paddlenlp/datasets/nlpcc13_evsam05_thu.py b/paddlenlp/datasets/nlpcc13_evsam05_thu.py
new file mode 100644
index 0000000000000000000000000000000000000000..618f776b27e04899d7c3b0685ee309c673eaf0a6
--- /dev/null
+++ b/paddlenlp/datasets/nlpcc13_evsam05_thu.py
@@ -0,0 +1,91 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["NLPCC13EVSAM05THU"]
+
+
+class NLPCC13EVSAM05THU(DatasetBuilder):
+    """
+    NLPCC13_EVSAM05_THU is the dataset for dependency parsing.
+    The format of this dataset is based on the CoNLL-X style:
+
+        '''
+        raw name        definition
+
+        ID              Token counter, starting at 1 for each new sentence.
+        FORM            Word form or punctuation symbol.
+        LEMMA           Lemma or stem (depending on the particular treebank) of word form, or an underscore if not available.
+        CPOSTAG         Coarse-grained part-of-speech tag, where the tagset depends on the treebank.
+        POSTAG          Fine-grained part-of-speech tag, where the tagset depends on the treebank.
+        FEATS           Unordered set of syntactic and/or morphological features (depending on the particular treebank), or an underscore if not available.
+        HEAD            Head of the current token, which is either a value of ID, or zero (’0’) if the token links to the virtual root node of the sentence.
+        DEPREL          Dependency relation to the HEAD.
+        '''
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/nlpcc13_evsam05_thu.tar.gz"
+    MD5 = "297ad22217ba4668d49580009810446e"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("nlpcc13_evsam05_thu", "train.conll"), "c7779f981203b4ecbe5b04c65aaaffce"),
+        "dev": META_INFO(os.path.join("nlpcc13_evsam05_thu", "dev.conll"), "59c2de72c7be39977f766e8290336dac"),
+        "test": META_INFO(os.path.join("nlpcc13_evsam05_thu", "test.conll"), "873223b42060ce16a7e24545e43a933f"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        start = 0
+        with open(filename, "r", encoding="utf-8") as f:
+            lines = []
+            for line in f.readlines():
+                if not line.startswith(" "):
+                    if not line.startswith("#") and (len(line) == 1 or line.split()[0].isdigit()):
+                        lines.append(line.strip())
+                else:
+                    lines.append("")
+
+        for i, line in enumerate(lines):
+            if not line:
+                values = list(zip(*[j.split("\t") for j in lines[start:i]]))
+
+                ID, FORM, LEMMA, CPOS, POS, FEATS, HEAD, DEPREL = values
+                if values:
+                    yield {
+                        "ID": ID,
+                        "FORM": FORM,
+                        "LEMMA": LEMMA,
+                        "CPOS": CPOS,
+                        "POS": POS,
+                        "FEATS": FEATS,
+                        "HEAD": HEAD,
+                        "DEPREL": DEPREL,
+                    }
+                start = i + 1
diff --git a/paddlenlp/datasets/nlpcc14_sc.py b/paddlenlp/datasets/nlpcc14_sc.py
new file mode 100644
index 0000000000000000000000000000000000000000..8eb766942a7d14c34714af548c566cf9e1113711
--- /dev/null
+++ b/paddlenlp/datasets/nlpcc14_sc.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["NLPCC14SC"]
+
+
+class NLPCC14SC(DatasetBuilder):
+    """
+    NLPCC14-SC is the dataset for sentiment classification. There are 2 classes
+    in the datasets: Negative (0) and Positive (1). The following is a part of
+    the train data:
+      '''
+      label	                  text_a
+      1	                      超级值得看的一个电影
+      0	                      我感觉卓越的东西现在好垃圾，还贵，关键贵。
+      '''
+    Please note that the test data contains no corresponding labels.
+
+    NLPCC14-SC datasets only contain train and test data, so we remove the dev
+    data in META_INFO. By Fiyen at Beijing Jiaotong University.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/NLPCC14-SC.zip"
+    MD5 = "4792a0982bc64b83d9a76dcce8bc00ad"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("NLPCC14-SC", "NLPCC14-SC", "train.tsv"), "b0c6f74bb8d41020067c8f103c6e08c0"),
+        "test": META_INFO(os.path.join("NLPCC14-SC", "NLPCC14-SC", "test.tsv"), "57526ba07510fdc901777e7602a26774"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = None
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    if split == "train":
+                        label, text = data
+                        yield {"text": text, "label": label, "qid": ""}
+                    elif split == "test":
+                        qid, text = data
+                        yield {"text": text, "label": "", "qid": qid}
+
+    def get_labels(self):
+        """
+        Return labels of the NLPCC14-SC object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/nlpcc_dbqa.py b/paddlenlp/datasets/nlpcc_dbqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8ffc11141d03bf5ccab35ee9d7d12a1edff1524
--- /dev/null
+++ b/paddlenlp/datasets/nlpcc_dbqa.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["NLPCC_DBQA"]
+
+
+class NLPCC_DBQA(DatasetBuilder):
+    """
+    NLPCC2016 DBQA dataset.
+
+    Document-based QA (or DBQA) task
+    When predicting answers to each question, a DBQA system built by each
+    participating team IS LIMITED TO select sentences as answers from the
+    question’s given document.
+
+    For more information: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/nlpcc-dbqa.zip"
+    MD5 = "a5f69c2462136ef4d1707e4e2551a57b"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("nlpcc-dbqa", "nlpcc-dbqa", "train.tsv"), "4f84fefce1a8f52c8d9248d1ff5ab9bd"),
+        "dev": META_INFO(os.path.join("nlpcc-dbqa", "nlpcc-dbqa", "dev.tsv"), "3831beb0d42c29615d06343538538f53"),
+        "test": META_INFO(os.path.join("nlpcc-dbqa", "nlpcc-dbqa", "test.tsv"), "e224351353b1f6a15837008b5d0da703"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = None
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    qid, text_a, text_b, label = data
+                    yield {"qid": qid, "text_a": text_a, "text_b": text_b, "label": label}
+
+    def get_labels(self):
+        """
+        Return labels of XNLI dataset.
+
+        Note:
+            Contradictory and contradiction are the same label
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/paws-x.py b/paddlenlp/datasets/paws-x.py
new file mode 100644
index 0000000000000000000000000000000000000000..c15ec89cdba4d85d99c27fca71227af20fc9a1a7
--- /dev/null
+++ b/paddlenlp/datasets/paws-x.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["PAWSX"]
+
+
+class PAWSX(DatasetBuilder):
+    """
+    PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
+    More information please refer to `https://arxiv.org/abs/1908.11828`
+    Here we only store simplified Chinese(zh) version.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/paws-x-zh.zip"
+    MD5 = "f1c6f2ab8afb1f29fe04a0c929e3ab1c"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("paws-x-zh", "paws-x-zh", "train.tsv"), "3422ba98e5151c91bbb0a785c4873a4c"),
+        "dev": META_INFO(os.path.join("paws-x-zh", "paws-x-zh", "dev.tsv"), "dc163453e728cf118e17b4065d6602c8"),
+        "test": META_INFO(os.path.join("paws-x-zh", "paws-x-zh", "test.tsv"), "5b7320760e70559591092cb01b6f5955"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                data = line.strip().split("\t")
+                if len(data) == 3:
+                    sentence1, sentence2, label = data
+                    yield {"sentence1": sentence1, "sentence2": sentence2, "label": label}
+                elif len(data) == 2:
+                    sentence1, sentence2 = data
+                    yield {"sentence1": sentence1, "sentence2": sentence2, "label": ""}
+                else:
+                    continue
+
+    def get_labels(self):
+        """
+        Return labels of the PAWS-X object.
+        """
+        return ["0", "1"]
diff --git a/paddlenlp/datasets/peoples_daily_ner.py b/paddlenlp/datasets/peoples_daily_ner.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a99f543e311285ed3b36ba7a96b0053afb300ec
--- /dev/null
+++ b/paddlenlp/datasets/peoples_daily_ner.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["PeoplesDailyNER"]
+
+
+class PeoplesDailyNER(DatasetBuilder):
+    """
+    Chinese Named Entity Recognition dataset published by People's Daily.
+    The dataset is in the BIO scheme with tags: LOC, ORG and PER.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/peoples_daily_ner.tar.gz"
+    MD5 = "a44ff9c4b37b48add9ddc17994d5620c"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("peoples_daily_ner", "train.tsv"), "67d3c93a37daba60ef43c03271f119d7"),
+        "dev": META_INFO(os.path.join("peoples_daily_ner", "dev.tsv"), "ec772f3ba914bca5269f6e785bb3375d"),
+        "test": META_INFO(os.path.join("peoples_daily_ner", "test.tsv"), "2f27ae68b5f61d6553ffa28bb577c8a7"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            next(f)
+            for line in f:
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    break
+                if len(line_stripped) == 2:
+                    tokens = line_stripped[0].split("\002")
+                    tags = line_stripped[1].split("\002")
+                else:
+                    tokens = line_stripped.split("\002")
+                    tags = []
+                yield {"tokens": tokens, "labels": tags}
+
+    def get_labels(self):
+
+        return ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"]
diff --git a/paddlenlp/datasets/poetry.py b/paddlenlp/datasets/poetry.py
new file mode 100644
index 0000000000000000000000000000000000000000..b323eb27fc376528157b433af67e76337b5e1c10
--- /dev/null
+++ b/paddlenlp/datasets/poetry.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["Poetry"]
+
+
+class Poetry(DatasetBuilder):
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/poetry.tar.gz"
+    MD5 = "8edd7eda1b273145b70ef29c82cd622b"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("poetry", "train.tsv"), "176c6202b5e71656ae7e7848eec4c54f"),
+        "dev": META_INFO(os.path.join("poetry", "dev.tsv"), "737e4b6da5facdc0ac33fe688df19931"),
+        "test": META_INFO(os.path.join("poetry", "test.tsv"), "1dca907b2d712730c7c828f8acee7431"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    break
+                if len(line_stripped) == 2:
+                    tokens = line_stripped[0]
+                    labels = line_stripped[1]
+                else:
+                    tokens = line_stripped
+                    labels = []
+                yield {"tokens": tokens, "labels": labels}
diff --git a/paddlenlp/datasets/ptb.py b/paddlenlp/datasets/ptb.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba953976d884697a50aa56226d4df4e8c0e8dfd2
--- /dev/null
+++ b/paddlenlp/datasets/ptb.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["PTB"]
+
+
+class PTB(DatasetBuilder):
+    """
+    This is the Penn Treebank Project: Release 2 CDROM, featuring a million
+    words of 1989 Wall Street Journal material.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/rnnlm/simple-examples.tgz"
+    MD5 = "30177ea32e27c525793142b6bf2c8e2d"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("simple-examples", "data", "ptb.train.txt"), "f26c4b92c5fdc7b3f8c7cdcb991d8420"
+        ),
+        "valid": META_INFO(
+            os.path.join("simple-examples", "data", "ptb.valid.txt"), "aa0affc06ff7c36e977d7cd49e3839bf"
+        ),
+        "test": META_INFO(os.path.join("simple-examples", "data", "ptb.test.txt"), "8b80168b89c18661a38ef683c0dc3721"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                line_stripped = line.strip()
+                yield {"sentence": line_stripped}
diff --git a/paddlenlp/datasets/seabsa16.py b/paddlenlp/datasets/seabsa16.py
new file mode 100644
index 0000000000000000000000000000000000000000..21001793bd0dc49231bb1a5ab717d4c30b0c710a
--- /dev/null
+++ b/paddlenlp/datasets/seabsa16.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["SeAbsa16"]
+
+
+class SeAbsa16(DatasetBuilder):
+    """
+    SE-ABSA16_PHNS dataset for Aspect-level Sentiment Classification task.
+    More information please refer to
+    https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLuge=1.
+
+    """
+
+    BUILDER_CONFIGS = {
+        # phns is short for phones.
+        "phns": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/SE-ABSA16_PHNS.zip",
+            "md5": "f5a62548f2fcf73892cacf2cdf159671",
+            "splits": {
+                "train": [
+                    os.path.join("SE-ABSA16_PHNS", "train.tsv"),
+                    "cb4f65aaee59fa76526a0c79b7c12689",
+                    (0, 1, 2),
+                    1,
+                ],
+                "test": [os.path.join("SE-ABSA16_PHNS", "test.tsv"), "7ad80f284e0eccc059ece3ce3d3a173f", (1, 2), 1],
+            },
+            "labels": ["0", "1"],
+        },
+        # came is short for cameras.
+        "came": {
+            "url": "https://bj.bcebos.com/paddlenlp/datasets/SE-ABSA16_CAME.zip",
+            "md5": "3104e92217bbff80a1ed834230f1df51",
+            "splits": {
+                "train": [
+                    os.path.join("SE-ABSA16_CAME", "train.tsv"),
+                    "8c661c0e83bb34b66c6fbf039c7fae80",
+                    (0, 1, 2),
+                    1,
+                ],
+                "test": [os.path.join("SE-ABSA16_CAME", "test.tsv"), "8b80f77960be55adca1184d7a20501df", (1, 2), 1],
+            },
+            "labels": ["0", "1"],
+        },
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        builder_config = self.BUILDER_CONFIGS[self.name]
+        default_root = os.path.join(DATA_HOME, f"SE-ABSA16_{self.name.upper()}")
+        filename, data_hash, _, _ = builder_config["splits"][mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            url = builder_config["url"]
+            md5 = builder_config["md5"]
+            get_path_from_url(url, DATA_HOME, md5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data"""
+        _, _, field_indices, num_discard_samples = self.BUILDER_CONFIGS[self.name]["splits"][split]
+        with open(filename, "r", encoding="utf-8") as f:
+            for idx, line in enumerate(f):
+                if idx < num_discard_samples:
+                    continue
+                line_stripped = line.strip().split("\t")
+                if not line_stripped:
+                    continue
+                example = [line_stripped[indice] for indice in field_indices]
+                if split == "test":
+                    yield {"text": example[0], "text_pair": example[1]}
+                else:
+                    yield {"text": example[1], "text_pair": example[2], "label": example[0]}
+
+    def get_labels(self):
+        """
+        Return labels of the SE_ABSA16.
+        """
+        return self.BUILDER_CONFIGS[self.name]["labels"]
diff --git a/paddlenlp/datasets/sighan-cn.py b/paddlenlp/datasets/sighan-cn.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fca3743a7336dda140031bf75ee07b250722df2
--- /dev/null
+++ b/paddlenlp/datasets/sighan-cn.py
@@ -0,0 +1,51 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["SIGHAN_CN"]
+
+
+class SIGHAN_CN(DatasetBuilder):
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/sighan-cn.zip"
+    MD5 = "cd67b9b36a5908f848cbf04b5d83c005"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("sighan-cn", "train.txt"), "5eb7b7847722f3bf69bf978d1a5f99cc"),
+        "dev": META_INFO(os.path.join("sighan-cn", "dev.txt"), "bc34d119aeb7ca022aa66e2f448ded95"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        """Reads data."""
+        with open(filename, "r", encoding="utf8") as fr:
+            for line in fr:
+                source, target = line.strip("\n").split("\t")[0:2]
+                yield {"source": source, "target": target}
diff --git a/paddlenlp/datasets/squad.py b/paddlenlp/datasets/squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5d603d41d14b3b2ac4426c6c87be1622a73160b
--- /dev/null
+++ b/paddlenlp/datasets/squad.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["SQuAD"]
+
+
+class SQuAD(DatasetBuilder):
+    """
+    Stanford Question Answering Dataset (SQuAD) is a reading comprehension
+    dataset, consisting of questions posed by crowdworkers on a set of Wikipedia
+    articles, where the answer to every question is a segment of text, or span,
+    from the corresponding reading passage, or the question might be unanswerable.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train_v1": META_INFO(
+            os.path.join("train-v1.1.json"),
+            "981b29407e0affa3b1b156f72073b945",
+            "https://bj.bcebos.com/paddlenlp/datasets/squad/train-v1.1.json",
+        ),
+        "dev_v1": META_INFO(
+            os.path.join("dev-v1.1.json"),
+            "3e85deb501d4e538b6bc56f786231552",
+            "https://bj.bcebos.com/paddlenlp/datasets/squad/dev-v1.1.json",
+        ),
+        "train_v2": META_INFO(
+            os.path.join("train-v2.0.json"),
+            "62108c273c268d70893182d5cf8df740",
+            "https://bj.bcebos.com/paddlenlp/datasets/squad/train-v2.0.json",
+        ),
+        "dev_v2": META_INFO(
+            os.path.join("dev-v2.0.json"),
+            "246adae8b7002f8679c027697b0b7cf8",
+            "https://bj.bcebos.com/paddlenlp/datasets/squad/dev-v2.0.json",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"].strip()
+                for qa in paragraph["qas"]:
+                    qas_id = qa["id"]
+                    question = qa["question"].strip()
+                    answer_starts = []
+                    answers = []
+                    is_impossible = False
+
+                    if "is_impossible" in qa.keys():
+                        is_impossible = qa["is_impossible"]
+
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"].strip() for answer in qa.get("answers", [])]
+
+                    yield {
+                        "id": qas_id,
+                        "title": title,
+                        "context": context,
+                        "question": question,
+                        "answers": answers,
+                        "answer_starts": answer_starts,
+                        "is_impossible": is_impossible,
+                    }
diff --git a/paddlenlp/datasets/thucnews.py b/paddlenlp/datasets/thucnews.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b8a92ebc71b74fcf4372065dc833c8ac50ea66b
--- /dev/null
+++ b/paddlenlp/datasets/thucnews.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+
+class THUCNews(DatasetBuilder):
+    """
+    A subset of THUCNews dataset. THUCNews is a text classification dataset.
+    See descrition about this subset version at https://github.com/gaussic/text-classification-cnn-rnn#%E6%95%B0%E6%8D%AE%E9%9B%86
+    The whole dataset can be downloaded at https://thunlp.oss-cn-qingdao.aliyuncs.com/THUCNews.zip
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/thucnews.zip"
+    MD5 = "97626b2268f902662a29aadf222f22cc"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    LABEL_PATH = os.path.join("thucnews", "label.txt")
+    SPLITS = {
+        "train": META_INFO(os.path.join("thucnews", "train.txt"), "beda43dfb4f7bd9bd3d465edb35fbb7f"),
+        "dev": META_INFO(os.path.join("thucnews", "val.txt"), "1abe8fe2c75dde701407a9161dcd223a"),
+        "test": META_INFO(os.path.join("thucnews", "test.txt"), "201f558b7d0b3419ddebcd695f3070f0"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        with open(filename, "r", encoding="utf8") as f:
+            examples = f.readlines()
+            for example in examples:
+                split_idx = example.find("\t")
+                label = example[:split_idx]
+                text = example[split_idx + 1 :].strip()
+                yield {"text": text, "label": label}
+
+    def get_labels(self):
+        labels = []
+        filename = os.path.join(DATA_HOME, self.__class__.__name__, self.LABEL_PATH)
+        with open(filename, "r", encoding="utf8") as f:
+            while True:
+                label = f.readline().strip()
+                if label == "":
+                    break
+                labels.append(label)
+        return labels
diff --git a/paddlenlp/datasets/triviaqa.py b/paddlenlp/datasets/triviaqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b966362ea9507175e4b538216a246f553fa60337
--- /dev/null
+++ b/paddlenlp/datasets/triviaqa.py
@@ -0,0 +1,80 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import json
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["TriviaQA"]
+
+
+class TriviaQA(DatasetBuilder):
+    """
+    TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence
+    triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and
+    independently gathered evidence documents, six per question on average, that provide high
+    quality distant supervision for answering the questions. The details can be found ACL
+    17 paper: https://arxiv.org/abs/1705.03551.
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5", "URL"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("wikipedia-train.json"),
+            "e4b3c74e781472d92e68da9c4b7418fe",
+            "https://bj.bcebos.com/paddlenlp/datasets/triviaqa/wikipedia-train.zip",
+        ),
+        "dev": META_INFO(
+            os.path.join("wikipedia-dev.json"),
+            "20d23a2f668a46fe5c590d126f4d2b95",
+            "https://bj.bcebos.com/paddlenlp/datasets/triviaqa/wikipedia-dev.zip",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, URL = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(URL, default_root)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf8") as f:
+            input_data = json.load(f)["data"]
+        for entry in input_data:
+            title = entry.get("title", "").strip()
+            for paragraph in entry["paragraphs"]:
+                context = paragraph["context"]
+                for qa in paragraph["qas"]:
+                    qas_id = qa["qid"]
+                    question = qa["question"]
+                    answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])]
+                    answers = [answer["text"] for answer in qa.get("answers", [])]
+                    if len(answers) == 1:
+                        yield {
+                            "id": qas_id,
+                            "title": title,
+                            "context": context,
+                            "question": question,
+                            "answers": answers,
+                            "answer_starts": answer_starts,
+                        }
diff --git a/paddlenlp/datasets/wmt14ende.py b/paddlenlp/datasets/wmt14ende.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5455e448a75900f024958bca97c4e8d5e801e45
--- /dev/null
+++ b/paddlenlp/datasets/wmt14ende.py
@@ -0,0 +1,139 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["WMT14ende"]
+
+
+class WMT14ende(DatasetBuilder):
+    """
+    This dataset is a translation dataset for machine translation task. More
+    specifically, this dataset is a WMT14 English to German translation dataset
+    which uses commoncrawl, europarl and news-commentary as train dataset and
+    uses newstest2014 as test dataset.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz"
+    META_INFO = collections.namedtuple("META_INFO", ("src_file", "tgt_file", "src_md5", "tgt_md5"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.en"),
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "train.tok.clean.bpe.33708.de"),
+            "c7c0b77e672fc69f20be182ae37ff62c",
+            "1865ece46948fda1209d3b7794770a0a",
+        ),
+        "dev": META_INFO(
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.en"),
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "newstest2013.tok.bpe.33708.de"),
+            "aa4228a4bedb6c45d67525fbfbcee75e",
+            "9b1eeaff43a6d5e78a381a9b03170501",
+        ),
+        "test": META_INFO(
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.en"),
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "newstest2014.tok.bpe.33708.de"),
+            "c9403eacf623c6e2d9e5a1155bdff0b5",
+            "0058855b55e37c4acfcb8cffecba1050",
+        ),
+        "dev-eval": META_INFO(
+            os.path.join("WMT14.en-de", "wmt14_ende_data", "newstest2013.tok.en"),
+            os.path.join("WMT14.en-de", "wmt14_ende_data", "newstest2013.tok.de"),
+            "d74712eb35578aec022265c439831b0e",
+            "6ff76ced35b70e63a61ecec77a1c418f",
+        ),
+        "test-eval": META_INFO(
+            os.path.join("WMT14.en-de", "wmt14_ende_data", "newstest2014.tok.en"),
+            os.path.join("WMT14.en-de", "wmt14_ende_data", "newstest2014.tok.de"),
+            "8cce2028e4ca3d4cc039dfd33adbfb43",
+            "a1b1f4c47f487253e1ac88947b68b3b8",
+        ),
+    }
+    VOCAB_INFO = [
+        (
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "vocab_all.bpe.33708"),
+            "2fc775b7df37368e936a8e1f63846bb0",
+        ),
+        (
+            os.path.join("WMT14.en-de", "wmt14_ende_data_bpe", "vocab_all.bpe.33712"),
+            "de485e3c2e17e23acf4b4b70b54682dd",
+        ),
+    ]
+    UNK_TOKEN = "<unk>"
+    BOS_TOKEN = "<s>"
+    EOS_TOKEN = "<e>"
+
+    MD5 = "a2b8410709ff760a3b40b84bd62dfbd8"
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        src_filename, tgt_filename, src_data_hash, tgt_data_hash = self.SPLITS[mode]
+        src_fullname = os.path.join(default_root, src_filename)
+        tgt_fullname = os.path.join(default_root, tgt_filename)
+
+        (bpe_vocab_filename, bpe_vocab_hash), (sub_vocab_filename, sub_vocab_hash) = self.VOCAB_INFO
+        bpe_vocab_fullname = os.path.join(default_root, bpe_vocab_filename)
+        sub_vocab_fullname = os.path.join(default_root, sub_vocab_filename)
+
+        if (
+            (not os.path.exists(src_fullname) or (src_data_hash and not md5file(src_fullname) == src_data_hash))
+            or (not os.path.exists(tgt_fullname) or (tgt_data_hash and not md5file(tgt_fullname) == tgt_data_hash))
+            or (
+                not os.path.exists(bpe_vocab_fullname)
+                or (bpe_vocab_hash and not md5file(bpe_vocab_fullname) == bpe_vocab_hash)
+            )
+            or (
+                not os.path.exists(sub_vocab_fullname)
+                or (sub_vocab_hash and not md5file(sub_vocab_fullname) == sub_vocab_hash)
+            )
+        ):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return src_fullname, tgt_fullname
+
+    def _read(self, filename, *args):
+        src_filename, tgt_filename = filename
+        with open(src_filename, "r", encoding="utf-8") as src_f:
+            with open(tgt_filename, "r", encoding="utf-8") as tgt_f:
+                for src_line, tgt_line in zip(src_f, tgt_f):
+                    src_line = src_line.strip()
+                    tgt_line = tgt_line.strip()
+                    if not src_line and not tgt_line:
+                        continue
+                    yield {"source": src_line, "target": tgt_line}
+
+    def get_vocab(self):
+        bpe_vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[0][0])
+        sub_vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[1][0])
+        vocab_info = {
+            "bpe": {
+                "filepath": bpe_vocab_fullname,
+                "unk_token": self.UNK_TOKEN,
+                "bos_token": self.BOS_TOKEN,
+                "eos_token": self.EOS_TOKEN,
+            },
+            "benchmark": {
+                "filepath": sub_vocab_fullname,
+                "unk_token": self.UNK_TOKEN,
+                "bos_token": self.BOS_TOKEN,
+                "eos_token": self.EOS_TOKEN,
+            },
+        }
+        return vocab_info
diff --git a/paddlenlp/datasets/wos.py b/paddlenlp/datasets/wos.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0cc437ac5c3a4356b45eef185f3d4ba52006581
--- /dev/null
+++ b/paddlenlp/datasets/wos.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Copyright (c) 2017 Kamran Kowsari
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this dataset and associated documentation files (the "Dataset"), to deal
+# in the dataset without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Dataset, and to permit persons to whom the dataset is
+# furnished to do so, subject to the following conditions:
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from paddlenlp.datasets import DatasetBuilder
+from paddlenlp.utils.env import DATA_HOME
+
+__all__ = ["WOS"]
+
+
+class WOS(DatasetBuilder):
+    """
+    Web of Science(WOS) dataset contains abstracts of published papers from Web of Science.
+    More information please refer to 'https://data.mendeley.com/datasets/9rw3vkcfy4/2'.
+    """
+
+    lazy = False
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/wos.tar.gz"
+    MD5 = "15c8631ed6a474f471f480c31a6bbcda"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("wos", "train.tsv"), "e0153a1ef502235edf2bb138afcfef99"),
+        "dev": META_INFO(os.path.join("wos", "dev.tsv"), "fcfc283349b353c3e1123fdd20429de9 "),
+        "test": META_INFO(os.path.join("wos", "test.tsv"), "6fe2068aada7f17220d521dd11c73aee"),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Check and download Dataset"""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                line_stripped = line.split("\t")
+
+                example = {"sentence": line_stripped[0].strip()}
+                for i in range(len(line_stripped) - 1):
+                    example["level {}".format(i + 1)] = line_stripped[i + 1].strip().split(",")
+
+                yield example
+
+    def get_labels(self):
+        """
+        Return labels of the WOS.
+        """
+        return [
+            "CS",
+            "ECE",
+            "Psychology",
+            "MAE",
+            "Civil",
+            "Medical",
+            "biochemistry",
+            "CS##Computer vision",
+            "CS##Machine learning",
+            "CS##network security",
+            "CS##Cryptography",
+            "CS##Operating systems",
+            "CS##Computer graphics",
+            "CS##Image processing",
+            "CS##Parallel computing",
+            "CS##Relational databases",
+            "CS##Software engineering",
+            "CS##Distributed computing",
+            "CS##Structured Storage",
+            "CS##Symbolic computation",
+            "CS##Algorithm design",
+            "CS##Computer programming",
+            "CS##Data structures",
+            "CS##Bioinformatics",
+            "ECE##Electricity",
+            "ECE##Lorentz force law",
+            "ECE##Electrical circuits",
+            "ECE##Voltage law",
+            "ECE##Digital control",
+            "ECE##System identification",
+            "ECE##Electrical network",
+            "ECE##Microcontroller",
+            "ECE##Electrical generator/Analog signal processing",
+            "ECE##Electric motor",
+            "ECE##Satellite radio",
+            "ECE##Control engineering",
+            "ECE##Signal-flow graph",
+            "ECE##State space representation",
+            "ECE##PID controller",
+            "ECE##Operational amplifier",
+            "Psychology##Prejudice",
+            "Psychology##Social cognition",
+            "Psychology##Person perception",
+            "Psychology##Nonverbal communication",
+            "Psychology##Prosocial behavior",
+            "Psychology##Leadership",
+            "Psychology##Eating disorders",
+            "Psychology##Depression",
+            "Psychology##Borderline personality disorder",
+            "Psychology##Seasonal affective disorder",
+            "Medical##Schizophrenia",
+            "Psychology##Antisocial personality disorder",
+            "Psychology##Media violence",
+            "Psychology##Prenatal development",
+            "Psychology##Child abuse",
+            "Psychology##Gender roles",
+            "Psychology##False memories",
+            "Psychology##Attention",
+            "Psychology##Problem-solving",
+            "MAE##computer-aided design",
+            "MAE##Hydraulics",
+            "MAE##Manufacturing engineering",
+            "MAE##Machine design",
+            "MAE##Fluid mechanics",
+            "MAE##Internal combustion engine",
+            "MAE##Thermodynamics",
+            "MAE##Materials Engineering",
+            "MAE##Strength of materials",
+            "Civil##Ambient Intelligence",
+            "Civil##Geotextile",
+            "Civil##Remote Sensing",
+            "Civil##Rainwater Harvesting",
+            "Civil##Water Pollution",
+            "Civil##Suspension Bridge",
+            "Civil##Stealth Technology",
+            "Civil##Green Building",
+            "Civil##Solar Energy",
+            "Civil##Construction Management",
+            "Civil##Smart Material",
+            "Medical##Addiction",
+            "Medical##Allergies",
+            "Medical##Alzheimer's Disease",
+            "Medical##Ankylosing Spondylitis",
+            "Medical##Anxiety",
+            "Medical##Asthma",
+            "Medical##Atopic Dermatitis",
+            "Medical##Atrial Fibrillation",
+            "Medical##Autism",
+            "Medical##Skin Care",
+            "Medical##Bipolar Disorder",
+            "Medical##Birth Control",
+            "Medical##Children's Health",
+            "Medical##Crohn's Disease",
+            "Medical##Dementia",
+            "Medical##Diabetes",
+            "Medical##Weight Loss",
+            "Medical##Digestive Health",
+            "Medical##Emergency Contraception",
+            "Medical##Mental Health",
+            "Medical##Fungal Infection",
+            "Medical##Headache",
+            "Medical##Healthy Sleep",
+            "Medical##Heart Disease",
+            "Medical##Hepatitis C",
+            "Medical##Hereditary Angioedema",
+            "Medical##HIV/AIDS",
+            "Medical##Hypothyroidism",
+            "Medical##Idiopathic Pulmonary Fibrosis",
+            "Medical##Irritable Bowel Syndrome",
+            "Medical##Kidney Health",
+            "Medical##Low Testosterone",
+            "Medical##Lymphoma",
+            "Medical##Medicare",
+            "Medical##Menopause",
+            "Medical##Migraine",
+            "Medical##Multiple Sclerosis",
+            "Medical##Myelofibrosis",
+            "Medical##Cancer",
+            "Medical##Osteoarthritis",
+            "Medical##Osteoporosis",
+            "Medical##Overactive Bladder",
+            "Medical##Parenting",
+            "Medical##Parkinson's Disease",
+            "Medical##Polycythemia Vera",
+            "Medical##Psoriasis",
+            "Medical##Psoriatic Arthritis",
+            "Medical##Rheumatoid Arthritis",
+            "Medical##Senior Health",
+            "Medical##Smoking Cessation",
+            "Medical##Sports Injuries",
+            "Medical##Sprains and Strains",
+            "Medical##Stress Management",
+            "biochemistry##Molecular biology",
+            "biochemistry##Cell biology",
+            "biochemistry##Human Metabolism",
+            "biochemistry##Immunology",
+            "biochemistry##Genetics",
+            "biochemistry##Enzymology",
+            "biochemistry##Polymerase chain reaction",
+            "biochemistry##Northern blotting",
+            "biochemistry##Southern blotting",
+        ]
diff --git a/paddlenlp/datasets/xnli.py b/paddlenlp/datasets/xnli.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f1fed622fc2586230a0dd4109fb44ef2227f2bf
--- /dev/null
+++ b/paddlenlp/datasets/xnli.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import csv
+import os
+import shutil
+from contextlib import ExitStack
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import _decompress, _get_unique_endpoints, get_path_from_url
+
+try:
+    from paddle.distributed import ParallelEnv
+except Exception:
+    import warnings
+
+    warnings.warn("paddle.distributed is not contains in you paddle!")
+
+from ..utils.env import DATA_HOME
+from ..utils.log import logger
+from .dataset import DatasetBuilder
+
+__all__ = ["XNLI"]
+ALL_LANGUAGES = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
+
+
+class XNLI(DatasetBuilder):
+    """
+    XNLI is a subset of a few thousand examples from MNLI which has been translated into
+    a 14 different languages (some low-ish resource). As with MNLI, the goal is to predict
+    textual entailment (does sentence A imply/contradict/neither sentence B) and is a
+    classification task (given two sentences, predict one of three labels).
+
+    For more information, please visit https://github.com/facebookresearch/XNLI
+    """
+
+    META_INFO = collections.namedtuple("META_INFO", ("file", "data_md5", "url", "zipfile_md5"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("XNLI-MT-1.0", "XNLI-MT-1.0", "multinli"),
+            "",
+            "https://bj.bcebos.com/paddlenlp/datasets/XNLI-MT-1.0.zip",
+            "fa3d8d6c3d1866cedc45680ba93c296e",
+        ),
+        "dev": META_INFO(
+            os.path.join("XNLI-1.0", "XNLI-1.0", "xnli.dev.tsv"),
+            "4c23601abba3e3e222e19d1c6851649e",
+            "https://bj.bcebos.com/paddlenlp/datasets/XNLI-1.0.zip",
+            "53393158739ec671c34f205efc7d1666",
+        ),
+        "test": META_INFO(
+            os.path.join("XNLI-1.0", "XNLI-1.0", "xnli.test.tsv"),
+            "fbc26e90f7e892e24dde978a2bd8ece6",
+            "https://bj.bcebos.com/paddlenlp/datasets/XNLI-1.0.zip",
+            "53393158739ec671c34f205efc7d1666",
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash, url, zipfile_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if mode == "train":
+            if not os.path.exists(fullname):
+                get_path_from_url(url, default_root, zipfile_hash)
+            unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:])
+            if ParallelEnv().current_endpoint in unique_endpoints:
+                file_num = len(os.listdir(fullname))
+                if file_num != len(ALL_LANGUAGES):
+                    logger.warning(
+                        "Number of train files is %d != %d, decompress again." % (file_num, len(ALL_LANGUAGES))
+                    )
+                    shutil.rmtree(fullname)
+                    _decompress(os.path.join(default_root, os.path.basename(url)))
+        else:
+            if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+                get_path_from_url(url, default_root, zipfile_hash)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        language = self.name
+        if language is None:
+            language = "all_languages"
+        if language not in ALL_LANGUAGES + ["all_languages"]:
+            raise ValueError(
+                f"Name parameter should be specified. Can be one of {ALL_LANGUAGES + ['all_languages']}. "
+            )
+        if language == "all_languages":
+            languages = ALL_LANGUAGES
+        else:
+            languages = [language]
+        if split == "train":
+            files = [os.path.join(filename, f"multinli.train.{lang}.tsv") for lang in languages]
+            if language == "all_languages":
+                with ExitStack() as stack:
+                    files = [stack.enter_context(open(file, "r", encoding="utf-8")) for file in files]
+                    readers = [csv.DictReader(file, delimiter="\t", quoting=csv.QUOTE_NONE) for file in files]
+                    for row_idx, rows in enumerate(zip(*readers)):
+                        if not rows[0]["label"]:
+                            continue
+                        data = {
+                            "premise": {},
+                            "hypothesis": {},
+                            "label": rows[0]["label"].replace("contradictory", "contradiction"),
+                        }
+                        for lang, row in zip(languages, rows):
+                            if not row["premise"] or not row["hypo"]:
+                                continue
+                            data["premise"][lang] = row["premise"]
+                            data["hypothesis"][lang] = row["hypo"]
+                        yield data
+            else:
+                for idx, file in enumerate(files):
+                    with open(file, "r", encoding="utf-8") as f:
+                        reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                        for row_idx, row in enumerate(reader):
+                            if not row["premise"] or not row["hypo"] or not row["label"]:
+                                continue
+                            yield {
+                                "premise": row["premise"],
+                                "hypothesis": row["hypo"],
+                                "label": row["label"].replace("contradictory", "contradiction"),
+                            }
+        else:
+            if language == "all_languages":
+                rows_per_pair_id = collections.defaultdict(list)
+                with open(filename, encoding="utf-8") as f:
+                    reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                    for row in reader:
+                        rows_per_pair_id[row["pairID"]].append(row)
+
+                for rows in rows_per_pair_id.values():
+                    if not rows[0]["gold_label"]:
+                        continue
+                    data = {"premise": {}, "hypothesis": {}, "label": rows[0]["gold_label"]}
+                    for row in rows:
+                        if not row["sentence1"] or not row["sentence2"]:
+                            continue
+                        data["premise"][row["language"]] = row["sentence1"]
+                        data["hypothesis"][row["language"]] = row["sentence2"]
+                    yield data
+            else:
+                with open(filename, encoding="utf-8") as f:
+                    reader = csv.DictReader(f, delimiter="\t", quoting=csv.QUOTE_NONE)
+                    for row in reader:
+                        if row["language"] == language:
+                            if not row["sentence1"] or not row["sentence2"] or not row["gold_label"]:
+                                continue
+                            yield {
+                                "premise": row["sentence1"],
+                                "hypothesis": row["sentence2"],
+                                "label": row["gold_label"],
+                            }
+
+    def get_labels(self):
+        """
+        Return labels of XNLI dataset.
+        """
+        return ["entailment", "neutral", "contradiction"]
diff --git a/paddlenlp/datasets/xnli_cn.py b/paddlenlp/datasets/xnli_cn.py
new file mode 100644
index 0000000000000000000000000000000000000000..a766e038e11c6b198b295a89a0a9addeaedff041
--- /dev/null
+++ b/paddlenlp/datasets/xnli_cn.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["XNLI_CN"]
+
+
+class XNLI_CN(DatasetBuilder):
+    """
+    XNLI dataset for chinese.
+
+    XNLI is an evaluation corpus for language transfer and cross-lingual
+    sentence classification in 15 languages. Here, XNLI only contrains
+    chinese corpus.
+
+    For more information, please visit https://github.com/facebookresearch/XNLI
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/xnli_cn.tar.gz"
+    MD5 = "aaf6de381a2553d61d8e6fad4ba96499"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(
+            os.path.join("xnli_cn.tar", "xnli_cn", "train", "part-0"), "b0e4df29af8413eb935a2204de8958b7"
+        ),
+        "dev": META_INFO(os.path.join("xnli_cn.tar", "xnli_cn", "dev", "part-0"), "401a2178e15f4b0c35812ab4a322bd94"),
+        "test": META_INFO(
+            os.path.join("xnli_cn.tar", "xnli_cn", "test", "part-0"), "71b043be8207e54185e761fca00ba3d7"
+        ),
+    }
+
+    def _get_data(self, mode, **kwargs):
+        """Downloads dataset."""
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        if not os.path.exists(fullname) or (data_hash and not md5file(fullname) == data_hash):
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, split):
+        """Reads data."""
+        with open(filename, "r", encoding="utf-8") as f:
+            head = None
+            for line in f:
+                data = line.strip().split("\t")
+                if not head:
+                    head = data
+                else:
+                    if split == "train":
+                        text_a, text_b, label = data
+                        yield {"text_a": text_a, "text_b": text_b, "label": label}
+                    elif split == "dev":
+                        text_a, text_b, label = data
+                        yield {"text_a": text_a, "text_b": text_b, "label": label}
+                    elif split == "test":
+                        text_a, text_b, label = data
+                        yield {"text_a": text_a, "text_b": text_b, "label": label}
+
+    def get_labels(self):
+        """
+        Return labels of XNLI dataset.
+
+        Note:
+            Contradictory and contradiction are the same label
+        """
+        return ["contradictory", "entailment", "neutral"]
diff --git a/paddlenlp/datasets/yahoo_answer_100k.py b/paddlenlp/datasets/yahoo_answer_100k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4571704535cdafb7ffb0bb249359e16ee28b5b73
--- /dev/null
+++ b/paddlenlp/datasets/yahoo_answer_100k.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+
+from paddle.dataset.common import md5file
+from paddle.utils.download import get_path_from_url
+
+from ..utils.env import DATA_HOME
+from .dataset import DatasetBuilder
+
+__all__ = ["YahooAnswer100K"]
+
+
+class YahooAnswer100K(DatasetBuilder):
+    """
+    The data is from https://arxiv.org/pdf/1702.08139.pdf, which samples 100k
+    documents from original Yahoo Answer data, and vocabulary size is 200k.
+    """
+
+    URL = "https://bj.bcebos.com/paddlenlp/datasets/yahoo-answer-100k.tar.gz"
+    MD5 = "68b88fd3f2cc9918a78047d99bcc6532"
+    META_INFO = collections.namedtuple("META_INFO", ("file", "md5"))
+    SPLITS = {
+        "train": META_INFO(os.path.join("yahoo-answer-100k", "yahoo.train.txt"), "3fb31bad56bae7c65fa084f702398c3b"),
+        "valid": META_INFO(os.path.join("yahoo-answer-100k", "yahoo.valid.txt"), "2680dd89b4fe882359846b5accfb7647"),
+        "test": META_INFO(os.path.join("yahoo-answer-100k", "yahoo.test.txt"), "3e6dcb643282e3543303980f1e21bb9d"),
+    }
+    VOCAB_INFO = (os.path.join("yahoo-answer-100k", "vocab.txt"), "2c17c7120e6240d34d19490404b5133d")
+    UNK_TOKEN = "_UNK"
+
+    def _get_data(self, mode, **kwargs):
+        default_root = os.path.join(DATA_HOME, self.__class__.__name__)
+        filename, data_hash = self.SPLITS[mode]
+        fullname = os.path.join(default_root, filename)
+        vocab_filename, vocab_hash = self.VOCAB_INFO
+        vocab_fullname = os.path.join(default_root, vocab_filename)
+
+        if (
+            (not os.path.exists(fullname))
+            or (data_hash and not md5file(fullname) == data_hash)
+            or (not os.path.exists(vocab_fullname) or (vocab_hash and not md5file(vocab_fullname) == vocab_hash))
+        ):
+
+            get_path_from_url(self.URL, default_root, self.MD5)
+
+        return fullname
+
+    def _read(self, filename, *args):
+        with open(filename, "r", encoding="utf-8") as f:
+            for line in f:
+                line_stripped = line.strip()
+                yield {"sentence": line_stripped}
+
+    def get_vocab(self):
+        vocab_fullname = os.path.join(DATA_HOME, self.__class__.__name__, self.VOCAB_INFO[0])
+
+        # Construct vocab_info to match the form of the input of `Vocab.load_vocabulary()` function
+        vocab_info = {"filepath": vocab_fullname, "unk_token": self.UNK_TOKEN}
+        return vocab_info
diff --git a/paddlenlp/embeddings/__init__.py b/paddlenlp/embeddings/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d0daf27cf88acc9ad771de0f74cda2de55149983
--- /dev/null
+++ b/paddlenlp/embeddings/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .token_embedding import *
diff --git a/paddlenlp/embeddings/constant.py b/paddlenlp/embeddings/constant.py
new file mode 100644
index 0000000000000000000000000000000000000000..600078eb28db8d2226035e924e9f4fb235388633
--- /dev/null
+++ b/paddlenlp/embeddings/constant.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2020 PaddlePaddle Authors and Chinese-Word-Vectors Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+URL_ROOT = "https://bj.bcebos.com/paddlenlp"
+EMBEDDING_URL_ROOT = URL_ROOT + "/models/embeddings"
+
+PAD_TOKEN = "[PAD]"
+UNK_TOKEN = "[UNK]"
+
+EMBEDDING_NAME_LIST = [
+    # Word2Vec
+    # baidu_encyclopedia
+    "w2v.baidu_encyclopedia.target.word-word.dim300",
+    "w2v.baidu_encyclopedia.target.word-character.char1-1.dim300",
+    "w2v.baidu_encyclopedia.target.word-character.char1-2.dim300",
+    "w2v.baidu_encyclopedia.target.word-character.char1-4.dim300",
+    "w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300",
+    "w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300",
+    "w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300",
+    "w2v.baidu_encyclopedia.target.word-wordLR.dim300",
+    "w2v.baidu_encyclopedia.target.word-wordPosition.dim300",
+    "w2v.baidu_encyclopedia.target.bigram-char.dim300",
+    "w2v.baidu_encyclopedia.context.word-word.dim300",
+    "w2v.baidu_encyclopedia.context.word-character.char1-1.dim300",
+    "w2v.baidu_encyclopedia.context.word-character.char1-2.dim300",
+    "w2v.baidu_encyclopedia.context.word-character.char1-4.dim300",
+    "w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300",
+    "w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300",
+    "w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300",
+    "w2v.baidu_encyclopedia.context.word-wordLR.dim300",
+    "w2v.baidu_encyclopedia.context.word-wordPosition.dim300",
+    # wikipedia
+    "w2v.wiki.target.bigram-char.dim300",
+    "w2v.wiki.target.word-char.dim300",
+    "w2v.wiki.target.word-word.dim300",
+    "w2v.wiki.target.word-bigram.dim300",
+    # people_daily
+    "w2v.people_daily.target.bigram-char.dim300",
+    "w2v.people_daily.target.word-char.dim300",
+    "w2v.people_daily.target.word-word.dim300",
+    "w2v.people_daily.target.word-bigram.dim300",
+    # weibo
+    "w2v.weibo.target.bigram-char.dim300",
+    "w2v.weibo.target.word-char.dim300",
+    "w2v.weibo.target.word-word.dim300",
+    "w2v.weibo.target.word-bigram.dim300",
+    # sogou
+    "w2v.sogou.target.bigram-char.dim300",
+    "w2v.sogou.target.word-char.dim300",
+    "w2v.sogou.target.word-word.dim300",
+    "w2v.sogou.target.word-bigram.dim300",
+    # zhihu
+    "w2v.zhihu.target.bigram-char.dim300",
+    "w2v.zhihu.target.word-char.dim300",
+    "w2v.zhihu.target.word-word.dim300",
+    "w2v.zhihu.target.word-bigram.dim300",
+    # finacial
+    "w2v.financial.target.bigram-char.dim300",
+    "w2v.financial.target.word-char.dim300",
+    "w2v.financial.target.word-word.dim300",
+    "w2v.financial.target.word-bigram.dim300",
+    # literature
+    "w2v.literature.target.bigram-char.dim300",
+    "w2v.literature.target.word-char.dim300",
+    "w2v.literature.target.word-word.dim300",
+    "w2v.literature.target.word-bigram.dim300",
+    # siku
+    "w2v.sikuquanshu.target.word-word.dim300",
+    "w2v.sikuquanshu.target.word-bigram.dim300",
+    # Mix-large
+    "w2v.mixed-large.target.word-char.dim300",
+    "w2v.mixed-large.target.word-word.dim300",
+    # GOOGLE NEWS
+    "w2v.google_news.target.word-word.dim300.en",
+    # GloVe
+    "glove.wiki2014-gigaword.target.word-word.dim50.en",
+    "glove.wiki2014-gigaword.target.word-word.dim100.en",
+    "glove.wiki2014-gigaword.target.word-word.dim200.en",
+    "glove.wiki2014-gigaword.target.word-word.dim300.en",
+    "glove.twitter.target.word-word.dim25.en",
+    "glove.twitter.target.word-word.dim50.en",
+    "glove.twitter.target.word-word.dim100.en",
+    "glove.twitter.target.word-word.dim200.en",
+    # FastText
+    "fasttext.wiki-news.target.word-word.dim300.en",
+    "fasttext.crawl.target.word-word.dim300.en",
+]
diff --git a/paddlenlp/embeddings/token_embedding.py b/paddlenlp/embeddings/token_embedding.py
new file mode 100644
index 0000000000000000000000000000000000000000..b915f668c98efcfb45732d902b47080a60a7f78f
--- /dev/null
+++ b/paddlenlp/embeddings/token_embedding.py
@@ -0,0 +1,378 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os.path as osp
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.utils.download import get_path_from_url
+
+from paddlenlp.data import Vocab, get_idx_from_word
+from paddlenlp.utils.env import MODEL_HOME, _get_sub_home
+from paddlenlp.utils.log import logger
+
+from .constant import EMBEDDING_NAME_LIST, EMBEDDING_URL_ROOT, PAD_TOKEN, UNK_TOKEN
+
+EMBEDDING_HOME = _get_sub_home("embeddings", parent_home=MODEL_HOME)
+
+__all__ = ["list_embedding_name", "TokenEmbedding"]
+
+
+def list_embedding_name():
+    """
+    Lists all names of pretrained embedding models paddlenlp provides.
+    """
+    return list(EMBEDDING_NAME_LIST)
+
+
+class TokenEmbedding(nn.Embedding):
+    """
+    A `TokenEmbedding` can load pre-trained embedding model which paddlenlp provides by
+    specifying embedding name. Furthermore, a `TokenEmbedding` can load extended vocabulary
+    by specifying extended_vocab_path.
+
+    Args:
+        embedding_name (`str`, optional):
+            The pre-trained embedding model name. Use `paddlenlp.embeddings.list_embedding_name()` to
+            list the names of all embedding models that we provide.
+            Defaults to `w2v.baidu_encyclopedia.target.word-word.dim300`.
+        unknown_token (`str`, optional):
+            Specifies unknown token.
+            Defaults to `[UNK]`.
+        unknown_token_vector (`list`, optional):
+            To initialize the vector of unknown token. If it's none, use normal distribution to
+            initialize the vector of unknown token.
+            Defaults to `None`.
+        extended_vocab_path (`str`, optional):
+            The file path of extended vocabulary.
+            Defaults to `None`.
+        trainable (`bool`, optional):
+            Whether the weight of embedding can be trained.
+            Defaults to True.
+        keep_extended_vocab_only (`bool`, optional):
+            Whether to keep the extended vocabulary only, will be effective only if provides extended_vocab_path.
+            Defaults to False.
+    """
+
+    def __init__(
+        self,
+        embedding_name=EMBEDDING_NAME_LIST[0],
+        unknown_token=UNK_TOKEN,
+        unknown_token_vector=None,
+        extended_vocab_path=None,
+        trainable=True,
+        keep_extended_vocab_only=False,
+    ):
+        vector_path = osp.join(EMBEDDING_HOME, embedding_name + ".npz")
+        if not osp.exists(vector_path):
+            # download
+            url = EMBEDDING_URL_ROOT + "/" + embedding_name + ".tar.gz"
+            get_path_from_url(url, EMBEDDING_HOME)
+
+        logger.info("Loading token embedding...")
+        vector_np = np.load(vector_path)
+        self.embedding_dim = vector_np["embedding"].shape[1]
+        self.unknown_token = unknown_token
+        if unknown_token_vector is not None:
+            unk_vector = np.array(unknown_token_vector).astype(paddle.get_default_dtype())
+        else:
+            unk_vector = np.random.normal(scale=0.02, size=self.embedding_dim).astype(paddle.get_default_dtype())
+        pad_vector = np.array([0] * self.embedding_dim).astype(paddle.get_default_dtype())
+        if extended_vocab_path is not None:
+            embedding_table = self._extend_vocab(
+                extended_vocab_path, vector_np, pad_vector, unk_vector, keep_extended_vocab_only
+            )
+            trainable = True
+        else:
+            embedding_table = self._init_without_extend_vocab(vector_np, pad_vector, unk_vector)
+
+        self.vocab = Vocab.from_dict(self._word_to_idx, unk_token=unknown_token, pad_token=PAD_TOKEN)
+        self.num_embeddings = embedding_table.shape[0]
+        # import embedding
+        super(TokenEmbedding, self).__init__(
+            self.num_embeddings, self.embedding_dim, padding_idx=self._word_to_idx[PAD_TOKEN]
+        )
+        self.weight.set_value(embedding_table)
+        self.set_trainable(trainable)
+        logger.info("Finish loading embedding vector.")
+        s = "Token Embedding info:\
+             \nUnknown index: {}\
+             \nUnknown token: {}\
+             \nPadding index: {}\
+             \nPadding token: {}\
+             \nShape :{}".format(
+            self._word_to_idx[self.unknown_token],
+            self.unknown_token,
+            self._word_to_idx[PAD_TOKEN],
+            PAD_TOKEN,
+            self.weight.shape,
+        )
+        logger.info(s)
+
+    def _init_without_extend_vocab(self, vector_np, pad_vector, unk_vector):
+        """
+        Constructs index to word list, word to index dict and embedding weight.
+        """
+        self._idx_to_word = list(vector_np["vocab"])
+        self._idx_to_word.append(self.unknown_token)
+        self._idx_to_word.append(PAD_TOKEN)
+        self._word_to_idx = self._construct_word_to_idx(self._idx_to_word)
+        # insert unk, pad embedding
+        embedding_table = np.append(vector_np["embedding"], [unk_vector, pad_vector], axis=0)
+
+        return embedding_table
+
+    def _read_vocab_list_from_file(self, extended_vocab_path):
+        # load new vocab table from file
+        vocab_list = []
+        with open(extended_vocab_path, "r", encoding="utf-8") as f:
+            for line in f.readlines():
+                vocab = line.rstrip("\n").split("\t")[0]
+                vocab_list.append(vocab)
+        return vocab_list
+
+    def _extend_vocab(self, extended_vocab_path, vector_np, pad_vector, unk_vector, keep_extended_vocab_only):
+        """
+        Constructs index to word list, word to index dict and embedding weight using
+        extended vocab.
+        """
+        logger.info("Start extending vocab.")
+        extend_vocab_list = self._read_vocab_list_from_file(extended_vocab_path)
+        extend_vocab_set = set(extend_vocab_list)
+        # update idx_to_word
+        self._idx_to_word = extend_vocab_list
+        self._word_to_idx = self._construct_word_to_idx(self._idx_to_word)
+
+        # use the Xavier init the embedding
+        xavier_scale = np.sqrt(6.0 / float(len(self._idx_to_word) + self.embedding_dim))
+        embedding_table = np.random.uniform(
+            low=-1.0 * xavier_scale, high=xavier_scale, size=(len(self._idx_to_word), self.embedding_dim)
+        ).astype(paddle.get_default_dtype())
+
+        pretrained_idx_to_word = list(vector_np["vocab"])
+        pretrained_word_to_idx = self._construct_word_to_idx(pretrained_idx_to_word)
+        pretrained_embedding_table = np.array(vector_np["embedding"])
+
+        pretrained_vocab_set = set(pretrained_idx_to_word)
+        extend_vocab_set = set(self._idx_to_word)
+        vocab_intersection = pretrained_vocab_set & extend_vocab_set
+        vocab_subtraction = pretrained_vocab_set - extend_vocab_set
+
+        # assignment from pretrained_vocab_embedding to extend_vocab_embedding
+        pretrained_vocab_intersect_index = [pretrained_word_to_idx[word] for word in vocab_intersection]
+        pretrained_vocab_subtract_index = [pretrained_word_to_idx[word] for word in vocab_subtraction]
+        extend_vocab_intersect_index = [self._word_to_idx[word] for word in vocab_intersection]
+        embedding_table[extend_vocab_intersect_index] = pretrained_embedding_table[pretrained_vocab_intersect_index]
+        if not keep_extended_vocab_only:
+            for idx in pretrained_vocab_subtract_index:
+                word = pretrained_idx_to_word[idx]
+                self._idx_to_word.append(word)
+                self._word_to_idx[word] = len(self._idx_to_word) - 1
+
+            embedding_table = np.append(
+                embedding_table, pretrained_embedding_table[pretrained_vocab_subtract_index], axis=0
+            )
+
+        if self.unknown_token not in extend_vocab_set:
+            self._idx_to_word.append(self.unknown_token)
+            self._word_to_idx[self.unknown_token] = len(self._idx_to_word) - 1
+            embedding_table = np.append(embedding_table, [unk_vector], axis=0)
+        else:
+            unk_idx = self._word_to_idx[self.unknown_token]
+            embedding_table[unk_idx] = unk_vector
+
+        if PAD_TOKEN not in extend_vocab_set:
+            self._idx_to_word.append(PAD_TOKEN)
+            self._word_to_idx[PAD_TOKEN] = len(self._idx_to_word) - 1
+            embedding_table = np.append(embedding_table, [pad_vector], axis=0)
+        else:
+            embedding_table[self._word_to_idx[PAD_TOKEN]] = pad_vector
+
+        logger.info("Finish extending vocab.")
+        return embedding_table
+
+    def set_trainable(self, trainable):
+        """
+        Whether or not to set the weights of token embedding to be trainable.
+
+        Args:
+            trainable (`bool`):
+                The weights can be trained if trainable is set to True, or the weights are fixed if trainable is False.
+
+        """
+        self.weight.stop_gradient = not trainable
+
+    def search(self, words):
+        """
+        Gets the vectors of specifying words.
+
+        Args:
+            words (`list` or `str` or `int`): The words which need to be searched.
+
+        Returns:
+            `numpy.array`: The vectors of specifying words.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.embeddings import TokenEmbedding
+
+                embed = TokenEmbedding()
+                vector =  embed.search('Welcome to use PaddlePaddle and PaddleNLP!')
+
+        """
+        idx_list = self.get_idx_list_from_words(words)
+        idx_tensor = paddle.to_tensor(idx_list)
+        return self(idx_tensor).cpu().numpy()
+
+    def get_idx_from_word(self, word):
+        """
+        Gets the index of specifying word by searching word_to_idx dict.
+
+        Args:
+            word (`list` or `str` or `int`): The input token word which we want to get the token index converted from.
+
+        Returns:
+            `int`: The index of specifying word.
+
+        """
+        return get_idx_from_word(word, self.vocab.token_to_idx, self.unknown_token)
+
+    def get_idx_list_from_words(self, words):
+        """
+        Gets the index list of specifying words by searching word_to_idx dict.
+
+        Args:
+            words (`list` or `str` or `int`): The input token words which we want to get the token indices converted from.
+
+        Returns:
+            `list`: The indexes list of specifying words.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.embeddings import TokenEmbedding
+
+                embed = TokenEmbedding()
+                index =  embed.get_idx_from_word('Welcome to use PaddlePaddle and PaddleNLP!')
+                #635963
+
+        """
+        if isinstance(words, str):
+            idx_list = [self.get_idx_from_word(words)]
+        elif isinstance(words, int):
+            idx_list = [words]
+        elif isinstance(words, list) or isinstance(words, tuple):
+            idx_list = [self.get_idx_from_word(word) if isinstance(word, str) else word for word in words]
+        else:
+            raise TypeError
+        return idx_list
+
+    def _dot_np(self, array_a, array_b):
+        return np.sum(array_a * array_b)
+
+    def _calc_word(self, word_a, word_b, calc_kernel):
+        embeddings = self.search([word_a, word_b])
+        embedding_a = embeddings[0]
+        embedding_b = embeddings[1]
+        return calc_kernel(embedding_a, embedding_b)
+
+    def dot(self, word_a, word_b):
+        """
+        Calculates the dot product of 2 words. Dot product or scalar product is an
+        algebraic operation that takes two equal-length sequences of numbers (usually
+        coordinate vectors), and returns a single number.
+
+        Args:
+            word_a (`str`): The first word string.
+            word_b (`str`): The second word string.
+
+        Returns:
+            float: The dot product of 2 words.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.embeddings import TokenEmbedding
+
+                embed = TokenEmbedding()
+                dot_product =  embed.dot('PaddlePaddle', 'PaddleNLP!')
+                #0.11827179
+
+        """
+        dot = self._dot_np
+        return self._calc_word(word_a, word_b, lambda x, y: dot(x, y))
+
+    def cosine_sim(self, word_a, word_b):
+        """
+        Calculates the cosine similarity of 2 word vectors. Cosine similarity is the
+        cosine of the angle between two n-dimensional vectors in an n-dimensional space.
+
+        Args:
+            word_a (`str`): The first word string.
+            word_b (`str`): The second word string.
+
+        Returns:
+            float: The cosine similarity of 2 words.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.embeddings import TokenEmbedding
+
+                embed = TokenEmbedding()
+                cosine_simi =  embed.cosine_sim('PaddlePaddle', 'PaddleNLP!')
+                #0.99999994
+
+        """
+        dot = self._dot_np
+        return self._calc_word(word_a, word_b, lambda x, y: dot(x, y) / (np.sqrt(dot(x, x)) * np.sqrt(dot(y, y))))
+
+    def _construct_word_to_idx(self, idx_to_word):
+        """
+        Constructs word to index dict.
+
+        Args:
+            idx_to_word ('list'):
+
+        Returns:
+            `Dict`: The word to index dict constructed by idx_to_word.
+
+        """
+        word_to_idx = {}
+        for i, word in enumerate(idx_to_word):
+            word_to_idx[word] = i
+        return word_to_idx
+
+    def __repr__(self):
+        """
+        Returns:
+            `Str`: The token embedding infomation.
+
+        """
+        info = "Object   type: {}\
+             \nUnknown index: {}\
+             \nUnknown token: {}\
+             \nPadding index: {}\
+             \nPadding token: {}\
+             \n{}".format(
+            super(TokenEmbedding, self).__repr__(),
+            self._word_to_idx[self.unknown_token],
+            self.unknown_token,
+            self._word_to_idx[PAD_TOKEN],
+            PAD_TOKEN,
+            self.weight,
+        )
+        return info
diff --git a/paddlenlp/experimental/__init__.py b/paddlenlp/experimental/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b40bb0346b48e6c18cd76dbabdefd1eddb95a62a
--- /dev/null
+++ b/paddlenlp/experimental/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .faster_tokenizer import *
+from .model_utils import *
+from .ernie_model import *
diff --git a/paddlenlp/experimental/autonlp/README.md b/paddlenlp/experimental/autonlp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..555f28039799c6b842d366762d47c067504d4b00
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/README.md
@@ -0,0 +1,146 @@
+# AutoNLP
+
+**简体中文**🀄 | [English🌎](./README_en.md)
+
+## 简介
+
+**AutoNLP目前在实验阶段。在正式发布之前，AutoNLP API有可能会变动**
+
+**AutoNLP** 是 PaddleNLP 的一个早期的实验性质的项目，旨在让NLP技术赋能百业。交付一个成功的 NLP 项目并不容易，因为它需要深入的NLP领域知识，而我们经常看到开发者在应用NLP技术的过程中遇到困难。这就是我们开发 **AutoNLP** 项目的原因。与为获得最先进的模型精度而使用大规模计算资源的传统 AutoML 方法相比，我们有不同的理念：
+
+1. 我们的目标不是在大型集群，大型数据集上训练最先进的模型，而是**在有限计算资源下的训练出不错模型**。我们假设我们的用户最多只有几个 GPU，并且希望在8小时内训练出不错的模型。您可以在 [Baidu AI Studio](https://aistudio.baidu.com/aistudio) 免费获得此级别的计算资源。
+2. AutoNLP的目标是提供**低代码**的解决方案，使您能够用几行代码训练出不错的模型，但它不是无代码的模型训练服务。
+3. 我们将尽可能地**自动化和抽象化** PaddleNLP已有的**全流程能力**（例如 预处理，分词，微调，提示学习，模型压缩，一键部署等等），助力开发者对于自己的使用场景进行快速适配与落地。
+4. 我们的工作是**免费和开源**的。
+
+## 安装
+
+安装 **AutoNLP** 与安装 PaddleNLP 非常相似，唯一的区别是 需要添加`[autonlp]`的标签。
+
+```
+pip install -U paddlenlp[autonlp]
+```
+
+您还可以从我们的 [GitHub](https://github.com/PaddlePaddle/PaddleNLP) clone并通过“pip install .[autonlp]”从源代码安装来获取develop分支中的最新成果。
+
+## 基础使用
+
+由于目前AutoNLP唯一支持的任务是文本分类，因此以下文档是关于 **AutoTrainerForTextClassification** 的使用用法。您也可以参考我们的 AiStudio notebook (To be added)
+
+### 创建AutoTrainerForTextClassification对象
+
+`AutoTrainerForTextClassification` 是您用来运行模型实验并与经过训练的模型交互的主要类，您可以像下面这样构造它：
+
+```python
+auto_trainer = AutoTrainerForTextClassification(
+    train_dataset=train_ds,
+    eval_dataset=dev_ds,
+    label_column="labels",
+    text_column="sentence",
+    language="Chinese",
+    output_dir="temp"
+)
+```
+
+Args:
+
+- train_dataset (Dataset, required): `paddle.io.Dataset` 格式的训练数据集，必须包含下面指定的 `text_column` 和 `label_column`
+- eval_dataset (Dataset, required): `paddle.io.Dataset`格式的评估数据集，必须包含下面指定的`text_column`和`label_column`
+- text_column (string, required): 数据集中的文本字段，为模型的主要输入。
+- label_column (string, required): 数据集中的标签字段
+- language (string, required): 文本语言
+- metric_for_best_model (string, optional): 用来选择最优模型的评估指标
+- greater_is_better (bool, optional): 更好的模型是否应该有更大的指标。与`metric_for_best_model`结合使用
+- problem_type (str, optional): 根据问题的性质在 [`multi_class`, `multi_label`] 中选择
+- output_dir (str, optional): 输出目录，默认为`autpnlp_results`
+- verbosity: (int, optional): 控制日志的详细程度。默认为“1”，可在driver中看见worker的日志。如果需要减少日志量，请使用 `verbosity > 0` 。
+
+### 训练
+
+您可以使用以下命令开始训练模型：
+
+```python
+auto_trainer.train(
+    num_cpus=2,
+    num_gpus=1,
+    max_concurrent_trials=1,
+    num_models=10,
+    time_budget_s=60 * 10,
+    verbosity=1
+)
+```
+Args:
+
+- num_models (int, required): 模型试验数量
+- num_gpus (str, optional): 实验使用的 GPU 数量。默认情况下，这是根据检测到的 GPU 设置的。
+- num_cpus (str, optional): 实验使用的 CPU 数量。默认情况下，这是根据检测到的 vCPU 设置的。
+- max_concurrent_trials (int, optional): 同时运行的最大试验数。必须是非负数。如果为 None 或 0，则不应用任何限制。默认为None。
+- time_budget_s: (int|float|datetime.timedelta, optional) 以秒为单位的全局时间预算，超过时间后停止所有模型试验。
+- experiment_name: (str, optional): 实验的名称。实验日志将存储在"<output_dir>/<experiment_name>"下。默认为 UNIX 时间戳。
+- hp_overrides: (dict[str, Any], optional): （仅限高级用户）。覆盖每个候选模型的超参数。例如，`{"TrainingArguments.max_steps"：5}`。
+- custom_model_candiates: (dict[str, Any], optional): （仅限高级用户）。运行用户提供的候选模型而不 PaddleNLP 的默认候选模型。可以参考 `._model_candidates` 属性
+
+
+### 评估和检查实验结果
+
+#### 检查实验结果
+
+实验结束后，您可以像下面这样检查实验结果，它会打印一个 pandas DataFrame：
+
+```
+auto_trainer.show_training_results()
+```
+
+您还可以在`<output_dir>/experiment_results.csv`下找到实验结果。不同实验产生的模型的标识符是`trial_id`，您可以在 DataFrame 或 csv 文件中找到这个字段。
+
+#### 加载以前的实验结果
+
+您可以从之前的运行（包括未完成的运行）中恢复实验结果，如下所示：
+
+```python
+auto_trainer.load("path/to/previous/results")
+```
+
+这使您能够使用 `show_training_results` API 来检查结果。再次调用 train() 将覆盖之前的结果。
+
+#### 使用不同的评估数据集
+
+除了使用构建 AutoTrainerForTextClassification 的时候提供的评估数据集以外，您也可以使用其他的数据集进行评估：
+
+```
+auto_trainer.evaluate(
+    trial_id="trial_123456",
+    eval_dataset=new_eval_dataset
+)
+```
+
+Args:
+- trial_id (str, optional): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型
+- eval_dataset (Dataset, optional): 自定义评估数据集，并且必须包含`text_column`和`label_column`字段。如果未提供，则默认为构建时使用的评估数据集
+
+
+
+### 模型输出与部署
+
+如果需要导出模型供以后使用，可以使用以下的API：
+
+```
+auto_trainer.export(
+    trial_id="trial_123456",
+    export_path="different/path/to/store/the/model"
+)
+```
+
+Args:
+- export_path (str, required): 输出路径
+- trial_id (int, required): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型
+
+同时我们还提供了`to_taskflow()`的API，可以直接将模型转换为 `Taskflow` 进行推理：
+
+```
+taskflow = auto_trainer.to_taskflow()
+taskflow("this is a test input")
+```
+
+Args:
+- trial_id (int, required): 通过 `trial_id` 指定要评估的模型。默认为由`metric_for_best_model`决定的最佳模型
diff --git a/paddlenlp/experimental/autonlp/README_en.md b/paddlenlp/experimental/autonlp/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..8db14f079d8a5d894dee04ff9afc5244725cdacc
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/README_en.md
@@ -0,0 +1,147 @@
+# AutoNLP
+
+[简体中文🀄](./README.md) |  **English**🌎
+
+# Introduction
+
+**The AutoNLP APIs are subjective to significant changes until formal release**
+
+**AutoNLP** is an experimental project by PaddleNLP to democratize NLP for everyone. Delivering a successful NLP project is not easy, as it requires deep domain knowledge. Time after time, we have seen people struggle to make NLP work on their dataset, for their projects, which is why we are building **AutoNLP**. Compared with the traditional AutoML approach of massive paid compute for State-of-the-Art model performance, we have a different philosphy:
+
+
+1. Instead of training State-of-the-Art models on huge datasets running on huge clusters, our goal is to deliver **decent models under limited compute**. We assume our users have a few GPUs at most and want to get decent models under 8 hours on their own in-house datasets. Note that you can get this level of compute for FREE on [Baidu AI Studio](https://aistudio.baidu.com/aistudio).
+2. Our solution is **low-code** and enables you to train good models with a few lines of code but it won't be no code / drag and drop.
+3. Leverage the **full-cycle capability** of PaddleNLP, We intent to **automate and abstract away** as much of NLP as possible, ranging from preprocessing to tokenizing, from finetuning to prompt tuning, from model compression to deloyment, etc.
+4. Our work is and always will be **free and open-sourced**.
+
+## Installation
+
+Installing **AutoNLP** is very similar to installing PaddleNLP, with the only difference being the `[autonlp]` tag.
+
+```
+pip install -U paddlenlp[autonlp]
+```
+
+You can also get our latest work in the develop branch by cloning from our [GitHub](https://github.com/PaddlePaddle/PaddleNLP) and install from source via `pip install .[autonlp]`.
+
+## Basic Usage
+
+Since the only supported task is Text Classification for now, the following documentation are on the usage of **AutoTrainerForTextClassification**. You can also follow our AiStudio notebook for example.
+
+### Constructor
+
+`AutoTrainerForTextClassification` is the main class which you use to run model experiments and interact with the trained models You can construct it like the following:
+
+```python
+auto_trainer = AutoTrainerForTextClassification(
+    train_dataset=train_ds,
+    eval_dataset=dev_ds,
+    label_column="labels",
+    text_column="sentence",
+    language="Chinese",
+    output_dir="temp"
+)
+```
+
+Args:
+
+- train_dataset (Dataset, required): Training dataset in the format of `paddle.io.Dataset`, must contains the 'text_column' and 'label_column' specified below
+- eval_dataset (Dataset, required): Evaluation dataset in the format of `paddle.io.Dataset`, must contains the 'text_column' and 'label_column' specified below
+- text_column (string, required): Name of the column that contains the input text.
+- label_column (string, required): Name of the column that contains the target variable to predict.
+- language (string, required): language of the text
+- metric_for_best_model (string, optional): the name of the metrc for selecting the best model.
+- greater_is_better (bool, optional): Whether better models should have a greater metric or not. Use in conjuction with `metric_for_best_model`.
+- problem_type (str, optional): Select among ["multi_class", "multi_label"] based on the nature of your problem
+- output_dir (str, optional): Output directory for the experiments, defaults to "autpnlp_results"
+- verbosity: (int, optional): controls the verbosity of the run. Defaults to 1, which let the workers log to the driver.To reduce the amount of logs, use verbosity > 0 to set stop the workers from logging to the driver.
+
+
+### Train
+
+You can start training with the following command:
+
+```python
+auto_trainer.train(
+    num_cpus=2,
+    num_gpus=1,
+    max_concurrent_trials=1,
+    num_models=10,
+    time_budget_s=60 * 10,
+    verbosity=1
+)
+```
+Args:
+
+- num_models (int, required): number of model trials to run
+- num_gpus (str, optional): number of GPUs to use for the job. By default, this is set based on detected GPUs.
+- num_cpus (str, optional): number of CPUs to use for the job. By default, this is set based on virtual cores.
+- max_concurrent_trials (int, optional): maximum number of trials to run concurrently. Must be non-negative. If None or 0, no limit will be applied. Defaults to None.
+- time_budget_s: (int|float|datetime.timedelta, optional) global time budget in seconds after which all model trials are stopped.
+- experiment_name: (str, optional): name of the experiment. Experiment log will be stored under `<output_dir>/<experiment_name>`. Defaults to UNIX timestamp.
+- hp_overrides: (dict[str, Any], optional): Advanced users only. override the hyperparameters of every model candidate.  For example, {"TrainingArguments.max_steps": 5}.
+- custom_model_candiates: (dict[str, Any], optional): Advanced users only. Run the user-provided model candidates instead of the default model candidated from PaddleNLP. See `._model_candidates` property as an example
+
+### Evaluations and Examine Results
+
+#### Examine Results
+
+Once the experimenets conclude, you can examine the experiment results like the following, which prints a pandas DataFrame:
+
+```
+auto_trainer.show_training_results()
+```
+
+You can also find the experiment results under `<output_dir>/experiment_results.csv`. The identifier for the models produced by different experiments is `trial_id`, which you can find in the `DataFrame` or the csv file.
+
+#### Load Previous Results
+
+You can recover the experiment results from a previous run (including unfinished runs) like the following:
+
+```python
+auto_trainer.load("path/to/previous/results")
+```
+
+This enables you to use the `show_training_results` API to examine the results. Call `train()` again will override the previous results.
+
+#### Custom Evaluations
+
+To evaluate on datasets other than the evaluation dataset provided to `AutoTrainerForTextClassification` at construction, you can use the
+
+```
+auto_trainer.evaluate(
+    trial_id="trial_123456",
+    eval_dataset=new_eval_dataset
+)
+```
+
+Args:
+- trial_id (str, optional): specify the model to be evaluated through the `trial_id`. Defaults to the best model, ranked by `metric_for_best_model`
+- eval_dataset (Dataset, optional): custom evaluation dataset and must contains the 'text_column' and 'label_column' fields. If not provided, defaults to the evaluation dataset used at construction
+
+
+
+### Export and Inference
+
+To export a model for later use, do:
+
+```
+auto_trainer.export(
+    trial_id="trial_123456",
+    export_path="different/path/to/store/the/model"
+)
+```
+
+Args:
+- export_path (str, required): the filepath for export
+- trial_id (int, required): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
+
+We also provide a convenience method to directly convert a model to a Taskflow for inference:
+
+```
+taskflow = auto_trainer.to_taskflow()
+taskflow("this is a test input")
+```
+
+Args:
+- trial_id (int, required): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
diff --git a/paddlenlp/experimental/autonlp/__init__.py b/paddlenlp/experimental/autonlp/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b2e409e93b6fcbba21ee7bf44dc31dc954368bb
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+from .text_classification import AutoTrainerForTextClassification
diff --git a/paddlenlp/experimental/autonlp/auto_trainer_base.py b/paddlenlp/experimental/autonlp/auto_trainer_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..d705450d51c6dd65bd9aefd0d47247d520ee7ade
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/auto_trainer_base.py
@@ -0,0 +1,383 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import datetime
+import logging
+import os
+import shutil
+import sys
+from abc import ABCMeta, abstractmethod
+from typing import Any, Callable, Dict, List, Optional, Union
+
+import ray
+from hyperopt import hp
+from paddle.io import Dataset
+from ray import tune
+from ray.air import RunConfig
+from ray.tune.result_grid import ResultGrid
+from ray.tune.search import ConcurrencyLimiter
+from ray.tune.search.hyperopt import HyperOptSearch
+
+from paddlenlp.trainer import TrainingArguments
+from paddlenlp.trainer.trainer_utils import EvalPrediction
+from paddlenlp.transformers import PretrainedTokenizer
+from paddlenlp.utils.log import logger
+
+
+class AutoTrainerBase(metaclass=ABCMeta):
+    """
+    The meta classs of AutoTrainer, which contains the common properies and methods of AutoNLP.
+    Task-specific AutoTrainers need to inherit from the meta class.
+
+    Args:
+        train_dataset (Dataset, required): Training dataset, must contains the 'text_column' and 'label_column' specified below
+        eval_dataset (Dataset, required): Evaluation dataset, must contains the 'text_column' and 'label_column' specified below
+        language (string, required): language of the text
+        metric_for_best_model (string, optional): the name of the metrc for selecting the best model.
+        greater_is_better (bool, required): Whether better models should have a greater metric or not. Use in conjuction with `metric_for_best_model`.
+        output_dir (str, optional): Output directory for the experiments, defaults to "autpnlp_results"
+        verbosity: (int, optional): controls the verbosity of the run. Defaults to 1, which let the workers log to the driver.To reduce the amount of logs,
+                use verbosity > 0 to set stop the workers from logging to the driver.
+    """
+
+    training_path = "training_checkpoints"  # filepath for Trainer's training checkpoints
+    save_path = "trained_model"  # filepath for the trained dygraph model
+    export_path = "exported_model"  # filepath for the exported static model
+    compress_path = "compressed_model"  # filepath for the compressed static model
+    results_filename = "experiment_results.csv"  # filepath for storing experiment results
+    experiment_path = None  # filepath for the experiment results
+    visualdl_path = "visualdl"  # filepath for the visualdl
+
+    def __init__(
+        self,
+        train_dataset: Dataset,
+        eval_dataset: Dataset,
+        metric_for_best_model: str,
+        greater_is_better: bool,
+        language: str = "Chinese",
+        output_dir: str = "autonlp_results",
+        verbosity: int = 1,
+        **kwargs,
+    ):
+        if metric_for_best_model is not None and not metric_for_best_model.startswith("eval_"):
+            self.metric_for_best_model = f"eval_{metric_for_best_model}"
+        else:
+            self.metric_for_best_model = metric_for_best_model
+        self.train_dataset = train_dataset
+        self.eval_dataset = eval_dataset
+        self.greater_is_better = greater_is_better
+        if language not in self.supported_languages:
+            raise ValueError(
+                f"'{language}' is not supported. Please choose among the following: {self.supported_languages}"
+            )
+
+        self.language = language
+        self.output_dir = output_dir
+        self.kwargs = kwargs
+        # Per default, Ray Tune creates JSON, CSV and TensorBoardX logger callbacks, turning it off
+        os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"
+        # use log_to_driver to control verbosity
+        ray.init(ignore_reinit_error=True, log_to_driver=True if verbosity >= 1 else False)
+
+    @property
+    @abstractmethod
+    def supported_languages(self) -> List[str]:
+        """
+        Override to store the supported languages for each auto trainer class
+        """
+
+    @property
+    @abstractmethod
+    def _default_training_argument(self) -> TrainingArguments:
+        """
+        Default TrainingArguments for the Trainer
+        """
+        return TrainingArguments(
+            output_dir=self.training_path,
+            disable_tqdm=True,
+            load_best_model_at_end=True,
+            save_total_limit=1,
+            report_to=["visualdl", "autonlp"],
+            logging_dir=self.visualdl_path,  # if logging_dir is redefined, the function visualdl() should be redefined as well.
+        )
+
+    @property
+    @abstractmethod
+    def _model_candidates(self) -> List[Dict[str, Any]]:
+        """
+        Model Candidates stored as Ray hyperparameter search space, organized by
+        self.language and preset
+        """
+
+    @abstractmethod
+    def _data_checks_and_inference(self, dataset_list: List[Dataset]):
+        """
+        Performs different data checks and inferences on the datasets
+        """
+
+    def _construct_trainable(self) -> Callable:
+        """
+        Returns the Trainable functions that contains the main preprocessing and training logic
+        """
+
+        def trainable(model_config):
+            # import is required for proper pickling
+            from paddlenlp.utils.log import logger
+
+            stdout_handler = logging.StreamHandler(sys.stdout)
+            stdout_handler.setFormatter(logger.format)
+            logger.logger.addHandler(stdout_handler)
+
+            # construct trainer
+            model_config = model_config["candidates"]
+            trainer = self._construct_trainer(model_config)
+            # train
+            trainer.train()
+            # evaluate
+            eval_metrics = trainer.evaluate()
+            # save dygraph model
+            trainer.save_model(self.save_path)
+
+            if os.path.exists(self.training_path):
+                logger.info("Removing training checkpoints to conserve disk space")
+                shutil.rmtree(self.training_path)
+            return eval_metrics
+
+        return trainable
+
+    @abstractmethod
+    def _compute_metrics(self, eval_preds: EvalPrediction) -> Dict[str, float]:
+        """
+        function used by the Trainer to compute metrics during training
+        See :class:`~paddlenlp.trainer.trainer_base.Trainer` for more details.
+        """
+
+    @abstractmethod
+    def _preprocess_fn(
+        self,
+        example: Dict[str, Any],
+        tokenizer: PretrainedTokenizer,
+        max_seq_length: int,
+        is_test: bool = False,
+    ) -> Dict[str, Any]:
+        """
+        preprocess an example from raw features to input features that Transformers models expect (e.g. input_ids, attention_mask, labels, etc)
+        """
+
+    @abstractmethod
+    def export(self, export_path: str, trial_id: Optional[str] = None):
+        """
+        Export the model from a certain `trial_id` to the given file path.
+
+        Args:
+            export_path (str, required): the filepath to export to
+            trial_id (int, optional): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
+        """
+
+        raise NotImplementedError
+
+    @abstractmethod
+    def to_taskflow(self, trial_id: Optional[str] = None):
+        """
+        Convert the model from a certain `trial_id` to a Taskflow for model inference
+
+        Args:
+            trial_id (int, optional): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    def evaluate(self, eval_dataset: Optional[Dataset] = None, trial_id: Optional[str] = None) -> Dict[str, float]:
+        """
+        Run evaluation and returns metrics from a certain `trial_id` on the given dataset.
+
+        Args:
+            trial_id (str, optional): specify the model to be evaluated through the `trial_id`. Defaults to the best model selected by `metric_for_best_model`
+            eval_dataset (Dataset, optional): custom evaluation dataset and must contains the 'text_column' and 'label_column' fields.
+                If not provided, defaults to the evaluation dataset used at construction.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    def predict(self, test_dataset: Dataset, trial_id: Optional[str] = None):
+        """
+        Run prediction and returns predictions and potential metrics from a certain `trial_id` on the given dataset
+        Args:
+            test_dataset (Dataset, required): Custom test dataset and must contains the 'text_column' and 'label_column' fields.
+            trial_id (str, optional): Specify the model to be evaluated through the `trial_id`. Defaults to the best model selected by `metric_for_best_model`.
+        """
+        raise NotImplementedError
+
+    def _override_hp(self, config: Dict[str, Any], default_hp: Any) -> Any:
+        """
+        Overrides the arguments with the provided hyperparameter config
+        """
+        new_hp = copy.deepcopy(default_hp)
+        for key, value in config.items():
+            if key in new_hp.to_dict():
+                if key in ["output_dir", "logging_dir"]:
+                    logger.warning(f"{key} cannot be overridden")
+                else:
+                    setattr(new_hp, key, value)
+        return new_hp
+
+    def _filter_model_candidates(
+        self, language=None, preset=None, custom_model_candidates=None
+    ) -> List[Dict[str, Any]]:
+        """
+        Model Candidates stored as Ray hyperparameter search space, organized by
+        override, language and preset
+        """
+        model_candidates = custom_model_candidates if custom_model_candidates is not None else self._model_candidates
+        if language is not None:
+            model_candidates = filter(
+                lambda x: x["language"] == language if "language" in x else True, model_candidates
+            )
+        if preset is not None:
+            model_candidates = filter(lambda x: x["preset"] == preset if "preset" in x else True, model_candidates)
+        return list(model_candidates)
+
+    def _get_model_result(self, trial_id=None):
+        if hasattr(self, "training_results"):
+            if trial_id is not None:
+                for result in self.training_results:
+                    if result.metrics["trial_id"] == trial_id:
+                        return result
+                raise LookupError(
+                    f"Trial_id '{trial_id}' is not found in 'training_results'. Did you enter the correct 'trial_id'?"
+                )
+            else:
+                result = self.training_results.get_best_result(
+                    metric=self.metric_for_best_model,
+                    mode="max" if self.greater_is_better else "min",
+                )
+                return result
+        else:
+            raise AttributeError(
+                "'AutoTrainer' has no attribute 'training_results'. Have you called the 'train' method?"
+            )
+
+    def show_training_results(self):
+        if hasattr(self, "training_results"):
+            return self.training_results.get_dataframe()
+        else:
+            raise AttributeError(
+                "'AutoTrainer' has no attribute 'training_results'. Have you called the 'train' method?"
+            )
+
+    def load(self, path: str):
+        """
+        Restores the AutoTrainer from a given experiment directory produced by a previous run
+
+        Args:
+            path (str, required): The filepath to load the previous experiments
+        """
+        logger.info(f"Restoring from {path}")
+        self.tuner = tune.Tuner.restore(path)
+        self.training_results = self.tuner.get_results()
+        logger.info("Found existing training results.")
+
+    def train(
+        self,
+        num_models: int = 1,
+        preset: Optional[str] = None,
+        num_gpus: Optional[int] = None,
+        num_cpus: Optional[int] = None,
+        max_concurrent_trials: Optional[int] = None,
+        time_budget_s: Optional[Union[int, float, datetime.timedelta]] = None,
+        experiment_name: str = None,
+        hp_overrides: Dict[str, Any] = None,
+        custom_model_candidates: List[Dict[str, Any]] = None,
+    ) -> ResultGrid:
+        """
+        Main logic of training models
+
+        Args:
+            num_models (int, required): number of model trials to run
+            preset (str, optional): preset configuration for the trained models, can significantly impact accuracy, size, and inference latency of trained models.
+                If not set, this will be inferred from data.
+            num_gpus (str, optional): number of GPUs to use for the job. By default, this is set based on detected GPUs.
+            num_cpus (str, optional): number of CPUs to use for the job. By default, this is set based on virtual cores.
+            max_concurrent_trials (int, optional): maximum number of trials to run concurrently. Must be non-negative. If None or 0, no limit will be applied.
+            time_budget_s: (int|float|datetime.timedelta, optional) global time budget in seconds after which all model trials are stopped.
+            experiment_name: (str, optional): name of the experiment. Experiment log will be stored under <output_dir>/<experiment_name>.
+                Defaults to UNIX timestamp.
+            hp_overrides: (dict[str, Any], optional): Advanced users only.
+                override the hyperparameters of every model candidate.  For example, {"max_steps": 5}.
+            custom_model_candiates: (dict[str, Any], optional): Advanced users only.
+                Run the user-provided model candidates instead of the default model candidated from PaddleNLP. See `._model_candidates` property as an example
+
+        Returns:
+            A set of objects for interacting with Ray Tune results. You can use it to inspect the trials and obtain the best result.
+        """
+        if hasattr(self, "tuner") and self.tuner is not None:
+            logger.info("Overwriting the existing Tuner and any previous training results")
+
+        trainable = self._construct_trainable()
+        model_candidates = self._filter_model_candidates(
+            language=self.language, preset=preset, custom_model_candidates=custom_model_candidates
+        )
+        if hp_overrides is not None:
+            for model_candidate in model_candidates:
+                model_candidate.update(hp_overrides)
+        search_space = {"candidates": hp.choice("candidates", model_candidates)}
+        mode = "max" if self.greater_is_better else "min"
+        algo = HyperOptSearch(space=search_space, metric=self.metric_for_best_model, mode=mode)
+        algo = ConcurrencyLimiter(algo, max_concurrent=max_concurrent_trials)
+        if num_gpus or num_cpus:
+            hardware_resources = {}
+            if num_gpus:
+                hardware_resources["gpu"] = num_gpus
+            if num_cpus:
+                hardware_resources["cpu"] = num_cpus
+            trainable = tune.with_resources(trainable, hardware_resources)
+
+        def trial_creator(trial):
+            return "{}".format(trial.trial_id)
+
+        tune_config = tune.TuneConfig(
+            num_samples=num_models,
+            time_budget_s=time_budget_s,
+            search_alg=algo,
+            trial_name_creator=trial_creator,
+            trial_dirname_creator=trial_creator,
+        )
+
+        if experiment_name is None:
+            experiment_name = datetime.datetime.now().strftime("%s")
+        self.experiment_path = os.path.join(self.output_dir, experiment_name)
+
+        self.tuner = tune.Tuner(
+            trainable,
+            tune_config=tune_config,
+            run_config=RunConfig(
+                name=experiment_name,
+                log_to_file="train.log",
+                local_dir=self.output_dir if self.output_dir else None,
+                callbacks=[tune.logger.CSVLoggerCallback()],
+            ),
+        )
+        self.training_results = self.tuner.fit()
+        self.show_training_results().to_csv(
+            path_or_buf=os.path.join(self.output_dir, experiment_name, self.results_filename), index=False
+        )
+
+        return self.training_results
+
+    def visualdl(self, trial_id: Optional[str] = None):
+        """
+        Return visualdl path to represent the results of the taskflow training.
+        """
+        model_result = self._get_model_result(trial_id=trial_id)
+        return os.path.join(model_result.log_dir, self.visualdl_path)
diff --git a/paddlenlp/experimental/autonlp/requirements.txt b/paddlenlp/experimental/autonlp/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c369392f26a10555f7f2bbde09d7c4d319c2e6db
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/requirements.txt
@@ -0,0 +1,4 @@
+protobuf==3.20.2 
+pydantic==1.10.11
+ray[tune]==2.5.1
+hyperopt>=0.2.5
diff --git a/paddlenlp/experimental/autonlp/text_classification.py b/paddlenlp/experimental/autonlp/text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..5df47338791816e34e15943a78ef4085c4c87a77
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/text_classification.py
@@ -0,0 +1,764 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import functools
+import json
+import os
+import shutil
+from typing import Any, Dict, List, Optional
+
+import numpy as np
+import paddle
+from hyperopt import hp
+from paddle.io import Dataset
+from scipy.special import expit as sigmoid
+from sklearn.metrics import accuracy_score, precision_recall_fscore_support
+
+from ...data import DataCollatorWithPadding
+from ...prompt import (
+    PromptDataCollatorWithPadding,
+    PromptModelForSequenceClassification,
+    PromptTrainer,
+    PromptTuningArguments,
+    UTCTemplate,
+)
+from ...taskflow import Taskflow
+from ...trainer import EarlyStoppingCallback, Trainer, TrainingArguments
+from ...trainer.trainer_utils import EvalPrediction
+from ...transformers import (
+    UTC,
+    AutoConfig,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    PretrainedTokenizer,
+    export_model,
+)
+from ...utils.log import logger
+from .auto_trainer_base import AutoTrainerBase
+from .utils import UTCLoss
+
+
+class AutoTrainerForTextClassification(AutoTrainerBase):
+    """
+    AutoTrainer for Text Classification problems
+
+    Args:
+        train_dataset (Dataset, required): Training dataset, must contains the 'text_column' and 'label_column' specified below
+        eval_dataset (Dataset, required): Evaluation dataset, must contains the 'text_column' and 'label_column' specified below
+        text_column (string, required): Name of the column that contains the input text.
+        label_column (string, required): Name of the column that contains the target variable to predict.
+        metric_for_best_model (string, optional): the name of the metrc for selecting the best model. Defaut to 'eval_accuracy'.
+        greater_is_better (bool, optional): Whether better models should have a greater metric or not. Use in conjuction with `metric_for_best_model`.
+        problem_type (str, optional): Select among ["multi_class", "multi_label"] based on the nature of your problem
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+            language (string, required): language of the text.
+            output_dir (str, optional): Output directory for the experiments, defaults to "autpnlp_results".
+            id2label(dict(int,string)): The dictionary to map the predictions from class ids to class names.
+            multilabel_threshold (float): The probability threshold used for the multi_label setup. Only effective if model = "multi_label". Defaults to 0.5.
+            verbosity: (int, optional): controls the verbosity of the run. Defaults to 1, which let the workers log to the driver.To reduce the amount of logs, use verbosity > 0 to set stop the workers from logging to the driver.
+    """
+
+    def __init__(
+        self,
+        text_column: str,
+        label_column: str,
+        train_dataset: Dataset,
+        eval_dataset: Dataset,
+        metric_for_best_model: Optional[str] = None,
+        greater_is_better: bool = True,
+        problem_type: str = "multi_class",
+        **kwargs
+    ):
+
+        super(AutoTrainerForTextClassification, self).__init__(
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            metric_for_best_model=metric_for_best_model,
+            greater_is_better=greater_is_better,
+            **kwargs,
+        )
+        self.text_column = text_column
+        self.label_column = label_column
+        self.id2label = self.kwargs.get("id2label", None)
+        self.multilabel_threshold = self.kwargs.get("multilabel_threshold", 0.5)
+        if problem_type in ["multi_label", "multi_class"]:
+            self.problem_type = problem_type
+        else:
+            raise NotImplementedError(
+                f"'{problem_type}' is not a supported problem_type. Please select among ['multi_label', 'multi_class']"
+            )
+        if self.metric_for_best_model is None:
+            if self.problem_type == "multi_class":
+                self.metric_for_best_model = "eval_accuracy"
+            else:
+                self.metric_for_best_model = "eval_macro_f1"
+
+        self._data_checks_and_inference([self.train_dataset, self.eval_dataset])
+
+    @property
+    def supported_languages(self) -> List[str]:
+        return ["Chinese", "English"]
+
+    @property
+    def _default_training_argument(self) -> TrainingArguments:
+        """
+        Default TrainingArguments for the Trainer
+        """
+        return TrainingArguments(
+            output_dir=self.training_path,
+            disable_tqdm=True,
+            metric_for_best_model=self.metric_for_best_model,
+            greater_is_better=True,
+            load_best_model_at_end=True,
+            evaluation_strategy="epoch",
+            save_strategy="epoch",
+            save_total_limit=1,
+            report_to=["visualdl", "autonlp"],
+            logging_dir=self.visualdl_path,
+        )
+
+    @property
+    def _default_prompt_tuning_arguments(self) -> PromptTuningArguments:
+        return PromptTuningArguments(
+            output_dir=self.training_path,
+            disable_tqdm=True,
+            metric_for_best_model=self.metric_for_best_model,
+            greater_is_better=True,
+            load_best_model_at_end=True,
+            evaluation_strategy="epoch",
+            save_strategy="epoch",
+            save_total_limit=1,
+            report_to=["visualdl", "autonlp"],
+            logging_dir=self.visualdl_path,
+        )
+
+    @property
+    def _model_candidates(self) -> List[Dict[str, Any]]:
+        train_batch_size = hp.choice("batch_size", [2, 4, 8, 16, 32])
+        chinese_finetune_models = hp.choice(
+            "finetune_models",
+            [
+                "ernie-1.0-large-zh-cw",  # 24-layer, 1024-hidden, 16-heads, 272M parameters.
+                "ernie-3.0-xbase-zh",  # 20-layer, 1024-hidden, 16-heads, 296M parameters.
+                "ernie-3.0-tiny-base-v2-zh",  # 12-layer, 768-hidden, 12-heads, 118M parameters.
+                "ernie-3.0-tiny-medium-v2-zh",  # 6-layer, 768-hidden, 12-heads, 75M parameters.
+                "ernie-3.0-tiny-mini-v2-zh",  # 6-layer, 384-hidden, 12-heads, 27M parameters
+                "ernie-3.0-tiny-micro-v2-zh",  # 4-layer, 384-hidden, 12-heads, 23M parameters
+                "ernie-3.0-tiny-nano-v2-zh",  # 4-layer, 312-hidden, 12-heads, 18M parameters.
+                "ernie-3.0-tiny-pico-v2-zh",  # 3-layer, 128-hidden, 2-heads, 5.9M parameters.
+            ],
+        )
+        english_finetune_models = hp.choice(
+            "finetune_models",
+            [
+                # add deberta-v3 when we have it
+                "roberta-large",  # 24-layer, 1024-hidden, 16-heads, 334M parameters. Case-sensitive
+                "roberta-base",  # 12-layer, 768-hidden, 12-heads, 110M parameters. Case-sensitive
+                "distilroberta-base",  # 6-layer, 768-hidden, 12-heads, 66M parameters. Case-sensitive
+                "ernie-2.0-base-en",  # 12-layer, 768-hidden, 12-heads, 103M parameters. Trained on lower-cased English text.
+                "ernie-2.0-large-en",  # 24-layer, 1024-hidden, 16-heads, 336M parameters. Trained on lower-cased English text.
+            ],
+        )
+        chinese_utc_models = hp.choice(
+            "utc_models",
+            [
+                "utc-xbase",  # 20-layer, 1024-hidden, 16-heads, 296M parameters.
+                "utc-base",  # 12-layer, 768-hidden, 12-heads, 118M parameters.
+                "utc-medium",  # 6-layer, 768-hidden, 12-heads, 75M parameters.
+                "utc-mini",  # 6-layer, 384-hidden, 12-heads, 27M parameters
+                "utc-micro",  # 4-layer, 384-hidden, 12-heads, 23M parameters
+                "utc-nano",  # 4-layer, 312-hidden, 12-heads, 18M parameters.
+            ],
+        )
+        return [
+            # fast learning: high LR, small early stop patience
+            {
+                "preset": "finetune",
+                "language": "Chinese",
+                "early_stopping_patience": 5,
+                "per_device_train_batch_size": train_batch_size,
+                "per_device_eval_batch_size": train_batch_size * 2,
+                "num_train_epochs": 100,
+                "model_name_or_path": chinese_finetune_models,
+                "learning_rate": 3e-5,
+            },
+            {
+                "preset": "finetune",
+                "language": "English",
+                "early_stopping_patience": 5,
+                "per_device_train_batch_size": train_batch_size,
+                "per_device_eval_batch_size": train_batch_size * 2,
+                "num_train_epochs": 100,
+                "model_name_or_path": english_finetune_models,
+                "learning_rate": 3e-5,
+            },
+            # slow learning: small LR, large early stop patience
+            {
+                "preset": "finetune",
+                "language": "Chinese",
+                "early_stopping_patience": 5,
+                "per_device_train_batch_size": train_batch_size,
+                "per_device_eval_batch_size": train_batch_size * 2,
+                "num_train_epochs": 100,
+                "model_name_or_path": chinese_finetune_models,
+                "learning_rate": 5e-6,
+            },
+            {
+                "preset": "finetune",
+                "language": "English",
+                "early_stopping_patience": 5,
+                "per_device_train_batch_size": train_batch_size,
+                "per_device_eval_batch_size": train_batch_size * 2,
+                "num_train_epochs": 100,
+                "model_name_or_path": english_finetune_models,
+                "learning_rate": 5e-6,
+            },
+            # utc tuning candidates
+            {
+                "preset": "utc",
+                "language": "Chinese",
+                "early_stopping_patience": 5,
+                "per_device_train_batch_size": train_batch_size,
+                "per_device_eval_batch_size": train_batch_size * 2,
+                "num_train_epochs": 100,
+                "model_name_or_path": chinese_utc_models,
+                "learning_rate": 1e-5,
+            },
+        ]
+
+    def _data_checks_and_inference(self, dataset_list: List[Dataset]):
+        """
+        Performs different data checks and generate id to label mapping on the datasets.
+        """
+        generate_id2label = True
+        if self.id2label is None:
+            self.id2label, self.label2id = {}, {}
+        else:
+            generate_id2label = False
+            self.label2id = {}
+            for i in self.id2label:
+                self.label2id[self.id2label[i]] = i
+
+        for dataset in dataset_list:
+            for example in dataset:
+                if self.text_column not in example or self.label_column not in example:
+                    raise ValueError(
+                        f"Text column: {self.text_column} and label columns:{self.label_column} must exist for example: {example}"
+                    )
+                if self.problem_type == "multi_class":
+                    label = example[self.label_column]
+                    if label not in self.label2id:
+                        if generate_id2label:
+                            self.label2id[label] = len(self.label2id)
+                            self.id2label[len(self.id2label)] = label
+                        else:
+                            raise ValueError(
+                                f"Label {label} is not found in the user-provided id2label argument: {self.id2label}"
+                            )
+                else:
+                    labels = example[self.label_column]
+                    for label in labels:
+                        if label not in self.label2id:
+                            if generate_id2label:
+                                self.label2id[label] = len(self.label2id)
+                                self.id2label[len(self.id2label)] = label
+                            else:
+                                raise ValueError(
+                                    f"Label {label} is not found in the user-provided id2label argument: {self.id2label}"
+                                )
+
+    def _construct_trainer(self, model_config) -> Trainer:
+
+        if "early_stopping_patience" in model_config:
+            callbacks = [EarlyStoppingCallback(early_stopping_patience=model_config["early_stopping_patience"])]
+        else:
+            callbacks = None
+
+        if self.problem_type == "multi_class":
+            criterion = paddle.nn.CrossEntropyLoss()
+        else:
+            criterion = paddle.nn.BCEWithLogitsLoss()
+
+        if "utc" in model_config["model_name_or_path"]:
+            model_path = model_config["model_name_or_path"]
+            tokenizer = AutoTokenizer.from_pretrained(model_path)
+            model = UTC.from_pretrained(model_path)
+            max_length = model_config.get("max_length", model.config.max_position_embeddings)
+
+            training_args = self._override_hp(model_config, self._default_prompt_tuning_arguments)
+            processed_train_dataset = self._preprocess_dataset(self.train_dataset, max_length, tokenizer, is_utc=True)
+            processed_eval_dataset = self._preprocess_dataset(self.eval_dataset, max_length, tokenizer, is_utc=True)
+
+            template = UTCTemplate(tokenizer=tokenizer, max_length=max_length)
+            criterion = UTCLoss()
+            prompt_model = PromptModelForSequenceClassification(
+                model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout
+            )
+
+            trainer = PromptTrainer(
+                model=prompt_model,
+                tokenizer=tokenizer,
+                args=training_args,
+                criterion=criterion,
+                train_dataset=processed_train_dataset,
+                eval_dataset=processed_eval_dataset,
+                callbacks=callbacks,
+                compute_metrics=self._compute_metrics,
+            )
+        else:
+            model_path = model_config["model_name_or_path"]
+            tokenizer = AutoTokenizer.from_pretrained(model_path)
+            model = AutoModelForSequenceClassification.from_pretrained(
+                model_path, num_labels=len(self.id2label), id2label=self.id2label, label2id=self.label2id
+            )
+            max_length = model_config.get("max_length", model.config.max_position_embeddings)
+
+            training_args = self._override_hp(model_config, self._default_training_argument)
+            processed_train_dataset = self._preprocess_dataset(self.train_dataset, max_length, tokenizer)
+            processed_eval_dataset = self._preprocess_dataset(self.eval_dataset, max_length, tokenizer)
+
+            trainer = Trainer(
+                model=model,
+                tokenizer=tokenizer,
+                args=training_args,
+                criterion=criterion,
+                train_dataset=processed_train_dataset,
+                eval_dataset=processed_eval_dataset,
+                data_collator=DataCollatorWithPadding(tokenizer),
+                compute_metrics=self._compute_metrics,
+                callbacks=callbacks,
+            )
+        return trainer
+
+    def evaluate(self, eval_dataset: Optional[Dataset] = None, trial_id: Optional[str] = None):
+        """
+        Run evaluation and returns metrics from a certain `trial_id` on the given dataset.
+        Args:
+            eval_dataset (Dataset, optional): custom evaluation dataset and must contains the 'text_column' and 'label_column' fields. If not provided, defaults to the evaluation dataset used at construction.
+            trial_id (str, optional): specify the model to be evaluated through the `trial_id`. Defaults to the best model selected by `metric_for_best_model`
+        """
+        model_result = self._get_model_result(trial_id=trial_id)
+        model_config = model_result.metrics["config"]["candidates"]
+        trainer = self._construct_trainer(model_config)
+        trainer._load_from_checkpoint(resume_from_checkpoint=os.path.join(model_result.log_dir, self.save_path))
+
+        if eval_dataset is not None:
+            self._data_checks_and_inference([eval_dataset])
+            is_utc = "utc" in model_config["model_name_or_path"]
+            if is_utc:
+                max_length = model_config.get("max_length", trainer.pretrained_model.config.max_position_embeddings)
+            else:
+                max_length = model_config.get("max_length", trainer.model.config.max_position_embeddings)
+            processed_eval_dataset = self._preprocess_dataset(
+                eval_dataset, max_length, trainer.tokenizer, is_utc=is_utc
+            )
+            eval_metrics = trainer.evaluate(eval_dataset=processed_eval_dataset)
+        else:
+            eval_metrics = trainer.evaluate()
+        trainer.log_metrics("eval", eval_metrics)
+
+        if os.path.exists(self.training_path):
+            logger.info(f"Removing {self.training_path} to conserve disk space")
+            shutil.rmtree(self.training_path)
+
+        return eval_metrics
+
+    def predict(self, test_dataset: Dataset, trial_id: Optional[str] = None):
+        """
+        Run prediction and returns predictions and potential metrics from a certain `trial_id` on the given dataset
+        Args:
+            test_dataset (Dataset): Custom test dataset and must contains the 'text_column' and 'label_column' fields.
+            trial_id (str, optional): Specify the model to be evaluated through the `trial_id`. Defaults to the best model selected by `metric_for_best_model`.
+        """
+        is_test = False
+        if self.label_column in test_dataset[0]:
+            self._data_checks_and_inference([test_dataset])
+        else:
+            is_test = True
+            for example in test_dataset:
+                if self.text_column not in example:
+                    raise ValueError(f"Text column: {self.text_column} must exist for example: {example}")
+
+        model_result = self._get_model_result(trial_id=trial_id)
+        model_config = model_result.metrics["config"]["candidates"]
+
+        trainer = self._construct_trainer(model_config)
+        trainer._load_from_checkpoint(resume_from_checkpoint=os.path.join(model_result.log_dir, self.save_path))
+
+        is_utc = False
+        if "utc" in model_config["model_name_or_path"]:
+            is_utc = True
+            max_length = model_config.get("max_length", trainer.pretrained_model.config.max_position_embeddings)
+        else:
+            max_length = model_config.get("max_length", trainer.model.config.max_position_embeddings)
+
+        processed_test_dataset = self._preprocess_dataset(
+            test_dataset, max_length, trainer.tokenizer, is_test=is_test, is_utc=is_utc
+        )
+        test_output = trainer.predict(test_dataset=processed_test_dataset)
+        trainer.log_metrics("test", test_output.metrics)
+
+        if os.path.exists(self.training_path):
+            logger.info(f"Removing {self.training_path} to conserve disk space")
+            shutil.rmtree(self.training_path)
+
+        return test_output
+
+    def _compute_metrics(self, eval_preds: EvalPrediction) -> Dict[str, float]:
+        """
+        function used by the Trainer to compute metrics during training
+        See :class:`~paddlenlp.trainer.trainer_base.Trainer` for more details.
+        """
+        if self.problem_type == "multi_class":
+            return self._compute_multi_class_metrics(eval_preds=eval_preds)
+        else:  # multi_label
+            return self._compute_multi_label_metrics(eval_preds=eval_preds)
+
+    def _compute_multi_class_metrics(self, eval_preds: EvalPrediction) -> Dict[str, float]:
+        #  utc labels is one-hot encoded
+        if len(eval_preds.label_ids[0]) > 1:
+            label_ids = np.argmax(eval_preds.label_ids, axis=-1)
+        else:
+            label_ids = eval_preds.label_ids
+
+        pred_ids = np.argmax(eval_preds.predictions, axis=-1)
+        metrics = {}
+        metrics["accuracy"] = accuracy_score(y_true=label_ids, y_pred=pred_ids)
+        for average in ["micro", "macro"]:
+            precision, recall, f1, _ = precision_recall_fscore_support(
+                y_true=label_ids, y_pred=pred_ids, average=average
+            )
+            metrics[f"{average}_precision"] = precision
+            metrics[f"{average}_recall"] = recall
+            metrics[f"{average}_f1"] = f1
+        return metrics
+
+    def _compute_multi_label_metrics(self, eval_preds: EvalPrediction) -> Dict[str, float]:
+        pred_probs = sigmoid(eval_preds.predictions)
+        pred_ids = pred_probs > self.multilabel_threshold
+        metrics = {}
+        # In multilabel classification, this function computes subset accuracy:
+        # the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
+        metrics["accuracy"] = accuracy_score(y_true=eval_preds.label_ids, y_pred=pred_ids)
+        for average in ["micro", "macro"]:
+            precision, recall, f1, _ = precision_recall_fscore_support(
+                y_true=eval_preds.label_ids, y_pred=pred_ids, average=average
+            )
+            metrics[f"{average}_precision"] = precision
+            metrics[f"{average}_recall"] = recall
+            metrics[f"{average}_f1"] = f1
+        return metrics
+
+    def _preprocess_labels(self, example, is_test=False, is_utc=False):
+        if is_utc:
+            example["choices"] = list(self.label2id.keys())
+            example["text_a"] = example[self.text_column]
+            example["text_b"] = ""
+        if not is_test:
+            if is_utc or self.problem_type == "multi_label":
+                labels = [1.0 if i in example[self.label_column] else 0.0 for i in self.label2id]
+                example["labels"] = paddle.to_tensor(labels, dtype="float32")
+            elif self.problem_type == "multi_class":
+                example["labels"] = paddle.to_tensor([self.label2id[example[self.label_column]]], dtype="int64")
+        return example
+
+    def _preprocess_fn(
+        self,
+        example: Dict[str, Any],
+        tokenizer: PretrainedTokenizer,
+        max_length: int,
+        is_test: bool = False,
+    ):
+        """
+        Preprocess an example from raw features to input features that Transformers models expect (e.g. input_ids, attention_mask, labels, etc)
+        """
+        result = tokenizer(text=example[self.text_column], max_length=max_length, truncation=True)
+        if not is_test:
+            result["labels"] = self._preprocess_labels(example)["labels"]
+        return result
+
+    def _preprocess_dataset(
+        self,
+        dataset: Dataset,
+        max_length: int,
+        tokenizer: PretrainedTokenizer,
+        is_test: bool = False,
+        is_utc: bool = False,
+    ):
+        """
+        Preprocess dataset from raw features to input features used by the Trainer or PromptTrainer.
+        """
+        if is_utc:
+            trans_func = functools.partial(self._preprocess_labels, is_utc=is_utc, is_test=is_test)
+        else:
+            trans_func = functools.partial(
+                self._preprocess_fn,
+                tokenizer=tokenizer,
+                max_length=max_length,  # truncate to the max length allowed by the model
+                is_test=is_test,
+            )
+        processed_dataset = copy.deepcopy(dataset).map(trans_func, lazy=False)
+        return processed_dataset
+
+    def to_taskflow(
+        self, trial_id: Optional[str] = None, batch_size: int = 1, precision: str = "fp32", compress: bool = False
+    ):
+        """
+        Convert the model from a certain `trial_id` to a Taskflow for model inference.
+
+        Args:
+            trial_id (int): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
+            batch_size(int): The sample number of a mini-batch. Defaults to 1.
+            precision (str): Select among ["fp32", "fp16"]. Default to "fp32".
+        """
+        model_result = self._get_model_result(trial_id=trial_id)
+        trial_id = model_result.metrics["trial_id"]
+        if compress:
+            export_path = os.path.join(model_result.log_dir, self.compress_path)
+        else:
+            export_path = os.path.join(model_result.log_dir, self.export_path)
+        self.export(export_path=export_path, trial_id=trial_id, compress=compress)
+
+        with open(os.path.join(export_path, "taskflow_config.json"), "r") as f:
+            taskflow_config = json.load(f)
+
+        taskflow_config["batch_size"] = batch_size
+        taskflow_config["precision"] = precision
+
+        return Taskflow(**taskflow_config)
+
+    def export(self, export_path: str, trial_id: Optional[str] = None, compress: bool = False):
+        """
+        Export the model from a certain `trial_id` to the given file path.
+
+        Args:
+            export_path (str, required): the filepath to export to
+            trial_id (int, required): use the `trial_id` to select the model to export. Defaults to the best model selected by `metric_for_best_model`
+        """
+
+        model_result = self._get_model_result(trial_id=trial_id)
+        model_config = model_result.metrics["config"]["candidates"]
+        trial_id = model_result.metrics["trial_id"]
+        if compress:
+            default_export_path = os.path.join(model_result.log_dir, self.compress_path)
+        else:
+            default_export_path = os.path.join(model_result.log_dir, self.export_path)
+
+        # Check whether it has been exported before
+        is_exported = False
+        if os.path.exists(default_export_path):
+            if "utc" in model_config["model_name_or_path"]:
+                files = [
+                    "model.pdiparams",
+                    "model.pdmodel",
+                    "tokenizer_config.json",
+                    "vocab.txt",
+                    "taskflow_config.json",
+                ]
+            else:
+                files = [
+                    "model.pdiparams",
+                    "model.pdmodel",
+                    "tokenizer_config.json",
+                    "vocab.txt",
+                    "taskflow_config.json",
+                ]
+
+            if all([os.path.exists(os.path.join(default_export_path, file)) for file in files]):
+                is_exported = True
+                if os.path.exists(export_path) and os.path.samefile(export_path, default_export_path):
+                    logger.info(f"Export_path: {export_path} already exists, skipping...")
+                    return
+
+        # Clear export path
+        if os.path.exists(export_path):
+            logger.info(f"Export path: {export_path} is not empty. The directory will be deleted.")
+            shutil.rmtree(export_path)
+
+        # Copy directly if it has been exported before
+        if is_exported:
+            logger.info(f"{default_export_path} already exists, copy {default_export_path} into {export_path}")
+            shutil.copytree(default_export_path, export_path)
+            return
+
+        # Construct trainer
+        trainer = self._construct_trainer(model_config)
+        trainer._load_from_checkpoint(resume_from_checkpoint=os.path.join(model_result.log_dir, self.save_path))
+
+        # Save static model
+        input_spec = self._get_input_spec(model_config=model_config)
+        if compress:
+            self.compress(trial_id=trial_id, compress_path=export_path)
+        elif "utc" in model_config["model_name_or_path"]:
+            export_model(model=trainer.pretrained_model, input_spec=input_spec, path=export_path)
+        else:
+            export_model(model=trainer.model, input_spec=input_spec, path=export_path)
+
+        # save tokenizer
+        trainer.tokenizer.save_pretrained(export_path)
+
+        # save taskflow config file
+        if "utc" in model_config["model_name_or_path"]:
+            taskflow_config = {
+                "task": "zero_shot_text_classification",
+                "model": model_config["model_name_or_path"],
+                "schema": list(self.label2id.keys()),
+                "single_label": True if self.problem_type == "multi_class" else False,
+                "is_static_model": True,
+                "pred_threshold": self.multilabel_threshold,
+                "max_seq_len": model_config.get("max_length", trainer.pretrained_model.config.max_position_embeddings),
+                "task_path": export_path,
+            }
+        else:
+            taskflow_config = {
+                "task": "text_classification",
+                "mode": "finetune",
+                "is_static_model": True,
+                "problem_type": self.problem_type,
+                "multilabel_threshold": self.multilabel_threshold,
+                "max_length": model_config.get("max_length", trainer.model.config.max_position_embeddings),
+                "id2label": self.id2label,
+                "task_path": export_path,
+            }
+
+        with open(os.path.join(export_path, "taskflow_config.json"), "w", encoding="utf-8") as f:
+            json.dump(taskflow_config, f, ensure_ascii=False)
+        logger.info(
+            f"Taskflow config saved to {export_path}. You can use the Taskflow config to create a Taskflow instance for inference"
+        )
+
+        logger.info(f"Exported trial_id: {trial_id} to export_path: {export_path} sucessfully!")
+
+        if os.path.exists(self.training_path):
+            logger.info("Removing training checkpoints to conserve disk space")
+            shutil.rmtree(self.training_path)
+
+    def _get_input_spec(self, model_config):
+
+        if "utc" in model_config["model_name_or_path"]:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="omask_positions"),
+                paddle.static.InputSpec(shape=[None], dtype="int64", name="cls_positions"),
+            ]
+        elif "ernie-m" in model_config["model_name_or_path"]:
+            input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
+        else:
+            input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            ]
+        return input_spec
+
+    def compress(self, compress_path: str, trial_id: Optional[str] = None):
+        """
+        Evaluate the models from a certain `trial_id` on the given dataset
+        Args:
+            compress_path(str): Path to the save compressed static model.
+            trial_id (str, optional): specify the model to be evaluated through the `trial_id`. Defaults to the best model selected by `metric_for_best_model`
+        """
+        logger.info("Currently Post Training Quantization is the only supported compression strategy.")
+        self._ptq_strategy(compress_path=compress_path, trial_id=trial_id)
+
+    def _ptq_strategy(
+        self,
+        compress_path: str,
+        trial_id: Optional[str] = None,
+        algo: str = "KL",
+        batch_size: int = 4,
+        batch_nums: int = 1,
+    ):
+        from paddle.static.quantization import PostTrainingQuantization
+
+        model_result = self._get_model_result(trial_id=trial_id)
+        model_config = model_result.metrics["config"]["candidates"]
+        trial_id = model_result.metrics["trial_id"]
+        export_path = os.path.join(model_result.log_dir, self.export_path)
+        self.export(export_path=export_path, trial_id=trial_id)
+        input_spec = self._get_input_spec(model_config=model_config)
+        if "utc" in model_config["model_name_or_path"]:
+            tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_result.log_dir, self.save_path))
+            config = AutoConfig.from_pretrained(model_config["model_name_or_path"])
+            max_length = model_config.get("max_length", config.max_position_embeddings)
+            template = UTCTemplate(tokenizer, max_length)
+            inputs = [
+                template({"text_a": eval_ds[self.text_column], "text_b": "", "choices": list(self.label2id.keys())})
+                for eval_ds in self.eval_dataset
+            ]
+            collator = PromptDataCollatorWithPadding(
+                tokenizer, padding=True, return_tensors="np", return_attention_mask=True
+            )
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(os.path.join(model_result.log_dir, self.save_path))
+            config = AutoConfig.from_pretrained(model_config["model_name_or_path"])
+            max_length = model_config.get("max_length", config.max_position_embeddings)
+            inputs = [
+                tokenizer(eval_ds[self.text_column], max_length=max_length, truncation=True)
+                for eval_ds in self.eval_dataset
+            ]
+            collator = DataCollatorWithPadding(tokenizer, return_tensors="np")
+        batches = [collator(inputs[idx : idx + batch_size]) for idx in range(0, len(inputs), batch_size)]
+
+        def _batch_generator_func():
+            for batch in batches:
+                batch_data = []
+                for spec in input_spec:
+                    if spec.name == "attention_mask":
+                        if batch[spec.name].ndim == 2:
+                            batch[spec.name] = (1 - batch[spec.name][:, np.newaxis, np.newaxis, :]) * -1e4
+                        elif batch[spec.name].ndim != 4:
+                            raise ValueError(
+                                "Expect attention mask with ndim=2 or 4, but get ndim={}".format(batch[spec.name].ndim)
+                            )
+                    batch_data.append(batch[spec.name].astype(str(spec.dtype).split(".")[1]))
+                yield batch_data
+
+        paddle.enable_static()
+        place = paddle.framework._current_expected_place()
+        exe = paddle.static.Executor(place)
+
+        post_training_quantization = PostTrainingQuantization(
+            executor=exe,
+            batch_generator=_batch_generator_func,
+            model_dir=export_path,
+            model_filename="model.pdmodel",
+            params_filename="model.pdiparams",
+            batch_size=batch_size,
+            batch_nums=batch_nums,
+            scope=None,
+            algo=algo,
+            hist_percent=0.9999,
+            round_type="round",
+            bias_correction=False,
+            quantizable_op_type=["matmul", "matmul_v2"],
+            is_full_quantize=False,
+            weight_bits=8,
+            activation_bits=8,
+            activation_quantize_type="range_abs_max",
+            weight_quantize_type="channel_wise_abs_max",
+            onnx_format=False,
+            optimize_model=False,
+        )
+
+        post_training_quantization.quantize()
+        post_training_quantization.save_quantized_model(
+            save_model_path=compress_path,
+            model_filename="model.pdmodel",
+            params_filename="model.pdiparams",
+        )
+
+        paddle.disable_static()
diff --git a/paddlenlp/experimental/autonlp/utils.py b/paddlenlp/experimental/autonlp/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e87586c6602babde54e3e566c69c085778228f7
--- /dev/null
+++ b/paddlenlp/experimental/autonlp/utils.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+
+
+class UTCLoss(object):
+    def __call__(self, logit, label):
+        return self.forward(logit, label)
+
+    def forward(self, logit, label):
+        logit = (1.0 - 2.0 * label) * logit
+        logit_neg = logit - label * 1e12
+        logit_pos = logit - (1.0 - label) * 1e12
+        zeros = paddle.zeros_like(logit[..., :1])
+        logit_neg = paddle.concat([logit_neg, zeros], axis=-1)
+        logit_pos = paddle.concat([logit_pos, zeros], axis=-1)
+        label = paddle.concat([label, zeros], axis=-1)
+        logit_neg[label == -100] = -1e12
+        logit_pos[label == -100] = -1e12
+        neg_loss = paddle.logsumexp(logit_neg, axis=-1)
+        pos_loss = paddle.logsumexp(logit_pos, axis=-1)
+        loss = (neg_loss + pos_loss).mean()
+        return loss
diff --git a/paddlenlp/experimental/ernie_model.py b/paddlenlp/experimental/ernie_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9ba566ae6b391cc0de61f225cbeea138216118b
--- /dev/null
+++ b/paddlenlp/experimental/ernie_model.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.experimental import FasterPretrainedModel, FasterTokenizer
+from paddlenlp.transformers.ernie.modeling import ErnieEmbeddings, ErniePooler
+from paddlenlp.transformers.model_utils import register_base_model
+
+__all__ = ["FasterErnieModel", "FasterErnieForSequenceClassification", "FasterErnieForTokenClassification"]
+
+
+class FasterErniePretrainedModel(FasterPretrainedModel):
+    r"""
+    An abstract class for pretrained ERNIE models. It provides ERNIE related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = "model_config.json"
+    pretrained_init_configuration = {
+        "ernie-1.0": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "relu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 513,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 2,
+            "vocab_size": 18000,
+            "pad_token_id": 0,
+            "do_lower_case": True,
+        },
+        "ernie-2.0-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+            "do_lower_case": True,
+        },
+        "ernie-2.0-en-finetuned-squad": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+            "do_lower_case": True,
+        },
+        "ernie-2.0-large-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "intermediate_size": 4096,  # special for ernie-2.0-large-en
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "max_position_embeddings": 512,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+            "do_lower_case": True,
+        },
+    }
+    resource_files_names = {"model_state": "model_state.pdparams", "vocab_file": "vocab.txt"}
+    pretrained_resource_files_map = {
+        "model_state": {
+            "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/faster_ernie/faster_ernie_v1_chn_base.pdparams",
+            "ernie-2.0-en": "https://bj.bcebos.com/paddlenlp/models/transformers/faster_ernie_v2_base/faster_ernie_v2_eng_base.pdparams",
+            "ernie-2.0-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/faster_ernie_v2_base/faster_ernie_v2_eng_base_finetuned_squad.pdparams",
+            "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/faster_ernie_v2_large/faster_ernie_v2_eng_large.pdparams",
+        },
+        "vocab_file": {
+            "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt",
+            "ernie-2.0-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "ernie-2.0-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/vocab.txt",
+        },
+    }
+    base_model_prefix = "ernie"
+
+    def init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.ernie.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class FasterErnieModel(FasterErniePretrainedModel):
+    r"""
+    The bare ERNIE Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `ErnieModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`ErniePretrainedModel._init_weights()` for how weights are initialized in `ErnieModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        vocab_file,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        do_lower_case=True,
+        is_split_into_words=False,
+        max_seq_len=512,
+    ):
+        super(FasterErnieModel, self).__init__()
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`model = FasterErnieModel.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file)
+        self.max_seq_len = max_seq_len
+
+        self.tokenizer = FasterTokenizer(
+            self.vocab, do_lower_case=self.do_lower_case, is_split_into_words=is_split_into_words
+        )
+        self.pad_token_id = pad_token_id
+        self.initializer_range = initializer_range
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=self.initializer_range))
+        self.embeddings = ErnieEmbeddings(
+            vocab_size,
+            hidden_size,
+            hidden_dropout_prob,
+            max_position_embeddings,
+            type_vocab_size,
+            pad_token_id,
+            weight_attr,
+        )
+        # Avoid import error in global scope when using paddle <= 2.2.0, therefore
+        # import FusedTransformerEncoderLayer in local scope.
+        # FusedTransformerEncoderLayer is supported by paddlepaddle since 2.2.0, please
+        # ensure the version >= 2.2.0
+        from paddle.incubate.nn import FusedTransformerEncoderLayer
+
+        encoder_layer = FusedTransformerEncoderLayer(
+            hidden_size,
+            num_attention_heads,
+            intermediate_size,
+            dropout_rate=hidden_dropout_prob,
+            activation=hidden_act,
+            attn_dropout_rate=attention_probs_dropout_prob,
+            act_dropout_rate=0,
+            weight_attr=weight_attr,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)
+        self.pooler = ErniePooler(hidden_size, weight_attr)
+        self.apply(self.init_weights)
+
+    def forward(self, text, text_pair=None):
+        input_ids, token_type_ids = self.tokenizer(text=text, text_pair=text_pair, max_seq_len=self.max_seq_len)
+
+        attention_mask = paddle.unsqueeze(
+            (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+        )
+        embedding_output = self.embeddings(input_ids=input_ids, token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class FasterErnieForSequenceClassification(FasterErniePretrainedModel):
+    def __init__(self, ernie, num_classes=2, dropout=None):
+        super(FasterErnieForSequenceClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie = ernie  # allow ernie to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)
+        self.apply(self.init_weights)
+
+    def forward(self, text, text_pair=None):
+
+        _, pooled_output = self.ernie(text, text_pair)
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        predictions = paddle.argmax(logits, axis=-1)
+        return logits, predictions
+
+
+class FasterErnieForTokenClassification(FasterErniePretrainedModel):
+    def __init__(self, ernie, num_classes=2, dropout=None):
+        super(FasterErnieForTokenClassification, self).__init__()
+        self.num_classes = num_classes
+        self.ernie = ernie  # allow ernie to be config
+        self.dropout = nn.Dropout(dropout if dropout is not None else self.ernie.config["hidden_dropout_prob"])
+        self.classifier = nn.Linear(self.ernie.config["hidden_size"], num_classes)
+        self.apply(self.init_weights)
+
+    def forward(self, text, text_pair=None):
+
+        sequence_output, _ = self.ernie(text, text_pair)
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        predictions = paddle.argmax(logits, axis=-1)
+        return logits, predictions
diff --git a/paddlenlp/experimental/faster_tokenizer.py b/paddlenlp/experimental/faster_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..051cf218fe49b3439475f96487932eaf00444a5e
--- /dev/null
+++ b/paddlenlp/experimental/faster_tokenizer.py
@@ -0,0 +1,152 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+
+import paddle
+import paddle.nn as nn
+from paddle.common_ops_import import LayerHelper
+from paddle.framework import core
+
+from paddlenlp.transformers import BertTokenizer, ErnieTokenizer, RobertaTokenizer
+from paddlenlp.transformers.ppminilm.tokenizer import PPMiniLMTokenizer
+from paddlenlp.utils.log import logger
+
+__all__ = ["to_tensor", "to_vocab_buffer", "FasterTokenizer"]
+
+
+def to_tensor(string_values, name="text"):
+    """
+    Create the tensor that the value holds the list of string.
+    NOTICE: The value will be holded in the cpu place.
+
+    Args:
+        string_values(list[string]): The value will be setted to the tensor.
+        name(string): The name of the tensor.
+    """
+    tensor = paddle.Tensor(core.VarDesc.VarType.STRING, [], name, core.VarDesc.VarType.STRINGS, False)
+    tensor.value().set_string_list(string_values)
+    return tensor
+
+
+def to_vocab_buffer(vocab_dict, name):
+    """
+    Create the tensor that the value holds the map, the type of key is the string.
+    NOTICE: The value will be holded in the cpu place.
+
+    Args:
+        vocab_dict(dict): The value will be setted to the tensor.
+            The key is token and the value is the token index.
+        name(string): The name of the tensor.
+    """
+    tensor = paddle.Tensor(core.VarDesc.VarType.RAW, [], name, core.VarDesc.VarType.VOCAB, True)
+    tensor.value().set_vocab(vocab_dict)
+    return tensor
+
+
+class FasterTokenizer(nn.Layer):
+    name_map = {
+        "bert-base-uncased": BertTokenizer,
+        "bert-large-uncased": BertTokenizer,
+        "bert-base-cased": BertTokenizer,
+        "bert-large-cased": BertTokenizer,
+        "bert-base-multilingual-uncased": BertTokenizer,
+        "bert-base-multilingual-cased": BertTokenizer,
+        "bert-base-chinese": BertTokenizer,
+        "bert-wwm-chinese": BertTokenizer,
+        "bert-wwm-ext-chinese": BertTokenizer,
+        "ernie-1.0": ErnieTokenizer,
+        "ernie-2.0-en": ErnieTokenizer,
+        "ernie-2.0-large-en": ErnieTokenizer,
+        "roberta-wwm-ext": RobertaTokenizer,
+        "roberta-wwm-ext-large": RobertaTokenizer,
+        "rbt3": RobertaTokenizer,
+        "rbtl3": RobertaTokenizer,
+        "ppminilm-6l-768h": PPMiniLMTokenizer,
+    }
+
+    def __init__(self, vocab, do_lower_case=False, is_split_into_words=False):
+        super(FasterTokenizer, self).__init__()
+
+        try:
+            self.mod = importlib.import_module("paddle._C_ops")
+        except Exception:
+            logger.warning(
+                "The paddlepaddle version is {paddle.__version__}, not the latest. Please upgrade the paddlepaddle package (>= 2.2.1)."
+            )
+            self.mod = importlib.import_module("paddle.framework.core.ops")
+
+        vocab_buffer = to_vocab_buffer(vocab, "vocab")
+        self.register_buffer("vocab", vocab_buffer, persistable=True)
+
+        self.do_lower_case = do_lower_case
+        self.is_split_into_words = is_split_into_words
+
+    def forward(self, text, text_pair=None, max_seq_len=0, pad_to_max_seq_len=False):
+        if paddle.in_dynamic_mode():
+            if isinstance(text, list) or isinstance(text, tuple):
+                text = to_tensor(list(text))
+            if text_pair is not None:
+                if isinstance(text_pair, list) or isinstance(text_pair, tuple):
+                    text_pair = to_tensor(list(text_pair))
+            input_ids, seg_ids = self.mod.faster_tokenizer(
+                self.vocab,
+                text,
+                text_pair,
+                "do_lower_case",
+                self.do_lower_case,
+                "max_seq_len",
+                max_seq_len,
+                "pad_to_max_seq_len",
+                pad_to_max_seq_len,
+                "is_split_into_words",
+                self.is_split_into_words,
+            )
+
+            return input_ids, seg_ids
+
+        attrs = {
+            "do_lower_case": self.do_lower_case,
+            "max_seq_len": max_seq_len,
+            "pad_to_max_seq_len": pad_to_max_seq_len,
+            "is_split_into_words": self.is_split_into_words,
+        }
+        helper = LayerHelper("faster_tokenizer")
+        input_ids = helper.create_variable_for_type_inference(dtype="int64")
+        seg_ids = helper.create_variable_for_type_inference(dtype="int64")
+        if text_pair is None:
+            helper.append_op(
+                type="faster_tokenizer",
+                inputs={"Vocab": self.vocab, "Text": text},
+                outputs={"InputIds": input_ids, "SegmentIds": seg_ids},
+                attrs=attrs,
+            )
+        else:
+            helper.append_op(
+                type="faster_tokenizer",
+                inputs={"Vocab": self.vocab, "Text": text, "TextPair": text_pair},
+                outputs={"InputIds": input_ids, "SegmentIds": seg_ids},
+                attrs=attrs,
+            )
+        return input_ids, seg_ids
+
+    @classmethod
+    def from_pretrained(cls, name):
+        if name in cls.name_map:
+            tokenizer_cls = cls.name_map[name]
+            tokenizer = tokenizer_cls.from_pretrained(name)
+            faster_tokenizer = cls(tokenizer.vocab.token_to_idx, tokenizer.do_lower_case)
+            return faster_tokenizer
+        else:
+            raise ValueError("Unknown name %s. Now %s surports  %s" % (name, cls.__name__, list(cls.name_map.keys())))
diff --git a/paddlenlp/experimental/model_utils.py b/paddlenlp/experimental/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..926a89bc8e1037237ee099db8bb5aca0083e0546
--- /dev/null
+++ b/paddlenlp/experimental/model_utils.py
@@ -0,0 +1,328 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import io
+import json
+import os
+from shutil import copyfile
+
+import paddle
+from paddle.framework import core
+
+from paddlenlp.transformers import PretrainedModel
+
+# TODO(fangzeyang) Temporary fix and replace by paddle framework downloader later
+from paddlenlp.utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
+from paddlenlp.utils.env import MODEL_HOME
+from paddlenlp.utils.log import logger
+
+__all__ = ["FasterPretrainedModel"]
+
+
+def load_vocabulary(filepath):
+    token_to_idx = {}
+    with io.open(filepath, "r", encoding="utf-8") as f:
+        for index, line in enumerate(f):
+            token = line.rstrip("\n")
+            token_to_idx[token] = int(index)
+    return token_to_idx
+
+
+class FasterPretrainedModel(PretrainedModel):
+    def to_static(self, output_path):
+        self.eval()
+
+        # Convert to static graph with specific input description
+        model = paddle.jit.to_static(
+            self, input_spec=[paddle.static.InputSpec(shape=[None, None], dtype=core.VarDesc.VarType.STRINGS)]
+        )
+        paddle.jit.save(model, output_path)
+        logger.info("Already save the static model to the path %s" % output_path)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        """
+        Creates an instance of `PretrainedModel`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of a built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains model weights file("model_state.pdparams")
+                  and model config file ("model_config.json").
+            *args (tuple): Position arguments for model `__init__`. If provided,
+                use these as position argument values for model initialization.
+            **kwargs (dict): Keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for model
+                initialization. If the keyword is in `__init__` argument names of
+                base model, update argument values of the base model; else update
+                argument values of derived model.
+
+        Returns:
+            PretrainedModel: An instance of `PretrainedModel`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertForSequenceClassification
+
+                # Name of built-in pretrained model
+                model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+                # Name of community-contributed pretrained model
+                model = BertForSequenceClassification.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned')
+
+                # Load from local directory path
+                model = BertForSequenceClassification.from_pretrained('./my_bert/')
+        """
+        pretrained_models = list(cls.pretrained_init_configuration.keys())
+        resource_files = {}
+        init_configuration = {}
+
+        # From built-in pretrained models
+        if pretrained_model_name_or_path in pretrained_models:
+            for file_id, map_list in cls.pretrained_resource_files_map.items():
+                resource_files[file_id] = map_list[pretrained_model_name_or_path]
+            init_configuration = copy.deepcopy(cls.pretrained_init_configuration[pretrained_model_name_or_path])
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            for file_id, file_name in cls.resource_files_names.items():
+                full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
+                resource_files[file_id] = full_file_name
+            resource_files["model_config_file"] = os.path.join(pretrained_model_name_or_path, cls.model_config_file)
+        else:
+            # Assuming from community-contributed pretrained models
+            for file_id, file_name in cls.resource_files_names.items():
+                full_file_name = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, file_name])
+                resource_files[file_id] = full_file_name
+            resource_files["model_config_file"] = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.model_config_file]
+            )
+
+        default_root = os.path.join(MODEL_HOME, pretrained_model_name_or_path)
+        resolved_resource_files = {}
+        for file_id, file_path in resource_files.items():
+            if file_path is None or os.path.isfile(file_path):
+                resolved_resource_files[file_id] = file_path
+                continue
+            path = os.path.join(default_root, file_path.split("/")[-1])
+            if os.path.exists(path):
+                logger.info("Already cached %s" % path)
+                resolved_resource_files[file_id] = path
+            else:
+                logger.info("Downloading %s and saved to %s" % (file_path, default_root))
+                try:
+                    resolved_resource_files[file_id] = get_path_from_url(file_path, default_root)
+                except RuntimeError as err:
+                    logger.error(err)
+                    raise RuntimeError(
+                        f"Can't load weights for '{pretrained_model_name_or_path}'.\n"
+                        f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                        "- a correct model-identifier of built-in pretrained models,\n"
+                        "- or a correct model-identifier of community-contributed pretrained models,\n"
+                        "- or the correct path to a directory containing relevant modeling files(model_weights and model_config).\n"
+                    )
+
+        # Prepare model initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        model_config_file = resolved_resource_files.pop("model_config_file", None)
+        if model_config_file is not None:
+            with io.open(model_config_file, encoding="utf-8") as f:
+                init_kwargs = json.load(f)
+        else:
+            init_kwargs = init_configuration
+
+        # position args are stored in kwargs, maybe better not include
+        init_args = init_kwargs.pop("init_args", ())
+        # class name corresponds to this configuration
+        init_class = init_kwargs.pop("init_class", cls.base_model_class.__name__)
+        # Check if the loaded config matches the current model class's __init__
+        # arguments. If not match, the loaded config is for the base model class.
+        if init_class == cls.base_model_class.__name__:
+            base_args = init_args
+            base_kwargs = init_kwargs
+            derived_args = ()
+            derived_kwargs = {}
+            base_arg_index = None
+        else:  # extract config for base model
+            derived_args = list(init_args)
+            derived_kwargs = init_kwargs
+            base_arg = None
+            for i, arg in enumerate(init_args):
+                if isinstance(arg, dict) and "init_class" in arg:
+                    assert arg.pop("init_class") == cls.base_model_class.__name__, (
+                        "pretrained base model should be {}"
+                    ).format(cls.base_model_class.__name__)
+                    base_arg_index = i
+                    base_arg = arg
+                    break
+            for arg_name, arg in init_kwargs.items():
+                if isinstance(arg, dict) and "init_class" in arg:
+                    assert arg.pop("init_class") == cls.base_model_class.__name__, (
+                        "pretrained base model should be {}"
+                    ).format(cls.base_model_class.__name__)
+                    base_arg_index = arg_name
+                    base_arg = arg
+                    break
+
+            base_args = base_arg.pop("init_args", ())
+            base_kwargs = base_arg
+        if cls == cls.base_model_class:
+            # Update with newly provided args and kwargs for base model
+            base_args = base_args if not args else args
+            base_kwargs.update(kwargs)
+            vocab_file = resolved_resource_files.pop("vocab_file", None)
+            if vocab_file and base_kwargs.get("vocab_file", None) is None:
+                base_kwargs["vocab_file"] = vocab_file
+            assert base_kwargs.get("vocab_file", None) is not None, "The vocab "
+            f"file is None. Please reload the class  {cls.__name__} with pretrained_name."
+
+            model = cls(*base_args, **base_kwargs)
+        else:
+            # Update with newly provided args and kwargs for derived model
+            base_parameters_dict = inspect.signature(cls.base_model_class.__init__).parameters
+            for k, v in kwargs.items():
+                if k in base_parameters_dict:
+                    base_kwargs[k] = v
+
+            vocab_file = resolved_resource_files.pop("vocab_file", None)
+            if vocab_file and base_kwargs.get("vocab_file", None) is None:
+                base_kwargs["vocab_file"] = vocab_file
+            assert base_kwargs.get("vocab_file", None) is not None, "The vocab "
+            f"file is None. Please reload the class  {cls.__name__} with pretrained_name."
+
+            base_model = cls.base_model_class(*base_args, **base_kwargs)
+            if base_arg_index is not None:
+                derived_args[base_arg_index] = base_model
+            else:
+                derived_args = (base_model,)  # assume at the first position
+            derived_args = derived_args if not args else args
+            derived_parameters_dict = inspect.signature(cls.__init__).parameters
+            for k, v in kwargs.items():
+                if k in derived_parameters_dict:
+                    derived_kwargs[k] = v
+            model = cls(*derived_args, **derived_kwargs)
+
+        # Maybe need more ways to load resources.
+        weight_path = resolved_resource_files["model_state"]
+        assert weight_path.endswith(".pdparams"), "suffix of weight must be .pdparams"
+
+        state_dict = paddle.load(weight_path)
+        logger.info("Loaded parameters from %s" % weight_path)
+
+        # Make sure we are able to load base models as well as derived models
+        # (with heads)
+        start_prefix = ""
+        model_to_load = model
+        state_to_load = state_dict
+        unexpected_keys = []
+        missing_keys = []
+        if not hasattr(model, cls.base_model_prefix) and any(
+            s.startswith(cls.base_model_prefix) for s in state_dict.keys()
+        ):
+            # base model
+            state_to_load = {}
+            start_prefix = cls.base_model_prefix + "."
+            for k, v in state_dict.items():
+                if k.startswith(cls.base_model_prefix):
+                    state_to_load[k[len(start_prefix) :]] = v
+                else:
+                    unexpected_keys.append(k)
+        if hasattr(model, cls.base_model_prefix) and not any(
+            s.startswith(cls.base_model_prefix) for s in state_dict.keys()
+        ):
+            # derived model (base model with heads)
+            model_to_load = getattr(model, cls.base_model_prefix)
+            for k in model.state_dict().keys():
+                if not k.startswith(cls.base_model_prefix):
+                    missing_keys.append(k)
+        if len(missing_keys) > 0:
+            logger.info(
+                "Weights of {} not initialized from pretrained model: {}".format(
+                    model.__class__.__name__, missing_keys
+                )
+            )
+        if len(unexpected_keys) > 0:
+            logger.info(
+                "Weights from pretrained model not used in {}: {}".format(model.__class__.__name__, unexpected_keys)
+            )
+        if paddle.in_dynamic_mode():
+            model_to_load.set_state_dict(state_to_load)
+            return model
+        return model, state_to_load
+
+    @staticmethod
+    def load_vocabulary(filepath):
+        token_to_idx = {}
+        with io.open(filepath, "r", encoding="utf-8") as f:
+            for index, line in enumerate(f):
+                token = line.rstrip("\n")
+                token_to_idx[token] = int(index)
+        return token_to_idx
+
+    def save_pretrained(self, save_dir):
+        """
+        Saves model configuration and related resources (model state) as files
+        under `save_dir`. The model configuration would be saved into a file named
+        "model_config.json", and model state would be saved into a file
+        named "model_state.pdparams".
+
+        The `save_dir` can be used in `from_pretrained` as argument value
+        of `pretrained_model_name_or_path` to re-load the trained model.
+
+        Args:
+            save_dir (str): Directory to save files into.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertForSequenceClassification
+
+                model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+                model.save_pretrained('./trained_model/')
+                # reload from save_directory
+                model = BertForSequenceClassification.from_pretrained('./trained_model/')
+        """
+        assert not os.path.isfile(save_dir), "Saving directory ({}) should be a directory, not a file".format(save_dir)
+        os.makedirs(save_dir, exist_ok=True)
+        # Save model config
+        self.save_model_config(save_dir)
+        # Save model
+        if paddle.in_dynamic_mode():
+            file_name = os.path.join(save_dir, list(self.resource_files_names.values())[0])
+            paddle.save(self.state_dict(), file_name)
+        else:
+            logger.warning("Save pretrained model only supported dygraph mode for now!")
+        # Save resources file
+        self.save_resources(save_dir)
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to `resource_files_names` indicating
+        files under `save_directory` by copying directly. Override it if necessary.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            src_path = self.init_config["init_args"][0].get(name, None)
+            dst_path = os.path.join(save_directory, file_name)
+            if src_path and os.path.abspath(src_path) != os.path.abspath(dst_path):
+                copyfile(src_path, dst_path)
diff --git a/paddlenlp/experimental/transformers/__init__.py b/paddlenlp/experimental/transformers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..007827ebc4b9faa64d636697ef6134f107db0aaf
--- /dev/null
+++ b/paddlenlp/experimental/transformers/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .bloom import *
+from .chatglm import *
+from .fused_transformer_layers import *
+from .llama import *
diff --git a/paddlenlp/experimental/transformers/bloom/__init__.py b/paddlenlp/experimental/transformers/bloom/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2a7f656c636e80599174536899414aff25795db
--- /dev/null
+++ b/paddlenlp/experimental/transformers/bloom/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/bloom/modeling.py b/paddlenlp/experimental/transformers/bloom/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..fda4f126da4aa96d06ec57977ec5cb87dc97b941
--- /dev/null
+++ b/paddlenlp/experimental/transformers/bloom/modeling.py
@@ -0,0 +1,460 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from typing import Tuple, Union
+
+import paddle
+from paddle import Tensor, nn
+from paddle.distributed import fleet
+from paddlenlp_ops import get_padding_offset
+
+from paddlenlp.experimental.transformers.fused_transformer_layers import (
+    FusedMultiTransformer,
+)
+from paddlenlp.experimental.transformers.generation_utils import (
+    GenerationInferenceModel,
+)
+from paddlenlp.transformers.bloom.modeling import BloomPreTrainedModel
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import register_base_model
+
+__all__ = [
+    "BloomModelInferenceModel",
+    "BloomForCausalLMInferenceModel",
+]
+
+
+def parallel_matmul(x: Tensor, y: Tensor, parallel_output=True):
+    is_fleet_init = True
+    world_size = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        world_size = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+    if is_fleet_init and world_size > 1:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=True)
+        if parallel_output:
+            return logits
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(x, y, transpose_y=True)
+        return logits
+
+
+@register_base_model
+class BloomModelInferenceModel(BloomPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.padding_idx = 0
+
+        self.embed_dim = config.hidden_size
+        self.n_head = config.n_head
+
+        # Embedding + LN Embedding
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(
+                    initializer=nn.initializer.Normal(mean=0.0, std=config.initializer_range)
+                ),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+
+        self.word_embeddings_layernorm = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+        # get ring_id
+        ring_id = -1
+        try:
+            hcg = fleet.get_hybrid_communicate_group()
+            model_parallel_group = hcg.get_model_parallel_group()
+            ring_id = model_parallel_group.id
+        except:
+            pass
+
+        # Transformer blocks
+        ln_scale_attrs = [paddle.ParamAttr(name="fusemt.{}.ln_scale".format(i)) for i in range(config.n_layer)]
+        ln_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ln_bias".format(i)) for i in range(config.n_layer)]
+        qkv_weight_attrs = [paddle.ParamAttr(name="fusemt.{}.qkv_weight".format(i)) for i in range(config.n_layer)]
+        qkv_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.qkv_bias".format(i)) for i in range(config.n_layer)]
+        linear_weight_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.linear_weight".format(i)) for i in range(config.n_layer)
+        ]
+        linear_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.linear_bias".format(i)) for i in range(config.n_layer)]
+        ffn_ln_scale_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn_ln_scale".format(i)) for i in range(config.n_layer)]
+        ffn_ln_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn_ln_bias".format(i)) for i in range(config.n_layer)]
+        ffn1_weight_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn1_weight".format(i)) for i in range(config.n_layer)]
+        ffn1_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn1_bias".format(i)) for i in range(config.n_layer)]
+        ffn2_weight_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn2_weight".format(i)) for i in range(config.n_layer)]
+        ffn2_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn2_bias".format(i)) for i in range(config.n_layer)]
+        self.transformer_block = FusedMultiTransformer(
+            self.embed_dim,
+            self.n_head,
+            4 * self.embed_dim,
+            activation="gelu",
+            num_layers=config.n_layer,
+            nranks=config.tensor_parallel_degree,
+            ring_id=ring_id,
+            ln_scale_attrs=ln_scale_attrs,
+            ln_bias_attrs=ln_bias_attrs,
+            qkv_weight_attrs=qkv_weight_attrs,
+            qkv_bias_attrs=qkv_bias_attrs,
+            linear_weight_attrs=linear_weight_attrs,
+            linear_bias_attrs=linear_bias_attrs,
+            ffn_ln_scale_attrs=ffn_ln_scale_attrs,
+            ffn_ln_bias_attrs=ffn_ln_bias_attrs,
+            ffn1_weight_attrs=ffn1_weight_attrs,
+            ffn1_bias_attrs=ffn1_bias_attrs,
+            ffn2_weight_attrs=ffn2_weight_attrs,
+            ffn2_bias_attrs=ffn2_bias_attrs,
+        )
+        self.cache_kvs = []
+
+        # Final Layer Norm
+        self.ln_f = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+        self.gradient_checkpointing = False
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings: Tensor):
+        self.word_embeddings = new_embeddings
+
+    def remove_padding(self, input_ids, seq_lens_this_time):
+        cum_offsets_now = paddle.cumsum(paddle.max(seq_lens_this_time) - seq_lens_this_time)
+        token_num = paddle.sum(seq_lens_this_time)
+        ids_remove_padding, cum_offsets, padding_offset = get_padding_offset(
+            input_ids, cum_offsets_now, token_num, seq_lens_this_time
+        )
+        return ids_remove_padding, padding_offset, cum_offsets
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        inputs_embeds=None,
+        cache=None,
+        cache_kvs=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        return_dict=None,
+        **kwargs,
+    ) -> Union[Tuple[Tensor], BaseModelOutputWithPastAndCrossAttentions]:
+        # past_key_values = kwargs.get("cache", past_key_values)
+        # is_decoder = past_key_values is not None
+        is_decoder = cache is not None
+        seq_len = seq_len_decoder if is_decoder else seq_len_encoder
+        if not is_decoder:
+            ids_remove_padding, padding_offset, cum_offsets = self.remove_padding(input_ids, seq_len)
+        else:
+            ids_remove_padding = input_ids
+            padding_offset = None
+            cum_offsets = None
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(ids_remove_padding)
+
+        hidden_states = self.word_embeddings_layernorm(inputs_embeds)
+
+        with paddle.fluid.framework._stride_in_no_check_dy2st_diff():
+            hidden_states, _ = self.transformer_block(
+                src=hidden_states,
+                input_ids=input_ids,
+                cum_offsets=cum_offsets,
+                padding_offset=padding_offset,
+                attn_mask=paddle.cast(attention_mask, dtype=hidden_states.dtype),
+                caches=cache_kvs,
+                seq_lens=seq_len,
+                time_step=paddle.increment(paddle.shape(attention_mask)[-1], -1) if is_decoder else None,
+            )
+
+        # Add last hidden state
+        hidden_states = self.ln_f(hidden_states)
+
+        return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states)
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict, use_structured_name=True):
+
+        for k, v in state_dict.items():
+            if k.find("word_embeddings.weight") >= 0:
+                self.word_embeddings.weight.set_value(paddle.to_tensor(v))
+            elif k.find("word_embeddings_layernorm.weight") >= 0:
+                self.word_embeddings_layernorm.weight.set_value(paddle.to_tensor(v))
+            elif k.find("word_embeddings_layernorm.bias") >= 0:
+                self.word_embeddings_layernorm.bias.set_value(paddle.to_tensor(v))
+            elif k.find("ln_f.weight") >= 0:
+                self.ln_f.weight.set_value(paddle.to_tensor(v))
+            elif k.find("ln_f.bias") >= 0:
+                self.ln_f.bias.set_value(paddle.to_tensor(v))
+            else:
+                # transformer block weights
+                splits = k.split(".")
+                idx = int(splits[1]) if splits[1].isdigit() else int(splits[2])
+
+                if k.endswith("input_layernorm.weight"):
+                    self.transformer_block.ln_scales[idx].set_value(paddle.to_tensor(v).astype("float32"))
+                elif k.endswith("input_layernorm.bias"):
+                    self.transformer_block.ln_biases[idx].set_value(paddle.to_tensor(v).astype("float32"))
+                elif k.endswith("self_attention.query_key_value.weight"):
+                    v = (
+                        v.reshape(
+                            [
+                                self.embed_dim,
+                                self.n_head // self.config.tensor_parallel_degree,
+                                3,
+                                self.embed_dim // self.n_head,
+                            ]
+                        )
+                        .transpose([2, 1, 3, 0])
+                        .reshape([-1, self.embed_dim])
+                    )
+                    self.transformer_block.qkv_weights[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("self_attention.query_key_value.bias"):
+                    v = (
+                        v.reshape(
+                            [
+                                self.n_head // self.config.tensor_parallel_degree,
+                                3,
+                                self.embed_dim // self.n_head,
+                            ]
+                        )
+                        .transpose([1, 0, 2])
+                        .reshape([-1])
+                    )
+                    self.transformer_block.qkv_biases[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("self_attention.dense.weight"):
+                    self.transformer_block.linear_weights[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("self_attention.dense.bias"):
+                    self.transformer_block.linear_biases[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("post_attention_layernorm.weight"):
+                    self.transformer_block.ffn_ln_scales[idx].set_value(paddle.to_tensor(v).astype("float32"))
+                elif k.endswith("post_attention_layernorm.bias"):
+                    self.transformer_block.ffn_ln_biases[idx].set_value(paddle.to_tensor(v).astype("float32"))
+                elif k.endswith("mlp.dense_h_to_4h.weight"):
+                    self.transformer_block.ffn1_weights[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("mlp.dense_h_to_4h.bias"):
+                    self.transformer_block.ffn1_biases[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("mlp.dense_4h_to_h.weight"):
+                    self.transformer_block.ffn2_weights[idx].set_value(paddle.to_tensor(v))
+                elif k.endswith("mlp.dense_4h_to_h.bias"):
+                    self.transformer_block.ffn2_biases[idx].set_value(paddle.to_tensor(v))
+                else:
+                    raise ValueError("Unknow weight {}".format(k))
+
+
+class BloomLMHead(nn.Layer):
+    def __init__(self, config, embedding_weights=None):
+        super(BloomLMHead, self).__init__()
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size],
+                dtype=paddle.get_default_dtype(),
+                is_bias=True,
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.config = config
+
+    def forward(self, hidden_states):
+        logits = parallel_matmul(hidden_states, self.decoder_weight, parallel_output=False)
+        return logits
+
+
+class BloomPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for GPT.
+    It calculates the final loss.
+    """
+
+    def __init__(self, pad_token_id=None, tensor_parallel_degree=1, tensor_parallel_output=False):
+        super(BloomPretrainingCriterion, self).__init__()
+        if tensor_parallel_degree > 1 and tensor_parallel_output:
+            self.loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+        self.pad_token_id = pad_token_id
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask=None):
+        masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = masked_lm_loss.astype("float32")
+            if loss_mask is not None:
+                loss_mask = loss_mask.reshape([-1])
+                masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+                loss = masked_lm_loss / loss_mask.sum()
+            else:
+                assert self.pad_token_id is not None
+                masked_lm_loss = masked_lm_loss[masked_lm_labels != self.pad_token_id]
+                loss = paddle.mean(masked_lm_loss)
+
+        return loss
+
+
+class BloomForCausalLMInferenceModel(GenerationInferenceModel, BloomPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [
+        r"h.*.self_attention.scale_mask_softmax.causal_mask",
+        r"lm_head.weight",
+    ]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.bloom = BloomModelInferenceModel(config)
+        self.lm_head = BloomLMHead(config, self.bloom.word_embeddings.weight)
+        self.criterion = BloomPretrainingCriterion(
+            pad_token_id=config.pad_token_id,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_output=True,
+        )
+
+    @classmethod
+    def get_cache_kvs_shape(cls, config, max_batch_size=None, max_length=None) -> list[list[int]]:
+        """get cache_kvs tensor for llama model
+
+        Args:
+            max_batch_size (int): the max batch size
+            max_length (int | None, optional): the max_length of cache_kvs. Defaults to None.
+
+        Returns:
+            list[paddle.Tensor]: the list tensor shape for cache
+        """
+        if max_length is None:
+            max_length = 2048
+
+        cache_kvs = []
+        for _ in range(config.n_layer):
+            cache_kvs.append(
+                [
+                    2,
+                    max_batch_size,
+                    config.num_attention_heads // max(config.tensor_parallel_degree, 1),
+                    max_length,
+                    config.hidden_size // config.num_attention_heads,
+                ]
+            )
+        return cache_kvs
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def prepare_inputs_for_generation(self, input_ids, cache_kvs, tgt_ids, tgt_generation_mask, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        attention_mask = kwargs.get("attention_mask", None)
+        position_ids = kwargs.get("position_ids", None)
+        seq_len_encoder = kwargs.get("seq_len_encoder", None)
+        seq_len_decoder = kwargs.get("seq_len_decoder", None)
+        cache = kwargs.get("cache", None)
+        if cache is not None:
+            input_ids = tgt_ids
+            attention_mask = tgt_generation_mask
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "position_ids": position_ids,
+            "cache_kvs": cache_kvs,
+            "cache": cache,
+            "use_cache": True,
+            "seq_len_encoder": seq_len_encoder,
+            "seq_len_decoder": seq_len_decoder,
+        }
+
+    def forward(
+        self,
+        input_ids=None,
+        cache=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        cache_kvs=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        return_dict=None,
+    ) -> Union[Tuple[Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.bloom(
+            input_ids,
+            cache=cache,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_kvs=cache_kvs,
+            seq_len_encoder=seq_len_encoder,
+            seq_len_decoder=seq_len_decoder,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        lm_logits = self.lm_head(hidden_states)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return output
+
+        return CausalLMOutputWithCrossAttentions(logits=lm_logits)
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict, use_structured_name=True):
+        self.lm_head.set_state_dict(
+            {k: state_dict[k] for k in state_dict.keys() if "lm_head" in k},
+            use_structured_name,
+        )
+        self.bloom.set_state_dict({k: state_dict[k] for k in state_dict.keys() if "bloom" in k})
+
+    @staticmethod
+    def _reorder_cache(past: Tuple[Tuple[Tensor]], beam_idx: Tensor) -> Tuple[Tuple[Tensor]]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        """
+        return tuple(tuple(past_state.index_select(0, beam_idx) for past_state in layer_past) for layer_past in past)
diff --git a/paddlenlp/experimental/transformers/chatglm/__init__.py b/paddlenlp/experimental/transformers/chatglm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2a7f656c636e80599174536899414aff25795db
--- /dev/null
+++ b/paddlenlp/experimental/transformers/chatglm/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/chatglm/modeling.py b/paddlenlp/experimental/transformers/chatglm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..df4145956539aafbd25b7e36983446789754e88e
--- /dev/null
+++ b/paddlenlp/experimental/transformers/chatglm/modeling.py
@@ -0,0 +1,637 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+from paddle.distributed import fleet
+from paddlenlp_ops import get_padding_offset
+
+from paddlenlp.experimental.transformers.fused_transformer_layers import (
+    FusedMultiTransformer,
+)
+from paddlenlp.experimental.transformers.generation_utils import (
+    GenerationInferenceModel,
+)
+from paddlenlp.transformers import ChatGLMConfig, ChatGLMPretrainedModel
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithPast,
+)
+from paddlenlp.transformers.model_utils import register_base_model
+
+__all__ = ["ChatGLMForCausalLMInferenceModel"]
+
+
+def parallel_matmul(lm_output, logit_weights, parallel_output):
+    hcg = fleet.get_hybrid_communicate_group()
+    model_parallel_group = hcg.get_model_parallel_group()
+    world_size = hcg.get_model_parallel_world_size()
+
+    if world_size > 1:
+        # _c_identity is backwards is reduce
+        input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group)
+
+        logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True)
+
+        if parallel_output:
+            return logits
+
+        # _c_concat has not grad backwards
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(lm_output, logit_weights, transpose_y=True)
+        return logits
+
+
+class RotaryEmbeddingsDybatch(nn.Layer):
+    def __init__(self, hidden_size, base=10000.0, learnable=False):
+        super().__init__()
+        self.dtype = paddle.get_default_dtype()
+        inv_freq = 1.0 / (base ** (paddle.arange(0, hidden_size, 2).astype("float32") / hidden_size))
+        inv_freq = inv_freq.astype(self.dtype)
+        self.learnable = learnable
+        if learnable:
+            self.inv_freq = nn.Parameter(inv_freq)
+            self.max_seq_len_cached = None
+        else:
+            self.register_buffer("inv_freq", inv_freq)
+            self.max_seq_len_cached = None
+            self.cos_cached = None
+            self.sin_cached = None
+
+    def forward(self, seq_dim=1, seq_len=128):
+        # TODO: Remove the condition for converting to static graph.
+        # if self.max_seq_len_cached is None or seq_len > self.max_seq_len_cached:
+        #    self.max_seq_len_cached = None if self.learnable else seq_len
+
+        t = paddle.arange(seq_len).astype(self.dtype)
+        # [s, h/n/2]
+        # TODO: Failed for fp16 when converting to static graph.
+        freqs = paddle.einsum("i,j->ij", t.astype("float32"), self.inv_freq.astype("float32"))
+
+        freqs = freqs.astype(self.dtype)
+        # [s, h/n]
+        emb = paddle.concat([freqs, freqs], axis=-1)
+
+        if self.dtype == paddle.bfloat16:
+            emb = emb.astype("float32")
+        # [s, 1, h/n]
+        cos_cached = emb.cos().unsqueeze(1)
+        sin_cached = emb.sin().unsqueeze(1)
+
+        if self.dtype == paddle.bfloat16:
+            cos_cached = cos_cached.astype(self.dtype)
+            sin_cached = sin_cached.astype(self.dtype)
+
+        if self.learnable:
+            return cos_cached, sin_cached
+
+        self.cos_cached, self.sin_cached = cos_cached, sin_cached
+
+        return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
+
+
+class ChatGLMStackDyBatch(nn.Layer):
+    """
+    GLM Transformer
+    """
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMStackDyBatch, self).__init__()
+        self.config = config
+        self.position_encoding_2d = config.position_encoding_2d
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+
+        self.config = config
+        self.current_rank = 0
+        self.world_size = 1
+
+        try:
+            self.current_rank = paddle.distributed.get_rank()
+            self.world_size = paddle.distributed.get_world_size()
+        except Exception:
+            pass
+
+        if self.config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        self.rotary_embeddings = RotaryEmbeddingsDybatch(
+            self.hidden_size // (self.num_attention_heads * 2)
+            if self.position_encoding_2d
+            else self.hidden_size // self.num_attention_heads,
+            base=10000.0,
+        )
+
+        # get ring_id
+        ring_id = -1
+        try:
+            hcg = fleet.get_hybrid_communicate_group()
+            model_parallel_group = hcg.get_model_parallel_group()
+            ring_id = model_parallel_group.id
+        except:
+            pass
+
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+        ln_scale_attrs = [paddle.ParamAttr(name="fusemt.{}.ln_scale".format(i)) for i in range(config.num_layers)]
+        ln_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ln_bias".format(i)) for i in range(config.num_layers)]
+        qkv_weight_attrs = [paddle.ParamAttr(name="fusemt.{}.qkv_weight".format(i)) for i in range(config.num_layers)]
+        qkv_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.qkv_bias".format(i)) for i in range(config.num_layers)]
+        linear_weight_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.linear_weight".format(i)) for i in range(config.num_layers)
+        ]
+        linear_bias_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.linear_bias".format(i)) for i in range(config.num_layers)
+        ]
+        ffn_ln_scale_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.ffn_ln_scale".format(i)) for i in range(config.num_layers)
+        ]
+        ffn_ln_bias_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.ffn_ln_bias".format(i)) for i in range(config.num_layers)
+        ]
+        ffn1_weight_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.ffn1_weight".format(i)) for i in range(config.num_layers)
+        ]
+        ffn1_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn1_bias".format(i)) for i in range(config.num_layers)]
+        ffn2_weight_attrs = [
+            paddle.ParamAttr(name="fusemt.{}.ffn2_weight".format(i)) for i in range(config.num_layers)
+        ]
+        ffn2_bias_attrs = [paddle.ParamAttr(name="fusemt.{}.ffn2_bias".format(i)) for i in range(config.num_layers)]
+        alpha = (2 * self.config.num_hidden_layers) ** 0.5
+        self.transformer_block = FusedMultiTransformer(
+            config.hidden_size,
+            config.num_attention_heads,
+            4 * config.hidden_size,
+            activation="gelu",
+            num_layers=config.num_layers,
+            nranks=config.tensor_parallel_degree,
+            ring_id=ring_id,
+            ln_scale_attrs=ln_scale_attrs,
+            ln_bias_attrs=ln_bias_attrs,
+            qkv_weight_attrs=qkv_weight_attrs,
+            qkv_bias_attrs=qkv_bias_attrs,
+            linear_weight_attrs=linear_weight_attrs,
+            linear_bias_attrs=linear_bias_attrs,
+            ffn_ln_scale_attrs=ffn_ln_scale_attrs,
+            ffn_ln_bias_attrs=ffn_ln_bias_attrs,
+            ffn1_weight_attrs=ffn1_weight_attrs,
+            ffn1_bias_attrs=ffn1_bias_attrs,
+            ffn2_weight_attrs=ffn2_weight_attrs,
+            ffn2_bias_attrs=ffn2_bias_attrs,
+            trans_qkvw=True,
+            normalize_before=False,
+            residual_alpha=alpha,
+            norm_type="layernorm",
+            use_neox_rotary_style=True,
+        )
+
+    def remove_padding(self, input_ids, seq_lens_this_time):
+        cum_offsets_now = paddle.cumsum(paddle.max(seq_lens_this_time) - seq_lens_this_time)
+        token_num = paddle.sum(seq_lens_this_time)
+        ids_remove_padding, cum_offsets, padding_offset = get_padding_offset(
+            input_ids, cum_offsets_now, token_num, seq_lens_this_time
+        )
+        return ids_remove_padding, padding_offset, cum_offsets
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        cache=None,
+        cache_kvs=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        time_step=None,
+        **kwargs,
+    ):
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape[:2]
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape[:2]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        encode_seq_length = input_ids.shape[1]
+        seq_lens = seq_len_decoder if encode_seq_length == 1 else seq_len_encoder
+
+        if encode_seq_length > 1:
+            ids_remove_padding, padding_offset, cum_offsets = self.remove_padding(input_ids, seq_len_encoder)
+        else:
+            ids_remove_padding = input_ids
+            padding_offset = None
+            cum_offsets = None
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(ids_remove_padding)
+
+        if cache is None:
+            cache = tuple([None] * self.config.num_layers)
+
+        hidden_states = inputs_embeds
+        if attention_mask is None:
+            attention_mask = paddle.zeros([1, 1]).astype("int64")
+
+        cos, sin = self.rotary_embeddings(seq_len=self.config.max_sequence_length + 1)
+        coses = []
+        sines = []
+        if self.position_encoding_2d:
+            block_position_ids = position_ids[:, 1, :].transpose([1, 0])
+            position_ids = position_ids[:, 0, :].transpose([1, 0])
+            coses.append(cos.squeeze(1)[position_ids].unsqueeze(2))
+            sines.append(sin.squeeze(1)[position_ids].unsqueeze(2))
+
+            coses.append(cos.squeeze(1)[block_position_ids].unsqueeze(2))
+            sines.append(sin.squeeze(1)[block_position_ids].unsqueeze(2))
+        else:
+            position_ids = position_ids.transpose([1, 0])
+            coses.append(cos.squeeze(1)[position_ids].unsqueeze(2))
+            sines.append(sin.squeeze(1)[position_ids].unsqueeze(2))
+
+        position_cos = coses[0].transpose([1, 2, 0, 3])
+        block_position_cos = coses[1].transpose([1, 2, 0, 3])
+
+        coses = paddle.concat([position_cos, block_position_cos], axis=-1).unsqueeze(0)
+        position_sin = sines[0].transpose([1, 2, 0, 3])
+
+        block_position_sin = sines[1].transpose([1, 2, 0, 3])
+        sines = paddle.concat([position_sin, block_position_sin], axis=-1).unsqueeze(0)
+
+        rotary_embeds = paddle.concat([coses, sines])
+
+        attention_mask = (attention_mask) * -1000000
+
+        new_cache = [None]
+        hidden_states = self.input_layernorm(hidden_states)
+
+        with paddle.fluid.framework._stride_in_no_check_dy2st_diff():
+            hidden_states, new_cache = self.transformer_block(
+                input_ids,
+                hidden_states,
+                cum_offsets=cum_offsets,
+                padding_offset=padding_offset,
+                attn_mask=paddle.cast(attention_mask, dtype=hidden_states.dtype),
+                caches=cache_kvs,
+                rotary_embs=paddle.cast(rotary_embeds, "float32"),
+                rotary_emb_dims=2 if self.config.position_encoding_2d else 1,
+                seq_lens=seq_lens,
+                time_step=time_step,
+            )
+        return (hidden_states, new_cache)
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict, use_structured_name=True):
+        dtype = paddle.get_default_dtype()
+        config = self.config
+        embed_dim = config.hidden_size
+        num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree
+        head_dim = embed_dim // config.num_attention_heads
+
+        for k, v in state_dict.items():
+            if k.startswith("transformer.word_embeddings.weight"):
+                self.word_embeddings.weight.set_value(v.astype(dtype))
+                continue
+            elif k.startswith("transformer.final_layernorm.weight"):
+                self.transformer_block.ffn_ln_scales[config.num_hidden_layers - 1].set_value(v.astype("float32"))
+                continue
+            elif k.startswith("transformer.final_layernorm.bias"):
+                self.transformer_block.ffn_ln_biases[config.num_hidden_layers - 1].set_value(v.astype("float32"))
+                continue
+            elif k.startswith("lm_head.weight"):
+                continue
+            elif k.endswith("rotary_emb.inv_freq"):
+                continue
+            idx = int(k.split(".")[2])
+            if k.endswith("input_layernorm.weight"):
+                if idx == 0:
+                    self.input_layernorm.weight.set_value(v.astype(dtype))
+                else:
+                    self.transformer_block.ffn_ln_scales[idx - 1].set_value(v.astype("float32"))
+            elif k.endswith("input_layernorm.bias"):
+                if idx == 0:
+                    self.input_layernorm.bias.set_value(v.astype(dtype))
+                else:
+                    self.transformer_block.ffn_ln_biases[idx - 1].set_value(v.astype("float32"))
+            elif k.endswith("post_attention_layernorm.weight"):
+                self.transformer_block.ln_scales[idx].set_value(v.astype("float32"))
+            elif k.endswith("post_attention_layernorm.bias"):
+                self.transformer_block.ln_biases[idx].set_value(v.astype("float32"))
+            elif k.endswith("attention.query_key_value.weight"):
+                # [embed_dim, num_heads, 3, head_dim] -> [embed_dim, 3, num_heads, head_dim]
+                v = (
+                    v.reshape([embed_dim, num_attention_heads, 3, head_dim])
+                    .transpose([2, 1, 3, 0])
+                    .reshape([head_dim * num_attention_heads * 3, embed_dim])
+                )
+                self.transformer_block.qkv_weights[idx].set_value(v.astype(dtype))
+            elif k.endswith("attention.query_key_value.bias"):
+                v = (
+                    v.reshape([num_attention_heads, 3, head_dim])
+                    .transpose([1, 0, 2])
+                    .reshape([head_dim * num_attention_heads * 3])
+                )
+                self.transformer_block.qkv_biases[idx].set_value(v.astype(dtype))
+            elif k.endswith("attention.dense.weight"):
+                self.transformer_block.linear_weights[idx].set_value(v.astype(dtype))
+            elif k.endswith("attention.dense.bias"):
+                self.transformer_block.linear_biases[idx].set_value(v.astype(dtype))
+            elif k.endswith("mlp.dense_h_to_4h.weight"):
+                self.transformer_block.ffn1_weights[idx].set_value(v.astype(dtype))
+            elif k.endswith("mlp.dense_h_to_4h.bias"):
+                self.transformer_block.ffn1_biases[idx].set_value(v.astype(dtype))
+            elif k.endswith("mlp.dense_4h_to_h.weight"):
+                self.transformer_block.ffn2_weights[idx].set_value(v.astype(dtype))
+            elif k.endswith("mlp.dense_4h_to_h.bias"):
+                self.transformer_block.ffn2_biases[idx].set_value(v.astype(dtype))
+            else:
+                print("Unknow weight {}".format(k))
+
+
+@register_base_model
+class ChatGLMModelDyBatch(ChatGLMPretrainedModel):
+    r"""
+    The GLM Model transformer can behave as an encoder (with only self-attention) as well as a decoder, where
+    a layer of cross-attention is added between the self-attention layers, following the architecture
+    described in [Attention is all you need](https://arxiv.org/abs/1706.03762).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/en/api/paddle/fluid/dygraph/layers/Layer_en.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    """
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMModelDyBatch, self).__init__(config)
+        self.config = config
+        self.transformer = ChatGLMStackDyBatch(config)
+        self.apply(self.init_weights)
+
+    def get_input_embeddings(self):
+        return self.transformer.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings):
+        self.transformer.word_embeddings = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        cache=None,
+        inputs_embeds=None,
+        use_cache=None,
+        cache_kvs=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        time_step=None,
+        **kwargs,
+    ):
+        if attention_mask is None:
+            attention_mask = self.get_masks(input_ids)
+
+        if position_ids is None:
+            MASK, gMASK = self.config.mask_token_id, self.config.gmask_token_id
+
+            use_gmasks = []
+            mask_positions = []
+            for seq in input_ids:
+                mask_token = gMASK if gMASK in seq else MASK
+                use_gmask = mask_token == gMASK
+                use_gmasks.append(use_gmask)
+                mask_positions.append(paddle.where(seq == mask_token)[0][0])
+            position_ids = self.get_position_ids(input_ids, mask_positions=mask_positions, use_gmasks=use_gmasks)
+
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        logits, new_caches = self.transformer(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            cache_kvs=cache_kvs,
+            seq_len_encoder=seq_len_encoder,
+            seq_len_decoder=seq_len_decoder,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            time_step=time_step,
+        )
+
+        if not return_dict:
+            return (logits, new_caches)
+
+        return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=logits, past_key_values=new_caches)
+
+
+class ChatGLMForCausalLMInferenceModel(GenerationInferenceModel, ChatGLMPretrainedModel):
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMForCausalLMInferenceModel, self).__init__(config)
+
+        self.config = config
+        self.max_sequence_length = config.max_sequence_length
+        self.position_encoding_2d = config.position_encoding_2d
+        self.time_step = paddle.to_tensor([1], dtype="int32", place=paddle.CPUPlace())
+        self.model = ChatGLMModelDyBatch(config)
+
+        self.lm_head = self.model.get_input_embeddings()
+
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
+    ):
+        # TODO: Support safetensors loading.
+        kwargs["use_safetensors"] = False
+        return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
+
+    @classmethod
+    def get_cache_kvs_shape(
+        cls, config: ChatGLMConfig, max_batch_size: int = None, max_length: int = None
+    ) -> list[list[int]]:
+        """get cache_kvs tensor for llama model
+
+        Args:
+            max_batch_size (int): the max batch size
+            max_length (int | None, optional): the max_length of cache_kvs. Defaults to None.
+
+        Returns:
+            list[paddle.Tensor]: the list tensor shape for cache
+        """
+        if max_length is None:
+            max_length = config.max_sequence_length
+
+        cache_kvs = []
+        for _ in range(config.num_hidden_layers):
+            cache_kvs.append(
+                [
+                    2,
+                    max_batch_size,
+                    config.num_attention_heads // max(config.tensor_parallel_degree, 1),
+                    max_length,
+                    config.hidden_size // config.num_attention_heads,
+                ]
+            )
+        return cache_kvs
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        cache_kvs,
+        seq_len_encoder,
+        seq_len_decoder,
+        tgt_ids,
+        tgt_pos,
+        tgt_generation_mask,
+        **kwargs,
+    ):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        cache = kwargs.get("cache", None)
+
+        time_step = None
+        if cache is not None:
+            time_step = self.time_step
+            input_ids = tgt_ids
+            position_ids = tgt_pos
+            attention_mask = 1 - tgt_generation_mask
+        else:
+            self.time_step = paddle.to_tensor(input_ids.shape[1], dtype="int32", place=paddle.CPUPlace())
+            paddle.increment(self.time_step, -1)
+
+        model_inputs = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "cache_kvs": cache_kvs,
+            "seq_len_encoder": seq_len_encoder,
+            "seq_len_decoder": seq_len_decoder,
+            "cache": cache,
+            "time_step": time_step,
+        }
+        return model_inputs
+
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        cache=None,
+        cache_kvs=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        time_step=None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            cache_kvs=cache_kvs,
+            seq_len_encoder=seq_len_encoder,
+            seq_len_decoder=seq_len_decoder,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            time_step=time_step,
+        )
+        hidden_states = transformer_outputs.last_hidden_state if return_dict else transformer_outputs[0]
+        if self.config.tensor_parallel_degree > 1:
+            lm_logits = parallel_matmul(hidden_states, self.lm_head.weight, self.config.tensor_parallel_output)
+        else:
+            lm_logits = F.linear(hidden_states, self.lm_head.weight.T)
+
+        loss = None
+        if labels is not None:
+            """
+            for p, l in zip(lm_logits[..., :-1, :].argmax(axis=-1), labels[..., 1:]):
+                print("prediction")
+                print(self.tokenizer.decode(p[l != -100].tolist()))
+                print("labels")
+                print(self.tokenizer.decode(l[l != -100].tolist()))
+            """
+
+            shift_logits = lm_logits[..., :-1, :]
+            shift_logits = shift_logits.reshape([-1, shift_logits.shape[-1]])
+            shift_logits = shift_logits.astype("float32")
+            shift_labels = labels[..., 1:].reshape([-1])
+
+            if self.config.tensor_parallel_degree > 1 and self.config.tensor_parallel_output:
+                self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+                shift_logits = shift_logits[shift_labels != -100]
+                shift_labels = shift_labels[shift_labels != -100]
+                loss = self.parallel_loss_func(shift_logits, shift_labels).mean()
+            else:
+                loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100)
+            loss = loss.astype(lm_logits.dtype)
+        if time_step:
+            paddle.increment(self.time_step)
+
+        if not return_dict:
+            if loss is not None:
+                return (loss, lm_logits, transformer_outputs[1:])
+            else:
+                return (lm_logits, transformer_outputs[1:])
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+        )
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        self.lm_head.weight.set_value(state_dict["transformer.word_embeddings.weight"])
+        self.model.transformer.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
diff --git a/paddlenlp/experimental/transformers/fused_transformer_layers.py b/paddlenlp/experimental/transformers/fused_transformer_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc587ca376adaef94464eb4034b53df99f0468e7
--- /dev/null
+++ b/paddlenlp/experimental/transformers/fused_transformer_layers.py
@@ -0,0 +1,714 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import paddle
+import paddle.distributed as dist
+from paddle.framework import LayerHelper, in_dynamic_mode
+from paddle.incubate.nn.functional import (
+    fused_layer_norm,
+    fused_rms_norm,
+    masked_multihead_attention,
+    variable_length_memory_efficient_attention,
+)
+from paddle.nn import Layer
+from paddle.nn.initializer import Constant
+from paddle.nn.quant import weight_only_linear
+
+from paddlenlp.utils.import_utils import is_paddlenlp_ops_available
+from paddlenlp.utils.log import logger
+
+if is_paddlenlp_ops_available():
+    from paddlenlp_ops import (
+        encode_rotary_qk,
+        qkv_transpose_split,
+        rebuild_padding,
+        transpose_remove_padding,
+        write_cache_kv,
+    )
+else:
+    logger.warning(
+        "The paddlenlp_ops package is not installed. you can read the docs and install it by hand, "
+        "you can refer to: https://github.com/PaddlePaddle/PaddleNLP/blob/develop/csrc/README.md"
+    )
+
+
+__all__ = ["FusedMultiTransformer"]
+
+
+# for distributed tensor model parallel
+def _set_var_distributed(var):
+    if var is None:
+        return
+
+    var.is_distributed = True
+
+    if not in_dynamic_mode():
+        # NOTE: use current_block and find_var_recursive to support while_loop
+        startup_block = paddle.static.default_startup_program().current_block()
+        main_block = paddle.static.default_main_program().current_block()
+        startup_block._find_var_recursive(var.name).is_distributed = True
+        main_block._find_var_recursive(var.name).is_distributed = True
+
+
+def fused_act_bias_wrapper(
+    x,
+    bias=None,
+    dequant_scales=None,
+    shift=None,
+    smooth=None,
+    act_method="gelu",
+    compute_dtype="default",
+    quant_scale=-1,
+    quant_round_type=0,
+    quant_max_bound=0,
+    quant_min_bound=0,
+):
+    if in_dynamic_mode():
+        return paddle._C_ops.fused_bias_act(
+            x,
+            bias,
+            dequant_scales,
+            shift,
+            smooth,
+            act_method,
+            compute_dtype,
+            quant_scale,
+            quant_round_type,
+            quant_max_bound,
+            quant_min_bound,
+        )
+    helper = LayerHelper("fused_bias_act")
+    if x.dtype == "int32":
+        if compute_dtype == "bf16":
+            dtype = "uint16"
+        elif compute_dtype == "fp16":
+            dtype = "float16"
+        elif compute_dtype == "fp32":
+            dtype = "float32"
+        out = helper.create_variable_for_type_inference(dtype=dtype)
+    else:
+        out = helper.create_variable_for_type_inference(dtype=x.dtype)
+
+    inputs = {}
+    inputs["x"] = x
+    if bias is not None:
+        inputs["bias"] = bias
+    if dequant_scales is not None:
+        inputs["bias"] = dequant_scales
+
+    if shift is not None:
+        inputs["shift"] = shift
+
+    if smooth is not None:
+        inputs["smooth"] = smooth
+
+    attrs = {
+        "act_method": act_method,
+        "compute_dtype": compute_dtype,
+        "quant_scale": quant_scale,
+        "quant_round_type": quant_round_type,
+        "quant_max_bound": quant_max_bound,
+        "quant_min_bound": quant_min_bound,
+    }
+
+    helper.append_op(
+        type="fused_bias_act",
+        inputs=inputs,
+        outputs={"out": out},
+        attrs=attrs,
+    )
+    return out
+
+
+class FusedMultiTransformer(Layer):
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dim_feedforward,
+        quant_bits=-1,  # -1 means use Half precision.
+        dropout_rate=0.0,
+        activation="gelu",
+        norm_type="layernorm",
+        use_neox_rotary_style=False,
+        normalize_before=True,
+        ln_scale_attrs=None,
+        ln_bias_attrs=None,
+        qkv_weight_attrs=None,
+        qkv_weight_scale_attrs=None,
+        qkv_bias_attrs=None,
+        linear_weight_attrs=None,
+        linear_weight_scale_attrs=None,
+        linear_bias_attrs=None,
+        ffn_ln_scale_attrs=None,
+        ffn_ln_bias_attrs=None,
+        ffn1_weight_attrs=None,
+        ffn1_weight_scale_attrs=None,
+        ffn1_bias_attrs=None,
+        ffn2_weight_attrs=None,
+        ffn2_weight_scale_attrs=None,
+        ffn2_bias_attrs=None,
+        epsilon=1e-5,
+        residual_alpha=1.0,
+        num_layers=-1,
+        nranks=1,
+        trans_qkvw=True,
+        ring_id=-1,
+        name=None,
+    ):
+        super().__init__()
+
+        assert embed_dim > 0, "Expected embed_dim to be greater than 0, " "but received {}".format(embed_dim)
+        assert num_heads > 0, "Expected nhead to be greater than 0, " "but received {}".format(num_heads)
+        assert dim_feedforward > 0, "Expected dim_feedforward to be greater than 0, but received {}".format(
+            dim_feedforward
+        )
+
+        self.normalize_before = normalize_before
+        self._dtype = self._helper.get_default_dtype()
+        self._epsilon = epsilon
+        self._residual_alpha = residual_alpha
+        self._trans_qkvw = trans_qkvw
+        self._ring_id = ring_id
+        self.nranks = nranks
+        self.norm_type = norm_type
+        if norm_type == "layernorm":
+            self.norm_func = fused_layer_norm
+        else:
+            self.norm_func = fused_rms_norm
+        self.use_neox_rotary_style = use_neox_rotary_style
+        self._norm_weight_dtype = "float32" if self.norm_type == "layernorm" else self._dtype
+
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
+
+        # tensor model parallel
+        if nranks > 1:
+            assert ring_id != -1
+        assert num_heads % nranks == 0
+        assert dim_feedforward % nranks == 0
+        num_heads = num_heads // nranks
+        dim_feedforward = dim_feedforward // nranks
+        self._dim_feedforward = dim_feedforward
+
+        if isinstance(qkv_weight_attrs, (list, tuple)):
+            num_layers = len(qkv_weight_attrs)
+        assert num_layers > 0
+
+        self.quant_bits = quant_bits
+        self.use_weight_only = False
+        self.weight_dtype = self._dtype
+        self.create_params_type = self._dtype
+
+        if self.quant_bits != -1:
+            self.use_weight_only = True
+            self.create_params_type = (
+                "int8"  # If use weightonly int4, params dtype is int8, and one of the dimension will be half.
+            )
+            self.weight_dtype = "int" + str(self.quant_bits)
+
+        self.ln_scales, self.ln_biases = [], []
+        self.qkv_weights, self.qkv_weights_scale, self.qkv_biases = [], [], []
+        self.linear_weights, self.linear_weights_scale, self.linear_biases = [], [], []
+        self.ffn_ln_scales, self.ffn_ln_biases = [], []
+        self.ffn1_weights, self.ffn1_weights_scale, self.ffn1_biases = [], [], []
+        self.ffn2_weights, self.ffn2_weights_scale, self.ffn2_biases = [], [], []
+
+        def get_attr(attrs, idx):
+            if isinstance(attrs, (list, tuple)):
+                assert len(attrs) == num_layers
+                return attrs[idx]
+            return attrs
+
+        def _add_parameter(param):
+            if param is None:
+                return
+            assert param.name not in self._parameters
+            self._parameters[param.name] = param
+
+        for i in range(num_layers):
+            ln_scale_attr = get_attr(ln_scale_attrs, i)
+            ln_bias_attr = get_attr(ln_bias_attrs, i)
+            qkv_weight_attr = get_attr(qkv_weight_attrs, i)
+            qkv_weight_scale_attr = get_attr(qkv_weight_scale_attrs, i)
+
+            qkv_bias_attr = get_attr(qkv_bias_attrs, i)
+            linear_weight_attr = get_attr(linear_weight_attrs, i)
+            linear_weight_scale_attr = get_attr(linear_weight_scale_attrs, i)
+            linear_bias_attr = get_attr(linear_bias_attrs, i)
+
+            ffn_ln_scale_attr = get_attr(ffn_ln_scale_attrs, i)
+            ffn_ln_bias_attr = get_attr(ffn_ln_bias_attrs, i)
+            ffn1_weight_attr = get_attr(ffn1_weight_attrs, i)
+            ffn1_weight_scale_attr = get_attr(ffn1_weight_scale_attrs, i)
+            ffn1_bias_attr = get_attr(ffn1_bias_attrs, i)
+            ffn2_weight_attr = get_attr(ffn2_weight_attrs, i)
+            ffn2_weight_scale_attr = get_attr(ffn2_weight_scale_attrs, i)
+            ffn2_bias_attr = get_attr(ffn2_bias_attrs, i)
+
+            ln_scale = self.create_parameter(
+                attr=ln_scale_attr,
+                shape=[embed_dim],
+                default_initializer=Constant(value=1.0),
+                dtype=self._norm_weight_dtype,
+            )
+            ln_bias = None
+            if ln_bias_attr:
+                ln_bias = self.create_parameter(
+                    attr=ln_bias_attr,
+                    shape=[embed_dim],
+                    is_bias=True,
+                    dtype=self._norm_weight_dtype,
+                )
+
+            # Note(Zhengzekang): Weightonly need weight is ColMajor layout.
+            qkv_weight_shape = (
+                [3 * num_heads * self.head_dim, embed_dim]
+                if trans_qkvw
+                else [embed_dim * 3 * num_heads, self.head_dim]
+            )
+            qkv_weight_scale = None
+            if self.use_weight_only:
+                if self.quant_bits == 4:
+                    qkv_weight_shape[0] //= 2
+
+                qkv_weight_scale = self.create_parameter(
+                    shape=[3 * num_heads * self.head_dim],
+                    attr=qkv_weight_scale_attr,
+                    dtype=paddle.float32,
+                    is_bias=False,
+                )
+
+            qkv_weight = self.create_parameter(
+                shape=qkv_weight_shape,
+                attr=qkv_weight_attr,
+                dtype=self.create_params_type,
+                is_bias=False,
+            )
+
+            qkv_bias = None
+            if qkv_bias_attr:
+                qkv_bias = self.create_parameter(
+                    shape=[3 * num_heads * self.head_dim],
+                    attr=qkv_bias_attr,
+                    dtype=self._dtype,
+                    is_bias=True,
+                )
+
+            linear_weight_shape = [num_heads * self.head_dim, embed_dim]
+            linear_weight_scale = None
+            if self.use_weight_only:
+                linear_weight_shape = [embed_dim, num_heads * self.head_dim]
+                if self.quant_bits == 4:
+                    linear_weight_shape[0] //= 2
+
+                linear_weight_scale = self.create_parameter(
+                    shape=[embed_dim],
+                    attr=linear_weight_scale_attr,
+                    dtype=paddle.float32,
+                    is_bias=False,
+                )
+            linear_weight = self.create_parameter(
+                shape=linear_weight_shape,
+                attr=linear_weight_attr,
+                dtype=self.create_params_type,
+                is_bias=False,
+            )
+
+            linear_bias = None
+            if linear_bias_attr:
+                linear_bias = self.create_parameter(
+                    shape=[embed_dim],
+                    attr=linear_bias_attr,
+                    dtype=self._dtype,
+                    is_bias=True,
+                )
+
+            ffn_ln_scale = self.create_parameter(
+                shape=[embed_dim],
+                attr=ffn_ln_scale_attr,
+                is_bias=False,
+                default_initializer=Constant(1.0),
+                dtype=self._norm_weight_dtype,
+            )
+
+            ffn_ln_bias = None
+            if ffn_ln_bias_attr:
+                ffn_ln_bias = self.create_parameter(
+                    shape=[embed_dim],
+                    attr=ffn_ln_bias_attr,
+                    is_bias=True,
+                    dtype=self._norm_weight_dtype,
+                )
+
+            ffn1_weight_shape = (
+                [embed_dim, dim_feedforward * 2] if activation.endswith("glu") else [embed_dim, dim_feedforward]
+            )
+            ffn1_weight_scale = None
+            if self.use_weight_only:
+                ffn1_weight_shape = (
+                    [dim_feedforward * 2, embed_dim] if activation.endswith("glu") else [dim_feedforward, embed_dim]
+                )
+                if self.quant_bits == 4:
+                    ffn1_weight_shape[0] //= 2
+
+                ffn1_weight_scale = self.create_parameter(
+                    shape=[dim_feedforward * 2],
+                    attr=ffn1_weight_scale_attr,
+                    dtype=paddle.float32,
+                    is_bias=False,
+                )
+            ffn1_weight = self.create_parameter(
+                shape=ffn1_weight_shape,
+                attr=ffn1_weight_attr,
+                dtype=self.create_params_type,
+                is_bias=False,
+            )
+
+            ffn1_bias = None
+            if ffn1_bias_attr:
+                ffn1_bias = self.create_parameter(
+                    shape=[dim_feedforward * 2] if activation.endswith("glu") else [dim_feedforward],
+                    attr=ffn1_bias_attr,
+                    dtype=self._dtype,
+                    is_bias=True,
+                )
+
+            ffn2_weight_shape = [dim_feedforward, embed_dim]
+            ffn2_weight_scale = None
+            if self.use_weight_only:
+                ffn2_weight_shape = [embed_dim, dim_feedforward]
+                if self.quant_bits == 4:
+                    ffn2_weight_shape[0] //= 2
+
+                ffn2_weight_scale = self.create_parameter(
+                    shape=[embed_dim],
+                    attr=ffn2_weight_scale_attr,
+                    dtype=paddle.float32,
+                    is_bias=False,
+                )
+
+            ffn2_weight = self.create_parameter(
+                shape=ffn2_weight_shape,
+                attr=ffn2_weight_attr,
+                dtype=self.create_params_type,
+                is_bias=False,
+            )
+
+            ffn2_bias = None
+            if ffn2_bias_attr:
+                ffn2_bias = self.create_parameter(
+                    shape=[embed_dim],
+                    attr=ffn2_bias_attr,
+                    dtype=self._dtype,
+                    is_bias=True,
+                )
+
+            # tensor model parallel
+            if nranks > 1:
+                # column parallel
+                _set_var_distributed(qkv_weight)
+                _set_var_distributed(qkv_bias)
+                _set_var_distributed(ffn1_weight)
+                _set_var_distributed(ffn1_bias)
+                # row parallel
+                _set_var_distributed(linear_weight)
+                _set_var_distributed(ffn2_weight)
+
+            self.ln_scales.append(ln_scale)
+            self.ln_biases.append(ln_bias)
+            self.qkv_weights.append(qkv_weight)
+            self.qkv_biases.append(qkv_bias)
+            self.linear_weights.append(linear_weight)
+            self.linear_biases.append(linear_bias)
+
+            self.ffn_ln_scales.append(ffn_ln_scale)
+            self.ffn_ln_biases.append(ffn_ln_bias)
+            self.ffn1_weights.append(ffn1_weight)
+            self.ffn1_biases.append(ffn1_bias)
+            self.ffn2_weights.append(ffn2_weight)
+            self.ffn2_biases.append(ffn2_bias)
+
+            if self.use_weight_only:
+                self.qkv_weights_scale.append(qkv_weight_scale)
+                self.linear_weights_scale.append(linear_weight_scale)
+                self.ffn1_weights_scale.append(ffn1_weight_scale)
+                self.ffn2_weights_scale.append(ffn2_weight_scale)
+
+            _add_parameter(ln_scale)
+            _add_parameter(ln_bias)
+            _add_parameter(qkv_weight)
+            _add_parameter(qkv_bias)
+            _add_parameter(linear_weight)
+            _add_parameter(linear_bias)
+
+            _add_parameter(ffn_ln_scale)
+            _add_parameter(ffn_ln_bias)
+            _add_parameter(ffn1_weight)
+            _add_parameter(ffn1_bias)
+            _add_parameter(ffn2_weight)
+            _add_parameter(ffn2_bias)
+
+            if self.use_weight_only:
+                _add_parameter(qkv_weight_scale)
+                _add_parameter(linear_weight_scale)
+                _add_parameter(ffn1_weight_scale)
+                _add_parameter(ffn2_weight_scale)
+
+        self.dropout_rate = dropout_rate
+        self.activation = activation
+        self.name = name
+
+        from paddle.incubate.nn.functional import fused_linear
+
+        self.linear = fused_linear
+
+    def forward(
+        self,
+        input_ids,
+        src,
+        cum_offsets=None,
+        padding_offset=None,
+        attn_mask=None,
+        caches=None,
+        pre_caches=None,
+        pre_caches_length=0,
+        rotary_embs=None,
+        rotary_emb_dims=0,
+        seq_lens=None,
+        time_step=None,
+    ):
+        r"""
+        Applies multi transformer layers on the input.
+
+        Parameters:
+            src (Tensor): The input of Transformer layers. It is
+                a tensor with shape `[batch_size, sequence_length, d_model]`.
+                The data type should be float16 or float32.
+            attn_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                `[batch_size, 1, sequence_length, sequence_length]`. It can be
+                None when nothing wanted or needed to be prevented attention to.
+                Default None.
+            caches (list(Tensor)|tuple(Tensor), optional): The cache structure
+                tensors for the inference generation model. It is only used for
+                inference and should be None for training. The shape is
+                `[2, batch_size, num_head, max_seq_len, head_dim]`. Default None.
+            pre_caches (list(Tensor)|tuple(Tensor), optional): The prefix caches
+                for the generation model. The shape is `[2, bsz, num\_head, cache\_len, head\_dim]`. Default None.
+            rotary_embs (Tensor optional): The RoPE embs for the rotary computation. The shape is `[2, bsz, 1, seq\_len, head\_dim]`. Default None.
+            rotary_emb_dims (int, optional): The rotary_emb_dims of rotary computation, and it is 0 when rotary_embs is None,
+                1 when rotary_embs is not None and pos_extra_ids is None, 2 when rotary_embs and pos_extra_ids are both not None. Default 0.
+            seq_lens (Tensor optional): The sequence lengths of this batch. The shape is `[bsz]`. Default None.
+            time_step (Tensor, optional): The time step tensor for the generation
+                model. Which used in decode stage, to represent the time step,
+                that is, the real seq_len of CacheKV. The shape is `[1]`, must be
+                in CPUPlace. Default None.
+
+        Returns:
+            Tensor|tuple: If `caches` is None, return a tensor that has
+            the same shape and data type with `src`, representing the output
+            of Transformer layers. If `caches` is not None, return the
+            tuple (output, caches), which output is the output of
+            Transformer layers, caches is inplace with input `caches`.
+        """
+        if caches is not None:
+            assert len(caches) == len(self.qkv_weights)
+        bias_residual_input = src
+        ln_out = src
+        for i in range(len(caches)):
+            if self.normalize_before is True:
+                # layernorm
+                if i == 0:
+                    ln_out = self.norm_func(
+                        src, self.ln_scales[i], self.ln_biases[i], self._epsilon, begin_norm_axis=1
+                    )
+
+            # qkv compute
+            if self.use_weight_only:
+                qkv_out = weight_only_linear(
+                    ln_out,
+                    weight=self.qkv_weights[i],
+                    bias=self.qkv_biases[i],
+                    weight_scale=self.qkv_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                qkv_out = self.linear(ln_out, self.qkv_weights[i], self.qkv_biases[i], transpose_weight=True)
+
+            # fmha compute
+            if time_step is None:  # context
+                """
+                qkv: bsz, seq_len, 3, numhead, headsize ->
+                q_out: bsz, numhead, seq_len, headsize
+                kv_out: 2, bsz, numhead, seq_len, headsize
+                """
+                q_out, k_out, v_out = qkv_transpose_split(
+                    qkv_out, padding_offset, seq_lens, input_ids, self.num_heads // self.nranks, self.head_dim
+                )
+
+                # rotary emb (inplace)
+                if rotary_embs is not None:
+                    encode_rotary_qk(
+                        q_out,
+                        k_out,
+                        rotary_embs,
+                        seq_lens,
+                        rotary_emb_dims=rotary_emb_dims,
+                        use_neox=self.use_neox_rotary_style,
+                    )
+
+                if pre_caches is not None:
+                    k_out = paddle.concat([pre_caches[i][0], k_out], axis=2)
+                    v_out = paddle.concat([pre_caches[i][1], v_out], axis=2)
+
+                # write cache kv (inplace)
+                write_cache_kv(k_out, v_out, caches[i], seq_lens + pre_caches_length)
+
+                # cutlass fmha
+                qktv_out = variable_length_memory_efficient_attention(
+                    q_out,
+                    k_out,
+                    v_out,
+                    seq_lens,
+                    seq_lens + pre_caches_length,
+                    mask=attn_mask,
+                    scale=float(self.head_dim**-0.5),
+                )
+
+                fmha_out = transpose_remove_padding(qktv_out, seq_lens, padding_offset)
+
+            else:
+                fmha_out = masked_multihead_attention(
+                    x=qkv_out,
+                    cache_kv=caches[i],
+                    src_mask=attn_mask,
+                    sequence_lengths=seq_lens,
+                    rotary_tensor=rotary_embs,
+                    rotary_emb_dims=rotary_emb_dims,
+                    use_neox_rotary_style=self.use_neox_rotary_style,
+                )[0]
+
+            # out_linear
+            if self.use_weight_only:
+                out_linear_out = weight_only_linear(
+                    fmha_out,
+                    weight=self.linear_weights[i],
+                    weight_scale=self.linear_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                out_linear_out = paddle.matmul(fmha_out, self.linear_weights[i])
+
+            # all_reduce
+            if self.nranks > 1:
+                dist.all_reduce(out_linear_out)
+
+            # norm + residual_add_bias
+            if self.normalize_before is True:
+                norm_out = self.norm_func(
+                    out_linear_out,
+                    norm_weight=self.ffn_ln_scales[i],
+                    norm_bias=self.ffn_ln_biases[i],
+                    epsilon=self._epsilon,
+                    begin_norm_axis=1,
+                    bias=self.linear_biases[i],
+                    residual=bias_residual_input,
+                )
+                tmp_out, bias_residual_input = norm_out[0], norm_out[1]
+            else:
+                tmp_out = self.norm_func(
+                    out_linear_out,
+                    norm_weight=self.ln_scales[i],
+                    norm_bias=self.ln_biases[i],
+                    epsilon=self._epsilon,
+                    residual_alpha=self._residual_alpha,
+                    begin_norm_axis=1,
+                    bias=self.linear_biases[i],
+                    residual=ln_out,
+                )[0]
+
+            # ffn1 matmul
+            if self.use_weight_only:
+                ffn1_out = weight_only_linear(
+                    tmp_out,
+                    weight=self.ffn1_weights[i],
+                    weight_scale=self.ffn1_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                ffn1_out = paddle.matmul(tmp_out, self.ffn1_weights[i])
+            ffn1_out = fused_act_bias_wrapper(ffn1_out, self.ffn1_biases[i], act_method=self.activation)
+
+            # ffn2 matmul
+            if self.use_weight_only:
+                ffn2_out = weight_only_linear(
+                    ffn1_out,
+                    weight=self.ffn2_weights[i],
+                    weight_scale=self.ffn2_weights_scale[i],
+                    weight_dtype=self.weight_dtype,
+                )
+            else:
+                ffn2_out = paddle.matmul(ffn1_out, self.ffn2_weights[i])
+
+            # all_reduce
+            if self.nranks > 1:
+                dist.all_reduce(ffn2_out)
+
+            # norm + residual_add_bias
+            if self.normalize_before is True:
+                if i != len(caches) - 1:
+                    norm_out = self.norm_func(
+                        ffn2_out,
+                        norm_weight=self.ln_scales[i + 1],
+                        norm_bias=self.ln_biases[i + 1],
+                        epsilon=self._epsilon,
+                        begin_norm_axis=1,
+                        bias=self.ffn2_biases[i],
+                        residual=bias_residual_input,
+                    )
+                    tmp_out, bias_residual_input = norm_out[0], norm_out[1]
+                else:
+                    tmp_out = fused_layer_norm(
+                        ffn2_out,
+                        norm_weight=None,
+                        norm_bias=None,
+                        epsilon=self._epsilon,
+                        begin_norm_axis=1,
+                        bias=self.ffn2_biases[i],
+                        residual=bias_residual_input,
+                    )[0]
+            else:
+                tmp_out = self.norm_func(
+                    ffn2_out,
+                    norm_weight=self.ffn_ln_scales[i],
+                    norm_bias=self.ffn_ln_biases[i],
+                    epsilon=self._epsilon,
+                    residual_alpha=self._residual_alpha,
+                    begin_norm_axis=1,
+                    bias=self.ffn2_biases[i],
+                    residual=tmp_out,
+                )[0]
+
+            ln_out = tmp_out
+
+        if time_step is None:
+            out = rebuild_padding(tmp_out, cum_offsets, seq_lens, input_ids)
+        else:
+            out = tmp_out
+        return out, caches
diff --git a/paddlenlp/experimental/transformers/generation_utils.py b/paddlenlp/experimental/transformers/generation_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f4b7d666b53712657a94416028bbc15102727bd
--- /dev/null
+++ b/paddlenlp/experimental/transformers/generation_utils.py
@@ -0,0 +1,415 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from typing import List, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddlenlp_ops import (
+    get_token_penalty_multi_scores,
+    save_with_output,
+    set_alibi_mask_value,
+    set_mask_value,
+    set_stop_value_multi_ends,
+    set_value_by_flags_and_idx,
+)
+
+from paddlenlp.generation import GenerationMixin, LogitsProcessor, LogitsProcessorList
+
+try:
+    from paddle import top_p_sampling
+except:
+    from paddlenlp_ops import top_p_sampling
+
+
+__all__ = ["GenerationInferenceModel"]
+
+
+class ForcedDecodingEOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the last generated token to be the selected `forced_eos_token`.
+
+    Args:
+        max_length (int): The maximum length of the sequence to be generated.
+        forced_eos_token_id (int): The id of the token to be generated as the last token.
+    """
+
+    def __init__(self, max_decoding_step: int, forced_eos_token_id: Union[int, List[int]]):
+        self.max_decoding_step = max_decoding_step
+        self.forced_eos_token_id = forced_eos_token_id
+
+    def __call__(self, input_ids, scores, decoding_step):
+        if decoding_step == self.max_decoding_step:
+            scores[:] = paddle.finfo(scores.dtype).min
+            scores[:, self.forced_eos_token_id] = 0
+        return scores
+
+
+class GenerationInferenceModel(GenerationMixin):
+    @classmethod
+    def get_cache_kvs_shape(cls, max_batch_size: int = None, max_length: int = None) -> list[list[int]]:
+        raise NotImplementedError
+
+    def to_static(self, output_path: str, config: dict):
+        dtype = config.get("dtype", paddle.get_default_dtype())
+
+        cache_kvs_shapes = self.get_cache_kvs_shape(self.config, max_length=config.get("max_length", None))
+        export_precache = config.get("export_precache", False)
+        if export_precache:
+            precache_input_spec = [
+                paddle.static.InputSpec(shape=[2, None, None, None, None], dtype=dtype, name=f"pre_caches_{i}")
+                for i in range(len(cache_kvs_shapes))
+            ]
+        else:
+            precache_input_spec = None
+
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),  # input_ids
+            paddle.static.InputSpec(shape=[None, 1, None, None], dtype=dtype, name="attention_mask"),  # attention_mask
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),  # position_ids
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="penalty_score"),  # penalty_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="frequency_score"),  # frequency_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="presence_score"),  # presence_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="min_length"),  # min_decode_length
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="max_length"),  # max_decode_length
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="temperature"),  # temperature
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="top_p"),  # top_p
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="eos_token_id"),  # eos_token_id
+            paddle.static.InputSpec(shape=[None, 1], dtype="int32", name="seq_len_encoder"),  # seq_len_encoder
+            paddle.static.InputSpec(shape=[None, 1], dtype="int32", name="seq_len_decoder"),  # seq_len_decoder
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="step_idx"),  # step_idx
+            paddle.static.InputSpec(shape=[None, 1], dtype="bool", name="stop_flags"),  # stop_flags
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="tgt_ids"),  # tgt_ids
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="tgt_pos"),  # tgt_pos
+            paddle.static.InputSpec(
+                shape=[None, 1, 1, None], dtype=dtype, name="tgt_generation_mask"
+            ),  # tgt_generation_mask
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="pre_ids"),  # pre_ids
+            paddle.static.InputSpec(shape=[1], dtype="int64", name="stop_nums"),  # stop_nums
+            [
+                paddle.static.InputSpec(
+                    shape=shape,
+                    dtype=dtype,
+                    name="cache_kvs_{}".format(i),
+                )
+                for i, shape in enumerate(cache_kvs_shapes)
+            ],  # cache_kvs
+            None,  # inputs_embeds
+            config.get("logits_processors", None),
+            precache_input_spec,
+        ]
+        if self.config["model_type"] and "chatglm" in self.config.model_type:
+            input_spec[2] = paddle.static.InputSpec(
+                shape=[None, None, None], dtype="int64", name="position_ids"
+            )  # position_ids
+            input_spec[16] = paddle.static.InputSpec(shape=[None, 2, 1], dtype="int64", name="tgt_pos")  # tgt_pos
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+        paddle.jit.save(
+            model, output_path, skip_prune_program=True
+        )  # Note(Zhengzekang): If we prune program it may cause some inference error.
+
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        seq_len = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+            seq_len = encoder_output.shape[1]
+        return paddle.ones([batch_size, seq_len], dtype="int64") * bos_token_id
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        penalty_score=None,
+        frequency_score=None,
+        presence_score=None,
+        min_length=None,
+        max_length=None,
+        temperature=None,
+        top_p=None,
+        eos_token_id=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        step_idx=None,
+        stop_flags=None,
+        tgt_ids=None,
+        tgt_pos=None,
+        tgt_generation_mask=None,
+        pre_ids=None,
+        stop_nums=None,
+        cache_kvs=[],
+        inputs_embeds=None,
+        logits_processors=None,
+        pre_caches=None,
+        **model_kwargs,
+    ):
+
+        model_kwargs["position_ids"] = position_ids
+        model_kwargs["attention_mask"] = attention_mask
+
+        model_kwargs["seq_len_encoder"] = seq_len_encoder
+        model_kwargs["seq_len_decoder"] = seq_len_decoder
+        model_kwargs["tgt_ids"] = tgt_ids
+        model_kwargs["tgt_generation_mask"] = tgt_generation_mask
+        model_kwargs["tgt_pos"] = tgt_pos
+        model_kwargs["step_idx"] = step_idx
+        model_kwargs["stop_flags"] = stop_flags
+        model_kwargs["pre_ids"] = pre_ids
+        model_kwargs["min_dec_len"] = min_length
+        model_kwargs["max_dec_len"] = max_length
+        model_kwargs["stop_nums"] = stop_nums
+        model_kwargs["penalty_score"] = penalty_score
+        model_kwargs["frequency_score"] = frequency_score
+        model_kwargs["presence_score"] = presence_score
+        model_kwargs["logits_processors"] = logits_processors or LogitsProcessorList()
+        model_kwargs["pre_caches"] = pre_caches
+
+        ret = self.sample(
+            input_ids,
+            eos_token_id,
+            top_p=top_p,
+            cache_kvs=cache_kvs,
+            temperature=temperature,
+            inputs_embeds=inputs_embeds,
+            **model_kwargs,
+        )
+        return ret
+
+    def update_model_kwargs_for_generation(self, cache, just_decoder, next_tokens, eos_token_id, model_kwargs):
+        if cache is None:
+            model_kwargs["step_idx"] = paddle.where(
+                model_kwargs["seq_len_encoder"] == 0,
+                model_kwargs["step_idx"],
+                model_kwargs["step_idx"] + 1,
+            )
+        else:
+            model_kwargs["step_idx"] = paddle.where(
+                model_kwargs["stop_flags"],
+                model_kwargs["step_idx"],
+                model_kwargs["step_idx"] + 1,
+            )
+        length_cond = paddle.greater_equal(model_kwargs["step_idx"], model_kwargs["max_dec_len"])
+        model_kwargs["stop_flags"] = paddle.logical_or(model_kwargs["stop_flags"], length_cond)
+        if cache is None:
+            next_tokens = paddle.where(just_decoder, paddle.full_like(next_tokens, -1), next_tokens)
+        next_tokens, model_kwargs["stop_flags"] = set_stop_value_multi_ends(
+            next_tokens, model_kwargs["stop_flags"], eos_token_id, 2
+        )  # multi ends
+
+        if cache is None:
+            # encoder's generation
+            model_kwargs["tgt_ids"] = paddle.where(just_decoder, model_kwargs["tgt_ids"], next_tokens)
+            if self.config["position_encoding_2d"] and self.config.position_encoding_2d is True:
+                tgt_pos = model_kwargs["tgt_pos"]
+                new_position_id = tgt_pos[:, 0, :].clone()
+                new_block_id = tgt_pos[:, 1, :].clone()
+                new_block_id = new_block_id + 1
+
+                model_kwargs["tgt_pos"] = paddle.concat(
+                    [new_position_id.unsqueeze(1), new_block_id.unsqueeze(1)], axis=1
+                )
+            else:
+                model_kwargs["tgt_pos"] = paddle.where(
+                    just_decoder, model_kwargs["tgt_pos"], model_kwargs["tgt_pos"] + 1
+                )
+            if "bloom" in self.config.architectures[0].lower():
+                model_kwargs["seq_len_decoder"] = set_alibi_mask_value(
+                    model_kwargs["tgt_generation_mask"],
+                    model_kwargs["stop_flags"],
+                    model_kwargs["seq_len_decoder"],
+                    model_kwargs["position_ids"],
+                    model_kwargs["tgt_pos"],
+                )
+            else:
+                model_kwargs["seq_len_decoder"] = set_mask_value(
+                    model_kwargs["tgt_generation_mask"],
+                    model_kwargs["stop_flags"],
+                    model_kwargs["seq_len_decoder"],
+                )
+        else:
+            model_kwargs["tgt_ids"] = next_tokens
+            if self.config["position_encoding_2d"] and self.config.position_encoding_2d is True:
+                tgt_pos = model_kwargs["tgt_pos"]
+                new_position_id = tgt_pos[:, 0, :].clone()
+                new_block_id = tgt_pos[:, 1, :].clone()
+                new_block_id = new_block_id + 1
+
+                model_kwargs["tgt_pos"] = paddle.concat(
+                    [new_position_id.unsqueeze(1), new_block_id.unsqueeze(1)], axis=1
+                )
+            else:
+                model_kwargs["tgt_pos"] = paddle.where(
+                    model_kwargs["stop_flags"],
+                    model_kwargs["tgt_pos"],
+                    model_kwargs["tgt_pos"] + 1,
+                )
+
+            model_kwargs["seq_len_decoder"] = paddle.where(
+                model_kwargs["stop_flags"],
+                model_kwargs["seq_len_decoder"],
+                model_kwargs["seq_len_decoder"] + 1,
+            )
+            if "bloom" in self.config.architectures[0].lower():
+                model_kwargs["seq_len_decoder"] = set_alibi_mask_value(
+                    model_kwargs["tgt_generation_mask"],
+                    model_kwargs["stop_flags"],
+                    model_kwargs["seq_len_decoder"],
+                    model_kwargs["position_ids"],
+                    model_kwargs["tgt_pos"],
+                )
+            else:
+                model_kwargs["seq_len_decoder"] = set_mask_value(
+                    model_kwargs["tgt_generation_mask"],
+                    model_kwargs["stop_flags"],
+                    model_kwargs["seq_len_decoder"],
+                )
+
+        model_kwargs["next_tokens"] = next_tokens
+        return model_kwargs
+
+    def sample(
+        self,
+        input_ids=None,
+        eos_token_id=None,
+        cache_kvs=[],
+        top_p=None,
+        temperature=None,
+        inputs_embeds=None,
+        **model_kwargs,
+    ):
+        step_idx_ori = paddle.full(shape=[1], dtype="int64", fill_value=1)
+        batch_idx = paddle.full(shape=[1], dtype="int32", fill_value=-1)
+
+        # fake temp next_tokens
+        next_tokens = paddle.full(shape=[paddle.shape(input_ids).shape[0], 1], dtype="int32", fill_value=0)
+
+        # let inputs_embeds enter into model_kwargs.
+        # because the code below directly use the model_kwargs as a parameter without using inputs_embeds.
+        model_kwargs["inputs_embeds"] = inputs_embeds
+        model_kwargs["all_input_ids"] = input_ids
+        logits_processors = model_kwargs.pop("logits_processors")
+
+        def _forward_(**args):
+            # cache_kvs is never empty because it is passed as a parameter in def sample.
+            model_inputs = self.prepare_inputs_for_generation(input_ids, cache_kvs, **args)
+            return self(**model_inputs)
+
+        def _post_process_(outputs, top_p, temperature, step_idx_ori, model_kwargs):
+            cache = model_kwargs.get("cache", None)
+            just_decoder = model_kwargs["seq_len_encoder"] == 0
+            if cache is None:  # first decoder
+                step_idx = paddle.where(
+                    just_decoder,
+                    paddle.full_like(model_kwargs["step_idx"], -1),
+                    model_kwargs["step_idx"],
+                )  # not update when continue decode
+            else:
+                step_idx = model_kwargs["step_idx"]
+            model_kwargs["stop_flags"] = set_value_by_flags_and_idx(
+                model_kwargs["pre_ids"],
+                model_kwargs["tgt_ids"],
+                step_idx,
+                model_kwargs["stop_flags"],
+            )
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            logits = paddle.cast(logits, paddle.float32)
+            logits = logits_processors(model_kwargs["all_input_ids"], logits, decoding_step=step_idx_ori)
+
+            logits = get_token_penalty_multi_scores(
+                model_kwargs["pre_ids"],
+                logits,
+                model_kwargs["penalty_score"],
+                model_kwargs["frequency_score"],
+                model_kwargs["presence_score"],
+                step_idx,
+                model_kwargs["min_dec_len"],
+                eos_token_id,
+            )
+            # sample
+            probs = F.softmax(logits)
+
+            # compute next_tokens, use paddle.top_p_sampling
+            logits = logits / temperature
+
+            _, next_tokens = top_p_sampling(probs, top_p, -1)
+
+            if self.config.tensor_parallel_degree > 1:
+                paddle.distributed.broadcast(next_tokens, 0)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                cache, just_decoder, next_tokens, eos_token_id, model_kwargs
+            )
+            next_tokens = model_kwargs["next_tokens"]
+            model_kwargs["all_input_ids"] = paddle.concat([model_kwargs["all_input_ids"], next_tokens], axis=1)
+
+            save_with_output(
+                next_tokens,
+                batch_idx,
+                step_idx_ori,
+                "real_time_save.temp_ids",
+                self.config.tensor_parallel_rank,
+            )
+
+            return next_tokens, model_kwargs
+
+        if paddle.max(model_kwargs["seq_len_encoder"]) > 0:
+            # encoder
+            outputs = _forward_(**model_kwargs)
+            # first decoder
+            next_tokens, model_kwargs = _post_process_(
+                outputs,
+                top_p,
+                temperature,
+                step_idx_ori,
+                model_kwargs,
+            )
+            step_idx_ori += 1
+        else:
+            outputs = None
+            # first decoder
+            next_tokens = None
+            model_kwargs["next_tokens"] = next_tokens
+            step_idx_ori += 0
+
+        # gives it a value, means we will entered into decoder phase.
+        model_kwargs["cache"] = 0
+
+        # decoder
+        while paddle.less_than(
+            paddle.sum(paddle.cast(model_kwargs["stop_flags"], "int64")),
+            model_kwargs["stop_nums"],
+        ):
+            next_tokens, model_kwargs = _post_process_(
+                _forward_(**model_kwargs),
+                top_p,
+                temperature,
+                step_idx_ori,
+                model_kwargs,
+            )
+            step_idx_ori += 1
+
+        return (
+            next_tokens,
+            model_kwargs["step_idx"],
+            paddle.cast(model_kwargs["stop_flags"], "int32"),
+            model_kwargs["seq_len_decoder"],
+            model_kwargs["tgt_pos"],
+        )
diff --git a/paddlenlp/experimental/transformers/llama/__init__.py b/paddlenlp/experimental/transformers/llama/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2a7f656c636e80599174536899414aff25795db
--- /dev/null
+++ b/paddlenlp/experimental/transformers/llama/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/experimental/transformers/llama/modeling.py b/paddlenlp/experimental/transformers/llama/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..83a75237cdaf4794bc64eed593c8d6e2e4fa6d06
--- /dev/null
+++ b/paddlenlp/experimental/transformers/llama/modeling.py
@@ -0,0 +1,702 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.distributed import fleet
+from paddle.nn.quant import weight_quantize
+from paddlenlp_ops import fused_get_rotary_embedding, get_padding_offset
+
+from paddlenlp.experimental.transformers.fused_transformer_layers import (
+    FusedMultiTransformer,
+)
+from paddlenlp.experimental.transformers.generation_utils import (
+    GenerationInferenceModel,
+)
+from paddlenlp.transformers import LlamaConfig, LlamaPretrainedModel
+from paddlenlp.transformers.llama.modeling import LlamaLMHead
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import register_base_model
+
+__all__ = ["LlamaInferenceModel", "LlamaForCausalLMInferenceModel", "LlamaForMiniGPT4InferenceModel"]
+
+
+class FusedLlamaRMSNorm(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.weight = paddle.create_parameter(
+            shape=[self.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+        self.variance_epsilon = config.rms_norm_eps
+        self.config = config
+
+    def forward(self, hidden_states):
+        result = paddle.incubate.nn.functional.fused_rms_norm(
+            hidden_states, self.weight, None, self.variance_epsilon, begin_norm_axis=1
+        )
+        if isinstance(result, tuple):
+            return result[0]
+        return result
+
+
+@register_base_model
+class LlamaInferenceModel(LlamaPretrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+    Args:
+        config: LlamaConfig
+    """
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__(config)
+        self.vocab_size = config.vocab_size
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.intermediate_size = config.intermediate_size
+        self.num_layers = config.num_hidden_layers
+        self.epsilon = config.rms_norm_eps
+        self.max_position_embeddings = config.max_position_embeddings
+        self.use_weight_only = False
+
+        self.quant_bits = config.quant_bits
+        self.quant_algo = "weight_only_int" + str(self.quant_bits)
+        if self.quant_bits != -1:
+            self.use_weight_only = True
+
+        if self.use_weight_only:
+            assert (
+                self.quant_algo == "weight_only_int8" or self.quant_algo == "weight_only_int4"
+            ), "Expected quant_algo equal to 'weight_only_int8' or 'weight_only_int4', but received {}".format(
+                self.quant_algo
+            )
+
+        if config.tensor_parallel_degree > 1:
+            self.embed_tokens = fleet.meta_parallel.VocabParallelEmbedding(
+                self.vocab_size,
+                self.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+
+        # get ring_id
+        ring_id = -1
+        try:
+            hcg = fleet.get_hybrid_communicate_group()
+            model_parallel_group = hcg.get_model_parallel_group()
+            ring_id = model_parallel_group.id
+        except:
+            pass
+
+        ln_scale_attrs = [paddle.ParamAttr(name="fusellama.{}.ln_scale".format(i)) for i in range(self.num_layers)]
+        qkv_weight_attrs = [
+            paddle.ParamAttr(
+                name="fusellama.{}.qkv_weight".format(i), initializer=paddle.nn.initializer.Constant(value=0)
+            )
+            for i in range(self.num_layers)
+        ]
+        out_proj_weight_attrs = [
+            paddle.ParamAttr(
+                name="fusellama.{}.out_proj_weight".format(i), initializer=paddle.nn.initializer.Constant(value=0)
+            )
+            for i in range(self.num_layers)
+        ]
+        ffn_ln_scale_attrs = [
+            paddle.ParamAttr(name="fusellama.{}.ffn_ln_scale".format(i)) for i in range(self.num_layers)
+        ]
+        ffn1_weight_attrs = [
+            paddle.ParamAttr(
+                name="fusellama.{}.ffn1_weight".format(i), initializer=paddle.nn.initializer.Constant(value=0)
+            )
+            for i in range(self.num_layers)
+        ]
+        ffn2_weight_attrs = [
+            paddle.ParamAttr(
+                name="fusellama.{}.ffn2_weight".format(i), initializer=paddle.nn.initializer.Constant(value=0)
+            )
+            for i in range(self.num_layers)
+        ]
+
+        qkv_weight_scale_attrs = None
+        out_proj_weight_scale_attrs = None
+        ffn1_weight_scale_attrs = None
+        ffn2_weight_scale_attrs = None
+
+        if self.use_weight_only:
+            qkv_weight_scale_attrs = [
+                paddle.ParamAttr(name="fusellama.{}.qkv_weight_scale".format(i)) for i in range(self.num_layers)
+            ]
+            out_proj_weight_scale_attrs = [
+                paddle.ParamAttr(name="fusellama.{}.out_proj_weight_scale".format(i)) for i in range(self.num_layers)
+            ]
+            ffn1_weight_scale_attrs = [
+                paddle.ParamAttr(name="fusellama.{}.ffn1_weight_scale".format(i)) for i in range(self.num_layers)
+            ]
+            ffn2_weight_scale_attrs = [
+                paddle.ParamAttr(name="fusellama.{}.ffn2_weight_scale".format(i)) for i in range(self.num_layers)
+            ]
+
+        self.transformer_block = FusedMultiTransformer(
+            self.hidden_size,
+            self.num_attention_heads,
+            self.intermediate_size,
+            quant_bits=self.quant_bits,
+            activation="swiglu",
+            num_layers=config.num_hidden_layers,
+            nranks=config.tensor_parallel_degree,
+            ring_id=ring_id,
+            ln_scale_attrs=ln_scale_attrs,
+            qkv_weight_attrs=qkv_weight_attrs,
+            qkv_weight_scale_attrs=qkv_weight_scale_attrs,
+            linear_weight_attrs=out_proj_weight_attrs,
+            linear_weight_scale_attrs=out_proj_weight_scale_attrs,
+            ffn_ln_scale_attrs=ffn_ln_scale_attrs,
+            ffn1_weight_attrs=ffn1_weight_attrs,
+            ffn1_weight_scale_attrs=ffn1_weight_scale_attrs,
+            ffn2_weight_attrs=ffn2_weight_attrs,
+            ffn2_weight_scale_attrs=ffn2_weight_scale_attrs,
+            epsilon=self.epsilon,
+            norm_type="rmsnorm",
+            use_neox_rotary_style=True,
+        )
+        self.norm = FusedLlamaRMSNorm(config)
+
+        self.cache_kvs = None
+        self.head_dim_shape_tensor = paddle.ones((self.hidden_size // self.num_attention_heads), dtype="int8")
+
+        self.gradient_checkpointing = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    def remove_padding(self, input_ids, seq_lens_this_time):
+        cum_offsets_now = paddle.cumsum(paddle.max(seq_lens_this_time) - seq_lens_this_time)
+        token_num = paddle.sum(seq_lens_this_time)
+        ids_remove_padding, cum_offsets, padding_offset = get_padding_offset(
+            input_ids, cum_offsets_now, token_num, seq_lens_this_time
+        )
+        return ids_remove_padding, padding_offset, cum_offsets
+
+    # This function is a little different from prepare_input_ids_for_generation in paddlenlp/transformers/generation/utils.py
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        seq_len = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+            seq_len = encoder_output.shape[1]
+        return paddle.ones([batch_size, seq_len], dtype="int64") * bos_token_id
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        cache_kvs=None,
+        pre_caches=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+        # kwargs["cache"] is used used to distinguish between encoder and decoder phase.
+        past_key_values = kwargs.get("cache", None)
+        is_decoder = past_key_values is not None
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        # genereate a fake input_ids according to inputs_embeds
+        # this is usually occurred in img2txt multimodal model when first enter into this forward function.
+        if input_ids is None and inputs_embeds is not None:
+            input_ids = self.prepare_input_ids_for_generation(self.config.bos_token_id, inputs_embeds)
+        if inputs_embeds is not None:
+            batch, seq_len, hidden_dim = inputs_embeds.shape
+            # merge batch and seq_len dimension.
+            inputs_embeds = inputs_embeds.reshape([batch * seq_len, hidden_dim])
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        cache_kvs = cache_kvs if cache_kvs is not None else self.cache_kvs
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * self.config.num_hidden_layers)
+
+        if not is_decoder:
+            ids_remove_padding, padding_offset, cum_offsets = self.remove_padding(input_ids, seq_len_encoder)
+        else:
+            ids_remove_padding = input_ids
+            padding_offset = None
+            cum_offsets = None
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(ids_remove_padding)
+
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+
+        seq_lens = seq_len_decoder if is_decoder else seq_len_encoder
+
+        position_offset = 0
+        if not is_decoder and pre_caches is not None:
+            position_offset = 128
+        new_rope = fused_get_rotary_embedding(
+            input_ids, position_ids, self.head_dim_shape_tensor, position_offset, True
+        )
+
+        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
+            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+            # removed after static graphs support inplace and stride.
+            with paddle.framework._no_check_dy2st_diff():
+                hidden_states, _ = self.transformer_block(
+                    input_ids,
+                    hidden_states,
+                    cum_offsets=cum_offsets,
+                    padding_offset=padding_offset,
+                    attn_mask=paddle.cast(attention_mask, dtype=hidden_states.dtype),
+                    caches=cache_kvs,
+                    pre_caches=pre_caches,
+                    pre_caches_length=position_offset,
+                    seq_lens=seq_lens,
+                    rotary_embs=new_rope,
+                    rotary_emb_dims=1,
+                    time_step=paddle.increment(paddle.shape(attention_mask)[-1], -1) if is_decoder else None,
+                )
+        else:
+            hidden_states, _ = self.transformer_block(
+                input_ids,
+                hidden_states,
+                cum_offsets=cum_offsets,
+                padding_offset=padding_offset,
+                attn_mask=paddle.cast(attention_mask, dtype=hidden_states.dtype),
+                caches=cache_kvs,
+                pre_caches=pre_caches,
+                pre_caches_length=position_offset,
+                seq_lens=seq_lens,
+                rotary_embs=new_rope,
+                rotary_emb_dims=1,
+                time_step=paddle.increment(paddle.shape(attention_mask)[-1], -1) if is_decoder else None,
+            )
+        hidden_states = self.norm(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, None, all_hidden_states, all_self_attns] if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        unfused_state_dict = {}
+        head_size = self.hidden_size // self.num_attention_heads
+
+        self.embed_tokens.weight.set_value(paddle.to_tensor(state_dict["llama.embed_tokens.weight"]))
+        self.norm.weight.set_value(paddle.to_tensor(state_dict["llama.norm.weight"]))
+
+        for idx in range(self.config.num_hidden_layers):
+            unfused_state_dict = {}
+            unfused_state_dict["self_attn.q_proj.weight"] = state_dict[
+                "llama.layers.{}.self_attn.q_proj.weight".format(idx)
+            ]
+            unfused_state_dict["self_attn.k_proj.weight"] = state_dict[
+                "llama.layers.{}.self_attn.k_proj.weight".format(idx)
+            ]
+            unfused_state_dict["self_attn.v_proj.weight"] = state_dict[
+                "llama.layers.{}.self_attn.v_proj.weight".format(idx)
+            ]
+
+            concated_qkv_weight = (
+                np.concatenate(
+                    [
+                        unfused_state_dict["self_attn.q_proj.weight"],
+                        unfused_state_dict["self_attn.k_proj.weight"],
+                        unfused_state_dict["self_attn.v_proj.weight"],
+                    ],
+                    axis=-1,
+                )
+                .transpose(1, 0)
+                .reshape(
+                    3 * (self.num_attention_heads // self.config.tensor_parallel_degree) * (head_size),
+                    self.hidden_size,
+                )
+            )  # reshape(3, self.num_attention_heself.hidden_sizeads // self.config.tensor_parallel_degree, head_size, )
+
+            qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight)
+            if self.use_weight_only:
+                qkv_weight_tensor = paddle.to_tensor(concated_qkv_weight)
+                qkv_weight_tensor = paddle.transpose(qkv_weight_tensor, perm=[1, 0])
+                qkv_quanted_weight_tensor, qkv_weight_scale_tensor = weight_quantize(
+                    qkv_weight_tensor, algo=self.quant_algo
+                )
+                self.transformer_block.qkv_weights[idx].set_value(qkv_quanted_weight_tensor)
+                self.transformer_block.qkv_weights_scale[idx].set_value(qkv_weight_scale_tensor)
+            else:
+                self.transformer_block.qkv_weights[idx].set_value(qkv_weight_tensor)
+
+            linear_weight_tensor = paddle.to_tensor(state_dict["llama.layers.{}.self_attn.o_proj.weight".format(idx)])
+            if self.use_weight_only:
+                linear_quanted_weight_tensor, linear_weight_scale_tensor = weight_quantize(
+                    linear_weight_tensor, algo=self.quant_algo
+                )
+                self.transformer_block.linear_weights[idx].set_value(linear_quanted_weight_tensor)
+                self.transformer_block.linear_weights_scale[idx].set_value(linear_weight_scale_tensor)
+            else:
+                self.transformer_block.linear_weights[idx].set_value(linear_weight_tensor)
+
+            unfused_state_dict["mlp.gate_proj.weight"] = state_dict["llama.layers.{}.mlp.gate_proj.weight".format(idx)]
+            unfused_state_dict["mlp.up_proj.weight"] = state_dict["llama.layers.{}.mlp.up_proj.weight".format(idx)]
+
+            concated_ffn1_weight = np.concatenate(
+                [unfused_state_dict["mlp.gate_proj.weight"], unfused_state_dict["mlp.up_proj.weight"]], axis=-1
+            )
+            ffn1_weight_tensor = paddle.to_tensor(concated_ffn1_weight)
+
+            if self.use_weight_only:
+                ffn1_quanted_weight_tensor, ffn1_weight_scale_tensor = weight_quantize(
+                    ffn1_weight_tensor, algo=self.quant_algo
+                )
+                self.transformer_block.ffn1_weights[idx].set_value(ffn1_quanted_weight_tensor)
+                self.transformer_block.ffn1_weights_scale[idx].set_value(ffn1_weight_scale_tensor)
+            else:
+                self.transformer_block.ffn1_weights[idx].set_value(ffn1_weight_tensor)
+
+            ffn2_weight_tensor = paddle.to_tensor(state_dict["llama.layers.{}.mlp.down_proj.weight".format(idx)])
+            if self.use_weight_only:
+                ffn2_quanted_weight_tensor, ffn2_weight_scale_tensor = weight_quantize(
+                    ffn2_weight_tensor, algo=self.quant_algo
+                )
+                self.transformer_block.ffn2_weights[idx].set_value(ffn2_quanted_weight_tensor)
+                self.transformer_block.ffn2_weights_scale[idx].set_value(ffn2_weight_scale_tensor)
+            else:
+                self.transformer_block.ffn2_weights[idx].set_value(ffn2_weight_tensor)
+
+            self.transformer_block.ln_scales[idx].set_value(
+                paddle.to_tensor(state_dict["llama.layers.{}.input_layernorm.weight".format(idx)])
+            )
+
+            self.transformer_block.ffn_ln_scales[idx].set_value(
+                paddle.to_tensor(state_dict["llama.layers.{}.post_attention_layernorm.weight".format(idx)])
+            )
+
+
+class LlamaForCausalLMInferenceModel(GenerationInferenceModel, LlamaPretrainedModel):
+    """
+    Dynamic Batching for LLaMA Model with pretraining tasks on top.
+    """
+
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.llama = LlamaInferenceModel(config)
+        self.lm_head = LlamaLMHead(config)
+
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
+    ):
+        # TODO: Support safetensors loading.
+        kwargs["use_safetensors"] = False
+        return super().from_pretrained(pretrained_model_name_or_path, from_hf_hub, subfolder, *args, **kwargs)
+
+    @classmethod
+    def get_cache_kvs_shape(
+        cls, config: LlamaConfig, max_batch_size: int = None, max_length: int = None
+    ) -> list[list[int]]:
+        """get cache_kvs tensor for llama model
+
+        Args:
+            max_batch_size (int): the max batch size
+            max_length (int | None, optional): the max_length of cache_kvs. Defaults to None.
+
+        Returns:
+            list[paddle.Tensor]: the list tensor shape for cache
+        """
+        if max_length is None:
+            max_length = config.max_position_embeddings
+
+        cache_kvs = []
+        for _ in range(config.num_hidden_layers):
+            cache_kvs.append(
+                [
+                    2,
+                    max_batch_size,
+                    config.num_attention_heads // max(config.tensor_parallel_degree, 1),
+                    max_length,
+                    config.hidden_size // config.num_attention_heads,
+                ]
+            )
+        return cache_kvs
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        cache_kvs,
+        seq_len_encoder,
+        seq_len_decoder,
+        tgt_ids,
+        tgt_pos,
+        tgt_generation_mask,
+        **kwargs,
+    ):
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        cache = kwargs.get("cache", None)
+        pre_caches = kwargs.get("pre_caches", None)
+        inputs_embeds = kwargs.get("inputs_embeds", None)
+        if cache is not None:
+            input_ids = tgt_ids
+            position_ids = tgt_pos
+            attention_mask = (tgt_generation_mask - 1) * 1e4
+            # make inputs_embeds be none in decoder phase.
+            # in forward function, it will be assigned according to input_ids.
+            inputs_embeds = None
+        else:
+            attention_mask = (attention_mask - 1) * 1e4
+        model_inputs = {
+            "input_ids": input_ids,
+            "inputs_embeds": inputs_embeds,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "cache_kvs": cache_kvs,
+            "seq_len_encoder": seq_len_encoder,
+            "seq_len_decoder": seq_len_decoder,
+            "cache": cache,
+            "pre_caches": pre_caches,
+        }
+        return model_inputs
+
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        cache=None,
+        cache_kvs=None,
+        pre_caches=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.llama(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            cache_kvs=cache_kvs,
+            pre_caches=pre_caches,
+            seq_len_encoder=seq_len_encoder,
+            seq_len_decoder=seq_len_decoder,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(
+            hidden_states,
+            tensor_parallel_output=False,
+        )
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :]
+            shift_labels = labels[..., 1:]
+            # Flatten the tokens
+            loss = self.criterion(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    @paddle.no_grad()
+    def set_state_dict(self, state_dict):
+        if "lm_head.weight" in state_dict:
+            self.lm_head.weight.set_value(state_dict["lm_head.weight"])
+        self.llama.set_state_dict({k: state_dict[k] for k in state_dict.keys()})
+
+
+class LlamaForMiniGPT4InferenceModel(LlamaForCausalLMInferenceModel):
+    """
+    This class is 99% like LlamaForCausalLMInferenceModel.
+    Used only for miniGPT4's second part.
+    """
+
+    # This function corresponds to miniGPT4's second part, only used in miniGPT4.
+    @paddle.no_grad()
+    def generate_text_with_image_features(
+        self,
+        image_features: paddle.Tensor,
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        position_ids=None,
+        penalty_score=None,
+        frequency_score=None,
+        presence_score=None,
+        min_length=None,
+        max_length=None,
+        temperature=None,
+        top_p=None,
+        eos_token_id=None,
+        seq_len_encoder=None,
+        seq_len_decoder=None,
+        step_idx=None,
+        stop_flags=None,
+        tgt_ids=None,
+        tgt_pos=None,
+        tgt_generation_mask=None,
+        pre_ids=None,
+        stop_nums=None,
+        cache_kvs=[],
+        inputs_embeds=None,
+        **generate_kwargs
+    ) -> paddle.Tensor:
+
+        first_embeds = self.llama.embed_tokens(first_input_ids)
+        second_embeds = self.llama.embed_tokens(second_input_ids)
+        image_features = paddle.cast(image_features, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, image_features, second_embeds], axis=1)
+
+        outputs = self.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            penalty_score=penalty_score,
+            frequency_score=frequency_score,
+            presence_score=presence_score,
+            min_length=min_length,
+            max_length=max_length,
+            temperature=temperature,
+            top_p=top_p,
+            eos_token_id=eos_token_id,
+            seq_len_encoder=seq_len_encoder,
+            seq_len_decoder=seq_len_decoder,
+            step_idx=step_idx,
+            stop_flags=stop_flags,
+            tgt_ids=tgt_ids,
+            tgt_pos=tgt_pos,
+            tgt_generation_mask=tgt_generation_mask,
+            pre_ids=pre_ids,
+            stop_nums=stop_nums,
+            cache_kvs=cache_kvs,
+        )
+        return outputs
+
+    # rewrite to_static function in generation_utils.py
+    def to_static(self, output_path: str, config: dict):
+        dtype = config.get("dtype", paddle.get_default_dtype())
+        cache_kvs_shapes = self.get_cache_kvs_shape(self.config, max_length=config.get("max_length", None))
+        input_spec = [
+            paddle.static.InputSpec(
+                shape=[None, None, None], dtype="float32", name="image_features"
+            ),  # image_features
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="first_input_ids"),  # first_input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="second_input_ids"),  # second_input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype=dtype, name="attention_mask"),  # attention_mask
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),  # position_ids
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="penalty_score"),  # penalty_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="frequency_score"),  # frequency_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="presence_score"),  # presence_score
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="min_length"),  # min_decode_length
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="max_length"),  # max_decode_length
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="temperature"),  # temperature
+            paddle.static.InputSpec(shape=[None, 1], dtype="float32", name="top_p"),  # top_p
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="eos_token_id"),  # eos_token_id
+            paddle.static.InputSpec(shape=[None, 1], dtype="int32", name="seq_len_encoder"),  # seq_len_encoder
+            paddle.static.InputSpec(shape=[None, 1], dtype="int32", name="seq_len_decoder"),  # seq_len_decoder
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="step_idx"),  # step_idx
+            paddle.static.InputSpec(shape=[None, 1], dtype="bool", name="stop_flags"),  # stop_flags
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="tgt_ids"),  # tgt_ids
+            paddle.static.InputSpec(shape=[None, 1], dtype="int64", name="tgt_pos"),  # tgt_pos
+            paddle.static.InputSpec(
+                shape=[None, 1, 1, None], dtype=dtype, name="tgt_generation_mask"
+            ),  # tgt_generation_mask
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="pre_ids"),  # pre_ids
+            paddle.static.InputSpec(shape=[1], dtype="int64", name="stop_nums"),  # stop_nums
+            [
+                paddle.static.InputSpec(
+                    shape=shape,
+                    dtype=dtype,
+                    name="cache_kvs_{}".format(i),
+                )
+                for i, shape in enumerate(cache_kvs_shapes)
+            ],  # cache_kvs
+        ]
+
+        model = paddle.jit.to_static(self.generate_text_with_image_features, input_spec=input_spec)
+        paddle.jit.save(model, output_path)
diff --git a/paddlenlp/generation/__init__.py b/paddlenlp/generation/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..021a2473286747f4f4fdbc1a90ca627e21e577d4
--- /dev/null
+++ b/paddlenlp/generation/__init__.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .configuration_utils import GenerationConfig
+from .logits_process import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+    TopKProcess,
+    TopPProcess,
+)
+from .stopping_criteria import (
+    MaxLengthCriteria,
+    MaxTimeCriteria,
+    StoppingCriteria,
+    StoppingCriteriaList,
+    validate_stopping_criteria,
+)
+from .streamers import BaseStreamer, TextIteratorStreamer, TextStreamer
+from .utils import BeamSearchScorer, GenerationMixin, get_unfinished_flag
diff --git a/paddlenlp/generation/configuration_utils.py b/paddlenlp/generation/configuration_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1269973abb8a3c6d9920c4d97dac53f65f90472
--- /dev/null
+++ b/paddlenlp/generation/configuration_utils.py
@@ -0,0 +1,616 @@
+# copyright (c) 2023 paddlepaddle authors. all rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Generation configuration class and utilities."""
+
+import copy
+import json
+import os
+import warnings
+from typing import Any, Dict, Optional, Union
+
+from huggingface_hub import hf_hub_download
+from paddle.common_ops_import import convert_dtype
+
+from paddlenlp import __version__
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+from paddlenlp.transformers.utils import resolve_cache_dir
+from paddlenlp.utils.log import logger
+
+from ..utils import GENERATION_CONFIG_NAME
+from ..utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+    hf_file_exists,
+    is_url,
+    url_file_exists,
+)
+
+DEFAULT_MAX_NEW_TOKENS = 20
+
+
+def resolve_hf_generation_config_path(repo_id: str, cache_dir: str, subfolder=None) -> str:
+    """resolve config file from hf hub
+
+    Args:
+        repo_id (str): the repo name from huggingface hub
+        cache_dir (str): the cachedir
+        subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+
+    Returns:
+        str: the downloaded config file
+    """
+    if hf_file_exists(repo_id=repo_id, filename=GENERATION_CONFIG_NAME, subfolder=subfolder):
+        file_name = GENERATION_CONFIG_NAME
+    else:
+        raise ValueError(f"can not find the paddle/pytorch config file from: https://huggingface.co/{repo_id}")
+
+    return hf_hub_download(
+        repo_id=repo_id,
+        filename=file_name,
+        cache_dir=cache_dir,
+        subfolder=subfolder,
+        library_name="PaddleNLP",
+        library_version=__version__,
+    )
+
+
+class GenerationConfig:
+    r"""
+    Arg:
+        > Parameters that control the length of the output
+            max_length (int, optional): The maximum length of the sequence to
+                be generated. Default to 20.
+            min_length (int, optional): The minimum length of the sequence to
+                be generated. Default to 0.
+            decode_strategy (str, optional): The decoding strategy in generation.
+                Currently, there are three decoding strategies supported:
+                "greedy_search", "sampling" and "beam_search". Default to
+                "greedy_search".
+            temperature (float, optional): The value used to module the next
+                token probabilities in the "sampling" strategy. Default to 1.0,
+                which means no effect.
+            top_k (int, optional): The number of highest probability tokens to
+                keep for top-k-filtering in the "sampling" strategy. Default to
+                0, which means no effect.
+            top_p (float, optional): The cumulative probability for
+                top-p-filtering in the "sampling" strategy. The value should
+                satisfy :math:`0 <= top\_p < 1`. Default to 1.0, which means no
+                effect.
+            repetition_penalty (float, optional):
+                The parameter for repetition penalty. 1.0 means no penalty. See `this paper
+                <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details. Defaults to 1.0.
+            num_beams (int, optional): The number of beams in the "beam_search"
+                strategy. Default to 1.
+            num_beam_groups (int, optional):
+                Number of groups to divide `num_beams` into in order to use DIVERSE
+                BEAM SEARCH. See `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__
+                for more details. Default to 1.
+            length_penalty (float, optional): The exponential penalty to the
+                sequence length in the "beam_search" strategy. The larger this
+                param is, the more that the model would generate shorter
+                sequences. Default to 0.0, which means no penalty.
+            early_stopping (bool, optional): Whether to stop searching in the
+                "beam_search" strategy when at least `num_beams` sentences are
+                finished per batch or not. Default to False.
+            bos_token_id (int, optional): The id of the `bos_token`. Default to
+                None.
+            eos_token_id (int, optional): The id of the `eos_token`. Default to
+                None.
+            pad_token_id (int, optional): The id of the `pad_token`. Default to
+                None.
+            decoder_start_token_id (int, optional): The start token id for
+                encoder-decoder models. Default to None.
+            forced_bos_token_id (int, optional): The id of the token to force as
+                the first generated token. Usually use for multilingual models.
+                Default to None.
+            forced_eos_token_id (int, optional): The id of the token to force as
+                the last generated token. Default to None.
+            num_return_sequences (int, optional): The number of returned
+                sequences for each sequence in the batch. Default to 1.
+            diversity_rate (float, optional): If num_beam_groups is 1, this is the
+                diversity_rate for Diverse Siblings Search. See
+                `this paper https://arxiv.org/abs/1611.08562`__ for more details.
+                If not, this is the diversity_rate for DIVERSE BEAM SEARCH.
+            use_cache: (bool, optional): Whether to use the model cache to
+                speed up decoding. Default to True.
+            use_fast: (bool, optional): Whether to use fast entry of model
+                for FastGeneration. Default to False.
+            use_fp16_decoding: (bool, optional): Whether to use fp16 for decoding.
+                Only works when fast entry is avalible. Default to False.
+            model_kwargs (dict): It can be used to specify additional kwargs
+                passed to the model.
+    """
+
+    def _get_generation_mode(self):
+        if hasattr(self, "num_beams") and self.num_beams == 1:
+            if hasattr(self, "do_sample") and self.do_sample is True:
+                generation_mode = "sampling"
+            else:
+                generation_mode = "greedy_search"
+        else:
+            generation_mode = "beam_search"
+
+        return generation_mode
+
+    def __init__(self, **kwargs):
+        # Parameters that control the length of the output
+        self.max_new_tokens = kwargs.get("max_new_tokens", DEFAULT_MAX_NEW_TOKENS)
+        self.min_new_token = kwargs.pop("min_new_token", 0)
+        self.max_length = kwargs.pop("max_length", 0)
+        self.min_length = kwargs.pop("min_length", 0)
+        self.early_stopping = kwargs.pop("early_stopping", False)
+
+        # Parameters for manipulation of the model output logits
+        self.diversity_rate = kwargs.pop("diversity_rate", 0.0)
+        self.temperature = kwargs.pop("temperature", 1.0)
+        self.top_k = kwargs.pop("top_k", 50)
+        self.top_p = kwargs.pop("top_p", 1.0)
+        self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
+        self.length_penalty = kwargs.pop("length_penalty", 1.0)
+        self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", None)
+        self.forced_bos_token_id = kwargs.pop("forced_bos_token_id", None)
+        self.forced_eos_token_id = kwargs.pop("forced_eos_token_id", None)
+        self.num_beams = kwargs.pop("num_beams", 1)
+        self.num_beam_groups = kwargs.pop("num_beam_groups", 1)
+        self.use_cache = kwargs.pop("use_cache", True)
+
+        # Parameters that define the output variables of `generate`
+        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
+
+        # Special tokens that can be used at generation time
+        self.pad_token_id = kwargs.pop("pad_token_id", None)
+        self.bos_token_id = kwargs.pop("bos_token_id", None)
+        self.eos_token_id = kwargs.pop("eos_token_id", None)
+
+        # Generation parameters exclusive to encoder-decoder models
+        self.use_fast = kwargs.pop("use_fast", False)
+        self.use_fp16_decoding = kwargs.pop("use_fp16_decoding", False)
+        self.decoder_start_token_id = kwargs.pop("decoder_start_token_id", None)
+        self._from_model_config = kwargs.pop("_from_model_config", False)
+        self.paddlenlp_version = kwargs.pop("paddlenlp_version", __version__)
+
+        # Additional attributes without default values
+        if not self._from_model_config:
+            # we don't want to copy values from the model config if we're initializing a `GenerationConfig` from a
+            # model's default configuration file
+            for key, value in kwargs.items():
+                try:
+                    setattr(self, key, value)
+                except AttributeError as err:
+                    logger.error(f"Can't set {key} with value {value} for {self}")
+                    raise err
+
+        # Parameters that control the generation strategy used
+        if "decode_strategy" in kwargs:
+            self.decode_strategy = kwargs.pop("decode_strategy")
+        else:
+            self.decode_strategy = self._get_generation_mode()
+
+        # Validate the values of the attributes
+        self.validate(is_init=True)
+
+    def __eq__(self, other):
+        if not isinstance(other, GenerationConfig):
+            return False
+
+        self_dict = self.__dict__.copy()
+        other_dict = other.__dict__.copy()
+        # ignore metadata
+        for metadata_field in ["_from_model_config", "paddlenlp_version"]:
+            self_dict.pop(metadata_field, None)
+            other_dict.pop(metadata_field, None)
+        return self_dict == other_dict
+
+    def __repr__(self):
+        return f"{self.__class__.__name__} {self.to_json_string()}"
+
+    def validate(self, is_init=False):
+        """
+        Validates the values of the attributes of the [`GenerationConfig`] instance. Raises exceptions in the presence
+        of parameterization that can be detected as incorrect from the configuration instance alone.
+
+        Note that some parameters are best validated at generate runtime, as they may depend on other inputs and/or the
+        model, such as parameters related to the generation length.
+        """
+
+        # Validation of individual attributes
+        if self.early_stopping not in {True, False, "never"}:
+            raise ValueError(f"`early_stopping` must be a boolean or 'never', but is {self.early_stopping}.")
+
+        # Validation of attribute relations:
+        fix_location = ""
+        if is_init:
+            fix_location = (
+                " This was detected when initializing the generation config instance, which means the corresponding "
+                "file may hold incorrect parameterization and should be fixed."
+            )
+
+        # 1. detect sampling-only parameterization when not in sampling mode
+        if self.decode_strategy == "greedy_search":
+            greedy_wrong_parameter_msg = (
+                "using greedy search strategy. However, `{flag_name}` is set to `{flag_value}` -- this flag is only "
+                'used in sample-based generation modes. You should set `decode_strategy="greedy_search" ` or unset `{flag_name}`.'
+                + fix_location
+            )
+            if self.temperature != 1.0:
+                warnings.warn(
+                    greedy_wrong_parameter_msg.format(flag_name="temperature", flag_value=self.temperature),
+                    UserWarning,
+                )
+            if self.top_p != 1.0:
+                warnings.warn(
+                    greedy_wrong_parameter_msg.format(flag_name="top_p", flag_value=self.top_p),
+                    UserWarning,
+                )
+
+        # 2. detect beam-only parameterization when not in beam mode
+        if self.decode_strategy != "beam_search":
+            single_beam_wrong_parameter_msg = (
+                "`num_beams` is set to 1. However, `{flag_name}` is set to `{flag_value}` -- this flag is only used "
+                "in beam-based generation modes. You should set `num_beams>1` or unset `{flag_name}`." + fix_location
+            )
+            if self.early_stopping is not False:
+                warnings.warn(
+                    single_beam_wrong_parameter_msg.format(flag_name="early_stopping", flag_value=self.early_stopping),
+                    UserWarning,
+                )
+            if self.num_beam_groups != 1:
+                warnings.warn(
+                    single_beam_wrong_parameter_msg.format(
+                        flag_name="num_beam_groups", flag_value=self.num_beam_groups
+                    ),
+                    UserWarning,
+                )
+            if self.length_penalty != 1.0:
+                warnings.warn(
+                    single_beam_wrong_parameter_msg.format(flag_name="length_penalty", flag_value=self.length_penalty),
+                    UserWarning,
+                )
+
+        # 4. check `num_return_sequences`
+        if self.num_return_sequences != 1:
+            if self.decode_strategy == "greedy_search":
+                raise ValueError(
+                    "Greedy methods without beam search do not support `num_return_sequences` different than 1 "
+                    f"(got {self.num_return_sequences})."
+                )
+
+    def save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        config_file_name: Optional[Union[str, os.PathLike]] = None,
+        **kwargs,
+    ):
+        r"""
+        Save a generation configuration object to the directory `save_directory`, so that it can be re-loaded using the
+        [`~GenerationConfig.from_pretrained`] class method.
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the configuration JSON file will be saved (will be created if it does not exist).
+            config_file_name (`str` or `os.PathLike`, *optional*, defaults to `"generation_config.json"`):
+                Name of the generation configuration JSON file to be saved in `save_directory`.
+        """
+
+        # At save time, validate the instance -- if any warning/exception is thrown, we refuse to save the instance
+        try:
+            with warnings.catch_warnings(record=True) as caught_warnings:
+                self.validate()
+            for w in caught_warnings:
+                raise ValueError(w.message)
+        except ValueError as exc:
+            warnings.warn(
+                "The generation config instance is invalid -- `.validate()` throws warnings and/or exceptions. "
+                "Fix these issues to save the configuration. This warning will be raised to an exception."
+                "\n\nThrown during validation:\n" + str(exc),
+                UserWarning,
+            )
+            return
+
+        config_file_name = config_file_name if config_file_name is not None else GENERATION_CONFIG_NAME
+
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        output_config_file = os.path.join(save_directory, config_file_name)
+
+        self.to_json_file(output_config_file, use_diff=True)
+        logger.info(f"Configuration saved in {output_config_file}")
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        config_file_name: Optional[Union[str, os.PathLike]] = None,
+        cache_dir: Optional[Union[str, os.PathLike]] = None,
+        force_download: bool = False,
+        **kwargs,
+    ) -> "GenerationConfig":
+        r"""
+        Instantiate a [`GenerationConfig`] from a generation configuration file.
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the *model id* of a pretrained model configuration hosted inside a model repo on
+                  paddlenlp bos server. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a configuration file saved using the
+                  [`~PretrainedConfig.save_pretrained`] method, e.g., `./my_model_directory/`.
+                - a path or url to a saved configuration JSON *file*, e.g., `./my_model_directory/configuration.json`.
+            from_hf_hub (bool, *optional*):
+                load config from huggingface hub: https://huggingface.co/models
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force to (re-)download the configuration files and override the cached versions if
+                they exist.
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final configuration object.
+
+                If `True`, then this functions returns a `Tuple(config, unused_kwargs)` where *unused_kwargs* is a
+                dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the
+                part of `kwargs` which has not been used to update `config` and is otherwise ignored.
+            kwargs (`Dict[str, Any]`, *optional*):
+                The values in kwargs of any keys which are configuration attributes will be used to override the loaded
+                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
+                by the `return_unused_kwargs` keyword parameter.
+
+        Returns:
+            [`GenerationConfig`]: The configuration object instantiated from this pretrained model.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import GenerationConfig
+
+        >>> generation_config = GenerationConfig.from_pretrained("gpt2")
+
+        >>> # E.g. config was saved using *save_pretrained('./test/saved_model/')*
+        >>> generation_config.save_pretrained("./test/saved_model/")
+        >>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/")
+
+        >>> # You can also specify configuration names to your generation configuration file
+        >>> generation_config.save_pretrained("./test/saved_model/", config_file_name="my_configuration.json")
+        >>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/", "my_configuration.json")
+
+        >>> # If you'd like to try a minor variation to an existing configuration, you can also pass generation
+        >>> # arguments to `.from_pretrained()`. Be mindful that typos and unused arguments will be ignored
+        >>> generation_config, unused_kwargs = GenerationConfig.from_pretrained(
+        ...     "gpt2", top_k=1, foo=False, do_sample=True, return_unused_kwargs=True
+        ... )
+        >>> generation_config.top_k
+        1
+
+        >>> unused_kwargs
+        {'foo': False}
+        ```"""
+        config_file_name = config_file_name if config_file_name is not None else GENERATION_CONFIG_NAME
+
+        subfolder = kwargs.pop("subfolder", "")
+
+        config_path = os.path.join(pretrained_model_name_or_path, config_file_name)
+        config_path = str(config_path)
+
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+
+        # 1. get the configuration file from local file, eg: /cache/path/model_config.json
+        if os.path.isfile(pretrained_model_name_or_path):
+            resolved_config_file = pretrained_model_name_or_path
+
+        # 2. get the configuration file from url, eg: https://ip/path/to/model_config.json
+        elif is_url(pretrained_model_name_or_path):
+            resolved_config_file = get_path_from_url_with_filelock(
+                pretrained_model_name_or_path, cache_dir, check_exist=not force_download
+            )
+        # 3. get the configuration file from local dir with default name, eg: /local/path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            configuration_file = os.path.join(pretrained_model_name_or_path, GENERATION_CONFIG_NAME)
+            if os.path.exists(configuration_file):
+                resolved_config_file = configuration_file
+            else:
+                # try to detect old-school config file
+                raise FileNotFoundError("please make sure there is `generation_config.json` under the dir")
+
+        # 4. get the configuration file from HF hub
+        elif from_hf_hub:
+            resolved_config_file = resolve_hf_generation_config_path(
+                repo_id=pretrained_model_name_or_path, cache_dir=cache_dir, subfolder=subfolder
+            )
+        else:
+            community_url = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, GENERATION_CONFIG_NAME])
+            if url_file_exists(community_url):
+                resolved_config_file = get_path_from_url_with_filelock(
+                    community_url, cache_dir, check_exist=not force_download
+                )
+            else:
+                raise FileNotFoundError(f"configuration file<{GENERATION_CONFIG_NAME}> not found")
+
+        try:
+            logger.info(f"Loading configuration file {resolved_config_file}")
+            # Load config dict
+            config_dict = cls._dict_from_json_file(resolved_config_file)
+        except (json.JSONDecodeError, UnicodeDecodeError):
+            raise EnvironmentError(f"Config file<'{resolved_config_file}'> is not a valid JSON file.")
+
+        return cls.from_dict(config_dict, **kwargs)
+
+    @classmethod
+    def _dict_from_json_file(cls, json_file: Union[str, os.PathLike]):
+        with open(json_file, "r", encoding="utf-8") as reader:
+            text = reader.read()
+        return json.loads(text)
+
+    def dict_paddle_dtype_to_str(self, d: Dict[str, Any]) -> None:
+        """
+        Checks whether the passed dictionary and its nested dicts have a *paddle_dtype* key and if it's not None,
+        converts paddle.dtype to a string of just the type. For example, `paddle.float32` get converted into *"float32"*
+        string, which can then be stored in the json format.
+        """
+        if d.get("dtype", None) is not None and not isinstance(d["dtype"], str):
+            d["dtype"] = convert_dtype(d["dtype"])
+        for value in d.values():
+            if isinstance(value, dict):
+                self.dict_paddle_dtype_to_str(value)
+
+    @classmethod
+    def from_dict(cls, config_dict: Dict[str, Any], **kwargs) -> "GenerationConfig":
+        """
+        Instantiates a [`GenerationConfig`] from a Python dictionary of parameters.
+
+        Args:
+            config_dict (`Dict[str, Any]`):
+                Dictionary that will be used to instantiate the configuration object.
+            kwargs (`Dict[str, Any]`):
+                Additional parameters from which to initialize the configuration object.
+
+        Returns:
+            [`GenerationConfig`]: The configuration object instantiated from those parameters.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+
+        config = cls(**{**config_dict, **kwargs})
+        unused_kwargs = config.update(**kwargs)
+
+        # logger.info(f"Generate config {config}")
+        if return_unused_kwargs:
+            return config, unused_kwargs
+        else:
+            return config
+
+    def to_diff_dict(self) -> Dict[str, Any]:
+        """
+        Removes all attributes from config which correspond to the default config attributes for better readability and
+        serializes to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        config_dict = self.to_dict()
+
+        # get the default config dict
+        default_config_dict = GenerationConfig().to_dict()
+
+        serializable_config_dict = {}
+
+        # only serialize values that differ from the default config
+        for key, value in config_dict.items():
+            if key not in default_config_dict or key == "transformers_version" or value != default_config_dict[key]:
+                serializable_config_dict[key] = value
+
+        self.dict_paddle_dtype_to_str(serializable_config_dict)
+        return serializable_config_dict
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serializes this instance to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
+        """
+        output = copy.deepcopy(self.__dict__)
+
+        # PaddleNLP version when serializing this file
+        output["paddlenlp_version"] = __version__
+
+        self.dict_paddle_dtype_to_str(output)
+        return output
+
+    def to_json_string(self, use_diff: bool = True) -> str:
+        """
+        Serializes this instance to a JSON string.
+
+        Args:
+            use_diff (`bool`, *optional*, defaults to `True`):
+                If set to `True`, only the difference between the config instance and the default `GenerationConfig()`
+                is serialized to JSON string.
+
+        Returns:
+            `str`: String containing all the attributes that make up this configuration instance in JSON format.
+        """
+        if use_diff is True:
+            config_dict = self.to_diff_dict()
+        else:
+            config_dict = self.to_dict()
+        return json.dumps(config_dict, indent=2, sort_keys=True) + "\n"
+
+    def to_json_file(self, json_file_path: Union[str, os.PathLike], use_diff: bool = True):
+        """
+        Save this instance to a JSON file.
+
+        Args:
+            json_file_path (`str` or `os.PathLike`):
+                Path to the JSON file in which this configuration instance's parameters will be saved.
+            use_diff (`bool`, *optional*, defaults to `True`):
+                If set to `True`, only the difference between the config instance and the default `GenerationConfig()`
+                is serialized to JSON file.
+        """
+        with open(json_file_path, "w", encoding="utf-8") as writer:
+            writer.write(self.to_json_string(use_diff=use_diff))
+
+    @classmethod
+    def from_model_config(cls, model_config: PretrainedConfig) -> "GenerationConfig":
+        """
+        Instantiates a [`GenerationConfig`] from a [`PretrainedConfig`]. This function is useful to convert legacy
+        [`PretrainedConfig`] objects, which may contain generation parameters, into a stand-alone [`GenerationConfig`].
+
+        Args:
+            model_config (`PretrainedConfig`):
+                The model config that will be used to instantiate the generation config.
+
+        Returns:
+            [`GenerationConfig`]: The configuration object instantiated from those parameters.
+        """
+        config_dict = model_config.to_dict()
+        config_dict.pop("_from_model_config", None)
+        config = cls.from_dict(config_dict, return_unused_kwargs=False, _from_model_config=True)
+
+        # Special case: some models have generation attributes set in the decoder. Use them if still unset in the
+        # generation config.
+        for decoder_name in ("decoder", "generator", "text_config"):
+            if decoder_name in config_dict:
+                default_generation_config = GenerationConfig()
+                decoder_config = config_dict[decoder_name]
+                for attr in config.to_dict().keys():
+                    if attr in decoder_config and getattr(config, attr) == getattr(default_generation_config, attr):
+                        setattr(config, attr, decoder_config[attr])
+
+        return config
+
+    def update(self, **kwargs):
+        """
+        Updates attributes of this class instance with attributes from `kwargs` if they match existing atributtes,
+        returning all the unused kwargs.
+
+        Args:
+            kwargs (`Dict[str, Any]`):
+                Dictionary of attributes to tentatively update this class.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary containing all the key-value pairs that were not used to update the instance.
+        """
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(self, key):
+                setattr(self, key, value)
+                to_remove.append(key)
+
+        # remove all the attributes that were updated, without modifying the input dict
+        unused_kwargs = {key: value for key, value in kwargs.items() if key not in to_remove}
+        return unused_kwargs
diff --git a/paddlenlp/generation/logits_process.py b/paddlenlp/generation/logits_process.py
new file mode 100644
index 0000000000000000000000000000000000000000..108c909a98eb819679d7265df78af461e0d6c121
--- /dev/null
+++ b/paddlenlp/generation/logits_process.py
@@ -0,0 +1,610 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from abc import ABC
+from collections import OrderedDict
+from typing import Dict, List, Tuple, Union
+
+import numpy as np
+import paddle
+from paddle.nn.layer.layers import in_declarative_mode
+
+
+class LogitsProcessor(ABC):
+    """
+    Abstract base class for all logit processors that can be applied during
+    generation.
+    """
+
+    def __call__(self, input_ids: paddle.Tensor, logits: paddle.Tensor):
+        raise NotImplementedError(
+            f"{self.__class__} is an abstract class. " "Only classes inheriting this class can be called."
+        )
+
+
+class LogitsProcessorList:
+    """use ordered dict to store processors"""
+
+    def __init__(self, processors: List[LogitsProcessor] = None) -> None:
+        self._processors = OrderedDict()
+        processors = processors or []
+        for processor in processors:
+            self.append(processor)
+
+    def __call__(self, input_ids: paddle.Tensor, logits: paddle.Tensor, **kwargs):
+        for processor in self._processors.values():
+            processor_args = inspect.signature(processor.__call__).parameters
+            if len(processor_args) > 2:
+                assert all(
+                    arg in kwargs for arg in list(processor_args.keys())[2:]
+                ), f"The parameters don't match for {processor.__class__}"
+                logits = processor(input_ids, logits, **kwargs)
+            else:
+                logits = processor(input_ids, logits)
+        return logits
+
+    def append(self, processor: LogitsProcessor):
+        self._processors[len(self._processors)] = processor
+
+
+class MinLengthLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing a min-length by setting EOS probability to 0.
+
+    Args:
+        min_length (int): The minimum length of generation sequence.
+        eos_token_id (int): The id of the `end-of-sequence` token.
+    """
+
+    def __init__(self, min_length: int, eos_token_id: Union[int, List[int]]):
+        if min_length < 0 and not in_declarative_mode():
+            raise ValueError("`min_length` should be a positive integer, but get {}".format(min_length))
+
+        if not isinstance(eos_token_id, int) or eos_token_id < 0:
+            raise ValueError("`eos_token_id` should be a positive integer, but get {}".format(eos_token_id))
+
+        self.min_length = min_length
+        self.eos_token_id = eos_token_id
+
+    def __call__(self, input_ids: paddle.Tensor, logits: paddle.Tensor):
+        cur_len = input_ids.shape[-1]
+        if cur_len < self.min_length:
+            logits[:, self.eos_token_id] = paddle.finfo(logits.dtype).min
+        return logits
+
+
+class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing an exponential penalty on repeated sequences.
+
+    Args:
+        repetition_penalty (float):
+            The parameter for repetition penalty. 1.0 means no penalty. See `this paper
+            <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details.
+    """
+
+    def __init__(self, penalty: float):
+        if not (penalty > 0) and not in_declarative_mode():
+            raise ValueError(f"`penalty` has to be a strictly positive float, but is {penalty}")
+
+        self.penalty = penalty
+
+    def __call__(self, input_ids: paddle.Tensor, logits: paddle.Tensor):
+        score = paddle.index_sample(logits, input_ids)
+        score = paddle.where(score < 0, score * self.penalty, score / self.penalty)
+        input_ids = input_ids + paddle.arange(logits.shape[0], dtype="int64").unsqueeze(-1) * logits.shape[-1]
+        outputs = paddle.scatter(logits.flatten(), input_ids.flatten(), score.flatten()).reshape(logits.shape)
+        return outputs
+
+
+def _get_ngrams(ngram_size: int, prev_input_ids: paddle.Tensor, num_hypos: int):
+    """
+    Assume ngram_size=2 and prev_input_ids=tensor([[40, 2883, 2712, 4346]]). The output of generated ngrams look like
+    this {(40,): [2883], (2883,): [2712], (2712,): [4346]}.
+
+    Args:
+        ngram_size (`int`):
+            The number sequential tokens taken as a group which may only occur once before being banned.
+        prev_input_ids (`paddle.Tensor`):
+           Generated token ids for the current hypothesis.
+        num_hypos (`int`):
+            The number of hypotheses for which n-grams need to be generated.
+
+    Returns:
+        generated_ngrams (`dict`):
+            Dictionary of generated ngrams.
+    """
+    generated_ngrams = [{} for _ in range(num_hypos)]
+    for idx in range(num_hypos):
+        gen_tokens = prev_input_ids[idx].tolist()
+        generated_ngram = generated_ngrams[idx]
+        for ngram in zip(*[gen_tokens[i:] for i in range(ngram_size)]):
+            prev_ngram_tuple = tuple(ngram[:-1])
+            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
+    return generated_ngrams
+
+
+def _get_generated_ngrams(banned_ngrams, prev_input_ids, ngram_size, cur_len):
+    """
+    Determines the banned tokens for the current hypothesis based on previously generated n-grams.
+
+    Args:
+        banned_ngrams (`dict`):
+            A dictionary containing previously generated n-grams for each hypothesis.
+        prev_input_ids (`paddle.Tensor`):
+            Generated token ids for the current hypothesis.
+        ngram_size (`int`):
+            The number sequential tokens taken as a group which may only occur once before being banned.
+        cur_len (`int`):
+            The current length of the token sequences for which the n-grams are being checked.
+
+    Returns:
+        List of tokens that are banned.
+    """
+    start_idx = cur_len + 1 - ngram_size
+    ngram_idx = tuple(prev_input_ids[start_idx:cur_len].tolist())
+    return banned_ngrams.get(ngram_idx, [])
+
+
+def _calc_banned_ngram_tokens(ngram_size: int, prev_input_ids: paddle.Tensor, num_hypos: int, cur_len: int):
+    """Copied from fairseq for no_repeat_ngram in beam_search"""
+    if cur_len + 1 < ngram_size:
+        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
+        return [[] for _ in range(num_hypos)]
+
+    generated_ngrams = _get_ngrams(ngram_size, prev_input_ids, num_hypos)
+
+    banned_tokens = [
+        _get_generated_ngrams(generated_ngrams[hypo_idx], prev_input_ids[hypo_idx], ngram_size, cur_len)
+        for hypo_idx in range(num_hypos)
+    ]
+    return banned_tokens
+
+
+class NoRepeatNGramLogitsProcessor(LogitsProcessor):
+    r"""
+    [`LogitsProcessor`] that enforces no repetition of n-grams. See
+    [Fairseq](https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345).
+    Args:
+        ngram_size (`int`):
+            All ngrams of size `ngram_size` can only occur once.
+    """
+
+    def __init__(self, ngram_size: int):
+        if not isinstance(ngram_size, int) or ngram_size <= 0:
+            raise ValueError(f"`ngram_size` has to be a strictly positive integer, but is {ngram_size}")
+        self.ngram_size = ngram_size
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor):
+        num_batch_hypotheses = scores.shape[0]
+        cur_len = input_ids.shape[-1]
+        banned_batch_tokens = _calc_banned_ngram_tokens(self.ngram_size, input_ids, num_batch_hypotheses, cur_len)
+
+        for i, banned_tokens in enumerate(banned_batch_tokens):
+            if len(banned_tokens) == 0:
+                continue
+            scores[i, banned_tokens] = paddle.finfo(scores.dtype).min
+
+        return scores
+
+
+class HammingDiversityLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces diverse beam search. Note that this logits
+    processor is only effective for `group_beam_search`. See
+    `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__ for more details.
+
+    Args:
+        diversity_rate (float): This value is subtracted from a beam's score if
+            it generates a token same as any beam from other group at a particular
+            time.
+        num_beams (int): Number of beams used for group beam search.
+        num_beam_groups (int): Number of groups to divide `num_beams` into in order
+            to ensure diversity among different groups of beams.
+    """
+
+    def __init__(self, diversity_rate: float, num_beams: int, num_beam_groups: int):
+        if not isinstance(diversity_rate, float) or (not diversity_rate > 0.0):
+            raise ValueError("`diversity_rate` should be a float strictly larger than 0.")
+        self._diversity_rate = diversity_rate
+        if not isinstance(num_beams, int) or num_beams < 2:
+            raise ValueError("`num_beams` should be an integer strictly larger than 1.")
+        self._num_beams = num_beams
+        if not isinstance(num_beam_groups, int) or num_beam_groups < 2:
+            raise ValueError("`num_beam_groups` should be an integer strictly larger than 1.")
+        self._num_sub_beams = num_beams // num_beam_groups
+
+    def __call__(
+        self, input_ids: paddle.Tensor, scores: paddle.Tensor, current_tokens: paddle.Tensor, beam_group_idx: int
+    ):
+        batch_size = current_tokens.shape[0] // self._num_beams
+        group_start_idx = beam_group_idx * self._num_sub_beams
+        group_end_idx = min(group_start_idx + self._num_sub_beams, self._num_beams)
+        group_size = group_end_idx - group_start_idx
+        vocab_size = scores.shape[-1]
+
+        if group_start_idx == 0:
+            return scores
+
+        for batch_idx in range(batch_size):
+            previous_group_tokens = current_tokens[
+                batch_idx * self._num_beams : batch_idx * self._num_beams + group_start_idx
+            ]
+            token_frequency = paddle.bincount(previous_group_tokens, minlength=vocab_size)
+            scores[batch_idx * group_size : (batch_idx + 1) * group_size] -= self._diversity_rate * token_frequency
+
+        return scores
+
+
+class ForcedBOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the first generated token to be the selected `forced_bos_token`.
+
+    Args:
+        forced_bos_token_id (:obj:`int`):
+            The id of the token to be generated as the first token.
+    """
+
+    def __init__(self, forced_bos_token_id: int):
+        self.forced_bos_token_id = forced_bos_token_id
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor):
+        cur_len = input_ids.shape[-1]
+        if cur_len == 1:
+            scores[:] = paddle.finfo(scores.dtype).min
+            scores[:, self.forced_bos_token_id] = 0
+        return scores
+
+
+class ForcedEOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the last generated token to be the selected `forced_eos_token`.
+
+    Args:
+        max_length (int): The maximum length of the sequence to be generated.
+        forced_eos_token_id (int): The id of the token to be generated as the last token.
+    """
+
+    def __init__(self, max_length: int, forced_eos_token_id: Union[int, List[int]]):
+        self.max_length = max_length
+        self.forced_eos_token_id = forced_eos_token_id
+
+    def __call__(self, input_ids, scores):
+        cur_len = input_ids.shape[-1]
+        if cur_len == self.max_length - 1:
+            scores[:] = paddle.finfo(scores.dtype).min
+            scores[:, self.forced_eos_token_id] = 0
+        return scores
+
+
+def TopKProcess(probs: paddle.Tensor, top_k: int, min_tokens_to_keep: int):
+    top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+    # Remove all tokens with a probability less than the last token of the top-k
+    # cast to float16 to support generation & d2s
+    if probs.dtype == paddle.bfloat16:
+        probs = paddle.cast(probs, paddle.float32)
+        topk_probs, _ = paddle.topk(probs, k=top_k)
+        topk_probs = paddle.cast(topk_probs, paddle.bfloat16)
+    else:
+        topk_probs, _ = paddle.topk(probs, k=top_k)
+
+    probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+    return probs
+
+
+def TopPProcess(probs: paddle.Tensor, top_p: float, min_tokens_to_keep: int):
+    sorted_indices = paddle.argsort(probs, descending=True)
+    if isinstance(sorted_indices, tuple):
+        sorted_probs, sorted_indices = sorted_indices
+    else:
+        sorted_probs = paddle.sort(probs, descending=True)
+
+    cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+    # Remove tokens with cumulative probs above the top_p, But keep at
+    # least min_tokens_to_keep tokens
+    sorted_indices_to_remove = cumulative_probs > top_p
+    if min_tokens_to_keep > 1:
+        # Set 'min_tokens_to_keep - 1' because the first token is kept
+        sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+    # Keep the first token
+    sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+    sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+    sorted_indices_to_remove[:, 0] = 0
+
+    # Scatter sorted tensors to original indexing
+    sorted_indices = sorted_indices + paddle.arange(probs.shape[0], dtype="int64").unsqueeze(-1) * probs.shape[-1]
+    condition = paddle.scatter(
+        sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+    )
+    condition = paddle.cast(condition, "bool").reshape(probs.shape)
+    probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+    return probs
+
+
+class LogitsWarper:
+    """Abstract base class for all logit warpers that can be applied during generation with multinomial sampling."""
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor):
+        raise NotImplementedError(
+            f"{self.__class__} is an abstract class. Only classes inheriting this class can be called."
+        )
+
+
+class TemperatureLogitsWarper(LogitsWarper):
+    r"""
+    [`LogitsWarper`] for temperature (exponential scaling output probability distribution).
+    Args:
+        temperature (`float`):
+            The value used to module the logits distribution.
+    """
+
+    def __init__(self, temperature: float):
+        if not isinstance(temperature, float) or not (temperature > 0):
+            raise ValueError(f"`temperature` has to be a strictly positive float, but is {temperature}")
+
+        self.temperature = temperature
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor):
+        scores = scores / self.temperature
+        return scores
+
+
+class SequenceBiasLogitsProcessor(LogitsProcessor):
+    """
+    [`LogitsProcessor`] that applies an additive bias on sequences. The bias is applied to the last token of a sequence
+    when the next generated token can complete it. Consequently, to take the most of biasing sequences with more than
+    one token, consider using beam methods (to gracefully work around partially completed sequences that have a
+    negative bias) and applying the bias to their prefixes (to ensure the bias is applied earlier).
+
+    <Tip>
+
+    In order to get the token ids of the sequences that you want to bias, make sure to set `add_prefix_space=True` when
+    initializing the tokenizer, and use `tokenizer(bad_words, add_special_tokens=False).input_ids`. The
+    `add_prefix_space` argument is only supported for some slow tokenizers, as fast tokenizers' prefixing behaviours
+    come from `pre tokenizers`.
+
+    </Tip>
+
+    Args:
+        sequence_bias (`Dict[Tuple[int], float]`):
+            Dictionary that maps a sequence of tokens to its bias term. Positive biases increase the odds of the
+            sequence being selected, while negative biases do the opposite. If a sequence has a length of 1, its bias
+            will always be applied. Otherwise, the bias will only be applied if the sequence in question is about to be
+            completed (in the token selection step after this processor is applied).
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
+
+    >>> model = AutoModelForCausalLM.from_pretrained("gpt2-en")
+    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2-en")
+    >>> inputs = tokenizer(["The full name of Donald is Donald"], return_tensors="pt")
+
+    >>> summary_ids = model.generate(inputs["input_ids"], max_new_tokens=4)
+    >>> print(tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0])
+    The full name of Donald is Donald J. Trump Jr
+
+    >>> # Now let's control generation through a bias. Please note that the tokenizer is initialized differently!
+    >>> tokenizer_with_prefix_space = AutoTokenizer.from_pretrained("gpt2-en")
+
+
+    >>> def get_tokens_as_tuple(word):
+    ...     return tuple(tokenizer_with_prefix_space([word], add_special_tokens=False).input_ids[0])
+
+
+    >>> # If we add a negative bias without beam search, it may become "stuck" in a prefix without good continuations
+    >>> sequence_bias = {get_tokens_as_tuple("Trump"): -10.0}
+    >>> biased_ids = model.generate(inputs["input_ids"], max_new_tokens=4, sequence_bias=sequence_bias)
+    >>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
+    The full name of Donald is Donald J. Donald,
+
+    >>> biased_ids = model.generate(inputs["input_ids"], max_new_tokens=4, num_beams=4, sequence_bias=sequence_bias)
+    >>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
+    The full name of Donald is Donald Rumsfeld,
+
+    >>> # We can also add a positive bias to nudge the model towards specific tokens or continuations
+    >>> sequence_bias = {get_tokens_as_tuple("Donald Duck"): 10.0}
+    >>> biased_ids = model.generate(inputs["input_ids"], max_new_tokens=4, num_beams=4, sequence_bias=sequence_bias)
+    >>> print(tokenizer.batch_decode(biased_ids, skip_special_tokens=True)[0])
+    The full name of Donald is Donald Duck.
+    ```
+    """
+
+    def __init__(self, sequence_bias: Dict[Tuple[int], float]):
+        self.sequence_bias = sequence_bias
+        self._validate_arguments()
+
+        # Bias variables that will be populated on the first call (for retrocompatibility purposes, the vocabulary size
+        # is infered in the first usage, which inhibits initializing here)
+        self.length_1_bias = None
+        self.prepared_bias_variables = False
+
+    def __call__(self, input_ids, scores):
+        # 1 - Prepares the bias tensors. This is only needed the first time the logit processor is called.
+        if not self.prepared_bias_variables:
+            self._prepare_bias_variables(scores)
+
+        # 2 - prepares an empty bias to add
+        bias = paddle.zeros_like(scores)
+
+        # 3 - include the bias from length = 1
+        if self.length_1_bias is not None:
+            bias += self.length_1_bias
+
+        # 4 - include the bias from length > 1, after determining which biased sequences may be completed.
+        for sequence_ids, sequence_bias in self.sequence_bias.items():
+            if len(sequence_ids) == 1:  # the sequence is of length 1, already applied
+                continue
+            if len(sequence_ids) > input_ids.shape[1]:  # the sequence is longer than the context, ignore
+                continue
+            prefix_length = len(sequence_ids) - 1
+            last_token = sequence_ids[-1]
+            matching_rows = (
+                paddle.equal(
+                    input_ids[:, -prefix_length:],
+                    paddle.to_tensor(sequence_ids[:-1], dtype=input_ids.dtype),
+                )
+                .astype(paddle.int64)
+                .prod(axis=1)
+            )
+            bias[:, last_token] += paddle.where(
+                matching_rows == 1,
+                paddle.to_tensor(sequence_bias),
+                paddle.to_tensor(0.0),
+            )
+
+        # 5 - apply the bias to the scores
+        scores = scores + bias
+        return scores
+
+    def _prepare_bias_variables(self, scores):
+        vocabulary_size = scores.shape[-1]
+
+        # Check biased tokens out of bounds
+        invalid_biases = []
+        for sequence_ids in self.sequence_bias:
+            for token_id in sequence_ids:
+                if token_id >= vocabulary_size:
+                    invalid_biases.append(token_id)
+        if len(invalid_biases) > 0:
+            raise ValueError(
+                f"The model vocabulary size is {vocabulary_size}, but the following tokens were being biased: "
+                f"{invalid_biases}"
+            )
+
+        # Precompute the bias tensors to be applied. Sequences of length 1 are kept separately, as they can be applied
+        # with simpler logic.
+        self.length_1_bias = paddle.zeros((vocabulary_size,))
+        for sequence_ids, bias in self.sequence_bias.items():
+            if len(sequence_ids) == 1:
+                self.length_1_bias[sequence_ids[-1]] = bias
+
+        self.prepared_bias_variables = True
+
+    def _validate_arguments(self):
+        sequence_bias = self.sequence_bias
+        if not isinstance(sequence_bias, dict) or len(sequence_bias) == 0:
+            raise ValueError(f"`sequence_bias` has to be a non-empty dictionary, but is {sequence_bias}.")
+        if any(not isinstance(sequence_ids, tuple) for sequence_ids in sequence_bias.keys()):
+            raise ValueError(f"`sequence_bias` has to be a dict with tuples as keys, but is {sequence_bias}.")
+        if any(
+            any((not isinstance(token_id, (int, np.integer)) or token_id < 0) for token_id in sequence_ids)
+            or len(sequence_ids) == 0
+            for sequence_ids in sequence_bias.keys()
+        ):
+            raise ValueError(
+                f"Each key in `sequence_bias` has to be a non-empty tuple of positive integers, but is "
+                f"{sequence_bias}."
+            )
+        if any(not isinstance(bias, float) for bias in sequence_bias.values()):
+            raise ValueError(f"`sequence_bias` has to be a dict with floats as values, but is {sequence_bias}.")
+
+
+class NoBadWordsLogitsProcessor(SequenceBiasLogitsProcessor):
+    """
+    [`LogitsProcessor`] that enforces that specified sequences will never be selected.
+
+    <Tip>
+
+    In order to get the token ids of the words that should not appear in the generated text, make sure to set
+    `add_prefix_space=True` when initializing the tokenizer, and use `tokenizer(bad_words,
+    add_special_tokens=False).input_ids`. The `add_prefix_space` argument is only supported for some slow tokenizers,
+    as fast tokenizers' prefixing behaviours come from `pre tokenizers`. Read more
+    [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers).
+
+    </Tip>
+
+    Args:
+        bad_words_ids (`List[List[int]]`):
+            List of list of token ids that are not allowed to be generated.
+        eos_token_id (`Union[int, List[int]]`):
+            The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
+
+    >>> model = AutoModelForCausalLM.from_pretrained("gpt2-en")
+    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2-en")
+    >>> inputs = tokenizer(["In a word, the cake is a"], return_tensors="pt")
+
+    >>> output_ids = model.generate(inputs["input_ids"], max_new_tokens=5, pad_token_id=tokenizer.eos_token_id)
+    >>> print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])
+    In a word, the cake is a bit of a mess.
+
+    >>> # Now let's take the bad words out. Please note that the tokenizer is initialized differently
+    >>> tokenizer_with_prefix_space = AutoTokenizer.from_pretrained("gpt2-en", add_prefix_space=True)
+
+
+    >>> def get_tokens_as_list(word_list):
+    ...     "Converts a sequence of words into a list of tokens"
+    ...     tokens_list = []
+    ...     for word in word_list:
+    ...         tokenized_word = tokenizer_with_prefix_space([word], add_special_tokens=False).input_ids[0]
+    ...         tokens_list.append(tokenized_word)
+    ...     return tokens_list
+
+
+    >>> bad_words_ids = get_tokens_as_list(word_list=["mess"])
+    >>> output_ids = model.generate(
+    ...     inputs["input_ids"], max_new_tokens=5, bad_words_ids=bad_words_ids, pad_token_id=tokenizer.eos_token_id
+    ... )
+    >>> print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])
+    In a word, the cake is a bit of a surprise.
+    ```
+
+    >>> from paddlenlp.transformers.generation import NoBadWordsLogitsProcessor, LogitsProcessorList
+    >>> logits_processors = LogitsProcessorList([NoBadWordsLogitsProcessor([[5,6]], eos_token_id=tokenizer.eos_token_id)])
+    >>> output_ids = model.generate(
+    ...     inputs["input_ids"], max_new_tokens=5, logits_processors=logits_processors, pad_token_id=tokenizer.eos_token_id
+    ... )
+    >>> print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0])
+    In a word, the cake is a bit of a surprise.
+    ```
+    """
+
+    def __init__(self, bad_words_ids: List[List[int]], eos_token_id: Union[int, List[int]]):
+        self.bad_word_ids = bad_words_ids
+        self._validate_arguments()
+
+        # Filter EOS token from bad_words_ids
+        if eos_token_id is None:
+            eos_token_id = []
+        if isinstance(eos_token_id, int):
+            eos_token_id = [eos_token_id]
+        bad_words_ids = list(
+            filter(lambda bad_token_seq: all(bad_token_seq != [i] for i in eos_token_id), bad_words_ids)
+        )
+
+        # Forbidding a sequence is equivalent to setting its bias to -inf
+        sequence_bias = {tuple(sequence): float("-inf") for sequence in bad_words_ids}
+        super().__init__(sequence_bias=sequence_bias)
+
+    def _validate_arguments(self):
+        bad_words_ids = self.bad_word_ids
+        if not isinstance(bad_words_ids, list) or len(bad_words_ids) == 0:
+            raise ValueError(f"`bad_words_ids` has to be a non-empty list, but is {bad_words_ids}.")
+        if any(not isinstance(bad_word_ids, list) for bad_word_ids in bad_words_ids):
+            raise ValueError(f"`bad_words_ids` has to be a list of lists, but is {bad_words_ids}.")
+        if any(
+            any((not isinstance(token_id, (int, np.integer)) or token_id < 0) for token_id in bad_word_ids)
+            for bad_word_ids in bad_words_ids
+        ):
+            raise ValueError(
+                f"Each list in `bad_words_ids` has to be a list of positive integers, but is {bad_words_ids}."
+            )
diff --git a/paddlenlp/generation/stopping_criteria.py b/paddlenlp/generation/stopping_criteria.py
new file mode 100644
index 0000000000000000000000000000000000000000..32447b637914a8d9a6fe2cfd2189fcc7887fa0f2
--- /dev/null
+++ b/paddlenlp/generation/stopping_criteria.py
@@ -0,0 +1,91 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+import warnings
+from abc import ABC
+from copy import deepcopy
+from typing import Optional
+
+import paddle
+
+
+class StoppingCriteria(ABC):
+    """
+    Abstract base class for all stopping criteria that can be applied during
+    generation.
+    """
+
+    def __call__(self, input_ids: paddle.Tensor, logits: paddle.Tensor, **kwargs):
+        raise NotImplementedError(f"{self.__class__} is an abstract class. " "StoppingCriteria needs to be subclassed")
+
+
+class MaxTimeCriteria(StoppingCriteria):
+    """
+    This class can be used to stop generation whenever the full generation exceeds some amount of time. By default, the
+    time will start being counted when you initialize this function. You can override this by passing an
+    `initial_time`.
+
+    Args:
+        max_time (`float`):
+            The maximum allowed time in seconds for the generation.
+        initial_time (`float`, *optional*, defaults to `time.time()`):
+            The start of the generation allowed time.
+    """
+
+    def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
+        self.max_time = max_time
+        self.initial_timestamp = time.time() if initial_timestamp is None else initial_timestamp
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor, **kwargs) -> bool:
+        return time.time() - self.initial_timestamp > self.max_time
+
+
+class MaxLengthCriteria(StoppingCriteria):
+    """
+    This class can be used to stop generation whenever the full generated number of tokens exceeds `max_length`. Keep
+    in mind for decoder-only type of transformers, [this will include the initial prompted tokens].
+
+    Args:
+        max_length (`int`):
+            The maximum length that the output sequence can have in number of tokens.
+    """
+
+    def __init__(self, max_length: int):
+        self.max_length = max_length
+
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor, **kwargs) -> bool:
+        return input_ids.shape[-1] >= self.max_length
+
+
+class StoppingCriteriaList(list):
+    def __call__(self, input_ids: paddle.Tensor, scores: paddle.Tensor, **kwargs):
+        return any(criteria(input_ids, scores) for criteria in self)
+
+    @property
+    def max_length(self):
+        for stopping_criterium in self:
+            if isinstance(stopping_criterium, MaxLengthCriteria):
+                return stopping_criterium.max_length
+        return None
+
+
+def validate_stopping_criteria(stopping_criteria: StoppingCriteriaList, max_length: int) -> StoppingCriteriaList:
+    stopping_max_length = stopping_criteria.max_length
+    new_stopping_criteria = deepcopy(stopping_criteria)
+    if stopping_max_length is not None and stopping_max_length != max_length:
+        warnings.warn("You set different `max_length` for stopping criteria and `max_length` parameter", UserWarning)
+    elif stopping_max_length is None:
+        new_stopping_criteria.append(MaxLengthCriteria(max_length=max_length))
+    return new_stopping_criteria
diff --git a/paddlenlp/generation/streamers.py b/paddlenlp/generation/streamers.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b97b0cf0f9587977426bda84e2ff142b5b62c6
--- /dev/null
+++ b/paddlenlp/generation/streamers.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from queue import Queue
+from typing import Optional
+
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+
+
+class BaseStreamer:
+    """
+    Base class from which `.generate()` streamers should inherit.
+    """
+
+    def put(self, value):
+        """Function that is called by `.generate()` to push new tokens"""
+        raise NotImplementedError()
+
+    def end(self):
+        """Function that is called by `.generate()` to signal the end of generation"""
+        raise NotImplementedError()
+
+
+class TextStreamer(BaseStreamer):
+    """
+    Parameters:
+        tokenizer (`AutoTokenizer`):
+            The tokenized used to decode the tokens.
+        skip_prompt (`bool`, *optional*, defaults to `False`):
+            Whether to skip the prompt to `.generate()` or not. Useful e.g. for chatbots.
+        decode_kwargs (`dict`, *optional*):
+            Additional keyword arguments to pass to the tokenizer's `decode` method.
+
+    Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+        >>> from paddlenlp.generation import TextStreamer
+
+        >>> tok = AutoTokenizer.from_pretrained("gpt2")
+        >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
+        >>> inputs = tok(["An increasing sequence: one,"], return_tensors="pd")
+        >>> streamer = TextStreamer(tok)
+
+        >>> # Despite returning the usual output, the streamer will also print the generated text to stdout.
+        >>> _ = model.generate(**inputs, streamer=streamer, max_length=20)
+        An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,
+        ```
+    """
+
+    def __init__(self, tokenizer: PretrainedTokenizer, skip_prompt: bool = False, **decode_kwargs):
+        self.tokenizer = tokenizer
+        self.skip_prompt = skip_prompt
+        self.decode_kwargs = decode_kwargs
+
+        # variables used in the streaming process
+        self.token_cache = []
+        self.print_len = 0
+        self.next_tokens_are_prompt = True
+
+    def put(self, value):
+        """
+        Receives tokens, decodes them, and prints them to stdout as soon as they form entire words.
+        """
+        if len(value.shape) > 1 and value.shape[0] > 1:
+            raise ValueError("TextStreamer only supports batch size 1")
+        elif len(value.shape) > 1:
+            value = value[0]
+
+        if self.skip_prompt and self.next_tokens_are_prompt:
+            self.next_tokens_are_prompt = False
+            return
+
+        # Add the new token to the cache and decodes the entire thing.
+        self.token_cache.extend(value.tolist())
+        text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
+
+        # After the symbol for a new line, we flush the cache.
+        if text.endswith("\n"):
+            printable_text = text[self.print_len :]
+            self.token_cache = []
+            self.print_len = 0
+        # If the last token is a CJK character, we print the characters.
+        elif len(text) > 0 and self._is_chinese_char(ord(text[-1])):
+            printable_text = text[self.print_len :]
+            self.print_len += len(printable_text)
+        # Otherwise, prints until the last space char (simple heuristic to avoid printing incomplete words,
+        # which may change with the subsequent token -- there are probably smarter ways to do this!)
+        else:
+            printable_text = text[self.print_len : text.rfind(" ") + 1]
+            self.print_len += len(printable_text)
+
+        self.on_finalized_text(printable_text)
+
+    def end(self):
+        """Flushes any remaining cache and prints a newline to stdout."""
+        # Flush the cache, if it exists
+        if len(self.token_cache) > 0:
+            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
+            printable_text = text[self.print_len :]
+            self.token_cache = []
+            self.print_len = 0
+        else:
+            printable_text = ""
+
+        self.next_tokens_are_prompt = True
+        self.on_finalized_text(printable_text, stream_end=True)
+
+    def on_finalized_text(self, text: str, stream_end: bool = False):
+        """Prints the new text to stdout. If the stream is ending, also prints a newline."""
+        print(text, flush=True, end="" if not stream_end else None)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+
+class TextIteratorStreamer(TextStreamer):
+    """
+    Streamer that stores print-ready text in a queue, to be used by a downstream application as an iterator. This is
+    useful for applications that benefit from acessing the generated text in a non-blocking way (e.g. in an interactive
+    Gradio demo).
+
+    Parameters:
+        tokenizer (`AutoTokenizer`):
+            The tokenized used to decode the tokens.
+        skip_prompt (`bool`, *optional*, defaults to `False`):
+            Whether to skip the prompt to `.generate()` or not. Useful e.g. for chatbots.
+        timeout (`float`, *optional*):
+            The timeout for the text queue. If `None`, the queue will block indefinitely. Useful to handle exceptions
+            in `.generate()`, when it is called in a separate thread.
+        decode_kwargs (`dict`, *optional*):
+            Additional keyword arguments to pass to the tokenizer's `decode` method.
+
+    Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+        >>> from paddlenlp.generation import TextIteratorStreamer
+        >>> from threading import Thread
+
+        >>> tok = AutoTokenizer.from_pretrained("gpt2")
+        >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
+        >>> inputs = tok(["An increasing sequence: one,"], return_tensors="pd")
+        >>> streamer = TextIteratorStreamer(tok)
+
+        >>> # Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
+        >>> generation_kwargs = dict(inputs, streamer=streamer, max_length=20)
+        >>> thread = Thread(target=model.generate, kwargs=generation_kwargs)
+        >>> thread.start()
+        >>> generated_text = ""
+        >>> for new_text in streamer:
+        ...     generated_text += new_text
+        >>> generated_text
+        'An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,'
+        ```
+    """
+
+    def __init__(
+        self,
+        tokenizer: PretrainedTokenizer,
+        skip_prompt: bool = False,
+        timeout: Optional[float] = None,
+        **decode_kwargs
+    ):
+        super().__init__(tokenizer, skip_prompt, **decode_kwargs)
+        self.text_queue = Queue()
+        self.stop_signal = None
+        self.timeout = timeout
+
+    def on_finalized_text(self, text: str, stream_end: bool = False):
+        """Put the new text in the queue. If the stream is ending, also put a stop signal in the queue."""
+        self.text_queue.put(text, timeout=self.timeout)
+        if stream_end:
+            self.text_queue.put(self.stop_signal, timeout=self.timeout)
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        value = self.text_queue.get(timeout=self.timeout)
+        if value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
diff --git a/paddlenlp/generation/utils.py b/paddlenlp/generation/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b423a04e9bb00899f75a233043e6fba183766d1d
--- /dev/null
+++ b/paddlenlp/generation/utils.py
@@ -0,0 +1,1789 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+from typing import Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.common_ops_import import convert_dtype
+
+# TODO(guosheng): update this workaround import for in_declarative_mode
+from paddle.nn.layer.layers import in_declarative_mode
+from paddle.utils import map_structure
+
+try:
+    from paddle import top_p_sampling
+
+    is_top_p_sampling_avaliable = True
+except:
+    is_top_p_sampling_avaliable = False
+
+from paddlenlp.transformers.model_outputs import ModelOutput
+from paddlenlp.transformers.utils import get_scale_by_dtype
+from paddlenlp.utils.log import logger
+
+from .configuration_utils import DEFAULT_MAX_NEW_TOKENS, GenerationConfig
+from .logits_process import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    NoRepeatNGramLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+    TopKProcess,
+    TopPProcess,
+)
+from .stopping_criteria import (
+    StoppingCriteria,
+    StoppingCriteriaList,
+    validate_stopping_criteria,
+)
+from .streamers import BaseStreamer
+
+__all__ = [
+    "GenerationMixin",
+    "BeamSearchScorer",
+    "BeamHypotheses",
+    "LogitsProcessorList",
+    "LogitsProcessor",
+    "MinLengthLogitsProcessor",
+    "RepetitionPenaltyLogitsProcessor",
+    "TopKProcess",
+    "TopPProcess",
+    "get_unfinished_flag",
+]
+
+
+def get_unfinished_flag(
+    input_ids: Tensor, unfinished_flag: Tensor, eos_token_id: Union[int, list[int], list[list[int]]]
+) -> Tensor:
+    """get unfinished flag for generation step
+
+    Args:
+        input_ids (Tensor): the input_ids
+        eos_token_id (Union[int, list[int], list[list[int]]]): the end os sentence flag, which can be:
+            * single token id, eg: 10
+            * multiple token ids to stop generation, eg: [10, 10]
+            * some more tokens to stop generations, eg: [[10], [20, 20], [30, 30, 30]]
+
+    Returns:
+        Tensor: the unfinished flag tensor
+    """
+    if isinstance(eos_token_id, int):
+        unfinished_flag = paddle.logical_and(unfinished_flag, input_ids[:, -1:] != eos_token_id)
+    elif isinstance(eos_token_id[0], int):
+        eos_token_id_tensor = paddle.to_tensor([eos_token_id])
+        is_last_tokens_equal = paddle.all(
+            paddle.equal(input_ids[:, -len(eos_token_id) :], eos_token_id_tensor), axis=-1
+        ).unsqueeze(-1)
+        unfinished_flag = paddle.logical_and(unfinished_flag, ~is_last_tokens_equal)
+    else:
+        batch_unfinish_flag = None
+        for batch_eos_token_id in eos_token_id:
+            if batch_unfinish_flag is None:
+                batch_unfinish_flag = ~get_unfinished_flag(input_ids, unfinished_flag, batch_eos_token_id)
+            else:
+                batch_unfinish_flag = paddle.logical_or(
+                    batch_unfinish_flag, ~get_unfinished_flag(input_ids, unfinished_flag, batch_eos_token_id)
+                )
+
+        unfinished_flag = ~batch_unfinish_flag
+    return unfinished_flag
+
+
+class BeamHypotheses:
+    def __init__(self, num_beams, length_penalty, early_stopping):
+        """
+        Initialize n-best list of hypotheses.
+        """
+        self.length_penalty = length_penalty
+        self.early_stopping = early_stopping
+        self.num_beams = num_beams
+        self.beams = []
+        self.worst_score = get_scale_by_dtype()
+
+    def __len__(self):
+        """
+        Number of hypotheses in the list.
+        """
+        return len(self.beams)
+
+    def add(self, hyp, sum_logprobs, origin_len=0):
+        """
+        Add a new hypothesis to the list.
+        """
+        score = sum_logprobs / (((hyp.shape[-1] - origin_len + 5) / 6) ** self.length_penalty)
+        if len(self) < self.num_beams or score > self.worst_score:
+            self.beams.append((score, hyp))
+            if len(self) > self.num_beams:
+                sorted_next_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])
+                del self.beams[sorted_next_scores[0][1]]
+                self.worst_score = sorted_next_scores[1][0]
+            else:
+                self.worst_score = min(score, self.worst_score)
+
+    def is_done(self, best_sum_logprobs, cur_len, origin_len=0):
+        """
+        If there are enough hypotheses and that none of the hypotheses being
+        generated can become better than the worst one in the heap, then we
+        are done with this sentence.
+        """
+        if len(self) < self.num_beams:
+            return False
+        elif self.early_stopping:
+            return True
+        else:
+            cur_score = best_sum_logprobs / ((cur_len - origin_len + 5) / 6) ** self.length_penalty
+            ret = self.worst_score >= cur_score
+            return ret
+
+
+class BeamSearchScorer(object):
+    """
+    implementing standard beam search decoding.
+    """
+
+    def __init__(
+        self,
+        batch_size,
+        max_length,
+        num_beams,
+        length_penalty=1.0,
+        do_early_stopping=False,
+        num_beam_hyps_to_keep=1,
+        num_beam_groups=1,
+    ):
+        self.max_length = max_length
+        self.num_beams = num_beams
+        self.length_penalty = length_penalty
+        self.do_early_stopping = do_early_stopping
+        self.num_beam_hyps_to_keep = num_beam_hyps_to_keep
+        self.num_beam_groups = num_beam_groups
+        self.group_size = self.num_beams // self.num_beam_groups
+
+        self._is_init = False
+        self._beam_hyps = [
+            BeamHypotheses(
+                num_beams=self.num_beams, length_penalty=self.length_penalty, early_stopping=self.do_early_stopping
+            )
+            for _ in range(batch_size)
+        ]
+        self._done = paddle.to_tensor([0 for _ in range(batch_size)], dtype="int64")
+
+        if not isinstance(num_beams, int) or num_beams <= 1:
+            raise ValueError(
+                "`num_beams` has to be an integer strictly greater than 1, but "
+                "received {}. For `num_beams` == 1, one should make use of "
+                "`greedy_search` instead.".format(num_beams)
+            )
+
+        if not isinstance(num_beam_groups, int) or (num_beam_groups > num_beams) or (num_beams % num_beam_groups != 0):
+            raise ValueError(
+                "`num_beam_groups` has to be an integer smaller or equal than "
+                "`num_beams` and `num_beams` has to be divisible by "
+                "`num_beam_groups`, but received num_beam_groups={}, num_beams="
+                "{}.".format(num_beam_groups, num_beams)
+            )
+
+    @property
+    def is_done(self):
+        return paddle.min(self._done) == 1
+
+    def process(
+        self, input_ids, next_scores, next_tokens, next_indices, origin_len=0, pad_token_id=None, eos_token_id=None
+    ):
+        cur_len = input_ids.shape[-1]
+        batch_size = len(self._beam_hyps)
+        assert batch_size == (input_ids.shape[0] // self.group_size)
+
+        next_beam_scores = paddle.zeros([batch_size, self.group_size], dtype=next_scores.dtype)
+        next_beam_tokens = paddle.zeros([batch_size, self.group_size], dtype=next_tokens.dtype)
+        next_beam_indices = paddle.zeros([batch_size, self.group_size], dtype=next_indices.dtype)
+
+        for batch_idx, beam_hyp in enumerate(self._beam_hyps):
+            if self._done[batch_idx] == 1:
+                assert (
+                    len(beam_hyp) >= self.num_beams
+                ), "Batch can only be done if at least {} beams have been generated".format(self.num_beams)
+                assert (
+                    eos_token_id is not None and pad_token_id is not None
+                ), "generated beams >= num_beams -> eos_token_id and pad_token have to be defined"
+                # pad the batch
+                next_beam_scores[batch_idx, :] = 0
+                next_beam_tokens[batch_idx, :] = pad_token_id
+                next_beam_indices[batch_idx, :] = 0
+                continue
+
+            # next tokens for this sentence
+            beam_idx = 0
+            for beam_token_rank, (next_token, next_score, next_index) in enumerate(
+                zip(next_tokens[batch_idx], next_scores[batch_idx], next_indices[batch_idx])
+            ):
+                batch_beam_idx = batch_idx * self.group_size + next_index
+                # add to generated hypotheses if end of sentence
+                if (eos_token_id is not None) and (next_token.item() == eos_token_id):
+                    # If beam_token does not belong to top num_beams tokens,
+                    # it should not be added
+                    is_beam_token_worse_than_top_num_beams = beam_token_rank >= self.group_size
+                    if is_beam_token_worse_than_top_num_beams:
+                        continue
+                    beam_hyp.add(input_ids[batch_beam_idx.item()].clone(), next_score.item(), origin_len)
+
+                else:
+                    # add next predicted token since it is not eos_token
+                    next_beam_scores[batch_idx, beam_idx] = next_score
+                    next_beam_tokens[batch_idx, beam_idx] = next_token.item()
+                    next_beam_indices[batch_idx, beam_idx] = batch_beam_idx.item()
+                    beam_idx += 1
+
+                # once the beam for next step is full, don't add more tokens to it.
+                if beam_idx == self.group_size:
+                    break
+
+            if beam_idx < self.group_size:
+                raise ValueError(
+                    "At most {} tokens in `next_tokens[batch_idx]` can be equal "
+                    "to `eos_token_id: {}`. Make sure `next_tokens[batch_idx]` "
+                    "are corrected.".format(self.group_size, eos_token_id)
+                )
+
+            # Check if we are done so that we can save a pad step if all(done)
+            if beam_hyp.is_done(next_scores[batch_idx].max().item(), cur_len, origin_len):
+                self._done[batch_idx] = 1
+
+        return {
+            "next_beam_scores": next_beam_scores.reshape([-1]),
+            "next_beam_tokens": next_beam_tokens.reshape([-1]),
+            "next_beam_indices": next_beam_indices.reshape([-1]),
+        }
+
+    def finalize(
+        self,
+        input_ids,
+        final_beam_scores,
+        final_beam_tokens,
+        final_beam_indices,
+        origin_len=0,
+        pad_token_id=None,
+        eos_token_id=None,
+    ):
+        batch_size = len(self._beam_hyps)
+
+        # finalize all open beam hypotheses and add to generated hypotheses
+        for batch_idx, beam_hyp in enumerate(self._beam_hyps):
+            if self._done[batch_idx] == 1:
+                continue
+
+            # all open beam hypotheses are added to the beam hypothesis
+            # beam hypothesis class automatically keeps the best beams
+            for beam_id in range(self.num_beams):
+                batch_beam_idx = batch_idx * self.num_beams + beam_id
+                final_score = final_beam_scores[batch_beam_idx].item()
+                final_tokens = input_ids[batch_beam_idx]
+                beam_hyp.add(final_tokens, final_score, origin_len=origin_len)
+
+        # select the best hypotheses
+        sent_lengths = paddle.zeros([batch_size * self.num_beam_hyps_to_keep], dtype=input_ids.dtype)
+        best = []
+
+        # retrieve best hypotheses
+        for i, beam_hyp in enumerate(self._beam_hyps):
+            sorted_hyps = sorted(beam_hyp.beams, key=lambda x: x[0])
+            for j in range(self.num_beam_hyps_to_keep):
+                best_score, best_hyp = sorted_hyps.pop()
+                sent_lengths[self.num_beam_hyps_to_keep * i + j] = len(best_hyp)
+                best.append([best_hyp, best_score])
+
+        # prepare for adding eos
+        sent_max_len = min(sent_lengths.max().item() + 1, self.max_length)
+        decoded = paddle.zeros([batch_size * self.num_beam_hyps_to_keep, sent_max_len], dtype=input_ids.dtype)
+        # shorter batches are padded if needed
+        if sent_lengths.min().item() != sent_lengths.max().item():
+            assert pad_token_id is not None, "`pad_token_id` has to be defined"
+            decoded[:, :] = pad_token_id
+        decoded_score = paddle.zeros([batch_size * self.num_beam_hyps_to_keep, 1])
+
+        # fill with hypotheses and eos_token_id if the latter fits in
+        for i, (hypo, score) in enumerate(best):
+            decoded[i, : sent_lengths[i].item()] = hypo.cpu().numpy()
+            decoded_score[i] = score
+            if sent_lengths[i] < self.max_length:
+                decoded[i, sent_lengths[i].item()] = eos_token_id
+        return decoded, decoded_score
+
+
+class GenerationMixin(object):
+    r"""
+    This class implements the interface for generation task.
+
+    It's used as the base class of `paddlenlp.transformers.PretrainedModel
+    <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.model_utils.html>`__.
+    """
+    # enable `to_static` method for CausalLM Model
+    enable_to_static_method = False
+
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids == pad_token_id).astype(paddle.get_default_dtype()) * get_scale_by_dtype(
+                return_positive=False
+            )
+        else:
+            attention_mask = paddle.zeros_like(input_ids, dtype=paddle.get_default_dtype())
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
+
+    @staticmethod
+    def prepare_seq_len_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            seq_len = paddle.sum(input_ids != pad_token_id, axis=1).unsqueeze(-1)
+        else:
+            seq_len = paddle.full((input_ids.shape[0], 1), input_ids.shape[1], dtype="int64")
+        return seq_len
+
+    def get_logits_processor(
+        self,
+        min_length=None,
+        max_length=None,
+        eos_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_beams=1,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        repetition_penalty=None,
+        no_repeat_ngram_size=None,
+        logits_processors=None,
+    ):
+        processors = LogitsProcessorList()
+
+        if min_length is not None and eos_token_id is not None and min_length > -1:
+            processors.append(MinLengthLogitsProcessor(min_length, eos_token_id))
+        if num_beam_groups > 1 and diversity_rate > 0.0:
+            processors.append(
+                HammingDiversityLogitsProcessor(
+                    diversity_rate=diversity_rate, num_beams=num_beams, num_beam_groups=num_beam_groups
+                )
+            )
+        if repetition_penalty is not None and repetition_penalty != 1.0:
+            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
+        if no_repeat_ngram_size is not None and no_repeat_ngram_size > 0:
+            processors.append(NoRepeatNGramLogitsProcessor(no_repeat_ngram_size))
+        if forced_bos_token_id is not None:
+            processors.append(ForcedBOSTokenLogitsProcessor(forced_bos_token_id))
+        if forced_eos_token_id is not None:
+            processors.append(ForcedEOSTokenLogitsProcessor(max_length, forced_eos_token_id))
+        # TODO
+        # Add more pre_processing for distribution
+
+        if logits_processors is not None:
+            custom_processors = LogitsProcessorList()
+            custom_processors_type = [type(lp) for lp in logits_processors]
+
+            for processor in processors:
+                if type(processor) not in custom_processors_type:
+                    custom_processors.append(processor)
+            custom_processors.extend(logits_processors)
+
+            return custom_processors
+        else:
+            return processors
+
+    @staticmethod
+    def expand_inputs_for_generation(input_ids, expand_size, attention_mask=None, **model_kwargs):
+
+        index = paddle.tile(
+            paddle.arange(paddle.shape(input_ids)[0], dtype="int64").unsqueeze(-1), [1, expand_size]
+        ).reshape([-1])
+
+        input_ids = paddle.gather(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.gather(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.gather(token_type_ids, index)
+
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.gather(position_ids, index)
+
+        if "seq_len" in model_kwargs and model_kwargs["seq_len"] is not None:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.gather(seq_len, index)
+
+        if "encoder_output" in model_kwargs and model_kwargs["encoder_output"] is not None:
+            encoder_output = model_kwargs["encoder_output"]
+            model_kwargs["encoder_output"] = paddle.gather(encoder_output, index)
+
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.gather(role_ids, index)
+
+        return input_ids, model_kwargs
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["cache"] = outputs[1]
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, ModelOutput) and "past_key_values" in outputs:
+            model_kwargs["cache"] = outputs.past_key_values
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            # nn.Pad2D don't support the data type `bool`
+            if convert_dtype(attention_mask.dtype) == "bool":
+                attention_mask = paddle.cast(attention_mask, "int64")
+            if len(attention_mask.shape) == 4:
+                cur_device = paddle.get_device()
+                if cur_device.split(":")[0] == "npu":
+                    attention_mask = nn.Pad2D([0, 0, 0, 1], mode="constant")(attention_mask)
+                    attention_mask = nn.Pad2D([0, 1, 0, 0], value=0)(attention_mask)
+                else:
+                    attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(attention_mask)
+                    attention_mask = nn.Pad2D([0, 1, 0, 0], value=get_scale_by_dtype(return_positive=False))(
+                        attention_mask
+                    )
+
+                dtype = convert_dtype(attention_mask.dtype)
+                if "int" in dtype:
+                    attention_mask[:, :, -1, -1] = 1
+                elif "float" in dtype:
+                    attention_mask[:, :, -1, -1] = 0.0
+                else:
+                    raise ValueError("The data type of input `attention_mask` must " "be bool, int or float")
+            else:
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+            model_kwargs["attention_mask"] = attention_mask
+
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        return model_kwargs
+
+    @staticmethod
+    def update_scores_for_generation(scores, next_scores, length, unfinished_flag):
+        # update scores
+
+        unfinished_scores = (scores * length + next_scores) / (length + 1)
+        scores = paddle.where(unfinished_flag, unfinished_scores, scores)
+        return scores
+
+    def prepare_encoder_decoder_kwargs_for_generation(self, input_ids, model_kwargs):
+        if "encoder_output" not in model_kwargs:
+            # retrieve encoder hidden states
+            encoder = self.get_encoder()
+            encoder_kwargs = {
+                argument: value
+                for argument, value in model_kwargs.items()
+                if not (
+                    argument.startswith("decoder_") or argument.startswith("cross_attn") or argument == "use_cache"
+                )
+            }
+            # Use inputs_embeds as the priority if inputs_embeds exists
+            if "inputs_embeds" in encoder_kwargs:
+                model_kwargs["encoder_output"] = encoder(**encoder_kwargs)
+            else:
+                model_kwargs["encoder_output"] = encoder(input_ids=input_ids, **encoder_kwargs)
+        return model_kwargs
+
+    def prepare_decoder_input_ids_for_generation(self, input_ids, decoder_start_token_id=None, bos_token_id=None):
+        decoder_start_token_id = (
+            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
+        )
+        decoder_start_token_id = decoder_start_token_id if decoder_start_token_id is not None else bos_token_id
+
+        decoder_input_ids = paddle.ones([input_ids.shape[0], 1], dtype="int64") * decoder_start_token_id
+
+        return decoder_input_ids
+
+    def get_decoder_start_token_id(self, decoder_start_token_id=None, bos_token_id=None):
+        decoder_start_token_id = (
+            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
+        )
+        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
+
+        if decoder_start_token_id is not None:
+            return decoder_start_token_id
+        elif self.config.decoder_start_token_id is not None:
+            return self.config.decoder_start_token_id
+        elif bos_token_id is not None:
+            return bos_token_id
+        elif self.config.bos_token_id is not None:
+            return self.config.bos_token_id
+        raise ValueError(
+            "`decoder_start_token_id` or `bos_token_id` has to be defined for encoder-decoder generation."
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, **kwargs):
+        # Implement in subclasses for custom behavior to prepare inputs in the
+        # generate method.
+
+        return {"input_ids": input_ids}
+
+    def adjust_logits_during_generation(self, logits):
+        # Implement in subclasses for custom behavior to adjust the logits in
+        # the generate method.
+
+        return logits
+
+    def prepare_fast_entry(self, kwargs):
+        return False
+
+    def _convert_to_fast(self, kwargs):
+        # try general convert
+        pass
+
+    def _build_fast(self, kwargs):
+        self._fast_entry = False
+        if kwargs["num_beam_groups"] != 1:
+            # not support for group_beam_search yet in the fast version
+            raise AttributeError("'num_beam_groups != 1' is not supported yet in the fast version")
+        if paddle.get_default_dtype() == "float16" and kwargs["use_fp16_decoding"] is False:
+            logger.info(
+                "Since the default dtype is float16, float16 would be used " "though 'use_fp16_decoding=False'."
+            )
+            kwargs["use_fp16_decoding"] = True
+        self.prepare_fast_entry(kwargs)
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        input_ids: paddle.Tensor = None,
+        generation_config: GenerationConfig = None,
+        stopping_criteria: StoppingCriteria = None,
+        streamer: BaseStreamer = None,
+        **kwargs,
+    ):
+        r"""
+        The interface for generation task. This method can generate sequences
+        by using decoding strategy. Currently, there are three decoding
+        strategies supported: "greedy_search", "sampling" and "beam_search".
+
+        Args:
+            input_ids (Tensor, optional): The input sequence ids for the
+                generation. It is a Tensor with shape [batch_size, sequence_length].
+                The data type should be int32 or int64. Default to None, which
+                we will initialize it as a Tensor with shape [1, 1], filled
+                with the value `bos_token_id`.
+            generation_config (`~generation.GenerationConfig`, *optional*):
+                The generation configuration to be used as base parametrization for the generation call. `**kwargs`
+                passed to generate matching the attributes of `generation_config` will override them. If
+                `generation_config` is not provided, the default will be used, which had the following loading
+                priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model
+                configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s
+                default values, whose documentation should be checked to parameterize generation.
+            stopping_criteria (`StoppingCriteriaList`, *optional*):
+                Custom stopping criteria that complement the default stopping criteria built from arguments and a
+                generation config. If a stopping criteria is passed that is already created with the arguments or a
+                generation config an error is thrown. This feature is intended for advanced users.
+            streamer (`~streamer.BaseStreamer`, *optional*):
+                Streamer object that will be used to stream the generated sequences. Generated tokens are passed
+                through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
+            kwargs (dict): It can be used to specify additional kwargs
+                passed to the model.
+
+        Returns:
+            tuple[Tensor]: It is a tuple contains two elements: ids and scores.
+            Each element is a Tensor.
+
+            With the fields:
+
+            - ids (Tensor):
+                The ids of the generated sequences. It is a Tensor with shape
+                [batch_size * num_return_sequences, sequence_length]. The data
+                type is same as the input `input_ids`.
+            - scores (Tensor):
+                The scores of the generated sequences. It is a Tensor with shape
+                [batch_size * num_return_sequences, 1]. The data type is float32
+                or float64, which is the same as the parameters in the model.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import (
+                    UnifiedTransformerLMHeadModel,
+                    UnifiedTransformerTokenizer
+                )
+
+                paddle.seed(2)
+
+                # Initialize the model and tokenizer
+                model_name_or_path = 'unified_transformer-12L-cn-luge'
+                model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)
+
+                # Prepare the model inputs.
+                history = "早上好，今天空气质量不错。"
+                inputs = tokenizer.dialogue_encode(history, task_type='chitchat',
+                    add_start_token_as_response=True, return_tensors=True)
+
+            .. code-block::
+
+                # Generate the sequence by using "greedy_search" strategy
+                ids, scores = model.generate(
+                    **inputs,
+                    decode_strategy="greedy_search")
+                print(ids.shape, scores.shape)
+                # [1, 3] [1, 1]
+                sequence_ids = ids.cpu().numpy().tolist()[0]
+                sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
+                response = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
+                print(response)
+                # 是的
+
+            .. code-block::
+
+                # Generate 2 sequences by using "sampling" strategy (top_k=5)
+                generation_config = GenerationConfig(
+                    decode_strategy="sampling",
+                    top_k=5,
+                    num_return_sequences=2
+                )
+                ids, scores = model.generate(
+                    **inputs,
+                    generation_config=generation_config,
+                    )
+                print(ids.shape, scores.shape)
+                # [2, 7] [2, 1]
+                response = []
+                for sequence_ids in ids.cpu().numpy().tolist():
+                    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
+                    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
+                    response.append(text)
+                print(response)
+                # ['天气好,心情也好', '你也是']
+
+            .. code-block::
+
+                # Generate 2 sequences by using "beam_search" strategy (num_beams=5)
+                generation_config = GenerationConfig(
+                    decode_strategy="beam_search",
+                    num_beams=5,
+                    num_return_sequences=2
+                )
+                ids, scores = model.generate(
+                    **inputs,
+                    generation_config=generation_config,
+                    )
+                print(ids.shape, scores.shape)
+                # [2, 3] [2, 1]
+                response = []
+                for sequence_ids in ids.cpu().numpy().tolist():
+                    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
+                    text = tokenizer.convert_ids_to_string(sequence_ids, keep_space=False)
+                    response.append(text)
+                print(response)
+                # ['是的', '嗯嗯']
+        """
+        if generation_config is None:
+            if self.generation_config._from_model_config:
+                new_generation_config = GenerationConfig.from_model_config(self.config)
+                if new_generation_config != self.generation_config:
+                    logger.warning(
+                        "model.generation_config is in conflict with model.config, " "model.config is used."
+                    )
+                    self.generation_config = new_generation_config
+            generation_config = self.generation_config
+
+        # without update model.generation_config
+        generation_config = copy.deepcopy(generation_config)
+        model_kwargs = generation_config.update(**kwargs)
+
+        assert generation_config.decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            generation_config.decode_strategy
+        )
+
+        # Whether to dynamic to static
+        is_tracing = False
+        if in_declarative_mode():
+            is_tracing = True
+
+        if is_tracing:
+            assert generation_config.decode_strategy in [
+                "sampling",
+            ], "`generate()` only supports 'sampling' temporarily but received {}.".format(
+                generation_config.decode_strategy
+            )
+
+        if getattr(self, "deprecated_warnings", None) is None:
+            self.deprecated_warnings = {}
+
+        use_fast = False
+        if "use_faster" in model_kwargs:
+            use_fast = model_kwargs.pop("use_faster")
+            if not self.deprecated_warnings.get("use_faster", False):
+                logger.warning("`use_faster` will be deprecated in near future. Please use `use_fast` instead. ")
+                self.deprecated_warnings["use_faster"] = True
+
+        if "use_fast" in model_kwargs:
+            use_fast = model_kwargs.pop("use_fast")
+
+        bos_token_id = (
+            generation_config.bos_token_id if generation_config.bos_token_id is not None else self.config.bos_token_id
+        )
+        eos_token_id = (
+            generation_config.eos_token_id if generation_config.eos_token_id is not None else self.config.eos_token_id
+        )
+        pad_token_id = (
+            generation_config.pad_token_id if generation_config.pad_token_id is not None else self.config.pad_token_id
+        )
+        forced_bos_token_id = (
+            generation_config.forced_bos_token_id
+            if generation_config.forced_bos_token_id is not None
+            else self.config.forced_bos_token_id
+        )
+        forced_eos_token_id = (
+            generation_config.forced_eos_token_id
+            if generation_config.forced_eos_token_id is not None
+            else self.config.forced_eos_token_id
+        )
+        decoder_start_token_id = (
+            generation_config.decoder_start_token_id
+            if generation_config.decoder_start_token_id is not None
+            else self.config.decoder_start_token_id
+        )
+        no_repeat_ngram_size = (
+            generation_config.no_repeat_ngram_size
+            if generation_config.no_repeat_ngram_size is not None
+            else self.config.no_repeat_ngram_size
+        )
+
+        if is_tracing:
+            self._fast_entry = None
+
+        if getattr(self, "_fast_entry", None) is not False and use_fast:
+            fg_args = locals()
+            fg_args.pop("self")
+            fg_args.pop("__class__", None)
+            model_kwargs = fg_args.pop("model_kwargs")
+            fg_args.update(model_kwargs)
+            try:
+                if getattr(self, "_fast_entry", None) is None:
+                    self._build_fast(fg_args)
+                if self._fast_entry:
+                    output = self._fast_entry(**fg_args)
+                    if isinstance(output, tuple):
+                        output_ids, dummy_srore = output
+                    else:
+                        output_ids = output
+                        # make result and fast result oneconsistent
+                        dummy_srore = None
+                    if generation_config.decode_strategy == "beam_search":
+                        output_ids = output_ids.transpose([1, 2, 0])
+                        output_ids = output_ids[:, : generation_config.num_return_sequences, :].reshape(
+                            [-1, output_ids.shape[-1]]
+                        )
+                        if dummy_srore is not None:
+                            dummy_srore = dummy_srore[:, : generation_config.num_return_sequences].flatten()
+                    else:
+                        output_ids = output_ids.transpose([1, 0])
+                    return output_ids, dummy_srore
+
+            except Exception as e:
+                fg_args["model_kwargs"] = model_kwargs
+                # TODO
+                # Prevent self._convert_to_fast to throw Exception
+                self._convert_to_fast(fg_args)
+                logger.warning(e)
+                logger.warning("FastGeneration is not available, " "and the original version would be used instead.")
+
+        # input_ids in model_kwargs is supported
+        if "input_ids" in model_kwargs:
+            _input_ids = model_kwargs.pop("input_ids")
+            if input_ids is None:
+                input_ids = _input_ids
+
+        # params check
+        if input_ids is None and "inputs_embeds" not in model_kwargs:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+        elif "inputs_embeds" in model_kwargs:
+            # Add input embeds support
+            input_ids = self.prepare_input_ids_for_generation(
+                bos_token_id, encoder_output=model_kwargs["inputs_embeds"]
+            )
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # TODO
+            # Init `attention_mask` depending on `pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, pad_token_id, eos_token_id
+            )
+        self.is_encoder_decoder = (
+            getattr(self, "encoder", None) is not None and getattr(self, "decoder", None) is not None
+        )
+
+        if self.is_encoder_decoder:
+            model_kwargs = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+            # set input_ids as decoder_input_ids
+            if "decoder_input_ids" in model_kwargs:
+                input_ids = model_kwargs.pop("decoder_input_ids")
+            else:
+                input_ids = self.prepare_decoder_input_ids_for_generation(
+                    input_ids, decoder_start_token_id, bos_token_id
+                )
+
+        # streamer
+        if streamer is not None:
+            # streamer couldn't support beam_search strategy
+            if generation_config.decode_strategy == "beam_search" or generation_config.num_beams > 1:
+                raise ValueError(
+                    "`streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1."
+                )
+            streamer.put(input_ids)
+
+        if pad_token_id is None and eos_token_id is not None:
+            print("Setting `pad_token_id` to `eos_token_id`:{} for " "open-end generation.".format(eos_token_id))
+            pad_token_id = eos_token_id
+
+        if generation_config.max_length != 0 and generation_config.max_new_tokens == DEFAULT_MAX_NEW_TOKENS:
+            logger.warning("`max_length` will be deprecated in future releases, use `max_new_tokens` instead.")
+            generation_config.max_new_tokens = generation_config.max_length
+
+        if generation_config.min_length != 0 and generation_config.min_new_token == 0:
+            logger.warning("`min_length` will be deprecated in future releases, use `min_new_token` instead.")
+            generation_config.min_new_token = generation_config.min_length
+
+        max_length = generation_config.max_new_tokens
+        min_length = generation_config.min_new_token
+        if is_tracing and not paddle.is_tensor(max_length):
+            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+                # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
+                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+                # removed after static graphs support inplace and stride.
+                with paddle.framework._no_check_dy2st_diff():
+                    min_len = input_ids.shape[-1]
+                    max_len = input_ids.shape[-1]
+                    paddle.increment(min_len, min_length)
+                    paddle.increment(max_len, max_length)
+            else:
+                min_len = input_ids.shape[-1]
+                max_len = input_ids.shape[-1]
+                paddle.increment(min_len, min_length)
+                paddle.increment(max_len, max_length)
+        else:
+            input_len = input_ids.shape[-1]
+            min_len = input_len + min_length
+            max_len = input_len + max_length
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_len if min_length > 0 else None,
+            max_length=max_len,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=generation_config.num_beams,
+            num_beam_groups=generation_config.num_beam_groups,
+            diversity_rate=generation_config.diversity_rate,
+            repetition_penalty=generation_config.repetition_penalty,
+            no_repeat_ngram_size=generation_config.no_repeat_ngram_size,
+            logits_processors=model_kwargs["logits_processors"]
+            if "logits_processors" in model_kwargs
+            and isinstance(model_kwargs["logits_processors"], LogitsProcessorList)
+            else None,
+        )
+        if "logits_processors" in model_kwargs:
+            model_kwargs.pop("logits_processors")
+
+        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
+
+        if generation_config.decode_strategy == "greedy_search":
+            if generation_config.num_return_sequences > 1:
+                raise ValueError(
+                    "`num_return_sequences` has to be 1, but is {} "
+                    "when doing greedy search.".format(generation_config.num_return_sequences)
+                )
+            return self.greedy_search(
+                input_ids,
+                logits_processors,
+                max_len,
+                pad_token_id,
+                eos_token_id,
+                stopping_criteria=stopping_criteria,
+                streamer=streamer,
+                **model_kwargs,
+            )
+
+        elif generation_config.decode_strategy == "sampling":
+            if generation_config.num_return_sequences > 1:
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=generation_config.num_return_sequences, **model_kwargs
+                )
+
+            if is_tracing:
+                return self.sample_d2s(
+                    input_ids,
+                    logits_processors,
+                    max_len,
+                    pad_token_id,
+                    eos_token_id,
+                    generation_config.top_k,
+                    generation_config.top_p,
+                    generation_config.temperature,
+                    **model_kwargs,
+                )
+            else:
+                return self.sample(
+                    input_ids,
+                    logits_processors,
+                    max_len,
+                    pad_token_id,
+                    eos_token_id,
+                    generation_config.top_k,
+                    generation_config.top_p,
+                    generation_config.temperature,
+                    stopping_criteria=stopping_criteria,
+                    streamer=streamer,
+                    **model_kwargs,
+                )
+
+        elif generation_config.decode_strategy == "beam_search":
+            batch_size = input_ids.shape[0]
+            if generation_config.num_return_sequences > generation_config.num_beams:
+                raise ValueError(
+                    "`num_return_sequences` has to be smaller or equal to "
+                    "`num_beams`. But received `num_return_sequences` is {}, "
+                    "`num_beams` is {}".format(generation_config.num_return_sequences, generation_config.num_beams)
+                )
+            if generation_config.num_beams <= 1:
+                raise ValueError(
+                    "`num_beams` has to be bigger than 1. But received "
+                    "`num_beams` is {}. If `num_beams` is 1, `decode_strategy` "
+                    "should be 'greedy_search'".format(generation_config.num_beams)
+                )
+            if generation_config.num_beam_groups > 1:
+                diverse_beam_scorer = BeamSearchScorer(
+                    batch_size=batch_size,
+                    max_length=max_len,
+                    num_beams=generation_config.num_beams,
+                    length_penalty=generation_config.length_penalty,
+                    do_early_stopping=generation_config.early_stopping,
+                    num_beam_hyps_to_keep=generation_config.num_return_sequences,
+                    num_beam_groups=generation_config.num_beam_groups,
+                )
+
+                # interleave with `num_beams`
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=generation_config.num_beams, **model_kwargs
+                )
+
+                return self.group_beam_search(
+                    input_ids,
+                    diverse_beam_scorer,
+                    logits_processors,
+                    max_len,
+                    pad_token_id,
+                    eos_token_id,
+                    stopping_criteria=stopping_criteria,
+                    **model_kwargs,
+                )
+            else:
+                beam_scorer = BeamSearchScorer(
+                    batch_size=batch_size,
+                    max_length=max_len,
+                    num_beams=generation_config.num_beams,
+                    length_penalty=generation_config.length_penalty,
+                    do_early_stopping=generation_config.early_stopping,
+                    num_beam_hyps_to_keep=generation_config.num_return_sequences,
+                )
+
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=generation_config.num_beams, **model_kwargs
+                )
+
+                return self.beam_search(
+                    input_ids,
+                    beam_scorer,
+                    logits_processors,
+                    max_len,
+                    generation_config.diversity_rate,
+                    pad_token_id,
+                    eos_token_id,
+                    stopping_criteria=stopping_criteria,
+                    **model_kwargs,
+                )
+
+    def greedy_search(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        stopping_criteria=None,
+        streamer=None,
+        **model_kwargs
+    ):
+        model_kwargs["use_cache"] = model_kwargs.get("use_cache", True)
+        logits_processors = logits_processors if logits_processors is not None else LogitsProcessorList()
+
+        # max_length will be convert to MaxLengthCriteria
+        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
+        if max_length is not None:
+            # logger.warning(
+            #    "`max_length` is deprecated in this function, use"
+            #    " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
+            # )
+            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
+
+        batch_size, cur_len = input_ids.shape
+        origin_len = cur_len
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+        generate_end = False
+        while True:
+
+            # prepare model inputs & get model output
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+
+            outputs = self(**model_inputs)
+
+            if isinstance(outputs, tuple):
+                logits = outputs[0]
+            elif isinstance(outputs, ModelOutput):
+                logits = outputs.logits
+            else:
+                logits = outputs
+
+            # [batch_size, vocab_size]
+            next_token_logits = logits[:, -1, :]
+
+            # pre-process distribution
+            next_token_logits = self.adjust_logits_during_generation(next_token_logits)
+            next_tokens_scores = logits_processors(input_ids, next_token_logits)
+            # greedy
+            probs = F.softmax(next_tokens_scores)
+            probs = paddle.log(probs)
+            next_tokens = paddle.argmax(probs, axis=-1).unsqueeze(-1)
+            next_scores = paddle.index_sample(probs.astype("float32"), next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+            cur_len += 1
+
+            input_ids = paddle.concat([input_ids, next_tokens], axis=1)
+            if streamer is not None:
+                streamer.put(next_tokens.cpu())
+
+            if stopping_criteria(input_ids, scores):
+                generate_end = True
+
+            if eos_token_id is not None:
+                unfinished_flag = get_unfinished_flag(input_ids, unfinished_flag, eos_token_id)
+
+            # Stop when there is a </s> in all sentences
+            if not paddle.any(unfinished_flag) or generate_end:
+                break
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+        if streamer is not None:
+            streamer.end()
+
+        return input_ids[:, origin_len:], scores
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        stopping_criteria=None,
+        streamer=None,
+        **model_kwargs
+    ):
+        model_kwargs["use_cache"] = model_kwargs.get("use_cache", True)
+
+        logits_processors = logits_processors if logits_processors is not None else LogitsProcessorList()
+
+        # max_length will be convert to MaxLengthCriteria
+        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
+        if max_length is not None:
+            # logger.warning(
+            #    "`max_length` is deprecated in this function, use"
+            #    " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
+            # )
+            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
+
+        batch_size, cur_len = input_ids.shape
+        origin_len = cur_len
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        generate_end = False
+        while True:
+            # prepare model inputs & get model output
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            outputs = self(**model_inputs)
+
+            if isinstance(outputs, tuple):
+                logits = outputs[0]
+            elif isinstance(outputs, ModelOutput):
+                logits = outputs.logits
+            else:
+                logits = outputs
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = self.adjust_logits_during_generation(logits)
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            origin_probs = paddle.log(origin_probs)
+            if temperature is not None and temperature != 1.0:
+                logits = logits / temperature
+            probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+
+            # multinomial not support fp16 and bf16 currently, issue: https://github.com/PaddlePaddle/Paddle/issues/51852
+            if probs.dtype == paddle.bfloat16 and top_k == 1:
+                probs = probs.astype("float32")
+                next_tokens = paddle.unsqueeze(paddle.argmax(probs, axis=-1), -1)
+            else:
+                next_tokens = paddle.multinomial(probs)
+
+            if self.config.tensor_parallel_degree > 1:
+                paddle.distributed.broadcast(next_tokens, 0)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            cur_len += 1
+            input_ids = paddle.concat([input_ids, next_tokens], axis=1)
+            if streamer is not None:
+                streamer.put(next_tokens.cpu())
+
+            if stopping_criteria(input_ids, scores):
+                generate_end = True
+
+            if eos_token_id is not None:
+                unfinished_flag = get_unfinished_flag(input_ids, unfinished_flag, eos_token_id)
+
+            # Stop when there is a </s> in all sentences
+            if not paddle.any(unfinished_flag) or generate_end:
+                break
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+        if streamer is not None:
+            streamer.end()
+
+        return input_ids[:, origin_len:], scores
+
+    def to_static(self, path: str, config: dict):
+        """export generation model to static
+
+        Args:
+            path (str): path of saved inference model
+            config (dict): configuration for generation
+                bos_token_id (int): token id of begin-of-sentence
+                eos_token_id (int): token id of end-of-sentence
+                pad_token_id (int): token id of pad token
+                use_top_p (bool): whether use top_p decoding strategy
+        """
+
+        use_top_p = config.get("use_top_p", True)
+
+        top_k_spec = paddle.static.InputSpec(shape=[1], dtype="int64") if not use_top_p else 0
+
+        top_p_spec = paddle.static.InputSpec(shape=[1], dtype="float32") if use_top_p else 1.0
+        temperature = paddle.static.InputSpec(shape=[1], dtype="float32") if use_top_p else 1.0
+
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # attention_mask
+            None,  # position_ids
+            paddle.static.InputSpec(shape=[1], dtype="int64"),  # max_length
+            0,  # min_length
+            "sampling",  # decode_strategy
+            temperature,  # temperature
+            top_k_spec,  # top_k
+            top_p_spec,  # top_p
+            1,  # repetition_penalty
+            # num_beams
+            1,
+            # num_beam_groups
+            1,
+            # length_penalty
+            0.0,
+            # early_stopping
+            False,
+            # bos_token_id
+            config.get("bos_token_id", 0),
+            # eos_token_id
+            config.get("eos_token_id", 0),
+            # pad_token_id
+            config.get("pad_token_id", 0),
+            # decoder_start_token_id
+            None,
+            # forced_bos_token_id
+            None,
+            # forced_eos_token_id
+            None,
+            # no_repeat_ngram_size
+            None,
+            # num_return_sequences
+            1,
+            # diversity_rate
+            0.0,
+            # use_cache
+            True,
+            # use_fast=False,
+            False,
+            # use_fp16_decoding=False,
+            False,
+        ]
+
+        model = paddle.jit.to_static(self.generate, input_spec=input_spec)
+
+        paddle.jit.save(model, path)
+
+    def sample_d2s(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+
+        logits_processors = logits_processors if logits_processors is not None else LogitsProcessorList()
+
+        if paddle.is_tensor(top_k) and not paddle.is_tensor(top_p):
+            use_top_p = False
+        elif not paddle.is_tensor(top_k) and paddle.is_tensor(top_p):
+            use_top_p = True
+
+        # top_k and top_p are the const value
+        elif isinstance(top_p, float) or isinstance(top_k, int):
+            use_top_p = True
+        else:
+            if top_p is None and top_k is None:
+                raise ValueError("top_k and top_p should not be None")
+            raise ValueError(
+                "you should not specify InputSpec for top_k and top_p parameters, one of InputSpec is expected"
+            )
+
+        use_topp_sampling_op = is_top_p_sampling_avaliable or model_kwargs.get("use_fuse_topp_sampling", False)
+        return_scores = model_kwargs.get("return_scores", True)
+
+        batch_size, cur_len = paddle.shape(input_ids)
+        # used for compute on gpu, avoid memcpy D2H
+        cur_len_gpu = paddle.full([1], cur_len, dtype="int64")
+
+        origin_len = paddle.shape(input_ids)[1]
+        # used for compute on gpu, avoid memcpy D2H
+        origin_len_gpu = paddle.full([1], origin_len, dtype="int64")
+
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        if return_scores:
+            scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+        else:
+            scores = None
+
+        # use_cache is immutable, we split it off other mutable kwargs.
+        assert "use_cache" in model_kwargs
+        immutable = {"use_cache": model_kwargs["use_cache"]}
+        del model_kwargs["use_cache"]
+
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **args, **immutable)
+            assert "use_cache" in model_inputs
+            del model_inputs["use_cache"]
+            return self(**model_inputs, **immutable)
+
+        def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs):
+            if isinstance(outputs, tuple):
+                logits = outputs[0]
+            elif isinstance(outputs, ModelOutput):
+                logits = outputs.logits
+            else:
+                logits = outputs
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = self.adjust_logits_during_generation(logits)
+
+            logits = logits_processors(input_ids, logits)
+            probs = F.softmax(logits)
+
+            # sample
+            if return_scores:
+                origin_probs = F.softmax(logits)
+                origin_probs = paddle.log(origin_probs)
+
+            # compute next_tokens
+            if use_top_p:
+                logits = logits / temperature
+                if use_topp_sampling_op:
+                    top_ps_tensor = paddle.full(shape=[paddle.shape(probs)[0], 1], fill_value=top_p, dtype=probs.dtype)
+                    _, next_tokens = top_p_sampling(probs, top_ps_tensor)
+                else:
+                    probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+                    next_tokens = paddle.multinomial(probs)
+            else:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+                if top_k == 1:
+                    next_tokens = paddle.unsqueeze_(paddle.argmax(probs, axis=-1), -1)
+                else:
+                    next_tokens = paddle.multinomial(probs)
+
+            if return_scores:
+                next_scores = paddle.index_sample(origin_probs, next_tokens)
+                scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            input_ids = paddle.concat([input_ids, next_tokens], axis=1)
+
+            if eos_token_id is not None:
+                unfinished_flag = get_unfinished_flag(input_ids, unfinished_flag, eos_token_id)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            return input_ids, scores, unfinished_flag, model_kwargs
+
+        outputs = _forward_(**model_kwargs)
+        input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+            outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs
+        )
+
+        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
+            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+            # removed after static graphs support inplace and stride.
+            with paddle.framework._no_check_dy2st_diff():
+                paddle.increment(cur_len)
+                paddle.increment(cur_len_gpu)
+        else:
+            paddle.increment(cur_len)
+            paddle.increment(cur_len_gpu)
+
+        attn_mask = model_kwargs["attention_mask"]
+        # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
+        model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
+        max_length = paddle.full([1], max_length, dtype="int64")
+
+        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+            # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
+            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+            # removed after static graphs support inplace and stride.
+            with paddle.framework._no_check_dy2st_diff():
+                while cur_len < max_length and paddle.any(unfinished_flag):
+                    input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                        _forward_(**model_kwargs),
+                        input_ids,
+                        cur_len_gpu,
+                        origin_len_gpu,
+                        scores,
+                        unfinished_flag,
+                        model_kwargs,
+                    )
+                    paddle.increment(cur_len)
+                    paddle.increment(cur_len_gpu)
+        else:
+            while cur_len < max_length and paddle.any(unfinished_flag):
+                input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                    _forward_(**model_kwargs),
+                    input_ids,
+                    cur_len_gpu,
+                    origin_len_gpu,
+                    scores,
+                    unfinished_flag,
+                    model_kwargs,
+                )
+                paddle.increment(cur_len)
+                paddle.increment(cur_len_gpu)
+
+        return input_ids[:, origin_len:], scores
+
+    def reorder_cache(self, cache, beam_idx):
+        cache = map_structure(lambda x: paddle.index_select(x, beam_idx), cache)
+        return cache
+
+    def beam_search(
+        self,
+        input_ids,
+        beam_scorer,
+        logits_processors,
+        max_length,
+        diversity_rate,
+        pad_token_id,
+        eos_token_id,
+        stopping_criteria=None,
+        **model_kwargs
+    ):
+        model_kwargs["use_cache"] = model_kwargs.get("use_cache", True)
+
+        logits_processors = logits_processors if logits_processors is not None else LogitsProcessorList()
+
+        # max_length will be convert to MaxLengthCriteria
+        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
+        if max_length is not None:
+            # logger.warning(
+            #    "`max_length` is deprecated in this function, use"
+            #    " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
+            # )
+            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
+
+        batch_size = len(beam_scorer._beam_hyps)
+        num_beams = beam_scorer.num_beams
+        batch_beam_size, cur_len = input_ids.shape
+        origin_len = cur_len
+
+        assert (
+            num_beams * batch_size == batch_beam_size
+        ), "Batch dimension of `input_ids` should be {}, but received {}.".format(
+            num_beams * batch_size, batch_beam_size
+        )
+
+        beam_scores = paddle.zeros((batch_size, num_beams), dtype=paddle.get_default_dtype())
+
+        beam_scores[:, 1:] = get_scale_by_dtype(return_positive=False)
+        beam_scores = paddle.reshape(beam_scores, [-1])
+
+        while True:
+            # prepare model inputs & get model output
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+
+            outputs = self(**model_inputs)
+
+            if isinstance(outputs, tuple):
+                logits = outputs[0]
+            elif isinstance(outputs, ModelOutput):
+                logits = outputs.logits
+            else:
+                logits = outputs
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = self.adjust_logits_during_generation(logits)
+            # beam search
+            # [batch_size * num_beams, vocab_size]
+            next_scores = F.softmax(logits)
+            next_scores = paddle.log(next_scores)
+            next_scores = logits_processors(input_ids, next_scores)
+            next_scores = next_scores + beam_scores.unsqueeze(-1)
+
+            vocab_size = next_scores.shape[-1]
+            if diversity_rate == 0.0:
+                # reshape for beam search
+                next_scores = next_scores.reshape([batch_size, num_beams * vocab_size])
+
+                next_scores, next_tokens = paddle.topk(next_scores, 2 * num_beams, axis=1)
+
+                next_indices = next_tokens // vocab_size
+                next_tokens = next_tokens % vocab_size
+
+            else:
+                next_scores, next_tokens = paddle.topk(next_scores, 2 * num_beams, axis=1)
+
+                sibling_score = paddle.arange(1, 2 * num_beams + 1, dtype="int64").unsqueeze(0) * diversity_rate
+
+                diversed_score = next_scores - sibling_score
+
+                next_scores = next_scores.reshape([batch_size, 2 * num_beams * num_beams])
+                next_tokens = next_tokens.reshape([batch_size, 2 * num_beams * num_beams])
+
+                diversed_score = diversed_score.reshape([batch_size, 2 * num_beams * num_beams])
+                diversed_score, diversed_tokens = paddle.topk(diversed_score, 2 * num_beams, axis=1)
+
+                # TODO
+                # Use gather_nd() to select origan token and score
+                next_scores = paddle.stack(
+                    [paddle.index_select(next_scores[i], diversed_tokens[i]) for i in range(next_scores.shape[0])]
+                )
+                next_tokens = paddle.stack(
+                    [paddle.index_select(next_tokens[i], diversed_tokens[i]) for i in range(next_tokens.shape[0])]
+                )
+
+                next_indices = diversed_tokens // (2 * num_beams)
+
+            # stateless
+            beam_outputs = beam_scorer.process(
+                input_ids,
+                next_scores,
+                next_tokens,
+                next_indices,
+                origin_len=origin_len,
+                pad_token_id=pad_token_id,
+                eos_token_id=eos_token_id,
+            )
+            beam_scores = beam_outputs["next_beam_scores"]
+            beam_next_tokens = beam_outputs["next_beam_tokens"]
+            beam_idx = beam_outputs["next_beam_indices"]
+
+            cur_len += 1
+            input_ids = paddle.concat(
+                [paddle.index_select(input_ids, beam_idx), beam_next_tokens.unsqueeze(-1)], axis=-1
+            )
+
+            if beam_scorer.is_done or stopping_criteria(input_ids, beam_scores):
+                break
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+            cache_name = "cache" if "cache" in model_kwargs else "past_key_values"
+            if model_kwargs[cache_name] is not None:
+                # reorder the cache
+                model_kwargs[cache_name] = self.reorder_cache(model_kwargs[cache_name], beam_idx)
+
+        pred_ids, scores = beam_scorer.finalize(
+            input_ids,
+            beam_scores,
+            next_tokens,
+            next_indices,
+            origin_len=origin_len,
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+        )
+        return pred_ids[:, origin_len:], scores
+
+    def group_beam_search(
+        self,
+        input_ids,
+        beam_scorer,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        stopping_criteria=None,
+        **model_kwargs
+    ):
+        model_kwargs["use_cache"] = model_kwargs.get("use_cache", True)
+        logits_processors = logits_processors if logits_processors is not None else LogitsProcessorList()
+
+        # max_length will be convert to MaxLengthCriteria
+        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
+        if max_length is not None:
+            # logger.warning(
+            #    "`max_length` is deprecated in this function, use"
+            #    " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
+            # )
+            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
+
+        batch_size = len(beam_scorer._beam_hyps)
+        num_beams = beam_scorer.num_beams
+        num_beam_groups = beam_scorer.num_beam_groups
+        num_sub_beams = num_beams // num_beam_groups
+
+        batch_beam_size, cur_len = input_ids.shape
+        origin_len = cur_len
+
+        assert (
+            num_beams * batch_size == batch_beam_size
+        ), "Batch dimension of `input_ids` should be {}, but received {}.".format(
+            num_beams * batch_size, batch_beam_size
+        )
+
+        beam_scores = paddle.full((batch_size, num_beams), get_scale_by_dtype(return_positive=False), dtype="float32")
+        # initialise score of first beam of each group with 0 and the rest with 1e-9. This ensures that the beams in
+        # the same group don't produce same tokens everytime.
+        beam_scores[:, ::num_sub_beams] = 0
+        beam_scores = paddle.reshape(beam_scores, [-1])
+
+        while True:
+            # predicted tokens in cur_len step
+            current_tokens = paddle.zeros(shape=[batch_size * num_beams], dtype=input_ids.dtype)
+
+            # indices which will form the beams in the next time step
+            reordering_indices = paddle.zeros(shape=[batch_size * num_beams], dtype="int64")
+            # prepare model inputs & get model output
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+            outputs = self(**model_inputs)
+
+            for beam_group_idx in range(num_beam_groups):
+                group_start_idx = beam_group_idx * num_sub_beams
+                group_end_idx = min(group_start_idx + num_sub_beams, num_beams)
+                group_size = group_end_idx - group_start_idx
+
+                # indices of beams of current group among all sentences in batch
+                batch_group_indices = []
+
+                for batch_idx in range(batch_size):
+                    batch_group_indices.extend(
+                        [batch_idx * num_beams + idx for idx in range(group_start_idx, group_end_idx)]
+                    )
+
+                group_input_ids = input_ids[batch_group_indices]
+
+                if isinstance(outputs, tuple):
+                    logits = outputs[0]
+                elif isinstance(outputs, ModelOutput):
+                    logits = outputs.logits
+                else:
+                    logits = outputs
+
+                logits = logits[:, -1, :]
+                logits = paddle.index_select(logits, paddle.to_tensor(batch_group_indices))
+                logits = self.adjust_logits_during_generation(logits)
+
+                next_scores = F.softmax(logits)
+                next_scores = paddle.log(next_scores)
+                vocab_size = next_scores.shape[-1]
+
+                next_scores = logits_processors(
+                    group_input_ids, next_scores, current_tokens=current_tokens, beam_group_idx=beam_group_idx
+                )
+
+                next_scores = next_scores + beam_scores[batch_group_indices].unsqueeze(-1)
+
+                # reshape for beam search
+                next_scores = next_scores.reshape([batch_size, group_size * vocab_size])
+
+                next_scores, next_tokens = paddle.topk(next_scores, 2 * group_size, axis=1)
+
+                next_indices = next_tokens // vocab_size
+                next_tokens = next_tokens % vocab_size
+
+                beam_outputs = beam_scorer.process(
+                    group_input_ids,
+                    next_scores,
+                    next_tokens,
+                    next_indices,
+                    origin_len=origin_len,
+                    pad_token_id=pad_token_id,
+                    eos_token_id=eos_token_id,
+                )
+
+                beam_scores[batch_group_indices] = beam_outputs["next_beam_scores"]
+                beam_next_tokens = beam_outputs["next_beam_tokens"]
+                beam_idx = beam_outputs["next_beam_indices"]
+
+                input_ids[batch_group_indices] = group_input_ids[beam_idx]
+                group_input_ids = paddle.concat(
+                    [paddle.index_select(group_input_ids, index=beam_idx), beam_next_tokens.unsqueeze(-1)], axis=-1
+                )
+                current_tokens[batch_group_indices] = beam_next_tokens
+
+                reordering_indices[batch_group_indices] = (
+                    num_beams * (beam_idx // group_size) + group_start_idx + (beam_idx % group_size)
+                )
+
+            input_ids = paddle.concat([input_ids, current_tokens.unsqueeze(-1)], axis=-1)
+
+            cur_len += 1
+
+            if beam_scorer.is_done or stopping_criteria(input_ids, beam_scores):
+                break
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            cache_name = "cache" if "cache" in model_kwargs else "past_key_values"
+            if model_kwargs[cache_name] is not None:
+                # reorder the cache
+                model_kwargs[cache_name] = self.reorder_cache(model_kwargs[cache_name], reordering_indices)
+
+        pred_ids, scores = beam_scorer.finalize(
+            input_ids,
+            beam_scores,
+            next_tokens,
+            next_indices,
+            origin_len=origin_len,
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+        )
+        return pred_ids[:, origin_len:], scores
diff --git a/paddlenlp/layers/__init__.py b/paddlenlp/layers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..13aec34a9621a996b0e4dd4e912687e86caff16d
--- /dev/null
+++ b/paddlenlp/layers/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .crf import LinearChainCrf, LinearChainCrfLoss, ViterbiDecoder
+from .globalpointer import (
+    GlobalPointerForEntityExtraction,
+    GPLinkerForEventExtraction,
+    GPLinkerForRelationExtraction,
+)
+from .linear import Linear
+from .sequence import sequence_mask
+from .tcn import TCN, TemporalBlock
diff --git a/paddlenlp/layers/crf.py b/paddlenlp/layers/crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ecdaa73aa77e600f10a392c92bf5e7c5a374a00
--- /dev/null
+++ b/paddlenlp/layers/crf.py
@@ -0,0 +1,417 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from ..utils.log import logger
+from .sequence import sequence_mask
+
+__all__ = ["LinearChainCrf", "LinearChainCrfLoss", "ViterbiDecoder"]
+
+
+def log_sum_exp(vec, dim=0):
+    # Avoid underflow and overflow
+    max_num = paddle.max(vec, dim)
+    max_exp = max_num.unsqueeze(-1)
+    return max_num + paddle.log(paddle.sum(paddle.exp(vec - max_exp), dim))
+
+
+class LinearChainCrf(nn.Layer):
+    """
+    LinearChainCrf is a linear chain Conditional Random Field layer, it can implement sequential dependencies in the predictions.
+    Therefore, it can take context into account whereas a classifier predicts a label for a single sample without considering "neighboring" samples.
+    See https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers for reference.
+
+    Args:
+        num_labels (int):
+            The label number.
+        crf_lr (float, optional):
+            The crf layer learning rate. Defaults to ``0.1``.
+        with_start_stop_tag (bool, optional):
+            If set to True, the start tag and stop tag will be considered, the transitions params will be a tensor with a shape of `[num_labels+2, num_labels+2]`.
+            Else, the transitions params will be a tensor with a shape of `[num_labels, num_labels]`.
+    """
+
+    def __init__(self, num_labels, crf_lr=0.1, with_start_stop_tag=True):
+        super(LinearChainCrf, self).__init__()
+        if with_start_stop_tag:
+            self.num_tags = num_labels + 2  # Additional [START] and [STOP]
+            self.start_idx = int(self.num_tags - 1)
+            self.stop_idx = int(self.num_tags - 2)
+        else:
+            self.num_tags = num_labels
+
+        self.transitions = self.create_parameter(
+            attr=paddle.ParamAttr(learning_rate=crf_lr), shape=[self.num_tags, self.num_tags], dtype="float32"
+        )
+        self.with_start_stop_tag = with_start_stop_tag
+
+        self._initial_alpha = None
+        self._start_tensor = None
+        self._stop_tensor = None
+        self._batch_index = None
+        self._seq_index = None
+        self._batch_seq_index = None
+
+    def _initialize_alpha(self, batch_size):
+        # alpha accumulate the path value to get the different next tag
+        if self._initial_alpha is None or batch_size > self._initial_alpha.shape[0]:
+            # Initialized by a small value.
+            initial_alpha = paddle.full((batch_size, self.num_tags - 1), dtype="float32", fill_value=-10000.0)
+            # alpha_start fill_value = 0. > -10000., means the first one step START gets the most score.
+            alpha_start = paddle.full((batch_size, 1), dtype="float32", fill_value=0.0)
+            self._initial_alpha = paddle.concat([initial_alpha, alpha_start], axis=1)
+        return self._initial_alpha[:batch_size, :]
+
+    def forward(self, inputs, lengths):
+        """
+        Computes the normalization in a linear-chain CRF. See http://www.cs.columbia.edu/~mcollins/fb.pdf for reference.
+
+        .. math::
+            F & = logZ(x) = log\\sum_y exp(score(x,y))
+
+            score(x,y) & = \\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i)
+
+            p(y_i) & = Emit(x_i,y_i), T(y_{i-1}, y_i) = Trans(y_{i-1}, y_i)
+
+        then we can get:
+
+        .. math::
+            F(1) = log\\sum_{y1} exp(p(y_1) + T([START], y1))
+
+        .. math::
+            F(2) & = log\\sum_{y1}\\sum_{y2} exp(p(y_1) + T([START], y1) + p(y_2) + T(y_1,y_2)) \\\\
+            & = log\\sum_{y2} exp(F(1) + p(y_2) + T(y_1,y_2))
+
+        Further, We can get F(n) is a recursive formula with F(n-1).
+
+        Args:
+            inputs (Tensor):
+                The input predicted tensor. Its dtype is float32 and has a shape of `[batch_size, sequence_length, num_tags]`.
+            lengths (Tensor):
+                The input length. Its dtype is int64 and has a shape of `[batch_size]`.
+
+        Returns:
+            Tensor: Returns the normalizers tensor `norm_score`. Its dtype is float32 and has a shape of `[batch_size]`.
+        """
+        batch_size, seq_len, n_labels = inputs.shape
+        inputs_t_exp = inputs.transpose([1, 0, 2]).unsqueeze(-1)
+        # trans_exp: batch_size, num_tags, num_tags
+        trans_exp = self.transitions.unsqueeze(0)
+
+        all_alpha = []
+        if self.with_start_stop_tag:
+            alpha = self._initialize_alpha(batch_size)
+
+        for i, input_exp in enumerate(inputs_t_exp):
+            # input_exp: batch_size, num_tags, num_tags
+            # alpha_exp: batch_size, num_tags, num_tags
+            if i == 0 and not self.with_start_stop_tag:
+                alpha = inputs[:, 0]
+            else:
+                alpha_exp = alpha.unsqueeze(1)
+                # F(n) = logsumexp(F(n-1) + p(y_n) + T(y_{n-1}, y_n))
+                mat = input_exp + trans_exp + alpha_exp
+                alpha = log_sum_exp(mat, 2).squeeze(-1)
+            all_alpha.append(alpha)
+
+        # Get the valid alpha
+        all_alpha = paddle.stack(all_alpha).transpose([1, 0, 2])
+        batch_index = self._get_batch_index(batch_size)
+        last_index = lengths - 1
+        idxs = paddle.stack([batch_index, last_index], axis=1)
+        alpha = paddle.gather_nd(all_alpha, idxs)
+
+        if self.with_start_stop_tag:
+            # The last one step
+            alpha += self.transitions[self.stop_idx].unsqueeze(0)
+        norm_score = log_sum_exp(alpha, 1)  # .squeeze(-1)
+        return norm_score
+
+    def gold_score(self, inputs, labels, lengths):
+        """
+        Computes the unnormalized score for a tag sequence.
+        $$ score(x,y) = \\sum_i Emit(x_i,y_i) + Trans(y_{i-1}, y_i) $$
+
+        Args:
+            inputs (Tensor):
+                The input predicted tensor. Its dtype is float32 and has a shape of `[batch_size, sequence_length, num_tags]`.
+            labels (Tensor):
+                The input label tensor. Its dtype is int64 and has a shape of `[batch_size, sequence_length]`
+            lengths (Tensor):
+                The input length. Its dtype is int64 and has a shape of `[batch_size]`.
+
+        Returns:
+            Tensor: Returns the unnormalized sequence scores tensor `unnorm_score`. Its dtype is float32 and has a shape of `[batch_size]`.
+        """
+        unnorm_score = self._point_score(inputs, labels, lengths) + self._trans_score(labels, lengths)
+        return unnorm_score
+
+    def _point_score(self, inputs, labels, lengths):
+        batch_size, seq_len, n_labels = inputs.shape
+        # Get the true label logit value
+        flattened_inputs = inputs.reshape([-1])
+        offsets = paddle.unsqueeze(self._get_batch_index(batch_size) * seq_len * n_labels, 1)
+        offsets += paddle.unsqueeze(self._get_seq_index(seq_len) * n_labels, 0)
+        flattened_tag_indices = paddle.reshape(offsets + labels, [-1])
+
+        scores = paddle.gather(flattened_inputs, flattened_tag_indices).reshape([batch_size, seq_len])
+
+        mask = paddle.cast(sequence_mask(self._get_batch_seq_index(batch_size, seq_len), lengths), "float32")
+        mask = mask[:, :seq_len]
+
+        mask_scores = scores * mask
+        score = paddle.sum(mask_scores, 1)
+        return score
+
+    def _trans_score(self, labels, lengths):
+        batch_size, seq_len = labels.shape
+
+        if self.with_start_stop_tag:
+            # Add START and STOP on either side of the labels
+            start_tensor, stop_tensor = self._get_start_stop_tensor(batch_size)
+            labels_ext = paddle.concat([start_tensor, labels, stop_tensor], axis=1)
+            mask = paddle.cast(sequence_mask(self._get_batch_seq_index(batch_size, seq_len), lengths + 1), "int64")
+            pad_stop = paddle.full((batch_size, seq_len + 2), dtype="int64", fill_value=self.stop_idx)
+            labels_ext = (1 - mask) * pad_stop + mask * labels_ext
+        else:
+            mask = paddle.cast(sequence_mask(self._get_batch_seq_index(batch_size, seq_len), lengths), "int64")
+            labels_ext = labels
+
+        start_tag_indices = labels_ext[:, :-1]
+        stop_tag_indices = labels_ext[:, 1:]
+
+        # Encode the indices in a flattened representation.
+        transition_indices = start_tag_indices * self.num_tags + stop_tag_indices
+        flattened_transition_indices = transition_indices.reshape([-1])
+        flattened_transition_params = paddle.flatten(self.transitions)
+        scores = paddle.gather(flattened_transition_params, flattened_transition_indices).reshape([batch_size, -1])
+        mask_scores = scores * mask[:, 1:]
+
+        # Accumulate the transition score
+        score = paddle.sum(mask_scores, 1)
+
+        return score
+
+    def _get_start_stop_tensor(self, batch_size):
+        if self._start_tensor is None or self._stop_tensor is None or batch_size != self._start_tensor.shape[0]:
+            self._start_tensor = paddle.full((batch_size, 1), dtype="int64", fill_value=self.start_idx)
+            self._stop_tensor = paddle.full((batch_size, 1), dtype="int64", fill_value=self.stop_idx)
+        return self._start_tensor, self._stop_tensor
+
+    def _get_batch_index(self, batch_size):
+        if self._batch_index is None or batch_size != self._batch_index.shape[0]:
+            self._batch_index = paddle.arange(end=batch_size, dtype="int64")
+        return self._batch_index
+
+    def _get_seq_index(self, length):
+        if self._seq_index is None or length > self._seq_index.shape[0]:
+            self._seq_index = paddle.arange(end=length, dtype="int64")
+        return self._seq_index[:length]
+
+    def _get_batch_seq_index(self, batch_size, length):
+        if (
+            self._batch_seq_index is None
+            or length + 2 > self._batch_seq_index.shape[1]
+            or batch_size > self._batch_seq_index.shape[0]
+        ):
+            self._batch_seq_index = paddle.cumsum(paddle.ones([batch_size, length + 2], "int64"), axis=1) - 1
+        if self.with_start_stop_tag:
+            return self._batch_seq_index[:batch_size, : length + 2]
+        else:
+            return self._batch_seq_index[:batch_size, :length]
+
+
+class LinearChainCrfLoss(nn.Layer):
+    """
+    The negative log-likelihood for linear chain Conditional Random Field (CRF).
+
+    Args:
+        crf (LinearChainCrf):
+            The `LinearChainCrf` network object. Its parameter will be used to calculate the loss.
+    """
+
+    def __init__(self, crf):
+        super(LinearChainCrfLoss, self).__init__()
+        self.crf = crf
+        if isinstance(crf, paddle.Tensor):
+            raise ValueError(
+                "From paddlenlp >= 2.0.0b4, the first param of LinearChainCrfLoss shoule be a LinearChainCrf object. For input parameter 'crf.transitions', you can remove '.transitions' to 'crf'"
+            )
+
+    def forward(self, inputs, lengths, labels, old_version_labels=None):
+        """
+        Calculate the crf loss. Let $$ Z(x) = \\sum_{y'}exp(score(x,y')) $$, means the sum of all path scores,
+        then we have $$ loss = -logp(y|x) = -log(exp(score(x,y))/Z(x)) = -score(x,y) + logZ(x) $$
+
+        Args:
+            inputs (Tensor):
+                The input predicted tensor. Its dtype is float32 and has a shape of `[batch_size, sequence_length, num_tags]`.
+            lengths (Tensor):
+                The input length. Its dtype is int64 and has a shape of `[batch_size]`.
+            labels (Tensor) :
+                The input label tensor. Its dtype is int64 and has a shape of `[batch_size, sequence_length]`
+            old_version_labels (Tensor, optional): Unnecessary parameter for compatibility with older versions. Defaults to ``None``.
+
+        Returns:
+            Tensor: The crf loss. Its dtype is float32 and has a shape of `[batch_size]`.
+        """
+        # Note: When closing to convergence, the loss could be a small negative number. This may caused by underflow when calculating exp in logsumexp.
+        #       We add relu here to avoid negative loss. In theory, the crf loss must be greater than or equal to 0, relu will not impact on it.
+        if old_version_labels is not None:
+            # TODO(qiujinxuan): rm compatibility support after lic.
+            labels = old_version_labels
+            if not getattr(self, "has_warn", False):
+                logger.warning(
+                    "Compatibility Warning: The params of LinearChainCrfLoss.forward has been modified. The third param is `labels`, and the fourth is not necessary. Please update the usage."
+                )
+                self.has_warn = True
+        loss = nn.functional.relu(self.crf.forward(inputs, lengths) - self.crf.gold_score(inputs, labels, lengths))
+        return loss
+
+
+class ViterbiDecoder(nn.Layer):
+    """
+    ViterbiDecoder can decode the highest scoring sequence of tags, it should only be used at test time.
+
+    Args:
+        transitions (Tensor):
+            The transition matrix.  Its dtype is float32 and has a shape of `[num_tags, num_tags]`.
+        with_start_stop_tag (bool, optional):
+            If set to True, the last row and the last column of transitions will be considered as start tag,
+            the penultimate row and the penultimate column of transitions will be considered as stop tag.
+            Else, all the rows and columns will be considered as the real tag. Defaults to ``None``.
+    """
+
+    def __init__(self, transitions, with_start_stop_tag=True):
+        super(ViterbiDecoder, self).__init__()
+        self.transitions = transitions
+        self.with_start_stop_tag = with_start_stop_tag
+        # If consider start and stop, -1 should be START and -2 should be STOP.
+        if with_start_stop_tag:
+            self.start_idx = -1
+            self.stop_idx = -2
+        self.num_tags = paddle.shape(transitions)[0]
+
+        self._initial_alpha = None
+        self._index = None
+        self._batch_index = None
+        self._batch_seq_index = None
+
+    def _initialize_alpha(self, batch_size):
+        # alpha accumulate the path value to get the different next tag
+        if self._initial_alpha is None or batch_size > paddle.shape(self._initial_alpha)[0]:
+            # Initialized by a small value.
+            initial_alpha = paddle.full([batch_size, self.num_tags - 1], dtype="float32", fill_value=-10000.0)
+            # alpha_start fill_value = 0. > -10000., means the first one step START gets the most score.
+            alpha_start = paddle.full([batch_size, 1], dtype="float32", fill_value=0.0)
+            self._initial_alpha = paddle.concat([initial_alpha, alpha_start], axis=1)
+        return paddle.slice(self._initial_alpha, axes=[0], starts=[0], ends=[batch_size])
+
+    def forward(self, inputs, lengths):
+        """
+        Decode the highest scoring sequence of tags.
+
+        Args:
+            inputs (Tensor):
+                The unary emission tensor. Its dtype is float32 and has a shape of `[batch_size, sequence_length, num_tags]`.
+            length (Tensor):
+                The input length tensor storing real length of each sequence for correctness. Its dtype is int64 and has a shape of `[batch_size]`.
+
+        Returns:
+            tuple: Returns tuple (scores, paths). The `scores` tensor containing the score for the Viterbi sequence.
+            Its dtype is float32 and has a shape of `[batch_size]`.
+            The `paths` tensor containing the highest scoring tag indices.
+            Its dtype is int64 and has a shape of `[batch_size, sequence_length]`.
+        """
+        input_shape = paddle.shape(inputs)
+        batch_size = input_shape[0]
+        n_label = input_shape[2]
+
+        inputs_t = inputs.transpose([1, 0, 2])
+        trans_exp = self.transitions.unsqueeze(0).expand([batch_size, n_label, n_label])
+
+        historys = []
+        left_length = lengths.clone()
+        max_seq_len = left_length.max()
+        # no need to expand the 'mask' in the following iteration
+        left_length = left_length.unsqueeze(-1).expand([batch_size, n_label])
+
+        if self.with_start_stop_tag:
+            alpha = self._initialize_alpha(batch_size)
+        else:
+            alpha = paddle.zeros((batch_size, self.num_tags), dtype="float32")
+        for i, logit in enumerate(inputs_t[:max_seq_len]):
+            # if not with_start_stop_tag, the first label has not antecedent tag.
+            if i == 0 and not self.with_start_stop_tag:
+                alpha = logit
+                left_length = left_length - 1
+                continue
+            alpha_exp = alpha.unsqueeze(2)
+            # alpha_trn_sum: batch_size, n_labels, n_labels
+            alpha_trn_sum = alpha_exp + trans_exp
+
+            # alpha_max: batch_size, n_labels
+            # We don't include the emission scores here because the max does not depend on them (we add them in below)
+            alpha_max = alpha_trn_sum.max(1)
+            # If with_start_stop_tag, the first antecedent tag must be START, else the first label has not antecedent tag.
+            # So we can record the path from i=1.
+            if i >= 1:
+                alpha_argmax = alpha_trn_sum.argmax(1)
+                historys.append(alpha_argmax)
+            # Now add the emission scores
+            alpha_nxt = alpha_max + logit
+
+            mask = paddle.cast((left_length > 0), dtype="float32")
+            alpha = mask * alpha_nxt + (1 - mask) * alpha
+
+            if self.with_start_stop_tag:
+                mask = paddle.cast((left_length == 1), dtype="float32")
+                alpha += mask * trans_exp[:, self.stop_idx]
+
+            left_length = left_length - 1
+
+        # last_ids: batch_size
+        scores, last_ids = alpha.max(1), alpha.argmax(1)
+        if max_seq_len == 1:
+            return scores, last_ids.unsqueeze(1)
+        # Trace back the best path
+        # historys: seq_len, batch_size, n_labels
+        historys = paddle.stack(historys)
+        left_length = left_length[:, 0]
+        tag_mask = paddle.cast((left_length >= 0), "int64")
+        last_ids_update = last_ids * tag_mask
+
+        batch_path = [last_ids_update]
+        batch_offset = self._get_batch_index(batch_size) * n_label
+        historys = paddle.reverse(historys, [0])
+        for hist in historys:
+            # hist: batch_size, n_labels
+            left_length = left_length + 1
+            gather_idx = batch_offset + last_ids
+            tag_mask = paddle.cast((left_length > 0), "int64")
+            last_ids_update = paddle.gather(hist.flatten(), gather_idx) * tag_mask
+            zero_len_mask = paddle.cast((left_length == 0), "int64")
+            last_ids_update = last_ids_update * (1 - zero_len_mask) + last_ids * zero_len_mask
+            batch_path.append(last_ids_update)
+            tag_mask = paddle.cast((left_length >= 0), "int64")
+            last_ids = last_ids_update + last_ids * (1 - tag_mask)
+        batch_path = paddle.reverse(paddle.stack(batch_path, 1), [1])
+        return scores, batch_path
+
+    def _get_batch_index(self, batch_size):
+        if self._batch_index is None or batch_size != paddle.shape(self._batch_index)[0]:
+            self._batch_index = paddle.arange(end=batch_size, dtype="int64")
+        return self._batch_index
diff --git a/paddlenlp/layers/globalpointer.py b/paddlenlp/layers/globalpointer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d11aedc9ddb0cc089aff8723156cd297000ed6f4
--- /dev/null
+++ b/paddlenlp/layers/globalpointer.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class RotaryPositionEmbedding(nn.Layer):
+    def __init__(self, dim, max_seq_len=512):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2, dtype="float32") / dim))
+        t = paddle.arange(max_seq_len, dtype=inv_freq.dtype)
+        freqs = paddle.matmul(t.unsqueeze(1), inv_freq.unsqueeze(0))
+        self.register_buffer("sin", freqs.sin(), persistable=False)
+        self.register_buffer("cos", freqs.cos(), persistable=False)
+
+    def forward(self, x, offset=0):
+        seqlen = paddle.shape(x)[-2]
+        sin, cos = (
+            self.sin[offset : offset + seqlen, :],
+            self.cos[offset : offset + seqlen, :],
+        )
+        x1, x2 = x[..., 0::2], x[..., 1::2]
+        # 奇偶交错
+        return paddle.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], axis=-1).flatten(-2, -1)
+
+
+class GlobalPointer(nn.Layer):
+    def __init__(self, hidden_size, heads, head_size=64, RoPE=True, tril_mask=True, max_length=512):
+        super().__init__()
+        self.heads = heads
+        self.head_size = head_size
+        self.RoPE = RoPE
+        self.tril_mask = tril_mask
+        self.dense1 = nn.Linear(hidden_size, head_size * 2)
+        self.dense2 = nn.Linear(head_size * 2, heads * 2)
+        if RoPE:
+            self.rotary = RotaryPositionEmbedding(head_size, max_length)
+
+    def forward(self, inputs, attention_mask=None):
+        inputs = self.dense1(inputs)
+        qw, kw = inputs[..., ::2], inputs[..., 1::2]
+        # RoPE编码
+        if self.RoPE:
+            qw, kw = self.rotary(qw), self.rotary(kw)
+
+        # 计算内积
+        logits = paddle.einsum("bmd,bnd->bmn", qw, kw) / self.head_size**0.5
+        bias = paddle.transpose(self.dense2(inputs), [0, 2, 1]) / 2
+        logits = logits[:, None] + bias[:, ::2, None] + bias[:, 1::2, :, None]
+
+        # 排除padding
+        attn_mask = 1 - attention_mask[:, None, None, :] * attention_mask[:, None, :, None]
+        logits = logits - attn_mask * 1e12
+
+        # # 排除下三角
+        if self.tril_mask:
+            mask = paddle.tril(paddle.ones_like(logits), diagonal=-1)
+
+            logits = logits - mask * 1e12
+
+        return logits
+
+
+class GlobalPointerForEntityExtraction(nn.Layer):
+    def __init__(self, encoder, label_maps, head_size=64):
+        super().__init__()
+        self.encoder = encoder
+        hidden_size = encoder.config["hidden_size"]
+        gpcls = GlobalPointer
+        self.entity_output = gpcls(hidden_size, len(label_maps["entity2id"]), head_size=head_size)
+
+    def forward(self, input_ids, attention_mask):
+        # input_ids, attention_mask, token_type_ids: (batch_size, seq_len)
+        context_outputs = self.encoder(input_ids, attention_mask=attention_mask)
+        # last_hidden_state: (batch_size, seq_len, hidden_size)
+        last_hidden_state = context_outputs[0]
+
+        entity_output = self.entity_output(last_hidden_state, attention_mask)
+        return [entity_output]
+
+
+class GPLinkerForRelationExtraction(nn.Layer):
+    def __init__(self, encoder, label_maps, head_size=64):
+        super().__init__()
+        self.encoder = encoder
+        hidden_size = encoder.config["hidden_size"]
+        num_ents = len(label_maps["entity2id"])
+        if "relation2id" in label_maps.keys():
+            num_rels = len(label_maps["relation2id"])
+        else:
+            num_rels = len(label_maps["sentiment2id"])
+        gpcls = GlobalPointer
+
+        self.entity_output = gpcls(hidden_size, num_ents, head_size=head_size)
+        self.head_output = gpcls(hidden_size, num_rels, head_size=head_size, RoPE=False, tril_mask=False)
+        self.tail_output = gpcls(hidden_size, num_rels, head_size=head_size, RoPE=False, tril_mask=False)
+
+    def forward(self, input_ids, attention_mask):
+        # input_ids, attention_mask, token_type_ids: (batch_size, seq_len)
+        context_outputs = self.encoder(input_ids, attention_mask=attention_mask)
+        # last_hidden_state: (batch_size, seq_len, hidden_size)
+        last_hidden_state = context_outputs[0]
+
+        entity_output = self.entity_output(last_hidden_state, attention_mask)
+        head_output = self.head_output(last_hidden_state, attention_mask)
+        tail_output = self.tail_output(last_hidden_state, attention_mask)
+        spo_output = [entity_output, head_output, tail_output]
+        return spo_output
+
+
+class GPLinkerForEventExtraction(nn.Layer):
+    def __init__(self, encoder, label_maps, head_size=64):
+        super().__init__()
+        self.encoder = encoder
+        hidden_size = encoder.config["hidden_size"]
+        num_labels = len(label_maps["label2id"])
+        gpcls = GlobalPointer
+
+        self.argu_output = gpcls(hidden_size, num_labels, head_size=head_size)
+        self.head_output = gpcls(hidden_size, 1, head_size=head_size, RoPE=False)
+        self.tail_output = gpcls(hidden_size, 1, head_size=head_size, RoPE=False)
+
+    def forward(self, input_ids, attention_mask):
+        # input_ids, attention_mask, token_type_ids: (batch_size, seq_len)
+        context_outputs = self.encoder(input_ids, attention_mask=attention_mask)
+        # last_hidden_state: (batch_size, seq_len, hidden_size)
+        last_hidden_state = context_outputs[0]
+
+        argu_output = self.argu_output(last_hidden_state, attention_mask)
+        head_output = self.head_output(last_hidden_state, attention_mask)
+        tail_output = self.tail_output(last_hidden_state, attention_mask)
+        aht_output = (argu_output, head_output, tail_output)
+        return aht_output
diff --git a/paddlenlp/layers/linear.py b/paddlenlp/layers/linear.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf18b97e7bcb6e582ba21614618b4ef2a2f96070
--- /dev/null
+++ b/paddlenlp/layers/linear.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle import nn
+from paddle.nn import functional as F
+
+
+class Linear(nn.Layer):
+    """
+    Same as paddle.layer.Linear, except weight matrix is stored as [out_features, in_features] (same as torch),
+    instead of [in_features, out_features]
+    """
+
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        weight_attr=None,
+        bias_attr=None,
+        name=None,
+    ):
+        super(Linear, self).__init__()
+        self._dtype = self._helper.get_default_dtype()
+        self._weight_attr = weight_attr
+        self._bias_attr = bias_attr
+        self.weight = self.create_parameter(
+            shape=[out_features, in_features],  # regular linear has shape [in_features, out_features]
+            attr=self._weight_attr,
+            dtype=self._dtype,
+            is_bias=False,
+        )
+        self.bias = self.create_parameter(
+            shape=[out_features],
+            attr=self._bias_attr,
+            dtype=self._dtype,
+            is_bias=True,
+        )
+        self.name = name
+
+    def forward(self, input):
+        out = F.linear(x=input, weight=self.weight.T, bias=self.bias, name=self.name)
+        return out
+
+    def extra_repr(self):
+        name_str = ", name={}".format(self.name) if self.name else ""
+        return "in_features={}, out_features={}, dtype={}{}".format(
+            self.weight.shape[1], self.weight.shape[0], self._dtype, name_str
+        )
diff --git a/paddlenlp/layers/sequence.py b/paddlenlp/layers/sequence.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ae485dc79b7c0764b5b7d939abcf258d02a4a6f
--- /dev/null
+++ b/paddlenlp/layers/sequence.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def sequence_mask(seq_ids, valid_lengths):
+    """
+    To boost the performance, this sequence_mask is different with paddle.nn.functional.sequence_mask
+
+    Args:
+        seq_ids (Tensor):
+            The whole sequence index, a tensor with a shape of [batch_size, sequence_length].
+        valid_lengths (Tensor):
+            The valid length of every sequence, a tensor with a shape of [batch_size].
+
+    Returns:
+        Tensor: Returns the output sequence mask `mask`.
+        Its dtype is `bool` and has a shape of [batch_size, sequence_length].
+    """
+    lengths_exp = valid_lengths.unsqueeze(1)
+    mask = seq_ids < lengths_exp
+
+    return mask
diff --git a/paddlenlp/layers/tcn.py b/paddlenlp/layers/tcn.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c7367965b852f14bf4a0851577074d41d89e26c
--- /dev/null
+++ b/paddlenlp/layers/tcn.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+import paddle.nn as nn
+from paddle.nn.utils import weight_norm
+
+__all__ = ["TemporalBlock", "TCN"]
+
+
+class Chomp1d(nn.Layer):
+    """
+    Remove the elements on the right.
+
+    Args:
+        chomp_size (int):
+            The number of elements removed.
+    """
+
+    def __init__(self, chomp_size):
+        super(Chomp1d, self).__init__()
+        self.chomp_size = chomp_size
+
+    def forward(self, x):
+        return x[:, :, : -self.chomp_size]
+
+
+class TemporalBlock(nn.Layer):
+    """
+    The TCN block, consists of dilated causal conv, relu and residual block.
+    See the Figure 1(b) in https://arxiv.org/pdf/1803.01271.pdf for more details.
+
+    Args:
+        n_inputs (int):
+            The number of channels in the input tensor.
+        n_outputs (int):
+            The number of filters.
+        kernel_size (int):
+            The filter size.
+        stride (int):
+            The stride size.
+        dilation (int):
+            The dilation size.
+        padding (int):
+            The size of zeros to be padded.
+        dropout (float, optional):
+            Probability of dropout the units. Defaults to 0.2.
+    """
+
+    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
+
+        super(TemporalBlock, self).__init__()
+        self.conv1 = weight_norm(
+            nn.Conv1D(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        # Chomp1d is used to make sure the network is causal.
+        # We pad by (k-1)*d on the two sides of the input for convolution,
+        # and then use Chomp1d to remove the (k-1)*d output elements on the right.
+        self.chomp1 = Chomp1d(padding)
+        self.relu1 = nn.ReLU()
+        self.dropout1 = nn.Dropout(dropout)
+
+        self.conv2 = weight_norm(
+            nn.Conv1D(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        self.chomp2 = Chomp1d(padding)
+        self.relu2 = nn.ReLU()
+        self.dropout2 = nn.Dropout(dropout)
+
+        self.net = nn.Sequential(
+            self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
+        )
+        self.downsample = nn.Conv1D(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
+        self.relu = nn.ReLU()
+        self.init_weights()
+
+    def init_weights(self):
+        self.conv1.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.conv1.weight.shape))
+        self.conv2.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.conv2.weight.shape))
+        if self.downsample is not None:
+            self.downsample.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.downsample.weight.shape))
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor):
+                The input tensor with a shape  of [batch_size, input_channel, sequence_length].
+
+        """
+        out = self.net(x)
+        res = x if self.downsample is None else self.downsample(x)
+        return self.relu(out + res)
+
+
+class TCN(nn.Layer):
+    def __init__(self, input_channel, num_channels, kernel_size=2, dropout=0.2):
+        """
+        Temporal Convolutional Networks is a simple convolutional architecture. It outperforms canonical recurrent networks
+        such as LSTMs in many tasks. See https://arxiv.org/pdf/1803.01271.pdf for more details.
+
+        Args:
+            input_channel (int):
+                The number of channels in the input tensor.
+            num_channels (list | tuple):
+                The number of channels in different layer.
+            kernel_size (int, optional):
+                The filter size.. Defaults to 2.
+            dropout (float, optional):
+                Probability of dropout the units.. Defaults to 0.2.
+        """
+        super(TCN, self).__init__()
+        layers = nn.LayerList()
+        num_levels = len(num_channels)
+        for i in range(num_levels):
+            dilation_size = 2**i
+            in_channels = input_channel if i == 0 else num_channels[i - 1]
+            out_channels = num_channels[i]
+            layers.append(
+                TemporalBlock(
+                    in_channels,
+                    out_channels,
+                    kernel_size,
+                    stride=1,
+                    dilation=dilation_size,
+                    padding=(kernel_size - 1) * dilation_size,
+                    dropout=dropout,
+                )
+            )
+
+        self.network = nn.Sequential(*layers)
+
+    def forward(self, x):
+        """
+        Apply temporal convolutional networks to the input tensor.
+
+        Args:
+            x (Tensor):
+                The input tensor with a shape of [batch_size, input_channel, sequence_length].
+
+        Returns:
+            Tensor: The `output` tensor with a shape of [batch_size, num_channels[-1], sequence_length].
+        """
+        output = self.network(x)
+        return output
diff --git a/paddlenlp/losses/__init__.py b/paddlenlp/losses/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e65d62cbe6a83dccf482e266a6efbf1c91467253
--- /dev/null
+++ b/paddlenlp/losses/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .rdrop import RDropLoss
diff --git a/paddlenlp/losses/rdrop.py b/paddlenlp/losses/rdrop.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5f0a27a5c2f0f16499a512cd84f475ad83b02ec
--- /dev/null
+++ b/paddlenlp/losses/rdrop.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+__all__ = ["RDropLoss"]
+
+
+class RDropLoss(nn.Layer):
+    """
+    R-Drop Loss implementation
+    For more information about R-drop please refer to this paper: https://arxiv.org/abs/2106.14448
+    Original implementation please refer to this code: https://github.com/dropreg/R-Drop
+
+    Args:
+        reduction(str, optional):
+            Indicate how to average the loss, the candicates are ``'none'``,``'batchmean'``,``'mean'``,``'sum'``.
+            If `reduction` is ``'mean'``, the reduced mean loss is returned;
+            If `reduction` is ``'batchmean'``, the sum loss divided by batch size is returned;
+            If `reduction` is ``'sum'``, the reduced sum loss is returned;
+            If `reduction` is ``'none'``, no reduction will be applied.
+            Defaults to ``'none'``.
+    """
+
+    def __init__(self, reduction="none"):
+        super(RDropLoss, self).__init__()
+        if reduction not in ["sum", "mean", "none", "batchmean"]:
+            raise ValueError(
+                "'reduction' in 'RDropLoss' should be 'sum', 'mean' 'batchmean', or 'none', "
+                "but received {}.".format(reduction)
+            )
+        self.reduction = reduction
+
+    def forward(self, p, q, pad_mask=None):
+        """
+        Args:
+            p(Tensor): the first forward logits of training examples.
+            q(Tensor): the second forward logits of training examples.
+            pad_mask(Tensor, optional): The Tensor containing the binary mask to index with, it's data type is bool.
+
+        Returns:
+            Tensor: Returns tensor `loss`, the rdrop loss of p and q.
+        """
+        p_loss = F.kl_div(F.log_softmax(p, axis=-1), F.softmax(q, axis=-1), reduction=self.reduction)
+        q_loss = F.kl_div(F.log_softmax(q, axis=-1), F.softmax(p, axis=-1), reduction=self.reduction)
+
+        # pad_mask is for seq-level tasks
+        if pad_mask is not None:
+            p_loss = paddle.masked_select(p_loss, pad_mask)
+            q_loss = paddle.masked_select(q_loss, pad_mask)
+
+        # You can choose whether to use function "sum" and "mean" depending on your task
+        p_loss = p_loss.sum()
+        q_loss = q_loss.sum()
+        loss = (p_loss + q_loss) / 2
+        return loss
diff --git a/paddlenlp/metrics/README.md b/paddlenlp/metrics/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eac5ba16974b1a62b2b59a6b7ea5054b4cd9d5b2
--- /dev/null
+++ b/paddlenlp/metrics/README.md
@@ -0,0 +1,14 @@
+# paddlenlp.metrics
+
+目前paddlenlp提供以下评价指标：
+
+| Metric                                                   | 简介                                                         | API                                                          |
+| -------------------------------------------------------- | :----------------------------------------------------------- | ------------------------------------------------------------ |
+| Perplexity                                               | 困惑度，常用来衡量语言模型优劣，也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity`                               |
+| BLEU(bilingual evaluation understudy)                    | 机器翻译常用评价指标                                         | `paddlenlp.metrics.BLEU`                                     |
+| Rouge(Recall-Oriented Understudy for Gisting Evaluation) | 评估自动文摘以及机器翻译的指标                               | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN`       |
+| AccuracyAndF1                                            | 准确率及F1-score，可用于GLUE中的MRPC 和QQP任务               | `paddlenlp.metrics.AccuracyAndF1`                            |
+| PearsonAndSpearman                                       | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务  | `paddlenlp.metrics.PearsonAndSpearman`                       |
+| Mcc(Matthews correlation coefficient)                    | 马修斯相关系数，用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc`                                      |
+| ChunkEvaluator                                           | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务，如命名实体识别（NER） | `paddlenlp.metrics.ChunkEvaluator`                           |
+| Squad                                                    | 用于SQuAD和DuReader-robust的评价指标                         | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` |
diff --git a/paddlenlp/metrics/__init__.py b/paddlenlp/metrics/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe8c83222cdbe174b23a15dd5110c33cb2cbc09b
--- /dev/null
+++ b/paddlenlp/metrics/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .bleu import BLEU, BLEUForDuReader
+from .chunk import ChunkEvaluator
+from .distinct import Distinct
+from .glue import AccuracyAndF1, Mcc, MultiLabelsMetric, PearsonAndSpearman
+from .mrr import MRR
+from .perplexity import Perplexity
+from .rouge import Rouge1, Rouge2, RougeL, RougeLForDuReader, RougeN
+from .sighan import CorrectionF1, DetectionF1
+from .span import SpanEvaluator
diff --git a/paddlenlp/metrics/bleu.py b/paddlenlp/metrics/bleu.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bb893393336afbb0375f1df318b1ec0ed137c8e
--- /dev/null
+++ b/paddlenlp/metrics/bleu.py
@@ -0,0 +1,276 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import sys
+from collections import defaultdict
+
+import paddle
+
+from .utils import default_trans_func
+
+__all__ = ["BLEU", "BLEUForDuReader"]
+
+
+def get_match_size(cand_ngram, refs_ngram):
+    ref_set = defaultdict(int)
+    for ref_ngram in refs_ngram:
+        tmp_ref_set = defaultdict(int)
+        for ngram in ref_ngram:
+            tmp_ref_set[tuple(ngram)] += 1
+        for ngram, count in tmp_ref_set.items():
+            ref_set[tuple(ngram)] = max(ref_set[tuple(ngram)], count)
+    cand_set = defaultdict(int)
+    for ngram in cand_ngram:
+        cand_set[tuple(ngram)] += 1
+    match_size = 0
+    for ngram, count in cand_set.items():
+        match_size += min(count, ref_set.get(tuple(ngram), 0))
+    cand_size = len(cand_ngram)
+    return match_size, cand_size
+
+
+def get_ngram(sent, n_size, label=None):
+    def _ngram(sent, n_size):
+        ngram_list = []
+        for left in range(len(sent) - n_size):
+            ngram_list.append(sent[left : left + n_size + 1])
+        return ngram_list
+
+    ngram_list = _ngram(sent, n_size)
+    if label is not None:
+        ngram_list = [ngram + "_" + label for ngram in ngram_list]
+    return ngram_list
+
+
+class BLEU(paddle.metric.Metric):
+    r"""
+    BLEU (bilingual evaluation understudy) is an algorithm for evaluating the
+    quality of text which has been machine-translated from one natural language
+    to another. This metric uses a modified form of precision to compare a
+    candidate translation against multiple reference translations.
+
+    BLEU could be used as `paddle.metric.Metric` class, or an ordinary
+    class. When BLEU is used as `paddle.metric.Metric` class. A function is
+    needed that transforms the network output to reference string list, and
+    transforms the label to candidate string. By default, a default function
+    `default_trans_func` is provided, which gets target sequence id by
+    calculating the maximum probability of each step. In this case, user must
+    provide `vocab`. It should be noted that the BLEU here is different from
+    the BLEU calculated in prediction, and it is only for observation during
+    training and evaluation.
+
+    .. math::
+
+        BP & =
+        \begin{cases}
+        1,  & \text{if }c>r \\
+        e_{1-r/c}, & \text{if }c\leq r
+        \end{cases}
+
+        BLEU & = BP\exp(\sum_{n=1}^N w_{n} \log{p_{n}})
+
+    where `c` is the length of candidate sentence, and `r` is the length of reference sentence.
+
+    Args:
+        trans_func (callable, optional): `trans_func` transforms the network
+            output to string to calculate.
+        vocab (dict|paddlenlp.data.vocab, optional): Vocab for target language.
+            If `trans_func` is None and BLEU is used as `paddle.metric.Metric`
+            instance, `default_trans_func` will be performed and `vocab` must
+            be provided.
+        n_size (int, optional): Number of gram for BLEU metric. Defaults to 4.
+        weights (list, optional): The weights of precision of each gram.
+            Defaults to None.
+        name (str, optional): Name of `paddle.metric.Metric` instance.
+            Defaults to "bleu".
+
+    Examples:
+        1. Using as a general evaluation object.
+
+        .. code-block:: python
+
+            from paddlenlp.metrics import BLEU
+            bleu = BLEU()
+            cand = ["The","cat","The","cat","on","the","mat"]
+            ref_list = [["The","cat","is","on","the","mat"], ["There","is","a","cat","on","the","mat"]]
+            bleu.add_inst(cand, ref_list)
+            print(bleu.score()) # 0.4671379777282001
+
+        2. Using as an instance of `paddle.metric.Metric`.
+
+        .. code-block:: python
+
+            # You could add the code below to Seq2Seq example in this repo to
+            # use BLEU as `paddlenlp.metric.Metric' class. If you run the
+            # following code alone, you may get an error.
+            # log example:
+            # Epoch 1/12
+            # step 100/507 - loss: 308.7948 - Perplexity: 541.5600 - bleu: 2.2089e-79 - 923ms/step
+            # step 200/507 - loss: 264.2914 - Perplexity: 334.5099 - bleu: 0.0093 - 865ms/step
+            # step 300/507 - loss: 236.3913 - Perplexity: 213.2553 - bleu: 0.0244 - 849ms/step
+
+            from paddlenlp.data import Vocab
+            from paddlenlp.metrics import BLEU
+
+            bleu_metric = BLEU(vocab=src_vocab.idx_to_token)
+            model.prepare(optimizer, CrossEntropyCriterion(), [ppl_metric, bleu_metric])
+
+    """
+
+    def __init__(self, trans_func=None, vocab=None, n_size=4, weights=None, name="bleu"):
+        super(BLEU, self).__init__()
+        if not weights:
+            weights = [1 / n_size for _ in range(n_size)]
+        assert (
+            len(weights) == n_size
+        ), "Number of weights and n-gram should be the same, got Number of weights: '%d' and n-gram: '%d'" % (
+            len(weights),
+            n_size,
+        )
+        self._name = name
+        self.match_ngram = {}
+        self.candi_ngram = {}
+        self.weights = weights
+        self.bp_r = 0
+        self.bp_c = 0
+        self.n_size = n_size
+        self.vocab = vocab
+        self.trans_func = trans_func
+
+    def update(self, output, label, seq_mask=None):
+        if self.trans_func is None:
+            if self.vocab is None:
+                raise AttributeError(
+                    "The `update` method requires users to provide `trans_func` or `vocab` when initializing BLEU."
+                )
+            cand_list, ref_list = default_trans_func(output, label, seq_mask=seq_mask, vocab=self.vocab)
+        else:
+            cand_list, ref_list = self.trans_func(output, label, seq_mask)
+        if len(cand_list) != len(ref_list):
+            raise ValueError("Length error! Please check the output of network.")
+        for i in range(len(cand_list)):
+            self.add_inst(cand_list[i], ref_list[i])
+
+    def add_inst(self, cand, ref_list):
+        """
+        Update the states based on a pair of candidate and references.
+
+        Args:
+            cand (list): Tokenized candidate sentence.
+            ref_list (list of list): List of tokenized ground truth sentences.
+        """
+        for n_size in range(self.n_size):
+            self.count_ngram(cand, ref_list, n_size)
+        self.count_bp(cand, ref_list)
+
+    def count_ngram(self, cand, ref_list, n_size):
+        cand_ngram = get_ngram(cand, n_size)
+        refs_ngram = []
+        for ref in ref_list:
+            refs_ngram.append(get_ngram(ref, n_size))
+        if n_size not in self.match_ngram:
+            self.match_ngram[n_size] = 0
+            self.candi_ngram[n_size] = 0
+        match_size, cand_size = get_match_size(cand_ngram, refs_ngram)
+
+        self.match_ngram[n_size] += match_size
+        self.candi_ngram[n_size] += cand_size
+
+    def count_bp(self, cand, ref_list):
+        self.bp_c += len(cand)
+        self.bp_r += min([(abs(len(cand) - len(ref)), len(ref)) for ref in ref_list])[1]
+
+    def reset(self):
+        self.match_ngram = {}
+        self.candi_ngram = {}
+        self.bp_r = 0
+        self.bp_c = 0
+
+    def accumulate(self):
+        """
+        Calculates and returns the final bleu metric.
+
+        Returns:
+            Tensor: Returns the accumulated metric `bleu` and its data type is float64.
+        """
+        prob_list = []
+        for n_size in range(self.n_size):
+            try:
+                if self.candi_ngram[n_size] == 0:
+                    _score = 0.0
+                else:
+                    _score = self.match_ngram[n_size] / float(self.candi_ngram[n_size])
+            except Exception:
+                _score = 0
+            if _score == 0:
+                _score = sys.float_info.min
+            prob_list.append(_score)
+
+        logs = math.fsum(w_i * math.log(p_i) for w_i, p_i in zip(self.weights, prob_list))
+        bp = math.exp(min(1 - self.bp_r / float(self.bp_c), 0))
+        bleu = bp * math.exp(logs)
+        return bleu
+
+    def score(self):
+        return self.accumulate()
+
+    def name(self):
+        return self._name
+
+
+class BLEUForDuReader(BLEU):
+    """
+    BLEU metric with bonus for DuReader contest.
+
+    Please refer to `DuReader Homepage<https://ai.baidu.com//broad/subordinate?dataset=dureader>`_ for more details.
+
+    Args:
+        n_size (int, optional): Number of gram for BLEU metric. Defaults to 4.
+        alpha (float, optional): Weight of YesNo dataset when adding bonus for DuReader contest. Defaults to 1.0.
+        beta (float, optional): Weight of Entity dataset when adding bonus for DuReader contest. Defaults to 1.0.
+
+    """
+
+    def __init__(self, n_size=4, alpha=1.0, beta=1.0):
+        super(BLEUForDuReader, self).__init__(n_size)
+        self.alpha = alpha
+        self.beta = beta
+
+    def add_inst(self, cand, ref_list, yn_label=None, yn_ref=None, entity_ref=None):
+        BLEU.add_inst(self, cand, ref_list)
+        if yn_label is not None and yn_ref is not None:
+            self.add_yn_bonus(cand, ref_list, yn_label, yn_ref)
+        elif entity_ref is not None:
+            self.add_entity_bonus(cand, entity_ref)
+
+    def add_yn_bonus(self, cand, ref_list, yn_label, yn_ref):
+        for n_size in range(self.n_size):
+            cand_ngram = get_ngram(cand, n_size, label=yn_label)
+            ref_ngram = []
+            for ref_id, r in enumerate(yn_ref):
+                ref_ngram.append(get_ngram(ref_list[ref_id], n_size, label=r))
+            match_size, cand_size = get_match_size(cand_ngram, ref_ngram)
+            self.match_ngram[n_size] += self.alpha * match_size
+            self.candi_ngram[n_size] += self.alpha * match_size
+
+    def add_entity_bonus(self, cand, entity_ref):
+        for n_size in range(self.n_size):
+            cand_ngram = get_ngram(cand, n_size, label="ENTITY")
+            ref_ngram = []
+            for reff_id, r in enumerate(entity_ref):
+                ref_ngram.append(get_ngram(r, n_size, label="ENTITY"))
+            match_size, cand_size = get_match_size(cand_ngram, ref_ngram)
+            self.match_ngram[n_size] += self.beta * match_size
+            self.candi_ngram[n_size] += self.beta * match_size
diff --git a/paddlenlp/metrics/chunk.py b/paddlenlp/metrics/chunk.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef24919db858a181ef52db539fd33f786a638693
--- /dev/null
+++ b/paddlenlp/metrics/chunk.py
@@ -0,0 +1,195 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import defaultdict
+
+import numpy as np
+import paddle
+from paddlenlp.utils.log import logger
+from seqeval.metrics.sequence_labeling import get_entities
+
+
+def extract_tp_actual_correct(y_true, y_pred, suffix, *args):
+    entities_true = defaultdict(set)
+    entities_pred = defaultdict(set)
+    for type_name, start, end in get_entities(y_true, suffix):
+        entities_true[type_name].add((start, end))
+    for type_name, start, end in get_entities(y_pred, suffix):
+        entities_pred[type_name].add((start, end))
+
+    target_names = sorted(set(entities_true.keys()) | set(entities_pred.keys()))
+
+    tp_sum = np.array([], dtype=np.int32)
+    pred_sum = np.array([], dtype=np.int32)
+    true_sum = np.array([], dtype=np.int32)
+    for type_name in target_names:
+        entities_true_type = entities_true.get(type_name, set())
+        entities_pred_type = entities_pred.get(type_name, set())
+        tp_sum = np.append(tp_sum, len(entities_true_type & entities_pred_type))
+        pred_sum = np.append(pred_sum, len(entities_pred_type))
+        true_sum = np.append(true_sum, len(entities_true_type))
+
+    return pred_sum, tp_sum, true_sum
+
+
+class ChunkEvaluator(paddle.metric.Metric):
+    """
+    ChunkEvaluator computes the precision, recall and F1-score for chunk detection.
+    It is often used in sequence tagging tasks, such as Named Entity Recognition(NER).
+
+    Args:
+        label_list (list):
+            The label list.
+        suffix (bool):
+            If set True, the label ends with '-B', '-I', '-E' or '-S', else the label starts with them.
+            Defaults to `False`.
+
+    Example:
+        .. code-block::
+
+            from paddlenlp.metrics import ChunkEvaluator
+
+            num_infer_chunks = 10
+            num_label_chunks = 9
+            num_correct_chunks = 8
+
+            label_list = [1,1,0,0,1,0,1]
+            evaluator = ChunkEvaluator(label_list)
+            evaluator.update(num_infer_chunks, num_label_chunks, num_correct_chunks)
+            precision, recall, f1 = evaluator.accumulate()
+            print(precision, recall, f1)
+            # 0.8 0.8888888888888888 0.8421052631578948
+
+    """
+
+    def __init__(self, label_list, suffix=False):
+        super(ChunkEvaluator, self).__init__()
+        self.id2label_dict = dict(enumerate(label_list))
+        self.suffix = suffix
+        self.num_infer_chunks = 0
+        self.num_label_chunks = 0
+        self.num_correct_chunks = 0
+
+    def compute(self, lengths, predictions, labels, dummy=None):
+        """
+        Computes the precision, recall and F1-score for chunk detection.
+
+        Args:
+            lengths (Tensor): The valid length of every sequence, a tensor with shape `[batch_size]`
+            predictions (Tensor): The predictions index, a tensor with shape `[batch_size, sequence_length]`.
+            labels (Tensor): The labels index, a tensor with shape `[batch_size, sequence_length]`.
+            dummy (Tensor, optional): Unnecessary parameter for compatibility with older versions with parameters list `inputs`, `lengths`, `predictions`, `labels`. Defaults to None.
+
+        Returns:
+            tuple: Returns tuple (`num_infer_chunks, num_label_chunks, num_correct_chunks`).
+
+            With the fields:
+
+            - `num_infer_chunks` (Tensor):
+                The number of the inference chunks.
+
+            - `num_label_chunks` (Tensor):
+                The number of the label chunks.
+
+            - `num_correct_chunks` (Tensor):
+                The number of the correct chunks.
+        """
+        if dummy is not None:
+            # TODO(qiujinxuan): rm compatibility support after lic.
+            dummy, lengths, predictions, labels = lengths, predictions, labels, dummy
+            if not getattr(self, "has_warn", False):
+                logger.warning(
+                    "Compatibility Warning: The params of ChunkEvaluator.compute has been modified. The old version is `inputs`, `lengths`, `predictions`, `labels` while the current version is `lengths`, `predictions`, `labels`.  Please update the usage."
+                )
+                self.has_warn = True
+        labels = labels.numpy()
+        predictions = predictions.numpy()
+        unpad_labels = [
+            [self.id2label_dict[index] for index in labels[sent_index][: lengths[sent_index]]]
+            for sent_index in range(len(lengths))
+        ]
+        unpad_predictions = [
+            [self.id2label_dict.get(index, "O") for index in predictions[sent_index][: lengths[sent_index]]]
+            for sent_index in range(len(lengths))
+        ]
+
+        pred_sum, tp_sum, true_sum = extract_tp_actual_correct(unpad_labels, unpad_predictions, self.suffix)
+        num_correct_chunks = paddle.to_tensor([tp_sum.sum()])
+        num_infer_chunks = paddle.to_tensor([pred_sum.sum()])
+        num_label_chunks = paddle.to_tensor([true_sum.sum()])
+
+        return num_infer_chunks, num_label_chunks, num_correct_chunks
+
+    def _is_number_or_matrix(self, var):
+        def _is_number_(var):
+            return (
+                isinstance(var, int)
+                or isinstance(var, np.int64)
+                or isinstance(var, float)
+                or (isinstance(var, np.ndarray) and var.shape == (1,))
+            )
+
+        return _is_number_(var) or isinstance(var, np.ndarray)
+
+    def update(self, num_infer_chunks, num_label_chunks, num_correct_chunks):
+        """
+        This function takes (num_infer_chunks, num_label_chunks, num_correct_chunks) as input,
+        to accumulate and update the corresponding status of the ChunkEvaluator object. The update method is as follows:
+
+        .. math::
+                   \\\\ \\begin{array}{l}{\\text { self. num_infer_chunks }+=\\text { num_infer_chunks }} \\\\ {\\text { self. num_Label_chunks }+=\\text { num_label_chunks }} \\\\ {\\text { self. num_correct_chunks }+=\\text { num_correct_chunks }}\\end{array} \\\\
+
+        Args:
+            num_infer_chunks(int|numpy.array):
+                The number of chunks in Inference on the given minibatch.
+            num_label_chunks(int|numpy.array):
+                The number of chunks in Label on the given mini-batch.
+            num_correct_chunks(int|float|numpy.array):
+                The number of chunks both in Inference and Label on the given mini-batch.
+        """
+        if not self._is_number_or_matrix(num_infer_chunks):
+            raise ValueError("The 'num_infer_chunks' must be a number(int) or a numpy ndarray.")
+        if not self._is_number_or_matrix(num_label_chunks):
+            raise ValueError("The 'num_label_chunks' must be a number(int, float) or a numpy ndarray.")
+        if not self._is_number_or_matrix(num_correct_chunks):
+            raise ValueError("The 'num_correct_chunks' must be a number(int, float) or a numpy ndarray.")
+        self.num_infer_chunks += num_infer_chunks
+        self.num_label_chunks += num_label_chunks
+        self.num_correct_chunks += num_correct_chunks
+
+    def accumulate(self):
+        """
+        This function returns the mean precision, recall and f1 score for all accumulated minibatches.
+
+        Returns:
+            tuple: Returns tuple (`precision, recall, f1 score`).
+        """
+        precision = float(self.num_correct_chunks / self.num_infer_chunks) if self.num_infer_chunks else 0.0
+        recall = float(self.num_correct_chunks / self.num_label_chunks) if self.num_label_chunks else 0.0
+        f1_score = float(2 * precision * recall / (precision + recall)) if self.num_correct_chunks else 0.0
+        return precision, recall, f1_score
+
+    def reset(self):
+        """
+        Reset function empties the evaluation memory for previous mini-batches.
+        """
+        self.num_infer_chunks = 0
+        self.num_label_chunks = 0
+        self.num_correct_chunks = 0
+
+    def name(self):
+        """
+        Return name of metric instance.
+        """
+        return "precision", "recall", "f1"
diff --git a/paddlenlp/metrics/distinct.py b/paddlenlp/metrics/distinct.py
new file mode 100644
index 0000000000000000000000000000000000000000..abcbfe95a339a27a0c10007f29f9fc550e15fff7
--- /dev/null
+++ b/paddlenlp/metrics/distinct.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+__all__ = ["Distinct"]
+
+
+class Distinct(paddle.metric.Metric):
+    """
+    `Distinct` is an algorithm for evaluating the textual diversity of the
+    generated text by calculating the number of distinct n-grams. The larger
+    the number of distinct n-grams, the higher the diversity of the text. See
+    details at https://arxiv.org/abs/1510.03055.
+
+    :class:`Distinct` could be used as a :class:`paddle.metric.Metric` class,
+    or an ordinary class. When :class:`Distinct` is used as a
+    :class:`paddle.metric.Metric` class, a function is needed to transform
+    the network output to a string list.
+
+    Args:
+        n_size (int, optional):
+            Number of gram for :class:`Distinct` metric. Defaults to 2.
+        trans_func (callable, optional):
+            `trans_func` transforms the network output to a string list. Defaults to None.
+
+            .. note::
+                When :class:`Distinct` is used as a :class:`paddle.metric.Metric`
+                class, `trans_func` must be provided. Please note that the
+                input of `trans_func` is numpy array.
+
+        name (str, optional): Name of :class:`paddle.metric.Metric` instance.
+            Defaults to "distinct".
+
+    Examples:
+        1. Using as a general evaluation object.
+
+        .. code-block:: python
+
+            from paddlenlp.metrics import Distinct
+            distinct = Distinct()
+            cand = ["The","cat","The","cat","on","the","mat"]
+            #update the states
+            distinct.add_inst(cand)
+            print(distinct.score())
+            # 0.8333333333333334
+
+        2. Using as an instance of `paddle.metric.Metric`.
+
+        .. code-block:: python
+
+            import numpy as np
+            from functools import partial
+            import paddle
+            from paddlenlp.transformers import BertTokenizer
+            from paddlenlp.metrics import Distinct
+
+            def trans_func(logits, tokenizer):
+                '''Transform the network output `logits` to string list.'''
+                # [batch_size, seq_len]
+                token_ids = np.argmax(logits, axis=-1).tolist()
+                cand_list = []
+                for ids in token_ids:
+                    tokens = tokenizer.convert_ids_to_tokens(ids)
+                    strings = tokenizer.convert_tokens_to_string(tokens)
+                    cand_list.append(strings.split())
+                return cand_list
+
+            paddle.seed(2021)
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+            distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer))
+            batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size
+            logits = paddle.rand([batch_size, seq_len, vocab_size])
+
+            distinct.update(logits.numpy())
+            print(distinct.accumulate()) # 1.0
+    """
+
+    def __init__(self, n_size=2, trans_func=None, name="distinct"):
+        super(Distinct, self).__init__()
+        self._name = name
+        self.diff_ngram = set()
+        self.count = 0.0
+        self.n_size = n_size
+        self.trans_func = trans_func
+
+    def update(self, output, *args):
+        """
+        Updates the metrics states. This method firstly will use
+        :meth:`trans_func` method to process the `output` to get the tokenized
+        candidate sentence list. Then call :meth:`add_inst` method to process
+        the candidate list one by one.
+
+        Args:
+            output (numpy.ndarray|Tensor):
+                The outputs of model.
+            args (tuple): The additional inputs.
+        """
+        if isinstance(output, paddle.Tensor):
+            output = output.numpy()
+
+        assert self.trans_func is not None, (
+            "The `update` method requires user " "to provide `trans_func` when initializing `Distinct`."
+        )
+        cand_list = self.trans_func(output)
+
+        for cand in cand_list:
+            self.add_inst(cand)
+
+    def add_inst(self, cand):
+        """
+        Updates the states based on the candidate.
+
+        Args:
+            cand (list): Tokenized candidate sentence generated by model.
+        """
+        for i in range(0, len(cand) - self.n_size + 1):
+            ngram = " ".join(cand[i : (i + self.n_size)])
+            self.count += 1
+            self.diff_ngram.add(ngram)
+
+    def reset(self):
+        """Resets states and result."""
+        self.diff_ngram = set()
+        self.count = 0.0
+
+    def accumulate(self):
+        """
+        Calculates the final distinct score.
+
+        Returns:
+            float: The final distinct score.
+        """
+        distinct = len(self.diff_ngram) / self.count
+        return distinct
+
+    def score(self):
+        """
+        The function is the same as :meth:`accumulate` method.
+
+        Returns:
+            float: The final distinct score.
+        """
+        return self.accumulate()
+
+    def name(self):
+        """
+        Returns the metric name.
+
+        Returns:
+            str: The metric name.
+        """
+        return self._name
diff --git a/paddlenlp/metrics/dureader.py b/paddlenlp/metrics/dureader.py
new file mode 100644
index 0000000000000000000000000000000000000000..003d554a4a64970bc5c55462a7f6f103bd257c6f
--- /dev/null
+++ b/paddlenlp/metrics/dureader.py
@@ -0,0 +1,340 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Official evaluation script for SQuAD version 2.0.
+
+In addition to basic functionality, we also compute additional statistics and
+plot precision-recall curves if an additional na_prob.json file is provided.
+This file is expected to map question ID's to the model's predicted probability
+that a question is unanswerable.
+"""
+import collections
+import json
+import math
+
+from paddlenlp.metrics.bleu import BLEU
+from paddlenlp.metrics.rouge import RougeL
+
+
+def compute_predictions(
+    all_examples, all_features, all_results, n_best_size, max_answer_length, do_lower_case, verbose, tokenizer
+):
+    """Write final predictions to the json file and log-odds of null if needed."""
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", ["feature_index", "start_index", "end_index", "start_logit", "end_logit"]
+    )
+
+    preds_for_eval = collections.OrderedDict()
+    preds_for_test = []
+
+    print(len(unique_id_to_result))
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            # if we could have irrelevant answers, get the min score of irrelevant
+
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index],
+                        )
+                    )
+
+        prelim_predictions = sorted(prelim_predictions, key=lambda x: (x.start_logit + x.end_logit), reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"]
+        )
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index : (pred.end_index + 1)]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start : (orig_doc_end + 1)]
+                tok_text = "".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+
+                tok_text = "".join(tok_text.split())
+                orig_text = "".join(orig_tokens)
+                final_text = get_final_text(tok_text, orig_text, tokenizer, verbose)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(_NbestPrediction(text=final_text, start_logit=pred.start_logit, end_logit=pred.end_logit))
+
+        # if we didn't inlude the empty option in the n-best, inlcude it
+
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(_NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0))
+
+        assert len(nbest) >= 1
+
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+                else:
+                    best_non_null_entry = _NbestPrediction(text="empty", start_logit=0.0, end_logit=0.0)
+
+        preds_for_eval[example.qas_id] = best_non_null_entry.text
+
+        preds_for_test.append(
+            {
+                "yesno_answers": [],
+                "question": example.question_text,
+                "question_type": example.question_type,
+                "answers": [best_non_null_entry.text],
+                "question_id": example.qas_id,
+            }
+        )
+
+    return preds_for_eval, preds_for_test
+
+
+def get_final_text(pred_text, orig_text, tokenizer, verbose):
+    """Project the tokenized prediction back to the original text."""
+
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the SQuAD eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heruistic between
+    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+
+    tok_text = " ".join(tokenizer.basic_tokenizer.tokenize(orig_text))
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        if verbose:
+            print("Unable to find text: '%s' in '%s'" % (pred_text, tok_text))
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+    if len(orig_ns_text) != len(tok_ns_text):
+        if verbose:
+            print("Length not equal after stripping spaces: '%s' vs '%s'" % (orig_ns_text, tok_ns_text))
+        return orig_text
+
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for i, tok_index in tok_ns_to_s_map.items():
+        tok_s_to_ns_map[tok_index] = i
+
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+    if orig_start_position is None:
+        if verbose:
+            print("Couldn't map start position")
+        return orig_text
+
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+    if orig_end_position is None:
+        if verbose:
+            print("Couldn't map end position")
+        return orig_text
+
+    output_text = orig_text[orig_start_position : (orig_end_position + 1)]
+    return output_text
+
+
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
+
+
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(enumerate(logits), key=lambda x: x[1], reverse=True)
+
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+
+
+def normalize(s):
+    """
+    Normalize strings to space joined chars.
+    Args:
+        s: a list of strings.
+    Returns:
+        A list of normalized strings.
+    """
+    if not s:
+        return s
+    normalized = []
+    for ss in s:
+        tokens = [c for c in list(ss) if len(c.strip()) != 0]
+        norm_s = "".join(tokens)
+        norm_s = norm_s.replace("，", ",")
+        norm_s = norm_s.replace("。", ".")
+        norm_s = norm_s.replace("！", "!")
+        norm_s = norm_s.replace("？", "?")
+        norm_s = norm_s.replace("；", ";")
+        norm_s = norm_s.replace("（", "(").replace("）", ")")
+        norm_s = norm_s.replace("【", "[").replace("】", "]")
+        norm_s = norm_s.replace("“", '"').replace("“", '"')
+        normalized.append(norm_s)
+    return normalized
+
+
+def dureader_evaluate(examples, preds):
+    bleu_eval = BLEU(4)
+    rouge_eval = RougeL()
+
+    for example in examples:
+        qid = example.qas_id
+        if qid not in preds:
+            print("Missing prediction for %s" % qid)
+            continue
+        pred_answers = preds[qid]
+        pred_answers = normalize([pred_answers])[0]
+        ref_answers = example.orig_answer_text
+        if not ref_answers:
+            continue
+        ref_answers = normalize(ref_answers)
+
+        bleu_eval.add_inst(pred_answers, ref_answers)
+        rouge_eval.add_inst(pred_answers, ref_answers)
+
+    bleu4 = bleu_eval.score()
+    rouge_l = rouge_eval.score()
+    metrics = {"ROUGE-L": round(rouge_l * 100, 2), "BLEU-4": round(bleu4 * 100, 2)}
+
+    print(json.dumps(metrics).encode("utf8"))
diff --git a/paddlenlp/metrics/glue.py b/paddlenlp/metrics/glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cfa40264624ec4b73c83482e05b4efb198d73fe
--- /dev/null
+++ b/paddlenlp/metrics/glue.py
@@ -0,0 +1,668 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import warnings
+
+import numpy as np
+import paddle
+from paddle.metric import Accuracy, Metric, Precision, Recall
+
+__all__ = ["AccuracyAndF1", "Mcc", "PearsonAndSpearman", "MultiLabelsMetric"]
+
+
+class AccuracyAndF1(Metric):
+    """
+    This class encapsulates Accuracy, Precision, Recall and F1 metric logic,
+    and `accumulate` function returns accuracy, precision, recall and f1.
+    The overview of all metrics could be seen at the document of `paddle.metric
+    <https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/metric/Overview_cn.html>`_
+    for details.
+
+    Args:
+        topk (int or tuple(int), optional):
+            Number of top elements to look at for computing accuracy.
+            Defaults to (1,).
+        pos_label (int, optional): The positive label for calculating precision
+            and recall.
+            Defaults to 1.
+        name (str, optional):
+            String name of the metric instance. Defaults to 'acc_and_f1'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import AccuracyAndF1
+
+            x = paddle.to_tensor([[0.1, 0.9], [0.5, 0.5], [0.6, 0.4], [0.7, 0.3]])
+            y = paddle.to_tensor([[1], [0], [1], [1]])
+
+            m = AccuracyAndF1()
+            correct = m.compute(x, y)
+            m.update(correct)
+            res = m.accumulate()
+            print(res) # (0.5, 0.5, 0.3333333333333333, 0.4, 0.45)
+
+    """
+
+    def __init__(self, topk=(1,), pos_label=1, name="acc_and_f1", *args, **kwargs):
+        super(AccuracyAndF1, self).__init__(*args, **kwargs)
+        self.topk = topk
+        self.pos_label = pos_label
+        self._name = name
+        self.acc = Accuracy(self.topk, *args, **kwargs)
+        self.precision = Precision(*args, **kwargs)
+        self.recall = Recall(*args, **kwargs)
+        self.reset()
+
+    def compute(self, pred, label, *args):
+        """
+        Accepts network's output and the labels, and calculates the top-k
+        (maximum value in topk) indices for accuracy.
+
+        Args:
+            pred (Tensor):
+                Predicted tensor, and its dtype is float32 or float64, and
+                has a shape of [batch_size, num_classes].
+            label (Tensor):
+                The ground truth tensor, and its dtype is int64, and has a
+                shape of [batch_size, 1] or [batch_size, num_classes] in one
+                hot representation.
+
+        Returns:
+            Tensor: Correct mask, each element indicates whether the prediction
+            equals to the label. Its' a tensor with a data type of float32 and
+            has a shape of [batch_size, topk].
+
+        """
+        self.label = label
+        self.preds_pos = paddle.nn.functional.softmax(pred)[:, self.pos_label]
+        return self.acc.compute(pred, label)
+
+    def update(self, correct, *args):
+        """
+        Updates the metrics states (accuracy, precision and recall), in order to
+        calculate accumulated accuracy, precision and recall of all instances.
+
+        Args:
+            correct (Tensor):
+                Correct mask for calculating accuracy, and it's a tensor with
+                shape [batch_size, topk] and has a dtype of
+                float32.
+
+        """
+        self.acc.update(correct)
+        self.precision.update(self.preds_pos, self.label)
+        self.recall.update(self.preds_pos, self.label)
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: The accumulated metric. A tuple of shape (acc, precision,
+            recall, f1, average_of_acc_and_f1)
+
+            With the fields:
+
+            - `acc` (numpy.float64):
+                The accumulated accuracy.
+            - `precision` (numpy.float64):
+                The accumulated precision.
+            - `recall` (numpy.float64):
+                The accumulated recall.
+            - `f1` (numpy.float64):
+                The accumulated f1.
+            - `average_of_acc_and_f1` (numpy.float64):
+                The average of accumulated accuracy and f1.
+
+        """
+        acc = self.acc.accumulate()
+        precision = self.precision.accumulate()
+        recall = self.recall.accumulate()
+        if precision == 0.0 or recall == 0.0:
+            f1 = 0.0
+        else:
+            # 1/f1 = 1/2 * (1/precision + 1/recall)
+            f1 = (2 * precision * recall) / (precision + recall)
+        return (
+            acc,
+            precision,
+            recall,
+            f1,
+            (acc + f1) / 2,
+        )
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.acc.reset()
+        self.precision.reset()
+        self.recall.reset()
+        self.label = None
+        self.preds_pos = None
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class Mcc(Metric):
+    """
+    This class calculates `Matthews correlation coefficient <https://en.wikipedia.org/wiki/Matthews_correlation_coefficient>`_ .
+
+    Args:
+        name (str, optional):
+            String name of the metric instance. Defaults to 'mcc'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import Mcc
+
+            x = paddle.to_tensor([[-0.1, 0.12], [-0.23, 0.23], [-0.32, 0.21], [-0.13, 0.23]])
+            y = paddle.to_tensor([[1], [0], [1], [1]])
+
+            m = Mcc()
+            (preds, label) = m.compute(x, y)
+            m.update((preds, label))
+            res = m.accumulate()
+            print(res) # (0.0,)
+
+    """
+
+    def __init__(self, name="mcc", *args, **kwargs):
+        super(Mcc, self).__init__(*args, **kwargs)
+        self._name = name
+        self.tp = 0  # true positive
+        self.fp = 0  # false positive
+        self.tn = 0  # true negative
+        self.fn = 0  # false negative
+
+    def compute(self, pred, label, *args):
+        """
+        Processes the pred tensor, and returns the indices of the maximum of each
+        sample.
+
+        Args:
+            pred (Tensor):
+                The predicted value is a Tensor with dtype float32 or float64.
+                Shape is [batch_size, 1].
+            label (Tensor):
+                The ground truth value is Tensor with dtype int64, and its
+                shape is [batch_size, 1].
+
+        Returns:
+            tuple: A tuple of preds and label. Each shape is
+            [batch_size, 1], with dtype float32 or float64.
+
+        """
+        preds = paddle.argsort(pred, descending=True)[:, :1]
+        return (preds, label)
+
+    def update(self, preds_and_labels):
+        """
+        Calculates states, i.e. the number of true positive, false positive,
+        true negative and false negative samples.
+
+        Args:
+            preds_and_labels (tuple[Tensor]):
+                Tuple of predicted value and the ground truth label, with dtype
+                float32 or float64. Each shape is [batch_size, 1].
+
+        """
+        preds = preds_and_labels[0]
+        labels = preds_and_labels[1]
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        if isinstance(labels, paddle.Tensor):
+            labels = labels.numpy().reshape(-1, 1)
+        sample_num = labels.shape[0]
+        for i in range(sample_num):
+            pred = preds[i]
+            label = labels[i]
+            if pred == 1:
+                if pred == label:
+                    self.tp += 1
+                else:
+                    self.fp += 1
+            else:
+                if pred == label:
+                    self.tn += 1
+                else:
+                    self.fn += 1
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: Returns the accumulated metric, a tuple of shape (mcc,), `mcc` is the accumulated mcc and its data
+            type is float64.
+
+        """
+        if self.tp == 0 or self.fp == 0 or self.tn == 0 or self.fn == 0:
+            mcc = 0.0
+        else:
+            # mcc = (tp*tn-fp*fn)/ sqrt(tp+fp)(tp+fn)(tn+fp)(tn+fn))
+            mcc = (self.tp * self.tn - self.fp * self.fn) / math.sqrt(
+                (self.tp + self.fp) * (self.tp + self.fn) * (self.tn + self.fp) * (self.tn + self.fn)
+            )
+        return (mcc,)
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.tp = 0  # true positive
+        self.fp = 0  # false positive
+        self.tn = 0  # true negative
+        self.fn = 0  # false negative
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+            str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class PearsonAndSpearman(Metric):
+    """
+    The class calculates `Pearson correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
+    and `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_ .
+
+
+    Args:
+        name (str, optional):
+            String name of the metric instance. Defaults to 'pearson_and_spearman'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import PearsonAndSpearman
+
+            x = paddle.to_tensor([[0.1], [1.0], [2.4], [0.9]])
+            y = paddle.to_tensor([[0.0], [1.0], [2.9], [1.0]])
+
+            m = PearsonAndSpearman()
+            m.update((x, y))
+            res = m.accumulate()
+            print(res) # (0.9985229081857804, 1.0, 0.9992614540928901)
+
+    """
+
+    def __init__(self, name="pearson_and_spearman", *args, **kwargs):
+        super(PearsonAndSpearman, self).__init__(*args, **kwargs)
+        self._name = name
+        self.preds = []
+        self.labels = []
+
+    def update(self, preds_and_labels):
+        """
+        Ensures the type of preds and labels is numpy.ndarray and reshapes them
+        into [-1, 1].
+
+        Args:
+            preds_and_labels (tuple[Tensor] or list[Tensor]):
+                Tuple or list of predicted value and the ground truth label.
+                Its data type should be float32 or float64 and its shape is [batch_size, d0, ..., dN].
+
+        """
+        preds = preds_and_labels[0]
+        labels = preds_and_labels[1]
+        if isinstance(preds, paddle.Tensor):
+            preds = preds.numpy()
+        if isinstance(labels, paddle.Tensor):
+            labels = labels.numpy()
+        preds = np.squeeze(preds.reshape(-1, 1)).tolist()
+        labels = np.squeeze(labels.reshape(-1, 1)).tolist()
+        self.preds.append(preds)
+        self.labels.append(labels)
+
+    def accumulate(self):
+        """
+        Calculates and returns the accumulated metric.
+
+        Returns:
+            tuple: Returns the accumulated metric, a tuple of (pearson, spearman,
+            the_average_of_pearson_and_spearman).
+
+            With the fields:
+
+            - `pearson` (numpy.float64):
+                The accumulated pearson.
+
+            - `spearman` (numpy.float64):
+                The accumulated spearman.
+
+            - `the_average_of_pearson_and_spearman` (numpy.float64):
+                The average of accumulated pearson and spearman correlation
+                coefficient.
+
+        """
+        preds = [item for sublist in self.preds for item in sublist]
+        labels = [item for sublist in self.labels for item in sublist]
+        pearson = self.pearson(preds, labels)
+        spearman = self.spearman(preds, labels)
+        return (
+            pearson,
+            spearman,
+            (pearson + spearman) / 2,
+        )
+
+    def pearson(self, preds, labels):
+        n = len(preds)
+        # simple sums
+        sum1 = sum(float(preds[i]) for i in range(n))
+        sum2 = sum(float(labels[i]) for i in range(n))
+        # sum up the squares
+        sum1_pow = sum([pow(v, 2.0) for v in preds])
+        sum2_pow = sum([pow(v, 2.0) for v in labels])
+        # sum up the products
+        p_sum = sum([preds[i] * labels[i] for i in range(n)])
+
+        numerator = p_sum - (sum1 * sum2 / n)
+        denominator = math.sqrt((sum1_pow - pow(sum1, 2) / n) * (sum2_pow - pow(sum2, 2) / n))
+        if denominator == 0:
+            return 0.0
+        return numerator / denominator
+
+    def spearman(self, preds, labels):
+        preds_rank = self.get_rank(preds)
+        labels_rank = self.get_rank(labels)
+
+        total = 0
+        n = len(preds)
+        for i in range(n):
+            total += pow((preds_rank[i] - labels_rank[i]), 2)
+        spearman = 1 - float(6 * total) / (n * (pow(n, 2) - 1))
+        return spearman
+
+    def get_rank(self, raw_list):
+        x = np.array(raw_list)
+        r_x = np.empty(x.shape, dtype=int)
+        y = np.argsort(-x)
+        for i, k in enumerate(y):
+            r_x[k] = i + 1
+        return r_x
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.preds = []
+        self.labels = []
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class MultiLabelsMetric(Metric):
+    """
+    This class encapsulates Accuracy, Precision, Recall and F1 metric logic in
+    multi-labels setting (also the binary setting).
+    Some codes are taken and modified from sklearn.metrics .
+
+    Args:
+        num_labels (int)
+            The total number of labels which is usually the number of classes
+        name (str, optional):
+            String name of the metric instance. Defaults to 'multi_labels_metric'.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.metrics import MultiLabelsMetric
+
+            x = paddle.to_tensor([[0.1, 0.2, 0.9], [0.5, 0.8, 0.5], [0.6, 1.5, 0.4], [2.8, 0.7, 0.3]])
+            y = paddle.to_tensor([[2], [1], [2], [1]])
+
+            m = MultiLabelsMetric(num_labels=3)
+            args = m.compute(x, y)
+            m.update(args)
+
+            result1 = m.accumulate(average=None)
+            # (array([0.0, 0.5, 1.0]), array([0.0, 0.5, 0.5]), array([0.0, 0.5, 0.66666667]))
+            result2 = m.accumulate(average='binary', pos_label=0)
+            # (0.0, 0.0, 0.0)
+            result3 = m.accumulate(average='binary', pos_label=1)
+            # (0.5, 0.5, 0.5)
+            result4 = m.accumulate(average='binary', pos_label=2)
+            # (1.0, 0.5, 0.6666666666666666)
+            result5 = m.accumulate(average='micro')
+            # (0.5, 0.5, 0.5)
+            result6 = m.accumulate(average='macro')
+            # (0.5, 0.3333333333333333, 0.38888888888888884)
+            result7 = m.accumulate(average='weighted')
+            # (0.75, 0.5, 0.5833333333333333)
+
+    Note: When zero_division is encountered (details as followed), the corresponding metrics will be set to 0.0
+        precision is zero_division if there are no positive predictions
+        recall is zero_division if there are no positive labels
+        fscore is zero_division if all labels AND predictions are negative
+    """
+
+    def __init__(self, num_labels, name="multi_labels_metric"):
+        super(MultiLabelsMetric, self).__init__()
+        if num_labels <= 1:
+            raise ValueError(f"The num_labels is {num_labels}, which must be greater than 1.")
+        self.num_labels = num_labels
+        self._name = name
+        self._confusion_matrix = np.zeros((num_labels, 2, 2), dtype=int)
+
+    def update(self, args):
+        """
+        Updates the metrics states (accuracy, precision and recall), in order to
+        calculate accumulated accuracy, precision and recall of all instances.
+
+        Args:
+            args (tuple of Tensor):
+                the tuple returned from `compute` function
+        """
+        pred = args[0].numpy()
+        label = args[1].numpy()
+        tmp_confusion_matrix = self._multi_labels_confusion_matrix(pred, label)
+        self._confusion_matrix += tmp_confusion_matrix
+
+    def accumulate(self, average=None, pos_label=1):
+        """
+        Calculates and returns the accumulated metric.
+
+        Args:
+            average (str in {‘binary’, ‘micro’, ‘macro’, ’weighted’} or None, optional):
+            Defaults to `None`. If `None`, the scores for each class are returned.
+            Otherwise, this determines the type of averaging performed on the data:
+
+            - `binary` :
+                Only report results for the class specified by pos_label.
+
+            - `micro` :
+                Calculate metrics globally by counting the total true positives,
+                false negatives and false positives.
+
+            - `macro` :
+                Calculate metrics for each label, and find their unweighted mean.
+                This does not take label imbalance into account.
+
+            - `weighted` :
+                Calculate metrics for each label, and find their average weighted
+                by support (the number of true instances for each label). This
+                alters `macro` to account for label imbalance; it can result in
+                an F-score that is not between precision and recall.
+
+            pos_label (int, optional):
+                The positive label for calculating precision and recall in binary settings.
+                Noted: Only when `average='binary'`, this arguments will be used. Otherwise,
+                it will be ignored.
+                Defaults to 1.
+
+        Returns:
+            tuple: The accumulated metric. A tuple of shape (precision, recall, f1)
+                With the fields:
+
+                - `precision` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated precision.
+                - `recall` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated recall.
+                - `f1` (numpy.float64 or numpy.ndarray if average=None):
+                    The accumulated f1.
+
+        """
+        if average not in {"binary", "micro", "macro", "weighted", None}:
+            raise ValueError(f"The average is {average}, which is unknown.")
+        if average == "binary":
+            if pos_label >= self.num_labels:
+                raise ValueError(
+                    f"The pos_label is {pos_label}, num_labels is {self.num_labels}. "
+                    f"The num_labels must be greater than pos_label."
+                )
+
+        confusion_matrix = None  # [*, 2, 2]
+        if average == "binary":
+            confusion_matrix = np.expand_dims(self._confusion_matrix[pos_label], axis=0)
+        elif average == "micro":
+            confusion_matrix = self._confusion_matrix.sum(axis=0, keepdims=True)
+        #  if average is 'macro' or 'weighted' or None
+        else:
+            confusion_matrix = self._confusion_matrix
+
+        tp = confusion_matrix[:, 1, 1]  # [*,]
+        pred = tp + confusion_matrix[:, 0, 1]  # [*,]
+        true = tp + confusion_matrix[:, 1, 0]  # [*,]
+
+        def _robust_divide(numerator, denominator, metric_name):
+            mask = denominator == 0.0
+            denominator = denominator.copy()
+            denominator[mask] = 1  # avoid zero division
+            result = numerator / denominator
+
+            if not np.any(mask):
+                return result
+
+            # precision is zero_division if there are no positive predictions
+            # recall is zero_division if there are no positive labels
+            # fscore is zero_division if all labels AND predictions are negative
+            warnings.warn(f"Zero division when calculating {metric_name}.", UserWarning)
+            result[mask] = 0.0
+            return result
+
+        precision = _robust_divide(tp, pred, "precision")
+        recall = _robust_divide(tp, true, "recall")
+        f1 = _robust_divide(2 * (precision * recall), (precision + recall), "f1")
+
+        weights = None  # [num_labels]
+        if average == "weighted":
+            weights = true
+            if weights.sum() == 0:
+                zero_division_value = np.float64(0.0)
+                if pred.sum() == 0:
+                    return (zero_division_value, zero_division_value, zero_division_value)
+                else:
+                    return (np.float64(0.0), zero_division_value, np.float64(0.0))
+        elif average == "macro":
+            weights = np.ones((self.num_labels), dtype=float)
+        if average is not None:
+            precision = np.average(precision, weights=weights)
+            recall = np.average(recall, weights=weights)
+            f1 = np.average(f1, weights=weights)
+
+        return precision, recall, f1
+
+    def compute(self, pred, label):
+        """
+        Accepts network's output and the labels, and calculates the top-k
+        (maximum value in topk) indices for accuracy.
+
+        Args:
+            pred (Tensor):
+                Predicted tensor, and its dtype is float32 or float64, and
+                has a shape of [batch_size, *, num_labels].
+            label (Tensor):
+                The ground truth tensor, and its dtype is int64, and has a
+                shape of [batch_size, *] or [batch_size, *, num_labels] in one
+                hot representation.
+
+        Returns:
+            tuple of Tensor: it contains two Tensor of shape [*, 1].
+            The tuple should be passed to `update` function.
+        """
+        if not (paddle.is_tensor(pred) and paddle.is_tensor(label)):
+            raise ValueError("pred and label must be paddle tensor")
+
+        if pred.shape[-1] != self.num_labels:
+            raise ValueError(f"The last dim of pred is {pred.shape[-1]}, " f"which should be num_labels")
+        pred = paddle.reshape(pred, [-1, self.num_labels])
+        pred = paddle.argmax(pred, axis=-1)
+
+        if label.shape[-1] == self.num_labels:
+            label = paddle.reshape(label, [-1, self.num_labels])
+            label = paddle.argmax(label, axis=-1)
+        else:
+            label = paddle.reshape(label, [-1])
+            if paddle.max(label) >= self.num_labels:
+                raise ValueError(f"Tensor label has value {paddle.max(label)}, " f"which is no less than num_labels")
+
+        if pred.shape[0] != label.shape[0]:
+            raise ValueError("The length of pred is not equal to the length of label")
+
+        return pred, label
+
+    def _multi_labels_confusion_matrix(self, pred, label):
+        tp_bins = label[pred == label]
+        tp = np.bincount(tp_bins, minlength=self.num_labels)  # [num_labels,]
+        tp_plus_fp = np.bincount(pred, minlength=self.num_labels)  # [num_labels,]
+        tp_plus_fn = np.bincount(label, minlength=self.num_labels)  # [num_labels,]
+        fp = tp_plus_fp - tp  # [num_labels,]
+        fn = tp_plus_fn - tp  # [num_labels,]
+        tn = pred.shape[0] - tp - fp - fn  # [num_labels,]
+        return np.array([tn, fp, fn, tp]).T.reshape(-1, 2, 2)  # [num_labels, 2, 2]
+
+    def reset(self):
+        self._confusion_matrix = np.zeros((self.num_labels, 2, 2), dtype=int)
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
diff --git a/paddlenlp/metrics/mrr.py b/paddlenlp/metrics/mrr.py
new file mode 100644
index 0000000000000000000000000000000000000000..128ee96706339e8ad6e75658f7c1909bf929ba43
--- /dev/null
+++ b/paddlenlp/metrics/mrr.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from sklearn.metrics import pairwise_distances
+
+__all__ = ["MRR"]
+
+
+class MRR:
+    """
+    MRR - Mean Reciprocal Rank, is a popular metric for recommend system
+    and other retrival task. The higher mrr is, the better performance of
+    model in retrival task.
+
+    Args:
+        distance: which algorithm to use to get distance of embeddings, for example: "cosine", "euclidean"
+
+    """
+
+    def __init__(self, distance="cosine"):
+        super().__init__()
+        self.distance = distance
+
+    def reset_distance(self, distance):
+        """
+        change the algorithm of calculating distance, need to be supported of sklearn.metrics.pairwise_distance
+        """
+        self.distance = distance
+
+    def compute_matrix_mrr(self, labels, embeddings):
+        """
+        A function which can calculate the distance of one embedding to other embeddings
+        in the matrix, and then it can find the most similar embedding's index to calculate
+        the mrr metric for this one embedding. After getting all the embeddings' mrr metric,
+        a mean pool is used to get the final mrr metric for input matrix.
+
+        Param:
+          - labels(np.array): label matrix, shape=[size, ]
+          - embeddings(np.array): embedding matrix, shape=[size, emb_dim]
+
+        Return:
+            mrr metric for input embedding matrix.
+        """
+        matrix_size = labels.shape[0]
+        if labels.shape[0] != embeddings.shape[0]:
+            raise Exception("label and embedding matrix must have same size at dim=0 !")
+        row_mrr = []  # mrr metric for each embedding of matrix
+        for i in range(0, matrix_size):
+            emb, label = embeddings[i, :], labels[i]
+            dists = pairwise_distances(emb.reshape(1, -1), embeddings, metric=self.distance).reshape(-1)
+            ranks_ids = np.argsort(dists)[1:]
+            ranks = (labels[ranks_ids] == label).astype(int)
+            ranks_nonzero_ids = ranks.nonzero()[0]
+            row_mrr.append(1.0 / (1 + ranks_nonzero_ids[0]) if ranks_nonzero_ids.size else 0.0)
+        mrr = np.mean(row_mrr)  # user mean value as final mrr metric for the matrix.
+        return mrr
diff --git a/paddlenlp/metrics/perplexity.py b/paddlenlp/metrics/perplexity.py
new file mode 100644
index 0000000000000000000000000000000000000000..905518f36db96cd79e54aebf994a4b636bf72e0c
--- /dev/null
+++ b/paddlenlp/metrics/perplexity.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+
+class Perplexity(paddle.metric.Metric):
+    """
+    Perplexity is a metric used to judge how good a language model is.
+    We can define perplexity as the inverse probability of the test set,
+    normalised by the number of the words in the test set.
+    Perplexity is calculated using cross entropy. It supports both padding data
+    and no padding data.
+
+    If data is not padded, users should provide `seq_len` for `Metric`
+    initialization. If data is padded, your label should contain `seq_mask`,
+    which indicates the actual length of samples.
+
+    This Perplexity requires that the output of your network is prediction,
+    label and sequence length (optional). If the Perplexity here doesn't meet
+    your needs, you could override the `compute` or `update` method for
+    calculating Perplexity.
+
+    Args:
+        seq_len(int): Sequence length of each sample, it must be provided while
+            data is not padded. Defaults to 20.
+        name(str): Name of `Metric` instance. Defaults to 'Perplexity'.
+
+    Example:
+        .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BertTokenizer
+            from paddlenlp.metrics import Perplexity
+
+            paddle.seed(2021)
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+            batch_size, seq_len, vocab_size = 1, 4, tokenizer.vocab_size
+            logits = paddle.rand([batch_size, seq_len, vocab_size])
+            labels= paddle.to_tensor([[1,0,1,1]])
+
+            perplexity = Perplexity()
+            correct = perplexity.compute(logits,labels)
+            perplexity.update(correct.numpy())
+            res = perplexity.accumulate()
+            print(res)
+            # 48263.528820122105
+    """
+
+    def __init__(self, name="Perplexity", *args, **kwargs):
+        super(Perplexity, self).__init__(*args, **kwargs)
+        self._name = name
+        self.total_ce = 0
+        self.total_word_num = 0
+
+    def compute(self, pred, label, seq_mask=None):
+        """
+        Computes cross entropy loss.
+
+        Args:
+            pred (Tensor):
+                Predictor tensor, and its dtype is float32 or float64, and has
+                a shape of [batch_size, sequence_length, vocab_size].
+            label(Tensor):
+                Label tensor, and its dtype is int64, and has a shape of
+                [batch_size, sequence_length, 1] or [batch_size, sequence_length].
+            seq_mask(Tensor, optional):
+                Sequence mask tensor, and its type could be float32, float64,
+                int32 or int64, and has a shape of [batch_size, sequence_length].
+                It's used to calculate loss. Defaults to None.
+
+        Returns:
+            tuple or Tensor: Returns tuple (`ce, word_num`) if `seq_mask` is not None. Otherwise, returns tensor `ce`.
+            `ce` it the cross entropy loss, its shape is [batch_size, sequence_length] and its data type should be float32.
+
+        """
+        if label.dim() == 2:
+            label = paddle.unsqueeze(label, axis=2)
+        ce = F.cross_entropy(input=pred, label=label, reduction="none", soft_label=False)
+        ce = paddle.squeeze(ce, axis=[2])
+        if seq_mask is not None:
+            ce = ce * seq_mask
+            word_num = paddle.sum(seq_mask)
+            return ce, word_num
+        return ce
+
+    def update(self, ce, word_num=None):
+        """
+        Updates metric states.
+
+        Args:
+            ce (numpy.ndarray):
+                Cross entropy loss, it's calculated by `compute` and converted
+                to `numpy.ndarray`.
+            word_num (numpy.ndarray):
+               The number of words of sequence, it's calculated by `compute`
+               and converted to `numpy.ndarray`. Defaults to None.
+
+        """
+        batch_ce = np.sum(ce)
+        if word_num is None:
+            word_num = ce.shape[0] * ce.shape[1]
+        else:
+            word_num = word_num.item()
+        self.total_ce += batch_ce
+        self.total_word_num += word_num
+
+    def reset(self):
+        """
+        Resets all metric states.
+        """
+        self.total_ce = 0
+        self.total_word_num = 0
+
+    def accumulate(self):
+        """
+        Calculates and returns the value of perplexity.
+
+        Returns:
+            float: Returns `perplexity`, the calculation results.
+        """
+        return np.exp(self.total_ce / self.total_word_num)
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
diff --git a/paddlenlp/metrics/rouge.py b/paddlenlp/metrics/rouge.py
new file mode 100644
index 0000000000000000000000000000000000000000..0991c15861136db4fbdb0b113d8845b833d7f299
--- /dev/null
+++ b/paddlenlp/metrics/rouge.py
@@ -0,0 +1,284 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+import paddle
+from .utils import default_trans_func
+
+__all__ = ["RougeL", "RougeLForDuReader"]
+
+
+class RougeN:
+    def __init__(self, n):
+        self.n = n
+
+    def _get_ngrams(self, words):
+        """Calculates word n-grams for multiple sentences."""
+        ngram_set = set()
+        max_index_ngram_start = len(words) - self.n
+        for i in range(max_index_ngram_start + 1):
+            ngram_set.add(tuple(words[i : i + self.n]))
+        return ngram_set
+
+    def score(self, evaluated_sentences_ids, reference_sentences_ids):
+        overlapping_count, reference_count = self.compute(evaluated_sentences_ids, reference_sentences_ids)
+        return overlapping_count / reference_count
+
+    def compute(self, evaluated_sentences_ids, reference_sentences_ids):
+        """
+        Args:
+            evaluated_sentences (list): the sentences ids predicted by the model.
+            reference_sentences (list): the referenced sentences ids. Its size should be same as evaluated_sentences.
+
+        Returns:
+            overlapping_count (int): the overlapping n-gram count.
+            reference_count (int): the reference sentences n-gram count.
+        """
+        if len(evaluated_sentences_ids) <= 0 or len(reference_sentences_ids) <= 0:
+            raise ValueError("Collections must contain at least 1 sentence.")
+
+        reference_count = 0
+        overlapping_count = 0
+
+        for evaluated_sentence_ids, reference_sentence_ids in zip(evaluated_sentences_ids, reference_sentences_ids):
+            evaluated_ngrams = self._get_ngrams(evaluated_sentence_ids)
+            reference_ngrams = self._get_ngrams(reference_sentence_ids)
+            reference_count += len(reference_ngrams)
+
+            # Gets the overlapping ngrams between evaluated and reference
+            overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
+            overlapping_count += len(overlapping_ngrams)
+
+        return overlapping_count, reference_count
+
+    def accumulate(self):
+        """
+        This function returns the mean precision, recall and f1 score for all accumulated minibatches.
+
+        Returns:
+            float: mean precision, recall and f1 score.
+        """
+        rouge_score = self.overlapping_count / self.reference_count
+        return rouge_score
+
+    def reset(self):
+        """
+        Reset function empties the evaluation memory for previous mini-batches.
+        """
+        self.overlapping_count = 0
+        self.reference_count = 0
+
+    def name(self):
+        """
+        Return name of metric instance.
+        """
+        return "Rouge-%s" % self.n
+
+    def update(self, overlapping_count, reference_count):
+        """
+        Args:
+        """
+        self.overlapping_count += overlapping_count
+        self.reference_count += reference_count
+
+
+class Rouge1(RougeN):
+    def __init__(self):
+        super(Rouge1, self).__init__(n=1)
+
+
+class Rouge2(RougeN):
+    def __init__(self):
+        super(Rouge2, self).__init__(n=2)
+
+
+class RougeL(paddle.metric.Metric):
+    r"""
+    Rouge-L is Recall-Oriented Understudy for Gisting Evaluation based on Longest Common Subsequence (LCS).
+    Longest common subsequence problem takes into account sentence level structure
+    similarity naturally and identifies longest co-occurring
+    in sequence n-grams automatically.
+
+    .. math::
+
+        R_{LCS} & = \frac{LCS(C,S)}{len(S)}
+
+        P_{LCS} & = \frac{LCS(C,S)}{len(C)}
+
+        F_{LCS} & = \frac{(1 + \gamma^2)R_{LCS}P_{LCS}}{R_{LCS}} + \gamma^2{R_{LCS}}
+
+    where `C` is the candidate sentence, and `S` is the reference sentence.
+
+    Args:
+        trans_func (callable, optional): `trans_func` transforms the network
+            output to string to calculate.
+        vocab (dict|paddlenlp.data.vocab, optional): Vocab for target language.
+            If `trans_func` is None and RougeL is used as `paddle.metric.Metric`
+            instance, `default_trans_func` will be performed and `vocab` must
+            be provided.
+        gamma (float): A hyperparameter to decide the weight of recall. Defaults to 1.2.
+        name (str, optional): Name of `paddle.metric.Metric` instance. Defaults to "rouge-l".
+
+    Examples:
+        .. code-block:: python
+
+            from paddlenlp.metrics import RougeL
+            rougel = RougeL()
+            cand = ["The","cat","The","cat","on","the","mat"]
+            ref_list = [["The","cat","is","on","the","mat"], ["There","is","a","cat","on","the","mat"]]
+            rougel.add_inst(cand, ref_list)
+            print(rougel.score()) # 0.7800511508951408
+
+    """
+
+    def __init__(self, trans_func=None, vocab=None, gamma=1.2, name="rouge-l", *args, **kwargs):
+        super(RougeL, self).__init__(*args, **kwargs)
+        self.gamma = gamma
+        self.inst_scores = []
+        self._name = name
+        self.vocab = vocab
+        self.trans_func = trans_func
+
+    def lcs(self, string, sub):
+        """
+        Calculate the length of longest common subsequence of string and sub.
+
+        Args:
+            string (str):
+                The string to be calculated, usually longer the sub string.
+            sub (str):
+                The sub string to be calculated.
+
+        Returns:
+            float: Returns the length of the longest common subsequence of string and sub.
+        """
+        if len(string) < len(sub):
+            sub, string = string, sub
+        lengths = np.zeros((len(string) + 1, len(sub) + 1))
+        for j in range(1, len(sub) + 1):
+            for i in range(1, len(string) + 1):
+                if string[i - 1] == sub[j - 1]:
+                    lengths[i][j] = lengths[i - 1][j - 1] + 1
+                else:
+                    lengths[i][j] = max(lengths[i - 1][j], lengths[i][j - 1])
+        return lengths[len(string)][len(sub)]
+
+    def add_inst(self, cand, ref_list):
+        """
+        Update the states based on the a pair of candidate and references.
+
+        Args:
+            cand (str): The candidate sentence generated by model.
+            ref_list (list): List of ground truth sentences.
+        """
+        precs, recalls = [], []
+        for ref in ref_list:
+            basic_lcs = self.lcs(cand, ref)
+            prec = basic_lcs / len(cand) if len(cand) > 0.0 else 0.0
+            rec = basic_lcs / len(ref) if len(ref) > 0.0 else 0.0
+            precs.append(prec)
+            recalls.append(rec)
+
+        prec_max = max(precs)
+        rec_max = max(recalls)
+
+        if prec_max != 0 and rec_max != 0:
+            score = ((1 + self.gamma**2) * prec_max * rec_max) / float(rec_max + self.gamma**2 * prec_max)
+        else:
+            score = 0.0
+        self.inst_scores.append(score)
+
+    def update(self, output, label, seq_mask=None):
+        if self.trans_func is None:
+            if self.vocab is None:
+                raise AttributeError(
+                    "The `update` method requires users to provide `trans_func` or `vocab` when initializing RougeL."
+                )
+            cand_list, ref_list = default_trans_func(output, label, seq_mask, self.vocab)
+        else:
+            cand_list, ref_list = self.trans_func(output, label, seq_mask)
+        if len(cand_list) != len(ref_list):
+            raise ValueError("Length error! Please check the output of network.")
+        for i in range(len(cand_list)):
+            self.add_inst(cand_list[i], ref_list[i])
+
+    def accumulate(self):
+        """
+        Calculate the final rouge-l metric.
+        """
+        return 1.0 * sum(self.inst_scores) / len(self.inst_scores)
+
+    def score(self):
+        return self.accumulate()
+
+    def reset(self):
+        self.inst_scores = []
+
+    def name(self):
+        return self._name
+
+
+class RougeLForDuReader(RougeL):
+    """
+    Rouge-L metric with bonus for DuReader contest.
+
+    Please refer to `DuReader Homepage<https://ai.baidu.com//broad/subordinate?dataset=dureader>`_ for more details.
+
+    Args:
+        alpha (float, optional): Weight of YesNo dataset when adding bonus for DuReader contest. Defaults to 1.0.
+        beta (float, optional): Weight of Entity dataset when adding bonus for DuReader contest. Defaults to 1.0.
+    """
+
+    def __init__(self, alpha=1.0, beta=1.0, gamma=1.2):
+        super(RougeLForDuReader, self).__init__(gamma)
+        self.alpha = alpha
+        self.beta = beta
+
+    def add_inst(self, cand, ref_list, yn_label=None, yn_ref=None, entity_ref=None):
+        precs, recalls = [], []
+        for i, ref in enumerate(ref_list):
+            basic_lcs = self.lcs(cand, ref)
+            yn_bonus, entity_bonus = 0.0, 0.0
+            if yn_ref is not None and yn_label is not None:
+                yn_bonus = self.add_yn_bonus(cand, ref, yn_label, yn_ref[i])
+            elif entity_ref is not None:
+                entity_bonus = self.add_entity_bonus(cand, entity_ref)
+            p_denom = len(cand) + self.alpha * yn_bonus + self.beta * entity_bonus
+            r_denom = len(ref) + self.alpha * yn_bonus + self.beta * entity_bonus
+            prec = (basic_lcs + self.alpha * yn_bonus + self.beta * entity_bonus) / p_denom if p_denom > 0.0 else 0.0
+            rec = (basic_lcs + self.alpha * yn_bonus + self.beta * entity_bonus) / r_denom if r_denom > 0.0 else 0.0
+            precs.append(prec)
+            recalls.append(rec)
+
+        prec_max = max(precs)
+        rec_max = max(recalls)
+        if prec_max != 0 and rec_max != 0:
+            score = ((1 + self.gamma**2) * prec_max * rec_max) / float(rec_max + self.gamma**2 * prec_max)
+        else:
+            score = 0.0
+        self.inst_scores.append(score)
+
+    def add_yn_bonus(self, cand, ref, yn_label, yn_ref):
+        if yn_label != yn_ref:
+            return 0.0
+        lcs_ = self.lcs(cand, ref)
+        return lcs_
+
+    def add_entity_bonus(self, cand, entity_ref):
+        lcs_ = 0.0
+        for ent in entity_ref:
+            if ent in cand:
+                lcs_ += len(ent)
+        return lcs_
diff --git a/paddlenlp/metrics/sighan.py b/paddlenlp/metrics/sighan.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9226ab3f34433321523686aeda6da7d48fa3391
--- /dev/null
+++ b/paddlenlp/metrics/sighan.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+from paddle.metric import Metric
+
+__all__ = ["DetectionF1", "CorrectionF1"]
+
+
+class DetectionF1(Metric):
+    def __init__(self, pos_label=1, name="DetectionF1", *args, **kwargs):
+        super(DetectionF1, self).__init__(*args, **kwargs)
+        self.pos_label = pos_label
+        self._name = name
+        self.reset()
+
+    def update(self, preds, labels, length, *args):
+        # [B, T, 2]
+        pred_labels = preds.argmax(axis=-1)
+        for i, label_length in enumerate(length):
+            pred_label = pred_labels[i][1 : 1 + label_length]
+            label = labels[i][1 : 1 + label_length]
+            # the sequence has errors
+            if (label == self.pos_label).any():
+                if (pred_label == label).all():
+                    self.tp += 1
+                else:
+                    self.fn += 1
+            else:
+                if (label != pred_label).any():
+                    self.fp += 1
+
+    def reset(self):
+        """
+        Resets all of the metric state.
+        """
+        self.tp = 0
+        self.fp = 0
+        self.fn = 0
+
+    def accumulate(self):
+        precision = np.nan
+        if self.tp + self.fp > 0:
+            precision = self.tp / (self.tp + self.fp)
+        recall = np.nan
+        if self.tp + self.fn > 0:
+            recall = self.tp / (self.tp + self.fn)
+        if self.tp == 0:
+            f1 = 0.0
+        else:
+            f1 = 2 * precision * recall / (precision + recall)
+        return f1, precision, recall
+
+    def name(self):
+        """
+        Returns name of the metric instance.
+
+        Returns:
+           str: The name of the metric instance.
+
+        """
+        return self._name
+
+
+class CorrectionF1(DetectionF1):
+    def __init__(self, pos_label=1, name="CorrectionF1", *args, **kwargs):
+        super(CorrectionF1, self).__init__(pos_label, name, *args, **kwargs)
+
+    def update(self, det_preds, det_labels, corr_preds, corr_labels, length, *args):
+        # [B, T, 2]
+        det_preds_labels = det_preds.argmax(axis=-1)
+        corr_preds_labels = corr_preds.argmax(axis=-1)
+
+        for i, label_length in enumerate(length):
+            # Ignore [CLS] token, so calculate from position 1.
+            det_preds_label = det_preds_labels[i][1 : 1 + label_length]
+            det_label = det_labels[i][1 : 1 + label_length]
+            corr_preds_label = corr_preds_labels[i][1 : 1 + label_length]
+            corr_label = corr_labels[i][1 : 1 + label_length]
+
+            # The sequence has any errors.
+            if (det_label == self.pos_label).any():
+                corr_pred_label = corr_preds_label * det_preds_label
+                corr_label = det_label * corr_label
+                if (corr_pred_label == corr_label).all():
+                    self.tp += 1
+                else:
+                    self.fn += 1
+            else:
+                if (det_label != det_preds_label).any():
+                    self.fp += 1
diff --git a/paddlenlp/metrics/span.py b/paddlenlp/metrics/span.py
new file mode 100644
index 0000000000000000000000000000000000000000..a432fd0cdbd380e1bf86aa93aeedc6e8c55bf80a
--- /dev/null
+++ b/paddlenlp/metrics/span.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.metric import Metric
+
+from ..utils.tools import get_bool_ids_greater_than, get_span
+
+
+class SpanEvaluator(Metric):
+    """
+    SpanEvaluator computes the precision, recall and F1-score for span detection.
+    """
+
+    def __init__(self, limit=0.5):
+        super(SpanEvaluator, self).__init__()
+        self.num_infer_spans = 0
+        self.num_label_spans = 0
+        self.num_correct_spans = 0
+        self.limit = limit
+
+    def compute(self, start_probs, end_probs, gold_start_ids, gold_end_ids):
+        """
+        Computes the precision, recall and F1-score for span detection.
+        """
+        pred_start_ids = get_bool_ids_greater_than(start_probs, self.limit)
+        pred_end_ids = get_bool_ids_greater_than(end_probs, self.limit)
+        gold_start_ids = get_bool_ids_greater_than(gold_start_ids.tolist(), self.limit)
+        gold_end_ids = get_bool_ids_greater_than(gold_end_ids.tolist(), self.limit)
+        num_correct_spans = 0
+        num_infer_spans = 0
+        num_label_spans = 0
+        for predict_start_ids, predict_end_ids, label_start_ids, label_end_ids in zip(
+            pred_start_ids, pred_end_ids, gold_start_ids, gold_end_ids
+        ):
+            [_correct, _infer, _label] = self.eval_span(
+                predict_start_ids, predict_end_ids, label_start_ids, label_end_ids
+            )
+            num_correct_spans += _correct
+            num_infer_spans += _infer
+            num_label_spans += _label
+        return num_correct_spans, num_infer_spans, num_label_spans
+
+    def update(self, num_correct_spans, num_infer_spans, num_label_spans):
+        """
+        This function takes (num_infer_spans, num_label_spans, num_correct_spans) as input,
+        to accumulate and update the corresponding status of the SpanEvaluator object.
+        """
+        self.num_infer_spans += num_infer_spans
+        self.num_label_spans += num_label_spans
+        self.num_correct_spans += num_correct_spans
+
+    def eval_span(self, predict_start_ids, predict_end_ids, label_start_ids, label_end_ids):
+        """
+        evaluate position extraction (start, end)
+        return num_correct, num_infer, num_label
+        input: [1, 2, 10] [4, 12] [2, 10] [4, 11]
+        output: (1, 2, 2)
+        """
+        pred_set = get_span(predict_start_ids, predict_end_ids)
+        label_set = get_span(label_start_ids, label_end_ids)
+        num_correct = len(pred_set & label_set)
+        num_infer = len(pred_set)
+        # For the case of overlapping in the same category,
+        # length of label_start_ids and label_end_ids is not equal
+        num_label = max(len(label_start_ids), len(label_end_ids))
+        return (num_correct, num_infer, num_label)
+
+    def accumulate(self):
+        """
+        This function returns the mean precision, recall and f1 score for all accumulated minibatches.
+
+        Returns:
+            tuple: Returns tuple (`precision, recall, f1 score`).
+        """
+        precision = float(self.num_correct_spans / self.num_infer_spans) if self.num_infer_spans else 0.0
+        recall = float(self.num_correct_spans / self.num_label_spans) if self.num_label_spans else 0.0
+        f1_score = float(2 * precision * recall / (precision + recall)) if self.num_correct_spans else 0.0
+        return precision, recall, f1_score
+
+    def reset(self):
+        """
+        Reset function empties the evaluation memory for previous mini-batches.
+        """
+        self.num_infer_spans = 0
+        self.num_label_spans = 0
+        self.num_correct_spans = 0
+
+    def name(self):
+        """
+        Return name of metric instance.
+        """
+        return "precision", "recall", "f1"
diff --git a/paddlenlp/metrics/squad.py b/paddlenlp/metrics/squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..37b0f1ca7584998e9131c8e131bf8c43df06e805
--- /dev/null
+++ b/paddlenlp/metrics/squad.py
@@ -0,0 +1,436 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import re
+import string
+import json
+import numpy as np
+
+from ..utils.log import logger
+
+
+def compute_prediction(
+    examples,
+    features,
+    predictions,
+    version_2_with_negative=False,
+    n_best_size=20,
+    max_answer_length=30,
+    null_score_diff_threshold=0.0,
+):
+    """
+    Post-processes the predictions of a question-answering model to convert
+    them to answers that are substrings of the original contexts. This is
+    the base postprocessing functions for models that only return start and
+    end logits.
+
+    Args:
+        examples (list): List of raw squad-style data (see `run_squad.py
+            <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/
+            machine_reading_comprehension/SQuAD/run_squad.py>`__ for more
+            information).
+        features (list): List of processed squad-style features (see
+            `run_squad.py <https://github.com/PaddlePaddle/PaddleNLP/blob/
+            develop/examples/machine_reading_comprehension/SQuAD/run_squad.py>`__
+            for more information).
+        predictions (tuple): The predictions of the model. Should be a tuple
+            of two list containing the start logits and the end logits.
+        version_2_with_negative (bool, optional): Whether the dataset contains
+            examples with no answers. Defaults to False.
+        n_best_size (int, optional): The total number of candidate predictions
+            to generate. Defaults to 20.
+        max_answer_length (int, optional): The maximum length of predicted answer.
+            Defaults to 20.
+        null_score_diff_threshold (float, optional): The threshold used to select
+            the null answer. Only useful when `version_2_with_negative` is True.
+            Defaults to 0.0.
+
+    Returns:
+        A tuple of three dictionaries containing final selected answer, all n_best
+        answers along with their probability and scores, and the score_diff of each
+        example.
+    """
+    assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)."
+    all_start_logits, all_end_logits = predictions
+
+    assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features."
+
+    # Build a map example to its corresponding features.
+    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
+    features_per_example = collections.defaultdict(list)
+    for i, feature in enumerate(features):
+        features_per_example[example_id_to_index[feature["example_id"]]].append(i)
+
+    # The dictionaries we have to fill.
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+
+    scores_diff_json = collections.OrderedDict()
+
+    # Let's loop over all the examples!
+    for example_index, example in enumerate(examples):
+        # Those are the indices of the features associated to the current example.
+        feature_indices = features_per_example[example_index]
+
+        min_null_prediction = None
+        prelim_predictions = []
+
+        # Looping through all the features associated to the current example.
+        for feature_index in feature_indices:
+            # We grab the predictions of the model for this feature.
+            start_logits = all_start_logits[feature_index]
+            end_logits = all_end_logits[feature_index]
+            # This is what will allow us to map some the positions in our logits to span of texts in the original
+            # context.
+            offset_mapping = features[feature_index]["offset_mapping"]
+            # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
+            # available in the current feature.
+            token_is_max_context = features[feature_index].get("token_is_max_context", None)
+
+            # Update minimum null prediction.
+            feature_null_score = start_logits[0] + end_logits[0]
+            if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
+                min_null_prediction = {
+                    "offsets": (0, 0),
+                    "score": feature_null_score,
+                    "start_logit": start_logits[0],
+                    "end_logit": end_logits[0],
+                }
+
+            # Go through all possibilities for the `n_best_size` greater start and end logits.
+            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
+                    # to part of the input_ids that are not in the context.
+                    if (
+                        start_index >= len(offset_mapping)
+                        or end_index >= len(offset_mapping)
+                        or offset_mapping[start_index] is None
+                        or offset_mapping[end_index] is None
+                        or len(offset_mapping[start_index]) == 0
+                        or len(offset_mapping[end_index]) == 0
+                    ):
+                        continue
+                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
+                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
+                        continue
+                    # Don't consider answer that don't have the maximum context available (if such information is
+                    # provided).
+                    if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
+                        continue
+                    prelim_predictions.append(
+                        {
+                            "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
+                            "score": start_logits[start_index] + end_logits[end_index],
+                            "start_logit": start_logits[start_index],
+                            "end_logit": end_logits[end_index],
+                        }
+                    )
+        if version_2_with_negative:
+            # Add the minimum null prediction
+            prelim_predictions.append(min_null_prediction)
+            null_score = min_null_prediction["score"]
+
+        # Only keep the best `n_best_size` predictions.
+        predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
+
+        # Add back the minimum null prediction if it was removed because of its low score.
+        if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions):
+            predictions.append(min_null_prediction)
+
+        # Use the offsets to gather the answer text in the original context.
+        context = example["context"]
+        for pred in predictions:
+            offsets = pred.pop("offsets")
+            pred["text"] = context[offsets[0] : offsets[1]]
+
+        # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
+        # failure.
+        if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
+            predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})
+
+        # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
+        # the LogSumExp trick).
+        scores = np.array([pred.pop("score") for pred in predictions])
+        exp_scores = np.exp(scores - np.max(scores))
+        probs = exp_scores / exp_scores.sum()
+
+        # Include the probabilities in our predictions.
+        for prob, pred in zip(probs, predictions):
+            pred["probability"] = prob
+
+        # Pick the best prediction. If the null answer is not possible, this is easy.
+        if not version_2_with_negative:
+            all_predictions[example["id"]] = predictions[0]["text"]
+        else:
+            # Otherwise we first need to find the best non-empty prediction.
+            i = 0
+            while predictions[i]["text"] == "":
+                i += 1
+            best_non_null_pred = predictions[i]
+
+            # Then we compare to the null prediction using the threshold.
+            score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
+            scores_diff_json[example["id"]] = float(score_diff)  # To be JSON-serializable.
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example["id"]] = ""
+            else:
+                all_predictions[example["id"]] = best_non_null_pred["text"]
+
+        # Make `predictions` JSON-serializable by casting np.float back to float.
+        all_nbest_json[example["id"]] = [
+            {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
+            for pred in predictions
+        ]
+
+    return all_predictions, all_nbest_json, scores_diff_json
+
+
+def make_qid_to_has_ans(examples):
+    qid_to_has_ans = {}
+    for example in examples:
+        if "is_impossible" in example:
+            has_ans = example["is_impossible"]
+        else:
+            has_ans = not len(example["answers"]["answer_start"]) == 0
+        qid_to_has_ans[example["id"]] = has_ans
+    return qid_to_has_ans
+
+
+def remove_punctuation(in_str):
+    in_str = str(in_str).lower().strip()
+    sp_char = [
+        "-",
+        ":",
+        "_",
+        "*",
+        "^",
+        "/",
+        "\\",
+        "~",
+        "`",
+        "+",
+        "=",
+        "，",
+        "。",
+        "：",
+        "？",
+        "！",
+        "“",
+        "”",
+        "；",
+        "’",
+        "《",
+        "》",
+        "……",
+        "·",
+        "、",
+        "「",
+        "」",
+        "（",
+        "）",
+        "－",
+        "～",
+        "『",
+        "』",
+    ]
+    out_segs = []
+    for char in in_str:
+        if char in sp_char:
+            continue
+        else:
+            out_segs.append(char)
+    return "".join(out_segs)
+
+
+def normalize_answer(s):
+    # Lower text and remove punctuation, articles and extra whitespace.
+    def remove_articles(text):
+        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+        return re.sub(regex, " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return remove_punctuation("".join(ch for ch in text if ch not in exclude))
+
+    def lower(text):
+        return text.lower()
+
+    if not s:
+        return ""
+    else:
+        return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def compute_exact(a_gold, a_pred):
+    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
+
+
+def compute_f1(a_gold, a_pred, is_whitespace_splited=True):
+    gold_toks = normalize_answer(a_gold).split()
+    pred_toks = normalize_answer(a_pred).split()
+
+    if not is_whitespace_splited:
+        gold_toks = gold_toks[0] if gold_toks else ""
+        pred_toks = pred_toks[0] if pred_toks else ""
+
+    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
+    num_same = sum(common.values())
+    if len(gold_toks) == 0 or len(pred_toks) == 0:
+        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
+        return int(gold_toks == pred_toks)
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(pred_toks)
+    recall = 1.0 * num_same / len(gold_toks)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def get_raw_scores(examples, preds, is_whitespace_splited=True):
+    exact_scores = {}
+    f1_scores = {}
+    for example in examples:
+        qid = example["id"]
+        gold_answers = [text for text in example["answers"]["text"] if normalize_answer(text)]
+        if not gold_answers:
+            # For unanswerable questions, only correct answer is empty string
+            gold_answers = [""]
+        if qid not in preds:
+            logger.info("Missing prediction for %s" % qid)
+            continue
+        a_pred = preds[qid]
+        # Take max over all gold answers
+        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
+        f1_scores[qid] = max(compute_f1(a, a_pred, is_whitespace_splited) for a in gold_answers)
+
+    return exact_scores, f1_scores
+
+
+def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
+    new_scores = {}
+    for qid, s in scores.items():
+        pred_na = na_probs[qid] > na_prob_thresh
+        if pred_na:
+            new_scores[qid] = float(not qid_to_has_ans[qid])
+        else:
+            new_scores[qid] = s
+    return new_scores
+
+
+def make_eval_dict(exact_scores, f1_scores, qid_list=None):
+    if not qid_list:
+        total = len(exact_scores)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores.values()) / total),
+                ("f1", 100.0 * sum(f1_scores.values()) / total),
+                ("total", total),
+            ]
+        )
+    else:
+        total = len(qid_list)
+        return collections.OrderedDict(
+            [
+                ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
+                ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
+                ("total", total),
+            ]
+        )
+
+
+def merge_eval(main_eval, new_eval, prefix):
+    for k in new_eval:
+        main_eval["%s_%s" % (prefix, k)] = new_eval[k]
+
+
+def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
+    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
+    cur_score = num_no_ans
+    best_score = cur_score
+    best_thresh = 0.0
+    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
+    for i, qid in enumerate(qid_list):
+        if qid not in scores:
+            continue
+        if qid_to_has_ans[qid]:
+            diff = scores[qid]
+        else:
+            if preds[qid]:
+                diff = -1
+            else:
+                diff = 0
+        cur_score += diff
+        if cur_score > best_score:
+            best_score = cur_score
+            best_thresh = na_probs[qid]
+    return 100.0 * best_score / len(scores), best_thresh
+
+
+def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
+    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
+    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
+    main_eval["best_exact"] = best_exact
+    main_eval["best_exact_thresh"] = exact_thresh
+    main_eval["best_f1"] = best_f1
+    main_eval["best_f1_thresh"] = f1_thresh
+
+
+def squad_evaluate(examples, preds, na_probs=None, na_prob_thresh=1.0, is_whitespace_splited=True):
+    """
+    Computes and prints the f1 score and em score of input prediction.
+    Args:
+        examples (list): List of raw squad-style data (see `run_squad.py
+            <https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/
+            machine_reading_comprehension/SQuAD/run_squad.py>`__ for more
+            information).
+        preds (dict): Dictionary of final predictions. Usually generated by
+            `compute_prediction`.
+        na_probs (dict, optional): Dictionary of score_diffs of each example.
+            Used to decide if answer exits and compute best score_diff
+            threshold of null. Defaults to None.
+        na_prob_thresh (float, optional): The threshold used to select the
+            null answer. Defaults to 1.0.
+        is_whitespace_splited (bool, optional): Whether the predictions and references
+            can be tokenized by whitespace. Usually set True for English and
+            False for Chinese. Defaults to True.
+    """
+
+    if not na_probs:
+        na_probs = {k: 0.0 for k in preds}
+
+    qid_to_has_ans = make_qid_to_has_ans(examples)  # maps qid to True/False
+    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
+    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
+    exact_raw, f1_raw = get_raw_scores(examples, preds, is_whitespace_splited)
+    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, na_prob_thresh)
+    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, na_prob_thresh)
+    out_eval = make_eval_dict(exact_thresh, f1_thresh)
+    if has_ans_qids:
+        has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
+        merge_eval(out_eval, has_ans_eval, "HasAns")
+    if no_ans_qids:
+        no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
+        merge_eval(out_eval, no_ans_eval, "NoAns")
+        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
+    logger.info(json.dumps(out_eval, indent=2))
+
+    return out_eval
diff --git a/paddlenlp/metrics/utils.py b/paddlenlp/metrics/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d4d613684381dfbd375b50c7ce9bc3e3c3b732e
--- /dev/null
+++ b/paddlenlp/metrics/utils.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+
+def default_trans_func(output, label, seq_mask, vocab):
+    seq_mask = np.expand_dims(seq_mask, axis=2).repeat(output.shape[2], axis=2)
+    output = output * seq_mask
+    idx = np.argmax(output, axis=2)
+    cand, ref_list = [], []
+    for i in range(idx.shape[0]):
+        token_list = []
+        for j in range(idx.shape[1]):
+            if seq_mask[i][j][0] == 0:
+                break
+            token_list.append(vocab[idx[i][j]])
+        cand.append(token_list)
+
+    label = np.squeeze(label, axis=2)
+    for i in range(label.shape[0]):
+        token_list = []
+        for j in range(label.shape[1]):
+            if seq_mask[i][j][0] == 0:
+                break
+            token_list.append(vocab[label[i][j]])
+
+        ref_list.append([token_list])
+    return cand, ref_list
diff --git a/paddlenlp/ops/CMakeLists.txt b/paddlenlp/ops/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d0914969a9793740f0ee8de0a099a19b1e27eab3
--- /dev/null
+++ b/paddlenlp/ops/CMakeLists.txt
@@ -0,0 +1,490 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+cmake_minimum_required(VERSION 3.10 FATAL_ERROR)
+project(FasterTransformer LANGUAGES C CXX CUDA)
+
+find_package(CUDA 10.1 REQUIRED)
+
+find_program(CCACHE_PROGRAM ccache)
+if(CCACHE_PROGRAM)
+  set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache)
+  set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache)
+endif()
+
+INCLUDE(ExternalProject)
+
+set(CXX_STD "17" CACHE STRING "C++ standard")
+
+option(ON_INFER         "Compiled with inference. "                                 OFF)
+option(WITH_GPU         "Compiled with GPU/CPU, default use CPU."                   ON)
+option(WITH_MKL         "Compile with MKL. Only works when ON_INFER is ON."         ON)
+option(USE_TENSORRT     "Compiled with TensorRT."                                   OFF)
+option(WITH_TRANSFORMER "Compiled with Transformer."                                ON)
+option(WITH_GPT         "Compiled with GPT."                                        ON)
+option(WITH_OPT         "Compiled with OPT."                                        ON)
+option(WITH_UNIFIED     "Compiled with Unified Transformer."                        ON)
+option(WITH_T5          "Compiled with T5."                                         ON)
+option(WITH_SP          "Compiled with sentencepiece. Only works when WITH_GPT and ON_INFER is ON." OFF)
+option(WITH_DECODER     "Compile with Transformer Decoder"                          ON)
+option(WITH_ENCODER     "Compile with Transformer Encoder"                          ON)
+option(WITH_STATIC_LIB  "Compile static lib"                                        OFF)
+option(WITH_BART        "Compile with BART"                                         ON)
+option(WITH_MBART       "Compile with MBART"                                        ON)
+option(WITH_PARALLEL    "Compile with model parallel for GPT"                       OFF)
+option(WITH_ONNXRUNTIME "Compile demo with ONNXRuntime"                             OFF)
+option(WITH_GPTJ        "Compile with GPTJ"                                         ON)
+option(WITH_PEGASUS     "Compile with Pegasus"                                      ON)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+if(WITH_PARALLEL)
+  # https://cmake.org/cmake/help/latest/module/FindMPI.html#variables-for-locating-mpi
+  # https://github.com/Kitware/CMake/blob/master/Modules/FindMPI.cmake
+  find_package(MPI REQUIRED)
+  find_package(NCCL REQUIRED)
+  add_definitions(-DBUILD_GPT)
+  list(APPEND decoding_op_files parallel_utils.cc)
+endif()
+
+if(NOT WITH_GPU)
+  message(FATAL_ERROR "Faster transformer custom op doesn't support CPU. Please add the flag -DWITH_GPU=ON to use GPU. ")
+endif()
+
+list(APPEND decoding_op_files cublas_handle.cc utils.cc)
+
+if(WITH_TRANSFORMER)
+  list(APPEND decoding_op_files fusion_decoding_op.cc fusion_decoding_op.cu fusion_force_decoding_op.cc fusion_force_decoding_op.cu)
+endif()
+
+if(WITH_GPT)
+  list(APPEND decoding_op_files fusion_gpt_op.cc fusion_gpt_op.cu)
+endif()
+
+if(WITH_OPT)
+  list(APPEND decoding_op_files fusion_opt_op.cc fusion_opt_op.cu)
+endif()
+
+if(WITH_UNIFIED)
+  list(APPEND decoding_op_files fusion_unified_decoding_op.cc fusion_unified_decoding_op.cu fusion_miro_op.cc fusion_miro_op.cu)
+endif()
+
+if(WITH_ENCODER)
+  list(APPEND decoding_op_files fusion_encoder_op.cc fusion_encoder_op.cu)
+endif()
+
+if(WITH_DECODER)
+  list(APPEND decoding_op_files fusion_decoder_op.cc fusion_decoder_op.cu)
+endif()
+
+if(WITH_BART)
+  list(APPEND decoding_op_files fusion_bart_decoding_op.cc fusion_bart_decoding_op.cu)
+endif()
+
+if(WITH_MBART)
+  list(APPEND decoding_op_files fusion_mbart_decoding_op.cc fusion_mbart_decoding_op.cu)
+endif()
+
+if(WITH_GPTJ)
+  list(APPEND decoding_op_files fusion_gptj_op.cc fusion_gptj_op.cu)
+endif()
+
+if(WITH_PEGASUS)
+  list(APPEND decoding_op_files fusion_pegasus_decoding_op.cc fusion_pegasus_decoding_op.cu)
+endif()
+
+if(WITH_T5)
+  list(APPEND decoding_op_files fusion_t5_decoding_op.cc fusion_t5_decoding_op.cu)
+endif()
+
+if(NOT WITH_TRANSFORMER AND NOT WITH_GPT AND NOT WITH_DECODER AND NOT WITH_ENCODER AND NOT WITH_BART AND NOT WITH_MBART AND NOT WITH_GPTJ AND NOT WITH_PEGASUS AND NOT WITH_T5)
+  message(FATAL_ERROR "-DWITH_TRANSFORMER=ON or/and -DWITH_GPT=ON or/and -DWITH_DECODER=ON or/and -DWITH_ENCODER=ON or/and -DWITH_BART=ON or/and -DWITH_MBART=ON or/and -DWITH_GPTJ=ON or/and -DWITH_PEGASUS=ON or/and -DWITH_T5=ON must be set to use FasterTransformer. ")
+endif()
+
+set(CUDA_PATH ${CUDA_TOOLKIT_ROOT_DIR})
+
+list(APPEND CMAKE_MODULE_PATH ${CUDA_PATH}/lib64)
+
+# Setting compiler flags
+set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}")    
+set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  -Xcompiler -Wall")
+
+######################################################################################
+# A function for automatic detection of GPUs installed  (if autodetection is enabled)
+# Usage:
+#   detect_installed_gpus(out_variable)
+function(detect_installed_gpus out_variable)
+  if(NOT CUDA_gpu_detect_output)
+    set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)
+
+    file(WRITE ${cufile} ""
+      "#include \"stdio.h\"\n"
+      "#include \"cuda.h\"\n"
+      "#include \"cuda_runtime.h\"\n"
+      "int main() {\n"
+      "  int count = 0;\n"
+      "  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
+      "  if (count == 0) return -1;\n"
+      "  for (int device = 0; device < count; ++device) {\n"
+      "    cudaDeviceProp prop;\n"
+      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
+      "      printf(\"%d.%d \", prop.major, prop.minor);\n"
+      "  }\n"
+      "  return 0;\n"
+      "}\n")
+
+    execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}"
+                    "--run" "${cufile}"
+                    WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
+                    RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
+                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+    if(nvcc_res EQUAL 0)
+      # Only use last item of nvcc_out (the last device's compute capability).
+      string(REGEX REPLACE "\\." "" nvcc_out "${nvcc_out}")
+      string(REGEX MATCHALL "[0-9()]+" nvcc_out "${nvcc_out}")
+      list(GET nvcc_out -1 nvcc_out)
+      set(CUDA_gpu_detect_output ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from detect_installed_gpus tool" FORCE)
+    endif()
+  endif()
+
+  if(NOT CUDA_gpu_detect_output)
+    message(STATUS "Automatic GPU detection failed. Building for all known architectures.")
+    set(${out_variable} ${paddle_known_gpu_archs} PARENT_SCOPE)
+  else()
+    set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
+  endif()
+endfunction()
+
+if (NOT SM)
+  # TODO(guosheng): Remove it if `GetCUDAComputeCapability` is exposed by paddle.
+  # Currently, if `CUDA_gpu_detect_output` is not defined, use the detected arch.
+  detect_installed_gpus(SM)
+endif()
+
+#[[
+if (SM STREQUAL 80 OR
+    SM STREQUAL 86 OR
+    SM STREQUAL 70 OR
+    SM STREQUAL 75 OR
+    SM STREQUAL 61 OR
+    SM STREQUAL 60)
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM},code=\\\"sm_${SM},compute_${SM}\\\"")
+  if (SM STREQUAL 70 OR SM STREQUAL 75 OR SM STREQUAL 80 OR SM STREQUAL 86)
+    set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+    set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+  endif()
+message("-- Assign GPU architecture (sm=${SM})")
+
+else()
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  \
+                      -gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
+                      -gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
+                      ")
+
+set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+
+message("-- Assign GPU architecture (sm=70,75)")
+endif()
+]]
+
+set(SM_SETS 52 60 61 70 75 80)
+set(USING_WMMA False)
+set(FIND_SM False)
+
+foreach(SM_NUM IN LISTS SM_SETS)
+  string(FIND "${SM}" "${SM_NUM}" SM_POS)
+  if(SM_POS GREATER -1)
+    set(FIND_SM True)
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM_NUM},code=\\\"sm_${SM_NUM},compute_${SM_NUM}\\\"")
+
+    if (SM_NUM STREQUAL 70 OR SM_NUM STREQUAL 75 OR SM_NUM STREQUAL 80 OR SM_NUM STREQUAL 86)
+      set(USING_WMMA True)
+    endif()
+
+    set(CMAKE_CUDA_ARCHITECTURES ${SM_NUM})
+    message("-- Assign GPU architecture (sm=${SM_NUM})")
+  endif()
+endforeach()
+
+if(USING_WMMA STREQUAL True)
+  set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+  set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+  message("-- Use WMMA")
+endif()
+
+if(NOT (FIND_SM STREQUAL True))
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  \
+                        -gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
+                        -gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
+                        -gencode=arch=compute_80,code=\\\"sm_80,compute_80\\\" \
+                        ")
+
+  set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+  set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+  if(BUILD_PYT)
+    set(ENV{TORCH_CUDA_ARCH_LIST} "7.0;7.5;8.0")
+  endif()
+  set(CMAKE_CUDA_ARCHITECTURES 70 75 80)
+  message("-- Assign GPU architecture (sm=70,75,80)")
+endif()
+
+set(CMAKE_C_FLAGS_DEBUG    "${CMAKE_C_FLAGS_DEBUG}    -Wall -O0")
+set(CMAKE_CXX_FLAGS_DEBUG  "${CMAKE_CXX_FLAGS_DEBUG}  -Wall -O0")
+set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -O0 -G -Xcompiler -Wall")
+
+set(CMAKE_CXX17_STANDARD_COMPILE_OPTION "-std=c++{CXX_STD}")
+set(CMAKE_CXX17_EXTENSION_COMPILE_OPTION "-std=gnu++{CXX_STD}")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++${CXX_STD}")
+
+set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3")
+set(CMAKE_CUDA_FLAGS_RELEASE "${CMAKE_CUDA_FLAGS_RELEASE} -Xcompiler -O3")
+
+set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
+
+list(APPEND COMMON_HEADER_DIRS
+  ${PROJECT_SOURCE_DIR}
+  ${CUDA_PATH}/include)
+
+set(COMMON_LIB_DIRS
+  ${CUDA_PATH}/lib64
+)
+
+if(WITH_PARALLEL)
+  list(APPEND COMMON_HEADER_DIRS
+    ${NCCL_INCLUDE_PATH}
+    ${MPI_INCLUDE_PATH})
+endif()
+
+set(THIRD_PATH "third-party")
+set(THIRD_PARTY_NAME "fastertransformer")
+
+include(external/boost)
+
+set(OPS_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/utils/allocator.h allocator_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/utils/allocator.h allocator_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/utils/common.h common_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/utils/common.h common_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/utils/common_structure.h common_structure_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/utils/common_structure.h common_structure_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/CMakeLists.txt cmakelists_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/CMakeLists.txt cmakelists_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cu topk_kernels_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/topk_kernels.cu topk_kernels_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/lightseq_kernels.cu lightseq_kernels_cu_src)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cu open_decoder_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/open_decoder.cu open_decoder_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/open_decoder.h open_decoder_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/open_decoder.h open_decoder_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.h cuda_kernels_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/cuda_kernels.h cuda_kernels_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu cuda_kernels_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/cuda_kernels.cu cuda_kernels_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/decoding_kernels.cu decoding_kernels_cu_src)
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/transformer_decoding_kernels.cu trans_decoding_kernels_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/decoding_kernels.cu decoding_kernels_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cuh open_decoder_cuh_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/open_decoder.cuh open_decoder_cuh_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/utils/arguments.h arguments_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/utils/arguments.h arguments_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/decoding_beamsearch.h decoding_beamsearch_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/decoding_beamsearch.h decoding_beamsearch_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/decoding_sampling.h decoding_sampling_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/decoding_sampling.h decoding_sampling_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/online_softmax_beamsearch_kernels.cu online_softmax_beamsearch_kernels_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/online_softmax_beamsearch_kernels.cu online_softmax_beamsearch_kernels_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cuh topk_kernels_cuh_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/topk_kernels.cuh topk_kernels_cuh_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cu trans_kernels_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/transformer_kernels.cu trans_kernels_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cuh trans_kernels_cuh_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/transformer_kernels.cuh trans_kernels_cuh_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.cu masked_multihead_attention_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/masked_multihead_attention.cu masked_multihead_attention_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/gpt.h gpt_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/gpt.h gpt_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/opt.h opt_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/opt.h opt_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/gptj.h gptj_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/gptj.h gptj_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention_utils.h masked_multihead_attention_utils_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/masked_multihead_attention_utils.h masked_multihead_attention_utils_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h masked_multihead_attention_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/masked_multihead_attention.h masked_multihead_attention_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cu attention_kernels_cu_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/attention_kernels.cu attention_kernels_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cuh attention_kernels_cuh_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/attention_kernels.cuh attention_kernels_cuh_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/t5_beamsearch.h t5_bs_h_src)
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/t5_sampling.h t5_spl_h_src)
+set(ft_dst ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/)
+
+# Encoder patches.
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/bert_encoder_transformer.h bert_encoder_transformer_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/bert_encoder_transformer.h bert_encoder_transformer_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/standard_encoder.h standard_encoder_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/standard_encoder.h standard_encoder_h_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/cuda/open_attention.h open_attention_h_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/open_attention.h open_attention_h_dst)
+
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/open_attention.cu open_attention_cu_dst)
+
+file(TO_NATIVE_PATH ${OPS_SOURCE_DIR}/patches/FasterTransformer/fastertransformer/CMakeLists.txt fastertransformer_cmakelists_src)
+file(TO_NATIVE_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/CMakeLists.txt fastertransformer_cmakelists_dst)
+# Encoder patches end.
+
+# TODO(guosheng): `find` seems meeting errors missing argument to `-exec', fix it
+set(MUTE_COMMAND grep -rl "printf(\"\\[WARNING\\]" ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/ | xargs -i{} sed -i "s/printf(\"\\WWARNING\\W decoding[^)]\\{1,\\})/ /" {})
+set(OPEN_ATTENTION_MUTE_COMMAND grep -rl "printf(\"\\[WARNING\\]" ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/ | xargs -i{} sed -i "s/printf(\"\\WWARNING\\W\\WOpenMultiHeadAttention\\W[^)]\\{1,\\})/ /" {})
+
+set(RM_OLD_CUB_COMMAND rm -rf ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/${THIRD_PARTY_NAME}/fastertransformer/cuda/cub)
+
+set(FT_PATCH_COMMAND
+  printf \\n\\n > blank_lines
+  && cp ${allocator_src} ${allocator_dst}
+  && cp ${common_src} ${common_dst}
+  && cp ${common_structure_src} ${common_structure_dst}
+  && cp ${cmakelists_src} ${cmakelists_dst}
+  && cp ${topk_kernels_src} ${topk_kernels_dst}
+  && cp ${decoding_beamsearch_h_src} ${decoding_beamsearch_h_dst}
+  && cp ${decoding_sampling_h_src} ${decoding_sampling_h_dst}
+  && cp ${online_softmax_beamsearch_kernels_cu_src} ${online_softmax_beamsearch_kernels_cu_dst}
+  && cp ${arguments_h_src} ${arguments_h_dst}
+  && cp ${open_decoder_h_src} ${open_decoder_h_dst}
+  && cp ${standard_encoder_h_src} ${standard_encoder_h_dst}
+  && cp ${bert_encoder_transformer_h_src} ${bert_encoder_transformer_h_dst}
+  && cp ${trans_kernels_cu_src} ${trans_kernels_cu_dst}
+  && cp ${masked_multihead_attention_cu_src} ${masked_multihead_attention_cu_dst}
+  && cp ${open_attention_h_src} ${open_attention_h_dst}
+  && cp ${fastertransformer_cmakelists_src} ${fastertransformer_cmakelists_dst}
+  && cp ${gpt_h_src} ${gpt_h_dst}
+  && cp ${opt_h_src} ${opt_h_dst}
+  && cp ${gptj_h_src} ${gptj_h_dst}
+  && cp ${masked_multihead_attention_h_src} ${masked_multihead_attention_h_dst}
+  && cp ${t5_bs_h_src} ${ft_dst}
+  && cp ${t5_spl_h_src} ${ft_dst}
+  && cat blank_lines ${masked_multihead_attention_utils_h_src} >> ${masked_multihead_attention_utils_h_dst}
+  && cat blank_lines ${attention_kernels_cu_src} >> ${attention_kernels_cu_dst}
+  && cat blank_lines ${attention_kernels_cuh_src} >> ${attention_kernels_cuh_dst}
+  && cat blank_lines ${cuda_kernels_h_src} >> ${cuda_kernels_h_dst}
+  && cat blank_lines ${lightseq_kernels_cu_src} >> ${topk_kernels_dst}
+  && cat blank_lines ${cuda_kernels_cu_src} >> ${cuda_kernels_cu_dst}
+  && cat blank_lines ${decoding_kernels_cu_src} >> ${decoding_kernels_cu_dst}
+  && cat blank_lines ${topk_kernels_cuh_src} >> ${topk_kernels_cuh_dst}
+  && cat blank_lines ${trans_decoding_kernels_cu_src} >> ${decoding_kernels_cu_dst}
+  && cat blank_lines ${open_decoder_cu_src} >> ${open_decoder_cu_dst}
+  && cat blank_lines ${open_decoder_cuh_src} >> ${open_decoder_cuh_dst}
+  && cat blank_lines ${trans_kernels_cuh_src} >> ${trans_kernels_cuh_dst}
+  && sed -i "s/^#define NEW_TRANSPOSE_BATCH_MAJOR 1/#define NEW_TRANSPOSE_BATCH_MAJOR 0/g" ${open_decoder_cu_dst}
+  && sed -i "2091,2119d" ${open_attention_cu_dst}
+  && rm blank_lines
+  && ${MUTE_COMMAND}
+  && ${OPEN_ATTENTION_MUTE_COMMAND}
+  && ${RM_OLD_CUB_COMMAND}
+)
+
+# TODO(guosheng): Use UPDATE_COMMAND instead of PATCH_COMMAND to make cmake
+# re-run always use the latest patches when the developer changes FT patch codes,
+# all patches rather than the changes would re-build, any better way to do this.
+# Or maybe hidden this function for simplicity.
+set(FT_UPDATE_COMMAND git checkout nccl_dependent_refine && git checkout . && ${FT_PATCH_COMMAND})
+
+ExternalProject_Add(
+  extern_${THIRD_PARTY_NAME}
+  GIT_REPOSITORY    https://gitee.com/paddlepaddle/FasterTransformer.git
+  GIT_TAG           nccl_dependent_refine
+  PREFIX            ${THIRD_PATH}
+  SOURCE_DIR        ${THIRD_PATH}/source/${THIRD_PARTY_NAME}
+  UPDATE_COMMAND    ${FT_UPDATE_COMMAND}  # PATCH_COMMAND     ${FT_PATCH_COMMAND}
+  BINARY_DIR        ${THIRD_PATH}/build/${THIRD_PARTY_NAME}
+  INSTALL_COMMAND   ""
+  CMAKE_ARGS        -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER} -DSM=${SM} -DBUILD_PD=ON -DBUILD_ENCODER=${WITH_ENCODER} -DPY_CMD=${PY_CMD} -DON_INFER=${ON_INFER} -DPADDLE_LIB=${PADDLE_LIB} -DWITH_MKL=${WITH_MKL} -DWITH_STATIC_LIB=${WITH_STATIC_LIB} -DBUILD_GPT=${WITH_PARALLEL} -DWITH_ONNXRUNTIME=${WITH_ONNXRUNTIME}
+)
+# -DBUILD_GPT=${WITH_GPT} 
+ExternalProject_Get_property(extern_${THIRD_PARTY_NAME} BINARY_DIR)
+ExternalProject_Get_property(extern_${THIRD_PARTY_NAME} SOURCE_DIR)
+ExternalProject_Get_property(extern_${THIRD_PARTY_NAME} SOURCE_SUBDIR)
+
+set(FT_INCLUDE_PATH ${SOURCE_DIR}/${SOURCE_SUBDIR})
+set(FT_LIB_PATH ${BINARY_DIR}/lib)
+
+include_directories(
+  ${FT_INCLUDE_PATH}
+)
+
+link_directories(
+  ${FT_LIB_PATH}
+)
+
+if(ON_INFER AND WITH_GPT AND WITH_SP)
+  ExternalProject_Add(
+    extern_sentencepiece
+    GIT_REPOSITORY    https://github.com/google/sentencepiece.git
+    PREFIX            ${THIRD_PATH}
+    SOURCE_DIR        ${THIRD_PATH}/source/sentencepiece/
+    BINARY_DIR        ${THIRD_PATH}/build/sentencepiece/
+    INSTALL_COMMAND   ""
+  )
+  
+  include_directories(
+    ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source/sentencepiece/src/
+  )
+
+  link_directories(
+    ${CMAKE_BINARY_DIR}/${THIRD_PATH}/build/sentencepiece/src/
+  )
+
+  add_definitions(-DGPT_ON_SENTENCEPIECE)
+endif()
+
+add_subdirectory(fast_transformer)
diff --git a/paddlenlp/ops/README.md b/paddlenlp/ops/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..baad094893d3fcf7bc2bb7bae510ab87b56f6543
--- /dev/null
+++ b/paddlenlp/ops/README.md
@@ -0,0 +1,353 @@
+# 文本生成高性能加速
+
+## 子目录结构
+
+以下是 FastGeneration 的高性能加速自定义 op 相关文件的目录结构。
+
+```text
+.
+├── fast_transformer/         # 基于自定义 op FastGeneration 子路径
+  ├── sample/                 # 基于 FastGeneration 使用样例
+  ├── src/                    # 自定义 OP C++ CUDA 代码
+  └── transformer/            # Python API 封装脚本
+└── patches                   # 自定义 op 第三方库自定义补丁代码
+```
+
+## 使用环境说明
+
+* 本项目依赖于 PaddlePaddle 2.1.0 及以上版本或适当的 develop 版本
+* CMake >= 3.10
+* CUDA 10.1 或 10.2（需要 PaddlePaddle 框架一致）
+* gcc 版本需要与编译 PaddlePaddle 版本一致，比如使用 gcc8.2
+* 推荐使用 Python3
+* [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#setup) 使用必要的环境
+* 环境依赖
+  - attrdict
+  - pyyaml
+  ```shell
+  pip install attrdict pyyaml
+  ```
+
+## 快速开始
+
+我们实现了基于 FastGeneration 的自定义 op 的接入，用于加速文本生成模型在 GPU 上的预测性能。接下来，我们将分别介绍基于 Python 动态图和预测库使用 FastGeneration 自定义 op 的方式，包括 op 的编译与使用。
+
+## Python 动态图使用自定义 op
+
+### JIT 自动编译
+
+目前当基于动态图使用 FastGeneration 预测加速自定义 op 时，PaddleNLP 提供了 Just In Time 的自动编译，在一些 API 上，用户无需关注编译流程，可以直接执行对应的 API，程序会自动编译需要的第三方库。
+
+以 Transformer 为例，可以直接调用 `TransformerGenerator()` 这个 API，程序会自动编译。
+
+目前支持 JIT 的预测加速 API 有：
+* `FasterTransformer()/TransformerGenerator()`: 支持 Transformer 模型的预测加速功能。使用示例可以参考 [Transformer 预测加速使用示例-sample](./fast_transformer/sample/decoding_sample.py)，[Transformer 预测加速使用示例-机器翻译](../../examples/machine_translation/transformer/fast_transformer/)。
+* `FasterGPT()`: 支持 GPT 模型的预测加速功能。使用示例可以参考 [GPT 预测加速使用示例](../../model_zoo/gpt/fast_gpt/)。
+* `FasterUnifiedTransformer()`: 支持 UnifiedTransformer 模型的预测加速功能。使用示例可以参考 [UnifiedTransformer 预测加速使用示例](../../examples/dialogue/unified_transformer/) 以及 PLATO-XL 的使用示例可以参考 [PLATO-XL 11B 使用示例](../../examples/dialogue/plato-xl/)。
+* `FasterUNIMOText()`: 支持 UNIMOText 模型预测加速功能。
+  * 使用示例可以参考 [UNIMOText 预测加速使用示例](../../fast_generation/samples/unimo_text_sample.py)。
+  * 同样，我们提供了动转静以及 Paddle Inference 使用的示例，详细可以参考 [动转静导出](./fast_transformer/sample/unimo_text_export_model_sample.py) 和 [Paddle Inference 使用示例](./fast_transformer/sample/unimo_text_inference.py)。
+* `FasterBART()`: 支持 BART 模型预测加速功能。使用示例可以参考 [BART 预测加速使用示例-sample](./fast_transformer/sample/bart_decoding_sample.py)，[BART 预测加速使用示例-文本摘要](../../examples/text_summarization/bart)。
+
+具体使用方法可以参考 API 文档或是使用示例。
+
+### 编译自定义OP
+
+除了自动编译外，如果需要自行编译，我们已经提供对应的 CMakeLists.txt，可以参考使用如下的方式完成编译。
+
+#### PaddleNLP 准备
+
+首先，如果需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 PaddleNLP，并重新编译:
+
+以下以从 github 上 clone 一个新版 PaddleNLP 为例:
+
+``` sh
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+```
+
+其次，配置环境变量，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作：
+
+``` sh
+export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH
+cd PaddleNLP/paddlenlp/ops/
+```
+
+#### 编译
+
+编译之前，请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译，并且正常可用。
+
+编译自定义 OP 可以参照一下步骤：
+
+``` sh
+mkdir build
+cd build/
+cmake .. -DCMAKE_BUILD_TYPE=Release -DPY_CMD=python3.x
+make -j
+cd ../
+```
+
+可以使用的编译选项包括：
+* `-DPY_CMD`: 指定当前装有 PaddlePaddle 版本的 python 环境，比如 `-DPY_CMD=python3.7`。若未指定 `-DPY_CMD` 将会默认使用系统命令 `python` 对应的 Python。
+* `-DSM`: 是指的所用 GPU 的 compute capability，建议不使用该选项设置，未设置时将自动检测。如要设置，需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置，如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。
+* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理，需要加上 `-DWITH_GPT=ON`。默认为 OFF。
+* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用，需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。
+* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用，需要加上 `-DWITH_BART=ON`。默认为 ON。
+* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。
+
+最终，编译会在 `./build/lib/` 路径下，产出 `libdecoding_op.so`，即需要的 FastGeneration decoding 执行的库。
+
+### 使用 Transformer decoding 高性能推理
+
+编写 python 脚本的时候，调用 `FasterTransformer` API 即可实现 Transformer 模型的高性能预测。
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考前文所述完成编译。编译好后，使用 `FasterTransformer(decoding_lib="/path/to/lib", ...)` 可以完成导入。
+
+举例如下：
+
+``` python
+from paddlenlp.ops import FasterTransformer
+
+transformer = FasterTransformer(
+    src_vocab_size=args.src_vocab_size,
+    trg_vocab_size=args.trg_vocab_size,
+    max_length=args.max_length + 1,
+    n_layer=args.n_layer,
+    n_head=args.n_head,
+    d_model=args.d_model,
+    d_inner_hid=args.d_inner_hid,
+    dropout=args.dropout,
+    weight_sharing=args.weight_sharing,
+    bos_id=args.bos_idx,
+    eos_id=args.eos_idx,
+    decoding_strategy=args.decoding_strategy,
+    beam_size=args.beam_size,
+    topk=args.topk,
+    topp=args.topp,
+    max_out_len=args.max_out_len,
+    decoding_lib=args.decoding_lib,
+    use_fp16_decoding=args.use_fp16_decoding)
+```
+
+更详细的例子可以参考 `./fast_transformer/sample/decoding_sample.py` 以及 `./sample/encoder_decoding_sample.py`，我们提供了更详细用例。
+
+#### Transformer decoding 示例代码
+
+使用 PaddlePaddle 仅执行 decoding 测试（float32）：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 0
+python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so
+```
+
+使用 PaddlePaddle 仅执行 decoding 测试（float16）：
+执行 float16 的 decoding，需要在执行的时候，加上 `--use_fp16_decoding` 选项。
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+./build/third-party/build/fastertransformer/bin/decoding_gemm 32 4 8 64 30000 32 512 1
+python ./fast_transformer/sample/decoding_sample.py --config ./fast_transformer/sample/config/decoding.sample.yaml --decoding_lib ./build/lib/libdecoding_op.so --use_fp16_decoding
+```
+
+其中，`decoding_gemm` 不同参数的意义可以参考 [FasterTransformer 文档](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-decoderdecoding-demos)。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config 文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。
+
+### 使用 GPT-2 decoding 高性能推理
+
+与 `FasterTransformer` API 类似，可以通过一下方式调用 GPT-2 相关优化：
+
+``` python
+from paddlenlp.ops import FasterGPT
+from paddlenlp.transformers import GPTModel, GPTForPretraining
+
+MODEL_CLASSES = {
+    "gpt2-medium-en": (GPTForPretraining, GPTTokenizer),
+}
+
+model_class, tokenizer_class = MODEL_CLASSES[args.model_name]
+tokenizer = tokenizer_class.from_pretrained(args.model_name)
+model = model_class.from_pretrained(args.model_name)
+
+# Define model
+gpt = FasterGPT(
+    model=model,
+    decoding_lib=args.decoding_lib,
+    use_fp16_decoding=args.use_fp16_decoding)
+```
+
+目前，GPT-2 的高性能预测接口 `FasterGPT()` 要求 batch 内输入的样本的长度都是相同的。并且，仅支持 topk-sampling 和 topp-sampling，不支持 beam-search。
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考前文所述完成编译。编译好后，使用 `FasterGPT(decoding_lib="/path/to/lib", ...)` 可以完成导入。
+
+更详细的例子可以参考 `./fast_transformer/sample/gpt_sample.py`，我们提供了更详细用例。
+
+#### GPT-2 decoding 示例代码
+
+使用 PaddlePaddle 仅执行 decoding 测试（float32）：
+
+``` sh
+export CUDA_VISIBLE_DEVICES=0
+python ./fast_transformer/sample/gpt_sample.py --model_name_or_path gpt2-medium-en --batch_size 1 --topk 4 --topp 0.0 --max_length 32 --start_token "<|endoftext|>" --end_token "<|endoftext|>" --temperature 1.0
+```
+
+其中，各个选项的意义如下：
+* `--model_name_or_path`: 预训练模型的名称或是路径。
+* `--decoding_lib`: 指向 `libdecoding_op.so` 的路径。需要包含 `libdecoding_op.so`。若不指定或是不存在则将自动进行 jit 编译产出该 lib。
+* `--batch_size`: 一个 batch 内，样本数目的大小。
+* `--topk`: 执行 topk-sampling 的时候的 `k` 的大小，默认是 4。
+* `--topp`: 执行 topp-sampling 的时候的阈值的大小，默认是 0.0 表示不执行 topp-sampling。
+* `--max_length`: 最长的生成长度。
+* `--start_token`: 字符串，表示任意生成的时候的开始 token。
+* `--end_token`: 字符串，生成的结束 token。
+* `--temperature`: temperature 的设定。
+* `--use_fp16_decoding`: 是否使用 fp16 进行推理。
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考前文。编译好后，可以在执行 `gpt_sample.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。
+
+## C++ 预测库使用自定义 op
+
+### 编译自定义OP
+
+在 C++ 预测库使用自定义 OP 需要将实现的 C++、CUDA 代码**以及 C++ 预测的 demo**编译成一个可执行文件。因预测库支持方式与 Python 不同，这个过程将不会产生自定义 op 的动态库，将直接得到可执行文件。我们已经提供对应的 CMakeLists.txt ，可以参考使用如下的方式完成编译。并获取执行 demo。
+
+#### PaddleNLP 准备
+
+首先，因为需要基于当前环境重新编译，当前的 paddlenlp 的 python 包里面并不包含 FastGeneration 相关 lib，需要从源码自行编译，可以直接使用 Python 的 package 下的 paddlenlp，或是可从 github 克隆一个 PaddleNLP，并重新编译:
+
+以下以从 github 上 clone 一个新版 PaddleNLP 为例:
+
+``` sh
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+```
+
+其次，让我们可以使用当前 clone 的 paddlenlp，并进入到自定义 OP 的路径，准备后续的编译操作：
+
+``` sh
+cd PaddleNLP/paddlenlp/ops/
+```
+
+#### 编译
+
+编译之前，请确保安装的 PaddlePaddle 的版本高于 2.1.0 或是基于最新的 develop 分支的代码编译，并且正常可用。
+
+编译自定义 OP 可以参照一下步骤：
+
+``` sh
+mkdir build
+cd build/
+cmake .. -DCMAKE_BUILD_TYPE=Release -DPADDLE_LIB=/path/to/paddle_inference_lib/ -DDEMO=./demo/transformer_e2e.cc -DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=OFF
+make -j
+cd ../
+```
+
+可以使用的编译选项包括：
+* `-DPADDLE_LIB`: 需要指明使用的 PaddlePaddle 预测库的路径 `/path/to/paddle_inference_install_dir/`，需要使用的 PaddlePaddle 的 lib 可以选择自行编译或者直接从官网下载 [paddle_inference_linux_lib](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#linux)。需要注意的是，在该路径下，预测库的组织结构满足：
+  ```text
+  .
+  ├── CMakeCache.txt
+  ├── paddle/
+    ├── include/
+    └── lib/
+  ├── third_party/
+    ├── cudaerror/
+    ├── install/
+    └── threadpool/
+  └── version.txt
+  ```
+* `-DDEMO`: 说明预测库使用 demo 的位置。比如指定 -DDEMO=./demo/transformer_e2e.cc 或是 -DDEMO=./demo/gpt.cc。最好使用绝对路径，若使用相对路径，需要是相对于 `PaddleNLP/paddlenlp/ops/fast_transformer/src/` 的相对路径。
+* `-DSM`: 是指的所用 GPU 的 compute capability，建议不使用该选项设置，未设置时将自动检测。如要设置，需根据 [compute capability](https://developer.nvidia.com/zh-cn/cuda-gpus#compute) 进行设置，如 V100 时设置 `-DSM=70` 或 T4 时设置 `-DSM=75`。
+* `-DWITH_GPT`: 是否编译带有 GPT 相关的 lib。若使用 GPT-2 高性能推理，需要加上 `-DWITH_GPT=ON`。默认为 OFF。
+* `-DWITH_UNIFIED`: 是否编译带有 Unified Transformer 或是 UNIMOText 相关的 lib。若使用，需要加上 `-DWITH_UNIFIED=ON`。默认为 ON。
+* `-DWITH_BART`: 是否编译带有 BART 支持的相关 lib。若使用，需要加上 `-DWITH_BART=ON`。默认为 ON。
+* `-DWITH_DECODER`: 是否编译带有 decoder 优化的 lib。默认为 ON。
+* `-DWITH_MKL`: 若当前是使用的带有 MKL 的 Paddle lib，那么需要在编译时打开 MKL 以引入 MKL 相关的依赖。
+* `-DWITH_ONNXRUNTIME`: 若当前使用的是带有 ONNX 支持的 Paddle lib，那么需要在编译时打开 ONNX Runtime 以引入相关依赖。
+* `-DON_INFER`: 是否编译 paddle inference 预测库。
+* **当使用预测库的自定义 op 的时候，请务必开启 `-DON_INFER=ON` 选项，否则，不会得到预测库的可执行文件。**
+
+#### 执行 Transformer decoding on PaddlePaddle
+
+编译完成后，在 `build/bin/` 路径下将会看到 `transformer_e2e` 的一个可执行文件。通过设置对应的设置参数完成执行的过程。
+
+``` sh
+cd bin/
+./transformer_e2e -batch_size <batch_size> -gpu_id <gpu_id> -model_dir <model_directory> -vocab_file <dict_file> -data_file <input_data>
+```
+
+举例说明：
+
+``` sh
+cd bin/
+# 执行 decoding_gemm 目的是基于当前环境、配置，提前确定一个性能最佳的矩阵乘算法，不是必要的步骤
+../third-party/build/fastertransformer/bin/decoding_gemm 8 5 8 64 38512 256 512 0
+./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 -data_file DATA_HOME/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en
+```
+
+其中：
+* `decoding_gemm` 不同参数的意义可以参考 [FasterTransformer 文档](https://github.com/NVIDIA/FasterTransformer/tree/v3.1#execute-the-decoderdecoding-demos)。这里提前执行 `decoding_gemm`，可以在当前路径下生成一个 config 文件，里面会包含针对当前 decoding 部分提供的配置下，性能最佳的矩阵乘的算法，并在执行的时候读入这个数据。
+* `DATA_HOME` 则是 `paddlenlp.utils.env.DATA_HOME` 返回的路径。
+
+预测所需要的模型文件，可以通过 `PaddleNLP/examples/machine_translation/transformer/fast_transformer/README.md` 文档中所记述的方式导出。
+
+#### 执行 GPT decoding on PaddlePaddle
+
+如果需要使用 Paddle Inference 预测库针对 GPT 进行预测，首先，需要导出预测模型，可以通过 `./fast_transformer/sample/gpt_export_model_sample.py` 脚本获取预测库用模型，执行方式如下所示：
+
+``` sh
+python ./fast_transformer/sample/gpt_export_model_sample.py --model_name_or_path gpt2-medium-en  --topk 4 --topp 0.0 --max_out_len 32 --start_token "<|endoftext|>" --end_token "<|endoftext|>" --temperature 1.0 --inference_model_dir ./infer_model/
+```
+
+各个选项的意义与上文的 `gpt_sample.py` 的选项相同。额外新增一个 `--inference_model_dir` 选项用于指定保存的模型文件、词表等文件。
+
+若当前环境下没有需要的自定义 op 的动态库，将会使用 JIT 自动编译需要的动态库。如果需要自行编译自定义 op 所需的动态库，可以参考前文。编译好后，可以在执行 `gpt_export_model_sample.py` 时使用 `--decoding_lib ../../../../paddlenlp/ops/build/lib/libdecoding_op.so` 可以完成导入。
+
+注意：如果是自行编译的话，这里的 `libdecoding_op.so` 的动态库是参照前文中 **`Python 动态图使用自定义 op`** 编译出来的 lib，与 **`C++ 预测库使用自定义 op`** 编译产出不同。因此，在使用预测库前，还需要额外导出模型：
+  * 一次用于获取 Python 动态图下的 lib，用到 Python 端进行模型导出。
+  * 一次获取编译的基于预测库的可执行文件
+
+若是使用的模型是 gpt2-medium-en，保存之后，`./infer_model/` 目录下组织的结构如下：
+
+``` text
+.
+├── gpt.pdiparams       # 保存的参数文件
+├── gpt.pdiparams.info  # 保存的一些变量描述信息，预测不会用到
+├── gpt.pdmodel         # 保存的模型文件
+├── merges.txt          # bpe
+└── vocab.txt           # 词表
+```
+
+同理，完成编译后，可以在 `build/bin/` 路径下将会看到 `gpt` 的一个可执行文件。通过设置对应的设置参数完成执行的过程。
+
+``` sh
+cd bin/
+./gpt -batch_size 1 -gpu_id 0 -model_dir path/to/model -vocab_file path/to/vocab -start_token "<|endoftext|>" -end_token "<|endoftext|>"
+```
+
+## FAQ
+
+**问题 1**：编译报错 `module not found: paddle`
+
+解决方式：编译自定义 op，默认使用系统命令 `python` 所指定的 Python 的版本，需要启动相应的 conda 或是 virtualenv，或是在编译时候指定 `-DPY_CMD=python3.7`。
+
+**问题 2**：编译报错：
+
+``` shell
+error: unrecognized command line option '-std=c++14'
+```
+
+解决方式：需要确认环境中是否有多个 `gcc8.2`。`nvcc` 默认使用系统默认的 `gcc`，所以需要将预期使用的 `gcc` 加入到相关环境变量中。
+
+**问题 3**：编译自定义 op C++ Demo 报错，报错可能是但不限于：
+
+``` shell
+undefined reference to paddle_infer::Predictor::GetInputNames() ...
+undefined reference to paddle_infer::Predictor::GetOutputNames() ...
+undefined reference to paddle::AnalysisConfig::SetModel(std::string const&, std::string const&) ...
+
+undefined reference to google::FlagRegisterer::FlagRegisterer ...
+```
+
+解决方式：PaddlePaddle 编译的时候如果使用了 CXX11 ABI，自行编写 demo 并编译自定义 OP 时候需要在 CMakeLists.txt 里面加 add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1) 或是 cmake 时候加上对应选项。
diff --git a/paddlenlp/ops/__init__.py b/paddlenlp/ops/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f18e6d0817ca7f1492787235a5f67acf9798f9b7
--- /dev/null
+++ b/paddlenlp/ops/__init__.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from . import optimizer
+from .distributed import *
+from .einsum import *
+
+# isort: split
+from .fast_transformer.transformer.decoding import *
+
+# isort: split
+from .fast_transformer.transformer.decoder import *
+from .fast_transformer.transformer.encoder import *
+from .fast_transformer.transformer.fast_transformer import *
+
+paddle.nn.TransformerEncoderLayer._ft_forward = encoder_layer_forward  # noqa F405
+paddle.nn.TransformerEncoder._ft_forward = encoder_forward  # noqa F405
+
+paddle.nn.TransformerEncoderLayer._ori_forward = paddle.nn.TransformerEncoderLayer.forward
+paddle.nn.TransformerEncoder._ori_forward = paddle.nn.TransformerEncoder.forward
diff --git a/paddlenlp/ops/cmake/FindNCCL.cmake b/paddlenlp/ops/cmake/FindNCCL.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..7dc1fa9968f4ab57fd9600f01d8e4e5e3e66eff8
--- /dev/null
+++ b/paddlenlp/ops/cmake/FindNCCL.cmake
@@ -0,0 +1,165 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# From PyTorch:
+#
+# Copyright (c) 2016-     Facebook, Inc            (Adam Paszke)
+# Copyright (c) 2014-     Facebook, Inc            (Soumith Chintala)
+# Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
+# Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
+# Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+# Copyright (c) 2011-2013 NYU                      (Clement Farabet)
+# Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+# Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
+# Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+#
+# From Caffe2:
+#
+# Copyright (c) 2016-present, Facebook Inc. All rights reserved.
+#
+# All contributions by Facebook:
+# Copyright (c) 2016 Facebook Inc.
+#
+# All contributions by Google:
+# Copyright (c) 2015 Google Inc.
+# All rights reserved.
+#
+# All contributions by Yangqing Jia:
+# Copyright (c) 2015 Yangqing Jia
+# All rights reserved.
+#
+# All contributions by Kakao Brain:
+# Copyright 2019-2020 Kakao Brain
+#
+# All contributions from Caffe:
+# Copyright(c) 2013, 2014, 2015, the respective contributors
+# All rights reserved.
+#
+# All other contributions:
+# Copyright(c) 2015, 2016 the respective contributors
+# All rights reserved.
+#
+# Caffe2 uses a copyright model similar to Caffe: each contributor holds
+# copyright over their contributions to Caffe2. The project versioning records
+# all such contribution and copyright details. If a contributor wants to further
+# mark their specific copyright on a particular contribution, they should
+# indicate their copyright solely in the commit message of the change when it is
+# committed.
+#
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#
+# 3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+#    and IDIAP Research Institute nor the names of its contributors may be
+#    used to endorse or promote products derived from this software without
+#    specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# Find the nccl libraries
+#
+# The following variables are optionally searched for defaults
+#  NCCL_ROOT: Base directory where all NCCL components are foundHong Xu, 1 year ago: • Let CMake handle NCCL detection instead of ou…
+#  NCCL_INCLUDE_DIR: Directory where NCCL header is foundPieter Noordhuis, 3 years ago: • Bump gloo
+#  NCCL_LIB_DIR: Directory where NCCL library is found
+#
+# The following are set after configuration is done:
+#  NCCL_FOUND
+#  NCCL_INCLUDE_DIRS
+#  NCCL_LIBRARIES
+#
+# The path hints include CUDA_TOOLKIT_ROOT_DIR seeing as some folks
+# install NCCL in the same location as the CUDA toolkit.
+# See https://github.com/caffe2/caffe2/issues/1601
+
+set(NCCL_INCLUDE_DIR $ENV{NCCL_INCLUDE_DIR} CACHE PATH "Folder contains NVIDIA NCCL headers")
+set(NCCL_LIB_DIR $ENV{NCCL_LIB_DIR} CACHE PATH "Folder contains NVIDIA NCCL libraries")
+set(NCCL_VERSION $ENV{NCCL_VERSION} CACHE STRING "Version of NCCL to build with")
+
+if ($ENV{NCCL_ROOT_DIR})
+  message(WARNING "NCCL_ROOT_DIR is deprecated. Please set NCCL_ROOT instead.")
+endif()
+list(APPEND NCCL_ROOT $ENV{NCCL_ROOT_DIR} ${CUDA_TOOLKIT_ROOT_DIR})
+# Compatible layer for CMake <3.12. NCCL_ROOT will be accounted in for searching paths and libraries for CMake >=3.12.
+list(APPEND CMAKE_PREFIX_PATH ${NCCL_ROOT})
+
+find_path(NCCL_INCLUDE_DIRS
+  NAMES nccl.h
+  HINTS ${NCCL_INCLUDE_DIR})
+
+if (USE_STATIC_NCCL)
+  MESSAGE(STATUS "USE_STATIC_NCCL is set. Linking with static NCCL library.")
+  SET(NCCL_LIBNAME "nccl_static")
+  if (NCCL_VERSION)  # Prefer the versioned library if a specific NCCL version is specified
+    set(CMAKE_FIND_LIBRARY_SUFFIXES ".a.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES})
+  endif()
+else()
+  SET(NCCL_LIBNAME "nccl")
+  if (NCCL_VERSION)  # Prefer the versioned library if a specific NCCL version is specified
+    set(CMAKE_FIND_LIBRARY_SUFFIXES ".so.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES})
+  endif()
+endif()
+
+find_library(NCCL_LIBRARIES
+  NAMES ${NCCL_LIBNAME}
+  HINTS ${NCCL_LIB_DIR})
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(NCCL DEFAULT_MSG NCCL_INCLUDE_DIRS NCCL_LIBRARIES)
+
+if(NCCL_FOUND)  # obtaining NCCL version and some sanity checks
+  set (NCCL_HEADER_FILE "${NCCL_INCLUDE_DIRS}/nccl.h")
+  message (STATUS "Determining NCCL version from ${NCCL_HEADER_FILE}...")
+  set (OLD_CMAKE_REQUIRED_INCLUDES ${CMAKE_REQUIRED_INCLUDES})
+  list (APPEND CMAKE_REQUIRED_INCLUDES ${NCCL_INCLUDE_DIRS})
+  include(CheckCXXSymbolExists)
+  check_cxx_symbol_exists(NCCL_VERSION_CODE nccl.h NCCL_VERSION_DEFINED)
+
+  if (NCCL_VERSION_DEFINED)
+    set(file "${PROJECT_BINARY_DIR}/detect_nccl_version.cc")
+    file(WRITE ${file} "
+      #include <iostream>
+      #include <nccl.h>
+      int main()
+      {
+        std::cout << NCCL_MAJOR << '.' << NCCL_MINOR << '.' << NCCL_PATCH << std::endl;
+        int x;
+        ncclGetVersion(&x);
+        return x == NCCL_VERSION_CODE;
+      }
+")
+    try_run(NCCL_VERSION_MATCHED compile_result ${PROJECT_BINARY_DIR} ${file}
+          RUN_OUTPUT_VARIABLE NCCL_VERSION_FROM_HEADER
+          LINK_LIBRARIES ${NCCL_LIBRARIES})
+    if (NOT NCCL_VERSION_MATCHED)
+      message(FATAL_ERROR "Found NCCL header version and library version do not match! \
+(include: ${NCCL_INCLUDE_DIRS}, library: ${NCCL_LIBRARIES}) Please set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually.")
+    endif()
+    message(STATUS "NCCL version: ${NCCL_VERSION_FROM_HEADER}")
+  else()
+    # message(STATUS "NCCL version < 2.3.5-5")
+  endif ()
+  set (CMAKE_REQUIRED_INCLUDES ${OLD_CMAKE_REQUIRED_INCLUDES})
+
+  message(STATUS "Found NCCL (include: ${NCCL_INCLUDE_DIRS}, library: ${NCCL_LIBRARIES})")
+  mark_as_advanced(NCCL_ROOT_DIR NCCL_INCLUDE_DIRS NCCL_LIBRARIES)
+endif()
diff --git a/paddlenlp/ops/cmake/external/boost.cmake b/paddlenlp/ops/cmake/external/boost.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..3140c7a48f46bb4756e892a14d5c12e7df794e8c
--- /dev/null
+++ b/paddlenlp/ops/cmake/external/boost.cmake
@@ -0,0 +1,64 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+include(ExternalProject)
+
+set(BOOST_PROJECT "extern_boost")
+# To release PaddlePaddle as a pip package, we have to follow the
+# manylinux1 standard, which features as old Linux kernels and
+# compilers as possible and recommends CentOS 5. Indeed, the earliest
+# CentOS version that works with NVIDIA CUDA is CentOS 6.  And a new
+# version of boost, say, 1.66.0, doesn't build on CentOS 6.  We
+# checked that the devtools package of CentOS 6 installs boost 1.41.0.
+# So we use 1.41.0 here.
+set(BOOST_VER "1.41.0")
+set(BOOST_TAR "boost_1_41_0" CACHE STRING "" FORCE)
+set(BOOST_URL "http://paddlepaddledeps.bj.bcebos.com/${BOOST_TAR}.tar.gz" CACHE STRING "" FORCE)
+
+MESSAGE(STATUS "BOOST_TAR: ${BOOST_TAR}, BOOST_URL: ${BOOST_URL}")
+
+set(THIRD_PARTY_PATH ${CMAKE_BINARY_DIR}/${THIRD_PATH}/source)
+
+set(BOOST_SOURCES_DIR ${THIRD_PARTY_PATH}/boost)
+set(BOOST_DOWNLOAD_DIR  "${BOOST_SOURCES_DIR}/src/${BOOST_PROJECT}")
+
+set(BOOST_INCLUDE_DIR "${BOOST_DOWNLOAD_DIR}" CACHE PATH "boost include directory." FORCE)
+set_directory_properties(PROPERTIES CLEAN_NO_CUSTOM 1)
+include_directories(${BOOST_INCLUDE_DIR})
+
+ExternalProject_Add(
+    ${BOOST_PROJECT}
+    ${EXTERNAL_PROJECT_LOG_ARGS}
+    DOWNLOAD_DIR          ${BOOST_DOWNLOAD_DIR}
+    URL                   ${BOOST_URL}
+    DOWNLOAD_NO_PROGRESS  1
+    PREFIX                ${BOOST_SOURCES_DIR}
+    CONFIGURE_COMMAND     ""
+    BUILD_COMMAND         ""
+    INSTALL_COMMAND       ""
+    UPDATE_COMMAND        ""
+    )
+
+ExternalProject_Get_property(${BOOST_PROJECT} SOURCE_DIR)
+
+if (${CMAKE_VERSION} VERSION_LESS "3.3.0" OR NOT WIN32)
+    set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/boost_dummy.c)
+    file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";")
+    add_library(boost STATIC ${dummyfile})
+else()
+    add_library(boost INTERFACE)
+endif()
+
+add_dependencies(boost ${BOOST_PROJECT})
+set(Boost_INCLUDE_DIR ${BOOST_INCLUDE_DIR})
+include_directories(${Boost_INCLUDE_DIR})
diff --git a/paddlenlp/ops/distributed/__init__.py b/paddlenlp/ops/distributed/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..67f3869d7d7306ca4e01833b92e672ee0b76cd31
--- /dev/null
+++ b/paddlenlp/ops/distributed/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import parallel
+from . import utils
+from .parallel import *
+from .utils import *
+
+__all__ = []
+__all__ += parallel.__all__
+__all__ += utils.__all__
diff --git a/paddlenlp/ops/distributed/parallel.py b/paddlenlp/ops/distributed/parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0d93359efb6ba6742c8f457176a4e50688f94ca
--- /dev/null
+++ b/paddlenlp/ops/distributed/parallel.py
@@ -0,0 +1,311 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+try:
+    from paddle.distributed.fleet import fleet
+except Exception:
+    import warnings
+
+    warnings.warn("paddle.distributed is not contains in you paddle!")
+
+__all__ = [
+    "guard",
+    "ParallelEmbedding",
+    "ColumnParallelLiner",
+    "RowParallelLiner",
+]
+
+
+def guard(device):
+    def decorator(Layer):
+        class WrapperClass(Layer):
+            def __init__(self, *args, **kw):
+                with paddle.static.device_guard(device):
+                    print("Init {} on {}".format(Layer.__name__, device))
+                    super().__init__(*args, **kw)
+
+            def forward(self, *args, **kw):
+                with paddle.static.device_guard(device):
+                    print("Forward {} on {}".format(Layer.__name__, device))
+                    return super().forward(*args, **kw)
+
+        return WrapperClass
+
+    return decorator
+
+
+class ParallelEmbedding(nn.Layer):
+    """
+    Parallel Embedding.
+
+    Args:
+        num_embeddings (int):
+            The size of embedding dictionary which dictates the maximum value of the input id.
+        embedding_dim (int):
+            The dimensions of each embedding vector.
+        rank (int):
+            The rank of the current part, which determines the start index of the vocab.
+        world_size (int):
+            The number of trainers.
+        weight_attr (Tensor, optional):
+            Specify the weight parameter property, including the initialization method.
+            Defaults to None which means the default weight parameter property will be used.
+        name (str, optional):
+            Normally there is no need for user to set this property.
+            Defaults to None.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim, rank, world_size, weight_attr=None, name=None):
+        super(ParallelEmbedding, self).__init__()
+        self.rank = rank
+        self.world_size = world_size
+        self.num_embeddings = num_embeddings
+        self.is_mp = self.world_size > 1
+
+        assert (
+            num_embeddings % self.world_size == 0
+        ), "The length of the vocabulary must be divisible by the parallelism degree of MP"
+
+        per_part_size = num_embeddings // self.world_size
+
+        self.vocab_start_index = self.rank * per_part_size
+        self._dtype = self._helper.get_default_dtype()
+        self._size = [per_part_size, embedding_dim]
+        self._weight_attr = weight_attr
+        self._name = name
+
+        self.weight = self.create_parameter(attr=self._weight_attr, shape=self._size, dtype=self._dtype, is_bias=False)
+        self.weight.is_distributed = True
+
+        startup_block = paddle.static.default_startup_program().global_block()
+        main_block = paddle.static.default_main_program().global_block()
+        startup_block.vars[self.weight.name].is_distributed = True
+        main_block.vars[self.weight.name].is_distributed = True
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor):
+                A Tensor contains the id information.
+                Its data type should be int32 or int64, and the value of the input id should be in [0, weight.shape[0]] .
+
+        Returns:
+            Tensor: Returns the embedding Tensor mapped by x.
+        """
+        if self.is_mp:
+            output_parallel = paddle.distributed.collective._c_lookup_table(
+                self.weight, x, start_index=self.vocab_start_index, name=self._name
+            )
+            output = paddle.distributed.collective._mp_allreduce(
+                output_parallel, group=None, use_calc_stream=True, use_model_parallel=True
+            )
+        else:
+            output = paddle.nn.functional.embedding(
+                x, weight=self.weight, padding_idx=None, sparse=False, name=self._name
+            )
+        return output
+
+
+class ColumnParallelLiner(nn.Layer):
+    """
+    Parallel Linear, axis=1.
+
+    Args:
+        size (int):
+            The size of embedding vector.
+        num_partitions (int, optional):
+            The number of parts within a model parallel group. Defaults to 1.
+        gather_out (bool, optional):
+            Whether to gather the output tensor. Defaults to True.
+        param_attr (Tensor, optional):
+            Specify the parameter property, including the initialization method.
+            Defaults to None which means the default parameter property will be used.
+        bias_attr (Tensor, optional):
+            Specify the bias property.
+            Defaults to None which means the default parameter property will be used.
+        name (str, optional):
+            Normally there is no need for user to set this property.
+            Defaults to None.
+
+    """
+
+    def __init__(self, size, num_partitions=1, gather_out=True, param_attr=None, bias_attr=None, name=None):
+        super().__init__()
+
+        if paddle.in_dynamic_mode():
+            rank = paddle.distributed.get_rank()
+        else:
+            assert fleet._role_maker, "To use paddle.distributed.split, " "you must call fleet.init() firstly."
+            rank = fleet.worker_index()
+
+        # rank within a model parallel group
+        inner_rank = rank % num_partitions
+        self.gather_out = gather_out
+
+        assert (
+            size[1] % num_partitions == 0
+        ), "Number of column of the weight for linear ({}) must be" " divisible by num_partitions ({})".format(
+            size[1], num_partitions
+        )
+        self.per_part_size = size[1] // num_partitions
+        linear_size = (size[0], self.per_part_size)
+
+        num_rows, num_cols = linear_size
+
+        if not name:
+            name = "fc_by_col_rank_%d" % inner_rank
+        else:
+            name = name + "_by_col_rank_%d" % inner_rank
+
+        self.linear = paddle.nn.Linear(num_rows, num_cols, weight_attr=param_attr, bias_attr=bias_attr, name=name)
+
+        weight = self.linear.weight
+        weight.is_distributed = True
+        # alias for weight tensor
+        self.weight = self.linear.weight
+
+        startup_block = paddle.static.default_startup_program().global_block()
+        main_block = paddle.static.default_main_program().global_block()
+        startup_block.vars[weight.name].is_distributed = True
+        main_block.vars[weight.name].is_distributed = True
+        # set is_distributed for splited bias
+        # if a linear layer is splited by col, the bias would also be split into each rank as its weight
+        if self.linear._bias_attr:
+            startup_block.vars[self.linear.bias.name].is_distributed = True
+            main_block.vars[self.linear.bias.name].is_distributed = True
+            self.bias = self.linear.bias
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor):
+                The input tensor. Its data type can be int or float.
+
+        Returns:
+            Tensor: Returns the embedding Tensor mapped by x.
+        """
+        group = None
+        x = paddle.distributed.collective._c_identity(x, group=group)
+        output_parallel = self.linear(x)
+        if self.gather_out is False:
+            return output_parallel
+
+        return paddle.distributed.collective._c_concat(output_parallel, group=group)
+
+
+class RowParallelLiner(nn.Layer):
+    """
+    Parallel Linear, axis=0.
+
+    Args:
+        size (int):
+            The size of embedding vector.
+        num_partitions (int, optional):
+            The number of parts within a model parallel group. Defaults to 1.
+        input_is_parallel (bool, optional):
+            Whether the input is parallel. Defaults to `False`.
+        param_attr (Tensor, optional):
+            Specify the parameter property, including the initialization method.
+            Defaults to None which means the default parameter property will be used.
+        bias_attr (Tensor, optional):
+            Specify the bias property.
+            Defaults to None which means the default parameter property will be used.
+        name (str, optional):
+            Normally there is no need for user to set this property.
+            Defaults to None.
+
+    """
+
+    def __init__(self, size, num_partitions=1, input_is_parallel=False, param_attr=None, bias_attr=None, name=None):
+        super().__init__()
+
+        if paddle.in_dynamic_mode():
+            rank = paddle.distributed.get_rank()
+        else:
+            assert fleet._role_maker, "To use paddle.distributed.split, " "you must call fleet.init() firstly."
+            rank = fleet.worker_index()
+
+        # rank within a model parallel group
+        inner_rank = rank % num_partitions
+        self.input_is_parallel = input_is_parallel
+
+        assert (
+            size[0] % num_partitions == 0
+        ), "Number of rows of the weight for linear ({}) must be" " divisible by num_partitions ({})".format(
+            size[0], num_partitions
+        )
+        self.per_part_size = size[0] // num_partitions
+        linear_size = (self.per_part_size, size[1])
+
+        num_rows, num_cols = linear_size
+
+        if not name:
+            name = "fc_by_row_rank_%d" % inner_rank
+        else:
+            name = name + "_by_row_rank_%d" % inner_rank
+        self.linear = paddle.nn.Linear(
+            num_rows,
+            num_cols,
+            weight_attr=param_attr,
+            # NOTE(wangxi): row split, bias need add after allreduce
+            bias_attr=False,
+            name=name,
+        )
+
+        weight = self.linear.weight
+        weight.is_distributed = True
+        # alias for weight tensor
+        self.weight = self.linear.weight
+        self.bias = self.linear.bias
+
+        startup_block = paddle.static.default_startup_program().global_block()
+        main_block = paddle.static.default_main_program().global_block()
+        startup_block.vars[weight.name].is_distributed = True
+        main_block.vars[weight.name].is_distributed = True
+        # set is_distributed for splited bias
+        # if a linear layer is splited by row, each rank would hold a complete bias
+
+        if bias_attr is not False:
+            self.bias = self.create_parameter(shape=[num_cols], attr=bias_attr, dtype=self._dtype, is_bias=True)
+        else:
+            self.bias = None
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor):
+                The input tensor. Its data type can be int or float.
+
+        Returns:
+            Tensor: Returns the embedding Tensor mapped by x.
+        """
+        group = None
+        if self.input_is_parallel:
+            assert x.shape[-1] == self.per_part_size, (
+                "The width ({}) of the input "
+                "x must be equal to the height ({}) of the weight. Maybe you "
+                "should split the input x using paddle.split.".format(x.shape[-1], self.per_part_size)
+            )
+        else:
+            # split last dim
+            x = paddle.distributed.collective._c_split(x, group=group)
+        output_parallel = self.linear(x)
+        output = paddle.distributed.collective._mp_allreduce(
+            output_parallel, group=group, use_calc_stream=True, use_model_parallel=True
+        )
+        output = output + self.bias if self.bias is not None else output
+        return output
diff --git a/paddlenlp/ops/distributed/utils/__init__.py b/paddlenlp/ops/distributed/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..429c4f5a6695f2b50fbc996d04448e5d000bf2e3
--- /dev/null
+++ b/paddlenlp/ops/distributed/utils/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .topo import Topology
+from .random import get_rng_state_tracker
+
+__all__ = [
+    "Topology",
+    "get_rng_state_tracker",
+]
diff --git a/paddlenlp/ops/distributed/utils/random.py b/paddlenlp/ops/distributed/utils/random.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fb406c6bdcc056a9c1fbf582687156dc8a3af84
--- /dev/null
+++ b/paddlenlp/ops/distributed/utils/random.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import paddle
+
+MODEL_PARALLEL_RNG = "model_parallel_rng"
+
+
+class RNGStatesTracker:
+    """
+    Tracker the RNG states.
+    """
+
+    def __init__(self):
+        # Map from name to the rng state.
+        self.states_ = {}
+        self.seeds_ = set()
+
+    def add(self, name, seed):
+        if seed in self.seeds_:
+            raise ValueError("seed {} already exists".format(seed))
+        self.seeds_.add(seed)
+        if name in self.states_:
+            raise ValueError("state {} already exists".format(name))
+        orig_rng_state = paddle.get_cuda_rng_state()
+        paddle.seed(seed)
+        self.states_[name] = paddle.get_cuda_rng_state()
+        paddle.set_cuda_rng_state(orig_rng_state)
+
+    @contextlib.contextmanager
+    def rng_state(self, name=MODEL_PARALLEL_RNG):
+        if name not in self.states_:
+            raise ValueError("state {} does not exist".format(name))
+        orig_cuda_rng_state = paddle.get_cuda_rng_state()
+        paddle.set_cuda_rng_state(self.states_[name])
+        try:
+            yield
+        finally:
+            self.states_[name] = paddle.get_cuda_rng_state()
+            paddle.set_cuda_rng_state(orig_cuda_rng_state)
+
+
+RNG_STATE_TRACKER = RNGStatesTracker()
+
+
+def get_rng_state_tracker():
+    return RNG_STATE_TRACKER
diff --git a/paddlenlp/ops/distributed/utils/topo.py b/paddlenlp/ops/distributed/utils/topo.py
new file mode 100644
index 0000000000000000000000000000000000000000..f88ae51080ebcaaeca51111eba479dfadd7db298
--- /dev/null
+++ b/paddlenlp/ops/distributed/utils/topo.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from collections import namedtuple
+
+GroupInfo = namedtuple("GroupInfo", ["size", "rank", "world"])
+
+
+class Topology:
+    def __init__(self, device_rank, world_size, dp_degree=None, pp_degree=1, sharding_degree=1, mp_degree=1):
+        arr = np.arange(0, dp_degree * pp_degree * sharding_degree * mp_degree).reshape(
+            [dp_degree, pp_degree, sharding_degree, mp_degree]
+        )
+
+        dp_rank, pp_rank, sharding_rank, mp_rank = np.where(arr == device_rank)
+        dp_rank = dp_rank[0]
+        pp_rank = pp_rank[0]
+        sharding_rank = sharding_rank[0]
+        mp_rank = mp_rank[0]
+
+        self.world = GroupInfo(size=world_size, rank=device_rank, world=list(range(0, world_size)))
+
+        mp_world = arr[dp_rank, pp_rank, sharding_rank, :]
+        self.mp_info = GroupInfo(size=len(mp_world), rank=mp_rank, world=mp_world.tolist())
+
+        sharding_world = arr[dp_rank, pp_rank, :, mp_rank]
+        self.sharding_info = GroupInfo(size=len(sharding_world), rank=sharding_rank, world=sharding_world.tolist())
+
+        pp_world = arr[dp_rank, :, sharding_rank, mp_rank]
+        self.pp_info = GroupInfo(size=len(pp_world), rank=pp_rank, world=pp_world.tolist())
+
+        dp_world = arr[:, pp_rank, sharding_rank, mp_rank]
+        self.dp_info = GroupInfo(size=len(dp_world), rank=dp_rank, world=dp_world.tolist())
+
+        self.is_last = self.pp_info.rank == self.pp_info.size - 1
+
+        data_arr = np.arange(0, dp_degree * sharding_degree).reshape([dp_degree, sharding_degree])
+        data_arr = np.expand_dims(data_arr, axis=1).repeat(pp_degree, axis=1)
+        data_arr = np.expand_dims(data_arr, axis=3).repeat(mp_degree, axis=3)
+
+        self.data_info = GroupInfo(
+            size=int(self.dp_info.size * self.sharding_info.size),
+            rank=int(self.dp_info.rank * self.sharding_info.size + self.sharding_info.rank),
+            world=data_arr.reshape(-1).tolist(),
+        )
+
+        assert self.data_info.world[device_rank] == self.data_info.rank, "Data rank caculate error!"
+        self.data_inner_times = self.world.size // self.data_info.size
+
+    def __repr__(self):
+        return f"dp_info:\n\t {self.dp_info}, \npp_info:\n\t {self.pp_info}, \nsharding_info:\n\t {self.sharding_info}, \nmp_info:\n\t {self.mp_info}\ndata_info:\n\t {self.data_info}"
diff --git a/paddlenlp/ops/einsum.py b/paddlenlp/ops/einsum.py
new file mode 100644
index 0000000000000000000000000000000000000000..46c396b75edfdbe2b8c5a35a8ef5dacf524a1c53
--- /dev/null
+++ b/paddlenlp/ops/einsum.py
@@ -0,0 +1,339 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+__all__ = ["einsum"]
+
+
+def einsum(equation, *operands):
+    r"""
+    Executes the sum of product of provided operands based on the Einstein summation convention.
+    Einsum can be used to complete a variety of operations, such as sum, transpose,
+    batch matrix multiplication.
+
+    Args:
+        equation (`str`):
+            Uses uncased letters to specify the dimension of the operands and result. The input
+            equation is on the left hand before `->` while the output equation is on the right side.
+            Einsum can infer the result shape so that the `->` and the result label letters can be omitted.
+            Operands in the input equation are splitted by commas (','), e.g. 'abc,cde' describes two 3D
+            operands. The dimensions labeled with same letter should be same or be 1. Ellipsis ('...') can
+            be used to specify the broadcast dimensions.
+
+        operands (`Tensor`):
+            The operands to compute the Einstein sum of. The number of operands should be the same as the
+            the operands described in input equation.
+
+    Returns:
+        `Tensor`: The result of Einstein sum product.
+
+    Example:
+        .. code-block::
+
+            import numpy as np
+            import paddle
+            import paddlenlp
+
+            np.random.seed(102)
+
+            x = paddle.to_tensor(np.random.rand(4))
+            y = paddle.to_tensor(np.random.rand(5))
+            # sum
+            print(paddlenlp.ops.einsum('i->', x))
+            # Tensor(shape=[], dtype=float64, place=CUDAPlace(0), stop_gradient=True, 2.30369050)
+
+            # dot
+            print(paddlenlp.ops.einsum('i,i->', x, x))
+            # Tensor(shape=[], dtype=float64, place=CUDAPlace(0), stop_gradient=True, 1.43773247)
+
+            # outer
+            print(paddlenlp.ops.einsum("i,j->ij", x, y)),
+            # Tensor(shape=[4, 5], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
+            #         [[0.34590188, 0.48353496, 0.09996135, 0.18656330, 0.21392910],
+            #         [0.39122025, 0.54688535, 0.11305780, 0.21100591, 0.24195704],
+            #         [0.17320613, 0.24212422, 0.05005442, 0.09341929, 0.10712238],
+            #         [0.42290818, 0.59118179, 0.12221522, 0.22809690, 0.26155500]])
+
+            A = paddle.to_tensor(np.random.rand(2, 3, 2))
+            B = paddle.to_tensor(np.random.rand(2, 2, 3))
+            # transpose
+            print(paddlenlp.ops.einsum('ijk->kji', A))
+            #  Tensor(shape=[2, 3, 2], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
+            #        [[[0.49174730, 0.33344683],
+            #          [0.89440989, 0.26162022],
+            #          [0.36116209, 0.12241719]],
+
+            #         [[0.49019824, 0.51895050],
+            #          [0.18241053, 0.13092809],
+            #          [0.81059146, 0.55165734]]])
+
+            # batch matrix multiplication
+            print(paddlenlp.ops.einsum('ijk, ikl->ijl', A,B))
+            # Tensor(shape=[2, 3, 3], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
+            #     [[[0.13654339, 0.39331432, 0.65059661],
+            #      [0.07171420, 0.57518653, 0.77629221],
+            #      [0.21250688, 0.37793541, 0.73643411]],
+
+            #     [[0.56925339, 0.65859030, 0.57509818],
+            #      [0.30368265, 0.25778348, 0.21630400],
+            #      [0.39587265, 0.58031243, 0.51824755]]])
+
+            # Ellipsis transpose
+            print(paddlenlp.ops.einsum('...jk->...kj', A))
+            # Tensor(shape=[2, 2, 3], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
+            #     [[[0.49174730, 0.89440989, 0.36116209],
+            #         [0.49019824, 0.18241053, 0.81059146]],
+
+            #         [[0.33344683, 0.26162022, 0.12241719],
+            #         [0.51895050, 0.13092809, 0.55165734]]])
+
+            # Ellipsis batch matrix multiplication
+            print(paddlenlp.ops.einsum('...jk, ...kl->...jl', A,B))
+            # Tensor(shape=[2, 3, 3], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
+            # [[[0.13654339, 0.39331432, 0.65059661],
+            #     [0.07171420, 0.57518653, 0.77629221],
+            #     [0.21250688, 0.37793541, 0.73643411]],
+
+            #     [[0.56925339, 0.65859030, 0.57509818],
+            #     [0.30368265, 0.25778348, 0.21630400],
+            #     [0.39587265, 0.58031243, 0.51824755]]])
+    """
+    # paddle.einsum can be used in paddle 2.3.0+
+    if hasattr(paddle, "einsum"):
+        return paddle.einsum(equation, *operands)
+
+    def _mul_sum(left, right, sum_dims):
+        assert left.rank() == right.rank(), "number of rank should be equal."
+        if len(sum_dims) == 0:
+            return left * right
+        sum_dims_set = set(sum_dims)
+        batch_dims = []
+        left_out_dims = []
+        right_out_dims = []
+        batch_size = summed_size = left_size = right_size = 1
+        dim = len(left.shape)
+        for i in range(dim):
+            is_left_summed_dim = left.shape[i] > 1  # not broadcast dim
+            is_right_summed_dim = right.shape[i] > 1
+            if i in sum_dims_set:
+                if is_left_summed_dim and is_right_summed_dim:
+                    assert left.shape[i] == right.shape[i], "Non-broadcast dim should be equal."
+                    summed_size *= left.shape[i]
+                elif is_left_summed_dim:
+                    left = left.sum(axis=i, keepdim=True)
+                elif is_right_summed_dim:
+                    right = right.sum(axis=i, keepdim=True)
+            elif is_left_summed_dim and is_right_summed_dim:
+                assert left.shape[i] == right.shape[i], "Non-broadcast dim should be equal."
+                batch_dims.append(i)
+                batch_size *= left.shape[i]
+            elif is_left_summed_dim:
+                left_out_dims.append(i)
+                left_size *= left.shape[i]
+            else:
+                right_out_dims.append(i)
+                right_size *= right.shape[i]
+        out_shape = [left.shape[i] for i in batch_dims + left_out_dims]
+        out_shape.extend([1] * len(sum_dims))
+        out_shape.extend([right.shape[i] for i in right_out_dims])
+
+        left_perm = list(batch_dims)
+        left_perm.extend(left_out_dims)
+        left_perm.extend(sum_dims)
+        left_perm.extend(right_out_dims)
+
+        right_perm = list(batch_dims)
+        right_perm.extend(sum_dims)
+        right_perm.extend(right_out_dims)
+        right_perm.extend(left_out_dims)
+
+        output_perm = [-1] * (len(batch_dims) + len(left_out_dims) + len(sum_dims) + len(right_out_dims))
+        for i, j in enumerate(batch_dims + left_out_dims + sum_dims + right_out_dims):
+            output_perm[j] = i
+
+        left = paddle.reshape(paddle.transpose(left, perm=left_perm), (batch_size, left_size, summed_size))
+        right = paddle.reshape(paddle.transpose(right, perm=right_perm), (batch_size, summed_size, right_size))
+        result = paddle.matmul(left, right)
+        result = paddle.reshape(result, out_shape)
+        result = paddle.transpose(result, output_perm)
+        return result
+
+    if len(operands) == 1 and isinstance(operands[0], (list, tuple)):
+        operands = operands[0]
+    # Equation is case insensitive
+    num_letters = 26
+    letters_to_idx = [-1] * num_letters
+    equation = equation.lower().replace(" ", "")
+    # 1. Parse the equation
+    eqns = equation.split("->")
+    num_eqns_size = len(eqns)
+    assert num_eqns_size <= 2, "The '->' should exist at most only once"
+
+    input_eqn = eqns[0]
+    output_eqn = None if num_eqns_size <= 1 else eqns[1]
+    operand_eqns = input_eqn.split(",")
+    assert len(operand_eqns) == len(
+        operands
+    ), "Number of operands in equation and the tensors provided should be equal."
+
+    # Parse input equation
+    num_total_idxes = 0
+    input_operand_idxes = []
+    letter_frequence = [0] * num_letters
+    idxes_last_operand = []
+    num_ell_idxes = -1
+    first_ell_idx = 0
+    for i, term in enumerate(operand_eqns):
+        ell_char_count = 0
+        operand_rank = int(operands[i].rank().cpu().numpy())
+        curr_num_ell_idxes = operand_rank - len(term) + 3
+        dims_in_terms = 0
+        curr_operand_idxes = []
+        for ch in term:
+            if ch == ".":
+                ell_char_count += 1
+                assert ell_char_count <= 3, "The '.' should only exist in one ellipsis '...' in term {}".format(term)
+                if ell_char_count == 3:
+                    if num_ell_idxes == -1:
+                        num_ell_idxes = curr_num_ell_idxes
+                        first_ell_idx = num_total_idxes
+                        num_total_idxes += num_ell_idxes
+                    else:
+                        assert (
+                            curr_num_ell_idxes == num_ell_idxes
+                        ), "Ellipsis in all terms should represent same dimensions ({}).".format(num_ell_idxes)
+
+                    for j in range(num_ell_idxes):
+                        curr_operand_idxes.append(j + first_ell_idx)
+                        idxes_last_operand.append(i)
+                    dims_in_terms += num_ell_idxes
+            else:
+                assert (ell_char_count == 0) or (
+                    ell_char_count == 3
+                ), "'.' must only occur in ellipsis, operand {}".format(term)
+                assert ord("a") <= ord(ch) and ord(ch) <= ord("z"), "only accept alphabet (a-zA-Z)"
+                letter_num = ord(ch) - ord("a")
+                if letters_to_idx[letter_num] == -1:
+                    letters_to_idx[letter_num] = num_total_idxes
+                    num_total_idxes += 1
+                    idxes_last_operand.append(i)
+                else:
+                    idxes_last_operand[letters_to_idx[letter_num]] = i
+                letter_frequence[letter_num] += 1
+                curr_operand_idxes.append(letters_to_idx[letter_num])
+                dims_in_terms += 1
+
+        assert dims_in_terms == operand_rank, "Dimension dismatch for operand {}: equation {}, tensor {}".format(
+            i, dims_in_terms, operand_rank
+        )
+        input_operand_idxes.append(curr_operand_idxes)
+    # Parse output equation
+    idxes_to_output_dims = [-1] * num_total_idxes
+    num_output_dims = 0
+    if num_eqns_size == 2:
+        ell_char_count = 0
+        for ch in output_eqn:
+            if ch == ".":
+                ell_char_count += 1
+                assert ell_char_count <= 3, "The '.' should only exist in one ellipsis '...' in term {}".format(
+                    output_eqn
+                )
+                if ell_char_count == 3:
+                    assert num_ell_idxes > -1, "Input equation '{}' don't have ellipsis.".format(input_eqn)
+                    for j in range(num_ell_idxes):
+                        idxes_to_output_dims[first_ell_idx + j] = num_output_dims
+                        num_output_dims += 1
+
+            else:
+                assert (ell_char_count == 0) or (
+                    ell_char_count == 3
+                ), "'.' must only occur in ellipsis, operand {}".format(output_eqn)
+                assert ord("a") <= ord(ch) and ord(ch) <= ord("z"), "only accept alphabet (a-zA-Z)"
+                letter_num = ord(ch) - ord("a")
+                assert letters_to_idx[letter_num] != -1, "character {} doesn't exist in input".format(ch)
+                assert (
+                    idxes_to_output_dims[letters_to_idx[letter_num]] == -1
+                ), "character {} occurs twice in output".format(ch)
+
+                idxes_to_output_dims[letters_to_idx[letter_num]] = num_output_dims
+                num_output_dims += 1
+    else:  # num_eqns_size == 1
+        # Infer the output dims
+        if num_ell_idxes >= 0:
+            for j in range(num_ell_idxes):
+                idxes_to_output_dims[first_ell_idx + j] = num_output_dims
+                num_output_dims += 1
+        for j in range(num_letters):
+            if letter_frequence[j] == 1:
+                idxes_to_output_dims[letters_to_idx[j]] = num_output_dims
+                num_output_dims += 1
+
+    # Mark sum index
+    sum_dim = num_output_dims
+    for i in range(num_total_idxes):
+        if idxes_to_output_dims[i] == -1:
+            idxes_to_output_dims[i] = sum_dim
+            sum_dim += 1
+
+    preprocessed_operands = []
+    size_dims = [-1] * num_total_idxes
+    for i, preprocessed_operand in enumerate(operands):
+        idx_to_dims = [-1] * num_total_idxes
+        curr_operand_idxes = input_operand_idxes[i]
+        dim = 0
+        for j, idx in enumerate(curr_operand_idxes):
+            output_dim = idxes_to_output_dims[idx]
+            if idx_to_dims[output_dim] == -1:
+                idx_to_dims[output_dim] = dim
+                if size_dims[idx] == -1:
+                    size_dims[idx] = preprocessed_operand.shape[dim]
+                else:
+                    assert (
+                        size_dims[idx] == preprocessed_operand.shape[dim]
+                    ), "Dimension size does not match previous size. "
+                dim += 1
+            else:
+                # Diagonal repeated index
+                # TODO(zhoushunjie): Need to develop a paddle.diagonal api
+                raise NotImplementedError("Can't support diagonal.")
+        perm = []
+        for input_dim in idx_to_dims:
+            if input_dim > -1:
+                perm.append(input_dim)
+        # Transpose the tensor by perm
+        preprocessed_operand = paddle.transpose(preprocessed_operand, perm=perm)
+
+        for dim, input_dim in enumerate(idx_to_dims):
+            if input_dim == -1:
+                preprocessed_operand = paddle.unsqueeze(preprocessed_operand, dim)
+
+        preprocessed_operands.append(preprocessed_operand)
+
+    # 2. Execute the mul_sum
+    sum_dims = []
+    result = preprocessed_operands[0]
+    for i in range(num_total_idxes):
+        if idxes_last_operand[i] == 0 and idxes_to_output_dims[i] >= num_output_dims:
+            result = result.sum(axis=idxes_to_output_dims[i], keepdim=True)
+    for i in range(1, len(preprocessed_operands)):
+        for j in range(num_total_idxes):
+            if idxes_last_operand[j] == i and idxes_to_output_dims[j] >= num_output_dims:
+                sum_dims.append(idxes_to_output_dims[j])
+        result = _mul_sum(result, preprocessed_operands[i], sum_dims)
+
+    squeeze_dims = [i for i in range(len(result.shape) - 1, num_output_dims - 1, -1)]
+    if len(squeeze_dims) != 0:
+        result = paddle.squeeze(result, squeeze_dims)
+    return result
diff --git a/paddlenlp/ops/ext_utils.py b/paddlenlp/ops/ext_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5891d78abdaa99fb13456fb137ae08c77e32cbbd
--- /dev/null
+++ b/paddlenlp/ops/ext_utils.py
@@ -0,0 +1,367 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+import hashlib
+import os
+import subprocess
+import sys
+import sysconfig
+import textwrap
+from pathlib import Path
+
+from filelock import FileLock
+from paddle.utils.cpp_extension import load_op_meta_info_and_register_op
+from paddle.utils.cpp_extension.cpp_extension import CUDA_HOME
+from paddle.utils.cpp_extension.cpp_extension import (
+    BuildExtension as PaddleBuildExtension,
+)
+from paddle.utils.cpp_extension.cpp_extension import CppExtension
+from paddle.utils.cpp_extension.extension_utils import (
+    _import_module_from_library,
+    _jit_compile,
+)
+from setuptools import Extension
+
+from paddlenlp.utils.env import PPNLP_HOME
+from paddlenlp.utils.log import logger
+
+if CUDA_HOME and not os.path.exists(CUDA_HOME):
+    # CUDA_HOME is only None for Windows CPU version in paddle `find_cuda_home`.
+    # Clear it for other non-CUDA situations.
+    CUDA_HOME = None
+
+LOADED_EXT = {}
+
+
+def file_lock(lock_file_path):
+    def _wrapper(func):
+        @functools.wraps(func)
+        def _impl(*args, **kwargs):
+            with FileLock(lock_file_path):
+                func(*args, **kwargs)
+
+        return _impl
+
+    return _wrapper
+
+
+def _get_files(path):
+    """
+    Helps to list all files under the given path.
+    """
+    if os.path.isfile(path):
+        return [path]
+    all_files = []
+    for root, _dirs, files in os.walk(path, followlinks=True):
+        for file in files:
+            file = os.path.join(root, file)
+            all_files.append(file)
+    return all_files
+
+
+# copy form distutils.dep_util to avoid import distutils
+def newer_group(sources, target, missing="error"):
+    """Return true if 'target' is out-of-date with respect to any file
+    listed in 'sources'.  In other words, if 'target' exists and is newer
+    than every file in 'sources', return false; otherwise return true.
+    'missing' controls what we do when a source file is missing; the
+    default ("error") is to blow up with an OSError from inside 'stat()';
+    if it is "ignore", we silently drop any missing source files; if it is
+    "newer", any missing source files make us assume that 'target' is
+    out-of-date (this is handy in "dry-run" mode: it'll make you pretend to
+    carry out commands that wouldn't work because inputs are missing, but
+    that doesn't matter because you're not actually going to run the
+    commands).
+    """
+    # If the target doesn't even exist, then it's definitely out-of-date.
+    if not os.path.exists(target):
+        return 1
+
+    # Otherwise we have to find out the hard way: if *any* source file
+    # is more recent than 'target', then 'target' is out-of-date and
+    # we can immediately return true.  If we fall through to the end
+    # of the loop, then 'target' is up-to-date and we return false.
+    from stat import ST_MTIME
+
+    target_mtime = os.stat(target)[ST_MTIME]
+    for source in sources:
+        if not os.path.exists(source):
+            if missing == "error":  # blow up when we stat() the file
+                pass
+            elif missing == "ignore":  # missing source dropped from
+                continue  # target's dependency list
+            elif missing == "newer":  # missing source means target is
+                return 1  # out-of-date
+
+        source_mtime = os.stat(source)[ST_MTIME]
+        if source_mtime > target_mtime:
+            return 1
+    else:
+        return 0
+
+
+class CMakeExtension(Extension):
+    def __init__(self, name, source_dir=None):
+        # A CMakeExtension needs a source_dir instead of a file list.
+        Extension.__init__(self, name, sources=[])
+        if source_dir is None:
+            self.source_dir = str(Path(__file__).parent.resolve())
+        else:
+            self.source_dir = os.path.abspath(os.path.expanduser(source_dir))
+        self.sources = _get_files(self.source_dir)
+
+    def build_with_command(self, ext_builder):
+        """
+        Custom `build_ext.build_extension` in `Extension` instead of `Command`.
+        `ext_builder` is the instance of `build_ext` command.
+        """
+        # refer to https://github.com/pybind/cmake_example/blob/master/setup.py
+        if ext_builder.compiler.compiler_type == "msvc":
+            raise NotImplementedError
+        cmake_args = getattr(self, "cmake_args", []) + [
+            "-DCMAKE_BUILD_TYPE={}".format("Debug" if ext_builder.debug else "Release"),
+            "-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={}".format(ext_builder.build_lib),
+        ]
+        build_args = []
+
+        # Set CMAKE_BUILD_PARALLEL_LEVEL to control the parallel build level
+        # across all generators.
+        if "CMAKE_BUILD_PARALLEL_LEVEL" not in os.environ:
+            # self.parallel is a Python 3 only way to set parallel jobs by hand
+            # using -j in the build_ext call, not supported by pip or PyPA-build.
+            if hasattr(ext_builder, "parallel") and ext_builder.parallel:
+                # CMake 3.12+ only.
+                build_args += ["-j{}".format(ext_builder.parallel)]
+
+        build_args += ["-j14"]
+
+        if not os.path.exists(ext_builder.build_temp):
+            os.makedirs(ext_builder.build_temp)
+
+        # Redirect stdout/stderr to mute, especially when allowing errors
+        stdout = getattr(self, "_std_out_handle", None)
+        subprocess.check_call(
+            ["cmake", self.source_dir] + cmake_args, cwd=ext_builder.build_temp, stdout=stdout, stderr=stdout
+        )
+        subprocess.check_call(
+            ["cmake", "--build", "."] + build_args, cwd=ext_builder.build_temp, stdout=stdout, stderr=stdout
+        )
+
+    def get_target_filename(self):
+        """
+        The file names of libraries. Currently only support one library for
+        one extension.
+        """
+        raise NotImplementedError
+
+    def get_output_filename(self):
+        """
+        The file names of outputs, which mostly is the same with
+        `get_target_filename`.
+        """
+        return self.get_target_filename()
+
+
+class FasterTransformerExtension(CMakeExtension):
+    def __init__(self, name, source_dir=None, need_parallel=False):
+        super(FasterTransformerExtension, self).__init__(name, source_dir)
+        self.sources = _get_files(os.path.join(self.source_dir, "fast_transformer", "src")) + _get_files(
+            os.path.join(self.source_dir, "patches", "FasterTransformer")
+        )
+        self._std_out_handle = None
+        # Env variable may not work as expected, since jit compile by `load`
+        # would not re-built if source code is not update.
+        # self.sm = os.environ.get("PPNLP_GENERATE_CODE", None)
+        # Whether or not to use model parallel. Note that since the building use
+        # a new process, we shoud find a way to let it know whether to use model
+        # parallel.
+        self.need_parallel = need_parallel
+
+    def build_with_command(self, ext_builder):
+        if CUDA_HOME is None:  # GPU only
+            # TODO(guosheng): should we touch a dummy file or add a quick exit
+            # method to avoid meaningless process in `load`
+            logger.warning("FastGeneration is not available because CUDA can not be found.")
+            raise NotImplementedError
+        # TODO(guosheng): Multiple -std seems be passed in FastGeneration,
+        # which is not allowed by NVCC. Fix it later.
+        self.cmake_args = [f"-DPY_CMD={sys.executable}"]
+        # `GetCUDAComputeCapability` is not exposed yet, and detect CUDA/GPU
+        # version in cmake file.
+        # self.cmake_args += [f"-DSM={self.sm}"] if self.sm is not None else []
+        self.cmake_args += "-DWITH_GPT=ON -DON_INFER=OFF -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON".split()
+
+        self.cmake_args += ["-DCMAKE_C_COMPILER={}".format(os.popen("which gcc").read().replace("\n", ""))]
+        self.cmake_args += ["-DCMAKE_CXX_COMPILER={}".format(os.popen("which g++").read().replace("\n", ""))]
+
+        self.cmake_args += ["-DPYTHON_LIBRARY={}".format(sysconfig.get_config_var("LIBDIR"))]
+        self.cmake_args += ["-DPYTHON_INCLUDE_DIR={}".format(sysconfig.get_config_var("INCLUDEPY"))]
+
+        if self.need_parallel:
+            self.cmake_args += ["-DWITH_PARALLEL=ON"]
+
+        try:
+            super(FasterTransformerExtension, self).build_with_command(ext_builder)
+            # FastGeneration cmake file resets `CMAKE_LIBRARY_OUTPUT_DIRECTORY`
+            # to `CMAKE_BINARY_DIR/lib`, thus copy the lib back to `build_ext.build_lib`.
+            # Maybe move this copy to CMakeList.
+            # `copy_tree` or `copy_file`, boost lib might be included
+            ext_builder.copy_tree(os.path.join(ext_builder.build_temp, "lib"), ext_builder.build_lib)
+            # TODO(guosheng): Maybe we should delete the build dir especially
+            # when it is in the dir of paddlenlp package.
+            # os.remove(ext_builder.build_temp)
+        except Exception as e:
+            logger.warning("FastGeneration is not available due to build errors.")
+            raise e
+
+    def get_target_filename(self):
+        # CMake file has fixed the name of lib, maybe we can copy it as the name
+        # returned by `BuildExtension.get_ext_filename` after build.
+        return "libdecoding_op.so"
+
+    def get_output_filename(self):
+        return "libdecoding_op.so"
+
+
+class BuildExtension(PaddleBuildExtension):
+    """
+    Support both `CppExtention` of Paddle and custom extensions of PaddleNLP.
+    """
+
+    def build_extensions(self):
+        custom_exts = []  # for
+        no_custom_exts = []  # for normal extentions paddle.utils.cpp_extension
+        for ext in self.extensions:
+            if hasattr(ext, "build_with_command"):
+                # custom build in Extension
+                ext.build_with_command(self)
+                custom_exts.append(ext)
+            else:
+                no_custom_exts.append(ext)
+        if no_custom_exts:
+            # Build CppExtentio/CUDAExtension with `PaddleBuildExtension`
+            self.extensions = no_custom_exts
+            super(BuildExtension, self).build_extensions()
+        self.extensions = custom_exts + no_custom_exts
+
+
+EXTENSIONS = {
+    "FastGeneration": FasterTransformerExtension,
+    # NOTE: Since model parallel code is supported by definitions, to avoid
+    # performance degrading on non-parallel mode, we use a separated lib for
+    # model parallel.
+    "FasterTransformerParallel": FasterTransformerExtension,
+}
+
+
+def get_extension_maker(name):
+    # Use `paddle.utils.cpp_extension.CppExtension` as the default
+    # TODO(guosheng): Maybe register extension classes into `Extensions`.
+    return EXTENSIONS.get(name, CppExtension)
+
+
+def _write_setup_file(name, file_path, build_dir, **kwargs):
+    """
+    Automatically generate setup.py and write it into build directory.
+    `kwargws` is arguments for the corresponding Extension initialization.
+    Any type extension can be jit build.
+    """
+    template = textwrap.dedent(
+        """
+    from setuptools import setup
+    from paddlenlp.ops.ext_utils import get_extension_maker, BuildExtension
+
+    setup(
+        name='{name}',
+        ext_modules=[
+            get_extension_maker('{name}')(
+                name='{name}',
+                {kwargs_str})],
+        cmdclass={{'build_ext' : BuildExtension.with_options(
+            output_dir=r'{build_dir}')
+        }})"""
+    ).lstrip()
+    kwargs_str = ""
+    for key, value in kwargs.items():
+        kwargs_str += key + "=" + (f"'{value}'" if isinstance(value, str) else str(value)) + ","
+    content = template.format(name=name, kwargs_str=kwargs_str, build_dir=build_dir)
+
+    with open(file_path, "w") as f:
+        f.write(content)
+
+
+@file_lock(os.path.join(PPNLP_HOME, "load_ext.lock"))
+def load(name, build_dir=None, force=False, verbose=False, **kwargs):
+    # TODO(guosheng): Need better way to resolve unsupported such as CPU. Currently,
+    # raise NotImplementedError and skip `_jit_compile`. Otherwise, `_jit_compile`
+    # will output the error to stdout (when verbose is True) and raise `RuntimeError`,
+    # which is not friendly for users though no other bad effect.
+    if CUDA_HOME is None:
+        logger.warning("%s is not available because CUDA can not be found." % name)
+        raise NotImplementedError
+    if name in LOADED_EXT.keys():
+        # TODO(guosheng): Maybe the key should combined with kwargs since the
+        # extension object is created using them.
+        return LOADED_EXT[name]
+    if build_dir is None:
+        # build_dir = os.path.join(PPNLP_HOME, 'extenstions')
+        # Maybe under package dir is better to avoid cmake source path conflict
+        # with different source path, like this:
+        # build_dir = os.path.join(
+        #     str(Path(__file__).parent.resolve()), 'extenstions')
+        # However if it is under the package dir, it might make the package hard
+        # to uninstall. Thus we put it in PPNLP_HOME with digest of current path,
+        # like this:
+        build_dir = os.path.join(
+            PPNLP_HOME, "extensions", hashlib.md5(str(Path(__file__).parent.resolve()).encode("utf-8")).hexdigest()
+        )
+    build_base_dir = os.path.abspath(os.path.expanduser(os.path.join(build_dir, name)))
+    if not os.path.exists(build_base_dir):
+        os.makedirs(build_base_dir)
+
+    extension = get_extension_maker(name)(name, **kwargs)
+    # Check if 'target' is out-of-date with respect to any file to avoid rebuild
+    if isinstance(extension, CMakeExtension):
+        # `CppExtention/CUDAExtension `has version manager by `PaddleBuildExtension`
+        # Maybe move this to CMakeExtension later.
+        # TODO(guosheng): flags/args changes may also trigger build, and maybe
+        # need version manager like `PaddleBuildExtension`.
+        out_filename = extension.get_output_filename()
+        if isinstance(out_filename, str):
+            out_filename = [out_filename]
+        out_filepath = [os.path.join(build_base_dir, f) for f in out_filename]
+        lib_filename = extension.get_target_filename()
+        lib_filepath = os.path.join(build_base_dir, lib_filename)
+        if not force:
+            ext_sources = extension.sources
+            if all(os.path.exists(f) and not newer_group(ext_sources, f, "newer") for f in out_filepath):
+                logger.debug("skipping '%s' extension (up-to-date) build" % name)
+                ops = load_op_meta_info_and_register_op(lib_filepath)
+                LOADED_EXT[name] = ops
+                return LOADED_EXT[name]
+
+    # write setup file and jit compile
+    file_path = os.path.join(build_dir, name, "{}_setup.py".format(name))
+    _write_setup_file(name, file_path, build_base_dir, **kwargs)
+    _jit_compile(file_path, verbose)
+    if isinstance(extension, CMakeExtension):
+        # Load a shared library (if exists) only to register op.
+        if os.path.exists(lib_filepath):
+            ops = load_op_meta_info_and_register_op(lib_filepath)
+            LOADED_EXT[name] = ops
+            return LOADED_EXT[name]
+    else:
+        # Import as callable python api
+        return _import_module_from_library(name, build_base_dir, verbose)
diff --git a/paddlenlp/ops/fast_transformer/CMakeLists.txt b/paddlenlp/ops/fast_transformer/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..be58f747a2fb3b4af59c2e862710038744f2527d
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/CMakeLists.txt
@@ -0,0 +1,14 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+add_subdirectory(src)
diff --git a/paddlenlp/ops/fast_transformer/__init__.py b/paddlenlp/ops/fast_transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..185a92b8d94d3426d616c0624f0f2ee04339349e
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/ops/fast_transformer/sample/bart_decoding_sample.py b/paddlenlp/ops/fast_transformer/sample/bart_decoding_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e3edff7c33fe06b7d72bc2d6b9f5e6d1f911962
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/bart_decoding_sample.py
@@ -0,0 +1,132 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+from paddlenlp.utils.log import logger
+
+
+def postprocess_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def prepare_input(tokenizer, sentences):
+    tokenized = tokenizer(sentences, padding=True)
+    input_ids = paddle.to_tensor(tokenized["input_ids"], dtype="int64")
+    return input_ids
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="bart-base",
+        type=str,
+        help="The model name to specify the bart to use. Can be one of ['bart-base', 'bart-large',]. ",
+    )
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        type=str,
+        help="The decoding strategy. Can be one of [greedy_search, beam_search, sampling]",
+    )
+    parser.add_argument("--beam_size", default=5, type=int, help="The parameters for beam search. ")
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=20, type=int, help="Maximum output length. ")
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument(
+        "--length_penalty", default=0.6, type=float, help="The power number in length penalty calculation"
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    paddle.set_device(place)
+
+    tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path)
+    logger.info("Loading the model parameters, please wait...")
+    model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+
+    # Set evaluate mode
+    model.eval()
+    sentences = [
+        "I love that girl, but <mask> does not <mask> me.",
+        "She is so <mask> that I can not help glance at <mask>.",
+        "Nothing's gonna <mask> my love for you.",
+        "Drop everything now. Meet me in the pouring <mask>. Kiss me on the sidewalk.",
+    ]
+
+    bos_id = model.bart.config["bos_token_id"]
+    eos_id = model.bart.config["eos_token_id"]
+    input_ids = prepare_input(tokenizer, sentences)
+    # Define model
+    fast_bart = model
+
+    # Set evaluate mode
+    fast_bart.eval()
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize()
+                start = time.perf_counter()
+            finished_seq, _ = fast_bart.generate(
+                input_ids=input_ids,
+                max_length=args.max_length,
+                decode_strategy=args.decoding_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.beam_size,
+                diversity_rate=args.diversity_rate,
+                length_penalty=args.length_penalty,
+                use_fp16_decoding=args.use_fp16_decoding,
+                use_fast=True,
+            )
+
+        paddle.device.cuda.synchronize()
+        logger.info("Average test time for decoding is %f ms" % ((time.perf_counter() - start) / 50 * 1000))
+
+        # Output
+        finished_seq = finished_seq.numpy()
+        for ins in finished_seq:
+            generated_ids = postprocess_seq(ins, bos_id, eos_id)
+            print(tokenizer.convert_ids_to_string(generated_ids))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/bart_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/bart_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..e72e5c3ae91a10ad3e2b7b452a2593d439a20d20
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/bart_export_model_sample.py
@@ -0,0 +1,111 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterBART
+from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path", default="bart-base", type=str, help="The model name to specify the bart to use. "
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of bart. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=20, type=int, help="Maximum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=5, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+    parser.add_argument("--repetition_penalty", default=1.0, type=float, help="The repetition_penalty to set. ")
+    parser.add_argument("--length_penalty", default=0.0, type=float, help="The length penalty to decode. ")
+    parser.add_argument("--early_stopping", action="store_true", help="Whether to do early stopping. ")
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model = BartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    tokenizer = BartTokenizer.from_pretrained(args.model_name_or_path)
+
+    # For opening faster_encoder
+    model.eval()
+
+    fast_bart = FasterBART(model=model, use_fp16_decoding=args.use_fp16_decoding)
+    # Set evaluate mode
+    fast_bart.eval()
+
+    # Convert dygraph model to static graph model
+    fast_bart = paddle.jit.to_static(
+        fast_bart,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # encoder_output
+            None,
+            # seq_len
+            None,
+            args.num_beams,  # num_beams.
+            args.topk,
+            args.topp,
+            args.decoding_strategy,
+            tokenizer.bos_token_id,  # bos
+            tokenizer.eos_token_id,  # eos
+            tokenizer.pad_token_id,  # pad
+            tokenizer.eos_token_id,  # decoder_start_token_id
+            args.max_out_len,  # max_length
+            args.diversity_rate,  # diversity_rate
+            args.length_penalty,  # length_penalty
+            args.num_return_sequences,
+            args.early_stopping,
+            tokenizer.eos_token_id,  # forced_eos_token_id
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(fast_bart, os.path.join(args.inference_model_dir, "bart"))
+    logger.info("BART has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/bart_inference.py b/paddlenlp/ops/fast_transformer/sample/bart_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ba744a0efde20f5d7391b8ac10c0139662e3f03
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/bart_inference.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import numpy as np
+import paddle.inference as paddle_infer
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import BartTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of BART. "
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+
+def prepare_input(tokenizer, sentences):
+    tokenized = tokenizer(sentences, padding=True)
+    input_ids = np.asarray(tokenized["input_ids"], dtype="int32")
+    return input_ids
+
+
+def postprocess_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def infer(args):
+    model_name = "bart-base"
+    tokenizer = BartTokenizer.from_pretrained(model_name)
+
+    sentences = [
+        "I love that girl, but <mask> does not <mask> me.",
+        "She is so <mask> that I can not help glance at <mask>.",
+        "Nothing's gonna <mask> my love for you.",
+        "Drop everything now. Meet me in the pouring <mask>. Kiss me on the sidewalk.",
+    ]
+
+    input_ids = prepare_input(tokenizer, sentences)
+
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+
+    config = paddle_infer.Config(
+        os.path.join(args.inference_model_dir, "bart.pdmodel"),
+        os.path.join(args.inference_model_dir, "bart.pdiparams"),
+    )
+
+    config.enable_use_gpu(100, 0)
+    config.disable_glog_info()
+    # `embedding_eltwise_layernorm_fuse_pass` failed
+    config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
+    predictor = paddle_infer.create_predictor(config)
+
+    input_names = predictor.get_input_names()
+    input_handle = predictor.get_input_handle(input_names[0])
+    input_handle.copy_from_cpu(input_ids.astype("int32"))
+
+    predictor.run()
+
+    output_names = predictor.get_output_names()
+    output_handle = predictor.get_output_handle(output_names[0])
+    output_data = output_handle.copy_to_cpu()
+
+    for idx, sample in enumerate(output_data.transpose([1, 2, 0]).tolist()):
+        for beam_idx, beam in enumerate(sample):
+            if beam_idx >= len(sample) / 2:
+                break
+            generated_ids = postprocess_seq(beam, tokenizer.bos_token_id, tokenizer.eos_token_id)
+            seq = tokenizer.convert_ids_to_string(generated_ids)
+            print(f"{idx}-{beam_idx}: {seq}")
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    infer(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/config/decoder.sample.yaml b/paddlenlp/ops/fast_transformer/sample/config/decoder.sample.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..7a9eee1bfc607f6e4e7804450ffcab1cd916f63a
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/config/decoder.sample.yaml
@@ -0,0 +1,39 @@
+# Batch size during inference. 
+infer_batch_size: 8
+max_out_len: 32
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# Size of source word dictionary.
+src_vocab_size: 38512
+# Size of target word dictionay
+trg_vocab_size: 38512
+# Index for <bos> token
+bos_idx: 0
+# Index for <eos> token
+eos_idx: 1
+# Index for <unk> token
+unk_idx: 2
+# Max length of sequences deciding the size of position encoding table.
+max_length: 32
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# Number of head used in multi-head attention.
+n_head: 8
+# Number of sub-layers to be stacked in the encoder.
+num_encoder_layers: 6
+# Number of sub-layers to be stacked in the decoder.
+num_decoder_layers: 6
+# Dropout rates.
+dropout: 0.1
+# The flag indicating whether to share embedding and softmax weights.
+# Vocabularies in source and target should be same for weight sharing.
+weight_sharing: True
+
+# Path of trained parameter, to make prediction
+init_from_params: base_trained_models/step_final/
diff --git a/paddlenlp/ops/fast_transformer/sample/config/decoding.sample.yaml b/paddlenlp/ops/fast_transformer/sample/config/decoding.sample.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b0ac5ba2e77452b035156a52275882dc562aed5f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/config/decoding.sample.yaml
@@ -0,0 +1,44 @@
+# Batch size during inference. 
+infer_batch_size: 32
+# Hyparams for generation:
+decoding_strategy: "beam_search"
+# The parameters for beam search.
+beam_size: 4
+# The parameters for topk sampling. 
+topk: 4
+# The parameters for topp sampling. 
+topp: 0.0
+max_out_len: 32
+# The number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# Size of source word dictionary.
+src_vocab_size: 30000
+# Size of target word dictionay
+trg_vocab_size: 30000
+# Index for <bos> token
+bos_idx: 0
+# Index for <eos> token
+eos_idx: 1
+# Index for <unk> token
+unk_idx: 2
+# Max length of sequences deciding the size of position encoding table.
+max_length: 32
+# The dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# Size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# Number of head used in multi-head attention.
+n_head: 8
+# Number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# Dropout rates.
+dropout: 0.1
+# The flag indicating whether to share embedding and softmax weights.
+# Vocabularies in source and target should be same for weight sharing.
+weight_sharing: True
diff --git a/paddlenlp/ops/fast_transformer/sample/decoder_sample.py b/paddlenlp/ops/fast_transformer/sample/decoder_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..6927aa41ad60d0725a0a54d8d3fd11921eb88543
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/decoder_sample.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+from pprint import pprint
+
+import paddle
+import yaml
+from attrdict import AttrDict
+
+from paddlenlp.ops import FasterDecoder
+from paddlenlp.utils.log import logger
+
+
+def get_op_cache_config(use_batch_major_op_cache, size_per_head, is_fp16):
+    x = 8 if is_fp16 else 4
+    use_batch_major_op_cache = True if use_batch_major_op_cache is True and size_per_head % x == 0 else False
+    x = x if use_batch_major_op_cache else 1
+    return use_batch_major_op_cache, x
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./config/decoder.sample.yaml", type=str, help="Path of the config file. ")
+    parser.add_argument(
+        "--decoder_lib", default="../../build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument("--use_fp16_decoder", action="store_true", help="Whether to use fp16 decoder to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    paddle.set_device(place)
+
+    use_batch_major_op_cache = True
+    size_per_head = args.d_model // args.n_head
+    use_batch_major_op_cache, x = get_op_cache_config(use_batch_major_op_cache, size_per_head, args.use_fp16_decoder)
+    print(f"use_batch_major_op_cache={use_batch_major_op_cache}, x={x}")
+    # Define model
+    transformer = FasterDecoder(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.num_encoder_layers,
+        num_decoder_layers=args.num_decoder_layers,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        max_out_len=args.max_out_len,
+        decoder_lib=args.decoder_lib,
+        use_fp16_decoder=args.use_fp16_decoder,
+        use_batch_major_op_cache=use_batch_major_op_cache,
+    )
+
+    # Load checkpoint.
+    transformer.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+    # Set evaluate mode
+    transformer.eval()
+
+    # Generate data randomly
+    dec_input = paddle.randn(shape=[args.infer_batch_size, 1, args.d_model], dtype="float32")
+    enc_output = paddle.randn(shape=[args.infer_batch_size, args.max_length, args.d_model], dtype="float32")
+    mem_seq_lens = paddle.full(shape=[args.infer_batch_size, 1], fill_value=args.max_length, dtype="int32")
+    dtype = "float32"
+    if args.use_fp16_decoder:
+        dtype = "float16"
+        dec_input = paddle.cast(dec_input, dtype=dtype)
+        enc_output = paddle.cast(enc_output, dtype=dtype)
+    if not use_batch_major_op_cache:
+        self_cache_key = paddle.zeros(
+            shape=[args.num_decoder_layers, 0, args.infer_batch_size, args.d_model], dtype=dtype
+        )
+        self_cache_value = paddle.zeros(
+            shape=[args.num_decoder_layers, 0, args.infer_batch_size, args.d_model], dtype=dtype
+        )
+    else:
+        self_cache_key = paddle.zeros(
+            shape=[
+                args.num_decoder_layers,
+                args.infer_batch_size,
+                args.n_head,
+                size_per_head // x,
+                args.max_out_len,
+                x,
+            ],
+            dtype=dtype,
+        )
+        self_cache_value = paddle.zeros(
+            shape=[args.num_decoder_layers, args.infer_batch_size, args.n_head, args.max_out_len, size_per_head],
+            dtype=dtype,
+        )
+    mem_cache = paddle.zeros(
+        shape=[args.num_decoder_layers, 2, args.infer_batch_size, args.max_length, args.d_model], dtype=dtype
+    )
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                start = time.time()
+            paddle.device.cuda.synchronize()
+            dec_output, self_cache_key, self_cache_value, mem_cache = transformer.decoder(
+                from_tensor=dec_input,
+                memory_tensor=enc_output,
+                mem_seq_len=mem_seq_lens,
+                self_cache_key=self_cache_key,
+                self_cache_value=self_cache_value,
+                mem_cache=mem_cache,
+                step=0,
+                memory_hidden_dim=args.d_model,
+                is_fuse_qkv=False,
+            )
+        paddle.device.cuda.synchronize()
+        logger.info("Average test time for decoder is %f ms" % ((time.time() - start) / 50 * 1000))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.decoder_lib = ARGS.decoder_lib
+    args.use_fp16_decoder = ARGS.use_fp16_decoder
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/decoding_sample.py b/paddlenlp/ops/fast_transformer/sample/decoding_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ea3f94f1138ebcdde6a16ed37f828c7885d66ee
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/decoding_sample.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import yaml
+from attrdict import AttrDict
+
+from paddlenlp.ops import FasterTransformer
+from paddlenlp.utils.log import logger
+
+paddle.seed(2)
+np.random.seed(2)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="./fast_transformer/sample/config/decoding.sample.yaml",
+        type=str,
+        help="Path of the config file. ",
+    )
+    parser.add_argument(
+        "--decoding_lib", default="./build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    # Define model
+    transformer = FasterTransformer(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        decoding_strategy=args.decoding_strategy,
+        beam_size=args.beam_size,
+        topk=args.topk,
+        topp=args.topp,
+        max_out_len=args.max_out_len,
+        decoding_lib=args.decoding_lib,
+        use_fp16_decoding=args.use_fp16_decoding,
+    )
+
+    # Set evaluate mode
+    transformer.eval()
+
+    enc_output = paddle.randn([args.infer_batch_size, args.max_length, args.d_model])
+    if args.use_fp16_decoding:
+        enc_output = paddle.cast(enc_output, "float16")
+    mem_seq_len = paddle.randint(1, args.max_length + 1, shape=[args.infer_batch_size], dtype="int32")
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                start = time.time()
+            transformer.decoding(enc_output=enc_output, memory_seq_lens=mem_seq_len)
+        logger.info("Average test time for decoding is %f ms" % ((time.time() - start) / 50 * 1000))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+        pprint(args)
+    args.decoding_lib = ARGS.decoding_lib
+    args.use_fp16_decoding = ARGS.use_fp16_decoding
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/encoder_decoder_sample.py b/paddlenlp/ops/fast_transformer/sample/encoder_decoder_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..eba7667d30cac23007afea8cd679dcb6870d02b9
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/encoder_decoder_sample.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import time
+from pprint import pprint
+
+import paddle
+import yaml
+from attrdict import AttrDict
+
+from paddlenlp.ops import FasterDecoder
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", default="./config/decoder.sample.yaml", type=str, help="Path of the config file. ")
+    parser.add_argument(
+        "--decoder_lib", default="../../build/lib/libdecoder_op.so", type=str, help="Path of libdecoder_op.so. "
+    )
+    parser.add_argument("--use_fp16_decoder", action="store_true", help="Whether to use fp16 decoder to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def get_op_cache_config(use_batch_major_op_cache, size_per_head, is_fp16):
+    x = 8 if is_fp16 else 4
+    use_batch_major_op_cache = True if use_batch_major_op_cache is True and size_per_head % x == 0 else False
+    x = x if use_batch_major_op_cache else 1
+    return use_batch_major_op_cache, x
+
+
+def do_predict(args):
+    place = "gpu"
+    paddle.set_device(place)
+
+    use_batch_major_op_cache = True
+    size_per_head = args.d_model // args.n_head
+    use_batch_major_op_cache, x = get_op_cache_config(use_batch_major_op_cache, size_per_head, args.use_fp16_decoder)
+
+    # Define model
+    transformer = FasterDecoder(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.num_encoder_layers,
+        num_decoder_layers=args.num_decoder_layers,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        max_out_len=args.max_out_len,
+        decoder_lib=args.decoder_lib,
+        use_fp16_decoder=args.use_fp16_decoder,
+        use_batch_major_op_cache=use_batch_major_op_cache,
+    )
+
+    # Load checkpoint.
+    transformer.load(os.path.join(args.init_from_params, "transformer.pdparams"))
+    # Set evaluate mode
+    transformer.eval()
+
+    # Generate src_word randomly
+    src_word = paddle.randint(0, args.src_vocab_size, shape=[args.infer_batch_size, args.max_length], dtype="int64")
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                start = time.time()
+            paddle.device.cuda.synchronize()
+            finished_seq, finished_scores = transformer(src_word=src_word)
+        paddle.device.cuda.synchronize()
+        logger.info("Average test time for decoder is %f ms" % ((time.time() - start) / 50 * 1000))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.decoder_lib = ARGS.decoder_lib
+    args.use_fp16_decoder = ARGS.use_fp16_decoder
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/encoder_decoding_sample.py b/paddlenlp/ops/fast_transformer/sample/encoder_decoding_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c4a092465a2226290e5a023e1ba2ead315d08e8
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/encoder_decoding_sample.py
@@ -0,0 +1,128 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import yaml
+from attrdict import AttrDict
+
+from paddlenlp.data import Pad
+from paddlenlp.ops import FasterTransformer, enable_fast_encoder
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        default="./fast_transformer/sample/config/decoding.sample.yaml",
+        type=str,
+        help="Path of the config file. ",
+    )
+    parser.add_argument(
+        "--decoding_lib", default="./build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--enable_fast_encoder",
+        action="store_true",
+        help="Whether to use fast version encoder to predict. This is experimental option for now. ",
+    )
+    parser.add_argument("--use_fp16_encoder", action="store_true", help="Whether to use fp16 encoder to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def generate_src_word(batch_size, vocab_size, max_length, eos_idx, pad_idx):
+    memory_sequence_length = np.random.randint(low=1, high=max_length, size=batch_size).astype(np.int32)
+    data = []
+    for i in range(batch_size):
+        data.append(np.random.randint(low=3, high=vocab_size, size=memory_sequence_length[i], dtype=np.int64))
+
+    word_pad = Pad(pad_idx)
+    src_word = word_pad([list(word) + [eos_idx] for word in data])
+
+    return paddle.to_tensor(src_word, dtype="int64")
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    # Define model
+    transformer = FasterTransformer(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+        decoding_strategy=args.decoding_strategy,
+        beam_size=args.beam_size,
+        topk=args.topk,
+        topp=args.topp,
+        max_out_len=args.max_out_len,
+        decoding_lib=args.decoding_lib,
+        use_fp16_decoding=args.use_fp16_decoding,
+        enable_fast_encoder=args.enable_fast_encoder,
+        use_fp16_encoder=args.use_fp16_encoder,
+    )
+
+    # Set evaluate mode
+    transformer.eval()
+
+    if args.enable_fast_encoder:
+        transformer = enable_fast_encoder(transformer, use_fp16=args.use_fp16_encoder)
+
+    src_word = generate_src_word(
+        batch_size=args.infer_batch_size,
+        vocab_size=args.src_vocab_size,
+        max_length=args.max_length,
+        eos_idx=args.eos_idx,
+        pad_idx=args.bos_idx,
+    )
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                paddle.device.cuda.synchronize(place)
+                start = time.time()
+            transformer(src_word=src_word)
+        paddle.device.cuda.synchronize(place)
+        logger.info("Average test time for encoder-decoding is %f ms" % ((time.time() - start) / 50 * 1000))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.decoding_lib = ARGS.decoding_lib
+    args.use_fp16_decoding = ARGS.use_fp16_decoding
+    args.enable_fast_encoder = ARGS.enable_fast_encoder
+    args.use_fp16_encoder = ARGS.use_fp16_encoder
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/gpt_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/gpt_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3ab7771bf17d717262f3b51e7686344b81e9416
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/gpt_export_model_sample.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterGPT
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt2-medium-en",
+        type=str,
+        help="The model name to specify the gpt to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt-cpm-large-cn']. ",
+    )
+    parser.add_argument(
+        "--decoding_lib", default="../../build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of gpt. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--topp", default=0.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_out_len", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    logger.info("Loading the model parameters, please wait...")
+    model = model_class.from_pretrained(args.model_name_or_path, max_predict_len=args.max_out_len)
+
+    gpt = FasterGPT(model=model, decoding_lib=args.decoding_lib, use_fp16_decoding=args.use_fp16_decoding)
+
+    # Set evaluate mode
+    gpt.eval()
+
+    # Convert dygraph model to static graph model
+    gpt = paddle.jit.to_static(
+        gpt,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            #
+            # If it's necessarry to provide mem_seq_len and attention_mask,
+            # the parameters should be:
+            # mem_seq_len
+            # paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # attention_mask
+            # paddle.static.InputSpec(shape=[None, None, None], dtype="float16" if args.use_fp16_decoding else "float32"),
+            #
+            None,  # mem_seq_len
+            None,  # attention_mask
+            args.topk,
+            args.topp,
+            args.max_out_len,
+            tokenizer.eos_token_id,
+            tokenizer.eos_token_id,
+            tokenizer.pad_token_id,
+            None,  # forced_eos_token_id
+            args.temperature,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(gpt, os.path.join(args.inference_model_dir, "gpt"))
+    logger.info("GPT has been saved to {}".format(args.inference_model_dir))
+
+    gpt.save_resources(tokenizer, args.inference_model_dir)
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/gpt_sample.py b/paddlenlp/ops/fast_transformer/sample/gpt_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f3344fe606354ca22e8789319b32d51fcb0c633
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/gpt_sample.py
@@ -0,0 +1,112 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import GPTChineseTokenizer, GPTLMHeadModel, GPTTokenizer
+from paddlenlp.utils.log import logger
+
+MODEL_CLASSES = {
+    "gpt-cpm-large-cn": (GPTLMHeadModel, GPTChineseTokenizer),
+    "gpt2-medium-en": (GPTLMHeadModel, GPTTokenizer),
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="gpt2-medium-en",
+        type=str,
+        help="The model name to specify the gpt to use. Can be one of ['gpt2-en', 'gpt2-medium-en', 'gpt-cpm-large-cn']. ",
+    )
+    parser.add_argument(
+        "--decoding_lib", default="../build/lib/libdecoding_op.so", type=str, help="Path of libdecoding_op.so. "
+    )
+    parser.add_argument("--batch_size", default=4, type=int, help="Batch size. ")
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=32, type=int, help="Maximum output length. ")
+    parser.add_argument(
+        "--start_token", default="<|endoftext|>", type=str, help="The start token. Defaults to <|endoftext|>. "
+    )
+    parser.add_argument(
+        "--end_token", default="<|endoftext|>", type=str, help="The end token. Defaults to <|endoftext|>. "
+    )
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_class, tokenizer_class = MODEL_CLASSES[args.model_name_or_path]
+    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
+    logger.info("Loading the model parameters, please wait...")
+    model = model_class.from_pretrained(args.model_name_or_path)
+    model.eval()
+
+    bos_id = tokenizer.convert_tokens_to_ids(args.start_token)
+    eos_id = tokenizer.convert_tokens_to_ids(args.end_token)
+
+    # Define model
+    gpt = model
+
+    # Set evaluate mode
+    gpt.eval()
+    input_ids = np.array([[bos_id] for i in range(args.batch_size * 1)]).astype("int64").reshape([args.batch_size, 1])
+    input_ids = paddle.to_tensor(input_ids)
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                paddle.device.cuda.synchronize(place)
+                start = time.time()
+            out_seq, _ = gpt.generate(
+                input_ids,
+                top_k=args.topk,
+                top_p=args.topp,
+                max_length=args.max_length,
+                temperature=args.temperature,
+                bos_token_id=bos_id,
+                eos_token_id=eos_id,
+                decode_strategy="sampling",
+                use_fp16_decoding=args.use_fp16_decoding,
+                use_fast=True,
+            )
+            output_sequence = out_seq.numpy()
+
+        paddle.device.cuda.synchronize(place)
+        logger.info("Average test time for decoding is %f ms" % ((time.time() - start) / 50 * 1000))
+        output_sequence = out_seq.numpy().tolist()
+    for i in range(args.batch_size):
+        print("========== Sample-%d ==========" % i)
+        print(tokenizer.convert_ids_to_string(output_sequence[i]))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/mbart_decoding_sample.py b/paddlenlp/ops/fast_transformer/sample/mbart_decoding_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..76c0d56993232ab0c8e005c6de83658bcd383135
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/mbart_decoding_sample.py
@@ -0,0 +1,138 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.data import Pad
+from paddlenlp.transformers import MBartForConditionalGeneration, MBartTokenizer
+from paddlenlp.utils.log import logger
+
+
+def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
+    """
+    Post-process the decoded sequence.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)]
+    return seq
+
+
+def prepare_input(tokenizer, sentences, pad_id):
+    word_pad = Pad(pad_id, dtype="int64")
+    tokenized = tokenizer(sentences, return_length=True)
+    inputs = word_pad(tokenized["input_ids"])
+    input_ids = paddle.to_tensor(inputs)
+    return input_ids
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="mbart-large-50-one-to-many-mmt",
+        type=str,
+        help="The model name to specify the bart to use. ",
+        choices=[
+            "mbart-large-50-one-to-many-mmt",
+            "mbart-large-50-many-to-one-mmt",
+            "mbart-large-50-many-to-many-mmt",
+            "mbart-large-cc25",
+            "mbart-large-en-ro",
+        ],
+    )
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        type=str,
+        help="The decoding strategy.",
+        choices=["greedy_search", "beam_search", "sampling"],
+    )
+    parser.add_argument("--beam_size", default=4, type=int, help="The parameters for beam search. ")
+    parser.add_argument("--top_k", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--top_p", default=1.0, type=float, help="The probability threshold to procedure topp sampling. "
+    )
+    parser.add_argument("--max_length", default=50, type=int, help="Maximum output length. ")
+    parser.add_argument("--diversity_rate", default=0.0, type=float, help="The diversity of beam search. ")
+    parser.add_argument(
+        "--length_penalty", default=0.0, type=float, help="The power number in length penalty calculation"
+    )
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument("--not_use_faster", action="store_false", help="Whether to use FastGeneration. ")
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    paddle.set_device(place)
+
+    tokenizer = MBartTokenizer.from_pretrained(args.model_name_or_path, src_lang="en_XX")
+    logger.info("Loading the model parameters, please wait...")
+    model = MBartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    # Set evaluate mode
+    model.eval()
+    sentences = [
+        "I love that girl, but she does not love me.",
+        "She is so beautiful that I can not help glance at her.",
+        "Nothing's gonna change my love for you.",
+        "Drop everything now. Meet me in the pouring rain. Kiss me on the sidewalk.",
+    ]
+
+    eos_id = model.mbart.config["eos_token_id"]
+    pad_id = model.mbart.config["pad_token_id"]
+    input_ids = prepare_input(tokenizer, sentences, pad_id)
+
+    with paddle.no_grad():
+        for i in range(100):
+            # For warmup.
+            if 50 == i:
+                # PaddlePaddle >= 2.2
+                paddle.device.cuda.synchronize()
+                start = time.perf_counter()
+            finished_seqs, _ = model.generate(
+                input_ids=input_ids,
+                forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"],
+                max_length=args.max_length,
+                decode_strategy=args.decoding_strategy,
+                top_k=args.top_k,
+                top_p=args.top_p,
+                num_beams=args.beam_size,
+                diversity_rate=args.diversity_rate,
+                length_penalty=args.length_penalty,
+                use_fast=args.not_use_faster,
+            )
+        paddle.device.cuda.synchronize()
+        logger.info("Average test time for decoding is %f ms" % ((time.perf_counter() - start) / 50 * 1000))
+
+        # Output
+        finished_seqs = finished_seqs.numpy().tolist()
+        for idx, finished_seq in enumerate(finished_seqs):
+            finished_seq = finished_seq
+            print(f"source: {sentences[idx]}")
+            finished_seq = post_process_seq(finished_seq, tokenizer.lang_code_to_id["zh_CN"], eos_id)
+            print(f"target: {tokenizer.convert_ids_to_string(finished_seq)}\n")
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/mbart_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/mbart_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..330e7aa5da78a7fef64c1bc7a83047ec04a18241
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/mbart_export_model_sample.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterMBART
+from paddlenlp.transformers import MBartForConditionalGeneration, MBartTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="mbart-large-50-many-to-many-mmt",
+        type=str,
+        help="The model name to specify the bart to use. ",
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of bart. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=5, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+    parser.add_argument("--repetition_penalty", default=1.0, type=float, help="The repetition_penalty to set. ")
+    parser.add_argument("--length_penalty", default=0.0, type=float, help="The length penalty to decode. ")
+    parser.add_argument("--early_stopping", action="store_true", help="Whether to do early stopping. ")
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model = MBartForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    tokenizer = MBartTokenizer.from_pretrained(args.model_name_or_path, src_lang="en_XX")
+
+    # For opening faster_encoder
+    model.eval()
+
+    fast_mbart = FasterMBART(model=model, use_fp16_decoding=args.use_fp16_decoding)
+    # Set evaluate mode
+    fast_mbart.eval()
+
+    # Convert dygraph model to static graph model
+    fast_mbart = paddle.jit.to_static(
+        fast_mbart,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # encoder_output
+            None,
+            # seq_len
+            None,
+            paddle.static.InputSpec(
+                shape=[None, 1], dtype="int32"
+            ),  # forced_bos_token_id can be a Tensor or int (bos_id)
+            args.num_beams,  # num_beams.
+            args.topk,  # top_k
+            args.topp,  # top_p
+            args.decoding_strategy,  # decode_strategy
+            tokenizer.bos_token_id,  # bos_token_id
+            tokenizer.eos_token_id,  # eos_token_id
+            tokenizer.pad_token_id,  # pad_token_id
+            model.mbart.config["decoder_start_token_id"],  # decoder_start_token_id
+            args.max_out_len,  # max_length
+            args.diversity_rate,  # diversity_rate
+            args.length_penalty,  # length_penalty
+            args.temperature,  # temperature
+            args.num_return_sequences,  # num_return_sequences
+            args.early_stopping,  # early_stopping
+            tokenizer.eos_token_id,  # forced_eos_token_id
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(fast_mbart, os.path.join(args.inference_model_dir, "mbart"))
+    logger.info("MBART has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/mbart_inference.py b/paddlenlp/ops/fast_transformer/sample/mbart_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd4b4d2ad150158d6a8a6873b83c78bf51de4d66
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/mbart_inference.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import numpy as np
+import paddle.inference as paddle_infer
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import MBartTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of BART. "
+    )
+    parser.add_argument("--batch_size", default=1, type=int, help="Batch size. ")
+
+    args = parser.parse_args()
+
+    return args
+
+
+def postprocess_response(tokenizer, seq, bos_idx, eos_idx):
+    """Post-process the decoded sequence."""
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx]
+    res = tokenizer.convert_ids_to_string(seq)
+    return res
+
+
+def infer(args):
+    model_name = "mbart-large-50-many-to-many-mmt"
+
+    tokenizer = MBartTokenizer.from_pretrained(model_name, src_lang="en_XX")
+    bos_id = tokenizer.lang_code_to_id["zh_CN"]
+    inputs = "PaddleNLP is a powerful NLP library with Awesome pre-trained models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications."
+
+    eos_id = tokenizer.eos_token_id
+
+    # Input ids
+    input_ids = tokenizer(inputs)["input_ids"]
+    input_ids = np.asarray(input_ids, dtype="int32").reshape(1, -1).repeat(args.batch_size, axis=0)
+
+    # Forced bos token ids
+    forced_bos_token = np.ones([args.batch_size, 1], dtype="int32") * bos_id
+
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+
+    config = paddle_infer.Config(
+        os.path.join(args.inference_model_dir, "mbart.pdmodel"),
+        os.path.join(args.inference_model_dir, "mbart.pdiparams"),
+    )
+
+    config.enable_use_gpu(100, 0)
+    config.disable_glog_info()
+    predictor = paddle_infer.create_predictor(config)
+
+    input_names = predictor.get_input_names()
+
+    # Input ids
+    input_ids_handle = predictor.get_input_handle(input_names[0])
+    input_ids_handle.copy_from_cpu(input_ids.astype("int32"))
+
+    # Forced bos token ids
+    forced_bos_token_handle = predictor.get_input_handle(input_names[1])
+    forced_bos_token_handle.copy_from_cpu(forced_bos_token.astype("int32"))
+
+    predictor.run()
+
+    output_names = predictor.get_output_names()
+    output_handle = predictor.get_output_handle(output_names[0])
+    output_data = output_handle.copy_to_cpu()
+
+    # [batch_size, num_beams * 2, sequence_length]
+    output_data = output_data.transpose([1, 2, 0])
+
+    # Only use the best sequence.
+    result = [postprocess_response(tokenizer, sample.tolist()[0], bos_id, eos_id) for sample in output_data]
+    print("Model input:", inputs)
+    print("Result:", "\n".join(result))
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    infer(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/plato_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/plato_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..56bf31bc076ef375e7922a8b8686d86231bc9c7e
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/plato_export_model_sample.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterUnifiedTransformer
+from paddlenlp.transformers import (
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerTokenizer,
+)
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="plato-xl",
+        type=str,
+        help="The model name to specify the PLATO/UnifiedTransformer to use. ",
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of gpt. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="sampling",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    if args.use_fp16_decoding and os.getenv("PPFG_QKV_MEM_OPT", "0") == "1":
+        paddle.set_default_dtype("float16")
+
+    model_name = "plato-xl"
+    model = UnifiedTransformerLMHeadModel.from_pretrained(model_name, load_state_as_np=True)
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)
+
+    plato = FasterUnifiedTransformer(model=model, use_fp16_decoding=args.use_fp16_decoding)
+    # Set evaluate mode
+    plato.eval()
+
+    # Convert dygraph model to static graph model
+    plato = paddle.jit.to_static(
+        plato,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # token_type_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # attention_mask
+            paddle.static.InputSpec(shape=[None, 1, None, None], dtype="float32"),
+            # seq_len
+            paddle.static.InputSpec(shape=[None], dtype="int32"),
+            # role_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # position_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            args.max_out_len,
+            args.min_out_len,
+            args.topk,
+            args.topp,
+            args.decoding_strategy,
+            tokenizer.cls_token_id,  # cls/bos
+            tokenizer.sep_token_id,  # sep/eos
+            tokenizer.pad_token_id,  # pad
+            args.num_beams,  # num_beams. Used for beam_search.
+            args.diversity_rate,  # diversity rate. Used for beam search.
+            args.temperature,
+            args.num_return_sequences,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(plato, os.path.join(args.inference_model_dir, "plato"))
+    logger.info("PLATO has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/plato_inference.py b/paddlenlp/ops/fast_transformer/sample/plato_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f935b7a0d250c905d5855c8b6fdfca1763aceff
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/plato_inference.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pprint import pprint
+
+import numpy as np
+import paddle.inference as paddle_infer
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of PLATO. "
+    )
+    parser.add_argument("--use_role", action="store_true", help="Whether to use role embeddings. ")
+    parser.add_argument(
+        "--position_style",
+        default="relative",
+        choices=["continuous", "relative"],
+        type=str,
+        help="The type for positional embedding. Default is continuous. ",
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+def infer(args):
+    model_name = "plato-xl"
+    tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)
+
+    context = [
+        "Hi , Becky , what's up ?",
+        "Not much , except that my mother-in-law is driving me up the wall .",
+        "What's the problem ?",
+    ]
+
+    data = tokenizer.dialogue_encode(
+        history=context,
+        add_start_token_as_response=True,
+        return_length=True,
+        return_role_ids=args.use_role,
+        position_style=args.position_style,
+    )
+
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+
+    config = paddle_infer.Config(
+        args.inference_model_dir + "plato.pdmodel", args.inference_model_dir + "plato.pdiparams"
+    )
+    config.enable_use_gpu(100, 0)
+    config.disable_glog_info()
+    predictor = paddle_infer.create_predictor(config)
+
+    input_handles = {}
+    for name in predictor.get_input_names():
+        input_handles[name] = predictor.get_input_handle(name)
+        if name == "attention_mask":
+            input_handles[name].copy_from_cpu(np.expand_dims(np.asarray(data[name], dtype="float32"), axis=(0, 1)))
+        else:
+            input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32").reshape([1, -1]))
+
+    output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+    predictor.run()
+
+    output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+    for sample in output[0].transpose([1, 0]).tolist():
+        print(" ".join(postprocess_response(sample, tokenizer)))
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    infer(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/t5_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/t5_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fdf1c99532e6c8684fd80e2341b9b0df05a43e2
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/t5_export_model_sample.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterT5
+from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path", default="t5-base", type=str, help="The model name to specify the bart to use. "
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of bart. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=256, type=int, help="Maximum output length. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="beam_search",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+    parser.add_argument("--repetition_penalty", default=1.0, type=float, help="The repetition_penalty to set. ")
+    parser.add_argument("--length_penalty", default=0.0, type=float, help="The length penalty to decode. ")
+    parser.add_argument("--early_stopping", action="store_true", help="Whether to do early stopping. ")
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+    tokenizer = T5Tokenizer.from_pretrained(args.model_name_or_path)
+
+    # For opening faster_encoder
+    model.eval()
+
+    fast_t5 = FasterT5(model=model, use_fp16_decoding=args.use_fp16_decoding)
+    # Set evaluate mode
+    fast_t5.eval()
+
+    # Convert dygraph model to static graph model
+    fast_t5 = paddle.jit.to_static(
+        fast_t5,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # encoder_output
+            None,
+            # seq_len
+            None,
+            args.max_out_len,  # max_length
+            0,  # min_length
+            args.topk,  # top_k
+            args.topp,  # top_p
+            args.num_beams,  # num_beams
+            args.decoding_strategy,  # decode_strategy
+            None,  # decoder_start_token_id
+            tokenizer.bos_token_id,  # bos_token_id
+            tokenizer.eos_token_id,  # eos_token_id
+            tokenizer.pad_token_id,  # pad_token_id
+            args.diversity_rate,  # diversity_rate
+            args.temperature,  # temperature
+            args.num_return_sequences,  # num_return_sequences
+            args.length_penalty,  # length_penalty
+            args.early_stopping,  # early_stopping
+            tokenizer.eos_token_id,  # forced_eos_token_id
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(fast_t5, os.path.join(args.inference_model_dir, "t5"))
+    logger.info("T5 has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/t5_inference.py b/paddlenlp/ops/fast_transformer/sample/t5_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..585a4a9566f23c6f89cb1136afcb1424cb5c5459
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/t5_inference.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle.inference as paddle_infer
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import T5Tokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of T5. "
+    )
+    parser.add_argument("--batch_size", default=1, type=int, help="Batch size. ")
+
+    args = parser.parse_args()
+
+    return args
+
+
+def postprocess_response(tokenizer, seq, bos_idx, eos_idx):
+    """Post-process the decoded sequence."""
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx]
+    res = tokenizer.convert_ids_to_string(seq)
+    return res
+
+
+def infer(args):
+    model_name = "t5-base"
+
+    tokenizer = T5Tokenizer.from_pretrained(model_name)
+    inputs = ' This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of countless generations of stars: the oldest stars are seen as blue dots. '
+
+    # Input ids
+    input_ids = tokenizer.encode("translate English to French: " + inputs, return_tensors="np")["input_ids"]
+
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+
+    config = paddle_infer.Config(
+        os.path.join(args.inference_model_dir, "t5.pdmodel"), os.path.join(args.inference_model_dir, "t5.pdiparams")
+    )
+
+    config.enable_use_gpu(100, 0)
+    config.disable_glog_info()
+    predictor = paddle_infer.create_predictor(config)
+
+    input_names = predictor.get_input_names()
+
+    # Input ids
+    input_ids_handle = predictor.get_input_handle(input_names[0])
+    input_ids_handle.copy_from_cpu(input_ids.astype("int32"))
+
+    predictor.run()
+
+    output_names = predictor.get_output_names()
+    output_handle = predictor.get_output_handle(output_names[0])
+    output_data = output_handle.copy_to_cpu()
+
+    # [batch_size, num_beams * 2, sequence_length]
+    output_data = output_data.transpose([1, 2, 0])
+
+    # Only use the best sequence.
+    translation = tokenizer.decode(output_data[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+    print("Result:", translation)
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    infer(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/unimo_text_export_model_sample.py b/paddlenlp/ops/fast_transformer/sample/unimo_text_export_model_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb3f2eeda8d8ac33ce4bc73ea3df6b4746d3ac1c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/unimo_text_export_model_sample.py
@@ -0,0 +1,111 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+import paddle
+
+from paddlenlp.ops import FasterUNIMOText
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
+from paddlenlp.utils.log import logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_name_or_path",
+        default="unimo-text-1.0-lcsts-new",
+        type=str,
+        help="The model name to specify the UNIMOText to use. ",
+    )
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of gpt. "
+    )
+    parser.add_argument("--topk", default=4, type=int, help="The number of candidate to procedure top_k sampling. ")
+    parser.add_argument(
+        "--topp", default=1.0, type=float, help="The probability threshold to procedure top_p sampling. "
+    )
+    parser.add_argument("--max_out_len", default=64, type=int, help="Maximum output length. ")
+    parser.add_argument("--min_out_len", default=1, type=int, help="Minimum output length. ")
+    parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ")
+    parser.add_argument("--temperature", default=1.0, type=float, help="The temperature to set. ")
+    parser.add_argument("--num_return_sequences", default=1, type=int, help="The number of returned sequences. ")
+    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
+    parser.add_argument(
+        "--decoding_strategy",
+        default="sampling",
+        choices=["sampling", "beam_search"],
+        type=str,
+        help="The main strategy to decode. ",
+    )
+    parser.add_argument("--num_beams", default=4, type=int, help="The number of candidate to procedure beam search. ")
+    parser.add_argument(
+        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def do_predict(args):
+    place = "gpu"
+    place = paddle.set_device(place)
+
+    model_name = "unimo-text-1.0-lcsts-new"
+    model = UNIMOLMHeadModel.from_pretrained(model_name)
+    tokenizer = UNIMOTokenizer.from_pretrained(model_name)
+
+    unimo_text = FasterUNIMOText(model=model, use_fp16_decoding=args.use_fp16_decoding)
+    # Set evaluate mode
+    unimo_text.eval()
+
+    # Convert dygraph model to static graph model
+    unimo_text = paddle.jit.to_static(
+        unimo_text,
+        input_spec=[
+            # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # token_type_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int32"),
+            # attention_mask
+            paddle.static.InputSpec(shape=[None, 1, None, None], dtype="float32"),
+            # seq_len
+            paddle.static.InputSpec(shape=[None], dtype="int32"),
+            args.max_out_len,
+            args.min_out_len,
+            args.topk,
+            args.topp,
+            args.num_beams,  # num_beams. Used for beam_search.
+            args.decoding_strategy,
+            tokenizer.cls_token_id,  # cls/bos
+            tokenizer.mask_token_id,  # mask/eos
+            tokenizer.pad_token_id,  # pad
+            args.diversity_rate,  # diversity rate. Used for beam search.
+            args.temperature,
+            args.num_return_sequences,
+        ],
+    )
+
+    # Save converted static graph model
+    paddle.jit.save(unimo_text, os.path.join(args.inference_model_dir, "unimo_text"))
+    logger.info("UNIMOText has been saved to {}.".format(args.inference_model_dir))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    pprint(args)
+
+    do_predict(args)
diff --git a/paddlenlp/ops/fast_transformer/sample/unimo_text_inference.py b/paddlenlp/ops/fast_transformer/sample/unimo_text_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..02effef6af569a6a1dcee11a3477d9003c873a0f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/sample/unimo_text_inference.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pprint import pprint
+
+import numpy as np
+import paddle.inference as paddle_infer
+
+from paddlenlp.ops.ext_utils import load
+from paddlenlp.transformers import UNIMOTokenizer
+
+
+def setup_args():
+    """Setup arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--inference_model_dir", default="./infer_model/", type=str, help="Path to save inference model of UNIMOText. "
+    )
+
+    args = parser.parse_args()
+
+    return args
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return tokens
+
+
+def infer(args):
+    model_name = "unimo-text-1.0-lcsts-new"
+    tokenizer = UNIMOTokenizer.from_pretrained(model_name)
+
+    inputs = "深度学习是人工智能的核心技术领域。百度飞桨作为中国首个自主研发、功能丰富、开源开放的产业级深度学习平台,将从多层次技术产品、产业AI人才培养和强大的生态资源支持三方面全面护航企业实现快速AI转型升级。"
+
+    data = tokenizer.gen_encode(
+        inputs, add_start_token_for_decoding=True, return_length=True, is_split_into_words=False
+    )
+
+    # Load FastGeneration lib.
+    load("FastGeneration", verbose=True)
+
+    config = paddle_infer.Config(
+        args.inference_model_dir + "unimo_text.pdmodel", args.inference_model_dir + "unimo_text.pdiparams"
+    )
+    config.enable_use_gpu(100, 0)
+    config.disable_glog_info()
+    predictor = paddle_infer.create_predictor(config)
+
+    input_handles = {}
+    for name in predictor.get_input_names():
+        input_handles[name] = predictor.get_input_handle(name)
+        if name == "attention_mask":
+            input_handles[name].copy_from_cpu(np.expand_dims(np.asarray(data[name], dtype="float32"), axis=(0, 1)))
+        else:
+            input_handles[name].copy_from_cpu(np.asarray(data[name], dtype="int32").reshape([1, -1]))
+
+    output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()]
+
+    predictor.run()
+
+    output = [output_handle.copy_to_cpu() for output_handle in output_handles]
+
+    for sample in output[0].transpose([1, 0]).tolist():
+        print("".join(postprocess_response(sample, tokenizer)))
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    pprint(args)
+
+    infer(args)
diff --git a/paddlenlp/ops/fast_transformer/src/CMakeLists.txt b/paddlenlp/ops/fast_transformer/src/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7db53476685b3453d312ecad91249fd99037b62d
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/CMakeLists.txt
@@ -0,0 +1,336 @@
+if (${CUDA_VERSION} GREATER_EQUAL 11.0)
+  message(STATUS "Add DCUDA11_MODE")
+  add_definitions("-DCUDA11_MODE")
+endif()
+
+add_definitions(-DNDEBUG)
+add_definitions(-DPADDLE_CUDA)
+# Default is 1 in standard c++ when using gcc8.2
+add_definitions(-D_GLIBCXX_USE_CXX11_ABI=1)
+
+add_definitions(-w)
+
+if(ON_INFER)
+  add_definitions(-DPADDLE_ON_INFERENCE)
+
+  link_directories(${COMMON_LIB_DIRS})
+
+  set(ft_lib_link
+    decoder decoding topk cuda_int8_kernels cuda_kernels online_softmax_beamsearch transformer_kernels attention_kernels encoder nccl_utils nvtx_utils
+  )
+
+  if(WITH_GPU)
+    add_definitions("-DPADDLE_WITH_CUDA")
+  endif()
+
+  if(NOT WITH_STATIC_LIB)
+    add_definitions("-DPADDLE_WITH_SHARED_LIB")
+  else()
+    # PD_INFER_DECL is mainly used to set the dllimport/dllexport attribute in dynamic library mode.
+    # Set it to empty in static library mode to avoid compilation issues.
+    add_definitions("/DPD_INFER_DECL=")
+  endif()
+
+  macro(safe_set_static_flag)
+      foreach(flag_var
+          CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_DEBUG CMAKE_CXX_FLAGS_RELEASE
+          CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO)
+        if(${flag_var} MATCHES "/MD")
+          string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}")
+        endif(${flag_var} MATCHES "/MD")
+      endforeach(flag_var)
+  endmacro()
+
+  if(NOT DEFINED PADDLE_LIB)
+    message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib")
+  endif()
+  if(NOT DEFINED DEMO)
+    message(FATAL_ERROR "please set DEMO with -DDEMO=demo_name")
+  endif()
+
+  include_directories("${PADDLE_LIB}/paddle/include")
+  set(PADDLE_LIB_THIRD_PARTY_PATH "${PADDLE_LIB}/third_party/install/")
+  include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}protobuf/include")
+  include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}glog/include")
+  include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}gflags/include")
+  include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}xxhash/include")
+  if (WITH_ONNXRUNTIME)
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/include")
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/include")
+  endif()
+
+  link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}protobuf/lib")
+  link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}glog/lib")
+  link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}gflags/lib")
+  link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}xxhash/lib")
+  link_directories("${PADDLE_LIB}/paddle/lib")
+  if (WITH_ONNXRUNTIME)
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/lib")
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/lib")
+  endif()
+
+  if (WIN32)
+    add_definitions("/DGOOGLE_GLOG_DLL_DECL=")
+    option(MSVC_STATIC_CRT "use static C Runtime library by default" ON)
+    if (MSVC_STATIC_CRT)
+      if (WITH_MKL)
+        set(FLAG_OPENMP "/openmp")
+      endif()
+      set(CMAKE_C_FLAGS_DEBUG   "${CMAKE_C_FLAGS_DEBUG} /bigobj /MTd ${FLAG_OPENMP}")
+      set(CMAKE_C_FLAGS_RELEASE  "${CMAKE_C_FLAGS_RELEASE} /bigobj /MT ${FLAG_OPENMP}")
+      set(CMAKE_CXX_FLAGS_DEBUG  "${CMAKE_CXX_FLAGS_DEBUG} /bigobj /MTd ${FLAG_OPENMP}")
+      set(CMAKE_CXX_FLAGS_RELEASE   "${CMAKE_CXX_FLAGS_RELEASE} /bigobj /MT ${FLAG_OPENMP}")
+      safe_set_static_flag()
+      if (WITH_STATIC_LIB)
+        add_definitions(-DSTATIC_LIB)
+      endif()
+    endif()
+  else()
+    if(WITH_MKL)
+      set(FLAG_OPENMP "-fopenmp")
+    endif()
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${FLAG_OPENMP}")
+  endif()
+
+  if (USE_TENSORRT AND WITH_GPU)
+    set(TENSORRT_ROOT "" CACHE STRING "The root directory of TensorRT library")
+    if("${TENSORRT_ROOT}" STREQUAL "")
+        message(FATAL_ERROR "The TENSORRT_ROOT is empty, you must assign it a value with CMake command. Such as: -DTENSORRT_ROOT=TENSORRT_ROOT_PATH ")
+    endif()
+    set(TENSORRT_INCLUDE_DIR ${TENSORRT_ROOT}/include)
+    set(TENSORRT_LIB_DIR ${TENSORRT_ROOT}/lib)
+  endif()
+
+  if (NOT WIN32)
+    if (USE_TENSORRT AND WITH_GPU)
+        include_directories("${TENSORRT_INCLUDE_DIR}")
+        link_directories("${TENSORRT_LIB_DIR}")
+    endif()
+  endif(NOT WIN32)
+
+  if(WITH_MKL)
+    set(MATH_LIB_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}mklml")
+    include_directories("${MATH_LIB_PATH}/include")
+    if(WIN32)
+      set(MATH_LIB ${MATH_LIB_PATH}/lib/mklml${CMAKE_STATIC_LIBRARY_SUFFIX}
+                  ${MATH_LIB_PATH}/lib/libiomp5md${CMAKE_STATIC_LIBRARY_SUFFIX})
+    else()
+      set(MATH_LIB ${MATH_LIB_PATH}/lib/libmklml_intel${CMAKE_SHARED_LIBRARY_SUFFIX}
+                  ${MATH_LIB_PATH}/lib/libiomp5${CMAKE_SHARED_LIBRARY_SUFFIX})
+    endif()
+    set(MKLDNN_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}mkldnn")
+    if(EXISTS ${MKLDNN_PATH})
+      include_directories("${MKLDNN_PATH}/include")
+      if(WIN32)
+        set(MKLDNN_LIB ${MKLDNN_PATH}/lib/mkldnn.lib)
+      else(WIN32)
+        set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libdnnl.so.3)
+      endif(WIN32)
+    endif()
+  else()
+    set(OPENBLAS_LIB_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}openblas")
+    include_directories("${OPENBLAS_LIB_PATH}/include/openblas")
+    if(WIN32)
+      set(MATH_LIB ${OPENBLAS_LIB_PATH}/lib/openblas${CMAKE_STATIC_LIBRARY_SUFFIX})
+    else()
+      set(MATH_LIB ${OPENBLAS_LIB_PATH}/lib/libopenblas${CMAKE_STATIC_LIBRARY_SUFFIX})
+    endif()
+  endif()
+
+  if(WITH_STATIC_LIB)
+    set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_STATIC_LIBRARY_SUFFIX})
+  else()
+    if(WIN32)
+      set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_STATIC_LIBRARY_SUFFIX})
+    else()
+      set(DEPS ${PADDLE_LIB}/paddle/lib/libpaddle_inference${CMAKE_SHARED_LIBRARY_SUFFIX})
+    endif()
+  endif()
+
+  if (WITH_ONNXRUNTIME)
+    set(DEPS ${DEPS} ${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/lib/libonnxruntime.so paddle2onnx)
+  endif()
+
+  if (NOT WIN32)
+    set(EXTERNAL_LIB "-lrt -ldl -lpthread")
+    set(DEPS ${DEPS}
+        ${MATH_LIB} ${MKLDNN_LIB}
+        glog gflags protobuf  xxhash
+        ${EXTERNAL_LIB})
+  else()
+    set(DEPS ${DEPS}
+        ${MATH_LIB} ${MKLDNN_LIB}
+        glog gflags_static libprotobuf xxhash ${EXTERNAL_LIB})
+    set(DEPS ${DEPS} shlwapi.lib)
+  endif(NOT WIN32)
+
+  cuda_add_library(decoding_infer_op ${decoding_op_files} SHARED)
+  add_dependencies(decoding_infer_op extern_${THIRD_PARTY_NAME} boost)
+
+  string(REPLACE "/" ";" DEMO_PATH ${DEMO})
+
+  list(LENGTH DEMO_PATH PATH_LEN)
+  MATH(EXPR PATH_LEN "${PATH_LEN}-1")
+  list(GET DEMO_PATH ${PATH_LEN} DEMO_NAME)
+
+  string(REPLACE "." ";" DEMO_NAME ${DEMO_NAME})
+  list(GET DEMO_NAME 0 DEMO_NAME)
+  add_executable(${DEMO_NAME} ${DEMO})
+  set(DEPS decoding_infer_op ${ft_lib_link} boost ${DEPS} cublas cudart)
+
+  if(WITH_GPT AND WITH_SP)
+    set(DEPS ${DEPS} sentencepiece)
+    add_dependencies(decoding_infer_op extern_sentencepiece)
+  endif()
+
+  if(WITH_PARALLEL)
+    set(DEPS ${DEPS} mpi nccl)
+  endif()
+
+  if(WIN32)
+    if(USE_TENSORRT)
+      add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
+              COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer${CMAKE_SHARED_LIBRARY_SUFFIX}
+                ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+              COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer_plugin${CMAKE_SHARED_LIBRARY_SUFFIX}
+                ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+      )
+    endif()
+    if(WITH_MKL)
+      add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
+            COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/mklml.dll ${CMAKE_BINARY_DIR}/Release
+            COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/libiomp5md.dll ${CMAKE_BINARY_DIR}/Release
+            COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_PATH}/lib/mkldnn.dll  ${CMAKE_BINARY_DIR}/Release
+      )
+    else()
+      add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
+            COMMAND ${CMAKE_COMMAND} -E copy ${OPENBLAS_LIB_PATH}/lib/openblas.dll ${CMAKE_BINARY_DIR}/Release
+      )
+    endif()
+    if(NOT WITH_STATIC_LIB)
+        add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
+          COMMAND ${CMAKE_COMMAND} -E copy "${PADDLE_LIB}/paddle/lib/paddle_fluid.dll" ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+        )
+    endif()
+  endif()
+
+  target_link_libraries(${DEMO_NAME} ${DEPS})
+
+else(ON_INFER)
+  if(NOT PY_CMD)
+    set(PYTHON_PATH "python" CACHE STRING "Python path")
+  else()
+    set(PYTHON_PATH ${PY_CMD} CACHE STRING "Python path")
+  endif()
+
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import paddle; print(paddle.__version__)"
+                RESULT_VARIABLE _INC_PYTHON_SUCCESS
+                OUTPUT_VARIABLE _INC_PYTHON_VALUES)
+  message(STATUS "PADDLE_VERSION: ${_INC_PYTHON_VALUES}")
+
+  # TODO(gongenlei): support PADDLE_NEW_ALLOCATOR for ON_INFER
+  if(_INC_PYTHON_VALUES VERSION_GREATER_EQUAL "2.3.0")
+      add_definitions(-DPADDLE_NEW_ALLOCATOR)
+  endif()
+
+  if(_INC_PYTHON_VALUES VERSION_GREATER_EQUAL "2.5.0" OR _INC_PYTHON_VALUES VERSION_EQUAL "0.0.0")
+    find_package(PythonLibs REQUIRED)
+    include_directories(${PYTHON_INCLUDE_DIRS})
+  endif()
+
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import paddle; print(paddle.sysconfig.get_include())"
+                  RESULT_VARIABLE _INC_PYTHON_SUCCESS
+                  OUTPUT_VARIABLE _INC_PYTHON_VALUES)
+  if (NOT _INC_PYTHON_SUCCESS MATCHES 0)
+      message(FATAL_ERROR "Python config Error.")
+  endif()
+  string(REGEX REPLACE ";" "\\\\;" _INC_PYTHON_VALUES ${_INC_PYTHON_VALUES})
+  string(REGEX REPLACE "\n" ";" _INC_PYTHON_VALUES ${_INC_PYTHON_VALUES})
+  list(GET _INC_PYTHON_VALUES 0 PY_INCLUDE_DIR)
+
+  list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR})
+  list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR}/third_party)
+
+  include_directories(
+    ${COMMON_HEADER_DIRS}
+  )
+
+  execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import paddle; print(paddle.sysconfig.get_lib())"
+                  RESULT_VARIABLE _LIB_PYTHON_SUCCESS
+                  OUTPUT_VARIABLE _LIB_PYTHON_VALUES)
+  if (NOT _LIB_PYTHON_SUCCESS MATCHES 0)
+      message(FATAL_ERROR "Python config Error.")
+  endif()
+  string(REGEX REPLACE ";" "\\\\;" _LIB_PYTHON_VALUES ${_LIB_PYTHON_VALUES})
+  string(REGEX REPLACE "\n" ";" _LIB_PYTHON_VALUES ${_LIB_PYTHON_VALUES})
+  list(GET _LIB_PYTHON_VALUES 0 PY_LIB_DIR)
+  list(APPEND COMMON_LIB_DIRS ${PY_LIB_DIR})
+
+  link_directories(
+    ${COMMON_LIB_DIRS}
+  )
+
+  include_directories(${PY_INCLUDE_DIR})
+  include_directories(${PY_INCLUDE_DIR}/third_party)
+
+  if(EXISTS ${PY_LIB_DIR}/libpaddle_custom_op.so)
+    set(lib_link
+      -lpaddle_custom_op
+    )
+  endif()
+
+  if(EXISTS ${PY_LIB_DIR}/../fluid/)
+    if(EXISTS ${PY_LIB_DIR}/../fluid/libpaddle.so)
+      set(lib_link
+        -lpaddle
+      )
+    elseif(EXISTS ${PY_LIB_DIR}/../fluid/core_avx.so)
+      set(lib_link
+        -l:core_avx.so
+      )
+    else()
+      set(lib_link
+        -l:core_noavx.so
+      )
+    endif()
+    link_directories(
+      ${PY_LIB_DIR}/../fluid/
+    )
+  elseif(EXISTS ${PY_LIB_DIR}/../base/)
+    if(EXISTS ${PY_LIB_DIR}/../base/libpaddle.so)
+      set(lib_link
+        -lpaddle
+      )
+    elseif(EXISTS ${PY_LIB_DIR}/../base/core_avx.so)
+      set(lib_link
+        -l:core_avx.so
+      )
+    else()
+      set(lib_link
+        -l:core_noavx.so
+      )
+    endif()
+    link_directories(
+      ${PY_LIB_DIR}/../base/
+    )
+  endif()
+
+  set(ft_lib_link
+    -ldecoder -ldecoding -ltopk -lcuda_int8_kernels -lcuda_kernels -lonline_softmax_beamsearch -ltransformer_kernels -lattention_kernels -lencoder -lnccl_utils -lnvtx_utils
+  )
+
+  if(WITH_PARALLEL)
+    set(ft_lib_link ${ft_lib_link} -lmpi -lnccl)
+  endif()
+
+  add_definitions(-DPADDLE_WITH_CUDA)
+  add_definitions(-DEIGEN_USE_GPU)
+  add_definitions(-DPADDLE_USE_DSO)
+  if (WITH_MKL)
+    add_definitions(-DPADDLE_WITH_MKLDNN)
+  endif()
+
+  add_library(decoding_op SHARED ${decoding_op_files})
+  add_dependencies(decoding_op extern_${THIRD_PARTY_NAME} boost)
+  target_link_libraries(decoding_op PRIVATE -lcublas -lcudart ${lib_link} ${ft_lib_link})
+endif()
diff --git a/paddlenlp/ops/fast_transformer/src/cublas_handle.cc b/paddlenlp/ops/fast_transformer/src/cublas_handle.cc
new file mode 100644
index 0000000000000000000000000000000000000000..dfdc9badc0055c416c7dc5206eaba03aa565f020
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/cublas_handle.cc
@@ -0,0 +1,28 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "cublas_handle.h"
+
+CublasHandle* CublasHandle::GetInstance() {
+  static CublasHandle* p_handle = nullptr;
+  if (p_handle == nullptr) {
+    p_handle = new CublasHandle();
+  }
+  return p_handle;
+}
+
+CublasHandle::~CublasHandle() {
+  cublasDestroy(cublas_handle_);
+  cublasLtDestroy(cublaslt_handle_);
+}
diff --git a/paddlenlp/ops/fast_transformer/src/cublas_handle.h b/paddlenlp/ops/fast_transformer/src/cublas_handle.h
new file mode 100644
index 0000000000000000000000000000000000000000..d636d0585dd1b1518967f9965f3c3ccabeaaf292
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/cublas_handle.h
@@ -0,0 +1,58 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+
+#include "fastertransformer/utils/common.h"
+
+
+/**
+ * The CublasHandle class defines the `GetInstance` method that serves as an
+ * alternative to constructor and lets clients access the same instance of this
+ * class over and over.
+ */
+class CublasHandle {
+  /**
+   * The CublasHandle's constructor should always be private to prevent direct
+   * construction calls with the `new` operator.
+   */
+private:
+  CublasHandle() {
+    cublasCreate(&cublas_handle_);
+    cublasLtCreate(&cublaslt_handle_);
+  }
+
+public:
+  /**
+   * CublasHandle should not be cloneable.
+   */
+  CublasHandle(CublasHandle& other) = delete;
+
+  /**
+   * CublasHandle should not be assignable.
+   */
+  void operator=(const CublasHandle&) = delete;
+
+  /**
+   * This is the static method that controls the access to the singleton
+   * instance. On the first run, it creates a singleton object and places it
+   * into the static field. On subsequent runs, it returns the client existing
+   * object stored in the static field.
+   */
+  static CublasHandle* GetInstance();
+
+  cublasHandle_t cublas_handle_;
+  cublasLtHandle_t cublaslt_handle_;
+
+  ~CublasHandle();
+};
diff --git a/paddlenlp/ops/fast_transformer/src/demo/gpt.cc b/paddlenlp/ops/fast_transformer/src/demo/gpt.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7dd55efb2c7aa409b6d628e8112b42bc9b4004a7
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/gpt.cc
@@ -0,0 +1,321 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <pthread.h>
+#include <algorithm>
+#include <atomic>
+#include <string>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <locale>
+#include <numeric>
+#include <string>
+#include <thread>
+#include <unordered_map>
+
+#ifdef GPT_ON_SENTENCEPIECE
+#include <sentencepiece_processor.h>
+#endif
+
+#include "helper.h"
+#include "utf8.h"
+
+#include <sys/time.h>
+#include <unistd.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+DEFINE_int32(batch_size, 1, "Batch size to do inference. ");
+DEFINE_int32(gpu_id, 0, "The gpu id to do inference. ");
+DEFINE_string(model_dir,
+              "./infer_model/",
+              "The directory to the inference model. ");
+DEFINE_string(vocab_file,
+              "./infer_model/vocab.txt",
+              "The path to the vocabulary file. ");
+DEFINE_string(start_token, "<|endoftext|>", "The start token of GPT.");
+DEFINE_string(end_token, "<|endoftext|>", "The end token of GPT.");
+
+using namespace paddle_infer;
+
+std::string model_dir = "";
+std::string vocab_file = "";
+
+const int BOS_IDX = 50256;
+const int EOS_IDX = 50256;
+const int PAD_IDX = 50256;
+const int MAX_LENGTH = 256;
+
+int batch_size = 1;
+int gpu_id = 0;
+
+namespace paddle {
+namespace inference {
+
+struct DataInput {
+  std::vector<int64_t> src_data;
+};
+
+struct DataResult {
+  std::wstring result_q;
+};
+
+bool get_result_tensor(const std::unique_ptr<paddle_infer::Tensor>& seq_ids,
+                       std::vector<DataResult>& dataresultvec,
+                       std::unordered_map<int, std::u32string>& num2word_dict,
+                       std::unordered_map<char32_t, int>& byte_decoder) {
+  // NOTE: Add SentencePiece to do some postprocess on cpm model.
+  // sentencepiece::SentencePieceProcessor processor;
+  // max_length * batch_size
+  std::vector<int> output_shape = seq_ids->shape();
+  int batch_size = output_shape[1];
+  int out_num = std::accumulate(
+      output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
+  std::vector<int> seq_ids_out;
+  seq_ids_out.resize(out_num);
+  seq_ids->CopyToCpu(seq_ids_out.data());
+
+  dataresultvec.resize(batch_size);
+  auto max_output_length = output_shape[0];
+
+  for (int bsz = 0; bsz < batch_size; ++bsz) {
+    std::u32string tmp_result_q = U"";
+    for (int len = 1; len < max_output_length; ++len) {
+      tmp_result_q =
+          tmp_result_q + num2word_dict[seq_ids_out[len * batch_size + bsz]];
+    }
+
+    for (int i = 0; i < tmp_result_q.length(); ++i) {
+      char32_t tmp = tmp_result_q[i];
+      if (byte_decoder.find(tmp) != byte_decoder.end()) {
+        dataresultvec[bsz].result_q = dataresultvec[bsz].result_q +
+                                      static_cast<wchar_t>(byte_decoder[tmp]);
+      } else {
+        std::cout << "Should not reach here. " << std::endl;
+        exit(-1);
+      }
+    }
+  }
+  return true;
+}
+
+std::unordered_map<char32_t, int> convert_unicode() {
+  char32_t c0 = U'!';
+  char32_t c1 = U'~';
+  char32_t c2 = U'¡';
+  char32_t c3 = U'¬';
+  char32_t c4 = U'®';
+  char32_t c5 = U'ÿ';
+
+  int a0 = c0;
+  int a1 = c1;
+  int a2 = c2;
+  int a3 = c3;
+  int a4 = c4;
+  int a5 = c5;
+
+  std::unordered_map<char32_t, int> ret;
+  int n = 0;
+  for (int b = 0; b < 256; ++b) {
+    char32_t key;
+    if (b < a0 || (b > a1 && b < a2) || (b < a3 && b > a4) || b > a5) {
+      key = static_cast<char32_t>(256 + n);
+      ret.insert(std::pair<char32_t, int>(key, b));
+      n++;
+    } else {
+      key = static_cast<char32_t>(b);
+      ret.insert(std::pair<char32_t, int>(key, b));
+    }
+  }
+
+  return ret;
+}
+
+class DataReader {
+public:
+  DataReader() {}
+
+  bool NextBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                 const int& batch_size,
+                 const std::u32string& start_token,
+                 const std::u32string& end_token,
+                 const int& num_batches,
+                 std::vector<std::u32string>& source_query_vec) {
+    if (current_batches++ >= num_batches) {
+      return false;
+    }
+
+    for (int i = 0; i < batch_size; ++i) {
+      source_query_vec.push_back(start_token);
+    }
+
+    std::u32string line;
+    std::vector<std::u32string> word_data;
+    std::vector<DataInput> data_input_vec;
+    int max_len = 0;
+    for (int i = 0; i < batch_size; i++) {
+      DataInput data_input;
+      data_input.src_data.push_back(word2num_dict[start_token]);
+      max_len = std::max(max_len, static_cast<int>(data_input.src_data.size()));
+      max_len = std::min(max_len, MAX_LENGTH);
+      data_input_vec.push_back(data_input);
+    }
+    if (data_input_vec.empty()) {
+      return false;
+    }
+    return TensorMoreBatch(
+        predictor, data_input_vec, max_len, data_input_vec.size());
+  }
+
+  bool GetWordDict() {
+    std::ifstream fin(vocab_file);
+    std::string line;
+    int k = 0;
+    while (std::getline(fin, line)) {
+      std::u32string tmp = utf8::utf8to32(line);
+      word2num_dict[tmp] = k;
+      num2word_dict[k] = tmp;
+      k += 1;
+    }
+
+    fin.close();
+
+    return true;
+  }
+
+  int GetCurrentBatch() { return current_batches; }
+
+  std::unordered_map<std::u32string, int> word2num_dict;
+  std::unordered_map<int, std::u32string> num2word_dict;
+  std::unique_ptr<std::ifstream> file;
+
+private:
+  bool TensorMoreBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                       std::vector<DataInput>& data_input_vec,
+                       int max_len,
+                       int batch_size) {
+    auto ids_name = predictor->GetInputNames();
+    auto ids_t = predictor->GetInputHandle(ids_name[0]);
+    std::vector<int> ids_vec;
+    ids_vec.resize(max_len * batch_size);
+    for (int i = 0; i < batch_size; ++i) {
+      for (int k = 0; k < max_len; ++k) {
+        if (k < data_input_vec[i].src_data.size()) {
+          ids_vec[i * max_len + k] = data_input_vec[i].src_data[k];
+        } else {
+          ids_vec[i * max_len + k] = PAD_IDX;
+        }
+      }
+    }
+    ids_t->Reshape({batch_size, max_len});
+    ids_t->CopyFromCpu(ids_vec.data());
+
+    return true;
+  }
+
+  int current_batches = 0;
+};
+
+
+template <typename... Args>
+void SummaryConfig(const paddle_infer::Config& config,
+                   double infer_time,
+                   int num_batches,
+                   int num_samples) {
+  LOG(INFO) << "----------------------- Perf info -----------------------";
+  LOG(INFO) << "batch_size: " << batch_size;
+  LOG(INFO) << "average_latency(ms): " << infer_time / num_samples << ", "
+            << "QPS: " << num_samples / (infer_time / 1000.0);
+}
+
+
+void Main(const int& batch_size,
+          const int& gpu_id,
+          const std::u32string& start_token,
+          const std::u32string& end_token) {
+  Config config;
+  config.SetModel(model_dir + "/gpt.pdmodel", model_dir + "/gpt.pdiparams");
+
+  config.EnableUseGpu(100, gpu_id);
+
+  config.SwitchUseFeedFetchOps(false);
+  config.SwitchSpecifyInputNames(true);
+  auto predictor = CreatePredictor(config);
+  DataReader reader;
+  reader.GetWordDict();
+
+  double whole_time = 0;
+  Timer timer;
+  int num_batches = 100;
+  int warmup = 50;
+  std::vector<std::u32string> source_query_vec;
+  auto byte_decoder = convert_unicode();
+
+  while (reader.NextBatch(predictor,
+                          batch_size,
+                          start_token,
+                          end_token,
+                          num_batches,
+                          source_query_vec)) {
+    int crt_batch = reader.GetCurrentBatch();
+    if (crt_batch >= warmup) {
+      timer.tic();
+    }
+    predictor->Run();
+
+    if (crt_batch >= warmup) {
+      whole_time += timer.toc();
+    }
+
+    std::vector<DataResult> dataresultvec;
+    auto output_names = predictor->GetOutputNames();
+    get_result_tensor(predictor->GetOutputHandle(output_names[0]),
+                      dataresultvec,
+                      reader.num2word_dict,
+                      byte_decoder);
+
+    for (int i = 0; i < batch_size; ++i) {
+      std::wcout << dataresultvec[i].result_q;
+      std::cout << std::endl;
+    }
+    source_query_vec.clear();
+  }
+  std::cout << std::endl;
+  SummaryConfig(config,
+                whole_time,
+                num_batches - warmup,
+                (num_batches - warmup) * batch_size);
+}
+}  // namespace inference
+}  // namespace paddle
+
+int main(int argc, char** argv) {
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+
+  batch_size = FLAGS_batch_size;
+  gpu_id = FLAGS_gpu_id;
+
+  model_dir = FLAGS_model_dir;
+  vocab_file = FLAGS_vocab_file;
+
+  paddle::inference::Main(batch_size,
+                          gpu_id,
+                          utf8::utf8to32(FLAGS_start_token),
+                          utf8::utf8to32(FLAGS_end_token));
+
+  return 0;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/demo/helper.h b/paddlenlp/ops/fast_transformer/src/demo/helper.h
new file mode 100644
index 0000000000000000000000000000000000000000..046ca4c9e3ea3aae1e97ba54a06a0621adc8ced3
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/helper.h
@@ -0,0 +1,66 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <gflags/gflags.h>
+#include <glog/logging.h>
+#include <glog/raw_logging.h>
+#include <sys/time.h>
+#include <chrono>  // NOLINT
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <vector>
+#include "paddle_inference_api.h"
+
+namespace paddle {
+namespace inference {
+// Timer for timer
+class Timer {
+public:
+  std::chrono::high_resolution_clock::time_point start;
+  std::chrono::high_resolution_clock::time_point startu;
+
+  void tic() { start = std::chrono::high_resolution_clock::now(); }
+
+  double toc() {
+    startu = std::chrono::high_resolution_clock::now();
+    std::chrono::duration<double> time_span =
+        std::chrono::duration_cast<std::chrono::duration<double>>(startu -
+                                                                  start);
+    double used_time_ms = static_cast<double>(time_span.count()) * 1000.0;
+    return used_time_ms;
+  }
+};
+
+static void split(const std::string &str,
+                  char sep,
+                  std::vector<std::string> *pieces) {
+  pieces->clear();
+  if (str.empty()) {
+    return;
+  }
+  size_t pos = 0;
+  size_t next = str.find(sep, pos);
+  while (next != std::string::npos) {
+    pieces->push_back(str.substr(pos, next - pos));
+    pos = next + 1;
+    next = str.find(sep, pos);
+  }
+  if (!str.substr(pos).empty()) {
+    pieces->push_back(str.substr(pos));
+  }
+}
+
+}  // namespace inference
+}  // namespace paddle
diff --git a/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc b/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2f1c802f96ede2488384b815c84a57b0dd28c187
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc
@@ -0,0 +1,281 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <pthread.h>
+#include <algorithm>
+#include <atomic>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <numeric>
+#include <string>
+#include <thread>
+#include <unordered_map>
+
+#include "helper.h"
+
+#include <sys/time.h>
+#include <unistd.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+
+using namespace paddle_infer;
+
+DEFINE_int32(batch_size, 1, "Batch size to do inference. ");
+DEFINE_int32(gpu_id, 0, "The gpu id to do inference. ");
+DEFINE_string(model_dir,
+              "./infer_model/",
+              "The directory to the inference model. ");
+DEFINE_string(vocab_file,
+              "./vocab_all.bpe.33708",
+              "The path to the vocabulary file. ");
+DEFINE_string(data_file,
+              "./newstest2014.tok.bpe.33708.en",
+              "The path to the input data file. ");
+
+std::string model_dir = "";
+std::string vocab_file = "";
+std::string data_file = "";
+
+const int EOS_IDX = 1;
+const int PAD_IDX = 0;
+const int MAX_LENGTH = 256;
+const int N_BEST = 1;
+
+int batch_size = 1;
+int gpu_id = 0;
+
+namespace paddle {
+namespace inference {
+
+struct DataInput {
+  std::vector<int64_t> src_data;
+};
+
+struct DataResult {
+  std::string result_q;
+};
+
+bool get_result_tensor(const std::unique_ptr<paddle_infer::Tensor>& seq_ids,
+                       std::vector<DataResult>& dataresultvec,
+                       std::unordered_map<int, std::string>& num2word_dict) {
+  std::vector<int> output_shape = seq_ids->shape();
+  int batch_size = output_shape[1];
+  int beam_num = output_shape[2];
+  int out_num = std::accumulate(
+      output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
+  std::vector<int> seq_ids_out;
+  seq_ids_out.resize(out_num);
+  seq_ids->CopyToCpu(seq_ids_out.data());
+
+  dataresultvec.resize(batch_size * N_BEST);
+  auto max_output_length = output_shape[0];
+
+  for (int bsz = 0; bsz < output_shape[1]; ++bsz) {
+    for (int k = 0; k < N_BEST; ++k) {
+      dataresultvec[bsz * N_BEST + k].result_q = "";
+      for (int len = 0; len < max_output_length; ++len) {
+        if (seq_ids_out[len * batch_size * beam_num + bsz * beam_num + k] ==
+            EOS_IDX)
+          break;
+        dataresultvec[bsz * N_BEST + k].result_q =
+            dataresultvec[bsz * N_BEST + k].result_q +
+            num2word_dict[seq_ids_out[len * batch_size * beam_num +
+                                      bsz * beam_num + k]] +
+            " ";
+      }
+    }
+  }
+  return true;
+}
+
+class DataReader {
+public:
+  explicit DataReader(const std::string& path)
+      : file(new std::ifstream(path)) {}
+
+  bool NextBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                 const int& batch_size,
+                 std::vector<std::string>& source_query_vec) {
+    std::string line;
+    std::vector<std::string> word_data;
+    std::vector<DataInput> data_input_vec;
+    int max_len = 0;
+    for (int i = 0; i < batch_size; i++) {
+      if (!std::getline(*file, line)) {
+        break;
+      }
+      DataInput data_input;
+      split(line, ' ', &word_data);
+      std::string query_str = "";
+      for (int j = 0; j < word_data.size(); ++j) {
+        if (j >= MAX_LENGTH) {
+          break;
+        }
+        query_str += word_data[j];
+        if (word2num_dict.find(word_data[j]) == word2num_dict.end()) {
+          data_input.src_data.push_back(word2num_dict["<unk>"]);
+        } else {
+          data_input.src_data.push_back(word2num_dict[word_data[j]]);
+        }
+      }
+      source_query_vec.push_back(query_str);
+      data_input.src_data.push_back(EOS_IDX);
+      max_len = std::max(max_len, static_cast<int>(data_input.src_data.size()));
+      max_len = std::min(max_len, MAX_LENGTH);
+      data_input_vec.push_back(data_input);
+    }
+    if (data_input_vec.empty()) {
+      return false;
+    }
+    return TensorMoreBatch(
+        predictor, data_input_vec, max_len, data_input_vec.size());
+  }
+
+  bool GetWordDict() {
+    std::ifstream fin(vocab_file);
+    std::string line;
+    int k = 0;
+    while (std::getline(fin, line)) {
+      word2num_dict[line] = k;
+      num2word_dict[k] = line;
+      k += 1;
+    }
+
+    fin.close();
+
+    return true;
+  }
+
+  std::unordered_map<std::string, int> word2num_dict;
+  std::unordered_map<int, std::string> num2word_dict;
+  std::unique_ptr<std::ifstream> file;
+
+private:
+  bool TensorMoreBatch(std::shared_ptr<paddle_infer::Predictor>& predictor,
+                       std::vector<DataInput>& data_input_vec,
+                       int max_len,
+                       int batch_size) {
+    auto src_word_t = predictor->GetInputHandle("src_word");
+    std::vector<int64_t> src_word_vec;
+    src_word_vec.resize(max_len * batch_size);
+    for (int i = 0; i < batch_size; ++i) {
+      for (int k = 0; k < max_len; ++k) {
+        if (k < data_input_vec[i].src_data.size()) {
+          src_word_vec[i * max_len + k] = data_input_vec[i].src_data[k];
+        } else {
+          src_word_vec[i * max_len + k] = PAD_IDX;
+        }
+      }
+    }
+    src_word_t->Reshape({batch_size, max_len});
+    src_word_t->CopyFromCpu(src_word_vec.data());
+
+    // NOTE: If the saved model supports force decoding, a nullptr must be
+    // given to trg_word to ensure predictor work properly when not
+    // using force decoding.
+    /*
+     * auto trg_word_t = predictor->GetInputHandle("trg_word");
+     * trg_word_t->Reshape({0, 0});
+     * trg_word_t->CopyFromCpu((int*)nullptr);
+     */
+
+    return true;
+  }
+};
+
+
+template <typename... Args>
+void SummaryConfig(const paddle_infer::Config& config,
+                   double infer_time,
+                   int num_batches,
+                   int num_samples) {
+  LOG(INFO) << "----------------------- Data info -----------------------";
+  LOG(INFO) << "batch_size: " << batch_size;
+  LOG(INFO) << "num_of_samples: " << num_samples;
+  LOG(INFO) << "----------------------- Conf info -----------------------";
+  LOG(INFO) << "runtime_device: " << (config.use_gpu() ? "gpu" : "cpu");
+  LOG(INFO) << "ir_optim: " << (config.ir_optim() ? "true" : "false");
+  LOG(INFO) << "----------------------- Perf info -----------------------";
+  LOG(INFO) << "average_latency(ms): " << infer_time / num_samples << ", "
+            << "QPS: " << num_samples / (infer_time / 1000.0);
+}
+
+
+void Main(int batch_size, int gpu_id) {
+  Config config;
+  config.SetModel(model_dir + "/transformer.pdmodel",
+                  model_dir + "/transformer.pdiparams");
+
+  config.EnableUseGpu(100, gpu_id);
+
+  config.SwitchUseFeedFetchOps(false);
+  config.SwitchSpecifyInputNames(true);
+  // When using fp16, fc_elementwise_layernorm_fuse_pass causes a little
+  // different translation results with original dygraph prediction, maybe you
+  // can turn off the IR optimization for same results as following:
+  // config.SwitchIrOptim(false);
+  auto predictor = CreatePredictor(config);
+  DataReader reader(data_file);
+  reader.GetWordDict();
+
+  double whole_time = 0;
+  Timer timer;
+  int num_batches = 0;
+  int num_samples = 0;
+  std::vector<std::string> source_query_vec;
+  std::ofstream out("predict.txt");
+
+  while (reader.NextBatch(predictor, batch_size, source_query_vec)) {
+    timer.tic();
+    predictor->Run();
+    std::vector<DataResult> dataresultvec;
+    auto output_names = predictor->GetOutputNames();
+    get_result_tensor(predictor->GetOutputHandle(output_names[0]),
+                      dataresultvec,
+                      reader.num2word_dict);
+
+    whole_time += timer.toc();
+    num_batches++;
+
+    if (out.is_open()) {
+      for (int i = 0; i < dataresultvec.size(); ++i) {
+        out << dataresultvec[i].result_q << "\n";
+      }
+    }
+    num_samples += dataresultvec.size();
+
+    source_query_vec.clear();
+  }
+  SummaryConfig(config, whole_time, num_batches, num_samples);
+  out.close();
+}
+}  // namespace inference
+}  // namespace paddle
+
+int main(int argc, char** argv) {
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+
+  batch_size = FLAGS_batch_size;
+  gpu_id = FLAGS_gpu_id;
+
+  model_dir = FLAGS_model_dir;
+  vocab_file = FLAGS_vocab_file;
+  data_file = FLAGS_data_file;
+
+  paddle::inference::Main(batch_size, gpu_id);
+
+  return 0;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8.h b/paddlenlp/ops/fast_transformer/src/demo/utf8.h
new file mode 100644
index 0000000000000000000000000000000000000000..82b13f59f983c57ea5bba18bcb58f836eaba8d5e
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8.h
@@ -0,0 +1,34 @@
+// Copyright 2006 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_2675DCD0_9480_4c0c_B92A_CC14C027B731
+#define UTF8_FOR_CPP_2675DCD0_9480_4c0c_B92A_CC14C027B731
+
+#include "utf8/checked.h"
+#include "utf8/unchecked.h"
+
+#endif // header guard
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8/checked.h b/paddlenlp/ops/fast_transformer/src/demo/utf8/checked.h
new file mode 100644
index 0000000000000000000000000000000000000000..512dcc2fbac82c55afb24f1ffc99d677b2a8e86a
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8/checked.h
@@ -0,0 +1,319 @@
+// Copyright 2006-2016 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_CHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+#define UTF8_FOR_CPP_CHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+
+#include "core.h"
+#include <stdexcept>
+
+namespace utf8
+{
+    // Base for the exceptions that may be thrown from the library
+    class exception : public ::std::exception {
+    };
+
+    // Exceptions that may be thrown from the library functions.
+    class invalid_code_point : public exception {
+        uint32_t cp;
+    public:
+        invalid_code_point(uint32_t codepoint) : cp(codepoint) {}
+        virtual const char* what() const UTF_CPP_NOEXCEPT UTF_CPP_OVERRIDE { return "Invalid code point"; }
+        uint32_t code_point() const {return cp;}
+    };
+
+    class invalid_utf8 : public exception {
+        uint8_t u8;
+    public:
+        invalid_utf8 (uint8_t u) : u8(u) {}
+        invalid_utf8 (char c) : u8(static_cast<uint8_t>(c)) {}
+        virtual const char* what() const UTF_CPP_NOEXCEPT UTF_CPP_OVERRIDE { return "Invalid UTF-8"; }
+        uint8_t utf8_octet() const {return u8;}
+    };
+
+    class invalid_utf16 : public exception {
+        uint16_t u16;
+    public:
+        invalid_utf16 (uint16_t u) : u16(u) {}
+        virtual const char* what() const UTF_CPP_NOEXCEPT UTF_CPP_OVERRIDE { return "Invalid UTF-16"; }
+        uint16_t utf16_word() const {return u16;}
+    };
+
+    class not_enough_room : public exception {
+    public:
+        virtual const char* what() const UTF_CPP_NOEXCEPT UTF_CPP_OVERRIDE { return "Not enough space"; }
+    };
+
+    /// The library API - functions intended to be called by the users
+
+    template <typename octet_iterator>
+    octet_iterator append(uint32_t cp, octet_iterator result)
+    {
+        if (!utf8::internal::is_code_point_valid(cp))
+            throw invalid_code_point(cp);
+
+        return internal::append(cp, result);
+    }
+
+    template <typename octet_iterator, typename output_iterator>
+    output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement)
+    {
+        while (start != end) {
+            octet_iterator sequence_start = start;
+            internal::utf_error err_code = utf8::internal::validate_next(start, end);
+            switch (err_code) {
+                case internal::UTF8_OK :
+                    for (octet_iterator it = sequence_start; it != start; ++it)
+                        *out++ = *it;
+                    break;
+                case internal::NOT_ENOUGH_ROOM:
+                    out = utf8::append (replacement, out);
+                    start = end;
+                    break;
+                case internal::INVALID_LEAD:
+                    out = utf8::append (replacement, out);
+                    ++start;
+                    break;
+                case internal::INCOMPLETE_SEQUENCE:
+                case internal::OVERLONG_SEQUENCE:
+                case internal::INVALID_CODE_POINT:
+                    out = utf8::append (replacement, out);
+                    ++start;
+                    // just one replacement mark for the sequence
+                    while (start != end && utf8::internal::is_trail(*start))
+                        ++start;
+                    break;
+            }
+        }
+        return out;
+    }
+
+    template <typename octet_iterator, typename output_iterator>
+    inline output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out)
+    {
+        static const uint32_t replacement_marker = utf8::internal::mask16(0xfffd);
+        return utf8::replace_invalid(start, end, out, replacement_marker);
+    }
+
+    template <typename octet_iterator>
+    uint32_t next(octet_iterator& it, octet_iterator end)
+    {
+        uint32_t cp = 0;
+        internal::utf_error err_code = utf8::internal::validate_next(it, end, cp);
+        switch (err_code) {
+            case internal::UTF8_OK :
+                break;
+            case internal::NOT_ENOUGH_ROOM :
+                throw not_enough_room();
+            case internal::INVALID_LEAD :
+            case internal::INCOMPLETE_SEQUENCE :
+            case internal::OVERLONG_SEQUENCE :
+                throw invalid_utf8(static_cast<uint8_t>(*it));
+            case internal::INVALID_CODE_POINT :
+                throw invalid_code_point(cp);
+        }
+        return cp;
+    }
+
+    template <typename octet_iterator>
+    uint32_t peek_next(octet_iterator it, octet_iterator end)
+    {
+        return utf8::next(it, end);
+    }
+
+    template <typename octet_iterator>
+    uint32_t prior(octet_iterator& it, octet_iterator start)
+    {
+        // can't do much if it == start
+        if (it == start)
+            throw not_enough_room();
+
+        octet_iterator end = it;
+        // Go back until we hit either a lead octet or start
+        while (utf8::internal::is_trail(*(--it)))
+            if (it == start)
+                throw invalid_utf8(*it); // error - no lead byte in the sequence
+        return utf8::peek_next(it, end);
+    }
+
+    template <typename octet_iterator, typename distance_type>
+    void advance (octet_iterator& it, distance_type n, octet_iterator end)
+    {
+        const distance_type zero(0);
+        if (n < zero) {
+            // backward
+            for (distance_type i = n; i < zero; ++i)
+                utf8::prior(it, end);
+        } else {
+            // forward
+            for (distance_type i = zero; i < n; ++i)
+                utf8::next(it, end);
+        }
+    }
+
+    template <typename octet_iterator>
+    typename std::iterator_traits<octet_iterator>::difference_type
+    distance (octet_iterator first, octet_iterator last)
+    {
+        typename std::iterator_traits<octet_iterator>::difference_type dist;
+        for (dist = 0; first < last; ++dist)
+            utf8::next(first, last);
+        return dist;
+    }
+
+    template <typename u16bit_iterator, typename octet_iterator>
+    octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result)
+    {
+        while (start != end) {
+            uint32_t cp = utf8::internal::mask16(*start++);
+            // Take care of surrogate pairs first
+            if (utf8::internal::is_lead_surrogate(cp)) {
+                if (start != end) {
+                    uint32_t trail_surrogate = utf8::internal::mask16(*start++);
+                    if (utf8::internal::is_trail_surrogate(trail_surrogate))
+                        cp = (cp << 10) + trail_surrogate + internal::SURROGATE_OFFSET;
+                    else
+                        throw invalid_utf16(static_cast<uint16_t>(trail_surrogate));
+                }
+                else
+                    throw invalid_utf16(static_cast<uint16_t>(cp));
+
+            }
+            // Lone trail surrogate
+            else if (utf8::internal::is_trail_surrogate(cp))
+                throw invalid_utf16(static_cast<uint16_t>(cp));
+
+            result = utf8::append(cp, result);
+        }
+        return result;
+    }
+
+    template <typename u16bit_iterator, typename octet_iterator>
+    u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result)
+    {
+        while (start < end) {
+            uint32_t cp = utf8::next(start, end);
+            if (cp > 0xffff) { //make a surrogate pair
+                *result++ = static_cast<uint16_t>((cp >> 10)   + internal::LEAD_OFFSET);
+                *result++ = static_cast<uint16_t>((cp & 0x3ff) + internal::TRAIL_SURROGATE_MIN);
+            }
+            else
+                *result++ = static_cast<uint16_t>(cp);
+        }
+        return result;
+    }
+
+    template <typename octet_iterator, typename u32bit_iterator>
+    octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result)
+    {
+        while (start != end)
+            result = utf8::append(*(start++), result);
+
+        return result;
+    }
+
+    template <typename octet_iterator, typename u32bit_iterator>
+    u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result)
+    {
+        while (start < end)
+            (*result++) = utf8::next(start, end);
+
+        return result;
+    }
+
+    // The iterator class
+    template <typename octet_iterator>
+    class iterator {
+      octet_iterator it;
+      octet_iterator range_start;
+      octet_iterator range_end;
+      public:
+      typedef uint32_t value_type;
+      typedef uint32_t* pointer;
+      typedef uint32_t& reference;
+      typedef std::ptrdiff_t difference_type;
+      typedef std::bidirectional_iterator_tag iterator_category;
+      iterator () {}
+      explicit iterator (const octet_iterator& octet_it,
+                         const octet_iterator& rangestart,
+                         const octet_iterator& rangeend) :
+               it(octet_it), range_start(rangestart), range_end(rangeend)
+      {
+          if (it < range_start || it > range_end)
+              throw std::out_of_range("Invalid utf-8 iterator position");
+      }
+      // the default "big three" are OK
+      octet_iterator base () const { return it; }
+      uint32_t operator * () const
+      {
+          octet_iterator temp = it;
+          return utf8::next(temp, range_end);
+      }
+      bool operator == (const iterator& rhs) const
+      {
+          if (range_start != rhs.range_start || range_end != rhs.range_end)
+              throw std::logic_error("Comparing utf-8 iterators defined with different ranges");
+          return (it == rhs.it);
+      }
+      bool operator != (const iterator& rhs) const
+      {
+          return !(operator == (rhs));
+      }
+      iterator& operator ++ ()
+      {
+          utf8::next(it, range_end);
+          return *this;
+      }
+      iterator operator ++ (int)
+      {
+          iterator temp = *this;
+          utf8::next(it, range_end);
+          return temp;
+      }
+      iterator& operator -- ()
+      {
+          utf8::prior(it, range_start);
+          return *this;
+      }
+      iterator operator -- (int)
+      {
+          iterator temp = *this;
+          utf8::prior(it, range_start);
+          return temp;
+      }
+    }; // class iterator
+
+} // namespace utf8
+
+#if UTF_CPP_CPLUSPLUS >= 201703L // C++ 17 or later
+#include "cpp17.h"
+#elif UTF_CPP_CPLUSPLUS >= 201103L // C++ 11 or later
+#include "cpp11.h"
+#endif // C++ 11 or later
+
+#endif //header guard
+
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8/core.h b/paddlenlp/ops/fast_transformer/src/demo/utf8/core.h
new file mode 100644
index 0000000000000000000000000000000000000000..34371ee31c8c3f48dc86c74991bc74230d08d3a7
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8/core.h
@@ -0,0 +1,387 @@
+// Copyright 2006 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_CORE_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+#define UTF8_FOR_CPP_CORE_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+
+#include <iterator>
+
+// Determine the C++ standard version.
+// If the user defines UTF_CPP_CPLUSPLUS, use that.
+// Otherwise, trust the unreliable predefined macro __cplusplus
+
+#if !defined UTF_CPP_CPLUSPLUS
+    #define UTF_CPP_CPLUSPLUS __cplusplus
+#endif
+
+#if UTF_CPP_CPLUSPLUS >= 201103L // C++ 11 or later
+    #define UTF_CPP_OVERRIDE override
+    #define UTF_CPP_NOEXCEPT noexcept
+#else // C++ 98/03
+    #define UTF_CPP_OVERRIDE
+    #define UTF_CPP_NOEXCEPT throw()
+#endif // C++ 11 or later
+
+
+namespace utf8
+{
+    // The typedefs for 8-bit, 16-bit and 32-bit unsigned integers
+    // You may need to change them to match your system.
+    // These typedefs have the same names as ones from cstdint, or boost/cstdint
+    typedef unsigned char   uint8_t;
+    typedef unsigned short  uint16_t;
+    typedef unsigned int    uint32_t;
+
+// Helper code - not intended to be directly called by the library users. May be changed at any time
+namespace internal
+{
+    // Unicode constants
+    // Leading (high) surrogates: 0xd800 - 0xdbff
+    // Trailing (low) surrogates: 0xdc00 - 0xdfff
+    const uint16_t LEAD_SURROGATE_MIN  = 0xd800u;
+    const uint16_t LEAD_SURROGATE_MAX  = 0xdbffu;
+    const uint16_t TRAIL_SURROGATE_MIN = 0xdc00u;
+    const uint16_t TRAIL_SURROGATE_MAX = 0xdfffu;
+    const uint16_t LEAD_OFFSET         = 0xd7c0u;       // LEAD_SURROGATE_MIN - (0x10000 >> 10)
+    const uint32_t SURROGATE_OFFSET    = 0xfca02400u;   // 0x10000u - (LEAD_SURROGATE_MIN << 10) - TRAIL_SURROGATE_MIN
+
+    // Maximum valid value for a Unicode code point
+    const uint32_t CODE_POINT_MAX      = 0x0010ffffu;
+
+    template<typename octet_type>
+    inline uint8_t mask8(octet_type oc)
+    {
+        return static_cast<uint8_t>(0xff & oc);
+    }
+    template<typename u16_type>
+    inline uint16_t mask16(u16_type oc)
+    {
+        return static_cast<uint16_t>(0xffff & oc);
+    }
+    template<typename octet_type>
+    inline bool is_trail(octet_type oc)
+    {
+        return ((utf8::internal::mask8(oc) >> 6) == 0x2);
+    }
+
+    template <typename u16>
+    inline bool is_lead_surrogate(u16 cp)
+    {
+        return (cp >= LEAD_SURROGATE_MIN && cp <= LEAD_SURROGATE_MAX);
+    }
+
+    template <typename u16>
+    inline bool is_trail_surrogate(u16 cp)
+    {
+        return (cp >= TRAIL_SURROGATE_MIN && cp <= TRAIL_SURROGATE_MAX);
+    }
+
+    template <typename u16>
+    inline bool is_surrogate(u16 cp)
+    {
+        return (cp >= LEAD_SURROGATE_MIN && cp <= TRAIL_SURROGATE_MAX);
+    }
+
+    template <typename u32>
+    inline bool is_code_point_valid(u32 cp)
+    {
+        return (cp <= CODE_POINT_MAX && !utf8::internal::is_surrogate(cp));
+    }
+
+    template <typename octet_iterator>
+    inline typename std::iterator_traits<octet_iterator>::difference_type
+    sequence_length(octet_iterator lead_it)
+    {
+        uint8_t lead = utf8::internal::mask8(*lead_it);
+        if (lead < 0x80)
+            return 1;
+        else if ((lead >> 5) == 0x6)
+            return 2;
+        else if ((lead >> 4) == 0xe)
+            return 3;
+        else if ((lead >> 3) == 0x1e)
+            return 4;
+        else
+            return 0;
+    }
+
+    template <typename octet_difference_type>
+    inline bool is_overlong_sequence(uint32_t cp, octet_difference_type length)
+    {
+        if (cp < 0x80) {
+            if (length != 1) 
+                return true;
+        }
+        else if (cp < 0x800) {
+            if (length != 2) 
+                return true;
+        }
+        else if (cp < 0x10000) {
+            if (length != 3) 
+                return true;
+        }
+
+        return false;
+    }
+
+    enum utf_error {UTF8_OK, NOT_ENOUGH_ROOM, INVALID_LEAD, INCOMPLETE_SEQUENCE, OVERLONG_SEQUENCE, INVALID_CODE_POINT};
+
+    /// Helper for get_sequence_x
+    template <typename octet_iterator>
+    utf_error increase_safely(octet_iterator& it, octet_iterator end)
+    {
+        if (++it == end)
+            return NOT_ENOUGH_ROOM;
+
+        if (!utf8::internal::is_trail(*it))
+            return INCOMPLETE_SEQUENCE;
+
+        return UTF8_OK;
+    }
+
+    #define UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(IT, END) {utf_error ret = increase_safely(IT, END); if (ret != UTF8_OK) return ret;}    
+
+    /// get_sequence_x functions decode utf-8 sequences of the length x
+    template <typename octet_iterator>
+    utf_error get_sequence_1(octet_iterator& it, octet_iterator end, uint32_t& code_point)
+    {
+        if (it == end)
+            return NOT_ENOUGH_ROOM;
+
+        code_point = utf8::internal::mask8(*it);
+
+        return UTF8_OK;
+    }
+
+    template <typename octet_iterator>
+    utf_error get_sequence_2(octet_iterator& it, octet_iterator end, uint32_t& code_point)
+    {
+        if (it == end) 
+            return NOT_ENOUGH_ROOM;
+
+        code_point = utf8::internal::mask8(*it);
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point = ((code_point << 6) & 0x7ff) + ((*it) & 0x3f);
+
+        return UTF8_OK;
+    }
+
+    template <typename octet_iterator>
+    utf_error get_sequence_3(octet_iterator& it, octet_iterator end, uint32_t& code_point)
+    {
+        if (it == end)
+            return NOT_ENOUGH_ROOM;
+            
+        code_point = utf8::internal::mask8(*it);
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point = ((code_point << 12) & 0xffff) + ((utf8::internal::mask8(*it) << 6) & 0xfff);
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point += (*it) & 0x3f;
+
+        return UTF8_OK;
+    }
+
+    template <typename octet_iterator>
+    utf_error get_sequence_4(octet_iterator& it, octet_iterator end, uint32_t& code_point)
+    {
+        if (it == end)
+           return NOT_ENOUGH_ROOM;
+
+        code_point = utf8::internal::mask8(*it);
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point = ((code_point << 18) & 0x1fffff) + ((utf8::internal::mask8(*it) << 12) & 0x3ffff);
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point += (utf8::internal::mask8(*it) << 6) & 0xfff;
+
+        UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR(it, end)
+
+        code_point += (*it) & 0x3f;
+
+        return UTF8_OK;
+    }
+
+    #undef UTF8_CPP_INCREASE_AND_RETURN_ON_ERROR
+
+    template <typename octet_iterator>
+    utf_error validate_next(octet_iterator& it, octet_iterator end, uint32_t& code_point)
+    {
+        if (it == end)
+            return NOT_ENOUGH_ROOM;
+
+        // Save the original value of it so we can go back in case of failure
+        // Of course, it does not make much sense with i.e. stream iterators
+        octet_iterator original_it = it;
+
+        uint32_t cp = 0;
+        // Determine the sequence length based on the lead octet
+        typedef typename std::iterator_traits<octet_iterator>::difference_type octet_difference_type;
+        const octet_difference_type length = utf8::internal::sequence_length(it);
+
+        // Get trail octets and calculate the code point
+        utf_error err = UTF8_OK;
+        switch (length) {
+            case 0:
+                return INVALID_LEAD;
+            case 1:
+                err = utf8::internal::get_sequence_1(it, end, cp);
+                break;
+            case 2:
+                err = utf8::internal::get_sequence_2(it, end, cp);
+            break;
+            case 3:
+                err = utf8::internal::get_sequence_3(it, end, cp);
+            break;
+            case 4:
+                err = utf8::internal::get_sequence_4(it, end, cp);
+            break;
+        }
+
+        if (err == UTF8_OK) {
+            // Decoding succeeded. Now, security checks...
+            if (utf8::internal::is_code_point_valid(cp)) {
+                if (!utf8::internal::is_overlong_sequence(cp, length)){
+                    // Passed! Return here.
+                    code_point = cp;
+                    ++it;
+                    return UTF8_OK;
+                }
+                else
+                    err = OVERLONG_SEQUENCE;
+            }
+            else 
+                err = INVALID_CODE_POINT;
+        }
+
+        // Failure branch - restore the original value of the iterator
+        it = original_it;
+        return err;
+    }
+
+    template <typename octet_iterator>
+    inline utf_error validate_next(octet_iterator& it, octet_iterator end) {
+        uint32_t ignored;
+        return utf8::internal::validate_next(it, end, ignored);
+    }
+
+    // Internal implementation of both checked and unchecked append() function
+    // This function will be invoked by the overloads below, as they will know
+    // the octet_type.
+    template <typename octet_iterator, typename octet_type>
+    octet_iterator append(uint32_t cp, octet_iterator result) {
+        if (cp < 0x80)                        // one octet
+            *(result++) = static_cast<octet_type>(cp);
+        else if (cp < 0x800) {                // two octets
+            *(result++) = static_cast<octet_type>((cp >> 6)          | 0xc0);
+            *(result++) = static_cast<octet_type>((cp & 0x3f)        | 0x80);
+        }
+        else if (cp < 0x10000) {              // three octets
+            *(result++) = static_cast<octet_type>((cp >> 12)         | 0xe0);
+            *(result++) = static_cast<octet_type>(((cp >> 6) & 0x3f) | 0x80);
+            *(result++) = static_cast<octet_type>((cp & 0x3f)        | 0x80);
+        }
+        else {                                // four octets
+            *(result++) = static_cast<octet_type>((cp >> 18)         | 0xf0);
+            *(result++) = static_cast<octet_type>(((cp >> 12) & 0x3f)| 0x80);
+            *(result++) = static_cast<octet_type>(((cp >> 6) & 0x3f) | 0x80);
+            *(result++) = static_cast<octet_type>((cp & 0x3f)        | 0x80);
+        }
+        return result;
+    }
+    
+    // One of the following overloads will be invoked from the API calls
+
+    // A simple (but dangerous) case: the caller appends byte(s) to a char array
+    inline char* append(uint32_t cp, char* result) {
+        return append<char*, char>(cp, result);
+    }
+
+    // Hopefully, most common case: the caller uses back_inserter
+    // i.e. append(cp, std::back_inserter(str));
+    template<typename container_type>
+    std::back_insert_iterator<container_type> append
+            (uint32_t cp, std::back_insert_iterator<container_type> result) {
+        return append<std::back_insert_iterator<container_type>,
+            typename container_type::value_type>(cp, result);
+    }
+
+    // The caller uses some other kind of output operator - not covered above
+    // Note that in this case we are not able to determine octet_type
+    // so we assume it's uint_8; that can cause a conversion warning if we are wrong.
+    template <typename octet_iterator>
+    octet_iterator append(uint32_t cp, octet_iterator result) {
+        return append<octet_iterator, uint8_t>(cp, result);
+    }
+
+} // namespace internal
+
+    /// The library API - functions intended to be called by the users
+
+    // Byte order mark
+    const uint8_t bom[] = {0xef, 0xbb, 0xbf};
+
+    template <typename octet_iterator>
+    octet_iterator find_invalid(octet_iterator start, octet_iterator end)
+    {
+        octet_iterator result = start;
+        while (result != end) {
+            utf8::internal::utf_error err_code = utf8::internal::validate_next(result, end);
+            if (err_code != internal::UTF8_OK)
+                return result;
+        }
+        return result;
+    }
+
+    template <typename octet_iterator>
+    inline bool is_valid(octet_iterator start, octet_iterator end)
+    {
+        return (utf8::find_invalid(start, end) == end);
+    }
+
+    template <typename octet_iterator>
+    inline bool starts_with_bom (octet_iterator it, octet_iterator end)
+    {
+        return (
+            ((it != end) && (utf8::internal::mask8(*it++)) == bom[0]) &&
+            ((it != end) && (utf8::internal::mask8(*it++)) == bom[1]) &&
+            ((it != end) && (utf8::internal::mask8(*it))   == bom[2])
+           );
+    }	
+} // namespace utf8
+
+#endif // header guard
+
+
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp11.h b/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp11.h
new file mode 100644
index 0000000000000000000000000000000000000000..2366f12915cbc9dc965f65bacbe3dc8a95ead1cf
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp11.h
@@ -0,0 +1,103 @@
+// Copyright 2018 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_a184c22c_d012_11e8_a8d5_f2801f1b9fd1
+#define UTF8_FOR_CPP_a184c22c_d012_11e8_a8d5_f2801f1b9fd1
+
+#include "checked.h"
+#include <string>
+
+namespace utf8
+{
+
+    inline void append(char32_t cp, std::string& s)
+    {
+        append(uint32_t(cp), std::back_inserter(s));
+    }
+
+    inline std::string utf16to8(const std::u16string& s)
+    {
+        std::string result;
+        utf16to8(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::u16string utf8to16(const std::string& s)
+    {
+        std::u16string result;
+        utf8to16(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::string utf32to8(const std::u32string& s)
+    {
+        std::string result;
+        utf32to8(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::u32string utf8to32(const std::string& s)
+    {
+        std::u32string result;
+        utf8to32(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::size_t find_invalid(const std::string& s)
+    {
+        std::string::const_iterator invalid = find_invalid(s.begin(), s.end());
+        return (invalid == s.end()) ? std::string::npos : static_cast<std::size_t>(invalid - s.begin());
+    }
+
+    inline bool is_valid(const std::string& s)
+    {
+        return is_valid(s.begin(), s.end());
+    }
+
+    inline std::string replace_invalid(const std::string& s, char32_t replacement)
+    {
+        std::string result;
+        replace_invalid(s.begin(), s.end(), std::back_inserter(result), replacement);
+        return result;
+    }
+
+    inline std::string replace_invalid(const std::string& s)
+    {
+        std::string result;
+        replace_invalid(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline bool starts_with_bom(const std::string& s)
+    {
+        return starts_with_bom(s.begin(), s.end());
+    }
+ 
+} // namespace utf8
+
+#endif // header guard
+
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp17.h b/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp17.h
new file mode 100644
index 0000000000000000000000000000000000000000..32a77ce307509e393bf54f83f352b42e4ec1066c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8/cpp17.h
@@ -0,0 +1,103 @@
+// Copyright 2018 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_7e906c01_03a3_4daf_b420_ea7ea952b3c9
+#define UTF8_FOR_CPP_7e906c01_03a3_4daf_b420_ea7ea952b3c9
+
+#include "checked.h"
+#include <string>
+
+namespace utf8
+{
+
+    inline void append(char32_t cp, std::string& s)
+    {
+        append(uint32_t(cp), std::back_inserter(s));
+    }
+
+    inline std::string utf16to8(std::u16string_view s)
+    {
+        std::string result;
+        utf16to8(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::u16string utf8to16(std::string_view s)
+    {
+        std::u16string result;
+        utf8to16(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::string utf32to8(std::u32string_view s)
+    {
+        std::string result;
+        utf32to8(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::u32string utf8to32(std::string_view s)
+    {
+        std::u32string result;
+        utf8to32(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline std::size_t find_invalid(std::string_view s)
+    {
+        std::string_view::const_iterator invalid = find_invalid(s.begin(), s.end());
+        return (invalid == s.end()) ? std::string_view::npos : static_cast<std::size_t>(invalid - s.begin());
+    }
+
+    inline bool is_valid(std::string_view s)
+    {
+        return is_valid(s.begin(), s.end());
+    }
+
+    inline std::string replace_invalid(std::string_view s, char32_t replacement)
+    {
+        std::string result;
+        replace_invalid(s.begin(), s.end(), std::back_inserter(result), replacement);
+        return result;
+    }
+
+    inline std::string replace_invalid(std::string_view s)
+    {
+        std::string result;
+        replace_invalid(s.begin(), s.end(), std::back_inserter(result));
+        return result;
+    }
+
+    inline bool starts_with_bom(std::string_view s)
+    {
+        return starts_with_bom(s.begin(), s.end());
+    }
+ 
+} // namespace utf8
+
+#endif // header guard
+
diff --git a/paddlenlp/ops/fast_transformer/src/demo/utf8/unchecked.h b/paddlenlp/ops/fast_transformer/src/demo/utf8/unchecked.h
new file mode 100644
index 0000000000000000000000000000000000000000..8fe83c9ecbc7eeffbf693bc8a50cd1833f816e82
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/demo/utf8/unchecked.h
@@ -0,0 +1,257 @@
+// Copyright 2006 Nemanja Trifunovic
+
+/*
+Permission is hereby granted, free of charge, to any person or organization
+obtaining a copy of the software and accompanying documentation covered by
+this license (the "Software") to use, reproduce, display, distribute,
+execute, and transmit the Software, and to prepare derivative works of the
+Software, and to permit third-parties to whom the Software is furnished to
+do so, all subject to the following:
+
+The copyright notices in the Software and this entire statement, including
+the above license grant, this restriction and the following disclaimer,
+must be included in all copies of the Software, in whole or in part, and
+all derivative works of the Software, unless such copies or derivative
+works are solely in the form of machine-executable object code generated by
+a source language processor.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
+SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
+FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+*/
+
+
+#ifndef UTF8_FOR_CPP_UNCHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+#define UTF8_FOR_CPP_UNCHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731
+
+#include "core.h"
+
+namespace utf8
+{
+    namespace unchecked
+    {
+        template <typename octet_iterator>
+        octet_iterator append(uint32_t cp, octet_iterator result)
+        {
+            return internal::append(cp, result);
+        }
+
+        template <typename octet_iterator, typename output_iterator>
+        output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement)
+        {
+            while (start != end) {
+                octet_iterator sequence_start = start;
+                internal::utf_error err_code = utf8::internal::validate_next(start, end);
+                switch (err_code) {
+                    case internal::UTF8_OK :
+                        for (octet_iterator it = sequence_start; it != start; ++it)
+                            *out++ = *it;
+                        break;
+                    case internal::NOT_ENOUGH_ROOM:
+                        out = utf8::unchecked::append (replacement, out);
+                        start = end;
+                        break;
+                    case internal::INVALID_LEAD:
+                        out = utf8::unchecked::append (replacement, out);
+                        ++start;
+                        break;
+                    case internal::INCOMPLETE_SEQUENCE:
+                    case internal::OVERLONG_SEQUENCE:
+                    case internal::INVALID_CODE_POINT:
+                        out = utf8::unchecked::append (replacement, out);
+                        ++start;
+                        // just one replacement mark for the sequence
+                        while (start != end && utf8::internal::is_trail(*start))
+                            ++start;
+                        break;
+                }
+            }
+            return out;
+        }
+
+        template <typename octet_iterator, typename output_iterator>
+        inline output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out)
+        {
+            static const uint32_t replacement_marker = utf8::internal::mask16(0xfffd);
+            return utf8::unchecked::replace_invalid(start, end, out, replacement_marker);
+        }
+
+        template <typename octet_iterator>
+        uint32_t next(octet_iterator& it)
+        {
+            uint32_t cp = utf8::internal::mask8(*it);
+            typename std::iterator_traits<octet_iterator>::difference_type length = utf8::internal::sequence_length(it);
+            switch (length) {
+                case 1:
+                    break;
+                case 2:
+                    it++;
+                    cp = ((cp << 6) & 0x7ff) + ((*it) & 0x3f);
+                    break;
+                case 3:
+                    ++it; 
+                    cp = ((cp << 12) & 0xffff) + ((utf8::internal::mask8(*it) << 6) & 0xfff);
+                    ++it;
+                    cp += (*it) & 0x3f;
+                    break;
+                case 4:
+                    ++it;
+                    cp = ((cp << 18) & 0x1fffff) + ((utf8::internal::mask8(*it) << 12) & 0x3ffff);                
+                    ++it;
+                    cp += (utf8::internal::mask8(*it) << 6) & 0xfff;
+                    ++it;
+                    cp += (*it) & 0x3f; 
+                    break;
+            }
+            ++it;
+            return cp;
+        }
+
+        template <typename octet_iterator>
+        uint32_t peek_next(octet_iterator it)
+        {
+            return utf8::unchecked::next(it);
+        }
+
+        template <typename octet_iterator>
+        uint32_t prior(octet_iterator& it)
+        {
+            while (utf8::internal::is_trail(*(--it))) ;
+            octet_iterator temp = it;
+            return utf8::unchecked::next(temp);
+        }
+
+        template <typename octet_iterator, typename distance_type>
+        void advance (octet_iterator& it, distance_type n)
+        {
+            const distance_type zero(0);
+            if (n < zero) {
+                // backward
+                for (distance_type i = n; i < zero; ++i)
+                    utf8::unchecked::prior(it);
+            } else {
+                // forward
+                for (distance_type i = zero; i < n; ++i)
+                    utf8::unchecked::next(it);
+            }
+        }
+
+        template <typename octet_iterator>
+        typename std::iterator_traits<octet_iterator>::difference_type
+        distance (octet_iterator first, octet_iterator last)
+        {
+            typename std::iterator_traits<octet_iterator>::difference_type dist;
+            for (dist = 0; first < last; ++dist) 
+                utf8::unchecked::next(first);
+            return dist;
+        }
+
+        template <typename u16bit_iterator, typename octet_iterator>
+        octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result)
+        {
+            while (start != end) {
+                uint32_t cp = utf8::internal::mask16(*start++);
+            // Take care of surrogate pairs first
+                if (utf8::internal::is_lead_surrogate(cp)) {
+                    uint32_t trail_surrogate = utf8::internal::mask16(*start++);
+                    cp = (cp << 10) + trail_surrogate + internal::SURROGATE_OFFSET;
+                }
+                result = utf8::unchecked::append(cp, result);
+            }
+            return result;
+        }
+
+        template <typename u16bit_iterator, typename octet_iterator>
+        u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result)
+        {
+            while (start < end) {
+                uint32_t cp = utf8::unchecked::next(start);
+                if (cp > 0xffff) { //make a surrogate pair
+                    *result++ = static_cast<uint16_t>((cp >> 10)   + internal::LEAD_OFFSET);
+                    *result++ = static_cast<uint16_t>((cp & 0x3ff) + internal::TRAIL_SURROGATE_MIN);
+                }
+                else
+                    *result++ = static_cast<uint16_t>(cp);
+            }
+            return result;
+        }
+
+        template <typename octet_iterator, typename u32bit_iterator>
+        octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result)
+        {
+            while (start != end)
+                result = utf8::unchecked::append(*(start++), result);
+
+            return result;
+        }
+
+        template <typename octet_iterator, typename u32bit_iterator>
+        u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result)
+        {
+            while (start < end)
+                (*result++) = utf8::unchecked::next(start);
+
+            return result;
+        }
+
+        // The iterator class
+        template <typename octet_iterator>
+          class iterator {
+            octet_iterator it;
+            public:
+            typedef uint32_t value_type;
+            typedef uint32_t* pointer;
+            typedef uint32_t& reference;
+            typedef std::ptrdiff_t difference_type;
+            typedef std::bidirectional_iterator_tag iterator_category;
+            iterator () {}
+            explicit iterator (const octet_iterator& octet_it): it(octet_it) {}
+            // the default "big three" are OK
+            octet_iterator base () const { return it; }
+            uint32_t operator * () const
+            {
+                octet_iterator temp = it;
+                return utf8::unchecked::next(temp);
+            }
+            bool operator == (const iterator& rhs) const 
+            { 
+                return (it == rhs.it);
+            }
+            bool operator != (const iterator& rhs) const
+            {
+                return !(operator == (rhs));
+            }
+            iterator& operator ++ () 
+            {
+                ::std::advance(it, utf8::internal::sequence_length(it));
+                return *this;
+            }
+            iterator operator ++ (int)
+            {
+                iterator temp = *this;
+                ::std::advance(it, utf8::internal::sequence_length(it));
+                return temp;
+            }  
+            iterator& operator -- ()
+            {
+                utf8::unchecked::prior(it);
+                return *this;
+            }
+            iterator operator -- (int)
+            {
+                iterator temp = *this;
+                utf8::unchecked::prior(it);
+                return temp;
+            }
+          }; // class iterator
+
+    } // namespace utf8::unchecked
+} // namespace utf8 
+
+
+#endif // header guard
+
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ed182f8540012f93866a721ee8533f1ccd35d23f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cc
@@ -0,0 +1,352 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_bart_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> BartDecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const float& temperature,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const int64_t& min_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const bool& early_stopping) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+  int min_out_len = rel_len ? min_len + input.shape()[1] : min_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+
+    return BartDecodingCUDAForward(input,
+                                   seq_len,
+                                   word_embedding,
+                                   self_ln_weight,
+                                   self_ln_bias,
+                                   self_q_weight,
+                                   self_q_bias,
+                                   self_k_weight,
+                                   self_k_bias,
+                                   self_v_weight,
+                                   self_v_bias,
+                                   self_out_weight,
+                                   self_out_bias,
+                                   cross_ln_weight,
+                                   cross_ln_bias,
+                                   cross_q_weight,
+                                   cross_q_bias,
+                                   cross_k_weight,
+                                   cross_k_bias,
+                                   cross_v_weight,
+                                   cross_v_bias,
+                                   cross_out_weight,
+                                   cross_out_bias,
+                                   ffn_ln_weight,
+                                   ffn_ln_bias,
+                                   ffn_inter_weight,
+                                   ffn_inter_bias,
+                                   ffn_out_weight,
+                                   ffn_out_bias,
+                                   decoder_ln_weight,
+                                   decoder_ln_bias,
+                                   embedding_weight,
+                                   embedding_bias,
+                                   positional_embedding_weight,
+                                   output_ids,
+                                   parent_ids,
+                                   sequence_length,
+                                   decoding_strategy,
+                                   beam_size,
+                                   topk,
+                                   topp,
+                                   temperature,
+                                   n_head,
+                                   size_per_head,
+                                   num_layer,
+                                   bos_id,
+                                   eos_id,
+                                   max_out_len,
+                                   min_out_len,
+                                   beam_search_diversity_rate,
+                                   alpha,
+                                   early_stopping);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> BartDecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const float& temperature,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const int64_t& min_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const bool& early_stopping) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> BartDecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_bart_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "temperature: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "min_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "rel_len: bool",
+            "alpha: float",
+            "early_stopping: bool"})
+    .SetKernelFn(PD_KERNEL(BartDecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(BartDecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(BartDecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..11d4541567886e81f0b4f0ed01f2d8cd979088bc
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.cu
@@ -0,0 +1,581 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+#include "cublas_handle.h"
+
+#include "fusion_bart_decoding_op.h"
+#include "pd_traits.h"
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> bart_decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const float& temperature,
+    const int& head_num_,
+    const int& size_per_head_,
+    const int& num_layer_,
+    const int& start_id_,
+    const int& end_id_,
+    const int64_t& max_seq_len_,
+    const int64_t& min_seq_len_,
+    const float& beam_search_diversity_rate_,
+    const float& alpha,
+    const bool& early_stopping,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ = (decoding_strategy == "topk_sampling" ||
+                        decoding_strategy == "topp_sampling")
+                           ? topk
+                           : 1;
+  float probability_threshold_ = (decoding_strategy == "topk_sampling" ||
+                                  decoding_strategy == "topp_sampling")
+                                     ? topp
+                                     : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+ //TODO(gongenlei): Support MP & PP
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = head_num_;
+  tensor_parallel_param.local_hidden_units_ = memory_hidden_dim;
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer_;
+  layer_parallel_param.local_batch_size = batch_size_;
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // // key
+    // params[i].self_attention.key_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(
+    //         self_attn_key_weight[i].data<data_t_>());
+    // params[i].self_attention.key_weight.bias =
+    //     reinterpret_cast<const DataType_*>(
+    //         self_attn_key_bias[i].data<data_t_>());
+    // // value
+    // params[i].self_attention.value_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(
+    //         self_attn_value_weight[i].data<data_t_>());
+    // params[i].self_attention.value_weight.bias =
+    //     reinterpret_cast<const DataType_*>(
+    //         self_attn_value_bias[i].data<data_t_>());
+
+    // key
+    params[i].self_attention.key_weight.kernel = nullptr;
+    params[i].self_attention.key_weight.bias = nullptr;
+    // value
+    params[i].self_attention.value_weight.kernel = nullptr;
+    params[i].self_attention.value_weight.bias = nullptr;
+
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+        cross_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beamsearch_;
+    decoding_beamsearch_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  /*is_fuse_topk_softMax*/
+        true, /*is_fuse_qkv*/
+        false, /*keep_alive_beam*/
+        alpha, 
+        false, /*normalization_before*/
+        2, /*pos_offset*/
+        ActivationType::GELU,
+        false,  /*pos_bias*/
+        false,  /*prefix_lm*/
+        -1,  /*finished_candidate_num*/
+        false,  /*early_stopping*/
+        false,  /*is_mbart*/
+        min_seq_len_);
+
+    decoding_beamsearch_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    decoding_beamsearch_->set_layer_parallel_param(
+        layer_parallel_param);
+
+    decoding_beamsearch_->forward(params, decoding_params);
+
+    delete decoding_beamsearch_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beamsearch_;
+    decoding_beamsearch_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  /*is_fuse_topk_softMax*/
+        true, /*is_fuse_qkv*/
+        true,  /*keep_alive_beam*/
+        alpha,
+        false, /*normalization_before*/
+        2,
+        ActivationType::GELU,
+        false, /*pos_bias*/
+        false, /*prefix_lm*/
+        finished_candidate_num_,
+        early_stopping,
+        false,  /*is_mbart*/
+        min_seq_len_);
+    
+    decoding_beamsearch_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    decoding_beamsearch_->set_layer_parallel_param(
+        layer_parallel_param);
+
+    decoding_beamsearch_->forward(params, decoding_params);
+
+    delete decoding_beamsearch_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ =
+        new DecodingSampling<DecodingTraits_::OpType>(allocator_,
+                                                      batch_size_,
+                                                      max_seq_len_,
+                                                      head_num_,
+                                                      size_per_head_,
+                                                      vocab_size,
+                                                      num_layer_,
+                                                      memory_hidden_dim,
+                                                      memory_max_seq_len,
+                                                      start_id_,
+                                                      end_id_,
+                                                      candidate_num_,
+                                                      probability_threshold_,
+                                                      true, /*is_fuse_qkv*/
+                                                      false, /*normalization_before*/
+                                                      2, /*pos_offset*/
+                                                      ActivationType::GELU,
+                                                      false, /*pos_bias*/
+                                                      temperature, /*temperature*/
+                                                      1.0, /*repeat_penalty*/
+                                                      false, /*prefix_lm*/
+                                                      false,  /*is_mbart*/
+                                                      min_seq_len_);
+    decoding_sampling_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    decoding_sampling_->set_layer_parallel_param(
+        layer_parallel_param);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, topk_sampling and topp_sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> BartDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const float& temperature,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const int64_t& min_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha,
+    const bool& early_stopping) {
+  auto stream = input.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = bart_decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          temperature,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          min_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = bart_decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          temperature,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          min_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..f1480b8618568e1e5b734526ecf567cd8ae9116b
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_bart_decoding_op.h
@@ -0,0 +1,85 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> BartDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const float& temperature,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const int64_t& min_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha,
+    const bool& early_stopping);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e0d055bde5feef394780fc69b8d13f897bd474da
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cc
@@ -0,0 +1,228 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_decoder_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> DecoderForward(
+    const paddle::Tensor& from_tensor,
+    const paddle::Tensor& memory_tensor,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& self_ln_weight,
+    const paddle::Tensor& self_ln_bias,
+    const paddle::Tensor& self_q_weight,
+    const paddle::Tensor& self_q_bias,
+    const paddle::Tensor& self_k_weight,
+    const paddle::Tensor& self_k_bias,
+    const paddle::Tensor& self_v_weight,
+    const paddle::Tensor& self_v_bias,
+    const paddle::Tensor& self_out_weight,
+    const paddle::Tensor& self_out_bias,
+    const paddle::Tensor& cross_ln_weight,
+    const paddle::Tensor& cross_ln_bias,
+    const paddle::Tensor& cross_q_weight,
+    const paddle::Tensor& cross_q_bias,
+    const paddle::Tensor& cross_k_weight,
+    const paddle::Tensor& cross_k_bias,
+    const paddle::Tensor& cross_v_weight,
+    const paddle::Tensor& cross_v_bias,
+    const paddle::Tensor& cross_out_weight,
+    const paddle::Tensor& cross_out_bias,
+    const paddle::Tensor& ffn_ln_weight,
+    const paddle::Tensor& ffn_ln_bias,
+    const paddle::Tensor& ffn_inter_weight,
+    const paddle::Tensor& ffn_inter_bias,
+    const paddle::Tensor& ffn_out_weight,
+    const paddle::Tensor& ffn_out_bias,
+    const paddle::Tensor& old_self_cache_key,
+    const paddle::Tensor& old_self_cache_value,
+    const paddle::Tensor& old_mem_cache,
+    const int step,
+    int n_head,
+    int size_per_head,
+    int memory_hidden_dim,
+    bool is_fuse_qkv) {
+  const int batch_size = memory_tensor.shape()[0];
+  std::vector<int64_t> output_dims;
+  output_dims = {batch_size, 1, n_head * size_per_head};
+
+  auto new_self_cache_key = old_self_cache_key;
+  auto new_self_cache_value = old_self_cache_value;
+  auto new_mem_cache = old_mem_cache;
+
+  if (from_tensor.place() == paddle::PlaceType::kGPU) {
+    auto decoder_output = paddle::Tensor(from_tensor.place(), output_dims);
+
+    paddle::Tensor _mem_seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      _mem_seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      _mem_seq_len = mem_seq_len;
+    }
+
+    return DecoderCUDAForward(from_tensor,
+                              memory_tensor,
+                              _mem_seq_len,
+                              self_ln_weight,
+                              self_ln_bias,
+                              self_q_weight,
+                              self_q_bias,
+                              self_k_weight,
+                              self_k_bias,
+                              self_v_weight,
+                              self_v_bias,
+                              self_out_weight,
+                              self_out_bias,
+                              cross_ln_weight,
+                              cross_ln_bias,
+                              cross_q_weight,
+                              cross_q_bias,
+                              cross_k_weight,
+                              cross_k_bias,
+                              cross_v_weight,
+                              cross_v_bias,
+                              cross_out_weight,
+                              cross_out_bias,
+                              ffn_ln_weight,
+                              ffn_ln_bias,
+                              ffn_inter_weight,
+                              ffn_inter_bias,
+                              ffn_out_weight,
+                              ffn_out_bias,
+                              old_self_cache_key,
+                              old_self_cache_value,
+                              old_mem_cache,
+                              step,
+                              decoder_output,
+                              new_self_cache_key,
+                              new_self_cache_value,
+                              new_mem_cache,
+                              n_head,
+                              size_per_head,
+                              memory_hidden_dim,
+                              is_fuse_qkv);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> DecoderInferShape(
+    const std::vector<int64_t>& from_tensor_shape,
+    const std::vector<int64_t>& memory_tensor_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& self_ln_weight_shapes,
+    const std::vector<int64_t>& self_ln_bias_shapes,
+    const std::vector<int64_t>& self_q_weight_shapes,
+    const std::vector<int64_t>& self_q_bias_shapes,
+    const std::vector<int64_t>& self_k_weight_shapes,
+    const std::vector<int64_t>& self_k_bias_shapes,
+    const std::vector<int64_t>& self_v_weight_shapes,
+    const std::vector<int64_t>& self_v_bias_shapes,
+    const std::vector<int64_t>& self_out_weight_shapes,
+    const std::vector<int64_t>& self_out_bias_shapes,
+    const std::vector<int64_t>& cross_ln_weight_shapes,
+    const std::vector<int64_t>& cross_ln_bias_shapes,
+    const std::vector<int64_t>& cross_q_weight_shapes,
+    const std::vector<int64_t>& cross_q_bias_shapes,
+    const std::vector<int64_t>& cross_k_weight_shapes,
+    const std::vector<int64_t>& cross_k_bias_shapes,
+    const std::vector<int64_t>& cross_v_weight_shapes,
+    const std::vector<int64_t>& cross_v_bias_shapes,
+    const std::vector<int64_t>& cross_out_weight_shapes,
+    const std::vector<int64_t>& cross_out_bias_shapes,
+    const std::vector<int64_t>& ffn_ln_weight_shapes,
+    const std::vector<int64_t>& ffn_ln_bias_shapes,
+    const std::vector<int64_t>& ffn_inter_weight_shapes,
+    const std::vector<int64_t>& ffn_inter_bias_shapes,
+    const std::vector<int64_t>& ffn_out_weight_shapes,
+    const std::vector<int64_t>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& old_self_cache_key_shape,
+    const std::vector<int64_t>& old_self_cache_value_shape,
+    const std::vector<int64_t>& old_mem_cache_shape,
+    const int& step,
+    const int& n_head,
+    const int& size_per_head,
+    const int& memory_hidden_dim,
+    const bool& is_fuse_qkv) {
+  return {from_tensor_shape,
+          old_self_cache_key_shape,
+          old_self_cache_value_shape,
+          old_mem_cache_shape};
+}
+
+std::vector<paddle::DataType> DecoderInferDtype(
+    const paddle::DataType& from_tensor,
+    const paddle::DataType& memory_tensor,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& self_ln_weight,
+    const paddle::DataType& self_ln_bias,
+    const paddle::DataType& self_q_weight,
+    const paddle::DataType& self_q_bias,
+    const paddle::DataType& self_k_weight,
+    const paddle::DataType& self_k_bias,
+    const paddle::DataType& self_v_weight,
+    const paddle::DataType& self_v_bias,
+    const paddle::DataType& self_out_weight,
+    const paddle::DataType& self_out_bias,
+    const paddle::DataType& cross_ln_weight,
+    const paddle::DataType& cross_ln_bias,
+    const paddle::DataType& cross_q_weight,
+    const paddle::DataType& cross_q_bias,
+    const paddle::DataType& cross_k_weight,
+    const paddle::DataType& cross_k_bias,
+    const paddle::DataType& cross_v_weight,
+    const paddle::DataType& cross_v_bias,
+    const paddle::DataType& cross_out_weight,
+    const paddle::DataType& cross_out_bias,
+    const paddle::DataType& ffn_ln_weight,
+    const paddle::DataType& ffn_ln_bias,
+    const paddle::DataType& ffn_inter_weight,
+    const paddle::DataType& ffn_inter_bias,
+    const paddle::DataType& ffn_out_weight,
+    const paddle::DataType& ffn_out_bias,
+    const paddle::DataType& old_self_cache_key,
+    const paddle::DataType& old_self_cache_value,
+    const paddle::DataType& old_mem_cache) {
+  return {from_tensor, old_self_cache_key, old_self_cache_value, old_mem_cache};
+}
+
+PD_BUILD_OP(fusion_decoder)
+    .Inputs(
+        {"FromTensor",          "MemoryTensor",         "MemSeqLen",
+         "SelfLayernormWeight", "SelfLayernormBias",    "SelfQueryWeight",
+         "SelfQueryBias",       "SelfKeyWeight",        "SelfKeyBias",
+         "SelfValueWeight",     "SelfValueBias",        "SelfOutWeight",
+         "SelfOutBias",         "CrossLayernormWeight", "CrossLayernormBias",
+         "CrossQueryWeight",    "CrossQueryBias",       "CrossKeyWeight",
+         "CrossKeyBias",        "CrossValueWeight",     "CrossValueBias",
+         "CrossOutWeight",      "CrossOutBias",         "FFNLayernormWeight",
+         "FFNLayernormBias",    "FFNInterWeight",       "FFNInterBias",
+         "FFNOutWeight",        "FFNOutBias",           "OldSelfCacheKey",
+         "OldSelfCacheValue",   "OldMemCache"})
+    .Outputs({"DecoderOutput",
+              "NewSelfCacheKey",
+              "NewSelfCacheValue",
+              "NewMemCache"})
+    .Attrs({"step: int",
+            "n_head: int",
+            "size_per_head: int",
+            "memory_hidden_dim: int",
+            "is_fuse_qkv: bool"})
+    .SetKernelFn(PD_KERNEL(DecoderForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(DecoderInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(DecoderInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..efe05f4be58e480ede3fd20fa3e200eed98099a5
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.cu
@@ -0,0 +1,374 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+
+#include "fusion_decoder_op.h"
+#include "pd_traits.h"
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> decoder_kernel(
+    const paddle::Tensor& from_tensor_input,
+    const paddle::Tensor& memory_tensor_input,
+    const paddle::Tensor& mem_seq_len_input,
+    const paddle::Tensor& self_ln_weight,
+    const paddle::Tensor& self_ln_bias,
+    const paddle::Tensor& self_q_weight,
+    const paddle::Tensor& self_q_bias,
+    const paddle::Tensor& self_k_weight,
+    const paddle::Tensor& self_k_bias,
+    const paddle::Tensor& self_v_weight,
+    const paddle::Tensor& self_v_bias,
+    const paddle::Tensor& self_out_weight,
+    const paddle::Tensor& self_out_bias,
+    const paddle::Tensor& cross_ln_weight,
+    const paddle::Tensor& cross_ln_bias,
+    const paddle::Tensor& cross_q_weight,
+    const paddle::Tensor& cross_q_bias,
+    const paddle::Tensor& cross_k_weight,
+    const paddle::Tensor& cross_k_bias,
+    const paddle::Tensor& cross_v_weight,
+    const paddle::Tensor& cross_v_bias,
+    const paddle::Tensor& cross_out_weight,
+    const paddle::Tensor& cross_out_bias,
+    const paddle::Tensor& ffn_ln_weight,
+    const paddle::Tensor& ffn_ln_bias,
+    const paddle::Tensor& ffn_inter_weight,
+    const paddle::Tensor& ffn_inter_bias,
+    const paddle::Tensor& ffn_out_weight,
+    const paddle::Tensor& ffn_out_bias,
+    const paddle::Tensor& old_self_cache_key,
+    const paddle::Tensor& old_self_cache_value,
+    const paddle::Tensor& old_mem_cache,
+    const int step,
+    paddle::Tensor& decoder_output_tensor,
+    paddle::Tensor& new_self_cache_key,
+    paddle::Tensor& new_self_cache_value,
+    paddle::Tensor& new_mem_cache,
+    int n_head,
+    int size_per_head,
+    int memory_hidden_dim,
+    bool is_fuse_qkv,
+    cublasHandle_t cublas_handle_,
+    cublasLtHandle_t cublaslt_handle_,
+    cudaStream_t stream) {
+  auto input_dims = memory_tensor_input.shape();
+  const int batch_size_ = static_cast<int>(input_dims[0]);
+  const int max_seq_len_ = static_cast<int>(input_dims[1]);
+  const int memory_hidden_dim_ = static_cast<int>(memory_hidden_dim);
+  const bool is_fuse_qkv_ = static_cast<bool>(is_fuse_qkv);
+
+  // Detect we use batch major
+  bool use_batch_major =
+      (old_self_cache_key.shape().size() == 5) ? true : false;
+  // we use decoder_max_seq_len == -1 to tell the decoder we use seq major cache
+  // format
+  int decoder_max_seq_len =
+      (use_batch_major) ? (int)old_self_cache_value.shape()[2] : -1;
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+  typedef DecoderTransformerTraits<traits_::OpType> DecoderTraits_;
+  OpenDecoder<DecoderTraits_::OpType>* decoder_;
+  decoder_ = new OpenDecoder<DecoderTraits_::OpType>(n_head,
+                                                     size_per_head,
+                                                     memory_hidden_dim_,
+                                                     is_fuse_qkv_,
+                                                     true,
+                                                     ActivationType::RELU);
+
+  DataType_* decoder_output = reinterpret_cast<DataType_*>(
+      decoder_output_tensor.mutable_data<data_t_>());
+  DataType_* self_cache_key_tensor = reinterpret_cast<DataType_*>(
+      const_cast<data_t_*>(old_self_cache_key.data<data_t_>()));
+  DataType_* self_cache_value_tensor = reinterpret_cast<DataType_*>(
+      const_cast<data_t_*>(old_self_cache_value.data<data_t_>()));
+  DataType_* memory_cache = reinterpret_cast<DataType_*>(
+      const_cast<data_t_*>(old_mem_cache.data<data_t_>()));
+  const DataType_* from_tensor =
+      reinterpret_cast<const DataType_*>(from_tensor_input.data<data_t_>());
+  const DataType_* memory_tensor =
+      reinterpret_cast<const DataType_*>(memory_tensor_input.data<data_t_>());
+  const int* memory_sequence_length = mem_seq_len_input.data<int>();
+
+  DecoderInitParam<DataType_> params;
+  params.cublas_handle = cublas_handle_;
+  params.cublaslt_handle = cublaslt_handle_;
+  params.stream = stream;
+  params.request_max_mem_seq_len = max_seq_len_;
+  params.request_batch_size = batch_size_;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  params.self_layernorm.gamma =
+      reinterpret_cast<const DataType_*>(self_ln_weight.data<data_t_>());
+  params.self_layernorm.beta =
+      reinterpret_cast<const DataType_*>(self_ln_bias.data<data_t_>());
+  params.self_attention.query_weight.kernel =
+      reinterpret_cast<const DataType_*>(self_q_weight.data<data_t_>());
+  params.self_attention.query_weight.bias =
+      reinterpret_cast<const DataType_*>(self_q_bias.data<data_t_>());
+  params.self_attention.key_weight.kernel =
+      reinterpret_cast<const DataType_*>(self_k_weight.data<data_t_>());
+  params.self_attention.key_weight.bias =
+      reinterpret_cast<const DataType_*>(self_k_bias.data<data_t_>());
+  params.self_attention.value_weight.kernel =
+      reinterpret_cast<const DataType_*>(self_v_weight.data<data_t_>());
+  params.self_attention.value_weight.bias =
+      reinterpret_cast<const DataType_*>(self_v_bias.data<data_t_>());
+  params.self_attention.attention_output_weight.kernel =
+      reinterpret_cast<const DataType_*>(self_out_weight.data<data_t_>());
+  params.self_attention.attention_output_weight.bias =
+      reinterpret_cast<const DataType_*>(self_out_bias.data<data_t_>());
+  params.cross_layernorm.gamma =
+      reinterpret_cast<const DataType_*>(cross_ln_weight.data<data_t_>());
+  params.cross_layernorm.beta =
+      reinterpret_cast<const DataType_*>(cross_ln_bias.data<data_t_>());
+  params.cross_attention.query_weight.kernel =
+      reinterpret_cast<const DataType_*>(cross_q_weight.data<data_t_>());
+  params.cross_attention.query_weight.bias =
+      reinterpret_cast<const DataType_*>(cross_q_bias.data<data_t_>());
+  params.cross_attention.key_weight.kernel =
+      reinterpret_cast<const DataType_*>(cross_k_weight.data<data_t_>());
+  params.cross_attention.key_weight.bias =
+      reinterpret_cast<const DataType_*>(cross_k_bias.data<data_t_>());
+  params.cross_attention.value_weight.kernel =
+      reinterpret_cast<const DataType_*>(cross_v_weight.data<data_t_>());
+  params.cross_attention.value_weight.bias =
+      reinterpret_cast<const DataType_*>(cross_v_bias.data<data_t_>());
+  params.cross_attention.attention_output_weight.kernel =
+      reinterpret_cast<const DataType_*>(cross_out_weight.data<data_t_>());
+  params.cross_attention.attention_output_weight.bias =
+      reinterpret_cast<const DataType_*>(cross_out_bias.data<data_t_>());
+  params.ffn_layernorm.gamma =
+      reinterpret_cast<const DataType_*>(ffn_ln_weight.data<data_t_>());
+  params.ffn_layernorm.beta =
+      reinterpret_cast<const DataType_*>(ffn_ln_bias.data<data_t_>());
+  params.ffn.intermediate_weight.kernel =
+      reinterpret_cast<const DataType_*>(ffn_inter_weight.data<data_t_>());
+  params.ffn.intermediate_weight.bias =
+      reinterpret_cast<const DataType_*>(ffn_inter_bias.data<data_t_>());
+  params.ffn.output_weight.kernel =
+      reinterpret_cast<const DataType_*>(ffn_out_weight.data<data_t_>());
+  params.ffn.output_weight.bias =
+      reinterpret_cast<const DataType_*>(ffn_out_bias.data<data_t_>());
+
+  const int local_step = static_cast<int>(step) + 1;
+  const int hidden_units = n_head * size_per_head;
+  DataType_* K_cache = self_cache_key_tensor;
+  DataType_* V_cache = self_cache_value_tensor;
+  DataType_* K_mem_cache = memory_cache;
+  DataType_* V_mem_cache =
+      memory_cache + batch_size_ * max_seq_len_ * hidden_units;
+  decoder_->set_max_batch_size(batch_size_);
+
+  const int decoder_buffer_size =
+      decoder_->getWorkspaceSize() * sizeof(DataType_);
+  void* buf =
+      allocator_.malloc(((sizeof(DataType_) == 2) ? CUBLAS_WORKSPACE_SIZE : 0) +
+                        decoder_buffer_size);
+  void* cublas_workspace = nullptr;
+  DataType_* decoder_buffer = (DataType_*)buf;
+  if (sizeof(DataType_) == 2)  // half
+  {
+    cublas_workspace = buf;
+    decoder_buffer =
+        (DataType_*)((char*)cublas_workspace + CUBLAS_WORKSPACE_SIZE);
+  }
+  decoder_->initialize(params, decoder_buffer, cublas_workspace);
+  decoder_->forward(from_tensor,
+                    memory_tensor,
+                    K_cache,
+                    V_cache,
+                    K_mem_cache,
+                    V_mem_cache,
+                    memory_sequence_length,
+                    decoder_output,
+                    local_step,
+                    decoder_max_seq_len,
+                    true);
+  allocator_.free(decoder_buffer);
+  delete decoder_;
+  return {decoder_output_tensor,
+          new_self_cache_key,
+          new_self_cache_value,
+          new_mem_cache};
+}
+
+std::vector<paddle::Tensor> DecoderCUDAForward(
+    const paddle::Tensor& from_tensor,
+    const paddle::Tensor& memory_tensor,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& self_ln_weight,
+    const paddle::Tensor& self_ln_bias,
+    const paddle::Tensor& self_q_weight,
+    const paddle::Tensor& self_q_bias,
+    const paddle::Tensor& self_k_weight,
+    const paddle::Tensor& self_k_bias,
+    const paddle::Tensor& self_v_weight,
+    const paddle::Tensor& self_v_bias,
+    const paddle::Tensor& self_out_weight,
+    const paddle::Tensor& self_out_bias,
+    const paddle::Tensor& cross_ln_weight,
+    const paddle::Tensor& cross_ln_bias,
+    const paddle::Tensor& cross_q_weight,
+    const paddle::Tensor& cross_q_bias,
+    const paddle::Tensor& cross_k_weight,
+    const paddle::Tensor& cross_k_bias,
+    const paddle::Tensor& cross_v_weight,
+    const paddle::Tensor& cross_v_bias,
+    const paddle::Tensor& cross_out_weight,
+    const paddle::Tensor& cross_out_bias,
+    const paddle::Tensor& ffn_ln_weight,
+    const paddle::Tensor& ffn_ln_bias,
+    const paddle::Tensor& ffn_inter_weight,
+    const paddle::Tensor& ffn_inter_bias,
+    const paddle::Tensor& ffn_out_weight,
+    const paddle::Tensor& ffn_out_bias,
+    const paddle::Tensor& old_self_cache_key,
+    const paddle::Tensor& old_self_cache_value,
+    const paddle::Tensor& old_mem_cache,
+    const int step,
+    paddle::Tensor& decoder_output,
+    paddle::Tensor& new_self_cache_key,
+    paddle::Tensor& new_self_cache_value,
+    paddle::Tensor& new_mem_cache,
+    int n_head,
+    int size_per_head,
+    int memory_hidden_dim,
+    bool is_fuse_qkv) {
+  auto stream = memory_tensor.stream();
+  cublasHandle_t cublas_handle_;
+  cublasCreate(&cublas_handle_);
+  cublasLtHandle_t cublaslt_handle_;
+  cublasLtCreate(&cublaslt_handle_);
+  cublasSetStream(cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (memory_tensor.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = decoder_kernel<paddle::DataType::FLOAT16>(from_tensor,
+                                                      memory_tensor,
+                                                      mem_seq_len,
+                                                      self_ln_weight,
+                                                      self_ln_bias,
+                                                      self_q_weight,
+                                                      self_q_bias,
+                                                      self_k_weight,
+                                                      self_k_bias,
+                                                      self_v_weight,
+                                                      self_v_bias,
+                                                      self_out_weight,
+                                                      self_out_bias,
+                                                      cross_ln_weight,
+                                                      cross_ln_bias,
+                                                      cross_q_weight,
+                                                      cross_q_bias,
+                                                      cross_k_weight,
+                                                      cross_k_bias,
+                                                      cross_v_weight,
+                                                      cross_v_bias,
+                                                      cross_out_weight,
+                                                      cross_out_bias,
+                                                      ffn_ln_weight,
+                                                      ffn_ln_bias,
+                                                      ffn_inter_weight,
+                                                      ffn_inter_bias,
+                                                      ffn_out_weight,
+                                                      ffn_out_bias,
+                                                      old_self_cache_key,
+                                                      old_self_cache_value,
+                                                      old_mem_cache,
+                                                      step,
+                                                      decoder_output,
+                                                      new_self_cache_key,
+                                                      new_self_cache_value,
+                                                      new_mem_cache,
+                                                      n_head,
+                                                      size_per_head,
+                                                      memory_hidden_dim,
+                                                      is_fuse_qkv,
+                                                      cublas_handle_,
+                                                      cublaslt_handle_,
+                                                      stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = decoder_kernel<paddle::DataType::FLOAT32>(from_tensor,
+                                                      memory_tensor,
+                                                      mem_seq_len,
+                                                      self_ln_weight,
+                                                      self_ln_bias,
+                                                      self_q_weight,
+                                                      self_q_bias,
+                                                      self_k_weight,
+                                                      self_k_bias,
+                                                      self_v_weight,
+                                                      self_v_bias,
+                                                      self_out_weight,
+                                                      self_out_bias,
+                                                      cross_ln_weight,
+                                                      cross_ln_bias,
+                                                      cross_q_weight,
+                                                      cross_q_bias,
+                                                      cross_k_weight,
+                                                      cross_k_bias,
+                                                      cross_v_weight,
+                                                      cross_v_bias,
+                                                      cross_out_weight,
+                                                      cross_out_bias,
+                                                      ffn_ln_weight,
+                                                      ffn_ln_bias,
+                                                      ffn_inter_weight,
+                                                      ffn_inter_bias,
+                                                      ffn_out_weight,
+                                                      ffn_out_bias,
+                                                      old_self_cache_key,
+                                                      old_self_cache_value,
+                                                      old_mem_cache,
+                                                      step,
+                                                      decoder_output,
+                                                      new_self_cache_key,
+                                                      new_self_cache_value,
+                                                      new_mem_cache,
+                                                      n_head,
+                                                      size_per_head,
+                                                      memory_hidden_dim,
+                                                      is_fuse_qkv,
+                                                      cublas_handle_,
+                                                      cublaslt_handle_,
+                                                      stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+  cublasDestroy(cublas_handle_);
+  cublasLtDestroy(cublaslt_handle_);
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.h b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..ec4e2966b60a8d7fcb7731e0db433152df0b67a0
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoder_op.h
@@ -0,0 +1,72 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> DecoderCUDAForward(
+    const paddle::Tensor& from_tensor,
+    const paddle::Tensor& memory_tensor,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& self_ln_weight,
+    const paddle::Tensor& self_ln_bias,
+    const paddle::Tensor& self_q_weight,
+    const paddle::Tensor& self_q_bias,
+    const paddle::Tensor& self_k_weight,
+    const paddle::Tensor& self_k_bias,
+    const paddle::Tensor& self_v_weight,
+    const paddle::Tensor& self_v_bias,
+    const paddle::Tensor& self_out_weight,
+    const paddle::Tensor& self_out_bias,
+    const paddle::Tensor& cross_ln_weight,
+    const paddle::Tensor& cross_ln_bias,
+    const paddle::Tensor& cross_q_weight,
+    const paddle::Tensor& cross_q_bias,
+    const paddle::Tensor& cross_k_weight,
+    const paddle::Tensor& cross_k_bias,
+    const paddle::Tensor& cross_v_weight,
+    const paddle::Tensor& cross_v_bias,
+    const paddle::Tensor& cross_out_weight,
+    const paddle::Tensor& cross_out_bias,
+    const paddle::Tensor& ffn_ln_weight,
+    const paddle::Tensor& ffn_ln_bias,
+    const paddle::Tensor& ffn_inter_weight,
+    const paddle::Tensor& ffn_inter_bias,
+    const paddle::Tensor& ffn_out_weight,
+    const paddle::Tensor& ffn_out_bias,
+    const paddle::Tensor& old_self_cache_key,
+    const paddle::Tensor& old_self_cache_value,
+    const paddle::Tensor& old_mem_cache,
+    const int step,
+    paddle::Tensor& decoder_output,
+    paddle::Tensor& new_self_cache_key,
+    paddle::Tensor& new_self_cache_value,
+    paddle::Tensor& new_mem_cache,
+    int n_head,
+    int size_per_head,
+    int memory_hidden_dim,
+    bool is_fuse_qkv);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3607f70961fb6761eb7904abcd06bac18bb359dd
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cc
@@ -0,0 +1,337 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> DecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+
+    return DecodingCUDAForward(input,
+                               seq_len,
+                               word_embedding,
+                               self_ln_weight,
+                               self_ln_bias,
+                               self_q_weight,
+                               self_q_bias,
+                               self_k_weight,
+                               self_k_bias,
+                               self_v_weight,
+                               self_v_bias,
+                               self_out_weight,
+                               self_out_bias,
+                               cross_ln_weight,
+                               cross_ln_bias,
+                               cross_q_weight,
+                               cross_q_bias,
+                               cross_k_weight,
+                               cross_k_bias,
+                               cross_v_weight,
+                               cross_v_bias,
+                               cross_out_weight,
+                               cross_out_bias,
+                               ffn_ln_weight,
+                               ffn_ln_bias,
+                               ffn_inter_weight,
+                               ffn_inter_bias,
+                               ffn_out_weight,
+                               ffn_out_bias,
+                               decoder_ln_weight,
+                               decoder_ln_bias,
+                               embedding_weight,
+                               embedding_bias,
+                               positional_embedding_weight,
+                               output_ids,
+                               parent_ids,
+                               sequence_length,
+                               decoding_strategy,
+                               beam_size,
+                               topk,
+                               topp,
+                               n_head,
+                               size_per_head,
+                               num_layer,
+                               bos_id,
+                               eos_id,
+                               max_out_len,
+                               beam_search_diversity_rate,
+                               alpha);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> DecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> DecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "rel_len: bool",
+            "alpha: float"})
+    .SetKernelFn(PD_KERNEL(DecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(DecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(DecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..3072b19709a7dd17713899b65ead86d39eb61065
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.cu
@@ -0,0 +1,538 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+
+#include "fusion_decoding_op.h"
+#include "pd_traits.h"
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& head_num_,
+    const int& size_per_head_,
+    const int& num_layer_,
+    const int& start_id_,
+    const int& end_id_,
+    const int64_t& max_seq_len_,
+    const float& beam_search_diversity_rate_,
+    const float& alpha,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ = (decoding_strategy == "topk_sampling" ||
+                        decoding_strategy == "topp_sampling")
+                           ? topk
+                           : 1;
+  float probability_threshold_ = (decoding_strategy == "topk_sampling" ||
+                                  decoding_strategy == "topp_sampling")
+                                     ? topp
+                                     : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  int inner_coeff = ffn_intermediate_weight[0].shape()[1] / memory_hidden_dim;
+
+  auto q_weight_shape = self_attn_query_weight[0].shape();
+  auto k_weight_shape = self_attn_key_weight[0].shape();
+  bool fuse_qkv = (q_weight_shape[1] == k_weight_shape[1]) ? false : true;
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[i].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[i].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+        cross_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  // is_fuse_topk_softMax
+        fuse_qkv,
+        false,                 // keep_alive_beam
+        0.6,                   // alpha
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        false,                 // prefix_lm
+        -1,                    // finished_candidate_num
+        false,                 // early_stopping
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  // is_fuse_topk_softMax
+        fuse_qkv,
+        true,  // keep_alive_beam
+        alpha,
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        false,                 // prefix_lm
+        -1,                    // finished_candidate_num
+        false,                 // early_stopping
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        fuse_qkv,
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        1.0,                   // temperature
+        1.0,                   // repeat_penalty
+        false,                 // prefix_lm
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, topk_sampling and topp_sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha) {
+  auto stream = input.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..c6e476302086adc2dd879299f88d7ca02abba6a5
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_decoding_op.h
@@ -0,0 +1,84 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "cublas_handle.h"
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a3a97b92461a34bc680aa71ec92a0a07dfa6c6c3
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cc
@@ -0,0 +1,193 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+
+#include "fusion_encoder_op.h"
+
+std::vector<paddle::Tensor> EncoderForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const std::vector<paddle::Tensor>& attn_query_weight,
+    const std::vector<paddle::Tensor>& attn_query_bias,
+    const std::vector<paddle::Tensor>& attn_key_weight,
+    const std::vector<paddle::Tensor>& attn_key_bias,
+    const std::vector<paddle::Tensor>& attn_value_weight,
+    const std::vector<paddle::Tensor>& attn_value_bias,
+    const std::vector<paddle::Tensor>& attn_output_weight,
+    const std::vector<paddle::Tensor>& attn_output_bias,
+    const std::vector<paddle::Tensor>& attn_output_layernorm_weight,
+    const std::vector<paddle::Tensor>& attn_output_layernorm_bias,
+    const std::vector<paddle::Tensor>& output_layernorm_weight,
+    const std::vector<paddle::Tensor>& output_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    // const paddle::Tensor& sequence_id_offset,
+    // const paddle::Tensor& trt_seqlen_offset,
+    // const paddle::Tensor& amax_list,
+    const int64_t& head_num,
+    const int64_t& size_per_head,
+    const bool& use_gelu,
+    const bool& remove_padding,
+    const int64_t& int8_mode,
+    const int64_t& num_layer,
+    const int64_t& layer_idx,
+    const bool& allow_gemm_test,
+    const bool& use_trt_kernel,
+    const bool& normalize_before) {
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto shape = input.shape();
+    std::vector<paddle::Tensor> encoder_out({
+      paddle::Tensor(paddle::PlaceType::kGPU, shape), paddle::Tensor(paddle::PlaceType::kGPU, shape)
+    });
+
+    return EncoderCUDAForward(input,
+                              attn_mask,
+                              attn_query_weight,
+                              attn_query_bias,
+                              attn_key_weight,
+                              attn_key_bias,
+                              attn_value_weight,
+                              attn_value_bias,
+                              attn_output_weight,
+                              attn_output_bias,
+                              attn_output_layernorm_weight,
+                              attn_output_layernorm_bias,
+                              output_layernorm_weight,
+                              output_layernorm_bias,
+                              ffn_intermediate_weight,
+                              ffn_intermediate_bias,
+                              ffn_output_weight,
+                              ffn_output_bias,
+                              // sequence_id_offset,
+                              // trt_seqlen_offset,
+                              // amax_list,
+                              encoder_out,
+                              head_num,
+                              size_per_head,
+                              use_gelu,
+                              remove_padding,
+                              int8_mode,  // no support now
+                              num_layer,
+                              layer_idx,
+                              allow_gemm_test,
+                              use_trt_kernel,
+                              normalize_before);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> EncoderInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<std::vector<int64_t>>& attn_query_weight_shape,
+    const std::vector<std::vector<int64_t>>& attn_query_bias_shape,
+    const std::vector<std::vector<int64_t>>& attn_key_weight_shape,
+    const std::vector<std::vector<int64_t>>& attn_key_bias_shape,
+    const std::vector<std::vector<int64_t>>& attn_value_weight_shape,
+    const std::vector<std::vector<int64_t>>& attn_value_bias_shape,
+    const std::vector<std::vector<int64_t>>& attn_output_weight_shape,
+    const std::vector<std::vector<int64_t>>& attn_output_bias_shape,
+    const std::vector<std::vector<int64_t>>& attn_output_layernorm_weight_shape,
+    const std::vector<std::vector<int64_t>>& attn_output_layernorm_bias_shape,
+    const std::vector<std::vector<int64_t>>& output_layernorm_weight_shape,
+    const std::vector<std::vector<int64_t>>& output_layernorm_bias_shape,
+    const std::vector<std::vector<int64_t>>& ffn_intermediate_weight_shape,
+    const std::vector<std::vector<int64_t>>& ffn_intermediate_bias_shape,
+    const std::vector<std::vector<int64_t>>& ffn_output_weight_shape,
+    const std::vector<std::vector<int64_t>>& ffn_output_bias_shape,
+    // const std::vector<int64_t>& sequence_id_offset,
+    // const std::vector<int64_t>& trt_seqlen_offset,
+    // const std::vector<int64_t>& amax_list_shape,
+    const int64_t& head_num,
+    const int64_t& size_per_head,
+    const bool& use_gelu,
+    const bool& remove_padding,
+    const int64_t& int8_mode,  // no support now
+    const int64_t& num_layer,
+    const int64_t& layer_idx,
+    const bool& allow_gemm_test,
+    const bool& use_trt_kernel,
+    const bool& normalize_before) {
+  return {input_shape};
+}
+
+
+std::vector<paddle::DataType> EncoderInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& attn_mask,
+    const std::vector<paddle::DataType>& attn_query_weight,
+    const std::vector<paddle::DataType>& attn_query_bias,
+    const std::vector<paddle::DataType>& attn_key_weight,
+    const std::vector<paddle::DataType>& attn_key_bias,
+    const std::vector<paddle::DataType>& attn_value_weight,
+    const std::vector<paddle::DataType>& attn_value_bias,
+    const std::vector<paddle::DataType>& attn_output_weight,
+    const std::vector<paddle::DataType>& attn_output_bias,
+    const std::vector<paddle::DataType>& attn_output_layernorm_weight,
+    const std::vector<paddle::DataType>& attn_output_layernorm_bias,
+    const std::vector<paddle::DataType>& output_layernorm_weight,
+    const std::vector<paddle::DataType>& output_layernorm_bias,
+    const std::vector<paddle::DataType>& ffn_intermediate_weight,
+    const std::vector<paddle::DataType>& ffn_intermediate_bias,
+    const std::vector<paddle::DataType>& ffn_output_weight,
+    const std::vector<paddle::DataType>& ffn_output_bias) {
+  // const paddle::DataType& sequence_id_offset,
+  // const paddle::DataType& trt_seqlen_offset,
+  // const paddle::DataType& amax_list) {
+  return {input};
+}
+
+PD_BUILD_OP(fusion_encoder)
+    .Inputs({
+        "Input",
+        "SelfAttnMask",
+        paddle::Vec("SelfQueryWeight"),
+        paddle::Vec("SelfQueryBias"),
+        paddle::Vec("SelfKeyWeight"),
+        paddle::Vec("SelfKeyBias"),
+        paddle::Vec("SelfValueWeight"),
+        paddle::Vec("SelfValueBias"),
+        paddle::Vec("SelfAttnOutputWeight"),
+        paddle::Vec("SelfAttnOutputBias"),
+        paddle::Vec("SelfAttnOutputLayernormWeight"),
+        paddle::Vec("SelfAttnOutputLayernormBias"),
+        paddle::Vec("OutputLayernormWeight"),
+        paddle::Vec("OutputLayernormBias"),
+        paddle::Vec("FFNInterWeight"),
+        paddle::Vec("FFNInterBias"),
+        paddle::Vec("FFNOutputWeight"),
+        paddle::Vec("FFNOutputBias"),
+        // "SequenceIdOffset",
+        // "TRTSeqLenOffset",
+        // "AmaxList",
+    })
+    .Outputs({"EncoderOut"})
+    .Attrs({"head_num: int64_t",
+            "size_per_head: int64_t",
+            "use_gelu: bool",
+            "remove_padding: bool",
+            "int8_mode: int64_t",
+            "num_layer: int64_t",
+            "layer_idx: int64_t",
+            "allow_gemm_test: bool",
+            "use_trt_kernel: bool",
+            "normalize_before: bool"})
+    .SetKernelFn(PD_KERNEL(EncoderForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(EncoderInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(EncoderInferDtype));
\ No newline at end of file
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..2fe897147ef86d23ddece92bf8067a2a49f32216
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.cu
@@ -0,0 +1,443 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+#include "cublas_handle.h"
+#include "fastertransformer/bert_encoder_transformer.h"
+
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/standard_encoder.h"
+#include "fusion_encoder_op.h"
+#include "pd_traits.h"
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> encoder_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const std::vector<paddle::Tensor>& attn_query_weight,
+    const std::vector<paddle::Tensor>& attn_query_bias,
+    const std::vector<paddle::Tensor>& attn_key_weight,
+    const std::vector<paddle::Tensor>& attn_key_bias,
+    const std::vector<paddle::Tensor>& attn_value_weight,
+    const std::vector<paddle::Tensor>& attn_value_bias,
+    const std::vector<paddle::Tensor>& attn_output_weight,
+    const std::vector<paddle::Tensor>& attn_output_bias,
+    /*
+    When calling BertEncoderTransformer(Post-Norm):
+        norm1 coresponds to BertInitParam.self_layernorm
+        norm2 coresponds to BertInitParam.ffn_layernorm
+    When calling OpenEncoder(Pre-Norm):
+        norm1 coresponds to EncoderInitParam.input_layernorm
+        norm2 coresponds to EncoderInitParam.self_layernorm
+    */
+    const std::vector<paddle::Tensor>& norm1_weight,
+    const std::vector<paddle::Tensor>& norm1_bias,
+    const std::vector<paddle::Tensor>& norm2_weight,
+    const std::vector<paddle::Tensor>& norm2_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    // const paddle::Tensor& sequence_id_offset,
+    // const paddle::Tensor& trt_seqlen_offset,
+    // const paddle::Tensor& amax_list,
+    std::vector<paddle::Tensor>& encoder_out,
+    int64_t head_num_,
+    int64_t size_per_head_,
+    bool use_gelu,
+    bool remove_padding,
+    int64_t int8_mode,  // no support now
+    int64_t num_layer_,
+    int64_t layer_idx_,
+    bool allow_gemm_test,
+    bool use_trt_kernel_,
+    bool normalize_before,
+    cudaStream_t stream) {
+
+  auto input_shape = input.shape();
+  int batch_size_ = input_shape[0];
+  int max_seq_len_ = input_shape[1];
+  typedef PDTraits<D> traits_;
+
+  fastertransformer::Allocator<AllocatorType::PD>* allocator_ =
+      new fastertransformer::Allocator<AllocatorType::PD>(stream);
+
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  int in_id = 0;
+  int layers = attn_query_weight.size();
+
+  if (normalize_before == false) {
+    typedef BertEncoderTransformerTraits<traits_::OpType,
+                                        fastertransformer::cuda::OpenMultiHeadAttention>
+        EncoderTraits_;
+
+    // Post-Normalization
+    BertInitParam<DataType_> encoder_param;
+
+    encoder_param.stream = stream;
+    encoder_param.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    encoder_param.cublaslt_handle =
+        CublasHandle::GetInstance()->cublaslt_handle_;
+
+    encoder_param.attr_mask =
+        reinterpret_cast<const DataType_*>(attn_mask.data<data_t_>());
+
+    BertEncoderTransformer<EncoderTraits_>* encoder =
+        new BertEncoderTransformer<EncoderTraits_>(
+            int8_mode, allow_gemm_test, use_gelu);
+
+    encoder->allocateBuffer(allocator_,
+                            batch_size_,
+                            max_seq_len_,
+                            max_seq_len_,
+                            head_num_,
+                            size_per_head_,
+                            use_trt_kernel_);
+
+    std::vector<data_t_*> enc_buf({
+        encoder_out[0].mutable_data<data_t_>(input.place()),
+        encoder_out[1].mutable_data<data_t_>(input.place())});
+
+    for (int layer = 0; layer < layers; ++layer) {
+        in_id = layer & 0x1;
+
+        if (0 == layer) {
+            encoder_param.from_tensor = reinterpret_cast<const DataType_*>(
+                input.data<data_t_>());
+            encoder_param.to_tensor = reinterpret_cast<const DataType_*>(
+                input.data<data_t_>());
+            encoder_param.transformer_out = reinterpret_cast<DataType_*>(
+                enc_buf[1 - in_id]);
+        } else {
+            encoder_param.from_tensor = reinterpret_cast<DataType_*>(
+                enc_buf[in_id]);
+            encoder_param.to_tensor = reinterpret_cast<DataType_*>(
+                enc_buf[in_id]);
+            encoder_param.transformer_out = reinterpret_cast<DataType_*>(
+                enc_buf[1 - in_id]);
+        }
+
+        // self attn
+        encoder_param.self_attention.query_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_query_weight[layer].data<data_t_>());
+        encoder_param.self_attention.query_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_query_bias[layer].data<data_t_>());
+        encoder_param.self_attention.key_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_key_weight[layer].data<data_t_>());
+        encoder_param.self_attention.key_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_key_bias[layer].data<data_t_>());
+        encoder_param.self_attention.value_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_value_weight[layer].data<data_t_>());
+        encoder_param.self_attention.value_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_value_bias[layer].data<data_t_>());
+        encoder_param.self_attention.attention_output_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_output_weight[layer].data<data_t_>());
+        encoder_param.self_attention.attention_output_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_output_bias[layer].data<data_t_>());
+
+        // self_attn_layer_norm
+        encoder_param.self_layernorm.gamma =
+            reinterpret_cast<const DataType_*>(norm1_weight[layer].data<data_t_>());
+        encoder_param.self_layernorm.beta =
+            reinterpret_cast<const DataType_*>(norm1_bias[layer].data<data_t_>());
+        encoder_param.ffn.intermediate_weight.kernel =
+            reinterpret_cast<const DataType_*>(
+                ffn_intermediate_weight[layer].data<data_t_>());
+        encoder_param.ffn.intermediate_weight.bias =
+            reinterpret_cast<const DataType_*>(
+                ffn_intermediate_bias[layer].data<data_t_>());
+
+        encoder_param.ffn.output_weight.kernel =
+            reinterpret_cast<const DataType_*>(ffn_output_weight[layer].data<data_t_>());
+        encoder_param.ffn.output_weight.bias =
+            reinterpret_cast<const DataType_*>(ffn_output_bias[layer].data<data_t_>());
+
+        // ffn_layer_norm
+        encoder_param.ffn_layernorm.gamma =
+            reinterpret_cast<const DataType_*>(norm2_weight[layer].data<data_t_>());
+        encoder_param.ffn_layernorm.beta =
+            reinterpret_cast<const DataType_*>(norm2_bias[layer].data<data_t_>());
+
+        int valid_word_num;
+
+        encoder_param.sequence_id_offset = nullptr;
+        valid_word_num = batch_size_ * max_seq_len_;
+
+        encoder_param.valid_word_num = valid_word_num;
+
+        encoder_param.trt_seqlen_offset = nullptr;
+        encoder_param.trt_seqlen_size = batch_size_ + 1;
+
+        encoder_param.amaxList = nullptr;
+
+        encoder->initialize(encoder_param);
+        encoder->forward();
+    }
+
+    encoder->freeBuffer();
+
+    delete allocator_;
+    delete encoder;
+  } else {
+    typedef OpenEncoderTraits<traits_::OpType, fastertransformer::cuda::OpenMultiHeadAttention>
+        OpenEncoderTraits_;
+
+    // Pre-Normalization
+    EncoderInitParam<DataType_> encoder_param;
+
+    encoder_param.stream = stream;
+    encoder_param.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    encoder_param.cublaslt_handle =
+        CublasHandle::GetInstance()->cublaslt_handle_;
+    encoder_param.attr_mask =
+        reinterpret_cast<const DataType_*>(attn_mask.data<data_t_>());
+
+    OpenEncoder<OpenEncoderTraits_>* encoder =
+        new OpenEncoder<OpenEncoderTraits_>(
+            int8_mode, allow_gemm_test, use_gelu);
+
+    encoder->allocateBuffer(allocator_,
+                            batch_size_,
+                            max_seq_len_,
+                            max_seq_len_,
+                            head_num_,
+                            size_per_head_,
+                            use_trt_kernel_);
+    
+    for (int layer = 0; layer < layers; ++layer) {
+        in_id = layer & 0x1;
+
+        if (0 == layer) {
+            encoder_param.from_tensor = reinterpret_cast<const DataType_*>(
+                input.data<data_t_>());
+            encoder_param.to_tensor = reinterpret_cast<const DataType_*>(
+                input.data<data_t_>());
+            encoder_param.transformer_out = reinterpret_cast<DataType_*>(
+                encoder_out[1 - in_id].mutable_data<data_t_>(input.place()));
+        } else {
+            encoder_param.from_tensor = reinterpret_cast<DataType_*>(
+                encoder_out[in_id].data<data_t_>());
+            encoder_param.to_tensor = reinterpret_cast<DataType_*>(
+                encoder_out[in_id].data<data_t_>());
+            encoder_param.transformer_out = reinterpret_cast<DataType_*>(
+                encoder_out[1 - in_id].mutable_data<data_t_>(input.place()));
+        }
+
+        // self attn
+        encoder_param.self_attention.query_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_query_weight[layer].data<data_t_>());
+        encoder_param.self_attention.query_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_query_bias[layer].data<data_t_>());
+        encoder_param.self_attention.key_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_key_weight[layer].data<data_t_>());
+        encoder_param.self_attention.key_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_key_bias[layer].data<data_t_>());
+        encoder_param.self_attention.value_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_value_weight[layer].data<data_t_>());
+        encoder_param.self_attention.value_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_value_bias[layer].data<data_t_>());
+        encoder_param.self_attention.attention_output_weight.kernel =
+            reinterpret_cast<const DataType_*>(attn_output_weight[layer].data<data_t_>());
+        encoder_param.self_attention.attention_output_weight.bias =
+            reinterpret_cast<const DataType_*>(attn_output_bias[layer].data<data_t_>());
+
+        // Spicific for Pre-Normalization
+        encoder_param.input_layernorm.gamma =
+            reinterpret_cast<const DataType_*>(norm1_weight[layer].data<data_t_>());
+        encoder_param.input_layernorm.beta =
+            reinterpret_cast<const DataType_*>(norm1_bias[layer].data<data_t_>());
+
+        encoder_param.self_layernorm.gamma =
+            reinterpret_cast<const DataType_*>(norm2_weight[layer].data<data_t_>());
+        encoder_param.self_layernorm.beta =
+            reinterpret_cast<const DataType_*>(norm2_bias[layer].data<data_t_>());
+
+        encoder_param.ffn.intermediate_weight.kernel =
+            reinterpret_cast<const DataType_*>(
+                ffn_intermediate_weight[layer].data<data_t_>());
+        encoder_param.ffn.intermediate_weight.bias =
+            reinterpret_cast<const DataType_*>(
+                ffn_intermediate_bias[layer].data<data_t_>());
+
+        encoder_param.ffn.output_weight.kernel =
+            reinterpret_cast<const DataType_*>(ffn_output_weight[layer].data<data_t_>());
+        encoder_param.ffn.output_weight.bias =
+            reinterpret_cast<const DataType_*>(ffn_output_bias[layer].data<data_t_>());
+
+        int valid_word_num;
+        encoder_param.sequence_id_offset = nullptr;
+        valid_word_num = batch_size_ * max_seq_len_;
+
+        encoder_param.valid_word_num = valid_word_num;
+
+        encoder_param.trt_seqlen_offset =
+            nullptr;  // trt_seqlen_offset.data<int>();
+        encoder_param.trt_seqlen_size = batch_size_ + 1;
+
+        encoder_param.amaxList = nullptr;
+
+        encoder->initialize(encoder_param);
+        encoder->forward();
+    }
+
+    encoder->freeBuffer();
+    delete allocator_;
+    delete encoder;
+  }
+
+  return {encoder_out[1 - in_id]};
+}
+
+std::vector<paddle::Tensor> EncoderCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const std::vector<paddle::Tensor>& attn_query_weight,
+    const std::vector<paddle::Tensor>& attn_query_bias,
+    const std::vector<paddle::Tensor>& attn_key_weight,
+    const std::vector<paddle::Tensor>& attn_key_bias,
+    const std::vector<paddle::Tensor>& attn_value_weight,
+    const std::vector<paddle::Tensor>& attn_value_bias,
+    const std::vector<paddle::Tensor>& attn_output_weight,
+    const std::vector<paddle::Tensor>& attn_output_bias,
+    /*
+    When calling BertEncoderTransformer(Post-Norm):
+        norm1 coresponds to BertInitParam.self_layernorm
+        norm2 coresponds to BertInitParam.ffn_layernorm
+    When calling OpenEncoder(Pre-Norm):
+        norm1 coresponds to EncoderInitParam.input_layernorm
+        norm2 coresponds to EncoderInitParam.self_layernorm
+    */
+    const std::vector<paddle::Tensor>& norm1_weight,
+    const std::vector<paddle::Tensor>& norm1_bias,
+    const std::vector<paddle::Tensor>& norm2_weight,
+    const std::vector<paddle::Tensor>& norm2_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    // const paddle::Tensor& sequence_id_offset,
+    // const paddle::Tensor& trt_seqlen_offset,
+    // const paddle::Tensor& amax_list,
+    std::vector<paddle::Tensor>& encoder_out,
+    int64_t head_num,
+    int64_t size_per_head,
+    bool use_gelu,
+    bool remove_padding,
+    int64_t int8_mode,
+    int64_t num_layer,
+    int64_t layer_idx,
+    bool allow_gemm_test,
+    bool use_trt_kernel,
+    bool normalize_before) {
+  auto stream = input.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = encoder_kernel<paddle::DataType::FLOAT16>(input,
+                                                      attn_mask,
+                                                      attn_query_weight,
+                                                      attn_query_bias,
+                                                      attn_key_weight,
+                                                      attn_key_bias,
+                                                      attn_value_weight,
+                                                      attn_value_bias,
+                                                      attn_output_weight,
+                                                      attn_output_bias,
+                                                      norm1_weight,
+                                                      norm1_bias,
+                                                      norm2_weight,
+                                                      norm2_bias,
+                                                      ffn_intermediate_weight,
+                                                      ffn_intermediate_bias,
+                                                      ffn_output_weight,
+                                                      ffn_output_bias,
+                                                      //   sequence_id_offset,
+                                                      //   trt_seqlen_offset,
+                                                      //   amax_list,
+                                                      encoder_out,
+                                                      head_num,
+                                                      size_per_head,
+                                                      use_gelu,
+                                                      remove_padding,
+                                                      int8_mode,
+                                                      num_layer,
+                                                      layer_idx,
+                                                      allow_gemm_test,
+                                                      use_trt_kernel,
+                                                      normalize_before,
+                                                      stream);
+
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = encoder_kernel<paddle::DataType::FLOAT32>(input,
+                                                      attn_mask,
+                                                      attn_query_weight,
+                                                      attn_query_bias,
+                                                      attn_key_weight,
+                                                      attn_key_bias,
+                                                      attn_value_weight,
+                                                      attn_value_bias,
+                                                      attn_output_weight,
+                                                      attn_output_bias,
+                                                      norm1_weight,
+                                                      norm1_bias,
+                                                      norm2_weight,
+                                                      norm2_bias,
+                                                      ffn_intermediate_weight,
+                                                      ffn_intermediate_bias,
+                                                      ffn_output_weight,
+                                                      ffn_output_bias,
+                                                      //   sequence_id_offset,
+                                                      //   trt_seqlen_offset,
+                                                      //   amax_list,
+                                                      encoder_out,
+                                                      head_num,
+                                                      size_per_head,
+                                                      use_gelu,
+                                                      remove_padding,
+                                                      int8_mode,
+                                                      num_layer,
+                                                      layer_idx,
+                                                      allow_gemm_test,
+                                                      use_trt_kernel,
+                                                      normalize_before,
+                                                      stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.h b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..fd6daa2f0d4a984cbf5acb80da4ad6c46117d76f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_encoder_op.h
@@ -0,0 +1,63 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/bert_encoder_transformer.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> EncoderCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const std::vector<paddle::Tensor>& attn_query_weight,
+    const std::vector<paddle::Tensor>& attn_query_bias,
+    const std::vector<paddle::Tensor>& attn_key_weight,
+    const std::vector<paddle::Tensor>& attn_key_bias,
+    const std::vector<paddle::Tensor>& attn_value_weight,
+    const std::vector<paddle::Tensor>& attn_value_bias,
+    const std::vector<paddle::Tensor>& attn_output_weight,
+    const std::vector<paddle::Tensor>& attn_output_bias,
+    const std::vector<paddle::Tensor>& norm1_weight,
+    const std::vector<paddle::Tensor>& norm1_bias,
+    const std::vector<paddle::Tensor>& norm2_weight,
+    const std::vector<paddle::Tensor>& norm2_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    // const paddle::Tensor& sequence_id_offset,
+    // const paddle::Tensor& trt_seqlen_offset,
+    // const paddle::Tensor& amax_list,
+    std::vector<paddle::Tensor>& encoder_out,
+    int64_t head_num_,
+    int64_t size_per_head_,
+    bool use_gelu,
+    bool remove_padding,
+    int64_t int8_mode,  // no support now
+    int64_t num_layer_,
+    int64_t layer_idx_,
+    bool allow_gemm_test,
+    bool use_trt_kernel_,
+    bool normalize_before);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..50892e56199bc28f889378de40f7b529e4ad0e80
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cc
@@ -0,0 +1,340 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_force_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> DecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+
+    return DecodingCUDAForward(input,
+                               seq_len,
+                               word_embedding,
+                               self_ln_weight,
+                               self_ln_bias,
+                               self_q_weight,
+                               self_q_bias,
+                               self_k_weight,
+                               self_k_bias,
+                               self_v_weight,
+                               self_v_bias,
+                               self_out_weight,
+                               self_out_bias,
+                               cross_ln_weight,
+                               cross_ln_bias,
+                               cross_q_weight,
+                               cross_q_bias,
+                               cross_k_weight,
+                               cross_k_bias,
+                               cross_v_weight,
+                               cross_v_bias,
+                               cross_out_weight,
+                               cross_out_bias,
+                               ffn_ln_weight,
+                               ffn_ln_bias,
+                               ffn_inter_weight,
+                               ffn_inter_bias,
+                               ffn_out_weight,
+                               ffn_out_bias,
+                               decoder_ln_weight,
+                               decoder_ln_bias,
+                               embedding_weight,
+                               embedding_bias,
+                               positional_embedding_weight,
+                               trg_word,
+                               output_ids,
+                               parent_ids,
+                               sequence_length,
+                               decoding_strategy,
+                               beam_size,
+                               topk,
+                               topp,
+                               n_head,
+                               size_per_head,
+                               num_layer,
+                               bos_id,
+                               eos_id,
+                               max_out_len,
+                               beam_search_diversity_rate,
+                               alpha);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> DecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& trg_word_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> DecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight,
+    const paddle::DataType& trg_word) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_force_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb",
+             "TrgWord"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "rel_len: bool",
+            "alpha: float"})
+    .SetKernelFn(PD_KERNEL(DecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(DecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(DecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..ae269e34ae4275a5843435a545c8b541230b3fb8
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.cu
@@ -0,0 +1,572 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+
+#include "fusion_force_decoding_op.h"
+#include "pd_traits.h"
+
+
+__global__ void get_trg_length(const int* trg_word,
+                               int* trg_length,
+                               const int seq_len,
+                               const int pad_id) {
+  int bid = threadIdx.x;
+
+  int cnt_nonpads = 0;
+  for (int i = 0; i < seq_len; ++i) {
+    if (pad_id != trg_word[bid * seq_len + i]) {
+      cnt_nonpads++;
+    } else {
+      break;
+    }
+  }
+  trg_length[bid] = cnt_nonpads;
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int head_num_,
+    const int size_per_head_,
+    const int num_layer_,
+    const int start_id_,
+    const int end_id_,
+    const int64_t max_seq_len_,
+    const float beam_search_diversity_rate_,
+    const float alpha,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ = (decoding_strategy == "sampling") ? topk : 1;
+  float probability_threshold_ = (decoding_strategy == "sampling") ? topp : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+  auto trg_word_shape = trg_word.shape();
+  int trg_max_len =
+      (trg_word_shape.size() == 2) ? static_cast<int>(trg_word_shape[1]) : 0;
+
+  paddle::Tensor trg_length =
+      (trg_word_shape.size() == 2 && trg_word_shape[0] != 0)
+          ? paddle::Tensor(paddle::PlaceType::kGPU, {trg_word_shape[0]})
+          : paddle::Tensor(paddle::PlaceType::kGPU, {1});
+  auto trg_length_ptr = trg_length.mutable_data<int>(input.place());
+
+  if (trg_word_shape.size() == 2 && trg_word_shape[0] != 0) {
+    decoding_params.trg_word = trg_word.data<int>();
+
+    get_trg_length<<<1, trg_word_shape[0], 0, stream>>>(
+        decoding_params.trg_word, trg_length_ptr, trg_max_len, start_id_);
+    decoding_params.trg_length = trg_length_ptr;
+  }
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  int inner_coeff = ffn_intermediate_weight[0].shape()[1] / memory_hidden_dim;
+
+  auto q_weight_shape = self_attn_query_weight[0].shape();
+  auto k_weight_shape = self_attn_key_weight[0].shape();
+  bool fuse_qkv = (q_weight_shape[1] == k_weight_shape[1]) ? false : true;
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[i].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[i].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+        cross_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  // is_fuse_topk_softMax
+        fuse_qkv,
+        false,                 // keep_alive_beam
+        0.6,                   // alpha
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        false,                 // prefix_lm
+        -1,                    // finished_candidate_num
+        false,                 // early_stopping
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,      // is_fuse_topk_softMax
+        fuse_qkv,  // is_fuse_qkv
+        true,      // keep_alive_beam
+        alpha,
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        false,                 // prefix_lm
+        -1,                    // finished_candidate_num
+        false,                 // early_stopping
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        fuse_qkv,
+        true,                  // normalization_before
+        0,                     // pos_offset
+        ActivationType::RELU,  // act
+        false,                 // pos_bias
+        1.0,                   // temperature
+        1.0,                   // repeat_penalty
+        false,                 // prefix_lm
+        false,                 // is_mbart
+        0,                     // min_length
+        inner_coeff);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, beam_search_v2 and sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const float alpha) {
+  auto stream = input.stream();
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          trg_word,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          trg_word,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..973329c17b69af1846c3f396f04f36c68cf2ee6f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_force_decoding_op.h
@@ -0,0 +1,85 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "cublas_handle.h"
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const float alpha);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a32867451e9d5dd023e8e1a9b9eea79dfc6dd8af
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cc
@@ -0,0 +1,223 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <string>
+#include <vector>
+
+#include "fusion_gpt_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> GPT2Forward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int batch_size = input.shape()[0];
+  int start_len = input.shape()[1];
+  int total_len = max_len + start_len;
+  std::vector<int64_t> output_dims({total_len, batch_size});
+  auto output_ids = paddle::Tensor(input.place(), output_dims);
+
+  if (word_embedding.place() == paddle::PlaceType::kGPU) {
+    return GPT2CUDAForward(input,
+                           attn_mask,
+                           start_length,
+                           word_embedding,
+                           self_ln_weight,
+                           self_ln_bias,
+                           self_q_weight,
+                           self_q_bias,
+                           self_k_weight,
+                           self_k_bias,
+                           self_v_weight,
+                           self_v_bias,
+                           self_out_weight,
+                           self_out_bias,
+                           ffn_ln_weight,
+                           ffn_ln_bias,
+                           ffn_inter_weight,
+                           ffn_inter_bias,
+                           ffn_out_weight,
+                           ffn_out_bias,
+                           decoder_ln_weight,
+                           decoder_ln_bias,
+                           positional_embedding_weight,
+                           emb_weight,
+                           output_ids,
+                           topk,
+                           topp,
+                           total_len,
+                           n_head,
+                           size_per_head,
+                           num_layer,
+                           bos_id,
+                           eos_id,
+                           temperature,
+                           use_fp16,
+                           tensor_para_size,
+                           layer_para_size,
+                           layer_para_batch_size);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> GPT2InferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<int64_t>& start_length,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& emb_weight_shape,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int64_t batch_size = input_shape[0];
+  int64_t start_len = input_shape[1];
+  std::vector<int64_t> output_dims({max_len + start_len, batch_size});
+  return {output_dims};
+}
+
+std::vector<paddle::DataType> GPT2InferDtype(
+    const paddle::DataType& input_dtype,
+    const paddle::DataType& attn_mask_dtype,
+    const paddle::DataType& start_length_dtype,
+    const paddle::DataType& word_embedding_dtype,
+    const std::vector<paddle::DataType>& self_ln_weight_dtype,
+    const std::vector<paddle::DataType>& self_ln_bias_dtype,
+    const std::vector<paddle::DataType>& self_q_weight_dtype,
+    const std::vector<paddle::DataType>& self_q_bias_dtype,
+    const std::vector<paddle::DataType>& self_k_weight_dtype,
+    const std::vector<paddle::DataType>& self_k_bias_dtype,
+    const std::vector<paddle::DataType>& self_v_weight_dtype,
+    const std::vector<paddle::DataType>& self_v_bias_dtype,
+    const std::vector<paddle::DataType>& self_out_weight_dtype,
+    const std::vector<paddle::DataType>& self_out_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_ln_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_ln_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_out_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_out_bias_dtype,
+    const paddle::DataType& decoder_ln_weight_dtype,
+    const paddle::DataType& decoder_ln_bias_dtype,
+    const paddle::DataType& positional_embedding_weight_dtype,
+    const paddle::DataType& emb_weight_dtype) {
+  return {paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_gpt)
+    .Inputs({"Input",
+             "AttentionMask",
+             "StartLength",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "PositionEncEmb",
+             "EmbWeight"})
+    .Outputs({"OutputIds"})
+    .Attrs({"topk: int",
+            "topp: float",
+            "max_len: int",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "temperature: float",
+            "use_fp16: bool",
+            "tensor_para_size: int",
+            "layer_para_size: int",
+            "layer_para_batch_size: int"})
+    .SetKernelFn(PD_KERNEL(GPT2Forward))
+    .SetInferShapeFn(PD_INFER_SHAPE(GPT2InferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(GPT2InferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..29bc57747f04ba135b44f467862fb1072ae26107
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.cu
@@ -0,0 +1,378 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+// TODO(guosheng): `HOST` conflict exists in float.h of paddle and mpi.h of mpi
+#include "fusion_gpt_op.h"
+#include "pd_traits.h"
+#ifdef HOST
+#undef HOST
+#endif
+
+#include "fastertransformer/gpt.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef BUILD_GPT  // consistent with FasterTransformer
+#include "parallel_utils.h"
+#endif
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> gpt2_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    cublasHandle_t cublas_handle_,
+    cublasLtHandle_t cublaslt_handle_,
+    cudaStream_t stream,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  auto input_dims = input.shape();
+  int batch_size_ = input_dims[0];
+  int start_len = input_dims[1];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = cublas_handle_;
+  decoding_params.cublaslt_handle = cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(word_emb.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  const int hidden_unit = size_per_head * n_head;
+
+#ifdef BUILD_GPT
+  auto* model_para_desc = ModelParaDescFactory::CreateModelParaDesc(
+      n_head,
+      size_per_head,
+      num_layer,
+      tensor_para_size,
+      layer_para_size,
+      layer_para_batch_size,
+      const_cast<data_t_*>(word_emb.data<data_t_>()));
+  auto& tensor_parallel_param = model_para_desc->tensor_parallel_param;
+  auto& layer_parallel_param = model_para_desc->layer_parallel_param;
+  auto seed = model_para_desc->dist(model_para_desc->gen);
+#else
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = n_head;
+  tensor_parallel_param.local_hidden_units_ = hidden_unit;
+
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer;
+  layer_parallel_param.local_batch_size = batch_size_;
+  int seed = -1;
+#endif
+
+  DecodingGpt<DecodingTraits_::OpType>* gpt_decoding;
+
+  decoding_params.request_batch_size = batch_size_;
+  decoding_params.max_input_len = start_len;
+  decoding_params.request_input_len = start_len;
+  decoding_params.request_output_len = max_len - start_len;
+
+  decoding_params.d_start_ids = const_cast<int *>(input.data<int>());
+  decoding_params.d_attn_mask =
+      reinterpret_cast<DataType_*>(const_cast<data_t_ *>(attn_mask.data<data_t_>()));
+  decoding_params.d_start_lengths = start_length.data<int>();
+
+  gpt_decoding =
+      new DecodingGpt<DecodingTraits_::OpType>(allocator_,
+                                               batch_size_,
+                                               max_len,
+                                               n_head,
+                                               size_per_head,
+                                               vocab_size,
+                                               num_layer,
+                                               bos_id,
+                                               eos_id,
+                                               topk,
+                                               topp,
+                                               temperature,
+                                               tensor_para_size,
+                                               layer_para_size,
+                                               true, /*is_fuse_QKV*/
+                                               1.0,  /*repetition_penalty*/
+                                               seed);
+
+  gpt_decoding->set_tensor_parallel_param(tensor_parallel_param);
+  gpt_decoding->set_layer_parallel_param(layer_parallel_param);
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer];
+
+  for (int i = 0; i < self_ln_weight.size(); ++i) {
+    // Allow python passing weights of all layers or only passing the
+    // corresponding layers to save memory.
+    int layer_idx = self_ln_weight.size() != num_layer
+                        ? layer_parallel_param.rank *
+                                  layer_parallel_param.layers_per_group +
+                              i
+                        : i;
+
+    params[layer_idx].stream = stream;
+    params[layer_idx].cublas_handle = cublas_handle_;
+    params[layer_idx].cublaslt_handle = cublaslt_handle_;
+
+    params[layer_idx].request_batch_size = batch_size_;
+    params[layer_idx].request_max_mem_seq_len = start_len;
+
+    params[layer_idx].self_layernorm.gamma =
+        reinterpret_cast<const DataType_*>(self_ln_weight[i].data<data_t_>());
+    params[layer_idx].self_layernorm.beta =
+        reinterpret_cast<const DataType_*>(self_ln_bias[i].data<data_t_>());
+
+    params[layer_idx].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_q_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(self_q_bias[i].data<data_t_>());
+    // For `is_fuse_QKV == true`, ignore weight and bias of key and value to
+    // remove requirements on python passing weights to save memory.
+    // params[layer_idx].self_attention.key_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(self_k_weight[i].data<data_t_>());
+    // params[layer_idx].self_attention.key_weight.bias =
+    //     reinterpret_cast<const DataType_*>(self_k_bias[i].data<data_t_>());
+    // params[layer_idx].self_attention.value_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(self_v_weight[i].data<data_t_>());
+    // params[layer_idx].self_attention.value_weight.bias =
+    //     reinterpret_cast<const DataType_*>(self_v_bias[i].data<data_t_>());
+
+    params[layer_idx].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_out_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(self_out_bias[i].data<data_t_>());
+
+    params[layer_idx].ffn_layernorm.gamma =
+        reinterpret_cast<const DataType_*>(ffn_ln_weight[i].data<data_t_>());
+    params[layer_idx].ffn_layernorm.beta =
+        reinterpret_cast<const DataType_*>(ffn_ln_bias[i].data<data_t_>());
+
+    params[layer_idx].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_inter_weight[i].data<data_t_>());
+    params[layer_idx].ffn.intermediate_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_inter_bias[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_out_weight[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_out_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma =
+      reinterpret_cast<const DataType_*>(decoder_ln_weight.data<data_t_>());
+  decoding_params.layernorm.beta =
+      reinterpret_cast<const DataType_*>(decoder_ln_bias.data<data_t_>());
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(emb_weight.data<data_t_>());
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      positional_embedding_weight.data<data_t_>());
+
+  gpt_decoding->forward_context(params, decoding_params);
+  gpt_decoding->forward(params, decoding_params);
+
+  delete gpt_decoding;
+  delete[] params;
+
+  return {output_ids};
+}
+
+std::vector<paddle::Tensor> GPT2CUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  auto stream = word_embedding.stream();
+  // TODO(guosheng): use the global cublas handle
+  cublasHandle_t cublas_handle_;
+  cublasCreate(&cublas_handle_);
+  cublasLtHandle_t cublaslt_handle_;
+  cublasLtCreate(&cublaslt_handle_);
+  cublasSetStream(cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  if (use_fp16) {
+    ret = gpt2_kernel<paddle::DataType::FLOAT16>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_q_bias,
+                                                 self_k_weight,
+                                                 self_k_bias,
+                                                 self_v_weight,
+                                                 self_v_bias,
+                                                 self_out_weight,
+                                                 self_out_bias,
+                                                 ffn_ln_weight,
+                                                 ffn_ln_bias,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 positional_embedding_weight,
+                                                 emb_weight,
+                                                 output_ids,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 cublas_handle_,
+                                                 cublaslt_handle_,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  } else {
+    ret = gpt2_kernel<paddle::DataType::FLOAT32>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_q_bias,
+                                                 self_k_weight,
+                                                 self_k_bias,
+                                                 self_v_weight,
+                                                 self_v_bias,
+                                                 self_out_weight,
+                                                 self_out_bias,
+                                                 ffn_ln_weight,
+                                                 ffn_ln_bias,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 positional_embedding_weight,
+                                                 emb_weight,
+                                                 output_ids,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 cublas_handle_,
+                                                 cublaslt_handle_,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  }
+
+  cublasDestroy(cublas_handle_);
+  cublasLtDestroy(cublaslt_handle_);
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.h b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..39b73b89809afcbb8a450d6af5ac4ace3f78c188
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gpt_op.h
@@ -0,0 +1,71 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+// #include "fastertransformer/gpt.h"
+// #include "fastertransformer/open_decoder.h"
+// #include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> GPT2CUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16,
+    const int& tensor_para_size,
+    const int& layer_para_size,
+    const int& layer_para_batch_size);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..81d5411eb4be87b67c6ebe67b995ffcb80984e2d
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cc
@@ -0,0 +1,203 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+
+#include "fusion_gptj_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> GPTJForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& emb_weight,
+    const paddle::Tensor& emb_bias,
+    const int topk,
+    const float topp,
+    const int max_len,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int rotary_embedding_dim,
+    const float repetition_penalty,
+    const int min_length,
+    const bool use_fp16 = false,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  int batch_size = input.shape()[0];
+  int start_len = input.shape()[1];
+  int total_len = max_len + start_len;
+  std::vector<int64_t> output_dims({total_len, batch_size});
+  
+#ifdef PADDLE_NEW_ALLOCATOR
+  // For PaddlePaddle>=2.3.0
+  auto output_ids = paddle::empty(output_dims, paddle::DataType::INT32, input.place());
+  auto gpu_place = paddle::GPUPlace();
+#else
+  auto output_ids = paddle::Tensor(input.place(), output_dims);
+  auto gpu_place = paddle::PlaceType::kGPU;
+#endif
+
+  if (word_embedding.place() == gpu_place) {
+    return GPTJCUDAForward(input,
+                           attn_mask,
+                           start_length,
+                           word_embedding,
+                           self_ln_weight,
+                           self_ln_bias,
+                           self_q_weight,
+                           self_out_weight,
+                           ffn_inter_weight,
+                           ffn_inter_bias,
+                           ffn_out_weight,
+                           ffn_out_bias,
+                           decoder_ln_weight,
+                           decoder_ln_bias,
+                           emb_weight,
+                           emb_bias,
+                           output_ids,
+                           topk,
+                           topp,
+                           total_len,
+                           n_head,
+                           size_per_head,
+                           num_layer,
+                           bos_id,
+                           eos_id,
+                           temperature,
+                           rotary_embedding_dim,
+                           repetition_penalty,
+                           min_length,
+                           use_fp16,
+                           tensor_para_size,
+                           layer_para_size,
+                           layer_para_batch_size);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> GPTJInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<int64_t>& start_length,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& emb_weight_shape,
+    const std::vector<int64_t>& emb_bias_shape,
+    const int topk,
+    const float topp,
+    const int max_len,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int rotary_embedding_dim,
+    const float repetition_penalty,
+    const int min_length,
+    const bool use_fp16 = false,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  int64_t batch_size = input_shape[0];
+  int64_t start_len = input_shape[1];
+  std::vector<int64_t> output_dims({max_len + start_len, batch_size});
+  return {output_dims};
+}
+
+std::vector<paddle::DataType> GPTJInferDtype(
+    const paddle::DataType& input_dtype,
+    const paddle::DataType& attn_mask_dtype,
+    const paddle::DataType& start_length_dtype,
+    const paddle::DataType& word_embedding_dtype,
+    const std::vector<paddle::DataType>& self_ln_weight_dtype,
+    const std::vector<paddle::DataType>& self_ln_bias_dtype,
+    const std::vector<paddle::DataType>& self_q_weight_dtype,
+    const std::vector<paddle::DataType>& self_out_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_out_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_out_bias_dtype,
+    const paddle::DataType& decoder_ln_weight_dtype,
+    const paddle::DataType& decoder_ln_bias_dtype,
+    const paddle::DataType& emb_weight_dtype,
+    const paddle::DataType& emb_bias_dtype) {
+  return {paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_gptj)
+    .Inputs({"Input",
+             "AttentionMask",
+             "StartLength",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias"})
+    .Outputs({"OutputIds"})
+    .Attrs({"topk: int",
+            "topp: float",
+            "max_len: int",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "temperature: float",
+            "rotary_embedding_dim: int",
+            "repetition_penalty: float",
+            "min_length: int",
+            "use_fp16: bool",
+            "tensor_para_size: int",
+            "layer_para_size: int",
+            "layer_para_batch_size: int"})
+    .SetKernelFn(PD_KERNEL(GPTJForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(GPTJInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(GPTJInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..25e28210796b875f17b3c5c5e1e76d30738dbb79
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.cu
@@ -0,0 +1,334 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+// Use the global cublas handle
+#include "cublas_handle.h"
+
+// TODO(guosheng): `HOST` conflict exists in float.h of paddle and mpi.h of mpi
+#include "fusion_gptj_op.h"
+#include "pd_traits.h"
+#ifdef HOST
+#undef HOST
+#endif
+
+#include "fastertransformer/utils/common.h"
+
+#ifdef BUILD_GPT  // consistent with FasterTransformer
+#include "parallel_utils.h"
+#endif
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> gptj_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& emb_weight,
+    const paddle::Tensor& emb_bias,
+    paddle::Tensor& output_ids,
+    const int topk,
+    const float topp,
+    const int max_len,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int rotary_embedding_dim,
+    const float repetition_penalty,
+    const int min_length,
+    cudaStream_t stream,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  auto input_dims = input.shape();
+  int batch_size_ = input_dims[0];
+  int start_len = input_dims[1];
+  const int vocab_size = emb_bias.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+#ifdef PADDLE_NEW_ALLOCATOR
+  // For PaddlePaddle>=2.3.0
+  decoding_params.output_ids = output_ids.data<int>();
+#else
+  decoding_params.output_ids = output_ids.mutable_data<int>(word_emb.place());
+#endif
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  const int hidden_unit = size_per_head * n_head;
+
+#ifdef BUILD_GPT
+  auto* model_para_desc = ModelParaDescFactory::CreateModelParaDesc(
+      n_head,
+      size_per_head,
+      num_layer,
+      tensor_para_size,
+      layer_para_size,
+      layer_para_batch_size,
+      const_cast<data_t_*>(word_emb.data<data_t_>()));
+  auto& tensor_parallel_param = model_para_desc->tensor_parallel_param;
+  auto& layer_parallel_param = model_para_desc->layer_parallel_param;
+  auto seed = model_para_desc->dist(model_para_desc->gen);
+#else
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = n_head;
+  tensor_parallel_param.local_hidden_units_ = hidden_unit;
+
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer;
+  layer_parallel_param.local_batch_size = batch_size_;
+  int seed = -1;
+#endif
+
+  DecodingGptJ<DecodingTraits_::OpType>* gptj_decoding;
+
+  decoding_params.request_batch_size = batch_size_;
+  decoding_params.max_input_len = start_len;
+  decoding_params.request_input_len = start_len;
+  decoding_params.request_output_len = max_len - start_len;
+
+  decoding_params.d_start_ids = const_cast<int *>(input.data<int>());
+  decoding_params.d_attn_mask =
+      reinterpret_cast<DataType_*>(const_cast<data_t_ *>(attn_mask.data<data_t_>()));
+  decoding_params.d_start_lengths = start_length.data<int>();
+
+  gptj_decoding =
+      new DecodingGptJ<DecodingTraits_::OpType>(allocator_,
+                                               batch_size_,
+                                               max_len,
+                                               n_head,
+                                               size_per_head,
+                                               vocab_size,
+                                               num_layer,
+                                               bos_id,
+                                               eos_id,
+                                               topk,
+                                               topp,
+                                               temperature,
+                                               tensor_para_size,
+                                               layer_para_size,
+                                               true, /*is_fuse_QKV*/
+                                               repetition_penalty,  /*repetition_penalty*/
+                                               seed,
+                                               rotary_embedding_dim,
+                                               min_length);
+
+  gptj_decoding->set_tensor_parallel_param(tensor_parallel_param);
+  gptj_decoding->set_layer_parallel_param(layer_parallel_param);
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer];
+
+  for (int i = 0; i < self_ln_weight.size(); ++i) {
+    // Allow python passing weights of all layers or only passing the
+    // corresponding layers to save memory.
+    int layer_idx = self_ln_weight.size() != num_layer
+                        ? layer_parallel_param.rank *
+                                  layer_parallel_param.layers_per_group +
+                              i
+                        : i;
+
+    params[layer_idx].stream = stream;
+    params[layer_idx].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[layer_idx].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    params[layer_idx].request_batch_size = batch_size_;
+    params[layer_idx].request_max_mem_seq_len = start_len;
+
+    params[layer_idx].self_layernorm.gamma =
+        reinterpret_cast<const DataType_*>(self_ln_weight[i].data<data_t_>());
+    params[layer_idx].self_layernorm.beta =
+        reinterpret_cast<const DataType_*>(self_ln_bias[i].data<data_t_>());
+
+    params[layer_idx].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_q_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.query_weight.bias = nullptr;
+
+    params[layer_idx].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_out_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.attention_output_weight.bias = nullptr;
+
+    params[layer_idx].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_inter_weight[i].data<data_t_>());
+    params[layer_idx].ffn.intermediate_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_inter_bias[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_out_weight[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_out_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma =
+      reinterpret_cast<const DataType_*>(decoder_ln_weight.data<data_t_>());
+  decoding_params.layernorm.beta =
+      reinterpret_cast<const DataType_*>(decoder_ln_bias.data<data_t_>());
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(emb_weight.data<data_t_>());
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(emb_bias.data<data_t_>());
+
+  gptj_decoding->forward_context(params, decoding_params);
+  gptj_decoding->forward(params, decoding_params);
+
+  delete gptj_decoding;
+  delete[] params;
+
+  return {output_ids};
+}
+
+std::vector<paddle::Tensor> GPTJCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& emb_weight,
+    const paddle::Tensor& emb_bias,
+    paddle::Tensor& output_ids,
+    const int topk,
+    const float topp,
+    const int max_len,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int rotary_embedding_dim,
+    const float repetition_penalty,
+    const int min_length,
+    const bool use_fp16 = false,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+        
+  auto stream = word_embedding.stream();
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  if (use_fp16) {
+    return gptj_kernel<paddle::DataType::FLOAT16>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_out_weight,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 emb_weight,
+                                                 emb_bias,
+                                                 output_ids,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 rotary_embedding_dim,
+                                                 repetition_penalty,
+                                                 min_length,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  } else {
+    return gptj_kernel<paddle::DataType::FLOAT32>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_out_weight,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 emb_weight,
+                                                 emb_bias,
+                                                 output_ids,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 rotary_embedding_dim,
+                                                 repetition_penalty,
+                                                 min_length,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  }
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.h b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..c41d645fda76f4fd523b5872e53e0ad3d1f758ca
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_gptj_op.h
@@ -0,0 +1,66 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/gptj.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> GPTJCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& emb_weight,
+    const paddle::Tensor& emb_bias,
+    paddle::Tensor& output_ids,
+    const int topk,
+    const float topp,
+    const int max_len,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int rotary_embedding_dim,
+    const float repetition_penalty,
+    const int min_length,
+    const bool use_fp16,
+    const int tensor_para_size,
+    const int layer_para_size,
+    const int layer_para_batch_size);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..30ecb154c1f57b6a30c57a2096fb28e37102aaf9
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cc
@@ -0,0 +1,368 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_mbart_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> MBartDecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& mbart_ln_weight,
+    const paddle::Tensor& mbart_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const bool& early_stopping,
+    const std::string& hidden_act) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+
+    return MBartDecodingCUDAForward(input,
+                                    seq_len,
+                                    word_embedding,
+                                    self_ln_weight,
+                                    self_ln_bias,
+                                    self_q_weight,
+                                    self_q_bias,
+                                    self_k_weight,
+                                    self_k_bias,
+                                    self_v_weight,
+                                    self_v_bias,
+                                    self_out_weight,
+                                    self_out_bias,
+                                    cross_ln_weight,
+                                    cross_ln_bias,
+                                    cross_q_weight,
+                                    cross_q_bias,
+                                    cross_k_weight,
+                                    cross_k_bias,
+                                    cross_v_weight,
+                                    cross_v_bias,
+                                    cross_out_weight,
+                                    cross_out_bias,
+                                    ffn_ln_weight,
+                                    ffn_ln_bias,
+                                    ffn_inter_weight,
+                                    ffn_inter_bias,
+                                    ffn_out_weight,
+                                    ffn_out_bias,
+                                    decoder_ln_weight,
+                                    decoder_ln_bias,
+                                    mbart_ln_weight,
+                                    mbart_ln_bias,
+                                    embedding_weight,
+                                    embedding_bias,
+                                    positional_embedding_weight,
+                                    trg_word,
+                                    output_ids,
+                                    parent_ids,
+                                    sequence_length,
+                                    decoding_strategy,
+                                    beam_size,
+                                    topk,
+                                    topp,
+                                    n_head,
+                                    size_per_head,
+                                    num_layer,
+                                    bos_id,
+                                    eos_id,
+                                    temperature,
+                                    max_out_len,
+                                    beam_search_diversity_rate,
+                                    alpha,
+                                    early_stopping,
+                                    hidden_act);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> MBartDecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& mbart_ln_weight_shape,
+    const std::vector<int64_t>& mbart_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& trg_word_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const bool& early_stopping,
+    const std::string& hidden_act) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+            decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> MBartDecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& mbart_ln_weight,
+    const paddle::DataType& mbart_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight,
+    const paddle::DataType& trg_word) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_mbart_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "MBARTLayernormWeight",
+             "MBARTLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb",
+             "TrgWord"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({
+        "decoding_strategy: std::string",
+        "beam_size: int",
+        "topk: int",
+        "topp: float",
+        "n_head: int",
+        "size_per_head: int",
+        "num_layer: int",
+        "bos_id: int",
+        "eos_id: int",
+        "temperature: float",
+        "max_len: int64_t",
+        "beam_search_diversity_rate: float",
+        "rel_len: bool",
+        "alpha: float",
+        "early_stopping: bool",
+        "hidden_act: std::string",
+    })
+    .SetKernelFn(PD_KERNEL(MBartDecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(MBartDecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(MBartDecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..08da17ce7d7fe604a946386264de82f1f8c0d6f5
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.cu
@@ -0,0 +1,596 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+#include "cublas_handle.h"
+
+#include "fusion_mbart_decoding_op.h"
+#include "pd_traits.h"
+
+
+__global__ void get_trg_length_mbart(const int* trg_word,
+                                     int* trg_length,
+                                     const int seq_len,
+                                     const int pad_id) {
+  int bid = threadIdx.x;
+
+  int cnt_nonpads = 0;
+  for (int i = 0; i < seq_len; ++i) {
+    if (pad_id != trg_word[bid * seq_len + i]) {
+      cnt_nonpads++;
+    } else {
+      break;
+    }
+  }
+  trg_length[bid] = cnt_nonpads;
+}
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> mbart_decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& mbart_layernorm_weight,
+    const paddle::Tensor& mbart_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& head_num_,
+    const int& size_per_head_,
+    const int& num_layer_,
+    const int& start_id_,
+    const int& end_id_,
+    const float& temperature,
+    const int64_t& max_seq_len_,
+    const float& beam_search_diversity_rate_,
+    const float& alpha,
+    const bool& early_stopping,
+    const std::string& hidden_act,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ = (decoding_strategy == "sampling") ? topk : 1;
+  float probability_threshold_ = (decoding_strategy == "sampling") ? topp : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+  auto trg_word_shape = trg_word.shape();
+  int trg_max_len =
+      (trg_word_shape.size() == 2) ? static_cast<int>(trg_word_shape[1]) : 0;
+
+  paddle::Tensor trg_length =
+      (trg_word_shape.size() == 2 && trg_word_shape[0] != 0)
+          ? paddle::Tensor(paddle::PlaceType::kGPU, {trg_word_shape[0]})
+          : paddle::Tensor(paddle::PlaceType::kGPU, {1});
+  auto trg_length_ptr = trg_length.mutable_data<int>(input.place());
+
+  if (trg_word_shape.size() == 2 && trg_word_shape[0] != 0) {
+    decoding_params.trg_word = trg_word.data<int>();
+
+    get_trg_length_mbart<<<1, trg_word_shape[0], 0, stream>>>(
+        decoding_params.trg_word, trg_length_ptr, trg_max_len, start_id_);
+    decoding_params.trg_length = trg_length_ptr;
+  }
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[i].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[i].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+        cross_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+
+  // for mbart embedding layernorm
+  decoding_params.mbart_layernorm.gamma = reinterpret_cast<const DataType_*>(
+      mbart_layernorm_weight.data<data_t_>());
+  decoding_params.mbart_layernorm.beta =
+      reinterpret_cast<const DataType_*>(mbart_layernorm_bias.data<data_t_>());
+
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+    
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  ActivationType activate =
+      (hidden_act == "gelu") ? ActivationType::GELU : ActivationType::RELU;
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  /*is_fuse_topk_softMax*/
+        false, /*is_fuse_qkv*/
+        false, /*keep_alive_beam*/
+        alpha, /*alpha not used for this case*/
+        true,
+        2, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        false /*prefix_lm*/,
+        -1, /*finished_candidate_num*/
+        false, /*early_stopping*/
+        true /*is_mbart */);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,   // is_fuse_topk_softMax
+        false,  // is_fuse_qkv
+        true,   // keep_alive_beam
+        alpha,
+        true,
+        2, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        false /*prefix_lm*/,
+        finished_candidate_num_, /*finished_candidate_num*/
+        early_stopping, /*early_stopping*/
+        true /*is_mbart */);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        false, /*is_fuse_qkv*/
+        true,
+        2, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        temperature,    // temperature
+        1.0,    // repeat_penalty
+        false,  // prefix_lm
+        true /*is_mbart */);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, beam_search_v2 and sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> MBartDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& mbart_ln_weight,
+    const paddle::Tensor& mbart_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int&  beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha,
+    const bool& early_stopping,
+    const std::string& hidden_act) {
+  auto stream = input.stream();
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = mbart_decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          mbart_ln_weight,
+          mbart_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          trg_word,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          temperature,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          hidden_act,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = mbart_decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          mbart_ln_weight,
+          mbart_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          trg_word,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          temperature,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          hidden_act,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..159d4e9070ee3c02a205e47b1dbf67789c52655f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_mbart_decoding_op.h
@@ -0,0 +1,88 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> MBartDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& mbart_ln_weight,
+    const paddle::Tensor& mbart_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& trg_word,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& head_num_,
+    const int& size_per_head_,
+    const int& num_layer_,
+    const int& start_id_,
+    const int& end_id_,
+    const float& temperature,
+    const int64_t& max_seq_len_,
+    const float& beam_search_diversity_rate_,
+    const float& alpha,
+    const bool& early_stopping,
+    const std::string& hidden_act);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0f5a1a2221c892f6419527cba985c8d665a71649
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cc
@@ -0,0 +1,427 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_miro_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> MIROForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const paddle::Tensor& pre_decoder_ln_weight,
+    const paddle::Tensor& pre_decoder_ln_bias,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const int& unk_id,
+    const int& mask_id,
+    const float& temperature,
+    const float& len_penalty,
+    const bool& normalize_before,
+    const bool& pos_bias,
+    const std::string& hidden_act,
+    const bool& rel_len,
+    const bool& early_stopping,
+    const int& min_length,
+    const int& tensor_para_size,
+    const int& layer_para_size,
+    const int& layer_para_batch_size) {
+  int batch_size = input_ids.shape()[0];
+  int max_out_len = rel_len ? max_len + input_ids.shape()[1] : max_len;
+
+  std::vector<int64_t> output_ids_dims;
+  std::vector<int64_t> output_scores_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_ids_dims = {max_out_len, batch_size, beam_size};
+    output_scores_dims = {batch_size, beam_size};
+    parent_ids_dims = output_ids_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_ids_dims = {max_out_len, batch_size, beam_size * 2};
+    output_scores_dims = {batch_size, beam_size * 2};
+    parent_ids_dims = output_ids_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_ids_dims = {max_out_len, batch_size};
+    output_scores_dims = {batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+  auto output_ids = paddle::Tensor(input_ids.place(), output_ids_dims);
+  auto parent_ids = paddle::Tensor(input_ids.place(), parent_ids_dims);
+  auto sequence_length =
+      paddle::Tensor(input_ids.place(), sequence_length_dims);
+  auto output_scores = paddle::Tensor(input_ids.place(), output_scores_dims);
+
+  if (input_ids.place() == paddle::PlaceType::kGPU) {
+    auto mem_seq_length = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      mem_seq_length = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      mem_seq_length = mem_seq_len;
+    }
+
+    return MIROCUDAForward(input_ids,
+                                      attn_mask,
+                                      mem_seq_length,
+                                      type_id,
+                                      decoder_type_id,
+                                      logits_mask,
+                                      word_embedding,
+                                      pre_decoder_ln_weight,
+                                      pre_decoder_ln_bias,
+                                      self_ln_weight,
+                                      self_ln_bias,
+                                      self_q_weight,
+                                      self_q_bias,
+                                      self_k_weight,
+                                      self_k_bias,
+                                      self_v_weight,
+                                      self_v_bias,
+                                      self_out_weight,
+                                      self_out_bias,
+                                      ffn_ln_weight,
+                                      ffn_ln_bias,
+                                      ffn_inter_weight,
+                                      ffn_inter_bias,
+                                      ffn_out_weight,
+                                      ffn_out_bias,
+                                      decoder_ln_weight,
+                                      decoder_ln_bias,
+                                      trans_weight,
+                                      trans_bias,
+                                      lm_ln_weight,
+                                      lm_ln_bias,
+                                      embedding_weight,
+                                      embedding_bias,
+                                      positional_embedding_weight,
+                                      type_embedding_weight,
+                                      role_id,
+                                      decoder_role_id,
+                                      role_embedding_table,
+                                      position_ids,
+                                      decoder_position_ids,
+                                      output_ids,
+                                      parent_ids,
+                                      sequence_length,
+                                      output_scores,
+                                      decoding_strategy,
+                                      beam_size,
+                                      topk,
+                                      topp,
+                                      n_head,
+                                      size_per_head,
+                                      num_layer,
+                                      bos_id,
+                                      eos_id,
+                                      max_out_len,
+                                      beam_search_diversity_rate,
+                                      unk_id,
+                                      mask_id,
+                                      temperature,
+                                      len_penalty,
+                                      normalize_before,
+                                      pos_bias,
+                                      hidden_act,
+                                      early_stopping,
+                                      min_length,
+                                      tensor_para_size,
+                                      layer_para_size,
+                                      layer_para_batch_size);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> MIROInferShape(
+    const std::vector<int64_t>& input_ids_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& logits_mask_shape,
+    const std::vector<int64_t>& type_id_shape,
+    const std::vector<int64_t>& decoder_type_id_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<int64_t>& pre_decoder_ln_weight_shape,
+    const std::vector<int64_t>& pre_decoder_ln_bias_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& trans_weight_shape,
+    const std::vector<int64_t>& trans_bias_shape,
+    const std::vector<int64_t>& lm_ln_weight_shape,
+    const std::vector<int64_t>& lm_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& type_embedding_weight_shape,
+    const std::vector<int64_t>& role_id_shape,
+    const std::vector<int64_t>& decoder_role_id_shape,
+    const std::vector<int64_t>& role_embedding_table_shape,
+    const std::vector<int64_t>& position_ids_shape,
+    const std::vector<int64_t>& decoder_position_ids_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const int& unk_id,
+    const int& mask_id,
+    const float& temperature,
+    const float& len_penalty,
+    const bool& normalize_before,
+    const bool& pos_bias,
+    const std::string& hidden_act,
+    const bool& rel_len,
+    const bool& early_stopping,
+    const int& min_length,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int batch_size = input_ids_shape[0];
+
+  std::vector<int64_t> output_ids_dims;
+  std::vector<int64_t> output_scores_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_ids_dims = {max_len, batch_size, beam_size};
+    output_scores_dims = {batch_size, beam_size};
+    return {output_ids_dims, output_ids_dims, sequence_length_dims, output_scores_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_ids_dims = {max_len, batch_size, beam_size * 2};
+    output_scores_dims = {batch_size, beam_size * 2};
+    return {output_ids_dims, output_ids_dims, sequence_length_dims, output_scores_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_ids_dims = {max_len, batch_size};
+    output_scores_dims = {batch_size};
+    return {output_ids_dims, {1}, sequence_length_dims, output_scores_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> MIROInferDtype(
+    const paddle::DataType& input_ids,
+    const paddle::DataType& attn_mask,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& logits_mask,
+    const paddle::DataType& type_id,
+    const paddle::DataType& decoder_type_id,
+    const paddle::DataType& word_embedding,
+    const paddle::DataType& pre_decoder_ln_weight,
+    const paddle::DataType& pre_decoder_ln_bias,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& trans_weight,
+    const paddle::DataType& trans_bias,
+    const paddle::DataType& lm_ln_weight,
+    const paddle::DataType& lm_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight,
+    const paddle::DataType& type_embedding_weight,
+    const paddle::DataType& role_id,
+    const paddle::DataType& decoder_role_id,
+    const paddle::DataType& role_embedding_table,
+    const paddle::DataType& position_ids,
+    const paddle::DataType& decoder_position_ids) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::FLOAT32};
+}
+
+PD_BUILD_OP(fusion_miro)
+    .Inputs({"InputIds",
+             "AttnMask",
+             "MemSeqLen",
+             "TypeIds",
+             "DecTypeIds",
+             "LogitsMask",
+             "WordEmbedding",
+             "PreDecoderLayernormWeight",
+             "PreDecoderLayernormBias",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "TransWeight",
+             "TransBias",
+             "LMLayernormWeight",
+             "LMLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb",
+             "TypeEmb",
+             "RoleIds",
+             "DecRoleIds",
+             "RoleEmbedding",
+             "PositionIds",
+             "DecPositionIds"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength", "OutputScores"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "unk_id: int",
+            "mask_id: int",
+            "temperature: float",
+            "len_penalty: float",
+            "normalize_before: bool",
+            "pos_bias: bool",
+            "hidden_act: std::string",
+            "rel_len: bool",
+            "early_stopping: bool",
+            "min_length: int",
+            "tensor_para_size: int",
+            "layer_para_size: int",
+            "layer_para_batch_size: int"})
+    .SetKernelFn(PD_KERNEL(MIROForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(MIROInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(MIROInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..db3d57d7d423efcec751e9c9877d7fd6837cf67f
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.cu
@@ -0,0 +1,710 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+// TODO(guosheng): `HOST` conflict exists in float.h of paddle and mpi.h of mpi
+#include "fusion_miro_op.h"
+#include "pd_traits.h"
+#ifdef HOST
+#undef HOST
+#endif
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/arguments.h"
+
+#ifdef BUILD_GPT  // consistent with FasterTransformer
+#include "parallel_utils.h"
+#endif
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> miro_decoding_kernel(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_emb,
+    const paddle::Tensor& pre_decoder_layernorm_weight,
+    const paddle::Tensor& pre_decoder_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_layernorm_weight,
+    const paddle::Tensor& lm_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int head_num_,
+    const int size_per_head_,
+    const int num_layer_,
+    const int start_id_,
+    const int end_id_,
+    const int64_t max_seq_len_,
+    const float beam_search_diversity_rate_,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    cudaStream_t stream,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ =
+      ("topk_sampling" == decoding_strategy ||
+       "topp_sampling" == decoding_strategy || "sampling" == decoding_strategy)
+          ? topk
+          : 1;
+  float probability_threshold_ =
+      ("topk_sampling" == decoding_strategy ||
+       "topp_sampling" == decoding_strategy || "sampling" == decoding_strategy)
+          ? topp
+          : 0.0;
+
+  auto input_ids_dims = input_ids.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_ids_dims[0] / beam_width_
+                        : input_ids_dims[0];
+  const int memory_max_seq_len = input_ids_dims[1];
+  const int memory_hidden_dim = head_num_ * size_per_head_;
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input_ids.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input_ids.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input_ids.place());
+  decoding_params.output_scores = output_scores.mutable_data<float>(input_ids.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.d_start_ids = const_cast<int *>(input_ids.data<int>());
+  decoding_params.d_attn_mask =
+      reinterpret_cast<DataType_*>(const_cast<data_t_ *>(attn_mask.data<data_t_>()));
+  decoding_params.d_start_lengths = memory_sequence_length.data<int>();
+
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+  decoding_params.type_id = type_id.data<int>();
+  decoding_params.decoder_type_id = decoder_type_id.data<int>();
+
+  if (decoding_strategy == "beam_search" ||
+      decoding_strategy == "beam_search_v2" ||
+      decoding_strategy == "beam_search_v3") {
+    decoding_params.request_batch_size = batch_size_ * beam_width_;
+  } else if (decoding_strategy == "sampling" ||
+             decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    decoding_params.request_batch_size = batch_size_;
+  }
+  decoding_params.max_input_len = memory_max_seq_len;
+  decoding_params.request_input_len = memory_max_seq_len;
+  decoding_params.request_output_len = max_seq_len_;
+
+#ifdef BUILD_GPT
+  auto* model_para_desc = ModelParaDescFactory::CreateModelParaDesc(
+      head_num_,
+      size_per_head_,
+      num_layer_,
+      tensor_para_size,
+      layer_para_size,
+      layer_para_batch_size,
+      const_cast<data_t_*>(word_emb.data<data_t_>()));
+  auto& tensor_parallel_param = model_para_desc->tensor_parallel_param;
+  auto& layer_parallel_param = model_para_desc->layer_parallel_param;
+  auto seed = model_para_desc->dist(model_para_desc->gen);
+#else
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = head_num_;
+  tensor_parallel_param.local_hidden_units_ = memory_hidden_dim;
+
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer_;
+  layer_parallel_param.local_batch_size = batch_size_;
+  int seed = -1;
+#endif
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  // Allow python passing partial weights for model parallel.
+  int inner_coeff =
+      (memory_hidden_dim == self_attn_output_weight[0].shape()[0])
+          ? ffn_intermediate_weight[0].shape()[1] / memory_hidden_dim
+          : (ffn_intermediate_weight[0].shape()[1] * tensor_para_size /
+             memory_hidden_dim);
+
+  for (int i = 0; i < self_layernorm_weight.size(); i++) {
+    // Allow python passing weights of all layers or only passing the
+    // corresponding layers to save memory.
+    int layer_idx = self_layernorm_weight.size() != num_layer_
+                        ? layer_parallel_param.rank *
+                                  layer_parallel_param.layers_per_group +
+                              i
+                        : i;
+    params[layer_idx].stream = stream;
+    params[layer_idx].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[layer_idx].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[layer_idx].request_batch_size = batch_size_ * beam_width_;
+      params[layer_idx].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[layer_idx].request_batch_size = batch_size_;
+      params[layer_idx].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[layer_idx].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[layer_idx].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[layer_idx].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[layer_idx].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[layer_idx].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[layer_idx].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+
+    params[layer_idx].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[layer_idx].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[layer_idx].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[layer_idx].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[layer_idx].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[layer_idx].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.pre_layernorm.gamma = reinterpret_cast<const DataType_*>(
+      pre_decoder_layernorm_weight.data<data_t_>());
+  decoding_params.pre_layernorm.beta = reinterpret_cast<const DataType_*>(
+      pre_decoder_layernorm_bias.data<data_t_>());
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+  decoding_params.trans_kernel =
+      reinterpret_cast<const DataType_*>(trans_weight.data<data_t_>());
+  decoding_params.trans_bias =
+      reinterpret_cast<const DataType_*>(trans_bias.data<data_t_>());
+
+  decoding_params.lm_layernorm.gamma =
+      reinterpret_cast<const DataType_*>(lm_layernorm_weight.data<data_t_>());
+  decoding_params.lm_layernorm.beta =
+      reinterpret_cast<const DataType_*>(lm_layernorm_bias.data<data_t_>());
+
+  // For embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+  // For weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // For matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+
+  // For masking some id during gen.
+  decoding_params.logits_mask =
+      reinterpret_cast<const DataType_*>(logits_mask.data<data_t_>());
+
+  decoding_params.type_table =
+      reinterpret_cast<const DataType_*>(type_embedding_weight.data<data_t_>());
+
+  // For role embedding.
+  auto role_id_shape = role_id.shape();
+  if (role_id_shape.size() > 0 && numel(role_id_shape) > 0) {
+    decoding_params.role_id = role_id.data<int>();
+    decoding_params.decoder_role_id = decoder_role_id.data<int>();
+    decoding_params.role_embedding_table =
+        reinterpret_cast<const DataType_*>(role_embedding_table.data<data_t_>());
+  }
+
+  auto position_id_shape = position_ids.shape();
+  if (position_id_shape.size() > 0 && numel(position_id_shape) > 0) {
+      decoding_params.position_ids = position_ids.data<int>();
+      decoding_params.decoder_position_ids = decoder_position_ids.data<int>();
+  }
+
+  ActivationType activate =
+      (hidden_act == "gelu") ? ActivationType::GELU : ActivationType::RELU;
+
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* miro_beam_search_;
+
+    miro_beam_search_ =
+        new DecodingBeamsearch<DecodingTraits_::OpType>(
+            allocator_,
+            batch_size_,
+            beam_width_,
+            max_seq_len_,
+            head_num_,
+            size_per_head_,
+            vocab_size,
+            num_layer_,
+            memory_hidden_dim,
+            memory_max_seq_len,
+            start_id_,
+            end_id_,
+            beam_search_diversity_rate_,
+            true,        /*is_fuse_topk_softMax*/
+            true,        /*is_fuse_qkv*/
+            false,       /*keep_alive_beam*/
+            len_penalty, /*alpha not used for this case*/
+            normalize_before,
+            0, /*pos_offset BART only for now*/
+            activate,
+            pos_bias,
+            true, /*prefix_lm*/
+            -1,  /*finished_candidate_num*/
+            false,  /*early_stopping*/
+            false,  /*is_mbart*/
+            min_length,
+            inner_coeff,
+            true);  /*is_miro*/
+    miro_beam_search_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    miro_beam_search_->set_layer_parallel_param(
+        layer_parallel_param);
+    miro_beam_search_->forward_context(params, decoding_params);
+    miro_beam_search_->forward(params, decoding_params);
+
+    delete miro_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* miro_beam_search_;
+
+    miro_beam_search_ =
+        new DecodingBeamsearch<DecodingTraits_::OpType>(
+            allocator_,
+            batch_size_,
+            beam_width_,
+            max_seq_len_,
+            head_num_,
+            size_per_head_,
+            vocab_size,
+            num_layer_,
+            memory_hidden_dim,
+            memory_max_seq_len,
+            start_id_,
+            end_id_,
+            beam_search_diversity_rate_,
+            true, /*is_fuse_topk_softMax*/
+            true, /*is_fuse_qkv*/
+            true, /*keep_alive_beam*/
+            len_penalty,
+            normalize_before,
+            0, /*pos_offset BART only for now*/
+            activate,
+            pos_bias,
+            true, /*prefix_lm*/
+            finished_candidate_num_,
+            early_stopping,
+            false,  /*is_mbart*/
+            min_length,
+            inner_coeff,
+            true);  /*is_miro*/
+    miro_beam_search_->forward_context(params, decoding_params);
+    miro_beam_search_->forward(params, decoding_params);
+
+    delete miro_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* miro_sampling_;
+
+    miro_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        true, /*is_fuse_qkv*/
+        normalize_before,
+        0, /*pos_offset BART only for now*/
+        activate,
+        pos_bias,
+        temperature,
+        1.0,   /*repeat_penalty*/
+        true,  /*prefix_lm*/
+        false, /*is_mbart*/
+        min_length,
+        inner_coeff,
+        seed,
+        tensor_para_size,
+        layer_para_size,
+        true);  /*is_miro*/
+    miro_sampling_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    miro_sampling_->set_layer_parallel_param(layer_parallel_param);
+    miro_sampling_->forward_context(params, decoding_params);
+    miro_sampling_->forward(params, decoding_params);
+
+    delete miro_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, beam_search_v2, topk_sampling and topp_sampling are "
+        "supported for "
+        "FasterTransformer. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length, output_scores};
+}
+
+std::vector<paddle::Tensor> MIROCUDAForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const paddle::Tensor& pre_decoder_ln_weight,
+    const paddle::Tensor& pre_decoder_ln_bias,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  auto stream = input_ids.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (self_ln_weight[0].type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = miro_decoding_kernel<paddle::DataType::FLOAT16>(
+          input_ids,
+          attn_mask,
+          mem_seq_len,
+          type_id,
+          decoder_type_id,
+          logits_mask,
+          word_embedding,
+          pre_decoder_ln_weight,
+          pre_decoder_ln_bias,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          trans_weight,
+          trans_bias,
+          lm_ln_weight,
+          lm_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          type_embedding_weight,
+          role_id,
+          decoder_role_id,
+          role_embedding_table,
+          position_ids,
+          decoder_position_ids,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          output_scores,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          unk_id,
+          mask_id,
+          temperature,
+          len_penalty,
+          normalize_before,
+          pos_bias,
+          hidden_act,
+          early_stopping,
+          min_length,
+          stream,
+          tensor_para_size,
+          layer_para_size,
+          layer_para_batch_size);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = miro_decoding_kernel<paddle::DataType::FLOAT32>(
+          input_ids,
+          attn_mask,
+          mem_seq_len,
+          type_id,
+          decoder_type_id,
+          logits_mask,
+          word_embedding,
+          pre_decoder_ln_weight,
+          pre_decoder_ln_bias,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          trans_weight,
+          trans_bias,
+          lm_ln_weight,
+          lm_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          type_embedding_weight,
+          role_id,
+          decoder_role_id,
+          role_embedding_table,
+          position_ids,
+          decoder_position_ids,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          output_scores,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          unk_id,
+          mask_id,
+          temperature,
+          len_penalty,
+          normalize_before,
+          pos_bias,
+          hidden_act,
+          early_stopping,
+          min_length,
+          stream,
+          tensor_para_size,
+          layer_para_size,
+          layer_para_batch_size);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_miro_op.h b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..081fef05010d6ed1ef286795aa41e533e0e58a75
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_miro_op.h
@@ -0,0 +1,102 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+// #include "fastertransformer/decoding_beamsearch.h"
+// #include "fastertransformer/decoding_sampling.h"
+// #include "fastertransformer/open_decoder.h"
+// #include "fastertransformer/utils/common.h"
+#include "cublas_handle.h"
+#include "utils.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> MIROCUDAForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const paddle::Tensor& pre_decoder_ln_weight,
+    const paddle::Tensor& pre_decoder_ln_bias,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    const int tensor_para_size,
+    const int layer_para_size,
+    const int layer_para_batch_size);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..21d4f1b2bad01df852f3c5436b9b6a5100bc546a
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cc
@@ -0,0 +1,227 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <string>
+#include <vector>
+
+#include "fusion_opt_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> OPTForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    const bool& normalize_before,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int batch_size = input.shape()[0];
+  int start_len = input.shape()[1];
+  int total_len = max_len + start_len;
+  std::vector<int64_t> output_dims({total_len, batch_size});
+  auto output_ids = paddle::Tensor(input.place(), output_dims);
+
+  if (word_embedding.place() == paddle::PlaceType::kGPU) {
+    return OPTCUDAForward(input,
+                           attn_mask,
+                           start_length,
+                           word_embedding,
+                           self_ln_weight,
+                           self_ln_bias,
+                           self_q_weight,
+                           self_q_bias,
+                           self_k_weight,
+                           self_k_bias,
+                           self_v_weight,
+                           self_v_bias,
+                           self_out_weight,
+                           self_out_bias,
+                           ffn_ln_weight,
+                           ffn_ln_bias,
+                           ffn_inter_weight,
+                           ffn_inter_bias,
+                           ffn_out_weight,
+                           ffn_out_bias,
+                           decoder_ln_weight,
+                           decoder_ln_bias,
+                           positional_embedding_weight,
+                           emb_weight,
+                           output_ids,
+                           normalize_before,
+                           topk,
+                           topp,
+                           total_len,
+                           n_head,
+                           size_per_head,
+                           num_layer,
+                           bos_id,
+                           eos_id,
+                           temperature,
+                           use_fp16,
+                           tensor_para_size,
+                           layer_para_size,
+                           layer_para_batch_size);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> OPTInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<int64_t>& start_length,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& emb_weight_shape,
+    const bool& normalize_before,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int64_t batch_size = input_shape[0];
+  int64_t start_len = input_shape[1];
+  std::vector<int64_t> output_dims({max_len + start_len, batch_size});
+  return {output_dims};
+}
+
+std::vector<paddle::DataType> OPTInferDtype(
+    const paddle::DataType& input_dtype,
+    const paddle::DataType& attn_mask_dtype,
+    const paddle::DataType& start_length_dtype,
+    const paddle::DataType& word_embedding_dtype,
+    const std::vector<paddle::DataType>& self_ln_weight_dtype,
+    const std::vector<paddle::DataType>& self_ln_bias_dtype,
+    const std::vector<paddle::DataType>& self_q_weight_dtype,
+    const std::vector<paddle::DataType>& self_q_bias_dtype,
+    const std::vector<paddle::DataType>& self_k_weight_dtype,
+    const std::vector<paddle::DataType>& self_k_bias_dtype,
+    const std::vector<paddle::DataType>& self_v_weight_dtype,
+    const std::vector<paddle::DataType>& self_v_bias_dtype,
+    const std::vector<paddle::DataType>& self_out_weight_dtype,
+    const std::vector<paddle::DataType>& self_out_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_ln_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_ln_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_inter_bias_dtype,
+    const std::vector<paddle::DataType>& ffn_out_weight_dtype,
+    const std::vector<paddle::DataType>& ffn_out_bias_dtype,
+    const paddle::DataType& decoder_ln_weight_dtype,
+    const paddle::DataType& decoder_ln_bias_dtype,
+    const paddle::DataType& positional_embedding_weight_dtype,
+    const paddle::DataType& emb_weight_dtype) {
+  return {paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_opt)
+    .Inputs({"Input",
+             "AttentionMask",
+             "StartLength",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "PositionEncEmb",
+             "EmbWeight"})
+    .Outputs({"OutputIds"})
+    .Attrs({"normalize_before: bool",
+            "topk: int",
+            "topp: float",
+            "max_len: int",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "temperature: float",
+            "use_fp16: bool",
+            "tensor_para_size: int",
+            "layer_para_size: int",
+            "layer_para_batch_size: int"})
+    .SetKernelFn(PD_KERNEL(OPTForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(OPTInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(OPTInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..6af9f9a381bac62edf5e8344b1cc09950aefbb3c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.cu
@@ -0,0 +1,384 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+// TODO(guosheng): `HOST` conflict exists in float.h of paddle and mpi.h of mpi
+#include "fusion_opt_op.h"
+#include "pd_traits.h"
+#ifdef HOST
+#undef HOST
+#endif
+
+#include "fastertransformer/opt.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef BUILD_GPT  // consistent with FasterTransformer
+#include "parallel_utils.h"
+#endif
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> opt_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const bool& normalize_before,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    cublasHandle_t cublas_handle_,
+    cublasLtHandle_t cublaslt_handle_,
+    cudaStream_t stream,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  auto input_dims = input.shape();
+  int batch_size_ = input_dims[0];
+  int start_len = input_dims[1];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = cublas_handle_;
+  decoding_params.cublaslt_handle = cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(word_emb.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  const int hidden_unit = size_per_head * n_head;
+
+#ifdef BUILD_GPT
+  auto* model_para_desc = ModelParaDescFactory::CreateModelParaDesc(
+      n_head,
+      size_per_head,
+      num_layer,
+      tensor_para_size,
+      layer_para_size,
+      layer_para_batch_size,
+      const_cast<data_t_*>(word_emb.data<data_t_>()));
+  auto& tensor_parallel_param = model_para_desc->tensor_parallel_param;
+  auto& layer_parallel_param = model_para_desc->layer_parallel_param;
+  auto seed = model_para_desc->dist(model_para_desc->gen);
+#else
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = n_head;
+  tensor_parallel_param.local_hidden_units_ = hidden_unit;
+
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer;
+  layer_parallel_param.local_batch_size = batch_size_;
+  int seed = -1;
+#endif
+
+  DecodingOpt<DecodingTraits_::OpType>* opt_decoding;
+
+  decoding_params.request_batch_size = batch_size_;
+  decoding_params.max_input_len = start_len;
+  decoding_params.request_input_len = start_len;
+  decoding_params.request_output_len = max_len - start_len;
+
+  decoding_params.d_start_ids = const_cast<int *>(input.data<int>());
+
+  decoding_params.d_attn_mask =
+      reinterpret_cast<DataType_*>(const_cast<data_t_ *>(attn_mask.data<data_t_>()));
+  decoding_params.d_start_lengths = start_length.data<int>();
+
+  opt_decoding =
+      new DecodingOpt<DecodingTraits_::OpType>(allocator_,
+                                               batch_size_,
+                                               max_len,
+                                               n_head,
+                                               size_per_head,
+                                               vocab_size,
+                                               num_layer,
+                                               bos_id,
+                                               eos_id,
+                                               topk,
+                                               topp,
+                                               temperature,
+                                               tensor_para_size,
+                                               layer_para_size,
+                                               true, /*is_fuse_QKV*/
+                                               normalize_before,
+                                               1.0,  /*repetition_penalty*/
+                                               seed);
+
+  opt_decoding->set_tensor_parallel_param(tensor_parallel_param);
+  opt_decoding->set_layer_parallel_param(layer_parallel_param);
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer];
+
+  for (int i = 0; i < self_ln_weight.size(); ++i) {
+    // Allow python passing weights of all layers or only passing the
+    // corresponding layers to save memory.
+    int layer_idx = self_ln_weight.size() != num_layer
+                        ? layer_parallel_param.rank *
+                                  layer_parallel_param.layers_per_group +
+                              i
+                        : i;
+
+    params[layer_idx].stream = stream;
+    params[layer_idx].cublas_handle = cublas_handle_;
+    params[layer_idx].cublaslt_handle = cublaslt_handle_;
+
+    params[layer_idx].request_batch_size = batch_size_;
+    params[layer_idx].request_max_mem_seq_len = start_len;
+
+    params[layer_idx].self_layernorm.gamma =
+        reinterpret_cast<const DataType_*>(self_ln_weight[i].data<data_t_>());
+    params[layer_idx].self_layernorm.beta =
+        reinterpret_cast<const DataType_*>(self_ln_bias[i].data<data_t_>());
+
+    params[layer_idx].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_q_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(self_q_bias[i].data<data_t_>());
+    // For `is_fuse_QKV == true`, ignore weight and bias of key and value to
+    // remove requirements on python passing weights to save memory.
+    // params[layer_idx].self_attention.key_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(self_k_weight[i].data<data_t_>());
+    // params[layer_idx].self_attention.key_weight.bias =
+    //     reinterpret_cast<const DataType_*>(self_k_bias[i].data<data_t_>());
+    // params[layer_idx].self_attention.value_weight.kernel =
+    //     reinterpret_cast<const DataType_*>(self_v_weight[i].data<data_t_>());
+    // params[layer_idx].self_attention.value_weight.bias =
+    //     reinterpret_cast<const DataType_*>(self_v_bias[i].data<data_t_>());
+
+    params[layer_idx].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(self_out_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(self_out_bias[i].data<data_t_>());
+
+    params[layer_idx].ffn_layernorm.gamma =
+        reinterpret_cast<const DataType_*>(ffn_ln_weight[i].data<data_t_>());
+    params[layer_idx].ffn_layernorm.beta =
+        reinterpret_cast<const DataType_*>(ffn_ln_bias[i].data<data_t_>());
+
+    params[layer_idx].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_inter_weight[i].data<data_t_>());
+    params[layer_idx].ffn.intermediate_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_inter_bias[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.kernel =
+        reinterpret_cast<const DataType_*>(ffn_out_weight[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_out_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma =
+      reinterpret_cast<const DataType_*>(decoder_ln_weight.data<data_t_>());
+  decoding_params.layernorm.beta =
+      reinterpret_cast<const DataType_*>(decoder_ln_bias.data<data_t_>());
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(emb_weight.data<data_t_>());
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      positional_embedding_weight.data<data_t_>());
+
+  opt_decoding->forward_context(params, decoding_params);
+  opt_decoding->forward(params, decoding_params);
+
+  delete opt_decoding;
+  delete[] params;
+
+  return {output_ids};
+}
+
+std::vector<paddle::Tensor> OPTCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const bool& normalize_before,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16 = false,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  auto stream = word_embedding.stream();
+  // TODO(guosheng): use the global cublas handle
+  cublasHandle_t cublas_handle_;
+  cublasCreate(&cublas_handle_);
+  cublasLtHandle_t cublaslt_handle_;
+  cublasLtCreate(&cublaslt_handle_);
+  cublasSetStream(cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  if (use_fp16) {
+    ret = opt_kernel<paddle::DataType::FLOAT16>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_q_bias,
+                                                 self_k_weight,
+                                                 self_k_bias,
+                                                 self_v_weight,
+                                                 self_v_bias,
+                                                 self_out_weight,
+                                                 self_out_bias,
+                                                 ffn_ln_weight,
+                                                 ffn_ln_bias,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 positional_embedding_weight,
+                                                 emb_weight,
+                                                 output_ids,
+                                                 normalize_before,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 cublas_handle_,
+                                                 cublaslt_handle_,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  } else {
+    ret = opt_kernel<paddle::DataType::FLOAT32>(input,
+                                                 attn_mask,
+                                                 start_length,
+                                                 word_embedding,
+                                                 self_ln_weight,
+                                                 self_ln_bias,
+                                                 self_q_weight,
+                                                 self_q_bias,
+                                                 self_k_weight,
+                                                 self_k_bias,
+                                                 self_v_weight,
+                                                 self_v_bias,
+                                                 self_out_weight,
+                                                 self_out_bias,
+                                                 ffn_ln_weight,
+                                                 ffn_ln_bias,
+                                                 ffn_inter_weight,
+                                                 ffn_inter_bias,
+                                                 ffn_out_weight,
+                                                 ffn_out_bias,
+                                                 decoder_ln_weight,
+                                                 decoder_ln_bias,
+                                                 positional_embedding_weight,
+                                                 emb_weight,
+                                                 output_ids,
+                                                 normalize_before,
+                                                 topk,
+                                                 topp,
+                                                 max_len,
+                                                 n_head,
+                                                 size_per_head,
+                                                 num_layer,
+                                                 bos_id,
+                                                 eos_id,
+                                                 temperature,
+                                                 cublas_handle_,
+                                                 cublaslt_handle_,
+                                                 stream,
+                                                 tensor_para_size,
+                                                 layer_para_size,
+                                                 layer_para_batch_size);
+  }
+
+  cublasDestroy(cublas_handle_);
+  cublasLtDestroy(cublaslt_handle_);
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_opt_op.h b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..b87ccdc46885dacaf412cdee10e17b57323a3609
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_opt_op.h
@@ -0,0 +1,71 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+// 
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// 
+//     http://www.apache.org/licenses/LICENSE-2.0
+// 
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <string>
+#include <vector>
+
+// #include "fastertransformer/gpt.h"
+// #include "fastertransformer/open_decoder.h"
+// #include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+std::vector<paddle::Tensor> OPTCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& start_length,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& emb_weight,
+    paddle::Tensor& output_ids,
+    const bool& normalize_before,
+    const int& topk,
+    const float& topp,
+    const int& max_len,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const float& temperature,
+    const bool& use_fp16,
+    const int& tensor_para_size,
+    const int& layer_para_size,
+    const int& layer_para_batch_size);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c98fd9f744a7ae8f10d19f88ec9eddf53c83bd5c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cc
@@ -0,0 +1,372 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_pegasus_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> PegasusDecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const std::string decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int64_t max_len,
+    const int64_t min_len,
+    const float beam_search_diversity_rate,
+    const bool rel_len,
+    const float alpha,
+    const bool early_stopping,
+    const std::string hidden_act) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+#ifdef PADDLE_NEW_ALLOCATOR
+  // For PaddlePaddle>=2.3.0
+  if (input.place() == paddle::GPUPlace()) {
+    auto output_ids = paddle::empty(output_dims, paddle::DataType::INT32, input.place());
+    auto parent_ids = paddle::empty(parent_ids_dims, paddle::DataType::INT32, input.place());
+    auto sequence_length = paddle::empty(sequence_length_dims, paddle::DataType::INT32, input.place());
+
+    paddle::Tensor seq_len = paddle::empty(mem_seq_len.shape(), mem_seq_len.dtype(), input.place());
+
+    if (mem_seq_len.place() != paddle::GPUPlace()) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::GPUPlace());
+    } else {
+      seq_len = mem_seq_len;
+    }
+#else
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+#endif
+    return PegasusDecodingCUDAForward(input,
+                                    seq_len,
+                                    word_embedding,
+                                    self_ln_weight,
+                                    self_ln_bias,
+                                    self_q_weight,
+                                    self_q_bias,
+                                    self_k_weight,
+                                    self_k_bias,
+                                    self_v_weight,
+                                    self_v_bias,
+                                    self_out_weight,
+                                    self_out_bias,
+                                    cross_ln_weight,
+                                    cross_ln_bias,
+                                    cross_q_weight,
+                                    cross_q_bias,
+                                    cross_k_weight,
+                                    cross_k_bias,
+                                    cross_v_weight,
+                                    cross_v_bias,
+                                    cross_out_weight,
+                                    cross_out_bias,
+                                    ffn_ln_weight,
+                                    ffn_ln_bias,
+                                    ffn_inter_weight,
+                                    ffn_inter_bias,
+                                    ffn_out_weight,
+                                    ffn_out_bias,
+                                    decoder_ln_weight,
+                                    decoder_ln_bias,
+                                    embedding_weight,
+                                    embedding_bias,
+                                    positional_embedding_weight,
+                                    output_ids,
+                                    parent_ids,
+                                    sequence_length,
+                                    decoding_strategy,
+                                    beam_size,
+                                    topk,
+                                    topp,
+                                    n_head,
+                                    size_per_head,
+                                    num_layer,
+                                    bos_id,
+                                    eos_id,
+                                    temperature,
+                                    max_out_len,
+                                    min_len,
+                                    beam_search_diversity_rate,
+                                    alpha,
+                                    early_stopping,
+                                    hidden_act);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> PegasusDecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::string decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int64_t max_len,
+    const int64_t min_len,
+    const float beam_search_diversity_rate,
+    const bool rel_len,
+    const float alpha,
+    const bool early_stopping,
+    const std::string hidden_act) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+            decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> PegasusDecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+PD_BUILD_OP(fusion_pegasus_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({
+        "decoding_strategy: std::string",
+        "beam_size: int",
+        "topk: int",
+        "topp: float",
+        "n_head: int",
+        "size_per_head: int",
+        "num_layer: int",
+        "bos_id: int",
+        "eos_id: int",
+        "temperature: float",
+        "max_len: int64_t",
+        "min_len: int64_t",
+        "beam_search_diversity_rate: float",
+        "rel_len: bool",
+        "alpha: float",
+        "early_stopping: bool",
+        "hidden_act: std::string",
+    })
+    .SetKernelFn(PD_KERNEL(PegasusDecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(PegasusDecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(PegasusDecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..753f70bf2ae07f9467dafa17bd72a991c3f496c4
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.cu
@@ -0,0 +1,554 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+#include "cublas_handle.h"
+
+#include "fusion_pegasus_decoding_op.h"
+#include "pd_traits.h"
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> pegasus_decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int head_num_,
+    const int size_per_head_,
+    const int num_layer_,
+    const int start_id_,
+    const int end_id_,
+    const float temperature,
+    const int64_t max_seq_len_,
+    const int64_t min_seq_len_,
+    const float beam_search_diversity_rate_,
+    const float alpha,
+    const bool early_stopping,
+    const std::string& hidden_act,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ = (decoding_strategy == "sampling") ? topk : 1;
+  float probability_threshold_ = (decoding_strategy == "sampling") ? topp : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+#ifdef PADDLE_NEW_ALLOCATOR
+  // For PaddlePaddle>=2.3.0
+  decoding_params.output_ids = output_ids.data<int>();
+  decoding_params.parent_ids = parent_ids.data<int>();
+  decoding_params.sequence_length = sequence_length.data<int>();
+#else
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length = sequence_length.mutable_data<int>(input.place());
+#endif
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[i].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[i].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+        cross_layernorm_bias[i].data<data_t_>());
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+    
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  ActivationType activate =
+      (hidden_act == "gelu") ? ActivationType::GELU : ActivationType::RELU;
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  /*is_fuse_topk_softMax*/
+        true, /*is_fuse_qkv*/
+        false, /*keep_alive_beam*/
+        alpha, /*alpha not used for this case*/
+        true,
+        0, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        false /*prefix_lm*/,
+        -1, /*finished_candidate_num*/
+        false, /*early_stopping*/
+        false, /*is_mbart */
+        min_seq_len_ /*min_length*/);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,   // is_fuse_topk_softMax
+        true,  // is_fuse_qkv
+        true,   // keep_alive_beam
+        alpha,
+        true, // normalize_before
+        0, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        false /*prefix_lm*/,
+        finished_candidate_num_, /*finished_candidate_num*/
+        early_stopping, /*early_stopping*/
+        false, /*is_mbart */
+        min_seq_len_ /*min_length*/);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        true, /*is_fuse_qkv*/
+        true, // normalize_before
+        0, /*pos_offset BART and MBART only for now*/
+        activate,
+        false,  // pos_bias
+        temperature,    // temperature
+        1.0,    // repeat_penalty
+        false,  // prefix_lm
+        false, /*is_mbart */
+        min_seq_len_ /*min_length*/);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, beam_search_v2 and sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> PegasusDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string decoding_strategy,
+    const int  beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const float temperature,
+    const int64_t max_len,
+    const int64_t min_len,
+    const float beam_search_diversity_rate,
+    const float alpha,
+    const bool early_stopping,
+    const std::string hidden_act) {
+  auto stream = input.stream();
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = pegasus_decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          temperature,
+          max_len,
+          min_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          hidden_act,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = pegasus_decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          temperature,
+          max_len,
+          min_len,
+          beam_search_diversity_rate,
+          alpha,
+          early_stopping,
+          hidden_act,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..5a4e2962cd74f7b0706a98a69a87a84bc4120bba
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_pegasus_decoding_op.h
@@ -0,0 +1,86 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> PegasusDecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int head_num,
+    const int size_per_head,
+    const int num_layer,
+    const int start_id,
+    const int end_id,
+    const float temperature,
+    const int64_t max_len,
+    const int64_t min_len,
+    const float beam_search_diversity_rate,
+    const float alpha,
+    const bool early_stopping,
+    const std::string hidden_act);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..840b23b03929ef782bcd34c17a5961931208b99e
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cc
@@ -0,0 +1,377 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "fusion_t5_decoding_op.h"
+
+#include <string>
+#include <vector>
+
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> T5DecodingForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_0,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_0,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_1,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_1,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& self_relative_attention_bias_weight,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const float& temperature,
+    const bool& early_stopping,
+    const int& max_distance,
+    const int& num_buckets,
+    const bool& tie_word_embeddings,
+    const std::string& act) {
+  int batch_size = input.shape()[0];
+  int max_out_len = rel_len ? max_len + input.shape()[1] : max_len;
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    batch_size /= beam_size;
+    output_dims = {max_out_len, batch_size, beam_size};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_out_len, batch_size, beam_size * 2};
+    parent_ids_dims = output_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_dims = {max_out_len, batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+
+  if (input.place() == paddle::PlaceType::kGPU) {
+    auto output_ids = paddle::Tensor(paddle::PlaceType::kGPU, output_dims);
+    auto parent_ids = paddle::Tensor(paddle::PlaceType::kGPU, parent_ids_dims);
+    auto sequence_length =
+        paddle::Tensor(paddle::PlaceType::kGPU, sequence_length_dims);
+
+    paddle::Tensor seq_len = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      seq_len = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      seq_len = mem_seq_len;
+    }
+
+    return T5DecodingCUDAForward(input,
+                                 seq_len,
+                                 word_embedding,
+                                 self_ln_weight,
+                                 self_ln_bias,
+                                 self_q_weight,
+                                 self_q_bias,
+                                 self_k_weight,
+                                 self_k_bias,
+                                 self_v_weight,
+                                 self_v_bias,
+                                 self_out_weight,
+                                 self_out_bias,
+                                 cross_ln_weight,
+                                 cross_ln_bias,
+                                 cross_q_weight,
+                                 cross_q_bias,
+                                 cross_k_weight,
+                                 cross_k_bias,
+                                 cross_v_weight,
+                                 cross_v_bias,
+                                 cross_out_weight,
+                                 cross_out_bias,
+                                 ffn_ln_weight,
+                                 ffn_ln_bias,
+                                 ffn_inter_weight_0,
+                                 ffn_inter_bias_0,
+                                 ffn_inter_weight_1,
+                                 ffn_inter_bias_1,
+                                 ffn_out_weight,
+                                 ffn_out_bias,
+                                 self_relative_attention_bias_weight,
+                                 decoder_ln_weight,
+                                 decoder_ln_bias,
+                                 embedding_weight,
+                                 embedding_bias,
+                                 output_ids,
+                                 parent_ids,
+                                 sequence_length,
+                                 decoding_strategy,
+                                 beam_size,
+                                 topk,
+                                 topp,
+                                 n_head,
+                                 size_per_head,
+                                 num_layer,
+                                 bos_id,
+                                 eos_id,
+                                 max_out_len,
+                                 beam_search_diversity_rate,
+                                 alpha,
+                                 temperature,
+                                 early_stopping,
+                                 max_distance,
+                                 num_buckets,
+                                 tie_word_embeddings,
+                                 act);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> T5DecodingInferShape(
+    const std::vector<int64_t>& input_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& cross_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_0_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_0_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_1_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_1_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& self_relative_attention_bias_weight_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const bool& rel_len,
+    const float& alpha,
+    const float& temperature,
+    const bool& early_stopping,
+    const int& max_distance,
+    const int& num_buckets,
+    const bool& tie_word_embeddings,
+    const std::string& act) {
+  int batch_size = input_shape[0];
+
+  std::vector<int64_t> output_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_dims = {max_len, batch_size, beam_size};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_dims = {max_len, batch_size, beam_size * 2};
+    return {output_dims, output_dims, sequence_length_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_dims = {max_len, batch_size};
+    return {output_dims, {1}, sequence_length_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> T5DecodingInferDtype(
+    const paddle::DataType& input,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& cross_ln_weight,
+    const std::vector<paddle::DataType>& cross_ln_bias,
+    const std::vector<paddle::DataType>& cross_q_weight,
+    const std::vector<paddle::DataType>& cross_q_bias,
+    const std::vector<paddle::DataType>& cross_k_weight,
+    const std::vector<paddle::DataType>& cross_k_bias,
+    const std::vector<paddle::DataType>& cross_v_weight,
+    const std::vector<paddle::DataType>& cross_v_bias,
+    const std::vector<paddle::DataType>& cross_out_weight,
+    const std::vector<paddle::DataType>& cross_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight_0,
+    const std::vector<paddle::DataType>& ffn_inter_bias_0,
+    const std::vector<paddle::DataType>& ffn_inter_weight_1,
+    const std::vector<paddle::DataType>& ffn_inter_bias_1,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& self_relative_attention_bias_weight,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32};
+}
+
+
+PD_BUILD_OP(fusion_t5_decoding)
+    .Inputs({"Input",
+             "MemSeqLen",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("CrossLayernormWeight"),
+             paddle::Vec("CrossLayernormBias"),
+             paddle::Vec("CrossQueryWeight"),
+             paddle::Vec("CrossQueryBias"),
+             paddle::Vec("CrossKeyWeight"),
+             paddle::Vec("CrossKeyBias"),
+             paddle::Vec("CrossValueWeight"),
+             paddle::Vec("CrossValueBias"),
+             paddle::Vec("CrossOutWeight"),
+             paddle::Vec("CrossOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight0"),
+             paddle::Vec("FFNInterBias0"),
+             paddle::Vec("FFNInterWeight1"),
+             paddle::Vec("FFNInterBias1"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "SelfRelativeAttentionBiasWeight",
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "EmbWeight",
+             "EmbBias"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "rel_len: bool",
+            "alpha: float",
+            "temperature: float",
+            "early_stopping: bool",
+            "max_distance: int",
+            "num_buckets: int",
+            "tie_word_embeddings: bool",
+            "act: std::string"})
+    .SetKernelFn(PD_KERNEL(T5DecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(T5DecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(T5DecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..5fd211f5fd228004792e5994343d59ab598bea50
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.cu
@@ -0,0 +1,635 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+
+#include "fusion_t5_decoding_op.h"
+#include "pd_traits.h"
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> t5_decoding_kernel(
+    const paddle::Tensor& input,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& cross_layernorm_weight,
+    const std::vector<paddle::Tensor>& cross_layernorm_bias,
+    const std::vector<paddle::Tensor>& cross_attn_query_weight,
+    const std::vector<paddle::Tensor>& cross_attn_query_bias,
+    const std::vector<paddle::Tensor>& cross_attn_key_weight,
+    const std::vector<paddle::Tensor>& cross_attn_key_bias,
+    const std::vector<paddle::Tensor>& cross_attn_value_weight,
+    const std::vector<paddle::Tensor>& cross_attn_value_bias,
+    const std::vector<paddle::Tensor>& cross_attn_output_weight,
+    const std::vector<paddle::Tensor>& cross_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight_0,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias_0,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight_1,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias_1,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& self_relative_attention_bias_weight,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& head_num_,
+    const int& size_per_head_,
+    const int& num_layer_,
+    const int& start_id_,
+    const int& end_id_,
+    const int64_t& max_seq_len_,
+    const float& beam_search_diversity_rate_,
+    const float& alpha,
+    const float& temperature,
+    const bool& early_stopping,
+    const int& max_distance,
+    const int& num_buckets,
+    const bool& tie_word_embeddings,
+    const std::string& act,
+    cudaStream_t stream) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ =
+      (decoding_strategy == "topk_sampling" ||
+       decoding_strategy == "topp_sampling" || decoding_strategy == "sampling")
+          ? topk
+          : 1;
+  float probability_threshold_ =
+      (decoding_strategy == "topk_sampling" ||
+       decoding_strategy == "topp_sampling" || decoding_strategy == "sampling")
+          ? topp
+          : 0.0;
+
+  auto input_dims = input.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_dims[0] / beam_width_
+                        : input_dims[0];
+  const int memory_max_seq_len = input_dims[1];
+  const int memory_hidden_dim = input_dims[2];
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.memory_tensor =
+      reinterpret_cast<const DataType_*>(input.data<data_t_>());
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  int inner_coeff = ffn_intermediate_weight_0[0].shape()[1] / memory_hidden_dim;
+  int inner_size = ffn_intermediate_weight_0[0].shape()[1];
+
+  auto q_weight_shape = self_attn_query_weight[0].shape();
+  auto k_weight_shape = self_attn_key_weight[0].shape();
+
+  if (decoding_strategy == "beam_search" ||
+      decoding_strategy == "beam_search_v2" ||
+      decoding_strategy == "beam_search_v3") {
+    decoding_params.request_batch_size = batch_size_ * beam_width_;
+  } else if (decoding_strategy == "sampling" ||
+             decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    decoding_params.request_batch_size = batch_size_;
+  }
+
+  bool use_gated = false;
+
+  for (int i = 0; i < num_layer_; i++) {
+    params[i].stream = stream;
+    params[i].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[i].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[i].request_batch_size = batch_size_ * beam_width_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[i].request_batch_size = batch_size_;
+      params[i].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[i].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+
+    if (self_layernorm_bias[i].shape()[0] != 1) {
+      params[i].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+          self_layernorm_bias[i].data<data_t_>());
+    } else {
+      params[i].self_layernorm.beta = nullptr;
+    }
+
+    // query
+    params[i].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[i].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[i].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[i].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+    params[i].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // cross
+    params[i].cross_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        cross_layernorm_weight[i].data<data_t_>());
+    if (cross_layernorm_bias[i].shape()[0] != 1) {
+      params[i].cross_layernorm.beta = reinterpret_cast<const DataType_*>(
+          cross_layernorm_bias[i].data<data_t_>());
+    } else {
+      params[i].cross_layernorm.beta = nullptr;
+    }
+    // query
+    params[i].cross_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_weight[i].data<data_t_>());
+    params[i].cross_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_query_bias[i].data<data_t_>());
+    // key
+    params[i].cross_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_weight[i].data<data_t_>());
+    params[i].cross_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_key_bias[i].data<data_t_>());
+    // value
+    params[i].cross_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_weight[i].data<data_t_>());
+    params[i].cross_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[i].cross_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_weight[i].data<data_t_>());
+    params[i].cross_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            cross_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[i].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    if (ffn_layernorm_bias[i].shape()[0] != 1) {
+      params[i].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+          ffn_layernorm_bias[i].data<data_t_>());
+    } else {
+      params[i].ffn_layernorm.beta = nullptr;
+    }
+    // intermediate proj
+    params[i].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight_0[i].data<data_t_>());
+    params[i].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias_0[i].data<data_t_>());
+
+    if (ffn_intermediate_weight_1[i].shape()[0] != 1) {
+        use_gated = true;
+        params[i].ffn.intermediate_weight_1.kernel =
+            reinterpret_cast<const DataType_*>(
+                ffn_intermediate_weight_1[i].data<data_t_>());
+        params[i].ffn.intermediate_weight_1.bias = reinterpret_cast<const DataType_*>(
+            ffn_intermediate_bias_1[i].data<data_t_>());
+    } else {
+        params[i].ffn.intermediate_weight_1.kernel = nullptr;
+        params[i].ffn.intermediate_weight_1.bias = nullptr;
+    }
+    
+    // out proj
+    params[i].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[i].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  // relative bias
+  decoding_params.self_relative_attention_bias_weight =
+      reinterpret_cast<const DataType_*>(
+          self_relative_attention_bias_weight.data<data_t_>());
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  if (decoder_layernorm_bias.shape()[0] != 1) {
+    decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+        decoder_layernorm_bias.data<data_t_>());
+  } else {
+    decoding_params.layernorm.beta = nullptr;
+  }
+  // for embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+
+  // for weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // for matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  ActivationType activate =
+      (act == "gelu") ? ActivationType::GELU : ActivationType::RELU;
+
+  if ("beam_search" == decoding_strategy) {
+    T5DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new T5DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  // is_fuse_topk_softMax
+        true,  // fuse_qkv
+        false,                 // keep_alive_beam
+        0.6,                   // alpha
+        true,                  // normalization_before
+        activate,
+        -1,                    // finished_candidate_num
+        false,                 // early_stopping
+        0,                     // min_length
+        inner_coeff,
+        inner_size,
+        num_buckets,
+        max_distance,
+        tie_word_embeddings,
+        use_gated);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    T5DecodingBeamsearch<DecodingTraits_::OpType>* decoding_beam_search_;
+    decoding_beam_search_ = new T5DecodingBeamsearch<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        beam_width_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        beam_search_diversity_rate_,
+        true,  // is_fuse_topk_softMax
+        true,  // fuse_qkv
+        true,  // keep_alive_beam
+        alpha,
+        true,                     // normalization_before
+        activate,
+        finished_candidate_num_,
+        early_stopping,
+        0,                        // min_length
+        inner_coeff,
+        inner_size,
+        num_buckets,
+        max_distance,
+        tie_word_embeddings,
+        use_gated);
+
+    decoding_beam_search_->forward(params, decoding_params);
+
+    delete decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+
+    T5DecodingSampling<DecodingTraits_::OpType>* decoding_sampling_;
+    decoding_sampling_ = new T5DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        true,  // fuse_qkv
+        true,                  // normalization_before
+        activate,
+        1.0,                   // temperature
+        1.0,                   // repeat_penalty
+        0,                     // min_length
+        inner_coeff,
+        inner_size,
+        -1,  // seed
+        1,  // tensor_para_size
+        1,  // layer_para_size
+        num_buckets,
+        max_distance,
+        tie_word_embeddings,
+        use_gated);
+
+    decoding_sampling_->forward(params, decoding_params);
+
+    delete decoding_sampling_;
+
+  } else {
+    PD_THROW(
+        "Only beam_search, topk_sampling and topp_sampling are supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length};
+}
+
+std::vector<paddle::Tensor> T5DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_0,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_0,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_1,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_1,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& self_relative_attention_bias_weight,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha,
+    const float& temperature,
+    const bool& early_stopping,
+    const int& max_distance,
+    const int& num_buckets,
+    const bool& tie_word_embeddings,
+    const std::string& act) {
+  auto stream = input.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (input.type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = t5_decoding_kernel<paddle::DataType::FLOAT16>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight_0,
+          ffn_inter_bias_0,
+          ffn_inter_weight_1,
+          ffn_inter_bias_1,
+          ffn_out_weight,
+          ffn_out_bias,
+          self_relative_attention_bias_weight,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          temperature,
+          early_stopping,
+          max_distance,
+          num_buckets,
+          tie_word_embeddings,
+          act,
+          stream);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = t5_decoding_kernel<paddle::DataType::FLOAT32>(
+          input,
+          mem_seq_len,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          cross_ln_weight,
+          cross_ln_bias,
+          cross_q_weight,
+          cross_q_bias,
+          cross_k_weight,
+          cross_k_bias,
+          cross_v_weight,
+          cross_v_bias,
+          cross_out_weight,
+          cross_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight_0,
+          ffn_inter_bias_0,
+          ffn_inter_weight_1,
+          ffn_inter_bias_1,
+          ffn_out_weight,
+          ffn_out_bias,
+          self_relative_attention_bias_weight,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          alpha,
+          temperature,
+          early_stopping,
+          max_distance,
+          num_buckets,
+          tie_word_embeddings,
+          act,
+          stream);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..15750959fed1916557d93737ea5e274708131d0c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_t5_decoding_op.h
@@ -0,0 +1,91 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+#include "cublas_handle.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/t5_beamsearch.h"
+#include "fastertransformer/t5_sampling.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> T5DecodingCUDAForward(
+    const paddle::Tensor& input,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& cross_ln_weight,
+    const std::vector<paddle::Tensor>& cross_ln_bias,
+    const std::vector<paddle::Tensor>& cross_q_weight,
+    const std::vector<paddle::Tensor>& cross_q_bias,
+    const std::vector<paddle::Tensor>& cross_k_weight,
+    const std::vector<paddle::Tensor>& cross_k_bias,
+    const std::vector<paddle::Tensor>& cross_v_weight,
+    const std::vector<paddle::Tensor>& cross_v_bias,
+    const std::vector<paddle::Tensor>& cross_out_weight,
+    const std::vector<paddle::Tensor>& cross_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_0,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_0,
+    const std::vector<paddle::Tensor>& ffn_inter_weight_1,
+    const std::vector<paddle::Tensor>& ffn_inter_bias_1,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& self_relative_attention_bias_weight,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const float& alpha,
+    const float& temperature,
+    const bool& early_stopping,
+    const int& max_distance,
+    const int& num_buckets,
+    const bool& tie_word_embeddings,
+    const std::string& act);
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cc b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6053ad48a9b1ee6d901bf341e81135a693600a0b
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cc
@@ -0,0 +1,417 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <string>
+#include <vector>
+
+#include "fusion_unified_decoding_op.h"
+#include "pd_traits.h"
+
+
+std::vector<paddle::Tensor> UnifiedDecodingForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const int& unk_id,
+    const int& mask_id,
+    const float& temperature,
+    const float& len_penalty,
+    const bool& normalize_before,
+    const bool& pos_bias,
+    const std::string& hidden_act,
+    const bool& rel_len,
+    const bool& early_stopping,
+    const int& min_length,
+    const int& tensor_para_size,
+    const int& layer_para_size,
+    const int& layer_para_batch_size) {
+  int batch_size = input_ids.shape()[0];
+  int max_out_len = rel_len ? max_len + input_ids.shape()[1] : max_len;
+
+  std::vector<int64_t> output_ids_dims;
+  std::vector<int64_t> output_scores_dims;
+  std::vector<int64_t> parent_ids_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_ids_dims = {max_out_len, batch_size, beam_size};
+    output_scores_dims = {batch_size, beam_size};
+    parent_ids_dims = output_ids_dims;
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_ids_dims = {max_out_len, batch_size, beam_size * 2};
+    output_scores_dims = {batch_size, beam_size * 2};
+    parent_ids_dims = output_ids_dims;
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_ids_dims = {max_out_len, batch_size};
+    output_scores_dims = {batch_size};
+    parent_ids_dims = {1};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+  auto output_ids = paddle::Tensor(input_ids.place(), output_ids_dims);
+  auto parent_ids = paddle::Tensor(input_ids.place(), parent_ids_dims);
+  auto sequence_length =
+      paddle::Tensor(input_ids.place(), sequence_length_dims);
+  auto output_scores = paddle::Tensor(input_ids.place(), output_scores_dims);
+
+  if (input_ids.place() == paddle::PlaceType::kGPU) {
+    auto mem_seq_length = paddle::Tensor(paddle::PlaceType::kGPU);
+
+    if (mem_seq_len.place() != paddle::PlaceType::kGPU) {
+      mem_seq_length = mem_seq_len.copy_to<int>(paddle::PlaceType::kGPU);
+    } else {
+      mem_seq_length = mem_seq_len;
+    }
+
+    return UnifiedDecodingCUDAForward(input_ids,
+                                      attn_mask,
+                                      mem_seq_length,
+                                      type_id,
+                                      decoder_type_id,
+                                      logits_mask,
+                                      word_embedding,
+                                      self_ln_weight,
+                                      self_ln_bias,
+                                      self_q_weight,
+                                      self_q_bias,
+                                      self_k_weight,
+                                      self_k_bias,
+                                      self_v_weight,
+                                      self_v_bias,
+                                      self_out_weight,
+                                      self_out_bias,
+                                      ffn_ln_weight,
+                                      ffn_ln_bias,
+                                      ffn_inter_weight,
+                                      ffn_inter_bias,
+                                      ffn_out_weight,
+                                      ffn_out_bias,
+                                      decoder_ln_weight,
+                                      decoder_ln_bias,
+                                      trans_weight,
+                                      trans_bias,
+                                      lm_ln_weight,
+                                      lm_ln_bias,
+                                      embedding_weight,
+                                      embedding_bias,
+                                      positional_embedding_weight,
+                                      type_embedding_weight,
+                                      role_id,
+                                      decoder_role_id,
+                                      role_embedding_table,
+                                      position_ids,
+                                      decoder_position_ids,
+                                      output_ids,
+                                      parent_ids,
+                                      sequence_length,
+                                      output_scores,
+                                      decoding_strategy,
+                                      beam_size,
+                                      topk,
+                                      topp,
+                                      n_head,
+                                      size_per_head,
+                                      num_layer,
+                                      bos_id,
+                                      eos_id,
+                                      max_out_len,
+                                      beam_search_diversity_rate,
+                                      unk_id,
+                                      mask_id,
+                                      temperature,
+                                      len_penalty,
+                                      normalize_before,
+                                      pos_bias,
+                                      hidden_act,
+                                      early_stopping,
+                                      min_length,
+                                      tensor_para_size,
+                                      layer_para_size,
+                                      layer_para_batch_size);
+  } else {
+    PD_THROW("Not implemented place. Only GPU is supported. ");
+  }
+}
+
+std::vector<std::vector<int64_t>> UnifiedDecodingInferShape(
+    const std::vector<int64_t>& input_ids_shape,
+    const std::vector<int64_t>& attn_mask_shape,
+    const std::vector<int64_t>& mem_seq_len_shape,
+    const std::vector<int64_t>& logits_mask_shape,
+    const std::vector<int64_t>& type_id_shape,
+    const std::vector<int64_t>& decoder_type_id_shape,
+    const std::vector<int64_t>& word_embedding_shape,
+    const std::vector<std::vector<int64_t>>& self_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_q_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_k_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_v_bias_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& self_out_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_ln_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_inter_bias_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_weight_shapes,
+    const std::vector<std::vector<int64_t>>& ffn_out_bias_shapes,
+    const std::vector<int64_t>& decoder_ln_weight_shape,
+    const std::vector<int64_t>& decoder_ln_bias_shape,
+    const std::vector<int64_t>& trans_weight_shape,
+    const std::vector<int64_t>& trans_bias_shape,
+    const std::vector<int64_t>& lm_ln_weight_shape,
+    const std::vector<int64_t>& lm_ln_bias_shape,
+    const std::vector<int64_t>& embedding_weight_shape,
+    const std::vector<int64_t>& embedding_bias_shape,
+    const std::vector<int64_t>& positional_embedding_weight_shape,
+    const std::vector<int64_t>& type_embedding_weight_shape,
+    const std::vector<int64_t>& role_id_shape,
+    const std::vector<int64_t>& decoder_role_id_shape,
+    const std::vector<int64_t>& role_embedding_table_shape,
+    const std::vector<int64_t>& position_ids_shape,
+    const std::vector<int64_t>& decoder_position_ids_shape,
+    const std::string& decoding_strategy,
+    const int& beam_size,
+    const int& topk,
+    const float& topp,
+    const int& n_head,
+    const int& size_per_head,
+    const int& num_layer,
+    const int& bos_id,
+    const int& eos_id,
+    const int64_t& max_len,
+    const float& beam_search_diversity_rate,
+    const int& unk_id,
+    const int& mask_id,
+    const float& temperature,
+    const float& len_penalty,
+    const bool& normalize_before,
+    const bool& pos_bias,
+    const std::string& hidden_act,
+    const bool& rel_len,
+    const bool& early_stopping,
+    const int& min_length,
+    const int& tensor_para_size = 1,
+    const int& layer_para_size = 1,
+    const int& layer_para_batch_size = 1) {
+  int batch_size = input_ids_shape[0];
+
+  std::vector<int64_t> output_ids_dims;
+  std::vector<int64_t> output_scores_dims;
+  std::vector<int64_t> sequence_length_dims({batch_size});
+  if (decoding_strategy == "beam_search") {
+    if (batch_size != -1) {
+      batch_size /= beam_size;
+    }
+    output_ids_dims = {max_len, batch_size, beam_size};
+    output_scores_dims = {batch_size, beam_size};
+    return {output_ids_dims, output_ids_dims, sequence_length_dims, output_scores_dims};
+  } else if (decoding_strategy == "beam_search_v2" ||
+             decoding_strategy == "beam_search_v3") {
+    // Use separated alive and finish beam queues to avoid the decrease of alive
+    // beams. The outputs must include both the finish and alive to trace full
+    // path.
+    if (batch_size != -1) {
+      sequence_length_dims = {batch_size * 2};
+      batch_size /= beam_size;
+    } else {
+      sequence_length_dims = {batch_size};
+    }
+    output_ids_dims = {max_len, batch_size, beam_size * 2};
+    output_scores_dims = {batch_size, beam_size * 2};
+    return {output_ids_dims, output_ids_dims, sequence_length_dims, output_scores_dims};
+  } else if (decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling" ||
+             decoding_strategy == "sampling") {
+    output_ids_dims = {max_len, batch_size};
+    output_scores_dims = {batch_size};
+    return {output_ids_dims, {1}, sequence_length_dims, output_scores_dims};
+  } else {
+    PD_THROW("Not supported decoding strategy. ");
+  }
+}
+
+std::vector<paddle::DataType> UnifiedDecodingInferDtype(
+    const paddle::DataType& input_ids,
+    const paddle::DataType& attn_mask,
+    const paddle::DataType& mem_seq_len,
+    const paddle::DataType& logits_mask,
+    const paddle::DataType& type_id,
+    const paddle::DataType& decoder_type_id,
+    const paddle::DataType& word_embedding,
+    const std::vector<paddle::DataType>& self_ln_weight,
+    const std::vector<paddle::DataType>& self_ln_bias,
+    const std::vector<paddle::DataType>& self_q_weight,
+    const std::vector<paddle::DataType>& self_q_bias,
+    const std::vector<paddle::DataType>& self_k_weight,
+    const std::vector<paddle::DataType>& self_k_bias,
+    const std::vector<paddle::DataType>& self_v_weight,
+    const std::vector<paddle::DataType>& self_v_bias,
+    const std::vector<paddle::DataType>& self_out_weight,
+    const std::vector<paddle::DataType>& self_out_bias,
+    const std::vector<paddle::DataType>& ffn_ln_weight,
+    const std::vector<paddle::DataType>& ffn_ln_bias,
+    const std::vector<paddle::DataType>& ffn_inter_weight,
+    const std::vector<paddle::DataType>& ffn_inter_bias,
+    const std::vector<paddle::DataType>& ffn_out_weight,
+    const std::vector<paddle::DataType>& ffn_out_bias,
+    const paddle::DataType& decoder_ln_weight,
+    const paddle::DataType& decoder_ln_bias,
+    const paddle::DataType& trans_weight,
+    const paddle::DataType& trans_bias,
+    const paddle::DataType& lm_ln_weight,
+    const paddle::DataType& lm_ln_bias,
+    const paddle::DataType& embedding_weight,
+    const paddle::DataType& embedding_bias,
+    const paddle::DataType& positional_embedding_weight,
+    const paddle::DataType& type_embedding_weight,
+    const paddle::DataType& role_id,
+    const paddle::DataType& decoder_role_id,
+    const paddle::DataType& role_embedding_table,
+    const paddle::DataType& position_ids,
+    const paddle::DataType& decoder_position_ids) {
+  return {paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::INT32,
+          paddle::DataType::FLOAT32};
+}
+
+PD_BUILD_OP(fusion_unified_decoding)
+    .Inputs({"InputIds",
+             "AttnMask",
+             "MemSeqLen",
+             "TypeIds",
+             "DecTypeIds",
+             "LogitsMask",
+             "WordEmbedding",
+             paddle::Vec("SelfLayernormWeight"),
+             paddle::Vec("SelfLayernormBias"),
+             paddle::Vec("SelfQueryWeight"),
+             paddle::Vec("SelfQueryBias"),
+             paddle::Vec("SelfKeyWeight"),
+             paddle::Vec("SelfKeyBias"),
+             paddle::Vec("SelfValueWeight"),
+             paddle::Vec("SelfValueBias"),
+             paddle::Vec("SelfOutWeight"),
+             paddle::Vec("SelfOutBias"),
+             paddle::Vec("FFNLayernormWeight"),
+             paddle::Vec("FFNLayernormBias"),
+             paddle::Vec("FFNInterWeight"),
+             paddle::Vec("FFNInterBias"),
+             paddle::Vec("FFNOutWeight"),
+             paddle::Vec("FFNOutBias"),
+             "DecoderLayernormWeight",
+             "DecoderLayernormBias",
+             "TransWeight",
+             "TransBias",
+             "LMLayernormWeight",
+             "LMLayernormBias",
+             "EmbWeight",
+             "EmbBias",
+             "PositionEncEmb",
+             "TypeEmb",
+             "RoleIds",
+             "DecRoleIds",
+             "RoleEmbedding",
+             "PositionIds",
+             "DecPositionIds"})
+    .Outputs({"OutputIds", "ParentIds", "SequenceLength", "OutputScores"})
+    .Attrs({"decoding_strategy: std::string",
+            "beam_size: int",
+            "topk: int",
+            "topp: float",
+            "n_head: int",
+            "size_per_head: int",
+            "num_layer: int",
+            "bos_id: int",
+            "eos_id: int",
+            "max_len: int64_t",
+            "beam_search_diversity_rate: float",
+            "unk_id: int",
+            "mask_id: int",
+            "temperature: float",
+            "len_penalty: float",
+            "normalize_before: bool",
+            "pos_bias: bool",
+            "hidden_act: std::string",
+            "rel_len: bool",
+            "early_stopping: bool",
+            "min_length: int",
+            "tensor_para_size: int",
+            "layer_para_size: int",
+            "layer_para_batch_size: int"})
+    .SetKernelFn(PD_KERNEL(UnifiedDecodingForward))
+    .SetInferShapeFn(PD_INFER_SHAPE(UnifiedDecodingInferShape))
+    .SetInferDtypeFn(PD_INFER_DTYPE(UnifiedDecodingInferDtype));
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cu b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..2df429cc9ee90bbe138a85887f5f78f26b320e69
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.cu
@@ -0,0 +1,693 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <curand.h>
+#include <curand_kernel.h>
+#include <algorithm>
+#include <iterator>
+#include <random>
+#include <sstream>
+#include <vector>
+
+// TODO(guosheng): `HOST` conflict exists in float.h of paddle and mpi.h of mpi
+#include "fusion_unified_decoding_op.h"
+#include "pd_traits.h"
+#ifdef HOST
+#undef HOST
+#endif
+
+#include "fastertransformer/decoding_beamsearch.h"
+#include "fastertransformer/decoding_sampling.h"
+#include "fastertransformer/utils/common.h"
+
+#ifdef BUILD_GPT  // consistent with FasterTransformer
+#include "parallel_utils.h"
+#endif
+
+
+template <paddle::DataType D>
+std::vector<paddle::Tensor> unified_decoding_kernel(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& memory_sequence_length,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_emb,
+    const std::vector<paddle::Tensor>& self_layernorm_weight,
+    const std::vector<paddle::Tensor>& self_layernorm_bias,
+    const std::vector<paddle::Tensor>& self_attn_query_weight,
+    const std::vector<paddle::Tensor>& self_attn_query_bias,
+    const std::vector<paddle::Tensor>& self_attn_key_weight,
+    const std::vector<paddle::Tensor>& self_attn_key_bias,
+    const std::vector<paddle::Tensor>& self_attn_value_weight,
+    const std::vector<paddle::Tensor>& self_attn_value_bias,
+    const std::vector<paddle::Tensor>& self_attn_output_weight,
+    const std::vector<paddle::Tensor>& self_attn_output_bias,
+    const std::vector<paddle::Tensor>& ffn_layernorm_weight,
+    const std::vector<paddle::Tensor>& ffn_layernorm_bias,
+    const std::vector<paddle::Tensor>& ffn_intermediate_weight,
+    const std::vector<paddle::Tensor>& ffn_intermediate_bias,
+    const std::vector<paddle::Tensor>& ffn_output_weight,
+    const std::vector<paddle::Tensor>& ffn_output_bias,
+    const paddle::Tensor& decoder_layernorm_weight,
+    const paddle::Tensor& decoder_layernorm_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_layernorm_weight,
+    const paddle::Tensor& lm_layernorm_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& position_encoding_table,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int head_num_,
+    const int size_per_head_,
+    const int num_layer_,
+    const int start_id_,
+    const int end_id_,
+    const int64_t max_seq_len_,
+    const float beam_search_diversity_rate_,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    cudaStream_t stream,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  int beam_width_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? beam_size
+                        : 1;
+  int candidate_num_ =
+      ("topk_sampling" == decoding_strategy ||
+       "topp_sampling" == decoding_strategy || "sampling" == decoding_strategy)
+          ? topk
+          : 1;
+  float probability_threshold_ =
+      ("topk_sampling" == decoding_strategy ||
+       "topp_sampling" == decoding_strategy || "sampling" == decoding_strategy)
+          ? topp
+          : 0.0;
+
+  auto input_ids_dims = input_ids.shape();
+  int batch_size_ = (decoding_strategy == "beam_search" ||
+                     decoding_strategy == "beam_search_v2" ||
+                     decoding_strategy == "beam_search_v3")
+                        ? input_ids_dims[0] / beam_width_
+                        : input_ids_dims[0];
+  const int memory_max_seq_len = input_ids_dims[1];
+  const int memory_hidden_dim = head_num_ * size_per_head_;
+  const int vocab_size = word_emb.shape()[0];
+
+  typedef PDTraits<D> traits_;
+  typedef typename traits_::DataType DataType_;
+  typedef typename traits_::data_t data_t_;
+
+  DecodingInitParam<DataType_> decoding_params;
+  decoding_params.cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+  decoding_params.cublaslt_handle =
+      CublasHandle::GetInstance()->cublaslt_handle_;
+
+  decoding_params.output_ids = output_ids.mutable_data<int>(input_ids.place());
+  decoding_params.parent_ids = parent_ids.mutable_data<int>(input_ids.place());
+  decoding_params.sequence_length =
+      sequence_length.mutable_data<int>(input_ids.place());
+  decoding_params.output_scores = output_scores.mutable_data<float>(input_ids.place());
+
+  typedef DecoderTransformerTraits<traits_::OpType> DecodingTraits_;
+  decoding_params.stream = stream;
+  fastertransformer::Allocator<AllocatorType::PD> allocator_(stream);
+
+  decoding_params.d_start_ids = const_cast<int *>(input_ids.data<int>());
+  decoding_params.d_attn_mask =
+      reinterpret_cast<DataType_*>(const_cast<data_t_ *>(attn_mask.data<data_t_>()));
+  decoding_params.d_start_lengths = memory_sequence_length.data<int>();
+
+  decoding_params.memory_sequence_length = memory_sequence_length.data<int>();
+  decoding_params.type_id = type_id.data<int>();
+  decoding_params.decoder_type_id = decoder_type_id.data<int>();
+
+  if (decoding_strategy == "beam_search" ||
+      decoding_strategy == "beam_search_v2" ||
+      decoding_strategy == "beam_search_v3") {
+    decoding_params.request_batch_size = batch_size_ * beam_width_;
+  } else if (decoding_strategy == "sampling" ||
+             decoding_strategy == "topk_sampling" ||
+             decoding_strategy == "topp_sampling") {
+    decoding_params.request_batch_size = batch_size_;
+  }
+  decoding_params.max_input_len = memory_max_seq_len;
+  decoding_params.request_input_len = memory_max_seq_len;
+  decoding_params.request_output_len = max_seq_len_;
+
+#ifdef BUILD_GPT
+  auto* model_para_desc = ModelParaDescFactory::CreateModelParaDesc(
+      head_num_,
+      size_per_head_,
+      num_layer_,
+      tensor_para_size,
+      layer_para_size,
+      layer_para_batch_size,
+      const_cast<data_t_*>(word_emb.data<data_t_>()));
+  auto& tensor_parallel_param = model_para_desc->tensor_parallel_param;
+  auto& layer_parallel_param = model_para_desc->layer_parallel_param;
+  auto seed = model_para_desc->dist(model_para_desc->gen);
+#else
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  tensor_parallel_param.rank = 0;
+  tensor_parallel_param.world_size = 1;
+  tensor_parallel_param.local_head_num_ = head_num_;
+  tensor_parallel_param.local_hidden_units_ = memory_hidden_dim;
+
+  layer_parallel_param.rank = 0;
+  layer_parallel_param.world_size = 1;
+  layer_parallel_param.layers_per_group = num_layer_;
+  layer_parallel_param.local_batch_size = batch_size_;
+  int seed = -1;
+#endif
+
+  DecoderInitParam<DataType_>* params =
+      new DecoderInitParam<DataType_>[num_layer_];
+
+  // Allow python passing partial weights for model parallel.
+  int inner_coeff =
+      (memory_hidden_dim == self_attn_output_weight[0].shape()[0])
+          ? ffn_intermediate_weight[0].shape()[1] / memory_hidden_dim
+          : (ffn_intermediate_weight[0].shape()[1] * tensor_para_size /
+             memory_hidden_dim);
+
+  for (int i = 0; i < self_layernorm_weight.size(); i++) {
+    // Allow python passing weights of all layers or only passing the
+    // corresponding layers to save memory.
+    int layer_idx = self_layernorm_weight.size() != num_layer_
+                        ? layer_parallel_param.rank *
+                                  layer_parallel_param.layers_per_group +
+                              i
+                        : i;
+    params[layer_idx].stream = stream;
+    params[layer_idx].cublas_handle = CublasHandle::GetInstance()->cublas_handle_;
+    params[layer_idx].cublaslt_handle = CublasHandle::GetInstance()->cublaslt_handle_;
+
+    if (decoding_strategy == "beam_search" ||
+        decoding_strategy == "beam_search_v2" ||
+        decoding_strategy == "beam_search_v3") {
+      params[layer_idx].request_batch_size = batch_size_ * beam_width_;
+      params[layer_idx].request_max_mem_seq_len = memory_max_seq_len;
+    } else if (decoding_strategy == "sampling" ||
+               decoding_strategy == "topk_sampling" ||
+               decoding_strategy == "topp_sampling") {
+      params[layer_idx].request_batch_size = batch_size_;
+      params[layer_idx].request_max_mem_seq_len = memory_max_seq_len;
+    }
+
+    // self attn
+    params[layer_idx].self_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        self_layernorm_weight[i].data<data_t_>());
+    params[layer_idx].self_layernorm.beta = reinterpret_cast<const DataType_*>(
+        self_layernorm_bias[i].data<data_t_>());
+    // query
+    params[layer_idx].self_attention.query_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.query_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_query_bias[i].data<data_t_>());
+    // key
+    params[layer_idx].self_attention.key_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.key_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_key_bias[i].data<data_t_>());
+    // value
+    params[layer_idx].self_attention.value_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_weight[i].data<data_t_>());
+    params[layer_idx].self_attention.value_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_value_bias[i].data<data_t_>());
+    // out proj
+    params[layer_idx].self_attention.attention_output_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_weight[i].data<data_t_>());
+
+    params[layer_idx].self_attention.attention_output_weight.bias =
+        reinterpret_cast<const DataType_*>(
+            self_attn_output_bias[i].data<data_t_>());
+
+    // ffn
+    params[layer_idx].ffn_layernorm.gamma = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_weight[i].data<data_t_>());
+    params[layer_idx].ffn_layernorm.beta = reinterpret_cast<const DataType_*>(
+        ffn_layernorm_bias[i].data<data_t_>());
+    // intermediate proj
+    params[layer_idx].ffn.intermediate_weight.kernel =
+        reinterpret_cast<const DataType_*>(
+            ffn_intermediate_weight[i].data<data_t_>());
+    params[layer_idx].ffn.intermediate_weight.bias = reinterpret_cast<const DataType_*>(
+        ffn_intermediate_bias[i].data<data_t_>());
+    // out proj
+    params[layer_idx].ffn.output_weight.kernel = reinterpret_cast<const DataType_*>(
+        ffn_output_weight[i].data<data_t_>());
+    params[layer_idx].ffn.output_weight.bias =
+        reinterpret_cast<const DataType_*>(ffn_output_bias[i].data<data_t_>());
+  }
+
+  decoding_params.layernorm.gamma = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_weight.data<data_t_>());
+  decoding_params.layernorm.beta = reinterpret_cast<const DataType_*>(
+      decoder_layernorm_bias.data<data_t_>());
+  decoding_params.trans_kernel =
+      reinterpret_cast<const DataType_*>(trans_weight.data<data_t_>());
+  decoding_params.trans_bias =
+      reinterpret_cast<const DataType_*>(trans_bias.data<data_t_>());
+
+  decoding_params.lm_layernorm.gamma =
+      reinterpret_cast<const DataType_*>(lm_layernorm_weight.data<data_t_>());
+  decoding_params.lm_layernorm.beta =
+      reinterpret_cast<const DataType_*>(lm_layernorm_bias.data<data_t_>());
+
+  // For embedding
+  decoding_params.embedding_table =
+      reinterpret_cast<const DataType_*>(word_emb.data<data_t_>());
+  // For weight sharing matmul
+  decoding_params.embedding_kernel =
+      reinterpret_cast<const DataType_*>(embedding_weight.data<data_t_>());
+  // For matmul bias
+  decoding_params.embedding_bias =
+      reinterpret_cast<const DataType_*>(embedding_bias.data<data_t_>());
+  decoding_params.position_encoding_table = reinterpret_cast<const DataType_*>(
+      position_encoding_table.data<data_t_>());
+
+  // For masking some id during gen.
+  decoding_params.logits_mask =
+      reinterpret_cast<const DataType_*>(logits_mask.data<data_t_>());
+
+  decoding_params.type_table =
+      reinterpret_cast<const DataType_*>(type_embedding_weight.data<data_t_>());
+
+  // For role embedding.
+  auto role_id_shape = role_id.shape();
+  if (role_id_shape.size() > 0 && numel(role_id_shape) > 0) {
+    decoding_params.role_id = role_id.data<int>();
+    decoding_params.decoder_role_id = decoder_role_id.data<int>();
+    decoding_params.role_embedding_table =
+        reinterpret_cast<const DataType_*>(role_embedding_table.data<data_t_>());
+  }
+
+  auto position_id_shape = position_ids.shape();
+  if (position_id_shape.size() > 0 && numel(position_id_shape) > 0) {
+      decoding_params.position_ids = position_ids.data<int>();
+      decoding_params.decoder_position_ids = decoder_position_ids.data<int>();
+  }
+
+  ActivationType activate =
+      (hidden_act == "gelu") ? ActivationType::GELU : ActivationType::RELU;
+
+  int finished_candidate_num_ =
+      ("beam_search_v3" == decoding_strategy) ? beam_width_ : beam_width_ * 2;
+
+  if ("beam_search" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* unified_decoding_beam_search_;
+
+    unified_decoding_beam_search_ =
+        new DecodingBeamsearch<DecodingTraits_::OpType>(
+            allocator_,
+            batch_size_,
+            beam_width_,
+            max_seq_len_,
+            head_num_,
+            size_per_head_,
+            vocab_size,
+            num_layer_,
+            memory_hidden_dim,
+            memory_max_seq_len,
+            start_id_,
+            end_id_,
+            beam_search_diversity_rate_,
+            true,        /*is_fuse_topk_softMax*/
+            true,        /*is_fuse_qkv*/
+            false,       /*keep_alive_beam*/
+            len_penalty, /*alpha not used for this case*/
+            normalize_before,
+            0, /*pos_offset BART only for now*/
+            activate,
+            pos_bias,
+            true, /*prefix_lm*/
+            -1,  /*finished_candidate_num*/
+            false,  /*early_stopping*/
+            false,  /*is_mbart*/
+            min_length,
+            inner_coeff);
+    unified_decoding_beam_search_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    unified_decoding_beam_search_->set_layer_parallel_param(
+        layer_parallel_param);
+    unified_decoding_beam_search_->forward_context(params, decoding_params);
+    unified_decoding_beam_search_->forward(params, decoding_params);
+
+    delete unified_decoding_beam_search_;
+  } else if ("beam_search_v2" == decoding_strategy ||
+             "beam_search_v3" == decoding_strategy) {
+    DecodingBeamsearch<DecodingTraits_::OpType>* unified_decoding_beam_search_;
+
+    unified_decoding_beam_search_ =
+        new DecodingBeamsearch<DecodingTraits_::OpType>(
+            allocator_,
+            batch_size_,
+            beam_width_,
+            max_seq_len_,
+            head_num_,
+            size_per_head_,
+            vocab_size,
+            num_layer_,
+            memory_hidden_dim,
+            memory_max_seq_len,
+            start_id_,
+            end_id_,
+            beam_search_diversity_rate_,
+            true, /*is_fuse_topk_softMax*/
+            true, /*is_fuse_qkv*/
+            true, /*keep_alive_beam*/
+            len_penalty,
+            normalize_before,
+            0, /*pos_offset BART only for now*/
+            activate,
+            pos_bias,
+            true, /*prefix_lm*/
+            finished_candidate_num_,
+            early_stopping,
+            false,  /*is_mbart*/
+            min_length,
+            inner_coeff);
+    unified_decoding_beam_search_->forward_context(params, decoding_params);
+    unified_decoding_beam_search_->forward(params, decoding_params);
+
+    delete unified_decoding_beam_search_;
+  } else if ("topk_sampling" == decoding_strategy ||
+             "topp_sampling" == decoding_strategy ||
+             "sampling" == decoding_strategy) {
+    DecodingSampling<DecodingTraits_::OpType>* unified_decoding_sampling_;
+
+    unified_decoding_sampling_ = new DecodingSampling<DecodingTraits_::OpType>(
+        allocator_,
+        batch_size_,
+        max_seq_len_,
+        head_num_,
+        size_per_head_,
+        vocab_size,
+        num_layer_,
+        memory_hidden_dim,
+        memory_max_seq_len,
+        start_id_,
+        end_id_,
+        candidate_num_,
+        probability_threshold_,
+        true, /*is_fuse_qkv*/
+        normalize_before,
+        0, /*pos_offset BART only for now*/
+        activate,
+        pos_bias,
+        temperature,
+        1.0,   /*repeat_penalty*/
+        true,  /*prefix_lm*/
+        false, /*is_mbart*/
+        min_length,
+        inner_coeff,
+        seed,
+        tensor_para_size,
+        layer_para_size);
+    unified_decoding_sampling_->set_tensor_parallel_param(
+        tensor_parallel_param);
+    unified_decoding_sampling_->set_layer_parallel_param(layer_parallel_param);
+    unified_decoding_sampling_->forward_context(params, decoding_params);
+    unified_decoding_sampling_->forward(params, decoding_params);
+
+    delete unified_decoding_sampling_;
+  } else {
+    PD_THROW(
+        "Only beam_search, beam_search_v2, topk_sampling and topp_sampling are "
+        "supported for "
+        "FastGeneration. ");
+  }
+  delete[] params;
+
+  return {output_ids, parent_ids, sequence_length, output_scores};
+}
+
+std::vector<paddle::Tensor> UnifiedDecodingCUDAForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    const int tensor_para_size = 1,
+    const int layer_para_size = 1,
+    const int layer_para_batch_size = 1) {
+  auto stream = input_ids.stream();
+
+  cublasSetStream(CublasHandle::GetInstance()->cublas_handle_, stream);
+
+  std::vector<paddle::Tensor> ret;
+
+  switch (self_ln_weight[0].type()) {
+    case paddle::DataType::FLOAT16: {
+      ret = unified_decoding_kernel<paddle::DataType::FLOAT16>(
+          input_ids,
+          attn_mask,
+          mem_seq_len,
+          type_id,
+          decoder_type_id,
+          logits_mask,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          trans_weight,
+          trans_bias,
+          lm_ln_weight,
+          lm_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          type_embedding_weight,
+          role_id,
+          decoder_role_id,
+          role_embedding_table,
+          position_ids,
+          decoder_position_ids,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          output_scores,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          unk_id,
+          mask_id,
+          temperature,
+          len_penalty,
+          normalize_before,
+          pos_bias,
+          hidden_act,
+          early_stopping,
+          min_length,
+          stream,
+          tensor_para_size,
+          layer_para_size,
+          layer_para_batch_size);
+      break;
+    }
+    case paddle::DataType::FLOAT32: {
+      ret = unified_decoding_kernel<paddle::DataType::FLOAT32>(
+          input_ids,
+          attn_mask,
+          mem_seq_len,
+          type_id,
+          decoder_type_id,
+          logits_mask,
+          word_embedding,
+          self_ln_weight,
+          self_ln_bias,
+          self_q_weight,
+          self_q_bias,
+          self_k_weight,
+          self_k_bias,
+          self_v_weight,
+          self_v_bias,
+          self_out_weight,
+          self_out_bias,
+          ffn_ln_weight,
+          ffn_ln_bias,
+          ffn_inter_weight,
+          ffn_inter_bias,
+          ffn_out_weight,
+          ffn_out_bias,
+          decoder_ln_weight,
+          decoder_ln_bias,
+          trans_weight,
+          trans_bias,
+          lm_ln_weight,
+          lm_ln_bias,
+          embedding_weight,
+          embedding_bias,
+          positional_embedding_weight,
+          type_embedding_weight,
+          role_id,
+          decoder_role_id,
+          role_embedding_table,
+          position_ids,
+          decoder_position_ids,
+          output_ids,
+          parent_ids,
+          sequence_length,
+          output_scores,
+          decoding_strategy,
+          beam_size,
+          topk,
+          topp,
+          n_head,
+          size_per_head,
+          num_layer,
+          bos_id,
+          eos_id,
+          max_len,
+          beam_search_diversity_rate,
+          unk_id,
+          mask_id,
+          temperature,
+          len_penalty,
+          normalize_before,
+          pos_bias,
+          hidden_act,
+          early_stopping,
+          min_length,
+          stream,
+          tensor_para_size,
+          layer_para_size,
+          layer_para_batch_size);
+      break;
+    }
+    default: {
+      PD_THROW(
+          "NOT supported data type. "
+          "Only float16 and float32 are supported. ");
+      break;
+    }
+  }
+
+  return ret;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.h b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..6ca034f3c79c86bb95b61eafe267da1f165d8171
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/fusion_unified_decoding_op.h
@@ -0,0 +1,100 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include <string>
+#include <vector>
+
+// #include "fastertransformer/decoding_beamsearch.h"
+// #include "fastertransformer/decoding_sampling.h"
+// #include "fastertransformer/open_decoder.h"
+// #include "fastertransformer/utils/common.h"
+#include "cublas_handle.h"
+#include "utils.h"
+
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+
+
+std::vector<paddle::Tensor> UnifiedDecodingCUDAForward(
+    const paddle::Tensor& input_ids,
+    const paddle::Tensor& attn_mask,
+    const paddle::Tensor& mem_seq_len,
+    const paddle::Tensor& type_id,
+    const paddle::Tensor& decoder_type_id,
+    const paddle::Tensor& logits_mask,
+    const paddle::Tensor& word_embedding,
+    const std::vector<paddle::Tensor>& self_ln_weight,
+    const std::vector<paddle::Tensor>& self_ln_bias,
+    const std::vector<paddle::Tensor>& self_q_weight,
+    const std::vector<paddle::Tensor>& self_q_bias,
+    const std::vector<paddle::Tensor>& self_k_weight,
+    const std::vector<paddle::Tensor>& self_k_bias,
+    const std::vector<paddle::Tensor>& self_v_weight,
+    const std::vector<paddle::Tensor>& self_v_bias,
+    const std::vector<paddle::Tensor>& self_out_weight,
+    const std::vector<paddle::Tensor>& self_out_bias,
+    const std::vector<paddle::Tensor>& ffn_ln_weight,
+    const std::vector<paddle::Tensor>& ffn_ln_bias,
+    const std::vector<paddle::Tensor>& ffn_inter_weight,
+    const std::vector<paddle::Tensor>& ffn_inter_bias,
+    const std::vector<paddle::Tensor>& ffn_out_weight,
+    const std::vector<paddle::Tensor>& ffn_out_bias,
+    const paddle::Tensor& decoder_ln_weight,
+    const paddle::Tensor& decoder_ln_bias,
+    const paddle::Tensor& trans_weight,
+    const paddle::Tensor& trans_bias,
+    const paddle::Tensor& lm_ln_weight,
+    const paddle::Tensor& lm_ln_bias,
+    const paddle::Tensor& embedding_weight,
+    const paddle::Tensor& embedding_bias,
+    const paddle::Tensor& positional_embedding_weight,
+    const paddle::Tensor& type_embedding_weight,
+    const paddle::Tensor& role_id,
+    const paddle::Tensor& decoder_role_id,
+    const paddle::Tensor& role_embedding_table,
+    const paddle::Tensor& position_ids,
+    const paddle::Tensor& decoder_position_ids,
+    paddle::Tensor& output_ids,
+    paddle::Tensor& parent_ids,
+    paddle::Tensor& sequence_length,
+    paddle::Tensor& output_scores,
+    const std::string& decoding_strategy,
+    const int beam_size,
+    const int topk,
+    const float topp,
+    const int n_head,
+    const int size_per_head,
+    const int num_layer,
+    const int bos_id,
+    const int eos_id,
+    const int64_t max_len,
+    const float beam_search_diversity_rate,
+    const int unk_id,
+    const int mask_id,
+    const float temperature,
+    const float len_penalty,
+    const bool normalize_before,
+    const bool pos_bias,
+    const std::string& hidden_act,
+    const bool early_stopping,
+    const int min_length,
+    const int tensor_para_size,
+    const int layer_para_size,
+    const int layer_para_batch_size);
diff --git a/paddlenlp/ops/fast_transformer/src/parallel_utils.cc b/paddlenlp/ops/fast_transformer/src/parallel_utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..730c3a398649063fbd2650d175f3cb095868151c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/parallel_utils.cc
@@ -0,0 +1,148 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "parallel_utils.h"
+
+static std::mutex mpi_global_mutex;
+static std::once_flag once_flag_init_mpi;
+
+void MPIExit() {
+  std::unique_lock<std::mutex> global_lock(mpi_global_mutex);
+  MPICHECK(MPI_Finalize());
+}
+
+void InitMPIOnce() {
+  // Initialize MPI environment
+  std::call_once(once_flag_init_mpi, []() {
+    MPICHECK(MPI_Init(nullptr, nullptr));
+    if (std::atexit(MPIExit)) {
+      throw std::runtime_error("Fail to register the MPI exit handler");
+    }
+  });
+}
+
+void InitNCCLComm(ncclUniqueId& tensor_para_nccl_uid,
+                  ncclUniqueId& layer_para_nccl_uid,
+                  ncclComm_t& tensor_para_nccl_comm,
+                  ncclComm_t& layer_para_nccl_comm,
+                  int rank,
+                  int tensor_para_size,
+                  int layer_para_size,
+                  int tensor_para_rank,
+                  int layer_para_rank) {
+  // assume gpu_num = n * k,
+  // tensor parallelism group size is n
+  // layer parallelism group size is k
+
+  if (tensor_para_rank == 0) {
+    // get the uid of each tensor parallelism group
+    // here, 0, 1, ..., n-1 are in group 0,
+    //       n, ..., 2n - 1 are in group 1.
+    NCCLCHECK(ncclGetUniqueId(&tensor_para_nccl_uid));
+    for (int i = 1; i < tensor_para_size; i++) {
+      printf("[INFO] rank %d sends tensor_para_nccl_uid to rank %d \n",
+             rank,
+             rank + i);
+      MPICHECK(MPI_Send(&tensor_para_nccl_uid,
+                        sizeof(tensor_para_nccl_uid),
+                        MPI_BYTE,
+                        rank + i,
+                        0,
+                        MPI_COMM_WORLD));
+    }
+  } else {
+    MPI_Status status;
+    printf("[INFO] rank %d receives tensor_para_nccl_uid from rank %d \n",
+           rank,
+           rank - tensor_para_rank);
+    MPICHECK(MPI_Recv(&tensor_para_nccl_uid,
+                      sizeof(tensor_para_nccl_uid),
+                      MPI_BYTE,
+                      rank - tensor_para_rank,
+                      0,
+                      MPI_COMM_WORLD,
+                      &status));
+  }
+
+  if (layer_para_rank == 0) {
+    // get the uid of each layer parallelism group
+    // 0, k, 2k, are in group 0
+    // 1, k+1, 2k+1 are in group 1
+    NCCLCHECK(ncclGetUniqueId(&layer_para_nccl_uid));
+    for (int i = 1; i < layer_para_size; i++) {
+      printf("[INFO] rank %d sends layer_para_nccl_uid to rank %d \n",
+             rank,
+             rank + i * tensor_para_size);
+      MPICHECK(MPI_Send(&layer_para_nccl_uid,
+                        sizeof(layer_para_nccl_uid),
+                        MPI_BYTE,
+                        rank + i * tensor_para_size,
+                        0,
+                        MPI_COMM_WORLD));
+    }
+  } else {
+    MPI_Status status;
+    printf("[INFO] rank %d receives layer_para_nccl_uid from rank %d \n",
+           rank,
+           rank % tensor_para_size);
+    MPICHECK(MPI_Recv(&layer_para_nccl_uid,
+                      sizeof(layer_para_nccl_uid),
+                      MPI_BYTE,
+                      rank % tensor_para_size,
+                      0,
+                      MPI_COMM_WORLD,
+                      &status));
+  }
+
+  NCCLCHECK(ncclCommInitRank(&tensor_para_nccl_comm,
+                             tensor_para_size,
+                             tensor_para_nccl_uid,
+                             tensor_para_rank));
+  NCCLCHECK(ncclCommInitRank(&layer_para_nccl_comm,
+                             layer_para_size,
+                             layer_para_nccl_uid,
+                             layer_para_rank));
+}
+
+// Make model parallel settings init only once for one model by using a global
+// dict mapping parameters representing different models to corresponding
+// settings. Note: `paddle::Tensor` for custom_op is re-created every step and
+// we use pointers as keys. Maybe using weakref as keys is better.
+static std::unordered_map<void*, std::unique_ptr<ModelParaDesc>>
+    model_para_infos;
+
+ModelParaDesc* ModelParaDescFactory::CreateModelParaDesc(
+    int head_num,
+    int size_per_head,
+    int layer_num,
+    int tensor_para_size,
+    int layer_para_size,
+    int layer_para_batch_size,
+    void* param_ptr = nullptr) {
+  InitMPIOnce();
+  auto it = model_para_infos.find(param_ptr);
+  if (it != model_para_infos.end()) {
+    return it->second.get();
+  } else {
+    model_para_infos.emplace(param_ptr,
+                             std::unique_ptr<ModelParaDesc>(
+                                 new ModelParaDesc(head_num,
+                                                   size_per_head,
+                                                   layer_num,
+                                                   tensor_para_size,
+                                                   layer_para_size,
+                                                   layer_para_batch_size)));
+    return model_para_infos[param_ptr].get();
+  }
+}
diff --git a/paddlenlp/ops/fast_transformer/src/parallel_utils.h b/paddlenlp/ops/fast_transformer/src/parallel_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..461e79702d181f33364388845d9342182797045a
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/parallel_utils.h
@@ -0,0 +1,102 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <unordered_map>
+#include <memory>
+#include <mutex>
+#include <random>
+#include <assert.h>
+
+#include "fastertransformer/utils/nccl_utils.h"
+
+
+void MPIExit();
+
+void InitMPIOnce();
+
+void InitNCCLComm(ncclUniqueId& tensor_para_nccl_uid,
+                  ncclUniqueId& layer_para_nccl_uid,
+                  ncclComm_t& tensor_para_nccl_comm,
+                  ncclComm_t& layer_para_nccl_comm,
+                  int rank,
+                  int tensor_para_size,
+                  int layer_para_size,
+                  int tensor_para_rank,
+                  int layer_para_rank);
+
+struct ModelParaDesc {
+  TensorParallelParam tensor_parallel_param;
+  LayerParallelParam layer_parallel_param;
+  ncclComm_t tensor_para_nccl_comm, layer_para_nccl_comm;
+  std::mt19937_64 gen;
+  std::uniform_int_distribution<> dist{0, std::numeric_limits<int>::max()};
+
+  ModelParaDesc(int head_num,
+                int size_per_head,
+                int layer_num,
+                int tensor_para_size,
+                int layer_para_size,
+                int layer_para_batch_size) {
+    int rank;
+    MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
+    const int local_head_num = head_num / tensor_para_size;
+    const int local_hidden_units = local_head_num * size_per_head;
+    const int layers_per_group = layer_num / layer_para_size;
+    assert(layer_num % layer_para_size == 0);
+    const int tensor_para_rank = rank % tensor_para_size;
+    const int layer_para_rank = rank / tensor_para_size;
+    ncclUniqueId tensor_para_nccl_uid, layer_para_nccl_uid;
+    InitNCCLComm(tensor_para_nccl_uid,
+                 layer_para_nccl_uid,
+                 tensor_para_nccl_comm,
+                 layer_para_nccl_comm,
+                 rank,
+                 tensor_para_size,
+                 layer_para_size,
+                 tensor_para_rank,
+                 layer_para_rank);
+    tensor_parallel_param.rank = tensor_para_rank;
+    tensor_parallel_param.world_size = tensor_para_size;
+    tensor_parallel_param.local_head_num_ = local_head_num;
+    tensor_parallel_param.local_hidden_units_ = local_hidden_units;
+    tensor_parallel_param.nccl_comm = tensor_para_nccl_comm;
+    layer_parallel_param.rank = layer_para_rank;
+    layer_parallel_param.world_size = layer_para_size;
+    layer_parallel_param.layers_per_group = layers_per_group;
+    layer_parallel_param.local_batch_size = layer_para_batch_size;
+    layer_parallel_param.nccl_comm = layer_para_nccl_comm;
+    // fix the seed to prevent the seed of different gpu are differnet in Tensor
+    // Parallel
+    size_t meta_seed =
+        *(reinterpret_cast<size_t*>(tensor_para_nccl_uid.internal));
+    gen = std::mt19937_64(meta_seed);
+  }
+
+  ~ModelParaDesc() {
+    if (tensor_para_nccl_comm) ncclCommDestroy(tensor_para_nccl_comm);
+    if (layer_para_nccl_comm) ncclCommDestroy(layer_para_nccl_comm);
+  }
+};
+
+struct ModelParaDescFactory {
+  static ModelParaDesc* CreateModelParaDesc(int head_num,
+                                            int size_per_head,
+                                            int layer_num,
+                                            int tensor_para_size,
+                                            int layer_para_size,
+                                            int layer_para_batch_size,
+                                            void* param_ptr);
+};
diff --git a/paddlenlp/ops/fast_transformer/src/pd_traits.h b/paddlenlp/ops/fast_transformer/src/pd_traits.h
new file mode 100644
index 0000000000000000000000000000000000000000..0a7a1e26dd906c0af0988dcd7db62e9ab8fcc54c
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/pd_traits.h
@@ -0,0 +1,37 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+
+#include "fastertransformer/utils/common.h"
+
+using namespace fastertransformer;
+
+template <paddle::DataType D>
+class PDTraits;
+
+template <>
+class PDTraits<paddle::DataType::FLOAT32> {
+public:
+  typedef float DataType;
+  typedef float data_t;
+  static const OperationType OpType = OperationType::FP32;
+};
+
+template <>
+class PDTraits<paddle::DataType::FLOAT16> {
+public:
+  typedef half DataType;
+  typedef paddle::float16 data_t;
+  static const OperationType OpType = OperationType::FP16;
+};
diff --git a/paddlenlp/ops/fast_transformer/src/utils.cc b/paddlenlp/ops/fast_transformer/src/utils.cc
new file mode 100644
index 0000000000000000000000000000000000000000..fe9652422806897b70c815ae37fae0ee0596bd56
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/utils.cc
@@ -0,0 +1,25 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "utils.h"
+
+
+const int64_t numel(const std::vector<int64_t>& tensor_shape) {
+    int size = tensor_shape.size();
+    int64_t n = 1;
+    for (int i = 0; i < size; ++i) {
+        n *= tensor_shape[i];
+    }
+    return n;
+}
diff --git a/paddlenlp/ops/fast_transformer/src/utils.h b/paddlenlp/ops/fast_transformer/src/utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..b4731958ab7e21c19ac33557ae20868d658d2b19
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/src/utils.h
@@ -0,0 +1,21 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <stdint.h>
+#include <vector>
+
+
+const int64_t numel(const std::vector<int64_t>& tensor_shape);
diff --git a/paddlenlp/ops/fast_transformer/transformer/__init__.py b/paddlenlp/ops/fast_transformer/transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..185a92b8d94d3426d616c0624f0f2ee04339349e
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/transformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/ops/fast_transformer/transformer/decoder.py b/paddlenlp/ops/fast_transformer/transformer/decoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..988b861e810dd08173cdb66f1385ae3a73a92cd1
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/transformer/decoder.py
@@ -0,0 +1,586 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.ops import transfer_param
+from paddlenlp.ops.ext_utils import LOADED_EXT, load
+from paddlenlp.transformers import (
+    PositionalEmbedding,
+    WordEmbedding,
+    position_encoding_init,
+)
+from paddlenlp.utils.log import logger
+
+from .decoding import run_custom
+
+
+def infer_transformer_decoder(
+    from_tensor,
+    memory_tensor,
+    mem_seq_len,
+    self_ln_weight,
+    self_ln_bias,
+    self_q_weight,
+    self_q_bias,
+    self_k_weight,
+    self_k_bias,
+    self_v_weight,
+    self_v_bias,
+    self_out_weight,
+    self_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    old_self_cache_key,
+    old_self_cache_value,
+    old_mem_cache,
+    step,
+    n_head,
+    size_per_head,
+    memory_hidden_dim,
+    is_fuse_qkv=False,
+):
+    inputs_names = [
+        "FromTensor",
+        "MemoryTensor",
+        "MemSeqLen",
+        "SelfLayernormWeight",
+        "SelfLayernormBias",
+        "SelfQueryWeight",
+        "SelfQueryBias",
+        "SelfKeyWeight",
+        "SelfKeyBias",
+        "SelfValueWeight",
+        "SelfValueBias",
+        "SelfOutWeight",
+        "SelfOutBias",
+        "CrossLayernormWeight",
+        "CrossLayernormBias",
+        "CrossQueryWeight",
+        "CrossQueryBias",
+        "CrossKeyWeight",
+        "CrossKeyBias",
+        "CrossValueWeight",
+        "CrossValueBias",
+        "CrossOutWeight",
+        "CrossOutBias",
+        "FFNLayernormWeight",
+        "FFNLayernormBias",
+        "FFNInterWeight",
+        "FFNInterBias",
+        "FFNOutWeight",
+        "FFNOutBias",
+        "OldSelfCacheKey",
+        "OldSelfCacheValue",
+    ]
+
+    inputs_var = [
+        from_tensor,
+        memory_tensor,
+        mem_seq_len,
+        self_ln_weight,
+        self_ln_bias,
+        self_q_weight,
+        self_q_bias,
+        self_k_weight,
+        self_k_bias,
+        self_v_weight,
+        self_v_bias,
+        self_out_weight,
+        self_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        old_self_cache_key,
+        old_self_cache_value,
+        old_mem_cache,
+    ]
+
+    attrs_names = ["step", "n_head", "size_per_head", "memory_hidden_dim", "is_fuse_qkv"]
+
+    attrs_val = [step, n_head, size_per_head, memory_hidden_dim, is_fuse_qkv]
+
+    outputs_names = ["DecoderOutput", "NewSelfCacheKey", "NewSelfCacheValue", "NewMemCache"]
+
+    outputs_dtype = [memory_tensor.dtype] * len(outputs_names)
+
+    return run_custom("fusion_decoder", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype)
+
+
+def get_op_cache_config(use_batch_major_op_cache, size_per_head, is_fp16):
+    x = 8 if is_fp16 else 4
+    use_batch_major_op_cache = True if use_batch_major_op_cache is True and size_per_head % x == 0 else False
+    x = x if use_batch_major_op_cache else 1
+    return use_batch_major_op_cache, x
+
+
+class InferTransformerDecoder(nn.Layer):
+    """
+    FasterTransformer decoder block.
+
+    Args:
+        decoder (`TransformerDecoder`):
+            Transformer decoder block.
+        n_head (`int`):
+            The number of head used in multi-head attention.
+        size_per_head (`int`):
+            The size of per head used in multi-head attention.
+        decoder_lib (`str`):
+            The path to decoder_lib. Default to None.
+        use_fp16_decoder (`bool`):
+            Whether to use fp16 for decoder. Default to False.
+    """
+
+    def __init__(
+        self, decoder, n_head, size_per_head, decoder_lib=None, use_fp16_decoder=False, use_batch_major_op_cache=False
+    ):
+
+        if decoder_lib is not None and os.path.isfile(decoder_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoder_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoder_lib is not None:
+                logger.warning("The specified decoder_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        super(InferTransformerDecoder, self).__init__()
+        self.n_head = n_head
+        self.size_per_head = size_per_head
+        self.use_batch_major_op_cache = use_batch_major_op_cache
+
+        if use_fp16_decoder:
+            for idx, mod in enumerate(decoder.layers):
+                mod.norm1.weight = transfer_param(mod.norm1.weight)
+                mod.norm1.bias = transfer_param(mod.norm1.bias, is_bias=True)
+                mod.self_attn.q_proj.weight = transfer_param(mod.self_attn.q_proj.weight)
+                mod.self_attn.q_proj.bias = transfer_param(mod.self_attn.q_proj.bias, is_bias=True)
+                mod.self_attn.k_proj.weight = transfer_param(mod.self_attn.k_proj.weight)
+                mod.self_attn.k_proj.bias = transfer_param(mod.self_attn.k_proj.bias, is_bias=True)
+                mod.self_attn.v_proj.weight = transfer_param(mod.self_attn.v_proj.weight)
+                mod.self_attn.v_proj.bias = transfer_param(mod.self_attn.v_proj.bias, is_bias=True)
+                mod.self_attn.out_proj.weight = transfer_param(mod.self_attn.out_proj.weight)
+                mod.self_attn.out_proj.bias = transfer_param(mod.self_attn.out_proj.bias, is_bias=True)
+
+                mod.norm2.weight = transfer_param(mod.norm2.weight)
+                mod.norm2.bias = transfer_param(mod.norm2.bias, is_bias=True)
+                mod.cross_attn.q_proj.weight = transfer_param(mod.cross_attn.q_proj.weight)
+                mod.cross_attn.q_proj.bias = transfer_param(mod.cross_attn.q_proj.bias, is_bias=True)
+                mod.cross_attn.k_proj.weight = transfer_param(mod.cross_attn.k_proj.weight)
+                mod.cross_attn.k_proj.bias = transfer_param(mod.cross_attn.k_proj.bias, is_bias=True)
+                mod.cross_attn.v_proj.weight = transfer_param(mod.cross_attn.v_proj.weight)
+                mod.cross_attn.v_proj.bias = transfer_param(mod.cross_attn.v_proj.bias, is_bias=True)
+                mod.cross_attn.out_proj.weight = transfer_param(mod.cross_attn.out_proj.weight)
+                mod.cross_attn.out_proj.bias = transfer_param(mod.cross_attn.out_proj.bias, is_bias=True)
+
+                mod.norm3.weight = transfer_param(mod.norm3.weight)
+                mod.norm3.bias = transfer_param(mod.norm3.bias, is_bias=True)
+                mod.linear1.weight = transfer_param(mod.linear1.weight)
+                mod.linear1.bias = transfer_param(mod.linear1.bias, is_bias=True)
+                mod.linear2.weight = transfer_param(mod.linear2.weight)
+                mod.linear2.bias = transfer_param(mod.linear2.bias, is_bias=True)
+
+        self.weights = []
+        for idx, mod in enumerate(decoder.layers):
+            layer_weight = []
+            layer_weight.append(mod.norm1.weight)
+            layer_weight.append(mod.norm1.bias)
+            layer_weight.append(mod.self_attn.q_proj.weight)
+            layer_weight.append(mod.self_attn.q_proj.bias)
+            layer_weight.append(mod.self_attn.k_proj.weight)
+            layer_weight.append(mod.self_attn.k_proj.bias)
+            layer_weight.append(mod.self_attn.v_proj.weight)
+            layer_weight.append(mod.self_attn.v_proj.bias)
+            layer_weight.append(mod.self_attn.out_proj.weight)
+            layer_weight.append(mod.self_attn.out_proj.bias)
+            layer_weight.append(mod.norm2.weight)
+            layer_weight.append(mod.norm2.bias)
+            layer_weight.append(mod.cross_attn.q_proj.weight)
+            layer_weight.append(mod.cross_attn.q_proj.bias)
+            layer_weight.append(mod.cross_attn.k_proj.weight)
+            layer_weight.append(mod.cross_attn.k_proj.bias)
+            layer_weight.append(mod.cross_attn.v_proj.weight)
+            layer_weight.append(mod.cross_attn.v_proj.bias)
+            layer_weight.append(mod.cross_attn.out_proj.weight)
+            layer_weight.append(mod.cross_attn.out_proj.bias)
+            layer_weight.append(mod.norm3.weight)
+            layer_weight.append(mod.norm3.bias)
+            layer_weight.append(mod.linear1.weight)
+            layer_weight.append(mod.linear1.bias)
+            layer_weight.append(mod.linear2.weight)
+            layer_weight.append(mod.linear2.bias)
+            self.weights.append(layer_weight)
+
+    def forward(
+        self,
+        from_tensor,
+        memory_tensor,
+        mem_seq_len,
+        self_cache_key,
+        self_cache_value,
+        mem_cache,
+        step,
+        memory_hidden_dim,
+        is_fuse_qkv,
+    ):
+        decoder_output = from_tensor
+        self_caches_key = []
+        self_caches_value = []
+        mem_caches = []
+        if not self.use_batch_major_op_cache:
+            self_cache_key = paddle.concat(
+                [
+                    self_cache_key,
+                    paddle.zeros(
+                        shape=[len(self.weights), 1, paddle.shape(memory_tensor)[0], self.n_head * self.size_per_head],
+                        dtype=self_cache_key.dtype,
+                    ),
+                ],
+                axis=1,
+            )
+            self_cache_value = paddle.concat(
+                [
+                    self_cache_value,
+                    paddle.zeros(
+                        shape=[len(self.weights), 1, paddle.shape(memory_tensor)[0], self.n_head * self.size_per_head],
+                        dtype=self_cache_value.dtype,
+                    ),
+                ],
+                axis=1,
+            )
+        for idx in range(len(self.weights)):
+            weight = self.weights[idx]
+            decoder_output, new_self_cache_key, new_self_cache_value, new_mem_cache = infer_transformer_decoder(
+                from_tensor=decoder_output,
+                memory_tensor=memory_tensor,
+                mem_seq_len=mem_seq_len,
+                self_ln_weight=weight[0],
+                self_ln_bias=weight[1],
+                self_q_weight=weight[2],
+                self_q_bias=weight[3],
+                self_k_weight=weight[4],
+                self_k_bias=weight[5],
+                self_v_weight=weight[6],
+                self_v_bias=weight[7],
+                self_out_weight=weight[8],
+                self_out_bias=weight[9],
+                cross_ln_weight=weight[10],
+                cross_ln_bias=weight[11],
+                cross_q_weight=weight[12],
+                cross_q_bias=weight[13],
+                cross_k_weight=weight[14],
+                cross_k_bias=weight[15],
+                cross_v_weight=weight[16],
+                cross_v_bias=weight[17],
+                cross_out_weight=weight[18],
+                cross_out_bias=weight[19],
+                ffn_ln_weight=weight[20],
+                ffn_ln_bias=weight[21],
+                ffn_inter_weight=weight[22],
+                ffn_inter_bias=weight[23],
+                ffn_out_weight=weight[24],
+                ffn_out_bias=weight[25],
+                old_self_cache_key=self_cache_key[idx],
+                old_self_cache_value=self_cache_value[idx],
+                old_mem_cache=mem_cache[idx],
+                step=step,
+                n_head=self.n_head,
+                size_per_head=self.size_per_head,
+                memory_hidden_dim=memory_hidden_dim,
+                is_fuse_qkv=is_fuse_qkv,
+            )
+            self_caches_key.append(new_self_cache_key)
+            self_caches_value.append(new_self_cache_value)
+            mem_caches.append(new_mem_cache)
+
+        self_cache_key = paddle.stack(self_caches_key, axis=0)
+        self_cache_value = paddle.stack(self_caches_value, axis=0)
+        mem_cache = paddle.stack(mem_caches, axis=0)
+        return decoder_output, self_cache_key, self_cache_value, mem_cache
+
+
+class FasterDecoder(nn.Layer):
+    """
+    FasterTransformer decoder for auto-regressive generation.
+
+    Args:
+        src_vocab_size (`int`):
+            The size of source vocabulary.
+        trg_vocab_size (`int`):
+            The size of target vocabulary.
+        max_length (`int`):
+            The maximum length of input sequences.
+        num_encoder_layers (`int`):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (`int`):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (`int`):
+            The number of head used in multi-head attention.
+        d_model (`int`):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (`int`):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (`float`):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (`bool`):
+            Whether to use weight sharing.
+        bos_id (`int`, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+        eos_id (`int`, optional):
+            The end token id. Defaults to 1.
+        max_out_len (int, optional):
+            The maximum output length. Defaults to 256.
+        decoder_lib (`str`):
+            The path to decoder_lib. Default to None.
+        use_fp16_decoder (`bool`):
+            Whether to use fp16 for decoder. Default to False.
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        bos_id=0,
+        eos_id=1,
+        max_out_len=256,
+        decoder_lib=None,
+        use_fp16_decoder=False,
+        use_batch_major_op_cache=False,
+    ):
+        super().__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.n_head = n_head
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.dropout = dropout
+        self.max_out_len = max_out_len
+        self.max_length = max_length
+        self.use_fp16_decoder = use_fp16_decoder
+        self.num_decoder_layers = num_decoder_layers
+        self.d_model = d_model
+        self.size_per_head = d_model // n_head
+        self.use_batch_major_op_cache, self.x = get_op_cache_config(
+            use_batch_major_op_cache, self.size_per_head, use_fp16_decoder
+        )
+
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        # print(self.src_word_embedding.word_embedding.weight)
+        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+        if weight_sharing:
+            assert (
+                src_vocab_size == trg_vocab_size
+            ), "Vocabularies in source and target should be same for weight sharing."
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        self.transformer = paddle.nn.Transformer(
+            d_model=d_model,
+            nhead=n_head,
+            num_encoder_layers=num_encoder_layers,
+            num_decoder_layers=num_decoder_layers,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            activation="relu",
+            normalize_before=True,
+        )
+
+        self.decoder = InferTransformerDecoder(
+            decoder=self.transformer.decoder,
+            n_head=n_head,
+            size_per_head=self.size_per_head,
+            decoder_lib=decoder_lib,
+            use_fp16_decoder=use_fp16_decoder,
+            use_batch_major_op_cache=self.use_batch_major_op_cache,
+        )
+
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
+            )
+        else:
+            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)
+
+    def forward(self, src_word):
+        src_max_len = paddle.shape(src_word)[-1]
+        mem_seq_lens = paddle.sum(
+            paddle.cast(src_word != self.bos_id, dtype="int32"), axis=-1, keepdim=True, dtype="int32"
+        )
+
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+
+        src_slf_attn_bias.stop_gradient = True
+
+        src_pos = paddle.cast(src_word != self.bos_id, dtype="int64") * paddle.arange(start=0, end=src_max_len)
+
+        src_emb = self.src_word_embedding(src_word)
+
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+        enc_output = self.transformer.encoder(enc_input, src_mask=src_slf_attn_bias)
+
+        batch_size, _, memory_hidden_dim = enc_output.shape
+        end_token_tensor = paddle.full(shape=[batch_size, 1], fill_value=self.eos_id, dtype="int64")
+
+        predict_ids = []
+        log_probs = paddle.full(shape=[batch_size, 1], fill_value=0, dtype="float32")
+        trg_word = paddle.full(shape=[batch_size, 1], fill_value=self.bos_id, dtype="int64")
+
+        if self.use_fp16_decoder:
+            enc_output = paddle.cast(enc_output, "float16")
+
+        # Init cache
+        if not self.use_batch_major_op_cache:
+            self_cache_key = paddle.zeros(
+                shape=[self.num_decoder_layers, 0, batch_size, self.d_model], dtype=enc_output.dtype
+            )
+            self_cache_value = paddle.zeros(
+                shape=[self.num_decoder_layers, 0, batch_size, self.d_model], dtype=enc_output.dtype
+            )
+        else:
+            self_cache_key = paddle.zeros(
+                shape=[
+                    self.num_decoder_layers,
+                    batch_size,
+                    self.n_head,
+                    self.size_per_head // self.x,
+                    self.max_out_len,
+                    self.x,
+                ],
+                dtype=enc_output.dtype,
+            )
+            self_cache_value = paddle.zeros(
+                shape=[self.num_decoder_layers, batch_size, self.n_head, self.max_out_len, self.size_per_head],
+                dtype=enc_output.dtype,
+            )
+        mem_cache = paddle.zeros(
+            shape=[self.num_decoder_layers, 2, batch_size, src_max_len, self.d_model], dtype=enc_output.dtype
+        )
+        for i in range(self.max_out_len):
+            trg_pos = paddle.full(shape=trg_word.shape, fill_value=i, dtype="int64")
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            # TODO(gongenlei): do cast in op
+            if self.use_fp16_decoder:
+                dec_input = paddle.cast(dec_input, "float16")
+            dec_output, self_cache_key, self_cache_value, mem_cache = self.decoder(
+                from_tensor=dec_input,
+                memory_tensor=enc_output,
+                mem_seq_len=mem_seq_lens,
+                self_cache_key=self_cache_key,
+                self_cache_value=self_cache_value,
+                mem_cache=mem_cache,
+                step=i,
+                memory_hidden_dim=memory_hidden_dim,
+                is_fuse_qkv=False,
+            )
+
+            if self.use_fp16_decoder:
+                dec_output = paddle.cast(dec_output, "float32")
+
+            dec_output = paddle.reshape(dec_output, shape=[-1, dec_output.shape[-1]])
+
+            logits = self.linear(dec_output)
+            step_log_probs = paddle.log(F.softmax(logits, axis=-1))
+            log_probs = paddle.add(x=step_log_probs, y=log_probs)
+            scores = log_probs
+            topk_scores, topk_indices = paddle.topk(x=scores, k=1)
+
+            finished = paddle.equal(topk_indices, end_token_tensor)
+            trg_word = topk_indices
+            log_probs = topk_scores
+
+            predict_ids.append(topk_indices)
+
+            # TODO(gongenlei): support static graph
+            if paddle.all(finished).numpy():
+                break
+
+        predict_ids = paddle.stack(predict_ids, axis=0)
+        finished_seq = paddle.transpose(predict_ids, [1, 2, 0])
+        finished_scores = topk_scores
+
+        return finished_seq, finished_scores
+
+    def load(self, init_from_params):
+        # Load the trained model
+        assert init_from_params, "Please set init_from_params to load the infer model."
+
+        model_dict = paddle.load(init_from_params, return_numpy=True)
+
+        # To set weight[padding_idx] to 0.
+        model_dict["trg_word_embedding.word_embedding.weight"][self.bos_id] = [0] * self.d_model
+
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["encoder.pos_encoder.weight"] = position_encoding_init(self.max_length, self.d_model)
+        model_dict["decoder.pos_encoder.weight"] = position_encoding_init(self.max_length, self.d_model)
+
+        if self.use_fp16_decoder:
+            for item in self.state_dict():
+                if "decoder.layers" in item:
+                    model_dict[item] = np.float16(model_dict[item])
+
+        self.load_dict(model_dict)
diff --git a/paddlenlp/ops/fast_transformer/transformer/decoding.py b/paddlenlp/ops/fast_transformer/transformer/decoding.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cac1f9026ba21abdd4174c002993b57192ad756
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/transformer/decoding.py
@@ -0,0 +1,4550 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import functools
+import os
+from collections import defaultdict
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.common_ops_import import LayerHelper
+from paddle.framework import core
+
+import paddlenlp
+from paddlenlp.ops.ext_utils import LOADED_EXT, load
+from paddlenlp.transformers import OPTForCausalLM
+from paddlenlp.transformers.t5.modeling import T5DenseGatedGeluDense, T5DenseReluDense
+from paddlenlp.transformers.utils import fn_args_to_dict
+from paddlenlp.utils.log import logger
+
+
+def run_custom(op_name, inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype=None):
+    ret = []
+
+    if paddle.in_dynamic_mode():
+        new_inputs_var = []
+        for k, v in zip(inputs_names, inputs_var):
+            if not k.endswith("@VECTOR") and isinstance(v, (list, tuple)) and len(v) == 1:
+                new_inputs_var.append(v[0])
+            else:
+                new_inputs_var.append(v)
+        outs = core.eager._run_custom_op(op_name, *new_inputs_var, *attrs_val)
+        return outs[0] if len(outs) == 1 else outs
+    else:
+        inputs = dict(zip(inputs_names, inputs_var))
+        attrs = dict(zip(attrs_names, attrs_val))
+        outputs = {}
+
+        helper = LayerHelper(op_name, **locals())
+
+        for i, name in enumerate(outputs_names):
+            outputs[name] = helper.create_variable(dtype=outputs_dtype[i])
+            ret.append(outputs[name])
+
+        helper.append_op(type=op_name, inputs=inputs, outputs=outputs, attrs=attrs)
+
+    return ret
+
+
+def infer_transformer_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _diversity_rate,
+    _rel_len,
+    _alpha,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        _topp,
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _max_out_len,
+        _diversity_rate,
+        _rel_len,
+        _alpha,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_force_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    trg_word,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _diversity_rate,
+    _rel_len,
+    _alpha,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+        "TrgWord",
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+        trg_word,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        _topp,
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _max_out_len,
+        _diversity_rate,
+        _rel_len,
+        _alpha,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_force_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_opt_decoding(
+    input,
+    attn_mask,
+    mem_seq_len,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    pos_emb,
+    linear_weight,
+    normalize_before,
+    topk,
+    topp,
+    max_out_len,
+    head_num,
+    size_per_head,
+    num_layer,
+    bos_id,
+    eos_id,
+    temperature,
+    use_fp16_decoding,
+):
+    helper = LayerHelper("fusion_opt", **locals())
+
+    inputs = {
+        "Input": input,
+        "AttentionMask": attn_mask,
+        "StartLength": mem_seq_len,
+        "WordEmbedding": word_emb,
+        "SelfLayernormWeight@VECTOR": slf_ln_weight,
+        "SelfLayernormBias@VECTOR": slf_ln_bias,
+        "SelfQueryWeight@VECTOR": slf_q_weight,
+        "SelfQueryBias@VECTOR": slf_q_bias,
+        "SelfKeyWeight@VECTOR": slf_k_weight,
+        "SelfKeyBias@VECTOR": slf_k_bias,
+        "SelfValueWeight@VECTOR": slf_v_weight,
+        "SelfValueBias@VECTOR": slf_v_bias,
+        "SelfOutWeight@VECTOR": slf_out_weight,
+        "SelfOutBias@VECTOR": slf_out_bias,
+        "FFNLayernormWeight@VECTOR": ffn_ln_weight,
+        "FFNLayernormBias@VECTOR": ffn_ln_bias,
+        "FFNInterWeight@VECTOR": ffn_inter_weight,
+        "FFNInterBias@VECTOR": ffn_inter_bias,
+        "FFNOutWeight@VECTOR": ffn_out_weight,
+        "FFNOutBias@VECTOR": ffn_out_bias,
+        "DecoderLayernormWeight": decoder_ln_weight,
+        "DecoderLayernormBias": decoder_ln_bias,
+        "PositionEncEmb": pos_emb,
+        "EmbWeight": linear_weight,
+    }
+    tensor_para_size = get_ft_para_conf().tensor_para_size
+    layer_para_size = get_ft_para_conf().layer_para_size
+    layer_para_batch_size = get_ft_para_conf().layer_para_batch_size
+    attrs = {
+        "normalize_before": normalize_before,
+        "topk": topk,
+        "topp": topp,
+        "max_len": max_out_len,
+        "n_head": head_num,
+        "size_per_head": size_per_head,
+        "num_layer": num_layer,
+        "bos_id": bos_id,
+        "eos_id": eos_id,
+        "temperature": temperature,
+        "use_fp16": use_fp16_decoding,
+        "tensor_para_size": tensor_para_size,
+        "layer_para_size": layer_para_size,
+        "layer_para_batch_size": layer_para_batch_size,
+    }
+
+    output_ids = helper.create_variable(dtype="int32")
+    outputs = {"OutputIds": output_ids}
+
+    helper.append_op(type="fusion_opt", inputs=inputs, outputs=outputs, attrs=attrs)
+
+    return output_ids
+
+
+def infer_gpt_decoding(
+    input,
+    attn_mask,
+    mem_seq_len,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    pos_emb,
+    linear_weight,
+    topk,
+    topp,
+    max_out_len,
+    head_num,
+    size_per_head,
+    num_layer,
+    bos_id,
+    eos_id,
+    temperature,
+    use_fp16_decoding,
+):
+
+    tensor_para_size = get_ft_para_conf().tensor_para_size
+    layer_para_size = get_ft_para_conf().layer_para_size
+    layer_para_batch_size = get_ft_para_conf().layer_para_batch_size
+
+    inputs_names = [
+        "Input",
+        "AttentionMask",
+        "StartLength",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "PositionEncEmb",
+        "EmbWeight",
+    ]
+
+    inputs_var = [
+        input,
+        attn_mask,
+        mem_seq_len,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        pos_emb,
+        linear_weight,
+    ]
+
+    attrs_names = [
+        "topk",
+        "topp",
+        "max_len",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "temperature",
+        "use_fp16",
+        "tensor_para_size",
+        "layer_para_size",
+        "layer_para_batch_size",
+    ]
+
+    attrs_val = [
+        topk,
+        topp,
+        max_out_len,
+        head_num,
+        size_per_head,
+        num_layer,
+        bos_id,
+        eos_id,
+        temperature,
+        use_fp16_decoding,
+        tensor_para_size,
+        layer_para_size,
+        layer_para_batch_size,
+    ]
+
+    outputs_names = ["OutputIds"]
+
+    outputs_dtype = ["int32"]
+
+    return run_custom("fusion_gpt", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype)
+
+
+def infer_unified_decoding(
+    input_ids,
+    attn_mask,
+    memory_seq_lens,
+    type_id,
+    decoder_type_id,
+    logits_mask,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    trans_weight,
+    trans_bias,
+    lm_ln_weight,
+    lm_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    type_emb,
+    role_id,
+    decoder_role_id,
+    role_emb,
+    position_id,
+    decoder_position_id,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _diversity_rate,
+    _unk_id,
+    _mask_id,
+    _temperature,
+    _len_penalty,
+    _normalize_before,
+    _pos_bias,
+    _hidden_act,
+    _rel_len,
+    _early_stopping,
+    _min_length,
+):
+
+    tensor_para_size = get_ft_para_conf().tensor_para_size
+    layer_para_size = get_ft_para_conf().layer_para_size
+    layer_para_batch_size = get_ft_para_conf().layer_para_batch_size
+
+    inputs_names = [
+        "InputIds",
+        "AttnMask",
+        "MemSeqLen",
+        "TypeIds",
+        "DecTypeIds",
+        "LogitsMask",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "TransWeight",
+        "TransBias",
+        "LMLayernormWeight",
+        "LMLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+        "TypeEmb",
+        "RoleIds",
+        "DecRoleIds",
+        "RoleEmbedding",
+        "PositionIds",
+        "DecPositionIds",
+    ]
+
+    inputs_var = [
+        input_ids,
+        attn_mask,
+        memory_seq_lens,
+        type_id,
+        decoder_type_id,
+        logits_mask,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        trans_weight,
+        trans_bias,
+        lm_ln_weight,
+        lm_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+        type_emb,
+        role_id,
+        decoder_role_id,
+        role_emb,
+        position_id,
+        decoder_position_id,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "beam_search_diversity_rate",
+        "unk_id",
+        "mask_id",
+        "temperature",
+        "len_penalty",
+        "normalize_before",
+        "pos_bias",
+        "hidden_act",
+        "rel_len",
+        "early_stopping",
+        "min_length",
+        "tensor_para_size",
+        "layer_para_size",
+        "layer_para_batch_size",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        float(_topp),
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _max_out_len,
+        _diversity_rate,
+        _unk_id,
+        _mask_id,
+        _temperature,
+        _len_penalty,
+        _normalize_before,
+        _pos_bias,
+        _hidden_act,
+        _rel_len,
+        _early_stopping,
+        _min_length,
+        tensor_para_size,
+        layer_para_size,
+        layer_para_batch_size,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength", "OutputScores"]
+
+    outputs_dtype = ["int32", "int32", "int32", "float32"]
+
+    return run_custom(
+        "fusion_unified_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_miro_decoding(
+    input_ids,
+    attn_mask,
+    memory_seq_lens,
+    type_id,
+    decoder_type_id,
+    logits_mask,
+    word_emb,
+    pre_decoder_ln_weight,
+    pre_decoder_ln_bias,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    trans_weight,
+    trans_bias,
+    lm_ln_weight,
+    lm_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    type_emb,
+    role_id,
+    decoder_role_id,
+    role_emb,
+    position_id,
+    decoder_position_id,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _diversity_rate,
+    _unk_id,
+    _mask_id,
+    _temperature,
+    _len_penalty,
+    _normalize_before,
+    _pos_bias,
+    _hidden_act,
+    _rel_len,
+    _early_stopping,
+    _min_length,
+):
+
+    tensor_para_size = get_ft_para_conf().tensor_para_size
+    layer_para_size = get_ft_para_conf().layer_para_size
+    layer_para_batch_size = get_ft_para_conf().layer_para_batch_size
+
+    inputs_names = [
+        "InputIds",
+        "AttnMask",
+        "MemSeqLen",
+        "TypeIds",
+        "DecTypeIds",
+        "LogitsMask",
+        "WordEmbedding",
+        "PreDecoderLayernormWeight",
+        "PreDecoderLayernormBias",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "TransWeight",
+        "TransBias",
+        "LMLayernormWeight",
+        "LMLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+        "TypeEmb",
+        "RoleIds",
+        "DecRoleIds",
+        "RoleEmbedding",
+        "PositionIds",
+        "DecPositionIds",
+    ]
+
+    inputs_var = [
+        input_ids,
+        attn_mask,
+        memory_seq_lens,
+        type_id,
+        decoder_type_id,
+        logits_mask,
+        word_emb,
+        pre_decoder_ln_weight,
+        pre_decoder_ln_bias,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        trans_weight,
+        trans_bias,
+        lm_ln_weight,
+        lm_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+        type_emb,
+        role_id,
+        decoder_role_id,
+        role_emb,
+        position_id,
+        decoder_position_id,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "beam_search_diversity_rate",
+        "unk_id",
+        "mask_id",
+        "temperature",
+        "len_penalty",
+        "normalize_before",
+        "pos_bias",
+        "hidden_act",
+        "rel_len",
+        "early_stopping",
+        "min_length",
+        "tensor_para_size",
+        "layer_para_size",
+        "layer_para_batch_size",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        float(_topp),
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _max_out_len,
+        _diversity_rate,
+        _unk_id,
+        _mask_id,
+        _temperature,
+        _len_penalty,
+        _normalize_before,
+        _pos_bias,
+        _hidden_act,
+        _rel_len,
+        _early_stopping,
+        _min_length,
+        tensor_para_size,
+        layer_para_size,
+        layer_para_batch_size,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength", "OutputScores"]
+
+    outputs_dtype = ["int32", "int32", "int32", "float32"]
+
+    return run_custom("fusion_miro", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype)
+
+
+def infer_bart_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _temperature,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _min_out_len,
+    _diversity_rate,
+    _rel_len,
+    _alpha,
+    _early_stopping,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "temperature",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "min_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+        "early_stopping",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        _topp,
+        _temperature,
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _max_out_len,
+        _min_out_len,
+        _diversity_rate,
+        _rel_len,
+        _alpha,
+        _early_stopping,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_bart_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_mbart_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    mbart_ln_weight,
+    mbart_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    trg_word,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _diversity_rate,
+    _rel_len,
+    _alpha,
+    _temperature,
+    _early_stopping,
+    _hidden_act,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "MBARTLayernormWeight",
+        "MBARTLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+        "TrgWord",
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        mbart_ln_weight,
+        mbart_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+        trg_word,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "temperature",
+        "max_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+        "early_stopping",
+        "hidden_act",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        _topp,
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _temperature,
+        _max_out_len,
+        _diversity_rate,
+        _rel_len,
+        _alpha,
+        _early_stopping,
+        _hidden_act,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_mbart_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_gptj_decoding(
+    input,
+    attn_mask,
+    mem_seq_len,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_out_weight,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    topk,
+    topp,
+    max_out_len,
+    head_num,
+    size_per_head,
+    num_layer,
+    bos_id,
+    eos_id,
+    temperature,
+    rotary_embedding_dim,
+    repetition_penalty,
+    min_length,
+    use_fp16_decoding,
+):
+    tensor_para_size = get_ft_para_conf().tensor_para_size
+    layer_para_size = get_ft_para_conf().layer_para_size
+    layer_para_batch_size = get_ft_para_conf().layer_para_batch_size
+
+    inputs = {
+        "Input": input,
+        "AttentionMask": attn_mask,
+        "StartLength": mem_seq_len,
+        "WordEmbedding": word_emb,
+        "SelfLayernormWeight@VECTOR": slf_ln_weight,
+        "SelfLayernormBias@VECTOR": slf_ln_bias,
+        "SelfQueryWeight@VECTOR": slf_q_weight,
+        "SelfOutWeight@VECTOR": slf_out_weight,
+        "FFNInterWeight@VECTOR": ffn_inter_weight,
+        "FFNInterBias@VECTOR": ffn_inter_bias,
+        "FFNOutWeight@VECTOR": ffn_out_weight,
+        "FFNOutBias@VECTOR": ffn_out_bias,
+        "DecoderLayernormWeight": decoder_ln_weight,
+        "DecoderLayernormBias": decoder_ln_bias,
+        "EmbWeight": linear_weight,
+        "EmbBias": linear_bias,
+    }
+
+    attrs = {
+        "topk": topk,
+        "topp": topp,
+        "max_len": max_out_len,
+        "n_head": head_num,
+        "size_per_head": size_per_head,
+        "num_layer": num_layer,
+        "bos_id": bos_id,
+        "eos_id": eos_id,
+        "temperature": temperature,
+        "rotary_embedding_dim": rotary_embedding_dim,
+        "repetition_penalty": repetition_penalty,
+        "min_length": min_length,
+        "use_fp16": use_fp16_decoding,
+        "tensor_para_size": tensor_para_size,
+        "layer_para_size": layer_para_size,
+        "layer_para_batch_size": layer_para_batch_size,
+    }
+
+    outputs_names = ["OutputIds"]
+    outputs_dtype = ["int32"]
+
+    return run_custom(
+        op_name="fusion_gptj",
+        inputs_names=inputs.keys(),
+        inputs_var=inputs.values(),
+        attrs_names=attrs.keys(),
+        attrs_val=attrs.values(),
+        outputs_names=outputs_names,
+        outputs_dtype=outputs_dtype,
+    )
+
+
+def infer_pegasus_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    pos_emb,
+    _decoding_strategy,
+    _beam_size,
+    _topk,
+    _topp,
+    _n_head,
+    _size_per_head,
+    _n_layer,
+    _bos_id,
+    _eos_id,
+    _max_out_len,
+    _min_out_len,
+    _diversity_rate,
+    _rel_len,
+    _alpha,
+    _temperature,
+    _early_stopping,
+    _hidden_act,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+        "PositionEncEmb",
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        linear_weight,
+        linear_bias,
+        pos_emb,
+        # The input of custom op must be given.
+        # Dispensable() and Intermediate() are not supported.
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "temperature",
+        "max_len",
+        "min_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+        "early_stopping",
+        "hidden_act",
+        "emb_scale",
+    ]
+
+    attrs_val = [
+        _decoding_strategy,
+        _beam_size,
+        _topk,
+        _topp,
+        _n_head,
+        _size_per_head,
+        _n_layer,
+        _bos_id,
+        _eos_id,
+        _temperature,
+        _max_out_len,
+        _min_out_len,
+        _diversity_rate,
+        _rel_len,
+        _alpha,
+        _early_stopping,
+        _hidden_act,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_pegasus_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def infer_t5_decoding(
+    enc_output,
+    memory_seq_lens,
+    word_emb,
+    slf_ln_weight,
+    slf_ln_bias,
+    slf_q_weight,
+    slf_q_bias,
+    slf_k_weight,
+    slf_k_bias,
+    slf_v_weight,
+    slf_v_bias,
+    slf_out_weight,
+    slf_out_bias,
+    cross_ln_weight,
+    cross_ln_bias,
+    cross_q_weight,
+    cross_q_bias,
+    cross_k_weight,
+    cross_k_bias,
+    cross_v_weight,
+    cross_v_bias,
+    cross_out_weight,
+    cross_out_bias,
+    ffn_ln_weight,
+    ffn_ln_bias,
+    ffn_inter_weight_0,
+    ffn_inter_bias_0,
+    ffn_inter_weight_1,
+    ffn_inter_bias_1,
+    ffn_out_weight,
+    ffn_out_bias,
+    relative_attention_bias_weight,
+    decoder_ln_weight,
+    decoder_ln_bias,
+    linear_weight,
+    linear_bias,
+    decoding_strategy,
+    beam_size,
+    top_k,
+    top_p,
+    head_num,
+    size_per_head,
+    num_decoder_layers,
+    start_id,
+    end_id,
+    max_out_len,
+    diversity_rate,
+    rel_len,
+    alpha,
+    temperature,
+    early_stopping,
+    max_distance,
+    relative_attention_num_buckets,
+    tie_word_embeddings,
+    act,
+):
+
+    inputs_names = [
+        "Input",
+        "MemSeqLen",
+        "WordEmbedding",
+        "SelfLayernormWeight@VECTOR",
+        "SelfLayernormBias@VECTOR",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfOutWeight@VECTOR",
+        "SelfOutBias@VECTOR",
+        "CrossLayernormWeight@VECTOR",
+        "CrossLayernormBias@VECTOR",
+        "CrossQueryWeight@VECTOR",
+        "CrossQueryBias@VECTOR",
+        "CrossKeyWeight@VECTOR",
+        "CrossKeyBias@VECTOR",
+        "CrossValueWeight@VECTOR",
+        "CrossValueBias@VECTOR",
+        "CrossOutWeight@VECTOR",
+        "CrossOutBias@VECTOR",
+        "FFNLayernormWeight@VECTOR",
+        "FFNLayernormBias@VECTOR",
+        "FFNInterWeight0@VECTOR",
+        "FFNInterBias0@VECTOR",
+        "FFNInterWeight1@VECTOR",
+        "FFNInterBias1@VECTOR",
+        "FFNOutWeight@VECTOR",
+        "FFNOutBias@VECTOR",
+        "SelfRelativeAttentionBiasWeight",
+        "DecoderLayernormWeight",
+        "DecoderLayernormBias",
+        "EmbWeight",
+        "EmbBias",
+    ]
+
+    inputs_var = [
+        enc_output,
+        memory_seq_lens,
+        word_emb,
+        slf_ln_weight,
+        slf_ln_bias,
+        slf_q_weight,
+        slf_q_bias,
+        slf_k_weight,
+        slf_k_bias,
+        slf_v_weight,
+        slf_v_bias,
+        slf_out_weight,
+        slf_out_bias,
+        cross_ln_weight,
+        cross_ln_bias,
+        cross_q_weight,
+        cross_q_bias,
+        cross_k_weight,
+        cross_k_bias,
+        cross_v_weight,
+        cross_v_bias,
+        cross_out_weight,
+        cross_out_bias,
+        ffn_ln_weight,
+        ffn_ln_bias,
+        ffn_inter_weight_0,
+        ffn_inter_bias_0,
+        ffn_inter_weight_1,
+        ffn_inter_bias_1,
+        ffn_out_weight,
+        ffn_out_bias,
+        relative_attention_bias_weight,
+        decoder_ln_weight,
+        decoder_ln_bias,
+        linear_weight,
+        linear_bias,
+    ]
+
+    attrs_names = [
+        "decoding_strategy",
+        "beam_size",
+        "topk",
+        "topp",
+        "n_head",
+        "size_per_head",
+        "num_layer",
+        "bos_id",
+        "eos_id",
+        "max_len",
+        "beam_search_diversity_rate",
+        "rel_len",
+        "alpha",
+        "temperature",
+        "early_stopping",
+        "max_distance",
+        "num_buckets",
+        "tie_word_embeddings",
+        "act",
+    ]
+
+    attrs_val = [
+        decoding_strategy,
+        beam_size,
+        top_k,
+        top_p,
+        head_num,
+        size_per_head,
+        num_decoder_layers,
+        start_id,
+        end_id,
+        max_out_len,
+        diversity_rate,
+        rel_len,
+        alpha,
+        temperature,
+        early_stopping,
+        max_distance,
+        relative_attention_num_buckets,
+        tie_word_embeddings,
+        act,
+    ]
+
+    outputs_names = ["OutputIds", "ParentIds", "SequenceLength"]
+
+    outputs_dtype = ["int32"] * len(outputs_names)
+
+    return run_custom(
+        "fusion_t5_decoding", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype
+    )
+
+
+def finalize(
+    beam_size,
+    output_ids,
+    parent_ids,
+    out_seq_lens,
+    forced_eos_token_id=None,
+    max_seq_len=None,
+    decoding_strategy="beam_search",
+):
+    if max_seq_len is None:
+        max_seq_len = paddle.max(out_seq_lens)
+    ids = paddle.slice(output_ids, [0], [0], [max_seq_len])
+    if decoding_strategy.startswith("beam_search"):
+        parent_ids = paddle.slice(parent_ids, [0], [0], [max_seq_len]) % (
+            beam_size * 2 if decoding_strategy.endswith("_v2") or decoding_strategy.endswith("_v3") else beam_size
+        )
+        ids = paddle.nn.functional.gather_tree(ids, parent_ids)
+        if forced_eos_token_id is not None:
+            ids[-1, :, :] = forced_eos_token_id
+    else:
+        if forced_eos_token_id is not None:
+            ids[-1, :] = forced_eos_token_id
+    return ids
+
+
+def transfer_param(p, is_bias=False, dtype="float16", restore_data=False):
+    param_shape = p.shape
+    # Allow CPU/GPU and float16/float32 transfer
+    # NOTE: str(p.place) differs between paddle develop and 2.2
+    if str(p.dtype)[-len(dtype) :] == dtype and ("gpu" in str(p.place).lower() or "cuda" in str(p.place).lower()):
+        return p
+    if restore_data:
+        if paddle.in_dynamic_mode():
+            param_data = p.numpy()
+            # Creating parameters with Assign initializer is too slow. Maybe we
+            # can cast to fp16 directly and get a tensor, while we do it more
+            # elaborately to get a ParamBase. Also note `VarBase.set_value`
+            # enforce the same dtype and can not be used directly.
+            new_p = type(p)(shape=param_shape, dtype=dtype, is_bias=is_bias)
+            new_p.value().get_tensor().set(param_data.astype(dtype), paddle.framework._current_expected_place())
+            return new_p
+        else:
+            param_data = np.array(paddle.static.global_scope().find_var(p.name).get_tensor())
+    return paddle.create_parameter(
+        shape=param_shape,
+        dtype=dtype,
+        is_bias=is_bias,
+        default_initializer=paddle.nn.initializer.Assign(param_data) if restore_data else None,
+    )
+
+
+def _convert_qkv(q_proj, k_proj, v_proj, attr="weight", use_numpy=True, del_param=False, dummy_tensor=None):
+    ft_para_conf = get_ft_para_conf()
+    # TODO(guosheng): maybe static graph need this
+    # p = fast_model.create_parameter(
+    #     shape=[q.shape[0], q.shape[1] + k.shape[1] + v.shape[1]],
+    #     dtype=q.dtype,
+    #     is_bias=is_bias)
+    q = getattr(q_proj, attr)
+    k = getattr(k_proj, attr)
+    v = getattr(v_proj, attr)
+    if use_numpy:
+        q = q.numpy()
+        if del_param:
+            if attr == "weight":
+                del q_proj.weight
+            else:
+                del q_proj.bias
+        k = k.numpy()
+        if del_param:
+            if attr == "weight":
+                del k_proj.weight
+            else:
+                del k_proj.bias
+        v = v.numpy()
+        if del_param:
+            if attr == "weight":
+                del v_proj.weight
+            else:
+                del v_proj.bias
+    else:
+        if del_param:
+            for i in [q_proj, k_proj, v_proj]:
+                if attr == "weight":
+                    del i.weight
+                else:
+                    del i.bias
+    q = ft_para_conf.slice_weight(q, 1)
+    k = ft_para_conf.slice_weight(k, 1)
+    v = ft_para_conf.slice_weight(v, 1)
+    if del_param:
+        # NOTE: dygraph_to_static/convert_call_func.py would log the converted
+        # function. For linear layer, if we delete the params, log would fail.
+        # And the log requires weight to be a 2D tensor.
+        # NOTE: Assignment to parameter 'weight' should be of type
+        # Parameter or None, thus delete before in case of tensor.
+        setattr(q_proj, attr, dummy_tensor)
+        setattr(k_proj, attr, dummy_tensor)
+        setattr(v_proj, attr, dummy_tensor)
+    if use_numpy:
+        p = paddle.to_tensor(np.concatenate([q, k, v], axis=-1))
+    else:
+        p = paddle.concat([q, k, v], axis=-1)
+    return p
+
+
+def convert_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False):
+    r"""
+    Convert parameters included in Transformer layer (`nn.TransformerEncoder`
+    and `gpt.modeling.TransformerDecoder`) from original models to the format
+    of faster models.
+
+    Args:
+        fast_model (Layer): The faster model object.
+        model (Layer): The Transformer layer. It can be an instance of
+            `nn.TransformerEncoder` or `gpt.modeling.TransformerDecoder`
+            currently, and `nn.TransformerDecoder` would be supported soon.
+        fuse_qkv (int): 0 for nofuse, 1 for fuse, 2 for fuse and delete the
+            unfused parameters. If environment variable `PPFG_QKV_MEM_OPT` is
+            set and the weights of q/k/v is fused, it will try to delete the
+            original unfused weights. Note the rollback to original model would
+            not be guarantee anymore when the faster model failed if the original
+            weights are deleted. Default to 1.
+        use_fp16 (bool): Whether to use float16. Maybe we should use the default
+            dtype as the highest priority later. Default to `False`.
+        restore_data (bool): If `False`, need to reload the weight values. It
+            should be `True` for weight loaded models. Default to `False`.
+
+    Returns:
+        defaultdict: Each value is a list including converted parameters in all
+            layers. For other parameters not included in Transformer module to
+            be converted, such as embeddings, you can achieve it by using the
+            returned dict `params` though `params['word_emb'].append()` directly
+            which would do CPU/GPU and fp32/fp16 transfer automatically.
+    """
+    if fuse_qkv == 1:
+        fuse_qkv = 2 if os.getenv("PPFG_QKV_MEM_OPT", "0") == "1" else 1
+    ft_para_conf = get_ft_para_conf()
+
+    class _list(list):
+        def append(self, item):
+            def attr_handle_func(x):
+                return x
+
+            if isinstance(item[0], nn.Layer):
+                # Axis is used for tensor slice in tensor parallel.
+                # Use None to make no slice on the tensor.
+                if len(item) == 2:
+                    layer, attr = item
+                    axis = None
+                else:
+                    layer, attr, axis = item
+                param = getattr(layer, attr)
+                if axis is not None and isinstance(layer, nn.Linear):
+                    param = ft_para_conf.slice_weight(param, axis)
+                param = transfer_param(
+                    param,
+                    is_bias=attr.endswith("bias"),
+                    dtype="float16" if use_fp16 else "float32",
+                    restore_data=restore_data,
+                )
+                # NOTE: Assignment to parameter 'weight' should be of type
+                # Parameter or None, thus delete first in case of param is
+                # a tensor.
+                # TODO(guosheng): Make slice_weight use `output_param=True`
+                # and remove delattr. Currently, if `param` is Tensor rather
+                # than Parameter, it would not be in state_dict.
+                delattr(layer, attr)
+                setattr(layer, attr, param)
+            else:
+                # NOTE: Compared with if branch, there is no layer attribute
+                # refered to the transfered param, thus we should set it as
+                # the layer attribute to be able to convert to static graph.
+                # Additionally, we suppose no need to process tensor parallel
+                # here since the param passed in might have been processed.
+                if len(item) == 2:
+                    param, is_bias = item
+                    attr_handle = attr_handle_func
+                else:
+                    param, is_bias, attr_handle = item
+                param = transfer_param(
+                    param, is_bias=is_bias, dtype="float16" if use_fp16 else "float32", restore_data=restore_data
+                )
+                attr_handle(param)
+            return super().append(param)
+
+    params = defaultdict(_list)
+
+    def _convert(module):
+        if isinstance(
+            module,
+            (
+                nn.TransformerEncoder,
+                nn.TransformerDecoder,
+                paddlenlp.transformers.gpt.modeling.TransformerDecoder,
+                paddlenlp.transformers.opt.modeling.TransformerDecoder,
+            ),
+        ):
+            num_layer = len(module.layers)
+            for i, layer in enumerate(module.layers):
+                if not ft_para_conf.is_load(i, num_layer):
+                    continue
+                # fuse_qkv: 0 for nofuse, 1 for fuse,
+                # 2 for fuse and delete the unfused
+                if fuse_qkv == 0:
+                    params["slf_q_weight"].append((layer.self_attn.q_proj, "weight", 1))
+                    params["slf_q_bias"].append((layer.self_attn.q_proj, "bias", 1))
+                    params["slf_k_weight"].append((layer.self_attn.k_proj, "weight", 1))
+                    params["slf_k_bias"].append((layer.self_attn.k_proj, "bias", 1))
+                    params["slf_v_weight"].append((layer.self_attn.v_proj, "weight", 1))
+                    params["slf_v_bias"].append((layer.self_attn.v_proj, "bias", 1))
+
+                else:
+                    # TODO(guosheng): Tensor with size 0 might be failed in
+                    # paddle develop, thus use tensor with size 1 instead
+                    # temporarily. Besides, we use 2D tensor since jit log
+                    # requires that on linear weight. While size 0 seems all
+                    # right in jit.to_static/jit.save.
+                    dummy_tensor = paddle.zeros([1, 1])
+                    w = _convert_qkv(
+                        layer.self_attn.q_proj,
+                        layer.self_attn.k_proj,
+                        layer.self_attn.v_proj,
+                        attr="weight",
+                        use_numpy=fuse_qkv == 2,
+                        del_param=fuse_qkv == 2,
+                        dummy_tensor=dummy_tensor,
+                    )
+                    b = _convert_qkv(
+                        layer.self_attn.q_proj,
+                        layer.self_attn.k_proj,
+                        layer.self_attn.v_proj,
+                        attr="bias",
+                        use_numpy=fuse_qkv == 2,
+                        del_param=fuse_qkv == 2,
+                        dummy_tensor=dummy_tensor,
+                    )
+                    params["slf_q_weight"].append((w, False))
+                    params["slf_q_bias"].append((b, True))
+                    # NOTE: Use `params["slf_q_weight"][-1]` rather than `w`,
+                    # since the appended tensor might be a new transfered tensor.
+                    # Besides, to allow convert_params be called more than once,
+                    # we find a attr name not existing to avoid overwriting the
+                    # existing attr.
+                    attr = "slf_q_weight_" + str(i)
+                    while hasattr(fast_model, attr):
+                        attr += "_"
+                    setattr(fast_model, attr, params["slf_q_weight"][-1])
+                    attr = "slf_q_bias_" + str(i)
+                    while hasattr(fast_model, attr):
+                        attr += "_"
+                    setattr(fast_model, attr, params["slf_q_bias"][-1])
+                    for key in [f"slf_{m}_{n}" for m in ("k", "v") for n in ("weight", "bias")]:
+                        params[key].append((dummy_tensor, True if key.endswith("bias") else False))
+                        attr = key + "_" + str(i)
+                        while hasattr(fast_model, attr):
+                            attr += "_"
+                        setattr(fast_model, attr, params[key][-1])
+                if hasattr(layer, "cross_attn"):
+                    # nn.TransformerDecoder
+                    params["cross_q_weight"].append((layer.cross_attn.q_proj, "weight", 1))
+                    params["cross_q_bias"].append((layer.cross_attn.q_proj, "bias", 1))
+                    params["cross_k_weight"].append((layer.cross_attn.k_proj, "weight", 1))
+                    params["cross_k_bias"].append((layer.cross_attn.k_proj, "bias", 1))
+                    params["cross_v_weight"].append((layer.cross_attn.v_proj, "weight", 1))
+                    params["cross_v_bias"].append((layer.cross_attn.v_proj, "bias", 1))
+                    params["cross_out_weight"].append((layer.cross_attn.out_proj, "weight", 0))
+                    params["cross_out_bias"].append((layer.cross_attn.out_proj, "bias", 0))
+
+                params["slf_out_weight"].append((layer.self_attn.out_proj, "weight", 0))
+                params["slf_out_bias"].append((layer.self_attn.out_proj, "bias"))
+                params["slf_ln_weight"].append((layer.norm1, "weight"))
+                params["slf_ln_bias"].append((layer.norm1, "bias"))
+                # Slice tensor when append according to axis(1 or 0) if parallel
+                # is enable.
+                params["ffn_inter_weight"].append((layer.linear1, "weight", 1))
+                params["ffn_inter_bias"].append((layer.linear1, "bias", 1))
+                params["ffn_out_weight"].append((layer.linear2, "weight", 0))
+                params["ffn_out_bias"].append((layer.linear2, "bias"))
+                if hasattr(layer, "norm3"):
+                    # nn.TransformerDecoder
+                    params["cross_ln_weight"].append((layer.norm2, "weight"))
+                    params["cross_ln_bias"].append((layer.norm2, "bias"))
+                    params["ffn_ln_weight"].append((layer.norm3, "weight"))
+                    params["ffn_ln_bias"].append((layer.norm3, "bias"))
+                else:
+                    params["ffn_ln_weight"].append((layer.norm2, "weight"))
+                    params["ffn_ln_bias"].append((layer.norm2, "bias"))
+
+            if getattr(module, "norm", None) is not None:
+                params["decoder_ln_weight"].append((module.norm, "weight"))
+                params["decoder_ln_bias"].append((module.norm, "bias"))
+        elif isinstance(module, (paddlenlp.transformers.t5.modeling.T5Stack)) and module.is_decoder:
+            num_layer = len(module.block)
+            for i, block in enumerate(module.block):
+                if not ft_para_conf.is_load(i, num_layer):
+                    continue
+                # fuse_qkv: 0 for nofuse, 1 for fuse,
+                # 2 for fuse and delete the unfused
+                if fuse_qkv == 0:
+                    params["slf_q_weight"].append((block.layer[0].SelfAttention.q, "weight", 1))
+                    if getattr(block.layer[0].SelfAttention.q, "bias", None) is not None:
+                        params["slf_q_bias"].append((block.layer[0].SelfAttention.q, "bias", 1))
+
+                    params["slf_k_weight"].append((block.layer[0].SelfAttention.k, "weight", 1))
+                    if getattr(block.layer[0].SelfAttention.k, "bias", None) is not None:
+                        params["slf_k_bias"].append((block.layer[0].SelfAttention.k, "bias", 1))
+
+                    params["slf_v_weight"].append((block.layer[0].SelfAttention.v, "weight", 1))
+                    if getattr(block.layer[0].SelfAttention.v, "bias", None) is not None:
+                        params["slf_k_bias"].append((block.layer[0].SelfAttention.v, "bias", 1))
+
+                else:
+                    dummy_tensor = paddle.zeros([1, 1])
+                    w = _convert_qkv(
+                        block.layer[0].SelfAttention.q,
+                        block.layer[0].SelfAttention.k,
+                        block.layer[0].SelfAttention.v,
+                        attr="weight",
+                        use_numpy=(fuse_qkv == 2),
+                        del_param=(fuse_qkv == 2),
+                        dummy_tensor=dummy_tensor,
+                    )
+                    params["slf_q_weight"].append((w, False))
+
+                    if (
+                        getattr(block.layer[0].SelfAttention.q, "bias", None) is not None
+                        and getattr(block.layer[0].SelfAttention.k, "bias", None) is not None
+                        and getattr(block.layer[0].SelfAttention.v, "bias", None) is not None
+                    ):
+                        b = _convert_qkv(
+                            block.layer[0].SelfAttention.q,
+                            block.layer[0].SelfAttention.k,
+                            block.layer[0].SelfAttention.v,
+                            attr="bias",
+                            use_numpy=(fuse_qkv == 2),
+                            del_param=(fuse_qkv == 2),
+                            dummy_tensor=dummy_tensor,
+                        )
+                        params["slf_q_bias"].append((b, True))
+
+                    # NOTE: Use `params["slf_q_weight"][-1]` rather than `w`,
+                    # since the appended tensor might be a new transfered tensor.
+                    # Besides, to allow convert_params be called more than once,
+                    # we find a attr name not existing to avoid overwriting the
+                    # existing attr.
+                    attr = "slf_q_weight_" + str(i)
+                    while hasattr(fast_model, attr):
+                        attr += "_"
+                    setattr(fast_model, attr, params["slf_q_weight"][-1])
+
+                    param_type = "weight"
+                    if "slf_q_bias" in params.keys():
+                        attr = "slf_q_bias_" + str(i)
+                        while hasattr(fast_model, attr):
+                            attr += "_"
+                        setattr(fast_model, attr, params["slf_q_bias"][-1])
+                        param_type.append("bias")
+
+                    for key in [f"slf_{m}_{n}" for m in ("k", "v") for n in param_type]:
+                        params[key].append((dummy_tensor, True if key.endswith("bias") else False))
+                        attr = key + "_" + str(i)
+                        while hasattr(fast_model, attr):
+                            attr += "_"
+                        setattr(fast_model, attr, params[key][-1])
+
+                ffn_index = 1
+                if len(block.layer) == 3:
+                    ffn_index = 2
+
+                    params["cross_q_weight"].append((block.layer[1].EncDecAttention.q, "weight", 1))
+                    if getattr(block.layer[1].EncDecAttention.q, "bias", None) is not None:
+                        params["cross_q_bias"].append((block.layer[1].EncDecAttention.q, "bias", 1))
+
+                    params["cross_k_weight"].append((block.layer[1].EncDecAttention.k, "weight", 1))
+                    if getattr(block.layer[1].EncDecAttention.k, "bias", None) is not None:
+                        params["cross_k_bias"].append((block.layer[1].EncDecAttention.k, "bias", 1))
+
+                    params["cross_v_weight"].append((block.layer[1].EncDecAttention.v, "weight", 1))
+                    if getattr(block.layer[1].EncDecAttention.v, "bias", None) is not None:
+                        params["cross_v_bias"].append((block.layer[1].EncDecAttention.v, "bias", 1))
+
+                    params["cross_out_weight"].append((block.layer[1].EncDecAttention.o, "weight", 0))
+                    if getattr(block.layer[1].EncDecAttention.o, "bias", None) is not None:
+                        params["cross_out_bias"].append((block.layer[1].EncDecAttention.o, "bias", 0))
+
+                    params["cross_ln_weight"].append((block.layer[1].layer_norm, "weight", 0))
+                    if getattr(block.layer[1].layer_norm, "bias", None) is not None:
+                        params["cross_ln_bias"].append((block.layer[1].layer_norm, "bias", 0))
+
+                if hasattr(block.layer[ffn_index], "DenseReluDense"):
+                    if isinstance(block.layer[ffn_index].DenseReluDense, (T5DenseReluDense)):
+                        params["ffn_inter_weight_0"].append((block.layer[ffn_index].DenseReluDense.wi, "weight", 1))
+                        if getattr(block.layer[ffn_index].DenseReluDense.wi, "bias", None) is not None:
+                            params["ffn_inter_bias_0"].append((block.layer[ffn_index].DenseReluDense.wi, "bias", 1))
+
+                        params["ffn_out_weight"].append((block.layer[ffn_index].DenseReluDense.wo, "weight", 0))
+                        if getattr(block.layer[ffn_index].DenseReluDense.wo, "bias", None) is not None:
+                            params["ffn_out_bias"].append((block.layer[ffn_index].DenseReluDense.wo, "bias"))
+                    elif isinstance(block.layer[ffn_index].DenseReluDense, (T5DenseGatedGeluDense)):
+                        params["ffn_inter_weight_0"].append((block.layer[ffn_index].DenseReluDense.wi_0, "weight", 1))
+                        if getattr(block.layer[ffn_index].DenseReluDense.wi_0, "bias", None) is not None:
+                            params["ffn_inter_bias_0"].append((block.layer[ffn_index].DenseReluDense.wi_0, "bias", 1))
+
+                        params["ffn_inter_weight_1"].append((block.layer[ffn_index].DenseReluDense.wi_1, "weight", 1))
+                        if getattr(block.layer[ffn_index].DenseReluDense.wi_1, "bias", None) is not None:
+                            params["ffn_inter_bias_1"].append((block.layer[ffn_index].DenseReluDense.wi_1, "bias", 1))
+
+                        params["ffn_out_weight"].append((block.layer[ffn_index].DenseReluDense.wo, "weight", 0))
+                        if getattr(block.layer[ffn_index].DenseReluDense.wo, "bias", None) is not None:
+                            params["ffn_out_bias"].append((block.layer[ffn_index].DenseReluDense.wo, "bias"))
+                    else:
+                        raise NotImplementedError("Faster only support T5DenseReluDense and T5DenseGatedGeluDense. ")
+
+                params["ffn_ln_weight"].append((block.layer[ffn_index].layer_norm, "weight"))
+                if getattr(block.layer[ffn_index].layer_norm, "bias", None) is not None:
+                    params["ffn_ln_bias"].append((block.layer[ffn_index].layer_norm, "bias"))
+
+                params["slf_out_weight"].append((block.layer[0].SelfAttention.o, "weight", 0))
+                if getattr(block.layer[0].SelfAttention.o, "bias", None) is not None:
+                    params["slf_out_bias"].append((block.layer[0].SelfAttention.o, "bias"))
+
+                params["slf_ln_weight"].append((block.layer[0].layer_norm, "weight"))
+                if getattr(block.layer[0].layer_norm, "bias", None) is not None:
+                    params["slf_ln_bias"].append((block.layer[0].layer_norm, "bias"))
+
+            if getattr(module, "norm", None) is not None:
+                params["decoder_ln_weight"].append((module.final_layer_norm, "weight"))
+                if getattr(module.final_layer_norm, "bias", None) is not None:
+                    params["decoder_ln_bias"].append((module.final_layer_norm, "bias"))
+
+    model.apply(_convert)
+    return params
+
+
+class InferBase(nn.Layer):
+    def __init__(self, use_fp16_decoding):
+        super(InferBase, self).__init__()
+        self._use_fp16_decoding = use_fp16_decoding
+
+    def default_bias(self, weight, index, is_null=False):
+        if is_null:
+            size = 1
+        elif isinstance(weight, (list, tuple)):
+            size = weight[0].shape[index]
+        else:
+            size = weight.shape[index]
+
+        if not hasattr(self, "default_bias_" + str(size)):
+            setattr(
+                self,
+                "default_bias_" + str(size),
+                paddle.zeros(shape=[size], dtype="float16" if self._use_fp16_decoding else "float32"),
+            )
+
+        if isinstance(weight, (list, tuple)):
+            return [getattr(self, "default_bias_" + str(size))] * len(weight)
+        else:
+            return [getattr(self, "default_bias_" + str(size))]
+
+
+class InferTransformerDecoding(nn.Layer):
+    def __init__(
+        self,
+        decoder,
+        word_embedding,
+        positional_embedding,
+        linear,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        bos_id=0,
+        eos_id=1,
+        decoding_strategy="beam_search",
+        beam_size=4,
+        topk=1,
+        topp=0.0,
+        max_out_len=256,
+        diversity_rate=0.0,
+        decoding_lib=None,
+        use_fp16_decoding=False,
+        rel_len=False,
+        alpha=0.6,
+    ):
+        # if decoding_lib is None:
+        #     raise ValueError(
+        #         "The args decoding_lib must be set to use FastGeneration. ")
+        # elif not os.path.exists(decoding_lib):
+        #     raise ValueError("The path to decoding lib is not exist.")
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        size_per_head = d_model / n_head
+        # fuse_qkv can only support size_per_head of [32, 64, 128].
+        if size_per_head in [32, 64, 128]:
+            self._fuse_qkv = True
+        else:
+            self._fuse_qkv = False
+
+        super(InferTransformerDecoding, self).__init__()
+        for arg, value in locals().items():
+            if arg not in ["self", "decoder", "word_embedding", "positional_embedding", "linear"]:
+                setattr(self, "_" + arg, value)
+        # process weights
+        if use_fp16_decoding:
+            for mod in decoder.layers:
+                mod.norm1.weight = transfer_param(mod.norm1.weight)
+                mod.norm1.bias = transfer_param(mod.norm1.bias, is_bias=True)
+                mod.self_attn.q_proj.weight = transfer_param(mod.self_attn.q_proj.weight)
+                mod.self_attn.q_proj.bias = transfer_param(mod.self_attn.q_proj.bias, is_bias=True)
+                mod.self_attn.k_proj.weight = transfer_param(mod.self_attn.k_proj.weight)
+                mod.self_attn.k_proj.bias = transfer_param(mod.self_attn.k_proj.bias, is_bias=True)
+                mod.self_attn.v_proj.weight = transfer_param(mod.self_attn.v_proj.weight)
+                mod.self_attn.v_proj.bias = transfer_param(mod.self_attn.v_proj.bias, is_bias=True)
+                mod.self_attn.out_proj.weight = transfer_param(mod.self_attn.out_proj.weight)
+                mod.self_attn.out_proj.bias = transfer_param(mod.self_attn.out_proj.bias, is_bias=True)
+
+                mod.norm2.weight = transfer_param(mod.norm2.weight)
+                mod.norm2.bias = transfer_param(mod.norm2.bias, is_bias=True)
+                mod.cross_attn.q_proj.weight = transfer_param(mod.cross_attn.q_proj.weight)
+                mod.cross_attn.q_proj.bias = transfer_param(mod.cross_attn.q_proj.bias, is_bias=True)
+                mod.cross_attn.k_proj.weight = transfer_param(mod.cross_attn.k_proj.weight)
+                mod.cross_attn.k_proj.bias = transfer_param(mod.cross_attn.k_proj.bias, is_bias=True)
+                mod.cross_attn.v_proj.weight = transfer_param(mod.cross_attn.v_proj.weight)
+                mod.cross_attn.v_proj.bias = transfer_param(mod.cross_attn.v_proj.bias, is_bias=True)
+                mod.cross_attn.out_proj.weight = transfer_param(mod.cross_attn.out_proj.weight)
+                mod.cross_attn.out_proj.bias = transfer_param(mod.cross_attn.out_proj.bias, is_bias=True)
+
+                mod.norm3.weight = transfer_param(mod.norm3.weight)
+                mod.norm3.bias = transfer_param(mod.norm3.bias, is_bias=True)
+                mod.linear1.weight = transfer_param(mod.linear1.weight)
+                mod.linear1.bias = transfer_param(mod.linear1.bias, is_bias=True)
+                mod.linear2.weight = transfer_param(mod.linear2.weight)
+                mod.linear2.bias = transfer_param(mod.linear2.bias, is_bias=True)
+
+            decoder.norm.weight = transfer_param(decoder.norm.weight)
+            decoder.norm.bias = transfer_param(decoder.norm.bias, is_bias=True)
+
+            linear.weight = transfer_param(linear.weight)
+            linear.bias = transfer_param(linear.bias, is_bias=True)
+
+            positional_embedding.weight = transfer_param(positional_embedding.weight)
+            word_embedding.weight = transfer_param(word_embedding.weight)
+
+        self.slf_ln_weight = []
+        self.slf_ln_bias = []
+        self.slf_q_weight = []
+        self.slf_q_bias = []
+        self.slf_k_weight = []
+        self.slf_k_bias = []
+        self.slf_v_weight = []
+        self.slf_v_bias = []
+        self.slf_out_weight = []
+        self.slf_out_bias = []
+
+        self.cross_ln_weight = []
+        self.cross_ln_bias = []
+        self.cross_q_weight = []
+        self.cross_q_bias = []
+        self.cross_k_weight = []
+        self.cross_k_bias = []
+        self.cross_v_weight = []
+        self.cross_v_bias = []
+        self.cross_out_weight = []
+        self.cross_out_bias = []
+
+        self.ffn_ln_weight = []
+        self.ffn_ln_bias = []
+        self.ffn_inter_weight = []
+        self.ffn_inter_bias = []
+        self.ffn_out_weight = []
+        self.ffn_out_bias = []
+
+        for i, mod in enumerate(decoder.layers):
+            self.slf_ln_weight.append(mod.norm1.weight)
+            self.slf_ln_bias.append(mod.norm1.bias)
+
+            if self._fuse_qkv:
+                q_weight_shape = mod.self_attn.q_proj.weight.shape
+                k_weight_shape = mod.self_attn.k_proj.weight.shape
+                v_weight_shape = mod.self_attn.v_proj.weight.shape
+
+                q_weights = self.create_parameter(
+                    shape=[q_weight_shape[0], q_weight_shape[1] + k_weight_shape[1] + v_weight_shape[1]],
+                    dtype="float16" if use_fp16_decoding else "float32",
+                )
+                setattr(self, "slf_q_weight_" + str(i), q_weights)
+                self.slf_q_weight.append(getattr(self, "slf_q_weight_" + str(i)))
+
+                q_bias_shape = mod.self_attn.q_proj.bias.shape
+                k_bias_shape = mod.self_attn.k_proj.bias.shape
+                v_bias_shape = mod.self_attn.v_proj.bias.shape
+
+                q_biases = self.create_parameter(
+                    shape=[q_bias_shape[0] + k_bias_shape[0] + v_bias_shape[0]],
+                    dtype="float16" if use_fp16_decoding else "float32",
+                    is_bias=True,
+                )
+                setattr(self, "slf_q_bias_" + str(i), q_biases)
+                self.slf_q_bias.append(getattr(self, "slf_q_bias_" + str(i)))
+            else:
+                self.slf_q_weight.append(mod.self_attn.q_proj.weight)
+                self.slf_q_bias.append(mod.self_attn.q_proj.bias)
+
+            self.slf_k_weight.append(mod.self_attn.k_proj.weight)
+            self.slf_k_bias.append(mod.self_attn.k_proj.bias)
+            self.slf_v_weight.append(mod.self_attn.v_proj.weight)
+            self.slf_v_bias.append(mod.self_attn.v_proj.bias)
+            self.slf_out_weight.append(mod.self_attn.out_proj.weight)
+            self.slf_out_bias.append(mod.self_attn.out_proj.bias)
+
+            self.cross_ln_weight.append(mod.norm2.weight)
+            self.cross_ln_bias.append(mod.norm2.bias)
+            self.cross_q_weight.append(mod.cross_attn.q_proj.weight)
+            self.cross_q_bias.append(mod.cross_attn.q_proj.bias)
+            self.cross_k_weight.append(mod.cross_attn.k_proj.weight)
+            self.cross_k_bias.append(mod.cross_attn.k_proj.bias)
+            self.cross_v_weight.append(mod.cross_attn.v_proj.weight)
+            self.cross_v_bias.append(mod.cross_attn.v_proj.bias)
+            self.cross_out_weight.append(mod.cross_attn.out_proj.weight)
+            self.cross_out_bias.append(mod.cross_attn.out_proj.bias)
+
+            self.ffn_ln_weight.append(mod.norm3.weight)
+            self.ffn_ln_bias.append(mod.norm3.bias)
+            self.ffn_inter_weight.append(mod.linear1.weight)
+            self.ffn_inter_bias.append(mod.linear1.bias)
+            self.ffn_out_weight.append(mod.linear2.weight)
+            self.ffn_out_bias.append(mod.linear2.bias)
+
+        self.decoder_ln_weight = [decoder.norm.weight]
+        self.decoder_ln_bias = [decoder.norm.bias]
+
+        self.pos_emb = [positional_embedding.weight]
+        self.word_emb = [word_embedding.weight]
+
+        self.linear_weight = [linear.weight]
+        self.linear_bias = [linear.bias]
+
+    def forward(self, enc_output, memory_seq_lens, trg_word=None):
+        def parse_function(func_name):
+            return partial(
+                func_name,
+                word_emb=self.word_emb,
+                slf_ln_weight=self.slf_ln_weight,
+                slf_ln_bias=self.slf_ln_bias,
+                slf_q_weight=self.slf_q_weight,
+                slf_q_bias=self.slf_q_bias,
+                slf_k_weight=self.slf_k_weight,
+                slf_k_bias=self.slf_k_bias,
+                slf_v_weight=self.slf_v_weight,
+                slf_v_bias=self.slf_v_bias,
+                slf_out_weight=self.slf_out_weight,
+                slf_out_bias=self.slf_out_bias,
+                cross_ln_weight=self.cross_ln_weight,
+                cross_ln_bias=self.cross_ln_bias,
+                cross_q_weight=self.cross_q_weight,
+                cross_q_bias=self.cross_q_bias,
+                cross_k_weight=self.cross_k_weight,
+                cross_k_bias=self.cross_k_bias,
+                cross_v_weight=self.cross_v_weight,
+                cross_v_bias=self.cross_v_bias,
+                cross_out_weight=self.cross_out_weight,
+                cross_out_bias=self.cross_out_bias,
+                ffn_ln_weight=self.ffn_ln_weight,
+                ffn_ln_bias=self.ffn_ln_bias,
+                ffn_inter_weight=self.ffn_inter_weight,
+                ffn_inter_bias=self.ffn_inter_bias,
+                ffn_out_weight=self.ffn_out_weight,
+                ffn_out_bias=self.ffn_out_bias,
+                decoder_ln_weight=self.decoder_ln_weight,
+                decoder_ln_bias=self.decoder_ln_bias,
+                linear_weight=self.linear_weight,
+                linear_bias=self.linear_bias,
+                pos_emb=self.pos_emb,
+                _decoding_strategy=self._decoding_strategy,
+                _beam_size=self._beam_size,
+                _topk=self._topk,
+                _topp=self._topp,
+                _n_head=self._n_head,
+                _size_per_head=int(self._d_model / self._n_head),
+                _n_layer=self._num_decoder_layers,
+                _bos_id=self._bos_id,
+                _eos_id=self._eos_id,
+                _max_out_len=self._max_out_len,
+                _diversity_rate=self._diversity_rate,
+                _rel_len=self._rel_len,
+                _alpha=self._alpha,
+            )
+
+        if self._decoding_strategy.startswith("beam_search"):
+            # TODO: Due to paddle.tile bug in static graph, tile_beam_merge_with_batch
+            # cannot work properly. These comments should be opened after PaddlePaddle v2.2.2.
+            if paddle.__version__ <= "2.1.3":
+                enc_output = nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(enc_output, self._beam_size)
+                memory_seq_lens = nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(
+                    memory_seq_lens, self._beam_size
+                )
+            else:
+                enc_output_shape = paddle.shape(enc_output)
+                batch_size = enc_output_shape[0]
+                max_seq_len = enc_output_shape[1]
+                enc_output = enc_output.unsqueeze([1])
+                memory_seq_lens = memory_seq_lens.unsqueeze([1])
+                enc_output = paddle.expand(
+                    enc_output, shape=[batch_size, self._beam_size, max_seq_len, self._d_model]
+                ).reshape([batch_size * self._beam_size, max_seq_len, self._d_model])
+                memory_seq_lens = paddle.expand(memory_seq_lens, shape=[batch_size, self._beam_size]).reshape(
+                    [batch_size * self._beam_size]
+                )
+
+        if trg_word is None:
+            output_ids, parent_ids, sequence_length = parse_function(infer_transformer_decoding)(
+                enc_output=[enc_output], memory_seq_lens=[memory_seq_lens]
+            )
+        else:
+            output_ids, parent_ids, sequence_length = parse_function(infer_force_decoding)(
+                enc_output=[enc_output], memory_seq_lens=[memory_seq_lens], trg_word=[trg_word]
+            )
+
+        ids = finalize(
+            self._beam_size, output_ids, parent_ids, sequence_length, decoding_strategy=self._decoding_strategy
+        )
+
+        return ids
+
+
+# Patch for parallel inference to save memory
+class FTParaConf(object):
+    r"""
+    Configurations for model parallel in FastGeneration. Currently only
+    support GPT. Please refer to  `Megatron <https://arxiv.org/pdf/2104.04473.pdf>`__
+    for details.
+
+    Args:
+        tensor_para_size (int, optional): The size for tensor parallel. If it is
+            1, tensor parallel would not be used. Default to 1.
+        layer_para_size (int, optional): The size for layer parallel. If it is
+            1, layer parallel would not be used. Default to 1.
+        layer_para_batch_size (int, optional): The local batch size for pipeline
+            parallel. It is suggested to use `batch_size // layer_para_size`.
+            Default to 1.
+    """
+
+    def __init__(self, tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1):
+        self.world_size = self._env2int(
+            [  # MPICH, OpenMPI, IMPI
+                "MPI_LOCALNRANKS",
+                "OMPI_COMM_WORLD_SIZE",
+                "PMI_SIZE",
+                "MV2_COMM_WORLD_SIZE",
+                "WORLD_SIZE",
+            ],
+            1,
+        )
+        self.rank = self._env2int(
+            [  # MPICH, OpenMPI, IMPI
+                "MPI_LOCALRANKID",
+                "OMPI_COMM_WORLD_RANK",
+                "PMI_RANK",
+                "MV2_COMM_WORLD_RANK",
+                "RANK",
+            ],
+            0,
+        )
+        if layer_para_size is None:
+            layer_para_size = 1
+        if tensor_para_size is None:
+            tensor_para_size = self.world_size // layer_para_size
+        self.no_para = tensor_para_size == 1 and layer_para_size == 1
+        self.tensor_para_size = tensor_para_size
+        self.layer_para_size = layer_para_size
+        self.layer_para_batch_size = layer_para_batch_size
+
+        assert (
+            self.world_size == tensor_para_size * layer_para_size
+        ), "tensor_para_size * layer_para_size must be equal to world_size."
+        self.tensor_para_rank = self.rank % self.tensor_para_size
+        self.layer_para_rank = self.rank // self.tensor_para_size
+        self.is_partial_model = False
+
+    @staticmethod
+    def _env2int(env_list, default=-1):
+        for e in env_list:
+            val = int(os.environ.get(e, -1))
+            if val >= 0:
+                return val
+        return default
+
+    def is_last_group(self):
+        r"""
+        For layer parallel, only the process corresponding to the last layer
+        group can get the predict results. It is used to check whether this is
+        the process corresponding to the last layer group.
+        """
+        return self.layer_para_rank == self.layer_para_size - 1
+
+    def is_load(self, i, num_layer):
+        r"""
+        Whether or not the given transformer layer of should be loaded to the
+        current parallel model. For layer parallel, there is no need not to load
+        other layer groups.
+
+        Args:
+            i (int): The index of Transformer layer.
+            num_layer (int): The number of Transformer layers.
+
+        Returns:
+            bool: Indicate whether or not the given transformer layer of should
+                be loaded to the current parallel model.
+        """
+        if self.no_para:
+            return True
+        # Take into account model only including partial weights.
+        if self.is_partial_model:
+            return True
+        layers_per_device = num_layer // self.layer_para_size
+        return (i >= layers_per_device * self.layer_para_rank) and i < layers_per_device * (self.layer_para_rank + 1)
+
+    def slice_weight(self, weight, axis, phase=1, out_param=False):
+        r"""
+        Get the weight slice for tensor parallel.
+
+        Args:
+            weight (Tensor or ndarray): The weight or bias to be sliced.
+            axis (int): The axis to perform slice.
+            phase (int, optional): 0 is used for creating partial model when
+                initializing and `from_pretrained`. While 1 is used in converting
+                parameters to FastGeneration. No slice would be performed if
+                it is 1, since parameters have been sliced in `phase=0`.
+            out_param (bool, optional): If true, `weight` should be a Parameter
+                and force the output to be a Parameter.
+
+        Returns:
+            Tensor or ndarray: The sliced weight.
+        """
+        # weight can be parameter/tensor/ndarray
+        if self.no_para:
+            return weight
+        # Take into account model only including partial weights.
+        if self.is_partial_model:
+            if phase == 1:
+                # 0 for init
+                # 1 for convert param to FT
+                # TODO(guosheng): Maybe we can remove slice_weight in converting
+                # parameters to FT if we have sliced parameters at phase 0, while
+                # we allow to use non-partial model when converting parameters
+                # to FT currently.
+                return weight
+        if len(weight.shape) == 1:
+            axis = 0
+        local_size = weight.shape[axis] // self.tensor_para_size
+        start_offset = self.tensor_para_rank * local_size
+        end_offset = start_offset + local_size
+        if len(weight.shape) == 1:
+            w_slice = weight[start_offset:end_offset]
+        else:
+            w_slice = weight[:, start_offset:end_offset] if axis == 1 else weight[start_offset:end_offset, :]
+        if out_param:
+            # Assume weight is also a Parameter.
+            w = type(weight)(shape=w_slice.shape, dtype=weight.dtype, is_bias=len(weight.shape) == 1)
+            # NOTE: `VarBase.set_value` would use `w.numpy()` while w is not
+            # initialized and can not be used directly.
+            # TODO(guosheng): If `w.place `can be used here, use `w.place` to
+            # avoid w.place and _current_expected_place are different.
+            w.value().get_tensor().set(w_slice, paddle.framework._current_expected_place())
+            return w
+        else:
+            return w_slice
+
+    def set_partial_model(self, is_partial_model):
+        r"""
+        This is used to set whether or not the current model has complete
+        parameters.
+
+        Args:
+            is_partial_model (bool): It is used to set whether or not the
+                current model has complete parameters.
+        """
+        self.is_partial_model = is_partial_model
+
+    def fit_partial_model(self, model, state_to_load):
+        r"""
+        Slice every values included in `state_to_load` according to the shape
+        of corresponding parameters in `model`. This is used in `from_pratrained`
+        to get sliced parameter values.
+
+        Args:
+            model (PretrainedModel): The model to use.
+            state_to_load (dict): The state dict including complete parameter
+                values of model.
+
+        Returns:
+            dict: The state dict contains adjusted values.
+        """
+        if self.no_para or not self.is_partial_model:
+            return state_to_load
+
+        def fit_param(p, v):
+            if p.shape[0] != v.shape[0]:
+                return _ft_para_conf.slice_weight(v, axis=0, phase=0)
+            if len(p.shape) == 2 and p.shape[1] != v.shape[1]:
+                return _ft_para_conf.slice_weight(v, axis=1, phase=0)
+            return v
+
+        for k, v in model.state_dict().items():
+            if k in state_to_load:
+                state_to_load[k] = fit_param(v, state_to_load[k])
+        return state_to_load
+
+
+# TODO(guosheng): Maybe use context-manager to allow multiple models.
+_ft_para_conf = FTParaConf()
+
+
+def get_ft_para_conf():
+    r"""
+    Get settings for model parallel.
+
+    Returns:
+        FTParaConf: The settings for model parallel.
+    """
+    return _ft_para_conf
+
+
+def enable_ft_para(tensor_para_size=None, layer_para_size=None, layer_para_batch_size=1):
+    r"""
+    Enable model parallel with the given settings in FastGeneration. Currently only
+    support GPT. Please refer to `Megatron <https://arxiv.org/pdf/2104.04473.pdf>`__
+    for details.
+
+    Args:
+        tensor_para_size (int, optional): The size for tensor parallel. If it is
+            1, tensor parallel would not be used. When it is None, tensor parallel
+            size would be set as `world_size / layer_para_size`. Default to None.
+        layer_para_size (int, optional): The size for layer parallel. If it is
+            1, layer parallel would not be used. When it is None, it would be set
+            as 1. Default to None.
+        layer_para_batch_size (int, optional): The local batch size for pipeline
+            parallel. It is suggested to use `batch_size // layer_para_size`.
+            Default to 1.
+    """
+    global _ft_para_conf
+    _ft_para_conf = FTParaConf(tensor_para_size, layer_para_size, layer_para_batch_size)
+    if _ft_para_conf.no_para:
+        return
+
+    def reset_param(layer, attr, axis):
+        param = getattr(layer, attr)
+        # NOTE: Assignment to parameter 'weight' should be of type Parameter or
+        # None. Additionaly, we cannot delattr and setattr which would remove
+        # the param from layer._parameters and state_dict, thus cannot fit_partial_model
+        param = _ft_para_conf.slice_weight(param, axis, phase=0, out_param=True)
+        setattr(layer, attr, param)
+
+    def layer_init_wrapper(func):
+        @functools.wraps(func)
+        def _impl(self, *args, **kwargs):
+            init_dict = fn_args_to_dict(func, *((self,) + args), **kwargs)
+            init_dict.pop("self")
+            assert (
+                init_dict["nhead"] % _ft_para_conf.tensor_para_size == 0
+            ), "The number of heads(%d) cannot be evenly divisible by `tensor_para_size`(%d)." % (
+                init_dict["nhead"],
+                _ft_para_conf.tensor_para_size,
+            )
+            func(self, *args, **kwargs)
+            # Reset parameters with corresponding slice.
+            for x, attr in [(m, n) for m in ("q", "k", "v") for n in ("weight", "bias")]:
+                reset_param(getattr(self.self_attn, x + "_proj"), attr, 1)
+            reset_param(self.self_attn.out_proj, "weight", 0)
+            reset_param(self.linear1, "weight", 1)
+            reset_param(self.linear1, "bias", 1)
+            reset_param(self.linear2, "weight", 0)
+
+        return _impl
+
+    def block_init_wrapper(func):
+        @functools.wraps(func)
+        def _impl(self, *args, **kwargs):
+            init_dict = fn_args_to_dict(func, *((self,) + args), **kwargs)
+            init_dict.pop("self")
+            num_layers = init_dict["num_hidden_layers"]
+            init_dict["num_hidden_layers"] //= _ft_para_conf.layer_para_size
+            func(self, **init_dict)
+            self.num_layers = num_layers
+            self.config["num_hidden_layers"] = num_layers
+
+        return _impl
+
+    def block_state_wrapper(func):
+        # TODO(guosheng): Uset state hook instead of block_state_wrapper.
+        # self.register_state_dict_hook(reidx_state_layer)
+        @functools.wraps(func)
+        def _impl(self, *args, **kwargs):
+            state_dict = func(self, *args, **kwargs)
+            arg_dict = fn_args_to_dict(func, *((self,) + args), **kwargs)
+            structured_name_prefix = arg_dict["structured_name_prefix"]
+
+            def reidx_state_layer(state_dict):
+                prefix = structured_name_prefix + "decoder.layers."
+                prefix_len = len(prefix)
+                for name, param in list(state_dict.items()):
+                    if name.startswith(prefix):
+                        layer_idx_len = 0
+                        for i in name[prefix_len:]:
+                            if i == ".":
+                                break
+                            else:
+                                layer_idx_len += 1
+                        layer_idx = int(name[prefix_len : prefix_len + layer_idx_len])
+                        new_name = (
+                            name[:prefix_len]
+                            + str(_ft_para_conf.layer_para_rank * len(self.decoder.layers) + layer_idx)
+                            + name[prefix_len + layer_idx_len :]
+                        )
+                        state_dict[new_name] = state_dict.pop(name)
+
+            reidx_state_layer(state_dict)
+            return state_dict
+
+        return _impl
+
+    # GPT
+    layer_init_fn = paddlenlp.transformers.gpt.modeling.TransformerDecoderLayer.__init__
+    paddlenlp.transformers.gpt.modeling.TransformerDecoderLayer.__init__ = layer_init_wrapper(layer_init_fn)
+    # Note that Transformer block in GPT is not created in TransformerDecoder
+    # but in GPTModel.
+    block_init_fn = paddlenlp.transformers.gpt.modeling.GPTModel.__init__
+    paddlenlp.transformers.gpt.modeling.GPTModel.__init__ = block_init_wrapper(block_init_fn)
+    block_state_fn = paddlenlp.transformers.gpt.modeling.GPTModel.state_dict
+    paddlenlp.transformers.gpt.modeling.GPTModel.state_dict = block_state_wrapper(block_state_fn)
+    # PLATO
+    paddle.nn.TransformerEncoderLayer.__init__ = layer_init_wrapper(paddle.nn.TransformerEncoderLayer.__init__)
+    _ft_para_conf.set_partial_model(True)
+    # TODO(guosheng): Should we set device here, sometimes we want to create
+    # models on CPU first to save memory.
+    # paddle.set_device("gpu:" + str(_ft_para_conf.rank))
+    # yield
+
+
+class InferOptDecoding(nn.Layer):
+    """extract infer model parameters and feed it into the cuda decoder"""
+
+    def __init__(self, model: OPTForCausalLM, decoding_lib=None, use_fp16_decoding=False):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load(
+                "FastGeneration" if get_ft_para_conf().no_para else "FasterTransformerParallel",
+                verbose=True,
+                need_parallel=not get_ft_para_conf().no_para,
+            )
+
+        super(InferOptDecoding, self).__init__()
+
+        self.use_fp16_decoding = use_fp16_decoding
+        self.model = model
+        self.head_num = self.model.opt.config["num_attention_heads"]
+        self.size_per_head = int(self.model.opt.config["hidden_size"] / self.head_num)
+        self.num_layer = self.model.opt.config["num_hidden_layers"]
+        self.inner_size = self.model.opt.config["intermediate_size"]
+
+        params = convert_params(self, model, fuse_qkv=1, use_fp16=use_fp16_decoding, restore_data=True)
+
+        if self.model.opt.embeddings.project_in is not None:
+            self.word_emb = paddle.matmul(
+                self.model.opt.embeddings.word_embeddings.weight, self.model.opt.embeddings.project_in.weight
+            )
+            # set the linear_weight
+            self.linear_weight = paddle.matmul(
+                self.model.opt.embeddings.word_embeddings.weight, self.model.opt.decoder.project_out.weight.T
+            )
+        else:
+            self.word_emb = self.model.opt.embeddings.word_embeddings.weight
+            self.linear_weight = self.model.opt.embeddings.word_embeddings.weight
+
+        # reset the offset in position embedding
+        position_embedding = self.model.opt.embeddings.position_embeddings
+        self.pos_emb = paddle.concat([position_embedding.weight[2:], position_embedding.weight[:2]])
+
+        # if there is no final layer norm, pass empty tensor to fusion opt op
+        final_layer_norm = self.model.opt.decoder.final_layer_norm
+        if final_layer_norm is None:
+            self.decoder_ln_weight = paddle.empty(shape=[0])
+            self.decoder_ln_bias = paddle.empty(shape=[0])
+        else:
+            self.decoder_ln_weight = final_layer_norm.weight
+            self.decoder_ln_bias = final_layer_norm.bias
+
+        self.normalize_before = self.model.decoder.final_layer_norm is not None
+
+        for k, v in params.items():
+            setattr(self, k, v)
+
+        # check the dtype of embedding
+        dtype = "float16" if use_fp16_decoding else "float32"
+        self.word_emb = transfer_param(self.word_emb, dtype=dtype, is_bias=False, restore_data=True)
+        self.linear_weight = transfer_param(self.linear_weight, dtype=dtype, is_bias=False, restore_data=True)
+        self.pos_emb = transfer_param(self.pos_emb, dtype=dtype, is_bias=False, restore_data=True)
+        self.decoder_ln_weight = transfer_param(self.decoder_ln_weight, dtype=dtype, is_bias=False, restore_data=True)
+        self.decoder_ln_bias = transfer_param(self.decoder_ln_bias, dtype=dtype, is_bias=True, restore_data=True)
+
+    def forward(
+        self,
+        input_ids,
+        mem_seq_len,
+        attention_mask=None,
+        topk=4,
+        topp=0.0,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        max_out_len=256,
+        temperature=1,
+    ):
+        if attention_mask is None:
+            batch_size = paddle.shape(input_ids)[0]
+            attention_mask = paddle.tril(
+                paddle.ones(
+                    [batch_size, mem_seq_len, mem_seq_len], dtype="float16" if self.use_fp16_decoding else "float32"
+                )
+            )
+        elif self.use_fp16_decoding and attention_mask.dtype == paddle.float32:
+            attention_mask = paddle.cast(attention_mask, dtype="float16")
+
+        output_ids = infer_opt_decoding(
+            input=[input_ids],
+            attn_mask=[attention_mask],
+            mem_seq_len=[mem_seq_len],
+            word_emb=self.word_emb,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=self.slf_ln_bias,
+            slf_q_weight=self.slf_q_weight,
+            slf_q_bias=self.slf_q_bias,
+            slf_k_weight=self.slf_k_weight,
+            slf_k_bias=self.slf_k_bias,
+            slf_v_weight=self.slf_v_weight,
+            slf_v_bias=self.slf_v_bias,
+            slf_out_weight=self.slf_out_weight,
+            slf_out_bias=self.slf_out_bias,
+            ffn_ln_weight=self.ffn_ln_weight,
+            ffn_ln_bias=self.ffn_ln_bias,
+            ffn_inter_weight=self.ffn_inter_weight,
+            ffn_inter_bias=self.ffn_inter_bias,
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=self.ffn_out_bias,
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=self.decoder_ln_bias,
+            pos_emb=self.pos_emb,
+            linear_weight=self.linear_weight,
+            normalize_before=self.normalize_before,
+            topk=topk,
+            topp=topp,
+            max_out_len=max_out_len,
+            head_num=self.head_num,
+            size_per_head=self.size_per_head,
+            num_layer=self.num_layer,
+            bos_id=bos_token_id,
+            eos_id=eos_token_id,
+            temperature=temperature,
+            use_fp16_decoding=self.use_fp16_decoding,
+        )
+
+        output_ids = output_ids[paddle.shape(input_ids)[-1] :, :]
+        if forced_eos_token_id is not None:
+            output_ids[:, -1] = forced_eos_token_id
+        return output_ids
+
+
+class InferGptDecoding(nn.Layer):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load(
+                "FastGeneration" if get_ft_para_conf().no_para else "FasterTransformerParallel",
+                verbose=True,
+                need_parallel=not get_ft_para_conf().no_para,
+            )
+
+        super(InferGptDecoding, self).__init__()
+
+        self.use_fp16_decoding = use_fp16_decoding
+        self.model = model
+        self.head_num = self.model.gpt.config["num_attention_heads"]
+        self.size_per_head = int(self.model.gpt.config["hidden_size"] / self.head_num)
+        self.num_layer = self.model.gpt.config["num_hidden_layers"]
+        self.inner_size = self.model.gpt.config["intermediate_size"]
+
+        params = convert_params(self, model, fuse_qkv=1, use_fp16=use_fp16_decoding, restore_data=True)
+        params["word_emb"].append((self.model.gpt.embeddings.word_embeddings, "weight"))
+        params["pos_emb"].append((self.model.gpt.embeddings.position_embeddings, "weight"))
+
+        # if model share word_embeddings weight
+        if id(self.model.gpt.embeddings.word_embeddings) == id(self.model.lm_head.weight):
+            params["linear_weight"].append((self.model.gpt.embeddings.word_embeddings, "weight"))
+        else:
+            params["linear_weight"].append((self.model.lm_head.weight, False, partial(setattr, self, "weight")))
+
+        for k, v in params.items():
+            setattr(self, k, v)
+
+    def forward(
+        self,
+        input_ids,
+        mem_seq_len,
+        attention_mask=None,
+        topk=4,
+        topp=0.0,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        max_out_len=256,
+        temperature=1,
+    ):
+        if attention_mask is None:
+            batch_size = paddle.shape(input_ids)[0]
+            attention_mask = paddle.tril(
+                paddle.ones(
+                    [batch_size, paddle.max(mem_seq_len), paddle.max(mem_seq_len)],
+                    dtype="float16" if self.use_fp16_decoding else "float32",
+                )
+            )
+        elif self.use_fp16_decoding and attention_mask.dtype == paddle.float32:
+            attention_mask = paddle.cast(attention_mask, dtype="float16")
+
+        (output_ids,) = infer_gpt_decoding(
+            input=[input_ids],
+            attn_mask=[attention_mask],
+            mem_seq_len=[mem_seq_len],
+            word_emb=self.word_emb,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=self.slf_ln_bias,
+            slf_q_weight=self.slf_q_weight,
+            slf_q_bias=self.slf_q_bias,
+            slf_k_weight=self.slf_k_weight,
+            slf_k_bias=self.slf_k_bias,
+            slf_v_weight=self.slf_v_weight,
+            slf_v_bias=self.slf_v_bias,
+            slf_out_weight=self.slf_out_weight,
+            slf_out_bias=self.slf_out_bias,
+            ffn_ln_weight=self.ffn_ln_weight,
+            ffn_ln_bias=self.ffn_ln_bias,
+            ffn_inter_weight=self.ffn_inter_weight,
+            ffn_inter_bias=self.ffn_inter_bias,
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=self.ffn_out_bias,
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=self.decoder_ln_bias,
+            pos_emb=self.pos_emb,
+            linear_weight=self.linear_weight,
+            topk=topk,
+            topp=topp,
+            max_out_len=max_out_len,
+            head_num=self.head_num,
+            size_per_head=self.size_per_head,
+            num_layer=self.num_layer,
+            bos_id=bos_token_id,
+            eos_id=eos_token_id,
+            temperature=temperature,
+            use_fp16_decoding=self.use_fp16_decoding,
+        )
+
+        output_ids = output_ids[paddle.shape(input_ids)[-1] :, :]
+        if forced_eos_token_id is not None:
+            output_ids[:, -1] = forced_eos_token_id
+        return output_ids
+
+
+class InferUnifiedDecoding(nn.Layer):
+    def __init__(
+        self,
+        model,
+        decoding_lib=None,
+        use_fp16_decoding=False,
+        logits_mask=None,
+        n_head=8,
+        hidden_dims=512,
+        size_per_head=64,
+        n_layer=6,
+        unk_id=0,
+        mask_id=30000,
+        normalize_before=True,
+        hidden_act="gelu",
+    ):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load(
+                "FastGeneration" if get_ft_para_conf().no_para else "FasterTransformerParallel",
+                verbose=True,
+                need_parallel=not get_ft_para_conf().no_para,
+            )
+
+        super(InferUnifiedDecoding, self).__init__()
+        for arg, value in locals().items():
+            if arg not in ["self"]:
+                setattr(self, "_" + arg, value)
+
+        params = convert_params(self, model, fuse_qkv=1, use_fp16=use_fp16_decoding, restore_data=True)
+        params["word_emb"].append((model.embeddings.word_embeddings, "weight"))
+        params["pos_emb"].append((model.embeddings.position_embeddings, "weight"))
+        params["type_emb"].append((model.embeddings.token_type_embeddings, "weight"))
+        if getattr(model.embeddings, "role_embeddings", None) is not None:
+            params["role_emb"].append((model.embeddings.role_embeddings, "weight"))
+        else:
+            # inputs of custom op cannot be None
+            params["role_emb"].append((paddle.zeros(shape=[1]), False, partial(setattr, self, "default_role_emb")))
+        if not self._normalize_before:
+            # pre-norm params has been converted in `convert_params`, and this
+            # is only for post-norm such as UNIMO.
+            params["decoder_ln_weight"].append((model.encoder_norm, "weight"))
+            params["decoder_ln_bias"].append((model.encoder_norm, "bias"))
+        params["trans_weight"].append((model.lm_head.transform, "weight"))
+        params["trans_bias"].append((model.lm_head.transform, "bias"))
+        params["lm_ln_weight"].append((model.lm_head.layer_norm, "weight"))
+        params["lm_ln_bias"].append((model.lm_head.layer_norm, "bias"))
+        # NOTE: newly created tensors should be layer attribute refered to be
+        # able to convert to static graph.
+        params["linear_weight"].append((model.lm_head.decoder_weight.t(), False, partial(setattr, self, "dec_weight")))
+        params["linear_bias"].append(
+            (paddle.assign(model.lm_head.decoder_bias), True, partial(setattr, self, "dec_bias"))
+        )
+        for k, v in params.items():
+            setattr(self, k, v)
+
+    def forward(
+        self,
+        input_ids,
+        attn_mask,
+        memory_seq_lens,
+        type_id,
+        decoder_type_id,
+        role_id=None,
+        decoder_role_id=None,
+        position_id=None,
+        decoder_position_id=None,
+        beam_size=4,
+        topk=4,
+        topp=0.0,
+        decoding_strategy="greedy_search",
+        max_out_len=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=1.0,
+        length_penalty=1.0,
+        diversity_rate=0.0,
+        pos_bias=True,
+        rel_len=False,
+        early_stopping=False,
+        min_length=0,
+    ):
+        if role_id is None:
+            role_id = paddle.zeros(shape=[0], dtype="int32")
+            decoder_role_id = paddle.zeros(shape=[0], dtype="int32")
+        if position_id is None:
+            position_id = paddle.zeros(shape=[0], dtype="int32")
+            decoder_position_id = paddle.zeros(shape=[0], dtype="int32")
+
+        if decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            topk = 1
+            topp = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if topp == 1 and topk > 0:
+                decoding_strategy = "topk_sampling"
+                topp = 0.0
+            elif topp > 0 and topk == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the fast version."
+                )
+        elif decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+
+        output_ids, parent_ids, sequence_length, output_scores = infer_unified_decoding(
+            input_ids=[input_ids],
+            attn_mask=[attn_mask],
+            memory_seq_lens=[memory_seq_lens],
+            type_id=[type_id],
+            decoder_type_id=[decoder_type_id],
+            logits_mask=[self._logits_mask],
+            word_emb=self.word_emb,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=self.slf_ln_bias,
+            slf_q_weight=self.slf_q_weight,
+            slf_q_bias=self.slf_q_bias,
+            slf_k_weight=self.slf_k_weight,
+            slf_k_bias=self.slf_k_bias,
+            slf_v_weight=self.slf_v_weight,
+            slf_v_bias=self.slf_v_bias,
+            slf_out_weight=self.slf_out_weight,
+            slf_out_bias=self.slf_out_bias,
+            ffn_ln_weight=self.ffn_ln_weight,
+            ffn_ln_bias=self.ffn_ln_bias,
+            ffn_inter_weight=self.ffn_inter_weight,
+            ffn_inter_bias=self.ffn_inter_bias,
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=self.ffn_out_bias,
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=self.decoder_ln_bias,
+            trans_weight=self.trans_weight,
+            trans_bias=self.trans_bias,
+            lm_ln_weight=self.lm_ln_weight,
+            lm_ln_bias=self.lm_ln_bias,
+            linear_weight=self.linear_weight,
+            linear_bias=self.linear_bias,
+            pos_emb=self.pos_emb,
+            type_emb=self.type_emb,
+            role_id=[role_id],
+            decoder_role_id=[decoder_role_id],
+            role_emb=self.role_emb,
+            position_id=[position_id],
+            decoder_position_id=[decoder_position_id],
+            _decoding_strategy=decoding_strategy,
+            _beam_size=beam_size,
+            _topk=topk,
+            _topp=topp,
+            _n_head=self._n_head,
+            _size_per_head=self._size_per_head,
+            _n_layer=self._n_layer,
+            _bos_id=bos_token_id,
+            _eos_id=eos_token_id,
+            _max_out_len=max_out_len,
+            _diversity_rate=-diversity_rate,
+            _unk_id=self._unk_id,
+            _mask_id=self._mask_id,
+            _temperature=temperature,
+            _len_penalty=length_penalty,
+            _normalize_before=self._normalize_before,
+            _pos_bias=pos_bias,
+            _hidden_act=self._hidden_act,
+            _rel_len=rel_len,
+            _early_stopping=early_stopping,
+            _min_length=min_length,
+        )
+        ids = finalize(
+            beam_size,
+            output_ids,
+            parent_ids,
+            sequence_length,
+            forced_eos_token_id=forced_eos_token_id,
+            decoding_strategy=decoding_strategy,
+        )
+        return ids, output_scores
+
+
+class InferMIRODecoding(nn.Layer):
+    def __init__(
+        self,
+        model,
+        decoding_lib=None,
+        use_fp16_decoding=False,
+        logits_mask=None,
+        n_head=8,
+        hidden_dims=512,
+        size_per_head=64,
+        n_layer=6,
+        unk_id=0,
+        mask_id=30000,
+        normalize_before=True,
+        hidden_act="relu",
+    ):
+
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FasterTransformer" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FasterTransformer"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load(
+                "FasterTransformer" if get_ft_para_conf().no_para else "FasterTransformerParallel",
+                verbose=True,
+                need_parallel=not get_ft_para_conf().no_para,
+            )
+
+        super(InferMIRODecoding, self).__init__()
+        for arg, value in locals().items():
+            if arg not in ["self"]:
+                setattr(self, "_" + arg, value)
+
+        params = convert_params(self, model, fuse_qkv=1, use_fp16=use_fp16_decoding, restore_data=True)
+        params["word_emb"].append((model.embeddings.word_embeddings, "weight"))
+        params["pos_emb"].append((model.embeddings.position_embeddings, "weight"))
+        params["type_emb"].append((model.embeddings.token_type_embeddings, "weight"))
+        if getattr(model.embeddings, "role_embeddings", None) is not None:
+            params["role_emb"].append((model.embeddings.role_embeddings, "weight"))
+        else:
+            # inputs of custom op cannot be None
+            params["role_emb"].append((paddle.zeros(shape=[1]), False, partial(setattr, self, "default_role_emb")))
+        # if not self._normalize_before:
+        #     # pre-norm params has been converted in `convert_params`, and this
+        #     # is only for post-norm such as UNIMO.
+        #     params["decoder_ln_weight"].append((model.encoder_norm, "weight"))
+        #     params["decoder_ln_bias"].append((model.encoder_norm, "bias"))
+        params["pre_decoder_ln_weight"].append((model.encoder_norm, "weight"))
+        params["pre_decoder_ln_bias"].append((model.encoder_norm, "bias"))
+
+        params["trans_weight"].append((model.lm_head.transform, "weight"))
+        params["trans_bias"].append((model.lm_head.transform, "bias"))
+        params["lm_ln_weight"].append((model.lm_head.layer_norm, "weight"))
+        params["lm_ln_bias"].append((model.lm_head.layer_norm, "bias"))
+        # NOTE: newly created tensors should be layer attribute refered to be
+        # able to convert to static graph.
+        params["linear_weight"].append((model.lm_head.decoder_weight.t(), False, partial(setattr, self, "dec_weight")))
+        params["linear_bias"].append(
+            (paddle.assign(model.lm_head.decoder_bias), True, partial(setattr, self, "dec_bias"))
+        )
+        for k, v in params.items():
+            setattr(self, k, v)
+
+    def forward(
+        self,
+        input_ids,
+        attn_mask,
+        memory_seq_lens,
+        type_id,
+        decoder_type_id,
+        role_id=None,
+        decoder_role_id=None,
+        position_id=None,
+        decoder_position_id=None,
+        beam_size=4,
+        topk=4,
+        topp=0.0,
+        decoding_strategy="greedy_search",
+        max_out_len=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=1.0,
+        length_penalty=1.0,
+        diversity_rate=0.0,
+        pos_bias=True,
+        rel_len=False,
+        early_stopping=False,
+        min_length=0,
+    ):
+        if role_id is None:
+            role_id = paddle.zeros(shape=[0], dtype="int32")
+            decoder_role_id = paddle.zeros(shape=[0], dtype="int32")
+        if position_id is None:
+            position_id = paddle.zeros(shape=[0], dtype="int32")
+            decoder_position_id = paddle.zeros(shape=[0], dtype="int32")
+
+        if decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            topk = 1
+            topp = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if topp == 1 and topk > 0:
+                decoding_strategy = "topk_sampling"
+                topp = 0.0
+            elif topp > 0 and topk == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the faster version."
+                )
+        elif decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+
+        output_ids, parent_ids, sequence_length, output_scores = infer_miro_decoding(
+            input_ids=[input_ids],
+            attn_mask=[attn_mask],
+            memory_seq_lens=[memory_seq_lens],
+            type_id=[type_id],
+            decoder_type_id=[decoder_type_id],
+            logits_mask=[self._logits_mask],
+            word_emb=self.word_emb,
+            pre_decoder_ln_weight=self.pre_decoder_ln_weight,
+            pre_decoder_ln_bias=self.pre_decoder_ln_bias,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=self.slf_ln_bias,
+            slf_q_weight=self.slf_q_weight,
+            slf_q_bias=self.slf_q_bias,
+            slf_k_weight=self.slf_k_weight,
+            slf_k_bias=self.slf_k_bias,
+            slf_v_weight=self.slf_v_weight,
+            slf_v_bias=self.slf_v_bias,
+            slf_out_weight=self.slf_out_weight,
+            slf_out_bias=self.slf_out_bias,
+            ffn_ln_weight=self.ffn_ln_weight,
+            ffn_ln_bias=self.ffn_ln_bias,
+            ffn_inter_weight=self.ffn_inter_weight,
+            ffn_inter_bias=self.ffn_inter_bias,
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=self.ffn_out_bias,
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=self.decoder_ln_bias,
+            trans_weight=self.trans_weight,
+            trans_bias=self.trans_bias,
+            lm_ln_weight=self.lm_ln_weight,
+            lm_ln_bias=self.lm_ln_bias,
+            linear_weight=self.linear_weight,
+            linear_bias=self.linear_bias,
+            pos_emb=self.pos_emb,
+            type_emb=self.type_emb,
+            role_id=[role_id],
+            decoder_role_id=[decoder_role_id],
+            role_emb=self.role_emb,
+            position_id=[position_id],
+            decoder_position_id=[decoder_position_id],
+            _decoding_strategy=decoding_strategy,
+            _beam_size=beam_size,
+            _topk=topk,
+            _topp=topp,
+            _n_head=self._n_head,
+            _size_per_head=self._size_per_head,
+            _n_layer=self._n_layer,
+            _bos_id=bos_token_id,
+            _eos_id=eos_token_id,
+            _max_out_len=max_out_len,
+            _diversity_rate=-diversity_rate,
+            _unk_id=self._unk_id,
+            _mask_id=self._mask_id,
+            _temperature=temperature,
+            _len_penalty=length_penalty,
+            _normalize_before=self._normalize_before,
+            _pos_bias=pos_bias,
+            _hidden_act=self._hidden_act,
+            _rel_len=rel_len,
+            _early_stopping=early_stopping,
+            _min_length=min_length,
+        )
+
+        ids = finalize(
+            beam_size,
+            output_ids,
+            parent_ids,
+            sequence_length,
+            forced_eos_token_id=forced_eos_token_id,
+            decoding_strategy=decoding_strategy,
+        )
+
+        return ids, output_scores
+
+
+class InferBartDecoding(nn.Layer):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        super(InferBartDecoding, self).__init__()
+        for arg, value in locals().items():
+            if arg not in ["self", "model", "word_embedding", "positional_embedding", "linear"]:
+                setattr(self, "_" + arg, value)
+        self._num_decoder_layers = model.bart.config["decoder_layers"]
+        self._n_head = model.bart.config["decoder_attention_heads"]
+        self._d_model = model.bart.config["d_model"]
+
+        params = convert_params(self, model.get_decoder(), fuse_qkv=2, use_fp16=use_fp16_decoding, restore_data=True)
+        params["decoder_ln_weight"].append((model.decoder.decoder_layernorm_embedding, "weight"))
+        params["decoder_ln_bias"].append((model.decoder.decoder_layernorm_embedding, "bias"))
+        params["word_emb"].append((model.decoder.embed_tokens, "weight"))
+        params["pos_emb"].append((model.decoder.decoder_embed_positions, "weight"))
+        params["linear_weight"].append((model.lm_head_weight.t(), False, partial(setattr, self, "lm_head_weight_")))
+        params["linear_bias"].append((model.final_logits_bias, True, partial(setattr, self, "lm_head_bias_")))
+        for k, v in params.items():
+            setattr(self, k, v)
+
+    def forward(
+        self,
+        enc_output,
+        memory_seq_lens,
+        beam_size=4,
+        top_k=1,
+        top_p=0.0,
+        temperature=1.0,
+        decoding_strategy="beam_search_v3",
+        max_out_len=256,
+        min_out_len=256,
+        diversity_rate=0.0,
+        rel_len=False,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        alpha=0.6,
+        early_stopping=False,
+    ):
+        # beam_search/beam_search_v2/beam_search_v3 should be corrected to beam_search_v3.
+        if decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+        elif decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            top_k = 1
+            top_p = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if top_p == 1 and top_k > 0:
+                decoding_strategy = "topk_sampling"
+                top_p = 0.0
+            elif top_p > 0 and top_k == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the fast version. "
+                )
+
+        output_ids, parent_ids, sequence_length = infer_bart_decoding(
+            [enc_output],
+            [memory_seq_lens],
+            self.word_emb,
+            self.slf_ln_weight,
+            self.slf_ln_bias,
+            self.slf_q_weight,
+            self.slf_q_bias,
+            self.slf_k_weight,
+            self.slf_k_bias,
+            self.slf_v_weight,
+            self.slf_v_bias,
+            self.slf_out_weight,
+            self.slf_out_bias,
+            self.cross_ln_weight,
+            self.cross_ln_bias,
+            self.cross_q_weight,
+            self.cross_q_bias,
+            self.cross_k_weight,
+            self.cross_k_bias,
+            self.cross_v_weight,
+            self.cross_v_bias,
+            self.cross_out_weight,
+            self.cross_out_bias,
+            self.ffn_ln_weight,
+            self.ffn_ln_bias,
+            self.ffn_inter_weight,
+            self.ffn_inter_bias,
+            self.ffn_out_weight,
+            self.ffn_out_bias,
+            self.decoder_ln_weight,
+            self.decoder_ln_bias,
+            self.linear_weight,
+            self.linear_bias,
+            self.pos_emb,
+            decoding_strategy,
+            beam_size,
+            top_k,
+            top_p,
+            temperature,
+            self._n_head,
+            int(self._d_model / self._n_head),
+            self._num_decoder_layers,
+            bos_token_id,
+            eos_token_id,
+            max_out_len,
+            min_out_len,
+            -diversity_rate,
+            rel_len,
+            alpha,
+            early_stopping,
+        )
+
+        ids = finalize(
+            beam_size,
+            output_ids,
+            parent_ids,
+            sequence_length,
+            forced_eos_token_id=forced_eos_token_id,
+            decoding_strategy=decoding_strategy,
+        )
+        return ids
+
+
+class InferMBartDecoding(nn.Layer):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, hidden_act="gelu"):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        super(InferMBartDecoding, self).__init__()
+        for arg, value in locals().items():
+            if arg not in ["self", "model", "word_embedding", "positional_embedding", "linear"]:
+                setattr(self, "_" + arg, value)
+        self._num_decoder_layers = model.mbart.config["decoder_layers"]
+        self._n_head = model.mbart.config["decoder_attention_heads"]
+        self._d_model = model.mbart.config["d_model"]
+
+        # process weights
+        if use_fp16_decoding:
+            for mod in model.mbart.decoder.decoder.layers:
+                mod.norm1.weight = transfer_param(mod.norm1.weight, restore_data=True)
+                mod.norm1.bias = transfer_param(mod.norm1.bias, is_bias=True, restore_data=True)
+                mod.self_attn.q_proj.weight = transfer_param(mod.self_attn.q_proj.weight, restore_data=True)
+                mod.self_attn.q_proj.bias = transfer_param(mod.self_attn.q_proj.bias, is_bias=True, restore_data=True)
+                mod.self_attn.k_proj.weight = transfer_param(mod.self_attn.k_proj.weight, restore_data=True)
+                mod.self_attn.k_proj.bias = transfer_param(mod.self_attn.k_proj.bias, is_bias=True, restore_data=True)
+                mod.self_attn.v_proj.weight = transfer_param(mod.self_attn.v_proj.weight, restore_data=True)
+                mod.self_attn.v_proj.bias = transfer_param(mod.self_attn.v_proj.bias, is_bias=True, restore_data=True)
+                mod.self_attn.out_proj.weight = transfer_param(mod.self_attn.out_proj.weight, restore_data=True)
+                mod.self_attn.out_proj.bias = transfer_param(
+                    mod.self_attn.out_proj.bias, is_bias=True, restore_data=True
+                )
+
+                mod.norm2.weight = transfer_param(mod.norm2.weight, restore_data=True)
+                mod.norm2.bias = transfer_param(mod.norm2.bias, is_bias=True, restore_data=True)
+                mod.cross_attn.q_proj.weight = transfer_param(mod.cross_attn.q_proj.weight, restore_data=True)
+                mod.cross_attn.q_proj.bias = transfer_param(
+                    mod.cross_attn.q_proj.bias, is_bias=True, restore_data=True
+                )
+                mod.cross_attn.k_proj.weight = transfer_param(mod.cross_attn.k_proj.weight, restore_data=True)
+                mod.cross_attn.k_proj.bias = transfer_param(
+                    mod.cross_attn.k_proj.bias, is_bias=True, restore_data=True
+                )
+                mod.cross_attn.v_proj.weight = transfer_param(mod.cross_attn.v_proj.weight, restore_data=True)
+                mod.cross_attn.v_proj.bias = transfer_param(
+                    mod.cross_attn.v_proj.bias, is_bias=True, restore_data=True
+                )
+                mod.cross_attn.out_proj.weight = transfer_param(mod.cross_attn.out_proj.weight, restore_data=True)
+                mod.cross_attn.out_proj.bias = transfer_param(
+                    mod.cross_attn.out_proj.bias, is_bias=True, restore_data=True
+                )
+
+                mod.norm3.weight = transfer_param(mod.norm3.weight, restore_data=True)
+                mod.norm3.bias = transfer_param(mod.norm3.bias, is_bias=True, restore_data=True)
+                mod.linear1.weight = transfer_param(mod.linear1.weight, restore_data=True)
+                mod.linear1.bias = transfer_param(mod.linear1.bias, is_bias=True, restore_data=True)
+                mod.linear2.weight = transfer_param(mod.linear2.weight, restore_data=True)
+                mod.linear2.bias = transfer_param(mod.linear2.bias, is_bias=True, restore_data=True)
+
+            model.decoder.decoder_layernorm_embedding.weight = transfer_param(
+                model.decoder.decoder_layernorm_embedding.weight, restore_data=True
+            )
+            model.decoder.decoder_layernorm_embedding.bias = transfer_param(
+                model.decoder.decoder_layernorm_embedding.bias, is_bias=True, restore_data=True
+            )
+
+            model.decoder.decoder.norm.weight = transfer_param(model.decoder.decoder.norm.weight, restore_data=True)
+            model.decoder.decoder.norm.bias = transfer_param(
+                model.decoder.decoder.norm.bias, is_bias=True, restore_data=True
+            )
+
+            model.lm_head_weight = transfer_param(model.lm_head_weight, restore_data=True)
+            model.final_logits_bias = transfer_param(model.final_logits_bias, is_bias=True, restore_data=True)
+
+            model.decoder.decoder_embed_positions.weight = transfer_param(
+                model.decoder.decoder_embed_positions.weight, restore_data=True
+            )
+            model.decoder.embed_tokens.weight = transfer_param(model.decoder.embed_tokens.weight, restore_data=True)
+
+        self.slf_ln_weight = []
+        self.slf_ln_bias = []
+        self.slf_q_weight = []
+        self.slf_q_bias = []
+        self.slf_k_weight = []
+        self.slf_k_bias = []
+        self.slf_v_weight = []
+        self.slf_v_bias = []
+        self.slf_out_weight = []
+        self.slf_out_bias = []
+
+        self.cross_ln_weight = []
+        self.cross_ln_bias = []
+        self.cross_q_weight = []
+        self.cross_q_bias = []
+        self.cross_k_weight = []
+        self.cross_k_bias = []
+        self.cross_v_weight = []
+        self.cross_v_bias = []
+        self.cross_out_weight = []
+        self.cross_out_bias = []
+
+        self.ffn_ln_weight = []
+        self.ffn_ln_bias = []
+        self.ffn_inter_weight = []
+        self.ffn_inter_bias = []
+        self.ffn_out_weight = []
+        self.ffn_out_bias = []
+
+        for mod in model.mbart.decoder.decoder.layers:
+            self.slf_ln_weight.append(mod.norm1.weight)
+            self.slf_ln_bias.append(mod.norm1.bias)
+            self.slf_q_weight.append(mod.self_attn.q_proj.weight)
+            self.slf_q_bias.append(mod.self_attn.q_proj.bias)
+            self.slf_k_weight.append(mod.self_attn.k_proj.weight)
+            self.slf_k_bias.append(mod.self_attn.k_proj.bias)
+            self.slf_v_weight.append(mod.self_attn.v_proj.weight)
+            self.slf_v_bias.append(mod.self_attn.v_proj.bias)
+            self.slf_out_weight.append(mod.self_attn.out_proj.weight)
+            self.slf_out_bias.append(mod.self_attn.out_proj.bias)
+
+            self.cross_ln_weight.append(mod.norm2.weight)
+            self.cross_ln_bias.append(mod.norm2.bias)
+            self.cross_q_weight.append(mod.cross_attn.q_proj.weight)
+            self.cross_q_bias.append(mod.cross_attn.q_proj.bias)
+            self.cross_k_weight.append(mod.cross_attn.k_proj.weight)
+            self.cross_k_bias.append(mod.cross_attn.k_proj.bias)
+            self.cross_v_weight.append(mod.cross_attn.v_proj.weight)
+            self.cross_v_bias.append(mod.cross_attn.v_proj.bias)
+            self.cross_out_weight.append(mod.cross_attn.out_proj.weight)
+            self.cross_out_bias.append(mod.cross_attn.out_proj.bias)
+
+            self.ffn_ln_weight.append(mod.norm3.weight)
+            self.ffn_ln_bias.append(mod.norm3.bias)
+            self.ffn_inter_weight.append(mod.linear1.weight)
+            self.ffn_inter_bias.append(mod.linear1.bias)
+            self.ffn_out_weight.append(mod.linear2.weight)
+            self.ffn_out_bias.append(mod.linear2.bias)
+
+        self.decoder_ln_weight = [model.decoder.decoder.norm.weight]
+        self.decoder_ln_bias = [model.decoder.decoder.norm.bias]
+
+        self.mbart_ln_weight = [model.decoder.decoder_layernorm_embedding.weight]
+        self.mbart_ln_bias = [model.decoder.decoder_layernorm_embedding.bias]
+
+        self.pos_emb = [model.decoder.decoder_embed_positions.weight]
+        self.word_emb = [model.decoder.embed_tokens.weight]
+
+        setattr(self, "lm_head_weight_", model.lm_head_weight.t())
+        self.linear_weight = [getattr(self, "lm_head_weight_")]
+        self.linear_bias = [model.final_logits_bias]
+
+    def forward(
+        self,
+        enc_output,
+        memory_seq_lens,
+        trg_word=None,
+        beam_size=4,
+        top_k=1,
+        top_p=0.0,
+        decoding_strategy="beam_search_v3",
+        max_out_len=256,
+        diversity_rate=0.0,
+        rel_len=False,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        alpha=0.6,
+        temperature=1.0,
+        early_stopping=False,
+    ):
+        # Beam_search/beam_search_v2/beam_search_v3 should be corrected to beam_search_v3.
+        if decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+        elif decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            top_k = 1
+            top_p = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if top_p == 1 and top_k > 0:
+                decoding_strategy = "topk_sampling"
+                top_p = 0.0
+            elif top_p > 0 and top_k == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the fast version. "
+                )
+        output_ids, parent_ids, sequence_length = infer_mbart_decoding(
+            [enc_output],
+            [memory_seq_lens],
+            self.word_emb,
+            self.slf_ln_weight,
+            self.slf_ln_bias,
+            self.slf_q_weight,
+            self.slf_q_bias,
+            self.slf_k_weight,
+            self.slf_k_bias,
+            self.slf_v_weight,
+            self.slf_v_bias,
+            self.slf_out_weight,
+            self.slf_out_bias,
+            self.cross_ln_weight,
+            self.cross_ln_bias,
+            self.cross_q_weight,
+            self.cross_q_bias,
+            self.cross_k_weight,
+            self.cross_k_bias,
+            self.cross_v_weight,
+            self.cross_v_bias,
+            self.cross_out_weight,
+            self.cross_out_bias,
+            self.ffn_ln_weight,
+            self.ffn_ln_bias,
+            self.ffn_inter_weight,
+            self.ffn_inter_bias,
+            self.ffn_out_weight,
+            self.ffn_out_bias,
+            self.decoder_ln_weight,
+            self.decoder_ln_bias,
+            self.mbart_ln_weight,
+            self.mbart_ln_bias,
+            self.linear_weight,
+            self.linear_bias,
+            self.pos_emb,
+            trg_word,
+            decoding_strategy,
+            beam_size,
+            top_k,
+            top_p,
+            self._n_head,
+            int(self._d_model / self._n_head),
+            self._num_decoder_layers,
+            bos_token_id,
+            eos_token_id,
+            max_out_len,
+            -diversity_rate,
+            rel_len,
+            alpha,
+            temperature,
+            early_stopping,
+            self._hidden_act,
+        )
+
+        ids = finalize(beam_size, output_ids, parent_ids, sequence_length, decoding_strategy=decoding_strategy)
+        return ids
+
+
+def convert_gptj_params(fast_model, model, fuse_qkv=1, use_fp16=False, restore_data=False, permutation=None):
+    r"""
+    Convert parameters included in Transformer layer  from original models
+    to the format of faster models.
+
+    Args:
+        fast_model (Layer): The faster model object.
+        model (Layer): The Transformer layer.
+        fuse_qkv (int): 0 for nofuse, 1 for fuse, 2 for fuse and delete the
+            unfused parameters. If environment variable `PPFG_QKV_MEM_OPT` is
+            set and the weights of q/k/v is fused, it will try to delete the
+            original unfused weights. Note the rollback to original model would
+            not be guarantee anymore when the faster model failed if the original
+            weights are deleted. Default to 1.
+        use_fp16 (bool): Whether to use float16. Maybe we should use the default
+            dtype as the highest priority later. Default to `False`.
+        restore_data (bool): If `False`, need to reload the weight values. It
+            should be `True` for weight loaded models. Default to `False`.
+
+    Returns:
+        defaultdict: Each value is a list including converted parameters in all
+            layers. For other parameters not included in Transformer module to
+            be converted, such as embeddings, you can achieve it by using the
+            returned dict `params` though `params['word_emb'].append()` directly
+            which would do CPU/GPU and fp32/fp16 transfer automatically.
+    """
+    if fuse_qkv == 1:
+        fuse_qkv = 2 if os.getenv("PPFG_QKV_MEM_OPT", "0") == "1" else 1
+    ft_para_conf = get_ft_para_conf()
+
+    class _list(list):
+        def append(self, item):
+            def attr_handle_func(x):
+                return x
+
+            if isinstance(item[0], nn.Layer):
+                # Axis is used for tensor slice in tensor parallel.
+                # Use None to make no slice on the tensor.
+                if len(item) == 2:
+                    layer, attr = item
+                    axis = None
+                else:
+                    layer, attr, axis = item
+                param = getattr(layer, attr)
+                if axis is not None and isinstance(layer, nn.Linear):
+                    param = ft_para_conf.slice_weight(param, axis)
+                param = transfer_param(
+                    param,
+                    is_bias=attr.endswith("bias"),
+                    dtype="float16" if use_fp16 else "float32",
+                    restore_data=restore_data,
+                )
+                # NOTE: Assignment to parameter 'weight' should be of type
+                # Parameter or None, thus delete first in case of param is
+                # a tensor.
+                # TODO(guosheng): Make slice_weight use `output_param=True`
+                # and remove delattr. Currently, if `param` is Tensor rather
+                # than Parameter, it would not be in state_dict.
+                delattr(layer, attr)
+                setattr(layer, attr, param)
+            else:
+                # NOTE: Compared with if branch, there is no layer attribute
+                # refered to the transfered param, thus we should set it as
+                # the layer attribute to be able to convert to static graph.
+                # Additionally, we suppose no need to process tensor parallel
+                # here since the param passed in might have been processed.
+                if len(item) == 2:
+                    param, is_bias = item
+                    attr_handle = attr_handle_func
+                else:
+                    param, is_bias, attr_handle = item
+                param = transfer_param(
+                    param, is_bias=is_bias, dtype="float16" if use_fp16 else "float32", restore_data=restore_data
+                )
+                attr_handle(param)
+            return super().append(param)
+
+    params = defaultdict(_list)
+
+    def _convert(module):
+        num_layer = len(module)
+        for i, layer in enumerate(module):
+            if not ft_para_conf.is_load(i, num_layer):
+                continue
+            # TODO(guosheng): Tensor with size 0 might be failed in
+            # paddle develop, thus use tensor with size 1 instead
+            # temporarily. Besides, we use 2D tensor since jit log
+            # requires that on linear weight. While size 0 seems all
+            # right in jit.to_static/jit.save.
+            dummy_tensor = paddle.zeros([1, 1])
+            if permutation is not None:
+                qkv = layer.attn.qkv_proj.weight.numpy()
+                qkv = qkv[:, permutation]
+                if fuse_qkv == 2:
+                    del layer.attn.qkv_proj.weight
+                    setattr(layer.attn.qkv_proj, "weight", dummy_tensor)
+                w = paddle.to_tensor(qkv)
+            else:
+                w = _convert_qkv(
+                    layer.attn.q_proj,
+                    layer.attn.k_proj,
+                    layer.attn.v_proj,
+                    attr="weight",
+                    use_numpy=fuse_qkv == 2,
+                    del_param=fuse_qkv == 2,
+                    dummy_tensor=dummy_tensor,
+                )
+            params["slf_q_weight"].append((w, False))
+            # NOTE: Use `params["slf_q_weight"][-1]` rather than `w`,
+            # since the appended tensor might be a new transfered tensor.
+            # Besides, to allow convert_params be called more than once,
+            # we find a attr name not existing to avoid overwriting the
+            # existing attr.
+            attr = "slf_q_weight_" + str(i)
+            while hasattr(fast_model, attr):
+                attr += "_"
+            setattr(fast_model, attr, params["slf_q_weight"][-1])
+
+            params["slf_out_weight"].append((layer.attn.out_proj, "weight", 0))
+            params["slf_ln_weight"].append((layer.ln_1, "weight"))
+            params["slf_ln_bias"].append((layer.ln_1, "bias"))
+            # Slice tensor when append according to axis(1 or 0) if parallel
+            # is enable.
+            params["ffn_inter_weight"].append((layer.mlp.fc_in, "weight", 1))
+            params["ffn_inter_bias"].append((layer.mlp.fc_in, "bias", 1))
+            params["ffn_out_weight"].append((layer.mlp.fc_out, "weight", 0))
+            params["ffn_out_bias"].append((layer.mlp.fc_out, "bias"))
+
+    _convert(model)
+    return params
+
+
+class InferGptJDecoding(nn.Layer):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, transpose_qkv=False):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load(
+                "FastGeneration" if get_ft_para_conf().no_para else "FasterTransformerParallel",
+                verbose=True,
+                need_parallel=not get_ft_para_conf().no_para,
+            )
+
+        super(InferGptJDecoding, self).__init__()
+
+        self.use_fp16_decoding = use_fp16_decoding
+        self.model = model
+        self.head_num = self.model.transformer.config["n_head"]
+        self.size_per_head = int(self.model.transformer.config["n_embd"] / self.head_num)
+        self.num_layer = self.model.transformer.config["n_layer"]
+        self.rotary_embedding_dim = self.model.transformer.config["rotary_dim"]
+        logger.info("Converting model weights, it will cost a few seconds.....")
+        permutation = None
+        if transpose_qkv:
+            # GPTJ is different with CodeGen in attention project layer.
+            local_dim = self.model.transformer.config["n_embd"] // 4
+            base_permutation = [0, 3, 6, 9, 2, 5, 8, 11, 1, 4, 7, 10]
+            permutation = np.concatenate([np.arange(i * local_dim, (i + 1) * local_dim) for i in base_permutation])
+        params = convert_gptj_params(
+            self,
+            model.transformer.h,
+            fuse_qkv=2,
+            use_fp16=use_fp16_decoding,
+            restore_data=True,
+            permutation=permutation,
+        )
+
+        params["word_emb"].append((self.model.transformer.wte, "weight"))
+        params["decoder_ln_weight"].append((self.model.transformer.ln_f, "weight"))
+        params["decoder_ln_bias"].append((self.model.transformer.ln_f, "bias"))
+        params["linear_weight"].append((self.model.lm_head.weight.t(), partial(setattr, self, "linear_weight_out")))
+        params["linear_bias"].append((self.model.lm_head, "bias"))
+
+        for k, v in params.items():
+            setattr(self, k, v)
+        logger.info("Already converted model weights.")
+
+    def forward(
+        self,
+        input_ids,
+        mem_seq_len,
+        attention_mask=None,
+        topk=4,
+        topp=0.0,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        max_out_len=256,
+        temperature=1,
+        repetition_penalty=1.0,
+        min_length=0,
+    ):
+        if attention_mask is None:
+            batch_size, input_length = paddle.shape(input_ids)
+            attention_mask = paddle.unsqueeze((input_ids != pad_token_id).astype("float32"), axis=[1])
+            causal_mask = paddle.tril(paddle.ones([batch_size, input_length, input_length], dtype="float32"))
+            attention_mask = paddle.logical_and(attention_mask, causal_mask)
+            if not self.use_fp16_decoding:
+                attention_mask = paddle.cast(attention_mask, dtype="float32")
+            else:
+                attention_mask = paddle.cast(attention_mask, dtype="float16")
+
+        if self.use_fp16_decoding and attention_mask.dtype == paddle.float32:
+            attention_mask = paddle.cast(attention_mask, dtype="float16")
+
+        (output_ids,) = infer_gptj_decoding(
+            input=[input_ids],
+            attn_mask=[attention_mask],
+            mem_seq_len=[mem_seq_len],
+            word_emb=self.word_emb,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=self.slf_ln_bias,
+            slf_q_weight=self.slf_q_weight,
+            slf_out_weight=self.slf_out_weight,
+            ffn_inter_weight=self.ffn_inter_weight,
+            ffn_inter_bias=self.ffn_inter_bias,
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=self.ffn_out_bias,
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=self.decoder_ln_bias,
+            linear_weight=self.linear_weight,
+            linear_bias=self.linear_bias,
+            topk=topk,
+            topp=topp,
+            max_out_len=max_out_len,
+            head_num=self.head_num,
+            size_per_head=self.size_per_head,
+            num_layer=self.num_layer,
+            bos_id=bos_token_id,
+            eos_id=eos_token_id,
+            temperature=temperature,
+            rotary_embedding_dim=self.rotary_embedding_dim,
+            repetition_penalty=repetition_penalty,
+            min_length=min_length,
+            use_fp16_decoding=self.use_fp16_decoding,
+        )
+
+        output_ids = output_ids[paddle.shape(input_ids)[-1] :, :]
+        if forced_eos_token_id is not None:
+            output_ids[:, -1] = forced_eos_token_id
+        return output_ids
+
+
+class InferPegasusDecoding(nn.Layer):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, hidden_act="gelu"):
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        super(InferPegasusDecoding, self).__init__()
+        self._hidden_act = hidden_act
+        self._num_decoder_layers = model.pegasus.config["num_decoder_layers"]
+        self._n_head = model.pegasus.config["decoder_attention_heads"]
+        self._d_model = model.pegasus.config["d_model"]
+
+        params = convert_params(self, model.decoder.decoder, fuse_qkv=2, use_fp16=use_fp16_decoding, restore_data=True)
+
+        self.decoder_ln_weight = [
+            transfer_param(
+                model.decoder.decoder_layernorm.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+        self.decoder_ln_bias = [
+            transfer_param(
+                model.decoder.decoder_layernorm.bias,
+                is_bias=True,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+
+        self.pos_emb = [
+            transfer_param(
+                model.decoder.decoder_embed_positions.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+        self.word_emb = [
+            transfer_param(
+                model.decoder.embed_tokens.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+        setattr(
+            self,
+            "lm_head_weight_",
+            transfer_param(
+                model.lm_head_weight.t(),
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            ),
+        )
+        self.linear_weight = [getattr(self, "lm_head_weight_")]
+        self.linear_bias = [
+            transfer_param(
+                model.final_logits_bias,
+                is_bias=True,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+        for k, v in params.items():
+            setattr(self, k, v)
+
+    def forward(
+        self,
+        enc_output,
+        memory_seq_lens,
+        beam_size=4,
+        top_k=1,
+        top_p=0.0,
+        decoding_strategy="beam_search_v3",
+        max_out_len=256,
+        min_out_len=256,
+        diversity_rate=0.0,
+        rel_len=False,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        alpha=0.6,
+        temperature=1.0,
+        early_stopping=False,
+        forced_eos_token_id=None,
+    ):
+        # Beam_search/beam_search_v2/beam_search_v3 should be corrected to beam_search_v3.
+        if decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+        elif decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            top_k = 1
+            top_p = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if top_p == 1 and top_k > 0:
+                decoding_strategy = "topk_sampling"
+                top_p = 0.0
+            elif top_p > 0 and top_k == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the fast version. "
+                )
+        output_ids, parent_ids, sequence_length = infer_pegasus_decoding(
+            [enc_output],
+            [memory_seq_lens],
+            self.word_emb,
+            self.slf_ln_weight,
+            self.slf_ln_bias,
+            self.slf_q_weight,
+            self.slf_q_bias,
+            self.slf_k_weight,
+            self.slf_k_bias,
+            self.slf_v_weight,
+            self.slf_v_bias,
+            self.slf_out_weight,
+            self.slf_out_bias,
+            self.cross_ln_weight,
+            self.cross_ln_bias,
+            self.cross_q_weight,
+            self.cross_q_bias,
+            self.cross_k_weight,
+            self.cross_k_bias,
+            self.cross_v_weight,
+            self.cross_v_bias,
+            self.cross_out_weight,
+            self.cross_out_bias,
+            self.ffn_ln_weight,
+            self.ffn_ln_bias,
+            self.ffn_inter_weight,
+            self.ffn_inter_bias,
+            self.ffn_out_weight,
+            self.ffn_out_bias,
+            self.decoder_ln_weight,
+            self.decoder_ln_bias,
+            self.linear_weight,
+            self.linear_bias,
+            self.pos_emb,
+            decoding_strategy,
+            beam_size,
+            top_k,
+            top_p,
+            self._n_head,
+            int(self._d_model / self._n_head),
+            self._num_decoder_layers,
+            bos_token_id,
+            eos_token_id,
+            max_out_len,
+            min_out_len,
+            diversity_rate,
+            rel_len,
+            alpha,
+            temperature,
+            early_stopping,
+            self._hidden_act,
+        )
+
+        ids = finalize(
+            beam_size,
+            output_ids,
+            parent_ids,
+            sequence_length,
+            forced_eos_token_id=forced_eos_token_id,
+            decoding_strategy=decoding_strategy,
+        )
+        return ids
+
+
+class InferT5Decoding(InferBase):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+
+        if decoding_lib is not None and os.path.isfile(decoding_lib):
+            # Maybe it has been loadad by `ext_utils.load`
+            if "FastGeneration" not in LOADED_EXT.keys():
+                ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(decoding_lib)
+                LOADED_EXT["FastGeneration"] = ops
+        else:
+            if decoding_lib is not None:
+                logger.warning("The specified decoding_lib does not exist, and it will be built automatically.")
+            load("FastGeneration", verbose=True)
+
+        super(InferT5Decoding, self).__init__(use_fp16_decoding)
+        for arg, value in locals().items():
+            if arg not in ["self", "model"]:
+                setattr(self, "_" + arg, value)
+
+        self._num_decoder_layers = model.config.num_decoder_layers
+        self._n_head = model.config.num_heads
+        self._d_model = model.config.d_model
+        self._relative_attention_num_buckets = model.config.relative_attention_num_buckets
+        self.tie_word_embeddings = model.config.tie_word_embeddings
+        self.act = model.config.feed_forward_proj
+
+        if "gelu" in self.act:
+            self.act = "gelu"
+        elif "relu" in self.act:
+            self.act = "relu"
+        else:
+            raise ValueError("Only gelu and relu are available in Faster. ")
+
+        # NOTE: using config when support.
+        self._max_distance = 128
+
+        params = convert_params(self, model.t5.decoder, fuse_qkv=2, use_fp16=use_fp16_decoding, restore_data=True)
+
+        self.decoder_ln_weight = [
+            transfer_param(
+                model.t5.decoder.final_layer_norm.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+
+        self.word_emb = [
+            transfer_param(
+                model.t5.decoder.embed_tokens.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            )
+        ]
+
+        if self.tie_word_embeddings:
+            setattr(
+                self,
+                "lm_head_weight_",
+                transfer_param(
+                    model.t5.decoder.embed_tokens.weight.t(),
+                    is_bias=False,
+                    dtype="float16" if use_fp16_decoding else "float32",
+                    restore_data=True,
+                ),
+            )
+        else:
+            setattr(
+                self,
+                "lm_head_weight_",
+                transfer_param(
+                    paddle.assign(model.lm_head.weight),
+                    is_bias=False,
+                    dtype="float16" if use_fp16_decoding else "float32",
+                    restore_data=True,
+                ),
+            )
+
+        self.linear_weight = [getattr(self, "lm_head_weight_")]
+        self.linear_bias = self.default_bias(self.linear_weight, 1)
+
+        setattr(
+            self,
+            "relative_attn_bias_w",
+            transfer_param(
+                model.t5.decoder.block[0].layer[0].SelfAttention.relative_attention_bias.weight,
+                is_bias=False,
+                dtype="float16" if use_fp16_decoding else "float32",
+                restore_data=True,
+            ),
+        )
+        self.relative_attention_bias_weight = [getattr(self, "relative_attn_bias_w")]
+        for k, v in params.items():
+            setattr(self, k, v)
+
+        self.zeros_t = paddle.zeros(shape=[1, 1], dtype="float16" if use_fp16_decoding else "float32")
+        if getattr(self, "slf_k_weight", None) is None:
+            self.slf_k_weight = [self.zeros_t] * model.t5.config["num_decoder_layers"]
+        if getattr(self, "slf_v_weight", None) is None:
+            self.slf_v_weight = [self.zeros_t] * model.t5.config["num_decoder_layers"]
+
+    def forward(
+        self,
+        enc_output,
+        memory_seq_lens,
+        beam_size=4,
+        top_k=1,
+        top_p=0.0,
+        decoding_strategy="beam_search_v3",
+        max_out_len=256,
+        diversity_rate=0.0,
+        rel_len=False,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        alpha=0.6,
+        temperature=1.0,
+        early_stopping=False,
+    ):
+        # Beam_search/beam_search_v2/beam_search_v3 should be corrected to beam_search_v3.
+        if decoding_strategy.startswith("beam_search"):
+            decoding_strategy = "beam_search_v3"
+        elif decoding_strategy == "greedy_search":
+            decoding_strategy = "topk_sampling"
+            top_k = 1
+            top_p = 0.0
+        elif decoding_strategy in ["sampling", "topk_sampling", "topp_sampling"]:
+            if top_p == 1 and top_k > 0:
+                decoding_strategy = "topk_sampling"
+                top_p = 0.0
+            elif top_p > 0 and top_k == 0:
+                decoding_strategy = "topp_sampling"
+            else:
+                raise AttributeError(
+                    "Only topk sampling or topp sampling are supported. "
+                    "Topk sampling and topp sampling cannot be both applied in the fast version. "
+                )
+
+        output_ids, parent_ids, sequence_length = infer_t5_decoding(
+            enc_output=[enc_output],
+            memory_seq_lens=[memory_seq_lens],
+            word_emb=self.word_emb,
+            slf_ln_weight=self.slf_ln_weight,
+            slf_ln_bias=getattr(self, "slf_ln_bias", self.default_bias(self.slf_ln_weight, 0, True)),
+            slf_q_weight=self.slf_q_weight,
+            slf_q_bias=getattr(self, "slf_q_bias", self.default_bias(self.slf_q_weight, 1)),
+            slf_k_weight=self.slf_k_weight,
+            slf_k_bias=getattr(self, "slf_k_bias", self.default_bias(self.slf_k_weight, 1)),
+            slf_v_weight=self.slf_v_weight,
+            slf_v_bias=getattr(self, "slf_v_bias", self.default_bias(self.slf_v_weight, 1)),
+            slf_out_weight=self.slf_out_weight,
+            slf_out_bias=getattr(self, "slf_out_bias", self.default_bias(self.slf_out_weight, 1)),
+            relative_attention_bias_weight=self.relative_attention_bias_weight,
+            cross_ln_weight=self.cross_ln_weight,
+            cross_ln_bias=getattr(self, "cross_ln_bias", self.default_bias(self.cross_ln_weight, 0, True)),
+            cross_q_weight=self.cross_q_weight,
+            cross_q_bias=getattr(self, "cross_q_bias", self.default_bias(self.cross_q_weight, 1)),
+            cross_k_weight=self.cross_k_weight,
+            cross_k_bias=getattr(self, "cross_k_bias", self.default_bias(self.cross_k_weight, 1)),
+            cross_v_weight=self.cross_v_weight,
+            cross_v_bias=getattr(self, "cross_v_bias", self.default_bias(self.cross_v_weight, 1)),
+            cross_out_weight=self.cross_out_weight,
+            cross_out_bias=getattr(self, "cross_out_bias", self.default_bias(self.cross_out_weight, 1)),
+            ffn_ln_weight=self.ffn_ln_weight,
+            ffn_ln_bias=getattr(self, "ffn_ln_bias", self.default_bias(self.ffn_ln_weight, 0, True)),
+            ffn_inter_weight_0=self.ffn_inter_weight_0,
+            ffn_inter_bias_0=getattr(self, "ffn_inter_bias_0", self.default_bias(self.ffn_inter_weight_0, 1)),
+            ffn_inter_weight_1=getattr(
+                self, "ffn_inter_weight_1", self.default_bias(self.ffn_inter_weight_0, 1, True)
+            ),
+            ffn_inter_bias_1=getattr(self, "ffn_inter_bias_1", self.default_bias(self.ffn_inter_weight_1, 1))
+            if hasattr(self, "ffn_inter_weight_1")
+            else getattr(self, "ffn_inter_bias_1", self.default_bias(self.ffn_inter_weight_0, 1, True)),
+            ffn_out_weight=self.ffn_out_weight,
+            ffn_out_bias=getattr(self, "ffn_out_bias", self.default_bias(self.ffn_out_weight, 1)),
+            decoder_ln_weight=self.decoder_ln_weight,
+            decoder_ln_bias=getattr(self, "decoder_ln_bias", self.default_bias(self.decoder_ln_weight, 0, True)),
+            linear_weight=self.linear_weight,
+            linear_bias=getattr(self, "linear_bias", self.default_bias(self.linear_weight, 1)),
+            decoding_strategy=decoding_strategy,
+            beam_size=beam_size,
+            top_k=top_k,
+            top_p=top_p,
+            head_num=self._n_head,
+            size_per_head=int(self._d_model / self._n_head),
+            num_decoder_layers=self._num_decoder_layers,
+            start_id=bos_token_id,
+            end_id=eos_token_id,
+            max_out_len=max_out_len,
+            diversity_rate=-diversity_rate,
+            rel_len=rel_len,
+            alpha=alpha,
+            temperature=temperature,
+            early_stopping=early_stopping,
+            max_distance=self._max_distance,
+            relative_attention_num_buckets=self._relative_attention_num_buckets,
+            tie_word_embeddings=self.tie_word_embeddings,
+            act=self.act,
+        )
+
+        ids = finalize(beam_size, output_ids, parent_ids, sequence_length, decoding_strategy=decoding_strategy)
+
+        return ids
diff --git a/paddlenlp/ops/fast_transformer/transformer/encoder.py b/paddlenlp/ops/fast_transformer/transformer/encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..da14723dbb543672db443915b066ec5d4304e6b8
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/transformer/encoder.py
@@ -0,0 +1,456 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.nn import TransformerEncoder, TransformerEncoderLayer
+
+from paddlenlp.ops.ext_utils import LOADED_EXT, load
+from paddlenlp.ops.fast_transformer.transformer.decoding import transfer_param
+from paddlenlp.utils.log import logger
+
+from .decoding import run_custom
+
+
+def infer_transformer_encoder(
+    input,
+    attn_mask,
+    q_weight,
+    q_bias,
+    k_weight,
+    k_bias,
+    v_weight,
+    v_bias,
+    attn_out_weight,
+    attn_out_bias,
+    norm1_weight,
+    norm1_bias,
+    norm2_weight,
+    norm2_bias,
+    ffn_inter_weight,
+    ffn_inter_bias,
+    ffn_out_weight,
+    ffn_out_bias,
+    #   sequence_id_offset,
+    #   trt_seqlen_offset,
+    #   amax_list,
+    n_head,
+    size_per_head,
+    n_layer=12,
+    use_gelu=True,
+    remove_padding=False,
+    int8_mode=0,
+    layer_idx=0,
+    allow_gemm_test=False,
+    use_trt_kernel=False,
+    normalize_before=False,
+):
+    """
+    Fusion Encoder API intergrating Encoder inference in FastGeneration. It
+    accepts the weight and bias of TransformerEncoder and some other parameters
+    for inference.
+    """
+    inputs_names = [
+        "Input",
+        "SelfAttnMask",
+        "SelfQueryWeight@VECTOR",
+        "SelfQueryBias@VECTOR",
+        "SelfKeyWeight@VECTOR",
+        "SelfKeyBias@VECTOR",
+        "SelfValueWeight@VECTOR",
+        "SelfValueBias@VECTOR",
+        "SelfAttnOutputWeight@VECTOR",
+        "SelfAttnOutputBias@VECTOR",
+        "SelfAttnOutputLayernormWeight@VECTOR",
+        "SelfAttnOutputLayernormBias@VECTOR",
+        "OutputLayernormWeight@VECTOR",
+        "OutputLayernormBias@VECTOR",
+        "FFNInterWeight@VECTOR",
+        "FFNInterBias@VECTOR",
+        "FFNOutputWeight@VECTOR",
+        "FFNOutputBias@VECTOR",
+        # 'SequenceIdOffset',
+        # "TRTSeqLenOffset",
+        # 'AmaxList'
+    ]
+
+    inputs_var = [
+        input,
+        attn_mask,
+        q_weight,
+        q_bias,
+        k_weight,
+        k_bias,
+        v_weight,
+        v_bias,
+        attn_out_weight,
+        attn_out_bias,
+        norm1_weight,
+        norm1_bias,
+        norm2_weight,
+        norm2_bias,
+        ffn_inter_weight,
+        ffn_inter_bias,
+        ffn_out_weight,
+        ffn_out_bias,
+        # 'SequenceIdOffset': sequence_id_offset,
+        # "TRTSeqLenOffset": trt_seqlen_offset,
+        # 'AmaxList': amax_list
+    ]
+
+    attrs_names = [
+        "head_num",
+        "size_per_head",
+        "use_gelu",
+        "remove_padding",
+        "int8_mode",
+        "num_layer",
+        "layer_idx",
+        "allow_gemm_test",
+        "use_trt_kernel",
+        "normalize_before",
+    ]
+
+    attrs_val = [
+        n_head,
+        size_per_head,
+        use_gelu,
+        remove_padding,
+        int8_mode,
+        n_layer,
+        layer_idx,
+        allow_gemm_test,
+        use_trt_kernel,
+        normalize_before,
+    ]
+
+    outputs_names = ["EncoderOut"]
+
+    outputs_dtype = [input[0].dtype]
+
+    return run_custom("fusion_encoder", inputs_names, inputs_var, attrs_names, attrs_val, outputs_names, outputs_dtype)
+
+
+def encoder_layer_forward(self, src, src_mask, cache=None, sequence_id_offset=None, trt_seq_len=None):
+    """
+    Redefines `forward` function of `paddle.nn.TransformerEncoderLayer` for
+    integrating FastGeneration for inference.
+
+    The original `forward` function would not be replaced unless
+    `enable_fast_encoder` is called by objects of its base class. After
+    replacing, objects of `paddle.nn.TransformerEncoderLayer` also have the
+    same member variables as before.
+
+    After inference, `disable_fast_encoder` could be called to restore the
+    `forward` function of `paddle.nn.TransformerEncoder` and
+    `paddle.nn.TransformerEncoderLayer`.
+
+    Args:
+        src (Tensor):
+            The input of Transformer encoder layer. It is a tensor with shape
+            `[batch_size, sequence_length, d_model]`. The data type should be
+            float32 or float64.
+        src_mask (Tensor, optional):
+            A tensor used in multi-head attention to prevents attention to some
+            unwanted positions, usually the paddings or the subsequent
+            positions. It is a tensor with shape `[batch_size, 1, 1, sequence_length]`.
+            When the data type is bool, the unwanted positions have `False`
+            values and the others have `True` values. When the data type is int,
+            the unwanted positions have 0 values and the others have 1 values.
+            When the data type is float, the unwanted positions have `-INF`
+            values and the others have 0 values. It can be None when nothing
+            wanted or needed to be prevented attention to. Defaults to None.
+
+    Returns:
+        src(Tensor|tuple):
+            It is a tensor that has the same shape and data type as `enc_input`,
+            representing the output of Transformer encoder layer. Or a tuple if
+            `cache` is not None, except for encoder layer output, the tuple
+            includes the new cache which is same as input `cache` argument but
+            `incremental_cache` has an incremental length. See
+            `paddle.nn.MultiHeadAttention.gen_cache` and
+            `paddle.nn.MultiHeadAttention.forward` for more details.
+    """
+    if cache is not None:
+        raise NotImplementedError("cache in encoder is not supported now")
+
+    src = infer_transformer_encoder(
+        input=[src],
+        attn_mask=[src_mask],
+        q_weight=[self.self_attn.q_proj.weight],
+        q_bias=[self.self_attn.q_proj.bias],
+        k_weight=[self.self_attn.k_proj.weight],
+        k_bias=[self.self_attn.k_proj.bias],
+        v_weight=[self.self_attn.v_proj.weight],
+        v_bias=[self.self_attn.v_proj.bias],
+        attn_out_weight=[self.self_attn.out_proj.weight],
+        attn_out_bias=[self.self_attn.out_proj.bias],
+        norm1_weight=[self.norm1.weight],
+        norm1_bias=[self.norm1.bias],
+        norm2_weight=[self.norm2.weight],
+        norm2_bias=[self.norm2.bias],
+        ffn_inter_weight=[self.linear1.weight],
+        ffn_inter_bias=[self.linear1.bias],
+        ffn_out_weight=[self.linear2.weight],
+        ffn_out_bias=[self.linear2.bias],
+        # sequence_id_offset=paddle.to_tensor([]),
+        # trt_seqlen_offset=paddle.to_tensor([]),
+        # amax_list=paddle.to_tensor([]),  # int8 mode is not supported.
+        n_head=self._config["nhead"],
+        size_per_head=self._config["d_model"] // self._config["nhead"],
+        use_gelu=self._config["activation"] == "gelu",
+        normalize_before=self._config["normalize_before"] is True,
+    )
+
+    return src
+
+
+def encoder_forward(self, src, src_mask=None, cache=None):
+    """
+    Redefines `forward` function of `paddle.nn.TransformerEncoder` for
+    integrating FastGeneration for inference.
+
+    The original `forward` function would not be replaced unless
+    `enable_fast_encoder` is called by objects of its base class. After
+    replacing, objects of `paddle.nn.TransformerEncoder` also have the same
+    member variables as before.
+
+    After inference, `disable_fast_encoder` could be called to restore the
+    `forward` function of `paddle.nn.TransformerEncoder` and
+    `paddle.nn.TransformerEncoderLayer`.
+
+    Args:
+        src (Tensor):
+            The input of Transformer encoder. It is a tensor
+            with shape `[batch_size, sequence_length, d_model]`. The data
+            type should be float32 or float16.
+        src_mask (Tensor, optional):
+            A tensor used in multi-head attention to prevents attention to
+            some unwanted positions, usually the paddings or the subsequent
+            positions. It is a tensor with shape `[batch_size, 1, 1, sequence_length]`.
+            The data type must be float, the unwanted positions have `-INF` values or other non-zeros
+            and the wanted positions must be 0.0.
+    Returns:
+        output (Tensor|tuple):
+            It is a tensor that has the same shape and data type as `src`,
+            representing the output of Transformer encoder. Or a tuple if
+            `cache` is not None, except for encoder output, the tuple includes
+            the new cache which is same as input `cache` argument but
+            `incremental_cache` in it has an incremental length. See
+            `paddle.nn.MultiHeadAttention.gen_cache` and
+            `paddle.nn.MultiHeadAttention.forward` for more details.
+    """
+    if cache is not None:
+        raise NotImplementedError("cache in encoder is not supported now")
+
+    if src_mask.dtype == paddle.float16:
+        src_mask = paddle.cast(src_mask, dtype="float32")
+    src_mask = src_mask == 0.0
+    if src_mask.dtype != src.dtype:
+        src_mask = paddle.cast(src_mask, src.dtype)
+
+    if len(src_mask.shape) == 4:
+        # transpose_src_mask: [batch_size, 1, sequence_length, 1]
+        transpose_src_mask = paddle.transpose(src_mask, perm=[0, 1, 3, 2])
+        # src_mask: [batch_size, 1, sequence_length, sequence_length]
+        src_mask = src_mask * transpose_src_mask
+
+    if getattr(self, "q_weight", None) is None:
+        self.q_weight = []
+        self.q_bias = []
+        self.k_weight = []
+        self.k_bias = []
+        self.v_weight = []
+        self.v_bias = []
+        self.attn_out_weight = []
+        self.attn_out_bias = []
+        self.norm1_weight = []
+        self.norm1_bias = []
+        self.norm2_weight = []
+        self.norm2_bias = []
+        self.ffn_inter_weight = []
+        self.ffn_inter_bias = []
+        self.ffn_out_weight = []
+        self.ffn_out_bias = []
+        for layer in self.layers:
+            self.q_weight.append(layer.self_attn.q_proj.weight)
+            self.q_bias.append(layer.self_attn.q_proj.bias)
+            self.k_weight.append(layer.self_attn.k_proj.weight)
+            self.k_bias.append(layer.self_attn.k_proj.bias)
+            self.v_weight.append(layer.self_attn.v_proj.weight)
+            self.v_bias.append(layer.self_attn.v_proj.bias)
+            self.attn_out_weight.append(layer.self_attn.out_proj.weight)
+            self.attn_out_bias.append(layer.self_attn.out_proj.bias)
+            self.norm1_weight.append(layer.norm1.weight)
+            self.norm1_bias.append(layer.norm1.bias)
+            self.norm2_weight.append(layer.norm2.weight)
+            self.norm2_bias.append(layer.norm2.bias)
+            self.ffn_inter_weight.append(layer.linear1.weight)
+            self.ffn_inter_bias.append(layer.linear1.bias)
+            self.ffn_out_weight.append(layer.linear2.weight)
+            self.ffn_out_bias.append(layer.linear2.bias)
+
+    output = infer_transformer_encoder(
+        input=[src],
+        attn_mask=[src_mask],
+        q_weight=self.q_weight,
+        q_bias=self.q_bias,
+        k_weight=self.k_weight,
+        k_bias=self.k_bias,
+        v_weight=self.v_weight,
+        v_bias=self.v_bias,
+        attn_out_weight=self.attn_out_weight,
+        attn_out_bias=self.attn_out_bias,
+        norm1_weight=self.norm1_weight,
+        norm1_bias=self.norm1_bias,
+        norm2_weight=self.norm2_weight,
+        norm2_bias=self.norm2_bias,
+        ffn_inter_weight=self.ffn_inter_weight,
+        ffn_inter_bias=self.ffn_inter_bias,
+        ffn_out_weight=self.ffn_out_weight,
+        ffn_out_bias=self.ffn_out_bias,
+        # sequence_id_offset=paddle.to_tensor([]),
+        # trt_seqlen_offset=paddle.to_tensor([]),
+        # amax_list=paddle.to_tensor([]),  # int8 mode is not supported.
+        n_head=self.layers[0]._config["nhead"],
+        size_per_head=self.layers[0]._config["d_model"] // self.layers[0]._config["nhead"],
+        use_gelu=self.layers[0]._config["activation"] == "gelu",
+        normalize_before=self.layers[0]._config["normalize_before"] is True,
+    )
+
+    if self.norm is not None:
+        output = self.norm(output)
+    return output
+
+
+def enable_fast_encoder(self, use_fp16=False, encoder_lib=None):
+    """
+    Compiles fusion encoder operator intergrated FastGeneration using the
+    method of JIT(Just-In-Time) and replaces the `forward` function of
+    `paddle.nn.TransformerEncoder` and `paddle.nn.TransformerEncoderLayer`
+    objects inherited from `self` to support inference using FastGeneration.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder
+
+            model.eval()
+            model = enable_fast_encoder(model)
+            enc_out = model(src, src_mask)
+            model = disable_fast_encoder(model)
+    """
+
+    def init_func(layer):
+        if isinstance(layer, TransformerEncoderLayer):
+            is_usable = True
+            if layer._config["bias_attr"] is False:
+                logger.warning(
+                    "`False` for paddle.nn.TransformerEncoder's"
+                    " parameter `bias_attr` is not supported in "
+                    "FastGeneration by now. The original forward"
+                    " will be involved."
+                )
+                is_usable = False
+            if layer._config["activation"] not in ("relu", "gelu"):
+                logger.warning("Only 'relu' or 'gelu' is supported by now. " "The original forward will be involved.")
+                is_usable = False
+            if is_usable:
+                layer.forward = layer._ft_forward
+        elif isinstance(layer, TransformerEncoder):
+            layer.forward = layer._ft_forward
+            if use_fp16:
+                convert_to_fp16(layer)
+
+    if not self.training:
+        try:
+            # Pass decoding lib to prevent re-building encoder.
+            # Todo: check weather decoding lib have contained encoder or not.
+            if encoder_lib is not None:
+                if "FastGeneration" not in LOADED_EXT.keys():
+                    ops = paddle.utils.cpp_extension.load_op_meta_info_and_register_op(encoder_lib)
+                    LOADED_EXT["FastGeneration"] = ops
+            else:
+                load("FastGeneration", verbose=True)
+        except Exception:
+            logger.warning("Exception occurs when using FasterEncoder. " "The original forward will be involved. ")
+            return self
+        for layer in self.children():
+            layer.apply(init_func)
+    return self
+
+
+def disable_fast_encoder(self):
+    """
+    Restores the original `forward` function of `paddle.nn.TransformerEncoder`
+    and `paddle.nn.TransformerEncoderLayer` objects inherited from `self`.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.ops import enable_fast_encoder, disable_fast_encoder
+
+            model.eval()
+            model = enable_fast_encoder(model)
+            enc_out = model(src, src_mask)
+            model = disable_fast_encoder(model)
+    """
+
+    def init_func(layer):
+        if isinstance(layer, (TransformerEncoderLayer, TransformerEncoder)):
+            layer.forward = layer._ori_forward
+
+    for layer in self.children():
+        layer.apply(init_func)
+    return self
+
+
+def convert_to_fp16(transformer_encoder):
+    """Convert paddle.nn.TransformerEncoder's parameter from float32 to float16
+
+    Args:
+        transformer_encoder (obeject, paddle.nn.TransformerEncoder):
+            The object to be converted to float16 inplaced, it must be an isinstance
+            of paddle.nn.TransformerEncoder.
+    """
+    if not isinstance(transformer_encoder, paddle.nn.TransformerEncoder):
+        logger.warning(
+            "transformer_encoder is not isinstance of paddle.nn.TransformerEncoder, return itself with no parameters convertion.".format
+        )
+        return transformer_encoder
+    else:
+        encoder_layers = transformer_encoder.layers
+
+        for mod in encoder_layers:
+            mod.norm1.weight = transfer_param(mod.norm1.weight, restore_data=True)
+            mod.norm1.bias = transfer_param(mod.norm1.bias, is_bias=True, restore_data=True)
+            mod.norm2.weight = transfer_param(mod.norm2.weight, restore_data=True)
+            mod.norm2.bias = transfer_param(mod.norm2.bias, is_bias=True, restore_data=True)
+
+            mod.linear1.weight = transfer_param(mod.linear1.weight, restore_data=True)
+            mod.linear1.bias = transfer_param(mod.linear1.bias, is_bias=True, restore_data=True)
+
+            mod.self_attn.q_proj.weight = transfer_param(mod.self_attn.q_proj.weight, restore_data=True)
+            mod.self_attn.q_proj.bias = transfer_param(mod.self_attn.q_proj.bias, is_bias=True, restore_data=True)
+            mod.self_attn.k_proj.weight = transfer_param(mod.self_attn.k_proj.weight, restore_data=True)
+            mod.self_attn.k_proj.bias = transfer_param(mod.self_attn.k_proj.bias, is_bias=True, restore_data=True)
+            mod.self_attn.v_proj.weight = transfer_param(mod.self_attn.v_proj.weight, restore_data=True)
+            mod.self_attn.v_proj.bias = transfer_param(mod.self_attn.v_proj.bias, is_bias=True, restore_data=True)
+            mod.self_attn.out_proj.weight = transfer_param(mod.self_attn.out_proj.weight, restore_data=True)
+            mod.self_attn.out_proj.bias = transfer_param(mod.self_attn.out_proj.bias, is_bias=True, restore_data=True)
+
+            mod.linear2.weight = transfer_param(mod.linear2.weight, restore_data=True)
+            mod.linear2.bias = transfer_param(mod.linear2.bias, is_bias=True, restore_data=True)
+        logger.info("Convert transformer_encoder's parameters from float32 to float16 succeessfully.")
diff --git a/paddlenlp/ops/fast_transformer/transformer/fast_transformer.py b/paddlenlp/ops/fast_transformer/transformer/fast_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..af0ecd3f31016673fc724e21a3389a0718d32c04
--- /dev/null
+++ b/paddlenlp/ops/fast_transformer/transformer/fast_transformer.py
@@ -0,0 +1,2021 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import shutil
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.ops import (
+    InferBartDecoding,
+    InferGptDecoding,
+    InferGptJDecoding,
+    InferMBartDecoding,
+    InferMIRODecoding,
+    InferOptDecoding,
+    InferPegasusDecoding,
+    InferT5Decoding,
+    InferTransformerDecoding,
+    InferUnifiedDecoding,
+)
+from paddlenlp.transformers import (
+    BartPretrainedModel,
+    CodeGenPreTrainedModel,
+    GPTChineseTokenizer,
+    GPTJPretrainedModel,
+    GPTPretrainedModel,
+    GPTTokenizer,
+    InferTransformerModel,
+    MBartPretrainedModel,
+    OPTPretrainedModel,
+    PegasusPretrainedModel,
+    PositionalEmbedding,
+    T5PretrainedModel,
+    TransformerModel,
+    UnifiedTransformerPretrainedModel,
+    UNIMOPretrainedModel,
+    WordEmbedding,
+    position_encoding_init,
+)
+from paddlenlp.utils.log import logger
+
+from .encoder import enable_fast_encoder
+
+
+class FasterTransformer(TransformerModel):
+    """
+    FasterTransformer is a fast version for generation with the Transformer
+    model. It uses a custom op based on and enhancing NV FasterTransformer to
+    do fast generation.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        attn_dropout (float):
+            The dropout probability used in MHA to drop some attention target.
+            If None, use the value of dropout. Defaults to None.
+        act_dropout (float):
+            The dropout probability used after FFN activition. If None, use
+            the value of dropout. Defaults to None.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        decoding_strategy (str, optional):
+            Indicating the strategy of decoding. It can be 'beam_search', 'beam_search_v2',
+            'topk_sampling' and 'topp_sampling'. For beam search strategies,
+            'v2' would select the top `beam_size * 2` beams and process the top
+            `beam_size` alive and finish beams in them separately, while 'v1'
+            would only select the top `beam_size` beams and mix up the alive and
+            finish beams. 'v2' always searchs more and get better results, since
+            the alive beams would always be `beam_size` while the number of alive
+            beams in `v1` might decrease when meeting the end token. However,
+            'v2' always generates longer results thus might do more calculation
+            and be slower.
+        beam_size (int, optional):
+            The beam width for beam search. Defaults to 4.
+        topk (int, optional):
+            The number of highest probability tokens to keep for top-k sampling.
+            Defaults to 4.
+        topp (float, optional):
+            The most probable tokens whose cumulative probability is not less than
+            `topp` are kept for top-p sampling. Defaults to 4.
+        max_out_len (int, optional):
+            The maximum output length. Defaults to 256.
+        diversity_rate (float, optional):
+            Refer to `A Simple, Fast Diverse Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_
+            for details. Bigger `diversity_rate` would lead to more diversity.
+            if `diversity_rate == 0` is equivalent to naive BeamSearch. Default
+            to 0 if not set.
+        use_fp16_decoding(bool, optional):
+            Whether to use fp16 for decoding.
+        enable_fast_encoder(bool, optional):
+            Whether to use the fast version of encoder. This is experimental option for now.
+            Defaults to False.
+        use_fp16_encoder(bool, optional):
+            Whether to use fp16 for encoder. Only works when enable_fast_encoder is True.
+            Defaults to False.
+        rel_len(bool, optional):
+            Indicating whether `max_out_len` in is the length relative to that
+            of source text. Only works in `v2` temporarily. It is suggest to set
+            a small `max_out_len` and use `rel_len=True`. Default to False if
+            not set.
+        alpha(float, optional):
+            The power number in length penalty calculation. Only works in `v2`
+            temporarily. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_.
+            Default to 0.6 if not set.
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        attn_dropout=None,
+        act_dropout=None,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        decoding_strategy="beam_search",
+        beam_size=4,
+        topk=1,
+        topp=0.0,
+        max_out_len=256,
+        diversity_rate=0.0,
+        decoding_lib=None,
+        use_fp16_decoding=False,
+        enable_fast_encoder=False,
+        use_fp16_encoder=False,
+        rel_len=False,
+        alpha=0.6,
+    ):
+        # if decoding_lib is None:
+        #     raise ValueError(
+        #         "The args decoding_lib must be set to use FasterTransformer. ")
+        # elif not os.path.exists(decoding_lib):
+        #     raise ValueError("The path to decoding lib is not exist.")
+
+        args = dict(locals())
+        args.pop("self")
+        args.pop("__class__", None)
+        self.decoding_strategy = args.pop("decoding_strategy")
+        self.beam_size = args.pop("beam_size")
+        self.topk = args.pop("topk")
+        self.topp = args.pop("topp")
+        self.max_out_len = args.pop("max_out_len")
+        self.diversity_rate = args.pop("diversity_rate")
+        self.decoding_lib = args.pop("decoding_lib")
+        self.use_fp16_decoding = args.pop("use_fp16_decoding")
+        self.enable_fast_encoder = args.pop("enable_fast_encoder")
+        self.use_fp16_encoder = args.pop("use_fp16_encoder")
+        self.rel_len = args.pop("rel_len")
+        self.alpha = args.pop("alpha")
+        self.dropout = dropout
+        self.weight_sharing = weight_sharing
+        self.trg_vocab_size = trg_vocab_size
+        self.d_model = d_model
+        self.bos_id = bos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
+        self.max_length = max_length
+        super(FasterTransformer, self).__init__(**args)
+
+        if self.enable_fast_encoder:
+            logger.warning("enable_fast_encoder is an experimental option and subject to change.")
+        elif self.use_fp16_encoder:
+            self.use_fp16_encoder = False
+
+        self.decoding_linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size)
+
+        if weight_sharing:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        self.decoding = InferTransformerDecoding(
+            decoder=self.transformer.decoder,
+            word_embedding=self.trg_word_embedding.word_embedding,
+            positional_embedding=self.trg_pos_embedding.pos_encoder,
+            linear=self.decoding_linear,
+            num_decoder_layers=num_decoder_layers,
+            n_head=n_head,
+            d_model=d_model,
+            bos_id=bos_id,
+            eos_id=eos_id,
+            decoding_strategy=decoding_strategy,
+            beam_size=beam_size,
+            topk=topk,
+            topp=topp,
+            max_out_len=max_out_len,
+            diversity_rate=self.diversity_rate,
+            decoding_lib=self.decoding_lib,
+            use_fp16_decoding=self.use_fp16_decoding,
+            rel_len=self.rel_len,
+            alpha=self.alpha,
+        )
+
+    def forward(self, src_word, trg_word=None):
+        src_max_len = paddle.shape(src_word)[-1]
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(start=0, end=src_max_len)
+
+        # Run encoder
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=False) if self.dropout else src_emb
+
+        if self.enable_fast_encoder and self.use_fp16_encoder:
+            enc_input = paddle.cast(enc_input, dtype="float16")
+
+        enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias)
+
+        if self.use_fp16_decoding and enc_output.dtype != paddle.float16:
+            enc_output = paddle.cast(enc_output, dtype="float16")
+        elif not self.use_fp16_decoding and enc_output.dtype != paddle.float32:
+            enc_output = paddle.cast(enc_output, dtype="float32")
+
+        mem_seq_lens = paddle.sum(paddle.cast(src_word != self.pad_id, dtype="int32"), dtype="int32", axis=1)
+        ids = self.decoding(enc_output, mem_seq_lens, trg_word=trg_word)
+
+        return ids
+
+    def load(self, init_from_params=None, state_dict=None):
+        # Load the trained model
+        if init_from_params is None and state_dict is None:
+            raise ValueError("Either init_from_params or state_dict must be given to load the infer model. ")
+
+        if state_dict is None:
+            state_dict = paddle.load(init_from_params, return_numpy=True)
+        else:
+            for state in state_dict:
+                # NOTE: This API only used in dygraph, so paddle.Tensor is enough.
+                if isinstance(state_dict[state], paddle.Tensor):
+                    state_dict[state] = state_dict[state].numpy()
+
+        # To set weight[padding_idx] to 0.
+        state_dict["trg_word_embedding.word_embedding.weight"][self.bos_id] = [0] * self.d_model
+
+        # Dealing with weight sharing.
+        if self.weight_sharing:
+            state_dict["decoding_linear.weight"] = np.transpose(state_dict["trg_word_embedding.word_embedding.weight"])
+        else:
+            state_dict["decoding_linear.weight"] = state_dict["linear.weight"]
+
+        if self.decoding._fuse_qkv:
+            for item in self.state_dict():
+                if "decoder" in item and "self_attn.q_proj" in item:
+                    num_layer = item.split(".")[3]
+                    param_type = item.split(".")[-1]
+
+                    state_dict["decoding.slf_q_" + param_type + "_" + num_layer] = np.concatenate(
+                        (
+                            state_dict[item],
+                            state_dict["transformer.decoder.layers." + num_layer + ".self_attn.k_proj." + param_type],
+                            state_dict["transformer.decoder.layers." + num_layer + ".self_attn.v_proj." + param_type],
+                        ),
+                        axis=-1,
+                    )
+
+        if self.use_fp16_decoding:
+            for item in self.state_dict():
+                if "decoder" in item or "decoding.slf" in item:
+                    state_dict[item] = np.float16(state_dict[item])
+            state_dict["decoding_linear.weight"] = np.float16(state_dict["decoding_linear.weight"])
+            state_dict["trg_word_embedding.word_embedding.weight"] = np.float16(
+                state_dict["trg_word_embedding.word_embedding.weight"]
+            )
+            state_dict["trg_pos_embedding.pos_encoder.weight"] = np.float16(
+                state_dict["trg_pos_embedding.pos_encoder.weight"]
+            )
+            state_dict["decoding_linear.bias"] = np.zeros([self.trg_vocab_size], dtype="float16")
+
+        self.load_dict(state_dict)
+
+        if self.enable_fast_encoder:
+            self = enable_fast_encoder(self, use_fp16=self.use_fp16_encoder)
+
+    def export_params(self, init_from_params, place):
+        """
+        This method is used for load static graph from dygraph checkpoint
+        or export inference model using static graph.
+        Do NOT support faster encoder.
+
+        Args:
+            init_from_params (string):
+                The path to dygraph checkpoint.
+            place (paddle.Place):
+                The place to execute static graph.
+
+        Example:
+            .. code-block::
+                paddle.enable_static()
+                place = "gpu"
+                place = paddle.set_device(place)
+                reader.adapt_vocab_size(args)
+
+                test_program = paddle.static.Program()
+                startup_program = paddle.static.Program()
+                with paddle.static.program_guard(test_program, startup_program):
+                    src_word = paddle.static.data(
+                        name="src_word", shape=[None, None], dtype="int64")
+
+                    # Define model
+                    transformer = FasterTransformer(
+                        src_vocab_size=args.src_vocab_size,
+                        trg_vocab_size=args.trg_vocab_size,
+                        max_length=args.max_length + 1,
+                        num_encoder_layers=args.n_layer,
+                        num_decoder_layers=args.n_layer,
+                        n_head=args.n_head,
+                        d_model=args.d_model,
+                        d_inner_hid=args.d_inner_hid,
+                        dropout=args.dropout,
+                        weight_sharing=args.weight_sharing,
+                        bos_id=args.bos_idx,
+                        eos_id=args.eos_idx,
+                        decoding_strategy=args.decoding_strategy,
+                        beam_size=args.beam_size,
+                        max_out_len=args.max_out_len,
+                        decoding_lib=args.decoding_lib,
+                        use_fp16_decoding=args.use_fp16_decoding,
+                        rel_len=args.use_rel_len,
+                        alpha=args.alpha)
+
+                    finished_seq = transformer(src_word=src_word)
+
+                test_program = test_program.clone(for_test=True)
+
+                exe = paddle.static.Executor(place)
+                exe.run(startup_program)
+
+                # Load checkpoint.
+                transformer.export_params(
+                    init_from_params=os.path.join(args.init_from_params,
+                                                "transformer.pdparams"),
+                    place=place)
+
+                paddle.static.save_inference_model(
+                    os.path.join(args.inference_model_dir, "transformer"),
+                    feed_vars=src_word,
+                    fetch_vars=finished_seq,
+                    executor=exe,
+                    program=test_program)
+        """
+        # Load the trained model
+        assert init_from_params, "Please set init_from_params to load the infer model."
+
+        model_dict = paddle.load(init_from_params, return_numpy=True)
+
+        # To set weight[padding_idx] to 0.
+        model_dict["trg_word_embedding.word_embedding.weight"][self.bos_id] = [0] * self.d_model
+
+        # Dealing with weight sharing.
+        if self.weight_sharing:
+            model_dict["decoding_linear.weight"] = np.transpose(model_dict["trg_word_embedding.word_embedding.weight"])
+        else:
+            model_dict["decoding_linear.weight"] = model_dict["linear.weight"]
+
+        # To avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["encoder.pos_encoder.weight"] = position_encoding_init(self.max_length, self.d_model)
+        model_dict["decoder.pos_encoder.weight"] = position_encoding_init(self.max_length, self.d_model)
+
+        if self.decoding._fuse_qkv:
+            for item in self.state_dict():
+                if "decoder" in item and "self_attn.q_proj" in item:
+                    num_layer = item.split(".")[3]
+                    param_type = item.split(".")[-1]
+
+                    model_dict["decoding.slf_q_" + param_type + "_" + num_layer] = np.concatenate(
+                        (
+                            model_dict[item],
+                            model_dict["transformer.decoder.layers." + num_layer + ".self_attn.k_proj." + param_type],
+                            model_dict["transformer.decoder.layers." + num_layer + ".self_attn.v_proj." + param_type],
+                        ),
+                        axis=-1,
+                    )
+
+        if self.use_fp16_decoding:
+            for item in self.state_dict():
+                if "decoder" in item or "decoding.slf" in item:
+                    model_dict[item] = np.float16(model_dict[item])
+            model_dict["decoding_linear.weight"] = np.float16(model_dict["decoding_linear.weight"])
+            model_dict["trg_word_embedding.word_embedding.weight"] = np.float16(
+                model_dict["trg_word_embedding.word_embedding.weight"]
+            )
+            model_dict["trg_pos_embedding.pos_encoder.weight"] = np.float16(
+                model_dict["trg_pos_embedding.pos_encoder.weight"]
+            )
+            model_dict["decoding_linear.bias"] = np.zeros([self.trg_vocab_size], dtype="float16")
+
+        for item in self.state_dict():
+            param = self
+            attr_list = item.split(".")
+            for attr in attr_list:
+                param = getattr(param, attr)
+            param_name = param.name
+            var = paddle.static.global_scope().find_var(param_name).get_tensor()
+            var.set(model_dict[item], place)
+
+
+class TransformerGenerator(paddle.nn.Layer):
+    """
+    The Transformer model for auto-regressive generation with beam search. It wraps
+    `FasterTransformer` and `InferTransformerModel`, and automatically chioces using
+    `FasterTransformer` (with jit building) or the slower verison `InferTransformerModel`.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        beam_size (int, optional):
+            The beam width for beam search. Defaults to 4.
+        max_out_len (int, optional):
+            The maximum output length. Defaults to 256.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
+        kwargs:
+            The key word arguments can be `output_time_major`, `use_ft`, `use_fp16_decoding`,
+            `rel_len`, `alpha`:
+
+            - `output_time_major(bool, optional)`: Indicate the data layout of predicted
+            Tensor. If `False`, the data layout would be batch major with shape
+            `[batch_size, seq_len, beam_size]`. If  `True`, the data layout would
+            be time major with shape `[seq_len, batch_size, beam_size]`. Default
+            to `False`.
+
+            - `use_ft(bool, optional)`: Whether to use FastGeneration
+            for decoding. Default to True if not set.
+
+            - `use_fp16_decoding(bool, optional)`: Whether to use fp16
+            for decoding.  Only works when using FastGeneration.
+
+            - `beam_search_version(str, optional)`: Indicating the strategy of
+            beam search. It can be 'v1' or 'v2'. 'v2' would select the top
+            `beam_size * 2` beams and process the top `beam_size` alive and
+            finish beams in them separately, while 'v1' would only select the
+            top `beam_size` beams and mix up the alive and finish beams. 'v2' always
+            searchs more and get better results, since the alive beams would
+            always be `beam_size` while the number of alive beams in `v1` might
+            decrease when meeting the end token. However, 'v2' always generates
+            longer results thus might do more calculation and be slower.
+
+            - `rel_len(bool, optional)`: Indicating whether `max_out_len` in is
+            the length relative to that of source text. Only works in `v2` temporarily.
+            It is suggest to set a small `max_out_len` and use `rel_len=True`.
+            Default to False if not set.
+
+            - `alpha(float, optional)`: The power number in length penalty
+            calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_.
+            Only works in `v2` temporarily. Default to 0.6 if not set.
+
+            - diversity_rate(float, optional): Refer to `A Simple, Fast Diverse
+            Decoding Algorithm for Neural Generation <https://arxiv.org/abs/1611.08562>`_
+            for details. Bigger `diversity_rate` would lead to more diversity.
+            if `diversity_rate == 0` is equivalent to naive BeamSearch. Default
+            to 0 if not set. **NOTE**: Only works when using FastGeneration
+            temporarily.
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        beam_size=4,
+        max_out_len=256,
+        activation="relu",
+        normalize_before=True,
+        **kwargs
+    ):
+        logger.warning("TransformerGenerator is an experimental API and subject to change.")
+        # `kwargs` can include output_time_major, use_fp16_decoding, topk, topp.
+        # The later three arguments can only work when using FastGeneration,
+        # and expose topk, topp later.
+        super(TransformerGenerator, self).__init__()
+        self.d_model = d_model
+        self.max_length = max_length
+        self.output_time_major = kwargs.pop("output_time_major", True)
+        # Only works for FastGeneration.
+        # TODO: original version supports diversity rate.
+        diversity_rate = kwargs.pop("diversity_rate", 0.0)
+        use_fp16_decoding = kwargs.pop("use_fp16_decoding", False)
+        use_ft = kwargs.pop("use_ft", True)
+        beam_search_version = kwargs.pop("beam_search_version", "v1")
+        rel_len = kwargs.pop("rel_len", False)
+        alpha = kwargs.pop("alpha", 0.6)
+
+        # TODO: Faster version needs to update attr to support custom
+        # activation and normalize_before which are both aupport in cpp codes.
+        if use_ft and activation == "relu" and normalize_before:
+            try:
+                decoding_strategy = "beam_search_v2" if beam_search_version == "v2" else "beam_search"
+                self.transformer = FasterTransformer(
+                    src_vocab_size=src_vocab_size,
+                    trg_vocab_size=trg_vocab_size,
+                    max_length=max_length,
+                    num_encoder_layers=num_encoder_layers,
+                    num_decoder_layers=num_decoder_layers,
+                    n_head=n_head,
+                    d_model=d_model,
+                    d_inner_hid=d_inner_hid,
+                    dropout=dropout,
+                    weight_sharing=weight_sharing,
+                    bos_id=bos_id,
+                    eos_id=eos_id,
+                    pad_id=pad_id,
+                    beam_size=beam_size,
+                    max_out_len=max_out_len,
+                    diversity_rate=diversity_rate,
+                    decoding_strategy=decoding_strategy,
+                    use_fp16_decoding=use_fp16_decoding,
+                    rel_len=rel_len,
+                    alpha=alpha,
+                )
+            except Exception:
+                logger.warning(
+                    "Exception occurs when using FastGeneration. " "The original forward will be involved. "
+                )
+                if diversity_rate != 0:
+                    logger.warning(
+                        "diversity_rate would not work since it is only " "supported by FastGeneration temporarily."
+                    )
+                self.transformer = InferTransformerModel(
+                    src_vocab_size=src_vocab_size,
+                    trg_vocab_size=trg_vocab_size,
+                    max_length=max_length,
+                    num_encoder_layers=num_encoder_layers,
+                    num_decoder_layers=num_decoder_layers,
+                    n_head=n_head,
+                    d_model=d_model,
+                    d_inner_hid=d_inner_hid,
+                    dropout=dropout,
+                    weight_sharing=weight_sharing,
+                    bos_id=bos_id,
+                    eos_id=eos_id,
+                    pad_id=pad_id,
+                    beam_size=beam_size,
+                    max_out_len=max_out_len,
+                    output_time_major=self.output_time_major,
+                    beam_search_version=beam_search_version,
+                    activation=activation,
+                    normalize_before=normalize_before,
+                    rel_len=rel_len,
+                    alpha=alpha,
+                )
+        else:
+            if diversity_rate != 0:
+                logger.warning(
+                    "diversity_rate would not work since it is only " "supported by FastGeneration temporarily."
+                )
+            self.transformer = InferTransformerModel(
+                src_vocab_size=src_vocab_size,
+                trg_vocab_size=trg_vocab_size,
+                max_length=max_length,
+                num_encoder_layers=num_encoder_layers,
+                num_decoder_layers=num_decoder_layers,
+                n_head=n_head,
+                d_model=d_model,
+                d_inner_hid=d_inner_hid,
+                dropout=dropout,
+                weight_sharing=weight_sharing,
+                bos_id=bos_id,
+                eos_id=eos_id,
+                pad_id=pad_id,
+                beam_size=beam_size,
+                max_out_len=max_out_len,
+                output_time_major=self.output_time_major,
+                beam_search_version=beam_search_version,
+                activation=activation,
+                normalize_before=normalize_before,
+                rel_len=rel_len,
+                alpha=alpha,
+            )
+
+    def forward(self, src_word, trg_word=None):
+        r"""
+        Performs decoding for transformer model.
+
+        Args:
+            src_word (Tensor):
+                The ids of source sequence words. It is a tensor with shape
+                `[batch_size, source_sequence_length]` and its data type can be
+                int or int64.
+            trg_word (Tensor):
+                The ids of target sequence words. Normally, it should NOT be
+                given. If it's given, force decoding with previous output token
+                will be trigger. Defaults to None.
+
+        Returns:
+            Tensor:
+                An int64 tensor shaped indicating the predicted ids. Its shape is
+                `[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]`
+                according to `output_time_major`. While, when using FastGeneration
+                and beam search v2, the beam dimension would be doubled to include
+                both the top `beam_size` alive and finish beams, thus the tensor
+                shape is `[batch_size, seq_len, beam_size * 2]` or `[seq_len, batch_size, beam_size * 2]`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.ops import TransformerGenerator
+
+                transformer = TransformerGenerator(
+                    src_vocab_size=30000,
+                    trg_vocab_size=30000,
+                    max_length=256,
+                    num_encoder_layers=6,
+                    num_decoder_layers=6,
+                    n_head=8,
+                    d_model=512,
+                    d_inner_hid=2048,
+                    dropout=0.1,
+                    weight_sharing=True,
+                    bos_id=0,
+                    eos_id=1,
+                    beam_size=4,
+                    max_out_len=256)
+
+                batch_size = 5
+                seq_len = 10
+                transformer(
+                    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
+        """
+        out = self.transformer(src_word, trg_word=trg_word)
+        # TODO(guosheng): FasterTransformer has an output with layout
+        # `[seq_len, batch_size, beam_size]`. While the output layout of
+        # original one is `[batch_size, seq_len, beam_size]`. Maybe we need
+        # unify them later.
+        if not self.output_time_major and isinstance(self.transformer, FasterTransformer):
+            out = paddle.transpose(out, [1, 0, 2])
+        return out
+
+    def load(self, path=None, state_dict=None):
+        if path is None and state_dict is None:
+            raise ValueError("Either path or state_dict must be given to load the infer model. ")
+
+        if isinstance(self.transformer, FasterTransformer):
+            self.transformer.load(path, state_dict)
+        else:
+            if state_dict is None:
+                state_dict = paddle.load(path)
+            self.transformer.load_dict(state_dict)
+
+
+class FasterOPT(OPTPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterOPT, self).__init__(model.config)
+        self._model = model
+        self.use_fp16_decoding = use_fp16_decoding
+        self.decoding = InferOptDecoding(model=model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding)
+
+    def forward(
+        self,
+        input_ids,
+        seq_len=None,
+        attention_mask=None,
+        top_k=4,
+        top_p=0.0,
+        max_length=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=0,
+        decode_strategy="sample",
+        num_return_sequences=1,
+        **model_kwargs
+    ):
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, "int32")
+
+        # change top_p to zero if not using top_p sampling for FT
+        if decode_strategy == "greedy_search":
+            top_p = 0.0
+            top_k = 1
+        if top_p == 1.0:
+            top_p = 0.0
+        if seq_len is None:
+            seq_len = paddle.sum(paddle.cast(input_ids != pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+
+            if bos_token_id == pad_token_id and paddle.sum(paddle.any(input_ids == pad_token_id), dtype="int64") > 0:
+                seq_len = seq_len + 1
+
+        if num_return_sequences > 1:
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids, expand_size=num_return_sequences, seq_len=seq_len, attention_mask=attention_mask
+            )
+            seq_len = model_kwargs["seq_len"]
+            attention_mask = model_kwargs.get("attention_mask", None)
+
+        return self.decoding(
+            input_ids,
+            mem_seq_len=seq_len,
+            attention_mask=attention_mask,
+            topk=top_k,
+            topp=top_p,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            temperature=temperature,
+        )
+
+    def export_params(self, state_to_load, place):
+        for item in state_to_load:
+            param_data = np.array(state_to_load[item])
+            if self.use_fp16_decoding:
+                param_data = np.float16(param_data)
+
+            param = self
+            attr_list = item.split(".")
+            attr_list = ["decoding", "model"] + attr_list
+            for attr in attr_list:
+                param = getattr(param, attr)
+            param_name = param.name
+            var = paddle.static.global_scope().find_var(param_name).get_tensor()
+            var.set(param_data, place)
+
+    def save_resources(self, tokenizer, path):
+        vocab_file = os.path.join(path, "vocab.txt")
+        if isinstance(tokenizer, GPTTokenizer):
+            with open(vocab_file, "w", encoding="utf-8") as f:
+                for token in tokenizer.encoder:
+                    f.write(token + "\n")
+            merges_file = os.path.join(path, "merges.txt")
+            shutil.copyfile(tokenizer._merges_file, merges_file)
+        elif isinstance(tokenizer, GPTChineseTokenizer):
+            tokenizer.save_resources(path)
+
+    generate = forward
+
+
+class FasterGPT(GPTPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterGPT, self).__init__(model.config)
+        self._model = model
+        self.use_fp16_decoding = use_fp16_decoding
+        self.decoding = InferGptDecoding(model=model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding)
+
+    def forward(
+        self,
+        input_ids,
+        seq_len=None,
+        attention_mask=None,
+        top_k=4,
+        top_p=0.0,
+        max_length=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=0,
+        decode_strategy="sample",
+        num_return_sequences=1,
+        **model_kwargs
+    ):
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, "int32")
+
+        # change top_p to zero if not using top_p sampling for FT
+        if decode_strategy == "greedy_search":
+            top_p = 0.0
+            top_k = 1
+        if top_p == 1.0:
+            top_p = 0.0
+        if seq_len is None:
+            seq_len = paddle.sum(paddle.cast(input_ids != pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+
+            if bos_token_id == pad_token_id and paddle.sum(paddle.any(input_ids == pad_token_id), dtype="int64") > 0:
+                seq_len = seq_len + 1
+
+        if num_return_sequences > 1:
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids, expand_size=num_return_sequences, seq_len=seq_len, attention_mask=attention_mask
+            )
+            seq_len = model_kwargs["seq_len"]
+            attention_mask = model_kwargs.get("attention_mask", None)
+
+        return self.decoding(
+            input_ids,
+            mem_seq_len=seq_len,
+            attention_mask=attention_mask,
+            topk=top_k,
+            topp=top_p,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            temperature=temperature,
+        )
+
+    def export_params(self, state_to_load, place):
+        for item in state_to_load:
+            param_data = np.array(state_to_load[item])
+            if self.use_fp16_decoding:
+                param_data = np.float16(param_data)
+
+            param = self
+            attr_list = item.split(".")
+            attr_list = ["decoding", "model"] + attr_list
+            for attr in attr_list:
+                param = getattr(param, attr)
+            param_name = param.name
+            var = paddle.static.global_scope().find_var(param_name).get_tensor()
+            var.set(param_data, place)
+
+    def save_resources(self, tokenizer, path):
+        vocab_file = os.path.join(path, "vocab.txt")
+        if isinstance(tokenizer, GPTTokenizer):
+            with open(vocab_file, "w", encoding="utf-8") as f:
+                for token in tokenizer.encoder:
+                    f.write(token + "\n")
+            merges_file = os.path.join(path, "merges.txt")
+            shutil.copyfile(tokenizer._merges_file, merges_file)
+        elif isinstance(tokenizer, GPTChineseTokenizer):
+            tokenizer.save_resources(path)
+
+    generate = forward
+
+
+class FasterUnifiedTransformer(UnifiedTransformerPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterUnifiedTransformer, self).__init__(model.config)
+        self._model = model
+        self._use_fp16_decoding = use_fp16_decoding
+        self.vocab_size = model.lm_head.decoder_bias.shape[0]
+        self.unk_token_id = self._model.config.unk_token_id
+        self.mask_token_id = self._model.config.mask_token_id
+        self.bos_token_id = self._model.config.bos_token_id
+        self.pad_token_id = self._model.config.pad_token_id
+        self.logits_mask = self.generate_logits_mask(use_fp16_decoding)
+        self._n_head = self._model.config.num_attention_heads
+        self._hidden_dims = self._model.config.hidden_size
+        self._normalize_before = self._model.config.normalize_before
+        self._size_per_head = self._hidden_dims // self._n_head
+        self._n_layer = self._model.config.num_hidden_layers
+        self._hidden_act = self._model.config.hidden_act
+
+        self.decoding = InferUnifiedDecoding(
+            model=self._model,
+            decoding_lib=decoding_lib,
+            use_fp16_decoding=use_fp16_decoding,
+            logits_mask=self.logits_mask,
+            n_head=self._n_head,
+            hidden_dims=self._hidden_dims,
+            size_per_head=self._size_per_head,
+            n_layer=self._n_layer,
+            unk_id=self.unk_token_id,
+            mask_id=self.mask_token_id,
+            normalize_before=self._normalize_before,
+            hidden_act=self._hidden_act,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, token_type_ids, attention_mask, seq_len, position_ids=None, role_ids=None, **kwargs
+    ):
+        input_ids = input_ids[:, :-1]
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, dtype="int32")
+
+        if token_type_ids.dtype == paddle.int64:
+            token_type_ids = paddle.cast(token_type_ids, dtype="int32")
+        decoder_type_ids = token_type_ids[:, -1:]
+        token_type_ids = token_type_ids[:, :-1]
+
+        # TODO(guosheng): attention_mask of UnifiedTransformer uses 0/-INF
+        # and is 4D. While now we want to use 1/0 to unify all models and
+        # tokenizers.
+        attention_mask = attention_mask[:, :, :-1, :-1] if attention_mask.ndim == 4 else attention_mask[:, :-1, :-1]
+        attention_mask = paddle.cast(attention_mask == 0, dtype="float16" if self._use_fp16_decoding else "float32")
+
+        seq_len = seq_len - 1
+        if seq_len.dtype == paddle.int64:
+            seq_len = paddle.cast(seq_len, dtype="int32")
+
+        if position_ids is not None:
+            if position_ids.dtype == paddle.int64:
+                position_ids = paddle.cast(position_ids, dtype="int32")
+            decoder_position_ids = position_ids[:, -1:]
+            position_ids = position_ids[:, :-1]
+        else:
+            decoder_position_ids = None
+
+        field_values = {}
+        if role_ids is not None:
+            if role_ids.dtype == paddle.int64:
+                role_ids = paddle.cast(role_ids, dtype="int32")
+            decoder_role_ids = role_ids[:, -1:]
+            role_ids = role_ids[:, :-1]
+        else:
+            decoder_role_ids = None
+
+        field_values["input_ids"] = input_ids
+        field_values["token_type_ids"] = token_type_ids
+        field_values["attention_mask"] = attention_mask
+        field_values["seq_len"] = seq_len
+        field_values["decoder_type_ids"] = decoder_type_ids
+        field_values["position_ids"] = position_ids
+        field_values["decoder_position_ids"] = decoder_position_ids
+        field_values["role_ids"] = role_ids
+        field_values["decoder_role_ids"] = decoder_role_ids
+
+        return field_values
+
+    def generate_logits_mask(self, use_fp16_decoding):
+        # pre-process distribution
+        logits_mask = np.zeros(shape=[self.vocab_size], dtype=np.float32)
+
+        if use_fp16_decoding:
+            logits_mask[self.unk_token_id] = -1e4
+            logits_mask[self.bos_token_id] = -1e4
+            logits_mask[self.pad_token_id] = -1e4
+        else:
+            logits_mask[self.unk_token_id] = -1e9
+            logits_mask[self.bos_token_id] = -1e9
+            logits_mask[self.pad_token_id] = -1e9
+
+        logits_mask_t = paddle.assign(logits_mask)
+        if use_fp16_decoding:
+            return paddle.cast(logits_mask_t, dtype="float16")
+        else:
+            return logits_mask_t
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids,
+        attention_mask,
+        seq_len=None,
+        role_ids=None,
+        position_ids=None,
+        max_length=128,
+        min_length=0,
+        top_k=4,
+        top_p=0.0,
+        decode_strategy="sampling",
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        num_beams=4,
+        diversity_rate=0.0,
+        temperature=1.0,
+        num_return_sequences=1,
+        length_penalty=0.6,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        **model_kwargs
+    ):
+
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(
+                paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, keepdim=True, dtype="int32"
+            )
+        if decode_strategy.startswith("beam_search"):
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_beams,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+                role_ids=role_ids,
+            )
+        elif decode_strategy == "sampling":
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_return_sequences,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+                role_ids=role_ids,
+            )
+        elif decode_strategy == "greedy_search":
+            model_kwargs = {
+                "token_type_ids": token_type_ids,
+                "position_ids": position_ids,
+                "attention_mask": attention_mask,
+                "seq_len": seq_len,
+                "role_ids": role_ids,
+            }
+        else:
+            raise ValueError("Only greedy search, beam search and sampling are supported. ")
+
+        model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+
+        seq_len = model_inputs.pop("seq_len")
+        decoder_type_ids = model_inputs.pop("decoder_type_ids")
+        role_ids = model_inputs.pop("role_ids", None)
+        decoder_role_ids = model_inputs.pop("decoder_role_ids", None)
+        position_ids = model_inputs.pop("position_ids", None)
+        decoder_position_ids = model_inputs.pop("decoder_position_ids", None)
+
+        return self.decoding(
+            input_ids=model_inputs["input_ids"],
+            attn_mask=model_inputs["attention_mask"],
+            memory_seq_lens=seq_len,
+            type_id=model_inputs["token_type_ids"],
+            decoder_type_id=decoder_type_ids,
+            role_id=role_ids,
+            decoder_role_id=decoder_role_ids,
+            position_id=position_ids,
+            decoder_position_id=decoder_position_ids,
+            beam_size=num_beams,
+            diversity_rate=diversity_rate,
+            topk=top_k,
+            topp=top_p,
+            decoding_strategy=decode_strategy,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            temperature=temperature,
+            length_penalty=length_penalty,
+            pos_bias=True,
+            forced_eos_token_id=forced_eos_token_id,
+            early_stopping=early_stopping,
+            min_length=min_length,
+        )
+
+    generate = forward
+
+
+class FasterUNIMOText(UNIMOPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, **kwargs):
+        super(FasterUNIMOText, self).__init__(model.config)
+        self._model = model
+        self._use_fp16_decoding = use_fp16_decoding
+        self.unk_token_id = self._model.config.unk_token_id
+        self.mask_token_id = self._model.config.mask_token_id
+        self.bos_token_id = self._model.config.bos_token_id
+        self.pad_token_id = self._model.config.pad_token_id
+        self.vocab_size = model.lm_head.decoder_bias.shape[0]
+
+        self.logits_mask = self.generate_logits_mask(use_fp16_decoding)
+        self._n_head = self._model.config.num_attention_heads
+        self._hidden_dims = self._model.config.hidden_size
+        self._normalize_before = self._model.config.normalize_before
+        self._size_per_head = self._hidden_dims // self._n_head
+        self._n_layer = self._model.config.num_hidden_layers
+        self._hidden_act = self._model.config.hidden_act
+        self.trans_out = kwargs.get("trans_out", False)
+
+        self.decoding = InferUnifiedDecoding(
+            model=self._model,
+            decoding_lib=decoding_lib,
+            use_fp16_decoding=use_fp16_decoding,
+            logits_mask=self.logits_mask,
+            n_head=self._n_head,
+            hidden_dims=self._hidden_dims,
+            size_per_head=self._size_per_head,
+            n_layer=self._n_layer,
+            unk_id=self.unk_token_id,
+            mask_id=self.mask_token_id,
+            normalize_before=self._normalize_before,
+            hidden_act=self._hidden_act,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, token_type_ids, attention_mask, **kwargs):
+        input_ids = input_ids[:, :-1]
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, dtype="int32")
+
+        if token_type_ids.dtype == paddle.int64:
+            token_type_ids = paddle.cast(token_type_ids, dtype="int32")
+        decoder_type_ids = token_type_ids[:, -1:]
+        token_type_ids = token_type_ids[:, :-1]
+
+        attention_mask = attention_mask[:, :, :-1, :-1]
+        attention_mask = paddle.cast(attention_mask == 0, dtype="float16" if self._use_fp16_decoding else "float32")
+
+        seq_len = kwargs.get("seq_len") - 1
+        if seq_len.dtype == paddle.int64:
+            seq_len = paddle.cast(seq_len, dtype="int32")
+
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": attention_mask,
+            "seq_len": seq_len,
+            "decoder_type_ids": decoder_type_ids,
+        }
+
+    def generate_logits_mask(self, use_fp16_decoding):
+        # pre-process distribution
+        logits_mask = np.zeros(shape=[self.vocab_size], dtype=np.float32)
+
+        if use_fp16_decoding:
+            logits_mask[self.unk_token_id] = -1e4
+            logits_mask[self.bos_token_id] = -1e4
+            logits_mask[self.pad_token_id] = -1e4
+        else:
+            logits_mask[self.unk_token_id] = -1e9
+            logits_mask[self.bos_token_id] = -1e9
+            logits_mask[self.pad_token_id] = -1e9
+
+        logits_mask_t = paddle.assign(logits_mask)
+        if use_fp16_decoding:
+            return paddle.cast(logits_mask_t, dtype="float16")
+        else:
+            return logits_mask_t
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids,
+        attention_mask,
+        seq_len=None,
+        max_length=128,
+        min_length=0,
+        top_k=4,
+        top_p=0.0,
+        num_beams=4,
+        decode_strategy="sampling",
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        diversity_rate=0.0,
+        temperature=1.0,
+        num_return_sequences=1,
+        length_penalty=0.6,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        position_ids=None,
+        **model_kwargs
+    ):
+
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+        if decode_strategy.startswith("beam_search"):
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_beams,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+            )
+        elif decode_strategy == "sampling":
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_return_sequences,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+            )
+        elif decode_strategy == "greedy_search":
+            model_kwargs = {
+                "token_type_ids": token_type_ids,
+                "position_ids": position_ids,
+                "attention_mask": attention_mask,
+                "seq_len": seq_len,
+            }
+        else:
+            raise ValueError("Only greedy search, beam search and sampling are supported. ")
+
+        model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+        seq_len = model_inputs.pop("seq_len")
+        decoder_type_ids = model_inputs.pop("decoder_type_ids")
+
+        ids, output_scores = self.decoding(
+            input_ids=model_inputs["input_ids"],
+            attn_mask=model_inputs["attention_mask"],
+            memory_seq_lens=seq_len,
+            type_id=model_inputs["token_type_ids"],
+            decoder_type_id=decoder_type_ids,
+            beam_size=num_beams,
+            diversity_rate=diversity_rate,
+            topk=top_k,
+            topp=top_p,
+            decoding_strategy=decode_strategy,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            temperature=temperature,
+            length_penalty=length_penalty,
+            forced_eos_token_id=forced_eos_token_id,
+            pos_bias=False,
+            early_stopping=early_stopping,
+            min_length=min_length,
+        )
+        if self.trans_out:
+            if decode_strategy.startswith("beam_search"):
+                ids = ids.transpose([1, 2, 0])
+            else:
+                ids = ids.transpose([1, 0])
+        return ids, output_scores
+
+    generate = forward
+
+
+class FasterMIRO(UNIMOPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, **kwargs):
+        super(FasterMIRO, self).__init__(model.config)
+        self._model = model
+        self._use_fp16_decoding = use_fp16_decoding
+        self.unk_token_id = self._model.config.unk_token_id
+        self.mask_token_id = self._model.config.mask_token_id
+        self.bos_token_id = self._model.config.bos_token_id
+        self.pad_token_id = self._model.config.pad_token_id
+        self.vocab_size = model.lm_head.decoder_bias.shape[0]
+
+        self.logits_mask = self.generate_logits_mask(use_fp16_decoding)
+        self._n_head = self._model.config.num_attention_heads
+        self._hidden_dims = self._model.config.hidden_size
+        self._normalize_before = self._model.config.normalize_before
+        self._size_per_head = self._hidden_dims // self._n_head
+        self._n_layer = self._model.config.num_hidden_layers
+        self._hidden_act = self._model.config.hidden_act
+        self.trans_out = kwargs.get("trans_out", False)
+
+        self.decoding = InferMIRODecoding(
+            model=self._model,
+            decoding_lib=decoding_lib,
+            use_fp16_decoding=use_fp16_decoding,
+            logits_mask=self.logits_mask,
+            n_head=self._n_head,
+            hidden_dims=self._hidden_dims,
+            size_per_head=self._size_per_head,
+            n_layer=self._n_layer,
+            unk_id=self.unk_token_id,
+            mask_id=self.mask_token_id,
+            normalize_before=self._normalize_before,
+            hidden_act=self._hidden_act,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, token_type_ids, attention_mask, **kwargs):
+        input_ids = input_ids[:, :-1]
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, dtype="int32")
+
+        if token_type_ids.dtype == paddle.int64:
+            token_type_ids = paddle.cast(token_type_ids, dtype="int32")
+        decoder_type_ids = token_type_ids[:, -1:]
+        token_type_ids = token_type_ids[:, :-1]
+
+        attention_mask = attention_mask[:, :, :-1, :-1]
+        attention_mask = paddle.cast(attention_mask == 0, dtype="float16" if self._use_fp16_decoding else "float32")
+
+        seq_len = kwargs.get("seq_len") - 1
+        if seq_len.dtype == paddle.int64:
+            seq_len = paddle.cast(seq_len, dtype="int32")
+
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": attention_mask,
+            "seq_len": seq_len,
+            "decoder_type_ids": decoder_type_ids,
+        }
+
+    def generate_logits_mask(self, use_fp16_decoding):
+        # pre-process distribution
+        logits_mask = np.zeros(shape=[self.vocab_size], dtype=np.float32)
+
+        if use_fp16_decoding:
+            logits_mask[self.unk_token_id] = -1e4
+            logits_mask[self.bos_token_id] = -1e4
+            logits_mask[self.pad_token_id] = -1e4
+        else:
+            logits_mask[self.unk_token_id] = -1e9
+            logits_mask[self.bos_token_id] = -1e9
+            logits_mask[self.pad_token_id] = -1e9
+
+        logits_mask_t = paddle.assign(logits_mask)
+        if use_fp16_decoding:
+            return paddle.cast(logits_mask_t, dtype="float16")
+        else:
+            return logits_mask_t
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids,
+        attention_mask,
+        seq_len=None,
+        max_length=128,
+        min_length=0,
+        top_k=4,
+        top_p=0.0,
+        num_beams=4,
+        decode_strategy="sampling",
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        diversity_rate=0.0,
+        temperature=1.0,
+        num_return_sequences=1,
+        length_penalty=0.6,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        position_ids=None,
+        **model_kwargs
+    ):
+
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+        if decode_strategy.startswith("beam_search"):
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_beams,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+            )
+        elif decode_strategy == "sampling":
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids,
+                expand_size=num_return_sequences,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                seq_len=seq_len,
+            )
+        elif decode_strategy == "greedy_search":
+            model_kwargs = {
+                "token_type_ids": token_type_ids,
+                "position_ids": position_ids,
+                "attention_mask": attention_mask,
+                "seq_len": seq_len,
+            }
+        else:
+            raise ValueError("Only greedy search, beam search and sampling are supported. ")
+
+        model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+        seq_len = model_inputs.pop("seq_len")
+        decoder_type_ids = model_inputs.pop("decoder_type_ids")
+
+        ids, output_scores = self.decoding(
+            input_ids=model_inputs["input_ids"],
+            attn_mask=model_inputs["attention_mask"],
+            memory_seq_lens=seq_len,
+            type_id=model_inputs["token_type_ids"],
+            decoder_type_id=decoder_type_ids,
+            beam_size=num_beams,
+            diversity_rate=diversity_rate,
+            topk=top_k,
+            topp=top_p,
+            decoding_strategy=decode_strategy,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            temperature=temperature,
+            length_penalty=length_penalty,
+            forced_eos_token_id=forced_eos_token_id,
+            pos_bias=False,
+            early_stopping=early_stopping,
+            min_length=min_length,
+        )
+        if self.trans_out:
+            if decode_strategy.startswith("beam_search"):
+                ids = ids.transpose([1, 2, 0])
+            else:
+                ids = ids.transpose([1, 0])
+        return ids, output_scores
+
+    generate = forward
+
+
+class FasterBART(BartPretrainedModel):
+    enable_faster_encoder_func = enable_fast_encoder
+
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=True):
+        super(FasterBART, self).__init__(model.config)
+        self.use_fp16_decoding = use_fp16_decoding
+        self._model = model
+        if use_fp16_decoding:
+            weight_attr = paddle.ParamAttr(initializer=nn.initializer.Assign(model.bart.encoder.embed_tokens.weight))
+            model.bart.encoder.embed_tokens = nn.Embedding(
+                *model.bart.encoder.embed_tokens.weight.shape, weight_attr=weight_attr
+            )
+        self.encoder = model.bart.get_encoder()
+        self.decoder = model.bart.get_decoder()
+        self.pad_token_id = model.bart.config["pad_token_id"]
+        self.enable_fast_encoder = enable_fast_encoder
+
+        self.decoding = InferBartDecoding(
+            model=self._model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding
+        )
+        if self.enable_fast_encoder:
+            # Must use `enable_fast_encoder` in `__init__` when dygraph to static graph.
+            self.encoder = FasterBART.enable_faster_encoder_func(self.encoder)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        encoder_output=None,
+        seq_len=None,
+        num_beams=4,
+        top_k=1,
+        top_p=0.0,
+        temperature=1.0,
+        decode_strategy="beam_search",
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        decoder_start_token_id=None,
+        min_length=0,
+        max_length=20,
+        diversity_rate=0.0,
+        length_penalty=0.6,
+        num_return_sequences=1,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        **model_kwargs
+    ):
+
+        if encoder_output is None:
+            assert input_ids is not None, "You have to specify either input_ids or encoder_output."
+            encoder_output = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)[
+                "encoder_output"
+            ]
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(
+                paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, keepdim=True, dtype="int32"
+            )
+        if self.use_fp16_decoding:
+            encoder_output = paddle.cast(encoder_output, "float16")
+        if decode_strategy.startswith("beam_search") and num_beams > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_beams, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        elif decode_strategy == "sampling" and num_return_sequences > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_return_sequences, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        if decoder_start_token_id is not None:
+            bos_token_id = decoder_start_token_id
+
+        return self.decoding(
+            enc_output=encoder_output,
+            memory_seq_lens=seq_len,
+            beam_size=num_beams,
+            top_k=top_k,
+            decoding_strategy=decode_strategy,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            top_p=top_p,
+            max_out_len=max_length,
+            min_out_len=min_length,
+            temperature=temperature,
+            diversity_rate=diversity_rate,
+            alpha=length_penalty,
+            early_stopping=early_stopping,
+            forced_eos_token_id=forced_eos_token_id,
+        )
+
+    generate = forward
+
+
+class FasterMBART(MBartPretrainedModel):
+    enable_faster_encoder_func = enable_fast_encoder
+
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False):
+        super(FasterMBART, self).__init__(model.config)
+        self.use_fp16_decoding = use_fp16_decoding
+        self._model = model
+        if use_fp16_decoding:
+            weight_attr = paddle.ParamAttr(initializer=nn.initializer.Assign(model.mbart.encoder.embed_tokens.weight))
+            model.mbart.encoder.embed_tokens = nn.Embedding(
+                *model.mbart.encoder.embed_tokens.weight.shape, weight_attr=weight_attr
+            )
+        self.encoder = model.mbart.get_encoder()
+        self.decoder = model.mbart.get_decoder()
+        self.pad_token_id = model.mbart.config["pad_token_id"]
+        self.enable_fast_encoder = enable_fast_encoder
+
+        self.decoding = InferMBartDecoding(
+            model=self._model,
+            decoding_lib=decoding_lib,
+            use_fp16_decoding=use_fp16_decoding,
+            hidden_act=model.mbart.config["activation_function"],
+        )
+
+        if self.enable_fast_encoder:
+            # Must use `enable_fast_encoder` in `__init__` when dygraph to static graph.
+            self.encoder = FasterMBART.enable_faster_encoder_func(self.encoder)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        encoder_output=None,
+        seq_len=None,
+        forced_bos_token_id=None,
+        num_beams=4,
+        top_k=1,
+        top_p=0.0,
+        decode_strategy="beam_search_v3",
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        decoder_start_token_id=None,
+        max_length=256,
+        diversity_rate=0.0,
+        length_penalty=0.6,
+        temperature=1.0,
+        num_return_sequences=1,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        **model_kwargs
+    ):
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self._model, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self._model, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self._model, "pad_token_id", None)
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self._model, "decoder_start_token_id", None)
+        )
+
+        # (gongenlei) Not enable_fast_encoder temporarily
+        if encoder_output is None:
+            assert input_ids is not None, "You have to specify either input_ids or encoder_output."
+            encoder_output = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)[
+                "encoder_output"
+            ]
+        batch_size = paddle.shape(encoder_output)[0]
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+        if self.use_fp16_decoding:
+            encoder_output = paddle.cast(encoder_output, "float16")
+        if decode_strategy.startswith("beam_search") and num_beams > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_beams, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        elif decode_strategy == "sampling" and num_return_sequences > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_return_sequences, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        if decoder_start_token_id is not None:
+            bos_token_id = decoder_start_token_id
+
+        if not isinstance(forced_bos_token_id, type(input_ids)):
+            if forced_bos_token_id is not None:
+                if decode_strategy == "sampling":
+                    forced_bos_token_id = paddle.full(
+                        [batch_size * num_return_sequences, 1], forced_bos_token_id, dtype="int32"
+                    )
+                else:
+                    forced_bos_token_id = paddle.full([batch_size, 1], forced_bos_token_id, dtype="int32")
+            else:
+                forced_bos_token_id = paddle.zeros([0])
+        elif decode_strategy == "sampling":
+            num_samples = paddle.shape(encoder_output)[0]
+            forced_bos_token_id = paddle.expand(forced_bos_token_id, shape=[num_samples, 1])
+
+        return self.decoding(
+            enc_output=encoder_output,
+            memory_seq_lens=seq_len,
+            beam_size=num_beams,
+            trg_word=forced_bos_token_id,
+            top_k=top_k,
+            top_p=top_p,
+            decoding_strategy=decode_strategy,
+            diversity_rate=diversity_rate,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            alpha=length_penalty,
+            temperature=temperature,
+            early_stopping=early_stopping,
+        )
+
+    generate = forward
+
+
+class FasterGPTJ(GPTJPretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterGPTJ, self).__init__(model.config)
+        self._model = model
+        self.use_fp16_decoding = use_fp16_decoding
+        self.decoding = InferGptJDecoding(model=model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding)
+
+    def forward(
+        self,
+        input_ids,
+        seq_len=None,
+        attention_mask=None,
+        top_k=4,
+        top_p=0.0,
+        min_length=0,
+        max_length=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=0,
+        repetition_penalty=1.0,
+        decode_strategy="sampling",
+        num_return_sequences=1,
+        **model_kwargs
+    ):
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, "int32")
+
+        # change top_p to zero if not using top_p sampling for FT
+        if decode_strategy == "greedy_search":
+            top_p = 0.0
+            top_k = 1
+        if top_p == 1.0:
+            top_p = 0.0
+        if seq_len is None:
+            seq_len = paddle.sum(paddle.cast(input_ids != pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+
+        if num_return_sequences > 1:
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids, expand_size=num_return_sequences, seq_len=seq_len, attention_mask=attention_mask
+            )
+            seq_len = model_kwargs["seq_len"]
+            attention_mask = model_kwargs.get("attention_mask", None)
+
+        return self.decoding(
+            input_ids,
+            mem_seq_len=seq_len,
+            attention_mask=attention_mask,
+            topk=top_k,
+            topp=top_p,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            temperature=temperature,
+            repetition_penalty=repetition_penalty,
+            min_length=min_length,
+        )
+
+    generate = forward
+
+
+class FasterCodeGen(CodeGenPreTrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterCodeGen, self).__init__(model.config)
+        self._model = model
+        self.use_fp16_decoding = use_fp16_decoding
+        self.decoding = InferGptJDecoding(
+            model=model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding, transpose_qkv=True
+        )
+
+    def forward(
+        self,
+        input_ids,
+        seq_len=None,
+        attention_mask=None,
+        top_k=4,
+        top_p=0.0,
+        min_length=0,
+        max_length=256,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        forced_eos_token_id=None,
+        temperature=0,
+        repetition_penalty=1.0,
+        decode_strategy="sampling",
+        num_return_sequences=1,
+        **model_kwargs
+    ):
+        if input_ids.dtype == paddle.int64:
+            input_ids = paddle.cast(input_ids, "int32")
+
+        # change top_p to zero if not using top_p sampling for FT
+        if decode_strategy == "greedy_search":
+            top_p = 0.0
+            top_k = 1
+        if top_p == 1.0:
+            top_p = 0.0
+        if seq_len is None:
+            seq_len = paddle.sum(paddle.cast(input_ids != pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+
+        if num_return_sequences > 1:
+            input_ids, model_kwargs = self.expand_inputs_for_generation(
+                input_ids, expand_size=num_return_sequences, seq_len=seq_len, attention_mask=attention_mask
+            )
+            seq_len = model_kwargs["seq_len"]
+            attention_mask = model_kwargs.get("attention_mask", None)
+
+        return self.decoding(
+            input_ids,
+            mem_seq_len=seq_len,
+            attention_mask=attention_mask,
+            topk=top_k,
+            topp=top_p,
+            max_out_len=max_length,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            temperature=temperature,
+            repetition_penalty=repetition_penalty,
+            min_length=min_length,
+        )
+
+    generate = forward
+
+
+class FasterPegasus(PegasusPretrainedModel):
+    enable_faster_encoder_func = enable_fast_encoder
+
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False, enable_fast_encoder=False, **kwargs):
+        super(FasterPegasus, self).__init__(model.config)
+        self.use_fp16_decoding = use_fp16_decoding
+        self._model = model
+        self.encoder = model.get_encoder()
+        self.decoder = model.get_decoder()
+        self.pad_token_id = model.pegasus.config["pad_token_id"]
+        self.enable_fast_encoder = enable_fast_encoder
+        self.trans_out = kwargs.get("trans_out", False)
+
+        self.decoding = InferPegasusDecoding(
+            model=self._model,
+            decoding_lib=decoding_lib,
+            use_fp16_decoding=use_fp16_decoding,
+            hidden_act=model.pegasus.config["activation_function"],
+        )
+
+        # TODO(gongenlei): Support faster_encoder
+        # if self.enable_fast_encoder:
+        #     # Must use `enable_fast_encoder` in `__init__` when dygraph to static graph.
+        #     self.encoder = FasterPegasus.enable_faster_encoder_func(self.encoder)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        encoder_output=None,
+        seq_len=None,
+        min_length=0,
+        max_length=256,
+        num_beams=4,
+        decode_strategy="beam_search_v3",
+        decoder_start_token_id=None,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        diversity_rate=0.0,
+        length_penalty=0.6,
+        top_k=1,
+        top_p=0.0,
+        temperature=1.0,
+        num_return_sequences=1,
+        early_stopping=False,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        **model_kwargs
+    ):
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self._model, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self._model, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self._model, "pad_token_id", None)
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self._model, "decoder_start_token_id", None)
+        )
+
+        if encoder_output is None:
+            assert input_ids is not None, "You have to specify either input_ids or encoder_output."
+            encoder_output = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)[
+                "encoder_output"
+            ]
+
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+        if self.use_fp16_decoding:
+            encoder_output = paddle.cast(encoder_output, "float16")
+        if decode_strategy.startswith("beam_search") and num_beams > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_beams, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        elif decode_strategy == "sampling" and num_return_sequences > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_return_sequences, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        if decoder_start_token_id is not None:
+            bos_token_id = decoder_start_token_id
+
+        ids = self.decoding(
+            enc_output=encoder_output,
+            memory_seq_lens=seq_len,
+            beam_size=num_beams,
+            top_k=top_k,
+            top_p=top_p,
+            decoding_strategy=decode_strategy,
+            max_out_len=max_length,
+            min_out_len=min_length,
+            diversity_rate=diversity_rate,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            alpha=length_penalty,
+            temperature=temperature,
+            early_stopping=early_stopping,
+            forced_eos_token_id=forced_eos_token_id,
+        )
+
+        if self.trans_out:
+            if decode_strategy.startswith("beam_search"):
+                ids = ids.transpose([1, 2, 0])
+            else:
+                ids = ids.transpose([1, 0])
+
+        return ids
+
+    generate = forward
+
+
+class FasterT5(T5PretrainedModel):
+    def __init__(self, model, decoding_lib=None, use_fp16_decoding=False):
+        super(FasterT5, self).__init__(model.config)
+        self.use_fp16_decoding = use_fp16_decoding
+        self._model = model
+        if use_fp16_decoding:
+            weight_attr = paddle.ParamAttr(initializer=nn.initializer.Assign(model.encoder.embed_tokens.weight))
+            model.encoder.embed_tokens = nn.Embedding(
+                *model.encoder.embed_tokens.weight.shape, weight_attr=weight_attr
+            )
+        self.encoder = model.t5.get_encoder()
+        self.decoder = model.t5.get_decoder()
+        self.pad_token_id = model.t5.config["pad_token_id"]
+
+        self.decoding = InferT5Decoding(
+            model=self._model, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding
+        )
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        encoder_output=None,
+        seq_len=None,
+        max_length=128,
+        min_length=0,
+        top_k=4,
+        top_p=0.0,
+        num_beams=4,
+        decode_strategy="sampling",
+        decoder_start_token_id=None,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        diversity_rate=0.0,
+        temperature=1.0,
+        num_return_sequences=1,
+        length_penalty=0.6,
+        early_stopping=False,
+        forced_eos_token_id=None,
+        **model_kwargs
+    ):
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self._model, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self._model, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self._model, "pad_token_id", None)
+
+        if encoder_output is None:
+            assert input_ids is not None, "You have to specify either input_ids or encoder_output."
+            encoder_output = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)[
+                "encoder_output"
+            ]
+
+            if isinstance(encoder_output, (list, tuple)):
+                encoder_output = encoder_output[0]
+
+        if seq_len is None:
+            assert input_ids is not None, "You have to specify either input_ids when generating seq_len."
+            seq_len = paddle.sum(paddle.cast(input_ids != self.pad_token_id, dtype="int32"), axis=-1, dtype="int32")
+        if self.use_fp16_decoding:
+            encoder_output = paddle.cast(encoder_output, "float16")
+        if decode_strategy.startswith("beam_search") and num_beams > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_beams, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        elif decode_strategy == "sampling" and num_return_sequences > 1:
+            encoder_output, expanded_kwargs = self.expand_inputs_for_generation(
+                encoder_output, expand_size=num_return_sequences, seq_len=seq_len
+            )
+            seq_len = expanded_kwargs["seq_len"]
+        if decoder_start_token_id is not None:
+            bos_token_id = decoder_start_token_id
+
+        return self.decoding(
+            enc_output=encoder_output,
+            memory_seq_lens=seq_len,
+            beam_size=num_beams,
+            top_k=top_k,
+            top_p=top_p,
+            decoding_strategy=decode_strategy,
+            max_out_len=max_length,
+            diversity_rate=diversity_rate,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            alpha=length_penalty,
+            temperature=temperature,
+            early_stopping=early_stopping,
+        )
+
+    generate = forward
diff --git a/paddlenlp/ops/optimizer/__init__.py b/paddlenlp/ops/optimizer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd46359ca17e347a07363cdd22f8d91313135163
--- /dev/null
+++ b/paddlenlp/ops/optimizer/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .adamwdl import AdamWDL, layerwise_lr_decay
+from .ema import ExponentialMovingAverage
+from .lr import InverseSquareRootSchedule
+
+__all__ = ["layerwise_lr_decay", "AdamWDL", "ExponentialMovingAverage", "InverseSquareRootSchedule"]
diff --git a/paddlenlp/ops/optimizer/adamwdl.py b/paddlenlp/ops/optimizer/adamwdl.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1dd97684a78611fcb10385222c9b316af02db11
--- /dev/null
+++ b/paddlenlp/ops/optimizer/adamwdl.py
@@ -0,0 +1,257 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from functools import partial
+
+import paddle
+from paddle.optimizer import AdamW
+
+__all__ = ["AdamWDL", "layerwise_lr_decay"]
+
+
+# Layerwise decay
+def layerwise_lr_decay(decay_rate, name_dict, n_layers, param):
+    """
+    Args:
+        decay_rate (float):
+            The layer-wise decay ratio.
+        name_dict (dict):
+            The keys of name_dict is dynamic name of model while the value
+            of name_dict is static name.
+            Use model.named_parameters() to get name_dict.
+        n_layers (int):
+            Total number of layers in the transformer encoder.
+    """
+    ratio = 1.0
+    static_name = name_dict[param.name]
+    if "encoder.layers" in static_name:
+        idx = static_name.find("encoder.layers.")
+        layer = int(static_name[idx:].split(".")[2])
+        ratio = decay_rate ** (n_layers - layer)
+    elif "embedding" in static_name:
+        ratio = decay_rate ** (n_layers + 1)
+    return ratio
+
+
+class AdamWDL(AdamW):
+    r"""
+    The AdamWDL optimizer is implemented based on the AdamW Optimization with dynamic lr setting.
+    Generally it's used for transformer model.
+    We use "layerwise_lr_decay" as default dynamic lr setting method of AdamWDL.
+    “Layer-wise decay” means exponentially decaying the learning rates of individual
+    layers in a top-down manner. For example, suppose the 24-th layer uses a learning
+    rate l, and the Layer-wise decay rate is α, then the learning rate of layer m
+    is lα^(24-m). See more details on: https://arxiv.org/abs/1906.08237.
+    .. math::
+        & t = t + 1
+
+        & moment\_1\_out = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad
+        & moment\_2\_out = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad
+        & learning\_rate = learning\_rate * \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}
+        & param\_out = param - learning\_rate * (\frac{moment\_1}{\sqrt{moment\_2} + \epsilon} + \lambda * param)
+    Args:
+        learning_rate (float|LRScheduler, optional): The learning rate used to update ``Parameter``.
+            It can be a float value or a LRScheduler. The default value is 0.001.
+        beta1 (float, optional): The exponential decay rate for the 1st moment estimates.
+            It should be a float number or a Tensor with shape [1] and data type as float32.
+            The default value is 0.9.
+        beta2 (float, optional): The exponential decay rate for the 2nd moment estimates.
+            It should be a float number or a Tensor with shape [1] and data type as float32.
+            The default value is 0.999.
+        epsilon (float, optional): A small float value for numerical stability.
+            It should be a float number or a Tensor with shape [1] and data type as float32.
+            The default value is 1e-08.
+        parameters (list|tuple, optional): List/Tuple of ``Tensor`` to update to minimize ``loss``. \
+            This parameter is required in dygraph mode. \
+            The default value is None in static mode, at this time all parameters will be updated.
+        weight_decay (float, optional): The weight decay coefficient, it can be float or Tensor. The default value is 0.01.
+        apply_decay_param_fun (function|None, optional): If it is not None,
+            only tensors that makes apply_decay_param_fun(Tensor.name)==True
+            will be updated. It only works when we want to specify tensors.
+            Default: None.
+        grad_clip (GradientClipBase, optional): Gradient cliping strategy, it's an instance of
+            some derived class of ``GradientClipBase`` . There are three cliping strategies
+            ( :ref:`api_paddle_nn_GradientClipByGlobalNorm` , :ref:`api_paddle_nn_GradientClipByNorm` ,
+            :ref:`api_paddle_nn_GradientClipByValue` ). Default None, meaning there is no gradient clipping.
+        lazy_mode (bool, optional): The official Adam algorithm has two moving-average accumulators.
+            The accumulators are updated at every step. Every element of the two moving-average
+            is updated in both dense mode and sparse mode. If the size of parameter is very large,
+            then the update may be very slow. The lazy mode only update the element that has
+            gradient in current mini-batch, so it will be much more faster. But this mode has
+            different semantics with the original Adam algorithm and may lead to different result.
+            The default value is False.
+        multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false.
+        layerwise_decay (float, optional): The layer-wise decay ratio. Defaults to 1.0.
+        n_layers (int, optional): The total number of encoder layers. Defaults to 12.
+        set_param_lr_fun (function|None, optional): If it's not None, set_param_lr_fun() will set the parameter
+            learning rate before it executes Adam Operator. Defaults to :ref:`layerwise_lr_decay`.
+        name_dict (dict, optional): The keys of name_dict is dynamic name of model while the value
+            of name_dict is static name. Use model.named_parameters() to get name_dict.
+        name (str, optional): Normally there is no need for user to set this property.
+            For more information, please refer to :ref:`api_guide_Name`.
+            The default value is None.
+    Examples:
+        .. code-block:: python
+            import paddle
+            from paddlenlp.ops.optimizer import AdamWDL
+            def simple_lr_setting(decay_rate, name_dict, n_layers, param):
+                ratio = 1.0
+                static_name = name_dict[param.name]
+                if "weight" in static_name:
+                    ratio = decay_rate**0.5
+                param.optimize_attr["learning_rate"] *= ratio
+
+            linear = paddle.nn.Linear(10, 10)
+            name_dict = dict()
+            for n, p in linear.named_parameters():
+                name_dict[p.name] = n
+            inp = paddle.rand([10,10], dtype="float32")
+            out = linear(inp)
+            loss = paddle.mean(out)
+            adamwdl = AdamWDL(
+                learning_rate=1e-4,
+                parameters=linear.parameters(),
+                set_param_lr_fun=simple_lr_setting,
+                layerwise_decay=0.8,
+                name_dict=name_dict)
+
+            loss.backward()
+            adamwdl.step()
+            adamwdl.clear_grad()
+    """
+
+    def __init__(
+        self,
+        learning_rate=0.001,
+        beta1=0.9,
+        beta2=0.999,
+        epsilon=1e-8,
+        parameters=None,
+        weight_decay=0.01,
+        apply_decay_param_fun=None,
+        grad_clip=None,
+        lazy_mode=False,
+        multi_precision=False,
+        layerwise_decay=1.0,
+        n_layers=12,
+        set_param_lr_fun=layerwise_lr_decay,
+        name_dict=None,
+        name=None,
+    ):
+        if not isinstance(layerwise_decay, float) and not isinstance(layerwise_decay, paddle.framework.Variable):
+            raise TypeError("coeff should be float or Tensor.")
+        self.layerwise_decay = layerwise_decay
+        self.n_layers = n_layers
+        self.set_param_lr_fun = partial(set_param_lr_fun, layerwise_decay, name_dict, n_layers)
+        coeff = weight_decay
+        self._coeff = coeff
+        self._lr_to_coeff = dict()
+        super(AdamWDL, self).__init__(
+            learning_rate=learning_rate,
+            parameters=parameters,
+            beta1=beta1,
+            beta2=beta2,
+            epsilon=epsilon,
+            grad_clip=grad_clip,
+            name=name,
+            apply_decay_param_fun=apply_decay_param_fun,
+            weight_decay=weight_decay,
+            lazy_mode=lazy_mode,
+            multi_precision=multi_precision,
+        )
+
+    def _set_auxiliary_var(self, key, val):
+        self._auxiliary_vars[key] = val
+
+    def _get_auxiliary_var(self, key):
+        if key in self._auxiliary_vars:
+            return self._auxiliary_vars[key]
+        else:
+            return None
+
+    def _append_optimize_op(self, block, param_and_grad):
+        if self.set_param_lr_fun is None:
+            return super(AdamWDL, self)._append_optimize_op(block, param_and_grad)
+
+        self._append_decoupled_weight_decay(block, param_and_grad)
+        prev_lr = param_and_grad[0].optimize_attr["learning_rate"]
+        ratio = self.set_param_lr_fun(param_and_grad[0])
+        param_and_grad[0].optimize_attr["learning_rate"] *= ratio
+
+        # excute Adam op
+        res = super(AdamWDL, self)._append_optimize_op(block, param_and_grad)
+        param_and_grad[0].optimize_attr["learning_rate"] = prev_lr
+        return res
+
+    def _append_decoupled_weight_decay(self, block, param_and_grad):
+        """
+        Add decoupled weight decay op.
+            parameter = parameter - parameter * coeff * lr
+        Args:
+            block: block in which variable is to be created
+            param_and_grad: (parameters, gradients) pairs,
+                the parameters need to decay.
+        Raises:
+            Exception: The type of coeff and parameter is not consistent.
+        """
+        if isinstance(param_and_grad, dict):
+            param_and_grad = self._update_param_group(param_and_grad)
+        param, grad = param_and_grad
+
+        if self._apply_decay_param_fun is not None and not self._apply_decay_param_fun(param.name):
+            return
+
+        if isinstance(self._learning_rate, float):
+            learning_rate = self._learning_rate
+        else:
+            # NOTE. We add this function to the _append_optimize_op(),
+            # for we must make sure _create_param_lr() be called after
+            # optimizer._create_global_learning_rate().
+            learning_rate = self._create_param_lr(param_and_grad)
+
+        with block.program._optimized_guard([param, grad]), paddle.static.name_scope("weight decay"):
+            self._params_name.add(param.name)
+
+            # If it has been calculated, the result will be reused.
+            # NOTE(wangxi): In dygraph mode, apply_gradient will be executed
+            # every step, so need clear _lr_to_coeff every step,
+            # we do this in _create_optimization_pass
+            decay_coeff = self._lr_to_coeff.get(learning_rate, None)
+            if decay_coeff is None:
+                # NOTE(wangxi): for pipeline to set device:all
+                with paddle.static.device_guard(None):
+                    decay_coeff = 1.0 - learning_rate * self._coeff
+                self._lr_to_coeff[learning_rate] = decay_coeff
+
+            find_master = self._multi_precision and param.dtype == paddle.float16
+            if find_master:
+                master_weight = self._master_weights[param.name]
+                scaled_param = master_weight * decay_coeff
+                paddle.assign(scaled_param, output=master_weight)
+            else:
+                scaled_param = param * decay_coeff
+                paddle.assign(scaled_param, output=param)
+
+    def _create_optimization_pass(self, parameters_and_grads):
+        optimize_ops = super(AdamWDL, self)._create_optimization_pass(parameters_and_grads)
+        # In dygraph mode, clear _lr_to_coeff after applied gradient
+        self._lr_to_coeff = dict()
+        return optimize_ops
+
+    def __str__(self):
+        return " ".join(["Weight Decay, params:", ",".join(self._params_name)])
+
+    def _update_param_group(self, parameters):
+        self._coeff = parameters.get("coeff", self._default_dict["coeff"])
+        parameters = parameters.get("params")
+        return parameters
diff --git a/paddlenlp/ops/optimizer/ema.py b/paddlenlp/ops/optimizer/ema.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a9e64ac17b675241d42f8b2bc0fd0c6eb34b2d4
--- /dev/null
+++ b/paddlenlp/ops/optimizer/ema.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class ExponentialMovingAverage(object):
+    def __init__(self, model, decay=0.999):
+        self.model = model
+        self.decay = decay
+        self.shadow = {}
+        self.backup = {}
+
+    def register(self):
+        for name, param in self.model.named_parameters():
+            if not param.stop_gradient:
+                self.shadow[name] = param.clone()
+
+    def update(self):
+        for name, param in self.model.named_parameters():
+            if not param.stop_gradient:
+                assert name in self.shadow
+                new_average = (1.0 - self.decay) * param + self.decay + self.shadow[name]
+                self.shadow[name] = new_average.clone()
+
+    def apply_shadow(self):
+        for name, param in self.model.named_parameters():
+            if not param.stop_gradient:
+                assert name in self.shadow
+                self.backup[name] = param
+                # TODO(huijuan): paddle中parameters赋值方式不是param.data，这样改不了模型参数
+                param.data = self.shadow[name]
+
+    def restore(self):
+        for name, param in self.model.named_parameters():
+            if not param.stop_gradient:
+                assert name in self.backup
+                param = self.backup[name]
+        self.backup = {}
diff --git a/paddlenlp/ops/optimizer/lr.py b/paddlenlp/ops/optimizer/lr.py
new file mode 100644
index 0000000000000000000000000000000000000000..b685cc4fa0abd16f284fc780aa7ccbc7d2c70f9b
--- /dev/null
+++ b/paddlenlp/ops/optimizer/lr.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from paddle.optimizer.lr import LRScheduler
+
+
+class InverseSquareRootSchedule(LRScheduler):
+    """
+    Decay the LR based on the inverse square root of the update number.
+
+    We also support a warmup phase where we linearly increase the learning rate
+    from some initial learning rate  until the configured learning rate. Thereafter
+    we decay proportional to the number of updates, with a decay factor set to
+    align with the configured learning rate.
+
+    Args:
+        warmup_steps(int):
+            The number of warmup steps. A super parameter.
+        learning_rate(float, optional):
+            The learning rate. It is a python float number. Defaults to 1.0.
+        last_epoch(int, optional):
+            The index of last epoch. Can be set to restart training. Default: -1,
+            means initial learning rate.
+        verbose(bool, optional):
+            If ``True``, prints a message to stdout for each
+            update. Defaults to ``False``.
+    """
+
+    def __init__(self, warmup_steps, learning_rate=1.0, last_epoch=-1, verbose=False):
+        self.warmup_steps = warmup_steps
+        warmup_end_lr = learning_rate
+        self.warmup_init_lr = 0.0
+        self.lr_step = (warmup_end_lr - self.warmup_init_lr) / self.warmup_steps
+        self.decay_factor = warmup_end_lr * (self.warmup_steps**0.5)
+
+        super(InverseSquareRootSchedule, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            return self.warmup_init_lr + self.last_epoch * self.lr_step
+        else:
+            return self.decay_factor * (self.last_epoch**-0.5)
diff --git a/paddlenlp/ops/patches/FasterTransformer/CMakeLists.txt b/paddlenlp/ops/patches/FasterTransformer/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9e4e460d265ad4937cbf74584a0ad0c091078064
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/CMakeLists.txt
@@ -0,0 +1,418 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
+project(FasterTransformer LANGUAGES CXX CUDA)
+
+find_package(CUDA 10.1 REQUIRED)
+
+find_program(CCACHE_PROGRAM ccache)
+if(CCACHE_PROGRAM)
+  set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache)
+  set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache)
+endif()
+
+option(BUILD_PD "Build in PaddlePaddle mode" ON)
+option(BUILD_GPT "Build project with gpt"    ON)
+option(BUILD_ENCODER "Build project with encoder"    ON)
+
+if(BUILD_ENCODER)
+  add_definitions(-DBUILD_ENCODER)
+endif()
+
+if(BUILD_GPT)
+  message(STATUS "Add DBUILD_GPT, requires MPI and NCCL")
+  add_definitions("-DBUILD_GPT")
+  set(CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/cmake/Modules)
+  find_package(MPI REQUIRED)
+  find_package(NCCL REQUIRED)
+  #if(${NCCL_VERSION} LESS 2.7)
+  #  message(FATAL_ERROR "NCCL_VERSION ${NCCL_VERSION} is less than 2.7")
+  #endif()
+  set(CMAKE_MODULE_PATH "") # prevent the bugs for pytorch building
+endif()
+
+set(CXX_STD "17" CACHE STRING "C++ standard")
+
+set(CUDA_PATH ${CUDA_TOOLKIT_ROOT_DIR})
+
+list(APPEND CMAKE_MODULE_PATH ${CUDA_PATH}/lib64)
+
+if (${CUDA_VERSION} GREATER_EQUAL 11.0)
+  message(STATUS "Add DCUDA11_MODE")
+  add_definitions("-DCUDA11_MODE")
+endif()
+
+# profiling
+option(USE_NVTX "Whether or not to use nvtx" OFF)
+if(USE_NVTX)
+  message(STATUS "NVTX is enabled.")
+  add_definitions("-DUSE_NVTX")
+endif()
+
+# setting compiler flags
+set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}")
+set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  -Xcompiler -Wall -ldl")
+
+# if (SM STREQUAL 80 OR
+#     SM STREQUAL 86 OR
+#     SM STREQUAL 70 OR
+#     SM STREQUAL 75 OR
+#     SM STREQUAL 61 OR
+#     SM STREQUAL 60)
+# #set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM},code=\\\"sm_${SM},compute_${SM}\\\" -rdc=true")
+# set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM},code=\\\"sm_${SM},compute_${SM}\\\"")
+#   if (SM STREQUAL 70 OR SM STREQUAL 75 OR SM STREQUAL 80 OR SM STREQUAL 86)
+#     set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+#     set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+#     set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+#   endif()
+# message("-- Assign GPU architecture (sm=${SM})")
+
+# else()
+# set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  \
+#                       -gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
+#                       -gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
+#                       ")
+# #                      -rdc=true")
+# set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+# set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+# set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+# message("-- Assign GPU architecture (sm=70,75)")
+# endif()
+
+set(SM_SETS 52 60 61 70 75 80)
+set(USING_WMMA False)
+set(FIND_SM False)
+
+foreach(SM_NUM IN LISTS SM_SETS)
+  string(FIND "${SM}" "${SM_NUM}" SM_POS)
+  if(SM_POS GREATER -1)
+    if(FIND_SM STREQUAL False)
+      set(ENV{TORCH_CUDA_ARCH_LIST} "")
+    endif()
+    set(FIND_SM True)
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${SM_NUM},code=\\\"sm_${SM_NUM},compute_${SM_NUM}\\\"")
+
+    if (SM_NUM STREQUAL 70 OR SM_NUM STREQUAL 75 OR SM_NUM STREQUAL 80 OR SM_NUM STREQUAL 86)
+      set(USING_WMMA True)
+    endif()
+
+    set(CMAKE_CUDA_ARCHITECTURES ${SM_NUM})
+    message("-- Assign GPU architecture (sm=${SM_NUM})")
+  endif()
+endforeach()
+
+if(USING_WMMA STREQUAL True)
+  set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+  set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+  message("-- Use WMMA")
+endif()
+
+if(NOT (FIND_SM STREQUAL True))
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS}  \
+                        -gencode=arch=compute_70,code=\\\"sm_70,compute_70\\\" \
+                        -gencode=arch=compute_75,code=\\\"sm_75,compute_75\\\" \
+                        -gencode=arch=compute_80,code=\\\"sm_80,compute_80\\\" \
+                        ")
+  #                      -rdc=true")
+  set(CMAKE_C_FLAGS    "${CMAKE_C_FLAGS}    -DWMMA")
+  set(CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS}  -DWMMA")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DWMMA")
+
+  set(CMAKE_CUDA_ARCHITECTURES 70 75 80)
+  message("-- Assign GPU architecture (sm=70,75,80)")
+endif()
+
+set(CMAKE_C_FLAGS_DEBUG    "${CMAKE_C_FLAGS_DEBUG}    -Wall -O0")
+set(CMAKE_CXX_FLAGS_DEBUG  "${CMAKE_CXX_FLAGS_DEBUG}  -Wall -O0")
+# set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -O0 -G -Xcompiler -Wall  --ptxas-options=-v --resource-usage")
+set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -O0 -G -Xcompiler -Wall")
+
+set(CMAKE_CXX17_STANDARD_COMPILE_OPTION "-std=c++{CXX_STD}")
+set(CMAKE_CXX17_EXTENSION_COMPILE_OPTION "-std=gnu++{CXX_STD}")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-extended-lambda")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr")
+set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --std=c++${CXX_STD}")
+
+set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3")
+# set(CMAKE_CUDA_FLAGS_RELEASE "${CMAKE_CUDA_FLAGS_RELEASE} -Xcompiler -O3 --ptxas-options=--verbose")
+set(CMAKE_CUDA_FLAGS_RELEASE "${CMAKE_CUDA_FLAGS_RELEASE} -Xcompiler -O3")
+
+set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
+
+set(COMMON_HEADER_DIRS
+  ${PROJECT_SOURCE_DIR}
+  ${CUDA_PATH}/include
+)
+
+set(COMMON_LIB_DIRS
+  ${CUDA_PATH}/lib64
+)
+
+if(NOT PY_CMD)
+  set(PYTHON_PATH "python" CACHE STRING "Python path")
+else()
+  set(PYTHON_PATH ${PY_CMD} CACHE STRING "Python path")
+endif()
+
+add_definitions(-w)
+
+if(BUILD_PD)
+  add_definitions(-DPADDLE_WITH_CUDA)
+
+  if(ON_INFER)
+    add_definitions(-DPADDLE_ON_INFERENCE)
+
+    link_directories(${COMMON_LIB_DIRS})
+
+    if(NOT WITH_STATIC_LIB)
+      add_definitions("-DPADDLE_WITH_SHARED_LIB")
+    else()
+      # PD_INFER_DECL is mainly used to set the dllimport/dllexport attribute in dynamic library mode.
+      # Set it to empty in static library mode to avoid compilation issues.
+      add_definitions("/DPD_INFER_DECL=")
+    endif()
+
+    macro(safe_set_static_flag)
+        foreach(flag_var
+            CMAKE_CXX_FLAGS CMAKE_CXX_FLAGS_DEBUG CMAKE_CXX_FLAGS_RELEASE
+            CMAKE_CXX_FLAGS_MINSIZEREL CMAKE_CXX_FLAGS_RELWITHDEBINFO)
+          if(${flag_var} MATCHES "/MD")
+            string(REGEX REPLACE "/MD" "/MT" ${flag_var} "${${flag_var}}")
+          endif(${flag_var} MATCHES "/MD")
+        endforeach(flag_var)
+    endmacro()
+
+    if(NOT DEFINED PADDLE_LIB)
+      message(FATAL_ERROR "please set PADDLE_LIB with -DPADDLE_LIB=/path/paddle/lib")
+    endif()
+
+    include_directories("${PADDLE_LIB}/paddle/include/")
+    set(PADDLE_LIB_THIRD_PARTY_PATH "${PADDLE_LIB}/third_party/install/")
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}protobuf/include")
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}glog/include")
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}gflags/include")
+    include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}xxhash/include")
+    if (WITH_ONNXRUNTIME)
+      include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/include")
+      include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/include")
+    endif()
+
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}protobuf/lib")
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}glog/lib")
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}gflags/lib")
+    link_directories("${PADDLE_LIB_THIRD_PARTY_PATH}xxhash/lib")
+    link_directories("${PADDLE_LIB}/paddle/lib")
+    if (WITH_ONNXRUNTIME)
+      include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/lib")
+      include_directories("${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/lib")
+    endif()
+
+    if(WITH_MKL)
+      set(FLAG_OPENMP "-fopenmp")
+    endif()
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${FLAG_OPENMP}")
+
+    if (USE_TENSORRT AND WITH_GPU)
+      set(TENSORRT_ROOT "" CACHE STRING "The root directory of TensorRT library")
+      if("${TENSORRT_ROOT}" STREQUAL "")
+          message(FATAL_ERROR "The TENSORRT_ROOT is empty, you must assign it a value with CMake command. Such as: -DTENSORRT_ROOT=TENSORRT_ROOT_PATH ")
+      endif()
+      set(TENSORRT_INCLUDE_DIR ${TENSORRT_ROOT}/include)
+      set(TENSORRT_LIB_DIR ${TENSORRT_ROOT}/lib)
+    endif()
+
+    if (USE_TENSORRT AND WITH_GPU)
+        include_directories("${TENSORRT_INCLUDE_DIR}")
+        link_directories("${TENSORRT_LIB_DIR}")
+    endif()
+
+    if(WITH_MKL)
+      set(MATH_LIB_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}mklml")
+      include_directories("${MATH_LIB_PATH}/include")
+      set(MKLDNN_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}mkldnn")
+      if(EXISTS ${MKLDNN_PATH})
+        include_directories("${MKLDNN_PATH}/include")
+        set(MKLDNN_LIB ${MKLDNN_PATH}/lib/libmkldnn.so.0)
+      endif()
+    else()
+      set(OPENBLAS_LIB_PATH "${PADDLE_LIB_THIRD_PARTY_PATH}openblas")
+      include_directories("${OPENBLAS_LIB_PATH}/include/openblas")
+    endif()
+
+  else()
+    execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import paddle; print(paddle.sysconfig.get_include())"
+                    RESULT_VARIABLE _INC_PYTHON_SUCCESS
+                    OUTPUT_VARIABLE _INC_PYTHON_VALUES)
+    if (NOT _INC_PYTHON_SUCCESS MATCHES 0)
+        message(FATAL_ERROR "Python config Error.")
+    endif()
+    string(REGEX REPLACE ";" "\\\\;" _INC_PYTHON_VALUES ${_INC_PYTHON_VALUES})
+    string(REGEX REPLACE "\n" ";" _INC_PYTHON_VALUES ${_INC_PYTHON_VALUES})
+    list(GET _INC_PYTHON_VALUES 0 PY_INCLUDE_DIR)
+
+    list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR})
+    list(APPEND COMMON_HEADER_DIRS ${PY_INCLUDE_DIR}/third_party)
+
+    execute_process(COMMAND ${PYTHON_PATH} "-c" "from __future__ import print_function; import paddle; print(paddle.sysconfig.get_lib())"
+                    RESULT_VARIABLE _LIB_PYTHON_SUCCESS
+                    OUTPUT_VARIABLE _LIB_PYTHON_VALUES)
+    if (NOT _LIB_PYTHON_SUCCESS MATCHES 0)
+        message(FATAL_ERROR "Python config Error.")
+    endif()
+    string(REGEX REPLACE ";" "\\\\;" _LIB_PYTHON_VALUES ${_LIB_PYTHON_VALUES})
+    string(REGEX REPLACE "\n" ";" _LIB_PYTHON_VALUES ${_LIB_PYTHON_VALUES})
+    list(GET _LIB_PYTHON_VALUES 0 PY_LIB_DIR)
+    list(APPEND COMMON_LIB_DIRS ${PY_LIB_DIR})
+
+    include_directories(${PY_INCLUDE_DIR})
+    include_directories(${PY_INCLUDE_DIR}\third_party)
+
+  endif()
+endif()
+
+if(BUILD_GPT)
+  list(APPEND COMMON_HEADER_DIRS ${NCCL_INCLUDE_DIRS})
+  get_filename_component(NCCL_LIB_DIRS ${NCCL_LIBRARIES} DIRECTORY)
+  list(APPEND COMMON_LIB_DIRS ${NCCL_LIB_DIRS})
+endif()
+
+list(APPEND COMMON_HEADER_DIRS ${MPI_INCLUDE_PATH})
+
+include_directories(
+  ${COMMON_HEADER_DIRS}
+)
+
+list(APPEND COMMON_LIB_DIRS /usr/local/mpi/lib)
+
+link_directories(
+  ${COMMON_LIB_DIRS}
+)
+
+add_subdirectory(fastertransformer)
+add_subdirectory(tools)
+# add_subdirectory(sample)
+
+########################################
+
+if(BUILD_GPT)
+# Following feature requires cmake 3.15
+# TODO Remove this part or modify such that we can run it under cmake 3.10
+cmake_minimum_required(VERSION 3.15 FATAL_ERROR)
+add_library(transformer-static STATIC
+  $<TARGET_OBJECTS:encoder>
+  $<TARGET_OBJECTS:cuda_kernels>
+  $<TARGET_OBJECTS:transformer_kernels>
+  $<TARGET_OBJECTS:nvtx_utils>
+  $<TARGET_OBJECTS:cuda_int8_kernels>
+  $<TARGET_OBJECTS:attention_kernels>
+  # trt_fused_multi_head_attention, gpt_triton_backend have been removed to
+  # resolve encoder ON_INFER compiling issue.
+  # $<TARGET_OBJECTS:trt_fused_multi_head_attention>
+  $<TARGET_OBJECTS:encoder_gemm_func>
+  $<TARGET_OBJECTS:encoder_igemm_func>
+  $<TARGET_OBJECTS:decoder>
+  $<TARGET_OBJECTS:decoding>
+  $<TARGET_OBJECTS:topk>
+  $<TARGET_OBJECTS:online_softmax_beamsearch>
+  $<TARGET_OBJECTS:nccl_utils>)
+set_property(TARGET transformer-static PROPERTY POSITION_INDEPENDENT_CODE ON)
+set_property(TARGET transformer-static PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
+target_link_libraries(transformer-static PUBLIC -lcublas -lcudart -lcurand -lnccl -lmpi nvtx_utils)
+
+add_library(transformer-shared SHARED
+  $<TARGET_OBJECTS:encoder>
+  $<TARGET_OBJECTS:cuda_kernels>
+  $<TARGET_OBJECTS:transformer_kernels>
+  $<TARGET_OBJECTS:nvtx_utils>
+  $<TARGET_OBJECTS:cuda_int8_kernels>
+  $<TARGET_OBJECTS:attention_kernels>
+  # $<TARGET_OBJECTS:trt_fused_multi_head_attention>
+  $<TARGET_OBJECTS:encoder_gemm_func>
+  $<TARGET_OBJECTS:encoder_igemm_func>
+  $<TARGET_OBJECTS:decoder>
+  $<TARGET_OBJECTS:decoding>
+  $<TARGET_OBJECTS:topk>
+  $<TARGET_OBJECTS:online_softmax_beamsearch>
+  $<TARGET_OBJECTS:nccl_utils>)
+  # $<TARGET_OBJECTS:gpt_triton_backend>)
+## add_library(transformer-shared SHARED  $<TARGET_OBJECTS:encoder>)
+set_target_properties(transformer-shared PROPERTIES POSITION_INDEPENDENT_CODE ON)
+set_target_properties(transformer-shared PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
+set_target_properties(transformer-shared PROPERTIES LINKER_LANGUAGE CXX)
+target_link_libraries(transformer-shared PUBLIC ${NCCL_LIBRARIES} ${MPI_LIBRARIES} -lcublas -lcublasLt -lcudart -lcurand )
+
+include(GNUInstallDirs)
+set(INSTALL_CONFIGDIR ${CMAKE_INSTALL_LIBDIR}/cmake/FasterTransformer)
+
+include(CMakePackageConfigHelpers)
+configure_package_config_file(
+  ${CMAKE_CURRENT_LIST_DIR}/cmake/FasterTransformerConfig.cmake.in
+  ${CMAKE_CURRENT_BINARY_DIR}/FasterTransformerConfig.cmake
+  INSTALL_DESTINATION ${INSTALL_CONFIGDIR}
+)
+
+install(
+  FILES
+  ${CMAKE_CURRENT_BINARY_DIR}/FasterTransformerConfig.cmake
+  DESTINATION ${INSTALL_CONFIGDIR}
+)
+
+install(
+  TARGETS
+    transformer-shared
+  EXPORT
+    transformer-shared-targets
+  LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/fastertransformer
+  ARCHIVE DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/fastertransformer
+)
+
+install(
+  EXPORT
+    transformer-shared-targets
+  FILE
+    FasterTransformerTargets.cmake
+  DESTINATION
+    ${INSTALL_CONFIGDIR}
+)
+
+file(GLOB_RECURSE HEADER_FILES "*.h" "*.hpp" "*.cuh")
+foreach ( file ${HEADER_FILES} )
+    file( RELATIVE_PATH rfile ${CMAKE_CURRENT_SOURCE_DIR} ${file} )
+    get_filename_component( dir ${rfile} DIRECTORY )
+    install( FILES ${file} DESTINATION  ${CMAKE_INSTALL_PREFIX}/backends/fastertransformer/include/${dir} )
+endforeach()
+
+
+################################################################################
+add_executable(gpt sample/cpp/gpt_sample.cc )
+target_link_libraries(gpt PUBLIC -lcublas -lcublasLt -lcudart -lcurand -lnccl -lmpi transformer-static)
+# target_link_libraries(gpt PUBLIC -lcublas -lcublasLt -lcudart -lcurand -lnccl -lmpi decoder decoding)
+export(
+  EXPORT
+    transformer-shared-targets
+  FILE
+    ${CMAKE_CURRENT_BINARY_DIR}/FasterTransformerTargets.cmake
+  NAMESPACE
+    TritonCore::
+)
+
+export(PACKAGE FasterTransformer)
+
+endif() # BUILD_GPT
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/CMakeLists.txt b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..12fbd83615b55a9fea52879efd5a16fc50adb09e
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/CMakeLists.txt
@@ -0,0 +1,27 @@
+# Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+cmake_minimum_required(VERSION 3.8)
+add_subdirectory(cuda)
+add_subdirectory(utils)
+add_subdirectory(gemm_test)
+if(BUILD_TF)
+  add_subdirectory(tf_op)
+endif()
+
+if(BUILD_PYT)
+  add_subdirectory(th_op)
+endif()
+
+# add_subdirectory(trt_fused_multihead_attention)
+# add_subdirectory(triton_backend)
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/bert_encoder_transformer.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/bert_encoder_transformer.h
new file mode 100644
index 0000000000000000000000000000000000000000..8bd5738f00ee30a9054e9c21454bba12b7d19b5f
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/bert_encoder_transformer.h
@@ -0,0 +1,1123 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * BERT Encoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include "fastertransformer/cuda/cuda_int8_kernels.h"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/cuda/open_attention.h"
+#include "fastertransformer/gemm_test/encoder_gemm_func.h"
+#include "fastertransformer/gemm_test/encoder_igemm_func.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/common_structure.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <typename T>
+class BertInitParam {
+public:
+  const T *from_tensor = nullptr;
+  const T *to_tensor = nullptr;
+
+  AttentionWeight<T> self_attention;
+  const T *attr_mask = nullptr;
+  LayerNormWeight<T> self_layernorm;
+
+  FFNWeight<T> ffn;
+  LayerNormWeight<T> ffn_layernorm;
+
+  T *transformer_out;
+  cublasHandle_t cublas_handle = nullptr;
+  cublasLtHandle_t cublaslt_handle = nullptr;
+  cudaStream_t stream = 0;
+
+  const int *sequence_id_offset = nullptr;
+  int valid_word_num = -1;
+  int layer_idx = 0;
+  int layer_num = 12;
+
+  // Part 1:
+  //  First 80 are for activation amaxs. For each activation amax, there are 4
+  //  values: amax, amax/127.0f, amax/127.0f/127.0f, 127.0f/amax -- input_amax
+  //  0-3 , Q_aftergemm_amax 4-7, Qbias_amax 8-11, K_aftergemm_amax 12-15,
+  //  Kbias_amax 16-19, V_aftergemm_amax 20-23, Vbias_amax 24-27, bmm1_amax
+  //  28-31, Softmax_amax 32-35, bmm2_amax 36-39, Proj_aftergemm_scale 40-43,
+  //  ProjBiasNorm_amax 44-47, FC1_aftergemm_amax 48-51, F1Bias_amax 52-55,
+  //  FC2_aftergemm_amax 56-59, F2BiasNorm_amax 60-63, reserve 64-79
+  // Part 2:
+  //  Kernel amaxs, for each kernel amax list, there are output_channel values :
+  //  query_weight_amax_list, key_weight_amax_list, value_weight_amax_list,
+  //  proj_weight_amax_list, FC1_weight_amax_list, FC2_weight_amax_list
+  // Part 3:
+  //  Int8 gemm deQFactor list (8 values): Q_deQ_scale, K_deQ_scale,
+  //  V_deQ_scale, bmm1_deQ_scale, bmm2_deQ_scale, FC0_deQ_scale, FC1_deQ_scale,
+  //  FC2_deQ_scale
+  // Part 4:
+  //  Amax used in trt fused mha kernel (3 values) : QKVbias_amax, Softmax_amax,
+  //  bmm2_amax
+  const float *amaxList = nullptr;
+  const int *trt_seqlen_offset = nullptr;
+  int trt_seqlen_size = -1;
+};
+
+template <OperationType OpType_,
+          template <OperationType> class MultiHeadAttention_>
+class BertEncoderTransformerTraits;
+
+template <template <OperationType> class MultiHeadAttention_>
+class BertEncoderTransformerTraits<OperationType::FP32, MultiHeadAttention_>
+    : public TransformerTraits<OperationType::FP32> {
+public:
+  typedef MultiHeadAttention_<OpType> MultiHeadAttention;
+};
+
+template <template <OperationType> class MultiHeadAttention_>
+class BertEncoderTransformerTraits<OperationType::FP16, MultiHeadAttention_>
+    : public TransformerTraits<OperationType::FP16> {
+public:
+  typedef MultiHeadAttention_<OpType> MultiHeadAttention;
+};
+
+template <class Traits_>
+class BertEncoderTransformer {
+  IAllocator *allocator_ = NULL;
+  typename Traits_::MultiHeadAttention *attention_ = NULL;
+  typedef typename Traits_::DataType DataType_;
+  BertInitParam<DataType_> param_;
+
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+  std::map<std::string, int> parameterMap_;
+
+  DataType_ *buf_ = NULL;
+  DataType_ *attr_out_buf_;
+  DataType_ *attr_matmul_buf_;
+  DataType_ *inter_matmul_buf_;
+  DataType_ *attr_matmul_unnormed_buf_;
+  void *cublas_workspace_ = NULL;
+
+  int batch_size_;
+  int from_seq_len_;
+  int to_seq_len_;
+  int head_num_;
+  int size_per_head_;
+
+  int sm_;
+  bool allow_gemm_test_ = false;
+  bool use_gelu_ = true;
+  bool use_ORDER_COL32_2R_4R4_ = false;
+
+  // for int8 quantization
+  const float *FC0_weight_amax_list, *FC1_weight_amax_list,
+      *FC2_weight_amax_list;
+  float scale_list[INT8O_GEMM_NUM + TRT_FUSED_MHA_AMAX_NUM];
+  const float *bmm2_amax_ptr, *ProjBiasNorm_amax_ptr, *F1Bias_amax_ptr,
+      *F2BiasNorm_amax_ptr, *to_tensor_amax_ptr, *Proj_aftergemm_amax_ptr,
+      *F1_aftergemm_amax_ptr, *F2_aftergemm_amax_ptr,
+      *int8O_gemm_deQ_scale_list;
+  // int8_mode == 0 -- not use int8
+  // int8_mode == 1 -- use int8; without quantized residual; when (batch*seqLen
+  // >= 512) or (seqLen % 32 !=0 ), using trt fused mha
+  // int8_mode == 2 -- use int8; with quantized residual; with trt fused mha
+  // int8_mode == 3 -- use int8; with quantized residual; without trt fused mha
+  int int8_mode_;
+  int layer_idx_;
+  int layer_num_;
+  const int8_t *int8_from_tensor_;
+  const DataType_ *transA_from_tensor_;
+  int32_t *int_buf_;
+  DataType_ *tmp_DataType_, *transA_from_tensor_tmp_,
+      *transformer_out_tmp_DataType_;
+  int8_t *tmp_int8_, *int8_from_tensor_tmp_, *attr_matmul_buf_tmp_,
+      *transformer_out_tmp_int8_;
+
+public:
+  void setLayerIdx(int layer_idx) { layer_idx_ = layer_idx; }
+
+  size_t calBufSizeInByte(int batch_size,
+                          int seq_len,
+                          int head_num,
+                          int size_per_head,
+                          int int8_mode) {
+    size_t m = batch_size * seq_len;
+    size_t n = head_num * size_per_head;
+    size_t k = n;
+    size_t normal_buf_size;
+    if (int8_mode != 0) {
+      // transA_from_tensor & transformer_out_tmp_DataType
+      normal_buf_size =
+          m * k * sizeof(DataType_) +
+          // int8_from_tensor & attr_matmul_buf_tmp & transformer_out_tmp_int8
+          m * k * sizeof(int8_t) +
+          // int8 qkv weight
+          3 * n * k * sizeof(int8_t) +
+          // FC0 & FC1 & FC2 for m*k(4k)*sizeof(DataType)
+          4 * m * k * sizeof(int) +
+          // attr_out_buf_ & attr_matmul_buf_ & inter_matmul_buf_
+          6 * m * n * sizeof(DataType_) +
+          // temp buf
+          m * n * sizeof(DataType_);
+    } else {
+      normal_buf_size =
+          sizeof(DataType_) * (m * n) * 7 +
+          ((sizeof(half) == sizeof(DataType_)) ? CUBLAS_WORKSPACE_SIZE : 0);
+    }
+    return normal_buf_size;
+  }
+
+  bool checkParameterInMap(int batch_size,
+                           int seq_len,
+                           int head_num,
+                           int size_per_head,
+                           int int8_mode,
+                           int is_fp16) {
+    char mark[1000];
+    bool parameterInMap;
+    int dataType = is_fp16 == 0 ? FLOAT_DATATYPE : HALF_DATATYPE;
+    if (int8_mode != 0) {
+      dataType = INT8_DATATYPE;
+    }
+    sprintf(mark,
+            "%d_%d_%d_%d_%d",
+            batch_size,
+            seq_len,
+            head_num,
+            size_per_head,
+            dataType);
+    if (parameterMap_.find(std::string(mark)) != parameterMap_.end())
+      parameterInMap = true;
+    else
+      parameterInMap = false;
+    return parameterInMap;
+  }
+
+  // free buffer for gemm test
+  // This function requires the same allocator of allocateBufferForGemmTest(*)
+  void freeBufferForGemmTest(IAllocator *allocator, void *&buffer) {
+    if (buffer != NULL) {
+      allocator->free(buffer);
+      buffer = NULL;
+    }
+  }
+
+  void allocateBufferForGemmTest(IAllocator *allocator,
+                                 void *&buffer,
+                                 int batch_size,
+                                 int seq_len,
+                                 int head_num,
+                                 int size_per_head,
+                                 int int8_mode,
+                                 int is_fp16) {
+    size_t buf_size_in_byte = calGemmTestBufSizeInByte(
+        batch_size, seq_len, head_num, size_per_head, int8_mode, is_fp16);
+    size_t total, free;
+    check_cuda_error(cudaMemGetInfo(&free, &total));
+    if (free < buf_size_in_byte + 10 * 1024 * 1024) {
+      printf(
+          "[WARNING] There is no enough device memory for gemm test!\n %ld "
+          "Bytes is needed, but only %ld Bytes is free.\n",
+          buf_size_in_byte,
+          free);
+      buffer = NULL;
+      return;
+    }
+    buffer =
+        reinterpret_cast<void *>(allocator->malloc(buf_size_in_byte, false));
+  }
+
+  bool gemmTest(int batch_size,
+                int seq_len,
+                int head_num,
+                int size_per_head,
+                int int8_mode,
+                int is_fp16) {
+    bool hasChangedConfig = false;
+    if (int8_mode != 0) {
+      // if not found parameters in map,
+      // read config first
+      // in case multiple instances (for example in tensorflow op) are used
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+      } else {
+        return hasChangedConfig;
+      }
+
+      // if still not found algos in map,
+      // do gemm test
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        void *gemm_test_buf = NULL;
+        allocateBufferForGemmTest(allocator_,
+                                  gemm_test_buf,
+                                  batch_size,
+                                  seq_len,
+                                  head_num,
+                                  size_per_head,
+                                  int8_mode,
+                                  is_fp16);
+        if (gemm_test_buf != NULL) {
+          generate_encoder_igemm_config(
+              batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          freeBufferForGemmTest(allocator_, gemm_test_buf);
+          readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+          hasChangedConfig = true;
+        }
+      } else {
+        hasChangedConfig = true;
+        return hasChangedConfig;
+      }
+    } else {
+      // if not found parameters in map,
+      // read config first
+      // in case multiple instances (for example in tensorflow op) are used
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+      } else {
+        return hasChangedConfig;
+      }
+
+      // if still not found parameters in map,
+      // do gemm test
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        void *gemm_test_buf = NULL;
+        allocateBufferForGemmTest(allocator_,
+                                  gemm_test_buf,
+                                  batch_size,
+                                  seq_len,
+                                  head_num,
+                                  size_per_head,
+                                  int8_mode,
+                                  is_fp16);
+        if (gemm_test_buf != NULL) {
+          if (is_fp16 == 1)
+            generate_encoder_gemm_config<half>(
+                batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          else
+            generate_encoder_gemm_config<float>(
+                batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          freeBufferForGemmTest(allocator_, gemm_test_buf);
+          readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+          hasChangedConfig = true;
+        }
+      } else {
+        hasChangedConfig = true;
+        return hasChangedConfig;
+      }
+    }
+    return hasChangedConfig;
+  }
+
+  // free buffer for BertEncoderTransformer
+  void freeBuffer() {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    if (buf_ != NULL) {
+      if (allocator_ == NULL) {
+        printf(
+            "[ERROR][BertEncoderTransformer][freeBuffer] allocator_ is "
+            "NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+      buf_ = NULL;
+    }
+    if (attention_ != NULL) attention_->freeBuffer();
+  }
+
+  // allocate buffer for BertEncoderTransformer
+  // do gemm test if allow_gemm_test == true
+  void allocateBuffer(IAllocator *allocator,
+                      int batch_size,
+                      int from_seq_len,
+                      int to_seq_len,
+                      int head_num,
+                      int size_per_head,
+                      bool use_trt_kernel = true) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try {
+      if (allocator == NULL) {
+        printf(
+            "[ERROR][BertEncoderTransformer][allocateBuffer] allocator == "
+            "NULL!\n");
+        exit(-1);
+      }
+      // only allocate new buffer when buf_ is empty
+      // if buf_ is not empty, use previous allocated one
+      // this can ensure consistency between (allocator_, batch_size_, ...) and
+      // buf_
+      if (buf_ != nullptr) {
+        printf(
+            "[ERROR][BertEncoderTransformer][allocateBuffer] previous buffer "
+            "is not freed, use previous one. To allocate new buffer, please "
+            "use freeBuffer() to free previous buffer first.\n");
+        exit(-1);
+      } else {
+        allocator_ = allocator;
+        batch_size_ = batch_size;
+        from_seq_len_ = from_seq_len;
+        to_seq_len_ = to_seq_len;
+        head_num_ = head_num;
+        size_per_head_ = size_per_head;
+
+        int m = batch_size_ * from_seq_len_;
+        int k = head_num_ * size_per_head_;
+        int n = k;
+
+        int buf_size = m * n;
+        size_t buf_size_in_byte = calBufSizeInByte(
+            batch_size_, from_seq_len_, head_num_, size_per_head_, int8_mode_);
+
+        // allocate buffer
+        if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+          buf_ = reinterpret_cast<DataType_ *>(
+              allocator_->malloc(buf_size_in_byte, false));
+          if (buf_ == nullptr)
+            throw std::runtime_error(
+                std::string("Allocator failed to allocate internal buffer."));
+
+          attr_out_buf_ =
+              (DataType_ *)(((char *)buf_) + m * k * sizeof(DataType_) +
+                            m * k * sizeof(int8_t) +
+                            3 * n * k * sizeof(int8_t) +
+                            4 * m * k * sizeof(int));
+          attr_matmul_buf_ = attr_out_buf_ + buf_size;
+          inter_matmul_buf_ = attr_matmul_buf_ + buf_size;
+
+          int8_from_tensor_tmp_ =
+              (int8_t *)(((char *)buf_) + m * k * (sizeof(DataType_)));
+          attr_matmul_buf_tmp_ = int8_from_tensor_tmp_;
+          transformer_out_tmp_int8_ = int8_from_tensor_tmp_;
+          transA_from_tensor_tmp_ = (DataType_ *)buf_;
+          transformer_out_tmp_DataType_ = transA_from_tensor_tmp_;
+
+          int_buf_ =
+              (int32_t *)(((char *)buf_) +
+                          (m * k) * (sizeof(DataType_) + sizeof(int8_t)) +
+                          3 * n * k * sizeof(int8_t));
+
+          tmp_DataType_ =
+              (DataType_ *)(((char *)buf_) +
+                            m * k * (sizeof(DataType_) + sizeof(int8_t)) +
+                            3 * n * k * sizeof(int8_t) +
+                            4 * m * k * sizeof(int32_t) +
+                            6 * m * n * sizeof(DataType_));
+          tmp_int8_ = (int8_t *)tmp_DataType_;
+#else
+          printf("[ERROR] BERT transformer encoder does not support INT8 in PaddleNLP. \n");
+          exit(-1);
+#endif
+
+        } else {
+          buf_ = reinterpret_cast<DataType_ *>(
+              allocator_->malloc(buf_size_in_byte, false));
+          if (buf_ == nullptr)
+            throw std::runtime_error(
+                std::string("Allocator failed to allocate internal buffer."));
+
+          if (sizeof(half) == sizeof(DataType_)) {
+            // cublas_workspace_ should be the start pointer of cudaMalloc()
+            // to ensure 16B alignemnet
+            cublas_workspace_ = buf_;
+            attr_out_buf_ = (DataType_ *)((char *)cublas_workspace_ +
+                                          CUBLAS_WORKSPACE_SIZE);
+          } else {
+            cublas_workspace_ = nullptr;
+            attr_out_buf_ = (DataType_ *)buf_;
+          }
+          attr_matmul_buf_ = attr_out_buf_ + buf_size;
+          inter_matmul_buf_ = attr_matmul_buf_ + buf_size;
+          attr_matmul_unnormed_buf_ = inter_matmul_buf_ + 4 * buf_size;
+        }
+      }
+
+      bool hasChangedConfig = false;
+      int is_fp16;
+      if (Traits_::OpType == OperationType::FP32)
+        is_fp16 = 0;
+      else
+        is_fp16 = 1;
+      // check if target algos in map
+      if (allow_gemm_test_) {
+        /*
+        hasChangedConfig = gemmTest(batch_size_,
+                                    from_seq_len_,
+                                    head_num_,
+                                    size_per_head_,
+                                    int8_mode_,
+                                    is_fp16);
+        */
+      }
+
+      // allocate buffer for attention_
+      attention_->allocateBuffer(allocator,
+                                 cublas_workspace_,
+                                 batch_size_,
+                                 from_seq_len_,
+                                 to_seq_len,
+                                 head_num_,
+                                 size_per_head_,
+                                 hasChangedConfig,
+                                 use_trt_kernel);
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  BertEncoderTransformer(int int8_mode = 0,
+                         bool allow_gemm_test = false,
+                         bool use_gelu = true)
+      : int8_mode_(int8_mode),
+        allow_gemm_test_(allow_gemm_test),
+        use_gelu_(use_gelu) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+
+    try {
+      // sm_ = getSMVersion();
+      // Set fake sm_ which have no effect.
+      sm_ = 70;
+
+      if (sm_ >= 80) {
+        use_ORDER_COL32_2R_4R4_ = true;
+      }
+      if (sm_ < 75 && int8_mode_ != 0) {
+        printf(
+            "[ERROR][BertEncoderTransformer] int8 mode only works with sm >= "
+            "75.\n");
+        exit(-1);
+      }
+
+      int isConfigExist = -1;
+      if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+        isConfigExist = access(IGEMM_CONFIG, 0);
+#else
+        printf("[ERROR] BERT transformer encoder does not support INT8 in PaddleNLP. \n");
+        exit(-1);
+#endif
+      } else {
+        isConfigExist = access(GEMM_CONFIG, 0);
+      }
+      if (isConfigExist == -1) {
+        if (!allow_gemm_test_) {
+          // printf(
+          //     "[WARNING][BertEncoderTransformer] %s is not found; using "
+          //     "default GEMM algo\n",
+          //     int8_mode_ != 0 ? IGEMM_CONFIG : GEMM_CONFIG);
+        }
+      } else {
+        readAlgoFromConfig(int8_mode_, cublasAlgoMap_, parameterMap_);
+      }
+
+      attention_ = new typename Traits_::MultiHeadAttention(
+          int8_mode_, allow_gemm_test_, use_ORDER_COL32_2R_4R4_, sm_);
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  BertEncoderTransformer(const BertEncoderTransformer *transformer) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    sm_ = transformer->sm_;
+    use_ORDER_COL32_2R_4R4_ = transformer->use_ORDER_COL32_2R_4R4_;
+    int8_mode_ = transformer->int8_mode_;
+    allow_gemm_test_ = transformer->allow_gemm_test_;
+    use_gelu_ = transformer->use_gelu_;
+
+    cublasAlgoMap_ = transformer->cublasAlgoMap_;
+    parameterMap_ = transformer->parameterMap_;
+
+    attention_ =
+        new typename Traits_::MultiHeadAttention(transformer->attention_);
+  }
+
+  void genTransATensorAndInt8TensorForFirstLayer() {
+    const int m = param_.sequence_id_offset == nullptr
+                      ? batch_size_ * from_seq_len_
+                      : param_.valid_word_num;
+    const int k = head_num_ * size_per_head_;
+    if (int8_mode_ == 1) {
+      transposeMatrix_colMajorToCOL32_kernelLauncher(
+          transA_from_tensor_tmp_, param_.from_tensor, k, m, param_.stream);
+      transA_from_tensor_ = (const DataType_ *)transA_from_tensor_tmp_;
+      quantized_kernelLauncher(int8_from_tensor_tmp_,
+                               transA_from_tensor_,
+                               m * k,
+                               to_tensor_amax_ptr + 3,
+                               param_.stream);
+    } else if (int8_mode_ == 2 || int8_mode_ == 3) {
+      transposeMatrix_colMajorToCOL32_quantize_kernelLauncher(
+          int8_from_tensor_tmp_,
+          param_.from_tensor,
+          k,
+          m,
+          to_tensor_amax_ptr + 3,
+          param_.stream);
+    }
+    int8_from_tensor_ = (const int8_t *)(int8_from_tensor_tmp_);
+  }
+
+  /**
+   * Initialize the parameters in class
+   * We will keep the Ctor empty to ensure the sub classes follow the same init
+   *routine.
+   * Please be aware that no dynamic memory allocation should be placed
+   **/
+  void initialize(BertInitParam<DataType_> param) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+
+    param_ = param;
+    cuda::MultiHeadInitParam<DataType_> multi_head_init_param;
+
+    if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+      int hidden_dim = size_per_head_ * head_num_;
+      layer_idx_ = param_.layer_idx;
+      layer_num_ = param_.layer_num;
+
+      bmm2_amax_ptr = param_.amaxList + 36;
+      ProjBiasNorm_amax_ptr = param_.amaxList + 44;
+      F1Bias_amax_ptr = param_.amaxList + 52;
+      F2BiasNorm_amax_ptr = param_.amaxList + 60;
+      Proj_aftergemm_amax_ptr = param_.amaxList + 40;
+      F1_aftergemm_amax_ptr = param_.amaxList + 48;
+      F2_aftergemm_amax_ptr = param_.amaxList + 56;
+      to_tensor_amax_ptr = param_.amaxList;
+
+      FC0_weight_amax_list =
+          param_.amaxList + ACTIVATION_AMAX_NUM + 3 * hidden_dim;
+      FC1_weight_amax_list = FC0_weight_amax_list + hidden_dim;
+      FC2_weight_amax_list = FC1_weight_amax_list + 4 * hidden_dim;
+
+      // This D2H copy operation will cause performance degradation
+      if ((int8_mode_ == 1 && ((batch_size_ * from_seq_len_ >= 512) ||
+                               (from_seq_len_ % 32 != 0))) ||
+          int8_mode_ == 2 || int8_mode_ == 3) {
+        // copy (int8O_gemm_deQ_scale_list + trt_fused_mha_amax_list) amax into
+        // scale_list
+        check_cuda_error(cudaMemcpyAsync(
+            scale_list,
+            FC2_weight_amax_list + hidden_dim,
+            (INT8O_GEMM_NUM + TRT_FUSED_MHA_AMAX_NUM) * sizeof(float),
+            cudaMemcpyDeviceToHost,
+            param_.stream));
+        int8O_gemm_deQ_scale_list = scale_list;
+      }
+      int k = hidden_dim;
+
+      const int m = param_.sequence_id_offset == nullptr
+                        ? batch_size_ * from_seq_len_
+                        : param_.valid_word_num;
+      if (layer_idx_ == 0) {
+        genTransATensorAndInt8TensorForFirstLayer();
+      } else {
+        transA_from_tensor_ = param_.from_tensor;
+        if (int8_mode_ == 2 || int8_mode_ == 3) {
+          int8_from_tensor_ = (const int8_t *)transA_from_tensor_;
+        } else if (int8_mode_ == 1) {
+          quantized_kernelLauncher(int8_from_tensor_tmp_,
+                                   transA_from_tensor_,
+                                   m * k,
+                                   to_tensor_amax_ptr + 3,
+                                   param_.stream);
+          int8_from_tensor_ = (const int8_t *)(int8_from_tensor_tmp_);
+        }
+      }
+
+      multi_head_init_param.int8_from_tensor = int8_from_tensor_;
+
+      multi_head_init_param.amaxList = param_.amaxList;
+
+      multi_head_init_param.int8O_gemm_deQ_scale_list =
+          int8O_gemm_deQ_scale_list;
+
+      multi_head_init_param.trt_fused_mha_amax_list =
+          scale_list + INT8O_GEMM_NUM;
+#else
+      printf("[ERROR] BERT transformer encoder does not support INT8 in PaddleNLP. \n");
+      exit(-1);
+#endif
+    }
+
+    multi_head_init_param.from_tensor = param.from_tensor;
+    multi_head_init_param.to_tensor = param.to_tensor;
+    multi_head_init_param.self_attention = param.self_attention;
+    multi_head_init_param.attr_mask = param.attr_mask;
+    multi_head_init_param.stream = param.stream;
+    multi_head_init_param.cublas_handle = param.cublas_handle;
+    multi_head_init_param.cublaslt_handle = param_.cublaslt_handle;
+    multi_head_init_param.attr_out = attr_out_buf_;
+    multi_head_init_param.valid_word_num = param.valid_word_num;
+    multi_head_init_param.sequence_id_offset = param.sequence_id_offset;
+    multi_head_init_param.trt_seqlen_offset = param_.trt_seqlen_offset;
+    multi_head_init_param.trt_seqlen_size = param_.trt_seqlen_size;
+
+    attention_->initialize(multi_head_init_param);
+  }
+
+  /**
+   * do forward
+   **/
+  void forward() {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try {
+      attention_->forward();
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      DataType_ alpha = (DataType_)1.0f;
+      DataType_ beta = (DataType_)0.0f;
+      const int m = param_.sequence_id_offset == nullptr
+                        ? batch_size_ * from_seq_len_
+                        : param_.valid_word_num;
+      int k = head_num_ * size_per_head_;
+      int n = k;
+
+      if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+        if (int8_mode_ == 1) {
+          cublasLtMM_withAlgo(
+              int_buf_,
+              1,
+              m,
+              n,
+              k,
+              m * k,
+              n * k,
+              m * n,
+              (int8_t *)attr_out_buf_,
+              (int8_t *)(param_.self_attention.attention_output_weight.kernel),
+              param_.cublaslt_handle,
+              param_.stream,
+              cublasAlgoMap_,
+              use_ORDER_COL32_2R_4R4_);
+          add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(
+              attr_matmul_buf_,
+              int_buf_,
+              transA_from_tensor_,
+              param_.self_attention.attention_output_weight.bias,
+              param_.self_layernorm.gamma,
+              param_.self_layernorm.beta,
+              m,
+              n,
+              param_.stream,
+              FC0_weight_amax_list,
+              bmm2_amax_ptr);
+        } else if (int8_mode_ == 2 || int8_mode_ == 3) {
+          cublasLtMM_withAlgo_int8IO(
+              (int8_t *)int_buf_,
+              1,
+              m,
+              n,
+              k,
+              m * k,
+              n * k,
+              m * n,
+              int8O_gemm_deQ_scale_list[5],
+              (int8_t *)attr_out_buf_,
+              (int8_t *)(param_.self_attention.attention_output_weight.kernel),
+              param_.cublaslt_handle,
+              param_.stream,
+              cublasAlgoMap_,
+              use_ORDER_COL32_2R_4R4_);
+          add_bias_input_layernorm_COL32_int8IO_kernelLauncher(
+              (int8_t *)attr_matmul_buf_,
+              (int8_t *)int_buf_,
+              int8_from_tensor_,
+              param_.self_attention.attention_output_weight.bias,
+              param_.self_layernorm.gamma,
+              param_.self_layernorm.beta,
+              m,
+              n,
+              param_.stream,
+              Proj_aftergemm_amax_ptr + 1,
+              to_tensor_amax_ptr + 1,
+              ProjBiasNorm_amax_ptr + 3);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n *= 4;
+
+        if (int8_mode_ == 1) {
+          quantized_kernelLauncher(attr_matmul_buf_tmp_,
+                                   attr_matmul_buf_,
+                                   k * m,
+                                   ProjBiasNorm_amax_ptr + 3,
+                                   param_.stream);
+          cublasLtMM_withAlgo(int_buf_,
+                              1,
+                              m,
+                              n,
+                              k,
+                              m * k,
+                              n * k,
+                              m * n,
+                              attr_matmul_buf_tmp_,
+                              (int8_t *)(param_.ffn.intermediate_weight.kernel),
+                              param_.cublaslt_handle,
+                              param_.stream,
+                              cublasAlgoMap_,
+                              use_ORDER_COL32_2R_4R4_);
+          add_bias_act_COL32_int32I_int8O_kernelLauncher(
+              (int8_t *)inter_matmul_buf_,
+              int_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              param_.stream,
+              FC1_weight_amax_list,
+              ProjBiasNorm_amax_ptr + 2,
+              F1Bias_amax_ptr + 3);
+        } else if (int8_mode_ == 2 || int8_mode_ == 3) {
+          cublasLtMM_withAlgo_int8IO(
+              (int8_t *)int_buf_,
+              1,
+              m,
+              n,
+              k,
+              m * k,
+              n * k,
+              m * n,
+              int8O_gemm_deQ_scale_list[6],
+              (int8_t *)attr_matmul_buf_,
+              (int8_t *)(param_.ffn.intermediate_weight.kernel),
+              param_.cublaslt_handle,
+              param_.stream,
+              cublasAlgoMap_,
+              use_ORDER_COL32_2R_4R4_);
+          add_bias_act_COL32_int8IO_kernelLauncher(
+              (int8_t *)inter_matmul_buf_,
+              (int8_t *)int_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              param_.stream,
+              F1_aftergemm_amax_ptr + 1,
+              F1Bias_amax_ptr + 3);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n = k;
+        k *= 4;
+
+        if (int8_mode_ == 1) {
+          cublasLtMM_withAlgo(int_buf_,
+                              1,
+                              m,
+                              n,
+                              k,
+                              m * k,
+                              n * k,
+                              m * n,
+                              (int8_t *)inter_matmul_buf_,
+                              (int8_t *)(param_.ffn.output_weight.kernel),
+                              param_.cublaslt_handle,
+                              param_.stream,
+                              cublasAlgoMap_,
+                              use_ORDER_COL32_2R_4R4_);
+          if (layer_idx_ != layer_num_ - 1) {
+            add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(
+                param_.transformer_out,
+                int_buf_,
+                attr_matmul_buf_,
+                param_.ffn.output_weight.bias,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                m,
+                n,
+                param_.stream,
+                FC2_weight_amax_list,
+                F1Bias_amax_ptr);
+          } else {
+            add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(
+                transformer_out_tmp_DataType_,
+                int_buf_,
+                attr_matmul_buf_,
+                param_.ffn.output_weight.bias,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                m,
+                n,
+                param_.stream,
+                FC2_weight_amax_list,
+                F1Bias_amax_ptr);
+            transposeMatrix_COL32ToColMajor_kernelLauncher(
+                param_.transformer_out,
+                transformer_out_tmp_DataType_,
+                m,
+                n,
+                param_.stream);
+          }
+        } else if (int8_mode_ == 2 || int8_mode_ == 3) {
+          cublasLtMM_withAlgo_int8IO(
+              (int8_t *)int_buf_,
+              1,
+              m,
+              n,
+              k,
+              m * k,
+              n * k,
+              m * n,
+              int8O_gemm_deQ_scale_list[7],
+              (int8_t *)inter_matmul_buf_,
+              (int8_t *)(param_.ffn.output_weight.kernel),
+              param_.cublaslt_handle,
+              param_.stream,
+              cublasAlgoMap_,
+              use_ORDER_COL32_2R_4R4_);
+          if (layer_idx_ != layer_num_ - 1) {
+            add_bias_input_layernorm_COL32_int8IO_kernelLauncher(
+                (int8_t *)param_.transformer_out,
+                (int8_t *)int_buf_,
+                (int8_t *)attr_matmul_buf_,
+                param_.ffn.output_weight.bias,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                m,
+                n,
+                param_.stream,
+                F2_aftergemm_amax_ptr + 1,
+                ProjBiasNorm_amax_ptr + 1,
+                F2BiasNorm_amax_ptr + 3);
+          } else {
+            add_bias_input_layernorm_COL32_int8I_DataTypeO_kernelLauncher(
+                transformer_out_tmp_DataType_,
+                (int8_t *)int_buf_,
+                (int8_t *)attr_matmul_buf_,
+                param_.ffn.output_weight.bias,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                m,
+                n,
+                param_.stream,
+                F2_aftergemm_amax_ptr + 1,
+                ProjBiasNorm_amax_ptr + 1);
+            transposeMatrix_COL32ToColMajor_kernelLauncher(
+                param_.transformer_out,
+                transformer_out_tmp_DataType_,
+                m,
+                n,
+                param_.stream);
+          }
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+#else  // WITH_INT8
+        printf("[ERROR] BERT transformer encoder does not support INT8 in PaddleNLP. \n");
+        exit(-1);
+#endif  // WITH_INT8
+
+      } else {
+        cublasMM_cublasLtMM_wrapper(
+            param_.cublaslt_handle,
+            param_.cublas_handle,
+            CUBLAS_OP_N,
+            CUBLAS_OP_N,
+            n,
+            m,
+            k,
+            &alpha,
+            param_.self_attention.attention_output_weight.kernel,
+            AType_,
+            n,
+            attr_out_buf_,
+            BType_,
+            k,
+            &beta,
+            (DataType_ *)attr_matmul_buf_,
+            CType_,
+            n,
+            param_.stream,
+            cublasAlgoMap_,
+            sm_,
+            cublas_workspace_);
+
+        add_bias_input_layernorm_kernelLauncher<DataType_>(
+            attr_matmul_buf_,
+            param_.from_tensor,
+            param_.self_attention.attention_output_weight.bias,
+            param_.self_layernorm.gamma,
+            param_.self_layernorm.beta,
+            m,
+            n,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n *= 4;
+
+        cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle,
+                                    param_.cublas_handle,
+                                    CUBLAS_OP_N,
+                                    CUBLAS_OP_N,
+                                    n,
+                                    m,
+                                    k,
+                                    &alpha,
+                                    param_.ffn.intermediate_weight.kernel,
+                                    AType_,
+                                    n,
+                                    attr_matmul_buf_,
+                                    BType_,
+                                    k,
+                                    &beta,
+                                    (DataType_ *)inter_matmul_buf_,
+                                    CType_,
+                                    n,
+                                    param_.stream,
+                                    cublasAlgoMap_,
+                                    sm_,
+                                    cublas_workspace_);
+        if (use_gelu_ == true) {
+          add_bias_act_kernelLauncher<DataType_>(
+              inter_matmul_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              ActivationType::GELU,
+              param_.stream);
+        } else {
+          add_bias_act_kernelLauncher<DataType_>(
+              inter_matmul_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              ActivationType::RELU,
+              param_.stream);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n = k;
+        k *= 4;
+
+        cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle,
+                                    param_.cublas_handle,
+                                    CUBLAS_OP_N,
+                                    CUBLAS_OP_N,
+                                    n,
+                                    m,
+                                    k,
+                                    &alpha,
+                                    param_.ffn.output_weight.kernel,
+                                    AType_,
+                                    n,
+                                    inter_matmul_buf_,
+                                    BType_,
+                                    k,
+                                    &beta,
+                                    (DataType_ *)(param_.transformer_out),
+                                    CType_,
+                                    n,
+                                    param_.stream,
+                                    cublasAlgoMap_,
+                                    sm_,
+                                    cublas_workspace_);
+
+        add_bias_input_layernorm_kernelLauncher<DataType_>(
+            param_.transformer_out,
+            attr_matmul_buf_,
+            param_.ffn.output_weight.bias,
+            param_.ffn_layernorm.gamma,
+            param_.ffn_layernorm.beta,
+            m,
+            n,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  ~BertEncoderTransformer() {
+    if (buf_ != NULL) {
+      if (allocator_ == NULL) {
+        printf(
+            "[ERROR][BertEncoderTransformer][~BertEncoderTransformer] "
+            "allocator_ is NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+    }
+    if (attention_ != NULL) delete attention_;
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..32c824ef01b0ef00457975b81db677690c48b74d
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cu
@@ -0,0 +1,154 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+#include "fastertransformer/cuda/masked_multihead_attention_utils.h"
+namespace fastertransformer 
+{
+
+template<typename T>
+struct Vec_t {};
+template<>
+struct Vec_t<float> {
+    using Type = float2;
+};
+template<>
+struct Vec_t<half> {
+    using Type = uint32_t;
+};
+
+#ifdef ENABLE_BF16
+template<>
+struct Vec_t<__nv_bfloat16> {
+    using Type = __nv_bfloat162;
+};
+#endif
+
+
+template<typename T>
+__global__ void add_fusedQKV_bias_transpose_kernel(T* q_buf,
+                                                   T* k_buf,
+                                                   T* v_buf,
+                                                   const T* __restrict QKV,
+                                                   const T* __restrict qkv_bias,
+                                                   const int batch_size,
+                                                   const int seq_len,
+                                                   const int head_num,
+                                                   const int size_per_head,
+                                                   const int rotary_embedding_dim)
+{
+    using Vec_t = typename Vec_t<T>::Type;
+    const int batch_idx = blockIdx.z;
+    const int head_idx = blockIdx.y;
+    const int seq_idx = blockIdx.x;
+    const int tidx = threadIdx.x;
+    if (tidx * 2 >= size_per_head) {
+        return;
+    }
+
+    const int batch_time_idx = seq_len * batch_idx + seq_idx;
+    const int hidden_idx = head_idx * size_per_head + tidx * 2;
+    const int n = head_num * size_per_head;
+
+    // src QKV: [batch, time, 3, head, hidden]
+    const int q_idx = batch_time_idx * 3 * n + hidden_idx;
+    const int k_idx = batch_time_idx * 3 * n + hidden_idx + n;
+    const int v_idx = batch_time_idx * 3 * n + hidden_idx + 2 * n;
+
+    Vec_t q = *reinterpret_cast<const Vec_t*>(&QKV[q_idx]);
+    Vec_t k = *reinterpret_cast<const Vec_t*>(&QKV[k_idx]);
+    Vec_t v = *reinterpret_cast<const Vec_t*>(&QKV[v_idx]);
+
+    if(qkv_bias != nullptr){
+    // qkv_bias: [3, head, hidden]
+        Vec_t q_bias = *reinterpret_cast<const Vec_t*>(&qkv_bias[hidden_idx]);
+        Vec_t k_bias = *reinterpret_cast<const Vec_t*>(&qkv_bias[hidden_idx + n]);
+        Vec_t v_bias = *reinterpret_cast<const Vec_t*>(&qkv_bias[hidden_idx + 2 * n]);
+
+        q = mmha::add(q, q_bias);
+        k = mmha::add(k, k_bias);
+        v = mmha::add(v, v_bias);
+    }
+
+    mmha::apply_rotary_embedding(q, k, tidx, rotary_embedding_dim, seq_idx);
+
+    // q_buf, k_buf, v_buf: [batch, head_num, seq_len, size_per_head]
+    const int dest_idx = size_per_head * seq_len * head_num * batch_idx + size_per_head * seq_len * head_idx
+                         + size_per_head * seq_idx + tidx * 2;
+
+    *reinterpret_cast<Vec_t*>(&q_buf[dest_idx]) = q;
+    *reinterpret_cast<Vec_t*>(&k_buf[dest_idx]) = k;
+    *reinterpret_cast<Vec_t*>(&v_buf[dest_idx]) = v;
+}
+
+template <typename T>
+void add_fusedQKV_bias_transpose_kernelLauncher(
+  T* q_buf,
+  T* k_buf,
+  T* v_buf,
+  T* QKV,
+  const T* qkv_bias,
+  const int batch_size,
+  const int seq_len,
+  const int head_num,
+  const int size_per_head,
+  const int rotary_embedding_dim,
+  cudaStream_t stream)
+{
+    if (rotary_embedding_dim == 0) {
+        const int m = batch_size * seq_len;
+        const int n = head_num * size_per_head;
+        dim3 block(384);
+        dim3 grid((int)(ceil(1.0 * m * n / 384)));
+        add_fusedQKV_bias_transpose_kernel<<<grid, block, 0, stream>>>(
+            q_buf, k_buf, v_buf, QKV, qkv_bias, batch_size, seq_len, head_num, size_per_head);
+    }
+    else {
+        // To implement rotary embeddings, each thread processes two QKV elems:
+        dim3 block((size_per_head / 2 + 31) / 32 * 32);
+        dim3 grid(seq_len, head_num, batch_size);
+        add_fusedQKV_bias_transpose_kernel<<<grid, block, 0, stream>>>(
+            q_buf, k_buf, v_buf, QKV, qkv_bias, batch_size, seq_len, head_num, size_per_head, rotary_embedding_dim);
+    }
+}
+
+template void add_fusedQKV_bias_transpose_kernelLauncher(
+    float* q_buf,
+    float* k_buf,
+    float* v_buf,
+    float* QKV,
+    const float* qkv_bias,
+    const int batch_size,
+    const int seq_len,
+    const int head_num,
+    const int size_per_head,
+    const int rotary_embedding_dim,
+    cudaStream_t stream);
+
+template void add_fusedQKV_bias_transpose_kernelLauncher(
+    half* q_buf,
+    half* k_buf,
+    half* v_buf,
+    half* QKV,
+    const half* qkv_bias,
+    const int batch_size,
+    const int seq_len,
+    const int head_num,
+    const int size_per_head,
+    const int rotary_embedding_dim,
+    cudaStream_t stream);
+      
+} // namespace fastertransformer
\ No newline at end of file
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cuh b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..9ae818a2c58ff0459ad0f5c797048c73c054a615
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/attention_kernels.cuh
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ 
+namespace fastertransformer {
+
+template <typename T>
+void add_fusedQKV_bias_transpose_kernelLauncher(
+  T* q_buf,
+  T* k_buf,
+  T* v_buf,
+  T* QKV,
+  const T* qkv_bias,
+  const int batch_size,
+  const int seq_len,
+  const int head_num,
+  const int size_per_head,
+  const int rotary_embedding_dim,
+  cudaStream_t stream);
+
+} // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..ccc5083dcc83c0447e5e9b36108f97b1ea8c310d
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.cu
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T, bool ALIVE>
+__global__ void update_logits_kernel(T* logits,
+                                     const T* bias,
+                                     const int end_id,
+                                     const bool* finished,
+                                     const int n) {
+  int bid = blockIdx.x;
+  bool finish = ALIVE ? false : finished[bid];
+  int offset = bid * n;
+
+  const T MAX_T_VAL = (sizeof(T) == 2) ? HALF_FLT_MAX : FLT_MAX;
+
+  float max_val = -FLT_MAX;
+  __shared__ float s_max_val;
+  __shared__ float s_sum_val;
+
+  for (int tid = threadIdx.x; tid < n; tid += blockDim.x) {
+    if (finish)
+      logits[offset + tid] = (tid == end_id) ? MAX_T_VAL : -MAX_T_VAL;
+    else
+      logits[offset + tid] += bias[tid];
+    max_val = max(max_val, (float)logits[offset + tid]);
+  }
+
+  max_val = blockReduceMax<float>((float)max_val);
+  if (threadIdx.x == 0) s_max_val = max_val;
+  __syncthreads();
+
+  float sum_val = 0.0f;
+  for (int tid = threadIdx.x; tid < n; tid += blockDim.x) {
+    float tmp = __expf((float)logits[offset + tid] - s_max_val);
+    logits[offset + tid] = (T)tmp;
+    sum_val += tmp;
+  }
+
+  sum_val = blockReduceSum<float>(sum_val);
+  if (threadIdx.x == 0) s_sum_val = sum_val;
+  __syncthreads();
+
+  for (int tid = threadIdx.x; tid < n; tid += blockDim.x) {
+    logits[offset + tid] = (T)logf((float)logits[offset + tid] / s_sum_val);
+  }
+}
+
+template <typename T>
+void update_logits_v2(T* logits,
+                      const T* bias,
+                      const int end_id,
+                      const bool* finished,
+                      const int m,
+                      const int n,
+                      cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+  /*n is the vocab_size, e.g., 30000, 7000.... vocab_size is usually very big.
+   */
+  update_logits_kernel<T, true><<<grid, block, 0, stream>>>(
+      logits, bias, end_id, finished, n);
+}
+
+template void update_logits_v2(float* logits,
+                               const float* bias,
+                               const int end_id,
+                               const bool* finished,
+                               const int m,
+                               const int n,
+                               cudaStream_t stream);
+
+template void update_logits_v2(half* logits,
+                               const half* bias,
+                               const int end_id,
+                               const bool* finished,
+                               const int m,
+                               const int n,
+                               cudaStream_t stream);
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.h
new file mode 100644
index 0000000000000000000000000000000000000000..749eb6a1ba304adb8fe9472e7a999743d12fdc80
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/cuda_kernels.h
@@ -0,0 +1,198 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+namespace fastertransformer {
+
+template <typename T>
+void init_kernelLauncher_v2(bool* finished,
+                            bool* alive_finished,
+                            int* sequence_length,
+                            int* word_ids,
+                            T* cum_log_probs,
+                            const int sentence_id,
+                            const int batch_size,
+                            const int beam_width,
+                            cudaStream_t stream);
+
+template <typename T>
+void update_logits_v2(T* logits,
+                      const T* bias,
+                      const int end_id,
+                      const bool* finished,
+                      const int m,
+                      const int n,
+                      cudaStream_t stream);
+
+template <typename T>
+void embedding_position_lookups_fix_kernel_launcher(T* from_tensor,
+                                                    const T* embedding_table,
+                                                    const T* pos_table,
+                                                    const int* word_ids,
+                                                    const int local_batch_size,
+                                                    const int batch_size,
+                                                    const int hidden_units,
+                                                    int step,
+                                                    int ite,
+                                                    int max_input_len,
+                                                    const int* start_lengths,
+                                                    cudaStream_t stream);
+
+template <typename T>
+void embedding_position_lookups_bart_kernel_launcher(
+    T* from_tensor,
+    const T* embedding_table,
+    const T* position_encoding_table,
+    const int* word_ids,
+    const int batch_size,
+    const int hidden_units,
+    cudaStream_t stream);
+
+template <typename T>
+void update_with_force_decodingLauncher(const int* trg_word,
+                                        const int* trg_length,
+                                        bool* finished,
+                                        int* word_ids,
+                                        int* sequence_length,
+                                        int* parent_ids_buf,
+                                        int* parent_ids,
+                                        int* output_ids,
+                                        T* scores,
+                                        bool keep_alive_beam,
+                                        const int batch_size,
+                                        const int beam_width,
+                                        const int max_trg_len,
+                                        const int step,
+                                        cudaStream_t stream);
+
+template <typename T>
+void update_KV_cache_kernelLauncher_v2(T** key_cache,
+                                       T** value_cache,
+                                       const int* beam_ids,
+                                       const bool* finished,
+                                       const int batch_size,
+                                       const int beam_width,
+                                       const int head_num,
+                                       const int size_per_head,
+                                       const int step,
+                                       const int decoder_max_seq_len,
+                                       const int cache_size,
+                                       const int decoder_layers,
+                                       cudaStream_t stream,
+                                       const int memory_max_seq_len = -1);
+
+template <typename T>
+void embeddings_kernel_launcher(T* from_tensor,
+                                const T* embedding_table,
+                                const T* position_encoding_table,
+                                const T* type_table,
+                                const int* memory_sequence_length,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int step,
+                                const int batch_size,
+                                const int hidden_units,
+                                const bool pos_bias,
+                                cudaStream_t stream,
+                                const int* decoder_role_id = nullptr,
+                                const T* role_embedding_table = nullptr,
+                                const int* decoder_position_id = nullptr);
+
+template <typename T>
+void words_embeddings_kernel_launcher(T* from_tensor,
+                                      const T* embedding_table,
+                                      const int* word_ids,
+                                      const int batch_size,
+                                      const int hidden_units,
+                                      cudaStream_t stream);
+
+template<typename T>
+void build_relative_attention_bias_launcher(T* relative_attention_bias,
+                                            const T* relative_attention_bias_table,
+                                            const int head_num,
+                                            const int seq_len,
+                                            const int num_bucket,
+                                            const bool is_bidirectional,
+                                            const int max_distance,
+                                            cudaStream_t stream);
+
+template <typename T>
+void start_ids_embeddings_kernel_launcher(T* from_tensor,
+                                const T* embedding_table,
+                                const T* position_encoding_table,
+                                const T* type_table,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int* memory_seq_len,
+                                const int start_step,
+                                const int max_length,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream,
+                                const int* role_id = nullptr,
+                                const T* role_embedding_table = nullptr,
+                                const int* position_id = nullptr);
+
+template <typename T>
+void init_cache_kernel_launcher(const float* cache_k,
+                                const float* cache_v,
+                                const int* memory_sequence_length,
+                                T* k_tgt,
+                                T* v_tgt,
+                                int n_head,
+                                int size_per_head,
+                                int mem_len,
+                                int batch_size,
+                                int beam_size,
+                                cudaStream_t stream);
+
+template <typename T>
+void apply_logits_mask_kernelLauncher(T* log_probs,
+                                      const bool* finished,
+                                      int batch_size,
+                                      int beam_width,
+                                      int vocab_size_padded,
+                                      int vocab_size,
+                                      cudaStream_t stream,
+                                      const T* logits_mask = nullptr,
+                                      const bool min_penalty = false,
+                                      const int end_id = -1,
+                                      const T* bias = nullptr);
+
+template <typename T>
+void gptj_start_id_embedding_lookups_kernel_launcher(T* from_tensor,
+                                                         int* output_ids,
+                                                         const T* embedding_table, 
+                                                         const int* word_ids,
+                                                         const int length,
+                                                         const int max_length,
+                                                         const int batch_size,
+                                                         const int hidden_units, 
+                                                         cudaStream_t stream);
+
+template <typename T>
+void gpj_embedding_lookups_kernel_launcher(T* from_tensor,
+                                            const T* embedding_table, 
+                                            const int* word_ids,
+                                            const int local_batch_size,
+                                            const int batch_size,
+                                            const int hidden_units, 
+                                            int step, 
+                                            int ite,
+                                            int max_input_len,
+                                            const int* start_lengths,
+                                            cudaStream_t stream);
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/decoding_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/decoding_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..810954c6b94a1d8c757a31c3f6710d1b99f01b19
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/decoding_kernels.cu
@@ -0,0 +1,713 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T, bool ALIVE = false>
+__global__ void init_kernel_v2(bool* finished,
+                               bool* alive_finished,
+                               int* sequence_length,
+                               int* word_ids,
+                               T* cum_log_probs,
+                               const int sentence_id,
+                               const int beam_width,
+                               const int batch_size) {
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : 1e20f;
+  for (int index = blockIdx.x * blockDim.x + threadIdx.x;
+       index < batch_size * beam_width;
+       index += blockDim.x * gridDim.x) {
+    finished[index] = false;
+    if (index < batch_size * beam_width / 2) {
+      alive_finished[index] = false;
+    }
+    sequence_length[index] = 0;
+    if (ALIVE) {
+      if (index < batch_size * beam_width / 2) word_ids[index] = sentence_id;
+      cum_log_probs[index] =
+          (index % beam_width == beam_width / 2) ? (T)0.0f : -MAX_T_VAL;
+    } else {
+      word_ids[index] = sentence_id;
+      cum_log_probs[index] = (index % beam_width == 0) ? (T)0.0f : -MAX_T_VAL;
+    }
+  }
+}
+
+template <typename T>
+void init_kernelLauncher_v2(bool* finished,
+                            bool* alive_finished,
+                            int* sequence_length,
+                            int* word_ids,
+                            T* cum_log_probs,
+                            const int sentence_id,
+                            const int batch_size,
+                            const int beam_width,
+                            cudaStream_t stream) {
+  dim3 grid((int)ceil(batch_size * beam_width * 1.0 / 256));
+  dim3 block(256);
+
+  init_kernel_v2<T, true><<<grid, block, 0, stream>>>(finished,
+                                                      alive_finished,
+                                                      sequence_length,
+                                                      word_ids,
+                                                      cum_log_probs,
+                                                      sentence_id,
+                                                      beam_width,
+                                                      batch_size);
+}
+
+// TODO Add half2 implementation
+template <typename T>
+__global__ void embedding_position_lookups_fix_kernel(
+    T* from_tensor,
+    const T* embedding_table,
+    const T* pos_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths) {
+  int timestep = step - 1;
+  // if the input is padded in the batch, indices of the word_id and the
+  // pos_table also should be shifted forward by the length of the padding.
+  int len_padding =
+      max_input_len - start_lengths[local_batch_size * ite + blockIdx.x];
+  int idx_word_id = (step == max_input_len) ? timestep - len_padding : timestep;
+  int idx_pos_table = timestep - len_padding;
+
+  int* word_ids_buf =
+      (int*)word_ids + idx_word_id * batch_size + local_batch_size * ite;
+  T* from_tensor_buf = from_tensor + blockIdx.x * hidden_units;
+  for (int index = threadIdx.x; index < hidden_units; index += blockDim.x) {
+    from_tensor_buf[index] =
+        embedding_table[word_ids_buf[blockIdx.x] * hidden_units + index] +
+        pos_table[idx_pos_table * hidden_units + index];
+  }
+}
+
+template <typename T>
+void embedding_position_lookups_fix_kernel_launcher(T* from_tensor,
+                                                    const T* embedding_table,
+                                                    const T* pos_table,
+                                                    const int* word_ids,
+                                                    const int local_batch_size,
+                                                    const int batch_size,
+                                                    const int hidden_units,
+                                                    int step,
+                                                    int ite,
+                                                    int max_input_len,
+                                                    const int* start_lengths,
+                                                    cudaStream_t stream) {
+  dim3 grid(min(local_batch_size, 65536));
+  dim3 block(min(hidden_units, 1024));
+  embedding_position_lookups_fix_kernel<T>
+      <<<grid, block, 0, stream>>>(from_tensor,
+                                   embedding_table,
+                                   pos_table,
+                                   word_ids,
+                                   local_batch_size,
+                                   batch_size,
+                                   hidden_units,
+                                   step,
+                                   ite,
+                                   max_input_len,
+                                   start_lengths);
+}
+
+template <typename T>
+__global__ void embedding_position_lookups_bart_kernel(
+    T* from_tensor,
+    const T* embedding_table,
+    const T* position_encoding,
+    const int* word_ids,
+    const int batch_size,
+    const int hidden_units) {
+  // 1. lookup from embedding table
+  // 2. add the position encoding
+  for (int index = blockIdx.x * blockDim.x + threadIdx.x;
+       index < batch_size * hidden_units;
+       index += blockDim.x * gridDim.x) {
+    const int row_index = index / hidden_units;
+    const int col_index = index % hidden_units;
+    from_tensor[index] =
+        embedding_table[word_ids[row_index] * hidden_units + col_index] +
+        position_encoding[col_index];
+  }
+}
+
+template <typename T>
+void embedding_position_lookups_bart_kernel_launcher(T* from_tensor,
+                                                     const T* embedding_table,
+                                                     const T* position_encoding,
+                                                     const int* word_ids,
+                                                     const int batch_size,
+                                                     const int hidden_units,
+                                                     cudaStream_t stream) {
+  dim3 grid(min(batch_size, 65536));
+  dim3 block(min(hidden_units, 1024));
+  embedding_position_lookups_bart_kernel<T><<<grid, block, 0, stream>>>(
+      from_tensor,
+      embedding_table,
+      position_encoding,
+      word_ids,
+      batch_size,
+      hidden_units);
+}
+
+template <typename T>
+__global__ void update_with_force_decoding_kernel(const int* trg_word,
+                                                  const int* trg_length,
+                                                  bool* finished,
+                                                  int* word_ids,
+                                                  int* sequence_length,
+                                                  int* parent_ids_buf,
+                                                  int* parent_ids,
+                                                  int* output_ids,
+                                                  T* scores,
+                                                  bool keep_alive_beam,
+                                                  const int batch_size,
+                                                  const int beam_width,
+                                                  const int max_trg_len,
+                                                  const int step) {
+  int bid = blockIdx.x;   // batch_size
+  int tid = threadIdx.x;  // beam_width
+
+  const T MAX_T_VAL = (sizeof(T) == 2) ? HALF_FLT_MAX : 1e20f;
+  if (step <= trg_length[bid]) {
+    finished[bid * beam_width + tid] = false;
+
+    int word_id = trg_word[bid * max_trg_len + step - 1];
+
+    if (keep_alive_beam) {
+      if (tid >= beam_width / 2) {
+        word_ids[bid * beam_width / 2 + tid - beam_width / 2] = word_id;
+      }
+    } else {
+      word_ids[bid * beam_width + tid] = word_id;
+    }
+
+    output_ids[bid * beam_width + tid] = word_id;
+    if (sequence_length) {
+      sequence_length[bid * beam_width + tid]++;
+    }
+
+    if (parent_ids && scores) {
+      if (keep_alive_beam) {
+        parent_ids[bid * beam_width + tid] = bid * beam_width + beam_width / 2;
+        if (tid >= beam_width / 2) {
+          parent_ids_buf[bid * beam_width / 2 + tid - beam_width / 2] =
+              bid * beam_width / 2;
+        }
+
+        if (tid == beam_width / 2) {
+          scores[bid * beam_width + tid] = 0;
+        } else {
+          scores[bid * beam_width + tid] = -MAX_T_VAL;
+        }
+      } else {
+        parent_ids[bid * beam_width + tid] = bid * beam_width;
+
+        if (tid == 0) {
+          scores[bid * beam_width + tid] = 0;
+        } else {
+          scores[bid * beam_width + tid] = -MAX_T_VAL;
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+void update_with_force_decodingLauncher(const int* trg_word,
+                                        const int* trg_length,
+                                        bool* finished,
+                                        int* word_ids,
+                                        int* sequence_length,
+                                        int* parent_ids_buf,
+                                        int* parent_ids,
+                                        int* output_ids,
+                                        T* scores,
+                                        bool keep_alive_beam,
+                                        const int batch_size,
+                                        const int beam_width,
+                                        const int max_trg_len,
+                                        const int step,
+                                        cudaStream_t stream) {
+  if (trg_word == nullptr) {
+    return;
+  }
+
+  update_with_force_decoding_kernel<<<batch_size, beam_width, 0, stream>>>(
+      trg_word,
+      trg_length,
+      finished,
+      word_ids,
+      sequence_length,
+      parent_ids_buf,
+      parent_ids,
+      output_ids,
+      scores,
+      keep_alive_beam,
+      batch_size,
+      beam_width,
+      max_trg_len,
+      step);
+}
+
+template <typename T>
+void update_KV_cache_kernelLauncher_v2(T** key_cache,
+                                       T** value_cache,
+                                       const int* beam_ids,
+                                       const bool* finished,
+                                       const int batch_size,
+                                       const int beam_width,
+                                       const int head_num,
+                                       const int size_per_head,
+                                       const int step,
+                                       const int decoder_max_seq_len,
+                                       const int cache_size,
+                                       const int decoder_layers,
+                                       cudaStream_t stream,
+                                       const int memory_max_seq_len) {
+  int src_id = step & 0x1;
+  int tgt_id = 1 - src_id;
+  int tmp_len = (memory_max_seq_len != -1) ? step + memory_max_seq_len : step;
+
+  if (decoder_max_seq_len < 0) {
+    int hidden_dim = head_num * size_per_head;
+    dim3 grid(decoder_layers * batch_size * beam_width * tmp_len);
+    dim3 block(min(1024, hidden_dim));
+    block.x = block.x / (4 / sizeof(T));
+
+    update_KV_cache_kernel<<<grid, block, 0, stream>>>(key_cache[src_id],
+                                                       key_cache[tgt_id],
+                                                       value_cache[src_id],
+                                                       value_cache[tgt_id],
+                                                       beam_ids,
+                                                       finished,
+                                                       batch_size,
+                                                       beam_width,
+                                                       hidden_dim,
+                                                       cache_size,
+                                                       tmp_len,
+                                                       decoder_layers);
+  } else {
+    dim3 grid(batch_size * beam_width, head_num, decoder_layers);
+    constexpr int block_sz = 128;
+    int tmp_decoder_max_seq_len =
+        (memory_max_seq_len != -1) ? (decoder_max_seq_len + memory_max_seq_len)
+                                   : decoder_max_seq_len;
+
+    update_KV_batch_major_cache_kernel<<<grid, block_sz, 0, stream>>>(
+        key_cache[src_id],
+        key_cache[tgt_id],
+        value_cache[src_id],
+        value_cache[tgt_id],
+        beam_ids,
+        finished,
+        batch_size,
+        beam_width,
+        size_per_head,
+        cache_size,
+        tmp_len,
+        tmp_decoder_max_seq_len,
+        decoder_layers);
+  }
+}
+
+template <typename T>
+__global__ void apply_logits_mask_kernel(int vocab_size_padded,
+                                         int vocab_size,
+                                         int beam_width,
+                                         T* log_probs,
+                                         const bool* finished,
+                                         const T* logits_mask = nullptr,
+                                         const bool min_penalty = false,
+                                         const int end_id = -1,
+                                         const T* bias = nullptr) {
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  int bbid = blockIdx.y;  // batch_size * beam_size: index
+  const T MAX_T_VAL = (sizeof(T) == 2) ? HALF_FLT_MAX : 1e20f;
+
+  bool finish = (finished != nullptr) ? finished[bbid] : false;
+
+  if (!finish) {
+    for (int i = tid + bid * blockDim.x; i < vocab_size_padded;
+          i += blockDim.x * gridDim.x) {
+      if ((min_penalty && i == end_id) || i >= vocab_size) {
+        log_probs[i + bbid * vocab_size_padded] = -MAX_T_VAL;
+      } else if (logits_mask) {
+        log_probs[i + bbid * vocab_size_padded] += logits_mask[i];
+      } else if (bias) {
+        log_probs[i + bbid * vocab_size_padded] += bias[i];
+      } else {
+        continue;
+      }
+    }
+  }
+}
+
+template <typename T>
+void apply_logits_mask_kernelLauncher(T* log_probs,
+                                      const bool* finished,
+                                      int batch_size,
+                                      int beam_width,
+                                      int vocab_size_padded,
+                                      int vocab_size,
+                                      cudaStream_t stream,
+                                      const T* logits_mask,
+                                      const bool min_penalty,
+                                      const int end_id,
+                                      const T* bias) {
+  if (logits_mask == nullptr && !min_penalty && bias == nullptr && vocab_size == vocab_size_padded) return;
+
+  dim3 block(256);
+  dim3 grid((vocab_size_padded + block.x - 1) / block.x,
+            beam_width * batch_size);
+
+  apply_logits_mask_kernel<T><<<grid, block, 0, stream>>>(vocab_size_padded,
+                                                          vocab_size,
+                                                          beam_width,
+                                                          log_probs,
+                                                          finished,
+                                                          logits_mask,
+                                                          min_penalty,
+                                                          end_id,
+                                                          bias);
+}
+
+
+  template <typename T> __launch_bounds__(1024, 1)
+  __global__ void gptj_start_id_embedding_lookups_kernel(T* from_tensor,
+                                                             int* output_ids,
+                                                             const T* embedding_table,
+                                                             const int* word_ids,
+                                                             const int length,
+                                                             const int max_length,
+                                                             const int batch_size,
+                                                             const int hidden_units)
+  { 
+      for(int index = blockIdx.x * blockDim.x + threadIdx.x; index < batch_size * length * hidden_units; index += blockDim.x * gridDim.x)
+      {
+          // transpose the word_ids [batch, length] (part of [batch, max_length]) to output_ids [length, batch]
+          if(index < batch_size * max_length)
+          {
+            const int seq_id = index % max_length;
+            const int batch_id = index / max_length;
+            if(seq_id < length)
+              output_ids[seq_id * batch_size + batch_id] = word_ids[index];
+            // output_ids[index] = word_ids[index];
+          }
+        
+          // embedding lookup from word ids [batch, length] (part of [batch, max_length]) and [vocab, hidden] to generate embedding [batch, length, hidden]
+          const int word_index = index / hidden_units;
+          const int word_index_row = word_index / length;
+          const int word_index_col = word_index % length;
+          const int real_word_index = word_index_row * max_length + word_index_col;
+          const int col_index = index % hidden_units;
+          from_tensor[index] = embedding_table[word_ids[real_word_index] * hidden_units + col_index];
+      }
+  }
+
+
+  template <typename T>
+  void gptj_start_id_embedding_lookups_kernel_launcher(T* from_tensor,
+                                                           int *output_ids,
+                                                           const T* embedding_table, 
+                                                           const int* word_ids,
+                                                           const int length,
+                                                           const int max_length,
+                                                           const int batch_size,
+                                                           const int hidden_units, 
+                                                           cudaStream_t stream)
+  {
+      dim3 grid(min(batch_size * length, 65536));
+      dim3 block(min(hidden_units, 1024));
+      gptj_start_id_embedding_lookups_kernel<T><<<grid, block, 0, stream>>>(from_tensor,
+                                                                                output_ids,
+                                                                                embedding_table,
+                                                                                word_ids,
+                                                                                length,
+                                                                                max_length,
+                                                                                batch_size,
+                                                                                hidden_units);
+  }
+
+
+  // TODO Add half2 implementation
+template <typename T>
+__global__ void gptj_embedding_lookups_kernel(
+    T* from_tensor,
+    const T* embedding_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths) {
+  int timestep = step - 1;
+  // if the input is padded in the batch, indices of the word_id 
+  // should be shifted forward by the length of the padding.
+  int len_padding =
+      max_input_len - start_lengths[local_batch_size * ite + blockIdx.x];
+  int idx_word_id = (step == max_input_len) ? timestep - len_padding : timestep;
+
+  int* word_ids_buf =
+      (int*)word_ids + idx_word_id * batch_size + local_batch_size * ite;
+  T* from_tensor_buf = from_tensor + blockIdx.x * hidden_units;
+  for (int index = threadIdx.x; index < hidden_units; index += blockDim.x) {
+    from_tensor_buf[index] =
+        embedding_table[word_ids_buf[blockIdx.x] * hidden_units + index];
+  }
+}
+
+template <typename T>
+void gpj_embedding_lookups_kernel_launcher(T* from_tensor,
+                                                    const T* embedding_table,
+                                                    const int* word_ids,
+                                                    const int local_batch_size,
+                                                    const int batch_size,
+                                                    const int hidden_units,
+                                                    int step,
+                                                    int ite,
+                                                    int max_input_len,
+                                                    const int* start_lengths,
+                                                    cudaStream_t stream) {
+  dim3 grid(min(local_batch_size, 65536));
+  dim3 block(min(hidden_units, 1024));
+  gptj_embedding_lookups_kernel<T>
+      <<<grid, block, 0, stream>>>(from_tensor,
+                                   embedding_table,
+                                   word_ids,
+                                   local_batch_size,
+                                   batch_size,
+                                   hidden_units,
+                                   step,
+                                   ite,
+                                   max_input_len,
+                                   start_lengths);
+}
+
+template void init_kernelLauncher_v2(bool* finished,
+                                     bool* alive_finished,
+                                     int* sequence_length,
+                                     int* word_ids,
+                                     float* cum_log_probs,
+                                     const int sentence_id,
+                                     const int batch_size,
+                                     const int beam_width,
+                                     cudaStream_t stream);
+
+template void init_kernelLauncher_v2(bool* finished,
+                                     bool* alive_finished,
+                                     int* sequence_length,
+                                     int* word_ids,
+                                     half* cum_log_probs,
+                                     const int sentence_id,
+                                     const int batch_size,
+                                     const int beam_width,
+                                     cudaStream_t stream);
+
+template void embedding_position_lookups_fix_kernel_launcher(
+    float* from_tensor,
+    const float* embedding_table,
+    const float* pos_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths,
+    cudaStream_t stream);
+
+template void embedding_position_lookups_fix_kernel_launcher(
+    half* from_tensor,
+    const half* embedding_table,
+    const half* pos_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths,
+    cudaStream_t stream);
+
+template void embedding_position_lookups_bart_kernel_launcher(
+    float* from_tensor,
+    const float* embedding_table,
+    const float* position_encoding,
+    const int* word_ids,
+    const int batch_size,
+    const int hidden_units,
+    cudaStream_t stream);
+
+template void embedding_position_lookups_bart_kernel_launcher(
+    half* from_tensor,
+    const half* embedding_table,
+    const half* position_encoding,
+    const int* word_ids,
+    const int batch_size,
+    const int hidden_units,
+    cudaStream_t stream);
+
+template void update_with_force_decodingLauncher(const int* trg_word,
+                                                 const int* trg_length,
+                                                 bool* finished,
+                                                 int* word_ids,
+                                                 int* sequence_length,
+                                                 int* parent_ids_buf,
+                                                 int* parent_ids,
+                                                 int* output_ids,
+                                                 float* scores,
+                                                 bool keep_alive_beam,
+                                                 const int batch_size,
+                                                 const int beam_width,
+                                                 const int max_trg_len,
+                                                 const int step,
+                                                 cudaStream_t stream);
+
+template void update_with_force_decodingLauncher(const int* trg_word,
+                                                 const int* trg_length,
+                                                 bool* finished,
+                                                 int* word_ids,
+                                                 int* sequence_length,
+                                                 int* parent_ids_buf,
+                                                 int* parent_ids,
+                                                 int* output_ids,
+                                                 half* scores,
+                                                 bool keep_alive_beam,
+                                                 const int batch_size,
+                                                 const int beam_width,
+                                                 const int max_trg_len,
+                                                 const int step,
+                                                 cudaStream_t stream);
+
+template void update_KV_cache_kernelLauncher_v2(float** key_cache,
+                                                float** value_cache,
+                                                const int* beam_ids,
+                                                const bool* finished,
+                                                const int batch_size,
+                                                const int beam_width,
+                                                const int head_num,
+                                                const int size_per_head,
+                                                const int step,
+                                                const int decoder_max_seq_len,
+                                                const int cache_size,
+                                                const int decoder_layers,
+                                                cudaStream_t stream,
+                                                const int memory_max_seq_len);
+
+template void update_KV_cache_kernelLauncher_v2(half** key_cache,
+                                                half** value_cache,
+                                                const int* beam_ids,
+                                                const bool* finished,
+                                                const int batch_size,
+                                                const int beam_width,
+                                                const int head_num,
+                                                const int size_per_head,
+                                                const int step,
+                                                const int decoder_max_seq_len,
+                                                const int cache_size,
+                                                const int decoder_layers,
+                                                cudaStream_t stream,
+                                                const int memory_max_seq_len);
+
+template void apply_logits_mask_kernelLauncher(
+    float* log_probs,
+    const bool* finished,
+    int batch_size,
+    int beam_width,
+    int vocab_size_padded,
+    int vocab_size,
+    cudaStream_t stream,
+    const float* logits_mask,
+    const bool min_penalty,
+    const int end_id,
+    const float* bias);
+
+template void apply_logits_mask_kernelLauncher(
+    half* log_probs,
+    const bool* finished,
+    int batch_size,
+    int beam_width,
+    int vocab_size_padded,
+    int vocab_size,
+    cudaStream_t stream,
+    const half* logits_mask,
+    const bool min_penalty,
+    const int end_id,
+    const half* bias);
+
+  template
+  void gptj_start_id_embedding_lookups_kernel_launcher(float* from_tensor,
+                                                           int* output_ids,
+                                                           const float* embedding_table,
+                                                           const int* word_ids,
+                                                           const int length,
+                                                           const int max_length,
+                                                           const int batch_size,
+                                                           const int hidden_units, 
+                                                           cudaStream_t stream);
+
+  template
+  void gptj_start_id_embedding_lookups_kernel_launcher(half* from_tensor,
+                                                           int* output_ids,
+                                                           const half* embedding_table,
+                                                           const int* word_ids,
+                                                           const int length,
+                                                           const int max_length,
+                                                           const int batch_size,
+                                                           const int hidden_units, 
+                                                           cudaStream_t stream);
+  
+  template void gpj_embedding_lookups_kernel_launcher(
+    float* from_tensor,
+    const float* embedding_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths,
+    cudaStream_t stream);
+
+template void gpj_embedding_lookups_kernel_launcher(
+    half* from_tensor,
+    const half* embedding_table,
+    const int* word_ids,
+    const int local_batch_size,
+    const int batch_size,
+    const int hidden_units,
+    int step,
+    int ite,
+    int max_input_len,
+    const int* start_lengths,
+    cudaStream_t stream);
+
+}  // end of name space fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/lightseq_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/lightseq_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4e3a4022933fc3094b53d174f06003b3a88cb07d
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/lightseq_kernels.cu
@@ -0,0 +1,56 @@
+
+namespace fastertransformer {
+
+const unsigned int WARP_REDUCE_MASK = 0xffffffff;
+const float CUDA_FLOAT_INF_NEG = -100000000.f;
+const unsigned int WARP_SIZE = 32;
+
+template <typename T>
+__forceinline__ __device__ T warpReduceMax(T val) {
+  for (int mask = (WARP_SIZE >> 1); mask > 0; mask >>= 1)
+    val = max(val, __shfl_xor_sync(WARP_REDUCE_MASK, val, mask, WARP_SIZE));
+  return val;
+}
+
+
+/* Calculate the maximum of all elements in a block */
+template <typename T>
+__forceinline__ __device__ T blockReduceMax(T val) {
+  static __shared__ T shared[32];
+  int lane = threadIdx.x & 0x1f;
+  int wid = threadIdx.x >> 5;
+
+  val = warpReduceMax<T>(val);
+
+  if (lane == 0) shared[wid] = val;
+  __syncthreads();
+
+  val = (threadIdx.x < ((blockDim.x + 31) >> 5)) ? shared[lane]
+                                                 : CUDA_FLOAT_INF_NEG;
+  val = warpReduceMax<T>(val);
+  return val;
+}
+
+/* Calculate the rough topk-th value in a block, rough but safe */
+template <typename T, int K>
+__forceinline__ __device__ T blockRoughTopK(T val) {
+  static __shared__ T shared[32];
+  int lane = threadIdx.x & 0x1f;
+  int wid = threadIdx.x >> 5;
+  val = warpReduceMax(val);
+
+  if (lane == 0) shared[wid] = val;
+  __syncthreads();
+
+  // we do not care about result of threadIdx.x bigger than (blockDim.x >> 5)
+  val = (threadIdx.x < (blockDim.x >> 5)) ? shared[lane] : CUDA_FLOAT_INF_NEG;
+
+  // K should be 2, 4, 6, 8, 16 or 32
+  for (int mask = 16; mask >= K; mask >>= 1)
+    val = max(val, __shfl_xor_sync(WARP_REDUCE_MASK, val, mask, 32));
+  for (int mask = (K >> 1); mask > 0; mask >>= 1)
+    val = min(val, __shfl_xor_sync(WARP_REDUCE_MASK, val, mask, 32));
+
+  return val;
+}
+}
\ No newline at end of file
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.cu
new file mode 100644
index 0000000000000000000000000000000000000000..310b6daeb78b93d4fc0fc4199a5ddeb22d9cf07d
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.cu
@@ -0,0 +1,1504 @@
+/***************************************************************************************************
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2011-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without modification, are not permit-
+ * ted.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR 
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND 
+ * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE 
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
+ * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 
+ * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#include "masked_multihead_attention.h"
+#include "masked_multihead_attention_utils.h"
+#include <assert.h>
+#include <float.h>
+#include <stdexcept>
+//#define MMHA_USE_HMMA_FOR_REDUCTION
+
+// Below are knobs to extend FP32 accumulation for higher FP16 accuracy
+
+// Does not seem to affect the accuracy that much
+//#define MMHA_USE_FP32_ACUM_FOR_FMA
+
+// Seems to slightly improve the accuracy
+#define MMHA_USE_FP32_ACUM_FOR_OUT
+
+#if 0 && defined(MMHA_USE_FP32_ACUM_FOR_OUT)
+// Does not seem to improve the accuracy
+//#define MMHA_USE_FP32_ACUM_FOR_LOGITS
+#endif
+
+namespace mmha { 
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+//
+// We use the following terminology to describe the different dimensions.
+//
+// B:  Batch size (number of sequences),
+// L:  Sequence length,
+// D:  Hidden dimension,
+// H:  Number of heads,
+// Dh: Hidden dimension per head - Dh = D / H.
+//
+// The different kernels assign a threadblock for B x H pair. The grid has size (1, B, H). We use 
+// 64, 128 and 256 threads per block.
+//
+// Each threadblock loads Dh values from Q and its associated bias. The kernels run a loop to 
+// compute Q * K^T where K is loaded from a cache buffer -- except for the current timestep. The 
+// cache buffer helps with memory accesses and contains keys with bias.
+//
+// The layout of the cache buffer for the keys is [B, H, Dh/x, L, x] where x == 8 for FP16 and 
+// x == 4 for FP32 where the fastest moving dimension (contiguous data) is the rightmost one. The
+// values for x are chosen to create chunks of 16 bytes.
+//
+// The different kernels use 1, 2 or 4 threads per key (THREADS_PER_KEY). The size of the LDGs 
+// depends on the number of threads per key. Each thread sums Dh / THREADS_PER_KEY elements. At
+// the end of each iteration of the Q * K^T loop, we perform a reduction between lanes using an
+// HMMA instruction (Tensor Core). Each Q * K^T valuey is stored in shared memory in FP32.
+//
+// After that loop, a parallel softmax is computed across the different Q * K^T values stored in
+// shared memory.
+//
+// The kernel ends with a loop over the values in V. We use THREADS_PER_VALUE to control how many
+// timesteps are computed by loop iteration. As with the keys, the values are read from a cache 
+// except for the current timestep. The layout of the cache buffer for the values is much simpler
+// as it is [B, H, L, Dh].
+//
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T, int Dh >
+struct Qk_vec_ {};
+
+template<> struct Qk_vec_<float,     32> { using Type = float;    };
+template<> struct Qk_vec_<float,     64> { using Type = float2;   };
+template<> struct Qk_vec_<float,    128> { using Type = float4;   };
+template<> struct Qk_vec_<float,    256> { using Type = float4;   };
+template<> struct Qk_vec_<uint16_t,  32> { using Type = uint32_t; };
+template<> struct Qk_vec_<uint16_t,  64> { using Type = uint32_t; };
+template<> struct Qk_vec_<uint16_t, 128> { using Type = uint2;    };
+template<> struct Qk_vec_<uint16_t, 256> { using Type = uint4;    };
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T, int THREADS_PER_KEY >
+struct K_vec_ {};
+
+template<> struct K_vec_<float,    4> { using Type = float;    };
+template<> struct K_vec_<float,    2> { using Type = float2;   };
+template<> struct K_vec_<float,    1> { using Type = float4;   };
+template<> struct K_vec_<uint16_t, 4> { using Type = uint32_t; };
+template<> struct K_vec_<uint16_t, 2> { using Type = uint2;    };
+template<> struct K_vec_<uint16_t, 1> { using Type = uint4;    };
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T, int V_VEC_SIZE >
+struct V_vec_ {};
+
+template<> struct V_vec_<float,    1> { using Type = float;    };
+template<> struct V_vec_<float,    2> { using Type = float2;   };
+template<> struct V_vec_<float,    4> { using Type = float4;   };
+template<> struct V_vec_<uint16_t, 2> { using Type = uint32_t; };
+template<> struct V_vec_<uint16_t, 4> { using Type = uint2;    };
+template<> struct V_vec_<uint16_t, 8> { using Type = uint4;    };
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+template< typename T>
+struct Qk_vec_acum_fp32_ {};
+
+template<> struct Qk_vec_acum_fp32_<float   > { using Type = float;        };
+template<> struct Qk_vec_acum_fp32_<float2  > { using Type = float2;       };
+template<> struct Qk_vec_acum_fp32_<float4  > { using Type = float4;       };
+//template<> struct Qk_vec_acum_fp32_<uint16_t> { using Type = float;        };
+template<> struct Qk_vec_acum_fp32_<uint32_t> { using Type = float2;       };
+template<> struct Qk_vec_acum_fp32_<uint2   > { using Type = Float4_;      };
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T>
+struct K_vec_acum_fp32_ {};
+
+template<> struct K_vec_acum_fp32_<float   > { using Type = float;        };
+template<> struct K_vec_acum_fp32_<float2  > { using Type = float2;       };
+template<> struct K_vec_acum_fp32_<float4  > { using Type = float4;       };
+template<> struct K_vec_acum_fp32_<uint32_t> { using Type = float2;       };
+template<> struct K_vec_acum_fp32_<uint2   > { using Type = Float4_;      };
+template<> struct K_vec_acum_fp32_<uint4   > { using Type = Float8_;      };
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+template< typename T >
+struct V_vec_acum_fp32_ {};
+
+template<> struct V_vec_acum_fp32_<float   > { using Type = float;    };
+template<> struct V_vec_acum_fp32_<float2  > { using Type = float2;   };
+template<> struct V_vec_acum_fp32_<float4  > { using Type = float4;   };
+template<> struct V_vec_acum_fp32_<uint32_t> { using Type = float2;   };
+template<> struct V_vec_acum_fp32_<uint2   > { using Type = Float4_;  };
+template<> struct V_vec_acum_fp32_<uint4   > { using Type = Float8_;  };
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int THREADS_PER_KEY, typename K_vec, int N >
+inline __device__ float qk_dot_(const K_vec (&q)[N], const K_vec (&k)[N]) {
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+  using K_vec_acum = typename K_vec_acum_fp32_<K_vec>::Type;
+#else
+  using K_vec_acum = K_vec;
+#endif
+  // Compute the parallel products for Q*K^T (treat vector lanes separately).
+  K_vec_acum qk_vec = mul<K_vec_acum, K_vec, K_vec>(q[0], k[0]);
+  #pragma unroll
+  for( int ii = 1; ii < N; ++ii ) {
+    qk_vec = fma(q[ii], k[ii], qk_vec);
+  }
+
+  // Finalize the reduction across lanes.
+  float qk = sum(qk_vec);
+  #pragma unroll
+  for( int mask = THREADS_PER_KEY / 2; mask >= 1; mask /= 2 ) {
+    qk += __shfl_xor_sync(uint32_t(-1), qk, mask);
+  }
+  return qk;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T, int THREADS_PER_KEY >
+struct Qk_dot {
+  template< typename K_vec, int N >
+  static inline __device__ float dot(const K_vec (&q)[N], const K_vec (&k)[N]) {
+    return qk_dot_<THREADS_PER_KEY>(q, k);
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float4 hmma_fp32(const uint2 &a, uint32_t b) {
+  float4 c; float zero = 0.f;
+  asm volatile( \
+    "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 \n" \
+    "    {%0, %1, %2, %3}, \n" \
+    "    {%4, %5}, \n" \
+    "    {%6}, \n" \
+    "    {%7, %7, %7, %7}; \n" \
+      \
+      : "=f"(c.x), "=f"(c.y), "=f"(c.z), "=f"(c.w) 
+      :  "r"(a.x)   "r"(a.y)
+      ,  "r"(b)
+      ,  "f"(zero));
+  return c;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int N >
+inline __device__ float qk_hmma_dot_(const uint32_t (&q)[N], const uint32_t (&k)[N]) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 750 
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+  using K_vec_acum = typename K_vec_acum_fp32_<uint32_t>::Type;
+#else
+  using K_vec_acum = uint32_t;
+#endif
+  K_vec_acum qk_vec = mul<K_vec_acum, uint32_t, uint32_t>(q[0], k[0]);
+  #pragma unroll
+  for( int ii = 1; ii < N; ++ii ) {
+    qk_vec = fma(q[ii], k[ii], qk_vec);
+  }
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+    uint32_t qk_vec_ = float2_to_half2(qk_vec);
+    return hmma_fp32(make_uint2(qk_vec_, 0u), 0x3c003c00u).x;
+#else
+  return hmma_fp32(make_uint2(qk_vec, 0u), 0x3c003c00u).x;
+#endif
+#else
+  return 0.f;
+#endif
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template<>
+struct Qk_dot<uint16_t, 4> {
+  template< int N >
+  static inline __device__ float dot(const uint32_t (&q)[N], const uint32_t (&k)[N]) {
+#if __CUDA_ARCH__ >= 750 && defined(MMHA_USE_HMMA_FOR_REDUCTION)
+    return qk_hmma_dot_(q, k);
+#else
+    return qk_dot_<4>(q, k);
+#endif // defined MMHA_USE_HMMA_FOR_REDUCTION
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< int WARPS_PER_BLOCK, int WARP_SIZE = 32 >
+inline __device__ float block_sum(float *red_smem, float sum) {
+
+  // Decompose the thread index into warp / lane.
+  int warp = threadIdx.x / WARP_SIZE;
+  int lane = threadIdx.x % WARP_SIZE;
+
+  // Compute the sum per warp.
+  #pragma unroll
+  for( int mask = WARP_SIZE / 2; mask >= 1; mask /= 2 ) {
+    sum += __shfl_xor_sync(uint32_t(-1), sum, mask);
+  }
+
+  // Warp leaders store the data to shared memory.
+  if( lane == 0 ) {
+    red_smem[warp] = sum;
+  }
+
+  // Make sure the data is in shared memory.
+  __syncthreads();
+
+  // The warps compute the final sums.
+  if( lane < WARPS_PER_BLOCK ) {
+    sum = red_smem[lane];
+  }
+
+  // Parallel reduction inside the warp.
+  #pragma unroll
+  for( int mask = WARPS_PER_BLOCK / 2; mask >= 1; mask /= 2 ) {
+    sum += __shfl_xor_sync(uint32_t(-1), sum, mask);
+  }
+
+  // Broadcast to other threads.
+  return __shfl_sync(uint32_t(-1), sum, 0);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(float &dst, float src) {
+  dst = src;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(uint16_t &dst, float src) {
+  dst = float_to_half(src);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(uint32_t &dst, float2 src) {
+  dst = float2_to_half2(src);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(uint2 &dst, Float4_ src) {
+  dst.x = float2_to_half2(src.x);
+  dst.y = float2_to_half2(src.y);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(uint4 &dst, Float8_ src) {
+  dst.x = float2_to_half2(src.x);
+  dst.y = float2_to_half2(src.y);
+  dst.z = float2_to_half2(src.z);
+  dst.w = float2_to_half2(src.w);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(float2 &dst, float2 src) {
+  dst = src;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ void convert_from_float(float4 &dst, float4 src) {
+  dst = src;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float convert_to_float(float4 u) {
+  return u.x;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float convert_to_float(uint4 u) {
+  float2 tmp = half2_to_float2(u.x);
+  return tmp.x;
+}
+
+#if defined(MMHA_USE_FP32_ACUM_FOR_LOGITS)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float cast_to_float(float u) {
+  return u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float2 cast_to_float(float2 u) {
+  return u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float4 cast_to_float(float4 u) {
+  return u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ Float4_ cast_to_float(Float4_ u) {
+  return u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ Float8_ cast_to_float(Float8_ u) {
+  return u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ float2 cast_to_float(uint32_t u) {
+  return half2_to_float2(u);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ Float4_ cast_to_float(uint2 u) {
+  Float4_ tmp;
+  tmp.x = half2_to_float2(u.x);
+  tmp.y = half2_to_float2(u.y);
+  return tmp;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ Float8_ cast_to_float(uint4 u) {
+  Float8_ tmp;
+  tmp.x = half2_to_float2(u.x);
+  tmp.y = half2_to_float2(u.y);
+  tmp.z = half2_to_float2(u.z);
+  tmp.w = half2_to_float2(u.w);
+  return tmp;
+}
+
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T >
+inline __device__ __host__ T div_up(T m, T n) {
+  return (m + n-1) / n;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <typename T>
+inline size_t smem_size_in_bytes(
+    const Masked_multihead_attention_params<T> &params,
+    int threads_per_value,
+    int threads_per_block,
+    int pad_active_groups) {
+  // The amount of shared memory needed to store the Q*K^T values in float.
+  size_t qk_sz = div_up(params.timestep + 1, 4) * 16;
+
+  // The extra memory needed if we are not using floats for the final logits.
+  size_t logits_sz = 0;
+#ifndef MMHA_USE_FP32_ACUM_FOR_LOGITS
+  if( sizeof(T) != 4 ) {
+    logits_sz = div_up(params.seq_length, 4) * 4 * sizeof(T);
+  }
+#endif
+
+  // The total size needed during softmax.
+  size_t softmax_sz = qk_sz + logits_sz;
+
+  // The number of partial rows to reduce in the final reduction.
+  // int rows_per_red = threads_per_block / threads_per_value;
+  // to solve `threads_per_block / threads_per_value` is not 2^n
+  int rows_per_red = params.rotary_embedding_dim>0 ? threads_per_block / threads_per_value: pad_active_groups;
+  // The amount of storage needed to finalize the outputs.
+  size_t red_sz = rows_per_red * params.hidden_size_per_head * sizeof(T) / 2;
+
+  // The max.
+  return max(softmax_sz, red_sz);
+}
+
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline __device__ constexpr uint32_t shfl_mask(int threads) { 
+  return threads == 32 ? uint32_t(-1) : (1u << threads) - 1u;
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< 
+  // The type of the inputs. Supported types: float and half.
+  typename T, 
+  // The hidden dimension per head.
+  int Dh,
+  int Dh_MAX,
+  // The number of threads per key.
+  int THREADS_PER_KEY,
+  // The number of threads per value.
+  int THREADS_PER_VALUE,
+  // The number of threads in a threadblock.
+  int THREADS_PER_BLOCK
+>
+__global__ void masked_multihead_attention_kernel(Masked_multihead_attention_params<T> params, int pad_active_groups) {
+  // Make sure the hidden dimension per head is a multiple of the number of threads per key.
+  static_assert(Dh % THREADS_PER_KEY == 0, "");
+  // Make sure the hidden dimension per head is a multiple of the number of threads per value.
+  static_assert(Dh % THREADS_PER_VALUE == 0, "");
+
+  // The size of a warp.
+  constexpr int WARP_SIZE = 32;
+  // The number of warps in a threadblock.
+  constexpr int WARPS_PER_BLOCK = THREADS_PER_BLOCK / WARP_SIZE;
+
+  // Use smem_size_in_bytes (above) to determine the amount of shared memory.
+  extern __shared__ char smem_[];
+
+  // The shared memory for the Q*K^T values and partial logits in softmax.
+  float *qk_smem = reinterpret_cast<float*>(smem_);
+
+  // The shared memory for the logits. For FP32, that's the same buffer as qk_smem.
+  char *logits_smem_ = smem_;
+#ifndef MMHA_USE_FP32_ACUM_FOR_LOGITS
+  if( sizeof(T) != 4 ) {
+    logits_smem_ += div_up(params.timestep + 1, 4) * 16; //sizeof(float);
+  }
+  T *logits_smem = reinterpret_cast<T*>(logits_smem_);
+#else
+  float *logits_smem = reinterpret_cast<float*>(logits_smem_);
+#endif
+
+  // The shared memory to do the final reduction for the output values. Reuse qk_smem.
+  T *out_smem = reinterpret_cast<T*>(smem_);
+
+  // The shared memory buffers for the block-wide reductions. One for max, one for sum.
+  __shared__ float red_smem[WARPS_PER_BLOCK * 2];
+  // Shared memory to store Q inputs.
+  __shared__ T q_smem[Dh];
+
+  // A vector of Q or K elements for the current timestep.
+  using Qk_vec = typename Qk_vec_<T, Dh_MAX>::Type;
+  // The number of elements per vector.
+  constexpr int QK_VEC_SIZE = sizeof(Qk_vec) / sizeof(T);
+  // Make sure the hidden size per head is a multiple of the vector size.
+  static_assert(Dh % QK_VEC_SIZE == 0, "");
+  // The number of vectors per warp.
+  constexpr int QK_VECS_PER_WARP = Dh / QK_VEC_SIZE;
+
+  // The layout of the cache is [B, H, Dh/x, L, x] with x == 4/8 for FP32/FP16. Since each thread
+  // owns x elements, we have to decompose the linear index into chunks of x values and the posi-
+  // tion of the thread in that chunk. 
+
+  // The number of elements in a chunk of 16B (that's the x in the above formula).
+  constexpr int QK_ELTS_IN_16B = 16 / sizeof(T);
+  // The number of K vectors in 16B.
+  constexpr int QK_VECS_IN_16B = 16 / sizeof(Qk_vec);
+
+  // The batch.
+  const int bi = blockIdx.y;
+  if(params.finished != nullptr && params.finished[bi] == true) return;
+  // The head.
+  const int hi = blockIdx.x;
+  // Combine the batch and the head indices.
+  const int bhi = bi * params.num_heads + hi;
+  // The thread in the block. 
+  const int tidx = threadIdx.x;
+
+  // While doing the product Q*K^T for the different keys we track the max.
+  float qk_max = -FLT_MAX;
+  float qk = 0;
+
+  int qkv_base_offset = (params.stride == 0)? bhi*Dh : bi*params.stride + hi*Dh;
+
+  // First QK_VECS_PER_WARP load Q and K + the bias values for the current timestep.
+  if( tidx < QK_VECS_PER_WARP ) {
+
+    // The offset in the Q and K buffer also accounts for the batch.
+    int qk_offset = qkv_base_offset + tidx*QK_VEC_SIZE;
+    // The offset in the bias buffer.
+    int qk_bias_offset = hi*Dh + tidx*QK_VEC_SIZE;
+
+    // Trigger the loads from the Q and K buffers.
+    Qk_vec q = *reinterpret_cast<const Qk_vec*>(&params.q[qk_offset]);
+    Qk_vec k = *reinterpret_cast<const Qk_vec*>(&params.k[qk_offset]);
+
+    // Trigger the loads from the Q and K bias buffers.
+    Qk_vec q_bias = *reinterpret_cast<const Qk_vec*>(&params.q_bias[qk_bias_offset]);
+    Qk_vec k_bias = *reinterpret_cast<const Qk_vec*>(&params.k_bias[qk_bias_offset]);
+
+    // Computes the Q/K values with bias.
+    q = add(q, q_bias);
+    k = add(k, k_bias);
+
+    // Store the Q values to shared memory.
+    *reinterpret_cast<Qk_vec*>(&q_smem[tidx*QK_VEC_SIZE]) = q;
+
+    // Write the K values to the global memory cache. 
+    // 
+    // NOTE: The stores are uncoalesced as we have multiple chunks of 16B spread across the memory 
+    // system. We designed it this way as it allows much better memory loads (and there are many 
+    // more loads) + the stores are really "write and forget" since we won't need the ack before 
+    // the end of the kernel. There's plenty of time for the transactions to complete.
+
+    // The 16B chunk written by the thread.
+    int co = tidx / QK_VECS_IN_16B;
+    // The position of the thread in that 16B chunk.
+    int ci = tidx % QK_VECS_IN_16B * QK_VEC_SIZE;
+
+    // Two chunks are separated by L * x elements. A thread write QK_VEC_SIZE elements.
+    int offset = bhi*params.seq_length*Dh + 
+                 co*params.seq_length*QK_ELTS_IN_16B + 
+                 params.timestep*QK_ELTS_IN_16B +
+                 ci; 
+
+    // Trigger the stores to global memory. 
+    *reinterpret_cast<Qk_vec*>(&params.k_cache[offset]) = k;
+
+    // Compute \sum_i Q[i] * K^T[i] for the current timestep.
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+    using Qk_vec_acum = typename Qk_vec_acum_fp32_<Qk_vec>::Type;
+#else
+    using Qk_vec_acum = Qk_vec;
+#endif
+    qk = dot<Qk_vec_acum, Qk_vec>(q, k);
+    // float qk = dot<Qk_vec_acum, Qk_vec>(q, k);
+    // #pragma unroll
+    // for( int mask = QK_VECS_PER_WARP / 2; mask >= 1; mask /= 2 ) {
+    //   qk += __shfl_xor_sync(shfl_mask(QK_VECS_PER_WARP), qk, mask);
+    // }
+
+    // // Normalize qk.
+    // qk *= params.inv_sqrt_dh;
+
+    // // Store that value in shared memory. Keep the Q*K^T value in register for softmax.
+    // if( tidx == 0 ) {
+    //   qk_max = qk;
+    //   qk_smem[params.timestep] = qk;
+    // }
+  }
+  if (tidx < WARP_SIZE) {
+    // use reduce-sum on WARP_SIZE instead of QK_VECS_PER_WARP to solve
+    // QK_VECS_PER_WARP is not 2^n, such as Dh=80, 96
+    for (int mask = WARP_SIZE / 2; mask >= 1; mask /= 2) {
+      qk += __shfl_xor_sync(uint32_t(-1), qk, mask);
+    }
+    if (tidx == 0) {
+      qk *= params.inv_sqrt_dh;
+      qk_max = qk;
+      qk_smem[params.timestep] = qk;
+    }
+  }
+
+  // Make sure the data is in shared memory.
+  __syncthreads();
+
+  // The type of queries and keys for the math in the Q*K^T product.
+  using K_vec = typename K_vec_<T, THREADS_PER_KEY>::Type;
+  // The number of elements per vector.
+  constexpr int K_VEC_SIZE = sizeof(K_vec) / sizeof(T);
+  // Make sure the hidden size per head is a multiple of the vector size.
+  static_assert(Dh % K_VEC_SIZE == 0, "");
+  // The number of elements per thread.
+  constexpr int K_ELTS_PER_THREAD = Dh / THREADS_PER_KEY;
+  // The number of vectors per thread.
+  constexpr int K_VECS_PER_THREAD = K_ELTS_PER_THREAD / K_VEC_SIZE;
+
+  // The position the first key loaded by each thread from the cache buffer (for this B * H).
+  int ko = tidx / THREADS_PER_KEY;
+  // The position of the thread in the chunk of keys.
+  int ki = tidx % THREADS_PER_KEY * K_VEC_SIZE;
+
+  // Load the Q values from shared memory. The values are reused during the loop on K.
+  K_vec q[K_VECS_PER_THREAD];
+  #pragma unroll
+  for( int ii = 0; ii < K_VECS_PER_THREAD; ++ii ) {
+    q[ii] = *reinterpret_cast<const K_vec*>(&q_smem[ki + ii*THREADS_PER_KEY*K_VEC_SIZE]);
+  }
+
+  // The number of timesteps loaded per iteration.
+  constexpr int K_PER_ITER = THREADS_PER_BLOCK / THREADS_PER_KEY;
+  // The number of keys per warp.
+  constexpr int K_PER_WARP = WARP_SIZE / THREADS_PER_KEY;
+
+  // The base pointer for the key in the cache buffer.
+  T *k_cache = &params.k_cache[bhi*params.seq_length*Dh + ki];
+
+  // Pick a number of keys to make sure all the threads of a warp enter (due to shfl_sync).
+  int ti_end = div_up(params.timestep, K_PER_WARP) * K_PER_WARP;
+
+  // Iterate over the keys/timesteps to compute the various (Q*K^T)_{ti} values.
+  for( int ti = ko; ti < ti_end; ti += K_PER_ITER ) {
+
+    // The keys loaded from the key cache.
+    K_vec k[K_VECS_PER_THREAD];
+    #pragma unroll
+    for( int ii = 0; ii < K_VECS_PER_THREAD; ++ii ) {
+      int jj = ii * params.seq_length + ti; 
+      if( ti < params.timestep ) {
+        k[ii] = *reinterpret_cast<const K_vec*>(&k_cache[jj*QK_ELTS_IN_16B]);
+      }
+    }
+
+    // Perform the dot product and normalize qk. 
+    // 
+    // WARNING: ALL THE THREADS OF A WARP MUST ENTER!!!
+    float qk = Qk_dot<T, THREADS_PER_KEY>::dot(q, k) * params.inv_sqrt_dh;
+
+    bool is_mask = params.is_mask? (ti >= params.input_lengths[bi] && ti < params.max_input_len) : false;
+
+    // Store the product to shared memory. There's one qk value per timestep. Update the max.
+    if( ti < params.timestep && tidx % THREADS_PER_KEY == 0 ) {
+
+      qk_max = is_mask? qk_max : fmaxf(qk_max, qk);
+      qk_smem[ti] = qk;
+    }
+  }
+
+  // Perform the final reduction to compute the max inside each warp.
+  //
+  // NOTE: In a group of THREADS_PER_KEY threads, the leader already has the max value for the
+  // group so it's not needed to run the reduction inside the group (again).
+  #pragma unroll
+  for( int mask = WARP_SIZE / 2; mask >= THREADS_PER_KEY; mask /= 2 ) {
+    qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
+  }
+
+  // Decompose the thread index into warp and lane.
+  const int warp = tidx / WARP_SIZE;
+  const int lane = tidx % WARP_SIZE;
+
+  // The warp leader writes the max to shared memory.
+  if( lane == 0 ) {
+    red_smem[warp] = qk_max;
+  }
+
+  // Make sure the products are in shared memory.
+  __syncthreads();
+
+  // The warps finalize the reduction.
+  qk_max = lane < WARPS_PER_BLOCK ? red_smem[lane] : -FLT_MAX;
+  #pragma unroll
+  for( int mask = WARPS_PER_BLOCK / 2; mask >= 1; mask /= 2 ) {
+    qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
+  }
+
+  // Broadcast to all the threads in the warp.
+  qk_max = __shfl_sync(uint32_t(-1), qk_max, 0);
+
+  // Compute the logits and start the sum.
+  float sum = 0.f;
+
+  for( int ti = tidx; ti <= params.timestep; ti += THREADS_PER_BLOCK ) {
+    bool is_mask = params.is_mask? (ti >= params.input_lengths[bi] && ti < params.max_input_len) : false;
+
+    float logit = is_mask? 0.f : __expf(qk_smem[ti] - qk_max);
+    sum += logit;
+    qk_smem[ti] = logit;
+  }
+
+  // Compute the sum.
+  sum = block_sum<WARPS_PER_BLOCK>(&red_smem[WARPS_PER_BLOCK], sum);
+
+  // Normalize the logits.
+  float inv_sum = __fdividef(1.f, sum + 1.e-6f);
+  for( int ti = tidx; ti <= params.timestep; ti += THREADS_PER_BLOCK ) { 
+    convert_from_float(logits_smem[ti], qk_smem[ti] * inv_sum);
+  }
+
+  // Make sure the logits are in shared memory.
+  __syncthreads();
+
+  // The number of elements per vector.
+  constexpr int V_VEC_SIZE = Dh / THREADS_PER_VALUE;
+  // A vector of V elements for the current timestep.
+  using V_vec = typename V_vec_<T, V_VEC_SIZE>::Type;
+
+  // The value computed by this thread.
+  int vo = tidx / THREADS_PER_VALUE;
+  // The hidden dimensions computed by this particular thread.
+  int vi = tidx % THREADS_PER_VALUE * V_VEC_SIZE;
+
+  // The base pointer for the value in the cache buffer.
+  T *v_cache = &params.v_cache[bhi*params.seq_length*Dh + vi];
+
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+  using V_vec_acum = typename V_vec_acum_fp32_<V_vec>::Type;
+#else
+  using V_vec_acum = V_vec;
+#endif
+  // The partial outputs computed by each thread.
+  V_vec_acum out; zero(out);
+
+  // The number of values processed per iteration of the loop.
+  constexpr int V_PER_ITER = THREADS_PER_BLOCK / THREADS_PER_VALUE;
+
+  // to solve THREADS_PER_BLOCK is not divisible by THREADS_PER_VALUE
+  if (vo < V_PER_ITER) {
+    // Loop over the timesteps to compute the partial outputs.
+    for (int ti = vo; ti < params.timestep; ti += V_PER_ITER) {
+      // Load the values from the cache.
+      V_vec v = *reinterpret_cast<const V_vec *>(&v_cache[ti * Dh]);
+      // Load the logits from shared memory.
+#if defined(MMHA_USE_FP32_ACUM_FOR_LOGITS)
+      float logit = logits_smem[ti];
+      out = fma(logit, cast_to_float(v), out);
+#else
+      T logit = logits_smem[ti];
+
+      // Update the partial sums.
+      out = fma(logit, v, out);
+#endif
+    }
+  }
+
+  // One group of threads computes the product(s) for the current timestep.
+  if( vo == params.timestep % V_PER_ITER ) {
+
+    // Trigger the loads from the V buffer.
+    V_vec v = *reinterpret_cast<const V_vec*>(&params.v[qkv_base_offset + vi]);
+    // Trigger the loads from the V bias buffer.
+    V_vec v_bias = *reinterpret_cast<const V_vec*>(&params.v_bias[hi*Dh + vi]);
+
+    // Compute the V values with bias.
+    v = add(v, v_bias);
+
+    // Store the values with bias back to global memory in the cache for V. 
+    *reinterpret_cast<V_vec*>(&v_cache[params.timestep*Dh]) = v;
+
+    // Initialize the output value with the current timestep.
+#if defined(MMHA_USE_FP32_ACUM_FOR_LOGITS)
+    out = fma(logits_smem[params.timestep], cast_to_float(v), out);
+#else
+    out = fma(logits_smem[params.timestep], v, out);
+#endif
+  }
+
+  // Make sure we can start writing to shared memory.
+  // Sync since logits_smem and out_smem use the same shared memory.
+  __syncthreads();
+
+  if (vo < pad_active_groups / 2) {
+    // use pad_active_groups instead of V_PER_ITER to solve V_PER_ITER is not 2^n
+    zero(*reinterpret_cast<V_vec*>(&out_smem[vo * Dh + vi]));
+  }
+  // No need to __syncthreads() here since sync exists in the reduce below
+
+  // Run the final reduction amongst the different groups computing different partial outputs.
+  // for( int active_groups = V_PER_ITER; active_groups >= 2; active_groups /= 2 ) {
+  #pragma unroll
+  for(int active_groups=pad_active_groups; active_groups >= 2; active_groups /= 2 ) {
+
+    // The midpoint in the number of active groups.
+    int midpoint = active_groups / 2;
+
+    // The upper part of active threads store to shared memory.
+    if( vo >= midpoint && vo < active_groups ) {
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+      convert_from_float(*reinterpret_cast<V_vec*>(&out_smem[(vo - midpoint)*Dh + vi]), out);
+#else
+      *reinterpret_cast<V_vec*>(&out_smem[(vo - midpoint)*Dh + vi]) = out;
+#endif
+    }
+    __syncthreads();
+
+    // The bottom warps update their values.
+    if( vo < midpoint ) {
+      out = add(*reinterpret_cast<const V_vec*>(&out_smem[vo*Dh + vi]), out);
+    }
+    __syncthreads();
+  }
+
+  // Output the final values.
+  if( vo == 0 ) {
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+    convert_from_float(*reinterpret_cast<V_vec*>(&params.out[bhi*Dh + vi]), out);
+#else
+    *reinterpret_cast<V_vec*>(&params.out[bhi*Dh + vi]) = out;
+#endif
+  }
+}
+
+template<
+    // The type of the inputs. Supported types: float and half.
+    typename T,
+    // The hidden dimension per head.
+    int Dh,
+    int Dh_MAX,
+    // The number of threads per key.
+    int THREADS_PER_KEY,
+    // The number of threads per value.
+    int THREADS_PER_VALUE,
+    // The number of threads in a threadblock.
+    int THREADS_PER_BLOCK
+    >
+__global__ void masked_multihead_attention_kernel_v2(Masked_multihead_attention_params<T> params, int pad_active_groups)
+{
+
+    // Make sure the hidden dimension per head is a multiple of the number of threads per key.
+    static_assert(Dh_MAX % THREADS_PER_KEY == 0, "");
+    // Make sure the hidden dimension per head is a multiple of the number of threads per value.
+    static_assert(Dh_MAX % THREADS_PER_VALUE == 0, "");
+
+    // The size of a warp.
+    constexpr int WARP_SIZE = 32;
+    // The number of warps in a threadblock.
+    constexpr int WARPS_PER_BLOCK = THREADS_PER_BLOCK / WARP_SIZE;
+
+    // Use smem_size_in_bytes (above) to determine the amount of shared memory.
+    extern __shared__ char smem_[];
+
+    // The shared memory for the Q*K^T values and partial logits in softmax.
+    float* qk_smem = reinterpret_cast<float*>(smem_);
+
+    // The shared memory for the logits. For FP32, that's the same buffer as qk_smem.
+    char* logits_smem_ = smem_;
+
+    // DO_CROSS_ATTENTION = false
+    constexpr bool DO_CROSS_ATTENTION = false;
+
+#ifndef MMHA_USE_FP32_ACUM_FOR_LOGITS
+    if (sizeof(T) != 4) {
+        // TODO - cahnge to tlength
+        logits_smem_ +=
+            (DO_CROSS_ATTENTION) ? div_up(params.seq_length + 1, 4) * 16 : div_up(params.timestep + 1, 4) * 16;
+    }
+    T* logits_smem = reinterpret_cast<T*>(logits_smem_);
+#else
+    float* logits_smem = reinterpret_cast<float*>(logits_smem_);
+#endif
+
+    // The shared memory to do the final reduction for the output values. Reuse qk_smem.
+    T* out_smem = reinterpret_cast<T*>(smem_);
+
+    // The shared memory buffers for the block-wide reductions. One for max, one for sum.
+    __shared__ float red_smem[WARPS_PER_BLOCK * 2];
+
+    // A vector of Q or K elements for the current timestep.
+    using Qk_vec = typename Qk_vec_<T, Dh_MAX>::Type;
+
+    // Use alignment for safely casting the shared buffers as Qk_vec.
+    // Shared memory to store Q inputs.
+    __shared__ __align__(sizeof(Qk_vec)) T q_smem[Dh_MAX];
+
+    // This is one of the reasons we should have a separate kernel for cross attention
+    __shared__ __align__(sizeof(Qk_vec)) T bias_smem[DO_CROSS_ATTENTION ? Dh_MAX : 1];
+
+    // A vector of Q or K elements for the current timestep.
+    using Qk_vec = typename Qk_vec_<T, Dh_MAX>::Type;
+    // The number of elements per vector.
+    constexpr int QK_VEC_SIZE = sizeof(Qk_vec) / sizeof(T);
+    // Make sure the hidden size per head is a multiple of the vector size.
+    static_assert(Dh_MAX % QK_VEC_SIZE == 0, "");
+    // We will use block wide reduction if needed
+    // static_assert(Dh_MAX / QK_VEC_SIZE <= WARP_SIZE, "");
+    // The number of vectors per warp.
+    constexpr int QK_VECS_PER_WARP = Dh_MAX / QK_VEC_SIZE;
+
+    // The layout of the cache is [B, H, Dh/x, L, x] with x == 4/8 for FP32/FP16. Since each thread
+    // owns x elements, we have to decompose the linear index into chunks of x values and the posi-
+    // tion of the thread in that chunk.
+
+    // The number of elements in a chunk of 16B (that's the x in the above formula).
+    constexpr int QK_ELTS_IN_16B = 16 / sizeof(T);
+    // The number of K vectors in 16B.
+    constexpr int QK_VECS_IN_16B = 16 / sizeof(Qk_vec);
+
+    // The batch/beam idx
+    const int bi = blockIdx.y;
+    if (params.finished != nullptr && params.finished[bi] == true) {
+        return;
+    }
+    // The beam idx
+    const int beami = bi % params.beam_width;
+    // The "beam-aware" batch idx
+    const int bbi = bi / params.beam_width;
+    // The head.
+    const int hi = blockIdx.x;
+    // Combine the batch and the head indices.
+    const int bhi = bi * params.num_heads + hi;
+    // Combine the "beam-aware" batch idx and the head indices.
+    const int bbhi = bbi * params.beam_width * params.num_heads + hi;
+    // The thread in the block.
+    const int tidx = threadIdx.x;
+
+    // While doing the product Q*K^T for the different keys we track the max.
+    float qk_max = -FLT_MAX;
+
+    float qk = 0.0F;
+
+    int qkv_base_offset = (params.stride == 0) ? bhi * Dh : bi * params.stride + hi * Dh;
+
+    // int tlength = (DO_CROSS_ATTENTION)? params.memory_length_per_sample[bi] - 1 : params.timestep;
+    int tlength = (DO_CROSS_ATTENTION)                  ? params.memory_length_per_sample[bi] - 1 :
+                  (params.length_per_sample == nullptr) ? params.timestep :
+                                                          params.length_per_sample[bi];
+    // First QK_VECS_PER_WARP load Q and K + the bias values for the current timestep.
+    if (tidx < QK_VECS_PER_WARP) {
+
+        // The offset in the Q and K buffer also accounts for the batch.
+        int qk_offset = qkv_base_offset + tidx * QK_VEC_SIZE;
+        // The offset in the bias buffer.
+        int qk_bias_offset = hi * Dh + tidx * QK_VEC_SIZE;
+
+        // Trigger the loads from the Q and K buffers.
+        Qk_vec q;
+        zero(q);
+        q = (Dh == Dh_MAX || tidx * QK_VEC_SIZE < Dh) ? *reinterpret_cast<const Qk_vec*>(&params.q[qk_offset]) : q;
+        Qk_vec k;
+        zero(k);
+        if (DO_CROSS_ATTENTION) {
+            // The 16B chunk written by the thread.
+            int co = tidx / QK_VECS_IN_16B;
+            // The position of the thread in that 16B chunk.
+            int ci = tidx % QK_VECS_IN_16B * QK_VEC_SIZE;
+
+            // Two chunks are separated by L * x elements. A thread write QK_VEC_SIZE elements.
+            int offset = bhi * params.seq_length * Dh + co * params.seq_length * QK_ELTS_IN_16B +
+                         // params.timestep*QK_ELTS_IN_16B +
+                         tlength * QK_ELTS_IN_16B + ci;
+            k = (Dh == Dh_MAX || tidx * QK_VEC_SIZE < Dh) ? *reinterpret_cast<const Qk_vec*>(&params.k_cache[offset]) :
+                                                            k;
+        }
+        else {
+            k = (Dh == Dh_MAX || tidx * QK_VEC_SIZE < Dh) ? *reinterpret_cast<const Qk_vec*>(&params.k[qk_offset]) : k;
+        }
+
+        // Trigger the loads from the Q and K bias buffers.
+        Qk_vec q_bias;
+        zero(q_bias);
+        q_bias = (Dh == Dh_MAX || tidx * QK_VEC_SIZE < Dh) && params.q_bias != nullptr ?
+                     *reinterpret_cast<const Qk_vec*>(&params.q_bias[qk_bias_offset]) :
+                     q_bias;
+        Qk_vec k_bias;
+        zero(k_bias);
+
+        if (!DO_CROSS_ATTENTION || (DO_CROSS_ATTENTION && params.timestep == 0)) {
+            k_bias = (Dh == Dh_MAX || tidx * QK_VEC_SIZE < Dh) && params.k_bias != nullptr ?
+                         *reinterpret_cast<const Qk_vec*>(&params.k_bias[qk_bias_offset]) :
+                         k_bias;
+        }
+
+        // Computes the Q/K values with bias.
+        q = add(q, q_bias);
+        if (!DO_CROSS_ATTENTION || (DO_CROSS_ATTENTION && params.timestep == 0)) {
+            k = add(k, k_bias);
+            if (params.rotary_embedding_dim > 0) {
+                apply_rotary_embedding(q, k, tidx, params.rotary_embedding_dim, params.timestep);
+            }
+        }
+        else {
+            if (params.rotary_embedding_dim > 0) {
+                apply_rotary_embedding(q, tidx, params.rotary_embedding_dim, params.timestep);
+            }
+        }
+
+        // Store the Q values to shared memory.
+        *reinterpret_cast<Qk_vec*>(&q_smem[tidx * QK_VEC_SIZE]) = q;
+
+        // Store Dh values of k_bias into smem, since will need to add later
+        // if params.timestep == 0
+        if (DO_CROSS_ATTENTION && params.timestep == 0) {
+            *reinterpret_cast<Qk_vec*>(&bias_smem[tidx * QK_VEC_SIZE]) = k_bias;
+        }
+
+        // Write the K values to the global memory cache.
+        //
+        // NOTE: The stores are uncoalesced as we have multiple chunks of 16B spread across the memory
+        // system. We designed it this way as it allows much better memory loads (and there are many
+        // more loads) + the stores are really "write and forget" since we won't need the ack before
+        // the end of the kernel. There's plenty of time for the transactions to complete.
+
+        // The 16B chunk written by the thread.
+        int co = tidx / QK_VECS_IN_16B;
+        // The position of the thread in that 16B chunk.
+        int ci = tidx % QK_VECS_IN_16B * QK_VEC_SIZE;
+
+        // Two chunks are separated by L * x elements. A thread write QK_VEC_SIZE elements.
+        int offset = bhi * params.seq_length * Dh + co * params.seq_length * QK_ELTS_IN_16B +
+                     // params.timestep*QK_ELTS_IN_16B +
+                     tlength * QK_ELTS_IN_16B + ci;
+
+        if (!DO_CROSS_ATTENTION || (DO_CROSS_ATTENTION && params.timestep == 0)) {
+            // Trigger the stores to global memory.
+            if (Dh == Dh_MAX || co < Dh / QK_ELTS_IN_16B) {
+                *reinterpret_cast<Qk_vec*>(&params.k_cache[offset]) = k;
+            }
+        }
+
+        // Compute \sum_i Q[i] * K^T[i] for the current timestep.
+#ifdef MMHA_USE_FP32_ACUM_FOR_FMA
+        using Qk_vec_acum = typename Qk_vec_acum_fp32_<Qk_vec>::Type;
+#else
+        using Qk_vec_acum = Qk_vec;
+#endif
+        qk = dot<Qk_vec_acum, Qk_vec>(q, k);
+        if (QK_VECS_PER_WARP <= WARP_SIZE) {
+#pragma unroll
+            for (int mask = QK_VECS_PER_WARP / 2; mask >= 1; mask /= 2) {
+                qk += __shfl_xor_sync(shfl_mask(QK_VECS_PER_WARP), qk, mask);
+            }
+        }
+    }
+
+    if (QK_VECS_PER_WARP > WARP_SIZE) {
+        constexpr int WARPS_PER_RED = (QK_VECS_PER_WARP + WARP_SIZE - 1) / WARP_SIZE;
+        qk = block_sum<WARPS_PER_RED>(&red_smem[WARPS_PER_RED], qk);
+    }
+
+    // Store that value in shared memory. Keep the Q*K^T value in register for softmax.
+    if (tidx == 0) {
+        // Normalize qk.
+        qk *= params.inv_sqrt_dh;
+
+        if (params.relative_attention_bias_float != nullptr) {
+            qk = qk
+                 + params.relative_attention_bias_float[hi * params.relative_attention_bias_stride
+                                                            * params.relative_attention_bias_stride
+                                                        + tlength * params.relative_attention_bias_stride + tlength];
+        }
+        else if (params.relative_attention_bias_half != nullptr) {
+            qk = qk
+                 + (float)
+                       params.relative_attention_bias_half[hi * params.relative_attention_bias_stride
+                                                               * params.relative_attention_bias_stride
+                                                           + tlength * params.relative_attention_bias_stride + tlength];
+        }
+        qk_max = qk;
+        qk_smem[tlength] = qk;
+        // qk_smem[params.timestep] = qk;
+    }
+
+    // Make sure the data is in shared memory.
+    __syncthreads();
+
+    // The type of queries and keys for the math in the Q*K^T product.
+    using K_vec = typename K_vec_<T, THREADS_PER_KEY>::Type;
+    // The number of elements per vector.
+    constexpr int K_VEC_SIZE = sizeof(K_vec) / sizeof(T);
+    // Make sure the hidden size per head is a multiple of the vector size.
+    static_assert(Dh_MAX % K_VEC_SIZE == 0, "");
+    // The number of elements per thread.
+    constexpr int K_ELTS_PER_THREAD = Dh_MAX / THREADS_PER_KEY;
+    // The number of vectors per thread.
+    constexpr int K_VECS_PER_THREAD = K_ELTS_PER_THREAD / K_VEC_SIZE;
+
+    // The position the first key loaded by each thread from the cache buffer (for this B * H).
+    int ko = tidx / THREADS_PER_KEY;
+    // The position of the thread in the chunk of keys.
+    int ki = tidx % THREADS_PER_KEY * K_VEC_SIZE;
+
+    static_assert(Dh_MAX == THREADS_PER_KEY * K_VEC_SIZE * K_VECS_PER_THREAD);
+
+    // Load the Q values from shared memory. The values are reused during the loop on K.
+    K_vec q[K_VECS_PER_THREAD];
+#pragma unroll
+    for (int ii = 0; ii < K_VECS_PER_THREAD; ++ii) {
+        q[ii] = *reinterpret_cast<const K_vec*>(&q_smem[ki + ii * THREADS_PER_KEY * K_VEC_SIZE]);
+    }
+
+    K_vec k_bias[DO_CROSS_ATTENTION ? K_VECS_PER_THREAD : 1];
+    if (DO_CROSS_ATTENTION && params.timestep == 0) {
+#pragma unroll
+        for (int ii = 0; ii < K_VECS_PER_THREAD; ++ii) {
+            k_bias[ii] = *reinterpret_cast<const K_vec*>(&bias_smem[ki + ii * THREADS_PER_KEY * K_VEC_SIZE]);
+        }
+    }
+
+    // The number of timesteps loaded per iteration.
+    constexpr int K_PER_ITER = THREADS_PER_BLOCK / THREADS_PER_KEY;
+    // The number of keys per warp.
+    constexpr int K_PER_WARP = WARP_SIZE / THREADS_PER_KEY;
+
+    // The base pointer for the key in the cache buffer.
+    T* k_cache = &params.k_cache[bhi * params.seq_length * Dh + ki];
+    // Base pointer for the beam's batch, before offsetting with indirection buffer
+    T* k_cache_batch = &params.k_cache[bbhi * params.seq_length * Dh + ki];
+
+    // Pick a number of keys to make sure all the threads of a warp enter (due to shfl_sync).
+    // int ti_end = div_up(params.timestep, K_PER_WARP) * K_PER_WARP;
+    int ti_end = div_up(tlength, K_PER_WARP) * K_PER_WARP;
+
+    // Iterate over the keys/timesteps to compute the various (Q*K^T)_{ti} values.
+    for (int ti = ko; ti < ti_end; ti += K_PER_ITER) {
+
+        // The keys loaded from the key cache.
+        K_vec k[K_VECS_PER_THREAD];
+        K_vec k_vec_zero;
+        zero(k_vec_zero);
+#pragma unroll
+        for (int ii = 0; ii < K_VECS_PER_THREAD; ++ii) {
+            int jj = ii * params.seq_length + ti;
+            // if( ti < params.timestep ) {
+            if (ti < tlength) {
+                const int beam_src =
+                    (params.cache_indir != nullptr) ?
+                        params.cache_indir[(bbi * params.beam_width + beami) * params.seq_length + ti] :
+                        0;
+                const int beam_offset = beam_src * params.num_heads * params.seq_length * Dh;
+                k[ii] = (Dh == Dh_MAX || jj * QK_ELTS_IN_16B < Dh * params.seq_length) ?
+                            *reinterpret_cast<const K_vec*>(&k_cache_batch[beam_offset + jj * QK_ELTS_IN_16B]) :
+                            k_vec_zero;
+                // add bias and update k_cache
+                if (DO_CROSS_ATTENTION && params.timestep == 0) {
+                    k[ii] = add(k[ii], k_bias[ii]);
+                    if (Dh == Dh_MAX || jj * QK_ELTS_IN_16B < Dh * params.seq_length) {
+                        *reinterpret_cast<K_vec*>(&k_cache[jj * QK_ELTS_IN_16B]) = k[ii];
+                    }
+                }
+            }
+        }
+
+        // Perform the dot product and normalize qk.
+        //
+        // WARNING: ALL THE THREADS OF A WARP MUST ENTER!!!
+        float qk = Qk_dot<T, THREADS_PER_KEY>::dot(q, k) * params.inv_sqrt_dh;
+        bool is_mask = (params.input_lengths != nullptr && ti >= params.input_lengths[bi] && ti < params.max_input_len);
+
+        // Store the product to shared memory. There's one qk value per timestep. Update the max.
+        // if( ti < params.timestep && tidx % THREADS_PER_KEY == 0 ) {
+        if (ti < tlength && tidx % THREADS_PER_KEY == 0) {
+            if (params.relative_attention_bias_float != nullptr) {
+                qk = qk
+                     + params.relative_attention_bias_float[hi * params.relative_attention_bias_stride
+                                                                * params.relative_attention_bias_stride
+                                                            + tlength * params.relative_attention_bias_stride + ti];
+            }
+            else if (params.relative_attention_bias_half != nullptr) {
+                qk = qk
+                     + (float)
+                           params.relative_attention_bias_half[hi * params.relative_attention_bias_stride
+                                                                   * params.relative_attention_bias_stride
+                                                               + tlength * params.relative_attention_bias_stride + ti];
+            }
+            qk_max = is_mask ? qk_max : fmaxf(qk_max, qk);
+            qk_smem[ti] = qk;
+        }
+    }
+
+// Perform the final reduction to compute the max inside each warp.
+//
+// NOTE: In a group of THREADS_PER_KEY threads, the leader already has the max value for the
+// group so it's not needed to run the reduction inside the group (again).
+#pragma unroll
+    for (int mask = WARP_SIZE / 2; mask >= THREADS_PER_KEY; mask /= 2) {
+        qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
+    }
+
+    // Decompose the thread index into warp and lane.
+    const int warp = tidx / WARP_SIZE;
+    const int lane = tidx % WARP_SIZE;
+
+    // The warp leader writes the max to shared memory.
+    if (lane == 0) {
+        red_smem[warp] = qk_max;
+    }
+
+    // Make sure the products are in shared memory.
+    __syncthreads();
+
+    // The warps finalize the reduction.
+    qk_max = lane < WARPS_PER_BLOCK ? red_smem[lane] : -FLT_MAX;
+#pragma unroll
+    for (int mask = WARPS_PER_BLOCK / 2; mask >= 1; mask /= 2) {
+        qk_max = fmaxf(qk_max, __shfl_xor_sync(uint32_t(-1), qk_max, mask));
+    }
+
+    // Broadcast to all the threads in the warp.
+    qk_max = __shfl_sync(uint32_t(-1), qk_max, 0);
+
+    // Compute the logits and start the sum.
+    float sum = 0.f;
+    // for( int ti = tidx; ti <= params.timestep; ti += THREADS_PER_BLOCK ) {
+    for (int ti = tidx; ti <= tlength; ti += THREADS_PER_BLOCK) {
+        bool is_mask = (params.input_lengths != nullptr && ti >= params.input_lengths[bi] && ti < params.max_input_len);
+        float logit = is_mask ? 0.f : __expf(qk_smem[ti] - qk_max);
+        sum += logit;
+        qk_smem[ti] = logit;
+    }
+
+    // Compute the sum.
+    sum = block_sum<WARPS_PER_BLOCK>(&red_smem[WARPS_PER_BLOCK], sum);
+
+    // Normalize the logits.
+    float inv_sum = __fdividef(1.f, sum + 1.e-6f);
+    // for( int ti = tidx; ti <= params.timestep; ti += THREADS_PER_BLOCK ) {
+    for (int ti = tidx; ti <= tlength; ti += THREADS_PER_BLOCK) {
+        convert_from_float(logits_smem[ti], qk_smem[ti] * inv_sum);
+    }
+
+    // Put Values part below so we leverage __syncthreads
+    // from the previous step
+
+    // The number of elements per vector.
+    constexpr int V_VEC_SIZE = Dh_MAX / THREADS_PER_VALUE;
+    // A vector of V elements for the current timestep.
+    using V_vec = typename V_vec_<T, V_VEC_SIZE>::Type;
+
+    // The value computed by this thread.
+    int vo = tidx / THREADS_PER_VALUE;
+    // The hidden dimensions computed by this particular thread.
+    int vi = tidx % THREADS_PER_VALUE * V_VEC_SIZE;
+
+    // The base pointer for the value in the cache buffer.
+    T* v_cache = &params.v_cache[bhi * params.seq_length * Dh + vi];
+    // Base pointer for the beam's batch, before offsetting with indirection buffer
+    T* v_cache_batch = &params.v_cache[bbhi * params.seq_length * Dh + vi];
+
+    // The number of values processed per iteration of the loop.
+    constexpr int V_PER_ITER = THREADS_PER_BLOCK / THREADS_PER_VALUE;
+
+    // One group of threads computes the product(s) for the current timestep.
+    V_vec v_bias;
+    zero(v_bias);
+    // if( vo == params.timestep % V_PER_ITER ) {
+    if (Dh == Dh_MAX || vi < Dh) {
+        if (!DO_CROSS_ATTENTION || (DO_CROSS_ATTENTION && params.timestep == 0)) {
+            if (vo == tlength % V_PER_ITER) {
+                // Trigger the loads from the V bias buffer.
+                if (params.v_bias != nullptr) {
+                    v_bias = *reinterpret_cast<const V_vec*>(&params.v_bias[hi * Dh + vi]);
+                }
+                if (DO_CROSS_ATTENTION) {
+                    *reinterpret_cast<V_vec*>(&bias_smem[vi]) = v_bias;
+                }
+            }
+        }
+    }
+
+    // From previous, before values, step
+    // Also make sure the logits are in shared memory.
+    __syncthreads();
+
+    // Values continued
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+    using V_vec_acum = typename V_vec_acum_fp32_<V_vec>::Type;
+#else
+    using V_vec_acum = V_vec;
+#endif
+    // The partial outputs computed by each thread.
+    V_vec_acum out;
+    zero(out);
+
+    // Loop over the timesteps to compute the partial outputs.
+    // for( int ti = vo; ti < params.timestep; ti += V_PER_ITER ) {
+    if (Dh == Dh_MAX || vi < Dh) {
+        for (int ti = vo; ti < tlength; ti += V_PER_ITER) {
+
+            // Fetch offset based on cache_indir when beam sampling
+            const int beam_src = (params.cache_indir != nullptr) ?
+                                     params.cache_indir[(bbi * params.beam_width + beami) * params.seq_length + ti] :
+                                     0;
+            const int beam_offset = beam_src * params.num_heads * params.seq_length * Dh;
+            // Load the values from the cache.
+            V_vec v = *reinterpret_cast<const V_vec*>(&v_cache_batch[beam_offset + ti * Dh]);
+            if (DO_CROSS_ATTENTION && params.timestep == 0) {
+                v = add(v, *reinterpret_cast<V_vec*>(&bias_smem[vi]));
+                *reinterpret_cast<V_vec*>(&v_cache[ti * Dh]) = v;
+            }
+            // Load the logits from shared memory.
+#if defined(MMHA_USE_FP32_ACUM_FOR_LOGITS)
+            float logit = logits_smem[ti];
+            out = fma(logit, cast_to_float(v), out);
+#else
+            T logit = logits_smem[ti];
+
+            // Update the partial sums.
+            out = fma(logit, v, out);
+#endif
+        }
+    }
+
+    // One group of threads computes the product(s) for the current timestep.
+    // if( vo == params.timestep % V_PER_ITER ) {
+    if (vo == tlength % V_PER_ITER && (Dh == Dh_MAX || vi < Dh)) {
+
+        V_vec v;
+        if (DO_CROSS_ATTENTION) {
+            v = *reinterpret_cast<const V_vec*>(&v_cache[tlength * Dh]);
+        }
+        else {
+            // Trigger the loads from the V buffer.
+            v = *reinterpret_cast<const V_vec*>(&params.v[qkv_base_offset + vi]);
+            // Trigger the loads from the V bias buffer.
+            // V_vec v_bias = *reinterpret_cast<const V_vec*>(&params.v_bias[hi*Dh + vi]);
+        }
+
+        // Compute the V values with bias.
+        if (!DO_CROSS_ATTENTION || (DO_CROSS_ATTENTION && params.timestep == 0)) {
+            v = add(v, v_bias);
+
+            // Store the values with bias back to global memory in the cache for V.
+            //*reinterpret_cast<V_vec*>(&v_cache[params.timestep*Dh]) = v;
+            *reinterpret_cast<V_vec*>(&v_cache[tlength * Dh]) = v;
+        }
+
+        // Initialize the output value with the current timestep.
+#if defined(MMHA_USE_FP32_ACUM_FOR_LOGITS)
+        // out = fma(logits_smem[params.timestep], cast_to_float(v), out);
+        out = fma(logits_smem[tlength], cast_to_float(v), out);
+#else
+        // out = fma(logits_smem[params.timestep], v, out);
+        out = fma(logits_smem[tlength], v, out);
+#endif
+    }
+
+    // Make sure we can start writing to shared memory.
+    __syncthreads();
+
+    // Run the final reduction amongst the different groups computing different partial outputs.
+    if (Dh == Dh_MAX || vi < Dh)
+#pragma unroll
+        for (int active_groups = V_PER_ITER; active_groups >= 2; active_groups /= 2) {
+
+            // The midpoint in the number of active groups.
+            int midpoint = active_groups / 2;
+
+            // The upper part of active threads store to shared memory.
+            if (vo >= midpoint && vo < active_groups && (Dh == Dh_MAX || vi < Dh)) {
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+                convert_from_float(*reinterpret_cast<V_vec*>(&out_smem[(vo - midpoint) * Dh + vi]), out);
+#else
+                *reinterpret_cast<V_vec*>(&out_smem[(vo - midpoint) * Dh + vi]) = out;
+#endif
+            }
+            __syncthreads();
+
+            // The bottom warps update their values.
+            if (vo < midpoint && (Dh == Dh_MAX || vi < Dh)) {
+                out = add(*reinterpret_cast<const V_vec*>(&out_smem[vo * Dh + vi]), out);
+            }
+            __syncthreads();
+        }
+
+    // Output the final values.
+    if (vo == 0 && (Dh == Dh_MAX || vi < Dh)) {
+#ifdef MMHA_USE_FP32_ACUM_FOR_OUT
+        convert_from_float(*reinterpret_cast<V_vec*>(&params.out[bhi * Dh + vi]), out);
+#else
+        *reinterpret_cast<V_vec*>(&params.out[bhi * Dh + vi]) = out;
+#endif
+    }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace mmha  
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define MMHA_LAUNCH_KERNEL(T, Dh, Dh_MAX, THDS_PER_KEY, THDS_PER_VALUE, THDS_PER_BLOCK, stream) \
+  int pad_active_groups = 1 << static_cast<int>(ceil(std::log2(THDS_PER_BLOCK / THDS_PER_VALUE))); \
+  size_t smem_sz = mmha::smem_size_in_bytes<T>(params, THDS_PER_VALUE, THDS_PER_BLOCK, pad_active_groups); \
+  dim3 grid(params.num_heads, params.batch_size);  \
+  mmha::masked_multihead_attention_kernel<T, Dh, Dh_MAX, THDS_PER_KEY, THDS_PER_VALUE, THDS_PER_BLOCK> \
+    <<<grid, THDS_PER_BLOCK, smem_sz, stream>>>(params, pad_active_groups)
+
+
+#define MMHA_LAUNCH_KERNEL_V2(T, Dh, Dh_MAX, THDS_PER_KEY, THDS_PER_VALUE, THDS_PER_BLOCK, stream) \
+  int pad_active_groups = 1 << static_cast<int>(ceil(std::log2(THDS_PER_BLOCK / THDS_PER_VALUE))); \
+  size_t smem_sz = mmha::smem_size_in_bytes<T>(params, THDS_PER_VALUE, THDS_PER_BLOCK, pad_active_groups); \
+  dim3 grid(params.num_heads, params.batch_size);  \
+    mmha::masked_multihead_attention_kernel_v2<T, Dh, Dh_MAX, THDS_PER_KEY, THDS_PER_VALUE, THDS_PER_BLOCK> \
+  <<<grid, THDS_PER_BLOCK, smem_sz, stream>>>(params, pad_active_groups)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template < typename T, int Dh, int Dh_MAX>
+void  mmha_launch_kernel(const Masked_multihead_attention_params<T> &params, const cudaStream_t &stream) {
+  if (params.rotary_embedding_dim > 0 || params.relative_attention_bias_float || params.relative_attention_bias_half) {
+    constexpr int THREADS_PER_VALUE = Dh_MAX * sizeof(T) / 16;
+    if( params.timestep < 32 ) {
+      MMHA_LAUNCH_KERNEL_V2(T, Dh, Dh_MAX, 4, THREADS_PER_VALUE, 64, stream);
+    } else if( params.timestep < 2048 ) {
+      MMHA_LAUNCH_KERNEL_V2(T, Dh, Dh_MAX, 2, THREADS_PER_VALUE, 128, stream);
+    } else {
+      MMHA_LAUNCH_KERNEL_V2(T, Dh, Dh_MAX, 1, THREADS_PER_VALUE, 256, stream);
+    }
+  } else {
+    constexpr int THREADS_PER_VALUE = Dh * sizeof(T) / 16;
+    if( params.timestep < 32 ) {
+      MMHA_LAUNCH_KERNEL(T, Dh, Dh_MAX, 4, THREADS_PER_VALUE, 64, stream);
+    } else if( params.timestep < 2048 ) {
+      MMHA_LAUNCH_KERNEL(T, Dh, Dh_MAX, 2, THREADS_PER_VALUE, 128, stream);
+    } else {
+      MMHA_LAUNCH_KERNEL(T, Dh, Dh_MAX, 1, THREADS_PER_VALUE, 256, stream);
+    }
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template< typename T >
+void masked_multihead_attention_(const Masked_multihead_attention_params<T> &params, const cudaStream_t &stream) {
+  switch ( params.hidden_size_per_head ) {
+    case 32:
+      mmha_launch_kernel<T, 32, 32>(params, stream);
+      break;
+    case 64:
+      mmha_launch_kernel<T, 64, 64>(params, stream);
+      break;
+    case 80:  // gpt-cpm-large-cn
+      mmha_launch_kernel<T, 80, 128>(params, stream);
+      break;
+    case 96:  // plato-xl
+      mmha_launch_kernel<T, 96, 128>(params, stream);
+      break;
+    case 128:
+      mmha_launch_kernel<T, 128, 128>(params, stream);
+      break;
+    case 160:
+      mmha_launch_kernel<T, 160, 256>(params, stream);
+      break;
+    case 192:
+      mmha_launch_kernel<T, 192, 256>(params, stream);
+      break;
+    case 224:
+      mmha_launch_kernel<T, 224, 256>(params, stream);
+      break;
+    case 256: // GPTJ/CodeGen
+      mmha_launch_kernel<T, 256, 256>(params, stream);
+      break;
+    default:
+      // assert(false);
+      throw std::runtime_error("Unsupported model size.");
+  }
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+void masked_multihead_attention(const Masked_multihead_attention_params<float> &params, const cudaStream_t &stream) {
+  masked_multihead_attention_(params, stream);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+void masked_multihead_attention(const Masked_multihead_attention_params<uint16_t> &params, const cudaStream_t &stream) {
+  masked_multihead_attention_(params, stream);
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#undef MMHA_LAUNCH_KERNEL
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h
new file mode 100644
index 0000000000000000000000000000000000000000..992a598536eea960a43d89ba9206df7b6e82b91c
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention.h
@@ -0,0 +1,115 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cuda_runtime_api.h>
+#include <cuda_fp16.h>
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define CHECK_CUDA(call) do { \
+  cudaError_t status_ = call; \
+  if( status_ != cudaSuccess ) { \
+    fprintf(stderr, "CUDA error (%s:%d): %s\n", __FILE__, __LINE__, cudaGetErrorString(status_)); \
+    exit(1); \
+  } \
+} while(0)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// The structure of parameters for the masked multihead attention kernel.
+//
+// We use the following terminology to describe the different dimensions.
+//
+// B:  Batch size (number of sequences),
+// L:  Sequence length,
+// D:  Hidden dimension,
+// H:  Number of heads,
+// Dh: Hidden dimension per head - Dh = D / H.
+
+template< typename T >
+struct Masked_multihead_attention_params {
+
+  // The output buffer. Dimensions B x D.
+  T *out;
+
+  // The input Qs and the associated bias. Dimensions B x D and D, resp.
+  const T *q, *q_bias;
+  // The input Ks and the associated bias. Dimensions B x D and D, resp.
+  const T *k, *k_bias;
+  // The input Vs and the associated bias. Dimensions B x D and D, resp.
+  const T *v, *v_bias;
+
+  // The cache for the Ks. The size must be at least B x L x D.
+  T *k_cache;
+  // The cache for the Vs. The size must be at least B x L x D.
+  T *v_cache;
+
+  // The indirections to use for cache when beam sampling.
+  const int* cache_indir = nullptr;
+
+  // allows to exist attention eary
+  bool *finished;
+
+  // Stride to handle the case when KQV is a single buffer
+  int stride;
+
+  // The batch size.
+  int batch_size;
+  // The sequence length.
+  int seq_length;
+  // The number of heads (H).
+  int num_heads;
+  // The hidden dimension per head (Dh).
+  int hidden_size_per_head;
+  // The current timestep.
+  int timestep;
+
+  // The per-head latent space reserved for rotary embeddings.
+  int rotary_embedding_dim = 0;
+
+  // The 1.f / sqrt(Dh). Computed on the host.
+  float inv_sqrt_dh;
+
+  // params for masking.
+  bool is_mask;
+  const int *input_lengths = input_lengths;
+  int max_input_len = max_input_len;
+
+  const float* relative_attention_bias_float = nullptr;
+  const half* relative_attention_bias_half = nullptr;
+  int relative_attention_bias_stride;
+  // The beam width
+  int beam_width = 1;
+  // required in case of cross attention
+  int* memory_length_per_sample = nullptr;
+  // required in case of masked attention with different length
+  const int* length_per_sample = nullptr;
+
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+void masked_multihead_attention    (const Masked_multihead_attention_params<float>    &params, const cudaStream_t &stream);
+void masked_multihead_attention    (const Masked_multihead_attention_params<uint16_t> &params, const cudaStream_t &stream);
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention_utils.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention_utils.h
new file mode 100644
index 0000000000000000000000000000000000000000..cdcc47a06180dfcbc65d49e9aba409e7b07701b6
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/masked_multihead_attention_utils.h
@@ -0,0 +1,265 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace mmha {
+
+inline __device__ float2 rotary_embedding_coefficient(const int zid, const int rot_embed_dim, const float t_step)
+{
+    const float inv_freq = t_step / pow(10000.0f, zid / (float)rot_embed_dim);
+    return {cos(inv_freq), sin(inv_freq)};
+}
+
+inline __device__ float2 rotary_embedding_transform(const float2 v, const float2 coef)
+{
+    float2 rot_v;
+    rot_v.x = coef.x * v.x - coef.y * v.y;
+    rot_v.y = coef.x * v.y + coef.y * v.x;
+    return rot_v;
+}
+
+inline __device__ uint32_t rotary_embedding_transform(const uint32_t v, const float2 coef)
+{
+    float2 fv = half2_to_float2(v);
+    float2 rot_fv = rotary_embedding_transform(fv, coef);
+    return float2_to_half2(rot_fv);
+}
+
+#ifdef ENABLE_BF16
+inline __device__ __nv_bfloat162 rotary_embedding_transform(const __nv_bfloat162 v, const float2 coef)
+{
+    float2 fv = bf1622float2(v);
+    float2 rot_fv = rotary_embedding_transform(fv, coef);
+    return __floats2bfloat162_rn(rot_fv.x, rot_fv.y);
+}
+#endif
+
+inline __device__ void apply_rotary_embedding(float& q, int zid, int rot_embed_dim, int t_step)
+{
+    return;
+}
+
+inline __device__ void apply_rotary_embedding(float& q, float& k, int zid, int rot_embed_dim, int t_step)
+{
+    return;
+}
+
+inline __device__ void apply_rotary_embedding(float2& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+}
+
+inline __device__ void apply_rotary_embedding(float2& q, float2& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+    k = rotary_embedding_transform(k, coef);
+}
+
+inline __device__ void apply_rotary_embedding(float4& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+
+    Float4_& q_ = *reinterpret_cast<Float4_*>(&q);
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q_.x = rotary_embedding_transform(q_.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q_.y = rotary_embedding_transform(q_.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(float4& q, float4& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+
+    Float4_& q_ = *reinterpret_cast<Float4_*>(&q);
+    Float4_& k_ = *reinterpret_cast<Float4_*>(&k);
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q_.x = rotary_embedding_transform(q_.x, coef0);
+    k_.x = rotary_embedding_transform(k_.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q_.y = rotary_embedding_transform(q_.y, coef1);
+    k_.y = rotary_embedding_transform(k_.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(uint32_t& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+}
+
+inline __device__ void apply_rotary_embedding(uint32_t& q, uint32_t& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+    k = rotary_embedding_transform(k, coef);
+}
+
+inline __device__ void apply_rotary_embedding(uint2& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(uint2& q, uint2& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    k.x = rotary_embedding_transform(k.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    k.y = rotary_embedding_transform(k.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(uint4& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (8 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(8 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(8 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    const auto coef2 = rotary_embedding_coefficient(8 * tid + 4, rot_embed_dim, t_step);
+    q.z = rotary_embedding_transform(q.z, coef2);
+    const auto coef3 = rotary_embedding_coefficient(8 * tid + 6, rot_embed_dim, t_step);
+    q.w = rotary_embedding_transform(q.w, coef3);
+}
+
+inline __device__ void apply_rotary_embedding(uint4& q, uint4& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (8 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(8 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    k.x = rotary_embedding_transform(k.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(8 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    k.y = rotary_embedding_transform(k.y, coef1);
+    const auto coef2 = rotary_embedding_coefficient(8 * tid + 4, rot_embed_dim, t_step);
+    q.z = rotary_embedding_transform(q.z, coef2);
+    k.z = rotary_embedding_transform(k.z, coef2);
+    const auto coef3 = rotary_embedding_coefficient(8 * tid + 6, rot_embed_dim, t_step);
+    q.w = rotary_embedding_transform(q.w, coef3);
+    k.w = rotary_embedding_transform(k.w, coef3);
+}
+
+#ifdef ENABLE_BF16
+inline __device__ void apply_rotary_embedding(__nv_bfloat162& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+}
+
+inline __device__ void
+apply_rotary_embedding(__nv_bfloat162& q, __nv_bfloat162& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (2 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef = rotary_embedding_coefficient(2 * tid, rot_embed_dim, t_step);
+    q = rotary_embedding_transform(q, coef);
+    k = rotary_embedding_transform(k, coef);
+}
+
+inline __device__ void apply_rotary_embedding(bf16_4_t& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(bf16_4_t& q, bf16_4_t& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (4 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(4 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    k.x = rotary_embedding_transform(k.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(4 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    k.y = rotary_embedding_transform(k.y, coef1);
+}
+
+inline __device__ void apply_rotary_embedding(bf16_8_t& q, int tid, int rot_embed_dim, int t_step)
+{
+    if (8 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(8 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(8 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    const auto coef2 = rotary_embedding_coefficient(8 * tid + 4, rot_embed_dim, t_step);
+    q.z = rotary_embedding_transform(q.z, coef2);
+    const auto coef3 = rotary_embedding_coefficient(8 * tid + 6, rot_embed_dim, t_step);
+    q.w = rotary_embedding_transform(q.w, coef3);
+}
+
+inline __device__ void apply_rotary_embedding(bf16_8_t& q, bf16_8_t& k, int tid, int rot_embed_dim, int t_step)
+{
+    if (8 * tid >= rot_embed_dim) {
+        return;
+    }
+    const auto coef0 = rotary_embedding_coefficient(8 * tid, rot_embed_dim, t_step);
+    q.x = rotary_embedding_transform(q.x, coef0);
+    k.x = rotary_embedding_transform(k.x, coef0);
+    const auto coef1 = rotary_embedding_coefficient(8 * tid + 2, rot_embed_dim, t_step);
+    q.y = rotary_embedding_transform(q.y, coef1);
+    k.y = rotary_embedding_transform(k.y, coef1);
+    const auto coef2 = rotary_embedding_coefficient(8 * tid + 4, rot_embed_dim, t_step);
+    q.z = rotary_embedding_transform(q.z, coef2);
+    k.z = rotary_embedding_transform(k.z, coef2);
+    const auto coef3 = rotary_embedding_coefficient(8 * tid + 6, rot_embed_dim, t_step);
+    q.w = rotary_embedding_transform(q.w, coef3);
+    k.w = rotary_embedding_transform(k.w, coef3);
+}
+#endif  // ENABLE_BF16
+
+} // namespace mmha 
\ No newline at end of file
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/online_softmax_beamsearch_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/online_softmax_beamsearch_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..65b4ac6f001432b1ab0062d2710bab58d2a03d88
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/online_softmax_beamsearch_kernels.cu
@@ -0,0 +1,1559 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "cub/cub.cuh"
+#include "fastertransformer/cuda/topk_kernels.cuh"
+
+namespace fastertransformer {
+
+#define TOPK_FP16_STORAGE 0
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_topK_kernel(const T* log_probs,
+                          int* topk_tmp_id_buf,
+                          T* topk_tmp_val_buf,
+                          const int vocab_size,
+                          T diversity_rate) {
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -FLT_MAX;
+  }
+
+  for (int elem_id = thread_id; elem_id < vocab_size;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    partial.insert((T)log_probs[index], index);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  if (thread_id == 0) {
+    int index = block_id * MAX_K;
+
+    for (int i = 0; i < MAX_K; ++i) {
+      topk_tmp_id_buf[index + i] = total.p[i];
+      topk_tmp_val_buf[index + i] = total.u[i] + diversity_rate * (T)i;
+    }
+  }
+}
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topK_kernel(int* topk_tmp_id_buf,
+                           T* topk_tmp_val_buf,
+                           int* id_buf) {
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+  if (thread_id == 0) {
+    for (int i = 0; i < MAX_K; ++i) {
+      partial.p[i] = -1;
+      partial.u[i] = -FLT_MAX;
+    }
+
+    int index = block_id * MAX_K * MAX_K;
+    for (int i = 0; i < MAX_K * MAX_K; i++) {
+      partial.insert((T)topk_tmp_val_buf[index + i],
+                     topk_tmp_id_buf[index + i]);
+    }
+
+    index = block_id * MAX_K;
+    for (int i = 0; i < MAX_K; i++) {
+      id_buf[index + i] = partial.p[i];
+    }
+  }
+}
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topK_kernel(const int* __restrict topk_tmp_id_buf,
+                           const T* __restrict topk_tmp_val_buf,
+                           int* __restrict id_buf,
+                           T* __restrict val_buf) {
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+  if (thread_id == 0) {
+    for (int i = 0; i < MAX_K; ++i) {
+      partial.p[i] = -1;
+      partial.u[i] = -FLT_MAX;
+    }
+
+    int index = block_id * MAX_K * MAX_K;
+    for (int i = 0; i < MAX_K * MAX_K; i++) {
+      partial.insert((T)topk_tmp_val_buf[index + i],
+                     topk_tmp_id_buf[index + i]);
+    }
+
+    index = block_id * MAX_K;
+    for (int i = 0; i < MAX_K; i++) {
+      id_buf[index + i] = partial.p[i];
+      val_buf[index + i] = partial.u[i];
+    }
+  }
+}
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topk_kernel(const int* __restrict x,
+                           const T* __restrict y,
+                           int* __restrict z,
+                           float* __restrict v,
+                           int V,
+                           int K,
+                           T diversity_rate) {
+  int thread_id = threadIdx.x;
+  int vector_id = blockIdx.x;
+
+  // reposition x, y to data for the current vector
+  x += vector_id * V;
+  y += vector_id * V;
+
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  TopK<T, MAX_K> partial;
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -FLT_MAX;
+  }
+  for (int elem_id = thread_id; elem_id < V; elem_id += THREADBLOCK_SIZE) {
+    int i = elem_id % K;
+    T elem = y[elem_id] + diversity_rate * (T)i;
+    int elem_idx = elem_id;  // x[elem_id];
+    partial.insert(elem, elem_idx);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  if (thread_id == 0) {
+    z += vector_id * K;
+    v += vector_id * K;
+
+    for (int i = 0; i < MAX_K; ++i) {
+      if (i < K) {
+        z[i] = x[total.p[i]];
+        v[i] = (float)y[total.p[i]];
+      }
+    }
+  }
+}
+
+struct __align__(8) MD {
+  float m;
+  float d;
+};
+
+__device__ __forceinline__ MD reduce_md_op(MD a, MD b) {
+  bool a_bigger = (a.m > b.m);
+  MD bigger_m = a_bigger ? a : b;
+  MD smaller_m = a_bigger ? b : a;
+  MD res;
+  res.d = bigger_m.d + smaller_m.d * __expf(smaller_m.m - bigger_m.m);
+  res.m = bigger_m.m;
+  return res;
+}
+
+template <typename T, int MAX_K>
+struct TopKMD {
+  MD md;
+  TopK<T, MAX_K> topk;
+};
+
+template <typename T, int MAX_K>
+__device__ __forceinline__ TopKMD<T, MAX_K> reduce_topk_md_op(
+    const TopKMD<T, MAX_K>& a, const TopKMD<T, MAX_K>& b) {
+  TopKMD<T, MAX_K> res;
+  res.md = reduce_md_op(a.md, b.md);
+  res.topk = reduce_topk_op(a.topk, b.topk);
+  return res;
+}
+
+template <typename T,
+          int ITEMS_PER_THREAD,
+          int MAX_K,
+          int THREADBLOCK_SIZE,
+          bool ALIVE = false>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_online_softmax_topk_kernel(const T* __restrict x,
+                                         const T* __restrict b,
+                                         const float* __restrict c,
+                                         const bool* __restrict finished,
+                                         int* __restrict z,
+                                         T* __restrict v,
+                                         int V,
+                                         int K,
+                                         int E) {
+  int thread_id = threadIdx.x;
+  int vector_id = blockIdx.x;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  // reposition y to data for the current vector
+  x += vector_id * V;
+
+  typedef cub::BlockReduce<TopKMD<float, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  TopKMD<float, MAX_K> partial;
+  bool finish = ALIVE ? false : finished[vector_id];
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.topk.p[i] = -1;
+    partial.topk.u[i] = -MAX_T_VAL;
+  }
+  partial.md.m = -MAX_T_VAL;
+  partial.md.d = 0.0F;
+
+  if (finish) {
+    for (int elem_id = thread_id; elem_id < V; elem_id += THREADBLOCK_SIZE) {
+      float elem = (elem_id == E) ? MAX_T_VAL : -MAX_T_VAL;
+      MD new_elem{elem, 1.0F};
+      partial.md = reduce_md_op(partial.md, new_elem);
+      partial.topk.insert(elem, elem_id);
+    }
+  } else {
+    for (int elem_id = thread_id; elem_id < V; elem_id += THREADBLOCK_SIZE) {
+      float elem = (float)x[elem_id] + b[elem_id];
+      MD new_elem{elem, 1.0F};
+      partial.md = reduce_md_op(partial.md, new_elem);
+      partial.topk.insert(elem, elem_id);
+    }
+  }
+
+  TopKMD<float, MAX_K> total =
+      BlockReduce(temp_storage)
+          .Reduce(partial, reduce_topk_md_op<float, MAX_K>);
+
+  if (thread_id == 0) {
+    z += vector_id * K;
+    v += vector_id * K;
+    // c += vector_id;
+
+    // cum_log_probs puts the results of alive beams after finish beams,
+    // thus we add the offset.
+    c += ALIVE ? (vector_id / (K / 2) * K + vector_id % (K / 2) + K / 2)
+               : vector_id;
+
+    // float d_total_inverse = __fdividef(1.0F, total.md.d);
+    float d_total_log = logf(total.md.d);
+    for (int i = 0; i < MAX_K; ++i) {
+      // float val = __expf(total.topk.u[i] - total.md.m) * d_total_inverse;
+      float val = total.topk.u[i] - total.md.m - d_total_log;
+      if (i < K) {
+        z[i] = total.topk.p[i] +
+               vector_id * V;  // faster transformer needs absolute id
+        v[i] = (float)val + (float)c[0];
+      }
+    }
+  }
+}
+
+template <typename T,
+          int ITEMS_PER_THREAD,
+          int MAX_K,
+          int THREADBLOCK_SIZE,
+          bool ALIVE = false>
+__launch_bounds__(THREADBLOCK_SIZE, 1) __global__
+    void beam_online_softmax_topk_stage1_kernel(const T* __restrict x,
+                                                const T* __restrict b,
+                                                const bool* __restrict finished,
+                                                float* __restrict t,
+                                                int V,
+                                                int K,
+                                                int E) {
+  int thread_id = threadIdx.x;
+  int vector_id = blockIdx.x;
+
+  const int PACKED_TOP_KMD_SIZE = 2 * MAX_K + 2;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  // one will have multiple sections per V
+  const int v_local = (V + gridDim.y - 1) / gridDim.y;
+  const int section_start = v_local * blockIdx.y;
+  int section_end = section_start + v_local;
+  section_end = (section_end > V) ? V : section_end;
+
+  // reposition x to data for the current vector
+  x += vector_id * V;
+#if TOPK_FP16_STORAGE == 1
+  typedef cub::BlockReduce<TopKMD<__half, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+#else
+  typedef cub::BlockReduce<TopKMD<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+#endif
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  __shared__ float buf_s[PACKED_TOP_KMD_SIZE];  // save intermediate result
+
+#if TOPK_FP16_STORAGE == 1
+  TopKMD<__half, MAX_K> partial;
+#else
+  TopKMD<T, MAX_K> partial;
+#endif
+  // bool finish = finished[vector_id];
+  bool finish = ALIVE ? false : finished[vector_id];
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.topk.p[i] = -1;
+    partial.topk.u[i] = -MAX_T_VAL;
+  }
+  partial.md.m = -MAX_T_VAL;
+  partial.md.d = 0.0F;
+
+  if (finish) {
+#pragma unroll 1
+    for (int elem_id = section_start + thread_id; elem_id < section_end;
+         elem_id += THREADBLOCK_SIZE) {
+      float elem = (elem_id == E) ? MAX_T_VAL : -MAX_T_VAL;
+      MD new_elem{elem, 1.0F};
+      partial.md = reduce_md_op(partial.md, new_elem);
+      partial.topk.insert(elem, elem_id);
+    }
+  } else {
+#pragma unroll 1
+    for (int elem_id = section_start + thread_id; elem_id < section_end;
+         elem_id += THREADBLOCK_SIZE) {
+      T bias = b == nullptr ? (T)0.0f : b[elem_id];  // gpt-2 does not use bias
+      T elem = x[elem_id] + bias;
+      MD new_elem{elem, 1.0F};
+      partial.md = reduce_md_op(partial.md, new_elem);
+      partial.topk.insert(elem, elem_id);
+    }
+  }
+
+#if TOPK_FP16_STORAGE == 1
+  TopKMD<__half, MAX_K> total =
+      BlockReduce(temp_storage)
+          .Reduce(partial, reduce_topk_md_op<__half, MAX_K>);
+#else
+  TopKMD<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_md_op<T, MAX_K>);
+#endif
+
+  if (thread_id == 0) {
+    for (int i = 0; i < K; i++) {
+      reinterpret_cast<int*>(buf_s)[i] =
+          total.topk.p[i] +
+          vector_id * V;  // faster transformer needs absolute id
+      buf_s[MAX_K + i] = total.topk.u[i];
+    }
+    buf_s[2 * MAX_K] = total.md.d;
+    buf_s[2 * MAX_K + 1] = total.md.m;
+  }
+  __syncthreads();
+  if (threadIdx.x < PACKED_TOP_KMD_SIZE) {
+    t[blockIdx.x * PACKED_TOP_KMD_SIZE * gridDim.y +
+      blockIdx.y * PACKED_TOP_KMD_SIZE + threadIdx.x] = buf_s[threadIdx.x];
+  }
+}
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE, bool ALIVE = false>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_online_softmax_topk_stage2_kernel(const float* __restrict x,
+                                                const float* __restrict c,
+                                                int* __restrict z,
+                                                T* __restrict v,
+                                                int K,
+                                                int parts_per_beam) {
+  const int vector_id = blockIdx.x;
+  const int thread_id = threadIdx.x;
+  const int PACKED_TOP_KMD_SIZE = 2 * MAX_K + 2;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  extern __shared__ char buf_s_[];  // intermediate result
+  float* buf_s = reinterpret_cast<float*>(buf_s_);
+  //__shared__ float buf_s[PACKED_TOP_KMD_SIZE * THREADBLOCK_SIZE]; //
+  // intermediate result
+
+  typedef cub::BlockReduce<TopKMD<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  x += vector_id * PACKED_TOP_KMD_SIZE * parts_per_beam;
+
+  TopKMD<T, MAX_K> partial;
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.topk.p[i] = -1;
+    partial.topk.u[i] = -MAX_T_VAL;
+  }
+  partial.md.m = -MAX_T_VAL;
+  partial.md.d = 0.0F;
+
+  // load and unpack into registers through smem
+  for (int idx = thread_id; idx < PACKED_TOP_KMD_SIZE * parts_per_beam;
+       idx += THREADBLOCK_SIZE) {
+    buf_s[idx] = x[idx];
+  }
+  __syncthreads();
+
+  if (threadIdx.x < parts_per_beam) {
+    float* b_s = buf_s + thread_id * PACKED_TOP_KMD_SIZE;
+    for (int i = 0; i < K; i++) {
+      partial.topk.p[i] = reinterpret_cast<int*>(b_s)[i];
+      partial.topk.u[i] = b_s[MAX_K + i];
+    }
+    partial.md.d = b_s[2 * MAX_K];
+    partial.md.m = b_s[2 * MAX_K + 1];
+  }
+  __syncthreads();
+
+  TopKMD<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_md_op<T, MAX_K>);
+
+  if (thread_id == 0) {
+    z += vector_id * K;
+    v += vector_id * K;
+    // c += vector_id;
+
+    // cum_log_probs puts the results of alive beams after finish beams,
+    // thus we add the offset.
+    c += ALIVE ? (vector_id / (K / 2) * K + vector_id % (K / 2) + K / 2)
+               : vector_id;
+
+    float d_total_log = logf(total.md.d);
+    for (int i = 0; i < MAX_K; ++i) {
+      float val = (float)total.topk.u[i] - total.md.m - d_total_log;
+      if (i < K) {
+        z[i] = total.topk.p[i];
+        v[i] = (float)val + (float)c[0];
+      }
+    }
+  }
+}
+
+template <typename T>
+void topK_kernelLauncher(T* log_probs,
+                         int* topk_tmp_id_buf,
+                         T* topk_tmp_val_buf,
+                         int* ids,
+                         DecodingBeamsearchArguments args,
+                         cudaStream_t stream) {
+  const int batch_size = args.batch_size_;
+  const int beam_width = args.beam_width_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int diversity_rate = args.beam_search_diversity_rate_;
+  const int block_size = SMALL_TOP_K_SOFTMAX_THREADBLOCK_SIZE;
+
+  switch (beam_width) {
+    case 1:
+      beam_topK_kernel<
+          T,
+          1,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        1,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 2:
+      beam_topK_kernel<
+          T,
+          2,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        2,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 3:
+      beam_topK_kernel<
+          T,
+          3,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        3,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 4:
+      beam_topK_kernel<
+          T,
+          4,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        4,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 6:
+      beam_topK_kernel<
+          T,
+          6,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        6,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 8:
+      beam_topK_kernel<
+          T,
+          8,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        8,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    case 32:
+      beam_topK_kernel<
+          T,
+          32,
+          block_size><<<batch_size * beam_width, block_size, 0, stream>>>(
+          log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          vocab_size,
+          diversity_rate);
+      batch_topK_kernel<T,
+                        32,
+                        block_size><<<batch_size, block_size, 0, stream>>>(
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);
+      break;
+    default:
+      printf("[ERROR] Topk kernel does not support beamwidth = %d \n",
+             beam_width);
+      exit(0);
+      break;
+  }
+}
+
+template <typename T, int MAX_K, bool ALIVE = false>
+void beam_online_softmax_topk_stage2_kernelLauncher(const float* temp_storage,
+                                                    const float* cum_log_probs,
+                                                    int* ids,
+                                                    T* vals,
+                                                    int batch_size,
+                                                    int beam_width,
+                                                    int parts_per_beam,
+                                                    cudaStream_t stream) {
+  // might rewrite beam_online_softmax_topk_stage2_kernel no to depend on
+  // constant block size
+  // in oreder to reduce compilation time
+  int smem_stage2_size = parts_per_beam * (2 * MAX_K + 2) * sizeof(float);
+
+  if (parts_per_beam <= 32) {
+    beam_online_softmax_topk_stage2_kernel<
+        T,
+        MAX_K,
+        32,
+        ALIVE><<<batch_size * beam_width, 32, smem_stage2_size, stream>>>(
+        temp_storage,
+        cum_log_probs,
+        ids,
+        vals,
+        ALIVE ? beam_width * 2 : beam_width,
+        parts_per_beam);
+    return;
+  }
+  if (parts_per_beam <= 64) {
+    beam_online_softmax_topk_stage2_kernel<
+        T,
+        MAX_K,
+        64,
+        ALIVE><<<batch_size * beam_width, 64, smem_stage2_size, stream>>>(
+        temp_storage,
+        cum_log_probs,
+        ids,
+        vals,
+        ALIVE ? beam_width * 2 : beam_width,
+        parts_per_beam);
+    return;
+  }
+  if (parts_per_beam <= 128) {
+    beam_online_softmax_topk_stage2_kernel<
+        T,
+        MAX_K,
+        128,
+        ALIVE><<<batch_size * beam_width, 128, smem_stage2_size, stream>>>(
+        temp_storage,
+        cum_log_probs,
+        ids,
+        vals,
+        ALIVE ? beam_width * 2 : beam_width,
+        parts_per_beam);
+    return;
+  }
+  assert(0);
+}
+
+template <typename T, int MAX_K>
+void topK_softMax_kernelLauncher(const T* log_probs,
+                                 const T* bias,
+                                 const bool* finished,
+                                 float* cum_log_probs,
+                                 int* ids,
+                                 void* temp_storage,
+                                 const int temp_storage_size,
+                                 const int batch_size,
+                                 const int beam_width,
+                                 const int vocab_size,
+                                 const int end_id,
+                                 T diversity_rate,
+                                 cudaStream_t stream) {
+  const int items_per_thread = 1;
+  const int block_sz =
+      (MAX_K < 16) ? (MAX_K < 8) ? SMALL_TOP_K_SOFTMAX_THREADBLOCK_SIZE : 128
+                   : 64;
+  // const int block_sz = SMALL_TOP_K_SOFTMAX_THREADBLOCK_SIZE;
+
+  assert(temp_storage_size % 2 == 0);
+  assert(temp_storage_size >= 2 * batch_size * beam_width * beam_width);
+
+  const int topk_buf_offset =
+      ceil(batch_size * beam_width * beam_width / 4.) * 4;
+  int* topk_tmp_id_buf = reinterpret_cast<int*>(temp_storage);
+  T* topk_tmp_val_buf = reinterpret_cast<T*>(topk_tmp_id_buf + topk_buf_offset);
+  float* tmp_buffer =
+      reinterpret_cast<float*>(topk_tmp_val_buf + topk_buf_offset);
+
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+  int voc_parts = 4;
+  if (batch_size * beam_width < 256) {
+    // Volta has 80 SMs, so we aim for three waves
+    voc_parts = (240 + batch_size * beam_width - 1) / (batch_size * beam_width);
+    voc_parts = std::min(128, voc_parts);  // we implment up to 128
+  }
+  dim3 grid(batch_size * beam_width, voc_parts);
+  cudaFuncSetAttribute(beam_online_softmax_topk_stage1_kernel<T,
+                                                              items_per_thread,
+                                                              MAX_K,
+                                                              block_sz>,
+                       cudaFuncAttributePreferredSharedMemoryCarveout,
+                       cudaSharedmemCarveoutMaxL1);
+  beam_online_softmax_topk_stage1_kernel<
+      T,
+      items_per_thread,
+      MAX_K,
+      block_sz><<<grid, block_sz, 0, stream>>>(
+      log_probs, bias, finished, tmp_buffer, vocab_size, beam_width, end_id);
+#endif
+  if (beam_width > 1) {
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+    beam_online_softmax_topk_stage2_kernelLauncher<T, MAX_K>(tmp_buffer,
+                                                             cum_log_probs,
+                                                             topk_tmp_id_buf,
+                                                             topk_tmp_val_buf,
+                                                             batch_size,
+                                                             beam_width,
+                                                             voc_parts,
+                                                             stream);
+#else
+    beam_online_softmax_topk_kernel<
+        T,
+        items_per_thread,
+        MAX_K,
+        block_sz><<<batch_size * beam_width, block_sz, 0, stream>>>(
+        log_probs,
+        bias,
+        cum_log_probs,
+        finished,
+        topk_tmp_id_buf,
+        topk_tmp_val_buf,
+        vocab_size,
+        beam_width,
+        end_id);
+#endif
+#if 0
+            // wrong result with diversity_rate != 0.f
+            batch_topK_kernel<T, MAX_K, 32><<<batch_size, 32, 0, stream>>>
+                                (topk_tmp_id_buf, topk_tmp_val_buf, ids, cum_log_probs);
+#else
+    batch_topk_kernel<T, MAX_K, 32><<<batch_size, 32, 0, stream>>>(
+        topk_tmp_id_buf,
+        topk_tmp_val_buf,
+        ids,
+        cum_log_probs,
+        beam_width * beam_width,
+        beam_width,
+        diversity_rate);
+#endif
+  } else {
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+    beam_online_softmax_topk_stage2_kernelLauncher<float, MAX_K>(tmp_buffer,
+                                                                 cum_log_probs,
+                                                                 ids,
+                                                                 cum_log_probs,
+                                                                 batch_size,
+                                                                 beam_width,
+                                                                 voc_parts,
+                                                                 stream);
+#else
+    beam_online_softmax_topk_kernel<
+        T,
+        items_per_thread,
+        MAX_K,
+        block_sz><<<batch_size * beam_width, block_sz, 0, stream>>>(
+        log_probs,
+        bias,
+        cum_log_probs,
+        finished,
+        ids,
+        cum_log_probs,
+        vocab_size,
+        beam_width,
+        end_id);
+#endif
+  }
+}
+
+template <typename T>
+void topK_softMax(const T* log_probs,
+                  const T* bias,
+                  const bool* finished,
+                  float* cum_log_probs,
+                  int* ids,
+                  void* temp_storage,
+                  DecodingBeamsearchArguments args,
+                  cudaStream_t stream) {
+  const int temp_storage_size = args.temp_storage_size_;
+  const int batch_size = args.batch_size_;
+  const int beam_width = args.beam_width_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int end_id = args.end_id_;
+  const T diversity_rate = args.beam_search_diversity_rate_;
+
+  switch (beam_width) {
+    case 1:
+      topK_softMax_kernelLauncher<T, 1>(log_probs,
+                                        bias,
+                                        finished,
+                                        cum_log_probs,
+                                        ids,
+                                        temp_storage,
+                                        temp_storage_size,
+                                        batch_size,
+                                        beam_width,
+                                        vocab_size,
+                                        end_id,
+                                        diversity_rate,
+                                        stream);
+      break;
+    case 2:
+      topK_softMax_kernelLauncher<T, 2>(log_probs,
+                                        bias,
+                                        finished,
+                                        cum_log_probs,
+                                        ids,
+                                        temp_storage,
+                                        temp_storage_size,
+                                        batch_size,
+                                        beam_width,
+                                        vocab_size,
+                                        end_id,
+                                        diversity_rate,
+                                        stream);
+      break;
+    case 3:
+      topK_softMax_kernelLauncher<T, 3>(log_probs,
+                                        bias,
+                                        finished,
+                                        cum_log_probs,
+                                        ids,
+                                        temp_storage,
+                                        temp_storage_size,
+                                        batch_size,
+                                        beam_width,
+                                        vocab_size,
+                                        end_id,
+                                        diversity_rate,
+                                        stream);
+      break;
+    case 4:
+      topK_softMax_kernelLauncher<T, 4>(log_probs,
+                                        bias,
+                                        finished,
+                                        cum_log_probs,
+                                        ids,
+                                        temp_storage,
+                                        temp_storage_size,
+                                        batch_size,
+                                        beam_width,
+                                        vocab_size,
+                                        end_id,
+                                        diversity_rate,
+                                        stream);
+      break;
+    case 8:
+      topK_softMax_kernelLauncher<T, 8>(log_probs,
+                                        bias,
+                                        finished,
+                                        cum_log_probs,
+                                        ids,
+                                        temp_storage,
+                                        temp_storage_size,
+                                        batch_size,
+                                        beam_width,
+                                        vocab_size,
+                                        end_id,
+                                        diversity_rate,
+                                        stream);
+      break;
+    case 16:
+      topK_softMax_kernelLauncher<T, 16>(log_probs,
+                                         bias,
+                                         finished,
+                                         cum_log_probs,
+                                         ids,
+                                         temp_storage,
+                                         temp_storage_size,
+                                         batch_size,
+                                         beam_width,
+                                         vocab_size,
+                                         end_id,
+                                         diversity_rate,
+                                         stream);
+      break;
+    case 32:
+      topK_softMax_kernelLauncher<T, 32>(log_probs,
+                                         bias,
+                                         finished,
+                                         cum_log_probs,
+                                         ids,
+                                         temp_storage,
+                                         temp_storage_size,
+                                         batch_size,
+                                         beam_width,
+                                         vocab_size,
+                                         end_id,
+                                         diversity_rate,
+                                         stream);
+      break;
+    default:
+      printf("[ERROR] Topk kernel does not support beamwidth = %d \n",
+             beam_width);
+      exit(0);
+      break;
+  }
+}
+
+template void topK_kernelLauncher<float>(float* log_probs,
+                                         int* topk_tmp_id_buf,
+                                         float* topk_tmp_val_buf,
+                                         int* ids,
+                                         DecodingBeamsearchArguments args,
+                                         cudaStream_t stream);
+
+template void topK_kernelLauncher<half>(half* log_probs,
+                                        int* topk_tmp_id_buf,
+                                        half* topk_tmp_val_buf,
+                                        int* ids,
+                                        DecodingBeamsearchArguments args,
+                                        cudaStream_t stream);
+
+template void topK_softMax<float>(const float* log_probs,
+                                  const float* bias,
+                                  const bool* finished,
+                                  float* cum_log_probs,
+                                  int* ids,
+                                  void* tmp_storage,
+                                  DecodingBeamsearchArguments args,
+                                  cudaStream_t stream);
+
+template void topK_softMax<half>(const half* log_probs,
+                                 const half* bias,
+                                 const bool* finished,
+                                 float* cum_log_probs,
+                                 int* ids,
+                                 void* tmp_storage,
+                                 DecodingBeamsearchArguments args,
+                                 cudaStream_t stream);
+
+template <typename T, int MAX_K>
+struct TopKFinish {
+  T u[MAX_K];
+  int idx[MAX_K];
+  int len[MAX_K];
+
+  __device__ __forceinline__ void insert(T elem, int pidx, int step) {
+    if (elem > u[MAX_K - 1]) {
+      u[MAX_K - 1] = elem;
+      idx[MAX_K - 1] = pidx;
+      len[MAX_K - 1] = step;
+
+      for (int k = MAX_K - 2; k >= 0; --k) {
+        if (u[k + 1] > u[k]) {
+          T u2 = u[k];
+          u[k] = u[k + 1];
+          u[k + 1] = u2;
+          int tmp = idx[k];
+          idx[k] = idx[k + 1];
+          idx[k + 1] = tmp;
+          tmp = len[k];
+          len[k] = len[k + 1];
+          len[k + 1] = tmp;
+        }
+      }
+    }
+  }
+};
+
+template <int MAX_K>
+struct TopKStop {
+  float u[MAX_K];
+  int word_id[MAX_K];
+  int idx[MAX_K];
+  int len[MAX_K];
+
+  __device__ __forceinline__ void insert(float elem,
+                                         int ids,
+                                         int pidx,
+                                         int step) {
+    if (elem > u[MAX_K - 1]) {
+      u[MAX_K - 1] = elem;
+      word_id[MAX_K - 1] = ids;
+      idx[MAX_K - 1] = pidx;
+      len[MAX_K - 1] = step;
+
+      for (int k = MAX_K - 2; k >= 0; --k) {
+        if (u[k + 1] > u[k]) {
+          float u2 = u[k];
+          u[k] = u[k + 1];
+          u[k + 1] = u2;
+
+          int tmp1 = word_id[k];
+          word_id[k] = word_id[k + 1];
+          word_id[k + 1] = tmp1;
+
+          int tmp2 = idx[k];
+          idx[k] = idx[k + 1];
+          idx[k + 1] = tmp2;
+
+          int tmp3 = len[k];
+          len[k] = len[k + 1];
+          len[k + 1] = tmp3;
+        }
+      }
+    }
+  }
+};
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topk_update_kernel(const int* __restrict x,
+                                  const T* __restrict y,
+                                  bool* finished,
+                                  bool* alive_finished,
+                                  int* sequence_length,
+                                  int* word_ids,
+                                  int* parent_ids,
+                                  int* output_word_ids,
+                                  int* output_parent_ids,
+                                  float* output_cum_log_probs,
+                                  const int batch_size,
+                                  const int beam_width,
+                                  const int vocab_size,
+                                  const int end_id,
+                                  const int step,
+                                  const int max_out_len,
+                                  int V,
+                                  int K,
+                                  T diversity_rate,
+                                  float length_penalty,
+                                  float max_length_penalty,
+                                  const int finished_candidate_num,
+                                  const bool early_stopping = false) {
+  int thread_id = threadIdx.x;
+  int vector_id = blockIdx.x;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  // to be consistent with MAX_T_VAL in init_kernel, which should also be same
+  // with other topk kernel, however it does not
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : 1e20f;
+
+  // reposition x, y to data for the current vector
+  x += vector_id * V;
+  y += vector_id * V;
+
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  TopK<T, MAX_K> partial;
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+  for (int elem_id = thread_id; elem_id < V; elem_id += THREADBLOCK_SIZE) {
+    int i = elem_id % K;
+    T elem = y[elem_id] + diversity_rate * (T)i;
+    int elem_idx = elem_id;  // x[elem_id];
+    partial.insert(elem, elem_idx);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  // grow finish and grow alive
+  if (thread_id == 0) {
+    word_ids += vector_id * beam_width;
+    parent_ids += vector_id * beam_width;
+    output_word_ids += vector_id * K;  // K == MAX_K == beam_width*2
+    output_parent_ids += vector_id * K;
+    output_cum_log_probs += vector_id * K;
+    sequence_length += vector_id * K;
+    finished += vector_id * K;
+    alive_finished += vector_id * beam_width;
+    // load the finish queue to grow
+    // TODO(guosheng): Use vectorized load or do this BlockReduce to use
+    // multi-threads without extra sync
+    int finish_num = 0;
+    TopKFinish<T, MAX_K / 2> finish_candidate;
+    for (int i = 0; i < MAX_K / 2; ++i) {
+      if (step == 1) {  // step number starts from 1 rather than 0
+        finish_candidate.u[i] = -MAX_T_VAL;
+        finish_candidate.idx[i] = -1;
+        finish_candidate.len[i] = 0;
+      } else {
+        finish_candidate.u[i] = output_cum_log_probs[i];
+        finish_candidate.idx[i] = i;
+        finish_candidate.len[i] = sequence_length[i];
+        if (finished[i]) finish_num++;
+      }
+    }
+
+    int alive_num = 0;
+    for (int i = 0; i < MAX_K; ++i) {  // K == MAX_K == beam_width*2
+      if (i < K) {
+        // beam_online_softmax_topk_kernel produces absolute id, which can make
+        // update_KV_cache_kernel use gather instead of gather_nd
+        int abs_id = x[total.p[i]];
+        float cum_log_prob = (float)y[total.p[i]];
+        // There are two queues, one for the alive and another for the finish.
+        // `beam_id` stands for parents in the alive, and it uses absolute id
+        // represented as `batch_idx * beam_width + beam_idx`.
+        int beam_id = abs_id / vocab_size;
+        int beam_id_in_output =
+            vector_id * K + (beam_id % beam_width) + beam_width;
+        int word_id = abs_id % vocab_size;
+        if (i < finished_candidate_num && word_id == end_id) {  // grow finish
+          // The alive candidates are put after finish candidates in the
+          // finish queue, thus parent index should plus with the
+          // offset(beam_width).
+          finish_candidate.insert(
+              cum_log_prob / length_penalty, beam_id_in_output, step);
+          if (finish_num != MAX_K / 2) finish_num++;
+        } else if (alive_num < beam_width && word_id != end_id) {  // grow alive
+          parent_ids[alive_num] = beam_id;
+          word_ids[alive_num] = word_id;
+          // Also put alive candidates after finish candidates, since output
+          // must include both the finish and alive to trace full path
+          output_word_ids[MAX_K / 2 + alive_num] = word_id;
+          output_parent_ids[MAX_K / 2 + alive_num] = beam_id_in_output;
+          output_cum_log_probs[MAX_K / 2 + alive_num] = cum_log_prob;
+          sequence_length[MAX_K / 2 + alive_num] = step;
+          finished[MAX_K / 2 + alive_num] = 0;
+          alive_finished[alive_num] = 0;
+          alive_num++;
+        }
+      }
+    }
+
+    for (int i = 0; i < MAX_K / 2; ++i) {
+      output_word_ids[i] = end_id;
+      output_cum_log_probs[i] = static_cast<float>(finish_candidate.u[i]);
+      output_parent_ids[i] = finish_candidate.idx[i];
+      sequence_length[i] = finish_candidate.len[i];
+      // finished[i] = 1;
+      finished[i] = finish_candidate.u[i] > (-MAX_T_VAL + (T)10.0f) ? 1 : 0;
+    }
+
+    // early finish
+    float lowest_finish =
+        finish_num == 0 ? -1e20f : output_cum_log_probs[finish_num - 1];
+    // The best possible score of the most likely alive sequence
+    float lower_bound =
+        (float)output_cum_log_probs[MAX_K / 2] / max_length_penalty;
+
+    // output must include both the finish and alive to trace full path
+    if (finished_candidate_num == MAX_K / 2) {
+      if (finish_num == finished_candidate_num &&
+          (lowest_finish > lower_bound || early_stopping)) {  // when finishing
+        // If early stop, also mark the alive beams finished.
+        for (int i = MAX_K / 2; i < MAX_K; ++i) {
+          finished[i] = 1;
+          alive_finished[i - MAX_K / 2] = 1;
+        }
+      } else if (step == max_out_len) {
+        TopKStop<MAX_K / 2> finish_stop;
+        for (int i = 0; i < MAX_K / 2; ++i) {
+          finish_stop.word_id[i] = -1;
+          finish_stop.u[i] = -1e20f;
+          finish_stop.idx[i] = -1;
+          finish_stop.len[i] = 0;
+        }
+
+        for (int i = 0; i < finish_num; ++i) {
+          finish_stop.insert(output_cum_log_probs[i],
+                             end_id,
+                             output_parent_ids[i],
+                             sequence_length[i]);
+        }
+        for (int i = MAX_K / 2; i < MAX_K; ++i) {
+          finish_stop.insert(output_cum_log_probs[i] / length_penalty,
+                             word_ids[i - MAX_K / 2],
+                             output_parent_ids[i],
+                             step);
+        }
+        for (int i = 0; i < MAX_K / 2; ++i) {
+          output_word_ids[i] = finish_stop.word_id[i];
+          output_cum_log_probs[i] = finish_stop.u[i];
+          output_parent_ids[i] = finish_stop.idx[i];
+          sequence_length[i] = finish_stop.len[i];
+          finished[i] = 1;
+        }
+
+        // If early stop, also mark the alive beams finished.
+        for (int i = MAX_K / 2; i < MAX_K; ++i) {
+          finished[i] = 1;
+          alive_finished[i - MAX_K / 2] = 1;
+        }
+      }
+    } else {
+      if (step == max_out_len ||
+          lowest_finish > lower_bound) {  // when finishing
+        for (int i = 0; finish_num < MAX_K / 2; ++finish_num, ++i) {
+          output_word_ids[finish_num] = word_ids[i];
+          output_cum_log_probs[finish_num] =
+              (float)output_cum_log_probs[i + beam_width] / length_penalty;
+          output_parent_ids[finish_num] = output_parent_ids[i + beam_width];
+          sequence_length[finish_num] = step;
+          finished[finish_num] = 1;
+        }
+        // If early stop, also mark the alive beams finished.
+        for (int i = MAX_K / 2; i < MAX_K; ++i) {
+          finished[i] = 1;
+          alive_finished[i - MAX_K / 2] = 1;
+        }
+      }
+    }
+  }
+}
+
+template <typename T, int MAX_K>
+void topK_softMax_update_kernelLauncher(const T* log_probs,
+                                        const T* bias,
+                                        bool* finished,
+                                        bool* alive_finished,
+                                        int* sequence_length,
+                                        int* word_ids,
+                                        int* parent_ids,
+                                        int* output_word_ids,
+                                        int* output_parent_ids,
+                                        float* cum_log_probs,
+                                        void* temp_storage,
+                                        const int temp_storage_size,
+                                        const int batch_size,
+                                        const int beam_width,
+                                        const int vocab_size,
+                                        const int end_id,
+                                        const int step,
+                                        const int max_out_len,
+                                        T diversity_rate,
+                                        const float alpha,
+                                        const int finished_candidate_num,
+                                        const bool early_stopping,
+                                        cudaStream_t stream) {
+  const int items_per_thread = 1;
+  const int block_sz =
+      (MAX_K < 16) ? (MAX_K < 8) ? SMALL_TOP_K_SOFTMAX_THREADBLOCK_SIZE : 128
+                   : 64;
+  // const int block_sz = SMALL_TOP_K_SOFTMAX_THREADBLOCK_SIZE;
+
+  assert(temp_storage_size % 2 == 0);
+  // select top beam_width*2 for topk_tmp_id_buf and topk_tmp_val_buf
+  assert(temp_storage_size >= 2 * batch_size * beam_width * beam_width * 2);
+
+  const int topk_buf_offset =
+      ceil(batch_size * beam_width * beam_width / 4.) * 4 * 2;
+  int* topk_tmp_id_buf = reinterpret_cast<int*>(temp_storage);
+  T* topk_tmp_val_buf = reinterpret_cast<T*>(topk_tmp_id_buf + topk_buf_offset);
+  float* tmp_buffer =
+      reinterpret_cast<float*>(topk_tmp_val_buf + topk_buf_offset);
+
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+  int voc_parts = 4;
+  if (batch_size * beam_width < 256) {
+    // Volta has 80 SMs, so we aim for three waves
+    voc_parts = (240 + batch_size * beam_width - 1) / (batch_size * beam_width);
+    voc_parts = std::min(128, voc_parts);  // we implment up to 128
+  }
+  dim3 grid(batch_size * beam_width, voc_parts);
+  cudaFuncSetAttribute(beam_online_softmax_topk_stage1_kernel<T,
+                                                              items_per_thread,
+                                                              MAX_K,
+                                                              block_sz>,
+                       cudaFuncAttributePreferredSharedMemoryCarveout,
+                       cudaSharedmemCarveoutMaxL1);
+  beam_online_softmax_topk_stage1_kernel<T,
+                                         items_per_thread,
+                                         MAX_K,
+                                         block_sz,
+                                         true><<<grid, block_sz, 0, stream>>>(
+      log_probs,
+      bias,
+      finished,
+      tmp_buffer,
+      vocab_size,
+      beam_width * 2,
+      end_id);
+  beam_online_softmax_topk_stage2_kernelLauncher<T, MAX_K, true>(
+      tmp_buffer,
+      cum_log_probs,
+      topk_tmp_id_buf,
+      topk_tmp_val_buf,
+      batch_size,
+      beam_width,
+      voc_parts,
+      stream);  // double beam_width in launcher
+#else
+  beam_online_softmax_topk_kernel<
+      T,
+      items_per_thread,
+      MAX_K,
+      block_sz,
+      true><<<batch_size * beam_width, block_sz, 0, stream>>>(log_probs,
+                                                              bias,
+                                                              cum_log_probs,
+                                                              finished,
+                                                              topk_tmp_id_buf,
+                                                              topk_tmp_val_buf,
+                                                              vocab_size,
+                                                              beam_width * 2,
+                                                              end_id);
+#endif
+  float length_penalty = (finished_candidate_num == beam_width)
+                             ? std::pow((5. + step - 1) / 6., alpha)
+                             : std::pow((5. + step + 1) / 6., alpha);
+  float max_length_penalty = (finished_candidate_num == beam_width)
+                                 ? length_penalty
+                                 : std::pow((5. + max_out_len + 1) / 6., alpha);
+  batch_topk_update_kernel<T, MAX_K, 32><<<batch_size, 32, 0, stream>>>(
+      topk_tmp_id_buf,
+      topk_tmp_val_buf,
+      finished,
+      alive_finished,
+      sequence_length,
+      word_ids,
+      parent_ids,
+      output_word_ids,
+      output_parent_ids,
+      cum_log_probs,
+      batch_size,
+      beam_width,
+      vocab_size,
+      end_id,
+      step,
+      max_out_len,
+      beam_width * beam_width * 2,
+      beam_width * 2,
+      diversity_rate,
+      length_penalty,
+      max_length_penalty,
+      finished_candidate_num,
+      early_stopping);
+}
+
+template <typename T>
+void topK_softMax_update(
+    const T* log_probs,
+    const T* bias,  // NOTE: bias is float in V3.1
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    void* temp_storage,
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream) {
+  const int temp_storage_size = args.temp_storage_size_;
+  const int batch_size = args.batch_size_;
+  const int beam_width = args.beam_width_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int end_id = args.end_id_;
+  const T diversity_rate = args.beam_search_diversity_rate_;
+  const int max_out_len = args.seq_len_;
+  const float alpha = args.alpha_;
+  const int finished_candidate_num = args.finished_candidate_num_;
+  const bool early_stopping = args.early_stopping_;
+
+  switch (beam_width) {
+    case 1:
+      topK_softMax_update_kernelLauncher<T, 2>(log_probs,
+                                               bias,
+                                               finished,
+                                               alive_finished,
+                                               sequence_length,
+                                               word_ids,
+                                               parent_ids,
+                                               output_word_ids,
+                                               output_parent_ids,
+                                               output_cum_log_probs,
+                                               temp_storage,
+                                               temp_storage_size,
+                                               batch_size,
+                                               beam_width,
+                                               vocab_size,
+                                               end_id,
+                                               step,
+                                               max_out_len,
+                                               diversity_rate,
+                                               alpha,
+                                               finished_candidate_num,
+                                               early_stopping,
+                                               stream);
+      break;
+    case 2:
+      topK_softMax_update_kernelLauncher<T, 4>(log_probs,
+                                               bias,
+                                               finished,
+                                               alive_finished,
+                                               sequence_length,
+                                               word_ids,
+                                               parent_ids,
+                                               output_word_ids,
+                                               output_parent_ids,
+                                               output_cum_log_probs,
+                                               temp_storage,
+                                               temp_storage_size,
+                                               batch_size,
+                                               beam_width,
+                                               vocab_size,
+                                               end_id,
+                                               step,
+                                               max_out_len,
+                                               diversity_rate,
+                                               alpha,
+                                               finished_candidate_num,
+                                               early_stopping,
+                                               stream);
+      break;
+    case 3:
+      topK_softMax_update_kernelLauncher<T, 6>(log_probs,
+                                               bias,
+                                               finished,
+                                               alive_finished,
+                                               sequence_length,
+                                               word_ids,
+                                               parent_ids,
+                                               output_word_ids,
+                                               output_parent_ids,
+                                               output_cum_log_probs,
+                                               temp_storage,
+                                               temp_storage_size,
+                                               batch_size,
+                                               beam_width,
+                                               vocab_size,
+                                               end_id,
+                                               step,
+                                               max_out_len,
+                                               diversity_rate,
+                                               alpha,
+                                               finished_candidate_num,
+                                               early_stopping,
+                                               stream);
+      break;
+    case 4:
+      topK_softMax_update_kernelLauncher<T, 8>(log_probs,
+                                               bias,
+                                               finished,
+                                               alive_finished,
+                                               sequence_length,
+                                               word_ids,
+                                               parent_ids,
+                                               output_word_ids,
+                                               output_parent_ids,
+                                               output_cum_log_probs,
+                                               temp_storage,
+                                               temp_storage_size,
+                                               batch_size,
+                                               beam_width,
+                                               vocab_size,
+                                               end_id,
+                                               step,
+                                               max_out_len,
+                                               diversity_rate,
+                                               alpha,
+                                               finished_candidate_num,
+                                               early_stopping,
+                                               stream);
+      break;
+    case 8:
+      topK_softMax_update_kernelLauncher<T, 16>(log_probs,
+                                                bias,
+                                                finished,
+                                                alive_finished,
+                                                sequence_length,
+                                                word_ids,
+                                                parent_ids,
+                                                output_word_ids,
+                                                output_parent_ids,
+                                                output_cum_log_probs,
+                                                temp_storage,
+                                                temp_storage_size,
+                                                batch_size,
+                                                beam_width,
+                                                vocab_size,
+                                                end_id,
+                                                step,
+                                                max_out_len,
+                                                diversity_rate,
+                                                alpha,
+                                                finished_candidate_num,
+                                                early_stopping,
+                                                stream);
+      break;
+    case 16:
+      topK_softMax_update_kernelLauncher<T, 32>(log_probs,
+                                                bias,
+                                                finished,
+                                                alive_finished,
+                                                sequence_length,
+                                                word_ids,
+                                                parent_ids,
+                                                output_word_ids,
+                                                output_parent_ids,
+                                                output_cum_log_probs,
+                                                temp_storage,
+                                                temp_storage_size,
+                                                batch_size,
+                                                beam_width,
+                                                vocab_size,
+                                                end_id,
+                                                step,
+                                                max_out_len,
+                                                diversity_rate,
+                                                alpha,
+                                                finished_candidate_num,
+                                                early_stopping,
+                                                stream);
+      break;
+    case 32:
+      topK_softMax_update_kernelLauncher<T, 64>(log_probs,
+                                                bias,
+                                                finished,
+                                                alive_finished,
+                                                sequence_length,
+                                                word_ids,
+                                                parent_ids,
+                                                output_word_ids,
+                                                output_parent_ids,
+                                                output_cum_log_probs,
+                                                temp_storage,
+                                                temp_storage_size,
+                                                batch_size,
+                                                beam_width,
+                                                vocab_size,
+                                                end_id,
+                                                step,
+                                                max_out_len,
+                                                diversity_rate,
+                                                alpha,
+                                                finished_candidate_num,
+                                                early_stopping,
+                                                stream);
+      break;
+    default:
+      printf("[ERROR] Topk kernel does not support beamwidth = %d \n",
+             beam_width);
+      exit(0);
+      break;
+  }
+}
+
+template void topK_softMax_update<float>(
+    const float* log_probs,
+    const float* bias,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,
+    void* temp_storage,
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+template void topK_softMax_update<half>(
+    const half* log_probs,
+    const half* bias,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,
+    void* temp_storage,
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+}  // end of namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_attention.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_attention.h
new file mode 100644
index 0000000000000000000000000000000000000000..6702087e9b1113a85d22d00715248c55579c7b5b
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_attention.h
@@ -0,0 +1,1137 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Open sourced multi-head attention
+ **/
+
+#pragma once
+
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/cuda/multi_head_attention.h"
+#include "fastertransformer/cuda/attention_kernels.cuh"
+#include "fastertransformer/cuda/transformer_kernels.cuh"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/cuda/cuda_int8_kernels.h"
+#include "fastertransformer/gemm_test/encoder_gemm_func.h"
+#include "fastertransformer/gemm_test/encoder_igemm_func.h"
+#include "fastertransformer/utils/functions.h"
+#include <assert.h>
+#include <cuda_runtime.h>
+#include <cuda_fp16.h>
+#include "fastertransformer/trt_fused_multihead_attention/qkvToContext.h"
+
+namespace fastertransformer{
+namespace cuda{
+
+void trt_add_QKV_bias_transpose_debug_kernelLauncher(
+  const half* query_buf, const half* bias_Q,
+  const half* key_buf, const half* bias_K,
+  const half* value_buf, const half* bias_V,
+  half* context_buf, 
+  const int valid_word_num, 
+  const int head_num, const int size_per_head,
+  cudaStream_t stream); // Used to debug the trt_add_QKV_bias kernel
+
+template <typename T>
+void add_QK_bias_transform_kernelLauncher(int8_t *q_buf, int8_t *k_buf, const int32_t* Q, const T* bias_Q, 
+                                          const int32_t* K, const T* bias_K, const int batch_size, 
+                                          const int seq_len, const int head_num, const int size_per_head, 
+                                          const float * q_weight_amax, const float *q_input_deQFactor_div127_ptr, 
+                                          const float * k_weight_amax, const float *k_input_deQFactor_div127_ptr, 
+                                          const float *q_output_scale_ptr, const float *k_output_scale_ptr,
+                                          bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                          
+template <typename T>
+void add_QK_bias_transform_kernelLauncher(int8_t *q_buf, int8_t *k_buf, const int8_t* Q, const T* bias_Q, 
+                                          const int8_t* K, const T* bias_K, const int batch_size, 
+                                          const int seq_len, const int head_num, const int size_per_head, 
+                                          const float *q_input_deQFactor_ptr, const float *k_input_deQFactor_ptr, 
+                                          const float *q_output_scale_ptr, const float *k_output_scale_ptr,
+                                          bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);                                          
+
+template <typename T>
+void add_V_bias_transform_kernelLauncher(int8_t *v_buf, const int32_t *V, const T *V_bias, 
+                                         const int batch_size, const int seq_len, 
+                                         const int head_num, const int size_per_head, 
+                                         const float* weight_amax, 
+                                         const float *input_deQFactor_div127_ptr, const float *out_scale_ptr, 
+                                         bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                         
+template <typename T>
+void add_V_bias_transform_kernelLauncher(int8_t *v_buf, const int8_t *V, const T *V_bias, const int batch_size, 
+                                         const int seq_len, const int head_num, const int size_per_head,
+                                         const float *input_deQFactor_ptr, const float *out_scale_ptr, 
+                                         bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                         
+void mappingRemovePaddingData_kernelLauncher(const int batch_size, const int seq_len, 
+                                             const int valid_word_num, int *mapping, 
+                                             const int* sequence_id_offset, cudaStream_t stream);
+                                             
+template <typename T>
+void add_QK_bias_transform_rebuild_padding_kernelLauncher(int8_t *q_buf, int8_t *k_buf, 
+                                                          const int32_t* Q, const T* bias_Q, 
+                                                          const int32_t* K, const T* bias_K, 
+                                                          const int* sequence_id_offset, const int valid_word_num, 
+                                                          const int batch_size, const int seq_len, 
+                                                          const int head_num, const int size_per_head, 
+                                                          const float * q_weight_amax, 
+                                                          const float *q_input_deQFactor_div127_ptr, 
+                                                          const float * k_weight_amax, 
+                                                          const float *k_input_deQFactor_div127_ptr, 
+                                                          const float *q_output_scale_ptr, const float *k_output_scale_ptr,
+                                                          bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                                          
+template <typename T>
+void add_QK_bias_transform_rebuild_padding_kernelLauncher(int8_t *q_buf, int8_t *k_buf, const int8_t* Q, const T* bias_Q, 
+                                                          const int8_t* K, const T* bias_K, const int* sequence_id_offset, 
+                                                          const int valid_word_num, 
+                                                          const int batch_size, const int seq_len, 
+                                                          const int head_num, const int size_per_head,  
+                                                          const float *q_deQFactor_ptr,  const float *k_deQFactor_ptr, 
+                                                          const float *q_output_scale_ptr, const float *k_output_scale_ptr,
+                                                          bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                                          
+template <typename T>
+void add_V_bias_transform_rebuild_padding_kernelLauncher(int8_t *v_buf, const int32_t *V, const T *V_bias, 
+                                                         const int* sequence_id_map, const int valid_word_num, 
+                                                         const int batch_size, const int seq_len, 
+                                                         const int head_num, const int size_per_head, 
+                                                         const float* weight_amax, 
+                                                         const float *input_deQFactor_div127_ptr, 
+                                                         const float *out_scale_ptr, 
+                                                         bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);
+                                                         
+template <typename T>
+void add_V_bias_transform_rebuild_padding_kernelLauncher(int8_t *v_buf, const int8_t *V, const T *V_bias, 
+                                                         const int* sequence_id_map, const int valid_word_num, 
+                                                         const int batch_size, const int seq_len, 
+                                                         const int head_num, const int size_per_head,
+                                                         const float *deQFactor_ptr, const float *out_scale_ptr, 
+                                                         bool use_ORDER_COL32_2R_4R4, cudaStream_t stream);  
+
+template <typename T>
+void softmax_COL32_kernelLauncher(int8_t* qk_buf, const int32_t* qk_int_buf, const T* attr_mask, 
+                                  const int batch_size, const int head_num, const int seq_len, 
+                                  const float scalar1a, const float *scalar1b, const float *scalar1c, 
+                                  const float *amax_ptr, cudaStream_t stream);
+
+template <typename T>
+void softmax_COL32_kernelLauncher(int8_t* qk_buf, const int8_t* qk_int_buf, const T* attr_mask, 
+                                  const int batch_size, const int head_num, const int seq_len, 
+                                  const float scalar1a, const float *scalar1b, const float *amax_ptr, 
+                                  cudaStream_t stream);
+
+template<typename T>
+void add_QKV_bias_rebuild_padding_kernelLauncher(T* Q, const T* bias_Q, T* K, const T* bias_K, 
+                                                 T* V, const T* bias_V, T* q_buf, T* k_buf, T* v_buf, 
+                                                 const int batch_size, const int seq_len, 
+                                                 const int head_num, const int size_per_head, const int valid_word_num, 
+                                                 const int* mask_offset, cudaStream_t stream);
+                                                 
+template<typename T>
+void transpose_kernelLauncher(T* src, T* dst, const int batch_size, const int seq_len, const int head_num, const int size_per_head, cudaStream_t stream); 
+
+template<typename T>
+void transpose_rebuild_padding_kernelLauncher(T* src, T* dst, const int valid_word_num,
+                                              const int batch_size, const int seq_len, 
+                                              const int head_num, const int size_per_head, 
+                                              const int* mask_offset, cudaStream_t stream);
+
+template<OperationType OpType_>
+class OpenMultiHeadAttentionTraits;
+
+template<>
+class OpenMultiHeadAttentionTraits<OperationType::FP32>
+{
+ public:
+  typedef float DataType;
+  static cudaDataType_t const computeType = CUDA_R_32F;
+  static cudaDataType_t const scaleType = CUDA_R_32F;
+  static cudaDataType_t const AType = CUDA_R_32F;
+  static cudaDataType_t const BType = CUDA_R_32F;
+  static cudaDataType_t const CType = CUDA_R_32F;
+  //others
+};
+
+template<>
+class OpenMultiHeadAttentionTraits<OperationType::FP16>
+{
+ public:
+  typedef half DataType;
+  static cudaDataType_t const computeType = CUDA_R_16F;
+  static cudaDataType_t const scaleType = CUDA_R_16F;
+  static cudaDataType_t const AType = CUDA_R_16F;
+  static cudaDataType_t const BType = CUDA_R_16F;
+  static cudaDataType_t const CType = CUDA_R_16F;
+  //others
+};
+
+/**
+ * Multi-head attetion open sourced
+ */
+template<OperationType OpType_>
+class OpenMultiHeadAttention: IMultiHeadAttention<OpType_>
+{
+ private:
+  typedef OpenMultiHeadAttentionTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  IAllocator* allocator_ = NULL;
+  MultiHeadInitParam<DataType_> param_;
+
+  //algo for batch matrix multiplication in unfused mha
+  int cublasBmmAlgo_[2];
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+  std::map<std::string, int> parameterMap_;
+  bool is_fuse_QKV_;
+
+  DataType_* buf_ = NULL;
+  DataType_* query_buf_;
+  DataType_* key_buf_;
+  DataType_* value_buf_;
+  DataType_* q_buf_;
+  DataType_* k_buf_;
+  DataType_* v_buf_;
+  DataType_* qk_buf_;
+  DataType_* transpose_dst_;
+  
+  DataType_** qkv_kernel_;
+  DataType_** qkv_input_;
+  DataType_** qkv_buf_;
+
+  void* cublas_workspace_;
+
+  void* trt_attn_workspace_;
+
+  const float *query_weight_amax_list, *key_weight_amax_list, *value_weight_amax_list;
+
+  int sm_;
+  int batch_size_;
+  int from_seq_len_;
+  int to_seq_len_;
+  int head_num_;
+  int size_per_head_;
+  float q_scaling_;
+  //int8_mode == 0 -- not use int8
+  //int8_mode == 1 -- use int8; without quantized residual; when (batch*seqLen >= 512) or (seqLen % 32 !=0 ), using trt fused mha
+  //int8_mode == 2 -- use int8; with quantized residual; with trt fused mha
+  //int8_mode == 3 -- use int8; with quantized residual; without trt fused mha
+  int int8_mode_ = 0;
+  int* sequence_id_map_;
+  int* Q_int_buf_;
+  int* K_int_buf_;
+  int* V_int_buf_;
+  int* qk_int_buf_;
+  int* transpose_dst_int_buf_;
+
+  bool allow_gemm_test_ = false;
+  bool use_ORDER_COL32_2R_4R4_ = false;
+  std::unique_ptr<MHARunner> dispatcher_fp16, dispatcher_int8;
+
+ public:
+
+  void getCublasBmmAlgoFromMap()
+  {
+    int batchCount, m, n, k, dataType;
+    if (std::is_same<half, DataType_>::value)
+      dataType = HALF_DATATYPE;
+    else
+      dataType = FLOAT_DATATYPE;
+    //bmm1    
+    batchCount = batch_size_*head_num_;
+    m = from_seq_len_; 
+    n = from_seq_len_; 
+    k = size_per_head_; 
+    char mark[256];
+    sprintf(mark, "%d_%d_%d_%d_%d", batchCount, n, m, k, dataType);
+    if (cublasAlgoMap_.find(mark) != cublasAlgoMap_.end())
+    {
+      cublasBmmAlgo_[0] = cublasAlgoMap_[mark].algoId;
+    }
+    else
+    {
+      cublasBmmAlgo_[0] = dataType == FLOAT_DATATYPE ? CUBLAS_GEMM_DEFAULT : CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+    }
+    //bmm2
+    batchCount = batch_size_*head_num_;
+    m = from_seq_len_;
+    n = size_per_head_;
+    k = from_seq_len_;
+    sprintf(mark, "%d_%d_%d_%d_%d", batchCount, n, m, k, dataType);
+    if (cublasAlgoMap_.find(mark) != cublasAlgoMap_.end())
+    {
+      cublasBmmAlgo_[1] = cublasAlgoMap_[mark].algoId;
+    }
+    else
+    {
+      cublasBmmAlgo_[1] = dataType == FLOAT_DATATYPE ? CUBLAS_GEMM_DEFAULT : CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+    }
+  }
+
+  void judgeFusedQKV()
+  {
+    is_fuse_QKV_ = false;
+    int m, n, k, dataType;
+    if (std::is_same<half, DataType_>::value)
+      dataType = HALF_DATATYPE;
+    else
+      dataType = FLOAT_DATATYPE;
+
+    m = batch_size_*from_seq_len_;
+    n = head_num_*size_per_head_;
+    k = head_num_*size_per_head_;
+    char mark[256], mark2[256];
+    sprintf(mark, "1_%d_%d_%d_%d", n, m, k, dataType);
+    sprintf(mark2, "3_%d_%d_%d_%d", n, m, k, dataType);
+    if (
+        cublasAlgoMap_.find(mark) != cublasAlgoMap_.end() && 
+        cublasAlgoMap_.find(mark2) != cublasAlgoMap_.end() &&
+        3*cublasAlgoMap_[mark].exec_time > cublasAlgoMap_[mark2].exec_time
+       )
+    {
+        is_fuse_QKV_ = true;
+    }
+  }
+
+  //free buffer for OpenMultiHeadAttention
+  void freeBuffer()
+  {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    if (buf_ != NULL)
+    {
+      if (allocator_ == NULL)
+      {
+        printf("[ERROR][OpenMultiHeadAttention][freeBuffer] allocator_ is NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+      buf_ = NULL;
+    }
+  }
+
+  size_t get_workspace_size()
+  {
+    size_t size = 0;
+    
+    const int buf_size = batch_size_ * head_num_ * from_seq_len_ * size_per_head_;
+    const int qk_buf_size = batch_size_ * head_num_ * from_seq_len_ * from_seq_len_;
+    const int seq_len_padded = (from_seq_len_ + 31)/32*32;
+    const int padded_buf_size = batch_size_ * head_num_ * seq_len_padded * size_per_head_;
+    const int padded_qk_buf_size = batch_size_ * head_num_ * seq_len_padded * seq_len_padded;
+
+    if(int8_mode_ != 0)
+    {
+#ifdef WITH_IINT8
+             //query_buf_(Q_int_buf_) key_buf_(K_int_buf_) value_buf_(V_int_buf_) qk_int_buf_ transpose_dst_(transpose_dst_int_buf_)
+      size = sizeof(int) * (4*buf_size + padded_qk_buf_size) +
+             //int8 q_buf_ k_buf_ v_buf_ qk_buf_
+             sizeof(int8_t) * (3*padded_buf_size + padded_qk_buf_size) +
+             //sequence_id_map 
+             (batch_size_*from_seq_len_)*sizeof(int) +
+             //trt_attn_workspace_
+             (dispatcher_int8.get() ? dispatcher_int8->getWorkspaceSize() : 0);
+
+#else
+      printf("[ERROR] PaddleNLP does not support INT8. \n");
+      exit(-1);
+#endif
+    }
+    else
+    {
+      size = sizeof(DataType_) * (buf_size * 7 + qk_buf_size) + sizeof(DataType_*) * 9 + 
+                (dispatcher_fp16.get() ? dispatcher_fp16->getWorkspaceSize() : 0);
+    }
+    return size;
+  }
+
+  //allocate buffer for OpenMultiHeadAttention
+  //read config again if hasChangedConfig == true
+  void allocateBuffer(IAllocator* allocator, void* cublas_workspace, int batch_size, int from_seq_len, int to_seq_len,
+                      int head_num, int size_per_head, bool hasChangedConfig, bool use_trt_kernel)
+  {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    if (allocator == NULL)
+    {
+      printf("[ERROR][OpenMultiHeadAttention][allocateBuffer] allocator == NULL!\n");
+      exit(-1);
+    }
+    
+    try
+    {
+      //only allocate new buffer when buf_ is empty
+      //if buf_ is not empty, use previous allocated one
+      //this can ensure consistency between (allocator_, batch_size_, ...) and buf_
+      if (buf_ != NULL){
+        printf("[ERROR][OpenMultiHeadAttention][allocateBuffer] previous buffer is not freed, use previous one. To allocate new buffer, please use freeBuffer() to free previous buffer first.\n");
+        exit(-1);
+      }
+      else
+      {
+        allocator_ = allocator;
+        batch_size_ = batch_size;
+        from_seq_len_ = from_seq_len;
+        to_seq_len_ = to_seq_len;
+        head_num_ = head_num;
+        size_per_head_ = size_per_head;
+        cublas_workspace_ = cublas_workspace;
+
+        const int buf_size = batch_size_ * head_num_ * from_seq_len_ * size_per_head_;
+        const int qk_buf_size = batch_size_ * head_num_ * from_seq_len_ * from_seq_len_;
+
+        if (int8_mode_ != 0)
+        {
+#ifdef WITH_INT8
+          if ((int8_mode_ == 1 && (batch_size_*from_seq_len_ >= 512 || (from_seq_len_ % 32 != 0))) || int8_mode_ == 2)
+          {
+            if (use_trt_kernel && (sm_ == kSM_86 || sm_ == kSM_80 || sm_ == kSM_75 || sm_ == kSM_72) && size_per_head_ == 64)
+            {
+              //try
+              {
+                dispatcher_int8.reset(new FusedMHARunnerInt8v2(head_num_, size_per_head_, sm_));    
+              }
+            }
+          }
+          const int seq_len_padded = (from_seq_len_ + 31)/32*32;
+          const int padded_buf_size = batch_size_ * head_num_ * seq_len_padded * size_per_head_;
+          const int padded_qk_buf_size = batch_size_ * head_num_ * seq_len_padded * seq_len_padded;
+          
+          buf_ = (DataType_*) allocator_->malloc(get_workspace_size(), false);
+          if (buf_ == NULL)
+            throw std::runtime_error(std::string("Allocator failed to allocate internal buffer."));
+          Q_int_buf_ = (int *)(buf_);
+          K_int_buf_ = Q_int_buf_ + buf_size;
+          V_int_buf_ = K_int_buf_ + buf_size;
+          transpose_dst_int_buf_ = V_int_buf_ + buf_size;
+          qk_int_buf_ = transpose_dst_int_buf_ + buf_size;
+          q_buf_ = (DataType_*)(qk_int_buf_ + padded_qk_buf_size);
+          //the actual size is calculated with int8_t datatype 
+          k_buf_ = (DataType_*)((int8_t*)q_buf_ + padded_buf_size);
+          v_buf_ = (DataType_*)((int8_t*)k_buf_ + padded_buf_size);
+          qk_buf_ = (DataType_*)((int8_t*)v_buf_ + padded_buf_size);
+          sequence_id_map_ = (int*)((int8_t*)qk_buf_ + padded_qk_buf_size);
+          trt_attn_workspace_ = (void*)(sequence_id_map_ + (batch_size_*from_seq_len_));
+#else
+          printf("[ERROR] PaddleNLP does not support INT8. \n");
+          exit(-1);
+#endif
+
+        }
+        else
+        {
+        // if (use_trt_kernel && (sm_ == kSM_70 || sm_ == kSM_86 || sm_ == kSM_80 || sm_ == kSM_75 || sm_ == kSM_72) && size_per_head_ == 64)
+        //     dispatcher_fp16.reset(new FusedMHARunnerFP16v2(head_num_, size_per_head_, sm_, q_scaling_));
+          buf_ = (DataType_*) allocator_->malloc(get_workspace_size(), false);
+          if (buf_ == NULL)
+            throw std::runtime_error(std::string("Allocator failed to allocate internal buffer."));
+          query_buf_ = buf_;
+          key_buf_ = buf_ + buf_size;
+          value_buf_ = buf_ + 2 * buf_size;
+          q_buf_ = buf_ + 3 * buf_size;
+          k_buf_ = buf_ + 4 * buf_size;
+          v_buf_ = buf_ + 5 * buf_size;
+          qk_buf_ = buf_ + 6 * buf_size;
+          transpose_dst_ = qk_buf_ + qk_buf_size;
+          qkv_kernel_ = (DataType_**)(transpose_dst_ + buf_size);
+          qkv_input_ = qkv_kernel_ + 3;
+          qkv_buf_ = qkv_input_ + 3;
+          trt_attn_workspace_ = (void*)(qkv_buf_ + 3);
+        }
+      }
+
+      //no gemm test in OpenMultiHeadAttention 
+      //if config changes, read config again
+      if (hasChangedConfig)
+      {
+        int isConfigExist = -1;
+        if (int8_mode_ != 0)
+        {
+#ifdef WITH_INT8
+          isConfigExist = access(IGEMM_CONFIG, 0);
+#else
+          printf("[ERROR] PaddleNLP does not support INT8. \n");
+          exit(-1);
+#endif
+        }
+        else
+        {
+          isConfigExist = access(GEMM_CONFIG, 0);
+        }
+        if (isConfigExist == -1)
+          printf("[WARNING][OpenMultiHeadAttention] %s is not found; using default GEMM algo\n", int8_mode_ != 0 ? IGEMM_CONFIG : GEMM_CONFIG);
+        else
+        {
+          readAlgoFromConfig(int8_mode_, cublasAlgoMap_, parameterMap_, false);
+        }
+      }
+
+      if (int8_mode_ == 0)
+      {
+        getCublasBmmAlgoFromMap();
+        judgeFusedQKV();
+      }
+    }
+    catch(std::runtime_error& error)
+    {
+      throw error;
+    }
+  }
+
+  //Ctor
+  OpenMultiHeadAttention(int int8_mode=0, bool allow_gemm_test=false, bool use_ORDER_COL32_2R_4R4=false, int sm = 75, float q_scaling=1.0) : 
+    int8_mode_(int8_mode), allow_gemm_test_(allow_gemm_test), use_ORDER_COL32_2R_4R4_(use_ORDER_COL32_2R_4R4), sm_(sm), q_scaling_(q_scaling)
+   {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    //sm_ = getSMVersion();
+
+    try
+    {
+      int isConfigExist = -1;
+      if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+        isConfigExist = access(IGEMM_CONFIG, 0);
+#else
+        printf("[ERROR] PaddleNLP does not support INT8. \n");
+        exit(-1);
+#endif
+      }
+      else {
+        isConfigExist = access(GEMM_CONFIG, 0);
+      }
+
+      if (isConfigExist == -1)
+      {
+        if (!allow_gemm_test_)
+        {
+          printf("[WARNING][OpenMultiHeadAttention] %s is not found; using default GEMM algo\n", int8_mode_ != 0 ? IGEMM_CONFIG : GEMM_CONFIG);
+        }
+      }
+      else
+      {
+        readAlgoFromConfig(int8_mode_, cublasAlgoMap_, parameterMap_, false);
+      }
+    }
+    catch(std::runtime_error& error)
+    {
+      throw error;
+    }
+  }
+
+  OpenMultiHeadAttention(const OpenMultiHeadAttention *attention)
+  {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    sm_ = attention->sm_;
+    int8_mode_ = attention->int8_mode_;
+    allow_gemm_test_ = attention->allow_gemm_test_;
+    q_scaling_ = attention->q_scaling_;
+
+    for(int i = 0; i < 2; i++) cublasBmmAlgo_[i] = attention->cublasBmmAlgo_[i];
+    cublasAlgoMap_ = attention->cublasAlgoMap_;
+    parameterMap_ = attention->parameterMap_;
+    is_fuse_QKV_ = attention->is_fuse_QKV_;
+    use_ORDER_COL32_2R_4R4_ = attention->use_ORDER_COL32_2R_4R4_;
+  }
+
+  void multiHeadAttr_nofuse_kernelLauncher(
+      cudaStream_t stream,
+      cublasHandle_t cublas_handle,
+      cublasLtHandle_t cublaslt_handle,
+      DataType_* Q,
+      const DataType_* bias_Q,
+      DataType_* K,
+      const DataType_* bias_K,
+      DataType_* V,
+      const DataType_* bias_V,
+      const DataType_* attr_mask,
+      DataType_* dst,
+      const int batch_size,
+      const int seq_len,
+      const int head_num,
+      const int size_per_head,
+      const int int8_mode_,
+      const DataType_ scalar)
+  {
+    const int k = head_num * size_per_head;
+
+    if (int8_mode_ != 0)
+    {
+      //var for int8
+#ifdef WITH_INT8
+      const float*Qbias_amax_ptr, *Kbias_amax_ptr, *Vbias_amax_ptr, *bmm1_amax_ptr, *Softmax_amax_ptr, *bmm2_amax_ptr, *in_amax_ptr, *Q_aftergemm_amax_ptr, *K_aftergemm_amax_ptr, *V_aftergemm_amax_ptr;
+      Qbias_amax_ptr = param_.amaxList + 8;
+      Kbias_amax_ptr = param_.amaxList + 16;
+      Vbias_amax_ptr = param_.amaxList + 24;
+      Softmax_amax_ptr = param_.amaxList + 32;
+      bmm2_amax_ptr = param_.amaxList + 36;
+      Q_aftergemm_amax_ptr = param_.amaxList + 4;
+      K_aftergemm_amax_ptr = param_.amaxList + 12;
+      V_aftergemm_amax_ptr = param_.amaxList + 20;
+      bmm1_amax_ptr = param_.amaxList + 28;
+      in_amax_ptr = param_.amaxList;
+      
+	  if (size_per_head % 32 != 0)
+      {
+        printf("[ERROR][FT][multiHeadAttr_nofuse_kernelLauncher] int8 unfused mha kernel only works when size_per_head %% 32 == 0.\n");
+        exit(-1);
+      }
+      if ((seq_len % 32 != 0) && int8_mode_ == 1)
+      {
+        printf("[ERROR][FT][multiHeadAttr_nofuse_kernelLauncher] int8 mode1 unfused mha kernel only works when seq_len %% 32 == 0.\n");
+        exit(-1);
+	    }
+      const int seq_len_padded = (seq_len + 31)/32*32;
+
+      if(param_.sequence_id_offset == nullptr || param_.valid_word_num == batch_size * seq_len)
+      {
+        if (int8_mode_ == 1)
+        {
+          add_QK_bias_transform_kernelLauncher((int8_t*)q_buf_, (int8_t*)k_buf_, 
+                                               (const int32_t*) Q, bias_Q, (const int32_t*) K, bias_K, 
+                                               batch_size, seq_len, head_num, size_per_head, 
+                                               query_weight_amax_list, in_amax_ptr+2, 
+                                               key_weight_amax_list, in_amax_ptr+2, 
+                                               Qbias_amax_ptr+3, Kbias_amax_ptr+3,
+                                               use_ORDER_COL32_2R_4R4_, stream);
+          add_V_bias_transform_kernelLauncher((int8_t*)v_buf_, 
+                                              (const int32_t *)V, bias_V, 
+                                              batch_size, seq_len, head_num, size_per_head, 
+                                              value_weight_amax_list, in_amax_ptr+2, Vbias_amax_ptr+3, 
+                                              use_ORDER_COL32_2R_4R4_, stream);
+        }
+        else if (int8_mode_ == 2 || int8_mode_ == 3)
+        {
+          add_QK_bias_transform_kernelLauncher((int8_t*)q_buf_, (int8_t*)k_buf_, 
+                                               (const int8_t*) Q, bias_Q, (const int8_t*)K, bias_K, 
+                                               batch_size, seq_len, head_num, size_per_head, 
+                                               Q_aftergemm_amax_ptr+1, K_aftergemm_amax_ptr+1, 
+                                               Qbias_amax_ptr+3, Kbias_amax_ptr+3,
+                                               use_ORDER_COL32_2R_4R4_, stream);
+          add_V_bias_transform_kernelLauncher((int8_t*)v_buf_, (const int8_t *)V, bias_V, 
+                                               batch_size, seq_len, head_num, size_per_head,
+                                               V_aftergemm_amax_ptr+1, Vbias_amax_ptr+3, 
+                                               use_ORDER_COL32_2R_4R4_, stream);
+        }
+      }
+      else{
+        mappingRemovePaddingData_kernelLauncher(batch_size, seq_len, 
+                                                param_.valid_word_num, sequence_id_map_, 
+                                                param_.sequence_id_offset, stream);
+        // if we use remove padding, then initialize the q_buf_, k_buf_ and v_buf_ to prevent bugs. v_buf_ will be properly initiaized in add_V_bias_transform_rebuild_padding_kernelLauncher()
+        cudaMemsetAsync((int8_t*)q_buf_, 0, 2 * batch_size_ * seq_len_padded * head_num * size_per_head * sizeof(int8_t), param_.stream);
+        if (int8_mode_ == 1)
+        {
+          
+          add_QK_bias_transform_rebuild_padding_kernelLauncher((int8_t*)q_buf_, (int8_t*)k_buf_, 
+                                                               (const int32_t*)Q, bias_Q, 
+                                                               (const int32_t*)K, bias_K, 
+                                                               param_.sequence_id_offset, param_.valid_word_num, 
+                                                               batch_size, seq_len, 
+                                                               head_num, size_per_head, 
+                                                               query_weight_amax_list, in_amax_ptr+2, 
+                                                               key_weight_amax_list, in_amax_ptr+2, 
+                                                               Qbias_amax_ptr+3, Kbias_amax_ptr+3,
+                                                               use_ORDER_COL32_2R_4R4_, stream);
+
+          add_V_bias_transform_rebuild_padding_kernelLauncher((int8_t*)v_buf_, (const int32_t *)V, bias_V,
+                                                              sequence_id_map_, param_.valid_word_num, 
+                                                              batch_size, seq_len, head_num, size_per_head, 
+                                                              value_weight_amax_list, in_amax_ptr+2, Vbias_amax_ptr+3, 
+                                                              use_ORDER_COL32_2R_4R4_, stream);      
+        }
+        else if (int8_mode_ == 2 || int8_mode_ == 3)
+        {
+          add_QK_bias_transform_rebuild_padding_kernelLauncher((int8_t*)q_buf_, (int8_t*)k_buf_, 
+                                                               (const int8_t*)Q, bias_Q, 
+                                                               (const int8_t*)K, bias_K,
+                                                               param_.sequence_id_offset, param_.valid_word_num,
+                                                               batch_size, seq_len, head_num, size_per_head, 
+                                                               Q_aftergemm_amax_ptr+1, K_aftergemm_amax_ptr+1,
+                                                               Qbias_amax_ptr+3, Kbias_amax_ptr+3,
+                                                               use_ORDER_COL32_2R_4R4_, stream);
+
+          add_V_bias_transform_rebuild_padding_kernelLauncher((int8_t*)v_buf_, (const int8_t *)V, bias_V, 
+                                                              sequence_id_map_, param_.valid_word_num, 
+                                                              batch_size, seq_len, head_num, size_per_head,
+                                                              V_aftergemm_amax_ptr+1, Vbias_amax_ptr+3, 
+                                                              use_ORDER_COL32_2R_4R4_, stream);
+        }
+      }
+  
+      int batchCount = batch_size * head_num;
+    
+      if (int8_mode_ == 1)
+      {     
+        cublasLtMM_withAlgo(qk_int_buf_, batchCount, seq_len, seq_len, size_per_head, 
+                            size_per_head*seq_len, size_per_head*seq_len, seq_len*seq_len, 
+                            (int8_t*)q_buf_, (int8_t*)k_buf_, cublaslt_handle, stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+
+        softmax_COL32_kernelLauncher((int8_t*)qk_buf_, qk_int_buf_, attr_mask, 
+                                     batch_size, head_num, seq_len, 
+                                     float(scalar), Qbias_amax_ptr + 1, Kbias_amax_ptr + 1, 
+                                     Softmax_amax_ptr, stream);
+      
+        cublasLtMM_withAlgo(transpose_dst_int_buf_, batchCount, seq_len, size_per_head, seq_len, 
+                            seq_len*seq_len, size_per_head*seq_len, size_per_head*seq_len, (int8_t*)qk_buf_, 
+                            (int8_t*)v_buf_, cublaslt_handle, stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+
+        if(param_.sequence_id_offset == nullptr || param_.valid_word_num == batch_size * seq_len)
+        {
+          transpose_COL32_kernelLauncher((int8_t*)dst, (const int*)transpose_dst_int_buf_, batch_size, seq_len, head_num, 
+                                         size_per_head, Vbias_amax_ptr+1, Softmax_amax_ptr+1, bmm2_amax_ptr+3, stream);
+        }
+        else
+        {
+          transpose_COL32_rebuild_padding_kernelLauncher((int8_t*)dst, (const int*)transpose_dst_int_buf_, sequence_id_map_, 
+                                                         param_.valid_word_num, batch_size, seq_len, head_num, size_per_head, 
+                                                         Vbias_amax_ptr+1, Softmax_amax_ptr+1, bmm2_amax_ptr+3, stream);     
+        }    
+      }
+      else if (int8_mode_ == 2 || int8_mode_ == 3)
+      {
+        cublasLtMM_withAlgo_int8IO((int8_t*)qk_int_buf_, batchCount, seq_len, seq_len_padded, size_per_head, 
+                                    size_per_head*seq_len, size_per_head*seq_len_padded, seq_len*seq_len_padded, 
+                                    param_.int8O_gemm_deQ_scale_list[3],
+                                    (int8_t*)q_buf_, (int8_t*)k_buf_, cublaslt_handle, stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+       
+        softmax_COL32_kernelLauncher((int8_t*)qk_buf_, (int8_t*)qk_int_buf_, attr_mask, 
+                                     batch_size, head_num, seq_len, 
+                                     float(scalar), bmm1_amax_ptr + 1, Softmax_amax_ptr, 
+                                     stream); 
+      
+        cublasLtMM_withAlgo_int8IO((int8_t*)transpose_dst_int_buf_, batchCount, seq_len, size_per_head, seq_len_padded, 
+                                    seq_len*seq_len_padded, size_per_head*seq_len_padded, size_per_head*seq_len, param_.int8O_gemm_deQ_scale_list[4], (int8_t*)qk_buf_, 
+                                    (int8_t*)v_buf_, cublaslt_handle, stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+        if(param_.sequence_id_offset == nullptr || param_.valid_word_num == batch_size * seq_len)
+        {
+          transpose_COL32_kernelLauncher((int8_t*)dst, (const int8_t*)transpose_dst_int_buf_, batch_size, seq_len, head_num, 
+                                          size_per_head, bmm2_amax_ptr+1, bmm2_amax_ptr+3, stream);
+        }
+        else
+        {
+          transpose_COL32_rebuild_padding_kernelLauncher((int8_t*)dst, (const int8_t*)transpose_dst_int_buf_, sequence_id_map_, 
+                                                          param_.valid_word_num, batch_size, seq_len, head_num, size_per_head, 
+                                                          bmm2_amax_ptr+1, bmm2_amax_ptr+3, stream);
+        }
+      }
+#else
+      printf("[ERROR] PaddleNLP does not support INT8. \n");
+      exit(-1);
+#endif
+    }
+    //FP32/FP16
+    else
+    {
+      if(param_.sequence_id_offset == nullptr || param_.valid_word_num == batch_size * seq_len)
+      {
+        add_QKV_bias_transpose_kernelLauncher(q_buf_, k_buf_, v_buf_,
+          Q, bias_Q,
+          K, bias_K,
+          V, bias_V,
+          batch_size_, seq_len,
+          head_num,
+          size_per_head, stream);
+      }
+      else
+      {
+        // if we use remove padding, then initialize the q_buf_, k_buf_ and v_buf_ to prevent bugs.
+        cudaMemsetAsync(q_buf_, 0, 3 * batch_size_ * seq_len * head_num * size_per_head * sizeof(DataType_), param_.stream);
+
+        add_QKV_bias_rebuild_padding_kernelLauncher(Q, bias_Q, K, bias_K, V, bias_V, q_buf_, k_buf_, v_buf_, 
+          batch_size, seq_len, head_num, size_per_head, param_.valid_word_num, param_.sequence_id_offset, stream);
+      }
+
+      DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+    
+      check_cuda_error(cublasGemmStridedBatchedEx(cublas_handle,
+        CUBLAS_OP_T, CUBLAS_OP_N,
+        seq_len, seq_len, size_per_head,
+        &alpha,
+        k_buf_, AType_, size_per_head, seq_len * size_per_head,
+        q_buf_, BType_, size_per_head, seq_len * size_per_head,
+        &beta,
+        qk_buf_, CType_, seq_len, seq_len * seq_len,
+        batch_size * head_num,
+        computeType_,
+        static_cast<cublasGemmAlgo_t>(cublasBmmAlgo_[0])));
+    
+      attn_softmax_kernelLauncher(qk_buf_, attr_mask, batch_size, seq_len, head_num, scalar, stream);
+
+      check_cuda_error(cublasGemmStridedBatchedEx(cublas_handle,
+        CUBLAS_OP_N, CUBLAS_OP_N,
+        size_per_head, seq_len, seq_len,
+        &alpha,
+        v_buf_, AType_, size_per_head, seq_len * size_per_head,
+        qk_buf_, BType_, seq_len, seq_len * seq_len,
+        &beta,
+        transpose_dst_, CType_, size_per_head, seq_len * size_per_head,
+        batch_size * head_num,
+        computeType_,
+        static_cast<cublasGemmAlgo_t>(cublasBmmAlgo_[1])));
+      
+      if(param_.sequence_id_offset == nullptr || param_.valid_word_num == batch_size * seq_len)
+      {
+        transpose_kernelLauncher(transpose_dst_, dst, batch_size, seq_len, head_num, size_per_head, stream);
+      }
+      else
+      {
+        transpose_rebuild_padding_kernelLauncher(transpose_dst_, dst, param_.valid_word_num,
+                                                 batch_size, seq_len, head_num, size_per_head, 
+                                                 param_.sequence_id_offset, stream);
+      }
+    }  
+  }
+
+  void forward()
+  {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try
+    { 
+      forward(param_.from_tensor, param_.to_tensor);
+    }
+    catch(std::runtime_error& error)
+    {
+      throw error;
+    }
+  }
+
+  void forward(const DataType_* from_tensor, const DataType_* to_tensor)
+  {
+    if(param_.sequence_id_offset != nullptr && param_.valid_word_num != batch_size_ * from_seq_len_)
+    {
+      is_fuse_QKV_ = false;
+    }
+
+    if(is_fuse_QKV_ == true && int8_mode_ == 0)
+    {
+      // For tensorrt, we cannot get the pointer of from tensor until enqueue
+      const DataType_* hA[] {param_.self_attention.query_weight.kernel,
+                             param_.self_attention.key_weight.kernel,
+                             param_.self_attention.value_weight.kernel,
+                             from_tensor, to_tensor, to_tensor,
+                             query_buf_, key_buf_, value_buf_};
+      cudaMemcpyAsync((void*)qkv_kernel_, hA, sizeof(DataType_*) * 9, cudaMemcpyHostToDevice, param_.stream);
+    }
+
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = param_.sequence_id_offset == nullptr ? batch_size_ * from_seq_len_ : param_.valid_word_num;
+    const int k = head_num_ * size_per_head_;
+    const int n = k;
+    const DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+    try
+    { 
+      if (int8_mode_ != 0){
+#ifdef WITH_INT8
+        //K_int_buf_ V_int_buf_ should point to correct buffer according to param_.valid_word_num
+        if (int8_mode_ == 1) {
+          K_int_buf_ = (int*)Q_int_buf_ + param_.valid_word_num * head_num_ * size_per_head_;
+          V_int_buf_ = (int*)K_int_buf_ + param_.valid_word_num * head_num_ * size_per_head_;
+        } else if (int8_mode_ == 2 || int8_mode_ == 3){
+          K_int_buf_ = (int*)((int8_t*)Q_int_buf_ + param_.valid_word_num * head_num_ * size_per_head_);
+          V_int_buf_ = (int*)((int8_t*)K_int_buf_ + param_.valid_word_num * head_num_ * size_per_head_);
+        }
+
+        int fusedINT8QKV = 0;
+        const int8_t* Q_weight = (const int8_t*)(param_.self_attention.query_weight.kernel);
+        const int8_t* K_weight = (const int8_t*)(param_.self_attention.key_weight.kernel);
+        const int8_t* V_weight = (const int8_t*)(param_.self_attention.value_weight.kernel);
+        //for QKV weight are DataType_ & continue
+        if ((param_.self_attention.query_weight.kernel + n*k == param_.self_attention.key_weight.kernel) &&
+            (param_.self_attention.key_weight.kernel + n*k == param_.self_attention.value_weight.kernel))
+          fusedINT8QKV = 1;
+          //for QVK weight are int8 & continue
+        else if ((Q_weight + n*k == K_weight) && (K_weight + n*k == V_weight))
+          fusedINT8QKV = 2;
+        
+        if (int8_mode_ == 1)
+        {
+          if (fusedINT8QKV == 0){
+            cublasLtMM_withAlgo(Q_int_buf_, 1, m, n, k, 0, 0, 0, 
+                                param_.int8_from_tensor, Q_weight, 
+                                param_.cublaslt_handle, param_.stream, 
+                                cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+            cublasLtMM_withAlgo(K_int_buf_, 1, m, n, k, 0, 0, 0, 
+                                param_.int8_from_tensor, K_weight, 
+                                param_.cublaslt_handle, param_.stream, 
+                                cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+            cublasLtMM_withAlgo(V_int_buf_, 1, m, n, k, 0, 0, 0, 
+                                param_.int8_from_tensor, V_weight, 
+                                param_.cublaslt_handle, param_.stream, 
+                                cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+          }
+          else{
+            int strideFactor = (fusedINT8QKV == 1) ? (sizeof(DataType_)/sizeof(int8_t)) : 1; 
+            cublasLtMM_withAlgo(Q_int_buf_, 3, m, n, k, 0, n*k*strideFactor, 
+                                n*m, param_.int8_from_tensor, Q_weight, 
+                                param_.cublaslt_handle, param_.stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+          }
+        }
+        else if (int8_mode_ == 2 || int8_mode_ == 3)
+        {
+          if (fusedINT8QKV == 0){
+            cublasLtMM_withAlgo_int8IO((int8_t*)Q_int_buf_, 1, m, n, k, 0, 0, 0, 
+                                       param_.int8O_gemm_deQ_scale_list[0],  
+                                       param_.int8_from_tensor, Q_weight, 
+                                       param_.cublaslt_handle, param_.stream, 
+                                       cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+            cublasLtMM_withAlgo_int8IO((int8_t*)K_int_buf_, 1, m, n, k, 0, 0, 0, 
+                                       param_.int8O_gemm_deQ_scale_list[1],
+                                       param_.int8_from_tensor, K_weight, 
+                                       param_.cublaslt_handle, param_.stream, 
+                                       cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+            cublasLtMM_withAlgo_int8IO((int8_t*)V_int_buf_, 1, m, n, k, 0, 0, 0,
+                                       param_.int8O_gemm_deQ_scale_list[2],            
+                                       param_.int8_from_tensor, V_weight, 
+                                       param_.cublaslt_handle, param_.stream, 
+                                       cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+          }
+          else{
+            int strideFactor = (fusedINT8QKV == 1) ? (sizeof(DataType_)/sizeof(int8_t)) : 1; 
+            cublasLtMM_withAlgo_int8IO((int8_t*)Q_int_buf_, 3, m, n, k, 0, n*k*strideFactor, n*m, 
+                                       param_.int8O_gemm_deQ_scale_list[0],
+                                       param_.int8_from_tensor, Q_weight, 
+                                       param_.cublaslt_handle, param_.stream, cublasAlgoMap_, use_ORDER_COL32_2R_4R4_);
+          }  
+        }
+
+        int S;
+        if(dispatcher_int8.get())
+          S = dispatcher_int8->getSFromMaxSeqLen(from_seq_len_);
+        if(dispatcher_int8.get() && dispatcher_int8->isValid(S) && param_.trt_seqlen_offset != nullptr)
+        {
+          // This function is only used when we satisfy the following conditions:
+          // 1. INT8
+          // 2. GPU SM >= 75
+          int8_fused_multiHeadAttr_kernelLauncher((const void*)Q_int_buf_,
+                                                  param_.amaxList + 4 + 1, 
+                                                  param_.amaxList + 12 + 1, 
+                                                  param_.amaxList + 20 + 1,
+                                                  param_.trt_fused_mha_amax_list[0]/127.0f,
+                                                  S
+                                                 );
+        }
+        else
+        {
+
+          DataType_ scalar = 1 / (sqrtf(size_per_head_ * 1.0f) * q_scaling_);
+          multiHeadAttr_nofuse_kernelLauncher(
+                param_.stream,
+                param_.cublas_handle,
+                param_.cublaslt_handle,
+                (DataType_*)Q_int_buf_,
+                param_.self_attention.query_weight.bias,
+                (DataType_*)(K_int_buf_),
+                param_.self_attention.key_weight.bias,
+                (DataType_*)(V_int_buf_),
+                param_.self_attention.value_weight.bias,
+                param_.attr_mask,
+                param_.attr_out,
+                batch_size_,
+                from_seq_len_,
+                head_num_,
+                size_per_head_,
+                int8_mode_,
+                scalar);
+        }
+#else
+        printf("[ERROR] PaddleNLP does not support INT8. \n");
+        exit(-1);
+#endif
+      }
+      else{      
+        if(is_fuse_QKV_ == true)
+        {
+          int algoId = getAlgoIdFromMap(cublasAlgoMap_, 3, n, m, k, AType_ == CUDA_R_16F ? HALF_DATATYPE : FLOAT_DATATYPE);
+          check_cuda_error(cublasGemmBatchedEx(param_.cublas_handle, 
+                           CUBLAS_OP_N, CUBLAS_OP_N, 
+                           n, m, k, 
+                           &alpha, 
+                           (const void* const*) qkv_kernel_, AType_, n,
+                           (const void* const*) qkv_input_, BType_, k,
+                           &beta, 
+                           (void* const*)qkv_buf_, CType_, n,
+                           3,
+                           computeType_, 
+                           static_cast<cublasGemmAlgo_t>(algoId)));
+        }
+        else
+        {
+          cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle, param_.cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 
+                           n, m, k, &alpha,
+                           param_.self_attention.query_weight.kernel, AType_, n,
+                           from_tensor, BType_, k,
+                           &beta, (DataType_ *)query_buf_, CType_, n,
+                           param_.stream, cublasAlgoMap_, sm_, cublas_workspace_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle, param_.cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 
+                           n, m, k, &alpha,
+                           param_.self_attention.key_weight.kernel, AType_, n,
+                           to_tensor, BType_, k,
+                           &beta, (DataType_ *)key_buf_, CType_, n,
+                           param_.stream, cublasAlgoMap_, sm_, cublas_workspace_); 
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle, param_.cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, 
+                           n, m, k, &alpha,
+                           param_.self_attention.value_weight.kernel, AType_, n,
+                           to_tensor, BType_, k,
+                           &beta, (DataType_ *)value_buf_, CType_, n,
+                           param_.stream, cublasAlgoMap_, sm_, cublas_workspace_); 
+        }
+     
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        int S;
+        if(dispatcher_fp16.get())
+          S = dispatcher_fp16->getSFromMaxSeqLen(from_seq_len_);
+        if(dispatcher_fp16.get() && OpType_==OperationType::FP16 && dispatcher_fp16->isValid(S) && param_.trt_seqlen_offset != nullptr)
+        {
+          // This function is only used when we satisfy the following conditions:
+          // 1. FP16
+          // 2. GPU SM >= 72
+          //  3. Temporally add seqlen <= 384 limitation because the current fused mha cannot handle seqlen > 384.
+          fused_multiHeadAttr_kernelLauncher(S);
+        }
+        else
+        {
+          DataType_ scalar = 1 / (sqrtf(size_per_head_ * 1.0f) * q_scaling_);
+
+          multiHeadAttr_nofuse_kernelLauncher(
+            param_.stream,
+            param_.cublas_handle,
+            param_.cublaslt_handle,
+            query_buf_,
+            param_.self_attention.query_weight.bias,
+            key_buf_,
+            param_.self_attention.key_weight.bias,
+            value_buf_,
+            param_.self_attention.value_weight.bias,
+            param_.attr_mask,
+            param_.attr_out,
+            batch_size_,
+            from_seq_len_,
+            head_num_,
+            size_per_head_,
+            int8_mode_,
+            scalar);
+        }
+      }
+    }
+    catch(std::runtime_error& error)
+    {
+      throw error;
+    }
+  }
+
+  void fused_multiHeadAttr_kernelLauncher(const int S);
+  
+  void int8_fused_multiHeadAttr_kernelLauncher(const void* Q, 
+                                               const float *q_deQFactor_ptr, const float *k_deQFactor_ptr, const float *v_deQFactor_ptr, 
+                                               const float mScaleQkv, const int S);
+
+  void trt_add_QKV_bias_kernelLauncher(
+      const DataType_* bias_Q,
+      const DataType_* bias_K,
+      const DataType_* bias_V);
+      
+  void trt_add_QKV_bias_COL32_int8IO_kernelLauncher(
+      int8_t* output,
+      const int8_t* Q,
+      const DataType_* bias_Q,
+      const DataType_* bias_K,
+      const DataType_* bias_V,
+      const float *q_input_deQFactor_ptr, 
+      const float *k_input_deQFactor_ptr, 
+      const float *v_input_deQFactor_ptr, 
+      const float qkv_output_scale); 
+
+  void trt_add_QKV_bias_COL32_int32Iint8O_kernelLauncher(
+      int8_t* output,
+      const int32_t* Q,
+      const DataType_* bias_Q,
+      const DataType_* bias_K,
+      const DataType_* bias_V,
+      const float *input_deQFactor_div127_ptr,
+      const float * q_weight_amax,
+      const float * k_weight_amax,
+      const float * v_weight_amax,
+      const float qkv_output_scale);
+
+  void initialize(MultiHeadInitParam<DataType_> param)
+  {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    param_ = param;
+    if (int8_mode_ != 0){
+#ifdef WITH_INT8
+      int hidden_dim = head_num_ * size_per_head_;
+      query_weight_amax_list = param_.amaxList + ACTIVATION_AMAX_NUM;
+      key_weight_amax_list = query_weight_amax_list + hidden_dim;
+      value_weight_amax_list = key_weight_amax_list + hidden_dim;
+      if (dispatcher_int8.get())
+      {
+        dispatcher_int8.get()->setScaleList(param_.trt_fused_mha_amax_list[0]/127.0f, param_.trt_fused_mha_amax_list[1]/127.0f, param_.trt_fused_mha_amax_list[2]/127.0f);
+      }
+#else
+      printf("[ERROR] PaddleNLP does not support INT8. \n");
+      exit(-1);
+#endif
+    } 
+  }
+
+  ~OpenMultiHeadAttention() override
+  {
+    if (buf_ != NULL)
+    {
+      if (allocator_ == NULL)
+      {
+        printf("[ERROR][OpenMultiHeadAttention][~OpenMultiHeadAttention] allocator_ is NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+      buf_ = NULL;
+    }
+  }
+};
+                                       
+}//namespace cuda
+}//namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cu
new file mode 100644
index 0000000000000000000000000000000000000000..c21a1672a4226eb8e810916160a267a89deec95a
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cu
@@ -0,0 +1,646 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+__global__ void transpose_cache_batch_major(T* k_dst,
+                                            T* v_dst,
+                                            const T* k_src,
+                                            const T* v_src,
+                                            const int* memory_seq_len,
+                                            const int head_num,
+                                            const int size_per_head,
+                                            const int memory_max_seq_len,
+                                            const int cache_max_len) {
+  const int hidden_dim = head_num * size_per_head;
+  const int x = (sizeof(T) == 4) ? 4 : 8;
+  const int size_per_head_split = size_per_head / x;
+  const int batch_id = blockIdx.x;
+  const int seq_id = blockIdx.y;
+
+  for (int id = threadIdx.x; id < head_num * size_per_head_split * x;
+       id += blockDim.x) {
+    int tmp_id = id;
+    int x_id = tmp_id % x;
+    tmp_id /= x;
+    int size_id = tmp_id % size_per_head_split;
+    tmp_id /= size_per_head_split;
+    int head_id = tmp_id % head_num;
+
+    int src_seq_id =
+        (seq_id < memory_seq_len[batch_id])
+            ? (seq_id + memory_max_seq_len - memory_seq_len[batch_id])
+            : (seq_id - memory_seq_len[batch_id]);
+
+    // key: [B, head_num, L, size_per_head / x, x] ->
+    // [B, head_num, size_per_head / x, L, x]
+    k_dst[batch_id * hidden_dim * cache_max_len +
+          head_id * size_per_head * cache_max_len +
+          size_id * cache_max_len * x + seq_id * x + x_id] =
+        k_src[batch_id * hidden_dim * memory_max_seq_len +
+              head_id * size_per_head * memory_max_seq_len +
+              src_seq_id * size_per_head + size_id * x + x_id];
+
+    // value: [B, head_num, L, size_per_head/x, x] ->
+    // [B, head_num, L, size_per_head/x, x]
+    v_dst[batch_id * hidden_dim * cache_max_len +
+          head_id * size_per_head * cache_max_len + seq_id * size_per_head +
+          size_id * x + x_id] =
+        v_src[batch_id * hidden_dim * memory_max_seq_len +
+              head_id * size_per_head * memory_max_seq_len +
+              src_seq_id * size_per_head + size_id * x + x_id];
+  }
+}
+
+template <typename T>
+void transpose_cache_batch_major_kernelLauncher(T* k_dst,
+                                                T* v_dst,
+                                                const T* k_src,
+                                                const T* v_src,
+                                                const int* memory_seq_len,
+                                                const int local_batch_size,
+                                                const int memory_max_seq_len,
+                                                const int cache_max_len,
+                                                const int size_per_head,
+                                                const int local_head_num,
+                                                cudaStream_t stream) {
+  constexpr int block_sz = 128;
+  dim3 grid(local_batch_size, memory_max_seq_len);
+
+  transpose_cache_batch_major<<<grid, block_sz, 0, stream>>>(k_dst,
+                                                             v_dst,
+                                                             k_src,
+                                                             v_src,
+                                                             memory_seq_len,
+                                                             local_head_num,
+                                                             size_per_head,
+                                                             memory_max_seq_len,
+                                                             cache_max_len);
+}
+
+template <typename T>
+void transpose_general_kernelLauncher(T* dst,
+                                      T* src,
+                                      const int batch_size,
+                                      const int seq_len,
+                                      const int head_num,
+                                      const int size_per_head,
+                                      cudaStream_t stream) {
+  dim3 grid, block;
+  int grid_size = batch_size * head_num * seq_len;
+  if (sizeof(T) == 2) {
+    int seq_per_block = grid_size % 4 == 0 ? 4 : 1;
+    grid.x = grid_size / seq_per_block;
+    block.x = seq_per_block * size_per_head / 2;
+    transpose<T><<<grid, block, 0, stream>>>(
+        src, dst, batch_size, seq_len, head_num, size_per_head / 2);
+  } else {
+    const int seq_per_block = 1;
+    grid.x = grid_size / seq_per_block;
+    block.x = seq_per_block * size_per_head;
+    transpose<T><<<grid, block, 0, stream>>>(
+        src, dst, batch_size, seq_len, head_num, size_per_head);
+  }
+}
+
+template void transpose_cache_batch_major_kernelLauncher(
+    float* k_dst,
+    float* v_dst,
+    const float* k_src,
+    const float* v_src,
+    const int* memory_seq_len,
+    const int local_batch_size,
+    const int memory_max_seq_len,
+    const int cache_max_len,
+    const int size_per_head,
+    const int local_head_num,
+    cudaStream_t stream);
+
+template void transpose_cache_batch_major_kernelLauncher(
+    half* k_dst,
+    half* v_dst,
+    const half* k_src,
+    const half* v_src,
+    const int* memory_seq_len,
+    const int local_batch_size,
+    const int memory_max_seq_len,
+    const int cache_max_len,
+    const int size_per_head,
+    const int local_head_num,
+    cudaStream_t stream);
+
+template void transpose_general_kernelLauncher(float* dst,
+                                               float* src,
+                                               const int batch_size,
+                                               const int seq_len,
+                                               const int head_num,
+                                               const int size_per_head,
+                                               cudaStream_t stream);
+
+template void transpose_general_kernelLauncher(half* dst,
+                                               half* src,
+                                               const int batch_size,
+                                               const int seq_len,
+                                               const int head_num,
+                                               const int size_per_head,
+                                               cudaStream_t stream);
+
+
+
+template <typename T>
+void fusedQKV_masked_attention_dispatch_v2(
+  const T* qkv_buf, const T* qkv_bias,
+  T* key_cache, T* value_cache,
+  T* context_buf, const bool* finished, int max_batch_size, int inference_batch_size, 
+  int head_num, int size_per_head, const int step, const int max_seq_len, 
+  const int max_input_len, const int* input_lengths, const int rotary_embedding_dim, cudaStream_t stream)
+{
+  using DataType = typename std::conditional<sizeof(T) == 4, float, uint16_t>::type;
+  // Prepare the parameters.
+  Masked_multihead_attention_params<DataType> params;
+  memset(&params, 0, sizeof(params));
+  int hidden_units = head_num * size_per_head;
+  if (qkv_bias != nullptr) {
+      params.q_bias = reinterpret_cast<const DataType*>(qkv_bias);
+      params.k_bias = reinterpret_cast<const DataType*>(qkv_bias) + hidden_units;
+      params.v_bias = reinterpret_cast<const DataType*>(qkv_bias) + 2 * hidden_units;
+  }
+  else {
+     // gptj/codegen no bias
+      params.q_bias = nullptr;
+      params.k_bias = nullptr;
+      params.v_bias = nullptr;
+  }
+
+  // Set the output buffer.
+  params.out = reinterpret_cast<DataType *>(context_buf);
+
+  // Set the input buffers.
+  params.q = reinterpret_cast<const DataType *>(qkv_buf);
+  params.k = reinterpret_cast<const DataType *>(qkv_buf) + hidden_units;
+  params.v = reinterpret_cast<const DataType *>(qkv_buf) + 2 * hidden_units;
+  params.stride = 3 * hidden_units;
+  params.finished = const_cast<bool*>(finished);
+
+  params.k_cache = reinterpret_cast<DataType *>(key_cache);
+  params.v_cache = reinterpret_cast<DataType *>(value_cache);
+  params.batch_size = inference_batch_size;
+  params.seq_length = max_seq_len;
+  params.timestep = step-1;
+  params.num_heads = head_num;
+  params.hidden_size_per_head = size_per_head;
+  // GptJ: rotary_embedding
+  params.rotary_embedding_dim = rotary_embedding_dim;
+  params.inv_sqrt_dh = 1.F / sqrtf((float) params.hidden_size_per_head);
+
+  params.is_mask = true;
+  params.input_lengths = input_lengths;
+  params.max_input_len = max_input_len;
+
+  masked_multihead_attention(params, stream);
+}
+
+template <typename T>
+void masked_attention_dispatch_v2(
+  T* key_buf, T* value_buf,
+  T* query_buf, const T* self_Q_bias, 
+  T* key_cache, const T* self_K_bias, T* value_cache, const T* self_V_bias,
+  T* context_buf, const bool* finished, int max_batch_size, int inference_batch_size,
+  int head_num, int size_per_head, const int step, const int max_seq_len, cudaStream_t stream,
+  const T* relative_attention_bias)
+{
+  if (max_seq_len < 0) {
+    const int block_sz = ATTENTION_BLOCK_SIZE;
+    T scalar = (T)(1.f / sqrtf(size_per_head * 1.0f));
+  
+    dim3 grid(inference_batch_size * head_num);
+  
+    int cond = size_per_head * ((ATTENION_OPT)? 1:0);
+    switch (cond)
+    {
+      case 32:
+        masked_attention_kernel_opt<32, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache, self_V_bias, context_buf, finished,
+          max_batch_size, head_num, step, scalar);
+        break;
+      case 64:
+          masked_attention_kernel_opt<64, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+            key_buf, value_buf,
+            query_buf, self_Q_bias,
+            key_cache, self_K_bias,
+            value_cache, self_V_bias,
+            context_buf, 
+            finished,
+            max_batch_size, head_num, step, scalar);
+        break;
+      case 128:
+          masked_attention_kernel_opt<128, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+            key_buf, value_buf,
+            query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache, self_V_bias, context_buf, finished,
+            max_batch_size, head_num, step, scalar);
+        break;
+      default:
+        // default path
+        int block_size = 128;
+        
+        //suppose size_per_head <= 128
+        if(step <= 64)
+          block_size = 64;
+        else if(step <= 128 && step > size_per_head)
+          block_size = 128;
+        else if(step > 128 && step <= 256)
+          block_size = 256;
+        else if(step > 256 && step <= 512)
+          block_size = 512;
+        else
+          block_size = 1024;
+        
+        if((int)block_size < size_per_head)
+          block_size = size_per_head;
+          
+        assert(block_size <= 1024);
+        dim3 block(block_size);
+        T scalar = 1 / sqrtf(size_per_head * 1.0f);
+  
+        
+        int shared_size = sizeof(T) * (size_per_head + step);
+        masked_attention_kernel<T><<<grid, block, shared_size, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,
+          key_cache, self_K_bias,
+          value_cache, self_V_bias,
+          context_buf, finished, max_batch_size,
+          head_num, size_per_head, step, scalar);
+    }
+  } else {
+    assert(step > 0);
+    assert(size_per_head == 32 || size_per_head == 64 || size_per_head == 128);
+    using DataType = typename std::conditional<sizeof(T) == 4, float, uint16_t>::type;
+    // Prepare the parameters.
+    Masked_multihead_attention_params<DataType> params;
+    memset(&params, 0, sizeof(params));
+    params.q_bias = reinterpret_cast<const DataType *>(self_Q_bias);
+    params.k_bias = reinterpret_cast<const DataType *>(self_K_bias);
+    params.v_bias = reinterpret_cast<const DataType *>(self_V_bias);
+  
+    // Set the output buffer.
+    params.out = reinterpret_cast<DataType *>(context_buf);
+  
+    // Set the input buffers.
+    params.q = reinterpret_cast<const DataType *>(query_buf);
+    params.k = reinterpret_cast<const DataType *>(key_buf);
+    params.v = reinterpret_cast<const DataType *>(value_buf);
+    params.stride = 0;
+    params.finished = const_cast<bool*>(finished);
+  
+    params.k_cache = reinterpret_cast<DataType *>(key_cache);
+    params.v_cache = reinterpret_cast<DataType *>(value_cache);
+    params.batch_size = inference_batch_size;
+    params.seq_length = max_seq_len;
+    params.timestep = step-1;
+    params.num_heads = head_num;
+    params.hidden_size_per_head = size_per_head;
+
+    params.is_mask = false;
+    params.input_lengths = nullptr;
+    params.max_input_len = 0;
+
+    if (relative_attention_bias) {
+      params.inv_sqrt_dh = 1.F;
+
+      if (sizeof(T) == 4) {
+        params.relative_attention_bias_float = reinterpret_cast<const float*>(relative_attention_bias);
+      } else {
+        params.relative_attention_bias_half = reinterpret_cast<const half*>(relative_attention_bias);
+      }
+
+      params.relative_attention_bias_stride = max_seq_len + 1;
+    } else {
+      params.inv_sqrt_dh = 1.F / sqrtf((float) params.hidden_size_per_head);
+    }
+
+    masked_multihead_attention(params, stream);
+  }
+}
+
+template <typename T>
+void fusedQKV_masked_attention_dispatch_v3(
+  const T* qkv_buf, const T* qkv_bias,
+  T* key_cache, T* value_cache,
+  T* context_buf, const bool* finished, int max_batch_size, int inference_batch_size, 
+  int head_num, int size_per_head, const int step, const int max_seq_len, cudaStream_t stream,
+  const T* relative_attention_bias)
+{
+  if (max_seq_len < 0) {
+    const int block_sz = ATTENTION_BLOCK_SIZE;
+    T scalar = (T)(1.f / sqrtf(size_per_head * 1.0f));
+  
+    dim3 grid(inference_batch_size * head_num);
+  
+    int cond = size_per_head * ((ATTENION_OPT)? 1:0);
+    switch (cond)
+    {
+      case 32:
+        fusedQKV_masked_attention_kernel_opt<32, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+          qkv_buf, qkv_bias,
+          key_cache, value_cache,
+          context_buf,
+          finished,
+          max_batch_size, head_num, step, scalar);
+        break;
+      case 64:
+        fusedQKV_masked_attention_kernel_opt<64, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+          qkv_buf, qkv_bias,
+          key_cache,
+          value_cache,
+          context_buf,
+          finished,
+          max_batch_size, head_num, step, scalar);
+        break;
+      case 128:
+        fusedQKV_masked_attention_kernel_opt<128, block_sz, T><<<grid, block_sz, sizeof(float)*step, stream>>>(
+          qkv_buf, qkv_bias,
+          key_cache,
+          value_cache,
+          context_buf,
+          finished,
+          max_batch_size, head_num, step, scalar);
+        break;
+      default:
+        assert(false);
+    }
+  }
+  else {
+    using DataType = typename std::conditional<sizeof(T) == 4, float, uint16_t>::type;
+    // Prepare the parameters.
+    Masked_multihead_attention_params<DataType> params;
+    memset(&params, 0, sizeof(params));
+    int hidden_units = head_num * size_per_head;
+    params.q_bias = reinterpret_cast<const DataType *>(qkv_bias);
+    params.k_bias = reinterpret_cast<const DataType *>(qkv_bias) + hidden_units;
+    params.v_bias = reinterpret_cast<const DataType *>(qkv_bias) + 2 * hidden_units;
+  
+    // Set the output buffer.
+    params.out = reinterpret_cast<DataType *>(context_buf);
+  
+    // Set the input buffers.
+    params.q = reinterpret_cast<const DataType *>(qkv_buf);
+    params.k = reinterpret_cast<const DataType *>(qkv_buf) + hidden_units;
+    params.v = reinterpret_cast<const DataType *>(qkv_buf) + 2 * hidden_units;
+    params.stride = 3 * hidden_units;
+    params.finished = const_cast<bool*>(finished);
+  
+    params.k_cache = reinterpret_cast<DataType *>(key_cache);
+    params.v_cache = reinterpret_cast<DataType *>(value_cache);
+    params.batch_size = inference_batch_size;
+    params.seq_length = max_seq_len;
+    params.timestep = step - 1;
+    params.num_heads = head_num;
+    params.hidden_size_per_head = size_per_head;
+
+    params.is_mask = false;
+    params.input_lengths = nullptr;
+    params.max_input_len = 0;
+
+    if (relative_attention_bias) {
+      params.inv_sqrt_dh = 1.F;
+
+      if (sizeof(T) == 4) {
+        params.relative_attention_bias_float = reinterpret_cast<const float*>(relative_attention_bias);
+      } else {
+        params.relative_attention_bias_half = reinterpret_cast<const half*>(relative_attention_bias);
+      }
+
+      params.relative_attention_bias_stride = max_seq_len + 1;
+    } else {
+      params.inv_sqrt_dh = 1.F / sqrtf((float) params.hidden_size_per_head);
+    }
+
+    masked_multihead_attention(params, stream);
+  }
+}
+
+template void fusedQKV_masked_attention_dispatch_v3(
+  const float* qkv_buf, 
+  const float* qkv_bias,
+  float* key_cache, 
+  float* value_cache,
+  float* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head, 
+  const int step, 
+  const int max_seq_len,
+  cudaStream_t stream,
+  const float* relative_attention_bias);
+  
+template void fusedQKV_masked_attention_dispatch_v3(
+  const half* qkv_buf, 
+  const half* qkv_bias,
+  half* key_cache, 
+  half* value_cache,
+  half* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head,
+  const int step, 
+  const int max_seq_len,
+  cudaStream_t stream,
+  const half* relative_attention_bias);
+
+template void masked_attention_dispatch_v2(
+  float* key_buf, 
+  float* value_buf,
+  float* query_buf, 
+  const float* self_Q_bias, 
+  float* key_cache, 
+  const float* self_K_bias, 
+  float* value_cache, 
+  const float* self_V_bias,
+  float* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head, 
+  const int step,
+  const int max_seq_size,
+  cudaStream_t stream,
+  const float* relative_attention_bias);
+
+template void masked_attention_dispatch_v2(
+  half* key_buf, 
+  half* value_buf,
+  half* query_buf, 
+  const half* self_Q_bias, 
+  half* key_cache, 
+  const half* self_K_bias, 
+  half* value_cache, 
+  const half* self_V_bias,
+  half* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head, 
+  const int step,
+  const int max_seq_size,
+  cudaStream_t stream,
+  const half* relative_attention_bias);
+
+template void fusedQKV_masked_attention_dispatch_v2(
+  const float* qkv_buf, 
+  const float* qkv_bias,
+  float* key_cache, 
+  float* value_cache,
+  float* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head, 
+  const int step, 
+  const int max_seq_len,
+  const int max_input_len, 
+  const int* input_lengths,
+  const int rotary_embedding_dim,
+  cudaStream_t stream);
+  
+template void fusedQKV_masked_attention_dispatch_v2(
+  const half* qkv_buf, 
+  const half* qkv_bias,
+  half* key_cache, 
+  half* value_cache,
+  half* context_buf, 
+  const bool* finished, 
+  int max_batch_size, 
+  int inference_batch_size, 
+  int head_num, 
+  int size_per_head,
+  const int step, 
+  const int max_seq_len,
+  const int max_input_len, 
+  const int* input_lengths,
+  const int rotary_embedding_dim,
+  cudaStream_t stream);
+
+template <typename T>
+void cross_attention_dispatch_v2(T* query_buf, const T* Q_bias, 
+  T* key_cache, const T* K_bias, T* value_cache, const T* V_bias, const int* length,
+  T* context_buf, const bool* finished,
+  int batch_size, int head_num, int size_per_head, int step, int seq_len, cudaStream_t stream,
+  const T* relative_attention_bias)
+  {
+    const int block_sz = ATTENTION_BLOCK_SIZE;
+    float scalar = (relative_attention_bias) ? 1.0f : 1.f / sqrtf(size_per_head * 1.0f);
+
+    dim3 grid(batch_size * head_num);
+
+    int cond = size_per_head * ((ATTENION_OPT)? 1:0);
+    switch (cond)
+    {
+      case 32:
+        cross_attention_kernel_opt<T, 32, block_sz><<<grid, block_sz, sizeof(float)*seq_len, stream>>>(
+          query_buf, Q_bias, key_cache, K_bias, value_cache, V_bias, length, context_buf, finished,
+          batch_size, head_num, step, seq_len, scalar);
+        break;
+      case 64:
+        cross_attention_kernel_opt<T, 64, block_sz><<<grid, block_sz, sizeof(float)*seq_len, stream>>>(
+          query_buf, Q_bias, key_cache, K_bias, value_cache, V_bias, length, context_buf, finished,
+          batch_size, head_num, step, seq_len, scalar);
+        break;
+      case 128:
+        cross_attention_kernel_opt<T, 128, block_sz><<<grid, block_sz, sizeof(float)*seq_len, stream>>>(
+          query_buf, Q_bias, key_cache, K_bias, value_cache, V_bias, length, context_buf, finished,
+          batch_size, head_num, step, seq_len, scalar);
+        break;
+      default:
+        // default path
+
+        int block_size = 128;
+
+        if(seq_len <= 64)
+          block_size = 64;
+        else if(seq_len <= 128 && seq_len > size_per_head)
+          block_size = 128;
+        else if(seq_len > 128 && seq_len <= 256)
+          block_size = 256;
+        else if(seq_len > 256 && seq_len <= 512)
+          block_size = 512;
+        else
+          block_size = 1024;
+
+        if(block_size < size_per_head)
+          block_size = size_per_head;
+
+        assert(block_size <= 1024);
+        dim3 block(block_size);
+        
+        int shared_size = sizeof(T) * (size_per_head + seq_len);
+        cross_attention_kernel<T><<<grid, block, shared_size, stream>>>(
+          query_buf, Q_bias, 
+          key_cache, K_bias,
+          value_cache, V_bias,
+          length, context_buf, finished,
+          batch_size,
+          head_num, size_per_head, step, seq_len, scalar);
+    }
+  }
+
+template void cross_attention_dispatch_v2(
+  float* query_buf, 
+  const float* Q_bias, 
+  float* key_cache, 
+  const float* K_bias, 
+  float* value_cache, 
+  const float* V_bias, 
+  const int* length,
+  float* context_buf, 
+  const bool* finished,
+  int batch_size, 
+  int head_num, 
+  int size_per_head, 
+  int step, 
+  int seq_len, 
+  cudaStream_t stream,
+  const float* relative_attention_bias);
+
+template void cross_attention_dispatch_v2(
+  half* query_buf, 
+  const half* Q_bias, 
+  half* key_cache, 
+  const half* K_bias, 
+  half* value_cache, 
+  const half* V_bias, 
+  const int* length,
+  half* context_buf, 
+  const bool* finished,
+  int batch_size, 
+  int head_num, 
+  int size_per_head, 
+  int step, 
+  int seq_len, 
+  cudaStream_t stream,
+  const half* relative_attention_bias);
+
+}
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cuh b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..8556a98ad3754bd25decd5259d4f2b6f7ddb2de8
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/open_decoder.cuh
@@ -0,0 +1,123 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+void transpose_cache_batch_major_kernelLauncher(T* k_dst,
+                                                T* v_dst,
+                                                const T* k_src,
+                                                const T* v_src,
+                                                const int* memory_seq_len,
+                                                const int local_batch_size,
+                                                const int memory_max_seq_len,
+                                                const int cache_max_seq_len,
+                                                const int size_per_head,
+                                                const int local_head_num,
+                                                cudaStream_t stream);
+
+template <typename T>
+void transpose_general_kernelLauncher(T* dst,
+                                      T* src,
+                                      const int batch_size,
+                                      const int seq_len,
+                                      const int head_num,
+                                      const int size_per_head,
+                                      cudaStream_t stream);
+
+template <typename T>
+__global__ void transpose(T* src,
+                          T* dst,
+                          const int batch_size,
+                          const int seq_len,
+                          const int head_num,
+                          const int size_per_head);
+
+template <typename T>
+void fusedQKV_masked_attention_dispatch_v2(const T* qkv_buf,
+                                           const T* qkv_bias,
+                                           T* key_cache,
+                                           T* value_cache,
+                                           T* context_buf,
+                                           const bool* finished,
+                                           int max_batch_size,
+                                           int inference_batch_size,
+                                           int head_num,
+                                           int size_per_head,
+                                           const int step,
+                                           const int max_seq_len,
+                                           const int max_input_len,
+                                           const int* input_lengths,
+                                           const int rotary_embedding_dim,
+                                           cudaStream_t stream);
+
+template <typename T>
+void masked_attention_dispatch_v2(T* key_buf,
+                                  T* value_buf,
+                                  T* query_buf,
+                                  const T* self_Q_bias,
+                                  T* key_cache,
+                                  const T* self_K_bias,
+                                  T* value_cache,
+                                  const T* self_V_bias,
+                                  T* context_buf,
+                                  const bool* finished,
+                                  int max_batch_size,
+                                  int inference_batch_size,
+                                  int head_num,
+                                  int size_per_head,
+                                  const int step,
+                                  const int max_seq_len,
+                                  cudaStream_t stream,
+                                  const T* relative_attention_bias = nullptr);
+
+template <typename T>
+void fusedQKV_masked_attention_dispatch_v3(
+    const T* qkv_buf,
+    const T* qkv_bias,
+    T* key_cache,
+    T* value_cache,
+    T* context_buf,
+    const bool* finished,
+    int max_batch_size,
+    int inference_batch_size,
+    int head_num,
+    int size_per_head,
+    const int step,
+    const int max_seq_len,
+    cudaStream_t stream,
+    const T* relative_attention_bias = nullptr);
+
+template <typename T>
+void cross_attention_dispatch_v2(T* query_buf,
+                                 const T* Q_bias,
+                                 T* key_cache,
+                                 const T* K_bias,
+                                 T* value_cache,
+                                 const T* V_bias,
+                                 const int* length,
+                                 T* context_buf,
+                                 const bool* finished,
+                                 int batch_size,
+                                 int head_num,
+                                 int size_per_head,
+                                 int step,
+                                 int seq_len,
+                                 cudaStream_t stream,
+                                 const T* relative_attention_bias = nullptr);
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..315bb8d344d1ddfaeb095e1225b86b23153d4209
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cu
@@ -0,0 +1,2643 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <random>
+#include "cub/cub.cuh"
+#include "fastertransformer/cuda/topk_kernels.cuh"
+
+namespace fastertransformer {
+
+__global__ void ker_curand_setup(curandState_t* state,
+                                 const int size,
+                                 const int seed) {
+  // curand_init(clock(), blockIdx.x * blockDim.x + threadIdx.x, 0,
+  // &state[blockIdx.x * blockDim.x + threadIdx.x]);
+  // fix the seed to prevent the seed of different gpu are differnet in Tensor
+  // Parallel
+  if (threadIdx.x + blockIdx.x * blockDim.x < size)
+    curand_init(seed,
+                blockIdx.x * blockDim.x + threadIdx.x,
+                seed,
+                &state[blockIdx.x * blockDim.x + threadIdx.x]);
+}
+
+__global__ void ker_curand_setup_bsz_one(curandState_t* state,
+                                 const int size,
+                                 const int seed) {
+  if (threadIdx.x + blockIdx.x * blockDim.x < size)
+    curand_init(seed,
+                0,
+                seed,
+                &state[blockIdx.x * blockDim.x + threadIdx.x]);
+}
+
+void ker_curand_setupLauncher(curandState_t* state,
+                              DecodingSamplingArguments args,
+                              cudaStream_t stream) {
+  dim3 block(256);
+  dim3 grid((int)(ceil(args.batch_size_ * 1.0 / 256)));
+  int seed = args.seed_ != -1 ? args.seed_ : clock() % INT_MAX;
+  if(args.batch_size_ != 1)
+    ker_curand_setup<<<grid, block, 0, stream>>>(state, args.batch_size_, seed);
+  else
+    // Reduce the huge occupation of gpu memory due to curand_init func when bsz=1.
+    // TODO(gongenlei): Solve above problem when bsz > 1.
+    ker_curand_setup_bsz_one<<<grid, block, 0, stream>>>(state, args.batch_size_, seed);
+}
+
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_topK_kernel(const T* log_probs,
+                          int* topk_tmp_id_buf,
+                          T* topk_tmp_val_buf,
+                          const int vocab_size,
+                          T diversity_rate) {
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+#pragma unroll
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+
+#pragma unroll
+  for (int elem_id = thread_id; elem_id < vocab_size;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    partial.insert(log_probs[index], index);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  if (thread_id == 0) {
+    int index = block_id * MAX_K;
+
+#pragma unroll
+    for (int i = 0; i < MAX_K; ++i) {
+      topk_tmp_id_buf[index + i] = total.p[i];
+      topk_tmp_val_buf[index + i] = total.u[i] + diversity_rate * (T)i;
+    }
+  }
+}
+
+template <typename T, int K>
+__forceinline__ __device__ T blockRoughTopK(T val);
+
+template <typename T, int beam_size, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_topK_kernel_hierarchical(const T* log_probs,
+                                       T* can_score_buf,
+                                       int* can_idx_buf,
+                                       int* topk_tmp_id_buf,
+                                       T* topk_tmp_val_buf,
+                                       const int vocab_size,
+                                       T diversity_rate) {
+  __shared__ T s_topk;
+  __shared__ int num_cur_beam_can;
+  typedef cub::BlockReduce<TopK<T, beam_size>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+  T rough_top_kth_logit = -MAX_T_VAL;
+
+#pragma unroll
+  for (int elem_id = thread_id; elem_id < vocab_size;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    rough_top_kth_logit = fmaxf(rough_top_kth_logit, log_probs[index]);
+  }
+  rough_top_kth_logit = blockRoughTopK<float, beam_size>(rough_top_kth_logit);
+  if (thread_id == 0) {
+    s_topk = rough_top_kth_logit;
+    num_cur_beam_can = 0;
+  }
+
+  int idx = block_id * vocab_size + thread_id;
+
+  __shared__ int l_n;  // current iteration candidate number
+  for (int iter = 0;
+       iter < (vocab_size + THREADBLOCK_SIZE - 1) / THREADBLOCK_SIZE;
+       iter++) {
+    // zero the counter
+    if (threadIdx.x == 0) l_n = 0;
+    __syncthreads();
+    T lgt = -MAX_T_VAL;  // min s_topk is CUDA_FLOAT_INF_NEG
+    int pos;
+    int vocab_id = idx - block_id * vocab_size;
+
+    if (vocab_id < vocab_size) {
+      lgt = log_probs[idx];
+      if (lgt >= s_topk) pos = atomicAdd(&l_n, 1);
+    }
+    __syncthreads();
+    if (threadIdx.x == 0) {
+      l_n = atomicAdd(&num_cur_beam_can, l_n);
+    }
+    __syncthreads();
+
+    if (lgt >= s_topk) {
+      pos += l_n;
+      can_score_buf[pos + block_id * vocab_size] = lgt;
+      can_idx_buf[pos + block_id * vocab_size] = idx;
+    }
+    __syncthreads();
+    idx += THREADBLOCK_SIZE;
+  }
+
+  TopK<T, beam_size> partial;
+#pragma unroll
+  for (int i = 0; i < beam_size; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+  for (int elem_id = thread_id; elem_id < num_cur_beam_can;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    partial.insert(can_score_buf[index], index);
+  }
+  TopK<T, beam_size> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, beam_size>);
+
+  if (thread_id == 0) {
+    int index = block_id * beam_size;
+
+#pragma unroll
+    for (int i = 0; i < beam_size; ++i) {
+      topk_tmp_id_buf[index + i] = can_idx_buf[total.p[i]];
+      topk_tmp_val_buf[index + i] = total.u[i] + diversity_rate * (T)i;
+    }
+  }
+}
+
+template <typename T, int THREADBLOCK_SIZE>
+__global__ void beam_topK_kernel_general(const T* log_probs,
+                                         T* tmp_log_probs,
+                                         int* topk_tmp_id_buf,
+                                         T* topk_tmp_val_buf,
+                                         const int k,
+                                         const int vocab_size) {
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+  typedef cub::BlockReduce<TopK_2<T>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+  TopK_2<T> partial;
+
+  for (int elem_id = tid; elem_id < vocab_size; elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + bid * vocab_size;
+    tmp_log_probs[index] = log_probs[index];
+  }
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int elem_id = tid; elem_id < vocab_size; elem_id += THREADBLOCK_SIZE) {
+      int index = elem_id + bid * vocab_size;
+      partial.insert(tmp_log_probs[index], index);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      const int index = bid * k + ite;
+      topk_tmp_id_buf[index] = total.p;
+      topk_tmp_val_buf[index] = total.u;
+      tmp_log_probs[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+}
+
+#define CASE_K(K)                                                              \
+  case K:                                                                      \
+    beam_topK_kernel<T, K, block_size><<<batch_size, block_size, 0, stream>>>( \
+        log_probs, topk_tmp_id_buf, topk_tmp_val_buf, vocab_size, 0.0f);       \
+    break;
+
+template <typename T>
+void beam_topK_kernelLauncher(const T* log_probs,
+                              int* topk_tmp_id_buf,
+                              T* topk_tmp_val_buf,
+                              DecodingSamplingArguments args,
+                              cudaStream_t stream) {
+  const int batch_size = args.batch_size_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int candidate_num = args.candidate_num_;
+  const int block_size = 256;
+  switch (candidate_num) {
+    CASE_K(1);
+    CASE_K(2);
+    CASE_K(4);
+    default:
+      printf("[ERROR] Topk kernel does not support candidate_num = %d \n",
+             candidate_num);
+      exit(0);
+      break;
+  }
+}
+
+#undef CASE_K
+
+template void beam_topK_kernelLauncher(const float* log_probs,
+                                       int* topk_tmp_id_buf,
+                                       float* topk_tmp_val_buf,
+                                       DecodingSamplingArguments args,
+                                       cudaStream_t stream);
+
+template void beam_topK_kernelLauncher(const half* log_probs,
+                                       int* topk_tmp_id_buf,
+                                       half* topk_tmp_val_buf,
+                                       DecodingSamplingArguments args,
+                                       cudaStream_t stream);
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topK_kernel(int* topk_tmp_id_buf,
+                           T* topk_tmp_val_buf,
+                           int* id_buf) {
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+  TopK<T, MAX_K> partial;
+  if (thread_id == 0) {
+    for (int i = 0; i < MAX_K; ++i) {
+      partial.p[i] = -1;
+      partial.u[i] = -MAX_T_VAL;
+    }
+
+    int index = block_id * MAX_K * MAX_K;
+    for (int i = 0; i < MAX_K * MAX_K; i++) {
+      partial.insert((T)topk_tmp_val_buf[index + i],
+                     topk_tmp_id_buf[index + i]);
+    }
+
+    index = block_id * MAX_K;
+    for (int i = 0; i < MAX_K; i++) {
+      id_buf[index + i] = partial.p[i];
+    }
+  }
+}
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void batch_topK_kernel_v2(int* topk_tmp_id_buf,
+                              T* topk_tmp_val_buf,
+                              int* id_buf) {
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  TopK<T, MAX_K> partial;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+#pragma unroll
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+
+  int ite = MAX_K * MAX_K / THREADBLOCK_SIZE;
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    int index = bid * MAX_K * MAX_K + i * THREADBLOCK_SIZE + tid;
+    partial.insert((T)topk_tmp_val_buf[index], topk_tmp_id_buf[index]);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  if (tid == 0) {
+#pragma unroll
+    for (int i = 0; i < MAX_K; i++) id_buf[bid * MAX_K + i] = total.p[i];
+  }
+}
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_1_opt3(const T* __restrict log_probs,
+                                  T* tmp_log_probs,
+                                  int* topk_tmp_id_buf,
+                                  T* topk_tmp_val_buf,
+                                  const bool* finished,
+                                  const int k,
+                                  const int vocab_size,
+                                  const int end_id) {
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+
+  const int row_id = bid / BLOCKS_PER_BEAM_;      // row id for log_probs
+  const int block_lane = bid % BLOCKS_PER_BEAM_;  // block id for a beam
+  const int tmp_log_buf_index = row_id * vocab_size;
+  const int tmp_topk_buf_index = row_id * BLOCKS_PER_BEAM_ * k + block_lane * k;
+  TopK_2<T> partial;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  if (finished != nullptr && finished[row_id] == true) {
+    if (tid < k) {
+      const int index = tmp_topk_buf_index + tid;
+      if (block_lane == 0 && tid == 0) {
+        topk_tmp_id_buf[index] = tmp_log_buf_index + end_id;
+        topk_tmp_val_buf[index] = log_probs[tmp_log_buf_index + end_id];
+      } else {
+        topk_tmp_id_buf[index] = -1;
+        topk_tmp_val_buf[index] = -MAX_T_VAL;
+      }
+    }
+    return;
+  }
+
+  for (int elem_id = tid + block_lane * BLOCK_SIZE_; elem_id < vocab_size;
+       elem_id += BLOCK_SIZE_ * BLOCKS_PER_BEAM_) {
+    int index = elem_id + tmp_log_buf_index;
+    tmp_log_probs[index] = log_probs[index];
+  }
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int elem_id = tid + block_lane * BLOCK_SIZE_; elem_id < vocab_size;
+         elem_id += BLOCK_SIZE_ * BLOCKS_PER_BEAM_) {
+      int index = elem_id + tmp_log_buf_index;
+      partial.insert(tmp_log_probs[index], index);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      const int index = tmp_topk_buf_index + ite;
+      topk_tmp_id_buf[index] = total.p;
+      topk_tmp_val_buf[index] = total.u;
+      tmp_log_probs[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+}
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_2_opt3(const int* __restrict topk_tmp_id_buf,
+                                  T* topk_tmp_val_buf,
+                                  int* ids,
+                                  const int k) {
+  const int size = k * k * BLOCKS_PER_BEAM_;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  extern __shared__ char array[];
+  T* s_val = topk_tmp_val_buf + batch_id * size;
+  int* s_id = (int*)(array);
+
+  TopK_2<T> partial;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE_) {
+      partial.insert(s_val[i], i);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      s_id[ite] = total.p;
+      s_val[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+  if (tid < k)
+    ids[batch_id * k + tid] = topk_tmp_id_buf[batch_id * size + s_id[tid]];
+}
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_1_opt3(
+    const T* __restrict log_probs,
+    const float* __restrict cum_log_probs,  // If null, log_probs is
+                                            // cum_log_probs.
+    // Used in beam_search_v2 which adding
+    // cum_log_buf_ to logits_buf_ here.
+    T* tmp_log_probs,
+    int* topk_tmp_id_buf,
+    T* topk_tmp_val_buf,
+    const bool* finished,
+    const int k,
+    const int vocab_size,
+    const T
+        diversity_rate,  // diversity_rate only works when BLOCKS_PER_BEAM_ is 1
+    const int end_id) {
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+
+  const int row_id = bid / BLOCKS_PER_BEAM_;      // row id for log_probs
+  const int block_lane = bid % BLOCKS_PER_BEAM_;  // block id for a beam
+  const int tmp_log_buf_index = row_id * vocab_size;
+  const int tmp_topk_buf_index = row_id * BLOCKS_PER_BEAM_ * k + block_lane * k;
+  const int beam_id_in_output =
+      row_id / (k >> 1) * k + row_id % (k >> 1) + (k >> 1);
+  TopK_2<T> partial;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  for (int elem_id = tid + block_lane * BLOCK_SIZE_; elem_id < vocab_size;
+       elem_id += BLOCK_SIZE_ * BLOCKS_PER_BEAM_) {
+    int index = elem_id + tmp_log_buf_index;
+    tmp_log_probs[index] = log_probs[index];
+  }
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int elem_id = tid + block_lane * BLOCK_SIZE_; elem_id < vocab_size;
+         elem_id += BLOCK_SIZE_ * BLOCKS_PER_BEAM_) {
+      int index = elem_id + tmp_log_buf_index;
+      partial.insert(tmp_log_probs[index], index);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      const int index = tmp_topk_buf_index + ite;
+      topk_tmp_id_buf[index] = total.p;
+      topk_tmp_val_buf[index] =
+          (cum_log_probs
+               ? (T)((float)total.u + cum_log_probs[beam_id_in_output])
+               : total.u) +
+          diversity_rate * (T)ite;
+      tmp_log_probs[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+}
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_2_opt3_sampling(
+    const int* __restrict topk_tmp_id_buf,
+    T* topk_tmp_val_buf,
+    T* topk_tmp2_val_buf,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    const int k,
+    curandState_t* curandstate,
+    const int end_id,
+    const int vocab_size) {
+  const int size = k * BLOCKS_PER_BEAM_;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  typedef cub::BlockReduce<TopK_2<float>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  extern __shared__ char array[];
+  __shared__ float rand_num;
+  __shared__ float s_sum;
+  __shared__ float s_max;
+  T* s_val = topk_tmp_val_buf + batch_id * size;
+  int* s_id = (int*)(array);
+  s_max = (float)0.0f;
+  s_sum = (float)0.0f;
+  TopK_2<float> partial;
+
+  for (int index = tid; index < size; index += BLOCK_SIZE_) {
+    topk_tmp2_val_buf[batch_id * size + index] =
+        topk_tmp_val_buf[batch_id * size + index];
+  }
+  __syncthreads();
+  T* s_val2 = topk_tmp2_val_buf + batch_id * size;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE_) {
+      partial.insert((float)s_val[i], i);
+    }
+
+    TopK_2<float> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<float>);
+
+    if (ite == 0) s_max = total.u;
+
+    if (tid == 0) {
+      s_id[ite] = total.p;
+      s_val[total.p] = -MAX_T_VAL;
+      total.u = __expf(total.u - s_max);
+      s_val2[total.p] = (T)total.u;
+      s_sum += total.u;
+    }
+    __syncthreads();
+  }
+  if (tid == 0) {
+    rand_num = (float)curand_uniform(curandstate + blockIdx.x) * s_sum;
+    for (int i = 0; i < k; i++) {
+      rand_num = rand_num - (float)s_val2[s_id[i]];
+      if (rand_num <= 0.0f || i == k - 1) {
+        ids[batch_id] = topk_tmp_id_buf[batch_id * size + s_id[i]] % vocab_size;
+        break;
+      }
+    }
+    if (finished_buf != nullptr) {
+      if (sequence_length != nullptr) {
+        sequence_length[batch_id] = finished_buf[batch_id]
+                                        ? sequence_length[batch_id]
+                                        : sequence_length[batch_id] + 1;
+      }
+      finished_buf[batch_id] = ids[batch_id] == end_id ? 1 : 0;
+    }
+  }
+}
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_2_opt3_sampling_v2(
+    const int* __restrict topk_tmp_id_buf,
+    T* topk_tmp_val_buf,
+    T* topk_tmp2_val_buf,
+    int* ids,
+    int* sequence_length,
+    float* scores,
+    bool* finished_buf,
+    const int k,
+    curandState_t* curandstate,
+    const int end_id,
+    const int vocab_size) {
+  const int size = k * BLOCKS_PER_BEAM_;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  typedef cub::BlockReduce<TopK_2<float>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  extern __shared__ char array[];
+  __shared__ float rand_num;
+  __shared__ float s_sum;
+  // __shared__ float s_max;
+  T* s_val = topk_tmp_val_buf + batch_id * size;
+  int* s_id = (int*)(array);
+  // s_max = (float)0.0f;
+  s_sum = (float)0.0f;
+  TopK_2<float> partial;
+
+  for (int index = tid; index < size; index += BLOCK_SIZE_) {
+    topk_tmp2_val_buf[batch_id * size + index] =
+        topk_tmp_val_buf[batch_id * size + index];
+  }
+  __syncthreads();
+  T* s_val2 = topk_tmp2_val_buf + batch_id * size;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE_) {
+      partial.insert((float)s_val[i], i);
+    }
+
+    TopK_2<float> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<float>);
+
+    // if (ite == 0) s_max = total.u;
+
+    if (tid == 0) {
+      s_id[ite] = total.p;
+      s_val[total.p] = -MAX_T_VAL;
+      // total.u = __expf(total.u - s_max);
+      s_val2[total.p] = (T)total.u;
+      s_sum += total.u;
+    }
+    __syncthreads();
+  }
+  if (tid == 0) {
+    rand_num = (float)curand_uniform(curandstate + blockIdx.x) * s_sum;
+    for (int i = 0; i < k; i++) {
+      rand_num = rand_num - (float)s_val2[s_id[i]];
+      if (rand_num <= 0.0f || i == k - 1) {
+        ids[batch_id] = topk_tmp_id_buf[batch_id * size + s_id[i]] % vocab_size;
+        if (scores) {
+          scores[batch_id] += __logf((float)s_val2[s_id[i]]);
+        }
+        break;
+      }
+    }
+    if (finished_buf != nullptr) {
+      if (sequence_length != nullptr) {
+        sequence_length[batch_id] = finished_buf[batch_id]
+                                        ? sequence_length[batch_id]
+                                        : sequence_length[batch_id] + 1;
+      }
+      finished_buf[batch_id] = ids[batch_id] == end_id ? 1 : 0;
+    }
+  }
+}
+
+template <typename T, int BLOCK_SIZE, int BLOCKS_PER_BEAM>
+__global__ void topk_stage_1_opt2_general(const T* __restrict log_probs,
+                                          T* tmp_log_probs,
+                                          int* topk_tmp_id_buf,
+                                          T* topk_tmp_val_buf,
+                                          const int k,
+                                          const int vocab_size) {
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+  const int row_id = bid / BLOCKS_PER_BEAM;      // row id for log_probs
+  const int block_lane = bid % BLOCKS_PER_BEAM;  // block id for a beam
+  const int tmp_log_buf_index = row_id * vocab_size;
+  const int tmp_topk_buf_index = row_id * BLOCKS_PER_BEAM * k + block_lane * k;
+  TopK_2<T> partial;
+
+  for (int elem_id = tid + block_lane * BLOCK_SIZE; elem_id < vocab_size;
+       elem_id += BLOCK_SIZE * BLOCKS_PER_BEAM) {
+    int index = elem_id + tmp_log_buf_index;
+    tmp_log_probs[index] = log_probs[index];
+  }
+
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int elem_id = tid + block_lane * BLOCK_SIZE; elem_id < vocab_size;
+         elem_id += BLOCK_SIZE * BLOCKS_PER_BEAM) {
+      int index = elem_id + tmp_log_buf_index;
+      partial.insert(tmp_log_probs[index], index);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      const int index = tmp_topk_buf_index + ite;
+      topk_tmp_id_buf[index] = total.p;
+      topk_tmp_val_buf[index] = total.u;
+      tmp_log_probs[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+}
+
+template <typename T, int BLOCK_SIZE, int BLOCKS_PER_BEAM>
+__global__ void topk_stage_2_opt2_general(const int* __restrict topk_tmp_id_buf,
+                                          T* topk_tmp_val_buf,
+                                          int* ids,
+                                          const int k) {
+  const int size = k * k * BLOCKS_PER_BEAM;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  extern __shared__ char array[];
+  T* s_val = topk_tmp_val_buf + batch_id * size;
+  int* s_id = (int*)(array);
+
+  TopK_2<T> partial;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE) {
+      partial.insert(s_val[i], i);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      s_id[ite] = total.p;
+      s_val[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+  if (tid < k)
+    ids[batch_id * k + tid] = topk_tmp_id_buf[batch_id * size + s_id[tid]];
+}
+
+#define CASE_K_DIV(K, BLOCK_SIZE_1, BLOCK_SIZE_2)                            \
+  case K:                                                                    \
+    beam_topK_kernel<                                                        \
+        T,                                                                   \
+        K,                                                                   \
+        BLOCK_SIZE_2><<<batch_size * beam_width, BLOCK_SIZE_2, 0, stream>>>( \
+        log_probs,                                                           \
+        topk_tmp_id_buf,                                                     \
+        topk_tmp_val_buf,                                                    \
+        vocab_size,                                                          \
+        diversity_rate);                                                     \
+    if (K < 10)                                                              \
+      batch_topK_kernel<                                                     \
+          T,                                                                 \
+          K,                                                                 \
+          BLOCK_SIZE_1><<<batch_size, BLOCK_SIZE_1, 0, stream>>>(            \
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);                           \
+    else                                                                     \
+      batch_topK_kernel_v2<T, K, 32><<<batch_size, 32, 0, stream>>>(         \
+          topk_tmp_id_buf, topk_tmp_val_buf, ids);                           \
+    break;
+
+#define CASE_K(K, BLOCK_SIZE_1_, BLOCK_SIZE_2_, BLOCKS_PER_BEAM_)            \
+  case K:                                                                    \
+    topk_stage_1_opt3<float,                                                 \
+                      BLOCK_SIZE_1_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size * K * BLOCKS_PER_BEAM_, \
+                                          BLOCK_SIZE_1_,                     \
+                                          0,                                 \
+                                          stream>>>(log_probs,               \
+                                                    temp_log_probs,          \
+                                                    topk_tmp_id_buf,         \
+                                                    topk_tmp_val_buf,        \
+                                                    finished,                \
+                                                    beam_width,              \
+                                                    vocab_size,              \
+                                                    end_id);                 \
+    topk_stage_2_opt3<float,                                                 \
+                      BLOCK_SIZE_2_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size,                        \
+                                          BLOCK_SIZE_2_,                     \
+                                          K * sizeof(int),                   \
+                                          stream>>>(                         \
+        topk_tmp_id_buf, topk_tmp_val_buf, ids, beam_width);                 \
+    break;
+
+template <typename T>
+void topK_kernelLauncher(void* workspace,
+                         size_t& workspace_size,
+                         T* log_probs,
+                         int* ids,
+                         const bool* finished,
+                         DecodingBeamsearchArguments args,
+                         cudaStream_t stream) {
+  const int batch_size = args.batch_size_;
+  const int beam_width = args.beam_width_;
+  const int vocab_size = args.vocab_size_padded_;
+  const T diversity_rate = args.beam_search_diversity_rate_;
+  const int end_id = args.end_id_;
+
+  const int max_block_per_beam = 8;
+  int temp_log_probs_buf_size =
+      batch_size * beam_width * vocab_size;  // type float
+  int topk_tmp_ids_buf_size =
+      batch_size * beam_width * beam_width * max_block_per_beam;  // type int
+  int topk_tmp_val_buf_size =
+      batch_size * beam_width * beam_width * max_block_per_beam;  // type float
+  // int can_score_buf_size = batch_size * beam_width * vocab_size;
+  // int can_idx_buf_size = batch_size * beam_width * vocab_size;
+
+  // prevent memory misalinged address
+  temp_log_probs_buf_size = (int)(ceil(temp_log_probs_buf_size / 4.)) * 4;
+  // can_score_buf_size = (int)(ceil(can_score_buf_size / 4.)) * 4;
+  // can_idx_buf_size = (int)(ceil(can_idx_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(float) * temp_log_probs_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     sizeof(float) * topk_tmp_val_buf_size;
+    // sizeof(float) * can_score_buf_size +
+    // sizeof(int) * can_idx_buf_size;
+    return;
+  } else {
+    T* temp_log_probs = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_log_probs + temp_log_probs_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+    // T* can_score_buf = (T*)(topk_tmp_val_buf + topk_tmp_val_buf_size);
+    // int* can_idx_buf = (int*)(can_score_buf + can_score_buf_size);
+    if (diversity_rate == 0.0f) {
+      switch (beam_width) {
+        CASE_K(1, 128, 128, 8);
+        CASE_K(4, 128, 128, 8);
+        CASE_K(10, 128, 128, 8);
+        CASE_K(16, 128, 128, 5);
+        CASE_K(32, 256, 128, 1);
+        CASE_K(64, 256, 256, 1);
+        default:
+          topk_stage_1_opt2_general<
+              T,
+              128,
+              1><<<batch_size * beam_width * 1, 128, 0, stream>>>(
+              log_probs,
+              temp_log_probs,
+              topk_tmp_id_buf,
+              topk_tmp_val_buf,
+              beam_width,
+              vocab_size);
+          topk_stage_2_opt2_general<T, 128, 1><<<batch_size,
+                                                 128,
+                                                 beam_width * beam_width * 1 *
+                                                         sizeof(float) +
+                                                     beam_width * sizeof(int),
+                                                 stream>>>(
+              topk_tmp_id_buf, topk_tmp_val_buf, ids, beam_width);
+          break;
+      }
+    } else {
+      switch (beam_width) {
+        CASE_K_DIV(1, 256, 256);
+        CASE_K_DIV(4, 256, 256);
+        CASE_K_DIV(16, 256, 64);
+        CASE_K_DIV(64, 256, 64);
+        default:
+          // printf("[ERROR] Topk kernel does not support beamwidth = %d \n",
+          //        beam_width);
+          // exit(0);
+          // diversity_rate only works when BLOCKS_PER_BEAM_ is 1
+          topk_stage_1_opt3<T,
+                            128,
+                            1><<<batch_size * beam_width * 1, 128, 0, stream>>>(
+              log_probs,
+              (float*)nullptr,
+              temp_log_probs,
+              topk_tmp_id_buf,
+              topk_tmp_val_buf,
+              finished,
+              beam_width,
+              vocab_size,
+              diversity_rate,
+              end_id);
+          topk_stage_2_opt3<
+              T,
+              128,
+              1><<<batch_size, 128, beam_width * sizeof(int), stream>>>(
+              topk_tmp_id_buf, topk_tmp_val_buf, ids, beam_width);
+          break;
+      }
+    }
+    return;
+  }
+}
+
+#undef CASE_K
+#undef CASE_K_DIV
+
+template void topK_kernelLauncher<float>(void* workspace,
+                                         size_t& workspace_size,
+                                         float* log_probs,
+                                         int* ids,
+                                         const bool* finished,
+                                         DecodingBeamsearchArguments args,
+                                         cudaStream_t stream);
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_stage_2_opt3_update(const int* __restrict topk_tmp_id_buf,
+                                         T* topk_tmp_val_buf,
+                                         bool* finished,
+                                         bool* alive_finished,
+                                         int* sequence_length,
+                                         int* word_ids,
+                                         int* parent_ids,
+                                         int* output_word_ids,
+                                         int* output_parent_ids,
+                                         float* output_cum_log_probs,
+                                         const int beam_width,
+                                         const int vocab_size,
+                                         const int end_id,
+                                         const int step,
+                                         const int max_out_len,
+                                         int k,
+                                         //  T diversity_rate,
+                                         float length_penalty,
+                                         float max_length_penalty,
+                                         const int finished_candidate_num,
+                                         const bool early_stopping) {
+  const int size = beam_width * BLOCKS_PER_BEAM_ * beam_width * 2;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  // to be consistent with MAX_T_VAL in init_kernel, which should also be same
+  // with other topk kernel, however it does not
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : 1e20f;
+
+  typedef cub::BlockReduce<TopK_2<T>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  TopK_2<T> partial;
+
+  int finish_num = 0;
+  int alive_num = 0;
+  // No need for tmp_cum_log_probs anymore, since topk_tmp_val_buf stores cum
+  // log
+  // probs now thus only need to write and no need to read output_cum_log_probs.
+  // float* tmp_cum_log_probs =
+  //     (float*)(topk_tmp_val_buf +
+  //              gridDim.x * beam_width * beam_width * 2 * BLOCKS_PER_BEAM_) +
+  //     batch_id * beam_width;
+  topk_tmp_id_buf += batch_id * size;
+  topk_tmp_val_buf += batch_id * size;
+  word_ids += batch_id * beam_width;
+  parent_ids += batch_id * beam_width;
+  output_word_ids += batch_id * k;  // k == beam_width*2
+  output_parent_ids += batch_id * k;
+  output_cum_log_probs += batch_id * k;
+  sequence_length += batch_id * k;
+  finished += batch_id * k;
+  alive_finished += batch_id * beam_width;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE_) {
+      // diversity_rate only works when BLOCKS_PER_BEAM_ is 1
+      // topk_tmp_val_buf reserves cum log probs rather than log probs currently
+      // partial.insert(topk_tmp_val_buf[i] +
+      //                    output_cum_log_probs[topk_tmp_id_buf[i] / vocab_size
+      //                    %
+      //                                             beam_width +
+      //                                         beam_width] +
+      //                    i % k * diversity_rate,
+      //                i);
+      // partial.insert(topk_tmp_val_buf[i] + i % k * diversity_rate, i);
+      partial.insert(topk_tmp_val_buf[i], i);
+    }
+
+    TopK_2<T> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<T>);
+
+    if (tid == 0) {
+      if (ite == 0) {
+        if (step == 1) {  // init output
+          for (int i = 0; i < beam_width; i++) {
+            output_word_ids[i] = end_id;
+            output_cum_log_probs[i] = -1e20f;
+            output_parent_ids[i] = -1;
+            sequence_length[i] = 0;
+          }
+        } else {
+          for (int i = 0; i < beam_width; i++) {
+            output_word_ids[i] = end_id;
+            output_parent_ids[i] = i;
+            if (finished[i]) finish_num++;
+          }
+        }
+      }
+
+      // beam_online_softmax_topk_kernel produces absolute id, which can make
+      // update_KV_cache_kernel use gather instead of gather_nd
+      int abs_id = topk_tmp_id_buf[total.p];
+      // use scores in topk_tmp_val_buf rather than total.u, since the latter
+      // stores diversity decay values, while original cum log probs is needed.
+      // float cum_log_prob = total.u;
+      float cum_log_prob = (float)topk_tmp_val_buf[total.p];
+      // There are two queues, one for the alive and another for the finish.
+      // `beam_id` stands for parents in the alive, and it uses absolute id
+      // represented as `batch_idx * beam_width + beam_idx`.
+      int beam_id = abs_id / vocab_size;
+      int beam_id_in_output =
+          batch_id * k + (beam_id % beam_width) + beam_width;
+      int word_id = abs_id % vocab_size;
+      if (word_id == end_id && ite < finished_candidate_num) {
+        // grow finish
+        float score = cum_log_prob / length_penalty;
+        if (score > output_cum_log_probs[beam_width - 1]) {
+          output_word_ids[beam_width - 1] = end_id;
+          output_cum_log_probs[beam_width - 1] = score;
+          output_parent_ids[beam_width - 1] = beam_id_in_output;
+          sequence_length[beam_width - 1] = step;
+          finished[beam_width - 1] = 1;
+          for (int i = beam_width - 2; i >= 0; --i) {
+            if (output_cum_log_probs[i + 1] > output_cum_log_probs[i]) {
+              // output_word_ids[i] = end_id;
+              float tmp_f = output_cum_log_probs[i];
+              output_cum_log_probs[i] = output_cum_log_probs[i + 1];
+              output_cum_log_probs[i + 1] = tmp_f;
+              int tmp_i = output_parent_ids[i];
+              output_parent_ids[i] = output_parent_ids[i + 1];
+              output_parent_ids[i + 1] = tmp_i;
+              tmp_i = sequence_length[i];
+              sequence_length[i] = sequence_length[i + 1];
+              sequence_length[i + 1] = tmp_i;
+              tmp_i = finished[i];
+              finished[i] = finished[i + 1];
+              finished[i + 1] = tmp_i;
+            } else {
+              break;
+            }
+          }
+        }
+        if (finish_num != beam_width) finish_num += 1;
+      } else if (alive_num < beam_width && word_id != end_id) {
+        // grow alive
+        parent_ids[alive_num] = beam_id;
+        word_ids[alive_num] = word_id;
+        // Also put alive candidates after finish candidates, since output
+        // must include both the finish and alive to trace full path
+        output_word_ids[beam_width + alive_num] = word_id;
+        output_parent_ids[beam_width + alive_num] = beam_id_in_output;
+        // Must not override output_cum_log_probs since the after iters would
+        // use it. We will copy tmp_cum_log_probs back to output_cum_log_probs
+        // after the topk all has been selected.
+        // tmp_cum_log_probs[alive_num] = cum_log_prob;
+        // No need for tmp_cum_log_probs anymore.
+        output_cum_log_probs[beam_width + alive_num] = cum_log_prob;
+        sequence_length[beam_width + alive_num] = step;
+        finished[beam_width + alive_num] = 0;
+        alive_finished[alive_num] = 0;
+        alive_num += 1;
+      }
+      topk_tmp_val_buf[total.p] = -MAX_T_VAL;
+    }
+    __syncthreads();
+  }
+
+  if (tid == 0) {
+    // No need for tmp_cum_log_probs anymore.
+    // for (int i = 0; i < beam_width; ++i) {
+    //   output_cum_log_probs[beam_width + i] = tmp_cum_log_probs[i];
+    // }
+    // early finish
+    float lowest_finish =
+        finish_num == 0 ? -1e20f : output_cum_log_probs[finish_num - 1];
+    // The best possible score of the most likely alive sequence
+    float lower_bound =
+        (float)output_cum_log_probs[beam_width] / max_length_penalty;
+
+    if (finished_candidate_num == beam_width) {
+      if (finish_num == finished_candidate_num &&
+          (lowest_finish > lower_bound || early_stopping)) {
+        // If early stop, also mark the alive beams finished.
+        for (int i = beam_width; i < beam_width * 2; ++i) {
+          finished[i] = 1;
+          alive_finished[i - beam_width] = 1;
+        }
+      } else if (step == max_out_len) {
+        // sort on finish sequences and alive sequences
+        for (int ite = beam_width; ite < beam_width * 2; ++ite) {
+          output_cum_log_probs[ite] =
+              output_cum_log_probs[ite] / length_penalty;
+          for (int i = ite - 1;
+               i >= 0 && output_cum_log_probs[i + 1] > output_cum_log_probs[i];
+               --i) {
+            float tmp_f = output_cum_log_probs[i];
+            output_cum_log_probs[i] = output_cum_log_probs[i + 1];
+            output_cum_log_probs[i + 1] = tmp_f;
+            int tmp_i = output_word_ids[i];
+            output_word_ids[i] = output_word_ids[i + 1];
+            output_word_ids[i + 1] = tmp_i;
+            tmp_i = output_parent_ids[i];
+            output_parent_ids[i] = output_parent_ids[i + 1];
+            output_parent_ids[i + 1] = tmp_i;
+            tmp_i = sequence_length[i];
+            sequence_length[i] = sequence_length[i + 1];
+            sequence_length[i + 1] = tmp_i;
+            finished[i] = 1;
+            finished[i + 1] = 1;
+          }
+        }
+        // If early stop, also mark the alive beams finished.
+        for (int i = beam_width; i < beam_width * 2; ++i) {
+          finished[i] = 1;
+          alive_finished[i - beam_width] = 1;
+        }
+      }
+
+    } else {
+      // output must include both the finish and alive to trace full path
+      if (step == max_out_len ||
+          lowest_finish > lower_bound) {  // when finishing
+        for (int i = 0; finish_num < beam_width; ++finish_num, ++i) {
+          output_word_ids[finish_num] = word_ids[i];
+          output_cum_log_probs[finish_num] =
+              output_cum_log_probs[i + beam_width] / length_penalty;
+          output_parent_ids[finish_num] = output_parent_ids[i + beam_width];
+          sequence_length[finish_num] = step;
+          finished[finish_num] = 1;
+        }
+        // If early stop, also mark the alive beams finished.
+        for (int i = beam_width; i < beam_width * 2; ++i) {
+          finished[i] = 1;
+          alive_finished[i - beam_width] = 1;
+        }
+      }
+    }
+  }
+}
+
+#define CASE_K(K, BLOCK_SIZE_1_, BLOCK_SIZE_2_, BLOCKS_PER_BEAM_)            \
+  case K:                                                                    \
+    topk_stage_1_opt3<T,                                                     \
+                      BLOCK_SIZE_1_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size * K * BLOCKS_PER_BEAM_, \
+                                          BLOCK_SIZE_1_,                     \
+                                          0,                                 \
+                                          stream>>>(log_probs,               \
+                                                    output_cum_log_probs,    \
+                                                    temp_log_probs,          \
+                                                    topk_tmp_id_buf,         \
+                                                    topk_tmp_val_buf,        \
+                                                    finished,                \
+                                                    beam_width * 2,          \
+                                                    vocab_size,              \
+                                                    (T)diversity_rate,       \
+                                                    end_id);                 \
+    topk_stage_2_opt3_update<                                                \
+        T,                                                                   \
+        BLOCK_SIZE_2_,                                                       \
+        BLOCKS_PER_BEAM_><<<batch_size, BLOCK_SIZE_2_, 0, stream>>>(         \
+        topk_tmp_id_buf,                                                     \
+        topk_tmp_val_buf,                                                    \
+        finished,                                                            \
+        alive_finished,                                                      \
+        sequence_length,                                                     \
+        word_ids,                                                            \
+        parent_ids,                                                          \
+        output_word_ids,                                                     \
+        output_parent_ids,                                                   \
+        output_cum_log_probs,                                                \
+        beam_width,                                                          \
+        vocab_size,                                                          \
+        end_id,                                                              \
+        step,                                                                \
+        max_out_len,                                                         \
+        beam_width * 2,                                                      \
+        length_penalty,                                                      \
+        max_length_penalty,                                                  \
+        finished_candidate_num,                                              \
+        early_stopping);                                                     \
+    break;
+
+template <typename T>
+void topK_update_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    const T* log_probs,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream) {
+  const int batch_size = args.batch_size_;
+  const int beam_width = args.beam_width_;
+  const int vocab_size = args.vocab_size_padded_;
+  // const int vocab_size = args.vocab_size_;
+  const float diversity_rate = args.beam_search_diversity_rate_;
+  const int end_id = args.end_id_;
+  const int max_out_len = args.seq_len_;
+  const float alpha = args.alpha_;
+
+  const int max_block_per_beam = 8;
+  int temp_log_probs_buf_size =
+      batch_size * beam_width * vocab_size;  // type float
+  // select top beam_width*2 for topk_tmp_id_buf and topk_tmp_val_buf
+  int topk_tmp_ids_buf_size = batch_size * beam_width * beam_width * 2 *
+                              max_block_per_beam;  // type int
+  int topk_tmp_val_buf_size = batch_size * beam_width * beam_width * 2 *
+                              max_block_per_beam;  // type float
+  // // to save tmp output_cum_log_probs results of the alive beams
+  // topk_tmp_val_buf_size += batch_size * beam_width;
+
+  // prevent memory misalinged address
+  temp_log_probs_buf_size = (int)(ceil(temp_log_probs_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(float) * temp_log_probs_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     sizeof(float) * topk_tmp_val_buf_size;
+    return;
+  } else {
+    T* temp_log_probs = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_log_probs + temp_log_probs_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+    const int finished_candidate_num = args.finished_candidate_num_;
+    const bool early_stopping = args.early_stopping_;
+    float length_penalty = (finished_candidate_num == beam_width)
+                               ? std::pow((5. + step - 1) / 6., alpha)
+                               : std::pow((5. + step + 1) / 6., alpha);
+    float max_length_penalty =
+        (finished_candidate_num == beam_width)
+            ? length_penalty
+            : std::pow((5. + max_out_len + 1) / 6., alpha);
+    if (diversity_rate == 0.0f) {
+      switch (beam_width) {
+        CASE_K(1, 128, 128, 8);
+        CASE_K(4, 128, 128, 8);
+        CASE_K(10, 128, 128, 8);
+        CASE_K(16, 128, 128, 5);
+        CASE_K(32, 256, 128, 1);
+        CASE_K(64, 256, 256, 1);
+        default:
+          topk_stage_1_opt3<T,
+                            128,
+                            1><<<batch_size * beam_width * 1, 128, 0, stream>>>(
+              log_probs,
+              output_cum_log_probs,
+              temp_log_probs,
+              topk_tmp_id_buf,
+              topk_tmp_val_buf,
+              finished,
+              beam_width * 2,
+              vocab_size,
+              diversity_rate,
+              end_id);
+          topk_stage_2_opt3_update<T, 128, 1><<<batch_size, 128, 0, stream>>>(
+              topk_tmp_id_buf,
+              topk_tmp_val_buf,
+              finished,
+              alive_finished,
+              sequence_length,
+              word_ids,
+              parent_ids,
+              output_word_ids,
+              output_parent_ids,
+              output_cum_log_probs,
+              beam_width,
+              vocab_size,
+              end_id,
+              step,
+              max_out_len,
+              beam_width * 2,
+              // diversity_rate,
+              length_penalty,
+              max_length_penalty,
+              finished_candidate_num,
+              early_stopping);
+          break;
+      }
+    } else {
+      // diversity_rate only works when BLOCKS_PER_BEAM_ is 1 to get the correct
+      // branch indice by `idx%k`
+      topk_stage_1_opt3<T,
+                        128,
+                        1><<<batch_size * beam_width * 1, 128, 0, stream>>>(
+          log_probs,
+          output_cum_log_probs,
+          temp_log_probs,
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          finished,
+          beam_width * 2,
+          vocab_size,
+          diversity_rate,
+          end_id);
+      topk_stage_2_opt3_update<T, 128, 1><<<batch_size, 128, 0, stream>>>(
+          topk_tmp_id_buf,
+          topk_tmp_val_buf,
+          finished,
+          alive_finished,
+          sequence_length,
+          word_ids,
+          parent_ids,
+          output_word_ids,
+          output_parent_ids,
+          output_cum_log_probs,
+          beam_width,
+          vocab_size,
+          end_id,
+          step,
+          max_out_len,
+          beam_width * 2,
+          // diversity_rate,
+          length_penalty,
+          max_length_penalty,
+          finished_candidate_num,
+          early_stopping);
+    }
+    return;
+  }
+}
+
+#undef CASE_K
+#undef CASE_K_DIV
+
+template void topK_update_kernelLauncher<float>(
+    void* workspace,
+    size_t& workspace_size,
+    const float* log_probs,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+template void topK_update_kernelLauncher<half>(
+    void* workspace,
+    size_t& workspace_size,
+    const half* log_probs,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+// Sampling kernels
+template <typename T>
+__global__ void sampling(int* topk_tmp_id_buf,
+                         T* topk_tmp_val_buf,
+                         int* ids,
+                         int* sequence_length,
+                         bool* finished_buf,
+                         const int candidate_num,
+                         int random_num,
+                         const int end_id,
+                         const int vocab_size) {
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  __shared__ float sum;
+  __shared__ float rand_num;
+
+  if (tid < candidate_num) {
+    float max_val = topk_tmp_val_buf[bid * candidate_num];
+    topk_tmp_val_buf[bid * candidate_num + tid] =
+        (T)__expf((float)topk_tmp_val_buf[bid * candidate_num + tid] - max_val);
+  }
+
+  if (tid == 0) {
+    sum = 0.0f;
+    for (int i = 0; i < candidate_num; i++) {
+      sum = sum + (float)topk_tmp_val_buf[bid * candidate_num + i];
+    }
+
+    curandState_t local_state;
+    curand_init(
+        (T)0, bid * candidate_num, blockDim.x * candidate_num, &local_state);
+    rand_num = (float)curand_uniform(&local_state) * sum;
+
+    ids[bid] =
+        topk_tmp_id_buf[bid * candidate_num + candidate_num - 1] % vocab_size;
+    for (int i = 0; i < candidate_num; i++) {
+      rand_num = rand_num - (float)topk_tmp_val_buf[bid * candidate_num + i];
+      if (rand_num <= 0.0f) {
+        ids[bid] = topk_tmp_id_buf[bid * candidate_num + i] % vocab_size;
+        break;
+      }
+    }
+    if (finished_buf != nullptr) {
+      if (sequence_length != nullptr) {
+        sequence_length[bid] =
+            finished_buf[bid] ? sequence_length[bid] : sequence_length[bid] + 1;
+      }
+      finished_buf[bid] = ids[bid] == end_id ? 1 : 0;
+    }
+  }
+}
+
+#define CASE_K(K)                                                              \
+  case K:                                                                      \
+    beam_topK_kernel<T, K, block_size><<<batch_size, block_size, 0, stream>>>( \
+        log_probs, topk_tmp_id_buf, topk_tmp_val_buf, vocab_size, 0.0f);       \
+    break;
+
+template <typename T>
+void topK_sampling_kernel_kernelLauncher(void* workspace,
+                                         size_t& workspace_size,
+                                         T* log_probs,
+                                         int* ids,
+                                         int* sequence_length,
+                                         bool* finished_buf,
+                                         int random_num,
+                                         DecodingSamplingArguments args,
+                                         cudaStream_t stream,
+                                         const int batch_size) {
+  // This function would be called two or more times.
+  // First time is used to get the workspace size, so we need to put
+  // max batch size we want to use.
+  // For other times, we need to put the inference batch size to
+  // set the grid size we use.
+  const int vocab_size = args.vocab_size_padded_;
+  const int candidate_num = args.candidate_num_;
+  const int end_id = args.end_id_;
+  const int block_size = 256;
+
+  int topk_tmp_ids_buf_size =
+      args.batch_size_ * args.candidate_num_;  // type int
+  int temp_log_probs_buf_size = args.batch_size_ * vocab_size;
+  int topk_tmp_val_buf_size = args.batch_size_ * args.candidate_num_;  // type T
+
+  temp_log_probs_buf_size = (int)(ceil(temp_log_probs_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(T) * temp_log_probs_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     sizeof(T) * topk_tmp_val_buf_size;
+  } else {
+    T* temp_log_probs = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_log_probs + temp_log_probs_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+
+    switch (candidate_num) {
+      CASE_K(1);
+      CASE_K(2);
+      CASE_K(4);
+      CASE_K(16);
+      CASE_K(64);
+      default:
+        beam_topK_kernel_general<
+            T,
+            block_size><<<batch_size, block_size, 0, stream>>>(log_probs,
+                                                               temp_log_probs,
+                                                               topk_tmp_id_buf,
+                                                               topk_tmp_val_buf,
+                                                               candidate_num,
+                                                               vocab_size);
+        break;
+    }
+    sampling<T><<<batch_size, candidate_num, 0, stream>>>(topk_tmp_id_buf,
+                                                          topk_tmp_val_buf,
+                                                          ids,
+                                                          sequence_length,
+                                                          finished_buf,
+                                                          candidate_num,
+                                                          random_num,
+                                                          end_id,
+                                                          vocab_size);
+  }
+}
+
+#undef CASE_K
+
+#define CASE_K(K_MIN, K_MAX, BLOCK_SIZE_1_, BLOCK_SIZE_2_, BLOCKS_PER_BEAM_) \
+  case K_MIN... K_MAX:                                                       \
+    topk_stage_1_opt3<T,                                                     \
+                      BLOCK_SIZE_1_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size * BLOCKS_PER_BEAM_,     \
+                                          BLOCK_SIZE_1_,                     \
+                                          0,                                 \
+                                          stream>>>(log_probs,               \
+                                                    temp_log_probs,          \
+                                                    topk_tmp_id_buf,         \
+                                                    topk_tmp_val_buf,        \
+                                                    finished_buf,            \
+                                                    candidate_num,           \
+                                                    vocab_size,              \
+                                                    end_id);                 \
+    topk_stage_2_opt3_sampling<T,                                            \
+                               BLOCK_SIZE_2_,                                \
+                               BLOCKS_PER_BEAM_><<<batch_size,               \
+                                                   BLOCK_SIZE_2_,            \
+                                                   K_MAX * sizeof(int),      \
+                                                   stream>>>(                \
+        topk_tmp_id_buf,                                                     \
+        topk_tmp_val_buf,                                                    \
+        topk_tmp2_val_buf,                                                   \
+        ids,                                                                 \
+        sequence_length,                                                     \
+        finished_buf,                                                        \
+        candidate_num,                                                       \
+        curandstate,                                                         \
+        end_id,                                                              \
+        vocab_size);                                                         \
+    break;
+
+
+template <typename T>
+void topK_sampling_kernel_kernelLauncher_v2(void* workspace,
+                                            size_t& workspace_size,
+                                            T* log_probs,
+                                            int* ids,
+                                            int* sequence_length,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments args,
+                                            cudaStream_t stream,
+                                            const int batch_size) {
+  // Here, we put batch size as an argument because the batch size of
+  // initialization
+  // and inference may be different due to pipelint parallelism.
+  const int candidate_num = args.candidate_num_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int end_id = args.end_id_;
+
+  const int max_block_per_beam = 8;
+  int temp_log_probs_buf_size = batch_size * vocab_size;  // type float
+  int topk_tmp_ids_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type int
+  int topk_tmp_val_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type float
+
+  // prevent memory misalinged address
+  temp_log_probs_buf_size = (int)(ceil(temp_log_probs_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(float) * temp_log_probs_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     2 * sizeof(float) * topk_tmp_val_buf_size;
+    return;
+  } else {
+    T* temp_log_probs = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_log_probs + temp_log_probs_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+    T* topk_tmp2_val_buf = (T*)(topk_tmp_val_buf + topk_tmp_val_buf_size);
+
+    switch (candidate_num) {
+      CASE_K(1, 16, 128, 128, 8);
+      CASE_K(17, 32, 256, 128, 8);
+      CASE_K(33, 64, 256, 256, 8);
+      default:
+        printf("[ERROR] Topk kernel does not support candidate_num = %d \n",
+               candidate_num);
+        exit(0);
+        break;
+    }
+    return;
+  }
+}
+
+#undef CASE_K
+
+
+#define CASE_K(K_MIN, K_MAX, BLOCK_SIZE_1_, BLOCK_SIZE_2_, BLOCKS_PER_BEAM_) \
+  case K_MIN... K_MAX:                                                       \
+    topk_stage_1_opt3<T,                                                     \
+                      BLOCK_SIZE_1_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size * BLOCKS_PER_BEAM_,     \
+                                          BLOCK_SIZE_1_,                     \
+                                          0,                                 \
+                                          stream>>>(log_probs,               \
+                                                    temp_log_probs,          \
+                                                    topk_tmp_id_buf,         \
+                                                    topk_tmp_val_buf,        \
+                                                    finished_buf,            \
+                                                    candidate_num,           \
+                                                    vocab_size,              \
+                                                    end_id);                 \
+    topk_stage_2_opt3_sampling_v2<T,                                            \
+                               BLOCK_SIZE_2_,                                \
+                               BLOCKS_PER_BEAM_><<<batch_size,               \
+                                                   BLOCK_SIZE_2_,            \
+                                                   K_MAX * sizeof(int),      \
+                                                   stream>>>(                \
+        topk_tmp_id_buf,                                                     \
+        topk_tmp_val_buf,                                                    \
+        topk_tmp2_val_buf,                                                   \
+        ids,                                                                 \
+        sequence_length,                                                     \
+        scores,                                                              \
+        finished_buf,                                                        \
+        candidate_num,                                                       \
+        curandstate,                                                         \
+        end_id,                                                              \
+        vocab_size);                                                         \
+    break;
+
+
+template <typename T>
+void topK_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            T* log_probs,
+                                            int* ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments args,
+                                            cudaStream_t stream,
+                                            const int batch_size) {
+  // Here, we put batch size as an argument because the batch size of
+  // initialization
+  // and inference may be different due to pipelint parallelism.
+  const int candidate_num = args.candidate_num_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int end_id = args.end_id_;
+
+  const int max_block_per_beam = 8;
+  int temp_log_probs_buf_size = batch_size * vocab_size;  // type float
+  int topk_tmp_ids_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type int
+  int topk_tmp_val_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type float
+
+  // prevent memory misalinged address
+  temp_log_probs_buf_size = (int)(ceil(temp_log_probs_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(T) * temp_log_probs_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     2 * sizeof(T) * topk_tmp_val_buf_size;
+    return;
+  } else {
+    T* temp_log_probs = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_log_probs + temp_log_probs_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+    T* topk_tmp2_val_buf = (T*)(topk_tmp_val_buf + topk_tmp_val_buf_size);
+
+    switch (candidate_num) {
+      CASE_K(1, 16, 128, 128, 8);
+      CASE_K(17, 32, 256, 128, 8);
+      CASE_K(33, 64, 256, 256, 8);
+      default:
+        printf("[ERROR] Topk kernel does not support candidate_num = %d \n",
+               candidate_num);
+        exit(0);
+        break;
+    }
+    return;
+  }
+}
+
+#undef CASE_K
+
+
+template void topK_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    float* log_probs,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    int random_num,
+    DecodingSamplingArguments args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    half* log_probs,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    int random_num,
+    DecodingSamplingArguments args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    float* log_probs,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    half* log_probs,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            float* log_probs,
+                                            int* ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments args,
+                                            cudaStream_t stream,
+                                            const int batch_size);
+
+template void topK_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            half* log_probs,
+                                            int* ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments args,
+                                            cudaStream_t stream,
+                                            const int batch_size);
+
+__global__ void init_topp_id_val(int* topp_id_val_buf,
+                                 int* topp_offset_buf,
+                                 const int batch_size,
+                                 const int vocab_size) {
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+
+  if (bid == 0) {
+    for (int i = tid; i < batch_size + 1; i += blockDim.x) {
+      topp_offset_buf[i] = i * vocab_size;
+    }
+  }
+
+  while (tid < vocab_size) {
+    topp_id_val_buf[bid * vocab_size + tid] = tid;
+    tid += blockDim.x;
+  }
+}
+
+
+void init_topp_id_val_kernel_kernelLauncher(int* topp_id_val_buf,
+                                            int* topp_offset_buf,
+                                            const int batch_size,
+                                            const int vocab_size,
+                                            cudaStream_t stream) {
+  init_topp_id_val<<<batch_size, 512, 0, stream>>>(
+      topp_id_val_buf, topp_offset_buf, batch_size, vocab_size);
+}
+
+// Sampling kernels
+template <typename T>
+__global__ void top_p_sampling(T* sorted_log_probs,
+                               int* sorted_id_vals,
+                               int* ids,
+                               int* sequence_length,
+                               bool* finished_buf,
+                               const int vocab_size,
+                               const int random_num,
+                               const float prob_threshold,
+                               const int end_id) {
+  int tid = threadIdx.x;
+  curandState_t local_state;
+  // TODO: fix randomly cannot work in some specific situation.
+  curand_init((T)random_num, tid, 0, &local_state);
+  T rand_num = (T)curand_uniform(&local_state) * (T)prob_threshold;
+  ids[tid] = sorted_id_vals[tid * vocab_size];
+
+  for (int i = tid * vocab_size; i < tid * vocab_size + vocab_size; i++) {
+    rand_num = rand_num - sorted_log_probs[i];
+    if (rand_num <= (T)0.0f) {
+      ids[tid] = sorted_id_vals[i];
+      break;
+    }
+  }
+  if (finished_buf != nullptr) {
+    finished_buf[tid] = ids[tid] == end_id ? 1 : 0;
+    if (sequence_length != nullptr) {
+      sequence_length[tid] =
+          finished_buf[tid] ? sequence_length[tid] : sequence_length[tid] + 1;
+    }
+  }
+}
+
+template <typename T>
+__global__ void top_p_sampling_v2(T* sorted_log_probs,
+                                  int* sorted_id_vals,
+                                  int* ids,
+                                  int* sequence_length,
+                                  bool* finished_buf,
+                                  const int vocab_size,
+                                  curandState_t* curandstate,
+                                  const float prob_threshold,
+                                  const int end_id,
+                                  const int batch_size,
+                                  float* scores = nullptr) {
+  int tid = blockDim.x * blockIdx.x + threadIdx.x;
+  if (tid < batch_size) {
+    T rand_num = (T)curand_uniform(curandstate + tid) * (T)prob_threshold;
+    ids[tid] = sorted_id_vals[vocab_size - 1];
+    for (int i = tid * vocab_size; i < tid * vocab_size + vocab_size; i++) {
+      rand_num = rand_num - sorted_log_probs[i];
+      if (rand_num <= (T)0.0) {
+        ids[tid] = sorted_id_vals[i];
+        if (scores) {
+          scores[tid] += __logf((float)sorted_log_probs[i]);
+        }
+        break;
+      }
+    };
+    if (finished_buf != nullptr) {
+      finished_buf[tid] = ids[tid] == end_id ? 1 : 0;
+      if (sequence_length != nullptr) {
+        sequence_length[tid] =
+            finished_buf[tid] ? sequence_length[tid] : sequence_length[tid] + 1;
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void sort_kernel(const T* log_probs,
+                            const int* id_vals,
+                            T* sorted_log_probs,
+                            int* sorted_id_vals,
+                            const int vocab_size) {
+  typedef cub::BlockRadixSort<T, 256, 32, int> BlockRadixSort;
+  __shared__ typename BlockRadixSort::TempStorage temp_storage;
+  // Obtain a segment of consecutive items that are blocked across threads
+  T thread_keys[32];
+  int thread_values[32];
+
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  for (int i = 0; i < 32; i++) {
+    int index = tid + 256 * i + bid * vocab_size;
+    thread_keys[i] = log_probs[index];
+    thread_values[i] = id_vals[index];
+  }
+  BlockRadixSort(temp_storage).SortDescending(thread_keys, thread_values);
+
+  for (int i = 0; i < 32; i++) {
+    int index = tid + 256 * i + bid * vocab_size;
+    sorted_log_probs[index] = thread_keys[i];
+    sorted_id_vals[index] = thread_values[i];
+  }
+}
+
+template <typename T>
+void topP_sampling_kernel_kernelLauncher(void* workspace,
+                                         size_t& workspace_size,
+                                         const T* log_probs,
+                                         const int* id_vals,
+                                         const int* offset_buf,
+                                         bool* finished_buf,
+                                         int step,
+                                         DecodingSamplingArguments& args,
+                                         int* output_ids,
+                                         int* sequence_length,
+                                         const int n,
+                                         cudaStream_t stream,
+                                         const int batch_size) {
+  const int vocab_size = args.vocab_size_padded_;
+  int sorted_log_prob_buf_size = batch_size * vocab_size;  // type T
+  int sorted_id_vals_buf_size = batch_size * vocab_size;   // type int
+  sorted_log_prob_buf_size = (int)(ceil(sorted_log_prob_buf_size / 4.)) * 4;
+  sorted_id_vals_buf_size = (int)(ceil(sorted_id_vals_buf_size / 4.)) * 4;
+
+  void* cub_temp_storage = workspace;
+  T* sorted_log_probs =
+      (T*)((char*)cub_temp_storage + args.cub_temp_storage_size_);
+  int* sorted_id_vals = (int*)(sorted_log_probs + sorted_log_prob_buf_size);
+
+  if (workspace == nullptr) {
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        nullptr,
+        args.cub_temp_storage_size_,
+        log_probs,
+        (T*)nullptr,
+        id_vals,
+        (int*)nullptr,
+        vocab_size * batch_size,
+        batch_size,
+        offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+    args.cub_temp_storage_size_ =
+        (int)(ceil(args.cub_temp_storage_size_ / 4.)) * 4;
+    workspace_size = sizeof(T) * sorted_log_prob_buf_size +
+                     sizeof(int) * sorted_id_vals_buf_size +
+                     args.cub_temp_storage_size_;
+  } else {
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        cub_temp_storage,
+        args.cub_temp_storage_size_,
+        log_probs,
+        sorted_log_probs,
+        id_vals,
+        sorted_id_vals,
+        n * batch_size,
+        batch_size,
+        offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+
+    top_p_sampling<<<1, batch_size, 0, stream>>>(sorted_log_probs,
+                                                 sorted_id_vals,
+                                                 output_ids,
+                                                 sequence_length,
+                                                 finished_buf,
+                                                 n,
+                                                 step,
+                                                 args.probability_threshold_,
+                                                 args.end_id_);
+  }
+}
+
+template void topP_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    const float* log_probs,
+    const int* id_vals,
+    const int* offset_buf,
+    bool* finished_buf,
+    int step,
+    DecodingSamplingArguments& args,
+    int* output_ids,
+    int* sequence_length,
+    const int n,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topP_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    const half* log_probs,
+    const int* id_vals,
+    const int* offset_buf,
+    bool* finished_buf,
+    int step,
+    DecodingSamplingArguments& args,
+    int* output_ids,
+    int* sequence_length,
+    const int n,
+    cudaStream_t stream,
+    const int batch_size);
+
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void beam_topK_kernel_for_topP(const T* log_probs,
+                                   int* topk_tmp_id_buf,
+                                   T* topk_tmp_val_buf,
+                                   const int vocab_size,
+                                   int* offset_buf,
+                                   int* begin_offset_buf,
+                                   float p_threshold) {
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+#pragma unroll
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+
+#pragma unroll
+  for (int elem_id = thread_id; elem_id < vocab_size;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    partial.insert(log_probs[index], index);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+
+  if (thread_id == 0) {
+    begin_offset_buf[block_id] = offset_buf[block_id];
+    T sum_prob = (T)(0.0f);
+
+#pragma unroll
+    for (int i = 0; i < MAX_K; i++) {
+      sum_prob += total.u[i];
+    }
+
+    if ((float)sum_prob >= p_threshold) {
+      begin_offset_buf[block_id] += vocab_size;
+      int index = block_id * vocab_size;
+
+#pragma unroll
+      for (int i = 0; i < MAX_K; ++i) {
+        topk_tmp_id_buf[index + i] = total.p[i] % vocab_size;
+        topk_tmp_val_buf[index + i] = total.u[i];
+      }
+    }
+  }
+}
+
+template <typename T>
+void topP_sampling_kernel_kernelLauncher_v2(void* workspace,
+                                            size_t& workspace_size,
+                                            const T* log_probs,
+                                            const int* id_vals,
+                                            int* offset_buf,
+                                            int* begin_offset_buf,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments& args,
+                                            int* output_ids,
+                                            int* sequence_length,
+                                            const int n,
+                                            cudaStream_t stream,
+                                            const int batch_size) {
+  // Here, we put batch size as an argument because the batch size of
+  // initialization
+  // and inference may be different due to pipelint parallelism.
+  const int vocab_size = args.vocab_size_padded_;
+  const int block_size = 256;
+
+  int sorted_log_prob_buf_size = batch_size * vocab_size;  // type T
+  int sorted_id_vals_buf_size = batch_size * vocab_size;   // type int
+  sorted_log_prob_buf_size = (int)(ceil(sorted_log_prob_buf_size / 4.)) * 4;
+  sorted_id_vals_buf_size = (int)(ceil(sorted_id_vals_buf_size / 4.)) * 4;
+
+  void* cub_temp_storage = workspace;
+  T* sorted_log_probs =
+      (T*)((char*)cub_temp_storage + args.cub_temp_storage_size_);
+  int* sorted_id_vals = (int*)(sorted_log_probs + sorted_log_prob_buf_size);
+
+
+  if (workspace == nullptr) {
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        nullptr,
+        args.cub_temp_storage_size_,
+        log_probs,
+        (T*)nullptr,
+        id_vals,
+        (int*)nullptr,
+        vocab_size * batch_size,
+        batch_size,
+        begin_offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+    args.cub_temp_storage_size_ =
+        (int)(ceil(args.cub_temp_storage_size_ / 4.)) * 4;
+    workspace_size = sizeof(T) * sorted_log_prob_buf_size +
+                     sizeof(int) * sorted_id_vals_buf_size +
+                     args.cub_temp_storage_size_;
+  } else {
+    beam_topK_kernel_for_topP<
+        T,
+        1,
+        block_size><<<batch_size, block_size, 0, stream>>>(
+        log_probs,
+        sorted_id_vals,
+        sorted_log_probs,
+        vocab_size,
+        offset_buf,
+        begin_offset_buf,
+        args.probability_threshold_);
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        cub_temp_storage,
+        args.cub_temp_storage_size_,
+        log_probs,
+        sorted_log_probs,
+        id_vals,
+        sorted_id_vals,
+        n * batch_size,
+        batch_size,
+        begin_offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+
+    dim3 block(256);
+    dim3 grid((int)(ceil(batch_size * 1.0 / 256)));
+    top_p_sampling_v2<<<grid, block, 0, stream>>>(sorted_log_probs,
+                                                  sorted_id_vals,
+                                                  output_ids,
+                                                  sequence_length,
+                                                  finished_buf,
+                                                  n,
+                                                  curandstate,
+                                                  args.probability_threshold_,
+                                                  args.end_id_,
+                                                  batch_size);
+  }
+}
+
+template <typename T>
+void topP_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            const T* log_probs,
+                                            const int* id_vals,
+                                            int* offset_buf,
+                                            int* begin_offset_buf,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments& args,
+                                            int* output_ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            const int n,
+                                            cudaStream_t stream,
+                                            const int batch_size) {
+  // Here, we put batch size as an argument because the batch size of
+  // initialization
+  // and inference may be different due to pipelint parallelism.
+  const int vocab_size = args.vocab_size_padded_;
+  const int block_size = 256;
+
+  int sorted_log_prob_buf_size = batch_size * vocab_size;  // type T
+  int sorted_id_vals_buf_size = batch_size * vocab_size;   // type int
+  sorted_log_prob_buf_size = (int)(ceil(sorted_log_prob_buf_size / 4.)) * 4;
+  sorted_id_vals_buf_size = (int)(ceil(sorted_id_vals_buf_size / 4.)) * 4;
+
+  void* cub_temp_storage = workspace;
+  T* sorted_log_probs =
+      (T*)((char*)cub_temp_storage + args.cub_temp_storage_size_);
+  int* sorted_id_vals = (int*)(sorted_log_probs + sorted_log_prob_buf_size);
+
+
+  if (workspace == nullptr) {
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        nullptr,
+        args.cub_temp_storage_size_,
+        log_probs,
+        (T*)nullptr,
+        id_vals,
+        (int*)nullptr,
+        vocab_size * batch_size,
+        batch_size,
+        begin_offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+    args.cub_temp_storage_size_ =
+        (int)(ceil(args.cub_temp_storage_size_ / 4.)) * 4;
+    workspace_size = sizeof(T) * sorted_log_prob_buf_size +
+                     sizeof(int) * sorted_id_vals_buf_size +
+                     args.cub_temp_storage_size_;
+  } else {
+    beam_topK_kernel_for_topP<
+        T,
+        1,
+        block_size><<<batch_size, block_size, 0, stream>>>(
+        log_probs,
+        sorted_id_vals,
+        sorted_log_probs,
+        vocab_size,
+        offset_buf,
+        begin_offset_buf,
+        args.probability_threshold_);
+
+    cub::DeviceSegmentedRadixSort::SortPairsDescending(
+        cub_temp_storage,
+        args.cub_temp_storage_size_,
+        log_probs,
+        sorted_log_probs,
+        id_vals,
+        sorted_id_vals,
+        n * batch_size,
+        batch_size,
+        begin_offset_buf,
+        offset_buf + 1,
+        0,              // begin_bit
+        sizeof(T) * 8,  // end_bit = sizeof(KeyT) * 8
+        stream);        // cudaStream_t
+
+    dim3 block(256);
+    dim3 grid((int)(ceil(batch_size * 1.0 / 256)));
+    top_p_sampling_v2<<<grid, block, 0, stream>>>(sorted_log_probs,
+                                                  sorted_id_vals,
+                                                  output_ids,
+                                                  sequence_length,
+                                                  finished_buf,
+                                                  n,
+                                                  curandstate,
+                                                  args.probability_threshold_,
+                                                  args.end_id_,
+                                                  batch_size,
+                                                  scores);
+  }
+}
+
+template void topP_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    const float* log_probs,
+    const int* id_vals,
+    int* offset_buf,
+    int* begin_offset_buf,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments& args,
+    int* output_ids,
+    int* sequence_length,
+    const int n,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topP_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    const half* log_probs,
+    const int* id_vals,
+    int* offset_buf,
+    int* begin_offset_buf,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments& args,
+    int* output_ids,
+    int* sequence_length,
+    const int n,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topP_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            const float* log_probs,
+                                            const int* id_vals,
+                                            int* offset_buf,
+                                            int* begin_offset_buf,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments& args,
+                                            int* output_ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            const int n,
+                                            cudaStream_t stream,
+                                            const int batch_size);
+
+template void topP_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            const half* log_probs,
+                                            const int* id_vals,
+                                            int* offset_buf,
+                                            int* begin_offset_buf,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments& args,
+                                            int* output_ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            const int n,
+                                            cudaStream_t stream,
+                                            const int batch_size);
+
+template <typename T, int MAX_K, int THREADBLOCK_SIZE>
+__launch_bounds__(THREADBLOCK_SIZE) __global__
+    void topK_topP_sampling_kernel(int* output_ids,
+                                   const T* logits,
+                                   const int vocab_size,
+                                   const int random_num,
+                                   const float prob_threshold,
+                                   T diversity_rate) {
+  typedef cub::BlockReduce<TopK<T, MAX_K>, THREADBLOCK_SIZE> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+
+  int thread_id = threadIdx.x;
+  int block_id = blockIdx.x;
+  TopK<T, MAX_K> partial;
+
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+#pragma unroll
+  for (int i = 0; i < MAX_K; ++i) {
+    partial.p[i] = -1;
+    partial.u[i] = -MAX_T_VAL;
+  }
+
+#pragma unroll
+  for (int elem_id = thread_id; elem_id < vocab_size;
+       elem_id += THREADBLOCK_SIZE) {
+    int index = elem_id + block_id * vocab_size;
+    partial.insert(logits[index], index);
+  }
+
+  TopK<T, MAX_K> total =
+      BlockReduce(temp_storage).Reduce(partial, reduce_topk_op<T, MAX_K>);
+
+  if (thread_id == 0) {
+    // float sum = 0.0f;
+    T sum = (T)(0.0f);
+    T max_val = total.u[0];
+
+#pragma unroll
+    for (int i = 0; i < MAX_K; i++) {
+      total.u[i] =
+          total.u[i] + diversity_rate * (T)i;  // diversely sampling penalty
+      total.u[i] = (T)__expf((float)(total.u[i] - max_val));
+      sum += total.u[i];
+    }
+
+    curandState_t local_state;
+    curand_init((T)0, block_id * MAX_K, blockDim.x * MAX_K, &local_state);
+    T rand_num = (T)curand_uniform(&local_state) * (T)prob_threshold * sum;
+
+    output_ids[block_id] = total.p[0] % vocab_size;
+
+#pragma unroll
+    for (int i = 0; i < MAX_K; i++) {
+      rand_num = rand_num - total.u[i];
+      if (rand_num <= (T)0.0f) {
+        output_ids[block_id] = total.p[i] % vocab_size;
+        break;
+      }
+    }
+  }
+}
+
+#define CASE_K(K)                                                          \
+  case K:                                                                  \
+    topK_topP_sampling_kernel<                                             \
+        T,                                                                 \
+        K,                                                                 \
+        block_size><<<batch_size, block_size, 0, stream>>>(                \
+        output_ids, logits, vocab_size, random_num, prob_threshold, 0.0f); \
+    break;
+
+template <typename T>
+void topK_topP_sampling_kernel_kernelLauncher(void* workspace,
+                                              size_t& workspace_size,
+                                              int* output_ids,
+                                              const T* logits,
+                                              const int random_num,
+                                              DecodingSamplingArguments& args,
+                                              cudaStream_t stream,
+                                              const int batch_size) {
+  if (workspace == nullptr) {
+    workspace_size = 0;
+  } else {
+    const int vocab_size = args.vocab_size_padded_;
+    const int block_size = 256;
+    const T prob_threshold = args.probability_threshold_;
+    switch (args.candidate_num_) {
+      CASE_K(1);
+      CASE_K(2);
+      CASE_K(4);
+      CASE_K(16);
+      CASE_K(64);
+      default:
+        printf("[ERROR] Topk kernel does not support candidate_num = %d \n",
+               args.candidate_num_);
+        exit(0);
+        break;
+    }
+  }
+}
+
+#undef CASE_K
+
+template <typename T, int BLOCK_SIZE_, int BLOCKS_PER_BEAM_>
+__global__ void topk_topp_sampling_kernel_v2(
+    const int* __restrict topk_tmp_id_buf,
+    T* topk_tmp_val_buf,
+    T* topk_tmp2_val_buf,
+    int* ids,
+    int* sequence_length,
+    bool* finished_buf,
+    const int k,
+    const T prob_threshold,
+    curandState_t* curandstate,
+    const int end_id,
+    const int vocab_size) {
+  const int size = k * BLOCKS_PER_BEAM_;
+  const int tid = threadIdx.x;
+  const int batch_id = blockIdx.x;
+  const bool IS_FP16 = std::is_same<T, half>::value;
+  const T MAX_T_VAL = (IS_FP16) ? HALF_FLT_MAX : FLT_MAX;
+
+  typedef cub::BlockReduce<TopK_2<float>, BLOCK_SIZE_> BlockReduce;
+  __shared__ typename BlockReduce::TempStorage temp_storage;
+  extern __shared__ char array[];
+  __shared__ float rand_num;
+  __shared__ float s_max;
+  __shared__ float s_sum;
+  T* s_val = topk_tmp_val_buf + batch_id * size;
+  int* s_id = (int*)(array);
+  s_max = 0.0f;
+  s_sum = 0.0f;
+  TopK_2<float> partial;
+
+  for (int index = tid; index < size; index += BLOCK_SIZE_) {
+    topk_tmp2_val_buf[batch_id * size + index] =
+        topk_tmp_val_buf[batch_id * size + index];
+  }
+  __syncthreads();
+  T* s_val2 = topk_tmp2_val_buf + batch_id * size;
+
+  for (int ite = 0; ite < k; ite++) {
+    partial.init();
+#pragma unroll
+    for (int i = tid; i < size; i += BLOCK_SIZE_) {
+      partial.insert((float)s_val[i], i);
+    }
+
+    TopK_2<float> total =
+        BlockReduce(temp_storage).Reduce(partial, reduce_topk_op_2<float>);
+
+    if (ite == 0) s_max = total.u;
+
+    if (tid == 0) {
+      s_id[ite] = total.p;
+      s_val[total.p] = -MAX_T_VAL;
+      total.u = __expf(total.u - s_max);
+      s_val2[total.p] = (T)total.u;
+      s_sum += total.u;
+    }
+    __syncthreads();
+  }
+  if (tid == 0) {
+    rand_num = (float)curand_uniform(curandstate + blockIdx.x) *
+               (float)prob_threshold * s_sum;
+    for (int i = 0; i < k; i++) {
+      rand_num = rand_num - (float)s_val2[s_id[i]];
+      if (rand_num <= 0.0f || i == k - 1) {
+        ids[batch_id] = topk_tmp_id_buf[batch_id * size + s_id[i]] % vocab_size;
+        break;
+      }
+    }
+    if (finished_buf != nullptr) {
+      finished_buf[batch_id] = ids[batch_id] == end_id ? 1 : 0;
+      if (sequence_length != nullptr) {
+        sequence_length[batch_id] = finished_buf[batch_id]
+                                        ? sequence_length[batch_id]
+                                        : sequence_length[batch_id] + 1;
+      }
+    }
+  }
+}
+
+#define CASE_K(K_MIN, K_MAX, BLOCK_SIZE_1_, BLOCK_SIZE_2_, BLOCKS_PER_BEAM_) \
+  case K_MIN... K_MAX:                                                       \
+    topk_stage_1_opt3<T,                                                     \
+                      BLOCK_SIZE_1_,                                         \
+                      BLOCKS_PER_BEAM_><<<batch_size * BLOCKS_PER_BEAM_,     \
+                                          BLOCK_SIZE_1_,                     \
+                                          0,                                 \
+                                          stream>>>(logits,                  \
+                                                    temp_logits,             \
+                                                    topk_tmp_id_buf,         \
+                                                    topk_tmp_val_buf,        \
+                                                    finished_buf,            \
+                                                    candidate_num,           \
+                                                    vocab_size,              \
+                                                    end_id);                 \
+    topk_topp_sampling_kernel_v2<T,                                          \
+                                 BLOCK_SIZE_2_,                              \
+                                 BLOCKS_PER_BEAM_><<<batch_size,             \
+                                                     BLOCK_SIZE_2_,          \
+                                                     K_MAX * sizeof(int),    \
+                                                     stream>>>(              \
+        topk_tmp_id_buf,                                                     \
+        topk_tmp_val_buf,                                                    \
+        topk_tmp2_val_buf,                                                   \
+        output_ids,                                                          \
+        nullptr,                                                             \
+        finished_buf,                                                        \
+        candidate_num,                                                       \
+        prob_threshold,                                                      \
+        curandstate,                                                         \
+        end_id,                                                              \
+        vocab_size);                                                         \
+    break;
+
+template <typename T>
+void topK_topP_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    int* output_ids,
+    const T* logits,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments& args,
+    cudaStream_t stream,
+    const int batch_size) {
+  // Here, we put batch size as an argument because the batch size of
+  // initialization
+  // and inference may be different due to pipelint parallelism.
+  const int candidate_num = args.candidate_num_;
+  const int vocab_size = args.vocab_size_padded_;
+  const int end_id = args.end_id_;
+  const T prob_threshold = args.probability_threshold_;
+
+  const int max_block_per_beam = 8;
+  int temp_logits_buf_size = batch_size * vocab_size;  // type float
+  int topk_tmp_ids_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type int
+  int topk_tmp_val_buf_size =
+      batch_size * candidate_num * max_block_per_beam;  // type float
+
+  // prevent memory misalinged address
+  temp_logits_buf_size = (int)(ceil(temp_logits_buf_size / 4.)) * 4;
+  topk_tmp_ids_buf_size = (int)(ceil(topk_tmp_ids_buf_size / 4.)) * 4;
+  topk_tmp_val_buf_size = (int)(ceil(topk_tmp_val_buf_size / 4.)) * 4;
+
+  if (workspace == nullptr) {
+    workspace_size = sizeof(T) * temp_logits_buf_size +
+                     sizeof(int) * topk_tmp_ids_buf_size +
+                     2 * sizeof(T) * topk_tmp_val_buf_size;
+    return;
+  } else {
+    T* temp_logits = (T*)workspace;
+    int* topk_tmp_id_buf = (int*)(temp_logits + temp_logits_buf_size);
+    T* topk_tmp_val_buf = (T*)(topk_tmp_id_buf + topk_tmp_ids_buf_size);
+    T* topk_tmp2_val_buf = (T*)(topk_tmp_val_buf + topk_tmp_val_buf_size);
+
+    switch (candidate_num) {
+      CASE_K(1, 16, 128, 128, 8);
+      CASE_K(17, 32, 256, 128, 8);
+      CASE_K(33, 64, 256, 256, 8);
+      default:
+        printf("[ERROR] Topk kernel does not support candidate_num = %d \n",
+               candidate_num);
+        exit(0);
+        break;
+    }
+    return;
+  }
+}
+
+#undef CASE_K
+
+template void topK_topP_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    int* output_ids,
+    const float* logits,
+    const int random_num,
+    DecodingSamplingArguments& args,
+    cudaStream_t stream,
+    const int batch_size);
+
+
+template void topK_topP_sampling_kernel_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    int* output_ids,
+    const half* logits,
+    const int random_num,
+    DecodingSamplingArguments& args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_topP_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    int* output_ids,
+    const float* logits,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments& args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template void topK_topP_sampling_kernel_kernelLauncher_v2(
+    void* workspace,
+    size_t& workspace_size,
+    int* output_ids,
+    const half* logits,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments& args,
+    cudaStream_t stream,
+    const int batch_size);
+
+}  // end of namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cuh b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..ff356b0778340d419fccdf4d39212cf4003daea4
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/topk_kernels.cuh
@@ -0,0 +1,87 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+void topK_softMax_update(
+    const T* log_probs,
+    const T* bias,  // NOTE: bias is float in V3.1
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    void* temp_storage,
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+template <typename T>
+void topK_update_kernelLauncher(
+    void* workspace,
+    size_t& workspace_size,
+    const T* log_probs,
+    bool* finished,
+    bool* alive_finished,
+    int* sequence_length,
+    int* word_ids,
+    int* parent_ids,  // for update cache, only include alive beams
+    int* output_word_ids,
+    int* output_parent_ids,  // for gather tree, include both alive and finish
+                             // beams
+    float* output_cum_log_probs,  // NOTE: cum_log_probs is T in V3.1
+    const int step,
+    DecodingBeamsearchArguments args,
+    cudaStream_t stream);
+
+template <typename T>
+void topK_sampling_kernel_kernelLauncher_v3(
+    void* workspace,
+    size_t& workspace_size,
+    T* log_probs,
+    int* ids,
+    int* sequence_length,
+    float* scores,
+    bool* finished_buf,
+    curandState_t* curandstate,
+    DecodingSamplingArguments args,
+    cudaStream_t stream,
+    const int batch_size);
+
+template <typename T>
+void topP_sampling_kernel_kernelLauncher_v3(void* workspace,
+                                            size_t& workspace_size,
+                                            const T* log_probs,
+                                            const int* id_vals,
+                                            int* offset_buf,
+                                            int* begin_offset_buf,
+                                            bool* finished_buf,
+                                            curandState_t* curandstate,
+                                            DecodingSamplingArguments& args,
+                                            int* output_ids,
+                                            int* sequence_length,
+                                            float* scores,
+                                            const int n,
+                                            cudaStream_t stream,
+                                            const int batch_size);
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoder.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoder.cu
new file mode 100644
index 0000000000000000000000000000000000000000..16a3c34df8a2d270c1054bb4fa5859b19f91faa3
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoder.cu
@@ -0,0 +1,643 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+__global__ void self_attention_kernel(const int* memory_sequence_length,
+                                      T* key_buf,
+                                      T* value_buf,
+                                      T* query_buf,
+                                      const T* self_Q_bias,
+                                      T* key_cache,
+                                      const T* self_K_bias,
+                                      T* value_cache,
+                                      const T* self_V_bias,
+                                      T* context_buf,
+                                      int batch_size,
+                                      int head_num,
+                                      int size_per_head,
+                                      const int step,
+                                      const int start_len,
+                                      const T scalar) {
+  extern __shared__ __align__(sizeof(T)) unsigned s_buf[];
+  T* sq = reinterpret_cast<T*>(s_buf);
+  T* logits = reinterpret_cast<T*>(&sq[size_per_head]);
+
+  int tid = threadIdx.x;
+  int bid = blockIdx.x / head_num;
+  int head_id = blockIdx.x % head_num;
+
+  int qkv_id = bid * head_num * size_per_head + head_id * size_per_head + tid;
+  int qkv_bias_id = head_id * size_per_head + tid;
+
+  if (tid < size_per_head)
+    sq[tid] = query_buf[qkv_id] + self_Q_bias[qkv_bias_id];
+  __syncthreads();
+
+  // offset for each step
+  int offset = batch_size * head_num * size_per_head;
+  for (int ite = 0; ite < step; ++ite) {
+    T key = tid < size_per_head ? key_cache[ite * offset + qkv_id] : (T)0.0f;
+    // for the last step, we should update K + bias_K to the cache
+    if (ite == step - 1 && tid < size_per_head) {
+      key = key_buf[qkv_id] + self_K_bias[qkv_bias_id];
+      key_cache[ite * offset + qkv_id] = key;
+    }
+
+    T val = (tid < size_per_head) ? key * sq[tid] * (T)(scalar) : (T)(0.0f);
+    T qk = blockReduceSum(val);
+    if (threadIdx.x == 0) {
+      logits[ite] = qk;
+    }
+    __syncthreads();  // try to remove
+  }
+  __syncthreads();  // try to remove
+
+  __shared__ float s_max_val, s_sum;
+  float local_i =
+      (tid >= (start_len - memory_sequence_length[bid]) && (tid < step))
+          ? (float)logits[tid]
+          : -1e20f;
+  float max_val = blockReduceMax<float>(local_i);
+  if (tid == 0) s_max_val = max_val;
+  __syncthreads();
+
+  local_i -= s_max_val;
+  float local_o =
+      (tid >= (start_len - memory_sequence_length[bid]) && (tid < step))
+          ? __expf(local_i)
+          : 0.0f;
+  float val = blockReduceSum<float>(local_o);
+
+  if (tid == 0) s_sum = val;  // + 1e-6;
+  __syncthreads();
+
+  if (tid >= (start_len - memory_sequence_length[bid]) && (tid < step)) {
+    logits[tid] = local_o / s_sum;
+  } else if (tid < step) {
+    logits[tid] = static_cast<T>(0.0f);
+  }
+  __syncthreads();
+
+  if (tid < size_per_head) {
+    T sum = (T)0.0f;
+    for (int ite = 0; ite < step; ++ite) {
+      T value = value_cache[ite * offset + qkv_id];
+      // for the last step, we should update V + bias_V to the cache
+      if (ite == step - 1) {
+        value = value_buf[qkv_id] + self_V_bias[qkv_bias_id];
+        value_cache[ite * offset + qkv_id] = value;
+      }
+      sum += value * logits[ite];
+    }
+    context_buf[qkv_id] = sum;
+  }
+}
+
+
+template <typename T>
+void self_attention_dispatch(const int* memory_sequence_length,
+                             T* key_buf,
+                             T* value_buf,
+                             T* query_buf,
+                             const T* self_Q_bias,
+                             T* key_cache,
+                             const T* self_K_bias,
+                             T* value_cache,
+                             const T* self_V_bias,
+                             T* context_buf,
+                             int batch_size,
+                             int head_num,
+                             int size_per_head,
+                             const int step,
+                             const int start_len,
+                             cudaStream_t stream) {
+  const int block_sz = ATTENTION_BLOCK_SIZE;
+  T scalar = (T)(1.f / sqrtf(size_per_head * 1.0f));
+
+  dim3 grid(batch_size * head_num);
+
+  int cond = size_per_head * ((ATTENION_OPT) ? 1 : 0);
+  switch (cond) {
+    /*case 32:
+      masked_attention_kernel_opt<32, block_sz, T><<<grid, block_sz,
+    sizeof(float)*step, stream>>>(
+        key_buf, value_buf,
+        query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache,
+    self_V_bias, context_buf,
+        batch_size, head_num, step, scalar);
+      break;
+    case 64:
+      if(sizeof(T) == 2)
+        masked_attention_kernel_opt_half2<64, block_sz><<<grid, block_sz,
+    sizeof(float)*step, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache,
+    self_V_bias, context_buf,
+          batch_size, head_num, step, scalar);
+      else
+        masked_attention_kernel_opt<64, block_sz, T><<<grid, block_sz,
+    sizeof(float)*step, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,
+          key_cache, self_K_bias,
+          value_cache, self_V_bias,
+          context_buf,
+          batch_size, head_num, step, scalar);
+      break;
+    case 128:
+      if(sizeof(T) == 2)
+        masked_attention_kernel_opt_half2<128, block_sz><<<grid, block_sz,
+    sizeof(float)*step, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache,
+    self_V_bias, context_buf,
+          batch_size, head_num, step, scalar);
+      else
+        masked_attention_kernel_opt<128, block_sz, T><<<grid, block_sz,
+    sizeof(float)*step, stream>>>(
+          key_buf, value_buf,
+          query_buf, self_Q_bias,  key_cache, self_K_bias, value_cache,
+    self_V_bias, context_buf,
+          batch_size, head_num, step, scalar);
+      break;*/
+    default:
+      // default path
+      int block_size = 128;
+
+      // suppose size_per_head <= 128
+      if (step <= 64)
+        block_size = 64;
+      else if (step <= 128 && step > size_per_head)
+        block_size = 128;
+      else if (step > 128 && step <= 256)
+        block_size = 256;
+      else if (step > 256 && step <= 512)
+        block_size = 512;
+      else
+        block_size = 1024;
+
+      if ((int)block_size < size_per_head) {
+        block_size = size_per_head;
+      }
+
+      assert(block_size <= 1024);
+      dim3 block(block_size);
+      T scalar = 1 / sqrtf(size_per_head * 1.0f);
+
+      int shared_size = sizeof(T) * (size_per_head + step);
+      self_attention_kernel<T><<<grid, block, shared_size, stream>>>(
+          memory_sequence_length,
+          key_buf,
+          value_buf,
+          query_buf,
+          self_Q_bias,
+          key_cache,
+          self_K_bias,
+          value_cache,
+          self_V_bias,
+          context_buf,
+          batch_size,
+          head_num,
+          size_per_head,
+          step,
+          start_len,
+          scalar);
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+  }
+}
+
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::self_multi_head_attention(
+    const DataType_* from_tensor,
+    const int* memory_sequence_length,
+    DataType_* key_cache_,
+    DataType_* value_cache_,
+    DataType_* decoder_output,
+    const int step,
+    const int start_len) {
+  int m = batch_size_;
+  int n = hidden_units_;
+  int k = hidden_units_;
+
+  DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+  if (is_fuse_QKV == true) {
+    check_cuda_error(
+        cublasGemmBatchedEx(param_.cublas_handle,
+                            CUBLAS_OP_N,
+                            CUBLAS_OP_N,
+                            n,
+                            m,
+                            k,
+                            &alpha,
+                            (const void* const*)qkv_kernel_,
+                            AType_,
+                            n,
+                            (const void* const*)qkv_input_,
+                            BType_,
+                            k,
+                            &beta,
+                            (void* const*)qkv_buf_,
+                            CType_,
+                            n,
+                            3,
+                            computeType_,
+                            static_cast<cublasGemmAlgo_t>(cublasAlgo_[4])));
+  } else {
+    key_buf_ = key_cache_ + (step - 1) * m * n;
+    value_buf_ = value_cache_ + (step - 1) * m * n;
+
+    check_cuda_error(
+        cublasGemmEx(param_.cublas_handle,
+                     CUBLAS_OP_N,
+                     CUBLAS_OP_N,
+                     n,
+                     m,
+                     k,
+                     &alpha,
+                     param_.self_attention.query_weight.kernel,
+                     AType_,
+                     n,
+                     from_tensor,
+                     BType_,
+                     k,
+                     &beta,
+                     query_buf_,
+                     CType_,
+                     n,
+                     computeType_,
+                     static_cast<cublasGemmAlgo_t>(cublasAlgo_[0])));
+
+    check_cuda_error(
+        cublasGemmEx(param_.cublas_handle,
+                     CUBLAS_OP_N,
+                     CUBLAS_OP_N,
+                     n,
+                     m,
+                     k,
+                     &alpha,
+                     param_.self_attention.key_weight.kernel,
+                     AType_,
+                     n,
+                     from_tensor,
+                     BType_,
+                     k,
+                     &beta,
+                     key_buf_,
+                     CType_,
+                     n,
+                     computeType_,
+                     static_cast<cublasGemmAlgo_t>(cublasAlgo_[0])));
+
+    check_cuda_error(
+        cublasGemmEx(param_.cublas_handle,
+                     CUBLAS_OP_N,
+                     CUBLAS_OP_N,
+                     n,
+                     m,
+                     k,
+                     &alpha,
+                     param_.self_attention.value_weight.kernel,
+                     AType_,
+                     n,
+                     from_tensor,
+                     BType_,
+                     k,
+                     &beta,
+                     value_buf_,
+                     CType_,
+                     n,
+                     computeType_,
+                     static_cast<cublasGemmAlgo_t>(cublasAlgo_[0])));
+  }
+
+  self_attention_dispatch<DataType_>(memory_sequence_length,
+                                     key_buf_,
+                                     value_buf_,
+                                     query_buf_,
+                                     param_.self_attention.query_weight.bias,
+                                     key_cache_,
+                                     param_.self_attention.key_weight.bias,
+                                     value_cache_,
+                                     param_.self_attention.value_weight.bias,
+                                     context_buf_,
+                                     batch_size_,
+                                     head_num_,
+                                     size_per_head_,
+                                     step,
+                                     start_len,
+                                     param_.stream);
+
+  check_cuda_error(
+      cublasGemmEx(param_.cublas_handle,
+                   CUBLAS_OP_N,
+                   CUBLAS_OP_N,
+                   n,
+                   m,
+                   k,
+                   &alpha,
+                   param_.self_attention.attention_output_weight.kernel,
+                   AType_,
+                   n,
+                   context_buf_,
+                   BType_,
+                   k,
+                   &beta,
+                   decoder_output,
+                   CType_,
+                   n,
+                   computeType_,
+                   static_cast<cublasGemmAlgo_t>(cublasAlgo_[0])));
+}
+
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::decoder_norm1(const DataType_* input,
+                                                    const DataType_* gamma,
+                                                    const DataType_* beta,
+                                                    DataType_* output,
+                                                    int m,
+                                                    int n) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+  /* For general cases, n is equal to hidden_units, e.g., 512/1024.
+     Since we have warp shuffle inside the code, block.x % 32 should be 0.
+  */
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x /
+      (4 / sizeof(DataType_));  // if using half, only need half of block.x
+
+  /* should pay attention to the rsqrt precision*/
+  // assert(block.x <= 1024);
+  // decoder_norm1_kernel<DataType_><<<grid, block, 0, param_.stream>>>(input,
+  // gamma, beta, output, m, n);
+  decoder_norm1_kernel_generalize<DataType_><<<grid, block, 0, param_.stream>>>(
+      input, gamma, beta, output, m, n);  // For gpt-3
+}
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::decoder_norm2(const DataType_* input,
+                                                    const DataType_* gamma,
+                                                    const DataType_* beta,
+                                                    const DataType_* bias,
+                                                    DataType_* output,
+                                                    DataType_* norm_output,
+                                                    int m,
+                                                    int n) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+
+  /* For general cases, n is equal to hidden_units, e.g., 512/1024.
+  Since we have warp shuffle inside the code, block.x % 32 should be 0.
+  */
+
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x /
+      (4 / sizeof(DataType_));  // if using half, only need half of block.x
+
+  /* should pay attention to the rsqrt precision*/
+  // assert(block.x <= 1024);
+  // decoder_norm2_kernel<DataType_><<<grid, block, 0, param_.stream>>>(input,
+  // gamma, beta, bias, output, norm_output, m, n);
+  decoder_norm2_kernel_generalize<DataType_><<<grid, block, 0, param_.stream>>>(
+      input, gamma, beta, bias, output, norm_output, m, n);  // For gpt-3
+}
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::ffn(const DataType_* input,
+                                          DataType_* ffn_inner,
+                                          DataType_* output,
+                                          const int m,
+                                          const int inner_size,
+                                          const int n,
+                                          ActivationType activation_type) {
+  int m1 = m, k1 = n, n1 = inner_size;
+  DataType_ alpha = (DataType_)1.0f;
+  DataType_ beta = (DataType_)0.0f;
+
+  check_cuda_error(cublasGemmEx(param_.cublas_handle,
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                n1,
+                                m1,
+                                k1,
+                                &alpha,
+                                param_.ffn.intermediate_weight.kernel,
+                                AType_,
+                                n1,
+                                input,
+                                BType_,
+                                k1,
+                                &beta,
+                                ffn_inner,
+                                CType_,
+                                n1,
+                                computeType_,
+                                static_cast<cublasGemmAlgo_t>(cublasAlgo_[2])));
+
+  // dim3 grid(min(m1, 65536));
+  // dim3 block(min(n1 / 4, 1024));
+
+  // // TODO remove this limitation
+  // // assert(block.x <= 1024);
+
+  // if(activation_type == ActivationType::RELU)
+  //   add_bias_relu<DataType_><<<grid, block, 0, param_.stream>>>(ffn_inner,
+  //   param_.ffn.intermediate_weight.bias, m1, n1);
+  // else if(activation_type == ActivationType::GELU)
+  //   add_bias_gelu<DataType_><<<grid, block, 0, param_.stream>>>(ffn_inner,
+  //   param_.ffn.intermediate_weight.bias, m1, n1);
+
+  dim3 block(min((int)(n1 / 4 / (4 / sizeof(DataType_))), 1024));
+  dim3 grid(min(m1 * n1 / block.x, 65536));
+
+  if (activation_type == ActivationType::RELU)
+    add_bias_relu<DataType_><<<grid, block, 0, param_.stream>>>(
+        ffn_inner,
+        param_.ffn.intermediate_weight.bias,
+        m1,
+        n1 / (4 / sizeof(DataType_)));
+  else if (activation_type == ActivationType::GELU)
+    add_bias_gelu<DataType_><<<grid, block, 0, param_.stream>>>(
+        ffn_inner,
+        param_.ffn.intermediate_weight.bias,
+        m1,
+        n1 / (4 / sizeof(DataType_)));
+
+
+  int m2 = m, n2 = n, k2 = inner_size;
+  check_cuda_error(cublasGemmEx(param_.cublas_handle,
+                                CUBLAS_OP_N,
+                                CUBLAS_OP_N,
+                                n2,
+                                m2,
+                                k2,
+                                &alpha,
+                                param_.ffn.output_weight.kernel,
+                                AType_,
+                                n2,
+                                ffn_inner,
+                                BType_,
+                                k2,
+                                &beta,
+                                output,
+                                CType_,
+                                n2,
+                                computeType_,
+                                static_cast<cublasGemmAlgo_t>(cublasAlgo_[3])));
+}
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::add_bias_act(
+    DataType_* input,
+    const DataType_* bias,
+    int m,
+    int n,
+    cudaStream_t stream,
+    ActivationType activation_type = ActivationType::GELU) {
+  dim3 block_(min((int)(n / 4 / (4 / sizeof(DataType_))), 1024));
+  dim3 grid_(min(m * n / block_.x, 65536));
+
+  if (activation_type == ActivationType::RELU)
+    add_bias_relu<DataType_><<<grid_, block_, 0, stream>>>(
+        input, bias, m, n / (4 / sizeof(DataType_)));
+  else if (activation_type == ActivationType::GELU)
+    add_bias_gelu<DataType_><<<grid_, block_, 0, stream>>>(
+        input, bias, m, n / (4 / sizeof(DataType_)));
+}
+
+template <OperationType OpType_>
+void OpenTransformerDecoder<OpType_>::add_bias_input(DataType_* output,
+                                                     const DataType_* input,
+                                                     const int m,
+                                                     const int n) {
+  dim3 grid(min(m, 65536));
+  dim3 block(min(n, 1024));
+
+  add_bias_input_kernel_generalize<<<grid, block, 0, param_.stream>>>(
+      output, input, param_.ffn.output_weight.bias, m, n);
+}
+
+template void
+OpenTransformerDecoder<OperationType::FP32>::self_multi_head_attention(
+    const float* from_tensor,
+    const int* memory_sequence_length,
+    float* key_cache,
+    float* value_cache,
+    float* decoder_output,
+    const int step,
+    const int start_len);
+
+template void
+OpenTransformerDecoder<OperationType::FP16>::self_multi_head_attention(
+    const half* from_tensor,
+    const int* memory_sequence_length,
+    half* key_cache,
+    half* value_cache,
+    half* decoder_output,
+    const int step,
+    const int start_len);
+
+template void OpenTransformerDecoder<OperationType::FP32>::ffn(
+    const float* input,
+    float* ffn_inner,
+    float* otuput,
+    const int m,
+    const int inner_size,
+    const int n,
+    ActivationType activation_type);
+
+template void OpenTransformerDecoder<OperationType::FP16>::ffn(
+    const half* input,
+    half* ffn_inner,
+    half* otuput,
+    const int m,
+    const int inner_size,
+    const int n,
+    ActivationType activation_type);
+
+template void OpenTransformerDecoder<OperationType::FP32>::decoder_norm1(
+    const float* input,
+    const float* gamma,
+    const float* beta,
+    float* output,
+    int m,
+    int n);
+
+template void OpenTransformerDecoder<OperationType::FP16>::decoder_norm1(
+    const half* input,
+    const half* gamma,
+    const half* beta,
+    half* output,
+    int m,
+    int n);
+
+template void OpenTransformerDecoder<OperationType::FP32>::decoder_norm2(
+    const float* input,
+    const float* gamma,
+    const float* beta,
+    const float* bias,
+    float* output,
+    float* norm_output,
+    int m,
+    int n);
+
+template void OpenTransformerDecoder<OperationType::FP16>::decoder_norm2(
+    const half* input,
+    const half* gamma,
+    const half* beta,
+    const half* bias,
+    half* output,
+    half* norm_output,
+    int m,
+    int n);
+
+template void OpenTransformerDecoder<OperationType::FP32>::add_bias_act(
+    float* input,
+    const float* bias,
+    int m,
+    int n,
+    cudaStream_t stream,
+    ActivationType activation_type);
+
+template void OpenTransformerDecoder<OperationType::FP16>::add_bias_act(
+    half* input,
+    const half* bias,
+    int m,
+    int n,
+    cudaStream_t stream,
+    ActivationType activation_type);
+
+template void OpenTransformerDecoder<OperationType::FP32>::add_bias_input(
+    float* output, const float* input, const int m, const int n);
+
+template void OpenTransformerDecoder<OperationType::FP16>::add_bias_input(
+    half* output, const half* input, const int m, const int n);
+
+
+}  // namespace FasterTransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoding_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoding_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..b950f54525bbb81a758cc3833e2888947d52bcb3
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_decoding_kernels.cu
@@ -0,0 +1,671 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+__global__ void embeddings_kernels(T* from_tensor,
+                                   const T* embedding_table,
+                                   const T* position_encoding,
+                                   const T* type_table,
+                                   const int* memory_sequence_length,
+                                   const int* type_id,
+                                   const int* word_ids,
+                                   const int step,
+                                   const int batch_size,
+                                   const int hidden_units,
+                                   const bool pos_bias,
+                                   const int* decoder_role_id = nullptr,
+                                   const T* role_embedding_table = nullptr,
+                                   const int* decoder_position_id = nullptr) {
+  // 1. lookup from embedding table
+  // 2. add the position encoding
+  // 3. add the token type embedding
+  for (int index = blockIdx.x * blockDim.x + threadIdx.x;
+       index < batch_size * hidden_units;
+       index += blockDim.x * gridDim.x) {
+    const int row_index = index / hidden_units;
+    const int col_index = index % hidden_units;
+
+    int position_index;
+    if (decoder_position_id) {
+      position_index = (decoder_position_id[row_index] + step - 1) * hidden_units + col_index;
+    } else {
+      int pos = (pos_bias) ? (step - 1 + memory_sequence_length[row_index])
+                           : (step - 1);
+      position_index = pos * hidden_units + col_index;
+    }
+
+    from_tensor[index] =
+        embedding_table[word_ids[row_index] * hidden_units + col_index] +
+        position_encoding[position_index] +
+        type_table[type_id[row_index] * hidden_units + col_index];
+
+    if (decoder_role_id) {
+      from_tensor[index] += role_embedding_table[decoder_role_id[row_index] * hidden_units + col_index];
+    }
+  }
+}
+
+template <typename T>
+void embeddings_kernel_launcher(T* from_tensor,
+                                const T* embedding_table,
+                                const T* position_encoding_table,
+                                const T* type_table,
+                                const int* memory_sequence_length,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int step,
+                                const int batch_size,
+                                const int hidden_units,
+                                const bool pos_bias,
+                                cudaStream_t stream,
+                                const int* decoder_role_id,
+                                const T* role_embedding_table,
+                                const int* decoder_position_id) {
+  dim3 grid(min(batch_size, 65536));
+  dim3 block(min(hidden_units, 1024));
+
+  embeddings_kernels<T><<<grid, block, 0, stream>>>(from_tensor,
+                                                    embedding_table,
+                                                    position_encoding_table,
+                                                    type_table,
+                                                    memory_sequence_length,
+                                                    type_id,
+                                                    word_ids,
+                                                    step,
+                                                    batch_size,
+                                                    hidden_units,
+                                                    pos_bias,
+                                                    decoder_role_id,
+                                                    role_embedding_table,
+                                                    decoder_position_id);
+}
+
+template <typename T>
+__global__ void words_embeddings_kernels(T* from_tensor,
+                                   const T* embedding_table,
+                                   const int* word_ids,
+                                   const int batch_size,
+                                   const int hidden_units) {
+  // 1. lookup from embedding table
+  // 2. add the position encoding
+  // 3. add the token type embedding
+  for (int index = blockIdx.x * blockDim.x + threadIdx.x;
+       index < batch_size * hidden_units;
+       index += blockDim.x * gridDim.x) {
+    const int row_index = index / hidden_units;
+    const int col_index = index % hidden_units;
+
+    from_tensor[index] =
+        embedding_table[word_ids[row_index] * hidden_units + col_index];
+  }
+}
+
+template <typename T>
+void words_embeddings_kernel_launcher(T* from_tensor,
+                                const T* embedding_table,
+                                const int* word_ids,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream) {
+  dim3 grid(min(batch_size, 65536));
+  dim3 block(min(hidden_units, 1024));
+
+  words_embeddings_kernels<T><<<grid, block, 0, stream>>>(from_tensor,
+                                                    embedding_table,
+                                                    word_ids,
+                                                    batch_size,
+                                                    hidden_units);
+}
+
+template<typename T>
+__global__ void build_relative_attention_bias_kernel(
+        T* relative_attention_bias,
+        const T* relative_attention_bias_table,
+        const int head_num,
+        const int seq_len,
+        const int num_bucket,
+        const bool is_bidirectional,
+        const int max_distance) {
+    const int head_id = blockIdx.x;
+    for (int seq_id = threadIdx.x; seq_id < seq_len * seq_len; seq_id += blockDim.x) {
+        int row_id = seq_id / seq_len;
+        int col_id = seq_id % seq_len;
+
+        int relative_position = col_id - row_id;
+
+        int relative_buckets = 0;
+        int tmp_num_bucket = num_bucket;
+        if (is_bidirectional) {
+            tmp_num_bucket /= 2;
+            if (relative_position > 0) {
+                relative_buckets += tmp_num_bucket;
+            }
+            else {
+                relative_position *= -1;
+            }
+        }
+        else {
+            relative_position = -min(relative_position, 0);
+        }
+
+        int max_exact = tmp_num_bucket / 2;
+        bool is_small = relative_position < max_exact;
+
+        int relative_position_if_large =
+            max_exact
+            + (int)(logf(relative_position * 1.0f / max_exact) / logf((float)max_distance / max_exact)
+                    * (tmp_num_bucket - max_exact));
+
+        relative_position_if_large = min(relative_position_if_large, tmp_num_bucket - 1);
+
+        relative_buckets += is_small ? relative_position : relative_position_if_large;
+
+        relative_attention_bias[head_id * seq_len * seq_len + seq_id] =
+            relative_attention_bias_table[relative_buckets * gridDim.x + head_id];
+    }
+}
+
+template<typename T>
+void build_relative_attention_bias_launcher(T* relative_attention_bias,
+                                            const T* relative_attention_bias_table,
+                                            const int head_num,
+                                            const int seq_len,
+                                            const int num_bucket,
+                                            const bool is_bidirectional,
+                                            const int max_distance,
+                                            cudaStream_t stream) {
+    dim3 grid(head_num);
+    dim3 block(256);
+    build_relative_attention_bias_kernel<<<grid, block, 0, stream>>>(relative_attention_bias,
+                                                           relative_attention_bias_table,
+                                                           head_num,
+                                                           seq_len,
+                                                           num_bucket,
+                                                           is_bidirectional,
+                                                           max_distance);
+}
+
+
+template <typename T>
+__global__ void start_id_embedding_kernel(T* from_tensor,
+                                          const T* embedding_table,
+                                          const T* position_encoding_table,
+                                          const T* type_table,
+                                          const int* type_id,
+                                          const int* word_ids,
+                                          const int* memory_seq_len,
+                                          const int start_step,
+                                          const int max_length,
+                                          const int batch_size,
+                                          const int hidden_units,
+                                          const int* role_id = nullptr,
+                                          const T* role_embedding_table = nullptr,
+                                          const int* position_id = nullptr) {
+  int bid = blockIdx.x;
+  int seq_id = blockIdx.y;
+
+  for(int index = threadIdx.x; index < hidden_units; index += blockDim.x) {
+
+    int position_index;
+    // Apply custom position ids to handle different position ids setting,
+    // such as relative position in PLATO-XL.
+    if (position_id) {
+      position_index = position_id[bid * max_length + seq_id] * hidden_units + index;
+    } else {
+      int step;
+      // Deal with the situation which input_sequences is padded left.
+      if (seq_id < max_length - memory_seq_len[bid]) {
+        step = start_step;
+      } else {
+        step = start_step + (seq_id - max_length + memory_seq_len[bid]);
+      }
+      position_index = (step - 1) * hidden_units + index;
+    }
+
+    from_tensor[bid * max_length * hidden_units + seq_id * hidden_units + index] =
+        embedding_table[word_ids[bid * max_length + seq_id] * hidden_units + index]
+        + position_encoding_table[position_index]
+        + type_table[type_id[bid * max_length + seq_id] * hidden_units + index];
+
+    if (role_id) {
+      from_tensor[bid * max_length * hidden_units + seq_id * hidden_units + index] +=
+          role_embedding_table[role_id[bid * max_length + seq_id] * hidden_units + index];
+    }
+  }
+}
+
+template <typename T>
+void start_ids_embeddings_kernel_launcher(T* from_tensor,
+                                const T* embedding_table,
+                                const T* position_encoding_table,
+                                const T* type_table,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int* memory_seq_len,
+                                const int start_step,
+                                const int max_length,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream,
+                                const int* role_id,
+                                const T* role_embedding_table,
+                                const int* position_id) {
+  dim3 grid(batch_size, max_length);
+  dim3 block(min(hidden_units, 1024));
+  start_id_embedding_kernel<T><<<grid, block, 0, stream>>>(from_tensor,
+                                                           embedding_table,
+                                                           position_encoding_table,
+                                                           type_table,
+                                                           type_id,
+                                                           word_ids,
+                                                           memory_seq_len,
+                                                           start_step,
+                                                           max_length,
+                                                           batch_size,
+                                                           hidden_units,
+                                                           role_id,
+                                                           role_embedding_table,
+                                                           position_id);
+}
+
+template <typename T>
+__global__ void initial_cache_kernel(const float* cache_k,
+                                     const float* cache_v,
+                                     const int* memory_sequence_length,
+                                     T* k_tgt,
+                                     T* v_tgt,
+                                     int n_head,
+                                     int size_per_head,
+                                     int mem_len,
+                                     int batch_size,
+                                     int beam_size = 1) {
+  int tid = threadIdx.x;
+  int bid = blockIdx.x / (beam_size * n_head);
+  int beam_id = blockIdx.x % (n_head * beam_size) / n_head;
+  int head_id = blockIdx.x % n_head;
+
+  int offset = batch_size * beam_size * n_head * size_per_head;
+
+  for (int ite = 0; ite < mem_len; ++ite) {
+    int tgt_id = bid * beam_size * n_head * size_per_head +
+                 beam_id * n_head * size_per_head + head_id * size_per_head +
+                 tid;
+    int src_id = bid * n_head * mem_len * size_per_head +
+                 head_id * mem_len * size_per_head + ite * size_per_head + tid;
+    k_tgt[ite * offset + tgt_id] = static_cast<T>(cache_k[src_id]);
+    v_tgt[ite * offset + tgt_id] = static_cast<T>(cache_v[src_id]);
+  }
+}
+
+template <typename T>
+void init_cache_kernel_launcher(const float* cache_k,
+                                const float* cache_v,
+                                const int* memory_sequence_length,
+                                T* k_tgt,
+                                T* v_tgt,
+                                int n_head,
+                                int size_per_head,
+                                int mem_len,
+                                int batch_size,
+                                int beam_size,
+                                cudaStream_t stream) {
+  initial_cache_kernel<
+      T><<<batch_size * beam_size * n_head, size_per_head, 0, stream>>>(
+      cache_k,
+      cache_v,
+      memory_sequence_length,
+      k_tgt,
+      v_tgt,
+      n_head,
+      size_per_head,
+      mem_len,
+      batch_size,
+      beam_size);
+}
+
+template <typename T>
+__global__ void init_logits_mask_kernel(T* logits_mask,
+                                        int vocab_size,
+                                        int start_id = -1,
+                                        int unk_id = -1,
+                                        int mask_id = -1) {
+  int bid = blockIdx.x;
+  int tid = threadIdx.x;
+
+  if (bid * blockDim.x + tid == start_id) {
+    logits_mask[bid * blockDim.x + tid] = static_cast<T>(-1e20f);
+  } else if (bid * blockDim.x + tid == unk_id) {
+    logits_mask[bid * blockDim.x + tid] = static_cast<T>(-1e20f);
+  } else if (bid * blockDim.x + tid == mask_id) {
+    logits_mask[bid * blockDim.x + tid] = static_cast<T>(-1e20f);
+  } else if (bid * blockDim.x + tid < vocab_size) {
+    logits_mask[bid * blockDim.x + tid] = static_cast<T>(0.0f);
+  }
+}
+
+template <typename T>
+void init_logits_mask_Launcher(T* logits_mask,
+                               int vocab_size,
+                               int start_id,
+                               int unk_id,
+                               int mask_id,
+                               cudaStream_t stream) {
+  dim3 block(256);
+  dim3 grid((vocab_size + block.x - 1) / block.x);
+
+  init_logits_mask_kernel<T><<<grid, block, 0, stream>>>(
+      logits_mask, vocab_size, start_id, unk_id, mask_id);
+}
+
+template <typename T>
+__global__ void apply_penalties_kernel(int step,
+                                       int vocab_size,
+                                       int beam_width,
+                                       T* log_probs,
+                                       const bool* finished,
+                                       int* current_ids,
+                                       int* previous_ids,
+                                       int* parent_ids,
+                                       int end_id,
+                                       float inv_temp,
+                                       float len_penalty,
+                                       float repeat_penalty,
+                                       const T* logits_mask = nullptr) {
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  int bbid = blockIdx.y;   // batch_size * beam_size: index
+  int bbsize = gridDim.y;  // batch_size * beam_size
+  int batchid = bbid / beam_width;
+  // int beamid = bbid % beam_width;
+
+  bool finish = (finished != nullptr) ? finished[bbid] : false;
+
+  if (!finish) {
+    // temperature
+    if (inv_temp != 1.0) {
+      for (int i = tid + bid * blockDim.x; i < vocab_size;
+           i += blockDim.x * gridDim.x) {
+        log_probs[i + bbid * vocab_size] *= inv_temp;
+      }
+    }
+
+    if (tid == 0 && bid == 0) {
+      // apply repetition penalty (this can apply the penalty multiple times to
+      // a
+      // repeated word).
+      if (repeat_penalty != 1.0) {
+        int prev_id = current_ids[bbid];
+        if (log_probs[prev_id + bbid * vocab_size] > T(0)) {
+          log_probs[prev_id + bbid * vocab_size] =
+              float(log_probs[prev_id + bbid * vocab_size]) / repeat_penalty;
+        } else {
+          log_probs[prev_id + bbid * vocab_size] =
+              float(log_probs[prev_id + bbid * vocab_size]) * repeat_penalty;
+        }
+
+        if (step > 1) {
+          int parent_beamid = parent_ids[bbsize * (step - 2) + bbid];
+          for (int i = step - 2; i > 0; --i) {
+            prev_id =
+                previous_ids[i * bbsize + batchid * beam_width + parent_beamid];
+            if (log_probs[prev_id + bbid * vocab_size] > T(0)) {
+              log_probs[prev_id + bbid * vocab_size] =
+                  float(log_probs[prev_id + bbid * vocab_size]) /
+                  repeat_penalty;
+            } else {
+              log_probs[prev_id + bbid * vocab_size] =
+                  float(log_probs[prev_id + bbid * vocab_size]) *
+                  repeat_penalty;
+            }
+            // if (i > 0) parent_beamid =
+            // parent_ids[bbsize*(i-1)+parent_beamid];
+            parent_beamid = parent_ids[bbsize * (i - 1) + parent_beamid];
+          }
+        }
+        prev_id = previous_ids[batchid * beam_width];
+        if (log_probs[prev_id + bbid * vocab_size] > T(0)) {
+          log_probs[prev_id + bbid * vocab_size] =
+              float(log_probs[prev_id + bbid * vocab_size]) / repeat_penalty;
+        } else {
+          log_probs[prev_id + bbid * vocab_size] =
+              float(log_probs[prev_id + bbid * vocab_size]) * repeat_penalty;
+        }
+      }
+
+      // apply length penalty
+      // NOTE: The length penalty has different implementation. May be update.
+      if (len_penalty != 1.0) {
+        if (log_probs[end_id + bbid * vocab_size] > T(0)) {
+          log_probs[end_id + bbid * vocab_size] =
+              float(log_probs[end_id + bbid * vocab_size]) / len_penalty;
+        } else {
+          log_probs[end_id + bbid * vocab_size] =
+              float(log_probs[end_id + bbid * vocab_size]) * len_penalty;
+        }
+      }
+    }
+
+    if (logits_mask) {
+      for (int i = tid + bid * blockDim.x; i < vocab_size;
+           i += blockDim.x * gridDim.x) {
+        log_probs[i + bbid * vocab_size] += logits_mask[i];
+      }
+    }
+  }
+}
+
+template <typename T>
+void apply_penalties_Launcher(int step,
+                              T* log_probs,
+                              const bool* finished,
+                              int* current_ids,
+                              int* previous_ids,
+                              int* parent_ids,
+                              int batch_size,
+                              int beam_width,
+                              int vocab_size,
+                              int end_id,
+                              float temperature,
+                              float len_penalty,
+                              float repeat_penalty,
+                              cudaStream_t stream,
+                              const T* logits_mask = nullptr) {
+  dim3 block(256);
+  dim3 grid((vocab_size + block.x - 1) / block.x, beam_width * batch_size);
+
+  apply_penalties_kernel<T><<<grid, block, 0, stream>>>(step,
+                                                        vocab_size,
+                                                        beam_width,
+                                                        log_probs,
+                                                        finished,
+                                                        current_ids,
+                                                        previous_ids,
+                                                        parent_ids,
+                                                        end_id,
+                                                        1.f / temperature,
+                                                        len_penalty,
+                                                        repeat_penalty,
+                                                        logits_mask);
+}
+
+template void embeddings_kernel_launcher(float* from_tensor,
+                                         const float* embedding_table,
+                                         const float* position_encoding_table,
+                                         const float* sent_table,
+                                         const int* memory_sequence_length,
+                                         const int* type_id,
+                                         const int* word_ids,
+                                         const int step,
+                                         const int batch_size,
+                                         const int hidden_units,
+                                         const bool pos_bias,
+                                         cudaStream_t stream,
+                                         const int* decoder_role_id,
+                                         const float* role_embedding_table,
+                                         const int* decoder_position_id);
+
+template void embeddings_kernel_launcher(half* from_tensor,
+                                         const half* embedding_table,
+                                         const half* position_encoding_table,
+                                         const half* sent_table,
+                                         const int* memory_sequence_length,
+                                         const int* type_id,
+                                         const int* word_ids,
+                                         const int step,
+                                         const int batch_size,
+                                         const int hidden_units,
+                                         const bool pos_bias,
+                                         cudaStream_t stream,
+                                         const int* decoder_role_id,
+                                         const half* role_embedding_table,
+                                         const int* decoder_position_id);
+
+template void words_embeddings_kernel_launcher(float* from_tensor,
+                                const float* embedding_table,
+                                const int* word_ids,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream);
+
+template void words_embeddings_kernel_launcher(half* from_tensor,
+                                const half* embedding_table,
+                                const int* word_ids,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream);
+
+template void build_relative_attention_bias_launcher(float* relative_attention_bias,
+                                            const float* relative_attention_bias_table,
+                                            const int head_num,
+                                            const int seq_len,
+                                            const int num_bucket,
+                                            const bool is_bidirectional,
+                                            const int max_distance,
+                                            cudaStream_t stream);
+
+template void build_relative_attention_bias_launcher(half* relative_attention_bias,
+                                            const half* relative_attention_bias_table,
+                                            const int head_num,
+                                            const int seq_len,
+                                            const int num_bucket,
+                                            const bool is_bidirectional,
+                                            const int max_distance,
+                                            cudaStream_t stream);
+
+template void start_ids_embeddings_kernel_launcher(float* from_tensor,
+                                const float* embedding_table,
+                                const float* position_encoding_table,
+                                const float* type_table,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int* memory_seq_len,
+                                const int start_step,
+                                const int max_length,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream,
+                                const int* role_id,
+                                const float* role_embedding_table,
+                                const int* position_id);
+
+template void start_ids_embeddings_kernel_launcher(half* from_tensor,
+                                const half* embedding_table,
+                                const half* position_encoding_table,
+                                const half* type_table,
+                                const int* type_id,
+                                const int* word_ids,
+                                const int* memory_seq_len,
+                                const int start_step,
+                                const int max_length,
+                                const int batch_size,
+                                const int hidden_units,
+                                cudaStream_t stream,
+                                const int* role_id,
+                                const half* role_embedding_table,
+                                const int* position_id);
+
+template void init_cache_kernel_launcher(const float* cache_k,
+                                         const float* cache_v,
+                                         const int* memory_sequence_length,
+                                         float* k_tgt,
+                                         float* v_tgt,
+                                         int n_head,
+                                         int size_per_head,
+                                         int mem_len,
+                                         int batch_size,
+                                         int beam_size,
+                                         cudaStream_t stream);
+
+template void init_cache_kernel_launcher(const float* cache_k,
+                                         const float* cache_v,
+                                         const int* memory_sequence_length,
+                                         half* k_tgt,
+                                         half* v_tgt,
+                                         int n_head,
+                                         int size_per_head,
+                                         int mem_len,
+                                         int batch_size,
+                                         int beam_size,
+                                         cudaStream_t stream);
+
+template void init_logits_mask_Launcher(float* logits_mask,
+                                        int vocab_size,
+                                        int start_id,
+                                        int unk_id,
+                                        int mask_id,
+                                        cudaStream_t stream);
+
+template void init_logits_mask_Launcher(half* logits_mask,
+                                        int vocab_size,
+                                        int start_id,
+                                        int unk_id,
+                                        int mask_id,
+                                        cudaStream_t stream);
+
+template void apply_penalties_Launcher(int step,
+                                       float* log_probs,
+                                       const bool* finished,
+                                       int* current_ids,
+                                       int* previous_ids,
+                                       int* parent_ids,
+                                       int batch_size,
+                                       int beam_width,
+                                       int vocab_size,
+                                       int end_id,
+                                       float temperature,
+                                       float len_penalty,
+                                       float repeat_penalty,
+                                       cudaStream_t stream,
+                                       const float* logits_mask);
+
+template void apply_penalties_Launcher(int step,
+                                       half* log_probs,
+                                       const bool* finished,
+                                       int* current_ids,
+                                       int* previous_ids,
+                                       int* parent_ids,
+                                       int batch_size,
+                                       int beam_width,
+                                       int vocab_size,
+                                       int end_id,
+                                       float temperature,
+                                       float len_penalty,
+                                       float repeat_penalty,
+                                       cudaStream_t stream,
+                                       const half* logits_mask);
+}
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cu b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cu
new file mode 100644
index 0000000000000000000000000000000000000000..70cada699fc75d40c6493f537f2764efcca2cffd
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cu
@@ -0,0 +1,985 @@
+/*
+* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+* Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+#include "fastertransformer/cuda/transformer_kernels.cuh"
+
+namespace fastertransformer {
+
+
+template <typename T>
+__inline__ __device__ T gelu(T x) {
+  float cdf =
+      0.5f *
+      (1.0f + tanhf((0.7978845608028654f * (x + 0.044715f * x * x * x))));
+
+  // NOTE: The precision of gelu with or without approximate formulation
+  // may cause serious problem in some cases. If necessary, the following
+  // comments can be opened to use the non-approximate formulation.
+  // float cdf = 0.5f * (1.0f + erf((float)x / sqrt(2.0f)));
+  return x * cdf;
+}
+
+template <>
+__inline__ __device__ half2 gelu(half2 val) {
+  half2 val_pow3 = __hmul2(val, __hmul2(val, val));
+  float2 tmp_pow = __half22float2(val_pow3);
+  float2 tmp = __half22float2(val);
+
+  tmp.x =
+      0.5f *
+      (1.0f + tanhf((0.7978845608028654f * (tmp.x + 0.044715f * tmp_pow.x))));
+  tmp.y =
+      0.5f *
+      (1.0f + tanhf((0.7978845608028654f * (tmp.y + 0.044715f * tmp_pow.y))));
+  return __hmul2(val, __float22half2_rn(tmp));
+}
+
+template <typename T>
+__inline__ __device__ T warpReduceSum(T val) {
+  for (int mask = 16; mask > 0; mask >>= 1)
+    val += __shfl_xor_sync(FINAL_MASK, val, mask, 32);
+  return val;
+}
+
+template <typename T>
+__inline__ __device__ T blockReduceSum(T val) {
+  static __shared__ T shared[32];
+  int lane = threadIdx.x & 0x1f;
+  int wid = threadIdx.x >> 5;
+
+  val = warpReduceSum<T>(val);
+
+  if (lane == 0) shared[wid] = val;
+  __syncthreads();
+
+  val = (threadIdx.x < (blockDim.x >> 5)) ? shared[lane] : (T)0.0f;
+  val = warpReduceSum(val);
+  return val;
+}
+
+template<typename T>
+__device__ __forceinline__ T clamp_inf_for_half(const float input)
+{
+    if (std::is_same<T, half>::value == true) {
+        // clamp inf values to enable fp16 training
+        return (float)input > 0.0f ? min(input, HALF_FLT_MAX - 1000) : max(input, -HALF_FLT_MAX + 1000);
+    }
+    else {
+        return input;
+    }
+}
+
+template <typename T>
+__global__ void add_bias_gelu(T* out, const T* __restrict bias, int m, int n) {
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    T reg_bias = __ldg(&bias[id % n]);
+    T val = out[id] + reg_bias;
+    out[id] = (T)(gelu(val));
+  }
+}
+
+template <>
+__global__ void add_bias_gelu(half* out,
+                              const half* __restrict bias,
+                              int m,
+                              int n) {
+  half2* out_ptr = (half2*)out;
+  const half2* bias_ptr = (half2*)bias;
+
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    half2 reg_bias = __ldg(&bias_ptr[id % n]);
+    half2 val = out_ptr[id] + reg_bias;
+    out_ptr[id] = gelu(val);
+  }
+}
+
+template <typename T>
+__global__ void add_bias_relu(T* out, const T* __restrict bias, int m, int n) {
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    T reg_bias = __ldg(&bias[id % n]);
+    T val = out[id] + reg_bias;
+    out[id] = (T)(val > 0.0f ? val : 0.0f);
+  }
+}
+
+template <>
+__global__ void add_bias_relu(half* out,
+                              const half* __restrict bias,
+                              int m,
+                              int n) {
+  half2* out_ptr = (half2*)out;
+  const half2* bias_ptr = (half2*)bias;
+
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    half2 reg_bias = __ldg(&bias_ptr[id % n]);
+    half2 val = out_ptr[id] + reg_bias;
+    val.x = val.x > (half)0.0f ? val.x : (half)0.0f;
+    val.y = val.y > (half)0.0f ? val.y : (half)0.0f;
+    out_ptr[id] = val;
+  }
+}
+
+template <typename T>
+__global__ void add_bias_input_layernorm(T* out,
+                                         const T* input,
+                                         const T* bias,
+                                         const T* gamma,
+                                         const T* beta,
+                                         int m,
+                                         int n) {
+  int tid = threadIdx.x;
+
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+
+  float local_out = 0.0f;
+  local_out += (float)(out[blockIdx.x * n + tid] + input[blockIdx.x * n + tid] +
+                       __ldg(&bias[tid]));
+
+  mean = blockReduceSum<float>(local_out);
+  if (threadIdx.x == 0) s_mean = mean / n;
+  __syncthreads();
+
+  variance = blockReduceSum<float>((local_out - s_mean) * (local_out - s_mean));
+  if (threadIdx.x == 0) s_variance = variance / n + 1e-6f;
+  __syncthreads();
+
+  out[blockIdx.x * n + tid] = (T)(((local_out - s_mean) * rsqrtf(s_variance)) *
+                                      (float)(__ldg(&gamma[tid])) +
+                                  (float)(__ldg(&beta[tid])));
+}
+
+template <>
+__global__ void add_bias_input_layernorm(half* out,
+                                         const half* input,
+                                         const half* bias,
+                                         const half* gamma,
+                                         const half* beta,
+                                         int m,
+                                         int n) {
+  int tid = threadIdx.x;
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+  float2 local_out_fp2;
+
+  half2* out_ptr = (half2*)out;
+  const half2* input_ptr = (const half2*)input;
+  const half2* bias_ptr = (const half2*)bias;
+  const half2* gamma_ptr = (const half2*)gamma;
+  const half2* beta_ptr = (const half2*)beta;
+
+  float local_out = 0.0f;
+  int id = blockIdx.x * n / 2 + tid;
+  local_out_fp2 = __half22float2(
+      __hadd2(__hadd2(out_ptr[id], input_ptr[id]), __ldg(&bias_ptr[tid])));
+  local_out += local_out_fp2.x;
+  local_out += local_out_fp2.y;
+
+  mean = blockReduceSum<float>(local_out);
+  if (threadIdx.x == 0) s_mean = mean / n;
+  __syncthreads();
+
+  variance = (local_out_fp2.x - s_mean) * (local_out_fp2.x - s_mean);
+  variance += (local_out_fp2.y - s_mean) * (local_out_fp2.y - s_mean);
+  variance = blockReduceSum<float>(variance);
+  if (threadIdx.x == 0) s_variance = rsqrtf(variance / n + 1e-6f);
+  __syncthreads();
+
+  float2 gamma_val = __half22float2(__ldg(&gamma_ptr[tid]));
+  float2 beta_val = __half22float2(__ldg(&beta_ptr[tid]));
+  local_out_fp2.x =
+      (local_out_fp2.x - s_mean) * s_variance * gamma_val.x + beta_val.x;
+  local_out_fp2.y =
+      (local_out_fp2.y - s_mean) * s_variance * gamma_val.y + beta_val.y;
+  out_ptr[id] = __float22half2_rn(local_out_fp2);
+}
+
+
+template <typename T>
+__global__ void add_bias_input_layernorm_v2(T* out,
+                                            const T* __restrict input,
+                                            const T* __restrict bias,
+                                            const T* __restrict gamma,
+                                            const T* __restrict beta,
+                                            int n) {
+  const int ite = 4;
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+  float local_out[ite];
+
+  float sum = 0.0f;
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    int col_id = i * blockDim.x + tid;
+    int id = bid * n + col_id;
+    local_out[i] = (float)(out[id] + __ldg(&input[id]) + __ldg(&bias[col_id]));
+    sum += local_out[i];
+  }
+
+  mean = blockReduceSum<float>(sum);
+  if (tid == 0) s_mean = mean / n;
+  __syncthreads();
+
+  float var = 0.0f;
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    float diff = local_out[i] - s_mean;
+    var += diff * diff;
+  }
+
+  variance = blockReduceSum<float>(var);
+  if (tid == 0) s_variance = rsqrtf(variance / n + 1e-6f);
+  __syncthreads();
+
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    int col_id = i * blockDim.x + tid;
+    int id = bid * n + col_id;
+    out[id] = (T)((local_out[i] - s_mean) * s_variance *
+                      (float)__ldg(&gamma[col_id]) +
+                  (float)__ldg(&beta[col_id]));
+  }
+}
+
+template <>
+__global__ void add_bias_input_layernorm_v2(half* out,
+                                            const half* __restrict input,
+                                            const half* __restrict bias,
+                                            const half* __restrict gamma,
+                                            const half* __restrict beta,
+                                            int n) {
+  const int ite = 4;
+  const int tid = threadIdx.x;
+  const int bid = blockIdx.x;
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+  half2 local_out_half2[ite];
+
+  half2* out_ptr = (half2*)out;
+  const half2* input_ptr = (const half2*)input;
+  const half2* bias_ptr = (const half2*)bias;
+  const half2* gamma_ptr = (const half2*)gamma;
+  const half2* beta_ptr = (const half2*)beta;
+
+  // float sum = 0.0f;
+  half2 sum = __float2half2_rn(0.0f);
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    int col_id = i * blockDim.x + tid;
+    int id = bid * n / 2 + col_id;
+    local_out_half2[i] =
+        out_ptr[id] + __ldg(&input_ptr[id]) + __ldg(&bias_ptr[col_id]);
+    sum += local_out_half2[i];
+  }
+
+  mean = blockReduceSum<float>((float)(sum.x + sum.y));
+  if (threadIdx.x == 0) s_mean = mean / n;
+  __syncthreads();
+
+  float var = 0.0f;
+  half2 s_mean_2 = __float2half2_rn(s_mean);
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    local_out_half2[i] = local_out_half2[i] - s_mean_2;
+    float v1 = (float)local_out_half2[i].x;
+    float v2 = (float)local_out_half2[i].y;
+    var += v1 * v1 + v2 * v2;
+  }
+
+  variance = blockReduceSum<float>(var);
+  if (threadIdx.x == 0) s_variance = rsqrtf(variance / n + 1e-6f);
+  __syncthreads();
+
+  half2 s_var_2 = __float2half2_rn(s_variance);
+#pragma unroll
+  for (int i = 0; i < ite; i++) {
+    int col_id = i * blockDim.x + tid;
+    int id = bid * n / 2 + col_id;
+    out_ptr[id] = local_out_half2[i] * s_var_2 * __ldg(&gamma_ptr[col_id]) +
+                  __ldg(&beta_ptr[col_id]);
+  }
+}
+
+template <typename T>
+void add_bias_act_kernelLauncher(T* out,
+                                 const T* bias,
+                                 int m,
+                                 int n,
+                                 ActivationType activation_type,
+                                 cudaStream_t stream) {
+  const int data_type_factor = 4 / sizeof(T);  // 1 for fp32, 2 for fp16
+  dim3 block, grid;
+  if (n / 4 / data_type_factor <= 1024) {
+    block.x = n / 4 / data_type_factor;
+    grid.x = m;
+  } else {
+    block.x = 1024;
+    grid.x = ceil(m * n / 1024.);
+  }
+
+
+  if (activation_type == ActivationType::RELU)
+    add_bias_relu<T><<<grid, block, 0, stream>>>(
+        out, bias, m, n / data_type_factor);
+  else if (activation_type == ActivationType::GELU)
+    add_bias_gelu<T><<<grid, block, 0, stream>>>(
+        out, bias, m, n / data_type_factor);
+}
+
+template <typename T>
+__global__ void gated_add_bias_relu(
+    T* out0,
+    T* out1,
+    const T* __restrict bias0,
+    const T* __restrict bias1,
+    int m,
+    int n) {
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    T reg_bias0 = (bias0) ? __ldg(&bias0[id % n]) : (T)0.0f;
+    T reg_bias1 = (bias1) ? __ldg(&bias1[id % n]) : (T)0.0f;
+
+    float val = (float)(out0[id] + reg_bias0);
+    val = (val > 0.0f) ? val : 0.0f;
+
+    out0[id] = (T)val * (out1[id] + reg_bias1);
+  }
+}
+
+template <>
+__global__ void gated_add_bias_relu(half* out0,
+                                    half* out1,
+                                    const half* __restrict bias0,
+                                    const half* __restrict bias1,
+                                    int m,
+                                    int n) {
+  half2* out0_ptr = (half2*)out0;
+  half2* out1_ptr = (half2*)out1;
+
+  const half2* bias0_ptr = (bias0) ? (half2*)bias0 : nullptr;
+  const half2* bias1_ptr = (bias1) ? (half2*)bias1 : nullptr;
+
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    half2 reg_bias0;
+    if (bias0) {
+      reg_bias0 = __ldg(&bias0_ptr[id % n]);
+    } else {
+      reg_bias0.x = (half)0.0f;
+      reg_bias0.y = (half)0.0f;
+    }
+
+    half2 reg_bias1;
+    if (bias1) {
+      reg_bias1 = __ldg(&bias1_ptr[id % n]);
+    } else {
+      reg_bias1.x = (half)0.0f;
+      reg_bias1.y = (half)0.0f;
+    }
+
+    half2 val0 = out0_ptr[id] + reg_bias0;
+    val0.x = val0.x > (half)0.0f ? val0.x : (half)0.0f;
+    val0.y = val0.y > (half)0.0f ? val0.y : (half)0.0f;
+
+    half2 val1 = out1_ptr[id] + reg_bias1;
+
+    out0_ptr[id] = val0 * val1;
+  }
+}
+
+template <typename T>
+__global__ void gated_add_bias_gelu(
+    T* out0,
+    T* out1,
+    const T* __restrict bias0,
+    const T* __restrict bias1,
+    int m,
+    int n) {
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    T reg_bias0 = (bias0) ? __ldg(&bias0[id % n]) : (T)0.0f;
+    T reg_bias1 = (bias1) ? __ldg(&bias1[id % n]) : (T)0.0f;
+
+    T val = (T)(gelu(out0[id] + reg_bias0)) * (out1[id] + reg_bias1);
+
+    out0[id] = (T)(val);
+  }
+}
+
+template <>
+__global__ void gated_add_bias_gelu(
+    half* out0,
+    half* out1,
+    const half* __restrict bias0,
+    const half* __restrict bias1,
+    int m,
+    int n) {
+  half2* out0_ptr = (half2*)out0;
+  half2* out1_ptr = (half2*)out1;
+
+  const half2* bias0_ptr = (half2*)bias0;
+  const half2* bias1_ptr = (half2*)bias1;
+
+  for (int id = blockIdx.x * blockDim.x + threadIdx.x; id < m * n;
+       id += blockDim.x * gridDim.x) {
+    half2 reg_bias0;
+    if (bias0) {
+      reg_bias0 = __ldg(&bias0_ptr[id % n]);
+    } else {
+      reg_bias0.x = (half)0.0f;
+      reg_bias0.y = (half)0.0f;
+    }
+
+    half2 reg_bias1;
+    if (bias0) {
+      reg_bias1 = __ldg(&bias1_ptr[id % n]);
+    } else {
+      reg_bias1.x = (half)0.0f;
+      reg_bias1.y = (half)0.0f;
+    }
+
+    half2 val0 = out0_ptr[id] + reg_bias0;
+    val0 = gelu(val0);
+
+    out0_ptr[id] = val0 * (out1_ptr[id] + reg_bias1);
+  }
+}
+
+template <typename T>
+void gated_add_bias_act_kernelLauncher(T* out,
+                                       const T* bias0,
+                                       const T* bias1,
+                                       int m,
+                                       int n,
+                                       ActivationType activation_type,
+                                       cudaStream_t stream) {
+  const int data_type_factor = 4 / sizeof(T);  // 1 for fp32, 2 for fp16
+  dim3 block, grid;
+  if (n / 4 / data_type_factor <= 1024) {
+    block.x = n / 4 / data_type_factor;
+    grid.x = m;
+  } else {
+    block.x = 1024;
+    grid.x = ceil(m * n / 1024.);
+  }
+
+  if (activation_type == ActivationType::RELU)
+    gated_add_bias_relu<T><<<grid, block, 0, stream>>>(
+        out, out + m * n, bias0, bias1, m, n / data_type_factor);
+  else if (activation_type == ActivationType::GELU)
+    gated_add_bias_gelu<T><<<grid, block, 0, stream>>>(
+        out, out + m * n, bias0, bias1, m, n / data_type_factor);
+}
+
+template <typename T>
+void add_bias_input_layernorm_kernelLauncher(T* out,
+                                             const T* input,
+                                             const T* bias,
+                                             const T* gamma,
+                                             const T* beta,
+                                             int m,
+                                             int n,
+                                             cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(n);
+  assert(n <= 1024);
+  if (n == 768 || n == 1024)
+    add_bias_input_layernorm_v2<T><<<grid, n / 4, 0, stream>>>(
+        out, input, bias, gamma, beta, n);
+  else
+    add_bias_input_layernorm<T><<<grid, block, 0, stream>>>(
+        out, input, bias, gamma, beta, m, n);
+}
+
+template <>
+void add_bias_input_layernorm_kernelLauncher(half* out,
+                                             const half* input,
+                                             const half* bias,
+                                             const half* gamma,
+                                             const half* beta,
+                                             int m,
+                                             int n,
+                                             cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(n / 2);
+  assert(n / 2 <= 1024);
+
+  if (m >= 512 && (n == 768 || n == 1024))
+    add_bias_input_layernorm_v2<half><<<grid, n / 8, 0, stream>>>(
+        out, input, bias, gamma, beta, n);
+  else
+    add_bias_input_layernorm<half><<<grid, block, 0, stream>>>(
+        out, input, bias, gamma, beta, m, n);
+}
+
+template <typename T>
+__global__ void add_bias_input_layernorm_2(const T* __restrict input,
+                                           const T* __restrict gamma,
+                                           const T* __restrict beta,
+                                           const T* __restrict bias,
+                                           T* output,
+                                           T* norm_output,
+                                           int m,
+                                           int n) {
+  int tid = threadIdx.x;
+
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+
+  float local_sum = 0.0f;
+  for (int i = tid; i < n; i += blockDim.x) {
+    float local_out = (float)(__ldg(&input[blockIdx.x * n + i]));
+    local_out += (float)(output[blockIdx.x * n + i]);
+    if(bias != nullptr){
+        local_out += (float)(__ldg(&bias[i]));
+    }
+    output[blockIdx.x * n + i] = (T)local_out;
+    local_sum += local_out;
+  }
+
+  if (gamma == nullptr || beta == nullptr){
+    return;
+  }
+
+  mean = blockReduceSum<float>(local_sum);
+
+  if (threadIdx.x == 0) s_mean = mean / n;
+  __syncthreads();
+
+  float local_var_sum = 0.0f;
+  for (int i = tid; i < n; i += blockDim.x) {
+    float diff = (float)(__ldg(&output[blockIdx.x * n + i])) - s_mean;
+    local_var_sum += diff * diff;
+  }
+  variance = blockReduceSum<float>(local_var_sum);
+
+  if (threadIdx.x == 0) s_variance = rsqrtf(variance / n + 1e-6);
+  __syncthreads();
+
+  for (int i = tid; i < n; i += blockDim.x) {
+    norm_output[blockIdx.x * n + i] =
+        (T)((((float)output[blockIdx.x * n + i] - s_mean) * s_variance) *
+                (float)(__ldg(&gamma[i])) +
+            (float)(__ldg(&beta[i])));
+  }
+}
+
+template <typename T>
+void add_bias_input_layernorm_2_kernelLauncher(const T* input,
+                                               const T* gamma,
+                                               const T* beta,
+                                               const T* bias,
+                                               T* output,
+                                               T* norm_output,
+                                               int m,
+                                               int n,
+                                               cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+  /* For general cases, n is equal to hidden_units, e.g., 512/1024.
+  Since we have warp shuffle inside the code, block.x % 32 should be 0.
+  */
+
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x / (4 / sizeof(T));  // if using half, only need half of block.x
+
+  /* should pay attention to the rsqrt precision*/
+  add_bias_input_layernorm_2<T><<<grid, block, 0, stream>>>(
+      input, gamma, beta, bias, output, norm_output, m, n);  // For gpt-3
+}
+
+template <typename T>
+__global__ void add_bias_input_t5_layernorm_2(const T* __restrict input,
+                                           const T* __restrict gamma,
+                                           const T* __restrict bias,
+                                           T* output,
+                                           T* norm_output,
+                                           int m,
+                                           int n) {
+  int tid = threadIdx.x;
+
+  __shared__ float s_variance;
+  float variance = 0.0f;
+
+  for (int i = tid; i < n; i += blockDim.x) {
+    float local_out = (float)(__ldg(&input[blockIdx.x * n + i]));
+    local_out += (float)(output[blockIdx.x * n + i]);
+    if(bias != nullptr){
+        local_out += (float)(__ldg(&bias[i]));
+    }
+    output[blockIdx.x * n + i] = (T)local_out;
+  }
+
+  if (gamma == nullptr){
+    return;
+  }
+
+  float local_var_sum = 0.0f;
+  for (int i = tid; i < n; i += blockDim.x) {
+    float diff = (float)(__ldg(&output[blockIdx.x * n + i]));
+    local_var_sum += diff * diff;
+  }
+  variance = blockReduceSum<float>(local_var_sum);
+
+  if (threadIdx.x == 0) s_variance = rsqrtf(variance / n + 1e-6);
+  __syncthreads();
+
+  for (int i = tid; i < n; i += blockDim.x) {
+    norm_output[blockIdx.x * n + i] =
+        clamp_inf_for_half<T>((((float)output[blockIdx.x * n + i]) * s_variance) *
+                (float)(__ldg(&gamma[i])));
+  }
+}
+
+template <typename T>
+void add_bias_input_t5_layernorm_2_kernelLauncher(const T* input,
+                                               const T* gamma,
+                                               const T* beta,
+                                               const T* bias,
+                                               T* output,
+                                               T* norm_output,
+                                               int m,
+                                               int n,
+                                               cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x / (4 / sizeof(T));
+
+  if (beta != nullptr) {
+    add_bias_input_layernorm_2<T><<<grid, block, 0, stream>>>(
+        input, gamma, beta, bias, output, norm_output, m, n);
+  } else {
+    add_bias_input_t5_layernorm_2<T><<<grid, block, 0, stream>>>(
+        input, gamma, bias, output, norm_output, m, n);
+  }
+}
+
+template <typename T>
+__global__ void add_bias_input(
+    T* output, const T* input, const T* bias, const int m, const int n) {
+  // This kernel can run with any block size and grid size
+  // Since the hidden dimension of GPT-3 would be larger than 1024
+  const int bid = blockIdx.x;
+  // const int blocks_per_row = n / blockDim.x;
+  // const int col_index = (bid % blocks_per_row) * blockDim.x + threadIdx.x;
+  // T bias_val = __ldg(&bias[col_index]);
+  for (int index = bid * blockDim.x + threadIdx.x; index < m * n;
+       index += blockDim.x * gridDim.x) {
+    T bias_val = __ldg(&bias[index % n]);
+    output[index] = output[index] + input[index] + bias_val;
+  }
+}
+
+template <typename T>
+void add_bias_input_kernelLauncher(T* output,
+                                   const T* bias,
+                                   const T* input,
+                                   const int m,
+                                   const int n,
+                                   cudaStream_t stream) {
+  int blocks_per_row = ceil(float(n) / 1024);
+  dim3 grid(min(m * blocks_per_row, 65536));
+  dim3 block(min(n, 1024));
+
+  add_bias_input<<<grid, block, 0, stream>>>(output, input, bias, m, n);
+}
+
+template <typename T>
+__global__ void layer_norm_kernel_generalize(const T* __restrict input,
+                                             const T* __restrict gamma,
+                                             const T* __restrict beta,
+                                             T* output,
+                                             int m,
+                                             int n) {
+  const int tid = threadIdx.x;
+
+  __shared__ float s_mean;
+  __shared__ float s_variance;
+  float mean = 0.0f;
+  float variance = 0.0f;
+
+  float local_sum = 0.0f;
+  for (int i = tid; i < n; i += blockDim.x) {
+    local_sum += (float)(__ldg(&input[blockIdx.x * n + i]));
+  }
+
+  mean = blockReduceSum<float>(local_sum);
+
+  if (threadIdx.x == 0) s_mean = mean / n;
+  __syncthreads();
+
+  float local_var_sum = 0.0f;
+  for (int i = tid; i < n; i += blockDim.x) {
+    float diff = (float)(__ldg(&input[blockIdx.x * n + i])) - s_mean;
+    local_var_sum += diff * diff;
+  }
+  variance = blockReduceSum<float>(local_var_sum);
+
+  if (threadIdx.x == 0) s_variance = rsqrtf(variance / n + 1e-6);
+
+  __syncthreads();
+
+  for (int i = tid; i < n; i += blockDim.x) {
+    output[blockIdx.x * n + i] =
+        (T)((((float)input[blockIdx.x * n + i] - s_mean) * s_variance) *
+                (float)(__ldg(&gamma[i])) +
+            (float)(__ldg(&beta[i])));
+  }
+}
+
+template <typename T>
+void layer_norm(const T* input,
+                const T* gamma,
+                const T* beta,
+                T* output,
+                int m,
+                int n,
+                cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+  /* For general cases, n is equal to hidden_units, e.g., 512/1024.
+     Since we have warp shuffle inside the code, block.x % 32 should be 0.
+  */
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x / (4 / sizeof(T));  // if using half, only need half of block.x
+
+  /* should pay attention to the rsqrt precision*/
+  layer_norm_kernel_generalize<T><<<grid, block, 0, stream>>>(
+      input, gamma, beta, output, m, n);  // For gpt-3
+}
+
+template <typename T>
+__global__ void t5_layer_norm_kernel_generalize(const T* __restrict input,
+                                             const T* __restrict gamma,
+                                             T* output,
+                                             int m,
+                                             int n) {
+    const int tid = threadIdx.x;
+
+    __shared__ float s_variance;
+    float variance = 0.0f;
+
+    float local_var_sum = 0.0f;
+    for (int i = tid; i < n; i += blockDim.x) {
+        float diff = (float)(__ldg(&input[blockIdx.x * n + i]));
+        local_var_sum += diff * diff;
+    }
+    variance = blockReduceSum(local_var_sum);
+
+    if (threadIdx.x == 0) {
+        s_variance = rsqrtf(variance / n + 1e-6f);
+    }
+    __syncthreads();
+
+    for (int i = tid; i < n; i += blockDim.x) {
+        output[blockIdx.x * n + i] =
+            clamp_inf_for_half<T>((((float)input[blockIdx.x * n + i]) * s_variance) * (float)(__ldg(&gamma[i])));
+    }
+}
+
+template <typename T>
+void t5_layer_norm(const T* input,
+                const T* gamma,
+                const T* beta,
+                T* output,
+                int m,
+                int n,
+                cudaStream_t stream) {
+  dim3 grid(m);
+  dim3 block(min(n, 1024));
+
+  if (n % 32 != 0) block.x = 1024;
+
+  block.x =
+      block.x / (4 / sizeof(T));
+  if (beta != nullptr) {
+    layer_norm_kernel_generalize<T><<<grid, block, 0, stream>>>(
+        input, gamma, beta, output, m, n);
+  } else {
+    t5_layer_norm_kernel_generalize<T><<<grid, block, 0, stream>>>(
+        input, gamma, output, m, n);
+  }
+}
+
+template void add_bias_act_kernelLauncher<float>(float* out,
+                                                 const float* bias,
+                                                 int m,
+                                                 int n,
+                                                 ActivationType activation_type,
+                                                 cudaStream_t stream);
+
+template void add_bias_input_layernorm_kernelLauncher<float>(
+    float* out,
+    const float* input,
+    const float* bias,
+    const float* gamma,
+    const float* beta,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_act_kernelLauncher<half>(half* out,
+                                                const half* bias,
+                                                int m,
+                                                int n,
+                                                ActivationType activation_type,
+                                                cudaStream_t stream);
+
+template void gated_add_bias_act_kernelLauncher<float>(float* out,
+                                                       const float* bias0,
+                                                       const float* bias1,
+                                                       int m,
+                                                       int n,
+                                                       ActivationType activation_type,
+                                                       cudaStream_t stream);
+
+template void gated_add_bias_act_kernelLauncher<half>(half* out,
+                                                      const half* bias0,
+                                                      const half* bias1,
+                                                      int m,
+                                                      int n,
+                                                      ActivationType activation_type,
+                                                      cudaStream_t stream);
+
+template void add_bias_input_layernorm_kernelLauncher<half>(
+    half* out,
+    const half* input,
+    const half* bias,
+    const half* gamma,
+    const half* beta,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_input_layernorm_2_kernelLauncher<float>(
+    const float* input,
+    const float* gamma,
+    const float* beta,
+    const float* bias,
+    float* output,
+    float* norm_output,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_input_layernorm_2_kernelLauncher<half>(
+    const half* input,
+    const half* gamma,
+    const half* beta,
+    const half* bias,
+    half* output,
+    half* norm_output,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_input_t5_layernorm_2_kernelLauncher<float>(
+    const float* input,
+    const float* gamma,
+    const float* beta,
+    const float* bias,
+    float* output,
+    float* norm_output,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_input_t5_layernorm_2_kernelLauncher<half>(
+    const half* input,
+    const half* gamma,
+    const half* beta,
+    const half* bias,
+    half* output,
+    half* norm_output,
+    int m,
+    int n,
+    cudaStream_t stream);
+
+template void add_bias_input_kernelLauncher<float>(float* output,
+                                                   const float* bias,
+                                                   const float* input,
+                                                   const int m,
+                                                   const int n,
+                                                   cudaStream_t stream);
+
+template void add_bias_input_kernelLauncher<half>(half* output,
+                                                  const half* bias,
+                                                  const half* input,
+                                                  const int m,
+                                                  const int n,
+                                                  cudaStream_t stream);
+
+template void layer_norm<float>(const float* input,
+                                const float* gamma,
+                                const float* beta,
+                                float* output,
+                                int m,
+                                int n,
+                                cudaStream_t stream);
+
+template void layer_norm<half>(const half* input,
+                               const half* gamma,
+                               const half* beta,
+                               half* output,
+                               int m,
+                               int n,
+                               cudaStream_t stream);
+
+template void t5_layer_norm<float>(const float* input,
+                                const float* gamma,
+                                const float* beta,
+                                float* output,
+                                int m,
+                                int n,
+                                cudaStream_t stream);
+
+template void t5_layer_norm<half>(const half* input,
+                               const half* gamma,
+                               const half* beta,
+                               half* output,
+                               int m,
+                               int n,
+                               cudaStream_t stream);
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cuh b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..7aaf99feab1f105d64315d75ae49190c81be2261
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/cuda/transformer_kernels.cuh
@@ -0,0 +1,48 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+namespace fastertransformer {
+
+template <typename T>
+void t5_layer_norm(const T* from_tensor,
+                   const T* gamma,
+                   const T* beta,
+                   T* norm_from_tensor_buf_,
+                   const int m,
+                   const int n,
+                   cudaStream_t stream);
+
+template <typename T>
+void add_bias_input_t5_layernorm_2_kernelLauncher(const T* input,
+                                                  const T* gamma,
+                                                  const T* beta,
+                                                  const T* bias,
+                                                  T* output,
+                                                  T* norm_output,
+                                                  int m,
+                                                  int n,
+                                                  cudaStream_t stream);
+
+template <typename T>
+void gated_add_bias_act_kernelLauncher(T* out,
+                                       const T* bias0,
+                                       const T* bias1,
+                                       int m,
+                                       int n,
+                                       ActivationType activation_type,
+                                       cudaStream_t stream);
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_beamsearch.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_beamsearch.h
new file mode 100644
index 0000000000000000000000000000000000000000..459e6c96780d94efc7db202cba2e43de4c4a28d6
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_beamsearch.h
@@ -0,0 +1,1445 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <OperationType OpType_>
+class DecodingBeamsearch {
+private:
+  typedef DecoderTransformerTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  const IAllocator &allocator_;
+  struct DecodingBeamsearchArguments args_;
+  TensorParallelParam t_parallel_param_;
+  LayerParallelParam l_parallel_param_;
+
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+  OpenDecoder<OpType_> *decoder_;
+  DataType_ **K_cache_;
+  DataType_ **V_cache_;
+  DataType_ **K_mem_cache_;
+  DataType_ **V_mem_cache_;
+  DataType_ *from_tensor_[2];
+  DataType_ *decoder_buf_;
+
+  // Prefix LM
+  DataType_ *trans_out_buf_;
+  DataType_ *lm_normed_result_buf_;
+
+  DataType_ *decoder_normed_result_buf_;
+  DataType_ *embedding_buf_;
+  float *logits_buf_;
+  float *cum_log_buf_;
+  int *word_ids_buf_;
+  int *parent_ids_buf_;
+  bool *finished_buf_;
+  bool *alive_finished_buf_;
+
+  void *buf_;
+  int *finished_count_buf_;
+  bool *h_finished_buf_;
+  int *h_trg_length_;
+  float *temp_storage_;
+
+  bool is_fuse_topk_softMax_;
+  bool keep_alive_beam_;
+
+  void *topK_kernel_workspace = nullptr;
+  size_t topk_workspace_size_ = 0;
+  void *cublas_workspace_ = nullptr;
+
+  DataType_ *padded_embedding_kernel;
+  DataType_ *padded_embedding_bias;
+  DataType_ *tmp_logits_buf_;
+
+public:
+  DecodingBeamsearch(const IAllocator &allocator,
+                     const int batch_size,
+                     const int beam_width,
+                     const int seq_len,
+                     const int head_num,
+                     const int size_per_head,
+                     const int vocab_size,
+                     const int decoder_layers,
+                     const int memory_hidden_units,
+                     const int memory_max_seq_len,
+                     const int start_id,
+                     const int end_id,
+                     const float beam_search_diversity_rate = -0.0f,
+                     const bool is_fuse_topk_softMax = false,
+                     const bool is_fuse_qkv = false,
+                     const bool keep_alive_beam = false,
+                     const float alpha = 0.6,
+                     const bool normalization_before = true,
+                     const int pos_offset = 0,
+                     const ActivationType act = ActivationType::RELU,
+                     const bool pos_bias = false,
+                     const bool prefix_lm = false,
+                     const int finished_candidate_num = -1,
+                     const bool early_stopping = false,
+                     const bool is_mbart = false,
+                     const int min_length = 0,
+                     const int inner_coeff = 4,
+                     const bool is_miro = false)
+      : allocator_(allocator),
+        is_fuse_topk_softMax_(is_fuse_topk_softMax),
+        keep_alive_beam_(keep_alive_beam) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    args_.batch_size_ = batch_size;
+    args_.beam_width_ = beam_width;
+    args_.seq_len_ = seq_len;
+    args_.memory_max_seq_len_ = memory_max_seq_len;
+    args_.head_num_ = head_num;
+    args_.size_per_head_ = size_per_head;
+    args_.hidden_units_ = head_num * size_per_head;
+    args_.decoder_layers_ = decoder_layers;
+    args_.vocab_size_ = vocab_size;
+    args_.start_id_ = start_id;
+    args_.end_id_ = end_id;
+    args_.beam_search_diversity_rate_ = beam_search_diversity_rate;
+    if (args_.beam_width_ > 16 || args_.beam_width_ > MAX_K)
+      is_fuse_topk_softMax_ = false;
+    if (std::is_same<DataType_, float>::value)
+      args_.vocab_size_padded_ = vocab_size;
+    else if (std::is_same<DataType_, half>::value)
+      args_.vocab_size_padded_ = (int)(ceil(vocab_size / 8.)) * 8;
+
+    args_.alpha_ = alpha;
+    args_.normalization_before_ = normalization_before;
+    args_.pos_offset_ = pos_offset;
+    args_.pos_bias_ = pos_bias;
+    args_.act_ = act;
+
+    args_.min_length_ = min_length;
+
+    args_.prefix_lm_ = prefix_lm;
+    args_.is_mbart_ = is_mbart;
+    args_.is_miro_ = is_miro;
+
+    args_.finished_candidate_num_ = (finished_candidate_num == -1)
+                                        ? beam_width * 2
+                                        : finished_candidate_num;
+    args_.early_stopping_ = early_stopping;
+
+    K_cache_ = new DataType_ *[2];
+    V_cache_ = new DataType_ *[2];
+
+    K_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+    V_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+
+    decoder_ = new OpenDecoder<OpType_>(head_num,
+                                        size_per_head,
+                                        memory_hidden_units,
+                                        is_fuse_qkv,
+                                        normalization_before,
+                                        args_.act_,
+                                        inner_coeff);
+    decoder_->set_max_batch_size(batch_size * beam_width);
+
+    size_t from_tensor_size =
+        args_.batch_size_ * args_.beam_width_ * args_.hidden_units_;  // type T
+    size_t decoder_workspace_size = decoder_->getWorkspaceSize();     // type T
+    size_t decoder_normed_result_buffer_size =
+        args_.batch_size_ * args_.beam_width_ * args_.hidden_units_;  // type T
+    size_t cache_size = (prefix_lm)
+                            ? (args_.batch_size_ * args_.beam_width_ *
+                               (args_.seq_len_ + args_.memory_max_seq_len_) *
+                               args_.hidden_units_)
+                            : (args_.batch_size_ * args_.beam_width_ *
+                               args_.seq_len_ * args_.hidden_units_);  // type T
+    size_t mem_cache_size =
+        (prefix_lm) ? 0 : (args_.batch_size_ * args_.beam_width_ *
+                           memory_max_seq_len * args_.hidden_units_);  // type T
+
+    size_t logits_buf_size = args_.batch_size_ * args_.beam_width_ *
+                             args_.vocab_size_padded_;  // type float
+    size_t cum_log_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type float
+    size_t word_ids_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type int
+    size_t parent_ids_buf_size =
+        keep_alive_beam_ ? word_ids_buf_size : 0;  // type int
+    size_t finished_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type bool
+    size_t alive_finished_buf_size = keep_alive_beam_ ? finished_buf_size : 0;
+    size_t finished_count_size = (size_t)(ceil(1 / 32.)) * 32;  // type int
+
+    size_t storage_size_per_beam =
+        2 * args_.beam_width_ +
+        SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K + 2);
+    args_.temp_storage_size_ = args_.batch_size_ * args_.beam_width_ *
+                               storage_size_per_beam;  // type float
+    args_.temp_storage_size_ = (size_t)(
+        ceil(args_.batch_size_ * args_.beam_width_ * args_.beam_width_ / 4.) *
+            4 * 2 +
+        ceil(args_.batch_size_ * args_.beam_width_ *
+             SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K + 2) / 4.) *
+            4);
+    size_t padded_embedding_kernel_size =
+        args_.hidden_units_ * args_.vocab_size_padded_;
+    size_t padded_embedding_bias_size = args_.vocab_size_padded_;
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_padded_ == args_.vocab_size_)) {
+      padded_embedding_kernel_size = 0;
+      padded_embedding_bias_size = 0;
+    }
+
+    // When using separated alive and finish beam queues, some buffers size need
+    // to be doubled to restore beam search intermedia results of both alive and
+    // finish beams.
+    if (keep_alive_beam_ == true) {
+      // cumulated log-probs of finish beams and alive beams
+      cum_log_buf_size += cum_log_buf_size;
+      finished_buf_size += finished_buf_size;
+      // Double the size of topk_tmp_id_buf, topk_tmp_val_buf, since we need
+      // select the top 2*beam_width.
+      args_.temp_storage_size_ +=
+          ceil(args_.batch_size_ * args_.beam_width_ * args_.beam_width_ / 4.) *
+          4 * 2;
+// Double tmp_buffer since we need select the top 2*beam_width.
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+      args_.temp_storage_size_ +=
+          ceil(args_.batch_size_ * args_.beam_width_ *
+               SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K) / 4.) *
+          4;
+#endif
+    }
+
+    // prevent memory misalinged address
+    logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+    cum_log_buf_size = (size_t)(ceil(cum_log_buf_size / 4.)) * 4;
+    word_ids_buf_size = (size_t)(ceil(word_ids_buf_size / 4.)) * 4;
+    parent_ids_buf_size = (size_t)(ceil(parent_ids_buf_size / 4.)) * 4;
+    finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+    alive_finished_buf_size =
+        (size_t)(ceil(alive_finished_buf_size / 32.)) * 32;
+    const size_t tmp_logits_buf_size = logits_buf_size;
+
+    // get workspace size of topk kernel
+    if (keep_alive_beam_ == true) {
+      topK_update_kernelLauncher(topK_kernel_workspace,
+                                 topk_workspace_size_,
+                                 logits_buf_,
+                                 finished_buf_,
+                                 alive_finished_buf_,
+                                 nullptr,
+                                 word_ids_buf_,
+                                 parent_ids_buf_,
+                                 nullptr,
+                                 nullptr,
+                                 cum_log_buf_,
+                                 0,
+                                 args_,
+                                 0);
+    } else {
+      topK_kernelLauncher(topK_kernel_workspace,
+                          topk_workspace_size_,
+                          logits_buf_,
+                          word_ids_buf_,
+                          finished_buf_,
+                          args_,
+                          0);
+    }
+
+    size_t lm_head_buffer_size = (prefix_lm)
+                                     ? decoder_normed_result_buffer_size * 3
+                                     : decoder_normed_result_buffer_size;
+
+    size_t datatype_buf_size =
+        from_tensor_size * 2 + decoder_workspace_size +
+        (cache_size * 4 + mem_cache_size * 2) * args_.decoder_layers_ +
+        lm_head_buffer_size;
+
+    buf_ = reinterpret_cast<void *>(allocator_.malloc(
+        ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) +
+        sizeof(DataType_) * datatype_buf_size +
+        sizeof(float) * (logits_buf_size + cum_log_buf_size) +
+        sizeof(DataType_) * tmp_logits_buf_size +
+        sizeof(DataType_) * padded_embedding_kernel_size +
+        sizeof(float) * padded_embedding_bias_size +
+        sizeof(int) * (word_ids_buf_size + parent_ids_buf_size) +
+        sizeof(bool) * (finished_buf_size + alive_finished_buf_size) +
+        topk_workspace_size_ +
+        sizeof(float) * args_.temp_storage_size_ +  // should be always float
+        sizeof(int) * finished_count_size));
+
+    if (sizeof(DataType_) == sizeof(half)) {
+      cublas_workspace_ = buf_;
+      from_tensor_[0] =
+          (DataType_ *)((char *)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+    } else {
+      cublas_workspace_ = nullptr;
+      from_tensor_[0] = (DataType_ *)(buf_);
+    }
+    from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+    for (int i = 0; i < args_.decoder_layers_; ++i) {
+      K_mem_cache_[i] =
+          from_tensor_[1] + from_tensor_size + i * mem_cache_size * 2;
+      V_mem_cache_[i] = from_tensor_[1] + from_tensor_size +
+                        i * mem_cache_size * 2 + mem_cache_size;
+    }
+    if (args_.beam_width_ > 1) {
+      /* We use two-way buffer since we have to update KV buf at the end of each
+       * step. */
+      K_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    0 * cache_size * args_.decoder_layers_;
+      K_cache_[1] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    1 * cache_size * args_.decoder_layers_;
+      V_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    2 * cache_size * args_.decoder_layers_;
+      V_cache_[1] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    3 * cache_size * args_.decoder_layers_;
+    } else {
+      // if beam width is 1, we only need one buffer
+      K_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    0 * cache_size * args_.decoder_layers_;
+      K_cache_[1] = K_cache_[0];
+      V_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    2 * cache_size * args_.decoder_layers_;
+      V_cache_[1] = V_cache_[0];
+    }
+
+    decoder_buf_ = V_cache_[1] + cache_size * args_.decoder_layers_;
+
+    if (prefix_lm) {
+      trans_out_buf_ = (decoder_buf_ + decoder_workspace_size);
+      lm_normed_result_buf_ =
+          (trans_out_buf_ + decoder_normed_result_buffer_size);
+
+      decoder_normed_result_buf_ =
+          (lm_normed_result_buf_ + decoder_normed_result_buffer_size);
+      // Used for post-norm.
+      embedding_buf_ =
+          (lm_normed_result_buf_ + decoder_normed_result_buffer_size);
+    } else {
+      decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+      // Used for post-norm.
+      embedding_buf_ = (decoder_buf_ + decoder_workspace_size);
+    }
+
+    logits_buf_ = (float *)(decoder_normed_result_buf_ +
+                            decoder_normed_result_buffer_size);
+    cum_log_buf_ = (float *)(logits_buf_ + logits_buf_size);
+    word_ids_buf_ = (int *)(cum_log_buf_ + cum_log_buf_size);
+    parent_ids_buf_ = (int *)(word_ids_buf_ + word_ids_buf_size);
+    finished_buf_ = (bool *)(parent_ids_buf_ + parent_ids_buf_size);
+    alive_finished_buf_ = (bool *)(finished_buf_ + finished_buf_size);
+    temp_storage_ = (float *)(alive_finished_buf_ + alive_finished_buf_size);
+    finished_count_buf_ = (int *)(temp_storage_ + args_.temp_storage_size_);
+    topK_kernel_workspace = (void *)(finished_count_buf_ + finished_count_size);
+    padded_embedding_kernel =
+        (DataType_ *)((char *)topK_kernel_workspace + topk_workspace_size_);
+    padded_embedding_bias =
+        (DataType_ *)(padded_embedding_kernel + padded_embedding_kernel_size);
+    tmp_logits_buf_ =
+        (DataType_ *)(padded_embedding_bias + padded_embedding_bias_size);
+
+    h_finished_buf_ = new bool[finished_buf_size];
+    h_trg_length_ = new int[args_.batch_size_];
+
+    int isConfigExist = access("decoding_gemm_config.in", 0);
+    if (isConfigExist == -1) {
+      printf("[WARNING] decoding_gemm_config.in is not found\n");
+    } else {
+      readAlgoFromConfig(cublasAlgoMap_, 1);
+      // check that the gemm_config setting is runnable
+      for (auto iter = cublasAlgoMap_.begin(); iter != cublasAlgoMap_.end();
+           iter++) {
+        int algoId = iter->second.algoId;
+        int stages = iter->second.stages;
+        // only check for cublas
+        if (stages != -1) continue;
+        if (Traits_::OpType == OperationType::FP32) {
+          if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT) {
+            // the algorithm is not for FP32
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n",
+                   algoId);
+            exit(-1);
+          }
+        } else {
+          if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP ||
+              algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP) {
+            // the algorithm is not for FP16
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n",
+                   algoId);
+            exit(-1);
+          }
+        }
+      }
+    }
+  }
+
+  void set_tensor_parallel_param(const TensorParallelParam param) {
+    t_parallel_param_ = param;
+    decoder_->set_tensor_parallel_param(param);
+  }
+
+  void set_layer_parallel_param(const LayerParallelParam param) {
+    l_parallel_param_ = param;
+    decoder_->set_layer_parallel_param(param);
+  }
+
+  void forward_context(const DecoderInitParam<DataType_> *decoder_param,
+                       const DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+      PRINT_FUNC_NAME_();
+#endif
+      const int input_len = decoding_params.request_input_len;
+      const int request_batch_size = decoding_params.request_batch_size;
+
+      const int max_input_len = decoding_params.max_input_len;
+
+      const int local_batch_size = ceil(request_batch_size * 1.0 / 1);
+      const int m = local_batch_size * input_len;
+      const int h_1 = args_.hidden_units_;
+
+      DataType_* from_tensor[2];
+      DataType_* decoder_output;
+      DataType_* decoder_workspace;
+      void *buf = reinterpret_cast<void *>(allocator_.malloc(
+          decoder_->getContextWorkspaceSize(input_len, local_batch_size) + 
+          (m * h_1 + 2 * request_batch_size * input_len * h_1) * sizeof(DataType_)
+      ));
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      from_tensor[0] = (DataType_*) buf;
+      from_tensor[1] = from_tensor[0] + request_batch_size * input_len * h_1;
+      decoder_output = from_tensor[1] + request_batch_size * input_len * h_1;
+      decoder_workspace = decoder_output + m * h_1;
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (args_.is_miro_) {
+        // Memory reuse. from_tensor[1].
+        start_ids_embeddings_kernel_launcher(from_tensor[1],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor[1],
+                   decoding_params.pre_layernorm.gamma,
+                   decoding_params.pre_layernorm.beta,
+                   from_tensor[0],
+                   m,
+                   h_1,
+                   decoding_params.stream);
+
+      } else if (args_.normalization_before_) {
+        start_ids_embeddings_kernel_launcher(from_tensor[0],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+      } else {
+        // Memory reuse. from_tensor[1].
+        start_ids_embeddings_kernel_launcher(from_tensor[1],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor[1],
+                   decoding_params.layernorm.gamma,
+                   decoding_params.layernorm.beta,
+                   from_tensor[0],
+                   m,
+                   h_1,
+                   decoding_params.stream);
+
+      }
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int dummy_decoder_max_seq_len = args_.seq_len_ + args_.memory_max_seq_len_;
+      size_t cache_size = local_batch_size * dummy_decoder_max_seq_len *
+                args_.hidden_units_;
+
+      int in_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        in_id = layer & 0x1;
+        out_id = 1 - in_id;
+
+        decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        size_t cache_offset = layer * cache_size;  // type T
+        decoder_->forward_context(decoder_workspace,
+                                  from_tensor[out_id],
+                                  K_cache_[1] + cache_offset,
+                                  V_cache_[1] + cache_offset,
+                                  from_tensor[in_id],
+                                  decoding_params.d_attn_mask,
+                                  local_batch_size,
+                                  input_len,
+                                  0,
+                                  dummy_decoder_max_seq_len,
+                                  layer == args_.decoder_layers_ - 1,
+                                  decoding_params.memory_sequence_length);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      } // end of for loop of layer
+      allocator_.free(buf);
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+  }
+
+  void forward(const DecoderInitParam<DataType_> *param,
+               DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = args_.batch_size_ * args_.beam_width_;
+    const int k = args_.hidden_units_;
+    const int n = args_.vocab_size_padded_;
+    const DataType_ *embedding_kernel_ptr = nullptr;
+    const DataType_ *embedding_bias_ptr = nullptr;
+
+    int min_trg_len = 0;
+    int max_trg_len = 0;
+
+    if (decoding_params.trg_word) {
+      cudaMemcpy(h_trg_length_,
+                 decoding_params.trg_length,
+                 sizeof(int) * args_.batch_size_,
+                 cudaMemcpyDeviceToHost);
+      min_trg_len = h_trg_length_[0];
+      max_trg_len = h_trg_length_[0];
+
+      for (int i = 1; i < args_.batch_size_; ++i) {
+        min_trg_len = std::min(min_trg_len, h_trg_length_[i]);
+        max_trg_len = std::max(max_trg_len, h_trg_length_[i]);
+      }
+    }
+
+    /*
+      sequence_length initialize to 0
+      finished: false
+      word_ids: start_id_
+      cum_log_probs (for eacm beam, the first element is 0). e.g., [0 -inf -inf
+      -inf][0 -inf -inf -inf]
+      cum_log_probs: If keep_alive_beam_ is true, the first alive element is 0.
+    */
+    if (keep_alive_beam_ == true) {
+      init_kernelLauncher_v2<float>(finished_buf_,
+                                    alive_finished_buf_,
+                                    decoding_params.sequence_length,
+                                    word_ids_buf_,
+                                    cum_log_buf_,
+                                    args_.start_id_,
+                                    args_.batch_size_,
+                                    args_.beam_width_ * 2,
+                                    decoding_params.stream);
+    } else {
+      init_kernelLauncher(finished_buf_,
+                          decoding_params.sequence_length,
+                          word_ids_buf_,
+                          cum_log_buf_,
+                          args_.start_id_,
+                          args_.batch_size_,
+                          args_.beam_width_,
+                          decoding_params.stream);
+    }
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the init by init_kernel_check.
+  init_kernel_check will compare the results of GPU and CPU.
+  Note that init_kernel_check contains init and uses do not need to call it
+  again.
+*/
+// init_kernel_check(finished_buf_, decoding_params.sequence_length,
+// word_ids_buf_, cum_log_buf_,
+//                   start_id_, batch_size_, beam_width_,
+//                   decoding_params.stream);
+#endif
+
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_padded_ == args_.vocab_size_)) {
+      embedding_kernel_ptr =
+          (const DataType_ *)decoding_params.embedding_kernel;
+      embedding_bias_ptr = (const DataType_ *)decoding_params.embedding_bias;
+    } else if (std::is_same<DataType_, half>::value) {
+      kernel_padding_kernelLauncher(padded_embedding_kernel,
+                                    decoding_params.embedding_kernel,
+                                    args_.hidden_units_,
+                                    args_.vocab_size_,
+                                    args_.vocab_size_padded_,
+                                    decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      bias_padding_kernelLauncher(padded_embedding_bias,
+                                  decoding_params.embedding_bias,
+                                  args_.vocab_size_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+      embedding_kernel_ptr = padded_embedding_kernel;
+      embedding_bias_ptr = padded_embedding_bias;
+    }
+
+    int cache_size =
+        (args_.prefix_lm_)
+            ? (m * (args_.seq_len_ + args_.memory_max_seq_len_) *
+               args_.hidden_units_)
+            : (m * args_.seq_len_ * args_.hidden_units_);  // type T
+
+    for (uint step = 1; step <= args_.seq_len_; ++step) {
+      // we use two-way buffer
+      int kv_cache_id = step & 0x1;
+      if (args_.is_miro_) {
+        embeddings_kernel_launcher(from_tensor_[1],
+                                   decoding_params.embedding_table,
+                                   decoding_params.position_encoding_table,
+                                   decoding_params.type_table,
+                                   decoding_params.memory_sequence_length,
+                                   decoding_params.decoder_type_id,
+                                   word_ids_buf_,
+                                   step,
+                                   m,
+                                   args_.hidden_units_,
+                                   args_.pos_bias_,
+                                   decoding_params.stream,
+                                   decoding_params.decoder_role_id,
+                                   decoding_params.role_embedding_table,
+                                   decoding_params.decoder_position_ids);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor_[1],
+                   decoding_params.pre_layernorm.gamma,
+                   decoding_params.pre_layernorm.beta,
+                   from_tensor_[0],
+                   m,
+                   k,
+                   decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+      } else if (args_.normalization_before_) {
+        if (args_.prefix_lm_) {
+          embeddings_kernel_launcher(from_tensor_[0],
+                                     decoding_params.embedding_table,
+                                     decoding_params.position_encoding_table,
+                                     decoding_params.type_table,
+                                     decoding_params.memory_sequence_length,
+                                     decoding_params.decoder_type_id,
+                                     word_ids_buf_,
+                                     step,
+                                     m,
+                                     args_.hidden_units_,
+                                     args_.pos_bias_,
+                                     decoding_params.stream,
+                                     decoding_params.decoder_role_id,
+                                     decoding_params.role_embedding_table,
+                                     decoding_params.decoder_position_ids);
+        } else {
+          if (args_.is_mbart_) {
+            embedding_lookup_sine_position_encoding_kernel_launcher(
+                embedding_buf_,
+                decoding_params.embedding_table,
+                decoding_params.position_encoding_table +
+                    (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+                word_ids_buf_,
+                m,
+                args_.hidden_units_,
+                decoding_params.stream);
+
+            layer_norm(embedding_buf_,
+                       decoding_params.mbart_layernorm.gamma,
+                       decoding_params.mbart_layernorm.beta,
+                       from_tensor_[0],
+                       m,
+                       k,
+                       decoding_params.stream);
+
+          } else {
+            embedding_lookup_sine_position_encoding_kernel_launcher(
+                from_tensor_[0],
+                decoding_params.embedding_table,
+                decoding_params.position_encoding_table +
+                    (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+                word_ids_buf_,
+                m,
+                args_.hidden_units_,
+                decoding_params.stream);
+          }
+        }
+      } else {
+        if (args_.prefix_lm_) {
+          embeddings_kernel_launcher(embedding_buf_,
+                                     decoding_params.embedding_table,
+                                     decoding_params.position_encoding_table,
+                                     decoding_params.type_table,
+                                     decoding_params.memory_sequence_length,
+                                     decoding_params.decoder_type_id,
+                                     word_ids_buf_,
+                                     step,
+                                     m,
+                                     args_.hidden_units_,
+                                     args_.pos_bias_,
+                                     decoding_params.stream,
+                                     decoding_params.decoder_role_id,
+                                     decoding_params.role_embedding_table,
+                                     decoding_params.decoder_position_ids);
+        } else {
+          // TODO(gongenlei): Only support Bart temporarily.
+          embedding_position_lookups_bart_kernel_launcher(
+              embedding_buf_,
+              decoding_params.embedding_table,
+              decoding_params.position_encoding_table +
+                  (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+              word_ids_buf_,
+              m,
+              args_.hidden_units_,
+              decoding_params.stream);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(embedding_buf_,
+                   decoding_params.layernorm.gamma,
+                   decoding_params.layernorm.beta,
+                   from_tensor_[0],
+                   m,
+                   k,
+                   decoding_params.stream);
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int from_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        /*
+          For the first layer (layer-0), from_id is 0. We also stored the
+          embedding lookup
+          result in from_tensor_[0]
+        */
+        from_id = layer & 0x1;
+        out_id = 1 - from_id;
+
+        /*
+          We use one decoder_ object to process multiple decoder layers.
+
+          At the beginning of each decoder layer, we initialize the decoder
+          object
+          with corresponding weights and decoder_buf_.
+          The decoder_buf_ is reused.
+        */
+        decoder_->initialize(param[layer], decoder_buf_, cublas_workspace_);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (args_.prefix_lm_) {
+          decoder_->forward_v2(
+              from_tensor_[from_id],
+              nullptr,
+              K_cache_[kv_cache_id] + layer * cache_size,
+              V_cache_[kv_cache_id] + layer * cache_size,
+              nullptr,
+              nullptr,
+              nullptr,
+              from_tensor_[out_id],
+              step + args_.memory_max_seq_len_,
+              args_.seq_len_ + args_.memory_max_seq_len_,
+              false, /* is_cross_attention */
+              keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+              args_.memory_max_seq_len_,
+              decoding_params.memory_sequence_length);
+        } else {
+          decoder_->forward(
+              from_tensor_[from_id],
+              decoding_params.memory_tensor,
+              K_cache_[kv_cache_id] + layer * cache_size,
+              V_cache_[kv_cache_id] + layer * cache_size,
+              K_mem_cache_[layer],
+              V_mem_cache_[layer],
+              decoding_params.memory_sequence_length,
+              from_tensor_[out_id],
+              step,
+              args_.seq_len_,
+              true, /* is_cross_attention */
+              keep_alive_beam_ ? alive_finished_buf_ : finished_buf_);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+
+      if (step > min_trg_len) {
+        DataType_ alpha = (DataType_)1.0f;
+        DataType_ beta = (DataType_)0.0f;
+
+        if (args_.is_miro_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+          // add bias decoding_params.trans_bias
+          add_bias_act_kernelLauncher(trans_out_buf_,
+                                      decoding_params.trans_bias,
+                                      m,
+                                      k,
+                                      args_.act_,
+                                      decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(trans_out_buf_,
+                     decoding_params.lm_layernorm.gamma,
+                     decoding_params.lm_layernorm.beta,
+                     lm_normed_result_buf_,
+                     m,
+                     k,
+                     decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+          cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                              decoding_params.cublas_handle,
+                                              CUBLAS_OP_N,
+                                              CUBLAS_OP_N,
+                                              n,
+                                              m,
+                                              k,
+                                              &alpha,
+                                              embedding_kernel_ptr,
+                                              AType_,
+                                              n,
+                                              lm_normed_result_buf_,
+                                              BType_,
+                                              k,
+                                              &beta,
+                                              tmp_logits_buf_,
+                                              CType_,
+                                              n,
+                                              decoding_params.stream,
+                                              cublasAlgoMap_,
+                                              cublas_workspace_);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+        } else if (args_.prefix_lm_) {
+          if (args_.normalization_before_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          } else {
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                from_tensor_[out_id],
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          }
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          // add bias decoding_params.trans_bias
+          add_bias_act_kernelLauncher(trans_out_buf_,
+                                      decoding_params.trans_bias,
+                                      m,
+                                      k,
+                                      args_.act_,
+                                      decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(trans_out_buf_,
+                     decoding_params.lm_layernorm.gamma,
+                     decoding_params.lm_layernorm.beta,
+                     lm_normed_result_buf_,
+                     m,
+                     k,
+                     decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                              decoding_params.cublas_handle,
+                                              CUBLAS_OP_N,
+                                              CUBLAS_OP_N,
+                                              n,
+                                              m,
+                                              k,
+                                              &alpha,
+                                              embedding_kernel_ptr,
+                                              AType_,
+                                              n,
+                                              lm_normed_result_buf_,
+                                              BType_,
+                                              k,
+                                              &beta,
+                                              tmp_logits_buf_,
+                                              CType_,
+                                              n,
+                                              decoding_params.stream,
+                                              cublasAlgoMap_,
+                                              cublas_workspace_);
+
+        } else {
+          if (args_.normalization_before_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                n,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                embedding_kernel_ptr,
+                                                AType_,
+                                                n,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                tmp_logits_buf_,
+                                                CType_,
+                                                n,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+
+          } else {
+            // Post-norm
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                n,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                embedding_kernel_ptr,
+                                                AType_,
+                                                n,
+                                                from_tensor_[out_id],
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                tmp_logits_buf_,
+                                                CType_,
+                                                n,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          }
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (decoding_params.logits_mask ||
+            (args_.min_length_ != 0 && step <= args_.min_length_) ||
+            args_.vocab_size_padded_ != args_.vocab_size_) {
+          apply_logits_mask_kernelLauncher(
+              tmp_logits_buf_,
+              keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+              args_.batch_size_,
+              args_.beam_width_,
+              args_.vocab_size_padded_,
+              args_.vocab_size_,
+              decoding_params.stream,
+              decoding_params.logits_mask,
+              (args_.min_length_ != 0 && step <= args_.min_length_),
+              args_.end_id_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        // Beamsearch
+        if (is_fuse_topk_softMax_) {
+          if (keep_alive_beam_) {
+            // Use separated alive and finish beam queues to avoid the decrease
+            // of alive beams.
+            topK_softMax_update(tmp_logits_buf_,
+                                embedding_bias_ptr,
+                                finished_buf_,
+                                alive_finished_buf_,
+                                decoding_params.sequence_length,
+                                word_ids_buf_,
+                                parent_ids_buf_,
+                                decoding_params.output_ids + (step - 1) * m * 2,
+                                decoding_params.parent_ids + (step - 1) * m * 2,
+                                cum_log_buf_,
+                                reinterpret_cast<void *>(temp_storage_),
+                                step,
+                                args_,
+                                decoding_params.stream);
+          } else {
+            topK_softMax(tmp_logits_buf_,
+                         embedding_bias_ptr,
+                         finished_buf_,
+                         cum_log_buf_,
+                         word_ids_buf_,
+                         reinterpret_cast<void *>(temp_storage_),
+                         args_,
+                         decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            update_kernelLauncher_v2(
+                finished_buf_,
+                decoding_params.parent_ids + (step - 1) * m,
+                decoding_params.sequence_length,
+                word_ids_buf_,
+                decoding_params.output_ids + (step - 1) * m,
+                finished_count_buf_,
+                args_,
+                decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+          }
+
+        } else {
+          if (keep_alive_beam_ == true) {
+            update_logits_v2(tmp_logits_buf_,
+                             embedding_bias_ptr,
+                             args_.end_id_,
+                             finished_buf_,
+                             m,
+                             n,
+                             decoding_params.stream);
+
+            // Use separated alive and finish beam queues to avoid the decrease
+            // of alive beams.
+            topK_update_kernelLauncher(
+                topK_kernel_workspace,
+                topk_workspace_size_,
+                tmp_logits_buf_,
+                finished_buf_,
+                alive_finished_buf_,
+                decoding_params.sequence_length,
+                word_ids_buf_,
+                parent_ids_buf_,
+                decoding_params.output_ids + (step - 1) * m * 2,
+                decoding_params.parent_ids + (step - 1) * m * 2,
+                cum_log_buf_,
+                step,
+                args_,
+                decoding_params.stream);
+          } else {
+            update_logits(logits_buf_,
+                          tmp_logits_buf_,
+                          embedding_bias_ptr,
+                          args_.end_id_,
+                          finished_buf_,
+                          m,
+                          n,
+                          decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the update_logits by update_logits_kernel_check.
+  update_logits_kernel_check will compare the results of GPU and CPU.
+  Note that update_logits_kernel_check contains update_logits and uses do not
+  need to call it again.
+*/
+// update_logits_kernel_check(logits_buf_, decoding_params.embedding_bias,
+// args_.end_id_, finished_buf_, m, n, decoding_params.stream);
+#endif
+            /* adding cum_log_buf_ to logits_buf_ */
+            broadcast_kernelLauncher(logits_buf_,
+                                     cum_log_buf_,
+                                     args_.batch_size_,
+                                     args_.beam_width_,
+                                     args_.vocab_size_padded_,
+                                     decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the broadcast_kernel by broadcast_kernel_check.
+  broadcast_kernel_check will compare the results of GPU and CPU.
+  Note that broadcast_kernel_check contains broadcast_kernelLauncher and uses do
+  not need to call it again.
+*/
+// broadcast_kernel_check(logits_buf_, cum_log_buf_, batch_size_, beam_width_,
+// vocab_size_, decoding_params.stream);
+#endif
+
+            topK_kernelLauncher(topK_kernel_workspace,
+                                topk_workspace_size_,
+                                logits_buf_,
+                                word_ids_buf_,
+                                finished_buf_,
+                                args_,
+                                decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+            update_kernelLauncher(logits_buf_,
+                                  cum_log_buf_,
+                                  finished_buf_,
+                                  decoding_params.parent_ids + (step - 1) * m,
+                                  decoding_params.sequence_length,
+                                  word_ids_buf_,
+                                  decoding_params.output_ids + (step - 1) * m,
+                                  args_.batch_size_,
+                                  args_.beam_width_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream,
+                                  args_.end_id_,
+                                  finished_count_buf_);
+          }
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+
+      if (step <= max_trg_len) {
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        update_with_force_decodingLauncher<float>(
+            decoding_params.trg_word,
+            decoding_params.trg_length,
+            finished_buf_,
+            word_ids_buf_,
+            (step > min_trg_len) ? nullptr : decoding_params.sequence_length,
+            (keep_alive_beam_) ? parent_ids_buf_ : nullptr,
+            (keep_alive_beam_) ? decoding_params.parent_ids + (step - 1) * m * 2
+                               : decoding_params.parent_ids + (step - 1) * m,
+            (keep_alive_beam_) ? decoding_params.output_ids + (step - 1) * m * 2
+                               : decoding_params.output_ids + (step - 1) * m,
+            cum_log_buf_,
+            keep_alive_beam_,
+            args_.batch_size_,
+            (keep_alive_beam_) ? args_.beam_width_ * 2 : args_.beam_width_,
+            max_trg_len,
+            step,
+            decoding_params.stream);
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (args_.beam_width_ > 1) {
+        // chose which self cache to use
+        int decoder_max_seq_len =
+            (decoder_->getCacheFormat() != 0) ? args_.seq_len_ : -1;
+
+        update_KV_cache_kernelLauncher_v2(
+            K_cache_,
+            V_cache_,
+            keep_alive_beam_ ? parent_ids_buf_
+                             : decoding_params.parent_ids + (step - 1) * m,
+            keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+            args_.batch_size_,
+            args_.beam_width_,
+            args_.head_num_,
+            args_.size_per_head_,
+            step,
+            decoder_max_seq_len,
+            cache_size,
+            args_.decoder_layers_,
+            decoding_params.stream,
+            (args_.prefix_lm_) ? args_.memory_max_seq_len_ : -1);
+      }
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the update_KV_cache by update_KV_cache_kernel_check.
+  update_KV_cache_kernel_check will compare the results of GPU and CPU.
+  Note that update_KV_cache_kernel_check contains update_KV_cache and uses do
+  not need to call it again.
+*/
+// update_KV_cache_kernel_check(K_cache_, V_cache_, decoding_params.parent_ids +
+// (step - 1) * batch_size_ * beam_width_, batch_size_, beam_width_,
+// hidden_units_, step, cache_size, decoder_layers_, decoding_params.stream);
+#endif
+
+      if (step > max_trg_len) {
+        // TODO Find a better method to check the is_finished
+        int finish_size = (keep_alive_beam_) ? m * 2 : m;
+        cudaMemcpy(h_finished_buf_,
+                   finished_buf_,
+                   sizeof(bool) * finish_size,
+                   cudaMemcpyDeviceToHost);
+        int sum = 0;
+        for (int i = 0; i < finish_size; i++) {
+          sum += (int)h_finished_buf_[i];
+        }
+        if (sum == finish_size) break;
+      }
+    }  // end for decoding step for llop
+
+    if (decoding_params.output_scores) {
+      cudaMemcpyAsync(decoding_params.output_scores, cum_log_buf_, 
+                      sizeof(float) * args_.batch_size_ * args_.beam_width_, cudaMemcpyDeviceToDevice, decoding_params.stream);
+    }
+  }    // end of forward
+
+  virtual ~DecodingBeamsearch() {
+    delete[] K_cache_;
+    delete[] V_cache_;
+    delete[] K_mem_cache_;
+    delete[] V_mem_cache_;
+    delete[] h_finished_buf_;
+    delete[] h_trg_length_;
+    delete decoder_;
+    allocator_.free(buf_);
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_sampling.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_sampling.h
new file mode 100644
index 0000000000000000000000000000000000000000..afbd2716cd652bd5933d8d65c13241e17bc9ad70
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/decoding_sampling.h
@@ -0,0 +1,1319 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <OperationType OpType_>
+class DecodingSampling {
+private:
+  typedef DecoderTransformerTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  const IAllocator &allocator_;
+  struct DecodingSamplingArguments args_;
+  TensorParallelParam t_parallel_param_;
+  LayerParallelParam l_parallel_param_;
+
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+  OpenDecoder<OpType_> *decoder_;
+  DataType_ **K_cache_;
+  DataType_ **V_cache_;
+  DataType_ **K_mem_cache_;
+  DataType_ **V_mem_cache_;
+  DataType_ *from_tensor_[2];
+  DataType_ *decoder_buf_;
+  DataType_ *decoder_normed_result_buf_;
+  DataType_ *embedding_buf_;
+  DataType_ *trans_out_buf_;
+  DataType_ *lm_normed_result_buf_;
+  DataType_ *logits_buf_;
+  int *word_ids_buf_;
+  bool *finished_buf_;
+  int *h_trg_length_;
+
+  void *buf_;
+  int *finished_count_buf_;
+  bool *h_finished_buf_;
+
+  void *topk_workspace_ = nullptr;
+  size_t topk_workspace_size_ = 0;
+  void *topp_workspace_ = nullptr;
+  size_t topp_workspace_size_ = 0;
+  void *cublas_workspace_ = nullptr;
+  curandState_t *curandstate_buf_;
+  int *topp_id_vals_buf_;
+  int *topp_offset_buf_;
+  int *begin_topp_offset_buf_;
+
+  DataType_ *padded_embedding_kernel;
+  DataType_ *padded_embedding_bias;
+
+public:
+  DecodingSampling(const IAllocator &allocator,
+                   const int batch_size,
+                   const int seq_len,
+                   const int head_num,
+                   const int size_per_head,
+                   const int vocab_size,
+                   const int decoder_layers,
+                   const int memory_hidden_units,
+                   const int memory_max_seq_len,
+                   const int start_id,
+                   const int end_id,
+                   const int candidate_num = 0,
+                   const float probability_threshold = 0.0,
+                   const int is_fuse_qkv = false,
+                   const bool normalization_before = true,
+                   const int pos_offset = 0,
+                   const ActivationType act = ActivationType::RELU,
+                   const bool pos_bias = false,
+                   const float temperature = 1.0,
+                   const float repeat_penalty = 1.0,
+                   const bool prefix_lm = false,
+                   const bool is_mbart = false,
+                   const int min_length = 0,
+                   const int inner_coeff = 4,
+                   const int seed = -1,
+                   const int tensor_para_size = 1,
+                   const int layer_para_size = 1,
+                   const bool is_miro = false)
+      : allocator_(allocator) {
+    args_.batch_size_ = batch_size;
+    args_.seq_len_ = seq_len;
+    args_.memory_max_seq_len_ = memory_max_seq_len;
+    args_.head_num_ = head_num;
+    args_.size_per_head_ = size_per_head;
+    args_.hidden_units_ = head_num * size_per_head;
+    args_.decoder_layers_ = decoder_layers;
+    args_.vocab_size_ = vocab_size;
+    args_.candidate_num_ = candidate_num;
+    args_.probability_threshold_ = probability_threshold;
+    args_.start_id_ = start_id;
+    args_.end_id_ = end_id;
+    args_.normalization_before_ = normalization_before;
+    args_.pos_offset_ = pos_offset;
+    args_.act_ = act;
+
+    args_.pos_bias_ = pos_bias;
+    args_.temperature_ = temperature;
+    args_.repeat_penalty_ = repeat_penalty;
+
+    args_.min_length_ = min_length;
+    args_.seed_ = seed;
+
+    args_.prefix_lm_ = prefix_lm;
+    args_.is_mbart_ = is_mbart;
+    args_.is_miro_ = is_miro;
+
+    // For models without parallel
+    if (l_parallel_param_.layers_per_group == 0) {
+        l_parallel_param_.layers_per_group = decoder_layers;
+    }
+
+    if (std::is_same<DataType_, float>::value)
+      args_.vocab_size_padded_ = vocab_size;
+    else if (std::is_same<DataType_, half>::value)
+      args_.vocab_size_padded_ = (int)(ceil(vocab_size / 8.)) * 8;
+
+    if (args_.candidate_num_ == 0 && args_.probability_threshold_ == 0.0) {
+      printf(
+          "[ERROR] Candidate_num for topk is 0 and probability threshold for "
+          "top p is 0.0 \n");
+      exit(-1);
+    } else if (args_.candidate_num_ != 0 &&
+               args_.probability_threshold_ != 0.0) {
+      printf(
+          "[ERROR] Candidate_num for topk is not 0 and probability threshold "
+          "for top p is not 0.0 \n");
+      exit(-1);
+    }
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    K_cache_ = new DataType_ *[1];
+    V_cache_ = new DataType_ *[1];
+
+    K_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+    V_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+
+    decoder_ = new OpenDecoder<OpType_>(head_num,
+                                        size_per_head,
+                                        memory_hidden_units,
+                                        is_fuse_qkv,
+                                        normalization_before,
+                                        args_.act_,
+                                        inner_coeff);
+    decoder_->set_max_batch_size(batch_size);
+
+    size_t from_tensor_size =
+        args_.batch_size_ * args_.hidden_units_;                   // type T
+    size_t decoder_workspace_size = decoder_->getWorkspaceSize();  // type T
+    size_t decoder_normed_result_buffer_size =
+        args_.batch_size_ * args_.hidden_units_;  // type T
+
+    size_t cache_size = (prefix_lm)
+                            ? (args_.batch_size_ *
+                               (args_.seq_len_ + args_.memory_max_seq_len_) *
+                               args_.hidden_units_)
+                            : (args_.batch_size_ * args_.seq_len_ *
+                               args_.hidden_units_);  // type T
+    size_t mem_cache_size =
+        (prefix_lm) ? 0 : (args_.batch_size_ * memory_max_seq_len *
+                           args_.hidden_units_);  // type T
+    if (tensor_para_size != 1) { // tensor parallel
+      cache_size /= tensor_para_size;
+      mem_cache_size /= tensor_para_size;
+    }
+    size_t logits_buf_size =
+        args_.batch_size_ * args_.vocab_size_padded_;  // type T
+
+    size_t word_ids_buf_size = args_.batch_size_;               // type int
+    size_t finished_buf_size = args_.batch_size_;               // type bool
+    size_t finished_count_size = (size_t)(ceil(1 / 32.)) * 32;  // type int
+
+    int topp_id_vals_buf_size =
+        args_.batch_size_ * args_.vocab_size_padded_;  // type int
+    int topp_offset_buf_size = args_.batch_size_ + 1;  // type int
+    size_t begin_topp_offset_buf_size = topp_offset_buf_size;
+    size_t curandState_size = args_.batch_size_;
+    size_t padded_embedding_kernel_size =
+        args_.hidden_units_ * args_.vocab_size_padded_;
+    size_t padded_embedding_bias_size = args_.vocab_size_padded_;
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_ == args_.vocab_size_padded_)) {
+      padded_embedding_kernel_size = 0;
+      padded_embedding_bias_size = 0;
+    }
+
+    // prevent memory misalinged address
+    logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+    word_ids_buf_size = (size_t)(ceil(word_ids_buf_size / 4.)) * 4;
+    finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+
+    topp_id_vals_buf_size = (size_t)(ceil(topp_id_vals_buf_size / 4.)) * 4;
+    topp_offset_buf_size = (size_t)(ceil(topp_offset_buf_size / 4.)) * 4;
+    begin_topp_offset_buf_size = topp_offset_buf_size;
+
+    topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                           topp_workspace_size_,
+                                           logits_buf_,
+                                           topp_id_vals_buf_,
+                                           topp_offset_buf_,
+                                           begin_topp_offset_buf_,
+                                           finished_buf_,
+                                           curandstate_buf_,
+                                           args_,
+                                           nullptr,
+                                           nullptr,
+                                           args_.vocab_size_padded_,
+                                           0,
+                                           args_.batch_size_);
+
+    topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                           topk_workspace_size_,
+                                           logits_buf_,
+                                           nullptr,
+                                           nullptr,
+                                           finished_buf_,
+                                           curandstate_buf_,
+                                           args_,
+                                           0,
+                                           args_.batch_size_);
+
+    size_t datatype_buf_size =
+        from_tensor_size * 2 + decoder_workspace_size +
+        (cache_size * 4 + mem_cache_size * 2) * args_.decoder_layers_ +
+        decoder_normed_result_buffer_size * 3;
+
+    buf_ = reinterpret_cast<void *>(allocator_.malloc(
+        ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) +
+        sizeof(DataType_) * (datatype_buf_size + logits_buf_size) +
+        sizeof(DataType_) *
+            (padded_embedding_kernel_size + padded_embedding_bias_size) +
+        sizeof(int) * word_ids_buf_size + sizeof(bool) * finished_buf_size +
+        sizeof(int) * finished_count_size +
+        sizeof(int) * (topp_id_vals_buf_size + 2 * topp_offset_buf_size) +
+        topp_workspace_size_ + topk_workspace_size_ +
+        curandState_size * sizeof(curandState_t)));
+
+    if (sizeof(DataType_) == sizeof(half)) {
+      cublas_workspace_ = buf_;
+      from_tensor_[0] =
+          (DataType_ *)((char *)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+    } else {
+      cublas_workspace_ = nullptr;
+      from_tensor_[0] = (DataType_ *)buf_;
+    }
+    from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+    for (int i = 0; i < args_.decoder_layers_; ++i) {
+      K_mem_cache_[i] =
+          from_tensor_[1] + from_tensor_size + i * mem_cache_size * 2;
+      V_mem_cache_[i] = from_tensor_[1] + from_tensor_size +
+                        i * mem_cache_size * 2 + mem_cache_size;
+    }
+
+    /* We use two-way buffer since we have to update KV buf at the end of each
+     * step. */
+    K_cache_[0] = V_mem_cache_[args_.decoder_layers_ - 1] + mem_cache_size +
+                  0 * cache_size * args_.decoder_layers_;
+    V_cache_[0] = V_mem_cache_[args_.decoder_layers_ - 1] + mem_cache_size +
+                  1 * cache_size * args_.decoder_layers_;
+
+    decoder_buf_ = V_cache_[0] + cache_size * args_.decoder_layers_;
+
+    if (prefix_lm) {
+      trans_out_buf_ = (decoder_buf_ + decoder_workspace_size);
+      lm_normed_result_buf_ =
+          (trans_out_buf_ + decoder_normed_result_buffer_size);
+
+      decoder_normed_result_buf_ =
+          (lm_normed_result_buf_ + decoder_normed_result_buffer_size);
+      // Used for post-norm.
+      embedding_buf_ =
+          (lm_normed_result_buf_ + decoder_normed_result_buffer_size);
+    } else {
+      decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+      // Used for post-norm.
+      embedding_buf_ = (decoder_buf_ + decoder_workspace_size);
+    }
+
+    logits_buf_ =
+        decoder_normed_result_buf_ + decoder_normed_result_buffer_size;
+    word_ids_buf_ = (int *)(logits_buf_ + logits_buf_size);
+    finished_buf_ = (bool *)(word_ids_buf_ + word_ids_buf_size);
+    finished_count_buf_ = (int *)(finished_buf_ + finished_buf_size);
+    topp_id_vals_buf_ = (int *)(finished_count_buf_ + finished_count_size);
+    begin_topp_offset_buf_ = (int *)(topp_id_vals_buf_ + topp_id_vals_buf_size);
+    topp_offset_buf_ =
+        (int *)(begin_topp_offset_buf_ + begin_topp_offset_buf_size);
+    topp_workspace_ = (void *)(topp_offset_buf_ + topp_offset_buf_size);
+    topk_workspace_ = (void *)((char *)topp_workspace_ + topp_workspace_size_);
+    padded_embedding_kernel =
+        (DataType_ *)((char *)topk_workspace_ + topk_workspace_size_);
+    padded_embedding_bias =
+        (DataType_ *)(padded_embedding_kernel + padded_embedding_kernel_size);
+    curandstate_buf_ =
+        (curandState_t *)(padded_embedding_bias + padded_embedding_bias_size);
+
+    h_finished_buf_ = new bool[finished_buf_size];
+    h_trg_length_ = new int[args_.batch_size_];
+
+    int isConfigExist = access("decoding_gemm_config.in", 0);
+    if (isConfigExist == -1) {
+      printf("[WARNING] decoding_gemm_config.in is not found\n");
+    } else {
+      readAlgoFromConfig(cublasAlgoMap_, 1);
+      // check that the gemm_config setting is runnable
+      for (auto iter = cublasAlgoMap_.begin(); iter != cublasAlgoMap_.end();
+           iter++) {
+        int algoId = iter->second.algoId;
+        int stages = iter->second.stages;
+        // only check for cublas
+        if (stages != -1) continue;
+        if (Traits_::OpType == OperationType::FP32) {
+          if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT) {
+            // the algorithm is not for FP32
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n",
+                   algoId);
+            exit(-1);
+          }
+        } else {
+          if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP ||
+              algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP) {
+            // the algorithm is not for FP16
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n",
+                   algoId);
+            exit(-1);
+          }
+        }
+      }
+    }
+  }
+
+  void set_tensor_parallel_param(const TensorParallelParam param) {
+    t_parallel_param_ = param;
+    decoder_->set_tensor_parallel_param(param);
+  }
+
+  void set_layer_parallel_param(const LayerParallelParam param) {
+    l_parallel_param_ = param;
+    decoder_->set_layer_parallel_param(param);
+  }
+
+  void forward_context(const DecoderInitParam<DataType_> *decoder_param,
+                       const DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+      PRINT_FUNC_NAME_();
+#endif
+      const int input_len = decoding_params.request_input_len;
+      const int request_batch_size = decoding_params.request_batch_size;
+
+      const int max_input_len = decoding_params.max_input_len;
+
+      const int local_batch_size = ceil(request_batch_size * 1.0 / l_parallel_param_.world_size);
+      const int m = local_batch_size * input_len;
+      const int h_1 = args_.hidden_units_;
+
+      DataType_* from_tensor[2];
+      DataType_* decoder_output;
+      DataType_* decoder_workspace;
+
+      void *buf = reinterpret_cast<void *>(allocator_.malloc(
+          decoder_->getContextWorkspaceSize(input_len, local_batch_size) + 
+          (m * h_1 + 2 * request_batch_size * input_len * h_1) * sizeof(DataType_)
+      ));
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      from_tensor[0] = (DataType_*) buf;
+      from_tensor[1] = from_tensor[0] + request_batch_size * input_len * h_1;
+      decoder_output = from_tensor[1] + request_batch_size * input_len * h_1;
+      decoder_workspace = decoder_output + m * h_1;
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (args_.is_miro_) {
+        // Memory reuse. from_tensor[1].
+        start_ids_embeddings_kernel_launcher(from_tensor[1],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor[1],
+                   decoding_params.pre_layernorm.gamma,
+                   decoding_params.pre_layernorm.beta,
+                   from_tensor[0],
+                   m,
+                   h_1,
+                   decoding_params.stream);
+
+      } else if (args_.normalization_before_) {
+        start_ids_embeddings_kernel_launcher(from_tensor[0],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+      } else {
+        // Memory reuse. from_tensor[1].
+        start_ids_embeddings_kernel_launcher(from_tensor[1],
+                                            decoding_params.embedding_table,
+                                            decoding_params.position_encoding_table,
+                                            decoding_params.type_table,
+                                            decoding_params.type_id,
+                                            decoding_params.d_start_ids,
+                                            decoding_params.memory_sequence_length,
+                                            1,
+                                            input_len,
+                                            request_batch_size,
+                                            h_1,
+                                            decoding_params.stream,
+                                            decoding_params.role_id,
+                                            decoding_params.role_embedding_table,
+                                            decoding_params.position_ids);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor[1],
+                   decoding_params.layernorm.gamma,
+                   decoding_params.layernorm.beta,
+                   from_tensor[0],
+                   m,
+                   h_1,
+                   decoding_params.stream);
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int dummy_decoder_max_seq_len = args_.seq_len_ + args_.memory_max_seq_len_;
+      size_t cache_size = local_batch_size * dummy_decoder_max_seq_len *
+                          t_parallel_param_.local_hidden_units_;
+
+      int in_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        in_id = layer & 0x1;
+        out_id = 1 - in_id;
+
+        decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        size_t cache_offset = layer * cache_size;  // type T
+        decoder_->forward_context(decoder_workspace,
+                                  from_tensor[out_id],
+                                  K_cache_[0] + cache_offset,
+                                  V_cache_[0] + cache_offset,
+                                  from_tensor[in_id],
+                                  decoding_params.d_attn_mask,
+                                  local_batch_size,
+                                  input_len,
+                                  0,
+                                  dummy_decoder_max_seq_len,
+                                  layer == args_.decoder_layers_ - 1,
+                                  decoding_params.memory_sequence_length);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      } // end of for loop of layer
+      allocator_.free(buf);
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+  }
+
+  void forward(const DecoderInitParam<DataType_> *param,
+               DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = args_.batch_size_;
+    const int k = args_.hidden_units_;
+    const int n = args_.vocab_size_padded_;
+    const DataType_ *embedding_kernel_ptr = nullptr;
+    const DataType_ *embedding_bias_ptr = nullptr;
+
+    int min_trg_len = 0;
+    int max_trg_len = 0;
+
+    if (decoding_params.trg_word) {
+      cudaMemcpy(h_trg_length_,
+                 decoding_params.trg_length,
+                 sizeof(int) * args_.batch_size_,
+                 cudaMemcpyDeviceToHost);
+      min_trg_len = h_trg_length_[0];
+      max_trg_len = h_trg_length_[0];
+
+      for (int i = 1; i < args_.batch_size_; ++i) {
+        min_trg_len = std::min(min_trg_len, h_trg_length_[i]);
+        max_trg_len = std::max(max_trg_len, h_trg_length_[i]);
+      }
+    }
+
+    /*
+      sequence_length initialize to 0
+      finished: false
+      word_ids: start_id_
+    */
+    if (decoding_params.output_scores) {
+   	  cudaMemsetAsync(decoding_params.output_scores, 0, sizeof(float) * m);
+    }
+    if (args_.candidate_num_ != 0) {
+      sampling_init_kernelLauncher(finished_buf_,
+                                   decoding_params.sequence_length,
+                                   word_ids_buf_,
+                                   args_.start_id_,
+                                   args_.batch_size_,
+                                   decoding_params.stream);
+    } else if (args_.probability_threshold_ != 0.0) {
+      topp_initialization_kernelLauncher_v2(finished_buf_,
+                                            decoding_params.sequence_length,
+                                            word_ids_buf_,
+                                            topp_id_vals_buf_,
+                                            topp_offset_buf_,
+                                            begin_topp_offset_buf_,
+                                            args_.vocab_size_padded_,
+                                            args_,
+                                            decoding_params.stream);
+    }
+    ker_curand_setupLauncher(curandstate_buf_, args_, decoding_params.stream);
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+#endif
+
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_ == args_.vocab_size_padded_)) {
+      embedding_kernel_ptr =
+          (const DataType_ *)decoding_params.embedding_kernel;
+      embedding_bias_ptr = (const DataType_ *)decoding_params.embedding_bias;
+    } else if (std::is_same<DataType_, half>::value) {
+      kernel_padding_kernelLauncher(padded_embedding_kernel,
+                                    decoding_params.embedding_kernel,
+                                    args_.hidden_units_,
+                                    args_.vocab_size_,
+                                    args_.vocab_size_padded_,
+                                    decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      bias_padding_kernelLauncher(padded_embedding_bias,
+                                  decoding_params.embedding_bias,
+                                  args_.vocab_size_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+      embedding_kernel_ptr = padded_embedding_kernel;
+      embedding_bias_ptr = padded_embedding_bias;
+    }
+
+    // TODO(guosheng): move cache offset into for loop for pipeline parallel
+    size_t cache_size =
+        (args_.prefix_lm_) ? (args_.batch_size_ *
+                              (args_.seq_len_ + args_.memory_max_seq_len_) *
+                              t_parallel_param_.local_hidden_units_)
+                           : (args_.batch_size_ * args_.seq_len_ *
+                              t_parallel_param_.local_hidden_units_);  // type T
+
+    const int local_batch = l_parallel_param_.local_batch_size;
+    for (uint step = 1; step <= args_.seq_len_; ++step) {
+      // const int ite_num = args_.batch_size_ / local_batch;
+      // for (size_t ite = 0; ite < ite_num; ite++) {
+      // }
+      if (args_.is_miro_) {
+        embeddings_kernel_launcher(from_tensor_[1],
+                                   decoding_params.embedding_table,
+                                   decoding_params.position_encoding_table,
+                                   decoding_params.type_table,
+                                   decoding_params.memory_sequence_length,
+                                   decoding_params.decoder_type_id,
+                                   word_ids_buf_,
+                                   step,
+                                   args_.batch_size_,
+                                   args_.hidden_units_,
+                                   args_.pos_bias_,
+                                   decoding_params.stream,
+                                   decoding_params.decoder_role_id,
+                                   decoding_params.role_embedding_table,
+                                   decoding_params.decoder_position_ids);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(from_tensor_[1],
+                   decoding_params.pre_layernorm.gamma,
+                   decoding_params.pre_layernorm.beta,
+                   from_tensor_[0],
+                   m,
+                   k,
+                   decoding_params.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+      } else if (args_.normalization_before_) {
+        if (args_.prefix_lm_) {
+          embeddings_kernel_launcher(from_tensor_[0],
+                                     decoding_params.embedding_table,
+                                     decoding_params.position_encoding_table,
+                                     decoding_params.type_table,
+                                     decoding_params.memory_sequence_length,
+                                     decoding_params.decoder_type_id,
+                                     word_ids_buf_,
+                                     step,
+                                     args_.batch_size_,
+                                     args_.hidden_units_,
+                                     args_.pos_bias_,
+                                     decoding_params.stream,
+                                     decoding_params.decoder_role_id,
+                                     decoding_params.role_embedding_table,
+                                     decoding_params.decoder_position_ids);
+        } else {
+          if (args_.is_mbart_) {
+            embedding_lookup_sine_position_encoding_kernel_launcher(
+                embedding_buf_,
+                decoding_params.embedding_table,
+                decoding_params.position_encoding_table +
+                    (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+                word_ids_buf_,
+                m,
+                args_.hidden_units_,
+                decoding_params.stream);
+
+            layer_norm(embedding_buf_,
+                       decoding_params.mbart_layernorm.gamma,
+                       decoding_params.mbart_layernorm.beta,
+                       from_tensor_[0],
+                       m,
+                       k,
+                       decoding_params.stream);
+
+          } else {
+            embedding_lookup_sine_position_encoding_kernel_launcher(
+                from_tensor_[0],
+                decoding_params.embedding_table,
+                decoding_params.position_encoding_table +
+                    (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+                word_ids_buf_,
+                m,
+                args_.hidden_units_,
+                decoding_params.stream);
+          }
+        }
+      } else {
+        if (args_.prefix_lm_) {
+          embeddings_kernel_launcher(embedding_buf_,
+                                     decoding_params.embedding_table,
+                                     decoding_params.position_encoding_table,
+                                     decoding_params.type_table,
+                                     decoding_params.memory_sequence_length,
+                                     decoding_params.decoder_type_id,
+                                     word_ids_buf_,
+                                     step,
+                                     args_.batch_size_,
+                                     args_.hidden_units_,
+                                     args_.pos_bias_,
+                                     decoding_params.stream,
+                                     decoding_params.decoder_role_id,
+                                     decoding_params.role_embedding_table,
+                                     decoding_params.decoder_position_ids);
+        } else {
+          // TODO(gongenlei): Only support Bart temporarily.
+          embedding_position_lookups_bart_kernel_launcher(
+              embedding_buf_,
+              decoding_params.embedding_table,
+              decoding_params.position_encoding_table +
+                  (step - 1 + args_.pos_offset_) * args_.hidden_units_,
+              word_ids_buf_,
+              args_.batch_size_,
+              args_.hidden_units_,
+              decoding_params.stream);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        layer_norm(embedding_buf_,
+                   decoding_params.layernorm.gamma,
+                   decoding_params.layernorm.beta,
+                   from_tensor_[0],
+                   m,
+                   k,
+                   decoding_params.stream);
+      }
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int from_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        if (l_parallel_param_.is_valid(layer)) {
+          /*
+             For the first layer (layer-0), from_id is 0. We also stored the
+             embedding lookup
+             result in from_tensor_[0]
+           */
+          from_id = layer & 0x1;
+          out_id = 1 - from_id;
+
+          /*
+            We use one decoder_ object to process multiple decoder layers.
+
+            At the beginning of each decoder layer, we initialize the decoder
+            object
+            with corresponding weights and decoder_buf_.
+
+            The decoder_buf_ is reused.
+          */
+          decoder_->initialize(param[layer], decoder_buf_, cublas_workspace_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          if (args_.prefix_lm_) {
+            decoder_->forward_v2(from_tensor_[from_id],
+                                nullptr,
+                                K_cache_[0] + layer * cache_size,
+                                V_cache_[0] + layer * cache_size,
+                                nullptr,
+                                nullptr,
+                                nullptr,
+                                from_tensor_[out_id],
+                                step + args_.memory_max_seq_len_,
+                                args_.seq_len_ + args_.memory_max_seq_len_,
+                                false, /* is_cross_attention */
+                                finished_buf_,
+                                args_.memory_max_seq_len_,
+                                decoding_params.memory_sequence_length);
+          } else {
+            decoder_->forward(from_tensor_[from_id],
+                              decoding_params.memory_tensor,
+                              K_cache_[0] + layer * cache_size,
+                              V_cache_[0] + layer * cache_size,
+                              K_mem_cache_[layer],
+                              V_mem_cache_[layer],
+                              decoding_params.memory_sequence_length,
+                              from_tensor_[out_id],
+                              step,
+                              args_.seq_len_,
+                              true, /* is_cross_attention */
+                              finished_buf_);
+          }
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+      }
+
+      if (step > min_trg_len) {
+        DataType_ alpha = (DataType_)1.0f;
+        DataType_ beta = (DataType_)0.0f;
+
+        if (args_.is_miro_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          // add bias decoding_params.trans_bias
+          add_bias_act_kernelLauncher(trans_out_buf_,
+                                      decoding_params.trans_bias,
+                                      m,
+                                      k,
+                                      args_.act_,
+                                      decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(trans_out_buf_,
+                     decoding_params.lm_layernorm.gamma,
+                     decoding_params.lm_layernorm.beta,
+                     lm_normed_result_buf_,
+                     m,
+                     k,
+                     decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                              decoding_params.cublas_handle,
+                                              CUBLAS_OP_N,
+                                              CUBLAS_OP_N,
+                                              n,
+                                              m,
+                                              k,
+                                              &alpha,
+                                              embedding_kernel_ptr,
+                                              AType_,
+                                              n,
+                                              lm_normed_result_buf_,
+                                              BType_,
+                                              k,
+                                              &beta,
+                                              logits_buf_,
+                                              CType_,
+                                              n,
+                                              decoding_params.stream,
+                                              cublasAlgoMap_,
+                                              cublas_workspace_);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+        } else if (args_.prefix_lm_) {
+          if (args_.normalization_before_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          } else {
+            // trans here
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                k,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                decoding_params.trans_kernel,
+                                                AType_,
+                                                k,
+                                                from_tensor_[out_id],
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                trans_out_buf_,
+                                                CType_,
+                                                k,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          }
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          // add bias decoding_params.trans_bias
+          add_bias_act_kernelLauncher(trans_out_buf_,
+                                      decoding_params.trans_bias,
+                                      m,
+                                      k,
+                                      args_.act_,
+                                      decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(trans_out_buf_,
+                     decoding_params.lm_layernorm.gamma,
+                     decoding_params.lm_layernorm.beta,
+                     lm_normed_result_buf_,
+                     m,
+                     k,
+                     decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                              decoding_params.cublas_handle,
+                                              CUBLAS_OP_N,
+                                              CUBLAS_OP_N,
+                                              n,
+                                              m,
+                                              k,
+                                              &alpha,
+                                              embedding_kernel_ptr,
+                                              AType_,
+                                              n,
+                                              lm_normed_result_buf_,
+                                              BType_,
+                                              k,
+                                              &beta,
+                                              logits_buf_,
+                                              CType_,
+                                              n,
+                                              decoding_params.stream,
+                                              cublasAlgoMap_,
+                                              cublas_workspace_);
+
+        } else {
+          if (args_.normalization_before_) {
+            layer_norm(from_tensor_[out_id],
+                       decoding_params.layernorm.gamma,
+                       decoding_params.layernorm.beta,
+                       decoder_normed_result_buf_,
+                       m,
+                       k,
+                       decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                n,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                embedding_kernel_ptr,
+                                                AType_,
+                                                n,
+                                                decoder_normed_result_buf_,
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                logits_buf_,
+                                                CType_,
+                                                n,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          } else {
+            // Post-norm
+            cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                                decoding_params.cublas_handle,
+                                                CUBLAS_OP_N,
+                                                CUBLAS_OP_N,
+                                                n,
+                                                m,
+                                                k,
+                                                &alpha,
+                                                embedding_kernel_ptr,
+                                                AType_,
+                                                n,
+                                                from_tensor_[out_id],
+                                                BType_,
+                                                k,
+                                                &beta,
+                                                logits_buf_,
+                                                CType_,
+                                                n,
+                                                decoding_params.stream,
+                                                cublasAlgoMap_,
+                                                cublas_workspace_);
+          }
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        if (decoding_params.logits_mask ||
+            (args_.min_length_ != 0 && step <= args_.min_length_) ||
+            args_.vocab_size_padded_ != args_.vocab_size_) {
+          apply_logits_mask_kernelLauncher(logits_buf_,
+                                           finished_buf_,
+                                           args_.batch_size_,
+                                           1,
+                                           args_.vocab_size_padded_,
+                                           args_.vocab_size_,
+                                           decoding_params.stream,
+                                           decoding_params.logits_mask,
+                                           (args_.min_length_ != 0 && step <= args_.min_length_),
+                                           args_.end_id_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        if (args_.temperature_ != 1.0) {
+          apply_temperature_penalty_kernelLauncher(
+              logits_buf_,
+              (DataType_)args_.temperature_,
+              args_.batch_size_,
+              args_.vocab_size_,
+              n,
+              decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        if (args_.candidate_num_ != 0) {
+          // top k sampling
+          if (decoding_params.output_scores) {
+            softmax_kernelLauncher(logits_buf_,
+                                  embedding_bias_ptr,
+                                  args_.end_id_,
+                                  finished_buf_,
+                                  m,
+                                  n,
+                                  n,
+                                  decoding_params.stream);
+
+            // Return Score.
+            topK_sampling_kernel_kernelLauncher_v3(
+                topk_workspace_,
+                topk_workspace_size_,
+                logits_buf_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                decoding_params.output_scores,
+                finished_buf_,
+                curandstate_buf_,  // used as random number
+                args_,
+                decoding_params.stream,
+                args_.batch_size_);
+          } else {
+            update_logits_without_softmax(logits_buf_,
+                                          embedding_bias_ptr,
+                                          args_.end_id_,
+                                          finished_buf_,
+                                          m,
+                                          n,
+                                          decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            topK_sampling_kernel_kernelLauncher_v2(
+                topk_workspace_,
+                topk_workspace_size_,
+                logits_buf_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                finished_buf_,
+                curandstate_buf_,  // used as random number
+                args_,
+                decoding_params.stream,
+                args_.batch_size_);
+          }
+        } else if (args_.probability_threshold_ != 0.0) {
+          // top p sampling
+          softmax_kernelLauncher(logits_buf_,
+                                 embedding_bias_ptr,
+                                 args_.end_id_,
+                                 finished_buf_,
+                                 m,
+                                 n,
+                                 n,
+                                 decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          if (decoding_params.output_scores) {
+            topP_sampling_kernel_kernelLauncher_v3(
+                topp_workspace_,
+                topp_workspace_size_,
+                logits_buf_,
+                topp_id_vals_buf_,
+                topp_offset_buf_,
+                begin_topp_offset_buf_,
+                finished_buf_,
+                curandstate_buf_,
+                args_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                decoding_params.output_scores,
+                n,
+                decoding_params.stream,
+                args_.batch_size_);
+          } else {
+            topP_sampling_kernel_kernelLauncher_v2(
+                topp_workspace_,
+                topp_workspace_size_,
+                logits_buf_,
+                topp_id_vals_buf_,
+                topp_offset_buf_,
+                begin_topp_offset_buf_,
+                finished_buf_,
+                curandstate_buf_,
+                args_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                n,
+                decoding_params.stream,
+                args_.batch_size_);
+          }
+        }
+      }
+
+      if (step <= max_trg_len) {
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        update_with_force_decodingLauncher(
+            decoding_params.trg_word,
+            decoding_params.trg_length,
+            finished_buf_,
+            word_ids_buf_,
+            (step > min_trg_len) ? nullptr : decoding_params.sequence_length,
+            (int *)nullptr,
+            (int *)nullptr,
+            decoding_params.output_ids + (step - 1) * args_.batch_size_,
+            (DataType_ *)nullptr,
+            false,
+            args_.batch_size_,
+            1,
+            max_trg_len,
+            step,
+            decoding_params.stream);
+      } else {
+        word_ids_buf_ =
+            decoding_params.output_ids + (step - 1) * args_.batch_size_;
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (step > max_trg_len) {
+        // TODO Find a better method to check the is_finished
+        cudaMemcpy(h_finished_buf_,
+                   finished_buf_,
+                   sizeof(bool) * args_.batch_size_,
+                   cudaMemcpyDeviceToHost);
+        uint sum = 0;
+        for (uint i = 0; i < args_.batch_size_; i++) {
+          sum += (int)h_finished_buf_[i];
+        }
+        if (sum == args_.batch_size_) break;
+      }
+    }
+  }
+
+  virtual ~DecodingSampling() {
+    delete[] K_cache_;
+    delete[] V_cache_;
+    delete[] K_mem_cache_;
+    delete[] V_mem_cache_;
+    delete[] h_finished_buf_;
+    delete[] h_trg_length_;
+    delete decoder_;
+    allocator_.free(buf_);
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gpt.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gpt.h
new file mode 100644
index 0000000000000000000000000000000000000000..f89df6f70c3fe045cdb40c4487f45c482ed4dc8f
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gpt.h
@@ -0,0 +1,895 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include <cuda_runtime.h>
+#include <stdlib.h>
+#include "fastertransformer/utils/nvtx_utils.h"
+
+namespace fastertransformer
+{
+
+template <OperationType OpType_>
+class DecodingGpt
+{
+private:
+    typedef DecoderTransformerTraits<OpType_> Traits_;
+    typedef typename Traits_::DataType DataType_;
+    const IAllocator &allocator_;
+    struct GptArguments args_;
+    TensorParallelParam t_parallel_param_;
+    LayerParallelParam l_parallel_param_;
+
+    const cudaDataType_t computeType_ = Traits_::computeType;
+    const cudaDataType_t AType_ = Traits_::AType;
+    const cudaDataType_t BType_ = Traits_::BType;
+    const cudaDataType_t CType_ = Traits_::CType;
+    std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+    DataType_ *embedding_kernel_padded_;
+
+    OpenDecoder<OpType_> *decoder_;
+    DataType_ **K_cache_;
+    DataType_ **V_cache_;
+    DataType_ *from_tensor_[2];
+    DataType_ *decoder_buf_;
+    DataType_ *decoder_normed_result_buf_;
+    DataType_ *logits_buf_;
+    void *buf_;
+    
+    void *topk_workspace_ = nullptr;
+    size_t topk_workspace_size_ = 0;
+    void *topp_workspace_ = nullptr;
+    size_t topp_workspace_size_ = 0;
+    void *topk_topp_workspace_ = nullptr;
+    size_t topk_topp_workspace_size_ = 0;
+    void *cublas_workspace_ = nullptr;
+    int *topp_id_vals_buf_;
+    int *topp_offset_buf_;
+    curandState_t *curandstate_buf_;
+    int *begin_topp_offset_buf_;
+
+    size_t nccl_buf_size_;
+    DataType_ *nccl_logits_buf_;
+
+    bool *finished_buf_;
+    bool *h_finished_buf_;
+    
+public:
+    DecodingGpt(const IAllocator &allocator, const int batch_size,
+                 const int seq_len,
+                 const int head_num, const int size_per_head,
+                 const int vocab_size, const int decoder_layers,
+                 const int start_id, const int end_id,
+                 const int candidate_num = 1,
+                 const float probability_threshold = 0.0,
+                 const float temperature = 1.0,
+                 const int tensor_para_size = 1,
+                 const int layer_para_size = 1,
+                 const bool is_fuse_QKV = true,
+                 const float repetition_penalty = 1.0,
+                 const int seed = -1) : allocator_(allocator)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        assert(temperature != 0.0);
+        assert(repetition_penalty > 0.0);
+        assert(candidate_num > 0 || probability_threshold > 0.0);
+        assert(decoder_layers % layer_para_size == 0);
+
+        args_.batch_size_ = batch_size;
+        args_.seq_len_ = seq_len;
+        args_.head_num_ = head_num;
+        args_.size_per_head_ = size_per_head;
+        args_.hidden_units_ = head_num * size_per_head;
+        args_.decoder_layers_ = decoder_layers;
+        args_.vocab_size_ = vocab_size;
+        args_.start_id_ = start_id;
+        args_.end_id_ = end_id;
+        args_.candidate_num_ = candidate_num;
+        args_.probability_threshold_ = probability_threshold;
+        args_.temperature_ = temperature;
+        args_.repetition_penalty_ = repetition_penalty;
+        /***** newly added by PaddleNLP *****/
+        args_.seed_ = seed;
+        
+        K_cache_ = new DataType_ *[1];
+        V_cache_ = new DataType_ *[1];
+
+        decoder_ = new OpenDecoder<OpType_>(args_.head_num_, size_per_head, 0 /* memory_hidden_units */, is_fuse_QKV);
+        decoder_->set_max_batch_size(args_.batch_size_);
+
+        args_.vocab_size_padded_ = div_up(args_.vocab_size_, 64) * 64;
+
+        size_t from_tensor_size = args_.batch_size_ * args_.hidden_units_;                    // type T
+        size_t decoder_workspace_size = (size_t)decoder_->getWorkspaceSize();                                             // type T
+        size_t decoder_normed_result_buffer_size = args_.batch_size_ * args_.hidden_units_;   // type T
+        // cache costs lots of memory, so we only store part of them when we use multi-gpu for inference
+        size_t cache_size = args_.batch_size_ * args_.seq_len_ * args_.hidden_units_ / tensor_para_size;         // type T
+        size_t logits_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type T
+
+        size_t topp_id_vals_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type int
+        size_t topp_offset_buf_size = args_.batch_size_ + 1;
+        size_t begin_topp_offset_buf_size = topp_offset_buf_size;
+        size_t curandState_size = args_.batch_size_;
+        size_t finished_buf_size = args_.batch_size_;
+
+        const int MEM_C = 128;
+        size_t embedding_kernel_transposed_padded_size = args_.hidden_units_ * args_.vocab_size_padded_;
+        embedding_kernel_transposed_padded_size = div_up(embedding_kernel_transposed_padded_size, MEM_C) * MEM_C;
+
+        // prevent memory misalinged address
+        logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+        
+        topp_id_vals_buf_size = (size_t)(ceil(topp_id_vals_buf_size / 4.)) * 4;
+        topp_offset_buf_size = (size_t)(ceil(topp_offset_buf_size / 4.)) * 4;
+        begin_topp_offset_buf_size = topp_offset_buf_size;
+        curandState_size = (size_t)(ceil(curandState_size / 32.)) * 32;
+        finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+
+        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                               topp_workspace_size_,
+                                               logits_buf_,
+                                               topp_id_vals_buf_,
+                                               topp_offset_buf_,
+                                               begin_topp_offset_buf_,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               nullptr,
+                                               nullptr,
+                                               args_.vocab_size_padded_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                               topk_workspace_size_,
+                                               logits_buf_,
+                                               nullptr,
+                                               nullptr,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                              topk_topp_workspace_size_,
+                                              nullptr,
+                                              logits_buf_,
+                                              nullptr,
+                                              curandstate_buf_,
+                                              args_,
+                                              0,
+                                              args_.batch_size_);
+
+        size_t datatype_buf_size = from_tensor_size * 2 + decoder_workspace_size +
+                                cache_size * 2 * (args_.decoder_layers_ / layer_para_size) + decoder_normed_result_buffer_size;
+
+        nccl_buf_size_ = args_.batch_size_ * args_.vocab_size_padded_;
+        nccl_buf_size_ = (size_t)(ceil(nccl_buf_size_ / 4.)) * 4;
+
+        buf_ = reinterpret_cast<void *>(allocator_.malloc(
+            ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) + 
+            sizeof(DataType_) * embedding_kernel_transposed_padded_size +
+            sizeof(DataType_) * (datatype_buf_size + logits_buf_size) + 
+            sizeof(int) * (topp_id_vals_buf_size + topp_offset_buf_size + begin_topp_offset_buf_size) +
+            topp_workspace_size_ + topk_workspace_size_ + topk_topp_workspace_size_ + sizeof(DataType_) * nccl_buf_size_ +
+            finished_buf_size + curandState_size * sizeof(curandState_t)));
+
+        if (sizeof(DataType_) == sizeof(half))
+        {
+          cublas_workspace_ = buf_;
+          embedding_kernel_padded_ = (DataType_ *)((char*)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+        }
+        else
+        {
+          cublas_workspace_ = nullptr;
+          embedding_kernel_padded_ = (DataType_ *)buf_;
+        }
+        from_tensor_[0] = (DataType_ *)(embedding_kernel_padded_ + embedding_kernel_transposed_padded_size);
+        from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+        K_cache_[0] = from_tensor_[1] + from_tensor_size + 0 * cache_size * args_.decoder_layers_ / layer_para_size;
+        V_cache_[0] = from_tensor_[1] + from_tensor_size + 1 * cache_size * args_.decoder_layers_ / layer_para_size;
+
+        decoder_buf_ = V_cache_[0] + cache_size * args_.decoder_layers_ / layer_para_size;
+        decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+        logits_buf_ = decoder_normed_result_buf_ + decoder_normed_result_buffer_size;
+        topp_id_vals_buf_ = (int *)((DataType_*)logits_buf_ + logits_buf_size);
+        begin_topp_offset_buf_ = (int *)(topp_id_vals_buf_ + topp_id_vals_buf_size);
+        topp_offset_buf_ = (int *)((int*)begin_topp_offset_buf_ + begin_topp_offset_buf_size);
+        topp_workspace_ = (void *)((int*)topp_offset_buf_ + topp_offset_buf_size);
+        topk_workspace_ = (void *)((char*)topp_workspace_ + topp_workspace_size_);
+        topk_topp_workspace_ = (void *)((char*)topk_workspace_ + topk_workspace_size_);
+        nccl_logits_buf_ = (DataType_ *)((char*)topk_topp_workspace_ + topk_topp_workspace_size_);
+        curandstate_buf_ = (curandState_t*)(nccl_logits_buf_ + nccl_buf_size_);
+        finished_buf_ = (bool*)(curandstate_buf_ + curandState_size);
+        h_finished_buf_ = new bool[args_.batch_size_];
+
+        cudaMemset(embedding_kernel_padded_, 0, embedding_kernel_transposed_padded_size * sizeof(DataType_));
+
+        int isConfigExist = access("decoding_gemm_config.in", 0);
+        if (isConfigExist == -1)
+            printf("[WARNING] decoding_gemm_config.in is not found\n");
+        else
+        {
+            readAlgoFromConfig(cublasAlgoMap_, 1);
+            // check that the gemm_config setting is runnable
+            for (auto iter = cublasAlgoMap_.begin() ; iter != cublasAlgoMap_.end() ; iter++)
+            {
+                int algoId = iter->second.algoId;
+                int stages = iter->second.stages;
+                //only check for cublas
+                if (stages != -1)
+                    continue;
+                if (Traits_::OpType == OperationType::FP32)
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT)
+                    {
+                        // the algorithm is not for FP32
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n", algoId);
+                        exit(-1);
+                    }
+                }
+                else
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP || algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP)
+                    {
+                        // the algorithm is not for FP16
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n", algoId);
+                        exit(-1);
+                    }
+                }
+            }
+        }
+    }
+
+    void set_tensor_parallel_param(const TensorParallelParam param)
+    {
+        t_parallel_param_ = param;
+        decoder_->set_tensor_parallel_param(param);
+    }
+
+    void set_layer_parallel_param(const LayerParallelParam param)
+    {
+        l_parallel_param_ = param;
+        decoder_->set_layer_parallel_param(param);
+    }
+
+    void forward_context(const DecoderInitParam<DataType_> *decoder_param,
+                         const DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+        const int request_batch_size = decoding_params.request_batch_size;
+        cudaMemsetAsync(decoding_params.output_ids, 0, sizeof(int) * request_batch_size * max_len, decoding_params.stream);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        // const int input_len = decoding_params.request_input_len;
+        const int max_input_len = decoding_params.max_input_len;
+
+        // d_start_ids: [batch * seqlen]
+        if(input_len == 1)
+        {
+            cudaMemcpyAsync(decoding_params.output_ids, decoding_params.d_start_ids, 
+                            sizeof(int) * request_batch_size, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            return;
+        }
+        const int local_batch_size = ceil(request_batch_size * 1.0 / l_parallel_param_.world_size);
+        const int m = local_batch_size * input_len;
+        const int h_1 = args_.hidden_units_;
+
+        DataType_* from_tensor[2];
+        DataType_* decoder_output;
+        DataType_* decoder_workspace;
+        void *buf = reinterpret_cast<void *>(allocator_.malloc(
+            decoder_->getContextWorkspaceSize(input_len, local_batch_size) + 
+            (m * h_1 + 2 * request_batch_size * input_len * h_1) * sizeof(DataType_)
+        ));
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        from_tensor[0] = (DataType_*) buf;
+        from_tensor[1] = from_tensor[0] + request_batch_size * input_len * h_1;
+        decoder_output = from_tensor[1] + request_batch_size * input_len * h_1;
+        decoder_workspace = decoder_output + m * h_1;
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if(l_parallel_param_.rank == 0)
+        {
+            PUSH_RANGE("Before Transformer/Embedding")
+            start_id_embedding_position_lookups_kernel_launcher(from_tensor[0],
+                                                                decoding_params.output_ids,
+                                                                decoding_params.embedding_table,
+                                                                decoding_params.position_encoding_table,
+                                                                decoding_params.d_start_ids,
+                                                                1,
+                                                                input_len,
+                                                                max_input_len,
+                                                                request_batch_size,
+                                                                args_.hidden_units_, 
+                                                                decoding_params.stream);
+            POP_RANGE
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        int ite_num = (int)(ceil(request_batch_size * 1.0 / local_batch_size));
+        for(int ite = 0; ite < ite_num; ite++)
+        {
+            int in_id, out_id;
+            for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+            {
+                if(l_parallel_param_.is_valid(layer))
+                {
+                    in_id = layer & 0x1;
+                    out_id = 1 - in_id;
+
+                    if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_recv(from_tensor[in_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                    l_parallel_param_, decoding_params.stream);
+                        all2all_gather(from_tensor[in_id] + ite * m * h_1, from_tensor[in_id] + ite * m * h_1, size, 
+                                    t_parallel_param_, decoding_params.stream);
+                    }
+
+                    decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    int dummy_decoder_max_seq_len = args_.seq_len_;
+                    // int dummy_decoder_max_seq_len = -1;
+                    size_t cache_offset;
+                    if(dummy_decoder_max_seq_len == -1)
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    else
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                        ite * local_batch_size * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    decoder_->forward_context(decoder_workspace,
+                                              from_tensor[out_id] + ite * m * h_1,
+                                              K_cache_[0] + cache_offset,
+                                              V_cache_[0] + cache_offset,
+                                              from_tensor[in_id] + ite * m * h_1,
+                                              decoding_params.d_attn_mask + ite * local_batch_size * input_len * input_len,
+                                              local_batch_size,
+                                              input_len,
+                                              ite,
+                                              dummy_decoder_max_seq_len,
+                                              layer == args_.decoder_layers_ - 1);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                    if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_send(from_tensor[out_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1,
+                                    l_parallel_param_, decoding_params.stream);
+                    }
+                }
+            } // end of for loop of layer
+        } // end of for loop of ite
+        allocator_.free(buf);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+    }
+
+    void forward(const DecoderInitParam<DataType_> *decoder_param,
+                 DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+        const int max_input_len = decoding_params.max_input_len;
+        const int request_batch_size = decoding_params.request_batch_size;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+
+        assert(request_batch_size <= args_.batch_size_);
+        assert(request_batch_size % l_parallel_param_.local_batch_size == 0);
+        const int m = request_batch_size;
+        const int k = args_.hidden_units_;
+        const DataType_* embedding_kernel_ptr = nullptr;
+
+        cudaMemsetAsync(finished_buf_, false, sizeof(finished_buf_[0]) * request_batch_size, decoding_params.stream);
+        if (args_.probability_threshold_ != 0.0)
+        {
+            topp_initialization_kernelLauncher_v2(nullptr,
+                                                  nullptr,
+                                                  nullptr,
+                                                  topp_id_vals_buf_,
+                                                  topp_offset_buf_,
+                                                  begin_topp_offset_buf_,
+                                                  args_.candidate_num_ > 0 ? args_.candidate_num_ : args_.vocab_size_padded_,
+                                                  args_,
+                                                  decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+        }
+        ker_curand_setupLauncher(curandstate_buf_,
+                                 args_,
+                                 decoding_params.stream);
+
+        if(std::is_same<DataType_, float>::value || (std::is_same<DataType_, half>::value && args_.vocab_size_padded_ == args_.vocab_size_))
+        {
+            embedding_kernel_ptr = (const DataType_ *)decoding_params.embedding_kernel;
+        }
+        else
+        {
+            cudaMemcpyAsync(embedding_kernel_padded_, decoding_params.embedding_kernel, 
+                            sizeof(DataType_) * args_.vocab_size_ * args_.hidden_units_, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            embedding_kernel_ptr = (const DataType_ *)embedding_kernel_padded_;
+        }
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        bool is_generation_done = false;
+        const int local_batch = l_parallel_param_.local_batch_size;
+        for (size_t step = input_len; step < max_len; ++step)
+        {
+            const int ite_num = request_batch_size / local_batch;
+            for(size_t ite = 0; ite < ite_num; ite++)
+            {
+                if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        PUSH_RANGE("token/recv")
+                        nccl_recv(decoding_params.output_ids + (step - 1) * m + ite * local_batch, local_batch,
+                                  l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                    }
+                }
+
+                if(l_parallel_param_.rank < l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                    }
+                }
+                if(ite == 0)
+                {
+                    cudaMemcpyAsync(h_finished_buf_, finished_buf_, sizeof(bool) * request_batch_size, cudaMemcpyDeviceToHost, decoding_params.stream);
+                    cudaStreamSynchronize(decoding_params.stream);
+                    uint sum = 0;
+                    for (uint i = 0; i < request_batch_size; i++)
+                    {
+                        sum += (int)h_finished_buf_[i];
+                    }
+                    if (sum == request_batch_size)
+                    {
+                        is_generation_done = true;
+                        break;
+                    }
+                }
+
+                if(l_parallel_param_.rank == 0)
+                {
+                    PUSH_RANGE("Before Transformer/Embedding")
+                    /***** newly fixed by PaddleNLP *****/
+                    embedding_position_lookups_kernel_launcher(from_tensor_[0],
+                                                            decoding_params.embedding_table,
+                                                            decoding_params.position_encoding_table,
+                                                            decoding_params.output_ids,
+                                                            local_batch,
+                                                            m,
+                                                            args_.hidden_units_,
+                                                            step,
+                                                            ite,
+                                                            max_input_len,
+                                                            decoding_params.d_start_lengths,
+                                                            decoding_params.stream);
+                    POP_RANGE
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+
+                //we use two-way buffer
+                int from_id, out_id;
+                for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+                {
+                    if(l_parallel_param_.is_valid(layer))
+                    {
+                        /*
+                            For the first layer (layer-0), from_id is 0. We also stored the embedding lookup 
+                            result in from_tensor_[0]
+                        */
+                        from_id = layer & 0x1;
+                        out_id = 1 - from_id;
+
+                        if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                        {
+                            const int size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_recv(from_tensor_[from_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                            all2all_gather(from_tensor_[from_id], from_tensor_[from_id], size, 
+                                           t_parallel_param_, decoding_params.stream);
+                        }
+
+                        /*
+                            We use one decoder_ object to process multiple decoder layers. 
+                            At the beginning of each decoder layer, we initialize the decoder object 
+                            with corresponding weights and decoder_buf_.
+                            The decoder_buf_ is reused.
+                        */
+                        decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+                        
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        int dummy_decoder_max_seq_len = args_.seq_len_;
+                        // int dummy_decoder_max_seq_len = -1;
+                        size_t cache_offset;
+                        if(dummy_decoder_max_seq_len == -1)
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                            ite * local_batch * t_parallel_param_.local_hidden_units_;
+                        }
+                        else
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) * 
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ + 
+                                            ite * local_batch * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                        }
+                        decoder_->forward_v2(from_tensor_[from_id], 
+                                            nullptr, // memory_tensor should be nullptr
+                                            K_cache_[0] + cache_offset,
+                                            V_cache_[0] + cache_offset,
+                                            nullptr, nullptr, // key_mem_cache_ and value_mem_cache_ should be nullptr
+                                            nullptr, // memory_sequence_length should be nullptr
+                                            from_tensor_[out_id], step, dummy_decoder_max_seq_len,
+                                            false, 
+                                            finished_buf_ + ite * local_batch,
+                                            max_input_len, 
+                                            decoding_params.d_start_lengths + ite * local_batch);
+
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif          
+
+                        if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                        {
+                            const size_t size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_send(from_tensor_[out_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                        }
+                    }
+                }
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1)
+                {
+
+                    layer_norm(from_tensor_[out_id],
+                               decoding_params.layernorm.gamma,
+                               decoding_params.layernorm.beta,
+                               decoder_normed_result_buf_,
+                               local_batch,
+                               k,
+                               decoding_params.stream);
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    DataType_ alpha = DataType_(1.0f);
+                    DataType_ beta = DataType_(0.0f);
+                    assert(args_.vocab_size_padded_ % t_parallel_param_.world_size == 0);
+                    int n = args_.vocab_size_padded_ / t_parallel_param_.world_size;
+                    
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                            decoding_params.cublas_handle, 
+                                                            CUBLAS_OP_T, CUBLAS_OP_N,
+                                                            n, local_batch, k,
+                                                            &alpha,
+                                                            embedding_kernel_ptr, AType_, k,
+                                                            decoder_normed_result_buf_, BType_, k,
+                                                            &beta,
+                                                            logits_buf_, CType_, n,
+                                                            decoding_params.stream, cublasAlgoMap_,
+                                                            cublas_workspace_);
+                        POP_RANGE
+                    }
+                    else
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                            decoding_params.cublas_handle, 
+                                                            CUBLAS_OP_T, CUBLAS_OP_N,
+                                                            n, local_batch, k,
+                                                            &alpha,
+                                                            embedding_kernel_ptr + t_parallel_param_.rank * n * k,
+                                                            AType_, k,
+                                                            decoder_normed_result_buf_, BType_, k,
+                                                            &beta,
+                                                            nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                            CType_, n,
+                                                            decoding_params.stream, cublasAlgoMap_,
+                                                            cublas_workspace_);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                    
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        apply_temperature_penalty_kernelLauncher(logits_buf_,
+                                                                (DataType_) args_.temperature_,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.stream);
+                    }
+                    else
+                    {
+                        if(t_parallel_param_.rank == t_parallel_param_.world_size - 1)
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    args_.vocab_size_ - n * t_parallel_param_.rank,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                        else
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    n,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // reduce and concat the result
+                    if(t_parallel_param_.world_size > 1)
+                    {
+                        PUSH_RANGE("After Transformer/all2all_gather")
+                        all2all_gather(nccl_logits_buf_, nccl_logits_buf_, local_batch * n, 
+                                       t_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                        
+                        transpose_axis_01_kernelLauncher(logits_buf_, nccl_logits_buf_, 
+                                                         t_parallel_param_.world_size, local_batch, n, decoding_params.stream);
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    n = args_.vocab_size_padded_;
+
+                    // Apply repetition penalty.
+                    if (args_.repetition_penalty_ != 1.0) {
+                        PUSH_RANGE("After Transformer/Repetition_penalty")
+                        apply_repetition_penalty_kernelLauncher(logits_buf_,
+                                                                args_.repetition_penalty_,
+                                                                decoding_params.d_start_ids,
+                                                                decoding_params.output_ids,
+                                                                m,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.d_start_lengths,
+                                                                max_input_len,
+                                                                step,
+                                                                ite,
+                                                                decoding_params.stream);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // Sampling
+                    if(args_.candidate_num_ > 0 && args_.probability_threshold_ == 0.0)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top k sampling
+                        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                                               topk_workspace_size_,
+                                                               logits_buf_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_, // used as random number
+                                                               args_,
+                                                               decoding_params.stream,
+                                                               local_batch);
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ == 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top p sampling
+                        softmax_kernelLauncher(logits_buf_,
+                                               (DataType_*) nullptr,
+                                               args_.end_id_,
+                                               finished_buf_ + ite * local_batch,
+                                               local_batch,
+                                               args_.vocab_size_padded_,
+                                               args_.vocab_size_,
+                                               decoding_params.stream);
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                                               topp_workspace_size_,
+                                                               logits_buf_,
+                                                               topp_id_vals_buf_,
+                                                               topp_offset_buf_,
+                                                               begin_topp_offset_buf_,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_,
+                                                               args_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               n,
+                                                               decoding_params.stream,
+                                                               local_batch);
+
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ > 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                                                    topk_topp_workspace_size_,
+                                                                    decoding_params.output_ids + step * m + ite * local_batch,
+                                                                    logits_buf_,
+                                                                    finished_buf_ + ite * local_batch,
+                                                                    curandstate_buf_,
+                                                                    args_,
+                                                                    decoding_params.stream,
+                                                                    local_batch);
+                        POP_RANGE
+                    }
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+                if(step < (size_t)max_input_len)
+                {
+                    // Replace the sampled id by start ids
+                    set_start_ids_kernelLauncher(decoding_params.output_ids, decoding_params.d_start_ids, max_input_len,
+                                                 step, ite, request_batch_size, local_batch, args_.end_id_, decoding_params.stream);
+                }
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    PUSH_RANGE("token/send")
+                    nccl_send(decoding_params.output_ids + step * m + ite * local_batch, local_batch, 0, l_parallel_param_, decoding_params.stream);
+                    POP_RANGE
+                }
+
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1 && step < max_len - 1)
+                {
+                    nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                }
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+            } // end for ite for loop
+
+            if (is_generation_done) {
+                break;
+            }
+        } // end for decoding step for loop
+        if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+        {
+            for(size_t ite = 0; ite < request_batch_size / local_batch; ite++)
+            {
+                nccl_recv(decoding_params.output_ids + (max_len - 1) * m + ite * local_batch,
+                          local_batch, l_parallel_param_.world_size - 1,
+                          l_parallel_param_, decoding_params.stream);
+            }
+        }
+    } // end of forward
+
+    virtual ~DecodingGpt()
+    {
+        delete[] K_cache_;
+        delete[] V_cache_;
+        delete decoder_;
+        allocator_.free(buf_);
+        delete [] h_finished_buf_;
+    }
+
+    inline int get_num_layer() {return args_.decoder_layers_;}
+
+    inline void set_local_batch_size(int local_batch)
+    { 
+        l_parallel_param_.local_batch_size = local_batch;
+        decoder_->set_local_batch_size(local_batch);
+    }
+};
+
+} //namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gptj.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gptj.h
new file mode 100644
index 0000000000000000000000000000000000000000..6262d967231144d7092725b7680c992df6899aa1
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/gptj.h
@@ -0,0 +1,946 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include <cuda_runtime.h>
+#include <stdlib.h>
+#include "fastertransformer/utils/nvtx_utils.h"
+
+namespace fastertransformer
+{
+
+template <OperationType OpType_>
+class DecodingGptJ
+{
+private:
+    typedef DecoderTransformerTraits<OpType_> Traits_;
+    typedef typename Traits_::DataType DataType_;
+    const IAllocator &allocator_;
+    struct GptJArguments args_;
+    TensorParallelParam t_parallel_param_;
+    LayerParallelParam l_parallel_param_;
+
+    const cudaDataType_t computeType_ = Traits_::computeType;
+    const cudaDataType_t AType_ = Traits_::AType;
+    const cudaDataType_t BType_ = Traits_::BType;
+    const cudaDataType_t CType_ = Traits_::CType;
+    std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+    DataType_ *embedding_kernel_padded_;
+    DataType_ *embedding_bias_padded_;
+
+    OpenDecoder<OpType_> *decoder_;
+    DataType_ **K_cache_;
+    DataType_ **V_cache_;
+    DataType_ *from_tensor_[2];
+    DataType_ *decoder_buf_;
+    DataType_ *decoder_normed_result_buf_;
+    DataType_ *logits_buf_;
+    void *buf_;
+    
+    void *topk_workspace_ = nullptr;
+    size_t topk_workspace_size_ = 0;
+    void *topp_workspace_ = nullptr;
+    size_t topp_workspace_size_ = 0;
+    void *topk_topp_workspace_ = nullptr;
+    size_t topk_topp_workspace_size_ = 0;
+    void *cublas_workspace_ = nullptr;
+    int *topp_id_vals_buf_;
+    int *topp_offset_buf_;
+    curandState_t *curandstate_buf_;
+    int *begin_topp_offset_buf_;
+
+    size_t nccl_buf_size_;
+    DataType_ *nccl_logits_buf_;
+
+    bool *finished_buf_;
+    bool *h_finished_buf_;
+    
+public:
+    DecodingGptJ(const IAllocator &allocator, 
+                 const int batch_size,
+                 const int seq_len,
+                 const int head_num, 
+                 const int size_per_head,
+                 const int vocab_size, 
+                 const int decoder_layers,
+                 const int start_id, 
+                 const int end_id,
+                 const int candidate_num = 1,
+                 const float probability_threshold = 0.0,
+                 const float temperature = 1.0,
+                 const int tensor_para_size = 1,
+                 const int layer_para_size = 1,
+                 const bool is_fuse_QKV = true,
+                 const float repetition_penalty = 1.0,
+                 const int seed = -1,
+                 const int rotary_embedding_dim = 0,
+                 const int min_length = 0) : allocator_(allocator)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        assert(temperature != 0.0);
+        assert(repetition_penalty > 0.0);
+        assert(candidate_num > 0 || probability_threshold > 0.0);
+        assert(decoder_layers % layer_para_size == 0);
+
+        args_.batch_size_ = batch_size;
+        args_.seq_len_ = seq_len;
+        args_.head_num_ = head_num;
+        args_.size_per_head_ = size_per_head;
+        args_.hidden_units_ = head_num * size_per_head;
+        args_.decoder_layers_ = decoder_layers;
+        args_.vocab_size_ = vocab_size;
+        args_.start_id_ = start_id;
+        args_.end_id_ = end_id;
+        args_.candidate_num_ = candidate_num;
+        args_.probability_threshold_ = probability_threshold;
+        args_.temperature_ = temperature;
+        args_.repetition_penalty_ = repetition_penalty;
+        /***** newly added by PaddleNLP *****/
+        args_.seed_ = seed;
+        args_.rotary_embedding_dim_ = rotary_embedding_dim;
+        args_.min_length_ = min_length;
+        
+        K_cache_ = new DataType_ *[1];
+        V_cache_ = new DataType_ *[1];
+
+        decoder_ = new OpenDecoder<OpType_>(args_.head_num_, size_per_head, 0 /* memory_hidden_units */, is_fuse_QKV);
+        decoder_->set_max_batch_size(args_.batch_size_);
+
+
+        // args_.vocab_size_padded_ = div_up(args_.vocab_size_, 64) * 64;
+        if(std::is_same<DataType_, float>::value)
+        {
+            args_.vocab_size_padded_ = args_.vocab_size_;
+        }
+        else
+        {
+            args_.vocab_size_padded_ = div_up(args_.vocab_size_, 64) * 64;
+        }
+
+        size_t from_tensor_size = args_.batch_size_ * args_.hidden_units_;                    // type T
+        size_t decoder_workspace_size = (size_t)decoder_->getWorkspaceSize();                                             // type T
+        size_t decoder_normed_result_buffer_size = args_.batch_size_ * args_.hidden_units_;   // type T
+        // cache costs lots of memory, so we only store part of them when we use multi-gpu for inference
+        size_t cache_size = args_.batch_size_ * args_.seq_len_ * args_.hidden_units_ / tensor_para_size;         // type T
+        size_t logits_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type T
+
+        size_t topp_id_vals_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type int
+        size_t topp_offset_buf_size = args_.batch_size_ + 1;
+        size_t begin_topp_offset_buf_size = topp_offset_buf_size;
+        size_t curandState_size = args_.batch_size_;
+        size_t finished_buf_size = args_.batch_size_;
+
+        const int MEM_C = 128;
+        size_t embedding_kernel_transposed_padded_size = args_.hidden_units_ * args_.vocab_size_padded_;
+        embedding_kernel_transposed_padded_size = div_up(embedding_kernel_transposed_padded_size, MEM_C) * MEM_C;
+
+
+        size_t padded_embedding_bias_size = args_.vocab_size_padded_;
+        if(std::is_same<DataType_, float>::value || (std::is_same<DataType_, half>::value && args_.vocab_size_padded_ == args_.vocab_size_))
+        {
+            padded_embedding_bias_size = 0;
+        }
+
+        // prevent memory misalinged address
+        logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+        
+        topp_id_vals_buf_size = (size_t)(ceil(topp_id_vals_buf_size / 4.)) * 4;
+        topp_offset_buf_size = (size_t)(ceil(topp_offset_buf_size / 4.)) * 4;
+        begin_topp_offset_buf_size = topp_offset_buf_size;
+        curandState_size = (size_t)(ceil(curandState_size / 32.)) * 32;
+        finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+
+        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                               topp_workspace_size_,
+                                               logits_buf_,
+                                               topp_id_vals_buf_,
+                                               topp_offset_buf_,
+                                               begin_topp_offset_buf_,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               nullptr,
+                                               nullptr,
+                                               args_.vocab_size_padded_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                               topk_workspace_size_,
+                                               logits_buf_,
+                                               nullptr,
+                                               nullptr,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                              topk_topp_workspace_size_,
+                                              nullptr,
+                                              logits_buf_,
+                                              nullptr,
+                                              curandstate_buf_,
+                                              args_,
+                                              0,
+                                              args_.batch_size_);
+
+        size_t datatype_buf_size = from_tensor_size * 2 + decoder_workspace_size +
+                                cache_size * 2 * (args_.decoder_layers_ / layer_para_size) + decoder_normed_result_buffer_size;
+
+        nccl_buf_size_ = args_.batch_size_ * args_.vocab_size_padded_;
+        nccl_buf_size_ = (size_t)(ceil(nccl_buf_size_ / 4.)) * 4;
+
+        buf_ = reinterpret_cast<void *>(allocator_.malloc(
+            ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) + 
+            sizeof(DataType_) * embedding_kernel_transposed_padded_size +
+            sizeof(DataType_) * (datatype_buf_size + logits_buf_size) + 
+            sizeof(DataType_) * (padded_embedding_bias_size) +
+            sizeof(int) * (topp_id_vals_buf_size + topp_offset_buf_size + begin_topp_offset_buf_size) +
+            topp_workspace_size_ + topk_workspace_size_ + topk_topp_workspace_size_ + sizeof(DataType_) * nccl_buf_size_ +
+            finished_buf_size + curandState_size * sizeof(curandState_t)));
+
+        if (sizeof(DataType_) == sizeof(half))
+        {
+          cublas_workspace_ = buf_;
+          embedding_kernel_padded_ = (DataType_ *)((char*)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+        }
+        else
+        {
+          cublas_workspace_ = nullptr;
+          embedding_kernel_padded_ = (DataType_ *)buf_;
+        }
+        embedding_bias_padded_ =  (DataType_ *)(embedding_kernel_padded_ + embedding_kernel_transposed_padded_size);
+        from_tensor_[0] = (DataType_ *)(embedding_bias_padded_ + padded_embedding_bias_size);
+        from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+        K_cache_[0] = from_tensor_[1] + from_tensor_size + 0 * cache_size * args_.decoder_layers_ / layer_para_size;
+        V_cache_[0] = from_tensor_[1] + from_tensor_size + 1 * cache_size * args_.decoder_layers_ / layer_para_size;
+
+        decoder_buf_ = V_cache_[0] + cache_size * args_.decoder_layers_ / layer_para_size;
+        decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+        logits_buf_ = decoder_normed_result_buf_ + decoder_normed_result_buffer_size;
+        topp_id_vals_buf_ = (int *)((DataType_*)logits_buf_ + logits_buf_size);
+        begin_topp_offset_buf_ = (int *)(topp_id_vals_buf_ + topp_id_vals_buf_size);
+        topp_offset_buf_ = (int *)((int*)begin_topp_offset_buf_ + begin_topp_offset_buf_size);
+        topp_workspace_ = (void *)((int*)topp_offset_buf_ + topp_offset_buf_size);
+        topk_workspace_ = (void *)((char*)topp_workspace_ + topp_workspace_size_);
+        topk_topp_workspace_ = (void *)((char*)topk_workspace_ + topk_workspace_size_);
+        nccl_logits_buf_ = (DataType_ *)((char*)topk_topp_workspace_ + topk_topp_workspace_size_);
+        curandstate_buf_ = (curandState_t*)(nccl_logits_buf_ + nccl_buf_size_);
+        finished_buf_ = (bool*)(curandstate_buf_ + curandState_size);
+        h_finished_buf_ = new bool[args_.batch_size_];
+
+        cudaMemset(embedding_kernel_padded_, 0, embedding_kernel_transposed_padded_size * sizeof(DataType_));
+
+        int isConfigExist = access("decoding_gemm_config.in", 0);
+        if (isConfigExist == -1)
+            printf("[WARNING] decoding_gemm_config.in is not found\n");
+        else
+        {
+            readAlgoFromConfig(cublasAlgoMap_, 1);
+            // check that the gemm_config setting is runnable
+            for (auto iter = cublasAlgoMap_.begin() ; iter != cublasAlgoMap_.end() ; iter++)
+            {
+                int algoId = iter->second.algoId;
+                int stages = iter->second.stages;
+                //only check for cublas
+                if (stages != -1)
+                    continue;
+                if (Traits_::OpType == OperationType::FP32)
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT)
+                    {
+                        // the algorithm is not for FP32
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n", algoId);
+                        exit(-1);
+                    }
+                }
+                else
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP || algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP)
+                    {
+                        // the algorithm is not for FP16
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n", algoId);
+                        exit(-1);
+                    }
+                }
+            }
+        }
+    }
+
+    void set_tensor_parallel_param(const TensorParallelParam param)
+    {
+        t_parallel_param_ = param;
+        decoder_->set_tensor_parallel_param(param);
+    }
+
+    void set_layer_parallel_param(const LayerParallelParam param)
+    {
+        l_parallel_param_ = param;
+        decoder_->set_layer_parallel_param(param);
+    }
+
+
+    void forward_context(const DecoderInitParam<DataType_> *decoder_param,
+                         const DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+        const int request_batch_size = decoding_params.request_batch_size;
+        cudaMemsetAsync(decoding_params.output_ids, 0, sizeof(int) * request_batch_size * max_len, decoding_params.stream);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        // const int input_len = decoding_params.request_input_len;
+        const int max_input_len = decoding_params.max_input_len;
+
+        // d_start_ids: [batch * seqlen]
+        if(input_len == 1)
+        {
+            cudaMemcpyAsync(decoding_params.output_ids, decoding_params.d_start_ids, 
+                            sizeof(int) * request_batch_size, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            return;
+        }
+        const int local_batch_size = ceil(request_batch_size * 1.0 / l_parallel_param_.world_size);
+        const int m = local_batch_size * input_len;
+        const int h_1 = args_.hidden_units_;
+
+        DataType_* from_tensor[2];
+        DataType_* decoder_output;
+        DataType_* decoder_workspace;
+        void *buf = reinterpret_cast<void *>(allocator_.malloc(
+            decoder_->getContextWorkspaceSize(input_len, local_batch_size) + 
+            (m * h_1 + 2 * request_batch_size * input_len * h_1) * sizeof(DataType_)
+        ));
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        from_tensor[0] = (DataType_*) buf;
+        from_tensor[1] = from_tensor[0] + request_batch_size * input_len * h_1;
+        decoder_output = from_tensor[1] + request_batch_size * input_len * h_1;
+        decoder_workspace = decoder_output + m * h_1;
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        if(l_parallel_param_.rank == 0)
+        {
+            PUSH_RANGE("Before Transformer/Embedding")
+            gptj_start_id_embedding_lookups_kernel_launcher(from_tensor[0],
+                                                            decoding_params.output_ids,
+                                                            decoding_params.embedding_table,
+                                                            decoding_params.d_start_ids,
+                                                            input_len,
+                                                            max_input_len,
+                                                            request_batch_size,
+                                                            args_.hidden_units_, 
+                                                            decoding_params.stream);
+            POP_RANGE
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        int ite_num = (int)(ceil(request_batch_size * 1.0 / local_batch_size));
+        for(int ite = 0; ite < ite_num; ite++)
+        {
+            int in_id, out_id;
+            for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+            {
+                if(l_parallel_param_.is_valid(layer))
+                {
+                    in_id = layer & 0x1;
+                    out_id = 1 - in_id;
+
+                    if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_recv(from_tensor[in_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                    l_parallel_param_, decoding_params.stream);
+                        all2all_gather(from_tensor[in_id] + ite * m * h_1, from_tensor[in_id] + ite * m * h_1, size, 
+                                    t_parallel_param_, decoding_params.stream);
+                    }
+
+                    decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    int dummy_decoder_max_seq_len = args_.seq_len_;
+                    // int dummy_decoder_max_seq_len = -1;
+                    size_t cache_offset;
+                    if(dummy_decoder_max_seq_len == -1)
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    else
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                        ite * local_batch_size * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    decoder_->forward_context(decoder_workspace,
+                                              from_tensor[out_id] + ite * m * h_1,
+                                              K_cache_[0] + cache_offset,
+                                              V_cache_[0] + cache_offset,
+                                              from_tensor[in_id] + ite * m * h_1,
+                                              decoding_params.d_attn_mask + ite * local_batch_size * input_len * input_len,
+                                              local_batch_size,
+                                              input_len,
+                                              ite,
+                                              dummy_decoder_max_seq_len,
+                                              layer == args_.decoder_layers_ - 1,
+                                              nullptr,
+                                              args_.rotary_embedding_dim_);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                    if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_send(from_tensor[out_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1,
+                                    l_parallel_param_, decoding_params.stream);
+                    }
+                }
+            } // end of for loop of layer
+        } // end of for loop of ite
+        allocator_.free(buf);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+    }
+
+    void forward(const DecoderInitParam<DataType_> *decoder_param,
+                 DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+        const int max_input_len = decoding_params.max_input_len;
+        const int request_batch_size = decoding_params.request_batch_size;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+
+        assert(request_batch_size <= args_.batch_size_);
+        assert(request_batch_size % l_parallel_param_.local_batch_size == 0);
+        const int m = request_batch_size;
+        const int k = args_.hidden_units_;
+        const DataType_* embedding_kernel_ptr = nullptr;
+        const DataType_* embedding_bias_ptr = nullptr;
+
+        cudaMemsetAsync(finished_buf_, false, sizeof(finished_buf_[0]) * request_batch_size, decoding_params.stream);
+        if (args_.probability_threshold_ != 0.0)
+        {
+            topp_initialization_kernelLauncher_v2(nullptr,
+                                                  nullptr,
+                                                  nullptr,
+                                                  topp_id_vals_buf_,
+                                                  topp_offset_buf_,
+                                                  begin_topp_offset_buf_,
+                                                  args_.candidate_num_ > 0 ? args_.candidate_num_ : args_.vocab_size_padded_,
+                                                  args_,
+                                                  decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+        }
+        ker_curand_setupLauncher(curandstate_buf_,
+                                 args_,
+                                 decoding_params.stream);
+
+        if(std::is_same<DataType_, float>::value || (std::is_same<DataType_, half>::value && args_.vocab_size_padded_ == args_.vocab_size_))
+        {
+            embedding_kernel_ptr = (const DataType_ *)decoding_params.embedding_kernel;
+            embedding_bias_ptr = (const DataType_ *)decoding_params.embedding_bias;
+        }
+        else
+        {
+            cudaMemcpyAsync(embedding_kernel_padded_, decoding_params.embedding_kernel, 
+                            sizeof(DataType_) * args_.vocab_size_ * args_.hidden_units_, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            embedding_kernel_ptr = (const DataType_ *)embedding_kernel_padded_;
+            bias_padding_kernelLauncher(embedding_bias_padded_,
+                        decoding_params.embedding_bias, // GPTJ/CodeGen bias
+                        args_.vocab_size_,
+                        args_.vocab_size_padded_,
+                        decoding_params.stream);
+            embedding_bias_ptr = embedding_bias_padded_;
+        }
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        bool is_generation_done = false;
+        const int local_batch = l_parallel_param_.local_batch_size;
+        for (size_t step = input_len; step < max_len; ++step)
+        {
+            const int ite_num = request_batch_size / local_batch;
+            for(size_t ite = 0; ite < ite_num; ite++)
+            {
+                if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        PUSH_RANGE("token/recv")
+                        nccl_recv(decoding_params.output_ids + (step - 1) * m + ite * local_batch, local_batch,
+                                  l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                    }
+                }
+
+                if(l_parallel_param_.rank < l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                    }
+                }
+                if(ite == 0)
+                {
+                    cudaMemcpyAsync(h_finished_buf_, finished_buf_, sizeof(bool) * request_batch_size, cudaMemcpyDeviceToHost, decoding_params.stream);
+                    cudaStreamSynchronize(decoding_params.stream);
+                    uint sum = 0;
+                    for (uint i = 0; i < request_batch_size; i++)
+                    {
+                        sum += (int)h_finished_buf_[i];
+                    }
+                    if (sum == request_batch_size)
+                    {
+                        is_generation_done = true;
+                        break;
+                    }
+                }
+
+                if(l_parallel_param_.rank == 0)
+                {
+                    PUSH_RANGE("Before Transformer/Embedding")
+                    /***** newly fixed by PaddleNLP *****/
+                    gpj_embedding_lookups_kernel_launcher(from_tensor_[0],
+                                                            decoding_params.embedding_table,
+                                                            decoding_params.output_ids,
+                                                            local_batch,
+                                                            m,
+                                                            args_.hidden_units_,
+                                                            step,
+                                                            ite,
+                                                            max_input_len,
+                                                            decoding_params.d_start_lengths,
+                                                            decoding_params.stream);
+                    POP_RANGE
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+
+                //we use two-way buffer
+                int from_id, out_id;
+                for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+                {
+                    if(l_parallel_param_.is_valid(layer))
+                    {
+                        /*
+                            For the first layer (layer-0), from_id is 0. We also stored the embedding lookup 
+                            result in from_tensor_[0]
+                        */
+                        from_id = layer & 0x1;
+                        out_id = 1 - from_id;
+
+                        if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                        {
+                            const int size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_recv(from_tensor_[from_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                            all2all_gather(from_tensor_[from_id], from_tensor_[from_id], size, 
+                                           t_parallel_param_, decoding_params.stream);
+                        }
+
+                        /*
+                            We use one decoder_ object to process multiple decoder layers. 
+                            At the beginning of each decoder layer, we initialize the decoder object 
+                            with corresponding weights and decoder_buf_.
+                            The decoder_buf_ is reused.
+                        */
+                        decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+                        
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        int dummy_decoder_max_seq_len = args_.seq_len_;
+                        // int dummy_decoder_max_seq_len = -1;
+                        size_t cache_offset;
+                        if(dummy_decoder_max_seq_len == -1)
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                            ite * local_batch * t_parallel_param_.local_hidden_units_;
+                        }
+                        else
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) * 
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ + 
+                                            ite * local_batch * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                        }
+                        decoder_->forward_v2(from_tensor_[from_id], 
+                                            nullptr, // memory_tensor should be nullptr
+                                            K_cache_[0] + cache_offset,
+                                            V_cache_[0] + cache_offset,
+                                            nullptr, nullptr, // key_mem_cache_ and value_mem_cache_ should be nullptr
+                                            nullptr, // memory_sequence_length should be nullptr
+                                            from_tensor_[out_id], step, dummy_decoder_max_seq_len,
+                                            false, 
+                                            finished_buf_ + ite * local_batch,
+                                            max_input_len, 
+                                            decoding_params.d_start_lengths + ite * local_batch,
+                                            args_.rotary_embedding_dim_);
+
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif          
+                        if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                        {
+                            const size_t size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_send(from_tensor_[out_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                        }
+                    }
+                }
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1)
+                {
+
+                    layer_norm(from_tensor_[out_id],
+                               decoding_params.layernorm.gamma,
+                               decoding_params.layernorm.beta,
+                               decoder_normed_result_buf_,
+                               local_batch,
+                               k,
+                               decoding_params.stream);
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    DataType_ alpha = DataType_(1.0f);
+                    DataType_ beta = DataType_(0.0f);
+                    assert(args_.vocab_size_padded_ % t_parallel_param_.world_size == 0);
+                    int n = args_.vocab_size_padded_ / t_parallel_param_.world_size;
+                    
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                            decoding_params.cublas_handle, 
+                                                            CUBLAS_OP_T, CUBLAS_OP_N,
+                                                            n, local_batch, k,
+                                                            &alpha,
+                                                            embedding_kernel_ptr, AType_, k,
+                                                            decoder_normed_result_buf_, BType_, k,
+                                                            &beta,
+                                                            logits_buf_, CType_, n,
+                                                            decoding_params.stream, cublasAlgoMap_,
+                                                            cublas_workspace_);
+                        POP_RANGE
+                    }
+                    else
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                            decoding_params.cublas_handle, 
+                                                            CUBLAS_OP_T, CUBLAS_OP_N,
+                                                            n, local_batch, k,
+                                                            &alpha,
+                                                            embedding_kernel_ptr + t_parallel_param_.rank * n * k,
+                                                            AType_, k,
+                                                            decoder_normed_result_buf_, BType_, k,
+                                                            &beta,
+                                                            nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                            CType_, n,
+                                                            decoding_params.stream, cublasAlgoMap_,
+                                                            cublas_workspace_);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+	                apply_logits_mask_kernelLauncher(logits_buf_,
+						    finished_buf_,
+						    args_.batch_size_,
+						    1,
+						    args_.vocab_size_padded_,
+						    args_.vocab_size_,
+						    decoding_params.stream,
+                            (DataType_*) nullptr,
+                            (args_.min_length_ != 0 && step-input_len < args_.min_length_),
+                            args_.end_id_,
+                            embedding_bias_ptr);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        apply_temperature_penalty_kernelLauncher(logits_buf_,
+                                                                (DataType_) args_.temperature_,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.stream);
+                    }
+                    else
+                    {
+                        if(t_parallel_param_.rank == t_parallel_param_.world_size - 1)
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    args_.vocab_size_ - n * t_parallel_param_.rank,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                        else
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    n,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // reduce and concat the result
+                    if(t_parallel_param_.world_size > 1)
+                    {
+                        PUSH_RANGE("After Transformer/all2all_gather")
+                        all2all_gather(nccl_logits_buf_, nccl_logits_buf_, local_batch * n, 
+                                       t_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                        
+                        transpose_axis_01_kernelLauncher(logits_buf_, nccl_logits_buf_, 
+                                                         t_parallel_param_.world_size, local_batch, n, decoding_params.stream);
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    n = args_.vocab_size_padded_;
+
+                    // Apply repetition penalty.
+                    if (args_.repetition_penalty_ != 1.0) {
+                        PUSH_RANGE("After Transformer/Repetition_penalty")
+                        apply_repetition_penalty_kernelLauncher(logits_buf_,
+                                                                args_.repetition_penalty_,
+                                                                decoding_params.d_start_ids,
+                                                                decoding_params.output_ids,
+                                                                m,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.d_start_lengths,
+                                                                max_input_len,
+                                                                step,
+                                                                ite,
+                                                                decoding_params.stream);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // Sampling
+                    if(args_.candidate_num_ > 0 && args_.probability_threshold_ == 0.0)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top k sampling
+                        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                                               topk_workspace_size_,
+                                                               logits_buf_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_, // used as random number
+                                                               args_,
+                                                               decoding_params.stream,
+                                                               local_batch);
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ == 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top p sampling
+                        softmax_kernelLauncher(logits_buf_,
+                                               (DataType_*) nullptr,
+                                               args_.end_id_,
+                                               finished_buf_ + ite * local_batch,
+                                               local_batch,
+                                               args_.vocab_size_padded_,
+                                               args_.vocab_size_,
+                                               decoding_params.stream);
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                                               topp_workspace_size_,
+                                                               logits_buf_,
+                                                               topp_id_vals_buf_,
+                                                               topp_offset_buf_,
+                                                               begin_topp_offset_buf_,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_,
+                                                               args_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               n,
+                                                               decoding_params.stream,
+                                                               local_batch);
+
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ > 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                                                    topk_topp_workspace_size_,
+                                                                    decoding_params.output_ids + step * m + ite * local_batch,
+                                                                    logits_buf_,
+                                                                    finished_buf_ + ite * local_batch,
+                                                                    curandstate_buf_,
+                                                                    args_,
+                                                                    decoding_params.stream,
+                                                                    local_batch);
+                        POP_RANGE
+                    }
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+                if(step < (size_t)max_input_len)
+                {
+                    // Replace the sampled id by start ids
+                    set_start_ids_kernelLauncher(decoding_params.output_ids, decoding_params.d_start_ids, max_input_len,
+                                                 step, ite, request_batch_size, local_batch, args_.end_id_, decoding_params.stream);
+                }
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    PUSH_RANGE("token/send")
+                    nccl_send(decoding_params.output_ids + step * m + ite * local_batch, local_batch, 0, l_parallel_param_, decoding_params.stream);
+                    POP_RANGE
+                }
+
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1 && step < max_len - 1)
+                {
+                    nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                }
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+            } // end for ite for loop
+
+            if (is_generation_done) {
+                break;
+            }
+        } // end for decoding step for loop
+        if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+        {
+            for(size_t ite = 0; ite < request_batch_size / local_batch; ite++)
+            {
+                nccl_recv(decoding_params.output_ids + (max_len - 1) * m + ite * local_batch,
+                          local_batch, l_parallel_param_.world_size - 1,
+                          l_parallel_param_, decoding_params.stream);
+            }
+        }
+    } // end of forward
+
+    virtual ~DecodingGptJ()
+    {
+        delete[] K_cache_;
+        delete[] V_cache_;
+        delete decoder_;
+        allocator_.free(buf_);
+        delete [] h_finished_buf_;
+    }
+
+    inline int get_num_layer() {return args_.decoder_layers_;}
+
+    inline void set_local_batch_size(int local_batch)
+    { 
+        l_parallel_param_.local_batch_size = local_batch;
+        decoder_->set_local_batch_size(local_batch);
+    }
+};
+
+} //namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/open_decoder.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/open_decoder.h
new file mode 100644
index 0000000000000000000000000000000000000000..c23162285a6d2fa50ff3031f0dd1dbd5c5bff785
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/open_decoder.h
@@ -0,0 +1,2166 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder for a Single Step of a Single Layer
+ **/
+
+#pragma once
+#include <assert.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include "fastertransformer/cuda/attention_kernels.cuh"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/cuda/open_decoder.cuh"
+#include "fastertransformer/cuda/transformer_kernels.cuh"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/common_structure.h"
+#include "fastertransformer/utils/functions.h"
+#include "fastertransformer/utils/nccl_utils.h"
+#include "fastertransformer/utils/nvtx_utils.h"
+#include "fastertransformer/utils/nvtx_utils.h"
+
+// use new attention implementation with [B, H, Dh/x, L, x] cache format for the
+// keys
+// and [B, H, L, Dh] for values
+
+#define USE_CACHE_BATCH_MAJOR_ATTENTION 1
+
+namespace fastertransformer {
+
+template <typename T>
+std::vector<T> copy_data(const T *src, size_t num) {
+  std::vector<T> h_tmp(num, 0);
+  cudaMemcpy(h_tmp.data(), src, sizeof(T) * num, cudaMemcpyDeviceToHost);
+  return h_tmp;
+}
+
+template <typename T>
+class DecoderInitParam : public AbstractParam {
+public:
+  /* weights for masked_multi_head_attention */
+  LayerNormWeight<T> self_layernorm;
+  AttentionWeight<T> self_attention;
+
+  LayerNormWeight<T> cross_layernorm;
+  AttentionWeight<T> cross_attention;
+
+  LayerNormWeight<T> ffn_layernorm;
+  FFNWeight<T> ffn;
+  cublasHandle_t cublas_handle;
+  cublasLtHandle_t cublaslt_handle;
+  cudaStream_t stream;
+
+  int request_batch_size = -1;
+  int request_max_mem_seq_len = -1;
+
+  const float *k_cache = nullptr;
+  const float *v_cache = nullptr;
+};
+
+template <OperationType OpType_>
+class DecoderTransformerTraits;
+
+template <>
+class DecoderTransformerTraits<OperationType::FP32>
+    : public TransformerTraits<OperationType::FP32> {};
+
+template <>
+class DecoderTransformerTraits<OperationType::FP16>
+    : public TransformerTraits<OperationType::FP16> {};
+
+template <OperationType OpType_>
+class OpenDecoder {
+private:
+  typedef DecoderTransformerTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  DecoderInitParam<DataType_> param_;
+  TensorParallelParam t_parallel_param_;
+  LayerParallelParam l_parallel_param_;
+
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+  int max_batch_size_ = -1;
+  int head_num_;
+  int size_per_head_;
+  int hidden_units_;
+  int memory_hidden_units_;
+  int normalization_before_;
+  int inner_coeff_ = 4;
+  int inner_size_ = -1;
+  bool use_gated_ = false;
+  ActivationType act_;
+
+  DataType_ *norm_from_tensor_buf_, *query_buf_;
+  DataType_ *context_buf_, *masked_output_buf_;
+  DataType_ *norm_masked_output_buf_, *cross_output_buf_;
+  DataType_ *norm_cross_output_buf_, *ffn_inner_buf_, *ffn_out_buf_;
+  DataType_ *key_buf_, *value_buf_;
+
+  DataType_ **qkv_kernel_;
+  DataType_ **qkv_input_;
+  DataType_ **qkv_buf_;
+  void *cublas_workspace_ = nullptr;
+
+  bool is_fuse_QKV_in_batched_gemm_;
+  const bool is_fuse_QKV_in_normal_gemm_;
+
+public:
+  void judgeFusedQKV() {
+    is_fuse_QKV_in_batched_gemm_ = false;
+    int m, n, k, dataType;
+    if (std::is_same<half, DataType_>::value)
+      dataType = HALF_DATATYPE;
+    else
+      dataType = FLOAT_DATATYPE;
+
+    m = l_parallel_param_.local_batch_size;
+    n = t_parallel_param_.local_hidden_units_;
+    k = hidden_units_;
+    char mark[256], mark2[256];
+    sprintf(mark, "1_%d_%d_%d_%d", n, m, k, dataType);
+    sprintf(mark2, "3_%d_%d_%d_%d", n, m, k, dataType);
+    if (cublasAlgoMap_.find(mark) != cublasAlgoMap_.end() &&
+        cublasAlgoMap_.find(mark2) != cublasAlgoMap_.end() &&
+        3 * cublasAlgoMap_[mark].exec_time > cublasAlgoMap_[mark2].exec_time) {
+      is_fuse_QKV_in_batched_gemm_ = true;
+    }
+  }
+
+
+  OpenDecoder(int head_num,
+              int size_per_head,
+              int memory_hidden_units,
+              bool is_fuse_QKV_in_normal_gemm = false,
+              bool normalization_before = true,
+              ActivationType act = ActivationType::GELU,
+              const int inner_coeff = 4,
+              const int inner_size = -1,
+              const bool use_gated = false)
+      // Activation function default to GELU for GPT.
+      : head_num_(head_num),
+        size_per_head_(size_per_head),
+        memory_hidden_units_(memory_hidden_units),
+        is_fuse_QKV_in_normal_gemm_(is_fuse_QKV_in_normal_gemm),
+        normalization_before_(normalization_before),
+        act_(act),
+        inner_coeff_(inner_coeff),
+        inner_size_(inner_size),
+        use_gated_(use_gated) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    hidden_units_ = head_num_ * size_per_head_;
+    t_parallel_param_.local_head_num_ = head_num_;
+    t_parallel_param_.local_hidden_units_ = hidden_units_;
+
+    int isConfigExist = access("decoding_gemm_config.in", 0);
+    if (isConfigExist == -1) {
+      printf("[WARNING] decoding_gemm_config.in is not found\n");
+    } else {
+      readAlgoFromConfig(cublasAlgoMap_);
+      // check that the gemm_config setting is runnable
+      for (auto iter = cublasAlgoMap_.begin(); iter != cublasAlgoMap_.end();
+           iter++) {
+        int algoId = iter->second.algoId;
+        int stages = iter->second.stages;
+        // only check for cublas
+        if (stages != -1) continue;
+        if (Traits_::OpType == OperationType::FP32) {
+          if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT) {
+            // the algorithm is not for FP32
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n",
+                   algoId);
+            exit(-1);
+          }
+        } else {
+          if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP ||
+              algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP) {
+            // the algorithm is not for FP16
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n",
+                   algoId);
+            exit(-1);
+          }
+        }
+      }
+    }
+    judgeFusedQKV();
+  }
+
+  inline void set_max_batch_size(int batch_size) {
+    max_batch_size_ = batch_size;
+  }
+
+  int getWorkspaceSize() {
+    assert(max_batch_size_ != -1);
+    int inner_buf_size = 0;
+    if (inner_size_ <= 0) {
+      inner_buf_size = inner_coeff_ * max_batch_size_ * hidden_units_;
+    } else {
+      inner_buf_size = max_batch_size_ * inner_size_;
+    }
+
+    if (use_gated_) {
+      inner_buf_size *= 2;
+    }
+
+    return 9 * max_batch_size_ * hidden_units_ + sizeof(DataType_ *) * 9 + inner_buf_size;
+  }
+
+  void set_tensor_parallel_param(const TensorParallelParam param) {
+    t_parallel_param_ = param;
+  }
+
+  void set_layer_parallel_param(const LayerParallelParam param) {
+    l_parallel_param_ = param;
+  }
+
+  void initialize(DecoderInitParam<DataType_> param,
+                  DataType_ *buf,
+                  void *cublas_workapsce,
+                  bool set_local_batch = true) {
+#ifndef NDEBUG
+// PRINT_FUNC_NAME_();
+#endif
+    param_ = param;
+    if (l_parallel_param_.local_batch_size == -1 || set_local_batch == true)
+      l_parallel_param_.local_batch_size = param_.request_batch_size;
+    const int buf_size = max_batch_size_ * hidden_units_;
+    // cublas_workspace_ should be the start pointer of cudaMalloc()
+    // to ensure 16B alignemnet
+    cublas_workspace_ = cublas_workapsce;
+
+    norm_from_tensor_buf_ = buf;
+    ffn_out_buf_ = buf;
+    query_buf_ = buf + buf_size;  // store the query values (from_tensor * Q) in
+                                  // both masked and multi-head attention
+    key_buf_ = query_buf_ + buf_size;
+    value_buf_ = key_buf_ + buf_size;
+    context_buf_ = value_buf_ + buf_size;  // store the context result
+                                           // (softmax(qk)v) in both masked and
+                                           // multi-head attention
+
+    masked_output_buf_ = context_buf_ + buf_size;  // masked_attention_output
+    norm_masked_output_buf_ =
+        masked_output_buf_ + buf_size;  // norm(masked_attention_output)
+
+    cross_output_buf_ =
+        norm_masked_output_buf_ + buf_size;  // mutli-head attention_output
+    norm_cross_output_buf_ =
+        cross_output_buf_ + buf_size;  // norm(multi-head attention_output)
+    ffn_inner_buf_ =
+        norm_cross_output_buf_ + buf_size;  // inner_coeff_(4) buf size to store inner product
+
+    int ffn_inner_buf_size = 0;
+    if (inner_size_ <= 0) {
+      ffn_inner_buf_size = inner_coeff_ * buf_size;
+    } else {
+      ffn_inner_buf_size = max_batch_size_ * inner_size_;
+    }
+
+    if (use_gated_) {
+      ffn_inner_buf_size *= 2;
+    }
+
+    qkv_kernel_ = (DataType_ **)(ffn_inner_buf_ + ffn_inner_buf_size);
+    qkv_input_ = qkv_kernel_ + 3;
+    qkv_buf_ = qkv_input_ + 3;
+
+    if (is_fuse_QKV_in_normal_gemm_ == false &&
+        is_fuse_QKV_in_batched_gemm_ == true) {
+      const DataType_ *hA[]{param_.self_attention.query_weight.kernel,
+                            param_.self_attention.key_weight.kernel,
+                            param_.self_attention.value_weight.kernel,
+                            norm_from_tensor_buf_,
+                            norm_from_tensor_buf_,
+                            norm_from_tensor_buf_,
+                            query_buf_,
+                            key_buf_,
+                            value_buf_};
+      cudaMemcpyAsync((void *)qkv_kernel_,
+                      hA,
+                      sizeof(DataType_ *) * 9,
+                      cudaMemcpyHostToDevice,
+                      param_.stream);
+    }
+  }
+
+  void forward(const DataType_ *from_tensor,
+               const DataType_ *memory_tensor,
+               DataType_ *key_cache_,
+               DataType_ *value_cache_,
+               DataType_ *key_mem_cache_,
+               DataType_ *value_mem_cache_,
+               const int *memory_sequence_length,
+               DataType_ *decoder_output,
+               const int step,
+               const int decoder_max_seq_len,
+               const bool is_cross_attention,
+               const bool *finished = nullptr,
+               const DataType_* relative_attention_bias = nullptr,
+               const bool use_t5_layer_norm = false) {
+#ifndef NDEBUG
+// PRINT_FUNC_NAME_();
+#endif
+    const int m = l_parallel_param_.local_batch_size;
+    if (inner_size_ <= 0) {
+      inner_size_ = inner_coeff_ * t_parallel_param_.local_hidden_units_;
+    }
+
+    try {
+      /* masked multi-head attention */
+      /* layernorm(from_tensor) -> norm_from_tensor_buf_ */
+      if (normalization_before_) {
+
+        if (use_t5_layer_norm) {
+          t5_layer_norm(from_tensor,
+                        param_.self_layernorm.gamma,
+                        param_.self_layernorm.beta,
+                        norm_from_tensor_buf_,
+                        m,
+                        hidden_units_,
+                        param_.stream);
+        } else {
+          layer_norm(from_tensor,
+                     param_.self_layernorm.gamma,
+                     param_.self_layernorm.beta,
+                     norm_from_tensor_buf_,
+                     m,
+                     hidden_units_,
+                     param_.stream);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        PUSH_RANGE("Transformer/slf_attn")
+        masked_multi_head_attention(norm_from_tensor_buf_,
+                                    key_cache_,
+                                    value_cache_,
+                                    masked_output_buf_,
+                                    finished,
+                                    step,
+                                    decoder_max_seq_len,
+                                    relative_attention_bias);
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (is_cross_attention == true) {
+          /*
+              add bias to masked_output_buf_
+              masked_output_buf_ + from_tensor -> masked_output_buf_
+              norm(masked_output_buf_) -> norm_masked_output_buf_
+          */
+          if (use_t5_layer_norm) {
+            add_bias_input_t5_layernorm_2_kernelLauncher(
+                from_tensor,
+                param_.cross_layernorm.gamma,
+                param_.cross_layernorm.beta,
+                param_.self_attention.attention_output_weight.bias,
+                masked_output_buf_,
+                norm_masked_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+
+          } else {
+            add_bias_input_layernorm_2_kernelLauncher(
+                from_tensor,
+                param_.cross_layernorm.gamma,
+                param_.cross_layernorm.beta,
+                param_.self_attention.attention_output_weight.bias,
+                masked_output_buf_,
+                norm_masked_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+          }
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          // For Attention is All You Need decoder
+          /* cross attention with memory */
+          cross_multi_head_attention(norm_masked_output_buf_,
+                                     memory_tensor,
+                                     key_mem_cache_,
+                                     value_mem_cache_,
+                                     cross_output_buf_,
+                                     memory_sequence_length,
+                                     finished,
+                                     param_.request_max_mem_seq_len,
+                                     step,
+                                     relative_attention_bias);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          /*
+              cross_output_buf_ + bias + masked_output_buf_ -> cross_output_buf_
+              norm(cross_otuput_buf) -> normed_last_context (input for ffn)
+          */
+          if (use_t5_layer_norm) {
+            add_bias_input_t5_layernorm_2_kernelLauncher(
+                masked_output_buf_,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                param_.cross_attention.attention_output_weight.bias,
+                cross_output_buf_,
+                norm_cross_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+
+          } else {
+            add_bias_input_layernorm_2_kernelLauncher(
+                masked_output_buf_,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                param_.cross_attention.attention_output_weight.bias,
+                cross_output_buf_,
+                norm_cross_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+          }
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          ffn(norm_cross_output_buf_,
+              ffn_inner_buf_,
+              decoder_output,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_,
+              step);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          add_bias_input_kernelLauncher(decoder_output,
+                                        param_.ffn.output_weight.bias,
+                                        cross_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+        } else {
+          if (use_t5_layer_norm) {
+            add_bias_input_t5_layernorm_2_kernelLauncher(
+                from_tensor,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                param_.self_attention.attention_output_weight.bias,
+                masked_output_buf_,
+                norm_masked_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+
+          } else {
+            add_bias_input_layernorm_2_kernelLauncher(
+                from_tensor,
+                param_.ffn_layernorm.gamma,
+                param_.ffn_layernorm.beta,
+                param_.self_attention.attention_output_weight.bias,
+                masked_output_buf_,
+                norm_masked_output_buf_,
+                m,
+                hidden_units_,
+                param_.stream);
+          }
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          // For GPT-2 decoder
+          PUSH_RANGE("Transformer/MLP")
+          ffn(norm_masked_output_buf_,
+              ffn_inner_buf_,
+              decoder_output,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+          POP_RANGE
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          add_bias_input_kernelLauncher(decoder_output,
+                                        param_.ffn.output_weight.bias,
+                                        masked_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+        }
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      } else {
+        // post-normalization
+        masked_multi_head_attention(from_tensor,
+                                    key_cache_,
+                                    value_cache_,
+                                    masked_output_buf_,
+                                    finished,
+                                    step,
+                                    decoder_max_seq_len,
+                                    relative_attention_bias);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        add_bias_input_layernorm_2_kernelLauncher(
+            from_tensor,
+            param_.self_layernorm.gamma,
+            param_.self_layernorm.beta,
+            param_.self_attention.attention_output_weight.bias,
+            masked_output_buf_,
+            norm_masked_output_buf_,
+            m,
+            hidden_units_,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        if (is_cross_attention == true) {
+          // For Attention is All You Need decoder
+          // cross attention with memory
+          cross_multi_head_attention(norm_masked_output_buf_,
+                                     memory_tensor,
+                                     key_mem_cache_,
+                                     value_mem_cache_,
+                                     cross_output_buf_,
+                                     memory_sequence_length,
+                                     finished,
+                                     param_.request_max_mem_seq_len,
+                                     step,
+                                     relative_attention_bias);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          //
+          //  cross_output_buf_ + bias + masked_output_buf_ -> cross_output_buf_
+          //  norm(cross_otuput_buf) -> normed_last_context (input for ffn)
+          //
+          add_bias_input_layernorm_2_kernelLauncher(
+              norm_masked_output_buf_,
+              param_.cross_layernorm.gamma,
+              param_.cross_layernorm.beta,
+              param_.cross_attention.attention_output_weight.bias,
+              cross_output_buf_,
+              norm_cross_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          ffn(norm_cross_output_buf_,
+              ffn_inner_buf_,
+              ffn_out_buf_,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          add_bias_input_kernelLauncher(ffn_out_buf_,
+                                        param_.ffn.output_weight.bias,
+                                        norm_cross_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(ffn_out_buf_,
+                     param_.ffn_layernorm.gamma,
+                     param_.ffn_layernorm.beta,
+                     decoder_output,
+                     m,
+                     hidden_units_,
+                     param_.stream);
+
+        } else {
+          ffn(norm_masked_output_buf_,
+              ffn_inner_buf_,
+              ffn_out_buf_,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          add_bias_input_kernelLauncher(ffn_out_buf_,
+                                        param_.ffn.output_weight.bias,
+                                        norm_masked_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(ffn_out_buf_,
+                     param_.ffn_layernorm.gamma,
+                     param_.ffn_layernorm.beta,
+                     decoder_output,
+                     m,
+                     hidden_units_,
+                     param_.stream);
+        }
+      }
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+#endif
+  }
+
+
+  void forward_v2(const DataType_ *from_tensor,
+                  const DataType_ *memory_tensor,
+                  DataType_ *key_cache_,
+                  DataType_ *value_cache_,
+                  DataType_ *key_mem_cache_,
+                  DataType_ *value_mem_cache_,
+                  const int *memory_sequence_length,
+                  DataType_ *decoder_output,
+                  const int step,
+                  const int decoder_max_seq_len,
+                  const bool is_cross_attention,
+                  const bool *finished = nullptr,
+                  const int max_input_len = 0,
+                  const int *input_lengths = nullptr,
+                  const int rotary_embedding_dim = 0,
+                  const DataType_ *relative_attention_bias = nullptr) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = l_parallel_param_.local_batch_size;
+    if (inner_size_ <= 0) {
+      inner_size_ = inner_coeff_ * t_parallel_param_.local_hidden_units_;
+    }
+    try {
+      /* masked multi-head attention */
+      /* layernorm(from_tensor) -> norm_from_tensor_buf_ */
+      if (normalization_before_) {
+        layer_norm(from_tensor,
+                   param_.self_layernorm.gamma,
+                   param_.self_layernorm.beta,
+                   norm_from_tensor_buf_,
+                   m,
+                   hidden_units_,
+                   param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        PUSH_RANGE("Transformer/slf_attn")
+        masked_multi_head_attention_v2(norm_from_tensor_buf_,
+                                       key_cache_,
+                                       value_cache_,
+                                       masked_output_buf_,
+                                       finished,
+                                       step,
+                                       decoder_max_seq_len,
+                                       max_input_len,
+                                       input_lengths,
+                                       rotary_embedding_dim);
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (is_cross_attention == true) {
+          /*
+              add bias to masked_output_buf_
+              masked_output_buf_ + from_tensor -> masked_output_buf_
+              norm(masked_output_buf_) -> norm_masked_output_buf_
+          */
+          add_bias_input_layernorm_2_kernelLauncher(
+              from_tensor,
+              param_.cross_layernorm.gamma,
+              param_.cross_layernorm.beta,
+              param_.self_attention.attention_output_weight.bias,
+              masked_output_buf_,
+              norm_masked_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          // For Attention is All You Need decoder
+          /* cross attention with memory */
+          cross_multi_head_attention(norm_masked_output_buf_,
+                                     memory_tensor,
+                                     key_mem_cache_,
+                                     value_mem_cache_,
+                                     cross_output_buf_,
+                                     memory_sequence_length,
+                                     finished,
+                                     param_.request_max_mem_seq_len,
+                                     step);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          /*
+              cross_output_buf_ + bias + masked_output_buf_ -> cross_output_buf_
+              norm(cross_otuput_buf) -> normed_last_context (input for ffn)
+          */
+          add_bias_input_layernorm_2_kernelLauncher(
+              masked_output_buf_,
+              param_.ffn_layernorm.gamma,
+              param_.ffn_layernorm.beta,
+              param_.cross_attention.attention_output_weight.bias,
+              cross_output_buf_,
+              norm_cross_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          ffn(norm_cross_output_buf_,
+              ffn_inner_buf_,
+              decoder_output,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          add_bias_input_kernelLauncher(decoder_output,
+                                        param_.ffn.output_weight.bias,
+                                        cross_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+        } else {
+          if (rotary_embedding_dim > 0){
+            add_bias_input_layernorm_2_kernelLauncher(
+              from_tensor,
+              (DataType_*) nullptr,
+              (DataType_*) nullptr,
+              (DataType_*) nullptr,
+              masked_output_buf_,
+              norm_masked_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+
+          }else{
+            add_bias_input_layernorm_2_kernelLauncher(
+              from_tensor,
+              param_.ffn_layernorm.gamma,
+              param_.ffn_layernorm.beta,
+              param_.self_attention.attention_output_weight.bias,
+              masked_output_buf_,
+              norm_masked_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+          }
+         
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          PUSH_RANGE("Transformer/MLP")
+          ffn(rotary_embedding_dim > 0 ? norm_from_tensor_buf_:norm_masked_output_buf_,
+              ffn_inner_buf_,
+              decoder_output,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+          POP_RANGE
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          add_bias_input_kernelLauncher(decoder_output,
+                                        param_.ffn.output_weight.bias,
+                                        masked_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+        }
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      } else {
+        // post-normalization
+        PUSH_RANGE("Transformer/slf_attn")
+        masked_multi_head_attention_v2(from_tensor,
+                                       key_cache_,
+                                       value_cache_,
+                                       masked_output_buf_,
+                                       finished,
+                                       step,
+                                       decoder_max_seq_len,
+                                       max_input_len,
+                                       input_lengths);
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        add_bias_input_layernorm_2_kernelLauncher(
+            from_tensor,
+            param_.self_layernorm.gamma,
+            param_.self_layernorm.beta,
+            param_.self_attention.attention_output_weight.bias,
+            masked_output_buf_,
+            norm_masked_output_buf_,
+            m,
+            hidden_units_,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (is_cross_attention == true) {
+          // For Attention is All You Need decoder
+          /* cross attention with memory */
+          cross_multi_head_attention(norm_masked_output_buf_,
+                                     memory_tensor,
+                                     key_mem_cache_,
+                                     value_mem_cache_,
+                                     cross_output_buf_,
+                                     memory_sequence_length,
+                                     finished,
+                                     param_.request_max_mem_seq_len,
+                                     step);
+
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          /*
+              cross_output_buf_ + bias + masked_output_buf_ -> cross_output_buf_
+              norm(cross_otuput_buf) -> normed_last_context (input for ffn)
+          */
+          add_bias_input_layernorm_2_kernelLauncher(
+              norm_masked_output_buf_,
+              param_.cross_layernorm.gamma,
+              param_.cross_layernorm.beta,
+              param_.cross_attention.attention_output_weight.bias,
+              cross_output_buf_,
+              norm_cross_output_buf_,
+              m,
+              hidden_units_,
+              param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+          ffn(norm_cross_output_buf_,
+              ffn_inner_buf_,
+              ffn_out_buf_,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          add_bias_input_kernelLauncher(ffn_out_buf_,
+                                        param_.ffn.output_weight.bias,
+                                        norm_cross_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(ffn_out_buf_,
+                     param_.ffn_layernorm.gamma,
+                     param_.ffn_layernorm.beta,
+                     decoder_output,
+                     m,
+                     hidden_units_,
+                     param_.stream);
+        } else {
+          PUSH_RANGE("Transformer/MLP")
+          ffn(norm_masked_output_buf_,
+              ffn_inner_buf_,
+              ffn_out_buf_,
+              m,
+              inner_size_,
+              hidden_units_,
+              act_);
+          POP_RANGE
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          add_bias_input_kernelLauncher(ffn_out_buf_,
+                                        param_.ffn.output_weight.bias,
+                                        norm_masked_output_buf_,
+                                        m,
+                                        hidden_units_,
+                                        param_.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          layer_norm(ffn_out_buf_,
+                     param_.ffn_layernorm.gamma,
+                     param_.ffn_layernorm.beta,
+                     decoder_output,
+                     m,
+                     hidden_units_,
+                     param_.stream);
+        }
+      }
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+#endif
+  }
+
+  size_t getContextWorkspaceSize(const int seq_len,
+                                 const int local_batch_size) {
+    const size_t m = local_batch_size * seq_len;
+    const size_t qk_buf_size =
+        (size_t)(ceil(local_batch_size * t_parallel_param_.local_head_num_ *
+                      seq_len * seq_len / 4.)) *
+        4;
+    const size_t attn_work_space_size =
+        3 * m * hidden_units_ /* Q, K, V */ +
+        3 * m *
+            t_parallel_param_.local_hidden_units_ /* q_buf, k_buf, v_buf */ +
+        qk_buf_size +
+        2 * m * t_parallel_param_.local_hidden_units_ /* trans_attn, attn */;
+
+    int inner_buf_size = 0;
+    if (inner_size_ <= 0) {
+      inner_buf_size = m * t_parallel_param_.local_hidden_units_ * inner_coeff_;
+    } else {
+      inner_buf_size = m * inner_size_;
+    }
+
+    if (use_gated_) {
+      inner_buf_size *= 2;
+    }
+
+    return (m * hidden_units_ * 3 + attn_work_space_size + inner_buf_size/* ffn buffer */) *
+          sizeof(DataType_);
+  }
+
+  // use to compute the context of gpt model
+  void forward_context(DataType_ *workspace,
+                       DataType_ *decoder_output,
+                       DataType_ *key_cache_,
+                       DataType_ *value_cache_,
+                       const DataType_ *from_tensor,
+                       const DataType_ *d_attn_mask,
+                       const int local_batch_size,
+                       const int seq_len,
+                       const int ite,
+                       const int max_seq_len,
+                       const bool is_final,
+                       const int* memory_sequence_length = nullptr,
+                       const int rotary_embedding_dim = 0) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try {
+      const int m = local_batch_size * seq_len;
+      const int qk_buf_size =
+          (int)(ceil(local_batch_size * t_parallel_param_.local_head_num_ *
+                     seq_len * seq_len / 4.)) *
+          4;
+      const int attn_work_space_size =
+          3 * m * hidden_units_ /* Q, K, V */ +
+          3 * m *
+              t_parallel_param_.local_hidden_units_ /* q_buf, k_buf, v_buf */ +
+          qk_buf_size +
+          2 * m * t_parallel_param_.local_hidden_units_ /* trans_attn, attn */;
+
+      // set workspace
+      DataType_ *norm_from_tensor_buf = (DataType_ *)workspace;
+      DataType_ *ffn_out_buf = (DataType_ *)workspace;
+      DataType_ *attention_workspace = norm_from_tensor_buf + m * hidden_units_;
+      DataType_ *masked_output_buf = attention_workspace + attn_work_space_size;
+      DataType_ *norm_masked_output_buf = masked_output_buf + m * hidden_units_;
+      DataType_ *ffn_inner_buf = norm_masked_output_buf + m * hidden_units_;
+
+      if (normalization_before_) {
+        layer_norm(from_tensor,
+                  param_.self_layernorm.gamma,
+                  param_.self_layernorm.beta,
+                  norm_from_tensor_buf,
+                  m,
+                  hidden_units_,
+                  param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        PUSH_RANGE("Transformer/slf_attn")
+        unfused_masked_multi_head_attention(attention_workspace,
+                                            norm_from_tensor_buf,
+                                            key_cache_,
+                                            value_cache_,
+                                            masked_output_buf,
+                                            d_attn_mask,
+                                            local_batch_size,
+                                            seq_len,
+                                            ite,
+                                            max_seq_len,
+                                            is_final,
+                                            memory_sequence_length,
+                                            rotary_embedding_dim);
+        if (is_final) return;
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        if (rotary_embedding_dim > 0){
+          add_bias_input_layernorm_2_kernelLauncher(
+              from_tensor,
+              (DataType_*) nullptr,
+              (DataType_*) nullptr,
+              (DataType_*) nullptr,
+              masked_output_buf,
+              norm_masked_output_buf,
+              m,
+              hidden_units_,
+              param_.stream);
+        }else{
+          add_bias_input_layernorm_2_kernelLauncher(
+              from_tensor,
+              param_.ffn_layernorm.gamma,
+              param_.ffn_layernorm.beta,
+              param_.self_attention.attention_output_weight.bias,
+              masked_output_buf,
+              norm_masked_output_buf,
+              m,
+              hidden_units_,
+              param_.stream);
+        }
+
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        // For GPT decoder
+        PUSH_RANGE("Transformer/MLP");
+        ffn(rotary_embedding_dim > 0 ? norm_from_tensor_buf:norm_masked_output_buf,
+            ffn_inner_buf,
+            decoder_output,
+            m,
+            inner_coeff_ * t_parallel_param_.local_hidden_units_,
+            hidden_units_,
+            act_);
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        add_bias_input_kernelLauncher(decoder_output,
+                                      param_.ffn.output_weight.bias,
+                                      masked_output_buf,
+                                      m,
+                                      hidden_units_,
+                                      param_.stream);
+      } else {
+
+        PUSH_RANGE("Transformer/slf_attn")
+        unfused_masked_multi_head_attention(attention_workspace,
+                                            from_tensor,
+                                            key_cache_,
+                                            value_cache_,
+                                            masked_output_buf,
+                                            d_attn_mask,
+                                            local_batch_size,
+                                            seq_len,
+                                            ite,
+                                            max_seq_len,
+                                            is_final,
+                                            memory_sequence_length);
+        if (is_final) return;
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        add_bias_input_layernorm_2_kernelLauncher(
+            from_tensor,
+            param_.self_layernorm.gamma,
+            param_.self_layernorm.beta,
+            param_.self_attention.attention_output_weight.bias,
+            masked_output_buf,
+            norm_masked_output_buf,
+            m,
+            hidden_units_,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        // For GPT decoder
+        PUSH_RANGE("Transformer/MLP");
+        ffn(norm_masked_output_buf,
+            ffn_inner_buf,
+            ffn_out_buf,
+            m,
+            inner_coeff_ * t_parallel_param_.local_hidden_units_,
+            hidden_units_,
+            act_);
+        POP_RANGE
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        add_bias_input_layernorm_2_kernelLauncher(
+            norm_masked_output_buf,
+            param_.ffn_layernorm.gamma,
+            param_.ffn_layernorm.beta,
+            param_.ffn.output_weight.bias,
+            ffn_out_buf,
+            decoder_output,
+            m,
+            hidden_units_,
+            param_.stream);
+
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  void masked_multi_head_attention(const DataType_ *from_tensor,
+                                   DataType_ *key_cache_,
+                                   DataType_ *value_cache_,
+                                   DataType_ *decoder_output,
+                                   const bool *finished,
+                                   const int step,
+                                   const int max_seq_len,
+                                   const DataType_* relative_attention_bias = nullptr) {
+    int m = l_parallel_param_.local_batch_size;
+    int n = t_parallel_param_.local_hidden_units_;
+    int k = hidden_units_;
+
+    // chose which attention to use
+    int decoder_max_seq_len = (getCacheFormat() != 0) ? max_seq_len : -1;
+
+    DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+    if (is_fuse_QKV_in_normal_gemm_ == true) {
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          3 * n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.query_weight.kernel,
+          AType_,
+          3 * n,
+          from_tensor,
+          BType_,
+          k,
+          &beta,
+          query_buf_,
+          CType_,
+          3 * n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      fusedQKV_masked_attention_dispatch_v3<DataType_>(
+          query_buf_,
+          param_.self_attention.query_weight.bias,
+          key_cache_,
+          value_cache_,
+          context_buf_,
+          finished,
+          param_.request_batch_size,
+          l_parallel_param_.local_batch_size,
+          t_parallel_param_.local_head_num_,
+          size_per_head_,
+          step,
+          decoder_max_seq_len,
+          param_.stream,
+          relative_attention_bias);
+
+    } else {
+      if (is_fuse_QKV_in_batched_gemm_ == true) {
+        cublasGemmAlgo_t cublasAlgo =
+            static_cast<cublasGemmAlgo_t>(getAlgoIdFromMap(
+                cublasAlgoMap_,
+                3,
+                n,
+                m,
+                k,
+                std::is_same<float, DataType_>::value ? FLOAT_DATATYPE
+                                                      : HALF_DATATYPE));
+        check_cuda_error(cublasGemmBatchedEx(param_.cublas_handle,
+                                             CUBLAS_OP_N,
+                                             CUBLAS_OP_N,
+                                             n,
+                                             m,
+                                             k,
+                                             &alpha,
+                                             (const void *const *)qkv_kernel_,
+                                             AType_,
+                                             n,
+                                             (const void *const *)qkv_input_,
+                                             BType_,
+                                             k,
+                                             &beta,
+                                             (void *const *)qkv_buf_,
+                                             CType_,
+                                             n,
+                                             3,
+                                             computeType_,
+                                             cublasAlgo));
+      } else {
+        cublasMM_cublasLtMM_wrapper_decoder(
+            param_.cublaslt_handle,
+            param_.cublas_handle,
+            CUBLAS_OP_N,
+            CUBLAS_OP_N,
+            n,
+            m,
+            k,
+            &alpha,
+            param_.self_attention.query_weight.kernel,
+            AType_,
+            n,
+            from_tensor,
+            BType_,
+            k,
+            &beta,
+            query_buf_,
+            CType_,
+            n,
+            param_.stream,
+            cublasAlgoMap_,
+            cublas_workspace_);
+
+        cublasMM_cublasLtMM_wrapper_decoder(
+            param_.cublaslt_handle,
+            param_.cublas_handle,
+            CUBLAS_OP_N,
+            CUBLAS_OP_N,
+            n,
+            m,
+            k,
+            &alpha,
+            param_.self_attention.key_weight.kernel,
+            AType_,
+            n,
+            from_tensor,
+            BType_,
+            k,
+            &beta,
+            key_buf_,
+            CType_,
+            n,
+            param_.stream,
+            cublasAlgoMap_,
+            cublas_workspace_);
+
+        cublasMM_cublasLtMM_wrapper_decoder(
+            param_.cublaslt_handle,
+            param_.cublas_handle,
+            CUBLAS_OP_N,
+            CUBLAS_OP_N,
+            n,
+            m,
+            k,
+            &alpha,
+            param_.self_attention.value_weight.kernel,
+            AType_,
+            n,
+            from_tensor,
+            BType_,
+            k,
+            &beta,
+            value_buf_,
+            CType_,
+            n,
+            param_.stream,
+            cublasAlgoMap_,
+            cublas_workspace_);
+      }
+
+      masked_attention_dispatch_v2<DataType_>(
+          key_buf_,
+          value_buf_,
+          query_buf_,
+          param_.self_attention.query_weight.bias,
+          key_cache_,
+          param_.self_attention.key_weight.bias,
+          value_cache_,
+          param_.self_attention.value_weight.bias,
+          context_buf_,
+          finished,
+          param_.request_batch_size,
+          l_parallel_param_.local_batch_size,
+          t_parallel_param_.local_head_num_,
+          size_per_head_,
+          step,
+          decoder_max_seq_len,
+          param_.stream,
+          relative_attention_bias);
+    }
+
+    k = t_parallel_param_.local_hidden_units_;
+    n = hidden_units_;
+
+    cublasMM_cublasLtMM_wrapper_decoder(
+        param_.cublaslt_handle,
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        n,
+        m,
+        k,
+        &alpha,
+        param_.self_attention.attention_output_weight.kernel,
+        AType_,
+        n,
+        context_buf_,
+        BType_,
+        k,
+        &beta,
+        decoder_output,
+        CType_,
+        n,
+        param_.stream,
+        cublasAlgoMap_,
+        cublas_workspace_);
+
+    PUSH_RANGE("Transformer/slf_attn/all2all_reduce")
+    all2all_reduce_sum(decoder_output,
+                       decoder_output,
+                       m * n,
+                       t_parallel_param_,
+                       param_.stream);
+    POP_RANGE
+  }
+
+  void masked_multi_head_attention_v2(const DataType_ *from_tensor,
+                                      DataType_ *key_cache_,
+                                      DataType_ *value_cache_,
+                                      DataType_ *decoder_output,
+                                      const bool *finished,
+                                      const int step,
+                                      const int max_seq_len,
+                                      const int max_input_len,
+                                      const int *input_lengths,
+                                      const int rotary_embedding_dim = 0) {
+    assert(is_fuse_QKV_in_normal_gemm_ ==
+           true);  // only support for is_fuse_QKV = True.
+
+    int m = l_parallel_param_.local_batch_size;
+    int n = t_parallel_param_.local_hidden_units_;
+    int k = hidden_units_;
+
+    assert(getCacheFormat() !=
+           0);  // this is the only difference with masked_multi_head_attention
+
+    DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+    cublasMM_cublasLtMM_wrapper_decoder(
+        param_.cublaslt_handle,
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        3 * n,
+        m,
+        k,
+        &alpha,
+        param_.self_attention.query_weight.kernel,
+        AType_,
+        3 * n,
+        from_tensor,
+        BType_,
+        k,
+        &beta,
+        query_buf_,
+        CType_,
+        3 * n,
+        param_.stream,
+        cublasAlgoMap_,
+        cublas_workspace_);
+
+    fusedQKV_masked_attention_dispatch_v2<DataType_>(
+        query_buf_,
+        param_.self_attention.query_weight.bias,
+        key_cache_,
+        value_cache_,
+        context_buf_,
+        finished,
+        param_.request_batch_size,
+        l_parallel_param_.local_batch_size,
+        t_parallel_param_.local_head_num_,
+        size_per_head_,
+        step,
+        max_seq_len,
+        max_input_len,
+        input_lengths,
+        rotary_embedding_dim,
+        param_.stream);
+
+    k = t_parallel_param_.local_hidden_units_;
+    n = hidden_units_;
+
+    cublasMM_cublasLtMM_wrapper_decoder(
+        param_.cublaslt_handle,
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        n,
+        m,
+        k,
+        &alpha,
+        param_.self_attention.attention_output_weight.kernel,
+        AType_,
+        n,
+        context_buf_,
+        BType_,
+        k,
+        &beta,
+        decoder_output,
+        CType_,
+        n,
+        param_.stream,
+        cublasAlgoMap_,
+        cublas_workspace_);
+
+    PUSH_RANGE("Transformer/slf_attn/all2all_reduce")
+    all2all_reduce_sum(decoder_output,
+                       decoder_output,
+                       m * n,
+                       t_parallel_param_,
+                       param_.stream);
+    POP_RANGE
+  }
+
+  /* attention with source sentence */
+  void cross_multi_head_attention(const DataType_ *from_tensor,
+                                  const DataType_ *memory_tensor,
+                                  DataType_ *key_mem_cache_,
+                                  DataType_ *value_mem_cache_,
+                                  DataType_ *decoder_output,
+                                  const int *memory_sequence_length,
+                                  const bool *finished,
+                                  const int max_seq_len,
+                                  const int step,
+                                  const DataType_ *relative_attention_bias = nullptr) {
+    int m = param_.request_batch_size;
+    int n = t_parallel_param_.local_hidden_units_;
+    int k = hidden_units_;
+
+    DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+    // reuse the query_buf
+    cublasMM_cublasLtMM_wrapper_decoder(
+        param_.cublaslt_handle,
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        n,
+        m,
+        k,
+        &alpha,
+        param_.cross_attention.query_weight.kernel,
+        AType_,
+        n,
+        from_tensor,
+        BType_,
+        k,
+        &beta,
+        query_buf_,
+        CType_,
+        n,
+        param_.stream,
+        cublasAlgoMap_,
+        cublas_workspace_);
+
+    if (step == 1) {
+      m *= max_seq_len;
+      k = memory_hidden_units_;
+
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.cross_attention.key_weight.kernel,
+          AType_,
+          n,
+          memory_tensor,
+          BType_,
+          k,
+          &beta,
+          key_mem_cache_,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.cross_attention.value_weight.kernel,
+          AType_,
+          n,
+          memory_tensor,
+          BType_,
+          k,
+          &beta,
+          value_mem_cache_,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      k = t_parallel_param_.local_hidden_units_;
+    }
+
+    cross_attention_dispatch_v2<DataType_>(
+        query_buf_,
+        param_.cross_attention.query_weight.bias,
+        key_mem_cache_,
+        param_.cross_attention.key_weight.bias,
+        value_mem_cache_,
+        param_.cross_attention.value_weight.bias,
+        memory_sequence_length,
+        context_buf_,
+        finished,
+        param_.request_batch_size,
+        head_num_,
+        size_per_head_,
+        step,
+        max_seq_len,
+        param_.stream,
+        relative_attention_bias);
+
+    m = param_.request_batch_size;
+    n = hidden_units_;
+    k = t_parallel_param_.local_hidden_units_;
+
+    cublasMM_cublasLtMM_wrapper_decoder(
+        param_.cublaslt_handle,
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        n,
+        m,
+        k,
+        &alpha,
+        param_.cross_attention.attention_output_weight.kernel,
+        AType_,
+        n,
+        context_buf_,
+        BType_,
+        k,
+        &beta,
+        decoder_output,
+        CType_,
+        n,
+        param_.stream,
+        cublasAlgoMap_,
+        cublas_workspace_);
+  }
+
+
+  void ffn(const DataType_ *input,
+           DataType_ *ffn_inner,
+           DataType_ *output,
+           const int m,
+           const int inner_size,
+           const int n,
+           ActivationType activation_type,
+           const int step = -1) {
+    int m1 = m, k1 = n, n1 = inner_size;
+    DataType_ alpha = (DataType_)1.0f;
+    DataType_ beta = (DataType_)0.0f;
+
+    cublasMM_cublasLtMM_wrapper_decoder(param_.cublaslt_handle,
+                                        param_.cublas_handle,
+                                        CUBLAS_OP_N,
+                                        CUBLAS_OP_N,
+                                        n1,
+                                        m1,
+                                        k1,
+                                        &alpha,
+                                        param_.ffn.intermediate_weight.kernel,
+                                        AType_,
+                                        n1,
+                                        input,
+                                        BType_,
+                                        k1,
+                                        &beta,
+                                        ffn_inner,
+                                        CType_,
+                                        n1,
+                                        param_.stream,
+                                        cublasAlgoMap_,
+                                        cublas_workspace_);
+
+    if (param_.ffn.intermediate_weight_1.kernel) {
+      cublasMM_cublasLtMM_wrapper_decoder(param_.cublaslt_handle,
+                                          param_.cublas_handle,
+                                          CUBLAS_OP_N,
+                                          CUBLAS_OP_N,
+                                          n1,
+                                          m1,
+                                          k1,
+                                          &alpha,
+                                          param_.ffn.intermediate_weight_1.kernel,
+                                          AType_,
+                                          n1,
+                                          input,
+                                          BType_,
+                                          k1,
+                                          &beta,
+                                          ffn_inner + m1 * inner_size,
+                                          CType_,
+                                          n1,
+                                          param_.stream,
+                                          cublasAlgoMap_,
+                                          cublas_workspace_);
+
+      gated_add_bias_act_kernelLauncher(ffn_inner,
+                                        param_.ffn.intermediate_weight.bias,
+                                        param_.ffn.intermediate_weight_1.bias,
+                                        m1,
+                                        inner_size,
+                                        activation_type,
+                                        param_.stream);
+
+    } else {
+      add_bias_act_kernelLauncher(ffn_inner,
+                                  param_.ffn.intermediate_weight.bias,
+                                  m1,
+                                  inner_size,
+                                  activation_type,
+                                  param_.stream);
+
+    }
+
+    int m2 = m, n2 = n, k2 = inner_size;
+    cublasMM_cublasLtMM_wrapper_decoder(param_.cublaslt_handle,
+                                        param_.cublas_handle,
+                                        CUBLAS_OP_N,
+                                        CUBLAS_OP_N,
+                                        n2,
+                                        m2,
+                                        k2,
+                                        &alpha,
+                                        param_.ffn.output_weight.kernel,
+                                        AType_,
+                                        n2,
+                                        ffn_inner,
+                                        BType_,
+                                        k2,
+                                        &beta,
+                                        output,
+                                        CType_,
+                                        n2,
+                                        param_.stream,
+                                        cublasAlgoMap_,
+                                        cublas_workspace_);
+
+    PUSH_RANGE("Transformer/MLP/all2all_reduce")
+    all2all_reduce_sum(output, output, m * n, t_parallel_param_, param_.stream);
+    POP_RANGE
+  }
+
+  void unfused_masked_multi_head_attention(DataType_ *workspace,
+                                           const DataType_ *from_tensor,
+                                           DataType_ *key_cache_,
+                                           DataType_ *value_cache_,
+                                           DataType_ *decoder_output,
+                                           const DataType_ *attr_mask,
+                                           const int local_batch_size,
+                                           const int seq_len,
+                                           const int ite,
+                                           const int max_seq_len,
+                                           const bool is_final,
+                                           const int* memory_sequence_length = nullptr,
+                                           const int rotary_embedding_dim = 0) {
+    const DataType_ scalar = 1 / sqrtf(size_per_head_ * 1.0f);
+    const int m = local_batch_size * seq_len;
+
+    const int qk_buf_size =
+        (int)(ceil(local_batch_size * t_parallel_param_.local_head_num_ *
+                   seq_len * seq_len / 4.)) *
+        4;
+
+    DataType_ *Q = workspace;
+    DataType_ *K = Q + m * hidden_units_;
+    DataType_ *V = K + m * hidden_units_;
+    DataType_ *q_buf = V + m * hidden_units_;
+    DataType_ *k_buf = q_buf + m * t_parallel_param_.local_hidden_units_;
+    DataType_ *v_buf = k_buf + m * t_parallel_param_.local_hidden_units_;
+    DataType_ *qk_buf = v_buf + m * t_parallel_param_.local_hidden_units_;
+    DataType_ *attn_trans_out = qk_buf + qk_buf_size;
+    DataType_ *attn_out =
+        attn_trans_out + m * t_parallel_param_.local_hidden_units_;
+
+    DataType_ alpha = (DataType_)1.0f, beta = (DataType_)0.0f;
+
+    if (is_fuse_QKV_in_normal_gemm_ == true) {
+      const int n = t_parallel_param_.local_hidden_units_;
+      const int k = hidden_units_;
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          3 * n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.query_weight.kernel,
+          AType_,
+          3 * n,
+          from_tensor,
+          BType_,
+          k,
+          &beta,
+          Q,
+          CType_,
+          3 * n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+      add_fusedQKV_bias_transpose_kernelLauncher(
+          q_buf,
+          k_buf,
+          v_buf,
+          Q,
+          param_.self_attention.query_weight.bias,
+          local_batch_size,
+          seq_len,
+          t_parallel_param_.local_head_num_,
+          size_per_head_,
+          rotary_embedding_dim,
+          param_.stream);
+    } else {
+      const int n = t_parallel_param_.local_hidden_units_;
+      const int k = hidden_units_;
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.query_weight.kernel,
+          AType_,
+          n,
+          from_tensor,
+          BType_,
+          k,
+          &beta,
+          Q,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.key_weight.kernel,
+          AType_,
+          n,
+          from_tensor,
+          BType_,
+          k,
+          &beta,
+          K,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.value_weight.kernel,
+          AType_,
+          n,
+          from_tensor,
+          BType_,
+          k,
+          &beta,
+          V,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      add_QKV_bias_transpose_kernelLauncher(
+          q_buf,
+          k_buf,
+          v_buf,
+          Q,
+          param_.self_attention.query_weight.bias,
+          K,
+          param_.self_attention.key_weight.bias,
+          V,
+          param_.self_attention.value_weight.bias,
+          local_batch_size,
+          seq_len,
+          t_parallel_param_.local_head_num_,
+          size_per_head_,
+          param_.stream);
+    }
+
+    // !!! need to implement cget_cache_config
+    if (max_seq_len == -1 || USE_CACHE_BATCH_MAJOR_ATTENTION == 0) {
+      transpose_4d_kernelLauncher(key_cache_,
+                                  k_buf,
+                                  local_batch_size,
+                                  seq_len,
+                                  size_per_head_,
+                                  t_parallel_param_.local_hidden_units_,
+                                  t_parallel_param_.local_head_num_,
+                                  param_.request_batch_size,
+                                  ite,
+                                  param_.stream);
+
+      transpose_4d_kernelLauncher(value_cache_,
+                                  v_buf,
+                                  local_batch_size,
+                                  seq_len,
+                                  size_per_head_,
+                                  t_parallel_param_.local_hidden_units_,
+                                  t_parallel_param_.local_head_num_,
+                                  param_.request_batch_size,
+                                  ite,
+                                  param_.stream);
+    } else if (USE_CACHE_BATCH_MAJOR_ATTENTION == 1) {
+      // Use batch major
+      // put k/v_buf from shape [B, H, L, Dh]
+      // to cache [B, H, Dh/x, L, x]  and [B, H, L, Dh/x, x]
+      if (memory_sequence_length == nullptr) {
+        transpose_4d_batch_major_kernelLauncher(key_cache_,
+                                                value_cache_,
+                                                k_buf,
+                                                v_buf,
+                                                local_batch_size,
+                                                seq_len,
+                                                max_seq_len,
+                                                size_per_head_,
+                                                t_parallel_param_.local_head_num_,
+                                                param_.stream);
+      } else {
+        // Left pad to right pad. 
+        transpose_cache_batch_major_kernelLauncher(
+            key_cache_,
+            value_cache_,
+            k_buf,
+            v_buf,
+            memory_sequence_length,
+            local_batch_size,
+            seq_len,
+            max_seq_len,
+            size_per_head_,
+            t_parallel_param_.local_head_num_,
+            param_.stream);
+      }
+    } else {
+      printf("[ERROR] Can not decide on the cache config \n");
+      exit(-1);
+    }
+
+    if (is_final) return;
+
+    cublasGemmAlgo_t cublasAlgo =
+        static_cast<cublasGemmAlgo_t>(getAlgoIdFromMap(
+            cublasAlgoMap_,
+            local_batch_size * t_parallel_param_.local_head_num_,
+            seq_len,
+            seq_len,
+            size_per_head_,
+            std::is_same<float, DataType_>::value ? FLOAT_DATATYPE
+                                                  : HALF_DATATYPE));
+
+    check_cuda_error(cublasGemmStridedBatchedEx(
+        param_.cublas_handle,
+        CUBLAS_OP_T,
+        CUBLAS_OP_N,
+        seq_len,
+        seq_len,
+        size_per_head_,
+        &alpha,
+        k_buf,
+        AType_,
+        size_per_head_,
+        seq_len * size_per_head_,
+        q_buf,
+        BType_,
+        size_per_head_,
+        seq_len * size_per_head_,
+        &beta,
+        qk_buf,
+        CType_,
+        seq_len,
+        seq_len * seq_len,
+        local_batch_size * t_parallel_param_.local_head_num_,
+        computeType_,
+        cublasAlgo));
+
+    attn_softmax_kernelLauncher(qk_buf,
+                                attr_mask,
+                                local_batch_size,
+                                seq_len,
+                                t_parallel_param_.local_head_num_,
+                                scalar,
+                                param_.stream);
+
+    cublasAlgo = static_cast<cublasGemmAlgo_t>(getAlgoIdFromMap(
+        cublasAlgoMap_,
+        local_batch_size * t_parallel_param_.local_head_num_,
+        size_per_head_,
+        seq_len,
+        seq_len,
+        std::is_same<float, DataType_>::value ? FLOAT_DATATYPE
+                                              : HALF_DATATYPE));
+
+    check_cuda_error(cublasGemmStridedBatchedEx(
+        param_.cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        size_per_head_,
+        seq_len,
+        seq_len,
+        &alpha,
+        v_buf,
+        AType_,
+        size_per_head_,
+        seq_len * size_per_head_,
+        qk_buf,
+        BType_,
+        seq_len,
+        seq_len * seq_len,
+        &beta,
+        attn_trans_out,
+        CType_,
+        size_per_head_,
+        seq_len * size_per_head_,
+        local_batch_size * t_parallel_param_.local_head_num_,
+        computeType_,
+        cublasAlgo));
+
+    transpose_general_kernelLauncher(attn_out,
+                                     attn_trans_out,
+                                     local_batch_size,
+                                     seq_len,
+                                     t_parallel_param_.local_head_num_,
+                                     size_per_head_,
+                                     param_.stream);
+
+    {
+      const int k = t_parallel_param_.local_hidden_units_;
+      const int n = hidden_units_;
+
+      cublasMM_cublasLtMM_wrapper_decoder(
+          param_.cublaslt_handle,
+          param_.cublas_handle,
+          CUBLAS_OP_N,
+          CUBLAS_OP_N,
+          n,
+          m,
+          k,
+          &alpha,
+          param_.self_attention.attention_output_weight.kernel,
+          AType_,
+          n,
+          attn_out,
+          BType_,
+          k,
+          &beta,
+          decoder_output,
+          CType_,
+          n,
+          param_.stream,
+          cublasAlgoMap_,
+          cublas_workspace_);
+
+      PUSH_RANGE("Transformer/slf_attn/all2all_reduce")
+      all2all_reduce_sum(decoder_output,
+                         decoder_output,
+                         m * n,
+                         t_parallel_param_,
+                         param_.stream);
+      POP_RANGE
+    }
+  }
+
+  int getCacheFormat() {
+    int x = (Traits_::OpType == OperationType::FP32) ? 4 : 8;
+    return (USE_CACHE_BATCH_MAJOR_ATTENTION == 1 && size_per_head_ % x == 0)
+               ? x
+               : 0;
+  }
+
+  ~OpenDecoder() {
+    norm_from_tensor_buf_ = nullptr;
+    query_buf_ = nullptr;
+    key_buf_ = nullptr;
+    value_buf_ = nullptr;
+    context_buf_ = nullptr;
+
+    masked_output_buf_ = nullptr;
+    norm_masked_output_buf_ = nullptr;
+
+    cross_output_buf_ = nullptr;
+    norm_cross_output_buf_ = nullptr;
+    ffn_inner_buf_ = nullptr;
+  }
+
+  inline void set_local_batch_size(int local_batch) {
+    l_parallel_param_.local_batch_size = local_batch;
+  }
+};
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/opt.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/opt.h
new file mode 100644
index 0000000000000000000000000000000000000000..279ff81e66de8eb261870eb56d786a26c3e53380
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/opt.h
@@ -0,0 +1,927 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include <cuda_runtime.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include "fastertransformer/utils/nvtx_utils.h"
+
+namespace fastertransformer
+{
+
+template <OperationType OpType_>
+class DecodingOpt
+{
+private:
+    typedef DecoderTransformerTraits<OpType_> Traits_;
+    typedef typename Traits_::DataType DataType_;
+    const IAllocator &allocator_;
+    struct GptArguments args_;
+    TensorParallelParam t_parallel_param_;
+    LayerParallelParam l_parallel_param_;
+
+    const cudaDataType_t computeType_ = Traits_::computeType;
+    const cudaDataType_t AType_ = Traits_::AType;
+    const cudaDataType_t BType_ = Traits_::BType;
+    const cudaDataType_t CType_ = Traits_::CType;
+    std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+    DataType_ *embedding_kernel_padded_;
+
+    OpenDecoder<OpType_> *decoder_;
+    DataType_ **K_cache_;
+    DataType_ **V_cache_;
+    DataType_ *from_tensor_[2];
+    DataType_ *decoder_buf_;
+    DataType_ *decoder_normed_result_buf_;
+    DataType_ *logits_buf_;
+    void *buf_;
+
+    DataType_ *project_buf_;
+
+    // save the word embedding size
+    
+    void *topk_workspace_ = nullptr;
+    size_t topk_workspace_size_ = 0;
+    void *topp_workspace_ = nullptr;
+    size_t topp_workspace_size_ = 0;
+    void *topk_topp_workspace_ = nullptr;
+    size_t topk_topp_workspace_size_ = 0;
+    void *cublas_workspace_ = nullptr;
+    int *topp_id_vals_buf_;
+    int *topp_offset_buf_;
+    curandState_t *curandstate_buf_;
+    int *begin_topp_offset_buf_;
+
+    size_t nccl_buf_size_;
+    DataType_ *nccl_logits_buf_;
+
+    bool *finished_buf_;
+    bool *h_finished_buf_;
+    
+public:
+    DecodingOpt(const IAllocator &allocator, const int batch_size,
+                 const int seq_len,
+                 const int head_num, const int size_per_head,
+                 const int vocab_size, const int decoder_layers,
+                 const int start_id, const int end_id,
+                 const int candidate_num = 1,
+                 const float probability_threshold = 0.0,
+                 const float temperature = 1.0,
+                 const int tensor_para_size = 1,
+                 const int layer_para_size = 1,
+                 const bool is_fuse_QKV = true,
+                 const bool normalization_before = true,
+                 const float repetition_penalty = 1.0,
+                 const int seed = -1) : allocator_(allocator)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        assert(temperature != 0.0);
+        assert(repetition_penalty > 0.0);
+        assert(candidate_num > 0 || probability_threshold > 0.0);
+        assert(decoder_layers % layer_para_size == 0);
+
+        args_.batch_size_ = batch_size;
+        args_.normalization_before_ = normalization_before;
+
+        args_.seq_len_ = seq_len;
+        args_.head_num_ = head_num;
+        args_.size_per_head_ = size_per_head;
+        args_.hidden_units_ = head_num * size_per_head;
+        args_.decoder_layers_ = decoder_layers;
+        args_.vocab_size_ = vocab_size;
+        args_.start_id_ = start_id;
+        args_.end_id_ = end_id;
+        args_.candidate_num_ = candidate_num;
+        args_.probability_threshold_ = probability_threshold;
+        args_.temperature_ = temperature;
+        args_.repetition_penalty_ = repetition_penalty;
+        /***** newly added by PaddleNLP *****/
+        args_.seed_ = seed;
+
+        /***** newly added according to the OPT model*****/
+        
+        K_cache_ = new DataType_ *[1];
+        V_cache_ = new DataType_ *[1];
+        
+        // opt use relu activation function
+        decoder_ = new OpenDecoder<OpType_>(args_.head_num_, size_per_head, 0 /* memory_hidden_units */, is_fuse_QKV, args_.normalization_before_, ActivationType::RELU);
+        decoder_->set_max_batch_size(args_.batch_size_);
+
+        args_.vocab_size_padded_ = div_up(args_.vocab_size_, 64) * 64;
+
+        size_t from_tensor_size = args_.batch_size_ * args_.hidden_units_;                    // type T
+        size_t decoder_workspace_size = (size_t)decoder_->getWorkspaceSize();                                             // type T
+
+        // change the size of decoder normed_result_buffer_resiz 
+        size_t decoder_normed_result_buffer_size = args_.batch_size_ * args_.hidden_units_;   // type T
+        // cache costs lots of memory, so we only store part of them when we use multi-gpu for inference
+        size_t cache_size = args_.batch_size_ * args_.seq_len_ * args_.hidden_units_ / tensor_para_size;         // type T
+        size_t logits_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type T
+
+        size_t topp_id_vals_buf_size = args_.batch_size_ * args_.vocab_size_padded_; // type int
+        size_t topp_offset_buf_size = args_.batch_size_ + 1;
+        size_t begin_topp_offset_buf_size = topp_offset_buf_size;
+        size_t curandState_size = args_.batch_size_;
+        size_t finished_buf_size = args_.batch_size_;
+
+        const int MEM_C = 128;
+        size_t embedding_kernel_transposed_padded_size = args_.hidden_units_ * args_.vocab_size_padded_;
+        embedding_kernel_transposed_padded_size = div_up(embedding_kernel_transposed_padded_size, MEM_C) * MEM_C;
+
+        // prevent memory misalinged address
+        logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+        
+        topp_id_vals_buf_size = (size_t)(ceil(topp_id_vals_buf_size / 4.)) * 4;
+        topp_offset_buf_size = (size_t)(ceil(topp_offset_buf_size / 4.)) * 4;
+        begin_topp_offset_buf_size = topp_offset_buf_size;
+        curandState_size = (size_t)(ceil(curandState_size / 32.)) * 32;
+        finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+
+        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                               topp_workspace_size_,
+                                               logits_buf_,
+                                               topp_id_vals_buf_,
+                                               topp_offset_buf_,
+                                               begin_topp_offset_buf_,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               nullptr,
+                                               nullptr,
+                                               args_.vocab_size_padded_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                               topk_workspace_size_,
+                                               logits_buf_,
+                                               nullptr,
+                                               nullptr,
+                                               nullptr,
+                                               curandstate_buf_,
+                                               args_,
+                                               0,
+                                               args_.batch_size_);
+
+        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                              topk_topp_workspace_size_,
+                                              nullptr,
+                                              logits_buf_,
+                                              nullptr,
+                                              curandstate_buf_,
+                                              args_,
+                                              0,
+                                              args_.batch_size_);
+
+        size_t datatype_buf_size = from_tensor_size * 2 + decoder_workspace_size +
+                                cache_size * 2 * (args_.decoder_layers_ / layer_para_size) + decoder_normed_result_buffer_size;
+
+        nccl_buf_size_ = args_.batch_size_ * args_.vocab_size_padded_;
+        nccl_buf_size_ = (size_t)(ceil(nccl_buf_size_ / 4.)) * 4;
+
+        // comput the total buf size and project buf address
+        size_t buf_size = 
+            ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) + 
+            sizeof(DataType_) * embedding_kernel_transposed_padded_size +
+            sizeof(DataType_) * (datatype_buf_size + logits_buf_size) + 
+            sizeof(int) * (topp_id_vals_buf_size + topp_offset_buf_size + begin_topp_offset_buf_size) +
+            topp_workspace_size_ + topk_workspace_size_ + topk_topp_workspace_size_ + sizeof(DataType_) * 
+            nccl_buf_size_ + finished_buf_size + curandState_size * sizeof(curandState_t);
+
+        buf_ = reinterpret_cast<void *>(allocator_.malloc(buf_size));
+        project_buf_ = (DataType_ *)(buf_ + buf_size);
+
+        if (sizeof(DataType_) == sizeof(half))
+        {
+          cublas_workspace_ = buf_;
+          embedding_kernel_padded_ = (DataType_ *)((char*)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+        }
+        else
+        {
+          cublas_workspace_ = nullptr;
+          embedding_kernel_padded_ = (DataType_ *)buf_;
+        }
+        from_tensor_[0] = (DataType_ *)(embedding_kernel_padded_ + embedding_kernel_transposed_padded_size);
+        from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+        K_cache_[0] = from_tensor_[1] + from_tensor_size + 0 * cache_size * args_.decoder_layers_ / layer_para_size;
+        V_cache_[0] = from_tensor_[1] + from_tensor_size + 1 * cache_size * args_.decoder_layers_ / layer_para_size;
+
+        decoder_buf_ = V_cache_[0] + cache_size * args_.decoder_layers_ / layer_para_size;
+        decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+        logits_buf_ = decoder_normed_result_buf_ + decoder_normed_result_buffer_size;
+        topp_id_vals_buf_ = (int *)((DataType_*)logits_buf_ + logits_buf_size);
+        begin_topp_offset_buf_ = (int *)(topp_id_vals_buf_ + topp_id_vals_buf_size);
+        topp_offset_buf_ = (int *)((int*)begin_topp_offset_buf_ + begin_topp_offset_buf_size);
+        topp_workspace_ = (void *)((int*)topp_offset_buf_ + topp_offset_buf_size);
+        topk_workspace_ = (void *)((char*)topp_workspace_ + topp_workspace_size_);
+        topk_topp_workspace_ = (void *)((char*)topk_workspace_ + topk_workspace_size_);
+        nccl_logits_buf_ = (DataType_ *)((char*)topk_topp_workspace_ + topk_topp_workspace_size_);
+        curandstate_buf_ = (curandState_t*)(nccl_logits_buf_ + nccl_buf_size_);
+        finished_buf_ = (bool*)(curandstate_buf_ + curandState_size);
+        h_finished_buf_ = new bool[args_.batch_size_];
+
+        cudaMemset(embedding_kernel_padded_, 0, embedding_kernel_transposed_padded_size * sizeof(DataType_));
+
+        int isConfigExist = access("decoding_gemm_config.in", 0);
+        if (isConfigExist == -1)
+            printf("[WARNING] decoding_gemm_config.in is not found\n");
+        else
+        {
+            readAlgoFromConfig(cublasAlgoMap_, 1);
+            // check that the gemm_config setting is runnable
+            for (auto iter = cublasAlgoMap_.begin() ; iter != cublasAlgoMap_.end() ; iter++)
+            {
+                int algoId = iter->second.algoId;
+                int stages = iter->second.stages;
+                //only check for cublas
+                if (stages != -1)
+                    continue;
+                if (Traits_::OpType == OperationType::FP32)
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT)
+                    {
+                        // the algorithm is not for FP32
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n", algoId);
+                        exit(-1);
+                    }
+                }
+                else
+                {
+                    if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP || algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP)
+                    {
+                        // the algorithm is not for FP16
+                        printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n", algoId);
+                        exit(-1);
+                    }
+                }
+            }
+        }
+    }
+
+    void set_tensor_parallel_param(const TensorParallelParam param)
+    {
+        t_parallel_param_ = param;
+        decoder_->set_tensor_parallel_param(param);
+    }
+
+    void set_layer_parallel_param(const LayerParallelParam param)
+    {
+        l_parallel_param_ = param;
+        decoder_->set_layer_parallel_param(param);
+    }
+
+    void forward_context(const DecoderInitParam<DataType_> *decoder_param,
+                         const DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+        const int request_batch_size = decoding_params.request_batch_size;
+
+        cudaMemsetAsync(decoding_params.output_ids, 0, sizeof(int) * request_batch_size * max_len, decoding_params.stream);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        // const int input_len = decoding_params.request_input_len;
+        const int max_input_len = decoding_params.max_input_len;
+        DataType_ alpha = DataType_(1.0f);
+        DataType_ beta = DataType_(0.0f);
+        if(input_len == 1)
+        {
+            cudaMemcpyAsync(decoding_params.output_ids, decoding_params.d_start_ids, 
+                            sizeof(int) * request_batch_size, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            return;
+        }
+        const int local_batch_size = ceil(request_batch_size * 1.0 / l_parallel_param_.world_size);
+        const int m = local_batch_size * input_len;
+        const int h_1 = args_.hidden_units_;
+
+        DataType_* from_tensor[2];
+        DataType_* decoder_output;
+        DataType_* decoder_workspace;
+        void *buf = reinterpret_cast<void *>(allocator_.malloc(
+            decoder_->getContextWorkspaceSize(input_len, local_batch_size) + 
+            (m * h_1 + 2 * request_batch_size * input_len * h_1) * sizeof(DataType_)
+        ));
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        from_tensor[0] = (DataType_*) buf;
+        from_tensor[1] = from_tensor[0] + request_batch_size * input_len * h_1;
+        decoder_output = from_tensor[1] + request_batch_size * input_len * h_1;
+        decoder_workspace = decoder_output + m * h_1;
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if(l_parallel_param_.rank == 0)
+        {
+            PUSH_RANGE("Before Transformer/Embedding")
+            start_id_embedding_position_lookups_kernel_launcher(from_tensor[0],
+                                                                decoding_params.output_ids,
+                                                                decoding_params.embedding_table,
+                                                                decoding_params.position_encoding_table,
+                                                                decoding_params.d_start_ids,
+                                                                1,
+                                                                input_len,
+                                                                max_input_len,
+                                                                request_batch_size,
+                                                                args_.hidden_units_, 
+                                                                decoding_params.stream);
+
+            POP_RANGE
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        int ite_num = (int)(ceil(request_batch_size * 1.0 / local_batch_size));
+        for(int ite = 0; ite < ite_num; ite++)
+        {
+            int in_id, out_id;
+            for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+            {
+                if(l_parallel_param_.is_valid(layer))
+                {
+                    in_id = layer & 0x1;
+                    out_id = 1 - in_id;
+
+                    if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_recv(from_tensor[in_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                    l_parallel_param_, decoding_params.stream);
+                        all2all_gather(from_tensor[in_id] + ite * m * h_1, from_tensor[in_id] + ite * m * h_1, size, 
+                                    t_parallel_param_, decoding_params.stream);
+                    }
+
+                    decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    int dummy_decoder_max_seq_len = args_.seq_len_;
+                    // int dummy_decoder_max_seq_len = -1;
+                    size_t cache_offset;
+                    if(dummy_decoder_max_seq_len == -1)
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    else
+                    {
+                        cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                        args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                        ite * local_batch_size * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                    }
+                    decoder_->forward_context(decoder_workspace,
+                                              from_tensor[out_id] + ite * m * h_1,
+                                              K_cache_[0] + cache_offset,
+                                              V_cache_[0] + cache_offset,
+                                              from_tensor[in_id] + ite * m * h_1,
+                                              decoding_params.d_attn_mask + ite * local_batch_size * input_len * input_len,
+                                              local_batch_size,
+                                              input_len,
+                                              ite,
+                                              dummy_decoder_max_seq_len,
+                                              layer == args_.decoder_layers_ - 1);
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                    if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                    {
+                        const int size = m * t_parallel_param_.local_hidden_units_;
+                        nccl_send(from_tensor[out_id] + ite * m * h_1 + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1,
+                                    l_parallel_param_, decoding_params.stream);
+                    }
+                }
+            } // end of for loop of layer
+        } // end of for loop of ite
+        allocator_.free(buf);
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+    }
+
+    void forward(const DecoderInitParam<DataType_> *decoder_param,
+                 DecodingInitParam<DataType_> decoding_params)
+    {
+#ifndef NDEBUG
+        PRINT_FUNC_NAME_();
+#endif
+        const int input_len = decoding_params.request_input_len;
+
+        const int max_input_len = decoding_params.max_input_len;
+        const int request_batch_size = decoding_params.request_batch_size;
+        const int max_len = (decoding_params.request_output_len > 0 && input_len + decoding_params.request_output_len <= args_.seq_len_) ?
+                            input_len + decoding_params.request_output_len :
+                            args_.seq_len_;
+
+        assert(request_batch_size <= args_.batch_size_);
+        assert(request_batch_size % l_parallel_param_.local_batch_size == 0);
+        const int m = request_batch_size;
+        const int k = args_.hidden_units_;
+        const DataType_* embedding_kernel_ptr = nullptr;
+
+        cudaMemsetAsync(finished_buf_, false, sizeof(finished_buf_[0]) * request_batch_size, decoding_params.stream);
+        if (args_.probability_threshold_ != 0.0)
+        {
+            topp_initialization_kernelLauncher_v2(nullptr,
+                                                  nullptr,
+                                                  nullptr,
+                                                  topp_id_vals_buf_,
+                                                  topp_offset_buf_,
+                                                  begin_topp_offset_buf_,
+                                                  args_.candidate_num_ > 0 ? args_.candidate_num_ : args_.vocab_size_padded_,
+                                                  args_,
+                                                  decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+        }
+        ker_curand_setupLauncher(curandstate_buf_,
+                                 args_,
+                                 decoding_params.stream);
+
+        if(std::is_same<DataType_, float>::value || (std::is_same<DataType_, half>::value && args_.vocab_size_padded_ == args_.vocab_size_))
+        {
+            embedding_kernel_ptr = (const DataType_ *)decoding_params.embedding_kernel;
+        }
+        else
+        {
+            cudaMemcpyAsync(embedding_kernel_padded_, decoding_params.embedding_kernel, 
+                            sizeof(DataType_) * args_.vocab_size_ * args_.hidden_units_, cudaMemcpyDeviceToDevice, decoding_params.stream);
+            embedding_kernel_ptr = (const DataType_ *)embedding_kernel_padded_;
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+        bool is_generation_done = false;
+        const int local_batch = l_parallel_param_.local_batch_size;
+        
+        DataType_ alpha = DataType_(1.0f);
+        DataType_ beta = DataType_(0.0f);
+        for (size_t step = input_len; step < max_len; ++step)
+        {
+            const int ite_num = request_batch_size / local_batch;
+            for(size_t ite = 0; ite < ite_num; ite++)
+            {
+                if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        PUSH_RANGE("token/recv")
+                        nccl_recv(decoding_params.output_ids + (step - 1) * m + ite * local_batch, local_batch,
+                                  l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                    }
+                }
+
+                if(l_parallel_param_.rank < l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    if(step != (size_t)input_len)
+                    {
+                        nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                    }
+                }
+                if(ite == 0)
+                {
+                    cudaMemcpyAsync(h_finished_buf_, finished_buf_, sizeof(bool) * request_batch_size, cudaMemcpyDeviceToHost, decoding_params.stream);
+                    cudaStreamSynchronize(decoding_params.stream);
+                    uint sum = 0;
+                    for (uint i = 0; i < request_batch_size; i++)
+                    {
+                        sum += (int)h_finished_buf_[i];
+                    }
+                    if (sum == request_batch_size)
+                    {
+                        is_generation_done = true;
+                        break;
+                    }
+                }
+
+                if(l_parallel_param_.rank == 0)
+                {
+                    PUSH_RANGE("Before Transformer/Embedding")
+                    /***** newly fixed by PaddleNLP *****/
+                    embedding_position_lookups_kernel_launcher(from_tensor_[0],
+                                                            decoding_params.embedding_table,
+                                                            decoding_params.position_encoding_table,
+                                                            decoding_params.output_ids,
+                                                            local_batch,
+                                                            m,
+                                                            args_.hidden_units_,
+                                                            step,
+                                                            ite,
+                                                            max_input_len,
+                                                            decoding_params.d_start_lengths,
+                                                            decoding_params.stream);
+                    cudaDeviceSynchronize();
+
+                    POP_RANGE
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+
+                //we use two-way buffer
+                int from_id, out_id;
+                for (int layer = 0; layer < args_.decoder_layers_; ++layer)
+                {
+                    if(l_parallel_param_.is_valid(layer))
+                    {
+                        /*
+                            For the first layer (layer-0), from_id is 0. We also stored the embedding lookup 
+                            result in from_tensor_[0]
+                        */
+                        from_id = layer & 0x1;
+                        out_id = 1 - from_id;
+
+                        if(layer == l_parallel_param_.layers_per_group * l_parallel_param_.rank && layer != 0 && l_parallel_param_.world_size > 1)
+                        {
+                            const int size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_recv(from_tensor_[from_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank - 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                            all2all_gather(from_tensor_[from_id], from_tensor_[from_id], size, 
+                                           t_parallel_param_, decoding_params.stream);
+                        }
+
+                        /*
+                            We use one decoder_ object to process multiple decoder layers. 
+                            At the beginning of each decoder layer, we initialize the decoder object 
+                            with corresponding weights and decoder_buf_.
+                            The decoder_buf_ is reused.
+                        */
+                        decoder_->initialize(decoder_param[layer], decoder_buf_, cublas_workspace_, false);
+                        
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        int dummy_decoder_max_seq_len = args_.seq_len_;
+                        // int dummy_decoder_max_seq_len = -1;
+                        size_t cache_offset;
+                        if(dummy_decoder_max_seq_len == -1)
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) *
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ +
+                                            ite * local_batch * t_parallel_param_.local_hidden_units_;
+                        }
+                        else
+                        {
+                            cache_offset = (layer - l_parallel_param_.layers_per_group * l_parallel_param_.rank) * 
+                                            args_.batch_size_ * args_.seq_len_ * t_parallel_param_.local_hidden_units_ + 
+                                            ite * local_batch * args_.seq_len_ * t_parallel_param_.local_hidden_units_;
+                        }
+
+                        decoder_->forward_v2(from_tensor_[from_id], 
+                                            nullptr, // memory_tensor should be nullptr
+                                            K_cache_[0] + cache_offset,
+                                            V_cache_[0] + cache_offset,
+                                            nullptr, nullptr, // key_mem_cache_ and value_mem_cache_ should be nullptr
+                                            nullptr, // memory_sequence_length should be nullptr
+                                            from_tensor_[out_id], step, dummy_decoder_max_seq_len,
+                                            false, 
+                                            finished_buf_ + ite * local_batch,
+                                            max_input_len, 
+                                            decoding_params.d_start_lengths + ite * local_batch);
+
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif          
+
+                        if(layer == l_parallel_param_.layers_per_group * (l_parallel_param_.rank + 1) - 1 && layer != args_.decoder_layers_ - 1 && l_parallel_param_.world_size > 1)
+                        {
+                            const size_t size = local_batch * t_parallel_param_.local_hidden_units_;
+                            nccl_send(from_tensor_[out_id] + size * t_parallel_param_.rank, size, l_parallel_param_.rank + 1, 
+                                      l_parallel_param_, decoding_params.stream);
+                        }
+                    }
+                }
+                
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1)
+                {
+                    if (args_.normalization_before_) {
+                        layer_norm(from_tensor_[out_id],
+                                decoding_params.layernorm.gamma,
+                                decoding_params.layernorm.beta,
+                                decoder_normed_result_buf_,
+                                local_batch,
+                                k,
+                                decoding_params.stream);                        
+                    } else {
+                        decoder_normed_result_buf_ = from_tensor_[out_id];
+                    }
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    DataType_ alpha = DataType_(1.0f);
+                    DataType_ beta = DataType_(0.0f);
+                    assert(args_.vocab_size_padded_ % t_parallel_param_.world_size == 0);
+                    int n = args_.vocab_size_padded_ / t_parallel_param_.world_size;
+                    
+                    // single gpu computing, if it need to fit into multi-gpu, it should change the function
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                        decoding_params.cublas_handle, 
+                                                        CUBLAS_OP_T, CUBLAS_OP_N,
+                                                        n, local_batch, args_.hidden_units_,
+                                                        &alpha,
+                                                        embedding_kernel_ptr, AType_, args_.hidden_units_,
+                                                        decoder_normed_result_buf_, BType_, args_.hidden_units_,
+                                                        &beta,
+                                                        logits_buf_, CType_, n,
+                                                        decoding_params.stream, cublasAlgoMap_,
+                                                        cublas_workspace_);
+                        POP_RANGE
+                    }
+                    else
+                    {
+                        PUSH_RANGE("After Transformer/GEMM")
+                        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle, 
+                                                            decoding_params.cublas_handle, 
+                                                            CUBLAS_OP_T, CUBLAS_OP_N,
+                                                            n, local_batch, k,
+                                                            &alpha,
+                                                            embedding_kernel_ptr + t_parallel_param_.rank * n * k,
+                                                            AType_, k,
+                                                            decoder_normed_result_buf_, BType_, k,
+                                                            &beta,
+                                                            nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                            CType_, n,
+                                                            decoding_params.stream, cublasAlgoMap_,
+                                                            cublas_workspace_);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                    
+                    if(t_parallel_param_.world_size == 1)
+                    {
+                        apply_temperature_penalty_kernelLauncher(logits_buf_,
+                                                                (DataType_) args_.temperature_,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.stream);
+                    }
+                    else
+                    {
+                        if(t_parallel_param_.rank == t_parallel_param_.world_size - 1)
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    args_.vocab_size_ - n * t_parallel_param_.rank,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                        else
+                        {
+                            apply_temperature_penalty_kernelLauncher(nccl_logits_buf_ + t_parallel_param_.rank * local_batch * n,
+                                                                    (DataType_) args_.temperature_,
+                                                                    local_batch,
+                                                                    n,
+                                                                    n,
+                                                                    decoding_params.stream);
+                        }
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // reduce and concat the result
+                    if(t_parallel_param_.world_size > 1)
+                    {
+                        PUSH_RANGE("After Transformer/all2all_gather")
+                        all2all_gather(nccl_logits_buf_, nccl_logits_buf_, local_batch * n, 
+                                       t_parallel_param_, decoding_params.stream);
+                        POP_RANGE
+                        
+                        transpose_axis_01_kernelLauncher(logits_buf_, nccl_logits_buf_, 
+                                                         t_parallel_param_.world_size, local_batch, n, decoding_params.stream);
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    n = args_.vocab_size_padded_;
+
+                    // Apply repetition penalty.
+                    if (args_.repetition_penalty_ != 1.0) {
+                        PUSH_RANGE("After Transformer/Repetition_penalty")
+                        apply_repetition_penalty_kernelLauncher(logits_buf_,
+                                                                args_.repetition_penalty_,
+                                                                decoding_params.d_start_ids,
+                                                                decoding_params.output_ids,
+                                                                m,
+                                                                local_batch,
+                                                                args_.vocab_size_,
+                                                                n,
+                                                                decoding_params.d_start_lengths,
+                                                                max_input_len,
+                                                                step,
+                                                                ite,
+                                                                decoding_params.stream);
+                        POP_RANGE
+                    }
+
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+
+                    // Sampling
+                    if(args_.candidate_num_ > 0 && args_.probability_threshold_ == 0.0)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top k sampling
+                        topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                                               topk_workspace_size_,
+                                                               logits_buf_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_, // used as random number
+                                                               args_,
+                                                               decoding_params.stream,
+                                                               local_batch);
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ == 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        // top p sampling
+                        softmax_kernelLauncher(logits_buf_,
+                                               (DataType_*) nullptr,
+                                               args_.end_id_,
+                                               finished_buf_ + ite * local_batch,
+                                               local_batch,
+                                               args_.vocab_size_padded_,
+                                               args_.vocab_size_,
+                                               decoding_params.stream);
+#ifndef NDEBUG
+                        cudaDeviceSynchronize();
+                        check_cuda_error(cudaGetLastError());
+#endif
+                        topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                                               topp_workspace_size_,
+                                                               logits_buf_,
+                                                               topp_id_vals_buf_,
+                                                               topp_offset_buf_,
+                                                               begin_topp_offset_buf_,
+                                                               finished_buf_ + ite * local_batch,
+                                                               curandstate_buf_,
+                                                               args_,
+                                                               decoding_params.output_ids + step * m + ite * local_batch,
+                                                               nullptr,
+                                                               n,
+                                                               decoding_params.stream,
+                                                               local_batch);
+
+                        POP_RANGE
+                    }
+                    else if(args_.candidate_num_ > 0 && args_.probability_threshold_ > 0.0f)
+                    {
+                        PUSH_RANGE("After Transformer/Sampling")
+                        topK_topP_sampling_kernel_kernelLauncher_v2(topk_topp_workspace_,
+                                                                    topk_topp_workspace_size_,
+                                                                    decoding_params.output_ids + step * m + ite * local_batch,
+                                                                    logits_buf_,
+                                                                    finished_buf_ + ite * local_batch,
+                                                                    curandstate_buf_,
+                                                                    args_,
+                                                                    decoding_params.stream,
+                                                                    local_batch);
+                        POP_RANGE
+                    }
+#ifndef NDEBUG
+                    cudaDeviceSynchronize();
+                    check_cuda_error(cudaGetLastError());
+#endif
+                }
+                if(step < (size_t)max_input_len)
+                {
+                    // Replace the sampled id by start ids
+                    set_start_ids_kernelLauncher(decoding_params.output_ids, decoding_params.d_start_ids, max_input_len,
+                                                 step, ite, request_batch_size, local_batch, args_.end_id_, decoding_params.stream);
+                }
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1)
+                {
+                    PUSH_RANGE("token/send")
+                    nccl_send(decoding_params.output_ids + step * m + ite * local_batch, local_batch, 0, l_parallel_param_, decoding_params.stream);
+                    POP_RANGE
+                }
+
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+
+                if(l_parallel_param_.rank == l_parallel_param_.world_size - 1 && l_parallel_param_.world_size > 1 && step < max_len - 1)
+                {
+                    nccl_broadcast(finished_buf_ + ite * local_batch, local_batch, l_parallel_param_.world_size - 1, l_parallel_param_, decoding_params.stream);
+                }
+#ifndef NDEBUG
+                cudaDeviceSynchronize();
+                check_cuda_error(cudaGetLastError());
+#endif
+            } // end for ite for loop
+            
+            if (is_generation_done) {
+                break;
+            }
+        } // end for decoding step for loop
+        if(l_parallel_param_.rank == 0 && l_parallel_param_.world_size > 1)
+        {
+            for(size_t ite = 0; ite < request_batch_size / local_batch; ite++)
+            {
+                nccl_recv(decoding_params.output_ids + (max_len - 1) * m + ite * local_batch,
+                          local_batch, l_parallel_param_.world_size - 1,
+                          l_parallel_param_, decoding_params.stream);
+            }
+        }
+    } // end of forward
+
+    virtual ~DecodingOpt()
+    {
+        delete[] K_cache_;
+        delete[] V_cache_;
+        delete decoder_;
+        allocator_.free(buf_);
+        delete [] h_finished_buf_;
+    }
+
+    inline int get_num_layer() {return args_.decoder_layers_;}
+
+    inline void set_local_batch_size(int local_batch)
+    { 
+        l_parallel_param_.local_batch_size = local_batch;
+        decoder_->set_local_batch_size(local_batch);
+    }
+};
+
+} //namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/standard_encoder.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/standard_encoder.h
new file mode 100644
index 0000000000000000000000000000000000000000..474fcc9cdad1932608d93ae76a39b3bc8ff2db0a
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/standard_encoder.h
@@ -0,0 +1,1013 @@
+/*
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Standard Encoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include "fastertransformer/cuda/cuda_int8_kernels.h"
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/cuda/open_attention.h"
+#include "fastertransformer/gemm_test/encoder_gemm_func.h"
+#include "fastertransformer/gemm_test/encoder_igemm_func.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/common_structure.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <typename T>
+class EncoderInitParam {
+public:
+  const T *from_tensor = nullptr;
+  const T *to_tensor = nullptr;
+
+  LayerNormWeight<T> input_layernorm;
+  AttentionWeight<T> self_attention;
+  const T *attr_mask = nullptr;
+  LayerNormWeight<T> self_layernorm;
+
+  FFNWeight<T> ffn;
+
+  T *transformer_out;
+  cublasHandle_t cublas_handle = nullptr;
+  cublasLtHandle_t cublaslt_handle = nullptr;
+  cudaStream_t stream = 0;
+
+  const int *sequence_id_offset = nullptr;
+  int valid_word_num = -1;
+  int layer_idx = 0;
+  int layer_num = 12;
+
+  // Part 1:
+  //  First 80 are for activation amaxs. For each activation amax, there are 4
+  //  values: amax, amax/127.0f, amax/127.0f/127.0f, 127.0f/amax -- input_amax
+  //  0-3 , Q_aftergemm_amax 4-7, Qbias_amax 8-11, K_aftergemm_amax 12-15,
+  //  Kbias_amax 16-19, V_aftergemm_amax 20-23, Vbias_amax 24-27, bmm1_amax
+  //  28-31, Softmax_amax 32-35, bmm2_amax 36-39, Proj_aftergemm_scale 40-43,
+  //  ProjBiasNorm_amax 44-47, FC1_aftergemm_amax 48-51, F1Bias_amax 52-55,
+  //  FC2_aftergemm_amax 56-59, F2BiasNorm_amax 60-63, reserve 64-79
+  // Part 2:
+  //  Kernel amaxs, for each kernel amax list, there are output_channel values :
+  //  query_weight_amax_list, key_weight_amax_list, value_weight_amax_list,
+  //  proj_weight_amax_list, FC1_weight_amax_list, FC2_weight_amax_list
+  // Part 3:
+  //  Int8 gemm deQFactor list (8 values): Q_deQ_scale, K_deQ_scale,
+  //  V_deQ_scale, bmm1_deQ_scale, bmm2_deQ_scale, FC0_deQ_scale, FC1_deQ_scale,
+  //  FC2_deQ_scale
+  // Part 4:
+  //  Amax used in trt fused mha kernel (3 values) : QKVbias_amax, Softmax_amax,
+  //  bmm2_amax
+  const float *amaxList = nullptr;
+  const int *trt_seqlen_offset = nullptr;
+  int trt_seqlen_size = -1;
+};
+
+template <OperationType OpType_,
+          template <OperationType> class MultiHeadAttention_>
+class OpenEncoderTraits;
+
+template <template <OperationType> class MultiHeadAttention_>
+class OpenEncoderTraits<OperationType::FP32, MultiHeadAttention_>
+    : public TransformerTraits<OperationType::FP32> {
+public:
+  typedef MultiHeadAttention_<OpType> MultiHeadAttention;
+};
+
+template <template <OperationType> class MultiHeadAttention_>
+class OpenEncoderTraits<OperationType::FP16, MultiHeadAttention_>
+    : public TransformerTraits<OperationType::FP16> {
+public:
+  typedef MultiHeadAttention_<OpType> MultiHeadAttention;
+};
+
+template <class Traits_>
+class OpenEncoder {
+  IAllocator *allocator_ = NULL;
+  typename Traits_::MultiHeadAttention *attention_ = NULL;
+  typedef typename Traits_::DataType DataType_;
+  EncoderInitParam<DataType_> param_;
+
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+  std::map<std::string, int> parameterMap_;
+
+  DataType_ *buf_ = NULL;
+  DataType_ *normed_from_tensor_;
+  DataType_ *attr_out_buf_;
+  DataType_ *attr_matmul_buf_;
+  DataType_ *inter_matmul_buf_;
+  DataType_ *attr_unnormed_matmul_buf_;
+  void *cublas_workspace_ = NULL;
+  int batch_size_;
+  int from_seq_len_;
+  int to_seq_len_;
+  int head_num_;
+  int size_per_head_;
+
+  int sm_;
+  bool allow_gemm_test_ = false;
+  bool use_gelu_ = false;
+  bool use_ORDER_COL32_2R_4R4_ = false;
+
+  // for int8 quantization
+  const float *FC0_weight_amax_list, *FC1_weight_amax_list,
+      *FC2_weight_amax_list;
+  float scale_list[INT8O_GEMM_NUM + TRT_FUSED_MHA_AMAX_NUM];
+  const float *bmm2_amax_ptr, *ProjBiasNorm_amax_ptr, *F1Bias_amax_ptr,
+      *F2BiasNorm_amax_ptr, *to_tensor_amax_ptr, *Proj_aftergemm_amax_ptr,
+      *F1_aftergemm_amax_ptr, *F2_aftergemm_amax_ptr,
+      *int8O_gemm_deQ_scale_list;
+  // int8_mode == 0 -- not use int8
+  // int8_mode == 1 -- use int8; without quantized residual; when (batch*seqLen
+  // >= 512) or (seqLen % 32 !=0 ), using trt fused mha
+  // int8_mode == 2 -- use int8; with quantized residual; with trt fused mha
+  // int8_mode == 3 -- use int8; with quantized residual; without trt fused mha
+  int int8_mode_;
+  int layer_idx_;
+  int layer_num_;
+  const int8_t *int8_from_tensor_;
+  const DataType_ *transA_from_tensor_;
+  int32_t *int_buf_;
+  DataType_ *tmp_DataType_, *transA_from_tensor_tmp_,
+      *transformer_out_tmp_DataType_;
+  int8_t *tmp_int8_, *int8_from_tensor_tmp_, *attr_matmul_buf_tmp_,
+      *transformer_out_tmp_int8_;
+
+public:
+  void setLayerIdx(int layer_idx) { layer_idx_ = layer_idx; }
+
+  size_t calBufSizeInByte(int batch_size,
+                          int seq_len,
+                          int head_num,
+                          int size_per_head,
+                          int int8_mode) {
+    size_t m = batch_size * seq_len;
+    size_t n = head_num * size_per_head;
+    size_t k = n;
+    size_t normal_buf_size;
+    if (int8_mode != 0) {
+      // transA_from_tensor & transformer_out_tmp_DataType
+      normal_buf_size =
+          m * k * sizeof(DataType_) +
+          // int8_from_tensor & attr_matmul_buf_tmp & transformer_out_tmp_int8
+          m * k * sizeof(int8_t) +
+          // int8 qkv weight
+          3 * n * k * sizeof(int8_t) +
+          // FC0 & FC1 & FC2 for m*k(4k)*sizeof(DataType)
+          4 * m * k * sizeof(int) +
+          // attr_out_buf_ & attr_matmul_buf_ & inter_matmul_buf_
+          6 * m * n * sizeof(DataType_) +
+          // temp buf
+          m * n * sizeof(DataType_);
+    } else {
+      normal_buf_size =
+          sizeof(DataType_) * (m * n) * 8 +
+          ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0);
+    }
+    return normal_buf_size;
+  }
+
+  bool checkParameterInMap(int batch_size,
+                           int seq_len,
+                           int head_num,
+                           int size_per_head,
+                           int int8_mode,
+                           int is_fp16) {
+    char mark[1000];
+    bool parameterInMap;
+    int dataType = is_fp16 == 0 ? FLOAT_DATATYPE : HALF_DATATYPE;
+    if (int8_mode != 0) {
+      dataType = INT8_DATATYPE;
+    }
+    sprintf(mark,
+            "%d_%d_%d_%d_%d",
+            batch_size,
+            seq_len,
+            head_num,
+            size_per_head,
+            dataType);
+    if (parameterMap_.find(std::string(mark)) != parameterMap_.end())
+      parameterInMap = true;
+    else
+      parameterInMap = false;
+    return parameterInMap;
+  }
+
+  // free buffer for gemm test
+  // This function requires the same allocator of allocateBufferForGemmTest(*)
+  void freeBufferForGemmTest(IAllocator *allocator, void *&buffer) {
+    if (buffer != NULL) {
+      allocator->free(buffer);
+      buffer = NULL;
+    }
+  }
+
+  void allocateBufferForGemmTest(IAllocator *allocator,
+                                 void *&buffer,
+                                 int batch_size,
+                                 int seq_len,
+                                 int head_num,
+                                 int size_per_head,
+                                 int int8_mode,
+                                 int is_fp16) {
+    size_t buf_size_in_byte = calGemmTestBufSizeInByte(
+        batch_size, seq_len, head_num, size_per_head, int8_mode, is_fp16);
+    size_t total, free;
+    check_cuda_error(cudaMemGetInfo(&free, &total));
+    if (free < buf_size_in_byte + 10 * 1024 * 1024) {
+      printf(
+          "[WARNING] There is no enough device memory for gemm test!\n %ld "
+          "Bytes is needed, but only %ld Bytes is free.\n",
+          buf_size_in_byte,
+          free);
+      buffer = NULL;
+      return;
+    }
+    buffer =
+        reinterpret_cast<void *>(allocator->malloc(buf_size_in_byte, false));
+  }
+
+  bool gemmTest(int batch_size,
+                int seq_len,
+                int head_num,
+                int size_per_head,
+                int int8_mode,
+                int is_fp16) {
+    bool hasChangedConfig = false;
+    if (int8_mode != 0) {
+      // if not found parameters in map,
+      // read config first
+      // in case multiple instances (for example in tensorflow op) are used
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+      } else {
+        return hasChangedConfig;
+      }
+
+      // if still not found algos in map,
+      // do gemm test
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        void *gemm_test_buf = NULL;
+        allocateBufferForGemmTest(allocator_,
+                                  gemm_test_buf,
+                                  batch_size,
+                                  seq_len,
+                                  head_num,
+                                  size_per_head,
+                                  int8_mode,
+                                  is_fp16);
+        if (gemm_test_buf != NULL) {
+          generate_encoder_igemm_config(
+              batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          freeBufferForGemmTest(allocator_, gemm_test_buf);
+          readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+          hasChangedConfig = true;
+        }
+      } else {
+        hasChangedConfig = true;
+        return hasChangedConfig;
+      }
+    } else {
+      // if not found parameters in map,
+      // read config first
+      // in case multiple instances (for example in tensorflow op) are used
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+      } else {
+        return hasChangedConfig;
+      }
+
+      // if still not found parameters in map,
+      // do gemm test
+      if (!checkParameterInMap(batch_size,
+                               seq_len,
+                               head_num,
+                               size_per_head,
+                               int8_mode,
+                               is_fp16)) {
+        void *gemm_test_buf = NULL;
+        allocateBufferForGemmTest(allocator_,
+                                  gemm_test_buf,
+                                  batch_size,
+                                  seq_len,
+                                  head_num,
+                                  size_per_head,
+                                  int8_mode,
+                                  is_fp16);
+        if (gemm_test_buf != NULL) {
+          if (is_fp16 == 1)
+            generate_encoder_gemm_config<half>(
+                batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          else
+            generate_encoder_gemm_config<float>(
+                batch_size, seq_len, head_num, size_per_head, gemm_test_buf);
+          freeBufferForGemmTest(allocator_, gemm_test_buf);
+          readAlgoFromConfig(int8_mode, cublasAlgoMap_, parameterMap_);
+          hasChangedConfig = true;
+        }
+      } else {
+        hasChangedConfig = true;
+        return hasChangedConfig;
+      }
+    }
+    return hasChangedConfig;
+  }
+
+  // free buffer for OpenEncoder
+  void freeBuffer() {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    if (buf_ != NULL) {
+      if (allocator_ == NULL) {
+        printf("[ERROR][OpenEncoder][freeBuffer] allocator_ is NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+      buf_ = NULL;
+    }
+    if (attention_ != NULL) attention_->freeBuffer();
+  }
+
+  // allocate buffer for OpenEncoder
+  // do gemm test if allow_gemm_test == true
+  void allocateBuffer(IAllocator *allocator,
+                      int batch_size,
+                      int from_seq_len,
+                      int to_seq_len,
+                      int head_num,
+                      int size_per_head,
+                      bool use_trt_kernel = true) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try {
+      if (allocator == NULL) {
+        printf("[ERROR][OpenEncoder][allocateBuffer] allocator == NULL!\n");
+        exit(-1);
+      }
+      // only allocate new buffer when buf_ is empty
+      // if buf_ is not empty, use previous allocated one
+      // this can ensure consistency between (allocator_, batch_size_, ...) and
+      // buf_
+      if (buf_ != nullptr) {
+        printf(
+            "[ERROR][OpenEncoder][allocateBuffer] previous buffer is not "
+            "freed, use previous one. To allocate new buffer, please use "
+            "freeBuffer() to free previous buffer first.\n");
+        exit(-1);
+      } else {
+        allocator_ = allocator;
+        batch_size_ = batch_size;
+        from_seq_len_ = from_seq_len;
+        to_seq_len_ = to_seq_len;
+        head_num_ = head_num;
+        size_per_head_ = size_per_head;
+
+        int m = batch_size_ * from_seq_len_;
+        int k = head_num_ * size_per_head_;
+        int n = k;
+
+        int buf_size = m * n;
+        size_t buf_size_in_byte = calBufSizeInByte(
+            batch_size_, from_seq_len_, head_num_, size_per_head_, int8_mode_);
+
+        // allocate buffer
+        if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+          buf_ = reinterpret_cast<DataType_ *>(
+              allocator_->malloc(buf_size_in_byte, false));
+          if (buf_ == nullptr)
+            throw std::runtime_error(
+                std::string("Allocator failed to allocate internal buffer."));
+
+          attr_out_buf_ =
+              (DataType_ *)(((char *)buf_) + m * k * sizeof(DataType_) +
+                            m * k * sizeof(int8_t) +
+                            3 * n * k * sizeof(int8_t) +
+                            4 * m * k * sizeof(int));
+          attr_matmul_buf_ = attr_out_buf_ + buf_size;
+          inter_matmul_buf_ = attr_matmul_buf_ + buf_size;
+
+          int8_from_tensor_tmp_ =
+              (int8_t *)(((char *)buf_) + m * k * (sizeof(DataType_)));
+          attr_matmul_buf_tmp_ = int8_from_tensor_tmp_;
+          transformer_out_tmp_int8_ = int8_from_tensor_tmp_;
+          transA_from_tensor_tmp_ = (DataType_ *)buf_;
+          transformer_out_tmp_DataType_ = transA_from_tensor_tmp_;
+
+          int_buf_ =
+              (int32_t *)(((char *)buf_) +
+                          (m * k) * (sizeof(DataType_) + sizeof(int8_t)) +
+                          3 * n * k * sizeof(int8_t));
+
+          tmp_DataType_ =
+              (DataType_ *)(((char *)buf_) +
+                            m * k * (sizeof(DataType_) + sizeof(int8_t)) +
+                            3 * n * k * sizeof(int8_t) +
+                            4 * m * k * sizeof(int32_t) +
+                            6 * m * n * sizeof(DataType_));
+          tmp_int8_ = (int8_t *)tmp_DataType_;
+#else
+      printf("[ERROR] Standard transformer does not support INT8. \n");
+      exit(-1);
+#endif
+        } else {
+          buf_ = reinterpret_cast<DataType_ *>(
+              allocator_->malloc(buf_size_in_byte, false));
+          if (buf_ == nullptr)
+            throw std::runtime_error(
+                std::string("Allocator failed to allocate internal buffer."));
+
+
+          if (sizeof(DataType_) == sizeof(half)) {
+            // cublas_workspace_ should be the start pointer of cudaMalloc()
+            // to ensure 16B alignemnet
+            cublas_workspace_ = buf_;
+            normed_from_tensor_ = (DataType_ *)((char *)cublas_workspace_ +
+                                                CUBLAS_WORKSPACE_SIZE);
+          } else {
+            cublas_workspace_ = nullptr;
+            normed_from_tensor_ = (DataType_ *)buf_;
+          }
+          attr_out_buf_ = normed_from_tensor_ + buf_size;
+          attr_matmul_buf_ = attr_out_buf_ + buf_size;
+          attr_unnormed_matmul_buf_ = attr_matmul_buf_ + buf_size;
+          inter_matmul_buf_ = attr_unnormed_matmul_buf_ + buf_size;
+        }
+      }
+
+      bool hasChangedConfig = false;
+      int is_fp16;
+      if (Traits_::OpType == OperationType::FP32)
+        is_fp16 = 0;
+      else
+        is_fp16 = 1;
+      // check if target algos in map
+      if (allow_gemm_test_) {
+        /*
+        hasChangedConfig = gemmTest(batch_size_,
+                                    from_seq_len_,
+                                    head_num_,
+                                    size_per_head_,
+                                    int8_mode_,
+                                    is_fp16);
+        */
+      }
+
+      // allocate buffer for attention_
+      attention_->allocateBuffer(allocator,
+                                 cublas_workspace_,
+                                 batch_size_,
+                                 from_seq_len_,
+                                 to_seq_len,
+                                 head_num_,
+                                 size_per_head_,
+                                 hasChangedConfig,
+                                 use_trt_kernel);
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+
+  OpenEncoder(int int8_mode = 0,
+              bool allow_gemm_test = false,
+              bool use_gelu = false)
+      : int8_mode_(int8_mode),
+        allow_gemm_test_(allow_gemm_test),
+        use_gelu_(use_gelu) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+
+    try {
+      // sm_ = getSMVersion();
+      // Set fake sm_ which have no effect.
+      sm_ = 70;
+
+      if (sm_ >= 80) {
+        use_ORDER_COL32_2R_4R4_ = true;
+      }
+      if (sm_ < 75 && int8_mode_ != 0) {
+        printf(
+            "[ERROR][BertEncoderTransformer] int8 mode only works with sm >= "
+            "75.\n");
+        exit(-1);
+      }
+
+
+      int isConfigExist = -1;
+      if (int8_mode_ != 0) {
+#ifdef WITH_INT8
+        isConfigExist = access(IGEMM_CONFIG, 0);
+#else
+      printf("[ERROR] Standard transformer does not support INT8. \n");
+      exit(-1);
+#endif
+      } else {
+        isConfigExist = access(GEMM_CONFIG, 0);
+      }
+      if (isConfigExist == -1) {
+        if (!allow_gemm_test_) {
+          // printf(
+          //     "[WARNING][OpenEncoder] %s is not found; using default GEMM "
+          //     "algo\n",
+          //     int8_mode_ != 0 ? IGEMM_CONFIG : GEMM_CONFIG);
+        }
+      } else {
+        readAlgoFromConfig(int8_mode_, cublasAlgoMap_, parameterMap_);
+      }
+
+      attention_ = new typename Traits_::MultiHeadAttention(
+          int8_mode_, allow_gemm_test_, use_ORDER_COL32_2R_4R4_, sm_);
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  void genTransATensorAndInt8TensorForFirstLayer() {
+    const int m = param_.sequence_id_offset == nullptr
+                      ? batch_size_ * from_seq_len_
+                      : param_.valid_word_num;
+    const int k = head_num_ * size_per_head_;
+    if (int8_mode_ == 1) {
+      transposeMatrix_colMajorToCOL32_kernelLauncher(
+          transA_from_tensor_tmp_, param_.from_tensor, k, m, param_.stream);
+      transA_from_tensor_ = (const DataType_ *)transA_from_tensor_tmp_;
+      quantized_kernelLauncher(int8_from_tensor_tmp_,
+                               transA_from_tensor_,
+                               m * k,
+                               to_tensor_amax_ptr + 3,
+                               param_.stream);
+    } else if (int8_mode_ == 2 || int8_mode_ == 3) {
+      transposeMatrix_colMajorToCOL32_quantize_kernelLauncher(
+          int8_from_tensor_tmp_,
+          param_.from_tensor,
+          k,
+          m,
+          to_tensor_amax_ptr + 3,
+          param_.stream);
+    }
+    int8_from_tensor_ = (const int8_t *)(int8_from_tensor_tmp_);
+  }
+
+  /**
+   * Initialize the parameters in class
+   * We will keep the Ctor empty to ensure the sub classes follow the same init
+   *routine.
+   * Please be aware that no dynamic memory allocation should be placed
+   **/
+  void initialize(EncoderInitParam<DataType_> param) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+
+    param_ = param;
+    cuda::MultiHeadInitParam<DataType_> multi_head_init_param;
+
+    if (int8_mode_ != 0) {
+      printf("[ERROR] Standard transformer does not support INT8. \n");
+      exit(-1);
+      // int hidden_dim = size_per_head_*head_num_;
+      // layer_idx_ = param_.layer_idx;
+      // layer_num_ = param_.layer_num;
+
+      // bmm2_amax_ptr = param_.amaxList + 36;
+      // ProjBiasNorm_amax_ptr = param_.amaxList + 44;
+      // F1Bias_amax_ptr = param_.amaxList + 52;
+      // F2BiasNorm_amax_ptr = param_.amaxList + 60;
+      // Proj_aftergemm_amax_ptr = param_.amaxList + 40;
+      // F1_aftergemm_amax_ptr = param_.amaxList + 48;
+      // F2_aftergemm_amax_ptr = param_.amaxList + 56;
+      // to_tensor_amax_ptr = param_.amaxList;
+
+      // FC0_weight_amax_list = param_.amaxList + ACTIVATION_AMAX_NUM +
+      // 3*hidden_dim;
+      // FC1_weight_amax_list = FC0_weight_amax_list + hidden_dim;
+      // FC2_weight_amax_list = FC1_weight_amax_list + 4*hidden_dim;
+
+      // //This D2H copy operation will cause performance degradation
+      // if ( (int8_mode_ == 1 && ((batch_size_*from_seq_len_ >= 512) ||
+      // (from_seq_len_ % 32 != 0)) ) || int8_mode_ == 2 || int8_mode_ == 3)
+      // {
+      //   //copy (int8O_gemm_deQ_scale_list + trt_fused_mha_amax_list) amax
+      //   into scale_list
+      //   check_cuda_error(cudaMemcpyAsync(scale_list, FC2_weight_amax_list +
+      //   hidden_dim, (INT8O_GEMM_NUM+TRT_FUSED_MHA_AMAX_NUM)*sizeof(float),
+      //   cudaMemcpyDeviceToHost, param_.stream));
+      //   int8O_gemm_deQ_scale_list = scale_list;
+      // }
+      // int k = hidden_dim;
+
+      // const int m = param_.sequence_id_offset == nullptr ? batch_size_ *
+      // from_seq_len_ : param_.valid_word_num;
+      // if (layer_idx_ == 0){
+      //   genTransATensorAndInt8TensorForFirstLayer();
+      // }
+      // else
+      // {
+      //   transA_from_tensor_ = param_.from_tensor;
+      //   if (int8_mode_ == 2 || int8_mode_ == 3){
+      //     int8_from_tensor_ = (const int8_t*)transA_from_tensor_;
+      //   }
+      //   else if (int8_mode_ == 1){
+      //     quantized_kernelLauncher(int8_from_tensor_tmp_,
+      //     transA_from_tensor_, m*k, to_tensor_amax_ptr + 3, param_.stream);
+      //     int8_from_tensor_ = (const int8_t*)(int8_from_tensor_tmp_);
+      //   }
+      // }
+
+      // multi_head_init_param.int8_from_tensor = int8_from_tensor_;
+
+      // multi_head_init_param.amaxList = param_.amaxList;
+
+      // multi_head_init_param.int8O_gemm_deQ_scale_list =
+      // int8O_gemm_deQ_scale_list;
+
+      // multi_head_init_param.trt_fused_mha_amax_list = scale_list +
+      // INT8O_GEMM_NUM;
+    }
+
+    multi_head_init_param.from_tensor = param.from_tensor;
+    multi_head_init_param.to_tensor = param.to_tensor;
+    multi_head_init_param.self_attention = param.self_attention;
+    multi_head_init_param.attr_mask = param.attr_mask;
+    multi_head_init_param.stream = param.stream;
+    multi_head_init_param.cublas_handle = param.cublas_handle;
+    multi_head_init_param.cublaslt_handle = param_.cublaslt_handle;
+    multi_head_init_param.attr_out = attr_out_buf_;
+    multi_head_init_param.valid_word_num = param.valid_word_num;
+    multi_head_init_param.sequence_id_offset = param.sequence_id_offset;
+    multi_head_init_param.trt_seqlen_offset = param_.trt_seqlen_offset;
+    multi_head_init_param.trt_seqlen_size = param_.trt_seqlen_size;
+
+    attention_->initialize(multi_head_init_param);
+  }
+
+  /**
+   * do forward
+   **/
+  void forward() {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    try {
+      const int m = param_.sequence_id_offset == nullptr
+                        ? batch_size_ * from_seq_len_
+                        : param_.valid_word_num;
+      int k = head_num_ * size_per_head_;
+      int n = k;
+
+      layer_norm(param_.from_tensor,
+                 param_.input_layernorm.gamma,
+                 param_.input_layernorm.beta,
+                 normed_from_tensor_,
+                 m,
+                 k,
+                 param_.stream);
+      attention_->forward(normed_from_tensor_, normed_from_tensor_);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      DataType_ alpha = (DataType_)1.0f;
+      DataType_ beta = (DataType_)0.0f;
+
+      if (int8_mode_ != 0) {
+        printf("[ERROR] Standard transformer does not support INT8. \n");
+        exit(-1);
+        //         if (int8_mode_ == 1)
+        //         {
+        //           cublasLtMM_withAlgo(int_buf_, 1, m, n, k, m*k, n*k, m*n,
+        //                               (int8_t*)attr_out_buf_,
+        //                               (int8_t*)(param_.self_attention.
+        //                               attention_output_weight.kernel),
+        //                               param_.cublaslt_handle, param_.stream,
+        //                               cublasAlgoMap_,
+        //                               use_ORDER_COL32_2R_4R4_);
+        //           add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(attr_matmul_buf_,
+        //           int_buf_, transA_from_tensor_,
+        //           param_.self_attention.attention_output_weight.bias,
+        //                                                                          param_.self_layernorm.gamma, param_.self_layernorm.beta, m, n, param_.stream,
+        //                                                                          FC0_weight_amax_list, bmm2_amax_ptr);
+        //         }
+        //         else if (int8_mode_ == 2 || int8_mode_ == 3)
+        //         {
+        //           cublasLtMM_withAlgo_int8IO((int8_t*)int_buf_, 1, m, n, k,
+        //           m*k, n*k, m*n, int8O_gemm_deQ_scale_list[5],
+        //                                      (int8_t*)attr_out_buf_,
+        //                                      (int8_t*)(param_.self_attention.
+        //                                      attention_output_weight.kernel),
+        //                                      param_.cublaslt_handle,
+        //                                      param_.stream, cublasAlgoMap_,
+        //                                      use_ORDER_COL32_2R_4R4_);
+        //           add_bias_input_layernorm_COL32_int8IO_kernelLauncher((int8_t*)attr_matmul_buf_,
+        //           (int8_t*)int_buf_, int8_from_tensor_,
+        //                                                                param_.self_attention.attention_output_weight.bias,
+        //                                                                param_.self_layernorm.gamma,
+        //                                                                param_.self_layernorm.beta,
+        //                                                                m, n,
+        //                                                                param_.stream,
+        //                                                                Proj_aftergemm_amax_ptr+1,
+        //                                                                to_tensor_amax_ptr+1,
+        //                                                                ProjBiasNorm_amax_ptr+3);
+        //         }
+
+        // #ifndef NDEBUG
+        //         cudaDeviceSynchronize();
+        //         check_cuda_error(cudaGetLastError());
+        // #endif
+
+        //         n *= 4;
+
+        //         if (int8_mode_ == 1){
+        //           quantized_kernelLauncher(attr_matmul_buf_tmp_,
+        //           attr_matmul_buf_, k*m, ProjBiasNorm_amax_ptr + 3,
+        //           param_.stream);
+        //           cublasLtMM_withAlgo(int_buf_, 1, m, n, k, m*k, n*k, m*n,
+        //                               attr_matmul_buf_tmp_,
+        //                               (int8_t*)(param_.ffn.intermediate_weight.kernel),
+        //                               param_.cublaslt_handle, param_.stream,
+        //                               cublasAlgoMap_,
+        //                               use_ORDER_COL32_2R_4R4_);
+        //           add_bias_act_COL32_int32I_int8O_kernelLauncher((int8_t*)inter_matmul_buf_,
+        //           int_buf_, param_.ffn.intermediate_weight.bias,
+        //                                                          m, n,
+        //                                                          param_.stream,
+        //                                                          FC1_weight_amax_list,
+        //                                                          ProjBiasNorm_amax_ptr+2,
+        //                                                          F1Bias_amax_ptr+3);
+        //         }
+        //         else if (int8_mode_ == 2 || int8_mode_ == 3)
+        //         {
+        //           cublasLtMM_withAlgo_int8IO((int8_t*)int_buf_, 1, m, n, k,
+        //           m*k, n*k, m*n, int8O_gemm_deQ_scale_list[6],
+        //                                      (int8_t*)attr_matmul_buf_,
+        //                                      (int8_t*)(param_.ffn.intermediate_weight.kernel),
+        //                                      param_.cublaslt_handle,
+        //                                      param_.stream, cublasAlgoMap_,
+        //                                      use_ORDER_COL32_2R_4R4_);
+        //           add_bias_act_COL32_int8IO_kernelLauncher((int8_t*)inter_matmul_buf_,
+        //           (int8_t*)int_buf_, param_.ffn.intermediate_weight.bias,
+        //                                                     m, n,
+        //                                                     param_.stream,
+        //                                                     F1_aftergemm_amax_ptr+1,
+        //                                                     F1Bias_amax_ptr+3);
+        //         }
+
+        // #ifndef NDEBUG
+        //         cudaDeviceSynchronize();
+        //         check_cuda_error(cudaGetLastError());
+        // #endif
+
+        //         n = k;
+        //         k *= 4;
+
+        //         if (int8_mode_ == 1)
+        //         {
+        //           cublasLtMM_withAlgo(int_buf_, 1, m, n, k, m*k, n*k, m*n,
+        //                               (int8_t*)inter_matmul_buf_,
+        //                               (int8_t*)(param_.ffn.output_weight.kernel),
+        //                               param_.cublaslt_handle, param_.stream,
+        //                               cublasAlgoMap_,
+        //                               use_ORDER_COL32_2R_4R4_);
+        //           if (layer_idx_ != layer_num_ - 1)
+        //           {
+        //             add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(param_.transformer_out,
+        //             int_buf_, attr_matmul_buf_,
+        //                                                                            param_.ffn.output_weight.bias, param_.ffn_layernorm.gamma,
+        //                                                                            param_.ffn_layernorm.beta, m, n, param_.stream, FC2_weight_amax_list,
+        //                                                                            F1Bias_amax_ptr);
+        //           }
+        //           else
+        //           {
+        //             add_bias_input_layernorm_COL32_int32I_DataTypeO_kernelLauncher(transformer_out_tmp_DataType_,
+        //             int_buf_, attr_matmul_buf_,
+        //                                                                            param_.ffn.output_weight.bias, param_.ffn_layernorm.gamma,
+        //                                                                            param_.ffn_layernorm.beta, m, n, param_.stream, FC2_weight_amax_list,
+        //                                                                            F1Bias_amax_ptr);
+        //             transposeMatrix_COL32ToColMajor_kernelLauncher(param_.transformer_out,
+        //             transformer_out_tmp_DataType_, m, n, param_.stream);
+        //           }
+        //         }
+        //         else if (int8_mode_ == 2 || int8_mode_ == 3)
+        //         {
+        //           cublasLtMM_withAlgo_int8IO((int8_t*)int_buf_, 1, m, n, k,
+        //           m*k, n*k, m*n, int8O_gemm_deQ_scale_list[7],
+        //                                      (int8_t*)inter_matmul_buf_,
+        //                                      (int8_t*)(param_.ffn.output_weight.kernel),
+        //                                      param_.cublaslt_handle,
+        //                                      param_.stream, cublasAlgoMap_,
+        //                                      use_ORDER_COL32_2R_4R4_);
+        //           if (layer_idx_ != layer_num_ - 1)
+        //           {
+        //             add_bias_input_layernorm_COL32_int8IO_kernelLauncher((int8_t*)param_.transformer_out,
+        //             (int8_t*)int_buf_, (int8_t*)attr_matmul_buf_,
+        //                                                                   param_.ffn.output_weight.bias,
+        //                                                                   param_.ffn_layernorm.gamma,
+        //                                                                   param_.ffn_layernorm.beta,
+        //                                                                   m,
+        //                                                                   n,
+        //                                                                   param_.stream,
+        //                                                                   F2_aftergemm_amax_ptr+1,
+        //                                                                   ProjBiasNorm_amax_ptr+1,
+        //                                                                   F2BiasNorm_amax_ptr+3);
+        //           }
+        //           else
+        //           {
+        //             add_bias_input_layernorm_COL32_int8I_DataTypeO_kernelLauncher(transformer_out_tmp_DataType_,
+        //             (int8_t*)int_buf_, (int8_t*)attr_matmul_buf_,
+        //                                                                           param_.ffn.output_weight.bias, param_.ffn_layernorm.gamma,
+        //                                                                           param_.ffn_layernorm.beta, m, n, param_.stream, F2_aftergemm_amax_ptr+1, ProjBiasNorm_amax_ptr+1);
+        //             transposeMatrix_COL32ToColMajor_kernelLauncher(param_.transformer_out,
+        //             transformer_out_tmp_DataType_, m, n, param_.stream);
+        //           }
+        //         }
+
+        // #ifndef NDEBUG
+        //         cudaDeviceSynchronize();
+        //         check_cuda_error(cudaGetLastError());
+        // #endif
+      } else {
+        cublasMM_cublasLtMM_wrapper(
+            param_.cublaslt_handle,
+            param_.cublas_handle,
+            CUBLAS_OP_N,
+            CUBLAS_OP_N,
+            n,
+            m,
+            k,
+            &alpha,
+            param_.self_attention.attention_output_weight.kernel,
+            AType_,
+            n,
+            attr_out_buf_,
+            BType_,
+            k,
+            &beta,
+            (DataType_ *)attr_unnormed_matmul_buf_,
+            CType_,
+            n,
+            param_.stream,
+            cublasAlgoMap_,
+            sm_,
+            cublas_workspace_);
+
+        add_bias_input_layernorm_2_kernelLauncher<DataType_>(
+            param_.from_tensor,
+            param_.self_layernorm.gamma,
+            param_.self_layernorm.beta,
+            param_.self_attention.attention_output_weight.bias,
+            attr_unnormed_matmul_buf_,
+            attr_matmul_buf_,
+            m,
+            n,
+            param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n *= 4;
+
+        cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle,
+                                    param_.cublas_handle,
+                                    CUBLAS_OP_N,
+                                    CUBLAS_OP_N,
+                                    n,
+                                    m,
+                                    k,
+                                    &alpha,
+                                    param_.ffn.intermediate_weight.kernel,
+                                    AType_,
+                                    n,
+                                    attr_matmul_buf_,
+                                    BType_,
+                                    k,
+                                    &beta,
+                                    (DataType_ *)inter_matmul_buf_,
+                                    CType_,
+                                    n,
+                                    param_.stream,
+                                    cublasAlgoMap_,
+                                    sm_,
+                                    cublas_workspace_);
+
+        if (use_gelu_ == true) {
+          add_bias_act_kernelLauncher<DataType_>(
+              inter_matmul_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              ActivationType::GELU,
+              param_.stream);
+        } else {
+          add_bias_act_kernelLauncher<DataType_>(
+              inter_matmul_buf_,
+              param_.ffn.intermediate_weight.bias,
+              m,
+              n,
+              ActivationType::RELU,
+              param_.stream);
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        n = k;
+        k *= 4;
+
+        cublasMM_cublasLtMM_wrapper(param_.cublaslt_handle,
+                                    param_.cublas_handle,
+                                    CUBLAS_OP_N,
+                                    CUBLAS_OP_N,
+                                    n,
+                                    m,
+                                    k,
+                                    &alpha,
+                                    param_.ffn.output_weight.kernel,
+                                    AType_,
+                                    n,
+                                    inter_matmul_buf_,
+                                    BType_,
+                                    k,
+                                    &beta,
+                                    (DataType_ *)(param_.transformer_out),
+                                    CType_,
+                                    n,
+                                    param_.stream,
+                                    cublasAlgoMap_,
+                                    sm_,
+                                    cublas_workspace_);
+
+
+        add_bias_input_kernelLauncher(param_.transformer_out,
+                                      param_.ffn.output_weight.bias,
+                                      attr_unnormed_matmul_buf_,
+                                      m,
+                                      n,
+                                      param_.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+    } catch (std::runtime_error &error) {
+      throw error;
+    }
+  }
+
+  ~OpenEncoder() {
+    if (buf_ != NULL) {
+      if (allocator_ == NULL) {
+        printf("[ERROR][OpenEncoder][~OpenEncoder] allocator_ is NULL!\n");
+        exit(-1);
+      }
+      allocator_->free(buf_);
+    }
+    if (attention_ != NULL) delete attention_;
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_beamsearch.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_beamsearch.h
new file mode 100644
index 0000000000000000000000000000000000000000..4f0c5db255f0e6fd033516e50721061297338898
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_beamsearch.h
@@ -0,0 +1,922 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <OperationType OpType_>
+class T5DecodingBeamsearch {
+private:
+  typedef DecoderTransformerTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  const IAllocator &allocator_;
+  struct T5BeamsearchArguments args_;
+  TensorParallelParam t_parallel_param_;
+  LayerParallelParam l_parallel_param_;
+
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+  OpenDecoder<OpType_> *decoder_;
+  DataType_ **K_cache_;
+  DataType_ **V_cache_;
+  DataType_ **K_mem_cache_;
+  DataType_ **V_mem_cache_;
+  DataType_ *from_tensor_[2];
+  DataType_ *decoder_buf_;
+
+  // Prefix LM
+  DataType_ *trans_out_buf_;
+  DataType_ *lm_normed_result_buf_;
+
+  DataType_ *decoder_normed_result_buf_;
+  DataType_ *embedding_buf_;
+  float *logits_buf_;
+  float *cum_log_buf_;
+  int *word_ids_buf_;
+  int *parent_ids_buf_;
+  bool *finished_buf_;
+  bool *alive_finished_buf_;
+  DataType_ *relative_attention_bias_;
+
+  void *buf_;
+  int *finished_count_buf_;
+  bool *h_finished_buf_;
+  int *h_trg_length_;
+  float *temp_storage_;
+
+  bool is_fuse_topk_softMax_;
+  bool keep_alive_beam_;
+
+  void *topK_kernel_workspace = nullptr;
+  size_t topk_workspace_size_ = 0;
+  void *cublas_workspace_ = nullptr;
+
+  DataType_ *padded_embedding_kernel;
+  DataType_ *padded_embedding_bias;
+  DataType_ *tmp_logits_buf_;
+
+public:
+  T5DecodingBeamsearch(const IAllocator &allocator,
+                       const int batch_size,
+                       const int beam_width,
+                       const int seq_len,
+                       const int head_num,
+                       const int size_per_head,
+                       const int vocab_size,
+                       const int decoder_layers,
+                       const int memory_hidden_units,
+                       const int memory_max_seq_len,
+                       const int start_id,
+                       const int end_id,
+                       const float beam_search_diversity_rate = -0.0f,
+                       const bool is_fuse_topk_softMax = false,
+                       const bool is_fuse_qkv = false,
+                       const bool keep_alive_beam = false,
+                       const float alpha = 0.6,
+                       const bool normalization_before = true,
+                       const ActivationType act = ActivationType::RELU,
+                       const int finished_candidate_num = -1,
+                       const bool early_stopping = false,
+                       const int min_length = 0,
+                       const int inner_coeff = 4,
+                       const int inner_size = -1,
+                       const int num_bucket = -1,
+                       const int max_distance = 128,
+                       const bool tie_word_embeddings = true,
+                       const bool use_gated = false)
+      : allocator_(allocator),
+        is_fuse_topk_softMax_(is_fuse_topk_softMax),
+        keep_alive_beam_(keep_alive_beam) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+
+    args_.batch_size_ = batch_size;
+    args_.beam_width_ = beam_width;
+    args_.seq_len_ = seq_len;
+    args_.memory_max_seq_len_ = memory_max_seq_len;
+    args_.head_num_ = head_num;
+    args_.size_per_head_ = size_per_head;
+    args_.hidden_units_ = head_num * size_per_head;
+    args_.decoder_layers_ = decoder_layers;
+    args_.vocab_size_ = vocab_size;
+    args_.start_id_ = start_id;
+    args_.end_id_ = end_id;
+    args_.beam_search_diversity_rate_ = beam_search_diversity_rate;
+    if (args_.beam_width_ > 16 || args_.beam_width_ > MAX_K)
+      is_fuse_topk_softMax_ = false;
+    if (std::is_same<DataType_, float>::value)
+      args_.vocab_size_padded_ = vocab_size;
+    else if (std::is_same<DataType_, half>::value)
+      args_.vocab_size_padded_ = (int)(ceil(vocab_size / 8.)) * 8;
+
+    args_.alpha_ = alpha;
+    args_.normalization_before_ = normalization_before;
+    args_.act_ = act;
+
+    args_.min_length_ = min_length;
+
+    args_.finished_candidate_num_ = (finished_candidate_num == -1)
+                                        ? beam_width * 2
+                                        : finished_candidate_num;
+    args_.early_stopping_ = early_stopping;
+
+    args_.num_bucket_ = num_bucket;
+    args_.max_distance_ = max_distance;
+    args_.tie_word_embeddings_ = tie_word_embeddings;
+
+    K_cache_ = new DataType_ *[2];
+    V_cache_ = new DataType_ *[2];
+
+    K_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+    V_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+
+    decoder_ = new OpenDecoder<OpType_>(head_num,
+                                        size_per_head,
+                                        memory_hidden_units,
+                                        is_fuse_qkv,
+                                        normalization_before,
+                                        args_.act_,
+                                        inner_coeff,
+                                        inner_size,
+                                        use_gated);
+    decoder_->set_max_batch_size(batch_size * beam_width);
+
+    size_t from_tensor_size =
+        args_.batch_size_ * args_.beam_width_ * args_.hidden_units_;  // type T
+    size_t decoder_workspace_size = decoder_->getWorkspaceSize();     // type T
+    size_t decoder_normed_result_buffer_size =
+        args_.batch_size_ * args_.beam_width_ * args_.hidden_units_;  // type T
+    size_t cache_size = (args_.batch_size_ * args_.beam_width_ *
+                               args_.seq_len_ * args_.hidden_units_);  // type T
+    size_t mem_cache_size = (args_.batch_size_ * args_.beam_width_ *
+                       memory_max_seq_len * args_.hidden_units_);  // type T
+
+    size_t logits_buf_size = args_.batch_size_ * args_.beam_width_ *
+                             args_.vocab_size_padded_;  // type float
+    size_t cum_log_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type float
+    size_t word_ids_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type int
+    size_t parent_ids_buf_size =
+        keep_alive_beam_ ? word_ids_buf_size : 0;  // type int
+    size_t finished_buf_size =
+        args_.batch_size_ * args_.beam_width_;  // type bool
+    size_t alive_finished_buf_size = keep_alive_beam_ ? finished_buf_size : 0;
+    size_t finished_count_size = (size_t)(ceil(1 / 32.)) * 32;  // type int
+
+    size_t storage_size_per_beam =
+        2 * args_.beam_width_ +
+        SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K + 2);
+    args_.temp_storage_size_ = args_.batch_size_ * args_.beam_width_ *
+                               storage_size_per_beam;  // type float
+    args_.temp_storage_size_ =
+        (size_t)(ceil(args_.batch_size_ * args_.beam_width_ *
+                      args_.beam_width_ / 4.) *
+                     4 * 2 +
+                 ceil(args_.batch_size_ * args_.beam_width_ *
+                      SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K + 2) /
+                      4.) *
+                     4);
+    size_t padded_embedding_kernel_size =
+        args_.hidden_units_ * args_.vocab_size_padded_;
+    size_t padded_embedding_bias_size = args_.vocab_size_padded_;
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_padded_ == args_.vocab_size_)) {
+      padded_embedding_kernel_size = 0;
+      padded_embedding_bias_size = 0;
+    }
+
+    // When using separated alive and finish beam queues, some buffers size need
+    // to be doubled to restore beam search intermedia results of both alive and
+    // finish beams.
+    if (keep_alive_beam_ == true) {
+      // cumulated log-probs of finish beams and alive beams
+      cum_log_buf_size += cum_log_buf_size;
+      finished_buf_size += finished_buf_size;
+      // Double the size of topk_tmp_id_buf, topk_tmp_val_buf, since we need
+      // select the top 2*beam_width.
+      args_.temp_storage_size_ +=
+          ceil(args_.batch_size_ * args_.beam_width_ * args_.beam_width_ / 4.) *
+          4 * 2;
+// Double tmp_buffer since we need select the top 2*beam_width.
+#ifdef DO_SPLIT_SMALL_TOP_K_SOFTMAX
+      args_.temp_storage_size_ +=
+          ceil(args_.batch_size_ * args_.beam_width_ *
+               SMALL_TOP_K_SOFTMAX_MAX_VOC_PARTS * (2 * MAX_K) / 4.) *
+          4;
+#endif
+    }
+
+    int relative_attention_bias_size =
+        (args_.seq_len_ + 1) * (args_.seq_len_ + 1) * head_num;
+
+    // prevent memory misalinged address
+    logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+    cum_log_buf_size = (size_t)(ceil(cum_log_buf_size / 4.)) * 4;
+    word_ids_buf_size = (size_t)(ceil(word_ids_buf_size / 4.)) * 4;
+    parent_ids_buf_size = (size_t)(ceil(parent_ids_buf_size / 4.)) * 4;
+    finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+    alive_finished_buf_size =
+        (size_t)(ceil(alive_finished_buf_size / 32.)) * 32;
+    const size_t tmp_logits_buf_size = logits_buf_size;
+
+    // get workspace size of topk kernel
+    if (keep_alive_beam_ == true) {
+      topK_update_kernelLauncher(topK_kernel_workspace,
+                                 topk_workspace_size_,
+                                 logits_buf_,
+                                 finished_buf_,
+                                 alive_finished_buf_,
+                                 nullptr,
+                                 word_ids_buf_,
+                                 parent_ids_buf_,
+                                 nullptr,
+                                 nullptr,
+                                 cum_log_buf_,
+                                 0,
+                                 args_,
+                                 0);
+    } else {
+      topK_kernelLauncher(topK_kernel_workspace,
+                          topk_workspace_size_,
+                          logits_buf_,
+                          word_ids_buf_,
+                          finished_buf_,
+                          args_,
+                          0);
+    }
+
+    size_t lm_head_buffer_size = decoder_normed_result_buffer_size;
+
+    size_t datatype_buf_size =
+        from_tensor_size * 2 + decoder_workspace_size +
+        (cache_size * 4 + mem_cache_size * 2) * args_.decoder_layers_ +
+        lm_head_buffer_size;
+
+    buf_ = reinterpret_cast<void *>(allocator_.malloc(
+        ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) +
+        sizeof(DataType_) * datatype_buf_size +
+        sizeof(float) * (logits_buf_size + cum_log_buf_size) +
+        sizeof(DataType_) * tmp_logits_buf_size +
+        sizeof(DataType_) * padded_embedding_kernel_size +
+        sizeof(float) * padded_embedding_bias_size +
+        sizeof(int) * (word_ids_buf_size + parent_ids_buf_size) +
+        sizeof(bool) * (finished_buf_size + alive_finished_buf_size) +
+        topk_workspace_size_ +
+        sizeof(float) * args_.temp_storage_size_ +  // should be always float
+        sizeof(int) * finished_count_size +
+        sizeof(DataType_) * relative_attention_bias_size));
+
+    if (sizeof(DataType_) == sizeof(half)) {
+      cublas_workspace_ = buf_;
+      from_tensor_[0] =
+          (DataType_ *)((char *)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+    } else {
+      cublas_workspace_ = nullptr;
+      from_tensor_[0] = (DataType_ *)(buf_);
+    }
+    from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+    for (int i = 0; i < args_.decoder_layers_; ++i) {
+      K_mem_cache_[i] =
+          from_tensor_[1] + from_tensor_size + i * mem_cache_size * 2;
+      V_mem_cache_[i] = from_tensor_[1] + from_tensor_size +
+                        i * mem_cache_size * 2 + mem_cache_size;
+    }
+    if (args_.beam_width_ > 1) {
+      /* We use two-way buffer since we have to update KV buf at the end of each
+       * step. */
+      K_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    0 * cache_size * args_.decoder_layers_;
+      K_cache_[1] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    1 * cache_size * args_.decoder_layers_;
+      V_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    2 * cache_size * args_.decoder_layers_;
+      V_cache_[1] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    3 * cache_size * args_.decoder_layers_;
+    } else {
+      // if beam width is 1, we only need one buffer
+      K_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    0 * cache_size * args_.decoder_layers_;
+      K_cache_[1] = K_cache_[0];
+      V_cache_[0] = V_mem_cache_[decoder_layers - 1] + mem_cache_size +
+                    2 * cache_size * args_.decoder_layers_;
+      V_cache_[1] = V_cache_[0];
+    }
+
+    decoder_buf_ = V_cache_[1] + cache_size * args_.decoder_layers_;
+
+    decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+    // Used for post-norm.
+    embedding_buf_ = (decoder_buf_ + decoder_workspace_size);
+
+    logits_buf_ = (float *)(decoder_normed_result_buf_ +
+                            decoder_normed_result_buffer_size);
+    cum_log_buf_ = (float *)(logits_buf_ + logits_buf_size);
+    word_ids_buf_ = (int *)(cum_log_buf_ + cum_log_buf_size);
+    parent_ids_buf_ = (int *)(word_ids_buf_ + word_ids_buf_size);
+    finished_buf_ = (bool *)(parent_ids_buf_ + parent_ids_buf_size);
+    alive_finished_buf_ = (bool *)(finished_buf_ + finished_buf_size);
+    temp_storage_ = (float *)(alive_finished_buf_ + alive_finished_buf_size);
+    finished_count_buf_ = (int *)(temp_storage_ + args_.temp_storage_size_);
+
+    relative_attention_bias_ =
+        (DataType_ *)(finished_count_buf_ + finished_count_size);
+
+    topK_kernel_workspace =
+        (void *)(relative_attention_bias_ + relative_attention_bias_size);
+    padded_embedding_kernel =
+        (DataType_ *)((char *)topK_kernel_workspace + topk_workspace_size_);
+    padded_embedding_bias =
+        (DataType_ *)(padded_embedding_kernel + padded_embedding_kernel_size);
+    tmp_logits_buf_ =
+        (DataType_ *)(padded_embedding_bias + padded_embedding_bias_size);
+
+    h_finished_buf_ = new bool[finished_buf_size];
+    h_trg_length_ = new int[args_.batch_size_];
+
+    int isConfigExist = access("decoding_gemm_config.in", 0);
+    if (isConfigExist == -1) {
+      printf("[WARNING] decoding_gemm_config.in is not found\n");
+    } else {
+      readAlgoFromConfig(cublasAlgoMap_, 1);
+      // check that the gemm_config setting is runnable
+      for (auto iter = cublasAlgoMap_.begin(); iter != cublasAlgoMap_.end();
+           iter++) {
+        int algoId = iter->second.algoId;
+        int stages = iter->second.stages;
+        // only check for cublas
+        if (stages != -1) continue;
+        if (Traits_::OpType == OperationType::FP32) {
+          if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT) {
+            // the algorithm is not for FP32
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n",
+                   algoId);
+            exit(-1);
+          }
+        } else {
+          if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP ||
+              algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP) {
+            // the algorithm is not for FP16
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n",
+                   algoId);
+            exit(-1);
+          }
+        }
+      }
+    }
+  }
+
+  void set_tensor_parallel_param(const TensorParallelParam param) {
+    t_parallel_param_ = param;
+    decoder_->set_tensor_parallel_param(param);
+  }
+
+  void set_layer_parallel_param(const LayerParallelParam param) {
+    l_parallel_param_ = param;
+    decoder_->set_layer_parallel_param(param);
+  }
+
+  void forward(const DecoderInitParam<DataType_> *param,
+               DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = args_.batch_size_ * args_.beam_width_;
+    const int k = args_.hidden_units_;
+    const int n = args_.vocab_size_padded_;
+    const DataType_ *embedding_kernel_ptr = nullptr;
+    const DataType_ *embedding_bias_ptr = nullptr;
+
+    int min_trg_len = 0;
+    int max_trg_len = 0;
+
+    if (decoding_params.trg_word) {
+      cudaMemcpy(h_trg_length_,
+                 decoding_params.trg_length,
+                 sizeof(int) * args_.batch_size_,
+                 cudaMemcpyDeviceToHost);
+      min_trg_len = h_trg_length_[0];
+      max_trg_len = h_trg_length_[0];
+
+      for (int i = 1; i < args_.batch_size_; ++i) {
+        min_trg_len = std::min(min_trg_len, h_trg_length_[i]);
+        max_trg_len = std::max(max_trg_len, h_trg_length_[i]);
+      }
+    }
+
+    /*
+      sequence_length initialize to 0
+      finished: false
+      word_ids: start_id_
+      cum_log_probs (for eacm beam, the first element is 0). e.g., [0 -inf -inf
+      -inf][0 -inf -inf -inf]
+      cum_log_probs: If keep_alive_beam_ is true, the first alive element is 0.
+    */
+    if (keep_alive_beam_ == true) {
+      init_kernelLauncher_v2<float>(finished_buf_,
+                                    alive_finished_buf_,
+                                    decoding_params.sequence_length,
+                                    word_ids_buf_,
+                                    cum_log_buf_,
+                                    args_.start_id_,
+                                    args_.batch_size_,
+                                    args_.beam_width_ * 2,
+                                    decoding_params.stream);
+    } else {
+      init_kernelLauncher(finished_buf_,
+                          decoding_params.sequence_length,
+                          word_ids_buf_,
+                          cum_log_buf_,
+                          args_.start_id_,
+                          args_.batch_size_,
+                          args_.beam_width_,
+                          decoding_params.stream);
+    }
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the init by init_kernel_check.
+  init_kernel_check will compare the results of GPU and CPU.
+  Note that init_kernel_check contains init and uses do not need to call it
+  again.
+*/
+// init_kernel_check(finished_buf_, decoding_params.sequence_length,
+// word_ids_buf_, cum_log_buf_,
+//                   start_id_, batch_size_, beam_width_,
+//                   decoding_params.stream);
+#endif
+
+    build_relative_attention_bias_launcher(
+        relative_attention_bias_,
+        decoding_params.self_relative_attention_bias_weight,
+        args_.head_num_,
+        (args_.seq_len_ + 1),
+        args_.num_bucket_,
+        false,
+        args_.max_distance_,
+        decoding_params.stream);
+
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_padded_ == args_.vocab_size_)) {
+      embedding_kernel_ptr =
+          (const DataType_ *)decoding_params.embedding_kernel;
+      embedding_bias_ptr = (const DataType_ *)decoding_params.embedding_bias;
+    } else if (std::is_same<DataType_, half>::value) {
+      kernel_padding_kernelLauncher(padded_embedding_kernel,
+                                    decoding_params.embedding_kernel,
+                                    args_.hidden_units_,
+                                    args_.vocab_size_,
+                                    args_.vocab_size_padded_,
+                                    decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      bias_padding_kernelLauncher(padded_embedding_bias,
+                                  decoding_params.embedding_bias,
+                                  args_.vocab_size_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+      embedding_kernel_ptr = padded_embedding_kernel;
+      embedding_bias_ptr = padded_embedding_bias;
+    }
+
+    int cache_size = (m * args_.seq_len_ * args_.hidden_units_);  // type T
+
+    for (uint step = 1; step <= args_.seq_len_; ++step) {
+      // we use two-way buffer
+      int kv_cache_id = step & 0x1;
+
+      words_embeddings_kernel_launcher(from_tensor_[0],
+                                       decoding_params.embedding_table,
+                                       word_ids_buf_,
+                                       m,
+                                       args_.hidden_units_,
+                                       decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int from_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        /*
+          For the first layer (layer-0), from_id is 0. We also stored the
+          embedding lookup
+          result in from_tensor_[0]
+        */
+        from_id = layer & 0x1;
+        out_id = 1 - from_id;
+
+        /*
+          We use one decoder_ object to process multiple decoder layers.
+
+          At the beginning of each decoder layer, we initialize the decoder
+          object
+          with corresponding weights and decoder_buf_.
+          The decoder_buf_ is reused.
+        */
+        decoder_->initialize(param[layer], decoder_buf_, cublas_workspace_);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        decoder_->forward(
+            from_tensor_[from_id],
+            decoding_params.memory_tensor,
+            K_cache_[kv_cache_id] + layer * cache_size,
+            V_cache_[kv_cache_id] + layer * cache_size,
+            K_mem_cache_[layer],
+            V_mem_cache_[layer],
+            decoding_params.memory_sequence_length,
+            from_tensor_[out_id],
+            step,
+            args_.seq_len_,
+            true, /* is_cross_attention */
+            keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+            relative_attention_bias_,
+            true);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+
+      if (step > min_trg_len) {
+        DataType_ alpha = (DataType_)1.0f;
+        DataType_ beta = (DataType_)0.0f;
+
+        t5_layer_norm(from_tensor_[out_id],
+                      decoding_params.layernorm.gamma,
+                      decoding_params.layernorm.beta,
+                      decoder_normed_result_buf_,
+                      m,
+                      k,
+                      decoding_params.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (args_.tie_word_embeddings_) {
+          alpha = (DataType_) pow((float)(k), -0.5);
+        }
+
+        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                            decoding_params.cublas_handle,
+                                            CUBLAS_OP_N,
+                                            CUBLAS_OP_N,
+                                            n,
+                                            m,
+                                            k,
+                                            &alpha,
+                                            embedding_kernel_ptr,
+                                            AType_,
+                                            n,
+                                            decoder_normed_result_buf_,
+                                            BType_,
+                                            k,
+                                            &beta,
+                                            tmp_logits_buf_,
+                                            CType_,
+                                            n,
+                                            decoding_params.stream,
+                                            cublasAlgoMap_,
+                                            cublas_workspace_);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (decoding_params.logits_mask ||
+            (args_.min_length_ != 0 && step <= args_.min_length_)) {
+          apply_logits_mask_kernelLauncher(
+              tmp_logits_buf_,
+              keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+              args_.batch_size_,
+              args_.beam_width_,
+              args_.vocab_size_padded_,
+              args_.vocab_size_,
+              decoding_params.stream,
+              decoding_params.logits_mask,
+              (args_.min_length_ != 0 && step <= args_.min_length_),
+              args_.end_id_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        // Beamsearch
+        if (is_fuse_topk_softMax_) {
+          if (keep_alive_beam_) {
+            // Use separated alive and finish beam queues to avoid the decrease
+            // of alive beams.
+
+            topK_softMax_update(tmp_logits_buf_,
+                                embedding_bias_ptr,
+                                finished_buf_,
+                                alive_finished_buf_,
+                                decoding_params.sequence_length,
+                                word_ids_buf_,
+                                parent_ids_buf_,
+                                decoding_params.output_ids + (step - 1) * m * 2,
+                                decoding_params.parent_ids + (step - 1) * m * 2,
+                                cum_log_buf_,
+                                reinterpret_cast<void *>(temp_storage_),
+                                step,
+                                args_,
+                                decoding_params.stream);
+
+          } else {
+            topK_softMax(tmp_logits_buf_,
+                         embedding_bias_ptr,
+                         finished_buf_,
+                         cum_log_buf_,
+                         word_ids_buf_,
+                         reinterpret_cast<void *>(temp_storage_),
+                         args_,
+                         decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            update_kernelLauncher_v2(
+                finished_buf_,
+                decoding_params.parent_ids + (step - 1) * m,
+                decoding_params.sequence_length,
+                word_ids_buf_,
+                decoding_params.output_ids + (step - 1) * m,
+                finished_count_buf_,
+                args_,
+                decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+          }
+
+        } else {
+          if (keep_alive_beam_ == true) {
+            update_logits_v2(tmp_logits_buf_,
+                             embedding_bias_ptr,
+                             args_.end_id_,
+                             finished_buf_,
+                             m,
+                             n,
+                             decoding_params.stream);
+
+            // Use separated alive and finish beam queues to avoid the decrease
+            // of alive beams.
+            topK_update_kernelLauncher(
+                topK_kernel_workspace,
+                topk_workspace_size_,
+                tmp_logits_buf_,
+                finished_buf_,
+                alive_finished_buf_,
+                decoding_params.sequence_length,
+                word_ids_buf_,
+                parent_ids_buf_,
+                decoding_params.output_ids + (step - 1) * m * 2,
+                decoding_params.parent_ids + (step - 1) * m * 2,
+                cum_log_buf_,
+                step,
+                args_,
+                decoding_params.stream);
+
+          } else {
+            update_logits(logits_buf_,
+                          tmp_logits_buf_,
+                          embedding_bias_ptr,
+                          args_.end_id_,
+                          finished_buf_,
+                          m,
+                          n,
+                          decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the update_logits by update_logits_kernel_check.
+  update_logits_kernel_check will compare the results of GPU and CPU.
+  Note that update_logits_kernel_check contains update_logits and uses do not
+  need to call it again.
+*/
+// update_logits_kernel_check(logits_buf_, decoding_params.embedding_bias,
+// args_.end_id_, finished_buf_, m, n, decoding_params.stream);
+#endif
+            /* adding cum_log_buf_ to logits_buf_ */
+            broadcast_kernelLauncher(logits_buf_,
+                                     cum_log_buf_,
+                                     args_.batch_size_,
+                                     args_.beam_width_,
+                                     args_.vocab_size_padded_,
+                                     decoding_params.stream);
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the broadcast_kernel by broadcast_kernel_check.
+  broadcast_kernel_check will compare the results of GPU and CPU.
+  Note that broadcast_kernel_check contains broadcast_kernelLauncher and uses do
+  not need to call it again.
+*/
+// broadcast_kernel_check(logits_buf_, cum_log_buf_, batch_size_, beam_width_,
+// vocab_size_, decoding_params.stream);
+#endif
+
+            topK_kernelLauncher(topK_kernel_workspace,
+                                topk_workspace_size_,
+                                logits_buf_,
+                                word_ids_buf_,
+                                finished_buf_,
+                                args_,
+                                decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            update_kernelLauncher(logits_buf_,
+                                  cum_log_buf_,
+                                  finished_buf_,
+                                  decoding_params.parent_ids + (step - 1) * m,
+                                  decoding_params.sequence_length,
+                                  word_ids_buf_,
+                                  decoding_params.output_ids + (step - 1) * m,
+                                  args_.batch_size_,
+                                  args_.beam_width_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream,
+                                  args_.end_id_,
+                                  finished_count_buf_);
+          }
+        }
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+      }
+
+      if (step <= max_trg_len) {
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        update_with_force_decodingLauncher<float>(
+            decoding_params.trg_word,
+            decoding_params.trg_length,
+            finished_buf_,
+            word_ids_buf_,
+            (step > min_trg_len) ? nullptr : decoding_params.sequence_length,
+            (keep_alive_beam_) ? parent_ids_buf_ : nullptr,
+            (keep_alive_beam_) ? decoding_params.parent_ids + (step - 1) * m * 2
+                               : decoding_params.parent_ids + (step - 1) * m,
+            (keep_alive_beam_) ? decoding_params.output_ids + (step - 1) * m * 2
+                               : decoding_params.output_ids + (step - 1) * m,
+            cum_log_buf_,
+            keep_alive_beam_,
+            args_.batch_size_,
+            (keep_alive_beam_) ? args_.beam_width_ * 2 : args_.beam_width_,
+            max_trg_len,
+            step,
+            decoding_params.stream);
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (args_.beam_width_ > 1) {
+        // chose which self cache to use
+        int decoder_max_seq_len =
+            (decoder_->getCacheFormat() != 0) ? args_.seq_len_ : -1;
+
+        update_KV_cache_kernelLauncher_v2(
+            K_cache_,
+            V_cache_,
+            keep_alive_beam_ ? parent_ids_buf_
+                             : decoding_params.parent_ids + (step - 1) * m,
+            keep_alive_beam_ ? alive_finished_buf_ : finished_buf_,
+            args_.batch_size_,
+            args_.beam_width_,
+            args_.head_num_,
+            args_.size_per_head_,
+            step,
+            decoder_max_seq_len,
+            cache_size,
+            args_.decoder_layers_,
+            decoding_params.stream,
+            (args_.prefix_lm_) ? args_.memory_max_seq_len_ : -1);
+      }
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+
+/*
+  User can check the update_KV_cache by update_KV_cache_kernel_check.
+  update_KV_cache_kernel_check will compare the results of GPU and CPU.
+  Note that update_KV_cache_kernel_check contains update_KV_cache and uses do
+  not need to call it again.
+*/
+// update_KV_cache_kernel_check(K_cache_, V_cache_, decoding_params.parent_ids +
+// (step - 1) * batch_size_ * beam_width_, batch_size_, beam_width_,
+// hidden_units_, step, cache_size, decoder_layers_, decoding_params.stream);
+#endif
+
+      if (step > max_trg_len) {
+        // TODO Find a better method to check the is_finished
+        int finish_size = (keep_alive_beam_) ? m * 2 : m;
+        cudaMemcpy(h_finished_buf_,
+                   finished_buf_,
+                   sizeof(bool) * finish_size,
+                   cudaMemcpyDeviceToHost);
+        int sum = 0;
+        for (int i = 0; i < finish_size; i++) {
+          sum += (int)h_finished_buf_[i];
+        }
+        if (sum == finish_size) break;
+      }
+    }  // end for decoding step for llop
+
+    if (decoding_params.output_scores) {
+      cudaMemcpyAsync(decoding_params.output_scores,
+                      cum_log_buf_,
+                      sizeof(float) * args_.batch_size_ * args_.beam_width_,
+                      cudaMemcpyDeviceToDevice,
+                      decoding_params.stream);
+    }
+  }  // end of forward
+
+  virtual ~T5DecodingBeamsearch() {
+    delete[] K_cache_;
+    delete[] V_cache_;
+    delete[] K_mem_cache_;
+    delete[] V_mem_cache_;
+    delete[] h_finished_buf_;
+    delete[] h_trg_length_;
+    delete decoder_;
+    allocator_.free(buf_);
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_sampling.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_sampling.h
new file mode 100644
index 0000000000000000000000000000000000000000..9f005396625750b863b6bfb9b65a7b916060722a
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/t5_sampling.h
@@ -0,0 +1,780 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+
+#include "fastertransformer/cuda/cuda_kernels.h"
+#include "fastertransformer/open_decoder.h"
+#include "fastertransformer/utils/allocator.h"
+#include "fastertransformer/utils/arguments.h"
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/functions.h"
+
+namespace fastertransformer {
+
+template <OperationType OpType_>
+class T5DecodingSampling {
+private:
+  typedef DecoderTransformerTraits<OpType_> Traits_;
+  typedef typename Traits_::DataType DataType_;
+  const IAllocator &allocator_;
+  struct T5SamplingArguments args_;
+  TensorParallelParam t_parallel_param_;
+  LayerParallelParam l_parallel_param_;
+
+  const cudaDataType_t computeType_ = Traits_::computeType;
+  const cudaDataType_t AType_ = Traits_::AType;
+  const cudaDataType_t BType_ = Traits_::BType;
+  const cudaDataType_t CType_ = Traits_::CType;
+  std::map<std::string, cublasLtMatmulAlgo_info> cublasAlgoMap_;
+
+  OpenDecoder<OpType_> *decoder_;
+  DataType_ **K_cache_;
+  DataType_ **V_cache_;
+  DataType_ **K_mem_cache_;
+  DataType_ **V_mem_cache_;
+  DataType_ *from_tensor_[2];
+  DataType_ *decoder_buf_;
+  DataType_ *decoder_normed_result_buf_;
+  DataType_ *embedding_buf_;
+  DataType_ *logits_buf_;
+  int *word_ids_buf_;
+  bool *finished_buf_;
+  int *h_trg_length_;
+
+  DataType_ *relative_attention_bias_;
+
+  void *buf_;
+  int *finished_count_buf_;
+  bool *h_finished_buf_;
+
+  void *topk_workspace_ = nullptr;
+  size_t topk_workspace_size_ = 0;
+  void *topp_workspace_ = nullptr;
+  size_t topp_workspace_size_ = 0;
+  void *cublas_workspace_ = nullptr;
+  curandState_t *curandstate_buf_;
+  int *topp_id_vals_buf_;
+  int *topp_offset_buf_;
+  int *begin_topp_offset_buf_;
+
+  DataType_ *padded_embedding_kernel;
+  DataType_ *padded_embedding_bias;
+
+public:
+  T5DecodingSampling(const IAllocator &allocator,
+                     const int batch_size,
+                     const int seq_len,
+                     const int head_num,
+                     const int size_per_head,
+                     const int vocab_size,
+                     const int decoder_layers,
+                     const int memory_hidden_units,
+                     const int memory_max_seq_len,
+                     const int start_id,
+                     const int end_id,
+                     const int candidate_num = 0,
+                     const float probability_threshold = 0.0,
+                     const int is_fuse_qkv = false,
+                     const bool normalization_before = true,
+                     const ActivationType act = ActivationType::RELU,
+                     const float temperature = 1.0,
+                     const float repeat_penalty = 1.0,
+                     const int min_length = 0,
+                     const int inner_coeff = 4,
+                     const int inner_size = -1,
+                     const int seed = -1,
+                     const int tensor_para_size = 1,
+                     const int layer_para_size = 1,
+                     const int num_bucket = -1,
+                     const int max_distance = 128,
+                     const bool tie_word_embeddings = true,
+                     const bool use_gated = false)
+      : allocator_(allocator) {
+    args_.batch_size_ = batch_size;
+    args_.seq_len_ = seq_len;
+    args_.memory_max_seq_len_ = memory_max_seq_len;
+    args_.head_num_ = head_num;
+    args_.size_per_head_ = size_per_head;
+    args_.hidden_units_ = head_num * size_per_head;
+    args_.decoder_layers_ = decoder_layers;
+    args_.vocab_size_ = vocab_size;
+    args_.candidate_num_ = candidate_num;
+    args_.probability_threshold_ = probability_threshold;
+    args_.start_id_ = start_id;
+    args_.end_id_ = end_id;
+    args_.normalization_before_ = normalization_before;
+    args_.act_ = act;
+
+    args_.temperature_ = temperature;
+    args_.repeat_penalty_ = repeat_penalty;
+
+    args_.min_length_ = min_length;
+    args_.seed_ = seed;
+
+    args_.num_bucket_ = num_bucket;
+    args_.max_distance_ = max_distance;
+    args_.tie_word_embeddings_ = tie_word_embeddings;
+
+    // For models without parallel
+    if (l_parallel_param_.layers_per_group == 0) {
+      l_parallel_param_.layers_per_group = decoder_layers;
+    }
+
+    if (std::is_same<DataType_, float>::value)
+      args_.vocab_size_padded_ = vocab_size;
+    else if (std::is_same<DataType_, half>::value)
+      args_.vocab_size_padded_ = (int)(ceil(vocab_size / 8.)) * 8;
+
+    if (args_.candidate_num_ == 0 && args_.probability_threshold_ == 0.0) {
+      printf(
+          "[ERROR] Candidate_num for topk is 0 and probability threshold for "
+          "top p is 0.0 \n");
+      exit(-1);
+    } else if (args_.candidate_num_ != 0 &&
+               args_.probability_threshold_ != 0.0) {
+      printf(
+          "[ERROR] Candidate_num for topk is not 0 and probability threshold "
+          "for top p is not 0.0 \n");
+      exit(-1);
+    }
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    K_cache_ = new DataType_ *[1];
+    V_cache_ = new DataType_ *[1];
+
+    K_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+    V_mem_cache_ = new DataType_ *[args_.decoder_layers_];
+
+    decoder_ = new OpenDecoder<OpType_>(head_num,
+                                        size_per_head,
+                                        memory_hidden_units,
+                                        is_fuse_qkv,
+                                        normalization_before,
+                                        args_.act_,
+                                        inner_coeff,
+                                        inner_size,
+                                        use_gated);
+    decoder_->set_max_batch_size(batch_size);
+
+    size_t from_tensor_size =
+        args_.batch_size_ * args_.hidden_units_;                   // type T
+    size_t decoder_workspace_size = decoder_->getWorkspaceSize();  // type T
+    size_t decoder_normed_result_buffer_size =
+        args_.batch_size_ * args_.hidden_units_;  // type T
+
+    size_t cache_size = (args_.batch_size_ * args_.seq_len_ *
+                         args_.hidden_units_);  // type T
+    size_t mem_cache_size = (args_.batch_size_ * memory_max_seq_len *
+                                   args_.hidden_units_);  // type T
+    if (tensor_para_size != 1) {                          // tensor parallel
+      cache_size /= tensor_para_size;
+      mem_cache_size /= tensor_para_size;
+    }
+
+    int relative_attention_bias_size =
+        (args_.seq_len_ + 1) * (args_.seq_len_ + 1) * head_num;
+
+    size_t logits_buf_size =
+        args_.batch_size_ * args_.vocab_size_padded_;  // type T
+
+    size_t word_ids_buf_size = args_.batch_size_;               // type int
+    size_t finished_buf_size = args_.batch_size_;               // type bool
+    size_t finished_count_size = (size_t)(ceil(1 / 32.)) * 32;  // type int
+
+    int topp_id_vals_buf_size =
+        args_.batch_size_ * args_.vocab_size_padded_;  // type int
+    int topp_offset_buf_size = args_.batch_size_ + 1;  // type int
+    size_t begin_topp_offset_buf_size = topp_offset_buf_size;
+    size_t curandState_size = args_.batch_size_;
+    size_t padded_embedding_kernel_size =
+        args_.hidden_units_ * args_.vocab_size_padded_;
+    size_t padded_embedding_bias_size = args_.vocab_size_padded_;
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_ == args_.vocab_size_padded_)) {
+      padded_embedding_kernel_size = 0;
+      padded_embedding_bias_size = 0;
+    }
+
+    // prevent memory misalinged address
+    logits_buf_size = (size_t)(ceil(logits_buf_size / 4.)) * 4;
+    word_ids_buf_size = (size_t)(ceil(word_ids_buf_size / 4.)) * 4;
+    finished_buf_size = (size_t)(ceil(finished_buf_size / 32.)) * 32;
+
+    topp_id_vals_buf_size = (size_t)(ceil(topp_id_vals_buf_size / 4.)) * 4;
+    topp_offset_buf_size = (size_t)(ceil(topp_offset_buf_size / 4.)) * 4;
+    begin_topp_offset_buf_size = topp_offset_buf_size;
+
+    topP_sampling_kernel_kernelLauncher_v2(topp_workspace_,
+                                           topp_workspace_size_,
+                                           logits_buf_,
+                                           topp_id_vals_buf_,
+                                           topp_offset_buf_,
+                                           begin_topp_offset_buf_,
+                                           finished_buf_,
+                                           curandstate_buf_,
+                                           args_,
+                                           nullptr,
+                                           nullptr,
+                                           args_.vocab_size_padded_,
+                                           0,
+                                           args_.batch_size_);
+
+    topK_sampling_kernel_kernelLauncher_v2(topk_workspace_,
+                                           topk_workspace_size_,
+                                           logits_buf_,
+                                           nullptr,
+                                           nullptr,
+                                           finished_buf_,
+                                           curandstate_buf_,
+                                           args_,
+                                           0,
+                                           args_.batch_size_);
+
+    size_t datatype_buf_size =
+        from_tensor_size * 2 + decoder_workspace_size +
+        (cache_size * 2 + mem_cache_size * 2) * args_.decoder_layers_ +
+        decoder_normed_result_buffer_size;
+
+    buf_ = reinterpret_cast<void *>(allocator_.malloc(
+        ((sizeof(DataType_) == sizeof(half)) ? CUBLAS_WORKSPACE_SIZE : 0) +
+        sizeof(DataType_) * (datatype_buf_size + logits_buf_size) +
+        sizeof(DataType_) *
+            (padded_embedding_kernel_size + padded_embedding_bias_size) +
+        sizeof(int) * word_ids_buf_size + sizeof(bool) * finished_buf_size +
+        sizeof(int) * finished_count_size +
+        sizeof(int) * (topp_id_vals_buf_size + 2 * topp_offset_buf_size) +
+        topp_workspace_size_ + topk_workspace_size_ +
+        sizeof(DataType_) * relative_attention_bias_size +
+        curandState_size * sizeof(curandState_t)));
+
+    if (sizeof(DataType_) == sizeof(half)) {
+      cublas_workspace_ = buf_;
+      from_tensor_[0] =
+          (DataType_ *)((char *)cublas_workspace_ + CUBLAS_WORKSPACE_SIZE);
+    } else {
+      cublas_workspace_ = nullptr;
+      from_tensor_[0] = (DataType_ *)buf_;
+    }
+    from_tensor_[1] = (DataType_ *)(from_tensor_[0] + from_tensor_size);
+
+    for (int i = 0; i < args_.decoder_layers_; ++i) {
+      K_mem_cache_[i] =
+          from_tensor_[1] + from_tensor_size + i * mem_cache_size * 2;
+      V_mem_cache_[i] = from_tensor_[1] + from_tensor_size +
+                        i * mem_cache_size * 2 + mem_cache_size;
+    }
+
+    K_cache_[0] = V_mem_cache_[args_.decoder_layers_ - 1] + mem_cache_size +
+                  0 * cache_size * args_.decoder_layers_;
+    V_cache_[0] = V_mem_cache_[args_.decoder_layers_ - 1] + mem_cache_size +
+                  1 * cache_size * args_.decoder_layers_;
+
+    decoder_buf_ = V_cache_[0] + cache_size * args_.decoder_layers_;
+
+    decoder_normed_result_buf_ = (decoder_buf_ + decoder_workspace_size);
+    // Used for post-norm.
+    embedding_buf_ = (decoder_buf_ + decoder_workspace_size);
+
+    logits_buf_ =
+        decoder_normed_result_buf_ + decoder_normed_result_buffer_size;
+    word_ids_buf_ = (int *)(logits_buf_ + logits_buf_size);
+    finished_buf_ = (bool *)(word_ids_buf_ + word_ids_buf_size);
+    finished_count_buf_ = (int *)(finished_buf_ + finished_buf_size);
+
+    relative_attention_bias_ =
+        (DataType_ *)(finished_count_buf_ + finished_count_size);
+
+    topp_id_vals_buf_ =
+        (int *)(relative_attention_bias_ + relative_attention_bias_size);
+    begin_topp_offset_buf_ = (int *)(topp_id_vals_buf_ + topp_id_vals_buf_size);
+    topp_offset_buf_ =
+        (int *)(begin_topp_offset_buf_ + begin_topp_offset_buf_size);
+    topp_workspace_ = (void *)(topp_offset_buf_ + topp_offset_buf_size);
+    topk_workspace_ = (void *)((char *)topp_workspace_ + topp_workspace_size_);
+    padded_embedding_kernel =
+        (DataType_ *)((char *)topk_workspace_ + topk_workspace_size_);
+    padded_embedding_bias =
+        (DataType_ *)(padded_embedding_kernel + padded_embedding_kernel_size);
+    curandstate_buf_ =
+        (curandState_t *)(padded_embedding_bias + padded_embedding_bias_size);
+
+    h_finished_buf_ = new bool[finished_buf_size];
+    h_trg_length_ = new int[args_.batch_size_];
+
+    int isConfigExist = access("decoding_gemm_config.in", 0);
+    if (isConfigExist == -1) {
+      printf("[WARNING] decoding_gemm_config.in is not found\n");
+    } else {
+      readAlgoFromConfig(cublasAlgoMap_, 1);
+      // check that the gemm_config setting is runnable
+      for (auto iter = cublasAlgoMap_.begin(); iter != cublasAlgoMap_.end();
+           iter++) {
+        int algoId = iter->second.algoId;
+        int stages = iter->second.stages;
+        // only check for cublas
+        if (stages != -1) continue;
+        if (Traits_::OpType == OperationType::FP32) {
+          if (algoId > CUBLAS_GEMM_ALGO23 || algoId < CUBLAS_GEMM_DEFAULT) {
+            // the algorithm is not for FP32
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP32. \n",
+                   algoId);
+            exit(-1);
+          }
+        } else {
+          if (algoId > CUBLAS_GEMM_ALGO15_TENSOR_OP ||
+              algoId < CUBLAS_GEMM_DEFAULT_TENSOR_OP) {
+            // the algorithm is not for FP16
+            printf("[ERROR] cuBLAS Algorithm %d is not used in FP16. \n",
+                   algoId);
+            exit(-1);
+          }
+        }
+      }
+    }
+  }
+
+  void set_tensor_parallel_param(const TensorParallelParam param) {
+    t_parallel_param_ = param;
+    decoder_->set_tensor_parallel_param(param);
+  }
+
+  void set_layer_parallel_param(const LayerParallelParam param) {
+    l_parallel_param_ = param;
+    decoder_->set_layer_parallel_param(param);
+  }
+
+  void forward(const DecoderInitParam<DataType_> *param,
+               DecodingInitParam<DataType_> decoding_params) {
+#ifndef NDEBUG
+    PRINT_FUNC_NAME_();
+#endif
+    const int m = args_.batch_size_;
+    const int k = args_.hidden_units_;
+    const int n = args_.vocab_size_padded_;
+    const DataType_ *embedding_kernel_ptr = nullptr;
+    const DataType_ *embedding_bias_ptr = nullptr;
+
+    int min_trg_len = 0;
+    int max_trg_len = 0;
+
+    if (decoding_params.trg_word) {
+      cudaMemcpy(h_trg_length_,
+                 decoding_params.trg_length,
+                 sizeof(int) * args_.batch_size_,
+                 cudaMemcpyDeviceToHost);
+      min_trg_len = h_trg_length_[0];
+      max_trg_len = h_trg_length_[0];
+
+      for (int i = 1; i < args_.batch_size_; ++i) {
+        min_trg_len = std::min(min_trg_len, h_trg_length_[i]);
+        max_trg_len = std::max(max_trg_len, h_trg_length_[i]);
+      }
+    }
+
+    /*
+      sequence_length initialize to 0
+      finished: false
+      word_ids: start_id_
+    */
+    if (decoding_params.output_scores) {
+      cudaMemsetAsync(decoding_params.output_scores, 0, sizeof(float) * m);
+    }
+    if (args_.candidate_num_ != 0) {
+      sampling_init_kernelLauncher(finished_buf_,
+                                   decoding_params.sequence_length,
+                                   word_ids_buf_,
+                                   args_.start_id_,
+                                   args_.batch_size_,
+                                   decoding_params.stream);
+    } else if (args_.probability_threshold_ != 0.0) {
+      topp_initialization_kernelLauncher_v2(finished_buf_,
+                                            decoding_params.sequence_length,
+                                            word_ids_buf_,
+                                            topp_id_vals_buf_,
+                                            topp_offset_buf_,
+                                            begin_topp_offset_buf_,
+                                            args_.vocab_size_padded_,
+                                            args_,
+                                            decoding_params.stream);
+    }
+    ker_curand_setupLauncher(curandstate_buf_, args_, decoding_params.stream);
+
+#ifndef NDEBUG
+    cudaDeviceSynchronize();
+    check_cuda_error(cudaGetLastError());
+#endif
+
+    build_relative_attention_bias_launcher(
+        relative_attention_bias_,
+        decoding_params.self_relative_attention_bias_weight,
+        args_.head_num_,
+        (args_.seq_len_ + 1),
+        args_.num_bucket_,
+        false,
+        args_.max_distance_,
+        decoding_params.stream);
+
+    if (std::is_same<DataType_, float>::value ||
+        (std::is_same<DataType_, half>::value &&
+         args_.vocab_size_ == args_.vocab_size_padded_)) {
+      embedding_kernel_ptr =
+          (const DataType_ *)decoding_params.embedding_kernel;
+      embedding_bias_ptr = (const DataType_ *)decoding_params.embedding_bias;
+    } else if (std::is_same<DataType_, half>::value) {
+      kernel_padding_kernelLauncher(padded_embedding_kernel,
+                                    decoding_params.embedding_kernel,
+                                    args_.hidden_units_,
+                                    args_.vocab_size_,
+                                    args_.vocab_size_padded_,
+                                    decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      bias_padding_kernelLauncher(padded_embedding_bias,
+                                  decoding_params.embedding_bias,
+                                  args_.vocab_size_,
+                                  args_.vocab_size_padded_,
+                                  decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+      embedding_kernel_ptr = padded_embedding_kernel;
+      embedding_bias_ptr = padded_embedding_bias;
+    }
+
+    // TODO(guosheng): move cache offset into for loop for pipeline parallel
+    size_t cache_size = (args_.batch_size_ * args_.seq_len_ *
+                         args_.hidden_units_);  // type T
+
+    const int local_batch = l_parallel_param_.local_batch_size;
+    for (uint step = 1; step <= args_.seq_len_; ++step) {
+
+      words_embeddings_kernel_launcher(from_tensor_[0],
+                                       decoding_params.embedding_table,
+                                       word_ids_buf_,
+                                       m,
+                                       args_.hidden_units_,
+                                       decoding_params.stream);
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      int from_id, out_id;
+      for (int layer = 0; layer < args_.decoder_layers_; ++layer) {
+        if (l_parallel_param_.is_valid(layer)) {
+          /*
+             For the first layer (layer-0), from_id is 0. We also stored the
+             embedding lookup
+             result in from_tensor_[0]
+           */
+          from_id = layer & 0x1;
+          out_id = 1 - from_id;
+
+          /*
+            We use one decoder_ object to process multiple decoder layers.
+
+            At the beginning of each decoder layer, we initialize the decoder
+            object
+            with corresponding weights and decoder_buf_.
+
+            The decoder_buf_ is reused.
+          */
+          decoder_->initialize(param[layer], decoder_buf_, cublas_workspace_);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          decoder_->forward(from_tensor_[from_id],
+                            decoding_params.memory_tensor,
+                            K_cache_[0] + layer * cache_size,
+                            V_cache_[0] + layer * cache_size,
+                            K_mem_cache_[layer],
+                            V_mem_cache_[layer],
+                            decoding_params.memory_sequence_length,
+                            from_tensor_[out_id],
+                            step,
+                            args_.seq_len_,
+                            true, /* is_cross_attention */
+                            finished_buf_,
+                            relative_attention_bias_,
+                            true);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+      }
+
+      if (step > min_trg_len) {
+        DataType_ alpha = (DataType_)1.0f;
+        DataType_ beta = (DataType_)0.0f;
+
+        t5_layer_norm(from_tensor_[out_id],
+                      decoding_params.layernorm.gamma,
+                      decoding_params.layernorm.beta,
+                      decoder_normed_result_buf_,
+                      m,
+                      k,
+                      decoding_params.stream);
+
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        if (args_.tie_word_embeddings_) {
+          alpha = (DataType_) pow((float)(k), -0.5);
+        }
+
+        cublasMM_cublasLtMM_wrapper_decoder(decoding_params.cublaslt_handle,
+                                            decoding_params.cublas_handle,
+                                            CUBLAS_OP_N,
+                                            CUBLAS_OP_N,
+                                            n,
+                                            m,
+                                            k,
+                                            &alpha,
+                                            embedding_kernel_ptr,
+                                            AType_,
+                                            n,
+                                            decoder_normed_result_buf_,
+                                            BType_,
+                                            k,
+                                            &beta,
+                                            logits_buf_,
+                                            CType_,
+                                            n,
+                                            decoding_params.stream,
+                                            cublasAlgoMap_,
+                                            cublas_workspace_);
+
+        if (decoding_params.logits_mask ||
+            (args_.min_length_ != 0 && step <= args_.min_length_)) {
+          apply_logits_mask_kernelLauncher(
+              logits_buf_,
+              finished_buf_,
+              args_.batch_size_,
+              1,
+              args_.vocab_size_padded_,
+              args_.vocab_size_,
+              decoding_params.stream,
+              decoding_params.logits_mask,
+              (args_.min_length_ != 0 && step <= args_.min_length_),
+              args_.end_id_);
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        if (args_.temperature_ != 1.0) {
+          apply_temperature_penalty_kernelLauncher(
+              logits_buf_,
+              (DataType_)args_.temperature_,
+              args_.batch_size_,
+              args_.vocab_size_,
+              n,
+              decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+        }
+
+        if (args_.candidate_num_ != 0) {
+          // top k sampling
+          if (decoding_params.output_scores) {
+            softmax_kernelLauncher(logits_buf_,
+                                   embedding_bias_ptr,
+                                   args_.end_id_,
+                                   finished_buf_,
+                                   m,
+                                   n,
+                                   n,
+                                   decoding_params.stream);
+
+            // Return Score.
+            topK_sampling_kernel_kernelLauncher_v3(
+                topk_workspace_,
+                topk_workspace_size_,
+                logits_buf_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                decoding_params.output_scores,
+                finished_buf_,
+                curandstate_buf_,  // used as random number
+                args_,
+                decoding_params.stream,
+                args_.batch_size_);
+          } else {
+            update_logits_without_softmax(logits_buf_,
+                                          embedding_bias_ptr,
+                                          args_.end_id_,
+                                          finished_buf_,
+                                          m,
+                                          n,
+                                          decoding_params.stream);
+
+#ifndef NDEBUG
+            cudaDeviceSynchronize();
+            check_cuda_error(cudaGetLastError());
+#endif
+
+            topK_sampling_kernel_kernelLauncher_v2(
+                topk_workspace_,
+                topk_workspace_size_,
+                logits_buf_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                finished_buf_,
+                curandstate_buf_,  // used as random number
+                args_,
+                decoding_params.stream,
+                args_.batch_size_);
+          }
+        } else if (args_.probability_threshold_ != 0.0) {
+          // top p sampling
+          softmax_kernelLauncher(logits_buf_,
+                                 embedding_bias_ptr,
+                                 args_.end_id_,
+                                 finished_buf_,
+                                 m,
+                                 n,
+                                 n,
+                                 decoding_params.stream);
+
+#ifndef NDEBUG
+          cudaDeviceSynchronize();
+          check_cuda_error(cudaGetLastError());
+#endif
+
+          if (decoding_params.output_scores) {
+            topP_sampling_kernel_kernelLauncher_v3(
+                topp_workspace_,
+                topp_workspace_size_,
+                logits_buf_,
+                topp_id_vals_buf_,
+                topp_offset_buf_,
+                begin_topp_offset_buf_,
+                finished_buf_,
+                curandstate_buf_,
+                args_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                decoding_params.output_scores,
+                n,
+                decoding_params.stream,
+                args_.batch_size_);
+          } else {
+            topP_sampling_kernel_kernelLauncher_v2(
+                topp_workspace_,
+                topp_workspace_size_,
+                logits_buf_,
+                topp_id_vals_buf_,
+                topp_offset_buf_,
+                begin_topp_offset_buf_,
+                finished_buf_,
+                curandstate_buf_,
+                args_,
+                decoding_params.output_ids + (step - 1) * args_.batch_size_,
+                decoding_params.sequence_length,
+                n,
+                decoding_params.stream,
+                args_.batch_size_);
+          }
+        }
+      }
+
+      if (step <= max_trg_len) {
+#ifndef NDEBUG
+        cudaDeviceSynchronize();
+        check_cuda_error(cudaGetLastError());
+#endif
+
+        update_with_force_decodingLauncher(
+            decoding_params.trg_word,
+            decoding_params.trg_length,
+            finished_buf_,
+            word_ids_buf_,
+            (step > min_trg_len) ? nullptr : decoding_params.sequence_length,
+            (int *)nullptr,
+            (int *)nullptr,
+            decoding_params.output_ids + (step - 1) * args_.batch_size_,
+            (DataType_ *)nullptr,
+            false,
+            args_.batch_size_,
+            1,
+            max_trg_len,
+            step,
+            decoding_params.stream);
+      } else {
+        word_ids_buf_ =
+            decoding_params.output_ids + (step - 1) * args_.batch_size_;
+      }
+
+#ifndef NDEBUG
+      cudaDeviceSynchronize();
+      check_cuda_error(cudaGetLastError());
+#endif
+
+      if (step > max_trg_len) {
+        // TODO Find a better method to check the is_finished
+        cudaMemcpy(h_finished_buf_,
+                   finished_buf_,
+                   sizeof(bool) * args_.batch_size_,
+                   cudaMemcpyDeviceToHost);
+        uint sum = 0;
+        for (uint i = 0; i < args_.batch_size_; i++) {
+          sum += (int)h_finished_buf_[i];
+        }
+        if (sum == args_.batch_size_) break;
+      }
+    }
+  }
+
+  virtual ~T5DecodingSampling() {
+    delete[] K_cache_;
+    delete[] V_cache_;
+    delete[] K_mem_cache_;
+    delete[] V_mem_cache_;
+    delete[] h_finished_buf_;
+    delete[] h_trg_length_;
+    delete decoder_;
+    allocator_.free(buf_);
+  }
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/allocator.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/allocator.h
new file mode 100644
index 0000000000000000000000000000000000000000..074ab288091989f1bba62feadebffee79bc89308
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/allocator.h
@@ -0,0 +1,120 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Memory Allocator
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include <vector>
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/utils.h"
+
+#ifdef PADDLE_CUDA
+#ifdef PADDLE_ON_INFERENCE
+#include "paddle/extension.h"
+#include "paddle_inference_api.h"
+#include "paddle/phi/api/ext/exception.h"
+#else
+#include "paddle/extension.h"
+#endif
+#endif
+
+namespace fastertransformer {
+
+class IAllocator {
+public:
+  virtual void *malloc(size_t size, const bool is_set_zero = true) const = 0;
+  virtual void free(void *ptr) const = 0;
+};
+
+template <AllocatorType AllocType_>
+class Allocator;
+
+template <>
+class Allocator<AllocatorType::CUDA> : public IAllocator {
+  const int device_id_;
+
+public:
+  Allocator(int device_id) : device_id_(device_id) {}
+  ~Allocator() {}
+
+  void *malloc(size_t size, const bool is_set_zero = true) const {
+    void *ptr = nullptr;
+    int o_device = 0;
+    check_cuda_error(get_set_device(device_id_, &o_device));
+    check_cuda_error(cudaMalloc(&ptr, size));
+    check_cuda_error(get_set_device(o_device));
+    return ptr;
+  }
+
+  void free(void *ptr) const {
+    int o_device = 0;
+    check_cuda_error(get_set_device(device_id_, &o_device));
+    check_cuda_error(cudaFree(ptr));
+    check_cuda_error(get_set_device(o_device));
+    return;
+  }
+};
+
+#ifdef PADDLE_CUDA
+template <>
+class Allocator<AllocatorType::PD> : public IAllocator {
+  std::shared_ptr<std::vector<paddle::Tensor>> allocated_tensor_vector;
+  cudaStream_t stream_;
+
+public:
+  Allocator(cudaStream_t stream)
+      : allocated_tensor_vector(
+            std::make_shared<std::vector<paddle::Tensor>>()),
+        stream_(stream) {}
+
+  void *malloc(size_t size, const bool is_set_zero = true) const {
+    PD_CHECK(size > 0, "Allocated memory must be greater than 0. ");
+
+    int64_t buf_size = static_cast<int64_t>(size);
+    std::vector<int64_t> buf_dims({buf_size});
+#ifdef PADDLE_NEW_ALLOCATOR
+    // For PaddlePaddle>=2.3.0
+    auto buf = paddle::empty(buf_dims, paddle::DataType::UINT8, paddle::GPUPlace());
+    allocated_tensor_vector->push_back(buf);
+    auto *flat = buf.data<uint8_t>();
+#else
+    auto buf = paddle::Tensor(paddle::PlaceType::kGPU, buf_dims);
+    allocated_tensor_vector->push_back(buf);
+    auto *flat = buf.mutable_data<uint8_t>(paddle::PlaceType::kGPU);
+#endif
+    void *ptr = reinterpret_cast<void *>(flat);
+    return ptr;
+  }
+
+  void free(void *ptr) const {
+#ifndef NDEBUG
+    printf("call from allocator free\n");
+#endif
+    return;
+  }
+
+  //   ~Allocator() {
+  //     allocated_tensor_vector->clear();
+  //     delete allocated_tensor_vector;
+  //   }
+};
+#endif
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/arguments.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/arguments.h
new file mode 100644
index 0000000000000000000000000000000000000000..e4b2093b59b5020b710051a339c4cc7b64b91304
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/arguments.h
@@ -0,0 +1,210 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/**
+ * Decoder transformer
+ **/
+
+#pragma once
+
+#include <cuda_runtime.h>
+#include <stdlib.h>
+
+#include "fastertransformer/utils/common.h"
+#include "fastertransformer/utils/common_structure.h"
+#include "fastertransformer/utils/nccl_utils.h"
+
+namespace fastertransformer {
+
+template <typename T>
+class DecodingInitParam : public AbstractParam {
+public:
+  /* weights for masked_multi_head_attention */
+  const T *embedding_table = nullptr;
+  const T *embedding_kernel = nullptr;
+  const T *embedding_bias = nullptr;
+
+  // Used for unilm.
+  const T *trans_kernel = nullptr;
+  const T *trans_bias = nullptr;
+
+  const T *memory_tensor = nullptr;
+
+  const int *type_id = nullptr;
+  const int *decoder_type_id = nullptr;
+  const int *memory_sequence_length = nullptr;
+
+  // Used for force decoding.
+  const int *trg_word = nullptr;
+  const int *trg_length = nullptr;
+
+  const T *position_encoding_table = nullptr;
+
+  // segment table
+  const T *type_table = nullptr;
+
+  // For PLATO embedding.
+  const int *latent_id = nullptr;
+  const T *latent_embedding_table = nullptr;
+  const int *role_id = nullptr;
+  const int *decoder_role_id = nullptr;
+  const T *role_embedding_table = nullptr;
+  // Custom position.
+  const int *position_ids = nullptr;
+  const int *decoder_position_ids = nullptr;
+
+  LayerNormWeight<T> pre_layernorm;
+  LayerNormWeight<T> layernorm;
+  LayerNormWeight<T> lm_layernorm;
+  LayerNormWeight<T> mbart_layernorm;
+
+  const T *logits_mask = nullptr;
+
+  int *output_ids = nullptr;
+  int *parent_ids = nullptr;
+  int *sequence_length = nullptr;
+  float *output_scores = nullptr;
+
+  cublasHandle_t cublas_handle;
+  cublasLtHandle_t cublaslt_handle;
+  cudaStream_t stream;
+
+  // T5
+  const T *self_relative_attention_bias_weight = nullptr;
+
+  // For GPT model
+  int request_batch_size = 0;
+  int request_input_len = 0;
+  int request_output_len = 0;
+  int max_input_len = 0;
+  int *d_start_ids = nullptr;
+  const int *d_start_lengths = nullptr;
+  const T *d_attn_mask = nullptr;
+
+  virtual ~DecodingInitParam() {}
+};
+
+struct TransformerArguments {
+  size_t batch_size_;
+  size_t seq_len_;
+  size_t head_num_;
+  size_t size_per_head_;
+  size_t hidden_units_;
+};
+
+struct DecodingArguments : public TransformerArguments {
+  int decoder_layers_;
+  int vocab_size_;
+  int start_id_;
+  int end_id_;
+  int vocab_size_padded_;
+
+  int min_length_{0};
+};
+
+struct DecodingSamplingArguments : public DecodingArguments {
+  int candidate_num_;
+  float probability_threshold_;
+  size_t cub_temp_storage_size_{0};
+  bool normalization_before_{true};
+  int pos_offset_{0};     // For BART position embedding
+  bool pos_bias_{false};  // For Unified position embedding
+  ActivationType act_{ActivationType::RELU};
+
+  int memory_max_seq_len_{0};
+  float temperature_{1.0};
+  float repeat_penalty_{1.0};
+  bool prefix_lm_{false};
+  // For tensor parallel usage currently.
+  int seed_{-1};
+
+  bool is_mbart_{false};
+  bool is_miro_{false};
+};
+
+struct DecodingBeamsearchArguments : public DecodingArguments {
+  int beam_width_;
+  int temp_storage_size_;
+  float beam_search_diversity_rate_;
+  float alpha_;  // power number for length penalty in beam search v2
+  bool normalization_before_{true};
+  int pos_offset_{0};     // For BART position embedding
+  bool pos_bias_{false};  // For Unified position embedding
+  ActivationType act_{ActivationType::RELU};
+
+  int memory_max_seq_len_{0};
+  bool prefix_lm_{false};
+  int finished_candidate_num_{-1};
+  bool early_stopping_{false};
+
+  bool is_mbart_{false};
+  bool is_miro_{false};
+};
+
+struct T5SamplingArguments : public DecodingSamplingArguments {
+  int num_bucket_{-1};
+  int max_distance_{128};
+  bool tie_word_embeddings_{true};
+};
+
+struct T5BeamsearchArguments : public DecodingBeamsearchArguments {
+  int num_bucket_{-1};
+  int max_distance_{128};
+  bool tie_word_embeddings_{true};
+};
+
+struct GptArguments : public DecodingSamplingArguments {
+  int **start_ids_;
+  int start_len_;
+  float temperature_{2.0};
+  float len_penalty{1.0};
+  float repetition_penalty_{1.0};
+  int *vocab_mask{nullptr};
+  int min_gpu_num_{1};
+};
+
+
+struct GptJArguments : public GptArguments {
+  int rotary_embedding_dim_{0};
+};
+
+struct TransformerSamplingArguments : public DecodingSamplingArguments {
+  int **start_ids_;
+  int start_len_;
+  float temperature_{1.0};
+  float len_penalty{1.0};
+  float repetition_penalty_{1.0};
+  int *vocab_mask{nullptr};
+  bool normalization_before_{true};
+  bool pos_bias_{true};
+  int unk_id_{-1};
+  int mask_id_{-1};
+  ActivationType act_{ActivationType::GELU};
+};
+
+struct TransformerBeamsearchArguments : public DecodingBeamsearchArguments {
+  int start_len_;
+  float temperature_{2.0};
+  float len_penalty{1.0};
+  float repetition_penalty_{2.0};
+  bool normalization_before_{true};
+  bool pos_bias_{true};
+  int unk_id_{-1};
+  int mask_id_{-1};
+  ActivationType act_{ActivationType::GELU};
+};
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common.h
new file mode 100644
index 0000000000000000000000000000000000000000..dc748dc50529901fff83d101a5085693a2777b88
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#pragma once
+
+#include <cublasLt.h>
+#include <cublas_v2.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <fstream>
+#include <iostream>
+#include <map>
+#include <string>
+#include <stdexcept>
+#include "stdio.h"
+
+#define MAX_CONFIG_NUM 20
+#define GEMM_NUM 6
+#define COL32_ 32
+#define ACTIVATION_AMAX_NUM 80
+#define INT8O_GEMM_NUM 8
+#define TRT_FUSED_MHA_AMAX_NUM 3
+#define GEMM_CONFIG "gemm_config.in"
+#define IGEMM_CONFIG "igemm_config.in"
+// workspace for cublas gemm : 32MB
+#define CUBLAS_WORKSPACE_SIZE 33554432
+
+
+#include "fastertransformer/gemm_test/encoder_gemm_func.h"
+#include "fastertransformer/gemm_test/encoder_igemm_func.h"
+
+struct AbstractParam {
+  virtual ~AbstractParam(){};
+};
+
+namespace fastertransformer {
+
+enum { FLOAT_DATATYPE = 0, HALF_DATATYPE = 1, INT8_DATATYPE = 2 };
+
+enum class OperationType { FP32, FP16 };
+enum class AllocatorType { CUDA, PD };
+
+#define PRINT_FUNC_NAME_()                                          \
+  do {                                                              \
+    std::cout << "[FT][CALL] " << __FUNCTION__ << " " << std::endl; \
+  } while (0)
+
+static double diffTime(timeval start, timeval end) {
+  return (end.tv_sec - start.tv_sec) * 1000 +
+         (end.tv_usec - start.tv_usec) * 0.001;
+}
+
+static const char *_cudaGetErrorEnum(cudaError_t error) {
+  return cudaGetErrorString(error);
+}
+
+static inline __device__ int8_t float_to_int8_rn(float x) {
+  uint32_t dst;
+  asm volatile("cvt.rni.sat.s8.f32 %0, %1;" : "=r"(dst) : "f"(x));
+  return reinterpret_cast<const int8_t &>(dst);
+}
+
+static const char *_cudaGetErrorEnum(cublasStatus_t error) {
+  switch (error) {
+    case CUBLAS_STATUS_SUCCESS:
+      return "CUBLAS_STATUS_SUCCESS";
+
+    case CUBLAS_STATUS_NOT_INITIALIZED:
+      return "CUBLAS_STATUS_NOT_INITIALIZED";
+
+    case CUBLAS_STATUS_ALLOC_FAILED:
+      return "CUBLAS_STATUS_ALLOC_FAILED";
+
+    case CUBLAS_STATUS_INVALID_VALUE:
+      return "CUBLAS_STATUS_INVALID_VALUE";
+
+    case CUBLAS_STATUS_ARCH_MISMATCH:
+      return "CUBLAS_STATUS_ARCH_MISMATCH";
+
+    case CUBLAS_STATUS_MAPPING_ERROR:
+      return "CUBLAS_STATUS_MAPPING_ERROR";
+
+    case CUBLAS_STATUS_EXECUTION_FAILED:
+      return "CUBLAS_STATUS_EXECUTION_FAILED";
+
+    case CUBLAS_STATUS_INTERNAL_ERROR:
+      return "CUBLAS_STATUS_INTERNAL_ERROR";
+
+    case CUBLAS_STATUS_NOT_SUPPORTED:
+      return "CUBLAS_STATUS_NOT_SUPPORTED";
+
+    case CUBLAS_STATUS_LICENSE_ERROR:
+      return "CUBLAS_STATUS_LICENSE_ERROR";
+  }
+  return "<unknown>";
+}
+
+template <typename T>
+void check(T result,
+           char const *const func,
+           const char *const file,
+           int const line) {
+  if (result) {
+    throw std::runtime_error(std::string("[FT][ERROR] CUDA runtime error: ") +
+                             (_cudaGetErrorEnum(result)) + " " + file + ":" +
+                             std::to_string(line) + " \n");
+  }
+}
+
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+
+template <typename T>
+void print_to_file(const T *result, const int size, const char *file) {
+  cudaDeviceSynchronize();
+  check_cuda_error(cudaGetLastError());
+  printf("[INFO] file: %s \n", file);
+  FILE *fd = fopen(file, "w");
+  T *tmp = (T *)malloc(sizeof(T) * size);
+  check_cuda_error(
+      cudaMemcpy(tmp, result, sizeof(T) * size, cudaMemcpyDeviceToHost));
+  for (int i = 0; i < size; ++i) {
+    float val;
+    if (sizeof(T) == 2)
+      val = (T)__half2float(tmp[i]);
+    else
+      val = (T)tmp[i];
+    fprintf(fd, "%f\n", val);
+  }
+  free(tmp);
+  fclose(fd);
+  cudaDeviceSynchronize();
+  check_cuda_error(cudaGetLastError());
+}
+
+template <typename T>
+void print_to_file(const T *result,
+                   const int size,
+                   const char *file,
+                   cudaStream_t stream,
+                   std::ios::openmode open_mode = std::ios::out) {
+  cudaDeviceSynchronize();
+  check_cuda_error(cudaGetLastError());
+  printf("[INFO] file: %s with size %d.\n", file, size);
+  std::ofstream outFile(file, open_mode);
+  if (outFile) {
+    T *tmp = new T[size];
+    check_cuda_error(cudaMemcpyAsync(
+        tmp, result, sizeof(T) * size, cudaMemcpyDeviceToHost, stream));
+    for (int i = 0; i < size; ++i) {
+      float val;
+      if (sizeof(T) == 2)
+        val = (T)__half2float(tmp[i]);
+      else
+        val = (T)tmp[i];
+      outFile << val << std::endl;
+    }
+    delete[] tmp;
+  } else {
+    printf("[ERROR] cannot open file %s \n", file);
+    exit(-1);
+  }
+  cudaDeviceSynchronize();
+  check_cuda_error(cudaGetLastError());
+}
+
+template <typename T>
+void print_to_screen(T *result, const int size) {
+  float *tmp = (float *)malloc(sizeof(float) * size);
+  check_cuda_error(
+      cudaMemcpy(tmp, result, sizeof(float) * size, cudaMemcpyDeviceToHost));
+  for (int i = 0; i < size; ++i) printf("%d, %f\n", i, (float)tmp[i]);
+  free(tmp);
+}
+
+template <typename T>
+void check_max_val(const T *result, const int size) {
+  T *tmp = new T[size];
+  cudaMemcpy(tmp, result, sizeof(T) * size, cudaMemcpyDeviceToHost);
+  float max_val = -100000;
+  for (int i = 0; i < size; i++) {
+    float val = (float)(tmp[i]);
+    if (val > max_val) max_val = val;
+  }
+  delete tmp;
+  printf("[INFO][CUDA] addr %p max val: %f \n", result, max_val);
+}
+
+inline int getSMVersion() {
+  int device{-1};
+  check_cuda_error(cudaGetDevice(&device));
+  cudaDeviceProp props;
+  check_cuda_error(cudaGetDeviceProperties(&props, device));
+  return props.major * 10 + props.minor;
+}
+
+template <typename T>
+void check_abs_mean_val(const T *result, const int size) {
+  T *tmp = new T[size];
+  cudaMemcpy(tmp, result, sizeof(T) * size, cudaMemcpyDeviceToHost);
+  float sum = 0.0f;
+  for (int i = 0; i < size; i++) {
+    sum += abs((float)tmp[i]);
+  }
+  delete tmp;
+  printf("[INFO][CUDA] addr %p abs mean val: %f \n", result, sum / size);
+}
+
+inline int div_up(int a, int n) { return (a + n - 1) / n; }
+
+inline void print_mem_usage() {
+  size_t free_bytes, total_bytes;
+  check_cuda_error(cudaMemGetInfo(&free_bytes, &total_bytes));
+  float free = (float)(free_bytes) / 1024.0 / 1024.0 / 1024.0;
+  float total = (float)(total_bytes) / 1024.0 / 1024.0 / 1024.0;
+  printf("after allocation, free %.2f GB total %.2f GB\n", free, total);
+}
+
+}  // namespace fastertransformer
diff --git a/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common_structure.h b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common_structure.h
new file mode 100644
index 0000000000000000000000000000000000000000..e154f63ef06c96386dacbd5df61f6ba9e27530de
--- /dev/null
+++ b/paddlenlp/ops/patches/FasterTransformer/fastertransformer/utils/common_structure.h
@@ -0,0 +1,80 @@
+/*
+ * Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+ * Copyright (c) 2020-2021, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+template<typename T>
+struct DenseWeight{
+    const T* kernel = nullptr;
+    const T* bias = nullptr;
+};
+
+template<typename T>
+struct LayerNormWeight{
+    const T* gamma = nullptr;
+    const T* beta = nullptr;
+};
+
+template<typename T>
+struct AttentionWeight{
+    DenseWeight<T> query_weight;
+    DenseWeight<T> key_weight;
+    DenseWeight<T> value_weight;
+    DenseWeight<T> attention_output_weight;
+};
+
+template<typename T>
+struct FFNWeight{
+    DenseWeight<T> intermediate_weight;
+    DenseWeight<T> intermediate_weight_1;
+    DenseWeight<T> output_weight;
+};
+
+namespace fastertransformer{
+
+enum class ActivationType{RELU, GELU};
+
+template<OperationType OpType_>
+class TransformerTraits;
+
+template<>
+class TransformerTraits<OperationType::FP32>
+{
+  public:
+    typedef float DataType;
+    static const OperationType OpType = OperationType::FP32;
+    static cudaDataType_t const computeType = CUDA_R_32F;
+    static cudaDataType_t const scaleType = CUDA_R_32F;
+    static cudaDataType_t const AType = CUDA_R_32F;
+    static cudaDataType_t const BType = CUDA_R_32F;
+    static cudaDataType_t const CType = CUDA_R_32F;
+};
+
+template<>
+class TransformerTraits<OperationType::FP16>
+{
+  public:
+    typedef half DataType;
+    static const OperationType OpType = OperationType::FP16;
+    static cudaDataType_t const computeType = CUDA_R_16F;
+    static cudaDataType_t const scaleType = CUDA_R_16F;
+    static cudaDataType_t const AType = CUDA_R_16F;
+    static cudaDataType_t const BType = CUDA_R_16F;
+    static cudaDataType_t const CType = CUDA_R_16F;
+};
+
+} // end of fastertransformer 
diff --git a/paddlenlp/peft/__init__.py b/paddlenlp/peft/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ffaec00a4985c8bafe21b91d85249c9d6b61853
--- /dev/null
+++ b/paddlenlp/peft/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .lora import LoRAConfig, LoRAModel
+from .prefix import PrefixConfig, PrefixModelForCausalLM
diff --git a/paddlenlp/peft/lora/__init__.py b/paddlenlp/peft/lora/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e3e6e546ed758e6fe423a23638ba06f8d4281bf
--- /dev/null
+++ b/paddlenlp/peft/lora/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .lora_config import LoRAConfig
+from .lora_layers import (
+    ColumnParallelLoRALinear,
+    ColumnParallelLoRAMergedLinear,
+    LoRALinear,
+    LoRAMergedLinear,
+)
+from .lora_model import LoRAModel
diff --git a/paddlenlp/peft/lora/lora_config.py b/paddlenlp/peft/lora/lora_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a48de4507efb5cf2c8dbd848ae8c383db33088a
--- /dev/null
+++ b/paddlenlp/peft/lora/lora_config.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from dataclasses import asdict, dataclass, field
+from typing import List, Optional, Union
+
+from ...utils.env import LORA_CONFIG_NAME
+
+
+@dataclass
+class LoRAConfig:
+    """
+    This is the configuration class to store the configuration of a [`LoRAModel`].
+    Args:
+        r (`int`): Lora attention dimension
+        target_modules (`Union[List[str],str]`): The names of the modules to apply Lora to.
+        trainable_modules (`List[str]`): The names of the modules to train when applying Lora.
+        lora_alpha (`float`): The alpha parameter for Lora scaling.
+        lora_dropout (`float`): The dropout probability for Lora layers.
+        merge_weights (`bool`):
+            Whether to merge the weights of the Lora layers with the base transformer model in `eval` mode.
+    """
+
+    r: int = field(default=8, metadata={"help": "Lora attention dimension"})
+    target_modules: Optional[Union[List[str], str]] = field(
+        default=None,
+        metadata={
+            "help": "List of module names or regex expression of the module names to replace with Lora."
+            "For example, ['q', 'v'] or '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$' "
+        },
+    )
+    trainable_modules: Optional[List[str]] = field(
+        default=None,
+        metadata={
+            "help": "List of module names or regex expression of the module names to train when applying with Lora."
+            "For example, ['q', 'v'] or '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$' "
+        },
+    )
+    lora_alpha: int = field(default=8, metadata={"help": "Lora alpha"})
+    lora_dropout: float = field(default=0.0, metadata={"help": "Lora dropout"})
+    merge_weights: bool = field(
+        default=False, metadata={"help": "Merge weights of the original model and the Lora model"}
+    )
+    trainable_bias: Optional[str] = field(
+        default=None, metadata={"help": "Define trainable bias parameters for the Lora model."}
+    )
+    enable_lora_list: Optional[Union[List[bool], List[Optional[List[bool]]]]] = field(
+        default=None,
+        metadata={
+            "help": "Provides fine-grained control over `MergedLoRALinear`. If None, `LoRALinear` is used instead."
+        },
+    )
+    tensor_parallel_degree: int = field(default=-1, metadata={"help": "1 for not use tensor parallel"})
+    dtype: Optional[str] = field(default=None, metadata={"help": "The data type of tensor"})
+    head_dim: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "The model multi head dimension.Only for LoRAMergedLinear and ColumnParallelLoRAMergedLinear."
+        },
+    )
+
+    @property
+    def __dict__(self):
+        return asdict(self)
+
+    def to_dict(self):
+        return self.__dict__
+
+    def save_pretrained(self, save_directory):
+        r"""
+        This method saves the configuration of your adapter model in a directory.
+        Args:
+            save_directory (`str`):
+                The directory where the configuration will be saved.
+        """
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        output_dict = self.__dict__
+        output_path = os.path.join(save_directory, LORA_CONFIG_NAME)
+
+        # save it
+        with open(output_path, "w") as writer:
+            writer.write(json.dumps(output_dict, indent=2, sort_keys=True))
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        This method loads the configuration of your adapter model from a directory.
+        Args:
+            pretrained_model_name_or_path (`str`):
+                The directory or the hub-id where the configuration is saved.
+            **kwargs:
+                Additional keyword arguments passed along to the child class initialization.
+        """
+        if os.path.isfile(os.path.join(pretrained_model_name_or_path, LORA_CONFIG_NAME)):
+            config_file = os.path.join(pretrained_model_name_or_path, LORA_CONFIG_NAME)
+        else:
+            raise ValueError(f"Can't find lora_config.json at '{pretrained_model_name_or_path}'")
+
+        loaded_attributes = cls.from_json_file(config_file)
+
+        config = cls(**kwargs)
+
+        for key, value in loaded_attributes.items():
+            if hasattr(config, key):
+                setattr(config, key, value)
+
+        return config
+
+    @classmethod
+    def from_json_file(cls, path_json_file):
+        r"""
+        Loads a configuration file from a json file.
+        Args:
+            path_json_file (`str`):
+                The path to the json file.
+        """
+        with open(path_json_file, "r") as file:
+            json_object = json.load(file)
+
+        return json_object
diff --git a/paddlenlp/peft/lora/lora_layers.py b/paddlenlp/peft/lora/lora_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..397c4d29b0f01db64d94f0cbc0b2d7b7fde89dbb
--- /dev/null
+++ b/paddlenlp/peft/lora/lora_layers.py
@@ -0,0 +1,637 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from typing import List, Optional
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.layers.mpu import mp_ops
+from paddle.distributed.fleet.meta_parallel import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+)
+
+
+class LoRALinear(nn.Linear):
+    # LoRA implemented in a dense layer
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        nn.Linear.__init__(self, in_features, out_features, **kwargs)
+        if not isinstance(r, int) or r <= 0:
+            raise ValueError("Lora rank r should be a positive integer")
+        self.r = r
+        self.lora_alpha = lora_alpha
+        # Optional dropout
+        if lora_dropout > 0.0:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+        # Actual trainable parameters
+        self.lora_A = self.create_parameter(
+            shape=[in_features, r],
+            dtype=self._dtype,
+            is_bias=False,
+            default_initializer=nn.initializer.KaimingUniform(negative_slope=math.sqrt(5), nonlinearity="leaky_relu"),
+        )
+        self.lora_B = self.create_parameter(
+            shape=[r, out_features],
+            dtype=self._dtype,
+            is_bias=False,
+            default_initializer=nn.initializer.Constant(value=0.0),
+        )
+        self.scaling = self.lora_alpha / self.r
+
+        # Freezing the pre-trained weight matrix
+        self.weight.stop_gradient = True
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            new_weight = self.weight - self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            new_weight = self.weight + self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = True
+
+    def forward(self, input: paddle.Tensor):
+        result = F.linear(x=input, weight=self.weight, bias=self.bias, name=self.name)
+        if not self.merged:
+            result += (self.lora_dropout(input) @ self.lora_A @ self.lora_B) * self.scaling
+        return result
+
+    def extra_repr(self):
+        name = f", name={self.name}" if self.name else ""
+        return f"in_features={self.weight.shape[0]}, out_features={self.weight.shape[1]}, rank={self.r}{name}"
+
+
+class RowParallelLoRALinear(RowParallelLinear):
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        merge_weights: bool = True,
+        **kwargs
+    ):
+        RowParallelLinear.__init__(self, in_features, out_features, **kwargs)
+        if not isinstance(r, int) or r <= 0:
+            raise ValueError("Lora rank r should be a positive integer")
+        self.r = r
+        self.lora_alpha = lora_alpha
+        # Optional dropout
+        if lora_dropout > 0.0:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+        # compatible
+        self.name = self._name
+
+        # Actual trainable parameters
+        self.lora_A = self.create_parameter(
+            shape=[self.input_size_per_partition, r],
+            dtype=self._dtype,
+            is_bias=False,
+            attr=paddle.ParamAttr(
+                initializer=nn.initializer.KaimingUniform(negative_slope=math.sqrt(5), nonlinearity="leaky_relu")
+            ),
+        )
+        self.lora_B = self.create_parameter(
+            shape=[r, self.out_features],
+            dtype=self._dtype,
+            is_bias=False,
+            default_initializer=nn.initializer.Constant(value=0.0),
+        )
+        self.lora_A.is_distributed = True
+        self.lora_A.split_axis = 0
+        self.lora_B.is_distributed = False
+        self.scaling = self.lora_alpha / self.r
+
+        # Freezing the pre-trained weight matrix
+        self.weight.stop_gradient = True
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            new_weight = self.weight - self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            new_weight = self.weight + self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = True
+
+    def forward(self, x: paddle.Tensor):
+        if not self.input_is_parallel:
+            input_mp = mp_ops._c_split(x, group=self.model_parallel_group)
+        else:
+            input_mp = x
+
+        # x @ W : [bz, in_f / ws] ===> [bz, out_f]
+        result_mp = F.linear(x=input_mp, weight=self.weight, name=self.name)
+
+        output = mp_ops._mp_allreduce(
+            result_mp,
+            group=self.model_parallel_group,
+            use_calc_stream=True,
+            use_model_parallel=True,
+        )
+
+        if not self.merged:
+            # x @ A: [bz, in_f/ ws] ===> [bz, r]
+            input_mp = self.lora_dropout(input_mp) @ self.lora_A
+            # all reduce to keep Lora B's gradient on different gpu consistent
+            input_dup = mp_ops._mp_allreduce(
+                input_mp,
+                group=self.model_parallel_group,
+                use_calc_stream=True,
+                use_model_parallel=True,
+            )
+            #  @ B: [bz, r] ===> [bz, out_f]
+            delta_mp = (input_dup @ self.lora_B) * self.scaling
+            output += delta_mp
+        output = output + self.bias if self.bias is not None else output
+        return output
+
+    def extra_repr(self):
+        name = f", name={self.name}" if self.name else ""
+        return f"in_features={self.weight.shape[0]}, out_features={self.weight.shape[1]}, rank={self.r}{name}"
+
+
+class ColumnParallelLoRALinear(ColumnParallelLinear):
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        merge_weights: bool = True,
+        lora_A_weight_attr: Optional[paddle.ParamAttr] = None,
+        **kwargs
+    ):
+        ColumnParallelLinear.__init__(self, in_features, out_features, **kwargs)
+        if not isinstance(r, int) or r <= 0:
+            raise ValueError("Lora rank r should be a positive integer")
+        self.r = r
+        self.lora_alpha = lora_alpha
+        # Optional dropout
+        if lora_dropout > 0.0:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+        # compatible
+        self.name = self._name
+
+        # Actual trainable parameters
+        self.lora_A = self.create_parameter(
+            shape=[in_features, r],
+            dtype=self._dtype,
+            is_bias=False,
+            attr=lora_A_weight_attr,
+        )
+        self.lora_A.is_distributed = False
+        self.lora_B = self.create_parameter(
+            shape=[r, self.output_size_per_partition],
+            dtype=self._dtype,
+            is_bias=False,
+            default_initializer=nn.initializer.Constant(value=0.0),
+        )
+        self.lora_B.is_distributed = True
+        self.lora_B.split_axis = 1
+        self.scaling = self.lora_alpha / self.r
+
+        # Freezing the pre-trained weight matrix
+        self.weight.stop_gradient = True
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            new_weight = self.weight - self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            new_weight = self.weight + self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = True
+
+    def forward(self, input: paddle.Tensor):
+        input_mp = mp_ops._c_identity(input, group=self.model_parallel_group)
+        result_mp = F.linear(x=input_mp, weight=self.weight, bias=self.bias, name=self.name)
+
+        if not self.merged:
+            input_a = self.lora_dropout(input) @ self.lora_A
+            input_a_mp = mp_ops._c_identity(input_a, group=self.model_parallel_group)
+            delta_mp = (input_a_mp @ self.lora_B) * self.scaling
+            result_mp += delta_mp
+
+        if self.gather_output and self.is_mp:
+            result = mp_ops._c_concat(result_mp, group=self.model_parallel_group)
+        else:
+            result = result_mp
+        return result
+
+    def extra_repr(self):
+        name = f", name={self.name}" if self.name else ""
+        return f"in_features={self.weight.shape[0]}, out_features={self.weight.shape[1]}, rank={self.r}{name}"
+
+
+class LoRAMergedLinear(nn.Linear):
+    # LoRA implemented in a dense layer  with merged linear weights for q, k, v
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        head_dim: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        merge_weights: bool = True,
+        enable_lora: List[bool] = [False],
+        **kwargs
+    ):
+        nn.Linear.__init__(self, in_features, out_features, **kwargs)
+        assert (
+            out_features % len(enable_lora) == 0
+        ), f"The length of enable_lora must divide out_features: {out_features} % {len(enable_lora)} != 0"
+        if not isinstance(r, int) or r <= 0:
+            raise ValueError("Lora rank r should be a positive integer")
+        self.r = r
+        self.lora_alpha = lora_alpha
+        if isinstance(enable_lora, List) and all(isinstance(item, bool) for item in enable_lora):
+            self.enable_lora = enable_lora
+        else:
+            raise TypeError("enable_lora must be a list of bools")
+
+        self.out_features = out_features
+        self.in_features = in_features
+        self.head_dim = head_dim
+        self.head_num = self.out_features // len(enable_lora) // self.head_dim
+
+        # Optional dropout
+        if lora_dropout > 0.0 and any:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+        # Actual trainable parameters
+        if any(enable_lora):
+            self.lora_A = self.create_parameter(
+                shape=[in_features, r * sum(enable_lora)],
+                dtype=self._dtype,
+                is_bias=False,
+                default_initializer=nn.initializer.KaimingUniform(
+                    negative_slope=math.sqrt(5), nonlinearity="leaky_relu"
+                ),
+            )
+            # Make sure lora_B is split in column for ColumnParallelLoRAMergedLinear.
+            self.lora_B = self.create_parameter(
+                shape=[r, out_features // len(enable_lora) * sum(enable_lora)],
+                dtype=self._dtype,
+                is_bias=False,
+                default_initializer=nn.initializer.Constant(value=0.0),
+            )
+            self.scaling = self.lora_alpha / self.r
+
+            # Freezing the pre-trained weight matrix
+            self.weight.stop_gradient = True
+
+    def zero_pad_and_reshape(self, x):
+        # if enable_lora is all true, then there is no need to zero pad
+        if all(self.enable_lora):
+            output = x
+        else:
+            split_output = paddle.split(x, sum(self.enable_lora), axis=-1)
+            for index in range(len(self.enable_lora)):
+                if self.enable_lora[index] is False:
+                    split_output.insert(index, paddle.zeros_like(split_output[0]))
+            output = paddle.concat(split_output, axis=-1)
+        if output.dim() == 2:
+            rank, out_features = output.shape
+            reshape_output = (
+                output.reshape([rank, len(self.enable_lora), self.head_num, self.head_dim])
+                .transpose([0, 2, 1, 3])
+                .reshape([rank, out_features])
+            )
+        else:
+            batch, seq_len, out_features = output.shape
+            reshape_output = (
+                output.reshape([batch, seq_len, len(self.enable_lora), self.head_num, self.head_dim])
+                .transpose([0, 1, 3, 2, 4])
+                .reshape([batch, seq_len, out_features])
+            )
+
+        return reshape_output
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            if any(self.enable_lora):
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta_weight = (
+                    F.conv1d(
+                        self.lora_A.T.unsqueeze(0),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                    .squeeze(0)
+                    .T
+                )
+                new_weight = self.weight - self.zero_pad_and_reshape(delta_weight * self.scaling)
+                self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            if any(self.enable_lora):
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta_weight = (
+                    F.conv1d(
+                        self.lora_A.T.unsqueeze(0),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                    .squeeze(0)
+                    .T
+                )
+                new_weight = self.weight + self.zero_pad_and_reshape(delta_weight * self.scaling)
+                self.weight.set_value(new_weight)
+            self.merged = True
+
+    def forward(self, input: paddle.Tensor):
+        result = F.linear(x=input, weight=self.weight, bias=self.bias, name=self.name)
+        if any(self.enable_lora) and not self.merged:
+            input_a = self.lora_dropout(input) @ self.lora_A
+            if input_a.dim() == 3:
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta = (
+                    F.conv1d(
+                        input_a.transpose([0, 2, 1]),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                ).transpose([0, 2, 1])
+            else:
+                raise NotImplementedError("LoRAMergedLinear only support 3D input features")
+
+            result += self.zero_pad_and_reshape(delta * self.scaling)
+        return result
+
+    def extra_repr(self):
+        name = f", name={self.name}" if self.name else ""
+        return f"in_features={self.weight.shape[0]}, out_features={self.weight.shape[1]}, rank={self.r}{name}"
+
+
+class ColumnParallelLoRAMergedLinear(ColumnParallelLinear):
+    # LoRA implemented in a dense layer  with merged linear weights for q, k, v
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        head_dim: int,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        merge_weights: bool = True,
+        enable_lora: List[bool] = [False],
+        lora_A_weight_attr: Optional[paddle.ParamAttr] = None,
+        **kwargs
+    ):
+        ColumnParallelLinear.__init__(self, in_features, out_features, **kwargs)
+        assert (
+            self.output_size_per_partition % len(enable_lora) == 0
+        ), f"The length of enable_lora must divide out_features: {self.output_size_per_partition} % {len(enable_lora)} != 0"
+        if not isinstance(r, int) or r <= 0:
+            raise ValueError("Lora rank r should be a positive integer")
+        self.r = r
+        self.lora_alpha = lora_alpha
+        if isinstance(enable_lora, List) and all(isinstance(item, bool) for item in enable_lora):
+            self.enable_lora = enable_lora
+        else:
+            raise TypeError("enable_lora must be a list of bools")
+
+        self.out_features = out_features
+        self.in_features = in_features
+        self.head_dim = head_dim
+        self.head_num = self.output_size_per_partition // len(enable_lora) // self.head_dim
+
+        # Optional dropout
+        if lora_dropout > 0.0 and any:
+            self.lora_dropout = nn.Dropout(p=lora_dropout)
+        else:
+            self.lora_dropout = lambda x: x
+
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = merge_weights
+
+        # compatible
+        self.name = self._name
+
+        # Actual trainable parameters
+        if any(enable_lora):
+            self.lora_A = self.create_parameter(
+                shape=[in_features, r * sum(enable_lora)],
+                dtype=self._dtype,
+                is_bias=False,
+                attr=lora_A_weight_attr,
+            )
+            self.lora_A.is_distributed = False
+            # Make sure lora_B is split in column the same as ColumnParallelLoRALinear.
+            self.lora_B = self.create_parameter(
+                shape=[r, self.output_size_per_partition // len(enable_lora) * sum(enable_lora)],
+                dtype=self._dtype,
+                is_bias=False,
+                default_initializer=nn.initializer.Constant(value=0.0),
+            )
+            self.lora_B.is_distributed = True
+            self.lora_B.split_axis = 1
+            self.scaling = self.lora_alpha / self.r
+
+            # Freezing the pre-trained weight matrix
+            self.weight.stop_gradient = True
+
+    def zero_pad_and_reshape(self, x):
+        # if enable_lora is all true, then there is no need to zero pad
+        if all(self.enable_lora):
+            output = x
+        else:
+            split_output = paddle.split(x, sum(self.enable_lora), axis=-1)
+            for index in range(len(self.enable_lora)):
+                if self.enable_lora[index] is False:
+                    split_output.insert(index, paddle.zeros_like(split_output[0]))
+            output = paddle.concat(split_output, axis=-1)
+        if output.dim() == 2:
+            rank, out_features = output.shape
+            reshape_output = (
+                output.reshape([rank, len(self.enable_lora), self.head_num, self.head_dim])
+                .transpose([0, 2, 1, 3])
+                .reshape([rank, out_features])
+            )
+        else:
+            batch, seq_len, out_features = output.shape
+            reshape_output = (
+                output.reshape([batch, seq_len, len(self.enable_lora), self.head_num, self.head_dim])
+                .transpose([0, 1, 3, 2, 4])
+                .reshape([batch, seq_len, out_features])
+            )
+
+        return reshape_output
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            if any(self.enable_lora):
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta_weight = (
+                    F.conv1d(
+                        self.lora_A.T.unsqueeze(0),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                    .squeeze(0)
+                    .T
+                )
+                new_weight = self.weight - self.zero_pad_and_reshape(delta_weight * self.scaling)
+                self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            if any(self.enable_lora):
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta_weight = (
+                    F.conv1d(
+                        self.lora_A.T.unsqueeze(0),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                    .squeeze(0)
+                    .T
+                )
+                new_weight = self.weight + self.zero_pad_and_reshape(delta_weight * self.scaling)
+                self.weight.set_value(new_weight)
+            self.merged = True
+
+    def forward(self, input: paddle.Tensor):
+        # [batch_size, *, in_features]
+        input_mp = mp_ops._c_identity(input, group=self.model_parallel_group)
+        # [batch_size, *, out_features_per_partition]
+        result_mp = F.linear(x=input_mp, weight=self.weight, bias=self.bias, name=self.name)
+        if any(self.enable_lora) and not self.merged:
+            input_a = self.lora_dropout(input) @ self.lora_A
+            input_a_mp = mp_ops._c_identity(input_a, group=self.model_parallel_group)
+            if input_a.dim() == 3:
+                reshape_lora_B = (
+                    self.lora_B.reshape([self.r, self.head_num, sum(self.enable_lora), self.head_dim])
+                    .transpose([0, 2, 1, 3])
+                    .reshape(self.lora_B.shape)
+                )
+                delta_mp = (
+                    F.conv1d(
+                        input_a_mp.transpose([0, 2, 1]),
+                        reshape_lora_B.T.unsqueeze(-1),
+                        groups=sum(self.enable_lora),
+                    )
+                ).transpose([0, 2, 1])
+            else:
+                raise NotImplementedError("LoRAMergedLinear only support 3D input features")
+            # [batch_size, *, out_features_per_partition]
+            result_mp += self.zero_pad_and_reshape(delta_mp * self.scaling)
+
+        if self.gather_output and self.is_mp:
+            result_mp_list = paddle.split(result_mp, len(self.enable_lora), axis=-1)
+            result_list = []
+            for result_mp in result_mp_list:
+                result_list.append(mp_ops._c_concat(result_mp, group=self.model_parallel_group))
+            # [batch_size, *, out_features]
+            result = paddle.concat(result_list, axis=-1)
+        else:
+            result = result_mp
+
+        return result
+
+    def extra_repr(self):
+        name = f", name={self.name}" if self.name else ""
+        return f"in_features={self.weight.shape[0]}, out_features={self.weight.shape[1]}, rank={self.r}{name}"
diff --git a/paddlenlp/peft/lora/lora_model.py b/paddlenlp/peft/lora/lora_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b3ead59a317bea320f54543fe3a2dbf5f7b9ae1
--- /dev/null
+++ b/paddlenlp/peft/lora/lora_model.py
@@ -0,0 +1,452 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import re
+from collections import OrderedDict
+from functools import partial
+from typing import Dict, List, Union
+
+import paddle
+import paddle.nn as nn
+from paddle.distributed.fleet.meta_parallel import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+)
+
+from ...transformers.conversion_utils import ConversionMixin
+from ...transformers.model_utils import PretrainedModel, _add_variant, dtype_guard
+from ...utils.distributed import distributed_gather
+from ...utils.env import LORA_WEIGHTS_NAME
+from ...utils.log import logger
+from .lora_config import LoRAConfig
+from .lora_layers import (
+    ColumnParallelLoRALinear,
+    ColumnParallelLoRAMergedLinear,
+    LoRALinear,
+    LoRAMergedLinear,
+    RowParallelLoRALinear,
+)
+
+
+class LoRAModel(nn.Layer):
+    restore_layer_map: Dict[nn.Layer, nn.Layer] = {
+        LoRALinear: nn.Linear,
+        LoRAMergedLinear: nn.Linear,
+        ColumnParallelLoRALinear: ColumnParallelLinear,
+        ColumnParallelLoRAMergedLinear: ColumnParallelLinear,
+    }
+
+    def __init__(self, model, lora_config: LoRAConfig) -> None:
+        super().__init__()
+        self.lora_config = lora_config
+        self.lora_split_mapping = {}
+        if self.lora_config.dtype is None:
+            self.lora_config.dtype = paddle.get_default_dtype()
+        with dtype_guard(self.lora_config.dtype):
+            self.model = self.get_lora_model(model, lora_config)
+        if self.lora_config.tensor_parallel_degree != self.model.config.tensor_parallel_degree:
+            self.lora_config.tensor_parallel_degree = self.model.config.tensor_parallel_degree
+            logger.warning(
+                f"Reset tensor_parallel_degree of lora_config to {self.model.config.tensor_parallel_degree}."
+            )
+        self.forward = self.model.forward
+
+    def add_lora_split_mapping(self, module_name, is_column=False):
+        self.lora_split_mapping[module_name] = is_column
+
+    def _get_tensor_parallel_mappings(self, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings():
+            final_actions = {}
+            for key, is_col in self.lora_split_mapping.items():
+                final_actions[key] = partial(fn, is_column=is_col)
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings()
+
+        return mappings
+
+    @classmethod
+    def from_pretrained(cls, model, lora_path, **kwargs):
+        lora_config = kwargs.pop("lora_config", None)
+        # init lora config & lora model
+        if not isinstance(lora_config, LoRAConfig):
+            lora_config = LoRAConfig.from_pretrained(lora_path)
+        # define a new variable to conserve original lora_config.tensor_parallel_degree value which will update while initializing lora model
+        lora_config_tensor_parallel_degree = lora_config.tensor_parallel_degree
+        lora_model = cls(model, lora_config)
+
+        # define lora weight name
+        if lora_config_tensor_parallel_degree > 1:
+            lora_weight_name = _add_variant(LORA_WEIGHTS_NAME, f"tp{model.config.tensor_parallel_rank:0>2d}")
+        else:
+            lora_weight_name = LORA_WEIGHTS_NAME
+
+        # load and set lora weight parameter
+        lora_weight_path = os.path.join(lora_path, lora_weight_name)
+        if os.path.exists(lora_weight_path):
+            # load lora weight parameter
+            lora_state_dict = paddle.load(lora_weight_path, return_numpy=True)
+            logger.info(f"Loading the LoRA weights from {lora_weight_path}")
+
+            if (
+                lora_config_tensor_parallel_degree > 1
+                and lora_config_tensor_parallel_degree != model.config.tensor_parallel_degree
+            ):
+                raise NotImplementedError(
+                    f"{lora_config_tensor_parallel_degree} is not equal to {model.config.tensor_parallel_degree}. Please merge LoRA weights first."
+                )
+
+            # convert parameters to tensor parallel for mp model
+            if lora_config_tensor_parallel_degree <= 1 and model.config.tensor_parallel_degree > 1:
+                lora_state_dict = lora_model._convert_tensor_parallel(lora_state_dict=lora_state_dict)
+
+            # set lora state dict
+            lora_model.set_state_dict(lora_state_dict)
+        else:
+            logger.error(f"LoRA weights not found under {lora_path}, creating LoRA weights from scratch")
+
+        return lora_model
+
+    def set_state_dict(self, state_dict):
+        import warnings
+
+        warnings.filterwarnings(
+            action="ignore", message=".*Skip loading for.*", category=Warning, lineno=0, append=False
+        )
+        self.model.set_state_dict(state_dict)
+        logger.info("Load lora weight successfully")
+
+    def _merge_trainable_tensor_parallel(self, trainable_state_dict):
+        trainable_name_action_mappings = self._get_tensor_parallel_mappings(self.model.config, is_split=False)
+
+        name_action_mappings = self.model._get_tensor_parallel_mappings(self.model.config, is_split=False)
+        state_keys_map = ConversionMixin._resolve_prefix_keys(
+            name_action_mappings.keys(), self.model.state_dict().keys()
+        )
+        for k, v in state_keys_map.items():
+            if v in trainable_state_dict:
+                trainable_name_action_mappings[v] = name_action_mappings[k]
+
+        hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+        mp_group = hcg.get_model_parallel_group()
+        is_dst = paddle.distributed.get_rank(mp_group) == 0
+
+        for key in trainable_state_dict:
+            tensor = trainable_state_dict[key]
+            if key in trainable_name_action_mappings:
+                ret = distributed_gather(tensor, group=mp_group, offload=True)
+                action = trainable_name_action_mappings[key]
+                tensor = action(ret) if is_dst else None
+                trainable_state_dict[key] = tensor
+            else:
+                trainable_state_dict[key] = tensor.cpu().numpy() if is_dst else None
+
+        return trainable_state_dict
+
+    def _convert_tensor_parallel(self, lora_state_dict):
+        lora_name_action_mappings = self._get_tensor_parallel_mappings(self.model.config, is_split=False)
+
+        name_action_mappings = self.model._get_tensor_parallel_mappings(self.model.config, is_split=False)
+        state_keys_map = ConversionMixin._resolve_prefix_keys(
+            name_action_mappings.keys(), self.model.state_dict().keys()
+        )
+        for k, v in state_keys_map.items():
+            if v in lora_state_dict.keys():
+                lora_name_action_mappings[v] = name_action_mappings[k]
+
+        for name, action in lora_name_action_mappings.items():
+            tensor = lora_state_dict.pop(name)
+            lora_state_dict[name] = action(tensor)
+        return lora_state_dict
+
+    def save_pretrained(self, save_directory: str, merge_tensor_parallel: bool = True, **kwargs):
+        variant = kwargs.get("variant", None)
+        is_main_process = kwargs.get("is_main_process", paddle.distributed.get_rank() == 0)
+
+        assert not os.path.isfile(
+            save_directory
+        ), f"Saving directory ({save_directory}) should be a directory, not a file"
+        os.makedirs(save_directory, exist_ok=True)
+
+        if merge_tensor_parallel and self.model.config.tensor_parallel_degree > 1:
+            trainable_state_dict = self.get_trainable_state_dict()
+            trainable_state_dict = self._merge_trainable_tensor_parallel(trainable_state_dict)
+            if not is_main_process:
+                logger.info("Saving with merge_tensor_parallel, tensor_parallel_rank > 0 don't need save")
+                return
+            variant = None
+            self.lora_config.tensor_parallel_degree = -1
+        else:
+            trainable_state_dict = self.get_trainable_state_dict()
+            if self.model.config.tensor_parallel_degree > 1:
+                if variant is None:
+                    variant = f"tp{self.model.config.tensor_parallel_rank:0>2d}"
+
+        # save lora weight
+        lora_weight_name = _add_variant(LORA_WEIGHTS_NAME, variant)
+        weight_filename = os.path.join(save_directory, lora_weight_name)
+        paddle.save(trainable_state_dict, weight_filename)
+
+        # save lora config
+        if is_main_process:
+            self.lora_config.save_pretrained(save_directory)
+        self.lora_config.tensor_parallel_degree = self.model.config.tensor_parallel_degree
+
+    def _find_and_replace_module(self, model, module_name, lora_config, enable_lora):
+        parent_module = model
+        attribute_chain = module_name.split(".")
+        for name in attribute_chain[:-1]:
+            parent_module = getattr(parent_module, name)
+        module = getattr(parent_module, attribute_chain[-1])
+        lora_module = None
+        if enable_lora is None:
+            if isinstance(module, nn.Linear):
+                lora_module = LoRALinear(
+                    in_features=module.weight.shape[0],
+                    out_features=module.weight.shape[1],
+                    r=lora_config.r,
+                    lora_alpha=lora_config.lora_alpha,
+                    lora_dropout=lora_config.lora_dropout,
+                    merge_weights=lora_config.merge_weights,
+                )
+            elif isinstance(module, ColumnParallelLinear):
+                # recover the original output_features
+                output_features = module.weight.shape[1] * module.world_size
+                lora_module = ColumnParallelLoRALinear(
+                    in_features=module.weight.shape[0],
+                    out_features=output_features,
+                    gather_output=module.gather_output,
+                    has_bias=module.bias is not None,
+                    r=lora_config.r,
+                    lora_alpha=lora_config.lora_alpha,
+                    lora_dropout=lora_config.lora_dropout,
+                    merge_weights=lora_config.merge_weights,
+                    lora_A_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.KaimingUniform(
+                            negative_slope=math.sqrt(5), nonlinearity="leaky_relu"
+                        )
+                    ),
+                )
+                # Lora column parallel will spilt lora B matrix
+                self.add_lora_split_mapping(module_name + ".lora_B", is_column=True)
+            elif isinstance(module, RowParallelLinear):
+                # recover the original output_features
+                lora_module = RowParallelLoRALinear(
+                    in_features=module.weight.shape[0] * module.world_size,
+                    out_features=module.weight.shape[1],
+                    has_bias=module.bias is not None,
+                    input_is_parallel=module.input_is_parallel,
+                    r=lora_config.r,
+                    lora_alpha=lora_config.lora_alpha,
+                    lora_dropout=lora_config.lora_dropout,
+                    merge_weights=lora_config.merge_weights,
+                )
+                # Lora column parallel will spilt lora A matrix
+                self.add_lora_split_mapping(module_name + ".lora_A", is_column=False)
+        else:
+            if isinstance(module, nn.Linear):
+                lora_module = LoRAMergedLinear(
+                    in_features=module.weight.shape[0],
+                    out_features=module.weight.shape[1],
+                    r=lora_config.r,
+                    lora_alpha=lora_config.lora_alpha,
+                    lora_dropout=lora_config.lora_dropout,
+                    merge_weights=lora_config.merge_weights,
+                    enable_lora=enable_lora,
+                    head_dim=lora_config.head_dim,
+                )
+            elif isinstance(module, ColumnParallelLinear):
+                # recover the original output_features
+                lora_module = ColumnParallelLoRAMergedLinear(
+                    in_features=module.weight.shape[0],
+                    out_features=module.weight.shape[1] * module.world_size,
+                    gather_output=module.gather_output,
+                    has_bias=module.bias is not None,
+                    r=lora_config.r,
+                    lora_alpha=lora_config.lora_alpha,
+                    lora_dropout=lora_config.lora_dropout,
+                    merge_weights=lora_config.merge_weights,
+                    enable_lora=enable_lora,
+                    head_dim=lora_config.head_dim,
+                    lora_A_weight_attr=paddle.ParamAttr(
+                        initializer=nn.initializer.KaimingUniform(
+                            negative_slope=math.sqrt(5), nonlinearity="leaky_relu"
+                        )
+                    ),
+                )
+        if lora_module is None:
+            raise ValueError(
+                f"LoRA strategy only supports paddle.nn.Linear or paddle.distributed.fleet.meta_parallel.ColumnParallelLinear. {module}({module_name}) is not supported。"
+            )
+
+        lora_module.weight = module.weight
+        if module.bias is not None:
+            lora_module.bias = module.bias
+        setattr(parent_module, attribute_chain[-1], lora_module)
+
+    def _find_and_restore_module(self, module_name):
+        parent_module = self.model
+        attribute_chain = module_name.split(".")
+        for name in attribute_chain[:-1]:
+            parent_module = getattr(parent_module, name)
+        module = getattr(parent_module, attribute_chain[-1])
+        original_model_class = self.restore_layer_map[module.__class__]
+        original_module = original_model_class(in_features=module.weight.shape[0], out_features=module.weight.shape[1])
+        original_module.weight = module.weight
+        if module.bias is not None:
+            original_module.bias = module.bias
+        setattr(parent_module, attribute_chain[-1], original_module)
+
+    def get_trainable_state_dict(self):
+        trainable_state_dict = OrderedDict()
+        for name, weight in self.model.state_dict().items():
+            # get lora parameter & QAT scale parameter
+            if not weight.stop_gradient or "activation_quanter" in name or "weight_quanter" in name:
+                trainable_state_dict[name] = weight
+        return trainable_state_dict
+
+    def print_trainable_parameters(self) -> None:
+        freeze_numel = 0
+        trainable_numel = 0
+        for _, weight in self.model.state_dict().items():
+            if weight.stop_gradient:
+                freeze_numel += weight.numel().item()
+            else:
+                trainable_numel += weight.numel().item()
+        logger.info(
+            f"Frozen parameters: {freeze_numel:.2e} || Trainable parameters:{trainable_numel:.2e} || Total parameters:{freeze_numel+trainable_numel:.2e}|| Trainable:{trainable_numel / (freeze_numel+trainable_numel):.2%}"
+        )
+
+    def mark_only_lora_as_trainable(self) -> None:
+        for _, layer in self.model.named_sublayers():
+            if (
+                isinstance(layer, LoRALinear)
+                or isinstance(layer, ColumnParallelLoRALinear)
+                or isinstance(layer, RowParallelLoRALinear)
+                or isinstance(layer, LoRAMergedLinear)
+                or isinstance(layer, ColumnParallelLoRAMergedLinear)
+            ):
+                for name, weight in layer.state_dict().items():
+                    if self.lora_config.trainable_bias in ["lora", "all"] and "bias" in name:
+                        weight.stop_gradient = False
+                    elif "lora" in name:
+                        weight.stop_gradient = False
+                    else:
+                        weight.stop_gradient = True
+            else:
+                for name, weight in layer.state_dict().items():
+                    if self.lora_config.trainable_bias == "all" and "bias" in name:
+                        weight.stop_gradient = False
+                    else:
+                        weight.stop_gradient = True
+        if self.lora_config.trainable_modules is not None:
+            for name, weight in self.model.state_dict().items():
+                if any(
+                    re.fullmatch(trainable_module, name) for trainable_module in self.lora_config.trainable_modules
+                ):
+                    weight.stop_gradient = False
+
+    def get_lora_model(self, model: Union[PretrainedModel, nn.Layer], lora_config: LoRAConfig):
+
+        if lora_config.target_modules is None:
+            return model
+        elif isinstance(lora_config.target_modules, str):
+            target_modules = [lora_config.target_modules]
+            if lora_config.enable_lora_list is None or (
+                isinstance(lora_config.enable_lora_list, List)
+                and all(isinstance(item, bool) for item in lora_config.enable_lora_list)
+            ):
+                enable_lora_list = [lora_config.enable_lora_list]
+            else:
+                raise TypeError(
+                    f"Invalid `enable_lora_list` value: {lora_config.enable_lora_list}. Since `target_modules` is `str`, `enable_lora_list` must be `None` or `List[bool]`"
+                )
+        else:
+            target_modules = lora_config.target_modules
+            if lora_config.enable_lora_list is None:
+                enable_lora_list = [None for _ in range(len(target_modules))]
+            elif isinstance(lora_config.enable_lora_list, List):
+                enable_lora_list = lora_config.enable_lora_list
+                if len(enable_lora_list) != len(target_modules):
+                    raise TypeError(
+                        f"Invalid lora_config.enable_lora_list value: {lora_config.enable_lora_list}. Since lora_config.target_modules is `List[str]`, `enable_lora_list` should have the same length as `target_modules`"
+                    )
+                for enable_lora in enable_lora_list:
+                    if not (
+                        enable_lora is None
+                        or (isinstance(enable_lora, List) and all(isinstance(item, bool) for item in enable_lora))
+                    ):
+                        raise TypeError(
+                            f"Invalid `enable_lora_list` value: {lora_config.enable_lora_list}. Since `target_modules` is `List[str]`, `enable_lora_list` must be `None` or  `List[Optional[List[bool]]]`"
+                        )
+            else:
+                raise TypeError(
+                    f"Invalid `enable_lora_list` value: {lora_config.enable_lora_list}. Since `target_modules` is `List[str]`, `enable_lora_list` must be `None` or `List[Optional[List[bool]]]`"
+                )
+
+        for target_module, enable_lora in zip(target_modules, enable_lora_list):
+            for i in model.named_sublayers():
+                module_name = i[0]
+                if re.fullmatch(target_module, module_name):
+                    self._find_and_replace_module(model, module_name, lora_config, enable_lora)
+        return model
+
+    def restore_original_model(self):
+        # make sure W and lora weights are not merged before we restore the original model
+        if self.lora_config.merge_weights:
+            self.train()
+
+        for layer_name, layer in self.model.named_sublayers():
+            if (
+                isinstance(layer, LoRALinear)
+                or isinstance(layer, ColumnParallelLoRALinear)
+                or isinstance(layer, LoRAMergedLinear)
+                or isinstance(layer, ColumnParallelLoRAMergedLinear)
+            ):
+                self._find_and_restore_module(layer_name)
+        return self.model
+
+    def __getattr__(self, name: str):
+        """Forward missing attributes to the wrapped module."""
+        try:
+            return super().__getattr__(name)  # defer to nn.Layer's logic
+        except AttributeError:
+            return getattr(self.model, name)
+
+    def train(self):
+        self.training = True
+        self.model.training = True
+        for layer in self.model.sublayers():
+            layer.training = True
+            layer.train()
+
+    def eval(self):
+        self.training = False
+        self.model.training = False
+        for layer in self.model.sublayers():
+            layer.training = False
+            layer.eval()
diff --git a/paddlenlp/peft/lora/lora_quant_layers.py b/paddlenlp/peft/lora/lora_quant_layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..6104f0a7ef66b34b415723511d6812ee63841668
--- /dev/null
+++ b/paddlenlp/peft/lora/lora_quant_layers.py
@@ -0,0 +1,93 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import nn
+from paddle.nn import functional as F
+from paddle.nn.quant.format import ConvertibleQuantedLayer
+
+
+class QuantedLoRALinear(ConvertibleQuantedLayer):
+    """
+    The computational logic of QuantizedLoRALinear is the same as LoRALinear.
+    The only difference is that its inputs are all fake quantized.
+
+    Note:
+        In order for proper quantization of this layer, we do (W + AB)x instead of Wx + ABx as in LoRALinear.
+        The quanted logic is quant(W + AB)x
+    """
+
+    def __init__(self, layer: nn.Layer, q_config):
+        super().__init__()
+        if isinstance(layer.lora_dropout, nn.Dropout):
+            raise ValueError("lora_dropout is not supported for QuantedLoRALinear")
+
+        self.weight = layer.weight
+        self.lora_A = layer.lora_A
+        self.lora_B = layer.lora_B
+        self.scaling = layer.scaling
+        self.bias = layer.bias
+        self.name = layer.name
+
+        # Mark the weight as unmerged
+        self.merged = False
+        self.merge_weights = layer.merge_weights
+
+        # For FakeQuant
+
+        self.weight_quanter = None
+        self.activation_quanter = None
+        if q_config.weight is not None:
+            self.weight_quanter = q_config.weight._instance(layer)
+        if q_config.activation is not None:
+            self.activation_quanter = q_config.activation._instance(layer)
+
+    def forward(self, input):
+
+        if self.merge_weights and self.merged:
+            weight = self.weight
+        else:
+            weight = self.weight + self.lora_A @ self.lora_B * self.scaling
+
+        quant_input = self.activation_quanter(input) if self.activation_quanter is not None else input
+        quant_weight = self.weight_quanter(weight) if self.weight_quanter is not None else weight
+
+        return self._linear_forward(quant_input, quant_weight)
+
+    def _linear_forward(self, input, weight):
+        weight = paddle.cast(weight, input.dtype)
+        out = F.linear(x=input, weight=weight, bias=self.bias, name=self.name)
+        return out
+
+    def train(self):
+        super().train()
+        if self.merge_weights and self.merged:
+            # Make sure that the weights are not merged
+            new_weight = self.weight - self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = False
+
+    def eval(self):
+        super().eval()
+        if self.merge_weights and not self.merged:
+            # Merge the weights and mark it
+            new_weight = self.weight + self.lora_A @ self.lora_B * self.scaling
+            self.weight.set_value(new_weight)
+            self.merged = True
+
+    def weights_to_quanters(self):
+        return [("weight", "weight_quanter")]
+
+    def activation_quanters(self):
+        return ["activation_quanter"]
diff --git a/paddlenlp/peft/prefix/__init__.py b/paddlenlp/peft/prefix/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..90edebf53b663e76aa186f62b816965cad3b5d70
--- /dev/null
+++ b/paddlenlp/peft/prefix/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .prefix_config import PrefixConfig
+from .prefix_model import PrefixModelForCausalLM
+from .utils import (
+    bloom_postprocess_past_key_value,
+    chatglm_postprocess_past_key_value,
+    llama_postprocess_past_key_value,
+    qwen_postprocess_past_key_value,
+)
diff --git a/paddlenlp/peft/prefix/prefix_config.py b/paddlenlp/peft/prefix/prefix_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba9135c6a9b6687ab23ad5fa1bd7e0c2b09318a6
--- /dev/null
+++ b/paddlenlp/peft/prefix/prefix_config.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from dataclasses import asdict, dataclass, field
+from typing import Optional
+
+from ...utils.env import PREFIX_CONFIG_NAME
+
+
+@dataclass
+class PrefixConfig:
+    prefix_dropout: float = field(default=0.0, metadata={"help": "Prefix projection dropout"})
+    num_prefix_tokens: Optional[int] = field(default=None, metadata={"help": "Number of prefix tokens"})
+    num_attention_heads: Optional[int] = field(default=None, metadata={"help": "Number of attention heads"})
+    multi_query_group_num: Optional[int] = field(default=None, metadata={"help": "Number of Multi-Query Groups."})
+    num_hidden_layers: Optional[int] = field(default=None, metadata={"help": "Number of transformer hidden layers"})
+    hidden_size: Optional[int] = field(
+        default=None, metadata={"help": "The hidden embedding dimension of the transformer model"}
+    )
+    prefix_projection: bool = field(default=False, metadata={"help": "Whether to project the prefix tokens"})
+    prefix_projection_hidden_size: Optional[int] = field(
+        default=None, metadata={"help": "The hidden embedding dimension of the transformer model"}
+    )
+    tensor_parallel_degree: int = field(default=-1, metadata={"help": ("1 for not use tensor parallel")})
+    dtype: Optional[str] = field(default=None, metadata={"help": "The data type of tensor"})
+
+    @property
+    def __dict__(self):
+        return asdict(self)
+
+    def to_dict(self):
+        return self.__dict__
+
+    def save_pretrained(self, save_directory):
+        r"""
+        This method saves the configuration of your adapter model in a directory.
+        Args:
+            save_directory (`str`):
+                The directory where the configuration will be saved.
+        """
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        output_dict = self.__dict__
+        output_path = os.path.join(save_directory, PREFIX_CONFIG_NAME)
+
+        # save it
+        with open(output_path, "w") as writer:
+            writer.write(json.dumps(output_dict, indent=2, sort_keys=True))
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        This method loads the configuration of your adapter model from a directory.
+        Args:
+            pretrained_model_name_or_path (`str`):
+                The directory or the hub-id where the configuration is saved.
+            **kwargs:
+                Additional keyword arguments passed along to the child class initialization.
+        """
+        if os.path.isfile(os.path.join(pretrained_model_name_or_path, PREFIX_CONFIG_NAME)):
+            config_file = os.path.join(pretrained_model_name_or_path, PREFIX_CONFIG_NAME)
+        else:
+            raise ValueError(f"Can't find prefix_config.json at '{pretrained_model_name_or_path}'")
+
+        loaded_attributes = cls.from_json_file(config_file)
+
+        config = cls(**kwargs)
+
+        for key, value in loaded_attributes.items():
+            if hasattr(config, key):
+                setattr(config, key, value)
+
+        return config
+
+    @classmethod
+    def from_json_file(cls, path_json_file):
+        r"""
+        Loads a configuration file from a json file.
+        Args:
+            path_json_file (`str`):
+                The path to the json file.
+        """
+        with open(path_json_file, "r") as file:
+            json_object = json.load(file)
+
+        return json_object
diff --git a/paddlenlp/peft/prefix/prefix_model.py b/paddlenlp/peft/prefix/prefix_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..89f4b1dec6d3ab74819ad858521a746f02e16256
--- /dev/null
+++ b/paddlenlp/peft/prefix/prefix_model.py
@@ -0,0 +1,449 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+from typing import Callable, Optional
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle.distributed import fleet
+
+from ...prompt.prompt_utils import signature
+from ...transformers.model_utils import _add_variant, dtype_guard
+from ...utils.distributed import distributed_gather
+from ...utils.env import PAST_KEY_VALUES_FILE_NAME, PREFIX_WEIGHTS_NAME
+from ...utils.log import logger
+from .prefix_config import PrefixConfig
+
+
+class PrefixModelForCausalLM(paddle.nn.Layer):
+    """
+    PrefixModel for causal language modeling.
+    """
+
+    def __init__(
+        self,
+        model,
+        prefix_config: PrefixConfig,
+        postprocess_past_key_value: Optional[Callable] = None,
+        pad_attention_mask: Optional[Callable] = None,
+    ) -> None:
+        super().__init__()
+        self.prefix_config = prefix_config
+        self.model = model
+        self.forward_keys = signature(self.model.forward)
+        self.config = model.config
+        if self.prefix_config.dtype is None:
+            self.prefix_config.dtype = paddle.get_default_dtype()
+        with dtype_guard(self.prefix_config.dtype):
+            self.prefix_encoder = self._create_prefix_encoder()
+            self.prefix_dropout = nn.Dropout(p=prefix_config.prefix_dropout)
+        self.prefix_tokens = paddle.arange(self.prefix_config.num_prefix_tokens, dtype="int64")
+        self.model_prepare_inputs_for_generation = self.model.prepare_inputs_for_generation
+        self.inference = False
+        self.postprocess_past_key_value = postprocess_past_key_value
+        self.pad_attention_mask = pad_attention_mask
+        if self.model.base_model_prefix == "chatglm_v2":
+            self.prefix_config.tensor_parallel_degree = -1
+        else:
+            if self.prefix_config.tensor_parallel_degree != self.model.config.tensor_parallel_degree:
+                self.prefix_config.tensor_parallel_degree = self.model.config.tensor_parallel_degree
+                logger.warning(
+                    f"Reset tensor_parallel_degree of prefix_config to {self.model.config.tensor_parallel_degree}."
+                )
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        **kwargs,
+    ):
+
+        batch_size = input_ids.shape[0]
+        past_key_values = self._get_past_key_values(batch_size)
+
+        if attention_mask is not None:
+            if self.pad_attention_mask is not None:
+                attention_mask = self.pad_attention_mask(
+                    input_ids.shape, self.prefix_config.num_prefix_tokens, attention_mask
+                )
+            else:
+                if len(attention_mask.shape) == 2:
+                    prefix_attention_mask = paddle.ones(
+                        [batch_size, self.prefix_config.num_prefix_tokens], dtype=attention_mask.dtype
+                    )
+                elif len(attention_mask.shape) == 3:
+                    batch_size, src_seq_len, tgt_seq_len = attention_mask.shape
+                    prefix_attention_mask = paddle.ones(
+                        [batch_size, src_seq_len, self.prefix_config.num_prefix_tokens], dtype=attention_mask.dtype
+                    )
+                elif len(attention_mask.shape) == 4:
+                    batch_size, num_heads, src_seq_len, tgt_seq_len = attention_mask.shape
+                    prefix_attention_mask = paddle.ones(
+                        [batch_size, num_heads, src_seq_len, self.prefix_config.num_prefix_tokens],
+                        dtype=attention_mask.dtype,
+                    )
+                else:
+                    raise ValueError(f"Unexpected attention_mask shape: {attention_mask.shape}")
+                attention_mask = paddle.concat((prefix_attention_mask, attention_mask), axis=-1)
+            kwargs["attention_mask"] = attention_mask
+
+        if "past_key_values" in self.forward_keys:
+            output = self.model(input_ids=input_ids, past_key_values=past_key_values, **kwargs)
+        elif "cache" in self.forward_keys:
+            output = self.model(input_ids=input_ids, cache=past_key_values, **kwargs)
+        else:
+            raise NotImplementedError("Model does not support past_key_values either cache")
+        return output
+
+    def generate(self, **kwargs):
+        if "input_ids" not in kwargs:
+            raise ValueError("input_ids must be provided for Peft model generation")
+
+        self.model.prepare_inputs_for_generation = self._prepare_inputs_for_generation
+        outputs = self.model.generate(**kwargs)
+        self.model.prepare_inputs_for_generation = self.model_prepare_inputs_for_generation
+        return outputs
+
+    def _prepare_inputs_for_generation(self, *args, **kwargs):
+        model_kwargs = self.model_prepare_inputs_for_generation(*args, **kwargs)
+        attention_mask = model_kwargs["attention_mask"]
+        batch_size = model_kwargs["input_ids"].shape[0]
+        if self.pad_attention_mask is not None:
+            attention_mask = self.pad_attention_mask(
+                model_kwargs["input_ids"].shape, self.prefix_config.num_prefix_tokens, attention_mask
+            )
+        else:
+            if len(attention_mask.shape) == 2:
+                prefix_attention_mask = paddle.ones(
+                    [batch_size, self.prefix_config.num_prefix_tokens], dtype=attention_mask.dtype
+                )
+            elif len(attention_mask.shape) == 3:
+                batch_size, src_seq_len, tgt_seq_len = attention_mask.shape
+                prefix_attention_mask = paddle.ones(
+                    [batch_size, src_seq_len, self.prefix_config.num_prefix_tokens], dtype=attention_mask.dtype
+                )
+            elif len(attention_mask.shape) == 4:
+                batch_size, num_heads, src_seq_len, tgt_seq_len = attention_mask.shape
+                prefix_attention_mask = paddle.ones(
+                    [batch_size, num_heads, src_seq_len, self.prefix_config.num_prefix_tokens],
+                    dtype=attention_mask.dtype,
+                )
+            else:
+                raise ValueError(f"Unexpected attention_mask shape: {attention_mask.shape}")
+            attention_mask = paddle.concat((prefix_attention_mask, attention_mask), axis=-1)
+        model_kwargs["attention_mask"] = attention_mask
+
+        if "past_key_values" in self.forward_keys:
+            key = "past_key_values"
+        elif "cache" in self.forward_keys:
+            key = "cache"
+        else:
+            raise NotImplementedError("Model does not support past_key_values either cache")
+        if model_kwargs[key] is None:
+            past_key_values = self._get_past_key_values(batch_size)
+            model_kwargs[key] = past_key_values
+        return model_kwargs
+
+    def mark_only_prefix_as_trainable(self) -> None:
+        # freeze pretrained model
+        for _, weight in self.model.state_dict().items():
+            weight.stop_gradient = True
+        # train prefix encoder only
+        for _, weight in self.prefix_encoder.state_dict().items():
+            weight.stop_gradient = False
+
+    def _create_prefix_encoder(self):
+        prefix_dropout = nn.Dropout(p=self.prefix_config.prefix_dropout)
+        self.head_dim = self.prefix_config.hidden_size // self.prefix_config.num_attention_heads
+        if self.prefix_config.multi_query_group_num is not None:
+            self.num_heads = self.prefix_config.multi_query_group_num
+        else:
+            self.num_heads = self.prefix_config.num_attention_heads
+        if self.prefix_config.prefix_projection:
+            activation = nn.Tanh()
+            if self.prefix_config.tensor_parallel_degree > 1:
+                prefix_embedding = fleet.meta_parallel.VocabParallelEmbedding(
+                    self.prefix_config.num_prefix_tokens,
+                    self.head_dim * self.num_heads,
+                )
+                prefix_proj_0 = fleet.meta_parallel.ColumnParallelLinear(
+                    self.head_dim * self.num_heads,
+                    self.prefix_config.prefix_projection_hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                )
+                prefix_proj_1 = fleet.meta_parallel.RowParallelLinear(
+                    self.prefix_config.prefix_projection_hidden_size,
+                    self.head_dim * self.num_heads * self.prefix_config.num_hidden_layers * 2,
+                    has_bias=True,
+                    input_is_parallel=True,
+                )
+            else:
+                prefix_embedding = nn.Embedding(
+                    self.prefix_config.num_prefix_tokens,
+                    self.head_dim * self.num_heads,
+                )
+                prefix_proj_0 = nn.Linear(
+                    self.head_dim * self.num_heads,
+                    self.prefix_config.prefix_projection_hidden_size,
+                )
+                prefix_proj_1 = nn.Linear(
+                    self.prefix_config.prefix_projection_hidden_size,
+                    self.head_dim * self.num_heads * self.prefix_config.num_hidden_layers * 2,
+                )
+            prefix_encoder = nn.Sequential(prefix_embedding, prefix_proj_0, activation, prefix_proj_1, prefix_dropout)
+        else:
+            if self.prefix_config.tensor_parallel_degree > 1:
+                prefix_embedding = fleet.meta_parallel.VocabParallelEmbedding(
+                    self.prefix_config.num_prefix_tokens,
+                    self.head_dim * self.num_heads * self.prefix_config.num_hidden_layers * 2,
+                )
+            else:
+                prefix_embedding = nn.Embedding(
+                    self.prefix_config.num_prefix_tokens,
+                    self.head_dim * self.num_heads * self.prefix_config.num_hidden_layers * 2,
+                )
+            prefix_encoder = nn.Sequential(prefix_embedding, prefix_dropout)
+        return prefix_encoder
+
+    def _get_past_key_values(self, batch_size):
+
+        # (bs, prefixlen, hidden_dim*layer_num*2)
+        past_key_values = self.prefix_encoder(self.prefix_tokens.unsqueeze(0).expand([batch_size, -1]))
+
+        # (bs, prefixlen, hidden_dim*layer_num*2/tensor_parallel_degree)
+        if self.prefix_config.tensor_parallel_degree > 1:
+            split_past_key_values = past_key_values.split(
+                num_or_sections=self.prefix_config.tensor_parallel_degree, axis=2
+            )
+            past_key_values = split_past_key_values[self.model.config.tensor_parallel_rank]
+            num_heads_per_partition = self.num_heads // self.prefix_config.tensor_parallel_degree
+        else:
+            num_heads_per_partition = self.num_heads
+
+        # (bs, prefixlen, layer_num*2, head_num/tensor_parallel_degree,  head_dim)
+        past_key_values = past_key_values.reshape(
+            [
+                batch_size,
+                self.prefix_config.num_prefix_tokens,
+                self.prefix_config.num_hidden_layers * 2,
+                num_heads_per_partition,
+                self.head_dim,
+            ]
+        )
+
+        if self.postprocess_past_key_value is not None:
+            past_key_values = self.postprocess_past_key_value(past_key_values)
+
+        return past_key_values
+
+    def train(self):
+        self.training = True
+        self.model.training = True
+        self.prefix_encoder.training = True
+        self.model.train()
+        self.prefix_encoder.train()
+
+    def eval(self):
+        self.training = False
+        self.model.training = False
+        self.prefix_encoder.training = False
+        self.model.eval()
+        self.prefix_encoder.eval()
+
+    def print_trainable_parameters(self) -> None:
+        freeze_numel = 0
+        trainable_numel = 0
+        for _, weight in self.model.state_dict().items():
+            if weight.stop_gradient:
+                freeze_numel += weight.numel().item()
+            else:
+                trainable_numel += weight.numel().item()
+        for _, weight in self.prefix_encoder.state_dict().items():
+            if weight.stop_gradient:
+                freeze_numel += weight.numel().item()
+            else:
+                trainable_numel += weight.numel().item()
+        logger.info(
+            f"Frozen parameters: {freeze_numel:.2e} || Trainable parameters:{trainable_numel:.2e} || Total parameters:{freeze_numel+trainable_numel:.2e}|| Trainable:{trainable_numel / (freeze_numel+trainable_numel):.2%}"
+        )
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        model,
+        prefix_path,
+        postprocess_past_key_value=None,
+        pad_attention_mask=None,
+    ):
+        # init prefix config & prefix model
+        prefix_config = PrefixConfig.from_pretrained(prefix_path)
+        # define a new variable to conserve original prefix_config.tensor_parallel_degree value which will update while initializing prefix model
+        prefix_config_tensor_parallel_degree = prefix_config.tensor_parallel_degree
+        prefix_model = cls(model, prefix_config, postprocess_past_key_value, pad_attention_mask)
+
+        # define prefix weight name
+        if prefix_config_tensor_parallel_degree > 1:
+            prefix_weight_name = _add_variant(PREFIX_WEIGHTS_NAME, f"tp{model.config.tensor_parallel_rank:0>2d}")
+        else:
+            prefix_weight_name = PREFIX_WEIGHTS_NAME
+
+        # load and set prefix weight parameter
+        prefix_weight_path = os.path.join(prefix_path, prefix_weight_name)
+        if os.path.exists(prefix_weight_path):
+            # load prefix weight parameter
+            prefix_state_dict = paddle.load(prefix_weight_path, return_numpy=True)
+            logger.info(f"Loading the prefix weights from {prefix_weight_path}")
+
+            if (
+                prefix_config_tensor_parallel_degree > 1
+                and prefix_config_tensor_parallel_degree != model.config.tensor_parallel_degree
+            ):
+                raise NotImplementedError(
+                    f"{prefix_config_tensor_parallel_degree} is not equal to {model.config.tensor_parallel_degree}. Please merge prefix weights first."
+                )
+
+            # convert parameters to tensor parallel for mp model
+            if prefix_config_tensor_parallel_degree <= 1 and model.config.tensor_parallel_degree > 1:
+                prefix_state_dict = prefix_model._convert_tensor_parallel(prefix_state_dict=prefix_state_dict)
+
+            # set prefix state dict
+            prefix_model.set_state_dict(prefix_state_dict)
+        else:
+            logger.error(f"prefix weights not found under {prefix_path}, creating prefix weights from scratch")
+
+        return prefix_model
+
+    def save_pretrained(self, save_directory: str, merge_tensor_parallel: bool = True, **kwargs):
+        variant = kwargs.get("variant", None)
+        is_main_process = kwargs.get("is_main_process", paddle.distributed.get_rank() == 0)
+
+        assert not os.path.isfile(
+            save_directory
+        ), f"Saving directory ({save_directory}) should be a directory, not a file"
+        os.makedirs(save_directory, exist_ok=True)
+
+        # past_key_values: (prefixlen, hidden_dim*layer_num*2)
+        past_key_values = self.prefix_encoder(self.prefix_tokens.unsqueeze(0).expand([1, -1]))
+        # (prefixlen, 2, layer_num, num_heads, head_dim)
+        past_key_values = past_key_values.reshape(
+            [
+                self.prefix_config.num_prefix_tokens,
+                2,
+                self.prefix_config.num_hidden_layers,
+                self.num_heads,
+                self.head_dim,
+            ]
+        )
+        # (num_layers, 2, num_heads, prefixlen, head_dim)
+        past_key_values = paddle.transpose(past_key_values, perm=[2, 1, 3, 0, 4]).cpu().numpy()
+
+        if merge_tensor_parallel and self.prefix_config.tensor_parallel_degree > 1:
+            trainable_state_dict = self.prefix_encoder.state_dict()
+            trainable_state_dict = self._merge_trainable_tensor_parallel(trainable_state_dict)
+            if not is_main_process:
+                logger.info("Saving with merge_tensor_parallel, tensor_parallel_rank > 0 don't need save")
+                return
+            variant = None
+            self.prefix_config.tensor_parallel_degree = -1
+        else:
+            trainable_state_dict = self.prefix_encoder.state_dict()
+            if self.prefix_config.tensor_parallel_degree > 1:
+                if variant is None:
+                    variant = f"tp{self.model.config.tensor_parallel_rank:0>2d}"
+
+        # save prefix tuning weight
+        prefix_weight_name = _add_variant(PREFIX_WEIGHTS_NAME, variant)
+        weight_filename = os.path.join(save_directory, prefix_weight_name)
+        paddle.save(trainable_state_dict, weight_filename)
+
+        # save prefix config & past key values
+        if is_main_process:
+            self.prefix_config.save_pretrained(save_directory)
+            np.save(os.path.join(save_directory, PAST_KEY_VALUES_FILE_NAME), past_key_values)
+
+        if self.model.base_model_prefix == "chatglm_v2":
+            self.prefix_config.tensor_parallel_degree = -1
+        else:
+            self.prefix_config.tensor_parallel_degree = self.model.config.tensor_parallel_degree
+
+    def set_state_dict(self, state_dict):
+        self.prefix_encoder.set_state_dict(state_dict)
+        logger.info("Load prefix weight successfully")
+
+    def _merge_trainable_tensor_parallel(self, trainable_state_dict):
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=False,
+            tensor_parallel_degree=self.prefix_config.tensor_parallel_degree,
+            tensor_parallel_rank=self.model.config.tensor_parallel_rank,
+            num_attention_heads=self.model.config.num_attention_heads,
+        )
+        if self.prefix_config.prefix_projection:
+            name_action_mappings = {
+                "0.weight": partial(fn, is_column=False),
+                "1.weight": partial(fn, is_column=True),
+                "1.bias": partial(fn, is_column=True),
+                "3.weight": partial(fn, is_column=False),
+            }
+        else:
+            name_action_mappings = {
+                "0.weight": partial(fn, is_column=False),
+            }
+        hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+        mp_group = hcg.get_model_parallel_group()
+        is_dst = paddle.distributed.get_rank(mp_group) == 0
+
+        for key in trainable_state_dict:
+            tensor = trainable_state_dict[key]
+            if key in name_action_mappings:
+                ret = distributed_gather(tensor, group=mp_group, offload=True)
+                action = name_action_mappings[key]
+                tensor = action(ret) if is_dst else None
+                trainable_state_dict[key] = tensor
+            else:
+                trainable_state_dict[key] = tensor.cpu().numpy() if is_dst else None
+
+        return trainable_state_dict
+
+    def _convert_tensor_parallel(self, prefix_state_dict):
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=True,
+            tensor_parallel_degree=self.model.config.tensor_parallel_degree,
+            tensor_parallel_rank=self.model.config.tensor_parallel_rank,
+            num_attention_heads=self.model.config.num_attention_heads,
+        )
+
+        if self.prefix_config.prefix_projection:
+            name_action_mappings = {
+                "0.weight": partial(fn, is_column=False),
+                "1.weight": partial(fn, is_column=True),
+                "1.bias": partial(fn, is_column=True),
+                "3.weight": partial(fn, is_column=False),
+            }
+        else:
+            name_action_mappings = {
+                "0.weight": partial(fn, is_column=False),
+            }
+
+        for name, action in name_action_mappings.items():
+            tensor = prefix_state_dict.pop(name)
+            prefix_state_dict[name] = action(tensor)
+        return prefix_state_dict
diff --git a/paddlenlp/peft/prefix/utils.py b/paddlenlp/peft/prefix/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b42412d8a446e9c8ec1ac7bdc8fffda1517835e8
--- /dev/null
+++ b/paddlenlp/peft/prefix/utils.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def bloom_postprocess_past_key_value(past_key_values):
+    # (layer_num, bs, head_num/tensor_parallel_degree, prefixlen, head_dim)*2
+    past_key_values = paddle.transpose(past_key_values, perm=[2, 0, 3, 1, 4]).split(2)
+    # keys: [layer_num, bs, head_num/tensor_parallel_degree, head_dim, prefixlen]
+    # value: [layer_num, bs, head_num/tensor_parallel_degree, prefixlen, head_dim]
+    keys, values = past_key_values[0].transpose([0, 1, 2, 4, 3]), past_key_values[1]
+    return tuple(zip(keys, values))
+
+
+def chatglm_postprocess_past_key_value(past_key_values):
+    # (layer_num, prefixlen, bs, head_num/tensor_parallel_degree, head_dim)*2
+    keys, values = paddle.transpose(past_key_values, perm=[2, 1, 0, 3, 4]).split(2)
+
+    return tuple(zip(keys, values))
+
+
+def llama_postprocess_past_key_value(past_key_values):
+    # (layer_num, bs, prefixlen, head_num/tensor_parallel_degree, head_dim)*2
+    keys, values = paddle.transpose(past_key_values, perm=[2, 0, 1, 3, 4]).split(2)
+
+    return tuple(zip(keys, values))
+
+
+def qwen_postprocess_past_key_value(past_key_values):
+    # (layer_num, bs, prefixlen, head_num/tensor_parallel_degree, head_dim)*2
+    keys, values = paddle.transpose(past_key_values, perm=[2, 0, 1, 3, 4]).split(2)
+
+    return tuple(zip(keys, values))
diff --git a/paddlenlp/prompt/__init__.py b/paddlenlp/prompt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9df588e4b659458de9c1758aaceceb16bab6e1ef
--- /dev/null
+++ b/paddlenlp/prompt/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .prompt_args import *
+from .prompt_model import *
+from .prompt_tokenizer import *
+from .prompt_trainer import *
+from .prompt_utils import *
+from .template import *
+from .verbalizer import *
diff --git a/paddlenlp/prompt/prompt_args.py b/paddlenlp/prompt/prompt_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1158e107241296cfda9ab42abd19a0b16030729
--- /dev/null
+++ b/paddlenlp/prompt/prompt_args.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from ..trainer import TrainingArguments
+from ..utils.log import logger
+
+__all__ = ["PromptTuningArguments"]
+
+
+@dataclass
+class PromptTuningArguments(TrainingArguments):
+    """
+    The arguments' subset for training loop during prompt tuning.
+    """
+
+    max_seq_length: int = field(default=512, metadata={"help": "The maximum length of all input text."})
+    freeze_plm: bool = field(
+        default=False, metadata={"help": "If True, the pretrained parameters won't be " "updated during tuning."}
+    )
+    freeze_dropout: bool = field(
+        default=False,
+        metadata={
+            "help": "If True, pretrained parameters won't be updated " "during tuning and the dropout is disabled."
+        },
+    )
+    save_plm: bool = field(default=False, metadata={"help": "Whether to save pretrained model."})
+    use_rdrop: bool = field(
+        default=False,
+        metadata={
+            "help": "Use R-Drop regularization strategy."
+            "Please refer to the paper for more details: "
+            "https://arxiv.org/abs/2106.14448."
+        },
+    )
+    alpha_rdrop: float = field(default=5.0, metadata={"help": "The KL-divergence loss weight alpha in R-Drop."})
+    use_rgl: bool = field(
+        default=False,
+        metadata={
+            "help": "Use label consistency to boost tuning performance."
+            "Please refer to the paper for more details: "
+            "https://aclanthology.org/2022.findings-naacl.81/."
+        },
+    )
+    alpha_rgl: float = field(default=0.5, metadata={"help": "The weight of label consistency loss in RGL."})
+
+    ppt_learning_rate: float = field(
+        default=1e-4, metadata={"help": "The initial learning rate of prompt parameters."}
+    )
+    ppt_weight_decay: float = field(default=0.0, metadata={"help": "Weight decay for the AdamW optimizer of prompt."})
+    ppt_adam_beta1: float = field(default=0.9, metadata={"help": "Beta1 for the AdamW optimizer of prompt."})
+    ppt_adam_beta2: float = field(default=0.999, metadata={"help": "Beta2 for the AdamW optimizer of prompt."})
+    ppt_adam_epsilon: float = field(default=1e-8, metadata={"help": "Epsilon for the AdamW optimizer of prompt."})
+
+    def __post_init__(self):
+        super(PromptTuningArguments, self).__post_init__()
+        if self.use_rgl and self.alpha_rgl == 0.0:
+            logger.warning(
+                "Ignore `use_rgl` because `alpha_rgl` = 0. Please " "set `alpha_rgl` a positive float to use RGL loss."
+            )
+            self.use_rgl = False
+
+        if self.use_rdrop and self.alpha_rdrop == 0.0:
+            logger.warning(
+                "Ignore `use_rdrop` because `alpha_rdrop` = 0. Please "
+                "set `alpha_rdrop` a positive float to use R-Drop."
+            )
+            self.use_rdrop = False
+
+        if self.freeze_dropout:
+            self.freeze_plm = True
diff --git a/paddlenlp/prompt/prompt_model.py b/paddlenlp/prompt/prompt_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..496662d5c7f601ebb14ea37f3398bef4b1931f74
--- /dev/null
+++ b/paddlenlp/prompt/prompt_model.py
@@ -0,0 +1,162 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import Any, Dict, Optional
+
+import paddle
+from paddle.static import InputSpec
+
+from ..transformers.model_outputs import (
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    SequenceClassifierOutput,
+)
+from .prompt_utils import signature
+from .template import PrefixTemplate, Template
+from .verbalizer import Verbalizer
+
+
+class PromptModelForSequenceClassification(paddle.nn.Layer):
+    """
+    PromptModel for classification tasks.
+    """
+
+    def __init__(
+        self,
+        model: paddle.nn.Layer,
+        template: Template,
+        verbalizer: Optional[Verbalizer] = None,
+        freeze_plm: bool = False,
+        freeze_dropout: bool = False,
+    ):
+        super(PromptModelForSequenceClassification, self).__init__()
+        self.plm = model
+        self.template = template
+        self.verbalizer = verbalizer
+        self.freeze_plm = freeze_plm
+        self.freeze_dropout = freeze_dropout
+        if self.freeze_plm:
+            for param in self.plm.parameters():
+                param.stop_gradient = True
+            if self.freeze_dropout:
+                self.plm.eval()
+        self.forward_keys = signature(self.plm.forward)
+        self._mask_token_id = self.template.tokenizer.mask_token_id
+        self._pad_token_id = self.template.tokenizer.pad_token_id
+        if isinstance(self.template, PrefixTemplate):
+            self.plm = self.template.process_model(self.plm)
+            self.forward_keys.append("past_key_values")
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        masked_positions: Optional[paddle.Tensor] = None,
+        soft_token_ids: Optional[paddle.Tensor] = None,
+        encoder_ids: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs: Dict[str, Any]
+    ):
+        return_dict = return_dict if return_dict is not None else False
+        return_hidden_states = kwargs.get("return_hidden_states", False)
+        input_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "position_ids": position_ids,
+            "masked_positions": masked_positions,
+            "soft_token_ids": soft_token_ids,
+            "attention_mask": attention_mask,
+            "encoder_ids": encoder_ids,
+            **kwargs,
+        }
+        input_dict = self.template.process_batch(input_dict)
+        input_dict = {**input_dict, **kwargs}
+        model_inputs = {k: input_dict[k] for k in input_dict if k in self.forward_keys}
+        if "masked_positions" in model_inputs:
+            model_inputs.pop("masked_positions")
+        model_outputs = self.plm(**model_inputs, return_dict=True)
+        if isinstance(model_outputs, MaskedLMOutput):
+            if self.verbalizer is not None:
+                logits = self.verbalizer.process_outputs(model_outputs.logits, input_dict["masked_positions"])
+                num_labels = len(self.verbalizer.label_words)
+            else:
+                raise Exception("Verbalizer is required when model uses the MaskedLM head")
+        elif isinstance(model_outputs, SequenceClassifierOutput):
+            logits = model_outputs.logits
+            num_labels = self.plm.num_labels if self.plm.num_labels is not None else self.plm.num_labels
+        elif isinstance(model_outputs, MultipleChoiceModelOutput):
+            logits = model_outputs.logits
+            num_labels = -1
+        else:
+            raise Exception(f"Model type not support yet: {type(model_outputs)}")
+
+        loss = None
+        if labels is not None:
+            if num_labels == 1:
+                loss_fct = paddle.nn.MSELoss()
+                loss = loss_fct(logits, labels)
+            elif num_labels > 0 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, num_labels)), labels.reshape((-1,)))
+            else:
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,)
+            if return_hidden_states:
+                output = output + (model_outputs.logits,)
+            if loss is not None:
+                return (loss,) + output
+            if isinstance(output, (list, tuple)) and len(output) == 1:
+                output = output[0]
+            return output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=model_outputs.logits,
+        )
+
+    def prompt_parameters(self):
+        """
+        Get the parameters of template and verbalizer.
+        """
+        params = [p for p in self.template.parameters()]
+        if self.verbalizer is not None:
+            params += [p for p in self.verbalizer.parameters()]
+        return params
+
+    def get_input_spec(self):
+        template_keywords = self.template.extract_template_keywords(self.template.prompt)
+        input_spec = [
+            InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+            InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+        ]
+        if "mask" in template_keywords:
+            input_spec.append(InputSpec(shape=[None], dtype="int64", name="masked_positions"))
+        if "soft" in template_keywords:
+            # Add placeholder for argument `masked_positions` if not exists.
+            if "mask" not in template_keywords:
+                input_spec.append(None)
+            input_spec.append(InputSpec(shape=[None, None], dtype="int64", name="soft_token_ids"))
+            if "encoder" in template_keywords:
+                input_spec.append(InputSpec(shape=[None, None], dtype="int64", name="encoder_ids"))
+        return input_spec
diff --git a/paddlenlp/prompt/prompt_tokenizer.py b/paddlenlp/prompt/prompt_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e41162c5ab6cc2f75b8a6a7ca4d23f1e66390b7
--- /dev/null
+++ b/paddlenlp/prompt/prompt_tokenizer.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+from collections import defaultdict
+from typing import Any, Dict, List, Union
+
+import numpy as np
+
+from paddlenlp.utils.log import logger
+
+__all__ = ["MLMPromptTokenizer"]
+
+
+class MLMPromptTokenizer(object):
+
+    omask_token = "[O-MASK]"
+
+    def __init__(self, tokenizer, max_length):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+
+    def __call__(self, inputs: List[Dict[str, Any]]):
+        part_do_truncate = [part["do_truncate"] for part in inputs]
+
+        encoded_inputs = defaultdict(list)
+        option_length = None
+        last_position = 1  # Id 0 denotes special token '[CLS]'.
+        last_token_type = 0
+        orig_input_ids = []
+        for index, part in enumerate(inputs):
+            # Create input_ids.
+            soft_token_ids = part.get("soft_tokens", None)
+            if soft_token_ids is None or len(soft_token_ids) == 1 and soft_token_ids[0] == 0:
+                orig_input_ids.append(
+                    self.tokenizer.encode(part["text"], add_special_tokens=False, return_token_type_ids=False)[
+                        "input_ids"
+                    ]
+                )
+            else:
+                orig_input_ids.append(soft_token_ids)
+        max_lengths = self._create_max_lengths_from_do_truncate(orig_input_ids, part_do_truncate)
+
+        for index, part in enumerate(inputs):
+            # Create input_ids.
+            soft_token_ids = part.get("soft_tokens", None)
+            if soft_token_ids is None or len(soft_token_ids) == 1 and soft_token_ids[0] == 0:
+                if self.tokenizer.truncation_side == "left":
+                    input_ids = orig_input_ids[index][-max_lengths[index] :]
+                else:
+                    input_ids = orig_input_ids[index][: max_lengths[index]]
+                encoded_inputs["soft_token_ids"].append([0] * len(input_ids))
+            else:
+                input_ids = soft_token_ids
+                encoded_inputs["soft_token_ids"].append(soft_token_ids)
+            encoded_inputs["input_ids"].append(input_ids)
+            part_length = len(input_ids)
+
+            # Create position_ids.
+            position_ids, last_position = self._create_position_ids_from_part(input_ids, part, last_position)
+            encoded_inputs["position_ids"].append(position_ids)
+
+            # Create token_type_ids.
+            if "token_types" in part:
+                last_token_type = part["token_types"]
+            encoded_inputs["token_type_ids"].append([last_token_type] * part_length)
+
+            # Create other features like encoder_ids.
+            for name in part:
+                if name not in ["text", "soft_tokens", "positions", "token_types"]:
+                    encoded_inputs[name].append([part[name]] * part_length)
+
+            # Record the length of options if exists.
+            if self.omask_token in part["text"]:
+                option_length = len(input_ids)
+
+        encoded_inputs.pop("do_truncate")
+        encoded_inputs = self.join(encoded_inputs)
+        encoded_inputs = self.add_special_tokens(encoded_inputs)
+        attention_mask = self._create_attention_mask(encoded_inputs["input_ids"], option_length)
+        if attention_mask is not None:
+            encoded_inputs["attention_mask"] = attention_mask
+        masked_positions = self._create_masked_positions(encoded_inputs["input_ids"], encoded_inputs["soft_token_ids"])
+        if masked_positions is not None:
+            encoded_inputs["masked_positions"] = masked_positions
+        return encoded_inputs
+
+    def _create_position_ids_from_part(self, input_ids: List[int], part: Dict[str, Any], last_position: int):
+        """
+        Create position ids from prompt for each part.
+        """
+        part_length = len(input_ids)
+        if "positions" in part and part["positions"] >= 0:
+            last_position = part["positions"]
+        if self.omask_token in part["text"]:
+            omask_id = self.tokenizer.convert_tokens_to_ids(self.omask_token)
+            omask_index = [x for x in range(part_length) if input_ids[x] == omask_id]
+            omask_index = [0] + omask_index
+            position_ids = []
+            max_index = 0
+            for start_id, end_id in zip(omask_index[:-1], omask_index[1:]):
+                position_ids.extend(list(range(last_position, last_position + end_id - start_id)))
+                max_index = max(end_id - start_id, max_index)
+            if len(position_ids) < part_length:
+                difference = part_length - len(position_ids)
+                position_ids.extend(range(last_position, last_position + difference))
+                max_index = max(difference, max_index)
+            last_position += max_index
+        else:
+            position_ids = list(range(last_position, last_position + part_length))
+            last_position += part_length
+        return position_ids, last_position
+
+    def _create_max_lengths_from_do_truncate(self, part_text: List[str], part_do_truncate: List[bool]):
+        """
+        Create the max sequence length of each part, where the longest part is truncated first.
+        """
+        text_length = sum([len(x) for x in part_text])
+        num_special_token = self.tokenizer.num_special_tokens_to_add()
+        max_length = self.max_length - num_special_token
+        if text_length <= max_length:
+            return [None] * len(part_text)
+        max_lengths = [None for _ in range(len(part_text))]
+        do_truncate = [int(x) for x in part_do_truncate]
+
+        # Remove parts that can not be truncated.
+        for index, part in enumerate(part_text):
+            if not part_do_truncate[index]:
+                max_length -= len(part)
+            else:
+                max_lengths[index] = len(part)
+        if sum(do_truncate) == 0:
+            logger.warning(
+                f"Can not truncate the sequence with length {text_length}. Set more `truncate` attributes as True."
+            )
+            return max_lengths
+
+        # Remove parts whose length is less than average maximum length of parts to truncate.
+        has_short = True
+        while has_short:
+            has_short = False
+            avg_max_length = max_length // sum(do_truncate)
+            for index, part in enumerate(part_text):
+                if do_truncate[index] == 1 and len(part) <= avg_max_length:
+                    do_truncate[index] = 0
+                    max_lengths[index] = len(part)
+                    max_length -= len(part)
+                    has_short = True
+        if max_length < 0:
+            raise AssertionError("Actual length has exceeded the maximum length. Check the implementation.")
+        avg_max_length = max_length // sum(do_truncate)
+        for index in range(len(part_text)):
+            if do_truncate[index] == 1:
+                max_lengths[index] = min(avg_max_length, max_length)
+                max_length -= max_lengths[index]
+                if max_length < 0:
+                    raise AssertionError("Actual length has exceeded the maximum length. Check the implementation.")
+        return max_lengths
+
+    def _create_attention_mask(self, input_ids: List[int], option_length: Union[int, None]):
+        if option_length is None:
+            return None
+        omask_id = self.tokenizer.convert_tokens_to_ids(self.omask_token)
+        input_ids = np.array(input_ids)
+        attention_mask = np.ones([len(input_ids), len(input_ids)])
+        omask_index = np.where(input_ids == omask_id)[0].tolist()
+        cls_indices = np.where(input_ids == self.tokenizer.cls_token_id)[0]
+        sep_indices = np.where(input_ids == self.tokenizer.sep_token_id)[0]
+        cls_index = len(input_ids)
+        for idx in cls_indices:
+            if idx > omask_index[-1]:
+                cls_index = idx
+                break
+        sep_index = len(input_ids)
+        for idx in sep_indices:
+            if idx > omask_index[-1]:
+                sep_index = idx
+                break
+        opt_begin = omask_index[0]
+        opt_end = min(cls_index, sep_index)
+        attention_mask[opt_begin:opt_end, opt_begin:opt_end] = 0
+        omask_index.append(opt_end)
+        for opt_begin, opt_end in zip(omask_index[:-1], omask_index[1:]):
+            attention_mask[opt_begin:opt_end, opt_begin:opt_end] = 1
+        attention_mask = (attention_mask - 1) * 1e4
+        return attention_mask
+
+    def _create_masked_positions(self, input_ids: List[int], soft_token_ids: List[int]):
+        non_soft_ids = np.array(input_ids) * (np.array(soft_token_ids) == 0)
+        mask_id = self.tokenizer.mask_token_id
+
+        masked_positions = np.where(non_soft_ids == mask_id)[0]
+        if masked_positions.shape[0] == 0:
+            return None
+        return masked_positions.tolist()
+
+    def add_special_tokens(self, input_dict: Dict[str, Any]):
+        for key in input_dict:
+            new_inputs = self.tokenizer.build_inputs_with_special_tokens(input_dict[key])
+            if key != "input_ids":
+                special_mask = np.array(self.tokenizer.get_special_tokens_mask(input_dict[key]))
+                new_inputs = np.array(new_inputs)
+                # TODO (Huijuan): Use different ids according to specific keyword.
+                new_inputs[special_mask == 1] = 0
+                new_inputs = new_inputs.tolist()
+            input_dict[key] = new_inputs
+        return input_dict
+
+    @staticmethod
+    def join(input_dict):
+        for key in input_dict:
+            input_dict[key] = list(itertools.chain(*input_dict[key]))
+        return input_dict
diff --git a/paddlenlp/prompt/prompt_trainer.py b/paddlenlp/prompt/prompt_trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e8c97023a66e708e085a48906a3940e78936a58
--- /dev/null
+++ b/paddlenlp/prompt/prompt_trainer.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from typing import Any, Callable, Dict, List, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.io import DataLoader, Dataset
+
+from ..data import DataCollator
+from ..datasets import MapDataset
+from ..losses import RDropLoss
+from ..trainer import Trainer, TrainerCallback
+from ..trainer.trainer_utils import EvalPrediction, get_scheduler
+from ..transformers import PretrainedTokenizer, export_model
+from ..utils.log import logger
+from .prompt_args import PromptTuningArguments
+from .prompt_utils import PromptDataCollatorWithPadding
+from .template import AutoTemplate
+from .verbalizer import SoftVerbalizer
+
+
+class PromptTrainer(Trainer):
+    """
+    PromptTrainer is a feature-complete training and eval loop for PaddleNLP
+    on prompt-tuning.
+    """
+
+    def __init__(
+        self,
+        model: nn.Layer,
+        tokenizer: PretrainedTokenizer,
+        criterion: Optional[nn.Layer] = None,
+        args: Optional[PromptTuningArguments] = None,
+        data_collator: Optional[DataCollator] = None,
+        train_dataset: Optional[MapDataset] = None,
+        eval_dataset: Optional[MapDataset] = None,
+        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
+        callbacks: Optional[List[TrainerCallback]] = None,
+        optimizers: Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler] = (None, None),
+    ):
+        if args is None:
+            output_dir = "tmp_trainer"
+            logger.info(
+                "No `TrainingArguments` passed, initialized with " "output_dir={} by default.".format(output_dir)
+            )
+            args = PromptTuningArguments(output_dir=output_dir)
+
+        if data_collator is None:
+            data_collator = PromptDataCollatorWithPadding(tokenizer, padding=True, return_tensors="pd")
+
+        if criterion is None and (args.use_rgl or args.use_rdrop):
+            raise Exception("'To use 'use_rgl', 'use_rdrop', 'criterion' must be specified")
+
+        super(PromptTrainer, self).__init__(
+            model=model,
+            criterion=criterion,
+            args=args,
+            data_collator=data_collator,
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            tokenizer=tokenizer,
+            compute_metrics=compute_metrics,
+            callbacks=callbacks,
+            optimizers=optimizers,
+        )
+
+        self._load_from_checkpoint(args.resume_from_checkpoint)
+
+        self.train_dataset = self._map_dataset(self.train_dataset)
+        self.eval_dataset = self._map_dataset(self.eval_dataset)
+
+        if self.args.use_rdrop:
+            self.rdrop_criterion = RDropLoss()
+
+    def _get_model(self):
+        model = self.model
+        if isinstance(model, paddle.DataParallel):
+            model = model._layers
+        return model
+
+    @property
+    def template(self):
+        return self._get_model().template
+
+    @template.setter
+    def template(self, template):
+        self._get_model().template = template
+
+    @property
+    def verbalizer(self):
+        return self._get_model().verbalizer
+
+    @verbalizer.setter
+    def verbalizer(self, verbalizer):
+        self._get_model().verbalizer = verbalizer
+
+    @property
+    def pretrained_model(self):
+        return self._get_model().plm
+
+    @pretrained_model.setter
+    def pretrained_model(self, model):
+        setattr(self._get_model(), "plm", model)
+
+    def _map_dataset(self, dataset: MapDataset):
+        if dataset is None:
+            return None
+        if not isinstance(dataset, MapDataset):
+            raise ValueError("Expected `MapDataset` but received {}.".format(type(dataset)))
+
+        def encode_with_template(example):
+            return self.template(example)
+
+        return dataset.map(encode_with_template)
+
+    def _prepare_input(self, inputs: Dict):
+        return inputs
+
+    def _save(
+        self,
+        output_dir: Optional[str] = None,
+        state_dict: Dict[str, Any] = None,
+        merge_tensor_parallel: Optional[bool] = True,
+    ):
+        super(PromptTrainer, self)._save(output_dir, state_dict, merge_tensor_parallel)
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        if self.template:
+            self.template.save(output_dir)
+        if self.verbalizer is not None:
+            self.verbalizer.save(output_dir)
+        if self.args.save_plm:
+            plm_output_dir = os.path.join(output_dir, "plm")
+            os.makedirs(plm_output_dir, exist_ok=True)
+            self.pretrained_model.save_pretrained(plm_output_dir)
+
+    def _load_from_checkpoint(self, resume_from_checkpoint: os.PathLike = None):
+        if resume_from_checkpoint is not None:
+            self.template = AutoTemplate.load_from(
+                resume_from_checkpoint, self.tokenizer, self.args.max_seq_length, self._get_model().plm
+            )
+        super(PromptTrainer, self)._load_from_checkpoint(resume_from_checkpoint)
+
+    def get_test_dataloader(self, test_dataset):
+        test_dataset = self._map_dataset(test_dataset)
+        return super(PromptTrainer, self).get_test_dataloader(test_dataset)
+
+    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+        if eval_dataset is not None:
+            eval_dataset = self._map_dataset(eval_dataset)
+        return super(PromptTrainer, self).get_eval_dataloader(eval_dataset)
+
+    def create_optimizer(self, lr_scheduler=None):
+        """
+        Setup the optimizer for both model and prompt parameters.
+        """
+        if self.optimizer is None:
+            optim_cls, optim_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
+
+            plm_parameters = []
+            if not self.args.freeze_plm:
+                plm_parameters.extend([p for p in self._get_model().plm.parameters() if not p.stop_gradient])
+
+            ppt_parameters = []
+            if self.template is not None:
+                ppt_parameters.extend([x for n, x in self.template.named_parameters() if not x.stop_gradient])
+            if self.verbalizer is not None:
+                if isinstance(self.verbalizer, SoftVerbalizer):
+                    if not self.args.freeze_plm:
+                        plm_parameters.extend(
+                            [p for n, p in self.verbalizer.non_head_parameters() if not p.stop_gradient]
+                        )
+                    ppt_parameters.extend([p for n, p in self.verbalizer.head_parameters()])
+                else:
+                    ppt_parameters.extend([p for n, p in self.verbalizer.parameters()])
+
+            decay_parameters = [
+                p.name for n, p in self._get_model().named_parameters() if not any(nd in n for nd in ["bias", "norm"])
+            ]
+
+            if len(plm_parameters) > 0:
+                ppt_lr = self.args.ppt_learning_rate / self.args.learning_rate
+                lr = self.lr_scheduler if lr_scheduler is None else lr_scheduler
+                if len(ppt_parameters) > 0:
+                    params = [
+                        {"params": plm_parameters},
+                        {
+                            "params": ppt_parameters,
+                            "learning_rate": ppt_lr,
+                            "weight_decay": self.args.ppt_weight_decay,
+                            "beta1": self.args.ppt_adam_beta1,
+                            "beta2": self.args.ppt_adam_beta2,
+                            "epsilon": self.args.ppt_adam_epsilon,
+                        },
+                    ]
+                else:
+                    params = plm_parameters
+            else:
+                if self.args.max_steps > 0:
+                    max_steps = self.args.max_steps
+                else:
+                    raise ValueError("Please use `max_steps` to set the maximum training steps.")
+                warmup = (
+                    self.args.warmup_steps if self.args.warmup_steps > 0 else int(self.args.warmup_ratio * max_steps)
+                )
+                self.lr_scheduler = get_scheduler(
+                    self.args.lr_scheduler_type,
+                    learning_rate=self.args.ppt_learning_rate,
+                    num_warmup_steps=warmup,
+                    num_training_steps=max_steps,
+                )
+                lr = self.lr_scheduler
+                params = ppt_parameters
+
+            self.optimizer = optim_cls(
+                learning_rate=lr,
+                apply_decay_param_fun=lambda x: x in decay_parameters,
+                parameters=params,
+                weight_decay=self.args.weight_decay,
+                grad_clip=nn.ClipGradByGlobalNorm(self.args.max_grad_norm),
+                **optim_kwargs,
+            )
+
+        return self.optimizer
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        """
+        Compute the total loss for every batch.
+        """
+        if "labels" not in inputs:
+            raise ValueError("Fail to compute loss as `labels` not in {}.".format(inputs))
+        labels = inputs["labels"]
+
+        input_dict = inputs.copy()
+
+        if self.criterion is not None:
+            # pop labels to move loss computation out of the model
+            input_dict.pop("labels")
+            input_dict["return_hidden_states"] = True
+            logits, hidden_states = model(**input_dict)
+            loss = self.criterion(logits, labels)
+
+            if self.args.use_rdrop:
+                loss = self._compute_rdrop_loss(model, input_dict, labels, logits, loss)
+
+            if self.args.use_rgl:
+                loss += self._compute_rgl_loss(hidden_states, labels)
+        else:
+            loss, logits = model(**input_dict)
+
+        outputs = (loss, logits)
+
+        return (loss, outputs) if return_outputs else loss
+
+    def _compute_rdrop_loss(self, model, input_dict, labels, outputs, loss):
+        re_outputs, _ = model(**input_dict)
+        ce_loss = (self.criterion(re_outputs, labels) + loss) * 0.5
+        kl_loss = self.rdrop_criterion(outputs, re_outputs)
+        loss = ce_loss + self.args.alpha_rdrop * kl_loss
+        return loss
+
+    def _compute_rgl_loss(self, embeddings, labels, equal_type="raw"):
+        """
+        Compute the label consistency loss of sentence embeddings per batch.
+        Please refer to https://aclanthology.org/2022.findings-naacl.81/
+        for more details.
+        """
+
+        def _max_equal(x, y):
+            return int(paddle.argmax(x, axis=0) == paddle.argmax(y, axis=0))
+
+        def _raw_equal(x, y):
+            return int(x == y)
+
+        if equal_type == "raw":
+            equals = _raw_equal
+        elif equal_type == "max":
+            equals = _max_equal
+        else:
+            raise ValueError("Unsupported equal type {}.".format(equal_type))
+        batch_size = embeddings.shape[0]
+        loss = 0
+        for i in range(batch_size):
+            for j in range(batch_size):
+                score = F.cosine_similarity(embeddings[i], embeddings[j], axis=0)
+                score = score.unsqueeze(0)
+                logits = paddle.concat([(1 - score) * 50, (1 + score) * 50], axis=-1)
+                label = paddle.to_tensor([equals(labels[i], labels[j])])
+                logits = logits.reshape([-1, logits.shape[-1]])
+                loss += F.cross_entropy(logits, label.unsqueeze(0))
+        loss = loss / (batch_size * (batch_size - 1))
+        loss = loss / 100 * self.args.alpha_rgl
+
+        return loss
+
+    def export_model(self, export_path, input_spec=None, export_type="paddle"):
+        os.makedirs(export_path, exist_ok=True)
+        self.template.save(export_path)
+        if self.verbalizer is not None:
+            self.verbalizer.save(export_path)
+        if input_spec is None:
+            input_spec = self.model.get_input_spec()
+        export_model(self.model, input_spec, export_path, export_type)
diff --git a/paddlenlp/prompt/prompt_utils.py b/paddlenlp/prompt/prompt_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d230fbf1ab417ae370fd3b886a67423a2981983d
--- /dev/null
+++ b/paddlenlp/prompt/prompt_utils.py
@@ -0,0 +1,210 @@
+"""
+Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+This module defines the itermediate data structure of inputs.
+"""
+
+import inspect
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+from paddle import Tensor
+
+from ..transformers.model_outputs import MaskedLMOutput, SequenceClassifierOutput
+from ..transformers.tokenizer_utils_base import PaddingStrategy, PretrainedTokenizerBase
+
+
+def signature(function):
+    """
+    Obtain the input arguments of the given function.
+    """
+    sig = inspect.signature(function)
+    args = [p.name for p in sig.parameters.values() if p.kind == inspect.Parameter.POSITIONAL_OR_KEYWORD]
+    return args
+
+
+@dataclass
+class PromptDataCollatorWithPadding:
+    """
+    Data collator that will group inputs by keywords and dynamically
+    pad the inputs to the longest sequence in the batch.
+
+    Args:
+        tokenizer (`paddlenlp.transformers.PretrainedTokenizer`):
+            The tokenizer used for encoding the data from PromptTokenizer.
+    """
+
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    pad_to_multiple_of: Optional[int] = None
+    return_tensors: str = "pd"
+    return_attention_mask: Optional[bool] = None
+    default_model_input_names: List = (
+        "input_ids",
+        "token_type_ids",
+        "special_tokens_mask",
+        "offset_mapping",
+        "position_ids",
+    )
+
+    def _convert_to_tensors(self, data):
+        if self.return_tensors == "np":
+            return np.array(data)
+        else:
+            return paddle.to_tensor(data)
+
+    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
+        batch = {}
+        for key in features[0]:
+            if key in self.default_model_input_names:
+                batch[key] = [b[key] for b in features]
+
+        batch = self.tokenizer.pad(
+            batch,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            return_tensors=self.return_tensors,
+            return_attention_mask=self.return_attention_mask,
+        )
+        max_length = batch["input_ids"].shape[1]
+        for key in features[0]:
+            if key not in self.default_model_input_names:
+                values = [b[key] for b in features if key in b]
+                if len(values) < len(features):
+                    continue
+                if key == "masked_positions":
+                    new_values = []
+                    for index, value in enumerate(values):
+                        value = np.array(value) + index * max_length
+                        new_values.extend(value.tolist())
+                    values = new_values
+                elif key == "attention_mask":
+                    new_values = np.ones([len(values), 1, max_length, max_length]) * -1e4
+                    for index, value in enumerate(values):
+                        length = len(value)
+                        new_values[index][0, :length, :length] = value
+                    values = new_values
+                elif key in ("soft_token_ids", "encoder_ids"):
+                    for index, value in enumerate(values):
+                        values[index] = value + [0] * (max_length - len(value))
+                elif key in ("omask_positions"):
+                    max_num_option = max([len(x) for x in values])
+                    for index, value in enumerate(values):
+                        values[index] = value + [0] * (max_num_option - len(value))
+                elif key == "labels":
+                    if isinstance(values[0], list):
+                        max_num_label = max([len(x) for x in values])
+                        for index, value in enumerate(values):
+                            values[index] = value + [-100] * (max_num_label - len(value))
+                elif key != "cls_positions":
+                    continue
+                batch[key] = self._convert_to_tensors(values)
+        return batch
+
+
+def sequence_classification_forward_with_past_key_values(
+    self,
+    input_ids: Optional[Tensor] = None,
+    token_type_ids: Optional[Tensor] = None,
+    position_ids: Optional[Tensor] = None,
+    attention_mask: Optional[Tensor] = None,
+    inputs_embeds: Optional[Tensor] = None,
+    labels: Optional[Tensor] = None,
+    output_hidden_states: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+):
+    outputs = self.ernie(
+        input_ids,
+        token_type_ids=token_type_ids,
+        position_ids=position_ids,
+        attention_mask=attention_mask,
+        inputs_embeds=inputs_embeds,
+        past_key_values=past_key_values,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=True,
+    )
+    pooled_output = outputs[1]
+
+    pooled_output = self.dropout(pooled_output)
+    logits = self.classifier(pooled_output)
+
+    loss = None
+    if labels is not None:
+        if self.num_labels == 1:
+            loss_fct = paddle.nn.MSELoss()
+            loss = loss_fct(logits, labels)
+        elif labels.dtype == paddle.int64 or labels.dtype == paddle.int32:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        else:
+            loss_fct = paddle.nn.BCEWithLogitsLoss()
+            loss = loss_fct(logits, labels)
+
+    return SequenceClassifierOutput(
+        loss=loss,
+        logits=logits,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
+
+
+def masked_lm_forward_with_past_key_values(
+    self,
+    input_ids: Optional[Tensor] = None,
+    token_type_ids: Optional[Tensor] = None,
+    position_ids: Optional[Tensor] = None,
+    attention_mask: Optional[Tensor] = None,
+    masked_positions: Optional[Tensor] = None,
+    inputs_embeds: Optional[Tensor] = None,
+    labels: Optional[Tensor] = None,
+    output_hidden_states: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+):
+    outputs = self.ernie(
+        input_ids,
+        token_type_ids=token_type_ids,
+        position_ids=position_ids,
+        attention_mask=attention_mask,
+        inputs_embeds=inputs_embeds,
+        past_key_values=past_key_values,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=True,
+    )
+    sequence_output = outputs[0]
+    prediction_scores = self.cls(sequence_output, masked_positions=masked_positions)
+
+    masked_lm_loss = None
+    if labels is not None:
+        loss_fct = paddle.nn.CrossEntropyLoss()
+        masked_lm_loss = loss_fct(
+            prediction_scores.reshape((-1, paddle.shape(prediction_scores)[-1])), labels.reshape((-1,))
+        )
+
+    return MaskedLMOutput(
+        loss=masked_lm_loss,
+        logits=prediction_scores,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+    )
diff --git a/paddlenlp/prompt/template.py b/paddlenlp/prompt/template.py
new file mode 100644
index 0000000000000000000000000000000000000000..c243ca7915aed1da86818932fde98703ffd4b7dd
--- /dev/null
+++ b/paddlenlp/prompt/template.py
@@ -0,0 +1,937 @@
+"""
+Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+This module provide prompt definition methods.
+"""
+
+import json
+import os
+import re
+import traceback
+from abc import abstractmethod
+from functools import partial
+from typing import Any, Dict, List, Optional
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from paddlenlp.transformers import PretrainedModel, PretrainedTokenizer
+from paddlenlp.utils.log import logger
+
+from .prompt_tokenizer import MLMPromptTokenizer
+from .prompt_utils import (
+    masked_lm_forward_with_past_key_values,
+    sequence_classification_forward_with_past_key_values,
+)
+
+__all__ = ["Template", "ManualTemplate", "SoftTemplate", "PrefixTemplate", "AutoTemplate", "UTCTemplate"]
+
+# Template used to be saved in a file.
+TEMPLATE_CONFIG_FILE = "template_config.json"
+TEMPLATE_PARAMETER_FILE = "template_state.pdparams"
+
+# Default values for some template attributes.
+DEFAULT_MAX_OPTIONS = 10
+
+
+class Template(nn.Layer):
+    """
+    Base class for [`Template`].
+
+    Args:
+        prompt (`str`):
+            A template string which defines how to combine text and prompt.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer used for tokenization.
+        max_length (`int`):
+            If set to a number, it will limit the total sequence returned so
+            that it has a maximum length, including prompts.
+    """
+
+    template_special_tokens = ["text", "hard", "soft", "soft_id", "prefix", "sep", "mask", "options"]
+    template_attributes = [
+        "length",
+        "encoder",
+        "position",
+        "token_type",
+        "hidden_size",
+        "add_omask",
+        "add_prompt",
+        "add_space",
+        "truncate",
+    ]
+    input_feature_names = ["do_truncate", "token_types", "positions"]
+    opt_token = "[OPT]"
+    omask_token = "[O-MASK]"
+
+    def __init__(self, prompt: str, tokenizer: PretrainedTokenizer, max_length: int, **kwargs):
+        super(Template, self).__init__()
+        for key, value in kwargs.items():
+            setattr(self, key, value)
+        self.tokenizer = tokenizer
+        self.prompt_tokenizer = MLMPromptTokenizer(tokenizer, max_length)
+        self.set_prompt(prompt)
+
+    @property
+    def prompt(self):
+        return self._prompt
+
+    @prompt.setter
+    def prompt(self, prompt):
+        logger.warning("Prompt can not be modified once set.")
+
+    def set_prompt(self, prompt: str):
+        if prompt is not None:
+            if isinstance(prompt, str):
+                self._prompt = self.parse_template_string(prompt)
+            else:
+                self._prompt = prompt
+            self.do_truncate = self.create_truncation_sequence_from_prompt()
+            self._check_template_special_tokens()
+            self.example_keys = self.create_example_keys_from_prompt()
+            self.token_types = self.create_token_type_sequence_from_prompt()
+            self.positions = self.create_position_sequence_from_prompt()
+            self.create_prompt_parameters()
+
+    @abstractmethod
+    def create_prompt_parameters(self):
+        raise NotImplementedError
+
+    def _check_template_special_tokens(self):
+        valid_attr = self.template_special_tokens + self.template_attributes
+        prompt_attr = []
+        for part in self._prompt:
+            prompt_attr.extend(list(part.keys()))
+            if "add_prompt" in part:
+                opt_prompt = part["add_prompt"]
+                if self.opt_token not in opt_prompt:
+                    raise ValueError("'{}' not found in option prompt.".format(self.opt_token))
+            if "add_omask" in part:
+                self._check_omask_token()
+        diff_attr = set(prompt_attr) - set(valid_attr)
+        if len(diff_attr) > 0:
+            raise ValueError("Invalid attributes found in template: {}.".format(diff_attr))
+        return True
+
+    def _check_example_name(self, name: str, example: Dict[str, Any]):
+        if name not in example:
+            raise ValueError(
+                "Unexpected value in template. Can not find keyword {} in example: {}".format(name, example)
+            )
+        return True
+
+    def _check_omask_token(self):
+        omask_example = """
+        Add '[O-MASK]' to tokenizer to use `add_omask`.
+
+        Examples:
+
+        ```python
+        omask_dict = {"additional_special_tokens": ["[O-MASK]"]}
+        tokenizer.add_special_tokens(omask_dict)
+        model.resize_token_embeddings(len(tokenizer))
+        ```"""
+        if self.omask_token not in self.tokenizer.additional_special_tokens:
+            self.tokenizer.add_special_tokens({"additional_special_tokens": [self.omask_token]})
+            return True
+            raise ValueError("'{}' not found in tokenizer.".format(self.omask_token) + omask_example)
+        return True
+
+    def build_inputs_with_prompt(
+        self, example: Dict[str, Any], prompt: Optional[List[Dict[str, Any]]] = None
+    ) -> List[str]:
+        """
+        Build input text sequences according to both prompt and example.
+
+        Args:
+            example (`Dict[str, Any]`):
+                A data sample with corresponding keys as `prompt`.
+            prompt (`Optional[List[Dict[str, Any]]]`):
+                A sequence of dictionary which defines positions of prompt,
+                input text and special tokens.
+        """
+        inputs = self._prompt.copy() if prompt is None else prompt.copy()
+
+        for index, part in enumerate(inputs):
+            if "text" in part:
+                self._check_example_name(part["text"], example)
+                inputs[index] = str(example[part["text"]])
+            elif "mask" in part:
+                if "length" not in part:
+                    part["length"] = 1
+                inputs[index] = self.tokenizer.mask_token * part["length"]
+            elif "sep" in part:
+                inputs[index] = self.tokenizer.sep_token
+            elif "hard" in part:
+                inputs[index] = part["hard"]
+            elif "options" in part:
+                if not isinstance(part["options"], list):
+                    self._check_example_name(part["options"], example)
+                    labels = example[part["options"]]
+                    labels = [labels] if isinstance(labels, str) else labels
+                else:
+                    labels = part["options"]
+                if "add_prompt" in part:
+                    opt_prompt = part["add_prompt"]
+                    labels = [opt_prompt.replace(self.opt_token, x) for x in labels]
+                if "add_omask" in part:
+                    labels = [self.omask_token + x for x in labels]
+                inputs[index] = "".join(labels)
+            else:
+                inputs[index] = part
+
+            if "add_space" in part:
+                inputs[index] = " " + inputs[index]
+        return inputs
+
+    def create_token_type_sequence_from_prompt(self, prompt: Optional[List[Dict[str, Any]]] = None) -> List[int]:
+        prompt = self._prompt if prompt is None else prompt
+        last_token_type = 0
+        token_type_ids = []
+        for part in prompt:
+            if "token_type" in part:
+                last_token_type = part["token_type"]
+            token_type_ids.append(last_token_type)
+        return token_type_ids
+
+    def create_position_sequence_from_prompt(self, prompt: Optional[List[Dict[str, Any]]] = None) -> List[int]:
+        prompt = self._prompt if prompt is None else prompt
+        position_ids = []
+        for part in prompt:
+            if "position" in part:
+                position_ids.append(part["position"])
+            else:
+                position_ids.append(-1)
+        return position_ids
+
+    def create_truncation_sequence_from_prompt(self, prompt: Optional[List[Dict[str, Any]]] = None) -> List[int]:
+        prompt = self._prompt.copy() if prompt is None else prompt.copy()
+        do_truncate = []
+        for part in prompt:
+            if "truncate" in part:
+                do_truncate.append(part["truncate"])
+            elif "text" in part:
+                do_truncate.append(True)
+            else:
+                do_truncate.append(False)
+        return do_truncate
+
+    def create_example_keys_from_prompt(self):
+        example_keys = set()
+        for part in self.prompt:
+            if "text" in part:
+                example_keys.add(part["text"])
+            if "options" in part and isinstance(part["options"], list):
+                example_keys.update(set(part["options"]))
+        if len(example_keys) == 0:
+            raise ValueError('No `text` keyword in template: "{}", please check it again.'.format(self.prompt))
+        return example_keys
+
+    def encode(self, example: Dict[str, Any]):
+        input_text = self.build_inputs_with_prompt(example)
+        input_names, input_values = ["text"], [input_text]
+        for name in self.input_feature_names:
+            input_names.append(name)
+            input_values.append(getattr(self, name, None))
+
+        inputs = []
+        for value in list(zip(*input_values)):
+            inputs.append(dict(zip(input_names, value)))
+
+        input_dict = self.prompt_tokenizer(inputs)
+        unused_example = {k: v for k, v in example.items() if k not in self.example_keys}
+
+        return {**input_dict, **unused_example}
+
+    def __call__(self, example: Dict[str, Any]):
+        return self.encode(example=example)
+
+    @abstractmethod
+    def process_batch(self, input_dict):
+        raise NotImplementedError
+
+    def save(self, save_path):
+        if not os.path.exists(save_path):
+            os.makedirs(save_path, exist_ok=True)
+        template_config_file = os.path.join(save_path, TEMPLATE_CONFIG_FILE)
+        template_class = self.__class__.__name__
+        with open(template_config_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(self._prompt, ensure_ascii=False) + "\n")
+            fp.write(json.dumps({"class": template_class}, ensure_ascii=False) + "\n")
+        template_param_file = os.path.join(save_path, TEMPLATE_PARAMETER_FILE)
+        template_state_dict = self.state_dict()
+        if len(template_state_dict) > 0:
+            paddle.save(template_state_dict, template_param_file)
+
+    @staticmethod
+    def extract_template_keywords(prompt: List[Dict[str, Any]]):
+        keywords = set()
+        for part in prompt:
+            keywords.update(part.keys())
+        return keywords
+
+    @staticmethod
+    def parse_template_string(prompt: str, left_token: Optional[str] = "{", right_token: Optional[str] = "}"):
+        """
+        Parse the defined string as a sequence of dictionaries.
+
+        Args:
+            prompt: A string comprised of nestable {}, [], integers and strings.
+
+        Returns:
+            A list of dictionaries corresponding to the input string.
+
+            For example, if we define `prompt` as
+
+            "{'text': 'hypothesis'}基于这一假设{'mask'}推断出{'options': 'label.txt'}",
+
+            then this function returns
+
+            [{"text": "hypothesis"}, {"hard": "基于这一假设"}, {"mask": null},
+             {"hard": "推断出"}, {"options": ["正确", "错误"]}].
+
+        Raises:
+            ValueError: A error occurred parsing an string with unmatched punctuations.
+        """
+        left_stack = []
+        parsed = []
+        index = 0
+        while index < len(prompt):
+            # Delete extra spaces.
+            part = {"add_space": " "} if prompt[index] == " " else {}
+            while index < len(prompt) and prompt[index] == " ":
+                index += 1
+            if index == len(prompt):
+                break
+            # Parse blocks with paired tokens like "{ }".
+            if prompt[index] == left_token:
+                left_index = index
+                while index < len(prompt):
+                    if prompt[index] == left_token:
+                        left_stack.append(index)
+                    elif prompt[index] == right_token:
+                        left_stack.pop()
+                        if len(left_stack) == 0:
+                            break
+                    index += 1
+                if index == len(prompt) and len(left_stack) > 0:
+                    raise ValueError(
+                        "{} at position {} has no corresponding {}".format(left_token, left_index, right_token)
+                    )
+                try:
+                    part_dict = eval(prompt[left_index : index + 1])
+                    if isinstance(part_dict, set):
+                        part_dict = {k: None for k in part_dict}
+                    part.update(part_dict)
+                except SyntaxError:
+                    logger.error(traceback.format_exc())
+                    exit()
+                index += 1
+            # Parse simplified discrete prompts.
+            else:
+                left_index = index
+                while index < len(prompt) and prompt[index] != left_token:
+                    index += 1
+                part["hard"] = prompt[left_index:index].rstrip(" ")
+
+            if "options" in part:
+                if os.path.isfile(part["options"]):
+                    with open(part["options"], "r") as fp:
+                        labels = [x.strip() for x in fp]
+                    part["options"] = labels
+                    part["length"] = len(labels)
+                elif "length" not in "options":
+                    part["length"] = DEFAULT_MAX_OPTIONS
+            if "length" in part:
+                assert part["length"] > 0
+                if "hard" in part:
+                    logger.warning("Ignore `length` attribute for keyword `hard`.")
+            if "position" in part:
+                assert part["position"] >= 0
+            if "token_type" in part:
+                assert part["token_type"] in (0, 1)
+            parsed.append(part)
+        return parsed
+
+
+class ManualTemplate(Template):
+    """
+    ManualTemplate for discrete prompt methods, such as PET, EFL.
+
+    Args:
+        prompt (`str`):
+            A template string which defines how to combine text and prompt.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer used for tokenization.
+        max_length (`int`):
+            If set to a number, it will limit the total sequence returned so
+            that it has a maximum length, including prompts.
+    """
+
+    template_special_tokens = ["text", "hard", "sep", "mask", "options"]
+    template_attributes = ["length", "position", "token_type", "add_prompt", "add_space", "add_omask", "truncate"]
+
+    def __init__(self, prompt: str, tokenizer: PretrainedTokenizer, max_length: int):
+        super(ManualTemplate, self).__init__(prompt, tokenizer, max_length)
+
+    def create_prompt_parameters(self):
+        return None
+
+    def process_batch(self, input_dict):
+        return input_dict
+
+
+class SoftLSTM(nn.Layer):
+    """
+    LSTM encoder for soft token embeddings.
+    """
+
+    def __init__(self, input_size, hidden_size, output_size, activation):
+        super(SoftLSTM, self).__init__()
+        self.lstm = nn.LSTM(
+            input_size=input_size, hidden_size=hidden_size, num_layers=2, direction="bidirect", time_major=False
+        )
+        self.mlp = nn.Sequential(
+            nn.Linear(2 * hidden_size, hidden_size), activation, nn.Linear(hidden_size, output_size)
+        )
+
+    def forward(self, embeds):
+        hidden_states, _ = self.lstm(embeds)
+        return self.mlp(hidden_states)
+
+
+class SoftTemplate(Template):
+    """
+    SoftTemplate for continuous prompt methods on the input layer.
+
+    Args:
+        prompt (`str`):
+            A template string which defines how to combine text and prompt.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer used for tokenization.
+        max_length (`int`):
+            If set to a number, it will limit the total sequence returned so
+            that it has a maximum length, including prompts.
+        word_embeddings (`Tensor`):
+            The word embeddings of pretrained models, which can be obtained by
+            calling `model.get_input_embeddings().weight`.
+        soft_embeddings (`Tensor`):
+            The embeddings of soft tokens, which overwrites `word_embeddings`
+            as initial weights when defined.
+    """
+
+    template_special_tokens = ["text", "hard", "soft", "soft_id", "sep", "mask", "options"]
+    input_feature_names = ["do_truncate", "token_types", "positions", "soft_tokens", "encoder_ids"]
+
+    def __init__(
+        self,
+        prompt: str,
+        tokenizer: PretrainedTokenizer,
+        max_length: int,
+        word_embeddings: Tensor,
+        soft_embeddings: Tensor = None,
+    ):
+        super(SoftTemplate, self).__init__(
+            prompt, tokenizer, max_length, word_embeddings=word_embeddings, soft_embeddings=soft_embeddings
+        )
+
+    def named_parameters(self):
+        named_params = [(n, p) for n, p in self.soft_embeddings.named_parameters()]
+        named_params.extend([(n, p) for n, p in self.encoder_list.named_parameters()])
+        return named_params
+
+    def parameters(self):
+        return [p for n, p in self.named_parameters()]
+
+    def create_prompt_parameters(self):
+        self._prompt, soft_token_config = self.parse_soft_prompt()
+        self.embed_size = self.word_embeddings.weight.shape[1]
+        soft2word, self.soft_tokens, self.num_soft_token = soft_token_config
+        self._init_soft_parameters(soft2word)
+        self.encoder_ids, self.encoder_list = self._create_soft_encoders()
+
+    def process_batch(self, input_dict: Dict[str, Tensor]) -> Dict[str, Tensor]:
+        """
+        Convert input_ids to inputs_embeds.
+
+        Soft tokens are encoded soft_embeddings with predefined encoders.
+        For other tokens, use word embeddings in pretrained model.
+        """
+        word_embeds = self.word_embeddings(input_dict["input_ids"])
+        if "attention_mask" not in input_dict or input_dict["attention_mask"] is None:
+            pad_token_id = self.tokenizer.pad_token_id
+            attention_mask = paddle.unsqueeze(
+                (input_dict["input_ids"] == pad_token_id).astype("float32") * -1e4, axis=[1, 2]
+            )
+            input_dict["attention_mask"] = attention_mask
+        input_dict["input_ids"] = None
+        soft_embeds = self.soft_embeddings(input_dict["soft_token_ids"])
+        soft_shape = soft_embeds.shape
+        soft_embeds = soft_embeds.reshape([-1, soft_shape[-1]])
+        for encoder_id in range(1, len(self.encoder_list)):
+            to_encode = paddle.where(input_dict["encoder_ids"] == encoder_id)
+            to_encode = to_encode[0] * soft_shape[1] + to_encode[1]
+            to_encode = to_encode.squeeze(1)
+            to_encode_embeds = soft_embeds[to_encode]
+            to_encode_embeds = to_encode_embeds.reshape([soft_shape[0], -1, soft_shape[-1]])
+            encoder = self.encoder_list[encoder_id]
+            encoded = encoder(to_encode_embeds)
+            encoded = encoded.reshape([-1, soft_shape[-1]])
+            soft_embeds = paddle.scatter(soft_embeds, to_encode, encoded)
+        soft_embeds = soft_embeds.reshape([soft_shape[0], -1, soft_shape[-1]])
+        soft_token_ids = input_dict["soft_token_ids"].unsqueeze(2)
+        input_dict["inputs_embeds"] = paddle.where(soft_token_ids > 0, soft_embeds, word_embeds)
+        return input_dict
+
+    def parse_soft_prompt(self):
+        """
+        Unify the form of continuous prompts as {"soft": "xxx"} and create
+        continuous token id sequence for each part in template.
+
+        Returns:
+            `List[Dict[str, str]]`: Template with continuous prompt formated as {"soft": "xxx"}.
+            `Tuple[Dict[int, int], List[List[int]], int]`:
+                - Mapping from continuous ids to word ids for initialization.
+                - Continuous ids for each part. Id 0 denotes none-continuous part.
+                - Number of unique continuous tokens.
+        """
+        prompt = self._prompt.copy()
+        num_soft_token = 1
+        soft_prompt = []
+        soft_token_ids = []
+        soft2word = {}
+        soft_id_reindex = {}
+
+        for part in prompt:
+            part_prompt = None
+            # Copy non-continuous prompt part.
+            if "soft" not in part and "soft_id" not in part:
+                soft_prompt.append(part)
+                soft_token_ids.append(None)
+
+            # Deal with continuous prompt with specific initialization.
+            elif "soft" in part and part["soft"] is not None:
+
+                # Get word tokens for initialization.
+                if "add_space" in part:
+                    part["soft"] = part["add_space"] + part["soft"]
+                word_token_ids = self.tokenizer(part["soft"], add_special_tokens=False, return_token_type_ids=False)[
+                    "input_ids"
+                ]
+
+                # Create continuous token ids.
+                soft_id_list = list(range(num_soft_token, num_soft_token + len(word_token_ids)))
+                num_soft_token += len(word_token_ids)
+
+                for soft_id, word_id in zip(soft_id_list, word_token_ids):
+                    soft2word[soft_id] = word_id
+
+                # Check `length` if exists.
+                if "length" in part:
+                    if part["length"] < len(word_token_ids):
+                        logger.warning("Ignore `length` because it is less than the length of defined word sequence.")
+                    elif part["length"] > len(word_token_ids):
+                        length = part["length"] - len(word_token_ids)
+                        soft_id_list += list(range(num_soft_token, num_soft_token + length))
+                        num_soft_token += length
+                        part["soft"] += self.tokenizer.unk_token * length
+
+                soft_token_ids.append(soft_id_list)
+                part_prompt = {"soft": part["soft"]}
+
+                # Check or record `soft_id` if exists.
+                if "soft_id" in part:
+                    if part["soft_id"] in soft_id_reindex:
+                        assert soft_id_list == soft_id_reindex[part["soft_id"]]
+                    else:
+                        soft_id_reindex[part["soft_id"]] = soft_id_list
+
+            # Deal with continuous prompt defined by `soft_id`.
+            elif "soft_id" in part and part["soft_id"] in soft_id_reindex:
+                soft_id_list = soft_id_reindex[part["soft_id"]]
+                if "length" in part:
+                    logger.warning("Ignore `length` because it is incompatible with existing `soft_id`.")
+                soft_token_ids.append(soft_id_list)
+                part_prompt = {"soft": [self.tokenizer.unk_token] * len(soft_id_list)}
+
+            # Deal with continuous prompt with random initialization.
+            else:
+                if "length" not in part:
+                    part["length"] = 1
+                soft_id_list = list(range(num_soft_token, num_soft_token + part["length"]))
+                num_soft_token += part["length"]
+                soft_token_ids.append(soft_id_list)
+                if "soft_id" in part:
+                    soft_id_reindex[part["soft_id"]] = soft_id_list
+                part_prompt = {"soft": [self.tokenizer.unk_token] * len(soft_id_list)}
+            if part_prompt is not None:
+                for key in part:
+                    if key not in ["soft", "soft_id", "length", "add_space"]:
+                        part_prompt[key] = part[key]
+                soft_prompt.append(part_prompt)
+
+        if num_soft_token == 1:
+            raise ValueError("Soft prompt expected for SoftTemplate, but get {}.".format(self._prompt))
+
+        soft_token_config = (soft2word, soft_token_ids, num_soft_token)
+
+        return soft_prompt, soft_token_config
+
+    def _init_soft_parameters(self, soft2word: Dict[int, int]):
+        if self.soft_embeddings is not None:
+            if self.soft_embeddings.weight.shape[0] != self.num_soft_token:
+                raise ValueError(
+                    "Given soft embeddings are incompatible with those "
+                    'defined in template "{}"'.format(self._prompt)
+                )
+        else:
+            self.soft_embeddings = nn.Embedding(self.num_soft_token, self.embed_size)
+            weight = self.soft_embeddings.weight.clone().detach()
+            for soft_id, word_id in soft2word.items():
+                # squeeze() is used here to be backward compatible with 0-D tensor introduced in paddle 2.5
+                word_id = paddle.to_tensor(word_id).squeeze()
+                weight[soft_id] = self.word_embeddings(word_id)
+            self.soft_embeddings.weight.set_value(weight)
+
+    def _create_soft_encoders(self, output_size: int = None, activation: nn.Layer = None):
+        encoder_list = [nn.Identity()]
+        encoder2id = {}
+        encoder_ids = []
+        output_size = self.embed_size if output_size is None else output_size
+        activation = nn.ReLU() if activation is None else activation
+        for part in self._prompt:
+            if "encoder" not in part or part["encoder"] is None:
+                encoder_ids.append(0)
+            else:
+                if part["encoder"] not in encoder2id:
+                    encoder2id[part["encoder"]] = len(encoder_list)
+                    encoder_ids.append(len(encoder_list))
+                    if "hidden_size" in part:
+                        hidden_size = part["hidden_size"]
+                    else:
+                        hidden_size = self.embed_size
+                    if part["encoder"] == "lstm":
+                        encoder_list.append(SoftLSTM(self.embed_size, hidden_size, output_size, activation))
+                    elif part["encoder"] == "mlp":
+                        encoder_list.append(
+                            nn.Sequential(
+                                nn.Linear(self.embed_size, hidden_size),
+                                activation,
+                                nn.Linear(hidden_size, output_size),
+                            )
+                        )
+                    else:
+                        raise ValueError("Encoder {} not supported.".format(part["encoder"]))
+                else:
+                    encoder_ids.append(encoder2id[part["encoder"]])
+        encoder_list = nn.LayerList(encoder_list)
+        return encoder_ids, encoder_list
+
+    def build_inputs_with_prompt(
+        self, example: Dict[str, Any], prompt: Optional[List[Dict[str, Any]]] = None
+    ) -> List[str]:
+        inputs = super(SoftTemplate, self).build_inputs_with_prompt(example, prompt)
+        for index, part in enumerate(inputs):
+            if isinstance(part, dict) and "soft" in part:
+                inputs[index] = part["soft"]
+        return inputs
+
+    def save(self, save_path):
+        super(SoftTemplate, self).save(save_path)
+        template_param_file = os.path.join(save_path, TEMPLATE_PARAMETER_FILE)
+        paddle.save(self.state_dict(), template_param_file)
+
+
+class PrefixTemplate(SoftTemplate):
+    """
+    PrefixTemplate for continuous prompt methods on every layer.
+
+    Args:
+        prompt (`str`):
+            A template string which defines how to combine text and prompt.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer used for tokenization.
+        max_length (`int`):
+            If set to a number, it will limit the total sequence returned so
+            that it has a maximum length, including prompts.
+        model (`PretrainedModel`):
+            An instance of PretrainedModel.
+    """
+
+    template_special_tokens = ["text", "hard", "prefix", "soft", "sep", "mask", "options"]
+    input_feature_names = ["do_truncate", "token_types", "positions", "soft_tokens", "encoder_ids"]
+
+    def __init__(
+        self,
+        prompt: str,
+        tokenizer: PretrainedTokenizer,
+        max_length: int,
+        model: PretrainedModel,
+        prefix_dropout: float = 0.1,
+    ):
+        self.n_layer, self.n_heads = self._get_config(model)
+        super(PrefixTemplate, self).__init__(prompt, tokenizer, max_length, model.get_input_embeddings())
+        self.dropout = nn.Dropout(p=prefix_dropout)
+
+    @staticmethod
+    def _get_config(model):
+        names = [n for n, p in model.named_parameters() if "layers" in n]
+        pattern = re.compile(r".*?\.(\d+)\..*?")
+        indices = []
+        for name in names:
+            result = pattern.match(name)
+            if result is not None:
+                indices.append(int(result.group(1)))
+        num_layer = max(indices) + 1
+        layer_names = names[0].split(".")[:-2]
+        layer = model
+        for name in layer_names:
+            layer = getattr(layer, name)
+        num_heads = layer.num_heads
+
+        return num_layer, num_heads
+
+    def parse_soft_prompt(self):
+        prompt = self._prompt.copy()
+
+        for index, part in enumerate(prompt):
+            if "soft" in part:
+                raise ValueError("Keyward `soft` should not be used in PrefixTemplate.")
+            if "prefix" not in part:
+                continue
+            if index != 0:
+                raise ValueError("Keyword `prefix` should locate at the beginning of template.")
+            part["soft"] = part["prefix"]
+            part.pop("prefix")
+            if "encoder" not in part:
+                part["encoder"] = "mlp"
+            prompt[index] = part
+
+        self._prompt = prompt
+        return super(PrefixTemplate, self).parse_soft_prompt()
+
+    def process_model(self, model):
+        if model.__class__.__name__.endswith("ForSequenceClassification"):
+            model.forward = partial(sequence_classification_forward_with_past_key_values, self=model)
+        elif model.__class__.__name__.endswith("ForMaskedLM"):
+            model.forward = partial(masked_lm_forward_with_past_key_values, self=model)
+        return model
+
+    def process_batch(self, input_dict: Dict[str, Tensor]) -> Dict[str, Tensor]:
+        word_embeds = self.word_embeddings(input_dict["input_ids"])
+        batch_size, _ = input_dict["soft_token_ids"].shape
+
+        soft_token_ids = paddle.masked_select(input_dict["soft_token_ids"], input_dict["soft_token_ids"] > 0)
+        soft_token_ids = soft_token_ids.reshape([batch_size, -1])
+        _, soft_len = soft_token_ids.shape
+
+        token_type_ids = paddle.masked_select(input_dict["token_type_ids"], input_dict["soft_token_ids"] == 0)
+        input_dict["token_type_ids"] = token_type_ids.reshape([batch_size, -1])
+        position_ids = paddle.masked_select(input_dict["position_ids"], input_dict["soft_token_ids"] == 0)
+        input_dict["position_ids"] = position_ids.reshape([batch_size, -1])
+        if "masked_position" in input_dict and input_dict["masked_positions"] is not None:
+            input_dict["masked_positions"] = input_dict["masked_positions"] - soft_len
+        input_dict["inputs_embeds"] = paddle.concat(
+            [word_embeds[:, 0, :].unsqueeze(1), word_embeds[:, soft_len + 1 :, :]], axis=1
+        )
+
+        if "attention_mask" not in input_dict or input_dict["attention_mask"] is None:
+            pad_token_id = self.tokenizer.pad_token_id
+            attention_mask = paddle.unsqueeze(
+                (input_dict["input_ids"] == pad_token_id).astype("float32") * -1e4, axis=[1, 2]
+            )
+            input_dict["attention_mask"] = attention_mask
+        input_dict["input_ids"] = None
+        input_dict.pop("soft_token_ids")
+        input_dict.pop("encoder_ids")
+
+        soft_embeds = self.soft_embeddings(soft_token_ids)
+        soft_embeds = self.encoder_list[1](soft_embeds)
+        soft_embeds = soft_embeds.reshape(
+            [batch_size, soft_len, self.n_layer * 2, self.n_heads, self.embed_size // self.n_heads]
+        )
+
+        soft_embeds = self.dropout(soft_embeds)
+        soft_embeds = paddle.transpose(soft_embeds, perm=[2, 0, 3, 1, 4])
+        soft_embeds = paddle.split(soft_embeds, num_or_sections=self.n_layer)
+        soft_embeds = [paddle.split(emb, 2) for emb in soft_embeds]
+        soft_embeds = [[x.squeeze(0) for x in emb] for emb in soft_embeds]
+        input_dict["past_key_values"] = tuple([tuple(emb) for emb in soft_embeds])
+        return input_dict
+
+    def _create_soft_encoders(self):
+        output_size = self.embed_size * self.n_layer * 2
+        activation = nn.Tanh()
+        return super(PrefixTemplate, self)._create_soft_encoders(output_size, activation)
+
+
+class AutoTemplate(object):
+    """
+    AutoTemplate can help you automatically create the relevant Template
+    given the provided prompt.
+    """
+
+    default_text_keyword = "text_a"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            "{} is designed to be instantiated using {}.create_from("
+            "prompt, tokenizer, max_length, ...)".format(self.__class__.__name__, self.__class__.__name__)
+        )
+
+    @classmethod
+    def create_from(
+        cls,
+        prompt: str,
+        tokenizer: PretrainedTokenizer,
+        max_length: int = 512,
+        model: PretrainedModel = None,
+        soft_embeddings: Tensor = None,
+        prefix_dropout: float = 0.1,
+        template_class: str = None,
+    ):
+        # Default template if not defined.
+        if prompt is None:
+            prompt = "{'soft'}{'text': 'text_a'}{'mask'}"
+
+        if isinstance(prompt, str):
+            prompt = Template.parse_template_string(prompt)
+        template_keywords = Template.extract_template_keywords(prompt)
+
+        # Complement simplified template as ManualTemplate-style in form.
+        if "text" not in template_keywords:
+            prompt = [{"text": cls.default_text_keyword}] + prompt
+            if "mask" not in template_keywords:
+                prompt = prompt + [{"mask": None}]
+
+        if template_class is None:
+            if "prefix" in template_keywords:
+                template_class = "PrefixTemplate"
+            elif "soft" in template_keywords or "soft_id" in template_keywords:
+                template_class = "SoftTemplate"
+            else:
+                template_class = "ManualTemplate"
+
+        # Choose Template according to template keywords.
+        if template_class == "PrefixTemplate":
+            return PrefixTemplate(
+                prompt=prompt, tokenizer=tokenizer, max_length=max_length, model=model, prefix_dropout=prefix_dropout
+            )
+        elif template_class == "SoftTemplate":
+            word_embeddings = model.get_input_embeddings()
+            return SoftTemplate(
+                prompt=prompt,
+                tokenizer=tokenizer,
+                max_length=max_length,
+                word_embeddings=word_embeddings,
+                soft_embeddings=soft_embeddings,
+            )
+        elif template_class == "UTCTemplate":
+            return UTCTemplate(tokenizer=tokenizer, max_length=max_length)
+        elif template_class == "ManualTemplate":
+            return ManualTemplate(prompt=prompt, tokenizer=tokenizer, max_length=max_length)
+        else:
+            raise ValueError(f"Unknown template: {template_class}.")
+
+    @classmethod
+    def load_from(
+        cls, data_path: os.PathLike, tokenizer: PretrainedTokenizer, max_length: int, model: PretrainedModel = None
+    ):
+        template_config_file = os.path.join(data_path, TEMPLATE_CONFIG_FILE)
+        if not os.path.isfile(template_config_file):
+            raise ValueError("{} not found under {}".format(TEMPLATE_CONFIG_FILE, data_path))
+        with open(template_config_file, "r") as fp:
+            config = [x.strip() for x in fp]
+            prompt = json.loads(config[0])
+            if len(config) > 1:
+                template_class = json.loads(config[1])["class"]
+            else:
+                template_class = None  # Compatible with previous versions
+        template = cls.create_from(
+            prompt=prompt, tokenizer=tokenizer, max_length=max_length, model=model, template_class=template_class
+        )
+        template_param_file = os.path.join(data_path, TEMPLATE_PARAMETER_FILE)
+        if os.path.isfile(template_param_file):
+            template.set_state_dict(paddle.load(template_param_file))
+        return template
+
+
+class UTCTemplate(Template):
+    """
+    Template for Unified Tag Classification.
+    """
+
+    template_special_tokens = ["text", "hard", "sep", "cls", "options"]
+
+    def __init__(self, tokenizer: PretrainedTokenizer, max_length: int, prompt: str = None):
+        prompt = (
+            (
+                "{'options': 'choices', 'add_omask': True, 'position': 0, 'token_type': 1}"
+                "{'sep': None, 'token_type': 0, 'position': 0}{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'}"
+            )
+            if prompt is None
+            else prompt
+        )
+        super(UTCTemplate, self).__init__(prompt, tokenizer, max_length)
+        self.max_position_id = self.tokenizer.model_max_length - 1
+        self.max_length = max_length
+        if not self._has_options():
+            raise ValueError(
+                "Expected `options` and `add_omask` are in defined prompt, but got {}".format(self.prompt)
+            )
+
+    def _has_options(self):
+        for part in self.prompt:
+            if "options" in part and "add_omask" in part:
+                return True
+        return False
+
+    def build_inputs_with_prompt(
+        self, example: Dict[str, Any], prompt: Optional[List[Dict[str, Any]]] = None
+    ) -> List[str]:
+        inputs = super(UTCTemplate, self).build_inputs_with_prompt(example, prompt)
+        for index, part in enumerate(inputs):
+            if "cls" in part:
+                inputs[index] = self.tokenizer.cls_token
+        return inputs
+
+    def encode(self, example: Dict[str, Any], use_mask: bool = False):
+        input_dict = super(UTCTemplate, self).encode(example)
+
+        # Set OMASK and MASK positions and labels for options.
+        omask_token_id = self.tokenizer.convert_tokens_to_ids("[O-MASK]")
+        input_dict["omask_positions"] = (
+            np.where(np.array(input_dict["input_ids"]) == omask_token_id)[0].squeeze().tolist()
+        )
+
+        sep_positions = (
+            np.where(np.array(input_dict["input_ids"]) == self.tokenizer.sep_token_id)[0].squeeze().tolist()
+        )
+        input_dict["cls_positions"] = sep_positions[0]
+
+        # Limit the maximum position ids.
+        position_ids = np.array(input_dict["position_ids"])
+        position_ids[position_ids > self.max_position_id] = self.max_position_id
+        input_dict["position_ids"] = position_ids.tolist()
+
+        return input_dict
+
+    def create_prompt_parameters(self):
+        return None
+
+    def process_batch(self, input_dict):
+        return input_dict
diff --git a/paddlenlp/prompt/verbalizer.py b/paddlenlp/prompt/verbalizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..637a370015599776c598ddf10f89859d6d72778d
--- /dev/null
+++ b/paddlenlp/prompt/verbalizer.py
@@ -0,0 +1,461 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+from abc import abstractmethod
+from typing import Dict
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+
+from paddlenlp.layers import Linear as TransposedLinear
+from paddlenlp.transformers import PretrainedModel, PretrainedTokenizer
+from paddlenlp.utils.log import logger
+
+__all__ = ["Verbalizer", "ManualVerbalizer", "SoftVerbalizer", "MaskedLMVerbalizer"]
+
+# Verbalizer used to be saved in a file.
+VERBALIZER_CONFIG_FILE = "verbalizer_config.json"
+VERBALIZER_PARAMETER_FILE = "verbalizer_state.pdparams"
+
+
+class Verbalizer(nn.Layer):
+    """
+    Base class for [`Verbalizer`].
+
+    Args:
+        label_words (`dict`):
+            Define the mapping from labels to a single or multiple words.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer for label word tokenization.
+    """
+
+    def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer, **kwargs):
+        super(Verbalizer, self).__init__()
+        for key, value in kwargs.items():
+            setattr(self, key, value)
+        self.tokenizer = tokenizer
+        self.token_aggregate_type = kwargs.get("token_aggregate_type", "mean")
+        self.word_aggregate_type = kwargs.get("word_aggregate_type", "mean")
+        self.mask_aggregate_type = kwargs.get("mask_aggregate_type", "product")
+        self.post_log_softmax = kwargs.get("post_log_softmax", True)
+        self.label_token_weight = kwargs.get("label_token_weight", None)
+        self.label_words = label_words
+        if self.label_token_weight is not None:
+            self.label_token_weight = self.normalize(self.project(self.label_token_weight.unsqueeze(0)))
+
+    @property
+    def labels(self):
+        if not hasattr(self, "_labels"):
+            raise RuntimeError("Attribute `labels` is not set yet.")
+        return self._labels
+
+    @labels.setter
+    def labels(self, labels):
+        raise NotImplementedError("Please use `label_words` to change `labels`.")
+
+    @property
+    def label_words(self):
+        if not hasattr(self, "_label_words"):
+            raise RuntimeError("Mapping from labels to words is not set yet.")
+        return self._label_words
+
+    @label_words.setter
+    def label_words(self, label_words: Dict):
+        if label_words is None:
+            return None
+        self._labels = sorted(list(label_words.keys()))
+        self.labels_to_ids = {label: idx for idx, label in enumerate(self._labels)}
+        self._words = []
+        for label in self._labels:
+            words = label_words[label]
+            if isinstance(words, str):
+                words = [words]
+            self._words.append(words)
+        self._label_words = {label: word for label, word in zip(self._labels, self._words)}
+        self.preprocess_label_words()
+        self.create_parameters()
+
+    @abstractmethod
+    def create_parameters(self):
+        """
+        A hook to create parameters for mapping from labels to words.
+        """
+        raise NotImplementedError
+
+    def preprocess_label_words(self):
+        label_token_ids = []
+        for label_word in self._words:
+            word_token_ids = []
+            for word in label_word:
+                token_ids = self.tokenizer.encode(word, add_special_tokens=False, return_token_type_ids=False)
+                word_token_ids.append(token_ids["input_ids"])
+            label_token_ids.append(word_token_ids)
+
+        max_num_words = max([len(words) for words in self._words])
+        max_num_tokens = max(
+            [max([len(token_ids) for token_ids in word_token_ids]) for word_token_ids in label_token_ids]
+        )
+        token_ids_shape = [len(self.labels), max_num_words, max_num_tokens]
+        token_ids = np.zeros(token_ids_shape)
+        word_mask = np.zeros(token_ids_shape[:-1])
+        token_mask = np.zeros(token_ids_shape)
+        for label_id, word_token_ids in enumerate(label_token_ids):
+            word_mask[label_id][: len(word_token_ids)] = 1
+            for word_id, tokens in enumerate(word_token_ids):
+                token_ids[label_id][word_id][: len(tokens)] = tokens
+                token_mask[label_id][word_id][: len(tokens)] = 1
+        self.token_ids = paddle.to_tensor(token_ids, dtype="int64", stop_gradient=True)
+        self.word_mask = paddle.to_tensor(word_mask, dtype="int64", stop_gradient=True)
+        self.token_mask = paddle.to_tensor(token_mask, dtype="int64", stop_gradient=True)
+
+    def convert_labels_to_ids(self, label: str):
+        assert isinstance(label, str)
+        return self.labels_to_ids[label]
+
+    def convert_ids_to_labels(self, index: int):
+        assert isinstance(index, int)
+        return self.labels[index]
+
+    def project(self, outputs: Tensor):
+        """
+        Fetch label word predictions from outputs over vocabulary.
+        """
+        token_ids = self.token_ids.reshape([-1])
+        label_token_outputs = outputs.index_select(index=token_ids, axis=-1)
+        label_shape = [*outputs.shape[:-1], *self.token_ids.shape]
+        label_token_outputs = label_token_outputs.reshape(label_shape)
+        label_word_outputs = self.aggregate(label_token_outputs, self.token_mask, self.token_aggregate_type)
+        label_word_outputs -= 1e4 * (1 - self.word_mask)
+        return label_word_outputs
+
+    def process_outputs(self, outputs: Tensor, masked_positions: Tensor = None):
+        """
+        Process outputs of `PretrainedModelForMaskedLM` over vocabulary.
+        """
+        if masked_positions is None:
+            return outputs
+        batch_size, _, num_pred = outputs.shape
+        outputs = outputs.reshape([-1, num_pred])
+        outputs = paddle.gather(outputs, masked_positions)
+        outputs = outputs.reshape([batch_size, -1, num_pred])
+        return outputs
+
+    def aggregate(self, outputs: Tensor, mask: Tensor, atype: str):
+        """
+        Aggregate multiple tokens/words for each word/label.
+        """
+        if atype == "mean":
+            outputs = outputs * mask
+            outputs = outputs.sum(axis=-1) / (mask.sum(axis=-1) + 1e-15)
+        elif atype == "max":
+            outputs = (outputs - 1e4 * (1 - mask)).max(axis=-1)
+        elif atype == "first":
+            index = paddle.to_tensor([0])
+            outputs = paddle.index_select(outputs, index, axis=-1).squeeze(axis=-1)
+        else:
+            raise ValueError("Strategy {} is not supported to aggregate multiple " "tokens.".format(atype))
+        return outputs
+
+    def normalize(self, outputs: Tensor):
+        """
+        Normalize the outputs over the whole vocabulary.
+        """
+        batch_size = outputs.shape[0]
+        outputs = F.softmax(outputs.reshape([batch_size, -1]), axis=-1).reshape(outputs.shape)
+        return outputs
+
+    def calibrate(self, label_word_outputs: Tensor):
+        """
+        Calibrate predictions with pre-defined weights over the whole vocabulary.
+        """
+        if self.label_token_weight.dim() != 1:
+            raise ValueError("Weights of label tokens should be a 1-D tensor.")
+        weight_shape = self.label_token_weight.shape
+        output_shape = label_word_outputs.shape
+        if weight_shape[1:] != output_shape[1:] or weight_shape[0] != 1:
+            raise ValueError(
+                "Shapes of label token weights and predictions do not match, "
+                "got {} and {}.".format(weight_shape, output_shape)
+            )
+        label_word_outputs /= self.label_token_weight + 1e-15
+        batch_size = label_word_outputs.shape0[0]
+        label_word_outputs = paddle.mean(label_word_outputs.reshape([batch_size, -1])).reshape(output_shape)
+
+        return label_word_outputs
+
+    def save(self, save_path: str):
+        if not os.path.exists(save_path):
+            os.makedirs(save_path, exist_ok=True)
+        verb_config_file = os.path.join(save_path, VERBALIZER_CONFIG_FILE)
+        with open(verb_config_file, "w", encoding="utf-8") as fp:
+            json.dump(self.label_words, fp, ensure_ascii=False)
+        verb_params_file = os.path.join(save_path, VERBALIZER_PARAMETER_FILE)
+        verb_state_dict = self.state_dict()
+        if len(verb_state_dict) > 0:
+            paddle.save(self.state_dict(), verb_params_file)
+
+    @classmethod
+    def load_from(cls, data_path: os.PathLike, tokenizer: PretrainedTokenizer):
+        verb_config_file = os.path.join(data_path, VERBALIZER_CONFIG_FILE)
+        if not os.path.isfile(verb_config_file):
+            raise ValueError("{} not found under {}".format(VERBALIZER_CONFIG_FILE, data_path))
+        with open(verb_config_file, "r") as fp:
+            label_words = json.load(fp)
+
+        verbalizer = cls(label_words, tokenizer)
+        verb_state_file = os.path.join(data_path, VERBALIZER_PARAMETER_FILE)
+        if os.path.isfile(verb_state_file):
+            verbalizer.set_state_dict(paddle.load(verb_state_file))
+            logger.info("Loading verbalizer state dict from {}".format(verb_state_file))
+        return verbalizer
+
+
+class ManualVerbalizer(Verbalizer):
+    """
+    ManualVerbalizer defines mapping from labels to words manually.
+
+    Args:
+        label_words (`dict`):
+            Define the mapping from labels to a single or multiple words.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer for label word tokenization.
+    """
+
+    def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer, **kwargs):
+        super(ManualVerbalizer, self).__init__(label_words=label_words, tokenizer=tokenizer, **kwargs)
+
+    def create_parameters(self):
+        return None
+
+    def aggregate_multiple_mask(self, outputs: Tensor, atype: str = None):
+        if atype is None:
+            return outputs
+        assert outputs.ndim == 3
+        if atype == "mean":
+            outputs = outputs.mean(axis=1)
+        elif atype == "max":
+            outputs = outputs.max(axis=1)
+        elif atype == "first":
+            index = paddle.to_tensor([0])
+            outputs = paddle.index_select(outputs, index, axis=1).squeeze(1)
+        elif atype == "product":
+            new_outputs = outputs[:, 0, :]
+            for index in range(1, outputs.shape[1]):
+                new_outputs *= outputs[:, index, :]
+            outputs = new_outputs
+        else:
+            raise ValueError("Strategy {} is not supported to aggregate multiple " "tokens.".format(atype))
+        return outputs
+
+    def process_outputs(self, outputs: Tensor, masked_positions: Tensor = None):
+        """
+        Process outputs over the vocabulary, including the following steps:
+
+        (1) Project outputs into the outputs of corresponding word.
+
+        If self.post_log_softmax is True:
+
+            (2) Normalize over all label words.
+
+            (3) Calibrate (optional)
+
+        (4) Aggregate multiple words for each label.
+
+        Args:
+            outputs (`Tensor`):
+                The outputs of `PretrainedModel` which class name ends with
+                `ForMaskedLM`.
+        Returns:
+            The prediction outputs over labels (`Tensor`).
+        """
+        outputs = super(ManualVerbalizer, self).process_outputs(outputs, masked_positions)
+        label_word_outputs = self.project(outputs)
+
+        if self.post_log_softmax:
+            label_word_outputs = self.normalize(label_word_outputs)
+
+            if self.label_token_weight is not None:
+                label_word_outputs = self.calibrate(label_word_outputs)
+
+            label_word_outputs = paddle.log(label_word_outputs + 1e-15)
+
+        label_outputs = self.aggregate(label_word_outputs, self.word_mask, self.word_aggregate_type)
+        label_outputs = self.aggregate_multiple_mask(label_outputs, self.mask_aggregate_type)
+        return label_outputs
+
+
+class MaskedLMIdentity(nn.Layer):
+    """
+    Identity layer with the same arguments as the last linear layer in
+    `PretrainedModel` whose name ends with `ForMaskedLM`.
+    """
+
+    def __init__(self):
+        super(MaskedLMIdentity, self).__init__()
+
+    def forward(self, sequence_output, masked_positions=None):
+        return sequence_output
+
+
+class SoftVerbalizer(Verbalizer):
+    """
+    SoftVerbalizer for the WARP method.
+
+    Args:
+        label_words (`dict`):
+            Define the mapping from labels to a single or multiple words.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer for label word tokenization.
+        model (`PretrainedModel`):
+            An instance of PretrainedModel with class name ends with `ForMaskedLM`
+    """
+
+    def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer, model: PretrainedModel, **kwargs):
+        super(SoftVerbalizer, self).__init__(label_words=label_words, tokenizer=tokenizer, model=model, **kwargs)
+        del self.model
+        setattr(model, self.head_name[0], MaskedLMIdentity())
+
+    def create_parameters(self):
+        # Only the first word used for initialization.
+        if self.token_ids.shape[1] != 1:
+            logger.warning("Only the first word for each label is used for" " initialization.")
+            index = paddle.to_tensor([0])
+            self.token_ids = paddle.index_select(self.token_ids, index, axis=1)
+            self.token_mask = paddle.index_select(self.token_mask, index, axis=1)
+            self.word_mask = paddle.ones([len(self.labels), 1])
+        self._extract_head(self.model)
+
+    def process_outputs(self, outputs: Tensor, masked_positions: Tensor = None):
+        outputs = super(SoftVerbalizer, self).process_outputs(outputs, masked_positions)
+        return self.head(outputs).squeeze(1)
+
+    def head_parameters(self):
+        # possible head parameters: decoder.weight, decoder_bias, bias
+        return [(n, p) for n, p in self.head.named_parameters() if self.head_name[-1] in n or n == "bias"]
+
+    def non_head_parameters(self):
+        return [(n, p) for n, p in self.head.named_parameters() if self.head_name[-1] not in n and n != "bias"]
+
+    def _extract_head(self, model: PretrainedModel):
+        # Find the nn.Linear layer with in_features = vocab_size
+        module_name = None
+        for i in model.named_sublayers():
+            if isinstance(i[1], TransposedLinear):
+                module_name = i[0]
+                break
+        if module_name is None:
+            raise ValueError("Can not find output layer, make sure type of the input model is AutoModelForMaskedLM.")
+
+        # recursively get the parent module to the decoder linear layer
+        parent_module = model
+        attribute_chain = module_name.split(".")
+        for name in attribute_chain[:-1]:
+            parent_module = getattr(parent_module, name)
+        self.head = copy.deepcopy(parent_module)
+
+        # replace the decoder linear layer with a linear linear with the trimmed vocab size
+        # we create a new decoder linear here instead of `resize_token_embeddings` because we only want to change the output embeddings
+        # this also invalidates any previous tie_weights
+        self.head_name = attribute_chain
+        module_name = attribute_chain[-1]
+        module = getattr(self.head, module_name)
+        # modify weight
+        module_weight = module.weight
+        module_bias = module.bias
+        selected_weight = self._create_init_weight(module_weight)
+        selected_bias = self._create_init_weight(module_bias, is_bias=True)
+        setattr(
+            self.head, module_name, TransposedLinear(in_features=module.weight.shape[1], out_features=len(self.labels))
+        )
+        getattr(self.head, module_name).weight.set_value(selected_weight.T)
+        getattr(self.head, module_name).bias.set_value(selected_bias)
+
+    def _create_init_weight(self, weight: Tensor, is_bias: bool = False):
+        token_ids = self.token_ids.squeeze(1)
+        token_mask = self.token_mask.squeeze(1)
+        aggr_type = self.token_aggregate_type
+        if is_bias:
+            bias = paddle.index_select(weight, token_ids.reshape([-1]), axis=0).reshape(token_ids.shape)
+            bias = self.aggregate(bias, token_mask, aggr_type)
+            return bias
+        else:
+            word_shape = [weight.shape[1], *token_ids.shape]
+            weight = paddle.index_select(weight, token_ids.reshape([-1]), axis=0).reshape(word_shape)
+            weight = self.aggregate(weight, token_mask, aggr_type)
+            return weight
+
+
+class MaskedLMVerbalizer(Verbalizer):
+    """
+    MaskedLMVerbalizer defines mapping from labels to words manually and supports
+    multiple masks corresponding to multiple tokens in words.
+
+    Args:
+        label_words (`dict`):
+            Define the mapping from labels to a single word. Only the first word
+            is used if multiple words are defined.
+        tokenizer (`PretrainedTokenizer`):
+            An instance of PretrainedTokenizer for label word tokenization.
+    """
+
+    def __init__(self, label_words: Dict, tokenizer: PretrainedTokenizer, **kwargs):
+        label_words = self.check_label_words_constraint(label_words)
+        super(MaskedLMVerbalizer, self).__init__(label_words=label_words, tokenizer=tokenizer, **kwargs)
+
+    def create_parameters(self):
+        return None
+
+    def check_label_words_constraint(self, label_words: Dict):
+        assert isinstance(label_words, dict), "`label_words` mapping should be a dictionary."
+        std_label_words = {}
+        for label, word in label_words.items():
+            if isinstance(word, str):
+                word = [word]
+            if len(word) > 1:
+                word = word[:1]
+                logger.info(f"More than one word for label `{label}`, only `{word[0]}` used.")
+            std_label_words[label] = word
+        word_length = [len(w[0]) for l, w in std_label_words.items()]
+        if len(set(word_length)) > 1:
+            raise ValueError(f"Length of all words for labels should be equal, but received {std_label_words}.")
+        return std_label_words
+
+    def aggregate_multiple_mask(self, outputs: Tensor, atype: str = "product"):
+        assert outputs.ndim == 3
+        token_ids = self.token_ids[:, 0, :].T
+        batch_size, num_token, num_pred = outputs.shape
+        results = paddle.index_select(outputs[:, 0, :], token_ids[0], axis=1)
+        if atype == "first":
+            return results
+
+        for index in range(1, num_token):
+            sub_results = paddle.index_select(outputs[:, index, :], token_ids[index], axis=1)
+            if atype in ("mean", "sum"):
+                results += sub_results
+            elif atype == "product":
+                results *= sub_results
+            elif atype == "max":
+                results = paddle.stack([results, sub_results], axis=-1)
+                results = results.max(axis=-1)
+            else:
+                raise ValueError("Strategy {} is not supported to aggregate multiple tokens.".format(atype))
+        if atype == "mean":
+            results = results / num_token
+        return results
diff --git a/paddlenlp/seq2vec/__init__.py b/paddlenlp/seq2vec/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9645655ace2fdcc0eeec1589d18b58903458f84f
--- /dev/null
+++ b/paddlenlp/seq2vec/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .encoder import *
diff --git a/paddlenlp/seq2vec/encoder.py b/paddlenlp/seq2vec/encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e5a7c1afd0dd56a21e860d51469d58fe7b62c54
--- /dev/null
+++ b/paddlenlp/seq2vec/encoder.py
@@ -0,0 +1,997 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.utils import weight_norm
+
+__all__ = ["BoWEncoder", "CNNEncoder", "GRUEncoder", "LSTMEncoder", "RNNEncoder", "TCNEncoder"]
+
+
+class BoWEncoder(nn.Layer):
+    r"""
+    A `BoWEncoder` takes as input a sequence of vectors and returns a
+    single vector, which simply sums the embeddings of a sequence across the time dimension.
+    The input to this encoder is of shape `(batch_size, num_tokens, emb_dim)`,
+    and the output is of shape `(batch_size, emb_dim)`.
+
+    Args:
+        emb_dim(int):
+            The dimension of each vector in the input sequence.
+
+    Example:
+        .. code-block::
+
+            import paddle
+            import paddle.nn as nn
+            import paddlenlp as nlp
+
+            class BoWModel(nn.Layer):
+                def __init__(self,
+                            vocab_size,
+                            num_classes,
+                            emb_dim=128,
+                            padding_idx=0,
+                            hidden_size=128,
+                            fc_hidden_size=96):
+                    super().__init__()
+                    self.embedder = nn.Embedding(
+                        vocab_size, emb_dim, padding_idx=padding_idx)
+                    self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim)
+                    self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
+                    self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
+                    self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+                def forward(self, text):
+                    # Shape: (batch_size, num_tokens, embedding_dim)
+                    embedded_text = self.embedder(text)
+
+                    # Shape: (batch_size, embedding_dim)
+                    summed = self.bow_encoder(embedded_text)
+                    encoded_text = paddle.tanh(summed)
+
+                    # Shape: (batch_size, hidden_size)
+                    fc1_out = paddle.tanh(self.fc1(encoded_text))
+                    # Shape: (batch_size, fc_hidden_size)
+                    fc2_out = paddle.tanh(self.fc2(fc1_out))
+                    # Shape: (batch_size, num_classes)
+                    logits = self.output_layer(fc2_out)
+                    return logits
+
+            model = BoWModel(vocab_size=100, num_classes=2)
+
+            text = paddle.randint(low=1, high=10, shape=[1,10], dtype='int32')
+            logits = model(text)
+    """
+
+    def __init__(self, emb_dim):
+        super().__init__()
+        self._emb_dim = emb_dim
+
+    def get_input_dim(self):
+        r"""
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `BoWEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._emb_dim
+
+    def get_output_dim(self):
+        r"""
+        Returns the dimension of the final vector output by this `BoWEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        return self._emb_dim
+
+    def forward(self, inputs, mask=None):
+        r"""
+        It simply sums the embeddings of a sequence across the time dimension.
+
+        Args:
+            inputs (Tensor):
+                Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or `float64`.
+                The sequence length of the input sequence.
+            mask (Tensor, optional):
+                Shape same as `inputs`.
+                Its each elements identify whether the corresponding input token is padding or not.
+                If True, not padding token. If False, padding token.
+                Defaults to `None`.
+
+        Returns:
+            Tensor:
+                Returns tensor `summed`, the result vector of BagOfEmbedding.
+                Its data type is same as `inputs` and its shape is `[batch_size, emb_dim]`.
+        """
+        if mask is not None:
+            inputs = inputs * mask
+
+        # Shape: (batch_size, embedding_dim)
+        summed = inputs.sum(axis=1)
+        return summed
+
+
+class CNNEncoder(nn.Layer):
+    r"""
+    A `CNNEncoder` takes as input a sequence of vectors and returns a
+    single vector, a combination of multiple convolution layers and max pooling layers.
+    The input to this encoder is of shape `(batch_size, num_tokens, emb_dim)`,
+    and the output is of shape `(batch_size, output_dim)` or `(batch_size, len(ngram_filter_sizes) * num_filter)`.
+
+    The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
+    out a vector of size num_filter. The number of times a convolution layer will be used
+    is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
+    outputs from the convolution layer and outputs the max.
+
+    This operation is repeated for every ngram size passed, and consequently the dimensionality of
+    the output after maxpooling is `len(ngram_filter_sizes) * num_filter`.  This then gets
+    (optionally) projected down to a lower dimensional output, specified by `output_dim`.
+
+    We then use a fully connected layer to project in back to the desired output_dim.  For more
+    details, refer to `A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural
+    Networks for Sentence Classification <https://arxiv.org/abs/1510.03820>`__ ,
+    Zhang and Wallace 2016, particularly Figure 1.
+
+    Args:
+        emb_dim(int):
+            The dimension of each vector in the input sequence.
+        num_filter(int):
+            This is the output dim for each convolutional layer, which is the number of "filters"
+            learned by that layer.
+        ngram_filter_sizes(Tuple[int], optional):
+            This specifies both the number of convolutional layers we will create and their sizes.  The
+            default of `(2, 3, 4, 5)` will have four convolutional layers, corresponding to encoding
+            ngrams of size 2 to 5 with some number of filters.
+        conv_layer_activation(Layer, optional):
+            Activation to use after the convolution layers.
+            Defaults to `paddle.nn.Tanh()`.
+        output_dim(int, optional):
+            After doing convolutions and pooling, we'll project the collected features into a vector of
+            this size.  If this value is `None`, we will just return the result of the max pooling,
+            giving an output of shape `len(ngram_filter_sizes) * num_filter`.
+            Defaults to `None`.
+
+    Example:
+        .. code-block::
+
+            import paddle
+            import paddle.nn as nn
+            import paddlenlp as nlp
+
+            class CNNModel(nn.Layer):
+                def __init__(self,
+                            vocab_size,
+                            num_classes,
+                            emb_dim=128,
+                            padding_idx=0,
+                            num_filter=128,
+                            ngram_filter_sizes=(3, ),
+                            fc_hidden_size=96):
+                    super().__init__()
+                    self.embedder = nn.Embedding(
+                        vocab_size, emb_dim, padding_idx=padding_idx)
+                    self.encoder = nlp.seq2vec.CNNEncoder(
+                        emb_dim=emb_dim,
+                        num_filter=num_filter,
+                        ngram_filter_sizes=ngram_filter_sizes)
+                    self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size)
+                    self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+                def forward(self, text):
+                    # Shape: (batch_size, num_tokens, embedding_dim)
+                    embedded_text = self.embedder(text)
+                    # Shape: (batch_size, len(ngram_filter_sizes)*num_filter)
+                    encoder_out = self.encoder(embedded_text)
+                    encoder_out = paddle.tanh(encoder_out)
+                    # Shape: (batch_size, fc_hidden_size)
+                    fc_out = self.fc(encoder_out)
+                    # Shape: (batch_size, num_classes)
+                    logits = self.output_layer(fc_out)
+                    return logits
+
+            model = CNNModel(vocab_size=100, num_classes=2)
+
+            text = paddle.randint(low=1, high=10, shape=[1,10], dtype='int32')
+            logits = model(text)
+    """
+
+    def __init__(
+        self,
+        emb_dim,
+        num_filter,
+        ngram_filter_sizes=(2, 3, 4, 5),
+        conv_layer_activation=nn.Tanh(),
+        output_dim=None,
+        **kwargs
+    ):
+        super().__init__()
+        self._emb_dim = emb_dim
+        self._num_filter = num_filter
+        self._ngram_filter_sizes = ngram_filter_sizes
+        self._activation = conv_layer_activation
+        self._output_dim = output_dim
+
+        self.convs = paddle.nn.LayerList(
+            [
+                nn.Conv2D(in_channels=1, out_channels=self._num_filter, kernel_size=(i, self._emb_dim), **kwargs)
+                for i in self._ngram_filter_sizes
+            ]
+        )
+
+        maxpool_output_dim = self._num_filter * len(self._ngram_filter_sizes)
+        if self._output_dim:
+            self.projection_layer = nn.Linear(maxpool_output_dim, self._output_dim)
+        else:
+            self.projection_layer = None
+            self._output_dim = maxpool_output_dim
+
+    def get_input_dim(self):
+        r"""
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `CNNEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._emb_dim
+
+    def get_output_dim(self):
+        r"""
+        Returns the dimension of the final vector output by this `CNNEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        return self._output_dim
+
+    def forward(self, inputs, mask=None):
+        r"""
+        The combination of multiple convolution layers and max pooling layers.
+
+        Args:
+            inputs (Tensor):
+                Shape as `(batch_size, num_tokens, emb_dim)` and dtype as `float32` or `float64`.
+                Tensor containing the features of the input sequence.
+            mask (Tensor, optional):
+                Shape should be same as `inputs` and dtype as `int32`, `int64`, `float32` or `float64`.
+                Its each elements identify whether the corresponding input token is padding or not.
+                If True, not padding token. If False, padding token.
+                Defaults to `None`.
+
+        Returns:
+            Tensor:
+                Returns tensor `result`.
+                If output_dim is None, the result shape is of `(batch_size, output_dim)` and
+                dtype is `float`; If not, the result shape is of `(batch_size, len(ngram_filter_sizes) * num_filter)`.
+
+        """
+        if mask is not None:
+            inputs = inputs * mask
+
+        # Shape: (batch_size, 1, num_tokens, emb_dim) = (N, C, H, W)
+        inputs = inputs.unsqueeze(1)
+
+        # If output_dim is None, result shape of (batch_size, len(ngram_filter_sizes) * num_filter));
+        # else, result shape of (batch_size, output_dim).
+        convs_out = [self._activation(conv(inputs)).squeeze(3) for conv in self.convs]
+        maxpool_out = [F.adaptive_max_pool1d(t, output_size=1).squeeze(2) for t in convs_out]
+        result = paddle.concat(maxpool_out, axis=1)
+
+        if self.projection_layer is not None:
+            result = self.projection_layer(result)
+        return result
+
+
+class GRUEncoder(nn.Layer):
+    r"""
+    A GRUEncoder takes as input a sequence of vectors and returns a
+    single vector, which is a combination of multiple `paddle.nn.GRU
+    <https://www.paddlepaddle.org.cn/documentation/docs/en/api
+    /paddle/nn/layer/rnn/GRU_en.html>`__ subclass.
+    The input to this encoder is of shape `(batch_size, num_tokens, input_size)`,
+    The output is of shape `(batch_size, hidden_size * 2)` if GRU is bidirection;
+    If not, output is of shape `(batch_size, hidden_size)`.
+
+    Paddle's GRU have two outputs: the hidden state for every time step at last layer,
+    and the hidden state at the last time step for every layer.
+    If `pooling_type` is not None, we perform the pooling on the hidden state of every time
+    step at last layer to create a single vector. If None, we use the hidden state
+    of the last time step at last layer as a single output (shape of `(batch_size, hidden_size)`);
+    And if direction is bidirection, the we concat the hidden state of the last forward
+    gru and backward gru layer to create a single vector (shape of `(batch_size, hidden_size * 2)`).
+
+    Args:
+        input_size (int):
+            The number of expected features in the input (the last dimension).
+        hidden_size (int):
+            The number of features in the hidden state.
+        num_layers (int, optional):
+            Number of recurrent layers.
+            E.g., setting num_layers=2 would mean stacking two GRUs together to form a stacked GRU,
+            with the second GRU taking in outputs of the first GRU and computing the final results.
+            Defaults to 1.
+        direction (str, optional):
+            The direction of the network. It can be "forward" and "bidirect"
+            (it means bidirection network). If "bidirect", it is a bidirectional GRU,
+            and returns the concat output from both directions.
+            Defaults to "forward".
+        dropout (float, optional):
+            If non-zero, introduces a Dropout layer on the outputs of each GRU layer
+            except the last layer, with dropout probability equal to dropout.
+            Defaults to 0.0.
+        pooling_type (str, optional):
+            If `pooling_type` is None, then the GRUEncoder will return the hidden state of
+            the last time step at last layer as a single vector.
+            If pooling_type is not None, it must be one of "sum", "max" and "mean".
+            Then it will be pooled on the GRU output (the hidden state of every time
+            step at last layer) to create a single vector.
+            Defaults to `None`
+
+    Example:
+        .. code-block::
+
+            import paddle
+            import paddle.nn as nn
+            import paddlenlp as nlp
+
+            class GRUModel(nn.Layer):
+                def __init__(self,
+                            vocab_size,
+                            num_classes,
+                            emb_dim=128,
+                            padding_idx=0,
+                            gru_hidden_size=198,
+                            direction='forward',
+                            gru_layers=1,
+                            dropout_rate=0.0,
+                            pooling_type=None,
+                            fc_hidden_size=96):
+                    super().__init__()
+                    self.embedder = nn.Embedding(
+                        num_embeddings=vocab_size,
+                        embedding_dim=emb_dim,
+                        padding_idx=padding_idx)
+                    self.gru_encoder = nlp.seq2vec.GRUEncoder(
+                        emb_dim,
+                        gru_hidden_size,
+                        num_layers=gru_layers,
+                        direction=direction,
+                        dropout=dropout_rate,
+                        pooling_type=pooling_type)
+                    self.fc = nn.Linear(self.gru_encoder.get_output_dim(), fc_hidden_size)
+                    self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+                def forward(self, text, seq_len):
+                    # Shape: (batch_size, num_tokens, embedding_dim)
+                    embedded_text = self.embedder(text)
+                    # Shape: (batch_size, num_tokens, num_directions*gru_hidden_size)
+                    # num_directions = 2 if direction is 'bidirect'
+                    # if not, num_directions = 1
+                    text_repr = self.gru_encoder(embedded_text, sequence_length=seq_len)
+                    # Shape: (batch_size, fc_hidden_size)
+                    fc_out = paddle.tanh(self.fc(text_repr))
+                    # Shape: (batch_size, num_classes)
+                    logits = self.output_layer(fc_out)
+                    return logits
+
+            model = GRUModel(vocab_size=100, num_classes=2)
+
+            text = paddle.randint(low=1, high=10, shape=[1,10], dtype='int32')
+            seq_len = paddle.to_tensor([10])
+            logits = model(text, seq_len)
+    """
+
+    def __init__(
+        self, input_size, hidden_size, num_layers=1, direction="forward", dropout=0.0, pooling_type=None, **kwargs
+    ):
+        super().__init__()
+        self._input_size = input_size
+        self._hidden_size = hidden_size
+        self._direction = direction
+        self._pooling_type = pooling_type
+
+        self.gru_layer = nn.GRU(
+            input_size=input_size,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            direction=direction,
+            dropout=dropout,
+            **kwargs,
+        )
+
+    def get_input_dim(self):
+        r"""
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `GRUEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._input_size
+
+    def get_output_dim(self):
+        r"""
+        Returns the dimension of the final vector output by this `GRUEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        if self._direction == "bidirect":
+            return self._hidden_size * 2
+        else:
+            return self._hidden_size
+
+    def forward(self, inputs, sequence_length):
+        r"""
+        GRUEncoder takes the a sequence of vectors and returns a single vector,
+        which is a combination of multiple GRU layers. The input to this
+        encoder is of shape `(batch_size, num_tokens, input_size)`,
+        The output is of shape `(batch_size, hidden_size * 2)` if GRU is bidirection;
+        If not, output is of shape `(batch_size, hidden_size)`.
+
+        Args:
+            inputs (Tensor): Shape as `(batch_size, num_tokens, input_size)`.
+                Tensor containing the features of the input sequence.
+            sequence_length (Tensor): Shape as `(batch_size)`.
+                The sequence length of the input sequence.
+
+        Returns:
+            Tensor: Returns tensor `output`, the hidden state at the last time step for every layer.
+            Its data type is `float` and its shape is `[batch_size, hidden_size]`.
+
+        """
+        encoded_text, last_hidden = self.gru_layer(inputs, sequence_length=sequence_length)
+        if not self._pooling_type:
+            # We exploit the `last_hidden` (the hidden state at the last time step for every layer)
+            # to create a single vector.
+            # If gru is not bidirection, then output is the hidden state of the last time step
+            # at last layer. Output is shape of `(batch_size, hidden_size)`.
+            # If gru is bidirection, then output is concatenation of the forward and backward hidden state
+            # of the last time step at last layer. Output is shape of `(batch_size, hidden_size * 2)`.
+            if self._direction != "bidirect":
+                output = last_hidden[-1, :, :]
+            else:
+                output = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1)
+        else:
+            # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
+            # to create a single vector. We perform pooling on the encoded text.
+            # The output shape is `(batch_size, hidden_size * 2)` if use bidirectional GRU,
+            # otherwise the output shape is `(batch_size, hidden_size * 2)`.
+            if self._pooling_type == "sum":
+                output = paddle.sum(encoded_text, axis=1)
+            elif self._pooling_type == "max":
+                output = paddle.max(encoded_text, axis=1)
+            elif self._pooling_type == "mean":
+                output = paddle.mean(encoded_text, axis=1)
+            else:
+                raise RuntimeError(
+                    "Unexpected pooling type %s ."
+                    "Pooling type must be one of sum, max and mean." % self._pooling_type
+                )
+        return output
+
+
+class LSTMEncoder(nn.Layer):
+    r"""
+    An LSTMEncoder takes as input a sequence of vectors and returns a
+    single vector, which is a combination of multiple `paddle.nn.LSTM
+    <https://www.paddlepaddle.org.cn/documentation/docs/en/api
+    /paddle/nn/layer/rnn/LSTM_en.html>`__ subclass.
+    The input to this encoder is of shape `(batch_size, num_tokens, input_size)`.
+    The output is of shape `(batch_size, hidden_size * 2)` if LSTM is bidirection;
+    If not, output is of shape `(batch_size, hidden_size)`.
+
+    Paddle's LSTM have two outputs: the hidden state for every time step at last layer,
+    and the hidden state and cell at the last time step for every layer.
+    If `pooling_type` is not None, we perform the pooling on the hidden state of every time
+    step at last layer to create a single vector. If None, we use the hidden state
+    of the last time step at last layer as a single output (shape of `(batch_size, hidden_size)`);
+    And if direction is bidirection, the we concat the hidden state of the last forward
+    lstm and backward lstm layer to create a single vector (shape of `(batch_size, hidden_size * 2)`).
+
+    Args:
+        input_size (int):
+            The number of expected features in the input (the last dimension).
+        hidden_size (int):
+            The number of features in the hidden state.
+        num_layers (int, optional):
+            Number of recurrent layers.
+            E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM,
+            with the second LSTM taking in outputs of the first LSTM and computing the final results.
+            Defaults to 1.
+        direction (str, optional):
+            The direction of the network. It can be "forward" or "bidirect" (it means bidirection network).
+            If "bidirect", it is a bidirectional LSTM, and returns the concat output from both directions.
+            Defaults to "forward".
+        dropout (float, optional):
+            If non-zero, introduces a Dropout layer on the outputs of each LSTM layer
+            except the last layer, with dropout probability equal to dropout.
+            Defaults to 0.0 .
+        pooling_type (str, optional):
+            If `pooling_type` is None, then the LSTMEncoder will return
+            the hidden state of the last time step at last layer as a single vector.
+            If pooling_type is not None, it must be one of "sum", "max" and "mean".
+            Then it will be pooled on the LSTM output (the hidden state of every
+            time step at last layer) to create a single vector.
+            Defaults to `None`.
+
+    Example:
+        .. code-block::
+
+            import paddle
+            import paddle.nn as nn
+            import paddlenlp as nlp
+
+            class LSTMModel(nn.Layer):
+                def __init__(self,
+                            vocab_size,
+                            num_classes,
+                            emb_dim=128,
+                            padding_idx=0,
+                            lstm_hidden_size=198,
+                            direction='forward',
+                            lstm_layers=1,
+                            dropout_rate=0.0,
+                            pooling_type=None,
+                            fc_hidden_size=96):
+                    super().__init__()
+                    self.embedder = nn.Embedding(
+                        num_embeddings=vocab_size,
+                        embedding_dim=emb_dim,
+                        padding_idx=padding_idx)
+                    self.lstm_encoder = nlp.seq2vec.LSTMEncoder(
+                        emb_dim,
+                        lstm_hidden_size,
+                        num_layers=lstm_layers,
+                        direction=direction,
+                        dropout=dropout_rate,
+                        pooling_type=pooling_type)
+                    self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)
+                    self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+                def forward(self, text, seq_len):
+                    # Shape: (batch_size, num_tokens, embedding_dim)
+                    embedded_text = self.embedder(text)
+                    # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
+                    # num_directions = 2 if direction is 'bidirect'
+                    # if not, num_directions = 1
+                    text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)
+                    # Shape: (batch_size, fc_hidden_size)
+                    fc_out = paddle.tanh(self.fc(text_repr))
+                    # Shape: (batch_size, num_classes)
+                    logits = self.output_layer(fc_out)
+                    return logits
+
+            model = LSTMModel(vocab_size=100, num_classes=2)
+
+            text = paddle.randint(low=1, high=10, shape=[1,10], dtype='int32')
+            seq_len = paddle.to_tensor([10])
+            logits = model(text, seq_len)
+    """
+
+    def __init__(
+        self, input_size, hidden_size, num_layers=1, direction="forward", dropout=0.0, pooling_type=None, **kwargs
+    ):
+        super().__init__()
+        self._input_size = input_size
+        self._hidden_size = hidden_size
+        self._direction = direction
+        self._pooling_type = pooling_type
+
+        self.lstm_layer = nn.LSTM(
+            input_size=input_size,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            direction=direction,
+            dropout=dropout,
+            **kwargs,
+        )
+
+    def get_input_dim(self):
+        r"""
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `LSTMEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._input_size
+
+    def get_output_dim(self):
+        r"""
+        Returns the dimension of the final vector output by this `LSTMEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        if self._direction == "bidirect":
+            return self._hidden_size * 2
+        else:
+            return self._hidden_size
+
+    def forward(self, inputs, sequence_length):
+        r"""
+        LSTMEncoder takes the a sequence of vectors and returns a
+        single vector, which is a combination of multiple LSTM layers.
+        The input to this encoder is of shape `(batch_size, num_tokens, input_size)`,
+        The output is of shape `(batch_size, hidden_size * 2)` if LSTM is bidirection;
+        If not, output is of shape `(batch_size, hidden_size)`.
+
+        Args:
+            inputs (Tensor): Shape as `(batch_size, num_tokens, input_size)`.
+                Tensor containing the features of the input sequence.
+            sequence_length (Tensor): Shape as `(batch_size)`.
+                The sequence length of the input sequence.
+
+        Returns:
+            Tensor: Returns tensor `output`, the hidden state at the last time step for every layer.
+            Its data type is `float` and its shape is `[batch_size, hidden_size]`.
+
+        """
+        encoded_text, (last_hidden, last_cell) = self.lstm_layer(inputs, sequence_length=sequence_length)
+        if not self._pooling_type:
+            # We exploit the `last_hidden` (the hidden state at the last time step for every layer)
+            # to create a single vector.
+            # If lstm is not bidirection, then output is the hidden state of the last time step
+            # at last layer. Output is shape of `(batch_size, hidden_size)`.
+            # If lstm is bidirection, then output is concatenation of the forward and backward hidden state
+            # of the last time step at last layer. Output is shape of `(batch_size, hidden_size * 2)`.
+            if self._direction != "bidirect":
+                output = last_hidden[-1, :, :]
+            else:
+                output = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1)
+        else:
+            # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
+            # to create a single vector. We perform pooling on the encoded text.
+            # The output shape is `(batch_size, hidden_size * 2)` if use bidirectional LSTM,
+            # otherwise the output shape is `(batch_size, hidden_size * 2)`.
+            if self._pooling_type == "sum":
+                output = paddle.sum(encoded_text, axis=1)
+            elif self._pooling_type == "max":
+                output = paddle.max(encoded_text, axis=1)
+            elif self._pooling_type == "mean":
+                output = paddle.mean(encoded_text, axis=1)
+            else:
+                raise RuntimeError(
+                    "Unexpected pooling type %s ."
+                    "Pooling type must be one of sum, max and mean." % self._pooling_type
+                )
+        return output
+
+
+class RNNEncoder(nn.Layer):
+    r"""
+    A RNNEncoder takes as input a sequence of vectors and returns a
+    single vector, which is a combination of multiple `paddle.nn.RNN
+    <https://www.paddlepaddle.org.cn/documentation/docs/en/api
+    /paddle/nn/layer/rnn/RNN_en.html>`__ subclass.
+    The input to this encoder is of shape `(batch_size, num_tokens, input_size)`,
+    The output is of shape `(batch_size, hidden_size * 2)` if RNN is bidirection;
+    If not, output is of shape `(batch_size, hidden_size)`.
+
+    Paddle's RNN have two outputs: the hidden state for every time step at last layer,
+    and the hidden state at the last time step for every layer.
+    If `pooling_type` is not None, we perform the pooling on the hidden state of every time
+    step at last layer to create a single vector. If None, we use the hidden state
+    of the last time step at last layer as a single output (shape of `(batch_size, hidden_size)`);
+    And if direction is bidirection, the we concat the hidden state of the last forward
+    rnn and backward rnn layer to create a single vector (shape of `(batch_size, hidden_size * 2)`).
+
+    Args:
+        input_size (int):
+            The number of expected features in the input (the last dimension).
+        hidden_size (int):
+            The number of features in the hidden state.
+        num_layers (int, optional):
+            Number of recurrent layers.
+            E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN,
+            with the second RNN taking in outputs of the first RNN and computing the final results.
+            Defaults to 1.
+        direction (str, optional):
+            The direction of the network. It can be "forward" and "bidirect"
+            (it means bidirection network). If "bidirect", it is a bidirectional RNN,
+            and returns the concat output from both directions. Defaults to "forward"
+        dropout (float, optional):
+            If non-zero, introduces a Dropout layer on the outputs of each RNN layer
+            except the last layer, with dropout probability equal to dropout.
+            Defaults to 0.0.
+        pooling_type (str, optional):
+            If `pooling_type` is None, then the RNNEncoder will return the hidden state
+            of the last time step at last layer as a single vector.
+            If pooling_type is not None, it must be one of "sum", "max" and "mean".
+            Then it will be pooled on the RNN output (the hidden state of every time
+            step at last layer) to create a single vector.
+            Defaults to `None`.
+
+    Example:
+        .. code-block::
+
+            import paddle
+            import paddle.nn as nn
+            import paddlenlp as nlp
+
+            class RNNModel(nn.Layer):
+                def __init__(self,
+                            vocab_size,
+                            num_classes,
+                            emb_dim=128,
+                            padding_idx=0,
+                            rnn_hidden_size=198,
+                            direction='forward',
+                            rnn_layers=1,
+                            dropout_rate=0.0,
+                            pooling_type=None,
+                            fc_hidden_size=96):
+                    super().__init__()
+                    self.embedder = nn.Embedding(
+                        num_embeddings=vocab_size,
+                        embedding_dim=emb_dim,
+                        padding_idx=padding_idx)
+                    self.rnn_encoder = nlp.seq2vec.RNNEncoder(
+                        emb_dim,
+                        rnn_hidden_size,
+                        num_layers=rnn_layers,
+                        direction=direction,
+                        dropout=dropout_rate,
+                        pooling_type=pooling_type)
+                    self.fc = nn.Linear(self.rnn_encoder.get_output_dim(), fc_hidden_size)
+                    self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+                def forward(self, text, seq_len):
+                    # Shape: (batch_size, num_tokens, embedding_dim)
+                    embedded_text = self.embedder(text)
+                    # Shape: (batch_size, num_tokens, num_directions*rnn_hidden_size)
+                    # num_directions = 2 if direction is 'bidirect'
+                    # if not, num_directions = 1
+                    text_repr = self.rnn_encoder(embedded_text, sequence_length=seq_len)
+                    # Shape: (batch_size, fc_hidden_size)
+                    fc_out = paddle.tanh(self.fc(text_repr))
+                    # Shape: (batch_size, num_classes)
+                    logits = self.output_layer(fc_out)
+                    return logits
+
+            model = RNNModel(vocab_size=100, num_classes=2)
+
+            text = paddle.randint(low=1, high=10, shape=[1,10], dtype='int32')
+            seq_len = paddle.to_tensor([10])
+            logits = model(text, seq_len)
+    """
+
+    def __init__(
+        self, input_size, hidden_size, num_layers=1, direction="forward", dropout=0.0, pooling_type=None, **kwargs
+    ):
+        super().__init__()
+        self._input_size = input_size
+        self._hidden_size = hidden_size
+        self._direction = direction
+        self._pooling_type = pooling_type
+
+        self.rnn_layer = nn.SimpleRNN(
+            input_size=input_size,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            direction=direction,
+            dropout=dropout,
+            **kwargs,
+        )
+
+    def get_input_dim(self):
+        r"""
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `RNNEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._input_size
+
+    def get_output_dim(self):
+        r"""
+        Returns the dimension of the final vector output by this `RNNEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        if self._direction == "bidirect":
+            return self._hidden_size * 2
+        else:
+            return self._hidden_size
+
+    def forward(self, inputs, sequence_length):
+        r"""
+        RNNEncoder takes the a sequence of vectors and returns a
+        single vector, which is a combination of multiple RNN layers.
+        The input to this encoder is of shape `(batch_size, num_tokens, input_size)`.
+        The output is of shape `(batch_size, hidden_size * 2)` if RNN is bidirection;
+        If not, output is of shape `(batch_size, hidden_size)`.
+
+        Args:
+            inputs (Tensor): Shape as `(batch_size, num_tokens, input_size)`.
+                Tensor containing the features of the input sequence.
+            sequence_length (Tensor): Shape as `(batch_size)`.
+                The sequence length of the input sequence.
+
+        Returns:
+            Tensor: Returns tensor `output`, the hidden state at the last time step for every layer.
+            Its data type is `float` and its shape is `[batch_size, hidden_size]`.
+
+        """
+        encoded_text, last_hidden = self.rnn_layer(inputs, sequence_length=sequence_length)
+        if not self._pooling_type:
+            # We exploit the `last_hidden` (the hidden state at the last time step for every layer)
+            # to create a single vector.
+            # If rnn is not bidirection, then output is the hidden state of the last time step
+            # at last layer. Output is shape of `(batch_size, hidden_size)`.
+            # If rnn is bidirection, then output is concatenation of the forward and backward hidden state
+            # of the last time step at last layer. Output is shape of `(batch_size, hidden_size * 2)`.
+            if self._direction != "bidirect":
+                output = last_hidden[-1, :, :]
+            else:
+                output = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1)
+        else:
+            # We exploit the `encoded_text` (the hidden state at the every time step for last layer)
+            # to create a single vector. We perform pooling on the encoded text.
+            # The output shape is `(batch_size, hidden_size * 2)` if use bidirectional RNN,
+            # otherwise the output shape is `(batch_size, hidden_size * 2)`.
+            if self._pooling_type == "sum":
+                output = paddle.sum(encoded_text, axis=1)
+            elif self._pooling_type == "max":
+                output = paddle.max(encoded_text, axis=1)
+            elif self._pooling_type == "mean":
+                output = paddle.mean(encoded_text, axis=1)
+            else:
+                raise RuntimeError(
+                    "Unexpected pooling type %s ."
+                    "Pooling type must be one of sum, max and mean." % self._pooling_type
+                )
+        return output
+
+
+class Chomp1d(nn.Layer):
+    """
+    Remove the elements on the right.
+
+    Args:
+        chomp_size (int): The number of elements removed.
+    """
+
+    def __init__(self, chomp_size):
+        super(Chomp1d, self).__init__()
+        self.chomp_size = chomp_size
+
+    def forward(self, x):
+        return x[:, :, : -self.chomp_size]
+
+
+class TemporalBlock(nn.Layer):
+    """
+    The TCN block, consists of dilated causal conv, relu and residual block.
+    See the Figure 1(b) in https://arxiv.org/pdf/1803.01271.pdf for more details.
+
+    Args:
+        n_inputs ([int]): The number of channels in the input tensor.
+        n_outputs ([int]): The number of filters.
+        kernel_size ([int]): The filter size.
+        stride ([int]): The stride size.
+        dilation ([int]): The dilation size.
+        padding ([int]): The size of zeros to be padded.
+        dropout (float, optional): Probability of dropout the units. Defaults to 0.2.
+    """
+
+    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
+
+        super(TemporalBlock, self).__init__()
+        self.conv1 = weight_norm(
+            nn.Conv1D(n_inputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        # Chomp1d is used to make sure the network is causal.
+        # We pad by (k-1)*d on the two sides of the input for convolution,
+        # and then use Chomp1d to remove the (k-1)*d output elements on the right.
+        self.chomp1 = Chomp1d(padding)
+        self.relu1 = nn.ReLU()
+        self.dropout1 = nn.Dropout(dropout)
+
+        self.conv2 = weight_norm(
+            nn.Conv1D(n_outputs, n_outputs, kernel_size, stride=stride, padding=padding, dilation=dilation)
+        )
+        self.chomp2 = Chomp1d(padding)
+        self.relu2 = nn.ReLU()
+        self.dropout2 = nn.Dropout(dropout)
+
+        self.net = nn.Sequential(
+            self.conv1, self.chomp1, self.relu1, self.dropout1, self.conv2, self.chomp2, self.relu2, self.dropout2
+        )
+        self.downsample = nn.Conv1D(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
+        self.relu = nn.ReLU()
+        self.init_weights()
+
+    def init_weights(self):
+        self.conv1.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.conv1.weight.shape))
+        self.conv2.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.conv2.weight.shape))
+        if self.downsample is not None:
+            self.downsample.weight.set_value(paddle.tensor.normal(0.0, 0.01, self.downsample.weight.shape))
+
+    def forward(self, x):
+        out = self.net(x)
+        res = x if self.downsample is None else self.downsample(x)
+        return self.relu(out + res)
+
+
+class TCNEncoder(nn.Layer):
+    r"""
+    A `TCNEncoder` takes as input a sequence of vectors and returns a
+    single vector, which is the last one time step in the feature map.
+    The input to this encoder is of shape `(batch_size, num_tokens, input_size)`,
+    and the output is of shape `(batch_size, num_channels[-1])` with a receptive
+    filed:
+
+    .. math::
+
+        receptive filed = 2 * \sum_{i=0}^{len(num\_channels)-1}2^i(kernel\_size-1).
+
+    Temporal Convolutional Networks is a simple convolutional architecture. It outperforms canonical recurrent networks
+    such as LSTMs in many tasks. See https://arxiv.org/pdf/1803.01271.pdf for more details.
+
+    Args:
+        input_size (int): The number of expected features in the input (the last dimension).
+        num_channels (list): The number of channels in different layer.
+        kernel_size (int): The kernel size. Defaults to 2.
+        dropout (float): The dropout probability. Defaults to 0.2.
+    """
+
+    def __init__(self, input_size, num_channels, kernel_size=2, dropout=0.2):
+        super(TCNEncoder, self).__init__()
+        self._input_size = input_size
+        self._output_dim = num_channels[-1]
+
+        layers = nn.LayerList()
+        num_levels = len(num_channels)
+        for i in range(num_levels):
+            dilation_size = 2**i
+            in_channels = input_size if i == 0 else num_channels[i - 1]
+            out_channels = num_channels[i]
+            layers.append(
+                TemporalBlock(
+                    in_channels,
+                    out_channels,
+                    kernel_size,
+                    stride=1,
+                    dilation=dilation_size,
+                    padding=(kernel_size - 1) * dilation_size,
+                    dropout=dropout,
+                )
+            )
+
+        self.network = nn.Sequential(*layers)
+
+    def get_input_dim(self):
+        """
+        Returns the dimension of the vector input for each element in the sequence input
+        to a `TCNEncoder`. This is not the shape of the input tensor, but the
+        last element of that shape.
+        """
+        return self._input_size
+
+    def get_output_dim(self):
+        """
+        Returns the dimension of the final vector output by this `TCNEncoder`.  This is not
+        the shape of the returned tensor, but the last element of that shape.
+        """
+        return self._output_dim
+
+    def forward(self, inputs):
+        r"""
+        TCNEncoder takes as input a sequence of vectors and returns a
+        single vector, which is the last one time step in the feature map.
+        The input to this encoder is of shape `(batch_size, num_tokens, input_size)`,
+        and the output is of shape `(batch_size, num_channels[-1])` with a receptive
+        filed:
+
+        .. math::
+
+            receptive filed = 2 * \sum_{i=0}^{len(num\_channels)-1}2^i(kernel\_size-1).
+
+        Args:
+            inputs (Tensor): The input tensor with shape `[batch_size, num_tokens, input_size]`.
+
+        Returns:
+            Tensor: Returns tensor `output` with shape `[batch_size, num_channels[-1]]`.
+        """
+        inputs_t = inputs.transpose([0, 2, 1])
+        output = self.network(inputs_t).transpose([2, 0, 1])[-1]
+        return output
diff --git a/paddlenlp/server/__init__.py b/paddlenlp/server/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..25faacc17cc2ce53c286a9871067472a65c867b1
--- /dev/null
+++ b/paddlenlp/server/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .server import SimpleServer
+from .handlers import *
diff --git a/paddlenlp/server/base_router.py b/paddlenlp/server/base_router.py
new file mode 100644
index 0000000000000000000000000000000000000000..83f884e56d5fdf748b059b511512b08e564f0154
--- /dev/null
+++ b/paddlenlp/server/base_router.py
@@ -0,0 +1,32 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+
+class BaseRouterManager(abc.ABC):
+    _app = None
+
+    def __init__(self, app):
+        super().__init__()
+        self._app = app
+
+    @abc.abstractmethod
+    def register_models_router(self):
+        return NotImplemented
+
+    @abc.abstractmethod
+    def register_taskflow_router(self):
+        return NotImplemented
diff --git a/paddlenlp/server/handlers/__init__.py b/paddlenlp/server/handlers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..397234ceb8836c2f6f2f974df8087302ce0892b8
--- /dev/null
+++ b/paddlenlp/server/handlers/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_handler import BaseModelHandler, BasePostHandler, BaseTaskflowHandler
+from .cls_post_handler import (
+    MultiClassificationPostHandler,
+    MultiLabelClassificationPostHandler,
+)
+from .custom_model_handler import CustomModelHandler, ERNIEMHandler
+from .qa_model_handler import QAModelHandler
+from .taskflow_handler import TaskflowHandler
+from .token_model_handler import TokenClsModelHandler
diff --git a/paddlenlp/server/handlers/base_handler.py b/paddlenlp/server/handlers/base_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..5206f28cb4a2e24b909e0fceef6639bdcb328442
--- /dev/null
+++ b/paddlenlp/server/handlers/base_handler.py
@@ -0,0 +1,46 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABCMeta, abstractmethod
+
+
+class BaseModelHandler(metaclass=ABCMeta):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    @abstractmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+        pass
+
+
+class BasePostHandler(metaclass=ABCMeta):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    @abstractmethod
+    def process(cls, data, parameters):
+        pass
+
+
+class BaseTaskflowHandler(metaclass=ABCMeta):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    @abstractmethod
+    def process(cls, data, parameters):
+        pass
diff --git a/paddlenlp/server/handlers/cls_post_handler.py b/paddlenlp/server/handlers/cls_post_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd019f8b37c4e8950da489dc0759d81e04b12d14
--- /dev/null
+++ b/paddlenlp/server/handlers/cls_post_handler.py
@@ -0,0 +1,71 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from .base_handler import BasePostHandler
+
+
+class MultiClassificationPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        if "logits" not in data:
+            raise ValueError(
+                "The output of model handler do not include the 'logits', "
+                " please check the model handler output. The model handler output:\n{}".format(data)
+            )
+
+        logits = data["logits"]
+        logits = np.array(logits)
+        max_value = np.max(logits, axis=1, keepdims=True)
+        exp_data = np.exp(logits - max_value)
+        probs = exp_data / np.sum(exp_data, axis=1, keepdims=True)
+        out_dict = {"label": logits.argmax(axis=-1).tolist(), "confidence": probs.max(axis=-1).tolist()}
+        return out_dict
+
+
+class MultiLabelClassificationPostHandler(BasePostHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, data, parameters):
+        if "logits" not in data:
+            raise ValueError(
+                "The output of model handler do not include the 'logits', "
+                " please check the model handler output. The model handler output:\n{}".format(data)
+            )
+
+        prob_limit = 0.5
+        if "prob_limit" in parameters:
+            prob_limit = parameters["prob_limit"]
+        logits = data["logits"]
+        logits = np.array(logits)
+        logits = 1 / (1.0 + np.exp(-logits))
+        labels = []
+        probs = []
+        for logit in logits:
+            label = []
+            prob = []
+            for i, p in enumerate(logit):
+                if p > prob_limit:
+                    label.append(i)
+                    prob.append(p)
+            labels.append(label)
+            probs.append(prob)
+        out_dict = {"label": labels, "confidence": probs}
+        return out_dict
diff --git a/paddlenlp/server/handlers/custom_model_handler.py b/paddlenlp/server/handlers/custom_model_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5a133b8ef198212fe7dea9767fbc1010a1cdad7
--- /dev/null
+++ b/paddlenlp/server/handlers/custom_model_handler.py
@@ -0,0 +1,156 @@
+# coding:utf-8
+# Copyright (c) 2023  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from ...data import Pad, Tuple
+from .base_handler import BaseModelHandler
+
+
+class CustomModelHandler(BaseModelHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+        max_seq_len = 128
+        batch_size = 1
+        if "max_seq_len" not in parameters:
+            max_seq_len = parameters["max_seq_len"]
+        if "batch_size" not in parameters:
+            batch_size = parameters["batch_size"]
+        text = None
+        if "text" in data:
+            text = data["text"]
+        if text is None:
+            return {}
+        if isinstance(text, str):
+            text = [text]
+        has_pair = False
+        if "text_pair" in data and data["text_pair"] is not None:
+            text_pair = data["text_pair"]
+            if isinstance(text_pair, str):
+                text_pair = [text_pair]
+            if len(text) != len(text_pair):
+                raise ValueError("The length of text and text_pair must be same.")
+            has_pair = True
+
+        # Get the result of tokenizer
+        examples = []
+        for idx, data in enumerate(text):
+            if has_pair:
+                result = tokenizer(text=text[idx], text_pair=text_pair[idx], max_length=max_seq_len)
+            else:
+                result = tokenizer(text=text[idx], max_length=max_seq_len)
+            examples.append((result["input_ids"], result["token_type_ids"]))
+
+        # Separates data into some batches.
+        batches = [examples[i : i + batch_size] for i in range(0, len(examples), batch_size)]
+
+        def batchify_fn(samples):
+            return Tuple(
+                Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),
+                Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),
+            )(samples)
+
+        results = [[]] * predictor._output_num
+        for batch in batches:
+            input_ids, token_type_ids = batchify_fn(batch)
+            if predictor._predictor_type == "paddle_inference":
+                predictor._input_handles[0].copy_from_cpu(input_ids)
+                predictor._input_handles[1].copy_from_cpu(token_type_ids)
+                predictor._predictor.run()
+                output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles]
+                for i, out in enumerate(output):
+                    results[i].append(out)
+            else:
+                predictor._predictor.run(None, {"input_ids": input_ids, "token_type_ids": token_type_ids})
+                for i, out in enumerate(output):
+                    results[i].append(out)
+
+        # Resolve the logits result and get the predict label and confidence
+        results_concat = []
+        for i in range(0, len(results)):
+            results_concat.append(np.concatenate(results[i], axis=0))
+        out_dict = {"logits": results_concat[0].tolist(), "data": data}
+        for i in range(1, len(results_concat)):
+            out_dict[f"logits_{i}"] = results_concat[i].tolist()
+        return out_dict
+
+
+class ERNIEMHandler(BaseModelHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+        max_seq_len = 128
+        batch_size = 1
+        if "max_seq_len" not in parameters:
+            max_seq_len = parameters["max_seq_len"]
+        if "batch_size" not in parameters:
+            batch_size = parameters["batch_size"]
+        text = None
+        if "text" in data:
+            text = data["text"]
+        if text is None:
+            return {}
+        if isinstance(text, str):
+            text = [text]
+        has_pair = False
+        if "text_pair" in data and data["text_pair"] is not None:
+            text_pair = data["text_pair"]
+            if isinstance(text_pair, str):
+                text_pair = [text_pair]
+            if len(text) != len(text_pair):
+                raise ValueError("The length of text and text_pair must be same.")
+            has_pair = True
+
+        # Get the result of tokenizer
+        examples = []
+        for idx, data in enumerate(text):
+            if has_pair:
+                result = tokenizer(text=text[idx], text_pair=text_pair[idx], max_length=max_seq_len)
+            else:
+                result = tokenizer(text=text[idx], max_length=max_seq_len)
+            examples.append(result["input_ids"])
+
+        # Separates data into some batches.
+        batches = [examples[i : i + batch_size] for i in range(0, len(examples), batch_size)]
+
+        def batchify_fn(samples):
+            return Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64")(samples)
+
+        results = [[]] * predictor._output_num
+        for batch in batches:
+            input_ids = batchify_fn(batch)
+            if predictor._predictor_type == "paddle_inference":
+                predictor._input_handles[0].copy_from_cpu(input_ids)
+                predictor._predictor.run()
+                output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles]
+                for i, out in enumerate(output):
+                    results[i].append(out)
+            else:
+                predictor._predictor.run(None, {"input_ids": input_ids})
+                for i, out in enumerate(output):
+                    results[i].append(out)
+
+        # Resolve the logits result and get the predict label and confidence
+        results_concat = []
+        for i in range(0, len(results)):
+            results_concat.append(np.concatenate(results[i], axis=0))
+        out_dict = {"logits": results_concat[0].tolist(), "data": data}
+        for i in range(1, len(results_concat)):
+            out_dict[f"logits_{i}"] = results_concat[i].tolist()
+        return out_dict
diff --git a/paddlenlp/server/handlers/qa_model_handler.py b/paddlenlp/server/handlers/qa_model_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..e672322675cb3db8946e516fdc736baa96306a78
--- /dev/null
+++ b/paddlenlp/server/handlers/qa_model_handler.py
@@ -0,0 +1,89 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+from .base_handler import BaseModelHandler
+
+
+class QAModelHandler(BaseModelHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+
+        max_seq_len = 128
+        doc_stride = 128
+        batch_size = 1
+        if "max_seq_len" in parameters:
+            max_seq_len = parameters["max_seq_len"]
+        if "batch_size" in parameters:
+            batch_size = parameters["batch_size"]
+        if "doc_stride" in parameters:
+            doc_stride = parameters["doc_stride"]
+        context = None
+        question = None
+
+        # Get the context in qa task
+        if "context" in data:
+            context = data["context"]
+        if context is None:
+            return {}
+        if isinstance(context, str):
+            context = [context]
+
+        # Get the context in qa task
+        if "question" in data:
+            question = data["question"]
+        if question is None:
+            return {}
+        if isinstance(question, str):
+            question = [question]
+
+        tokenizer_results = tokenizer(
+            question,
+            context,
+            stride=doc_stride,
+            max_length=max_seq_len,
+            return_offsets_mapping=True,
+            pad_to_max_seq_len=True,
+        )
+        input_ids = tokenizer_results["input_ids"]
+        token_type_ids = tokenizer_results["token_type_ids"]
+        # Separates data into some batches.
+        batches = [[i, i + batch_size] for i in range(0, len(input_ids), batch_size)]
+
+        results = [[] for i in range(0, predictor._output_num)]
+        for start, end in batches:
+            input_id = np.array(input_ids[start:end]).astype("int64")
+            token_type_id = np.array(token_type_ids[start:end]).astype("int64")
+            if predictor._predictor_type == "paddle_inference":
+                predictor._input_handles[0].copy_from_cpu(input_id)
+                predictor._input_handles[1].copy_from_cpu(token_type_id)
+
+                predictor._predictor.run()
+                output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles]
+                for i, out in enumerate(output):
+                    results[i].extend(out.tolist())
+            else:
+                output = predictor._predictor.run(None, {"input_ids": input_id, "token_type_ids": token_type_id})
+                for i, out in enumerate(output):
+                    results[i].extend(out.tolist())
+        data["offset_mapping"] = tokenizer_results["offset_mapping"]
+        out_dict = {"logits": results[0], "data": data}
+        for i in range(1, len(results)):
+            out_dict[f"logits_{i}"] = results[1]
+        return out_dict
diff --git a/paddlenlp/server/handlers/taskflow_handler.py b/paddlenlp/server/handlers/taskflow_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..fce31ccc86395267ac5bc7a207f20da608b985ac
--- /dev/null
+++ b/paddlenlp/server/handlers/taskflow_handler.py
@@ -0,0 +1,34 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .base_handler import BaseTaskflowHandler
+
+
+class TaskflowHandler(BaseTaskflowHandler):
+    def __init__(self):
+        self._name = "taskflow_handler"
+
+    @classmethod
+    def process(cls, predictor, data, parameters):
+        if data is None:
+            return {}
+        text = None
+        if "text" in data:
+            text = data["text"]
+        else:
+            return {}
+        if "schema" in parameters:
+            schema = parameters["schema"]
+            predictor.set_schema(schema)
+        return predictor(text)
diff --git a/paddlenlp/server/handlers/token_model_handler.py b/paddlenlp/server/handlers/token_model_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a3baf4e670f7e7ec57e5c8a33135733824de89
--- /dev/null
+++ b/paddlenlp/server/handlers/token_model_handler.py
@@ -0,0 +1,114 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+from ...data import Pad, Tuple
+from .base_handler import BaseModelHandler
+
+
+class TokenClsModelHandler(BaseModelHandler):
+    def __init__(self):
+        super().__init__()
+
+    @classmethod
+    def process(cls, predictor, tokenizer, data, parameters):
+        max_seq_len = 128
+        batch_size = 1
+        return_attention_mask = False
+        is_split_into_words = False
+        if "max_seq_len" in parameters:
+            max_seq_len = parameters["max_seq_len"]
+        if "batch_size" in parameters:
+            batch_size = parameters["batch_size"]
+        if "return_attention_mask" in parameters:
+            return_attention_mask = parameters["return_attention_mask"]
+        if "is_split_into_words" in parameters:
+            is_split_into_words = parameters["is_split_into_words"]
+        text = None
+        if "text" in data:
+            text = data["text"]
+        if text is None:
+            return {}
+        if isinstance(text, str):
+            text = [text]
+        has_pair = False
+        if "text_pair" in data and data["text_pair"] is not None:
+            text_pair = data["text_pair"]
+            if isinstance(text_pair, str):
+                text_pair = [text_pair]
+            if len(text) != len(text_pair):
+                raise ValueError("The length of text and text_pair must be same.")
+            has_pair = True
+
+        # Get the result of tokenizer
+        pad = True
+        if len(text) == 1:
+            pad = False
+        examples = []
+        if has_pair:
+            tokenizer_result = tokenizer(
+                text=text,
+                text_pair=text_pair,
+                max_length=max_seq_len,
+                truncation=True,
+                return_attention_mask=return_attention_mask,
+                is_split_into_words=is_split_into_words,
+                padding=pad,
+            )
+        else:
+            tokenizer_result = tokenizer(
+                text=text,
+                max_length=max_seq_len,
+                truncation=True,
+                return_attention_mask=return_attention_mask,
+                is_split_into_words=is_split_into_words,
+                padding=pad,
+            )
+
+        examples = []
+        for input_ids, token_type_ids in zip(tokenizer_result["input_ids"], tokenizer_result["token_type_ids"]):
+            examples.append((input_ids, token_type_ids))
+        # Separates data into some batches.
+        batches = [examples[i : i + batch_size] for i in range(0, len(examples), batch_size)]
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # segment
+        ): fn(samples)
+        results = [[] for i in range(0, predictor._output_num)]
+        for batch in batches:
+            input_ids, token_type_ids = batchify_fn(batch)
+            if predictor._predictor_type == "paddle_inference":
+                predictor._input_handles[0].copy_from_cpu(input_ids)
+                predictor._input_handles[1].copy_from_cpu(token_type_ids)
+
+                predictor._predictor.run()
+                output = [output_handle.copy_to_cpu() for output_handle in predictor._output_handles]
+                for i, out in enumerate(output):
+                    results[i].append(out)
+            else:
+                output = predictor._predictor.run(None, {"input_ids": input_ids, "token_type_ids": token_type_ids})
+                for i, out in enumerate(output):
+                    results[i].append(out)
+
+        results_concat = []
+        for i in range(0, len(results)):
+            results_concat.append(np.concatenate(results[i], axis=0))
+        out_dict = {"logits": results_concat[0].tolist(), "data": data}
+        for i in range(1, len(results_concat)):
+            out_dict[f"logits_{i}"] = results_concat[i].tolist()
+        if return_attention_mask:
+            out_dict["attention_mask"] = tokenizer_result["attention_mask"]
+        return out_dict
diff --git a/paddlenlp/server/http_router/__init__.py b/paddlenlp/server/http_router/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..24ce449a2071a3769b0ecd27e45ffffef84ed300
--- /dev/null
+++ b/paddlenlp/server/http_router/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .router import HttpRouterManager
diff --git a/paddlenlp/server/http_router/router.py b/paddlenlp/server/http_router/router.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b5668dedafbc1f039ba7473218a73ca658b1318
--- /dev/null
+++ b/paddlenlp/server/http_router/router.py
@@ -0,0 +1,119 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import hashlib
+import typing
+from typing import Optional
+
+from fastapi import APIRouter, Request
+from pydantic import BaseModel, Extra, create_model
+
+from ...utils.log import logger
+from ..base_router import BaseRouterManager
+
+
+class ResponseBase(BaseModel):
+    text: Optional[str] = None
+
+
+class RequestBase(BaseModel, extra=Extra.forbid):
+    parameters: Optional[dict] = {}
+
+
+class HttpRouterManager(BaseRouterManager):
+    def register_models_router(self, task_name):
+
+        # Url path to register the model
+        paths = [f"/{task_name}"]
+        for path in paths:
+            logger.info("   Transformer model request [path]={} is genereated.".format(path))
+
+        # Unique name to create the pydantic model
+        unique_name = hashlib.md5(task_name.encode()).hexdigest()
+
+        # Create request model
+        req_model = create_model(
+            "RequestModel" + unique_name,
+            data=(typing.Any, ...),
+            __base__=RequestBase,
+        )
+
+        # Create response model
+        resp_model = create_model(
+            "ResponseModel" + unique_name,
+            result=(typing.Any, ...),
+            __base__=ResponseBase,
+        )
+
+        # Template predict endpoint function to dynamically serve different models
+        def predict(request: Request, inference_request: req_model):
+            result = self._app._model_manager.predict(inference_request.data, inference_request.parameters)
+            return {"result": result}
+
+        # Register the route and add to the app
+        router = APIRouter()
+        for path in paths:
+            router.add_api_route(
+                path,
+                predict,
+                methods=["post"],
+                summary=f"{task_name.title()}",
+                response_model=resp_model,
+                response_model_exclude_unset=True,
+                response_model_exclude_none=True,
+            )
+        self._app.include_router(router)
+
+    def register_taskflow_router(self, task_name):
+
+        # Url path to register the model
+        paths = [f"/{task_name}"]
+        for path in paths:
+            logger.info("   Taskflow  request [path]={} is genereated.".format(path))
+
+        # Unique name to create the pydantic model
+        unique_name = hashlib.md5(task_name.encode()).hexdigest()
+
+        # Create request model
+        req_model = create_model(
+            "RequestModel" + unique_name,
+            data=(typing.Any, ...),
+            __base__=RequestBase,
+        )
+
+        # Create response model
+        resp_model = create_model(
+            "ResponseModel" + unique_name,
+            result=(typing.Any, ...),
+            __base__=ResponseBase,
+        )
+
+        # Template predict endpoint function to dynamically serve different models
+        def predict(request: Request, inference_request: req_model):
+            result = self._app._taskflow_manager.predict(inference_request.data, inference_request.parameters)
+            return {"result": result}
+
+        # Register the route and add to the app
+        router = APIRouter()
+        for path in paths:
+            router.add_api_route(
+                path,
+                predict,
+                methods=["post"],
+                summary=f"{task_name.title()}",
+                response_model=resp_model,
+                response_model_exclude_unset=True,
+                response_model_exclude_none=True,
+            )
+        self._app.include_router(router)
diff --git a/paddlenlp/server/model_manager.py b/paddlenlp/server/model_manager.py
new file mode 100644
index 0000000000000000000000000000000000000000..8739d55ae38600708b0a1c103610a18ca291001e
--- /dev/null
+++ b/paddlenlp/server/model_manager.py
@@ -0,0 +1,96 @@
+# coding:utf-8
+# copyright (c) 2022  paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license"
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+import time
+
+from ..transformers import AutoTokenizer
+from ..utils.log import logger
+from ..utils.tools import get_env_device
+from .handlers import BaseModelHandler, BasePostHandler
+from .predictor import Predictor
+from .utils import lock_predictor
+
+
+class ModelManager:
+    def __init__(self, task_name, model_path, tokenizer_name, model_handler, post_handler, precision, device_id):
+        self._task_name = task_name
+        self._model_path = model_path
+        self._tokenizer_name = tokenizer_name
+        self._model_handler = model_handler
+        self._post_handler = post_handler
+        self._precision = precision
+        self._device_id = device_id
+        self._tokenizer = None
+        self._register()
+
+    def _register(self):
+        # Get the model handler
+        if not issubclass(self._model_handler, BaseModelHandler):
+            raise TypeError(
+                "The model_handler must be subclass of paddlenlp.server.handlers.BaseModelHandler, please check the type."
+            )
+        self._model_handler = self._model_handler.process
+
+        if not issubclass(self._post_handler, BasePostHandler):
+            raise TypeError(
+                "The post_handler must be subclass of paddlenlp.server.handlers.BasePostHandler, please check the type."
+            )
+        self._post_handler = self._post_handler.process
+
+        # Create the model predictor
+        device = get_env_device()
+        predictor_list = []
+        if device == "cpu" or self._device_id == -1:
+            predictor = Predictor(self._model_path, self._precision, "cpu")
+            predictor_list.append(predictor)
+        elif isinstance(self._device_id, int):
+            predictor = Predictor(self._model_path, self._precision, "gpu:" + str(self._device_id))
+            predictor_list.append(predictor)
+        elif isinstance(self._device_id, list):
+            for device in self._device_id:
+                predictor = Predictor(
+                    self._model_path,
+                    self._model_class_or_name,
+                    self._input_spec,
+                    self._precision,
+                    "gpu:" + str(device),
+                )
+                predictor_list.append(predictor)
+        self._predictor_list = predictor_list
+
+        # Get the tokenize of model
+        self._get_tokenizer()
+
+    def _get_tokenizer(self):
+        if self._tokenizer_name is not None:
+            if isinstance(self._tokenizer_name, str):
+                self._tokenizer = AutoTokenizer.from_pretrained(self._tokenizer_name)
+            else:
+                logger.error("The argrument of `tokenizer_name`  must be the name of tokenizer.")
+        assert self._tokenizer is not None, "The tokenizer must be not register, you could set the class of Tokenizer"
+
+    def _get_predict_id(self):
+        t = time.time()
+        t = int(round(t * 1000))
+        predictor_id = t % len(self._predictor_list)
+        logger.info("The predictor id: {} is selected by running the model.".format(predictor_id))
+        return predictor_id
+
+    def predict(self, data, parameters):
+        predictor_id = self._get_predict_id()
+        with lock_predictor(self._predictor_list[predictor_id]._lock):
+            model_output = self._model_handler(self._predictor_list[predictor_id], self._tokenizer, data, parameters)
+            final_output = self._post_handler(model_output, parameters)
+            return final_output
diff --git a/paddlenlp/server/predictor.py b/paddlenlp/server/predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..45d803e4b13b1380638fce20da7491e213abbd97
--- /dev/null
+++ b/paddlenlp/server/predictor.py
@@ -0,0 +1,202 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import sys
+import threading
+from multiprocessing import cpu_count
+
+import paddle
+
+from ..utils.log import logger
+
+
+class Predictor:
+    def __init__(self, model_path, precision, device):
+        self._model_path = model_path
+        self._default_static_model_path = "auto_static"
+        self._precision = precision
+        self._cpu_thread = 8
+        self._config = None
+        self._device = device
+        self._num_threads = math.ceil(cpu_count() / 2)
+        self._output_num = 1
+        paddle.set_device(device)
+        self._create_predictor()
+        self._lock = threading.Lock()
+
+    def _get_default_static_model_path(self):
+        # The model path had the static_model_path
+        static_model_path = os.path.join(self._model_path, self._default_static_model_path, "inference.pdmodel")
+        if os.path.exists(static_model_path):
+            return os.path.join(self._model_path, self._default_static_model_path, "inference")
+        for file_name in os.listdir(self._model_path):
+            # FIXME(wawltor) The path maybe not correct
+            if file_name.count(".pdmodel"):
+                return os.path.join(self._model_path, file_name[:-8])
+        return None
+
+    def _is_int8_model(self, model_path):
+        paddle.set_device("cpu")
+        model = paddle.jit.load(model_path)
+        program = model.program()
+        for block in program.blocks:
+            for i, op in enumerate(block.ops):
+                if op.type.count("quantize"):
+                    paddle.set_device(self._device)
+                    return True
+        paddle.set_device(self._device)
+        return False
+
+    def _create_predictor(self):
+        # Get the model parameter path and model config path
+        static_model_path = self._get_default_static_model_path()
+
+        # Convert the Draph Model to Static Model
+        if static_model_path is None:
+            raise RuntimeError("The model path do not include the inference model, please check!")
+        is_int8_model = self._is_int8_model(static_model_path)
+        # Load the inference model and maybe we will convert the onnx model
+        # Judge the predictor type for the inference
+        if self._precision == "int8" and not is_int8_model:
+            self._precision = "fp32"
+
+        if is_int8_model:
+            self._precision = "int8"
+
+        self._predictor_type = self._check_predictor_type()
+        if self._predictor_type == "paddle_inference":
+            self._prepare_paddle_mode(static_model_path)
+        else:
+            self._prepare_onnx_mode(static_model_path)
+
+    def _check_predictor_type(self):
+        predictor_type = "paddle_inference"
+        device = paddle.get_device()
+        if self._precision == "int8" or device == "xpu" or device == "cpu":
+            predictor_type = "paddle_inference"
+        else:
+            if device.count("gpu") and self._precision == "fp16":
+                try:
+                    import onnx  # noqa F401
+                    import onnxruntime as ort  # noqa F401
+                    import paddle2onnx  # noqa F401
+                    from onnxconverter_common import float16  # noqa F401
+
+                    predictor_type = "onnxruntime"
+                except Exception:
+                    logger.error(
+                        "The inference precision is change to 'fp32', please install the dependencies that required for 'fp16' inference, you could use the commands as fololws:\n"
+                        " ****** pip uninstall onnxruntime ******\n"
+                        " ****** pip install onnxruntime-gpu onnx onnxconverter-common ******"
+                    )
+                    sys.exit(-1)
+        return predictor_type
+
+    def _prepare_paddle_mode(self, static_model_path):
+        """
+        Construct the input data and predictor in the PaddlePaddele static mode.
+        """
+        self._config = paddle.inference.Config(static_model_path + ".pdmodel", static_model_path + ".pdiparams")
+        self._config.disable_glog_info()
+        if paddle.get_device() == "cpu":
+            self._config.disable_gpu()
+            self._config.enable_mkldnn()
+            self._config.enable_memory_optim()
+            if self._precision == "int8":
+                self._config.enable_mkldnn_bfloat16()
+            elif self._precision == "fp16":
+                self._config.enable_mkldnn_int8()
+        else:
+            self._config.enable_use_gpu(100, int(self._device.split(":")[-1]))
+            if self._precision == "int8":
+                # FIXME(wawltor) The paddlenlp serving support the int8 model
+                logger.warning("The PaddleNLP serving do not support the INT8 model, we will support later!")
+                sys.exit(-1)
+
+        self._config.switch_use_feed_fetch_ops(False)
+        self._config.set_cpu_math_library_num_threads(self._num_threads)
+        self._config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
+        self._predictor = paddle.inference.create_predictor(self._config)
+        self._input_handles = [self._predictor.get_input_handle(name) for name in self._predictor.get_input_names()]
+        self._output_handles = [self._predictor.get_output_handle(name) for name in self._predictor.get_output_names()]
+        self._output_num = len(self._output_handles)
+
+    def _prepare_onnx_mode(self, static_model_path):
+        import onnx
+        import onnxruntime as ort
+        import paddle2onnx
+        from onnxconverter_common import float16
+
+        onnx_dir = os.path.join(self._model_path, "onnx")
+        if not os.path.exists(onnx_dir):
+            os.mkdir(onnx_dir)
+        float_onnx_file = os.path.join(onnx_dir, "model.onnx")
+        if not os.path.exists(float_onnx_file):
+            model_path = static_model_path + ".pdmodel"
+            params_file = static_model_path + ".pdiparams"
+            onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+                model_file=model_path, params_file=params_file, opset_version=13, enable_onnx_checker=True
+            )
+            with open(float_onnx_file, "wb") as f:
+                f.write(onnx_model)
+        fp16_model_file = os.path.join(onnx_dir, "fp16_model.onnx")
+        if not os.path.exists(fp16_model_file):
+            onnx_model = onnx.load_model(float_onnx_file)
+            trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+            onnx.save_model(trans_model, fp16_model_file)
+        providers = ["CUDAExecutionProvider"]
+        sess_options = ort.SessionOptions()
+        sess_options.inter_op_num_threads = self._num_threads
+        device_id = int(self._device.split(":")[-1])
+        self._predictor = ort.InferenceSession(
+            fp16_model_file,
+            sess_options=sess_options,
+            providers=providers,
+            provider_options=[{"device_id": device_id}],
+        )
+        self._output_num = len(self._predictor.get_outputs())
+        assert "CUDAExecutionProvider" in self._predictor.get_providers(), (
+            "The environment for GPU inference is not set properly. "
+            "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+            "Please run the following commands to reinstall: \n "
+            "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+        )
+
+    def _convert_dygraph_to_static(self, model_instance, input_spec):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            model_instance is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info(
+            "Converting to the static inference model will cost a little time, please do not break this process."
+        )
+        try:
+            static_model = paddle.jit.to_static(model_instance, input_spec=input_spec)
+            save_path = os.path.join(self._model_path, self._default_static_model_path, "inference")
+            paddle.jit.save(static_model, save_path)
+            logger.info("The static inference model save in the path:{}".format(save_path))
+        except Exception:
+            logger.warning(
+                "Fail convert to inference model, please create the issue for the developers,"
+                "the issue link: https://github.com/PaddlePaddle/PaddleNLP/issues"
+            )
+            sys.exit(-1)
diff --git a/paddlenlp/server/server.py b/paddlenlp/server/server.py
new file mode 100644
index 0000000000000000000000000000000000000000..15e455858104ea80ae28e825bb20ef1388738b84
--- /dev/null
+++ b/paddlenlp/server/server.py
@@ -0,0 +1,83 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fastapi import FastAPI
+from .http_router import HttpRouterManager
+from .model_manager import ModelManager
+from .taskflow_manager import TaskflowManager
+from ..taskflow import Taskflow
+
+
+class SimpleServer(FastAPI):
+    def __init__(self, **kwargs):
+        """
+        Initial function for the PaddleNLP SimpleServer.
+        """
+        super().__init__(**kwargs)
+        self._router_manager = HttpRouterManager(self)
+        self._taskflow_manager = None
+        self._model_manager = None
+        self._service_name = "paddlenlp"
+        self._service_type = None
+
+    def register(
+        self, task_name, model_path, tokenizer_name, model_handler, post_handler, precision="fp32", device_id=0
+    ):
+        """
+        The register function for the SimpleServer, the main register argrument as follows:
+
+        Args:
+            name(str): The server name for the route.
+            model_path (str):
+            handler(str):
+            device (int|list|str, optional):
+        """
+        self._server_type = "models"
+        model_manager = ModelManager(
+            task_name, model_path, tokenizer_name, model_handler, post_handler, precision, device_id
+        )
+        self._model_manager = model_manager
+        # Register transformers model server router
+        self._router_manager.register_models_router(task_name)
+
+    def register_taskflow(self, task_name, task, taskflow_handler=None):
+        """
+        The register function for the SimpleServer, the main register argrument as follows:
+
+        Args:
+            name(str): The server name for the route.
+            model_or_path (str):
+            handler(str):
+            device (int|list|str, optional):
+        """
+        self._server_type = "server"
+        check_flag = True
+
+        # Check the task type, it must be the instance of Taskflow or List[Taskflow]
+        if isinstance(task, Taskflow):
+            task = [task]
+        for t in task:
+            if not isinstance(t, Taskflow):
+                check_flag = False
+                break
+        if not check_flag:
+            raise TypeError(
+                "Unsupport task type {}, it must be instance of Taskflow or List[Taskflow]".format(type(task))
+            )
+
+        # Register Taskflow server router
+        taskflow_manager = TaskflowManager(task, taskflow_handler)
+        self._taskflow_manager = taskflow_manager
+        self._router_manager.register_taskflow_router(task_name)
diff --git a/paddlenlp/server/taskflow_manager.py b/paddlenlp/server/taskflow_manager.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba96c33ed8004448996d4159d15929b256621a18
--- /dev/null
+++ b/paddlenlp/server/taskflow_manager.py
@@ -0,0 +1,40 @@
+# coding:utf-8
+# copyright (c) 2022  paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license"
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+import time
+from .handlers import TaskflowHandler
+from .utils import lock_predictor
+from ..utils.log import logger
+
+
+class TaskflowManager:
+    """
+    The TaskflowManager could predict the raw text.
+    """
+
+    def __init__(self, task, taskflow_handler=None):
+        self._task = task
+        if taskflow_handler is None:
+            self._handler_func = TaskflowHandler.process
+        else:
+            self._handler_func = taskflow_handler.process
+
+    def predict(self, data, parameters):
+        t = time.time()
+        t = int(round(t * 1000))
+        task_index = t % len(self._task)
+        logger.info("The predictor id: {} is selected by running the taskflow.".format(task_index))
+        with lock_predictor(self._task[task_index]._lock):
+            return self._handler_func(self._task[task_index], data, parameters)
diff --git a/paddlenlp/server/utils.py b/paddlenlp/server/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..34d4805606054946362ed2a9117f14d2db0020cf
--- /dev/null
+++ b/paddlenlp/server/utils.py
@@ -0,0 +1,25 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+
+
+@contextlib.contextmanager
+def lock_predictor(lock):
+    lock.acquire()
+    try:
+        yield
+    finally:
+        lock.release()
diff --git a/paddlenlp/taskflow/__init__.py b/paddlenlp/taskflow/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d39fea274ba7a2d93a2ce0f889c6e82a5001da28
--- /dev/null
+++ b/paddlenlp/taskflow/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .taskflow import Taskflow
diff --git a/paddlenlp/taskflow/code_generation.py b/paddlenlp/taskflow/code_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..af4f531a5db9f9cb9c859868a027eba45f398d41
--- /dev/null
+++ b/paddlenlp/taskflow/code_generation.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+
+import numpy as np
+import paddle
+
+from ..data import Pad
+from ..transformers import CodeGenForCausalLM, CodeGenTokenizer
+from .task import Task
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           codegen = Taskflow("code_generation")
+           codegen("def hello_world():")
+           '''
+           ['\n    print("Hello world")']
+           '''
+         """
+
+
+class CodeGenerationTask(Task):
+    """
+    The text generation model to predict the code.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._batch_size = kwargs.get("batch_size", 1)
+        self._max_length = kwargs.get("max_length", 128)
+        self._min_length = kwargs.get("min_length", 0)
+        self._decode_strategy = kwargs.get("decode_strategy", "sampling")
+        self._temperature = kwargs.get("temperature", 0.6)
+        self._top_k = kwargs.get("top_k", 5)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._num_beams = kwargs.get("num_beams", 4)
+        self._length_penalty = kwargs.get("length_penalty", 1.0)
+        self._repetition_penalty = kwargs.get("repetition_penalty", 1.1)
+        self._output_scores = kwargs.get("output_scores", False)
+        self._use_faster = kwargs.get("use_faster", False)
+        self._construct_tokenizer(model)
+        self._construct_model(model)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        self._model = CodeGenForCausalLM.from_pretrained(model)
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = CodeGenTokenizer.from_pretrained(model)
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        padding = False if batch_size == 1 else True
+        pad_func = Pad(pad_val=self._model.pad_token_id, pad_right=False, dtype=np.int64)
+
+        def _parse_batch(batch_examples):
+            if padding:
+                input_ids = pad_func([example for example in batch_examples])
+            else:
+                input_ids = np.asarray([example for example in batch_examples], dtype=np.int64)
+            return input_ids
+
+        examples = self._convert_text_to_input(data)["input_ids"]
+
+        # Separates data into some batches.
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield _parse_batch(one_batch)
+                one_batch = []
+        if one_batch:
+            yield _parse_batch(one_batch)
+
+    def _convert_text_to_input(self, texts):
+        """
+        Convert input strings to ids.
+        """
+        return self._tokenizer(texts)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {}
+        outputs["batches"] = batches
+        outputs["text"] = inputs
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        all_ids = []
+        all_scores = []
+
+        for batch in inputs["batches"]:
+            input_ids = paddle.to_tensor(batch)
+            ids, scores = self._model.generate(
+                input_ids=input_ids,
+                max_length=self._max_length,
+                min_length=self._min_length,
+                decode_strategy=self._decode_strategy,
+                temperature=self._temperature,
+                top_k=self._top_k,
+                top_p=self._top_p,
+                num_beams=self._num_beams,
+                length_penalty=self._length_penalty,
+                repetition_penalty=self._repetition_penalty,
+                use_fast=self._use_faster,
+            )
+            all_ids.extend(ids.numpy().tolist())
+            all_scores.extend(scores.numpy().tolist())
+        inputs["ids"] = all_ids
+        inputs["scores"] = all_scores
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        batch_out = []
+        generated_ids = inputs["ids"]
+        for generated_id in generated_ids:
+            text = self._tokenizer.decode(generated_id, skip_special_tokens=True, spaces_between_special_tokens=False)
+            text = re.split("\nclass|\ndef|\n#|\n@|\nprint|\nif", text)[0].rstrip()
+            batch_out.append(text)
+        if self._output_scores:
+            return batch_out, inputs["scores"]
+        return batch_out
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+        ]
diff --git a/paddlenlp/taskflow/dependency_parsing.py b/paddlenlp/taskflow/dependency_parsing.py
new file mode 100644
index 0000000000000000000000000000000000000000..e72c470ad2275e12a0fc3c375f2a7dd52bef1238
--- /dev/null
+++ b/paddlenlp/taskflow/dependency_parsing.py
@@ -0,0 +1,736 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os
+
+import numpy as np
+import paddle
+
+from ..data import Pad, Vocab
+from .models import BiAffineParser
+from .task import Task
+from .utils import download_file
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           ddp = Taskflow("dependency_parsing")
+           ddp("三亚是一座美丽的城市")
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}]
+           '''
+           ddp(["三亚是一座美丽的城市", "他送了一本书"])
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}, {'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
+           '''
+
+           ddp = Taskflow("dependency_parsing", prob=True, use_pos=True)
+           ddp("三亚是一座美丽的城市")
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽的城市'], 'head': [2, 0, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'VOB'], 'postag': ['LOC', 'v', 'm', 'n'], 'prob': [1.0, 1.0, 1.0, 1.0]}]
+           '''
+
+           ddp = Taskflow("dependency_parsing", model="ddparser-ernie-1.0")
+           ddp("三亚是一座美丽的城市")
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}]
+           '''
+
+           ddp = Taskflow("dependency_parsing", model="ddparser-ernie-gram-zh")
+           ddp("三亚是一座美丽的城市")
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}]
+           '''
+
+           # 已分词输入
+           ddp = Taskflow("dependency_parsing", segmented=True)
+           ddp.from_segments([["三亚", "是", "一座", "美丽", "的", "城市"]])
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}]
+           '''
+           ddp.from_segments([['三亚', '是', '一座', '美丽', '的', '城市'], ['他', '送', '了', '一本', '书']])
+           '''
+           [{'word': ['三亚', '是', '一座', '美丽', '的', '城市'], 'head': [2, 0, 6, 6, 4, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'MT', 'VOB']}, {'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
+           '''
+         """
+
+
+class DDParserTask(Task):
+    """
+    DDParser task to analyze the dependency relationship between words in a sentence
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        tree(bool): Ensure the output conforms to the tree structure.
+        prob(bool): Whether to return the probability of predicted heads.
+        use_pos(bool): Whether to return the postag.
+        batch_size(int): Numbers of examples a batch.
+        return_visual(bool): If True, the result will contain the dependency visualization.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "word_vocab": "word_vocab.json",
+        "rel_vocab": "rel_vocab.json",
+    }
+    resource_files_urls = {
+        "ddparser": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser/model_state.pdparams",
+                "f388c91e85b5b4d0db40157a4ee28c08",
+            ],
+            "word_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser/word_vocab.json",
+                "594694033b149cbb724cac0975df07e4",
+            ],
+            "rel_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser/rel_vocab.json",
+                "0decf1363278705f885184ff8681f4cd",
+            ],
+        },
+        "ddparser-ernie-1.0": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0/model_state.pdparams",
+                "78a4d5c2add642a88f6fdbee3574f617",
+            ],
+            "word_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0/word_vocab.json",
+                "17ed37b5b7ebb8475d4bff1ff8dac4b7",
+            ],
+            "rel_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-1.0/rel_vocab.json",
+                "0decf1363278705f885184ff8681f4cd",
+            ],
+        },
+        "ddparser-ernie-gram-zh": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-gram-zh/model_state.pdparams",
+                "9d0a49026feb97fac22c8eec3e88f5c3",
+            ],
+            "word_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-gram-zh/word_vocab.json",
+                "38120123d39876337975cc616901c8b9",
+            ],
+            "rel_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/ddparser-ernie-gram-zh/rel_vocab.json",
+                "0decf1363278705f885184ff8681f4cd",
+            ],
+        },
+        "font_file": {
+            "font_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dependency_parsing/SourceHanSansCN-Regular.ttf",
+                "cecb7328bc0b9412b897fb3fc61edcdb",
+            ]
+        },
+    }
+
+    def __init__(
+        self,
+        task,
+        model,
+        tree=True,
+        prob=False,
+        use_pos=False,
+        use_cuda=False,
+        batch_size=1,
+        return_visual=False,
+        **kwargs
+    ):
+        super().__init__(task=task, model=model, **kwargs)
+        self._usage = usage
+        self.model = model
+
+        if self.model == "ddparser":
+            self.encoding_model = "lstm-pe"
+        elif self.model == "ddparser-ernie-1.0":
+            self.encoding_model = "ernie-1.0"
+        elif self.model == "ddparser-ernie-gram-zh":
+            self.encoding_model = "ernie-gram-zh"
+        else:
+            raise ValueError(
+                "The encoding model should be one of \
+                ddparser, ddparser-ernie-1.0 and ddparser-ernie-gram-zh"
+            )
+        self._check_task_files()
+        self._construct_vocabs()
+        self.font_file_path = download_file(
+            self._task_path,
+            "SourceHanSansCN-Regular.ttf",
+            self.resource_files_urls["font_file"]["font_file"][0],
+            self.resource_files_urls["font_file"]["font_file"][1],
+        )
+        self.tree = tree
+        self.prob = prob
+        self.use_pos = use_pos
+        self.batch_size = batch_size
+        self.return_visual = return_visual
+
+        try:
+            from LAC import LAC
+        except Exception:
+            raise ImportError("Please install the dependencies first, pip install LAC --upgrade")
+
+        self.use_cuda = use_cuda
+        self.lac = LAC(mode="lac" if self.use_pos else "seg", use_cuda=self.use_cuda)
+        self._get_inference_model()
+
+    def _check_segmented_words(self, inputs):
+        inputs = inputs[0]
+        if not all([isinstance(i, list) and i and all(i) for i in inputs]):
+            raise TypeError("Invalid input format.")
+        return inputs
+
+    def from_segments(self, segmented_words):
+        # pos tag is not available for segmented inputs
+        self.use_pos = False
+        segmented_words = self._check_segmented_words(segmented_words)
+        inputs = {}
+        inputs["words"] = segmented_words
+        inputs = self._preprocess_words(inputs)
+        outputs = self._run_model(inputs)
+        results = self._postprocess(outputs)
+        return results
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        ]
+
+    def _construct_vocabs(self):
+        word_vocab_path = os.path.join(self._task_path, "word_vocab.json")
+        rel_vocab_path = os.path.join(self._task_path, "rel_vocab.json")
+        self.word_vocab = Vocab.from_json(word_vocab_path)
+        self.rel_vocab = Vocab.from_json(rel_vocab_path)
+        self.word_pad_index = self.word_vocab.to_indices("[PAD]")
+        self.word_bos_index = self.word_vocab.to_indices("[CLS]")
+        self.word_eos_index = self.word_vocab.to_indices("[SEP]")
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = BiAffineParser(
+            encoding_model=self.encoding_model,
+            n_rels=len(self.rel_vocab),
+            n_words=len(self.word_vocab),
+            pad_index=self.word_pad_index,
+            bos_index=self.word_bos_index,
+            eos_index=self.word_eos_index,
+        )
+        model_path = os.path.join(self._task_path, "model_state.pdparams")
+        # Load the model parameter for the predict
+        state_dict = paddle.load(model_path)
+        model_instance.set_dict(state_dict)
+        model_instance.eval()
+        self._model = model_instance
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        return None
+
+    def _preprocess_words(self, inputs):
+        examples = []
+        for text in inputs["words"]:
+            example = {"FORM": text}
+            example = convert_example(example, vocabs=[self.word_vocab, self.rel_vocab])
+            examples.append(example)
+
+        batches = [examples[idx : idx + self.batch_size] for idx in range(0, len(examples), self.batch_size)]
+
+        def batchify_fn(batch):
+            raw_batch = [raw for raw in zip(*batch)]
+            batch = [pad_sequence(data) for data in raw_batch]
+            return batch
+
+        batches = [flat_words(batchify_fn(batch)[0]) for batch in batches]
+
+        inputs["data_loader"] = batches
+        return inputs
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+
+        outputs = {}
+
+        lac_results = []
+        position = 0
+
+        inputs = self._check_input_text(inputs)
+        while position < len(inputs):
+            lac_results += self.lac.run(inputs[position : position + self.batch_size])
+            position += self.batch_size
+
+        if not self.use_pos:
+            outputs["words"] = lac_results
+        else:
+            outputs["words"], outputs["postags"] = [raw for raw in zip(*lac_results)]
+
+        outputs = self._preprocess_words(outputs)
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+
+        arcs, rels, probs = [], [], []
+        for batch in inputs["data_loader"]:
+            words, wp = batch
+            self.input_handles[0].copy_from_cpu(words)
+            self.input_handles[1].copy_from_cpu(wp)
+            self.predictor.run()
+            arc_preds = self.output_handle[0].copy_to_cpu()
+            rel_preds = self.output_handle[1].copy_to_cpu()
+            s_arc = self.output_handle[2].copy_to_cpu()
+            mask = self.output_handle[3].copy_to_cpu().astype("bool")
+
+            arc_preds, rel_preds = decode(arc_preds, rel_preds, s_arc, mask, self.tree)
+
+            arcs.extend([arc_pred[m] for arc_pred, m in zip(arc_preds, mask)])
+            rels.extend([rel_pred[m] for rel_pred, m in zip(rel_preds, mask)])
+            if self.prob:
+                arc_probs = probability(s_arc, arc_preds)
+                probs.extend([arc_prob[m] for arc_prob, m in zip(arc_probs, mask)])
+        inputs["arcs"] = arcs
+        inputs["rels"] = rels
+        inputs["probs"] = probs
+        return inputs
+
+    def _postprocess(self, inputs):
+
+        arcs = inputs["arcs"]
+        rels = inputs["rels"]
+        words = inputs["words"]
+        arcs = [[s.item() for s in seq] for seq in arcs]
+        rels = [self.rel_vocab.to_tokens(seq) for seq in rels]
+
+        results = []
+
+        for word, arc, rel in zip(words, arcs, rels):
+            result = {
+                "word": word,
+                "head": arc,
+                "deprel": rel,
+            }
+            results.append(result)
+
+        if self.use_pos:
+            postags = inputs["postags"]
+            for result, postag in zip(results, postags):
+                result["postag"] = postag
+
+        if self.prob:
+            probs = inputs["probs"]
+            probs = [[round(p, 2) for p in seq.tolist()] for seq in probs]
+            for result, prob in zip(results, probs):
+                result["prob"] = prob
+
+        if self.return_visual:
+            for result in results:
+                result["visual"] = self._visualize(result)
+
+        return results
+
+    def _visualize(self, data):
+        """
+        Visualize the dependency.
+        Args:
+            data(dict): A dict contains the word, head and dep
+         Returns:
+            data: a numpy array, use cv2.imshow to show it or cv2.imwrite to save it.
+        """
+        try:
+            import matplotlib.font_manager as font_manager
+            import matplotlib.pyplot as plt
+        except Exception:
+            raise ImportError("Please install the dependencies first, pip install matplotlib --upgrade")
+
+        self.plt = plt
+        self.font = font_manager.FontProperties(fname=self.font_file_path)
+        word, head, deprel = data["word"], data["head"], data["deprel"]
+
+        nodes = ["ROOT"] + word
+        x = list(range(len(nodes)))
+        y = [0] * (len(nodes))
+        fig, ax = self.plt.subplots()
+        # Control the picture size
+        max_span = max([abs(i + 1 - j) for i, j in enumerate(head)])
+        fig.set_size_inches((len(nodes), max_span / 2))
+        # Set the points
+        self.plt.scatter(x, y, c="w")
+
+        for i in range(len(nodes)):
+            txt = nodes[i]
+            xytext = (i, 0)
+            if i == 0:
+                # Set 'ROOT'
+                ax.annotate(
+                    txt,
+                    xy=xytext,
+                    xycoords="data",
+                    xytext=xytext,
+                    textcoords="data",
+                )
+            else:
+                xy = (head[i - 1], 0)
+                rad = 0.5 if head[i - 1] < i else -0.5
+                # Set the word
+                ax.annotate(
+                    txt,
+                    xy=xy,
+                    xycoords="data",
+                    xytext=(xytext[0] - 0.1, xytext[1]),
+                    textcoords="data",
+                    fontproperties=self.font,
+                )
+                # Draw the curve
+                ax.annotate(
+                    "",
+                    xy=xy,
+                    xycoords="data",
+                    xytext=xytext,
+                    textcoords="data",
+                    arrowprops=dict(
+                        arrowstyle="<-",
+                        shrinkA=12,
+                        shrinkB=12,
+                        color="blue",
+                        connectionstyle="arc3,rad=%s" % rad,
+                    ),
+                )
+                # Set the deprel label. Calculate its position by the radius
+                text_x = min(i, head[i - 1]) + abs((i - head[i - 1])) / 2 - 0.2
+                text_y = abs((i - head[i - 1])) / 4
+                ax.annotate(deprel[i - 1], xy=xy, xycoords="data", xytext=[text_x, text_y], textcoords="data")
+
+        # Control the axis
+        self.plt.axis("equal")
+        self.plt.axis("off")
+
+        # Save to numpy array
+        fig.canvas.draw()
+        data = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
+        data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))[:, :, ::-1]
+        return data
+
+
+def pad_sequence(sequences, padding_value=0, fix_len=None):
+    """Fill sequences(np.ndarray) into a fixed-length matrix."""
+    max_size = sequences[0].shape
+    trailing_dims = max_size[1:]
+    max_len = max([s.shape[0] for s in sequences])
+    if fix_len is not None:
+        assert fix_len >= max_len, "fix_len is too small."
+        max_len = fix_len
+    out_dims = (len(sequences), max_len) + trailing_dims
+    out_tensor = np.full(out_dims, padding_value, dtype=sequences[0].dtype)
+    for i, tensor in enumerate(sequences):
+        length = tensor.shape[0]
+        out_tensor[i, :length, ...] = tensor
+    return out_tensor
+
+
+def convert_example(example, vocabs, fix_len=20):
+    word_vocab, rel_vocab = vocabs
+
+    word_bos_index = word_vocab.to_indices("[CLS]")
+    word_eos_index = word_vocab.to_indices("[SEP]")
+
+    words = [[word_vocab.to_indices(char) for char in word] for word in example["FORM"]]
+    words = [[word_bos_index]] + words + [[word_eos_index]]
+    return [pad_sequence([np.array(ids[:fix_len], dtype=np.int64) for ids in words], fix_len=fix_len)]
+
+
+def flat_words(words, pad_index=0):
+    mask = words != pad_index
+    lens = np.sum(mask.astype(np.int64), axis=-1)
+    position = np.cumsum(lens + (lens == 0).astype(np.int64), axis=1) - 1
+    lens = np.sum(lens, -1)
+    words = words.ravel()[np.flatnonzero(words)]
+
+    sequences = []
+    idx = 0
+    for l in lens:
+        sequences.append(words[idx : idx + l])
+        idx += l
+    words = Pad(pad_val=pad_index)(sequences)
+
+    max_len = words.shape[1]
+
+    mask = (position >= max_len).astype(np.int64)
+    position = position * np.logical_not(mask) + mask * (max_len - 1)
+    return words, position
+
+
+def probability(s_arc, arc_preds):
+    s_arc = s_arc - s_arc.max(axis=-1).reshape(list(s_arc.shape)[:-1] + [1])
+    s_arc = np.exp(s_arc) / np.exp(s_arc).sum(axis=-1).reshape(list(s_arc.shape)[:-1] + [1])
+
+    arc_probs = [s[np.arange(len(arc_pred)), arc_pred] for s, arc_pred in zip(s_arc, arc_preds)]
+    return arc_probs
+
+
+def decode(arc_preds, rel_preds, s_arc, mask, tree):
+    """decode"""
+    lens = np.sum(mask, -1)
+
+    bad = [not istree(seq[: i + 1]) for i, seq in zip(lens, arc_preds)]
+    if tree and any(bad):
+        arc_preds[bad] = eisner(s_arc[bad], mask[bad])
+    rel_preds = [rel_pred[np.arange(len(arc_pred)), arc_pred] for arc_pred, rel_pred in zip(arc_preds, rel_preds)]
+    return arc_preds, rel_preds
+
+
+def eisner(scores, mask):
+    """
+    Eisner algorithm is a general dynamic programming decoding algorithm for bilexical grammar.
+
+    Args：
+        scores: Adjacency matrix，shape=(batch, seq_len, seq_len)
+        mask: mask matrix，shape=(batch, sql_len)
+
+    Returns:
+        output，shape=(batch, seq_len)，the index of the parent node corresponding to the token in the query
+
+    """
+    lens = mask.sum(1)
+    batch_size, seq_len, _ = scores.shape
+    scores = scores.transpose(2, 1, 0)
+    # Score for incomplete span
+    s_i = np.full_like(scores, float("-inf"))
+    # Score for complete span
+    s_c = np.full_like(scores, float("-inf"))
+    # Incomplete span position for backtrack
+    p_i = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64)
+    # Complete span position for backtrack
+    p_c = np.zeros((seq_len, seq_len, batch_size), dtype=np.int64)
+    # Set 0 to s_c.diagonal
+    s_c = fill_diagonal(s_c, 0)
+    # Contiguous
+    s_c = np.ascontiguousarray(s_c)
+    s_i = np.ascontiguousarray(s_i)
+    for w in range(1, seq_len):
+        n = seq_len - w
+        starts = np.arange(n, dtype=np.int64)[np.newaxis, :]
+        # ilr = C(i->r) + C(j->r+1)
+        ilr = stripe(s_c, n, w) + stripe(s_c, n, w, (w, 1))
+        # Shape: [batch_size, n, w]
+        ilr = ilr.transpose(2, 0, 1)
+        # scores.diagonal(-w).shape:[batch, n]
+        il = ilr + scores.diagonal(-w)[..., np.newaxis]
+        # I(j->i) = max(C(i->r) + C(j->r+1) + s(j->i)), i <= r < j
+        il_span, il_path = il.max(-1), il.argmax(-1)
+        s_i = fill_diagonal(s_i, il_span, offset=-w)
+        p_i = fill_diagonal(p_i, il_path + starts, offset=-w)
+
+        ir = ilr + scores.diagonal(w)[..., np.newaxis]
+        # I(i->j) = max(C(i->r) + C(j->r+1) + s(i->j)), i <= r < j
+        ir_span, ir_path = ir.max(-1), ir.argmax(-1)
+        s_i = fill_diagonal(s_i, ir_span, offset=w)
+        p_i = fill_diagonal(p_i, ir_path + starts, offset=w)
+
+        # C(j->i) = max(C(r->i) + I(j->r)), i <= r < j
+        cl = stripe(s_c, n, w, (0, 0), 0) + stripe(s_i, n, w, (w, 0))
+        cl = cl.transpose(2, 0, 1)
+        cl_span, cl_path = cl.max(-1), cl.argmax(-1)
+        s_c = fill_diagonal(s_c, cl_span, offset=-w)
+        p_c = fill_diagonal(p_c, cl_path + starts, offset=-w)
+
+        # C(i->j) = max(I(i->r) + C(r->j)), i < r <= j
+        cr = stripe(s_i, n, w, (0, 1)) + stripe(s_c, n, w, (1, w), 0)
+        cr = cr.transpose(2, 0, 1)
+        cr_span, cr_path = cr.max(-1), cr.argmax(-1)
+        s_c = fill_diagonal(s_c, cr_span, offset=w)
+        s_c[0, w][np.not_equal(lens, w)] = float("-inf")
+        p_c = fill_diagonal(p_c, cr_path + starts + 1, offset=w)
+
+    predicts = []
+    p_c = p_c.transpose(2, 0, 1)
+    p_i = p_i.transpose(2, 0, 1)
+    for i, length in enumerate(lens.tolist()):
+        heads = np.ones(length + 1, dtype=np.int64)
+        backtrack(p_i[i], p_c[i], heads, 0, length, True)
+        predicts.append(heads)
+
+    return pad_sequence(predicts, fix_len=seq_len)
+
+
+def fill_diagonal(x, value, offset=0, dim1=0, dim2=1):
+    """
+    Fill value into the diagoanl of x that offset is ${offset}
+    and the coordinate system is (dim1, dim2).
+    """
+    strides = x.strides
+    shape = x.shape
+    if dim1 > dim2:
+        dim1, dim2 = dim2, dim1
+    assert 0 <= dim1 < dim2 <= 2
+    assert len(x.shape) == 3
+    assert shape[dim1] == shape[dim2]
+
+    dim_sum = dim1 + dim2
+    dim3 = 3 - dim_sum
+    if offset >= 0:
+        diagonal = np.lib.stride_tricks.as_strided(
+            x[:, offset:] if dim_sum == 1 else x[:, :, offset:],
+            shape=(shape[dim3], shape[dim1] - offset),
+            strides=(strides[dim3], strides[dim1] + strides[dim2]),
+        )
+    else:
+        diagonal = np.lib.stride_tricks.as_strided(
+            x[-offset:, :] if dim_sum in [1, 2] else x[:, -offset:],
+            shape=(shape[dim3], shape[dim1] + offset),
+            strides=(strides[dim3], strides[dim1] + strides[dim2]),
+        )
+
+    diagonal[...] = value
+    return x
+
+
+def backtrack(p_i, p_c, heads, i, j, complete):
+    """
+    Backtrack the position matrix of eisner to generate the tree
+    """
+    if i == j:
+        return
+    if complete:
+        r = p_c[i, j]
+        backtrack(p_i, p_c, heads, i, r, False)
+        backtrack(p_i, p_c, heads, r, j, True)
+    else:
+        r, heads[j] = p_i[i, j], i
+        i, j = sorted((i, j))
+        backtrack(p_i, p_c, heads, i, r, True)
+        backtrack(p_i, p_c, heads, j, r + 1, True)
+
+
+def stripe(x, n, w, offset=(0, 0), dim=1):
+    """
+    Returns a diagonal stripe of the tensor.
+
+    Args:
+        x (Tensor): the input tensor with 2 or more dims.
+        n (int): the length of the stripe.
+        w (int): the width of the stripe.
+        offset (tuple): the offset of the first two dims.
+        dim (int): 0 if returns a horizontal stripe; 1 else.
+
+    Example:
+    >>> x = np.arange(25).reshape(5, 5)
+    >>> x
+    tensor([[ 0,  1,  2,  3,  4],
+            [ 5,  6,  7,  8,  9],
+            [10, 11, 12, 13, 14],
+            [15, 16, 17, 18, 19],
+            [20, 21, 22, 23, 24]])
+    >>> stripe(x, 2, 3, (1, 1))
+    tensor([[ 6,  7,  8],
+            [12, 13, 14]])
+    >>> stripe(x, 2, 3, dim=0)
+    tensor([[ 0,  5, 10],
+            [ 6, 11, 16]])
+    """
+    if not x.flags["C_CONTIGUOUS"]:
+        x = np.ascontiguousarray(x)
+    strides = x.strides
+    m = strides[0] + strides[1]
+    k = strides[1] if dim == 1 else strides[0]
+    return np.lib.stride_tricks.as_strided(
+        x[offset[0] :, offset[1] :], shape=[n, w] + list(x.shape[2:]), strides=[m, k] + list(strides[2:])
+    )
+
+
+class Node:
+    """Node class"""
+
+    def __init__(self, id=None, parent=None):
+        self.lefts = []
+        self.rights = []
+        self.id = int(id)
+        self.parent = parent if parent is None else int(parent)
+
+
+class DepTree:
+    """
+    DepTree class, used to check whether the prediction result is a project Tree.
+    A projective tree means that you can project the tree without crossing arcs.
+    """
+
+    def __init__(self, sentence):
+        # set root head to -1
+        sentence = copy.deepcopy(sentence)
+        sentence[0] = -1
+        self.sentence = sentence
+        self.build_tree()
+        self.visit = [False] * len(sentence)
+
+    def build_tree(self):
+        """Build the tree"""
+        self.nodes = [Node(index, p_index) for index, p_index in enumerate(self.sentence)]
+        # set root
+        self.root = self.nodes[0]
+        for node in self.nodes[1:]:
+            self.add(self.nodes[node.parent], node)
+
+    def add(self, parent, child):
+        """Add a child node"""
+        if parent.id is None or child.id is None:
+            raise Exception("id is None")
+        if parent.id < child.id:
+            parent.rights = sorted(parent.rights + [child.id])
+        else:
+            parent.lefts = sorted(parent.lefts + [child.id])
+
+    def judge_legal(self):
+        """Determine whether it is a project tree"""
+        target_seq = list(range(len(self.nodes)))
+        if len(self.root.lefts + self.root.rights) != 1:
+            return False
+        cur_seq = self.inorder_traversal(self.root)
+        if target_seq != cur_seq:
+            return False
+        else:
+            return True
+
+    def inorder_traversal(self, node):
+        """Inorder traversal"""
+        if self.visit[node.id]:
+            return []
+        self.visit[node.id] = True
+        lf_list = []
+        rf_list = []
+        for ln in node.lefts:
+            lf_list += self.inorder_traversal(self.nodes[ln])
+        for rn in node.rights:
+            rf_list += self.inorder_traversal(self.nodes[rn])
+
+        return lf_list + [node.id] + rf_list
+
+
+def istree(sequence):
+    """Is the sequence a project tree"""
+    return DepTree(sequence).judge_legal()
diff --git a/paddlenlp/taskflow/dialogue.py b/paddlenlp/taskflow/dialogue.py
new file mode 100644
index 0000000000000000000000000000000000000000..78e0984e09407440437bdca89a74d095a3effb51
--- /dev/null
+++ b/paddlenlp/taskflow/dialogue.py
@@ -0,0 +1,370 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+from collections import deque
+
+import numpy as np
+import paddle
+
+from ..data import Pad
+from ..transformers import UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer
+from .task import Task
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           # 非交互模式
+           dialogue = Taskflow("dialogue")
+           dialogue(["吃饭了吗"])
+           '''
+           ['刚吃完饭,你在干什么呢?']
+           '''
+           dialogue(["你好", "吃饭了吗"], ["你是谁？"])
+           '''
+           ['吃过了,你呢', '我是李明啊']
+           '''
+
+           dialogue = Taskflow("dialogue")
+           # 进入交互模式 (输入exit退出)
+           dialogue.interactive_mode(max_turn=3)
+
+           '''
+           [Human]:你好
+           [Bot]:你好,很高兴认识你,我想问你一下,你喜欢运动吗?
+           [Human]:喜欢
+           [Bot]:那你喜欢什么运动啊?
+           [Human]:篮球,你喜欢篮球吗
+           [Bot]:当然了,我很喜欢打篮球的。
+           '''
+         """
+
+
+class DialogueTask(Task):
+    """
+    Task of Chinese open domain dialogue.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+    }
+    resource_files_urls = {
+        "plato-mini": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dialogue/plato-mini/model_state.pdparams",
+                "450be85b9b7f0bc03b12252a75af04f3",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/dialogue/plato-mini/model_config.json",
+                "5e853fda9a9b573815ad112e494a65af",
+            ],
+        },
+        "__internal_testing__/tiny-random-plato": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-plato/model_state.pdparams",
+                "fda5d068908505cf0c3a46125eb4d39e",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-plato/config.json",
+                "3664e658d5273a132f2e7345a8cafa53",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, batch_size=1, max_seq_len=512, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._static_mode = False
+        self._usage = usage
+        if not self._custom_model:
+            self._check_task_files()
+        self._construct_tokenizer(self._task_path if self._custom_model else model)
+        self._batch_size = batch_size
+        self._max_seq_len = max_seq_len
+        self._interactive_mode = False
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(self._task_path if self._custom_model else model)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="token_type_ids"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = UnifiedTransformerLMHeadModel.from_pretrained(model, from_hf_hub=self.from_hf_hub)
+        model_instance.eval()
+        self._model = model_instance
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = UnifiedTransformerTokenizer.from_pretrained(model, from_hf_hub=self.from_hf_hub)
+
+    def _batchify_fn(self, batch_examples):
+        # padding = False if self._batch_size == 1 else True
+        pad_func = Pad(pad_val=self._tokenizer.pad_token_id, pad_right=False, dtype="int64")
+
+        def pad_mask(batch_attention_mask):
+            batch_size = len(batch_attention_mask)
+            max_len = max(map(len, batch_attention_mask))
+            attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e4
+            for i, mask_data in enumerate(attention_mask):
+                seq_len = len(batch_attention_mask[i])
+                mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+            # In order to ensure the correct broadcasting mechanism, expand one
+            # dimension to the second dimension (n_head of Transformer).
+            attention_mask = np.expand_dims(attention_mask, axis=1)
+            return attention_mask
+
+        input_ids = pad_func([example["input_ids"] for example in batch_examples])
+        token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+        position_ids = pad_func([example["position_ids"] for example in batch_examples])
+        attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+        return input_ids, token_type_ids, position_ids, attention_mask
+
+    def _check_input_text(self, inputs):
+        if self._interactive_mode:
+            if isinstance(inputs, str):
+                self.context.append(inputs.strip())
+                inputs = [list(self.context)]
+                return inputs
+            else:
+                raise ValueError("In the interactive mode, the input data shold be a string")
+        elif not isinstance(inputs[0], list):
+            raise ValueError("If not in the interactive mode, the input data should be a list.")
+        return inputs
+
+    def _batchify(self, data, max_seq_len, batch_size):
+        """
+        Generate input batches.
+        """
+        padding = False if batch_size == 1 else True
+        pad_func = Pad(pad_val=self._tokenizer.pad_token_id, pad_right=False, dtype=np.int64)
+
+        def pad_mask(batch_attention_mask):
+            batch_size = len(batch_attention_mask)
+            max_len = max(map(len, batch_attention_mask))
+            attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e4
+            for i, mask_data in enumerate(attention_mask):
+                seq_len = len(batch_attention_mask[i])
+                mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+            # In order to ensure the correct broadcasting mechanism, expand one
+            # dimension to the second dimension (n_head of Transformer).
+            attention_mask = np.expand_dims(attention_mask, axis=1)
+            return attention_mask
+
+        def _parse_batch(batch_examples):
+            if padding:
+                input_ids = pad_func([example["input_ids"] for example in batch_examples])
+                token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+                position_ids = pad_func([example["position_ids"] for example in batch_examples])
+                attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+            else:
+                input_ids = np.asarray([example["input_ids"] for example in batch_examples], dtype=np.int64)
+                token_type_ids = np.asarray([example["token_type_ids"] for example in batch_examples], dtype=np.int64)
+                position_ids = np.asarray([example["position_ids"] for example in batch_examples], dtype=np.int64)
+                attention_mask = np.asarray([example["attention_mask"] for example in batch_examples])
+                attention_mask = np.expand_dims(attention_mask, 0)
+
+            return input_ids, token_type_ids, position_ids, attention_mask
+
+        examples = []
+        for texts in data:
+            examples.append(self._convert_text_to_input(texts, max_seq_len))
+
+        # Separates data into some batches.
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield _parse_batch(one_batch)
+                one_batch = []
+        if one_batch:
+            yield _parse_batch(one_batch)
+
+    def _convert_text_to_input(self, texts, max_seq_len):
+        """
+        Convert input strings to tokens.
+        """
+        return self._tokenizer.dialogue_encode(
+            texts, max_seq_len=max_seq_len, add_start_token_as_response=True, is_split_into_words=False
+        )
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        num_workers = self.kwargs["num_workers"] if "num_workers" in self.kwargs else 0  # noqa: F841
+        lazy_load = self.kwargs["lazy_load"] if "lazy_load" in self.kwargs else False  # noqa: F841
+
+        batches = self._batchify(inputs, self._max_seq_len, self._batch_size)
+
+        outputs = {}
+        outputs["batches"] = batches
+        outputs["text"] = inputs
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        all_ids = []
+        all_scores = []
+
+        for batch in inputs["batches"]:
+            input_ids, token_type_ids, position_ids, attention_mask = map(paddle.to_tensor, batch)
+            ids, scores = self._model.generate(
+                input_ids=input_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                max_length=64,
+                min_length=1,
+                decode_strategy="sampling",
+                temperature=1.0,
+                top_k=5,
+                top_p=1.0,
+                num_beams=0,
+                length_penalty=1.0,
+                early_stopping=False,
+                use_fast=False,
+                num_return_sequences=1,
+            )
+            all_ids.extend([ids])
+            all_scores.extend([scores])
+        inputs["ids"] = all_ids
+        inputs["scores"] = all_scores
+        return inputs
+
+    def _post_process_response(self, token_ids, tokenizer):
+        """
+        Post-process the decoded sequence. Truncate from the first <eos>.
+        """
+        eos_pos = len(token_ids)
+        for i, tok_id in enumerate(token_ids):
+            if tok_id == tokenizer.sep_token_id:
+                eos_pos = i
+                break
+        token_ids = token_ids[:eos_pos]
+        tokens = tokenizer.convert_ids_to_tokens(token_ids)
+        tokens = tokenizer.merge_subword(tokens)
+        return token_ids, tokens
+
+    @contextlib.contextmanager
+    def interactive_mode(self, max_turn=3):
+        """
+        Enter the interactive mode.
+        """
+        self._interactive_mode = True
+        self.max_turn = max_turn
+        self.context = deque(maxlen=self.max_turn)
+        yield
+        self.context.clear()
+        self._interactive_mode = False
+
+    def _get_in_turn_repetition(self, pred, is_cn=False):
+        """
+        Get in-turn repetition.
+        """
+        if len(pred) == 0:
+            return 1.0
+        if isinstance(pred[0], str):
+            pred = [tok.lower() for tok in pred]
+            if is_cn:
+                pred = "".join(pred)
+        tri_grams = set()
+        for i in range(len(pred) - 2):
+            tri_gram = tuple(pred[i : i + 3])
+            if tri_gram in tri_grams:
+                return True
+            tri_grams.add(tri_gram)
+        return False
+
+    def _select_response(self, ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1, keep_space=True):
+        """
+        Select response with the highest score.
+        """
+        ids = ids.numpy().tolist()
+        scores = scores.numpy()
+
+        if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+            raise ValueError(
+                "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                    len(ids), num_return_sequences
+                )
+            )
+
+        group = []
+        tmp = []
+        for pred, score in zip(ids, scores):
+            pred_token_ids, pred_tokens = self._post_process_response(pred, tokenizer)
+            num_token = len(pred_token_ids)
+            if keep_space:
+                response = " ".join(pred_tokens)
+            else:
+                response = "".join(pred_tokens)
+
+            in_turn_repetition = self._get_in_turn_repetition(pred_tokens, True) or self._get_in_turn_repetition(
+                pred_token_ids
+            )
+            # not ending
+            if max_dec_len is not None and num_token >= max_dec_len:
+                score -= 1e3
+            elif in_turn_repetition:
+                score -= 1e3
+
+            tmp.append([response, score])
+            if len(tmp) == num_return_sequences:
+                group.append(tmp)
+                tmp = []
+
+        results = []
+        for preds in group:
+            preds = sorted(preds, key=lambda x: -x[1])
+            results.append(preds[0][0])
+        return results
+
+    def _postprocess(self, inputs):
+        all_ids = inputs["ids"]
+        all_scores = inputs["scores"]
+        texts = inputs["text"]
+
+        results = []
+        for ids, scores, text in zip(all_ids, all_scores, texts):
+            results.extend(
+                self._select_response(ids, scores, self._tokenizer, num_return_sequences=1, keep_space=False)
+            )
+
+        if self._interactive_mode:
+            self.context.append(results[0].strip())
+        return results
diff --git a/paddlenlp/taskflow/document_intelligence.py b/paddlenlp/taskflow/document_intelligence.py
new file mode 100644
index 0000000000000000000000000000000000000000..771de056ccef893364ad62b9fd3d98bf5b98a0f0
--- /dev/null
+++ b/paddlenlp/taskflow/document_intelligence.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+
+from ..transformers import AutoTokenizer
+from .task import Task
+from .utils import ImageReader, download_file, find_answer_pos, get_doc_pred, sort_res
+
+usage = r"""
+            from paddlenlp import Taskflow
+            docprompt = Taskflow("document_intelligence")
+            # Types of doc: A string containing a local path to an image
+            docprompt({"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]})
+            # Types of doc: A string containing a http link pointing to an image
+            docprompt({"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]})
+            '''
+            [{'prompt': '发票号码是多少?', 'result': [{'value': 'No44527206', 'prob': 0.74, 'start': 2, 'end': 2}]}, {'prompt': '校验码是多少?', 'result': [{'value': '01107 555427109891646', 'prob': 1.0, 'start': 231, 'end': 233}]}]
+            '''
+
+            # Batch input
+            batch_input = [
+                {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]},
+                {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
+            ]
+            docprompt(batch_input)
+            '''
+            [[{'prompt': '发票号码是多少?', 'result': [{'value': 'No44527206', 'prob': 0.74, 'start': 2, 'end': 2}]}, {'prompt': '校验码是多少?', 'result': [{'value': '01107 555427109891646', 'prob': 1.0, 'start': 231, 'end': 233}]}], [{'prompt': '五百丁本次想要担任的是什么职位?', 'result': [{'value': '客户经理', 'prob': 1.0, 'start': 4, 'end': 7}]}, {'prompt': '五百丁是在哪里上的大学?', 'result': [{'value': '广州五百丁学院', 'prob': 1.0, 'start': 31, 'end': 37}]}, {'prompt': '大学学的是什么专业?', 'result': [{'value': '金融学(本科）', 'prob': 0.82, 'start': 38, 'end': 44}]}]]
+            '''
+         """
+
+URLS = {
+    "docprompt": [
+        "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/docprompt/docprompt_params.tar",
+        "8eae8148981731f230b328076c5a08bf",
+    ],
+}
+
+
+class DocPromptTask(Task):
+    """
+    The document intelligence model, give the querys and predict the answers.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._batch_size = kwargs.get("batch_size", 1)
+        self._topn = kwargs.get("topn", 1)
+        self._lang = kwargs.get("lang", "ch")
+        self._construct_ocr_engine(lang=self._lang)
+        self._usage = usage
+        download_file(self._task_path, "docprompt_params.tar", URLS[self.model][0], URLS[self.model][1])
+        self._get_inference_model()
+        self._construct_tokenizer()
+        self._reader = ImageReader(super_rel_pos=False, tokenizer=self._tokenizer)
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained("ernie-layoutx-base-uncased")
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        preprocess_results = self._check_input_text(inputs)
+        for example in preprocess_results:
+            if "word_boxes" in example.keys():
+                ocr_result = example["word_boxes"]
+                example["ocr_type"] = "word_boxes"
+            else:
+                ocr_result = self._ocr.ocr(example["doc"], cls=True)
+                example["ocr_type"] = "ppocr"
+                # Compatible with paddleocr>=2.6.0.2
+                ocr_result = ocr_result[0] if len(ocr_result) == 1 else ocr_result
+            example["ocr_result"] = ocr_result
+        return preprocess_results
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        all_predictions_list = []
+        for example in inputs:
+            ocr_result = example["ocr_result"]
+            doc_path = example["doc"]
+            prompt = example["prompt"]
+            ocr_type = example["ocr_type"]
+
+            if not ocr_result:
+                all_predictions = [
+                    {"prompt": p, "result": [{"value": "", "prob": 0.0, "start": -1, "end": -1}]} for p in prompt
+                ]
+                all_boxes = {}
+            else:
+                data_loader = self._reader.data_generator(ocr_result, doc_path, prompt, self._batch_size, ocr_type)
+
+                RawResult = collections.namedtuple("RawResult", ["unique_id", "seq_logits"])
+
+                all_results = []
+                for data in data_loader:
+                    for idx in range(len(self.input_names)):
+                        self.input_handles[idx].copy_from_cpu(data[idx])
+                    self.predictor.run()
+                    outputs = [output_handle.copy_to_cpu() for output_handle in self.output_handle]
+                    unique_ids, seq_logits = outputs
+
+                    for idx in range(len(unique_ids)):
+                        all_results.append(
+                            RawResult(
+                                unique_id=int(unique_ids[idx]),
+                                seq_logits=seq_logits[idx],
+                            )
+                        )
+
+                all_examples = self._reader.examples["infer"]
+                all_features = self._reader.features["infer"]
+                all_key_probs = [1 for _ in all_examples]
+
+                example_index_to_features = collections.defaultdict(list)
+
+                for feature in all_features:
+                    example_index_to_features[feature.qas_id].append(feature)
+
+                unique_id_to_result = {}
+                for result in all_results:
+                    unique_id_to_result[result.unique_id] = result
+
+                all_predictions = []
+                all_boxes = {}
+                for (example_index, example) in enumerate(all_examples):
+                    example_doc_tokens = example.doc_tokens
+                    example_qas_id = example.qas_id
+                    page_id = example_qas_id.split("_")[0]
+                    if page_id not in all_boxes:
+                        all_boxes[page_id] = example.ori_boxes
+                    example_query = example.keys[0]
+                    features = example_index_to_features[example_qas_id]
+
+                    preds = []
+                    # keep track of the minimum score of null start+end of position 0
+                    for feature in features:
+                        if feature.unique_id not in unique_id_to_result:
+                            continue
+                        result = unique_id_to_result[feature.unique_id]
+
+                        # find preds
+                        ans_pos = find_answer_pos(result.seq_logits, feature)
+                        preds.extend(
+                            get_doc_pred(
+                                result, ans_pos, example, self._tokenizer, feature, True, all_key_probs, example_index
+                            )
+                        )
+
+                    if not preds:
+                        preds.append({"value": "", "prob": 0.0, "start": -1, "end": -1})
+                    else:
+                        preds = sort_res(example_query, preds, example_doc_tokens, all_boxes[page_id], self._lang)[
+                            : self._topn
+                        ]
+                    all_predictions.append({"prompt": example_query, "result": preds})
+            all_predictions_list.append(all_predictions)
+        return all_predictions_list
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        results = inputs
+        results = results[0] if len(results) == 1 else results
+        return results
+
+    def _check_input_text(self, inputs):
+        inputs = inputs[0]
+        if isinstance(inputs, dict):
+            inputs = [inputs]
+        if isinstance(inputs, list):
+            input_list = []
+            for example in inputs:
+                data = {}
+                if isinstance(example, dict):
+                    if "doc" not in example.keys():
+                        raise ValueError(
+                            "Invalid inputs, the inputs should contain an url to an image or a local path."
+                        )
+                    else:
+                        if isinstance(example["doc"], str):
+                            if example["doc"].startswith("http://") or example["doc"].startswith("https://"):
+                                download_file("./", example["doc"].rsplit("/", 1)[-1], example["doc"])
+                                doc_path = example["doc"].rsplit("/", 1)[-1]
+                            else:
+                                doc_path = example["doc"]
+                            data["doc"] = doc_path
+                        else:
+                            raise ValueError("Incorrect path or url, URLs must start with `http://` or `https://`")
+                    if "prompt" not in example.keys():
+                        raise ValueError("Invalid inputs, the inputs should contain the prompt.")
+                    else:
+                        if isinstance(example["prompt"], str):
+                            data["prompt"] = [example["prompt"]]
+                        elif isinstance(example["prompt"], list) and all(
+                            isinstance(s, str) for s in example["prompt"]
+                        ):
+                            data["prompt"] = example["prompt"]
+                        else:
+                            raise TypeError("Incorrect prompt, prompt should be string or list of string.")
+                    if "word_boxes" in example.keys():
+                        data["word_boxes"] = example["word_boxes"]
+                    input_list.append(data)
+                else:
+                    raise TypeError(
+                        "Invalid inputs, input for document intelligence task should be dict or list of dict, but type of {} found!".format(
+                            type(example)
+                        )
+                    )
+        else:
+            raise TypeError(
+                "Invalid inputs, input for document intelligence task should be dict or list of dict, but type of {} found!".format(
+                    type(inputs)
+                )
+            )
+        return input_list
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        pass
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        pass
diff --git a/paddlenlp/taskflow/fill_mask.py b/paddlenlp/taskflow/fill_mask.py
new file mode 100644
index 0000000000000000000000000000000000000000..585d23b0a1285ee6fd035c2882445e8f9ea73201
--- /dev/null
+++ b/paddlenlp/taskflow/fill_mask.py
@@ -0,0 +1,167 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, List, Optional, Union
+
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+
+from .task import Task
+from .utils import dygraph_mode_guard
+
+usage = r"""
+        from paddlenlp import Taskflow
+        text_cls = Taskflow(
+            "fill_mask",
+            task_path=<local_saved_model>,
+            top_k=1
+            )
+        text_cls('飞桨[MASK]度学习架')
+        '''
+        [
+            {
+                'token': <token_index>,
+                'token_str': '深',
+                'sequence': 飞桨深度学习框架,
+                'score': 0.65
+            }
+        ]
+        '''
+        text_cls(['飞桨[MASK]度学习架', '生活的真谛是[MASK]'])
+        '''
+        [
+            {
+                'token': <token_index>,
+                'token_str': '深',
+                'sequence': 飞桨深度学习框架,
+                'score': 0.65
+            },
+            {
+                'token': <token_index>,
+                'token_str': '爱',
+                'sequence': 生活的真谛是爱,
+                'score': 0.65
+            }
+        ]
+         """
+
+
+class FillMaskTask(Task):
+    """
+    Perform cloze-style mask filling with Masked Language Modeling (MLM)
+    NOTE: This task is different from all other tasks that it has no out-of-box zero-shot capabilities.
+    Instead, it's used as a simple inference pipeline.
+    Args:
+        task (string): The name of task.
+        task_path (string): The local file path to the model path or a pre-trained model
+        top_k (string, optional): The number of predictions to return.. Defaults to 5.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task: str, model: Optional[str] = None, top_k: Optional[str] = 5, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self.top_k = top_k
+        self._construct_tokenizer(self._task_path)
+        self._construct_model(self._task_path)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        raise NotImplementedError(f"Conversion from dygraph to static graph is not supported in {self.__name__}")
+
+    def _construct_model(self, model: str):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = AutoModelForMaskedLM.from_pretrained(model, from_hf_hub=self.from_hf_hub)
+        model_instance.eval()
+        self._model = model_instance
+
+    def _construct_tokenizer(self, model: str):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(model, from_hf_hub=self.from_hf_hub)
+
+    def get_masked_index(self, input_ids):
+        return paddle.nonzero(input_ids == self._tokenizer.mask_token_id)
+
+    def ensure_exactly_one_mask_token(self, input_ids: List[int]):
+        num_mask_token = input_ids.count(self._tokenizer.mask_token_id)
+        if num_mask_token != 1:
+            raise ValueError(f"FillMaskTask expects 1 mask token for each input but found {num_mask_token}")
+
+    def _preprocess(self, inputs: Union[str, List[str]]) -> Dict[str, Any]:
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+
+        max_length = self.kwargs["max_length"] if "max_length" in self.kwargs else 512
+        collator = DataCollatorWithPadding(self._tokenizer, return_tensors="pd")
+        tokenized_inputs = []
+        for i in inputs:
+            tokenized_input = self._tokenizer(i, max_length=max_length)
+            self.ensure_exactly_one_mask_token(tokenized_input["input_ids"])
+            tokenized_inputs.append(tokenized_input)
+
+        batches = [tokenized_inputs[idx : idx + batch_size] for idx in range(0, len(tokenized_inputs), batch_size)]
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["batches"] = [collator(batch) for batch in batches]
+
+        return outputs
+
+    def _run_model(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        model_outputs = []
+        with dygraph_mode_guard():
+            for batch in inputs["batches"]:
+                logits = self._model(**batch)
+                masked_index = self.get_masked_index(batch["input_ids"])
+                mask_token_logits = paddle.gather_nd(logits, masked_index)
+                mask_token_probs = F.softmax(mask_token_logits, axis=-1)
+                top_probs, top_pred_indices = paddle.topk(mask_token_probs, k=self.top_k, axis=-1)
+                for probs, pred_indices in zip(top_probs.tolist(), top_pred_indices.tolist()):
+                    model_output = []
+                    for prob, pred in zip(probs, pred_indices):
+                        model_output.append({"token": pred, "score": prob})
+                    model_outputs.append(model_output)
+        outputs = {}
+        outputs["text"] = inputs["text"]
+        outputs["model_outputs"] = model_outputs
+        return outputs
+
+    def _postprocess(self, inputs: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        for i, model_output in enumerate(inputs["model_outputs"]):
+            # Same API with https://huggingface.co/tasks/fill-mask
+            for token_output in model_output:
+                token_output["token_str"] = self._tokenizer.decode(token_output["token"])
+                # Since we limit to 1 MASK per input, we can directly use .replace here
+                token_output["sequence"] = inputs["text"][i].replace("[MASK]", token_output["token_str"])
+        return inputs["model_outputs"]
diff --git a/paddlenlp/taskflow/information_extraction.py b/paddlenlp/taskflow/information_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee4b81002a42a4afc5d9fa4a20b1fad2488a2957
--- /dev/null
+++ b/paddlenlp/taskflow/information_extraction.py
@@ -0,0 +1,1594 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import json
+import os
+import re
+from typing import List
+
+import numpy as np
+import paddle
+from huggingface_hub import hf_hub_download
+
+from ..datasets import load_dataset
+from ..layers import GlobalPointerForEntityExtraction, GPLinkerForRelationExtraction
+from ..transformers import UIE, UIEM, UIEX, AutoModel, AutoTokenizer
+from ..utils.doc_parser import DocParser
+from ..utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME
+from ..utils.ie_utils import map_offset, pad_image_data
+from ..utils.log import logger
+from ..utils.tools import get_bool_ids_greater_than, get_span
+from .task import Task
+from .utils import DataCollatorGP, SchemaTree, dbc2sbc, get_id_and_prob, gp_decode
+
+usage = r"""
+            from paddlenlp import Taskflow
+
+            # Entity Extraction
+            schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
+            ie = Taskflow('information_extraction', schema=schema)
+            ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+            '''
+            [{'时间': [{'text': '2月8日上午', 'start': 0, 'end': 6, 'probability': 0.9857378532924486}], '选手': [{'text': '谷爱凌', 'start': 28, 'end': 31, 'probability': 0.8981548639781138}], '赛事名称': [{'text': '北京冬奥会自由式滑雪女子大跳台决赛', 'start': 6, 'end': 23, 'probability': 0.8503089953268272}]}]
+            '''
+
+            # Relation Extraction
+            schema = [{"歌曲名称":["歌手", "所属专辑"]}] # Define the schema for relation extraction
+            ie.set_schema(schema) # Reset schema
+            ie("《告别了》是孙耀威在专辑爱的故事里面的歌曲")
+            '''
+            [{'歌曲名称': [{'text': '告别了', 'start': 1, 'end': 4, 'probability': 0.6296155977145546, 'relations': {'歌手': [{'text': '孙耀威', 'start': 6, 'end': 9, 'probability': 0.9988381005599081}], '所属专辑': [{'text': '爱的故事', 'start': 12, 'end': 16, 'probability': 0.9968462078543183}]}}, {'text': '爱的故事', 'start': 12, 'end': 16, 'probability': 0.2816869478191606, 'relations': {'歌手': [{'text': '孙耀威', 'start': 6, 'end': 9, 'probability': 0.9951415104192272}]}}]}]
+            '''
+
+            # Event Extraction
+            schema = [{'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']}] # Define the schema for event extraction
+            ie.set_schema(schema) # Reset schema
+            ie('中国地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')
+            '''
+            [{'地震触发词': [{'text': '地震', 'start': 56, 'end': 58, 'probability': 0.9977425555988333, 'relations': {'地震强度': [{'text': '3.5级', 'start': 52, 'end': 56, 'probability': 0.998080217831891}], '时间': [{'text': '5月16日06时08分', 'start': 11, 'end': 22, 'probability': 0.9853299772936026}], '震中位置': [{'text': '云南临沧市凤庆县(北纬24.34度，东经99.98度)', 'start': 23, 'end': 50, 'probability': 0.7874012889740385}], '震源深度': [{'text': '10千米', 'start': 63, 'end': 67, 'probability': 0.9937974422968665}]}}]}]
+            '''
+
+            # Opinion Extraction
+            schema = [{'评价维度': ['观点词', '情感倾向[正向，负向]']}] # Define the schema for opinion extraction
+            ie.set_schema(schema) # Reset schema
+            ie("地址不错，服务一般，设施陈旧")
+            '''
+            [{'评价维度': [{'text': '地址', 'start': 0, 'end': 2, 'probability': 0.9888139270606509, 'relations': {'观点词': [{'text': '不错', 'start': 2, 'end': 4, 'probability': 0.9927847072459528}], '情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.998228967796706}]}}, {'text': '设施', 'start': 10, 'end': 12, 'probability': 0.9588297379365116, 'relations': {'观点词': [{'text': '陈旧', 'start': 12, 'end': 14, 'probability': 0.9286753967902683}], '情感倾向[正向，负向]': [{'text': '负向', 'probability': 0.9949389795770394}]}}, {'text': '服务', 'start': 5, 'end': 7, 'probability': 0.9592857070501211, 'relations': {'观点词': [{'text': '一般', 'start': 7, 'end': 9, 'probability': 0.9949359182521675}], '情感倾向[正向，负向]': [{'text': '负向', 'probability': 0.9952498258302498}]}}]}]
+            '''
+
+            # Sentence-level Sentiment Classification
+            schema = ['情感倾向[正向，负向]'] # Define the schema for sentence-level sentiment classification
+            ie.set_schema(schema) # Reset schema
+            ie('这个产品用起来真的很流畅，我非常喜欢')
+            '''
+            [{'情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.9990024058203417}]}]
+            '''
+
+            # English Model
+            schema = [{'Person': ['Company', 'Position']}]
+            ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')
+            ie_en('In 1997, Steve was excited to become the CEO of Apple.')
+            '''
+            [{'Person': [{'text': 'Steve', 'start': 9, 'end': 14, 'probability': 0.999631971804547, 'relations': {'Company': [{'text': 'Apple', 'start': 48, 'end': 53, 'probability': 0.9960158209451642}], 'Position': [{'text': 'CEO', 'start': 41, 'end': 44, 'probability': 0.8871063806420736}]}}]}]
+            '''
+
+            schema = ['Sentiment classification [negative, positive]']
+            ie_en.set_schema(schema)
+            ie_en('I am sorry but this is the worst film I have ever seen in my life.')
+            '''
+            [{'Sentiment classification [negative, positive]': [{'text': 'negative', 'probability': 0.9998415771287057}]}]
+            '''
+
+            # Multilingual Model
+            schema = [{'Person': ['Company', 'Position']}]
+            ie_m = Taskflow('information_extraction', schema=schema, model='uie-m-base', schema_lang="en")
+            ie_m('In 1997, Steve was excited to become the CEO of Apple.')
+            '''
+            [{'Person': [{'text': 'Steve', 'start': 9, 'end': 14, 'probability': 0.9998436034905893, 'relations': {'Company': [{'text': 'Apple', 'start': 48, 'end': 53, 'probability': 0.9842775467359672}], 'Position': [{'text': 'CEO', 'start': 41, 'end': 44, 'probability': 0.9628799853543271}]}}]}]
+            '''
+         """
+
+MODEL_MAP = {"UIE": UIE, "UIEM": UIEM, "UIEX": UIEX}
+
+
+def get_dynamic_max_length(examples, default_max_length: int, dynamic_max_length: List[int]) -> int:
+    """get max_length by examples which you can change it by examples in batch"""
+    cur_length = len(examples[0]["input_ids"])
+    max_length = default_max_length
+    for max_length_option in sorted(dynamic_max_length):
+        if cur_length <= max_length_option:
+            max_length = max_length_option
+            break
+    return max_length
+
+
+class UIETask(Task):
+    """
+    Universal Information Extraction Task.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "config": "config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    # vocab.txt/special_tokens_map.json/tokenizer_config.json are common to the default model.
+    resource_files_urls = {
+        "uie-base": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_v1.1/model_state.pdparams",
+                "47b93cf6a85688791699548210048085",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/config.json",
+                "ad8b5442c758fb2dc18ea53b61e867f7",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-medium": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.1/model_state.pdparams",
+                "c34475665eb05e25f3c9cd9b020b331a",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium/config.json",
+                "7fb22b3e07c5af76371c25ab814f06b8",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-mini": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini_v1.1/model_state.pdparams",
+                "9a0805762c41b104d590c15fbe9b19fd",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_mini/config.json",
+                "8ddebbf64c3f32a49e6f9e1c220e7322",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-micro": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro_v1.1/model_state.pdparams",
+                "da67287bca2906864929e16493f748e4",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_micro/config.json",
+                "544ddc65c758536cd3ba122f55b8709c",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-nano": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano_v1.1/model_state.pdparams",
+                "48db5206232e89ef16b66467562d90e5",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_nano/config.json",
+                "e0e0a2c0d9651ed1a8492be5507590a9",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        # Rename to `uie-medium` and the name of `uie-tiny` will be deprecated in future.
+        "uie-tiny": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.1/model_state.pdparams",
+                "c34475665eb05e25f3c9cd9b020b331a",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium/config.json",
+                "7fb22b3e07c5af76371c25ab814f06b8",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-medical-base": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medical_base_v0.2/model_state.pdparams",
+                "7582d3b01f6faf00b7000111ea853796",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/config.json",
+                "ad8b5442c758fb2dc18ea53b61e867f7",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-base-en": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_en_v1.2/model_state.pdparams",
+                "8c5d5c8faa76681a0aad58f982cd6141",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_en/config.json",
+                "257b80ea8b7889fd8b83a9ace7a8a220",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_en/vocab.txt",
+                "64800d5d8528ce344256daf115d4965e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_en/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_base_en/tokenizer_config.json",
+                "59acb0ce78e79180a2491dfd8382b28c",
+            ],
+        },
+        "uie-m-base": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base_v1.1/model_state.pdparams",
+                "eb00c06bd7144e76343d750f5bf36ff6",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/config.json",
+                "f03de3ce1b83c13e7bee18e6f323d33f",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/vocab.txt",
+                "e6e1091c984592e72c4460e8eb25045e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/tokenizer_config.json",
+                "f144bd065ea90cc26eaa91197124bdcc",
+            ],
+            "sentencepiece_model_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/sentencepiece.bpe.model",
+                "bf25eb5120ad92ef5c7d8596b5dc4046",
+            ],
+        },
+        "uie-m-large": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large_v1.1/model_state.pdparams",
+                "9db83a67f34a9c2483dbe57d2510b4c2",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large/config.json",
+                "8f540de05de57ecc66336b41f3a7ffdb",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large/vocab.txt",
+                "e6e1091c984592e72c4460e8eb25045e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large/tokenizer_config.json",
+                "f144bd065ea90cc26eaa91197124bdcc",
+            ],
+            "sentencepiece_model_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_large/sentencepiece.bpe.model",
+                "bf25eb5120ad92ef5c7d8596b5dc4046",
+            ],
+        },
+        "uie-x-base": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base_v1.0/model_state.pdparams",
+                "a953b55f7639ae73d1df6c2c5f7667dd",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/config.json",
+                "6bcd7d4b119717121fa0276c20bd9224",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/vocab.txt",
+                "e6e1091c984592e72c4460e8eb25045e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/special_tokens_map.json",
+                "ba000b17745bb5b5b40236789318847f",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/tokenizer_config.json",
+                "09456ba644dac6f9d0b367353a36abe7",
+            ],
+            "sentencepiece_model_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/sentencepiece.bpe.model",
+                "bf25eb5120ad92ef5c7d8596b5dc4046",
+            ],
+        },
+        "__internal_testing__/tiny-random-uie": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie/model_state.pdparams",
+                "9e89a3bf94081b2d9ed89118419a3061",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie/config.json",
+                "113667d59b84133a99b4f1f1ec5784d7",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie/tokenizer_config.json",
+                "dcb0f3257830c0eb1f2de47f2d86f89a",
+            ],
+        },
+        "__internal_testing__/tiny-random-uie-m": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-m/model_state.pdparams",
+                "9fd51b19ba96ab634185744e0a214378",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-m/config.json",
+                "7fc6b1503db1e68bec4e6035cc7705c5",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-m/vocab.txt",
+                "e6e1091c984592e72c4460e8eb25045e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-m/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-m/tokenizer_config.json",
+                "66651e1427b0936da3f964f640303d16",
+            ],
+            "sentencepiece_model_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_m_base/sentencepiece.bpe.model",
+                "bf25eb5120ad92ef5c7d8596b5dc4046",
+            ],
+        },
+        "__internal_testing__/tiny-random-uie-x": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-x_v1.0/model_state.pdparams",
+                "d9b573b31a82b860b6e5a3005d7b879e",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-x_v1.0/config.json",
+                "27d715e680596a69d882056a400d97db",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-x/vocab.txt",
+                "e6e1091c984592e72c4460e8eb25045e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-x/special_tokens_map.json",
+                "ba000b17745bb5b5b40236789318847f",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-uie-x/tokenizer_config.json",
+                "c19bdbcec62476176d268e4dc7f1e506",
+            ],
+            "sentencepiece_model_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_x_base/sentencepiece.bpe.model",
+                "bf25eb5120ad92ef5c7d8596b5dc4046",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, schema=None, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+
+        self._convert_from_torch = kwargs.get("convert_from_torch", None)
+        self._max_seq_len = kwargs.get("max_seq_len", 512)
+        self._dynamic_max_length = kwargs.get("dynamic_max_length", None)
+        self._batch_size = kwargs.get("batch_size", 16)
+        self._split_sentence = kwargs.get("split_sentence", False)
+        self._position_prob = kwargs.get("position_prob", 0.5)
+        self._lazy_load = kwargs.get("lazy_load", False)
+        self._num_workers = kwargs.get("num_workers", 0)
+        self._use_fast = kwargs.get("use_fast", False)
+        self._layout_analysis = kwargs.get("layout_analysis", False)
+        self._ocr_lang = kwargs.get("ocr_lang", "ch")
+        self._schema_lang = kwargs.get("schema_lang", "ch")
+        self._expand_to_a4_size = False if self._custom_model else True
+
+        if self.model in [
+            "uie-m-base",
+            "uie-m-large",
+            "uie-x-base",
+            "__internal_testing__/tiny-random-uie-m",
+            "__internal_testing__/tiny-random-uie-x",
+        ]:
+            self.resource_files_names["sentencepiece_model_file"] = "sentencepiece.bpe.model"
+        elif "sentencepiece_model_file" in self.resource_files_names.keys():
+            del self.resource_files_names["sentencepiece_model_file"]
+
+        # TODO: temporary solution to support HF Hub due to lack of AutoModel
+        # change this logic to use AutoConfig when available
+        if self.from_hf_hub:
+            config_file_path = hf_hub_download(repo_id=self._task_path, filename=CONFIG_NAME)
+            with open(config_file_path) as f:
+                self._init_class = json.load(f)["architectures"].pop()
+        else:
+            # Compatible with the model fine-tuned without PretrainedConfig
+            if os.path.exists(os.path.join(self._task_path, LEGACY_CONFIG_NAME)):
+                if "config" in self.resource_files_names.keys():
+                    del self.resource_files_names["config"]
+                with open(os.path.join(self._task_path, LEGACY_CONFIG_NAME)) as f:
+                    self._init_class = json.load(f)["init_class"]
+                self._check_task_files()
+            else:
+                self._check_task_files()
+                with open(os.path.join(self._task_path, CONFIG_NAME)) as f:
+                    self._init_class = json.load(f)["architectures"].pop()
+
+        self._is_en = True if model in ["uie-base-en"] or self._schema_lang == "en" else False
+
+        if self._init_class in ["UIEX"]:
+            self._summary_token_num = 4  # [CLS] prompt [SEP] [SEP] text [SEP] for UIE-X
+        else:
+            self._summary_token_num = 3  # [CLS] prompt [SEP] text [SEP]
+
+        self._parser_map = {
+            "ch": None,  # OCR-CH
+            "en": None,  # OCR-EN
+            "ch-layout": None,  # Layout-CH
+            "en-layout": None,  # Layout-EN
+        }
+        if not schema:
+            logger.warning(
+                "The schema has not been set yet, please set a schema via set_schema(). "
+                "More details about the setting of schema please refer to https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/taskflow_text.md"
+            )
+            self._schema_tree = None
+        else:
+            self.set_schema(schema)
+        self._check_predictor_type()
+        self._get_inference_model()
+        self._usage = usage
+        self._construct_tokenizer()
+
+    def set_argument(self, argument: dict):
+        for k, v in argument.items():
+            if k == "input":
+                continue
+            setattr(self, f"_{k}", v)
+
+    def set_schema(self, schema):
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        if paddle.get_device().split(":", 1)[0] == "npu":
+            input_spec_dtype = "int32"
+        else:
+            input_spec_dtype = "int64"
+        if self._init_class in ["UIEX"]:
+            self._input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="attention_mask"),
+                paddle.static.InputSpec(shape=[None, None, 4], dtype="int64", name="bbox"),
+                paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="float32", name="image"),
+            ]
+        elif self._init_class in ["UIEM"]:
+            self._input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+            ]
+        else:
+            self._input_spec = [
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="input_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="token_type_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="position_ids"),
+                paddle.static.InputSpec(shape=[None, None], dtype=input_spec_dtype, name="attention_mask"),
+            ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = MODEL_MAP[self._init_class].from_pretrained(
+            self._task_path, from_hf_hub=self.from_hf_hub, convert_from_torch=self._convert_from_torch
+        )
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(
+            self._task_path, use_fast=self._use_fast, from_hf_hub=self.from_hf_hub
+        )
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        outputs = {}
+        outputs["text"] = inputs
+        return outputs
+
+    def _check_input_text(self, inputs):
+        """
+        Check whether the input meet the requirement.
+        """
+        self._ocr_lang_choice = (self._ocr_lang + "-layout") if self._layout_analysis else self._ocr_lang
+        inputs = inputs[0]
+        if isinstance(inputs, dict) or isinstance(inputs, str):
+            inputs = [inputs]
+        if isinstance(inputs, list):
+            input_list = []
+            for example in inputs:
+                data = {}
+                if isinstance(example, dict):
+                    if "doc" in example.keys():
+                        if not self._parser_map[self._ocr_lang_choice]:
+                            self._parser_map[self._ocr_lang_choice] = DocParser(
+                                ocr_lang=self._ocr_lang, layout_analysis=self._layout_analysis
+                            )
+                        if "layout" in example.keys():
+                            data = self._parser_map[self._ocr_lang_choice].parse(
+                                {"doc": example["doc"]}, do_ocr=False, expand_to_a4_size=self._expand_to_a4_size
+                            )
+                            data["layout"] = example["layout"]
+                        else:
+                            data = self._parser_map[self._ocr_lang_choice].parse(
+                                {"doc": example["doc"]}, expand_to_a4_size=self._expand_to_a4_size
+                            )
+                    elif "text" in example.keys():
+                        if not isinstance(example["text"], str):
+                            raise TypeError(
+                                "Invalid inputs, the input text should be string. but type of {} found!".format(
+                                    type(example["text"])
+                                )
+                            )
+                        data["text"] = example["text"]
+                    else:
+                        raise ValueError("Invalid inputs, the input should contain a doc or a text.")
+                    input_list.append(data)
+                elif isinstance(example, str):
+                    input_list.append(example)
+                else:
+                    raise TypeError(
+                        "Invalid inputs, the input should be dict or list of dict, but type of {} found!".format(
+                            type(example)
+                        )
+                    )
+        else:
+            raise TypeError("Invalid input format!")
+        return input_list
+
+    def _single_stage_predict(self, inputs):
+        input_texts = [d["text"] for d in inputs]
+        prompts = [d["prompt"] for d in inputs]
+
+        # max predict length should exclude the length of prompt and summary tokens
+        max_predict_len = self._max_seq_len - len(max(prompts)) - self._summary_token_num
+
+        if self._init_class in ["UIEX"]:
+            bbox_list = [d["bbox"] for d in inputs]
+            short_input_texts, short_bbox_list, input_mapping = self._auto_splitter(
+                input_texts, max_predict_len, bbox_list=bbox_list, split_sentence=self._split_sentence
+            )
+        else:
+            short_input_texts, input_mapping = self._auto_splitter(
+                input_texts, max_predict_len, split_sentence=self._split_sentence
+            )
+
+        short_texts_prompts = []
+        for k, v in input_mapping.items():
+            short_texts_prompts.extend([prompts[k] for _ in range(len(v))])
+        if self._init_class in ["UIEX"]:
+            image_list = []
+            for k, v in input_mapping.items():
+                image_list.extend([inputs[k]["image"] for _ in range(len(v))])
+            short_inputs = [
+                {
+                    "text": short_input_texts[i],
+                    "prompt": short_texts_prompts[i],
+                    "bbox": short_bbox_list[i],
+                    "image": image_list[i],
+                }
+                for i in range(len(short_input_texts))
+            ]
+        else:
+            short_inputs = [
+                {"text": short_input_texts[i], "prompt": short_texts_prompts[i]} for i in range(len(short_input_texts))
+            ]
+
+        def text_reader(inputs):
+            for example in inputs:
+                if self._dynamic_max_length is not None:
+                    temp_encoded_inputs = self._tokenizer(
+                        text=[example["prompt"]],
+                        text_pair=[example["text"]],
+                        truncation=True,
+                        max_seq_len=self._max_seq_len,
+                        return_attention_mask=True,
+                        return_position_ids=True,
+                        return_dict=False,
+                        return_offsets_mapping=True,
+                    )
+                    max_length = get_dynamic_max_length(
+                        examples=temp_encoded_inputs,
+                        default_max_length=self._max_seq_len,
+                        dynamic_max_length=self._dynamic_max_length,
+                    )
+                    encoded_inputs = self._tokenizer(
+                        text=[example["prompt"]],
+                        text_pair=[example["text"]],
+                        truncation=True,
+                        max_seq_len=max_length,
+                        pad_to_max_seq_len=True,
+                        return_attention_mask=True,
+                        return_position_ids=True,
+                        return_offsets_mapping=True,
+                    )
+                    logger.info("Inference with dynamic max length in {}".format(max_length))
+                else:
+                    encoded_inputs = self._tokenizer(
+                        text=[example["prompt"]],
+                        text_pair=[example["text"]],
+                        truncation=True,
+                        max_seq_len=self._max_seq_len,
+                        pad_to_max_seq_len=True,
+                        return_attention_mask=True,
+                        return_position_ids=True,
+                        return_offsets_mapping=True,
+                    )
+                if self._init_class in ["UIEM"]:
+                    tokenized_output = [
+                        encoded_inputs["input_ids"][0],
+                        encoded_inputs["position_ids"][0],
+                        encoded_inputs["offset_mapping"][0],
+                    ]
+                else:
+                    tokenized_output = [
+                        encoded_inputs["input_ids"][0],
+                        encoded_inputs["token_type_ids"][0],
+                        encoded_inputs["position_ids"][0],
+                        encoded_inputs["attention_mask"][0],
+                        encoded_inputs["offset_mapping"][0],
+                    ]
+                tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
+                yield tuple(tokenized_output)
+
+        def doc_reader(inputs, pad_id=1, c_sep_id=2):
+            def _process_bbox(tokens, bbox_lines, offset_mapping, offset_bias):
+                bbox_list = [[0, 0, 0, 0] for x in range(len(tokens))]
+
+                for index, bbox in enumerate(bbox_lines):
+                    index_token = map_offset(index + offset_bias, offset_mapping)
+                    if 0 <= index_token < len(bbox_list):
+                        bbox_list[index_token] = bbox
+                return bbox_list
+
+            def _encode_doc(
+                tokenizer, offset_mapping, last_offset, prompt, this_text_line, inputs_ids, q_sep_index, max_seq_len
+            ):
+                if len(offset_mapping) == 0:
+                    content_encoded_inputs = tokenizer(
+                        text=[prompt],
+                        text_pair=[this_text_line],
+                        max_seq_len=max_seq_len,
+                        return_dict=False,
+                        return_offsets_mapping=True,
+                    )
+
+                    content_encoded_inputs = content_encoded_inputs[0]
+                    inputs_ids = content_encoded_inputs["input_ids"][:-1]
+                    sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
+                    q_sep_index = content_encoded_inputs["input_ids"].index(2, 1)
+
+                    bias = 0
+                    for i in range(len(sub_offset_mapping)):
+                        if i == 0:
+                            continue
+                        mapping = sub_offset_mapping[i]
+                        if mapping[0] == 0 and mapping[1] == 0 and bias == 0:
+                            bias = sub_offset_mapping[i - 1][-1] + 1
+                        if mapping[0] == 0 and mapping[1] == 0:
+                            continue
+                        if mapping == sub_offset_mapping[i - 1]:
+                            continue
+                        sub_offset_mapping[i][0] += bias
+                        sub_offset_mapping[i][1] += bias
+
+                    offset_mapping = sub_offset_mapping[:-1]
+                    last_offset = offset_mapping[-1][-1]
+                else:
+                    content_encoded_inputs = tokenizer(
+                        text=this_text_line, max_seq_len=max_seq_len, return_dict=False, return_offsets_mapping=True
+                    )
+                    inputs_ids += content_encoded_inputs["input_ids"][1:-1]
+                    sub_offset_mapping = [list(x) for x in content_encoded_inputs["offset_mapping"]]
+                    for i, sub_list in enumerate(sub_offset_mapping[1:-1]):
+                        if i == 0:
+                            org_offset = sub_list[1]
+                        else:
+                            if sub_list[0] != org_offset and sub_offset_mapping[1:-1][i - 1] != sub_list:
+                                last_offset += 1
+                            org_offset = sub_list[1]
+                        offset_mapping += [[last_offset, sub_list[1] - sub_list[0] + last_offset]]
+                        last_offset = offset_mapping[-1][-1]
+                return offset_mapping, last_offset, q_sep_index, inputs_ids
+
+            for example in inputs:
+                content = example["text"]
+                prompt = example["prompt"]
+                bbox_lines = example.get("bbox", None)
+                image_buff_string = example.get("image", None)
+                # Text
+                if bbox_lines is None:
+                    encoded_inputs = self._tokenizer(
+                        text=[example["prompt"]],
+                        text_pair=[example["text"]],
+                        truncation=True,
+                        max_seq_len=self._max_seq_len,
+                        pad_to_max_seq_len=True,
+                        return_attention_mask=True,
+                        return_position_ids=True,
+                        return_offsets_mapping=True,
+                        return_dict=False,
+                    )
+
+                    encoded_inputs = encoded_inputs[0]
+
+                    inputs_ids = encoded_inputs["input_ids"]
+                    position_ids = encoded_inputs["position_ids"]
+                    attention_mask = encoded_inputs["attention_mask"]
+
+                    q_sep_index = inputs_ids.index(2, 1)
+                    c_sep_index = attention_mask.index(0)
+
+                    offset_mapping = [list(x) for x in encoded_inputs["offset_mapping"]]
+
+                    bbox_list = [[0, 0, 0, 0] for x in range(len(inputs_ids))]
+                    token_type_ids = [
+                        1 if token_index <= q_sep_index or token_index > c_sep_index else 0
+                        for token_index in range(self._max_seq_len)
+                    ]
+                    padded_image = np.zeros([3, 224, 224])
+                # Doc
+                else:
+                    inputs_ids = []
+                    prev_bbox = [-1, -1, -1, -1]
+                    this_text_line = ""
+                    q_sep_index = -1
+                    offset_mapping = []
+                    last_offset = 0
+                    for char_index, (char, bbox) in enumerate(zip(content, bbox_lines)):
+                        if char_index == 0:
+                            prev_bbox = bbox
+                            this_text_line = char
+                            continue
+
+                        if all([bbox[x] == prev_bbox[x] for x in range(4)]):
+                            this_text_line += char
+                        else:
+                            offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
+                                self._tokenizer,
+                                offset_mapping,
+                                last_offset,
+                                prompt,
+                                this_text_line,
+                                inputs_ids,
+                                q_sep_index,
+                                self._max_seq_len,
+                            )
+                            this_text_line = char
+                        prev_bbox = bbox
+                    if len(this_text_line) > 0:
+                        offset_mapping, last_offset, q_sep_index, inputs_ids = _encode_doc(
+                            self._tokenizer,
+                            offset_mapping,
+                            last_offset,
+                            prompt,
+                            this_text_line,
+                            inputs_ids,
+                            q_sep_index,
+                            self._max_seq_len,
+                        )
+                    if len(inputs_ids) > self._max_seq_len:
+                        inputs_ids = inputs_ids[: (self._max_seq_len - 1)] + [c_sep_id]
+                        offset_mapping = offset_mapping[: (self._max_seq_len - 1)] + [[0, 0]]
+                    else:
+                        inputs_ids += [c_sep_id]
+                        offset_mapping += [[0, 0]]
+
+                    if len(offset_mapping) > 1:
+                        offset_bias = offset_mapping[q_sep_index - 1][-1] + 1
+                    else:
+                        offset_bias = 0
+
+                    seq_len = len(inputs_ids)
+                    inputs_ids += [pad_id] * (self._max_seq_len - seq_len)
+                    token_type_ids = [1] * (q_sep_index + 1) + [0] * (seq_len - q_sep_index - 1)
+                    token_type_ids += [pad_id] * (self._max_seq_len - seq_len)
+
+                    bbox_list = _process_bbox(inputs_ids, bbox_lines, offset_mapping, offset_bias)
+
+                    offset_mapping += [[0, 0]] * (self._max_seq_len - seq_len)
+
+                    # Reindex the text
+                    text_start_idx = offset_mapping[1:].index([0, 0]) + self._summary_token_num - 1
+                    for idx in range(text_start_idx, self._max_seq_len):
+                        offset_mapping[idx][0] -= offset_bias
+                        offset_mapping[idx][1] -= offset_bias
+
+                    position_ids = list(range(seq_len))
+
+                    position_ids = position_ids + [0] * (self._max_seq_len - seq_len)
+                    attention_mask = [1] * seq_len + [0] * (self._max_seq_len - seq_len)
+
+                    image_data = base64.b64decode(image_buff_string.encode("utf8"))
+                    padded_image = pad_image_data(image_data)
+
+                input_list = [
+                    inputs_ids,
+                    token_type_ids,
+                    position_ids,
+                    attention_mask,
+                    bbox_list,
+                    padded_image,
+                    offset_mapping,
+                ]
+                input_list = [inputs_ids, token_type_ids, position_ids, attention_mask, bbox_list]
+                return_list = [np.array(x, dtype="int64") for x in input_list]
+                return_list.append(np.array(padded_image, dtype="float32"))
+                return_list.append(np.array(offset_mapping, dtype="int64"))
+                assert len(inputs_ids) == self._max_seq_len
+                assert len(token_type_ids) == self._max_seq_len
+                assert len(position_ids) == self._max_seq_len
+                assert len(attention_mask) == self._max_seq_len
+                assert len(bbox_list) == self._max_seq_len
+                yield tuple(return_list)
+
+        reader = doc_reader if self._init_class in ["UIEX"] else text_reader
+        infer_ds = load_dataset(reader, inputs=short_inputs, lazy=self._lazy_load)
+        batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
+
+        infer_data_loader = paddle.io.DataLoader(
+            dataset=infer_ds, batch_sampler=batch_sampler, num_workers=self._num_workers, return_list=True
+        )
+
+        sentence_ids = []
+        probs = []
+        for batch in infer_data_loader:
+            if self._init_class in ["UIEX"]:
+                input_ids, token_type_ids, pos_ids, att_mask, bbox, image, offset_maps = batch
+            elif self._init_class in ["UIEM"]:
+                input_ids, pos_ids, offset_maps = batch
+            else:
+                input_ids, token_type_ids, pos_ids, att_mask, offset_maps = batch
+            if self._predictor_type == "paddle-inference":
+                if self._init_class in ["UIEX"]:
+                    self.input_handles[0].copy_from_cpu(input_ids.numpy())
+                    self.input_handles[1].copy_from_cpu(token_type_ids.numpy())
+                    self.input_handles[2].copy_from_cpu(pos_ids.numpy())
+                    self.input_handles[3].copy_from_cpu(att_mask.numpy())
+                    self.input_handles[4].copy_from_cpu(bbox.numpy())
+                    self.input_handles[5].copy_from_cpu(image.numpy())
+                elif self._init_class in ["UIEM"]:
+                    self.input_handles[0].copy_from_cpu(input_ids.numpy())
+                    self.input_handles[1].copy_from_cpu(pos_ids.numpy())
+                else:
+                    self.input_handles[0].copy_from_cpu(input_ids.numpy())
+                    self.input_handles[1].copy_from_cpu(token_type_ids.numpy())
+                    self.input_handles[2].copy_from_cpu(pos_ids.numpy())
+                    self.input_handles[3].copy_from_cpu(att_mask.numpy())
+                self.predictor.run()
+                start_prob = self.output_handle[0].copy_to_cpu().tolist()
+                end_prob = self.output_handle[1].copy_to_cpu().tolist()
+            else:
+                if self._init_class in ["UIEX"]:
+                    input_dict = {
+                        "input_ids": input_ids.numpy(),
+                        "token_type_ids": token_type_ids.numpy(),
+                        "position_ids": pos_ids.numpy(),
+                        "attention_mask": att_mask.numpy(),
+                        "bbox": bbox.numpy(),
+                        "image": image.numpy(),
+                    }
+                elif self._init_class in ["UIEM"]:
+                    input_dict = {
+                        "input_ids": input_ids.numpy(),
+                        "position_ids": pos_ids.numpy(),
+                    }
+                else:
+                    input_dict = {
+                        "input_ids": input_ids.numpy(),
+                        "token_type_ids": token_type_ids.numpy(),
+                        "position_ids": pos_ids.numpy(),
+                        "attention_mask": att_mask.numpy(),
+                    }
+                start_prob, end_prob = self.predictor.run(None, input_dict)
+                start_prob = start_prob.tolist()
+                end_prob = end_prob.tolist()
+
+            start_ids_list = get_bool_ids_greater_than(start_prob, limit=self._position_prob, return_prob=True)
+            end_ids_list = get_bool_ids_greater_than(end_prob, limit=self._position_prob, return_prob=True)
+            for start_ids, end_ids, offset_map in zip(start_ids_list, end_ids_list, offset_maps.tolist()):
+                span_set = get_span(start_ids, end_ids, with_prob=True)
+                sentence_id, prob = get_id_and_prob(span_set, offset_map)
+                sentence_ids.append(sentence_id)
+                probs.append(prob)
+        results = self._convert_ids_to_results(short_inputs, sentence_ids, probs)
+        results = self._auto_joiner(results, short_input_texts, input_mapping)
+        return results
+
+    def _auto_joiner(self, short_results, short_inputs, input_mapping):
+        concat_results = []
+        is_cls_task = False
+        for short_result in short_results:
+            if short_result == []:
+                continue
+            elif "start" not in short_result[0].keys() and "end" not in short_result[0].keys():
+                is_cls_task = True
+                break
+            else:
+                break
+        for k, vs in input_mapping.items():
+            if is_cls_task:
+                cls_options = {}
+                single_results = []
+                for v in vs:
+                    if len(short_results[v]) == 0:
+                        continue
+                    if short_results[v][0]["text"] not in cls_options.keys():
+                        cls_options[short_results[v][0]["text"]] = [1, short_results[v][0]["probability"]]
+                    else:
+                        cls_options[short_results[v][0]["text"]][0] += 1
+                        cls_options[short_results[v][0]["text"]][1] += short_results[v][0]["probability"]
+                if len(cls_options) != 0:
+                    cls_res, cls_info = max(cls_options.items(), key=lambda x: x[1])
+                    concat_results.append([{"text": cls_res, "probability": cls_info[1] / cls_info[0]}])
+                else:
+                    concat_results.append([])
+            else:
+                offset = 0
+                single_results = []
+                for v in vs:
+                    if v == 0:
+                        single_results = short_results[v]
+                        offset += len(short_inputs[v])
+                    else:
+                        for i in range(len(short_results[v])):
+                            if "start" not in short_results[v][i] or "end" not in short_results[v][i]:
+                                continue
+                            short_results[v][i]["start"] += offset
+                            short_results[v][i]["end"] += offset
+                        offset += len(short_inputs[v])
+                        single_results.extend(short_results[v])
+                concat_results.append(single_results)
+        return concat_results
+
+    def _run_model(self, inputs):
+        raw_inputs = inputs["text"]
+        _inputs = self._parse_inputs(raw_inputs)
+        results = self._multi_stage_predict(_inputs)
+        inputs["result"] = results
+        return inputs
+
+    def _parse_inputs(self, inputs):
+        _inputs = []
+        for d in inputs:
+            if isinstance(d, dict):
+                if "doc" in d.keys():
+                    text = ""
+                    bbox = []
+                    img_w, img_h = d["img_w"], d["img_h"]
+                    offset_x, offset_y = d["offset_x"], d["offset_x"]
+                    for segment in d["layout"]:
+                        org_box = segment[0]  # bbox before expand to A4 size
+                        box = [
+                            org_box[0] + offset_x,
+                            org_box[1] + offset_y,
+                            org_box[2] + offset_x,
+                            org_box[3] + offset_y,
+                        ]
+                        box = self._parser_map[self._ocr_lang_choice]._normalize_box(box, [img_w, img_h], [1000, 1000])
+                        text += segment[1]
+                        bbox.extend([box] * len(segment[1]))
+                    _inputs.append({"text": text, "bbox": bbox, "image": d["image"], "layout": d["layout"]})
+                else:
+                    _inputs.append({"text": d["text"], "bbox": None, "image": None})
+            else:
+                _inputs.append({"text": d, "bbox": None, "image": None})
+        return _inputs
+
+    def _multi_stage_predict(self, data):
+        """
+        Traversal the schema tree and do multi-stage prediction.
+
+        Args:
+            data (list): a list of strings
+
+        Returns:
+            list: a list of predictions, where the list's length
+                equals to the length of `data`
+        """
+        results = [{} for _ in range(len(data))]
+        # Input check to early return
+        if len(data) < 1 or self._schema_tree is None:
+            return results
+
+        # Copy to stay `self._schema_tree` unchanged
+        schema_list = self._schema_tree.children[:]
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            examples = []
+            input_map = {}
+            cnt = 0
+            idx = 0
+            if not node.prefix:
+                for one_data in data:
+                    examples.append(
+                        {
+                            "text": one_data["text"],
+                            "bbox": one_data["bbox"],
+                            "image": one_data["image"],
+                            "prompt": dbc2sbc(node.name),
+                        }
+                    )
+                    input_map[cnt] = [idx]
+                    idx += 1
+                    cnt += 1
+            else:
+                for pre, one_data in zip(node.prefix, data):
+                    if len(pre) == 0:
+                        input_map[cnt] = []
+                    else:
+                        for p in pre:
+                            if self._is_en:
+                                if re.search(r"\[.*?\]$", node.name):
+                                    prompt_prefix = node.name[: node.name.find("[", 1)].strip()
+                                    cls_options = re.search(r"\[.*?\]$", node.name).group()
+                                    # Sentiment classification of xxx [positive, negative]
+                                    prompt = prompt_prefix + p + " " + cls_options
+                                else:
+                                    prompt = node.name + p
+                            else:
+                                prompt = p + node.name
+                            examples.append(
+                                {
+                                    "text": one_data["text"],
+                                    "bbox": one_data["bbox"],
+                                    "image": one_data["image"],
+                                    "prompt": dbc2sbc(prompt),
+                                }
+                            )
+                        input_map[cnt] = [i + idx for i in range(len(pre))]
+                        idx += len(pre)
+                    cnt += 1
+            if len(examples) == 0:
+                result_list = []
+            else:
+                result_list = self._single_stage_predict(examples)
+
+            if not node.parent_relations:
+                relations = [[] for i in range(len(data))]
+                for k, v in input_map.items():
+                    for idx in v:
+                        if len(result_list[idx]) == 0:
+                            continue
+                        if node.name not in results[k].keys():
+                            results[k][node.name] = result_list[idx]
+                        else:
+                            results[k][node.name].extend(result_list[idx])
+                    if node.name in results[k].keys():
+                        relations[k].extend(results[k][node.name])
+            else:
+                relations = node.parent_relations
+                for k, v in input_map.items():
+                    for i in range(len(v)):
+                        if len(result_list[v[i]]) == 0:
+                            continue
+                        if "relations" not in relations[k][i].keys():
+                            relations[k][i]["relations"] = {node.name: result_list[v[i]]}
+                        elif node.name not in relations[k][i]["relations"].keys():
+                            relations[k][i]["relations"][node.name] = result_list[v[i]]
+                        else:
+                            relations[k][i]["relations"][node.name].extend(result_list[v[i]])
+                new_relations = [[] for i in range(len(data))]
+                for i in range(len(relations)):
+                    for j in range(len(relations[i])):
+                        if "relations" in relations[i][j].keys() and node.name in relations[i][j]["relations"].keys():
+                            for k in range(len(relations[i][j]["relations"][node.name])):
+                                new_relations[i].append(relations[i][j]["relations"][node.name][k])
+                relations = new_relations
+
+            prefix = [[] for _ in range(len(data))]
+            for k, v in input_map.items():
+                for idx in v:
+                    for i in range(len(result_list[idx])):
+                        if self._is_en:
+                            prefix[k].append(" of " + result_list[idx][i]["text"])
+                        else:
+                            prefix[k].append(result_list[idx][i]["text"] + "的")
+
+            for child in node.children:
+                child.prefix = prefix
+                child.parent_relations = relations
+                schema_list.append(child)
+        results = self._add_bbox_info(results, data)
+        return results
+
+    def _add_bbox_info(self, results, data):
+        def _add_bbox(result, char_boxes):
+            for vs in result.values():
+                for v in vs:
+                    if "start" in v.keys() and "end" in v.keys():
+                        boxes = []
+                        for i in range(v["start"], v["end"]):
+                            cur_box = char_boxes[i][1]
+                            if i == v["start"]:
+                                box = cur_box
+                                continue
+                            _, cur_y1, cur_x2, cur_y2 = cur_box
+                            if cur_y1 == box[1] and cur_y2 == box[3]:
+                                box[2] = cur_x2
+                            else:
+                                boxes.append(box)
+                                box = cur_box
+                        if box:
+                            boxes.append(box)
+                        boxes = [[int(b) for b in box] for box in boxes]
+                        v["bbox"] = boxes
+                    if v.get("relations"):
+                        _add_bbox(v["relations"], char_boxes)
+            return result
+
+        new_results = []
+        for result, one_data in zip(results, data):
+            if "layout" in one_data.keys():
+                layout = one_data["layout"]
+                char_boxes = []
+                for segment in layout:
+                    sbox = segment[0]
+                    text_len = len(segment[1])
+                    if text_len == 0:
+                        continue
+                    if len(segment) == 2 or (len(segment) == 3 and segment[2] != "table"):
+                        char_w = (sbox[2] - sbox[0]) * 1.0 / text_len
+                        for i in range(text_len):
+                            cbox = [sbox[0] + i * char_w, sbox[1], sbox[0] + (i + 1) * char_w, sbox[3]]
+                            char_boxes.append((segment[1][i], cbox))
+                    else:
+                        cell_bbox = [(segment[1][i], sbox) for i in range(text_len)]
+                        char_boxes.extend(cell_bbox)
+
+                result = _add_bbox(result, char_boxes)
+            new_results.append(result)
+        return new_results
+
+    def _convert_ids_to_results(self, examples, sentence_ids, probs):
+        """
+        Convert ids to raw text in a single stage.
+        """
+        results = []
+        for example, sentence_id, prob in zip(examples, sentence_ids, probs):
+            if len(sentence_id) == 0:
+                results.append([])
+                continue
+            result_list = []
+            text = example["text"]
+            prompt = example["prompt"]
+            for i in range(len(sentence_id)):
+                start, end = sentence_id[i]
+                if start < 0 and end >= 0:
+                    continue
+                if end < 0:
+                    start += len(prompt) + 1
+                    end += len(prompt) + 1
+                    result = {"text": prompt[start:end], "probability": prob[i]}
+                    result_list.append(result)
+                else:
+                    result = {"text": text[start:end], "start": start, "end": end, "probability": prob[i]}
+                    result_list.append(result)
+            results.append(result_list)
+        return results
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+    def _postprocess(self, inputs):
+        """
+        This function will convert the model output to raw text.
+        """
+        return inputs["result"]
+
+
+class GPTask(Task):
+    """
+    Global Pointer for closed-domain information extraction Task.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._schema_tree = None
+        self._load_config()
+        self._construct_tokenizer()
+        self._get_inference_model()
+
+        self._max_seq_len = kwargs.get("max_seq_len", 256)
+        self._batch_size = kwargs.get("batch_size", 64)
+        self._lazy_load = kwargs.get("lazy_load", False)
+        self._num_workers = kwargs.get("num_workers", 0)
+
+    def _load_config(self):
+        model_config_file = os.path.join(self._task_path, self.resource_files_names["model_config"])
+        with open(model_config_file, encoding="utf-8") as f:
+            model_config = json.load(f)
+        self._label_maps = model_config["label_maps"]
+        self._task_type = model_config["task_type"]
+        self._encoder = model_config["encoder"]
+        schema = model_config["label_maps"]["schema"]
+        self._set_schema(schema)
+
+    def _set_schema(self, schema):
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="att_mask"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        encoder = AutoModel.from_pretrained(self._encoder)
+        if self._task_type == "entity_extraction":
+            model_instance = GlobalPointerForEntityExtraction(encoder, self._label_maps)
+        else:
+            model_instance = GPLinkerForRelationExtraction(encoder, self._label_maps)
+        model_path = os.path.join(self._task_path, "model_state.pdparams")
+        state_dict = paddle.load(model_path)
+        model_instance.set_dict(state_dict)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        # TODO(zhoushunjie): Will set use_fast=True in future.
+        self._tokenizer = AutoTokenizer.from_pretrained(self._task_path)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+
+        def read(inputs):
+            for x in inputs:
+                tokenized_inputs = self._tokenizer(
+                    x,
+                    max_length=self._max_seq_len,
+                    padding=False,
+                    truncation=True,
+                    return_attention_mask=True,
+                    return_offsets_mapping=True,
+                    return_token_type_ids=False,
+                )
+                tokenized_inputs["text"] = x
+                yield tokenized_inputs
+
+        infer_ds = load_dataset(read, inputs=inputs, lazy=self._lazy_load)
+
+        data_collator = DataCollatorGP(self._tokenizer, label_maps=self._label_maps, task_type=self._task_type)
+
+        batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
+
+        infer_data_loader = paddle.io.DataLoader(
+            dataset=infer_ds,
+            batch_sampler=batch_sampler,
+            collate_fn=data_collator,
+            num_workers=self._num_workers,
+            return_list=True,
+        )
+        outputs = {}
+        outputs["data_loader"] = infer_data_loader
+        outputs["input_texts"] = inputs
+        return outputs
+
+    def _run_model(self, inputs):
+        all_preds = ([], []) if self._task_type in ["opinion_extraction", "relation_extraction"] else []
+        for batch in inputs["data_loader"]:
+            input_ids, attention_masks, offset_mappings, texts = batch
+            self.input_handles[0].copy_from_cpu(input_ids.numpy().astype("int64"))
+            self.input_handles[1].copy_from_cpu(attention_masks.numpy().astype("int64"))
+            self.predictor.run()
+            logits = [paddle.to_tensor(self.output_handle[i].copy_to_cpu()) for i in range(len(self.output_handle))]
+            batch_outputs = gp_decode(logits, offset_mappings, texts, self._label_maps, self._task_type)
+            if isinstance(batch_outputs, tuple):
+                all_preds[0].extend(batch_outputs[0])  # Entity output
+                all_preds[1].extend(batch_outputs[1])  # Relation output
+            else:
+                all_preds.extend(batch_outputs)
+        inputs["result"] = all_preds
+        return inputs
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+    def _postprocess(self, inputs):
+        if self._task_type == "entity_extraction":
+            results = self._postprocess_entity_extraction(inputs)
+        elif self._task_type == "opinion_extraction":
+            results = self._postprocess_opinion_extraction(inputs)
+        else:
+            results = self._postprocess_relation_extraction(inputs)
+        return results
+
+    def _postprocess_opinion_extraction(self, inputs):
+        all_ent_preds, all_rel_preds = inputs["result"]
+        results = []
+        for i in range(len(inputs["input_texts"])):
+            result = {}
+            aspect_maps = {}
+            for ent in all_ent_preds[i]:
+                ent_res = {
+                    "text": ent["text"],
+                    "start": ent["start_index"],
+                    "end": ent["start_index"] + len(ent["text"]),
+                    "probability": ent["probability"],
+                }
+                result.setdefault(ent["type"], []).append(ent_res)
+                if ent["type"] == "评价维度":
+                    for r in result["评价维度"]:
+                        if ent["text"] == r["text"] and ent["start_index"] == r["start"]:
+                            aspect_maps[(ent["text"], ent["start_index"])] = r
+                            break
+
+            for rel in all_rel_preds[i]:
+                r = aspect_maps[(rel["aspect"], rel["aspect_start_index"])]
+                r["relations"] = {}
+                sentiment = {"probability": rel["probability"], "text": rel["sentiment"]}
+                opinion = {
+                    "text": rel["opinion"],
+                    "start": rel["opinion_start_index"],
+                    "end": rel["opinion_start_index"] + len(rel["opinion"]),
+                    "probability": rel["probability"],
+                }
+                r["relations"].setdefault("情感倾向[正向，负向]", []).append(sentiment)
+                r["relations"].setdefault("观点词", []).append(opinion)
+            results.append(result)
+        return results
+
+    def _postprocess_relation_extraction(self, inputs):
+        all_ent_preds, all_rel_preds = inputs["result"]
+        results = []
+        for input_text_idx in range(len(inputs["input_texts"])):
+            result = {}
+            schema_list = self._schema_tree.children[:]
+            while len(schema_list) > 0:
+                node = schema_list.pop(0)
+                if node.parent_relations is None:
+                    prefix = []
+                    relations = [[]]
+                    cnt = -1
+                    for ent in all_ent_preds[input_text_idx]:
+                        if node.name == ent["type"]:
+                            ent_res = {
+                                "text": ent["text"],
+                                "start": ent["start_index"],
+                                "end": ent["start_index"] + len(ent["text"]),
+                                "probability": ent["probability"].astype("float"),
+                            }
+                            result.setdefault(node.name, []).append(ent_res)
+                            cnt += 1
+                            result[node.name][cnt]["relations"] = {}
+                            relations[0].append(result[node.name][cnt])
+                else:
+                    relations = [[] for _ in range(len(node.parent_relations))]
+                    for i, rs in enumerate(node.parent_relations):
+                        for r in rs:
+                            cnt = -1
+                            for rel in all_rel_preds[input_text_idx]:
+                                if (
+                                    r["text"] == rel["subject"]
+                                    and r["start"] == rel["subject_start_index"]
+                                    and node.name == rel["predicate"]
+                                ):
+                                    rel_res = {
+                                        "text": rel["object"],
+                                        "start": rel["object_start_index"],
+                                        "end": rel["object_start_index"] + len(rel["object"]),
+                                        "probability": rel["probability"].astype("float"),
+                                    }
+                                    r["relations"].setdefault(node.name, []).append(rel_res)
+                                    cnt += 1
+                                    r["relations"][node.name][cnt]["relations"] = {}
+                                    relations[i].append(r["relations"][node.name][cnt])
+                for child in node.children:
+                    child.prefix = prefix
+                    child.parent_relations = relations
+                    schema_list.append(child)
+            results.append(result)
+        return results
+
+    def _postprocess_entity_extraction(self, inputs):
+        all_preds = inputs["result"]
+        results = []
+        for input_text_idx in range(len(inputs["input_texts"])):
+            result = {}
+            schema_list = self._schema_tree.children[:]
+            while len(schema_list) > 0:
+                node = schema_list.pop(0)
+                for ent in all_preds[input_text_idx]:
+                    if node.name == ent["type"]:
+                        ent_res = {
+                            "text": ent["text"],
+                            "start": ent["start_index"],
+                            "end": ent["start_index"] + len(ent["text"]),
+                            "probability": ent["probability"].astype("float"),
+                        }
+                        result.setdefault(node.name, []).append(ent_res)
+            results.append(result)
+        return results
diff --git a/paddlenlp/taskflow/knowledge_mining.py b/paddlenlp/taskflow/knowledge_mining.py
new file mode 100644
index 0000000000000000000000000000000000000000..23dd8879833b51294654fcbc70a4a0bbd80f203f
--- /dev/null
+++ b/paddlenlp/taskflow/knowledge_mining.py
@@ -0,0 +1,773 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+
+from ..datasets import load_dataset
+from ..transformers import ErnieCtmNptagModel, ErnieCtmTokenizer, ErnieCtmWordtagModel
+from ..transformers.ernie_ctm.configuration import ErnieCtmConfig
+from .task import Task
+from .utils import (
+    BurkhardKellerTree,
+    Customization,
+    DataCollatorForErnieCtm,
+    TermTree,
+    WordTagRelationExtractor,
+    add_docstrings,
+)
+
+LABEL_TO_SCHEMA = {
+    "人物类_实体": ["人物|E", "虚拟角色|E", "演艺团体|E"],
+    "人物类_概念": ["人物|C", "虚拟角色|C"],
+    "作品类_实体": ["作品与出版物|E"],
+    "作品类_概念": ["作品与出版物|C", "文化类"],
+    "组织机构类": ["组织机构"],
+    "组织机构类_企事业单位": ["企事业单位", "品牌", "组织机构"],
+    "组织机构类_医疗卫生机构": ["医疗卫生机构", "组织机构"],
+    "组织机构类_国家机关": ["国家机关", "组织机构"],
+    "组织机构类_体育组织机构": ["体育组织机构", "组织机构"],
+    "组织机构类_教育组织机构": ["教育组织机构", "组织机构"],
+    "组织机构类_军事组织机构": ["军事组织机构", "组织机构"],
+    "物体类": ["物体与物品", "品牌", "虚拟物品", "虚拟物品"],
+    "物体类_兵器": ["兵器"],
+    "物体类_化学物质": ["物体与物品", "化学术语"],
+    "其他角色类": ["角色"],
+    "文化类": ["文化", "作品与出版物|C", "体育运动项目", "语言文字"],
+    "文化类_语言文字": ["语言学术语"],
+    "文化类_奖项赛事活动": ["奖项赛事活动", "特殊日", "事件"],
+    "文化类_制度政策协议": ["制度政策协议", "法律法规"],
+    "文化类_姓氏与人名": ["姓氏与人名"],
+    "生物类": ["生物"],
+    "生物类_植物": ["植物", "生物"],
+    "生物类_动物": ["动物", "生物"],
+    "品牌名": ["品牌", "企事业单位"],
+    "场所类": ["区域场所", "居民服务机构", "医疗卫生机构"],
+    "场所类_交通场所": ["交通场所", "设施"],
+    "位置方位": ["位置方位"],
+    "世界地区类": ["世界地区", "区域场所", "政权朝代"],
+    "饮食类": ["饮食", "生物类", "药物"],
+    "饮食类_菜品": ["饮食"],
+    "饮食类_饮品": ["饮食"],
+    "药物类": ["药物", "生物类"],
+    "药物类_中药": ["药物", "生物类"],
+    "医学术语类": ["医药学术语"],
+    "术语类_生物体": ["生物学术语"],
+    "疾病损伤类": ["疾病损伤", "动物疾病", "医药学术语"],
+    "疾病损伤类_植物病虫害": ["植物病虫害", "医药学术语"],
+    "宇宙类": ["天文学术语"],
+    "事件类": ["事件", "奖项赛事活动"],
+    "时间类": ["时间阶段", "政权朝代"],
+    "术语类": ["术语"],
+    "术语类_符号指标类": ["编码符号指标", "术语"],
+    "信息资料": ["生活用语"],
+    "链接地址": ["生活用语"],
+    "个性特征": ["个性特点", "生活用语"],
+    "感官特征": ["生活用语"],
+    "场景事件": ["场景事件", "情绪", "态度", "个性特点"],
+    "介词": ["介词"],
+    "介词_方位介词": ["介词"],
+    "助词": ["助词"],
+    "代词": ["代词"],
+    "连词": ["连词"],
+    "副词": ["副词"],
+    "疑问词": ["疑问词"],
+    "肯定词": ["肯定否定词"],
+    "否定词": ["肯定否定词"],
+    "数量词": ["数量词", "量词"],
+    "叹词": ["叹词"],
+    "拟声词": ["拟声词"],
+    "修饰词": ["修饰词", "生活用语"],
+    "外语单词": ["日文假名", "词汇用语"],
+    "汉语拼音": ["汉语拼音"],
+}
+
+usage = r"""
+            from paddlenlp import Taskflow
+
+            # 默认使用WordTag词类知识标注工具
+            wordtag = Taskflow("knowledge_mining", model="wordtag")
+            wordtag("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+            '''
+            [{'text': '《孤女》是2010年九州出版社出版的小说，作者是余兼羽', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': '，', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}]}]
+            '''
+
+            wordtag= Taskflow("knowledge_mining", batch_size=2)
+            wordtag(["热梅茶是一道以梅子为主要原料制作的茶饮",
+                    "《孤女》是2010年九州出版社出版的小说，作者是余兼羽"])
+            '''
+            [{'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]}, {'text': '《孤女》是2010年九州出版社出版的小说，作者是余兼羽', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': '，', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}]}]
+            '''
+
+            # 使用WordTag-IE进行信息抽取
+            wordtag = Taskflow("knowledge_mining", model="wordtag", with_ie=True)
+            '''
+            [[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲，收录在专辑同名《忘了所有》中，由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '，', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': '，', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]]
+            '''
+
+            # 切换为NPTag名词短语标注工具
+            nptag = Taskflow("knowledge_mining", model="nptag")
+            nptag("糖醋排骨")
+            '''
+            [{'text': '糖醋排骨', 'label': '菜品'}]
+            '''
+
+            nptag(["糖醋排骨", "红曲霉菌"])
+            '''
+            [{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}]
+            '''
+
+            # 输出粗粒度类别标签`category`，即WordTag的词汇标签。
+            nptag = Taskflow("knowledge_mining", model="nptag", linking=True)
+            nptag(["糖醋排骨", "红曲霉菌"])
+            '''
+            [{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}]
+            '''
+         """
+
+
+@add_docstrings(usage)
+class WordTagTask(Task):
+    """
+    This the NER(Named Entity Recognition) task that convert the raw text to entities. And the task with the `wordtag`
+    model will link the more meesage with the entity.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "config.json",
+        "termtree_schema": "termtree_type.csv",
+        "termtree_data": "termtree_data",
+        "tags": "tags.txt",
+        "spo_config": "spo_config.pkl",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    resource_files_urls = {
+        "wordtag": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.5/model_state.pdparams",
+                "c7c9cef72f73ee22c70c26ef11393025",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.1/config.json",
+                "b9f307b3fa03ad98c08ecb5249c15dfa",
+            ],
+            "termtree_schema": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/termtree_type.csv",
+                "062cb9ac24f4135bf836e2a2fc5a1209",
+            ],
+            "termtree_data": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/termtree_data",
+                "a0efe723f84cf90540ac727be5b62e59",
+            ],
+            "tags": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.1/tags.txt",
+                "f33feedd01d478b03bac81be19b48d00",
+            ],
+            "spo_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.1/spo_config.pkl",
+                "07a0b8d0422198d8c4c0f70e68963275",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/vocab.txt",
+                "54aa6e2eeb0478c2d18a2343b008590c",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/special_tokens_map.json",
+                "58104269e4f141a258bdb2ed06aa599f",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/tokenizer_config.json",
+                "e3f2756e72e24e3bb298303fb9a171f7",
+            ],
+        }
+    }
+
+    def __init__(
+        self,
+        model,
+        task,
+        tag_path=None,
+        term_schema_path=None,
+        term_data_path=None,
+        user_dict=None,
+        linking=True,
+        spo_config_path=None,
+        with_ie=False,
+        **kwargs
+    ):
+        super().__init__(model=model, task=task, **kwargs)
+        self._tag_path = tag_path
+        self._term_schema_path = term_schema_path
+        self._term_data_path = term_data_path
+        self._user_dict = user_dict
+        self._linking = linking
+        self._spo_config_path = spo_config_path
+        self._with_ie = with_ie
+        self._check_task_files()
+        self._load_task_resources()
+        self._construct_tokenizer(model)
+        self._usage = usage
+        self._summary_num = 2
+        self._get_inference_model()
+
+        if self._user_dict:
+            self._custom = Customization()
+            self._custom.load_customization(self._user_dict)
+        else:
+            self._custom = None
+        self._num_workers = self.kwargs["num_workers"] if "num_workers" in self.kwargs else 0
+        self._batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        self._lazy_load = self.kwargs["lazy_load"] if "lazy_load" in self.kwargs else False
+        self._max_seq_len = self.kwargs["max_seq_len"] if "max_seq_len" in self.kwargs else 512
+        self._split_sentence = self.kwargs["split_sentence"] if "split_sentence" in self.kwargs else False
+        if self._with_ie:
+            self._ie_extractor = WordTagRelationExtractor.from_pkl(self._spo_config_path)
+
+    @property
+    def summary_num(self):
+        """
+        Number of model summary token
+        """
+        return self._summary_num
+
+    @property
+    def linking(self):
+        """
+        Whether to do term linking.
+        """
+        return self._linking
+
+    @staticmethod
+    def _load_labels(tag_path):
+        tags_to_idx = {}
+        all_tags = []
+        i = 0
+        with open(tag_path, encoding="utf-8") as fp:
+            for line in fp:
+                line = line.strip()
+                tag = line.split("-")[-1]
+                if tag not in all_tags:
+                    all_tags.append(tag)
+                tags_to_idx[line] = i
+                i += 1
+        idx_to_tags = dict(zip(*(tags_to_idx.values(), tags_to_idx.keys())))
+        return tags_to_idx, idx_to_tags, all_tags
+
+    def _load_task_resources(self):
+        """
+        Load the resource of this task.
+        """
+        if self._tag_path is None:
+            self._tag_path = os.path.join(self._task_path, "tags.txt")
+            self._tags_to_index, self._index_to_tags, self._all_tags = self._load_labels(self._tag_path)
+
+        if self._term_schema_path is None:
+            self._term_schema_path = os.path.join(self._task_path, "termtree_type.csv")
+        if self._term_data_path is None:
+            self._term_data_path = os.path.join(self._task_path, "termtree_data")
+
+        if self._linking is True:
+            self._termtree = TermTree.from_dir(self._term_schema_path, self._term_data_path, self._linking)
+
+        if self._spo_config_path is None:
+            self._spo_config_path = os.path.join(self._task_path, "spo_config.pkl")
+
+    def _preprocess_text(self, input_texts):
+        """
+        Create the dataset and dataloader for the predict.
+        """
+        max_predict_len = self._max_seq_len - self.summary_num - 1
+        filter_input_texts = []
+        for input_text in input_texts:
+            if not (isinstance(input_text, str) and len(input_text) > 0):
+                continue
+            filter_input_texts.append(input_text)
+        input_texts = filter_input_texts
+
+        short_input_texts, self.input_mapping = self._auto_splitter(
+            input_texts, max_predict_len, split_sentence=self._split_sentence
+        )
+
+        def read(inputs):
+            for text in inputs:
+                tokenized_output = self._tokenizer(
+                    list(text), return_length=True, is_split_into_words=True, max_length=self._max_seq_len
+                )
+                yield {
+                    "input_ids": tokenized_output["input_ids"],
+                    "token_type_ids": tokenized_output["token_type_ids"],
+                    "seq_len": tokenized_output["seq_len"],
+                }
+
+        infer_ds = load_dataset(read, inputs=short_input_texts, lazy=self._lazy_load)
+
+        data_collator = DataCollatorForErnieCtm(self._tokenizer, model="wordtag")
+
+        batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
+
+        infer_data_loader = paddle.io.DataLoader(
+            dataset=infer_ds,
+            batch_sampler=batch_sampler,
+            collate_fn=data_collator,
+            num_workers=self._num_workers,
+            return_list=True,
+        )
+
+        outputs = {}
+        outputs["data_loader"] = infer_data_loader
+        outputs["short_input_texts"] = short_input_texts
+        return outputs
+
+    def _reset_offset(self, pred_words):
+        for i in range(0, len(pred_words)):
+            if i > 0:
+                pred_words[i]["offset"] = pred_words[i - 1]["offset"] + len(pred_words[i - 1]["item"])
+            pred_words[i]["length"] = len(pred_words[i]["item"])
+        return pred_words
+
+    def _decode(self, batch_texts, batch_pred_tags):
+        batch_results = []
+        for sent_index in range(len(batch_texts)):
+            sent = batch_texts[sent_index]
+            indexes = batch_pred_tags[sent_index][self.summary_num : len(sent) + self.summary_num]
+            tags = [self._index_to_tags[index] for index in indexes]
+            if self._custom:
+                self._custom.parse_customization(sent, tags, prefix=True)
+            sent_out = []
+            tags_out = []
+            partial_word = ""
+            for ind, tag in enumerate(tags):
+                if partial_word == "":
+                    partial_word = sent[ind]
+                    tags_out.append(tag.split("-")[-1])
+                    continue
+                if tag.startswith("B") or tag.startswith("S") or tag.startswith("O"):
+                    sent_out.append(partial_word)
+                    tags_out.append(tag.split("-")[-1])
+                    partial_word = sent[ind]
+                    continue
+                partial_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(partial_word)
+
+            pred_words = []
+            for s, t in zip(sent_out, tags_out):
+                pred_words.append({"item": s, "offset": 0, "wordtag_label": t})
+
+            pred_words = self._reset_offset(pred_words)
+            result = {"text": sent, "items": pred_words}
+            batch_results.append(result)
+        return batch_results
+
+    def _term_linking(self, wordtag_res):
+        for item in wordtag_res["items"]:
+            flag, _ = self._termtree.find_term(item["item"])
+            if flag is False:
+                continue
+            if item["wordtag_label"] not in LABEL_TO_SCHEMA:
+                # Custom label defined by user
+                if item["wordtag_label"] not in self._all_tags:
+                    target_type_can = [item["wordtag_label"]]
+                else:
+                    continue
+            else:
+                target_type_can = LABEL_TO_SCHEMA[item["wordtag_label"]]
+            high_priority = False
+            for target_type_raw in target_type_can:
+                target_type_ = target_type_raw.split("|")
+                target_src = None
+                if len(target_type_) == 2:
+                    target_src = target_type_[1]
+                target_type = target_type_[0]
+                flag, term_id = self._termtree.find_term(item["item"], target_type)
+                if flag is False:
+                    continue
+                term_id = list(filter(lambda d: self._termtree[d].node_type == "term", term_id))
+                if len(term_id) == 0:
+                    continue
+                if target_src is not None:
+                    term_id = list(filter(lambda d: self._termtree[d].base.startswith(target_src.lower()), term_id))
+                    if len(term_id) == 0:
+                        continue
+
+                term_id.sort(
+                    key=lambda d: (
+                        self._termtree[d].termtype == target_type or target_type in self._termtree[d].subtype,
+                        self._termtree[d].term == item["item"],
+                    ),
+                    reverse=True,
+                )
+                if self._termtree[term_id[0]].term == item["item"]:
+                    high_priority = True
+                    item["termid"] = term_id[0]
+                if high_priority:
+                    break
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),  # token_type_ids
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="seq_len"),  # seq_len
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+
+        model_config = ErnieCtmConfig.from_pretrained(self._task_path, num_labels=len(self._tags_to_index))
+        model_instance = ErnieCtmWordtagModel.from_pretrained(self._task_path, config=model_config)
+
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        tokenizer_instance = ErnieCtmTokenizer.from_pretrained(self._task_path)
+        self._tokenizer = tokenizer_instance
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        outputs = self._preprocess_text(inputs)
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        all_pred_tags = []
+        for batch in inputs["data_loader"]:
+            input_ids, token_type_ids, seq_len = batch
+            self.input_handles[0].copy_from_cpu(input_ids.numpy())
+            self.input_handles[1].copy_from_cpu(token_type_ids.numpy())
+            self.input_handles[2].copy_from_cpu(seq_len.numpy())
+            self.predictor.run()
+            pred_tags = self.output_handle[0].copy_to_cpu()
+            all_pred_tags.extend(pred_tags.tolist())
+        inputs["all_pred_tags"] = all_pred_tags
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        results = self._decode(inputs["short_input_texts"], inputs["all_pred_tags"])
+        results = self._auto_joiner(results, self.input_mapping, is_dict=True)
+        for result in results:
+            pred_words = result["items"]
+            pred_words = self._reset_offset(pred_words)
+            result["items"] = pred_words
+        if self.linking is True:
+            for res in results:
+                self._term_linking(res)
+        if self._with_ie:
+            ie_results = []
+            for result in results:
+                spo_result = self._ie_extractor.extract_spo(result["items"])
+                ie_results.append(spo_result)
+            return [results, ie_results]
+        return results
+
+    def set_schema(self, schema):
+        """User define the schema for the information extraction.
+        Args:
+            schema (List[ Dict[str, Any]]): Dictionary data contain all k-v data.
+        """
+        self._ie_extractor = WordTagRelationExtractor.from_dict(schema)
+
+
+@add_docstrings(usage)
+class NPTagTask(Task):
+    """
+    Noun phrase tagging task that convert the noun phrase to POS tag.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        batch_size(int): Numbers of examples a batch.
+        linking(bool): Returns the categories. If `linking` is True, the fine-grained label (label) will link with the coarse-grained label (category).
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "config.json",
+        "name_category_map": "name_category_map.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    resource_files_urls = {
+        "nptag": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag_v1.2/model_state.pdparams",
+                "34923c4d06acf936f52e1fa376b13748",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag/config.json",
+                "895f0eba0819da56db709d00109c984e",
+            ],
+            "name_category_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag/name_category_map.json",
+                "c60810205993d307d919a26a3b96786f",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag/vocab.txt",
+                "54aa6e2eeb0478c2d18a2343b008590c",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag/special_tokens_map.json",
+                "58104269e4f141a258bdb2ed06aa599f",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/nptag/tokenizer_config.json",
+                "e3f2756e72e24e3bb298303fb9a171f7",
+            ],
+        }
+    }
+
+    def __init__(self, task, model, batch_size=1, max_seq_len=64, linking=False, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._usage = usage
+        self._batch_size = batch_size
+        self._max_seq_len = max_seq_len
+        self._linking = linking
+        self._check_task_files()
+        self._construct_tokenizer(model)
+        self._name_dict = None
+        self._summary_num = 2
+        self._max_cls_len = 5
+        self._lazy_load = kwargs.get("lazy_load", False)
+        self._num_workers = kwargs.get("num_workers", 0)
+        self._construct_dict_map()
+
+        self._get_inference_model()
+        # Disable IR optimization for NPTag
+        self._config.switch_ir_optim(False)
+
+    @property
+    def summary_num(self):
+        """
+        Number of model summary token
+        """
+        return self._summary_num
+
+    def _construct_dict_map(self):
+        """
+        Construct dict map for the predictor.
+        """
+        name_dict_path = os.path.join(self._task_path, "name_category_map.json")
+        with open(name_dict_path, encoding="utf-8") as fp:
+            self._name_dict = json.load(fp)
+        self._tree = BurkhardKellerTree()
+        self._cls_vocabs = OrderedDict()
+        for k in self._name_dict:
+            self._tree.add(k)
+            for c in k:
+                if c not in self._cls_vocabs:
+                    self._cls_vocabs[c] = len(self._cls_vocabs)
+        self._cls_vocabs["[PAD]"] = len(self._cls_vocabs)
+        self._id_vocabs = dict(zip(self._cls_vocabs.values(), self._cls_vocabs.keys()))
+        self._vocab_ids = self._tokenizer.vocab.to_indices(list(self._cls_vocabs.keys()))
+
+    def _decode(self, pred_ids):
+        tokens = [self._id_vocabs[i] for i in pred_ids]
+        valid_token = []
+        for token in tokens:
+            if token == "[PAD]":
+                break
+            valid_token.append(token)
+        return "".join(valid_token)
+
+    def _search(self, scores_can, pred_ids_can, depth, path, score):
+        if depth >= 5:
+            return [(path, score)]
+        res = []
+        for i in range(len(pred_ids_can[0])):
+            tmp_res = self._search(
+                scores_can, pred_ids_can, depth + 1, path + [pred_ids_can[depth][i]], score + scores_can[depth][i]
+            )
+            res.extend(tmp_res)
+        return res
+
+    def _find_topk(self, a, k, axis=-1, largest=True, sorted=True):
+        if axis is None:
+            axis_size = a.size
+        else:
+            axis_size = a.shape[axis]
+        assert 1 <= k <= axis_size
+
+        a = np.asanyarray(a)
+        if largest:
+            index_array = np.argpartition(a, axis_size - k, axis=axis)
+            topk_indices = np.take(index_array, -np.arange(k) - 1, axis=axis)
+        else:
+            index_array = np.argpartition(a, k - 1, axis=axis)
+            topk_indices = np.take(index_array, np.arange(k), axis=axis)
+        topk_values = np.take_along_axis(a, topk_indices, axis=axis)
+        if sorted:
+            sorted_indices_in_topk = np.argsort(topk_values, axis=axis)
+            if largest:
+                sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis)
+            sorted_topk_values = np.take_along_axis(topk_values, sorted_indices_in_topk, axis=axis)
+            sorted_topk_indices = np.take_along_axis(topk_indices, sorted_indices_in_topk, axis=axis)
+            return sorted_topk_values, sorted_topk_indices
+        return topk_values, topk_indices
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),  # token_type_ids
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = ErnieCtmNptagModel.from_pretrained(self._task_path)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        tokenizer_instance = ErnieCtmTokenizer.from_pretrained(self._task_path)
+        self._tokenizer = tokenizer_instance
+
+    def _preprocess(self, inputs):
+        """
+        Create the dataset and dataloader for the predict.
+        """
+        inputs = self._check_input_text(inputs)
+        self._max_cls_len = 5
+
+        # Prompt template: input_text + "是" + "[MASK]" * cls_seq_length
+        prompt_template = ["是"] + ["[MASK]"] * self._max_cls_len
+
+        def read(inputs):
+            for text in inputs:
+                if len(text) + self._max_cls_len + 1 + self._summary_num + 1 > self._max_seq_len:
+                    text = text[: (self._max_seq_len - (self._max_cls_len + 1 + self._summary_num + 1))]
+
+                tokens = list(text) + prompt_template
+                tokenized_output = self._tokenizer(
+                    tokens, return_length=True, is_split_into_words=True, max_length=self._max_seq_len
+                )
+                label_indices = list(
+                    range(tokenized_output["seq_len"] - 1 - self._max_cls_len, tokenized_output["seq_len"] - 1)
+                )
+
+                yield {
+                    "input_ids": tokenized_output["input_ids"],
+                    "token_type_ids": tokenized_output["token_type_ids"],
+                    "label_indices": label_indices,
+                }
+
+        infer_ds = load_dataset(read, inputs=inputs, lazy=self._lazy_load)
+
+        data_collator = DataCollatorForErnieCtm(self._tokenizer, model="nptag")
+
+        batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
+
+        infer_data_loader = paddle.io.DataLoader(
+            dataset=infer_ds,
+            batch_sampler=batch_sampler,
+            collate_fn=data_collator,
+            num_workers=self._num_workers,
+            return_list=True,
+        )
+
+        outputs = {}
+        outputs["data_loader"] = infer_data_loader
+        outputs["texts"] = inputs
+        return outputs
+
+    def _run_model(self, inputs):
+        all_scores_can = []
+        all_preds_can = []
+        pred_ids = []
+        for batch in inputs["data_loader"]:
+            input_ids, token_type_ids, label_indices = batch
+            self.input_handles[0].copy_from_cpu(input_ids.numpy())
+            self.input_handles[1].copy_from_cpu(token_type_ids.numpy())
+            self.predictor.run()
+            logits = self.output_handle[0].copy_to_cpu()
+            for i, l in zip(label_indices, logits):
+                score = l[i[0] : i[-1] + 1, self._vocab_ids]
+                # Find topk candidates of scores and predicted indices.
+                score_can, pred_id_can = self._find_topk(score, k=4, axis=-1)
+
+                all_scores_can.extend([score_can.tolist()])
+                all_preds_can.extend([pred_id_can.tolist()])
+                pred_ids.extend([pred_id_can[:, 0].tolist()])
+        inputs["all_scores_can"] = all_scores_can
+        inputs["all_preds_can"] = all_preds_can
+        inputs["pred_ids"] = pred_ids
+        return inputs
+
+    def _postprocess(self, inputs):
+        results = []
+
+        for i in range(len(inputs["texts"])):
+            cls_label = self._decode(inputs["pred_ids"][i])
+            result = {
+                "text": inputs["texts"][i],
+                "label": cls_label,
+            }
+            if cls_label not in self._name_dict:
+                scores_can = inputs["all_scores_can"][i]
+                pred_ids_can = inputs["all_preds_can"][i]
+                labels_can = self._search(scores_can, pred_ids_can, 0, [], 0)
+                labels_can.sort(key=lambda d: -d[1])
+                for labels in labels_can:
+                    cls_label_can = self._decode(labels[0])
+                    if cls_label_can in self._name_dict:
+                        result["label"] = cls_label_can
+                        break
+                else:
+                    labels_can = self._tree.search_similar_word(cls_label)
+                    if len(labels_can) != 0:
+                        result["label"] = labels_can[0][0]
+                        break
+            if self._linking:
+                if result["label"] in self._name_dict:
+                    result["category"] = self._name_dict[result["label"]]
+            results.append(result)
+        return results
diff --git a/paddlenlp/taskflow/lexical_analysis.py b/paddlenlp/taskflow/lexical_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..92786e13d27c399748ef969ba6746cad6696174e
--- /dev/null
+++ b/paddlenlp/taskflow/lexical_analysis.py
@@ -0,0 +1,265 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from ..data import Pad, Stack, Tuple
+from ..datasets import load_dataset
+from .models import BiGruCrf
+from .task import Task
+from .utils import Customization
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           lac = Taskflow("lexical_analysis")
+           lac("LAC是个优秀的分词工具")
+           '''
+           [{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']}]
+           '''
+
+           lac(["LAC是个优秀的分词工具", "三亚是一个美丽的城市"])
+           '''
+           [{'text': 'LAC是个优秀的分词工具', 'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'], 'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']},
+            {'text': '三亚是一个美丽的城市', 'segs': ['三亚', '是', '一个', '美丽', '的', '城市'], 'tags': ['LOC', 'v', 'm', 'a', 'u', 'n']}
+           ]
+           '''
+
+         """
+
+
+def load_vocab(dict_path):
+    """
+    Load vocab from file
+    """
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], str(i)
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+class LacTask(Task):
+    """
+    Lexical analysis of Chinese task to segement the chinese sentence.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        user_dict(string): The user-defined dictionary, default to None.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "tags": "tag.dic",
+        "q2b": "q2b.dic",
+        "word": "word.dic",
+    }
+    resource_files_urls = {
+        "lac": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/lexical_analysis/lac/model_state.pdparams",
+                "3d4008c6c9d29424465829c9acf909bd",
+            ],
+            "tags": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/lexical_analysis/lac/tag.dic",
+                "b11b616926b9f7f0a40a8087f84a8a99",
+            ],
+            "q2b": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/lexical_analysis/lac/q2b.dic",
+                "4ef2cd16f8002fe7cd7dd31cdff47e0d",
+            ],
+            "word": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/lexical_analysis/lac/word.dic",
+                "f1dfc68139bb6dd58c9c4313c341e436",
+            ],
+        }
+    }
+
+    def __init__(self, task, model, user_dict=None, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._usage = usage
+        self._user_dict = user_dict
+        self._check_task_files()
+        self._construct_vocabs()
+        self._get_inference_model()
+        self._max_seq_len = 512
+        if self._user_dict:
+            self._custom = Customization()
+            self._custom.load_customization(self._user_dict)
+        else:
+            self._custom = None
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_ids"),
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="length"),
+        ]
+
+    def _construct_vocabs(self):
+        word_dict_path = os.path.join(self._task_path, "word.dic")
+        tag_dict_path = os.path.join(self._task_path, "tag.dic")
+        q2b_dict_path = os.path.join(self._task_path, "q2b.dic")
+        self._word_vocab = load_vocab(word_dict_path)
+        self._tag_vocab = load_vocab(tag_dict_path)
+        self._q2b_vocab = load_vocab(q2b_dict_path)
+        self._id2word_dict = dict(zip(self._word_vocab.values(), self._word_vocab.keys()))
+        self._id2tag_dict = dict(zip(self._tag_vocab.values(), self._tag_vocab.keys()))
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = BiGruCrf(
+            self.kwargs["emb_dim"], self.kwargs["hidden_size"], len(self._word_vocab), len(self._tag_vocab)
+        )
+        # Load the model parameter for the predict
+        state_dict = paddle.load(os.path.join(self._task_path, "model_state.pdparams"))
+        model_instance.set_dict(state_dict)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        return None
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        num_workers = self.kwargs["num_workers"] if "num_workers" in self.kwargs else 0
+        self._split_sentence = self.kwargs["split_sentence"] if "split_sentence" in self.kwargs else False
+        oov_token_id = self._word_vocab.get("OOV")
+
+        filter_inputs = []
+        for input in inputs:
+            if not (isinstance(input, str) and len(input.strip()) > 0):
+                continue
+            filter_inputs.append(input)
+
+        short_input_texts, self.input_mapping = self._auto_splitter(
+            filter_inputs, self._max_seq_len, split_sentence=self._split_sentence
+        )
+
+        def read(inputs):
+            for input_tokens in inputs:
+                ids = []
+                for token in input_tokens:
+                    token = self._q2b_vocab.get(token, token)
+                    token_id = self._word_vocab.get(token, oov_token_id)
+                    ids.append(token_id)
+                lens = len(ids)
+                yield ids, lens
+
+        infer_ds = load_dataset(read, inputs=short_input_texts, lazy=False)
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=0, dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # seq_len
+        ): fn(samples)
+        infer_data_loader = paddle.io.DataLoader(
+            infer_ds,
+            collate_fn=batchify_fn,
+            num_workers=num_workers,
+            batch_size=batch_size,
+            shuffle=False,
+            return_list=True,
+        )
+        outputs = {}
+        outputs["text"] = short_input_texts
+        outputs["data_loader"] = infer_data_loader
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        lens = []
+        for batch in inputs["data_loader"]:
+            input_ids, seq_len = batch
+            self.input_handles[0].copy_from_cpu(input_ids.numpy())
+            self.input_handles[1].copy_from_cpu(seq_len.numpy())
+            self.predictor.run()
+            tags_ids = self.output_handle[0].copy_to_cpu()
+            results.extend(tags_ids.tolist())
+            lens.extend(seq_len.tolist())
+
+        inputs["result"] = results
+        inputs["lens"] = lens
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        lengths = inputs["lens"]
+        preds = inputs["result"]
+        sents = inputs["text"]
+        final_results = []
+        for sent_index in range(len(lengths)):
+            single_result = {}
+            tags = [self._id2tag_dict[str(index)] for index in preds[sent_index][: lengths[sent_index]]]
+            sent = sents[sent_index]
+            if self._custom:
+                self._custom.parse_customization(sent, tags)
+            sent_out = []
+            tags_out = []
+            parital_word = ""
+            for ind, tag in enumerate(tags):
+                if parital_word == "":
+                    parital_word = sent[ind]
+                    tags_out.append(tag.split("-")[0])
+                    continue
+                if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                    sent_out.append(parital_word)
+                    tags_out.append(tag.split("-")[0])
+                    parital_word = sent[ind]
+                    continue
+                parital_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(parital_word)
+
+            single_result["text"] = sent
+            single_result["segs"] = sent_out
+            single_result["tags"] = tags_out
+            final_results.append(single_result)
+        final_results = self._auto_joiner(final_results, self.input_mapping, is_dict=True)
+        return final_results
diff --git a/paddlenlp/taskflow/models/__init__.py b/paddlenlp/taskflow/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..59d4fe2d305f3932c96bb44327d5f809332bae4e
--- /dev/null
+++ b/paddlenlp/taskflow/models/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .sentiment_analysis_model import BoWModel, LSTMModel, SkepSequenceModel
+from .lexical_analysis_model import BiGruCrf
+from .dependency_parsing_model import BiAffineParser
+from .text_correction_model import ErnieForCSC
diff --git a/paddlenlp/taskflow/models/dependency_parsing_model.py b/paddlenlp/taskflow/models/dependency_parsing_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..5af9925cb7f1e8e3e9b7f467d19d8cb4d284d955
--- /dev/null
+++ b/paddlenlp/taskflow/models/dependency_parsing_model.py
@@ -0,0 +1,229 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+from paddlenlp.transformers import AutoModel
+
+
+class BiAffineParser(nn.Layer):
+    """DDParser"""
+
+    def __init__(self, encoding_model, n_rels, n_words, pad_index, bos_index, eos_index, n_mlp_arc=500, n_mlp_rel=100):
+        super(BiAffineParser, self).__init__()
+        self.pad_index = pad_index
+        self.bos_index = bos_index
+        self.eos_index = eos_index
+
+        if encoding_model == "lstm-pe":
+            self.embed = LSTMByWPEncoder(n_words, pad_index)
+        else:  # encoding_model is "ernie-3.0-medium-zh", "ernie-1.0" or other models:
+            pretrained_model = AutoModel.from_pretrained(encoding_model)
+            self.embed = ErnieEncoder(pad_index, pretrained_model)
+
+        # MLP layer
+        self.mlp_arc_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc)
+        self.mlp_arc_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_arc)
+        self.mlp_rel_h = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel)
+        self.mlp_rel_d = MLP(n_in=self.embed.mlp_input_size, n_out=n_mlp_rel)
+
+        # Biaffine layer
+        self.arc_attn = BiAffine(n_in=n_mlp_arc, bias_x=True, bias_y=False)
+        self.rel_attn = BiAffine(n_in=n_mlp_rel, n_out=n_rels, bias_x=True, bias_y=True)
+
+    def forward(self, words, wp):
+
+        words, x = self.embed(words, wp)
+        mask = paddle.logical_and(words != self.pad_index, words != self.eos_index)
+
+        arc_h = self.mlp_arc_h(x)
+        arc_d = self.mlp_arc_d(x)
+        rel_h = self.mlp_rel_h(x)
+        rel_d = self.mlp_rel_d(x)
+
+        # Get arc and rel scores from the bilinear attention
+        # Shape: (batch_size, seq_len, seq_len)
+        s_arc = self.arc_attn(arc_d, arc_h)
+        # Shape: (batch_size, seq_len, seq_len, n_rels)
+        s_rel = paddle.transpose(self.rel_attn(rel_d, rel_h), perm=[0, 2, 3, 1])
+        # Set the scores that exceed the length of each sentence to -1e5
+        s_arc_mask = paddle.unsqueeze(mask, 1)
+        s_arc = s_arc * s_arc_mask + paddle.scale(
+            paddle.cast(s_arc_mask, "int32"), scale=1e5, bias=-1, bias_after_scale=False
+        )
+
+        mask = paddle.cast(
+            paddle.logical_and(
+                paddle.logical_and(words != self.pad_index, words != self.bos_index),
+                words != self.eos_index,
+            ),
+            "int32",
+        )
+        arc_preds = paddle.argmax(s_arc, axis=-1)
+        rel_preds = paddle.argmax(s_rel, axis=-1)
+        return arc_preds, rel_preds, s_arc, mask
+
+
+class MLP(nn.Layer):
+    """MLP"""
+
+    def __init__(self, n_in, n_out):
+        super(MLP, self).__init__()
+
+        self.linear = nn.Linear(
+            n_in,
+            n_out,
+            weight_attr=nn.initializer.XavierNormal(),
+        )
+        self.leaky_relu = nn.LeakyReLU(negative_slope=0.1)
+
+    def forward(self, x):
+        # Shape: (batch_size, output_size)
+        x = self.linear(x)
+        x = self.leaky_relu(x)
+        return x
+
+
+class BiAffine(nn.Layer):
+    """BiAffine"""
+
+    def __init__(self, n_in, n_out=1, bias_x=True, bias_y=True):
+        super(BiAffine, self).__init__()
+
+        self.n_in = n_in
+        self.n_out = n_out
+        self.bias_x = bias_x
+        self.bias_y = bias_y
+        self.weight = self.create_parameter(shape=[n_out, n_in + bias_x, n_in + bias_y], dtype="float32")
+
+    def forward(self, x, y):
+        if self.bias_x:
+            x = paddle.concat([x, paddle.ones_like(x[:, :, :1])], axis=-1)
+        if self.bias_y:
+            y = paddle.concat([y, paddle.ones_like(x[:, :, :1])], axis=-1)
+        # Shape x: (batch_size, num_tokens, input_size + bias_x)
+        b = x.shape[0]
+        o = self.weight.shape[0]
+        # Shape x: (batch_size, output_size, num_tokens, input_size + bias_x)
+        x = paddle.expand(paddle.unsqueeze(x, axis=1), shape=(x.shape[0], o, x.shape[1], x.shape[2]))
+        # Shape y: (batch_size, output_size, num_tokens, input_size + bias_y)
+        y = paddle.expand(paddle.unsqueeze(y, axis=1), shape=(y.shape[0], o, y.shape[1], y.shape[2]))
+        # Shape weight: (batch_size, output_size, input_size + bias_x, input_size + bias_y)
+        weight = paddle.expand(
+            paddle.unsqueeze(self.weight, axis=0),
+            shape=(b, self.weight.shape[0], self.weight.shape[1], self.weight.shape[2]),
+        )
+
+        # Shape: (batch_size, output_size, num_tokens, num_tokens)
+        s = paddle.matmul(paddle.matmul(x, weight), paddle.transpose(y, perm=[0, 1, 3, 2]))
+        # Remove dim 1 if n_out == 1
+        if s.shape[1] == 1:
+            s = paddle.squeeze(s, axis=1)
+        return s
+
+
+class ErnieEncoder(nn.Layer):
+    def __init__(self, pad_index, pretrained_model):
+        super(ErnieEncoder, self).__init__()
+        self.pad_index = pad_index
+        self.ptm = pretrained_model
+        self.mlp_input_size = self.ptm.config["hidden_size"]
+
+    def forward(self, words, wp):
+        x, _ = self.ptm(words)
+        x = paddle.reshape(
+            index_sample(x, wp),
+            shape=[wp.shape[0], wp.shape[1], x.shape[2]],
+        )
+        words = index_sample(words, wp)
+        return words, x
+
+
+class LSTMByWPEncoder(nn.Layer):
+    def __init__(self, n_words, pad_index, lstm_by_wp_embed_size=200, n_embed=300, n_lstm_hidden=300, n_lstm_layers=3):
+        super(LSTMByWPEncoder, self).__init__()
+        self.pad_index = pad_index
+        self.word_embed = nn.Embedding(n_words, lstm_by_wp_embed_size)
+
+        self.lstm = nn.LSTM(
+            input_size=lstm_by_wp_embed_size,
+            hidden_size=n_lstm_hidden,
+            num_layers=n_lstm_layers,
+            direction="bidirectional",
+        )
+
+        self.mlp_input_size = n_lstm_hidden * 2
+
+    def forward(self, words, wp):
+
+        word_embed = self.word_embed(words)
+        mask = words != self.pad_index
+        seq_lens = paddle.sum(paddle.cast(mask, "int32"), axis=-1)
+
+        x, _ = self.lstm(word_embed, sequence_length=seq_lens)
+        x = paddle.reshape(
+            index_sample(x, wp),
+            shape=[wp.shape[0], wp.shape[1], x.shape[2]],
+        )
+        words = paddle.index_sample(words, wp)
+        return words, x
+
+
+def index_sample(x, index):
+    """Select input value according to index
+
+    Arags：
+        input: input matrix
+        index: index matrix
+    Returns:
+        output
+    >>> input
+    [
+        [1, 2, 3],
+        [4, 5, 6]
+    ]
+    >>> index
+    [
+        [1, 2],
+        [0, 1]
+    ]
+    >>> index_sample(input, index)
+    [
+        [2, 3],
+        [4, 5]
+    ]
+    """
+    x_s = x.shape
+    dim = len(index.shape) - 1
+    assert x_s[:dim] == index.shape[:dim]
+    if len(x_s) == 3 and dim == 1:
+        r_x = paddle.reshape(x, shape=[-1, x_s[1], x_s[-1]])
+    else:
+        r_x = paddle.reshape(x, shape=[-1, x_s[-1]])
+    index = paddle.reshape(index, shape=[len(r_x), -1, 1])
+    # Generate arange index, shape like index
+    arr_index = paddle.arange(start=0, end=len(index), dtype=index.dtype)
+    arr_index = paddle.unsqueeze(arr_index, axis=[1, 2])
+    arr_index = paddle.expand(arr_index, index.shape)
+    # Generate new index
+    new_index = paddle.concat((arr_index, index), -1)
+    new_index = paddle.reshape(new_index, (-1, 2))
+    # Get output
+    out = paddle.gather_nd(r_x, new_index)
+    if len(x_s) == 3 and dim == 2:
+        out = paddle.reshape(out, shape=[x_s[0], x_s[1], -1])
+    else:
+        out = paddle.reshape(out, shape=[x_s[0], -1])
+    return out
diff --git a/paddlenlp/taskflow/models/lexical_analysis_model.py b/paddlenlp/taskflow/models/lexical_analysis_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..32f7110209167e0a7eb8bb7e8f80d5701ab4f533
--- /dev/null
+++ b/paddlenlp/taskflow/models/lexical_analysis_model.py
@@ -0,0 +1,100 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+
+try:
+    from paddle.text import ViterbiDecoder
+except Exception:
+    raise ImportError(
+        "Taskflow requires paddle version >= 2.2.0, but current paddle version is {}".format(
+            paddle.version.full_version
+        )
+    )
+
+
+class BiGruCrf(nn.Layer):
+    """The network for lexical analysis, based on two layers of BiGRU and one layer of CRF. More details see https://arxiv.org/abs/1807.01882
+    Args:
+        word_emb_dim (int): The dimension in which a word is embedded.
+        hidden_size (int): The number of hidden nodes in the GRU layer.
+        vocab_size (int): the word vocab size.
+        num_labels (int): the labels amount.
+        emb_lr (float, optional): The scaling of the learning rate of the embedding layer. Defaults to 2.0.
+        crf_lr (float, optional): The scaling of the learning rate of the crf layer. Defaults to 0.2.
+    """
+
+    def __init__(
+        self, word_emb_dim, hidden_size, vocab_size, num_labels, emb_lr=2.0, crf_lr=0.2, with_start_stop_tag=True
+    ):
+        super(BiGruCrf, self).__init__()
+        self.word_emb_dim = word_emb_dim
+        self.vocab_size = vocab_size
+        self.num_labels = num_labels
+        self.hidden_size = hidden_size
+        self.emb_lr = emb_lr
+        self.crf_lr = crf_lr
+        self.init_bound = 0.1
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=self.vocab_size,
+            embedding_dim=self.word_emb_dim,
+            weight_attr=paddle.ParamAttr(
+                learning_rate=self.emb_lr,
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+            ),
+        )
+
+        self.gru = nn.GRU(
+            input_size=self.word_emb_dim,
+            hidden_size=self.hidden_size,
+            num_layers=2,
+            direction="bidirectional",
+            weight_ih_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+            weight_hh_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.fc = nn.Linear(
+            in_features=self.hidden_size * 2,
+            out_features=self.num_labels + 2 if with_start_stop_tag else self.num_labels,
+            weight_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.crf = LinearChainCrf(self.num_labels, self.crf_lr, with_start_stop_tag)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, with_start_stop_tag)
+
+    def forward(self, inputs, lengths, labels=None):
+        word_embed = self.word_embedding(inputs)
+        bigru_output, _ = self.gru(word_embed, sequence_length=lengths)
+        emission = self.fc(bigru_output)
+        if labels is not None:
+            loss = self.crf_loss(emission, lengths, labels)
+            return loss
+        else:
+            _, prediction = self.viterbi_decoder(emission, lengths)
+            return prediction
diff --git a/paddlenlp/taskflow/models/sentiment_analysis_model.py b/paddlenlp/taskflow/models/sentiment_analysis_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfffd545d3bbb8cc63caa2d7f40dd828b9dc15b5
--- /dev/null
+++ b/paddlenlp/taskflow/models/sentiment_analysis_model.py
@@ -0,0 +1,151 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.seq2vec.encoder import BoWEncoder, LSTMEncoder
+from paddlenlp.transformers import SkepConfig, SkepModel, SkepPretrainedModel
+
+
+class BoWModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size(int): The vocab size that used to create the embedding.
+        num_class(int): The num class of the classifier.
+        emb_dim(int. optional): The size of the embedding, default value is 128.
+        padding_idx(int, optional): The padding value in the embedding, the padding_idx of embedding value will
+            not be updated, the default value is 0.
+        hidden_size(int, optional): The output size of linear that after the bow, default value is 128.
+        fc_hidden_size(int, optional): The output size of linear that after the first linear, default value is 96.
+    """
+
+    def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, hidden_size=128, fc_hidden_size=96):
+        super().__init__()
+        self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx)
+        self.bow_encoder = BoWEncoder(emb_dim)
+        self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
+        self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len=None):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+
+        # Shape: (batch_size, embedding_dim)
+        summed = self.bow_encoder(embedded_text)
+        encoded_text = paddle.tanh(summed)
+
+        # Shape: (batch_size, hidden_size)
+        fc1_out = paddle.tanh(self.fc1(encoded_text))
+        # Shape: (batch_size, fc_hidden_size)
+        fc2_out = paddle.tanh(self.fc2(fc1_out))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc2_out)
+        return logits
+
+
+class LSTMModel(nn.Layer):
+    """
+    This class implements the Bag of Words Classification Network model to classify texts.
+    At a high level, the model starts by embedding the tokens and running them through
+    a word embedding. Then, we encode these representations with a `BoWEncoder`.
+    Lastly, we take the output of the encoder to create a final representation,
+    which is passed through some feed-forward layers to output a logits (`output_layer`).
+    Args:
+        vocab_size(int): The vocab size that used to create the embedding.
+        num_class(int):  The num class of the classifier.
+        emb_dim(int. optional): The size of the embedding, default value is 128.
+        padding_idx(int, optional): The padding value in the embedding, the padding_idx of embedding value will
+            not be updated, the default value is 0.
+        lstm_hidden_size(int, optional): The output size of the lstm, default value 198.
+        direction(string, optional): The direction of lstm, default value is `forward`.
+        lstm_layers(string, optional): The num of lstm layer.
+        dropout(float, optional): The dropout rate of lstm.
+        pooling_type(float, optional): The pooling type of lstm. Default value is None,
+            if `pooling_type` is None, then the LSTMEncoder will return the hidden state of the last time step at last layer as a single vector.
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        num_classes,
+        emb_dim=128,
+        padding_idx=0,
+        lstm_hidden_size=198,
+        direction="forward",
+        lstm_layers=1,
+        dropout_rate=0.0,
+        pooling_type=None,
+        fc_hidden_size=96,
+    ):
+        super().__init__()
+        self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx)
+        self.lstm_encoder = LSTMEncoder(
+            emb_dim,
+            lstm_hidden_size,
+            num_layers=lstm_layers,
+            direction=direction,
+            dropout=dropout_rate,
+            pooling_type=pooling_type,
+        )
+        self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)
+        self.output_layer = nn.Linear(fc_hidden_size, num_classes)
+
+    def forward(self, text, seq_len):
+        # Shape: (batch_size, num_tokens, embedding_dim)
+        embedded_text = self.embedder(text)
+        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
+        # num_directions = 2 if direction is 'bidirect'
+        # if not, num_directions = 1
+        text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)
+        # Shape: (batch_size, fc_hidden_size)
+        fc_out = paddle.tanh(self.fc(text_repr))
+        # Shape: (batch_size, num_classes)
+        logits = self.output_layer(fc_out)
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1).numpy()
+        return idx, probs
+
+
+class SkepSequenceModel(SkepPretrainedModel):
+    def __init__(self, config: SkepConfig):
+        super(SkepSequenceModel, self).__init__(config)
+        self.skep = SkepModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        outputs = self.skep(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        probs = F.softmax(logits, axis=1)
+        idx = paddle.argmax(probs, axis=1)
+
+        return idx, probs
diff --git a/paddlenlp/taskflow/models/text_correction_model.py b/paddlenlp/taskflow/models/text_correction_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a140062714543fda1c1b0014082a1a946d23a21
--- /dev/null
+++ b/paddlenlp/taskflow/models/text_correction_model.py
@@ -0,0 +1,127 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+
+class ErnieForCSC(nn.Layer):
+    r"""
+    ErnieForCSC is a model specified for Chinese Spelling Correction task.
+
+    It integrates phonetic features into language model by leveraging the powerful
+    pre-training and fine-tuning method.
+
+    See more details on https://aclanthology.org/2021.findings-acl.198.pdf.
+    Args:
+        ernie (ErnieModel):
+            An instance of `paddlenlp.transformers.ErnieModel`.
+        pinyin_vocab_size (int):
+            The vocab size of pinyin vocab.
+        pad_pinyin_id (int, optional):
+            The pad token id of pinyin vocab. Defaults to 0.
+    """
+
+    def __init__(self, ernie, pinyin_vocab_size, pad_pinyin_id=0):
+        super(ErnieForCSC, self).__init__()
+        self.ernie = ernie
+        emb_size = self.ernie.config["hidden_size"]
+        hidden_size = self.ernie.config["hidden_size"]
+        vocab_size = self.ernie.config["vocab_size"]
+
+        self.pad_token_id = self.ernie.config["pad_token_id"]
+        self.pinyin_vocab_size = pinyin_vocab_size
+        self.pad_pinyin_id = pad_pinyin_id
+        self.pinyin_embeddings = nn.Embedding(self.pinyin_vocab_size, emb_size, padding_idx=pad_pinyin_id)
+        self.detection_layer = nn.Linear(hidden_size, 2)
+        self.correction_layer = nn.Linear(hidden_size, vocab_size)
+        self.softmax = nn.Softmax()
+
+    def forward(self, input_ids, pinyin_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            pinyin_ids (Tensor):
+                Indices of pinyin tokens of input sequence in the pinyin vocabulary. They are
+                numerical representations of tokens that build the pinyin input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                config.max_position_embeddings - 1]``.
+                Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`.
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+
+        Returns:
+            det_preds (Tensor):
+                A Tensor of the detection prediction of each tokens.
+                Shape as `(batch_size, sequence_length)` and dtype as `int`.
+
+            char_preds (Tensor):
+                A Tensor of the correction prediction of each tokens.
+                Shape as `(batch_size, sequence_length)` and dtype as `int`.
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.detection_layer.weight.dtype) * -1e4, axis=[1, 2]
+            )
+
+        embedding_output = self.ernie.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        pinyin_embedding_output = self.pinyin_embeddings(pinyin_ids)
+
+        # Detection module aims to detect whether each Chinese character has spelling error.
+        detection_outputs = self.ernie.encoder(embedding_output, attention_mask)
+        # detection_error_probs shape: [B, T, 2]. It indicates the erroneous probability of each
+        # word in the sequence from 0 to 1.
+        detection_error_probs = self.softmax(self.detection_layer(detection_outputs))
+        # Correction module aims to correct each potential wrong character to right character.
+        word_pinyin_embedding_output = (
+            detection_error_probs[:, :, 0:1] * embedding_output
+            + detection_error_probs[:, :, 1:2] * pinyin_embedding_output
+        )
+
+        correction_outputs = self.ernie.encoder(word_pinyin_embedding_output, attention_mask)
+        # correction_logits shape: [B, T, V]. It indicates the correct score of each token in vocab
+        # according to each word in the sequence.
+        correction_logits = self.correction_layer(correction_outputs)
+
+        det_preds = detection_error_probs.argmax(axis=-1)
+        char_preds = correction_logits.argmax(axis=-1)
+        return det_preds, char_preds
diff --git a/paddlenlp/taskflow/multimodal_feature_extraction.py b/paddlenlp/taskflow/multimodal_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e6050081643c14d52ef2dae52490bf71d52a7f3
--- /dev/null
+++ b/paddlenlp/taskflow/multimodal_feature_extraction.py
@@ -0,0 +1,463 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from ..transformers import AutoModel, AutoProcessor
+from ..utils.log import logger
+from .task import Task
+from .utils import dygraph_mode_guard, static_mode_guard
+
+usage = r"""
+            from paddlenlp import Taskflow
+            from PIL import Image
+            # Multi modal feature_extraction with ernie_vil-2.0-base-zh
+            vision_language = Taskflow("feature_extraction", model='PaddlePaddle/ernie_vil-2.0-base-zh')
+            image_embeds = vision_language([Image.open("demo/000000039769.jpg")])
+            print(image_embeds)
+            '''
+            Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                    [[-0.59475428, -0.69795364,  0.22144008,  0.88066685, -0.58184201,
+                        -0.73454666,  0.95557910, -0.61410815,  0.23474170,  0.13301648,
+                        0.86196446,  0.12281934,  0.69097638,  1.47614217,  0.07238606,
+                        ...
+            '''
+            text_embeds = vision_language(["猫的照片","狗的照片"])
+            text_features = text_embeds["features"]
+            print(text_features)
+            '''
+            Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                    [[ 0.04250504, -0.41429776,  0.26163983, ...,  0.26221892,
+                        0.34387422,  0.18779707],
+            '''
+            image_features /= image_features.norm(axis=-1, keepdim=True)
+            text_features /= text_features.norm(axis=-1, keepdim=True)
+            logits_per_image = 100 * image_features @ text_features.t()
+            probs = F.softmax(logits_per_image, axis=-1)
+            print(probs)
+            '''
+            Tensor(shape=[1, 2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                [[0.99833173, 0.00166824]])
+            '''
+         """
+
+
+class MultimodalFeatureExtractionTask(Task):
+    """
+    Feature extraction task using no model head. This task extracts the hidden states from the base
+    model, which can be used as features in retrieval and clustering tasks.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "config": "config.json",
+        "vocab_file": "vocab.txt",
+        "preprocessor_config": "preprocessor_config.json",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    resource_files_urls = {
+        "PaddlePaddle/ernie_vil-2.0-base-zh": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/model_state.pdparams",
+                "38d8c8e01f74ba881e87d9a3f669e5ae",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/config.json",
+                "caf929b450d5638e8df2a95c936519e7",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "preprocessor_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/preprocessor_config.json",
+                "9a2e8da9f41896fedb86756b79355ee2",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/PaddlePaddle/ernie_vil-2.0-base-zh/tokenizer_config.json",
+                "da5385c23c8f522d33fc3aac829e4375",
+            ],
+        },
+        "OFA-Sys/chinese-clip-vit-base-patch16": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/model_state.pdparams",
+                "d594c94833b8cfeffc4f986712b3ef79",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/config.json",
+                "3611b5c34ad69dcf91e3c1d03b01a93a",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/vocab.txt",
+                "3b5b76c4aef48ecf8cb3abaafe960f09",
+            ],
+            "preprocessor_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/preprocessor_config.json",
+                "ba1fb66c75b18b3c9580ea5120e01ced",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-base-patch16/tokenizer_config.json",
+                "573ba0466e15cdb5bd423ff7010735ce",
+            ],
+        },
+        "OFA-Sys/chinese-clip-vit-large-patch14": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/model_state.pdparams",
+                "5c0dde02d68179a9cc566173e53966c0",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/config.json",
+                "a5e35843aa87ab1106e9f60f1e16b96d",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/vocab.txt",
+                "3b5b76c4aef48ecf8cb3abaafe960f09",
+            ],
+            "preprocessor_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/preprocessor_config.json",
+                "ba1fb66c75b18b3c9580ea5120e01ced",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14/tokenizer_config.json",
+                "573ba0466e15cdb5bd423ff7010735ce",
+            ],
+        },
+        "OFA-Sys/chinese-clip-vit-large-patch14-336px": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/model_state.pdparams",
+                "ee3eb7f9667cfb06338bea5757c5e0d7",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/config.json",
+                "cb2794d99bea8c8f45901d177e663e1e",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/vocab.txt",
+                "3b5b76c4aef48ecf8cb3abaafe960f09",
+            ],
+            "preprocessor_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/preprocessor_config.json",
+                "c52a0b3abe9bdd1c3c5a3d56797f4a03",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/OFA-Sys/chinese-clip-vit-large-patch14-336px/tokenizer_config.json",
+                "573ba0466e15cdb5bd423ff7010735ce",
+            ],
+        },
+        "__internal_testing__/tiny-random-ernievil2": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/model_state.pdparams",
+                "771c844e7b75f61123d9606c8c17b1d6",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/config.json",
+                "ae27a68336ccec6d3ffd14b48a6d1f25",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "preprocessor_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/preprocessor_config.json",
+                "9a2e8da9f41896fedb86756b79355ee2",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-ernievil2/tokenizer_config.json",
+                "2333f189cad8dd559de61bbff4d4a789",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, batch_size=1, is_static_model=True, max_length=128, return_tensors="pd", **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._seed = None
+        self.export_type = "text"
+        self._batch_size = batch_size
+        self.return_tensors = return_tensors
+        if not self.from_hf_hub:
+            self._check_task_files()
+        self._max_length = max_length
+        self._construct_tokenizer()
+        self.is_static_model = is_static_model
+        self._config_map = {}
+        self.predictor_map = {}
+        self.input_names_map = {}
+        self.input_handles_map = {}
+        self.output_handle_map = {}
+        self._check_predictor_type()
+        if self.is_static_model:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        self._model = AutoModel.from_pretrained(self._task_path)
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._processor = AutoProcessor.from_pretrained(self._task_path)
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+
+        def _parse_batch(batch_examples):
+            if isinstance(batch_examples[0], str):
+                batch_texts = batch_examples
+                batch_images = None
+            else:
+                batch_texts = None
+                batch_images = batch_examples
+            if self.is_static_model:
+                # The input of static model is numpy array
+                tokenized_inputs = self._processor(
+                    text=batch_texts,
+                    images=batch_images,
+                    return_tensors="np",
+                    padding="max_length",
+                    max_length=self._max_length,
+                    truncation=True,
+                )
+            else:
+                # The input of dygraph model is padddle.Tensor
+                tokenized_inputs = self._processor(
+                    text=batch_texts,
+                    images=batch_images,
+                    return_tensors="pd",
+                    padding="max_length",
+                    max_length=self._max_length,
+                    truncation=True,
+                )
+            return tokenized_inputs
+
+        # Separates data into some batches.
+        one_batch = []
+        for example in data:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield _parse_batch(one_batch)
+                one_batch = []
+        if one_batch:
+            yield _parse_batch(one_batch)
+
+    def _check_input_text(self, inputs):
+        """
+        Check whether the input text meet the requirement.
+        """
+        inputs = inputs[0]
+        if isinstance(inputs, str):
+            if len(inputs) == 0:
+                raise ValueError("Invalid inputs, input text should not be empty, please check your input.")
+            inputs = [inputs]
+        elif isinstance(inputs, Image.Image):
+            inputs = [inputs]
+        elif isinstance(inputs, list):
+            # and len(inputs[0].strip()) > 0
+            if not (isinstance(inputs[0], (str, Image.Image))):
+                raise TypeError(
+                    "Invalid inputs, input text/image should be list of str/PIL.image, and first element of list should not be empty."
+                )
+        else:
+            raise TypeError(
+                "Invalid inputs, input text should be str or list of str, but type of {} found!".format(type(inputs))
+            )
+        return inputs
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw inputs to the model inputs, two steps involved:
+           1) Transform the raw text/image to token ids/pixel_values.
+           2) Generate the other model inputs from the raw text/image and token ids/pixel_values.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {"batches": batches, "inputs": inputs}
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        all_feats = []
+        if self.is_static_model:
+            with static_mode_guard():
+                for batch_inputs in inputs["batches"]:
+                    if self._predictor_type == "paddle-inference":
+                        if "input_ids" in batch_inputs:
+                            self.input_handles_map["text"][0].copy_from_cpu(batch_inputs["input_ids"])
+                            self.predictor_map["text"].run()
+                            text_features = self.output_handle_map["text"][0].copy_to_cpu()
+                            all_feats.append(text_features)
+                        elif "pixel_values" in batch_inputs:
+                            self.input_handles_map["image"][0].copy_from_cpu(batch_inputs["pixel_values"])
+                            self.predictor_map["image"].run()
+                            image_features = self.output_handle_map["image"][0].copy_to_cpu()
+                            all_feats.append(image_features)
+                    else:
+                        # onnx mode
+                        if "input_ids" in batch_inputs:
+                            input_dict = {}
+                            input_dict["input_ids"] = batch_inputs["input_ids"]
+                            text_features = self.predictor_map["text"].run(None, input_dict)[0].tolist()
+                            all_feats.append(text_features)
+                        elif "pixel_values" in batch_inputs:
+                            input_dict = {}
+                            input_dict["pixel_values"] = batch_inputs["pixel_values"]
+                            image_features = self.predictor_map["image"].run(None, input_dict)[0].tolist()
+                            all_feats.append(image_features)
+        else:
+            for batch_inputs in inputs["batches"]:
+                if "input_ids" in batch_inputs:
+                    text_features = self._model.get_text_features(input_ids=batch_inputs["input_ids"])
+                    all_feats.append(text_features.numpy())
+                if "pixel_values" in batch_inputs:
+                    image_features = self._model.get_image_features(pixel_values=batch_inputs["pixel_values"])
+                    all_feats.append(image_features.numpy())
+        inputs.update({"features": all_feats})
+        return inputs
+
+    def _postprocess(self, inputs):
+        inputs["features"] = np.concatenate(inputs["features"], axis=0)
+        if self.return_tensors == "pd":
+            inputs["features"] = paddle.to_tensor(inputs["features"])
+        return inputs
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_text_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+        ]
+
+        self._input_image_spec = [
+            paddle.static.InputSpec(shape=[None, 3, 224, 224], dtype="float32", name="pixel_values"),
+        ]
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_image_spec is not None or self._input_text_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+
+        static_model = paddle.jit.to_static(self._model.get_text_features, input_spec=self._input_text_spec)
+        self.inference_model_path = self.inference_text_model_path
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
+
+        static_model = paddle.jit.to_static(self._model.get_image_features, input_spec=self._input_image_spec)
+        self.inference_model_path = self.inference_image_model_path
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
+
+    def _get_inference_model(self):
+        """
+        Return the inference program, inputs and outputs in static mode.
+        """
+        _base_path = os.path.join(self._home_path, "taskflow", self.task, self.model)
+        self.inference_image_model_path = os.path.join(_base_path, "static", "get_image_features")
+        self.inference_text_model_path = os.path.join(_base_path, "static", "get_text_features")
+        if (
+            not os.path.exists(self.inference_image_model_path + ".pdiparams")
+            or self._param_updated
+            or not os.path.exists(self.inference_text_model_path + ".pdiparams")
+        ):
+            with dygraph_mode_guard():
+                self._construct_model(self.model)
+                self._construct_input_spec()
+                self._convert_dygraph_to_static()
+        if self._predictor_type == "paddle-inference":
+            # Get text inference model
+            self.inference_model_path = self.inference_text_model_path
+            self._static_model_file = self.inference_model_path + ".pdmodel"
+            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+            self._prepare_static_mode()
+
+            self.predictor_map["text"] = self.predictor
+            self.input_names_map["text"] = self.input_names
+            self.input_handles_map["text"] = self.input_handles
+            self.output_handle_map["text"] = self.output_handle
+            self._config_map["text"] = self._config
+
+            # Get image inference model
+            self.inference_model_path = self.inference_image_model_path
+            self._static_model_file = self.inference_model_path + ".pdmodel"
+            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+            self._prepare_static_mode()
+
+            self.predictor_map["image"] = self.predictor
+            self.input_names_map["image"] = self.input_names
+            self.input_handles_map["image"] = self.input_handles
+            self.output_handle_map["image"] = self.output_handle
+            self._config_map["image"] = self._config
+        else:
+            # Get text onnx model
+            self.export_type = "text"
+            self.inference_model_path = self.inference_text_model_path
+            self._static_model_file = self.inference_model_path + ".pdmodel"
+            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._prepare_onnx_mode()
+            self.predictor_map["text"] = self.predictor
+
+            # Get image onnx model
+            self.export_type = "image"
+            self.inference_model_path = self.inference_image_model_path
+            self._static_model_file = self.inference_model_path + ".pdmodel"
+            self._static_params_file = self.inference_model_path + ".pdiparams"
+            self._prepare_onnx_mode()
+            self.predictor_map["image"] = self.predictor
diff --git a/paddlenlp/taskflow/named_entity_recognition.py b/paddlenlp/taskflow/named_entity_recognition.py
new file mode 100644
index 0000000000000000000000000000000000000000..d590c0e4d6c8a3153d6e008c28e5a3ab195cf09f
--- /dev/null
+++ b/paddlenlp/taskflow/named_entity_recognition.py
@@ -0,0 +1,240 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .knowledge_mining import WordTagTask
+from .lexical_analysis import LacTask
+from .utils import Customization
+
+POS_LABEL_WORDTAG = [
+    "介词",
+    "介词_方位介词",
+    "助词",
+    "代词",
+    "连词",
+    "副词",
+    "疑问词",
+    "肯定词",
+    "否定词",
+    "数量词",
+    "叹词",
+    "拟声词",
+    "修饰词",
+    "外语单词",
+    "英语单词",
+    "汉语拼音",
+    "词汇用语",
+    "w",
+]
+
+POS_LABEL_LAC = ["n", "f", "s", "t", "v", "vd", "vn", "a", "ad", "an", "d", "m", "q", "r", "p", "c", "u", "xc", "w"]
+
+usage = r"""
+          from paddlenlp import Taskflow
+
+          # WordTag精确模式
+          ner = Taskflow("ner")
+          ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+          '''
+          [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]
+          '''
+
+          ner(["热梅茶是一道以梅子为主要原料制作的茶饮", "《孤女》是2010年九州出版社出版的小说，作者是余兼羽"])
+          '''
+          [[('热梅茶', '饮食类_饮品'), ('是', '肯定词'), ('一道', '数量词'), ('以', '介词'), ('梅子', '饮食类'), ('为', '肯定词'), ('主要原料', '物体类'), ('制作', '场景事件'), ('的', '助词'), ('茶饮', '饮食类_饮品')], [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]]
+          '''
+
+          # 只返回实体/概念词
+          ner = Taskflow("ner", entity_only=True)
+          ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+          '''
+          [('孤女', '作品类_实体'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('小说', '作品类_概念'), ('作者', '人物类_概念'), ('余兼羽', '人物类_实体')]
+          '''
+
+          # 使用快速模式，只返回实体词
+          ner = Taskflow("ner", mode="fast", entity_only=True)
+          ner("三亚是一个美丽的城市")
+          '''
+          [('三亚', 'LOC')]
+          '''
+          """
+
+
+class NERWordTagTask(WordTagTask):
+    """
+    This the NER(Named Entity Recognition) task that convert the raw text to entities. And the task with the `wordtag`
+    model will link the more meesage with the entity.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "config.json",
+        "tags": "tags.txt",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    resource_files_urls = {
+        "wordtag": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.5/model_state.pdparams",
+                "c7c9cef72f73ee22c70c26ef11393025",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.1/config.json",
+                "b9f307b3fa03ad98c08ecb5249c15dfa",
+            ],
+            "tags": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag_v1.1/tags.txt",
+                "f33feedd01d478b03bac81be19b48d00",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/vocab.txt",
+                "54aa6e2eeb0478c2d18a2343b008590c",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/special_tokens_map.json",
+                "58104269e4f141a258bdb2ed06aa599f",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/knowledge_mining/wordtag/tokenizer_config.json",
+                "e3f2756e72e24e3bb298303fb9a171f7",
+            ],
+        }
+    }
+
+    def __init__(self, model, task, entity_only=False, **kwargs):
+        super().__init__(model="wordtag", task=task, **kwargs)
+        self.entity_only = entity_only
+        if self._user_dict:
+            self._custom = Customization()
+            self._custom.load_customization(self._user_dict)
+        else:
+            self._custom = None
+
+    def _decode(self, batch_texts, batch_pred_tags):
+        batch_results = []
+        for sent_index in range(len(batch_texts)):
+            sent = batch_texts[sent_index]
+            indexes = batch_pred_tags[sent_index][self.summary_num : len(sent) + self.summary_num]
+            tags = [self._index_to_tags[index] for index in indexes]
+            if self._custom:
+                self._custom.parse_customization(sent, tags, prefix=True)
+            sent_out = []
+            tags_out = []
+            partial_word = ""
+            for ind, tag in enumerate(tags):
+                if partial_word == "":
+                    partial_word = sent[ind]
+                    tags_out.append(tag.split("-")[-1])
+                    continue
+                if tag.startswith("B") or tag.startswith("S") or tag.startswith("O"):
+                    sent_out.append(partial_word)
+                    tags_out.append(tag.split("-")[-1])
+                    partial_word = sent[ind]
+                    continue
+                partial_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(partial_word)
+
+            pred_words = []
+            for s, t in zip(sent_out, tags_out):
+                pred_words.append({"item": s, "wordtag_label": t})
+
+            result = {"text": sent, "items": pred_words}
+            batch_results.append(result)
+        return batch_results
+
+    def _simplify_result(self, results):
+        simple_results = []
+        for result in results:
+            simple_result = []
+            if "items" in result:
+                for item in result["items"]:
+                    if self.entity_only and item["wordtag_label"] in POS_LABEL_WORDTAG:
+                        continue
+                    simple_result.append((item["item"], item["wordtag_label"]))
+            simple_results.append(simple_result)
+        simple_results = simple_results[0] if len(simple_results) == 1 else simple_results
+        return simple_results
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        results = self._decode(inputs["short_input_texts"], inputs["all_pred_tags"])
+        results = self._auto_joiner(results, self.input_mapping, is_dict=True)
+        results = self._simplify_result(results)
+        return results
+
+
+class NERLACTask(LacTask):
+    """
+    Part-of-speech tagging task for the raw text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, model, task, entity_only=False, **kwargs):
+        super().__init__(task=task, model="lac", **kwargs)
+        self.entity_only = entity_only
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        lengths = inputs["lens"]
+        preds = inputs["result"]
+        sents = inputs["text"]
+        final_results = []
+        for sent_index in range(len(lengths)):
+            tags = [self._id2tag_dict[str(index)] for index in preds[sent_index][: lengths[sent_index]]]
+            sent = sents[sent_index]
+            if self._custom:
+                self._custom.parse_customization(sent, tags)
+            sent_out = []
+            tags_out = []
+            parital_word = ""
+            for ind, tag in enumerate(tags):
+                if parital_word == "":
+                    parital_word = sent[ind]
+                    tags_out.append(tag.split("-")[0])
+                    continue
+                if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                    sent_out.append(parital_word)
+                    tags_out.append(tag.split("-")[0])
+                    parital_word = sent[ind]
+                    continue
+                parital_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(parital_word)
+
+            result = []
+            for s, t in zip(sent_out, tags_out):
+                if self.entity_only and t in POS_LABEL_LAC:
+                    continue
+                result.append((s, t))
+            final_results.append(result)
+        final_results = self._auto_joiner(final_results, self.input_mapping)
+        final_results = final_results if len(final_results) > 1 else final_results[0]
+        return final_results
diff --git a/paddlenlp/taskflow/poetry_generation.py b/paddlenlp/taskflow/poetry_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..7678ea232edd4ba6e462c18f03cf29d6e278ff39
--- /dev/null
+++ b/paddlenlp/taskflow/poetry_generation.py
@@ -0,0 +1,51 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .text_generation import TextGenerationTask
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           poetry = Taskflow("poetry_generation")
+           poetry("林密不见人")
+           '''
+           [{'text': '林密不见人', 'answer': ',但闻人语响。'}]
+           '''
+
+           poetry(["林密不见人", "举头邀明月"])
+           '''
+           [{'text': '林密不见人', 'answer': ',但闻人语响。'}, {'text': '举头邀明月', 'answer': ',低头思故乡。'}]
+           '''
+         """
+
+URLS = {
+    "gpt-cpm-large-cn": [
+        "https://bj.bcebos.com/paddlenlp/taskflow/text_generation/gpt-cpm/gpt-cpm-large-cn_params.tar",
+        "5aad6f81053cfdbba4797f044fcf66d1",
+    ],
+}
+
+
+class PoetryGenerationTask(TextGenerationTask):
+    """
+    The text generation model to predict the question or chinese  poetry.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
diff --git a/paddlenlp/taskflow/pos_tagging.py b/paddlenlp/taskflow/pos_tagging.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d7a309112b8cbf06c9937f3c96a051f668d24fa
--- /dev/null
+++ b/paddlenlp/taskflow/pos_tagging.py
@@ -0,0 +1,81 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .lexical_analysis import LacTask
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           pos = Taskflow("pos_tagging")
+           pos("第十四届全运会在西安举办")
+           '''
+           [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]
+           '''
+
+           pos(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+           '''
+           [[('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')], [('三亚', 'LOC'), ('是', 'v'), ('一个', 'm'), ('美丽', 'a'), ('的', 'u'), ('城市', 'n')]]
+           '''
+         """
+
+
+class POSTaggingTask(LacTask):
+    """
+    Part-of-speech tagging task for the raw text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        lengths = inputs["lens"]
+        preds = inputs["result"]
+        sents = inputs["text"]
+        final_results = []
+        for sent_index in range(len(lengths)):
+            tags = [self._id2tag_dict[str(index)] for index in preds[sent_index][: lengths[sent_index]]]
+            sent = sents[sent_index]
+            if self._custom:
+                self._custom.parse_customization(sent, tags)
+            sent_out = []
+            tags_out = []
+            parital_word = ""
+            for ind, tag in enumerate(tags):
+                if parital_word == "":
+                    parital_word = sent[ind]
+                    tags_out.append(tag.split("-")[0])
+                    continue
+                if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                    sent_out.append(parital_word)
+                    tags_out.append(tag.split("-")[0])
+                    parital_word = sent[ind]
+                    continue
+                parital_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(parital_word)
+
+            result = list(zip(sent_out, tags_out))
+            final_results.append(result)
+        final_results = self._auto_joiner(final_results, self.input_mapping)
+        final_results = final_results if len(final_results) > 1 else final_results[0]
+        return final_results
diff --git a/paddlenlp/taskflow/question_answering.py b/paddlenlp/taskflow/question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..c00d1d3e743fbe7ba7c9f9f0fb2433e510c228f2
--- /dev/null
+++ b/paddlenlp/taskflow/question_answering.py
@@ -0,0 +1,52 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .text_generation import TextGenerationTask
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           qa = Taskflow("question_answering")
+           qa("中国的国土面积有多大？")
+           '''
+           [{'text': '中国的国土面积有多大？', 'answer': '960万平方公里。'}]
+           '''
+
+           qa(["中国国土面积有多大？", "中国的首都在哪里？"])
+           '''
+           [{'text': '中国国土面积有多大？', 'answer': '960万平方公里。'}, {'text': '中国的首都在哪里？', 'answer': '北京。'}]
+           '''
+
+         """
+
+URLS = {
+    "gpt-cpm-large-cn": [
+        "https://bj.bcebos.com/paddlenlp/taskflow/text_generation/gpt-cpm/gpt-cpm-large-cn_params.tar",
+        "5aad6f81053cfdbba4797f044fcf66d1",
+    ],
+}
+
+
+class QuestionAnsweringTask(TextGenerationTask):
+    """
+    The text generation model to predict the question or chinese  poetry.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
diff --git a/paddlenlp/taskflow/question_generation.py b/paddlenlp/taskflow/question_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5303a28e66d2370339b2a074e97460858499188
--- /dev/null
+++ b/paddlenlp/taskflow/question_generation.py
@@ -0,0 +1,454 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import numpy as np
+import paddle
+
+from ..data import Pad
+from ..transformers import UNIMOLMHeadModel, UNIMOTokenizer
+from .task import Task
+
+usage = r"""
+            from paddlenlp import Taskflow
+
+            question_generation = Taskflow("question_generation")
+            question_generation([{"context": "奇峰黄山千米以上的山峰有77座，整座黄山就是一座花岗岩的峰林，自古有36大峰，36小峰，最高峰莲花峰、最险峰天都峰和观日出的最佳点光明顶构成黄山的三大主峰。", "answer": "莲花峰"}]])
+            '''
+                ['黄山最高峰是什么']
+            '''
+         """
+
+
+class QuestionGenerationTask(Task):
+    """
+    The text summarization model to predict the summary of an input text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._batch_size = kwargs.get("batch_size", 16)
+        self._output_scores = kwargs.get("output_scores", False)
+        self._is_select_from_num_return_sequences = kwargs.get("is_select_from_num_return_sequences", True)
+        self._construct_tokenizer(model)
+        self._construct_model(model)
+        # Hypter-parameter during generating.
+        self._max_length = kwargs.get("max_length", 50)
+        self._min_length = kwargs.get("min_length", 3)
+        self._decode_strategy = kwargs.get("decode_strategy", "beam_search")
+        self._temperature = kwargs.get("temperature", 1.0)
+        self._top_k = kwargs.get("top_k", 0)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._num_beams = kwargs.get("num_beams", 6)
+        self._num_beam_groups = kwargs.get("num_beam_groups", 1)
+        self._diversity_rate = kwargs.get("diversity_rate", 0.0)
+        self._length_penalty = kwargs.get("length_penalty", 1.2)
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+        self._repetition_penalty = kwargs.get("repetition_penalty", 1)
+        self._use_faster = kwargs.get("use_faster", False)
+        self._use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        self._template = kwargs.get("template", 1)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        if self._custom_model:
+            self._model = UNIMOLMHeadModel.from_pretrained(self._task_path)
+        else:
+            self._model = UNIMOLMHeadModel.from_pretrained(model)
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        if self._custom_model:
+            self._tokenizer = UNIMOTokenizer.from_pretrained(self._task_path)
+        else:
+            self._tokenizer = UNIMOTokenizer.from_pretrained(model)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {"batches": batches, "text": inputs}
+        return outputs
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        examples = [self._convert_example(i) for i in data]
+        # Separates data into some batches.
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield self._parse_batch(one_batch, self._tokenizer.pad_token_id)
+                one_batch = []
+        if one_batch:
+            yield self._parse_batch(one_batch, self._tokenizer.pad_token_id)
+
+    def _check_input_text(self, inputs):
+        inputs = inputs[0]
+        if isinstance(inputs, str):
+            if len(inputs) == 0:
+                raise ValueError("Invalid inputs, input text should not be empty text, please check your input. ")
+            inputs = [inputs]
+        elif isinstance(inputs, dict):
+            if not ("source" in inputs and "title" in inputs) and not ("context" in inputs and "answer" in inputs):
+                raise TypeError(
+                    "Invalid inputs, source and title are not in the input dictionary, nor are context and answer."
+                )
+        elif isinstance(inputs, list):
+            if not (isinstance(inputs[0], dict)):
+                raise TypeError(
+                    "Invalid inputs, input text should be list of dict, but type of List({}) found! ".format(
+                        type(inputs[0])
+                    )
+                )
+        else:
+            raise TypeError(
+                "Invalid inputs, input text should be str or list of str, but type of {} found!".format(type(inputs))
+            )
+        return inputs
+
+    def _convert_example(self, example, max_seq_len=512, return_length=True, template=1):
+        """
+        Convert all examples into necessary features.
+        """
+        if isinstance(example, dict):
+            target = None
+            if "source" in example and "title" in example:
+                source = example["source"]
+                title = None
+                if "title" in example.keys():
+                    title = example["title"]
+            elif "context" in example and "answer" in example:
+                source = example["context"]
+                title = None
+                if "answer" in example.keys():
+                    title = example["answer"]
+            else:
+                assert False, "Source and title are not in the input dictionary, nor are context and answer."
+            if "target" in example.keys():
+                target = example["target"]
+        elif isinstance(example, list):
+            source = example[0]
+            title = example[1]
+
+        if self._template == 1:
+            # use template 1
+            source = "答案：" + title + self._tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+        elif self._template == 2:
+            # use template 2
+            source = "答案：" + title + self._tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "在已知答案的前提下，问题：" + target
+        elif self._template == 3:
+            # use template 3
+            source = "这是一个问题生成任务，根据提供的答案和上下文，来生成问题。" + title + self._tokenizer.sep_token + "上下文：" + source
+            title = None
+            if target:
+                target = "问题：" + target
+
+        tokenized_example = self._tokenizer.gen_encode(
+            source,
+            title=title,
+            max_seq_len=max_seq_len,
+            max_title_len=30,
+            add_start_token_for_decoding=True,
+            return_position_ids=True,
+        )
+
+        if "target" in example and example["target"]:
+            tokenized_example["target"] = example["target"]
+        # Use to gather the logits corresponding to the labels during training
+        return tokenized_example
+
+    def _parse_batch(self, batch_examples, pad_val, pad_right=False):
+        """
+        Batchify a batch of examples.
+        """
+
+        def pad_mask(batch_attention_mask):
+            """Pad attention_mask."""
+            batch_size = len(batch_attention_mask)
+            max_len = max(map(len, batch_attention_mask))
+            attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+            for i, mask_data in enumerate(attention_mask):
+                seq_len = len(batch_attention_mask[i])
+                if pad_right:
+                    mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32")
+                else:
+                    mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+            # In order to ensure the correct broadcasting mechanism, expand one
+            # dimension to the second dimension (n_head of Transformer).
+            attention_mask = np.expand_dims(attention_mask, axis=1)
+            return attention_mask
+
+        pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int64")
+        input_ids = pad_func([example["input_ids"] for example in batch_examples])
+        token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+        position_ids = pad_func([example["position_ids"] for example in batch_examples])
+        attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+        # seq_len = np.asarray([example['seq_len'] for example in batch_examples],
+        #                      dtype='int32')
+        batch_dict = {}
+        batch_dict["input_ids"] = input_ids
+        batch_dict["token_type_ids"] = token_type_ids
+        batch_dict["position_ids"] = position_ids
+        batch_dict["attention_mask"] = attention_mask
+        # batch_dict['seq_len'] = seq_len
+        return batch_dict
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        all_ids = []
+        all_scores = []
+
+        for batch in inputs["batches"]:
+            input_ids = paddle.to_tensor(batch["input_ids"], dtype="int64")
+            token_type_ids = paddle.to_tensor(batch["token_type_ids"], dtype="int64")
+            position_ids = paddle.to_tensor(batch["position_ids"], dtype="int64")
+            attention_mask = paddle.to_tensor(batch["attention_mask"], dtype="float32")
+            # seq_len = paddle.to_tensor(batch['seq_len'], dtype='int64')
+            ids, scores = self._model.generate(
+                input_ids=input_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                max_length=self._max_length,
+                min_length=self._min_length,
+                decode_strategy=self._decode_strategy,
+                temperature=self._temperature,
+                top_k=self._top_k,
+                top_p=self._top_p,
+                num_beams=self._num_beams,
+                num_beam_groups=self._num_beam_groups,
+                diversity_rate=self._diversity_rate,
+                length_penalty=self._length_penalty,
+                num_return_sequences=self._num_return_sequences,
+                repetition_penalty=self._repetition_penalty,
+                bos_token_id=self._tokenizer.cls_token_id,
+                eos_token_id=self._tokenizer.mask_token_id,
+                use_fast=self._use_faster,
+                use_fp16_decoding=self._use_fp16_decoding,
+            )
+            all_ids.extend(ids)
+            all_scores.extend(scores)
+        inputs["ids"] = all_ids
+        inputs["scores"] = all_scores
+        return inputs
+
+    def out_run_model(self, input_ids, token_type_ids, position_ids, attention_mask):
+        """
+        Debug used.
+        """
+        all_ids = []
+        all_scores = []
+        # seq_len = paddle.to_tensor(batch['seq_len'], dtype='int64')
+        ids, scores = self._model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            max_length=self._max_length,
+            min_length=self._min_length,
+            decode_strategy=self._decode_strategy,
+            temperature=self._temperature,
+            top_k=self._top_k,
+            top_p=self._top_p,
+            num_beams=self._num_beams,
+            length_penalty=self._length_penalty,
+            num_return_sequences=self._num_return_sequences,
+            bos_token_id=self._tokenizer.cls_token_id,
+            eos_token_id=self._tokenizer.mask_token_id,
+        )
+        all_ids.extend(ids)
+        all_scores.extend(scores)
+
+        inputs = {}
+        inputs["ids"] = all_ids
+        inputs["scores"] = all_scores
+        return all_ids, all_scores
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        ids_list = inputs["ids"]
+        scores_list = inputs["scores"]
+        if self._is_select_from_num_return_sequences:
+            results = self._select_from_num_return_sequences(
+                ids_list, scores_list, self._max_length, self._num_return_sequences
+            )
+        else:
+            results = self._return_num_return_sequences(
+                ids_list, scores_list, self._max_length, self._num_return_sequences
+            )
+        output_tokens = [result[0] for result in results]
+        output_scores = [math.exp(result[1]) for result in results]
+        # output_scores = [[math.exp(s) for s in result[1]] if isinstance(result[1], list) else math.exp(result[1]) for result in results]
+
+        if self._output_scores:
+            return output_tokens, output_scores
+        return output_tokens
+
+    def _return_num_return_sequences(self, ids, scores, max_dec_len=None, num_return_sequences=1):
+        """
+        Select generated sequence form several return sequences.
+        """
+        results = []
+        group = []
+        tmp = []
+        if scores is not None:
+            ids = [i.numpy() for i in ids]
+            scores = [i.numpy() for i in scores]
+
+            if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+                raise ValueError(
+                    "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                        len(ids), num_return_sequences
+                    )
+                )
+
+            for pred, score in zip(ids, scores):
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                target = "".join(pred_tokens)
+                target = self._remove_template(target)
+                # not ending
+                if max_dec_len is not None and num_token >= max_dec_len:
+                    score -= 1e3
+                tmp.append([target, score])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+            for preds in group:
+                preds = sorted(preds, key=lambda x: -x[1])
+                for pred in preds:
+                    results.append(pred)
+        else:
+            ids = ids.numpy()
+            for pred in ids:
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                response = "".join(pred_tokens)
+                response = self._remove_template(response)
+                # TODO: Support return scores in FT.
+                tmp.append([response])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+
+            for preds in group:
+                for pred in preds:
+                    results.append(pred)
+        return results
+
+    def _select_from_num_return_sequences(self, ids, scores, max_dec_len=None, num_return_sequences=1):
+        """
+        Select generated sequence form several return sequences.
+        """
+        results = []
+        group = []
+        tmp = []
+        if scores is not None:
+            ids = [i.numpy() for i in ids]
+            scores = [i.numpy() for i in scores]
+
+            if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+                raise ValueError(
+                    "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                        len(ids), num_return_sequences
+                    )
+                )
+
+            for pred, score in zip(ids, scores):
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                target = "".join(pred_tokens)
+                target = self._remove_template(target)
+                # not ending
+                if max_dec_len is not None and num_token >= max_dec_len:
+                    score -= 1e3
+                tmp.append([target, score])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+            for preds in group:
+                preds = sorted(preds, key=lambda x: -x[1])
+                results.append(preds[0])
+        else:
+            ids = ids.numpy()
+            for pred in ids:
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                response = "".join(pred_tokens)
+                response = self._remove_template(response)
+                # TODO: Support return scores in FT.
+                tmp.append([response])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+
+            for preds in group:
+                results.append(preds[0])
+        return results
+
+    def _post_process_decoded_sequence(self, token_ids):
+        """Post-process the decoded sequence. Truncate from the first <eos>."""
+        eos_pos = len(token_ids)
+        for i, tok_id in enumerate(token_ids):
+            if tok_id == self._tokenizer.mask_token_id:
+                eos_pos = i
+                break
+        token_ids = token_ids[:eos_pos]
+        tokens = self._tokenizer.convert_ids_to_tokens(token_ids)
+        tokens = self._tokenizer.merge_subword(tokens)
+        special_tokens = ["[UNK]"]
+        tokens = [token for token in tokens if token not in special_tokens]
+        return token_ids, tokens
+
+    def _remove_template(self, instr):
+        """Remove template prefix of decoded sequence."""
+        outstr = instr.strip("问题：")
+        outstr = instr.strip("在已知答案的前提下，问题：")
+        return outstr
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+        ]
diff --git a/paddlenlp/taskflow/sentiment_analysis.py b/paddlenlp/taskflow/sentiment_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6b1a47bf315327cfd9ab4106d2f89ccc05a72a1
--- /dev/null
+++ b/paddlenlp/taskflow/sentiment_analysis.py
@@ -0,0 +1,881 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os
+
+import numpy as np
+import paddle
+
+from ..data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
+from ..datasets import load_dataset
+from ..transformers import UIE, AutoTokenizer, SkepTokenizer
+from ..utils.tools import get_bool_ids_greater_than, get_span
+from .models import LSTMModel, SkepSequenceModel
+from .task import Task
+from .utils import SchemaTree, dbc2sbc, get_id_and_prob, static_mode_guard
+
+usage = r"""
+            from paddlenlp import Taskflow
+
+            # sentiment analysis with bilstm
+            senta = Taskflow("sentiment_analysis")
+            senta("怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片")
+            '''
+            [{'text': '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片', 'label': 'negative', 'score': 0.6691398620605469}]
+            '''
+
+            senta(["怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片",
+                   "作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间"])
+            '''
+            [{'text': '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片', 'label': 'negative', 'score': 0.6691398620605469},
+             {'text': '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间', 'label': 'positive', 'score': 0.9857505559921265}
+            ]
+            '''
+
+            # sentiment analysis with skep
+            senta = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch")
+            senta("作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。")
+            '''
+            [{'text': '作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。', 'label': 'positive', 'score': 0.984320878982544}]
+            '''
+
+            # sentiment analysis with UIE
+            # aspect, opinion and sentiment extraction
+            schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+            ie = Taskflow('information_extraction', schema=schema,  model="uie-base")
+            ie("地址不错，服务一般，设施陈旧")
+            '''
+            [{'评价维度': [{'text': '地址', 'start': 0, 'end': 2, 'probability': 0.9888139270606509, 'relations': {'观点词': [{'text': '不错', 'start': 2, 'end': 4, 'probability': 0.9927847072459528}], '情感倾向[正向，负向]': [{'text': '正向', 'probability': 0.998228967796706}]}}, {'text': '设施', 'start': 10, 'end': 12, 'probability': 0.9588297379365116, 'relations': {'观点词': [{'text': '陈旧', 'start': 12, 'end': 14, 'probability': 0.9286753967902683}], '情感倾向[正向，负向]': [{'text': '负向', 'probability': 0.9949389795770394}]}}, {'text': '服务', 'start': 5, 'end': 7, 'probability': 0.9592857070501211, 'relations': {'观点词': [{'text': '一般', 'start': 7, 'end': 9, 'probability': 0.9949359182521675}], '情感倾向[正向，负向]': [{'text': '负向', 'probability': 0.9952498258302498}]}}]}]
+            '''
+            # opinion and sentiment extraction according to pre-given aspects
+            schema = [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+            aspects = ['服务', '价格']
+            ie = Taskflow("sentiment_analysis", model="uie-base", schema=schema, aspects=aspects)
+            ie("蛋糕味道不错，很好吃，店家服务也很好")
+            '''
+            [{'评价维度': [{'text': '服务', 'relations': {'观点词': [{'text': '好', 'start': 17, 'end': 18, 'probability': 0.9998383583299955}], '情感倾向[正向,负向,未提及]': [{'text': '正向', 'probability': 0.9999240650320473}]}}, {'text': '价格', 'relations': {'情感倾向[正向,负向,未提及]': [{'text': '未提及', 'probability': 0.9999845028521719}]}}]}]
+            '''
+         """
+
+
+class SentaTask(Task):
+    """
+    Sentiment analysis task using RNN or BOW model to predict sentiment opinion on Chinese text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {"model_state": "model_state.pdparams", "vocab": "vocab.txt"}
+    resource_files_urls = {
+        "bilstm": {
+            "vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/sentiment_analysis/bilstm/vocab.txt",
+                "df714f0bfd6d749f88064679b4c97fd5",
+            ],
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/sentiment_analysis/bilstm/model_state.pdparams",
+                "609fc068aa35339e20f8310b5c20887c",
+            ],
+        }
+    }
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._static_mode = True
+        self._label_map = {0: "negative", 1: "positive"}
+        self._check_task_files()
+        self._construct_tokenizer(model)
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+        self._usage = usage
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_ids"),
+            paddle.static.InputSpec(shape=[None], dtype="int64", name="length"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        vocab_size = self.kwargs["vocab_size"]
+        pad_token_id = self.kwargs["pad_token_id"]
+        num_classes = 2
+
+        # Select the senta network for the inference
+        model_instance = LSTMModel(
+            vocab_size, num_classes, direction="bidirect", padding_idx=pad_token_id, pooling_type="max"
+        )
+        model_path = os.path.join(self._task_path, "model_state.pdparams")
+
+        # Load the model parameter for the predict
+        state_dict = paddle.load(model_path)
+        model_instance.set_dict(state_dict)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        vocab_path = os.path.join(self._task_path, "vocab.txt")
+        vocab = Vocab.load_vocabulary(vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+        vocab_size = len(vocab)
+        pad_token_id = vocab.to_indices("[PAD]")
+        # Construct the tokenizer form the JiebaToeknizer
+        self.kwargs["pad_token_id"] = pad_token_id
+        self.kwargs["vocab_size"] = vocab_size
+        tokenizer = JiebaTokenizer(vocab)
+        self._tokenizer = tokenizer
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        examples = []
+        filter_inputs = []
+        for input_data in inputs:
+            if not (isinstance(input_data, str) and len(input_data) > 0):
+                continue
+            filter_inputs.append(input_data)
+            ids = self._tokenizer.encode(input_data)
+            lens = len(ids)
+            examples.append((ids, lens))
+
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        outputs = {}
+        outputs["data_loader"] = batches
+        outputs["text"] = filter_inputs
+        return outputs
+
+    def _batchify_fn(self, samples):
+        fn = Tuple(
+            Pad(axis=0, pad_val=self._tokenizer.vocab.token_to_idx.get("[PAD]", 0)),  # input_ids
+            Stack(dtype="int64"),  # seq_len
+        )
+        return fn(samples)
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        scores = []
+        with static_mode_guard():
+            for batch in inputs["data_loader"]:
+                ids, lens = self._batchify_fn(batch)
+                self.input_handles[0].copy_from_cpu(ids)
+                self.input_handles[1].copy_from_cpu(lens)
+                self.predictor.run()
+                idx = self.output_handle[0].copy_to_cpu().tolist()
+                probs = self.output_handle[1].copy_to_cpu().tolist()
+                labels = [self._label_map[i] for i in idx]
+                score = [max(prob) for prob in probs]
+                results.extend(labels)
+                scores.extend(score)
+
+        inputs["result"] = results
+        inputs["score"] = scores
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        This function will convert the model output to raw text.
+        """
+        final_results = []
+        for text, label, score in zip(inputs["text"], inputs["result"], inputs["score"]):
+            result = {}
+            result["text"] = text
+            result["label"] = label
+            result["score"] = score
+            final_results.append(result)
+        return final_results
+
+
+class SkepTask(Task):
+    """
+    Sentiment analysis task using ERNIE-Gram model to predict sentiment opinion on Chinese text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+    }
+    resource_files_urls = {
+        "skep_ernie_1.0_large_ch": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/sentiment_analysis/skep_ernie_1.0_large_ch/model_state.pdparams",
+                "cf7aa5f5ffa834b329bbcb1dca54e9fc",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/sentiment_analysis/skep_ernie_1.0_large_ch/model_config.json",
+                "847b84ab08611a2f5a01a22c18b0be23",
+            ],
+        },
+        "__internal_testing__/tiny-random-skep": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-skep/model_state.pdparams",
+                "3bedff32b4de186252094499d1c8ede3",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/tiny-random-skep/model_config.json",
+                "f891e4a927f946c23bc32653f535510b",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._static_mode = True
+        self._label_map = {0: "negative", 1: "positive"}
+        if not self._custom_model:
+            self._check_task_files()
+        self._construct_tokenizer(self._task_path if self._custom_model else model)
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(self._task_path if self._custom_model else model)
+        self._usage = usage
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = SkepSequenceModel.from_pretrained(self._task_path, num_labels=len(self._label_map))
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ]
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        tokenizer = SkepTokenizer.from_pretrained(model)
+        self._tokenizer = tokenizer
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+
+        examples = []
+        filter_inputs = []
+        for input_data in inputs:
+            if not (isinstance(input_data, str) and len(input_data.strip()) > 0):
+                continue
+            filter_inputs.append(input_data)
+            encoded_inputs = self._tokenizer(text=input_data, max_seq_len=128)
+            ids = encoded_inputs["input_ids"]
+            segment_ids = encoded_inputs["token_type_ids"]
+            examples.append((ids, segment_ids))
+
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        outputs = {}
+        outputs["text"] = filter_inputs
+        outputs["data_loader"] = batches
+        return outputs
+
+    def _batchify_fn(self, samples):
+        fn = Tuple(
+            Pad(axis=0, pad_val=self._tokenizer.pad_token_id),  # input ids
+            Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id),  # token type ids
+        )
+        return fn(samples)
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        scores = []
+        with static_mode_guard():
+            for batch in inputs["data_loader"]:
+                ids, segment_ids = self._batchify_fn(batch)
+                self.input_handles[0].copy_from_cpu(ids)
+                self.input_handles[1].copy_from_cpu(segment_ids)
+                self.predictor.run()
+                idx = self.output_handle[0].copy_to_cpu().tolist()
+                probs = self.output_handle[1].copy_to_cpu().tolist()
+                labels = [self._label_map[i] for i in idx]
+                score = [max(prob) for prob in probs]
+                results.extend(labels)
+                scores.extend(score)
+
+        inputs["result"] = results
+        inputs["score"] = scores
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        final_results = []
+        for text, label, score in zip(inputs["text"], inputs["result"], inputs["score"]):
+            result = {}
+            result["text"] = text
+            result["label"] = label
+            result["score"] = score
+            final_results.append(result)
+        return final_results
+
+
+class UIESentaTask(Task):
+    """
+    Universal Information Extraction Task.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        aspects (list[string]):  a list of pre-given aspects
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    # vocab.txt/special_tokens_map.json/tokenizer_config.json are common to the default model.
+    resource_files_urls = {
+        "uie-senta-base": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-base/model_state.pdparams",
+                "88fcf3aa5afee16ddb61b4ecdf53f572",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-base/model_config.json",
+                "74f033ab874a1acddb3aec9b9c4d9cde",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-base/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-base/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+        "uie-senta-medium": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-medium/model_state.pdparams",
+                "afc11ed983a0075f4bb13cf203ccd841",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-medium/model_config.json",
+                "4c98a7bc547d60ac94e44e17c47a3488",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-medium/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-medium/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-medium/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+        "uie-senta-mini": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-mini/model_state.pdparams",
+                "83d5082596cfd95b9548aefc248c7ad1",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-mini/model_config.json",
+                "9628a5c64a1e6ed8278c0344c8ef874a",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-mini/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-mini/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-mini/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+        "uie-senta-micro": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-micro/model_state.pdparams",
+                "047b5549dc182cfca036c3fce1e7f6f7",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-micro/model_config.json",
+                "058a28845781dbe89a3827bc11355bc8",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-micro/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-micro/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-micro/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+        "uie-senta-nano": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-nano/model_state.pdparams",
+                "27afd8946f47a2b8618ffae9ac0f5922",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-nano/model_config.json",
+                "b9f74bdf02f5fb2d208e1535c8a13649",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-nano/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-nano/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/sentiment_analysis/uie-senta-nano/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, schema, aspects=None, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._schema_tree = None
+        self.set_schema(schema)
+        self._check_task_files()
+        self._check_predictor_type()
+        self._get_inference_model()
+        self._usage = usage
+        self._max_seq_len = self.kwargs["max_seq_len"] if "max_seq_len" in self.kwargs else 512
+        self._batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 64
+        self._split_sentence = self.kwargs["split_sentence"] if "split_sentence" in self.kwargs else False
+        self._position_prob = self.kwargs["position_prob"] if "position_prob" in self.kwargs else 0.5
+        self._lazy_load = self.kwargs["lazy_load"] if "lazy_load" in self.kwargs else False
+        self._num_workers = self.kwargs["num_workers"] if "num_workers" in self.kwargs else 0
+        self.use_fast = self.kwargs["use_fast"] if "use_fast" in self.kwargs else False
+        self._construct_tokenizer()
+        self.aspects = self._check_aspects(aspects)
+
+    def set_schema(self, schema):
+        """
+        Set schema for UIE Model.
+        """
+        if isinstance(schema, dict) or isinstance(schema, str):
+            schema = [schema]
+        self._schema_tree = self._build_tree(schema)
+
+    def _check_aspects(self, aspects):
+        """
+        Check aspects whether to be valid.
+        """
+        if aspects is None:
+            return aspects
+        elif not isinstance(aspects, list):
+            raise TypeError(
+                "Invalid aspects, input aspects should be list of str, but type of {} found!".format(type(aspects))
+            )
+        elif not aspects:
+            raise ValueError("Invalid aspects, input aspects should not be empty, but {} found!".format(aspects))
+        else:
+            for i, aspect in enumerate(aspects):
+                if not isinstance(aspect, str):
+                    raise TypeError(
+                        "Invalid aspect, the aspect at index {} should be str, but type of {} found!".format(
+                            i, type(aspect)
+                        )
+                    )
+                if not aspect.strip():
+                    raise ValueError(
+                        "Invalid aspect, the aspect at index {} should not be empty, but {} found!".format(i, aspect)
+                    )
+        return aspects
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="pos_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="att_mask"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = UIE.from_pretrained(self._task_path)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(self._task_path, use_fast=self.use_fast)
+
+    def _preprocess(self, inputs):
+        """
+        Read and analyze inputs.
+        """
+        examples = self._check_input_text(inputs)
+
+        outputs = {}
+        outputs["text"] = examples
+        return outputs
+
+    def _single_stage_predict(self, inputs):
+        input_texts = []
+        prompts = []
+        for i in range(len(inputs)):
+            input_texts.append(inputs[i]["text"])
+            prompts.append(inputs[i]["prompt"])
+        # max predict length should exclude the length of prompt and summary tokens
+        max_predict_len = self._max_seq_len - len(max(prompts)) - 3
+
+        short_input_texts, self.input_mapping = self._auto_splitter(
+            input_texts, max_predict_len, split_sentence=self._split_sentence
+        )
+
+        short_texts_prompts = []
+        for k, v in self.input_mapping.items():
+            short_texts_prompts.extend([prompts[k] for i in range(len(v))])
+        short_inputs = [
+            {"text": short_input_texts[i], "prompt": short_texts_prompts[i]} for i in range(len(short_input_texts))
+        ]
+
+        def read(inputs):
+            for example in inputs:
+                encoded_inputs = self._tokenizer(
+                    text=[example["prompt"]],
+                    text_pair=[example["text"]],
+                    truncation=True,
+                    max_seq_len=self._max_seq_len,
+                    pad_to_max_seq_len=True,
+                    return_attention_mask=True,
+                    return_position_ids=True,
+                    return_offsets_mapping=True,
+                )
+                tokenized_output = [
+                    encoded_inputs["input_ids"][0],
+                    encoded_inputs["token_type_ids"][0],
+                    encoded_inputs["position_ids"][0],
+                    encoded_inputs["attention_mask"][0],
+                    encoded_inputs["offset_mapping"][0],
+                ]
+                tokenized_output = [np.array(x, dtype="int64") for x in tokenized_output]
+                yield tuple(tokenized_output)
+
+        infer_ds = load_dataset(read, inputs=short_inputs, lazy=self._lazy_load)
+        batch_sampler = paddle.io.BatchSampler(dataset=infer_ds, batch_size=self._batch_size, shuffle=False)
+
+        infer_data_loader = paddle.io.DataLoader(
+            dataset=infer_ds, batch_sampler=batch_sampler, num_workers=self._num_workers, return_list=True
+        )
+
+        sentence_ids = []
+        probs = []
+        for batch in infer_data_loader:
+            input_ids, token_type_ids, pos_ids, att_mask, offset_maps = batch
+            if self._predictor_type == "paddle-inference":
+                self.input_handles[0].copy_from_cpu(input_ids.numpy())
+                self.input_handles[1].copy_from_cpu(token_type_ids.numpy())
+                self.input_handles[2].copy_from_cpu(pos_ids.numpy())
+                self.input_handles[3].copy_from_cpu(att_mask.numpy())
+                self.predictor.run()
+                start_prob = self.output_handle[0].copy_to_cpu().tolist()
+                end_prob = self.output_handle[1].copy_to_cpu().tolist()
+            else:
+                input_dict = {
+                    "input_ids": input_ids.numpy(),
+                    "token_type_ids": token_type_ids.numpy(),
+                    "pos_ids": pos_ids.numpy(),
+                    "att_mask": att_mask.numpy(),
+                }
+                start_prob, end_prob = self.predictor.run(None, input_dict)
+                start_prob = start_prob.tolist()
+                end_prob = end_prob.tolist()
+
+            start_ids_list = get_bool_ids_greater_than(start_prob, limit=self._position_prob, return_prob=True)
+            end_ids_list = get_bool_ids_greater_than(end_prob, limit=self._position_prob, return_prob=True)
+
+            for start_ids, end_ids, offset_map in zip(start_ids_list, end_ids_list, offset_maps.tolist()):
+                span_set = get_span(start_ids, end_ids, with_prob=True)
+                sentence_id, prob = get_id_and_prob(span_set, offset_map)
+                sentence_ids.append(sentence_id)
+                probs.append(prob)
+        results = self._convert_ids_to_results(short_inputs, sentence_ids, probs)
+        results = self._auto_joiner(results, short_input_texts, self.input_mapping)
+        return results
+
+    def _auto_joiner(self, short_results, short_inputs, input_mapping):
+        concat_results = []
+        is_cls_task = False
+        for short_result in short_results:
+            if short_result == []:
+                continue
+            elif "start" not in short_result[0].keys() and "end" not in short_result[0].keys():
+                is_cls_task = True
+                break
+            else:
+                break
+        for k, vs in input_mapping.items():
+            if is_cls_task:
+                cls_options = {}
+                single_results = []
+                for v in vs:
+                    if len(short_results[v]) == 0:
+                        continue
+                    if short_results[v][0]["text"] not in cls_options.keys():
+                        cls_options[short_results[v][0]["text"]] = [1, short_results[v][0]["probability"]]
+                    else:
+                        cls_options[short_results[v][0]["text"]][0] += 1
+                        cls_options[short_results[v][0]["text"]][1] += short_results[v][0]["probability"]
+                if len(cls_options) != 0:
+                    cls_res, cls_info = max(cls_options.items(), key=lambda x: x[1])
+                    concat_results.append([{"text": cls_res, "probability": cls_info[1] / cls_info[0]}])
+                else:
+                    concat_results.append([])
+            else:
+                offset = 0
+                single_results = []
+                for v in vs:
+                    if v == 0:
+                        single_results = short_results[v]
+                        offset += len(short_inputs[v])
+                    else:
+                        for i in range(len(short_results[v])):
+                            if "start" not in short_results[v][i] or "end" not in short_results[v][i]:
+                                continue
+                            short_results[v][i]["start"] += offset
+                            short_results[v][i]["end"] += offset
+                        offset += len(short_inputs[v])
+                        single_results.extend(short_results[v])
+                concat_results.append(single_results)
+        return concat_results
+
+    def _run_model(self, inputs):
+        raw_inputs = inputs["text"]
+        results = self._multi_stage_predict(raw_inputs)
+        inputs["result"] = results
+        return inputs
+
+    def _multi_stage_predict(self, data):
+        """
+        Traversal the schema tree and do multi-stage prediction.
+        Args:
+            data (list): a list of strings
+        Returns:
+            list: a list of predictions, where the list's length
+                equals to the length of `data`
+        """
+        if self.aspects is not None:
+            # predict with pre-give aspects
+            results = []
+            prefixs = []
+            relations = []
+            result = {"评价维度": [{"text": aspect} for aspect in self.aspects]}
+            prefix = [aspect + "的" for aspect in self.aspects]
+            for i in range(len(data)):
+                results.append(copy.deepcopy(result))
+                prefixs.append(copy.deepcopy(prefix))
+                relations.append(results[-1]["评价维度"])
+            # copy to stay `self._schema_tree` unchanged
+            schema_list = self._schema_tree.children[:]
+            for node in schema_list:
+                node.prefix = prefixs
+                node.parent_relations = relations
+
+        else:
+            results = [{} for _ in range(len(data))]
+            # input check to early return
+            if len(data) < 1 or self._schema_tree is None:
+                return results
+            # copy to stay `self._schema_tree` unchanged
+            schema_list = self._schema_tree.children[:]
+
+        while len(schema_list) > 0:
+            node = schema_list.pop(0)
+            examples = []
+            input_map = {}
+            cnt = 0
+            idx = 0
+            if not node.prefix:
+                for one_data in data:
+                    examples.append({"text": one_data, "prompt": dbc2sbc(node.name)})
+                    input_map[cnt] = [idx]
+                    idx += 1
+                    cnt += 1
+            else:
+                for pre, one_data in zip(node.prefix, data):
+                    if len(pre) == 0:
+                        input_map[cnt] = []
+                    else:
+                        for p in pre:
+                            examples.append({"text": one_data, "prompt": dbc2sbc(p + node.name)})
+                        input_map[cnt] = [i + idx for i in range(len(pre))]
+                        idx += len(pre)
+                    cnt += 1
+            if len(examples) == 0:
+                result_list = []
+            else:
+                result_list = self._single_stage_predict(examples)
+
+            if not node.parent_relations:
+                relations = [[] for i in range(len(data))]
+                for k, v in input_map.items():
+                    for idx in v:
+                        if len(result_list[idx]) == 0:
+                            continue
+                        if node.name not in results[k].keys():
+                            results[k][node.name] = result_list[idx]
+                        else:
+                            results[k][node.name].extend(result_list[idx])
+                    if node.name in results[k].keys():
+                        relations[k].extend(results[k][node.name])
+            else:
+                relations = node.parent_relations
+                for k, v in input_map.items():
+                    for i in range(len(v)):
+                        if len(result_list[v[i]]) == 0:
+                            continue
+                        if "relations" not in relations[k][i].keys():
+                            relations[k][i]["relations"] = {node.name: result_list[v[i]]}
+                        elif node.name not in relations[k][i]["relations"].keys():
+                            relations[k][i]["relations"][node.name] = result_list[v[i]]
+                        else:
+                            relations[k][i]["relations"][node.name].extend(result_list[v[i]])
+                new_relations = [[] for i in range(len(data))]
+                for i in range(len(relations)):
+                    for j in range(len(relations[i])):
+                        if "relations" in relations[i][j].keys() and node.name in relations[i][j]["relations"].keys():
+                            for k in range(len(relations[i][j]["relations"][node.name])):
+                                new_relations[i].append(relations[i][j]["relations"][node.name][k])
+                relations = new_relations
+
+            prefix = [[] for _ in range(len(data))]
+            for k, v in input_map.items():
+                for idx in v:
+                    for i in range(len(result_list[idx])):
+                        prefix[k].append(result_list[idx][i]["text"] + "的")
+
+            for child in node.children:
+                child.prefix = prefix
+                child.parent_relations = relations
+                schema_list.append(child)
+        return results
+
+    def _convert_ids_to_results(self, examples, sentence_ids, probs):
+        """
+        Convert ids to raw text in a single stage.
+        """
+        results = []
+        for example, sentence_id, prob in zip(examples, sentence_ids, probs):
+            if len(sentence_id) == 0:
+                results.append([])
+                continue
+            result_list = []
+            text = example["text"]
+            prompt = example["prompt"]
+            for i in range(len(sentence_id)):
+                start, end = sentence_id[i]
+                if start < 0 and end >= 0:
+                    continue
+                if end < 0:
+                    start += len(prompt) + 1
+                    end += len(prompt) + 1
+                    result = {"text": prompt[start:end], "probability": prob[i]}
+                    result_list.append(result)
+                else:
+                    result = {"text": text[start:end], "start": start, "end": end, "probability": prob[i]}
+                    result_list.append(result)
+            results.append(result_list)
+        return results
+
+    @classmethod
+    def _build_tree(cls, schema, name="root"):
+        """
+        Build the schema tree.
+        """
+        schema_tree = SchemaTree(name)
+        for s in schema:
+            if isinstance(s, str):
+                schema_tree.add_child(SchemaTree(s))
+            elif isinstance(s, dict):
+                for k, v in s.items():
+                    if isinstance(v, str):
+                        child = [v]
+                    elif isinstance(v, list):
+                        child = v
+                    else:
+                        raise TypeError(
+                            "Invalid schema, value for each key:value pairs should be list or string"
+                            "but {} received".format(type(v))
+                        )
+                    schema_tree.add_child(cls._build_tree(child, name=k))
+            else:
+                raise TypeError("Invalid schema, element should be string or dict, " "but {} received".format(type(s)))
+        return schema_tree
+
+    def _postprocess(self, inputs):
+        """
+        This function will convert the model output to raw text.
+        """
+        return inputs["result"]
diff --git a/paddlenlp/taskflow/task.py b/paddlenlp/taskflow/task.py
new file mode 100644
index 0000000000000000000000000000000000000000..22b178b61d35577c9169a68eb8ecc0c7b2bdcf4c
--- /dev/null
+++ b/paddlenlp/taskflow/task.py
@@ -0,0 +1,529 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+import math
+import os
+from abc import abstractmethod
+from multiprocessing import cpu_count
+
+import paddle
+from paddle.dataset.common import md5file
+
+from ..utils.env import PPNLP_HOME
+from ..utils.log import logger
+from .utils import cut_chinese_sent, download_check, download_file, dygraph_mode_guard
+
+
+class Task(metaclass=abc.ABCMeta):
+    """
+    The meta classs of task in Taskflow. The meta class has the five abstract function,
+        the subclass need to inherit from the meta class.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, model, task, priority_path=None, **kwargs):
+        self.model = model
+        self.is_static_model = kwargs.get("is_static_model", False)
+        self.task = task
+        self.kwargs = kwargs
+        self._priority_path = priority_path
+        self._usage = ""
+        # The dygraph model instance
+        self._model = None
+        # The static model instance
+        self._input_spec = None
+        self._config = None
+        self._init_class = None
+        self._custom_model = False
+        self._param_updated = False
+
+        self._num_threads = self.kwargs["num_threads"] if "num_threads" in self.kwargs else math.ceil(cpu_count() / 2)
+        self._infer_precision = self.kwargs["precision"] if "precision" in self.kwargs else "fp32"
+        # Default to use Paddle Inference
+        self._predictor_type = "paddle-inference"
+        # The root directory for storing Taskflow related files, default to ~/.paddlenlp.
+        self._home_path = self.kwargs["home_path"] if "home_path" in self.kwargs else PPNLP_HOME
+        self._task_flag = self.kwargs["task_flag"] if "task_flag" in self.kwargs else self.model
+        self.from_hf_hub = kwargs.pop("from_hf_hub", False)
+        # Add mode flag for onnx output path redirection
+        self.export_type = None
+
+        if "task_path" in self.kwargs:
+            self._task_path = self.kwargs["task_path"]
+            self._custom_model = True
+        elif self._priority_path:
+            self._task_path = os.path.join(self._home_path, "taskflow", self._priority_path)
+        else:
+            self._task_path = os.path.join(self._home_path, "taskflow", self.task, self.model)
+        if self.is_static_model:
+            self._static_model_name = self._get_static_model_name()
+
+        if not self.from_hf_hub:
+            download_check(self._task_flag)
+
+    @abstractmethod
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+
+    @abstractmethod
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+
+    @abstractmethod
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+
+    @abstractmethod
+    def _run_model(self, inputs, **kwargs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+
+    @abstractmethod
+    def _postprocess(self, inputs):
+        """
+        The model output is the logits and pros, this function will convert the model output to raw text.
+        """
+
+    @abstractmethod
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+
+    def _get_static_model_name(self):
+        names = []
+        for file_name in os.listdir(self._task_path):
+            if ".pdmodel" in file_name:
+                names.append(file_name[:-8])
+        if len(names) == 0:
+            raise IOError(f"{self._task_path} should include '.pdmodel' file.")
+        if len(names) > 1:
+            logger.warning(f"{self._task_path} includes more than one '.pdmodel' file.")
+        return names[0]
+
+    def _check_task_files(self):
+        """
+        Check files required by the task.
+        """
+        for file_id, file_name in self.resource_files_names.items():
+            if self.task in ["information_extraction"]:
+                dygraph_file = ["model_state.pdparams"]
+            else:
+                dygraph_file = ["model_state.pdparams", "config.json"]
+            if self.is_static_model and file_name in dygraph_file:
+                continue
+            path = os.path.join(self._task_path, file_name)
+            url = self.resource_files_urls[self.model][file_id][0]
+            md5 = self.resource_files_urls[self.model][file_id][1]
+
+            downloaded = True
+            if not os.path.exists(path):
+                downloaded = False
+            else:
+                if not self._custom_model:
+                    if os.path.exists(path):
+                        # Check whether the file is updated
+                        if not md5file(path) == md5:
+                            downloaded = False
+                            if file_id == "model_state":
+                                self._param_updated = True
+                    else:
+                        downloaded = False
+            if not downloaded:
+                download_file(self._task_path, file_name, url, md5)
+
+    def _check_predictor_type(self):
+        if paddle.get_device() == "cpu" and self._infer_precision == "fp16":
+            logger.warning("The inference precision is change to 'fp32', 'fp16' inference only takes effect on gpu.")
+        elif paddle.get_device().split(":", 1)[0] == "npu":
+            if self._infer_precision == "fp16":
+                logger.info("Inference on npu with fp16 precison")
+        else:
+            if self._infer_precision == "fp16":
+                self._predictor_type = "onnxruntime"
+
+    def _construct_ocr_engine(self, lang="ch", use_angle_cls=True):
+        """
+        Construct the OCR engine
+        """
+        try:
+            from paddleocr import PaddleOCR
+        except ImportError:
+            raise ImportError("Please install the dependencies first, pip install paddleocr")
+        use_gpu = False if paddle.get_device() == "cpu" else True
+        self._ocr = PaddleOCR(use_angle_cls=use_angle_cls, show_log=False, use_gpu=use_gpu, lang=lang)
+
+    def _construce_layout_analysis_engine(self):
+        """
+        Construct the layout analysis engine
+        """
+        try:
+            from paddleocr import PPStructure
+        except ImportError:
+            raise ImportError("Please install the dependencies first, pip install paddleocr")
+        self._layout_analysis_engine = PPStructure(table=False, ocr=True, show_log=False)
+
+    def _prepare_static_mode(self):
+        """
+        Construct the input data and predictor in the PaddlePaddele static mode.
+        """
+        if paddle.get_device() == "cpu":
+            self._config.disable_gpu()
+            self._config.enable_mkldnn()
+            if self._infer_precision == "int8":
+                # EnableMKLDNN() only works when IR optimization is enabled.
+                self._config.switch_ir_optim(True)
+                self._config.enable_mkldnn_int8()
+                logger.info((">>> [InferBackend] INT8 inference on CPU ..."))
+        elif paddle.get_device().split(":", 1)[0] == "npu":
+            self._config.disable_gpu()
+            self._config.enable_custom_device("npu", self.kwargs["device_id"])
+        else:
+            if self._infer_precision == "int8":
+                logger.info(
+                    ">>> [InferBackend] It is a INT8 model which is not yet supported on gpu, use FP32 to inference here ..."
+                )
+            self._config.enable_use_gpu(100, self.kwargs["device_id"])
+            # TODO(linjieccc): enable after fixed
+            self._config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
+            self._config.delete_pass("fused_multi_transformer_encoder_pass")
+        self._config.set_cpu_math_library_num_threads(self._num_threads)
+        self._config.switch_use_feed_fetch_ops(False)
+        self._config.disable_glog_info()
+        self._config.enable_memory_optim()
+
+        # TODO(linjieccc): some temporary settings and will be remove in future
+        # after fixed
+        if self.task in ["document_intelligence", "knowledge_mining", "zero_shot_text_classification"]:
+            self._config.switch_ir_optim(False)
+        if self.model == "uie-data-distill-gp":
+            self._config.enable_memory_optim(False)
+
+        self.predictor = paddle.inference.create_predictor(self._config)
+        self.input_names = [name for name in self.predictor.get_input_names()]
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()]
+
+    def _prepare_onnx_mode(self):
+        try:
+            import onnx
+            import onnxruntime as ort
+            import paddle2onnx
+            from onnxconverter_common import float16
+        except ImportError:
+            logger.warning(
+                "The inference precision is change to 'fp32', please install the dependencies that required for 'fp16' inference, pip install onnxruntime-gpu onnx onnxconverter-common"
+            )
+        if self.export_type is None:
+            onnx_dir = os.path.join(self._task_path, "onnx")
+        else:
+            # Compatible multimodal model for saving image and text path
+            onnx_dir = os.path.join(self._task_path, "onnx", self.export_type)
+
+        if not os.path.exists(onnx_dir):
+            os.makedirs(onnx_dir, exist_ok=True)
+        float_onnx_file = os.path.join(onnx_dir, "model.onnx")
+        if not os.path.exists(float_onnx_file) or self._param_updated:
+            onnx_model = paddle2onnx.command.c_paddle_to_onnx(
+                model_file=self._static_model_file,
+                params_file=self._static_params_file,
+                opset_version=13,
+                enable_onnx_checker=True,
+            )
+            with open(float_onnx_file, "wb") as f:
+                f.write(onnx_model)
+        fp16_model_file = os.path.join(onnx_dir, "fp16_model.onnx")
+        if not os.path.exists(fp16_model_file) or self._param_updated:
+            onnx_model = onnx.load_model(float_onnx_file)
+            trans_model = float16.convert_float_to_float16(onnx_model, keep_io_types=True)
+            onnx.save_model(trans_model, fp16_model_file)
+        providers = [("CUDAExecutionProvider", {"device_id": self.kwargs["device_id"]})]
+        sess_options = ort.SessionOptions()
+        sess_options.intra_op_num_threads = self._num_threads
+        sess_options.inter_op_num_threads = self._num_threads
+        self.predictor = ort.InferenceSession(fp16_model_file, sess_options=sess_options, providers=providers)
+        assert "CUDAExecutionProvider" in self.predictor.get_providers(), (
+            "The environment for GPU inference is not set properly. "
+            "A possible cause is that you had installed both onnxruntime and onnxruntime-gpu. "
+            "Please run the following commands to reinstall: \n "
+            "1) pip uninstall -y onnxruntime onnxruntime-gpu \n 2) pip install onnxruntime-gpu"
+        )
+        self.input_handler = [i.name for i in self.predictor.get_inputs()]
+
+    def _get_inference_model(self):
+        """
+        Return the inference program, inputs and outputs in static mode.
+        """
+        if self._custom_model:
+            param_path = os.path.join(self._task_path, "model_state.pdparams")
+
+            if os.path.exists(param_path):
+                cache_info_path = os.path.join(self._task_path, ".cache_info")
+                md5 = md5file(param_path)
+                self._param_updated = True
+                if os.path.exists(cache_info_path) and open(cache_info_path).read()[:-8] == md5:
+                    self._param_updated = False
+                elif self.task == "information_extraction" and self.model != "uie-data-distill-gp":
+                    # UIE related models are moved to paddlenlp.transformers after v2.4.5
+                    # So we convert the parameter key names for compatibility
+                    # This check will be discard in future
+                    fp = open(cache_info_path, "w")
+                    fp.write(md5 + "taskflow")
+                    fp.close()
+                    model_state = paddle.load(param_path)
+                    prefix_map = {"UIE": "ernie", "UIEM": "ernie_m", "UIEX": "ernie_layout"}
+                    new_state_dict = {}
+                    for name, param in model_state.items():
+                        if "ernie" in name:
+                            new_state_dict[name] = param
+                        elif "encoder.encoder" in name:
+                            trans_name = name.replace("encoder.encoder", prefix_map[self._init_class] + ".encoder")
+                            new_state_dict[trans_name] = param
+                        elif "encoder" in name:
+                            trans_name = name.replace("encoder", prefix_map[self._init_class])
+                            new_state_dict[trans_name] = param
+                        else:
+                            new_state_dict[name] = param
+                    paddle.save(new_state_dict, param_path)
+                else:
+                    fp = open(cache_info_path, "w")
+                    fp.write(md5 + "taskflow")
+                    fp.close()
+
+        # When the user-provided model path is already a static model, skip to_static conversion
+        if self.is_static_model:
+            self.inference_model_path = os.path.join(self._task_path, self._static_model_name)
+            if not os.path.exists(self.inference_model_path + ".pdmodel") or not os.path.exists(
+                self.inference_model_path + ".pdiparams"
+            ):
+                raise IOError(
+                    f"{self._task_path} should include {self._static_model_name + '.pdmodel'} and {self._static_model_name + '.pdiparams'} while is_static_model is True"
+                )
+            if self.paddle_quantize_model(self.inference_model_path):
+                self._infer_precision = "int8"
+                self._predictor_type = "paddle-inference"
+
+        else:
+            # Since 'self._task_path' is used to load the HF Hub path when 'from_hf_hub=True', we construct the static model path in a different way
+            _base_path = (
+                self._task_path
+                if not self.from_hf_hub
+                else os.path.join(self._home_path, "taskflow", self.task, self._task_path)
+            )
+            self.inference_model_path = os.path.join(_base_path, "static", "inference")
+            if not os.path.exists(self.inference_model_path + ".pdiparams") or self._param_updated:
+                with dygraph_mode_guard():
+                    self._construct_model(self.model)
+                    self._construct_input_spec()
+                    self._convert_dygraph_to_static()
+
+        self._static_model_file = self.inference_model_path + ".pdmodel"
+        self._static_params_file = self.inference_model_path + ".pdiparams"
+
+        if paddle.get_device().split(":", 1)[0] == "npu" and self._infer_precision == "fp16":
+            # transform fp32 model tp fp16 model
+            self._static_fp16_model_file = self.inference_model_path + "-fp16.pdmodel"
+            self._static_fp16_params_file = self.inference_model_path + "-fp16.pdiparams"
+            if not os.path.exists(self._static_fp16_model_file) and not os.path.exists(self._static_fp16_params_file):
+                logger.info("Converting to the inference model from fp32 to fp16.")
+                paddle.inference.convert_to_mixed_precision(
+                    os.path.join(self._static_model_file),
+                    os.path.join(self._static_params_file),
+                    os.path.join(self._static_fp16_model_file),
+                    os.path.join(self._static_fp16_params_file),
+                    backend=paddle.inference.PlaceType.CUSTOM,
+                    mixed_precision=paddle.inference.PrecisionType.Half,
+                    # Here, npu sigmoid will lead to OOM and cpu sigmoid don't support fp16.
+                    # So, we add sigmoid to black list temporarily.
+                    black_list={"sigmoid"},
+                )
+                logger.info(
+                    "The inference model in fp16 precison save in the path:{}".format(self._static_fp16_model_file)
+                )
+            self._static_model_file = self._static_fp16_model_file
+            self._static_params_file = self._static_fp16_params_file
+        if self._predictor_type == "paddle-inference":
+            self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+            self._prepare_static_mode()
+        else:
+            self._prepare_onnx_mode()
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
+
+    def _check_input_text(self, inputs):
+        """
+        Check whether the input text meet the requirement.
+        """
+        inputs = inputs[0]
+        if isinstance(inputs, str):
+            if len(inputs) == 0:
+                raise ValueError("Invalid inputs, input text should not be empty text, please check your input.")
+            inputs = [inputs]
+        elif isinstance(inputs, list):
+            if not (isinstance(inputs[0], str) and len(inputs[0].strip()) > 0):
+                raise TypeError(
+                    "Invalid inputs, input text should be list of str, and first element of list should not be empty text."
+                )
+        else:
+            raise TypeError(
+                "Invalid inputs, input text should be str or list of str, but type of {} found!".format(type(inputs))
+            )
+        return inputs
+
+    def _auto_splitter(self, input_texts, max_text_len, bbox_list=None, split_sentence=False):
+        """
+        Split the raw texts automatically for model inference.
+        Args:
+            input_texts (List[str]): input raw texts.
+            max_text_len (int): cutting length.
+            bbox_list (List[float, float,float, float]): bbox for document input.
+            split_sentence (bool): If True, sentence-level split will be performed.
+                `split_sentence` will be set to False if bbox_list is not None since sentence-level split is not support for document.
+        return:
+            short_input_texts (List[str]): the short input texts for model inference.
+            input_mapping (dict): mapping between raw text and short input texts.
+        """
+        input_mapping = {}
+        short_input_texts = []
+        cnt_org = 0
+        cnt_short = 0
+        with_bbox = False
+        if bbox_list:
+            with_bbox = True
+            short_bbox_list = []
+            if split_sentence:
+                logger.warning(
+                    "`split_sentence` will be set to False if bbox_list is not None since sentence-level split is not support for document."
+                )
+                split_sentence = False
+
+        for idx in range(len(input_texts)):
+            if not split_sentence:
+                sens = [input_texts[idx]]
+            else:
+                sens = cut_chinese_sent(input_texts[idx])
+            for sen in sens:
+                lens = len(sen)
+                if lens <= max_text_len:
+                    short_input_texts.append(sen)
+                    if with_bbox:
+                        short_bbox_list.append(bbox_list[idx])
+                    input_mapping.setdefault(cnt_org, []).append(cnt_short)
+                    cnt_short += 1
+                else:
+                    temp_text_list = [sen[i : i + max_text_len] for i in range(0, lens, max_text_len)]
+                    short_input_texts.extend(temp_text_list)
+                    if with_bbox:
+                        if bbox_list[idx] is not None:
+                            temp_bbox_list = [
+                                bbox_list[idx][i : i + max_text_len] for i in range(0, lens, max_text_len)
+                            ]
+                            short_bbox_list.extend(temp_bbox_list)
+                        else:
+                            short_bbox_list.extend([None for _ in range(len(temp_text_list))])
+                    short_idx = cnt_short
+                    cnt_short += math.ceil(lens / max_text_len)
+                    temp_text_id = [short_idx + i for i in range(cnt_short - short_idx)]
+                    input_mapping.setdefault(cnt_org, []).extend(temp_text_id)
+            cnt_org += 1
+        if with_bbox:
+            return short_input_texts, short_bbox_list, input_mapping
+        else:
+            return short_input_texts, input_mapping
+
+    def _auto_joiner(self, short_results, input_mapping, is_dict=False):
+        """
+        Join the short results automatically and generate the final results to match with the user inputs.
+        Args:
+            short_results (List[dict] / List[List[str]] / List[str]): input raw texts.
+            input_mapping (dict): cutting length.
+            is_dict (bool): whether the element type is dict, default to False.
+        return:
+            short_input_texts (List[str]): the short input texts for model inference.
+        """
+        concat_results = []
+        elem_type = {} if is_dict else []
+        for k, vs in input_mapping.items():
+            single_results = elem_type
+            for v in vs:
+                if len(single_results) == 0:
+                    single_results = short_results[v]
+                elif isinstance(elem_type, list):
+                    single_results.extend(short_results[v])
+                elif isinstance(elem_type, dict):
+                    for sk in single_results.keys():
+                        if isinstance(single_results[sk], str):
+                            single_results[sk] += short_results[v][sk]
+                        else:
+                            single_results[sk].extend(short_results[v][sk])
+                else:
+                    raise ValueError(
+                        "Invalid element type, the type of results "
+                        "for each element should be list of dict, "
+                        "but {} received.".format(type(single_results))
+                    )
+            concat_results.append(single_results)
+        return concat_results
+
+    def paddle_quantize_model(self, model_path):
+        """
+        Determine whether it is an int8 model.
+        """
+        model = paddle.jit.load(model_path)
+        program = model.program()
+        for block in program.blocks:
+            for op in block.ops:
+                if op.type.count("quantize"):
+                    return True
+        return False
+
+    def help(self):
+        """
+        Return the usage message of the current task.
+        """
+        print("Examples:\n{}".format(self._usage))
+
+    def __call__(self, *args, **kwargs):
+        inputs = self._preprocess(*args)
+        outputs = self._run_model(inputs, **kwargs)
+        results = self._postprocess(outputs)
+        return results
diff --git a/paddlenlp/taskflow/taskflow.py b/paddlenlp/taskflow/taskflow.py
new file mode 100644
index 0000000000000000000000000000000000000000..be9b9f05f1ca7f1cc99948cd9a0e5126391fae1d
--- /dev/null
+++ b/paddlenlp/taskflow/taskflow.py
@@ -0,0 +1,864 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import threading
+
+import paddle
+
+from ..utils.tools import get_env_device
+from .code_generation import CodeGenerationTask
+from .dependency_parsing import DDParserTask
+from .dialogue import DialogueTask
+from .document_intelligence import DocPromptTask
+from .fill_mask import FillMaskTask
+from .information_extraction import GPTask, UIETask
+from .knowledge_mining import NPTagTask, WordTagTask
+from .lexical_analysis import LacTask
+from .multimodal_feature_extraction import MultimodalFeatureExtractionTask
+from .named_entity_recognition import NERLACTask, NERWordTagTask
+from .poetry_generation import PoetryGenerationTask
+from .pos_tagging import POSTaggingTask
+from .question_answering import QuestionAnsweringTask
+from .question_generation import QuestionGenerationTask
+from .sentiment_analysis import SentaTask, SkepTask, UIESentaTask
+from .text2text_generation import ChatGLMTask
+from .text_classification import TextClassificationTask
+from .text_correction import CSCTask
+from .text_feature_extraction import (
+    SentenceFeatureExtractionTask,
+    TextFeatureExtractionTask,
+)
+from .text_similarity import TextSimilarityTask
+from .text_summarization import TextSummarizationTask
+from .word_segmentation import SegJiebaTask, SegLACTask, SegWordTagTask
+from .zero_shot_text_classification import ZeroShotTextClassificationTask
+
+TASKS = {
+    "dependency_parsing": {
+        "models": {
+            "ddparser": {
+                "task_class": DDParserTask,
+                "task_flag": "dependency_parsing-biaffine",
+            },
+            "ddparser-ernie-1.0": {
+                "task_class": DDParserTask,
+                "task_flag": "dependency_parsing-ernie-1.0",
+            },
+            "ddparser-ernie-gram-zh": {
+                "task_class": DDParserTask,
+                "task_flag": "dependency_parsing-ernie-gram-zh",
+            },
+        },
+        "default": {
+            "model": "ddparser",
+        },
+    },
+    "dialogue": {
+        "models": {
+            "plato-mini": {"task_class": DialogueTask, "task_flag": "dialogue-plato-mini"},
+            "__internal_testing__/tiny-random-plato": {
+                "task_class": DialogueTask,
+                "task_flag": "dialogue-tiny-random-plato",
+            },
+        },
+        "default": {
+            "model": "plato-mini",
+        },
+    },
+    "fill_mask": {
+        "models": {
+            "fill_mask": {"task_class": FillMaskTask, "task_flag": "fill_mask-fill_mask"},
+        },
+        "default": {
+            "model": "fill_mask",
+        },
+    },
+    "knowledge_mining": {
+        "models": {
+            "wordtag": {
+                "task_class": WordTagTask,
+                "task_flag": "knowledge_mining-wordtag",
+                "task_priority_path": "wordtag",
+            },
+            "nptag": {
+                "task_class": NPTagTask,
+                "task_flag": "knowledge_mining-nptag",
+            },
+        },
+        "default": {
+            "model": "wordtag",
+        },
+    },
+    "lexical_analysis": {
+        "models": {
+            "lac": {
+                "task_class": LacTask,
+                "hidden_size": 128,
+                "emb_dim": 128,
+                "task_flag": "lexical_analysis-gru_crf",
+                "task_priority_path": "lac",
+            }
+        },
+        "default": {"model": "lac"},
+    },
+    "ner": {
+        "modes": {
+            "accurate": {
+                "task_class": NERWordTagTask,
+                "task_flag": "ner-wordtag",
+                "task_priority_path": "wordtag",
+                "linking": False,
+            },
+            "fast": {
+                "task_class": NERLACTask,
+                "hidden_size": 128,
+                "emb_dim": 128,
+                "task_flag": "ner-lac",
+                "task_priority_path": "lac",
+            },
+        },
+        "default": {"mode": "accurate"},
+    },
+    "poetry_generation": {
+        "models": {
+            "gpt-cpm-large-cn": {
+                "task_class": PoetryGenerationTask,
+                "task_flag": "poetry_generation-gpt-cpm-large-cn",
+                "task_priority_path": "gpt-cpm-large-cn",
+            },
+        },
+        "default": {
+            "model": "gpt-cpm-large-cn",
+        },
+    },
+    "pos_tagging": {
+        "models": {
+            "lac": {
+                "task_class": POSTaggingTask,
+                "hidden_size": 128,
+                "emb_dim": 128,
+                "task_flag": "pos_tagging-gru_crf",
+                "task_priority_path": "lac",
+            }
+        },
+        "default": {"model": "lac"},
+    },
+    "question_answering": {
+        "models": {
+            "gpt-cpm-large-cn": {
+                "task_class": QuestionAnsweringTask,
+                "task_flag": "question_answering-gpt-cpm-large-cn",
+                "task_priority_path": "gpt-cpm-large-cn",
+            },
+        },
+        "default": {
+            "model": "gpt-cpm-large-cn",
+        },
+    },
+    "sentiment_analysis": {
+        "models": {
+            "bilstm": {
+                "task_class": SentaTask,
+                "task_flag": "sentiment_analysis-bilstm",
+            },
+            "skep_ernie_1.0_large_ch": {
+                "task_class": SkepTask,
+                "task_flag": "sentiment_analysis-skep_ernie_1.0_large_ch",
+            },
+            "uie-senta-base": {
+                "task_class": UIESentaTask,
+                "task_flag": "sentiment_analysis-uie-senta-base",
+            },
+            "uie-senta-medium": {
+                "task_class": UIESentaTask,
+                "task_flag": "sentiment_analysis-uie-senta-medium",
+            },
+            "uie-senta-mini": {
+                "task_class": UIESentaTask,
+                "task_flag": "sentiment_analysis-uie-senta-mini",
+            },
+            "uie-senta-micro": {
+                "task_class": UIESentaTask,
+                "task_flag": "sentiment_analysis-uie-senta-micro",
+            },
+            "uie-senta-nano": {
+                "task_class": UIESentaTask,
+                "task_flag": "sentiment_analysis-uie-senta-nano",
+            },
+            "__internal_testing__/tiny-random-skep": {
+                "task_class": SkepTask,
+                "task_flag": "sentiment_analysis-tiny-random-skep",
+            },
+        },
+        "default": {"model": "bilstm"},
+    },
+    "text_correction": {
+        "models": {
+            "ernie-csc": {"task_class": CSCTask, "task_flag": "text_correction-ernie-csc"},
+        },
+        "default": {"model": "ernie-csc"},
+    },
+    "text_similarity": {
+        "models": {
+            "simbert-base-chinese": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-simbert-base-chinese",
+            },
+            "rocketqa-zh-dureader-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-zh-dureader-cross-encoder",
+            },
+            "rocketqa-base-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-base-cross-encoder",
+            },
+            "rocketqa-medium-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-medium-cross-encoder",
+            },
+            "rocketqa-mini-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-mini-cross-encoder",
+            },
+            "rocketqa-micro-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-micro-cross-encoder",
+            },
+            "rocketqa-nano-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqa-nano-cross-encoder",
+            },
+            "rocketqav2-en-marco-cross-encoder": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-rocketqav2-en-marco-cross-encoder",
+            },
+            "ernie-search-large-cross-encoder-marco-en": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-ernie-search-large-cross-encoder-marco-en",
+            },
+            "__internal_testing__/tiny-random-bert": {
+                "task_class": TextSimilarityTask,
+                "task_flag": "text_similarity-tiny-random-bert",
+            },
+        },
+        "default": {"model": "simbert-base-chinese"},
+    },
+    "text_summarization": {
+        "models": {
+            "unimo-text-1.0-summary": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-unimo-text-1.0-summary",
+                "task_priority_path": "unimo-text-1.0-summary",
+            },
+            "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+                "task_priority_path": "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+            },
+            "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-IDEA-CCNL/Randeng-Pegasus523M-Summary-Chinese",
+                "task_priority_path": "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese",
+            },
+            "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese-V1": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-IDEA-CCNL/Randeng-Pegasus523M-Summary-Chinese-V1",
+                "task_priority_path": "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese-V1",
+            },
+            "PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA",
+                "task_priority_path": "PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA",
+            },
+            "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA": {
+                "task_class": TextSummarizationTask,
+                "task_flag": "text_summarization-PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA",
+                "task_priority_path": "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA",
+            },
+        },
+        "default": {"model": "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA"},
+    },
+    "word_segmentation": {
+        "modes": {
+            "fast": {
+                "task_class": SegJiebaTask,
+                "task_flag": "word_segmentation-jieba",
+            },
+            "base": {
+                "task_class": SegLACTask,
+                "hidden_size": 128,
+                "emb_dim": 128,
+                "task_flag": "word_segmentation-gru_crf",
+                "task_priority_path": "lac",
+            },
+            "accurate": {
+                "task_class": SegWordTagTask,
+                "task_flag": "word_segmentation-wordtag",
+                "task_priority_path": "wordtag",
+                "linking": False,
+            },
+        },
+        "default": {"mode": "base"},
+    },
+    "information_extraction": {
+        "models": {
+            "uie-base": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-base"},
+            "uie-medium": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-medium",
+            },
+            "uie-mini": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-mini"},
+            "uie-micro": {"task_class": UIETask, "hidden_size": 384, "task_flag": "information_extraction-uie-micro"},
+            "uie-nano": {"task_class": UIETask, "hidden_size": 312, "task_flag": "information_extraction-uie-nano"},
+            "uie-tiny": {"task_class": UIETask, "hidden_size": 768, "task_flag": "information_extraction-uie-tiny"},
+            "uie-medical-base": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-medical-base",
+            },
+            "uie-base-en": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-base-en",
+            },
+            "uie-m-base": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-m-base",
+            },
+            "uie-m-large": {
+                "task_class": UIETask,
+                "hidden_size": 1024,
+                "task_flag": "information_extraction-uie-m-large",
+            },
+            "uie-x-base": {
+                "task_class": UIETask,
+                "hidden_size": 768,
+                "task_flag": "information_extraction-uie-x-base",
+            },
+            "uie-data-distill-gp": {"task_class": GPTask, "task_flag": "information_extraction-uie-data-distill-gp"},
+            "__internal_testing__/tiny-random-uie": {
+                "task_class": UIETask,
+                "hidden_size": 8,
+                "task_flag": "information_extraction-tiny-random-uie",
+            },
+            "__internal_testing__/tiny-random-uie-m": {
+                "task_class": UIETask,
+                "hidden_size": 8,
+                "task_flag": "information_extraction-tiny-random-uie-m",
+            },
+            "__internal_testing__/tiny-random-uie-x": {
+                "task_class": UIETask,
+                "hidden_size": 8,
+                "task_flag": "information_extraction-tiny-random-uie-x",
+            },
+        },
+        "default": {"model": "uie-base"},
+    },
+    "code_generation": {
+        "models": {
+            "Salesforce/codegen-350M-mono": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-350M-mono",
+                "task_priority_path": "Salesforce/codegen-350M-mono",
+            },
+            "Salesforce/codegen-2B-mono": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-2B-mono",
+                "task_priority_path": "Salesforce/codegen-2B-mono",
+            },
+            "Salesforce/codegen-6B-mono": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-6B-mono",
+                "task_priority_path": "Salesforce/codegen-6B-mono",
+            },
+            "Salesforce/codegen-350M-nl": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-350M-nl",
+                "task_priority_path": "Salesforce/codegen-350M-nl",
+            },
+            "Salesforce/codegen-2B-nl": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-2B-nl",
+                "task_priority_path": "Salesforce/codegen-2B-nl",
+            },
+            "Salesforce/codegen-6B-nl": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-6B-nl",
+                "task_priority_path": "Salesforce/codegen-6B-nl",
+            },
+            "Salesforce/codegen-350M-multi": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-350M-multi",
+                "task_priority_path": "Salesforce/codegen-350M-multi",
+            },
+            "Salesforce/codegen-2B-multi": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-2B-multi",
+                "task_priority_path": "Salesforce/codegen-2B-multi",
+            },
+            "Salesforce/codegen-6B-multi": {
+                "task_class": CodeGenerationTask,
+                "task_flag": "code_generation-Salesforce/codegen-6B-multi",
+                "task_priority_path": "Salesforce/codegen-6B-multi",
+            },
+        },
+        "default": {
+            "model": "Salesforce/codegen-350M-mono",
+        },
+    },
+    "text_classification": {
+        "modes": {
+            "finetune": {
+                "task_class": TextClassificationTask,
+                "task_flag": "text_classification-finetune",
+            },
+            "prompt": {
+                "task_class": TextClassificationTask,
+                "task_flag": "text_classification-prompt",
+            },
+        },
+        "default": {"mode": "finetune"},
+    },
+    "document_intelligence": {
+        "models": {
+            "docprompt": {
+                "task_class": DocPromptTask,
+                "task_flag": "document_intelligence-docprompt",
+            },
+        },
+        "default": {"model": "docprompt"},
+    },
+    "question_generation": {
+        "models": {
+            "unimo-text-1.0": {
+                "task_class": QuestionGenerationTask,
+                "task_flag": "question_generation-unimo-text-1.0",
+            },
+            "unimo-text-1.0-dureader_qg": {
+                "task_class": QuestionGenerationTask,
+                "task_flag": "question_generation-unimo-text-1.0-dureader_qg",
+            },
+            "unimo-text-1.0-question-generation": {
+                "task_class": QuestionGenerationTask,
+                "task_flag": "question_generation-unimo-text-1.0-question-generation",
+            },
+            "unimo-text-1.0-question-generation-dureader_qg": {
+                "task_class": QuestionGenerationTask,
+                "task_flag": "question_generation-unimo-text-1.0-question-generation-dureader_qg",
+            },
+        },
+        "default": {"model": "unimo-text-1.0-dureader_qg"},
+    },
+    "text2text_generation": {
+        "models": {
+            "THUDM/chatglm-6b": {
+                "task_class": ChatGLMTask,
+                "task_flag": "text_generation-THUDM/chatglm-6b",
+            },
+            "THUDM/chatglm2-6b": {
+                "task_class": ChatGLMTask,
+                "task_flag": "text_generation-THUDM/chatglm2-6b",
+            },
+            "__internal_testing__/tiny-random-chatglm": {
+                "task_class": ChatGLMTask,
+                "task_flag": "text_generation-tiny-random-chatglm",
+            },
+            "THUDM/chatglm-6b-v1.1": {
+                "task_class": ChatGLMTask,
+                "task_flag": "text_generation-THUDM/chatglm-6b-v1.1",
+            },
+        },
+        "default": {"model": "THUDM/chatglm-6b-v1.1"},
+    },
+    "zero_shot_text_classification": {
+        "models": {
+            "utc-large": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-large",
+            },
+            "utc-xbase": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-xbase",
+            },
+            "utc-base": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-base",
+            },
+            "utc-medium": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-medium",
+            },
+            "utc-micro": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-micro",
+            },
+            "utc-mini": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-mini",
+            },
+            "utc-nano": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-nano",
+            },
+            "utc-pico": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-utc-pico",
+            },
+            "__internal_testing__/tiny-random-utc": {
+                "task_class": ZeroShotTextClassificationTask,
+                "task_flag": "zero_shot_text_classification-tiny-random-utc",
+            },
+        },
+        "default": {"model": "utc-base"},
+    },
+    "feature_extraction": {
+        "models": {
+            "rocketqa-zh-dureader-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-dureader-query-encoder",
+                "task_priority_path": "rocketqa-zh-dureader-query-encoder",
+            },
+            "rocketqa-zh-dureader-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-dureader-para-encoder",
+                "task_priority_path": "rocketqa-rocketqa-zh-dureader-para-encoder",
+            },
+            "rocketqa-zh-base-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-base-query-encoder",
+                "task_priority_path": "rocketqa-zh-base-query-encoder",
+            },
+            "rocketqa-zh-base-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-base-para-encoder",
+                "task_priority_path": "rocketqa-zh-base-para-encoder",
+            },
+            "rocketqa-zh-medium-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-medium-query-encoder",
+                "task_priority_path": "rocketqa-zh-medium-query-encoder",
+            },
+            "rocketqa-zh-medium-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-medium-para-encoder",
+                "task_priority_path": "rocketqa-zh-medium-para-encoder",
+            },
+            "rocketqa-zh-mini-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-mini-query-encoder",
+                "task_priority_path": "rocketqa-zh-mini-query-encoder",
+            },
+            "rocketqa-zh-mini-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-rocketqa-zh-mini-para-encoder",
+                "task_priority_path": "rocketqa-zh-mini-para-encoder",
+            },
+            "rocketqa-zh-micro-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-micro-query-encoder",
+                "task_priority_path": "rocketqa-zh-micro-query-encoder",
+            },
+            "rocketqa-zh-micro-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-micro-para-encoder",
+                "task_priority_path": "rocketqa-zh-micro-para-encoder",
+            },
+            "rocketqa-zh-nano-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-nano-query-encoder",
+                "task_priority_path": "rocketqa-zh-nano-query-encoder",
+            },
+            "rocketqa-zh-nano-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqa-zh-nano-para-encoder",
+                "task_priority_path": "rocketqa-zh-nano-para-encoder",
+            },
+            "rocketqav2-en-marco-query-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqav2-en-marco-query-encoder",
+                "task_priority_path": "rocketqav2-en-marco-query-encoder",
+            },
+            "rocketqav2-en-marco-para-encoder": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-rocketqav2-en-marco-para-encoder",
+                "task_priority_path": "rocketqav2-en-marco-para-encoder",
+            },
+            "ernie-search-base-dual-encoder-marco-en": {
+                "task_class": TextFeatureExtractionTask,
+                "task_flag": "feature_extraction-ernie-search-base-dual-encoder-marco-en",
+                "task_priority_path": "ernie-search-base-dual-encoder-marco-en",
+            },
+            "PaddlePaddle/ernie_vil-2.0-base-zh": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-PaddlePaddle/ernie_vil-2.0-base-zh",
+                "task_priority_path": "PaddlePaddle/ernie_vil-2.0-base-zh",
+            },
+            "OFA-Sys/chinese-clip-vit-base-patch16": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-OFA-Sys/chinese-clip-vit-base-patch16",
+                "task_priority_path": "OFA-Sys/chinese-clip-vit-base-patch16",
+            },
+            "OFA-Sys/chinese-clip-vit-huge-patch14": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-OFA-Sys/chinese-clip-vit-huge-patch14",
+                "task_priority_path": "OFA-Sys/chinese-clip-vit-huge-patch14",
+            },
+            "OFA-Sys/chinese-clip-vit-large-patch14": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-OFA-Sys/chinese-clip-vit-large-patch14",
+                "task_priority_path": "OFA-Sys/chinese-clip-vit-large-patch14",
+            },
+            "OFA-Sys/chinese-clip-vit-large-patch14-336px": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-OFA-Sys/chinese-clip-vit-large-patch14-336px",
+                "task_priority_path": "OFA-Sys/chinese-clip-vit-large-patch14-336px",
+            },
+            "openai/clip-vit-base-patch32": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-vit-base-patch32",
+                "task_priority_path": "openai/clip-vit-base-patch32",
+            },
+            "openai/clip-vit-base-patch16": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-vit-base-patch16",
+                "task_priority_path": "openai/clip-vit-base-patch16",
+            },
+            "openai/clip-vit-large-patch14": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-vit-large-patch14",
+                "task_priority_path": "openai/clip-vit-large-patch14",
+            },
+            "laion/CLIP-ViT-H-14-laion2B-s32B-b79K": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
+                "task_priority_path": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
+            },
+            "laion/CLIP-ViT-B-32-laion2B-s34B-b79K": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
+                "task_priority_path": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
+            },
+            "openai/clip-rn50": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-rn50",
+                "task_priority_path": "openai/clip-rn50",
+            },
+            "openai/clip-rn101": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-rn101",
+                "task_priority_path": "openai/clip-rn101",
+            },
+            "openai/clip-rn50x4": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-openai/clip-rn50x4",
+                "task_priority_path": "openai/clip-rn50x4",
+            },
+            "__internal_testing__/tiny-random-ernievil2": {
+                "task_class": MultimodalFeatureExtractionTask,
+                "task_flag": "feature_extraction-tiny-random-ernievil2",
+                "task_priority_path": "__internal_testing__/tiny-random-ernievil2",
+            },
+            "moka-ai/m3e-base": {
+                "task_class": SentenceFeatureExtractionTask,
+                "task_flag": "feature_extraction-moka-ai/m3e-base",
+                "task_priority_path": "moka-ai/m3e-base",
+            },
+            "__internal_testing__/tiny-random-m3e": {
+                "task_class": SentenceFeatureExtractionTask,
+                "task_flag": "__internal_testing__/tiny-random-m3e",
+                "task_priority_path": "__internal_testing__/tiny-random-m3e",
+            },
+        },
+        "default": {"model": "PaddlePaddle/ernie_vil-2.0-base-zh"},
+    },
+}
+
+support_schema_list = [
+    "uie-base",
+    "uie-medium",
+    "uie-mini",
+    "uie-micro",
+    "uie-nano",
+    "uie-tiny",
+    "uie-medical-base",
+    "uie-base-en",
+    "wordtag",
+    "uie-m-large",
+    "uie-m-base",
+    "uie-x-base",
+    "uie-senta-base",
+    "uie-senta-medium",
+    "uie-senta-mini",
+    "uie-senta-micro",
+    "uie-senta-nano",
+    "utc-large",
+    "utc-xbase",
+    "utc-base",
+    "utc-medium",
+    "utc-micro",
+    "utc-mini",
+    "utc-nano",
+    "utc-pico",
+    "utc-tiny",
+    "__internal_testing__/tiny-random-uie",
+    "__internal_testing__/tiny-random-uie-m",
+    "__internal_testing__/tiny-random-uie-x",
+]
+
+support_argument_list = [
+    "dalle-mini",
+    "dalle-mega",
+    "dalle-mega-v16",
+    "pai-painter-painting-base-zh",
+    "pai-painter-scenery-base-zh",
+    "pai-painter-commercial-base-zh",
+    "CompVis/stable-diffusion-v1-4",
+    "openai/disco-diffusion-clip-vit-base-patch32",
+    "openai/disco-diffusion-clip-rn50",
+    "openai/disco-diffusion-clip-rn101",
+    "PaddlePaddle/disco_diffusion_ernie_vil-2.0-base-zh",
+    "uie-base",
+    "uie-medium",
+    "uie-mini",
+    "uie-micro",
+    "uie-nano",
+    "uie-tiny",
+    "uie-medical-base",
+    "uie-base-en",
+    "uie-m-large",
+    "uie-m-base",
+    "uie-x-base",
+    "__internal_testing__/tiny-random-uie-m",
+    "__internal_testing__/tiny-random-uie-x",
+    "THUDM/chatglm-6b",
+    "THUDM/chatglm2-6b",
+    "THUDM/chatglm-6b-v1.1",
+]
+
+
+class Taskflow(object):
+    """
+    The Taskflow is the end2end interface that could convert the raw text to model result, and decode the model result to task result. The main functions as follows:
+        1) Convert the raw text to task result.
+        2) Convert the model to the inference model.
+        3) Offer the usage and help message.
+    Args:
+        task (str): The task name for the Taskflow, and get the task class from the name.
+        model (str, optional): The model name in the task, if set None, will use the default model.
+        mode (str, optional): Select the mode of the task, only used in the tasks of word_segmentation and ner.
+            If set None, will use the default mode.
+        device_id (int, optional): The device id for the gpu, xpu and other devices, the defalut value is 0.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+
+    """
+
+    def __init__(self, task, model=None, mode=None, device_id=0, from_hf_hub=False, **kwargs):
+        assert task in TASKS, f"The task name:{task} is not in Taskflow list, please check your task name."
+        self.task = task
+        # Set the device for the task
+        device = get_env_device()
+        if device == "cpu" or device_id == -1:
+            paddle.set_device("cpu")
+        else:
+            paddle.set_device(device + ":" + str(device_id))
+
+        if self.task in ["word_segmentation", "ner", "text_classification"]:
+            tag = "modes"
+            ind_tag = "mode"
+            self.model = mode
+        else:
+            tag = "models"
+            ind_tag = "model"
+            self.model = model
+
+        if self.model is not None:
+            assert self.model in set(TASKS[task][tag].keys()), f"The {tag} name: {model} is not in task:[{task}]"
+        else:
+            self.model = TASKS[task]["default"][ind_tag]
+
+        if "task_priority_path" in TASKS[self.task][tag][self.model]:
+            self.priority_path = TASKS[self.task][tag][self.model]["task_priority_path"]
+        else:
+            self.priority_path = None
+
+        # Update the task config to kwargs
+        config_kwargs = TASKS[self.task][tag][self.model]
+        kwargs["device_id"] = device_id
+        kwargs.update(config_kwargs)
+        self.kwargs = kwargs
+        task_class = TASKS[self.task][tag][self.model]["task_class"]
+        self.task_instance = task_class(
+            model=self.model, task=self.task, priority_path=self.priority_path, from_hf_hub=from_hf_hub, **self.kwargs
+        )
+        task_list = TASKS.keys()
+        Taskflow.task_list = task_list
+
+        # Add the lock for the concurrency requests
+        self._lock = threading.Lock()
+
+    def __call__(self, *inputs, **kwargs):
+        """
+        The main work function in the taskflow.
+        """
+        results = self.task_instance(inputs, **kwargs)
+        return results
+
+    def help(self):
+        """
+        Return the task usage message.
+        """
+        return self.task_instance.help()
+
+    def task_path(self):
+        """
+        Return the path of current task
+        """
+        return self.task_instance._task_path
+
+    @staticmethod
+    def tasks():
+        """
+        Return the available task list.
+        """
+        task_list = list(TASKS.keys())
+        return task_list
+
+    def from_segments(self, *inputs):
+        results = self.task_instance.from_segments(inputs)
+        return results
+
+    def interactive_mode(self, max_turn):
+        with self.task_instance.interactive_mode(max_turn):
+            while True:
+                human = input("[Human]:").strip()
+                if human.lower() == "exit":
+                    exit()
+                robot = self.task_instance(human)[0]
+                print("[Bot]:%s" % robot)
+
+    def set_schema(self, schema):
+        assert (
+            self.task_instance.model in support_schema_list
+        ), "This method can only be used by the task based on the model of uie or wordtag."
+        self.task_instance.set_schema(schema)
+
+    def set_argument(self, argument):
+        assert self.task_instance.model in support_argument_list, (
+            "This method can only be used by the task of text-to-image generation, information extraction "
+            "or zero-text-classification."
+        )
+        self.task_instance.set_argument(argument)
diff --git a/paddlenlp/taskflow/text2text_generation.py b/paddlenlp/taskflow/text2text_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cb5703c2dfed399ce99f5c604c38388cdeae990
--- /dev/null
+++ b/paddlenlp/taskflow/text2text_generation.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from ..transformers import AutoModelForCausalLM, AutoTokenizer
+from ..utils.log import logger
+from .task import Task
+from .utils import static_mode_guard
+
+
+class ChatGLMTask(Task):
+    """
+    The text to text generation LLM model to predict the question or chinese  poetry.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        # Default to static mode
+        self._static_mode = False
+        self._dtype = kwargs.get("dtype", "float16")
+        self.kwargs["generation_task"] = task
+        self._tgt_length = kwargs.get("tgt_length", 2048)
+        # Token max length
+        self._max_seq_length = kwargs.get("max_seq_length", 2048)
+        self._top_k = kwargs.get("top_k", 1)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._temperature = kwargs.get("temperature", 1.0)
+        self._decode_strategy = kwargs.get("decode_strategy", "sampling")
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+
+        self._construct_tokenizer(model)
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+        self._construct_input_spec()
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64"),  # attention_mask
+            paddle.static.InputSpec(shape=[None, None, None], dtype="int64"),  # position_ids
+            # max_length
+            self._tgt_length,
+            # min_length
+            0,
+            # decode_strategy
+            self._decode_strategy,
+            # temperature
+            self._temperature,
+            # top_k
+            self._top_k,
+            # top_p
+            self._top_p,
+            # repetition_penalty
+            1,
+            # num_beams
+            1,
+            # num_beam_groups
+            1,
+            # length_penalty
+            0.0,
+            # early_stopping
+            False,
+            # bos_token_id
+            self._tokenizer.bos_token_id,
+            # eos_token_id
+            self._tokenizer.eos_token_id,
+            # pad_token_id
+            self._tokenizer.pad_token_id,
+            # decoder_start_token_id
+            None,
+            # forced_bos_token_id
+            None,
+            # forced_eos_token_id
+            None,
+            # no_repeat_ngram_size
+            None,
+            # num_return_sequences
+            self._num_return_sequences,
+            # diversity_rate
+            0.0,
+            # use_cache
+            True,
+        ]
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        tokenizer_instance = AutoTokenizer.from_pretrained(model)
+
+        self._tokenizer = tokenizer_instance
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = AutoModelForCausalLM.from_pretrained(
+            self.model,
+            load_state_as_np=True,
+            dtype=self._dtype,
+        )
+        # Load the model parameter for the predict
+        model_instance.eval()
+        self._model = model_instance
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        # Separates data into some batches.
+        one_batch = []
+        for example in data:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield one_batch
+                one_batch = []
+        if one_batch:
+            yield one_batch
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        batches = self._batchify(inputs, batch_size)
+        examples = []
+        for input_text in batches:
+            if self._static_mode:
+                tokenized_output = self._tokenizer(
+                    input_text,
+                    return_tensors="np",
+                    padding=True,
+                    max_length=self._max_seq_length,
+                    truncation=True,
+                    truncation_side="left",
+                )
+            else:
+                tokenized_output = self._tokenizer(
+                    input_text,
+                    return_tensors="pd",
+                    padding=True,
+                    max_length=self._max_seq_length,
+                    truncation=True,
+                    truncation_side="left",
+                )
+            examples.append(tokenized_output)
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["data_loader"] = examples
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        if self._static_mode:
+            with static_mode_guard():
+                for batch in inputs["data_loader"]:
+                    input_ids = batch["input_ids"]
+                    attention_mask = batch["attention_mask"]
+                    position_ids = batch["position_ids"]
+                    self.input_handles[0].copy_from_cpu(input_ids)
+                    self.input_handles[1].copy_from_cpu(attention_mask)
+                    self.input_handles[2].copy_from_cpu(position_ids)
+                    self.predictor.run()
+                    result = self.output_handle[0].copy_to_cpu().tolist()
+                    results.extend(result)
+        else:
+            for batch_inputs in inputs["data_loader"]:
+                result = self._model.generate(
+                    **batch_inputs,
+                    decode_strategy=self._decode_strategy,
+                    top_k=self._top_k,
+                    top_p=self._top_p,
+                    temperature=self._temperature,
+                    max_length=self._tgt_length,
+                    bos_token_id=self._tokenizer.bos_token_id,
+                    eos_token_id=self._tokenizer.eos_token_id,
+                    pad_token_id=self._tokenizer.pad_token_id,
+                    num_return_sequences=self._num_return_sequences,
+                    use_cache=True,
+                )
+                result = result[0]
+                results.extend(result)
+
+        inputs["results"] = results
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        preds = inputs["results"]
+        result = []
+        for x in preds:
+            if self._static_mode:
+                res = self._tokenizer.decode(x, skip_special_tokens=True)
+                res = res.strip("\n")
+                result.append(res)
+            else:
+                res = self._tokenizer.decode(x.numpy().tolist(), skip_special_tokens=True)
+                res = res.strip("\n")
+                result.append(res)
+        out_dict = {"result": result}
+        return out_dict
+
+    def set_argument(self, argument: dict):
+        for k, v in argument.items():
+            if k == "input":
+                continue
+            setattr(self, f"_{k}", v)
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+
+        static_model = paddle.jit.to_static(self._model.generate, input_spec=self._input_spec)
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
diff --git a/paddlenlp/taskflow/text_classification.py b/paddlenlp/taskflow/text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..170b8381e286a07571074cdc07c9fa85d9433cc9
--- /dev/null
+++ b/paddlenlp/taskflow/text_classification.py
@@ -0,0 +1,369 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+from typing import Any, Dict, List, Union
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from scipy.special import expit as np_sigmoid
+from scipy.special import softmax as np_softmax
+
+from ..data import DataCollatorWithPadding
+from ..prompt import (
+    AutoTemplate,
+    PromptDataCollatorWithPadding,
+    PromptModelForSequenceClassification,
+    SoftVerbalizer,
+)
+from ..transformers import (
+    AutoModelForMaskedLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+)
+from ..utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME
+from ..utils.log import logger
+from .task import Task
+from .utils import static_mode_guard
+
+usage = r"""
+        from paddlenlp import Taskflow
+        text_cls = Taskflow(
+            "text_classification",
+            mode="finetune",
+            problem_type="multi_class",
+            task_path=<local_saved_dynamic_model>,
+            id2label={0: "negative", 1: "positive"}
+            )
+        text_cls('房间依然很整洁，相当不错')
+        '''
+        [
+            {
+                'text': '房间依然很整洁，相当不错',
+                'predictions: [{
+                    'label': 'positive',
+                    'score': 0.80
+                }]
+            }
+        ]
+        '''
+        text_cls = Taskflow(
+            "text_classification",
+            mode="prompt",
+            problem_type="multi_label",
+            is_static_model=True,
+            task_path=<local_saved_static_model>,
+            static_model_prefix=<static_model_prefix>,
+            plm_model_path=<local_saved_plm_model>,
+            id2label={ 0: "体育", 1: "经济", 2: "娱乐"}
+            )
+        text_cls(['这是一条体育娱乐新闻的例子',
+                        '这是一条经济新闻'])
+        '''
+        [
+            {
+                'text': '这是一条体育娱乐新闻的例子',
+                'predictions: [
+                    {
+                        'label': '体育',
+                        'score': 0.80
+                    },
+                    {
+                        'label': '娱乐',
+                        'score': 0.90
+                    }
+                ]
+            },
+            {
+                'text': '这是一条经济新闻',
+                'predictions: [
+                    {
+                    'label': '经济',
+                    'score': 0.80
+                    }
+                ]
+            }
+        ]
+         """
+
+
+def softmax(x, axis=None):
+    x_max = np.amax(x, axis=axis, keepdims=True)
+    exp_x_shifted = np.exp(x - x_max)
+    return exp_x_shifted / np.sum(exp_x_shifted, axis=axis, keepdims=True)
+
+
+class TextClassificationTask(Task):
+    """
+    The text classfication model to classify text.
+    NOTE: This task is different from all other tasks that it has no out-of-box zero-shot capabilities.
+    Instead, it's used as a simple inference pipeline.
+
+    Args:
+        task (string): The name of task.
+        model (string): Mode of the classification, Supports ["prompt", "finetune"].
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+            task_path (string): The local file path to the model path or a pre-trained model.
+            is_static_model (string): Whether the model in task path  is a static model.
+            problem_type (str, optional): Select among ["multi_class", "multi_label"] based on the nature of your problem. Default to "multi_class".
+            multilabel_threshold (float): The probability threshold used for the multi_label setup. Only effective if model = "multi_label". Defaults to 0.5.
+            max_length (int): Maximum number of tokens for the model.
+            precision (int): Select among ["fp32", "fp16"]. Default to "fp32".
+            plm_model_name (str): Pretrained langugae model name for PromptModel.
+            input_spec [list]: Specify the tensor information for each input parameter of the forward function.
+            id2label(dict(int,string)): The dictionary to map the predictions from class ids to class names.
+            batch_size(int): The sample number of a mini-batch.
+    """
+
+    def __init__(self, task: str, model: str = "finetune", **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self.problem_type = self.kwargs.get("problem_type", "multi_class")
+        self.multilabel_threshold = self.kwargs.get("multilabel_threshold", 0.5)
+        self._max_length = self.kwargs.get("max_length", 512)
+
+        self._construct_tokenizer()
+        if self.model == "prompt":
+            self._initialize_prompt()
+        self._check_predictor_type()
+        self._get_inference_model()
+        self._construct_id2label()
+
+    def _initialize_prompt(self):
+        if "plm_model_name" in self.kwargs:
+            self._plm_model = AutoModelForMaskedLM.from_pretrained(self.kwargs["plm_model_name"])
+        elif os.path.isdir(os.path.join(self._task_path, "plm")):
+            self._plm_model = AutoModelForMaskedLM.from_pretrained(os.path.join(self._task_path, "plm"))
+            logger.info(f"Load pretrained language model from {self._plm_model}")
+        else:
+            raise NotImplementedError(
+                "Please specify the pretrained language model name （ex. plm_model_name='ernie-3.0-medium-zh'）."
+            )
+        self._template = AutoTemplate.load_from(self._task_path, self._tokenizer, self._max_length, self._plm_model)
+        with open(os.path.join(self._task_path, "verbalizer_config.json"), "r", encoding="utf-8") as fp:
+            self._label_words = json.load(fp)
+        self._verbalizer = SoftVerbalizer(self._label_words, self._tokenizer, self._plm_model)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        if "input_spec" in self.kwargs:
+            self._input_spec = self.kwargs["input_spec"]
+        elif self.model == "finetune":
+            if os.path.exists(os.path.join(self._task_path, LEGACY_CONFIG_NAME)):
+                with open(os.path.join(self._task_path, LEGACY_CONFIG_NAME)) as fb:
+                    init_class = json.load(fb)["init_class"]
+            elif os.path.exists(os.path.join(self._task_path, CONFIG_NAME)):
+                with open(os.path.join(self._task_path, CONFIG_NAME)) as fb:
+                    init_class = json.load(fb)["architectures"].pop()
+            else:
+                raise IOError(
+                    f"Model configuration file dosen't exist.[task_path] should inclue {LEGACY_CONFIG_NAME} or {CONFIG_NAME}"
+                )
+
+            if init_class in ["ErnieMForSequenceClassification"]:
+                self._input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids")]
+            else:
+                self._input_spec = [
+                    paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+                    paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+                ]
+        elif self.model == "prompt":
+            self._input_spec = self._model.get_input_spec()
+        else:
+            raise NotImplementedError(
+                f"'{self.model}' is not a supported model_type. Please select among ['finetune', 'prompt']"
+            )
+
+    def _construct_model(self, model: str):
+        """
+        Construct the inference model for the predictor.
+        """
+        if model == "finetune":
+            model_instance = AutoModelForSequenceClassification.from_pretrained(self._task_path)
+        elif model == "prompt":
+            model_instance = PromptModelForSequenceClassification(self._plm_model, self._template, self._verbalizer)
+            state_dict = paddle.load(os.path.join(self._task_path, "model_state.pdparams"), return_numpy=True)
+            model_instance.set_state_dict(state_dict)
+            # release memory
+            del state_dict
+        else:
+            raise NotImplementedError(
+                f"'{model}' is not a supported model_type. Please select among ['finetune', 'prompt']"
+            )
+
+        # Load the model parameter for the predict
+        model_instance.eval()
+        self._model = model_instance
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(self._task_path)
+
+    def _construct_id2label(self):
+        if "id2label" in self.kwargs:
+            id2label = self.kwargs["id2label"]
+        elif os.path.exists(os.path.join(self._task_path, "id2label.json")):
+            id2label_path = os.path.join(self._task_path, "id2label.json")
+            with open(id2label_path) as fb:
+                id2label = json.load(fb)
+            logger.info(f"Load id2label from {id2label_path}.")
+        elif self.model == "prompt" and os.path.exists(os.path.join(self._task_path, "verbalizer_config.json")):
+            label_list = sorted(list(self._verbalizer.label_words.keys()))
+            id2label = {}
+            for i, l in enumerate(label_list):
+                id2label[i] = l
+            logger.info("Load id2label from verbalizer.")
+        elif self.model == "finetune" and os.path.exists(os.path.join(self._task_path, CONFIG_NAME)):
+            config_path = os.path.join(self._task_path, CONFIG_NAME)
+            with open(config_path) as fb:
+                config = json.load(fb)
+                if "id2label" in config:
+                    id2label = config["id2label"]
+                    logger.info(f"Load id2label from {config_path}.")
+                else:
+                    id2label = None
+        else:
+            id2label = None
+
+        if id2label is None:
+            self.id2label = id2label
+        else:
+            self.id2label = {}
+            for i in id2label:
+                self.id2label[int(i)] = id2label[i]
+
+    def _preprocess(self, inputs: Union[str, List[str]]) -> Dict[str, Any]:
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+
+        if self.model == "finetune":
+            collator = DataCollatorWithPadding(self._tokenizer, return_tensors="np")
+            tokenized_inputs = [self._tokenizer(i, max_length=self._max_length, truncation=True) for i in inputs]
+            batches = [tokenized_inputs[idx : idx + batch_size] for idx in range(0, len(tokenized_inputs), batch_size)]
+        elif self.model == "prompt":
+            collator = PromptDataCollatorWithPadding(
+                self._tokenizer, padding=True, return_tensors="np", return_attention_mask=True
+            )
+            part_text = "text"
+            for part in self._template.prompt:
+                if "text" in part:
+                    part_text = part["text"]
+            template_inputs = [self._template({part_text: x}) for x in inputs]
+            batches = [template_inputs[idx : idx + batch_size] for idx in range(0, len(template_inputs), batch_size)]
+        else:
+            raise NotImplementedError(
+                f"'{self.model}' is not a supported model_type. Please select among ['finetune', 'prompt']"
+            )
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["batches"] = [collator(batch) for batch in batches]
+
+        return outputs
+
+    def _run_model(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        # TODO: support hierachical classification
+        outputs = {}
+        outputs["text"] = inputs["text"]
+        outputs["batch_logits"] = []
+        dtype_dict = {
+            "input_ids": "int64",
+            "token_type_ids": "int64",
+            "position_ids": "int64",
+            "attention_mask": "float32",
+            "masked_positions": "int64",
+            "soft_token_ids": "int64",
+            "encoder_ids": "int64",
+        }
+        with static_mode_guard():
+            for batch in inputs["batches"]:
+                if "attention_mask" in batch:
+                    input_name = "attention_mask"
+                    if batch[input_name].ndim == 2:
+                        batch[input_name] = (1 - batch[input_name][:, np.newaxis, np.newaxis, :]) * -1e4
+                    elif batch[input_name].ndim != 4:
+                        raise ValueError(
+                            "Expect attention mask with ndim=2 or 4, but get ndim={}".format(batch[input_name].ndim)
+                        )
+                if self._predictor_type == "paddle-inference":
+                    for i, input_name in enumerate(self.predictor.get_input_names()):
+                        self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
+                    self.predictor.run()
+                    logits = self.output_handle[0].copy_to_cpu().tolist()
+                else:
+                    input_dict = {}
+                    for input_name in self.input_handler:
+                        input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
+                    logits = self.predictor.run(None, input_dict)[0].tolist()
+                outputs["batch_logits"].append(logits)
+        return outputs
+
+    def _postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        This function converts the model logits output to class score and predictions
+        """
+        # TODO: support hierachical classification
+        postprocessed_outputs = []
+        for logits in inputs["batch_logits"]:
+            if self.problem_type == "multi_class":
+                if isinstance(logits, paddle.Tensor):  # dygraph
+                    scores = F.softmax(logits, axis=-1).numpy()
+                    labels = paddle.argmax(logits, axis=-1).numpy()
+                else:  # static graph
+                    scores = np_softmax(logits, axis=-1)
+                    labels = np.argmax(logits, axis=-1)
+                for score, label in zip(scores, labels):
+                    postprocessed_output = {}
+                    if self.id2label is None:
+                        postprocessed_output["predictions"] = [{"label": label, "score": score[label]}]
+                    else:
+                        postprocessed_output["predictions"] = [{"label": self.id2label[label], "score": score[label]}]
+                    postprocessed_outputs.append(postprocessed_output)
+            elif self.problem_type == "multi_label":  # multi_label
+                if isinstance(logits, paddle.Tensor):  # dygraph
+                    scores = F.sigmoid(logits).numpy()
+                else:  # static graph
+                    scores = np_sigmoid(logits)
+                for score in scores:
+                    postprocessed_output = {}
+                    postprocessed_output["predictions"] = []
+                    for i, class_score in enumerate(score):
+                        if class_score > self.multilabel_threshold:
+                            if self.id2label is None:
+                                postprocessed_output["predictions"].append({"label": i, "score": class_score})
+                            else:
+                                postprocessed_output["predictions"].append(
+                                    {"label": self.id2label[i], "score": class_score}
+                                )
+                    postprocessed_outputs.append(postprocessed_output)
+            else:
+                raise NotImplementedError(
+                    f"'{self.problem_type}' is not a supported problem type. Please select among ['multi_class', 'multi_label']"
+                )
+        for i, postprocessed_output in enumerate(postprocessed_outputs):
+            postprocessed_output["text"] = inputs["text"][i]
+        return postprocessed_outputs
diff --git a/paddlenlp/taskflow/text_correction.py b/paddlenlp/taskflow/text_correction.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7fd11c6248f10032059519aec9237e4ba96cc0e
--- /dev/null
+++ b/paddlenlp/taskflow/text_correction.py
@@ -0,0 +1,265 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from ..data import Pad, Stack, Tuple, Vocab
+from ..transformers import ErnieModel, ErnieTokenizer, is_chinese_char
+from .models import ErnieForCSC
+from .task import Task
+from .utils import static_mode_guard
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           text_correction = Taskflow("text_correction")
+           text_correction('遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。')
+           '''
+           [{'source': '遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+             'target': '遇到逆境时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+             'errors': [{'position': 3, 'correction': {'竟': '境'}}]}
+           ]
+           '''
+
+           text_correction(['遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+                            '人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。'])
+           '''
+           [{'source': '遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+             'target': '遇到逆境时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。',
+             'errors': [{'position': 3, 'correction': {'竟': '境'}}]},
+            {'source': '人生就是如此，经过磨练才能让自己更加拙壮，才能使自己更加乐观。',
+             'target': '人生就是如此，经过磨练才能让自己更加茁壮，才能使自己更加乐观。',
+             'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}
+           ]
+           '''
+
+         """
+
+TASK_MODEL_MAP = {"ernie-csc": "ernie-1.0"}
+
+
+class CSCTask(Task):
+    """
+    The text generation model to predict the question or chinese  poetry.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {"model_state": "model_state.pdparams", "pinyin_vocab": "pinyin_vocab.txt"}
+    resource_files_urls = {
+        "ernie-csc": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/text_correction/ernie-csc/model_state.pdparams",
+                "cdc53e7e3985ffc78fedcdf8e6dca6d2",
+            ],
+            "pinyin_vocab": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/text_correction/ernie-csc/pinyin_vocab.txt",
+                "5599a8116b6016af573d08f8e686b4b2",
+            ],
+        }
+    }
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._usage = usage
+        self._check_task_files()
+        self._construct_vocabs()
+        self._get_inference_model()
+        self._construct_tokenizer(model)
+        try:
+            import pypinyin
+        except ImportError:
+            raise ImportError("Please install the dependencies first, pip install pypinyin --upgrade")
+        self._pypinyin = pypinyin
+        self._batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype="int64"),  # input
+            Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id, dtype="int64"),  # segment
+            Pad(
+                axis=0, pad_val=self._pinyin_vocab.token_to_idx[self._pinyin_vocab.pad_token], dtype="int64"
+            ),  # pinyin
+            Stack(axis=0, dtype="int64"),  # length
+        ): [data for data in fn(samples)]
+        self._num_workers = self.kwargs["num_workers"] if "num_workers" in self.kwargs else 0
+        self._batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        self._lazy_load = self.kwargs["lazy_load"] if "lazy_load" in self.kwargs else False
+        self._max_seq_len = self.kwargs["max_seq_len"] if "max_seq_len" in self.kwargs else 128
+        self._split_sentence = self.kwargs["split_sentence"] if "split_sentence" in self.kwargs else False
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="pinyin_ids"),
+        ]
+
+    def _construct_vocabs(self):
+        pinyin_vocab_path = os.path.join(self._task_path, "pinyin_vocab.txt")
+        self._pinyin_vocab = Vocab.load_vocabulary(pinyin_vocab_path, unk_token="[UNK]", pad_token="[PAD]")
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        ernie = ErnieModel.from_pretrained(TASK_MODEL_MAP[model])
+        model_instance = ErnieForCSC(
+            ernie,
+            pinyin_vocab_size=len(self._pinyin_vocab),
+            pad_pinyin_id=self._pinyin_vocab[self._pinyin_vocab.pad_token],
+        )
+        # Load the model parameter for the predict
+        model_path = os.path.join(self._task_path, "model_state.pdparams")
+        state_dict = paddle.load(model_path)
+        model_instance.set_state_dict(state_dict)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = ErnieTokenizer.from_pretrained(TASK_MODEL_MAP[model])
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        input_texts = self._check_input_text(inputs)
+        examples = []
+        texts = []
+        max_predict_len = self._max_seq_len - 2
+        short_input_texts, self.input_mapping = self._auto_splitter(
+            input_texts, max_predict_len, split_sentence=self._split_sentence
+        )
+        for text in short_input_texts:
+            if not (isinstance(text, str) and len(text) > 0):
+                continue
+            example = {"source": text.strip()}
+            input_ids, token_type_ids, pinyin_ids, length = self._convert_example(example)
+            examples.append((input_ids, token_type_ids, pinyin_ids, length))
+            texts.append(example["source"])
+
+        batch_examples = [examples[idx : idx + self._batch_size] for idx in range(0, len(examples), self._batch_size)]
+        batch_texts = [
+            short_input_texts[idx : idx + self._batch_size] for idx in range(0, len(examples), self._batch_size)
+        ]
+        outputs = {}
+        outputs["batch_examples"] = batch_examples
+        outputs["batch_texts"] = batch_texts
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        with static_mode_guard():
+            for examples in inputs["batch_examples"]:
+                token_ids, token_type_ids, pinyin_ids, lengths = self._batchify_fn(examples)
+                self.input_handles[0].copy_from_cpu(token_ids)
+                self.input_handles[1].copy_from_cpu(pinyin_ids)
+                self.predictor.run()
+                det_preds = self.output_handle[0].copy_to_cpu()
+                char_preds = self.output_handle[1].copy_to_cpu()
+
+                batch_result = []
+                for i in range(len(lengths)):
+                    batch_result.append((det_preds[i], char_preds[i], lengths[i]))
+                results.append(batch_result)
+        inputs["batch_results"] = results
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the logits and probs, this function will convert the model output to raw text.
+        """
+        results = []
+
+        for examples, texts, temp_results in zip(
+            inputs["batch_examples"], inputs["batch_texts"], inputs["batch_results"]
+        ):
+            for i in range(len(examples)):
+                result = {}
+                det_pred, char_preds, length = temp_results[i]
+                pred_result = self._parse_decode(texts[i], char_preds, det_pred, length)
+                result["source"] = texts[i]
+                result["target"] = "".join(pred_result)
+                results.append(result)
+        results = self._auto_joiner(results, self.input_mapping, is_dict=True)
+        for result in results:
+            errors_result = []
+            for i, (source_token, target_token) in enumerate(zip(result["source"], result["target"])):
+                if source_token != target_token:
+                    errors_result.append({"position": i, "correction": {source_token: target_token}})
+            result["errors"] = errors_result
+        return results
+
+    def _convert_example(self, example):
+        source = example["source"]
+        words = list(source)
+        length = len(words)
+        words = ["[CLS]"] + words + ["[SEP]"]
+        input_ids = self._tokenizer.convert_tokens_to_ids(words)
+        token_type_ids = [0] * len(input_ids)
+
+        # Use pad token in pinyin emb to map word emb [CLS], [SEP]
+        pinyins = self._pypinyin.lazy_pinyin(source, style=self._pypinyin.Style.TONE3, neutral_tone_with_five=True)
+
+        pinyin_ids = [0]
+        # Align pinyin and chinese char
+        pinyin_offset = 0
+        for i, word in enumerate(words[1:-1]):
+            pinyin = "[UNK]" if word != "[PAD]" else "[PAD]"
+            if len(word) == 1 and is_chinese_char(ord(word)):
+                while pinyin_offset < len(pinyins):
+                    current_pinyin = pinyins[pinyin_offset][:-1]
+                    pinyin_offset += 1
+                    if current_pinyin in self._pinyin_vocab:
+                        pinyin = current_pinyin
+                        break
+            pinyin_ids.append(self._pinyin_vocab[pinyin])
+
+        pinyin_ids.append(0)
+        assert len(input_ids) == len(pinyin_ids), "length of input_ids must be equal to length of pinyin_ids"
+        return input_ids, token_type_ids, pinyin_ids, length
+
+    def _parse_decode(self, words, corr_preds, det_preds, lengths):
+        UNK = self._tokenizer.unk_token
+        UNK_id = self._tokenizer.convert_tokens_to_ids(UNK)
+
+        corr_pred = corr_preds[1 : 1 + lengths].tolist()
+        det_pred = det_preds[1 : 1 + lengths].tolist()
+        words = list(words)
+        rest_words = []
+        max_seq_length = self._max_seq_len - 2
+        if len(words) > max_seq_length:
+            rest_words = words[max_seq_length:]
+            words = words[:max_seq_length]
+
+        pred_result = ""
+        for j, word in enumerate(words):
+            candidates = self._tokenizer.convert_ids_to_tokens(
+                corr_pred[j] if corr_pred[j] < self._tokenizer.vocab_size else UNK_id
+            )
+            word_icc = is_chinese_char(ord(word))
+            cand_icc = is_chinese_char(ord(candidates)) if len(candidates) == 1 else False
+            if not word_icc or det_pred[j] == 0 or candidates in [UNK, "[PAD]"] or (word_icc and not cand_icc):
+                pred_result += word
+            else:
+                pred_result += candidates.lstrip("##")
+        pred_result += "".join(rest_words)
+        return pred_result
diff --git a/paddlenlp/taskflow/text_feature_extraction.py b/paddlenlp/taskflow/text_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdd78c21e8fbdf3d3d7956c15a573b58c5971d1d
--- /dev/null
+++ b/paddlenlp/taskflow/text_feature_extraction.py
@@ -0,0 +1,587 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.transformers import AutoModel, AutoTokenizer, ErnieDualEncoder
+
+from ..utils.log import logger
+from .task import Task
+from .utils import dygraph_mode_guard, static_mode_guard
+
+ENCODER_TYPE = {
+    "rocketqa-zh-dureader-query-encoder": "query",
+    "rocketqa-zh-dureader-para-encoder": "paragraph",
+    "rocketqa-zh-base-query-encoder": "query",
+    "rocketqa-zh-base-para-encoder": "paragraph",
+    "rocketqa-zh-medium-query-encoder": "query",
+    "rocketqa-zh-medium-para-encoder": "paragraph",
+    "rocketqa-zh-mini-query-encoder": "query",
+    "rocketqa-zh-mini-para-encoder": "paragraph",
+    "rocketqa-zh-micro-query-encoder": "query",
+    "rocketqa-zh-micro-para-encoder": "paragraph",
+    "rocketqa-zh-nano-query-encoder": "query",
+    "rocketqa-zh-nano-para-encoder": "paragraph",
+    "rocketqav2-en-marco-query-encoder": "query",
+    "rocketqav2-en-marco-para-encoder": "paragraph",
+    "ernie-search-base-dual-encoder-marco-en": "query_paragraph",
+}
+
+
+usage = r"""
+            from paddlenlp import Taskflow
+            import paddle.nn.functional as F
+            # Text feature_extraction with rocketqa-zh-base-query-encoder
+            text_encoder = Taskflow("feature_extraction", model='rocketqa-zh-base-query-encoder')
+            text_embeds = text_encoder(['春天适合种什么花？','谁有狂三这张高清的?'])
+            text_features1 = text_embeds["features"]
+            print(text_features1)
+            '''
+            Tensor(shape=[2, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                [[ 0.27640465, -0.13405125,  0.00612330, ..., -0.15600294,
+                    -0.18932408, -0.03029604],
+                    [-0.12041329, -0.07424965,  0.07895312, ..., -0.17068857,
+                    0.04485796, -0.18887770]])
+            '''
+            text_embeds = text_encoder('春天适合种什么菜？')
+            text_features2 = text_embeds["features"]
+            print(text_features2)
+            '''
+            Tensor(shape=[1, 768], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                [[ 0.32578075, -0.02398480, -0.18929179, -0.18639392, -0.04062131,
+                    0.06708499, -0.04631376, -0.41177100, -0.23074438, -0.23627219,
+                ......
+            '''
+            probs = F.cosine_similarity(text_features1, text_features2)
+            print(probs)
+            '''
+            Tensor(shape=[2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+                [0.86455142, 0.41222256])
+            '''
+         """
+
+
+class TextFeatureExtractionTask(Task):
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "config": "config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    resource_files_urls = {
+        "rocketqa-zh-dureader-query-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-dureader-query-encoder/model_state.pdparams",
+                "6125930530fd55ed715b0595e65789aa",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-dureader-query-encoder/config.json",
+                "efc1280069bb22b5bd06dc44b780bc6a",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-dureader-query-encoder/vocab.txt",
+                "062f696cad47bb62da86d8ae187b0ef4",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-dureader-query-encoder/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-dureader-query-encoder/tokenizer_config.json",
+                "3a50349b8514e744fed72e59baca51b5",
+            ],
+        },
+        "rocketqa-zh-base-query-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-base-query-encoder/model_state.pdparams",
+                "3bb1a7870792146c6dd2fa47a45e15cc",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-base-query-encoder/config.json",
+                "be88115dd8a00e9de6b44f8c9a055e1a",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-base-query-encoder/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-base-query-encoder/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/feature_extraction/rocketqa-zh-base-query-encoder/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+    }
+
+    def __init__(
+        self,
+        task: str = None,
+        model: str = None,
+        batch_size: int = 1,
+        max_seq_len: int = 128,
+        _static_mode: bool = True,
+        return_tensors: str = "pd",
+        reinitialize: bool = False,
+        share_parameters: bool = False,
+        is_paragraph: bool = False,
+        output_emb_size: Optional[int] = None,
+        **kwargs
+    ):
+        super().__init__(task=task, model=model, **kwargs)
+        self._seed = None
+        self.export_type = "text"
+        self._batch_size = batch_size
+        self.max_seq_len = max_seq_len
+        self.model = model
+        self._static_mode = _static_mode
+        self.return_tensors = return_tensors
+
+        self.reinitialize = reinitialize
+        self.share_parameters = share_parameters
+        self.output_emb_size = output_emb_size
+        self.is_paragraph = is_paragraph
+        self._check_para_encoder()
+        # self._check_task_files()
+        self._check_predictor_type()
+        self._construct_tokenizer()
+        # self._get_inference_model()
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+
+    def _check_para_encoder(self):
+        if self.model in ENCODER_TYPE:
+            if ENCODER_TYPE[self.model] == "paragraph":
+                self.is_paragraph = True
+            else:
+                self.is_paragraph = False
+        else:
+            self.is_paragraph = False
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        # self._model = ErnieDualEncoder(self._task_path)
+        self._model = ErnieDualEncoder(
+            query_model_name_or_path=self.model,
+            output_emb_size=self.output_emb_size,
+            reinitialize=self.reinitialize,
+            share_parameters=self.share_parameters,
+        )
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(self.model)
+        # Fix windows dtype bug
+        if self._static_mode:
+            self._collator = DataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        else:
+            self._collator = DataCollatorWithPadding(self._tokenizer, return_tensors="pd")
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+
+        def _parse_batch(batch_examples):
+            if self.is_paragraph:
+                # The input of the passage encoder is [CLS][SEP]...[SEP].
+                tokenized_inputs = self._tokenizer(
+                    text=[""] * len(batch_examples),
+                    text_pair=batch_examples,
+                    padding="max_length",
+                    truncation=True,
+                    max_seq_len=self.max_seq_len,
+                )
+            else:
+                tokenized_inputs = self._tokenizer(
+                    text=[""] * len(batch_examples),
+                    text_pair=batch_examples,
+                    padding="max_length",
+                    truncation=True,
+                    max_seq_len=self.max_seq_len,
+                )
+            return tokenized_inputs
+
+        # Separates data into some batches.
+        one_batch = []
+        for example in data:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield _parse_batch(one_batch)
+                one_batch = []
+        if one_batch:
+            yield _parse_batch(one_batch)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw inputs to the model inputs, two steps involved:
+           1) Transform the raw text/image to token ids/pixel_values.
+           2) Generate the other model inputs from the raw text/image and token ids/pixel_values.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {"batches": batches, "inputs": inputs}
+        return outputs
+
+    def _run_model(self, inputs, **kwargs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        all_feats = []
+        if self._static_mode:
+            with static_mode_guard():
+                for batch_inputs in inputs["batches"]:
+                    batch_inputs = self._collator(batch_inputs)
+                    if self._predictor_type == "paddle-inference":
+                        if "input_ids" in batch_inputs:
+                            self.input_handles[0].copy_from_cpu(batch_inputs["input_ids"])
+                            self.input_handles[1].copy_from_cpu(batch_inputs["token_type_ids"])
+                            self.predictor.run()
+                            text_features = self.output_handle[0].copy_to_cpu()
+                            all_feats.append(text_features)
+                    else:
+                        # onnx mode
+                        if "input_ids" in batch_inputs:
+                            input_dict = {}
+                            input_dict["input_ids"] = batch_inputs["input_ids"]
+                            input_dict["token_type_ids"] = batch_inputs["token_type_ids"]
+                            text_features = self.predictor.run(None, input_dict)[0].tolist()
+                            all_feats.append(text_features)
+
+        else:
+            with dygraph_mode_guard():
+                for batch_inputs in inputs["batches"]:
+                    batch_inputs = self._collator(batch_inputs)
+                    text_features = self._model.get_pooled_embedding(
+                        input_ids=batch_inputs["input_ids"], token_type_ids=batch_inputs["token_type_ids"]
+                    )
+                    all_feats.append(text_features.numpy())
+        inputs.update({"features": all_feats})
+        return inputs
+
+    def _postprocess(self, inputs):
+        inputs["features"] = np.concatenate(inputs["features"], axis=0)
+        if self.return_tensors == "pd":
+            inputs["features"] = paddle.to_tensor(inputs["features"])
+        return inputs
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+
+        static_model = paddle.jit.to_static(self._model.get_pooled_embedding, input_spec=self._input_spec)
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
+
+
+def text_length(text):
+    # {key: value} case
+    if isinstance(text, dict):
+        return len(next(iter(text.values())))
+    # Object has no len() method
+    elif not hasattr(text, "__len__"):
+        return 1
+    # Empty string or list of ints
+    elif len(text) == 0 or isinstance(text[0], int):
+        return len(text)
+    # Sum of length of individual strings
+    else:
+        return sum([len(t) for t in text])
+
+
+class SentenceFeatureExtractionTask(Task):
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "config": "config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    def __init__(
+        self,
+        task: str = None,
+        model: str = None,
+        batch_size: int = 1,
+        max_seq_len: int = 512,
+        _static_mode: bool = True,
+        return_tensors: str = "pd",
+        pooling_mode: str = "cls_token",
+        **kwargs
+    ):
+        super().__init__(
+            task=task,
+            model=model,
+            pooling_mode=pooling_mode,
+            **kwargs,
+        )
+        self._seed = None
+        self.export_type = "text"
+        self._batch_size = batch_size
+        self.max_seq_len = max_seq_len
+        self.model = model
+        self._static_mode = _static_mode
+        self.return_tensors = return_tensors
+        self.pooling_mode = pooling_mode
+        self._check_predictor_type()
+        self._construct_tokenizer()
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        self._model = AutoModel.from_pretrained(self.model)
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(self.model)
+        self.pad_token_id = self._tokenizer.convert_tokens_to_ids(self._tokenizer.pad_token)
+        # Fix windows dtype bug
+        if self._static_mode:
+            self._collator = DataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        else:
+            self._collator = DataCollatorWithPadding(self._tokenizer, return_tensors="pd")
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+
+        def _parse_batch(batch_examples, max_seq_len=None):
+            if isinstance(batch_examples[0], str):
+                to_tokenize = [batch_examples]
+            else:
+                batch1, batch2 = [], []
+                for text_tuple in batch_examples:
+                    batch1.append(text_tuple[0])
+                    batch2.append(text_tuple[1])
+                to_tokenize = [batch1, batch2]
+            to_tokenize = [[str(s).strip() for s in col] for col in to_tokenize]
+            if max_seq_len is None:
+                max_seq_len = self.max_seq_len
+            tokenized_inputs = self._tokenizer(
+                to_tokenize[0],
+                padding=True,
+                truncation="longest_first",
+                max_seq_len=max_seq_len,
+            )
+            return tokenized_inputs
+
+        # Seperates data into some batches.
+        one_batch = []
+        self.length_sorted_idx = np.argsort([-text_length(sen) for sen in data])
+        sentences_sorted = [data[idx] for idx in self.length_sorted_idx]
+
+        for example in range(len(sentences_sorted)):
+            one_batch.append(sentences_sorted[example])
+            if len(one_batch) == batch_size:
+                yield _parse_batch(one_batch)
+                one_batch = []
+        if one_batch:
+            yield _parse_batch(one_batch)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw inputs to the model inputs, two steps involved:
+           1) Transform the raw text/image to token ids/pixel_values.
+           2) Generate the other model inputs from the raw text/image and token ids/pixel_values.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {"batches": batches, "inputs": inputs}
+        return outputs
+
+    def _run_model(self, inputs, **kwargs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        pooling_mode = kwargs.get("pooling_mode", None)
+        if pooling_mode is None:
+            pooling_mode = self.pooling_mode
+        all_feats = []
+        if self._static_mode:
+            with static_mode_guard():
+                for batch_inputs in inputs["batches"]:
+                    batch_inputs = self._collator(batch_inputs)
+                    if self._predictor_type == "paddle-inference":
+                        if "input_ids" in batch_inputs:
+                            self.input_handles[0].copy_from_cpu(batch_inputs["input_ids"])
+                            self.input_handles[1].copy_from_cpu(batch_inputs["token_type_ids"])
+                            self.predictor.run()
+                            token_embeddings = self.output_handle[0].copy_to_cpu()
+                            if pooling_mode == "max_tokens":
+                                attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                                    token_embeddings.dtype
+                                )
+                                input_mask_expanded = np.expand_dims(attention_mask, -1).repeat(
+                                    token_embeddings.shape[-1], axis=-1
+                                )
+                                token_embeddings[input_mask_expanded == 0] = -1e9
+                                max_over_time = np.max(token_embeddings, 1)
+                                all_feats.append(max_over_time)
+                            elif pooling_mode == "mean_tokens" or pooling_mode == "mean_sqrt_len_tokens":
+                                attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                                    token_embeddings.dtype
+                                )
+                                input_mask_expanded = np.expand_dims(attention_mask, -1).repeat(
+                                    token_embeddings.shape[-1], axis=-1
+                                )
+                                sum_embeddings = np.sum(token_embeddings * input_mask_expanded, 1)
+                                sum_mask = input_mask_expanded.sum(1)
+                                sum_mask = np.clip(sum_mask, a_min=1e-9, a_max=np.max(sum_mask))
+                                if pooling_mode == "mean_tokens":
+                                    all_feats.append(sum_embeddings / sum_mask)
+                                elif pooling_mode == "mean_sqrt_len_tokens":
+                                    all_feats.append(sum_embeddings / np.sqrt(sum_mask))
+                            else:
+                                cls_token = token_embeddings[:, 0]
+                                all_feats.append(cls_token)
+                    else:
+                        # onnx mode
+                        if "input_ids" in batch_inputs:
+                            input_dict = {}
+                            input_dict["input_ids"] = batch_inputs["input_ids"]
+                            input_dict["token_type_ids"] = batch_inputs["token_type_ids"]
+                            token_embeddings = self.predictor.run(None, input_dict)[0]
+                            if pooling_mode == "max_tokens":
+                                attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                                    token_embeddings.dtype
+                                )
+                                input_mask_expanded = np.expand_dims(attention_mask, -1).repeat(
+                                    token_embeddings.shape[-1], axis=-1
+                                )
+                                token_embeddings[input_mask_expanded == 0] = -1e9
+                                max_over_time = np.max(token_embeddings, 1)
+                                all_feats.append(max_over_time)
+                            elif pooling_mode == "mean_tokens" or pooling_mode == "mean_sqrt_len_tokens":
+                                attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                                    token_embeddings.dtype
+                                )
+                                input_mask_expanded = np.expand_dims(attention_mask, -1).repeat(
+                                    token_embeddings.shape[-1], axis=-1
+                                )
+                                sum_embeddings = np.sum(token_embeddings * input_mask_expanded, 1)
+                                sum_mask = input_mask_expanded.sum(1)
+                                sum_mask = np.clip(sum_mask, a_min=1e-9, a_max=np.max(sum_mask))
+                                if pooling_mode == "mean_tokens":
+                                    all_feats.append(sum_embeddings / sum_mask)
+                                elif pooling_mode == "mean_sqrt_len_tokens":
+                                    all_feats.append(sum_embeddings / np.sqrt(sum_mask))
+                            else:
+                                cls_token = token_embeddings[:, 0]
+                                all_feats.append(cls_token)
+        else:
+            with dygraph_mode_guard():
+                for batch_inputs in inputs["batches"]:
+                    batch_inputs = self._collator(batch_inputs)
+                    token_embeddings = self._model(input_ids=batch_inputs["input_ids"])[0]
+                    if pooling_mode == "max_tokens":
+                        attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                            self._model.pooler.dense.weight.dtype
+                        )
+                        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.shape)
+                        token_embeddings[input_mask_expanded == 0] = -1e9
+                        max_over_time = paddle.max(token_embeddings, 1)
+                        all_feats.append(max_over_time)
+
+                    elif pooling_mode == "mean_tokens" or pooling_mode == "mean_sqrt_len_tokens":
+                        attention_mask = (batch_inputs["input_ids"] != self.pad_token_id).astype(
+                            self._model.pooler.dense.weight.dtype
+                        )
+                        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.shape)
+                        sum_embeddings = paddle.sum(token_embeddings * input_mask_expanded, 1)
+                        sum_mask = input_mask_expanded.sum(1)
+                        sum_mask = paddle.clip(sum_mask, min=1e-9)
+                        if pooling_mode == "mean_tokens":
+                            all_feats.append(sum_embeddings / sum_mask)
+                        elif pooling_mode == "mean_sqrt_len_tokens":
+                            all_feats.append(sum_embeddings / paddle.sqrt(sum_mask))
+                    else:
+                        cls_token = token_embeddings[:, 0]
+                        all_feats.append(cls_token)
+        inputs.update({"features": all_feats})
+        return inputs
+
+    def _postprocess(self, inputs):
+        inputs["features"] = np.concatenate(inputs["features"], axis=0)
+        inputs["features"] = [inputs["features"][idx] for idx in np.argsort(self.length_sorted_idx)]
+
+        if self.return_tensors == "pd":
+            inputs["features"] = paddle.to_tensor(inputs["features"])
+        return inputs
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+
+        static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
diff --git a/paddlenlp/taskflow/text_generation.py b/paddlenlp/taskflow/text_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..0eeaacf4580da61facb5f24271a295e6e60cfb5a
--- /dev/null
+++ b/paddlenlp/taskflow/text_generation.py
@@ -0,0 +1,158 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from ..data import Pad, Stack, Tuple
+from ..transformers import GPTChineseTokenizer, GPTForGreedyGeneration, GPTTokenizer
+from .task import Task
+from .utils import download_file, static_mode_guard
+
+usage = r"""
+         """
+
+URLS = {
+    "gpt-cpm-large-cn": [
+        "https://bj.bcebos.com/paddlenlp/taskflow/text_generation/gpt-cpm/gpt-cpm-large-cn_params.tar",
+        "5aad6f81053cfdbba4797f044fcf66d1",
+    ],
+}
+
+
+class TextGenerationTask(Task):
+    """
+    The text generation model to predict the question or chinese  poetry.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        # Default to static mode
+        self._static_mode = True
+        self._usage = usage
+        if self._static_mode:
+            download_file(self._task_path, "gpt-cpm-large-cn_params.tar", URLS[self.model][0], URLS[self.model][1])
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+        self._construct_tokenizer(model)
+        self.kwargs["generation_task"] = task
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_ids")]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = GPTForGreedyGeneration.from_pretrained(self.model, max_predict_len=32)
+        # Load the model parameter for the predict
+        model_instance.eval()
+        self._model = model_instance
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        if self.model == "gpt-cpm-large-cn":
+            tokenizer_instance = GPTChineseTokenizer.from_pretrained(model)
+        else:
+            tokenizer_instance = GPTTokenizer.from_pretrained(model)
+
+        self._tokenizer = tokenizer_instance
+
+    def _preprocess(self, inputs, padding=True, add_special_tokens=True):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        batch_size = self.kwargs["batch_size"] if "batch_size" in self.kwargs else 1
+        generation_task = self.kwargs["generation_task"] if "generation_task" in self.kwargs else "question_answering"
+
+        def select_few_shot_input(model_name, generation_task):
+            pre_input = ""
+            if generation_task not in ["question_answering", "poetry_generation"]:
+                raise ValueError("The generation task must be question or poetry")
+            if model_name == "gpt-cpm-large-cn":
+                if generation_task == "question_answering":
+                    pre_input = "问题：中国的首都是哪里？答案：北京。\n问题：{} 答案："
+                else:
+                    pre_input = "默写古诗: 大漠孤烟直，长河落日圆。\n{}"
+            return pre_input
+
+        pre_input = select_few_shot_input(self.model, generation_task)
+
+        examples = []
+        filter_inputs = []
+        for input_text in inputs:
+            if not (isinstance(input_text, str) and len(input_text) > 0):
+                continue
+            filter_inputs.append(input_text)
+            few_shot_input = pre_input.format(input_text)
+            ids = self._tokenizer(few_shot_input)["input_ids"]
+            examples.append((ids, len(ids)))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=0, dtype="int64"),
+            Stack(dtype="int64"),  # seq_len
+        ): fn(samples)
+
+        batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)]
+        outputs = {}
+        outputs["text"] = filter_inputs
+        outputs["data_loader"] = batches
+        self._batchify_fn = batchify_fn
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        lens = []
+        with static_mode_guard():
+            for batch in inputs["data_loader"]:
+                ids, seq_len = self._batchify_fn(batch)
+                self.input_handles[0].copy_from_cpu(ids)
+                self.predictor.run()
+                result = self.output_handle[0].copy_to_cpu().tolist()
+                results.extend(result)
+                lens.extend(seq_len.tolist())
+        inputs["results"] = results
+        inputs["lens"] = lens
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        batch_out = []
+        preds = inputs["results"]
+        for index in range(0, len(preds)):
+            seq_len = inputs["lens"][index]
+            single_result = {}
+            single_result["text"] = inputs["text"][index]
+            single_result["answer"] = self._tokenizer.convert_ids_to_string(preds[index][seq_len:-1])
+            batch_out.append(single_result)
+        return batch_out
diff --git a/paddlenlp/taskflow/text_similarity.py b/paddlenlp/taskflow/text_similarity.py
new file mode 100644
index 0000000000000000000000000000000000000000..5792125218f0d413f7dac3c0e651cc9ee86c1c08
--- /dev/null
+++ b/paddlenlp/taskflow/text_similarity.py
@@ -0,0 +1,353 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+
+from ..data import Pad, Tuple
+from ..transformers import ErnieCrossEncoder, ErnieTokenizer
+from ..utils.log import logger
+from .task import Task
+from .utils import static_mode_guard
+
+usage = r"""
+         from paddlenlp import Taskflow
+
+         similarity = Taskflow("text_similarity")
+         similarity([["世界上什么东西最小", "世界上什么东西最小？"]])
+         '''
+         [{'text1': '世界上什么东西最小', 'text2': '世界上什么东西最小？', 'similarity': 0.992725}]
+         '''
+
+         similarity = Taskflow("text_similarity", batch_size=2)
+         similarity([["光眼睛大就好看吗", "眼睛好看吗？"], ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"]])
+         '''
+         [{'text1': '光眼睛大就好看吗', 'text2': '眼睛好看吗？', 'similarity': 0.74502707}, {'text1': '小蝌蚪找妈妈怎么样', 'text2': '小蝌蚪找妈妈是谁画的', 'similarity': 0.8192149}]
+         '''
+         """
+MATCH_TYPE = {
+    "rocketqa-zh-dureader-cross-encoder": "matching",
+    "rocketqa-base-cross-encoder": "matching",
+    "rocketqa-medium-cross-encoder": "matching",
+    "rocketqa-mini-cross-encoder": "matching",
+    "rocketqa-micro-cross-encoder": "matching",
+    "rocketqa-nano-cross-encoder": "matching",
+    "rocketqav2-en-marco-cross-encoder": "matching_v2",
+    "ernie-search-large-cross-encoder-marco-en": "matching_v3",
+}
+
+
+class TextSimilarityTask(Task):
+    """
+    Text similarity task using SimBERT to predict the similarity of sentence pair.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+    }
+    resource_files_urls = {
+        "simbert-base-chinese": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/text_similarity/simbert-base-chinese/model_state.pdparams",
+                "27d9ef240c2e8e736bdfefea52af2542",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/text_similarity/simbert-base-chinese/model_config.json",
+                "1254bbd7598457a9dad0afcb2e24b70c",
+            ],
+        },
+        "rocketqa-zh-dureader-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-zh-dureader-cross-encoder/model_state.pdparams",
+                "88bc3e1a64992a1bdfe4044ecba13bc7",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-zh-dureader-cross-encoder/model_config.json",
+                "b69083c2895e8f68e1a10467b384daab",
+            ],
+        },
+        "rocketqa-base-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-base-cross-encoder/model_state.pdparams",
+                "6d845a492a2695e62f2be79f8017be92",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-base-cross-encoder/model_config.json",
+                "18ce260ede18bc3cb28dcb2e7df23b1a",
+            ],
+        },
+        "rocketqa-medium-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-medium-cross-encoder/model_state.pdparams",
+                "4b929f4fc11a1df8f59fdf2784e23fa7",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-medium-cross-encoder/model_config.json",
+                "10997db96bc86e29cd113e1bf58989d7",
+            ],
+        },
+        "rocketqa-mini-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-mini-cross-encoder/model_state.pdparams",
+                "c411111df990132fb88c070d8b8cf3f7",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-mini-cross-encoder/model_config.json",
+                "271e6d779acbe8e8acdd596b1c835546",
+            ],
+        },
+        "rocketqa-micro-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-micro-cross-encoder/model_state.pdparams",
+                "3d643ff7d6029c8ceab5653680167dc0",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-micro-cross-encoder/model_config.json",
+                "b32d1a932d8c367fab2a6216459dd0a7",
+            ],
+        },
+        "rocketqa-nano-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-nano-cross-encoder/model_state.pdparams",
+                "4c1d36e5e94f5af09f665fc7ad0be140",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqa-nano-cross-encoder/model_config.json",
+                "dcff14cd671e1064be2c5d63734098bb",
+            ],
+        },
+        "rocketqav2-en-marco-cross-encoder": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqav2-en-marco-cross-encoder/model_state.pdparams",
+                "a5afc77b6a63fc32a1beca3010f40f32",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/rocketqav2-en-marco-cross-encoder/config.json",
+                "8f5d5c71c8a891b68d0402a13e38b6f9",
+            ],
+        },
+        "ernie-search-large-cross-encoder-marco-en": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/ernie-search-large-cross-encoder-marco-en/model_state.pdparams",
+                "fdf29f7de0f7fe570740d343c96165e5",
+            ],
+            "model_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/text_similarity/ernie-search-large-cross-encoder-marco-en/config.json",
+                "28bad2c7b36fa148fa75a8dc5b690485",
+            ],
+        },
+        "__internal_testing__/tiny-random-bert": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-bert/model_state.pdparams",
+                "8d8814d589c21bf083fdb35de6c11a57",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-bert/config.json",
+                "37e28e2359f330f64fc82beff1967a1e",
+            ],
+        },
+    }
+
+    def __init__(self, task, model, batch_size=1, max_length=384, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._static_mode = True
+        self._check_predictor_type()
+        if not self.from_hf_hub:
+            self._check_task_files()
+        if self._static_mode:
+            self._get_inference_model()
+        else:
+            self._construct_model(model)
+        self._construct_tokenizer(model)
+        self._batch_size = batch_size
+        self._max_length = max_length
+        self._usage = usage
+        self.model_name = model
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+
+        if "rocketqav2-en" in model or "ernie-search" in model:
+            self._model = ErnieCrossEncoder(self._task_path, num_classes=1, reinitialize=True)
+        elif "rocketqa" in model:
+            self._model = ErnieCrossEncoder(self._task_path, num_classes=2)
+        else:
+            self._model = AutoModel.from_pretrained(self._task_path, pool_act="linear")
+        self._model.eval()
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        if "rocketqa" in model or "ernie-search" in model:
+            self._tokenizer = ErnieTokenizer.from_pretrained(model)
+        else:
+            self._tokenizer = AutoTokenizer.from_pretrained(model)
+
+    def _check_input_text(self, inputs):
+        inputs = inputs[0]
+        if not all([isinstance(i, list) and i and all(i) and len(i) == 2 for i in inputs]):
+            raise TypeError("Invalid input format.")
+        return inputs
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+
+        examples = []
+        for data in inputs:
+            text1, text2 = data[0], data[1]
+            if "rocketqa" in self.model_name or "ernie-search" in self.model_name:
+                # Todo: wugaosheng, Add erine-search encoding support
+                encoded_inputs = self._tokenizer(text=text1, text_pair=text2, max_length=self._max_length)
+                ids = encoded_inputs["input_ids"]
+                segment_ids = encoded_inputs["token_type_ids"]
+                examples.append((ids, segment_ids))
+            else:
+                text1_encoded_inputs = self._tokenizer(text=text1, max_length=self._max_length)
+                text1_input_ids = text1_encoded_inputs["input_ids"]
+                text1_token_type_ids = text1_encoded_inputs["token_type_ids"]
+
+                text2_encoded_inputs = self._tokenizer(text=text2, max_length=self._max_length)
+                text2_input_ids = text2_encoded_inputs["input_ids"]
+                text2_token_type_ids = text2_encoded_inputs["token_type_ids"]
+
+                examples.append((text1_input_ids, text1_token_type_ids, text2_input_ids, text2_token_type_ids))
+
+        batches = [examples[idx : idx + self._batch_size] for idx in range(0, len(examples), self._batch_size)]
+        if "rocketqa" in self.model_name or "ernie-search" in self.model_name:
+            batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype="int64"),  # input ids
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id, dtype="int64"),  # token type ids
+            ): [data for data in fn(samples)]
+        else:
+            batchify_fn = lambda samples, fn=Tuple(  # noqa: E731
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype="int64"),  # text1_input_ids
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id, dtype="int64"),  # text1_token_type_ids
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_id, dtype="int64"),  # text2_input_ids
+                Pad(axis=0, pad_val=self._tokenizer.pad_token_type_id, dtype="int64"),  # text2_token_type_ids
+            ): [data for data in fn(samples)]
+
+        outputs = {}
+        outputs["data_loader"] = batches
+        outputs["text"] = inputs
+        self._batchify_fn = batchify_fn
+        return outputs
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_tokenize` function.
+        """
+        results = []
+        if "rocketqa" in self.model_name or "ernie-search" in self.model_name:
+            with static_mode_guard():
+                for batch in inputs["data_loader"]:
+
+                    if self._predictor_type == "paddle-inference":
+                        input_ids, segment_ids = self._batchify_fn(batch)
+                        self.input_handles[0].copy_from_cpu(input_ids)
+                        self.input_handles[1].copy_from_cpu(segment_ids)
+                        self.predictor.run()
+                        scores = self.output_handle[0].copy_to_cpu().tolist()
+                        results.extend(scores)
+                    else:
+                        # onnx mode
+                        input_dict = {}
+                        input_ids, segment_ids = self._batchify_fn(batch)
+                        input_dict["input_ids"] = input_ids
+                        input_dict["token_type_ids"] = segment_ids
+                        scores = self.predictor.run(None, input_dict)[0].tolist()
+                        results.extend(scores)
+        else:
+            with static_mode_guard():
+                for batch in inputs["data_loader"]:
+                    text1_ids, text1_segment_ids, text2_ids, text2_segment_ids = self._batchify_fn(batch)
+                    self.input_handles[0].copy_from_cpu(text1_ids)
+                    self.input_handles[1].copy_from_cpu(text1_segment_ids)
+                    self.predictor.run()
+                    vecs_text1 = self.output_handle[1].copy_to_cpu()
+
+                    self.input_handles[0].copy_from_cpu(text2_ids)
+                    self.input_handles[1].copy_from_cpu(text2_segment_ids)
+                    self.predictor.run()
+                    vecs_text2 = self.output_handle[1].copy_to_cpu()
+
+                    vecs_text1 = vecs_text1 / (vecs_text1**2).sum(axis=1, keepdims=True) ** 0.5
+                    vecs_text2 = vecs_text2 / (vecs_text2**2).sum(axis=1, keepdims=True) ** 0.5
+                    similarity = (vecs_text1 * vecs_text2).sum(axis=1)
+                    results.extend(similarity)
+        inputs["result"] = results
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        final_results = []
+        for text, similarity in zip(inputs["text"], inputs["result"]):
+            result = {}
+            result["text1"] = text[0]
+            result["text2"] = text[1]
+            # The numpy.float32 can not be converted to the json format
+            if isinstance(similarity, list):
+                result["similarity"] = float(similarity[0])
+            else:
+                result["similarity"] = float(similarity)
+            final_results.append(result)
+        return final_results
+
+    def _convert_dygraph_to_static(self):
+        """
+        Convert the dygraph model to static model.
+        """
+        assert (
+            self._model is not None
+        ), "The dygraph model must be created before converting the dygraph model to static model."
+        assert (
+            self._input_spec is not None
+        ), "The input spec must be created before converting the dygraph model to static model."
+        logger.info("Converting to the inference model cost a little time.")
+        if self.model in MATCH_TYPE:
+            if MATCH_TYPE[self.model] == "matching":
+                static_model = paddle.jit.to_static(self._model.matching, input_spec=self._input_spec)
+            elif MATCH_TYPE[self.model] == "matching_v2":
+                static_model = paddle.jit.to_static(self._model.matching_v2, input_spec=self._input_spec)
+            elif MATCH_TYPE[self.model] == "matching_v3":
+                static_model = paddle.jit.to_static(self._model.matching_v3, input_spec=self._input_spec)
+        else:
+            static_model = paddle.jit.to_static(self._model, input_spec=self._input_spec)
+
+        paddle.jit.save(static_model, self.inference_model_path)
+        logger.info("The inference model save in the path:{}".format(self.inference_model_path))
diff --git a/paddlenlp/taskflow/text_summarization.py b/paddlenlp/taskflow/text_summarization.py
new file mode 100644
index 0000000000000000000000000000000000000000..0acad2be6bb7dddb5e4056c4457813ff04f3d472
--- /dev/null
+++ b/paddlenlp/taskflow/text_summarization.py
@@ -0,0 +1,315 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+from ..data import Pad
+from ..transformers import (
+    AutoModelForConditionalGeneration,
+    AutoTokenizer,
+    UNIMOForConditionalGeneration,
+)
+from .task import Task
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           text_summarization = Taskflow("text_summarization")
+           text_summarization(2022年，中国房地产进入转型阵痛期，传统“高杠杆、快周转”的模式难以为继，万科甚至直接喊话，中国房地产进入“黑铁时代”)
+           '''
+            ['万科喊话中国房地产进入“黑铁时代”']
+           '''
+
+           text_summarization(['据悉，2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用，继续把落实“双减”作为学校工作的重中之重，重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。',
+          '党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用，因此平时口服党参能远离三高的危害。另外党参除了益气养血，降低中枢神经作用，调整消化系统功能，健脾补肺的功能。'])
+           '''
+            ['教育部：将从四个方面持续巩固提高学校“双减”工作水平', '党参能降低三高的危害']
+           '''
+         """
+
+
+class TextSummarizationTask(Task):
+    """
+    The text summarization model to predict the summary of an input text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._batch_size = kwargs.get("batch_size", 1)
+        self._output_scores = kwargs.get("output_scores", False)
+        self._model_type = None
+        self._construct_tokenizer(model)
+        self._construct_model(model)
+        # Hypter-parameter during generating.
+        self._max_length = kwargs.get("max_length", 128)
+        self._min_length = kwargs.get("min_length", 0)
+        self._decode_strategy = kwargs.get("decode_strategy", "beam_search")
+        self._temperature = kwargs.get("temperature", 1.0)
+        self._top_k = kwargs.get("top_k", 5)
+        self._top_p = kwargs.get("top_p", 1.0)
+        self._num_beams = kwargs.get("num_beams", 4)
+        self._length_penalty = kwargs.get("length_penalty", 0.0)
+        self._num_return_sequences = kwargs.get("num_return_sequences", 1)
+        self._repetition_penalty = kwargs.get("repetition_penalty", 1)
+        self._use_faster = kwargs.get("use_faster", False)
+        self._use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        if self._custom_model:
+            self._model = AutoModelForConditionalGeneration.from_pretrained(
+                self._task_path, from_hf_hub=self.from_hf_hub
+            )
+        else:
+            self._model = AutoModelForConditionalGeneration.from_pretrained(model)
+        self._model.eval()
+        if isinstance(self._model, UNIMOForConditionalGeneration):
+            self._model_type = "unimo-text"
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        if self._custom_model:
+            self._tokenizer = AutoTokenizer.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
+        else:
+            self._tokenizer = AutoTokenizer.from_pretrained(model)
+
+    def _preprocess(self, inputs):
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        batches = self._batchify(inputs, self._batch_size)
+        outputs = {"batches": batches, "text": inputs}
+        return outputs
+
+    def _batchify(self, data, batch_size):
+        """
+        Generate input batches.
+        """
+        pad_right = False
+        if self._model_type != "unimo-text":
+            pad_right = True
+        examples = [self._convert_example(i) for i in data]
+        # Separates data into some batches.
+        one_batch = []
+        for example in examples:
+            one_batch.append(example)
+            if len(one_batch) == batch_size:
+                yield self._parse_batch(one_batch, self._tokenizer.pad_token_id, pad_right)
+                one_batch = []
+        if one_batch:
+            yield self._parse_batch(one_batch, self._tokenizer.pad_token_id, pad_right)
+
+    def _convert_example(self, example, max_seq_len=512, return_length=True):
+        """
+        Convert all examples into necessary features.
+        """
+        if self._model_type != "unimo-text":
+            tokenized_example = self._tokenizer(
+                example, max_length=max_seq_len, padding=False, truncation=True, return_attention_mask=True
+            )
+        else:
+            tokenized_example = self._tokenizer.gen_encode(
+                example,
+                max_seq_len=max_seq_len,
+                add_start_token_for_decoding=True,
+                return_length=True,
+                is_split_into_words=False,
+            )
+        # Use to gather the logits corresponding to the labels during training
+        return tokenized_example
+
+    def _parse_batch(self, batch_examples, pad_val, pad_right=False):
+        """
+        Batchify a batch of examples.
+        """
+
+        def pad_mask(batch_attention_mask):
+            """Pad attention_mask."""
+            batch_size = len(batch_attention_mask)
+            max_len = max(map(len, batch_attention_mask))
+            attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9
+            for i, mask_data in enumerate(attention_mask):
+                seq_len = len(batch_attention_mask[i])
+                if pad_right:
+                    mask_data[:seq_len:, :seq_len] = np.array(batch_attention_mask[i], dtype="float32")
+                else:
+                    mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+            # In order to ensure the correct broadcasting mechanism, expand one
+            # dimension to the second dimension (n_head of Transformer).
+            attention_mask = np.expand_dims(attention_mask, axis=1)
+            return attention_mask
+
+        pad_func = Pad(pad_val=pad_val, pad_right=pad_right, dtype="int32")
+        batch_dict = {}
+        input_ids = pad_func([example["input_ids"] for example in batch_examples])
+        if self._model_type != "unimo-text":
+            attention_mask = (input_ids != pad_val).astype("float32")
+            batch_dict["input_ids"] = input_ids
+            batch_dict["attention_mask"] = attention_mask
+        else:
+            token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+            position_ids = pad_func([example["position_ids"] for example in batch_examples])
+            attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+            seq_len = np.asarray([example["seq_len"] for example in batch_examples], dtype="int32")
+            batch_dict["input_ids"] = input_ids
+            batch_dict["token_type_ids"] = token_type_ids
+            batch_dict["position_ids"] = position_ids
+            batch_dict["attention_mask"] = attention_mask
+            batch_dict["seq_len"] = seq_len
+        return batch_dict
+
+    def _run_model(self, inputs):
+        """
+        Run the task model from the outputs of the `_preprocess` function.
+        """
+        all_ids = []
+        all_scores = []
+
+        for batch in inputs["batches"]:
+            input_ids = paddle.to_tensor(batch["input_ids"], dtype="int64")
+            token_type_ids = (
+                paddle.to_tensor(batch["token_type_ids"], dtype="int64") if "token_type_ids" in batch else None
+            )
+            position_ids = paddle.to_tensor(batch["position_ids"], dtype="int64") if "position_ids" in batch else None
+            attention_mask = paddle.to_tensor(batch["attention_mask"], dtype="float32")
+            ids, scores = self._model.generate(
+                input_ids=input_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                max_length=self._max_length,
+                min_length=self._min_length,
+                decode_strategy=self._decode_strategy,
+                temperature=self._temperature,
+                top_k=self._top_k,
+                top_p=self._top_p,
+                num_beams=self._num_beams,
+                length_penalty=self._length_penalty,
+                num_return_sequences=self._num_return_sequences,
+                repetition_penalty=self._repetition_penalty,
+                bos_token_id=None if self._model_type != "unimo-text" else self._tokenizer.cls_token_id,
+                eos_token_id=None if self._model_type != "unimo-text" else self._tokenizer.mask_token_id,
+                use_fast=self._use_faster,
+                use_fp16_decoding=self._use_fp16_decoding,
+            )
+            all_ids.extend(ids)
+            all_scores.extend(scores)
+        inputs["ids"] = all_ids
+        inputs["scores"] = all_scores
+        return inputs
+
+    def _postprocess(self, inputs):
+        """
+        The model output is tag ids, this function will convert the model output to raw text.
+        """
+        ids_list = inputs["ids"]
+        scores_list = inputs["scores"]
+        if self._model_type != "unimo-text":
+            output_tokens = self._tokenizer.batch_decode(
+                ids_list, skip_special_tokens=True, clean_up_tokenization_spaces=False
+            )
+            output_scores = [i.numpy() for i in scores_list]
+        else:
+            results = self._select_from_num_return_sequences(
+                ids_list, scores_list, self._max_length, self._num_return_sequences
+            )
+            output_tokens = [result[0] for result in results]
+            output_scores = [result[1] for result in results]
+
+        if self._output_scores:
+            return output_tokens, output_scores
+        return output_tokens
+
+    def _select_from_num_return_sequences(self, ids, scores, max_dec_len=None, num_return_sequences=1):
+        """
+        Select generated sequence form several return sequences.
+        """
+        results = []
+        group = []
+        tmp = []
+        if scores is not None:
+            ids = [i.numpy() for i in ids]
+            scores = [i.numpy() for i in scores]
+
+            if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0:
+                raise ValueError(
+                    "the length of `ids` is {}, but the `num_return_sequences` is {}".format(
+                        len(ids), num_return_sequences
+                    )
+                )
+
+            for pred, score in zip(ids, scores):
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                target = "".join(pred_tokens)
+                # not ending
+                if max_dec_len is not None and num_token >= max_dec_len:
+                    score -= 1e3
+                tmp.append([target, score])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+            for preds in group:
+                preds = sorted(preds, key=lambda x: -x[1])
+                results.append(preds[0])
+        else:
+            ids = ids.numpy()
+            for pred in ids:
+                pred_token_ids, pred_tokens = self._post_process_decoded_sequence(pred)
+                num_token = len(pred_token_ids)
+                response = "".join(pred_tokens)
+                # TODO: Support return scores in FT.
+                tmp.append([response])
+                if len(tmp) == num_return_sequences:
+                    group.append(tmp)
+                    tmp = []
+
+            for preds in group:
+                results.append(preds[0])
+        return results
+
+    def _post_process_decoded_sequence(self, token_ids):
+        """Post-process the decoded sequence. Truncate from the first <eos>."""
+        eos_pos = len(token_ids)
+        for i, tok_id in enumerate(token_ids):
+            if tok_id == self._tokenizer.mask_token_id:
+                eos_pos = i
+                break
+        token_ids = token_ids[:eos_pos]
+        tokens = self._tokenizer.convert_ids_to_tokens(token_ids)
+        tokens = self._tokenizer.merge_subword(tokens)
+        special_tokens = ["[UNK]"]
+        tokens = [token for token in tokens if token not in special_tokens]
+        return token_ids, tokens
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+        ]
diff --git a/paddlenlp/taskflow/utils.py b/paddlenlp/taskflow/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5dbb0ed57b702ad64a56138007553fde3961d26
--- /dev/null
+++ b/paddlenlp/taskflow/utils.py
@@ -0,0 +1,2548 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import copy
+import csv
+import json
+import math
+import os
+import pickle
+import re
+import traceback
+import warnings
+from collections import OrderedDict, namedtuple
+from dataclasses import dataclass
+from datetime import datetime
+from functools import cmp_to_key
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+import six
+from paddle.dataset.common import md5file
+from PIL import Image
+
+from ..transformers.tokenizer_utils_base import PaddingStrategy, PretrainedTokenizerBase
+from ..utils.downloader import DownloaderCheck, get_path_from_url
+from ..utils.image_utils import (
+    Bbox,
+    DecodeImage,
+    NormalizeImage,
+    PadBatch,
+    Permute,
+    ResizeImage,
+    check,
+    img2base64,
+    two_dimension_sort_layout,
+)
+from ..utils.log import logger
+
+DOC_FORMAT = r"""
+    Examples:
+        .. code-block:: python
+              """
+DOWNLOAD_CHECK = False
+
+
+def download_file(save_dir, filename, url, md5=None):
+    """
+    Download the file from the url to specified directory.
+    Check md5 value when the file is exists, if the md5 value is the same as the existed file, just use
+    the older file, if not, will download the file from the url.
+
+    Args:
+        save_dir(string): The specified directory saving the file.
+        filename(string): The specified filename saving the file.
+        url(string): The url downling the file.
+        md5(string, optional): The md5 value that checking the version downloaded.
+    """
+    fullname = os.path.join(save_dir, filename)
+    if os.path.exists(fullname):
+        if md5 and (not md5file(fullname) == md5):
+            logger.info("Updating {} from {}".format(filename, url))
+            logger.disable()
+            get_path_from_url(url, save_dir, md5)
+    else:
+        logger.info("Downloading {} from {}".format(filename, url))
+        logger.disable()
+        get_path_from_url(url, save_dir, md5)
+    logger.enable()
+    return fullname
+
+
+def download_check(task):
+    """
+    Check the resource status in the specified task.
+
+    Args:
+        task(string): The name of specified task.
+    """
+    logger.disable()
+    global DOWNLOAD_CHECK
+    if not DOWNLOAD_CHECK:
+        DOWNLOAD_CHECK = True
+        checker = DownloaderCheck(task)
+        checker.start()
+        checker.join()
+    logger.enable()
+
+
+def add_docstrings(*docstr):
+    """
+    The function that add the doc string to doc of class.
+    """
+
+    def docstring_decorator(fn):
+        fn.__doc__ = fn.__doc__ + "".join(DOC_FORMAT) + "".join(docstr)
+        return fn
+
+    return docstring_decorator
+
+
+@contextlib.contextmanager
+def static_mode_guard():
+    paddle.enable_static()
+    yield
+    paddle.disable_static()
+
+
+@contextlib.contextmanager
+def dygraph_mode_guard():
+    paddle.disable_static()
+    yield
+
+
+def cut_chinese_sent(para):
+    """
+    Cut the Chinese sentences more precisely, reference to "https://blog.csdn.net/blmoistawinde/article/details/82379256".
+    """
+    para = re.sub(r"([。！？\?])([^”’])", r"\1\n\2", para)
+    para = re.sub(r"(\.{6})([^”’])", r"\1\n\2", para)
+    para = re.sub(r"(\…{2})([^”’])", r"\1\n\2", para)
+    para = re.sub(r"([。！？\?][”’])([^，。！？\?])", r"\1\n\2", para)
+    para = para.rstrip()
+    return para.split("\n")
+
+
+class TermTreeNode(object):
+    """Defination of term node. All members are protected, to keep rigorism of data struct.
+
+    Args:
+        sid (str): term id of node.
+        term (str): term, common name of this term.
+        base (str): `cb` indicates concept base, `eb` indicates entity base.
+        term_type (Optional[str], optional): type of this term, constructs hirechical of `term` node. Defaults to None.
+        hyper (Optional[str], optional): parent type of a `type` node. Defaults to None.
+        node_type (str, optional): type statement of node, `type` or `term`. Defaults to "term".
+        alias (Optional[List[str]], optional): alias of this term. Defaults to None.
+        alias_ext (Optional[List[str]], optional): extended alias of this term, CANNOT be used in matching.
+            Defaults to None.
+        sub_type (Optional[List[str]], optional): grouped by some term. Defaults to None.
+        sub_term (Optional[List[str]], optional): some lower term. Defaults to None.
+        data (Optional[Dict[str, Any]], optional): to sore full imformation of a term. Defaults to None.
+
+    """
+
+    def __init__(
+        self,
+        sid: str,
+        term: str,
+        base: str,
+        node_type: str = "term",
+        term_type: Optional[str] = None,
+        hyper: Optional[str] = None,
+        level: Optional[int] = None,
+        alias: Optional[List[str]] = None,
+        alias_ext: Optional[List[str]] = None,
+        sub_type: Optional[List[str]] = None,
+        sub_term: Optional[List[str]] = None,
+        data: Optional[Dict[str, Any]] = None,
+    ):
+        self._sid = sid
+        self._term = term
+        self._base = base
+        self._term_type = term_type
+        self._hyper = hyper
+        self._sub_term = sub_term if sub_term is not None else []
+        self._sub_type = sub_type if sub_type is not None else []
+        self._alias = alias if alias is not None else []
+        self._alias_ext = alias_ext if alias_ext is not None else []
+        self._data = data
+        self._level = level
+        self._node_type = node_type
+        self._sons = set()
+
+    def __str__(self):
+        if self._data is not None:
+            return json.dumps(self._data, ensure_ascii=False)
+        else:
+            res = {
+                "termid": self._sid,
+                "term": self._term,
+                "src": self._base,
+                "alias": self._alias,
+                "alias_ext": self._alias_ext,
+                "termtype": self._term_type,
+                "subterms": self._sub_term,
+                "subtype": self._sub_type,
+                "links": [],
+            }
+            return json.dumps(res, ensure_ascii=False)
+
+    @property
+    def sid(self):
+        return self._sid
+
+    @property
+    def term(self):
+        return self._term
+
+    @property
+    def base(self):
+        return self._base
+
+    @property
+    def alias(self):
+        return self._alias
+
+    @property
+    def alias_ext(self):
+        return self._alias_ext
+
+    @property
+    def termtype(self):
+        return self._term_type
+
+    @property
+    def subtype(self):
+        return self._sub_type
+
+    @property
+    def subterm(self):
+        return self._sub_term
+
+    @property
+    def hyper(self):
+        return self._hyper
+
+    @property
+    def level(self):
+        return self._level
+
+    @property
+    def sons(self):
+        return self._sons
+
+    @property
+    def node_type(self):
+        return self._node_type
+
+    def add_son(self, son_name):
+        self._sons.add(son_name)
+
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]):
+        """Build a node from dictionary data.
+
+        Args:
+            data (Dict[str, Any]): Dictionary data contain all k-v data.
+
+        Returns:
+            [type]: TermTree node object.
+        """
+        return cls(
+            sid=data["termid"],
+            term=data["term"],
+            base=data["src"],
+            term_type=data["termtype"],
+            sub_type=data["subtype"],
+            sub_term=data["subterms"],
+            alias=data["alias"],
+            alias_ext=data["alias_ext"],
+            data=data,
+        )
+
+    @classmethod
+    def from_json(cls, json_str: str):
+        """Build a node from JSON string.
+
+        Args:
+            json_str (str): JSON string formatted by TermTree data.
+
+        Returns:
+            [type]: TermTree node object.
+        """
+        dict_data = json.loads(json_str)
+        return cls.from_dict(dict_data)
+
+
+class TermTree(object):
+    """TermTree class."""
+
+    def __init__(self):
+        self._nodes: Dict[str, TermTreeNode] = {}
+        self._root = TermTreeNode(sid="root", term="root", base="cb", node_type="root", level=0)
+        self._nodes["root"] = self.root
+        self._index = {}
+
+    def __build_sons(self):
+        for node in self._nodes:
+            self.__build_son(self._nodes[node])
+
+    def __getitem__(self, item):
+        return self._nodes[item]
+
+    def __contains__(self, item):
+        return item in self._nodes
+
+    def __iter__(self):
+        return self._nodes.__iter__()
+
+    @property
+    def root(self):
+        return self._root
+
+    def __load_type(self, file_path: str):
+        with open(file_path, "rt", newline="", encoding="utf8") as csvfile:
+            file_handler = csv.DictReader(csvfile, delimiter="\t")
+            for row in file_handler:
+                if row["type-1"] not in self:
+                    self.add_type(type_name=row["type-1"], hyper_type="root")
+                if row["type-2"] != "" and row["type-2"] not in self:
+                    self.add_type(type_name=row["type-2"], hyper_type=row["type-1"])
+                if row["type-3"] != "" and row["type-3"] not in self:
+                    self.add_type(type_name=row["type-3"], hyper_type=row["type-2"])
+
+    def __judge_term_node(self, node: TermTreeNode) -> bool:
+        if node.termtype not in self:
+            raise ValueError(f"Term type of new node {node.termtype} does not exists.")
+        if node.sid in self:
+            warnings.warn(f"{node.sid} exists, will be replaced by new node.")
+
+    def add_term(
+        self,
+        term: Optional[str] = None,
+        base: Optional[str] = None,
+        term_type: Optional[str] = None,
+        sub_type: Optional[List[str]] = None,
+        sub_term: Optional[List[str]] = None,
+        alias: Optional[List[str]] = None,
+        alias_ext: Optional[List[str]] = None,
+        data: Optional[Dict[str, Any]] = None,
+    ):
+        """Add a term into TermTree.
+
+        Args:
+            term (str): common name of name.
+            base (str): term is concept or entity.
+            term_type (str): term type of this term
+            sub_type (Optional[List[str]], optional): sub type of this term, must exists in TermTree. Defaults to None.
+            sub_terms (Optional[List[str]], optional): sub terms of this term. Defaults to None.
+            alias (Optional[List[str]], optional): alias of this term. Defaults to None.
+            alias_ext (Optional[List[str]], optional): . Defaults to None.
+            data (Optional[Dict[str, Any]], optional): [description]. Defaults to None.
+        """
+        if data is not None:
+            new_node = TermTreeNode.from_dict(data)
+        else:
+            new_node = TermTreeNode(
+                sid=f"{term_type}_{base}_{term}",
+                term=term,
+                base=base,
+                term_type=term_type,
+                sub_term=sub_term,
+                sub_type=sub_type,
+                alias=alias,
+                alias_ext=alias_ext,
+                node_type="term",
+            )
+        self.__judge_term_node(new_node)
+        self._nodes[new_node.sid] = new_node
+        self.__build_index(new_node)
+
+    def add_type(self, type_name, hyper_type):
+        if type_name in self._nodes:
+            raise ValueError(f"Term Type {type_name} exists.")
+        if hyper_type not in self._nodes:
+            raise ValueError(f"Hyper type {hyper_type} does not exist, please add it first.")
+        if self._nodes[hyper_type].level == 3:
+            raise ValueError(
+                "Term type schema must be 3-LEVEL, 3rd level type node should not be a parent of type node."
+            )
+        self._nodes[type_name] = TermTreeNode(
+            sid=type_name,
+            term=type_name,
+            base=None,
+            hyper=hyper_type,
+            node_type="type",
+            level=self._nodes[hyper_type].level + 1,
+        )
+        self.__build_index(self._nodes[type_name])
+
+    def __load_file(self, file_path: str):
+        with open(file_path, encoding="utf-8") as fp:
+            for line in fp:
+                data = json.loads(line)
+                self.add_term(data=data)
+
+    def __build_son(self, node: TermTreeNode):
+        """Build sons of a node
+
+        Args:
+            node (TermTreeNode): son node.
+        """
+        type_node = None
+        if node.termtype is not None:
+            type_node = self._nodes[node.termtype]
+        elif node.hyper is not None:
+            type_node = self._nodes[node.hyper]
+        if type_node is not None:
+            type_node.add_son(node.sid)
+        for sub_type in node.subtype:
+            sub_type_node = self._nodes[sub_type]
+            sub_type_node.add_son(node.sid)
+
+    def build_son(self, node: str):
+        self.__build_son(self[node])
+
+    def __build_index(self, node: TermTreeNode):
+        if node.term not in self._index:
+            self._index[node.term] = []
+        self._index[node.term].append(node.sid)
+        for alia in node.alias:
+            if alia not in self._index:
+                self._index[alia] = []
+            self._index[alia].append(node.sid)
+
+    def __judge_hyper(self, source_id, target_id) -> bool:
+        queue = [source_id]
+        visited_node = {source_id}
+        while len(queue) > 0:
+            cur_id = queue.pop(0)
+            if cur_id == target_id:
+                return True
+            cur_node = self._nodes[cur_id]
+            edge = []
+            if cur_node.hyper is not None:
+                edge.append(cur_node.hyper)
+            if cur_node.termtype is not None:
+                edge.append(cur_node.termtype)
+            edge.extend(cur_node.subtype)
+            for next_id in edge:
+                if next_id not in visited_node:
+                    queue.append(next_id)
+                    visited_node.add(next_id)
+        return False
+
+    def find_term(self, term: str, term_type: Optional[str] = None) -> Tuple[bool, Union[List[str], None]]:
+        """Find a term in Term Tree. If term not exists, return None.
+        If `term_type` is not None, will find term with this type.
+
+        Args:
+            term (str): term to look up.
+            term_type (Optional[str], optional): find term in this term_type. Defaults to None.
+
+        Returns:
+            Union[None, List[str]]: [description]
+        """
+        if term not in self._index:
+            return False, None
+        else:
+            if term_type is None:
+                return True, self._index[term]
+            else:
+                out = []
+                for term_id in self._index[term]:
+                    if self.__judge_hyper(term_id, term_type) is True:
+                        out.append(term_id)
+                if len(out) > 0:
+                    return True, out
+                else:
+                    return False, None
+
+    def build_from_dir(self, term_schema_path, term_data_path, linking=True):
+        """Build TermTree from a directory which should contain type schema and term data.
+
+        Args:
+            dir ([type]): [description]
+        """
+        self.__load_type(term_schema_path)
+        if linking:
+            self.__load_file(term_data_path)
+            self.__build_sons()
+
+    @classmethod
+    def from_dir(cls, term_schema_path, term_data_path, linking) -> "TermTree":
+        """Build TermTree from a directory which should contain type schema and term data.
+
+        Args:
+            source_dir ([type]): [description]
+
+        Returns:
+            TermTree: [description]
+        """
+        term_tree = cls()
+        term_tree.build_from_dir(term_schema_path, term_data_path, linking)
+        return term_tree
+
+    def __dfs(self, cur_id: str, depth: int, path: Dict[str, str], writer: csv.DictWriter):
+        cur_node = self._nodes[cur_id]
+        if cur_node.node_type == "term":
+            return
+        if depth > 0:
+            path[f"type-{depth}"] = cur_id
+        if path["type-1"] != "":
+            writer.writerow(path)
+        for son in cur_node.sons:
+            self.__dfs(son, depth + 1, path, writer)
+        if depth > 0:
+            path[f"type-{depth}"] = ""
+
+    def save(self, save_dir):
+        """Save term tree to directory `save_dir`
+
+        Args:
+            save_dir ([type]): Directory.
+        """
+        if os.path.exists(save_dir) is False:
+            os.makedirs(save_dir, exist_ok=True)
+        out_path = {}
+        for i in range(1, 3):
+            out_path[f"type-{i}"] = ""
+        with open(f"{save_dir}/termtree_type.csv", "wt", encoding="utf-8", newline="") as fp:
+            fieldnames = ["type-1", "type-2", "type-3"]
+            csv_writer = csv.DictWriter(fp, delimiter="\t", fieldnames=fieldnames)
+            csv_writer.writeheader()
+            self.__dfs("root", 0, out_path, csv_writer)
+        with open(f"{save_dir}/termtree_data", "w", encoding="utf-8", newline="") as fp:
+            for nid in self:
+                node = self[nid]
+                if node.node_type == "term":
+                    print(node, file=fp)
+
+
+def levenstein_distance(s1: str, s2: str) -> int:
+    """Calculate minimal Levenstein distance between s1 and s2.
+
+    Args:
+        s1 (str): string
+        s2 (str): string
+
+    Returns:
+        int: the minimal distance.
+    """
+    m, n = len(s1) + 1, len(s2) + 1
+
+    # Initialize
+    dp = [[0] * n for i in range(m)]
+    dp[0][0] = 0
+    for i in range(1, m):
+        dp[i][0] = dp[i - 1][0] + 1
+    for j in range(1, n):
+        dp[0][j] = dp[0][j - 1] + 1
+
+    for i in range(1, m):
+        for j in range(1, n):
+            if s1[i - 1] != s2[j - 1]:
+                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
+            else:
+                dp[i][j] = dp[i - 1][j - 1]
+    return dp[m - 1][n - 1]
+
+
+class BurkhardKellerNode(object):
+    """Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance.
+
+    Args:
+        word (str): word of current node.
+    """
+
+    def __init__(self, word: str):
+        self.word = word
+        self.next = {}
+
+
+class BurkhardKellerTree(object):
+    """Implementataion of BK-Tree"""
+
+    def __init__(self):
+        self.root = None
+        self.nodes = {}
+
+    def __add(self, cur_node: BurkhardKellerNode, word: str):
+        """Insert a word into current tree. If tree is empty, set this word to root.
+
+        Args:
+            word (str): word to be inserted.
+        """
+        if self.root is None:
+            self.root = BurkhardKellerNode(word)
+            return
+        if word in self.nodes:
+            return
+        dist = levenstein_distance(word, cur_node.word)
+        if dist not in cur_node.next:
+            self.nodes[word] = cur_node.next[dist] = BurkhardKellerNode(word)
+        else:
+            self.__add(cur_node.next[dist], word)
+
+    def add(self, word: str):
+        """Insert a word into current tree. If tree is empty, set this word to root.
+
+        Args:
+            word (str): word to be inserted.
+        """
+        return self.__add(self.root, word)
+
+    def __search_similar_word(self, cur_node: BurkhardKellerNode, s: str, threshold: int = 2) -> List[str]:
+        res = []
+        if cur_node is None:
+            return res
+        dist = levenstein_distance(cur_node.word, s)
+        if dist <= threshold:
+            res.append((cur_node.word, dist))
+        start = max(dist - threshold, 1)
+        while start < dist + threshold:
+            tmp_res = self.__search_similar_word(cur_node.next.get(start, None), s)[:]
+            res.extend(tmp_res)
+            start += 1
+        return res
+
+    def search_similar_word(self, word: str) -> List[str]:
+        """Search the most similar (minimal levenstain distance) word between `s`.
+
+        Args:
+            s (str): target word
+
+        Returns:
+            List[str]: similar words.
+        """
+        res = self.__search_similar_word(self.root, word)
+
+        def max_prefix(s1: str, s2: str) -> int:
+            res = 0
+            length = min(len(s1), len(s2))
+            for i in range(length):
+                if s1[i] == s2[i]:
+                    res += 1
+                else:
+                    break
+            return res
+
+        res.sort(key=lambda d: (d[1], -max_prefix(d[0], word)))
+        return res
+
+
+class TriedTree(object):
+    """Implementataion of TriedTree"""
+
+    def __init__(self):
+        self.tree = {}
+
+    def add_word(self, word):
+        """add single word into TriedTree"""
+        self.tree[word] = len(word)
+        for i in range(1, len(word)):
+            wfrag = word[:i]
+            self.tree[wfrag] = self.tree.get(wfrag, None)
+
+    def search(self, content):
+        """Backward maximum matching
+
+        Args:
+            content (str): string to be searched
+        Returns:
+            List[Tuple]: list of maximum matching words, each element represents
+                the starting and ending position of the matching string.
+        """
+        result = []
+        length = len(content)
+        for start in range(length):
+            for end in range(start + 1, length + 1):
+                pos = self.tree.get(content[start:end], -1)
+                if pos == -1:
+                    break
+                if pos and (len(result) == 0 or end > result[-1][1]):
+                    result.append((start, end))
+        return result
+
+
+class Customization(object):
+    """
+    User intervention based on Aho-Corasick automaton
+    """
+
+    def __init__(self):
+        self.dictitem = {}
+        self.ac = None
+
+    def load_customization(self, filename, sep=None):
+        """Load the custom vocab"""
+        self.ac = TriedTree()
+        with open(filename, "r", encoding="utf8") as f:
+            for line in f:
+                if sep is None:
+                    words = line.strip().split()
+
+                if len(words) == 0:
+                    continue
+
+                phrase = ""
+                tags = []
+                offset = []
+                for word in words:
+                    if word.rfind("/") < 1:
+                        phrase += word
+                        tags.append("")
+                    else:
+                        phrase += word[: word.rfind("/")]
+                        tags.append(word[word.rfind("/") + 1 :])
+                    offset.append(len(phrase))
+
+                if len(phrase) < 2 and tags[0] == "":
+                    continue
+
+                self.dictitem[phrase] = (tags, offset)
+                self.ac.add_word(phrase)
+
+    def parse_customization(self, query, lac_tags, prefix=False):
+        """Use custom vocab to modify the lac results"""
+        if not self.ac:
+            logger.warning("customization dict is not load")
+            return
+        ac_res = self.ac.search(query)
+
+        for begin, end in ac_res:
+            phrase = query[begin:end]
+            index = begin
+
+            tags, offsets = self.dictitem[phrase]
+
+            if prefix:
+                for tag, offset in zip(tags, offsets):
+                    while index < begin + offset:
+                        if len(tag) == 0:
+                            lac_tags[index] = "I" + lac_tags[index][1:]
+                        else:
+                            lac_tags[index] = "I-" + tag
+                        index += 1
+                lac_tags[begin] = "B" + lac_tags[begin][1:]
+                for offset in offsets:
+                    index = begin + offset
+                    if index < len(lac_tags):
+                        lac_tags[index] = "B" + lac_tags[index][1:]
+            else:
+                for tag, offset in zip(tags, offsets):
+                    while index < begin + offset:
+                        if len(tag) == 0:
+                            lac_tags[index] = lac_tags[index][:-1] + "I"
+                        else:
+                            lac_tags[index] = tag + "-I"
+                        index += 1
+                lac_tags[begin] = lac_tags[begin][:-1] + "B"
+                for offset in offsets:
+                    index = begin + offset
+                    if index < len(lac_tags):
+                        lac_tags[index] = lac_tags[index][:-1] + "B"
+
+
+class SchemaTree(object):
+    """
+    Implementataion of SchemaTree
+    """
+
+    def __init__(self, name="root", children=None):
+        self.name = name
+        self.children = []
+        self.prefix = None
+        self.parent_relations = None
+        self.parent = None
+        if children is not None:
+            for child in children:
+                self.add_child(child)
+
+    def __repr__(self):
+        return self.name
+
+    def add_child(self, node):
+        assert isinstance(node, SchemaTree), "The children of a node should be an instacne of SchemaTree."
+        self.children.append(node)
+
+
+def get_id_and_prob(span_set, offset_mapping):
+    """
+    Return text id and probability of predicted spans
+
+    Args:
+        span_set (set): set of predicted spans.
+        offset_mapping (list[int]): list of pair preserving the
+                index of start and end char in original text pair (prompt + text) for each token.
+    Returns:
+        sentence_id (list[tuple]): index of start and end char in original text.
+        prob (list[float]): probabilities of predicted spans.
+    """
+    prompt_end_token_id = offset_mapping[1:].index([0, 0])
+    bias = offset_mapping[prompt_end_token_id][1] + 1
+    for idx in range(1, prompt_end_token_id + 1):
+        offset_mapping[idx][0] -= bias
+        offset_mapping[idx][1] -= bias
+
+    sentence_id = []
+    prob = []
+    for start, end in span_set:
+        prob.append(start[1] * end[1])
+        start_id = offset_mapping[start[0]][0]
+        end_id = offset_mapping[end[0]][1]
+        sentence_id.append((start_id, end_id))
+    return sentence_id, prob
+
+
+def dbc2sbc(s):
+    rs = ""
+    for char in s:
+        code = ord(char)
+        if code == 0x3000:
+            code = 0x0020
+        else:
+            code -= 0xFEE0
+        if not (0x0021 <= code and code <= 0x7E):
+            rs += char
+            continue
+        rs += chr(code)
+    return rs
+
+
+class WordTagRelationExtractor(object):
+    """Implement of information extractor."""
+
+    _chain_items = {"和", "与", "兼", "及", "以及", "还有", "并"}
+    _all_items = None
+    _jux_buf = []
+
+    def __init__(self, schema):
+        self._schema = schema
+
+    @property
+    def schema(self):
+        return self._schema
+
+    @classmethod
+    def from_dict(cls, config_dict):
+        """Make an instance from a configuration dictionary.
+
+        Args:
+            config_dict (Dict[str, Any]): configuration dict.
+        """
+        res = {}
+
+        for i, trip_config in enumerate(config_dict):
+            head_role_type = trip_config["head_role"]
+            if head_role_type not in res:
+                res[head_role_type] = {"trigger": {}, "g_t_map": {}, "rel_group": {}, "trig_word": {}}
+            group_name = trip_config["group"]
+            if "rel_group" in trip_config:
+                res[head_role_type]["rel_group"][group_name] = trip_config["rel_group"]
+            if group_name not in res[head_role_type]["trig_word"]:
+                res[head_role_type]["trig_word"][group_name] = set()
+            for trig_word in trip_config["trig_word"]:
+                res[head_role_type]["trigger"][trig_word] = {
+                    "trigger_type": trip_config["trig_type"],
+                    "group_name": group_name,
+                    "rev_flag": trip_config["reverse"],
+                }
+                res[head_role_type]["trig_word"][group_name].add(trig_word)
+            res[head_role_type]["g_t_map"][group_name] = trip_config["tail_role"]
+
+        return cls(res)
+
+    @classmethod
+    def from_json(cls, json_str):
+        """Implement an instance from JSON str."""
+        config_dict = json.loads(json_str)
+        return cls.from_dict(config_dict)
+
+    @classmethod
+    def from_pkl(cls, pkl_path):
+        """Implement an instance from a serialized pickle package."""
+        with open(pkl_path, "rb") as fp:
+            schema = pickle.load(fp)
+        return cls(schema)
+
+    @classmethod
+    def from_config(cls, config_path):
+        """Implement an instance from a configuration file."""
+        with open(config_path, encoding="utf-8") as fp:
+            config_json = json.load(fp)
+        return cls.from_dict(config_json)
+
+    def add_schema_from_dict(self, config_dict):
+        """Add the schema from the dict."""
+        for i, trip_config in enumerate(config_dict):
+            head_role_type = trip_config["head_role"]
+            if head_role_type not in self._schema:
+                self._schema[head_role_type] = {"trigger": {}, "g_t_map": {}, "rel_group": {}, "trig_word": {}}
+            group_name = trip_config["group"]
+            if "rel_group" in self._schema:
+                self._schema[head_role_type]["rel_group"][group_name] = trip_config["rel_group"]
+            if group_name not in self._schema[head_role_type]["trig_word"]:
+                self._schema[head_role_type]["trig_word"][group_name] = set()
+            for trig_word in trip_config["trig_word"]:
+                self._schema[head_role_type]["trigger"][trig_word] = {
+                    "trigger_type": trip_config["trig_type"],
+                    "group_name": group_name,
+                    "rev_flag": trip_config["reverse"],
+                }
+                self._schema[head_role_type]["trig_word"][group_name].add(trig_word)
+            self._schema[head_role_type]["g_t_map"][group_name] = trip_config["tail_role"]
+
+    def _judge_jux(self, wordtag_item):
+        """Judge whether `wordtag_item` is a relevance componet between two juxtaposed items.
+
+        Args:
+            wordtag_item (dict): input item.
+
+        Returns:
+            bool: [description]
+        """
+        if wordtag_item["item"] in {"、", " ", "《", "》", "/"}:
+            return True
+        if wordtag_item["item"] in self._chain_items and wordtag_item["wordtag_label"] == "连词":
+            return True
+        return False
+
+    def _search_jux(self, cur_item, cur_pos=0, jux_type=None, jux_word=None, status_flag=None, search_list=None):
+        """Find juxtaposed items with `cur_item` at `cur_pos` in `self._all_items`.
+
+        Args:
+            cur_item (Dict[str, Any]): the item current viewing.
+            cur_pos (int, optional): current position of viewing item. Defaults to 0.
+            jux_type (Set[str], optional):  wordtag labels that can be considered as juxtaposed item. Defaults to None.
+            jux_word (Set[str], optional):  words that can be considered as juxtaposed item. Defaults to None.
+            status_flag (bool, optional): if True, on the juxtaposed item, or on chain item. Defaults to None.
+
+        Returns:
+            int: end postion of juxtable items.
+        """
+        if search_list is None:
+            search_list = self._all_items
+
+        if jux_type is None and jux_word is None:
+            raise ValueError("`jux_type` and `jux_word` are both None.")
+
+        if status_flag is True:
+            self._jux_buf.append(cur_item)
+
+        if cur_pos >= len(search_list) - 1:
+            return cur_pos
+
+        next_item = search_list[cur_pos + 1]
+
+        if self._judge_jux(next_item) is True:
+            return self._search_jux(
+                cur_item=next_item,
+                cur_pos=cur_pos + 1,
+                jux_type=jux_type,
+                jux_word=jux_word,
+                status_flag=False,
+                search_list=search_list,
+            )
+
+        next_flag = True
+        if jux_type is not None:
+            next_flag = next_flag and self._match_item(next_item, jux_type)
+        if jux_word is not None:
+            next_flag = next_flag and (next_item["item"] in jux_word)
+        if next_flag is True:
+            return self._search_jux(
+                cur_item=next_item, cur_pos=cur_pos + 1, jux_type=jux_type, jux_word=jux_word, status_flag=True
+            )
+        if next_flag is not True:
+            while self._judge_jux(search_list[cur_pos]) is True:
+                cur_pos -= 1
+        return cur_pos
+
+    @staticmethod
+    def _match_item(item, type_can):
+        match_key = item["wordtag_label"].split("_")[0]
+        return match_key in type_can or item["wordtag_label"] in type_can
+
+    def _trig_handler(self, cur_item, head_conf):
+        """Whether current item is a trigger, if True, return corresponding flag and configuration.
+
+        Args:
+            cur_item (Dict[str, Any]): current viewing ite,
+            st_conf (Dict[str, Any]): config
+
+        Returns:
+            Tuple[str, Union[None, dict]]: [description]
+        """
+        trigger_conf = head_conf["trigger"]
+        if cur_item["item"] in trigger_conf:
+            # find a trigger, then judge whether it is a tail-trigger or a rel trigger.
+            if trigger_conf[cur_item["item"]]["trigger_type"] == "role":
+                # find a tail-trigger, then judge wordtag label.
+                group_name = trigger_conf[cur_item["item"]]["group_name"]
+                for tail_conf in head_conf["g_t_map"][group_name]:
+                    if self._match_item(cur_item, tail_conf["main"]) is True:
+                        return "trig_t", tail_conf
+                else:
+                    return "un_trig", None
+            else:
+                return "trig_g", None
+        else:
+            return "un_trig", None
+
+    def _find_tail(self, search_range, sg_conf, head_hype):
+        """Find tail role in `search_range`
+
+        Args:
+            search_range (List[int]): index range of `self._all_items`, items to be checked.
+            sg_conf (Dict[str, Any]): configuration of group.
+            head_type (str): wordtag label of head role item.
+        """
+        for i in search_range:
+            item = self._all_items[i]
+            if item["item"] in {"，", "？", "、", "。", "；"}:
+                return -2, None
+            for j, tail_conf in enumerate(sg_conf):
+                flag = self._match_item(item, tail_conf["main"])
+                if flag is True:
+                    return i, tail_conf
+                if item["wordtag_label"].startswith(head_hype):
+                    return -1, None
+
+        return -2, None
+
+    def _find_supp(self, search_range, search_type):
+        res = []
+        for i in search_range:
+            item = self._all_items[i]
+            if item["item"] == "，":
+                break
+            if any(item["wordtag_label"].startswith(sup_t) for sup_t in search_type):
+                res.append(item)
+        return res if len(res) > 0 else None
+
+    def _make_output(self, head_item, tail_item, group, source, support=None, trig_word=None, **kwargs):
+        """Make formatted outputs of mined results.
+
+        Args:
+            head_item (Dict[str, Any]): [description]
+            head_index (int): [description]
+            tail_item (List[Dict[str, Any]]): [description]
+            tail_indices (List[int]): [description]
+            group (str): [description]
+            source (str): [description]
+            support (List[Dict[str, Any]], optional): [description]. Defaults to None.
+            support_indices (List[int], optional): [description]. Defaults to None.
+            trig_word (List[str], optional): [description]. Defaults to None.
+            trig_indices (List[int], optional): [description]. Defaults to None.
+        """
+        res = {
+            "HEAD_ROLE": {
+                "item": head_item["item"],
+                "type": head_item["wordtag_label"],
+                "offset": head_item["offset"],
+            },
+            "TAIL_ROLE": [
+                {"item": ti["item"], "offset": ti["offset"], "type": ti["wordtag_label"]} for ti in tail_item
+            ],
+            "GROUP": group,
+            "SRC": source,
+        }
+        if support is not None:
+            res["SUPPORT"] = [
+                {
+                    "item": si["item"],
+                    "offset": si["offset"],
+                    "type": si["wordtag_label"],
+                }
+                for si in support
+            ]
+        if trig_word is not None:
+            res["TRIG"] = [
+                {
+                    "item": ti["item"],
+                    "offset": ti["offset"],
+                }
+                for ti in trig_word
+            ]
+        return res
+
+    def _reverse(self, res, group_name=None):
+        ret = []
+        for rev_head in res["TAIL_ROLE"]:
+            rev_tmp = {
+                "HEAD_ROLE": rev_head,
+                "TAIL_ROLE": [res["HEAD_ROLE"]],
+                "GROUP": group_name if group_name is not None else res["GROUP"],
+            }
+            if "SUPPORT" in res:
+                rev_tmp["SUPPORT"] = res["SUPPORT"]
+            if "TRIG" in res:
+                rev_tmp["TRIG"] = res["TRIG"]
+            rev_tmp["SRC"] = "REVERSE" if group_name is not None else res["SRC"]
+            ret.append(rev_tmp)
+        return ret
+
+    def extract_spo(self, all_items):
+        """Pipeline of mining procedure.
+
+        Args:
+            all_items ([type]): [description]
+        """
+        self._all_items = all_items
+
+        res_cand = []
+
+        # Match head role, and consider it as central, search others.
+        for i, head_cand in enumerate(self._all_items):
+            last_end = i
+            try:
+                datetime.strptime(head_cand["item"], "%Y年%m月%d日")
+                head_cand["wordtag_label"] = "时间类_具体时间"
+            except ValueError:
+                pass
+
+            if head_cand["wordtag_label"] in self._schema:
+                head_conf = self._schema[head_cand["wordtag_label"]]
+                head_type = head_cand["wordtag_label"]
+            else:
+                match_key = head_cand["wordtag_label"].split("_")[0]
+                if match_key in self._schema:
+                    head_conf = self._schema[match_key]
+                    head_type = match_key
+                else:
+                    continue
+
+            trig_status = "un_trig"
+
+            # Consider `head_cand` as a start item, find trigger words behind.
+            # We suppose that minning strategy is directed, so only search items behinds head.
+            # If need, we can reverse constructed triples.
+            j = i + 1
+            while j < len(self._all_items):
+                cur_item = all_items[j]
+                cur_pos = j
+                j += 1
+
+                trig_status, trig_conf = self._trig_handler(cur_item, self._schema[head_type])
+
+                # Find a tail role, generate corresponding triple.
+                if trig_status == "trig_t":
+                    trig_status = "un_trig"
+                    tail_flag = True
+                    for k in range(i + 1, j):
+                        if self._all_items[k]["wordtag_label"] == head_cand["wordtag_label"]:
+                            tail_flag = False
+                            break
+                    if tail_flag is False:
+                        continue
+
+                    group_name = head_conf["trigger"][cur_item["item"]]["group_name"]
+                    del self._jux_buf[:]
+                    idx = self._search_jux(
+                        cur_item=cur_item, cur_pos=cur_pos, jux_type=trig_conf["main"], status_flag=True
+                    )
+                    supports = self._find_supp(search_range=range(j - 1, i, -1), search_type=trig_conf["support"])
+
+                    tmp = self._make_output(
+                        head_item=head_cand,
+                        tail_item=self._jux_buf[:],
+                        group=group_name,
+                        support=supports,
+                        source="TAIL",
+                    )
+
+                    # Reverse triple if group has relative.
+                    if (
+                        group_name in head_conf.get("rel_group", {})
+                        or head_conf["trigger"][cur_item["item"]]["rev_flag"] is True
+                    ):
+                        rev_tmp = self._reverse(tmp, head_conf.get("rel_group", {}).get(group_name, None))
+                        res_cand.extend(rev_tmp[:])
+                    if head_conf["trigger"][cur_item["item"]]["rev_flag"] is False:
+                        res_cand.append(tmp.copy())
+
+                    j = idx + 1
+                    last_end = idx
+                    continue
+
+                # Find a group trigger word, look for tail role items of current head role and group argument.
+                # Searching range is items behind group trigger and items between head rold and group trigger word.
+                if trig_status == "trig_g":
+                    trig_status = "un_trig"
+                    group_name = head_conf["trigger"][cur_item["item"]]["group_name"]
+
+                    del self._jux_buf[:]
+                    g_start_idx = j - 1
+                    g_idx = self._search_jux(
+                        cur_item=cur_item,
+                        cur_pos=cur_pos,
+                        jux_word=head_conf["trig_word"][group_name],
+                        status_flag=True,
+                    )
+
+                    g_trig_words = self._jux_buf[:]
+                    j = g_idx + 1
+
+                    # Search right.
+                    if j < len(self._all_items) - 1:
+                        tail_idx, tail_conf = self._find_tail(
+                            range(g_idx + 1, len(self._all_items)), head_conf["g_t_map"][group_name], head_type
+                        )
+
+                        if tail_idx > 0:
+                            # Find a tail.
+                            tail_item = self._all_items[tail_idx]
+                            del self._jux_buf[:]
+                            idx = self._search_jux(
+                                cur_item=tail_item, cur_pos=tail_idx, status_flag=True, jux_type=tail_conf["main"]
+                            )
+                            tail_cand = self._jux_buf[:]
+                            supports = self._find_supp(range(tail_idx - 1, i, -1), tail_conf["support"])
+
+                            tmp = self._make_output(
+                                head_item=head_cand,
+                                tail_item=tail_cand,
+                                group=group_name,
+                                source="HGT",
+                                support=supports,
+                                trig_word=g_trig_words,
+                            )
+
+                            if (
+                                group_name in head_conf.get("rel_group", {})
+                                or head_conf["trigger"][cur_item["item"]]["rev_flag"] is True
+                            ):
+                                rev_tmp = self._reverse(tmp, head_conf.get("rel_group", {}).get(group_name, None))
+                                res_cand.extend(rev_tmp[:])
+                            if head_conf["trigger"][cur_item["item"]]["rev_flag"] is False:
+                                res_cand.append(tmp.copy())
+
+                            j = idx + 1
+                            last_end = idx
+                            continue
+
+                    # Search left
+                    if g_idx - i > len(g_trig_words):
+                        tail_idx, tail_conf = self._find_tail(
+                            range(g_start_idx, last_end, -1), head_conf["g_t_map"][group_name], head_type
+                        )
+                        tail_item = self._all_items[tail_idx]
+                        if tail_idx > 0:
+                            del self._jux_buf[:]
+                            _ = self._search_jux(
+                                cur_item=tail_item,
+                                cur_pos=0,
+                                jux_type=tail_conf["main"],
+                                status_flag=True,
+                                search_list=self._all_items[i + 1 : tail_idx][::-1],
+                            )
+                            tail_cand = self._jux_buf[:]
+                            supports = self._find_supp(range(g_idx - 1, last_end, -1), tail_conf["support"])
+                            last_end = g_idx
+
+                            tmp = self._make_output(
+                                head_item=head_cand,
+                                tail_item=tail_cand,
+                                group=group_name,
+                                trig_word=g_trig_words,
+                                source="HTG",
+                                support=supports,
+                            )
+
+                            if (
+                                group_name in head_conf.get("rel_group", {})
+                                or head_conf["trigger"][cur_item["item"]]["rev_flag"] is True
+                            ):
+                                rev_tmp = self._reverse(tmp, head_conf.get("rel_group", {}).get(group_name, None))
+                                res_cand.extend(rev_tmp[:])
+                            if head_conf["trigger"][cur_item["item"]]["rev_flag"] is False:
+                                res_cand.append(tmp.copy())
+                            continue
+        return res_cand
+
+
+@dataclass
+class DataCollatorGP:
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    label_maps: Optional[dict] = None
+    task_type: Optional[str] = None
+
+    def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]:
+        new_features = [{k: v for k, v in f.items() if k not in ["offset_mapping", "text"]} for f in features]
+
+        batch = self.tokenizer.pad(
+            new_features,
+            padding=self.padding,
+        )
+
+        batch = [paddle.to_tensor(batch[k]) for k in batch.keys()]
+        batch.append([feature["offset_mapping"] for feature in features])
+        batch.append([feature["text"] for feature in features])
+        return batch
+
+
+@dataclass
+class DataCollatorForErnieCtm:
+    tokenizer: PretrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    model: Optional[str] = "wordtag"
+
+    def __call__(self, features: List[Dict[str, Union[List[int], paddle.Tensor]]]) -> Dict[str, paddle.Tensor]:
+        no_pad = "seq_len" if self.model == "wordtag" else "label_indices"
+        new_features = [{k: v for k, v in f.items() if k != no_pad} for f in features]
+        batch = self.tokenizer.pad(
+            new_features,
+            padding=self.padding,
+        )
+
+        batch = [paddle.to_tensor(batch[k]) for k in batch.keys()]
+        batch.append(paddle.to_tensor([f[no_pad] for f in features]))
+        return batch
+
+
+def gp_decode(batch_outputs, offset_mappings, texts, label_maps, task_type="relation_extraction"):
+    if task_type == "entity_extraction":
+        batch_ent_results = []
+        for entity_output, offset_mapping, text in zip(batch_outputs[0].numpy(), offset_mappings, texts):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            entity_probs = F.softmax(paddle.to_tensor(entity_output), axis=1).numpy()
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                ent_prob = entity_probs[l, start, end]
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {
+                    "text": text[start:end],
+                    "type": label_maps["id2entity"][str(l)],
+                    "start_index": start,
+                    "probability": ent_prob,
+                }
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+        return batch_ent_results
+    else:
+        batch_ent_results = []
+        batch_rel_results = []
+        for entity_output, head_output, tail_output, offset_mapping, text in zip(
+            batch_outputs[0].numpy(),
+            batch_outputs[1].numpy(),
+            batch_outputs[2].numpy(),
+            offset_mappings,
+            texts,
+        ):
+            entity_output[:, [0, -1]] -= np.inf
+            entity_output[:, :, [0, -1]] -= np.inf
+            entity_probs = F.softmax(paddle.to_tensor(entity_output), axis=1).numpy()
+            head_probs = F.softmax(paddle.to_tensor(head_output), axis=1).numpy()
+            tail_probs = F.softmax(paddle.to_tensor(tail_output), axis=1).numpy()
+
+            ents = set()
+            ent_list = []
+            for l, start, end in zip(*np.where(entity_output > 0.0)):
+                ent_prob = entity_probs[l, start, end]
+                ents.add((start, end))
+                start, end = (offset_mapping[start][0], offset_mapping[end][-1])
+                ent = {
+                    "text": text[start:end],
+                    "type": label_maps["id2entity"][str(l)],
+                    "start_index": start,
+                    "probability": ent_prob,
+                }
+                ent_list.append(ent)
+            batch_ent_results.append(ent_list)
+
+            rel_list = []
+            for sh, st in ents:
+                for oh, ot in ents:
+                    p1s = np.where(head_output[:, sh, oh] > 0.0)[0]
+                    p2s = np.where(tail_output[:, st, ot] > 0.0)[0]
+                    ps = set(p1s) & set(p2s)
+                    for p in ps:
+                        rel_prob = head_probs[p, sh, oh] * tail_probs[p, st, ot]
+                        if task_type == "relation_extraction":
+                            rel = {
+                                "subject": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "predicate": label_maps["id2relation"][str(p)],
+                                "object": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "subject_start_index": offset_mapping[sh][0],
+                                "object_start_index": offset_mapping[oh][0],
+                                "probability": rel_prob,
+                            }
+                        else:
+                            rel = {
+                                "aspect": text[offset_mapping[sh][0] : offset_mapping[st][1]],
+                                "sentiment": label_maps["id2relation"][str(p)],
+                                "opinion": text[offset_mapping[oh][0] : offset_mapping[ot][1]],
+                                "aspect_start_index": offset_mapping[sh][0],
+                                "opinion_start_index": offset_mapping[oh][0],
+                                "probability": rel_prob,
+                            }
+                        rel_list.append(rel)
+            batch_rel_results.append(rel_list)
+        return (batch_ent_results, batch_rel_results)
+
+
+DocSpan = namedtuple("DocSpan", ["start", "length"])
+
+Example = namedtuple(
+    "Example",
+    [
+        "keys",
+        "key_labels",
+        "doc_tokens",
+        "text",
+        "qas_id",
+        "model_type",
+        "seq_labels",
+        "ori_boxes",
+        "boxes",
+        "segment_ids",
+        "symbol_ids",
+        "im_base64",
+        "image_rois",
+    ],
+)
+
+Feature = namedtuple(
+    "Feature",
+    [
+        "unique_id",
+        "example_index",
+        "qas_id",
+        "doc_span_index",
+        "tokens",
+        "token_to_orig_map",
+        "token_is_max_context",
+        "token_ids",
+        "position_ids",
+        "text_type_ids",
+        "text_symbol_ids",
+        "overlaps",
+        "key_labels",
+        "seq_labels",
+        "se_seq_labels",
+        "bio_seq_labels",
+        "bioes_seq_labels",
+        "keys",
+        "model_type",
+        "doc_tokens",
+        "doc_labels",
+        "text",
+        "boxes",
+        "segment_ids",
+        "im_base64",
+        "image_rois",
+    ],
+)
+
+
+class Compose(object):
+    """compose"""
+
+    def __init__(self, transforms, ctx=None):
+        """init"""
+        self.transforms = transforms
+        self.ctx = ctx
+
+    def __call__(self, data):
+        """call"""
+        ctx = self.ctx if self.ctx else {}
+        for f in self.transforms:
+            try:
+                data = f(data, ctx)
+            except Exception as e:
+                stack_info = traceback.format_exc()
+                logger.warning("fail to map op [{}] with error: {} and stack:\n{}".format(f, e, str(stack_info)))
+                raise e
+        return data
+
+
+def batch_arrange(batch_samples, fields):
+    def _segm(samples):
+        """"""
+        assert "gt_poly" in samples
+        segms = samples["gt_poly"]
+        if "is_crowd" in samples:
+            is_crowd = samples["is_crowd"]
+            if len(segms) != 0:
+                assert len(segms) == is_crowd.shape[0]
+
+        gt_masks = []
+        valid = True
+        for i in range(len(segms)):
+            segm = segms[i]
+            gt_segm = []
+            if "is_crowd" in samples and is_crowd[i]:
+                gt_segm.append([[0, 0]])
+            else:
+                for poly in segm:
+                    if len(poly) == 0:
+                        valid = False
+                        break
+                    gt_segm.append(np.array(poly).reshape(-1, 2))
+            if (not valid) or len(gt_segm) == 0:
+                break
+            gt_masks.append(gt_segm)
+        return gt_masks
+
+    def im_shape(samples, dim=3):
+        # hard code
+        assert "h" in samples
+        assert "w" in samples
+        if dim == 3:  # RCNN, ..
+            return np.array((samples["h"], samples["w"], 1), dtype=np.float32)
+        else:  # YOLOv3, ..
+            return np.array((samples["h"], samples["w"]), dtype=np.int32)
+
+    arrange_batch = []
+    for samples in batch_samples:
+        one_ins = ()
+        for i, field in enumerate(fields):
+            if field == "gt_mask":
+                one_ins += (_segm(samples),)
+            elif field == "im_shape":
+                one_ins += (im_shape(samples),)
+            elif field == "im_size":
+                one_ins += (im_shape(samples, 2),)
+            else:
+                if field == "is_difficult":
+                    field = "difficult"
+                assert field in samples, "{} not in samples".format(field)
+                one_ins += (samples[field],)
+        arrange_batch.append(one_ins)
+    return arrange_batch
+
+
+class ProcessReader(object):
+    """
+    Args:
+        dataset (DataSet): DataSet object
+        sample_transforms (list of BaseOperator): a list of sample transforms
+            operators.
+        batch_transforms (list of BaseOperator): a list of batch transforms
+            operators.
+        batch_size (int): batch size.
+        shuffle (bool): whether shuffle dataset or not. Default False.
+        drop_last (bool): whether drop last batch or not. Default False.
+        drop_empty (bool): whether drop sample when it's gt is empty or not.
+            Default True.
+        mixup_epoch (int): mixup epoc number. Default is -1, meaning
+            not use mixup.
+        cutmix_epoch (int): cutmix epoc number. Default is -1, meaning
+            not use cutmix.
+        class_aware_sampling (bool): whether use class-aware sampling or not.
+            Default False.
+        worker_num (int): number of working threads/processes.
+            Default -1, meaning not use multi-threads/multi-processes.
+        use_process (bool): whether use multi-processes or not.
+            It only works when worker_num > 1. Default False.
+        bufsize (int): buffer size for multi-threads/multi-processes,
+            please note, one instance in buffer is one batch data.
+        memsize (str): size of shared memory used in result queue when
+            use_process is true. Default 3G.
+        inputs_def (dict): network input definition use to get input fields,
+            which is used to determine the order of returned data.
+        devices_num (int): number of devices.
+        num_trainers (int): number of trainers. Default 1.
+    """
+
+    def __init__(
+        self,
+        dataset=None,
+        sample_transforms=None,
+        batch_transforms=None,
+        batch_size=None,
+        shuffle=False,
+        drop_last=False,
+        drop_empty=True,
+        mixup_epoch=-1,
+        cutmix_epoch=-1,
+        class_aware_sampling=False,
+        use_process=False,
+        use_fine_grained_loss=False,
+        num_classes=80,
+        bufsize=-1,
+        memsize="3G",
+        inputs_def=None,
+        devices_num=1,
+        num_trainers=1,
+    ):
+        """"""
+        self._fields = copy.deepcopy(inputs_def["fields"]) if inputs_def else None
+
+        # transform
+        self._sample_transforms = Compose(sample_transforms, {"fields": self._fields})
+        self._batch_transforms = None
+
+        if batch_transforms:
+            batch_transforms = [bt for bt in batch_transforms]
+            self._batch_transforms = Compose(batch_transforms, {"fields": self._fields})
+
+        self._batch_size = batch_size
+        self._shuffle = shuffle
+        self._drop_last = drop_last
+        self._drop_empty = drop_empty
+
+        # sampling
+        self._mixup_epoch = mixup_epoch // num_trainers
+        self._cutmix_epoch = cutmix_epoch // num_trainers
+        self._class_aware_sampling = class_aware_sampling
+
+        self._indexes = None
+        self._pos = -1
+        self._epoch = -1
+        self._curr_iter = 0
+
+    def process(self, dataset):
+        """process"""
+        batch = self._load_batch(dataset)
+        res = self.worker(self._drop_empty, batch)
+        return res
+
+    def _load_batch(self, dataset):
+        batch = []
+        for data in dataset:
+            sample = copy.deepcopy(data)
+            batch.append(sample)
+        return batch
+
+    def worker(self, drop_empty=True, batch_samples=None):
+        """
+        sample transform and batch transform.
+        """
+        batch = []
+        for sample in batch_samples:
+            sample = self._sample_transforms(sample)
+            batch.append(sample)
+        if len(batch) > 0 and self._batch_transforms:
+            batch = self._batch_transforms(batch)
+        if len(batch) > 0 and self._fields:
+            batch = batch_arrange(batch, self._fields)
+        return batch
+
+
+def pad_batch_data(
+    insts,
+    pad_idx=0,
+    max_seq_len=None,
+    return_pos=False,
+    return_input_mask=False,
+    return_max_len=False,
+    return_num_token=False,
+    return_seq_lens=False,
+    pad_2d_pos_ids=False,
+    pad_segment_id=False,
+    select=False,
+    extract=False,
+):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts) if max_seq_len is None else max_seq_len
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    if pad_2d_pos_ids:
+        boxes = [x + [[0, 0, 0, 0]] * (max_len - len(x)) for x in insts]
+        boxes = np.array(boxes, dtype="int64")
+        return boxes
+
+    inst_data = np.array([inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+
+    # position data
+    if return_pos:
+        inst_pos = np.array([list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst)) for inst in insts])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] * (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape([-1, 1])]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+class ImageReader(object):
+    def __init__(
+        self,
+        super_rel_pos,
+        tokenizer,
+        max_key_len=16,
+        max_seq_len=512,
+        image_size=1024,
+        block_w=7,
+        block_h=7,
+        im_npos=224,
+    ):
+        self.tokenizer = tokenizer
+        self.vocab = self.tokenizer.get_vocab()
+
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.pad = "[PAD]"
+        self.cls = "[CLS]"
+        self.sep = "[SEP]"
+        self.mask = "[MASK]"
+
+        self.super_rel_pos = super_rel_pos
+        self.max_key_len = max_key_len
+        self.max_seq_len = max_seq_len
+        self.doc_stride = 128
+        self.unique_id = 10000000
+
+        self.examples = {}
+        self.features = {}
+
+        self.image_size = image_size
+        self.block_w = block_w
+        self.block_h = block_h
+        self.im_npos = im_npos
+        self.image_rois = []
+        cut_width, cut_height = int(self.image_size / self.block_w), int(self.image_size / self.block_h)
+        for idh in range(self.block_h):
+            for idw in range(self.block_w):
+                self.image_rois.append([idw * cut_width, idh * cut_height, cut_width, cut_height])
+
+        sample_trans = [
+            DecodeImage(),
+            ResizeImage(target_size=self.im_npos, interp=1),
+            NormalizeImage(
+                is_channel_first=False,
+                mean=[103.530, 116.280, 123.675],
+                std=[57.375, 57.120, 58.395],
+            ),
+            Permute(to_bgr=False),
+        ]
+
+        batch_trans = [PadBatch(pad_to_stride=32, use_padded_im_info=True)]
+
+        inputs_def = {
+            "fields": ["image", "im_info", "im_id", "gt_bbox"],
+        }
+        self.data_loader = ProcessReader(
+            sample_transforms=sample_trans,
+            batch_transforms=batch_trans,
+            shuffle=False,
+            drop_empty=True,
+            inputs_def=inputs_def,
+        )
+
+    def ppocr2example(self, ocr_res, img_path, querys):
+        examples = []
+        segments = []
+        for rst in ocr_res:
+            left = min(rst[0][0][0], rst[0][3][0])
+            top = min(rst[0][0][-1], rst[0][1][-1])
+            width = max(rst[0][1][0], rst[0][2][0]) - min(rst[0][0][0], rst[0][3][0])
+            height = max(rst[0][2][-1], rst[0][3][-1]) - min(rst[0][0][-1], rst[0][1][-1])
+            segments.append({"bbox": Bbox(*[left, top, width, height]), "text": rst[-1][0]})
+        segments.sort(key=cmp_to_key(two_dimension_sort_layout))
+        # 2. im_base64
+        img_base64 = img2base64(img_path)
+        # 3. doc_tokens, doc_boxes, segment_ids
+        doc_tokens = []
+        doc_boxes = []
+        ori_boxes = []
+        doc_segment_ids = []
+
+        im_w_box = max([seg["bbox"].left + seg["bbox"].width for seg in segments]) + 20
+        im_h_box = max([seg["bbox"].top + seg["bbox"].height for seg in segments]) + 20
+        img = Image.open(img_path)
+        im_w, im_h = img.size
+        im_w, im_h = max(im_w, im_w_box), max(im_h, im_h_box)
+
+        scale_x = self.image_size / im_w
+        scale_y = self.image_size / im_h
+        for segment_id, segment in enumerate(segments):
+            bbox = segment["bbox"]  # x, y, w, h
+            x1, y1, w, h = bbox.left, bbox.top, bbox.width, bbox.height
+            sc_w = int(min(w * scale_x, self.image_size - 1))
+            sc_h = int(min(h * scale_y, self.image_size - 1))
+            sc_y1 = int(max(0, min(y1 * scale_y, self.image_size - h - 1)))
+            sc_x1 = int(max(0, min(x1 * scale_x, self.image_size - w - 1)))
+            if w < 0:
+                raise ValueError("Incorrect bbox, please check the input word boxes.")
+            ori_bbox = Bbox(*[x1, y1, w, h])
+            sc_bbox = Bbox(*[sc_x1, sc_y1, sc_w, sc_h])
+            text = segment["text"]
+            char_num = []
+            eng_word = ""
+            for char in text:
+                if not check(char) and not eng_word:
+                    doc_tokens.append([char])
+                    doc_segment_ids.append([segment_id])
+                    char_num.append(2)
+                elif not check(char) and eng_word:
+                    doc_tokens.append([eng_word])
+                    doc_segment_ids.append([segment_id])
+                    char_num.append(len(eng_word))
+                    eng_word = ""
+                    doc_tokens.append([char])
+                    doc_segment_ids.append([segment_id])
+                    char_num.append(2)
+                else:
+                    eng_word += char
+            if eng_word:
+                doc_tokens.append([eng_word])
+                doc_segment_ids.append([segment_id])
+                char_num.append(len(eng_word))
+            ori_char_width = round(ori_bbox.width / sum(char_num), 1)
+            sc_char_width = round(sc_bbox.width / sum(char_num), 1)
+            for chr_idx in range(len(char_num)):
+                if chr_idx == 0:
+                    doc_boxes.append(
+                        [Bbox(*[sc_bbox.left, sc_bbox.top, (sc_char_width * char_num[chr_idx]), sc_bbox.height])]
+                    )
+                    ori_boxes.append(
+                        [Bbox(*[ori_bbox.left, ori_bbox.top, (ori_char_width * char_num[chr_idx]), ori_bbox.height])]
+                    )
+                else:
+                    doc_boxes.append(
+                        [
+                            Bbox(
+                                *[
+                                    sc_bbox.left + (sc_char_width * sum(char_num[:chr_idx])),
+                                    sc_bbox.top,
+                                    (sc_char_width * char_num[chr_idx]),
+                                    sc_bbox.height,
+                                ]
+                            )
+                        ]
+                    )
+                    ori_boxes.append(
+                        [
+                            Bbox(
+                                *[
+                                    ori_bbox.left + (ori_char_width * sum(char_num[:chr_idx])),
+                                    ori_bbox.top,
+                                    (ori_char_width * char_num[chr_idx]),
+                                    ori_bbox.height,
+                                ]
+                            )
+                        ]
+                    )
+
+        qas_id = 0
+        for query in querys:
+            example = Example(
+                keys=[query],
+                key_labels=[0],
+                doc_tokens=doc_tokens,
+                seq_labels=[0 for one in doc_tokens],
+                text="",
+                qas_id="0_" + str(qas_id),
+                model_type=None,
+                ori_boxes=ori_boxes,
+                boxes=doc_boxes,
+                segment_ids=doc_segment_ids,
+                symbol_ids=None,
+                image_rois=self.image_rois,
+                im_base64=img_base64,
+            )
+
+            examples.append(example)
+            qas_id += 1
+        return examples
+
+    def box2example(self, ocr_res, img_path, querys):
+        """
+        ocr_res = [[word_str, [x1, y1, x2, y2]], [word_str, [x1, y1, x2, y2]], ...]
+        """
+        examples = []
+        doc_boxes = []
+        ori_boxes = []
+        boxes = [x[1] for x in ocr_res]
+        im_w_box = max([b[2] for b in boxes]) + 20
+        im_h_box = max([b[3] for b in boxes]) + 20
+        img = Image.open(img_path)
+        im_w, im_h = img.size
+        im_w, im_h = max(im_w, im_w_box), max(im_h, im_h_box)
+
+        scale_x = self.image_size / im_w
+        scale_y = self.image_size / im_h
+        for box in boxes:
+            x1, y1, x2, y2 = box
+            if x2 <= x1 or y2 <= y1:
+                raise ValueError("Invalid bbox format")
+            w = max(x1, x2) - min(x1, x2)
+            h = max(y1, y2) - min(y1, y2)
+            ori_boxes.append([Bbox(*[x1, y1, w, h])])
+            w = int(min(w * scale_x, self.image_size - 1))
+            h = int(min(h * scale_y, self.image_size - 1))
+            x1 = int(max(0, min(x1 * scale_x, self.image_size - w - 1)))
+            y1 = int(max(0, min(y1 * scale_y, self.image_size - h - 1)))
+            if w < 0:
+                raise ValueError("Invalid bbox format")
+            doc_boxes.append([Bbox(*[x1, y1, w, h])])
+
+        img_base64 = img2base64(img_path)
+
+        doc_tokens = [[x[0]] for x in ocr_res]
+        doc_segment_ids = [[0]] * len(doc_tokens)
+
+        qas_id = 0
+        for query in querys:
+            example = Example(
+                keys=[query],
+                key_labels=[0],
+                doc_tokens=doc_tokens,
+                seq_labels=[0 for one in doc_tokens],
+                text="",
+                qas_id=str(qas_id),
+                model_type=None,
+                ori_boxes=ori_boxes,
+                boxes=doc_boxes,
+                segment_ids=doc_segment_ids,
+                symbol_ids=None,
+                image_rois=self.image_rois,
+                im_base64=img_base64,
+            )
+
+            if not (len(example.doc_tokens) == len(example.boxes) == len(example.segment_ids)):
+                raise ValueError(
+                    "Incorrect word_boxes, the format should be `List[str, Tuple[float, float, float, float]]`"
+                )
+
+            examples.append(example)
+            qas_id += 1
+
+        return examples
+
+    def example2feature(self, example, tokenizer, max_line_id=128):
+        features = []
+        all_doc_tokens = []
+        tok_to_orig_index = []
+        boxes = []
+        segment_ids = []
+        all_doc_labels = []
+
+        query_tokens = tokenizer.tokenize("&" + str(example.keys[0]))[1:][: self.max_key_len]
+
+        for i, (token_list, box_list, seg_list, l) in enumerate(
+            zip(example.doc_tokens, example.boxes, example.segment_ids, example.seq_labels)
+        ):
+            assert len(token_list) == len(box_list) == len(seg_list)
+            for idt, (token, box, seg) in enumerate(zip(token_list, box_list, seg_list)):
+                sub_tokens = tokenizer.tokenize("&" + token)[1:]
+                for ii, sub_token in enumerate(sub_tokens):
+                    width_split = box.width / len(sub_tokens)
+                    boxes.append([box.left + ii * width_split, box.top, width_split, box.height])
+                    segment_ids.append(seg)
+                    tok_to_orig_index.append(i)
+                    all_doc_tokens.append(sub_token)
+                    all_doc_labels.extend([0])
+
+        max_tokens_for_doc = self.max_seq_len - len(query_tokens) - 4
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, self.doc_stride)
+
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            labels = []
+            feature_segment_ids = []
+            feature_boxes = []
+            token_to_orig_map = {}
+            token_is_max_context = {}
+            text_type_ids = []
+            tokens.append(self.cls)
+            feature_boxes.append([0, 0, 0, 0])
+            labels.append(0)
+            text_type_ids.append(0)
+            feature_segment_ids.append(max_line_id - 1)
+
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]
+                is_max_context = self._check_is_max_context(doc_spans, doc_span_index, split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+
+                feature_boxes.append(boxes[split_token_index])
+                feature_segment_ids.append(segment_ids[split_token_index])
+                text_type_ids.append(0)
+                labels.append(all_doc_labels[split_token_index])
+
+            tokens.append(self.sep)
+            feature_boxes.append([0, 0, 0, 0])
+            text_type_ids.append(0)
+            feature_segment_ids.append(max_line_id - 1)
+            labels.append(0)
+            for token in query_tokens:
+                tokens.append(token)
+                feature_boxes.append([0, 0, 0, 0])
+                feature_segment_ids.append(max_line_id - 1)
+                text_type_ids.append(1)
+                labels.append(0)
+
+            tokens = tokens + [self.sep]
+            feature_boxes.extend([[0, 0, 0, 0]])
+            feature_segment_ids = feature_segment_ids + [max_line_id - 1]
+            text_type_ids = text_type_ids + [1]
+            labels.append(0)
+
+            position_ids = list(range(len(tokens)))
+            token_ids = tokenizer.convert_tokens_to_ids(tokens)
+            feature_segment_ids = [x % max_line_id for x in feature_segment_ids]
+
+            feature = Feature(
+                unique_id=self.unique_id,
+                example_index=0,
+                qas_id=example.qas_id,
+                doc_span_index=doc_span_index,
+                tokens=tokens,
+                token_to_orig_map=token_to_orig_map,
+                token_is_max_context=token_is_max_context,
+                token_ids=token_ids,
+                position_ids=position_ids,
+                text_type_ids=text_type_ids,
+                text_symbol_ids=None,
+                overlaps=None,
+                keys=example.keys,
+                seq_labels=labels,
+                se_seq_labels=None,
+                bio_seq_labels=None,
+                bioes_seq_labels=None,
+                key_labels=example.key_labels,
+                model_type=example.model_type,
+                doc_tokens=example.doc_tokens,
+                doc_labels=example.seq_labels,
+                text=example.text,
+                boxes=feature_boxes,
+                segment_ids=feature_segment_ids,
+                im_base64=example.im_base64,
+                image_rois=example.image_rois,
+            )
+            features.append(feature)
+            self.unique_id += 1
+        return features
+
+    def _pad_batch_records(self, batch_records, max_line_id=128, phase="infer"):
+        """pad batch records"""
+        return_list = []
+        batch_token_ids = []
+        batch_sent_ids = []
+        batch_pos_ids = []
+        batch_2d_pos_ids = []
+        batch_segment_ids = []
+        batch_labels = []
+        batch_unique_id = []
+        batch_image_base64 = []
+        batch_image_rois = []
+
+        for i in range(len(batch_records)):
+            batch_token_ids.append(batch_records[i].token_ids)
+            batch_sent_ids.append(batch_records[i].text_type_ids)
+            batch_segment_ids.append(batch_records[i].segment_ids)
+            batch_labels.append(batch_records[i].seq_labels)
+            batch_unique_id.append(batch_records[i].unique_id)
+            batch_pos_ids.append(batch_records[i].position_ids)
+            batch_2d_pos_ids.append(batch_records[i].boxes)
+            batch_image_base64.append(batch_records[i].im_base64)
+            batch_image_rois.append(batch_records[i].image_rois)
+
+        padded_token_ids, _ = pad_batch_data(batch_token_ids, pad_idx=self.pad_id, return_input_mask=True)
+        padded_sent_ids = pad_batch_data(batch_sent_ids, pad_idx=self.pad_id)
+        padded_pos_ids = pad_batch_data(batch_pos_ids, pad_idx=self.pad_id)
+        new_padded_pos_ids = []
+        for idp, pos_ids in enumerate(padded_pos_ids):
+            new_padded_pos_ids.append(
+                np.concatenate((pos_ids, np.array([[x] for x in range(self.block_w * self.block_h)])), axis=0)
+            )
+        padded_pos_ids = np.array(new_padded_pos_ids)
+        padded_2d_pos_ids = pad_batch_data(batch_2d_pos_ids, pad_2d_pos_ids=True, select=False, extract=True)
+        new_padded_2d_pos_ids = []
+        for pos_ids_2d, batch_record in zip(padded_2d_pos_ids, batch_records):
+            new_padded_2d_pos_ids.append(np.concatenate((pos_ids_2d, np.array(batch_record.image_rois)), axis=0))
+        padded_2d_pos_ids = np.array(new_padded_2d_pos_ids)
+        padded_segment_ids = pad_batch_data(batch_segment_ids, pad_idx=max_line_id - 1)
+
+        input_mask_mat = self._build_input_mask(
+            np.array([list(x) + [[-1] for _ in range(self.block_w * self.block_h)] for x in padded_token_ids])
+        )
+        super_rel_pos = self._build_rel_pos(
+            np.array([list(x) + [[-1] for _ in range(self.block_w * self.block_h)] for x in padded_token_ids])
+        )
+
+        unique_id = np.array(batch_unique_id).astype("float32").reshape([-1, 1])
+
+        bsz, seq_len, _ = padded_token_ids.shape
+        task_ids = np.ones((bsz, seq_len, 1)).astype("int64")
+        for b in range(bsz):
+            if np.sum(padded_2d_pos_ids[b]) > 0:
+                task_ids[b, :, :] = 0
+            else:
+                task_ids[b, :, :] = 1
+
+        coco_data = self.generate_coco_data(
+            [""] * len(batch_image_base64),
+            batch_image_base64,
+            [self.image_size] * len(batch_image_base64),
+            [self.image_size] * len(batch_image_base64),
+            batch_image_rois,
+        )
+
+        image_data = self.im_make_batch(
+            self.data_loader.process(coco_data),
+            self.block_w * self.block_h,
+            len(batch_image_base64),
+        )
+
+        return_list = [
+            padded_token_ids,
+            padded_sent_ids,
+            padded_pos_ids,
+            padded_2d_pos_ids,
+            padded_segment_ids,
+            task_ids,
+            input_mask_mat,
+            super_rel_pos,
+            unique_id,
+            image_data,
+        ]
+        return return_list
+
+    def data_generator(self, ocr_res, img_path, querys, batch_size, ocr_type="ppocr", phase="infer"):
+        if ocr_type == "ppocr":
+            self.examples[phase] = self.ppocr2example(ocr_res, img_path, querys)
+        elif ocr_type == "word_boxes":
+            self.examples[phase] = self.box2example(ocr_res, img_path, querys)
+        self.features[phase] = sum([self.example2feature(e, self.tokenizer) for e in self.examples[phase]], [])
+        for batch_data in self._prepare_batch_data(self.features[phase], batch_size, phase=phase):
+            yield self._pad_batch_records(batch_data)
+
+    def _prepare_batch_data(self, features, batch_size, phase=None):
+        """generate batch records"""
+        batch_records = []
+        for feature in features:
+            to_append = len(batch_records) < batch_size
+            if to_append:
+                batch_records.append(feature)
+            else:
+                yield batch_records
+                batch_records = [feature]
+
+        if phase == "infer" and batch_records:
+            yield batch_records
+
+    def _build_input_mask(self, padded_token_ids):
+        """build_input_mask"""
+        bsz, seq_len, _ = padded_token_ids.shape
+        return np.ones((bsz, seq_len, seq_len)).astype("float32")
+
+    def _build_rel_pos(self, padded_token_ids):
+        """build relative position"""
+        bsz, seq_len, _ = padded_token_ids.shape
+        rel_pos = np.zeros((bsz, seq_len, seq_len)).astype("int64")
+        return rel_pos
+
+    def generate_coco_data(
+        self,
+        batch_image_path,
+        batch_image_base64,
+        batch_scaled_width,
+        batch_scaled_height,
+        batch_rois,
+    ):
+        """generator coco data"""
+
+        def transform(dataset):
+            roidbs = []
+            for i in dataset:
+                rvl_rec = {
+                    "im_file": i[0],
+                    "im_id": np.array([i[1]]),
+                    "h": i[2],
+                    "w": i[3],
+                    "gt_bbox": i[4],
+                    "cover_box": i[5],
+                    "im_base64": i[6],
+                }
+
+                roidbs.append(rvl_rec)
+            return roidbs
+
+        result = []
+        for image_path, im_base64, width, height, roi in zip(
+            batch_image_path,
+            batch_image_base64,
+            batch_scaled_width,
+            batch_scaled_height,
+            batch_rois,
+        ):
+            result.append((image_path, 0, height, width, roi, None, im_base64))
+        return transform(result)
+
+    def im_make_batch(self, dataset, image_boxes_nums, bsize):
+        """make image batch"""
+        img_batch = np.array([i[0] for i in dataset], "float32")
+        return img_batch
+
+    def BIO2SPAN(self, BIO):
+        start_label, end_label = [], []
+        for seq in BIO:
+            first_one = True
+            start_pos = [1 if x == 2 else 0 for x in seq]
+            if sum(start_pos) == 0 and sum(seq) != 0:
+                start_pos = []
+                for idp, p in enumerate(seq):
+                    if p == 1 and first_one:
+                        start_pos.append(1)
+                        first_one = False
+                    else:
+                        start_pos.append(0)
+
+            start_label.append(start_pos)
+
+            end_tmp = []
+            for index, s in enumerate(seq):
+                if s == -100 or s == 0:
+                    end_tmp.append(s)
+                elif s == 2 and index + 1 < len(seq) and (seq[index + 1] == 0 or seq[index + 1] == 2):
+                    end_tmp.append(1)
+                elif s == 2 and index + 1 < len(seq) and seq[index + 1] != 0:
+                    end_tmp.append(0)
+                elif s == 2 and index + 1 == len(seq):
+                    end_tmp.append(1)
+                elif s == 1 and (index + 1 == len(seq) or seq[index + 1] != 1):
+                    end_tmp.append(1)
+                else:
+                    end_tmp.append(0)
+            end_label.append(end_tmp)
+
+        return start_label, end_label
+
+    def _check_is_max_context(self, doc_spans, cur_span_index, position):
+        best_score = None
+        best_span_index = None
+        for (span_index, doc_span) in enumerate(doc_spans):
+            end = doc_span.start + doc_span.length - 1
+            if position < doc_span.start:
+                continue
+            if position > end:
+                continue
+            num_left_context = position - doc_span.start
+            num_right_context = end - position
+            score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
+            if best_score is None or score > best_score:
+                best_score = score
+                best_span_index = span_index
+        return cur_span_index == best_span_index
+
+
+def get_doc_pred(result, ans_pos, example, tokenizer, feature, do_lower_case, all_key_probs, example_index):
+    def _compute_softmax(scores):
+        """Compute softmax probability over raw logits."""
+        if len(scores) == 0:
+            return []
+
+        max_score = None
+        for score in scores:
+            if max_score is None or score > max_score:
+                max_score = score
+
+        exp_scores = []
+        total_sum = 0.0
+        for score in scores:
+            x = math.exp(score - max_score)
+            exp_scores.append(x)
+            total_sum += x
+
+        probs = []
+        for score in exp_scores:
+            probs.append(score / total_sum)
+        return probs
+
+    preds = []
+    for start_index, end_index in ans_pos:
+        # process data
+        tok_tokens = feature.tokens[start_index : end_index + 1]
+        tok_text = " ".join(tok_tokens)
+        # De-tokenize WordPieces that have been split off.
+        tok_text = tok_text.replace(" ##", "")
+        tok_text = tok_text.replace("##", "")
+        tok_text = tok_text.strip()
+        tok_text = "".join(tok_text.split())
+
+        orig_doc_start = feature.token_to_orig_map[start_index]
+        orig_doc_end = feature.token_to_orig_map[end_index]
+        orig_tokens = example.doc_tokens[orig_doc_start : orig_doc_end + 1]
+
+        # Clean whitespace
+        orig_text = "".join(["".join(x) for x in orig_tokens])
+        final_text = get_final_text(tok_text, orig_text, tokenizer, do_lower_case)
+
+        probs = []
+        for idx, logit in enumerate(result.seq_logits[start_index : end_index + 1]):
+            if idx == 0:
+                # -1 is for B in  OIB or I in OI
+                probs.append(_compute_softmax(logit)[-1])
+            else:
+                # 1 is for I in OIB or I in OI
+                probs.append(_compute_softmax(logit)[1])
+        avg_prob = sum(probs) / len(probs)
+        preds.append({"value": final_text, "prob": round(avg_prob, 2), "start": orig_doc_start, "end": orig_doc_end})
+    return preds
+
+
+def get_final_text(pred_text, orig_text, tokenizer, do_lower_case):
+    """Project the tokenized prediction back to the original text."""
+
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+    if len(orig_ns_text) != len(tok_ns_text):
+        return orig_text
+
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+        tok_s_to_ns_map[tok_index] = i
+
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+    if orig_start_position is None:
+        return orig_text
+
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+    if orig_end_position is None:
+        return orig_text
+
+    output_text = orig_text[orig_start_position : (orig_end_position + 1)]
+    return output_text
+
+
+def find_bio_pos(label):
+    """find answer position from BIO label"""
+    e = []
+    cand_ans = []
+    last_l = None
+    for idx, l in enumerate(label):
+        if l == "O":
+            if e:
+                cand_ans.append([e[0], e[-1]])
+            e = []
+        elif l.startswith("B"):
+            if last_l == "O" or last_l is None:
+                if len(e) != 0:
+                    e = []
+                e.append(idx)
+            else:  # I B
+                if e:
+                    cand_ans.append([e[0], e[-1]])
+                    e = []
+                e.append(idx)
+        elif l.startswith("I"):
+            if len(e) == 0:
+                continue
+            else:
+                e.append(idx)
+        last_l = l
+    if e:
+        cand_ans.append([e[0], e[-1]])
+    return cand_ans
+
+
+def viterbi_decode(logits):
+    np_logits = np.array(logits)  # shape: L * D
+    length, dim = np_logits.shape
+    f = np.zeros(np_logits.shape)
+    path = [["" for i in range(dim)] for j in range(length)]
+    label_scheme = "OIB"
+    # oib label 0:O, 1:I, 2:B
+    # illegal matrix: [O, I ,B, start, end] * [O, I, B, start, end]
+    illegal = np.array([[0, -1, 0, -1, 0], [0, 0, 0, -1, 0], [0, 0, 0, 0, 0], [0, -1, 0, 0, 0], [-1, -1, -1, -1, -1]])
+    illegal = illegal * 1000
+
+    f[0, :] = np_logits[0, :] + illegal[3, :3]
+    path[0] = [label_scheme[i] for i in range(dim)]
+
+    for step in range(1, length):
+        last_s = f[step - 1, :]
+        for d in range(dim):
+            cand_score = illegal[:3, d] + last_s + np_logits[step, d]
+            f[step, d] = np.max(cand_score)
+            path[step][d] = path[step - 1][np.argmax(cand_score)] + label_scheme[d]
+    final_path = path[-1][np.argmax(f[-1, :])]
+    return final_path
+
+
+def find_answer_pos(logits, feature):
+    start_index = -1
+    end_index = -1
+    ans = []
+    cand_ans = []
+
+    best_path = viterbi_decode(logits)
+    cand_ans = find_bio_pos(best_path)
+
+    for start_index, end_index in cand_ans:
+        is_valid = True
+        if start_index not in feature.token_to_orig_map:
+            is_valid = False
+        if end_index not in feature.token_to_orig_map:
+            is_valid = False
+        if not feature.token_is_max_context.get(start_index, False):
+            is_valid = False
+        if end_index < start_index:
+            is_valid = False
+        if is_valid:
+            ans.append([start_index, end_index])
+
+    return ans
+
+
+def calEuclidean(x_list, y_list):
+    """
+    Calculate euclidean distance
+    """
+    if x_list is None or y_list is None:
+        return None
+    else:
+        dist = np.sqrt(np.square(x_list[0] - y_list[0]) + np.square(x_list[1] - y_list[1]))
+        return dist
+
+
+def longestCommonSequence(question_tokens, context_tokens):
+    """
+    Longest common sequence
+    """
+    max_index = -1
+    max_len = 0
+    m, n = len(question_tokens), len(context_tokens)
+    dp = [[0] * (n + 1) for _ in range(m + 1)]
+    for i in range(1, m + 1):
+        for j in range(1, n + 1):
+            if question_tokens[i - 1].lower() == context_tokens[j - 1][0].lower():
+                dp[i][j] = 1 + dp[i - 1][j - 1]
+                if dp[i][j] > max_len:
+                    max_len = dp[i][j]
+                    max_index = j - 1
+    return max_index, max_len
+
+
+def sort_res(prompt, ans_list, context, boxes, lang="en"):
+    if len(ans_list) == 1:
+        return ans_list
+    else:
+        ans_val = []
+        for ans in ans_list:
+            ans_val.append(ans["value"])
+        if len(set(ans_val)) == len(ans_val):
+            sorted_ans_list = sorted(ans_list, key=lambda x: x["prob"], reverse=True)
+            return sorted_ans_list
+        else:
+            if lang == "en":
+                clean_prompt = [word for word in prompt.split(" ")]
+            else:
+                clean_prompt = [word for word in prompt]
+
+            max_index, max_len = longestCommonSequence(clean_prompt, context)
+            if max_index == -1:
+                sorted_ans_list = sorted(ans_list, key=lambda x: x["prob"], reverse=True)
+                return sorted_ans_list
+            else:
+                prompt_center = []
+                for idx in range(max_index - max_len + 1, max_index + 1):
+                    box = boxes[idx][0]
+                    x = box.left + box.width / 2
+                    y = box.top + box.height / 2
+                    prompt_center.append([x, y])
+
+                ans_center = []
+                ans_prob = []
+                for ans in ans_list:
+                    ans_prob.append(ans["prob"])
+                    cent_list = []
+                    for idx in range(ans["start"], ans["end"] + 1):
+                        box = boxes[idx][0]
+                        x = box.left + box.width / 2
+                        y = box.top + box.height / 2
+                        cent_list.append([x, y])
+                    ans_center.append(cent_list)
+
+                ans_odist = []
+                for ans_c in ans_center:
+                    odist = 0
+                    for a_c in ans_c:
+                        for p_c in prompt_center:
+                            odist += calEuclidean(a_c, p_c)
+                    odist /= len(ans_c)
+                    ans_odist.append(odist * (-1))
+
+                ans_score = np.sum([ans_prob, ans_odist], axis=0).tolist()
+                sorted_ans_list = sorted(ans_list, key=lambda x: ans_score[ans_list.index(x)], reverse=True)
+                return sorted_ans_list
diff --git a/paddlenlp/taskflow/word_segmentation.py b/paddlenlp/taskflow/word_segmentation.py
new file mode 100644
index 0000000000000000000000000000000000000000..1bb3974b97a5bbe368ddc7bcb6ffa4f0ed2358b8
--- /dev/null
+++ b/paddlenlp/taskflow/word_segmentation.py
@@ -0,0 +1,173 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import jieba
+
+from .lexical_analysis import LacTask
+from .named_entity_recognition import NERWordTagTask
+from .task import Task
+
+usage = r"""
+           from paddlenlp import Taskflow
+
+           # Taskflow base模式
+           seg = Taskflow("word_segmentation")
+           seg("第十四届全运会在西安举办")
+           '''
+           ['第十四届', '全运会', '在', '西安', '举办']
+           '''
+
+           seg(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+           '''
+           [['第十四届', '全运会', '在', '西安', '举办'], ['三亚', '是', '一个', '美丽', '的', '城市']]
+           '''
+
+           # 快速模式分词
+           seg = Taskflow("word_segmentation", mode="fast")
+           seg("第十四届全运会在西安举办")
+           '''
+           ['第十四届', '全运会', '在', '西安', '举办']
+           '''
+
+           # 精确模式分词
+           seg = Taskflow("word_segmentation", mode="accurate")
+           seg("李伟拿出具有科学性、可操作性的《陕西省高校管理体制改革实施方案》")
+           '''
+           ['李伟', '拿出', '具有', '科学性', '、', '可操作性', '的', '《', '陕西省高校管理体制改革实施方案', '》']
+           '''
+         """
+
+
+class SegJiebaTask(Task):
+    """
+    Word Segmentation task for the raw text.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        user_dict(string): The user-defined dictionary, default to None.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, user_dict=None, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+        self._user_dict = user_dict
+        if self._user_dict:
+            jieba.load_userdict(user_dict)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        return None
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        return None
+
+    def _construct_tokenizer(self, model):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        return None
+
+    def _preprocess(self, inputs):
+        inputs = self._check_input_text(inputs)
+        return inputs
+
+    def _postprocess(self, inputs):
+        results = inputs if len(inputs) > 1 else inputs[0]
+        return results
+
+    def _run_model(self, inputs):
+        def cut(string):
+            return jieba.lcut(string)
+
+        results = list(map(cut, inputs))
+        return results
+
+
+class SegLACTask(LacTask):
+    """
+    Segement the sentences to the words using LAC mode.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    def __init__(self, task, model, **kwargs):
+        super().__init__(task=task, model="lac", **kwargs)
+
+    def _postprocess(self, inputs):
+        """
+        The model output is the tag ids, this function will convert the model output to raw text.
+        """
+        lengths = inputs["lens"]
+        preds = inputs["result"]
+        sents = inputs["text"]
+        final_results = []
+        for sent_index in range(len(lengths)):
+            tags = [self._id2tag_dict[str(index)] for index in preds[sent_index][: lengths[sent_index]]]
+            sent = sents[sent_index]
+            if self._custom:
+                self._custom.parse_customization(sent, tags)
+            sent_out = []
+            tags_out = []
+            parital_word = ""
+            for ind, tag in enumerate(tags):
+                if parital_word == "":
+                    parital_word = sent[ind]
+                    tags_out.append(tag.split("-")[0])
+                    continue
+                if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                    sent_out.append(parital_word)
+                    tags_out.append(tag.split("-")[0])
+                    parital_word = sent[ind]
+                    continue
+                parital_word += sent[ind]
+
+            if len(sent_out) < len(tags_out):
+                sent_out.append(parital_word)
+            final_results.append(sent_out)
+        final_results = self._auto_joiner(final_results, self.input_mapping)
+        final_results = final_results if len(final_results) > 1 else final_results[0]
+        return final_results
+
+
+class SegWordTagTask(NERWordTagTask):
+    """
+    Segement the sentences to the words using WordTag model.
+    Args:
+        task(string): The name of task.
+        model(string): The model name in the task.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+
+    """
+
+    def __init__(self, model, task, **kwargs):
+        super().__init__(model="wordtag", task=task, **kwargs)
+
+    def _simplify_result(self, results):
+        simple_results = []
+        for result in results:
+            simple_result = []
+            if "items" in result:
+                for item in result["items"]:
+                    simple_result.append(item["item"])
+            simple_results.append(simple_result)
+        simple_results = simple_results[0] if len(simple_results) == 1 else simple_results
+        return simple_results
diff --git a/paddlenlp/taskflow/zero_shot_text_classification.py b/paddlenlp/taskflow/zero_shot_text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..43d9f8ff57562d0718b3ed4cb8cdc0904d5a13b8
--- /dev/null
+++ b/paddlenlp/taskflow/zero_shot_text_classification.py
@@ -0,0 +1,427 @@
+# coding:utf-8
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, List, Union
+
+import numpy as np
+from paddle.static import InputSpec
+from scipy.special import expit as np_sigmoid
+from scipy.special import softmax as np_softmax
+
+from ..prompt import PromptDataCollatorWithPadding, UTCTemplate
+from ..transformers import UTC, AutoTokenizer
+from .task import Task
+from .utils import static_mode_guard
+
+usage = r"""
+        from paddlenlp import Taskflow
+
+        schema = ['这是一条差评', '这是一条好评']
+        text_cls = Taskflow("zero_shot_text_classification", schema=schema)
+        text_cls('房间干净明亮，非常不错')
+        '''
+        [{'predictions': [{'label': '这是一条好评', 'score': 0.9695149765679986}], 'text_a': '房间干净明亮，非常不错'}]
+        '''
+         """
+
+
+class ZeroShotTextClassificationTask(Task):
+    """
+    Zero-shot Universial Text Classification Task.
+
+    Args:
+        task (string): The name of task.
+        model (string): The model_name in the task.
+        schema (list): List of candidate labels.
+        kwargs (dict, optional): Additional keyword arguments passed along to the specific task.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "config": "config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+    resource_files_urls = {
+        "utc-xbase": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-xbase/model_state.pdparams",
+                "e751c3a78d4caff923759c0d0547bfe6",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-xbase/config.json",
+                "4c2b035c71ff226a14236171a1a202a4",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-xbase/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-xbase/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-xbase/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-base": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-base/model_state.pdparams",
+                "72089351c6fb02bcf8f270fe0cc508e9",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-base/config.json",
+                "79aa9a69286604436937b03f429f4d34",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-base/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-base/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-base/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-medium": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-medium/model_state.pdparams",
+                "2802c766a8b880aad910dd5a7db809ae",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-medium/config.json",
+                "2899cd7c8590dcdc4223e4b1262e2f4e",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-medium/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-medium/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-medium/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-micro": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-micro/model_state.pdparams",
+                "d9ebdfce9a8c6ebda43630ed18b07c58",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-micro/config.json",
+                "8c8da9337e09e0c3962196987dca18bd",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-micro/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-micro/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-micro/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-mini": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-mini/model_state.pdparams",
+                "848a2870cd51bfc22174a2a38884085c",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-mini/config.json",
+                "933b8ebfcf995b1f965764ac426a2ffa",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-mini/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-mini/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-mini/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-nano": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-nano/model_state.pdparams",
+                "2bd31212d989619148eda3afebc7354d",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-nano/config.json",
+                "02fe311fdcc127e56ff0975038cc4d65",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-nano/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-nano/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-nano/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-pico": {
+            "model_state": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-pico/model_state.pdparams",
+                "f7068d63ad2930de7ac850d475052946",
+            ],
+            "config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-pico/config.json",
+                "c0c7412cdd070edb5a1ce70c7fc68ad3",
+            ],
+            "vocab_file": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-pico/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-pico/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://paddlenlp.bj.bcebos.com/taskflow/zero_shot_text_classification/utc-pico/tokenizer_config.json",
+                "be86466f6769fde498690269d099ea7c",
+            ],
+        },
+        "utc-large": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/zero_shot_text_classification/utc-large/model_state.pdparams",
+                "71eb9a732c743a513b84ca048dc4945b",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/zero_shot_text_classification/utc-large/config.json",
+                "9496be2cc99f7e6adf29280320274142",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/zero_text_classification/utc-large/vocab.txt",
+                "afc01b5680a53525df5afd7518b42b48",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/zero_text_classification/utc-large/special_tokens_map.json",
+                "2458e2131219fc1f84a6e4843ae07008",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/taskflow/zero_text_classification/utc-large/tokenizer_config.json",
+                "dcb0f3257830c0eb1f2de47f2d86f89a",
+            ],
+        },
+        "__internal_testing__/tiny-random-utc": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-utc/model_state.pdparams",
+                "d303b59447be690530c35c73f8fd03cd",
+            ],
+            "config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-utc/config.json",
+                "3420a6638a7c73c6239eb1d7ca1bc5fe",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-utc/vocab.txt",
+                "97eb0ec5a5890c8190e10e251af2e133",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-utc/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-utc/tokenizer_config.json",
+                "258fc552c15cec90046066ca122899e2",
+            ],
+        },
+    }
+
+    def __init__(self, task: str, model: str, schema: list = None, **kwargs):
+        super().__init__(task=task, model=model, **kwargs)
+
+        self._set_utc_schema(schema)
+        self._max_seq_len = kwargs.get("max_seq_len", 512)
+        self._batch_size = kwargs.get("batch_size", 1)
+        self._pred_threshold = kwargs.get("pred_threshold", 0.5)
+        self._num_workers = kwargs.get("num_workers", 0)
+        self._single_label = kwargs.get("single_label", False)
+
+        self._check_task_files()
+        self._construct_tokenizer()
+        self._check_predictor_type()
+        self._get_inference_model()
+
+    def _set_utc_schema(self, schema):
+        if schema is None:
+            self._choices = None
+        elif isinstance(schema, list):
+            self._choices = schema
+        elif isinstance(schema, dict) and len(schema) == 1:
+            for key in schema:
+                self._choices = schema[key]
+        else:
+            raise ValueError(f"Invalid schema: {schema}.")
+
+    def set_schema(self, schema):
+        self._set_utc_schema(schema)
+
+    def _construct_input_spec(self):
+        """
+        Construct the input spec for the convert dygraph model to static model.
+        """
+        self._input_spec = [
+            InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+            InputSpec(shape=[None, None], dtype="int64", name="position_ids"),
+            InputSpec(shape=[None, None, None, None], dtype="float32", name="attention_mask"),
+            InputSpec(shape=[None, None], dtype="int64", name="omask_positions"),
+            InputSpec(shape=[None], dtype="int64", name="cls_positions"),
+        ]
+
+    def _construct_model(self, model):
+        """
+        Construct the inference model for the predictor.
+        """
+        model_instance = UTC.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
+        self._model = model_instance
+        self._model.eval()
+
+    def _construct_tokenizer(self):
+        """
+        Construct the tokenizer for the predictor.
+        """
+        self._tokenizer = AutoTokenizer.from_pretrained(self._task_path, from_hf_hub=self.from_hf_hub)
+        self._collator = PromptDataCollatorWithPadding(self._tokenizer, return_tensors="np")
+        self._template = UTCTemplate(self._tokenizer, self._max_seq_len)
+
+    def _check_input_text(self, inputs):
+        inputs = inputs[0]
+        if isinstance(inputs, str) or isinstance(inputs, dict):
+            inputs = [inputs]
+
+        if isinstance(inputs, list):
+            input_list = []
+            for example in inputs:
+                data = {"text_a": "", "text_b": "", "choices": self._choices}
+                if isinstance(example, dict):
+                    for k in example:
+                        if k in data:
+                            data[k] = example[k]
+                elif isinstance(example, str):
+                    data["text_a"] = example
+                    data["text_b"] = ""
+                elif isinstance(example, list):
+                    for x in example:
+                        if not isinstance(x, str):
+                            raise ValueError("Invalid inputs, input text should be strings.")
+                    data["text_a"] = example[0]
+                    data["text_b"] = "".join(example[1:]) if len(example) > 1 else ""
+                else:
+                    raise ValueError(
+                        "Invalid inputs, the input should be {'text_a': a, 'text_b': b}, a text or a list of text."
+                    )
+
+                if len(data["text_a"]) < 1 and len(data["text_b"]) < 1:
+                    raise ValueError("Invalid inputs, input `text_a` and `text_b` are both missing or empty.")
+                if not isinstance(data["choices"], list) or len(data["choices"]) < 2:
+                    raise ValueError("Invalid inputs, label candidates should be a list with length >= 2.")
+                input_list.append(data)
+        else:
+            raise TypeError("Invalid input format!")
+        return input_list
+
+    def _preprocess(self, inputs: Union[str, List[str]]) -> Dict[str, Any]:
+        """
+        Transform the raw text to the model inputs, two steps involved:
+           1) Transform the raw text to token ids.
+           2) Generate the other model inputs from the raw text and token ids.
+        """
+        inputs = self._check_input_text(inputs)
+        # Get the config from the kwargs
+        tokenized_inputs = [self._template(i) for i in inputs]
+        batches = [
+            tokenized_inputs[idx : idx + self._batch_size] for idx in range(0, len(tokenized_inputs), self._batch_size)
+        ]
+        inputs = [inputs[idx : idx + self._batch_size] for idx in range(0, len(inputs), self._batch_size)]
+        outputs = {}
+        outputs["text"] = inputs
+        outputs["batches"] = [self._collator(batch) for batch in batches]
+
+        return outputs
+
+    def _run_model(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        outputs = {}
+        outputs["text"] = inputs["text"]
+        outputs["batch_logits"] = []
+        dtype_dict = {
+            "input_ids": "int64",
+            "token_type_ids": "int64",
+            "position_ids": "int64",
+            "attention_mask": "float32",
+            "omask_positions": "int64",
+            "cls_positions": "int64",
+        }
+        with static_mode_guard():
+            for batch in inputs["batches"]:
+                if self._predictor_type == "paddle-inference":
+                    for i, input_name in enumerate(self.input_names):
+                        self.input_handles[i].copy_from_cpu(batch[input_name].astype(dtype_dict[input_name]))
+                    self.predictor.run()
+                    logits = self.output_handle[0].copy_to_cpu().tolist()
+                else:
+                    input_dict = {}
+                    for input_name in dtype_dict:
+                        input_dict[input_name] = batch[input_name].astype(dtype_dict[input_name])
+                    logits = self.predictor.run(None, input_dict)[0].tolist()
+                outputs["batch_logits"].append(logits)
+
+        return outputs
+
+    def _postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        This function converts the model logits output to class score and predictions
+        """
+        outputs = []
+        for batch_text, batch_logits in zip(inputs["text"], inputs["batch_logits"]):
+            for text, logits in zip(batch_text, batch_logits):
+                output = {}
+                if len(text["text_a"]) > 0:
+                    output["text_a"] = text["text_a"]
+                if len(text["text_b"]) > 0:
+                    output["text_b"] = text["text_b"]
+
+                if self._single_label:
+                    score = np_softmax(logits, axis=-1)
+                    label = np.argmax(logits, axis=-1)
+                    output["predictions"] = [{"label": text["choices"][label], "score": score[label]}]
+                else:
+                    scores = np_sigmoid(logits)
+                    output["predictions"] = []
+                    if scores.ndim == 2:
+                        scores = scores[0]
+                    for i, class_score in enumerate(scores):
+                        if class_score > self._pred_threshold:
+                            output["predictions"].append({"label": text["choices"][i], "score": class_score})
+                outputs.append(output)
+
+        return outputs
diff --git a/paddlenlp/trainer/__init__.py b/paddlenlp/trainer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9e601ce7ae99aa6a93674a3304c9b365ca507ce
--- /dev/null
+++ b/paddlenlp/trainer/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you smay not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .argparser import *
+from .training_args import *
+from .compression_args import *
+from .trainer import *
+from .trainer_callback import *
+from .trainer_utils import *
+from .trainer_compress import *
+from .training_args_seq2seq import *
+from .trainer_seq2seq import *
diff --git a/paddlenlp/trainer/argparser.py b/paddlenlp/trainer/argparser.py
new file mode 100644
index 0000000000000000000000000000000000000000..16f0ea8ca028f852c9a2a4716a3478c4fb7a3623
--- /dev/null
+++ b/paddlenlp/trainer/argparser.py
@@ -0,0 +1,261 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# # Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/hf_argparser.py
+
+import dataclasses
+import json
+import sys
+from argparse import ArgumentDefaultsHelpFormatter, ArgumentParser, ArgumentTypeError
+from copy import copy
+from enum import Enum
+from inspect import isclass
+from pathlib import Path
+from typing import Any, Dict, Iterable, NewType, Optional, Tuple, Union, get_type_hints
+
+DataClass = NewType("DataClass", Any)
+DataClassType = NewType("DataClassType", Any)
+
+__all__ = [
+    "PdArgumentParser",
+    "strtobool",
+]
+
+
+# From https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse
+def strtobool(v):
+    if isinstance(v, bool):
+        return v
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise ArgumentTypeError(
+            f"Truthy value expected: got {v} but expected one of yes/no, true/false, t/f, y/n, 1/0 (case insensitive)."
+        )
+
+
+class PdArgumentParser(ArgumentParser):
+    """
+    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses to generate arguments.
+
+    The class is designed to play well with the native argparse. In particular, you can add more (non-dataclass backed)
+    arguments to the parser after initialization and you'll get the output back after parsing as an additional
+    namespace. Optional: To create sub argument groups use the `_argument_group_name` attribute in the dataclass.
+    """
+
+    dataclass_types: Iterable[DataClassType]
+
+    def __init__(self, dataclass_types: Union[DataClassType, Iterable[DataClassType]], **kwargs):
+        """
+        Args:
+            dataclass_types:
+                Dataclass type, or list of dataclass types for which we will "fill" instances with the parsed args.
+            kwargs:
+                (Optional) Passed to `argparse.ArgumentParser()` in the regular way.
+        """
+        # To make the default appear when using --help
+        if "formatter_class" not in kwargs:
+            kwargs["formatter_class"] = ArgumentDefaultsHelpFormatter
+        super().__init__(**kwargs)
+        if dataclasses.is_dataclass(dataclass_types):
+            dataclass_types = [dataclass_types]
+        self.dataclass_types = list(dataclass_types)
+        for dtype in self.dataclass_types:
+            self._add_dataclass_arguments(dtype)
+
+    @staticmethod
+    def _parse_dataclass_field(parser: ArgumentParser, field: dataclasses.Field):
+        field_name = f"--{field.name}"
+        kwargs = field.metadata.copy()
+        # field.metadata is not used at all by Data Classes,
+        # it is provided as a third-party extension mechanism.
+        if isinstance(field.type, str):
+            raise RuntimeError(
+                "Unresolved type detected, which should have been done with the help of "
+                "`typing.get_type_hints` method by default"
+            )
+
+        origin_type = getattr(field.type, "__origin__", field.type)
+        if origin_type is Union:
+            if len(field.type.__args__) != 2 or type(None) not in field.type.__args__:
+                raise ValueError("Only `Union[X, NoneType]` (i.e., `Optional[X]`) is allowed for `Union`")
+            if bool not in field.type.__args__:
+                # filter `NoneType` in Union (except for `Union[bool, NoneType]`)
+                field.type = (
+                    field.type.__args__[0] if isinstance(None, field.type.__args__[1]) else field.type.__args__[1]
+                )
+                origin_type = getattr(field.type, "__origin__", field.type)
+
+        # A variable to store kwargs for a boolean field, if needed
+        # so that we can init a `no_*` complement argument (see below)
+        bool_kwargs = {}
+        if isinstance(field.type, type) and issubclass(field.type, Enum):
+            kwargs["choices"] = [x.value for x in field.type]
+            kwargs["type"] = type(kwargs["choices"][0])
+            if field.default is not dataclasses.MISSING:
+                kwargs["default"] = field.default
+            else:
+                kwargs["required"] = True
+        # fix https://github.com/huggingface/transformers/pull/16946
+        elif field.type is bool or field.type == Optional[bool]:
+            # Copy the currect kwargs to use to instantiate a `no_*` complement argument below.
+            # We do not initialize it here because the `no_*` alternative must be instantiated after the real argument
+            bool_kwargs = copy(kwargs)
+
+            # Hack because type=bool in argparse does not behave as we want.
+            kwargs["type"] = strtobool
+            if field.type is bool or (field.default is not None and field.default is not dataclasses.MISSING):
+                # Default value is False if we have no default when of type bool.
+                default = False if field.default is dataclasses.MISSING else field.default
+                # This is the value that will get picked if we don't include --field_name in any way
+                kwargs["default"] = default
+                # This tells argparse we accept 0 or 1 value after --field_name
+                kwargs["nargs"] = "?"
+                # This is the value that will get picked if we do --field_name (without value)
+                kwargs["const"] = True
+        elif isclass(origin_type) and issubclass(origin_type, list):
+            kwargs["type"] = field.type.__args__[0]
+            kwargs["nargs"] = "+"
+            if field.default_factory is not dataclasses.MISSING:
+                kwargs["default"] = field.default_factory()
+            elif field.default is dataclasses.MISSING:
+                kwargs["required"] = True
+        else:
+            kwargs["type"] = field.type
+            if field.default is not dataclasses.MISSING:
+                kwargs["default"] = field.default
+            elif field.default_factory is not dataclasses.MISSING:
+                kwargs["default"] = field.default_factory()
+            else:
+                kwargs["required"] = True
+        parser.add_argument(field_name, **kwargs)
+
+        # Add a complement `no_*` argument for a boolean field AFTER the initial field has already been added.
+        # Order is important for arguments with the same destination!
+        # We use a copy of earlier kwargs because the original kwargs have changed a lot before reaching down
+        # here and we do not need those changes/additional keys.
+        if field.default is True and (field.type is bool or field.type == Optional[bool]):
+            bool_kwargs["default"] = False
+            parser.add_argument(f"--no_{field.name}", action="store_false", dest=field.name, **bool_kwargs)
+
+    def _add_dataclass_arguments(self, dtype: DataClassType):
+        if hasattr(dtype, "_argument_group_name"):
+            parser = self.add_argument_group(dtype._argument_group_name)
+        else:
+            parser = self
+
+        try:
+            type_hints: Dict[str, type] = get_type_hints(dtype)
+        except NameError:
+            raise RuntimeError(
+                f"Type resolution failed for f{dtype}. Try declaring the class in global scope or "
+                f"removing line of `from __future__ import annotations` which opts in Postponed "
+                f"Evaluation of Annotations (PEP 563)"
+            )
+
+        for field in dataclasses.fields(dtype):
+            if not field.init:
+                continue
+            field.type = type_hints[field.name]
+            self._parse_dataclass_field(parser, field)
+
+    def parse_args_into_dataclasses(
+        self, args=None, return_remaining_strings=False, look_for_args_file=True, args_filename=None
+    ) -> Tuple[DataClass, ...]:
+        """
+        Parse command-line args into instances of the specified dataclass types.
+
+        This relies on argparse's `ArgumentParser.parse_known_args`. See the doc at:
+        docs.python.org/3.7/library/argparse.html#argparse.ArgumentParser.parse_args
+
+        Args:
+            args:
+                List of strings to parse. The default is taken from sys.argv. (same as argparse.ArgumentParser)
+            return_remaining_strings:
+                If true, also return a list of remaining argument strings.
+            look_for_args_file:
+                If true, will look for a ".args" file with the same base name as the entry point script for this
+                process, and will append its potential content to the command line args.
+            args_filename:
+                If not None, will uses this file instead of the ".args" file specified in the previous argument.
+
+        Returns:
+            Tuple consisting of:
+
+                - the dataclass instances in the same order as they were passed to the initializer.abspath
+                - if applicable, an additional namespace for more (non-dataclass backed) arguments added to the parser
+                  after initialization.
+                - The potential list of remaining argument strings. (same as argparse.ArgumentParser.parse_known_args)
+        """
+        if args_filename or (look_for_args_file and len(sys.argv)):
+            if args_filename:
+                args_file = Path(args_filename)
+            else:
+                args_file = Path(sys.argv[0]).with_suffix(".args")
+
+            if args_file.exists():
+                fargs = args_file.read_text().split()
+                args = fargs + args if args is not None else fargs + sys.argv[1:]
+                # in case of duplicate arguments the first one has precedence
+                # so we append rather than prepend.
+        namespace, remaining_args = self.parse_known_args(args=args)
+        outputs = []
+        for dtype in self.dataclass_types:
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
+            inputs = {k: v for k, v in vars(namespace).items() if k in keys}
+            for k in keys:
+                delattr(namespace, k)
+            obj = dtype(**inputs)
+            outputs.append(obj)
+        if len(namespace.__dict__) > 0:
+            # additional namespace.
+            outputs.append(namespace)
+        if return_remaining_strings:
+            return (*outputs, remaining_args)
+        else:
+            if remaining_args:
+                raise ValueError(f"Some specified arguments are not used by the PdArgumentParser: {remaining_args}")
+
+            return (*outputs,)
+
+    def parse_json_file(self, json_file: str) -> Tuple[DataClass, ...]:
+        """
+        Alternative helper method that does not use `argparse` at all, instead loading a json file and populating the
+        dataclass types.
+        """
+        data = json.loads(Path(json_file).read_text())
+        outputs = []
+        for dtype in self.dataclass_types:
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
+            inputs = {k: v for k, v in data.items() if k in keys}
+            obj = dtype(**inputs)
+            outputs.append(obj)
+        return (*outputs,)
+
+    def parse_dict(self, args: dict) -> Tuple[DataClass, ...]:
+        """
+        Alternative helper method that does not use `argparse` at all, instead uses a dict and populating the dataclass
+        types.
+        """
+        outputs = []
+        for dtype in self.dataclass_types:
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
+            inputs = {k: v for k, v in args.items() if k in keys}
+            obj = dtype(**inputs)
+            outputs.append(obj)
+        return (*outputs,)
diff --git a/paddlenlp/trainer/compression_args.py b/paddlenlp/trainer/compression_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..d0ed1c7f1545bcfefd970d85d5237ca9639b1b82
--- /dev/null
+++ b/paddlenlp/trainer/compression_args.py
@@ -0,0 +1,225 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import types
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+import paddle
+
+from ..utils.log import logger
+from .training_args import TrainingArguments
+
+__all__ = [
+    "CompressionArguments",
+]
+
+
+@dataclass
+class CompressionArguments(TrainingArguments):
+    """
+    CompressionArguments is the subset of the arguments we use in our example
+    scripts **which relate to the training loop itself**.
+
+    Using [`PdArgumentParser`] we can turn this class into
+    [argparse](https://docs.python.org/3/library/argparse#module-argparse)
+    arguments that can be specified on the command line.
+    """
+
+    do_compress: bool = field(default=False, metadata={"help": "Whether to run compression after training."})
+    input_dtype: Optional[str] = field(
+        default="int64",
+        metadata={"help": "The data type of input tensor, it could be int32 or int64. Defaults to int64."},
+    )
+    # prune embeddings
+    prune_embeddings: bool = field(default=False, metadata={"help": "Whether to prune embeddings before finetuning."})
+    onnx_format: Optional[bool] = field(
+        default=True,
+        metadata={"help": "Whether to export onnx format quantized model, and it defaults to True."},
+    )
+    strategy: Optional[str] = field(
+        default="dynabert+ptq",
+        metadata={
+            "help": "Compression strategy. It supports 'dynabert+qat+embeddings',"
+            "'dynabert+qat', 'dynabert+ptq', 'dynabert+embeddings', 'dynabert', 'ptq' and 'qat' now."
+        },
+    )
+    # dynabert
+    width_mult_list: Optional[List[str]] = field(
+        default=None,
+        metadata={"help": ("List of width multiplicator for pruning using DynaBERT strategy.")},
+    )
+    logging_steps: int = field(default=100, metadata={"help": "Log every X updates steps."})
+
+    save_steps: int = field(default=100, metadata={"help": "Save checkpoint every X updates steps."})
+
+    warmup_ratio: float = field(
+        default=0.1, metadata={"help": "Linear warmup over warmup_ratio fraction of total steps."}
+    )
+
+    # quant
+    weight_quantize_type: Optional[str] = field(
+        default="channel_wise_abs_max",
+        metadata={
+            "help": "Quantization type for weights. Supports 'abs_max' and 'channel_wise_abs_max'. "
+            "This param only specifies the fake ops in saving quantized model, and "
+            "we save the scale obtained by post training quantization in fake ops. "
+            "Compared to 'abs_max' the model accuracy is usually higher when it is "
+            "'channel_wise_abs_max'."
+        },
+    )
+    activation_quantize_type: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "Support 'abs_max', 'range_abs_max' and 'moving_average_abs_max'. "
+            "In strategy 'ptq', it defaults to 'range_abs_max' and in strategy "
+            "'qat', it defaults to 'moving_average_abs_max'."
+        },
+    )
+    # ptq:
+    algo_list: Optional[List[str]] = field(
+        default=None,
+        metadata={
+            "help": "Algorithm list for Post-Quantization, and it supports 'hist', 'KL', "
+            "'mse', 'avg', 'abs_max' and 'emd'.'KL' uses KL-divergenc method to get "
+            "the KL threshold for quantized activations and get the abs_max value "
+            "forquantized weights. 'abs_max' gets the abs max value for activations "
+            "and weights. 'min_max' gets the min and max value for quantized "
+            "activations and weights. 'avg' gets the average value among the max "
+            "values for activations. 'hist' gets the value of 'hist_percent' "
+            "quantile as the threshold. 'mse' gets the value which makes the "
+            "quantization mse loss minimal."
+        },
+    )
+
+    batch_num_list: Optional[List[int]] = field(
+        default=None,
+        metadata={
+            "help": "List of batch_num. 'batch_num' is the number of batchs for sampling. "
+            "the number of calibrate data is batch_size * batch_nums. "
+            "If batch_nums is None, use all data provided by data loader as calibrate data."
+        },
+    )
+    batch_size_list: Optional[List[int]] = field(
+        default=None,
+        metadata={"help": "List of batch_size. 'batch_size' is the batch of data loader."},
+    )
+
+    round_type: Optional[str] = field(
+        default="round",
+        metadata={
+            "help": "The method of converting the quantized weights value float->int. "
+            "Currently supports ['round', 'adaround'] methods. Default is `round`, "
+            "which is rounding nearest to the integer. 'adaround' is refer to "
+            "https://arxiv.org/abs/2004.10568."
+        },
+    )
+    bias_correction: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": "If set to True, use the bias correction method of "
+            "https://arxiv.org/abs/1810.05723. Default is False."
+        },
+    )
+    input_infer_model_path: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "If you have only inference model, quantization is also supported."
+            " The format is `dirname/file_prefix` or `file_prefix`. Default "
+            "is None."
+        },
+    )
+    # qat
+    use_pact: Optional[bool] = field(
+        default=True,
+        metadata={
+            "help": "Whether to use PACT(Parameterized Clipping Activation for Quantized Neural Networks) "
+            "method in quantization aware training."
+        },
+    )
+    moving_rate: Optional[float] = field(
+        default=0.9,
+        metadata={"help": "The decay coefficient of moving average. Defaults to 0.9."},
+    )
+
+    def print_config(self, args=None, key=""):
+        """
+        Prints all config values.
+        """
+
+        compression_arg_name = [
+            "strategy",
+            "width_mult_list",
+            "batch_num_list",
+            "bias_correction",
+            "round_type",
+            "algo_list",
+            "batch_size_list",
+            "weight_quantize_type",
+            "activation_quantize_type",
+            "input_infer_model_path",
+            "activation_preprocess_type",
+            "weight_preprocess_type",
+            "moving_rate",
+            "use_pact",
+            "onnx_format",
+            "prune_embeddings",
+            "input_dtype",
+        ]
+        default_arg_dict = {
+            "width_mult_list": ["3/4"],
+            "batch_size_list": [4, 8, 16],
+            "algo_list": ["mse", "KL"],
+            "batch_num_list": [1],
+        }
+        logger.info("=" * 60)
+        if args is None:
+            args = self
+            key = "Compression"
+
+        logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
+        if key == "Compression":
+            logger.info(
+                "Compression Suggestions: `Strategy` supports 'dynabert+qat+embeddings', "
+                "'dynabert+qat', 'dynabert+ptq', 'dynabert+embeddings', "
+                "'dynabert' and 'ptq'. `input_dtype`, `prune_embeddings`, "
+                "and `onnx_format` are common needed. `width_mult_list` is needed in "
+                "`dynabert`, and `algo_list`, `batch_num_list`, `batch_size_list`,"
+                " `round_type`, `bias_correction`, `weight_quantize_type`, "
+                "`input_infer_model_path` are needed in 'ptq'. `activation_preprocess_type'`, "
+                "'weight_preprocess_type', 'moving_rate', 'weight_quantize_type', "
+                "and 'activation_quantize_type' are needed in 'qat'."
+            )
+        logger.info("{:30}:{}".format("paddle commit id", paddle.version.commit))
+
+        for arg in dir(args):
+            if key == "Compression" and arg not in compression_arg_name:
+                continue
+            if arg[:2] != "__":  # don't print double underscore methods
+                v = getattr(args, arg)
+                if v is None and arg in default_arg_dict:
+                    v = default_arg_dict[arg]
+                    setattr(args, arg, v)
+                elif v is None and arg == "activation_quantize_type":
+                    if key == "Compression" and "ptq" in args.strategy:
+                        setattr(args, arg, "range_abs_max")
+                    elif key == "Compression" and "qat" in args.strategy:
+                        setattr(args, arg, "moving_average_abs_max")
+
+                if not isinstance(v, types.MethodType):
+                    logger.info("{:30}:{}".format(arg, v))
+
+        logger.info("")
diff --git a/paddlenlp/trainer/integrations.py b/paddlenlp/trainer/integrations.py
new file mode 100644
index 0000000000000000000000000000000000000000..60154e3b2a3f9ac532d4f700c862e548d5644e85
--- /dev/null
+++ b/paddlenlp/trainer/integrations.py
@@ -0,0 +1,175 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/integrations.py
+
+import importlib
+import json
+
+from ..peft import LoRAModel, PrefixModelForCausalLM
+from ..transformers import PretrainedModel
+from ..utils.log import logger
+from .trainer_callback import TrainerCallback
+
+
+def is_visualdl_available():
+    return importlib.util.find_spec("visualdl") is not None
+
+
+def is_ray_available():
+    return importlib.util.find_spec("ray.air") is not None
+
+
+def get_available_reporting_integrations():
+    integrations = []
+    if is_visualdl_available():
+        integrations.append("visualdl")
+
+    return integrations
+
+
+def rewrite_logs(d):
+    new_d = {}
+    eval_prefix = "eval_"
+    eval_prefix_len = len(eval_prefix)
+    test_prefix = "test_"
+    test_prefix_len = len(test_prefix)
+    for k, v in d.items():
+        if k.startswith(eval_prefix):
+            new_d["eval/" + k[eval_prefix_len:]] = v
+        elif k.startswith(test_prefix):
+            new_d["test/" + k[test_prefix_len:]] = v
+        else:
+            new_d["train/" + k] = v
+    return new_d
+
+
+class VisualDLCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that sends the logs to [VisualDL](https://www.paddlepaddle.org.cn/paddle/visualdl).
+    Args:
+        vdl_writer (`LogWriter`, *optional*):
+            The writer to use. Will instantiate one if not set.
+    """
+
+    def __init__(self, vdl_writer=None):
+        has_visualdl = is_visualdl_available()
+        if not has_visualdl:
+            raise RuntimeError("VisualDLCallback requires visualdl to be installed. Please install visualdl.")
+        if has_visualdl:
+            try:
+                from visualdl import LogWriter
+
+                self._LogWriter = LogWriter
+            except ImportError:
+                self._LogWriter = None
+        else:
+            self._LogWriter = None
+        self.vdl_writer = vdl_writer
+
+    def _init_summary_writer(self, args, log_dir=None):
+        log_dir = log_dir or args.logging_dir
+        if self._LogWriter is not None:
+            self.vdl_writer = self._LogWriter(logdir=log_dir)
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        if not state.is_world_process_zero:
+            return
+
+        log_dir = None
+
+        if self.vdl_writer is None:
+            self._init_summary_writer(args, log_dir)
+
+        if self.vdl_writer is not None:
+            self.vdl_writer.add_text("args", args.to_json_string())
+            if "model" in kwargs:
+                model = kwargs["model"]
+                if isinstance(model, LoRAModel) or isinstance(model, PrefixModelForCausalLM):
+                    model = kwargs["model"].model
+                if isinstance(model, PretrainedModel) and model.constructed_from_pretrained_config():
+                    model.config.architectures = [model.__class__.__name__]
+                    self.vdl_writer.add_text("model_config", str(model.config))
+                elif hasattr(model, "init_config") and model.init_config is not None:
+                    model_config_json = json.dumps(model.get_model_config(), ensure_ascii=False, indent=2)
+                    self.vdl_writer.add_text("model_config", model_config_json)
+
+            if hasattr(self.vdl_writer, "add_hparams"):
+                self.vdl_writer.add_hparams(args.to_sanitized_dict(), metrics_list=[])
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if not state.is_world_process_zero:
+            return
+
+        if self.vdl_writer is None:
+            return
+
+        if self.vdl_writer is not None:
+            logs = rewrite_logs(logs)
+            for k, v in logs.items():
+                if isinstance(v, (int, float)):
+                    self.vdl_writer.add_scalar(k, v, state.global_step)
+                else:
+                    logger.warning(
+                        "Trainer is attempting to log a value of "
+                        f'"{v}" of type {type(v)} for key "{k}" as a scalar. '
+                        "This invocation of VisualDL's writer.add_scalar() "
+                        "is incorrect so we dropped this attribute."
+                    )
+            self.vdl_writer.flush()
+
+    def on_train_end(self, args, state, control, **kwargs):
+        if self.vdl_writer:
+            self.vdl_writer.close()
+            self.vdl_writer = None
+
+
+class AutoNLPCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that sends the logs to [`Ray Tune`] for [`AutoNLP`]
+    """
+
+    def __init__(self):
+        if not is_ray_available():
+            raise RuntimeError(
+                "AutoNLPCallback requires extra dependencies to be installed. Please install paddlenlp with 'pip install paddlenlp[autonlp]'."
+            )
+        self.session = importlib.import_module("ray.air.session")
+        self.tune = importlib.import_module("ray.tune")
+
+    # report session metrics to Ray to track trial progress
+    def on_evaluate(self, args, state, control, **kwargs):
+        if not state.is_world_process_zero:
+            return
+
+        metrics = kwargs.get("metrics", None)
+        if self.tune.is_session_enabled() and metrics is not None and isinstance(metrics, dict):
+            self.session.report(metrics)
+
+
+INTEGRATION_TO_CALLBACK = {
+    "visualdl": VisualDLCallback,
+    "autonlp": AutoNLPCallback,
+}
+
+
+def get_reporting_integration_callbacks(report_to):
+    for integration in report_to:
+        if integration not in INTEGRATION_TO_CALLBACK:
+            raise ValueError(
+                f"{integration} is not supported, only {', '.join(INTEGRATION_TO_CALLBACK.keys())} are supported."
+            )
+    return [INTEGRATION_TO_CALLBACK[integration] for integration in report_to]
diff --git a/paddlenlp/trainer/plugins/__init__.py b/paddlenlp/trainer/plugins/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5ab11090f407063cc2d4440c96f9ea7d591847d
--- /dev/null
+++ b/paddlenlp/trainer/plugins/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you smay not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/trainer/plugins/npu_plugin.py b/paddlenlp/trainer/plugins/npu_plugin.py
new file mode 100644
index 0000000000000000000000000000000000000000..469e722a0532ef1ddedbb6dba4589b97064aa528
--- /dev/null
+++ b/paddlenlp/trainer/plugins/npu_plugin.py
@@ -0,0 +1,127 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import types
+
+import numpy as np
+import paddle
+from paddle.common_ops_import import LayerHelper
+
+from ...utils.log import logger
+
+
+def npu_accelerate_plugin(optimizer):
+    """npu_accelerate_plugin uses the flatten_param_grads method to speed up the performance of the model on NPU devices.
+    flatten_param_grads method will be added to `step` function of optimizer.
+
+    Args:
+        optimizer (`paddle.optimizer.Optimizer`):
+            The Optimizer whose `step` method will be modified.
+    """
+    optimizer.step = types.MethodType(_optimizer_step_with_flatten_param_grads, optimizer)
+
+
+def _optimizer_step_with_flatten_param_grads(optimizer):
+    if not isinstance(optimizer._param_groups[0], dict):
+        params_grads = []
+        for param in optimizer._param_groups:
+            if param.stop_gradient:
+                continue
+            if param._grad_ivar() is not None:
+                grad_var = param._grad_ivar()
+                params_grads.append((param, grad_var))
+
+        # currently, only support ClipGradByGlobalNorm and without regularization.
+        if isinstance(params_grads, list) and optimizer.regularization is None:
+            if optimizer._grad_clip is None or isinstance(optimizer._grad_clip, paddle.nn.ClipGradByGlobalNorm):
+                params_grads = _flatten_param_grads(optimizer, params_grads)
+
+        optimizer._apply_optimize(
+            loss=None,
+            startup_program=None,
+            params_grads=params_grads,
+            param_group_idx=0,
+        )
+    else:
+        raise RuntimeError("flatten_param_grads is not supported when _param_groups[0] is dict.")
+
+
+def _flatten_param_grads(optimizer, params_grads):
+    optimizer.helper = LayerHelper(optimizer.__class__.__name__)
+    need_flatten_params = []
+    need_flatten_grads = []
+    for p, g in params_grads:
+        if g is None:
+            continue
+        g.persistable = True
+        if getattr(p, "need_clip", True) is False or getattr(p, "regularizer", None) is not None:
+            logger.warning(
+                f"flatten_param_grads=True will be discarded since paramter {p.name}'s need_clip is False or "
+                "the regularizer is set."
+            )
+            return params_grads
+
+        need_flatten_params.append(p)
+        need_flatten_grads.append(g)
+
+    shape = [np.prod(p.shape) for p in need_flatten_params]
+
+    flatten_param = optimizer.helper.create_global_variable(
+        name="flatten_param",
+        persistable=True,
+        dtype=need_flatten_params[0].dtype,
+        shape=[np.sum(shape)],
+        belong_to_optimizer=True,
+    )
+
+    flatten_grad = optimizer.helper.create_global_variable(
+        name="flatten_grad",
+        persistable=True,
+        dtype=need_flatten_grads[0].dtype,
+        shape=[np.sum(shape)],
+        belong_to_optimizer=True,
+    )
+
+    flatten_param.stop_gradient = False
+    # In the final state of the dynamic graph, the `coalesce_tensor` op
+    # does not support passing the output as an input into the op in
+    # temporary, so _legacy_C_ops is temporarily used here.
+    # `use_align` is set to false, which is different from the behavior
+    # under static graphs. `use_align` can be set to true after calling
+    # the coalesce_tensor op of the final state (_C_ops).
+    paddle._legacy_C_ops.coalesce_tensor(
+        need_flatten_params,
+        need_flatten_params,
+        flatten_param,
+        "copy_data",
+        True,
+        "use_align",
+        False,
+        "dtype",
+        need_flatten_params[0].dtype,
+    )
+
+    paddle._legacy_C_ops.coalesce_tensor(
+        need_flatten_grads,
+        need_flatten_grads,
+        flatten_grad,
+        "copy_data",
+        True,
+        "use_align",
+        False,
+        "dtype",
+        need_flatten_grads[0].dtype,
+    )
+    return [(flatten_param, flatten_grad)]
diff --git a/paddlenlp/trainer/plugins/timer.py b/paddlenlp/trainer/plugins/timer.py
new file mode 100644
index 0000000000000000000000000000000000000000..002596fa14e6d801642ea03b817e7beb338b98cd
--- /dev/null
+++ b/paddlenlp/trainer/plugins/timer.py
@@ -0,0 +1,124 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+class _Timer:
+    """Profile Timer for recording time taken by forward/ bacward/ reduce/ step."""
+
+    def __init__(self, name):
+        self.name = name
+        self.elapsed_ = 0.0
+        self.started_ = False
+        self.start_time = time.time()
+
+    def start(self):
+        """Start the timer."""
+        assert not self.started_, "timer has already started"
+        paddle.device.synchronize()
+        self.start_time = time.time()
+        self.started_ = True
+
+    def stop(self):
+        """Stop the timers."""
+        assert self.started_, "timer is not started."
+        paddle.device.synchronize()
+        self.elapsed_ += time.time() - self.start_time
+        self.started_ = False
+
+    def reset(self):
+        """Reset timer."""
+        self.elapsed_ = 0.0
+        self.started_ = False
+
+    def elapsed(self, reset=True):
+        """Calculate the elapsed time."""
+        started_ = self.started_
+        # If the timing in progress, end it first.
+        if self.started_:
+            self.stop()
+        # Get the elapsed time.
+        elapsed_ = self.elapsed_
+        # Reset the elapsed time
+        if reset:
+            self.reset()
+        # If timing was in progress, set it back.
+        if started_:
+            self.start()
+        return elapsed_
+
+
+class Timers:
+    """Group of timers."""
+
+    def __init__(self):
+        self.timers = {}
+
+    def __call__(self, name):
+        if name not in self.timers:
+            self.timers[name] = _Timer(name)
+        return self.timers[name]
+
+    def write(self, names, writer, iteration, normalizer=1.0, reset=True):
+        """Write timers to a tensorboard writer"""
+        assert normalizer > 0.0
+        for name in names:
+            value = self.timers[name].elapsed(reset=reset) / normalizer
+            writer.add_scalar("timers/" + name, value, iteration)
+
+    def log(self, names, normalizer=1.0, reset=True):
+        """Log a group of timers."""
+        assert normalizer > 0.0
+        # string = "time (ms) / rate"
+        string = "time (ms)"
+
+        time_dict = {}
+        for name in names:
+            time_dict[name] = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer
+
+        # total_time = sum(list(time_dict.values()))
+        # string += " | total_time : {:.2f} ".format(total_time)
+        time_dict = sorted(time_dict.items(), key=lambda x: x[1], reverse=True)
+
+        for time_tuple in time_dict:
+            name, value = time_tuple
+            # string += " | {} : {:.2f} ({:.2f}%) ".format(name, value, value * 100.0 / total_time)
+            string += " | {} : {:.2f}".format(name, value)
+        return string
+
+
+_GLOBAL_TIMERS = None
+
+
+def get_timers():
+    global _GLOBAL_TIMERS
+    return _GLOBAL_TIMERS
+
+
+def set_timers():
+    global _GLOBAL_TIMERS
+    logger.info("enable PaddleNLP timer")
+    _GLOBAL_TIMERS = Timers()
+
+
+def disable_timers():
+    global _GLOBAL_TIMERS
+    logger.info("disable PaddleNLP timer")
+    _GLOBAL_TIMERS = None
diff --git a/paddlenlp/trainer/trainer.py b/paddlenlp/trainer/trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8fc373862b746b6b1f956a614f9722cf6abaab9c
--- /dev/null
+++ b/paddlenlp/trainer/trainer.py
@@ -0,0 +1,2659 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py
+
+import collections
+import contextlib
+import inspect
+import math
+import os
+import random
+import re
+import shutil
+import sys
+import time
+import types
+import warnings
+from collections import OrderedDict
+from collections.abc import Mapping
+from pathlib import Path
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.amp.auto_cast as autocast
+import paddle.distributed as dist
+import paddle.nn as nn
+from packaging import version
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer.dygraph_sharding_optimizer import (
+    DygraphShardingOptimizer,
+)
+from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer.hybrid_parallel_optimizer import (
+    HybridParallelOptimizer,
+)
+
+try:
+    from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+        obtain_optimizer_parameters_list,
+    )
+
+    _obtain_optimizer_parameters_list = obtain_optimizer_parameters_list
+except:
+    try:
+        from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer.hybrid_parallel_optimizer import (
+            _obtain_optimizer_parameters_list,
+        )
+    except:
+        _obtain_optimizer_parameters_list = None
+
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients,
+)
+from paddle.io import DataLoader, Dataset, DistributedBatchSampler
+from tqdm.auto import tqdm
+
+from ..data import (
+    DataCollator,
+    DataCollatorWithPadding,
+    DistDataLoader,
+    default_data_collator,
+)
+from ..peft import LoRAModel, PrefixModelForCausalLM
+from ..transformers.model_utils import (
+    PretrainedModel,
+    _add_variant,
+    load_sharded_checkpoint,
+    unwrap_model,
+)
+from ..transformers.tokenizer_utils import PretrainedTokenizer
+from ..utils.batch_sampler import DistributedBatchSampler as NlpDistributedBatchSampler
+from ..utils.env import (
+    LORA_WEIGHTS_NAME,
+    PADDLE_WEIGHTS_INDEX_NAME,
+    PADDLE_WEIGHTS_NAME,
+    PREFIX_WEIGHTS_NAME,
+)
+from ..utils.import_utils import is_datasets_available
+from ..utils.log import logger
+from .integrations import get_reporting_integration_callbacks
+from .plugins.timer import get_timers, set_timers
+from .trainer_callback import (
+    CallbackHandler,
+    DefaultFlowCallback,
+    PrinterCallback,
+    ProgressCallback,
+    TrainerCallback,
+    TrainerControl,
+    TrainerState,
+)
+from .trainer_utils import (  # set_hyrbid_parallel_seed,
+    PREFIX_CHECKPOINT_DIR,
+    EvalLoopOutput,
+    EvalPrediction,
+    IterableDatasetShard,
+    OptimizerNames,
+    PredictionOutput,
+    RemoveColumnsCollator,
+    ShardingOption,
+    TrainerMemoryTracker,
+    TrainOutput,
+    find_batch_size,
+    get_last_checkpoint,
+    get_scheduler,
+    has_length,
+    set_seed,
+    speed_metrics,
+)
+from .training_args import TrainingArguments
+from .utils.helper import (  # nested_truncate,
+    distributed_concat,
+    nested_concat,
+    nested_detach,
+    nested_numpify,
+    nested_truncate,
+)
+from .utils.sharding_io import ShardingIO
+
+DEFAULT_CALLBACKS = [DefaultFlowCallback]
+DEFAULT_PROGRESS_CALLBACK = ProgressCallback
+
+# Name of the files used for checkpointing
+TRAINING_ARGS_NAME = "training_args.bin"
+TRAINER_STATE_NAME = "trainer_state.json"
+
+OPTIMIZER_NAME = "optimizer.pdopt"
+SCHEDULER_NAME = "scheduler.pdparams"
+SCALER_NAME = "scaler.pdparams"
+
+
+if is_datasets_available():
+    import datasets
+
+
+try:
+    from paddle.distributed.fleet.utils import mix_precision_utils
+except:
+    mix_precision_utils = None
+
+try:
+    from paddle.io.dataloader.dataloader_iter import _DataLoaderIterBase
+except:
+    from paddle.fluid.dataloader.dataloader_iter import _DataLoaderIterBase
+
+
+def is_dp_group_support_in_group_sharded_parallel():
+    return "dp_group" in set(inspect.signature(paddle.distributed.sharding.group_sharded_parallel).parameters.keys())
+
+
+__all__ = ["Trainer"]
+
+
+class Trainer:
+    """
+    Trainer is a simple but feature-complete training and eval loop for PaddlePaddle, optimized for PaddleNLP.
+
+    Args:
+        model ([`PretrainedModel`] or `paddle.nn.Layer`, *optional*):
+            The model to train, evaluate or use for predictions.
+
+            [`Trainer`] is optimized to work with the [`PretrainedModel`] provided by the library. You can still use
+            your own models defined as `paddle.nn.Layer` as long as they work the same way as the PaddleNLP
+            models.
+        criterion(`paddle.nn.Layer`, *optional*):
+            The model may only output the loggit, if you want do more computation for the output of model, you can
+            add the criterion Layer.
+        args ([`TrainingArguments`], *optional*):
+            The arguments to tweak for training. Will default to a basic instance of [`TrainingArguments`] with the
+            `output_dir` set to a directory named *tmp_trainer* in the current directory if not provided.
+        data_collator (`DataCollator`, *optional*):
+            The function to use to form a batch from a list of elements of `train_dataset` or `eval_dataset`. Will
+            default to [`default_data_collator`] if no `tokenizer` is provided, an instance of
+            [`DataCollatorWithPadding`] otherwise.
+        train_dataset (`paddle.io.Dataset` or `paddle.io.IterableDataset`, *optional*):
+            The dataset to use for training. If it is an `datasets.Dataset`, columns not accepted by the
+            `model.forward()` method are automatically removed.
+        eval_dataset (Union[`paddle.io.Dataset`, Dict[str, `paddle.io.Dataset`]],  *optional*):
+             The dataset to use for evaluation. If it is a [`~datasets.Dataset`], columns not accepted by the
+             `model.forward()` method are automatically removed. If it is a dictionary, it will evaluate on each
+             dataset prepending the dictionary key to the metric name.
+        tokenizer ([`PretrainedTokenizer`], *optional*):
+            The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the
+            maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an
+            interrupted training or reuse the fine-tuned model.
+        compute_metrics (`Callable[[EvalPrediction], Dict]`, *optional*):
+            The function that will be used to compute metrics at evaluation. Must take a [`EvalPrediction`] and return
+            a dictionary string to metric values.
+        callbacks (List of [`TrainerCallback`], *optional*):
+            A list of callbacks to customize the training loop. Will add those to the list of default callbacks.
+            If you want to remove one of the default callbacks used, use the [`Trainer.remove_callback`] method.
+        optimizers (`Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler]`, *optional*): A tuple
+            containing the optimizer and the scheduler to use. Will default to an instance of [`AdamW`] on your model
+            and a scheduler given by [`get_linear_schedule_with_warmup`] controlled by `args`.
+        preprocess_logits_for_metrics (`Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor]`, *optional*):
+            A function that preprocess the logits right before caching them at each evaluation step. Must take two
+            tensors, the logits and the labels, and return the logits once processed as desired. The modifications made
+            by this function will be reflected in the predictions received by `compute_metrics`.
+
+    Important attributes:
+
+        - **model** -- Always points to the core model. If using a transformers model, it will be a [`PretrainedModel`]
+          subclass.
+        - **model_wrapped** -- Always points to the most external model in case one or more other modules wrap the
+          original model. This is the model that should be used for the forward pass. For example, the inner model is
+          wrapped in `paddle.DataParallel`. If model hasn't been wrapped, then `self.model_wrapped` is the same
+          as `self.model`.
+
+    """
+
+    from .trainer_utils import log_metrics, metrics_format, save_metrics, save_state
+
+    def __init__(
+        self,
+        model: Union[PretrainedModel, nn.Layer] = None,
+        criterion: nn.Layer = None,
+        args: TrainingArguments = None,
+        data_collator: Optional[DataCollator] = None,
+        train_dataset: Optional[Dataset] = None,
+        eval_dataset: Union[Dataset, Dict[str, Dataset]] = None,
+        tokenizer: Optional[PretrainedTokenizer] = None,
+        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
+        callbacks: Optional[List[TrainerCallback]] = None,
+        optimizers: Tuple[paddle.optimizer.Optimizer, paddle.optimizer.lr.LRScheduler] = (None, None),
+        preprocess_logits_for_metrics: Callable[[paddle.Tensor, paddle.Tensor], paddle.Tensor] = None,
+    ):
+
+        if args is None:
+            output_dir = "tmp_trainer"
+            logger.info(f"No `TrainingArguments` passed, using `output_dir={output_dir}`.")
+            args = TrainingArguments(output_dir=output_dir)
+
+        self.args = args
+        self.is_in_train = False
+        # self.do_grad_scaling = args.fp16
+
+        # memory metrics - must set up as early as possible
+        self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics)
+        self._memory_tracker.start()
+
+        # Seed must be set before instantiating the model when using model
+        set_seed(args=self.args)
+
+        if model is None:
+            raise RuntimeError("`Trainer` requires either a `model` or `model_init` argument")
+
+        if self.args.should_save or self.args.should_save_model_state:
+            os.makedirs(self.args.output_dir, exist_ok=True)
+
+        self.sharding = None
+        if len(args.sharding) > 0:
+            if args.local_rank == -1:
+                raise ValueError("Using sharding only works in distributed training.")
+            self.sharding = True
+
+        # init parallel env
+        if paddle.distributed.get_world_size() > 1:
+            if self.args.use_hybrid_parallel:
+                self.hcg = fleet.get_hybrid_communicate_group()
+                self.dp_group = self.hcg.get_data_parallel_group()
+                self.sharding_group = self.hcg.get_sharding_parallel_group()
+
+        default_collator = default_data_collator if tokenizer is None else DataCollatorWithPadding(tokenizer)
+
+        self.data_collator = data_collator if data_collator is not None else default_collator
+        self.train_dataset = train_dataset
+        self.eval_dataset = eval_dataset
+        self.tokenizer = tokenizer
+        if not args.skip_profile_timer:
+            set_timers()
+        self.timers = get_timers()
+
+        self.model_wrapped = model
+        self.model = model
+        self.criterion = criterion
+
+        self.compute_metrics = compute_metrics
+        self.preprocess_logits_for_metrics = preprocess_logits_for_metrics
+        self.optimizer, self.lr_scheduler = optimizers
+        # Label smoothing
+        # if self.args.label_smoothing_factor != 0:
+        #     self.label_smoother = LabelSmoother(epsilon=self.args.label_smoothing_factor)
+        # else:
+        self.label_smoother = None
+        self.state = TrainerState()
+        self.control = TrainerControl()
+        self._signature_columns = None
+        self.optimizer_grouped_parameters = None
+        self.sharding_io = None
+        if self.args.should_save_sharding_stage1_model or self.args.should_load_sharding_stage1_model:
+            self.sharding_io = ShardingIO(self.args, self.model, self.optimizer)
+
+        if self.sharding is not None and self.optimizer is not None:
+            raise RuntimeError(
+                "Passing `optimizers` is not allowed if sharding is enabled."
+                "You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method."
+            )
+        if self.args.pipeline_parallel_degree > 1:
+            from paddle.distributed.fleet.meta_parallel import PipelineLayer
+
+            assert isinstance(
+                model, PipelineLayer
+            ), "Only support pipeline parallel mode when model is PipelineLayer!!!"
+
+        default_callbacks = DEFAULT_CALLBACKS + get_reporting_integration_callbacks(self.args.report_to)
+        callbacks = default_callbacks if callbacks is None else default_callbacks + callbacks
+        self.callback_handler = CallbackHandler(
+            callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
+        )
+        self.add_callback(PrinterCallback if self.args.disable_tqdm else DEFAULT_PROGRESS_CALLBACK)
+
+        if args.max_steps > 0:
+            logger.info("max_steps is given, it will override any value given in num_train_epochs")
+
+        if train_dataset is not None and not isinstance(train_dataset, collections.abc.Sized) and args.max_steps <= 0:
+            raise ValueError("train_dataset does not implement __len__, max_steps has to be specified")
+
+        self.do_grad_scaling = False
+        self.enable_autocast_context_manager = False
+
+        if args.fp16 or args.bf16:
+            logger.info("Using half precision")
+            self.enable_autocast_context_manager = True
+            self.do_grad_scaling = True if args.fp16 else False
+            self.amp_dtype = "float16" if args.fp16 else "bfloat16"
+            # fix for load saved fp16 or bf16 ckpt, decorate model first.
+            if self.args.fp16_opt_level == "O2":
+                if self.amp_dtype == "bfloat16":
+                    # fix for paddlepaddle < 2.4.1, not support for bf16
+                    paddle.amp.decorate(models=model, level=self.args.fp16_opt_level, dtype=self.amp_dtype)
+                else:
+                    paddle.amp.decorate(models=model, level=self.args.fp16_opt_level)
+            # for pipeline mode and pure tensor parallel
+            if self.args.pipeline_parallel_degree > 1 or (
+                self.args.tensor_parallel_degree > 1 and self.sharding is None
+            ):
+                self.scaler = paddle.amp.GradScaler(init_loss_scaling=self.args.scale_loss)
+                if self.args.amp_master_grad:
+                    mix_precision_utils.MixPrecisionScaler(self.scaler)  # retun value has no use
+                self.scaler = fleet.distributed_scaler(self.scaler)
+            elif self.sharding is not None:
+                self.scaler = paddle.amp.GradScaler(init_loss_scaling=self.args.scale_loss)
+                if self.amp_dtype == "float16" or self.amp_dtype == "bfloat16":
+                    if ShardingOption.SHARD_OP in self.args.sharding:
+                        self.scaler = fleet.distributed_scaler(self.scaler)
+                        if self.args.amp_master_grad:
+                            mix_precision_utils.MixPrecisionScaler(self.scaler)  # retun value has no use
+                    else:
+                        # scaler for stage2 and stage3
+                        from paddle.distributed.fleet.meta_parallel.sharding.group_sharded_utils import (
+                            GroupShardedScaler,
+                        )
+
+                        self.scaler = GroupShardedScaler(self.scaler)
+                else:
+                    self.do_grad_scaling = False
+                    self.use_cuda_amp = False
+                    self.amp_dtype = None
+
+            else:
+                self.scaler = paddle.amp.GradScaler(init_loss_scaling=self.args.scale_loss)
+
+        if args.recompute:
+
+            def fn(layer):
+                if hasattr(layer, "enable_recompute") and (
+                    layer.enable_recompute is False or layer.enable_recompute == 0
+                ):
+                    layer.enable_recompute = True
+
+            model.apply(fn)
+
+        default_label_names = (
+            ["start_positions", "end_positions"]
+            if "QusetionAnswering" in type(self.model).__name__ or "UIE" in type(self.model).__name__
+            else ["labels"]
+        )
+        self.label_names = default_label_names if self.args.label_names is None else self.args.label_names
+
+        self.control = self.callback_handler.on_init_end(self.args, self.state, self.control)
+        self.print_config()
+
+        # very last
+        self._memory_tracker.stop_and_update_metrics()
+
+    def add_callback(self, callback):
+        """
+        Add a callback to the current list of [`~TrainerCallback`].
+
+        Args:
+           callback (`type` or [`~TrainerCallback`]):
+               A [`~TrainerCallback`] class or an instance of a [`~TrainerCallback`]. In the
+               first case, will instantiate a member of that class.
+        """
+        self.callback_handler.add_callback(callback)
+
+    def pop_callback(self, callback):
+        """
+        Remove a callback from the current list of [`~TrainerCallback`] and returns it.
+        If the callback is not found, returns `None` (and no error is raised).
+        Args:
+           callback (`type` or [`~TrainerCallback`]):
+               A [`~TrainerCallback`] class or an instance of a [`~TrainerCallback`]. In the
+               first case, will pop the first member of that class found in the list of callbacks.
+        Returns:
+            [`~TrainerCallback`]: The callback removed, if found.
+        """
+        return self.callback_handler.pop_callback(callback)
+
+    def remove_callback(self, callback):
+        """
+        Remove a callback from the current list of [`~TrainerCallback`].
+        Args:
+           callback (`type` or [`~TrainerCallback`]):
+               A [`~TrainerCallback`] class or an instance of a [`~TrainerCallback`]. In the
+               first case, will remove the first member of that class found in the list of callbacks.
+        """
+        self.callback_handler.remove_callback(callback)
+
+    def _load_from_checkpoint(self, resume_from_checkpoint=None):
+        """load state_dict from_checkpoint, Only load model state dict.
+
+        Args:
+            resume_from_checkpoint (`str` or `bool`, *optional*):
+                If a `str`, local path to a saved checkpoint as saved by a previous instance of [`Trainer`]. If a
+                `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance
+                of [`Trainer`]. Only load model state dict.
+        """
+        resume_from_checkpoint = None if not resume_from_checkpoint else resume_from_checkpoint
+
+        # Load potential model checkpoint
+        if isinstance(resume_from_checkpoint, bool) and resume_from_checkpoint:
+            resume_from_checkpoint = get_last_checkpoint(self.args.output_dir)
+            if resume_from_checkpoint is None:
+                raise ValueError(f"No valid checkpoint found in output directory ({self.args.output_dir})")
+
+        if isinstance(self.model, LoRAModel):
+            weight_name = LORA_WEIGHTS_NAME
+        elif isinstance(self.model, PrefixModelForCausalLM):
+            weight_name = PREFIX_WEIGHTS_NAME
+        else:
+            weight_name = PADDLE_WEIGHTS_NAME
+
+        weight_index_name = PADDLE_WEIGHTS_INDEX_NAME  # currently set paddle as default, do not support safetensors.
+
+        if self.args.should_load_sharding_stage1_model:
+            state_dict = self.sharding_io.load_state_dict_from_checkpoint_with_reshard(
+                resume_from_checkpoint,
+                base_weight_name=weight_name,
+            )
+        else:
+            if resume_from_checkpoint is not None and self.args.dataset_rank == 0:
+
+                weights_file = os.path.join(
+                    resume_from_checkpoint, _add_variant(weight_name, self.args.weight_name_suffix)
+                )
+                weights_index_file = os.path.join(
+                    resume_from_checkpoint, _add_variant(weight_index_name, self.args.weight_name_suffix)
+                )
+
+                if not any(
+                    os.path.isfile(f)
+                    for f in [
+                        weights_file,
+                        weights_index_file,
+                    ]
+                ):
+                    raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
+
+                logger.info(f"Loading model from {resume_from_checkpoint} .")
+
+                if os.path.isfile(weights_file):
+                    # We load the model state dict on the CPU to avoid an OOM error.
+                    state_dict = paddle.load(weights_file, return_numpy=True)
+                    if (isinstance(self.model, LoRAModel) and self.model.lora_config.tensor_parallel_degree > 1) or (
+                        isinstance(self.model, PrefixModelForCausalLM)
+                        and self.model.prefix_config.tensor_parallel_degree > 1
+                    ):
+                        state_dict = self.model._convert_tensor_parallel(state_dict)
+
+                    # If the model is on the GPU, it still works!
+                    self._set_state_dict_in_model(state_dict)
+                    # release memory
+                    del state_dict
+                else:
+                    # We load the sharded checkpoint.
+                    missing_keys, unexpected_keys = load_sharded_checkpoint(
+                        self.model, resume_from_checkpoint, self.args.weight_name_suffix, prefer_safe=False
+                    )
+                    logger.info(f"set state_dict: {missing_keys, unexpected_keys}")
+
+            elif resume_from_checkpoint is not None:
+                logger.info(f"not loading ckpt :{self.args.dataset_rank}")
+
+    def _wrap_model_and_load_sharded_checkpoint(self, resume_from_checkpoint):
+        # In the sharded mode, should invoke _load_from_checkpoint after _wrap_model.
+        # In this mode, each sharding rank load sharded params, do not need to implement the broadcast logic.
+        model = self._wrap_model(self.model_wrapped)
+        if self.sharding_io is not None:
+            # the self.optimizer should be wrapped and it is done in _wrap_model
+            self.sharding_io.set_optimizer(self.optimizer)
+        if model is not self.model:
+            self.model_wrapped = model
+        # Should invoke _load_from_checpoint after _load_optimizer_and_scheduler
+        # because the _load_from_checkpoint method rely on the optimizer in the shareded mode.
+        self._load_optimizer_and_scheduler(resume_from_checkpoint)
+        self._load_from_checkpoint(resume_from_checkpoint)
+        return model
+
+    def train(
+        self,
+        resume_from_checkpoint: Optional[Union[str, bool]] = None,
+        ignore_keys_for_eval: Optional[List[str]] = None,
+    ):
+        """
+        Main training entry point.
+
+        Args:
+            resume_from_checkpoint (`str` or `bool`, *optional*):
+                If a `str`, local path to a saved checkpoint as saved by a previous instance of [`Trainer`]. If a
+                `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance
+                of [`Trainer`]. If present, training will resume from the model/optimizer/scheduler states loaded here.
+            ignore_keys_for_eval (`List[str]`, *optional*)
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions for evaluation during the training.
+        """
+        args = self.args
+        self.is_in_train = True
+
+        # memory metrics - must set up as early as possible
+        self._memory_tracker.start()
+
+        if not self.args.should_load_sharding_stage1_model:
+            self._load_from_checkpoint(resume_from_checkpoint)
+
+        train_dataloader = self.get_train_dataloader()
+
+        total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.dataset_world_size
+        len_dataloader = None
+        if has_length(train_dataloader):
+            len_dataloader = len(train_dataloader)
+            num_update_steps_per_epoch = len(train_dataloader) // args.gradient_accumulation_steps
+            num_update_steps_per_epoch = max(num_update_steps_per_epoch, 1)
+            num_examples = len(self.train_dataset)
+
+            if args.max_steps > 0:
+                max_steps = args.max_steps
+                num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
+                    args.max_steps % num_update_steps_per_epoch > 0
+                )
+                num_train_samples = args.max_steps * total_train_batch_size
+            else:
+                max_steps = int(num_update_steps_per_epoch * args.num_train_epochs)
+                num_train_epochs = math.ceil(args.num_train_epochs)
+                num_train_samples = int(len(self.train_dataset) * args.num_train_epochs)
+
+            if args.minimum_eval_times is not None and args.minimum_eval_times > 0:
+                if max_steps // args.eval_steps < args.minimum_eval_times:
+                    exp_step = max_steps / args.minimum_eval_times
+                    exp_step = max(int(exp_step - exp_step % 10), 10)
+                    logger.info("Reset eval step by minimum_eval_times to %d" % exp_step)
+                    args.eval_steps = exp_step
+        elif args.max_steps > 0:  # Rely on max_steps when dataloader does not have a working size
+            max_steps = args.max_steps
+            # Setting a very large number of epochs so we go as many times as necessary over the iterator.
+            num_train_epochs = sys.maxsize
+            num_update_steps_per_epoch = max_steps
+            num_examples = total_train_batch_size * args.max_steps
+            num_train_samples = args.max_steps * total_train_batch_size
+        else:
+            raise ValueError(
+                f"args.max_steps must be set to a positive value if dataloader does not have a length, was {args.max_steps}"
+            )
+
+        # delay_optimizer_creation = (
+        #     self.sharding is not None
+        #     and ShardingOption.SHARD_OP in self.args.sharding
+        # )
+        delay_optimizer_creation = False
+
+        if not delay_optimizer_creation:
+            self.create_optimizer_and_scheduler(num_training_steps=max_steps)
+
+        self.state = TrainerState()
+
+        if self.args.should_load_sharding_stage1_model:
+            model = self._wrap_model_and_load_sharded_checkpoint(resume_from_checkpoint)
+        elif self.args.should_save_sharding_stage1_model:
+            # In the non-sharded mode, should invoke _load_from_checkpoint before _wrap_model.
+            # In this mode, the rank0 load all params and the _wrap_model implicitly broadcast params from rank0 to the other ranks.
+            model = self._wrap_model(self.model_wrapped)
+            if self.sharding_io is not None:
+                assert delay_optimizer_creation is False, "delay_optimizer_creation should be False"
+                # the self.optimizer should be wrapped and it is done in _wrap_model
+                self.sharding_io.set_optimizer(self.optimizer)
+            # for the rest of this function `model` is the outside model, whether it was wrapped or not
+            if model is not self.model:
+                self.model_wrapped = model
+            if delay_optimizer_creation:
+                self.create_optimizer_and_scheduler(num_training_steps=max_steps)
+            self._load_optimizer_and_scheduler(resume_from_checkpoint)
+        else:
+            model = self._wrap_model(self.model_wrapped)
+            # for the rest of this function `model` is the outside model, whether it was wrapped or not
+            if model is not self.model:
+                self.model_wrapped = model
+            if delay_optimizer_creation:
+                self.create_optimizer_and_scheduler(num_training_steps=max_steps)
+            self._load_optimizer_and_scheduler(resume_from_checkpoint)
+
+        logger.info("***** Running training *****")
+        logger.info(f"  Num examples = {num_examples:,}")
+        logger.info(f"  Num Epochs = {num_train_epochs}")
+        logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
+        logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
+        logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
+        logger.info(f"  Total optimization steps = {max_steps:,}")
+        logger.info(f"  Total num train samples = {num_train_samples:,}")
+        # per_device_trainable_numel = sum(p.numel().item() for p in model.parameters() if not p.stop_gradient)
+        # TODO: Temporary fix since Tensor.numel() not supported in distributed mode
+        per_device_trainable_numel = sum(np.prod(p.shape) for p in model.parameters() if not p.stop_gradient)
+        logger.info(f"  Number of trainable parameters = {per_device_trainable_numel:,} (per device)")
+        if self.args.use_hybrid_parallel:
+            # todo fix for pipeline_parallel_degree
+            parts_num = max(self.args.tensor_parallel_degree, 1) * max(self.args.pipeline_parallel_degree, 1)
+            if parts_num > 1:
+                all_reduce_dtype = "int64"
+                if paddle.get_device().split(":")[0] in ["npu", "xpu"]:
+                    # TODO(duanyanhui): fix when NPU all_reduce supports int64
+                    all_reduce_dtype = "float32"
+                trainable_numel_tensor = paddle.to_tensor(per_device_trainable_numel, dtype=all_reduce_dtype)
+                paddle.distributed.all_reduce(trainable_numel_tensor)
+                trainable_numel = int(trainable_numel_tensor.item()) // self.args.dataset_world_size
+                # the numel is roughly, because the tensor parallel still hold own bias or layer_norm weight without splited
+                # so, the trainable numel is a little bigger than real.
+                logger.info(f"  Number of trainable parameters = {trainable_numel:,} (all devices, roughly)")
+
+        start_time = time.time()
+        self._globalstep_last_start_time = time.time()
+        self.state.epoch = 0
+        epochs_trained = 0
+        steps_trained_in_current_epoch = 0
+        steps_trained_progress_bar = None
+
+        # Check if continuing training from a checkpoint
+        if resume_from_checkpoint is not None and os.path.isfile(
+            os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME)
+        ):
+            self.state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))
+            epochs_trained = self.state.global_step // num_update_steps_per_epoch
+            if not args.ignore_data_skip:
+                steps_trained_in_current_epoch = self.state.global_step % (num_update_steps_per_epoch)
+                steps_trained_in_current_epoch *= args.gradient_accumulation_steps
+            else:
+                steps_trained_in_current_epoch = 0
+
+            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
+            logger.info(f"  Continuing training from epoch {epochs_trained}")
+            logger.info(f"  Continuing training from global step {self.state.global_step}")
+            if not args.ignore_data_skip:
+                logger.info(
+                    f"  Will skip the first {epochs_trained} epochs then the first {steps_trained_in_current_epoch} "
+                    "batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` "
+                    "flag to your launch command, but you will resume the training on data already seen by your model."
+                )
+                if self.is_local_process_zero() and not args.disable_tqdm:
+                    steps_trained_progress_bar = tqdm(total=steps_trained_in_current_epoch)
+                    steps_trained_progress_bar.set_description("Skipping the first batches")
+            if not args.ignore_data_skip:
+                if isinstance(train_dataloader, paddle.io.DataLoader) and isinstance(
+                    train_dataloader.batch_sampler, NlpDistributedBatchSampler
+                ):
+                    consumed_samples = (
+                        self.state.global_step
+                        * args.train_batch_size
+                        * args.gradient_accumulation_steps
+                        * args.dataset_world_size
+                    )
+                    train_dataloader.batch_sampler.set_epoch(consumed_samples=consumed_samples)
+                    logger.info(f"Set DistributedBatchSampler consumed_samples to {consumed_samples}")
+
+        epoch_iterator = train_dataloader
+        # steps_in_epoch = len(epoch_iterator)
+        steps_in_epoch = (
+            len(epoch_iterator) if len_dataloader is not None else args.max_steps * args.gradient_accumulation_steps
+        )
+        if len_dataloader is not None:
+            if self.args.gradient_accumulation_steps > len(epoch_iterator):
+                logger.warning(
+                    f"changing accumulation step from `{self.args.gradient_accumulation_steps}` to `{len(epoch_iterator)}` to avoid, cross epoch accumulate"
+                )
+                self.args.gradient_accumulation_steps = len(epoch_iterator)
+
+        self.callback_handler.model = self.model
+        self.callback_handler.optimizer = self.optimizer
+        self.callback_handler.lr_scheduler = self.lr_scheduler
+        self.callback_handler.train_dataloader = train_dataloader
+
+        self.state.max_steps = int(max_steps)
+        self.state.num_train_epochs = num_train_epochs
+        self.state.is_local_process_zero = self.is_local_process_zero()
+        self.state.is_world_process_zero = self.is_world_process_zero()
+
+        self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
+
+        tr_loss = paddle.to_tensor(0.0)
+        self._total_loss_scalar = 0.0
+        self._globalstep_last_logged = self.state.global_step
+
+        if self.args.device == "npu" and self.args.flatten_param_grads:
+            from .plugins.npu_plugin import npu_accelerate_plugin
+
+            npu_accelerate_plugin(self.optimizer)
+
+        self.timers and self.timers("read-data").start()
+
+        for epoch in range(epochs_trained, num_train_epochs):
+            if isinstance(train_dataloader, paddle.io.DataLoader) and isinstance(
+                train_dataloader.batch_sampler, DistributedBatchSampler
+            ):
+                train_dataloader.batch_sampler.set_epoch(epoch)
+
+            step_control = 0  # used in loop control, reset to 0 after every step
+            self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)
+
+            for step, inputs in enumerate(epoch_iterator):
+                self.timers and self.timers("read-data").stop()
+                os.environ["TRAINER_GLOBAL_STEP"] = str(self.state.global_step)
+                self.callback_handler.on_load_data_end(args, self.state, self.control, inputs=inputs)
+
+                # Skip past any already trained steps if resuming training
+                # for paddlenlp.utils.batch_sampler.DistributedBatchSampler
+                # We use consumed_samples to reset the status
+                if isinstance(train_dataloader, paddle.io.DataLoader) and isinstance(
+                    train_dataloader.batch_sampler, NlpDistributedBatchSampler
+                ):
+                    if step == 0:
+                        if steps_trained_progress_bar is not None:
+                            steps_trained_progress_bar.update(steps_trained_in_current_epoch)
+                            steps_trained_progress_bar.close()
+                            steps_trained_progress_bar = None
+                        self._load_rng_state(resume_from_checkpoint)
+                    step += steps_trained_in_current_epoch
+                elif steps_trained_in_current_epoch > 0:
+                    steps_trained_in_current_epoch -= 1
+                    if steps_trained_progress_bar is not None:
+                        steps_trained_progress_bar.update(1)
+                    if steps_trained_in_current_epoch == 0:
+                        self._load_rng_state(resume_from_checkpoint)
+                    continue
+                elif steps_trained_progress_bar is not None:
+                    steps_trained_progress_bar.close()
+                    steps_trained_progress_bar = None
+
+                if step_control % args.gradient_accumulation_steps == 0:
+                    self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
+                    self.timers and self.timers("forward-backward").start()
+
+                dp_enabled = (
+                    self.args.data_parallel_degree > 1 if self.args.use_hybrid_parallel else args.local_rank != -1
+                )
+                forbidden_no_sync = False
+                # stage2 and stage3 should not no_sync, because the is no DDP wrapper and no_sync API
+                # hybrid_parallel (tp or pp or sharding stage 1) should not no_sync
+                if self.args.use_hybrid_parallel:
+                    forbidden_no_sync = True
+
+                availiable_no_sync = dp_enabled and not forbidden_no_sync
+
+                is_no_sync = (
+                    ((step_control + 1) % args.gradient_accumulation_steps != 0)
+                    and availiable_no_sync
+                    and args._no_sync_in_gradient_accumulation
+                ) or (args.recompute and availiable_no_sync)
+                # sharding
+                # stage1. the same as ddp
+                # stage2. manualy collect gradient on dp group
+                if is_no_sync:
+                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
+                    with model.no_sync():
+                        tr_loss_step = self.training_step(model, inputs)
+                else:
+                    tr_loss_step = self.training_step(model, inputs)
+
+                tr_loss += tr_loss_step
+
+                if (step_control + 1) % args.gradient_accumulation_steps == 0 or (
+                    # last step in epoch but step is always smaller than gradient_accumulation_steps
+                    steps_in_epoch <= args.gradient_accumulation_steps
+                    and (step + 1) == steps_in_epoch
+                ):
+                    self.timers and self.timers("forward-backward").stop()
+                    # Maunally collect gradients when group_sharded_parallel can't accept dp_group
+                    # Case 1: Use sharding stage 2/3 with dp
+                    # Case 2: Use recompute and dp
+                    # local_rank != -1 don't means dp in networks.
+                    self.timers and self.timers("all-reduce").start()
+
+                    if self.sharding and ShardingOption.SHARD_OP not in self.args.sharding:
+                        if self.args.data_parallel_degree > 1 and not is_dp_group_support_in_group_sharded_parallel():
+                            fused_allreduce_gradients(model.parameters(), fleet.get_hybrid_communicate_group())
+                            if ShardingOption.FULL_SHARD in self.args.sharding:
+                                # Why need sync on parm again ?
+                                # TODO: fix this.
+                                for p in model.parameters():
+                                    if hasattr(p, "bw_storage"):
+                                        assert p.grad is None, "This case shouldn't happen."
+                                        p.bw_storage.scale_(1.0 / self.dp_group.nranks)
+                                        paddle.distributed.all_reduce(p.bw_storage, group=self.dp_group)
+
+                    # Case 2: Use recompute and dp / sharding stage1,
+                    # manualy collect gradient for dp.
+                    elif args.recompute and availiable_no_sync:
+                        fused_allreduce_gradients(list(model.parameters()), None)
+
+                    pipeline_parallel_config = set(args.pipeline_parallel_config.split(" "))
+                    enable_delay_scale_loss = "enable_delay_scale_loss" in pipeline_parallel_config
+                    enable_dp_comm_overlap = "enable_dp_comm_overlap" in pipeline_parallel_config
+
+                    if isinstance(self.optimizer, HybridParallelOptimizer) and not self.do_grad_scaling:
+                        parameters_list = _obtain_optimizer_parameters_list(self.optimizer._inner_opt)
+
+                        if not enable_dp_comm_overlap:
+                            if self.optimizer._sharding_enable:
+                                assert isinstance(self.optimizer._inner_opt, DygraphShardingOptimizer)
+                                self.optimizer._inner_opt.reduce_gradients(list(parameters_list), self.optimizer._hcg)
+
+                            if self.optimizer._dp_enable:
+                                fused_allreduce_gradients(list(parameters_list), self.optimizer._hcg)
+                    self.timers and self.timers("all-reduce").stop()
+                    self.timers and self.timers("optimizer-step").start()
+
+                    # pipeline parallel mode,  handle gradient merge here
+                    if args.pipeline_parallel_degree > 1 and enable_delay_scale_loss:
+                        for p in model._layers.parameters():
+                            with paddle.no_grad():
+                                if hasattr(p, "main_grad") and p.main_grad is not None:
+                                    assert p.grad is None
+                                    p.main_grad.scale_(1.0 / self.args.gradient_accumulation_steps)
+                                elif p.grad is not None:
+                                    p.grad.scale_(1.0 / self.args.gradient_accumulation_steps)
+
+                    # Optimizer step
+                    self.callback_handler.on_optimizer_begin(
+                        args, self.state, self.control, scaler=self.scaler if self.do_grad_scaling else None
+                    )
+                    optimizer_was_run = True
+                    if self.do_grad_scaling:
+                        scale_before = self.scaler._scale.cpu().numpy()
+                        self.scaler.step(self.optimizer)
+                        self.scaler.update()
+                        scale_after = self.scaler._scale.cpu().numpy()
+                        optimizer_was_run = not self.scaler._cache_founf_inf
+                        if not optimizer_was_run:
+                            logger.warning(
+                                f"optimizer not run, scale_before: {scale_before[0]}, scale_after: {scale_after[0]}"
+                            )
+                    elif isinstance(self.optimizer, HybridParallelOptimizer):
+                        self.optimizer._step(parameters_list)
+                    else:
+                        self.optimizer.step()
+
+                    self.timers and self.timers("optimizer-step").stop()
+
+                    if optimizer_was_run:
+                        self.lr_scheduler.step()
+
+                    self.optimizer.clear_grad()
+                    self.callback_handler.on_optimizer_end(
+                        args, self.state, self.control, scaler=self.scaler if self.do_grad_scaling else None
+                    )
+
+                    self.state.global_step += 1
+                    self.state.epoch = epoch + (step + 1) / steps_in_epoch
+                    self.control = self.callback_handler.on_step_end(args, self.state, self.control)
+                    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
+                    self._print_timer()
+                    step_control = 0
+                else:
+                    self.control = self.callback_handler.on_substep_end(args, self.state, self.control)
+                    step_control += 1
+
+                if self.control.should_epoch_stop or self.control.should_training_stop:
+                    break
+                self.timers and self.timers("read-data").start()
+
+            if step < 0:
+                logger.warning(
+                    f"There seems to be not a single sample in your epoch_iterator, stopping training at step"
+                    f" {self.state.global_step}! This is expected if you're using an IterableDataset and set"
+                    f" num_steps ({self.state.max_steps}) higher than the number of available samples."
+                )
+                self.control.should_training_stop = True
+
+            self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
+            self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
+
+            if self.control.should_training_stop:
+                break
+
+        if args.past_index and hasattr(self, "_past"):
+            # Clean the state at the end of training
+            delattr(self, "_past")
+
+        logger.info("\nTraining completed. \n")
+        if args.load_best_model_at_end and self.state.best_model_checkpoint is not None:
+            if args.local_rank != -1:
+                dist.barrier()
+
+            logger.info(
+                f"Loading best model from {self.state.best_model_checkpoint} (score: {self.state.best_metric})."
+            )
+            if isinstance(self.model, LoRAModel):
+                best_model_path = os.path.join(self.state.best_model_checkpoint, LORA_WEIGHTS_NAME)
+            elif isinstance(self.model, PrefixModelForCausalLM):
+                best_model_path = os.path.join(self.state.best_model_checkpoint, PREFIX_WEIGHTS_NAME)
+            else:
+                weight_name = PADDLE_WEIGHTS_NAME
+                best_model_path = os.path.join(
+                    self.state.best_model_checkpoint, _add_variant(weight_name, self.args.weight_name_suffix)
+                )
+            if os.path.exists(best_model_path):
+                # We load the model state dict on the CPU to avoid an OOM error.
+                state_dict = paddle.load(best_model_path, return_numpy=True)
+                if (isinstance(self.model, LoRAModel) and self.model.lora_config.tensor_parallel_degree > 1) or (
+                    isinstance(self.model, PrefixModelForCausalLM)
+                    and self.model.prefix_config.tensor_parallel_degree > 1
+                ):
+                    state_dict = self.model._convert_tensor_parallel(state_dict)
+                # If the model is on the GPU, it still works!
+                self._set_state_dict_in_model(state_dict)
+            else:
+                logger.warning(
+                    f"Could not locate the best model at {best_model_path}, if you are running a distributed training "
+                    "on multiple nodes, you should activate `--save_on_each_node`."
+                )
+
+        self._total_loss_scalar += tr_loss.item()
+        train_loss = self._total_loss_scalar / self.state.global_step
+
+        metrics = speed_metrics("train", start_time, num_samples=num_train_samples, num_steps=self.state.max_steps)
+
+        metrics["train_loss"] = train_loss
+
+        self.is_in_train = False
+
+        self._memory_tracker.stop_and_update_metrics(metrics)
+
+        self.log(metrics)
+
+        self.control = self.callback_handler.on_train_end(args, self.state, self.control)
+
+        return TrainOutput(self.state.global_step, train_loss, metrics)
+
+    def _get_train_sampler(self) -> Optional[paddle.io.Sampler]:
+        if self.train_dataset is None or not has_length(self.train_dataset):
+            return None
+
+        if self.args.world_size <= 1:
+            return paddle.io.BatchSampler(
+                dataset=self.train_dataset,
+                shuffle=True,
+                batch_size=self.args.per_device_train_batch_size,
+                drop_last=self.args.dataloader_drop_last,
+            )
+
+        return DistributedBatchSampler(
+            self.train_dataset,
+            batch_size=self.args.per_device_train_batch_size,
+            shuffle=True,
+            num_replicas=self.args.dataset_world_size,
+            rank=self.args.dataset_rank,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def _set_state_dict_in_model(self, state_dict):
+        # TODO  @ZHUI paddle need return the results of set_state_dict.
+        logger.info(f"set state-dict :{self.model.set_state_dict(state_dict)}")
+
+    def _print_timer(self):
+        """print timer and clear states"""
+        paddle_timer_info = ""
+        try:
+            from paddle.distributed.fleet.utils.timer_helper import (
+                get_timers as paddle_get_timers,
+            )
+
+            paddle_pipeline_timers = paddle_get_timers()
+            for name, timer in paddle_pipeline_timers.timers.items():
+                elapsed_time = timer.elapsed(reset=False) * 1000.0
+                paddle_timer_info += f" | {name}: {elapsed_time:.2f}"
+            paddle_pipeline_timers.log(paddle_pipeline_timers.timers.keys(), reset=True)
+        except ImportError:  # paddle version too old, timer not support
+            warnings.warn(f"paddle version:{paddle.__git_commit__} does not support pipeline timer")
+        except AssertionError:  # paddle timer not enabled
+            pass
+
+        if self.timers is not None:
+            timer_info = self.timers.log(self.timers.timers.keys(), reset=True)
+        else:
+            timer_info = ""
+
+        if timer_info or paddle_timer_info:
+            logger.info(f"[Profile global_step: {self.state.global_step}] {timer_info} {paddle_timer_info}")
+
+    def _maybe_log_save_evaluate(self, tr_loss, model, epoch, ignore_keys_for_eval, **kwargs):
+        if self.control.should_log:
+
+            logs: Dict[str, float] = {}
+
+            # all_gather + mean() to get average loss over all processes
+            tr_loss_scalar = self._nested_gather(tr_loss).mean().item()
+
+            # reset tr_loss to zero
+            tr_loss.subtract_(tr_loss)
+
+            logs["loss"] = round(tr_loss_scalar / (self.state.global_step - self._globalstep_last_logged), 8)
+            logs["learning_rate"] = float("{0:.3e}".format(self._get_learning_rate()))
+            logs["global_step"] = int(self.state.global_step)
+
+            total_train_batch_size = (
+                self.args.train_batch_size * self.args.gradient_accumulation_steps * self.args.dataset_world_size
+            )
+            num_steps = self.state.global_step - self._globalstep_last_logged
+            logs.update(
+                speed_metrics(
+                    "interval",
+                    self._globalstep_last_start_time,
+                    num_samples=total_train_batch_size * num_steps,
+                    num_steps=num_steps,
+                )
+            )
+
+            self._total_loss_scalar += tr_loss_scalar
+            self._globalstep_last_logged = self.state.global_step
+            self._globalstep_last_start_time = time.time()
+
+            self.log(logs, **kwargs)
+
+        metrics = None
+        if self.control.should_evaluate:
+            if isinstance(self.eval_dataset, dict):
+                for eval_dataset_name, eval_dataset in self.eval_dataset.items():
+                    metrics = self.evaluate(
+                        eval_dataset=eval_dataset,
+                        ignore_keys=ignore_keys_for_eval,
+                        metric_key_prefix=f"eval_{eval_dataset_name}",
+                    )
+            else:
+                metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
+
+        if self.control.should_save:
+            self._save_checkpoint(model, metrics=metrics)
+            self.control = self.callback_handler.on_save(self.args, self.state, self.control)
+
+    def _get_learning_rate(self):
+        return self.optimizer.get_lr()
+
+    def get_train_dataloader(self):
+        """
+        Returns the training [`~paddle.io.DataLoader`].
+
+        Will use no sampler if `self.train_dataset` does not implement `__len__`, a random sampler (adapted to
+        distributed training if necessary) otherwise.
+
+        Subclass and override this method if you want to inject some custom behavior.
+        """
+        if self.args.should_load_dataset and self.train_dataset is None:
+            raise ValueError("Training requires a train_dataset when should_load_dataset is True.")
+        if not self.args.should_load_dataset and self.train_dataset is not None:
+            raise ValueError("We don't need train_dataset when should_load_dataset is False.")
+
+        train_dataset = self.train_dataset
+        if is_datasets_available() and train_dataset is not None and isinstance(train_dataset, datasets.Dataset):
+            train_dataset = self._remove_unused_columns(train_dataset, description="training")
+        _DataLoader = DistDataLoader if self.args.distributed_dataloader else DataLoader
+
+        if self._is_iterable_dataset(train_dataset):
+            if self.args.dataset_world_size > 1:
+                train_dataset = IterableDatasetShard(
+                    train_dataset,
+                    batch_size=self.args.per_device_train_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    num_processes=self.args.dataset_world_size,
+                    process_index=self.args.dataset_rank,
+                )
+
+            return DataLoader(
+                train_dataset,
+                batch_size=self.args.per_device_train_batch_size,
+                collate_fn=self.data_collator,
+                num_workers=self.args.dataloader_num_workers,
+            )
+
+        if self.args.should_load_dataset:
+            train_sampler = self._get_train_sampler()
+        else:
+            train_sampler = None
+
+        if self.args.distributed_dataloader:
+            logger.info("Training using DistDataLoader.")
+
+        return _DataLoader(
+            train_dataset,
+            batch_sampler=train_sampler,
+            collate_fn=self.data_collator,
+            num_workers=self.args.dataloader_num_workers,
+        )
+
+    def _get_eval_sampler(self, eval_dataset: Dataset):
+        if self.args.world_size <= 1:
+            return paddle.io.BatchSampler(
+                eval_dataset,
+                batch_size=self.args.per_device_eval_batch_size,
+                shuffle=False,
+                drop_last=False,
+            )
+        else:
+            drop_last = False
+            if self.args.pipeline_parallel_degree > 1:
+                drop_last = True
+                logger.warning(
+                    "In parallel mode, the bacth_size is strictly checked. set DistributedBatchSampler drop_last=True."
+                )
+
+            return DistributedBatchSampler(
+                eval_dataset,
+                num_replicas=self.args.dataset_world_size,
+                rank=self.args.dataset_rank,
+                batch_size=self.args.per_device_eval_batch_size,
+                shuffle=False,
+                drop_last=drop_last,
+            )
+
+    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+        """
+        Returns the evaluation [`~paddle.io.DataLoader`].
+
+        Subclass and override this method if you want to inject some custom behavior.
+
+        Args:
+            eval_dataset (`paddle.io.Dataset`, *optional*):
+                If provided, will override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not accepted by
+                the `model.forward()` method are automatically removed. It must implement `__len__`.
+        """
+        if self.args.should_load_dataset and eval_dataset is None and self.eval_dataset is None:
+            raise ValueError("Evaluation requires an eval_dataset when should_load_dataset is True.")
+        if not self.args.should_load_dataset and not (eval_dataset is None and self.eval_dataset is None):
+            raise ValueError("We don't need eval_dataset when should_load_dataset is False.")
+
+        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
+
+        if is_datasets_available() and eval_dataset is not None and isinstance(eval_dataset, datasets.Dataset):
+            eval_dataset = self._remove_unused_columns(eval_dataset, description="evaluation")
+
+        _DataLoader = DistDataLoader if self.args.distributed_dataloader else DataLoader
+
+        if self._is_iterable_dataset(eval_dataset):
+            if self.args.dataset_world_size > 1:
+                eval_dataset = IterableDatasetShard(
+                    eval_dataset,
+                    batch_size=self.args.per_device_eval_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    num_processes=self.args.dataset_world_size,
+                    process_index=self.args.dataset_rank,
+                )
+
+            return DataLoader(
+                eval_dataset,
+                batch_size=self.args.per_device_eval_batch_size,
+                collate_fn=self.data_collator,
+                num_workers=self.args.dataloader_num_workers,
+            )
+
+        if self.args.should_load_dataset:
+            eval_sampler = self._get_eval_sampler(eval_dataset)
+        else:
+            eval_sampler = None
+
+        if self.args.distributed_dataloader:
+            logger.info("Eval using DistDataLoader.")
+
+        return _DataLoader(
+            eval_dataset,
+            batch_sampler=eval_sampler,
+            collate_fn=self.data_collator,
+            num_workers=self.args.dataloader_num_workers,
+        )
+
+    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
+        """
+        Returns the test [`~paddle.io.DataLoader`].
+
+        Subclass and override this method if you want to inject some custom behavior.
+
+        Args:
+            test_dataset (`paddle.io.Dataset`, *optional*):
+                The test dataset to use. If it is an `datasets.Dataset`, columns not accepted by the `model.forward()`
+                method are automatically removed. It must implement `__len__`.
+        """
+        if self.args.should_load_dataset and not test_dataset:
+            raise ValueError("Test requires an test_dataset when should_load_dataset is True.")
+        if not self.args.should_load_dataset and test_dataset is not None:
+            raise ValueError("We don't need test_dataset when should_load_dataset is False.")
+
+        if is_datasets_available() and test_dataset is not None and isinstance(test_dataset, datasets.Dataset):
+            test_dataset = self._remove_unused_columns(test_dataset, description="test")
+
+        _DataLoader = DistDataLoader if self.args.distributed_dataloader else DataLoader
+
+        if self._is_iterable_dataset(test_dataset):
+            if self.args.dataset_world_size > 1:
+                test_dataset = IterableDatasetShard(
+                    test_dataset,
+                    batch_size=self.args.per_device_eval_batch_size,
+                    drop_last=self.args.dataloader_drop_last,
+                    num_processes=self.args.dataset_world_size,
+                    process_index=self.args.dataset_rank,
+                )
+
+            return DataLoader(
+                test_dataset,
+                batch_size=self.args.per_device_eval_batch_size * self.world_size,
+                collate_fn=self.data_collator,  # _get_collator_with_removed_columns
+                num_workers=self.args.dataloader_num_workers,
+            )
+
+        if self.args.should_load_dataset:
+            test_sampler = self._get_eval_sampler(test_dataset)
+        else:
+            test_sampler = None
+
+        if self.args.distributed_dataloader:
+            logger.info("Test using DistDataLoader.")
+
+        # We use the same batch_size as for eval.
+        return _DataLoader(
+            test_dataset,
+            batch_sampler=test_sampler,
+            collate_fn=self.data_collator,
+            drop_last=self.args.dataloader_drop_last,
+        )
+
+    def create_optimizer_and_scheduler(self, num_training_steps: int):
+        """
+        Setup the optimizer and the learning rate scheduler.
+
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        Trainer's init through `optimizers`, or subclass and override this method (or `create_optimizer` and/or
+        `create_scheduler`) in a subclass.
+        """
+        self.create_scheduler(num_training_steps=num_training_steps)
+        self.create_optimizer(self.lr_scheduler)
+
+    def create_optimizer(self, lr_scheduler=None):
+        """
+        Setup the optimizer.
+
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        Trainer's init through `optimizers`, or subclass and override this method in a subclass.
+        """
+        if self.optimizer is None:
+            if self.optimizer_grouped_parameters is not None:
+                params = self.optimizer_grouped_parameters
+                apply_decay_param_fun = None
+            else:
+                params = self.model.parameters()
+                decay_parameters = [
+                    p.name for n, p in self.model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])
+                ]
+
+                def apply_decay_param_fun(x):
+                    return x in decay_parameters
+
+            optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
+            if hasattr(optimizer_cls, "_create_master_weight") and self.args.fp16_opt_level == "O2":
+                optimizer_kwargs["multi_precision"] = True
+
+            def is_new_version_sharding_stage1_optimizer():
+                signature_keys = set(inspect.signature(DygraphShardingOptimizer).parameters.keys())
+                return "inner_optimizer_class" not in signature_keys
+
+            if ShardingOption.SHARD_OP in self.args.sharding and not is_new_version_sharding_stage1_optimizer():
+                # for backward compatibility.
+                # this call will raise, if sharding stage1 is supported in HybridParallelOptimizer,
+                # in which case, the logic follows will handle it
+                self.optimizer = DygraphShardingOptimizer(
+                    hcg=fleet.get_hybrid_communicate_group(),
+                    user_defined_strategy=None,
+                    params=params,
+                    inner_optimizer_class=optimizer_cls,
+                    learning_rate=self.lr_scheduler if lr_scheduler is None else lr_scheduler,
+                    apply_decay_param_fun=apply_decay_param_fun,
+                    weight_decay=self.args.weight_decay,
+                    grad_clip=nn.ClipGradByGlobalNorm(self.args.max_grad_norm)
+                    if self.args.max_grad_norm > 0
+                    else None,
+                    **optimizer_kwargs,
+                )
+            else:
+                self.optimizer = optimizer_cls(
+                    learning_rate=self.lr_scheduler if lr_scheduler is None else lr_scheduler,
+                    apply_decay_param_fun=apply_decay_param_fun,
+                    parameters=params,
+                    weight_decay=self.args.weight_decay,
+                    grad_clip=nn.ClipGradByGlobalNorm(self.args.max_grad_norm)
+                    if self.args.max_grad_norm > 0
+                    else None,
+                    **optimizer_kwargs,
+                )
+
+        return self.optimizer
+
+    def _load_rng_state(self, checkpoint):
+        # Load RNG states from `checkpoint`
+        if checkpoint is None:
+            return
+
+        # if use distributed training
+        if self.args.world_size > 1:
+            process_index = self.args.process_index
+            rng_file = os.path.join(checkpoint, f"rng_state_{process_index}.pth")
+            if not os.path.isfile(rng_file):
+                logger.info(
+                    f"Didn't find an RNG file for process {process_index}, if you are resuming a training that "
+                    "wasn't launched in a distributed fashion, reproducibility is not guaranteed."
+                )
+                return
+        else:
+            rng_file = os.path.join(checkpoint, "rng_state.pth")
+            if not os.path.isfile(rng_file):
+                logger.info(
+                    "Didn't find an RNG file, if you are resuming a training that was launched in a distributed "
+                    "fashion, reproducibility is not guaranteed."
+                )
+                return
+
+        checkpoint_rng_state = paddle.load(rng_file, return_numpy=True)
+        random.setstate(checkpoint_rng_state["python"])
+        np.random.set_state(checkpoint_rng_state["numpy"])
+
+        core = paddle.framework.core
+
+        core.default_cpu_generator().manual_seed(checkpoint_rng_state["cpu"])
+        if core.is_compiled_with_cuda():
+            if not len(checkpoint_rng_state["cuda"]) == core.get_cuda_device_count():
+                raise ValueError("Length of gpu state list shoule be equal to the gpu device count")
+            for i in range(core.get_cuda_device_count()):
+                core.default_cuda_generator(i).manual_seed(checkpoint_rng_state["cuda"][i])
+
+        if paddle.device.get_all_custom_device_type() is not None:
+            custom_device_type = paddle.device.get_all_custom_device_type()
+            for device in custom_device_type:
+                if not len(checkpoint_rng_state["cuda"]) == core.get_custom_device_count(device):
+                    raise ValueError("Length of custom device state list shoule be equal to the custom device count")
+                for i in range(core.get_custom_device_count(device)):
+                    core.default_custom_device_generator(i).manual_seed(checkpoint_rng_state["cuda"][i])
+
+        if self.args.use_hybrid_parallel:
+            fleet.meta_parallel.get_rng_state_tracker().set_states_tracker(
+                checkpoint_rng_state["hybrid_parallel_rng_state_tracker"]
+            )
+
+    @staticmethod
+    def get_optimizer_cls_and_kwargs(args: TrainingArguments) -> Tuple[Any, Any]:
+        """
+        Returns the optimizer class and optimizer parameters based on the training arguments.
+
+        Args:
+            args (`paddlenlp.training_args.TrainingArguments`):
+                The training arguments for the training session.
+
+        """
+        # optimizer_kwargs = {"lr": args.learning_rate}
+        optimizer_kwargs = {}
+        adam_kwargs = {
+            "beta1": args.adam_beta1,
+            "beta2": args.adam_beta2,
+            "epsilon": args.adam_epsilon,
+        }
+        if args.optim == OptimizerNames.ADAMW:
+            from paddle.optimizer import AdamW
+
+            optimizer_cls = AdamW
+            optimizer_kwargs.update(adam_kwargs)
+        else:
+            raise ValueError(f"Trainer cannot instantiate unsupported optimizer: {args.optim}")
+        return optimizer_cls, optimizer_kwargs
+
+    def create_scheduler(self, num_training_steps: int):
+        """
+        Setup the scheduler. The optimizer of the trainer must have been set up either before this method is called or
+        passed as an argument.
+
+        Args:
+            num_training_steps (int): The number of training steps to do.
+        """
+        warmup = (
+            self.args.warmup_steps if self.args.warmup_steps > 0 else int(self.args.warmup_ratio * num_training_steps)
+        )
+
+        if self.lr_scheduler is None:
+            self.lr_scheduler = get_scheduler(
+                self.args.lr_scheduler_type,
+                learning_rate=self.args.learning_rate,
+                num_warmup_steps=warmup,
+                num_training_steps=num_training_steps,
+                num_cycles=self.args.num_cycles,
+                lr_end=self.args.lr_end,
+                power=self.args.power,
+            )
+
+        return self.lr_scheduler
+
+    def num_examples(self, dataloader: DataLoader) -> int:
+        """
+        Helper to get number of samples in a [`~paddle.io.DataLoader`] by accessing its dataset. When
+        dataloader.dataset does not exist or has no length, estimates as best it can
+        """
+        try:
+            dataset = dataloader.dataset
+            # Special case for IterableDatasetShard, we need to dig deeper
+            if isinstance(dataset, IterableDatasetShard):
+                return len(dataloader.dataset.dataset)
+            return len(dataloader.dataset)
+        except (NameError, AttributeError, TypeError):  # no dataset or length, estimate by length of dataloader
+            return len(dataloader) * self.args.per_device_train_batch_size
+
+    def _wrap_model(self, model, training=True):
+
+        # train/eval could be run multiple-times - if already wrapped, don't re-wrap it again
+        if unwrap_model(model) is not model:
+            return model
+
+        # Note: in paddle.distributed mode, there's no point in wrapping the model
+        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.
+        if not training:
+            return model
+
+        # Mixed precision training
+        if training and self.do_grad_scaling:  # self.args.fp16_opt_level=="O2":
+            # model, self.optimizer
+            if self.amp_dtype == "bfloat16":
+                # fix for paddlepaddle < 2.4.1, not support for bf16
+                decorated = paddle.amp.decorate(
+                    models=model, optimizers=self.optimizer, level=self.args.fp16_opt_level, dtype=self.amp_dtype
+                )
+            else:
+                decorated = paddle.amp.decorate(
+                    models=model, optimizers=self.optimizer, level=self.args.fp16_opt_level
+                )
+
+            if self.optimizer is None:
+                model = decorated
+            else:
+                model, self.optimizer = decorated
+
+        # Multi-gpu training
+        if self.args.world_size > 1 and not self.args.use_hybrid_parallel:
+            model = paddle.DataParallel(model)
+            # Distributed training (should be after fp16 initialization)
+
+        in_pipeline_parallel_mode = self.args.pipeline_parallel_degree > 1
+        in_sharding_parallel_mode = self.sharding is not None
+        in_tensor_parallel_model = self.args.tensor_parallel_degree > 1
+
+        # Pipeline mode
+        if in_pipeline_parallel_mode:
+            if self.args.amp_master_grad:
+                mix_precision_utils.MixPrecisionLayer(model, dtype=self.amp_dtype)  # return value has no use
+            # hack for pipeline model mini batch to batch
+            # need batter solution @ZHUI
+            # make batch_fn compatible for fleet.distributed_model decorate.
+            prepare_pipeline_inputs_func = (
+                model._prepare_pipeline_inputs_func if hasattr(model, "_prepare_pipeline_inputs_func") else None
+            )
+            model = fleet.distributed_model(model)
+            if prepare_pipeline_inputs_func is not None:
+                model._prepare_pipeline_inputs_func = prepare_pipeline_inputs_func
+            else:
+
+                def _prepare_pipeline_inputs_func(inputs):
+                    first_stage_keys = ["input_ids", "attention_mask", "position_ids"]
+                    last_stage_keys = ["labels"]
+
+                    def get_expected_keys(inputs, keys):
+                        ret = tuple([inputs.pop(k) for k in keys if k in inputs])
+                        if len(ret) == 1:
+                            ret = ret[0]
+                        return ret
+
+                    if type(inputs) is dict or type(inputs) is OrderedDict:
+                        return [
+                            get_expected_keys(inputs, first_stage_keys),
+                            get_expected_keys(inputs, last_stage_keys),
+                        ]
+
+                    keys = list(inputs[0].keys())
+                    inputs_batch = {key: [data.pop(key) for data in inputs] for key in keys}
+                    return [
+                        get_expected_keys(inputs_batch, first_stage_keys),
+                        get_expected_keys(inputs_batch, last_stage_keys),
+                    ]
+
+                logger.warning(
+                    "Using default prepare pipeline inputs func, only support input_ids and labels as inputs."
+                )
+                model._prepare_pipeline_inputs_func = _prepare_pipeline_inputs_func
+
+            assert self.optimizer is not None, "Pipeline mode need decorate optimizer, pelease init optimizer."
+            if self.args.amp_master_grad:
+                self.optimizer = mix_precision_utils.MixPrecisionOptimizer(self.optimizer)
+            self.optimizer = fleet.distributed_optimizer(self.optimizer)
+
+        # No pipeline mode, sharding only
+        if not in_pipeline_parallel_mode and in_sharding_parallel_mode:
+            # Sharded DDP!
+            if self.args.tensor_parallel_degree > 1:
+                hcg = fleet.get_hybrid_communicate_group()
+                assert (
+                    ShardingOption.SHARD_GRAD_OP in self.args.sharding or ShardingOption.SHARD_OP in self.args.sharding
+                ), "Only support tensor parallel + sharding stage1/stage2 hybrid parallel now."
+                model = paddle.distributed.fleet.meta_parallel.TensorParallel(model, hcg, strategy=None)
+
+            if ShardingOption.SHARD_OP in self.args.sharding:
+                if self.args.amp_master_grad:
+                    mix_precision_utils.MixPrecisionLayer(model, dtype=self.amp_dtype)  # return value has no use
+                model = fleet.distributed_model(model)
+                if self.args.amp_master_grad:
+                    self.optimizer = mix_precision_utils.MixPrecisionOptimizer(self.optimizer)
+                self.optimizer = fleet.distributed_optimizer(self.optimizer)
+            else:
+                # sync params (broadcast) buffers in dp group
+                if not is_dp_group_support_in_group_sharded_parallel() and self.args.data_parallel_degree > 1:
+                    from paddle.distributed.parallel import sync_params_buffers
+
+                    hcg = fleet.get_hybrid_communicate_group()
+                    dp_group = hcg.get_data_parallel_group()
+                    sync_params_buffers(model, comm_group=dp_group, src_rank=dp_group.ranks[0])
+
+                cpu_offload = ShardingOption.OFFLOAD in self.args.sharding
+                assert self.optimizer is not None, "optimizer is empty!"
+                level = None
+                if ShardingOption.SHARD_GRAD_OP in self.args.sharding:
+                    level = "os_g"
+                if ShardingOption.FULL_SHARD in self.args.sharding:
+                    level = "p_g_os"
+
+                from paddle.distributed.sharding import group_sharded_parallel
+
+                # add dp_group and exclude_layer params
+                # https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/distributed/sharding/group_sharded_parallel_cn.html#group-sharded-parallel
+                extra_kwargs = {}
+                if is_dp_group_support_in_group_sharded_parallel():
+                    extra_kwargs["dp_group"] = self.dp_group
+                    extra_kwargs["exclude_layer"] = ["GroupNorm"]
+
+                model, optimizer, _ = group_sharded_parallel(
+                    model,
+                    self.optimizer,
+                    level=level,
+                    scaler=None,
+                    group=self.sharding_group,
+                    offload=cpu_offload,
+                    **extra_kwargs,
+                )
+                self.optimizer = optimizer
+
+        # pure tesnor parallel mode, no pipeline_parallel, no sharding.
+        if not in_pipeline_parallel_mode and not in_sharding_parallel_mode and in_tensor_parallel_model:
+            if self.args.amp_master_grad:
+                mix_precision_utils.MixPrecisionLayer(model, dtype=self.amp_dtype)  # return value has no use
+
+            model = fleet.distributed_model(model)
+            assert self.optimizer is not None, "Tensor parallel mode need decorate optimizer, pelease init optimizer."
+            if self.args.amp_master_grad:
+                self.optimizer = mix_precision_utils.MixPrecisionOptimizer(self.optimizer)
+            self.optimizer = fleet.distributed_optimizer(self.optimizer)
+
+        return model
+
+    def _prepare_input(self, data: Union[paddle.Tensor, Any]) -> Union[paddle.Tensor, Any]:
+        """
+        Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
+        """
+        if isinstance(data, Mapping):
+            return type(data)({k: self._prepare_input(v) for k, v in data.items()})
+        elif isinstance(data, (tuple, list)):
+            return type(data)(self._prepare_input(v) for v in data)
+        elif isinstance(data, paddle.Tensor):
+            # kwargs = dict(device=self.args.current_device)
+            # update data type for pure fp16
+            if data.place.is_cuda_pinned_place():
+                return data.cuda()
+            return data
+            # return data.to(**kwargs)
+        return data
+
+    def _prepare_inputs(self, inputs: Dict[str, Union[paddle.Tensor, Any]]) -> Dict[str, Union[paddle.Tensor, Any]]:
+        """
+        Prepare `inputs` before feeding them to the model, converting them to tensors if they are not already and
+        handling potential state.
+        """
+        inputs = self._prepare_input(inputs)
+        if self.args.past_index >= 0 and self._past is not None:
+            inputs["mems"] = self._past
+
+        return inputs
+
+    def autocast_smart_context_manager(self):
+        """
+        A helper wrapper that creates an appropriate context manager for `autocast` while feeding it the desired
+        arguments, depending on the situation.
+        """
+        if self.enable_autocast_context_manager:
+            custom_black_list = ["reduce_sum", "c_softmax_with_cross_entropy"]
+            custom_white_list = []
+            if self.args.fp16_opt_level == "O2":
+                # https://github.com/PaddlePaddle/Paddle/blob/eb97f4f0adca40b16a309b927e480178beb8ae96/python/paddle/amp/amp_lists.py#L85-L86
+                # the lookup_table is in black_list, but in O2, we need it return fp16
+                custom_white_list.extend(["lookup_table", "lookup_table_v2"])
+
+            if self.args.amp_custom_white_list is not None:
+                custom_white_list.extend(self.args.amp_custom_white_list)
+            if self.args.amp_custom_black_list is not None:
+                custom_black_list.extend(self.args.amp_custom_black_list)
+
+            ctx_manager = autocast(
+                True,
+                custom_black_list=set(custom_black_list),
+                custom_white_list=set(custom_white_list),
+                level=self.args.fp16_opt_level,
+                dtype=self.amp_dtype,
+            )
+        else:
+            ctx_manager = contextlib.nullcontext() if sys.version_info >= (3, 7) else contextlib.suppress()
+
+        return ctx_manager
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        """
+        How the loss is computed by Trainer. By default, all models return the loss in the first element.
+        Subclass and override for custom behavior.
+        """
+        if self.criterion is not None:
+            if "labels" in inputs:
+                labels = inputs.pop("labels")
+            elif "start_positions" in inputs and "end_positions" in inputs:
+                labels = (inputs.pop("start_positions"), inputs.pop("end_positions"))
+            elif self.args.label_names is not None:
+                labels = []
+                for label in self.label_names:
+                    labels.append(inputs.pop(label))
+                labels = tuple(labels)
+            elif "generator_labels" in inputs:
+                labels = inputs["generator_labels"]
+        else:
+            labels = None
+
+        outputs = model(**inputs)
+
+        if self.criterion is not None:
+            loss = self.criterion(outputs, labels)
+            outputs = (loss, outputs)
+
+        # Save past state if it exists
+        # TODO: this needs to be fixed and made cleaner later.
+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+
+        # We don't use .loss here since the model may return tuples instead of ModelOutput.
+        loss = outputs["loss"] if isinstance(outputs, dict) else outputs
+        if isinstance(outputs, dict):
+            loss = outputs["loss"]
+        elif isinstance(outputs, tuple):
+            loss = outputs[0]
+        else:
+            loss = outputs
+
+        return (loss, outputs) if return_outputs else loss
+
+    def training_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle.Tensor, Any]]) -> paddle.Tensor:
+        """
+        Perform a training step on a batch of inputs.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to train.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+
+        Return:
+            `paddle.Tensor`: The tensor with training loss on this batch.
+        """
+        if self.args.pipeline_parallel_degree > 1:
+            return self.training_pipeline_step(model, inputs)
+
+        model.train()
+        inputs = self._prepare_inputs(inputs)
+
+        with self.autocast_smart_context_manager():
+            loss = self.compute_loss(model, inputs)
+
+        if self.args.gradient_accumulation_steps > 1:
+            loss = loss / self.args.gradient_accumulation_steps
+
+        if self.do_grad_scaling:
+            self.scaler.scale(loss).backward()
+        else:
+            loss.backward()
+
+        return loss.detach()
+
+    def training_pipeline_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle.Tensor, Any]]) -> paddle.Tensor:
+        """
+        Perform a training step on a batch of inputs.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to train.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+
+        Return:
+            `paddle.Tensor`: The tensor with training loss on this batch.
+        """
+        # accumulation data
+        if not hasattr(self, "_pp_data_buffer"):
+            self._pp_data_buffer = []
+        self._pp_data_buffer.append(inputs)
+        if len(self._pp_data_buffer) != self.args.gradient_accumulation_steps:
+            return paddle.zeros([])
+
+        # for v in self._pp_data_buffer[0].values():
+        #     assert isinstance(v, paddle.Tensor), f"Only support tensor as pipeline mode input, got type {type(v)}"
+
+        inputs = model._prepare_pipeline_inputs_func(self._pp_data_buffer)
+        self._pp_data_buffer = []
+
+        model.train()
+        # hack pipeline-layers
+        # since the pipeline layer will check input is valid every iter.
+        # in same case,  for example, batch size warmup, we need dynamic change gradient_accumulation_steps to implement.
+        config_backup = model.micro_batch_size, model.accumulate_steps
+        model.micro_batch_size = self.args.per_device_train_batch_size
+        model.accumulate_steps = self.args.gradient_accumulation_steps
+
+        if model._dp_comm_overlap or model._sharding_comm_overlap:
+            for _, buffers in model._chunk_2_comm_buffers.items():
+                for buffer in buffers:
+                    buffer._acc_steps = self.args.gradient_accumulation_steps
+
+        inputs = model._prepare_training(
+            inputs, self.optimizer, self.lr_scheduler
+        )  # None, None => [optimizer, lr_scheduler]
+        model.optimizer = None  # we do not use `PipelineParallel` to handler optimizer step
+        model.lr_scheduler = None
+
+        with self.autocast_smart_context_manager():
+            loss = model.forward_backward_pipeline(inputs, self.scaler if self.do_grad_scaling else None)
+
+        model.micro_batch_size, model.accumulate_steps = config_backup
+
+        return loss.detach()
+
+    def save_model(self, output_dir: Optional[str] = None, merge_tensor_parallel: Optional[bool] = False):
+        """
+        Will save the model, so you can reload it using `from_pretrained()`.
+
+        Will only save from the main process.
+        """
+
+        if output_dir is None:
+            output_dir = self.args.output_dir
+
+        if ShardingOption.FULL_SHARD in self.args.sharding:
+            self.model_wrapped.get_all_parameters(convert2cpu=True)
+
+        if self.args.should_save_model_state:
+            self._save(output_dir=output_dir, merge_tensor_parallel=merge_tensor_parallel)
+
+    def _save_checkpoint(self, model, metrics=None):
+        # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"
+
+        # Save model checkpoint
+        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
+
+        run_dir = self.args.output_dir
+
+        output_dir = os.path.join(run_dir, checkpoint_folder)
+
+        self.save_model(output_dir)
+
+        optimizer_name = _add_variant(OPTIMIZER_NAME, self.args.optimizer_name_suffix)
+
+        if self.args.use_hybrid_parallel:
+            if self.dp_group.rank <= 0:
+                os.makedirs(output_dir, exist_ok=True)
+                paddle.save(
+                    self.optimizer.state_dict(),
+                    os.path.join(output_dir, optimizer_name),
+                )
+
+        if self.args.should_save:
+            if not self.args.use_hybrid_parallel:
+                paddle.save(self.optimizer.state_dict(), os.path.join(output_dir, OPTIMIZER_NAME))
+
+            # FIXME: manybe only save one copy
+            paddle.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, SCHEDULER_NAME))
+
+            if self.do_grad_scaling:
+                paddle.save(self.scaler.state_dict(), os.path.join(output_dir, SCALER_NAME))
+
+        # Determine the new best metric / best model checkpoint
+        if metrics is not None and self.args.metric_for_best_model is not None:
+            metric_to_check = self.args.metric_for_best_model
+            if not metric_to_check.startswith("eval_"):
+                metric_to_check = f"eval_{metric_to_check}"
+            metric_value = metrics[metric_to_check]
+
+            operator = np.greater if self.args.greater_is_better else np.less
+            if (
+                self.state.best_metric is None
+                or self.state.best_model_checkpoint is None
+                or operator(metric_value, self.state.best_metric)
+            ):
+                self.state.best_metric = metric_value
+                self.state.best_model_checkpoint = output_dir
+
+        # Save the Trainer state
+        if self.args.should_save:
+            self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))
+
+        # Save RNG state in non-distributed training
+        rng_states = {
+            "python": random.getstate(),
+            "numpy": np.random.get_state(),
+            "cuda": [k.current_seed() for k in paddle.get_rng_state()],
+            "cpu": paddle.framework.core.default_cpu_generator().get_state().current_seed(),
+        }
+        if self.args.use_hybrid_parallel:
+            rng_states[
+                "hybrid_parallel_rng_state_tracker"
+            ] = fleet.meta_parallel.get_rng_state_tracker().get_states_tracker()
+
+        # A process can arrive here before the process 0 has a chance to save the model, in which case output_dir may
+        # not yet exist.
+        os.makedirs(output_dir, exist_ok=True)
+
+        if self.args.world_size > 1:
+            # use global process_index to save
+            process_index = self.args.process_index
+            paddle.save(rng_states, os.path.join(output_dir, f"rng_state_{process_index}.pth"))
+        else:
+            paddle.save(rng_states, os.path.join(output_dir, "rng_state.pth"))
+
+        # Maybe delete some older checkpoints.
+        if self.args.should_save_model_state and (
+            True if not self.args.use_hybrid_parallel else self.args.local_rank == 0
+        ):
+            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
+
+    def set_optimizer_grouped_parameters(self, optimizer_grouped_parameters=None):
+        """
+        set optimizer grouped parameters:
+
+        you can set optimizer_grouped_parameters with whatever argments on whatever parameters to train.
+        """
+        self.optimizer_grouped_parameters = optimizer_grouped_parameters
+
+    def disable_autocast_context_manager(self):
+        """
+        For pure fp16 or pure bf16 training, the paddle.amp.autocast is annoy for always cast fp32 to fp16.
+        if you networks cast fp16 to fp32 manually to get higher precision, autocast make it not work, since it cast fp32 to fp16 back.
+
+        """
+        assert self.args.fp16_opt_level == "O2", "disable_autocast_context_manager should only work for pure fp16/bf16"
+        self.enable_autocast_context_manager = False
+
+    def _sorted_checkpoints(
+        self, output_dir=None, checkpoint_prefix=PREFIX_CHECKPOINT_DIR, use_mtime=False
+    ) -> List[str]:
+        ordering_and_checkpoint_path = []
+
+        glob_checkpoints = [str(x) for x in Path(output_dir).glob(f"{checkpoint_prefix}-*")]
+
+        for path in glob_checkpoints:
+            if use_mtime:
+                ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
+            else:
+                regex_match = re.match(f".*{checkpoint_prefix}-([0-9]+)", path)
+                if regex_match is not None and regex_match.groups() is not None:
+                    ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
+
+        checkpoints_sorted = sorted(ordering_and_checkpoint_path)
+        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
+        # Make sure we don't delete the best model.
+        if self.state.best_model_checkpoint is not None:
+            best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
+            for i in range(best_model_index, len(checkpoints_sorted) - 2):
+                checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i]
+        return checkpoints_sorted
+
+    def _rotate_checkpoints(self, use_mtime=False, output_dir=None) -> None:
+        if self.args.save_total_limit is None or self.args.save_total_limit <= 0:
+            return
+
+        # Check if we should delete older checkpoint(s)
+        checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime, output_dir=output_dir)
+        if len(checkpoints_sorted) <= self.args.save_total_limit:
+            return
+
+        # If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which
+        # we don't do to allow resuming.
+        save_total_limit = self.args.save_total_limit
+        if (
+            self.state.best_model_checkpoint is not None
+            and self.args.save_total_limit == 1
+            and checkpoints_sorted[-1] != self.state.best_model_checkpoint
+        ):
+            save_total_limit = 2
+
+        number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - save_total_limit)
+        checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
+        for checkpoint in checkpoints_to_be_deleted:
+            logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit")
+            shutil.rmtree(checkpoint)
+
+    def _save(self, output_dir: Optional[str] = None, state_dict=None, merge_tensor_parallel=False):
+        # If we are executing this function, we are the process zero, so we don't check for that.
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+        # Save a trained model and configuration using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+
+        merge_tensor_parallel = merge_tensor_parallel and self.args.use_hybrid_parallel
+
+        if isinstance(self.model, LoRAModel) or isinstance(self.model, PrefixModelForCausalLM):
+            # lugimzzz: Force merge_tensor_parallel to True for LoRA & Prefix Model until there is an option to merge params during training.
+            self.model.save_pretrained(
+                output_dir,
+                variant=self.args.weight_name_suffix,
+                merge_tensor_parallel=True,
+                is_main_process=self.args.should_save,
+            )
+        elif not isinstance(self.model, PretrainedModel):
+            if isinstance(unwrap_model(self.model), PretrainedModel):
+                if self.args.should_save_sharding_stage1_model:
+                    config_to_save = None
+                    state_dict, config_to_save, weight_name_suffix = self.sharding_io.manipulate_state_dict_and_config(
+                        unwrap_model(self.model), merge_tensor_parallel=merge_tensor_parallel
+                    )
+                    unwrap_model(self.model).save_pretrained(
+                        output_dir,
+                        state_dict=state_dict,
+                        config_to_save=config_to_save,
+                        merge_tensor_parallel=merge_tensor_parallel,
+                        variant=weight_name_suffix,
+                        is_main_process=self.args.should_save,
+                    )
+                else:
+                    unwrap_model(self.model).save_pretrained(
+                        output_dir,
+                        merge_tensor_parallel=merge_tensor_parallel,
+                        variant=self.args.weight_name_suffix,
+                        is_main_process=self.args.should_save,
+                    )
+            else:
+                logger.info("Trainer.model is not a `PretrainedModel`, only saving its state dict.")
+                if merge_tensor_parallel:
+                    logger.warning("Trainer.model is not a `PretrainedModel`, not suppor for merge_tensor_parallel.")
+                if state_dict is None:
+                    state_dict = self.model.state_dict()
+                paddle.save(
+                    state_dict,
+                    os.path.join(output_dir, _add_variant(PADDLE_WEIGHTS_NAME, self.args.weight_name_suffix)),
+                )
+        else:
+            if isinstance(self.model, PretrainedModel) and self.args.should_save_sharding_stage1_model:
+                config_to_save = None
+                state_dict, config_to_save, weight_name_suffix = self.sharding_io.manipulate_state_dict_and_config(
+                    self.model, merge_tensor_parallel=merge_tensor_parallel
+                )
+                self.model.save_pretrained(
+                    output_dir,
+                    state_dict=state_dict,
+                    config_to_save=config_to_save,
+                    merge_tensor_parallel=merge_tensor_parallel,
+                    variant=weight_name_suffix,
+                    is_main_process=self.args.should_save,
+                )
+            else:
+                self.model.save_pretrained(
+                    output_dir,
+                    merge_tensor_parallel=merge_tensor_parallel,
+                    variant=self.args.weight_name_suffix,
+                    is_main_process=self.args.should_save,
+                )
+        if self.args.should_save_sharding_stage1_model:
+            self.sharding_io.save_distributed_model_meta(output_dir)
+
+        if self.args.should_save:
+            if self.tokenizer is not None:
+                self.tokenizer.save_pretrained(output_dir)
+
+            # Good practice: save your training arguments together with the trained model
+            paddle.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
+
+    def _load_optimizer_and_scheduler(self, checkpoint):
+        """If optimizer and scheduler states exist, load them."""
+        if checkpoint is None:
+            return
+
+        opt_state_dict = None
+        if self.args.should_load_sharding_stage1_model:
+            opt_state_dict = self.sharding_io.load_optimizer_state_with_reshard(
+                checkpoint, base_opt_name=OPTIMIZER_NAME
+            )
+        else:
+            optimizer_name = _add_variant(OPTIMIZER_NAME, self.args.optimizer_name_suffix)
+            path = os.path.join(checkpoint, optimizer_name)
+            if os.path.isfile(path):
+                opt_state_dict = paddle.load(path)
+
+        if opt_state_dict is not None and os.path.isfile(os.path.join(checkpoint, SCHEDULER_NAME)):
+            # Load in optimizer and scheduler states
+            self.optimizer.set_state_dict(opt_state_dict)
+
+            self.lr_scheduler.set_state_dict(paddle.load(os.path.join(checkpoint, SCHEDULER_NAME)))
+            if self.do_grad_scaling and os.path.isfile(os.path.join(checkpoint, SCALER_NAME)):
+                self.scaler.load_state_dict(paddle.load(os.path.join(checkpoint, SCALER_NAME), return_numpy=True))
+        else:
+            raise ValueError(
+                f"optimizer-state-dict not found, opt:{os.path.join(checkpoint, optimizer_name)} scheduler:{os.path.join(checkpoint, SCHEDULER_NAME)}"
+            )
+
+    def log(self, logs: Dict[str, float], **kwargs) -> None:
+        """
+        Log `logs` on the various objects watching training.
+
+        Subclass and override this method to inject custom behavior.
+
+        Args:
+            logs (`Dict[str, float]`):
+                The values to log.
+        """
+
+        try:
+            from paddle.distributed.fleet.utils.timer_helper import (
+                get_timers as paddle_get_timers,
+            )
+
+            paddle_pipeline_timers = paddle_get_timers()
+        except ImportError:  # paddle version too old, timer not support
+            warnings.warn(f"paddle version:{paddle.__git_commit__} does not support pipeline timer")
+            paddle_pipeline_timers = None
+        except AssertionError:
+            paddle_pipeline_timers = None
+        kwargs.update(timer=self.timers, paddle_pipeline_timers=paddle_pipeline_timers)
+
+        if self.state.epoch is not None:
+            logs["epoch"] = round(self.state.epoch, 4)
+        output = {**logs, **{"step": self.state.global_step}}
+        self.state.log_history.append(output)
+        self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs, **kwargs)
+
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dataset] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> Dict[str, float]:
+        """
+        Run evaluation and returns metrics.
+
+        The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
+        (pass it to the init `compute_metrics` argument).
+
+        You can also subclass and override this method to inject custom behavior.
+
+        Args:
+            eval_dataset (`Dataset`, *optional*):
+                Pass a dataset if you wish to override `self.eval_dataset`. If it is an `datasets.Dataset`, columns not
+                accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
+                method.
+            ignore_keys (`Lst[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "eval_bleu" if the prefix is "eval" (default)
+
+        Returns:
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
+            dictionary also contains the epoch number which comes from the training state.
+        """
+        # memory metrics - must set up as early as possible
+        self._memory_tracker.start()
+
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        start_time = time.time()
+
+        output = self.evaluation_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if self.compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            metric_key_prefix=metric_key_prefix,
+            max_eval_iters=self.args.max_evaluate_steps,
+        )
+
+        total_batch_size = self.args.eval_batch_size * self.args.dataset_world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self.log(output.metrics)
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+
+        self._memory_tracker.stop_and_update_metrics(output.metrics)
+
+        return output.metrics
+
+    def evaluation_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        prediction_loss_only: Optional[bool] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_eval_iters: Optional[int] = -1,
+    ) -> EvalLoopOutput:
+        """
+        Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`.
+
+        Works both with or without labels.
+        """
+        args = self.args
+
+        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else args.prediction_loss_only
+
+        if self.args.pipeline_parallel_degree > 1:
+            # Only accept wrapped model for pipeline_parallel mode
+            model = self.model_wrapped
+        else:
+            model = self.model
+
+        if isinstance(dataloader, paddle.io.DataLoader):
+            batch_size = dataloader.batch_sampler.batch_size
+        elif isinstance(dataloader, _DataLoaderIterBase):
+            # support for inner dataloader
+            batch_size = dataloader._batch_sampler.batch_size
+            # alias for inner dataloader
+            dataloader.dataset = dataloader._dataset
+        else:
+            raise ValueError("Only support for paddle.io.DataLoader")
+
+        num_samples = None
+        if max_eval_iters > 0:
+            # on eval limit steps
+            num_samples = batch_size * self.args.dataset_world_size * max_eval_iters
+            if isinstance(dataloader, _DataLoaderIterBase) and isinstance(
+                dataloader._batch_sampler, NlpDistributedBatchSampler
+            ):
+                consumed_samples = (
+                    ((self.state.global_step) // args.eval_steps)
+                    * max_eval_iters
+                    * args.per_device_eval_batch_size
+                    * args.dataset_world_size
+                )
+                dataloader._batch_sampler.set_epoch(consumed_samples=consumed_samples)
+
+        logger.info(f"***** Running {description} *****")
+
+        if not self.args.distributed_dataloader or (
+            self.args.distributed_dataloader and self.args.should_load_dataset
+        ):
+            if has_length(dataloader):
+                logger.info(f"  Num examples = {self.num_examples(dataloader)}")
+                if max_eval_iters > 0:
+                    logger.info(f"  Total prediction steps = {max_eval_iters}")
+                else:
+                    logger.info(f"  Total prediction steps = {len(dataloader)}")
+            else:
+                logger.info("  Num examples: Unknown")
+                if max_eval_iters > 0:
+                    logger.info(f"  Total prediction steps = {max_eval_iters}")
+
+            logger.info(f"  Pre device batch size = {batch_size}")
+            logger.info(f"  Total Batch size = {batch_size * self.args.dataset_world_size}")
+
+        model.eval()
+
+        self.callback_handler.eval_dataloader = dataloader
+        # Do this before wrapping.
+        eval_dataset = dataloader.dataset
+
+        if args.past_index >= 0:
+            self._past = None
+
+        # Initialize containers
+        # losses/preds/labels on GPU (accumulated for eval_accumulation_steps)
+        losses_host = None
+        preds_host = None
+        labels_host = None
+        # losses/preds/labels on CPU (final containers)
+        all_losses = None
+        all_preds = None
+        all_labels = None
+        # Will be useful when we have an iterable dataset so don't know its length.
+
+        observed_num_examples = 0
+        # Main evaluation loop
+        losses = []
+        for step, inputs in enumerate(dataloader):
+            # Update the observed num examples
+            observed_batch_size = find_batch_size(inputs)
+            if observed_batch_size is not None:
+                observed_num_examples += observed_batch_size
+                # For batch samplers, batch_size is not known by the dataloader in advance.
+                batch_size = observed_batch_size
+
+            # Prediction step
+            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
+
+            # Update containers on host
+            if loss is not None:
+                # losses = self._nested_gather(loss.repeat(batch_size))
+                losses = self._nested_gather(paddle.tile(loss, repeat_times=[batch_size, 1]))
+                losses_host = losses if losses_host is None else paddle.concat((losses_host, losses), axis=0)
+            if labels is not None:
+                labels = self._pad_across_processes(labels)
+                labels = self._nested_gather(labels)
+                labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)
+            if logits is not None:
+                logits = self._pad_across_processes(logits)
+                logits = self._nested_gather(logits)
+                if self.preprocess_logits_for_metrics is not None:
+                    logits = self.preprocess_logits_for_metrics(logits, labels)
+                preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
+            self.control = self.callback_handler.on_prediction_step(args, self.state, self.control)
+
+            # Gather all tensors and put them back on the CPU if we have done enough accumulation steps.
+            if args.eval_accumulation_steps is not None and (step + 1) % args.eval_accumulation_steps == 0:
+                if losses_host is not None:
+                    losses = nested_numpify(losses_host)
+                    all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+                if preds_host is not None:
+                    logits = nested_numpify(preds_host)
+                    all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+
+                if labels_host is not None:
+                    labels = nested_numpify(labels_host)
+                    all_labels = (
+                        labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+                    )
+
+                # Set back to None to begin a new accumulation
+                losses_host, preds_host, labels_host = None, None, None
+
+            if max_eval_iters > 0 and step >= max_eval_iters - 1:
+                break
+
+        # Gather all remaining tensors and put them back on the CPU
+        if losses_host is not None:
+            losses = nested_numpify(losses_host)
+            all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+        if preds_host is not None:
+            logits = nested_numpify(preds_host)
+            all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+        if labels_host is not None:
+            labels = nested_numpify(labels_host)
+            all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+
+        # Number of samples
+        if num_samples is not None:
+            pass
+        elif has_length(eval_dataset):
+            num_samples = len(eval_dataset)
+        # The instance check is weird and does not actually check for the type, but whether the dataset has the right
+        # methods. Therefore we need to make sure it also has the attribute.
+        elif isinstance(eval_dataset, IterableDatasetShard) and hasattr(eval_dataset, "num_examples"):
+            num_samples = eval_dataset.num_examples
+        else:
+            if has_length(dataloader):
+                num_samples = self.num_examples(dataloader)
+            else:  # both len(dataloader.dataset) and len(dataloader) fail
+                num_samples = observed_num_examples
+
+        # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of
+        # samplers has been rounded to a multiple of batch_size, so we truncate.
+        if all_losses is not None:
+            all_losses = all_losses[:num_samples]
+        if all_preds is not None:
+            all_preds = nested_truncate(all_preds, num_samples)
+        if all_labels is not None:
+            all_labels = nested_truncate(all_labels, num_samples)
+
+        model.train()
+
+        # Metrics!
+        if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
+            metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
+        else:
+            metrics = {}
+
+        if all_losses is not None:
+            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)
+
+    def predict(
+        self, test_dataset: Dataset, ignore_keys: Optional[List[str]] = None, metric_key_prefix: str = "test"
+    ) -> PredictionOutput:
+        """
+        Run prediction and returns predictions and potential metrics.
+        Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method
+        will also return metrics, like in `evaluate()`.
+        Args:
+            test_dataset (`Dataset`):
+                Dataset to run the predictions on. If it is an `datasets.Dataset`, columns not accepted by the
+                `model.forward()` method are automatically removed. Has to implement the method `__len__`
+            ignore_keys (`Lst[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (`str`, *optional*, defaults to `"test"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "test_bleu" if the prefix is "test" (default)
+        <Tip>
+        If your predictions or labels have different sequence length (for instance because you're doing dynamic padding
+        in a token classification task) the predictions will be padded (on the right) to allow for concatenation into
+        one array. The padding index is -100.
+        </Tip>
+        Returns: *NamedTuple* A namedtuple with the following keys:
+            - predictions (`np.ndarray`): The predictions on `test_dataset`.
+            - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some).
+            - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained
+              labels).
+        """
+        # memory metrics - must set up as early as possible
+        self._memory_tracker.start()
+
+        test_dataloader = self.get_test_dataloader(test_dataset)
+        start_time = time.time()
+
+        eval_loop = self.evaluation_loop
+        output = eval_loop(
+            test_dataloader,
+            description="Prediction",
+            ignore_keys=ignore_keys,
+            metric_key_prefix=metric_key_prefix,
+            max_eval_iters=self.args.max_evaluate_steps,
+        )
+        total_batch_size = self.args.per_device_eval_batch_size * self.args.dataset_world_size
+        output.metrics.update(
+            speed_metrics(
+                metric_key_prefix,
+                start_time,
+                num_samples=output.num_samples,
+                num_steps=math.ceil(output.num_samples / total_batch_size),
+            )
+        )
+
+        self._memory_tracker.stop_and_update_metrics(output.metrics)
+
+        return PredictionOutput(predictions=output.predictions, label_ids=output.label_ids, metrics=output.metrics)
+
+    def prediction_pipeline_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        prediction_step function for pipeline parallel mode.
+        """
+        if hasattr(model, "_prepare_pipeline_inputs_func"):
+            inputs, labels = model._prepare_pipeline_inputs_func(inputs)
+            has_labels = labels is not None
+        else:
+            has_labels = all(inputs.get(k) is not None for k in self.label_names)
+            inputs = self._prepare_inputs(inputs)
+            # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
+            if has_labels:
+                labels = nested_detach(tuple(inputs.get(name) for name in self.label_names))
+                if len(labels) == 1:
+                    labels = labels[0]
+            else:
+                labels = None
+            inputs = inputs.pop("input_ids")
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    loss = model.eval_batch([inputs, labels], compute_loss=True)
+                    # loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
+                loss = loss.mean().detach()
+            else:
+                raise ValueError("pipeline mode eval need label!")
+
+        return (loss, None, labels)
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        Perform an evaluation step on `model` using `inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to evaluate.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (`bool`):
+                Whether or not to return the loss only.
+            ignore_keys (`Lst[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+
+        Return:
+            Tuple[Optional[paddle.Tensor], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss,
+            logits and labels (each being optional).
+        """
+        if self.args.pipeline_parallel_degree > 1:
+            # hack for pipeline mode
+            inputs = self._prepare_inputs(inputs)
+            return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys)
+
+        has_labels = all(inputs.get(k) is not None for k in self.label_names)
+        inputs = self._prepare_inputs(inputs)
+        if ignore_keys is None:
+            if hasattr(self.model, "config"):
+                ignore_keys = getattr(self.model.config, "keys_to_ignore_at_inference", [])
+            else:
+                ignore_keys = []
+
+        # labels may be popped when computing the loss (label smoothing for instance) so we grab them first.
+        if has_labels:
+            labels = nested_detach(tuple(inputs.get(name) for name in self.label_names))
+            if len(labels) == 1:
+                labels = labels[0]
+        else:
+            labels = None
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
+                loss = loss.mean().detach()
+
+                if isinstance(outputs, dict):
+                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys + ["loss"])
+                else:
+                    logits = outputs[1:]
+            else:
+                loss = None
+                with self.autocast_smart_context_manager():
+                    outputs = model(**inputs)
+                if isinstance(outputs, dict):
+                    logits = tuple(v for k, v in outputs.items() if k not in ignore_keys)
+                else:
+                    logits = outputs
+                # TODO: this needs to be fixed and made cleaner later.
+                if self.args.past_index >= 0:
+                    self._past = outputs[self.args.past_index - 1]
+
+        if prediction_loss_only:
+            return (loss, None, None)
+
+        logits = nested_detach(logits)
+        if isinstance(logits, (list, tuple)) and len(logits) == 1:
+            logits = logits[0]
+
+        return (loss, logits, labels)
+
+    def is_local_process_zero(self) -> bool:
+        """
+        Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several
+        machines) main process.
+        """
+        return self.args.local_process_index == 0
+
+    def is_world_process_zero(self) -> bool:
+        """
+        Whether or not this process is the global main process (when training in a distributed fashion on several
+        machines, this is only going to be `True` for one process).
+        """
+        return self.args.process_index == 0
+
+    def _nested_gather(self, tensors):
+        """
+        Gather value of `tensors` (tensor or list/tuple of nested tensors) and convert them to numpy before
+        concatenating them to `gathered`
+        """
+        if tensors is None:
+            return
+        if self.args.local_rank != -1:
+            tensors = distributed_concat(tensors)
+        return tensors
+
+        # Copied from Accelerate.
+
+    def _pad_across_processes(self, tensor, pad_index=-100):
+        """
+        Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so
+        they can safely be gathered.
+        """
+        if isinstance(tensor, (list, tuple)):
+            return type(tensor)(self._pad_across_processes(t, pad_index=pad_index) for t in tensor)
+        elif isinstance(tensor, dict):
+            return type(tensor)({k: self._pad_across_processes(v, pad_index=pad_index) for k, v in tensor.items()})
+        elif not isinstance(tensor, paddle.Tensor):
+            raise TypeError(
+                f"Can't pad the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors."
+            )
+
+        if len(tensor.shape) < 2:
+            return tensor
+        # Gather all sizes
+        size = paddle.to_tensor(tensor.shape)[None]
+        sizes = self._nested_gather(size).cpu()
+
+        max_size = max(s[1] for s in sizes)
+        if tensor.shape[1] == max_size:
+            return tensor
+
+        # Then pad to the maximum size
+        old_size = tensor.shape
+        new_size = list(old_size)
+        new_size[1] = max_size
+        # new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
+        new_tensor = paddle.zeros(tuple(new_size), dtype=tensor.dtype) + pad_index
+        new_tensor[:, : old_size[1]] = tensor
+        return new_tensor
+
+    def _set_signature_columns_if_needed(self):
+        if self._signature_columns is None:
+            # Inspect model forward signature to keep only the arguments it accepts.
+            signature = inspect.signature(self.model.forward)
+            self._signature_columns = list(signature.parameters.keys())
+            # Labels may be named label or label_ids, the default data collator handles that.
+            self._signature_columns += list(set(["label", "label_ids"] + self.label_names))
+
+    def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
+        if not self.args.remove_unused_columns:
+            return dataset
+        if self._signature_columns is None:
+            # Inspect model forward signature to keep only the arguments it accepts.
+            signature = inspect.signature(self.model.forward)
+            self._signature_columns = list(signature.parameters.keys())
+            # Labels may be named label or label_ids, the default data collator handles that.
+            self._signature_columns += ["label", "label_ids", "labels", "start_positions", "end_positions"]
+
+        ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
+        if len(ignored_columns) > 0:
+            dset_description = "" if description is None else f"in the {description} set "
+            logger.info(
+                f"The following columns {dset_description} don't have a corresponding argument in "
+                f"`{self.model.__class__.__name__}.forward` and have been ignored: {', '.join(ignored_columns)}."
+                f" If {', '.join(ignored_columns)} are not expected by `{self.model.__class__.__name__}.forward`, "
+                f" you can safely ignore this message."
+            )
+
+        columns = [k for k in self._signature_columns if k in dataset.column_names]
+
+        if version.parse(datasets.__version__) < version.parse("1.4.0"):
+            dataset.set_format(
+                type=dataset.format["type"], columns=columns, format_kwargs=dataset.format["format_kwargs"]
+            )
+            return dataset
+        else:
+            return dataset.remove_columns(ignored_columns)
+
+    def _get_collator_with_removed_columns(
+        self, data_collator: Callable, description: Optional[str] = None
+    ) -> Callable:
+        """Wrap the data collator in a callable removing unused columns."""
+        if not self.args.remove_unused_columns:
+            return data_collator
+        self._set_signature_columns_if_needed()
+        signature_columns = self._signature_columns
+
+        remove_columns_collator = RemoveColumnsCollator(
+            data_collator=data_collator,
+            signature_columns=signature_columns,
+            logger=logger,
+            description=description,
+            model_name=self.model.__class__.__name__,
+        )
+        return remove_columns_collator
+
+    def _is_iterable_dataset(self, dataset):
+        return isinstance(dataset, paddle.io.IterableDataset)
+
+    def print_config(self, args=None, key=""):
+        """
+        print config values
+        """
+        logger.info("=" * 60)
+        if args is None:
+            args = self.args
+            key = "Training"
+
+        logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
+        logger.info("{:30}: {}".format("paddle commit id", paddle.version.commit))
+
+        for a in dir(args):
+            if a[:2] != "__":  # don't print double underscore methods
+                v = getattr(args, a)
+                if not isinstance(v, types.MethodType):
+                    logger.info("{:30}: {}".format(a, v))
+
+        logger.info("")
diff --git a/paddlenlp/trainer/trainer_callback.py b/paddlenlp/trainer/trainer_callback.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5195ca0ed5ecbf5d72cdcee44fad9432d0c2f8a
--- /dev/null
+++ b/paddlenlp/trainer/trainer_callback.py
@@ -0,0 +1,605 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py
+"""
+Callbacks to use with the Trainer class and customize the training loop.
+"""
+import dataclasses
+import json
+from dataclasses import dataclass
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+from tqdm.auto import tqdm
+
+from paddlenlp.utils.log import logger
+
+from .trainer_utils import IntervalStrategy, has_length
+from .training_args import TrainingArguments
+
+__all__ = [
+    "TrainerState",
+    "TrainerControl",
+    "TrainerCallback",
+    "CallbackHandler",
+    "DefaultFlowCallback",
+    "ProgressCallback",
+    "PrinterCallback",
+    "EarlyStoppingCallback",
+]
+
+
+@dataclass
+class TrainerState:
+    """
+    A class containing the [`Trainer`] inner state that will be saved along the model and optimizer when checkpointing
+    and passed to the [`TrainerCallback`].
+
+    <Tip>
+
+    In all this class, one step is to be understood as one update step. When using gradient accumulation, one update
+    step may require several forward and backward passes: if you use `gradient_accumulation_steps=n`, then one update
+    step requires going through *n* batches.
+
+    </Tip>
+
+    Args:
+        epoch (`float`, *optional*):
+            Only set during training, will represent the epoch the training is at (the decimal part being the
+            percentage of the current epoch completed).
+        global_step (`int`, *optional*, defaults to 0):
+            During training, represents the number of update steps completed.
+        max_steps (`int`, *optional*, defaults to 0):
+            The number of update steps to do during the current training.
+        total_flos (`float`, *optional*, defaults to 0):
+            The total number of floating operations done by the model since the beginning of training (stored as floats
+            to avoid overflow).
+        log_history (`List[Dict[str, float]]`, *optional*):
+            The list of logs done since the beginning of training.
+        best_metric (`float`, *optional*):
+            When tracking the best model, the value of the best metric encountered so far.
+        best_model_checkpoint (`str`, *optional*):
+            When tracking the best model, the value of the name of the checkpoint for the best model encountered so
+            far.
+        is_local_process_zero (`bool`, *optional*, defaults to `True`):
+            Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on
+            several machines) main process.
+        is_world_process_zero (`bool`, *optional*, defaults to `True`):
+            Whether or not this process is the global main process (when training in a distributed fashion on several
+            machines, this is only going to be `True` for one process).
+    """
+
+    epoch: Optional[float] = None
+    global_step: int = 0
+    max_steps: int = 0
+    num_train_epochs: int = 0
+    total_flos: float = 0
+    log_history: List[Dict[str, float]] = None
+    best_metric: Optional[float] = None
+    best_model_checkpoint: Optional[str] = None
+    is_local_process_zero: bool = True
+    is_world_process_zero: bool = True
+    trial_name: str = None
+    trial_params: Dict[str, Union[str, float, int, bool]] = None
+
+    def __post_init__(self):
+        if self.log_history is None:
+            self.log_history = []
+
+    def save_to_json(self, json_path: str):
+        """Save the content of this instance in JSON format inside `json_path`."""
+        json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
+        with open(json_path, "w", encoding="utf-8") as f:
+            f.write(json_string)
+
+    @classmethod
+    def load_from_json(cls, json_path: str):
+        """Create an instance from the content of `json_path`."""
+        with open(json_path, "r", encoding="utf-8") as f:
+            text = f.read()
+        return cls(**json.loads(text))
+
+
+@dataclass
+class TrainerControl:
+    """
+    A class that handles the [`Trainer`] control flow. This class is used by the [`TrainerCallback`] to activate some
+    switches in the training loop.
+
+    Args:
+        should_training_stop (`bool`, *optional*, defaults to `False`):
+            Whether or not the training should be interrupted.
+
+            If `True`, this variable will not be set back to `False`. The training will just stop.
+        should_epoch_stop (`bool`, *optional*, defaults to `False`):
+            Whether or not the current epoch should be interrupted.
+
+            If `True`, this variable will be set back to `False` at the beginning of the next epoch.
+        should_save (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should be saved at this step.
+
+            If `True`, this variable will be set back to `False` at the beginning of the next step.
+        should_evaluate (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should be evaluated at this step.
+
+            If `True`, this variable will be set back to `False` at the beginning of the next step.
+        should_log (`bool`, *optional*, defaults to `False`):
+            Whether or not the logs should be reported at this step.
+
+            If `True`, this variable will be set back to `False` at the beginning of the next step.
+    """
+
+    should_training_stop: bool = False
+    should_epoch_stop: bool = False
+    should_save: bool = False
+    should_evaluate: bool = False
+    should_log: bool = False
+
+    def _new_training(self):
+        """Internal method that resets the variable for a new training."""
+        self.should_training_stop = False
+
+    def _new_epoch(self):
+        """Internal method that resets the variable for a new epoch."""
+        self.should_epoch_stop = False
+
+    def _new_step(self):
+        """Internal method that resets the variable for a new step."""
+        self.should_save = False
+        self.should_evaluate = False
+        self.should_log = False
+
+
+class TrainerCallback:
+    """
+    A class for objects that will inspect the state of the training loop at some events and take some decisions. At
+    each of those events the following arguments are available:
+
+    Args:
+        args ([`TrainingArguments`]):
+            The training arguments used to instantiate the [`Trainer`].
+        state ([`TrainerState`]):
+            The current state of the [`Trainer`].
+        control ([`TrainerControl`]):
+            The object that is returned to the [`Trainer`] and can be used to make some decisions.
+        model ([`PreTrainedModel`] or `paddle.nn.Layer`):
+            The model being trained.
+        tokenizer ([`PreTrainedTokenizer`]):
+            The tokenizer used for encoding the data.
+        optimizer (`paddle.optimizer.Optimizer`):
+            The optimizer used for the training steps.
+        lr_scheduler (`paddle.optimizer.lr.LRScheduler`):
+            The scheduler used for setting the learning rate.
+        train_dataloader (`paddle.io.DataLoader`, *optional*):
+            The current dataloader used for training.
+        eval_dataloader (`paddle.io.DataLoader`, *optional*):
+            The current dataloader used for training.
+        metrics (`Dict[str, float]`):
+            The metrics computed by the last evaluation phase.
+
+            Those are only accessible in the event `on_evaluate`.
+        logs  (`Dict[str, float]`):
+            The values to log.
+
+            Those are only accessible in the event `on_log`.
+
+    The `control` object is the only one that can be changed by the callback, in which case the event that changes it
+    should return the modified version.
+
+    The argument `args`, `state` and `control` are positionals for all events, all the others are grouped in `kwargs`.
+    You can unpack the ones you need in the signature of the event using them. As an example, see the code of the
+    simple [`~transformer.PrinterCallback`].
+
+    Example:
+
+    ```python
+    class PrinterCallback(TrainerCallback):
+        def on_log(self, args, state, control, logs=None, **kwargs):
+            _ = logs.pop("total_flos", None)
+            if state.is_local_process_zero:
+                logger.info(logs)
+    ```"""
+
+    def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the end of the initialization of the [`Trainer`].
+        """
+        pass
+
+    def on_train_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the beginning of training.
+        """
+        pass
+
+    def on_train_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the end of training.
+        """
+        pass
+
+    def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the beginning of an epoch.
+        """
+        pass
+
+    def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the end of an epoch.
+        """
+        pass
+
+    def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the beginning of a training step. If using gradient accumulation, one training step might take
+        several inputs.
+        """
+        pass
+
+    def on_load_data_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        pass
+
+    def on_optimizer_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        pass
+
+    def on_optimizer_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        pass
+
+    def on_substep_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the end of an substep during gradient accumulation.
+        """
+        pass
+
+    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called at the end of a training step. If using gradient accumulation, one training step might take
+        several inputs.
+        """
+        pass
+
+    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called after an evaluation phase.
+        """
+        pass
+
+    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called after a checkpoint save.
+        """
+        pass
+
+    def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called after logging the last logs.
+        """
+        pass
+
+    def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        """
+        Event called after a prediction step.
+        """
+        pass
+
+
+class CallbackHandler(TrainerCallback):
+    """Internal class that just calls the list of callbacks in order."""
+
+    def __init__(self, callbacks, model, tokenizer, optimizer, lr_scheduler):
+        self.callbacks = []
+        for cb in callbacks:
+            self.add_callback(cb)
+        self.model = model
+        self.tokenizer = tokenizer
+        self.optimizer = optimizer
+        self.lr_scheduler = lr_scheduler
+        self.train_dataloader = None
+        self.eval_dataloader = None
+
+        if not any(isinstance(cb, DefaultFlowCallback) for cb in self.callbacks):
+            logger.warning(
+                "The Trainer will not work properly if you don't have a `DefaultFlowCallback` in its callbacks. You\n"
+                + "should add one before training with `trainer.add_callback(DefaultFlowCallback). The current list of"
+                + "callbacks is\n:"
+                + self.callback_list
+            )
+
+    def add_callback(self, callback):
+        cb = callback() if isinstance(callback, type) else callback
+        cb_class = callback if isinstance(callback, type) else callback.__class__
+        if cb_class in [c.__class__ for c in self.callbacks]:
+            logger.warning(
+                f"You are adding a {cb_class} to the callbacks of this Trainer, but there is already one. The current"
+                + "list of callbacks is\n:"
+                + self.callback_list
+            )
+        self.callbacks.append(cb)
+
+    def pop_callback(self, callback):
+        if isinstance(callback, type):
+            for cb in self.callbacks:
+                if isinstance(cb, callback):
+                    self.callbacks.remove(cb)
+                    return cb
+        else:
+            for cb in self.callbacks:
+                if cb == callback:
+                    self.callbacks.remove(cb)
+                    return cb
+
+    def remove_callback(self, callback):
+        if isinstance(callback, type):
+            for cb in self.callbacks:
+                if isinstance(cb, callback):
+                    self.callbacks.remove(cb)
+                    return
+        else:
+            self.callbacks.remove(callback)
+
+    @property
+    def callback_list(self):
+        return "\n".join(cb.__class__.__name__ for cb in self.callbacks)
+
+    def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_init_end", args, state, control)
+
+    def on_train_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        control.should_training_stop = False
+        return self.call_event("on_train_begin", args, state, control)
+
+    def on_train_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_train_end", args, state, control)
+
+    def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        control.should_epoch_stop = False
+        return self.call_event("on_epoch_begin", args, state, control)
+
+    def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_epoch_end", args, state, control)
+
+    def on_step_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        control.should_log = False
+        control.should_evaluate = False
+        control.should_save = False
+        return self.call_event("on_step_begin", args, state, control)
+
+    def on_load_data_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, inputs: Dict):
+        return self.call_event("on_load_data_end", args, state, control, inputs=inputs)
+
+    def on_optimizer_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, scaler):
+        return self.call_event("on_optimizer_begin", args, state, control, scaler=scaler)
+
+    def on_optimizer_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, scaler):
+        return self.call_event("on_optimizer_end", args, state, control, scaler=scaler)
+
+    def on_substep_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_substep_end", args, state, control)
+
+    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_step_end", args, state, control)
+
+    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, metrics):
+        control.should_evaluate = False
+        return self.call_event("on_evaluate", args, state, control, metrics=metrics)
+
+    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        control.should_save = False
+        return self.call_event("on_save", args, state, control)
+
+    def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, logs, **kwargs):
+        control.should_log = False
+        return self.call_event("on_log", args, state, control, logs=logs, **kwargs)
+
+    def on_prediction_step(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
+        return self.call_event("on_prediction_step", args, state, control)
+
+    def call_event(self, event, args, state, control, **kwargs):
+        for callback in self.callbacks:
+            result = getattr(callback, event)(
+                args,
+                state,
+                control,
+                model=self.model,
+                tokenizer=self.tokenizer,
+                optimizer=self.optimizer,
+                lr_scheduler=self.lr_scheduler,
+                train_dataloader=self.train_dataloader,
+                eval_dataloader=self.eval_dataloader,
+                **kwargs,
+            )
+            # A Callback can skip the return of `control` if it doesn't change it.
+            if result is not None:
+                control = result
+        return control
+
+
+class DefaultFlowCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that handles the default flow of the training loop for logs, evaluation and checkpoints.
+    """
+
+    def on_step_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        # Log
+        if state.global_step == 1 and args.logging_first_step:
+            control.should_log = True
+        if args.logging_strategy == IntervalStrategy.STEPS and state.global_step % args.logging_steps == 0:
+            control.should_log = True
+
+        # Evaluate
+        if args.evaluation_strategy == IntervalStrategy.STEPS and state.global_step % args.eval_steps == 0:
+            control.should_evaluate = True
+
+        # Save
+        if (
+            args.save_strategy == IntervalStrategy.STEPS
+            and args.save_steps > 0
+            and state.global_step % args.save_steps == 0
+        ):
+            control.should_save = True
+
+        # End training
+        if state.global_step >= state.max_steps:
+            control.should_training_stop = True
+            # Log and save on end
+            if args.logging_strategy == IntervalStrategy.STEPS and state.global_step >= args.logging_steps:
+                control.should_log = True
+            if args.evaluation_strategy == IntervalStrategy.STEPS and state.global_step >= args.eval_steps:
+                control.should_evaluate = True
+            if (
+                args.save_strategy == IntervalStrategy.STEPS
+                and args.save_steps > 0
+                and state.global_step >= args.save_steps
+            ):
+                control.should_save = True
+
+        return control
+
+    def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        # Log
+        if args.logging_strategy == IntervalStrategy.EPOCH:
+            control.should_log = True
+
+        # Evaluate
+        if args.evaluation_strategy == IntervalStrategy.EPOCH:
+            control.should_evaluate = True
+
+        # Save
+        if args.save_strategy == IntervalStrategy.EPOCH:
+            control.should_save = True
+
+        return control
+
+
+class ProgressCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that displays the progress of training or evaluation.
+    """
+
+    def __init__(self):
+        self.training_bar = None
+        self.prediction_bar = None
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        if state.is_local_process_zero:
+            self.training_bar = tqdm(total=state.max_steps)
+        self.current_step = 0
+
+    def on_step_end(self, args, state, control, **kwargs):
+        if state.is_local_process_zero:
+            self.training_bar.update(state.global_step - self.current_step)
+            self.current_step = state.global_step
+
+    def on_prediction_step(self, args, state, control, eval_dataloader=None, **kwargs):
+        if state.is_local_process_zero and has_length(eval_dataloader.dataset):
+            if self.prediction_bar is None:
+                self.prediction_bar = tqdm(total=len(eval_dataloader), leave=self.training_bar is None)
+            self.prediction_bar.update(1)
+
+    def on_evaluate(self, args, state, control, **kwargs):
+        if state.is_local_process_zero:
+            if self.prediction_bar is not None:
+                self.prediction_bar.close()
+            self.prediction_bar = None
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if state.is_local_process_zero and self.training_bar is not None:
+            _ = logs.pop("total_flos", None)
+            if type(logs) is dict:
+                logs_str = ", ".join(f"{k}: {v}" for k, v in logs.items())
+            else:
+                logs_str = str(logs)
+            self.training_bar.write(logs_str)
+
+    def on_train_end(self, args, state, control, **kwargs):
+        if state.is_local_process_zero:
+            self.training_bar.close()
+            self.training_bar = None
+
+
+class PrinterCallback(TrainerCallback):
+    """
+    A bare [`TrainerCallback`] that just prints the logs.
+    """
+
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        _ = logs.pop("total_flos", None)
+        if state.is_local_process_zero:
+            if type(logs) is dict:
+                logger.info(", ".join(f"{k}: {v}" for k, v in logs.items()))
+            else:
+                logger.info(logs)
+
+
+class EarlyStoppingCallback(TrainerCallback):
+    """
+    A [`TrainerCallback`] that handles early stopping.
+
+    Args:
+       early_stopping_patience (`int`):
+            Use with `metric_for_best_model` to stop training when the specified metric worsens for
+            `early_stopping_patience` evaluation calls.
+       early_stopping_threshold(`float`, *optional*):
+            Use with TrainingArguments `metric_for_best_model` and `early_stopping_patience` to denote how much the
+            specified metric must improve to satisfy early stopping conditions. `
+
+    This callback depends on [`TrainingArguments`] argument *load_best_model_at_end* functionality to set best_metric
+    in [`TrainerState`].
+    """
+
+    def __init__(self, early_stopping_patience: int = 1, early_stopping_threshold: Optional[float] = 0.0):
+        self.early_stopping_patience = early_stopping_patience
+        self.early_stopping_threshold = early_stopping_threshold
+        # early_stopping_patience_counter denotes the number of times validation metrics failed to improve.
+        self.early_stopping_patience_counter = 0
+
+    def check_metric_value(self, args, state, control, metric_value):
+        # best_metric is set by code for load_best_model
+        operator = np.greater if args.greater_is_better else np.less
+        if state.best_metric is None or (
+            operator(metric_value, state.best_metric)
+            and abs(metric_value - state.best_metric) > self.early_stopping_threshold
+        ):
+            self.early_stopping_patience_counter = 0
+        else:
+            self.early_stopping_patience_counter += 1
+
+    def on_train_begin(self, args, state, control, **kwargs):
+        assert args.load_best_model_at_end, "EarlyStoppingCallback requires load_best_model_at_end = True"
+        assert (
+            args.metric_for_best_model is not None
+        ), "EarlyStoppingCallback requires metric_for_best_model is defined"
+        assert (
+            args.evaluation_strategy != IntervalStrategy.NO
+        ), "EarlyStoppingCallback requires IntervalStrategy of steps or epoch"
+
+    def on_evaluate(self, args, state, control, metrics, **kwargs):
+        metric_to_check = args.metric_for_best_model
+        if not metric_to_check.startswith("eval_"):
+            metric_to_check = f"eval_{metric_to_check}"
+        metric_value = metrics.get(metric_to_check)
+
+        if metric_value is None:
+            logger.warning(
+                f"early stopping required metric_for_best_model, but did not find {metric_to_check} so early stopping is disabled"
+            )
+            return
+
+        self.check_metric_value(args, state, control, metric_value)
+        if self.early_stopping_patience_counter >= self.early_stopping_patience:
+            control.should_training_stop = True
diff --git a/paddlenlp/trainer/trainer_compress.py b/paddlenlp/trainer/trainer_compress.py
new file mode 100644
index 0000000000000000000000000000000000000000..27629e2778e7fd87436aedf345db2e3342eedcfe
--- /dev/null
+++ b/paddlenlp/trainer/trainer_compress.py
@@ -0,0 +1,1035 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import json
+import math
+import os
+import time
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.metric import Accuracy
+from paddle.utils import try_import
+
+from ..data import Pad
+from ..metrics import ChunkEvaluator
+from ..metrics.squad import compute_prediction, squad_evaluate
+from ..transformers import export_model
+from ..transformers.model_outputs import BaseModelOutputWithPoolingAndCrossAttentions
+from ..transformers.ofa_utils import (
+    compute_neuron_head_importance,
+    encoder_layer_ofa_forward,
+    encoder_ofa_forward,
+    mha_ofa_forward,
+    prepare_qkv_ofa,
+    reorder_neuron_head,
+)
+from ..utils.log import logger
+from .trainer import Trainer
+
+
+def global_try_import_slim():
+    global paddleslim
+    try_import("paddleslim")
+    import paddleslim
+
+
+def compress(self, custom_evaluate=None):
+    """
+    Supports pruning DynaBERT and post-training quantization. If both are
+    needed, pruning DynaBERT would be performed before quantizaton.
+    """
+    args = self.args
+    self.custom_evaluate = custom_evaluate
+    if "dynabert" in args.strategy:
+        global_try_import_slim()
+        if self.args.width_mult_list is not None:
+            self.args.width_mult_list = [eval(width_mult) for width_mult in self.args.width_mult_list]
+        class_name = self.model.__class__.__name__
+        if (
+            "SequenceClassification" not in class_name
+            and "TokenClassification" not in class_name
+            and "QuestionAnswering" not in class_name
+        ):
+            assert (
+                self.custom_evaluate is not None
+            ), "Custom model using DynaBERT strategy needs to pass in parameters `custom_evaluate`."
+        model = copy.deepcopy(self.model)
+        self.original_model = model
+        _dynabert(self, self.model)
+
+        del self.original_model
+        if "ptq" in args.strategy or "qat" in args.strategy:
+            output_dir_list = []
+            for width_mult in args.width_mult_list:
+                output_dir_width = os.path.join(args.output_dir, "width_mult_" + str(round(width_mult, 2)))
+                if "ptq" in args.strategy:
+                    output_dir_list += self.quant(output_dir_width, "ptq")
+                elif "qat" in args.strategy:
+                    self.quant(output_dir_width, "qat")
+                    output_dir_list.append(output_dir_width)
+        if "embeddings" in args.strategy:
+            if "ptq" not in args.strategy and "qat" not in args.strategy:
+                output_dir_list = []
+                for width_mult in args.width_mult_list:
+                    output_dir_width = os.path.join(
+                        args.output_dir, "width_mult_" + str(round(width_mult, 2)), args.input_filename_prefix
+                    )
+                    self.quant(output_dir_width, "embeddings")
+            else:
+                for output_dir in output_dir_list:
+                    self.quant(os.path.join(output_dir, args.output_filename_prefix), "embeddings")
+
+    elif "ptq" in args.strategy:
+        # When input model is an inference model
+        if args.input_infer_model_path is not None:
+            model_dir = os.path.dirname(args.input_infer_model_path)
+            self.args.input_filename_prefix = os.path.basename(args.input_infer_model_path)
+            output_dir_list = self.quant(model_dir, "ptq")
+        # Input model is load from Trainer API in dygraph.
+        else:
+            # When input model is a dygraph.
+            # exports model and then do 'ptq'
+            # Prefix of `export_model` is 'model'
+            self.args.input_filename_prefix = "model"
+            input_spec = generate_input_spec(self.model, self.train_dataset, self.args.input_dtype)
+            input_dir = args.output_dir
+            export_model(model=self.model, input_spec=input_spec, path=input_dir)
+            output_dir_list = self.quant(input_dir, "ptq")
+        if "embeddings" in args.strategy:
+            for output_dir in output_dir_list:
+                self.quant(os.path.join(output_dir, args.output_filename_prefix), "embeddings")
+    elif "qat" in args.strategy:
+        global_try_import_slim()
+        self.quant(args.output_dir, "qat")
+        if "embeddings" in args.strategy:
+            self.quant(os.path.join(args.output_dir, args.output_filename_prefix), "embeddings")
+
+
+def quant(self, model_dir, strategy):
+    """
+    Supports Post-Training Quantization, Quantization Aware Training and
+    Embedding Quantization.
+    """
+    if strategy == "ptq":
+        return _post_training_quantization_grid_search(self, model_dir)
+    elif strategy == "qat":
+        _quant_aware_training_dynamic(self, model_dir)
+    elif strategy == "embeddings":
+        _quant_embeddings(self, model_dir)
+
+
+def generate_input_spec(model, dataset, input_dtype="int64"):
+    model_para_keys = inspect.signature(model.forward).parameters.keys()
+    input_num = 0
+    for key in dataset[0].keys():
+        if key in model_para_keys and key not in ("labels", "start_positions", "end_positions"):
+            input_num += 1
+    input_spec = [paddle.static.InputSpec(shape=[None, None], dtype=input_dtype) for i in range(input_num)]
+    return input_spec
+
+
+def _dynabert(self, model):
+    args = self.args
+    model = _replace_auto_model_forward(model)
+    if args.width_mult_list is None:
+        args.width_mult_list = [0.75]
+    # Each batch is a dict.
+    train_dataloader = self.get_train_dataloader()
+    eval_dataloader = self.get_eval_dataloader(self.eval_dataset)
+    if "QuestionAnswering" in model.__class__.__name__:
+        eval_dataloader_with_label = self.get_eval_dataloader(self.eval_examples)
+        ofa_model, teacher_model = _dynabert_init(self, model, eval_dataloader_with_label)
+    else:
+        ofa_model, teacher_model = _dynabert_init(self, model, eval_dataloader)
+
+    # TODO: args.gradient_accumulation_steps
+    if args.max_steps > 0:
+        args.num_training_steps = args.max_steps
+        args.num_train_epochs = math.ceil(args.num_training_steps / len(train_dataloader))
+    else:
+        args.num_training_steps = len(train_dataloader) * args.num_train_epochs
+        args.num_train_epochs = math.ceil(args.num_train_epochs)
+    self.create_optimizer_and_scheduler(num_training_steps=args.num_training_steps)
+
+    ofa_model = _dynabert_training(
+        self, ofa_model, model, teacher_model, train_dataloader, eval_dataloader, args.num_train_epochs
+    )
+    self.reset_optimizer_and_scheduler()
+
+    # Each width_mult best model would be exported.
+    _dynabert_export(self)
+
+    ofa_model, ofa_model.model = _recover_transformer_func(ofa_model, True), _recover_transformer_func(
+        ofa_model.model, True
+    )
+    ofa_model.model = _recover_auto_model_forward(ofa_model.model)
+    logger.info("Pruning is finished using DynaBERT strategy.")
+
+
+def _replace_transformer_func(self):
+    nn.MultiHeadAttention._ori_forward = paddle.nn.MultiHeadAttention.forward
+    nn.MultiHeadAttention._ori_prepare_qkv = nn.MultiHeadAttention._prepare_qkv
+
+    nn.MultiHeadAttention._forward = mha_ofa_forward
+    nn.MultiHeadAttention.__prepare_qkv = prepare_qkv_ofa
+    nn.TransformerEncoder._forward = encoder_ofa_forward
+    nn.TransformerEncoderLayer._forward = encoder_layer_ofa_forward
+
+    def init_func(layer):
+        if isinstance(layer, nn.MultiHeadAttention):
+            layer.forward = layer._forward
+            layer._prepare_qkv = layer.__prepare_qkv
+        elif isinstance(layer, nn.TransformerEncoderLayer):
+            layer.forward = layer._forward
+        elif isinstance(layer, nn.TransformerEncoder):
+            layer.forward = layer._forward
+
+    for layer in self.children():
+        layer.apply(init_func)
+    return self
+
+
+def _recover_transformer_func(self, all_recover=False):
+    def init_func(layer):
+        if isinstance(layer, nn.MultiHeadAttention):
+            layer.forward = layer._ori_forward
+        elif isinstance(layer, nn.TransformerEncoderLayer):
+            layer.forward = layer._ori_forward
+        elif isinstance(layer, nn.TransformerEncoder):
+            layer.forward = layer._ori_forward
+        if all_recover:
+            if isinstance(layer, nn.MultiHeadAttention):
+                layer._prepare_qkv = layer._ori_prepare_qkv
+
+    for layer in self.children():
+        layer.apply(init_func)
+
+    return self
+
+
+def _replace_auto_model_forward(self):
+    self.base_model_class._forward = auto_model_dynabert_forward
+    self.base_model_class._ori_forward = self.base_model_class.forward
+
+    def init_func(layer):
+        if isinstance(layer, self.base_model_class):
+            layer.forward = layer._forward
+
+    for layer in self.children():
+        layer.apply(init_func)
+    return self
+
+
+def _replace_auto_model_qat_forward(self):
+    self.base_model_class._forward = auto_model_forward
+    self.base_model_class._ori_forward = self.base_model_class.forward
+
+    def init_func(layer):
+        if isinstance(layer, self.base_model_class):
+            layer.forward = layer._forward
+
+    for layer in self.children():
+        layer.apply(init_func)
+    return self
+
+
+def _recover_auto_model_forward(self):
+    def init_func(layer):
+        if isinstance(
+            layer,
+            self.base_model_class if not isinstance(self, paddle.DataParallel) else self._layers.base_model_class,
+        ):
+            layer.forward = layer._ori_forward
+
+    for layer in self._layers.children() if isinstance(self, paddle.DataParallel) else self.children():
+        layer.apply(init_func)
+    return self
+
+
+def _dynabert_init(self, model, eval_dataloader):
+    from paddleslim.nas.ofa import OFA, DistillConfig, utils
+    from paddleslim.nas.ofa.convert_super import Convert, supernet
+
+    # Step1: Initialize a dictionary to save the weights from the origin model.
+    origin_weights = model.state_dict()
+
+    # Step2: Define teacher model.
+    teacher_model = copy.deepcopy(model)
+
+    # Step3: Convert origin model to supernet.
+    sp_config = supernet(expand_ratio=[1.0])
+    model = Convert(sp_config).convert(model)
+
+    # Use weights saved in the dictionary to initialize supernet.
+    utils.set_state_dict(model, origin_weights)
+    del origin_weights
+
+    # Step4: Config about distillation.
+    mapping_layers = [model.base_model_prefix + ".embeddings"]
+    for idx in range(model.base_model.config["num_hidden_layers"]):
+        mapping_layers.append(model.base_model_prefix + ".encoder.layers.{}".format(idx))
+
+    default_distill_config = {
+        "lambda_distill": 0.1,
+        "teacher_model": teacher_model,
+        "mapping_layers": mapping_layers,
+    }
+    distill_config = DistillConfig(**default_distill_config)
+
+    # Step5: Config in supernet training.
+    ofa_model = OFA(model, distill_config=distill_config, elastic_order=["width"])
+
+    # Step6: Calculate the importance of neurons and head,
+    # and then reorder them according to the importance.
+    ofa_model.model, ofa_model = _replace_transformer_func(ofa_model.model), _replace_transformer_func(ofa_model)
+    head_importance, neuron_importance = compute_neuron_head_importance(
+        model=ofa_model.model,
+        data_loader=eval_dataloader,
+        loss_fct=self.criterion,
+        num_layers=model.base_model.config["num_hidden_layers"],
+        num_heads=model.base_model.config["num_attention_heads"],
+        label_names=self.args.label_names,
+    )
+
+    reorder_neuron_head(ofa_model.model, head_importance, neuron_importance)
+
+    if paddle.distributed.get_world_size() > 1:
+        ofa_model.model = paddle.DataParallel(ofa_model.model)
+
+    return ofa_model, teacher_model
+
+
+def check_dynabert_config(net_config, width_mult):
+    """
+    Corrects net_config for OFA model if necessary.
+    """
+    if "electra.embeddings_project" in net_config:
+        net_config["electra.embeddings_project"]["expand_ratio"] = 1.0
+    for key in net_config:
+        # Makes sure to expands the size of the last dim to `width_mult` for
+        # these Linear weights.
+        if "q_proj" in key or "k_proj" in key or "v_proj" in key or "linear1" in key:
+            net_config[key]["expand_ratio"] = width_mult
+        # Keeps the size of the last dim of these Linear weights same as
+        # before.
+        elif "out_proj" in key or "linear2" in key:
+            net_config[key]["expand_ratio"] = 1.0
+    return net_config
+
+
+def evaluate(self, model, data_loader):
+    if self.custom_evaluate is not None:
+        return self.custom_evaluate(self, model, data_loader)
+    if isinstance(model, paddleslim.nas.ofa.OFA):
+        class_name = model.model.__class__.__name__
+    else:
+        class_name = model.__class__.__name__
+    if "SequenceClassification" in class_name:
+        return evaluate_seq_cls(self, model, data_loader)
+    elif "QuestionAnswering" in class_name:
+        return evaluate_qa(self, model, data_loader)
+    elif "TokenClassification" in class_name:
+        return evaluate_token_cls(self, model, data_loader)
+    else:
+        raise NotImplementedError(
+            "Model to be compressed is an instance of a custom class, "
+            "so function `evaluate(self, model, data_loader)` should be "
+            "implemented, and `model` should support both `paddle.nn.layer` "
+            "and `paddleslim.nas.ofa.OFA` instances, and it should return "
+            "a single float for precision value, such as acc."
+        )
+
+
+@paddle.no_grad()
+def evaluate_qa(self, model, data_loader):
+    model.eval()
+    all_start_logits = []
+    all_end_logits = []
+    for batch in data_loader:
+        logits = model(input_ids=batch["input_ids"], token_type_ids=batch["token_type_ids"])
+        if isinstance(model, paddleslim.nas.ofa.OFA):
+            start_logits_tensor, end_logits_tensor = logits[0]
+        else:
+            start_logits_tensor, end_logits_tensor = logits
+        for idx in range(start_logits_tensor.shape[0]):
+            all_start_logits.append(start_logits_tensor.numpy()[idx])
+            all_end_logits.append(end_logits_tensor.numpy()[idx])
+    n_best_size = 20
+    max_answer_length = 50
+    all_predictions, _, _ = compute_prediction(
+        self.eval_examples,
+        self.eval_dataset,
+        (all_start_logits, all_end_logits),
+        False,
+        n_best_size,
+        max_answer_length,
+    )
+    res = squad_evaluate(
+        examples=[raw_data for raw_data in self.eval_examples], preds=all_predictions, is_whitespace_splited=False
+    )
+    logger.info("EM: %f, F1: %f, " % (res["exact"], res["f1"]))
+    res = res["exact"]
+    model.train()
+    return res
+
+
+@paddle.no_grad()
+def evaluate_seq_cls(self, model, data_loader):
+    metric = Accuracy()
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        labels = batch.pop("labels")
+        logits = model(**batch)
+        if isinstance(model, paddleslim.nas.ofa.OFA):
+            logits = logits[0]
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+    res = metric.accumulate()
+    logger.info("acc: %s, " % res)
+    model.train()
+    return res
+
+
+@paddle.no_grad()
+def evaluate_token_cls(self, model, data_loader):
+    metric = ChunkEvaluator(label_list=self.train_dataset.label_list)
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        logits = model(input_ids=batch["input_ids"], token_type_ids=batch["token_type_ids"])
+        if isinstance(model, paddleslim.nas.ofa.OFA):
+            logits = logits[0]
+        preds = logits.argmax(axis=2)
+        seq_len = paddle.sum(batch["labels"] != self.train_dataset.ignore_label, axis=-1)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(seq_len, preds, batch["labels"])
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+    res = metric.accumulate()
+    logger.info("precision: %f, recall: %f, f1_score: %f" % (res[0], res[1], res[2]))
+    res = res[2]
+    model.train()
+    return res
+
+
+def _dynabert_training(self, ofa_model, model, teacher_model, train_dataloader, eval_dataloader, num_train_epochs):
+    from paddleslim.nas.ofa import utils
+
+    global_step = 0
+    lambda_logit = 1.0
+    tic_train = time.time()
+    best_acc = [0.0] * len(self.args.width_mult_list)
+    acc = 0.0
+
+    logger.info("Teacher's evaluation starts.")
+    tic_eval = time.time()
+    evaluate(self, teacher_model, eval_dataloader)
+    logger.info("eval done total: %s s" % (time.time() - tic_eval))
+
+    logger.info("DynaBERT training starts. This period will cost some time.")
+    for epoch in range(num_train_epochs):
+        # Step7: Set current epoch and task.
+        ofa_model.set_epoch(epoch)
+        ofa_model.set_task("width")
+        for step, batch in enumerate(train_dataloader):
+            global_step += 1
+            for width_mult in self.args.width_mult_list:
+                # Step8: Broadcast supernet config from width_mult,
+                # and use this config in supernet training.
+                net_config = utils.dynabert_config(ofa_model, width_mult)
+                net_config = check_dynabert_config(net_config, width_mult)
+                ofa_model.set_net_config(net_config)
+                if "token_type_ids" in batch:
+                    logits, teacher_logits = ofa_model(
+                        input_ids=batch["input_ids"],
+                        token_type_ids=batch["token_type_ids"],
+                        attention_mask=[None, None],
+                    )
+                else:
+                    logits, teacher_logits = ofa_model(batch["input_ids"], attention_mask=[None, None])
+                rep_loss = ofa_model.calc_distill_loss()
+                if isinstance(logits, tuple):
+                    logit_loss, num_logit = 0, 0
+                    for i in range(len(logits)):
+                        try:
+                            logit_loss += soft_cross_entropy(logits[i], teacher_logits[i].detach())
+                            num_logit += 1
+                        except RuntimeError:
+                            pass
+                    logit_loss /= num_logit
+                else:
+                    logit_loss = soft_cross_entropy(logits, teacher_logits.detach())
+                loss = rep_loss + lambda_logit * logit_loss
+                loss.backward()
+            self.optimizer.step()
+            self.lr_scheduler.step()
+            self.optimizer.clear_grad()
+            if global_step % self.args.logging_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, lr: %.3e, loss: %f, speed: %.2f step/s"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            self.optimizer.get_lr(),
+                            loss,
+                            self.args.logging_steps / (time.time() - tic_train),
+                        )
+                    )
+                tic_train = time.time()
+
+            if global_step % self.args.save_steps == 0:
+                for idx, width_mult in enumerate(self.args.width_mult_list):
+                    net_config = utils.dynabert_config(ofa_model, width_mult)
+                    net_config = check_dynabert_config(net_config, width_mult)
+                    ofa_model.set_net_config(net_config)
+                    tic_eval = time.time()
+                    logger.info("width_mult %s:" % round(width_mult, 2))
+                    acc = evaluate(self, ofa_model, eval_dataloader)
+                    if acc > best_acc[idx]:
+                        best_acc[idx] = acc
+                        if paddle.distributed.get_rank() == 0:
+                            output_dir_width = os.path.join(
+                                self.args.output_dir, "width_mult_" + str(round(width_mult, 2))
+                            )
+                            if not os.path.exists(output_dir_width):
+                                os.makedirs(output_dir_width)
+                            # need better way to get inner model of DataParallel
+                            model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                            model_to_save.save_pretrained(output_dir_width)
+                    logger.info("eval done total: %s s" % (time.time() - tic_eval))
+            if global_step > self.args.num_training_steps:
+                if best_acc[idx] == 0.0:
+                    output_dir_width = os.path.join(self.args.output_dir, "width_mult_" + str(round(width_mult, 2)))
+                    if not os.path.exists(output_dir_width):
+                        os.makedirs(output_dir_width)
+                    # need better way to get inner model of DataParallel
+                    model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                    model_to_save.save_pretrained(output_dir_width)
+                logger.info("Best result of width_mult %.2f: %.4f" % (width_mult, best_acc[idx]))
+                return ofa_model
+
+    for idx, width_mult in enumerate(self.args.width_mult_list):
+        logger.info("Best result of width_mult %.2f: %.4f" % (width_mult, best_acc[idx]))
+    return ofa_model
+
+
+def _get_dynabert_model(model, width_mult):
+    for layer in model.base_model.encoder.layers:
+        # Multi-Head Attention
+        layer.self_attn.num_heads = int(layer.self_attn.num_heads * width_mult)
+        layer.self_attn.q_proj = nn.Linear(
+            layer.self_attn.q_proj.weight.shape[0],
+            int(layer.self_attn.q_proj.weight.shape[1] * width_mult),
+            layer.self_attn.q_proj._weight_attr,
+            layer.self_attn.q_proj._bias_attr,
+        )
+        layer.self_attn.k_proj = nn.Linear(
+            layer.self_attn.k_proj.weight.shape[0],
+            int(layer.self_attn.k_proj.weight.shape[1] * width_mult),
+            layer.self_attn.k_proj._weight_attr,
+            layer.self_attn.k_proj._bias_attr,
+        )
+        layer.self_attn.v_proj = nn.Linear(
+            layer.self_attn.v_proj.weight.shape[0],
+            int(layer.self_attn.v_proj.weight.shape[1] * width_mult),
+            layer.self_attn.v_proj._weight_attr,
+            layer.self_attn.v_proj._bias_attr,
+        )
+        layer.self_attn.out_proj = nn.Linear(
+            int(layer.self_attn.out_proj.weight.shape[0] * width_mult),
+            layer.self_attn.out_proj.weight.shape[1],
+            layer.self_attn.out_proj._weight_attr,
+            layer.self_attn.out_proj._bias_attr,
+        )
+
+        # Feed Forward
+        layer.linear1 = nn.Linear(
+            layer.linear1.weight.shape[0],
+            int(layer.linear1.weight.shape[1] * width_mult),
+            layer.linear1._weight_attr,
+            layer.linear1._bias_attr,
+        )
+        layer.linear2 = nn.Linear(
+            int(layer.linear2.weight.shape[0] * width_mult),
+            layer.linear2.weight.shape[1],
+            layer.linear2._weight_attr,
+            layer.linear2._bias_attr,
+        )
+    return model
+
+
+def _load_parameters(dynabert_model, ori_state_dict):
+    dynabert_state_dict = dynabert_model.state_dict()
+    for key in ori_state_dict.keys():
+        # Removes '.fn' from ofa model parameters
+        dynabert_key = key.replace(".fn", "")
+        if dynabert_key not in dynabert_state_dict.keys():
+            logger.warning("Failed to export parameter %s" % key)
+        else:
+            dynabert_shape = dynabert_state_dict[dynabert_key].shape
+            if len(dynabert_shape) == 2:
+                dynabert_state_dict[dynabert_key] = ori_state_dict[key][: dynabert_shape[0], : dynabert_shape[1]]
+            elif len(dynabert_shape) == 1:
+                dynabert_state_dict[dynabert_key] = ori_state_dict[key][: dynabert_shape[0]]
+            else:
+                raise ValueError("Please check input model. Length of shape should be 1 or 2 for any parameter.")
+    dynabert_model.set_state_dict(dynabert_state_dict)
+    return dynabert_model
+
+
+def _export_dynamic_dynabert_model(self, width_mult):
+    model_dir = os.path.join(self.args.output_dir, "width_mult_" + str(round(width_mult, 2)))
+    state_dict = paddle.load(os.path.join(model_dir, "model_state.pdparams"))
+    dynabert_model = _get_dynabert_model(self.original_model, width_mult)
+    dynabert_model = _load_parameters(dynabert_model, state_dict)
+    return dynabert_model
+
+
+def _dynabert_export(self):
+    for width_mult in self.args.width_mult_list:
+        dynabert_model = _export_dynamic_dynabert_model(self, width_mult)
+        self.model = dynabert_model
+        if "qat" not in self.args.strategy:
+            input_spec = generate_input_spec(self.model, self.train_dataset, self.args.input_dtype)
+            pruned_infer_model_dir = os.path.join(self.args.output_dir, "width_mult_" + str(round(width_mult, 2)))
+            export_model(model=dynabert_model, input_spec=input_spec, path=pruned_infer_model_dir)
+            self.args.input_filename_prefix = "model"
+            logger.info("Pruned models have been exported.")
+
+
+def _post_training_quantization_grid_search(self, model_dir):
+    args = self.args
+    if args.batch_num_list is None:
+        args.batch_num_list = [1]
+    if args.batch_size_list is None:
+        args.batch_size_list = [4, 8, 16]
+    if args.algo_list is None:
+        args.algo_list = ["mse", "KL"]
+
+    paddle.enable_static()
+    place = paddle.set_device(args.device)
+    exe = paddle.static.Executor(place)
+
+    args.output_filename_prefix = "int8"
+    output_dir_list = []
+
+    def _post_training_quantization(algo, batch_size, batch_nums):
+        from paddle.static.quantization import PostTrainingQuantization
+
+        def _batch_generator_func():
+            param_name_list = []
+            for key in self.eval_dataset[0]:
+                if key in ("input_ids", "token_type_ids"):
+                    param_name_list.append(key)
+            batch_data = [[] for i in range(len(param_name_list))]
+            for data in self.eval_dataset:
+                for i in range(len(param_name_list)):
+                    batch_data[i].append(data[param_name_list[i]])
+                if len(batch_data[0]) == batch_size:
+                    for i in range(len(param_name_list)):
+                        batch_data[i] = Pad(axis=0, pad_val=0)(batch_data[i])
+                    yield batch_data
+                    batch_data = [[] for i in range(len(param_name_list))]
+
+        post_training_quantization = PostTrainingQuantization(
+            executor=exe,
+            batch_generator=_batch_generator_func,
+            model_dir=model_dir,
+            model_filename=args.input_filename_prefix + ".pdmodel",
+            params_filename=args.input_filename_prefix + ".pdiparams",
+            batch_size=batch_size,
+            batch_nums=batch_nums,
+            scope=None,
+            algo=algo,
+            hist_percent=0.9999,
+            round_type=args.round_type,
+            bias_correction=args.bias_correction,
+            quantizable_op_type=["matmul", "matmul_v2"],
+            is_full_quantize=False,
+            weight_bits=8,
+            activation_bits=8,
+            activation_quantize_type="range_abs_max"
+            if args.activation_quantize_type is None
+            else args.activation_quantize_type,
+            weight_quantize_type=args.weight_quantize_type,
+            onnx_format=args.onnx_format,
+            optimize_model=False,
+        )
+        post_training_quantization.quantize()
+        save_model_path = os.path.join(model_dir, algo + "_".join([str(batch_size), str(batch_nums)]))
+        post_training_quantization.save_quantized_model(
+            save_model_path=save_model_path,
+            model_filename=args.output_filename_prefix + ".pdmodel",
+            params_filename=args.output_filename_prefix + ".pdiparams",
+        )
+        output_dir_list.append(save_model_path)
+
+    logger.info("Post training quantization starts.")
+    for algo in args.algo_list:
+        for batch_size in args.batch_size_list:
+            for batch_nums in args.batch_num_list:
+                _post_training_quantization(algo, batch_size, batch_nums)
+
+    paddle.disable_static()
+    logger.info("Post training quantization ends and quantized models are saved.")
+    return output_dir_list
+
+
+def _quant_aware_training_dynamic(self, input_dir):
+    # TODO: Switch from multiple GPUs to a single GPU.
+    from paddleslim import QAT
+
+    args = self.args
+    args.output_filename_prefix = "int8"
+
+    quant_config = {
+        # It defauts to None, which means that no preprocessing is performed
+        # on the active value."
+        "activation_preprocess_type": "PACT" if args.use_pact else None,
+        # It defauts to None, which means that no preprocessing is performed
+        # on weights.
+        "weight_preprocess_type": "PACT" if args.use_pact else None,
+        "weight_quantize_type": args.weight_quantize_type,
+        "activation_quantize_type": "moving_average_abs_max"
+        if args.activation_quantize_type is None
+        else args.activation_quantize_type,
+        "weight_bits": 8,
+        "activation_bits": 8,
+        "dtype": "int8",
+        # window size for 'range_abs_max' quantization. defaulf is 10000
+        "window_size": 10000,
+        "quantizable_layer_type": ["Linear", "Conv2D"],
+        "moving_rate": args.moving_rate,
+        "onnx_format": args.onnx_format,
+    }
+
+    if not os.path.exists(input_dir):
+        os.makedirs(input_dir)
+
+    output_param_path = os.path.join(input_dir, "best_quant.pdparams")
+
+    train_dataloader = self.get_train_dataloader()
+    eval_dataloader = self.get_eval_dataloader(self.eval_dataset)
+
+    # TODO: args.gradient_accumulation_steps
+    if args.max_steps > 0:
+        args.num_training_steps = args.max_steps
+        args.num_train_epochs = math.ceil(args.num_training_steps / len(train_dataloader))
+    else:
+        args.num_training_steps = len(train_dataloader) * args.num_train_epochs
+        args.num_train_epochs = math.ceil(args.num_train_epochs)
+
+    self.create_optimizer_and_scheduler(num_training_steps=args.num_training_steps)
+
+    logger.info("Evaluating FP32 model before quantization aware training.")
+
+    tic_eval = time.time()
+
+    acc = evaluate(self, self.model, eval_dataloader)
+    logger.info("eval done total: %s s" % (time.time() - tic_eval))
+
+    quanter = QAT(config=quant_config)
+    self.model = _replace_auto_model_qat_forward(self.model)
+    quanter.quantize(self.model)
+
+    global_step = 0
+    tic_train = time.time()
+    best_acc, acc = 0.0, 0.0
+
+    logger.info("Quant aware training starts.")
+    # Train self.model
+    for epoch in range(args.num_train_epochs):
+        for step, batch in enumerate(train_dataloader):
+            global_step += 1
+            labels = None
+            if self.args.label_names is None:
+                if "labels" in batch:
+                    labels = batch.pop("labels")
+                elif "start_positions" in batch and "end_positions" in batch:
+                    labels = (batch.pop("start_positions"), batch.pop("end_positions"))
+            else:
+                labels = []
+                for label in self.args.label_names:
+                    labels.append(batch.pop(label))
+                labels = tuple(labels)
+            model_para_keys = inspect.signature(self.model.forward).parameters.keys()
+            inputs = {}
+            for key in batch:
+                if key in model_para_keys:
+                    inputs[key] = batch[key]
+            logits = self.model(**inputs)
+            loss = self.criterion(logits, labels)
+            loss.backward()
+
+            self.optimizer.step()
+            self.lr_scheduler.step()
+            self.optimizer.clear_grad()
+            if global_step % self.args.logging_steps == 0:
+                if paddle.distributed.get_rank() == 0:
+                    logger.info(
+                        "global step %d, epoch: %d, batch: %d, lr: %.3e, loss: %f, speed: %.2f step/s"
+                        % (
+                            global_step,
+                            epoch,
+                            step,
+                            self.optimizer.get_lr(),
+                            loss,
+                            args.logging_steps / (time.time() - tic_train),
+                        )
+                    )
+                tic_train = time.time()
+
+            if global_step % args.save_steps == 0:
+                tic_eval = time.time()
+                acc = evaluate(self, self.model, eval_dataloader)
+                if acc > best_acc:
+                    best_acc = acc
+                    if paddle.distributed.get_rank() == 0:
+                        # need better way to get inner model of DataParallel
+                        model_to_save = (
+                            self.model._layers if isinstance(self.model, paddle.DataParallel) else self.model
+                        )
+                        paddle.save(model_to_save.state_dict(), output_param_path)
+                logger.info("eval done total: %s s" % (time.time() - tic_eval))
+    logger.info("Best result: %.4f" % best_acc)
+    self.model.set_state_dict(paddle.load(output_param_path))
+
+    input_spec = generate_input_spec(self.model, self.train_dataset, self.args.input_dtype)
+
+    quanter.save_quantized_model(
+        self.model, os.path.join(input_dir, args.output_filename_prefix), input_spec=input_spec
+    )
+
+    self.model = _recover_auto_model_forward(self.model)
+    logger.info(
+        "Quant aware training ends and quantized models are saved to %s."
+        % os.path.join(input_dir, args.output_filename_prefix)
+    )
+
+
+def _quant_embeddings(self, input_prefix):
+    import paddleslim.quant as quant
+
+    self.args.output_filename_prefix = "quant_emb"
+
+    paddle.enable_static()
+    place = paddle.set_device(self.args.device)
+    exe = paddle.static.Executor(place)
+    main_program, feed_target_names, fetch_targets = paddle.static.load_inference_model(input_prefix, exe)
+
+    config = {"quantize_op_types": ["lookup_table_v2"], "lookup_table_v2": {"quantize_type": "log"}}
+
+    quant_emb_program = quant.quant_embedding(main_program, place, config)
+
+    input_dir = os.path.dirname(input_prefix)
+
+    paddle.static.save_inference_model(
+        os.path.join(input_dir, self.args.output_filename_prefix),
+        feed_target_names,
+        fetch_targets,
+        exe,
+        program=quant_emb_program,
+    )
+
+
+def auto_model_dynabert_forward(
+    self,
+    input_ids,
+    token_type_ids=None,
+    position_ids=None,
+    attention_mask=[None, None],
+    task_type_ids=None,
+    past_key_values=None,
+    inputs_embeds=None,
+    use_cache=None,
+    output_hidden_states=False,
+    output_attentions=False,
+    return_dict=False,
+):
+    kwargs = locals()
+    wtype = (
+        self.encoder.layers[0].norm1.fn.weight.dtype
+        if hasattr(self.encoder.layers[0].norm1, "fn")
+        else self.encoder.layers[0].norm1.weight.dtype
+    )
+    if input_ids is not None and inputs_embeds is not None:
+        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+    elif input_ids is not None:
+        input_shape = paddle.shape(input_ids)
+    elif inputs_embeds is not None:
+        input_shape = paddle.shape(inputs_embeds)[:-1]
+    else:
+        raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+    past_key_values_length = None
+    if past_key_values is not None:
+        past_key_values_length = past_key_values[0][0].shape[2]
+
+    if attention_mask is None:
+        # input_ids[0][0] is equals to 0 while exporting.
+        if input_ids[0][0] != 0:
+            attention_mask = [None, None]
+            attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e4, axis=[1, 2])
+        else:
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+    elif isinstance(attention_mask, paddle.Tensor) and attention_mask.ndim == 2:
+        attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(wtype)
+        attention_mask = (1.0 - attention_mask) * -1e4
+    elif attention_mask[0] is None:
+        attention_mask[0] = paddle.unsqueeze((input_ids == self.pad_token_id).astype(wtype) * -1e4, axis=[1, 2])
+
+    embedding_kwargs_keys = inspect.signature(self.embeddings.forward).parameters.keys()
+    embedding_kwargs = {}
+    for key in embedding_kwargs_keys:
+        if key in kwargs.keys():
+            embedding_kwargs[key] = kwargs[key]
+    embedding_kwargs["input_ids"] = input_ids
+
+    embedding_output = self.embeddings(**embedding_kwargs)
+    if hasattr(self, "embeddings_project"):
+        embedding_output = self.embeddings_project(embedding_output)
+
+    self.encoder._use_cache = use_cache  # To be consistent with HF
+
+    encoder_kwargs_keys = inspect.signature(self.encoder.forward).parameters.keys()
+    encoder_kwargs = {}
+    for key in encoder_kwargs_keys:
+        if key == "cache":
+            encoder_kwargs[key] = past_key_values
+        elif key == "src_mask":
+            encoder_kwargs[key] = attention_mask
+        elif key in kwargs:
+            encoder_kwargs[key] = kwargs[key]
+
+    encoder_outputs = self.encoder(embedding_output, **encoder_kwargs)
+    if isinstance(encoder_outputs, type(embedding_output)):
+        sequence_output = encoder_outputs
+        if hasattr(self, "pooler"):
+            pooled_output = self.pooler(sequence_output)
+        else:
+            pooled_output = sequence_output[:, 0]
+        return (sequence_output, pooled_output)
+    else:
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+def auto_model_forward(
+    self,
+    input_ids,
+    token_type_ids=None,
+    position_ids=None,
+    attention_mask=None,
+    task_type_ids=None,
+    past_key_values=None,
+    inputs_embeds=None,
+    use_cache=None,
+    output_hidden_states=False,
+    output_attentions=False,
+    return_dict=False,
+):
+    kwargs = locals()
+    past_key_values_length = None
+    if past_key_values is not None:
+        past_key_values_length = past_key_values[0][0].shape[2]
+
+    if attention_mask is None:
+        attention_mask = paddle.unsqueeze((input_ids == self.pad_token_id).astype(paddle.float32) * -1e4, axis=[1, 2])
+        if past_key_values is not None:
+            batch_size = past_key_values[0][0].shape[0]
+            past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+            attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+    # For 2D attention_mask from tokenizer
+    elif attention_mask.ndim == 2:
+        attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+        attention_mask = (1.0 - attention_mask) * -1e4
+
+    kwargs_keys = inspect.signature(self._ori_forward).parameters.keys()
+
+    model_kwargs = {}
+    for key in kwargs_keys:
+        model_kwargs[key] = kwargs[key]
+    model_kwargs["attention_mask"] = attention_mask
+    return self._ori_forward(**model_kwargs)
+
+
+def soft_cross_entropy(inp, target):
+    inp_likelihood = F.log_softmax(inp, axis=-1)
+    target_prob = F.softmax(target, axis=-1)
+    return -1.0 * paddle.mean(paddle.sum(inp_likelihood * target_prob, axis=-1))
+
+
+def reset_optimizer_and_scheduler(self):
+    self.optimizer, self.lr_scheduler = None, None
+
+
+def cut_embeddings(model, tokenizer, config, word_emb_index, max_seq_length, max_vocab_size, output_dir):
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    state_dict = model.state_dict()
+
+    word_emb_name = model.base_model_prefix + ".embeddings.word_embeddings.weight"
+    word_emb_np = state_dict[word_emb_name].cpu().numpy()
+    word_emb_np_new = [word_emb_np[idx] for idx in word_emb_index]
+
+    state_dict[word_emb_name] = paddle.to_tensor(word_emb_np_new)
+    # Rewrites Position Embedding parameters
+    pos_emb_name = model.base_model_prefix + ".embeddings.position_embeddings.weight"
+    state_dict[pos_emb_name] = state_dict[pos_emb_name][:max_seq_length, :]
+
+    paddle.save(state_dict, os.path.join(output_dir, "model_state.pdparams"))
+
+    # Rewrites config
+    config["max_position_embeddings"] = max_seq_length
+    config["vocab_size"] = max_vocab_size
+    config.save_pretrained(output_dir)
+
+    # Rewrites vocab file
+    vocab_file = os.path.join(output_dir, "vocab.txt")
+    f = open(vocab_file, "w")
+    for idx in word_emb_index:
+        f.write(tokenizer.convert_ids_to_tokens(idx) + "\n")
+    f.close()
+
+    tokenizer.init_config["model_max_length"] = max_seq_length
+    if "vocab_file" in tokenizer.init_config:
+        tokenizer.init_config.pop("vocab_file")
+    f = open(os.path.join(output_dir, tokenizer.tokenizer_config_file), "w")
+
+    f.write(json.dumps(tokenizer.init_config))
+    f.close()
+
+
+Trainer.compress = compress
+Trainer.quant = quant
+Trainer.reset_optimizer_and_scheduler = reset_optimizer_and_scheduler
+Trainer.cut_embeddings = cut_embeddings
diff --git a/paddlenlp/trainer/trainer_seq2seq.py b/paddlenlp/trainer/trainer_seq2seq.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4bdab1839570066f89be6580771820e5e64d558
--- /dev/null
+++ b/paddlenlp/trainer/trainer_seq2seq.py
@@ -0,0 +1,248 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import paddle
+from paddle import nn
+from paddle.io import Dataset
+
+from .trainer import Trainer
+from .trainer_utils import PredictionOutput
+
+__all__ = [
+    "Seq2SeqTrainer",
+]
+
+
+class Seq2SeqTrainer(Trainer):
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dataset] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        **gen_kwargs
+    ) -> Dict[str, float]:
+        """
+        Run evaluation and returns metrics.
+
+        The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
+        (pass it to the init `compute_metrics` argument).
+
+        You can also subclass and override this method to inject custom behavior.
+
+        Args:
+            eval_dataset (`Dataset`, *optional*):
+                Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
+                not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
+                method.
+            ignore_keys (`List[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "eval_bleu" if the prefix is `"eval"` (default)
+            max_length (`int`, *optional*):
+                The maximum target length to use when predicting with the generate method.
+            num_beams (`int`, *optional*):
+                Number of beams for beam search that will be used when predicting with the generate method. 1 means no
+                beam search.
+            gen_kwargs:
+                Additional `generate` specific kwargs.
+
+        Returns:
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
+            dictionary also contains the epoch number which comes from the training state.
+        """
+
+        gen_kwargs = gen_kwargs.copy()
+        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
+            gen_kwargs["max_length"] = self.args.generation_max_length
+        gen_kwargs["num_beams"] = (
+            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
+        )
+        self._gen_kwargs = gen_kwargs
+
+        return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
+
+    def predict(
+        self,
+        test_dataset: Dataset,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "test",
+        **gen_kwargs
+    ) -> PredictionOutput:
+        """
+        Run prediction and returns predictions and potential metrics.
+
+        Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method
+        will also return metrics, like in `evaluate()`.
+
+        Args:
+            test_dataset (`Dataset`):
+                Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
+                `model.forward()` method are automatically removed. Has to implement the method `__len__`
+            ignore_keys (`List[str]`, *optional*):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "eval_bleu" if the prefix is `"eval"` (default)
+            max_length (`int`, *optional*):
+                The maximum target length to use when predicting with the generate method.
+            num_beams (`int`, *optional*):
+                Number of beams for beam search that will be used when predicting with the generate method. 1 means no
+                beam search.
+            gen_kwargs:
+                Additional `generate` specific kwargs.
+
+        <Tip>
+
+        If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
+        padding in a token classification task) the predictions will be padded (on the right) to allow for
+        concatenation into one array. The padding index is -100.
+
+        </Tip>
+
+        Returns: *NamedTuple* A namedtuple with the following keys:
+
+            - predictions (`np.ndarray`): The predictions on `test_dataset`.
+            - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some).
+            - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained
+              labels).
+        """
+
+        gen_kwargs = gen_kwargs.copy()
+        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
+            gen_kwargs["max_length"] = self.args.generation_max_length
+        gen_kwargs["num_beams"] = (
+            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
+        )
+        self._gen_kwargs = gen_kwargs
+
+        return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
+
+    def prediction_step(
+        self,
+        model: nn.Layer,
+        inputs: Dict[str, Union[paddle.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]:
+        """
+        Perform an evaluation step on `model` using `inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (`nn.Layer`):
+                The model to evaluate.
+            inputs (`Dict[str, Union[paddle.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument `labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (`bool`):
+                Whether or not to return the loss only.
+
+        Return:
+            Tuple[Optional[float], Optional[paddle.Tensor], Optional[paddle.Tensor]]: A tuple with the loss, logits and
+            labels (each being optional).
+        """
+
+        if not self.args.predict_with_generate or prediction_loss_only:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        has_labels = "labels" in inputs
+        inputs = self._prepare_inputs(inputs)
+
+        gen_kwargs = self._gen_kwargs.copy()
+        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
+            gen_kwargs["max_length"] = self.model.config.max_length
+        gen_kwargs["num_beams"] = (
+            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
+        )
+
+        if "attention_mask" in inputs:
+            gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
+        if "global_attention_mask" in inputs:
+            gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)
+
+        # prepare generation inputs
+        # some encoder-decoder models can have varying encoder's and thus
+        # varying model input names
+        if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
+            generation_inputs = inputs[self.model.encoder.main_input_name]
+        else:
+            generation_inputs = inputs[self.model.main_input_name]
+
+        generated_tokens = self.model.generate(
+            generation_inputs,
+            **gen_kwargs,
+        )
+        # different from hf returns: tuple[Tensor]: It is a tuple contains two elements: ids and scores.
+        if isinstance(generated_tokens, tuple):
+            generated_tokens = generated_tokens[0]
+        # in case the batch is shorter than max length, the output should be padded
+        if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
+        elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
+            gen_kwargs["max_new_tokens"] + 1
+        ):
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)
+
+        with paddle.no_grad():
+            if has_labels:
+                with self.autocast_smart_context_manager():
+                    outputs = model(**inputs)
+                if self.label_smoother is not None:
+                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
+                else:
+                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
+            else:
+                loss = None
+
+        if self.args.prediction_loss_only:
+            return (loss, None, None)
+
+        if has_labels:
+            labels = inputs["labels"]
+            if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
+                labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
+            elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
+                gen_kwargs["max_new_tokens"] + 1
+            ):
+                labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
+        else:
+            labels = None
+
+        return (loss, generated_tokens, labels)
+
+    def _pad_tensors_to_max_len(self, tensor, max_length):
+        if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
+            # If PAD token is not defined at least EOS token has to be defined
+            pad_token_id = (
+                self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
+            )
+        else:
+            if self.model.config.pad_token_id is not None:
+                pad_token_id = self.model.config.pad_token_id
+            else:
+                raise ValueError("Pad_token_id must be set in the configuration of the model, in order to pad tensors")
+        # paddle.ones need to support device args.
+        padded_tensor = pad_token_id * paddle.ones((tensor.shape[0], max_length), dtype=tensor.dtype)
+        padded_tensor[:, : tensor.shape[-1]] = tensor
+        return padded_tensor
diff --git a/paddlenlp/trainer/trainer_utils.py b/paddlenlp/trainer/trainer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..dacb6051e71adc800ac4014cccafcff131dda2b0
--- /dev/null
+++ b/paddlenlp/trainer/trainer_utils.py
@@ -0,0 +1,952 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_utils.py
+
+"""
+Utilities for the Trainer class.
+"""
+import datetime
+import gc
+import inspect
+import json
+import math
+import os
+import random
+import re
+import threading
+import time
+from enum import Enum
+from typing import Dict, List, NamedTuple, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+from paddle.io import IterableDataset
+from paddle.optimizer.lr import LambdaDecay
+
+from ..transformers.tokenizer_utils_base import BatchEncoding
+from ..utils.import_utils import is_paddle_cuda_available, is_psutil_available
+from ..utils.log import logger
+
+__all__ = [
+    "TrainOutput",
+    "PredictionOutput",
+    "EvalPrediction",
+    "IntervalStrategy",
+    "SchedulerType",
+    "set_seed",
+    "speed_metrics",
+    "get_last_checkpoint",
+    "get_scheduler",
+    "set_hyrbid_parallel_seed",
+]
+
+
+def set_seed(seed: int = 1234, args=None):
+    if args is None:
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+
+    if args is not None:
+        if args.use_hybrid_parallel:
+            from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+
+            random.seed(args.seed + args.dataset_rank)
+            np.random.seed(args.seed + args.dataset_rank)
+            paddle.seed(args.seed + args.dataset_rank)
+
+            # local_seed/ global_seed is used to control dropout in ModelParallel
+            local_seed = args.seed + 59999 + args.tensor_parallel_rank * 10 + args.pipeline_parallel_rank * 1000
+            global_seed = args.seed + 100003 + args.dataset_rank
+            tracker = get_rng_state_tracker()
+
+            if "global_seed" not in tracker.states_:
+                tracker.add("global_seed", global_seed)
+            if "local_seed" not in tracker.states_:
+                tracker.add("local_seed", local_seed)
+        else:
+            random.seed(args.seed)
+            np.random.seed(args.seed)
+            paddle.seed(args.seed)
+
+
+class ExplicitEnum(Enum):
+    """
+    Enum with more explicit error message for missing values.
+    """
+
+    @classmethod
+    def _missing_(cls, value):
+        raise ValueError(
+            f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
+        )
+
+
+class EvalPrediction(NamedTuple):
+    """
+    Evaluation output (always contains labels), to be used to compute metrics.
+
+    Parameters:
+        predictions (`np.ndarray`): Predictions of the model.
+        label_ids (`np.ndarray`): Targets to be matched.
+    """
+
+    predictions: Union[np.ndarray, Tuple[np.ndarray]]
+    label_ids: Union[np.ndarray, Tuple[np.ndarray]]
+
+
+class EvalLoopOutput(NamedTuple):
+    predictions: Union[np.ndarray, Tuple[np.ndarray]]
+    label_ids: Optional[Union[np.ndarray, Tuple[np.ndarray]]]
+    metrics: Optional[Dict[str, float]]
+    num_samples: Optional[int]
+
+
+class PredictionOutput(NamedTuple):
+    predictions: Union[np.ndarray, Tuple[np.ndarray]]
+    label_ids: Optional[Union[np.ndarray, Tuple[np.ndarray]]]
+    metrics: Optional[Dict[str, float]]
+
+
+class TrainOutput(NamedTuple):
+    global_step: int
+    training_loss: float
+    metrics: Dict[str, float]
+
+
+PREFIX_CHECKPOINT_DIR = "checkpoint"
+_re_checkpoint = re.compile(r"^" + PREFIX_CHECKPOINT_DIR + r"\-(\d+)$")
+
+
+def get_last_checkpoint(folder):
+    content = os.listdir(folder)
+    checkpoints = [
+        path
+        for path in content
+        if _re_checkpoint.search(path) is not None and os.path.isdir(os.path.join(folder, path))
+    ]
+    if len(checkpoints) == 0:
+        return
+    return os.path.join(folder, max(checkpoints, key=lambda x: int(_re_checkpoint.search(x).groups()[0])))
+
+
+class IntervalStrategy(ExplicitEnum):
+    NO = "no"
+    STEPS = "steps"
+    EPOCH = "epoch"
+
+
+class EvaluationStrategy(ExplicitEnum):
+    NO = "no"
+    STEPS = "steps"
+    EPOCH = "epoch"
+
+
+class OptimizerNames(ExplicitEnum):
+    """
+    Stores the acceptable string identifiers for optimizers.
+    """
+
+    ADAMW = "adamw"
+    ADAFACTOR = "adafactor"
+
+
+class ShardingOption(ExplicitEnum):
+    """
+    Sharding Option
+    OP for sharding optimizer state
+    GRAD for sharding gradients
+    FULL_SHARD for sharding optimizer gradient and parameter
+    OFFLOAD means offload to cpu.
+    """
+
+    SHARD_OP = "stage1"
+    SHARD_GRAD_OP = "stage2"
+    FULL_SHARD = "stage3"
+    # NO_SHARD = "no"
+    OFFLOAD = "offload"
+
+
+def is_main_process(local_rank):
+    """
+    Whether or not the current process is the local process, based on `xm.get_ordinal()` (for TPUs) first, then on
+    `local_rank`.
+    """
+
+    return local_rank in [-1, 0]
+
+
+def total_processes_number(local_rank):
+    """
+    Return the number of processes launched in parallel. Works with `paddle.distributed` and TPUs.
+    """
+    if local_rank != -1:
+        import paddle
+
+        return paddle.distributed.get_world_size()
+    return 1
+
+
+def speed_metrics(split, start_time, num_samples=None, num_steps=None):
+    """
+    Measure and return speed performance metrics.
+
+    This function requires a time snapshot `start_time` before the operation to be measured starts and this function
+    should be run immediately after the operation to be measured has completed.
+
+    Args:
+
+    - split: name to prefix metric (like train, eval, test...)
+    - start_time: operation start time
+    - num_samples: number of samples processed
+    """
+    runtime = time.time() - start_time
+    result = {f"{split}_runtime": round(runtime, 4)}
+    if num_samples is not None:
+        samples_per_second = num_samples / runtime
+        result[f"{split}_samples_per_second"] = samples_per_second
+    if num_steps is not None:
+        steps_per_second = num_steps / runtime
+        result[f"{split}_steps_per_second"] = steps_per_second
+    return result
+
+
+class SchedulerType(ExplicitEnum):
+    LINEAR = "linear"
+    COSINE = "cosine"
+    CONSTANT = "constant"
+    CONSTANT_WITH_WARMUP = "constant_with_warmup"
+    POLYNOMIAL = "polynomial"
+
+
+def get_constant_schedule(learning_rate: float, last_epoch: int = -1):
+    """
+    Create a schedule with a constant learning rate, using the learning rate set in optimizer.
+    Args:
+        learning_rate (float)
+            The initial learning rate. It is a python float number.
+        last_epoch (`int`, *optional*, defaults to -1):
+            The index of the last epoch when resuming training.
+    Return:
+        `paddle.optimizer.lr.LambdaDecay` with the appropriate schedule.
+    """
+    return LambdaDecay(learning_rate, lambda _: 1, last_epoch=last_epoch)
+
+
+def get_constant_schedule_with_warmup(learning_rate: float, num_warmup_steps: int, last_epoch: int = -1):
+    """
+    Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate
+    increases linearly between 0 and the initial lr set in the optimizer.
+    Args:
+        learning_rate (float)
+            The initial learning rate. It is a python float number.
+        num_warmup_steps (`int`):
+            The number of steps for the warmup phase.
+        last_epoch (`int`, *optional*, defaults to -1):
+            The index of the last epoch when resuming training.
+    Return:
+        `paddle.optimizer.lr.LambdaDecay` with the appropriate schedule.
+    """
+
+    def lr_lambda(current_step: int):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1.0, num_warmup_steps))
+        return 1.0
+
+    return LambdaDecay(learning_rate, lr_lambda, last_epoch=last_epoch)
+
+
+def get_linear_schedule_with_warmup(learning_rate: float, num_warmup_steps, num_training_steps, last_epoch=-1):
+    """
+    Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after
+    a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
+    Args:
+        learning_rate (float)
+            The initial learning rate. It is a python float number.
+        num_warmup_steps (`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (`int`):
+            The total number of training steps.
+        last_epoch (`int`, *optional*, defaults to -1):
+            The index of the last epoch when resuming training.
+    Return:
+        `paddle.optimizer.lr.LambdaDecay` with the appropriate schedule.
+    """
+
+    def lr_lambda(current_step: int):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        return max(
+            0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps))
+        )
+
+    return LambdaDecay(learning_rate, lr_lambda, last_epoch)
+
+
+def get_cosine_schedule_with_warmup(
+    learning_rate: float, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1
+):
+    """
+    Create a schedule with a learning rate that decreases following the values of the cosine function between the
+    initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the
+    initial lr set in the optimizer.
+    Args:
+        learning_rate (float)
+            The initial learning rate. It is a python float number.
+        num_warmup_steps (`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (`int`):
+            The total number of training steps.
+        num_cycles (`float`, *optional*, defaults to 0.5):
+            The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
+            following a half-cosine).
+        last_epoch (`int`, *optional*, defaults to -1):
+            The index of the last epoch when resuming training.
+    Return:
+        `paddle.optimizer.lr.LambdaDecay` with the appropriate schedule.
+    """
+
+    def lr_lambda(current_step):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))
+        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))
+
+    return LambdaDecay(learning_rate, lr_lambda, last_epoch)
+
+
+def get_polynomial_decay_schedule_with_warmup(
+    learning_rate: float,
+    num_warmup_steps: int,
+    num_training_steps: int,
+    lr_end: float = 1e-7,
+    power: float = 1.0,
+    last_epoch: int = -1,
+):
+    """
+    Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the
+    optimizer to end lr defined by *lr_end*, after a warmup period during which it increases linearly from 0 to the
+    initial lr set in the optimizer.
+    Args:
+        learning_rate (`float`):
+            The base learning rate. It is a python float number.
+        num_warmup_steps (`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (`int`):
+            The total number of training steps.
+        lr_end (`float`, *optional*, defaults to 1e-7):
+            The end LR.
+        power (`float`, *optional*, defaults to 1.0):
+            Power factor.
+        last_epoch (`int`, *optional*, defaults to -1):
+            The index of the last epoch when resuming training.
+    Note: *power* defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT
+    implementation at
+    https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37
+    Return:
+        `paddle.optimizer.lr.LambdaDecay` with the appropriate schedule.
+    """
+
+    lr_init = learning_rate
+    if not (lr_init > lr_end):
+        raise ValueError(f"lr_end ({lr_end}) must be be smaller than initial lr ({lr_init})")
+
+    def lr_lambda(current_step: int):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        elif current_step > num_training_steps:
+            return lr_end / lr_init  # as LambdaLR multiplies by lr_init
+        else:
+            lr_range = lr_init - lr_end
+            decay_steps = num_training_steps - num_warmup_steps
+            pct_remaining = 1 - (current_step - num_warmup_steps) / decay_steps
+            decay = lr_range * pct_remaining**power + lr_end
+            return decay / lr_init  # as LambdaLR multiplies by lr_init
+
+    return LambdaDecay(learning_rate, lr_lambda, last_epoch)
+
+
+TYPE_TO_SCHEDULER_FUNCTION = {
+    SchedulerType.LINEAR: get_linear_schedule_with_warmup,
+    SchedulerType.COSINE: get_cosine_schedule_with_warmup,
+    SchedulerType.CONSTANT: get_constant_schedule,
+    SchedulerType.POLYNOMIAL: get_polynomial_decay_schedule_with_warmup,
+    SchedulerType.CONSTANT_WITH_WARMUP: get_constant_schedule_with_warmup,
+}
+
+
+def get_scheduler(
+    name: Union[str, SchedulerType],
+    learning_rate: float,
+    num_warmup_steps: Optional[int] = None,
+    num_training_steps: Optional[int] = None,
+    num_cycles: Optional[float] = 0.5,
+    lr_end: Optional[float] = 1e-7,
+    power: Optional[float] = 1.0,
+):
+    """
+    Unified API to get any scheduler from its name.
+    Args:
+        name (`str` or `SchedulerType`):
+            The name of the scheduler to use.
+        learning_rate (float)
+            The initial learning rate. It is a python float number.
+        num_warmup_steps (`int`, *optional*):
+            The number of warmup steps to do. This is not required by all schedulers (hence the argument being
+            optional), the function will raise an error if it's unset and the scheduler type requires it.
+        num_training_steps (`int``, *optional*):
+            The number of training steps to do. This is not required by all schedulers (hence the argument being
+            optional), the function will raise an error if it's unset and the scheduler type requires it.
+        num_cycles (``float``, *optional*):
+            The number of waves in the cosine scheduler (the defaults is to just decrease from the max value to 0
+            following a half-cosine). This is not required by all schedulers (hence the argument being optional)
+        lr_end (``float``, *optional*):
+            The end LR in the polynomial scheduler. This is not required by all schedulers (hence the argument
+            being optional).
+        power (``float``, *optional*):
+            The power factor in the polynomial scheduler. This is not required by all schedulers (hence the argument
+            being optional).
+    """
+    name = SchedulerType(name)
+    schedule_func = TYPE_TO_SCHEDULER_FUNCTION[name]
+    if name == SchedulerType.CONSTANT:
+        return schedule_func(learning_rate)
+
+    # All other schedulers require `num_warmup_steps`
+    if num_warmup_steps is None:
+        raise ValueError(f"{name} requires `num_warmup_steps`, please provide that argument.")
+
+    if name == SchedulerType.CONSTANT_WITH_WARMUP:
+        return schedule_func(learning_rate, num_warmup_steps=num_warmup_steps)
+
+    # All other schedulers require `num_training_steps`
+    if num_training_steps is None:
+        raise ValueError(f"{name} requires `num_training_steps`, please provide that argument.")
+
+    if name == SchedulerType.COSINE:
+        return schedule_func(
+            learning_rate,
+            num_warmup_steps=num_warmup_steps,
+            num_training_steps=num_training_steps,
+            num_cycles=num_cycles,
+        )
+
+    if name == SchedulerType.POLYNOMIAL:
+        return schedule_func(
+            learning_rate,
+            num_warmup_steps=num_warmup_steps,
+            num_training_steps=num_training_steps,
+            lr_end=lr_end,
+            power=power,
+        )
+
+    return schedule_func(learning_rate, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
+
+
+def _secs2timedelta(secs):
+    """
+    convert seconds to hh:mm:ss.msec, msecs rounded to 2 decimals
+    """
+
+    msec = int(abs(secs - int(secs)) * 100)
+    return f"{datetime.timedelta(seconds=int(secs))}.{msec:02d}"
+
+
+def metrics_format(self, metrics: Dict[str, float]) -> Dict[str, float]:
+    """
+    Reformat Trainer metrics values to a human-readable format
+    Args:
+        metrics (`Dict[str, float]`):
+            The metrics returned from train/evaluate/predict
+    Returns:
+        metrics (`Dict[str, float]`): The reformatted metrics
+    """
+
+    metrics_copy = metrics.copy()
+    for k, v in metrics_copy.items():
+        if "_mem_" in k:
+            metrics_copy[k] = f"{ v >> 20 }MB"
+        elif "_runtime" in k:
+            metrics_copy[k] = _secs2timedelta(v)
+        elif k == "total_flos":
+            metrics_copy[k] = f"{ int(v) >> 30 }GF"
+        elif isinstance(metrics_copy[k], float):
+            metrics_copy[k] = round(v, 4)
+
+    return metrics_copy
+
+
+def log_metrics(self, split, metrics):
+    """
+    Log metrics in a specially formatted way
+    Under distributed environment this is done only for a process with rank 0.
+    Args:
+        split (`str`):
+            Mode/split name: one of `train`, `eval`, `test`
+        metrics (`Dict[str, float]`):
+            The metrics returned from train/evaluate/predictmetrics: metrics dict
+    """
+    if not self.is_world_process_zero():
+        return
+
+    logger.info(f"***** {split} metrics *****")
+    metrics_formatted = self.metrics_format(metrics)
+    k_width = max(len(str(x)) for x in metrics_formatted.keys())
+    v_width = max(len(str(x)) for x in metrics_formatted.values())
+    for key in sorted(metrics_formatted.keys()):
+        logger.info(f"  {key: <{k_width}} = {metrics_formatted[key]:>{v_width}}")
+
+
+def save_metrics(self, split, metrics, combined=True):
+    """
+    Save metrics into a json file for that split, e.g. `train_results.json`.
+    Under distributed environment this is done only for a process with rank 0.
+    Args:
+        split (`str`):
+            Mode/split name: one of `train`, `eval`, `test`, `all`
+        metrics (`Dict[str, float]`):
+            The metrics returned from train/evaluate/predict
+        combined (`bool`, *optional*, defaults to `True`):
+            Creates combined metrics by updating `all_results.json` with metrics of this call
+    To understand the metrics please read the docstring of [`~Trainer.log_metrics`]. The only difference is that raw
+    unformatted numbers are saved in the current method.
+    """
+    if not self.is_world_process_zero():
+        return
+
+    path = os.path.join(self.args.output_dir, f"{split}_results.json")
+    with open(path, "w") as f:
+        json.dump(metrics, f, indent=4, sort_keys=True)
+
+    if combined:
+        path = os.path.join(self.args.output_dir, "all_results.json")
+        if os.path.exists(path):
+            with open(path, "r") as f:
+                all_metrics = json.load(f)
+        else:
+            all_metrics = {}
+
+        all_metrics.update(metrics)
+        with open(path, "w") as f:
+            json.dump(all_metrics, f, indent=4, sort_keys=True)
+
+
+def save_state(self):
+    """
+    Saves the Trainer state, since Trainer.save_model saves only the tokenizer with the model
+    Under distributed environment this is done only for a process with rank 0.
+    """
+    if not self.is_world_process_zero():
+        return
+
+    path = os.path.join(self.args.output_dir, "trainer_state.json")
+    self.state.save_to_json(path)
+
+
+def has_length(dataset):
+    """
+    Checks if the dataset implements __len__() and it doesn't raise an error
+    """
+    try:
+        return len(dataset) is not None
+    except (TypeError, ValueError, RuntimeError):
+        # TypeError: len() of unsized object
+        return False
+
+
+class TrainerMemoryTracker:
+    """
+    A helper class that tracks cpu and gpu memory.
+
+    This class will silently skip unless `psutil` is available. Install with `pip install psutil`.
+
+    When a stage completes, it can pass metrics dict to update with the memory metrics gathered during this stage.
+
+    Example :
+
+    ```python
+    self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics)
+    self._memory_tracker.start()
+    # code ...
+    metrics = {"train_runtime": 10.5}
+    self._memory_tracker.stop_and_update_metrics(metrics)
+    ```
+
+    At the moment GPU tracking is only for `paddle`.
+
+    # To understand this class' intricacies please read the documentation of [`~Trainer.log_metrics`].
+    """
+
+    # map trainer methods to metrics prefix
+    stages = {
+        "__init__": "init",
+        "train": "train",
+        "_inner_training_loop": "train",
+        "evaluate": "eval",
+        "predict": "test",
+    }
+
+    def __init__(self, skip_memory_metrics=False):
+
+        self.skip_memory_metrics = skip_memory_metrics
+
+        if not is_psutil_available():
+            # soft dependency on psutil
+            self.skip_memory_metrics = True
+
+        if self.skip_memory_metrics:
+            return
+
+        import psutil  # noqa
+
+        if is_paddle_cuda_available():
+            import paddle
+
+            self.paddle = paddle
+            self.gpu = {}
+        else:
+            self.paddle = None
+
+        self.process = psutil.Process()
+
+        self.cur_stage = None
+        self.cpu = {}
+        self.init_reported = False
+
+    def derive_stage(self):
+        """derives the stage/caller name automatically"""
+        caller = inspect.currentframe().f_back.f_back.f_code.co_name
+        if caller in self.stages:
+            return self.stages[caller]
+        else:
+            raise ValueError(
+                f"was called from {caller}, but only expect to be called from one of {self.stages.keys()}"
+            )
+
+    def cpu_mem_used(self):
+        """get resident set size memory for the current process"""
+        return self.process.memory_info().rss
+
+    def peak_monitor_func(self):
+        self.cpu_mem_used_peak = -1
+
+        while True:
+            self.cpu_mem_used_peak = max(self.cpu_mem_used(), self.cpu_mem_used_peak)
+
+            # can't sleep or will not catch the peak right (this comment is here on purpose)
+            # time.sleep(0.001) # 1msec
+
+            if not self.peak_monitoring:
+                break
+
+    def start(self):
+        """start tracking for the caller's stage"""
+        if self.skip_memory_metrics:
+            return
+
+        stage = self.derive_stage()
+        # deal with nested calls of eval during train - simply ignore those
+        if self.cur_stage is not None and self.cur_stage != stage:
+            return
+
+        self.cur_stage = stage
+
+        gc.collect()
+
+        if self.paddle is not None:
+            # self.paddle.cuda.reset_peak_memory_stats()?
+            self.paddle.device.cuda.empty_cache()
+
+        # gpu
+        if self.paddle is not None:
+            self.gpu_mem_used_at_start = self.paddle.device.cuda.memory_allocated()
+
+        # cpu
+        self.cpu_mem_used_at_start = self.cpu_mem_used()
+
+        self.peak_monitoring = True
+        peak_monitor_thread = threading.Thread(target=self.peak_monitor_func)
+        peak_monitor_thread.daemon = True
+        peak_monitor_thread.start()
+
+    def stop(self, stage):
+        """stop tracking for the passed stage"""
+
+        # deal with nested calls of eval during train - simply ignore those
+        if self.cur_stage is not None and self.cur_stage != stage:
+            return
+
+        # this sends a signal to peak_monitor_func to complete its loop
+        self.peak_monitoring = False
+
+        # first ensure all objects get collected and their memory is freed
+        gc.collect()
+
+        if self.paddle is not None:
+            self.paddle.device.cuda.empty_cache()
+
+        # concepts:
+        # - alloc_delta:  the difference of allocated memory between the end and the start
+        # - peaked_delta: the difference between the peak memory and the current memory
+        # in order to know how much memory the measured code consumed one needs to sum these two
+
+        # gpu
+        if self.paddle is not None:
+            self.gpu_mem_used_now = self.paddle.device.cuda.memory_allocated()
+            self.gpu_mem_used_peak = self.paddle.device.cuda.max_memory_allocated()
+            self.gpu[self.cur_stage] = dict(
+                begin=self.gpu_mem_used_at_start,
+                end=self.gpu_mem_used_now,
+                alloc=(self.gpu_mem_used_now - self.gpu_mem_used_at_start),
+                peaked=max(0, self.gpu_mem_used_peak - self.gpu_mem_used_now),
+            )
+
+        # cpu
+        self.cpu_mem_used_now = self.cpu_mem_used()
+        self.cpu[self.cur_stage] = dict(
+            begin=self.cpu_mem_used_at_start,
+            end=self.cpu_mem_used_now,
+            alloc=(self.cpu_mem_used_now - self.cpu_mem_used_at_start),
+            peaked=max(0, self.cpu_mem_used_peak - self.cpu_mem_used_now),
+        )
+
+        # reset - cycle finished
+        self.cur_stage = None
+
+    def update_metrics(self, stage, metrics):
+        """updates the metrics"""
+        if self.skip_memory_metrics:
+            return
+
+        # deal with nested calls of eval during train - simply ignore those
+        if self.cur_stage is not None and self.cur_stage != stage:
+            return
+
+        # since we don't have a way to return init metrics, we push them into the first of train/val/predict
+        stages = [stage]
+        if not self.init_reported:
+            stages.insert(0, "init")
+            self.init_reported = True
+
+        for stage in stages:
+            for t in ["alloc", "peaked"]:
+                if stage in self.cpu and t in self.cpu[stage]:
+                    metrics[f"{stage}_mem_cpu_{t}_delta"] = self.cpu[stage][t]
+                if self.paddle is not None and stage in self.gpu and t in self.gpu[stage]:
+                    metrics[f"{stage}_mem_gpu_{t}_delta"] = self.gpu[stage][t]
+            # if we need additional debug info, enable the following
+            # for t in ["begin", "end"]:
+            #     if stage in self.cpu and t in self.cpu[stage]:
+            #         metrics[f"{stage}_mem_cpu_{t}"] = self.cpu[stage][t]
+            #     if self.paddle is not None and stage in self.gpu and t in self.gpu[stage]:
+            #         metrics[f"{stage}_mem_gpu_{t}"] = self.gpu[stage][t]
+
+        # since memory can be allocated before init, and it might be difficult to track overall
+        # memory usage, in particular for GPU, let's report memory usage at the point init was called
+        if stages[0] == "init":
+            metrics["before_init_mem_cpu"] = self.cpu["init"]["begin"]
+            if self.paddle is not None:
+                metrics["before_init_mem_gpu"] = self.gpu["init"]["begin"]
+            # if we also wanted to report any additional memory allocations in between init and
+            # whatever the next stage was we could also report this:
+            # if self.cpu["init"]["end"] != self.cpu[stage]["begin"]:
+            #     metrics[f"after_init_mem_cpu_delta"] = self.cpu[stage]["begin"] - self.cpu["init"]["end"]
+            # if self.paddle is not None and self.gpu["init"]["end"] != self.gpu[stage]["begin"]:
+            #     metrics[f"after_init_mem_gpu_delta"] = self.gpu[stage]["begin"] - self.gpu["init"]["end"]
+
+    def stop_and_update_metrics(self, metrics=None):
+        """combine stop and metrics update in one call for simpler code"""
+        if self.skip_memory_metrics:
+            return
+
+        stage = self.derive_stage()
+        self.stop(stage)
+
+        # init doesn't have metrics to update so we just save that data for later stages to retrieve
+        if metrics is not None:
+            self.update_metrics(stage, metrics)
+
+
+class IterableDatasetShard(IterableDataset):
+    """
+    Wraps a Paddle `IterableDataset` to generate samples for one of the processes only. Instances of this class will
+    always yield a number of samples that is a round multiple of the actual batch size (which is `batch_size x
+    num_processes`). Depending on the value of the `drop_last` attribute, it will either stop the iteration at the
+    first batch that would be too small or loop with indices from the beginning.
+    On two processes with an iterable dataset yielding of `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]` with a batch size of
+    2:
+    - the shard on process 0 will yield `[0, 1, 4, 5, 8, 9]` so will see batches `[0, 1]`, `[4, 5]`, `[8, 9]`
+    - the shard on process 1 will yield `[2, 3, 6, 7, 10, 11]` so will see batches `[2, 3]`, `[6, 7]`, `[10, 11]`
+    Args:
+        dataset (`paddle.io.IterableDataset`):
+            The batch sampler to split in several shards.
+        batch_size (`int`, *optional*, defaults to 1):
+            The size of the batches per shard.
+        drop_last (`bool`, *optional*, defaults to `False`):
+            Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the
+            beginning.
+        num_processes (`int`, *optional*, defaults to 1):
+            The number of processes running concurrently.
+        process_index (`int`, *optional*, defaults to 0):
+            The index of the current process.
+        seed (`int`, *optional*, defaults to 0):
+            A random seed that will be used for the random number generation in
+            [`~trainer_utils.IterableDatasetShard.set_epoch`].
+    """
+
+    def __init__(
+        self,
+        dataset: IterableDataset,
+        batch_size: int = 1,
+        drop_last: bool = False,
+        num_processes: int = 1,
+        process_index: int = 0,
+        seed: int = 0,
+    ):
+        self.dataset = dataset
+        self.batch_size = batch_size
+        self.drop_last = drop_last
+        self.num_processes = num_processes
+        self.process_index = process_index
+        self.seed = seed
+        self.epoch = 0
+        self.num_examples = 0
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+        if hasattr(self.dataset, "set_epoch"):
+            self.dataset.set_epoch(epoch)
+
+    def __iter__(self):
+        self.num_examples = 0
+        # TODO: support generator seed in sampling.
+        #
+        # if (
+        #     not hasattr(self.dataset, "set_epoch")
+        #     and hasattr(self.dataset, "generator")
+        #     and isinstance(self.dataset.generator, paddle.fluid.Generator)
+        # ):
+        #     self.dataset.generator.manual_seed(self.seed + self.epoch)
+        real_batch_size = self.batch_size * self.num_processes
+        process_slice = range(self.process_index * self.batch_size, (self.process_index + 1) * self.batch_size)
+
+        first_batch = None
+        current_batch = []
+        for element in self.dataset:
+            self.num_examples += 1
+            current_batch.append(element)
+            # Wait to have a full batch before yielding elements.
+            if len(current_batch) == real_batch_size:
+                for i in process_slice:
+                    yield current_batch[i]
+                if first_batch is None:
+                    first_batch = current_batch.copy()
+                current_batch = []
+
+        # Finished if drop_last is True, otherwise complete the last batch with elements from the beginning.
+        if not self.drop_last and len(current_batch) > 0:
+            if first_batch is None:
+                first_batch = current_batch.copy()
+            while len(current_batch) < real_batch_size:
+                current_batch += first_batch
+            for i in process_slice:
+                yield current_batch[i]
+
+    def __len__(self):
+        # Will raise an error if the underlying dataset is not sized.
+        if self.drop_last:
+            return (len(self.dataset) // (self.batch_size * self.num_processes)) * self.batch_size
+        else:
+            return math.ceil(len(self.dataset) / (self.batch_size * self.num_processes)) * self.batch_size
+
+
+def find_batch_size(tensors):
+    """
+    Find the first dimension of a tensor in a nested list/tuple/dict of tensors.
+    """
+    if isinstance(tensors, (list, tuple)):
+        for t in tensors:
+            result = find_batch_size(t)
+            if result is not None:
+                return result
+    elif isinstance(tensors, (dict, BatchEncoding)):
+        for key, value in tensors.items():
+            result = find_batch_size(value)
+            if result is not None:
+                return result
+    elif isinstance(tensors, paddle.Tensor):
+        return tensors.shape[0] if len(tensors.shape) >= 1 else None
+    elif isinstance(tensors, np.ndarray):
+        return tensors.shape[0] if len(tensors.shape) >= 1 else None
+
+
+class RemoveColumnsCollator:
+    """Wrap the data collator to remove unused columns before they are passed to the collator."""
+
+    def __init__(
+        self,
+        data_collator,
+        signature_columns,
+        logger=None,
+        model_name: Optional[str] = None,
+        description: Optional[str] = None,
+    ):
+        self.data_collator = data_collator
+        self.signature_columns = signature_columns
+        self.logger = logger
+        self.description = description
+        self.model_name = model_name
+        self.message_logged = False
+
+    def _remove_columns(self, feature: dict) -> dict:
+        if not isinstance(feature, dict):
+            return feature
+        if not self.message_logged and self.logger and self.model_name:
+            ignored_columns = list(set(feature.keys()) - set(self.signature_columns))
+            if len(ignored_columns) > 0:
+                dset_description = "" if self.description is None else f"in the {self.description} set"
+                self.logger.info(
+                    f"The following columns {dset_description} don't have a corresponding argument in "
+                    f"`{self.model_name}.forward` and have been ignored: {', '.join(ignored_columns)}."
+                    f" If {', '.join(ignored_columns)} are not expected by `{self.model_name}.forward`, "
+                    " you can safely ignore this message."
+                )
+                self.message_logged = True
+        return {k: v for k, v in feature.items() if k in self.signature_columns}
+
+    def __call__(self, features: List[dict]):
+        features = [self._remove_columns(feature) for feature in features]
+        return self.data_collator(features)
+
+
+def set_hyrbid_parallel_seed(basic_seed, dataset_rank, tp_rank, pp_rank=0):
+    from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+
+    random.seed(basic_seed + dataset_rank)
+    np.random.seed(basic_seed + dataset_rank)
+    paddle.seed(basic_seed + dataset_rank)
+
+    # local_seed/ global_seed is used to control dropout in ModelParallel
+    local_seed = basic_seed + 59999 + tp_rank * 10 + pp_rank * 1000
+    global_seed = basic_seed + 100003 + dataset_rank
+
+    tracker = get_rng_state_tracker()
+
+    if "global_seed" not in tracker.states_:
+        tracker.add("global_seed", global_seed)
+    if "local_seed" not in tracker.states_:
+        tracker.add("local_seed", local_seed)
diff --git a/paddlenlp/trainer/training_args.py b/paddlenlp/trainer/training_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..737df7dd03d0405730b6d7f89f3176477d84e615
--- /dev/null
+++ b/paddlenlp/trainer/training_args.py
@@ -0,0 +1,1384 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py
+
+import contextlib
+import json
+import math
+import os
+import types
+import warnings
+from dataclasses import asdict, dataclass, field
+from enum import Enum
+from typing import Any, Dict, List, Optional
+
+import paddle
+from paddle.distributed import fleet
+
+from ..utils.log import logger
+from .trainer_utils import (
+    IntervalStrategy,
+    OptimizerNames,
+    SchedulerType,
+    ShardingOption,
+)
+
+__all__ = [
+    "default_logdir",
+    "TrainingArguments",
+]
+
+
+def default_logdir() -> str:
+    """
+    Same default
+    """
+    import socket
+    from datetime import datetime
+
+    current_time = datetime.now().strftime("%b%d_%H-%M-%S")
+    return os.path.join("runs", current_time + "_" + socket.gethostname())
+
+
+@dataclass
+class TrainingArguments:
+    """
+    TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop
+    itself**.
+
+    Using [`PdArgumentParser`] we can turn this class into
+    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
+    command line.
+
+    Parameters:
+        output_dir (`str`):
+            The output directory where the model predictions and checkpoints will be written.
+        overwrite_output_dir (`bool`, *optional*, defaults to `False`):
+            If `True`, overwrite the content of the output directory. Use this to continue training if `output_dir`
+            points to a checkpoint directory.
+        do_train (`bool`, *optional*, defaults to `False`):
+            Whether to run training or not. This argument is not directly used by [`Trainer`], it's intended to be used
+            by your training/evaluation scripts instead. See the [example
+            scripts](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples) for more details.
+        do_eval (`bool`, *optional*):
+            Whether to run evaluation on the validation set or not. Will be set to `True` if `evaluation_strategy` is
+            different from `"no"`. This argument is not directly used by [`Trainer`], it's intended to be used by your
+            training/evaluation scripts instead. See the [example
+            scripts](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples) for more details.
+        do_predict (`bool`, *optional*, defaults to `False`):
+            Whether to run predictions on the test set or not. This argument is not directly used by [`Trainer`], it's
+            intended to be used by your training/evaluation scripts instead. See the [example
+            scripts](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples) for more details.
+        do_export (`bool`, *optional*, defaults to `False`):
+            Whether to export inference model or not. This argument is not directly used by [`Trainer`], it's
+            intended to be used by your training/evaluation scripts instead.
+        evaluation_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"no"`):
+            The evaluation strategy to adopt during training. Possible values are:
+
+                - `"no"`: No evaluation is done during training.
+                - `"steps"`: Evaluation is done (and logged) every `eval_steps`.
+                - `"epoch"`: Evaluation is done at the end of each epoch.
+
+        prediction_loss_only (`bool`, *optional*, defaults to `False`):
+            When performing evaluation and generating predictions, only returns the loss.
+        per_device_train_batch_size (`int`, *optional*, defaults to 8):
+            The batch size per GPU core/CPU for training.
+        per_device_eval_batch_size (`int`, *optional*, defaults to 8):
+            The batch size per GPU core/CPU for evaluation.
+        gradient_accumulation_steps (`int`, *optional*, defaults to 1):
+            Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
+
+            <Tip warning={true}>
+
+            When using gradient accumulation, one step is counted as one step with backward pass. Therefore, logging,
+            evaluation, save will be conducted every `gradient_accumulation_steps * xxx_step` training examples.
+
+            </Tip>
+
+        eval_accumulation_steps (`int`, *optional*):
+            Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If
+            left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but
+            requires more memory).
+        learning_rate (`float`, *optional*, defaults to 5e-5):
+            The initial learning rate for [`AdamW`] optimizer.
+        weight_decay (`float`, *optional*, defaults to 0):
+            The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`]
+            optimizer.
+        adam_beta1 (`float`, *optional*, defaults to 0.9):
+            The beta1 hyperparameter for the [`AdamW`] optimizer.
+        adam_beta2 (`float`, *optional*, defaults to 0.999):
+            The beta2 hyperparameter for the [`AdamW`] optimizer.
+        adam_epsilon (`float`, *optional*, defaults to 1e-8):
+            The epsilon hyperparameter for the [`AdamW`] optimizer.
+        max_grad_norm (`float`, *optional*, defaults to 1.0):
+            Maximum gradient norm (for gradient clipping).
+        num_train_epochs(`float`, *optional*, defaults to 3.0):
+            Total number of training epochs to perform (if not an integer, will perform the decimal part percents of
+            the last epoch before stopping training).
+        max_steps (`int`, *optional*, defaults to -1):
+            If set to a positive number, the total number of training steps to perform. Overrides `num_train_epochs`.
+            In case of using a finite iterable dataset the training may stop before reaching the set number of steps
+            when all data is exhausted
+        lr_scheduler_type (`str` or [`SchedulerType`], *optional*, defaults to `"linear"`):
+            The scheduler type to use. See the documentation of [`SchedulerType`] for all possible values.
+        warmup_ratio (`float`, *optional*, defaults to 0.0):
+            Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.
+        warmup_steps (`int`, *optional*, defaults to 0):
+            Number of steps used for a linear warmup from 0 to `learning_rate`. Overrides any effect of `warmup_ratio`.
+        num_cycles (`float`, *optional*, defaults to 0.5):
+            The number of waves in the cosine scheduler.
+        lr_end (`float`, *optional*, defaults to 1e-7):
+            The end LR used in the polynomial scheduler.
+        power (`float`, *optional*, defaults to 1.0):
+            The power factor used in the polynomial scheduler.
+
+        log_on_each_node (`bool`, *optional*, defaults to `True`):
+            In multinode distributed training, whether to log using `log_level` once per node, or only on the main
+            node.
+        logging_dir (`str`, *optional*):
+            log directory. Will default to *output_dir/runs/**CURRENT_DATETIME_HOSTNAME***.
+        logging_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`):
+            The logging strategy to adopt during training. Possible values are:
+
+                - `"no"`: No logging is done during training.
+                - `"epoch"`: Logging is done at the end of each epoch.
+                - `"steps"`: Logging is done every `logging_steps`.
+
+        logging_first_step (`bool`, *optional*, defaults to `False`):
+            Whether to log and evaluate the first `global_step` or not.
+        logging_steps (`int`, *optional*, defaults to 500):
+            Number of update steps between two logs if `logging_strategy="steps"`.
+        save_strategy (`str` or [`~trainer_utils.IntervalStrategy`], *optional*, defaults to `"steps"`):
+            The checkpoint save strategy to adopt during training. Possible values are:
+
+                - `"no"`: No save is done during training.
+                - `"epoch"`: Save is done at the end of each epoch.
+                - `"steps"`: Save is done every `save_steps`.
+        save_steps (`int`, *optional*, defaults to 500):
+            Number of updates steps before two checkpoint saves if `save_strategy="steps"`.
+        save_total_limit (`int`, *optional*):
+            If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
+            `output_dir`.
+        save_on_each_node (`bool`, *optional*, defaults to `False`):
+            When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on
+            the main one.
+
+            This should not be activated when the different nodes use the same storage as the files will be saved with
+            the same names for each node.
+        no_cuda (`bool`, *optional*, defaults to `False`):
+            Whether to not use CUDA even when it is available or not.
+        seed (`int`, *optional*, defaults to 42):
+            Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
+            [`~Trainer.model_init`] function to instantiate the model if it has some randomly initialized parameters.
+        fp16 (`bool`, *optional*, defaults to `False`):
+            Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
+        fp16_opt_level (`str`, *optional*, defaults to 'O1'):
+            For `fp16` training,  AMP optimization level selected in ['O0', 'O1', 'O2']. See details at
+            https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/amp/auto_cast_cn.html
+        amp_custom_black_list (`List[str]`, *optional*, defaults to `None`):
+            The custom black_list. The set of ops that support fp16/bf16 calculation and are considered numerically-dangerous
+            and whose effects may also be observed in downstream ops. These ops will not be converted to fp16/bf16.
+        amp_custom_white_list (`List[str]`, *optional*, defaults to `None`):
+            The custom white_list. It’s the set of ops that support fp16/bf16 calculation and are considered numerically-safe and
+             performance-critical. These ops will be converted to fp16/bf16.
+        amp_master_grad (`bool`, *optional*, defaults to `False`):
+            For amp opt level=’O2’, whether to use float32 weight gradients
+            for calculations such as gradient clipping, weight decay, and weight updates. If master_grad is enabled,
+            the weight gradients will be float32 dtype after the backpropagation. Default is False, there is only float16 weight gradients.
+            Note: only support model parallel and pipeline parallel for now !!!
+        sharding (`str`, *optional*, defaults to ``):
+            Whether or not to use Paddle Sharding Data Parallel training (in distributed training
+            only). The base option should be `stage1`, `stage2` or `stage3` and you can add
+            CPU-offload to `stage2` or `stage3` like this: `stage2 offload` or `stage3 offload`.
+            Each stage means:
+                stage1 : optimizer state segmentation
+                stage2 : optimizer state + gradient segmentation
+                stage3 : parameter + gradient + optimizer state segmentation
+                offload : offload parameters to cpu
+        sharding_parallel_degree (`int`, *optional*, defaults to `-1`)
+            Sharding parameter in certain cards group. For example, aussume we use 2 machines each with 8 cards,
+            then set sharding_parallel_degree=8, sharding will only communication inside machine.
+            default -1 means sharding parameters between all workers.
+        tensor_parallel_degree (`int`, *optional*, defaults to `-1`)
+            Tensor parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.3 Tensor Model Parallelism).
+            This technique splits one transformer layer into multi-cards (For examples, tensor_parallel_degree=4, will split a layer to 4-parts)
+            tensor_parallel_degree means split the transformer layer to how many parts.
+            default -1 for not use tensor parallel,  Suggest tensor_parallel_degree<=8 for better proformance.
+            Note, this need model support in source code, currently GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM is supported.
+        pipeline_parallel_degree (`int`, *optional*, defaults to `-1`)
+            Pipeline parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.2 Pipeline Model Parallelism).
+            Pipeline parallelism assigns multi-transformer layers to different cards, the micro batch data stream passed between cards like pipelines.
+            pipeline_parallel_degree means split all transformer layers to how many stages.
+            default -1 for not use pipeline parallel.
+            Note. this need model support in source code, see llama modeling_pp.py file
+        pipeline_parallel_config (`str`, *optional*)(
+            Some additional config it highly affect the useage of pipeline parallel, we provide some option to config it.
+            following config is support:
+              disable_p2p_cache_shape, if you max sequence length is varying, please set disable_p2p_cache_shape.
+              disable_partial_send_recv, optmize send speed for tensor parallel.
+              enable_delay_scale_loss, accumulate gradients util optimizer step, all gradients div by inner pipeline accumute step. instead of div accumute step on loss directly.
+              enable_dp_comm_overlap, fuse data parallel gradient communication.
+              enable_sharding_comm_overlap, fuse sharding stage 1 parallel gradient communication.
+        sharding_parallel_config (`str`, *optional*)(
+            Some additional config it highly affect the useage of sharding parallel, we provide some option to config it.
+            following config is support:
+              enable_stage1_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation
+              enable_stage1_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed
+        recompute (`bool`, *optional*, defaults to `False`):
+            Recompute the forward pass to calculate gradients. Used for saving memory.
+            Only support for networks with transformer blocks.
+        scale_loss (`float`,  *optional*, defaults to 32768):
+            The value of initial scale_loss for fp16. (default: 32768)
+        local_rank (`int`, *optional*, defaults to -1):
+            Rank of the process during distributed training.
+        dataloader_drop_last (`bool`, *optional*, defaults to `False`):
+            Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
+            or not.
+        eval_steps (`int`, *optional*):
+            Number of update steps between two evaluations if `evaluation_strategy="steps"`. Will default to the same
+            value as `logging_steps` if not set.
+        max_evaluate_steps (`int`, *optional*, defaults to -1):
+            If set to a positive number, the total number of evaluation steps to perform.
+        dataloader_num_workers (`int`, *optional*, defaults to 0):
+            Number of subprocesses to use for data loading. 0 means that the data will be loaded in the
+            main process.
+        past_index (`int`, *optional*, defaults to -1):
+            Some models like TransformerXL or XLNet can make use of the past hidden states for their predictions.
+            If this argument is set to a positive int, the `Trainer` will use the corresponding output (usually index 2) as
+            the past state and feed it to the model at the next training step under the keyword argument `mems`.
+        run_name (`str`, *optional*):
+            A descriptor for the run. Typically used for logging.
+        disable_tqdm (`bool`, *optional*):
+            Whether or not to disable the tqdm progress bars and table of metrics. Will default to `True` if the logging
+            level is set to warn or lower (default), `False` otherwise.
+        remove_unused_columns (`bool`, *optional*, defaults to `True`):
+            If using `datasets.Dataset` datasets, whether or not to automatically remove the columns unused by the
+            model forward method.
+        label_names (`List[str]`, *optional*):
+            The list of keys in your dictionary of inputs that correspond to the labels.
+            Will eventually default to `["labels"]` except if the model used is one of the `XxxForQuestionAnswering` in
+            which case it will default to `["start_positions", "end_positions"]`.
+        load_best_model_at_end (`bool`, *optional*, defaults to `False`):
+            Whether or not to load the best model found during training at the end of training.
+
+            <Tip>
+
+            When set to `True`, the parameters `save_strategy` needs to be the same as `eval_strategy`, and in the case
+            it is "steps", `save_steps` must be a round multiple of `eval_steps`.
+
+            </Tip>
+
+        metric_for_best_model (`str`, *optional*):
+            Use in conjunction with `load_best_model_at_end` to specify the metric to use to compare two different
+            models. Must be the name of a metric returned by the evaluation with or without the prefix `"eval_"`. Will
+            default to `"loss"` if unspecified and `load_best_model_at_end=True` (to use the evaluation loss).
+
+            If you set this value, `greater_is_better` will default to `True`. Don't forget to set it to `False` if
+            your metric is better when lower.
+        greater_is_better (`bool`, *optional*):
+            Use in conjunction with `load_best_model_at_end` and `metric_for_best_model` to specify if better models
+            should have a greater metric or not. Will default to:
+
+            - `True` if `metric_for_best_model` is set to a value that isn't `"loss"` or `"eval_loss"`.
+            - `False` if `metric_for_best_model` is not set, or set to `"loss"` or `"eval_loss"`.
+        ignore_data_skip (`bool`, *optional*, defaults to `False`):
+            When resuming training, whether or not to skip the epochs and batches to get the data loading at the same
+            stage as in the previous training. If set to `True`, the training will begin faster (as that skipping step
+            can take a long time) but will not yield the same results as the interrupted training would have.
+        optim (`str` or [`training_args.OptimizerNames`], *optional*, defaults to `"adamw"`):
+            The optimizer to use: adamw, or adafactor.
+        length_column_name (`str`, *optional*, defaults to `"length"`):
+            Column name for precomputed lengths. If the column exists, grouping by length will use these values rather
+            than computing them on train startup. Ignored unless `group_by_length` is `True` and the dataset is an
+            instance of `Dataset`.
+        report_to (`str` or `List[str]`, *optional*, defaults to `"visualdl"`):
+            The list of integrations to report the results and logs to. Supported platforms is `"visualdl"`.
+            `"none"` for no integrations.
+        resume_from_checkpoint (`str`, *optional*):
+            The path to a folder with a valid checkpoint for your model. This argument is not directly used by
+            [`Trainer`], it's intended to be used by your training/evaluation scripts instead. See the [example
+            scripts](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples) for more details.
+        flatten_param_grads (`bool`, *optional*):
+            Whether use flatten_param_grads method in optimizer, only used on NPU devices. Default is `False`.
+        skip_profile_timer (`bool`, *optional*):
+            Whether skip profile timer, timer will record time usage of forward/ backward/ step, etc.
+        distributed_dataloader (`bool`, *optional*):
+            Whether to use distributed dataloader. Default is `False`.
+    """
+
+    output_dir: str = field(
+        metadata={"help": "The output directory where the model predictions and checkpoints will be written."},
+    )
+    overwrite_output_dir: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Overwrite the content of the output directory. "
+                "Use this to continue training if output_dir points to a checkpoint directory."
+            )
+        },
+    )
+
+    do_train: bool = field(default=False, metadata={"help": "Whether to run training."})
+    do_eval: bool = field(default=False, metadata={"help": "Whether to run eval on the dev set."})
+    do_predict: bool = field(default=False, metadata={"help": "Whether to run predictions on the test set."})
+    do_export: bool = field(default=False, metadata={"help": "Whether to export infernece model."})
+    evaluation_strategy: IntervalStrategy = field(
+        default="no",
+        metadata={"help": "The evaluation strategy to use."},
+    )
+    prediction_loss_only: bool = field(
+        default=False,
+        metadata={"help": "When performing evaluation and predictions, only returns the loss."},
+    )
+
+    per_device_train_batch_size: int = field(default=8, metadata={"help": "Batch size per GPU core/CPU for training."})
+    per_device_eval_batch_size: int = field(
+        default=8, metadata={"help": "Batch size per GPU core/CPU for evaluation."}
+    )
+
+    gradient_accumulation_steps: int = field(
+        default=1,
+        metadata={"help": "Number of updates steps to accumulate before performing a backward/update pass."},
+    )
+    eval_accumulation_steps: Optional[int] = field(
+        default=None,
+        metadata={"help": "Number of predictions steps to accumulate before moving the tensors to the CPU."},
+    )
+
+    learning_rate: float = field(default=5e-5, metadata={"help": "The initial learning rate for AdamW."})
+    weight_decay: float = field(default=0.0, metadata={"help": "Weight decay for AdamW if we apply some."})
+    adam_beta1: float = field(default=0.9, metadata={"help": "Beta1 for AdamW optimizer"})
+    adam_beta2: float = field(default=0.999, metadata={"help": "Beta2 for AdamW optimizer"})
+    adam_epsilon: float = field(default=1e-8, metadata={"help": "Epsilon for AdamW optimizer."})
+    max_grad_norm: float = field(default=1.0, metadata={"help": "Max gradient norm."})
+
+    num_train_epochs: float = field(default=3.0, metadata={"help": "Total number of training epochs to perform."})
+    max_steps: int = field(
+        default=-1,
+        metadata={"help": "If > 0: set total number of training steps to perform. Override num_train_epochs."},
+    )
+    lr_scheduler_type: str = field(
+        default="linear",
+        metadata={"help": "The scheduler type to use. suppor linear, cosine, constant, constant_with_warmup"},
+    )
+    warmup_ratio: float = field(
+        default=0.0, metadata={"help": "Linear warmup over warmup_ratio fraction of total steps."}
+    )
+    warmup_steps: int = field(default=0, metadata={"help": "Linear warmup over warmup_steps."})
+    num_cycles: float = field(default=0.5, metadata={"help": "The number of waves in the cosine scheduler."})
+    lr_end: float = field(default=1e-7, metadata={"help": "The end LR in the polynomial scheduler."})
+    power: float = field(default=1.0, metadata={"help": "The power factor in the polynomial scheduler."})
+
+    log_on_each_node: bool = field(
+        default=True,
+        metadata={
+            "help": "When doing a multinode distributed training, whether to log once per node or just once on the main node."
+        },
+    )
+    logging_dir: Optional[str] = field(default=None, metadata={"help": "VisualDL log dir."})
+    logging_strategy: IntervalStrategy = field(
+        default="steps",
+        metadata={"help": "The logging strategy to use."},
+    )
+    logging_first_step: bool = field(default=False, metadata={"help": "Log the first global_step"})
+    logging_steps: int = field(default=500, metadata={"help": "Log every X updates steps."})
+
+    save_strategy: IntervalStrategy = field(
+        default="steps",
+        metadata={"help": "The checkpoint save strategy to use."},
+    )
+    save_steps: int = field(default=500, metadata={"help": "Save checkpoint every X updates steps."})
+    save_total_limit: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Limit the total amount of checkpoints. "
+                "Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints"
+            )
+        },
+    )
+    save_on_each_node: bool = field(
+        default=False,
+        metadata={
+            "help": "When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one"
+        },
+    )
+    no_cuda: bool = field(default=False, metadata={"help": "Do not use CUDA even when it is available"})
+    seed: int = field(default=42, metadata={"help": "Random seed that will be set at the beginning of training."})
+
+    bf16: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA"
+                " architecture or using CPU (no_cuda). This is an experimental API and it may change."
+            )
+        },
+    )
+    fp16: bool = field(
+        default=False,
+        metadata={"help": "Whether to use fp16 (mixed) precision instead of 32-bit"},
+    )
+    fp16_opt_level: str = field(
+        default="O1",
+        metadata={
+            "help": (
+                "For fp16: AMP optimization level selected in ['O0', 'O1', and 'O2']. "
+                "See details at https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/amp/auto_cast_cn.html"
+            )
+        },
+    )
+    amp_master_grad: bool = field(
+        default=False,
+        metadata={
+            "help": "amp_master_grad (bool, optional) – For amp opt level=’O2’, whether to use float32 weight gradients "
+            " for calculations such as gradient clipping, weight decay, and weight updates. If master_grad is enabled,"
+            " the weight gradients will be float32 dtype after the backpropagation. Default is False, there is only float16 weight gradients."
+            "Note: only support model parallel and pipeline parallel for now !!!"
+        },
+    )
+    bf16_full_eval: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may"
+                " change."
+            )
+        },
+    )
+    fp16_full_eval: bool = field(
+        default=False,
+        metadata={"help": "Whether to use full float16 evaluation instead of 32-bit"},
+    )
+
+    amp_custom_black_list: Optional[List[str]] = field(
+        default=None,
+        metadata={
+            "help": "The set of ops that support fp16/bf16 calculation and are considered numerically-dangerous and whose effects may also be observed in downstream ops."
+        },
+    )
+    amp_custom_white_list: Optional[List[str]] = field(
+        default=None,
+        metadata={
+            "help": "The the set of ops that support fp16/bf16 calculation and are considered numerically-safe and performance-critical. These ops will be converted to fp16/bf16."
+        },
+    )
+
+    sharding: str = field(
+        default="",
+        metadata={
+            "help": (
+                "Whether or not to use Paddle Sharding Data Parallel training (in distributed training"
+                " only). The base option should be `stage1`, `stage2` or `stage3` and you can add"
+                " CPU-offload to `stage2` or `stage3` like this: stage2 offload` or `stage3"
+                " offload`. "
+            )
+        },
+    )
+    sharding_degree: int = field(  # Alias for sharding_parallel_degree
+        default=-1,
+        metadata={"help": ("@deprecated Please use sharding_parallel_degree. ")},
+    )
+    sharding_parallel_degree: int = field(
+        default=-1,
+        metadata={
+            "help": (
+                "Sharding parameter in certain cards group. For example, aussume we use 2 machines each with 8 cards, "
+                "then set sharding_degree=8, sharding will only communication inside machine. "
+                "default -1 means sharding parameters between all workers."
+            )
+        },
+    )
+    save_sharded_model: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "When use sharding stage1 and set save_sharded_model True, each shanding rank only save part of the model. It reduce time to save the model."
+            )
+        },
+    )
+
+    load_sharded_model: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "When use sharding stage1 and set load_sharded_model True, it means loading the sharded model. The sharded model is saved when we set save_sharded_model True."
+            )
+        },
+    )
+    tensor_parallel_degree: int = field(
+        default=-1,
+        metadata={
+            "help": (
+                "Tensor parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.3 Tensor Model Parallelism). "
+                "This techique splits one transformer layer into multi-cards (For examples, tensor_parallel_degree=4, will split a layer to 4-parts) "
+                "tensor_parallel_degree means split the transformer layer to how many parts."
+                "default -1 for not use tensor parallel,  Suggest tensor_parallel_degree<=8 for better proformance."
+                "Note, this need model support in source code, currently GPT/BLOOM/LLAMA/BLOOM/CLM/CHATGLM is supported. "
+            )
+        },
+    )
+    pipeline_parallel_degree: int = field(
+        default=-1,
+        metadata={
+            "help": (
+                "Pipeline parallelism is parallel technique proposed in (https://arxiv.org/pdf/2104.04473.pdf see 2.2 Pipeline Model Parallelism). "
+                "Pipeline parallelism assigns multi-transformer layers to different cards, the micro batch data stream passed between cards like pipelines."
+                "pipeline_parallel_degree means split all transformer layers to how many stages."
+                "default -1 for not use pipeline parallel."
+                "Note. this need model support in source code, see llama modeling_pp.py file"
+            )
+        },
+    )
+    tensor_parallel_config: str = field(
+        default="",
+        metadata={
+            "help": (
+                "Some additional configs which affect model parallel performance, we provide some option to config it."
+                "following config is support:\n"
+                "enable_mp_async_allreduce, it supports all_reduce(dx) overlap with matmul(dw) in ColumnParallelLinear backward when it set True, which can accelerate model parallel performance. \n"
+                "enable_mp_skip_c_identity, it supports skip c_identity in ColumnParallelLinear and RowParallelLinear. It only works when set mp_async_allreduce is True. It can accelerate model parallel further.\n"
+                "enable_mp_fused_linear_param_grad_add, it supports fused_linear_param_grad_add in ColumnParallelLinear (cuda >= 11.6). It only works when mp_async_allreduce is true.  It can accelerate model parallel further."
+            )
+        },
+    )
+    pipeline_parallel_config: str = field(
+        default="",
+        metadata={
+            "help": (
+                "Some additional config it highly affect the useage of pipeline parallel, we provide some option to config it."
+                "following config is support:\n"
+                "disable_p2p_cache_shape, if you max sequence length is varying, please set disable_p2p_cache_shape. \n"
+                "disable_partial_send_recv, optmize send speed for tensor parallel.\n"
+                "enable_delay_scale_loss, accumulate gradients util optimizer step, all gradients div by inner pipeline accumute step. instead of div accumute step on loss directly.\n"
+                "enable_dp_comm_overlap, fuse data parallel gradient communication. \n"
+                "enable_sharding_comm_overlap, fuse sharding stage 1 parallel gradient communication. \n"
+            )
+        },
+    )
+    sharding_parallel_config: str = field(
+        default="",
+        metadata={
+            "help": (
+                "Some additional config it highly affect the useage of sharding parallel, we provide some option to config it."
+                "following config is support: \n"
+                "enable_stage1_tensor_fusion, fuse small tensors into big tensor chunks to accelerate communications, may increase memory occupation\n"
+                "enable_stage1_overlap, fuse small tensors into big tensor chunks to accelerate communications and do communication overlap with backward computation, may harm the backward speed"
+            )
+        },
+    )
+    hybrid_parallel_topo_order: str = field(
+        default=None,
+        metadata={
+            "help": (
+                "In hybrid parallelism, the order of communication groups may affect efficiency.\n"
+                "Following options are supported:\n"
+                "- pp_first. the topo order is dp, pp, sharding, mp \n"
+                "- sharding_first. the topo order is dp, sharding, pp, mp \n"
+                "Defalut is None, for pp_first"
+            )
+        },
+    )
+    recompute: bool = field(
+        default=False,
+        metadata={
+            "help": "Recompute the forward pass to calculate gradients. Used for saving memory. "
+            "Only support for networks with transformer blocks."
+        },
+    )
+
+    scale_loss: float = field(default=2**15, metadata={"help": "The value of initial scale_loss for fp16."})
+
+    minimum_eval_times: int = field(
+        default=None,
+        metadata={
+            "help": "If under eval_steps, the valid time is less then minimum_eval_times, the config of override eval_steps."
+        },
+    )
+
+    local_rank: int = field(default=-1, metadata={"help": "For distributed training: local_rank"})
+
+    dataloader_drop_last: bool = field(
+        default=False, metadata={"help": "Drop the last incomplete batch if it is not divisible by the batch size."}
+    )
+    eval_steps: int = field(default=None, metadata={"help": "Run an evaluation every X steps."})
+    max_evaluate_steps: int = field(
+        default=-1, metadata={"help": "If set to a positive number, the total number of evaluation steps to perform."}
+    )
+    dataloader_num_workers: int = field(
+        default=0,
+        metadata={
+            "help": "Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process."
+        },
+    )
+
+    past_index: int = field(
+        default=-1,
+        metadata={"help": "If >=0, uses the corresponding part of the output as the past state for next step."},
+    )
+
+    run_name: Optional[str] = field(default=None, metadata={"help": "An optional descriptor for the run."})
+
+    device: Optional[str] = field(default="gpu", metadata={"help": "select cpu, gpu, xpu, npu devices."})
+
+    disable_tqdm: Optional[bool] = field(
+        default=None, metadata={"help": "Whether or not to disable the tqdm progress bars."}
+    )
+
+    remove_unused_columns: Optional[bool] = field(
+        default=True, metadata={"help": "Remove columns not required by the model when using an nlp.Dataset."}
+    )
+
+    label_names: Optional[List[str]] = field(
+        default=None, metadata={"help": "The list of keys in your dictionary of inputs that correspond to the labels."}
+    )
+
+    load_best_model_at_end: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Whether or not to load the best model found during training at the end of training."},
+    )
+    metric_for_best_model: Optional[str] = field(
+        default=None, metadata={"help": "The metric to use to compare two different models."}
+    )
+    greater_is_better: Optional[bool] = field(
+        default=None, metadata={"help": "Whether the `metric_for_best_model` should be maximized or not."}
+    )
+    ignore_data_skip: bool = field(
+        default=False,
+        metadata={
+            "help": "When resuming training, whether or not to skip the first epochs and batches to get to the same training data."
+        },
+    )
+    optim: str = field(
+        default="adamw",
+        metadata={"help": "The optimizer to use."},
+    )
+    report_to: Optional[List[str]] = field(
+        default=None, metadata={"help": "The list of integrations to report the results and logs to."}
+    )
+    resume_from_checkpoint: Optional[str] = field(
+        default=None,
+        metadata={"help": "The path to a folder with a valid checkpoint for your model."},
+    )
+    skip_memory_metrics: bool = field(
+        default=True, metadata={"help": "Whether or not to skip adding of memory profiler reports to metrics."}
+    )
+    flatten_param_grads: Optional[bool] = field(
+        default=False,
+        metadata={"help": "Whether use flatten_param_grads method in optimizer, only used on NPU devices."},
+    )
+    lazy_data_processing: Optional[bool] = field(
+        default=True,
+        metadata={"help": "Whether use lazy data processing."},
+    )
+    skip_profile_timer: Optional[bool] = field(
+        default=True,
+        metadata={"help": "enable framework timer, will output timeline informatoin in logging and visualdl."},
+    )
+    distributed_dataloader: Optional[bool] = field(
+        default=False, metadata={"help": "Whether to use distributed dataloader."}
+    )
+
+    def __post_init__(self):
+        env_local_rank = int(os.environ.get("PADDLE_RANK_IN_NODE", -1))
+        if env_local_rank != -1 and env_local_rank != self.local_rank and paddle.distributed.get_world_size() > 1:
+            self.local_rank = env_local_rank
+
+        # convert to int
+        self.log_level = -1
+        self.log_level_replica = -1
+
+        # expand paths, if not os.makedirs("~/bar") will make directory
+        # in the current directory instead of the actual home
+        if self.output_dir is not None:
+            self.output_dir = os.path.expanduser(self.output_dir)
+        if self.logging_dir is None and self.output_dir is not None:
+            self.logging_dir = os.path.join(self.output_dir, default_logdir())
+        if self.logging_dir is not None:
+            self.logging_dir = os.path.expanduser(self.logging_dir)
+
+        if self.disable_tqdm is None:
+            self.disable_tqdm = False  # logger.getEffectiveLevel() > logging.WARN
+
+        self.evaluation_strategy = IntervalStrategy(self.evaluation_strategy)
+        self.logging_strategy = IntervalStrategy(self.logging_strategy)
+        self.save_strategy = IntervalStrategy(self.save_strategy)
+
+        self.lr_scheduler_type = SchedulerType(self.lr_scheduler_type)
+        if self.do_eval is False and self.evaluation_strategy != IntervalStrategy.NO:
+            self.do_eval = True
+
+        if self.do_eval and self.evaluation_strategy == IntervalStrategy.NO:
+            logger.warning(
+                "evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'."
+            )
+            self.evaluation_strategy = IntervalStrategy.STEPS
+
+        # eval_steps has to be defined and non-zero, fallbacks to logging_steps if the latter is non-zero
+        if self.evaluation_strategy == IntervalStrategy.STEPS and (self.eval_steps is None or self.eval_steps == 0):
+            if self.logging_steps > 0:
+                logger.info(f"using `logging_steps` to initialize `eval_steps` to {self.logging_steps}")
+                self.eval_steps = self.logging_steps
+            else:
+                raise ValueError(
+                    f"evaluation strategy {self.evaluation_strategy} requires either non-zero --eval_steps or --logging_steps"
+                )
+
+        # logging_steps must be non-zero for logging_strategy that is other than 'no'
+        if self.logging_strategy == IntervalStrategy.STEPS and self.logging_steps == 0:
+            raise ValueError(f"logging strategy {self.logging_strategy} requires non-zero --logging_steps")
+
+        # Sanity checks for load_best_model_at_end: we require save and eval strategies to be compatible.
+        if self.load_best_model_at_end:
+            if self.evaluation_strategy != self.save_strategy:
+                raise ValueError(
+                    "--load_best_model_at_end requires the save and eval strategy to match, but found\n- Evaluation "
+                    f"strategy: {self.evaluation_strategy}\n- Save strategy: {self.save_strategy}"
+                )
+            if self.evaluation_strategy == IntervalStrategy.STEPS and self.save_steps % self.eval_steps != 0:
+                raise ValueError(
+                    "--load_best_model_at_end requires the saving steps to be a round multiple of the evaluation "
+                    f"steps, but found {self.save_steps}, which is not a round multiple of {self.eval_steps}."
+                )
+
+        if self.load_best_model_at_end and self.metric_for_best_model is None:
+            self.metric_for_best_model = "loss"
+        if self.greater_is_better is None and self.metric_for_best_model is not None:
+            self.greater_is_better = self.metric_for_best_model not in ["loss", "eval_loss"]
+        if self.run_name is None:
+            self.run_name = self.output_dir
+
+        if self.fp16 and self.bf16:
+            raise ValueError("At most one of fp16 and bf16 can be True, but not both")
+
+        if self.fp16_full_eval and self.bf16_full_eval:
+            raise ValueError("At most one of fp16 and bf16 can be True for full eval, but not both")
+
+        self.optim = OptimizerNames(self.optim)
+
+        self.use_hybrid_parallel = False
+
+        if isinstance(self.sharding, bool):
+            self.sharding = "stage1" if self.sharding else ""
+        if isinstance(self.sharding, str):
+            self.sharding = [ShardingOption(s) for s in self.sharding.split()]
+        if self.sharding == [ShardingOption.OFFLOAD]:
+            raise ValueError(
+                "`--sharding offload` can't work on its own. It needs to be added to `--sharding stage2` or "
+                '`--sharding stage3`. For example, `--sharding "stage2 offload"`.'
+            )
+        elif len(self.sharding) > (ShardingOption.OFFLOAD in self.sharding) + 1:
+            raise ValueError("`--sharding` recived too many arguments.")
+
+        if self.sharding_degree > 0:
+            warnings.warn("`sharding_degree` is deprecated, please use `sharding_parallel_degree`")
+            self.sharding_parallel_degree = max(self.sharding_degree, self.sharding_parallel_degree)
+
+        delattr(self, "sharding_degree")
+
+        if len(self.sharding) == 0 and self.sharding_parallel_degree > 0:
+            warnings.warn("`--sharding_parallel_degree` is useful only when `--sharding` is specified.")
+
+        if paddle.distributed.get_world_size() > 1 and (
+            len(self.sharding) > 0 or self.tensor_parallel_degree > 1 or self.pipeline_parallel_degree > 1
+        ):
+            self.use_hybrid_parallel = True
+
+        if self.distributed_dataloader and not (self.tensor_parallel_degree > 1 or self.pipeline_parallel_degree > 1):
+            warnings.warn("We set `distributed_dataloader` to False if tp_degree <= 1 and pp_degree <= 1")
+            self.distributed_dataloader = False
+
+        if self.amp_master_grad:
+            if (
+                self.pipeline_parallel_degree <= 1
+                and self.tensor_parallel_degree <= 1
+                and not (self.sharding and ShardingOption.SHARD_OP in self.sharding)
+            ):
+                raise ValueError(
+                    "Temporarily amp master grad only support for tensor/pipeline/sharding parallel. "
+                    "Please set amp_master_grad to False."
+                )
+            if not (self.bf16 or self.fp16):
+                logger.warning("set amp_master_grad to false since amp is disabled.")
+                self.amp_master_grad = False
+
+        if self.use_hybrid_parallel:
+            world_size = paddle.distributed.get_world_size()
+            tensor_parallel_degree = max(self.tensor_parallel_degree, 1)
+            pipeline_parallel_degree = max(self.pipeline_parallel_degree, 1)
+
+            assert (
+                world_size % (tensor_parallel_degree * pipeline_parallel_degree) == 0
+            ), f"Total world_size:{world_size} shoule be devided by tensor_parallel_degree: {self.tensor_parallel_degree} and pipeline_parallel_degree: {self.pipeline_parallel_degree}."
+
+            if self.sharding_parallel_degree == -1:
+                if len(self.sharding) > 0:
+                    self.sharding_parallel_degree = world_size // (tensor_parallel_degree * pipeline_parallel_degree)
+
+            sharding_parallel_degree = max(self.sharding_parallel_degree, 1)
+            if sharding_parallel_degree == 1 and len(self.sharding) > 0:
+                logger.warning("sharding_parallel_degree=1 means no sharding, please set sharding to empty!")
+                self.sharding = []
+
+            assert world_size % (sharding_parallel_degree * tensor_parallel_degree * pipeline_parallel_degree) == 0, (
+                "The world size for workers should be divided by sharding_parallel_degree, tensor_parallel_degree, and pipeline_parallel_degree, "
+                "sharding_parallel_degree:{sharding_parallel_degree}, tensor_parallel_degree:{tensor_parallel_degree}, "
+                "pipeline_parallel_degree:{pipeline_parallel_degree}, "
+                " world_size:{world_size}"
+            )
+            self.data_parallel_degree = world_size // (
+                sharding_parallel_degree * tensor_parallel_degree * pipeline_parallel_degree
+            )
+            # TODO(liuzhenhai): remove this when framework is ready
+            if sharding_parallel_degree > 1 and ShardingOption.SHARD_OP in self.sharding:
+                assert self.data_parallel_degree == 1, "sharding stage1 can not coexist with dp for now"
+
+            if ShardingOption.OFFLOAD in self.sharding:
+                warnings.warn("`offload` is not supported NOW!")
+
+            if pipeline_parallel_degree > 1:
+                if ShardingOption.FULL_SHARD in self.sharding or ShardingOption.SHARD_GRAD_OP in self.sharding:
+                    raise ValueError(
+                        "pipeline parallel is not compatible for sharding stage2 or stage3, please using sharding stage1"
+                    )
+
+            # TODO use paddle.distributed.is_initialized() after paddle 2.4rc
+            if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized():
+                strategy = fleet.DistributedStrategy()
+                if pipeline_parallel_degree > 1:
+                    pipeline_parallel_config = set(self.pipeline_parallel_config.split(" "))
+                    for x in pipeline_parallel_config:
+                        if len(x) > 0:
+                            if x not in [
+                                "disable_p2p_cache_shape",
+                                "disable_partial_send_recv",
+                                "enable_delay_scale_loss",
+                                "enable_dp_comm_overlap",
+                                "enable_sharding_comm_overlap",
+                                "enable_timer",
+                            ]:
+                                raise ValueError(
+                                    f"Found unknown pipeline mode config {x}, accpet config is disable_p2p_cache_shape, disable_partial_send_recv."
+                                )
+
+                    strategy.pipeline_configs = {
+                        "accumulate_steps": self.gradient_accumulation_steps,
+                        "micro_batch_size": self.per_device_train_batch_size,
+                        "enable_partial_send_recv": "disable_partial_send_recv" not in pipeline_parallel_config,
+                        "p2p_cache_shape": False if "disable_p2p_cache_shape" in pipeline_parallel_config else True,
+                        # "delay_scale_loss": True, Fix ME
+                    }
+                    logger.info(f"PP configs:{strategy.pipeline_configs}, use master_grad: {self.amp_master_grad}")
+                    dygraph_pp_configs = {
+                        "delay_scale_loss": True if "enable_delay_scale_loss" in pipeline_parallel_config else False,
+                        "dp_comm_overlap": "enable_dp_comm_overlap" in pipeline_parallel_config
+                        and self.data_parallel_degree > 1,
+                        "sharding_comm_overlap": "enable_sharding_comm_overlap" in pipeline_parallel_config
+                        and self.sharding_parallel_degree > 1,
+                        "enable_timer": "enable_timer" in pipeline_parallel_config,
+                    }
+                    if dygraph_pp_configs["dp_comm_overlap"]:
+                        raise ValueError("overlap has accuracy issue")  # TODO: fix `overalap` + `delay_scale` issue
+
+                    if self.do_eval:
+                        assert (
+                            self.per_device_train_batch_size * self.gradient_accumulation_steps
+                            == self.per_device_eval_batch_size
+                        ), (
+                            "In pipeline model, the evaluation also shares same setting with training. "
+                            "Please set per_device_eval_batch_size=per_device_train_batch_size * gradient_accumulation_steps."
+                        )
+
+                if tensor_parallel_degree > 1:
+                    strategy.tensor_parallel_configs = {"tensor_init_seed": self.seed}
+
+                    if " " in self.tensor_parallel_config:
+                        mp_config = set(self.tensor_parallel_config.split(" "))
+                    else:
+                        mp_config = set(self.tensor_parallel_config.split(","))
+
+                    for x in mp_config:
+                        if len(x) > 0:
+                            if x not in [
+                                "enable_mp_async_allreduce",
+                                "enable_mp_skip_c_identity",
+                                "enable_mp_fused_linear_param_grad_add",
+                            ]:
+                                raise ValueError(
+                                    f"Found unknown tensor parallell config {x}, "
+                                    f"accept config is enable_mp_async_allreduce, enable_mp_skip_c_identity and enable_mp_fused_linear_param_grad_add"
+                                )
+                    try:
+                        if "enable_mp_async_allreduce" in mp_config:
+                            strategy.hybrid_configs["mp_configs"].mp_async_allreduce = True
+                            if "enable_mp_skip_c_identity" in mp_config:
+                                strategy.hybrid_configs["mp_configs"].mp_skip_c_identity = True
+                            if "enable_mp_fused_linear_param_grad_add" in mp_config:
+                                strategy.hybrid_configs["mp_configs"].mp_fused_linear_param_grad_add = True
+                        else:
+                            if "enable_mp_skip_c_identity" in mp_config:
+                                warnings.warn(
+                                    "enable_mp_skip_c_identity only works with enable_mp_async_allreduce. It will not work."
+                                )
+                            if "enable_mp_fused_linear_param_grad_add" in mp_config:
+                                warnings.warn(
+                                    "enable_mp_fused_linear_param_grad_add only works with enable_mp_async_allreduce. It will not work."
+                                )
+                    except:
+                        warnings.warn(
+                            "The enable_mp_async_allreduce, enable_mp_skip_c_identity and enable_mp_fused_linear_param_grad_add are not supported "
+                            "by current version of Paddle. Please try latest develop Paddle."
+                        )
+
+                if self.hybrid_parallel_topo_order is None:
+                    self.hybrid_parallel_topo_order = "pp_first"
+                assert self.hybrid_parallel_topo_order in ["pp_first", "sharding_first"]
+
+                def is_segment_parallel_supported():
+                    import inspect
+
+                    members = [name for (name, date) in inspect.getmembers(fleet.HybridCommunicateGroup)]
+                    return "get_sep_parallel_world_size" in members
+
+                if self.hybrid_parallel_topo_order == "pp_first":
+                    if is_segment_parallel_supported():
+                        order = ["dp", "pp", "sharding", "sep", "mp"]
+                    else:
+                        order = ["dp", "pp", "sharding", "mp"]
+                if self.hybrid_parallel_topo_order == "sharding_first":
+                    if is_segment_parallel_supported():
+                        order = ["dp", "sharding", "pp", "sep", "mp"]
+                    else:
+                        order = ["dp", "sharding", "pp", "mp"]
+
+                hybrid_configs = {
+                    "dp_degree": self.data_parallel_degree,
+                    "mp_degree": tensor_parallel_degree,
+                    "pp_degree": pipeline_parallel_degree,
+                    "sharding_degree": sharding_parallel_degree,
+                    "order": order,
+                }
+
+                if pipeline_parallel_degree > 1:
+                    hybrid_configs["pp_configs"] = dygraph_pp_configs
+                    logger.info(f"using pipeline configs:{dygraph_pp_configs}")
+
+                # setter once https://github.com/PaddlePaddle/Paddle/blob/b7295120b0e78b293cd7ae29706e21769d06a3cc/python/paddle/distributed/fleet/base/distributed_strategy.py#L1692
+                strategy.hybrid_configs = hybrid_configs
+
+                if sharding_parallel_degree > 1:
+                    sharding_parallel_config = set(self.sharding_parallel_config.split(" "))
+                    for x in sharding_parallel_config:
+                        if len(x) > 0:
+                            if x not in ["enable_stage1_tensor_fusion", "enable_stage1_overlap"]:
+                                raise ValueError(
+                                    f"Found unknown pipeline mode config {x}, "
+                                    f"accpet config is enable_stage1_tensor_fusion, enable_stage1_overlap."
+                                )
+                    try:
+                        if (
+                            "enable_stage1_tensor_fusion" in sharding_parallel_config
+                            or "enable_stage1_overlap" in sharding_parallel_config
+                        ):
+                            assert pipeline_parallel_degree == 1, (
+                                "For pipeline parallel with sharding, the sharding overlap and tensor fusion "
+                                "should be configured in pipeline_parallel_config."
+                            )
+                        strategy.hybrid_configs["sharding_configs"].tensor_fusion = (
+                            True if "enable_stage1_tensor_fusion" in sharding_parallel_config else False
+                        )
+                        if "enable_stage1_overlap" in sharding_parallel_config:
+                            strategy.hybrid_configs["sharding_configs"].comm_overlap = True
+                            strategy.hybrid_configs[
+                                "sharding_configs"
+                            ].accumulate_steps = self.gradient_accumulation_steps
+                    except KeyError:
+                        warnings.warn(
+                            "The enable_stage1_tensor_fusion or enable_stage1_overlap is not supported "
+                            "by current version of Paddle. Please try latest develop Paddle."
+                        )
+                fleet.init(is_collective=True, strategy=strategy)
+                logger.info(strategy)
+        else:
+            world_size = paddle.distributed.get_world_size()
+            if world_size > 1:
+                if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized():
+                    paddle.distributed.init_parallel_env()
+
+        if self.report_to is None:
+            logger.info(
+                "The default value for the training argument `--report_to` will change in v5 (from all installed "
+                "integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as "
+                "now. You should start updating your code and make this info disappear :-)."
+            )
+            self.report_to = "all"
+        if self.report_to == "all" or self.report_to == ["all"]:
+            # Import at runtime to avoid a circular import.
+            from .integrations import get_available_reporting_integrations
+
+            self.report_to = get_available_reporting_integrations()
+        elif self.report_to == "none" or self.report_to == ["none"]:
+            self.report_to = []
+        elif not isinstance(self.report_to, list):
+            self.report_to = [self.report_to]
+
+        if self.warmup_ratio < 0 or self.warmup_ratio > 1:
+            raise ValueError("warmup_ratio must lie in range [0,1]")
+        elif self.warmup_ratio > 0 and self.warmup_steps > 0:
+            logger.info(
+                "Both warmup_ratio and warmup_steps given, warmup_steps will override any effect of warmup_ratio during training"
+            )
+
+        if self.flatten_param_grads and self.device != "npu":
+            raise ValueError("flatten_param_grads can only be used on npu devices in temporary.")
+
+    def __str__(self):
+        self_as_dict = asdict(self)
+        self_as_dict = {k: f"<{k.upper()}>" if k.endswith("_token") else v for k, v in self_as_dict.items()}
+
+        attrs_as_str = [f"{k}={v},\n" for k, v in sorted(self_as_dict.items())]
+        return f"{self.__class__.__name__}(\n{''.join(attrs_as_str)})"
+
+    __repr__ = __str__
+
+    @property
+    def train_batch_size(self) -> int:
+        """
+        The actual batch size for training.
+        """
+        train_batch_size = self.per_device_train_batch_size
+        return train_batch_size
+
+    @property
+    def eval_batch_size(self) -> int:
+        """
+        The actual batch size for evaluation.
+        """
+        eval_batch_size = self.per_device_eval_batch_size
+        return eval_batch_size
+
+    @property
+    def current_device(self) -> "paddle.device":
+        """
+        The device used by this process.
+        """
+        return paddle.device.get_device()
+
+    @property
+    def world_size(self):
+        """
+        The number of processes used in parallel.
+        """
+        if self.local_rank != -1:
+            return paddle.distributed.get_world_size()
+        return 1
+
+    @property
+    def data_parallel_rank(self):
+        if self.use_hybrid_parallel:
+            hcg = fleet.get_hybrid_communicate_group()
+            dp_group = hcg.get_data_parallel_group()
+            if dp_group.rank == -1:
+                return 0
+            return dp_group.rank
+        else:
+            return paddle.distributed.get_rank()
+
+    @property
+    def dataset_rank(self):
+        if self.use_hybrid_parallel:
+            return max(self.sharding_parallel_degree, 1) * self.data_parallel_rank + self.sharding_parallel_rank
+        else:
+            return paddle.distributed.get_rank()
+
+    @property
+    def dataset_world_size(self):
+        if self.use_hybrid_parallel:
+            return max(self.sharding_parallel_degree, 1) * max(self.data_parallel_degree, 1)
+        else:
+            return paddle.distributed.get_world_size()
+
+    @property
+    def sharding_parallel_rank(self):
+        if self.use_hybrid_parallel:
+            hcg = fleet.get_hybrid_communicate_group()
+            sharding_group = hcg.get_sharding_parallel_group()
+            return max(sharding_group.rank, 0)
+        else:
+            return 0
+
+    @property
+    def tensor_parallel_rank(self):
+        if self.use_hybrid_parallel:
+            hcg = fleet.get_hybrid_communicate_group()
+            tp_group = hcg.get_model_parallel_group()
+            return max(tp_group.rank, 0)
+        else:
+            return 0
+
+    @property
+    def pipeline_parallel_rank(self):
+        if self.use_hybrid_parallel:
+            hcg = fleet.get_hybrid_communicate_group()
+            rank = hcg.get_stage_id()
+            return max(rank, 0)
+        else:
+            return 0
+
+    @property
+    def optimizer_name_suffix(self):
+        if self.use_hybrid_parallel:
+            name = []
+            if self.tensor_parallel_degree > 1:
+                name.append(f"tp{self.tensor_parallel_rank:0>2d}")
+            if self.pipeline_parallel_degree > 1:
+                name.append(f"pp{self.pipeline_parallel_rank:0>2d}")
+            if self.sharding_parallel_degree > 1:
+                name.append(f"shard{self.sharding_parallel_rank:0>2d}")
+
+            return "_".join(name)
+        else:
+            return None
+
+    @property
+    def weight_name_suffix(self):
+        if self.use_hybrid_parallel:
+            name = []
+            if self.tensor_parallel_degree > 1:
+                name.append(f"tp{self.tensor_parallel_rank:0>2d}")
+            if self.pipeline_parallel_degree > 1:
+                name.append(f"pp{self.pipeline_parallel_rank:0>2d}")
+            return "_".join(name)
+        else:
+            return None
+
+    def sharded_name_suffix(self, shard_id=None):
+        if self.use_hybrid_parallel:
+            name = []
+            if self.tensor_parallel_degree > 1:
+                name.append(f"tp{self.tensor_parallel_rank:0>2d}")
+            if self.pipeline_parallel_degree > 1:
+                name.append(f"pp{self.pipeline_parallel_rank:0>2d}")
+            if self.sharding_parallel_degree > 1:
+                if shard_id is None:
+                    shard_id = self.sharding_parallel_rank
+                assert isinstance(shard_id, int)
+                name.append(f"shard{shard_id:0>2d}")
+            return "_".join(name)
+        else:
+            return None
+
+    @property
+    def process_index(self):
+        """
+        The index of the current process used.
+        """
+        if self.local_rank != -1:
+            return paddle.distributed.get_rank()
+        return 0
+
+    @property
+    def local_process_index(self):
+        """
+        The index of the local process used.
+        """
+        if self.local_rank != -1:
+            return self.local_rank
+        return 0
+
+    @property
+    def should_log(self):
+        """
+        Whether or not the current process should produce log.
+        """
+        if self.log_on_each_node:
+            return self.local_process_index == 0
+        else:
+            return self.process_index == 0
+
+    @property
+    def should_save(self):
+        """
+        Whether or not the current process should write to disk, e.g., to save models and checkpoints.
+
+        For model state:
+            work for data parallel, tensor parallel, sharding
+        For optimizer state:
+            work for data parallel, tensor parallel
+            not work for sharding
+        """
+        if self.save_on_each_node:
+            return self.local_process_index == 0
+        else:
+            return self.process_index == 0
+
+    @property
+    def should_save_model_state(self):
+        """
+        Whether or not the current process should write to disk, e.g., to save models and checkpoints.
+
+        For model state:
+            work for data parallel, tensor parallel, sharding
+        For optimizer state:
+            work for data parallel, tensor parallel
+            not work for sharding
+        """
+        if self.save_on_each_node:
+            return self.local_process_index == 0
+        else:
+            if self.should_save_sharding_stage1_model:
+                return True
+            elif self.use_hybrid_parallel:
+                # save on dataset rank 0
+                return self.sharding_parallel_rank == 0 and self.data_parallel_rank == 0
+            else:
+                return self.process_index == 0
+
+    @property
+    def _no_sync_in_gradient_accumulation(self):
+        """
+        Whether or not to use no_sync for the gradients when doing gradient accumulation.
+        """
+        return True
+
+    @property
+    def should_save_sharding_stage1_model(self):
+        return (
+            ShardingOption.SHARD_OP in self.sharding and self.sharding_parallel_degree > 1 and self.save_sharded_model
+        )
+
+    @property
+    def should_load_sharding_stage1_model(self):
+        return (
+            ShardingOption.SHARD_OP in self.sharding and self.sharding_parallel_degree > 1 and self.load_sharded_model
+        )
+
+    @property
+    def should_load_dataset(self):
+        if not self.distributed_dataloader:
+            return True
+        else:
+            if self.tensor_parallel_rank == 0 and self.pipeline_parallel_rank == 0:
+                return True
+            else:
+                return False
+
+    @contextlib.contextmanager
+    def main_process_first(self, local=True, desc="work"):
+        """
+        A context manager for paddle distributed environment where on needs to do something on the main process, while
+        blocking replicas, and when it's finished releasing the replicas.
+
+        One such use is for `datasets`'s `map` feature which to be efficient should be run once on the main process,
+        which upon completion saves a cached version of results and which then automatically gets loaded by the
+        replicas.
+
+        Args:
+            local (`bool`, *optional*, defaults to `True`):
+                if `True` first means process of rank 0 of each node if `False` first means process of rank 0 of node
+                rank 0 In multi-node environment with a shared filesystem you most likely will want to use
+                `local=False` so that only the main process of the first node will do the processing. If however, the
+                filesystem is not shared, then the main process of each node will need to do the processing, which is
+                the default behavior.
+            desc (`str`, *optional*, defaults to `"work"`):
+                a work description to be used in debug logs
+
+        """
+        if self.world_size > 1:
+            if local:
+                is_main_process = self.local_process_index == 0
+                main_process_desc = "main local process"
+            else:
+                is_main_process = self.process_index == 0
+                main_process_desc = "main process"
+
+            try:
+                if not is_main_process:
+                    # tell all replicas to wait
+                    logger.debug(f"{self.process_index}: waiting for the {main_process_desc} to perform {desc}")
+                    paddle.distributed.barrier()
+                yield
+            finally:
+                if is_main_process:
+                    # the wait is over
+                    logger.debug(f"{self.process_index}: {main_process_desc} completed {desc}, releasing all replicas")
+                    paddle.distributed.barrier()
+        else:
+            yield
+
+    def get_warmup_steps(self, num_training_steps: int):
+        """
+        Get number of steps used for a linear warmup.
+        """
+        warmup_steps = (
+            self.warmup_steps if self.warmup_steps > 0 else math.ceil(num_training_steps * self.warmup_ratio)
+        )
+        return warmup_steps
+
+    def to_dict(self):
+        """
+        Serializes this instance while replace `Enum` by their values (for JSON serialization support). It obfuscates
+        the token values by removing their value.
+        """
+        d = asdict(self)
+        for k, v in d.items():
+            if isinstance(v, Enum):
+                d[k] = v.value
+            if isinstance(v, list) and len(v) > 0 and isinstance(v[0], Enum):
+                d[k] = [x.value for x in v]
+            if k.endswith("_token"):
+                d[k] = f"<{k.upper()}>"
+        return d
+
+    def to_json_string(self):
+        """
+        Serializes this instance to a JSON string.
+        """
+        return json.dumps(self.to_dict(), indent=2)
+
+    def to_sanitized_dict(self) -> Dict[str, Any]:
+        """
+        Sanitized serialization
+        """
+        d = self.to_dict()
+        d = {**d, **{"train_batch_size": self.train_batch_size, "eval_batch_size": self.eval_batch_size}}
+
+        valid_types = [bool, int, float, str]
+        valid_types.append(paddle.Tensor)
+
+        return {k: v if type(v) in valid_types else str(v) for k, v in d.items()}
+
+    def print_config(self, args=None, key=""):
+        """
+        print all config values.
+        """
+        logger.info("=" * 60)
+        if args is None:
+            args = self
+            key = "Training"
+
+        logger.info("{:^40}".format("{} Configuration Arguments".format(key)))
+        logger.info("{:30}: {}".format("paddle commit id", paddle.version.commit))
+
+        for a in dir(args):
+            if a[:2] != "__":  # don't print double underscore methods
+                v = getattr(args, a)
+                if not isinstance(v, types.MethodType):
+                    logger.info("{:30}: {}".format(a, v))
+
+        logger.info("")
diff --git a/paddlenlp/trainer/training_args_seq2seq.py b/paddlenlp/trainer/training_args_seq2seq.py
new file mode 100644
index 0000000000000000000000000000000000000000..3885944a4e7d3a0740786aa056b916cabded5785
--- /dev/null
+++ b/paddlenlp/trainer/training_args_seq2seq.py
@@ -0,0 +1,68 @@
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+from typing import Optional
+
+from .training_args import TrainingArguments
+from .utils import add_start_docstrings
+
+__all__ = [
+    "Seq2SeqTrainingArguments",
+]
+
+
+@dataclass
+@add_start_docstrings(TrainingArguments.__doc__)
+class Seq2SeqTrainingArguments(TrainingArguments):
+    """
+    Args:
+        sortish_sampler (`bool`, *optional*, defaults to `False`):
+            Whether to use a *sortish sampler* or not. Only possible if the underlying datasets are *Seq2SeqDataset*
+            for now but will become generally available in the near future.
+
+            It sorts the inputs according to lengths in order to minimize the padding size, with a bit of randomness
+            for the training set.
+        predict_with_generate (`bool`, *optional*, defaults to `False`):
+            Whether to use generate to calculate generative metrics (ROUGE, BLEU).
+        generation_max_length (`int`, *optional*):
+            The `max_length` to use on each evaluation loop when `predict_with_generate=True`. Will default to the
+            `max_length` value of the model configuration.
+        generation_num_beams (`int`, *optional*):
+            The `num_beams` to use on each evaluation loop when `predict_with_generate=True`. Will default to the
+            `num_beams` value of the model configuration.
+    """
+
+    sortish_sampler: bool = field(default=False, metadata={"help": "Whether to use SortishSampler or not."})
+    predict_with_generate: bool = field(
+        default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."}
+    )
+    generation_max_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The `max_length` to use on each evaluation loop when `predict_with_generate=True`. Will default "
+                "to the `max_length` value of the model configuration."
+            )
+        },
+    )
+    generation_num_beams: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "The `num_beams` to use on each evaluation loop when `predict_with_generate=True`. Will default "
+                "to the `num_beams` value of the model configuration."
+            )
+        },
+    )
diff --git a/paddlenlp/trainer/utils/__init__.py b/paddlenlp/trainer/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d432b97167646f57e66262bb2233f3dd747bf2bb
--- /dev/null
+++ b/paddlenlp/trainer/utils/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you smay not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .helper import *
+
+from .doc import (
+    add_end_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+)
diff --git a/paddlenlp/trainer/utils/doc.py b/paddlenlp/trainer/utils/doc.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b4c0c9d76903d200595a8b08f3d3572d2468356
--- /dev/null
+++ b/paddlenlp/trainer/utils/doc.py
@@ -0,0 +1,54 @@
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Doc utilities: Utilities related to documentation
+"""
+
+
+def add_start_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        return fn
+
+    return docstring_decorator
+
+
+def add_start_docstrings_to_model_forward(*docstr):
+    def docstring_decorator(fn):
+        docstring = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
+        class_name = f"[`{fn.__qualname__.split('.')[0]}`]"
+        intro = f"   The {class_name} forward method, overrides the `__call__` special method."
+        note = r"""
+
+    <Tip>
+
+    Although the recipe for forward pass needs to be defined within this function, one should call the [`Layer`]
+    instance afterwards instead of this since the former takes care of running the pre and post processing steps while
+    the latter silently ignores them.
+
+    </Tip>
+"""
+
+        fn.__doc__ = intro + note + docstring
+        return fn
+
+    return docstring_decorator
+
+
+def add_end_docstrings(*docstr):
+    def docstring_decorator(fn):
+        fn.__doc__ = (fn.__doc__ if fn.__doc__ is not None else "") + "".join(docstr)
+        return fn
+
+    return docstring_decorator
diff --git a/paddlenlp/trainer/utils/helper.py b/paddlenlp/trainer/utils/helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd54df672e3fcd5835c8ac1e74f2ec73025d9677
--- /dev/null
+++ b/paddlenlp/trainer/utils/helper.py
@@ -0,0 +1,126 @@
+# Copyright 2020-present the HuggingFace Inc. team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This file is modified from
+#  https://github.com/huggingface/transformers/blob/main/src/transformers
+
+from typing import Any, Optional
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+
+__all__ = [
+    "distributed_concat",
+    "paddle_pad_and_concatenate",
+    "nested_concat",
+    "nested_detach",
+    "nested_numpify",
+    "nested_truncate",
+]
+
+
+def distributed_concat(tensor: Any, num_total_examples: Optional[int] = None) -> Any:
+    try:
+        if isinstance(tensor, (tuple, list)):
+            return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
+        output_tensors = []
+        dist.all_gather(output_tensors, tensor)
+        output_tensors = [t if len(t.shape) > 0 else t.reshape_([-1]) for t in output_tensors]
+        concat = paddle.concat(output_tensors, axis=0)
+
+        # truncate the dummy elements added by SequentialDistributedSampler
+        if num_total_examples is not None:
+            concat = concat[:num_total_examples]
+        return concat
+    except AssertionError:
+        raise AssertionError("Not currently using distributed training")
+
+
+def paddle_pad_and_concatenate(tensor1, tensor2, padding_index=-100):
+    """Concatenates `tensor1` and `tensor2` on first axis, applying padding on the second if necessary."""
+    if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:
+        return paddle.concat((tensor1, tensor2), axis=0)
+
+    # raise ValueError("Error")
+    # Let's figure out the new shape
+    new_shape = (tensor1.shape[0] + tensor2.shape[0], max(tensor1.shape[1], tensor2.shape[1])) + tuple(
+        tensor1.shape[2:]
+    )
+
+    # Now let's fill the result tensor
+    # result = tensor1.new_full(new_shape, padding_index)
+    result = paddle.full(new_shape, padding_index, dtype=tensor1.dtype)
+
+    result[: tensor1.shape[0], : tensor1.shape[1]] = tensor1
+    result[tensor1.shape[0] :, : tensor2.shape[1]] = tensor2
+    return result
+
+
+def numpy_pad_and_concatenate(array1, array2, padding_index=-100):
+    """Concatenates `array1` and `array2` on first axis, applying padding on the second if necessary."""
+    if len(array1.shape) == 1 or array1.shape[1] == array2.shape[1]:
+        return np.concatenate((array1, array2), axis=0)
+
+    # Let's figure out the new shape
+    new_shape = (array1.shape[0] + array2.shape[0], max(array1.shape[1], array2.shape[1])) + array1.shape[2:]
+
+    # Now let's fill the result tensor
+    result = np.full_like(array1, padding_index, shape=new_shape)
+    result[: array1.shape[0], : array1.shape[1]] = array1
+    result[array1.shape[0] :, : array2.shape[1]] = array2
+    return result
+
+
+def nested_concat(tensors, new_tensors, padding_index=-100):
+    """
+    Concat the `new_tensors` to `tensors` on the first dim and pad them on the second if needed. Works for tensors or
+    nested list/tuples of tensors.
+    """
+    assert type(tensors) == type(
+        new_tensors
+    ), f"Expected `tensors` and `new_tensors` to have the same type but found {type(tensors)} and {type(new_tensors)}."
+    if isinstance(tensors, (list, tuple)):
+        return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
+    elif isinstance(tensors, paddle.Tensor):
+        return paddle_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
+    elif isinstance(tensors, np.ndarray):
+        return numpy_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
+    else:
+        raise TypeError(f"Unsupported type for concatenation: got {type(tensors)}")
+
+
+def nested_detach(tensors):
+    "Detach `tensors` (even if it's a nested list/tuple of tensors)."
+    if isinstance(tensors, (list, tuple)):
+        return type(tensors)(nested_detach(t) for t in tensors)
+    return tensors.detach()
+
+
+def nested_numpify(tensors):
+    "Numpify `tensors` (even if it's a nested list/tuple of tensors)."
+    if isinstance(tensors, (list, tuple)):
+        return type(tensors)(nested_numpify(t) for t in tensors)
+    t = tensors.cpu()
+    if t.dtype == paddle.float16:
+        t = t.cast(paddle.float32)
+    return t.cpu().numpy()
+
+
+def nested_truncate(tensors, limit):
+    "Truncate `tensors` at `limit` (even if it's a nested list/tuple of tensors)."
+    if isinstance(tensors, (list, tuple)):
+        return type(tensors)(nested_truncate(t, limit) for t in tensors)
+    return tensors[:limit]
diff --git a/paddlenlp/trainer/utils/sharding_io.py b/paddlenlp/trainer/utils/sharding_io.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebb05321d8b0003ec601dab08ed95fdb845de2ae
--- /dev/null
+++ b/paddlenlp/trainer/utils/sharding_io.py
@@ -0,0 +1,520 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+from collections import OrderedDict
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer import (
+    DygraphShardingOptimizer,
+)
+
+from paddlenlp.transformers.model_utils import (
+    _add_variant,
+    get_parameter_dtype,
+    unwrap_optimizer,
+)
+from paddlenlp.transformers.utils import paddlenlp_load
+from paddlenlp.utils.log import logger
+
+# Name of the files used for checkpointing
+TRAINING_ARGS_NAME = "training_args.bin"
+TRAINER_STATE_NAME = "trainer_state.json"
+
+OPTIMIZER_NAME = "optimizer.pdopt"
+SCHEDULER_NAME = "scheduler.pdparams"
+SCALER_NAME = "scaler.pdparams"
+MODEL_META_NAME = "model_meta.json"
+SHARDING_META_NAME = "shard_meta.json"
+
+
+def filter_sharded_params(state_dict, optimizer, sharding_rank):
+    from paddle.distributed.fleet.meta_optimizers.dygraph_optimizer.dygraph_sharding_optimizer import (
+        DygraphShardingOptimizer,
+    )
+
+    logger.info(f"filter sharded_params not placed in sharding_rank {sharding_rank} .")
+
+    optimizer = unwrap_optimizer(optimizer, DygraphShardingOptimizer)
+    if optimizer is None:
+        return state_dict
+    filtered_state_dict = OrderedDict()
+    for (k, v) in state_dict.items():
+        assert v.name in optimizer._param2rank
+        sharded_rank = optimizer._param2rank[v.name]
+        if sharded_rank != sharding_rank:
+            continue
+        filtered_state_dict[k] = v
+    return filtered_state_dict
+
+
+def exclude_paramters_in_state_dict(
+    model_state_dict, param_names_in_master_weights, sharding_group, should_save_sharding_stage1_model=True
+):
+    assert sharding_group is not None
+    assert isinstance(model_state_dict, dict) and isinstance(
+        param_names_in_master_weights, (list, set)
+    ), "param_names_in_master_weights type:{}".format(type(param_names_in_master_weights))
+    state_param_names = [v.name for k, v in model_state_dict.items()]
+    logger.debug(
+        "param_names_in_master_weights:{}, state_param_names:{}".format(
+            param_names_in_master_weights, state_param_names
+        )
+    )
+    if not should_save_sharding_stage1_model:
+        # allgather parameter names in sharding group
+        tmp = []
+        paddle.distributed.all_gather_object(tmp, param_names_in_master_weights, group=sharding_group)
+        param_names_in_master_weights = [v for item in tmp for v in item]
+        logger.info("sharding_group_param_names:{}".format(param_names_in_master_weights))
+    non_parameters_state_dict = copy.copy(model_state_dict)
+    for k, v in model_state_dict.items():
+        if v.name in param_names_in_master_weights:
+            non_parameters_state_dict.pop(k)
+
+    return non_parameters_state_dict
+
+
+class ShardingIO:
+    def __init__(self, args, model, optimizer=None, hcg=None):
+        self.args = args
+        self.model = model
+        self.optimizer = optimizer
+        self.hcg = hcg
+        self.sharding_group = None
+        if self.hcg is None and paddle.distributed.get_world_size() > 1 and self.args.use_hybrid_parallel:
+            self.hcg = fleet.get_hybrid_communicate_group()
+            self.sharding_group = self.hcg.get_sharding_parallel_group()
+
+    def set_optimizer(self, optimizer):
+        self.optimizer = optimizer
+
+    def load_state_dict_from_checkpoint_with_reshard(self, resume_from_checkpoint, base_weight_name):
+        """load state_dict from_checkpoint with reshard, Only load model state dict."""
+        parallel_config = self._load_distributed_strategy(resume_from_checkpoint)
+        pp_degree = parallel_config["pp_degree"]
+        pp_degree = parallel_config["pp_degree"]
+        mp_degree = parallel_config["mp_degree"]
+        sharding_degree = parallel_config["sharding_degree"]
+        self.args.pipeline_parallel_degree == pp_degree
+        self.args.tensor_parallel_degree == mp_degree
+        cur_sharding_degree = self.args.sharding_parallel_degree
+
+        state_dict = OrderedDict()
+
+        for i in range(self.args.sharding_parallel_rank, sharding_degree, cur_sharding_degree):
+            tmp = self._load_one_state_dict_from_checkpoint(
+                resume_from_checkpoint, base_weight_name, self.args.sharded_name_suffix(i)
+            )
+            for (k, v) in tmp.items():
+                state_dict[k] = v
+            del tmp
+
+        def filter_func(name):
+            return True
+
+        state_dict = self._all_gather_state_dict(state_dict, filter_func)
+
+        if self.args.bf16:
+            state_dict = self._recover_params_from_master_weights(state_dict)
+
+        return state_dict
+
+    def _load_one_state_dict_from_checkpoint(self, resume_from_checkpoint, base_weight_name, weight_name_suffix):
+        """
+        load state_dict of one shard from_checkpoint, Only load model state dict.
+        """
+        file_path = os.path.join(resume_from_checkpoint, _add_variant(base_weight_name, weight_name_suffix))
+        if not os.path.isfile(file_path):
+            raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}, no {file_path}")
+
+        logger.info(f"Loading model from {resume_from_checkpoint} .")
+        # We load the model state dict on the CPU to avoid an OOM error.
+        state_dict = paddle.load(file_path, return_numpy=True)
+        return state_dict
+
+    def _load_optimizer_state_of_one_shard(self, checkpoint, base_opt_name, optimizer_name_suffix):
+        optimizer_name = _add_variant(base_opt_name, optimizer_name_suffix)
+        path = os.path.join(checkpoint, optimizer_name)
+        logger.info(f"load optimizer state from {path}")
+        if os.path.isfile(path):
+            return paddlenlp_load(path, map_location="cpu")
+        logger.info(f"{path} not exists")
+        return None
+
+    def load_optimizer_state_with_reshard(self, checkpoint, base_opt_name):
+        """load state_dict of multiple shard from_checkpoint, Only load model state dict."""
+        parallel_config = self._load_distributed_strategy(checkpoint)
+        pp_degree = parallel_config["pp_degree"]
+        mp_degree = parallel_config["mp_degree"]
+        sharding_degree = parallel_config["sharding_degree"]
+        assert self.args.pipeline_parallel_degree == pp_degree
+        assert self.args.tensor_parallel_degree == mp_degree
+        cur_sharding_degree = self.args.sharding_parallel_degree
+
+        def need_reshard():
+            if sharding_degree != cur_sharding_degree:
+                return True
+            sharding_meta = self._load_sharding_meta(checkpoint)
+            param2rank = sharding_meta["param2rank"]
+            optimizer = unwrap_optimizer(self.optimizer, DygraphShardingOptimizer)
+            assert optimizer
+            assert len(param2rank) == len(optimizer._param2rank)
+            for (k, v) in param2rank.items():
+                assert k in optimizer._param2rank
+                if optimizer._param2rank[k] != int(v):
+                    return True
+            return False
+
+        if not need_reshard():
+            logger.info("do not need reshard")
+            return self._load_optimizer_state_of_one_shard(checkpoint, base_opt_name, self.args.optimizer_name_suffix)
+        logger.info("reshard optimizer state")
+        state_dict = OrderedDict()
+        master_weights = OrderedDict()
+        lr_scheduler = {}
+
+        for i in range(self.args.sharding_parallel_rank, sharding_degree, cur_sharding_degree):
+            tmp = self._load_optimizer_state_of_one_shard(checkpoint, base_opt_name, self.args.sharded_name_suffix(i))
+
+            if tmp is None:
+                continue
+
+            for (k, v) in tmp.items():
+                if k == "master_weights":
+                    for (kk, vv) in v.items():
+                        master_weights[kk] = vv
+                    continue
+                if k == "LR_Scheduler":
+                    lr_scheduler[i] = v
+                    continue
+                state_dict[k] = v
+
+            del tmp
+
+        # gather all opt names
+        # list of list
+        opt_names_list = self._all_gather_simple_object(list(state_dict.keys()))
+        opt_names = []
+        for e in opt_names_list:
+            opt_names.extend(e)
+
+        # opt name to param name
+        opt_to_p = self._map_optimizer_state_to_param(opt_names)
+
+        optimizer = unwrap_optimizer(self.optimizer, DygraphShardingOptimizer)
+        param2rank = optimizer._param2rank
+
+        def all_gather_state_dict(state_dict, filter_func):
+            remote_state_dict_keys = [k for k in state_dict.keys() if not filter_func(k)]
+            tmp_state_dict = OrderedDict()
+            for k in remote_state_dict_keys:
+                tmp_state_dict[k] = state_dict[k]
+                state_dict.pop(k)
+            tmp_state_dict = self._all_gather_state_dict(tmp_state_dict, filter_func)
+            for (k, v) in tmp_state_dict.items():
+                state_dict[k] = v
+            return state_dict
+
+        def opt_filter_func(name):
+            assert name in opt_to_p, f"name {name} not in opt_to_p"
+            param_name = opt_to_p[name]
+            assert param_name in param2rank, f"param_name {param_name} not in param2rank param2"
+            return param2rank[param_name] == self.args.sharding_parallel_rank
+
+        state_dict = all_gather_state_dict(state_dict, opt_filter_func)
+
+        def master_weights_filter_func(name):
+            assert (name in param2rank) or (name in opt_to_p), f"name {name} not in param2rank or opt_to_p"
+            if name in opt_to_p:
+                name = opt_to_p[name]
+            return param2rank[name] == self.args.sharding_parallel_rank
+
+        # master weights
+        master_weights = all_gather_state_dict(master_weights, master_weights_filter_func)
+        state_dict["master_weights"] = master_weights
+
+        # lr scheduler
+        logger.debug(f"lr_scheduler:{lr_scheduler}")
+        lr_schedulers = self._all_gather_simple_object(lr_scheduler)
+        lr_scheduler = {}
+        for e in lr_schedulers:
+            for (k, v) in e.items():
+                lr_scheduler[k] = v
+        if lr_scheduler:
+            state_dict["LR_Scheduler"] = lr_scheduler[0]
+
+        return state_dict
+
+    def manipulate_state_dict_and_config(self, model_to_save, merge_tensor_parallel=False):
+        weight_name_suffix = self.args.sharded_name_suffix()
+
+        state_dict = model_to_save.state_dict()
+        if self.args.should_save_sharding_stage1_model:
+            state_dict = filter_sharded_params(state_dict, self.optimizer, self.sharding_group.rank)
+
+        config_to_save = None
+        merge_tensor_parallel = merge_tensor_parallel and self.args.use_hybrid_parallel
+        if merge_tensor_parallel:
+            dtype = get_parameter_dtype(model_to_save)
+            assert hasattr(model_to_save, "config")
+            model_to_save.config.dtype = str(dtype).split(".")[1]
+            config_to_save = copy.deepcopy(model_to_save.config)
+            if config_to_save.tensor_parallel_degree > 1:
+                state_dict = model_to_save.merge_tensor_parallel(state_dict, config_to_save)
+                config_to_save.tensor_parallel_degree = 1
+                if config_to_save.tensor_parallel_rank != 0:
+                    logger.info("Saving with merge_tensor_parallel, tensor_parallel_rank > 0 don't need save")
+                    return
+                # if variant is not None and "tp" in variant:
+                if "tp" in weight_name_suffix:
+                    weight_name_suffix = "_".join([x for x in weight_name_suffix.split("_") if "tp" not in x])
+
+        if self.args.bf16 and self.args.should_save_sharding_stage1_model:
+            param_names_in_master_weights = []
+            optimzier_state_dict = self.optimizer.state_dict()
+            assert "master_weights" in optimzier_state_dict
+            param_names_in_master_weights = list(optimzier_state_dict["master_weights"].keys())
+            state_dict = exclude_paramters_in_state_dict(
+                state_dict, param_names_in_master_weights, self.sharding_group
+            )
+            logger.info(
+                "param_names_in_master_weights len:{}, bf16 state_dict len:{}, :{}".format(
+                    len(param_names_in_master_weights), len(state_dict), state_dict
+                )
+            )
+        return state_dict, config_to_save, weight_name_suffix
+
+    def save_distributed_model_meta(self, dir):
+        if not self.args.use_hybrid_parallel:
+            return
+
+        if not self.args.should_save_sharding_stage1_model:
+            return
+
+        nranks = dist.get_world_size()
+        if nranks <= 1:
+            return
+
+        model_meta = {}
+        parallel_config = self._get_distributed_strategy()
+        if parallel_config:
+            model_meta["parallel_config"] = parallel_config
+        sharding_metas = self._gather_sharding_metas()
+        if sharding_metas:
+            model_meta["sharding_metas"] = sharding_metas
+
+        if dist.get_rank():
+            return
+
+        path = os.path.join(dir, MODEL_META_NAME)
+        with open(path, "w") as f:
+            json.dump(model_meta, f, indent=4)
+
+    def _get_distributed_strategy(self):
+        pp_degree = 1
+        mp_degree = 1
+        sharding_degree = 1
+        vpp_degree = 1
+        nranks = dist.get_world_size()
+        if self.args.use_hybrid_parallel and nranks > 1:
+            if dist.get_rank():
+                return
+            hcg = fleet.get_hybrid_communicate_group()
+            mp_degree = hcg.get_model_parallel_world_size()
+            pp_degree = hcg.get_pipe_parallel_world_size()
+            sharding_degree = hcg.get_sharding_parallel_world_size()
+            """
+            if pp_degree > 1:
+                assert isinstance(model, fleet.meta_parallel.PipelineParallel), "must be pipeline model"
+                vpp_degree = model._layers.get_num_virtual_stages()
+            """
+        parallel_config = {
+            "pp_degree": pp_degree,
+            "mp_degree": mp_degree,
+            "sharding_degree": sharding_degree,
+            "vpp_degree": vpp_degree,
+        }
+        return parallel_config
+
+    def _recover_params_from_master_weights(self, state_dict):
+        assert isinstance(self.optimizer._inner_opt, DygraphShardingOptimizer)
+        param2rank = self.optimizer._inner_opt._param2rank
+        opt_state_dict = self.optimizer.state_dict()
+        assert "master_weights" in opt_state_dict
+        master_weigths = opt_state_dict["master_weights"]
+        param_names_in_master_weights = list(master_weigths.keys())
+        tmp = []
+        logger.debug("param_names_in_master_weights:{}".format(param_names_in_master_weights))
+        paddle.distributed.all_gather_object(tmp, param_names_in_master_weights, group=self.sharding_group)
+        sharding_group_param_names = [v for item in tmp for v in item]
+        logger.debug("sharding_group_param_names:{}".format(sharding_group_param_names))
+        model_state_dict = self.model.state_dict()
+        logger.info("before recover, model_state_dict number: {}".format(len(model_state_dict)))
+        for key, param in model_state_dict.items():
+            if param.name in master_weigths:
+                assert param.shape == master_weigths[param.name].shape
+                paddle.assign(paddle.cast(master_weigths[param.name].cuda(), paddle.bfloat16), model_state_dict[key])
+            if param.name in sharding_group_param_names:
+                paddle.distributed.broadcast(
+                    model_state_dict[key],
+                    src=self.sharding_group.ranks[param2rank[param.name]],
+                    group=self.sharding_group,
+                    sync_op=True,
+                )
+        logger.info("after recover, casted model_state_dict number: {}".format(len(model_state_dict)))
+        state_dict.update(model_state_dict)
+        return state_dict
+
+    def _all_gather_simple_object(self, obj, group=None):
+        if group is None:
+            group = self.hcg.get_sharding_parallel_group()
+        res = []
+        paddle.distributed.all_gather_object(res, obj, group)
+        return res
+
+    def _load_model_meta(self, dir):
+        meta_path = os.path.join(dir, MODEL_META_NAME)
+        assert os.path.exists(meta_path), f"{meta_path} not exist"
+        with open(meta_path, "r") as handle:
+            model_dist_meta = json.load(handle)
+        assert "parallel_config" in model_dist_meta
+        return model_dist_meta
+
+    def _load_distributed_strategy(self, dir):
+        model_dist_meta = self._load_model_meta(dir)
+        parallel_config = model_dist_meta["parallel_config"]
+        assert "pp_degree" in parallel_config
+        assert "mp_degree" in parallel_config
+        assert "sharding_degree" in parallel_config
+        return parallel_config
+
+    def _load_sharding_meta(self, dir):
+        suffix = f"tp{self.args.tensor_parallel_rank:0>2d}_pp{self.args.pipeline_parallel_rank:0>2d}"
+        distributed_model_meta = self._load_model_meta(dir)
+        if "sharding_metas" in distributed_model_meta:
+            sharding_metas = distributed_model_meta["sharding_metas"]
+            assert suffix in sharding_metas
+            sharding_meta = sharding_metas[suffix]
+            assert "param2rank" in sharding_meta
+            return sharding_meta
+
+        # for backward compatibility
+        meta_path = os.path.join(dir, _add_variant(SHARDING_META_NAME, suffix))
+        assert os.path.exists(meta_path), f"{meta_path} not exist"
+        with open(meta_path, "r") as f:
+            sharding_meta = json.load(f)
+        assert "param2rank" in sharding_meta
+        return sharding_meta
+
+    def _map_optimizer_state_to_param(self, optimizer_state_names):
+        optimizer = unwrap_optimizer(self.optimizer, DygraphShardingOptimizer)
+        all_names = list(optimizer._param2rank.keys())
+        all_names.extend(list(optimizer_state_names))
+        all_names.sort()
+        pre_p_name = ""
+        opt_to_p = {}
+        for n in all_names:
+            if n in optimizer._param2rank:
+                # we get a param
+                pre_p_name = n
+            else:
+                assert pre_p_name, n
+                opt_to_p[n] = pre_p_name
+        return opt_to_p
+
+    def _all_gather_state_dict(self, state_dict, filter_func, group=None):
+        if group is None:
+            group = self.hcg.get_sharding_parallel_group()
+        res = OrderedDict()
+
+        def map_func(weight):
+            if isinstance(weight, paddle.Tensor):
+                weight = weight.cpu().numpy()
+            return weight
+
+        state_dict = {k: map_func(v) for (k, v) in state_dict.items()}
+
+        meta_dict = {}
+        for (k, v) in state_dict.items():
+            # src rank
+            meta_dict[k] = (v.dtype, v.shape, group.rank)
+
+        meta_dict_list = self._all_gather_simple_object(meta_dict, group)
+
+        total_meta_dict = {}
+        for meta_dict in meta_dict_list:
+            for (k, v) in meta_dict.items():
+                assert k not in total_meta_dict
+                total_meta_dict[k] = v
+
+        meta_list = list(total_meta_dict.items())
+        meta_list = sorted(meta_list, key=lambda x: x[0])
+        for (k, meta) in meta_list:
+            dtype, shape, rank = meta
+            if rank == group.rank:
+                assert k in state_dict
+                tensor = paddle.to_tensor(state_dict[k])
+            else:
+                tensor = paddle.to_tensor(np.empty(shape, dtype))
+            logger.info(f"broadcast {k} from {rank}")
+            # broadcast the tensor
+            paddle.distributed.broadcast(
+                tensor,
+                src=group.ranks[rank],
+                group=group,
+                sync_op=True,
+            )
+            if filter_func(k):
+                res[k] = tensor.cpu()
+            del tensor
+        return res
+
+    def _gather_sharding_metas(self):
+        nranks = dist.get_world_size()
+        if not self.args.use_hybrid_parallel or nranks <= 1:
+            return None
+        if self.args.sharding_parallel_rank != 0:
+            return None
+        if self.args.data_parallel_rank != 0:
+            return None
+        optimizer = unwrap_optimizer(self.optimizer, DygraphShardingOptimizer)
+
+        if not optimizer:
+            return None
+
+        param2rank = {k: v for (k, v) in optimizer._param2rank.items()}
+
+        model = self.model
+        structure_name_mapping = {k: v.name for (k, v) in model.state_dict().items()}
+
+        sharding_metas = {}
+        sharding_meta = {}
+
+        sharding_meta["param2rank"] = param2rank
+        sharding_meta["structure_name_mapping"] = structure_name_mapping
+        suffix = f"tp{self.args.tensor_parallel_rank:0>2d}_pp{self.args.pipeline_parallel_rank:0>2d}"
+        sharding_metas[suffix] = sharding_meta
+        sharding_metas_list = self._all_gather_simple_object(sharding_metas, self.hcg.get_model_parallel_group())
+        sharding_metas = {k: v for e in sharding_metas_list for (k, v) in e.items()}
+        if self.args.tensor_parallel_rank != 0:
+            return None
+        sharding_metas_list = self._all_gather_simple_object(sharding_metas, self.hcg.get_pipe_parallel_group())
+        sharding_metas = {k: v for e in sharding_metas_list for (k, v) in e.items()}
+        return sharding_metas
diff --git a/paddlenlp/transformers/__init__.py b/paddlenlp/transformers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ead2157e9fad082351dc64a1d79dc83cfbd7fb6
--- /dev/null
+++ b/paddlenlp/transformers/__init__.py
@@ -0,0 +1,299 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from .configuration_utils import PretrainedConfig
+from .model_utils import PretrainedModel, register_base_model
+from .tokenizer_utils import (
+    PretrainedTokenizer,
+    BPETokenizer,
+    tokenize_chinese_chars,
+    is_chinese_char,
+    AddedToken,
+    normalize_chars,
+    tokenize_special_chars,
+    convert_to_unicode,
+)
+from .processing_utils import ProcessorMixin
+from .feature_extraction_utils import BatchFeature, FeatureExtractionMixin
+from .image_processing_utils import ImageProcessingMixin
+from .attention_utils import create_bigbird_rand_mask_idx_list
+from .sequence_parallel_utils import (
+    GatherOp,
+    ScatterOp,
+    AllGatherOp,
+    ReduceScatterOp,
+    ColumnSequenceParallelLinear,
+    RowSequenceParallelLinear,
+    mark_as_sequence_parallel_parameter,
+    register_sequence_parallel_allreduce_hooks,
+)
+from .export import export_model
+
+# isort: split
+from .bert.modeling import *
+from .bert.tokenizer import *
+from .bert.configuration import *
+
+# isort: split
+from .gpt.modeling import *
+from .gpt.tokenizer import *
+from .gpt.configuration import *
+from .roberta.modeling import *
+from .roberta.tokenizer import *
+from .roberta.configuration import *
+from .electra.modeling import *
+from .electra.tokenizer import *
+from .electra.configuration import *
+from .albert.configuration import *
+from .albert.modeling import *
+from .albert.tokenizer import *
+from .bit.modeling import *
+from .bit.configuration import *
+from .bit.image_processing import *
+from .bart.modeling import *
+from .bart.tokenizer import *
+from .bart.configuration import *
+from .bert_japanese.tokenizer import *
+from .bigbird.modeling import *
+from .bigbird.configuration import *
+from .bigbird.tokenizer import *
+from .blenderbot.modeling import *
+from .blenderbot.tokenizer import *
+from .blenderbot.configuration import *
+from .blenderbot_small.modeling import *
+from .blenderbot_small.tokenizer import *
+from .blenderbot_small.configuration import *
+from .blip.modeling import *
+from .blip.modeling_text import *
+from .blip.configuration import *
+from .blip.processing import *
+from .blip.image_processing import *
+from .chinesebert.configuration import *
+from .chinesebert.modeling import *
+from .chinesebert.tokenizer import *
+from .convbert.configuration import *
+from .convbert.modeling import *
+from .convbert.tokenizer import *
+from .ctrl.modeling import *
+from .ctrl.tokenizer import *
+from .ctrl.configuration import *
+from .dpt.modeling import *
+from .dpt.configuration import *
+from .dpt.image_processing import *
+from .distilbert.configuration import *
+from .distilbert.modeling import *
+from .distilbert.tokenizer import *
+from .ernie.configuration import *
+from .ernie.modeling import *
+from .ernie.tokenizer import *
+from .ernie_ctm.modeling import *
+from .ernie_ctm.tokenizer import *
+from .ernie_ctm.configuration import *
+from .ernie_doc.modeling import *
+from .ernie_doc.tokenizer import *
+from .ernie_doc.configuration import *
+from .ernie_gen.modeling import ErnieForGeneration
+from .ernie_gram.modeling import *
+from .ernie_gram.tokenizer import *
+from .ernie_gram.configuration import *
+from .ernie_layout.modeling import *
+from .ernie_layout.tokenizer import *
+from .ernie_layout.configuration import *
+from .ernie_m.configuration import *
+from .ernie_m.modeling import *
+from .ernie_m.tokenizer import *
+from .fnet.modeling import *
+from .fnet.tokenizer import *
+from .fnet.configuration import *
+from .funnel.modeling import *
+from .funnel.tokenizer import *
+from .funnel.configuration import *
+from .llama.configuration import *
+from .llama.modeling import *
+from .llama.tokenizer import *
+from .layoutlm.configuration import *
+from .layoutlm.modeling import *
+from .layoutlm.tokenizer import *
+from .layoutlmv2.modeling import *
+from .layoutlmv2.tokenizer import *
+from .layoutlmv2.configuration import *
+from .layoutxlm.modeling import *
+from .layoutxlm.tokenizer import *
+from .layoutxlm.configuration import *
+from .luke.modeling import *
+from .luke.tokenizer import *
+from .luke.configuration import *
+from .mbart.modeling import *
+from .mbart.tokenizer import *
+from .mbart.configuration import *
+from .megatronbert.modeling import *
+from .megatronbert.tokenizer import *
+from .megatronbert.configuration import *
+from .prophetnet.modeling import *
+from .prophetnet.tokenizer import *
+from .prophetnet.configuration import *
+from .mobilebert.configuration import *
+from .mobilebert.modeling import *
+from .mobilebert.tokenizer import *
+from .mpnet.configuration import *
+from .mpnet.modeling import *
+from .mpnet.tokenizer import *
+from .mt5.configuration import *
+from .mt5.modeling import *
+from .nezha.configuration import *
+from .nezha.modeling import *
+from .nezha.tokenizer import *
+from .ppminilm.modeling import *
+from .ppminilm.tokenizer import *
+from .reformer.modeling import *
+from .reformer.tokenizer import *
+from .reformer.configuration import *
+from .rembert.modeling import *
+from .rembert.tokenizer import *
+from .rembert.configuration import *
+from .roformer.modeling import *
+from .roformer.configuration import *
+from .roformer.tokenizer import *
+from .semantic_search.modeling import *
+from .skep.configuration import *
+from .skep.modeling import *
+from .skep.tokenizer import *
+from .squeezebert.modeling import *
+from .squeezebert.tokenizer import *
+from .squeezebert.configuration import *
+from .t5.modeling import *
+from .t5.tokenizer import *
+from .t5.configuration import *
+from .tinybert.configuration import *
+from .tinybert.modeling import *
+from .tinybert.tokenizer import *
+from .transformer.modeling import *
+from .unified_transformer.modeling import *
+from .unified_transformer.tokenizer import *
+from .unified_transformer.configuration import *
+from .ernie_code.tokenizer import *
+from .ernie_code.modeling import *
+from .ernie_code.configuration import *
+from .ernie_vil.configuration import *
+from .ernie_vil.modeling import *
+from .ernie_vil.feature_extraction import *
+from .ernie_vil.tokenizer import *
+from .ernie_vil.processing import *
+from .ernie_vil.image_processing import *
+from .unimo.modeling import *
+from .unimo.tokenizer import *
+from .unimo.configuration import *
+from .xlnet.modeling import *
+from .xlnet.tokenizer import *
+from .xlnet.configuration import *
+from .xlm.modeling import *
+from .xlm.tokenizer import *
+from .xlm.configuration import *
+from .gau_alpha.modeling import *
+from .gau_alpha.tokenizer import *
+from .gau_alpha.configuration import *
+from .roformerv2.modeling import *
+from .roformerv2.tokenizer import *
+from .roformerv2.configuration import *
+from .optimization import *
+from .opt.configuration import *
+from .opt.modeling import *
+from .auto.modeling import *
+from .auto.tokenizer import *
+from .auto.processing import *
+from .auto.configuration import *
+from .codegen.modeling import *
+from .codegen.tokenizer import *
+from .codegen.configuration import *
+from .artist.modeling import *
+from .artist.tokenizer import *
+from .artist.configuration import *
+from .dallebart.modeling import *
+from .dallebart.tokenizer import *
+from .dallebart.configuration import *
+from .clip.modeling import *
+from .clip.configuration import *
+from .clip.feature_extraction import *
+from .clip.tokenizer import *
+from .clip.processing import *
+from .clip.image_processing import *
+from .chineseclip.modeling import *
+from .chineseclip.configuration import *
+from .chineseclip.feature_extraction import *
+from .chineseclip.processing import *
+from .chineseclip.image_processing import *
+from .chineseclip.tokenizer import *
+from .gptj.modeling import *
+from .gptj.tokenizer import *
+from .gptj.configuration import *
+from .pegasus.modeling import *
+from .pegasus.tokenizer import *
+from .pegasus.configuration import *
+from .glm.configuration import *
+from .glm.modeling import *
+from .glm.tokenizer import *
+from .nystromformer.configuration import *
+from .nystromformer.modeling import *
+from .nystromformer.tokenizer import *
+from .bloom.configuration import *
+from .bloom.modeling import *
+from .bloom.tokenizer import *
+from .clipseg.configuration import *
+from .clipseg.modeling import *
+from .clipseg.processing import *
+from .clipseg.image_processing import *
+from .blip_2.modeling import *
+from .blip_2.configuration import *
+from .blip_2.processing import *
+from .chatglm.configuration import *
+from .chatglm.modeling import *
+from .chatglm.tokenizer import *
+from .chatglm_v2.configuration import *
+from .chatglm_v2.modeling import *
+from .chatglm_v2.tokenizer import *
+from .speecht5.configuration import *
+from .speecht5.modeling import *
+from .speecht5.tokenizer import *
+from .speecht5.processing import *
+from .speecht5.feature_extraction import *
+from .minigpt4.modeling import *
+from .minigpt4.configuration import *
+from .minigpt4.processing import *
+from .minigpt4.image_processing import *
+from .clap.configuration import *
+from .clap.feature_extraction import *
+from .clap.modeling import *
+from .clap.processing import *
+from .visualglm.modeling import *
+from .visualglm.configuration import *
+from .visualglm.processing import *
+from .visualglm.image_processing import *
+from .rw.modeling import *
+from .rw.configuration import *
+from .rw.tokenizer import *
+from .qwen.modeling import *
+from .qwen.configuration import *
+from .qwen.tokenizer import *
+
+# For faster tokenizer
+from ..utils.import_utils import is_fast_tokenizer_available
+
+if is_fast_tokenizer_available():
+    from .tokenizer_utils_fast import PretrainedFastTokenizer
+    from .bert.fast_tokenizer import *
+    from .ernie.fast_tokenizer import *
+    from .tinybert.fast_tokenizer import *
+    from .ernie_m.fast_tokenizer import *
+    from .nystromformer.fast_tokenizer import *
diff --git a/paddlenlp/transformers/activations.py b/paddlenlp/transformers/activations.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab9be11679283e2e0daee0e4905ea1dc3709d222
--- /dev/null
+++ b/paddlenlp/transformers/activations.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from collections import OrderedDict
+
+import paddle
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+
+
+class NewGELUActivation(nn.Layer):
+    """
+    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
+    the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return (
+            0.5 * input * (1.0 + paddle.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * paddle.pow(input, 3.0))))
+        )
+
+
+class GELUActivation(nn.Layer):
+    """
+    Original Implementation of the GELU activation function in Google BERT repo when initially created. For
+    information: OpenAI GPT's GELU is slightly different (and gives slightly different results): 0.5 * x * (1 +
+    torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))) This is now written in C in nn.functional
+    Also see the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
+    """
+
+    def __init__(self, use_gelu_python: bool = False):
+        super().__init__()
+        if use_gelu_python:
+            self.act = self._gelu_python
+        else:
+            self.act = nn.functional.gelu
+
+    def _gelu_python(self, input: Tensor) -> Tensor:
+        return input * 0.5 * (1.0 + paddle.erf(input / math.sqrt(2.0)))
+
+    def forward(self, input: Tensor) -> Tensor:
+        return self.act(input)
+
+
+class FastGELUActivation(nn.Layer):
+    """
+    Applies GELU approximation that is slower than QuickGELU but more accurate. See: https://github.com/hendrycks/GELUs
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return 0.5 * input * (1.0 + paddle.tanh(input * 0.7978845608 * (1.0 + 0.044715 * input * input)))
+
+
+class QuickGELUActivation(nn.Layer):
+    """
+    Applies GELU approximation that is fast but somewhat inaccurate. See: https://github.com/hendrycks/GELUs
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return input * F.sigmoid(1.702 * input)
+
+
+class ClippedGELUActivation(nn.Layer):
+    """
+    Clip the range of possible GeLU outputs between [min, max]. This is especially useful for quantization purpose, as
+    it allows mapping negatives values in the GeLU spectrum. For more information on this trick, please refer to
+    https://arxiv.org/abs/2004.09602.
+
+    Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
+    initially created.
+
+    For information: OpenAI GPT's gelu is slightly different (and gives slightly different results): 0.5 * x * (1 +
+    torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3)))). See https://arxiv.org/abs/1606.08415
+    """
+
+    def __init__(self, min: float, max: float):
+        if min > max:
+            raise ValueError(f"min should be < max (got min: {min}, max: {max})")
+
+        super().__init__()
+        self.min = min
+        self.max = max
+
+    def forward(self, x: Tensor) -> Tensor:
+        return paddle.clip(gelu(x), self.min, self.max)
+
+
+class SiLUActivation(nn.Layer):
+    """
+    See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear
+    Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function
+    Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated
+    Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with
+    later.
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return F.silu(input)
+
+
+class MishActivation(nn.Layer):
+    """
+    See Mish: A Self-Regularized Non-Monotonic Activation Function (Misra., https://arxiv.org/abs/1908.08681). Also
+    visit the official repository for the paper: https://github.com/digantamisra98/Mish
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return F.mish(input)
+
+
+class LinearActivation(nn.Layer):
+    """
+    Applies the linear activation function, i.e. forwarding input directly to output.
+    """
+
+    def forward(self, input: Tensor) -> Tensor:
+        return input
+
+
+class ClassInstantier(OrderedDict):
+    def __getitem__(self, key):
+        content = super().__getitem__(key)
+        cls, kwargs = content if isinstance(content, tuple) else (content, {})
+        return cls(**kwargs)
+
+
+ACT2CLS = {
+    "gelu": GELUActivation,
+    "gelu_10": (ClippedGELUActivation, {"min": -10, "max": 10}),
+    "gelu_fast": FastGELUActivation,
+    "gelu_new": NewGELUActivation,
+    "gelu_python": (GELUActivation, {"use_gelu_python": True}),
+    "linear": LinearActivation,
+    "mish": MishActivation,
+    "quick_gelu": QuickGELUActivation,
+    "relu": nn.ReLU,
+    "relu6": nn.ReLU6,
+    "sigmoid": nn.Sigmoid,
+    "silu": SiLUActivation,
+    "swish": SiLUActivation,
+    "tanh": nn.Tanh,
+}
+ACT2FN = ClassInstantier(ACT2CLS)
+
+
+def get_activation(activation_string):
+    if activation_string in ACT2FN:
+        return ACT2FN[activation_string]
+    else:
+        raise KeyError(f"function {activation_string} not found in ACT2FN mapping {list(ACT2FN.keys())}")
+
+
+# For backwards compatibility with: from activations import gelu_python
+gelu_python = get_activation("gelu_python")
+gelu_new = get_activation("gelu_new")
+gelu = get_activation("gelu")
+gelu_fast = get_activation("gelu_fast")
+quick_gelu = get_activation("quick_gelu")
+silu = get_activation("silu")
+mish = get_activation("mish")
+linear_act = get_activation("linear")
diff --git a/paddlenlp/transformers/albert/__init__.py b/paddlenlp/transformers/albert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/albert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/albert/configuration.py b/paddlenlp/transformers/albert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce125f13c7b573d9987d9400ee488a81b4f8ff1d
--- /dev/null
+++ b/paddlenlp/transformers/albert/configuration.py
@@ -0,0 +1,448 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Albert model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["ALBERT_PRETRAINED_INIT_CONFIGURATION", "AlbertConfig", "ALBERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ALBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "albert-base-v1": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-large-v1": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 4096,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-xlarge-v1": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 2048,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 8192,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-xxlarge-v1": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 4096,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 16384,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 64,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-base-v2": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-large-v2": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 4096,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-xlarge-v2": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 2048,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 8192,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-xxlarge-v2": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 4096,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 16384,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 64,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30000,
+    },
+    "albert-chinese-tiny": {
+        "attention_probs_dropout_prob": 0.0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.0,
+        "hidden_size": 312,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 1248,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 4,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "albert-chinese-small": {
+        "attention_probs_dropout_prob": 0.0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.0,
+        "hidden_size": 384,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 1536,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 6,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "albert-chinese-base": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "albert-chinese-large": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "inner_group_num": 1,
+        "intermediate_size": 4096,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "albert-chinese-xlarge": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 2048,
+        "initializer_range": 0.014,
+        "inner_group_num": 1,
+        "intermediate_size": 8192,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "albert-chinese-xxlarge": {
+        "attention_probs_dropout_prob": 0,
+        "bos_token_id": 2,
+        "embedding_size": 128,
+        "eos_token_id": 3,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 4096,
+        "initializer_range": 0.01,
+        "inner_group_num": 1,
+        "intermediate_size": 16384,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_groups": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+}
+
+ALBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "albert-base-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v1.pdparams",
+        "albert-large-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v1.pdparams",
+        "albert-xlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v1.pdparams",
+        "albert-xxlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v1.pdparams",
+        "albert-base-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v2.pdparams",
+        "albert-large-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v2.pdparams",
+        "albert-xlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v2.pdparams",
+        "albert-xxlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v2.pdparams",
+        "albert-chinese-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-tiny.pdparams",
+        "albert-chinese-small": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-small.pdparams",
+        "albert-chinese-base": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-base.pdparams",
+        "albert-chinese-large": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-large.pdparams",
+        "albert-chinese-xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xlarge.pdparams",
+        "albert-chinese-xxlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xxlarge.pdparams",
+    }
+}
+
+
+class AlbertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`AlbertModel`]. It is used to instantiate
+    an ALBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ALBERT
+    [albert-xxlarge-v2](https://huggingface.co/albert-xxlarge-v2) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `AlbertModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `AlbertModel`.
+            Defaults to `30000`.
+        embedding_size (int, optional):
+            Dimensionality of the embedding layer. Defaults to `128`.
+        hidden_size (int, optional):
+            Dimensionality of the encoder layer and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        inner_group_num (int, optional):
+            Number of hidden groups in the Transformer encoder. Defaults to `1`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+        inner_group_num (int, optional):
+            Number of inner groups in a hidden group. Default to `1`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0`.
+        classifier_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for attached classifiers.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`. Defaults to `12`.
+
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`BertPretrainedModel.init_weights()` for how weights are initialized in `ElectraModel`.
+
+        layer_norm_eps(float, optional):
+            The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for initializing layer normalization layers.
+            A small value to the variance added to the normalization layer to prevent division by zero.
+            Default to `1e-12`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary. Defaults to `0`.
+        add_pooling_layer(bool, optional):
+            Whether or not to add the pooling layer. Default to `False`.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import AlbertConfig, AlbertModel
+    >>> # Initializing an ALBERT style configuration
+    >>> configuration = AlbertConfig()
+    >>> # Initializing a model (with random weights) from the ALBERT-base style configuration
+    >>> model = AlbertModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ALBERT_PRETRAINED_INIT_CONFIGURATION
+    model_type = "albert"
+
+    def __init__(
+        self,
+        vocab_size=30000,
+        embedding_size=128,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_hidden_groups=1,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        inner_group_num=1,
+        hidden_act="gelu",
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        bos_token_id=2,
+        eos_token_id=3,
+        add_pooling_layer=True,
+        classifier_dropout_prob=0.1,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_hidden_groups = num_hidden_groups
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.inner_group_num = inner_group_num
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.classifier_dropout_prob = classifier_dropout_prob
+        self.add_pooling_layer = add_pooling_layer
diff --git a/paddlenlp/transformers/albert/modeling.py b/paddlenlp/transformers/albert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..362c9a5527c1c84e6d8a701214844b3b6c389d6d
--- /dev/null
+++ b/paddlenlp/transformers/albert/modeling.py
@@ -0,0 +1,1554 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Modeling classes for ALBERT model."""
+
+import math
+from typing import List, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import Layer
+
+from ...layers import Linear as TransposedLinear
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    MaskedLMOutput,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    ALBERT_PRETRAINED_INIT_CONFIGURATION,
+    ALBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    AlbertConfig,
+)
+
+__all__ = [
+    "AlbertPretrainedModel",
+    "AlbertModel",
+    "AlbertForPretraining",
+    "AlbertForMaskedLM",
+    "AlbertForSequenceClassification",
+    "AlbertForTokenClassification",
+    "AlbertForQuestionAnswering",
+    "AlbertForMultipleChoice",
+]
+
+dtype_float = paddle.get_default_dtype()
+
+
+class AlbertForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`AlbertForPreTraining`].
+
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        sop_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    sop_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class AlbertEmbeddings(Layer):
+    """
+    Constructs the embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # Position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        past_key_values_length=0,
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        embeddings = inputs_embeds + token_type_embeddings
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings += position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class AlbertAttention(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(self.hidden_size, self.all_head_size)
+        self.key = nn.Linear(self.hidden_size, self.all_head_size)
+        self.value = nn.Linear(self.hidden_size, self.all_head_size)
+
+        self.attention_dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.output_dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    # Copied from transformers.models.bert.modeling_bert.BertSelfAttention.transpose_for_scores
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        context_layer = context_layer.reshape([0, 0, -1])
+
+        # dense layer shape to be checked
+        projected_context_layer = self.dense(context_layer)
+
+        projected_context_layer_dropout = self.output_dropout(projected_context_layer)
+        layer_normed_context_layer = self.layer_norm(hidden_states + projected_context_layer_dropout)
+        return (layer_normed_context_layer, attention_probs) if output_attentions else (layer_normed_context_layer,)
+
+
+class AlbertLayer(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertLayer, self).__init__()
+        self.seq_len_dim = 1
+        self.full_layer_layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.attention = AlbertAttention(config)
+        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=False,
+    ):
+        attention_output = self.attention(
+            hidden_states,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+        )
+
+        ffn_output = self.ffn(attention_output[0])
+        ffn_output = self.activation(ffn_output)
+        ffn_output = self.ffn_output(ffn_output)
+
+        hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])
+
+        return (hidden_states,) + attention_output[1:]  # add attentions if we output them
+
+
+class AlbertLayerGroup(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertLayerGroup, self).__init__()
+
+        self.albert_layers = nn.LayerList([AlbertLayer(config) for _ in range(config.inner_group_num)])
+
+    def forward(
+        self, hidden_states, attention_mask=None, head_mask=None, output_attentions=False, output_hidden_states=False
+    ):
+        layer_attentions = () if output_attentions else None
+        all_hidden_states = (hidden_states,) if output_hidden_states else None
+
+        for layer_index, albert_layer in enumerate(self.albert_layers):
+            layer_output = albert_layer(
+                hidden_states,
+                attention_mask,
+                head_mask[layer_index],
+                output_attentions=output_attentions,
+            )
+            hidden_states = layer_output[0]
+
+            if output_attentions:
+                layer_attentions = layer_attentions + (layer_output[1],)
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+        outputs = (hidden_states,)
+
+        if output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+
+        if output_attentions:
+            outputs = outputs + (layer_attentions,)
+
+        return outputs
+
+
+class AlbertTransformer(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertTransformer, self).__init__()
+
+        self.num_hidden_layers = config.num_hidden_layers
+        self.num_hidden_groups = config.num_hidden_groups
+
+        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)
+        self.albert_layer_groups = nn.LayerList([AlbertLayerGroup(config) for _ in range(config.num_hidden_groups)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        hidden_states = self.embedding_hidden_mapping_in(hidden_states)
+
+        all_hidden_states = (hidden_states,) if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        for i in range(self.num_hidden_layers):
+            # Number of layers in a hidden group
+            layers_per_group = int(self.num_hidden_layers / self.num_hidden_groups)
+            # Index of the hidden group
+            group_idx = int(i / (self.num_hidden_layers / self.num_hidden_groups))
+
+            layer_group_output = self.albert_layer_groups[group_idx](
+                hidden_states,
+                attention_mask,
+                head_mask[group_idx * layers_per_group : (group_idx + 1) * layers_per_group],
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+            )
+            hidden_states = layer_group_output[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + layer_group_output[-1]
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
+        )
+
+
+class AlbertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ALBERT models. It provides ALBERT related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = AlbertConfig
+
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "transformer"
+
+    pretrained_init_configuration = ALBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ALBERT_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_name_mappings(cls, config: AlbertConfig) -> List[StateDictNameMapping]:
+        model_mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.position_embeddings.weight",
+            "embeddings.token_type_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["encoder.embedding_hidden_mapping_in.weight", None, "transpose"],
+            "encoder.embedding_hidden_mapping_in.bias",
+        ]
+
+        if config.add_pooling_layer:
+            model_mappings.extend(
+                [
+                    ["pooler.weight", None, "transpose"],
+                    ["pooler.bias"],
+                ]
+            )
+
+        for group_index in range(config.num_hidden_groups):
+            group_mappings = [
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.full_layer_layer_norm.weight",
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.full_layer_layer_norm.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.query.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.query.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.key.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.key.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.value.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.value.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.dense.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.dense.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.LayerNorm.weight",
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.layer_norm.weight",
+                ],
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.LayerNorm.bias",
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.attention.layer_norm.bias",
+                ],
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.ffn.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.ffn.bias",
+                [
+                    f"encoder.albert_layer_groups.{group_index}.albert_layers.0.ffn_output.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.albert_layer_groups.{group_index}.albert_layers.0.ffn_output.bias",
+            ]
+            model_mappings.extend(group_mappings)
+
+        init_name_mappings(model_mappings)
+        # base-model prefix "AlbertModel"
+        if "AlbertModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "albert." + mapping[0]
+                mapping[1] = "transformer." + mapping[1]
+
+        # downstream mappings
+        if "AlbertForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [["qa_outputs.weight", "qa_outputs.weight", "transpose"], ["qa_outputs.bias", "qa_outputs.bias"]]
+            )
+        if (
+            "AlbertForMultipleChoice" in config.architectures
+            or "AlbertForSequenceClassification" in config.architectures
+            or "AlbertForTokenClassification" in config.architectures
+        ):
+            model_mappings.extend(
+                [["classifier.weight", "classifier.weight", "transpose"], ["classifier.bias", "classifier.bias"]]
+            )
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        # Initialize the weights.
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer._padding_idx is not None:
+                layer.weight[layer._padding_idx].set_value(paddle.zeros_like(layer.weight[layer._padding_idx]))
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.ones_like(layer.weight))
+
+
+@register_base_model
+class AlbertModel(AlbertPretrainedModel):
+    """
+    The bare Albert Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eos_token_id = config.eos_token_id
+        self.initializer_range = config.initializer_range
+        self.num_hidden_layers = config.num_hidden_layers
+        self.embeddings = AlbertEmbeddings(config)
+        self.encoder = AlbertTransformer(config)
+        self.config = config
+
+        if config.add_pooling_layer:
+            self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
+            self.pooler_activation = nn.Tanh()
+        else:
+            self.pooler = None
+            self.pooler_activation = None
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.dim() == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)
+        elif head_mask.dim() == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.dim() == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = paddle.cast(head_mask, dtype=dtype_float)
+        return head_mask
+
+    def get_head_mask(self, head_mask, num_hidden_layers, is_attention_chunked=False):
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            head_mask (Tensor, optional):
+                Mask to nullify selected heads of the self-attention modules. Masks values can either be 0 or 1:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicated the head is **masked**.
+            inputs_embeds (Tensor, optional):
+               If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+               pass an embedded representation directly instead of passing `inputs_ids`.
+           output_hidden_states (bool, optional):
+               Whether to return the hidden states of all layers.
+               Defaults to `False`.
+           output_attentions (bool, optional):
+               Whether to return the attentions tensors of all attention layers.
+               Defaults to `False`.
+           return_dict (bool, optional):
+               Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+               will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple or Dict: Returns tuple (`sequence_output`, `pooled_output`) or a dict with
+            `last_hidden_state`, `pooled_output`, `all_hidden_states`, `all_attentions` fields.
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+               Sequence of hidden-states at the last layer of the model.
+               It's data type should be float32 and has a shape of [`batch_size, sequence_length, hidden_size`].
+
+            - `pooled_output` (Tensor):
+               The output of first token (`[CLS]`) in sequence.
+               We "pool" the model by simply taking the hidden state corresponding to the first token.
+               Its data type should be float32 and
+               has a shape of [batch_size, hidden_size].
+
+            - `last_hidden_state` (Tensor):
+               The output of the last encoder layer, it is also the `sequence_output`.
+               It's data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+            - `all_hidden_states` (Tensor):
+               Hidden_states of all layers in the Transformer encoder. The length of `all_hidden_states` is `num_hidden_layers + 1`.
+               For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `all_attentions` (Tensor):
+               Attentions of all layers of in the Transformer encoder. The length of `all_attentions` is `num_hidden_layers`.
+               For all element in the tuple, its data type should be float32 and its shape is
+               [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import AlbertModel, AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
+                model = AlbertModel.from_pretrained('albert-base-v1')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(shape=input_shape)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(shape=input_shape, dtype="int64")
+
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+        extended_attention_mask = paddle.cast(extended_attention_mask, dtype=dtype_float)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        head_mask = self.get_head_mask(head_mask, self.num_hidden_layers)
+
+        embedding_output = self.embeddings(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+        )
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            extended_attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = encoder_outputs[0]
+
+        pooled_output = self.pooler_activation(self.pooler(sequence_output[:, 0])) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class AlbertForPretraining(AlbertPretrainedModel):
+    """
+    Albert Model with a `masked language modeling` head and a `sentence order prediction` head
+    on top.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForPretraining, self).__init__(config)
+
+        self.transformer = AlbertModel(config)
+        self.predictions = AlbertMLMHead(config)
+        self.sop_classifier = AlbertSOPHead(config)
+        self.config = config
+        self.vocab_size = config.vocab_size
+
+    def get_output_embeddings(self):
+        return self.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.predictions.decoder = new_embeddings
+
+    def get_input_embeddings(self):
+        return self.transformer.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        sentence_order_label=None,
+        labels=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForPretraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            sentence_order_label(Tensor, optional):
+                Labels of the next sequence prediction. Input should be a sequence pair
+                Indices should be 0 or 1. ``0`` indicates original order (sequence A, then sequence B),
+                and ``1`` indicates switched order (sequence B, then sequence A). Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple or Dict: Returns tuple (`prediction_scores`, `sop_scores`) or a dict with
+            `prediction_logits`, `sop_logits`, `pooled_output`, `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `sop_scores` (Tensor):
+                The scores of sentence order prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+            - `prediction_logits` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `sop_logits` (Tensor):
+                The scores of sentence order prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+        """
+
+        outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output, pooled_output = outputs[:2]
+
+        prediction_scores = self.predictions(sequence_output)
+        sop_scores = self.sop_classifier(pooled_output)
+
+        total_loss = None
+        if labels is not None and sentence_order_label is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(prediction_scores.reshape([-1, self.config.vocab_size]), labels.reshape([-1]))
+            sentence_order_loss = loss_fct(sop_scores.reshape([-1, 2]), sentence_order_label.reshape([-1]))
+            total_loss = masked_lm_loss + sentence_order_loss
+
+        if not return_dict:
+            output = (prediction_scores, sop_scores) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return AlbertForPreTrainingOutput(
+            loss=total_loss,
+            prediction_logits=prediction_scores,
+            sop_logits=sop_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class AlbertMLMHead(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertMLMHead, self).__init__()
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size)
+        self.bias = self.create_parameter(
+            [config.vocab_size], is_bias=True, default_initializer=nn.initializer.Constant(value=0)
+        )
+        self.dense = nn.Linear(config.hidden_size, config.embedding_size)
+        self.decoder = TransposedLinear(config.embedding_size, config.vocab_size)
+
+        self.activation = ACT2FN[config.hidden_act]
+
+        # link bias
+        self.bias = self.decoder.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        prediction_scores = hidden_states
+        return prediction_scores
+
+
+class AlbertSOPHead(Layer):
+    def __init__(self, config: AlbertConfig):
+        super(AlbertSOPHead, self).__init__()
+        self.dropout = nn.Dropout(config.classifier_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, pooled_output):
+        dropout_pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(dropout_pooled_output)
+        return logits
+
+
+class AlbertForMaskedLM(AlbertPretrainedModel):
+    """
+    Albert Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForMaskedLM, self).__init__(config)
+
+        self.transformer = AlbertModel(config)
+        self.predictions = AlbertMLMHead(config)
+        self.config = config
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.predictions.decoder = new_embeddings
+
+    def get_input_embeddings(self):
+        return self.transformer.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForPretraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor or Dict: Returns tensor `prediction_scores` or a dict with `logits`,
+            `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `logits` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+        """
+
+        transformer_outputs = self.transformer(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(transformer_outputs, type(input_ids)):
+            transformer_outputs = [transformer_outputs]
+
+        hidden_states = transformer_outputs[0]
+        logits = self.predictions(hidden_states)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(logits.reshape((-1, logits.shape[-1])), labels.reshape((-1,)))
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return tuple_output(output, masked_lm_loss)
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class AlbertForSequenceClassification(AlbertPretrainedModel):
+    """
+    Albert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+
+        self.transformer = AlbertModel(config)
+        self.dropout = nn.Dropout(config.classifier_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor or Dict: Returns tensor `logits`, or a dict with `logits`, `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor of the input text classification logits.
+                Shape as `[batch_size, num_labels]` and dtype as float32.
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import AlbertForSequenceClassification, AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
+                model = AlbertForSequenceClassification.from_pretrained('albert-base-v1')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+        transformer_outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = transformer_outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class AlbertForTokenClassification(AlbertPretrainedModel):
+    """
+    Albert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.transformer = AlbertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor or Dict: Returns tensor `logits`, or a dict with `logits`, `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor of the input token classification logits.
+                Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import AlbertForTokenClassification, AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
+                model = AlbertForTokenClassification.from_pretrained('albert-base-v1')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+        transformer_outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = transformer_outputs[0]
+
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class AlbertForQuestionAnswering(AlbertPretrainedModel):
+    """
+    Albert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForQuestionAnswering, self).__init__(config)
+        self.config = config
+        self.transformer = AlbertModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple or Dict: Returns tuple (`start_logits, end_logits`)or a dict
+            with `start_logits`, `end_logits`, `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import AlbertForQuestionAnswering, AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
+                model = AlbertForQuestionAnswering.from_pretrained('albert-base-v1')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+        transformer_outputs = self.transformer(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = transformer_outputs[0]
+        logits = self.qa_outputs(sequence_output)
+
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+        start_logits = start_logits.squeeze(axis=-1)
+        end_logits = start_logits.squeeze(axis=-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + transformer_outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class AlbertForMultipleChoice(AlbertPretrainedModel):
+    """
+    Albert Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like SWAG tasks .
+
+    Args:
+        config (:class:`AlbertConfig`):
+            An instance of AlbertConfig used to construct AlbertModel.
+
+    """
+
+    def __init__(self, config: AlbertConfig):
+        super(AlbertForMultipleChoice, self).__init__(config)
+        self.transformer = AlbertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        self.config = config
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=False,
+        output_attentions=False,
+        return_dict=False,
+    ):
+        r"""
+        The AlbertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`AlbertModel`.
+            attention_mask (list, optional):
+                See :class:`AlbertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`AlbertModel`.
+            position_ids(Tensor, optional):
+                See :class:`AlbertModel`.
+            head_mask(Tensor, optional):
+                See :class:`AlbertModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`AlbertModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor or Dict: Returns tensor `reshaped_logits` or a dict
+            with `reshaped_logits`, `hidden_states`, `attentions` fields.
+
+            With the fields:
+
+            - `reshaped_logits` (Tensor):
+                A tensor of the input multiple choice classification logits.
+                Shape as `[batch_size, num_labels]` and dtype as `float32`.
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+            - `attentions` (Tensor):
+                Attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is
+                [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+        """
+
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        input_ids = input_ids.reshape([-1, input_ids.shape[-1]]) if input_ids is not None else None
+        attention_mask = attention_mask.reshape([-1, attention_mask.shape[-1]]) if attention_mask is not None else None
+        token_type_ids = token_type_ids.reshape([-1, token_type_ids.shape[-1]]) if token_type_ids is not None else None
+        position_ids = position_ids.reshape([-1, position_ids.shape[-1]]) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.reshape([-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]])
+            if inputs_embeds is not None
+            else None
+        )
+        transformer_outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = transformer_outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape([-1, num_choices])
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + transformer_outputs[2:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/albert/tokenizer.py b/paddlenlp/transformers/albert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7d80d0b2457102528aea567a91ad4479a8e3331
--- /dev/null
+++ b/paddlenlp/transformers/albert/tokenizer.py
@@ -0,0 +1,801 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for ALBERT model."""
+
+import os
+import unicodedata
+from shutil import copyfile
+
+import sentencepiece as spm
+
+from .. import PretrainedTokenizer, BertTokenizer, AddedToken
+
+__all__ = ["AlbertTokenizer"]
+
+SPIECE_UNDERLINE = "▁"
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "albert-base-v1": 512,
+    "albert-large-v1": 512,
+    "albert-xlarge-v1": 512,
+    "albert-xxlarge-v1": 512,
+    "albert-base-v2": 512,
+    "albert-large-v2": 512,
+    "albert-xlarge-v2": 512,
+    "albert-xxlarge-v2": 512,
+    "albert-chinese-tiny": 512,
+    "albert-chinese-small": 512,
+    "albert-chinese-base": 512,
+    "albert-chinese-large": 512,
+    "albert-chinese-xlarge": 512,
+    "albert-chinese-xxlarge": 512,
+}
+
+
+class AlbertTokenizer(PretrainedTokenizer):
+    """
+    Constructs an Albert tokenizer based on SentencePiece or `BertTokenizer`.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        sentence_model_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing. Defaults to `True`.
+        remove_space (bool):
+            Whether or note to remove space when tokenizing. Defaults to `True`.
+        keep_accents (bool):
+            Whether or note to keep accents when tokenizing. Defaults to `False`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import AlbertTokenizer
+            tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')
+            tokens = tokenizer('He was a puppeteer')
+            '''
+            {'input_ids': [2, 24, 23, 21, 10956, 7911, 3],
+             'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {
+        "sentencepiece_model_file": "spiece.model",
+        "vocab_file": "vocab.txt",
+    }
+
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "albert-base-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v1.spiece.model",
+            "albert-large-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v1.spiece.model",
+            "albert-xlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v1.spiece.model",
+            "albert-xxlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v1.spiece.model",
+            "albert-base-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v2.spiece.model",
+            "albert-large-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v2.spiece.model",
+            "albert-xlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v2.spiece.model",
+            "albert-xxlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v2.spiece.model",
+            "albert-chinese-tiny": None,
+            "albert-chinese-small": None,
+            "albert-chinese-base": None,
+            "albert-chinese-large": None,
+            "albert-chinese-xlarge": None,
+            "albert-chinese-xxlarge": None,
+        },
+        "vocab_file": {
+            "albert-base-v1": None,
+            "albert-large-v1": None,
+            "albert-xlarge-v1": None,
+            "albert-xxlarge-v1": None,
+            "albert-base-v2": None,
+            "albert-large-v2": None,
+            "albert-xlarge-v2": None,
+            "albert-xxlarge-v2": None,
+            "albert-chinese-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-tiny.vocab.txt",
+            "albert-chinese-small": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-small.vocab.txt",
+            "albert-chinese-base": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-base.vocab.txt",
+            "albert-chinese-large": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-large.vocab.txt",
+            "albert-chinese-xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xlarge.vocab.txt",
+            "albert-chinese-xxlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xxlarge.vocab.txt",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "albert-base-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-large-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xlarge-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xxlarge-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-base-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-large-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xlarge-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xxlarge-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-chinese-tiny": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-small": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-base": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-large": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-xlarge": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-xxlarge": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        do_lower_case=True,
+        remove_space=True,
+        keep_accents=False,
+        bos_token="[CLS]",
+        eos_token="[SEP]",
+        unk_token="<unk>",
+        sep_token="[SEP]",
+        pad_token="<pad>",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._build_special_tokens_map_extended(mask_token=mask_token)
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+        if vocab_file is not None:
+            self.tokenizer = AlbertChineseTokenizer(
+                vocab_file=vocab_file,
+                do_lower_case=do_lower_case,
+                unk_token=unk_token,
+                sep_token=sep_token,
+                pad_token=pad_token,
+                cls_token=cls_token,
+                mask_token=mask_token,
+                **kwargs,
+            )
+        elif sentencepiece_model_file is not None:
+            self.tokenizer = AlbertEnglishTokenizer(
+                sentencepiece_model_file=sentencepiece_model_file,
+                do_lower_case=do_lower_case,
+                remove_space=remove_space,
+                keep_accents=keep_accents,
+                bos_token=bos_token,
+                eos_token=eos_token,
+                unk_token=unk_token,
+                sep_token=sep_token,
+                pad_token=pad_token,
+                cls_token=cls_token,
+                mask_token=mask_token,
+                **kwargs,
+            )
+        else:
+            raise ValueError(
+                "You should only specify either one(not both) of 'vocal_file'"
+                "and 'sentencepiece_model_file' to construct an albert tokenizer."
+                "Specify 'vocal_file' for Chinese tokenizer and "
+                "'sentencepiece_model_file' for English tokenizer"
+            )
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return self.tokenizer.vocab_size
+
+    def _tokenize(self, text):
+        return self.tokenizer._tokenize(text)
+
+    def tokenize(self, text):
+        """
+        Converts a string to a list of tokens.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List(str): A list of string representing converted tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+
+        """
+
+        return self.tokenizer.tokenize(text)
+
+    def _convert_token_to_id(self, token):
+        """
+        Converts a sequence of tokens (list of string) to a list of ids.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            list: Converted ids from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('bert-base-uncased')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                #['▁he', '▁was', '▁a', '▁puppet', 'eer']
+
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                #[24, 23, 21, 10956, 7911]
+        """
+        return self.tokenizer._convert_token_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """
+        Converts a sequence of tokens (list of string) to a list of ids.
+
+        Args:
+            ids (list): A list of ids to be converted.
+            skip_special_tokens (bool, optional):
+                Whether or not to skip specical tokens. Defaults to `False`.
+
+        Returns:
+            list: A list of converted tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('bert-base-uncased')
+                ids = [24, 23, 21, 10956, 7911]
+                tokens = tokenizer.convert_ids_to_tokens(ids)
+                #['▁he', '▁was', '▁a', '▁puppet', 'eer']
+        """
+        return self.tokenizer._convert_id_to_token(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import AlbertTokenizer
+
+                tokenizer = AlbertTokenizer.from_pretrained('bert-base-uncased')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                '''
+                ['▁he', '▁was', '▁a', '▁puppet', 'eer']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                he was a puppeteer
+                '''
+        """
+        return self.tokenizer.convert_tokens_to_string(tokens)
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        return self.tokenizer.num_special_tokens_to_add(pair=pair)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        An Albert sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        return self.tokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=token_ids_1)
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A Albert offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        return self.tokenizer.build_offset_mapping_with_special_tokens(
+            offset_mapping_0, offset_mapping_1=offset_mapping_1
+        )
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+        return self.tokenizer.get_special_tokens_mask(
+            token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=already_has_special_tokens
+        )
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        return self.tokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=token_ids_1)
+
+    def save_resources(self, save_directory):
+        return self.tokenizer.save_resources(save_directory)
+
+
+class AlbertEnglishTokenizer(PretrainedTokenizer):
+    resource_files_names = {
+        "sentencepiece_model_file": "spiece.model",
+    }
+
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "albert-base-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v1.spiece.model",
+            "albert-large-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v1.spiece.model",
+            "albert-xlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v1.spiece.model",
+            "albert-xxlarge-v1": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v1.spiece.model",
+            "albert-base-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-base-v2.spiece.model",
+            "albert-large-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-large-v2.spiece.model",
+            "albert-xlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xlarge-v2.spiece.model",
+            "albert-xxlarge-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-xxlarge-v2.spiece.model",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "albert-base-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-large-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xlarge-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xxlarge-v1": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-base-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-large-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xlarge-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+        "albert-xxlarge-v2": {
+            "do_lower_case": True,
+            "remove_space": True,
+            "keep_accents": False,
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+        },
+    }
+    max_model_input_sizes = {
+        "albert-base-v1": 512,
+        "albert-large-v1": 512,
+        "albert-xlarge-v1": 512,
+        "albert-xxlarge-v1": 512,
+        "albert-base-v2": 512,
+        "albert-large-v2": 512,
+        "albert-xlarge-v2": 512,
+        "albert-xxlarge-v2": 512,
+    }
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=True,
+        remove_space=True,
+        keep_accents=False,
+        bos_token="[CLS]",
+        eos_token="[SEP]",
+        unk_token="<unk>",
+        sep_token="[SEP]",
+        pad_token="<pad>",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        sp_model_kwargs=None,
+        **kwargs
+    ):
+
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.sentencepiece_model_file)
+
+    def preprocess_text(self, inputs):
+        if self.remove_space:
+            outputs = " ".join(inputs.strip().split())
+        else:
+            outputs = inputs
+        outputs = outputs.replace("``", '"').replace("''", '"')
+
+        if not self.keep_accents:
+            outputs = unicodedata.normalize("NFKD", outputs)
+            outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+        if self.do_lower_case:
+            outputs = outputs.lower()
+
+        return outputs
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        text = self.preprocess_text(text)
+        pieces = self.sp_model.encode(text, out_type=str)
+        new_pieces = []
+        for piece in pieces:
+            if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
+                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, ""))
+                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
+                    if len(cur_pieces[0]) == 1:
+                        cur_pieces = cur_pieces[1:]
+                    else:
+                        cur_pieces[0] = cur_pieces[0][1:]
+                cur_pieces.append(piece[-1])
+                new_pieces.extend(cur_pieces)
+            else:
+                new_pieces.append(piece)
+
+        return new_pieces
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) to an id using the vocab."""
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) to a token (str) using the vocab."""
+        return self.sp_model.IdToPiece(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+
+    def save_resources(self, save_directory):
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(self.sentencepiece_model_file) != os.path.abspath(save_path) and os.path.isfile(
+                self.sentencepiece_model_file
+            ):
+                copyfile(self.sentencepiece_model_file, save_path)
+            elif not os.path.isfile(self.sentencepiece_model_file):
+                with open(save_path, "wb") as fi:
+                    content_spiece_model = self.sp_model.serialized_model_proto()
+                    fi.write(content_spiece_model)
+
+
+class AlbertChineseTokenizer(BertTokenizer):
+    resource_files_names = {"vocab_file": "vocab.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "albert-chinese-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-tiny.vocab.txt",
+            "albert-chinese-small": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-small.vocab.txt",
+            "albert-chinese-base": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-base.vocab.txt",
+            "albert-chinese-large": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-large.vocab.txt",
+            "albert-chinese-xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xlarge.vocab.txt",
+            "albert-chinese-xxlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/albert/albert-chinese-xxlarge.vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "albert-chinese-tiny": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-small": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-base": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-large": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-xlarge": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+        "albert-chinese-xxlarge": {
+            "do_lower_case": False,
+            "unk_token": "[UNK]",
+            "pad_token": "[PAD]",
+        },
+    }
+    max_model_input_sizes = {
+        "albert-chinese-tiny": 512,
+        "albert-chinese-small": 512,
+        "albert-chinese-base": 512,
+        "albert-chinese-large": 512,
+        "albert-chinese-xlarge": 512,
+        "albert-chinese-xxlarge": 512,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super(AlbertChineseTokenizer, self).__init__(
+            vocab_file,
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/artist/__init__.py b/paddlenlp/transformers/artist/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/artist/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/artist/configuration.py b/paddlenlp/transformers/artist/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a0fd4c0e6bd722fdbbe2ea6d71a8f1ed24839ee
--- /dev/null
+++ b/paddlenlp/transformers/artist/configuration.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MBart model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers import GPTConfig
+
+__all__ = ["ARTIST_PRETRAINED_INIT_CONFIGURATION", "ARTIST_PRETRAINED_RESOURCE_FILES_MAP", "ArtistConfig"]
+
+ARTIST_PRETRAINED_INIT_CONFIGURATION = {
+    "pai-painter-base-zh": {
+        "vocab_size": 37512,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu_python",
+        "hidden_dropout_prob": 0.0,
+        "attention_probs_dropout_prob": 0.0,
+        "max_position_embeddings": 288,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 16384,  # 0 + 16384
+        "eos_token_id": 16486,  # 102 + 16384
+        "bos_token_id": 16485,  # 101 + 16384
+        "eol_token_id": 16486,  # 102 + 16384
+    },
+    "pai-painter-painting-base-zh": {
+        "vocab_size": 37512,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu_python",
+        "hidden_dropout_prob": 0.0,
+        "attention_probs_dropout_prob": 0.0,
+        "max_position_embeddings": 288,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 16384,  # 0 + 16384
+        "eos_token_id": 16486,  # 102 + 16384
+        "bos_token_id": 16485,  # 101 + 16384
+        "eol_token_id": 16486,  # 102 + 16384
+    },
+    "pai-painter-scenery-base-zh": {
+        "vocab_size": 37512,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu_python",
+        "hidden_dropout_prob": 0.0,
+        "attention_probs_dropout_prob": 0.0,
+        "max_position_embeddings": 288,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 16384,  # 0 + 16384
+        "eos_token_id": 16486,  # 102 + 16384
+        "bos_token_id": 16485,  # 101 + 16384
+        "eol_token_id": 16486,  # 102 + 16384
+    },
+    "pai-painter-commercial-base-zh": {
+        "vocab_size": 37512,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu_python",
+        "hidden_dropout_prob": 0.0,
+        "attention_probs_dropout_prob": 0.0,
+        "max_position_embeddings": 288,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 16384,  # 0 + 16384
+        "eos_token_id": 16486,  # 102 + 16384
+        "bos_token_id": 16485,  # 101 + 16384
+        "eol_token_id": 16486,  # 102 + 16384
+    },
+    "pai-painter-large-zh": {
+        "vocab_size": 37512,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu_python",
+        "hidden_dropout_prob": 0.0,
+        "attention_probs_dropout_prob": 0.0,
+        "max_position_embeddings": 288,
+        "type_vocab_size": 1,
+        "initializer_range": 0.02,
+        "pad_token_id": 16384,  # 0 + 16384
+        "eos_token_id": 16486,  # 102 + 16384
+        "bos_token_id": 16485,  # 101 + 16384
+        "eol_token_id": 16486,  # 102 + 16384
+    },
+}
+ARTIST_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "pai-painter-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-base-zh/model_state.pdparams",
+        "pai-painter-painting-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-painting-base-zh/model_state.pdparams",
+        "pai-painter-scenery-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-scenery-base-zh/model_state.pdparams",
+        "pai-painter-commercial-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-commercial-base-zh/model_state.pdparams",
+        "pai-painter-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-large-zh/model_state.pdparams",
+    }
+}
+
+
+class ArtistConfig(GPTConfig):
+    pretrained_init_configuration = ARTIST_PRETRAINED_INIT_CONFIGURATION
diff --git a/paddlenlp/transformers/artist/modeling.py b/paddlenlp/transformers/artist/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..4910d201180081688f85f1cf657138d205b718e3
--- /dev/null
+++ b/paddlenlp/transformers/artist/modeling.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2022 Alibaba PAI team. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn.functional as F
+
+from ..gpt.modeling import GPTLMHead, GPTLMHeadModel, GPTModel
+from .configuration import (
+    ARTIST_PRETRAINED_INIT_CONFIGURATION,
+    ARTIST_PRETRAINED_RESOURCE_FILES_MAP,
+    ArtistConfig,
+)
+
+__all__ = [
+    "ArtistModel",
+    "ArtistForConditionalGeneration",
+]
+
+# set gelu_new
+F.gelu_python = F.gelu
+
+
+class ArtistModel(GPTModel):
+    config_class = ArtistConfig
+    pretrained_init_configuration = ARTIST_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ARTIST_PRETRAINED_RESOURCE_FILES_MAP
+
+
+class ArtistForConditionalGeneration(GPTLMHeadModel):
+    """
+    The ArtistT(GPT) Model with a `language modeling` head on top.
+
+    Args:
+        gpt (:class:`ArtistModel`):
+            An instance of :class:`ArtistModel`.
+
+    """
+
+    config_class = ArtistConfig
+    pretrained_init_configuration = ARTIST_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ARTIST_PRETRAINED_RESOURCE_FILES_MAP
+
+    def __init__(self, config: ArtistConfig):
+        super().__init__(config)
+        self.lm_head = GPTLMHead(config)
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        # we don't use attention_mask
+        attention_mask = paddle.zeros_like(input_ids, dtype=paddle.get_default_dtype())
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
diff --git a/paddlenlp/transformers/artist/tokenizer.py b/paddlenlp/transformers/artist/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a4074e2f1142da592e398381a388428d8f486a5
--- /dev/null
+++ b/paddlenlp/transformers/artist/tokenizer.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = [
+    "ArtistTokenizer",
+]
+
+
+class ArtistTokenizer(BertTokenizer):
+    """
+    Constructs an Artist tokenizer. `ArtistTokenizer` is almost identical to `BertTokenizer`.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool, optional):
+            Whether to lowercase the input when tokenizing.
+            Defaults to `True`.
+        image_vocab_size (int, optional):
+            The vocabulary size of image.
+            Defaults to `16384`.
+        do_basic_tokenize (bool, optional):
+            Whether to use a basic tokenizer before a WordPiece tokenizer.
+            Defaults to `True`.
+        never_split (Iterable, optional):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`. Defaults to `None`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+        tokenize_chinese_chars (bool, optional):
+            Whether to tokenize Chinese characters.
+            Defaults to `True`.
+        strip_accents: (bool, optional):
+            Whether to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ArtistTokenizer
+            tokenizer = ArtistTokenizer.from_pretrained('pai-painter-painting-base-zh')
+
+            inputs = tokenizer('风阁水帘今在眼，且来先看早梅红', return_token_type_ids=False)
+            print(inputs)
+
+            '''
+            {'input_ids': [23983, 23707, 20101, 18750, 17175, 18146, 21090, 24408, 17068,
+                           19725, 17428, 21076, 19577, 19833, 21657]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "pai-painter-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-base-zh/vocab.txt",
+            "pai-painter-painting-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-painting-base-zh/vocab.txt",
+            "pai-painter-scenery-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-scenery-base-zh/vocab.txt",
+            "pai-painter-commercial-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-commercial-base-zh/vocab.txt",
+            "pai-painter-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/artist/pai-painter-large-zh/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "pai-painter-base-zh": {
+            "do_lower_case": True,
+            "image_vocab_size": 16384,
+        },
+        "pai-painter-painting-base-zh": {
+            "do_lower_case": True,
+            "image_vocab_size": 16384,
+        },
+        "pai-painter-scenery-base-zh": {
+            "do_lower_case": True,
+            "image_vocab_size": 16384,
+        },
+        "pai-painter-commercial-base-zh": {
+            "do_lower_case": True,
+            "image_vocab_size": 16384,
+        },
+        "pai-painter-large-zh": {
+            "do_lower_case": True,
+            "image_vocab_size": 16384,
+        },
+    }
+    max_model_input_sizes = {
+        "pai-painter-base-zh": 32,
+        "pai-painter-painting-base-zh": 32,
+        "pai-painter-scenery-base-zh": 32,
+        "pai-painter-commercial-base-zh": 32,
+        "pai-painter-large-zh": 32,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        image_vocab_size=16384,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            do_lower_case,
+            do_basic_tokenize,
+            never_split,
+            unk_token,
+            sep_token,
+            pad_token,
+            cls_token,
+            mask_token,
+            tokenize_chinese_chars,
+            strip_accents,
+            **kwargs,
+        )
+        # we need add image_vocab_size offset
+        # for example [523, 102, 0, 0]
+        # => [523 + image_vocab_size, 102 + image_vocab_size, 0 + image_vocab_size, 0 + image_vocab_size]
+        self.image_vocab_size = image_vocab_size
+
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+
+        if token in self.added_tokens_encoder:
+            # note: process image_vocab_size offset
+            return self.added_tokens_encoder[token] + self.image_vocab_size
+        # note: process image_vocab_size offset
+        return self._convert_token_to_id(token) + self.image_vocab_size
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        if isinstance(ids, int):
+            if ids - self.image_vocab_size in self.added_tokens_decoder:
+                return self.added_tokens_decoder[ids - self.image_vocab_size]
+            else:
+                # note: process image_vocab_size offset
+                return self._convert_id_to_token(ids - self.image_vocab_size)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids:
+                continue
+            if index - self.image_vocab_size in self.added_tokens_decoder:
+                tokens.append(self.added_tokens_decoder[index - self.image_vocab_size])
+            else:
+                # note: process image_vocab_size offset
+                tokens.append(self._convert_id_to_token(index - self.image_vocab_size))
+        return tokens
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence (we don't add special tokens).
+
+        An Artist sequence has the following format:
+
+        - single sequence:      ``X``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                We do'nt use sequence pairs.
+                Defaults to None.
+
+        Returns:
+            List[int]: List of input_id.
+        """
+        return token_ids_0
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=32,  # default
+        stride=0,
+        is_split_into_words=False,
+        padding="max_length",  # default
+        truncation=True,  # default
+        return_position_ids=False,
+        return_token_type_ids=False,  # don't return token_type_ids
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        return_dict=True,
+        return_offsets_mapping=False,
+        add_special_tokens=True,
+        pad_to_multiple_of=None,
+        return_tensors=None,
+        verbose: bool = True,
+        **kwargs
+    ):
+        return super().__call__(
+            text,
+            text_pair,
+            max_length,
+            stride,
+            is_split_into_words,
+            padding,
+            truncation,
+            return_position_ids,
+            return_token_type_ids,
+            return_attention_mask,
+            return_length,
+            return_overflowing_tokens,
+            return_special_tokens_mask,
+            return_dict,
+            return_offsets_mapping,
+            add_special_tokens,
+            pad_to_multiple_of,
+            return_tensors,
+            verbose,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/attention_utils.py b/paddlenlp/transformers/attention_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bcd644df46639f4fc7d74c237a8f364db70909e
--- /dev/null
+++ b/paddlenlp/transformers/attention_utils.py
@@ -0,0 +1,619 @@
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import copy
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import ParamAttr
+from paddle.nn import Layer
+
+
+class Registry(object):
+    def __init__(self):
+        self.cls_dict = {}
+
+    def register(self, name):
+        def add_item(name, cls):
+            self.cls_dict[name] = cls
+            return cls
+
+        return lambda cls: add_item(name, cls)
+
+
+AttentionRegistry = Registry()
+
+
+def create_bigbird_rand_mask_idx(
+    num_layers, query_length, key_length, num_heads, block_size, window_size, num_global_blocks, num_rand_blocks, seed
+):
+    # TODO(zsj): need to simplify
+    num_key_blocks = key_length // block_size
+    num_query_blocks = query_length // block_size
+    num_window_blocks = window_size // 2
+    all_key_blocks_idx = np.arange(0, num_key_blocks, dtype=np.int32)
+    rand_mask_idx = [[] for i in range(num_heads)]
+    for query_block_idx in range(num_query_blocks):
+        left_key_block_idx = max(0, query_block_idx - num_window_blocks)
+        right_key_block_idx = min(query_block_idx + num_window_blocks, num_key_blocks - 1)
+
+        illegal_blocks_idx = [i for i in range(left_key_block_idx, right_key_block_idx + 1)]
+        illegal_blocks_idx.extend([i for i in range(num_global_blocks)])
+        left_key_block_idx = query_block_idx - num_window_blocks
+        right_key_block_idx = query_block_idx + num_window_blocks
+
+        if num_global_blocks > left_key_block_idx:
+            num_fill_blocks = num_global_blocks - left_key_block_idx
+            illegal_blocks_idx.extend([i for i in range(num_key_blocks - num_fill_blocks, num_key_blocks)])
+        if right_key_block_idx >= num_key_blocks:
+            num_fill_blocks = right_key_block_idx - num_key_blocks + 1
+            illegal_blocks_idx.extend([i for i in range(num_global_blocks, num_global_blocks + num_fill_blocks)])
+
+        illegal_blocks_idx = set(illegal_blocks_idx)
+
+        for i in range(num_heads):
+            legal_blocks_idx = []
+            perm_block = np.random.permutation(all_key_blocks_idx)
+            for j in perm_block:
+                if j not in illegal_blocks_idx:
+                    legal_blocks_idx.append(j)
+                if len(legal_blocks_idx) == num_rand_blocks:
+                    break
+            rand_mask_idx[i].append(legal_blocks_idx)
+    rand_mask_idx = np.stack(rand_mask_idx, axis=0)
+    rand_mask_idx = rand_mask_idx[:, num_global_blocks:] - num_global_blocks // 2
+    # transform rand_mask_idx
+    H = rand_mask_idx.shape[0]
+    L = rand_mask_idx.shape[1]
+    R = rand_mask_idx.shape[2]
+    rand_mask_idx = rand_mask_idx.reshape([-1, 1])
+    head_idx = np.arange(H).reshape([-1, 1])
+    head_idx = np.pad(head_idx, ([0, 0], [0, L * R - 1]), mode="edge").reshape([-1, 1])
+    rand_mask_idx_list = np.concatenate([head_idx, rand_mask_idx], axis=1)
+    return rand_mask_idx_list
+
+
+def create_bigbird_rand_mask_idx_list(
+    num_layers, query_length, key_length, num_heads, block_size, window_size, num_global_blocks, num_rand_blocks, seed
+):
+    rand_mask_idx_list = [
+        create_bigbird_rand_mask_idx(
+            num_layers,
+            query_length,
+            key_length,
+            num_heads,
+            block_size,
+            window_size,
+            num_global_blocks,
+            num_rand_blocks,
+            seed,
+        )
+        for i in range(num_layers)
+    ]
+    rand_mask_idx_list = np.stack(rand_mask_idx_list)
+    return rand_mask_idx_list
+
+
+def _convert_param_attr_to_list(param_attr, n):
+    if isinstance(param_attr, (list, tuple)):
+        assert len(param_attr) == n, "length of param_attr should be %d when it is a list/tuple" % n
+        param_attrs = []
+        for attr in param_attr:
+            if isinstance(attr, bool):
+                if attr:
+                    param_attrs.append(ParamAttr._to_attr(None))
+                else:
+                    param_attrs.append(False)
+            else:
+                param_attrs.append(ParamAttr._to_attr(attr))
+    elif isinstance(param_attr, bool):
+        param_attrs = []
+        if param_attr:
+            param_attrs = [ParamAttr._to_attr(None) for i in range(n)]
+        else:
+            param_attrs = [False] * n
+    else:
+        param_attrs = []
+        attr = ParamAttr._to_attr(param_attr)
+        for i in range(n):
+            attr_i = copy.deepcopy(attr)
+            if attr.name:
+                attr_i.name = attr_i.name + "_" + str(i)
+            param_attrs.append(attr_i)
+    return param_attrs
+
+
+class Linear3D(Layer):
+    def __init__(self, hidden_size, num_attention_heads, size_per_head, weight_attr=None, bias_attr=None):
+        super(Linear3D, self).__init__()
+        self._dtype = self._helper.get_default_dtype()
+        self._weight_attr = weight_attr
+        self._bias_attr = bias_attr
+        self.weight = self.create_parameter(
+            shape=[hidden_size, hidden_size], attr=self._weight_attr, dtype=self._dtype, is_bias=False
+        )
+        self.bias = self.create_parameter(shape=[hidden_size], attr=self._bias_attr, dtype=self._dtype, is_bias=True)
+        self.size_per_head = size_per_head
+        self.num_attention_heads = num_attention_heads
+        self.hidden_size = hidden_size
+
+    def forward(self, input):
+        # abc,cde->adbe
+        B, T, D = input.shape
+        H = self.num_attention_heads
+        result = paddle.matmul(input, self.weight)
+        reshape_b = paddle.reshape(self.bias, [1, 1, D])
+        result += reshape_b
+        result = paddle.reshape(result, [B, T, H, -1])
+        result = paddle.transpose(result, [0, 2, 1, 3])
+        return result
+
+
+class Attention(Layer):
+    def __init__(self, num_heads=1, block_size=1, window_size=3, num_global_blocks=1, num_rand_blocks=1, seed=None):
+        super().__init__()
+
+    def forward(
+        self,
+        query_matrix,
+        key_matrix,
+        value_matrix,
+        d_head,
+        attn_mask=None,
+        rand_mask_idx=None,
+        query_mask=None,
+        key_mask=None,
+        dropout=None,
+    ):
+        raise NotImplementedError
+
+
+@AttentionRegistry.register("default_attention")
+class DefaultAttention(Attention):
+    def forward(
+        self,
+        query_matrix,
+        key_matrix,
+        value_matrix,
+        d_head,
+        attn_mask=None,
+        rand_mask_idx=None,
+        query_mask=None,
+        key_mask=None,
+        dropout=None,
+    ):
+        # scale dot product attention
+        product = paddle.matmul(x=query_matrix, y=key_matrix, transpose_y=True)
+        product = product * (d_head**-0.5)
+        product += (1 - paddle.matmul(query_mask, key_mask)) * -1e6
+        if attn_mask is not None:
+            product = product + attn_mask
+        weights = F.softmax(product)
+        if dropout:
+            weights = F.dropout(weights, dropout, training=self.training, mode="upscale_in_train")
+
+        out = paddle.matmul(weights, value_matrix)
+        return out
+
+
+@AttentionRegistry.register("bigbird")
+class BigBirdSparseAttention(Attention):
+    def __init__(self, num_heads=1, block_size=1, window_size=3, num_global_blocks=1, num_rand_blocks=1, seed=None):
+        super(BigBirdSparseAttention, self).__init__(
+            num_heads, block_size, window_size, num_global_blocks, num_rand_blocks, seed
+        )
+        for k, v in locals().items():
+            if k != "self":
+                setattr(self, k, v)
+        self.num_global_blocks_back = num_global_blocks // 2
+        self.num_global_blocks_front = (
+            num_global_blocks // 2 if num_global_blocks % 2 == 0 else num_global_blocks // 2 + 1
+        )
+
+    def _get_band_mask(self, blocked_query_mask, blocked_key_mask, batch_size, sequence_length):
+        """
+        Return second mask: [B, 1, L-G, bs, G+W]
+        """
+        GB = self.num_global_blocks_back
+        GF = self.num_global_blocks_front
+        G = self.num_global_blocks
+        W = self.window_size
+        bs = self.block_size
+        T = sequence_length
+        L = T // bs  # blocked length
+        B = batch_size
+        H = self.num_heads
+        # G+W+R
+        # query_mask: [B, L, bs]
+        # key_mask: [B, L, bs]
+        # [B, L-G, bs, 1] * [B, L-G, 1, G*bs] -> [B, L-G, bs, G*bs]
+        temp_query_mask = paddle.reshape(blocked_query_mask[:, GF:-GB], [B, L - G, bs, 1])
+        temp_key_mask_front = paddle.reshape(blocked_key_mask[:, :GF], [B, 1, 1, GF * bs])
+        global_block_mask_front = paddle.einsum("blqd,bmdk->blqk", temp_query_mask, temp_key_mask_front)
+
+        temp_key_mask_back = paddle.reshape(blocked_key_mask[:, -GB:], [B, 1, 1, GB * bs])
+        global_block_mask_back = paddle.einsum("blqd,bmdk->blqk", temp_query_mask, temp_key_mask_back)
+        # create window block mask
+        key_mask_list = []
+        for query_block_id in range(GF, GF + W // 2):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            zero_key_mask = paddle.zeros_like(blocked_key_mask[:, -(W - (right_block_id + 1 - G)) : -GB])
+            temp_key_mask = paddle.concat([blocked_key_mask[:, GF : (right_block_id + 1)], zero_key_mask], axis=1)
+            temp_key_mask = paddle.unsqueeze(temp_key_mask, 1)
+            key_mask_list.append(temp_key_mask)
+        roll_key_mask1 = paddle.concat(key_mask_list, axis=1)
+        roll_key_mask1 = paddle.reshape(roll_key_mask1, [0, 0, W * bs])
+        key_mask_list = []
+
+        band_length = L - G - W // 2 * 2
+        for query_block_id in range(GF + W // 2, GF + W // 2 + W):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            key_mask_list.append(blocked_key_mask[:, left_block_id : left_block_id + band_length])
+        window_key_mask = paddle.concat(key_mask_list, axis=2)
+        window_key_mask = paddle.reshape(window_key_mask, [0, 0, W * bs])
+
+        key_mask_list = []
+        for query_block_id in range((L - GB) - W // 2, L - GB):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            zero_key_mask = paddle.zeros_like(blocked_key_mask[:, GF : GF + W - (L - left_block_id - GB)])
+            temp_key_mask = paddle.concat([zero_key_mask, blocked_key_mask[:, left_block_id:-GB]], axis=1)
+            temp_key_mask = paddle.unsqueeze(temp_key_mask, 1)
+            key_mask_list.append(temp_key_mask)
+        roll_key_mask2 = paddle.concat(key_mask_list, axis=1)
+        roll_key_mask2 = paddle.reshape(roll_key_mask2, [0, 0, W * bs])
+
+        window_key_mask = paddle.concat([roll_key_mask1, window_key_mask, roll_key_mask2], axis=1)
+        window_key_mask = paddle.unsqueeze(window_key_mask, axis=2)
+        # [B, L-G, bs, 1] * [B, L-G, 1, W*bs] -> [B, L-G, bs, W*bs]
+        window_block_mask = paddle.einsum("blkd,bldq->blkq", temp_query_mask, window_key_mask)
+        band_mask = paddle.concat([global_block_mask_front, window_block_mask, global_block_mask_back], axis=3)
+        band_mask = paddle.unsqueeze(band_mask, 1)  # for head
+        band_mask = paddle.expand(band_mask, [B, H, L - G, bs, -1])
+        return band_mask
+
+    def _get_band_matrix(self, blocked_matrix, B, T):
+        """
+        return global and window matrix: [B, H, L-G, (G+W) * bs, -1]
+        """
+        # blocked_matrix: [B, H, L, bs, -1]
+        GB = self.num_global_blocks_back
+        GF = self.num_global_blocks_front
+        G = self.num_global_blocks
+        W = self.window_size
+        bs = self.block_size
+        L = T // bs  # blocked length
+        H = self.num_heads
+
+        # get roll matrix
+        blocked_list = []
+        for query_block_id in range(GF, GF + W // 2):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            temp_blocked_matrix_list = [
+                blocked_matrix[:, :, 0 : (right_block_id + 1)],
+                blocked_matrix[:, :, -(G + W - right_block_id - 1) :],
+            ]
+            temp_blocked_matrix = paddle.concat(temp_blocked_matrix_list, axis=2)
+            temp_blocked_matrix = paddle.unsqueeze(temp_blocked_matrix, axis=2)
+            blocked_list.append(temp_blocked_matrix)
+
+        # get window matrix
+        band_length = L - G - W // 2 * 2
+        band_matrix_list = []
+        for query_block_id in range(GF + W // 2, GF + W // 2 + W):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            band_matrix_list.append(
+                paddle.unsqueeze(blocked_matrix[:, :, left_block_id : left_block_id + band_length], axis=3)
+            )
+        band_matrix = paddle.concat(band_matrix_list, axis=3)
+
+        global_blocked_front_matrix = paddle.unsqueeze(blocked_matrix[:, :, :GF], axis=2)
+        global_blocked_front_matrix = paddle.expand(global_blocked_front_matrix, [B, H, band_length, GF, bs, -1])
+        global_blocked_back_matrix = paddle.unsqueeze(blocked_matrix[:, :, -GB:], axis=2)
+        global_blocked_back_matrix = paddle.expand(global_blocked_back_matrix, [B, H, band_length, GB, bs, -1])
+        band_matrix = paddle.concat([global_blocked_front_matrix, band_matrix, global_blocked_back_matrix], axis=3)
+        blocked_list.append(band_matrix)
+
+        for query_block_id in range(L - GB - W // 2, L - GB):
+            left_block_id = query_block_id - W // 2
+            right_block_id = query_block_id + W // 2
+            temp_blocked_matrix_list = [
+                blocked_matrix[:, :, 0 : G + W - (L - left_block_id)],
+                blocked_matrix[:, :, left_block_id:],
+            ]
+            temp_blocked_matrix = paddle.concat(temp_blocked_matrix_list, axis=2)
+            temp_blocked_matrix = paddle.unsqueeze(temp_blocked_matrix, axis=2)
+            blocked_list.append(temp_blocked_matrix)
+
+        band_matrix = paddle.concat(blocked_list, axis=2)
+        band_matrix = paddle.reshape(band_matrix, [B, H, L - G, (G + W) * bs, -1])
+        return band_matrix
+
+    def _get_rand_mask(self, blocked_query_mask, blocked_key_mask, rand_mask_idx, batch_size, sequence_length):
+        """
+        return random mask: [B, H, L-G, bs, R * bs]
+        """
+        # rand_mask_idx: [H, T]
+        # blocked_query_mask: [B, L, bs]
+        # blocked_key_mask: [B, L, bs]
+        bs = self.block_size
+        B = batch_size
+        L = sequence_length // bs
+        H = self.num_heads
+        GB = self.num_global_blocks_back
+        GF = self.num_global_blocks_front
+        R = self.num_rand_blocks
+        temp_block_key_mask = paddle.unsqueeze(blocked_key_mask, 1)
+        temp_block_key_mask = paddle.expand(temp_block_key_mask, [B, H, L, -1])
+        temp_block_key_mask_list = [paddle.gather_nd(temp_block_key_mask[b], rand_mask_idx) for b in range(B)]
+        temp_block_key_mask = paddle.concat(temp_block_key_mask_list, 0)
+        temp_block_key_mask = paddle.reshape(
+            temp_block_key_mask, [B, temp_block_key_mask.shape[0] // B // (L - GF - GB) // R, L - GF - GB, -1]
+        )
+        rand_mask = paddle.einsum("blq,bhlk->bhlqk", blocked_query_mask[:, GF:-GB], temp_block_key_mask)
+        return rand_mask
+
+    def _gather_random_key_value(self, blocked_matrix, rand_mask_idx, B, T):
+        """
+        return random key matrix: [B, H, L-G, R * bs, -1]
+        """
+        # blocked_matrix: [B, H, L, bs, -1]
+        # rand_mask_idx: [H, T]
+        G = self.num_global_blocks
+        H = self.num_heads
+        bs = self.block_size
+        L = T // bs
+        R = self.num_rand_blocks
+        gathered_matrix = paddle.concat(
+            [paddle.gather_nd(blocked_matrix[b, :], rand_mask_idx) for b in range(B)], axis=0
+        )
+        gathered_matrix = paddle.reshape(gathered_matrix, [B, H, L - G, R * bs, -1])
+        return gathered_matrix
+
+    def _get_global_out(self, query_matrix, key_matrix, value_matrix, key_mask, d_head, dropout, is_front=True):
+        GB = self.num_global_blocks_back
+        GF = self.num_global_blocks_front
+        if is_front:
+            global_query_matrix = query_matrix[:, :, 0 : GF * self.block_size]
+        else:
+            global_query_matrix = query_matrix[:, :, -GB * self.block_size :]
+        global_product = paddle.matmul(global_query_matrix, key_matrix, transpose_y=True)
+        global_product = global_product * (d_head**-0.5)
+        global_product += (1 - key_mask) * -1e6
+        global_weights = F.softmax(global_product)
+        # [B, H, GF*bs, T] * [B, H, T, D] -> [B, H, GF*bs, D]
+        global_product = paddle.matmul(global_weights, value_matrix)
+        return global_product
+
+    def _get_splited_matrix(self, matrix):
+        W = self.window_size // 2
+        return matrix[:, :, 0:W], matrix[:, :, W:-W], matrix[:, :, -W:]
+
+    def forward(
+        self,
+        query_matrix,
+        key_matrix,
+        value_matrix,
+        d_head,
+        attn_mask=None,
+        rand_mask_idx=None,
+        query_mask=None,
+        key_mask=None,
+        dropout=None,
+    ):
+        """
+        query_matrix: [B, H, T, D]
+        key_matrix: [B, H, T, D]
+        value_matrix: [B, H, T, D]
+        query_mask: [B, 1, T, 1]  bool mask
+        key_mask: [B, 1, 1, T]    bool mask
+        rand_mask_idx: [H, T//bs, bs]
+        Global Attention
+        Random Attention
+        Window Attention
+        """
+        B = query_matrix.shape[0]  # batch_size
+        H = self.num_heads
+        T = query_matrix.shape[2]  # sequence_length
+        G = self.num_global_blocks
+        GB = self.num_global_blocks_back
+        GF = self.num_global_blocks_front
+        R = self.num_rand_blocks
+        bs = self.block_size
+        L = T // bs  # blocked length
+
+        blocked_query_matrix = paddle.reshape(query_matrix, [B, H, L, bs, -1])
+        blocked_key_matrix = paddle.reshape(key_matrix, [B, H, L, bs, -1])
+        blocked_value_matrix = paddle.reshape(value_matrix, [B, H, L, bs, -1])
+        blocked_query_mask = paddle.reshape(query_mask, [B, L, bs])
+        blocked_key_mask = paddle.reshape(key_mask, [B, L, bs])
+
+        # 1. global_front_product
+        global_front_out = self._get_global_out(query_matrix, key_matrix, value_matrix, key_mask, d_head, dropout)
+
+        # 2. global_back_product
+        global_back_out = self._get_global_out(
+            query_matrix, key_matrix, value_matrix, key_mask, d_head, dropout, False
+        )
+
+        # 3. second_product
+
+        # create second matrix
+        # [B, 1, L-G, bs, (G+W)*bs]
+        band_mask = self._get_band_mask(blocked_query_mask, blocked_key_mask, B, T)
+        # [B, H, L-G, bs, R*bs]
+        rand_mask = self._get_rand_mask(blocked_query_mask, blocked_key_mask, rand_mask_idx, B, T)
+        # [B, H, L-G, bs, (G+W+R)*bs]
+        second_mask = paddle.concat([band_mask, rand_mask], axis=4)
+
+        # [B, H, L-G, R * bs, -1]
+        random_keys = self._gather_random_key_value(blocked_key_matrix, rand_mask_idx, B, T)
+        random_values = self._gather_random_key_value(blocked_value_matrix, rand_mask_idx, B, T)
+
+        band_keys_matrix = self._get_band_matrix(blocked_key_matrix, B, T)
+        band_value_matrix = self._get_band_matrix(blocked_value_matrix, B, T)
+
+        # [B, H, L - G, bs, -1]
+        second_query_matrix = blocked_query_matrix[:, :, GF:-GB]
+        # [B, H, L - G, (G+W+R)*bs, -1]
+        second_key_matrix = paddle.concat([band_keys_matrix, random_keys], axis=3)
+        # [B, H, L - G, (G+W+R)*bs, -1]
+        second_value_matrix = paddle.concat([band_value_matrix, random_values], axis=3)
+        second_top_value_matrix, second_middle_value_matrix, second_bottom_value_matrix = self._get_splited_matrix(
+            second_value_matrix
+        )
+        second_product = paddle.einsum("bhlqd,bhlkd->bhlqk", second_query_matrix, second_key_matrix)
+        second_product = second_product * (d_head**-0.5)
+        second_product += (1 - second_mask) * -1e6
+        second_weights = F.softmax(second_product)
+
+        second_top_weights, second_middle_weights, second_bottom_weights = self._get_splited_matrix(second_weights)
+        second_top_out = paddle.einsum("bhlqk,bhlkd->bhlqd", second_top_weights, second_top_value_matrix)
+
+        second_middle_out = paddle.einsum(
+            "bhlqk,bhlkd->bhlqd",
+            second_middle_weights[:, :, :, :, GF * bs : -(GB + R) * bs],
+            second_middle_value_matrix[:, :, :, GF * bs : -(GB + R) * bs],
+        )
+        # add global block attention
+        second_middle_out += paddle.einsum(
+            "bhlqk,bhkd->bhlqd", second_middle_weights[:, :, :, :, : GF * bs], blocked_value_matrix[:, :, 0]
+        )
+        second_middle_out += paddle.einsum(
+            "bhlqk,bhkd->bhlqd",
+            second_middle_weights[:, :, :, :, -(GB + R) * bs : -R * bs],
+            blocked_value_matrix[:, :, -GB],
+        )
+        # add random block attention
+        second_middle_out += paddle.einsum(
+            "...qk,...kd->...qd", second_middle_weights[:, :, :, :, -R * bs :], random_values[:, :, GF:-GB]
+        )
+
+        second_bottom_out = paddle.einsum("bhlqk,bhlkd->bhlqd", second_bottom_weights, second_bottom_value_matrix)
+
+        second_out = paddle.concat([second_top_out, second_middle_out, second_bottom_out], axis=2)
+        second_out = paddle.reshape(second_out, [B, H, (L - G) * bs, -1])
+
+        # [B, H, T, D]
+        out = paddle.concat([global_front_out, second_out, global_back_out], axis=2)
+        out = out * query_mask
+        return out
+
+
+class MultiHeadAttention(Layer):
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        weight_attr=None,
+        bias_attr=None,
+        block_size=1,
+        window_size=3,
+        num_global_blocks=1,
+        num_rand_blocks=1,
+        seed=None,
+        attention_type="bigbird",
+    ):
+
+        super(MultiHeadAttention, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        self.q_proj = Linear3D(embed_dim, num_heads, self.head_dim, weight_attr, bias_attr=bias_attr)
+        self.k_proj = Linear3D(embed_dim, num_heads, self.head_dim, weight_attr, bias_attr=bias_attr)
+        self.v_proj = Linear3D(embed_dim, num_heads, self.head_dim, weight_attr, bias_attr=bias_attr)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr)
+
+        self.attn_impl = AttentionRegistry.cls_dict[attention_type](
+            num_heads, block_size, window_size, num_global_blocks, num_rand_blocks, seed
+        )
+
+    def _prepare_qkv(self, query, key, value, cache=None):
+        q = self.q_proj(query)
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = paddle.concat([cache.k, k], axis=2)
+            v = paddle.concat([cache.v, v], axis=2)
+            cache = self.Cache(k, v)
+
+        return (q, k, v) if cache is None else (q, k, v, cache)
+
+    def compute_kv(self, key, value):
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(shape=[-1, self.num_heads, 0, self.head_dim], fill_value=0, dtype=key.dtype)
+
+            v = paddle.full(shape=[-1, self.num_heads, 0, self.head_dim], fill_value=0, dtype=key.dtype)
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def forward(
+        self, query, key, value, attn_mask=None, rand_mask_idx=None, query_mask=None, key_mask=None, cache=None
+    ):
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if cache is None:
+            q, k, v = self._prepare_qkv(query, key, value, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, cache)
+
+        out = self.attn_impl(q, k, v, self.head_dim, attn_mask, rand_mask_idx, query_mask, key_mask, self.dropout)
+        # combine heads
+        out = paddle.transpose(out, perm=[0, 2, 1, 3])
+        out = paddle.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if cache is not None:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
diff --git a/paddlenlp/transformers/audio_utils.py b/paddlenlp/transformers/audio_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..31795a062dd7afacdae263ee00ddb8b8fe7fc5d7
--- /dev/null
+++ b/paddlenlp/transformers/audio_utils.py
@@ -0,0 +1,694 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Audio processing functions to extract features from audio waveforms. This code is pure numpy to support all frameworks
+and remove unnecessary dependencies.
+"""
+import warnings
+from typing import Optional, Union
+
+import numpy as np
+
+
+def hertz_to_mel(freq: Union[float, np.ndarray], mel_scale: str = "htk") -> Union[float, np.ndarray]:
+    """
+    Convert frequency from hertz to mels.
+
+    Args:
+        freq (`float` or `np.ndarray`):
+            The frequency, or multiple frequencies, in hertz (Hz).
+        mel_scale (`str`, *optional*, defaults to `"htk"`):
+            The mel frequency scale to use, `"htk"` or `"slaney"`.
+
+    Returns:
+        `float` or `np.ndarray`: The frequencies on the mel scale.
+    """
+
+    if mel_scale not in ["slaney", "htk"]:
+        raise ValueError('mel_scale should be one of "htk" or "slaney".')
+
+    if mel_scale == "htk":
+        return 2595.0 * np.log10(1.0 + (freq / 700.0))
+
+    min_log_hertz = 1000.0
+    min_log_mel = 15.0
+    logstep = 27.0 / np.log(6.4)
+    mels = 3.0 * freq / 200.0
+
+    if isinstance(freq, np.ndarray):
+        log_region = freq >= min_log_hertz
+        mels[log_region] = min_log_mel + np.log(freq[log_region] / min_log_hertz) * logstep
+    elif freq >= min_log_hertz:
+        mels = min_log_mel + np.log(freq / min_log_hertz) * logstep
+
+    return mels
+
+
+def mel_to_hertz(mels: Union[float, np.ndarray], mel_scale: str = "htk") -> Union[float, np.ndarray]:
+    """
+    Convert frequency from mels to hertz.
+
+    Args:
+        mels (`float` or `np.ndarray`):
+            The frequency, or multiple frequencies, in mels.
+        mel_scale (`str`, *optional*, `"htk"`):
+            The mel frequency scale to use, `"htk"` or `"slaney"`.
+
+    Returns:
+        `float` or `np.ndarray`: The frequencies in hertz.
+    """
+
+    if mel_scale not in ["slaney", "htk"]:
+        raise ValueError('mel_scale should be one of "htk" or "slaney".')
+
+    if mel_scale == "htk":
+        return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
+
+    min_log_hertz = 1000.0
+    min_log_mel = 15.0
+    logstep = np.log(6.4) / 27.0
+    freq = 200.0 * mels / 3.0
+
+    if isinstance(mels, np.ndarray):
+        log_region = mels >= min_log_mel
+        freq[log_region] = min_log_hertz * np.exp(logstep * (mels[log_region] - min_log_mel))
+    elif mels >= min_log_mel:
+        freq = min_log_hertz * np.exp(logstep * (mels - min_log_mel))
+
+    return freq
+
+
+def _create_triangular_filter_bank(fft_freqs: np.ndarray, filter_freqs: np.ndarray) -> np.ndarray:
+    """
+    Creates a triangular filter bank.
+
+    Adapted from *torchaudio* and *librosa*.
+
+    Args:
+        fft_freqs (`np.ndarray` of shape `(num_frequency_bins,)`):
+            Discrete frequencies of the FFT bins in Hz.
+        filter_freqs (`np.ndarray` of shape `(num_mel_filters,)`):
+            Center frequencies of the triangular filters to create, in Hz.
+
+    Returns:
+        `np.ndarray` of shape `(num_frequency_bins, num_mel_filters)`
+    """
+    filter_diff = np.diff(filter_freqs)
+    slopes = np.expand_dims(filter_freqs, 0) - np.expand_dims(fft_freqs, 1)
+    down_slopes = -slopes[:, :-2] / filter_diff[:-1]
+    up_slopes = slopes[:, 2:] / filter_diff[1:]
+    return np.maximum(np.zeros(1), np.minimum(down_slopes, up_slopes))
+
+
+def mel_filter_bank(
+    num_frequency_bins: int,
+    num_mel_filters: int,
+    min_frequency: float,
+    max_frequency: float,
+    sampling_rate: int,
+    norm: Optional[str] = None,
+    mel_scale: str = "htk",
+) -> np.ndarray:
+    """
+    Creates a frequency bin conversion matrix used to obtain a mel spectrogram. This is called a *mel filter bank*, and
+    various implementation exist, which differ in the number of filters, the shape of the filters, the way the filters
+    are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these
+    features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency.
+
+    Different banks of mel filters were introduced in the literature. The following variations are supported:
+
+    - MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHz and a speech
+      bandwidth of `[0, 4600]` Hz.
+    - MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a speech
+      bandwidth of `[0, 8000]` Hz. This assumes sampling rate ≥ 16 kHz.
+    - MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate of 16 kHz and
+      speech bandwidth of `[133, 6854]` Hz. This version also includes area normalization.
+    - HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes a sampling rate of
+      12.5 kHz and speech bandwidth of `[0, 6250]` Hz.
+
+    This code is adapted from *torchaudio* and *librosa*. Note that the default parameters of torchaudio's
+    `melscale_fbanks` implement the `"htk"` filters while librosa uses the `"slaney"` implementation.
+
+    Args:
+        num_frequency_bins (`int`):
+            Number of frequencies used to compute the spectrogram (should be the same as in `stft`).
+        num_mel_filters (`int`):
+            Number of mel filters to generate.
+        min_frequency (`float`):
+            Lowest frequency of interest in Hz.
+        max_frequency (`float`):
+            Highest frequency of interest in Hz. This should not exceed `sampling_rate / 2`.
+        sampling_rate (`int`):
+            Sample rate of the audio waveform.
+        norm (`str`, *optional*):
+            If `"slaney"`, divide the triangular mel weights by the width of the mel band (area normalization).
+        mel_scale (`str`, *optional*, defaults to `"htk"`):
+            The mel frequency scale to use, `"htk"` or `"slaney"`.
+
+    Returns:
+        `np.ndarray` of shape (`num_frequency_bins`, `num_mel_filters`): Triangular filter bank matrix. This is a
+        projection matrix to go from a spectrogram to a mel spectrogram.
+    """
+    if norm is not None and norm != "slaney":
+        raise ValueError('norm must be one of None or "slaney"')
+
+    # frequencies of FFT bins in Hz
+    fft_freqs = np.linspace(0, sampling_rate // 2, num_frequency_bins)
+
+    # center points of the triangular mel filters
+    mel_min = hertz_to_mel(min_frequency, mel_scale=mel_scale)
+    mel_max = hertz_to_mel(max_frequency, mel_scale=mel_scale)
+    mel_freqs = np.linspace(mel_min, mel_max, num_mel_filters + 2)
+    filter_freqs = mel_to_hertz(mel_freqs, mel_scale=mel_scale)
+
+    mel_filters = _create_triangular_filter_bank(fft_freqs, filter_freqs)
+
+    if norm is not None and norm == "slaney":
+        # Slaney-style mel is scaled to be approx constant energy per channel
+        enorm = 2.0 / (filter_freqs[2 : num_mel_filters + 2] - filter_freqs[:num_mel_filters])
+        mel_filters *= np.expand_dims(enorm, 0)
+
+    if (mel_filters.max(axis=0) == 0.0).any():
+        warnings.warn(
+            "At least one mel filter has all zero values. "
+            f"The value for `num_mel_filters` ({num_mel_filters}) may be set too high. "
+            f"Or, the value for `num_frequency_bins` ({num_frequency_bins}) may be set too low."
+        )
+
+    return mel_filters
+
+
+def optimal_fft_length(window_length: int) -> int:
+    """
+    Finds the best FFT input size for a given `window_length`. This function takes a given window length and, if not
+    already a power of two, rounds it up to the next power or two.
+
+    The FFT algorithm works fastest when the length of the input is a power of two, which may be larger than the size
+    of the window or analysis frame. For example, if the window is 400 samples, using an FFT input size of 512 samples
+    is more optimal than an FFT size of 400 samples. Using a larger FFT size does not affect the detected frequencies,
+    it simply gives a higher frequency resolution (i.e. the frequency bins are smaller).
+    """
+    return 2 ** int(np.ceil(np.log2(window_length)))
+
+
+def window_function(
+    window_length: int,
+    name: str = "hann",
+    periodic: bool = True,
+    frame_length: Optional[int] = None,
+    center: bool = True,
+) -> np.ndarray:
+    """
+    Returns an array containing the specified window. This window is intended to be used with `stft`.
+
+    The following window types are supported:
+
+        - `"boxcar"`: a rectangular window
+        - `"hamming"`: the Hamming window
+        - `"hann"`: the Hann window
+
+    Args:
+        window_length (`int`):
+            The length of the window in samples.
+        name (`str`, *optional*, defaults to `"hann"`):
+            The name of the window function.
+        periodic (`bool`, *optional*, defaults to `True`):
+            Whether the window is periodic or symmetric.
+        frame_length (`int`, *optional*):
+            The length of the analysis frames in samples. Provide a value for `frame_length` if the window is smaller
+            than the frame length, so that it will be zero-padded.
+        center (`bool`, *optional*, defaults to `True`):
+            Whether to center the window inside the FFT buffer. Only used when `frame_length` is provided.
+
+    Returns:
+        `np.ndarray` of shape `(window_length,)` or `(frame_length,)` containing the window.
+    """
+    length = window_length + 1 if periodic else window_length
+
+    if name == "boxcar":
+        window = np.ones(length)
+    elif name in ["hamming", "hamming_window"]:
+        window = np.hamming(length)
+    elif name in ["hann", "hann_window"]:
+        window = np.hanning(length)
+    else:
+        raise ValueError(f"Unknown window function '{name}'")
+
+    if periodic:
+        window = window[:-1]
+
+    if frame_length is None:
+        return window
+
+    if window_length > frame_length:
+        raise ValueError(
+            f"Length of the window ({window_length}) may not be larger than frame_length ({frame_length})"
+        )
+
+    padded_window = np.zeros(frame_length)
+    offset = (frame_length - window_length) // 2 if center else 0
+    padded_window[offset : offset + window_length] = window
+    return padded_window
+
+
+# TODO This method does not support batching yet as we are mainly focused on inference.
+def spectrogram(
+    waveform: np.ndarray,
+    window: np.ndarray,
+    frame_length: int,
+    hop_length: int,
+    fft_length: Optional[int] = None,
+    power: Optional[float] = 1.0,
+    center: bool = True,
+    pad_mode: str = "reflect",
+    onesided: bool = True,
+    preemphasis: Optional[float] = None,
+    mel_filters: Optional[np.ndarray] = None,
+    mel_floor: float = 1e-10,
+    log_mel: Optional[str] = None,
+    reference: float = 1.0,
+    min_value: float = 1e-10,
+    db_range: Optional[float] = None,
+    dtype: np.dtype = np.float32,
+) -> np.ndarray:
+    """
+    Calculates a spectrogram over one waveform using the Short-Time Fourier Transform.
+
+    This function can create the following kinds of spectrograms:
+
+      - amplitude spectrogram (`power = 1.0`)
+      - power spectrogram (`power = 2.0`)
+      - complex-valued spectrogram (`power = None`)
+      - log spectrogram (use `log_mel` argument)
+      - mel spectrogram (provide `mel_filters`)
+      - log-mel spectrogram (provide `mel_filters` and `log_mel`)
+
+    How this works:
+
+      1. The input waveform is split into frames of size `frame_length` that are partially overlapping by `frame_length
+         - hop_length` samples.
+      2. Each frame is multiplied by the window and placed into a buffer of size `fft_length`.
+      3. The DFT is taken of each windowed frame.
+      4. The results are stacked into a spectrogram.
+
+    We make a distinction between the following "blocks" of sample data, each of which may have a different lengths:
+
+      - The analysis frame. This is the size of the time slices that the input waveform is split into.
+      - The window. Each analysis frame is multiplied by the window to avoid spectral leakage.
+      - The FFT input buffer. The length of this determines how many frequency bins are in the spectrogram.
+
+    In this implementation, the window is assumed to be zero-padded to have the same size as the analysis frame. A
+    padded window can be obtained from `window_function()`. The FFT input buffer may be larger than the analysis frame,
+    typically the next power of two.
+
+    Note: This function is not optimized for speed yet. It should be mostly compatible with `librosa.stft` and
+    `torchaudio.functional.transforms.Spectrogram`, although it is more flexible due to the different ways spectrograms
+    can be constructed.
+
+    Args:
+        waveform (`np.ndarray` of shape `(length,)`):
+            The input waveform. This must be a single real-valued, mono waveform.
+        window (`np.ndarray` of shape `(frame_length,)`):
+            The windowing function to apply, including zero-padding if necessary. The actual window length may be
+            shorter than `frame_length`, but we're assuming the array has already been zero-padded.
+        frame_length (`int`):
+            The length of the analysis frames in samples. With librosa this is always equal to `fft_length` but we also
+            allow smaller sizes.
+        hop_length (`int`):
+            The stride between successive analysis frames in samples.
+        fft_length (`int`, *optional*):
+            The size of the FFT buffer in samples. This determines how many frequency bins the spectrogram will have.
+            For optimal speed, this should be a power of two. If `None`, uses `frame_length`.
+        power (`float`, *optional*, defaults to 1.0):
+            If 1.0, returns the amplitude spectrogram. If 2.0, returns the power spectrogram. If `None`, returns
+            complex numbers.
+        center (`bool`, *optional*, defaults to `True`):
+            Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, frame
+            `t` will start at time `t * hop_length`.
+        pad_mode (`str`, *optional*, defaults to `"reflect"`):
+            Padding mode used when `center` is `True`. Possible values are: `"constant"` (pad with zeros), `"edge"`
+            (pad with edge values), `"reflect"` (pads with mirrored values).
+        onesided (`bool`, *optional*, defaults to `True`):
+            If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1`
+            frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins.
+        preemphasis (`float`, *optional*)
+            Coefficient for a low-pass filter that applies pre-emphasis before the DFT.
+        mel_filters (`np.ndarray` of shape `(num_freq_bins, num_mel_filters)`, *optional*):
+            The mel filter bank. If supplied, applies a this filter bank to create a mel spectrogram.
+        mel_floor (`float`, *optional*, defaults to 1e-10):
+            Minimum value of mel frequency banks.
+        log_mel (`str`, *optional*):
+            How to convert the spectrogram to log scale. Possible options are: `None` (don't convert), `"log"` (take
+            the natural logarithm) `"log10"` (take the base-10 logarithm), `"dB"` (convert to decibels). Can only be
+            used when `power` is not `None`.
+        reference (`float`, *optional*, defaults to 1.0):
+            Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set
+            the loudest part to 0 dB. Must be greater than zero.
+        min_value (`float`, *optional*, defaults to `1e-10`):
+            The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking
+            `log(0)`. For a power spectrogram, the default of `1e-10` corresponds to a minimum of -100 dB. For an
+            amplitude spectrogram, the value `1e-5` corresponds to -100 dB. Must be greater than zero.
+        db_range (`float`, *optional*):
+            Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the
+            peak value and the smallest value will never be more than 80 dB. Must be greater than zero.
+        dtype (`np.dtype`, *optional*, defaults to `np.float32`):
+            Data type of the spectrogram tensor. If `power` is None, this argument is ignored and the dtype will be
+            `np.complex64`.
+
+    Returns:
+        `nd.array` containing a spectrogram of shape `(num_frequency_bins, length)` for a regular spectrogram or shape
+        `(num_mel_filters, length)` for a mel spectrogram.
+    """
+    window_length = len(window)
+
+    if fft_length is None:
+        fft_length = frame_length
+
+    if frame_length > fft_length:
+        raise ValueError(f"frame_length ({frame_length}) may not be larger than fft_length ({fft_length})")
+
+    if window_length != frame_length:
+        raise ValueError(f"Length of the window ({window_length}) must equal frame_length ({frame_length})")
+
+    if hop_length <= 0:
+        raise ValueError("hop_length must be greater than zero")
+
+    if waveform.ndim != 1:
+        raise ValueError(f"Input waveform must have only one dimension, shape is {waveform.shape}")
+
+    if np.iscomplexobj(waveform):
+        raise ValueError("Complex-valued input waveforms are not currently supported")
+
+    # center pad the waveform
+    if center:
+        padding = [(int(frame_length // 2), int(frame_length // 2))]
+        waveform = np.pad(waveform, padding, mode=pad_mode)
+
+    # promote to float64, since np.fft uses float64 internally
+    waveform = waveform.astype(np.float64)
+    window = window.astype(np.float64)
+
+    # split waveform into frames of frame_length size
+    num_frames = int(1 + np.floor((waveform.size - frame_length) / hop_length))
+
+    num_frequency_bins = (fft_length // 2) + 1 if onesided else fft_length
+    spectrogram = np.empty((num_frames, num_frequency_bins), dtype=np.complex64)
+
+    # rfft is faster than fft
+    fft_func = np.fft.rfft if onesided else np.fft.fft
+    buffer = np.zeros(fft_length)
+
+    timestep = 0
+    for frame_idx in range(num_frames):
+        buffer[:frame_length] = waveform[timestep : timestep + frame_length]
+
+        if preemphasis is not None:
+            buffer[1:frame_length] -= preemphasis * buffer[: frame_length - 1]
+            buffer[0] *= 1 - preemphasis
+
+        buffer[:frame_length] *= window
+
+        spectrogram[frame_idx] = fft_func(buffer)
+        timestep += hop_length
+
+    # note: ** is much faster than np.power
+    if power is not None:
+        spectrogram = np.abs(spectrogram, dtype=np.float64) ** power
+
+    spectrogram = spectrogram.T
+
+    if mel_filters is not None:
+        spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram))
+
+    if power is not None and log_mel is not None:
+        if log_mel == "log":
+            spectrogram = np.log(spectrogram)
+        elif log_mel == "log10":
+            spectrogram = np.log10(spectrogram)
+        elif log_mel == "dB":
+            if power == 1.0:
+                spectrogram = amplitude_to_db(spectrogram, reference, min_value, db_range)
+            elif power == 2.0:
+                spectrogram = power_to_db(spectrogram, reference, min_value, db_range)
+            else:
+                raise ValueError(f"Cannot use log_mel option '{log_mel}' with power {power}")
+        else:
+            raise ValueError(f"Unknown log_mel option: {log_mel}")
+
+        spectrogram = np.asarray(spectrogram, dtype)
+
+    return spectrogram
+
+
+def power_to_db(
+    spectrogram: np.ndarray,
+    reference: float = 1.0,
+    min_value: float = 1e-10,
+    db_range: Optional[float] = None,
+) -> np.ndarray:
+    """
+    Converts a power spectrogram to the decibel scale. This computes `10 * log10(spectrogram / reference)`, using basic
+    logarithm properties for numerical stability.
+
+    The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a
+    linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it.
+    This means that large variations in energy may not sound all that different if the sound is loud to begin with.
+    This compression operation makes the (mel) spectrogram features match more closely what humans actually hear.
+
+    Based on the implementation of `librosa.power_to_db`.
+
+    Args:
+        spectrogram (`np.ndarray`):
+            The input power (mel) spectrogram. Note that a power spectrogram has the amplitudes squared!
+        reference (`float`, *optional*, defaults to 1.0):
+            Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set
+            the loudest part to 0 dB. Must be greater than zero.
+        min_value (`float`, *optional*, defaults to `1e-10`):
+            The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking
+            `log(0)`. The default of `1e-10` corresponds to a minimum of -100 dB. Must be greater than zero.
+        db_range (`float`, *optional*):
+            Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the
+            peak value and the smallest value will never be more than 80 dB. Must be greater than zero.
+
+    Returns:
+        `np.ndarray`: the spectrogram in decibels
+    """
+    if reference <= 0.0:
+        raise ValueError("reference must be greater than zero")
+    if min_value <= 0.0:
+        raise ValueError("min_value must be greater than zero")
+
+    reference = max(min_value, reference)
+
+    spectrogram = np.clip(spectrogram, a_min=min_value, a_max=None)
+    spectrogram = 10.0 * (np.log10(spectrogram) - np.log10(reference))
+
+    if db_range is not None:
+        if db_range <= 0.0:
+            raise ValueError("db_range must be greater than zero")
+        spectrogram = np.clip(spectrogram, a_min=spectrogram.max() - db_range, a_max=None)
+
+    return spectrogram
+
+
+def amplitude_to_db(
+    spectrogram: np.ndarray,
+    reference: float = 1.0,
+    min_value: float = 1e-5,
+    db_range: Optional[float] = None,
+) -> np.ndarray:
+    """
+    Converts an amplitude spectrogram to the decibel scale. This computes `20 * log10(spectrogram / reference)`, using
+    basic logarithm properties for numerical stability.
+
+    The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a
+    linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it.
+    This means that large variations in energy may not sound all that different if the sound is loud to begin with.
+    This compression operation makes the (mel) spectrogram features match more closely what humans actually hear.
+
+    Args:
+        spectrogram (`np.ndarray`):
+            The input amplitude (mel) spectrogram.
+        reference (`float`, *optional*, defaults to 1.0):
+            Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set
+            the loudest part to 0 dB. Must be greater than zero.
+        min_value (`float`, *optional*, defaults to `1e-5`):
+            The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking
+            `log(0)`. The default of `1e-5` corresponds to a minimum of -100 dB. Must be greater than zero.
+        db_range (`float`, *optional*):
+            Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the
+            peak value and the smallest value will never be more than 80 dB. Must be greater than zero.
+
+    Returns:
+        `np.ndarray`: the spectrogram in decibels
+    """
+    if reference <= 0.0:
+        raise ValueError("reference must be greater than zero")
+    if min_value <= 0.0:
+        raise ValueError("min_value must be greater than zero")
+
+    reference = max(min_value, reference)
+
+    spectrogram = np.clip(spectrogram, a_min=min_value, a_max=None)
+    spectrogram = 20.0 * (np.log10(spectrogram) - np.log10(reference))
+
+    if db_range is not None:
+        if db_range <= 0.0:
+            raise ValueError("db_range must be greater than zero")
+        spectrogram = np.clip(spectrogram, a_min=spectrogram.max() - db_range, a_max=None)
+
+    return spectrogram
+
+
+def get_mel_filter_banks(
+    nb_frequency_bins: int,
+    nb_mel_filters: int,
+    frequency_min: float,
+    frequency_max: float,
+    sample_rate: int,
+    norm: Optional[str] = None,
+    mel_scale: str = "htk",
+) -> np.array:
+    warnings.warn(
+        "The function `get_mel_filter_banks` is deprecated and will be removed in version 4.31.0 of Transformers",
+        FutureWarning,
+    )
+    return mel_filter_bank(
+        num_frequency_bins=nb_frequency_bins,
+        num_mel_filters=nb_mel_filters,
+        min_frequency=frequency_min,
+        max_frequency=frequency_max,
+        sampling_rate=sample_rate,
+        norm=norm,
+        mel_scale=mel_scale,
+    )
+
+
+def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = 400, center: bool = True):
+    """
+    In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed
+    segments called `frames`.
+
+    The window length (window_length) defines how much of the signal is contained in each frame, while the hop length
+    defines the step between the beginning of each new frame.
+
+
+    Args:
+        waveform (`np.array` of shape `(sample_length,)`):
+            The raw waveform which will be split into smaller chunks.
+        hop_length (`int`, *optional*, defaults to 160):
+            Step between each window of the waveform.
+        fft_window_size (`int`, *optional*, defaults to 400):
+            Defines the size of the window.
+        center (`bool`, defaults to `True`):
+            Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the
+            waveform on the left and on the right.
+
+    Return:
+        framed_waveform (`np.array` of shape `(waveform.shape // hop_length , fft_window_size)`):
+            The framed waveforms that can be fed to `np.fft`.
+    """
+    warnings.warn(
+        "The function `fram_wave` is deprecated and will be removed in version 4.31.0 of Transformers",
+        FutureWarning,
+    )
+    frames = []
+    for i in range(0, waveform.shape[0] + 1, hop_length):
+        if center:
+            half_window = (fft_window_size - 1) // 2 + 1
+            start = i - half_window if i > half_window else 0
+            end = i + half_window if i < waveform.shape[0] - half_window else waveform.shape[0]
+            frame = waveform[start:end]
+            if start == 0:
+                padd_width = (-i + half_window, 0)
+                frame = np.pad(frame, pad_width=padd_width, mode="reflect")
+
+            elif end == waveform.shape[0]:
+                padd_width = (0, (i - waveform.shape[0] + half_window))
+                frame = np.pad(frame, pad_width=padd_width, mode="reflect")
+
+        else:
+            frame = waveform[i : i + fft_window_size]
+            frame_width = frame.shape[0]
+            if frame_width < waveform.shape[0]:
+                frame = np.lib.pad(
+                    frame, pad_width=(0, fft_window_size - frame_width), mode="constant", constant_values=0
+                )
+        frames.append(frame)
+
+    frames = np.stack(frames, 0)
+    return frames
+
+
+def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = None):
+    """
+    Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results
+    as `torch.stft`.
+
+    Args:
+        frames (`np.array` of dimension `(num_frames, fft_window_size)`):
+            A framed audio signal obtained using `audio_utils.fram_wav`.
+        windowing_function (`np.array` of dimension `(nb_frequency_bins, nb_mel_filters)`:
+            A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the
+            boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function.
+            For more information on the discontinuities, called *Spectral leakage*, refer to [this
+            tutorial]https://download.ni.com/evaluation/pxi/Understanding%20FFTs%20and%20Windowing.pdf
+        fft_window_size (`int`, *optional*):
+            Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the
+            spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of
+            frequency bins (`nb_frequency_bins`) used to divide the window into equal strips is equal to
+            `(1+fft_window_size)//2`. An increase of the fft_window_size slows the calculus time proportionnally.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers.audio_utils import stft, fram_wave
+    >>> import numpy as np
+
+    >>> audio = np.random.rand(50)
+    >>> fft_window_size = 10
+    >>> hop_length = 2
+    >>> framed_audio = fram_wave(audio, hop_length, fft_window_size)
+    >>> spectrogram = stft(framed_audio, np.hanning(fft_window_size + 1))
+    ```
+
+    Returns:
+        spectrogram (`np.ndarray`):
+            A spectrogram of shape `(num_frames, nb_frequency_bins)` obtained using the STFT algorithm
+    """
+    warnings.warn(
+        "The function `stft` is deprecated and will be removed in version 4.31.0 of Transformers",
+        FutureWarning,
+    )
+    frame_size = frames.shape[1]
+
+    if fft_window_size is None:
+        fft_window_size = frame_size
+
+    if fft_window_size < frame_size:
+        raise ValueError("FFT size must greater or equal the frame size")
+    # number of FFT bins to store
+    nb_frequency_bins = (fft_window_size >> 1) + 1
+
+    spectrogram = np.empty((len(frames), nb_frequency_bins), dtype=np.complex64)
+    fft_signal = np.zeros(fft_window_size)
+
+    for f, frame in enumerate(frames):
+        if windowing_function is not None:
+            np.multiply(frame, windowing_function, out=fft_signal[:frame_size])
+        else:
+            fft_signal[:frame_size] = frame
+        spectrogram[f] = np.fft.fft(fft_signal, axis=0)[:nb_frequency_bins]
+    return spectrogram.T
diff --git a/paddlenlp/transformers/auto/__init__.py b/paddlenlp/transformers/auto/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/auto/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/auto/configuration.py b/paddlenlp/transformers/auto/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ca8f54cd8abba0f24fbac50ef10a109994c5f58
--- /dev/null
+++ b/paddlenlp/transformers/auto/configuration.py
@@ -0,0 +1,239 @@
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import inspect
+import io
+import json
+import os
+from collections import defaultdict
+from typing import Dict, List, Type
+
+from huggingface_hub import hf_hub_download
+
+from paddlenlp import __version__
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+    url_file_exists,
+)
+from paddlenlp.utils.import_utils import import_module
+from paddlenlp.utils.log import logger
+
+from ..utils import resolve_cache_dir
+
+__all__ = [
+    "AutoConfig",
+]
+
+
+def get_configurations() -> Dict[str, List[Type[PretrainedConfig]]]:
+    """load the configurations of PretrainedConfig mapping: {<model-name>: [<class-name>, <class-name>, ...], }
+
+    Returns:
+        dict[str, str]: the mapping of model-name to model-classes
+    """
+    # 1. search the subdir<model-name> to find model-names
+    transformers_dir = os.path.dirname(os.path.dirname(__file__))
+    exclude_models = ["auto"]
+
+    mappings = defaultdict(list)
+    for model_name in os.listdir(transformers_dir):
+        if model_name in exclude_models:
+            continue
+
+        model_dir = os.path.join(transformers_dir, model_name)
+        if not os.path.isdir(model_dir):
+            continue
+
+        # 2. find the `configuration.py` file as the identifier of PretrainedConfig class
+        configuration_path = os.path.join(model_dir, "configuration.py")
+        if not os.path.exists(configuration_path):
+            continue
+
+        configuration_module = import_module(f"paddlenlp.transformers.{model_name}.configuration")
+        for key in dir(configuration_module):
+            value = getattr(configuration_module, key)
+            if inspect.isclass(value) and issubclass(value, PretrainedConfig):
+                mappings[model_name].append(value)
+
+    return mappings
+
+
+class AutoConfig(PretrainedConfig):
+    """
+    AutoConfig is a generic config class that will be instantiated as one of the
+    base PretrainedConfig classes when created with the AutoConfig.from_pretrained() classmethod.
+    """
+
+    MAPPING_NAMES: Dict[str, List[Type[PretrainedConfig]]] = get_configurations()
+
+    # cache the builtin pretrained-model-name to Model Class
+    name2class = None
+    config_file = "config.json"
+
+    # TODO(wj-Mcat): the supporting should be removed after v2.6
+    legacy_config_file = "config.json"
+
+    @classmethod
+    def _get_config_class_from_config(
+        cls, pretrained_model_name_or_path: str, config_file_path: str
+    ) -> PretrainedConfig:
+        with io.open(config_file_path, encoding="utf-8") as f:
+            config = json.load(f)
+
+        # add support for legacy config
+        if "init_class" in config:
+            architectures = [config.pop("init_class")]
+        else:
+            architectures = config.pop("architectures", None)
+            if architectures is None:
+                return cls
+
+        model_name = architectures[0]
+        model_class = import_module(f"paddlenlp.transformers.{model_name}")
+
+        assert inspect.isclass(model_class) and issubclass(
+            model_class, PretrainedModel
+        ), f"<{model_class}> should be a PretarinedModel class, but <{type(model_class)}>"
+
+        return cls if model_class.config_class is None else model_class.config_class
+
+    @classmethod
+    def from_file(cls, config_file: str, **kwargs) -> AutoConfig:
+        """construct configuration with AutoConfig class to enable normal loading
+
+        Args:
+            config_file (str): the path of config file
+
+        Returns:
+            AutoConfig: the instance of AutoConfig
+        """
+        with open(config_file, "r", encoding="utf-8") as f:
+            config = json.load(f)
+
+        config.update(kwargs)
+        return cls(**config)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, *model_args, from_hf_hub=False, **kwargs):
+        """
+        Creates an instance of `AutoConfig`. Related resources are loaded by
+        specifying name of a built-in pretrained model, or a community-contributed
+        pretrained model, or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains processor related resources
+                  and processor config file ("processor_config.json").
+            *args (tuple): position arguments for model `__init__`. If provided,
+                use these as position argument values for processor initialization.
+            **kwargs (dict): keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for processor
+                initialization.
+
+        Returns:
+            PretrainedConfig: An instance of `PretrainedConfig`.
+
+
+        Example:
+            .. code-block::
+            from paddlenlp.transformers import AutoConfig
+            config = AutoConfig.from_pretrained("bert-base-uncased")
+            config.save_pretrained('./bert-base-uncased')
+        """
+        subfolder = kwargs.pop("subfolder", None)
+        cache_dir = resolve_cache_dir(
+            pretrained_model_name_or_path, from_hf_hub=from_hf_hub, cache_dir=kwargs.pop("cache_dir", None)
+        )
+
+        if not cls.name2class:
+            cls.name2class = {}
+            for model_classes in cls.MAPPING_NAMES.values():
+                for model_class in model_classes:
+                    cls.name2class.update(
+                        {model_name: model_class for model_name in model_class.pretrained_init_configuration.keys()}
+                    )
+
+        # From built-in pretrained models
+        if pretrained_model_name_or_path in cls.name2class:
+            return cls.name2class[pretrained_model_name_or_path].from_pretrained(
+                pretrained_model_name_or_path, *model_args, **kwargs
+            )
+
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.config_file)
+            if not os.path.exists(config_file):
+                # try to load legacy config file
+                legacy_config_file = os.path.join(pretrained_model_name_or_path, cls.legacy_config_file)
+                if not os.path.exists(legacy_config_file):
+                    raise ValueError(
+                        f"config file<{cls.config_file}> or legacy config file<{cls.legacy_config_file}> not found"
+                    )
+
+                logger.warning(f"loading legacy config file<{cls.legacy_config_file}> ...")
+                config_file = legacy_config_file
+
+            config_class = cls._get_config_class_from_config(pretrained_model_name_or_path, config_file)
+            logger.info("We are using %s to load '%s'." % (config_class, pretrained_model_name_or_path))
+            if config_class is cls:
+                return cls.from_file(config_file)
+            return config_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+        elif from_hf_hub:
+            file = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename=cls.config_file,
+                cache_dir=cache_dir,
+                subfolder=subfolder,
+                library_name="PaddleNLP",
+                library_version=__version__,
+            )
+            # from local dir path
+            return cls.from_pretrained(os.path.dirname(file))
+
+        # Assuming from community-contributed pretrained models
+        else:
+            # add support for legacy config file ...
+            community_config_path = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.config_file])
+            if not url_file_exists(community_config_path):
+                legacy_community_config_path = "/".join(
+                    [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.legacy_config_file]
+                )
+                if not url_file_exists(legacy_community_config_path):
+                    raise RuntimeError(
+                        f"Can't load Config for '{pretrained_model_name_or_path}'.\n"
+                        f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                        "- a correct model-identifier of built-in pretrained models,\n"
+                        "- or a correct model-identifier of community-contributed pretrained models,\n"
+                        "- or the correct path to a directory containing relevant config files.\n"
+                    )
+                logger.warning(f"loading legacy config file<{cls.legacy_config_file}> ...")
+                community_config_path = legacy_community_config_path
+
+            resolved_config_file = get_path_from_url_with_filelock(community_config_path, cache_dir)
+
+            config_class = cls._get_config_class_from_config(pretrained_model_name_or_path, resolved_config_file)
+            logger.info("We are using %s to load '%s'." % (config_class, pretrained_model_name_or_path))
+            if config_class is cls:
+                return cls.from_file(resolved_config_file, **kwargs)
+
+            return config_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/paddlenlp/transformers/auto/modeling.py b/paddlenlp/transformers/auto/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..13e9113f61a908ab6a1fffbeab8fc72caad19cf5
--- /dev/null
+++ b/paddlenlp/transformers/auto/modeling.py
@@ -0,0 +1,1040 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import io
+import json
+import os
+from collections import OrderedDict
+
+from huggingface_hub import hf_hub_download
+
+from paddlenlp import __version__
+from paddlenlp.transformers import *  # noqa
+from paddlenlp.transformers.configuration_utils import is_standard_config
+from paddlenlp.utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+    hf_file_exists,
+    url_file_exists,
+)
+from paddlenlp.utils.log import logger
+
+from ..utils import resolve_cache_dir
+
+__all__ = [
+    "AutoBackbone",
+    "AutoModel",
+    "AutoModelForPretraining",
+    "AutoModelForSequenceClassification",
+    "AutoModelForTokenClassification",
+    "AutoModelForQuestionAnswering",
+    "AutoModelForMultipleChoice",
+    "AutoModelForMaskedLM",
+    "AutoModelForCausalLM",
+    "AutoEncoder",
+    "AutoDecoder",
+    "AutoGenerator",
+    "AutoDiscriminator",
+    "AutoModelForConditionalGeneration",
+]
+
+MAPPING_NAMES = OrderedDict(
+    [
+        # Base model mapping
+        ("Albert", "albert"),
+        ("BigBird", "bigbird"),
+        ("BlenderbotSmall", "blenderbot_small"),
+        ("Blenderbot", "blenderbot"),
+        ("ChatGLMv2", "chatglm_v2"),
+        ("ChatGLM", "chatglm"),
+        ("ChineseCLIP", "chineseclip"),
+        ("ChineseBert", "chinesebert"),
+        ("ConvBert", "convbert"),
+        ("CTRL", "ctrl"),
+        ("DistilBert", "distilbert"),
+        ("DalleBart", "dallebart"),
+        ("Electra", "electra"),
+        ("ErnieViL", "ernie_vil"),
+        ("ErnieCtm", "ernie_ctm"),
+        ("ErnieDoc", "ernie_doc"),
+        ("ErnieGen", "ernie_gen"),
+        ("ErnieGram", "ernie_gram"),
+        ("ErnieLayout", "ernie_layout"),
+        ("ErnieM", "ernie_m"),
+        ("ErnieCode", "ernie_code"),
+        ("Ernie", "ernie"),
+        ("FNet", "fnet"),
+        ("Funnel", "funnel"),
+        ("Llama", "llama"),
+        ("LayoutXLM", "layoutxlm"),
+        ("LayoutLMv2", "layoutlmv2"),
+        ("LayoutLM", "layoutlm"),
+        ("Luke", "luke"),
+        ("MBart", "mbart"),
+        ("MegatronBert", "megatronbert"),
+        ("MobileBert", "mobilebert"),
+        ("MPNet", "mpnet"),
+        ("NeZha", "nezha"),
+        ("Nystromformer", "nystromformer"),
+        ("PPMiniLM", "ppminilm"),
+        ("ProphetNet", "prophetnet"),
+        ("Reformer", "reformer"),
+        ("RemBert", "rembert"),
+        ("Roberta", "roberta"),
+        ("RoFormerv2", "roformerv2"),
+        ("RoFormer", "roformer"),
+        ("Skep", "skep"),
+        ("SqueezeBert", "squeezebert"),
+        ("TinyBert", "tinybert"),
+        ("UnifiedTransformer", "unified_transformer"),
+        ("UNIMO", "unimo"),
+        ("XLNet", "xlnet"),
+        ("XLM", "xlm"),
+        ("GPT", "gpt"),
+        ("GLM", "glm"),
+        ("MT5", "mt5"),
+        ("T5", "t5"),
+        ("Bert", "bert"),
+        ("Bart", "bart"),
+        ("GAUAlpha", "gau_alpha"),
+        ("CodeGen", "codegen"),
+        ("CLIPVision", "clip"),
+        ("CLIPText", "clip"),
+        ("CLIP", "clip"),
+        ("ChineseCLIPVision", "chineseclip"),
+        ("ChineseCLIPText", "chineseclip"),
+        ("ChineseCLIP", "chineseclip"),
+        ("Artist", "artist"),
+        ("OPT", "opt"),
+        ("Pegasus", "pegasus"),
+        ("DPT", "dpt"),
+        ("Bit", "bit"),
+        ("BlipText", "blip"),
+        ("BlipVision", "blip"),
+        ("Blip", "blip"),
+        ("Bloom", "bloom"),
+        ("QWen", "qwen"),
+    ]
+)
+
+MAPPING_TASKS = OrderedDict(
+    [
+        ("Backbone", "AutoBackbone"),
+        ("Model", "AutoModel"),
+        ("ForPretraining", "AutoModelForPretraining"),
+        ("ForSequenceClassification", "AutoModelForSequenceClassification"),
+        ("ForTokenClassification", "AutoModelForTokenClassification"),
+        ("ForQuestionAnswering", "AutoModelForQuestionAnswering"),
+        ("ForMultipleChoice", "AutoModelForMultipleChoice"),
+        ("ForMaskedLM", "AutoModelForMaskedLM"),
+        ("ForCausalLM", "AutoModelForCausalLM"),
+        ("Encoder", "AutoEncoder"),
+        ("Decoder", "AutoDecoder"),
+        ("Generator", "AutoGenerator"),
+        ("Discriminator", "AutoDiscriminator"),
+        ("ForConditionalGeneration", "AutoModelForConditionalGeneration"),
+    ]
+)
+
+MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
+    [
+        # Model for Causal LM mapping
+        ("opt", "OPTForCausalLM"),
+    ]
+)
+
+
+def get_name_mapping(task="Model"):
+    """
+    Task can be 'Backbone', 'Model', 'ForPretraining', 'ForSequenceClassification', 'ForTokenClassification',
+    'ForQuestionAnswering', 'ForMultipleChoice', 'ForMaskedLM', 'ForCausalLM', 'Encoder', 'Decoder',
+    'Generator', 'Discriminator', 'ForConditionalGeneration'
+    """
+    NAME_MAPPING = OrderedDict()
+    for key, value in MAPPING_NAMES.items():
+        import_class = key + task
+        new_key = key + "Model_Import_Class"
+        NAME_MAPPING[new_key] = import_class
+        NAME_MAPPING[import_class] = value
+
+    return NAME_MAPPING
+
+
+def get_task_name(model_class):
+    for key, value in MAPPING_TASKS.items():
+        if model_class.endswith(key):
+            return value
+    return None
+
+
+def get_init_configurations():
+    CONFIGURATION_MODEL_MAPPING = OrderedDict()
+    for key, class_name in MAPPING_NAMES.items():
+        import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.modeling")
+        model_name = getattr(import_class, key + "Model")
+        if key == "ErnieGen":
+            name = tuple(model_name.ernie_gen_pretrained_init_configuration.keys())
+        else:
+            name = tuple(model_name.pretrained_init_configuration.keys())
+        CONFIGURATION_MODEL_MAPPING[name] = key + "Model"
+
+    return CONFIGURATION_MODEL_MAPPING
+
+
+class _BaseAutoModelClass:
+    # Base class for auto models.
+    _pretrained_model_dict = None
+    _name_mapping = None
+    _task_choice = False
+    model_config_file = "config.json"
+    legacy_model_config_file = "model_config.json"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path).`"
+        )
+
+    # TODO: Refactor into AutoConfig when available
+    @classmethod
+    def _get_model_class_from_config(cls, pretrained_model_name_or_path, config_file_path):
+        with io.open(config_file_path, encoding="utf-8") as f:
+            config = json.load(f)
+
+        # Get class name corresponds to this configuration
+        if is_standard_config(config):
+            architectures = config["architectures"]
+            init_class = architectures.pop() if len(architectures) > 0 else None
+        else:
+            init_class = config.pop("init_class", None)
+        init_class = init_class[:-5] if init_class is not None and init_class.endswith("Model") else init_class
+        model_name = None
+        if init_class:
+            for model_flag, name in MAPPING_NAMES.items():
+                if model_flag in init_class:
+                    model_name = model_flag + "Model"
+                    break
+        else:
+            # From pretrained_model_name_or_path
+            for model_flag, name in MAPPING_NAMES.items():
+                if name in pretrained_model_name_or_path.lower():
+                    model_name = model_flag + "Model"
+                    break
+        if model_name is None:
+            raise AttributeError(
+                f"Unable to parse 'architectures' or 'init_class' from {config_file_path}. Also unable to infer model class from 'pretrained_model_name_or_path'"
+            )
+        init_class = cls._name_mapping[model_name + "_Import_Class"]
+        class_name = cls._name_mapping[init_class]
+        import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.modeling")
+        try:
+            model_class = getattr(import_class, init_class)
+            return model_class
+        except AttributeError as err:
+            logger.error(err)
+            all_model_classes = import_class.__all__
+            all_tasks = {get_task_name(m) for m in all_model_classes if get_task_name(m) is not None}
+            raise AttributeError(
+                f"module '{import_class.__name__}' only supports the following classes: "
+                + ", ".join(m for m in all_model_classes)
+                + "\n"
+                "Hint: you can use interface "
+                + " or ".join(task + ".from_pretrained" for task in all_tasks)
+                + f" to load '{pretrained_model_name_or_path}'\n"
+            )
+
+    @classmethod
+    def _from_pretrained(
+        cls, pretrained_model_name_or_path, task=None, from_hf_hub=False, subfolder=None, *model_args, **kwargs
+    ):
+        if task:
+            if cls._task_choice:
+                cls._name_mapping = get_name_mapping(task)
+            else:
+                print("We only support task choice for AutoModel.")
+        cache_dir = kwargs.get("cache_dir", None)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+
+        all_model_names = []
+        for pretrained_model_names, model_name in cls._pretrained_model_dict.items():
+            for name in pretrained_model_names:
+                all_model_names.append(name)
+
+        # From HF
+        if from_hf_hub:
+            if hf_file_exists(repo_id=pretrained_model_name_or_path, filename=cls.model_config_file):
+                config_file = hf_hub_download(
+                    repo_id=pretrained_model_name_or_path,
+                    filename=cls.model_config_file,
+                    subfolder=subfolder,
+                    cache_dir=cache_dir,
+                    library_name="PaddleNLP",
+                    library_version=__version__,
+                )
+            elif hf_file_exists(repo_id=pretrained_model_name_or_path, filename=cls.legacy_model_config_file):
+                logger.info("Standard config do not exist, loading from legacy config")
+                config_file = hf_hub_download(
+                    repo_id=pretrained_model_name_or_path,
+                    filename=cls.legacy_model_config_file,
+                    subfolder=subfolder,
+                    cache_dir=cache_dir,
+                    library_name="PaddleNLP",
+                    library_version=__version__,
+                )
+            if os.path.exists(config_file):
+                model_class = cls._get_model_class_from_config(pretrained_model_name_or_path, config_file)
+                logger.info(f"We are using {model_class} to load '{pretrained_model_name_or_path}'.")
+                return model_class.from_pretrained(
+                    pretrained_model_name_or_path, from_hf_hub=from_hf_hub, *model_args, **kwargs
+                )
+            else:
+                logger.warning(f"{config_file}  is not a valid path to a model config file")
+        # From built-in pretrained models
+        elif pretrained_model_name_or_path in all_model_names:
+            for pretrained_model_names, model_name in cls._pretrained_model_dict.items():
+                # From built-in pretrained models
+                for pattern in pretrained_model_names:
+                    if pattern == pretrained_model_name_or_path:
+                        init_class = cls._name_mapping[model_name + "_Import_Class"]
+                        class_name = cls._name_mapping[init_class]
+                        import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.modeling")
+                        try:
+                            model_class = getattr(import_class, init_class)
+                        except AttributeError as err:
+                            logger.error(err)
+                            all_model_classes = import_class.__all__
+                            all_tasks = {get_task_name(m) for m in all_model_classes if get_task_name(m) is not None}
+                            raise AttributeError(
+                                f"module '{import_class.__name__}' only supports the following classes: "
+                                + ", ".join(m for m in all_model_classes)
+                                + "\n"
+                                "Hint: you can use interface "
+                                + " or ".join(task + ".from_pretrained" for task in all_tasks)
+                                + f" to load '{pretrained_model_name_or_path}'\n"
+                            )
+                        logger.info(f"We are using {model_class} to load '{pretrained_model_name_or_path}'.")
+                        return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.model_config_file)
+            legacy_config_file = os.path.join(pretrained_model_name_or_path, cls.legacy_model_config_file)
+            if os.path.exists(config_file):
+                model_class = cls._get_model_class_from_config(pretrained_model_name_or_path, config_file)
+                logger.info(f"We are using {model_class} to load '{pretrained_model_name_or_path}'.")
+                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            elif os.path.exists(legacy_config_file):
+                logger.info("Standard config do not exist, loading from legacy config")
+                model_class = cls._get_model_class_from_config(pretrained_model_name_or_path, legacy_config_file)
+                logger.info(f"We are using {model_class} to load '{pretrained_model_name_or_path}'.")
+                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            else:
+                logger.warning(f"{config_file}  is not a valid path to a model config file")
+        # Assuming from community-contributed pretrained models
+        else:
+            standard_community_url = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.model_config_file]
+            )
+            legacy_community_url = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.legacy_model_config_file]
+            )
+            try:
+                if url_file_exists(standard_community_url):
+                    resolved_vocab_file = get_path_from_url_with_filelock(standard_community_url, cache_dir)
+                elif url_file_exists(legacy_community_url):
+                    logger.info("Standard config do not exist, loading from legacy config")
+                    resolved_vocab_file = get_path_from_url_with_filelock(legacy_community_url, cache_dir)
+                else:
+                    raise RuntimeError("Neither 'config.json' nro 'model_config.json' exists")
+            except RuntimeError as err:
+                logger.error(err)
+                raise RuntimeError(
+                    f"Can't load weights for '{pretrained_model_name_or_path}'.\n"
+                    f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                    "- a correct model-identifier of built-in pretrained models,\n"
+                    "- or a correct model-identifier of community-contributed pretrained models,\n"
+                    "- or the correct path to a directory containing relevant modeling files(model_weights and model_config).\n"
+                )
+
+            if os.path.exists(resolved_vocab_file):
+                model_class = cls._get_model_class_from_config(pretrained_model_name_or_path, resolved_vocab_file)
+                logger.info(f"We are using {model_class} to load '{pretrained_model_name_or_path}'.")
+                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            else:
+                logger.warning(f"{resolved_vocab_file}  is not a valid path to a model config file")
+
+
+class AutoBackbone(_BaseAutoModelClass):
+    """
+    AutoBackbone.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Backbone")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoBackbone`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoBackbone`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoBackbone
+
+                # Name of built-in pretrained model
+                model = AutoBackbone.from_pretrained("google/bit-50")
+                print(type(model))
+                # <class 'paddlenlp.transformers.bit.modeling.BitBackbone'>
+
+
+                # Load from local directory path
+                model = AutoBackbone.from_pretrained("./bit-50")
+                print(type(model))
+                # <class 'paddlenlp.transformers.bit.modeling.BitBackbone'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModel(_BaseAutoModelClass):
+    """
+    AutoClass can help you automatically retrieve the relevant model given the provided
+    pretrained weights/vocabulary.
+    AutoModel is a generic model class that will be instantiated as one of the base model classes
+    when created with the from_pretrained() classmethod.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Model")
+    _task_choice = True
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, task=None, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModel`. Model weights are loaded
+        by specifying name of a built-in pretrained model, a pretrained model on HF, a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of a built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains model weights file("model_state.pdparams")
+                  and model config file ("model_config.json").
+            task (str): Specify a downstream task. Task can be 'Model', 'ForPretraining',
+                'ForSequenceClassification', 'ForTokenClassification', 'ForQuestionAnswering',
+                'ForMultipleChoice', 'ForMaskedLM', 'ForCausalLM', 'Encoder', 'Decoder',
+                'Generator', 'Discriminator', 'ForConditionalGeneration'.
+                We only support specify downstream tasks in AutoModel. Defaults to `None`.
+            *args (tuple): Position arguments for model `__init__`. If provided,
+                use these as position argument values for model initialization.
+            **kwargs (dict): Keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for model
+                initialization. If the keyword is in `__init__` argument names of
+                base model, update argument values of the base model; else update
+                argument values of derived model.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModel`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModel
+
+                # Name of built-in pretrained model
+                model = AutoModel.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModel'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModel.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModel'>
+
+                # Load from local directory path
+                model = AutoModel.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModel'>
+
+                # choose task
+                model = AutoModel.from_pretrained('bert-base-uncased', task='ForPretraining')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertForPretraining'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, task, *model_args, **kwargs)
+
+
+class AutoModelForPretraining(_BaseAutoModelClass):
+    """
+    AutoModelForPretraining.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForPretraining")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForPretraining`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForPretraining`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForPretraining
+
+                # Name of built-in pretrained model
+                model = AutoModelForPretraining.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForPretraining'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForPretraining.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForPretraining'>
+
+                # Load from local directory path
+                model = AutoModelForPretraining.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForPretraining'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForSequenceClassification(_BaseAutoModelClass):
+    """
+    AutoModelForSequenceClassification.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForSequenceClassification")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForSequenceClassification`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForSequenceClassification`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForSequenceClassification
+
+                # Name of built-in pretrained model
+                model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForSequenceClassification'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForSequenceClassification.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForSequenceClassification'>
+
+                # Load from local directory path
+                model = AutoModelForSequenceClassification.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForSequenceClassification'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForTokenClassification(_BaseAutoModelClass):
+    """
+    AutoModelForTokenClassification.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForTokenClassification")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForTokenClassification`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForTokenClassification`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForTokenClassification
+
+                # Name of built-in pretrained model
+                model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForTokenClassification'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForTokenClassification.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForTokenClassification'>
+
+                # Load from local directory path
+                model = AutoModelForTokenClassification.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForTokenClassification'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForQuestionAnswering(_BaseAutoModelClass):
+    """
+    AutoModelForQuestionAnswering.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForQuestionAnswering")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForQuestionAnswering`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForQuestionAnswering`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForQuestionAnswering
+
+                # Name of built-in pretrained model
+                model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForQuestionAnswering'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForQuestionAnswering.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForQuestionAnswering'>
+
+                # Load from local directory path
+                model = AutoModelForQuestionAnswering.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForQuestionAnswering'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForMultipleChoice(_BaseAutoModelClass):
+    """
+    AutoModelForMultipleChoice.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForMultipleChoice")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForMultipleChoice`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForMultipleChoice`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForMultipleChoice
+
+                # Name of built-in pretrained model
+                model = AutoModelForMultipleChoice.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMultipleChoice'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForMultipleChoice.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMultipleChoice'>
+
+                # Load from local directory path
+                model = AutoModelForMultipleChoice.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMultipleChoice'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForMaskedLM(_BaseAutoModelClass):
+    """
+    AutoModelForMaskedLM.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForMaskedLM")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForMaskedLM`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForMaskedLM`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForMaskedLM
+
+                # Name of built-in pretrained model
+                model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMaskedLM'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForMaskedLM.from_pretrained('iverxin/bert-base-japanese')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMaskedLM'>
+
+                # Load from local directory path
+                model = AutoModelForMaskedLM.from_pretrained('./my_bert/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bert.modeling.BertModelForMaskedLM'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForCausalLM(_BaseAutoModelClass):
+    """
+    AutoModelForCausalLM.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForCausalLM")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForCausalLM`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForCausalLM`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForCausalLM
+
+                # Name of built-in pretrained model
+                model = AutoModelForCausalLM.from_pretrained('gpt2-en')
+                print(type(model))
+                # <class 'paddlenlp.transformers.gpt.modeling.GPTLMHeadModel'>
+
+                # Name of community-contributed pretrained model
+                model = AutoModelForCausalLM.from_pretrained('junnyu/distilgpt2')
+                print(type(model))
+                # <class 'paddlenlp.transformers.gpt.modeling.GPTLMHeadModel'>
+
+                # Load from local directory path
+                model = AutoModelForCausalLM.from_pretrained('./my_gpt/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.gpt.modeling.GPTLMHeadModel'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoEncoder(_BaseAutoModelClass):
+    """
+    AutoEncoder.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Encoder")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoEncoder`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoEncoder`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoEncoder
+
+                # Name of built-in pretrained model
+                model = AutoEncoder.from_pretrained('bart-base',vocab_size=20000)
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartEncoder'>
+
+                # Load from local directory path
+                model = AutoEncoder.from_pretrained('./my_bart/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartEncoder'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoDecoder(_BaseAutoModelClass):
+    """
+    AutoDecoder.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Decoder")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoDecoder`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoDecoder`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoDecoder
+
+                # Name of built-in pretrained model
+                model = AutoDecoder.from_pretrained('bart-base', vocab_size=20000)
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartEncoder'>
+
+                # Load from local directory path
+                model = AutoDecoder.from_pretrained('./my_bart/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartEncoder'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoGenerator(_BaseAutoModelClass):
+    """
+    AutoGenerator.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Generator")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoGenerator`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoGenerator`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoGenerator
+
+                # Name of built-in pretrained model
+                model = AutoGenerator.from_pretrained('electra-small')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraGenerator'>
+
+                # Name of community-contributed pretrained model
+                model = AutoGenerator.from_pretrained('junnyu/hfl-chinese-legal-electra-small-generator')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraGenerator'>
+
+                # Load from local directory path
+                model = AutoGenerator.from_pretrained('./my_electra/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraGenerator'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoDiscriminator(_BaseAutoModelClass):
+    """
+    AutoDiscriminator.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("Discriminator")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoDiscriminator`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoDiscriminator`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoDiscriminator
+
+                # Name of built-in pretrained model
+                model = AutoDiscriminator.from_pretrained('electra-small')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraDiscriminator'>
+
+                # Name of community-contributed pretrained model
+                model = AutoDiscriminator.from_pretrained('junnyu/hfl-chinese-legal-electra-small-generator')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraDiscriminator'>
+
+                # Load from local directory path
+                model = AutoDiscriminator.from_pretrained('./my_electra/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.electra.modeling.ElectraDiscriminator'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+
+class AutoModelForConditionalGeneration(_BaseAutoModelClass):
+    """
+    AutoModelForConditionalGeneration.
+    """
+
+    CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+    _pretrained_model_dict = CONFIGURATION_MODEL_MAPPING
+    _name_mapping = get_name_mapping("ForConditionalGeneration")
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoModelForConditionalGeneration`. Model weights are loaded
+        by specifying name of a built-in pretrained model, or a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): See :class:`AutoModel`.
+            *args (tuple): See :class:`AutoModel`.
+            **kwargs (dict): See :class:`AutoModel`.
+
+        Returns:
+            PretrainedModel: An instance of `AutoModelForConditionalGeneration`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoModelForConditionalGeneration
+
+                # Name of built-in pretrained model
+                model = AutoModelForConditionalGeneration.from_pretrained('bart-base')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartForConditionalGeneration'>
+
+
+                # Load from local directory path
+                model = AutoModelForConditionalGeneration.from_pretrained('./my_bart/')
+                print(type(model))
+                # <class 'paddlenlp.transformers.bart.modeling.BartForConditionalGeneration'>
+        """
+        return cls._from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/paddlenlp/transformers/auto/processing.py b/paddlenlp/transformers/auto/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..870773b1b338c179f8bae139f67e486667384fbe
--- /dev/null
+++ b/paddlenlp/transformers/auto/processing.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import io
+import json
+import os
+from collections import OrderedDict
+
+from ...utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url_with_filelock
+from ...utils.import_utils import import_module
+from ...utils.log import logger
+from ..utils import resolve_cache_dir
+
+__all__ = [
+    "AutoProcessor",
+]
+
+PROCESSOR_MAPPING_NAMES = OrderedDict(
+    [
+        ("ChineseCLIPProcessor", "chineseclip"),
+        ("CLIPProcessor", "clip"),
+        ("ErnieViLProcessor", "ernie_vil"),
+        ("CLIPSegProcessor", "clipseg"),
+        ("SpeechT5Processor", "speecht5"),
+        ("ClapProcessor", "clap"),
+    ]
+)
+
+
+def get_configurations():
+    MAPPING_NAMES = OrderedDict()
+    for key, class_name in PROCESSOR_MAPPING_NAMES.items():
+        import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.processing")
+        processor_name = getattr(import_class, key)
+        name = tuple(processor_name.pretrained_init_configuration.keys())
+        if MAPPING_NAMES.get(name, None) is None:
+            MAPPING_NAMES[name] = []
+        MAPPING_NAMES[name].append(processor_name)
+    return MAPPING_NAMES
+
+
+class AutoProcessor:
+    """
+    AutoClass can help you automatically retrieve the relevant model given the provided
+    pretrained weights/vocabulary.
+    Autoprocessor is a generic processor class that will be instantiated as one of the
+    base processor classes when created with the Autoprocessor.from_pretrained() classmethod.
+    """
+
+    MAPPING_NAMES = get_configurations()
+    _processor_mapping = MAPPING_NAMES
+    _name_mapping = PROCESSOR_MAPPING_NAMES
+    processor_config_file = "preprocessor_config.json"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path).`"
+        )
+
+    @classmethod
+    def _get_processor_class_from_config(cls, pretrained_model_name_or_path, config_file_path):
+        with io.open(config_file_path, encoding="utf-8") as f:
+            init_kwargs = json.load(f)
+        # class name corresponds to this configuration
+        init_class = init_kwargs.pop("init_class", None)
+        if init_class is None:
+            init_class = init_kwargs.pop("processor_class", None)
+
+        if init_class:
+            class_name = cls._name_mapping[init_class]
+            import_class = import_module(f"paddlenlp.transformers.{class_name}.processing")
+            processor_class = getattr(import_class, init_class)
+            return processor_class
+        # If no `init_class`, we use pattern recognition to recognize the processor class.
+        else:
+            logger.info("We use pattern recognition to recognize the processor class.")
+            for key, pattern in cls._name_mapping.items():
+                if pattern in pretrained_model_name_or_path.lower():
+                    init_class = key
+                    class_name = cls._name_mapping[init_class]
+                    import_class = import_module(f"paddlenlp.transformers.{class_name}.processor")
+                    processor_class = getattr(import_class, init_class)
+                    break
+            return processor_class
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        """
+        Creates an instance of `Autoprocessor`. Related resources are loaded by
+        specifying name of a built-in pretrained model, or a community-contributed
+        pretrained model, or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains processor related resources
+                  and processor config file ("processor_config.json").
+            *args (tuple): position arguments for model `__init__`. If provided,
+                use these as position argument values for processor initialization.
+            **kwargs (dict): keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for processor
+                initialization.
+
+        Returns:
+            Pretrainedprocessor: An instance of `Pretrainedprocessor`.
+
+
+        Example:
+            .. code-block::
+            from paddlenlp.transformers import AutoProcessor
+            processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
+            processor.save_pretrained('clip_processor')
+        """
+        cache_dir = resolve_cache_dir(
+            pretrained_model_name_or_path=pretrained_model_name_or_path,
+            from_hf_hub=False,  # TODO: from_hf_hub not supported yet
+            cache_dir=kwargs.pop("cache_dir", None),
+        )
+        all_processor_names = []
+        for names, processor_class in cls._processor_mapping.items():
+            for name in names:
+                all_processor_names.append(name)
+        # From built-in pretrained models
+        if pretrained_model_name_or_path in all_processor_names:
+            for names, processor_classes in cls._processor_mapping.items():
+                for pattern in names:
+                    if pattern == pretrained_model_name_or_path:
+                        actual_processor_class = processor_classes[0]
+                        logger.info(
+                            "We are using %s to load '%s'." % (actual_processor_class, pretrained_model_name_or_path)
+                        )
+                        return actual_processor_class.from_pretrained(
+                            pretrained_model_name_or_path, *model_args, **kwargs
+                        )
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.processor_config_file)
+            if os.path.exists(config_file):
+                processor_class = cls._get_processor_class_from_config(pretrained_model_name_or_path, config_file)
+                logger.info("We are using %s to load '%s'." % (processor_class, pretrained_model_name_or_path))
+                return processor_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        # Assuming from community-contributed pretrained models
+        else:
+            community_config_path = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.processor_config_file]
+            )
+            try:
+                resolved_vocab_file = get_path_from_url_with_filelock(community_config_path, cache_dir)
+            except RuntimeError as err:
+                logger.error(err)
+                raise RuntimeError(
+                    f"Can't load processor for '{pretrained_model_name_or_path}'.\n"
+                    f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                    "- a correct model-identifier of built-in pretrained models,\n"
+                    "- or a correct model-identifier of community-contributed pretrained models,\n"
+                    "- or the correct path to a directory containing relevant processor files.\n"
+                )
+
+            if os.path.exists(resolved_vocab_file):
+                processor_class = cls._get_processor_class_from_config(
+                    pretrained_model_name_or_path, resolved_vocab_file
+                )
+                logger.info("We are using %s to load '%s'." % (processor_class, pretrained_model_name_or_path))
+                return processor_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/paddlenlp/transformers/auto/tokenizer.py b/paddlenlp/transformers/auto/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2499199ef6c21eea25d0129aff5c2c6a23c0719
--- /dev/null
+++ b/paddlenlp/transformers/auto/tokenizer.py
@@ -0,0 +1,370 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import io
+import json
+import os
+from collections import OrderedDict
+
+from huggingface_hub import hf_hub_download
+
+from paddlenlp import __version__
+from paddlenlp.utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+)
+from paddlenlp.utils.import_utils import import_module, is_fast_tokenizer_available
+from paddlenlp.utils.log import logger
+
+from ..utils import resolve_cache_dir
+
+__all__ = [
+    "AutoTokenizer",
+]
+
+TOKENIZER_MAPPING_NAMES = OrderedDict(
+    [
+        ("AlbertEnglishTokenizer", "albert"),
+        ("AlbertChineseTokenizer", "albert"),
+        ("BertJapaneseTokenizer", "bert_japanese"),
+        ("BigBirdTokenizer", "bigbird"),
+        ("BlenderbotSmallTokenizer", "blenderbot_small"),
+        ("BlenderbotTokenizer", "blenderbot"),
+        ("ChatGLMTokenizer", "chatglm"),
+        ("ChatGLMv2Tokenizer", "chatglm_v2"),
+        ("ChineseBertTokenizer", "chinesebert"),
+        ("ConvBertTokenizer", "convbert"),
+        ("CTRLTokenizer", "ctrl"),
+        ("DalleBartTokenizer", "dallebart"),
+        ("DistilBertTokenizer", "distilbert"),
+        ("ElectraTokenizer", "electra"),
+        ("ErnieCtmTokenizer", "ernie_ctm"),
+        ("ErnieDocTokenizer", "ernie_doc"),
+        ("ErnieDocBPETokenizer", "ernie_doc"),
+        ("ErnieGramTokenizer", "ernie_gram"),
+        ("ErnieLayoutTokenizer", "ernie_layout"),
+        ("ErnieMTokenizer", "ernie_m"),
+        ("ErnieCodeTokenizer", "ernie_code"),
+        ("ErnieTokenizer", "ernie"),
+        ("FNetTokenizer", "fnet"),
+        ("FunnelTokenizer", "funnel"),
+        ("LlamaTokenizer", "llama"),
+        ("LayoutXLMTokenizer", "layoutxlm"),
+        ("LayoutLMv2Tokenizer", "layoutlmv2"),
+        ("LayoutLMTokenizer", "layoutlm"),
+        ("LukeTokenizer", "luke"),
+        ("MBartTokenizer", "mbart"),
+        ("MBart50Tokenizer", "mbart"),
+        ("MegatronBertTokenizer", "megatronbert"),
+        ("MobileBertTokenizer", "mobilebert"),
+        ("MPNetTokenizer", "mpnet"),
+        ("NeZhaTokenizer", "nezha"),
+        ("NystromformerTokenizer", "nystromformer"),
+        ("PPMiniLMTokenizer", "ppminilm"),
+        ("ProphetNetTokenizer", "prophetnet"),
+        ("ReformerTokenizer", "reformer"),
+        ("RemBertTokenizer", "rembert"),
+        ("RobertaChineseTokenizer", "roberta"),
+        ("RobertaBPETokenizer", "roberta"),
+        ("RoFormerTokenizer", "roformer"),
+        ("RoFormerv2Tokenizer", "roformerv2"),
+        ("SkepTokenizer", "skep"),
+        ("SqueezeBertTokenizer", "squeezebert"),
+        ("TinyBertTokenizer", "tinybert"),
+        ("UnifiedTransformerTokenizer", "unified_transformer"),
+        ("UNIMOTokenizer", "unimo"),
+        ("XLNetTokenizer", "xlnet"),
+        ("XLMTokenizer", "xlm"),
+        ("GPTTokenizer", "gpt"),
+        ("GPTChineseTokenizer", "gpt"),
+        ("T5Tokenizer", "t5"),
+        ("BertTokenizer", "bert"),
+        ("BartTokenizer", "bart"),
+        ("GAUAlphaTokenizer", "gau_alpha"),
+        ("CodeGenTokenizer", "codegen"),
+        ("CLIPTokenizer", "clip"),
+        ("ArtistTokenizer", "artist"),
+        ("ChineseCLIPTokenizer", "chineseclip"),
+        ("ErnieViLTokenizer", "ernie_vil"),
+        ("PegasusChineseTokenizer", "pegasus"),
+        ("GLMBertTokenizer", "glm"),
+        ("GLMChineseTokenizer", "glm"),
+        ("GLMGPT2Tokenizer", "glm"),
+        ("BloomTokenizer", "bloom"),
+        ("SpeechT5Tokenizer", "speecht5"),
+        ("QWenTokenizer", "qwen"),
+    ]
+)
+
+FAST_TOKENIZER_MAPPING_NAMES = OrderedDict(
+    [
+        ("BertFastTokenizer", "bert"),
+        ("ErnieFastTokenizer", "ernie"),
+        ("TinyBertFastTokenizer", "tinybert"),
+        ("ErnieMFastTokenizer", "ernie_m"),
+        ("NystromformerFastTokenizer", "nystromformer"),
+    ]
+)
+# For FastTokenizer
+if is_fast_tokenizer_available():
+    TOKENIZER_MAPPING_NAMES.update(FAST_TOKENIZER_MAPPING_NAMES)
+
+
+def get_configurations():
+    MAPPING_NAMES = OrderedDict()
+    for key, class_name in TOKENIZER_MAPPING_NAMES.items():
+        fast_name = ""
+        if "Fast" in key:
+            fast_name = "fast_"
+        import_class = importlib.import_module(f"paddlenlp.transformers.{class_name}.{fast_name}tokenizer")
+        tokenizer_name = getattr(import_class, key)
+        name = tuple(tokenizer_name.pretrained_init_configuration.keys())
+        # FastTokenizer will share the same config with python tokenizer
+        # So same config would map more than one tokenizer
+        if MAPPING_NAMES.get(name, None) is None:
+            MAPPING_NAMES[name] = []
+        # (tokenizer_name, is_fast)
+        MAPPING_NAMES[name].append((tokenizer_name, fast_name != ""))
+    return MAPPING_NAMES
+
+
+class AutoTokenizer:
+    """
+    AutoClass can help you automatically retrieve the relevant model given the provided
+    pretrained weights/vocabulary.
+    AutoTokenizer is a generic tokenizer class that will be instantiated as one of the
+    base tokenizer classes when created with the AutoTokenizer.from_pretrained() classmethod.
+    """
+
+    MAPPING_NAMES = get_configurations()
+    _tokenizer_mapping = MAPPING_NAMES
+    _name_mapping = TOKENIZER_MAPPING_NAMES
+    _fast_name_mapping = FAST_TOKENIZER_MAPPING_NAMES
+    tokenizer_config_file = "tokenizer_config.json"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path).`"
+        )
+
+    @classmethod
+    def _get_fast_tokenizer_class(cls, init_class, class_name):
+        tokenizer_class = None
+        if is_fast_tokenizer_available():
+            is_support_fast_tokenizer = False
+            init_class_prefix = init_class[:-9]
+            for fast_tokenizer_class, name in cls._fast_name_mapping.items():
+                fast_tokenizer_class_prefix = fast_tokenizer_class[:-9]
+                if name == class_name and fast_tokenizer_class_prefix.startswith(init_class_prefix):
+                    is_support_fast_tokenizer = True
+                    import_class = import_module(f"paddlenlp.transformers.{class_name}.fast_tokenizer")
+                    tokenizer_class = getattr(import_class, fast_tokenizer_class)
+                    break
+            if not is_support_fast_tokenizer:
+                logger.warning(
+                    f"The tokenizer {tokenizer_class} doesn't have the fast version."
+                    " Please check the map `paddlenlp.transformers.auto.tokenizer.FAST_TOKENIZER_MAPPING_NAMES`"
+                    " to see which fast tokenizers are currently supported."
+                )
+        else:
+            logger.warning(
+                "Can't find the fast_tokenizer package, "
+                "please ensure install fast_tokenizer correctly. "
+                "You can install fast_tokenizer by `pip install fast-tokenizer-python`."
+            )
+        return tokenizer_class
+
+    @classmethod
+    def _get_tokenizer_class_from_config(cls, pretrained_model_name_or_path, config_file_path, use_fast):
+        with io.open(config_file_path, encoding="utf-8") as f:
+            init_kwargs = json.load(f)
+        # class name corresponds to this configuration
+        init_class = init_kwargs.pop("init_class", None)
+        if init_class is None:
+            init_class = init_kwargs.pop("tokenizer_class", None)
+
+        if init_class:
+            class_name = cls._name_mapping[init_class]
+            import_class = import_module(f"paddlenlp.transformers.{class_name}.tokenizer")
+            tokenizer_class = getattr(import_class, init_class)
+            if use_fast:
+                fast_tokenizer_class = cls._get_fast_tokenizer_class(init_class, class_name)
+                tokenizer_class = fast_tokenizer_class if fast_tokenizer_class else tokenizer_class
+            return tokenizer_class
+        # If no `init_class`, we use pattern recognition to recognize the tokenizer class.
+        else:
+            # TODO: Potential issue https://github.com/PaddlePaddle/PaddleNLP/pull/3786#discussion_r1024689810
+            logger.info("We use pattern recognition to recognize the Tokenizer class.")
+            for key, pattern in cls._name_mapping.items():
+                if pattern in pretrained_model_name_or_path.lower():
+                    init_class = key
+                    class_name = cls._name_mapping[init_class]
+                    import_class = import_module(f"paddlenlp.transformers.{class_name}.tokenizer")
+                    tokenizer_class = getattr(import_class, init_class)
+                    if use_fast:
+                        fast_tokenizer_class = cls._get_fast_tokenizer_class(init_class, class_name)
+                        tokenizer_class = fast_tokenizer_class if fast_tokenizer_class else tokenizer_class
+                    break
+            return tokenizer_class
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, from_hf_hub=False, subfolder=None, *model_args, **kwargs):
+        """
+        Creates an instance of `AutoTokenizer`. Related resources are loaded by
+        specifying name of a built-in pretrained model, or a community-contributed
+        pretrained model, or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains tokenizer related resources
+                  and tokenizer config file ("tokenizer_config.json").
+            from_hf_hub (bool, optional) Whether to load from HuggingFace Hub
+            subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+                Only works when loading from HuggingFace Hub.
+            *args (tuple): position arguments for model `__init__`. If provided,
+                use these as position argument values for tokenizer initialization.
+            **kwargs (dict): keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for tokenizer
+                initialization.
+
+        Returns:
+            PretrainedTokenizer: An instance of `PretrainedTokenizer`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import AutoTokenizer
+
+                # Name of built-in pretrained model
+                tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+                print(type(tokenizer))
+                # <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>
+
+                # Name of community-contributed pretrained model
+                tokenizer = AutoTokenizer.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned')
+                print(type(tokenizer))
+                # <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>
+
+                # Load from local directory path
+                tokenizer = AutoTokenizer.from_pretrained('./my_bert/')
+                print(type(tokenizer))
+                # <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'>
+        """
+        # Default not to use fast tokenizer
+        use_fast = kwargs.pop("use_fast", False)
+        cache_dir = kwargs.get("cache_dir", None)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+
+        if "use_faster" in kwargs:
+            use_fast = kwargs.pop("use_faster", False)
+            logger.warning("The keyword argument `use_faster` is deprecated in future, please use `use_fast` instead")
+
+        all_tokenizer_names = []
+        for names, tokenizer_class in cls._tokenizer_mapping.items():
+            for name in names:
+                all_tokenizer_names.append(name)
+        # From HF Hub
+        if from_hf_hub:
+            config_file = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename=cls.tokenizer_config_file,
+                subfolder=subfolder,
+                cache_dir=cache_dir,
+                library_name="PaddleNLP",
+                library_version=__version__,
+            )
+            if os.path.exists(config_file):
+                tokenizer_class = cls._get_tokenizer_class_from_config(
+                    pretrained_model_name_or_path, config_file, use_fast
+                )
+                logger.info(f"We are using {tokenizer_class} to load '{pretrained_model_name_or_path}'.")
+                return tokenizer_class.from_pretrained(
+                    pretrained_model_name_or_path, from_hf_hub=from_hf_hub, subfolder=subfolder, *model_args, **kwargs
+                )
+        # From built-in pretrained models
+        elif pretrained_model_name_or_path in all_tokenizer_names:
+            for names, tokenizer_classes in cls._tokenizer_mapping.items():
+                for pattern in names:
+                    if pattern == pretrained_model_name_or_path:
+                        actual_tokenizer_class = None
+                        # Default setting the python tokenizer to actual_tokenizer_class
+                        for tokenizer_class in tokenizer_classes:
+                            if not tokenizer_class[1]:
+                                actual_tokenizer_class = tokenizer_class[0]
+                                break
+                        if use_fast:
+                            if is_fast_tokenizer_available():
+                                is_support_fast_tokenizer = False
+                                for tokenizer_class in tokenizer_classes:
+                                    if tokenizer_class[1]:
+                                        actual_tokenizer_class = tokenizer_class[0]
+                                        is_support_fast_tokenizer = True
+                                        break
+                                if not is_support_fast_tokenizer:
+                                    logger.warning(
+                                        f"The tokenizer {actual_tokenizer_class} doesn't have the fast version."
+                                        " Please check the map `paddlenlp.transformers.auto.tokenizer.FAST_TOKENIZER_MAPPING_NAMES`"
+                                        " to see which fast tokenizers are currently supported."
+                                    )
+                            else:
+                                logger.warning(
+                                    "Can't find the fast_tokenizer package, "
+                                    "please ensure install fast_tokenizer correctly. "
+                                    "You can install fast_tokenizer by `pip install fast-tokenizer-python`."
+                                )
+
+                        logger.info(f"We are using {tokenizer_class} to load '{pretrained_model_name_or_path}'.")
+                        return actual_tokenizer_class.from_pretrained(
+                            pretrained_model_name_or_path, *model_args, **kwargs
+                        )
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.tokenizer_config_file)
+            if os.path.exists(config_file):
+                tokenizer_class = cls._get_tokenizer_class_from_config(
+                    pretrained_model_name_or_path, config_file, use_fast
+                )
+                logger.info(f"We are using {tokenizer_class} to load '{pretrained_model_name_or_path}'.")
+                return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            else:
+                raise FileNotFoundError(f"{config_file} is not found under '{pretrained_model_name_or_path}'")
+        # Assuming from community-contributed pretrained models
+        else:
+            community_config_path = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.tokenizer_config_file]
+            )
+            try:
+                resolved_vocab_file = get_path_from_url_with_filelock(community_config_path, cache_dir)
+            except RuntimeError as err:
+                logger.error(err)
+                raise RuntimeError(
+                    f"Can't load tokenizer for '{pretrained_model_name_or_path}'.\n"
+                    f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                    "- a correct model-identifier of built-in pretrained models,\n"
+                    "- or a correct model-identifier of community-contributed pretrained models,\n"
+                    "- or the correct path to a directory containing relevant tokenizer files.\n"
+                )
+
+            if os.path.exists(resolved_vocab_file):
+                tokenizer_class = cls._get_tokenizer_class_from_config(
+                    pretrained_model_name_or_path, resolved_vocab_file, use_fast
+                )
+                logger.info(f"We are using {tokenizer_class} to load '{pretrained_model_name_or_path}'.")
+                return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/paddlenlp/transformers/bart/__init__.py b/paddlenlp/transformers/bart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/bart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bart/configuration.py b/paddlenlp/transformers/bart/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..3326c2111ae6ff62d54720a0d3337c63a3653573
--- /dev/null
+++ b/paddlenlp/transformers/bart/configuration.py
@@ -0,0 +1,197 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Bart model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+from ...utils.log import logger
+
+__all__ = ["BART_PRETRAINED_INIT_CONFIGURATION", "BartConfig", "BART_PRETRAINED_RESOURCE_FILES_MAP"]
+
+BART_PRETRAINED_INIT_CONFIGURATION = {
+    "bart-base": {
+        "vocab_size": 50265,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "forced_eos_token_id": 2,
+        "decoder_start_token_id": 2,
+        "d_model": 768,
+        "num_encoder_layers": 6,
+        "num_decoder_layers": 6,
+        "encoder_attention_heads": 12,
+        "decoder_attention_heads": 12,
+        "encoder_ffn_dim": 3072,
+        "decoder_ffn_dim": 3072,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "attention_dropout": 0.1,
+        "activation_dropout": 0.1,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": False,
+    },
+    "bart-large": {
+        "vocab_size": 50265,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "forced_eos_token_id": 2,
+        "decoder_start_token_id": 2,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "attention_dropout": 0.1,
+        "activation_dropout": 0.1,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": False,
+    },
+}
+
+BART_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "bart-base": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-base.pdparams",
+        "bart-large": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-large.pdparams",
+    }
+}
+
+
+class BartConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BartModel`]. It is used to instantiate a BART
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the BART bart-base architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, optional):
+            Vocabulary size of the BART model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BartModel`] or [`TFBartModel`]. Default to 50265.
+        d_model (`int`, optional):
+            Dimensionality of the layers and the pooler layer. Default to 1024
+        encoder_layers (`int`, optional):
+            Number of encoder layers. Default to 6.
+        decoder_layers (`int`, optional):
+            Number of decoder layers. Default to 6.
+        encoder_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer encoder. Default to 12.
+        decoder_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer decoder. Default to 12.
+        decoder_ffn_dim (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Default to 3072.
+        encoder_ffn_dim (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Default to 3072.
+        activation_function (`str` or `function`, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions are supported.
+            Default to `"gelu"`.
+        dropout (`float`, optional):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. Default to 0.1.
+        attention_dropout (`float`, optional):
+            The dropout ratio for the attention probabilities. Default to 0.1.
+        activation_dropout (`float`, optional):
+            The dropout ratio for activations inside the fully connected layer. Default to 0.1.
+        max_position_embeddings (`int`, optional):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048). Default to 1024.
+        init_std (`float`, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices. Default to 0.02.
+        num_labels (`int`, optional):
+            The number of labels to use in [`BartForSequenceClassification`]. Default to 3.
+        forced_eos_token_id (`int`, optional):
+            The id of the token to force as the last generated token when `max_length` is reached. Usually set to
+            `eos_token_id`. Default to 2.
+        scale_embedding (`bool`, optional):
+            Scale embeddings by diving by sqrt(d_model). Default to `False`.
+
+    """
+    model_type = "bart"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map: Dict[str, str] = {
+        "num_encoder_layers": "encoder_layers",
+        "num_decoder_layers": "decoder_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = BART_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50265,
+        max_position_embeddings: int = 1024,
+        encoder_layers: int = 6,
+        encoder_ffn_dim: int = 3072,
+        encoder_attention_heads: int = 12,
+        decoder_layers: int = 6,
+        decoder_ffn_dim: int = 3072,
+        decoder_attention_heads: int = 12,
+        activation_function: str = "gelu",
+        d_model: int = 768,
+        dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.1,
+        init_std: float = 0.02,
+        pad_token_id: int = 1,
+        bos_token_id: int = 0,
+        eos_token_id: int = 2,
+        is_encoder_decoder: bool = True,
+        decoder_start_token_id: int = 2,
+        forced_eos_token_id: int = 2,
+        scale_embedding: bool = False,
+        **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            decoder_start_token_id=decoder_start_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.d_model = d_model
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.activation_function = activation_function
+        self.init_std = init_std
+        self.num_hidden_layers = encoder_layers
+        self.scale_embedding = scale_embedding
+
+        # ensure backward compatibility for BART CNN models
+        if self.forced_bos_token_id is None and kwargs.get("force_bos_token_to_be_generated", False):
+            self.forced_bos_token_id = self.bos_token_id
+            logger.warning(
+                f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions. "
+                "The config can simply be saved and uploaded again to be fixed."
+            )
diff --git a/paddlenlp/transformers/bart/modeling.py b/paddlenlp/transformers/bart/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9cfcd08c33fd1295531a45e4e87d7f230d98e1e
--- /dev/null
+++ b/paddlenlp/transformers/bart/modeling.py
@@ -0,0 +1,1407 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Embedding, Layer, MultiHeadAttention
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    ModelOutput,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    Seq2SeqQuestionAnsweringModelOutput,
+    Seq2SeqSequenceClassifierOutput,
+    convert_encoder_output,
+)
+from .configuration import (
+    BART_PRETRAINED_INIT_CONFIGURATION,
+    BART_PRETRAINED_RESOURCE_FILES_MAP,
+    BartConfig,
+)
+
+__all__ = [
+    "BartModel",
+    "BartPretrainedModel",
+    "BartEncoder",
+    "BartDecoder",
+    "BartClassificationHead",
+    "BartForSequenceClassification",
+    "BartForQuestionAnswering",
+    "BartForConditionalGeneration",
+]
+
+Cache = MultiHeadAttention.Cache
+StaticCache = MultiHeadAttention.StaticCache
+
+
+def shift_tokens_right(input_ids, decoder_start_token_id):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    return shifted_input_ids
+
+
+class BartPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Bart models. It provides Bart related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = BART_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = BART_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "bart"
+    config_class = BartConfig
+
+    @classmethod
+    def _get_name_mappings(cls, config: BartConfig) -> List[StateDictNameMapping]:
+        model_mappings = [
+            "shared.weight",
+        ]
+
+        num_encoder_layers = config.num_encoder_layers or 0
+        num_decoder_layers = config.num_decoder_layers or 0
+
+        if num_encoder_layers:
+            encoder_mappings = [
+                ["encoder.embed_positions.weight", "encoder.encoder_embed_positions.weight"],
+                ["encoder.layernorm_embedding.weight", "encoder.encoder_layernorm_embedding.weight"],
+                ["encoder.layernorm_embedding.bias", "encoder.encoder_layernorm_embedding.bias"],
+            ]
+
+            model_mappings.extend(encoder_mappings)
+
+            for layer_index in range(num_encoder_layers):
+                encoder_mappings = [
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                        f"encoder.encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.fc1.weight",
+                        f"encoder.encoder.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.fc1.bias",
+                        f"encoder.encoder.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.fc2.weight",
+                        f"encoder.encoder.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.fc2.bias",
+                        f"encoder.encoder.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn_layer_norm.weight",
+                        f"encoder.encoder.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.self_attn_layer_norm.bias",
+                        f"encoder.encoder.layers.{layer_index}.norm1.bias",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.final_layer_norm.weight",
+                        f"encoder.encoder.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"encoder.layers.{layer_index}.final_layer_norm.bias",
+                        f"encoder.encoder.layers.{layer_index}.norm2.bias",
+                    ],
+                ]
+
+                model_mappings.extend(encoder_mappings)
+
+        if num_decoder_layers:
+            decoder_mappings = [
+                ["decoder.embed_positions.weight", "decoder.decoder_embed_positions.weight"],
+                ["decoder.layernorm_embedding.weight", "decoder.decoder_layernorm_embedding.weight"],
+                ["decoder.layernorm_embedding.bias", "decoder.decoder_layernorm_embedding.bias"],
+            ]
+
+            model_mappings.extend(decoder_mappings)
+
+            for layer_index in range(num_decoder_layers):
+                decoder_mappings = [
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn.out_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.k_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.k_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.k_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.v_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.v_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.v_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.q_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.q_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.q_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.out_proj.weight",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn.out_proj.bias",
+                        f"decoder.decoder.layers.{layer_index}.cross_attn.out_proj.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.fc1.weight",
+                        f"decoder.decoder.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.fc1.bias",
+                        f"decoder.decoder.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.fc2.weight",
+                        f"decoder.decoder.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.fc2.bias",
+                        f"decoder.decoder.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn_layer_norm.weight",
+                        f"decoder.decoder.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.self_attn_layer_norm.bias",
+                        f"decoder.decoder.layers.{layer_index}.norm1.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn_layer_norm.weight",
+                        f"decoder.decoder.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.encoder_attn_layer_norm.bias",
+                        f"decoder.decoder.layers.{layer_index}.norm2.bias",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.final_layer_norm.weight",
+                        f"decoder.decoder.layers.{layer_index}.norm3.weight",
+                    ],
+                    [
+                        f"decoder.layers.{layer_index}.final_layer_norm.bias",
+                        f"decoder.decoder.layers.{layer_index}.norm3.bias",
+                    ],
+                ]
+
+                model_mappings.extend(decoder_mappings)
+
+        init_name_mappings(model_mappings)
+
+        # base-model prefix "BartModel"
+        if "BartModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "model." + mapping[0]
+                mapping[1] = "bart." + mapping[1]
+
+        if "BartForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["qa_outputs.weight", "classifier.weight", "transpose"],
+                    ["qa_outputs.bias", "classifier.bias"],
+                ]
+            )
+
+        if "BartForSequenceClassification" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["classification_head.dense.weight", "classifier.dense.weight", "transpose"],
+                    ["classification_head.dense.bias", "classifier.dense.bias"],
+                    ["classification_head.out_proj.weight", "classifier.out_proj.weight", "transpose"],
+                    ["classification_head.out_proj.bias", "classifier.out_proj.bias"],
+                ]
+            )
+
+        if "BartForConditionalGeneration" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["lm_head.weight", "lm_head_weight"],
+                    ["final_logits_bias", "final_logits_bias"],
+                ]
+            )
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class BartLearnedPositionalEmbedding(Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim):
+        # Bart is set up so that if padding_idx is specified then offset the embedding ids by 2
+        # and adjust num_embeddings appropriately. Other models dont have this hack
+        self.offset = 2
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(self, input_ids_shape: Tuple, past_key_values_length: int = 0) -> Tensor:
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        bsz, seq_len = input_ids_shape[:2]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        # (gongenlei) For dygraph to static graph
+        return Embedding.forward(self, positions + self.offset)
+
+
+class BartEncoder(BartPretrainedModel):
+    """
+    The Transformer Encoder of BartModel. The arguments of BartEncoder can see :class:`BartModel`.
+    """
+
+    def __init__(self, config: BartConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+
+        self.embed_scale = (config.d_model**0.5) if config.scale_embedding else 1.0
+        self.encoder_embed_positions = BartLearnedPositionalEmbedding(config.max_position_embeddings, config.d_model)
+
+        self.encoder_dropout = nn.Dropout(config.dropout)
+        self.encoder_layernorm_embedding = nn.LayerNorm(config.d_model)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.encoder_attention_heads,
+            dim_feedforward=config.encoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.encoder_layers)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ) -> Union[Tensor, Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        """
+        The BartEncoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`BartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            output_attentions (bool, optional):
+                See :class:`BartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`BartModel`.
+            return_dict (bool, optional):
+                See :class:`BartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `encoder_outputs` which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, d_model].
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            inputs_shape = paddle.shape(input_ids)
+            input_ids = input_ids.reshape((-1, inputs_shape[-1]))
+        elif inputs_embeds is not None:
+            inputs_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+
+        inputs_embed_pos = self.encoder_embed_positions(inputs_shape)
+        hidden_states = inputs_embeds + inputs_embed_pos
+        hidden_states = self.encoder_layernorm_embedding(hidden_states)
+        encoder_input = self.encoder_dropout(hidden_states)
+
+        if attention_mask is None and input_ids is not None:
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        encoder_output = self.encoder(
+            encoder_input,
+            src_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return encoder_output
+
+
+class BartDecoder(BartPretrainedModel):
+    """
+    The Transformer Decoder of BartModel. The arguments of BartDecoder can see :class:`BartModel`.
+    """
+
+    def __init__(self, config: BartConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+
+        self.embed_scale = (config.d_model**0.5) if config.scale_embedding else 1.0
+        self.decoder_embed_positions = BartLearnedPositionalEmbedding(config.max_position_embeddings, config.d_model)
+        self.decoder_dropout = nn.Dropout(config.dropout)
+        self.decoder_layernorm_embedding = nn.LayerNorm(config.d_model)
+
+        decoder_layer = nn.TransformerDecoderLayer(
+            d_model=config.d_model,
+            nhead=config.decoder_attention_heads,
+            dim_feedforward=config.decoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+        )
+        self.decoder = nn.TransformerDecoder(decoder_layer, config.decoder_layers)
+
+    def forward(
+        self,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        memory_mask: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tensor, Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        """
+        The BartDecoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            decoder_input_ids (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            encoder_output (Tensor, optional):
+                See :class:`BartModel`.
+            memory_mask (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            cache (Tensor, optional):
+                See :class:`BartModel`.
+            output_attentions (bool, optional):
+                See :class:`BartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`BartModel`.
+            return_dict (bool, optional):
+                See :class:`BartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `decoder_outputs` which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, d_model].
+
+        """
+        # retrieve input_ids and inputs_embeds
+        if decoder_input_ids is not None and decoder_inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif decoder_input_ids is not None:
+            inputs_shape = paddle.shape(decoder_input_ids)
+            decoder_input_ids = decoder_input_ids.reshape((-1, inputs_shape[-1]))
+        elif decoder_inputs_embeds is not None:
+            inputs_shape = paddle.shape(decoder_inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        if decoder_attention_mask is None:
+            decoder_length = inputs_shape[-1]
+            decoder_attention_mask = paddle.tensor.triu(
+                (paddle.full((decoder_length, decoder_length), -np.inf, dtype=paddle.get_default_dtype())), 1
+            )
+
+        if decoder_inputs_embeds is None:
+            decoder_inputs_embeds = self.embed_tokens(decoder_input_ids) * self.embed_scale
+
+        past_key_values_length = paddle.shape(cache[0][0].k)[2] if cache is not None else 0
+        decoder_inputs_embed_pos = self.decoder_embed_positions(inputs_shape, past_key_values_length)
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        hidden_states = self.decoder_layernorm_embedding(hidden_states)
+        decoder_input = self.decoder_dropout(hidden_states)
+
+        decoder_output = self.decoder(
+            tgt=decoder_input,
+            memory=encoder_output if isinstance(encoder_output, type(decoder_input)) else encoder_output[0],
+            tgt_mask=decoder_attention_mask,
+            memory_mask=memory_mask,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return decoder_output
+
+
+@register_base_model
+class BartModel(BartPretrainedModel):
+    r"""
+    The bare Bart Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BartConfig`):
+            An instance of BartConfig used to construct BartModel.
+    """
+
+    def __init__(self, config: BartConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.encoder = BartEncoder(config, self.shared)
+        self.decoder = BartDecoder(config, self.shared)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, value):
+        self.shared = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Seq2SeqModelOutput]:
+        r"""
+        The BartModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, encoder_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, encoder_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is `[batch_size, sequence_length, d_model]`.
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, d_model`].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation  of shape `(batch_size, target_sequence_length, hidden_size)`. If `cache` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
+                This is useful if you want more control over how to convert `decoder_input_ids` indices
+                into associated vectors than the model's internal embedding lookup matrix. Default to None.
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                 Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                 can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, d_model].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BartModel, BartTokenizer
+
+                tokenizer = BartTokenizer.from_pretrained('bart-base')
+                model = BartModel.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        # different to other models, Bart automatically creates decoder_input_ids from
+        # inputBartForSequenceClassification_ids if no decoder_input_ids are provided
+        if input_ids is None and inputs_embeds is None and encoder_output is None:
+            raise ValueError("You have to specify either input_ids or encoder_output")
+
+        if decoder_input_ids is None and decoder_inputs_embeds is None:
+            if input_ids is None:
+                raise ValueError(
+                    "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
+                    "passed, `input_ids` cannot be `None`. Please pass either "
+                    "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
+                )
+            decoder_input_ids = shift_tokens_right(input_ids, self.decoder_start_token_id)
+        if attention_mask is None and input_ids is not None:
+            # only generate attention_mask when input_ids is specified
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        if inputs_embeds is not None and input_ids is None and attention_mask is None:
+            logger.warning("provided inputs_embeds without attention_mask")
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+            attention_mask.stop_gradient = True
+
+        input_type = type(decoder_input_ids) if decoder_input_ids is not None else type(decoder_inputs_embeds)
+        if encoder_output is None:
+            encoder_output = self.encoder(
+                input_ids,
+                attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
+        elif return_dict and not isinstance(encoder_output, ModelOutput):
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            encoder_output = convert_encoder_output(encoder_output)
+        if isinstance(encoder_output, input_type):
+            encoder_last_hidden_state = encoder_output
+        else:
+            encoder_last_hidden_state = encoder_output[0]
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.decoder.gen_cache(encoder_last_hidden_state)
+        else:
+            cache = None
+
+        memory_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                memory_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                memory_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                memory_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        decoder_output = self.decoder(
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_last_hidden_state,
+            memory_mask,
+            cache=cache,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if not return_dict:
+            if isinstance(decoder_output, input_type):
+                decoder_output = (decoder_output,)
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            return decoder_output + encoder_output
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_output.last_hidden_state,
+            past_key_values=decoder_output.past_key_values,
+            decoder_hidden_states=decoder_output.hidden_states,
+            decoder_attentions=decoder_output.attentions,
+            cross_attentions=decoder_output.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+
+class BartClassificationHead(Layer):
+    """
+    Perform sentence-level classification tasks.
+    """
+
+    def __init__(self, input_dim: int, inner_dim: int, num_classes: int, pooler_dropout: float):
+        super().__init__()
+        self.dense = nn.Linear(input_dim, inner_dim)
+        self.dropout = nn.Dropout(p=pooler_dropout)
+        self.out_proj = nn.Linear(inner_dim, num_classes)
+
+    def forward(self, hidden_states: Tensor) -> Tensor:
+        """
+        Args:
+            hidden_states (Tensor):
+                Hidden states of the classification model.
+        """
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.tanh(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.out_proj(hidden_states)
+        return hidden_states
+
+
+class BartForSequenceClassification(BartPretrainedModel):
+    r"""
+    Bart Model with a linear layer on top of the pooled output,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`BartConfig`):
+            An instance of BartConfig used to construct BartForSequenceClassification.
+    """
+
+    def __init__(self, config: BartConfig):
+        super().__init__(config)
+        self.bart = BartModel(config)
+        self.num_labels = config.num_labels
+        self.classifier = BartClassificationHead(
+            config.d_model,
+            config.d_model,
+            config.num_labels,
+            config.classifier_dropout if config.classifier_dropout is not None else config.dropout,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tensor, Tuple, Seq2SeqSequenceClassifierOutput]:
+        r"""
+        The BartForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`BartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`BartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`BartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            use_cache (bool, optional):
+                See :class:`BartModel`. Forcely set to `False` when `labels` is provided that can save memory during training.
+            cache (Tensor, optional):
+                See :class:`BartModel`.
+            labels (Tensor, optional):
+                Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+                num_labels - 1]`. If `num_labels > 1` a classification loss is computed (Cross-Entropy).
+                Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`BartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`BartModel`.
+            return_dict (bool, optional):
+                See :class:`BartModel`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqSequenceClassifierOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqSequenceClassifierOutput`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and labels=None,
+            returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BartForSequenceClassification, BartTokenizer
+
+                tokenizer = BartTokenizer.from_pretrained('bart-base')
+                model = BartForSequenceClassification.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            logger.warning("The `use_cache` argument is changed to `False` since `labels` is provided.")
+            use_cache = False
+
+        if input_ids is None and inputs_embeds is not None:
+            logger.warning(
+                f"{self.__class__.__name__} will not detect eos tokens in `inputs_embeds`. Results may be "
+                "unexpected if using eos tokens in conjunction with `inputs_embeds.`"
+            )
+
+        outputs = self.bart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        output = outputs[0]
+        output_shape = paddle.shape(output)
+
+        if input_ids is not None:
+            eos_mask = paddle.cast(input_ids == self.bart.config["eos_token_id"], dtype="int64")
+            if len(paddle.unique(paddle.sum(eos_mask, axis=1))) > 1:
+                raise ValueError("All examples must have the same number of <eos> tokens.")
+
+            # TODO(gongenlei): support bool tensor index
+            output = output.masked_select(eos_mask.unsqueeze(-1).astype("bool").tile([1, 1, output_shape[-1]]))
+
+        sentence_representation = output.reshape([output_shape[0], -1, output_shape[-1]])[:, -1, :]
+        logits = self.classifier(sentence_representation)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            if len(outputs) == 2:
+                return (loss, logits) if loss is not None else logits
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+
+class BartForQuestionAnswering(BartPretrainedModel):
+    r"""
+    Bart Model with a linear layer on top of the hidden-states output to
+    compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`BartConfig`):
+            An instance of BartConfig used to construct BartForQuestionAnswering.
+    """
+
+    def __init__(self, config: BartConfig):
+        super().__init__(config)
+        self.bart = BartModel(config)
+        self.classifier = nn.Linear(config.d_model, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Seq2SeqQuestionAnsweringModelOutput]:
+        r"""
+        The BartForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`BartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`BartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`BartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            use_cache (bool, optional):
+                See :class:`BartModel`. Forcely set to `False` when `start_positions` and `end_positions` are provided that can save memory during training.
+            cache (Tensor, optional):
+                See :class:`BartModel`.
+            start_positions (Tensor, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence
+                are not taken into account for computing the loss.
+                A tensor of shape `(batch_size, )`. Default to `None`.
+            end_positions (Tensor, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence
+                are not taken into account for computing the loss.
+                A tensor of shape `(batch_size, )`. Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`BartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`BartModel`.
+            return_dict (bool, optional):
+                See :class:`BartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqQuestionAnsweringModelOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqQuestionAnsweringModelOutput`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `start_positions=end_positions=None`,
+            returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BartForQuestionAnswering, BartTokenizer
+
+                tokenizer = BartTokenizer.from_pretrained('bart-base')
+                model = BartForQuestionAnswering.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+                start_logits = outputs[0]
+                end_logits  =outputs[1]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if start_positions is not None and end_positions is not None:
+            logger.warning(
+                "The `use_cache` argument is changed to `False` since `start_positions` and `end_positions` are provided."
+            )
+            use_cache = False
+
+        outputs = self.bart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            outputs = (start_logits, end_logits) + (outputs[1:] if len(outputs) > 2 else ())
+            return ((total_loss,) + outputs) if total_loss else outputs
+
+        return Seq2SeqQuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+
+class BartForConditionalGeneration(BartPretrainedModel):
+    r"""
+    Bart Model with a `language modeling` head on top.
+
+    Args:
+        config (:class:`BartConfig`):
+            An instance of BartConfig used to construct BartForConditionalGeneration.
+    """
+
+    def __init__(self, config: BartConfig):
+        super().__init__(config)
+        self.bart = BartModel(config)
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model], dtype=self.bart.shared.weight.dtype, is_bias=False
+        )
+        self.register_buffer("final_logits_bias", paddle.zeros((1, config.vocab_size)))
+
+    def get_encoder(self):
+        return self.bart.get_encoder()
+
+    def get_decoder(self):
+        return self.bart.get_decoder()
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterBART
+
+        decode_strategy = kwargs.get("decode_strategy")
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decoding_lib = kwargs.get("decoding_lib", None)
+        enable_fast_encoder = kwargs.get("enable_fast_encoder", True)
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        if kwargs["min_length"] != 0:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'min_length != 0' is not supported yet in the fast version")
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        self._fast_entry = FasterBART(
+            self,
+            use_fp16_decoding=use_fp16_decoding,
+            decoding_lib=decoding_lib,
+            enable_fast_encoder=enable_fast_encoder,
+        ).forward
+        return self._fast_entry
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tensor, Tuple, Seq2SeqLMOutput]:
+        r"""
+        The BartForConditionalGeneration forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`BartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`BartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`BartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`BartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`BartModel`.
+            use_cache (bool, optional):
+                See :class:`BartModel`.
+            cache (Tensor, optional):
+                See :class:`BartModel`.
+            labels (Tensor, optional):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+                A tensor of shape `(batch_size, sequence_length)`. Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`BartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`BartModel`.
+            return_dict (bool, optional):
+                See :class:`BartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`.
+            Especially, When `use_cache=return_dict=output_hidden_states=output_attentions=False` and labels=None,
+            returns tensor `logits`, a tensor of the input text classification logits.
+
+            With the fields:
+
+            - `lm_logits` (Tensor):
+                The generated sentence of the model.
+                Its data type should be float32 and has a shape of [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BartForConditionalGeneration, BartTokenizer
+
+                tokenizer = BartTokenizer.from_pretrained('bart-base')
+                model = BartForConditionalGeneration.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            if use_cache:
+                logger.warning("The `use_cache` argument is changed to `False` since `labels` is provided.")
+            use_cache = False
+
+        outputs = self.bart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        lm_logits = paddle.tensor.matmul(outputs[0], self.lm_head_weight, transpose_y=True) + self.final_logits_bias
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(lm_logits.reshape((-1, self.bart.config["vocab_size"])), labels.reshape((-1,)))
+
+        if not return_dict:
+            if len(outputs) == 2:
+                return (masked_lm_loss, lm_logits) if masked_lm_loss is not None else lm_logits
+            else:
+                outputs = (lm_logits,) + outputs[1:]
+                return ((masked_lm_loss,) + outputs) if masked_lm_loss is not None else outputs
+
+        return Seq2SeqLMOutput(
+            loss=masked_lm_loss,
+            logits=lm_logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    def prepare_decoder_input_ids_from_labels(self, labels):
+        return shift_tokens_right(labels, self.bart.config["decoder_start_token_id"])
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        attention_mask=None,
+        decoder_attention_mask=None,
+        cache=None,
+        use_cache=False,
+        encoder_output=None,
+        **kwargs
+    ):
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1].unsqueeze(-1)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask[:, :, -1, :].unsqueeze(2)
+
+        return {
+            "input_ids": None,
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "decoder_attention_mask": decoder_attention_mask,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
diff --git a/paddlenlp/transformers/bart/tokenizer.py b/paddlenlp/transformers/bart/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..43e8d7fde79073fedc282e9d35e2a3635ab80df4
--- /dev/null
+++ b/paddlenlp/transformers/bart/tokenizer.py
@@ -0,0 +1,398 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+from functools import lru_cache
+
+from paddle.utils import try_import
+
+from .. import AddedToken, PretrainedTokenizer
+
+__all__ = ["BartTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "bart-base": 1024,
+    "bart-large": 1024,
+}
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = chr
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class BartTokenizer(PretrainedTokenizer):
+    r"""
+    Construct a BART tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocabulary file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+        bos_token (str, optional):
+            The beginning of sequence token that was used during pretraining. Can be
+            used a sequence classifier token.
+            Defaults to `"<s>"`.
+        eos_token (str, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `"</s>"`.
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens.
+            Defaults to `"<s>"`.
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to `"</s>"`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to `"<unk>"`.
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to `"<pad>"`.
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to `"<mask>"`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import BartTokenizer
+
+            tokenizer = BartTokenizer.from_pretrained('bart-base')
+            print(tokenizer('He was a puppeteer'))
+
+            '''
+            {'input_ids': [0, 894, 21, 10, 32986, 9306, 254, 2],
+            'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
+            '''
+
+    """
+    # merges and vocab same as GPT2
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "bart-base": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-base-vocab.json",
+            "bart-large": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-large-vocab.json",
+        },
+        "merges_file": {
+            "bart-base": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-base-merges.txt",
+            "bart-large": "https://bj.bcebos.com/paddlenlp/models/transformers/bart/bart-large-merges.txt",
+        },
+    }
+    pretrained_init_configuration = {"bart-base": {}, "bart-large": {}}
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        bos_token="<s>",
+        eos_token="</s>",
+        cls_token="<s>",
+        sep_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        **kwargs
+    ):
+
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
+        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
+        self._build_special_tokens_map_extended(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+        )
+
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.num_command_tokens = 2
+        self.num_type_tokens = 2
+
+        with open(vocab_file, "r", encoding="utf-8") as f:
+            self.encoder = json.load(f)
+
+        self.decoder = {v: k for k, v in self.encoder.items()}
+
+        self.num_tokens = len(self.encoder)
+        self.num_text_tokens = self.num_tokens - 1
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+
+        with open(merges_file, encoding="utf-8") as f:
+            bpe_data = f.read().split("\n")[1:-1]
+
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        re = try_import("regex")
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def _bpe_encode(self, text):
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification
+        tasks by concatenating and adding special tokens.
+        """
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return _cls + token_ids_0 + _sep
+        return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is
+        called when adding special tokens using the tokenizer ``encode`` methods.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+
+        return len(self.encoder)
+
+    @property
+    def eol_token_id(self):
+        if self.eol_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eol_token)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:  # noqa: E722
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+
+        return self.decoder[index]
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a single index or a sequence of indices to texts.
+
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+
+        Returns:
+            str: The decoded text.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTTokenizer
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
+                # 'Welcome to use PaddlePaddle and PaddleNLP'
+
+        """
+
+        text = "".join([self.decoder[id] for id in ids])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def save_resources(self, save_directory):
+        """
+        Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file
+        (ends with '.spm') under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name)
+
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (string) in a single string.
+        """
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A BERT offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0), (0, 0)] + offset_mapping_1 + [(0, 0)]
diff --git a/paddlenlp/transformers/bert/__init__.py b/paddlenlp/transformers/bert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/bert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bert/configuration.py b/paddlenlp/transformers/bert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1327783b2d8b1307c857388675e451ef06fbf92
--- /dev/null
+++ b/paddlenlp/transformers/bert/configuration.py
@@ -0,0 +1,407 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BERT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["BERT_PRETRAINED_INIT_CONFIGURATION", "BertConfig", "BERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+BERT_PRETRAINED_INIT_CONFIGURATION = {
+    "bert-base-uncased": {
+        "vocab_size": 30522,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-large-uncased": {
+        "vocab_size": 30522,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-base-multilingual-uncased": {
+        "vocab_size": 105879,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-base-cased": {
+        "vocab_size": 28996,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-base-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-base-multilingual-cased": {
+        "vocab_size": 119547,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-large-cased": {
+        "vocab_size": 28996,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-wwm-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "bert-wwm-ext-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "macbert-base-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "macbert-large-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "simbert-base-chinese": {
+        "vocab_size": 13685,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-medium": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 512,
+        "initializer_range": 0.02,
+        "intermediate_size": 2048,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 8,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-6l-768h": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-small": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 512,
+        "initializer_range": 0.02,
+        "intermediate_size": 2048,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 4,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-mini": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 4,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "uer/chinese-roberta-tiny": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 128,
+        "initializer_range": 0.02,
+        "intermediate_size": 512,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 2,
+        "num_hidden_layers": 2,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+}
+
+BERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "bert-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-uncased.pdparams",
+        "bert-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/bert-large-uncased.pdparams",
+        "bert-base-multilingual-uncased": "http://bj.bcebos.com/paddlenlp/models/transformers/bert-base-multilingual-uncased.pdparams",
+        "bert-base-cased": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-base-cased.pdparams",
+        "bert-base-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-base-chinese.pdparams",
+        "bert-base-multilingual-cased": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-base-multilingual-cased.pdparams",
+        "bert-large-cased": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-large-cased.pdparams",
+        "bert-wwm-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-chinese.pdparams",
+        "bert-wwm-ext-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese.pdparams",
+        "macbert-base-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/macbert/macbert-base-chinese.pdparams",
+        "macbert-large-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/macbert/macbert-large-chinese.pdparams",
+        "simbert-base-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/simbert/simbert-base-chinese-v1.pdparams",
+        "uer/chinese-roberta-base": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_base.pdparams",
+        "uer/chinese-roberta-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_medium.pdparams",
+        "uer/chinese-roberta-6l-768h": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_6l_768h.pdparams",
+        "uer/chinese-roberta-small": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_small.pdparams",
+        "uer/chinese-roberta-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_mini.pdparams",
+        "uer/chinese-roberta-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_tiny.pdparams",
+    }
+}
+
+
+class BertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BertModel`] or a [`TFBertModel`]. It is used to
+    instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the BERT
+    bert-base-uncased architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import BertModel, BertConfig
+
+    >>> # Initializing a BERT bert-base-uncased style configuration
+    >>> configuration = BertConfig()
+
+    >>> # Initializing a model from the bert-base-uncased style configuration
+    >>> model = BertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "bert"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = BERT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        fuse: bool = False,
+        layer_norm_eps=1e-12,
+        use_cache=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/bert/fast_tokenizer.py b/paddlenlp/transformers/bert/fast_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3ed6e5767585fba8894e9733d03263f29acc326
--- /dev/null
+++ b/paddlenlp/transformers/bert/fast_tokenizer.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from typing import Optional, Tuple
+
+from fast_tokenizer import normalizers
+
+from ..tokenizer_utils_fast import PretrainedFastTokenizer
+from .tokenizer import BertTokenizer
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
+
+
+class BertFastTokenizer(PretrainedFastTokenizer):
+    resource_files_names = VOCAB_FILES_NAMES  # for save_pretrained
+    slow_tokenizer_class = BertTokenizer
+    pretrained_resource_files_map = slow_tokenizer_class.pretrained_resource_files_map
+    pretrained_init_configuration = slow_tokenizer_class.pretrained_init_configuration
+
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        normalizer_state = json.loads(self.backend_tokenizer.normalizer.__getstate__())
+        if (
+            normalizer_state.get("lowercase", do_lower_case) != do_lower_case
+            or normalizer_state.get("strip_accents", strip_accents) != strip_accents
+            or normalizer_state.get("handle_chinese_chars", tokenize_chinese_chars) != tokenize_chinese_chars
+        ):
+            normalizer_class = getattr(normalizers, normalizer_state.pop("type"))
+            normalizer_state["lowercase"] = do_lower_case
+            normalizer_state["strip_accents"] = strip_accents
+            normalizer_state["handle_chinese_chars"] = tokenize_chinese_chars
+            self.backend_tokenizer.normalizer = normalizer_class(**normalizer_state)
+
+        self.do_lower_case = do_lower_case
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        files = self._tokenizer.model.save(save_directory, prefix=filename_prefix)
+        return tuple(files)
diff --git a/paddlenlp/transformers/bert/modeling.py b/paddlenlp/transformers/bert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed980805f9043f7a49208181c41654b9f72710f7
--- /dev/null
+++ b/paddlenlp/transformers/bert/modeling.py
@@ -0,0 +1,1421 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import warnings
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Layer
+
+try:
+    from paddle.incubate.nn import FusedTransformerEncoderLayer
+except ImportError:
+    FusedTransformerEncoderLayer = None
+from dataclasses import dataclass
+
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+
+from ...layers import Linear as TransposedLinear
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.env import CONFIG_NAME
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    BERT_PRETRAINED_INIT_CONFIGURATION,
+    BERT_PRETRAINED_RESOURCE_FILES_MAP,
+    BertConfig,
+)
+
+__all__ = [
+    "BertModel",
+    "BertPretrainedModel",
+    "BertForPretraining",
+    "BertPretrainingCriterion",
+    "BertPretrainingHeads",
+    "BertForSequenceClassification",
+    "BertForTokenClassification",
+    "BertForQuestionAnswering",
+    "BertForMultipleChoice",
+    "BertForMaskedLM",
+]
+
+
+class BertEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        past_key_values_length: Optional[int] = None,
+    ):
+
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            if past_key_values_length is not None:
+                position_ids += past_key_values_length
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertPooler(Layer):
+    """
+    Pool the result of BertEncoder.
+    """
+
+    def __init__(self, config: BertConfig):
+        """init the bert pooler with config & args/kwargs
+
+        Args:
+            config (BertConfig): BertConfig instance. Defaults to None.
+        """
+        super(BertPooler, self).__init__()
+
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.pool_act = config.pool_act
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.pool_act == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained BERT models. It provides BERT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = BertConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "bert"
+
+    pretrained_init_configuration = BERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = BERT_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_name_mappings(cls, config: BertConfig) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.position_embeddings.weight",
+            "embeddings.token_type_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["pooler.dense.weight", None, "transpose"],
+            "pooler.dense.bias",
+            # for TokenClassification
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+
+        # base-model prefix "BertModel"
+        if "BertModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "bert." + mapping[0]
+                mapping[1] = "bert." + mapping[1]
+
+        # downstream mappings
+        if "BertForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [["qa_outputs.weight", "classifier.weight", "transpose"], ["qa_outputs.bias", "classifier.bias"]]
+            )
+        if (
+            "BertForMultipleChoice" in config.architectures
+            or "BertForSequenceClassification" in config.architectures
+            or "BertForTokenClassification" in config.architectures
+        ):
+            model_mappings.extend([["classifier.weight", "classifier.weight", "transpose"]])
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class BertModel(BertPretrainedModel):
+    """
+    The bare BERT Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertModel.
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertModel, self).__init__(config)
+
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = BertEmbeddings(config)
+        if config.fuse and FusedTransformerEncoderLayer is None:
+            warnings.warn(
+                "FusedTransformerEncoderLayer is not supported by the running Paddle. "
+                "The flag fuse_transformer will be ignored. Try Paddle >= 2.3.0"
+            )
+        self.fuse = config.fuse and FusedTransformerEncoderLayer is not None
+        if self.fuse:
+            self.encoder = nn.LayerList(
+                [
+                    FusedTransformerEncoderLayer(
+                        config.hidden_size,
+                        config.num_attention_heads,
+                        config.intermediate_size,
+                        dropout_rate=config.hidden_dropout_prob,
+                        activation=config.hidden_act,
+                        attn_dropout_rate=config.attention_probs_dropout_prob,
+                        act_dropout_rate=0.0,
+                    )
+                    for _ in range(config.num_hidden_layers)
+                ]
+            )
+        else:
+            encoder_layer = nn.TransformerEncoderLayer(
+                config.hidden_size,
+                config.num_attention_heads,
+                config.intermediate_size,
+                dropout=config.hidden_dropout_prob,
+                activation=config.hidden_act,
+                attn_dropout=config.attention_probs_dropout_prob,
+                act_dropout=0,
+            )
+            self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = BertPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BertModel, BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-wwm-chinese')
+                model = BertModel.from_pretrained('bert-wwm-chinese')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        past_key_values_length = None
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+                attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            past_key_values_length=past_key_values_length,
+        )
+        if self.fuse:
+            assert not output_attentions, "Not support attentions output currently."
+            assert past_key_values is None, "Not support past_key_values currently."
+            hidden_states = embedding_output
+            all_hidden_states = [] if output_hidden_states else None
+            for layer in self.encoder:
+                hidden_states = layer(hidden_states, attention_mask)
+                if output_hidden_states:
+                    all_hidden_states.append(hidden_states)
+            pooled_output = self.pooler(hidden_states)
+
+            if return_dict:
+                return BaseModelOutputWithPoolingAndCrossAttentions(
+                    last_hidden_state=hidden_states, pooler_output=pooled_output, hidden_states=all_hidden_states
+                )
+            else:
+                return (
+                    (hidden_states, pooled_output, all_hidden_states)
+                    if output_hidden_states
+                    else (hidden_states, pooled_output)
+                )
+        else:
+            self.encoder._use_cache = use_cache  # To be consistent with HF
+            encoder_outputs = self.encoder(
+                embedding_output,
+                src_mask=attention_mask,
+                cache=past_key_values,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            if isinstance(encoder_outputs, type(embedding_output)):
+                sequence_output = encoder_outputs
+                pooled_output = self.pooler(sequence_output)
+                return (sequence_output, pooled_output)
+            else:
+                sequence_output = encoder_outputs[0]
+                pooled_output = self.pooler(sequence_output)
+                if not return_dict:
+                    return (sequence_output, pooled_output) + encoder_outputs[1:]
+                return BaseModelOutputWithPoolingAndCrossAttentions(
+                    last_hidden_state=sequence_output,
+                    pooler_output=pooled_output,
+                    past_key_values=encoder_outputs.past_key_values,
+                    hidden_states=encoder_outputs.hidden_states,
+                    attentions=encoder_outputs.attentions,
+                )
+
+
+class BertForQuestionAnswering(BertPretrainedModel):
+    """
+    Bert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForQuestionAnswering.
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertForQuestionAnswering, self).__init__(config)
+        self.bert = BertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BertModel`.
+            position_ids(Tensor, optional):
+                See :class:`BertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BertModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bert.modeling import BertForQuestionAnswering
+                from paddlenlp.transformers.bert.tokenizer import BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+                model = BertForQuestionAnswering.from_pretrained('bert-base-cased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.bert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BertForSequenceClassification(BertPretrainedModel):
+    """
+    Bert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForSequenceClassification.
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertForSequenceClassification, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BertModel`.
+            position_ids(Tensor, optional):
+                See :class:`BertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BertModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bert.modeling import BertForSequenceClassification
+                from paddlenlp.transformers.bert.tokenizer import BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+                model = BertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 2]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BertForTokenClassification(BertPretrainedModel):
+    """
+    Bert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForTokenClassification.
+    """
+
+    def __init__(self, config: BertConfig):
+        super().__init__(config)
+
+        self.bert = BertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BertModel`.
+            position_ids(Tensor, optional):
+                See :class:`BertModel`.
+            attention_mask (list, optional):
+                See :class:`BertModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bert.modeling import BertForTokenClassification
+                from paddlenlp.transformers.bert.tokenizer import BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+                model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 2]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.bert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BertLMPredictionHead(Layer):
+    """
+    Bert Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertLMPredictionHead, self).__init__()
+
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = getattr(nn.functional, config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder = TransposedLinear(config.hidden_size, config.vocab_size)
+        # link bias to load pretrained weights
+        self.decoder_bias = self.decoder.bias
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class BertPretrainingHeads(Layer):
+    """
+    Perform language modeling task and next sentence classification task.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForPretraining.
+
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertPretrainingHeads, self).__init__()
+        self.predictions = BertLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output, masked_positions=None):
+        """
+        Args:
+            sequence_output(Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+            pooled_output(Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+            masked_positions(Tensor, optional):
+                A tensor indicates positions to be masked in the position embedding.
+                Its data type should be int64 and its shape is [batch_size, mask_token_num].
+                `mask_token_num` is the number of masked tokens. It should be no bigger than `sequence_length`.
+                Defaults to `None`, which means we output hidden-states of all tokens in masked token prediction.
+
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        """
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+@dataclass
+class BertForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`BertForPreTraining`].
+
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    seq_relationship_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class BertForPretraining(BertPretrainedModel):
+    """
+    Bert Model with pretraining tasks on top.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForPretraining.
+
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertForPretraining, self).__init__(config)
+        self.bert = BertModel(config)
+        self.cls = BertPretrainingHeads(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        next_sentence_label: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BertModel`.
+            position_ids (Tensor, optional):
+                See :class:`BertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BertModel`.
+            masked_positions(Tensor, optional):
+                See :class:`BertPretrainingHeads`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
+                the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+            next_sentence_label (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence
+                pair (see `input_ids` docstring) Indices should be in `[0, 1]`:
+
+                - 0 indicates sequence B is a continuation of sequence A,
+                - 1 indicates sequence B is a random sequence.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput`.
+
+        """
+        with paddle.static.amp.fp16_guard():
+            outputs = self.bert(
+                input_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            sequence_output, pooled_output = outputs[:2]
+            prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions)
+
+            total_loss = None
+            if labels is not None and next_sentence_label is not None:
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                masked_lm_loss = loss_fct(
+                    prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+                )
+                next_sentence_loss = loss_fct(
+                    seq_relationship_score.reshape((-1, 2)), next_sentence_label.reshape((-1,))
+                )
+                total_loss = masked_lm_loss + next_sentence_loss
+            if not return_dict:
+                output = (prediction_scores, seq_relationship_score) + outputs[2:]
+                return ((total_loss,) + output) if total_loss is not None else output
+
+            return BertForPreTrainingOutput(
+                loss=total_loss,
+                prediction_logits=prediction_scores,
+                seq_relationship_logits=seq_relationship_score,
+                hidden_states=outputs.hidden_states,
+                attentions=outputs.attentions,
+            )
+
+
+class BertPretrainingCriterion(paddle.nn.Layer):
+    """
+
+    Args:
+        vocab_size(int):
+            Vocabulary size of `inputs_ids` in `BertModel`. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling `BertModel`.
+
+    """
+
+    def __init__(self, vocab_size):
+        super(BertPretrainingCriterion, self).__init__()
+        # CrossEntropyLoss is expensive since the inner reshape (copy)
+        self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1)
+        self.vocab_size = vocab_size
+
+    def forward(
+        self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale
+    ):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size]
+            seq_relationship_score(Tensor):
+                The scores of next sentence prediction. Its data type should be float32 and
+                its shape is [batch_size, 2]
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1].
+                Otherwise, its shape is [batch_size, mask_token_num, 1]
+            next_sentence_labels(Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and
+                its shape is [batch_size, 1]
+            masked_lm_scale(Tensor or int):
+                The scale of masked tokens. Used for the normalization of masked language modeling loss.
+                If it is a `Tensor`, its data type should be int64 and its shape is equal to `prediction_scores`.
+
+        Returns:
+            Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`.
+            Its data type should be float32 and its shape is [1].
+
+
+        """
+        with paddle.static.amp.fp16_guard():
+            masked_lm_loss = F.cross_entropy(prediction_scores, masked_lm_labels, reduction="none", ignore_index=-1)
+            masked_lm_loss = masked_lm_loss / masked_lm_scale
+            next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none")
+        return paddle.sum(masked_lm_loss) + paddle.mean(next_sentence_loss)
+
+
+class BertForMultipleChoice(BertPretrainedModel):
+    """
+    Bert Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForMultipleChoice.
+
+    Examples:
+        >>> model = BertForMultipleChoice(config, dropout=0.1)
+        >>> # or
+        >>> config.hidden_dropout_prob = 0.1
+        >>> model = BertForMultipleChoice(config)
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertForMultipleChoice, self).__init__(config)
+
+        self.bert = BertModel(config)
+        self.num_choices = config.num_choices
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BertForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`BertModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`BertModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`BertModel` and shape as [batch_size, num_choice, sequence_length].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BertForMultipleChoice, BertTokenizer
+                from paddlenlp.data import Pad, Dict
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                model = BertForMultipleChoice.from_pretrained('bert-base-uncased', num_choices=2)
+
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                batchify_fn = lambda samples, fn=Dict(
+                    {
+                        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+                        "token_type_ids": Pad(
+                            axis=0, pad_val=tokenizer.pad_token_type_id
+                        ),  # token_type_ids
+                    }
+                ): fn(samples)
+                inputs = batchify_fn(inputs)
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(inputs[0], dtype="int64"),
+                    token_type_ids=paddle.to_tensor(inputs[1], dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        outputs = self.bert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BertOnlyMLMHead(nn.Layer):
+    def __init__(self, config: BertConfig):
+        super().__init__()
+        self.predictions = BertLMPredictionHead(config=config)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class BertForMaskedLM(BertPretrainedModel):
+    """
+    Bert Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForMaskedLM.
+
+    """
+
+    def __init__(self, config: BertConfig):
+        super(BertForMaskedLM, self).__init__(config)
+        self.bert = BertModel(config)
+
+        self.cls = BertOnlyMLMHead(config=config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BertModel`.
+            position_ids (Tensor, optional):
+                See :class:`BertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BertModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+                loss is only computed for the tokens with labels in `[0, ..., vocab_size]`
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BertForMaskedLM, BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 30522]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.bert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions=masked_positions)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return (
+                ((masked_lm_loss,) + output)
+                if masked_lm_loss is not None
+                else (output[0] if len(output) == 1 else output)
+            )
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/bert/modeling.pyi b/paddlenlp/transformers/bert/modeling.pyi
new file mode 100644
index 0000000000000000000000000000000000000000..5717c0c922971ce7cb522964d58aadb26f1063b9
--- /dev/null
+++ b/paddlenlp/transformers/bert/modeling.pyi
@@ -0,0 +1,345 @@
+import paddle.nn as nn
+import paddle
+from ..model_outputs import ModelOutput
+from .configuration import BertConfig
+from _typeshed import Incomplete
+from paddle import Tensor
+from paddle.nn import Layer, Embedding, Linear
+from paddlenlp.transformers.model_utils import PretrainedModel
+from typing import Dict, Optional, Tuple, Union, overload
+
+class BertEmbeddings(Layer):
+    word_embeddings: Embedding
+    position_embeddings: Embedding
+    token_type_embeddings: Embedding
+    layer_norm: Layer
+    dropout: float
+    def __init__(self, config: BertConfig) -> None: ...
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        past_key_values_length: int = 0,
+    ): ...
+
+class BertPooler(Layer):
+    dense: Linear
+    activation: Layer
+    pool_act: Layer
+    def __init__(self, config: BertConfig) -> None: ...
+    def forward(self, hidden_states): ...
+
+class BertPretrainedModel(PretrainedModel):
+    model_config_file: str
+    config_class: Incomplete
+    resource_files_names: Dict[str, str]
+    base_model_prefix: str
+    pretrained_init_configuration: Dict[str, dict]
+    pretrained_resource_files_map: Dict[str, str]
+    def init_weights(self, layer) -> None: ...
+
+class BertModel(BertPretrainedModel):
+    pad_token_id: int
+    initializer_range: float
+    embeddings: Embedding
+    fuse: bool
+    encoder: nn.TransformerDecoder
+    pooler: BertPooler
+
+    def __init__(self, config: BertConfig) -> None: ...
+    def get_input_embeddings(self): ...
+    def set_input_embeddings(self, value) -> None: ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        past_key_values: Tensor | None = ...,
+        use_cache: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertModel: ...
+
+class BertForQuestionAnswering(BertPretrainedModel):
+    bert: BertModel
+    dropout: nn.Dropout
+    classifier: Linear
+    def __init__(self, config: BertConfig): ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        start_positions: Tensor | None = ...,
+        end_positions: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    def __call__(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        start_positions: Tensor | None = ...,
+        end_positions: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        classifier_dropout: float | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForQuestionAnswering: ...
+
+class BertForSequenceClassification(BertPretrainedModel):
+    bert: BertModel
+    num_labels: int
+    dropout: nn.Dropout
+    classifier: Linear
+    def __init__(self, config: BertConfig): ...
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    def __call__(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        num_labels: int | None = 2,
+        classifier_dropout: float | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForSequenceClassification: ...
+
+class BertForTokenClassification(BertPretrainedModel):
+    bert: BertModel
+    num_labels: int
+    dropout: nn.Dropout
+    classifier: Linear
+    def __init__(self, config: BertConfig): ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    def __call__(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        num_labels: int | None = 2,
+        classifier_dropout: float | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForTokenClassification: ...
+
+class BertLMPredictionHead(Layer):
+    transform: Incomplete
+    activation: Incomplete
+    layer_norm: nn.LayerNorm
+    decoder_weight: paddle.ParamAttr
+    decoder_bias: paddle.ParamAttr
+    def __init__(self, config: BertConfig, embedding_weights: Tensor | None = ...) -> None: ...
+    def forward(self, hidden_states, masked_positions: Tensor | None = ...): ...
+
+class BertPretrainingHeads(Layer):
+    predictions: Incomplete
+    seq_relationship: Incomplete
+    def __init__(self, config: BertConfig, embedding_weights: Tensor | None = ...) -> None: ...
+    def forward(self, sequence_output, pooled_output, masked_positions: Tensor | None = ...): ...
+
+class BertForPreTrainingOutput(ModelOutput):
+    loss: Optional[paddle.Tensor]
+    prediction_logits: paddle.Tensor
+    seq_relationship_logits: paddle.Tensor
+    hidden_states: Optional[Tuple[paddle.Tensor]]
+    attentions: Optional[Tuple[paddle.Tensor]]
+    def __init__(
+        self,
+        loss: Tensor | None,
+        prediction_logits: Tensor | None,
+        seq_relationship_logits: Tensor | None,
+        hidden_states: Tensor | None,
+        attentions: Tensor | None,
+    ) -> None: ...
+
+class BertForPretraining(BertPretrainedModel):
+    bert: BertModel
+    cls: Incomplete
+    def __init__(self, config: BertConfig) -> None: ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        masked_positions: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        next_sentence_label: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    def __call__(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        masked_positions: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        next_sentence_label: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForQuestionAnswering: ...
+
+class BertPretrainingCriterion(paddle.nn.Layer):
+    loss_fn: nn.Layer
+    vocab_size: int
+    def __init__(self, vocab_size) -> None: ...
+    def forward(
+        self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale
+    ): ...
+    def __call__(
+        self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale
+    ): ...
+
+class BertForMultipleChoice(BertPretrainedModel):
+    bert: BertModel
+    num_choices: int
+    dropout: nn.Dropout
+    classifier: Linear
+    @overload
+    def __init__(self, config: BertConfig) -> None: ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        num_choices: int | None = 2,
+        classifier_dropout: float | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForMultipleChoice: ...
+
+class BertOnlyMLMHead(nn.Layer):
+    predictions: BertLMPredictionHead
+    def __init__(self, config: BertConfig, embedding_weights: Tensor | None = ...) -> None: ...
+    def forward(self, sequence_output, masked_positions: Tensor | None = ...): ...
+
+class BertForMaskedLM(BertPretrainedModel):
+    bert: BertModel
+    cls: BertOnlyMLMHead
+    def __init__(self, config: BertConfig) -> None: ...
+    def forward(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    def __call__(
+        self,
+        input_ids,
+        token_type_ids: Tensor | None = ...,
+        position_ids: Tensor | None = ...,
+        attention_mask: Tensor | None = ...,
+        labels: Tensor | None = ...,
+        output_hidden_states: bool = ...,
+        output_attentions: bool = ...,
+        return_dict: bool = ...,
+    ): ...
+    @staticmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        cache_dir: str | None = None,
+        config: Optional[BertConfig] = None,
+        *args,
+        **kwargs
+    ) -> BertForMaskedLM: ...
diff --git a/paddlenlp/transformers/bert/tokenizer.py b/paddlenlp/transformers/bert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d43002c48fab6eae6fd044ed1a813570c4be0184
--- /dev/null
+++ b/paddlenlp/transformers/bert/tokenizer.py
@@ -0,0 +1,630 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unicodedata
+
+from ..tokenizer_utils import (
+    PretrainedTokenizer,
+    _is_control,
+    _is_punctuation,
+    _is_symbol,
+    _is_whitespace,
+    convert_to_unicode,
+    whitespace_tokenize,
+)
+
+__all__ = [
+    "BasicTokenizer",
+    "BertTokenizer",
+    "WordpieceTokenizer",
+]
+
+
+class BasicTokenizer(object):
+    """
+    Runs basic tokenization (punctuation splitting, lower casing, etc.).
+
+    Args:
+        do_lower_case (bool):
+            Whether to lowercase the input when tokenizing.
+            Defaults to `True`.
+        never_split (Iterable):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        tokenize_chinese_chars (bool):
+            Whether to tokenize Chinese characters.
+        strip_accents: (bool):
+            Whether to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+    """
+
+    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+        """Constructs a BasicTokenizer."""
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+
+    def tokenize(self, text, never_split=None):
+        """
+        Tokenizes a piece of text using basic tokenizer.
+
+        Args:
+            text (str): A piece of text.
+            never_split (List[str]): List of token not to split.
+
+        Returns:
+            list(str): A list of tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import BasicTokenizer
+                basictokenizer = BasicTokenizer()
+                tokens = basictokenizer.tokenize('He was a puppeteer')
+                '''
+                ['he', 'was', 'a', 'puppeteer']
+                '''
+        """
+        text = convert_to_unicode(text)
+        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """
+        Strips accents from a piece of text.
+        """
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """
+        Splits punctuation on a piece of text.
+        """
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            # punctuation and symbol should be treat as single char.
+            if _is_punctuation(char) or _is_symbol(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """
+        Adds whitespace around any CJK character.
+        """
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """
+        Checks whether CP is the codepoint of a CJK character.
+        """
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """
+        Performs invalid character removal and whitespace cleanup on text.
+        """
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """
+    Runs WordPiece tokenization.
+
+    Args:
+        vocab (Vocab|dict):
+            Vocab of the word piece tokenizer.
+        unk_token (str):
+            A specific token to replace all unknown tokens.
+        max_input_chars_per_word (int):
+            If a word's length is more than
+            max_input_chars_per_word, it will be dealt as unknown word.
+            Defaults to 100.
+    """
+
+    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """
+        Tokenizes a piece of text into its word pieces.
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer`.
+
+        Returns:
+            list (str): A list of wordpiece tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import BertTokenizer, WordpieceTokenizer
+
+                berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                vocab  = berttokenizer.vocab
+                unk_token = berttokenizer.unk_token
+
+                wordpiecetokenizer = WordpieceTokenizer(vocab,unk_token)
+                inputs = wordpiecetokenizer.tokenize("unaffable")
+                print(inputs)
+                '''
+                ["un", "##aff", "##able"]
+                '''
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+class BertTokenizer(PretrainedTokenizer):
+    """
+    Constructs a BERT tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool, optional):
+            Whether to lowercase the input when tokenizing.
+            Defaults to `True`.
+        do_basic_tokenize (bool, optional):
+            Whether to use a basic tokenizer before a WordPiece tokenizer.
+            Defaults to `True`.
+        never_split (Iterable, optional):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`. Defaults to `None`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+        tokenize_chinese_chars (bool, optional):
+            Whether to tokenize Chinese characters.
+            Defaults to `True`.
+        strip_accents: (bool, optional):
+            Whether to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import BertTokenizer
+            tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+            inputs = tokenizer('He was a puppeteer')
+            print(inputs)
+
+            '''
+            {'input_ids': [101, 2002, 2001, 1037, 13997, 11510, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
+            '''
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "bert-base-uncased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-uncased-vocab.txt",
+            "bert-large-uncased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-large-uncased-vocab.txt",
+            "bert-base-cased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-cased-vocab.txt",
+            "bert-large-cased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-large-cased-vocab.txt",
+            "bert-base-multilingual-uncased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-multilingual-uncased-vocab.txt",
+            "bert-base-multilingual-cased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-multilingual-cased-vocab.txt",
+            "bert-base-chinese": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-chinese-vocab.txt",
+            "bert-wwm-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-chinese-vocab.txt",
+            "bert-wwm-ext-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese-vocab.txt",
+            "macbert-large-chinese": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-chinese-vocab.txt",
+            "macbert-base-chinese": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-chinese-vocab.txt",
+            "simbert-base-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/simbert/vocab.txt",
+            "uer/chinese-roberta-base": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+            "uer/chinese-roberta-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+            "uer/chinese-roberta-6l-768h": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+            "uer/chinese-roberta-small": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+            "uer/chinese-roberta-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+            "uer/chinese-roberta-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/uer/chinese_roberta_vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "bert-base-uncased": {"do_lower_case": True},
+        "bert-large-uncased": {"do_lower_case": True},
+        "bert-base-cased": {"do_lower_case": False},
+        "bert-large-cased": {"do_lower_case": False},
+        "bert-base-multilingual-uncased": {"do_lower_case": True},
+        "bert-base-multilingual-cased": {"do_lower_case": False},
+        "bert-base-chinese": {"do_lower_case": False},
+        "bert-wwm-chinese": {"do_lower_case": False},
+        "bert-wwm-ext-chinese": {"do_lower_case": False},
+        "macbert-large-chinese": {"do_lower_case": False},
+        "macbert-base-chinese": {"do_lower_case": False},
+        "simbert-base-chinese": {"do_lower_case": True},
+        "uer/chinese-roberta-base": {"do_lower_case": True},
+        "uer/chinese-roberta-medium": {"do_lower_case": True},
+        "uer/chinese-roberta-6l-768h": {"do_lower_case": True},
+        "uer/chinese-roberta-small": {"do_lower_case": True},
+        "uer/chinese-roberta-mini": {"do_lower_case": True},
+        "uer/chinese-roberta-tiny": {"do_lower_case": True},
+    }
+    max_model_input_sizes = {
+        "bert-base-uncased": 512,
+        "bert-large-uncased": 512,
+        "bert-base-cased": 512,
+        "bert-large-cased": 512,
+        "bert-base-multilingual-uncased": 512,
+        "bert-base-multilingual-cased": 512,
+        "bert-base-chinese": 512,
+        "bert-wwm-chinese": 512,
+        "bert-wwm-ext-chinese": 512,
+        "macbert-large-chinese": 512,
+        "macbert-base-chinese": 512,
+        "simbert-base-chinese": 512,
+        "uer/chinese-roberta-base": 512,
+        "uer/chinese-roberta-medium": 512,
+        "uer/chinese-roberta-6l-768h": 512,
+        "uer/chinese-roberta-small": 512,
+        "uer/chinese-roberta-mini": 512,
+        "uer/chinese-roberta-tiny": 512,
+    }
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for BERT models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import BertTokenizer
+
+                berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                tokens = berttokenizer.tokenize('He was a puppeteer')
+                '''
+                ['he', 'was', 'a', 'puppet', '##eer']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                he was a puppeteer
+                '''
+        """
+
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A BERT sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A BERT offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A BERT sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in self.all_special_ids else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.vocab._idx_to_token.get(index, self.unk_token)
diff --git a/paddlenlp/transformers/bert_japanese/__init__.py b/paddlenlp/transformers/bert_japanese/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/bert_japanese/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bert_japanese/tokenizer.py b/paddlenlp/transformers/bert_japanese/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ddd0ddeaab1d56de36f271f5f9b6a0be5d04937
--- /dev/null
+++ b/paddlenlp/transformers/bert_japanese/tokenizer.py
@@ -0,0 +1,354 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os
+import unicodedata
+import collections
+
+from .. import BertTokenizer, BasicTokenizer, WordpieceTokenizer
+
+__all__ = ["BertJapaneseTokenizer", "MecabTokenizer", "CharacterTokenizer"]
+
+
+class BertJapaneseTokenizer(BertTokenizer):
+    """
+    Construct a BERT tokenizer for Japanese text, based on a MecabTokenizer.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`False`.
+        do_word_tokenize (bool, optional):
+            Whether to do word tokenization. Defaults to`True`.
+        do_subword_tokenize (bool, optional):
+            Whether to do subword tokenization. Defaults to`True`.
+        word_tokenizer_type (str, optional):
+            Type of word tokenizer. Defaults to`basic`.
+        subword_tokenizer_type (str, optional):
+            Type of subword tokenizer. Defaults to`wordpiece`.
+        never_split (bool, optional):
+            Kept for backward compatibility purposes. Defaults to`None`.
+        mecab_kwargs (str, optional):
+            Dictionary passed to the `MecabTokenizer` constructor.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import BertJapaneseTokenizer
+            tokenizer = BertJapaneseTokenizer.from_pretrained('iverxin/bert-base-japanese/')
+
+            inputs = tokenizer('こんにちは')
+            print(inputs)
+
+            '''
+            {'input_ids': [2, 10350, 25746, 28450, 3], 'token_type_ids': [0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "cl-tohoku/bert-base-japanese": "http://bj.bcebos.com/paddlenlp/models/community/cl-tohoku/bert-base-japanese/vocab.txt",
+            "cl-tohoku/bert-base-japanese-whole-word-masking": "http://bj.bcebos.com/paddlenlp/models/community/cl-tohoku/bert-base-japanese-whole-word-masking/vocab.txt",
+            "cl-tohoku/bert-base-japanese-char": "http://bj.bcebos.com/paddlenlp/models/community/cl-tohoku/bert-base-japanese-char/vocab.txt",
+            "cl-tohoku/bert-base-japanese-char-whole-word-masking": "http://bj.bcebos.com/paddlenlp/models/community/cl-tohoku/bert-base-japanese-char-whole-word-masking/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "cl-tohoku/bert-base-japanese": {
+            "do_lower_case": False,
+            "word_tokenizer_type": "mecab",
+            "subword_tokenizer_type": "wordpiece",
+        },
+        "cl-tohoku/bert-base-japanese-whole-word-masking": {
+            "do_lower_case": False,
+            "word_tokenizer_type": "mecab",
+            "subword_tokenizer_type": "wordpiece",
+        },
+        "cl-tohoku/bert-base-japanese-char": {
+            "do_lower_case": False,
+            "word_tokenizer_type": "mecab",
+            "subword_tokenizer_type": "character",
+        },
+        "cl-tohoku/bert-base-japanese-char-whole-word-masking": {
+            "do_lower_case": False,
+            "word_tokenizer_type": "mecab",
+            "subword_tokenizer_type": "character",
+        },
+    }
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=False,
+        do_word_tokenize=True,
+        do_subword_tokenize=True,
+        word_tokenizer_type="mecab",
+        subword_tokenizer_type="wordpiece",
+        never_split=None,
+        mecab_kwargs=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = BertJapaneseTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.idx_to_token.items()])
+
+        self.do_word_tokenize = do_word_tokenize
+        self.word_tokenizer_type = word_tokenizer_type
+        self.lower_case = do_lower_case
+        self.never_split = never_split
+        self.mecab_kwargs = copy.deepcopy(mecab_kwargs)
+        if do_word_tokenize:
+            if word_tokenizer_type == "basic":
+                self.basic_tokenizer = BasicTokenizer(
+                    do_lower_case=do_lower_case,
+                )
+            elif word_tokenizer_type == "mecab":
+                self.basic_tokenizer = MecabTokenizer(
+                    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
+                )
+            else:
+                raise ValueError(f"Invalid word_tokenizer_type '{word_tokenizer_type}' is specified.")
+
+        self.do_subword_tokenize = do_subword_tokenize
+        self.subword_tokenizer_type = subword_tokenizer_type
+        if do_subword_tokenize:
+            if subword_tokenizer_type == "wordpiece":
+                self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+            elif subword_tokenizer_type == "character":
+                self.wordpiece_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=unk_token)
+            else:
+                raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.")
+
+    @property
+    def do_lower_case(self):
+        return self.lower_case
+
+    def __getstate__(self):
+        state = dict(self.__dict__)
+        if self.word_tokenizer_type == "mecab":
+            del state["basic_tokenizer"]
+        return state
+
+    def __setstate__(self, state):
+        self.__dict__ = state
+        if self.word_tokenizer_type == "mecab":
+            self.basic_tokenizer = MecabTokenizer(
+                do_lower_case=self.do_lower_case, never_split=self.never_split, **(self.mecab_kwargs or {})
+            )
+
+    def _tokenize(self, text):
+        if self.do_word_tokenize:
+            if self.word_tokenizer_type == "basic":
+                tokens = self.basic_tokenizer.tokenize(text)
+            elif self.word_tokenizer_type == "mecab":
+                tokens = self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens)
+        else:
+            tokens = [text]
+
+        if self.do_subword_tokenize:
+            split_tokens = [sub_token for token in tokens for sub_token in self.wordpiece_tokenizer.tokenize(token)]
+        else:
+            split_tokens = tokens
+
+        return split_tokens
+
+
+class MecabTokenizer:
+    """Runs basic tokenization with MeCab morphological parser."""
+
+    def __init__(
+        self,
+        do_lower_case=False,
+        never_split=None,
+        normalize_text=True,
+        mecab_dic="ipadic",
+        mecab_option=None,
+    ):
+        """
+        Constructs a MecabTokenizer.
+
+        Args:
+            do_lower_case (bool):
+                Whether to lowercase the input. Defaults to`True`.
+            never_split: (list):
+                Kept for backward compatibility purposes. Defaults to`None`.
+            normalize_text (bool):
+                Whether to apply unicode normalization to text before tokenization.  Defaults to`True`.
+            mecab_dic (string):
+                Name of dictionary to be used for MeCab initialization. If you are using a system-installed dictionary,
+                set this option to `None` and modify `mecab_option`. Defaults to`ipadic`.
+            mecab_option (string):
+                String passed to MeCab constructor. Defaults to`None`.
+        """
+        self.do_lower_case = do_lower_case
+        self.never_split = never_split if never_split is not None else []
+        self.normalize_text = normalize_text
+
+        try:
+            import fugashi
+        except ModuleNotFoundError as error:
+            raise error.__class__(
+                "You need to install fugashi to use MecabTokenizer. "
+                "See https://pypi.org/project/fugashi/ for installation."
+            )
+
+        mecab_option = mecab_option or ""
+
+        if mecab_dic is not None:
+            if mecab_dic == "ipadic":
+                try:
+                    import ipadic
+                except ModuleNotFoundError as error:
+                    raise error.__class__(
+                        "The ipadic dictionary is not installed. "
+                        "See https://github.com/polm/ipadic-py for installation."
+                    )
+
+                dic_dir = ipadic.DICDIR
+
+            elif mecab_dic == "unidic_lite":
+                try:
+                    import unidic_lite
+                except ModuleNotFoundError as error:
+                    raise error.__class__(
+                        "The unidic_lite dictionary is not installed. "
+                        "See https://github.com/polm/unidic-lite for installation."
+                    )
+
+                dic_dir = unidic_lite.DICDIR
+
+            elif mecab_dic == "unidic":
+                try:
+                    import unidic
+                except ModuleNotFoundError as error:
+                    raise error.__class__(
+                        "The unidic dictionary is not installed. "
+                        "See https://github.com/polm/unidic-py for installation."
+                    )
+
+                dic_dir = unidic.DICDIR
+                if not os.path.isdir(dic_dir):
+                    raise RuntimeError(
+                        "The unidic dictionary itself is not found."
+                        "See https://github.com/polm/unidic-py for installation."
+                    )
+            else:
+                raise ValueError("Invalid mecab_dic is specified.")
+
+            mecabrc = os.path.join(dic_dir, "mecabrc")
+            mecab_option = f'-d "{dic_dir}" -r "{mecabrc}" ' + mecab_option
+
+        self.mecab = fugashi.GenericTagger(mecab_option)
+
+    def tokenize(self, text, never_split=None, **kwargs):
+        """Tokenizes a piece of text."""
+        if self.normalize_text:
+            text = unicodedata.normalize("NFKC", text)
+
+        never_split = self.never_split + (never_split if never_split is not None else [])
+        tokens = []
+
+        for word in self.mecab(text):
+            token = word.surface
+
+            if self.do_lower_case and token not in never_split:
+                token = token.lower()
+
+            tokens.append(token)
+
+        return tokens
+
+
+class CharacterTokenizer:
+    """Runs Character tokenization."""
+
+    def __init__(self, vocab, unk_token, normalize_text=True):
+        """
+        Constructs a CharacterTokenizer.
+
+        Args:
+            vocab:
+                Vocabulary object.
+            unk_token (str):
+                A special symbol for out-of-vocabulary token.
+            normalize_text (boolean):
+                Whether to apply unicode normalization to text before tokenization. Defaults to True.
+        """
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.normalize_text = normalize_text
+
+    def tokenize(self, text):
+        """
+        Tokenizes a piece of text into characters.
+
+        For example, `input = "apple""` wil return as output `["a", "p", "p", "l", "e"]`.
+
+        Args:
+            text: A single token or whitespace separated tokens.
+                This should have already been passed through `BasicTokenizer`.
+
+        Returns:
+            A list of characters.
+        """
+        if self.normalize_text:
+            text = unicodedata.normalize("NFKC", text)
+
+        output_tokens = []
+        for char in text:
+            if char not in self.vocab:
+                output_tokens.append(self.unk_token)
+                continue
+
+            output_tokens.append(char)
+
+        return output_tokens
diff --git a/paddlenlp/transformers/bigbird/__init__.py b/paddlenlp/transformers/bigbird/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/bigbird/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bigbird/configuration.py b/paddlenlp/transformers/bigbird/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..7427f107e4952283235cbf778b1d403b54ce2a23
--- /dev/null
+++ b/paddlenlp/transformers/bigbird/configuration.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BIGBIRD model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["BIGBIRD_PRETRAINED_INIT_CONFIGURATION", "BigBirdConfig", "BIGBIRD_PRETRAINED_RESOURCE_FILES_MAP"]
+
+BIGBIRD_PRETRAINED_INIT_CONFIGURATION = {
+    "bigbird-base-uncased": {
+        "num_layers": 12,
+        "vocab_size": 50358,
+        "nhead": 12,
+        "attn_dropout": 0.1,
+        "dim_feedforward": 3072,
+        "activation": "gelu",
+        "normalize_before": False,
+        "block_size": 16,
+        "window_size": 3,
+        "num_global_blocks": 2,
+        "num_rand_blocks": 3,
+        "seed": None,
+        "pad_token_id": 0,
+        "hidden_size": 768,
+        "hidden_dropout_prob": 0.1,
+        "max_position_embeddings": 4096,
+        "type_vocab_size": 2,
+        "num_labels": 2,
+        "initializer_range": 0.02,
+    },
+}
+
+BIGBIRD_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "bigbird-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/bigbird/bigbird-base-uncased.pdparams",
+    }
+}
+
+
+class BigBirdConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BigBirdModel`]. It is used to instantiate an
+    BigBird model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the BigBird
+    [google/bigbird-roberta-base](https://huggingface.co/google/bigbird-roberta-base) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 50358):
+            Vocabulary size of the BigBird model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BigBirdModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimension of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_new"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 4096):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 1024 or 2048 or 4096).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BigBirdModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        attention_type (`str`, *optional*, defaults to `"bigbird"`)
+            Whether to use block sparse attention (with n complexity) as introduced in paper or original attention
+            layer (with n^2 complexity). Possible values are `"original_full"` and `"bigbird"`.
+        use_bias (`bool`, *optional*, defaults to `True`)
+            Whether to use bias in query, key, value.
+        rescale_embeddings (`bool`, *optional*, defaults to `False`)
+            Whether to rescale embeddings with (hidden_size ** 0.5).
+        block_size (`int`, *optional*, defaults to 64)
+            Size of each block. Useful only when `attention_type == "bigbird"`.
+        num_random_blocks (`int`, *optional*, defaults to 3)
+            Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
+            "bigbird"`.
+        dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Example:
+    ```python
+    >>> from transformers import BigBirdConfig, BigBirdModel
+    >>> # Initializing a BigBird google/bigbird-roberta-base style configuration
+    >>> configuration = BigBirdConfig()
+    >>> # Initializing a model (with random weights) from the google/bigbird-roberta-base style configuration
+    >>> model = BigBirdModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "big_bird"
+    attribute_map: Dict[str, str] = {
+        "num_classes": "num_labels",
+        "nhead": "num_attention_heads",
+        "num_layers": "num_hidden_layers",
+        "dim_feedforward": "intermediate_size",
+        "d_model": "hidden_size",
+    }
+    pretrained_init_configuration = BIGBIRD_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=50358,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu_new",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=4096,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        use_cache=True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        sep_token_id=66,
+        attention_type="bigbird",
+        use_bias=True,
+        rescale_embeddings=False,
+        block_size=1,
+        num_random_blocks=3,
+        dropout=0.1,
+        padding_idx=0,
+        attn_dropout=0.1,
+        act_dropout=None,
+        normalize_before=False,
+        weight_attr=None,
+        bias_attr=None,
+        window_size=3,
+        num_global_blocks=2,
+        num_rand_blocks=2,
+        seed=None,
+        activation="relu",
+        embedding_weights=None,
+        **kwargs,
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            sep_token_id=sep_token_id,
+            **kwargs,
+        )
+
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.type_vocab_size = type_vocab_size
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
+
+        self.rescale_embeddings = rescale_embeddings
+        self.attention_type = attention_type
+        self.use_bias = use_bias
+        self.block_size = block_size
+        self.num_random_blocks = num_random_blocks
+        self.dropout = dropout
+
+        self.padding_idx = padding_idx
+        self.attn_dropout = attn_dropout
+        self.act_dropout = act_dropout
+        self.normalize_before = normalize_before
+        self.weight_attr = weight_attr
+        self.bias_attr = bias_attr
+        self.window_size = window_size
+        self.num_global_blocks = num_global_blocks
+        self.num_rand_blocks = num_rand_blocks
+        self.seed = seed
+        self.activation = activation
+        self.embedding_weights = embedding_weights
diff --git a/paddlenlp/transformers/bigbird/modeling.py b/paddlenlp/transformers/bigbird/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..771367d333e24265a4eb630502715f80669b185d
--- /dev/null
+++ b/paddlenlp/transformers/bigbird/modeling.py
@@ -0,0 +1,1706 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from typing import List, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Dropout, Layer, LayerList, LayerNorm, Linear
+
+from paddlenlp.transformers import create_bigbird_rand_mask_idx_list
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..attention_utils import MultiHeadAttention, _convert_param_attr_to_list
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    BIGBIRD_PRETRAINED_INIT_CONFIGURATION,
+    BIGBIRD_PRETRAINED_RESOURCE_FILES_MAP,
+    BigBirdConfig,
+)
+
+__all__ = [
+    "BigBirdModel",
+    "BigBirdPretrainedModel",
+    "BigBirdForPretraining",
+    "BigBirdPretrainingCriterion",
+    "BigBirdForSequenceClassification",
+    "BigBirdPretrainingHeads",
+    "BigBirdForQuestionAnswering",
+    "BigBirdForTokenClassification",
+    "BigBirdForMultipleChoice",
+    "BigBirdForMaskedLM",
+    "BigBirdForCausalLM",
+]
+
+BIG_BIRD_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/bigbird-roberta-base",
+    "google/bigbird-roberta-large",
+    "google/bigbird-base-trivia-itc",
+]
+
+
+@dataclass
+class BigBirdEncoderLayerOutput(ModelOutput):
+
+    src: Optional[Tuple[paddle.Tensor]] = None
+    attn_output: Optional[Tuple[paddle.Tensor]] = None
+
+
+class TransformerEncoderLayer(Layer):
+    def __init__(self, config: BigBirdConfig):
+        super(TransformerEncoderLayer, self).__init__()
+        self.config = config
+        attn_dropout = config.dropout if config.attn_dropout is None else config.attn_dropout
+        act_dropout = config.dropout if config.act_dropout is None else config.act_dropout
+        self.normalize_before = config.normalize_before
+
+        weight_attrs = _convert_param_attr_to_list(config.weight_attr, 2)
+        bias_attrs = _convert_param_attr_to_list(config.bias_attr, 2)
+
+        self.self_attn = MultiHeadAttention(
+            config.d_model,
+            config.nhead,
+            dropout=attn_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+            attention_type=config.attention_type,
+            block_size=config.block_size,
+            window_size=config.window_size,
+            num_global_blocks=config.num_global_blocks,
+            num_rand_blocks=config.num_rand_blocks,
+            seed=config.seed,
+        )
+        self.linear1 = Linear(config.d_model, config.dim_feedforward, weight_attrs[1], bias_attr=bias_attrs[1])
+        self.dropout = Dropout(act_dropout, mode="upscale_in_train")
+        self.linear2 = Linear(config.dim_feedforward, config.d_model, weight_attrs[1], bias_attr=bias_attrs[1])
+        self.norm1 = LayerNorm(config.d_model, epsilon=1e-12)
+        self.norm2 = LayerNorm(config.d_model, epsilon=1e-12)
+        self.dropout1 = Dropout(config.dropout, mode="upscale_in_train")
+        self.dropout2 = Dropout(config.dropout, mode="upscale_in_train")
+        self.activation = getattr(F, config.activation)
+        self.d_model = config.d_model
+        self.nhead = config.nhead
+
+    def forward(self, src, src_mask=None, rand_mask_idx=None, query_mask=None, key_mask=None):
+        residual = src
+        if self.normalize_before:
+            src = self.norm1(src)
+        src = self.self_attn(src, src, src, src_mask, rand_mask_idx, query_mask, key_mask)
+
+        attn_output = paddle.reshape(x=src, shape=[src.shape[0], src.shape[1], self.nhead, -1])
+        attn_output = paddle.transpose(attn_output, perm=[0, 2, 1, 3])
+
+        src = residual + self.dropout1(src)
+        if not self.normalize_before:
+            src = self.norm1(src)
+        residual = src
+        if self.normalize_before:
+            src = self.norm2(src)
+        src = self.linear2(self.dropout(self.activation(self.linear1(src))))
+        src = residual + self.dropout2(src)
+        if not self.normalize_before:
+            src = self.norm2(src)
+
+        return BigBirdEncoderLayerOutput(
+            src=src,
+            attn_output=attn_output,
+        )
+
+
+@dataclass
+class BigBirdEncoderOutput(ModelOutput):
+
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    all_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    all_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class TransformerEncoder(Layer):
+    def __init__(self, encoder_layer, num_layers):
+        super(TransformerEncoder, self).__init__()
+        self.layers = LayerList(
+            [(encoder_layer if i == 0 else type(encoder_layer)(encoder_layer.config)) for i in range(num_layers)]
+        )
+        self.num_layers = num_layers
+        self.norm = LayerNorm(self.layers[0].d_model, epsilon=1e-12)
+        self.normalize_before = self.layers[0].normalize_before
+
+    def forward(
+        self,
+        src,
+        src_mask_list=None,
+        rand_mask_idx_list=None,
+        query_mask=None,
+        key_mask=None,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        # hidden_states and attention lists to be filled if wished
+        all_hidden_states = []
+        all_attentions = []
+
+        output = src
+        if not self.normalize_before:
+            output = self.norm(output)
+
+        hidden_states = output
+
+        for i, mod in enumerate(self.layers):
+            if output_hidden_states is True:
+                all_hidden_states.append(hidden_states)
+
+            rand_mask_id = None
+            if rand_mask_idx_list is not None:
+                rand_mask_id = rand_mask_idx_list[i]
+            if i != 0:
+                output = mod(output.src, None, rand_mask_id, query_mask, key_mask)
+            if i == 0:
+                output = mod(output, None, rand_mask_id, query_mask, key_mask)
+            hidden_states = output.src
+            attn_output = output.attn_output
+
+            if output_attentions:
+                all_attentions.append(attn_output)
+
+        # Add last layer
+        if output_hidden_states is True:
+            all_hidden_states.append(hidden_states)
+
+        if self.normalize_before:
+            output = self.norm(output)
+        return BigBirdEncoderOutput(
+            hidden_states=output.src,
+            all_hidden_states=all_hidden_states,
+            all_attentions=all_attentions,
+        )
+
+
+class BigBirdPooler(Layer):
+    """
+    Pool the result of BigBird Encoder
+    """
+
+    def __init__(self, hidden_size):
+        super(BigBirdPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BigBirdEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.padding_idx)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+    ):
+        if input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+
+        if position_ids is None:
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BigBirdPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained BigBird models. It provides BigBird related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = BIGBIRD_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = BIGBIRD_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "bigbird"
+    model_config_file = CONFIG_NAME
+    config_class = BigBirdConfig
+
+    def _init_weights(self, layer):
+        # Initialization hook
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class BigBirdModel(BigBirdPretrainedModel):
+    """
+    The bare BigBird Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        num_layers (int):
+            Number of hidden layers in the Transformer encoder.
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `BigBirdModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `BigBirdModel`.
+        nhead (int):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        attn_dropout (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        dim_feedforward (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the Transformer encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        activation (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"``, ``"silu"`` and ``"gelu_new"`` are supported.
+            Defaults to `"gelu"`.
+        normalize_before (bool, optional):
+            Indicates whether to put layer normalization into preprocessing of MHA and FFN sub-layers.
+            If True, pre-process is layer normalization and post-precess includes dropout,
+            residual connection. Otherwise, no pre-process and post-precess includes dropout,
+            residual connection, layer normalization.
+            Defaults to `False`.
+        block_size (int, optional):
+            The block size for the attention mask.
+            Defaults to `1`.
+        window_size (int, optional):
+            The number of block in a window.
+            Defaults to `3`.
+        num_global_blocks (int, optional):
+            Number of global blocks per sequence.
+            Defaults to `1`.
+        num_rand_blocks (int, optional):
+            Number of random blocks per row.
+            Defaults to `2`.
+        seed (int, optional):
+            The random seed for generating random block id.
+            Defaults to ``None``.
+        pad_token_id (int, optional):
+            The index of padding token for BigBird embedding.
+            Defaults to ``0``.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layer and pooler layer.
+            Defaults to `768`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`.
+            Defaults to `2`.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdModel, self).__init__(config)
+        # embedding
+        self.embeddings = BigBirdEmbeddings(config)
+
+        # encoder
+        encoder_layer = TransformerEncoderLayer(config)
+        self.encoder = TransformerEncoder(encoder_layer, config.num_layers)
+        # pooler
+        self.pooler = BigBirdPooler(config.hidden_size)
+        self.pad_token_id = config.pad_token_id
+        self.num_layers = config.num_layers
+        self.config = config
+
+    def _process_mask(self, input_ids, inputs_embeds, attention_mask=None):
+        # [B, T]
+        if input_ids is not None:
+            attention_mask = (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype)
+        else:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+            attention_mask = paddle.zeros(input_shape, dtype=self.pooler.dense.weight.dtype)
+
+        # [B, 1, T, 1]
+        query_mask = paddle.unsqueeze(attention_mask, axis=[1, 3])
+        # [B, 1, 1, T]
+        key_mask = paddle.unsqueeze(attention_mask, axis=[1, 2])
+        query_mask = 1 - query_mask
+        key_mask = 1 - key_mask
+        return attention_mask, query_mask, key_mask
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdModel forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (`Tensor`):
+                Indices of input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (`Tensor`, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to ``None``, which means we don't add segment embeddings.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            rand_mask_idx_list (`list`, optional):
+                A list which contains some tensors used in bigbird random block.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Examples:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BigBirdModel, BigBirdTokenizer
+                from paddlenlp.transformers import create_bigbird_rand_mask_idx_list
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdModel.from_pretrained('bigbird-base-uncased')
+                config = model.config
+                max_seq_len = 512
+                input_ids = tokenizer.convert_tokens_to_ids(
+                    tokenizer(
+                        "This is a docudrama story on the Lindy Chamberlain case and a look at "
+                        "its impact on Australian society It especially looks at the problem of "
+                        "innuendo gossip and expectation when dealing with reallife dramasbr br "
+                        "One issue the story deals with is the way it is expected people will all "
+                        "give the same emotional response to similar situations Not everyone goes "
+                        "into wild melodramatic hysterics to every major crisis Just because the "
+                        "characters in the movies and on TV act in a certain way is no reason to "
+                        "expect real people to do so"
+                    ))
+                input_ids.extend([0] * (max_seq_len - len(input_ids)))
+                seq_len = len(input_ids)
+                input_ids = paddle.to_tensor([input_ids])
+                rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+                    config["num_layers"], seq_len, seq_len, config["nhead"],
+                    config["block_size"], config["window_size"], config["num_global_blocks"],
+                    config["num_rand_blocks"], config["seed"])
+                rand_mask_idx_list = [
+                    paddle.to_tensor(rand_mask_idx) for rand_mask_idx in rand_mask_idx_list
+                ]
+                output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list)
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        embedding_output = self.embeddings(input_ids, token_type_ids, inputs_embeds=inputs_embeds)
+        attention_mask, query_mask, key_mask = self._process_mask(input_ids, inputs_embeds, attention_mask)
+        batch_size, seq_len = input_shape
+        rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+            self.config["num_layers"],
+            seq_len,
+            seq_len,
+            self.config["nhead"],
+            self.config["block_size"],
+            self.config["window_size"],
+            self.config["num_global_blocks"],
+            self.config["num_rand_blocks"],
+            self.config["seed"],
+        )
+        rand_mask_idx_list = [paddle.to_tensor(rand_mask_idx) for rand_mask_idx in rand_mask_idx_list]
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            rand_mask_idx_list,
+            query_mask,
+            key_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+        sequence_output = encoder_outputs.hidden_states
+        hidden_states = encoder_outputs.all_hidden_states if output_hidden_states else None
+        attentions = encoder_outputs.all_attentions if output_attentions else None
+        pooled_output = self.pooler(encoder_outputs.hidden_states)
+        if not return_dict:
+            return sequence_output, pooled_output
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=hidden_states,
+            attentions=attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+
+class BigBirdForSequenceClassification(BigBirdPretrainedModel):
+    """
+    BigBird Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        bigbird (:class:`BigBirdModel`):
+            An instance of :class:`BigBirdModel`.
+        num_labels (int, optional):
+            The number of classes. Defaults to `None`.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.bigbird = BigBirdModel(config)
+        self.linear = nn.Linear(config.hidden_size, self.num_labels)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob, mode="upscale_in_train")
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            token_type_ids (Tensor):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (list):
+                See :class:`BigBirdModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`BigBirdModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            Tensor: Returns tensor `output`, a tensor of the input text classification logits.
+            Its data type should be float32 and it has a shape of [batch_size, num_labels].
+
+        Examples:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BigBirdForSequenceClassification, BigBirdTokenizer
+                from paddlenlp.transformers import create_bigbird_rand_mask_idx_list
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForSequenceClassification.from_pretrained('bigbird-base-uncased')
+                config = model.bigbird.config
+                max_seq_len = 512
+                input_ids = tokenizer.convert_tokens_to_ids(
+                    tokenizer(
+                        "This is a docudrama story on the Lindy Chamberlain case and a look at "
+                        "its impact on Australian society It especially looks at the problem of "
+                        "innuendo gossip and expectation when dealing with reallife dramasbr br "
+                        "One issue the story deals with is the way it is expected people will all "
+                        "give the same emotional response to similar situations Not everyone goes "
+                        "into wild melodramatic hysterics to every major crisis Just because the "
+                        "characters in the movies and on TV act in a certain way is no reason to "
+                        "expect real people to do so"
+                    ))
+                input_ids.extend([0] * (max_seq_len - len(input_ids)))
+                seq_len = len(input_ids)
+                input_ids = paddle.to_tensor([input_ids])
+                rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+                    config["num_layers"], seq_len, seq_len, config["nhead"],
+                    config["block_size"], config["window_size"], config["num_global_blocks"],
+                    config["num_rand_blocks"], config["seed"])
+                rand_mask_idx_list = [
+                    paddle.to_tensor(rand_mask_idx) for rand_mask_idx in rand_mask_idx_list
+                ]
+                output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list)
+                print(output)
+        """
+        outputs = self.bigbird(
+            input_ids,
+            token_type_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.linear(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BigBirdLMPredictionHead(Layer):
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdLMPredictionHead, self).__init__()
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = getattr(nn.functional, config.activation)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=1e-12)
+        self.decoder = nn.Linear(config.vocab_size, config.hidden_size)
+        self.decoder.weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype, is_bias=False
+            )
+            if config.embedding_weights is None
+            else config.embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder.weight.dtype, is_bias=True
+        )
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder.weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class BigBirdPretrainingHeads(Layer):
+    """
+    The BigBird pretraining heads for a pretraining task.
+
+    Args:
+        hidden_size (int):
+            See :class:`BigBirdModel`.
+        vocab_size (int):
+            See :class:`BigBirdModel`.
+        activation (str):
+            See :class:`BigBirdModel`.
+        embedding_weights (Tensor, optional):
+            The weight of pretraining embedding layer. Its data type should be float32
+            and its shape is [hidden_size, vocab_size].
+            If set to `None`, use normal distribution to initialize weight.
+            Defaults to `None`.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdPretrainingHeads, self).__init__()
+        self.predictions = BigBirdLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output, masked_positions=None):
+        r"""
+        The BigBirdPretrainingHeads forward method, overrides the __call__() special method.
+
+        Args:
+            sequence_output (Tensor):
+                The sequence output of BigBirdModel. Its data type should be float32 and
+                has a shape of [batch_size, sequence_length, hidden_size].
+            pooled_output (Tensor):
+                The pooled output of BigBirdModel. Its data type should be float32 and
+                has a shape of [batch_size, hidden_size].
+            masked_positions (Tensor):
+                A tensor indicates positions to be masked in the position embedding.
+                Its data type should be int64 and its shape is [batch_size, mask_token_num].
+                `mask_token_num` is the number of masked tokens. It should be no bigger than `sequence_length`.
+                Defaults to `None`, which means we output hidden-states of all tokens in masked token prediction.
+
+        Returns:
+            tuple: (``prediction_scores``, ``seq_relationship_score``).
+
+            With the fields:
+
+            - prediction_scores (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - seq_relationship_score (Tensor):
+                The logits whether 2 sequences are NSP relationship. Its data type should be float32 and
+                has a shape of [batch_size, 2].
+        """
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+@dataclass
+class BigBirdForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`BertForPreTraining`].
+
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    seq_relationship_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class BigBirdForPretraining(BigBirdPretrainedModel):
+    """
+    BigBird Model with pretraining tasks on top.
+
+    Args:
+        bigbird (:class:`BigBirdModel`):
+            An instance of :class:`BigBirdModel`.
+
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForPretraining, self).__init__(config)
+        self.bigbird = BigBirdModel(config)
+        self.cls = BigBirdPretrainingHeads(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        masked_positions: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        next_sentence_label: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdForPretraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            token_type_ids (Tensor):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (list):
+                See :class:`BigBirdModel`.
+            masked_positions (list):
+                A tensor indicates positions to be masked in the position embedding.
+                Its data type should be int64 and its shape is [batch_size, mask_token_num].
+                `mask_token_num` is the number of masked tokens. It should be no bigger than `sequence_length`.
+                Defaults to `None`, which means we output hidden-states of all tokens in masked token prediction.
+            inputs_embeds(Tensor, optional):
+                See :class:`BigBirdModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.bert.BertForPreTrainingOutput`.
+
+        Examples:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BigBirdForPretraining, BigBirdTokenizer
+                from paddlenlp.transformers import create_bigbird_rand_mask_idx_list
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForPretraining.from_pretrained('bigbird-base-uncased')
+                config = model.bigbird.config
+                max_seq_len = 512
+                input_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights = tokenizer.encode(
+                        "This is a docudrama story on the Lindy Chamberlain case and a look at "
+                        "its impact on Australian society It especially looks at the problem of "
+                        "innuendo gossip and expectation when dealing with reallife dramasbr br "
+                        "One issue the story deals with is the way it is expected people will all "
+                        "give the same emotional response to similar situations Not everyone goes "
+                        "into wild melodramatic hysterics to every major crisis Just because the "
+                        "characters in the movies and on TV act in a certain way is no reason to "
+                        "expect real people to do so", max_seq_len=max_seq_len)
+
+                seq_len = len(input_ids)
+                input_ids = paddle.to_tensor([input_ids])
+                rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+                    config["num_layers"], seq_len, seq_len, config["nhead"],
+                    config["block_size"], config["window_size"], config["num_global_blocks"],
+                    config["num_rand_blocks"], config["seed"])
+                rand_mask_idx_list = [
+                    paddle.to_tensor(rand_mask_idx) for rand_mask_idx in rand_mask_idx_list
+                ]
+                output = model(input_ids, rand_mask_idx_list=rand_mask_idx_list)
+                print(output)
+        """
+        outputs = self.bigbird(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=None,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions)
+
+        total_loss = None
+        if labels is not None and next_sentence_label is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+            next_sentence_loss = loss_fct(seq_relationship_score.reshape((-1, 2)), next_sentence_label.reshape((-1,)))
+            total_loss = masked_lm_loss + next_sentence_loss
+        if not return_dict:
+            output = (prediction_scores, seq_relationship_score) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return BigBirdForPreTrainingOutput(
+            loss=total_loss,
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BigBirdPretrainingCriterion(paddle.nn.Layer):
+    """
+    BigBird Criterion for a pretraining task on top.
+
+    Args:
+        vocab_size (int):
+            See :class:`BigBirdModel`.
+        use_nsp (bool, optional):
+            It decides whether it considers Next Sentence Prediction loss.
+            Defaults to `False`.
+        ignore_index (int):
+            Specifies a target value that is ignored and does
+            not contribute to the input gradient. Only valid
+            if :attr:`soft_label` is set to :attr:`False`.
+            Defaults to `0`.
+    """
+
+    def __init__(self, config: BigBirdConfig, use_nsp=False, ignore_index=0):
+        super(BigBirdPretrainingCriterion, self).__init__()
+        # CrossEntropyLoss is expensive since the inner reshape (copy)
+        self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1)
+        self.vocab_size = config.vocab_size
+        self.use_nsp = use_nsp
+        self.ignore_index = ignore_index
+
+    def forward(
+        self,
+        prediction_scores,
+        seq_relationship_score,
+        masked_lm_labels,
+        next_sentence_labels,
+        masked_lm_scale,
+        masked_lm_weights,
+    ):
+        r"""
+        The BigBirdPretrainingCriterion forward method, overrides the __call__() special method.
+
+        Args:
+            prediction_scores (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+            seq_relationship_score (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+            masked_lm_labels (Tensor):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1].
+                Otherwise, its shape is [batch_size, mask_token_num, 1].
+            next_sentence_labels (Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and its shape is [batch_size, 1].
+            masked_lm_scale (Tensor or int):
+                The scale of masked tokens. Used for the normalization of masked language modeling loss.
+                If it is a `Tensor`, its data type should be int64 and its shape is equal to `prediction_scores`.
+            masked_lm_weights (Tensor):
+                The weight of masked tokens. Its data type should be float32 and its shape
+                is [mask_token_num, 1].
+
+        Returns:
+            Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`.
+            Its data type should be float32 and its shape is [1].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import BigBirdForPretraining, BigBirdTokenizer, BigBirdPretrainingCriterion
+                from paddlenlp.transformers import create_bigbird_rand_mask_idx_list
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForPretraining.from_pretrained('bigbird-base-uncased')
+                config = model.bigbird.config
+                criterion = BigBirdPretrainingCriterion(config["vocab_size"], False)
+                max_seq_len = 512
+                max_pred_length=75
+                input_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights = tokenizer.encode(
+                        "This is a docudrama story on the Lindy Chamberlain case and a look at "
+                        "its impact on Australian society It especially looks at the problem of "
+                        "innuendo gossip and expectation when dealing with reallife dramasbr br "
+                        "One issue the story deals with is the way it is expected people will all "
+                        "give the same emotional response to similar situations Not everyone goes "
+                        "into wild melodramatic hysterics to every major crisis Just because the "
+                        "characters in the movies and on TV act in a certain way is no reason to "
+                        "expect real people to do so", max_seq_len=max_seq_len, max_pred_len=max_pred_length)
+
+                seq_len = len(input_ids)
+                masked_lm_positions_tmp = np.full(seq_len, 0, dtype=np.int32)
+                masked_lm_ids_tmp = np.full([seq_len, 1], -1, dtype=np.int64)
+                masked_lm_weights_tmp = np.full([seq_len], 0, dtype="float32")
+
+                mask_token_num = 0
+                for i, x in enumerate([input_ids]):
+                    for j, pos in enumerate(masked_lm_positions):
+                        masked_lm_positions_tmp[mask_token_num] = i * seq_len + pos
+                        masked_lm_ids_tmp[mask_token_num] = masked_lm_ids[j]
+                        masked_lm_weights_tmp[mask_token_num] = masked_lm_weights[j]
+
+                masked_lm_positions = masked_lm_positions_tmp
+                masked_lm_ids = masked_lm_ids_tmp
+                masked_lm_weights = masked_lm_weights_tmp
+                print(masked_lm_ids.shape)
+                input_ids = paddle.to_tensor([input_ids])
+                masked_lm_positions = paddle.to_tensor(masked_lm_positions)
+                masked_lm_ids = paddle.to_tensor(masked_lm_ids, dtype='int64')
+                masked_lm_weights = paddle.to_tensor(masked_lm_weights)
+                masked_lm_scale = 1.0
+                next_sentence_labels = paddle.zeros(shape=(1, 1), dtype='int64')
+
+                rand_mask_idx_list = create_bigbird_rand_mask_idx_list(
+                    config["num_layers"], seq_len, seq_len, config["nhead"],
+                    config["block_size"], config["window_size"], config["num_global_blocks"],
+                    config["num_rand_blocks"], config["seed"])
+                rand_mask_idx_list = [
+                    paddle.to_tensor(rand_mask_idx) for rand_mask_idx in rand_mask_idx_list
+                ]
+                prediction_scores, seq_relationship_score = model(input_ids, rand_mask_idx_list=rand_mask_idx_list, masked_positions=masked_lm_positions)
+
+                loss = criterion(prediction_scores, seq_relationship_score,
+                                masked_lm_ids, next_sentence_labels,
+                                masked_lm_scale, masked_lm_weights)
+                print(loss)
+        """
+        masked_lm_loss = F.cross_entropy(
+            prediction_scores, masked_lm_labels, ignore_index=self.ignore_index, reduction="none"
+        )
+        masked_lm_loss = paddle.transpose(masked_lm_loss, [1, 0])
+        masked_lm_loss = paddle.sum(masked_lm_loss * masked_lm_weights) / (paddle.sum(masked_lm_weights) + 1e-5)
+        scale = 1.0
+        if not self.use_nsp:
+            scale = 0.0
+        next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none")
+        return masked_lm_loss + paddle.mean(next_sentence_loss) * scale
+
+
+class BigBirdIntermediate(Layer):
+    def __init__(self, config: BigBirdConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.dim_feedforward)
+        if isinstance(config.activation, str):
+            self.intermediate_act_fn = ACT2FN[config.activation]
+        else:
+            self.intermediate_act_fn = config.activation
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BigBirdOutput(Layer):
+    def __init__(self, config: BigBirdConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.dim_feedforward, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BigBirdForQuestionAnswering(BigBirdPretrainedModel):
+    """
+    BigBird Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        bigbird (:class:`BigBirdModel`):
+            An instance of BigBirdModel.
+        dropout (float, optional):
+            The dropout probability for output of BigBirdModel.
+            If None, use the same value as `hidden_dropout_prob` of `BigBirdModel`
+            instance `bigbird`. Defaults to `None`.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForQuestionAnswering, self).__init__(config)
+        self.bigbird = BigBirdModel(config)  # allow bigbird to be config
+        self.dropout = nn.Dropout(
+            config.dropout if config.dropout is not None else self.bigbird.config["hidden_dropout_prob"]
+        )
+        self.classifier = nn.Linear(self.bigbird.config["hidden_size"], 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (`List`):
+                See :class:`BigBirdModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`BigBirdModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bigbird.modeling import BigBirdForQuestionAnswering
+                from paddlenlp.transformers.bigbird.tokenizer import BigBirdTokenizer
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForQuestionAnswering.from_pretrained('bigbird-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors='pd')
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  =outputs[1]
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.bigbird(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    @staticmethod
+    def prepare_question_mask(q_lengths, maxlen):
+        mask = paddle.arange(0, maxlen, dtype="int64").unsqueeze_(0)
+        mask = mask < q_lengths
+        return mask
+
+
+class BigBirdForTokenClassification(BigBirdPretrainedModel):
+    """
+    BigBird Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        bigbird (:class:`BigBirdModel`):
+            An instance of BigBirdModel.
+        num_labels (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of BIGBIRD.
+            If None, use the same value as `hidden_dropout_prob` of `BigBirdModel`
+            instance `bigbird`. Defaults to None.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.bigbird = BigBirdModel(config)
+        self.dropout = nn.Dropout(config.dropout if config.dropout is not None else config.hidden_dropout_prob)
+        self.classifier = nn.Linear(self.bigbird.config["hidden_size"], self.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (`List`):
+                See :class:`BigBirdModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            inputs_embeds(Tensor, optional):
+                See :class:`BigBirdModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bigbird.modeling import BigBirdForTokenClassification
+                from paddlenlp.transformers.bigbird.tokenizer import BigBirdTokenizer
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForTokenClassification.from_pretrained('bigbird-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors='pd')
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.bigbird(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BigBirdForMultipleChoice(BigBirdPretrainedModel):
+    """
+    BigBird Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks .
+
+    Args:
+        bigbird (:class:`BigBirdModel`):
+            An instance of BigBirdModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of BIGBIRD.
+            If None, use the same value as `hidden_dropout_prob` of `BigBirdModel`
+            instance `bigbird`. Defaults to None.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForMultipleChoice, self).__init__(config)
+        self.bigbird = BigBirdModel(config)  # allow bigbird to be config
+        self.num_choices = config.num_choices
+        self.dropout = nn.Dropout(
+            config.dropout if config.dropout is not None else self.bigbird.config["hidden_dropout_prob"]
+        )
+        self.classifier = nn.Linear(self.bigbird.config["hidden_size"], 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        token_type_ids: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The BigBirdForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`  and shape as [batch_size, num_choice, n_head, sequence_length, sequence_length].
+            rand_mask_idx_list (`List`):
+                See :class:`BigBirdModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            inputs_embeds(Tensor, optional):
+                See :class:`BigBirdModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.bigbird.modeling import BigBirdForMultipleChoice
+                from paddlenlp.transformers.bigbird.tokenizer import BigBirdTokenizer
+
+                tokenizer = BigBirdTokenizer.from_pretrained('bigbird-base-uncased')
+                model = BigBirdForTokenClassification.from_pretrained('bigbird-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors='pd')
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        # input_ids: [bs, num_choice, seq_l]
+        if input_ids is not None:
+            input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, *attention_mask.shape[2:]))
+
+        if rand_mask_idx_list is not None:
+            rand_mask_idx_list = rand_mask_idx_list.reshape(shape=(-1, *rand_mask_idx_list.shape[2:]))
+
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape(shape=(-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]))
+
+        outputs = self.bigbird(
+            input_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BigBirdForMaskedLM(BigBirdPretrainedModel):
+    """
+    BigBird Model with a `language modeling` head on top.
+
+    Args:
+        BigBird (:class:`BigBirdModel`):
+            An instance of :class:`BigBirdModel`.
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForMaskedLM, self).__init__(config)
+        self.bigbird = BigBirdModel(config)
+        self.lm_head = BigBirdLMPredictionHead(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (`List`):
+                See :class:`BigBirdModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BigBirdModel`.
+            labels (Tensor, optional):
+                The Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, sequence_length].
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            tuple: Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`).
+
+            With the fields:
+
+            - `masked_lm_loss` (Tensor):
+                The masked lm loss. Its data type should be float32 and its shape is [1].
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32. Its shape is [batch_size, sequence_length, vocab_size].
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model. Its data type should be float32. Its shape is `[batch_size, sequence_length, hidden_size]`.
+
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.bigbird(
+            input_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape(shape=(-1, self.bigbird.config["vocab_size"])),
+                labels.reshape(shape=(-1,)),
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return (
+                ((masked_lm_loss,) + output)
+                if masked_lm_loss is not None
+                else (output[0] if len(output) == 1 else output)
+            )
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class BigBirdForCausalLM(BigBirdPretrainedModel):
+    """
+    BigBird Model for casual language model tasks.
+
+    Args:
+        BigBird (:class:`BigBirdModel`):
+            An instance of :class:`BigBirdModel`.
+
+    """
+
+    def __init__(self, config: BigBirdConfig):
+        super(BigBirdForCausalLM, self).__init__(config)
+        self.bigbird = BigBirdModel(config)
+        self.lm_head = BigBirdLMPredictionHead(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        rand_mask_idx_list: Optional[List] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`BigBirdModel`.
+            attention_mask (Tensor):
+                See :class:`BigBirdModel`.
+            rand_mask_idx_list (`List`):
+                See :class:`BigBirdModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`BigBirdModel`.
+            labels (Tensor, optional):
+                The Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, sequence_length].
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`).
+
+            With the fields:
+
+            - `masked_lm_loss` (Tensor):
+                The masked lm loss. Its data type should be float32 and its shape is [1].
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32. Its shape is [batch_size, sequence_length, vocab_size].
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model. Its data type should be float32. Its shape is `[batch_size, sequence_length, hidden_size]`.
+
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.bigbird(
+            input_ids,
+            attention_mask=attention_mask,
+            rand_mask_idx_list=rand_mask_idx_list,
+            inputs_embeds=inputs_embeds,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :]
+            labels = labels[:, 1:]
+            loss_fct = nn.CrossEntropyLoss()
+            lm_loss = loss_fct(
+                paddle.reshape(shifted_prediction_scores, [-1, self.bigbird.config["vocab_size"]]),
+                paddle.reshape(labels, [-1]),
+            )
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MaskedLMOutput(
+            loss=lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/bigbird/tokenizer.py b/paddlenlp/transformers/bigbird/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bc5ee70ef4add59fcef61cdfaf36b1c80a5337a
--- /dev/null
+++ b/paddlenlp/transformers/bigbird/tokenizer.py
@@ -0,0 +1,400 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import warnings
+
+import numpy as np
+import sentencepiece as spm
+
+from paddlenlp.data.vocab import Vocab
+
+from ..albert.tokenizer import AlbertEnglishTokenizer
+
+__all__ = ["BigBirdTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"bigbird-base-uncased": 4096}
+
+
+class BigBirdTokenizer(AlbertEnglishTokenizer):
+    """
+    Constructs an BigBird tokenizer based on `SentencePiece <https://github.com/google/sentencepiece>`__.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        sentencepiece_model_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool): Whether the text strips accents and convert to
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Raises:
+        ValueError: If file sentencepiece_model_file doesn't exist.
+
+    """
+
+    resource_files_names = {
+        "sentencepiece_model_file": "sentencepiece_gpt2.model",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "bigbird-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/bigbird/sentencepiece_gpt2.model",
+        },
+    }
+    pretrained_init_configuration = {
+        "bigbird-base-uncased": {"do_lower_case": False},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        eos_token="</s>",
+        unk_token="<unk>",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        extra_ids=100,
+        additional_special_tokens=[],
+        sp_model_kwargs=None,
+        encoding="utf8",
+        **kwargs
+    ):
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.extra_ids = extra_ids
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_file)
+        self.encoding = encoding
+        vocab_dict = {}
+        for id in range(self.sp_model.get_piece_size()):
+            vocab_dict[self.sp_model.id_to_piece(id)] = id
+        vocab_ = Vocab.from_dict(vocab_dict, unk_token=unk_token)
+        self.start_word_tokens = np.array([vocab_._idx_to_token[i][0] == "▁" for i in range(0, len(vocab_))])
+
+        self.unk_token = unk_token
+        self.mask_id = vocab_dict[mask_token] if mask_token in vocab_dict else 0
+        self.unk_id = vocab_dict[unk_token] if unk_token in vocab_dict else 0
+        self.cls_id = vocab_dict[cls_token] if cls_token in vocab_dict else 0
+        self.sep_id = vocab_dict[sep_token] if sep_token in vocab_dict else 0
+        self.pad_id = vocab_dict[pad_token] if pad_token in vocab_dict else 0
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        is_split_into_words=False,
+        padding=None,
+        truncation="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        if "pad_to_max_seq_len" in kwargs and padding is None:
+            pad_to_max_seq_len = kwargs.pop("pad_to_max_seq_len")
+            padding = "max_length" if pad_to_max_seq_len else False
+        elif padding is None:
+            padding = False
+
+        if "max_seq_len" in kwargs and max_length is None:
+            max_length = kwargs["max_seq_len"]
+
+        if "truncation_strategy" in kwargs and kwargs["truncation_strategy"] != "longest_first":
+            truncation = kwargs["truncation_strategy"]
+
+        return super(BigBirdTokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.extra_ids
+
+    def _add_eos_if_not_present(self, token_ids):
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            warnings.warn(
+                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
+            )
+            return token_ids
+        else:
+            return token_ids + [self.eos_token_id]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1):
+        """
+        Build model inputs from a sequence or a pair of sequence.
+
+        An BigBird sequence has the following format:
+
+        - single sequence:      ``X </s>``
+        - pair of sequences:        ``A </s> B </s>``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+
+        """
+        token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+        if token_ids_1 is None:
+            return token_ids_0
+        else:
+            token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+            return token_ids_0 + token_ids_1
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return offset_mapping_0 + [(0, 0)]
+
+        return offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences.
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+
+        """
+        eos = [self.eos_token_id]
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers in the range [0, 1]:
+                1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+
+        # normal case: some special tokens
+        if token_ids_1 is None:
+            return ([0] * len(token_ids_0)) + [1]
+        return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        for token in tokens:
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode_pieces(current_sub_tokens) + token + " "
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+        out_string += self.sp_model.decode_pieces(current_sub_tokens)
+        return out_string.strip()
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token.startswith("<extra_id_"):
+            match = re.match(r"<extra_id_(\d+)>", token)
+            num = int(match.group(1))
+            return self.vocab_size - num - 1
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index < self.sp_model.get_piece_size():
+            token = self.sp_model.IdToPiece(index)
+        else:
+            token = f"<extra_id_{self.vocab_size - 1 - index}>"
+        return token
+
+    def _encode(self, text, max_seq_len=None, max_pred_len=None, masked_lm_prob=0.15):
+        """
+        Returns a tuple containing the encoded sequence and mask information.
+
+        Args:
+            text (str,list[str] or list[int]):
+                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using
+                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+                method)
+            max_seq_len (int, optional):
+                If set to a number, will limit the total sequence returned so that it has a maximum length.
+                If set to None, will not limit the total sequence.
+                Defaults to None.
+            max_pred_len (int, optional):
+                If set to a number, will limit the mask sequence returned so that it has a maximum prediction length.
+                If set to None, will not limit the mask sequence.
+            masked_lm_prob (float, optional):
+                The probability of the token to be masked. Defaults to `0.15`.
+        Returns:
+            tuple: Returns tuple (span_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights).
+        """
+
+        def get_input_ids(text):
+            if isinstance(text, str):
+                text = re.sub("[\n]+", "", text)
+                tokens = self._tokenize(text)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise ValueError(
+                    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                )
+
+        ids = get_input_ids(text)
+        # Find the span for in the text
+        max_seq_len = len(ids) if max_seq_len is None else max_seq_len
+        max_pred_len = len(ids) if max_pred_len is None else max_pred_len
+
+        end_pos = max_seq_len - 2 + np.random.randint(max(1, len(ids) - max_seq_len - 2))
+        start_pos = max(0, end_pos - max_seq_len + 2)
+        span_ids = ids[start_pos:end_pos]
+
+        word_begin_flag = self.start_word_tokens[span_ids]
+        word_begin_pos = np.flatnonzero(word_begin_flag).astype(np.int32)
+        if word_begin_pos.size == 0:
+            word_begin_pos = np.arange(len(span_ids), dtype=np.int32)
+            word_begin_flag = np.logical_not(word_begin_flag)
+
+        first_start_pos = word_begin_pos[0]
+        span_ids = span_ids[first_start_pos:]
+        num_tokens = len(span_ids)
+        word_begin_pos = word_begin_pos - first_start_pos
+        words = np.split(np.arange(len(span_ids), dtype="int32"), word_begin_pos)[1:]
+        assert len(words) == len(word_begin_pos)
+        num_to_predict = min(max_pred_len, max(1, int(round(len(word_begin_pos) * masked_lm_prob))))
+
+        masked_lm_positions = np.concatenate(
+            np.random.choice(np.array([[]] + words, dtype=np.object)[1:], num_to_predict, replace=False), 0
+        )
+        if len(masked_lm_positions) > max_pred_len:
+            masked_lm_positions = masked_lm_positions[: max_pred_len + 1]
+            truncate_masking_flag = np.flatnonzero(word_begin_flag[masked_lm_positions])
+            if len(truncate_masking_flag) == 0:
+                truncate_masking_index = max_pred_len
+            else:
+                truncate_masking_index = truncate_masking_flag[-1]
+            masked_lm_positions = masked_lm_positions[:truncate_masking_index]
+        span_ids = np.array(span_ids, dtype="int32")
+        masked_lm_positions = np.sort(masked_lm_positions)
+        masked_lm_ids = np.array(span_ids)[masked_lm_positions]
+
+        random_prob = np.random.rand(len(masked_lm_positions))
+        mask_pos = masked_lm_positions[random_prob < 0.8]
+        random_pos = masked_lm_positions[random_prob > 0.9]
+        span_ids[mask_pos] = self.mask_id
+        span_ids[random_pos] = np.random.randint(self.unk_id + 1, self.vocab_size, len(random_pos), dtype=np.int32)
+        span_ids = np.concatenate(
+            [np.array([self.cls_id], dtype=np.int32), span_ids, np.array([self.sep_id], dtype=np.int32)]
+        )
+        padding_len = max_seq_len - num_tokens - 2
+        span_ids = np.pad(span_ids, [0, padding_len], "constant")
+        pred_padding_len = max_pred_len - len(masked_lm_positions)
+        masked_lm_weights = np.pad(
+            np.ones_like(masked_lm_positions, dtype=np.float32), [0, pred_padding_len], "constant"
+        )
+        masked_lm_positions = np.pad(masked_lm_positions + 1, [0, pred_padding_len], "constant")
+        masked_lm_ids = np.pad(masked_lm_ids, [0, pred_padding_len], "constant")
+        return span_ids, masked_lm_positions, masked_lm_ids, masked_lm_weights
diff --git a/paddlenlp/transformers/bit/__init__.py b/paddlenlp/transformers/bit/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/bit/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bit/configuration.py b/paddlenlp/transformers/bit/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..786e446a9bc8a938aac05dd7e835ba8025bfee33
--- /dev/null
+++ b/paddlenlp/transformers/bit/configuration.py
@@ -0,0 +1,130 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BiT model configuration"""
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["BitConfig"]
+
+
+class BitConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BitModel`]. It is used to instantiate an BiT
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the BiT
+    [google/bit-50](https://huggingface.co/google/bit-50) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        embedding_size (`int`, *optional*, defaults to 64):
+            Dimensionality (hidden size) for the embedding layer.
+        hidden_sizes (`List[int]`, *optional*, defaults to `[256, 512, 1024, 2048]`):
+            Dimensionality (hidden size) at each stage.
+        depths (`List[int]`, *optional*, defaults to `[3, 4, 6, 3]`):
+            Depth (number of layers) for each stage.
+        layer_type (`str`, *optional*, defaults to `"preactivation"`):
+            The layer to use, it can be either `"preactivation"` or `"bottleneck"`.
+        hidden_act (`str`, *optional*, defaults to `"relu"`):
+            The non-linear activation function in each block. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"`
+            are supported.
+        global_padding (`str`, *optional*):
+            Padding strategy to use for the convolutional layers. Can be either `"valid"`, `"same"`, or `None`.
+        num_groups (`int`, *optional*, defaults to `32`):
+            Number of groups used for the `BitGroupNormActivation` layers.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The drop path rate for the stochastic depth.
+        embedding_dynamic_padding (`bool`, *optional*, defaults to `False`):
+            Whether or not to make use of dynamic padding for the embedding layer.
+        output_stride (`int`, *optional*, defaults to 32):
+            The output stride of the model.
+        width_factor (`int`, *optional*, defaults to 1):
+            The width factor for the model.
+        out_features (`List[str]`, *optional*):
+            If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
+            (depending on how many stages the model has). Will default to the last stage if unset.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import BitConfig, BitModel
+
+    >>> # Initializing a BiT bit-50 style configuration
+    >>> configuration = BitConfig()
+
+    >>> # Initializing a model (with random weights) from the bit-50 style configuration
+    >>> model = BitModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "bit"
+    layer_types = ["preactivation", "bottleneck"]
+    supported_padding = ["SAME", "VALID"]
+
+    def __init__(
+        self,
+        num_channels=3,
+        embedding_size=64,
+        hidden_sizes=[256, 512, 1024, 2048],
+        depths=[3, 4, 6, 3],
+        layer_type="preactivation",
+        hidden_act="relu",
+        global_padding=None,
+        num_groups=32,
+        drop_path_rate=0.0,
+        embedding_dynamic_padding=False,
+        output_stride=32,
+        width_factor=1,
+        out_features=None,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+        if layer_type not in self.layer_types:
+            raise ValueError(f"layer_type={layer_type} is not one of {','.join(self.layer_types)}")
+        if global_padding is not None:
+            if global_padding.upper() in self.supported_padding:
+                global_padding = global_padding.upper()
+            else:
+                raise ValueError(f"Padding strategy {global_padding} not supported")
+        self.num_channels = num_channels
+        self.embedding_size = embedding_size
+        self.hidden_sizes = hidden_sizes
+        self.depths = depths
+        self.layer_type = layer_type
+        self.hidden_act = hidden_act
+        self.global_padding = global_padding
+        self.num_groups = num_groups
+        self.drop_path_rate = drop_path_rate
+        self.embedding_dynamic_padding = embedding_dynamic_padding
+        self.output_stride = output_stride
+        self.width_factor = width_factor
+
+        self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(depths) + 1)]
+        if out_features is not None:
+            if not isinstance(out_features, list):
+                raise ValueError("out_features should be a list")
+            for feature in out_features:
+                if feature not in self.stage_names:
+                    raise ValueError(
+                        f"Feature {feature} is not a valid feature name. Valid names are {self.stage_names}"
+                    )
+        self.out_features = out_features
diff --git a/paddlenlp/transformers/bit/image_processing.py b/paddlenlp/transformers/bit/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6b373b88ab3daf489875af6517eda6f2f9d3470
--- /dev/null
+++ b/paddlenlp/transformers/bit/image_processing.py
@@ -0,0 +1,328 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for BiT."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    center_crop,
+    convert_to_rgb,
+    get_resize_output_image_size,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["BitImageProcessor"]
+
+
+class BitImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a BiT image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else [0.48145466, 0.4578275, 0.40821073]
+        self.image_std = image_std if image_std is not None else [0.26862954, 0.26130258, 0.27577711]
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/bit/modeling.py b/paddlenlp/transformers/bit/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..63a6a71ee4627e7d7f91bb41da21a6da7250311c
--- /dev/null
+++ b/paddlenlp/transformers/bit/modeling.py
@@ -0,0 +1,915 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 Google AI and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Paddle BiT model. Also supports backbone for ViT hybrid."""
+
+import collections
+import math
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...utils.initializer import kaiming_normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BackboneOutput,
+    BaseModelOutputWithNoAttention,
+    BaseModelOutputWithPoolingAndNoAttention,
+    ImageClassifierOutputWithNoAttention,
+)
+from ..model_utils import BackboneMixin, PretrainedModel
+from .configuration import BitConfig
+
+__all__ = [
+    "BitPretrainedModel",
+    "BitModel",
+    "BitForImageClassification",
+    "BitBackbone",
+]
+
+
+def get_padding_value(padding=None, kernel_size=7, stride=1, dilation=1) -> Tuple[Tuple, bool]:
+    r"""
+    Utility function to get the tuple padding value given the kernel_size and padding.
+
+    Args:
+        padding (Union[`str`, `int`], *optional*):
+            Padding value, can be either `"same"`, `"valid"`. If a different value is provided the default padding from
+            Paddle is used.
+        kernel_size (`int`, *optional*, defaults to 7):
+            Kernel size of the convolution layers.
+        stride (`int`, *optional*, defaults to 1):
+            Stride value of the convolution layers.
+        dilation (`int`, *optional*, defaults to 1):
+            Dilation value of the convolution layers.
+    """
+    dynamic = False
+    if padding is None:
+        padding = ((stride - 1) + dilation * (kernel_size - 1)) // 2
+        return padding, dynamic
+
+    if isinstance(padding, str):
+        # for any string padding, the padding will be calculated for you, one of three ways
+        padding = padding.lower()
+        if padding == "same":
+            # TF compatible 'SAME' padding, has a performance and GPU memory allocation impact
+            if stride == 1 and (dilation * (kernel_size - 1)) % 2 == 0:
+                # static case, no extra overhead
+                padding = ((stride - 1) + dilation * (kernel_size - 1)) // 2
+            else:
+                # dynamic 'SAME' padding, has runtime/GPU memory overhead
+                padding = 0
+                dynamic = True
+        elif padding == "valid":
+            # 'VALID' padding, same as padding=0
+            padding = 0
+        else:
+            # Default to PyTorch style 'same'-ish symmetric padding
+            padding = ((stride - 1) + dilation * (kernel_size - 1)) // 2
+    return padding, dynamic
+
+
+class WeightStandardizedConv2D(nn.Conv2D):
+    """Conv2d with Weight Standardization. Includes TensorFlow compatible SAME padding. Used for ViT Hybrid model.
+
+    Paper: [Micro-Batch Training with Batch-Channel Normalization and Weight
+    Standardization](https://arxiv.org/abs/1903.10520v2)
+    """
+
+    def __init__(
+        self,
+        in_channel,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding="SAME",
+        dilation=1,
+        groups=1,
+        bias=False,
+        epsilon=1e-6,
+    ):
+        padding, is_dynamic = get_padding_value(padding, kernel_size, stride=stride, dilation=dilation)
+        super().__init__(
+            in_channel,
+            out_channels,
+            kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias_attr=bias,
+        )
+        if is_dynamic:
+            self.pad = DynamicPad2d(kernel_size, stride, dilation)
+        else:
+            self.pad = None
+        self.epsilon = epsilon
+
+    def forward(self, hidden_state):
+        if self.pad is not None:
+            hidden_state = self.pad(hidden_state)
+        w = self.weight
+        v, m = paddle.var(w, axis=[1, 2, 3], keepdim=True, unbiased=False), paddle.mean(
+            w, axis=[1, 2, 3], keepdim=True
+        )
+        w = (w - m) / paddle.sqrt(v + self.epsilon)
+
+        hidden_state = F.conv2d(
+            hidden_state, w, self.bias, self._stride, self._padding, self._dilation, self._groups, self._data_format
+        )
+        return hidden_state
+
+
+class BitGroupNormActivation(nn.GroupNorm):
+    r"""
+    A module that combines group normalization with an activation function.
+    """
+
+    def __init__(self, config, num_channels, epsilon=1e-5, apply_activation=True):
+        super(BitGroupNormActivation, self).__init__(config.num_groups, num_channels, epsilon=epsilon)
+        if apply_activation:
+            self.activation = ACT2FN[config.hidden_act]
+        else:
+            self.activation = nn.Identity()
+
+    def forward(self, hidden_state):
+        hidden_state = super().forward(hidden_state)
+        hidden_state = self.activation(hidden_state)
+        return hidden_state
+
+
+class DynamicPad2d(nn.Layer):
+    r"""
+    A module that wraps dynamic padding of any input, given the parameters of the convolutional layer and the input
+    hidden states.
+    """
+
+    def __init__(self, kernel_size, stride, dilation, value=0):
+        super().__init__()
+        # Safety checkers
+        if isinstance(kernel_size, int):
+            kernel_size = (kernel_size, kernel_size)
+
+        if isinstance(stride, int):
+            stride = (stride, stride)
+
+        if isinstance(dilation, int):
+            dilation = (dilation, dilation)
+
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.value = value
+
+        def compute_padding(x, kernel_size, stride, dilation):
+            return max((math.ceil(x / stride) - 1) * stride + (kernel_size - 1) * dilation + 1 - x, 0)
+
+        self.compute_padding = compute_padding
+
+    def __call__(self, input):
+        # Get width and height
+        input_height, input_width = input.shape[-2:]
+
+        # Compute the padding values
+        padding_height = self.compute_padding(input_height, self.kernel_size[0], self.stride[0], self.dilation[0])
+        padding_width = self.compute_padding(input_width, self.kernel_size[1], self.stride[1], self.dilation[1])
+
+        # apply pad
+        if padding_height > 0 or padding_width > 0:
+            input = F.pad(
+                input,
+                [
+                    padding_width // 2,
+                    padding_width - padding_width // 2,
+                    padding_height // 2,
+                    padding_height - padding_height // 2,
+                ],
+                value=self.value,
+            )
+        return input
+
+
+class BitMaxPool2D(nn.MaxPool2D):
+    """Tensorflow like 'SAME' wrapper for 2D max pooling"""
+
+    def __init__(
+        self,
+        kernel_size: int,
+        stride=None,
+        dilation=1,
+        ceil_mode=False,
+        padding=(0, 0),
+        padding_value=0,
+        use_dynamic_padding=True,
+    ):
+        # must be 1
+        assert dilation == 1
+        kernel_size = kernel_size if isinstance(kernel_size, collections.abc.Iterable) else (kernel_size, kernel_size)
+        stride = stride if isinstance(stride, collections.abc.Iterable) else (stride, stride)
+        dilation = dilation if isinstance(dilation, collections.abc.Iterable) else (dilation, dilation)
+        super().__init__(kernel_size, stride, padding, ceil_mode=ceil_mode)
+        if use_dynamic_padding:
+            self.pad = DynamicPad2d(kernel_size, stride, dilation, padding_value)
+        else:
+            self.pad = nn.Identity()
+
+    def forward(self, hidden_states):
+        hidden_states = self.pad(hidden_states)
+        return super().forward(hidden_states)
+
+
+class BitEmbeddings(nn.Layer):
+    """
+    BiT Embeddings (stem) composed of a single aggressive convolution.
+    """
+
+    def __init__(self, config: BitConfig):
+        super().__init__()
+
+        self.convolution = WeightStandardizedConv2D(
+            config.num_channels,
+            config.embedding_size,
+            kernel_size=7,
+            stride=2,
+            epsilon=1e-8,
+            padding=config.global_padding,
+        )
+
+        self.pooler = BitMaxPool2D(kernel_size=3, stride=2, use_dynamic_padding=config.embedding_dynamic_padding)
+
+        # Use the same padding strategy as convolutional layers
+        if config.global_padding is not None and config.global_padding.upper() == "SAME":
+            self.pad = nn.Identity()
+        else:
+            self.pad = nn.Pad2D(padding=(1, 1, 1, 1), value=0.0)
+
+        if not config.layer_type == "preactivation":
+            self.norm = BitGroupNormActivation(config, num_channels=config.embedding_size)
+        else:
+            self.norm = nn.Identity()
+
+        self.num_channels = config.num_channels
+
+    def forward(self, pixel_values: Tensor) -> Tensor:
+        num_channels = pixel_values.shape[1]
+        if num_channels != self.num_channels:
+            raise ValueError(
+                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+            )
+
+        embedding = self.convolution(pixel_values)
+
+        embedding = self.pad(embedding)
+
+        embedding = self.norm(embedding)
+
+        embedding = self.pooler(embedding)
+
+        return embedding
+
+
+def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+    """
+    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+
+    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
+    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
+    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
+    argument.
+    """
+    if drop_prob == 0.0 or not training:
+        return input
+    keep_prob = 1 - drop_prob
+    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + paddle.rand(shape, dtype=input.dtype)
+    random_tensor = paddle.floor(random_tensor)  # binarize
+    output = (input / keep_prob) * random_tensor
+    return output
+
+
+class BitDropPath(nn.Layer):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+    def __init__(self, drop_prob: Optional[float] = None) -> None:
+        super().__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        return drop_path(hidden_states, self.drop_prob, self.training)
+
+    def extra_repr(self) -> str:
+        return "p={}".format(self.drop_prob)
+
+
+def make_div(value, divisor=8):
+    min_value = divisor
+    new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
+    if new_value < 0.9 * value:
+        new_value += divisor
+    return new_value
+
+
+class BitPreActivationBottleneckLayer(nn.Layer):
+    """Pre-activation (v2) bottleneck block.
+    Follows the implementation of "Identity Mappings in Deep Residual Networks":
+    https://github.com/KaimingHe/resnet-1k-layers/blob/master/resnet-pre-act.lua
+
+    Except it puts the stride on 3x3 conv when available.
+    """
+
+    def __init__(
+        self,
+        config,
+        in_channels,
+        out_channels=None,
+        bottle_ratio=0.25,
+        stride=1,
+        dilation=1,
+        first_dilation=None,
+        groups=1,
+        drop_path_rate=0.0,
+        is_first_layer=False,
+    ):
+        super().__init__()
+
+        first_dilation = first_dilation or dilation
+
+        out_channels = out_channels or in_channels
+        mid_channels = make_div(out_channels * bottle_ratio)
+
+        if is_first_layer:
+            self.downsample = BitDownsampleConv(
+                config,
+                in_channels,
+                out_channels,
+                stride=stride,
+                preact=True,
+            )
+        else:
+            self.downsample = None
+
+        self.norm1 = BitGroupNormActivation(config, in_channels)
+        self.conv1 = WeightStandardizedConv2D(
+            in_channels, mid_channels, 1, epsilon=1e-8, padding=config.global_padding
+        )
+
+        self.norm2 = BitGroupNormActivation(config, num_channels=mid_channels)
+        self.conv2 = WeightStandardizedConv2D(
+            mid_channels, mid_channels, 3, stride=stride, groups=groups, epsilon=1e-8, padding=config.global_padding
+        )
+
+        self.norm3 = BitGroupNormActivation(config, mid_channels)
+        self.conv3 = WeightStandardizedConv2D(
+            mid_channels, out_channels, 1, epsilon=1e-8, padding=config.global_padding
+        )
+
+        self.drop_path = BitDropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()
+
+    def forward(self, hidden_states):
+        hidden_states_preact = self.norm1(hidden_states)
+
+        # shortcut branch
+        shortcut = hidden_states
+        if self.downsample is not None:
+            shortcut = self.downsample(hidden_states_preact)
+
+        # residual branch
+        hidden_states = self.conv1(hidden_states_preact)
+        hidden_states = self.conv2(self.norm2(hidden_states))
+        hidden_states = self.conv3(self.norm3(hidden_states))
+        hidden_states = self.drop_path(hidden_states)
+        return hidden_states + shortcut
+
+
+class BitBottleneckLayer(nn.Layer):
+    """Non Pre-activation bottleneck block, equivalent to V1.5/V1b bottleneck. Used for ViT Hybrid."""
+
+    def __init__(
+        self,
+        config,
+        in_channels,
+        out_channels=None,
+        bottle_ratio=0.25,
+        stride=1,
+        dilation=1,
+        first_dilation=None,
+        groups=1,
+        drop_path_rate=0.0,
+        is_first_layer=False,
+    ):
+        super().__init__()
+        first_dilation = first_dilation or dilation
+
+        out_channels = out_channels or in_channels
+        mid_chs = make_div(out_channels * bottle_ratio)
+
+        if is_first_layer:
+            self.downsample = BitDownsampleConv(
+                config,
+                in_channels,
+                out_channels,
+                stride=stride,
+                preact=False,
+            )
+        else:
+            self.downsample = None
+
+        self.conv1 = WeightStandardizedConv2D(in_channels, mid_chs, 1, epsilon=1e-8, padding=config.global_padding)
+        self.norm1 = BitGroupNormActivation(config, num_channels=mid_chs)
+        self.conv2 = WeightStandardizedConv2D(
+            mid_chs,
+            mid_chs,
+            3,
+            stride=stride,
+            dilation=first_dilation,
+            groups=groups,
+            epsilon=1e-8,
+            padding=config.global_padding,
+        )
+        self.norm2 = BitGroupNormActivation(config, num_channels=mid_chs)
+        self.conv3 = WeightStandardizedConv2D(mid_chs, out_channels, 1, epsilon=1e-8, padding=config.global_padding)
+        self.norm3 = BitGroupNormActivation(config, num_channels=out_channels, apply_activation=False)
+        self.drop_path = BitDropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()
+
+        self.activation = ACT2FN[config.hidden_act]
+
+    def forward(self, hidden_states):
+        # shortcut branch
+        shortcut = hidden_states
+        if self.downsample is not None:
+            shortcut = self.downsample(hidden_states)
+
+        # residual
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = self.norm1(hidden_states)
+
+        hidden_states = self.conv2(hidden_states)
+        hidden_states = self.norm2(hidden_states)
+
+        hidden_states = self.conv3(hidden_states)
+        hidden_states = self.norm3(hidden_states)
+
+        hidden_states = self.drop_path(hidden_states)
+        hidden_states = self.activation(hidden_states + shortcut)
+        return hidden_states
+
+
+class BitDownsampleConv(nn.Layer):
+    def __init__(
+        self,
+        config,
+        in_channels,
+        out_channels,
+        stride=1,
+        preact=True,
+    ):
+        super().__init__()
+        self.conv = WeightStandardizedConv2D(
+            in_channels, out_channels, 1, stride=stride, epsilon=1e-8, padding=config.global_padding
+        )
+        self.norm = (
+            nn.Identity()
+            if preact
+            else BitGroupNormActivation(config, num_channels=out_channels, apply_activation=False)
+        )
+
+    def forward(self, x):
+        return self.norm(self.conv(x))
+
+
+class BitStage(nn.Layer):
+    """
+    A ResNet v2 stage composed by stacked layers.
+    """
+
+    def __init__(
+        self,
+        config,
+        in_channels,
+        out_channels,
+        stride,
+        dilation,
+        depth,
+        bottle_ratio=0.25,
+        layer_dropout=None,
+    ):
+        super().__init__()
+
+        first_dilation = 1 if dilation in (1, 2) else 2
+
+        # Get the layer type
+        if config.layer_type == "bottleneck":
+            layer_cls = BitBottleneckLayer
+        else:
+            layer_cls = BitPreActivationBottleneckLayer
+
+        prev_chs = in_channels
+        self.layers = nn.Sequential()
+        for layer_idx in range(depth):
+            # Get the current hyper-parameters
+            stride, drop_path_rate, is_first_layer = self._get_updated_hyperparameters(
+                layer_idx, stride, layer_dropout
+            )
+
+            self.layers.add_sublayer(
+                str(layer_idx),
+                layer_cls(
+                    config,
+                    prev_chs,
+                    out_channels,
+                    stride=stride,
+                    dilation=dilation,
+                    bottle_ratio=bottle_ratio,
+                    first_dilation=first_dilation,
+                    drop_path_rate=drop_path_rate,
+                    is_first_layer=is_first_layer,
+                ),
+            )
+            prev_chs = out_channels
+            first_dilation = dilation
+
+    def _get_updated_hyperparameters(self, layer_idx, stride, layer_dropout):
+        r"""
+        Get the new hyper-parameters with respect to the previous ones and the index of the current layer.
+        """
+        if layer_dropout:
+            drop_path_rate = layer_dropout[layer_idx]
+        else:
+            drop_path_rate = 0.0
+
+        if layer_idx != 0:
+            stride = 1
+
+        is_first_layer = layer_idx == 0
+
+        return stride, drop_path_rate, is_first_layer
+
+    def forward(self, input: Tensor) -> Tensor:
+        hidden_state = input
+        for _, layer in enumerate(self.layers):
+            hidden_state = layer(hidden_state)
+        return hidden_state
+
+
+class BitEncoder(nn.Layer):
+    def __init__(self, config: BitConfig):
+        super().__init__()
+        self.stages = nn.LayerList([])
+
+        prev_chs = config.embedding_size
+
+        # These needs to stay hardcoded
+        current_stride = 4
+        dilation = 1
+
+        layer_dropouts = [
+            x.tolist()
+            for x in paddle.to_tensor(np.linspace(0, config.drop_path_rate, sum(config.depths))).split(config.depths)
+        ]
+
+        for stage_idx, (current_depth, current_hidden_size, layer_dropout) in enumerate(
+            zip(config.depths, config.hidden_sizes, layer_dropouts)
+        ):
+            # Get the updated hyper params
+            out_channels, stride, dilation = self._get_updated_hyperparameters(
+                stage_idx, current_stride, current_hidden_size, dilation, config
+            )
+
+            stage = BitStage(
+                config,
+                prev_chs,
+                out_channels,
+                stride=stride,
+                dilation=dilation,
+                depth=current_depth,
+                layer_dropout=layer_dropout,
+            )
+
+            prev_chs = out_channels
+            current_stride *= stride
+
+            self.stages.add_sublayer(str(stage_idx), stage)
+
+    def _get_updated_hyperparameters(self, stage_idx, current_stride, current_hidden_size, dilation, config):
+        out_channels = make_div(current_hidden_size * config.width_factor)
+        stride = 1 if stage_idx == 0 else 2
+        if current_stride >= config.output_stride:
+            dilation *= stride
+            stride = 1
+        return out_channels, stride, dilation
+
+    def forward(
+        self, hidden_state: Tensor, output_hidden_states: bool = False, return_dict: bool = True
+    ) -> BaseModelOutputWithNoAttention:
+        hidden_states = () if output_hidden_states else None
+
+        for stage_module in self.stages:
+            if output_hidden_states:
+                hidden_states = hidden_states + (hidden_state,)
+
+            hidden_state = stage_module(hidden_state)
+
+        if output_hidden_states:
+            hidden_states = hidden_states + (hidden_state,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_state, hidden_states] if v is not None)
+
+        return BaseModelOutputWithNoAttention(
+            last_hidden_state=hidden_state,
+            hidden_states=hidden_states,
+        )
+
+
+class BitPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = BitConfig
+    base_model_prefix = "bit"
+    main_input_name = "pixel_values"
+
+    def _init_weights(self, module):
+        if isinstance(module, nn.Conv2D):
+            kaiming_normal_(module.weight, mode="fan_out", nonlinearity="relu")
+        elif isinstance(module, (nn.BatchNorm2D, nn.GroupNorm)):
+            ones_(module.weight)
+            zeros_(module.bias)
+
+
+class BitModel(BitPretrainedModel):
+    """
+    The bare BiT model outputting raw features without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BitConfig`):
+            An instance of BitConfig used to construct BitModel.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+
+        self.embedder = BitEmbeddings(config)
+
+        self.encoder = BitEncoder(config)
+        self.norm = (
+            BitGroupNormActivation(config, num_channels=config.hidden_sizes[-1])
+            if config.layer_type == "preactivation"
+            else nn.Identity()
+        )
+
+        self.pooler = nn.AdaptiveAvgPool2D((1, 1))
+
+    def forward(
+        self, pixel_values: Tensor, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None
+    ) -> BaseModelOutputWithPoolingAndNoAttention:
+        r"""
+        The BitModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`BitImageProcessor`]. See [`BitImageProcessor.__call__`]
+                for details.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`BaseModelOutputWithPoolingAndNoAttention` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+        """
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        embedding_output = self.embedder(pixel_values)
+
+        encoder_outputs = self.encoder(
+            embedding_output, output_hidden_states=output_hidden_states, return_dict=return_dict
+        )
+
+        last_hidden_state = encoder_outputs[0]
+
+        last_hidden_state = self.norm(last_hidden_state)
+
+        pooled_output = self.pooler(last_hidden_state)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndNoAttention(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+        )
+
+
+class BitForImageClassification(BitPretrainedModel):
+    """
+    BiT Model with an image classification head on top (a linear layer on top of the pooled features), e.g. for
+    ImageNet.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BitConfig`):
+            An instance of BitConfig used to construct BitForImageClassification.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.bit = BitModel(config)
+        # classification head
+        self.classifier = nn.Sequential(
+            nn.Flatten(),
+            nn.Linear(config.hidden_sizes[-1], config.num_labels) if config.num_labels > 0 else nn.Identity(),
+        )
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> ImageClassifierOutputWithNoAttention:
+        r"""
+        The BitForImageClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`BitImageProcessor`]. See [`BitImageProcessor.__call__`]
+                for details.
+            labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+                Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+                config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`ImageClassifierOutputWithNoAttention` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.bit(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)
+
+        pooled_output = outputs.pooler_output if return_dict else outputs[1]
+
+        logits = self.classifier(pooled_output)
+
+        loss = None
+
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.flatten())
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return (loss,) + output if loss is not None else output
+
+        return ImageClassifierOutputWithNoAttention(loss=loss, logits=logits, hidden_states=outputs.hidden_states)
+
+
+class BitBackbone(BitPretrainedModel, BackboneMixin):
+    """
+    BiT backbone, to be used with frameworks like DETR and MaskFormer.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`DPTConfig`):
+            An instance of DPTConfig used to construct BitBackbone.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.stage_names = config.stage_names
+        self.bit = BitModel(config)
+
+        self.out_features = config.out_features if config.out_features is not None else [self.stage_names[-1]]
+
+        out_feature_channels = {}
+        out_feature_channels["stem"] = config.embedding_size
+        for idx, stage in enumerate(self.stage_names[1:]):
+            out_feature_channels[stage] = config.hidden_sizes[idx]
+
+        self.out_feature_channels = out_feature_channels
+
+    @property
+    def channels(self):
+        return [self.out_feature_channels[name] for name in self.out_features]
+
+    def forward(
+        self, pixel_values: Tensor, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None
+    ) -> BackboneOutput:
+        r"""
+        The BitBackbone forward method, overrides the `__call__()` special method.
+
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`BitImageProcessor`]. See [`BitImageProcessor.__call__`]
+                for details.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`BackboneOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import BitImageProcessor, BitBackbone
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> processor = BitImageProcessor.from_pretrained("google/bit-50")
+        >>> model = BitBackbone.from_pretrained("google/bit-50")
+
+        >>> inputs = processor(image, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        outputs = self.bit(pixel_values, output_hidden_states=True, return_dict=True)
+
+        hidden_states = outputs.hidden_states
+
+        feature_maps = ()
+        for idx, stage in enumerate(self.stage_names):
+            if stage in self.out_features:
+                feature_maps += (hidden_states[idx],)
+
+        if not return_dict:
+            output = (feature_maps,)
+            if output_hidden_states:
+                output += (outputs.hidden_states,)
+            return output
+
+        return BackboneOutput(
+            feature_maps=feature_maps,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            attentions=None,
+        )
diff --git a/paddlenlp/transformers/blenderbot/__init__.py b/paddlenlp/transformers/blenderbot/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/blenderbot/configuration.py b/paddlenlp/transformers/blenderbot/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7ee871735e50d4952f99704fd83cd7143a32277
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot/configuration.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" blenderbot model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["BLENDERBOT_PRETRAINED_INIT_CONFIGURATION", "BlenderbotConfig", "BLENDERBOT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+BLENDERBOT_PRETRAINED_INIT_CONFIGURATION = {
+    "blenderbot-3B": {
+        "vocab_size": 8008,
+        "bos_token_id": 1,
+        "pad_token_id": 0,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 1,
+        "d_model": 2560,
+        "num_encoder_layers": 2,
+        "num_decoder_layers": 24,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "encoder_ffn_dim": 10240,
+        "decoder_ffn_dim": 10240,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "init_std": 0.02,
+        "max_position_embeddings": 128,
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "scale_embedding": True,
+        "normalize_before": True,
+    },
+    "blenderbot-400M-distill": {
+        "vocab_size": 8008,
+        "bos_token_id": 1,
+        "pad_token_id": 0,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 1,
+        "d_model": 1280,
+        "num_encoder_layers": 2,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "encoder_ffn_dim": 5120,
+        "decoder_ffn_dim": 5120,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "init_std": 0.02,
+        "max_position_embeddings": 128,
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "scale_embedding": True,
+        "normalize_before": True,
+    },
+    "blenderbot-1B-distill": {
+        "vocab_size": 8008,
+        "bos_token_id": 1,
+        "pad_token_id": 0,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 1,
+        "d_model": 2560,
+        "num_encoder_layers": 2,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "decoder_ffn_dim": 10240,
+        "encoder_ffn_dim": 10240,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "init_std": 0.02,
+        "max_position_embeddings": 128,
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "normalize_before": True,
+        "scale_embedding": True,
+    },
+}
+
+BLENDERBOT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "blenderbot-3B": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-3B.pdparams",
+        "blenderbot-1B-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-1B-distill.pdparams",
+        "blenderbot-400M-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-400M-distill.pdparams",
+    }
+}
+
+
+class BlenderbotConfig(PretrainedConfig):
+    """
+    Args:
+        vocab_size (`int`):
+            Vocabulary size of the Blenderbot model.
+        bos_token_id (`int`, optional):
+           The id for begging of sentences token. Defaults to ``1``.
+        pad_token_id (`int`, optional):
+           The id for padding token. Defaults to ``0``.
+        eos_token_id (`int`, optional):
+           The id for end of sentence token. Defaults to ``2``.
+        decoder_start_token_id (`int`, optional):
+           The id indicating the start of decoding sentence. Defaults to ``1``.
+        d_model (`int`, optional):
+           Dimensionality of the layers and the pooler layer. Defaults to ``1280``.
+        num_encoder_layers (`int`, optional):
+           Number of Transformer encoder layers for BlenderbotEncoder. Defaults to ``2``.
+        num_decoder_layers (`int`, optional):
+           Number of Transformer decoder layers for BlenderbotDecoder. Defaults to ``12``.
+        encoder_attention_heads (`int`, optional):
+           Number of attention heads for each Transformer encoder layer in BlenderbotEncoder.
+           Defaults to ``32``.
+        decoder_attention_heads (`int`, optional):
+           Number of attention heads for each Transformer decoder layer in BlenderbotDecoder.
+           Defaults to ``32``.
+        encoder_ffn_dim (`int`, optional):
+           Dimensionality of the feed-forward layer for each Transformer encoder layer in
+           BlenderbotEncoder. Defaults to ``5120``.
+        decoder_ffn_dim (`int`, optional):
+           Dimensionality of the feed-forward layer for each Transformer dncoder layer in
+           BlenderbotDncoder. Defaults to ``5120``.
+        dropout (`float`, optional):
+           The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+           Defaults to ``0.1``.
+        activation_function (`str`, optional):
+           The non-linear activation function (function or string) in the encoder and pooler.
+           ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+           are supported. Defaults to ``"gelu"``.
+        attention_dropout (`float`, optional):
+           The dropout ratio for the attention probabilities.
+           Defaults to ``0.0``.
+        activation_dropout (`float`, optional):
+           The dropout ratio for activations inside the fully connected layer.
+        max_position_embeddings (`int`, optional):,
+           The max position index of an input sequence. Defaults to ``128``.
+        init_std (`float`, optional):
+           The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+           Defaults to ``0.02``.
+        scale_embedding (`bool`, optional):
+           Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults to ``True``.
+        normalize_before (bool, optional):
+           Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers.
+           If True, pre-process is layer normalization and post-precess includes dropout,
+           residual connection. Otherwise, no pre-process and post-precess includes dropout,
+           residual connection, layer normalization. Defaults to ``True``.
+    """
+
+    model_type = "blenderbot"
+    pretrained_init_configuration = BLENDERBOT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=8008,
+        bos_token_id=1,
+        pad_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=1,
+        d_model=1280,
+        num_encoder_layers=2,
+        num_decoder_layers=12,
+        encoder_attention_heads=32,
+        decoder_attention_heads=32,
+        encoder_ffn_dim=5120,
+        decoder_ffn_dim=5120,
+        dropout=0.1,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        max_position_embeddings=128,
+        init_std=0.02,
+        scale_embedding=True,
+        normalize_before=True,
+        **kwargs
+    ):
+        super(BlenderbotConfig, self).__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.num_decoder_layers = num_decoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_attention_heads = decoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.init_std = init_std
+        self.scale_embedding = scale_embedding
+        self.normalize_before = normalize_before
diff --git a/paddlenlp/transformers/blenderbot/modeling.py b/paddlenlp/transformers/blenderbot/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2a85313ec823aee0b1b3380383efed6fa70d598
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot/modeling.py
@@ -0,0 +1,745 @@
+# encoding=utf-8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.tensor as tensor
+from paddle.nn import Embedding
+from paddle.nn.layer.transformer import _convert_attention_mask
+
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    BLENDERBOT_PRETRAINED_INIT_CONFIGURATION,
+    BLENDERBOT_PRETRAINED_RESOURCE_FILES_MAP,
+    BlenderbotConfig,
+)
+
+__all__ = [
+    "BlenderbotModel",
+    "BlenderbotPretrainedModel",
+    "BlenderbotEncoder",
+    "BlenderbotDecoder",
+    "BlenderbotForConditionalGeneration",
+    "BlenderbotForCausalLM",
+]
+
+
+# Copied from paddlenlp.transformers.bart.modeling.shift_tokens_right
+def shift_tokens_right(input_ids: tensor, decoder_start_token_id: int):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+    return shifted_input_ids
+
+
+class BlenderbotPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained Blenderbot models. It provides Blenderbot related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+    base_model_prefix = "blenderbot"
+    config_class = BlenderbotConfig
+
+    pretrained_init_configuration = BLENDERBOT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = BLENDERBOT_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if paddle.get_default_dtype() not in ["float32", "float64"]:
+            # gaussian/standard_normal/randn/normal only supports [float32, float64]
+            return
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class BlenderbotLearnedPositionalEmbedding(Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+
+    Please refer to the superclass for more information regarding methods and arguments.
+    """
+
+    def __init__(self, config: BlenderbotConfig):
+        super().__init__(num_embeddings=config.max_position_embeddings, embedding_dim=config.d_model)
+
+    def forward(self, input_ids_shape, past_key_values_length=0):
+        """
+        Args:
+            input_ids_shape (`tuple`): Expected to be [batch_size, sequence_length].
+            past_key_values_length (`int`, optional): The length of past_key_value,
+            which is used only when ``use_cache=True`` during prediction generating.
+
+        Returns:
+            (Tensor): The generated positional embedding.
+        """
+        bsz, seq_len = input_ids_shape[:2]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        return super().forward(positions)
+
+
+class BlenderbotEncoder(BlenderbotPretrainedModel):
+    """
+    The encoder of Blenderbot Model.
+    Please refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` or
+    :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more information
+    regarding methods and arguments.
+    """
+
+    def __init__(self, config: BlenderbotConfig, embed_tokens=None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(
+                num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+            )
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+        self.encoder_embed_positions = BlenderbotLearnedPositionalEmbedding(config)
+
+        self.encoder_dropout = nn.Dropout(config.dropout)
+        self.encoder_layernorm = nn.LayerNorm(normalized_shape=config.d_model)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.encoder_attention_heads,
+            dim_feedforward=config.encoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=config.normalize_before,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=config.num_encoder_layers)
+
+    def forward(self, input_ids, attention_mask=None):
+        """
+        Returns:
+            Tensor: The last hidden states at the last layer of the encoder.
+            It's data type should be `float` and has a shape of `(batch_size, seq_lens, hidden_size)`.
+            ``seq_lens`` corresponds to the length of input sequence.
+        """
+        if input_ids is None:
+            raise ValueError("Input_ids cannot be None.")
+
+        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+        inputs_embed_pos = self.encoder_embed_positions(input_ids.shape)
+
+        hidden_states = inputs_embeds + inputs_embed_pos
+        encoder_input = self.encoder_dropout(hidden_states)
+
+        if attention_mask is None:
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        else:
+            attention_mask = attention_mask.unsqueeze([1, 2]) * -1e4
+
+        attention_mask.stop_gradient = True
+        encoder_output = self.encoder(encoder_input, src_mask=attention_mask)
+        # Different from BlenderbotSmall, Blenderbot Encoder apply the final layer norm on encoder output
+        encoder_output = self.encoder_layernorm(encoder_output)
+        return encoder_output
+
+
+class BlenderbotDecoderLayer(nn.TransformerDecoderLayer):
+    """
+    Construct decoder layer for BlenderbotForCausalLM.
+    Different from BlenderbotModel, BLenderbotForCausalLM does not apply
+    cross-attention.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        super(BlenderbotDecoderLayer, self).__init__(
+            d_model=d_model,
+            nhead=nhead,
+            dim_feedforward=dim_feedforward,
+            dropout=dropout,
+            activation=activation,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            normalize_before=normalize_before,
+            weight_attr=weight_attr,
+            bias_attr=bias_attr,
+        )
+
+    def forward(self, tgt, memory=None, tgt_mask=None, memory_mask=None, cache=None):
+        """
+        Please refer to  :class:`~paddlenlp.nn.TransformerDecoderLayer`
+        for more information regarding arguments.
+        """
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask, cache=None)
+        else:
+            tgt, incremental_cache = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask, cache=cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        # Cross-attention will not be applied for BlenderbotForCausalLM
+        if memory is not None:
+            residual = tgt
+            if self.normalize_before:
+                tgt = self.norm2(tgt)
+            memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+            if cache is None:
+                tgt = self.cross_attn(query=tgt, key=memory, value=memory, attn_mask=memory_mask, cache=None)
+            else:
+                tgt, static_cache = self.cross_attn(
+                    query=tgt, key=memory, value=memory, attn_mask=memory_mask, cache=cache[1]
+                )
+            tgt = residual + self.dropout2(tgt)
+            if not self.normalize_before:
+                tgt = self.norm2(tgt)
+        else:
+            static_cache = cache[1] if cache is not None else None
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, static_cache))
+
+
+class TransformerDecoder(nn.TransformerDecoder):
+    """
+    Construct Transformer decoder for BlenderbotForCausalLM.
+    """
+
+    def __init__(self, decoder_layer, num_layers, norm=None):
+        super(TransformerDecoder, self).__init__(decoder_layer=decoder_layer, num_layers=num_layers, norm=norm)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        """
+        Please refer to  :class:`~paddlenlp.nn.TransformerDecoder`
+        for more information regarding arguments and methods.
+        """
+
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        if memory is not None:
+            memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        return output if cache is None else (output, new_caches)
+
+
+class BlenderbotDecoder(BlenderbotPretrainedModel):
+    """
+    The decoder of Blenderbot Model.
+    Please refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` and
+    :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more information
+    regarding methods and arguments.
+    """
+
+    def __init__(self, config: BlenderbotConfig, embed_tokens=None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(
+                num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+            )
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+        self.decoder_embed_positions = BlenderbotLearnedPositionalEmbedding(config)
+        self.decoder_dropout = nn.Dropout(config.dropout)
+        self.decoder_layernorm = nn.LayerNorm(normalized_shape=config.d_model)
+
+        decoder_layer = BlenderbotDecoderLayer(
+            d_model=config.d_model,
+            nhead=config.decoder_attention_heads,
+            dim_feedforward=config.decoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=config.normalize_before,
+        )
+        self.decoder = TransformerDecoder(decoder_layer=decoder_layer, num_layers=config.num_decoder_layers)
+
+    def forward(
+        self,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        memory_mask=None,
+        use_cache=False,
+        cache=None,
+    ):
+        """
+        Please refer to :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more
+        information regarding the arguments.
+        """
+        if decoder_input_ids is None:
+            raise ValueError("Decoder_input_ids cannot be None.")
+        if decoder_attention_mask is None:
+            decoder_length = paddle.shape(decoder_input_ids)[-1]
+            decoder_attention_mask = paddle.tensor.triu(
+                (paddle.full((decoder_length, decoder_length), -np.inf, dtype=paddle.get_default_dtype())), 1
+            )
+        decoder_inputs_embeds = self.embed_tokens(decoder_input_ids) * self.embed_scale
+        # cache[num_layer][0] is an instance of `MultiHeadAttention.Cache` containing
+        # tensor k and v with shape of `[batch_size, num_heads, len_seq, embed_dim // num_heads]`
+        # Refer to paddle.nn.MultiHeadAttention.gen_cache for more details regarding cache.
+        past_key_values_length = cache[0][0].k.shape[2] if cache is not None else 0
+
+        decoder_inputs_embed_pos = self.decoder_embed_positions(
+            input_ids_shape=decoder_input_ids.shape, past_key_values_length=past_key_values_length
+        )
+
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        decoder_input = self.decoder_dropout(hidden_states)
+
+        decoder_output = self.decoder(
+            tgt=decoder_input,
+            memory=encoder_output,
+            tgt_mask=decoder_attention_mask,
+            memory_mask=memory_mask,
+            cache=cache,
+        )
+        if use_cache:
+            decoder_output, cache = decoder_output
+            decoder_output = self.decoder_layernorm(decoder_output)
+            return decoder_output, cache
+        else:
+            decoder_output = self.decoder_layernorm(decoder_output)
+            return decoder_output
+
+
+@register_base_model
+class BlenderbotModel(BlenderbotPretrainedModel):
+    """
+    Construct a bare Blenderbot Model.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Check the superclass documentation for the generic methods and the library implements for all its model.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    """
+
+    def __init__(self, config: BlenderbotConfig):
+        super(BlenderbotModel, self).__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eos_token_id = config.eos_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.shared = nn.Embedding(
+            num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+        )
+        self.encoder = BlenderbotEncoder(config)
+        self.decoder = BlenderbotDecoder(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+        **kwargs
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+            decoder_input_ids (Tensor, optional):
+                If not provided, ``decoder_input_ids`` will be automatically generated based
+                on ``decoder_start_token_id`` and ``input_ids``.
+
+            decoder_attention_mask (Tensor, optional):
+                If not provided, the default ``decoder_attention_mask`` will be a tensor with
+                upper triangular part being ``-np.inf``. the shape will be ``(decoder_length, decoder_length)``
+
+            encoder_output (Tensor, optional):
+                The output of encoder. If not provided, a ``encoder_output`` will be generated
+                from BlenderbotEncoder. Defaults to ``None``.
+
+            use_cache (bool, optional):
+                Indicates whether to use cache to speed up decoding. Defaults to ``False``
+
+            cache (list, optional): It is a list, and each element in the list
+                is a tuple( :code:`(incremental_cache, static_cache)` ). See
+                `paddle.nn.TransformerDecoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+        Returns:
+            Tensor|tuple:
+                If ``use_cache=False``, the return will be the last hidden state of decoder with shape
+                of [batch_size, seq_lens, hidden_size]. ``seq_lens`` corresponds to the length of input sequence.
+                Otherwise, the return will be a tuple of ``(decoder_output, cache)``. Please refer to
+                class :class:`paddle.nn.TransformerDecoder` for more information regarding ``cache``.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BlenderbotTokenizer, BlenderbotModel
+
+                # "blenderbot-400M-distill" is the pretrained weight of BlenderbotForConditionalGeneration,
+                # Therefore some weight of additional layers in BlenderbotForConditionalGeneration
+                # might not be loaded and used regarding the following sample code.
+                pretrained_model_name = "blenderbot-400M-distill"
+                tokenizer = BlenderbotTokenizer.from_pretrained(pretrained_model_name)
+                model = BlenderbotModel.from_pretrained(pretrained_model_name)
+
+                sample_text = "My friends are cool but they eat too many carbs."
+                inputs = tokenizer(sample_text, return_attention_mask=True, return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                decoder_output = model(**inputs)
+        """
+        if decoder_input_ids is None:
+            decoder_input_ids = shift_tokens_right(
+                input_ids=input_ids, decoder_start_token_id=self.decoder_start_token_id
+            )
+        if encoder_output is None:
+            encoder_output = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.decoder.gen_cache(encoder_output)
+        else:
+            cache = None
+
+        if input_ids is not None:
+            memory_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+            memory_mask.stop_gradient = True
+        else:
+            memory_mask = attention_mask
+
+        decoder_output = self.decoder(
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            encoder_output=encoder_output,
+            memory_mask=memory_mask,
+            use_cache=use_cache,
+            cache=cache,
+        )
+        return decoder_output
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, value):
+        self.shared = value
+
+    def get_encoder(self):
+        """This method is required for model with encoder-decoder architecture."""
+        return self.encoder
+
+
+class BlenderbotForConditionalGeneration(BlenderbotPretrainedModel):
+    def __init__(self, config: BlenderbotConfig):
+        super(BlenderbotForConditionalGeneration, self).__init__(config)
+        self.blenderbot = BlenderbotModel(config)
+        self.eos_token_id = config.eos_token_id
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model],
+            dtype=self.blenderbot.shared.weight.dtype,
+            is_bias=False,
+        )
+
+        if hasattr(self, "final_logits_bias"):
+            self.final_logits_bias = paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype())
+        else:
+            self.register_buffer(
+                "final_logits_bias",
+                paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype()),
+            )
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+        **kwargs
+    ):
+        """
+        Please refer to :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more
+        information regarding arguments.
+        Return:
+            Tensor|tuple: If ``use_cache=False``, the return will be a tensor with shape of
+                [batch_size, seq_lens, hidden_size]. Otherwise, the return will be a tuple
+                of ``(decoder_output, cache)``.
+        Example:
+            .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
+
+            pretrained_model_name = "blenderbot-400M-distill"
+            tokenizer = BlenderbotTokenizer.from_pretrained(pretrained_model_name)
+            model = BlenderbotForConditionalGeneration.from_pretrained(pretrained_model_name)
+
+            sample_text = "My friends are cool but they eat too many carbs."
+            inputs = tokenizer(sample_text, return_attention_mask=True, return_token_type_ids=False)
+            inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+            # Generate response using beam search
+            result_ids, scores = model.generate(input_ids=inputs['input_ids'],
+                                                max_length=60,
+                                                min_length=20,
+                                                decode_strategy='beam_search',
+                                                num_beams=10,
+                                                length_penalty=0.65)
+            for sequence_ids in result_ids.numpy().tolist():
+                print("User:\t", sample_text)
+                print("bot:\t", tokenizer.convert_ids_to_string(sequence_ids))
+                # "bot:	  That's unfortunate. Are they trying to lose weight?"
+        """
+        decoder_outputs = self.blenderbot(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            encoder_output=encoder_output,
+            use_cache=use_cache,
+            cache=cache,
+        )
+
+        lm_logits = (
+            paddle.tensor.matmul(
+                decoder_outputs[0] if use_cache else decoder_outputs, self.lm_head_weight, transpose_y=True
+            )
+            + self.final_logits_bias
+        )
+
+        if use_cache:
+            cache = decoder_outputs[1]
+            return lm_logits, cache
+        return lm_logits
+
+    def prepare_inputs_for_generation(
+        self, decoder_input_ids, attention_mask=None, encoder_output=None, use_cache=True, cache=None, **kwargs
+    ):
+        """
+        Prepare inputs for decoder to generate sentences.
+        Return:
+            dict: A dictionary containing necessary inputs for generating next token.
+        """
+
+        if encoder_output is not None:
+            expand_size = int(decoder_input_ids.shape[0] / encoder_output.shape[0])
+            if expand_size > 1:
+                index = paddle.tile(paddle.arange(encoder_output.shape[0]).unsqueeze(-1), [1, expand_size]).reshape(
+                    [-1]
+                )
+                encoder_output = paddle.index_select(encoder_output, index)
+
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1:]
+
+        return {
+            "input_ids": None,  # during prediction, Encoder_output is provided, do not need input_ids.
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def get_encoder(self):
+        """This method is required for model with encoder-decoder architecture."""
+        return self.encoder
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class BlenderbotForCausalLM(BlenderbotPretrainedModel):
+    """
+    Constructs BLenderbot For Causal Language Model. This model is equivalent to the
+    blenderbot decoder without cross-attention.
+    """
+
+    def __init__(self, config: BlenderbotConfig):
+        super().__init__(config)
+        self.blenderbot = BlenderbotModel(config)
+        self.decoder = self.blenderbot.decoder
+
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model],
+            dtype=self.blenderbot.shared.weight.dtype,
+            is_bias=False,
+        )
+
+        if hasattr(self, "final_logits_bias"):
+            self.final_logits_bias = paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype())
+        else:
+            self.register_buffer(
+                "final_logits_bias",
+                paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype()),
+            )
+
+    def forward(self, input_ids=None, attention_mask=None, use_cache=False, cache=None, **kwargs):
+        """
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+            use_cache (bool, optional):
+                Indicates whether to use cache to speed up decoding. Defaults to ``False``
+
+            cache (list, optional): It is a list, and each element in the list
+                is a tuple( :code:`(incremental_cache, static_cache)` ). See
+                `paddle.nn.TransformerDecoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+        Return:
+            Tensor|tuple: If ``use_cache=False``, the return will be a tensor with shape of
+                [batch_size, seq_lens, hidden_size]. Otherwise, the return will be a tuple
+                of ``(lm_logits, cache)``.
+        Example:
+            .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BlenderbotTokenizer, BlenderbotForCausalLM
+            use_cache = False
+            text = "My friends are cool but they eat too many carbs."
+            model_name = "blenderbot-400M-distill"
+            tokenizer = BlenderbotTokenizer.from_pretrained(model_name)
+            model = BlenderbotForCausalLM.from_pretrained(model_name)
+            model.eval()
+            inputs = tokenizer(text)
+            inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+            with paddle.no_grad():
+                outputs = model(**inputs, use_cache=use_cache)
+                # outputs is a tuple of (lm_logits, cache) if ``use_cache=True``.
+        """
+        if use_cache and cache is None:
+            # Generating incremental cache. A random tensor with shape of
+            # (batch_size, len_seq, hidden_size) is passed for memory argument.
+            # since the `static_cache` will not be used in BlenderbotForCausalLM
+            batch_size, len_seq = input_ids.shape
+            cache = self.decoder.decoder.gen_cache(memory=paddle.zeros((batch_size, len_seq, self.config.d_model)))
+        decoder_outputs = self.decoder(
+            decoder_input_ids=input_ids, encoder_output=None, memory_mask=None, use_cache=use_cache, cache=cache
+        )
+
+        lm_logits = (
+            paddle.tensor.matmul(
+                decoder_outputs[0] if use_cache else decoder_outputs, self.lm_head_weight, transpose_y=True
+            )
+            + self.final_logits_bias
+        )
+
+        if use_cache:
+            cache = decoder_outputs[1]
+            return lm_logits, cache
+        return lm_logits
+
+    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, use_cache=True, cache=None, **kwargs):
+        """
+        Prepare inputs for decoder to generate sentences.
+        Return:
+            dict: A dictionary containing necessary inputs for generating next token.
+        """
+        if cache is not None:
+            input_ids = input_ids[:, -1:].unsqueeze(-1)
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "use_cache": use_cache, "cache": cache}
diff --git a/paddlenlp/transformers/blenderbot/tokenizer.py b/paddlenlp/transformers/blenderbot/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..20748ad4307c3c87ee10719ef2dfe5d65725ddc1
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot/tokenizer.py
@@ -0,0 +1,161 @@
+# encoding=utf-8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.utils import try_import
+
+from .. import AddedToken, GPTTokenizer
+
+__all__ = ["BlenderbotTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "blenderbot-3B": 128,
+    "blenderbot-400M-distill": 128,
+    "blenderbot-1B-distill": 128,
+}
+
+
+class BlenderbotTokenizer(GPTTokenizer):
+    r"""
+    Construct a Blenderbot tokenizer, derived from the GPT tokenizer, using
+    byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.GPTTokenizer`,
+    which contains most of the main methods.
+    Please should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (str): file path of the vocabulary
+        merges_file (str): file path of the merges file.
+        errors (str): The method to handle errors in decoding
+        max_len (int): The specified maximum sequence length. Default: "None".
+        special_tokens (dict): The additional special tokens. Default: "None".
+        bos_token (str): The special token for beginning of sequence token. Default: "<s>".
+        eos_token (str): The special token for end of sequence token. Default: "</s>".
+        cls_token (str): The special token for cls. Default: "<s>".
+        sep_token (str): The special token for separator token . Default: "</s>".
+        pad_token (str): The special token for padding. Default: "<pad>".
+        eol_token (str): The special token for newline. Default: "\u010a".
+        add_prefix (bool): Whether or not to add an initial space to the input.
+            This allows to treat the leading word just as any other word.
+            (Blenderbot adds an initial space when tokenizes input text, which
+             is differnt from BlenderbotSmall)
+    Examples:
+        .. code-block:: python
+            from paddlenlp.transformers import BlenderbotTokenizer
+            tokenizer = BlenderbotTokenizer.from_pretrained("blenderbot-400M-distill")
+            text = "My friends are cool but they eat too many carbs."
+            inputs = tokenizer(text)
+            # above line outputs:
+            # {'input_ids': [863, 1329, 366, 1449, 373, 382, 1861, 618, 847, 911, 1372, 21, 2],
+            # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+    """
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "blenderbot-400M-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-400M-distill-vocab.json",
+            "blenderbot-3B": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-3B-vocab.json",
+            "blenderbot-1B-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-1B-distill-vocab.json",
+        },
+        "merges_file": {
+            "blenderbot-400M-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-400M-distill-merges.txt",
+            "blenderbot-3B": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-3B-merges.txt",
+            "blenderbot-1B-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot/blenderbot-1B-distill-merges.txt",
+        },
+    }
+    pretrained_init_configuration = {
+        "blenderbot-3B": {"add_prefix": True},
+        "blenderbot-400M-distill": {"add_prefix": True},
+        "blenderbot-1B-distill": {"add_prefix": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        special_tokens=None,
+        bos_token="<s>",
+        eos_token="</s>",
+        cls_token="<s>",
+        sep_token="</s>",
+        pad_token="<pad>",
+        unk_token="<unk>",
+        mask_token="<mask>",
+        eol_token="\u010a",
+        add_prefix=True,
+        **kwargs
+    ):
+
+        sep_token = (
+            AddedToken(sep_token, lstrip=False, rstrip=False, single_word=False, normalized=True)
+            if isinstance(sep_token, str)
+            else sep_token
+        )
+
+        self._build_special_tokens_map_extended(sep_token=sep_token)
+
+        super(BlenderbotTokenizer, self).__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            errors=errors,
+            max_len=max_len,
+            special_tokens=special_tokens,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            eol_token=eol_token,
+            **kwargs,
+        )
+        self.add_prefix = add_prefix
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        A Blenderbot sequence has the following format:
+        ::
+            - single sequence: ``X </s>``
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                token_ids_1 Will be ignored
+
+        Returns:
+            :obj:`List[int]`: List of input_id with the appropriate special tokens.
+        """
+        return token_ids_0 + [self.eos_token_id]
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for Blenderbot models.
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        add_prefix = kwargs.pop("add_prefix", self.add_prefix)
+        if is_split_into_words or add_prefix:
+            text = " " + text
+        return text, kwargs
diff --git a/paddlenlp/transformers/blenderbot_small/__init__.py b/paddlenlp/transformers/blenderbot_small/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot_small/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/blenderbot_small/configuration.py b/paddlenlp/transformers/blenderbot_small/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..81b544cb7e4d24aa0cbad91c411324c63b7d0dcc
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot_small/configuration.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" blenderbot model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "BLENDERBOTSMALL_PRETRAINED_INIT_CONFIGURATION",
+    "BlenderbotSmallConfig",
+    "BLENDERBOTSMALL_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+BLENDERBOTSMALL_PRETRAINED_INIT_CONFIGURATION = {
+    "blenderbot_small-90M": {
+        "vocab_size": 54944,
+        "bos_token_id": 1,
+        "pad_token_id": 0,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 1,
+        "d_model": 512,
+        "num_encoder_layers": 8,
+        "num_decoder_layers": 8,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "decoder_ffn_dim": 2048,
+        "encoder_ffn_dim": 2048,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "init_std": 0.02,
+        "max_position_embeddings": 512,
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "scale_embedding": True,
+        "normalize_before": False,
+    },
+}
+
+BLENDERBOTSMALL_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "blenderbot_small-90M": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot_small/blenderbot_small-90M.pdparams",
+    }
+}
+
+
+class BlenderbotSmallConfig(PretrainedConfig):
+    """
+    Args:
+         vocab_size (`int`):
+             Vocabulary size of the BlenderbotSmall model.
+         bos_token_id (`int`, optional):
+            The id for begging of sentences token. Defaults to ``1``.
+         pad_token_id (`int`, optional):
+            The id for padding token. Defaults to ``0``.
+         eos_token_id (`int`, optional):
+            The id for end of sentence token. Defaults to ``2``.
+         decoder_start_token_id (`int`, optional):
+            The id indicating the start of decoding sentence. Defaults to ``1``.
+         d_model (`int`, optional):
+            Dimensionality of the layers and the pooler layer. Defaults to ``512``.
+         num_encoder_layers (`int`, optional):
+            Number of Transformer encoder layers for BlenderbotSmallEncoder. Defaults to ``8``.
+         num_decoder_layers (`int`, optional):
+            Number of Transformer decoder layers for BlenderbotSmallDecoder. Defaults to ``8``.
+         encoder_attention_heads (`int`, optional):
+            Number of attention heads for each Transformer encoder layer in BlenderbotSmallEncoder.
+            Defaults to ``16``.
+         decoder_attention_heads (`int`, optional):
+            Number of attention heads for each Transformer decoder layer in BlenderbotSmallDecoder.
+            Defaults to ``16``.
+         encoder_ffn_dim (`int`, optional):
+            Dimensionality of the feed-forward layer for each Transformer encoder layer in
+            BlenderbotSmallEncoder. Defaults to ``2048``.
+         decoder_ffn_dim (`int`, optional):
+            Dimensionality of the feed-forward layer for each Transformer dncoder layer in
+            BlenderbotSmallDncoder. Defaults to ``2048``.
+         dropout (`float`, optional):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+            Defaults to ``0.1``.
+         activation_function (`str`, optional):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+         attention_dropout (`float`, optional):
+            The dropout ratio for the attention probabilities.
+            Defaults to ``0.0``.
+         activation_dropout (`float`, optional):
+            The dropout ratio for activations inside the fully connected layer.
+         max_position_embeddings (`int`, optional):,
+            The max position index of an input sequence. Defaults to ``512``.
+         init_std (`float`, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Defaults to ``0.02``.
+         scale_embedding (`bool`, optional):
+            Indicate whether to scale embeddings by diving by sqrt(d_model). Defaults to ``True``.
+         normalize_before (bool, optional):
+            Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers.
+            If True, pre-process is layer normalization and post-precess includes dropout,
+            residual connection. Otherwise, no pre-process and post-precess includes dropout,
+            residual connection, layer normalization. Defaults to ``False``.
+    """
+
+    model_type = "blenderbot_small"
+    pretrained_init_configuration = BLENDERBOTSMALL_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=54944,
+        bos_token_id=1,
+        pad_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=1,
+        d_model=512,
+        num_encoder_layers=8,
+        num_decoder_layers=8,
+        encoder_attention_heads=16,
+        decoder_attention_heads=16,
+        encoder_ffn_dim=2048,
+        decoder_ffn_dim=2048,
+        dropout=0.1,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        max_position_embeddings=512,
+        init_std=0.02,
+        scale_embedding=True,
+        normalize_before=False,
+        **kwargs
+    ):
+        super(BlenderbotSmallConfig, self).__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.num_decoder_layers = num_decoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_attention_heads = decoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.init_std = init_std
+        self.scale_embedding = scale_embedding
+        self.normalize_before = normalize_before
diff --git a/paddlenlp/transformers/blenderbot_small/modeling.py b/paddlenlp/transformers/blenderbot_small/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccb0d7ff7df585658bf42af2a42a7acaa909d73c
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot_small/modeling.py
@@ -0,0 +1,748 @@
+# encoding=utf-8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.tensor as tensor
+from paddle.nn import Embedding
+from paddle.nn.layer.transformer import _convert_attention_mask
+
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    BLENDERBOTSMALL_PRETRAINED_INIT_CONFIGURATION,
+    BLENDERBOTSMALL_PRETRAINED_RESOURCE_FILES_MAP,
+    BlenderbotSmallConfig,
+)
+
+__all__ = [
+    "BlenderbotSmallModel",
+    "BlenderbotSmallPretrainedModel",
+    "BlenderbotSmallEncoder",
+    "BlenderbotSmallDecoder",
+    "BlenderbotSmallForConditionalGeneration",
+    "BlenderbotSmallForCausalLM",
+]
+
+
+# Copied from paddlenlp.transformers.bart.modeling.shift_tokens_right
+def shift_tokens_right(input_ids: tensor, decoder_start_token_id: int):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+    return shifted_input_ids
+
+
+class BlenderbotSmallLearnedPositionalEmbedding(Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+
+    Please should refer to the superclass for more information regarding methods and arguments.
+    """
+
+    def __init__(self, config: BlenderbotSmallConfig):
+        super().__init__(num_embeddings=config.max_position_embeddings, embedding_dim=config.d_model)
+
+    def forward(self, input_ids_shape, past_key_values_length=0):
+        """
+        Generate positional embeddings up based on input_ids_shape.
+        Args:
+            input_ids_shape (`tuple`): expected to be [batch_size, sequence_length].
+            past_key_values_length (`int`, optional): The length of past_key_value,
+            which is used only when the ``use_cache=True`` during prediction generating.
+
+        Returns:
+            (Tensor): The generated positional embedding.
+        """
+        bsz, seq_len = input_ids_shape[:2]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        return super().forward(positions)
+
+
+class BlenderbotSmallPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained BlenderbotSmall models. It provides BlenderbotSmall related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+    pretrained_init_configuration = BLENDERBOTSMALL_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = BLENDERBOTSMALL_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "blenderbot_small"
+    config_class = BlenderbotSmallConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if paddle.get_default_dtype() not in ["float32", "float64"]:
+            # gaussian/standard_normal/randn/normal only supports [float32, float64]
+            return
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class BlenderbotSmallDecoderLayer(nn.TransformerDecoderLayer):
+    """
+    Construct decoder layer for BlenderbotSmallDecoder.
+    Please refer to :class:`~paddlenlp.nn.TransformerDecoderLayer` for more details.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="gelu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=True,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        super(BlenderbotSmallDecoderLayer, self).__init__(
+            d_model=d_model,
+            nhead=nhead,
+            dim_feedforward=dim_feedforward,
+            dropout=dropout,
+            activation=activation,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            normalize_before=normalize_before,
+            weight_attr=weight_attr,
+            bias_attr=bias_attr,
+        )
+
+    def forward(self, tgt, memory=None, tgt_mask=None, memory_mask=None, cache=None):
+        """
+        Please refer to  :class:`~paddlenlp.nn.TransformerDecoderLayer`
+        for more information regarding arguments.
+        """
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        if cache is None:
+            tgt = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask, cache=None)
+        else:
+            tgt, incremental_cache = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask, cache=cache[0])
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        # Cross-attention will not be applied for BlenderbotSmallForCausalLM
+        if memory is not None:
+            residual = tgt
+            if self.normalize_before:
+                tgt = self.norm2(tgt)
+            memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+            if cache is None:
+                tgt = self.cross_attn(query=tgt, key=memory, value=memory, attn_mask=memory_mask, cache=None)
+            else:
+                tgt, static_cache = self.cross_attn(
+                    query=tgt, key=memory, value=memory, attn_mask=memory_mask, cache=cache[1]
+                )
+            tgt = residual + self.dropout2(tgt)
+            if not self.normalize_before:
+                tgt = self.norm2(tgt)
+        else:
+            static_cache = cache[1] if cache is not None else None
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm3(tgt)
+        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        tgt = residual + self.dropout3(tgt)
+        if not self.normalize_before:
+            tgt = self.norm3(tgt)
+        return tgt if cache is None else (tgt, (incremental_cache, static_cache))
+
+
+class TransformerDecoder(nn.TransformerDecoder):
+    """
+    Construct Transformer decoder for BlenderbotSmallDecoder.
+    """
+
+    def __init__(self, decoder_layer, num_layers, norm=None):
+        super(TransformerDecoder, self).__init__(decoder_layer=decoder_layer, num_layers=num_layers, norm=norm)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+        """
+        Please refer to  :class:`~paddlenlp.nn.TransformerDecoder`
+        for more information regarding arguments and methods.
+        """
+
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        if memory is not None:
+            memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        output = tgt
+        new_caches = []
+        for i, mod in enumerate(self.layers):
+            if cache is None:
+                output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None)
+            else:
+                output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i])
+                new_caches.append(new_cache)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        return output if cache is None else (output, new_caches)
+
+
+class BlenderbotSmallEncoder(BlenderbotSmallPretrainedModel):
+    """
+    The encoder of BlenderbotSmall Model.
+    Please refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` or
+    :class:`~paddlenlp.transformers.Blenderbot.BlenderbotSmallModel` for more details
+    regarding methods and arguments.
+    """
+
+    def __init__(
+        self,
+        config: BlenderbotSmallConfig,
+        embed_tokens=None,
+    ):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(
+                num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+            )
+        self.encoder_embed_positions = BlenderbotSmallLearnedPositionalEmbedding(config)
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+        self.encoder_dropout = nn.Dropout(config.dropout)
+        self.encoder_layernorm_embedding = nn.LayerNorm(config.d_model)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.encoder_attention_heads,
+            dim_feedforward=config.encoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=config.normalize_before,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=config.num_encoder_layers)
+
+    def forward(self, input_ids=None, attention_mask=None):
+        """
+        Returns:
+            Tensor: The last hidden-states at the last layer of the encoder.
+            It's data type should be `float` and has a shape of `(batch_size, seq_lens, hidden_size)`.
+            ``seq_lens`` corresponds to the length of input sequence.
+        """
+        if input_ids is None:
+            raise ValueError("Input_ids cannot be None.")
+        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+        inputs_embed_pos = self.encoder_embed_positions(input_ids.shape)
+        hidden_states = inputs_embeds + inputs_embed_pos
+        hidden_states = self.encoder_layernorm_embedding(hidden_states)
+        encoder_input = self.encoder_dropout(hidden_states)
+
+        if attention_mask is None:
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        else:
+            attention_mask = attention_mask.unsqueeze([1, 2]) * -1e4
+        attention_mask.stop_gradient = True
+        encoder_output = self.encoder(encoder_input, src_mask=attention_mask)
+        return encoder_output
+
+
+class BlenderbotSmallDecoder(BlenderbotSmallPretrainedModel):
+    """
+    The decoder of BlenderbotSmall Model.
+    Please refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` and
+    :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more information
+    regarding methods and arguments.
+    """
+
+    def __init__(
+        self,
+        config: BlenderbotSmallConfig,
+        embed_tokens=None,
+    ):
+        super().__init__(config)
+        self.init_std = config.init_std
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(
+                num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+            )
+
+        self.decoder_embed_positions = BlenderbotSmallLearnedPositionalEmbedding(config)
+        self.decoder_dropout = nn.Dropout(config.dropout)
+        self.decoder_layernorm_embedding = nn.LayerNorm(normalized_shape=config.d_model)
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+
+        decoder_layer = BlenderbotSmallDecoderLayer(
+            d_model=config.d_model,
+            nhead=config.decoder_attention_heads,
+            dim_feedforward=config.decoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=config.normalize_before,
+        )
+        self.decoder = TransformerDecoder(decoder_layer=decoder_layer, num_layers=config.num_decoder_layers)
+
+    def forward(
+        self,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        memory_mask=None,
+        use_cache=False,
+        cache=None,
+    ):
+        """
+        Please refer to :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more
+        information regarding the arguments.
+        """
+        if decoder_input_ids is None:
+            raise ValueError("Decoder_input_ids cannot be None.")
+        if decoder_attention_mask is None:
+            decoder_length = paddle.shape(decoder_input_ids)[-1]
+            decoder_attention_mask = paddle.tensor.triu(
+                (paddle.full((decoder_length, decoder_length), -np.inf, dtype=paddle.get_default_dtype())), 1
+            )
+        decoder_inputs_embeds = self.embed_tokens(decoder_input_ids) * self.embed_scale
+        # cache[num_layer][0] is an instance of `MultiHeadAttention.Cache` containing
+        # tensor k and v with shape of `[batch_size, num_heads, len_seq, embed_dim // num_heads]`
+        # ``len_seq`` refer to the length of ``decoder_input_ids``
+        # Refer to paddle.nn.MultiHeadAttention.gen_cache for more details regarding cache.
+        past_key_values_length = cache[0][0].k.shape[2] if cache is not None else 0
+
+        decoder_inputs_embed_pos = self.decoder_embed_positions(
+            input_ids_shape=decoder_input_ids.shape, past_key_values_length=past_key_values_length
+        )
+
+        # Different from BLenderbot, BlenderbotSmall Apply layer norm on decoder_inputs_embeds
+        decoder_inputs_embeds = self.decoder_layernorm_embedding(decoder_inputs_embeds)
+
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        decoder_input = self.decoder_dropout(hidden_states)
+
+        decoder_output = self.decoder(
+            tgt=decoder_input,
+            memory=encoder_output,
+            tgt_mask=decoder_attention_mask,
+            memory_mask=memory_mask,
+            cache=cache,
+        )
+        return decoder_output
+
+
+@register_base_model
+class BlenderbotSmallModel(BlenderbotSmallPretrainedModel):
+    r"""
+    Construct a bare BlenderbotSmall Model.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Check the superclass documentation for the generic methods and the library implements for all its model.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    """
+
+    def __init__(self, config: BlenderbotSmallConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eos_token_id = config.eos_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.shared = nn.Embedding(
+            num_embeddings=config.vocab_size, embedding_dim=config.d_model, padding_idx=config.pad_token_id
+        )
+
+        self.encoder = BlenderbotSmallEncoder(config, embed_tokens=self.shared)
+        self.decoder = BlenderbotSmallDecoder(config, embed_tokens=self.shared)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+        **kwargs
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+            decoder_input_ids (Tensor, optional):
+                If not provided, ``decoder_input_ids`` will be automatically generated based
+                on ``decoder_start_token_id`` and ``input_ids``.
+
+            decoder_attention_mask (Tensor, optional):
+                If not provided, the default ``decoder_attention_mask`` will be a tensor with
+                upper triangular part being ``-np.inf``. the shape will be ``(decoder_length, decoder_length)``
+
+            encoder_output (Tensor, optional):
+                The output of encoder. If not provided, a new ``encoder_output`` will be generated
+                from BlenderbotEncoder. Defaults to ``None``.
+
+            use_cache (bool, optional):
+                Indicates whether to use cache to speed up decoding. Defaults to ``False``
+
+            cache (list, optional): It is a list, and each element in the list
+                is a tuple( :code:`(incremental_cache, static_cache)` ). See
+                `TransformerDecoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+        Returns:
+            Tensor|tuple:
+                If ``use_cache=False``, the return will be the last hidden state of decoder with shape
+                of [batch_size, seq_lens, hidden_size]. ``seq_lens`` corresponds to the length of input sequence.
+                Otherwise, the return will be a tuple of ``(decoder_output, cache)``. Please refer to
+                class :class:`paddle.nn.TransformerDecoder` for more information regarding ``cache``.
+
+        Example:
+            .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BlenderbotSmallTokenizer, BlenderbotSmallModel
+
+            # "blenderbot_small-90M" is pretrained weight of BlenderbotSmallForConditionalGeneration,
+            # Therefore some weight of additional layers in BlenderbotSmallForConditionalGeneration
+            # might not be loaded and used.
+            pretrained_model_name = "blenderbot_small-90M"
+            tokenizer = BlenderbotSmallTokenizer.from_pretrained(pretrained_model_name)
+            model = BlenderbotSmallModel.from_pretrained(pretrained_model_name)
+
+            sample_text = "My friends are cool but they eat too many carbs."
+            inputs = tokenizer(sample_text, return_attention_mask=True, return_token_type_ids=False)
+            inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+            decoder_output = model(**inputs)
+        """
+        if decoder_input_ids is None:
+            decoder_input_ids = shift_tokens_right(
+                input_ids=input_ids, decoder_start_token_id=self.decoder_start_token_id
+            )
+        if encoder_output is None:
+            encoder_output = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        # initialize cache based on encoder output for decoding at 1st time step.
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.decoder.gen_cache(encoder_output)
+        else:
+            cache = None
+
+        if attention_mask is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating attention_mask"
+            memory_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        else:
+            memory_mask = attention_mask.unsqueeze([1, 2]) * -1e4
+
+        memory_mask.stop_gradient = True
+        decoder_output = self.decoder(
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            encoder_output=encoder_output,
+            memory_mask=memory_mask,
+            use_cache=use_cache,
+            cache=cache,
+        )
+        return decoder_output
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, value):
+        self.shared = value
+
+    def get_encoder(self):
+        """
+        This method is required for model with encoder-decoder architecture.
+        """
+        return self.encoder
+
+
+class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPretrainedModel):
+    """
+    Please refer to :class:`~paddlenlp.transformers.Blenderbot.BlenderbotModel` for more
+    information regarding arguments.
+    Return:
+        Tensor|tuple: If ``use_cache=False``, the return will be a tensor with shape of
+            [batch_size, seq_lens, hidden_size]. Otherwise, the return will be a tuple
+            of ``(decoder_output, cache)``.
+    Example:
+        .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BlenderbotSmallTokenizer, BlenderbotSmallForConditionalGeneration
+
+            pretrained_model_name = "blenderbot_small-90M"
+            tokenizer = BlenderbotSmallTokenizer.from_pretrained(pretrained_model_name)
+            model = BlenderbotSmallForConditionalGeneration.from_pretrained(pretrained_model_name)
+
+            sample_text = "My friends are cool but they eat too many carbs."
+            inputs = tokenizer(sample_text, return_attention_mask=True, return_token_type_ids=False)
+            inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+            result_ids, score = model.generate(input_ids=inputs['input_ids'],
+                                               max_length=60,
+                                               min_length=20,
+                                               decode_strategy='beam_search',
+                                               num_beams=10,
+                                               length_penalty=0.65
+                                               )
+            for sequence_ids in result_ids.numpy().tolist():
+                print("User:\t", sample_text)
+                print("bot:\t", tokenizer.convert_ids_to_string(sequence_ids))
+    """
+
+    def __init__(self, config: BlenderbotSmallConfig):
+        super(BlenderbotSmallForConditionalGeneration, self).__init__(config)
+        self.eos_token_id = config.eos_token_id
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.blenderbot_small = BlenderbotSmallModel(config)
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model],
+            dtype=self.blenderbot_small.shared.weight.dtype,
+            is_bias=False,
+        )
+
+        if hasattr(self, "final_logits_bias"):
+            self.final_logits_bias = paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype())
+        else:
+            self.register_buffer(
+                "final_logits_bias",
+                paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype()),
+            )
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+    ):
+        decoder_outputs = self.blenderbot_small(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            encoder_output=encoder_output,
+            use_cache=use_cache,
+            cache=cache,
+        )
+
+        lm_logits = (
+            paddle.tensor.matmul(
+                decoder_outputs[0] if use_cache else decoder_outputs, self.lm_head_weight, transpose_y=True
+            )
+            + self.final_logits_bias
+        )
+        if use_cache:
+            cache = decoder_outputs[1]
+            return lm_logits, cache
+        return lm_logits
+
+    def prepare_inputs_for_generation(
+        self, decoder_input_ids, attention_mask=None, encoder_output=None, use_cache=True, cache=None, **kwargs
+    ):
+
+        if encoder_output is not None:
+            expand_size = int(decoder_input_ids.shape[0] / encoder_output.shape[0])
+            if expand_size > 1:
+                index = paddle.tile(paddle.arange(encoder_output.shape[0]).unsqueeze(-1), [1, expand_size]).reshape(
+                    [-1]
+                )
+                encoder_output = paddle.index_select(encoder_output, index)
+
+        if use_cache and cache is None:
+            if encoder_output is None:
+                raise ValueError("Encoder output can not be none if `use_cache` is True")
+            cache = self.decoder.decoder.gen_cache(memory=encoder_output)
+
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1:]
+        return {
+            "input_ids": None,  # during prediction, Encoder_output is provided, do not need input_ids.
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def get_encoder(self):
+        """
+        This method is required for model with encoder-decoder architecture.
+        """
+        return self.encoder
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class BlenderbotSmallForCausalLM(BlenderbotSmallPretrainedModel):
+    """
+    Constructs BLenderbotSmall For Causal Language Model. This model is equivalent to the
+    blenderbotSmall decoder without cross-attention.
+    """
+
+    def __init__(self, config: BlenderbotSmallConfig):
+        super().__init__(config)
+        self.blenderbot_small = BlenderbotSmallModel(config)
+        self.decoder = self.blenderbot_small.decoder
+
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model],
+            dtype=self.blenderbot_small.shared.weight.dtype,
+            is_bias=False,
+        )
+        if hasattr(self, "final_logits_bias"):
+            self.final_logits_bias = paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype())
+        else:
+            self.register_buffer(
+                "final_logits_bias",
+                paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype()),
+            )
+
+    def forward(self, input_ids=None, attention_mask=None, use_cache=False, cache=None, **kwargs):
+        """
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+
+            use_cache (bool, optional):
+                Indicates whether to use cache to speed up decoding. Defaults to ``False``
+
+            cache (list, optional): It is a list, and each element in the list
+                is a tuple( :code:`(incremental_cache, static_cache)` ). See
+                `paddle.nn.TransformerDecoder.gen_cache` for more details. It is only
+                used for inference and should be None for training. Default None.
+        Return:
+            Tensor|tuple: If ``use_cache=False``, the return will be a tensor with shape of
+                [batch_size, seq_lens, hidden_size]. Otherwise, the return will be a tuple
+                of ``(lm_logits, cache)``.
+        Example:
+            .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import BlenderbotSmallTokenizer, BlenderbotSmallForCausalLM
+            use_cache = False
+            text = "My friends are cool but they eat too many carbs."
+            model_name = "blenderbot_small-90M"
+            tokenizer = BlenderbotSmallTokenizer.from_pretrained(model_name)
+            model = BlenderbotSmallForCausalLM.from_pretrained(model_name)
+            model.eval()
+            inputs = tokenizer(text, return_attention_mask=True, return_token_type_ids=False)
+            inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+            with paddle.no_grad():
+                outputs = model(**inputs, use_cache=use_cache)
+                # outputs is a tuple of (lm_logits, cache) if ``use_cache=True``.
+
+        """
+        if use_cache and cache is None:
+            # Generating incremental cache. A random tensor with shape of
+            # (batch_size, len_seq, hidden_size) is passed for memory argument.
+            # since the `static_cache` will not be used in BlenderbotSmallForCausalLM
+            batch_size, len_seq = input_ids.shape
+            cache = self.decoder.decoder.gen_cache(memory=paddle.zeros((batch_size, len_seq, self.config.d_model)))
+        decoder_outputs = self.decoder(
+            decoder_input_ids=input_ids, encoder_output=None, memory_mask=None, use_cache=use_cache, cache=cache
+        )
+
+        lm_logits = (
+            paddle.tensor.matmul(
+                decoder_outputs[0] if use_cache else decoder_outputs, self.lm_head_weight, transpose_y=True
+            )
+            + self.final_logits_bias
+        )
+
+        if use_cache:
+            cache = decoder_outputs[1]
+            return lm_logits, cache
+        return lm_logits
+
+    def prepare_inputs_for_generation(self, input_ids, attention_mask=None, use_cache=True, cache=None, **kwargs):
+        """
+        Prepare inputs for decoder to generate sentences.
+        Return:
+            dict: A dictionary containing necessary inputs for generating next token.
+        """
+        if cache is not None:
+            input_ids = input_ids[:, -1:].unsqueeze(-1)
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "use_cache": use_cache, "cache": cache}
diff --git a/paddlenlp/transformers/blenderbot_small/tokenizer.py b/paddlenlp/transformers/blenderbot_small/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..af6d861986f3769e38af2e002ca159be2248513a
--- /dev/null
+++ b/paddlenlp/transformers/blenderbot_small/tokenizer.py
@@ -0,0 +1,220 @@
+# encoding=utf-8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+
+from ..gpt.tokenizer import GPTTokenizer
+
+__all__ = ["BlenderbotSmallTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"blenderbot_small-90M": 512}
+
+
+# Copy from paddlenlp.transformers.gpt.tokenizer.get_pairs
+def get_pairs(word):
+    """
+    Args:
+        word (tuple): tuple of symbols (symbols being variable-length strings).
+
+    Returns:
+        set: symbol pairs in a word.
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class BlenderbotSmallTokenizer(GPTTokenizer):
+    r"""
+    Constructs a BlenderbotSmall tokenizer based on Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.GPTTokenizer`,
+    which contains most of the main methods.
+    Please should refer to the superclass for more information regarding methods.
+    Args:
+        vocab_file (str): file path of the vocabulary
+        merges_file (str): file path of the merges file.
+        errors (str): The method to handle errors in decoding
+        max_len (int): The specified maximum sequence length. Default: "None".
+        special_tokens (dict): The additional special tokens. Default: "None".
+        bos_token (str): The special token for beginning of sequence token. Default: "__start__".
+        eos_token (str): The special token for end of sequence token. Default: "__end__".
+        unk_token (str): The special token for unknown tokens. Default: "__unk__"
+        pad_token (str): The special token for padding. Default: "__null__".
+        eol_token (str): The special token for newline. Default: "__newln__".
+    Examples:
+        .. code-block:: python
+            from paddlenlp.transformers import BlenderbotSmallTokenizer
+            tokenizer = BlenderbotSmallTokenizer.from_pretrained("blenderbot_small-90M")
+            text = "My friends are cool but they eat too many carbs."
+            inputs = tokenizer(text)
+            # above line outputs:
+            #   {'input_ids': [42, 643, 46, 1430, 45, 52, 1176, 146, 177, 753, 2430, 5],
+            #   'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+    """
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "blenderbot_small-90M": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot_small/blenderbot_small-90M-vocab.json",
+        },
+        "merges_file": {
+            "blenderbot_small-90M": "https://bj.bcebos.com/paddlenlp/models/transformers/blenderbot_small/blenderbot_small-90M-merges.txt",
+        },
+    }
+    pretrained_init_configuration = {"blenderbot_small-90M": {}}
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        special_tokens=None,
+        bos_token="__start__",
+        eos_token="__end__",
+        unk_token="__unk__",
+        pad_token="__null__",
+        eol_token="__newln__",
+        **kwargs
+    ):
+        super(BlenderbotSmallTokenizer, self).__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            errors=errors,
+            max_len=max_len,
+            special_tokens=special_tokens,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            eol_token=eol_token,
+            **kwargs,
+        )
+        self.pat = r"\S+\n?"  # String matching pattern of BlenderbotSmall is different from Blenderbot
+        self.unk_id = self.encoder[unk_token]
+        self.eol_token = eol_token
+
+    def bpe(self, token):
+        """
+        Apply Byte-Pair-Encoding on token.
+        The process of bpe in BlenderbotSmall is different from Blenderbot.
+        Args:
+            token (str): The token to be converted.
+
+        Returns:
+            str: Converted token.
+        """
+        if token in self.cache:
+            return self.cache[token]
+        token = re.sub("([.,!?()])", r" \1", token)
+        token = re.sub("(')", r" \1 ", token)
+        token = re.sub(r"\s{2,}", " ", token)
+        if "\n" in token:
+            token = token.replace("\n", self.eol_token)
+        tokens = token.split(" ")
+        words = []
+        for token in tokens:
+            if not len(token):
+                continue
+
+            token = token.lower()
+            word = tuple(token)
+            word = tuple(list(word[:-1]) + [word[-1] + "</w>"])
+            pairs = get_pairs(word)
+
+            if not pairs:
+                words.append(token)
+                continue
+
+            while True:
+                bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+                if bigram not in self.bpe_ranks:
+                    break
+                first, second = bigram
+                new_word = []
+                i = 0
+
+                while i < len(word):
+                    try:
+                        j = word.index(first, i)
+                        new_word.extend(word[i:j])
+                        i = j
+                    except ValueError:
+                        new_word.extend(word[i:])
+                        break
+
+                    if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                        new_word.append(first + second)
+                        i += 2
+                    else:
+                        new_word.append(word[i])
+                        i += 1
+                new_word = tuple(new_word)
+                word = new_word
+                if len(word) == 1:
+                    break
+                else:
+                    pairs = get_pairs(word)
+            word = "@@ ".join(word)
+            word = word[:-4]
+
+            self.cache[token] = word
+            words.append(word)
+        return " ".join(words)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string.
+        Args:
+            tokens (list[str]): A sequence of tokens.
+
+        Returns:
+            str: Converted string.
+        """
+        return " ".join(tokens).replace("@@ ", "").strip()
+
+    def convert_ids_to_string(self, ids, skip_special_tokens=True, clean_up_tokenization_spaces=True):
+        """
+        Converts a sequence of ids (list of integers) to a single string.
+        Args:
+            ids (list[int]):
+                A sequence of ids corresponding to tokens.
+            skip_special_tokens (bool, optional):
+                Whether to skip and not decode special tokens when converting. Defaults to `False`.
+            clean_up_tokenization_spaces (bool, optional):
+                Whether to Clean up a list of simple English tokenization artifacts
+                like spaces before punctuations and abbreviated forms.
+        Returns:
+            str: Converted string.
+        """
+        tokens = self.convert_ids_to_tokens(ids, skip_special_tokens=skip_special_tokens)
+        output_string = self.convert_tokens_to_string(tokens)
+        if clean_up_tokenization_spaces:
+            output_string = (
+                output_string.replace(" .", ".")
+                .replace(" ?", "?")
+                .replace(" !", "!")
+                .replace(" ,", ",")
+                .replace(" ' ", "'")
+                .replace(" n't", "n't")
+                .replace(" 'm", "'m")
+                .replace(" 's", "'s")
+                .replace(" 've", "'ve")
+                .replace(" 're", "'re")
+            )
+        return output_string
diff --git a/paddlenlp/transformers/blip/__init__.py b/paddlenlp/transformers/blip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/blip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/blip/configuration.py b/paddlenlp/transformers/blip/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9c516fcd1b61e18649cefe011220e9076f2786e
--- /dev/null
+++ b/paddlenlp/transformers/blip/configuration.py
@@ -0,0 +1,407 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Blip model configuration"""
+
+import copy
+import os
+from typing import Optional, Union
+
+from ...utils.log import logger
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "BlipTextConfig",
+    "BlipVisionConfig",
+    "BlipConfig",
+]
+
+
+class BlipTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BlipTextModel`]. It is used to instantiate a BLIP
+    text model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the `BlipText` used by the [base
+    architectures](https://huggingface.co/Salesforce/blip-vqa-base).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the `Blip` text model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`BlipModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        encoder_hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers from the vision model.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        max_position_embeddings (`int`, *optional*, defaults to 77):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults
+            to 1e-12): The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        bos_token_id (`int`, *optional*, defaults to 30522):
+            The id of the `beginning-of-sequence` token.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the `end-of-sequence` token.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The id of the `padding` token.
+        sep_token_id (`int`, *optional*, defaults to 102):
+            The id of the `separator` token.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import BlipTextConfig, BlipTextModel
+
+    >>> # Initializing a BlipTextConfig with Salesforce/blip-vqa-base style configuration
+    >>> configuration = BlipTextConfig()
+
+    >>> # Initializing a BlipTextModel (with random weights) from the Salesforce/blip-vqa-base style configuration
+    >>> model = BlipTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "blip_text_model"
+
+    def __init__(
+        self,
+        vocab_size=30524,
+        hidden_size=768,
+        encoder_hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=768,
+        num_hidden_layers=12,
+        num_attention_heads=8,
+        max_position_embeddings=512,
+        hidden_act="gelu",
+        layer_norm_eps=1e-12,
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        bos_token_id=30522,
+        eos_token_id=2,
+        pad_token_id=0,
+        sep_token_id=102,
+        is_decoder=True,
+        use_cache=True,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            sep_token_id=sep_token_id,
+            **kwargs,
+        )
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.encoder_hidden_size = encoder_hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.is_decoder = is_decoder
+        self.use_cache = use_cache
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from BlipConfig
+        if config_dict.get("model_type") == "blip":
+            config_dict = config_dict["text_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class BlipVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BlipVisionModel`]. It is used to instantiate a
+    BLIP vision model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration defaults will yield a similar configuration to that of the Blip-base
+    [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults
+            to 1e-6): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import BlipVisionConfig, BlipVisionModel
+
+    >>> # Initializing a BlipVisionConfig with Salesforce/blip-vqa-base style configuration
+    >>> configuration = BlipVisionConfig()
+
+    >>> # Initializing a BlipVisionModel (with random weights) from the Salesforce/blip-vqa-base style configuration
+    >>> model = BlipVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "blip_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=512,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=384,
+        patch_size=16,
+        hidden_act="gelu",
+        layer_norm_eps=0.000001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=1e-10,
+        initializer_factor=1.0,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from BlipConfig
+        if config_dict.get("model_type") == "blip":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class BlipConfig(PretrainedConfig):
+    r"""
+    [`BlipConfig`] is the configuration class to store the configuration of a [`BlipModel`]. It is used to instantiate
+    a BLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
+    a configuration with the defaults will yield a similar configuration to that of the BLIP-base
+    [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`BlipTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`BlipVisionConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimentionality of text and vision projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original BLIP implementation.
+        image_text_hidden_size (`int`, *optional*, defaults to 768):
+            Dimentionality of the hidden state of the image-text fusion layer.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import BlipConfig, BlipModel
+
+    >>> # Initializing a BlipConfig with Salesforce/blip-vqa-base style configuration
+    >>> configuration = BlipConfig()
+
+    >>> # Initializing a BlipPModel (with random weights) from the Salesforce/blip-vqa-base style configuration
+    >>> model = BlipModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a BlipConfig from a BlipTextConfig and a BlipVisionConfig
+
+    >>> # Initializing a BLIPText and BLIPVision configuration
+    >>> config_text = BlipTextConfig()
+    >>> config_vision = BlipVisionConfig()
+
+    >>> config = BlipConfig.from_text_vision_configs(config_text, config_vision)
+    ```"""
+
+    model_type = "blip"
+    is_composition = True
+
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        projection_dim=512,
+        logit_scale_init_value=2.6592,
+        image_text_hidden_size=256,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+        if text_config_dict is not None:
+            text_config = text_config_dict
+        if vision_config_dict is not None:
+            vision_config = vision_config_dict
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the BlipTextConfig with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the BlipVisionConfig with default values.")
+
+        text_config["projection_dim"] = projection_dim
+        vision_config["projection_dim"] = projection_dim
+        self.text_config = BlipTextConfig(**text_config)
+        self.vision_config = BlipVisionConfig(**vision_config)
+
+        self.text_config.encoder_hidden_size = self.vision_config.hidden_size
+
+        self.projection_dim = projection_dim
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+        self.image_text_hidden_size = image_text_hidden_size
+
+    @classmethod
+    def from_text_vision_configs(cls, text_config: BlipTextConfig, vision_config: BlipVisionConfig, **kwargs):
+        r"""
+        Instantiate a [`BlipConfig`] (or a derived class) from blip text model configuration and blip vision model
+        configuration.
+
+        Returns:
+            [`BlipConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/blip/image_processing.py b/paddlenlp/transformers/blip/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..569f387ea8158a9b1303c02656a0f49322cc64df
--- /dev/null
+++ b/paddlenlp/transformers/blip/image_processing.py
@@ -0,0 +1,285 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for BLIP."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    convert_to_rgb,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = [
+    "BlipImageProcessor",
+]
+
+
+class BlipImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a BLIP image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
+            `do_resize` parameter in the `preprocess` method.
+        size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
+            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
+            overridden by the `resample` parameter in the `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
+            `do_rescale` parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
+            overridden by the `rescale_factor` parameter in the `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
+            overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+            Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 384, "width": 384}
+        size = get_size_dict(size, default_to_square=True)
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image.
+
+        Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
+        longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
+        resized to the max size while preserving the aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
+            resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=True)
+        output_size = (size["width"], size["height"])
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            mean (`float` or `List[float]`):
+                Image mean.
+            std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: Optional[bool] = None,
+        size: Optional[Dict[str, int]] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        do_convert_rgb: bool = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Controls the size of the image after `resize`. The shortest edge of the image is resized to
+                `size["shortest_edge"]` whilst preserving the aspect ratio. If the longest edge of this resized image
+                is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
+                edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to normalize the image by if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        size = size if size is not None else self.size
+        size = get_size_dict(size, default_to_square=False)
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None or resample is None:
+            raise ValueError("Size and resample must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/blip/modeling.py b/paddlenlp/transformers/blip/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..66c2f9cb651857e327e2b5d4e740e31d6b96ee0b
--- /dev/null
+++ b/paddlenlp/transformers/blip/modeling.py
@@ -0,0 +1,1590 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Paddle BLIP model."""
+
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..model_outputs import BaseModelOutput, BaseModelOutputWithPooling, ModelOutput
+from ..model_utils import PretrainedModel
+from .configuration import BlipConfig, BlipTextConfig, BlipVisionConfig
+from .modeling_text import BlipTextLMHeadModel, BlipTextModel
+
+BLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "Salesforce/blip-vqa-base",
+    "Salesforce/blip-vqa-capfilt-large",
+    "Salesforce/blip-image-captioning-base",
+    "Salesforce/blip-image-captioning-large",
+    "Salesforce/blip-itm-base-coco",
+    "Salesforce/blip-itm-large-coco",
+    "Salesforce/blip-itm-base-flickr",
+    "Salesforce/blip-itm-large-flickr",
+]
+
+__all__ = [
+    "BlipPretrainedModel",
+    "BlipVisionModel",
+    "BlipModel",
+    "BlipForConditionalGeneration",
+    "BlipForQuestionAnswering",
+    "BlipForImageTextRetrieval",
+]
+
+
+# Copied from transformers.models.clip.modeling_clip.contrastive_loss
+def contrastive_loss(logits: paddle.Tensor) -> paddle.Tensor:
+    return F.cross_entropy(logits, paddle.arange(len(logits)))
+
+
+# Copied from transformers.models.clip.modeling_clip.clip_loss with clip->blip
+def blip_loss(similarity: paddle.Tensor) -> paddle.Tensor:
+    caption_loss = contrastive_loss(similarity)
+    image_loss = contrastive_loss(similarity.t())
+    return (caption_loss + image_loss) / 2.0
+
+
+def Parameter(tensor):
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+@dataclass
+class BlipForConditionalGenerationModelOutput(ModelOutput):
+    """
+    Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
+    last hidden states. This class also adds the loss term from the text decoder.
+
+    Args:
+        loss (`paddle.Tensor`, *optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Languge modeling loss from the text decoder.
+        decoder_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*):
+            Prediction scores of the language modeling head of the text decoder model.
+        image_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)`, *optional*):
+            The image embeddings obtained after applying the Vision Transformer model to the input image.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[Tuple[paddle.Tensor]] = None
+    decoder_logits: Optional[Tuple[paddle.Tensor]] = None
+    image_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BlipTextVisionModelOutput(ModelOutput):
+    """
+    Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
+    last hidden states. This class also adds the loss term from the text decoder.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Languge modeling loss from the text decoder.
+        image_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    image_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BlipImageTextMatchingModelOutput(ModelOutput):
+    """
+    Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
+    last hidden states. This class also adds the loss term from the text decoder as well as the image-text similarity
+    scores.
+
+    Args:
+        itm_score (`paddle.Tensor`):
+            The image-text similarity scores.
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Languge modeling loss from the text decoder.
+        image_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        vision_pooler_output (`paddle.Tensor` of shape `(batch_size, hidden_size)`, *optional*):
+            Last layer hidden-state of the vision of the vision-only branch of the model.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        question_embeds (`paddle.Tensor`):
+            The question embeddings obtained by the text projection layer.
+    """
+
+    itm_score: Optional[paddle.Tensor] = None
+    loss: Optional[paddle.Tensor] = None
+    image_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    vision_pooler_output: Optional[paddle.Tensor] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+    question_embeds: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BlipOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`paddle.Tensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`paddle.Tensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`BlipTextModel`].
+        image_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of [`BlipVisionModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`BlipTextModel`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`BlipVisionModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_image: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    image_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    vision_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class BlipVisionEmbeddings(nn.Layer):
+    def __init__(self, config: BlipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+
+        self.class_embedding = Parameter(paddle.randn([1, 1, self.embed_dim]))
+        self.patch_embedding = nn.Conv2D(
+            in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+
+        self.position_embedding = Parameter(paddle.randn([1, self.num_positions, self.embed_dim]))
+
+    def forward(self, pixel_values: paddle.Tensor) -> paddle.Tensor:
+        batch_size = pixel_values.shape[0]
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose([0, 2, 1])
+
+        class_embeds = self.class_embedding.expand([batch_size, 1, -1]).cast(target_dtype)
+        embeddings = paddle.concat([class_embeds, patch_embeds], axis=1)
+        embeddings = embeddings + self.position_embedding[:, : embeddings.shape[1], :].cast(target_dtype)
+        return embeddings
+
+
+# Copied from transformers.models.clip.modeling_clip.CLIPTextEmbeddings with CLIP->Blip
+class BlipTextEmbeddings(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        embed_dim = config.hidden_size
+
+        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
+        self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").reshape((1, -1))
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+    ) -> paddle.Tensor:
+        seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.token_embedding(input_ids)
+
+        position_embeddings = self.position_embedding(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+
+        return embeddings
+
+
+class BlipAttention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = nn.Dropout(config.attention_dropout)
+
+        self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim)
+
+        self.projection = nn.Linear(self.embed_dim, self.embed_dim)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = hidden_states.shape
+
+        mixed_qkv = self.qkv(hidden_states)
+        mixed_qkv = (
+            self.qkv(hidden_states)
+            .reshape([bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads])
+            .transpose([2, 0, 3, 1, 4])
+        )
+        query_states, key_states, value_states = (
+            mixed_qkv[0],
+            mixed_qkv[1],
+            mixed_qkv[2],
+        )
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        attention_scores = attention_scores * self.scale
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_states).transpose([0, 2, 1, 3])
+
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.embed_dim,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output = self.projection(context_layer)
+
+        outputs = (output, attention_probs) if output_attentions else (output, None)
+
+        return outputs
+
+
+# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->Blip
+class BlipMLP(nn.Layer):
+    def __init__(self, config: BlipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+class BlipEncoderLayer(nn.Layer):
+    def __init__(self, config: BlipVisionConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = BlipAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = BlipMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + residual
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = hidden_states + residual
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class BlipPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = BlipConfig
+    base_model_prefix = "blip"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def init_weights(self):
+        """
+        A method executed at the end of each Transformer model initialization, to execute code that needs the model's
+        modules properly initialized (such as weight initialization).
+        """
+        self.apply(self._init_weights)
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_range
+        if isinstance(module, nn.Conv2D) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
+            normal_(module.weight, mean=0.0, std=factor)
+            if hasattr(module, "bias") and module.bias is not None:
+                zeros_(module.bias)
+
+        if isinstance(module, BlipVisionEmbeddings):
+            if hasattr(self.config, "vision_config"):
+                factor = self.config.vision_config.initializer_range
+            trunc_normal_ = nn.initializer.TruncatedNormal(mean=0.0, std=factor)
+            trunc_normal_(module.position_embedding)
+            trunc_normal_(
+                module.class_embedding,
+            )
+
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+
+        elif isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, BlipEncoder):
+            module.gradient_checkpointing = value
+
+
+class BlipEncoder(nn.Layer):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`BlipEncoderLayer`].
+
+    Args:
+        config (`BlipVisionConfig`):
+            The corresponding vision configuration for the `BlipEncoder`.
+    """
+
+    def __init__(self, config: BlipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.LayerList([BlipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for encoder_layer in self.layers:
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+class BlipVisionModel(BlipPretrainedModel):
+    r"""
+    The vision model from BLIP without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipVisionConfig`):
+            An instance of BlipVisionConfig used to construct BlipVisionModel.
+    """
+    main_input_name = "pixel_values"
+    config_class = BlipVisionConfig
+
+    def __init__(self, config: BlipVisionConfig):
+        super().__init__(config)
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = BlipVisionEmbeddings(config)
+        self.encoder = BlipEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.embeddings
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`BlipImageProcessor`]. See [`BlipImageProcessor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~BaseModelOutputWithPooling`] instead of a plain tuple.
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BLIPProcessor, BLIPVisionModel
+
+        >>> model = BLIPVisionModel.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+        >>> processor = BLIPProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooled_output = last_hidden_state[:, 0, :]
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class BlipModel(BlipPretrainedModel):
+    r"""
+    The bare BLIP Model outputting logits_per_image and logits_per_text.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipConfig`):
+            An instance of BlipConfig used to construct BlipModel.
+    """
+    config_class = BlipConfig
+
+    def __init__(self, config: BlipConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, BlipTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type BlipTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, BlipVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type BlipVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+
+        self.projection_dim = config.projection_dim
+        self.text_embed_dim = text_config.hidden_size
+        self.vision_embed_dim = vision_config.hidden_size
+
+        self.text_model = BlipTextModel(text_config)
+        self.vision_model = BlipVisionModel(vision_config)
+
+        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias_attr=False)
+        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias_attr=False)
+        self.logit_scale = Parameter(
+            paddle.ones(
+                [
+                    1,
+                ]
+            )
+            * config.logit_scale_init_value
+        )
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`BertTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`BlipTextModel`].
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import BlipProcessor, BlipModel
+
+        >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+
+        >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_features = self.text_projection(pooled_output)
+
+        return text_features
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`BLIPImageProcessor`]. See [`BLIPImageProcessor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            image_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`BlipVisionModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipModel
+
+        >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> image_features = model.get_image_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = vision_outputs[1]  # pooled_output
+        image_features = self.visual_projection(pooled_output)
+
+        return image_features
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        pixel_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BlipOutput]:
+        r"""
+        The BLIPPModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in
+                the range ``[0, max_text_length - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            return_loss (`bool`, *optional*):
+                Whether or not to return the contrastive loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`BlipOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `True`.
+        Returns:
+            An instance of :class:`BlipOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BlipOutput`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipModel
+
+        >>> model = BlipModel.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(
+        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pd", padding=True
+        ... )
+
+        >>> outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+        >>> probs = F.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities
+        ```"""
+        # Use BLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[1]
+        image_embeds = self.visual_projection(image_embeds)
+
+        text_embeds = text_outputs[1]
+        text_embeds = self.text_projection(text_embeds)
+
+        # normalized features
+        image_embeds = F.normalize(image_embeds, axis=-1)
+        text_embeds = F.normalize(text_embeds, axis=-1)
+
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_text = paddle.matmul(text_embeds, image_embeds, transpose_y=True) * logit_scale
+        logits_per_image = logits_per_text.t()
+
+        loss = None
+        if return_loss:
+            loss = blip_loss(logits_per_text)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return BlipOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+
+
+class BlipForConditionalGeneration(BlipPretrainedModel):
+    r"""
+    BLIP Model for image captioning. The model consists of a vision encoder and a text decoder. One can optionally pass
+    `input_ids` to the model, which serve as a text prompt, to make the text decoder continue the prompt. Otherwise,
+    the decoder starts generating text from the [BOS] (beginning-of-sequence) token. will start generating the caption
+    from the text input. If no text input is provided, the decoder will start with the [BOS] token only.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipConfig`):
+            An instance of BlipConfig used to construct BlipForConditionalGeneration.
+    """
+    config_class = BlipConfig
+    _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"]
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: BlipConfig):
+        super().__init__(config)
+
+        self.vision_model = BlipVisionModel(config.vision_config)
+
+        self.text_decoder = BlipTextLMHeadModel(config.text_config)
+
+        self.decoder_input_ids = config.text_config.bos_token_id
+        self.decoder_pad_token_id = config.text_config.pad_token_id
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BlipForConditionalGenerationModelOutput]:
+        r"""
+        Args:
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in
+                the range ``[0, max_text_length - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BlipForConditionalGenerationModelOutput`] instead of a plain tuple.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipForConditionalGeneration
+
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        ```"""
+        batch_size = pixel_values.shape[0]
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[0]
+
+        if input_ids is None:
+            input_ids = paddle.to_tensor([[self.decoder_input_ids] * batch_size])
+
+        if labels is None:
+            labels = paddle.where(input_ids == self.decoder_pad_token_id, paddle.to_tensor(-100), input_ids)
+
+        outputs = self.text_decoder(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            encoder_hidden_states=image_embeds,
+            labels=labels,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            outputs = (outputs[0], outputs[1], image_embeds, vision_outputs[0]) + vision_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return BlipForConditionalGenerationModelOutput(
+            loss=outputs.loss,
+            decoder_logits=outputs.logits,
+            image_embeds=image_embeds,
+            last_hidden_state=vision_outputs.last_hidden_state,
+            hidden_states=vision_outputs.hidden_states,
+            attentions=vision_outputs.attentions,
+        )
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs
+    ) -> paddle.Tensor:
+        r"""
+        Overrides *generate* function to be able to use the model as a conditional generator
+
+        Args:
+            pixel_values (*paddle.Tensor* of shape *(batch_size, image_width, image_height)*:
+                Input image to be processed
+            input_ids (*paddle.Tensor* of shape *(batch_size, sequence_length)*, *optional*):
+                The sequence used as a prompt for the generation.
+            attention_mask (*paddle.Tensor* of shape *(batch_size, sequence_length)*, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipForConditionalGeneration
+
+        >>> model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model.generate(**inputs)[0]
+        >>> print(processor.decode(outputs[0], skip_special_tokens=True))
+        two cats are laying on a couch
+        ```
+        """
+
+        batch_size = pixel_values.shape[0]
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+        )
+
+        image_embeds = vision_outputs[0]
+
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype=paddle.int64)
+
+        if isinstance(input_ids, list):
+            input_ids = paddle.to_tensor(input_ids)
+        elif input_ids is None:
+            input_ids = paddle.to_tensor([[self.decoder_input_ids, self.config.text_config.eos_token_id]]).tile(
+                [batch_size, 1]
+            )
+
+        input_ids[:, 0] = self.config.text_config.bos_token_id
+        attention_mask = attention_mask[:, :-1] if attention_mask is not None else None
+
+        outputs = self.text_decoder.generate(
+            input_ids=input_ids[:, :-1],
+            eos_token_id=self.config.text_config.sep_token_id,
+            pad_token_id=self.config.text_config.pad_token_id,
+            attention_mask=attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            **generate_kwargs,
+        )
+
+        return outputs
+
+
+class BlipForQuestionAnswering(BlipPretrainedModel):
+    r"""
+    BLIP Model for visual question answering. The model consists of a vision encoder, a text encoder as well as a text
+    decoder. The vision encoder will encode the input image, the text encoder will encode the input question together
+    with the encoding of the image, and the text decoder will output the answer to the question.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipConfig`):
+            An instance of BlipConfig used to construct BlipForQuestionAnswering.
+    """
+    config_class = BlipConfig
+    _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"]
+
+    def __init__(self, config: BlipConfig):
+        super().__init__(config)
+
+        self.vision_model = BlipVisionModel(config.vision_config)
+
+        self.text_encoder = BlipTextModel(config.text_config, add_pooling_layer=False)
+
+        self.text_decoder = BlipTextLMHeadModel(config.text_config)
+
+        self.decoder_pad_token_id = config.text_config.pad_token_id
+        self.decoder_bos_token_id = config.text_config.bos_token_id
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor,
+        pixel_values: paddle.Tensor,
+        decoder_input_ids: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BlipTextVisionModelOutput]:
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BlipTextVisionModelOutput`] instead of a plain tuple.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipForQuestionAnswering
+
+        >>> model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "How many cats are in the picture?"
+
+        >>> inputs = processor(images=image, text=text, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        batch_size = input_ids.shape[0]
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[0]
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype=paddle.int64)
+
+        question_embeds = self.text_encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=return_dict,
+        )
+
+        question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state
+
+        if decoder_input_ids is None:
+            # (TODO, junnyu) [batch_size, 2]
+            decoder_input_ids = paddle.to_tensor([self.decoder_bos_token_id]).tile((batch_size, 2))
+
+        if labels is None:
+            labels = paddle.where(
+                decoder_input_ids == self.decoder_pad_token_id, paddle.to_tensor(-100), decoder_input_ids
+            )
+
+        answer_output = self.text_decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=question_embeds,
+            encoder_attention_mask=attention_mask,
+            labels=labels,
+            return_dict=return_dict,
+            reduction="none",
+        )
+
+        decoder_loss = answer_output.loss.mean() if return_dict else answer_output[0].mean()
+
+        if not return_dict:
+            outputs = (decoder_loss, image_embeds, vision_outputs[0]) + vision_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return BlipTextVisionModelOutput(
+            loss=decoder_loss,
+            image_embeds=image_embeds,
+            last_hidden_state=vision_outputs.last_hidden_state,
+            hidden_states=vision_outputs.hidden_states,
+            attentions=vision_outputs.attentions,
+        )
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        input_ids: paddle.Tensor,
+        pixel_values: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs
+    ) -> paddle.Tensor:
+        r"""
+        Overrides *generate* function to be able to use the model as a conditional generator
+
+        Args:
+            input_ids (*paddle.Tensor* of shape *(batch_size, sequence_length)*):
+                The sequence used as a prompt for the generation.
+            pixel_values (*paddle.Tensor* of shape *(batch_size, image_width, image_height)*:
+                Input image to be processed
+            attention_mask (*paddle.Tensor* of shape *(batch_size, sequence_length)*, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`. `1` for
+                tokens that are NOT MASKED, `0` for MASKED tokens.
+            **generate_kwargs:
+                Additional arguments passed to the *generate* function of the decoder
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipForQuestionAnswering
+
+        >>> model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "How many cats are in the picture?"
+
+        >>> inputs = processor(images=image, text=text, return_tensors="pd")
+
+        >>> outputs = model.generate(**inputs)
+        >>> print(processor.decode(outputs[0], skip_special_tokens=True))
+        2
+        ```
+        """
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+        )
+
+        image_embeds = vision_outputs[0]
+
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype=paddle.int64)
+
+        if isinstance(input_ids, list):
+            input_ids = paddle.to_tensor(input_ids)
+
+        question_outputs = self.text_encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+        )
+
+        question_embeds = question_outputs[0]
+
+        question_attention_mask = paddle.ones(question_embeds.shape[:-1], dtype=paddle.int64)
+
+        bos_ids = paddle.full((question_embeds.shape[0], 1), fill_value=self.decoder_bos_token_id, dtype=paddle.int64)
+
+        outputs = self.text_decoder.generate(
+            input_ids=bos_ids,
+            eos_token_id=self.config.text_config.sep_token_id,
+            pad_token_id=self.config.text_config.pad_token_id,
+            encoder_hidden_states=question_embeds,
+            encoder_attention_mask=question_attention_mask,
+            **generate_kwargs,
+        )
+
+        return outputs
+
+
+class BlipForImageTextRetrieval(BlipPretrainedModel):
+    r"""
+    BLIP Model with a vision and text projector, and a classification head on top. The model is used in the context of
+    image-text retrieval. Given an image and a text, the model returns the probability of the text being relevant to
+    the image.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipConfig`):
+            An instance of BlipConfig used to construct BlipForImageTextRetrieval.
+    """
+    config_class = BlipConfig
+
+    def __init__(self, config: BlipConfig):
+        super().__init__(config)
+
+        self.vision_model = BlipVisionModel(config.vision_config)
+
+        self.text_encoder = BlipTextModel(config.text_config, add_pooling_layer=False)
+
+        # vision projection layer
+        self.vision_proj = nn.Linear(config.vision_config.hidden_size, config.image_text_hidden_size)
+
+        # text projection layer
+        self.text_proj = nn.Linear(config.text_config.hidden_size, config.image_text_hidden_size)
+
+        # image text matching head
+        self.itm_head = nn.Linear(config.text_config.hidden_size, 2)
+
+        self.decoder_pad_token_id = config.text_config.pad_token_id
+        self.decoder_bos_token_id = config.text_config.bos_token_id
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor,
+        pixel_values: paddle.Tensor,
+        use_itm_head: Optional[bool] = True,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BlipTextVisionModelOutput]:
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            use_itm_head (bool, optional):
+                Whether to use itm head.
+                Defaults to `True`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BlipTextVisionModelOutput`] instead of a plain tuple.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import BlipProcessor, BlipForImageTextRetrieval
+
+        >>> model = BlipForImageTextRetrieval.from_pretrained("Salesforce/blip-itm-base")
+        >>> model.eval()
+        >>> processor = BlipProcessor.from_pretrained("Salesforce/blip-itm-base")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "an image of a cat"
+
+        >>> inputs = processor(images=image, text=text, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[0]
+        image_atts = paddle.ones(image_embeds.shape[:-1], dtype=paddle.int64)
+
+        if use_itm_head:
+            question_embeds = self.text_encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                encoder_hidden_states=image_embeds,
+                encoder_attention_mask=image_atts,
+                return_dict=return_dict,
+            )
+            question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state
+
+            output = self.itm_head(question_embeds[:, 0, :])
+        else:
+            question_embeds = self.text_encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                return_dict=return_dict,
+            )
+            question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state
+
+            image_feat = F.normalize(self.vision_proj(image_embeds[:, 0, :]), axis=-1)
+            text_feat = F.normalize(self.text_proj(question_embeds[:, 0, :]), axis=-1)
+
+            output = paddle.matmul(image_feat, text_feat, transpose_y=True)
+
+        if not return_dict:
+            outputs = (output, vision_outputs[0]) + vision_outputs[2:] + (question_embeds,)
+            return tuple(output for output in outputs if output is not None)
+
+        return BlipImageTextMatchingModelOutput(
+            itm_score=output,
+            last_hidden_state=vision_outputs.last_hidden_state,
+            hidden_states=vision_outputs.hidden_states,
+            attentions=vision_outputs.attentions,
+            question_embeds=question_embeds,
+        )
diff --git a/paddlenlp/transformers/blip/modeling_text.py b/paddlenlp/transformers/blip/modeling_text.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f5c1b547bdd9b29724eea25ac039ea1c57e9558
--- /dev/null
+++ b/paddlenlp/transformers/blip/modeling_text.py
@@ -0,0 +1,1101 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the BSD-3-clause license (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import inspect
+import math
+from functools import partial
+from typing import Callable, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ...utils.log import logger
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from ..model_utils import PretrainedModel
+from .configuration import BlipTextConfig
+
+__all__ = [
+    "BlipTextPretrainedModel",
+    "BlipTextModel",
+    "BlipTextLMHeadModel",
+]
+
+
+def apply_chunking_to_forward(
+    forward_fn: Callable[..., paddle.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
+) -> paddle.Tensor:
+    """
+    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension
+    `chunk_dim`. It then applies a layer `forward_fn` to each chunk independently to save memory.
+
+    If the `forward_fn` is independent across the `chunk_dim` this function will yield the same result as directly
+    applying `forward_fn` to `input_tensors`.
+
+    Args:
+        forward_fn (`Callable[..., paddle.Tensor]`):
+            The forward function of the model.
+        chunk_size (`int`):
+            The chunk size of a chunked tensor: `num_chunks = len(input_tensors[0]) / chunk_size`.
+        chunk_dim (`int`):
+            The dimension over which the `input_tensors` should be chunked.
+        input_tensors (`Tuple[paddle.Tensor]`):
+            The input tensors of `forward_fn` which will be chunked
+
+    Returns:
+        `paddle.Tensor`: A tensor with the same shape as the `forward_fn` would have given if applied`.
+
+
+    Examples:
+
+    ```python
+    # rename the usual forward() fn to forward_chunk()
+    def forward_chunk(self, hidden_states):
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+    # implement a chunked forward function
+    def forward(self, hidden_states):
+        return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states)
+    ```"""
+
+    assert len(input_tensors) > 0, f"{input_tensors} has to be a tuple/list of tensors"
+
+    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compatibility
+    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)
+    if num_args_in_forward_chunk_fn != len(input_tensors):
+        raise ValueError(
+            f"forward_chunk_fn expects {num_args_in_forward_chunk_fn} arguments, but only {len(input_tensors)} input "
+            "tensors are given"
+        )
+
+    if chunk_size > 0:
+        tensor_shape = input_tensors[0].shape[chunk_dim]
+        for input_tensor in input_tensors:
+            if input_tensor.shape[chunk_dim] != tensor_shape:
+                raise ValueError(
+                    f"All input tenors have to be of the same shape: {tensor_shape}, "
+                    f"found shape {input_tensor.shape[chunk_dim]}"
+                )
+
+        if input_tensors[0].shape[chunk_dim] % chunk_size != 0:
+            raise ValueError(
+                f"The dimension to be chunked {input_tensors[0].shape[chunk_dim]} has to be a multiple of the chunk "
+                f"size {chunk_size}"
+            )
+
+        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size
+
+        # chunk input tensor into tuples
+        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, axis=chunk_dim) for input_tensor in input_tensors)
+        # apply forward fn to every tuple
+        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))
+        # concatenate output at same dimension
+        return paddle.concat(output_chunks, axis=chunk_dim)
+
+    return forward_fn(*input_tensors)
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L52
+class BlipTextEmbeddings(nn.Layer):
+    """Construct the embeddings from word and position embeddings."""
+
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size
+        )  # , padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").reshape((1, -1))
+        )
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+
+        self.config = config
+
+    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        embeddings = inputs_embeds
+
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L97
+class BlipTextSelfAttention(nn.Layer):
+    def __init__(self, config: BlipTextConfig, is_cross_attention: bool = False):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention heads (%d)"
+                % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.scale = math.sqrt(self.attention_head_size)
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.shape[1]
+            position_ids_l = paddle.arange(seq_length, dtype=paddle.int64).reshape([-1, 1])
+            position_ids_r = paddle.arange(seq_length, dtype=paddle.int64).reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / self.scale
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BlipTextModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        outputs = outputs + (past_key_value,)
+        return outputs
+
+
+# Copied from transformers.models.bert.modeling_bert.BertSelfOutput with Bert -> BlipText
+class BlipTextSelfOutput(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#242
+class BlipTextAttention(nn.Layer):
+    def __init__(self, config: BlipTextConfig, is_cross_attention: bool = False):
+        super().__init__()
+        self.self = BlipTextSelfAttention(config, is_cross_attention)
+        self.output = BlipTextSelfOutput(config)
+        self.pruned_heads = set()
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert -> BlipText
+class BlipTextIntermediate(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOutput with Bert -> BlipText
+class BlipTextOutput(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BlipTextLayer(nn.Layer):
+    def __init__(self, config: BlipTextConfig, layer_num: int):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BlipTextAttention(config)
+        self.layer_num = layer_num
+        if self.config.is_decoder:
+            self.crossattention = BlipTextAttention(config, is_cross_attention=self.config.is_decoder)
+        self.intermediate = BlipTextIntermediate(config)
+        self.output = BlipTextOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+
+        outputs = self_attention_outputs[1:-1]
+        present_key_value = self_attention_outputs[-1]
+
+        if encoder_hidden_states is not None:
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions=output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+
+        outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L386
+class BlipTextEncoder(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList([BlipTextLayer(config, i) for i in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.is_decoder else None
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->BlipText
+class BlipTextPooler(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+# Copied from transformers.models.bert.modeling_bert.BertPredictionHeadTransform with Bert->BlipText
+class BlipTextPredictionHeadTransform(nn.Layer):
+    def __init__(self, config: BlipTextConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertLMPredictionHead with Bert->BlipText
+class BlipTextLMPredictionHead(nn.Layer):
+    def __init__(self, config: BlipTextConfig, embedding_weights=None):
+        super().__init__()
+        self.transform = BlipTextPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype, is_bias=False
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+
+        self.bias = self.create_parameter(
+            shape=[
+                config.vocab_size,
+            ],
+            dtype=self.decoder_weight.dtype,
+            is_bias=True,
+        )
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = paddle.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.bias
+        return hidden_states
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOnlyMLMHead with Bert->BlipText
+class BlipTextOnlyMLMHead(nn.Layer):
+    """
+    Perform language modeling task.
+
+    Args:
+        config (:class:`BlipTextConfig`):
+            An instance of BlipTextConfig used to construct BlipTextLMHeadModel.
+        embedding_weights (Tensor, optional):
+            Decoding weights used to map hidden_states to logits of the masked token prediction.
+            Its data type should be float32 and its shape is [vocab_size, hidden_size].
+            Defaults to `None`, which means use the same weights of the embedding layer.
+
+    """
+
+    def __init__(self, config: BlipTextConfig, embedding_weights=None):
+        super().__init__()
+        self.predictions = BlipTextLMPredictionHead(config, embedding_weights)
+
+    def forward(self, sequence_output: paddle.Tensor) -> paddle.Tensor:
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L548
+class BlipTextPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = BlipTextConfig
+    base_model_prefix = "bert"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    supports_gradient_checkpointing = True
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, BlipTextEncoder):
+            module.gradient_checkpointing = value
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/3a29b7410476bf5f2ba0955827390eb6ea1f4f9d/models/med.py#L571
+class BlipTextModel(BlipTextPretrainedModel):
+    """
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in [Attention is
+    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. argument and `is_decoder` set to `True`; an
+    `encoder_hidden_states` is then expected as an input to the forward pass.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`BlipTextConfig`):
+            An instance of BlipTextConfig used to construct BlipTextModel.
+    """
+
+    def __init__(self, config: BlipTextConfig, add_pooling_layer: bool = True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = BlipTextEmbeddings(config)
+        self.encoder = BlipTextEncoder(config)
+        self.pooler = BlipTextPooler(config) if add_pooling_layer else None
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    @property
+    def dtype(self):
+        return self.embeddings.word_embeddings.weight.dtype
+
+    def get_extended_attention_mask(
+        self, attention_mask: paddle.Tensor, input_shape: Tuple[int], is_decoder: bool
+    ) -> paddle.Tensor:
+        if attention_mask.ndim == 3:
+            extended_attention_mask = attention_mask.unsqueeze(1)
+        elif attention_mask.ndim == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length)
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                causal_mask = causal_mask.cast(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask.unsqueeze([1, 2])
+            else:
+                extended_attention_mask = attention_mask.unsqueeze([1, 2])
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        extended_attention_mask = extended_attention_mask.cast(self.dtype)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask):
+        if encoder_attention_mask.ndim == 4:
+            encoder_extended_attention_mask = encoder_attention_mask
+        elif encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze(1)
+        elif encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze([1, 2])
+        encoder_extended_attention_mask = encoder_extended_attention_mask.cast(self.dtype)  # fp16 compatibility
+
+        if self.dtype == paddle.float16:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        elif self.dtype == paddle.float32:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        else:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+        return encoder_extended_attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        inputs_embeds=None,
+        encoder_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+    ):
+        r"""
+        input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`BertTokenizer`]. See [`PretrainedTokenizer.encode`] and
+            [`PretrainedTokenizer.__call__`] for details.
+
+        attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+
+        inputs_embeds (`paddle.Tensor` of shape `(batch_size, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        encoder_embeds (`paddle.Tensor` of shape `(batch_size, hidden_size)`, *optional*):
+            Optionally, same as inputs_embeds.
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`BaseModelOutputWithPoolingAndCrossAttentions`] instead of a plain tuple.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+            batch_size, seq_length = input_shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+            batch_size, seq_length = input_shape
+        elif encoder_embeds is not None:
+            input_shape = encoder_embeds.shape[:-1]
+            batch_size, seq_length = input_shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds or encoder_embeds")
+
+        # cache_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = paddle.ones((batch_size, seq_length + past_key_values_length))
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: paddle.Tensor = self.get_extended_attention_mask(
+            attention_mask, input_shape, is_decoder
+        )
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if isinstance(encoder_hidden_states, (list, tuple)):
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].shape
+            else:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+
+            if isinstance(encoder_attention_mask, (list, tuple)):
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        if encoder_embeds is None:
+            embedding_output = self.embeddings(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                past_key_values_length=past_key_values_length,
+            )
+        else:
+            embedding_output = encoder_embeds
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            if pooled_output is None:
+                # note: we donot output pooled_output
+                return (sequence_output,) + encoder_outputs[1:]
+            else:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L811
+class BlipTextLMHeadModel(BlipTextPretrainedModel):
+    """
+    Bert Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`BlipTextConfig`):
+            An instance of BlipTextConfig used to construct BlipTextLMHeadModel.
+
+    """
+
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
+
+    def __init__(self, config: BlipTextConfig):
+        super().__init__(config)
+
+        self.bert = BlipTextModel(config, add_pooling_layer=False)
+        self.cls = BlipTextOnlyMLMHead(config, embedding_weights=self.bert.embeddings.word_embeddings.weight)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction="mean",
+    ):
+        r"""
+        input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`BertTokenizer`]. See [`PretrainedTokenizer.encode`] and
+            [`PretrainedTokenizer.__call__`] for details.
+
+        attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+
+        inputs_embeds (`paddle.Tensor` of shape `(batch_size, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`CausalLMOutputWithCrossAttentions`] instead of a plain tuple.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        if return_logits:
+            return prediction_scores[:, :-1, :]
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :]
+            labels = labels[:, 1:]
+            loss_fct = nn.CrossEntropyLoss(reduction=reduction)  # TODO label_smoothing=0.1
+            lm_loss = loss_fct(shifted_prediction_scores.reshape([-1, self.config.vocab_size]), labels.flatten())
+            if reduction == "none":
+                lm_loss = lm_loss.reshape([prediction_scores.shape[0], -1]).sum(1)
+
+        if not return_dict:
+            # note: we donot output pooler
+            if self.bert.pooler is None:
+                output = (prediction_scores,) + outputs[1:]
+            else:
+                output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape, dtype=input_ids.dtype)
+
+        # cut decoder_input_ids if past_key_values is used
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past_key_values,
+            "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
+            "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
+            "is_decoder": True,
+            # we must set return_dict False
+            "return_dict": False,
+        }
+
+    def prepare_attention_mask_for_generation(
+        self,
+        inputs: paddle.Tensor,
+        pad_token_id: Optional[int],
+        eos_token_id: Optional[int],
+    ) -> paddle.Tensor:
+        # donot create 4d attention mask
+        is_input_ids = len(inputs.shape) == 2 and inputs.dtype in [paddle.int32, paddle.int64]
+        is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs.tolist())
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (pad_token_id != eos_token_id)
+
+        # Check if input is input_ids and padded -> only then is attention_mask defined
+        if is_input_ids and is_pad_token_in_inputs and is_pad_token_not_equal_to_eos_token_id:
+            return (inputs != pad_token_id).cast("int64")
+        else:
+            return paddle.ones(inputs.shape[:2], dtype=paddle.int64)
diff --git a/paddlenlp/transformers/blip/processing.py b/paddlenlp/transformers/blip/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32bdfa5a420cf1c030fac0b7517e7855dcb80cd
--- /dev/null
+++ b/paddlenlp/transformers/blip/processing.py
@@ -0,0 +1,119 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for Blip.
+"""
+
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = [
+    "BlipProcessor",
+]
+
+
+class BlipProcessor(ProcessorMixin):
+    r"""
+    Constructs a BLIP processor which wraps a BERT tokenizer and BLIP image processor into a single processor.
+
+    [`BlipProcessor`] offers all the functionalities of [`BlipImageProcessor`] and [`BertTokenizer`]. See the
+    docstring of [`~BlipProcessor.__call__`] and [`~BlipProcessor.decode`] for more information.
+
+    Args:
+        image_processor (`BlipImageProcessor`):
+            An instance of [`BlipImageProcessor`]. The image processor is a required input.
+        tokenizer (`BertTokenizer`):
+            An instance of ['BertTokenizer`]. The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "BlipImageProcessor"
+    tokenizer_class = "BertTokenizer"
+
+    def __init__(self, image_processor, tokenizer):
+        tokenizer.model_input_names = ["input_ids", "attention_mask"]
+        super().__init__(image_processor, tokenizer)
+        self.current_processor = self.image_processor
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to Bert's [`~BertTokenizer.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        BlipImageProcessor's [`~BlipImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `paddle.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[paddle.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or Paddle
+                tensor. In case of a NumPy array/Paddle tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
diff --git a/paddlenlp/transformers/blip_2/__init__.py b/paddlenlp/transformers/blip_2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/blip_2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/blip_2/configuration.py b/paddlenlp/transformers/blip_2/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..401b249fc50bbef9ade4ad03ea2831eef80dd0ec
--- /dev/null
+++ b/paddlenlp/transformers/blip_2/configuration.py
@@ -0,0 +1,366 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" BLIP-2 model configuration"""
+import copy
+import os
+from typing import Union
+
+from paddlenlp.transformers import AutoConfig
+
+from ...utils.log import logger
+from ..auto.modeling import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+from ..configuration_utils import PretrainedConfig
+from ..opt.configuration import OPTConfig
+from ..t5.configuration import T5Config
+
+__all__ = [
+    "Blip2VisionConfig",
+    "Blip2QFormerConfig",
+    "Blip2Config",
+]
+
+
+class Blip2VisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Blip2VisionModel`]. It is used to instantiate a
+    BLIP-2 vision encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration defaults will yield a similar configuration to that of the BLIP-2
+    [Salesforce/blip2-opt-2.7b](https://huggingface.co/Salesforce/blip2-opt-2.7b) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 1408):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 6144):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 39):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 14):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported. layer_norm_eps (`float`, *optional*, defaults
+            to 1e-5): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries and values in the self-attention layers.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import Blip2VisionConfig, Blip2VisionModel
+    >>> # Initializing a Blip2VisionConfig with Salesforce/blip2-opt-2.7b style configuration
+    >>> configuration = Blip2VisionConfig()
+    >>> # Initializing a Blip2VisionModel (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
+    >>> model = Blip2VisionModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "blip_2_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=1408,
+        intermediate_size=6144,
+        projection_dim=512,
+        num_hidden_layers=39,
+        num_attention_heads=16,
+        num_channels=3,
+        image_size=224,
+        patch_size=14,
+        hidden_act="gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=1e-10,
+        initializer_factor=1.0,
+        qkv_bias=True,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.qkv_bias = qkv_bias
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from Blip2Config
+        if config_dict.get("model_type") == "blip-2":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class Blip2QFormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Blip2QFormerModel`]. It is used to instantiate a
+    BLIP-2 Querying Transformer (Q-Former) model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the BLIP-2
+    [Salesforce/blip2-opt-2.7b](https://huggingface.co/Salesforce/blip2-opt-2.7b) architecture. Configuration objects
+    inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from
+    [`PretrainedConfig`] for more information.
+    Note that [`Blip2QFormerModel`] is very similar to [`BertLMHeadModel`] with interleaved cross-attention.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the Q-Former model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling the model.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        cross_attention_frequency (`int`, *optional*, defaults to 2):
+            The frequency of adding cross-attention to the Transformer layers.
+        encoder_hidden_size (`int`, *optional*, defaults to 1408):
+            The hidden size of the hidden states for cross-attention.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import Blip2QFormerConfig, Blip2QFormerModel
+    >>> # Initializing a BLIP-2 Salesforce/blip2-opt-2.7b style configuration
+    >>> configuration = Blip2QFormerConfig()
+    >>> # Initializing a model (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
+    >>> model = Blip2QFormerModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "blip_2_qformer"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        position_embedding_type="absolute",
+        classifier_dropout=None,
+        cross_attention_frequency=2,
+        encoder_hidden_size=1408,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.classifier_dropout = classifier_dropout
+        self.cross_attention_frequency = cross_attention_frequency
+        self.encoder_hidden_size = encoder_hidden_size
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the qformer config dict if we are loading from Blip2Config
+        if config_dict.get("model_type") == "blip-2":
+            config_dict = config_dict["qformer_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class Blip2Config(PretrainedConfig):
+    r"""
+    [`Blip2Config`] is the configuration class to store the configuration of a [`Blip2ForConditionalGeneration`]. It is
+    used to instantiate a BLIP-2 model according to the specified arguments, defining the vision model, Q-Former model
+    and language model configs. Instantiating a configuration with the defaults will yield a similar configuration to
+    that of the BLIP-2 [Salesforce/blip2-opt-2.7b](https://huggingface.co/Salesforce/blip2-opt-2.7b) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`Blip2VisionConfig`].
+        qformer_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`Blip2QFormerConfig`].
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize any [`PretrainedConfig`].
+        num_query_tokens (`int`, *optional*, defaults to 32):
+            The number of query tokens passed through the Transformer.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import (
+    ...     Blip2VisionConfig,
+    ...     Blip2QFormerConfig,
+    ...     OPTConfig,
+    ...     Blip2Config,
+    ...     Blip2ForConditionalGeneration,
+    ... )
+    >>> # Initializing a Blip2Config with Salesforce/blip2-opt-2.7b style configuration
+    >>> configuration = Blip2Config()
+    >>> # Initializing a Blip2ForConditionalGeneration (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
+    >>> model = Blip2ForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    >>> # We can also initialize a Blip2Config from a Blip2VisionConfig, Blip2QFormerConfig and any PretrainedConfig
+    >>> # Initializing BLIP-2 vision, BLIP-2 Q-Former and language model configurations
+    >>> vision_config = Blip2VisionConfig()
+    >>> qformer_config = Blip2QFormerConfig()
+    >>> text_config = OPTConfig()
+    >>> config = Blip2Config.from_text_vision_configs(vision_config, qformer_config, text_config)
+    ```"""
+
+    model_type = "blip-2"
+    is_composition = True
+
+    def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs):
+        super().__init__(**kwargs)
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the Blip2VisionConfig with default values.")
+
+        if qformer_config is None:
+            qformer_config = {}
+            logger.info("qformer_config is None. Initializing the Blip2QFormerConfig with default values.")
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the text config with default values (`OPTConfig`).")
+        self.vision_config = Blip2VisionConfig(**vision_config)
+        self.qformer_config = Blip2QFormerConfig(**qformer_config)
+        text_model_type = text_config["model_type"] if "model_type" in text_config else "opt"
+        # self.text_config = CONFIG_MAPPING[text_model_type](**text_config)
+        if text_model_type == "t5":
+            self.text_config = T5Config(**text_config)
+        elif text_model_type == "opt":
+            self.text_config = OPTConfig(**text_config)
+        else:
+            self.text_config = AutoConfig(**text_config)
+
+        self.num_query_tokens = num_query_tokens
+        self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size
+        self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+        # CONFIGURATION_MODEL_MAPPING = get_init_configurations()
+        # self.use_decoder_only_language_model = self.text_config.model_type in CONFIGURATION_MODEL_MAPPING
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+
+    @classmethod
+    def from_vision_qformer_text_configs(
+        cls,
+        vision_config: Blip2VisionConfig,
+        qformer_config: Blip2QFormerConfig,
+        text_config: PretrainedConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`Blip2Config`] (or a derived class) from a BLIP-2 vision model, Q-Former and language model
+        configurations.
+        Returns:
+            [`Blip2Config`]: An instance of a configuration object
+        """
+
+        return cls(
+            vision_config=vision_config.to_dict(),
+            qformer_config=qformer_config.to_dict(),
+            text_config=text_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["vision_config"] = self.vision_config.to_dict()
+        output["qformer_config"] = self.qformer_config.to_dict()
+        output["text_config"] = self.text_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/blip_2/modeling.py b/paddlenlp/transformers/blip_2/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..5328c0ff516cd9fad094b6735f9982247d779e53
--- /dev/null
+++ b/paddlenlp/transformers/blip_2/modeling.py
@@ -0,0 +1,1679 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Paddle BLIP2 model."""
+
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import CrossEntropyLoss
+
+from paddlenlp.utils.log import logger
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from ..model_utils import (
+    PretrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from ..opt.configuration import OPTConfig
+from ..opt.modeling import OPTForCausalLM
+from ..t5.configuration import T5Config
+from ..t5.modeling import T5ForConditionalGeneration
+from .configuration import Blip2Config, Blip2QFormerConfig, Blip2VisionConfig
+
+BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "Salesforce/blip2-flan-t5-xl",
+]
+
+__all__ = [
+    "Blip2QFormerModel",
+    "Blip2Model",
+    "Blip2PretrainedModel",
+    "Blip2VisionModel",
+    "Blip2ForConditionalGeneration",
+]
+
+
+def Parameter(tensor):
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+@dataclass
+class Blip2ForConditionalGenerationModelOutput(ModelOutput):
+    """
+    Class defining the outputs of [`Blip2ForConditionalGeneration`].
+    Args:
+        loss (`paddle.Tensor`, *optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Language modeling loss from the language model.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head of the language model.
+        vision_outputs (`BaseModelOutputWithPooling`):
+            Outputs of the vision encoder.
+        qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
+            Outputs of the Q-Former (Querying Transformer).
+        language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
+            Outputs of the language model.
+    """
+
+    loss: Optional[Tuple[paddle.Tensor]] = None
+    logits: Optional[Tuple[paddle.Tensor]] = None
+    vision_outputs: Optional[paddle.Tensor] = None
+    qformer_outputs: Optional[Tuple[paddle.Tensor]] = None
+    language_model_outputs: Optional[Tuple[paddle.Tensor]] = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k]
+            if k not in ["vision_outputs", "qformer_outputs", "language_model_outputs"]
+            else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+# Copied from paddlenlp.transformers.blip.modeling.BlipVisionEmbeddings with Blip->Blip2
+class Blip2VisionEmbeddings(nn.Layer):
+    def __init__(self, config: Blip2VisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+
+        self.class_embedding = Parameter(
+            paddle.randn([1, 1, self.embed_dim], dtype=paddle.get_default_dtype()),
+        )
+
+        self.patch_embedding = nn.Conv2D(
+            in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+
+        self.position_embedding = Parameter(
+            paddle.randn([1, self.num_positions, self.embed_dim], dtype=paddle.get_default_dtype())
+        )
+
+    def forward(self, pixel_values: paddle.Tensor) -> paddle.Tensor:
+        batch_size = pixel_values.shape[0]
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose([0, 2, 1])
+
+        class_embeds = self.class_embedding.expand([batch_size, 1, -1]).cast(target_dtype)
+        embeddings = paddle.concat([class_embeds, patch_embeds], axis=1)
+        embeddings = embeddings + self.position_embedding[:, : embeddings.shape[1], :].cast(target_dtype)
+        return embeddings
+
+
+class Blip2Attention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = nn.Dropout(config.attention_dropout)
+
+        # small tweak here compared to CLIP, no bias here
+        self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias_attr=False)
+
+        if config.qkv_bias:
+            q_bias = Parameter(paddle.zeros([self.embed_dim], dtype=paddle.get_default_dtype()))
+            v_bias = Parameter(paddle.zeros([self.embed_dim], dtype=paddle.get_default_dtype()))
+        else:
+            q_bias = None
+            v_bias = None
+
+        if q_bias is not None:
+            qkv_bias = paddle.concat((q_bias, paddle.zeros_like(v_bias), v_bias))
+            self.qkv.bias = Parameter(qkv_bias)
+
+        self.projection = nn.Linear(self.embed_dim, self.embed_dim)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = hidden_states.shape
+
+        mixed_qkv = self.qkv(hidden_states)
+
+        mixed_qkv = mixed_qkv.reshape([bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads]).transpose(
+            [2, 0, 3, 1, 4]
+        )
+        query_states, key_states, value_states = (
+            mixed_qkv[0],
+            mixed_qkv[1],
+            mixed_qkv[2],
+        )
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        attention_scores = attention_scores * self.scale
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_states).transpose([0, 2, 1, 3])
+
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.embed_dim,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output = self.projection(context_layer)
+
+        outputs = (output, attention_probs) if output_attentions else (output, None)
+
+        return outputs
+
+
+# Copied from paddlenlp.transformers.blip.modeling.BlipMLP
+class Blip2MLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.blip.modeling.BlipEncoderLayer with Blip->Blip2
+class Blip2EncoderLayer(nn.Layer):
+    def __init__(self, config: Blip2Config):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = Blip2Attention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = Blip2MLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            head_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + residual
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = hidden_states + residual
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class Blip2PretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = Blip2Config
+    base_model_prefix = "blip"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids",
+        r"language_model.encoder.embed_tokens.weight",
+        r"language_model.decoder.embed_tokens.weight",
+    ]
+    _no_split_modules = ["Blip2Attention", "T5Block", "OPTDecoderLayer"]
+    _keep_in_fp32_modules = ["wo"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_range
+        if isinstance(module, nn.Conv2D) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
+            normal_(module.weight, mean=0.0, std=factor)
+            if hasattr(module, "bias") and module.bias is not None:
+                zeros_(module.bias)
+
+        if isinstance(module, Blip2VisionEmbeddings):
+            if hasattr(self.config, "vision_config"):
+                factor = self.config.vision_config.initializer_range
+            trunc_normal_ = nn.initializer.TruncatedNormal(mean=0.0, std=factor)
+            trunc_normal_(module.position_embedding)
+            trunc_normal_(
+                module.class_embedding,
+            )
+
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        elif isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, Blip2Encoder):
+            module.gradient_checkpointing = value
+
+
+# Copied from paddlenlp.transformers.blip.modeling.BlipEncoder with Blip->Blip2
+class Blip2Encoder(nn.Layer):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`Blip2EncoderLayer`].
+    Args:
+        config (`Blip2Config`):
+            The corresponding vision configuration for the `Blip2Encoder`.
+    """
+
+    def __init__(self, config: Blip2Config):
+        super().__init__()
+        self.config = config
+        self.layers = nn.LayerList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+# Copied from paddlenlp.transformers.blip.modeling.BlipVisionModel with Blip->Blip2, BLIP->BLIP_2
+class Blip2VisionModel(Blip2PretrainedModel):
+    main_input_name = "pixel_values"
+    config_class = Blip2VisionConfig
+
+    def __init__(self, config: Blip2VisionConfig):
+        super().__init__(config)
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = Blip2VisionEmbeddings(config)
+        self.encoder = Blip2Encoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings
+
+
+class Blip2QFormerMultiHeadAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention heads (%d)"
+                % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        mixed_query_layer = self.query(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.shape[1]
+            position_ids_l = paddle.arange(seq_length, dtype="int64").reshape([-1, 1])
+            position_ids_r = paddle.arange(seq_length, dtype="int64").reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(axis=-1)(attention_scores)
+
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = paddle.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        outputs = outputs + (past_key_value,)
+        return outputs
+
+
+# Copied from paddlenlp.transformers.bert.modeling.BertSelfOutput with Bert->Blip2QFormer
+class Blip2QFormerSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class Blip2QFormerAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.attention = Blip2QFormerMultiHeadAttention(config, is_cross_attention)
+        self.output = Blip2QFormerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.attention.query = prune_linear_layer(self.attention.query, index)
+        self.attention.key = prune_linear_layer(self.attention.key, index)
+        self.attention.value = prune_linear_layer(self.attention.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
+        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+# Copied from paddlenlp.transformers.bert.modeling.BertIntermediate with Bert->Blip2QFormer
+class Blip2QFormerIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.bert.modeling.BertOutput with Bert->Blip2QFormer
+class Blip2QFormerOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class Blip2QFormerLayer(nn.Layer):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = Blip2QFormerAttention(config)
+
+        self.layer_idx = layer_idx
+
+        if layer_idx % config.cross_attention_frequency == 0:
+            self.crossattention = Blip2QFormerAttention(config, is_cross_attention=True)
+            self.has_cross_attention = True
+        else:
+            self.has_cross_attention = False
+
+        self.intermediate_query = Blip2QFormerIntermediate(config)
+        self.output_query = Blip2QFormerOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        query_length=0,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+
+        present_key_value = self_attention_outputs[-1]
+
+        if query_length > 0:
+            query_attention_output = attention_output[:, :query_length, :]
+
+            if self.has_cross_attention:
+                if encoder_hidden_states is None:
+                    raise ValueError("encoder_hidden_states must be given for cross-attention layers")
+                cross_attention_outputs = self.crossattention(
+                    query_attention_output,
+                    attention_mask,
+                    head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions=output_attentions,
+                )
+                query_attention_output = cross_attention_outputs[0]
+                # add cross attentions if we output attention weights
+                outputs = outputs + cross_attention_outputs[1:-1]
+
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk_query,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                query_attention_output,
+            )
+
+            if attention_output.shape[1] > query_length:
+                layer_output_text = apply_chunking_to_forward(
+                    self.feed_forward_chunk,
+                    self.chunk_size_feed_forward,
+                    self.seq_len_dim,
+                    attention_output[:, query_length:, :],
+                )
+                layer_output = paddle.concat([layer_output, layer_output_text], axis=1)
+        else:
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                attention_output,
+            )
+        outputs = (layer_output,) + outputs
+
+        outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def feed_forward_chunk_query(self, attention_output):
+        intermediate_output = self.intermediate_query(attention_output)
+        layer_output = self.output_query(intermediate_output, attention_output)
+        return layer_output
+
+
+class Blip2QFormerEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList(
+            [Blip2QFormerLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        query_length=0,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions else None
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions, query_length)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    query_length,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if layer_module.has_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class Blip2QFormerModel(Blip2PretrainedModel):
+    """
+    Querying Transformer (Q-Former), used in BLIP-2.
+    """
+
+    def __init__(self, config: Blip2QFormerConfig):
+        super().__init__(config)
+        self.config = config
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.encoder = Blip2QFormerEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: paddle.Tensor,
+        input_shape: Tuple[int],
+        has_query: bool = False,
+    ) -> paddle.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (`paddle.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+        Returns:
+            `paddle.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.cast(dtype=self.config.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask: paddle.Tensor) -> paddle.Tensor:
+        """
+        Invert an attention mask (e.g., switches 0. and 1.).
+        Args:
+            encoder_attention_mask (`paddle.Tensor`): An attention mask.
+        Returns:
+            `paddle.Tensor`: The inverted attention mask.
+        """
+        if encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
+        if encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
+        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
+        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow
+        # /transformer/transformer_layers.py#L270
+        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==
+        # encoder_extended_attention_mask.transpose(-1, -2))
+        encoder_extended_attention_mask = encoder_extended_attention_mask.cast(
+            dtype=self.config.dtype
+        )  # fp16 compatibility
+        encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+        return encoder_extended_attention_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[paddle.Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> paddle.Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.ndim == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand([num_hidden_layers, -1, -1, -1, -1])
+        elif head_mask.ndim == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.ndim == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = head_mask.cast(dtype=self.config.dtype)  # switch to float if need + fp16 compatibility
+        return head_mask
+
+    def forward(
+        self,
+        query_embeds,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of:
+            shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and
+            value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are
+            used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key
+            value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape
+            `(batch_size, sequence_length)`.
+        use_cache (`bool`, `optional`):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # past_key_values_length
+        past_key_values_length = (
+            past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
+        )
+
+        query_length = query_embeds.shape[1] if query_embeds is not None else 0
+
+        embedding_output = self.layernorm(query_embeds.cast(self.layernorm.weight.dtype))
+        embedding_output = self.dropout(embedding_output)
+
+        input_shape = embedding_output.shape[:-1]
+        batch_size, seq_length = input_shape
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(((batch_size, seq_length + past_key_values_length)))
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].shape
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            query_length=query_length,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = sequence_output[:, 0, :]
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class Blip2Model(Blip2PretrainedModel):
+    config_class = Blip2Config
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: Blip2Config):
+        super().__init__(config)
+
+        self.vision_model = Blip2VisionModel(config.vision_config)
+
+        self.query_tokens = Parameter(paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]))
+        self.qformer = Blip2QFormerModel(config.qformer_config)
+
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        if config.use_decoder_only_language_model:
+            if isinstance(config.text_config, OPTConfig):
+                language_model = OPTForCausalLM(config.text_config)
+            else:
+                raise NotImplementedError
+        else:
+            if isinstance(config.text_config, T5Config):
+                language_model = T5ForConditionalGeneration(config.text_config)
+            else:
+                raise NotImplementedError
+        self.language_model = language_model
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_ids: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Returns:
+            text_outputs (`CausalLMOutputWithPast`, or `tuple(paddle.Tensor)` if `return_dict=False`):
+                The language model outputs. If `return_dict=True`, the output is a [`CausalLMOutputWithPast`] that
+                contains the language model logits, the past key values and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from paddlenlp.transformers import AutoTokenizer, Blip2Model
+        >>> model = Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+        >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt").to(device)
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.use_decoder_only_language_model:
+            text_outputs = self.language_model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+
+            text_outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                decoder_input_ids=decoder_input_ids,
+                decoder_attention_mask=decoder_attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+                labels=labels,
+            )
+
+        return text_outputs
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import AutoProcessor, Blip2Model
+        >>> model = Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+        >>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pd")
+        >>> image_outputs = model.get_image_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return vision_outputs
+
+    def get_qformer_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import Blip2Processor, Blip2Model
+        >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model = Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pt")
+        >>> qformer_outputs = model.get_qformer_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[0]
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return query_outputs
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_ids: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Blip2ForConditionalGenerationModelOutput]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import Blip2Processor, Blip2Model
+        >>> import paddle
+        >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model = Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model.to(device)  # doctest: +IGNORE_RESULT
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> prompt = "Question: how many cats are there? Answer:"
+        >>> inputs = processor(images=image, text=prompt, return_tensors="pd")
+        >>> outputs = model(pixel_values=inputs["pixel_values"],input_ids=inputs["input_ids"])
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[0]
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        query_output = query_outputs[0]
+
+        # step 3: use the language model, conditioned on the query outputs and the prompt
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        inputs_embeds = paddle.concat([language_model_inputs, inputs_embeds], axis=1)
+
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+
+        attention_mask = paddle.concat([language_model_attention_mask, attention_mask], axis=1)
+
+        if self.config.use_decoder_only_language_model:
+            outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            logits = outputs.logits if return_dict else outputs[0]
+            loss = None
+            # we compute the loss here since we need to take into account the sequence length of the query embeds
+            if labels is not None:
+                logits = logits[:, -labels.shape[1] :, :]
+                # Shift so that tokens < n predict n
+                shift_logits = logits[..., :-1, :]
+                shift_labels = labels[..., 1:]
+
+                # Flatten the tokens
+                loss_fct = CrossEntropyLoss(reduction="mean")
+
+                loss = loss_fct(
+                    shift_logits.reshape([-1, self.config.text_config.vocab_size]), shift_labels.reshape([-1])
+                )
+        else:
+            outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                decoder_input_ids=decoder_input_ids,
+                decoder_attention_mask=decoder_attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+                labels=labels,
+            )
+            loss = outputs.loss if return_dict else outputs[0]
+            logits = outputs.logits if return_dict else outputs[1]
+
+        if not return_dict:
+            output = (logits, vision_outputs, query_outputs, outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return Blip2ForConditionalGenerationModelOutput(
+            loss=loss,
+            logits=logits,
+            vision_outputs=vision_outputs,
+            qformer_outputs=query_outputs,
+            language_model_outputs=outputs,
+        )
+
+
+class Blip2ForConditionalGeneration(Blip2PretrainedModel):
+    config_class = Blip2Config
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: Blip2Config):
+        super().__init__(config)
+
+        self.vision_model = Blip2VisionModel(config.vision_config)
+
+        self.query_tokens = Parameter(paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]))
+        self.qformer = Blip2QFormerModel(config.qformer_config)
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        if config.use_decoder_only_language_model:
+            if isinstance(config.text_config, OPTConfig):
+                language_model = OPTForCausalLM(config.text_config)
+            else:
+                raise NotImplementedError
+        else:
+            if isinstance(config.text_config, T5Config):
+                language_model = T5ForConditionalGeneration(config.text_config)
+            else:
+                raise NotImplementedError
+        self.language_model = language_model
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_ids: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, Blip2ForConditionalGenerationModelOutput]:
+        r"""
+        Returns:
+        Examples:
+        Image captioning (without providing a text prompt):
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import Blip2Processor, Blip2ForConditionalGeneration
+        >>> import paddle
+        >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model = Blip2ForConditionalGeneration.from_pretrained(
+        ...     "Salesforce/blip2-flan-t5-xl"
+        ... )
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pd")
+        >>> generated_ids, scores = model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        >>> print(generated_text)
+        two cats laying on a couch
+        ```
+        Visual question answering (prompt = question):
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import Blip2Processor, Blip2ForConditionalGeneration
+        >>> import paddle
+        >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
+        >>> model = Blip2ForConditionalGeneration.from_pretrained(
+        ...     "Salesforce/blip2-flan-t5-xl"
+        ... )
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> prompt = "Question: how many cats are there? Answer:"
+        >>> inputs = processor(images=image, text=prompt, return_tensors="pd")
+        >>> generated_ids, scores= model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        >>> print(generated_text)
+        two
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[0]
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        query_output = query_outputs[0]
+
+        # step 3: use the language model, conditioned on the query outputs and the prompt
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids).cast(dtype=language_model_inputs.dtype)
+        inputs_embeds = paddle.concat([language_model_inputs, inputs_embeds], axis=1)
+
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+
+        attention_mask = paddle.concat([language_model_attention_mask, attention_mask], axis=1)
+
+        if self.config.use_decoder_only_language_model:
+            outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            logits = outputs.logits if return_dict else outputs[0]
+            loss = None
+            # we compute the loss here since we need to take into account the sequence length of the query embeds
+            if labels is not None:
+                logits = logits[:, -labels.shape[1] :, :]
+                # Shift so that tokens < n predict n
+                shift_logits = logits[..., :-1, :]
+                shift_labels = labels[..., 1:]
+
+                # Flatten the tokens
+                loss_fct = CrossEntropyLoss(reduction="mean")
+
+                loss = loss_fct(
+                    shift_logits.reshape([-1, self.config.text_config.vocab_size]), shift_labels.reshape([-1])
+                )
+        else:
+            outputs = self.language_model(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                decoder_input_ids=decoder_input_ids,
+                decoder_attention_mask=decoder_attention_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+                labels=labels,
+            )
+            loss = outputs.loss if return_dict else outputs[0]
+            logits = outputs.logits if return_dict else outputs[1]
+
+        if not return_dict:
+            output = (logits, vision_outputs, query_outputs, outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return Blip2ForConditionalGenerationModelOutput(
+            loss=loss,
+            logits=logits,
+            vision_outputs=vision_outputs,
+            qformer_outputs=query_outputs,
+            language_model_outputs=outputs,
+        )
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs,
+    ) -> paddle.Tensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+        Args:
+            pixel_values (`paddle.Tensor` of shape (batch_size, num_channels, height, width)):
+                Input images to be processed.
+            input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The sequence used as a prompt for the generation.
+            attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                Mask to avoid performing attention on padding token indices
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+        """
+        batch_size = pixel_values.shape[0]
+        image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        language_model_inputs = self.language_projection(query_output)
+        language_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+        if input_ids is None:
+            input_ids = paddle.to_tensor([[self.config.text_config.bos_token_id]]).tile([batch_size, 1])
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        attention_mask = paddle.concat([language_attention_mask, attention_mask], axis=1)
+
+        # concatenate query embeddings with prompt embeddings
+        inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
+        inputs_embeds = paddle.concat([language_model_inputs, inputs_embeds], axis=1)
+
+        outputs = self.language_model.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            **generate_kwargs,
+        )
+
+        return outputs
diff --git a/paddlenlp/transformers/blip_2/processing.py b/paddlenlp/transformers/blip_2/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..65fdd7af2a2555138d6a6f14ef9c61f199bc48bd
--- /dev/null
+++ b/paddlenlp/transformers/blip_2/processing.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Processor class for BLIP-2.
+"""
+
+from typing import List, Optional, Union
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import (
+    BatchEncoding,
+    PreTokenizedInput,
+    TensorType,
+    TextInput,
+)
+
+__all__ = [
+    "Blip2Processor",
+]
+
+
+class Blip2Processor(ProcessorMixin):
+    r"""
+    Constructs a BLIP-2 processor which wraps a BLIP image processor and an OPT/T5 tokenizer into a single processor.
+    [`BlipProcessor`] offers all the functionalities of [`BlipImageProcessor`] and [`AutoTokenizer`]. See the docstring
+    of [`~BlipProcessor.__call__`] and [`~BlipProcessor.decode`] for more information.
+    Args:
+        image_processor (`BlipImageProcessor`):
+            An instance of [`BlipImageProcessor`]. The image processor is a required input.
+        tokenizer (`AutoTokenizer`):
+            An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "BlipImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+
+    # Copied from paddlenlp.transformers.blip.processing.BlipProcessor.__init__
+    def __init__(self, image_processor, tokenizer):
+        tokenizer.return_token_type_ids = False
+        super().__init__(image_processor, tokenizer)
+        self.current_processor = self.image_processor
+
+    # Copied from paddlenlp.transformers.blip.processing.BlipProcessor.__call__
+    def __call__(
+        self,
+        images=None,
+        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> BatchEncoding:
+        """
+        This method uses [`BlipImageProcessor.__call__`] method to prepare image(s) for the model, and
+        [`BertTokenizerFast.__call__`] to prepare text for the model.
+        Please refer to the docstring of the above two methods for more information.
+        """
+        if images is None and text is None:
+            raise ValueError("You have to specify either images or text.")
+
+        # Get only text
+        if images is None:
+            self.current_processor = self.tokenizer
+            text_encoding = self.tokenizer(
+                text=text,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            return text_encoding
+
+        # add pixel_values
+        encoding_image_processor = self.image_processor(images, return_tensors=return_tensors)
+
+        if text is not None:
+            text_encoding = self.tokenizer(
+                text=text,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+        else:
+            text_encoding = None
+
+        if text_encoding is not None:
+            encoding_image_processor.update(text_encoding)
+
+        return encoding_image_processor
+
+    # Copied from paddlenlp.transformers.blip.processing.BlipProcessor.batch_decode with BertTokenizerFast->PreTrainedTokenizer
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    # Copied from paddlenlp.transformers.blip.processing.BlipProcessor.decode with BertTokenizerFast->PreTrainedTokenizer
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    # Copied from paddlenlp.transformers.blip.processing.BlipProcessor.model_input_names
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
diff --git a/paddlenlp/transformers/bloom/__init__.py b/paddlenlp/transformers/bloom/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/bloom/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/bloom/configuration.py b/paddlenlp/transformers/bloom/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d976b3cc6b28bba9a93b8e8d55d6c9262f52fb02
--- /dev/null
+++ b/paddlenlp/transformers/bloom/configuration.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Bloom model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["BLOOM_PRETRAINED_INIT_CONFIGURATION", "BloomConfig", "BLOOM_PRETRAINED_RESOURCE_FILES_MAP"]
+
+
+def _construct_resource_file_url(model_names: list[str], file_name: str) -> dict[str, str]:
+    """construct resource file dict object according to the file type
+
+    TODO(wj-Mcat): this method will be moved into `PretrainedConfig` later
+
+    Args:
+        file_name (str): the name of target file
+
+    Returns:
+        dict[str, str]: the dict info of pretrained
+    """
+    return {
+        model_name: f"https://paddlenlp.bj.bcebos.com/models/community/{model_name}/{file_name}"
+        for model_name in model_names
+    }
+
+
+BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "bigscience/bloom",
+    "bigscience/bloom-560m",
+    "bigscience/bloom-1b1",
+    "bigscience/bloom-1b3",
+    "bigscience/bloom-1b7",
+    "bigscience/bloom-3b",
+    "bigscience/bloom-7b1",
+    "bigscience/bloomz",
+    "bigscience/bloomz-mt",
+    "bigscience/bloomz-560m",
+    "bigscience/bloomz-1b1",
+    "bigscience/bloomz-1b3",
+    "bigscience/bloomz-1b7",
+    "bigscience/bloomz-3b",
+    "bigscience/bloomz-7b1",
+]
+
+BLOOM_PRETRAINED_INIT_CONFIGURATION = _construct_resource_file_url(BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST, "config.json")
+
+
+BLOOM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": _construct_resource_file_url(BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST, "model_state.pdparams")
+}
+
+
+class BloomConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`BloomModel`]. It is used to
+    instantiate a BLOOM model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the BLOOM
+    bigscience/bloom-560m architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import BloomModel, BloomConfig
+
+    >>> # Initializing a BLOOM bigscience/bloom-560m style configuration
+    >>> configuration = BloomConfig()
+
+    >>> # Initializing a model from the bigscience/bloom-560m style configuration
+    >>> model = BloomModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "bloom"
+    attribute_map: Dict[str, str] = {}  # noqa: F811
+    attribute_map = {"num_attention_heads": "n_head", "n_embed": "hidden_size"}
+
+    pretrained_init_configuration = BLOOM_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=250880,
+        hidden_size=64,
+        n_layer=2,
+        n_head=8,
+        masked_softmax_fusion=True,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        use_cache=False,
+        bos_token_id=1,
+        eos_token_id=2,
+        pad_token_id=3,
+        apply_residual_connection_post_layernorm=False,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        attention_softmax_in_fp32=True,
+        pretraining_tp=1,  # TP rank used when training with megatron
+        slow_but_exact=False,
+        use_recompute=False,
+        use_pure_fp16=False,
+        **kwargs,
+    ):
+
+        self.n_head = n_head
+        self.hidden_size = hidden_size
+
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.n_layer = n_layer
+        self.masked_softmax_fusion = masked_softmax_fusion
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+        self.pretraining_tp = pretraining_tp
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.slow_but_exact = slow_but_exact
+        self.use_recompute = use_recompute
+        self.use_pure_fp16 = use_pure_fp16
diff --git a/paddlenlp/transformers/bloom/modeling.py b/paddlenlp/transformers/bloom/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..40675f374ce6e222b2ae93a03e336a03dc88fbc0
--- /dev/null
+++ b/paddlenlp/transformers/bloom/modeling.py
@@ -0,0 +1,1859 @@
+# coding=utf-8
+# Copyright 2022 HuggingFace Inc. team and BigScience workshop.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle BLOOM model."""
+from __future__ import annotations
+
+import math
+from functools import partial
+from typing import Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.autograd import PyLayer
+from paddle.distributed import fleet
+from paddle.distributed.fleet.utils import recompute
+
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils.converter import StateDictNameMapping, init_name_mappings
+from paddlenlp.utils.log import logger
+
+from .configuration import BloomConfig
+from .processor import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    RepetitionPenaltyLogitsProcessor,
+)
+
+__all__ = [
+    "BloomModel",
+    "BloomForPretraining",
+    "BloomForCausalLM",
+    "BloomForSequenceClassification",
+    "BloomForTokenClassification",
+    "BloomForGeneration",
+]
+
+
+def parallel_matmul(x: Tensor, y: Tensor, parallel_output=True):
+    is_fleet_init = True
+    world_size = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        world_size = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+    if is_fleet_init and world_size > 1:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=True)
+        if parallel_output:
+            return logits
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(x, y, transpose_y=True)
+        return logits
+
+
+def split_tensor_along_last_dim(tensor: Tensor, num_partitions: int, contiguous_split_chunks: bool = False):
+    """Split a tensor along its last dimension -> query/key/value layer
+    Args:
+        tensor: ([`paddle.Tensor`], *required*):
+            input tensor to split
+        num_partitions ([`int`], *required*):
+            number of partitions to split the tensor
+        contiguous_split_chunks ([`bool`], *optional*, default=`False`)::
+            If True, make each chunk contiguous in memory.
+    """
+    return paddle.split(tensor, 3, axis=-1)
+
+
+def _make_causal_mask(input_ids_shape, past_key_values_length: int) -> Tensor:
+    """
+    Make causal mask used for self-attention.
+    """
+    batch_size, target_length = input_ids_shape
+    mask = paddle.ones((target_length, target_length + past_key_values_length), dtype="bool")
+    # ONNX doesn't support `Tensor.triu` properly, thus we use this workaround
+    seq_ids = paddle.arange(target_length)
+    mask[:, past_key_values_length:] = seq_ids[:, None] >= seq_ids[None, :]
+
+    expanded_mask = mask.unsqueeze(axis=[0, 1]).expand(
+        [batch_size, 1, target_length, target_length + past_key_values_length]
+    )
+    return expanded_mask
+
+
+def _expand_2d_mask(mask: Tensor, tgt_length: int) -> Tensor:
+    """
+    Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
+    """
+    batch_size, src_length = mask.shape[0], mask.shape[-1]
+    tgt_length = tgt_length if tgt_length is not None else src_length
+
+    mask.stop_gradient = True
+    return mask.unsqueeze(axis=[1, 2]).expand([batch_size, 1, tgt_length, src_length])
+
+
+def build_alibi_tensor(attention_mask: Tensor, num_heads: int, dtype) -> Tensor:
+    """
+    Link to paper: https://arxiv.org/abs/2108.12409 Alibi tensor is not causal as the original paper mentions, it
+    relies on a translation invariance of softmax for quick implementation: with l being a tensor, and a fixed value
+    `softmax(l+a) = softmax(l)`. Based on
+    https://github.com/ofirpress/attention_with_linear_biases/blob/a35aaca144e0eb6b789dfcb46784c4b8e31b7983/fairseq/models/transformer.py#L742
+    TODO @thomasw21 this doesn't work as nicely due to the masking strategy, and so masking varies slightly.
+
+    Args:
+    Returns tensor shaped (batch_size * num_heads, 1, max_seq_len)
+        attention_mask (`Tensor`):
+            Token-wise attention mask, this should be of shape (batch_size, max_seq_len).
+        num_heads (`int`, *required*):
+            number of heads
+        dtype (`paddle.dtype`, *optional*, default=`paddle.bfloat16`):
+            dtype of the output tensor
+    """
+    # _, seq_length = attention_mask.shape[0], attention_mask.shape[-1]
+    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+    base = paddle.full([], 2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), dtype=paddle.float32)
+    powers = paddle.arange(1, 1 + closest_power_of_2, dtype=paddle.float32)
+    slopes = paddle.pow(base, powers)
+
+    if closest_power_of_2 != num_heads:
+        extra_base = paddle.to_tensor(2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), dtype=paddle.float32)
+        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
+        extra_powers = paddle.arange(1, 1 + 2 * num_remaining_heads, 2, dtype=paddle.float32)
+        slopes = paddle.concat([slopes, paddle.pow(extra_base, extra_powers)], axis=0)
+
+    # Note: alibi will added to the attention bias that will be applied to the query, key product of attention
+    # => therefore alibi will have to be of shape (batch_size, num_heads, query_length, key_length)
+    # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length)
+    # => the query_length dimension will then be broadcasted correctly
+    # This is more or less identical to T5's relative position bias:
+    # https://github.com/huggingface/transformers/blob/f681437203baa7671de3174b0fa583c349d9d5e1/src/transformers/models/t5/modeling_t5.py#L527
+    arange_tensor = ((attention_mask.astype(paddle.float32).cumsum(axis=-1) - 1) * attention_mask)[:, None, :]
+    alibi = slopes[..., None] * arange_tensor
+    # return alibi
+    return paddle.cast(alibi, dtype)
+    # return paddle.cast(alibi.reshape([batch_size * num_heads, 1, seq_length]), dtype)
+
+
+def dropout_add(x: Tensor, residual: Tensor, prob: float, training: bool) -> Tensor:
+    """
+    Dropout add function
+
+    Args:
+        x (`paddle.tensor`, *required*):
+            input tensor
+        residual (`paddle.tensor`, *required*):
+            esidual tensor
+        prob (`float`, *required*):
+            dropout probability
+        training (`bool`, *required*):
+            training mode
+    """
+    out = F.dropout(x, p=prob, training=training)
+    out = residual + out
+    return out
+
+
+def pre_process_alibi_for_pad(alibi, attention_mask, num_heads):
+    """
+    Args:
+    Pre-process the alibi tensor for padding.
+        alibi: ([`paddle.tensor`], *required*):
+            alibi tensor to pre-process
+        attention_mask: ([`paddle.tensor`], *required*):
+            attention mask to pre-process"""
+
+    # Sanity check if we are not inferring less tokens than the total sequence length
+    # This usually happens when the inference is done with past_key_values
+    # In this case we re-create the alibi tensor with the correct sequence length
+    if attention_mask.shape[-1] != alibi.shape[-1]:
+        alibi = build_alibi_tensor(attention_mask, num_heads, alibi.dtype).repeat_interleave(
+            attention_mask.shape[0], axis=0
+        )
+    # Get the indexes of the padding tokens
+    index_x0, index_y0 = paddle.where(attention_mask == 0.0)
+    index_x1, index_y1 = paddle.where(attention_mask == 1.0)
+
+    # Clone the embeddings  - we can detach because the embeddings are not learned
+    # Get a refence tensor
+    slice_reference_alibi = build_alibi_tensor(attention_mask, num_heads, alibi.dtype)
+
+    # Loop over the batch where the padding is and replace the alibi tensor by the reference tensor
+    # Only where you do not have padding. Replace padding tokens by zeros
+    # This operation can be seen as a shifting operation.
+    for i, index in enumerate(paddle.unique(index_x0)):
+        slice_to_modify = paddle.zeros_like(slice_reference_alibi)
+        index_shift = index_y1[index_x1 == index]
+        shift_value = len(index_shift)
+        slice_to_modify[:, :, index_shift] = slice_reference_alibi[:, :, :shift_value]
+        alibi[index * num_heads : (index + 1) * num_heads] = slice_to_modify
+    return alibi
+
+
+def bloom_gelu_forward(x):
+    """
+    Custom bias GELU function. Adapted from Megatron-DeepSpeed code. Here we use a simple implementation (inference) to
+    make the model jitable.
+
+    Args:
+        x (`paddle.tensor`, *required*):
+            input hidden states
+    """
+    return x * 0.5 * (1.0 + paddle.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
+
+
+def bloom_gelu_back(g, x):
+    """
+    gradient of tanh approximation of gelu gradient of actual gelu is: 0.5 * (1. + paddle.erf(x * 0.70710678)) +
+    0.3989423 * x * paddle.exp(-0.5 * x * x)
+
+    Args:
+        g (`paddle.tensor`, *required*):
+            gradient output tensor
+        x (`paddle.tensor`, *required*):
+            input tensor
+    """
+    x = x[0]  # x is a tuple of 1 element, needs to unpack it first
+    tanh_out = paddle.tanh(0.79788456 * x * (1 + 0.044715 * x * x))
+    # sqrt(2/pi) * 3 * 0.044715 -> 0.1070322243
+    ff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out)
+    return ff * g
+
+
+def baddbmm(input, batch1, batch2, beta=1.0, alpha=1.0):
+    return beta * input + alpha * paddle.matmul(batch1, batch2)
+
+
+class GeLUFunction(PyLayer):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+        return bloom_gelu_forward(input)
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        input = ctx.saved_tensors
+        return bloom_gelu_back(grad_output, input)
+
+
+class BloomGelu(nn.Layer):
+    """
+    BloomBiasGelu wrapper function that make use of the simple function on inference mode to make the model
+    paddlescriptable and use the autograd function in training mode to get the accurate results of the gradients Partly
+    copied from Megatron-DeepSpeed code and adapted for our needs
+
+    See here why autograd functions are not paddlescriptable: https://github.com/pypaddle/pypaddle/issues/22329
+
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return bloom_gelu_forward(x)
+        # if self.training and in_dygraph_mode():
+        #    return GeLUFunction.apply(x)
+        # else:
+        #    return bloom_gelu_forward(x)
+
+
+class BloomAttention(nn.Layer):
+    def __init__(self, config, layer_number=None):
+        super().__init__()
+
+        self.pretraining_tp = config.pretraining_tp
+        self.slow_but_exact = config.slow_but_exact
+
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.n_head
+        self.head_dim = self.hidden_size // self.num_heads
+        self.split_size = self.hidden_size
+        self.hidden_dropout = config.hidden_dropout
+        self.config = config
+
+        if config.tensor_parallel_degree > 1:
+            assert self.num_heads % config.tensor_parallel_degree == 0
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+
+        # Layer-wise attention scaling
+        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
+        self.beta = 1.0
+
+        if config.tensor_parallel_degree > 1:
+            self.query_key_value = fleet.meta_parallel.ColumnParallelLinear(
+                self.hidden_size, 3 * self.hidden_size, has_bias=True, gather_output=False
+            )
+        else:
+            self.query_key_value = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias_attr=True)
+
+        if config.tensor_parallel_degree > 1:
+            self.dense = fleet.meta_parallel.RowParallelLinear(
+                self.hidden_size, self.hidden_size, has_bias=True, input_is_parallel=True
+            )
+        else:
+            self.dense = nn.Linear(self.hidden_size, self.hidden_size)
+
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+
+    def _split_heads(self, fused_qkv: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        """
+        Split the last dimension into (num_heads, head_dim) without making any copies, results share same memory
+        storage as `fused_qkv`
+
+        Args:
+            fused_qkv (`paddle.tensor`, *required*): [batch_size, seq_length, num_heads * 3 * head_dim]
+
+        Returns:
+            query: [batch_size, seq_length, num_heads, head_dim] key: [batch_size, seq_length, num_heads, head_dim]
+            value: [batch_size, seq_length, num_heads, head_dim]
+        """
+        batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+        fused_qkv = fused_qkv.reshape([batch_size, seq_length, self.num_heads, 3, self.head_dim])
+        return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
+
+    def _merge_heads(self, x: Tensor) -> Tensor:
+        """
+        Merge heads together over the last dimenstion
+
+        Args:
+            x: (`paddle.tensor`, *required*): [batch_size * num_heads, seq_length, head_dim]
+
+        Returns:
+            paddle.tensor: [batch_size, seq_length, num_heads * head_dim]
+        """
+        # What we want to achieve is:
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads * head_dim
+        batch_size_and_num_heads, seq_length, _ = x.shape
+        batch_size = batch_size_and_num_heads // self.num_heads
+
+        # First view to decompose the batch size
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, num_heads, seq_length, head_dim
+        x = x.reshape([batch_size, self.num_heads, seq_length, self.head_dim])
+
+        # batch_size, num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads, head_dim
+        x = x.transpose([0, 2, 1, 3])
+
+        # batch_size, seq_length, num_heads, head_dim -> batch_size, seq_length, num_heads * head_dim
+        return x.reshape([batch_size, seq_length, self.num_heads * self.head_dim])
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        residual: Tensor,
+        alibi: Tensor,
+        attention_mask: Tensor,
+        layer_past: Optional[Tuple[Tensor, Tensor]] = None,
+        head_mask: Optional[Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+    ):
+        fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
+
+        # 3 x [batch_size, seq_length, num_heads, head_dim]
+        (query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
+
+        batch_size, q_length, _, _ = query_layer.shape
+
+        query_layer = query_layer.transpose([0, 2, 1, 3])
+        key_layer = key_layer.transpose([0, 2, 3, 1])
+        value_layer = value_layer.transpose([0, 2, 1, 3])
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            # concatenate along seq_length dimension:
+            #  - key: [batch_size, self.num_heads, head_dim, kv_length]
+            #  - value: [batch_size, self.num_heads, kv_length, head_dim]
+            key_layer = paddle.concat((past_key, key_layer), axis=3)
+            value_layer = paddle.concat((past_value, value_layer), axis=2)
+
+        if use_cache is True:
+            present = (key_layer, value_layer)
+        else:
+            present = None
+
+        _, _, _, kv_length = key_layer.shape
+
+        query_layer = query_layer.reshape([batch_size * self.num_heads, q_length, self.head_dim])
+        key_layer = key_layer.reshape([batch_size * self.num_heads, self.head_dim, kv_length])
+        value_layer = value_layer.reshape([batch_size * self.num_heads, kv_length, self.head_dim])
+
+        # [batch_size * num_heads, q_length, kv_length]
+        # alibi:[batch_size * num_heads, q_length, kv_length]
+        # we use `Tensor.baddbmm` instead of `paddle.baddbmm` as the latter isn't supported by TorchScript v1.11
+        attention_scores = baddbmm(
+            alibi, batch1=query_layer, batch2=key_layer, beta=self.beta, alpha=self.inv_norm_factor
+        )
+
+        # change view to [batch_size, num_heads, q_length, kv_length]
+        # attention_scores = matmul_result.reshape([batch_size, self.num_heads, q_length, kv_length])
+
+        # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype - [batch_size, num_heads, q_length, kv_length]
+        input_dtype = query_layer.dtype
+        # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
+        if input_dtype != paddle.float32:
+            attention_scores = paddle.cast(attention_scores, paddle.float32)
+            attn_weights = attention_scores + attention_mask
+            attention_probs = paddle.cast(F.softmax(attn_weights, axis=-1, dtype=paddle.float32), dtype=input_dtype)
+        else:
+            attn_weights = attention_scores + attention_mask
+            attention_probs = F.softmax(attn_weights, axis=-1)
+
+        # [batch_size, num_heads, q_length, kv_length]
+        attention_probs = self.attention_dropout(attention_probs)
+
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        # change view [batch_size x num_heads, q_length, kv_length]
+        attention_probs_reshaped = attention_probs.reshape([batch_size * self.num_heads, q_length, kv_length])
+
+        # matmul: [batch_size * num_heads, q_length, head_dim]
+        context_layer = paddle.matmul(attention_probs_reshaped, value_layer)
+
+        # change view [batch_size, num_heads, q_length, head_dim]
+        context_layer = self._merge_heads(context_layer)
+
+        # aggregate results across tp ranks. See here: https://github.com/pypaddle/pypaddle/issues/76232
+        if self.pretraining_tp > 1 and self.slow_but_exact:
+            slices = self.hidden_size / self.pretraining_tp
+            output_tensor = paddle.zeros_like(context_layer)
+            for i in range(self.pretraining_tp):
+                output_tensor = output_tensor + F.linear(
+                    context_layer[:, :, int(i * slices) : int((i + 1) * slices)],
+                    self.dense.weight[:, int(i * slices) : int((i + 1) * slices)],
+                )
+        else:
+            output_tensor = self.dense(context_layer)
+
+        output_tensor = dropout_add(output_tensor, residual, self.hidden_dropout, self.training)
+
+        outputs = (output_tensor, present)
+        if output_attentions:
+            # output attentions should be: [batch_size, self.num_heads, q_length, kv_length]
+            attention_probs = attention_probs.reshape([batch_size, self.num_heads, q_length, kv_length])
+            outputs += (attention_probs,)
+
+        return outputs
+
+
+class BloomMLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.pretraining_tp = config.pretraining_tp
+        self.slow_but_exact = config.slow_but_exact
+        if config.tensor_parallel_degree > 1:
+            self.dense_h_to_4h = fleet.meta_parallel.ColumnParallelLinear(
+                hidden_size, 4 * hidden_size, gather_output=False, has_bias=True
+            )
+
+            self.dense_4h_to_h = fleet.meta_parallel.RowParallelLinear(
+                4 * hidden_size, hidden_size, input_is_parallel=True, has_bias=True
+            )
+
+        else:
+            self.dense_h_to_4h = nn.Linear(hidden_size, 4 * hidden_size)
+            self.dense_4h_to_h = nn.Linear(4 * hidden_size, hidden_size)
+        self.hidden_dropout = config.hidden_dropout
+        self.gelu_impl = BloomGelu()
+
+    def forward(self, hidden_states, residual):
+        hidden_states = self.gelu_impl(self.dense_h_to_4h(hidden_states))
+
+        if self.pretraining_tp > 1 and self.slow_but_exact:
+            intermediate_output = paddle.zeros_like(residual)
+            slices = self.dense_4h_to_h.weight.shape[-1] / self.pretraining_tp
+            for i in range(self.pretraining_tp):
+                intermediate_output = intermediate_output + nn.functional.linear(
+                    hidden_states[:, :, int(i * slices) : int((i + 1) * slices)],
+                    self.dense_4h_to_h.weight[:, int(i * slices) : int((i + 1) * slices)],
+                )
+        else:
+            intermediate_output = self.dense_4h_to_h(hidden_states)
+
+        output = dropout_add(intermediate_output, residual, self.hidden_dropout, self.training)
+
+        return output
+
+
+class BloomBlock(nn.Layer):
+    def __init__(self, config, layer_number=None):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.input_layernorm = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
+        self.n_head = config.n_head
+        self.self_attention = BloomAttention(config, layer_number=layer_number)
+        self.post_attention_layernorm = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
+
+        self.mlp = BloomMLP(config)
+
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+        self.hidden_dropout = config.hidden_dropout
+
+    def forward(
+        self,
+        hidden_states,
+        layer_past=None,
+        attention_mask=None,
+        head_mask=None,
+        use_cache=False,
+        output_attentions=False,
+        alibi=None,
+    ):
+        # hidden_states: [batch_size, seq_length, hidden_size]
+
+        # Layer norm at the beginning of the transformer layer.
+        layernorm_output = self.input_layernorm(hidden_states)
+
+        # Layer norm post the self attention.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = hidden_states
+
+        # Self attention.
+
+        attn_outputs = self.self_attention(
+            layernorm_output,
+            residual,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            alibi=alibi,
+            head_mask=head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+
+        attention_output = attn_outputs[0]
+
+        outputs = attn_outputs[1:]
+
+        layernorm_output = self.post_attention_layernorm(attention_output)
+
+        # Get residual
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = attention_output
+
+        # MLP.
+        output = self.mlp(layernorm_output, residual)
+
+        if use_cache:
+            outputs = (output,) + outputs
+        else:
+            outputs = (output,) + outputs[1:]
+        return outputs  # hidden_states, present, attentions
+
+
+class BloomPreTrainedModel(PretrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = BloomConfig
+    base_model_prefix = "bloom"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["BloomBlock"]
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "h.0.self_attention.query_key_value.weight": partial(fn, is_column=True),
+                "h.0.self_attention.query_key_value.bias": partial(fn, is_column=True),
+                "h.0.mlp.dense_h_to_4h.bias": partial(fn, is_column=True),
+                "h.0.mlp.dense_h_to_4h.weight": partial(fn, is_column=True),
+                # Row Linear
+                "word_embeddings.weight": partial(fn, is_column=False),
+                "h.0.self_attention.dense.weight": partial(fn, is_column=False),
+                "h.0.mlp.dense_4h_to_h.weight": partial(fn, is_column=False),
+            }
+            for key, action in base_actions.items():
+                if "h.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("h.0.", f"h.{i}.")] = action
+                final_actions[key] = action
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.n_layer)
+
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialize the weights."""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            layer.weight.set_value(
+                paddle.tensor.normal(mean=0.0, std=self.config.initializer_range, shape=layer.weight.shape)
+            )
+            if getattr(layer, "bias", None) is not None:
+                layer.weight.set_value(paddle.zeros(shape=layer.weight.shape, dtype=paddle.get_default_dtype()))
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, BloomModel):
+            module.gradient_checkpointing = value
+
+    @staticmethod
+    def _convert_to_bloom_cache(past_key_value: Tuple[Tuple[Tensor, Tensor]]) -> Tuple[Tuple[Tensor, Tensor]]:
+        """
+        Converts the cache to the format expected by Bloom, i.e. to tuple(tuple([batch_size * num_heads, ...]))
+        """
+        batch_size, num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        batch_size_times_num_heads = batch_size * num_heads
+        # key:  [batch_size, num_heads, head_dim, seq_length] -> [batch_size * num_heads, head_dim, seq_length]
+        # value: [batch_size, num_heads, seq_length, head_dim] -> [batch_size * num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].reshape([batch_size_times_num_heads, head_dim, seq_length]),
+                layer_past[1].reshape([batch_size_times_num_heads, seq_length, head_dim]),
+            )
+            for layer_past in past_key_value
+        )
+
+    @staticmethod
+    def _convert_to_standard_cache(
+        past_key_value: Tuple[Tuple[Tensor, Tensor]], batch_size: int
+    ) -> Tuple[Tuple[Tensor, Tensor]]:
+        """
+        Standardizes the format of the cache so as to match most implementations, i.e. to tuple(tuple([batch_size,
+        num_heads, ...]))
+        """
+        batch_size_times_num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        num_heads = batch_size_times_num_heads // batch_size
+        # key: [batch_size * num_heads, head_dim, seq_length] -> [batch_size, num_heads, head_dim, seq_length]
+        # value: [batch_size * num_heads, seq_length, head_dim] -> [batch_size, num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].reshape([batch_size, num_heads, head_dim, seq_length]),
+                layer_past[1].reshape([batch_size, num_heads, seq_length, head_dim]),
+            )
+            for layer_past in past_key_value
+        )
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.dim() == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)
+        elif head_mask.dim() == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.dim() == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+
+        head_mask = paddle.cast(head_mask, dtype=self.dtype)
+        return head_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    @classmethod
+    def _get_name_mappings(cls, config: BloomConfig) -> list[StateDictNameMapping]:
+        hard_mapping = [
+            "word_embeddings.weight",
+            "word_embeddings_layernorm.weight",
+            "word_embeddings_layernorm.bias",
+            "ln_f.weight",
+            "ln_f.bias",
+        ]
+        for i in range(config.n_layer):
+            hard_mapping.extend(
+                [
+                    f"h.{i}.input_layernorm.weight",
+                    f"h.{i}.input_layernorm.bias",
+                    [
+                        f"h.{i}.self_attention.query_key_value.weight",
+                        None,
+                        "transpose",
+                    ],
+                    f"h.{i}.self_attention.query_key_value.bias",
+                    [f"h.{i}.self_attention.dense.weight", None, "transpose"],
+                    f"h.{i}.self_attention.dense.bias",
+                    f"h.{i}.post_attention_layernorm.weight",
+                    f"h.{i}.post_attention_layernorm.bias",
+                    [f"h.{i}.mlp.dense_h_to_4h.weight", None, "transpose"],
+                    [f"h.{i}.mlp.dense_4h_to_h.weight", None, "transpose"],
+                    f"h.{i}.mlp.dense_h_to_4h.bias",
+                    f"h.{i}.mlp.dense_4h_to_h.bias",
+                ]
+            )
+
+        init_name_mappings(hard_mapping)
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(hard_mapping)]
+        model_class_name = config.architectures[0]
+
+        if model_class_name != "BloomModel":
+            for mapping in mappings:
+                mapping.source_name = "transformer." + mapping.source_name
+                mapping.target_name = "bloom." + mapping.target_name
+
+        if model_class_name == "BloomForSequenceClassification":
+            mappings.append(StateDictNameMapping("score.weight", None, "transpose"))
+        if model_class_name == "BloomForTokenClassification":
+            mappings.append(StateDictNameMapping("classifier.weight", None, "transpose"))
+            mappings.append(StateDictNameMapping("classifier.bias"))
+
+        return mappings
+
+
+class BloomModel(BloomPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.padding_idx = 0
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.n_head = config.n_head
+
+        # Embedding + LN Embedding
+        # self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(
+                    initializer=nn.initializer.Normal(mean=0.0, std=config.initializer_range)
+                ),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+
+        self.word_embeddings_layernorm = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+        # Transformer blocks
+        self.h = nn.LayerList([BloomBlock(config, layer_number=i) for i in range(config.n_layer)])
+
+        # Final Layer Norm
+        self.ln_f = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+        self.gradient_checkpointing = False
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def _prepare_attn_mask(
+        self, attention_mask: Tensor, input_shape: Tuple[int, int], past_key_values_length: int, num_heads: int, dtype
+    ) -> Tensor:
+        # create causal mask
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        combined_attention_mask = None
+        _, src_length = input_shape
+
+        if src_length > 1:
+            combined_attention_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
+
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        if len(attention_mask.shape) == 2:
+            expanded_attn_mask = _expand_2d_mask(attention_mask, tgt_length=src_length)
+        elif len(attention_mask.shape) == 3:
+            # [batch_size,tgt_length, src_length] -> [batch_size, 1, tgt_length, src_length]
+            expanded_attn_mask = attention_mask.unsqueeze(1)
+        elif len(attention_mask.shape) == 4:
+            expanded_attn_mask = attention_mask
+
+        if combined_attention_mask is not None:
+            expanded_attn_mask = expanded_attn_mask & combined_attention_mask
+
+        mask_shape = expanded_attn_mask.shape
+        expanded_attn_mask = expanded_attn_mask.expand([mask_shape[0], num_heads, mask_shape[2], mask_shape[3]])
+        zero = paddle.zeros(expanded_attn_mask.shape, dtype=dtype)
+        neg_inf = paddle.full(expanded_attn_mask.shape, paddle.finfo(dtype).min, dtype=dtype)
+        expanded_attn_mask = paddle.where(expanded_attn_mask, zero, neg_inf)
+        batch_size, num_heads, sq_len, kv_len = expanded_attn_mask.shape
+        return expanded_attn_mask.reshape([batch_size * num_heads, sq_len, kv_len])
+
+    def set_input_embeddings(self, new_embeddings: Tensor):
+        self.word_embeddings = new_embeddings
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self, block, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions, alibi
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(block),
+            hidden_states,
+            layer_past,
+            attention_mask,
+            head_mask,
+            use_cache,
+            output_attentions,
+            alibi,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+        return hidden_states
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **kwargs,
+    ) -> Union[Tuple[Tensor], BaseModelOutputWithPastAndCrossAttentions]:
+
+        past_key_values = kwargs.get("cache", past_key_values)
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.h))
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape batch_size x num_heads x N x N
+        # head_mask has shape n_layer x batch x num_heads x N x N
+        head_mask = self.get_head_mask(head_mask, self.config.n_layer)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        hidden_states = self.word_embeddings_layernorm(inputs_embeds)
+
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+
+        # Compute alibi tensor: check build_alibi_tensor documentation
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[3]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if attention_mask is None:
+            attention_mask = paddle.ones([batch_size, seq_length_with_past], dtype="bool")
+        elif attention_mask.dtype != paddle.bool:
+            attention_mask = paddle.cast(attention_mask, "bool")
+
+        if len(attention_mask.shape) > 2:
+            _attention_mask = paddle.ones([batch_size, seq_length_with_past], dtype="bool")
+            alibi = build_alibi_tensor(_attention_mask, self.config.n_head, dtype=hidden_states.dtype)
+        else:
+            alibi = build_alibi_tensor(attention_mask, self.config.n_head, dtype=hidden_states.dtype)
+
+        if self.config.tensor_parallel_degree > 1:
+            block_size = self.config.n_head // self.config.tensor_parallel_degree
+            alibi = alibi[
+                :, self.config.tensor_parallel_rank * block_size : (self.config.tensor_parallel_rank + 1) * block_size
+            ]
+            alibi = alibi.reshape([batch_size * block_size, 1, seq_length_with_past])
+            causal_mask = self._prepare_attn_mask(
+                attention_mask,
+                input_shape=(batch_size, seq_length),
+                past_key_values_length=past_key_values_length,
+                num_heads=block_size,
+                dtype=hidden_states.dtype,
+            )
+        else:
+            alibi = alibi.reshape([batch_size * self.config.n_head, 1, seq_length_with_past])
+            causal_mask = self._prepare_attn_mask(
+                attention_mask,
+                input_shape=(batch_size, seq_length),
+                past_key_values_length=past_key_values_length,
+                num_heads=self.config.n_head,
+                dtype=hidden_states.dtype,
+            )
+
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+            has_gradient = not hidden_states.stop_gradient
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.enable_recompute and has_gradient:
+                outputs = self.recompute_training(
+                    block,
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=causal_mask,
+                    head_mask=head_mask[i],
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                    alibi=alibi,
+                )
+            else:
+                outputs = block(
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=causal_mask,
+                    head_mask=head_mask[i],
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                    alibi=alibi,
+                )
+
+            hidden_states = outputs[0]
+            if use_cache is True:
+                presents = presents + (outputs[1],)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+
+        # Add last hidden state
+        hidden_states = self.ln_f(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class BloomLMHead(nn.Layer):
+    def __init__(self, config, embedding_weights=None):
+        super(BloomLMHead, self).__init__()
+        self.decoder_weight = (
+            self.create_parameter(shape=[config.vocab_size, config.hidden_size], dtype=paddle.get_default_dtype())
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.config = config
+
+    def forward(self, hidden_states, parallel_output):
+        logits = parallel_matmul(hidden_states, self.decoder_weight, parallel_output=parallel_output)
+        return logits
+
+
+class BloomPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for GPT.
+    It calculates the final loss.
+    """
+
+    def __init__(self, ignore_index=-100, tensor_parallel_degree=1, tensor_parallel_output=False):
+        super(BloomPretrainingCriterion, self).__init__()
+        if tensor_parallel_degree > 1 and tensor_parallel_output:
+            self.loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+        self.ignore_index = ignore_index
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask=None):
+        masked_lm_loss = self.loss_func(prediction_scores, masked_lm_labels.unsqueeze(2))
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = masked_lm_loss.astype("float32")
+            if loss_mask is not None:
+                loss_mask = loss_mask.reshape([-1])
+                masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+                loss = masked_lm_loss / loss_mask.sum()
+            else:
+                masked_lm_loss = masked_lm_loss[masked_lm_labels != self.ignore_index]
+                loss = paddle.mean(masked_lm_loss)
+
+        return loss
+
+
+class BloomForPretraining(BloomPreTrainedModel):
+    """
+    The pretraining model of Bloom.
+    It returns some logits and cached_kvs.
+    """
+
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.bloom = BloomModel(config)
+        self.criterion = BloomPretrainingCriterion(tensor_parallel_degree=config.tensor_parallel_degree)
+        self.extra_parameters = [self.bloom.word_embeddings.weight]
+
+    def forward(
+        self,
+        input_ids,
+        labels=None,
+        loss_mask=None,
+        attention_mask=None,
+        use_cache=False,
+        cache=None,
+    ):
+        outputs = self.bloom(input_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache)
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        logits = parallel_matmul(
+            encoder_outputs[0],
+            self.bloom.word_embeddings.weight,
+            parallel_output=False,
+        )
+        if labels is None:
+            return logits
+
+        loss = self.criterion(logits, labels, loss_mask)
+        return loss, logits
+
+
+class BloomForCausalLM(BloomPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.decoder_weight"]
+    _keys_to_ignore_on_save = [r"lm_head.decoder_weight"]
+    _tied_weights_keys = ["lm_head.decoder_weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.bloom = BloomModel(config)
+        self.lm_head = BloomLMHead(config, self.bloom.word_embeddings.weight)
+        self.criterion = BloomPretrainingCriterion(
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_output=True,
+        )
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithCrossAttentions) and "past_key_values" in outputs:
+            model_kwargs["cache"] = outputs.past_key_values
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        if not is_encoder_decoder:
+            # update attention mask
+            if "attention_mask" in model_kwargs:
+                attention_mask = model_kwargs["attention_mask"]
+                if len(attention_mask.shape) == 2:
+                    model_kwargs["attention_mask"] = paddle.concat(
+                        [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)],
+                        axis=-1,
+                    )
+                elif len(attention_mask.shape) == 4:
+                    model_kwargs["attention_mask"] = paddle.concat(
+                        [attention_mask, paddle.ones([*attention_mask.shape[:3], 1], dtype=attention_mask.dtype)],
+                        axis=-1,
+                    )[:, :, -1:, :]
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        return model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        attention_mask = kwargs.get("attention_mask", None)
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(axis=-1)
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "cache": cache, "use_cache": True}
+
+    # TODO(wawltor) attention_mask is not need
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        attention_mask = paddle.ones_like(input_ids, dtype="bool")
+        attention_mask = (input_ids != pad_token_id).astype("bool")
+        return attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        cache=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ) -> Union[Tuple[Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.bloom(
+            input_ids,
+            past_key_values=cache,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        lm_logits = self.lm_head(hidden_states, self.config.tensor_parallel_output)
+
+        loss = None
+        if labels is not None:
+            loss = self.criterion(lm_logits, labels)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+    @staticmethod
+    def _reorder_cache(past: Tuple[Tuple[Tensor]], beam_idx: Tensor) -> Tuple[Tuple[Tensor]]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        """
+        return tuple(tuple(past_state.index_select(0, beam_idx) for past_state in layer_past) for layer_past in past)
+
+
+class BloomForSequenceClassification(BloomPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.bloom = BloomModel(config)
+        self.score = nn.Linear(config.hidden_size, config.num_labels, bias_attr=False)
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ) -> Union[Tuple[Tensor], SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.bloom(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+            sequence_length = input_ids.shape[1]
+        else:
+            batch_size = inputs_embeds.shape[0]
+            sequence_length = inputs_embeds.shape[1]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+
+        if self.config.pad_token_id is None:
+            pooled_logits = logits[:, -1]
+        else:
+            if input_ids is not None:
+                # select the last word of batch sentence
+                sequence_lengths = paddle.where(input_ids != self.config.pad_token_id, 1, 0).sum(axis=-1) - 1
+                sequence_lengths += paddle.to_tensor([i * input_ids.shape[1] for i in range(batch_size)])
+                pooled_logits = paddle.index_select(
+                    logits.reshape([batch_size * sequence_length, -1]), sequence_lengths, axis=0
+                )
+
+            else:
+                pooled_logits = logits[:, -1]
+                logger.warning(
+                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+                )
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and labels.dtype == paddle.int64:
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.reshape([-1, self.num_labels]), labels.reshape([-1]))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class BloomForTokenClassification(BloomPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.bloom = BloomModel(config)
+        if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
+            classifier_dropout = config.classifier_dropout
+        elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
+            classifier_dropout = config.hidden_dropout
+        else:
+            classifier_dropout = 0.1
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ) -> Union[Tuple[Tensor], TokenClassifierOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.bloom(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class BloomForGeneration(BloomPreTrainedModel):
+    """
+    Bloom Model with pretraining tasks on top.
+
+    Args:
+        bloom (:class:`BloomModel`):
+            An instance of :class:`BloomModel`.
+
+    """
+
+    def __init__(self, config: BloomConfig):
+        # when running generation, it must be True
+        config.use_cache = True
+
+        super(BloomForGeneration, self).__init__(config)
+        self.bloom = BloomModel(config)
+        self.config = config
+
+        self.max_length = self.config.get("max_dec_len", 20)
+        self.min_length = self.config.get("min_dec_len", 0)
+        self.decode_strategy = self.config.get("decode_strategy", "sampling")
+        self.temperature = self.config.get("temperature", 1.0)
+        self.top_k = self.config.get("top_k", 0)
+        self.top_p = self.config.get("top_p", 1.0)
+        self.use_topp_sampling = self.config.get("use_topp_sampling", False)
+        self.inference = self.config.get("inference", False)
+        self.repetition_penalty = self.config.get("repetition_penalty", 1.0)
+        self.num_beams = self.config.get("num_beams", 1)
+        self.num_beam_groups = self.config.get("num_beam_groups", 1)
+        self.length_penalty = self.config.get("length_penalty", 0.0)
+        self.early_stopping = self.config.get("early_stopping", False)
+        self.bos_token_id = self.config.get("bos_token_id", None)
+        self.eos_token_id = self.config.get("eos_token_id", None)
+        self.pad_token_id = self.config.get("pad_token_id", None)
+        self.decoder_start_token_id = self.config.get("decoder_start_token_id", None)
+        self.forced_bos_token_id = self.config.get("forced_bos_token_id", None)
+        self.forced_eos_token_id = self.config.get("forced_eos_token_id", None)
+        self.num_return_sequences = self.config.get("num_return_sequences", 1)
+        self.diversity_rate = self.config.get("diversity_rate", 0.0)
+        self.use_cache = self.config.get("use_cache", True)
+
+    def prepare_input_ids_for_generation(self, bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_attention_mask_for_generation(self, input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("bool")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="bool")
+        return attention_mask
+
+    def update_scores_for_generation(self, scores, next_scores, length, unfinished_flag):
+        # update scores
+
+        unfinished_scores = (scores * length + next_scores) / (length + 1)
+        scores = paddle.where(unfinished_flag, unfinished_scores, scores)
+        return scores
+
+    def get_logits_processor(
+        self,
+        min_length=None,
+        max_length=None,
+        eos_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_beams=1,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        repetition_penalty=None,
+    ):
+        processors = LogitsProcessorList()
+
+        # if min_length is not None and eos_token_id is not None and min_length > -1:
+        #     processors.append(
+        #         MinLengthLogitsProcessor(min_length, eos_token_id))
+
+        if num_beam_groups > 1 and diversity_rate > 0.0:
+            processors.append(
+                HammingDiversityLogitsProcessor(
+                    diversity_rate=diversity_rate, num_beams=num_beams, num_beam_groups=num_beam_groups
+                )
+            )
+        if repetition_penalty is not None and repetition_penalty != 1.0:
+            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
+        if forced_bos_token_id is not None:
+            processors.append(ForcedBOSTokenLogitsProcessor(forced_bos_token_id))
+        if forced_eos_token_id is not None:
+            processors.append(ForcedEOSTokenLogitsProcessor(max_length, forced_eos_token_id))
+        # TODO
+        # Add more pre_processing for distribution
+
+        return processors
+
+    def expand_inputs_for_generation(self, input_ids, expand_size, attention_mask=None, **model_kwargs):
+
+        index = paddle.tile(paddle.arange(paddle.shape(input_ids)[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.gather(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.gather(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.gather(token_type_ids, index)
+
+        if "seq_len" in model_kwargs and model_kwargs["seq_len"] is not None:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.gather(seq_len, index)
+
+        if "encoder_output" in model_kwargs and model_kwargs["encoder_output"] is not None:
+            encoder_output = model_kwargs["encoder_output"]
+            model_kwargs["encoder_output"] = paddle.gather(encoder_output, index)
+
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.gather(role_ids, index)
+
+        return input_ids, model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        attention_mask = kwargs.get("attention_mask", None)
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "cache": cache}
+
+    def update_model_kwargs_for_generation(self, next_tokens, outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.concat([token_type_ids, token_type_ids[:, -1:]], axis=-1)
+
+        if not is_encoder_decoder:
+            # update attention mask
+            if "attention_mask" in model_kwargs:
+                attention_mask = model_kwargs["attention_mask"]
+                model_kwargs["attention_mask"] = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="bool")], axis=-1
+                )
+
+        # update role_ids
+        if "role_ids" in model_kwargs and model_kwargs["role_ids"] is not None:
+            role_ids = model_kwargs["role_ids"]
+            model_kwargs["role_ids"] = paddle.concat([role_ids, role_ids[:, -1:]], axis=-1)
+
+        model_kwargs["res"] = paddle.concat([model_kwargs["res"], next_tokens], axis=1)
+
+        return model_kwargs
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        **model_kwargs
+    ):
+        def TopKProcess(probs, top_k, min_tokens_to_keep):
+            top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+            # Remove all tokens with a probability less than the last token of the top-k
+            topk_probs, _ = paddle.topk(probs, k=top_k)
+            probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+            return probs
+
+        def TopPProcess(probs, top_p, min_tokens_to_keep):
+            sorted_probs = paddle.sort(probs, descending=True)
+            sorted_indices = paddle.argsort(probs, descending=True)
+            cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+            # Remove tokens with cumulative probs above the top_p, But keep at
+            # least min_tokens_to_keep tokens
+            sorted_indices_to_remove = cumulative_probs > top_p
+            if min_tokens_to_keep > 1:
+                # Set 'min_tokens_to_keep - 1' because the first token is kept
+                sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+            # Keep the first token
+            sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+            sorted_indices_to_remove[:, 0] = 0
+
+            # Scatter sorted tensors to original indexing
+            sorted_indices = sorted_indices + paddle.arange(probs.shape[0]).unsqueeze(-1) * probs.shape[-1]
+            condition = paddle.scatter(
+                sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+            )
+            condition = paddle.cast(condition, "bool").reshape(probs.shape)
+            probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+            return probs
+
+        batch_size, cur_len = paddle.shape(input_ids)
+
+        # used for compute on gpu, avoid memcpy D2H
+        cur_len_gpu = paddle.full([1], cur_len)
+
+        origin_len = paddle.shape(input_ids)[1]
+        # used for compute on gpu, avoid memcpy D2H
+        origin_len_gpu = paddle.full([1], origin_len)
+
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        res = paddle.assign(input_ids)
+        model_kwargs["res"] = res
+
+        # use_cache is immutable, we split it off other mutable kwargs.
+        assert "use_cache" in model_kwargs
+        immutable = {"use_cache": model_kwargs["use_cache"]}
+        del model_kwargs["use_cache"]
+
+        def _forward_(**args):
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **args, **immutable)
+            return self.bloom(**model_inputs, **immutable)
+
+        def _post_process_(outputs, input_ids, cur_len, origin_len, scores, unfinished_flag, model_kwargs):
+
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            # logits = paddle.matmul(
+            #     logits,
+            #     self.bloom.embeddings.word_embeddings.weight,
+            #     transpose_y=True)
+
+            # x_dims_mapping = [self.bloom.mesh.dp] + [
+            #     None for i in range(len(logits.shape) - 1)
+            # ]
+            # w_dims_mapping = [self.bloom.mesh.mp, None]
+            # matmul = auto.shard_op(paddle.matmul, self.bloom.mesh[-1],
+            #                        [x_dims_mapping, w_dims_mapping, None])
+
+            logits = paddle.matmul(logits, self.bloom.word_embeddings.weight, transpose_y=True)
+
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # pre-process distribution
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            if temperature is None or temperature == 1.0:
+                probs = paddle.assign(origin_probs)
+                origin_probs = paddle.log(origin_probs)
+            else:
+                origin_probs = paddle.log(origin_probs)
+                logits = logits / temperature
+                probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                if self.use_topp_sampling:
+                    try:
+                        from ppfleetx_ops import topp_sampling
+                    except ImportError:
+                        raise ImportError(
+                            "please install ppfleetx_ops by 'cd ppfleetx/ops && python setup_cuda.py install'!"
+                        )
+                    top_ps_tensor = paddle.full(shape=[paddle.shape(probs)[0]], fill_value=top_p, dtype=probs.dtype)
+                    next_tokens = topp_sampling(probs, top_ps_tensor)
+                else:
+                    probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+
+            if not self.use_topp_sampling:
+                # TODO(wj-Mcat): multinomial do not support fp16, so convert it to fp32
+                # refer to: https://github.com/PaddlePaddle/Paddle/issues/51852
+                next_tokens = paddle.multinomial(paddle.cast(probs, paddle.float32))
+                # next_tokens = paddle.multinomial(probs)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            input_ids = next_tokens
+
+            if eos_token_id is not None:
+                unfinished_flag = paddle.logical_and(unfinished_flag, next_tokens != eos_token_id)
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                next_tokens, outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            return input_ids, scores, unfinished_flag, model_kwargs
+
+        # Note(GuoxiaWang):Pre-while call for inference, simulate a do while loop statement
+        # the value in model_kwargs should be tensor before while loop
+        outputs = _forward_(**model_kwargs)
+
+        input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+            outputs, input_ids, cur_len_gpu, origin_len_gpu, scores, unfinished_flag, model_kwargs
+        )
+        if not self.inference:
+            cur_len += 1
+        else:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            paddle.increment(cur_len)
+        paddle.increment(cur_len_gpu)
+
+        attn_mask = model_kwargs["attention_mask"]
+        # make the shape of attention_mask = (-1, -1, -1, -1) in dy2static.
+        model_kwargs["attention_mask"] = paddle.reshape(attn_mask, paddle.shape(attn_mask))
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else None
+        max_length = paddle.to_tensor(max_length)
+        while cur_len < max_length:
+            # Note(GuoxiaWang): Remove outputs = _forward_(**model_kwargs)
+            # and change it to pass directly to _post_process_ to avoid
+            # closed-loop problem of dynamic-to-static model
+            input_ids, scores, unfinished_flag, model_kwargs = _post_process_(
+                _forward_(**model_kwargs),
+                input_ids,
+                cur_len_gpu,
+                origin_len_gpu,
+                scores,
+                unfinished_flag,
+                model_kwargs,
+            )
+            if not self.inference:
+                cur_len += 1
+            else:
+                # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+                paddle.increment(cur_len)
+            paddle.increment(cur_len_gpu)
+
+            if not paddle.any(unfinished_flag):
+                break
+
+        return model_kwargs["res"][:, origin_len:], scores
+
+    def forward(self, input_ids=None, **model_kwargs):
+
+        max_length = self.max_length
+        min_length = self.min_length
+        decode_strategy = self.decode_strategy
+        temperature = self.temperature
+        top_k = self.top_k
+        top_p = self.top_p
+        repetition_penalty = self.repetition_penalty
+        num_beams = self.num_beams
+        num_beam_groups = self.num_beam_groups
+        bos_token_id = self.bos_token_id
+        eos_token_id = self.eos_token_id
+        pad_token_id = self.pad_token_id
+        decoder_start_token_id = self.decoder_start_token_id
+        forced_bos_token_id = self.forced_bos_token_id
+        forced_eos_token_id = self.forced_eos_token_id
+        num_return_sequences = self.num_return_sequences
+        diversity_rate = self.diversity_rate
+        use_cache = self.use_cache
+
+        assert decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            decode_strategy
+        )
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self.config, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self.config, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self.config, "pad_token_id", None)
+        forced_bos_token_id = (
+            forced_bos_token_id
+            if forced_bos_token_id is not None
+            else getattr(self.config, "forced_bos_token_id", None)
+        )
+        forced_eos_token_id = (
+            forced_eos_token_id
+            if forced_eos_token_id is not None
+            else getattr(self.config, "forced_eos_token_id", None)
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self.config, "decoder_start_token_id", None)
+        )
+
+        # params check
+        if input_ids is None:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # Init `attention_mask` depending on `pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, pad_token_id, eos_token_id
+            )
+
+        if model_kwargs.get("position_ids", None) is None:
+            model_kwargs["position_ids"] = paddle.arange(
+                0, paddle.shape(model_kwargs["attention_mask"])[-1], dtype=input_ids.dtype
+            ).unsqueeze(0)
+
+        self.is_encoder_decoder = False
+
+        model_kwargs["use_cache"] = use_cache
+
+        if self.inference:
+            # Note(ZhenyuLi): Avoid the synchronization caused by scale in dy2static
+            min_len = int(input_ids.shape[-1])
+            max_len = int(input_ids.shape[-1])
+            paddle.increment(min_len, min_length)
+            paddle.increment(max_len, max_length)
+        else:
+            input_len = input_ids.shape[-1]
+            max_len = max_length + input_len
+            min_len = min_length + input_len
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_len,
+            max_length=max_len,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=num_beams,
+            num_beam_groups=num_beam_groups,
+            diversity_rate=diversity_rate,
+            repetition_penalty=repetition_penalty,
+        )
+
+        if decode_strategy == "sampling":
+            if num_return_sequences > 1:
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_return_sequences, **model_kwargs
+                )
+
+            ret = self.sample(
+                input_ids,
+                logits_processors,
+                max_len,
+                pad_token_id,
+                eos_token_id,
+                top_k,
+                top_p,
+                temperature,
+                **model_kwargs,
+            )
+        else:
+            raise ValueError(f"Not support {decode_strategy} strategy yet!")
+        return ret
diff --git a/paddlenlp/transformers/bloom/processor.py b/paddlenlp/transformers/bloom/processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..19558f62029088152a204e91b49713ecaa779682
--- /dev/null
+++ b/paddlenlp/transformers/bloom/processor.py
@@ -0,0 +1,176 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import inspect
+from abc import ABC
+
+import paddle
+
+
+class LogitsProcessorList(list):
+    def __call__(self, input_ids, logits, **kwargs):
+        for processor in self:
+            processor_args = inspect.signature(processor.__call__).parameters
+            if len(processor_args) > 2:
+                assert all(
+                    arg in kwargs for arg in list(processor_args.keys())[2:]
+                ), f"The parameters don't match for {processor.__class__}"
+                logits = processor(input_ids, logits, **kwargs)
+            else:
+                logits = processor(input_ids, logits)
+        return logits
+
+
+class LogitsProcessor(ABC):
+    """
+    Abstract base class for all logit processors that can be applied during
+    generation.
+    """
+
+    def __call__(self, input_ids, logits):
+        raise NotImplementedError(
+            f"{self.__class__} is an abstract class. " "Only classes inheriting this class can be called."
+        )
+
+
+class MinLengthLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing a min-length by setting EOS probability to 0.
+    Args:
+        min_length (int): The minimum length of generation sequence.
+        eos_token_id (int): The id of the `end-of-sequence` token.
+    """
+
+    def __init__(self, min_length, eos_token_id):
+        self.min_length = min_length
+        self.eos_token_id = eos_token_id
+
+    def __call__(self, input_ids, logits):
+        cur_len = input_ids.shape[-1]
+        if cur_len < self.min_length:
+            logits[:, self.eos_token_id] = -float("inf")
+        return logits
+
+
+class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
+    r"""
+    Enforcing an exponential penalty on repeated sequences.
+    Args:
+        repetition_penalty (float):
+            The parameter for repetition penalty. 1.0 means no penalty. See `this paper
+            <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details.
+    """
+
+    def __init__(self, penalty: float):
+        if not isinstance(penalty, float) or not (penalty > 0):
+            raise ValueError(f"`penalty` has to be a strictly positive float, but is {penalty}")
+
+        self.penalty = penalty
+
+    def __call__(self, input_ids, logits):
+        score = paddle.index_sample(logits, input_ids)
+        score = paddle.where(score < 0, score * self.penalty, score / self.penalty)
+        input_ids = input_ids + paddle.arange(logits.shape[0]).unsqueeze(-1) * logits.shape[-1]
+        outputs = paddle.scatter(logits.flatten(), input_ids.flatten(), score.flatten()).reshape(logits.shape)
+        return outputs
+
+
+class HammingDiversityLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces diverse beam search. Note that this logits
+    processor is only effective for `group_beam_search`. See
+    `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__ for more details.
+    Args:
+        diversity_rate (float): This value is subtracted from a beam's score if
+            it generates a token same as any beam from other group at a particular
+            time.
+        num_beams (int): Number of beams used for group beam search.
+        num_beam_groups (int): Number of groups to divide `num_beams` into in order
+            to ensure diversity among different groups of beams.
+    """
+
+    def __init__(self, diversity_rate, num_beams, num_beam_groups):
+        if not isinstance(diversity_rate, float) or (not diversity_rate > 0.0):
+            raise ValueError("`diversity_rate` should be a float strictly larger than 0.")
+        self._diversity_rate = diversity_rate
+        if not isinstance(num_beams, int) or num_beams < 2:
+            raise ValueError("`num_beams` should be an integer strictly larger than 1.")
+        self._num_beams = num_beams
+        if not isinstance(num_beam_groups, int) or num_beam_groups < 2:
+            raise ValueError("`num_beam_groups` should be an integer strictly larger than 1.")
+        self._num_sub_beams = num_beams // num_beam_groups
+
+    def __call__(self, input_ids, scores, current_tokens, beam_group_idx):
+        batch_size = current_tokens.shape[0] // self._num_beams
+        group_start_idx = beam_group_idx * self._num_sub_beams
+        group_end_idx = min(group_start_idx + self._num_sub_beams, self._num_beams)
+        group_size = group_end_idx - group_start_idx
+        vocab_size = scores.shape[-1]
+
+        if group_start_idx == 0:
+            return scores
+
+        for batch_idx in range(batch_size):
+            previous_group_tokens = current_tokens[
+                batch_idx * self._num_beams : batch_idx * self._num_beams + group_start_idx
+            ]
+            token_frequency = paddle.bincount(previous_group_tokens, minlength=vocab_size)
+            scores[batch_idx * group_size : (batch_idx + 1) * group_size] -= self._diversity_rate * token_frequency
+
+        return scores
+
+
+class ForcedBOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the first generated token to be the selected `forced_bos_token`.
+    Args:
+        forced_bos_token_id (:obj:`int`):
+            The id of the token to be generated as the first token.
+    """
+
+    def __init__(self, forced_bos_token_id):
+        self.forced_bos_token_id = forced_bos_token_id
+
+    def __call__(self, input_ids, scores):
+        cur_len = input_ids.shape[-1]
+        if cur_len == 1:
+            num_tokens = scores.shape[1]
+            scores[:, [i for i in range(num_tokens) if i != self.forced_bos_token_id]] = -float("inf")
+            scores[:, self.forced_bos_token_id] = 0
+        return scores
+
+
+class ForcedEOSTokenLogitsProcessor(LogitsProcessor):
+    """
+    This `LogitsProcessor` enforces the last generated token to be the selected `forced_eos_token`.
+    Args:
+        max_length (int): The maximum length of the sequence to be generated.
+        forced_eos_token_id (int): The id of the token to be generated as the last token.
+    """
+
+    def __init__(self, max_length, forced_eos_token_id):
+        self.max_length = max_length
+        self.forced_eos_token_id = forced_eos_token_id
+
+    def __call__(self, input_ids, scores):
+        cur_len = input_ids.shape[-1]
+        if cur_len == self.max_length - 1:
+            num_tokens = scores.shape[1]
+            scores[
+                :, [i for i in range(num_tokens) if i != self.forced_eos_token_id]
+            ] = -1e9  # TODO change back to -inf after paddle.topk is fixed
+            scores[:, self.forced_eos_token_id] = 0
+        return scores
diff --git a/paddlenlp/transformers/bloom/tokenizer.py b/paddlenlp/transformers/bloom/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ba02b9b9551b92a72be312231371d4ea20f5b54
--- /dev/null
+++ b/paddlenlp/transformers/bloom/tokenizer.py
@@ -0,0 +1,411 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import json
+import os
+import shutil
+from functools import lru_cache
+from typing import Dict, Optional, Union
+
+import numpy as np
+from paddle.utils import try_import
+
+from paddlenlp.transformers import AddedToken, PretrainedTokenizer
+
+from ..tokenizer_utils_base import BatchEncoding, EncodedInput, PaddingStrategy
+from .configuration import (
+    BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST,
+    _construct_resource_file_url,
+)
+
+__all__ = [
+    "BloomTokenizer",
+]
+
+PRETRAINED_RESOURCE_FILES_MAP = {
+    "vocab_file": _construct_resource_file_url(BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST, "vocab.json"),
+    "merges_file": _construct_resource_file_url(BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST, "merges.txt"),
+    "tokenizer_file": _construct_resource_file_url(BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST, "tokenizer.json"),
+}
+
+
+def split_tokenizer_json_file(tokenizer_file: str):
+    base_dir = os.path.dirname(tokenizer_file)
+    with open(tokenizer_file, "r", encoding="utf-8") as f:
+        tokenizer = json.load(f)
+
+    def save_to_file(file: str, content: str):
+        if os.path.exists(file):
+            return
+        with open(file, "w", encoding="utf-8") as f:
+            f.write(content)
+
+    # vocab.json
+    save_to_file(os.path.join(base_dir, "vocab.json"), json.dumps(tokenizer["model"]["vocab"], ensure_ascii=False))
+    # merge file
+    save_to_file(os.path.join(base_dir, "merges.txt"), "\n".join(tokenizer["model"]["merges"]))
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = chr
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class BloomTokenizer(PretrainedTokenizer):
+    """
+    Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocab file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import BloomTokenizer
+
+            tokenizer = BloomTokenizer.from_pretrained('bigscience/bloom-560m')
+            print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))
+
+            '''
+            {'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {
+        "vocab_file": "vocab.json",
+        "merges_file": "merges.txt",
+        "tokenizer_file": "tokenizer.json",
+    }  # for save_pretrained
+    pretrained_resource_files_map = PRETRAINED_RESOURCE_FILES_MAP
+
+    # TODO(wj-Mcat): disable max-model input size of bloom model
+    max_model_input_sizes = {
+        "bigscience/bloom-560m": 102400,
+    }
+    padding_side = "left"
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        pad_token="<pad>",
+        eol_token="<s>",
+        add_prefix_space=False,
+        add_bos_token=False,
+        **kwargs  # The token of newline.
+    ):
+
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        self.eol_token = eol_token
+        self._build_special_tokens_map_extended(
+            bos_token=pad_token if getattr(self, "bos_token", None) is None else self.bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+        )
+
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.num_command_tokens = 2
+        self.num_type_tokens = 2
+
+        with open(vocab_file, "r", encoding="utf-8") as f:
+            self.encoder = json.load(f)
+
+        self.decoder = {v: k for k, v in self.encoder.items()}
+
+        self.num_tokens = len(self.encoder)
+        self.num_text_tokens = self.num_tokens - 1
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+
+        with open(merges_file, encoding="utf-8") as f:
+            bpe_data = f.read().split("\n")[1:-1]
+
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        self.add_prefix_space = add_prefix_space
+        self.add_bos_token = add_bos_token
+
+        re = try_import("regex")
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+
+        return len(self.encoder)
+
+    @property
+    def eol_token_id(self):
+        if self.eol_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eol_token)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+
+        return self.decoder[index]
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a single index or a sequence of indices to texts.
+
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+
+        Returns:
+            str: The decoded text.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BloomTokenizer
+                tokenizer = BloomTokenizer.from_pretrained('gpt2-medium-en')
+                print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
+                # 'Welcome to use PaddlePaddle and PaddleNLP'
+
+        """
+
+        text = "".join([self.decoder[id] for id in ids])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def save_resources(self, save_directory):
+        """
+        Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file
+        (ends with '.spm') under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name, None)
+            if source_path is None:
+                continue
+
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (string) in a single string.
+        """
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
+        if is_split_into_words or add_prefix_space:
+            text = " " + text
+        return (text, kwargs)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        if self.add_bos_token:
+            bos_token_ids = [self.bos_token_id]
+        else:
+            bos_token_ids = []
+
+        output = bos_token_ids + token_ids_0
+
+        if token_ids_1 is None:
+            return output
+
+        return output + bos_token_ids + token_ids_1
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        if "attention_mask" in encoded_inputs and len(np.shape(encoded_inputs["attention_mask"])) > 2:
+            attention_mask = encoded_inputs["attention_mask"]
+            encoded_inputs.pop("attention_mask")
+        else:
+            attention_mask = None
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        encoded_inputs = super()._pad(
+            encoded_inputs, max_length, padding_strategy, pad_to_multiple_of, return_attention_mask
+        )
+        if attention_mask is not None and len(np.shape(attention_mask)) > 2:
+            encoded_inputs["attention_mask"] = attention_mask
+            needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+            if needs_to_be_padded:
+                difference = max_length - len(required_input)
+                if "attention_mask" in encoded_inputs:
+                    encoded_inputs["attention_mask"] = np.pad(
+                        encoded_inputs["attention_mask"],
+                        pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                        mode="constant",
+                        constant_values=0,
+                    )
+        return encoded_inputs
diff --git a/paddlenlp/transformers/chatglm/LICENSE b/paddlenlp/transformers/chatglm/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..f8e2731822499ba3d54fbe226f75a5b205b71c2b
--- /dev/null
+++ b/paddlenlp/transformers/chatglm/LICENSE
@@ -0,0 +1,65 @@
+The ChatGLM-6B License
+
+一、定义
+
+“许可方”是指分发其软件的 ChatGLM-6B 模型团队。
+
+“软件”是指根据本许可提供的 ChatGLM-6B 模型参数。
+
+2. 许可授予
+
+根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可，仅用于您的非商业研究目的。
+
+上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
+
+3.限制
+
+您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
+
+您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
+
+4.免责声明
+
+本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
+
+5. 责任限制
+
+除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
+
+6.争议解决
+
+本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
+
+请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 glm-130b@googlegroups.com 与我们联系。
+
+1. Definitions
+
+“Licensor” means the ChatGLM-6B Model Team that distributes its Software.
+
+“Software” means the ChatGLM-6B model parameters made available under this license.
+
+2. License Grant
+
+Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+3. Restriction
+
+You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
+
+You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
+
+4. Disclaimer
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+5. Limitation of Liability
+
+EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+6. Dispute Resolution
+
+This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
+
+Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.
diff --git a/paddlenlp/transformers/chatglm/__init__.py b/paddlenlp/transformers/chatglm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/chatglm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/chatglm/configuration.py b/paddlenlp/transformers/chatglm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..36e3ea8990461c8879083783a0e52bdad6eba929
--- /dev/null
+++ b/paddlenlp/transformers/chatglm/configuration.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" ChatGLM model configuration """
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ChatGLMConfig",
+    "CHATGLM_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+
+CHATGLM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "THUDM/chatglm-6b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm-6b/model_state.pdparams",
+        "THUDM/chatglm-6b-v1.1": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm-6b-v1.1/model_state.pdparams",
+    }
+}
+
+
+class ChatGLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~ChatGLMModel`].
+    It is used to instantiate an ChatGLM model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the ChatGLM-6B [THUDM/ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) architecture.
+
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 150528):
+            Vocabulary size of the ChatGLM-6B model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~ChatGLMModel`] or
+            [`~TFChatGLMModel`].
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 28):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        inner_hidden_size (`int`, *optional*, defaults to 16384):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        max_sequence_length (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        layernorm_epsilon (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether the model should return the last key/values attentions (not used by all models).
+        Example:
+
+    ```python
+    >>> from configuration import ChatGLMConfig
+    >>> from modeling import ChatGLMModel
+
+    >>> # Initializing a ChatGLM-6B THUDM/ChatGLM-6B style configuration
+    >>> configuration = ChatGLMConfig()
+
+    >>> # Initializing a model from the THUDM/ChatGLM-6B style configuration
+    >>> model = ChatGLMModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "chatglm"
+    attribute_map = {"num_layers": "num_hidden_layers"}
+
+    def __init__(
+        self,
+        vocab_size=130528,
+        hidden_size=4096,
+        num_hidden_layers=28,
+        num_attention_heads=32,
+        layernorm_epsilon=1e-5,
+        use_cache=False,
+        bos_token_id=130004,
+        eos_token_id=130005,
+        pad_token_id=3,
+        mask_token_id=130000,
+        gmask_token_id=130001,
+        max_sequence_length=2048,
+        inner_hidden_size=16384,
+        position_encoding_2d=True,
+        quantization_bit=0,
+        pre_seq_len=None,
+        prefix_projection=False,
+        output_predict=True,
+        attention_scale=True,
+        activation="gelu",
+        num_image_tokens=0,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.num_hidden_layers = num_hidden_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.max_sequence_length = max_sequence_length
+        self.layernorm_epsilon = layernorm_epsilon
+        self.inner_hidden_size = inner_hidden_size
+        self.use_cache = use_cache
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.mask_token_id = mask_token_id
+        self.gmask_token_id = gmask_token_id
+        self.position_encoding_2d = position_encoding_2d
+        self.quantization_bit = quantization_bit
+        self.pre_seq_len = pre_seq_len
+        self.prefix_projection = prefix_projection
+        self.output_predict = output_predict
+        self.attention_scale = attention_scale
+        self.activation = activation
+        self.num_image_tokens = num_image_tokens
diff --git a/paddlenlp/transformers/chatglm/modeling.py b/paddlenlp/transformers/chatglm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c58750bb9509251f46e26124f1d5237c378f27a
--- /dev/null
+++ b/paddlenlp/transformers/chatglm/modeling.py
@@ -0,0 +1,952 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GLM model"""
+from __future__ import annotations
+
+import math
+import re
+from functools import partial
+from typing import Any, Dict, Optional
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.distributed import fleet
+from paddle.distributed.fleet.utils import recompute
+from paddle.utils import map_structure
+
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithPast,
+)
+from .configuration import CHATGLM_PRETRAINED_RESOURCE_FILES_MAP, ChatGLMConfig
+
+__all__ = [
+    "ChatGLMModel",
+    "ChatGLMPretrainedModel",
+    "ChatGLMForCausalLM",
+]
+
+
+def parallel_matmul(lm_output, logit_weights, parallel_output):
+    hcg = fleet.get_hybrid_communicate_group()
+    model_parallel_group = hcg.get_model_parallel_group()
+    world_size = hcg.get_model_parallel_world_size()
+
+    if world_size > 1:
+        # _c_identity is backwards is reduce
+        input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group)
+
+        logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True)
+
+        if parallel_output:
+            return logits
+
+        # _c_concat has not grad backwards
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(lm_output, logit_weights, transpose_y=True)
+        return logits
+
+
+class PrefixEncoder(nn.Layer):
+    """
+    The prefix encoder for P-Tuning v2.
+    Input shape: [batch_size, prefix_length]
+    Output shape: [batch_size, prefix_length, 2 * num_layers * hidden_size]
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.prefix_projection = config.prefix_projection
+        if self.prefix_projection:
+            # Use a two-layer MLP to encode the prefix
+            self.embedding = nn.Embedding(config.pre_seq_len, config.hidden_size)
+            self.trans = nn.Sequential(
+                nn.Linear(config.hidden_size, config.hidden_size),
+                nn.Tanh(),
+                nn.Linear(config.hidden_size, config.num_layers * config.hidden_size * 2),
+            )
+        else:
+            self.embedding = nn.Embedding(config.pre_seq_len, config.num_layers * config.hidden_size * 2)
+
+    def forward(self, prefix: paddle.Tensor):
+        if self.prefix_projection:
+            prefix_tokens = self.embedding(prefix)
+            past_key_values = self.trans(prefix_tokens)
+        else:
+            past_key_values = self.embedding(prefix)
+        return past_key_values
+
+
+class RotaryEmbeddings(nn.Layer):
+    def __init__(self, hidden_size, base=10000.0, position_encoding_2d=True):
+        super().__init__()
+        self.default_dtype = paddle.get_default_dtype()
+        inv_freq = 1.0 / (base ** (paddle.arange(0, hidden_size, 2).astype("float32") / hidden_size))
+        inv_freq = inv_freq.astype(self.default_dtype)
+        self.position_encoding_2d = position_encoding_2d
+        self.register_buffer("inv_freq", inv_freq)
+        self.max_seq_len_cached = -1
+        self.cos_cached = None
+        self.sin_cached = None
+
+    def get_rotary_embeds(self, cos, sin, position_ids):
+        # [s, b, 1, h/n]
+        cos = cos.squeeze(1)[position_ids].unsqueeze(2)
+        sin = sin.squeeze(1)[position_ids].unsqueeze(2)
+        return paddle.stack([cos, sin], axis=0)
+
+    def forward(self, position_ids):
+
+        seq_len = position_ids.max() + 1
+        # seq_len = position_ids.shape[-1]
+
+        if self.max_seq_len_cached < 0 or seq_len > self.max_seq_len_cached:
+            self.max_seq_len_cached = seq_len
+
+            # x.shape = [b, s, n, h/n/2]
+            # TODO(duanyanhui): npu arange kernel don't support fp16, and
+            # it can't be fallbacked to cpu. It will be fixed in future.
+            if paddle.get_device().split(":")[0] == "npu":
+                t = paddle.arange(start=0, end=seq_len, dtype="float32")
+                t = t.cast(self.inv_freq.dtype)
+            else:
+                t = paddle.arange(start=0, end=seq_len, dtype=self.inv_freq.dtype)
+            # [s, h/n/2]
+            if not paddle.in_dynamic_mode():
+                inv_freq = paddle.cast(self.inv_freq, "float32")
+                t = paddle.cast(t, "float32")
+                freqs = paddle.einsum("i,j->ij", t, inv_freq)
+            else:
+                freqs = paddle.einsum("i,j->ij", t, self.inv_freq)
+            freqs = freqs.cast(self.default_dtype)
+            # [s, h/n]
+            emb = paddle.concat([freqs, freqs], axis=-1)
+            if self.default_dtype == paddle.bfloat16:
+                emb = emb.cast("float32")
+            # [s, 1, h/n]
+            cos_cached = emb.cos().unsqueeze(1)
+            sin_cached = emb.sin().unsqueeze(1)
+
+            if self.default_dtype == paddle.bfloat16:
+                cos_cached = cos_cached.astype(self.default_dtype)
+                sin_cached = sin_cached.astype(self.default_dtype)
+
+            if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+                # TODO(daisiming): _no_check_dy2st_diff is used to turn off the checking of behavior
+                # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+                # removed after static graphs support inplace and stride.
+                with paddle.framework._no_check_dy2st_diff():
+                    self.cos_cached, self.sin_cached = cos_cached, sin_cached
+            else:
+                self.cos_cached, self.sin_cached = cos_cached, sin_cached
+
+        cos, sin = self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
+        if self.position_encoding_2d:
+            block_position_ids = position_ids[:, 1, :].transpose([1, 0])
+            position_ids = position_ids[:, 0, :].transpose([1, 0])
+            block_rotary_embeds = self.get_rotary_embeds(cos, sin, block_position_ids)
+            position_rotary_embeds = self.get_rotary_embeds(cos, sin, position_ids)
+            rotary_embeds = paddle.stack([position_rotary_embeds, block_rotary_embeds], axis=0)
+        else:
+            position_ids = position_ids.transpose([1, 0])
+            rotary_embeds = self.get_rotary_embeds(cos, sin, position_ids)
+
+        return rotary_embeds
+
+
+class ChatGLMAttention(nn.Layer):
+    """
+    Self-attention layer performs multiple attention to jointly attending to
+    information from different representation subspaces.
+    """
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.config = config
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        self.position_encoding_2d = config.position_encoding_2d
+        self.scale_mask_softmax = False
+        self.default_dtype = paddle.get_default_dtype()
+
+        self.attention_scale = config.attention_scale
+
+        if config.tensor_parallel_degree > 1:
+            self.query_key_value = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size, 3 * config.hidden_size, has_bias=True, gather_output=False
+            )
+            self.dense = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size, config.hidden_size, input_is_parallel=True, has_bias=True
+            )
+            self.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree
+        else:
+            self.query_key_value = nn.Linear(config.hidden_size, 3 * config.hidden_size)
+            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+
+        # self.attention_dropout = nn.Dropout(config.attention_dropout_prob)
+        # self.output_dropout = nn.Dropout(config.output_dropout_prob)
+
+    def _rotate_half(self, x):
+        x1, x2 = paddle.chunk(x, 2, axis=-1)
+        return paddle.concat([-x2, x1], axis=-1)
+
+    def _apply_rotary_position_embed_index(self, q, k, cos, sin):
+        # q.shape = [s, b, n, h/n/2], cos.shape = [s, 1, h/n], position_ids.shape = [s, b]
+        # [s, b, n, h/n]
+        q = q * cos + self._rotate_half(q) * sin
+        k = k * cos + self._rotate_half(k) * sin
+        return q, k
+
+    def _core_attention(self, q_layer: Tensor, k_layer: Tensor, position_ids: Tensor, rotary_embeds: Tensor):
+        # Set store_true, position_encoding_2d=False by default.
+        if self.config.position_encoding_2d:
+            # [s, b, n, h/n/2]
+            q1, q2 = paddle.chunk(q_layer, 2, axis=-1)
+            k1, k2 = paddle.chunk(k_layer, 2, axis=-1)
+
+            pcos, psin = rotary_embeds[0][0], rotary_embeds[0][1]
+            bcos, bsin = rotary_embeds[1][0], rotary_embeds[1][1]
+
+            # [s, b, n, h/n]
+            q1, k1 = self._apply_rotary_position_embed_index(q1, k1, pcos, psin)
+            q2, k2 = self._apply_rotary_position_embed_index(q2, k2, bcos, bsin)
+            q_layer = paddle.concat([q1, q2], axis=-1)
+            k_layer = paddle.concat([k1, k2], axis=-1)
+        else:
+            cos, sin = rotary_embeds[0], rotary_embeds[1]
+            # [s, b, n, h/n]
+            q_layer, k_layer = self._apply_rotary_position_embed_index(q_layer, k_layer, cos, sin)
+        return q_layer, k_layer
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Tensor,
+        position_ids: Tensor,
+        use_cache: bool = False,
+        cache: Tensor = None,
+        layer_id=0,
+        rotary_embeds=None,
+    ):
+        # [s, b, h]
+        query_length, batch_size = hidden_states.shape[:2]
+        # [s, b, 3h]
+        mixed_layer = self.query_key_value(hidden_states)
+        # [s, b, n, 3h//n]
+        mixed_layer = mixed_layer.reshape(
+            [query_length, batch_size, self.num_attention_heads, self.attention_head_size * 3]
+        )
+        # [s, b, n, h//n]
+        q_layer, k_layer, v_layer = paddle.split(mixed_layer, 3, axis=-1)
+        # [s, b, n, h/n]
+        q_layer, k_layer = self._core_attention(q_layer, k_layer, position_ids, rotary_embeds)
+
+        if cache is not None:
+            cache_k, cache_v = cache[0], cache[1]
+            # [s + c, b, n, h/n]
+            k_layer = paddle.concat([cache_k, k_layer], axis=0)
+            v_layer = paddle.concat([cache_v, v_layer], axis=0)
+
+        seq_length, batch_size, num_heads, hidden_size = k_layer.shape
+
+        cache_kv = None
+        if use_cache:
+            cache_kv = (k_layer, v_layer)
+
+        attention_scale_coeff = float(layer_id) + 1.0
+        if self.attention_scale:
+            # [s, b, n, h/n]
+            q_layer = q_layer / (math.sqrt(self.attention_head_size) * attention_scale_coeff)
+            q_layer = q_layer.astype(self.default_dtype)
+
+        # [b, n, s, s]
+        output_shape = [q_layer.shape[1], q_layer.shape[2], q_layer.shape[0], k_layer.shape[0]]
+
+        # [s, b * n, h/n]
+        q_layer = q_layer.reshape([output_shape[2], output_shape[0] * output_shape[1], -1])
+        k_layer = k_layer.reshape([output_shape[3], output_shape[0] * output_shape[1], -1])
+
+        # [b * n , s, s] = matmul([b * n, s, h/n],  [b * n, h/n, s])
+        attention_scores = paddle.matmul(q_layer.transpose([1, 0, 2]), k_layer.transpose([1, 2, 0]))
+        # [b, n, s, s]
+        attention_scores = attention_scores.reshape(output_shape)
+
+        if self.scale_mask_softmax:
+            self.scale_mask_softmax.scale = attention_scale_coeff
+            attention_probs = self.scale_mask_softmax(attention_scores, attention_mask)
+        else:
+            attention_scores = attention_scores + attention_mask
+            attention_scores = attention_scores.astype("float32")
+            attention_scores = attention_scores * attention_scale_coeff
+            attention_probs = F.softmax(attention_scores, axis=-1)
+            attention_probs = attention_probs.astype(self.default_dtype)
+            v_layer = v_layer.astype(self.default_dtype)
+
+        # [b, n, s, h/n]
+        output_shape = [v_layer.shape[1], v_layer.shape[2], q_layer.shape[0], v_layer.shape[3]]
+        # [s, b * n, h/n]
+        v_layer = v_layer.reshape([v_layer.shape[0], output_shape[0] * output_shape[1], -1])
+        # [b * n, s, s]
+        attention_probs = attention_probs.reshape([output_shape[0] * output_shape[1], output_shape[2], -1])
+
+        # [b * n, s, h/n]
+        context_layer = paddle.bmm(attention_probs, v_layer.transpose([1, 0, 2]))
+        context_layer = context_layer.reshape(output_shape)
+
+        # [s, b, n, h/n]
+        context_layer = context_layer.transpose([2, 0, 1, 3])
+
+        # [s, b, h]
+        new_context_shape = context_layer.shape[:-2] + [self.num_attention_heads * self.attention_head_size]
+        context_layer = context_layer.reshape(new_context_shape)
+
+        output = self.dense(context_layer)
+
+        return output, cache_kv, attention_probs
+
+
+class ChatGLMBlock(nn.Layer):
+    """
+    The Transformer layer.
+    """
+
+    def __init__(self, config: ChatGLMConfig, layer_id: int):
+        super(ChatGLMBlock, self).__init__()
+        self.config = config
+        self.layer_id = layer_id
+        self.default_dtype = paddle.get_default_dtype()
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+        self.attention = ChatGLMAttention(config)
+        self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+        self.mlp = ChatGLMMLP(config)
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Tensor,
+        position_ids: Tensor,
+        use_cache: bool = False,
+        cache: Tensor = None,
+        rotary_embeds: Tensor = None,
+    ):
+        # Layer norm before transformer layer
+        attention_input = self.input_layernorm(hidden_states)
+        # Self attention
+        attention_output, cache, _ = self.attention(
+            hidden_states=attention_input,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            cache=cache,
+            use_cache=use_cache,
+            layer_id=self.layer_id,
+            rotary_embeds=rotary_embeds,
+        )
+        # Residual connection
+        alpha = (2 * self.config.num_hidden_layers) ** 0.5
+        layernorm_input = alpha * attention_input + attention_output
+        # Layernorm after attention
+        mlp_input = self.post_attention_layernorm(layernorm_input)
+        # MLP
+        mlp_output = self.mlp(mlp_input)
+        # Second residual connection
+        output = mlp_input * alpha + mlp_output
+        return output, cache
+
+
+class ChatGLMMLP(nn.Layer):
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMMLP, self).__init__()
+        self.config = config
+        if config.inner_hidden_size is None:
+            inner_hidden_size = config.hidden_size * 4
+        else:
+            inner_hidden_size = config.inner_hidden_size
+
+        if config.tensor_parallel_degree > 1:
+            self.dense_h_to_4h = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size, inner_hidden_size, has_bias=True, gather_output=False
+            )
+            self.dense_4h_to_h = fleet.meta_parallel.RowParallelLinear(
+                inner_hidden_size, config.hidden_size, input_is_parallel=True, has_bias=True
+            )
+        else:
+            self.dense_h_to_4h = nn.Linear(config.hidden_size, inner_hidden_size)
+            self.dense_4h_to_h = nn.Linear(inner_hidden_size, config.hidden_size)
+        # self.dropout = nn.Dropout(config.output_dropout_prob)
+        self.activation = self.geglue if self.config.activation == "geglu" else self.gelu
+
+    def geglu(self, x):
+        x1, x2 = paddle.chunk(x, chunks=2, axis=-1)
+        x = x1 * F.gelu(x2)
+        return x
+
+    def gelu(self, x):
+        return F.gelu(x, approximate=True)
+
+    def forward(self, hidden_states):
+        intermediate_parallel = self.dense_h_to_4h(hidden_states)
+        intermediate_parallel = self.activation(intermediate_parallel)
+        output = self.dense_4h_to_h(intermediate_parallel)
+        # output = self.dropout(output)
+        return output
+
+
+class ChatGLMStack(nn.Layer):
+    """
+    GLM Transformer
+    """
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMStack, self).__init__()
+        self.config = config
+        self.position_encoding_2d = config.position_encoding_2d
+        self.hidden_size = config.hidden_size
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.num_attention_heads = config.num_attention_heads
+        self.rotary_embeddings = RotaryEmbeddings(
+            self.hidden_size // (self.num_attention_heads * 2)
+            if self.position_encoding_2d
+            else self.hidden_size // self.num_attention_heads,
+            base=10000.0,
+        )
+        # self.embedding_dropout = nn.Dropout(config.embedding_dropout_prob)
+
+        if self.config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+
+        self.layers = nn.LayerList()
+        for index in range(config.num_hidden_layers):
+            self.layers.append(ChatGLMBlock(config, index))
+
+        self.final_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+
+        if self.config.pre_seq_len is not None:
+            for param in self.parameters():
+                param.requires_grad = False
+            self.prefix_tokens = paddle.arange(self.pre_seq_len, dtype="int64")
+            self.prefix_encoder = PrefixEncoder(config)
+            self.dropout = nn.Dropout(0.1)
+
+    def get_prompt(self, batch_size, dtype=paddle.float16):
+        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand([batch_size, -1])
+        past_key_values = self.prefix_encoder(prefix_tokens).astype(dtype)
+        past_key_values = past_key_values.reshape(
+            batch_size,
+            self.config.pre_seq_len,
+            self.config.num_layers * 2,
+            self.config.num_attention_heads,
+            self.config.hidden_size // self.config.num_attention_heads,
+        )
+        # seq_len, b, nh, hidden_size
+        past_key_values = self.dropout(past_key_values)
+        past_key_values = past_key_values.transpose([2, 1, 0, 3, 4]).split(2)
+        # past_key_values = [(v[0], v[1]) for v in past_key_values]
+        return past_key_values
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module: nn.Layer,
+        hidden_states: Tensor,
+        attention_mask: Tensor,
+        position_ids: Tensor,
+        use_cache: bool,
+        cache: Tensor,
+        rotary_embeds: Tensor,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            attention_mask,
+            position_ids,
+            use_cache,
+            cache,
+            rotary_embeds,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+        return hidden_states
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        inputs_embeds: Tensor = None,
+        cache: Optional[Tensor] = None,
+        use_cache: bool = False,
+    ):
+
+        if input_ids is not None and inputs_embeds is not None:
+            input_ids = None
+            logger.warning("Specify both input_ids and inputs_embeds at the same time, will use inputs_embeds")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape[:2]
+        elif inputs_embeds is not None:
+            batch_size, seq_length = inputs_embeds.shape[:2]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        inputs_embeds = inputs_embeds.transpose([1, 0, 2])
+
+        rotary_embeds = self.rotary_embeddings(position_ids)
+
+        if cache is None:
+            if self.config.pre_seq_len is not None:
+                cache = self.get_prompt(batch_size=input_ids.shape[0], dtype=inputs_embeds.dtype)
+            else:
+                cache = tuple([None] * len(self.layers))
+
+        # this branch is deprecated
+        if self.config.pre_seq_len is not None and attention_mask is not None:
+            prefix_attention_mask = paddle.ones([batch_size, 1, input_ids.shape[-1], self.config.pre_seq_len])
+            prefix_attention_mask = (prefix_attention_mask < 0.5).astype("int64")
+            attention_mask = paddle.concat((prefix_attention_mask, attention_mask), axis=3)
+
+        zero = paddle.zeros(attention_mask.shape, dtype=inputs_embeds.dtype)
+        neg_inf = paddle.full_like(attention_mask, paddle.finfo(inputs_embeds.dtype).min, dtype=inputs_embeds.dtype)
+        attention_mask = paddle.where(attention_mask, zero, neg_inf)
+
+        hidden_states = inputs_embeds
+
+        current_caches = [] if use_cache else None
+
+        for i, layer in enumerate(self.layers):
+            cache_i = cache[i]
+
+            if self.enable_recompute and not hidden_states.stop_gradient:
+                hidden_states, new_cache = self.recompute_training(
+                    layer,
+                    hidden_states=hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    use_cache=use_cache,
+                    cache=cache_i,
+                    rotary_embeds=rotary_embeds,
+                )
+            else:
+                hidden_states, new_cache = layer(
+                    hidden_states=hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    use_cache=use_cache,
+                    cache=cache_i,
+                    rotary_embeds=rotary_embeds,
+                )
+
+            if use_cache:
+                current_caches.append(new_cache)
+
+        output = self.final_layernorm(hidden_states)
+        return (output, current_caches)
+
+
+class ChatGLMPretrainedModel(PretrainedModel):
+    """
+    An abstarct class for pretrained ChatGLM models. It provides GLM related
+    `model_config_file`, `resource_file_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "chatglm"
+    config_class = ChatGLMConfig
+    model_config_file = CONFIG_NAME
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = CHATGLM_PRETRAINED_RESOURCE_FILES_MAP
+    _keys_to_ignore_on_load_missing = [r"transformer.rotary_embeddings.inv_freq", r"lm_head.decoder_weight"]
+    _keys_to_ignore_on_load_unexpected = [r"transformer.rotary_emb.inv_freq"]
+
+    def init_weights(self, layer):
+        """Initialization hook"""
+        return None
+
+    def get_position_ids(self, input_ids, mask_positions, use_gmasks=None):
+        batch_size, seq_length = input_ids.shape
+        if use_gmasks is None:
+            use_gmasks = [False] * batch_size
+
+        context_lengths = []
+        for seq in input_ids:
+            context_lengths.append(paddle.where(seq == self.config.bos_token_id)[0][0])
+
+        if self.config.position_encoding_2d:
+            position_ids = paddle.arange(seq_length, dtype="int64").unsqueeze(0).tile([batch_size, 1])
+            for i, context_length in enumerate(context_lengths):
+                position_ids[i, context_length:] = mask_positions[i]
+            block_position_ids = [
+                paddle.concat(
+                    (
+                        paddle.zeros([context_length], dtype="int64"),
+                        paddle.arange(seq_length - context_length, dtype="int64") + 1,
+                    )
+                )
+                for context_length in context_lengths
+            ]
+            block_position_ids = paddle.stack(block_position_ids, axis=0)
+            position_ids = paddle.stack((position_ids, block_position_ids), axis=1)
+        else:
+            position_ids = paddle.arange(seq_length, dtype="int64").unsqueeze(0).tile([batch_size, 1])
+            for i, context_length in enumerate(context_lengths):
+                if not use_gmasks[i]:
+                    position_ids[context_length:] = mask_positions[i]
+
+        return position_ids
+
+    def _get_model_inputs_spec(self, dtype: str):
+        return {
+            "input_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+            "attention_mask": paddle.static.InputSpec(shape=[None, None, None, None], dtype="int64"),
+            "position_ids": paddle.static.InputSpec(shape=[None, 2, None], dtype="int64"),
+        }
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_hidden_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "transformer.layers.0.mlp.dense_h_to_4h.bias": partial(fn, is_column=True),
+                "transformer.layers.0.mlp.dense_h_to_4h.weight": partial(fn, is_column=True),
+                "transformer.layers.0.attention.query_key_value.bias": partial(fn, is_column=True),
+                "transformer.layers.0.attention.query_key_value.weight": partial(fn, is_column=True),
+                # Row Linear
+                "transformer.word_embeddings.weight": partial(fn, is_column=False),
+                "transformer.layers.0.attention.dense.weight": partial(fn, is_column=False),
+                "transformer.layers.0.mlp.dense_4h_to_h.weight": partial(fn, is_column=False),
+            }
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_hidden_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+
+@register_base_model
+class ChatGLMModel(ChatGLMPretrainedModel):
+    r"""
+    The GLM Model transformer can behave as an encoder (with only self-attention) as well as a decoder, where
+    a layer of cross-attention is added between the self-attention layers, following the architecture
+    described in [Attention is all you need](https://arxiv.org/abs/1706.03762).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    """
+    _keys_to_ignore_on_load_unexpected = [r"transformer.layers.*.attention.rotary_emb.inv_freq", r"lm_head.weight"]
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMModel, self).__init__(config)
+        self.config = config
+        self.transformer = ChatGLMStack(config)
+        self.apply(self.init_weights)
+
+    def get_input_embeddings(self):
+        return self.transformer.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings):
+        self.transformer.word_embeddings = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        position_ids: Tensor = None,
+        attention_mask: Tensor = None,
+        cache=None,
+        inputs_embeds: Tensor = None,
+        use_cache: bool = None,
+        return_dict: bool = None,
+    ):
+        if input_ids is None:
+            assert position_ids is not None, "`position_ids` must be explicitly specified when input_ids is None."
+            assert attention_mask is not None, "`attention_mask` must be explicitly specified when input_ids is None."
+
+        if attention_mask is None or len(attention_mask.shape) != 4:
+            raise ValueError(f"attention mask should'nt be None or has size other than 4Dim. Found {attention_mask}")
+
+        attention_mask = attention_mask.astype("bool")
+
+        if position_ids is None:
+            MASK, gMASK = self.config.mask_token_id, self.config.gmask_token_id
+
+            use_gmasks = []
+            mask_positions = []
+            for seq in input_ids:
+                mask_token = gMASK if gMASK in seq else MASK
+                use_gmask = mask_token == gMASK
+                use_gmasks.append(use_gmask)
+                mask_positions.append(paddle.where(seq == mask_token)[0][0])
+            position_ids = self.get_position_ids(input_ids, mask_positions=mask_positions, use_gmasks=use_gmasks)
+
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        logits, new_caches = self.transformer(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            cache=cache,
+            use_cache=use_cache,
+        )
+
+        if not return_dict:
+            return (logits, new_caches)
+
+        return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=logits, past_key_values=new_caches)
+
+
+class ChatGLMHead(nn.Layer):
+    def __init__(self, config, embedding_weights=None):
+        super(ChatGLMHead, self).__init__()
+        self.decoder_weight = (
+            self.create_parameter(shape=[config.vocab_size, config.hidden_size], dtype=paddle.get_default_dtype())
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.config = config
+
+    def forward(self, hidden_states):
+        if self.config.tensor_parallel_degree > 1:
+            logits = parallel_matmul(hidden_states, self.decoder_weight, self.config.tensor_parallel_output)
+        else:
+            logits = F.linear(hidden_states, self.decoder_weight.T)
+        return logits
+
+
+class ChatGLMForCausalLM(ChatGLMPretrainedModel):
+    _keys_to_ignore_on_save = [r"lm_head.decoder_weight"]
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMForCausalLM, self).__init__(config)
+
+        self.config = config
+        self.max_sequence_length = config.max_sequence_length
+        self.position_encoding_2d = config.position_encoding_2d
+        self.chatglm = ChatGLMModel(config)
+
+        self.lm_head = ChatGLMHead(config, self.chatglm.transformer.word_embeddings.weight)
+        # from paddlenlp.transformers import ChatGLMTokenizer
+        # self.tokenizer = ChatGLMTokenizer.from_pretrained("THUDM/chatglm-6b")
+
+    def prepare_inputs_for_generation(
+        self, input_ids, position_ids=None, attention_mask=None, past_key_values=None, cache=None, **kwargs
+    ):
+        batch_size, seq_length = input_ids.shape
+        MASK, gMASK = self.config.mask_token_id, self.config.gmask_token_id
+        use_gmasks = []
+        mask_positions = []
+        for seq in input_ids:
+            mask_token = gMASK if gMASK in seq else MASK
+            use_gmask = mask_token == gMASK
+            use_gmasks.append(use_gmask)
+            mask_positions.append(paddle.where(seq == mask_token)[0][0])
+
+        if cache is not None or past_key_values is not None:
+            last_token = input_ids[:, -1].unsqueeze(-1)
+
+            attention_mask = attention_mask[:, :, -1:]
+
+            if position_ids is not None:
+                position_ids = position_ids[..., -1:]
+            else:
+                if self.position_encoding_2d:
+                    context_lengths = []
+                    for seq in input_ids:
+                        context_lengths.append(paddle.where(seq == self.config.bos_token_id)[0][0])
+
+                    context_lengths = paddle.to_tensor(context_lengths, dtype="int64")
+                    block_position_ids = seq_length - context_lengths
+                    position_ids = paddle.concat(
+                        [paddle.to_tensor(mask_positions, dtype="int64"), block_position_ids], axis=1
+                    ).unsqueeze(-1)
+                else:
+                    position_ids = paddle.to_tensor(mask_positions, dtype="int64").unsqueeze(-1)
+
+            if cache is None:
+                cache = past_key_values
+            return {
+                "input_ids": last_token,
+                "cache": cache[-1],
+                "position_ids": position_ids,
+                "use_cache": True,
+                "attention_mask": attention_mask,
+            }
+        else:
+            if position_ids is None:
+                position_ids = self.get_position_ids(input_ids, mask_positions=mask_positions, use_gmasks=use_gmasks)
+
+            return {
+                "input_ids": input_ids,
+                "cache": cache,
+                "position_ids": position_ids,
+                "use_cache": True,
+                "attention_mask": attention_mask,
+            }
+
+    def reorder_cache(self, cache: paddle.Tensor, beam_idx):
+        cache = map_structure(lambda x: paddle.index_select(x, beam_idx, axis=1), cache)
+        return cache
+
+    def update_model_kwargs_for_generation(
+        self,
+        outputs,
+        model_kwargs: Dict[str, Any],
+        is_encoder_decoder: bool = False,
+        standardize_cache_format: bool = False,
+    ) -> Dict[str, Any]:
+        # update cache
+        model_kwargs["cache"] = outputs[1] if isinstance(outputs, tuple) else outputs["past_key_values"]
+
+        # update attention mask
+        if "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            if attention_mask is not None:
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.zeros([*attention_mask.shape[:3], 1], attention_mask.dtype)], axis=3
+                )
+                new_attention_mask = attention_mask[:, :, -1:].clone()
+                new_attention_mask[..., -1] = 1
+                model_kwargs["attention_mask"] = paddle.concat([attention_mask, new_attention_mask], axis=2)
+
+        # update position ids
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            new_position_id = position_ids[..., -1:].clone()
+            new_position_id[:, 1, :] += 1
+            model_kwargs["position_ids"] = paddle.concat([position_ids, new_position_id], axis=-1)
+
+        return model_kwargs
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        cache=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        return_dict=False,
+    ):
+        transformer_outputs = self.chatglm(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs.last_hidden_state if return_dict else transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+        lm_logits = lm_logits.transpose([1, 0, 2]).astype("float32")
+        loss = None
+        if labels is not None:
+            if self.config.tensor_parallel_degree > 1 and self.config.tensor_parallel_output:
+                self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+                filtered_logits = lm_logits[labels != -100]
+                filtered_labels = labels[labels != -100]
+                loss = self.parallel_loss_func(filtered_logits, filtered_labels).mean()
+            else:
+                loss = nn.functional.cross_entropy(lm_logits, labels, ignore_index=-100)
+            loss = loss.astype(lm_logits.dtype)
+
+        if not return_dict:
+            if loss is not None:
+                return (loss, lm_logits, transformer_outputs[1:])
+            else:
+                return (lm_logits, transformer_outputs[1:])
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+        )
+
+    @staticmethod
+    def _reorder_cache(cache, beam_idx):
+        return tuple(
+            (
+                layer_past[0].index_select(1, beam_idx),
+                layer_past[1].index_select(1, beam_idx),
+            )
+            for layer_past in cache
+        )
+
+    @staticmethod
+    def process_response(response):
+        response = response.strip()
+        response = response.replace("[[训练时间]]", "2023年")
+        punkts = [
+            [",", "，"],
+            ["!", "！"],
+            [":", "："],
+            [";", "；"],
+            ["\?", "？"],
+        ]
+        for item in punkts:
+            response = re.sub(r"([\u4e00-\u9fff])%s" % item[0], r"\1%s" % item[1], response)
+            response = re.sub(r"%s([\u4e00-\u9fff])" % item[0], r"%s\1" % item[1], response)
+        return response
diff --git a/paddlenlp/transformers/chatglm/tokenizer.py b/paddlenlp/transformers/chatglm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5d551d68e24a33c0eacca8a7b73c3aadddd0b7b
--- /dev/null
+++ b/paddlenlp/transformers/chatglm/tokenizer.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tokenization classes for ChatGLM."""
+import os
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import sentencepiece as spm
+
+from .. import PretrainedTokenizer
+from ..tokenizer_utils_base import BatchEncoding, PaddingStrategy
+
+
+class ChatGLMTokenizer(PretrainedTokenizer):
+    """
+    Construct a ChatGLM tokenizer.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+    """
+
+    resource_files_names = {"vocab_file": "ice_text.model"}
+    max_model_input_sizes = {"THUDM/chatglm-6b": 2048, "THUDM/chatglm-6b-v1.1": 2048}
+    model_input_names = ["input_ids", "attention_mask"]
+    pretrained_resource_files_map = {
+        "model_file": {
+            "THUDM/chatglm-6b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm-6b/ice_text.model",
+            "THUDM/chatglm-6b-v1.1": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm-6b-v1.1/ice_text.model",
+        }
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<sop>",
+        eos_token="<eop>",
+        end_token="</s>",
+        mask_token="[MASK]",
+        gmask_token="[gMASK]",
+        pad_token="<pad>",
+        padding_side="left",
+        do_lower_case=False,
+        num_image_tokens=20000,
+        **kwargs
+    ) -> None:
+        super().__init__(
+            pad_token=pad_token,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            mask_token=mask_token,
+            padding_side=padding_side,
+            **kwargs,
+        )
+        self.end_token = end_token
+        self.gmask_token = gmask_token
+        self.do_lower_case = do_lower_case
+        self.vocab_file = vocab_file
+        self.num_image_tokens = num_image_tokens
+        self.max_blank_length = kwargs.get("max_blank_length", 80)
+
+        self.sp_tokenizer = spm.SentencePieceProcessor()
+        self.sp_tokenizer.Load(self.vocab_file)
+
+    @property
+    def gmask_token_id(self) -> Optional[int]:
+        if self.gmask_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.gmask_token)
+
+    @property
+    def end_token_id(self) -> Optional[int]:
+        if self.end_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.end_token)
+
+    @property
+    def tab_token(self):
+        return "<|tab|>"
+
+    @staticmethod
+    def get_blank_token(length: int):
+        assert length >= 2
+        return f"<|blank_{length}|>"
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_tokenizer.vocab_size() + self.num_image_tokens
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        if kwargs.get("remove_space", False):
+            text = " ".join(text.strip().split())
+        if kwargs.get("linebreak", True):
+            text = text.replace("\n", "<n>")
+        if kwargs.get("whitespaces", True):
+            text = text.replace("\t", self.tab_token)
+            for i in range(self.max_blank_length, 1, -1):
+                text = text.replace(" " * i, self.get_blank_token(i))
+        return (text, kwargs)
+
+    def _tokenize(self, text, **kwargs):
+        """Returns a tokenized string."""
+        add_dummy_prefix = kwargs.get("add_dummy_prefix", True)
+
+        if not add_dummy_prefix:
+            text = "<n>" + text
+        tokens = self.sp_tokenizer.EncodeAsPieces(text)
+        return tokens if add_dummy_prefix else tokens[2:]
+
+    def _decode(
+        self,
+        token_ids: List[int],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        spaces_between_special_tokens: bool = True,
+        **kwargs
+    ) -> str:
+        token_ids = [int(_id) - self.num_image_tokens for _id in token_ids]
+        token_ids = [_id for _id in token_ids if _id >= 0]
+        text = super()._decode(
+            token_ids,
+            skip_special_tokens,
+            clean_up_tokenization_spaces,
+            spaces_between_special_tokens,
+            **kwargs,
+        )
+        return self.postprocess(text)
+
+    def postprocess(self, text):
+        # Postprocess.
+        text = text.replace("<n>", "\n")
+        text = text.replace(self.tab_token, "\t")
+        for i in range(2, self.max_blank_length + 1):
+            text = text.replace(self.get_blank_token(i), " " * i)
+        return text
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token.startswith("<image_") and token.endswith(">") and token[7:-1].isdigit():
+            return int(token[7:-1])
+        else:
+            return self.sp_tokenizer.PieceToId(token) + self.num_image_tokens
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index >= self.vocab_size:
+            return self.unk_token
+        else:
+            if index < self.num_image_tokens:
+                return "<image_{}>".format(index)
+            else:
+                return self.sp_tokenizer.IdToPiece(index - self.num_image_tokens)
+
+    def convert_tokens_to_string(self, tokens):
+        text = self.sp_tokenizer.DecodePieces(tokens)
+        text = self.postprocess(text)
+        return text
+
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        """
+        Save the vocabulary and special tokens file to a directory.
+
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+            filename_prefix (`str`, *optional*):
+                An optional prefix to add to the named of the saved files.
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(save_directory, self.vocab_files_names["vocab_file"])
+        else:
+            vocab_file = save_directory
+
+        with open(self.vocab_file, "rb") as fin:
+            proto_str = fin.read()
+
+        with open(vocab_file, "wb") as writer:
+            writer.write(proto_str)
+
+        return (vocab_file,)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        token_ids_0 += [self.gmask_token_id, self.bos_token_id]
+        if token_ids_1 is not None:
+            token_ids_0 = token_ids_0 + token_ids_1 + [self.eos_token_id]
+        return token_ids_0
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict, BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy=PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        # Load from model defaults
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names or "attention_mask" in encoded_inputs
+
+        assert self.padding_side == "left"
+        required_input = encoded_inputs[self.model_input_names[0]]
+        seq_length = len(required_input)
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+
+        # Initialize attention mask if not present.
+        if max_length is not None:
+            if self.bos_token_id in required_input:
+                context_length = required_input.index(self.bos_token_id)
+            else:
+                context_length = seq_length
+            if "attention_mask" not in encoded_inputs:
+                attention_mask = np.ones((1, seq_length, seq_length))
+                attention_mask = np.tril(attention_mask)
+                attention_mask[:, :, :context_length] = 1
+                encoded_inputs["attention_mask"] = attention_mask
+
+            if "position_ids" not in encoded_inputs:
+                position_ids = np.arange(seq_length, dtype=np.int64)
+                mask_token = self.mask_token_id if self.mask_token_id in required_input else self.gmask_token_id
+                if mask_token in required_input:
+                    mask_position = required_input.index(mask_token)
+                    position_ids[context_length:] = mask_position
+                block_position_ids = np.concatenate(
+                    [
+                        np.zeros(context_length, dtype=np.int64),
+                        np.arange(1, seq_length - context_length + 1, dtype=np.int64),
+                    ]
+                )
+                encoded_inputs["position_ids"] = np.stack([position_ids, block_position_ids], axis=0)
+
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+
+            if "attention_mask" in encoded_inputs:
+                encoded_inputs["attention_mask"] = np.pad(
+                    encoded_inputs["attention_mask"],
+                    pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                    mode="constant",
+                    constant_values=0,
+                )
+            if "token_type_ids" in encoded_inputs:
+                encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+                    "token_type_ids"
+                ]
+            if "special_tokens_mask" in encoded_inputs:
+                encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+            if "position_ids" in encoded_inputs:
+                encoded_inputs["position_ids"] = np.pad(
+                    encoded_inputs["position_ids"], pad_width=[(0, 0), (difference, 0)]
+                )
+            encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+
+        return encoded_inputs
diff --git a/paddlenlp/transformers/chatglm_v2/LICENSE b/paddlenlp/transformers/chatglm_v2/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..535dd587a89d5b270168712afcb6de54b321d4c4
--- /dev/null
+++ b/paddlenlp/transformers/chatglm_v2/LICENSE
@@ -0,0 +1,65 @@
+The ChatGLM-6B License
+
+一、定义
+
+“许可方”是指分发其软件的 ChatGLM2-6B 模型团队。
+
+“软件”是指根据本许可提供的 ChatGLM2-6B 模型参数。
+
+2. 许可授予
+
+根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可，仅用于您的非商业研究目的。
+
+上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
+
+3.限制
+
+您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
+
+您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
+
+4.免责声明
+
+本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
+
+5. 责任限制
+
+除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
+
+6.争议解决
+
+本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
+
+请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 glm-130b@googlegroups.com 与我们联系。
+
+1. Definitions
+
+“Licensor” means the ChatGLM2-6B Model Team that distributes its Software.
+
+“Software” means the ChatGLM2-6B model parameters made available under this license.
+
+2. License Grant
+
+Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+3. Restriction
+
+You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
+
+You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
+
+4. Disclaimer
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+5. Limitation of Liability
+
+EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+6. Dispute Resolution
+
+This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
+
+Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.
diff --git a/paddlenlp/transformers/chatglm_v2/__init__.py b/paddlenlp/transformers/chatglm_v2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..775d34cf85f8c1bd428c35e2e907a930b3b6d656
--- /dev/null
+++ b/paddlenlp/transformers/chatglm_v2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 ChatGLM2-6B Model Team and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/chatglm_v2/configuration.py b/paddlenlp/transformers/chatglm_v2/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e51322f11e0fc6dfb562ee71525408b2b18f3a4f
--- /dev/null
+++ b/paddlenlp/transformers/chatglm_v2/configuration.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2023 ChatGLM2-6B Model Team and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..configuration_utils import PretrainedConfig
+
+CHATGLM_V2_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "THUDM/chatglm2-6b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm2-6b/model_state.pdparams",
+    }
+}
+
+
+class ChatGLMv2Config(PretrainedConfig):
+    model_type = "chatglm_v2"
+    attribute_map = {
+        "num_layers": "num_hidden_layers",
+        "padded_vocab_size": "vocab_size",
+        "seq_length": "max_sequence_length",
+    }
+
+    def __init__(
+        self,
+        num_hidden_layers=28,
+        vocab_size=65024,
+        hidden_size=4096,
+        ffn_hidden_size=13696,
+        kv_channels=128,
+        num_attention_heads=32,
+        max_sequence_length=2048,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        layernorm_epsilon=1e-5,
+        use_cache=True,
+        rmsnorm=True,
+        apply_residual_connection_post_layernorm=False,
+        post_layer_norm=True,
+        add_bias_linear=False,
+        add_qkv_bias=False,
+        interleaved_qkv=False,
+        bias_dropout_fusion=True,
+        multi_query_group_num=1,
+        apply_query_key_layer_scaling=True,
+        attention_softmax_in_fp32=True,
+        fp32_residual_connection=False,
+        eos_token_id=2,
+        pad_token_id=0,
+        use_flash_attention=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.num_hidden_layers = num_hidden_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.ffn_hidden_size = ffn_hidden_size
+        self.kv_channels = kv_channels
+        self.num_attention_heads = num_attention_heads
+        self.max_sequence_length = max_sequence_length
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.layernorm_epsilon = layernorm_epsilon
+        self.use_cache = use_cache
+        self.rmsnorm = rmsnorm
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.post_layer_norm = post_layer_norm
+        self.add_bias_linear = add_bias_linear
+        self.add_qkv_bias = add_qkv_bias
+        self.bias_dropout_fusion = bias_dropout_fusion
+        self.multi_query_group_num = multi_query_group_num
+        self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+        self.fp32_residual_connection = fp32_residual_connection
+        self.use_flash_attention = use_flash_attention
diff --git a/paddlenlp/transformers/chatglm_v2/modeling.py b/paddlenlp/transformers/chatglm_v2/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bbccd194057dd2d5d953d2dc182a04b5b683ac1
--- /dev/null
+++ b/paddlenlp/transformers/chatglm_v2/modeling.py
@@ -0,0 +1,885 @@
+# Copyright (c) 2023 ChatGLM2-6B Model Team and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from functools import partial
+from typing import Any, Dict, Optional, Tuple
+
+import paddle
+import paddle.distributed.fleet as fleet
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+from paddle.utils import map_structure
+
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithPast,
+    ModelOutput,
+)
+from .configuration import CHATGLM_V2_PRETRAINED_RESOURCE_FILES_MAP, ChatGLMv2Config
+
+__all__ = [
+    "ChatGLMv2Model",
+    "ChatGLMv2PretrainedModel",
+    "ChatGLMv2ForCausalLM",
+]
+
+
+def assign_kv_heads(num_kv_heads, num_gpus):
+    # Initialize the assignment list
+    """
+    Assign kv heads to different GPUs in the Tensor Parallel Setup
+
+    Examples:
+        assign_kv_heads(num_kv_heads=1, num_gpus=2): [[0], [0]]
+        assign_kv_heads(num_kv_heads=2, num_gpus=2): [[0], [1]]
+        assign_kv_heads(num_kv_heads=4, num_gpus=2): [[0,1], [2,3]]
+        assign_kv_heads(num_kv_heads=1, num_gpus=4): [[0],[0],[0],[0]]
+        assign_kv_heads(num_kv_heads=2, num_gpus=4): [[0],[0],[1],[1]]
+        assign_kv_heads(num_kv_heads=4, num_gpus=4): [[0],[1],[2],[3]]
+    """
+    assignment_list = [[] for _ in range(num_gpus)]
+    # Case 1: more heads than cards
+    if num_kv_heads > num_gpus:
+        num_heads_per_card = num_kv_heads // num_gpus
+        for i in range(num_gpus):
+            for j in range(num_heads_per_card):
+                assignment_list[i].append(i * num_heads_per_card + j)
+    # Case 2: more cards than heads. each card get only 1 head.
+    else:
+        num_card_per_heads = num_gpus // num_kv_heads
+        for i in range(num_kv_heads):
+            for j in range(num_card_per_heads):
+                assignment_list[i * num_card_per_heads + j].append(i)
+    return assignment_list
+
+
+class RotaryEmbedding(nn.Layer):
+    def __init__(self, dim, original_impl=False):
+        super().__init__()
+        self.default_dtype = paddle.get_default_dtype()
+        inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2, dtype="float32") / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        self.dim = dim
+        self.original_impl = original_impl
+
+    def forward_impl(self, seq_len: int, n_elem: int, base: int = 10000):
+        """Enhanced Transformer with Rotary Position Embedding.
+        Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
+        transformers/rope/__init__.py. MIT License:
+        https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license.
+        """
+        # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
+        theta = 1.0 / (base ** (paddle.arange(0, n_elem, 2, dtype="float32") / n_elem))
+
+        # Create position indexes `[0, 1, ..., seq_len - 1]`
+        seq_idx = paddle.arange(0, seq_len, dtype=theta.dtype)
+
+        # Calculate the product of position index and $\theta_i$
+        idx_theta = paddle.outer(seq_idx, theta).astype(self.default_dtype)
+
+        cache = paddle.stack([paddle.cos(idx_theta), paddle.sin(idx_theta)], axis=-1)
+
+        # this is to mimic the behaviour of complex32, else we will get different results
+        if self.default_dtype in (paddle.float16, paddle.bfloat16, paddle.int8):
+            cache = cache.astype(self.default_dtype)
+            # cache = cache.bfloat16() if dtype == paddle.bfloat16 else cache.astype("float16")
+        return cache
+
+    def forward(self, max_seq_len, offset=0):
+        return self.forward_impl(seq_len=max_seq_len, n_elem=self.dim)
+
+
+# @paddle.jit.script
+def apply_rotary_pos_emb(x: paddle.Tensor, rope_cache: paddle.Tensor) -> paddle.Tensor:
+    # x: [sq, b, np, hn]
+    sq, b, np, hn = x.shape
+    rot_dim = rope_cache.shape[-2] * 2
+    x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
+    # truncate to support variable sizes
+    rope_cache = rope_cache[:sq]
+    xshaped = x.reshape([sq, -1, np, rot_dim // 2, 2])
+    rope_cache = rope_cache.reshape([sq, -1, 1, xshaped.shape[3], 2])
+    x_out2 = paddle.stack(
+        [
+            xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
+            xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
+        ],
+        -1,
+    )
+    x_out2 = x_out2.flatten(3)
+    return paddle.concat((x_out2, x_pass), axis=-1)
+
+
+class RMSNorm(nn.Layer):
+    def __init__(self, hidden_size, epsilon=None):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.weight = paddle.create_parameter(
+            shape=[self.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+        self.epsilon = 1e-5 if epsilon is None else epsilon
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
+        hidden_states = paddle.rsqrt(variance + self.epsilon) * hidden_states
+        output = (hidden_states * self.weight).astype(input_dtype)
+
+        # if self.weight.dtype in [paddle.float16, paddle.bfloat16]:
+        #     hidden_states = paddle.cast(hidden_states, self.weight.dtype)
+        return output
+
+
+class CoreAttention(nn.Layer):
+    def __init__(self, config: ChatGLMv2Config, layer_number):
+        super(CoreAttention, self).__init__()
+
+        self.default_dtype = paddle.get_default_dtype()
+        self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
+        if self.apply_query_key_layer_scaling:
+            self.attention_softmax_in_fp32 = True
+        self.layer_number = max(1, layer_number)
+        if config.tensor_parallel_degree > 1:
+            self.num_attention_heads_per_partition = config.num_attention_heads // config.tensor_parallel_degree
+        else:
+            self.num_attention_heads_per_partition = config.num_attention_heads
+        self.hidden_size_per_partition = config.kv_channels * self.num_attention_heads_per_partition
+        self.hidden_size_per_attention_head = self.hidden_size_per_partition // self.num_attention_heads_per_partition
+
+        coeff = None
+        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
+        if self.apply_query_key_layer_scaling:
+            coeff = self.layer_number
+            self.norm_factor *= coeff
+        self.coeff = coeff
+
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+
+    def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        # Raw attention scores
+        # [batch_size, num_heads, query_length, key_length]
+        output_size = (query_layer.shape[1], query_layer.shape[2], query_layer.shape[0], key_layer.shape[0])
+
+        # [query_length, batch_size, num_heads, hidden] -> [query_length, batch_size * num_heads, hidden]
+        query_layer = query_layer.reshape([output_size[2], output_size[0] * output_size[1], -1])
+        # [key_length, batch_size, num_heads, hidden] -> [key_length, batch_size * num_heads, hidden]
+        key_layer = key_layer.reshape([output_size[3], output_size[0] * output_size[1], -1])
+
+        # Raw attention scores. [batch_size * num_heads, query_length, key_length]
+        matmul_result = paddle.bmm(query_layer.transpose([1, 0, 2]), key_layer.transpose([1, 2, 0])) * (
+            1.0 / self.norm_factor
+        )
+
+        # change view to [batch_size, num_heads, query_length, key_length]
+        attention_scores = matmul_result.reshape(output_size)
+
+        # ===========================
+        # Attention probs and dropout
+        # ===========================
+
+        # attention scores and attention mask [batch_size, num_heads, query_length, key_length]
+        if self.attention_softmax_in_fp32:
+            attention_scores = attention_scores.astype("float32")
+        if self.coeff is not None:
+            attention_scores = attention_scores * self.coeff
+
+        attention_scores = attention_scores + attention_mask
+
+        attention_probs = F.softmax(attention_scores.astype("float32"), axis=-1)
+        attention_probs = attention_probs.astype(self.default_dtype)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+        # [batch_size, num_heads, query_length, key_length]
+
+        # value_layer -> context layer.
+        # [sk, b, np, hn] --> [b, np, sq, hn]
+
+        # context layer shape: [b, np, sq, hn]
+        output_size = (value_layer.shape[1], value_layer.shape[2], query_layer.shape[0], value_layer.shape[3])
+        # change view [sk, b * np, hn]
+        value_layer = value_layer.reshape([value_layer.shape[0], output_size[0] * output_size[1], -1])
+        # change view [b * np, sq, sk]
+        attention_probs = attention_probs.reshape([output_size[0] * output_size[1], output_size[2], -1])
+        # matmul: [b * np, sq, hn]
+        context_layer = paddle.bmm(attention_probs, value_layer.transpose([1, 0, 2]))
+        # change view [b, np, sq, hn]
+        context_layer = context_layer.reshape(output_size)
+        # [b, np, sq, hn] --> [sq, b, np, hn]
+        context_layer = context_layer.transpose([2, 0, 1, 3])
+        # [sq, b, np, hn] --> [sq, b, hp]
+        new_context_shape = context_layer.shape[:-2] + [self.hidden_size_per_partition]
+        context_layer = context_layer.reshape(new_context_shape)
+
+        return context_layer
+
+
+class SelfAttention(nn.Layer):
+    """Parallel self-attention layer abstract class.
+
+    Self-attention layer takes input with size [s, b, h]
+    and returns output of the same size.
+    """
+
+    def __init__(self, config: ChatGLMv2Config, layer_number, device=None):
+        super(SelfAttention, self).__init__()
+        self.layer_number = max(1, layer_number)
+        assert (
+            config.kv_channels * config.num_attention_heads == config.hidden_size
+        ), "`kv_channels` * `num_attention_heads` must equal to `hidden_size`"
+
+        # Per attention head and per partition values.
+        self.hidden_size_per_attention_head = config.hidden_size // config.num_attention_heads
+        self.core_attention = CoreAttention(config, self.layer_number)
+        self.num_multi_query_groups_per_partition = config.multi_query_group_num
+
+        if config.tensor_parallel_degree > 1:
+            self.query = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size,
+                config.hidden_size,
+                has_bias=config.add_bias_linear or config.add_qkv_bias,
+                gather_output=False,
+            )
+            self.dense = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size, config.hidden_size, input_is_parallel=True, has_bias=config.add_bias_linear
+            )
+            self.num_attention_heads_per_partition = config.num_attention_heads // config.tensor_parallel_degree
+            self.kv_indices = paddle.to_tensor(
+                assign_kv_heads(config.multi_query_group_num, config.tensor_parallel_degree)[
+                    config.tensor_parallel_rank
+                ]
+            )
+        else:
+            self.query = nn.Linear(
+                config.hidden_size,
+                config.hidden_size,
+                bias_attr=config.add_bias_linear or config.add_qkv_bias,
+            )
+            self.dense = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=config.add_bias_linear)
+            self.num_attention_heads_per_partition = config.num_attention_heads
+            self.kv_indices = None
+
+        self.key = nn.Linear(
+            config.hidden_size,
+            self.hidden_size_per_attention_head * config.multi_query_group_num,
+            bias_attr=config.add_qkv_bias,
+        )
+        self.value = nn.Linear(
+            config.hidden_size,
+            self.hidden_size_per_attention_head * config.multi_query_group_num,
+            bias_attr=config.add_qkv_bias,
+        )
+
+    def forward(self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True):
+        seq_length, batch_size, hidden_size = hidden_states.shape
+        query_layer = self.query(hidden_states)
+        key_layer = self.key(hidden_states)
+        value_layer = self.value(hidden_states)
+
+        query_layer = query_layer.reshape(
+            [seq_length, batch_size, self.num_attention_heads_per_partition, self.hidden_size_per_attention_head]
+        )
+        key_layer = key_layer.reshape([seq_length, batch_size, -1, self.hidden_size_per_attention_head])
+        value_layer = value_layer.reshape([seq_length, batch_size, -1, self.hidden_size_per_attention_head])
+
+        # apply relative positional encoding (rotary embedding)
+        if rotary_pos_emb is not None:
+            query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
+            key_layer = apply_rotary_pos_emb(key_layer, rotary_pos_emb)
+
+        # adjust key and value for inference
+        if use_cache:
+            if kv_cache is not None:
+                cache_k, cache_v = kv_cache
+                key_layer = paddle.concat((cache_k, key_layer), axis=0)
+                value_layer = paddle.concat((cache_v, value_layer), axis=0)
+            kv_cache = (key_layer, value_layer)
+        else:
+            kv_cache = None
+
+        if self.kv_indices is not None:
+            key_layer = paddle.index_select(key_layer, self.kv_indices, axis=2)
+            value_layer = paddle.index_select(value_layer, self.kv_indices, axis=2)
+            multiplier = self.num_attention_heads_per_partition // self.kv_indices.shape[0]
+        else:
+            multiplier = self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition
+
+        key_layer = key_layer.unsqueeze(-2).tile([1, 1, 1, multiplier, 1])
+        key_layer = key_layer.reshape(
+            key_layer.shape[:2] + [self.num_attention_heads_per_partition, self.hidden_size_per_attention_head]
+        )
+        value_layer = value_layer.unsqueeze(-2).tile([1, 1, 1, multiplier, 1])
+        value_layer = value_layer.reshape(
+            value_layer.shape[:2] + [self.num_attention_heads_per_partition, self.hidden_size_per_attention_head]
+        )
+
+        # ==================================
+        # core attention computation
+        # ==================================
+
+        context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
+
+        # =================
+        # Output. [seq_length, b, h]
+        # =================
+
+        output = self.dense(context_layer)
+        return output, kv_cache
+
+
+class MLP(nn.Layer):
+    """MLP.
+
+    MLP will take the input with h hidden state, project it to 4*h
+    hidden dimension, perform nonlinear transformation, and project the
+    state back into h hidden dimension.
+    """
+
+    def __init__(self, config: ChatGLMv2Config):
+        super(MLP, self).__init__()
+
+        self.add_bias = config.add_bias_linear
+
+        if config.tensor_parallel_degree > 1:
+            self.dense_h_to_4h = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size, config.ffn_hidden_size * 2, has_bias=self.add_bias, gather_output=False
+            )
+            self.dense_4h_to_h = fleet.meta_parallel.RowParallelLinear(
+                config.ffn_hidden_size, config.hidden_size, input_is_parallel=True, has_bias=self.add_bias
+            )
+        else:
+            # Project to 4h due to swiglu doubling the output width, see https://arxiv.org/pdf/2002.05202.pdf
+            self.dense_h_to_4h = nn.Linear(config.hidden_size, config.ffn_hidden_size * 2, bias_attr=self.add_bias)
+            # Project back to h.
+            self.dense_4h_to_h = nn.Linear(
+                config.ffn_hidden_size,
+                config.hidden_size,
+                bias_attr=self.add_bias,
+            )
+
+    def forward(self, hidden_states):
+        # [s, b, 4hp]
+        intermediate_parallel = self.dense_h_to_4h(hidden_states)
+        # Special Slicing to accomodate Tensor Parallel
+        # Even channels is ffc_fc, odd channels is gate
+        ffn_fc = intermediate_parallel[..., 0::2]
+        gate = intermediate_parallel[..., 1::2]
+        intermediate_parallel = F.silu(ffn_fc) * gate
+        # [s, b, h]
+        output = self.dense_4h_to_h(intermediate_parallel)
+        return output
+
+
+class GLMBlock(nn.Layer):
+    """A single transformer layer.
+
+    Transformer layer takes input with size [s, b, h] and returns an
+    output of the same size.
+    """
+
+    def __init__(self, config: ChatGLMv2Config, layer_number):
+        super(GLMBlock, self).__init__()
+        self.layer_number = layer_number
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+
+        self.fp32_residual_connection = config.fp32_residual_connection
+
+        LayerNormFunc = RMSNorm if config.rmsnorm else nn.LayerNorm
+        # Layernorm on the input data.
+        self.input_layernorm = LayerNormFunc(config.hidden_size, epsilon=config.layernorm_epsilon)
+
+        # Self attention.
+        self.self_attention = SelfAttention(config, layer_number)
+        self.hidden_dropout = config.hidden_dropout
+
+        # Layernorm on the attention output
+        self.post_attention_layernorm = LayerNormFunc(config.hidden_size, epsilon=config.layernorm_epsilon)
+
+        # MLP
+        self.mlp = MLP(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask,
+        rotary_pos_emb,
+        kv_cache=None,
+        use_cache=True,
+    ):
+        # hidden_states: [s, b, h]
+
+        # Layer norm at the beginning of the transformer layer.
+        layernorm_output = self.input_layernorm(hidden_states)
+
+        # Self attention.
+        attention_output, kv_cache = self.self_attention(
+            layernorm_output, attention_mask, rotary_pos_emb, kv_cache=kv_cache, use_cache=use_cache
+        )
+
+        # Residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = hidden_states
+
+        layernorm_input = F.dropout(attention_output, p=self.hidden_dropout, training=self.training)
+        layernorm_input = residual + layernorm_input
+
+        # Layer norm post the self attention.
+        layernorm_output = self.post_attention_layernorm(layernorm_input)
+
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+
+        # Second residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = layernorm_input
+
+        output = F.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
+        output = residual + output
+        return output, kv_cache
+
+
+class GLMTransformer(nn.Layer):
+    """Transformer class."""
+
+    def __init__(self, config: ChatGLMv2Config):
+        super(GLMTransformer, self).__init__()
+        self.config = config
+        self.enable_recompute = False
+        self.fp32_residual_connection = config.fp32_residual_connection
+        self.post_layer_norm = config.post_layer_norm
+
+        # Number of layers.
+        self.num_hidden_layers = config.num_hidden_layers
+
+        # Transformer layers.
+        def build_layer(layer_number):
+            return GLMBlock(config, layer_number)
+
+        self.layers = nn.LayerList([build_layer(i + 1) for i in range(self.num_hidden_layers)])
+
+        if self.post_layer_norm:
+            LayerNormFunc = RMSNorm if config.rmsnorm else nn.LayerNorm
+            # Final layer norm before output.
+            self.final_layernorm = LayerNormFunc(config.hidden_size, epsilon=config.layernorm_epsilon)
+
+    def _get_layer(self, layer_number):
+        return self.layers[layer_number]
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module: nn.Layer,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        rotary_embeds: paddle.Tensor,
+        kv_cache: paddle.Tensor,
+        use_cache: bool,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states, kv_cache = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            attention_mask,
+            rotary_embeds,
+            kv_cache,
+            use_cache,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+        return hidden_states, kv_cache
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask,
+        rotary_pos_emb,
+        kv_caches=None,
+        use_cache: Optional[bool] = True,
+        output_hidden_states: Optional[bool] = False,
+    ):
+        if not kv_caches:
+            kv_caches = [None for _ in range(self.num_hidden_layers)]
+        presents = () if use_cache else None
+        all_self_attentions = None
+        all_hidden_states = () if output_hidden_states else None
+
+        zero = paddle.zeros(attention_mask.shape, dtype=hidden_states.dtype)
+        neg_inf = paddle.full_like(attention_mask, paddle.finfo(hidden_states.dtype).min, dtype=hidden_states.dtype)
+        attention_mask = paddle.where(attention_mask, zero, neg_inf)
+
+        for index in range(self.num_hidden_layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer = self._get_layer(index)
+
+            if self.enable_recompute and not hidden_states.stop_gradient:
+                hidden_states, kv_cache = self.recompute_training(
+                    layer,
+                    hidden_states,
+                    attention_mask,
+                    rotary_pos_emb,
+                    kv_cache=kv_caches[index],
+                    use_cache=use_cache,
+                )
+            else:
+                hidden_states, kv_cache = layer(
+                    hidden_states, attention_mask, rotary_pos_emb, kv_cache=kv_caches[index], use_cache=use_cache
+                )
+
+            if use_cache:
+                presents = presents + (kv_cache,)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        # Final layer norm.
+        if self.post_layer_norm:
+            hidden_states = self.final_layernorm(hidden_states)
+
+        return hidden_states, presents, all_hidden_states, all_self_attentions
+
+
+class ChatGLMv2PretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and
+    a simple interface for downloading and loading pretrained models.
+    """
+
+    config_class = ChatGLMv2Config
+    pretrained_resource_files_map = CHATGLM_V2_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "chatglm_v2"
+
+    def get_masks(self, input_ids, past_key_values, padding_mask=None):
+        batch_size, seq_length = input_ids.shape
+
+        # casual mask
+        casual_mask = paddle.tril(paddle.ones([batch_size, 1, seq_length, seq_length])).astype("bool")
+        past_length = 0
+        if past_key_values:
+            past_length = past_key_values[0][0].shape[0]
+        if past_length:
+            casual_mask = paddle.concat(
+                [paddle.ones([batch_size, 1, seq_length, past_length], dtype="bool"), casual_mask], axis=-1
+            )
+
+        # seq_mask
+        if padding_mask is None:
+            padding_mask = paddle.ones((batch_size, 1, seq_length, seq_length + past_length), dtype="bool")
+        if len(padding_mask.shape) == 2:
+            # from Tokenizer
+            padding_mask = (
+                padding_mask.unsqueeze(axis=[1, 2])
+                .expand([batch_size, 1, seq_length, seq_length + past_length])
+                .astype("bool")
+            )
+        elif len(padding_mask.shape) == 3:
+            # [batch_size,tgt_length, src_length] -> [batch_size, 1, tgt_length, src_length]
+            padding_mask = padding_mask.unsqueeze(1).astype("bool")
+        elif len(padding_mask.shape) == 4:
+            padding_mask = padding_mask.astype("bool")
+
+        casual_mask = casual_mask & padding_mask
+
+        return casual_mask
+
+    def get_position_ids(self, input_ids):
+        batch_size, seq_length = input_ids.shape
+        position_ids = paddle.arange(seq_length, dtype="int64").unsqueeze(0).tile([batch_size, 1])
+        return position_ids
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_hidden_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "encoder.layers.0.mlp.dense_h_to_4h.weight": partial(fn, is_column=True),
+                "encoder.layers.0.self_attention.query.bias": partial(fn, is_column=True),
+                "encoder.layers.0.self_attention.query.weight": partial(fn, is_column=True),
+                "output_layer.weight": partial(fn, is_column=True),
+                # Row Linear
+                "embedding.word_embeddings.weight": partial(fn, is_column=False),
+                "encoder.layers.0.self_attention.dense.weight": partial(fn, is_column=False),
+                "encoder.layers.0.mlp.dense_4h_to_h.weight": partial(fn, is_column=False),
+            }
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_hidden_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+
+class Embedding(nn.Layer):
+    """Language model embeddings."""
+
+    def __init__(self, config: ChatGLMv2Config):
+        super(Embedding, self).__init__()
+
+        self.hidden_size = config.hidden_size
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(config.vocab_size, config.hidden_size)
+        else:
+            self.word_embeddings = nn.Embedding(config.padded_vocab_size, self.hidden_size)
+        self.fp32_residual_connection = config.fp32_residual_connection
+
+    def forward(self, input_ids):
+        # Embeddings.
+        embeddings = self.word_embeddings(input_ids)
+        # Data format change to avoid explicit tranposes
+        # [batch_size, seq_length, hidden_size] --> [seq_length, batch_size, hidden_size].
+        embeddings = embeddings.transpose([1, 0, 2])
+        # If the input flag for fp32 residual connection is set, convert for float.
+        if self.fp32_residual_connection:
+            embeddings = embeddings.astype("float32")
+        return embeddings
+
+
+@register_base_model
+class ChatGLMv2Model(ChatGLMv2PretrainedModel):
+    def __init__(self, config: ChatGLMv2Config, empty_init=True):
+        super().__init__(config)
+        self.embedding = Embedding(config)
+
+        # Rotary positional embeddings
+        self.max_sequence_length = config.max_sequence_length
+        rotary_dim = (
+            config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
+        )
+        self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2)
+        self.encoder = GLMTransformer(config)
+        if config.tensor_parallel_degree > 1:
+            self.output_layer = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size,
+                config.padded_vocab_size,
+                has_bias=False,
+                gather_output=not config.tensor_parallel_output,
+            )
+        else:
+            self.output_layer = nn.Linear(config.hidden_size, config.padded_vocab_size, bias_attr=False)
+
+    def get_input_embeddings(self):
+        return self.embedding.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embedding.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        full_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor, paddle.Tensor], ...]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        batch_size, seq_length = input_ids.shape
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embedding(input_ids)
+
+        full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask)
+
+        # Rotary positional embeddings
+        rotary_pos_emb = self.rotary_pos_emb(self.max_sequence_length)
+        if position_ids is not None:
+            rotary_pos_emb = rotary_pos_emb[position_ids]
+        else:
+            rotary_pos_emb = rotary_pos_emb[None, :seq_length]
+
+        rotary_pos_emb = rotary_pos_emb.transpose([1, 0, 2, 3])
+
+        # Run encoder.
+        hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
+            inputs_embeds,
+            full_attention_mask,
+            rotary_pos_emb=rotary_pos_emb,
+            kv_caches=past_key_values,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+        )
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+        )
+
+
+class ChatGLMv2ForCausalLM(ChatGLMv2PretrainedModel):
+    def __init__(self, config: ChatGLMv2Config):
+        super().__init__(config)
+        self.max_sequence_length = config.max_sequence_length
+        self.chatglm_v2 = ChatGLMv2Model(config)
+
+    def reorder_cache(self, cache: paddle.Tensor, beam_idx):
+        cache = map_structure(lambda x: paddle.index_select(x, beam_idx, axis=1), cache)
+        return cache
+
+    def update_model_kwargs_for_generation(
+        self,
+        outputs: ModelOutput,
+        model_kwargs: Dict[str, Any],
+        is_encoder_decoder: bool = False,
+        standardize_cache_format: bool = False,
+    ) -> Dict[str, Any]:
+        # update past_key_values
+        model_kwargs["past_key_values"] = outputs[1] if isinstance(outputs, tuple) else outputs["past_key_values"]
+
+        # update attention mask
+        if "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            new_attention_mask = paddle.ones((attention_mask.shape[0], 1), dtype=attention_mask.dtype)
+            model_kwargs["attention_mask"] = paddle.concat([attention_mask, new_attention_mask], axis=-1)
+
+        # update position ids
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            new_position_id = position_ids[..., -1:].clone()
+            new_position_id += 1
+            model_kwargs["position_ids"] = paddle.concat([position_ids, new_position_id], axis=-1)
+
+        model_kwargs["is_first_forward"] = False
+        return model_kwargs
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: paddle.Tensor,
+        past_key_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        is_first_forward: bool = True,
+        **kwargs
+    ) -> dict:
+        # only last token for input_ids if past is not None
+        if position_ids is None:
+            position_ids = self.get_position_ids(input_ids)
+        if not is_first_forward:
+            position_ids = position_ids[..., -1:]
+            input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past_key_values,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "return_last_logit": True,
+            "use_cache": True,
+        }
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[paddle.Tensor]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        return_last_logit: Optional[bool] = False,
+    ):
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.chatglm_v2(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+
+        if return_last_logit:
+            hidden_states = hidden_states[-1:]
+        lm_logits = self.chatglm_v2.output_layer(hidden_states)
+        lm_logits = lm_logits.transpose([1, 0, 2])
+
+        loss = None
+        if labels is not None:
+            reshaped_logits = lm_logits.reshape([-1, lm_logits.shape[-1]]).astype("float32")
+            reshaped_labels = labels.reshape([-1])
+
+            if self.config.tensor_parallel_degree > 1 and self.config.tensor_parallel_output:
+                loss_fn = fleet.meta_parallel.ParallelCrossEntropy()
+            else:
+                loss_fn = nn.CrossEntropyLoss(reduction="none")
+
+            loss_mask = (labels != -100).astype("float32")
+            loss = loss_fn(reshaped_logits, reshaped_labels)
+            loss = paddle.sum(loss.reshape([-1]).cast(paddle.float32) * loss_mask.reshape([-1]).cast(paddle.float32))
+            loss = loss / loss_mask.sum()
+
+            lm_logits = lm_logits.astype(hidden_states.dtype)
+            loss = loss.astype(hidden_states.dtype)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/chatglm_v2/tokenizer.py b/paddlenlp/transformers/chatglm_v2/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ede3ac23d96a0b044b79ed768595058728b30b1
--- /dev/null
+++ b/paddlenlp/transformers/chatglm_v2/tokenizer.py
@@ -0,0 +1,280 @@
+# Copyright (c) 2023 ChatGLM2-6B Model Team and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+from sentencepiece import SentencePieceProcessor
+
+from .. import PretrainedTokenizer
+from ..tokenizer_utils_base import BatchEncoding, PaddingStrategy
+
+
+class SPTokenizer:
+    def __init__(self, model_path: str):
+        # reload tokenizer
+        assert os.path.isfile(model_path), model_path
+        self.sp_model = SentencePieceProcessor(model_file=model_path)
+
+        # BOS / EOS token IDs
+        self.n_words: int = self.sp_model.vocab_size()
+        self.bos_id: int = self.sp_model.bos_id()
+        self.eos_id: int = self.sp_model.eos_id()
+        self.pad_id: int = self.sp_model.unk_id()
+        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()
+
+        special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"]
+        self.special_tokens = {}
+        self.index_special_tokens = {}
+        for token in special_tokens:
+            self.special_tokens[token] = self.n_words
+            self.index_special_tokens[self.n_words] = token
+            self.n_words += 1
+
+    def tokenize(self, s: str):
+        return self.sp_model.EncodeAsPieces(s)
+
+    def encode(self, s: str, bos: bool = False, eos: bool = False) -> List[int]:
+        assert type(s) is str
+        t = self.sp_model.encode(s)
+        if bos:
+            t = [self.bos_id] + t
+        if eos:
+            t = t + [self.eos_id]
+        return t
+
+    def decode(self, t: List[int]) -> str:
+        return self.sp_model.decode(t)
+
+    def decode_tokens(self, tokens: List[str]) -> str:
+        text = self.sp_model.DecodePieces(tokens)
+        return text
+
+    def convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token in self.special_tokens:
+            return self.special_tokens[token]
+        return self.sp_model.PieceToId(token)
+
+    def convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index in self.index_special_tokens or index in [self.eos_id, self.bos_id, self.pad_id] or index < 0:
+            return ""
+        return self.sp_model.IdToPiece(index)
+
+
+class ChatGLMv2Tokenizer(PretrainedTokenizer):
+    resource_files_names = {"vocab_file": "tokenizer.model"}
+    model_input_names = ["input_ids", "attention_mask", "position_ids"]
+    pretrained_resource_files_map = {
+        "model_file": {
+            "THUDM/chatglm2-6b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/chatglm2-6b/tokenizer.model",
+        }
+    }
+
+    def __init__(self, vocab_file, padding_side="left", **kwargs):
+        super().__init__(padding_side=padding_side, **kwargs)
+        self.name = "ChatGLMv2Tokenizer"
+
+        self.vocab_file = vocab_file
+        self.tokenizer = SPTokenizer(vocab_file)
+        self.special_tokens = {
+            "<bos>": self.tokenizer.bos_id,
+            "<eos>": self.tokenizer.eos_id,
+            "<pad>": self.tokenizer.pad_id,
+        }
+
+    def get_command(self, token):
+        if token in self.special_tokens:
+            return self.special_tokens[token]
+        assert token in self.tokenizer.special_tokens, f"{token} is not a special token for {self.name}"
+        return self.tokenizer.special_tokens[token]
+
+    @property
+    def pad_token(self) -> str:
+        return "<unk>"
+
+    @property
+    def pad_token_id(self):
+        return self.get_command("<pad>")
+
+    @property
+    def eos_token(self) -> str:
+        return "</s>"
+
+    @property
+    def eos_token_id(self):
+        return self.get_command("<eos>")
+
+    @property
+    def vocab_size(self):
+        return self.tokenizer.n_words
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text, **kwargs):
+        return self.tokenizer.tokenize(text)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.tokenizer.convert_token_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.tokenizer.convert_id_to_token(index)
+
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        return self.tokenizer.decode_tokens(tokens)
+
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        """
+        Save the vocabulary and special tokens file to a directory.
+
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+            filename_prefix (`str`, *optional*):
+                An optional prefix to add to the named of the saved files.
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(save_directory, self.vocab_files_names["vocab_file"])
+        else:
+            vocab_file = save_directory
+
+        with open(self.vocab_file, "rb") as fin:
+            proto_str = fin.read()
+
+        with open(vocab_file, "wb") as writer:
+            writer.write(proto_str)
+
+        return (vocab_file,)
+
+    def get_prefix_tokens(self):
+        prefix_tokens = [self.get_command("[gMASK]"), self.get_command("sop")]
+        return prefix_tokens
+
+    def build_prompt(self, query, history=None):
+        if history is None:
+            history = []
+        prompt = ""
+        for i, (old_query, response) in enumerate(history):
+            prompt += "[Round {}]\n\n问：{}\n\n答：{}\n\n".format(i + 1, old_query, response)
+        prompt += "[Round {}]\n\n问：{}\n\n答：".format(len(history) + 1, query)
+        return prompt
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A BERT sequence has the following format:
+
+        - single sequence: `[CLS] X [SEP]`
+        - pair of sequences: `[CLS] A [SEP] B [SEP]`
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        prefix_tokens = self.get_prefix_tokens()
+        token_ids_0 = prefix_tokens + token_ids_0
+        if token_ids_1 is not None:
+            token_ids_0 = token_ids_0 + token_ids_1 + [self.get_command("<eos>")]
+        return token_ids_0
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict, BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        assert self.padding_side == "left"
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        seq_length = len(required_input)
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+
+        # Initialize attention mask if not present.
+        if "attention_mask" not in encoded_inputs:
+            encoded_inputs["attention_mask"] = [1] * seq_length
+
+        if "position_ids" not in encoded_inputs:
+            encoded_inputs["position_ids"] = list(range(seq_length))
+
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+
+            if "attention_mask" in encoded_inputs:
+                # 3D/4D attention mask
+                if len(np.shape(encoded_inputs["attention_mask"])) > 2:
+                    encoded_inputs["attention_mask"] = np.pad(
+                        encoded_inputs["attention_mask"],
+                        pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                        mode="constant",
+                        constant_values=0,
+                    )
+                # 2D attention mask
+                else:
+                    encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+            if "position_ids" in encoded_inputs:
+                encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
+            encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+
+        return encoded_inputs
diff --git a/paddlenlp/transformers/chinesebert/__init__.py b/paddlenlp/transformers/chinesebert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/chinesebert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/chinesebert/configuration.py b/paddlenlp/transformers/chinesebert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8000b7a37fda9e3097ca788489ea8fcebfcda6d
--- /dev/null
+++ b/paddlenlp/transformers/chinesebert/configuration.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ChineseBERT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "CHINESEBERT_PRETRAINED_INIT_CONFIGURATION",
+    "ChineseBertConfig",
+    "CHINESEBERT_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+CHINESEBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "ChineseBERT-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 23236,
+        "glyph_embedding_dim": 1728,
+        "pinyin_map_len": 32,
+    },
+    "ChineseBERT-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 23236,
+        "glyph_embedding_dim": 1728,
+        "pinyin_map_len": 32,
+    },
+}
+
+CHINESEBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ChineseBERT-base": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-base/model_state.pdparams",
+        "ChineseBERT-large": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-large/model_state.pdparams",
+    }
+}
+
+
+class ChineseBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ChineseBertModel`]. It is used to
+    instantiate a ChineseBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ChineseBERT
+    ChineseBERT-base architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the ChineseBERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        glyph_embedding_dim (`int`, *optional*):
+            The dim of glyph_embedding.
+        pinyin_embedding_size (`int`, *optional*):
+            pinyin embedding size
+        pinyin_map_len (int, *optional*):
+            The length of pinyin map.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import BertModel, BertConfig
+
+    >>> # Initializing a ChineseBERT bert-base-uncased style configuration
+    >>> configuration = ChineseBertConfig()
+
+    >>> # Initializing a model from the bert-base-uncased style configuration
+    >>> model = ChineseBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "chinesebert"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = CHINESEBERT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 23236,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        layer_norm_eps=1e-12,
+        glyph_embedding_dim=1728,
+        pinyin_embedding_size=128,
+        pinyin_map_len=32,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.layer_norm_eps = layer_norm_eps
+        self.glyph_embedding_dim = glyph_embedding_dim
+        self.pinyin_embedding_size = pinyin_embedding_size
+        self.pinyin_map_len = pinyin_map_len
diff --git a/paddlenlp/transformers/chinesebert/modeling.py b/paddlenlp/transformers/chinesebert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c23e50d1b52314b3a696583eb48cb0557a2be158
--- /dev/null
+++ b/paddlenlp/transformers/chinesebert/modeling.py
@@ -0,0 +1,822 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# MIT License
+
+# Copyright (c) 2021 ShannonAI
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+
+from .configuration import (
+    CHINESEBERT_PRETRAINED_INIT_CONFIGURATION,
+    CHINESEBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    ChineseBertConfig,
+)
+
+__all__ = [
+    "ChineseBertModel",
+    "ChineseBertPretrainedModel",
+    "ChineseBertForPretraining",
+    "ChineseBertPretrainingCriterion",
+    "ChineseBertForSequenceClassification",
+    "ChineseBertForTokenClassification",
+    "ChineseBertForQuestionAnswering",
+]
+
+
+class PinyinEmbedding(nn.Layer):
+    def __init__(self, config: ChineseBertConfig):
+        """Pinyin Embedding Layer"""
+        super(PinyinEmbedding, self).__init__()
+        self.embedding = nn.Embedding(config.pinyin_map_len, config.pinyin_embedding_size)
+        self.pinyin_out_dim = config.hidden_size
+        self.conv = nn.Conv1D(
+            in_channels=config.pinyin_embedding_size,
+            out_channels=self.pinyin_out_dim,
+            kernel_size=2,
+            stride=1,
+            padding=0,
+        )
+
+    def forward(self, pinyin_ids):
+        """
+        Args:
+            pinyin_ids (Tensor): Its shape is (bs*sentence_length*pinyin_locs).
+
+        Returns:
+            pinyin_embed (Tensor): Its shape is (bs,sentence_length,pinyin_out_dim).
+
+        """
+        # input pinyin ids for 1-D conv
+        embed = self.embedding(pinyin_ids)  # [bs,sentence_length*pinyin_locs,embed_size]
+        bs, sentence_length, pinyin_locs, embed_size = embed.shape
+        view_embed = embed.reshape(
+            shape=[-1, pinyin_locs, embed_size]
+        )  # [(bs*sentence_length),pinyin_locs,embed_size]
+        input_embed = view_embed.transpose([0, 2, 1])  # [(bs*sentence_length), embed_size, pinyin_locs]
+        # conv + max_pooling
+        pinyin_conv = self.conv(input_embed)  # [(bs*sentence_length),pinyin_out_dim,H]
+        pinyin_embed = F.max_pool1d(pinyin_conv, pinyin_conv.shape[-1])  # [(bs*sentence_length),pinyin_out_dim,1]
+        return pinyin_embed.reshape(
+            shape=[bs, sentence_length, self.pinyin_out_dim]
+        )  # [bs,sentence_length,pinyin_out_dim]
+
+
+class GlyphEmbedding(nn.Layer):
+    """Glyph2Image Embedding."""
+
+    def __init__(self, config: ChineseBertConfig):
+        super(GlyphEmbedding, self).__init__()
+        self.embedding = nn.Embedding(num_embeddings=config.vocab_size, embedding_dim=config.glyph_embedding_dim)
+
+    def forward(self, input_ids):
+        """
+        Get glyph images for batch inputs.
+
+        Args:
+            input_ids (Tensor): Its shape is [batch, sentence_length].
+
+        Returns:
+            images (Tensor): Its shape is [batch, sentence_length, self.font_num*self.font_size*self.font_size].
+
+        """
+        return self.embedding(input_ids)
+
+
+class FusionBertEmbeddings(nn.Layer):
+    """
+    Construct the embeddings from word, position, glyph, pinyin and token_type embeddings.
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(FusionBertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.pinyin_embeddings = PinyinEmbedding(config)
+        self.glyph_embeddings = GlyphEmbedding(config)
+
+        self.glyph_map = nn.Linear(config.glyph_embedding_dim, config.hidden_size)
+        self.map_fc = nn.Linear(config.hidden_size * 3, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids",
+            paddle.expand(paddle.arange(config.max_position_embeddings, dtype="int64"), shape=[1, -1]),
+        )
+
+    def forward(self, input_ids, pinyin_ids, token_type_ids=None, position_ids=None):
+
+        input_shape = input_ids.shape
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+        # get char embedding, pinyin embedding and glyph embedding
+        word_embeddings = self.word_embeddings(input_ids)  # [bs,l,hidden_size]
+
+        pinyin_embeddings = self.pinyin_embeddings(
+            pinyin_ids.reshape(shape=[input_shape[0], seq_length, 8])
+        )  # [bs,l,hidden_size]
+
+        glyph_embeddings = self.glyph_map(self.glyph_embeddings(input_ids))  # [bs,l,hidden_size]
+        # fusion layer
+        concat_embeddings = paddle.concat((word_embeddings, pinyin_embeddings, glyph_embeddings), axis=2)
+        inputs_embeds = self.map_fc(concat_embeddings)
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+# Same as BertLMPredictionHead
+class ChineseBertLMPredictionHead(nn.Layer):
+    """
+    Language Modeling head
+    """
+
+    def __init__(self, config: ChineseBertConfig, embedding_weights=None):
+        super(ChineseBertLMPredictionHead, self).__init__()
+
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = getattr(nn.functional, config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype, is_bias=False
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+# Same as BertPretrainingHeads
+class ChineseBertPretrainingHeads(nn.Layer):
+    """
+    Perform language modeling task and next sentence classification task.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertPretrainingHeads.
+        embedding_weights (Tensor, optional):
+            Decoding weights used to map hidden_states to logits of the masked token prediction.
+            Its data type should be float32 and its shape is [vocab_size, hidden_size].
+            Defaults to `None`, which means use the same weights of the embedding layer.
+
+    """
+
+    def __init__(self, config: ChineseBertConfig, embedding_weights=None):
+        super(ChineseBertPretrainingHeads, self).__init__()
+        self.predictions = ChineseBertLMPredictionHead(config, embedding_weights)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output, masked_positions=None):
+        """
+        Args:
+            sequence_output(Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+            pooled_output(Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+            masked_positions(Tensor, optional):
+                A tensor indicates positions to be masked in the position embedding.
+                Its data type should be int64 and its shape is [batch_size, mask_token_num].
+                `mask_token_num` is the number of masked tokens. It should be no bigger than `sequence_length`.
+                Defaults to `None`, which means we output hidden-states of all tokens in masked token prediction.
+
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        """
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+# Same as BertPooler
+class ChineseBertPooler(nn.Layer):
+    """
+    Pool the result of ChineseBertEncoder.
+    """
+
+    def __init__(self, config):
+        """init the bert pooler with config & args/kwargs
+
+        Args:
+            config (:class:`ChineseBertConfig`): An instance of ChineseBertConfig.
+        """
+        super(ChineseBertPooler, self).__init__()
+
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.pool_act = config.pool_act
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.pool_act == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ChineseBertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ChineseBert models. It provides ChineseBert related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "chinesebert"
+    pretrained_resource_files_map = CHINESEBERT_PRETRAINED_RESOURCE_FILES_MAP
+    pretrained_init_configuration = CHINESEBERT_PRETRAINED_INIT_CONFIGURATION
+    config_class = ChineseBertConfig
+
+    def _init_weights(self, layer):
+        """Initialize the weights."""
+
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class ChineseBertModel(ChineseBertPretrainedModel):
+    """
+    The bare ChineseBert Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertModel.
+
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(ChineseBertModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.layer_norm_eps = config.layer_norm_eps
+        self.initializer_range = config.initializer_range
+        self.embeddings = FusionBertEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = ChineseBertPooler(config)
+
+    def forward(
+        self,
+        input_ids,
+        pinyin_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        output_hidden_states=False,
+    ):
+        r"""
+        The ChineseBert forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            pinyin_ids (Tensor, optional):
+                Indices of input sequence tokens pinyin. We apply a CNN model with width 2 on the pinyin
+                sequence, followed by max-pooling to derive the resulting pinyin embedding. This makes output
+                dimensionality immune to the length of the input pinyin sequence. The length of the input pinyin
+                sequence is fixed at 8.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length, 8].
+                Defaults to `None`, which means we don't add pinyin embeddings.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`, `pooled_output`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+            - `encoder_outputs` (List(Tensor)):
+                A list of Tensor containing hidden-states of the model at each hidden layer in the Transformer encoder.
+                The length of the list is `num_hidden_layers`.
+                Each Tensor has a data type of float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ChineseBertModel, ChineseBertTokenizer
+
+                tokenizer = ChineseBertTokenizer.from_pretrained('ChineseBERT-base')
+                model = ChineseBertModel.from_pretrained('ChineseBERT-base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4,
+                axis=[1, 2],
+            )
+        elif attention_mask.ndim == 2:
+            # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+            attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            pinyin_ids=pinyin_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+        )
+        print(embedding_output.shape)
+
+        if output_hidden_states:
+            output = embedding_output
+            encoder_outputs = []
+            for mod in self.encoder.layers:
+                output = mod(output, src_mask=attention_mask)
+                encoder_outputs.append(output)
+            if self.encoder.norm is not None:
+                encoder_outputs[-1] = self.encoder.norm(encoder_outputs[-1])
+            pooled_output = self.pooler(encoder_outputs[-1])
+        else:
+            sequence_output = self.encoder(embedding_output, src_mask=attention_mask)
+            pooled_output = self.pooler(sequence_output)
+        if output_hidden_states:
+            return encoder_outputs, pooled_output
+        else:
+            return sequence_output, pooled_output
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+
+class ChineseBertForQuestionAnswering(ChineseBertPretrainedModel):
+    """
+    ChineseBert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertForQuestionAnswering.
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(ChineseBertForQuestionAnswering, self).__init__(config)
+        self.chinesebert = ChineseBertModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, pinyin_ids=None, token_type_ids=None, attention_mask=None):
+        r"""
+        The ChineseBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ChineseBertModel`.
+            pinyin_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ChineseBertModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.chinesebert.modeling import ChineseBertForQuestionAnswering
+                from paddlenlp.transformers.chinesebert.tokenizer import ChineseBertTokenizer
+
+                tokenizer = ChineseBertTokenizer.from_pretrained('ChineseBERT-base')
+                model = ChineseBertForQuestionAnswering.from_pretrained('ChineseBERT-base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+        sequence_output, _ = self.chinesebert(
+            input_ids, pinyin_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=None
+        )
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
+
+
+class ChineseBertForSequenceClassification(ChineseBertPretrainedModel):
+    """
+    ChineseBert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertForSequenceClassification.e.
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(ChineseBertForSequenceClassification, self).__init__(config)
+        self.chinesebert = ChineseBertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, pinyin_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The ChineseBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ChineseBertModel`.
+            pinyin_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`ChineseBertModel`.
+            attention_mask (list, optional):
+                See :class:`ChineseBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.chinesebert.modeling import ChineseBertForSequenceClassification
+                from paddlenlp.transformers.chinesebert.tokenizer import ChineseBertTokenizer
+
+                tokenizer = ChineseBertTokenizer.from_pretrained('ChineseBERT-base')
+                model = ChineseBertForSequenceClassification.from_pretrained('ChineseBERT-base', num_classes=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 2]
+
+        """
+
+        _, pooled_output = self.chinesebert(
+            input_ids,
+            pinyin_ids=pinyin_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class ChineseBertForTokenClassification(ChineseBertPretrainedModel):
+    """
+    ChineseBert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertForTokenClassification.e.
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(ChineseBertForTokenClassification, self).__init__(config)
+        self.chinesebert = ChineseBertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, pinyin_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The ChineseBertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ChineseBertModel`.
+            pinyin_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`ChineseBertModel`.
+            attention_mask (list, optional):
+                See :class:`ChineseBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+             .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.chinesebert.modeling import ChineseBertForSequenceClassification
+                from paddlenlp.transformers.chinesebert.tokenizer import ChineseBertTokenizer
+
+                tokenizer = ChineseBertTokenizer.from_pretrained('ChineseBERT-base')
+                model = ChineseBertForSequenceClassification.from_pretrained('ChineseBERT-base', num_classes=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 2]
+
+        """
+        sequence_output, _ = self.chinesebert(
+            input_ids,
+            pinyin_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class ChineseBertForPretraining(ChineseBertPretrainedModel):
+    """
+    ChineseBert Model with pretraining tasks on top.
+
+    Args:
+        config (:class:`ChineseBertConfig`):
+            An instance of ChineseBertConfig used to construct ChineseBertForPretraining.e.
+
+    """
+
+    def __init__(self, config: ChineseBertConfig):
+        super(ChineseBertForPretraining, self).__init__(config)
+        self.chinesebert = ChineseBertModel(config)
+        self.cls = ChineseBertPretrainingHeads(
+            config,
+            embedding_weights=self.chinesebert.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids,
+        pinyin_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        masked_positions=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ChineseBertModel`.
+            pinyin_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ChineseBertModel`.
+            masked_positions(Tensor, optional):
+                See :class:`ChineseBertPretrainingHeads`.
+
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        """
+        with paddle.static.amp.fp16_guard():
+            outputs = self.chinesebert(
+                input_ids,
+                pinyin_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+            )
+            sequence_output, pooled_output = outputs[:2]
+            prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions)
+            return prediction_scores, seq_relationship_score
+
+
+class ChineseBertPretrainingCriterion(nn.Layer):
+    """
+
+    Args:
+        vocab_size(int):
+            Vocabulary size of `inputs_ids` in `ChineseBertModel`. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling `ChineseBertBertModel`.
+
+    """
+
+    def __init__(self, vocab_size):
+        super(ChineseBertPretrainingCriterion, self).__init__()
+        # CrossEntropyLoss is expensive since the inner reshape (copy)
+        self.loss_fn = nn.loss.CrossEntropyLoss(ignore_index=-1)
+        self.vocab_size = vocab_size
+
+    def forward(
+        self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels, masked_lm_scale
+    ):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size]
+            seq_relationship_score(Tensor):
+                The scores of next sentence prediction. Its data type should be float32 and
+                its shape is [batch_size, 2]
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1].
+                Otherwise, its shape is [batch_size, mask_token_num, 1]
+            next_sentence_labels(Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and
+                its shape is [batch_size, 1]
+            masked_lm_scale(Tensor or int):
+                The scale of masked tokens. Used for the normalization of masked language modeling loss.
+                If it is a `Tensor`, its data type should be int64 and its shape is equal to `prediction_scores`.
+
+        Returns:
+            Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`.
+            Its data type should be float32 and its shape is [1].
+
+
+        """
+        with paddle.static.amp.fp16_guard():
+            masked_lm_loss = F.cross_entropy(prediction_scores, masked_lm_labels, reduction="none", ignore_index=-1)
+            masked_lm_loss = masked_lm_loss / masked_lm_scale
+            next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none")
+        return paddle.sum(masked_lm_loss) + paddle.mean(next_sentence_loss)
diff --git a/paddlenlp/transformers/chinesebert/tokenizer.py b/paddlenlp/transformers/chinesebert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..40ffba60152c0fb0e9ecc0d2bd53edbac69c99d9
--- /dev/null
+++ b/paddlenlp/transformers/chinesebert/tokenizer.py
@@ -0,0 +1,759 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# MIT License
+
+# Copyright (c) 2021 ShannonAI
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+from functools import lru_cache
+
+from paddle.utils import try_import
+
+from paddlenlp.transformers import BertTokenizer
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"ChineseBERT-base": 512, "ChineseBERT-large": 512}
+
+
+class ChineseBertTokenizer(BertTokenizer):
+    """
+    Construct a ChineseBert tokenizer. `ChineseBertTokenizer` is similar to `BertTokenizerr`.
+    The difference between them is that ChineseBert has the extra process about pinyin id.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to `True`.
+        pinyin_map (dict):
+            A dict of pinyin map, the map between pinyin char and id. pinyin char is 26 Romanian characters and 0-5 numbers.
+            Defaults to None.
+        id2pinyin (dict):
+            A dict of char id map tensor.
+            Defaults to None.
+        pinyin2tensor (dict):
+            A dict of pinyin map tensor.
+            Defaults to None.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ChineseBertTokenizer
+            tokenizer = ChineseBertTokenizer.from_pretrained('ChineseBERT-base')
+
+            inputs = tokenizer('欢迎使用飞桨！')
+            print(inputs)
+
+            '''
+            {'input_ids': [101, 3614, 6816, 886, 4500, 7607, 3444, 8013, 102],
+            'pinyin_ids': [0, 0, 0, 0, 0, 0, 0, 0, 13, 26, 6, 19, 1, 0, 0, 0, 30, 14, 19, 12, 2, 0, 0, 0, 24, 13, 14, 3, 0, 0, 0, 0, 30, 20, 19, 12, 4, 0, 0, 0, 11, 10, 14, 1, 0, 0, 0, 0, 15, 14, 6, 19, 12, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ChineseBERT-base": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-base/vocab.txt",
+            "ChineseBERT-large": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-base/vocab.txt",
+        },
+        "tokenizer_config_file": {
+            "ChineseBERT-base": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-large/tokenizer_config.json",
+            "ChineseBERT-large": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese_bert/chinesebert-large/tokenizer_config.json",
+        },
+    }
+    pretrained_init_configuration = {
+        "ChineseBERT-base": {"do_lower_case": True},
+        "ChineseBERT-large": {"do_lower_case": True},
+    }
+    padding_side = "right"
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        pinyin_map=None,
+        id2pinyin=None,
+        pinyin2tensor=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super().__init__(vocab_file, do_lower_case, unk_token, sep_token, pad_token, cls_token, mask_token, **kwargs)
+        self.pinyin_dict = pinyin_map
+        self.id2pinyin = id2pinyin
+        self.pinyin2tensor = pinyin2tensor
+        self.special_tokens_pinyin_ids = [0] * 8
+
+    def encode(
+        self,
+        text,
+        text_pair=None,
+        max_seq_len=512,
+        pad_to_max_seq_len=False,
+        truncation_strategy="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=True,
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports sequence or sequence pair as input, and batch input
+        is not allowed.
+
+        Args:
+            text (str, List[str] or List[int]):
+                The sequence to be processed. One sequence is a string, a list
+                of strings, or a list of integers depending on whether it has
+                been pretokenized and converted to ids.
+            text_pair (str, List[str] or List[List[str]]):
+                Same as `text` argument, while it represents for the latter
+                sequence of the sequence pair.
+            max_seq_len (int, optional):
+                If set to a number, will limit the total sequence returned so
+                that it has a maximum length. If there are overflowing tokens,
+                those overflowing tokens will be added to the returned dictionary
+                when `return_overflowing_tokens` is `True`. Defaults to `None`.
+            stride (int, optional):
+                Only available for batch input of sequence pair and mainly for
+                question answering usage. When for QA, `text` represents questions
+                and `text_pair` represents contexts. If `stride` is set to a
+                positive number, the context will be split into multiple spans
+                where `stride` defines the number of (tokenized) tokens to skip
+                from the start of one span to get the next span, thus will produce
+                a bigger batch than inputs to include all spans. Moreover, 'overflow_to_sample'
+                and 'offset_mapping' preserving the original example and position
+                information will be added to the returned dictionary. Defaults to 0.
+            pad_to_max_seq_len (bool, optional):
+                If set to `True`, the returned sequences would be padded up to
+                `max_seq_len` specified length according to padding side
+                (`self.padding_side`) and padding token id. Defaults to `False`.
+            truncation_strategy (str, optional):
+                String selected in the following options:
+
+                - 'longest_first' (default) Iteratively reduce the inputs sequence
+                until the input is under `max_seq_len` starting from the longest
+                one at each token (when there is a pair of input sequences).
+                - 'only_first': Only truncate the first sequence.
+                - 'only_second': Only truncate the second sequence.
+                - 'do_not_truncate': Do not truncate (raise an error if the input
+                sequence is longer than `max_seq_len`).
+
+                Defaults to 'longest_first'.
+            return_position_ids (bool, optional):
+                Whether to include tokens position ids in the returned dictionary.
+                Defaults to `False`.
+            return_token_type_ids (bool, optional):
+                Whether to include token type ids in the returned dictionary.
+                Defaults to `True`.
+            return_attention_mask (bool, optional):
+                Whether to include the attention mask in the returned dictionary.
+                Defaults to `False`.
+            return_length (bool, optional):
+                Whether to include the length of each encoded inputs in the
+                returned dictionary. Defaults to `False`.
+            return_overflowing_tokens (bool, optional):
+                Whether to include overflowing token information in the returned
+                dictionary. Defaults to `False`.
+            return_special_tokens_mask (bool, optional):
+                Whether to include special tokens mask information in the returned
+                dictionary. Defaults to `False`.
+
+        Returns:
+            dict:
+                The dict has the following optional items:
+
+                - **input_ids** (list[int]): List of token ids to be fed to a model.
+                - **pinyin_ids** (list[int]): List of pinyin ids to be fed to a model.
+                - **position_ids** (list[int], optional): List of token position ids to be
+                  fed to a model. Included when `return_position_ids` is `True`
+                - **token_type_ids** (list[int], optional): List of token type ids to be
+                  fed to a model. Included when `return_token_type_ids` is `True`.
+                - **attention_mask** (list[int], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `return_attention_mask` is `True`.
+                - **seq_len** (int, optional): The input_ids length. Included when `return_length`
+                  is `True`.
+                - **overflowing_tokens** (list[int], optional): List of overflowing tokens.
+                  Included when if `max_seq_len` is specified and `return_overflowing_tokens`
+                  is True.
+                - **num_truncated_tokens** (int, optional): The number of overflowing tokens.
+                  Included when if `max_seq_len` is specified and `return_overflowing_tokens`
+                  is True.
+                - **special_tokens_mask** (list[int], optional): List of integers valued 0 or 1,
+                  with 0 specifying special added tokens and 1 specifying sequence tokens.
+                  Included when `return_special_tokens_mask` is `True`.
+        """
+
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self.tokenize(text)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise ValueError(
+                    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                )
+
+        ids = get_input_ids(text)
+        pair_ids = get_input_ids(text_pair) if text_pair is not None else None
+
+        pair = bool(pair_ids is not None)
+        len_ids = len(ids)
+        len_pair_ids = len(pair_ids) if pair else 0
+
+        encoded_inputs = {}
+
+        # Truncation: Handle max sequence length
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair))
+
+        token_offset_mapping = self.get_offset_mapping(text)
+
+        if pair:
+            token_pair_offset_mapping = self.get_offset_mapping(text_pair)
+        else:
+            token_pair_offset_mapping = None
+
+        if max_seq_len and total_len > max_seq_len:
+            (
+                ids,
+                pair_ids,
+                token_offset_mapping,
+                token_pair_offset_mapping,
+                overflowing_tokens,
+            ) = self.truncate_sequences(
+                ids,
+                pair_ids=pair_ids,
+                token_offset_mapping=token_offset_mapping,
+                token_pair_offset_mapping=token_pair_offset_mapping,
+                num_tokens_to_remove=total_len - max_seq_len,
+                truncation_strategy=truncation_strategy,
+            )
+
+            if return_overflowing_tokens:
+                encoded_inputs["overflowing_tokens"] = overflowing_tokens
+                encoded_inputs["num_truncated_tokens"] = total_len - max_seq_len
+
+        # Add special tokens
+
+        sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+        token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+
+        offset_mapping = self.build_offset_mapping_with_special_tokens(token_offset_mapping, token_pair_offset_mapping)
+
+        # Build output dictionnary
+        encoded_inputs["input_ids"] = sequence
+        encoded_inputs["pinyin_ids"] = self.get_pinyin_ids(text, text_pair, offset_mapping)
+
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+        if return_special_tokens_mask:
+            encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+        if return_length:
+            encoded_inputs["seq_len"] = len(encoded_inputs["input_ids"])
+
+        # Check lengths
+        assert max_seq_len is None or len(encoded_inputs["input_ids"]) <= max_seq_len
+
+        # Padding
+        needs_to_be_padded = pad_to_max_seq_len and max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len
+
+        if needs_to_be_padded:
+            difference = max_seq_len - len(encoded_inputs["input_ids"])
+            encoded_inputs["pinyin_ids"] = encoded_inputs["pinyin_ids"] + self.special_tokens_pinyin_ids * difference
+            if self.padding_side == "right":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference
+                if return_token_type_ids:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                    )
+                if return_special_tokens_mask:
+                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+                encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference
+            elif self.padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"])
+                if return_token_type_ids:
+                    encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+                        "token_type_ids"
+                    ]
+                if return_special_tokens_mask:
+                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+                encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"]
+        else:
+            if return_attention_mask:
+                encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
+
+        if return_position_ids:
+            encoded_inputs["position_ids"] = list(range(len(encoded_inputs["input_ids"])))
+
+        return encoded_inputs
+
+    def batch_encode(
+        self,
+        batch_text_or_text_pairs,
+        max_seq_len=512,
+        pad_to_max_seq_len=False,
+        stride=0,
+        is_split_into_words=False,
+        truncation_strategy="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=True,
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports batch inputs of sequence or sequence pair.
+
+        Args:
+            batch_text_or_text_pairs (list):
+                The element of list can be sequence or sequence pair, and the
+                sequence is a string or a list of strings depending on whether
+                it has been pretokenized. If each sequence is provided as a list
+                of strings (pretokenized), you must set `is_split_into_words` as
+                `True` to disambiguate with a sequence pair.
+            max_seq_len (int, optional):
+                If set to a number, will limit the total sequence returned so
+                that it has a maximum length. If there are overflowing tokens,
+                those overflowing tokens will be added to the returned dictionary
+                when `return_overflowing_tokens` is `True`. Defaults to `None`.
+            stride (int, optional):
+                Only available for batch input of sequence pair and mainly for
+                question answering usage. When for QA, `text` represents questions
+                and `text_pair` represents contexts. If `stride` is set to a
+                positive number, the context will be split into multiple spans
+                where `stride` defines the number of (tokenized) tokens to skip
+                from the start of one span to get the next span, thus will produce
+                a bigger batch than inputs to include all spans. Moreover, 'overflow_to_sample'
+                and 'offset_mapping' preserving the original example and position
+                information will be added to the returned dictionary. Defaults to 0.
+            pad_to_max_seq_len (bool, optional):
+                If set to `True`, the returned sequences would be padded up to
+                `max_seq_len` specified length according to padding side
+                (`self.padding_side`) and padding token id. Defaults to `False`.
+            truncation_strategy (str, optional):
+                String selected in the following options:
+
+                - 'longest_first' (default) Iteratively reduce the inputs sequence
+                until the input is under `max_seq_len` starting from the longest
+                one at each token (when there is a pair of input sequences).
+                - 'only_first': Only truncate the first sequence.
+                - 'only_second': Only truncate the second sequence.
+                - 'do_not_truncate': Do not truncate (raise an error if the input
+                sequence is longer than `max_seq_len`).
+
+                Defaults to 'longest_first'.
+            return_position_ids (bool, optional):
+                Whether to include tokens position ids in the returned dictionary.
+                Defaults to `False`.
+            return_token_type_ids (bool, optional):
+                Whether to include token type ids in the returned dictionary.
+                Defaults to `True`.
+            return_attention_mask (bool, optional):
+                Whether to include the attention mask in the returned dictionary.
+                Defaults to `False`.
+            return_length (bool, optional):
+                Whether to include the length of each encoded inputs in the
+                returned dictionary. Defaults to `False`.
+            return_overflowing_tokens (bool, optional):
+                Whether to include overflowing token information in the returned
+                dictionary. Defaults to `False`.
+            return_special_tokens_mask (bool, optional):
+                Whether to include special tokens mask information in the returned
+                dictionary. Defaults to `False`.
+
+        Returns:
+            list[dict]:
+                The dict has the following optional items:
+
+                - **input_ids** (list[int]): List of token ids to be fed to a model.
+                - **pinyin_ids** (list[int]): List of pinyin ids to be fed to a model.
+                - **position_ids** (list[int], optional): List of token position ids to be
+                  fed to a model. Included when `return_position_ids` is `True`
+                - **token_type_ids** (list[int], optional): List of token type ids to be
+                  fed to a model. Included when `return_token_type_ids` is `True`.
+                - **attention_mask** (list[int], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `return_attention_mask` is `True`.
+                - **seq_len** (int, optional): The input_ids length. Included when `return_length`
+                  is `True`.
+                - **overflowing_tokens** (list[int], optional): List of overflowing tokens.
+                  Included when if `max_seq_len` is specified and `return_overflowing_tokens`
+                  is True.
+                - **num_truncated_tokens** (int, optional): The number of overflowing tokens.
+                  Included when if `max_seq_len` is specified and `return_overflowing_tokens`
+                  is True.
+                - **special_tokens_mask** (list[int], optional): List of integers valued 0 or 1,
+                  with 0 specifying special added tokens and 1 specifying sequence tokens.
+                  Included when `return_special_tokens_mask` is `True`.
+                - **offset_mapping** (list[int], optional): list of pair preserving the
+                  index of start and end char in original input for each token.
+                  For a sqecial token, the index pair is `(0, 0)`. Included when
+                  `stride` works.
+                - **overflow_to_sample** (int, optional): Index of example from which this
+                  feature is generated. Included when `stride` works.
+        """
+
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self.tokenize(text)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise ValueError(
+                    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                )
+
+        batch_encode_inputs = []
+        for example_id, tokens_or_pair_tokens in enumerate(batch_text_or_text_pairs):
+            if not isinstance(tokens_or_pair_tokens, (list, tuple)):
+                text, text_pair = tokens_or_pair_tokens, None
+            elif is_split_into_words and not isinstance(tokens_or_pair_tokens[0], (list, tuple)):
+                text, text_pair = tokens_or_pair_tokens, None
+            else:
+                text, text_pair = tokens_or_pair_tokens
+
+            if stride > 0 and text_pair is not None:
+                first_ids = get_input_ids(text)
+                second_ids = get_input_ids(text_pair)
+
+                max_len_for_pair = max_seq_len - len(first_ids) - self.num_special_tokens_to_add(pair=True)
+                token_offset_mapping = self.get_offset_mapping(text)
+                token_pair_offset_mapping = self.get_offset_mapping(text_pair)
+
+                while True:
+                    encoded_inputs = {}
+
+                    ids = first_ids
+                    mapping = token_offset_mapping
+                    if len(second_ids) <= max_len_for_pair:
+                        pair_ids = second_ids
+                        pair_mapping = token_pair_offset_mapping
+                    else:
+                        pair_ids = second_ids[:max_len_for_pair]
+                        pair_mapping = token_pair_offset_mapping[:max_len_for_pair]
+
+                    offset_mapping = self.build_offset_mapping_with_special_tokens(mapping, pair_mapping)
+
+                    sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+                    token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+
+                    # Build output dictionnary
+                    encoded_inputs["input_ids"] = sequence
+                    # add_pinyin_ids
+                    encoded_inputs["pinyin_ids"] = self.get_pinyin_ids(text, text_pair, offset_mapping)
+                    if return_token_type_ids:
+                        encoded_inputs["token_type_ids"] = token_type_ids
+                    if return_special_tokens_mask:
+                        encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+                    if return_length:
+                        encoded_inputs["seq_len"] = len(encoded_inputs["input_ids"])
+
+                    # Check lengths
+                    assert max_seq_len is None or len(encoded_inputs["input_ids"]) <= max_seq_len
+
+                    # Padding
+                    needs_to_be_padded = (
+                        pad_to_max_seq_len and max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len
+                    )
+
+                    encoded_inputs["offset_mapping"] = offset_mapping
+
+                    if needs_to_be_padded:
+                        difference = max_seq_len - len(encoded_inputs["input_ids"])
+                        # padding pinyin_ids
+                        encoded_inputs["pinyin_ids"] = (
+                            encoded_inputs["pinyin_ids"] + self.special_tokens_pinyin_ids * difference
+                        )
+                        if self.padding_side == "right":
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [
+                                    0
+                                ] * difference
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = (
+                                    encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                                )
+                            if return_special_tokens_mask:
+                                encoded_inputs["special_tokens_mask"] = (
+                                    encoded_inputs["special_tokens_mask"] + [1] * difference
+                                )
+                            encoded_inputs["input_ids"] = (
+                                encoded_inputs["input_ids"] + [self.pad_token_id] * difference
+                            )
+                            encoded_inputs["offset_mapping"] = encoded_inputs["offset_mapping"] + [(0, 0)] * difference
+                        elif self.padding_side == "left":
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [0] * difference + [1] * len(
+                                    encoded_inputs["input_ids"]
+                                )
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = [
+                                    self.pad_token_type_id
+                                ] * difference + encoded_inputs["token_type_ids"]
+                            if return_special_tokens_mask:
+                                encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs[
+                                    "special_tokens_mask"
+                                ]
+                            encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs[
+                                "input_ids"
+                            ]
+                            encoded_inputs["offset_mapping"] = [(0, 0)] * difference + encoded_inputs["offset_mapping"]
+                    else:
+                        if return_attention_mask:
+                            encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
+
+                    if return_position_ids:
+                        encoded_inputs["position_ids"] = list(range(len(encoded_inputs["input_ids"])))
+
+                    encoded_inputs["overflow_to_sample"] = example_id
+                    batch_encode_inputs.append(encoded_inputs)
+
+                    if len(second_ids) <= max_len_for_pair:
+                        break
+                    else:
+                        second_ids = second_ids[max_len_for_pair - stride :]
+                        token_pair_offset_mapping = token_pair_offset_mapping[max_len_for_pair - stride :]
+
+            else:
+                batch_encode_inputs.append(
+                    self.encode(
+                        text,
+                        text_pair,
+                        max_seq_len=max_seq_len,
+                        pad_to_max_seq_len=pad_to_max_seq_len,
+                        truncation_strategy=truncation_strategy,
+                        return_position_ids=return_position_ids,
+                        return_token_type_ids=return_token_type_ids,
+                        return_attention_mask=return_attention_mask,
+                        return_length=return_length,
+                        return_overflowing_tokens=return_overflowing_tokens,
+                        return_special_tokens_mask=return_special_tokens_mask,
+                    )
+                )
+
+        return batch_encode_inputs
+
+    def truncate_sequences(
+        self,
+        ids,
+        pair_ids=None,
+        token_offset_mapping=None,
+        token_pair_offset_mapping=None,
+        num_tokens_to_remove=0,
+        truncation_strategy="longest_first",
+        stride=0,
+    ):
+        """
+        Truncates a sequence pair in place to the maximum length.
+
+        Args:
+            ids: list of tokenized input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+            token_offset_mapping (list): The map of tokens and the start and end index of their start and end character
+            token_pair_offset_mapping(list): The map of token pairs and the start and end index of their start and end character
+            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):
+                number of tokens to remove using the truncation strategy
+            truncation_strategy: string selected in the following options:
+                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_seq_len
+                    starting from the longest one at each token (when there is a pair of input sequences).
+                    Overflowing tokens only contains overflow from the first sequence.
+                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
+                - 'only_second': Only truncate the second sequence
+                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_seq_len)
+            stride (:obj:`int`, `optional`, defaults to ``0``):
+                If set to a number along with max_seq_len, the overflowing tokens returned will contain some tokens
+                from the main sequence returned. The value of this argument defines the number of additional tokens.
+        """
+
+        if num_tokens_to_remove <= 0:
+            return ids, pair_ids, []
+
+        if truncation_strategy == "longest_first":
+            overflowing_tokens = []
+            for _ in range(num_tokens_to_remove):
+                if pair_ids is None or len(ids) > len(pair_ids):
+                    overflowing_tokens = [ids[-1]] + overflowing_tokens
+                    ids = ids[:-1]
+                    token_offset_mapping = token_offset_mapping[:-1]
+                else:
+                    pair_ids = pair_ids[:-1]
+                    token_pair_offset_mapping = token_pair_offset_mapping[:-1]
+            window_len = min(len(ids), stride)
+            if window_len > 0:
+                overflowing_tokens = ids[-window_len:] + overflowing_tokens
+        elif truncation_strategy == "only_first":
+            assert len(ids) > num_tokens_to_remove
+            window_len = min(len(ids), stride + num_tokens_to_remove)
+            overflowing_tokens = ids[-window_len:]
+            ids = ids[:-num_tokens_to_remove]
+            token_offset_mapping = token_offset_mapping[:-num_tokens_to_remove]
+        elif truncation_strategy == "only_second":
+            assert pair_ids is not None and len(pair_ids) > num_tokens_to_remove
+            window_len = min(len(pair_ids), stride + num_tokens_to_remove)
+            overflowing_tokens = pair_ids[-window_len:]
+            pair_ids = pair_ids[:-num_tokens_to_remove]
+            token_pair_offset_mapping = token_pair_offset_mapping[:-num_tokens_to_remove]
+        elif truncation_strategy == "do_not_truncate":
+            raise ValueError("Input sequence are too long for max_length. Please select a truncation strategy.")
+        else:
+            raise ValueError(
+                "Truncation_strategy should be selected in ['longest_first', 'only_first', 'only_second', 'do_not_truncate']"
+            )
+        return (ids, pair_ids, token_offset_mapping, token_pair_offset_mapping, overflowing_tokens)
+
+    @lru_cache(9999)
+    def pinyin_locs_map(self, text):
+        """
+        Get the map of pinyin locations and pinyin tensor.
+
+        Args:
+            text (str):
+                The sequence to be processed.
+
+        Returns:
+            dict: the map of pinyin locations and pinyin tensor.
+        """
+        pinyin = try_import("pypinyin.pinyin")
+        Style = try_import("pypinyin.Style")
+        pinyin_list = pinyin(
+            text,
+            style=Style.TONE3,
+            heteronym=True,
+            errors=lambda x: [["not chinese"] for _ in x],
+        )
+        pinyin_locs = {}
+        # get pinyin of each location
+        for index, item in enumerate(pinyin_list):
+            pinyin_string = item[0]
+            # not a Chinese character, pass
+            if pinyin_string == "not chinese":
+                continue
+            if pinyin_string in self.pinyin2tensor:
+                pinyin_locs[index] = self.pinyin2tensor[pinyin_string]
+            else:
+                ids = [0] * 8
+                for i, p in enumerate(pinyin_string):
+                    if p not in self.pinyin_dict["char2idx"]:
+                        ids = [0] * 8
+                        break
+                    ids[i] = self.pinyin_dict["char2idx"][p]
+                pinyin_locs[index] = ids
+        return pinyin_locs
+
+    def get_pinyin_ids(self, text, text_pair=None, offset_mapping=None):
+        """
+        Find chinese character location, and generate pinyin ids.
+
+        Args:
+            text (str):
+                The sequence to be processed.
+            text_pair (str, optional):
+                Same as `text` argument, while it represents for the latter sequence of the sequence pair.
+                Defaults to `None`.
+            offset_mapping (list, optional):
+                A list of wordpiece offsets with the appropriate offsets of special tokens.
+                Defaults to `None`.
+
+        Returns:
+            list: The list of pinyin id tensor.
+        """
+
+        text_pinyin_locs = self.pinyin_locs_map(text)
+        if text_pair:
+            text_pair_pinyin_locs = self.pinyin_locs_map(text_pair)
+        else:
+            text_pair_pinyin_locs = None
+
+        pinyin_ids = []
+        special_token_count = 0
+
+        for offset in offset_mapping:
+            if offset == (0, 0):
+                special_token_count += 1
+
+            if special_token_count <= 1:
+                pinyin_locs_maps = text_pinyin_locs
+            else:
+                pinyin_locs_maps = text_pair_pinyin_locs
+
+            if offset[1] - offset[0] != 1:
+                pinyin_ids.extend([0] * 8)
+                continue
+            if offset[0] in pinyin_locs_maps:
+                pinyin_ids.extend(pinyin_locs_maps[offset[0]])
+            else:
+                pinyin_ids.extend([0] * 8)
+
+        return pinyin_ids
diff --git a/paddlenlp/transformers/chineseclip/__init__.py b/paddlenlp/transformers/chineseclip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/chineseclip/configuration.py b/paddlenlp/transformers/chineseclip/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d46b5df51e42f2e3f09cc4344a285c0cdf2701bb
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/configuration.py
@@ -0,0 +1,394 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OFA-Sys Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Chinese-CLIP model configuration"""
+
+import copy
+import os
+from typing import Optional, Union
+
+from ...utils.log import logger
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ChineseCLIPTextConfig",
+    "ChineseCLIPVisionConfig",
+    "ChineseCLIPConfig",
+]
+
+
+class ChineseCLIPTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ChineseCLIPModel`]. It is used to instantiate a
+    Chinese CLIP model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Chinese CLIP
+    [OFA-Sys/chinese-clip-vit-base-patch16](https:
+        //huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the CHINESE_CLIP model. Defines the number of different tokens that can be represented
+            by the `inputs_ids` passed when calling [`ChineseCLIPModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`ChineseCLIPModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ChineseCLIPTextConfig, ChineseCLIPTextModel
+
+    >>> # Initializing a ChineseCLIPTextConfig with OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> configuration = ChineseCLIPTextConfig()
+
+    >>> # Initializing a ChineseCLIPTextModel (with random weights) from the OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> model = ChineseCLIPTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "chinese_clip_text_model"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        projection_dim=512,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        pool_act: str = "tanh",
+        fuse: bool = False,
+        position_embedding_type="absolute",
+        use_cache=False,  # may has OOM bug, must set this to False,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.initializer_factor = initializer_factor
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from ChineseCLIPConfig
+        if config_dict.get("model_type") == "chinese_clip":
+            projection_dim = config_dict.get("projection_dim", None)
+            config_dict = config_dict["text_config"]
+            if projection_dim is not None:
+                config_dict["projection_dim"] = projection_dim
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ChineseCLIPVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ChineseCLIPModel`]. It is used to instantiate an
+    ChineseCLIP model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ChineseCLIP
+    [OFA-Sys/chinese-clip-vit-base-patch16](https:
+        //huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*,
+            defaults to 1e-5): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import ChineseCLIPVisionConfig, ChineseCLIPVisionModel
+
+    >>> # Initializing a ChineseCLIPVisionConfig with OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> configuration = ChineseCLIPVisionConfig()
+
+    >>> # Initializing a ChineseCLIPVisionModel (with random weights) from the OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> model = ChineseCLIPVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+
+    model_type = "chinese_clip_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=512,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=32,
+        hidden_act="quick_gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from ChineseCLIPConfig
+        if config_dict.get("model_type") == "chinese_clip":
+            projection_dim = config_dict.get("projection_dim", None)
+            config_dict = config_dict["vision_config"]
+            if projection_dim is not None:
+                config_dict["projection_dim"] = projection_dim
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ChineseCLIPConfig(PretrainedConfig):
+    r"""
+    [`ChineseCLIPConfig`] is the configuration class to store the configuration of a [`ChineseCLIPModel`]. It is used
+    to instantiate Chinese-CLIP model according to the specified arguments, defining the text model and vision model
+    configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the
+    Chinese-CLIP [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
+    architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ChineseCLIPTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ChineseCLIPVisionConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimentionality of text and vision projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original ChineseCLIP
+            implementation.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ChineseCLIPConfig, ChineseCLIPModel
+
+    >>> # Initializing a ChineseCLIPConfig with OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> configuration = ChineseCLIPConfig()
+
+    >>> # Initializing a ChineseCLIPModel (with random weights) from the OFA-Sys/chinese-clip-vit-base-patch16 style configuration
+    >>> model = ChineseCLIPModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a ChineseCLIPConfig from a ChineseCLIPTextConfig and a ChineseCLIPVisionConfig
+
+    >>> # Initializing a ChineseCLIPTextConfig and ChineseCLIPVisionConfig configuration
+    >>> config_text = ChineseCLIPTextConfig()
+    >>> config_vision = ChineseCLIPVisionConfig()
+
+    >>> config = ChineseCLIPConfig.from_text_vision_configs(config_text, config_vision)
+    ```
+    """
+
+    model_type = "chinese_clip"
+    is_composition = True
+
+    def __init__(
+        self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+        if text_config_dict is not None:
+            text_config = text_config_dict
+        if vision_config_dict is not None:
+            vision_config = vision_config_dict
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the ChineseCLIPTextConfig with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the ChineseCLIPVisionConfig with default values.")
+
+        text_config["projection_dim"] = projection_dim
+        vision_config["projection_dim"] = projection_dim
+        self.text_config = ChineseCLIPTextConfig(**text_config)
+        self.vision_config = ChineseCLIPVisionConfig(**vision_config)
+
+        self.projection_dim = projection_dim
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+
+    @classmethod
+    def from_text_vision_configs(
+        cls, text_config: ChineseCLIPTextConfig, vision_config: ChineseCLIPVisionConfig, **kwargs
+    ):
+        r"""
+        Instantiate a [`ChineseCLIPConfig`] (or a derived class) from Chinese-CLIP text model configuration and
+        Chinese-CLIP vision model configuration. Returns:
+            [`ChineseCLIPConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/chineseclip/converter.py b/paddlenlp/transformers/chineseclip/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..476a145ad9ad9efc212797cd04965bb5e5b96046
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/converter.py
@@ -0,0 +1,301 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+from typing import Dict, List, Type, Union
+
+from ...utils.converter import Converter, StateDictNameMapping
+from ...utils.import_utils import import_module
+from ..model_utils import PretrainedModel
+from .modeling import ChineseCLIPModel
+
+
+class ChineseCLIPConverter(Converter):
+    """Chinese-CLIP Converter which handle the converting operations"""
+
+    num_layer_key = "num_hidden_layers"
+    _ignore_state_dict_keys = ["text_model.embeddings.position_ids", "vision_model.embeddings.position_ids"]
+    architectures: Dict[str, Type[PretrainedModel]] = {"ChineseCLIPModel": ChineseCLIPModel}
+    try_compare_logits: bool = False
+
+    def resolve_num_layer(self, config_or_num_layers: Union[dict, int] = None) -> int:
+        """resolve the number of transformer layer based on the key of model config, eg: `num_hidden_layers` in CLIPModel
+        Args:
+            config_or_num_layers (Union[dict, int], optional): the instance of config or num_layers. Defaults to None.
+        Raises:
+            ValueError: when `config_or_num_layers` is not dict/int, it will raise the error
+        Returns:
+            int: the number of transformer layer
+        """
+        if isinstance(config_or_num_layers, dict):
+            num_text_layer = 0
+            num_vision_layer = 0
+
+            if self.model_type in ["chinese_clip", "chinese_clip_text_model"]:
+                if "text_config" in config_or_num_layers:
+                    num_text_layer = config_or_num_layers["text_config"][self.num_layer_key]
+                else:
+                    num_text_layer = config_or_num_layers[self.num_layer_key]
+
+            if self.model_type in ["chinese_clip", "chinese_clip_vision_model"]:
+                if "vision_config" in config_or_num_layers:
+                    num_vision_layer = config_or_num_layers["vision_config"][self.num_layer_key]
+                else:
+                    num_vision_layer = config_or_num_layers[self.num_layer_key]
+
+            return num_text_layer, num_vision_layer
+        elif isinstance(config_or_num_layers, int):
+            num_layer = config_or_num_layers
+        else:
+            raise ValueError(f"the type of config_or_num_layers<{config_or_num_layers}> should be one of <dict, int>")
+        return num_layer, num_layer
+
+    def load_torch_weight_file(self, model_file: str):
+        """load torch weight file with torch which should be removed later.
+        Args:
+            model_file (str): the path of pytorch model file
+        Returns:
+            Dict[str, ndarray]: the state dict object of loaded pytorch state dict
+        """
+        import torch
+
+        state_dict = torch.load(model_file)
+        for key in state_dict.keys():
+            state_dict[key] = state_dict[key].cpu().numpy()
+            if state_dict[key].ndim == 0:
+                state_dict[key] = state_dict[key].reshape((1,))
+        return state_dict
+
+    def get_paddle_pytorch_model_classes(self):
+        paddle_clip_model_class = import_module(f"paddlenlp.transformers.{self.architecture}")
+        pytorch_clip_model_class = import_module(f"transformers.{self.architecture}")
+        return paddle_clip_model_class, pytorch_clip_model_class
+
+    def get_name_mapping(self, config_or_num_layers: Union[dict, int] = None) -> List[StateDictNameMapping]:
+        self.model_type = (
+            config_or_num_layers.get("model_type", "chinese_clip")
+            if isinstance(config_or_num_layers, dict)
+            else "chinese_clip"
+        )
+        self.architecture = (
+            config_or_num_layers.get("architectures", ["ChineseCLIPModel"])[0]
+            if isinstance(config_or_num_layers, dict)
+            else "ChineseCLIPModel"
+        )
+
+        num_text_layer, num_vision_layer = self.resolve_num_layer(config_or_num_layers)
+
+        mappings: List[StateDictNameMapping] = []
+        if self.model_type == "chinese_clip":
+            hard_mappings = [["logit_scale", "logit_scale"]]
+        else:
+            hard_mappings = []
+
+        # text model (bert model)
+        if num_text_layer > 0:
+            text_model_layer_mappings = [
+                ["text_model.embeddings.word_embeddings.weight", "text_model.embeddings.word_embeddings.weight"],
+                [
+                    "text_model.embeddings.position_embeddings.weight",
+                    "text_model.embeddings.position_embeddings.weight",
+                ],
+                [
+                    "text_model.embeddings.token_type_embeddings.weight",
+                    "text_model.embeddings.token_type_embeddings.weight",
+                ],
+                ["text_model.embeddings.LayerNorm.weight", "text_model.embeddings.layer_norm.weight"],
+                ["text_model.embeddings.LayerNorm.bias", "text_model.embeddings.layer_norm.bias"],
+                ["text_projection.weight", "text_projection", "transpose"],
+                # donot add pooler
+                # ["text_model.pooler.dense.weight", "text_model.pooler.dense.weight", "transpose"],
+                # ["text_model.pooler.dense.bias", "text_model.pooler.dense.bias"],
+            ]
+            hard_mappings.extend(text_model_layer_mappings)
+
+            for layer_index in range(num_text_layer):
+                text_model_layer_mappings = [
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.query.weight",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.query.bias",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.key.weight",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.key.bias",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.value.weight",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.self.value.bias",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.output.dense.weight",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.output.dense.bias",
+                        f"text_model.encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.intermediate.dense.weight",
+                        f"text_model.encoder.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.intermediate.dense.bias",
+                        f"text_model.encoder.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                        f"text_model.encoder.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                        f"text_model.encoder.layers.{layer_index}.norm1.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.output.dense.weight",
+                        f"text_model.encoder.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.output.dense.bias",
+                        f"text_model.encoder.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.output.LayerNorm.weight",
+                        f"text_model.encoder.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"text_model.encoder.layer.{layer_index}.output.LayerNorm.bias",
+                        f"text_model.encoder.layers.{layer_index}.norm2.bias",
+                    ],
+                ]
+                hard_mappings.extend(text_model_layer_mappings)
+
+        # vision model
+        if num_vision_layer > 0:
+            vision_model_layer_mappings = [
+                ["vision_model.embeddings.class_embedding", "vision_model.class_embedding"],
+                ["vision_model.embeddings.patch_embedding.weight", "vision_model.conv1.weight"],
+                ["vision_model.embeddings.position_embedding.weight", "vision_model.positional_embedding.weight"],
+                ["vision_model.pre_layrnorm.weight", "vision_model.ln_pre.weight"],
+                ["vision_model.pre_layrnorm.bias", "vision_model.ln_pre.bias"],
+                ["vision_model.post_layernorm.weight", "vision_model.ln_post.weight"],
+                ["vision_model.post_layernorm.bias", "vision_model.ln_post.bias"],
+                ["visual_projection.weight", "vision_projection", "transpose"],
+            ]
+            hard_mappings.extend(vision_model_layer_mappings)
+            for layer_index in range(num_vision_layer):
+                vision_model_layer_mappings = [
+                    # qkv out
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    # fc1
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc1.weight",
+                        f"vision_model.transformer.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc1.bias",
+                        f"vision_model.transformer.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm1.weight",
+                        f"vision_model.transformer.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm1.bias",
+                        f"vision_model.transformer.layers.{layer_index}.norm1.bias",
+                    ],
+                    # fc2
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc2.weight",
+                        f"vision_model.transformer.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc2.bias",
+                        f"vision_model.transformer.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm2.weight",
+                        f"vision_model.transformer.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm2.bias",
+                        f"vision_model.transformer.layers.{layer_index}.norm2.bias",
+                    ],
+                ]
+                hard_mappings.extend(vision_model_layer_mappings)
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(hard_mappings)]
+        return mappings
+
+    def convert(self, input_dir: str, output_dir: str):
+        super().convert(input_dir, output_dir)
+        old_config_file = os.path.join(output_dir, "model_config.json")
+        new_config_file = os.path.join(output_dir, "config.json")
+        os.rename(old_config_file, new_config_file)
diff --git a/paddlenlp/transformers/chineseclip/feature_extraction.py b/paddlenlp/transformers/chineseclip/feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..be35ddca3659e1015f507a48cae3861733c9eec8
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/feature_extraction.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for Chinese-CLIP."""
+
+__all__ = ["ChineseCLIPFeatureExtractor"]
+
+
+import warnings
+
+from .image_processing import ChineseCLIPImageProcessor
+
+
+class ChineseCLIPFeatureExtractor(ChineseCLIPImageProcessor):
+    def __init__(self, *args, **kwargs) -> None:
+        warnings.warn(
+            "The class ChineseCLIPFeatureExtractor is deprecated and will be removed in version 5 of PaddleNLP. Please"
+            " use CLIPImageProcessor instead.",
+            FutureWarning,
+        )
+        super().__init__(*args, **kwargs)
diff --git a/paddlenlp/transformers/chineseclip/image_processing.py b/paddlenlp/transformers/chineseclip/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6b9c23cabf64cb96229d2652918e8d52607eaa5
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/image_processing.py
@@ -0,0 +1,328 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OFA-Sys Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for Chinese-CLIP."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    center_crop,
+    convert_to_rgb,
+    get_resize_output_image_size,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["ChineseCLIPImageProcessor"]
+
+
+class ChineseCLIPImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a Chinese-CLIP image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else [0.48145466, 0.4578275, 0.40821073]
+        self.image_std = image_std if image_std is not None else [0.26862954, 0.26130258, 0.27577711]
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/chineseclip/modeling.py b/paddlenlp/transformers/chineseclip/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..a376cc7fff37e2d4890d724535f0722e8903be1c
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/modeling.py
@@ -0,0 +1,1036 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OFA-Sys Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn.functional as F
+from paddle import nn
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..bert.modeling import BertEmbeddings as ChineseCLIPTextEmbeddings
+from ..bert.modeling import BertModel
+from ..clip.modeling import CLIPVisionTransformer as ChineseCLIPVisionTransformer
+from ..model_outputs import (
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from ..model_utils import PretrainedModel
+from .configuration import (
+    ChineseCLIPConfig,
+    ChineseCLIPTextConfig,
+    ChineseCLIPVisionConfig,
+)
+
+CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "OFA-Sys/chinese-clip-vit-base-patch16",
+    "OFA-Sys/chinese-clip-vit-huge-patch14",
+    "OFA-Sys/chinese-clip-vit-large-patch14",
+    "OFA-Sys/chinese-clip-vit-large-patch14-336px",
+    # See all Chinese-CLIP models at https://huggingface.co/models?filter=chinese_clip
+]
+
+__all__ = [
+    "ChineseCLIPTextModel",
+    "ChineseCLIPVisionModel",
+    "ChineseCLIPPretrainedModel",
+    "ChineseCLIPModel",
+    "ChineseCLIPTextModelWithProjection",
+    "ChineseCLIPVisionModelWithProjection",
+]
+
+
+def quick_gelu(x):
+    return x * F.sigmoid(1.702 * x)
+
+
+F.quick_gelu = quick_gelu
+
+# contrastive loss function, adapted from
+# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html
+
+
+def contrastive_loss(logits: paddle.Tensor) -> paddle.Tensor:
+    return F.cross_entropy(logits, paddle.arange(len(logits)))
+
+
+def chinese_clip_loss(similarity: paddle.Tensor) -> paddle.Tensor:
+    caption_loss = contrastive_loss(similarity)
+    image_loss = contrastive_loss(similarity.t())
+    return (caption_loss + image_loss) / 2.0
+
+
+@dataclass
+class ChineseCLIPVisionModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+
+    Args:
+        image_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    image_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ChineseCLIPTextModelOutput(ModelOutput):
+    """
+    Base class for text model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        text_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The text embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    text_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ChineseCLIPOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`paddle.Tensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`paddle.Tensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`ChineseCLIPTextModel`].
+        image_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of [`ChineseCLIPVisionModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`BaseModelOutputWithPoolingAndCrossAttentions`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ChineseCLIPVisionModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_image: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    image_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPoolingAndCrossAttentions = None
+    vision_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class ChineseCLIPPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ChineseCLIP models. It provides ChineseCLIP related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    config_class = ChineseCLIPConfig
+    base_model_prefix = "chinese_clip"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, nn.TransformerEncoder):
+            module.enable_recompute = value
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor
+        if isinstance(layer, ChineseCLIPVisionTransformer):
+            vision_embed_dim = layer.config.hidden_size
+            vision_layers = layer.config.num_hidden_layers
+            initializer_range = layer.config.initializer_range
+
+            # vision embedding
+            normal_(layer.class_embedding, std=vision_embed_dim**-0.5 * factor)
+            normal_(layer.conv1.weight, std=initializer_range * factor)
+            normal_(layer.positional_embedding.weight, std=initializer_range * factor)
+
+            # init CLIPAttention + CLIPMLP
+            for sub_layer in layer.sublayers():
+                if isinstance(sub_layer, nn.TransformerEncoderLayer):
+                    # self_attn
+                    in_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * ((2 * vision_layers) ** -0.5) * factor
+                    out_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * factor
+                    normal_(sub_layer.self_attn.q_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.k_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.v_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.out_proj.weight, std=out_proj_std)
+                    # ffn
+                    in_proj_std = (sub_layer._config["d_model"] ** -0.5) * ((2 * vision_layers) ** -0.5) * factor
+                    fc_std = (2 * sub_layer._config["d_model"]) ** -0.5 * factor
+                    normal_(sub_layer.linear1.weight, std=fc_std)
+                    normal_(sub_layer.linear2.weight, std=in_proj_std)
+
+        elif isinstance(layer, ChineseCLIPTextEmbeddings):
+            normal_(layer.word_embeddings.weight, mean=0.0, std=self.config.initializer_range)
+            normal_(layer.position_embeddings.weight, mean=0.0, std=self.config.initializer_range)
+            normal_(layer.token_type_embeddings.weight, mean=0.0, std=self.config.initializer_range)
+            with paddle.no_grad():
+                for embedding in [layer.word_embeddings, layer.position_embeddings, layer.token_type_embeddings]:
+                    if embedding._padding_idx is not None:
+                        embedding.weight[embedding._padding_idx] = 0
+
+        elif isinstance(layer, ChineseCLIPModel):
+            normal_(layer.text_projection, std=layer.text_embed_dim**-0.5 * self.config.initializer_factor)
+            normal_(layer.vision_projection, std=layer.vision_embed_dim**-0.5 * self.config.initializer_factor)
+        elif isinstance(layer, ChineseCLIPVisionModelWithProjection):
+            normal_(layer.vision_projection, std=self.config.hidden_size**-0.5 * self.config.initializer_factor)
+        elif isinstance(layer, ChineseCLIPTextModelWithProjection):
+            normal_(layer.text_projection, std=self.config.hidden_size**-0.5 * self.config.initializer_factor)
+
+        if isinstance(layer, nn.LayerNorm):
+            zeros_(layer.bias)
+            ones_(layer.weight)
+
+        if isinstance(layer, nn.Linear):
+            normal_(layer.weight, mean=0.0, std=self.config.initializer_range)
+            if layer.bias is not None:
+                zeros_(layer.bias)
+
+
+class FirstTokenPooler(nn.Layer):
+    def forward(self, hidden_states):
+        pooled_output = hidden_states[:, 0]
+        return pooled_output
+
+
+class ChineseCLIPTextModel(ChineseCLIPPretrainedModel):
+    r"""
+    The text model [bert model] from ChineseCLIP without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseCLIPTextConfig`):
+            An instance of ChineseCLIPTextConfig used to construct ChineseCLIPTextModel.
+    """
+
+    config_class = ChineseCLIPTextConfig
+
+    def __init__(self, config: ChineseCLIPTextConfig, add_pooling_layer=False):
+        super().__init__(config)
+        self.text_model = BertModel(config)
+        if not add_pooling_layer:
+            self.text_model.pooler = FirstTokenPooler()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`ChineseCLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPoolingAndCrossAttentions`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPoolingAndCrossAttentions` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import ChineseCLIPTokenizer, ChineseCLIPTextModel
+
+        >>> model = ChineseCLIPTextModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> tokenizer = ChineseCLIPTokenizer.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> inputs = tokenizer(["一只猫的照片", "一条狗的照片"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class ChineseCLIPVisionModel(ChineseCLIPPretrainedModel):
+    r"""
+    The vision model from Chinese-CLIP without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseCLIPVisionConfig`):
+            An instance of ChineseCLIPVisionConfig used to construct ChineseCLIPVisionModel.
+    """
+    config_class = ChineseCLIPVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: ChineseCLIPVisionConfig):
+        super().__init__(config)
+
+        self.vision_model = ChineseCLIPVisionTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.conv1
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`ChineseCLIPProcessor`]. See [`ChineseCLIPProcessor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import ChineseCLIPProcessor, ChineseCLIPVisionModel
+
+        >>> model = ChineseCLIPVisionModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> processor = CLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class ChineseCLIPModel(ChineseCLIPPretrainedModel):
+    r"""
+    The bare Chinese-CLIP Model outputting logits_per_image and logits_per_text.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseCLIPConfig`):
+            An instance of ChineseCLIPConfig used to construct ChineseCLIPModel.
+    """
+    config_class = ChineseCLIPConfig
+
+    def __init__(self, config: ChineseCLIPConfig, add_pooling_layer=False):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, ChineseCLIPTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type ChineseCLIPTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, ChineseCLIPVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type ChineseCLIPVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+
+        self.projection_dim = config.projection_dim
+        self.text_embed_dim = text_config.hidden_size
+        self.vision_embed_dim = vision_config.hidden_size
+
+        self.text_model = BertModel(text_config)
+        if not add_pooling_layer:
+            self.text_model.pooler = FirstTokenPooler()
+        self.vision_model = ChineseCLIPVisionTransformer(vision_config)
+
+        self.vision_projection = paddle.create_parameter(
+            (self.vision_embed_dim, self.projection_dim), paddle.get_default_dtype()
+        )
+        self.text_projection = paddle.create_parameter(
+            (self.text_embed_dim, self.projection_dim), paddle.get_default_dtype()
+        )
+
+        self.logit_scale = paddle.create_parameter(
+            (1,),
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(config.logit_scale_init_value),
+        )
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`ChineseCLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            token_type_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+                Its data type should be `int64`. Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`ChineseCLIPTextModel`].
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import ChineseCLIPTokenizer, ChineseCLIPModel
+
+        >>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> tokenizer = ChineseCLIPTokenizer.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> inputs = tokenizer(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], padding=True, return_tensors="pd")
+        >>> text_features = model.get_text_features(**inputs)
+        >>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)
+        ```
+        """
+        # Use Chinese-CLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_features = paddle.matmul(pooled_output, self.text_projection)
+
+        return text_features
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`ChineseCLIPProcessor`]. See [`ChineseCLIPProcessor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            image_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`CLIPVisionModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel
+
+        >>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pt")
+
+        >>> image_features = model.get_image_features(**inputs)
+        >>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)
+        ```
+        """
+        # Use Chinese-CLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = vision_outputs[1]  # pooled_output
+        image_features = paddle.matmul(pooled_output, self.vision_projection)
+
+        return image_features
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        pixel_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ChineseCLIPOutput]:
+        r"""
+        The ChineseCLIPModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings (CLIPTextTransformer). Selected in
+                the range ``[0, max_text_length - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention (CLIPTextTransformer) to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`CLIPOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `True`.
+
+        Returns:
+            An instance of :class:`CLIPOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`CLIPOutput`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle.nn.functional as F
+        >>> from paddlenlp.transformers import ChineseCLIPProcessor, ChineseCLIPModel
+
+        >>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(text=["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], images=image, return_tensors="pd", padding=True)
+
+        >>> outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+        >>> probs = F.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities
+        ```
+        """
+        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[1]
+        image_embeds = paddle.matmul(image_embeds, self.vision_projection)
+
+        text_embeds = text_outputs[1]
+        text_embeds = paddle.matmul(text_embeds, self.text_projection)
+
+        # normalized features
+        image_embeds = image_embeds / image_embeds.norm(p=2, axis=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(p=2, axis=-1, keepdim=True)
+
+        if paddle.distributed.is_initialized() and dist.get_world_size() > 1:
+            world_size = dist.get_world_size()
+            rank = dist.get_rank()
+            gathered_image_features = [paddle.zeros_like(image_embeds) for _ in range(world_size)]
+            gathered_text_features = [paddle.zeros_like(text_embeds) for _ in range(world_size)]
+            dist.all_gather(gathered_image_features, image_embeds)
+            dist.all_gather(gathered_text_features, text_embeds)
+            # Add current text_embeds image_embeds into the batch for gradient update
+            image_embeds = paddle.concat(
+                [image_embeds] + gathered_image_features[:rank] + gathered_image_features[rank + 1 :]
+            )
+            text_embeds = paddle.concat(
+                [text_embeds] + gathered_text_features[:rank] + gathered_text_features[rank + 1 :]
+            )
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_text = paddle.matmul(text_embeds, image_embeds, transpose_y=True) * logit_scale
+        logits_per_image = logits_per_text.t()
+
+        loss = None
+        if return_loss:
+            loss = chinese_clip_loss(logits_per_text)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return ChineseCLIPOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+
+
+class ChineseCLIPTextModelWithProjection(ChineseCLIPPretrainedModel):
+    r"""
+    Chinese-CLIP Text Model with a projection layer on top (a linear layer on top of the pooled output).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseCLIPTextConfig`):
+            An instance of ChineseCLIPTextConfig used to construct ChineseCLIPTextModelWithProjection.
+    """
+    config_class = ChineseCLIPTextConfig
+
+    def __init__(self, config: ChineseCLIPTextConfig, add_pooling_layer=False):
+        super().__init__(config)
+
+        self.text_model = BertModel(config)
+        if not add_pooling_layer:
+            self.text_model.pooler = FirstTokenPooler()
+        self.text_projection = paddle.create_parameter(
+            (config.hidden_size, config.projection_dim), paddle.get_default_dtype()
+        )
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ChineseCLIPTextModelOutput]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`ChineseCLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`ChineseCLIPTextModelOutput`] instead of a plain tuple.
+                If `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`ChineseCLIPTextModelOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`ChineseCLIPTextModelOutput`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import ChineseCLIPTokenizer, ChineseCLIPTextModelWithProjection
+
+        >>> model = ChineseCLIPTextModelWithProjection.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> tokenizer = ChineseCLIPTokenizer.from_pretrained(""OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> inputs = tokenizer(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> text_embeds = outputs.text_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_embeds = paddle.matmul(pooled_output, self.text_projection)
+
+        if not return_dict:
+            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return ChineseCLIPTextModelOutput(
+            text_embeds=text_embeds,
+            last_hidden_state=text_outputs.last_hidden_state,
+            hidden_states=text_outputs.hidden_states,
+            attentions=text_outputs.attentions,
+        )
+
+
+class ChineseCLIPVisionModelWithProjection(ChineseCLIPPretrainedModel):
+    r"""
+    Chinese-CLIP Vision Model with a projection layer on top (a linear layer on top of the pooled output).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ChineseCLIPVisionConfig`):
+            An instance of ChineseCLIPVisionConfig used to construct ChineseCLIPVisionModelWithProjection.
+    """
+    config_class = ChineseCLIPVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: ChineseCLIPVisionConfig):
+        super().__init__(config)
+
+        self.vision_model = ChineseCLIPVisionTransformer(config)
+        self.vision_projection = paddle.create_parameter(
+            (config.hidden_size, config.projection_dim), paddle.get_default_dtype()
+        )
+
+    def get_input_embeddings(self) -> nn.Layer:
+        if isinstance(self.vision_model, ChineseCLIPVisionTransformer):
+            return self.vision_model.conv1
+        else:
+            return None
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ChineseCLIPVisionModelOutput]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`ChineseCLIPProcessor`]. See [`ChineseCLIPProcessor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`ChineseCLIPVisionModelOutput`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`ChineseCLIPVisionModelOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`ChineseCLIPVisionModelOutput`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import ChineseCLIPProcessor, ChineseCLIPVisionModelWithProjection
+
+        >>> model = ChineseCLIPVisionModelWithProjection.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+        >>> model.eval()
+        >>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+
+        >>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> image_embeds = outputs.image_embeds
+
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = vision_outputs[1]  # pooled_output
+
+        image_embeds = paddle.matmul(pooled_output, self.vision_projection)
+
+        if not return_dict:
+            outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return ChineseCLIPVisionModelOutput(
+            image_embeds=image_embeds,
+            last_hidden_state=vision_outputs.last_hidden_state,
+            hidden_states=vision_outputs.hidden_states,
+            attentions=vision_outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/chineseclip/processing.py b/paddlenlp/transformers/chineseclip/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..701dfda83efddf3239679f93a3b3dd87399caea8
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/processing.py
@@ -0,0 +1,153 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OFA-Sys Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Image/Text processor class for Chinese-CLIP
+"""
+import warnings
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = ["ChineseCLIPProcessor"]
+
+
+class ChineseCLIPProcessor(ProcessorMixin):
+    r"""
+    Constructs a Chinese-CLIP processor which wraps a Chinese-CLIP image processor and a Chinese-CLIP tokenizer into a
+    single processor.
+
+    [`ChineseCLIPProcessor`] offers all the functionalities of [`ChineseCLIPImageProcessor`] and [`ChineseCLIPTokenizer`].
+    See the [`~ChineseCLIPProcessor.__call__`] and [`~ChineseCLIPProcessor.decode`] for more information.
+
+    Args:
+        image_processor ([`ChineseCLIPImageProcessor`]):
+            The image processor is a required input.
+        tokenizer ([`ChineseCLIPTokenizer`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "ChineseCLIPImageProcessor"
+    tokenizer_class = "ChineseCLIPTokenizer"
+
+    pretrained_init_configuration = {
+        "OFA-Sys/chinese-clip-vit-base-patch16": {"do_lower_case": True},
+        "OFA-Sys/chinese-clip-vit-huge-patch14": {"do_lower_case": True},
+        "OFA-Sys/chinese-clip-vit-large-patch14": {"do_lower_case": True},
+        "OFA-Sys/chinese-clip-vit-large-patch14-336px": {"do_lower_case": True},
+    }
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to ChineseCLIPTokenizer's [`~ChineseCLIPTokenizer.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        ChineseCLIPImageProcessor's [`~ChineseCLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `paddle.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[paddle.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or Paddle
+                tensor. In case of a NumPy array/Paddle tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ChineseCLIPTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ChineseCLIPTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+
+    @property
+    def feature_extractor_class(self):
+        warnings.warn(
+            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
+            FutureWarning,
+        )
+        return self.image_processor_class
+
+    @property
+    def feature_extractor(self):
+        warnings.warn(
+            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
+            FutureWarning,
+        )
+        return self.image_processor
diff --git a/paddlenlp/transformers/chineseclip/tokenizer.py b/paddlenlp/transformers/chineseclip/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d23dd6cc29d83e681937d3bfd1c6c1ff64a4d475
--- /dev/null
+++ b/paddlenlp/transformers/chineseclip/tokenizer.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["ChineseCLIPTokenizer"]
+
+
+class ChineseCLIPTokenizer(BertTokenizer):
+    resource_files_names = {"vocab_file": "vocab.txt"}
+    pretrained_resource_files_map = {"vocab_file": {}}
+    pretrained_init_configuration = {}
+    model_input_names = [
+        "input_ids",
+        "token_type_ids",
+        "attention_mask",
+    ]
diff --git a/paddlenlp/transformers/clap/__init__.py b/paddlenlp/transformers/clap/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/clap/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/clap/configuration.py b/paddlenlp/transformers/clap/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..6edea1415f7ed26c67e6b87499e8639510d20987
--- /dev/null
+++ b/paddlenlp/transformers/clap/configuration.py
@@ -0,0 +1,464 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os
+from typing import Optional, Union
+
+from ...utils.log import logger
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ClapTextConfig",
+    "ClapAudioConfig",
+    "ClapConfig",
+]
+
+
+class ClapTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ClapTextModel`]. It is used to instantiate a CLAP
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the CLAP
+    [calp-hsat-fused](https://huggingface.co/laion/clap-hsat-fused) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the CLAP model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ClapTextModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"relu"`,
+            `"relu"`, `"silu"` and `"relu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`ClapTextModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
+            The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        projection_dim (`int`, *optional*, defaults to 512)
+            Dimension of the projection head of the `ClapTextModelWithProjection`.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import ClapTextConfig, ClapTextModel
+
+    >>> # Initializing a CLAP text configuration
+    >>> configuration = ClapTextConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = ClapTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "clap_text_model"
+
+    def __init__(
+        self,
+        vocab_size=50265,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=514,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        layer_norm_eps=1e-12,
+        projection_dim=512,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        position_embedding_type="absolute",
+        use_cache=True,
+        classifier_dropout=None,
+        projection_hidden_act="relu",
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
+        self.projection_hidden_act = projection_hidden_act
+        self.projection_dim = projection_dim
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> "PretrainedConfig":
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from ClapConfig
+        if config_dict.get("model_type") == "clap":
+            config_dict = config_dict["text_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ClapAudioConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ClapAudioModel`]. It is used to instantiate a
+    CLAP audio encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the audio encoder of the CLAP
+    [laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        window_size (`int`, *optional*, defaults to 8):
+            Image size of the spectrogram
+        num_mel_bins (`int`, *optional*, defaults to 64):
+            Number of mel features used per frames. Should correspond to the value used in the `ClapProcessor` class.
+        spec_size (`int`, *optional*, defaults to 256):
+            Desired input size of the spectrogram that the model supports. It can be different from the output of the
+            `ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size`
+            of the audio models.
+        hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        patch_size (`int`, *optional*, defaults to 4):
+            Patch size for the audio spectrogram
+        patch_stride (`list`, *optional*, defaults to `[4, 4]`):
+            Patch stride for the audio spectrogram
+        num_classes (`int`, *optional*, defaults to 527):
+            Number of classes used for the head training
+        hidden_size (`int`, *optional*, defaults to 768):
+            Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's
+            output,which is sent to the projection MLP layer.
+        projection_dim (`int`, *optional*, defaults to 512):
+            Hidden size of the projection layer.
+        depths (`list`, *optional*, defaults to `[2, 2, 6, 2]`):
+            Depths used for the Swin Layers of the audio model
+        num_attention_heads (`list`, *optional*, defaults to `[4, 8, 16, 32]`):
+            Number of attention heads used for the Swin Layers of the audio model
+        enable_fusion (`bool`, *optional*, defaults to `False`):
+            Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the
+            best results.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the encoder.
+        fusion_type (`[type]`, *optional*):
+            Fusion type used for the patch fusion.
+        patch_embed_input_channels (`int`, *optional*, defaults to 1):
+            Number of channels used for the input spectrogram
+        flatten_patch_embeds (`bool`, *optional*, defaults to `True`):
+            Whether or not to flatten the patch embeddings
+        patch_embeds_hidden_size (`int`, *optional*, defaults to 96):
+            Hidden size of the patch embeddings. It is used as the number of output channels.
+        enable_patch_layer_norm (`bool`, *optional*, defaults to `True`):
+            Whether or not to enable layer normalization for the patch embeddings
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            Drop path rate for the patch fusion
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether or not to add a bias to the query, key, value projections.
+        mlp_ratio (`float`, *optional*, defaults to 4.0):
+            Ratio of the mlp hidden dim to embedding dim.
+        aff_block_r (`int`, *optional*, defaults to 4):
+            downsize_ratio used in the AudioFF block
+        num_hidden_layers (`int`, *optional*, defaults to 4):
+            Number of hidden layers in the Transformer encoder.
+        projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
+            The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (`float`, *optional*, defaults to 1.0):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ClapAudioConfig, ClapAudioModel
+
+    >>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration
+    >>> configuration = ClapAudioConfig()
+
+    >>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration
+    >>> model = ClapAudioModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "clap_audio_model"
+
+    def __init__(
+        self,
+        window_size=8,
+        num_mel_bins=64,
+        spec_size=256,
+        hidden_act="gelu",
+        patch_size=4,
+        patch_stride=[4, 4],
+        num_classes=527,
+        hidden_size=768,
+        projection_dim=512,
+        depths=[2, 2, 6, 2],
+        num_attention_heads=[4, 8, 16, 32],
+        enable_fusion=False,
+        hidden_dropout_prob=0.1,
+        fusion_type=None,
+        patch_embed_input_channels=1,
+        flatten_patch_embeds=True,
+        patch_embeds_hidden_size=96,
+        enable_patch_layer_norm=True,
+        drop_path_rate=0.0,
+        attention_probs_dropout_prob=0.0,
+        qkv_bias=True,
+        mlp_ratio=4.0,
+        aff_block_r=4,
+        num_hidden_layers=4,
+        projection_hidden_act="relu",
+        layer_norm_eps=1e-5,
+        initializer_factor=1.0,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+        self.window_size = window_size
+        self.num_mel_bins = num_mel_bins
+        self.spec_size = spec_size
+        self.patch_size = patch_size
+        self.patch_stride = patch_stride
+        self.num_classes = num_classes
+        self.hidden_size = hidden_size
+        self.depths = depths
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.window_size = window_size
+        self.enable_fusion = enable_fusion
+        self.fusion_type = fusion_type
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.projection_dim = projection_dim
+        self.flatten_patch_embeds = flatten_patch_embeds
+        self.patch_embeds_hidden_size = patch_embeds_hidden_size
+        self.enable_patch_layer_norm = enable_patch_layer_norm
+        self.drop_path_rate = drop_path_rate
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.qkv_bias = qkv_bias
+        self.mlp_ratio = mlp_ratio
+        self.patch_embed_input_channels = patch_embed_input_channels
+        self.aff_block_r = aff_block_r
+        self.layer_norm_eps = layer_norm_eps
+        self.initializer_factor = initializer_factor
+        self.projection_hidden_act = projection_hidden_act
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> "PretrainedConfig":
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the audio config dict if we are loading from ClapConfig
+        if config_dict.get("model_type") == "clap":
+            config_dict = config_dict["audio_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ClapConfig(PretrainedConfig):
+    r"""
+    [`ClapConfig`] is the configuration class to store the configuration of a [`ClapModel`]. It is used to instantiate
+    a CLAP model according to the specified arguments, defining the text model and audio model configs. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the CLAP
+    [laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ClapTextConfig`].
+        audio_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ClapAudioConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimentionality of text and audio projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation.
+        projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
+            Activation function for the projection layers.
+        initializer_factor (`float`, *optional*, defaults to 1.0):
+            Factor to scale the initialization of the model weights.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ClapConfig, ClapModel
+
+    >>> # Initializing a ClapConfig with laion-ai/base style configuration
+    >>> configuration = ClapConfig()
+
+    >>> # Initializing a ClapModel (with random weights) from the laion-ai/base style configuration
+    >>> model = ClapModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a ClapConfig from a ClapTextConfig and a ClapAudioConfig
+    >>> from paddlenlp.transformers import ClapTextConfig, ClapAudioConfig
+
+    >>> # Initializing a ClapText and ClapAudioConfig configuration
+    >>> config_text = ClapTextConfig()
+    >>> config_audio = ClapAudioConfig()
+
+    >>> config = ClapConfig.from_text_audio_configs(config_text, config_audio)
+    ```"""
+
+    model_type = "clap"
+    is_composition = True
+
+    def __init__(
+        self,
+        text_config=None,
+        audio_config=None,
+        logit_scale_init_value=(1 / 0.07),
+        projection_dim=512,
+        projection_hidden_act="relu",
+        initializer_factor=1.0,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the ClapTextConfig with default values.")
+
+        if audio_config is None:
+            audio_config = {}
+            logger.info("audio_config is None. initializing the ClapAudioConfig with default values.")
+
+        self.text_config = ClapTextConfig(**text_config)
+        self.audio_config = ClapAudioConfig(**audio_config)
+        self.text_config.projection_dim = projection_dim
+        self.audio_config.projection_dim = projection_dim
+
+        self.text_config.projection_hidden_act = projection_hidden_act
+        self.audio_config.projection_hidden_act = projection_hidden_act
+
+        self.projection_dim = projection_dim
+        self.projection_hidden_act = projection_hidden_act
+        self.hidden_size = self.text_config.hidden_size
+
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = initializer_factor
+        self.num_hidden_layers = self.text_config.num_hidden_layers + len(self.audio_config.depths)
+
+    @classmethod
+    def from_text_audio_configs(cls, text_config: ClapTextConfig, audio_config: ClapAudioConfig, **kwargs):
+        r"""
+        Instantiate a [`ClapConfig`] (or a derived class) from clap text model configuration and clap audio model
+        configuration.
+
+        Returns:
+            [`ClapConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), audio_config=audio_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["audio_config"] = self.audio_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/clap/feature_extraction.py b/paddlenlp/transformers/clap/feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..40c2f2c1b132547a6cff5af5fc42222ba0af5bf4
--- /dev/null
+++ b/paddlenlp/transformers/clap/feature_extraction.py
@@ -0,0 +1,358 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+
+from ...utils.log import logger
+from ..audio_utils import mel_filter_bank, spectrogram, window_function
+from ..feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ..feature_extraction_utils import BatchFeature
+
+
+class ClapFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a CLAP feature extractor.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the *Short Time
+    Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent.
+
+    Args:
+        feature_size (`int`, defaults to 64):
+            The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters
+            (`n_mels`).
+        sampling_rate (`int`, defaults to 48_000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves
+            to warn users if the audio fed to the feature extractor does not have the same sampling rate.
+        hop_length (`int`, defaults to 480):
+            Length of the overlaping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split
+            in smaller `frames` with a step of `hop_length` between each frame.
+        max_length_s (`int`, defaults to 10):
+            The maximum input lenght of the model in seconds. This is used to pad the audio.
+        fft_window_size (`int`, defaults to 1024):
+            Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency
+            resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples.
+        padding_value (`float`, *optional*, defaults to 0.0):
+            Padding value used to pad the audio. Should correspond to silences.
+        return_attention_mask (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should return the attention masks coresponding to the input.
+        frequency_min (`float`, *optional*, default to 0):
+            The lowest frequency of interest. The STFT will not be computed for values below this.
+        frequency_max (`float`, *optional*, default to 14_000):
+            The highest frequency of interest. The STFT will not be computed for values above this.
+        top_db (`float`, *optional*):
+            The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the
+            `audio_utils.power_to_db` function
+        truncation (`str`, *optional*, default to `"fusions"`):
+            Truncation pattern for long audio inputs. Two patterns are available:
+                - `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a
+                  downsampled version of the entire mel spectrogram.
+            If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a copy
+            of the original mel obtained from the padded audio.
+                - `rand_trunc` will select a random crop of the mel spectrogram.
+        padding (`str`, *optional*, defaults to `"repeatpad"`):
+               Padding pattern for shorter audio inputs. Three patterns were originally implemented:
+                - `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
+                - `repeat`: the audio is repeated and then cut to fit the `max_length`
+                - `pad`: the audio is padded.
+    """
+
+    model_input_names = ["input_features", "is_longer"]
+
+    def __init__(
+        self,
+        feature_size=64,
+        sampling_rate=48_000,
+        hop_length=480,
+        max_length_s=10,
+        fft_window_size=1024,
+        padding_value=0.0,
+        return_attention_mask=False,  # pad inputs to max length with silence token (zero) and no attention mask
+        frequency_min: float = 0,
+        frequency_max: float = 14_000,
+        top_db: int = None,
+        truncation: str = "fusion",
+        padding: str = "repeatpad",
+        **kwargs,
+    ):
+        super().__init__(
+            feature_size=feature_size,
+            sampling_rate=sampling_rate,
+            padding_value=padding_value,
+            return_attention_mask=return_attention_mask,
+            **kwargs,
+        )
+        self.top_db = top_db
+        self.truncation = truncation
+        self.padding = padding
+        self.fft_window_size = fft_window_size
+        self.nb_frequency_bins = (fft_window_size >> 1) + 1
+        self.hop_length = hop_length
+        self.max_length_s = max_length_s
+        self.nb_max_samples = max_length_s * sampling_rate
+        self.sampling_rate = sampling_rate
+        self.frequency_min = frequency_min
+        self.frequency_max = frequency_max
+        self.mel_filters = mel_filter_bank(
+            num_frequency_bins=self.nb_frequency_bins,
+            num_mel_filters=feature_size,
+            min_frequency=frequency_min,
+            max_frequency=frequency_max,
+            sampling_rate=sampling_rate,
+            norm=None,
+            mel_scale="htk",
+        )
+        self.mel_filters_slaney = mel_filter_bank(
+            num_frequency_bins=self.nb_frequency_bins,
+            num_mel_filters=feature_size,
+            min_frequency=frequency_min,
+            max_frequency=frequency_max,
+            sampling_rate=sampling_rate,
+            norm="slaney",
+            mel_scale="slaney",
+        )
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serializes this instance to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance, excpet for the
+            mel filter banks, which do not need to be saved or printed as they are too long.
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["feature_extractor_type"] = self.__class__.__name__
+        if "mel_filters" in output:
+            del output["mel_filters"]
+        if "mel_filters_slaney" in output:
+            del output["mel_filters_slaney"]
+        return output
+
+    def _np_extract_fbank_features(self, waveform: np.array, mel_filters: Optional[np.array] = None) -> np.ndarray:
+        """
+        Compute the log-mel spectrogram of the provided `waveform` using the Hann window. In CLAP, two different filter
+        banks are used depending on the truncation pattern:
+            - `self.mel_filters`: they correspond to the default parameters of `torchaudio` which can be obtained from
+              calling `torchaudio.transforms.MelSpectrogram().mel_scale.fb`. These filters are used when `truncation`
+              is set to `"fusion"`.
+            - `self.mel_filteres_slaney` : they correspond to the default parameters of `librosa` which used
+              `librosa.filters.mel` when computing the mel spectrogram. These filters were only used in the original
+              implementation when the truncation mode is not `"fusion"`.
+        """
+        log_mel_spectrogram = spectrogram(
+            waveform,
+            window_function(self.fft_window_size, "hann"),
+            frame_length=self.fft_window_size,
+            hop_length=self.hop_length,
+            power=2.0,
+            mel_filters=mel_filters,
+            log_mel="dB",
+        )
+        return log_mel_spectrogram.T
+
+    def _random_mel_fusion(self, mel, total_frames, chunk_frames):
+        ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3)
+        if len(ranges[1]) == 0:
+            # if the audio is too short, we just use the first chunk
+            ranges[1] = [0]
+        if len(ranges[2]) == 0:
+            # if the audio is too short, we just use the first chunk
+            ranges[2] = [0]
+        # randomly choose index for each part
+        idx_front = np.random.choice(ranges[0])
+        idx_middle = np.random.choice(ranges[1])
+        idx_back = np.random.choice(ranges[2])
+
+        mel_chunk_front = mel[idx_front : idx_front + chunk_frames, :]
+        mel_chunk_middle = mel[idx_middle : idx_middle + chunk_frames, :]
+        mel_chunk_back = mel[idx_back : idx_back + chunk_frames, :]
+
+        mel = paddle.to_tensor(mel[None, None, :])
+        mel_shrink = paddle.nn.functional.interpolate(
+            mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False
+        )
+        mel_shrink = mel_shrink[0][0].numpy()
+        mel_fusion = np.stack([mel_shrink, mel_chunk_front, mel_chunk_middle, mel_chunk_back], axis=0)
+        return mel_fusion
+
+    def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> np.array:
+        """
+        Extracts the mel spectrogram and prepares it for the mode based on the `truncation` and `padding` arguments.
+        Four different path are possible:
+            - `truncation="fusion"` and the length of the waveform is greater than the max length: the mel spectrogram
+              will be computed on the entire audio. 3 random crops and a dowsampled version of the full mel spectrogram
+              are then stacked together. They will later be used for `feature_fusion`.
+            - `truncation="rand_trunc"` and the length of the waveform is smaller than the max length: the audio is
+              padded based on `padding`.
+            - `truncation="fusion"` and the length of the waveform is smaller than the max length: the audio is padded
+              based on `padding`, and is repeated `4` times.
+            - `truncation="rand_trunc"` and the length of the waveform is greater than the max length: the mel
+              spectrogram will be computed on a random crop of the waveform.
+
+        """
+        if waveform.shape[0] > max_length:
+            if truncation == "rand_trunc":
+                longer = True
+                # random crop to max_length (for compatibility) -> this should be handled by self.pad
+                overflow = len(waveform) - max_length
+                idx = np.random.randint(0, overflow + 1)
+                waveform = waveform[idx : idx + max_length]
+                input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
+            elif truncation == "fusion":
+                mel = self._np_extract_fbank_features(waveform, self.mel_filters)
+                chunk_frames = max_length // self.hop_length + 1  # the +1 related to how the spectrogram is computed
+                total_frames = mel.shape[0]
+                if chunk_frames == total_frames:
+                    # there is a corner case where the audio length is larger than max_length but smaller than max_length+hop_length.
+                    # In this case, we just use the whole audio.
+                    input_mel = np.stack([mel, mel, mel, mel], axis=0)
+                    longer = False
+                else:
+                    input_mel = self._random_mel_fusion(mel, total_frames, chunk_frames)
+                    longer = True
+            else:
+                raise NotImplementedError(f"data_truncating {truncation} not implemented")
+
+        else:
+            longer = False
+            # only use repeat as a new possible value for padding. you repeat the audio before applying the usual max_length padding
+            if waveform.shape[0] < max_length:
+                if padding == "repeat":
+                    n_repeat = int(max_length / len(waveform))
+                    waveform = np.stack(np.tile(waveform, n_repeat + 1))[:max_length]
+                if padding == "repeatpad":
+                    n_repeat = int(max_length / len(waveform))
+                    waveform = np.stack(np.tile(waveform, n_repeat))
+                waveform = np.pad(waveform, (0, max_length - waveform.shape[0]), mode="constant", constant_values=0)
+
+            if truncation == "fusion":
+                input_mel = self._np_extract_fbank_features(waveform, self.mel_filters)
+                input_mel = np.stack([input_mel, input_mel, input_mel, input_mel], axis=0)
+            else:
+                input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
+
+        return input_mel, longer
+
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
+        truncation: str = None,
+        padding: Optional[str] = None,
+        max_length: Optional[int] = None,
+        sampling_rate: Optional[int] = None,
+        return_tensors: Optional[str] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Args:
+            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
+                stereo, i.e. single float per timestep.
+            truncation (`str`, *optional*):
+                Truncation pattern for long audio inputs. Two patterns are available:
+                    - `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and
+                      a downsampled version of the entire mel spectrogram.
+                If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a
+                copy of the original mel obtained from the padded audio.
+                    - `rand_trunc` will select a random crop of the mel spectrogram.
+            padding (`str`, *optional*):
+               Padding pattern for shorter audio inputs. Three patterns were originally implemented:
+                    - `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
+                    - `repeat`: the audio is repeated and then cut to fit the `max_length`
+                    - `pad`: the audio is padded.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.np.array` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
+                pipeline.
+        """
+        truncation = truncation if truncation is not None else self.truncation
+        padding = padding if padding else self.padding
+
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
+                    f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
+                    f" was sampled with {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+
+        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
+        if is_batched_numpy and len(raw_speech.shape) > 2:
+            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
+        is_batched = is_batched_numpy or (
+            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
+        )
+
+        if is_batched:
+            raw_speech = [np.asarray(speech, dtype=np.float64) for speech in raw_speech]
+        elif not is_batched and not isinstance(raw_speech, np.ndarray):
+            raw_speech = np.asarray(raw_speech, dtype=np.float64)
+        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
+            raw_speech = raw_speech.astype(np.float64)
+
+        # always return batch
+        if not is_batched:
+            raw_speech = [np.asarray(raw_speech)]
+
+        # convert to mel spectrogram, truncate and pad if needed.
+        padded_inputs = [
+            self._get_input_mel(waveform, max_length if max_length else self.nb_max_samples, truncation, padding)
+            for waveform in raw_speech
+        ]
+
+        input_mel = []
+        is_longer = []
+        for mel, longer in padded_inputs:
+            input_mel.append(mel)
+            is_longer.append(longer)
+
+        if truncation == "fusion" and sum(is_longer) == 0:
+            # if no audio is longer than 10s, then randomly select one audio to be longer
+            rand_idx = np.random.randint(0, len(input_mel))
+            is_longer[rand_idx] = True
+
+        if isinstance(input_mel[0], List):
+            input_mel = [np.asarray(feature, dtype=np.float64) for feature in input_mel]
+
+        # is_longer is a list of bool
+        is_longer = [[longer] for longer in is_longer]
+
+        input_features = {"input_features": input_mel, "is_longer": is_longer}
+        input_features = BatchFeature(input_features)
+
+        if return_tensors is not None:
+            input_features = input_features.convert_to_tensors(return_tensors)
+
+        return input_features
diff --git a/paddlenlp/transformers/clap/modeling.py b/paddlenlp/transformers/clap/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..47e4a61c2c874666c272ca47830711f4a639f73d
--- /dev/null
+++ b/paddlenlp/transformers/clap/modeling.py
@@ -0,0 +1,2285 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The LAION-AI Team and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import math
+from dataclasses import dataclass
+from typing import Any, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+from paddle.distributed.fleet.utils import recompute
+
+from paddlenlp.utils.log import logger
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from ..model_utils import (
+    PretrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from .configuration import ClapAudioConfig, ClapConfig, ClapTextConfig
+
+CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "laion/clap-htsat-fused",
+    "laion/clap-htsat-unfused",
+    # See all clap models at https://huggingface.co/models?filter=clap
+]
+
+
+__all__ = [
+    "ClapTextModelWithProjection",
+    "ClapAudioModelWithProjection",
+    "ClapModel",
+    "ClapAudioConfig",
+    "ClapAudioModel",
+    "ClapTextModel",
+]
+
+
+def Parameter(tensor):
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+# Adapted from: https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/utils.py#L191
+def interpolate(hidden_states, ratio):
+    """
+    Interpolate data in time domain. This is used to compensate the resolution reduction in downsampling of a CNN.
+
+    Args:
+        hidden_states (`paddle.Tensor` of shape (batch_size, time_length, classes_num)):
+            Input hidden states
+        ratio (`int`):
+            The ratio of the length of the output to the length of the input.
+    """
+    (batch_size, time_length, classes_num) = hidden_states.shape
+    upsampled = hidden_states[:, :, None, :].tile([1, 1, ratio, 1])
+    upsampled = upsampled.reshape([batch_size, time_length * ratio, classes_num])
+    return upsampled
+
+
+# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/htsat.py#L249
+def window_partition(hidden_states, window_size):
+    """
+    Returns the resized hidden states. The output shape should be `(batch_size * num_windows, window_size, window_size,
+    num_channels)`
+
+    Args:
+        hidden_states (`paddle.Tensor` of shape `(batch_size, height, width, num_channels)`):
+            Input hidden states
+        window_size (`int`):
+            Window size
+    """
+    batch_size, height, width, num_channels = hidden_states.shape
+
+    hidden_states = hidden_states.reshape(
+        [batch_size, height // window_size, window_size, width // window_size, window_size, num_channels]
+    )
+    windows = hidden_states.transpose([0, 1, 3, 2, 4, 5]).reshape([-1, window_size, window_size, num_channels])
+    return windows
+
+
+# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/htsat.py#L263
+def window_reverse(windows, window_size, height, width):
+    """
+    Args:
+        windows (`paddle.Tensor` of shape `(num_windows * batch_size, window_size, window_size, num_channels)`):
+            Input windows
+        window_size (`int`):
+            Window size
+        height (`int`):
+            Height of the resized audio
+        width (`int`):
+            Width of the resized audio
+    """
+    batch_size = int(windows.shape[0] / (height * width / window_size / window_size))
+
+    hidden_states = windows.reshape(
+        [batch_size, height // window_size, width // window_size, window_size, window_size, -1]
+    )
+    hidden_states = hidden_states.transpose([0, 1, 3, 2, 4, 5]).reshape([batch_size, height, width, -1])
+    return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.roberta.modeling_roberta.create_position_ids_from_input_ids
+def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`.
+
+    Args:
+        x: paddle.Tensor x:
+
+    Returns: paddle.Tensor
+    """
+    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
+    mask = input_ids.cast("int32").not_equal(paddle.to_tensor([padding_idx], dtype="int32")).cast("int32")
+    incremental_indices = (paddle.cumsum(mask, axis=1).cast(mask.dtype) + past_key_values_length) * mask
+    return incremental_indices.cast("int64") + padding_idx
+
+
+# contrastive loss function, adapted from
+# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html#CLIP-loss-function
+def contrastive_loss(logits: paddle.Tensor) -> paddle.Tensor:
+    labels = paddle.arange(len(logits))
+    return nn.functional.cross_entropy(logits, labels)
+
+
+@dataclass
+# Copied from paddlenlp.transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Clap
+class ClapTextModelOutput(ModelOutput):
+    """
+    Base class for text model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        text_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The text embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    text_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ClapAudioModelOutput(ModelOutput):
+    """
+    ClapAudio model output to mimic the output of the original implementation.
+
+    Args:
+        audio_embeds (`paddle.Tensor` of shape `(batch_size, hidden_size)`):
+            The Audio embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+    """
+
+    audio_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+# Copied from paddlenlp.transformers.models.clip.modeling_clip.CLIPOutput with CLIP->Clap, vision->audio, Vision->Audio, image->audio
+class ClapOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for audio-text similarity.
+        logits_per_audio:(`paddle.Tensor` of shape `(audio_batch_size, text_batch_size)`):
+            The scaled dot product scores between `audio_embeds` and `text_embeds`. This represents the audio-text
+            similarity scores.
+        logits_per_text:(`paddle.Tensor` of shape `(text_batch_size, audio_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio
+            similarity scores.
+        text_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`ClapTextModel`].
+        audio_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The audio embeddings obtained by applying the projection layer to the pooled output of [`ClapAudioModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ClapTextModel`].
+        audio_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`ClapAudioModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_audio: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    audio_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    audio_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "audio_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+# Adapted from paddlenlp.transformers.models.swin.modeling_swin.SwinDropPath
+class ClapDropPath(nn.Layer):
+    """
+    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). This is a slightly
+    refactored version of the `SwinDropPath` implementation.
+    """
+
+    def __init__(self, drop_prob=None):
+        super().__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, hidden_states):
+        if self.drop_prob == 0.0 or not self.training:
+            return hidden_states
+
+        keep_prob = 1 - self.drop_prob
+        # work with diff dim tensors, not just 2D ConvNets
+        shape = (hidden_states.shape[0],) + (1,) * (hidden_states.ndim - 1)
+
+        random_tensor = keep_prob + paddle.rand(shape, dtype=hidden_states.dtype)
+        random_tensor = paddle.floor(random_tensor)  # binarize
+        output = (hidden_states / keep_prob) * random_tensor
+        return output
+
+
+# Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/feature_fusion.py#L133
+class ClapAudioAFFBlock(nn.Layer):
+    r"""
+    ATTENTIONAL FEATURE FUSION Block from CLAP, since in CLAP we are always in 2D mode, it is not needed to implement
+    the 1D version.
+    """
+
+    def __init__(self, config: ClapAudioConfig):
+        super().__init__()
+        channels = config.patch_embeds_hidden_size
+        downsize_ratio = config.aff_block_r
+        inter_channels = int(channels // downsize_ratio)
+
+        self.local_att = nn.Sequential(
+            nn.Conv2D(channels, inter_channels, kernel_size=1, stride=1, padding=0),
+            nn.BatchNorm2D(inter_channels),
+            nn.ReLU(),
+            nn.Conv2D(inter_channels, channels, kernel_size=1, stride=1, padding=0),
+            nn.BatchNorm2D(channels),
+        )
+        self.global_att = nn.Sequential(
+            nn.AdaptiveAvgPool2D(1),
+            nn.Conv2D(channels, inter_channels, kernel_size=1, stride=1, padding=0),
+            nn.BatchNorm2D(inter_channels),
+            nn.ReLU(),
+            nn.Conv2D(inter_channels, channels, kernel_size=1, stride=1, padding=0),
+            nn.BatchNorm2D(channels),
+        )
+
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, hidden_states, residual):
+        attention_input = hidden_states + residual
+
+        fused_layer_output = self.local_att(attention_input) + self.global_att(attention_input)
+        fused_layer_output = self.sigmoid(fused_layer_output)
+
+        output = 2 * hidden_states * fused_layer_output + 2 * residual * (1 - fused_layer_output)
+        return output
+
+
+class ClapAudioPatchEmbed(nn.Layer):
+    """
+    This module converts the hidden states reshaped as an image to patch embeddings ready to be passed to the
+    Transformer block.
+    """
+
+    def __init__(self, config: ClapAudioConfig):
+        super().__init__()
+        img_size = (config.spec_size, config.spec_size) if isinstance(config.spec_size, int) else config.spec_size
+        patch_size = (
+            (config.patch_size, config.patch_size) if isinstance(config.patch_size, int) else config.patch_size
+        )
+        patch_stride = (
+            (config.patch_stride, config.patch_stride) if isinstance(config.patch_stride, int) else config.patch_stride
+        )
+
+        self.img_size = img_size
+        self.patch_stride = patch_stride
+
+        self.grid_size = (img_size[0] // patch_stride[0], img_size[1] // patch_stride[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+
+        self.flatten = config.flatten_patch_embeds
+        self.enable_fusion = config.enable_fusion
+
+        padding = ((patch_size[0] - patch_stride[0]) // 2, (patch_size[1] - patch_stride[1]) // 2)
+
+        scale_factor = 4 if (self.enable_fusion) and (config.fusion_type == "channel_map") else 1
+
+        self.proj = nn.Conv2D(
+            config.patch_embed_input_channels * scale_factor,
+            config.patch_embeds_hidden_size,
+            kernel_size=patch_size,
+            stride=patch_stride,
+            padding=padding,
+        )
+
+        self.norm = nn.LayerNorm(config.patch_embeds_hidden_size) if config.enable_patch_layer_norm else nn.Identity()
+        if self.enable_fusion:
+            self.fusion_model = ClapAudioAFFBlock(config)
+            self.mel_conv2d = nn.Conv2D(
+                config.patch_embed_input_channels,
+                config.patch_embeds_hidden_size,
+                kernel_size=(patch_size[0], patch_size[1] * 3),
+                stride=(patch_stride[0], patch_stride[1] * 3),
+                padding=padding,
+            )
+
+    def forward(self, hidden_states, is_longer_idx=None):
+        if self.enable_fusion:
+            # retrieve the last mel as we have transposed the input
+            global_hidden_states = hidden_states[:, 0:1, :, :]
+
+            # global processing
+            batch_size, num_channels, height, width = global_hidden_states.shape
+
+            if height != self.img_size[0] or width != self.img_size[1]:
+                raise ValueError(
+                    f"Input audio size ({height}*{width}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+                )
+
+            global_hidden_states = self.proj(global_hidden_states)
+            output_width = global_hidden_states.shape[-1]
+            if len(is_longer_idx) > 0:
+                # local processing
+                local_hidden_states = paddle.gather(hidden_states[:, 1:, :, :], is_longer_idx, axis=0)
+
+                batch_size, num_channels, height, width = local_hidden_states.shape
+                local_hidden_states = local_hidden_states.reshape([batch_size * num_channels, 1, height, width])
+
+                local_hidden_states = self.mel_conv2d(local_hidden_states)
+
+                _, features, height, width = local_hidden_states.shape
+                local_hidden_states = local_hidden_states.reshape([batch_size, num_channels, features, height, width])
+                local_hidden_states = local_hidden_states.transpose((0, 2, 3, 1, 4)).flatten(3)
+
+                local_width = local_hidden_states.shape[-1]
+
+                local_hidden_states = nn.functional.pad(
+                    local_hidden_states, (0, output_width - local_width, 0, 0), mode="constant", value=0.0
+                )
+
+                global_hidden_states[is_longer_idx] = self.fusion_model(
+                    paddle.gather(global_hidden_states, is_longer_idx, axis=0), local_hidden_states
+                )
+            hidden_states = global_hidden_states
+        else:
+            _, _, height, width = hidden_states.shape
+            if height != self.img_size[0] or width != self.img_size[1]:
+                raise ValueError(
+                    f"Input audio size ({height}*{width}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+                )
+            hidden_states = self.proj(hidden_states)
+
+        if self.flatten:
+            hidden_states = hidden_states.flatten(2).transpose([0, 2, 1])
+        hidden_states = self.norm(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinSelfAttention with Swin->ClapAudio
+class ClapAudioSelfAttention(nn.Layer):
+    def __init__(self, config, dim, num_heads, window_size):
+        super().__init__()
+        if dim % num_heads != 0:
+            raise ValueError(
+                f"The hidden size ({dim}) is not a multiple of the number of attention heads ({num_heads})"
+            )
+
+        self.num_attention_heads = num_heads
+        self.attention_head_size = int(dim / num_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.window_size = (
+            window_size if isinstance(window_size, collections.abc.Iterable) else (window_size, window_size)
+        )
+        self.relative_position_bias_table = Parameter(
+            paddle.zeros([(2 * self.window_size[0] - 1) * (2 * self.window_size[1] - 1), num_heads])
+        )
+
+        # get pair-wise relative position index for each token inside the window
+        coords_h = paddle.arange(self.window_size[0])
+        coords_w = paddle.arange(self.window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w], indexing="ij"))
+        coords_flatten = paddle.flatten(coords, 1)
+        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
+        relative_coords = relative_coords.transpose([1, 2, 0])
+        relative_coords[:, :, 0] += self.window_size[0] - 1
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        self.query = nn.Linear(self.all_head_size, self.all_head_size, bias_attr=config.qkv_bias)
+        self.key = nn.Linear(self.all_head_size, self.all_head_size, bias_attr=config.qkv_bias)
+        self.value = nn.Linear(self.all_head_size, self.all_head_size, bias_attr=config.qkv_bias)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        batch_size, dim, num_channels = hidden_states.shape
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose([0, 1, 3, 2]))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        relative_position_bias = self.relative_position_bias_table[self.relative_position_index.reshape([-1])]
+        relative_position_bias = relative_position_bias.reshape(
+            [self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1]
+        )
+
+        relative_position_bias = relative_position_bias.transpose([2, 0, 1])
+        attention_scores = attention_scores + relative_position_bias.unsqueeze(0)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in ClapAudioModel forward() function)
+            mask_shape = attention_mask.shape[0]
+            attention_scores = attention_scores.reshape(
+                [batch_size // mask_shape, mask_shape, self.num_attention_heads, dim, dim]
+            )
+            attention_scores = attention_scores + attention_mask.unsqueeze(1).unsqueeze(0)
+            attention_scores = attention_scores.reshape([-1, self.num_attention_heads, dim, dim])
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinSelfOutput with Swin->ClapAudio
+class ClapAudioSelfOutput(nn.Layer):
+    def __init__(self, config, dim):
+        super().__init__()
+        self.dense = nn.Linear(dim, dim)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinAttention with Swin->ClapAudio
+class ClapAudioAttention(nn.Layer):
+    def __init__(self, config, dim, num_heads, window_size):
+        super().__init__()
+        self.self = ClapAudioSelfAttention(config, dim, num_heads, window_size)
+        self.output = ClapAudioSelfOutput(config, dim)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, axis=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.self(hidden_states, attention_mask, head_mask, output_attentions)
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinIntermediate with Swin->ClapAudio
+class ClapAudioIntermediate(nn.Layer):
+    def __init__(self, config, dim):
+        super().__init__()
+        self.dense = nn.Linear(dim, int(config.mlp_ratio * dim))
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinOutput with Swin->ClapAudio
+class ClapAudioOutput(nn.Layer):
+    def __init__(self, config, dim):
+        super().__init__()
+        self.dense = nn.Linear(int(config.mlp_ratio * dim), dim)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinLayer with SwinDropPath->ClapDropPath, Swin->ClapAudio
+class ClapAudioLayer(nn.Layer):
+    def __init__(self, config, dim, input_resolution, num_heads, shift_size=0):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.shift_size = shift_size
+        self.window_size = config.window_size
+        self.input_resolution = input_resolution
+        self.layernorm_before = nn.LayerNorm(dim, epsilon=config.layer_norm_eps)
+        self.attention = ClapAudioAttention(config, dim, num_heads, window_size=self.window_size)
+        self.drop_path = ClapDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
+        self.layernorm_after = nn.LayerNorm(dim, epsilon=config.layer_norm_eps)
+        self.intermediate = ClapAudioIntermediate(config, dim)
+        self.output = ClapAudioOutput(config, dim)
+
+    def set_shift_and_window_size(self, input_resolution):
+        if min(input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't partition windows
+            self.shift_size = 0
+            self.window_size = min(input_resolution)
+
+    def get_attn_mask(self, height, width, dtype):
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            img_mask = paddle.zeros((1, height, width, 1), dtype=dtype)
+            height_slices = (
+                slice(0, -self.window_size),
+                slice(-self.window_size, -self.shift_size),
+                slice(-self.shift_size, None),
+            )
+            width_slices = (
+                slice(0, -self.window_size),
+                slice(-self.window_size, -self.shift_size),
+                slice(-self.shift_size, None),
+            )
+            count = 0
+            for height_slice in height_slices:
+                for width_slice in width_slices:
+                    img_mask[:, height_slice, width_slice, :] = count
+                    count += 1
+
+            mask_windows = window_partition(img_mask, self.window_size)
+            mask_windows = mask_windows.reshape([-1, self.window_size * self.window_size])
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = masked_fill(attn_mask, attn_mask != 0, float(-100.0))
+            attn_mask = masked_fill(attn_mask, attn_mask == 0, float(0.0))
+        else:
+            attn_mask = None
+        return attn_mask
+
+    def maybe_pad(self, hidden_states, height, width):
+        pad_right = (self.window_size - width % self.window_size) % self.window_size
+        pad_bottom = (self.window_size - height % self.window_size) % self.window_size
+        # (padding_left,padding_right,padding_top, padding_bottom)
+        pad_values = (0, 0, 0, pad_right, 0, pad_bottom)
+        # hidden_states = nn.functional.pad(hidden_states, pad_values)
+        # TODO(wugaosheng): torch pad is different from paddle pad
+        hidden_states = nn.functional.pad(hidden_states, (0, pad_right, 0, pad_bottom), data_format="NHWC")
+
+        return hidden_states, pad_values
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        input_dimensions: Tuple[int, int],
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        always_partition: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        if not always_partition:
+            self.set_shift_and_window_size(input_dimensions)
+        else:
+            pass
+        height, width = input_dimensions
+        batch_size, _, channels = hidden_states.shape
+        shortcut = hidden_states
+
+        hidden_states = self.layernorm_before(hidden_states)
+
+        hidden_states = hidden_states.reshape([batch_size, height, width, channels])
+
+        # pad hidden_states to multiples of window size
+        hidden_states, pad_values = self.maybe_pad(hidden_states, height, width)
+
+        _, height_pad, width_pad, _ = hidden_states.shape
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_hidden_states = paddle.roll(
+                hidden_states, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2)
+            )
+        else:
+            shifted_hidden_states = hidden_states
+
+        # partition windows
+        hidden_states_windows = window_partition(shifted_hidden_states, self.window_size)
+        hidden_states_windows = hidden_states_windows.reshape([-1, self.window_size * self.window_size, channels])
+        attn_mask = self.get_attn_mask(height_pad, width_pad, dtype=hidden_states.dtype)
+
+        attention_outputs = self.attention(
+            hidden_states_windows, attn_mask, head_mask, output_attentions=output_attentions
+        )
+
+        attention_output = attention_outputs[0]
+
+        attention_windows = attention_output.reshape([-1, self.window_size, self.window_size, channels])
+        shifted_windows = window_reverse(attention_windows, self.window_size, height_pad, width_pad)
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            attention_windows = paddle.roll(shifted_windows, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            attention_windows = shifted_windows
+
+        was_padded = pad_values[3] > 0 or pad_values[5] > 0
+        if was_padded:
+            attention_windows = attention_windows[:, :height, :width, :]
+
+        attention_windows = attention_windows.reshape([batch_size, height * width, channels])
+
+        hidden_states = shortcut + self.drop_path(attention_windows)
+
+        layer_output = self.layernorm_after(hidden_states)
+        layer_output = self.intermediate(layer_output)
+        layer_output = hidden_states + self.output(layer_output)
+
+        layer_outputs = (layer_output, attention_outputs[1]) if output_attentions else (layer_output,)
+        return layer_outputs
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinStage with Swin->ClapAudio
+class ClapAudioStage(nn.Layer):
+    def __init__(self, config, dim, input_resolution, depth, num_heads, drop_path, downsample):
+        super().__init__()
+        self.config = config
+        self.dim = dim
+        self.blocks = nn.LayerList(
+            [
+                ClapAudioLayer(
+                    config=config,
+                    dim=dim,
+                    input_resolution=input_resolution,
+                    num_heads=num_heads,
+                    shift_size=0 if (i % 2 == 0) else config.window_size // 2,
+                )
+                for i in range(depth)
+            ]
+        )
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(input_resolution, dim=dim, norm_layer=nn.LayerNorm)
+        else:
+            self.downsample = None
+
+        self.pointing = False
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        input_dimensions: Tuple[int, int],
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        always_partition: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        height, width = input_dimensions
+        for i, layer_module in enumerate(self.blocks):
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+
+            layer_outputs = layer_module(
+                hidden_states, input_dimensions, layer_head_mask, output_attentions, always_partition
+            )
+
+            hidden_states = layer_outputs[0]
+
+        hidden_states_before_downsampling = hidden_states
+        if self.downsample is not None:
+            height_downsampled, width_downsampled = (height + 1) // 2, (width + 1) // 2
+            output_dimensions = (height, width, height_downsampled, width_downsampled)
+            hidden_states = self.downsample(hidden_states_before_downsampling, input_dimensions)
+        else:
+            output_dimensions = (height, width, height, width)
+
+        stage_outputs = (hidden_states, hidden_states_before_downsampling, output_dimensions)
+
+        if output_attentions:
+            stage_outputs += layer_outputs[1:]
+        return stage_outputs
+
+
+# Copied from paddlenlp.transformers.models.swin.modeling_swin.SwinPatchMerging with Swin->ClapAudio
+class ClapAudioPatchMerging(nn.Layer):
+    """
+    Patch Merging Layer.
+
+    Args:
+        input_resolution (`Tuple[int]`):
+            Resolution of input feature.
+        dim (`int`):
+            Number of input channels.
+        norm_layer (`nn.Layer`, *optional*, defaults to `nn.LayerNorm`):
+            Normalization layer class.
+    """
+
+    def __init__(self, input_resolution: Tuple[int], dim: int, norm_layer: nn.Layer = nn.LayerNorm) -> None:
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias_attr=False)
+        self.norm = norm_layer(4 * dim)
+
+    def maybe_pad(self, input_feature, height, width):
+        should_pad = (height % 2 == 1) or (width % 2 == 1)
+        if should_pad:
+            pad_values = (0, width % 2, 0, height % 2)
+            input_feature = nn.functional.pad(input_feature, pad_values, data_format="NHWC")
+
+        return input_feature
+
+    def forward(self, input_feature: paddle.Tensor, input_dimensions: Tuple[int, int]) -> paddle.Tensor:
+        height, width = input_dimensions
+        # `dim` is height * width
+        batch_size, dim, num_channels = input_feature.shape
+
+        input_feature = input_feature.reshape([batch_size, height, width, num_channels])
+        # pad input to be disible by width and height, if needed
+        input_feature = self.maybe_pad(input_feature, height, width)
+        # [batch_size, height/2, width/2, num_channels]
+        input_feature_0 = input_feature[:, 0::2, 0::2, :]
+        # [batch_size, height/2, width/2, num_channels]
+        input_feature_1 = input_feature[:, 1::2, 0::2, :]
+        # [batch_size, height/2, width/2, num_channels]
+        input_feature_2 = input_feature[:, 0::2, 1::2, :]
+        # [batch_size, height/2, width/2, num_channels]
+        input_feature_3 = input_feature[:, 1::2, 1::2, :]
+        # batch_size height/2 width/2 4*num_channels
+        input_feature = paddle.concat([input_feature_0, input_feature_1, input_feature_2, input_feature_3], -1)
+        input_feature = input_feature.reshape([batch_size, -1, 4 * num_channels])  # batch_size height/2*width/2 4*C
+
+        input_feature = self.norm(input_feature)
+        input_feature = self.reduction(input_feature)
+
+        return input_feature
+
+
+class ClapAudioEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.num_layers = len(config.depths)
+
+        self.config = config
+        self.patch_embed = ClapAudioPatchEmbed(config)
+        self.enable_fusion = config.enable_fusion
+        self.patch_stride = self.patch_embed.patch_stride
+        self.spec_size = config.spec_size
+        self.freq_ratio = config.spec_size // config.num_mel_bins
+
+        self.num_features = int(config.patch_embeds_hidden_size * 2 ** (self.num_layers - 1))
+
+        drop_path_rate = [x.item() for x in paddle.linspace(0, config.drop_path_rate, sum(config.depths))]
+
+        grid_size = self.patch_embed.grid_size
+        self.input_resolutions = [(grid_size[0] // (2**i), grid_size[1] // (2**i)) for i in range(self.num_layers)]
+
+        self.layers = nn.LayerList(
+            [
+                ClapAudioStage(
+                    config=config,
+                    dim=int(config.patch_embeds_hidden_size * 2**i_layer),
+                    input_resolution=self.input_resolutions[i_layer],
+                    depth=config.depths[i_layer],
+                    num_heads=config.num_attention_heads[i_layer],
+                    drop_path=drop_path_rate[sum(config.depths[:i_layer]) : sum(config.depths[: i_layer + 1])],
+                    downsample=ClapAudioPatchMerging if (i_layer < self.num_layers - 1) else None,
+                )
+                for i_layer in range(self.num_layers)
+            ]
+        )
+
+        self.gradient_checkpointing = False
+
+        self.batch_norm = nn.BatchNorm2D(config.num_mel_bins, momentum=0.1)
+        self.norm = nn.LayerNorm(self.num_features)
+        self.depths = config.depths
+        self.avgpool = nn.AdaptiveAvgPool1D(1)
+
+    def reshape_mel2img(self, normalized_input_features):
+        """
+        The input is 4 normalized log mel spectrograms. It is reshape to the common shape of images. Each channel
+        should represent 1 of the 4 crops of the spectrogram. For more details, refer to the [`ClapFeatureExtractor`].
+        """
+        _, _, time_length, freq_length = normalized_input_features.shape
+
+        spec_width = int(self.spec_size * self.freq_ratio)
+        spec_heigth = self.spec_size // self.freq_ratio
+
+        if time_length > spec_width or freq_length > spec_heigth:
+            raise ValueError("the wav size should be less than or equal to the swin input size")
+
+        # to avoid bicubic zero error
+        if time_length < spec_width:
+            normalized_input_features = nn.functional.interpolate(
+                normalized_input_features, (spec_width, freq_length), mode="bicubic", align_corners=True
+            )
+        if freq_length < spec_heigth:
+            normalized_input_features = nn.functional.interpolate(
+                normalized_input_features, (time_length, spec_heigth), mode="bicubic", align_corners=True
+            )
+
+        batch, channels, time, freq = normalized_input_features.shape
+
+        # batch_size, channels, spec_width, spec_heigth --> batch_size, channels, spec_heigth * freq_ratio, spec_width // freq_ratio
+        normalized_input_features = normalized_input_features.reshape(
+            [batch, channels * self.freq_ratio, time // self.freq_ratio, freq]
+        )
+        normalized_input_features = normalized_input_features.transpose([0, 1, 3, 2])
+        normalized_input_features = normalized_input_features.reshape(
+            [batch, channels, freq * self.freq_ratio, time // self.freq_ratio]
+        )
+
+        return normalized_input_features
+
+    def forward(
+        self,
+        input_features,
+        is_longer: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        output_hidden_states_before_downsampling: Optional[bool] = False,
+        always_partition: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple, ClapAudioModelOutput]:
+
+        input_features = input_features.transpose([0, 3, 2, 1])
+        normalized_input_features = self.batch_norm(input_features)
+        normalized_input_features = normalized_input_features.transpose([0, 3, 2, 1])
+        is_longer_list_idx = None
+        if self.enable_fusion:
+            is_longer_list = is_longer
+            is_longer_list_idx = paddle.where(is_longer_list == 1)[0]
+
+        hidden_states = self.reshape_mel2img(normalized_input_features)
+
+        frames_num = hidden_states.shape[2]
+
+        hidden_states = self.patch_embed(hidden_states, is_longer_list_idx)
+
+        all_hidden_states = () if output_hidden_states else None
+        all_reshaped_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        input_dimensions = self.input_resolutions[0]
+
+        if output_hidden_states:
+            batch_size, _, hidden_size = hidden_states.shape
+            # rearrange batch_size (height width) channels -> batch_size channel height width
+            reshaped_hidden_state = hidden_states.reshape([batch_size, *input_dimensions, hidden_size])
+            reshaped_hidden_state = reshaped_hidden_state.transpose([0, 3, 1, 2])
+            all_hidden_states += (hidden_states,)
+            all_reshaped_hidden_states += (reshaped_hidden_state,)
+
+        for i, layer_module in enumerate(self.layers):
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+
+            input_dimensions = self.input_resolutions[i]
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module), hidden_states, input_dimensions, layer_head_mask
+                )
+
+            else:
+                layer_outputs = layer_module(
+                    hidden_states, input_dimensions, layer_head_mask, output_attentions, always_partition
+                )
+
+            hidden_states = layer_outputs[0]
+
+            hidden_states_before_downsampling = layer_outputs[1]
+            output_dimensions = layer_outputs[2]
+
+            input_dimensions = (output_dimensions[-2], output_dimensions[-1])
+
+            if output_hidden_states and output_hidden_states_before_downsampling:
+                batch_size, _, hidden_size = hidden_states_before_downsampling.shape
+                # rearrange batch_size (height width) channels -> batch_size channel height width
+                # here we use the original (not downsampled) height and width
+                reshaped_hidden_state = hidden_states_before_downsampling.reshape(
+                    [batch_size, *(output_dimensions[0], output_dimensions[1]), hidden_size]
+                )
+                reshaped_hidden_state = reshaped_hidden_state.transpose([0, 3, 1, 2])
+                all_hidden_states += (hidden_states_before_downsampling,)
+                all_reshaped_hidden_states += (reshaped_hidden_state,)
+            elif output_hidden_states and not output_hidden_states_before_downsampling:
+                batch_size, _, hidden_size = hidden_states.shape
+                # rearrange batch_size (height width) channels -> batch_size channel height width
+                reshaped_hidden_state = hidden_states.reshape([batch_size, *input_dimensions, hidden_size])
+                reshaped_hidden_state = reshaped_hidden_state.transpose([0, 3, 1, 2])
+                all_hidden_states += (hidden_states,)
+                all_reshaped_hidden_states += (reshaped_hidden_state,)
+
+            if output_attentions:
+                all_self_attentions += layer_outputs[3:]
+
+        last_hidden_state = self.norm(hidden_states)
+
+        batch_size, _, n_channels = last_hidden_state.shape
+
+        freq_shape = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[0]
+        temporal_shape = frames_num // (2 ** (len(self.depths) - 1)) // self.patch_stride[1]
+
+        last_hidden_state = last_hidden_state.transpose([0, 2, 1]).reshape(
+            [batch_size, n_channels, freq_shape, temporal_shape]
+        )
+
+        batch_size, n_channels, n_frequencies, n_temp = last_hidden_state.shape
+        # group 2D CNN
+        c_freq_bin = n_frequencies // self.freq_ratio
+        last_hidden_state = last_hidden_state.reshape(
+            [batch_size, n_channels, n_frequencies // c_freq_bin, c_freq_bin, n_temp]
+        )
+        last_hidden_state = last_hidden_state.transpose([0, 1, 3, 2, 4]).reshape(
+            [batch_size, n_channels, c_freq_bin, -1]
+        )
+        latent_output = self.avgpool(paddle.flatten(last_hidden_state, 2))
+        latent_output = paddle.flatten(latent_output, 1)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    last_hidden_state,
+                    latent_output,
+                    all_reshaped_hidden_states,
+                    all_self_attentions,
+                ]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=latent_output,
+            hidden_states=all_reshaped_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class ClapProjectionLayer(nn.Layer):
+    def __init__(self, config: Union[ClapAudioConfig, ClapTextConfig]):
+        super().__init__()
+        self.config = config
+        hidden_size = config.hidden_size
+        projection_dim = config.projection_dim
+
+        self.linear1 = nn.Linear(hidden_size, projection_dim)
+        self.activation = ACT2FN[config.projection_hidden_act]
+        self.linear2 = nn.Linear(projection_dim, projection_dim)
+
+    def forward(self, hidden_states):
+        hidden_states = self.linear1(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.linear2(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.roberta.modeling_roberta.RobertaEmbeddings with Roberta->ClapText, persistent=False->persistent=True
+class ClapTextEmbeddings(nn.Layer):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+
+    # Copied from paddlenlp.transformers.models.bert.modeling_bert.BertEmbeddings.__init__
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.register_buffer("position_ids", paddle.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer("token_type_ids", paddle.zeros(self.position_ids.shape, dtype="int64"), persistable=True)
+
+        # End copy
+        self.padding_idx = config.pad_token_id
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if position_ids is None:
+            if input_ids is not None:
+                # Create the position ids from the input token ids. Any padded tokens remain padded.
+                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
+            else:
+                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
+
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
+        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
+        # issue #5664
+        if token_type_ids is None:
+            if hasattr(self, "token_type_ids"):
+                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand([input_shape[0], seq_length])
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids.cast("int64"))
+
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+    def create_position_ids_from_inputs_embeds(self, inputs_embeds):
+        """
+        We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
+
+        Args:
+            inputs_embeds: paddle.Tensor
+
+        Returns: paddle.Tensor
+        """
+        input_shape = inputs_embeds.shape[:-1]
+        sequence_length = input_shape[1]
+
+        position_ids = paddle.arange(self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype="int64")
+        return position_ids.unsqueeze(0).expand(input_shape)
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertSelfAttention with Bert->ClapText
+class ClapTextSelfAttention(nn.Layer):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = position_embedding_type or getattr(
+            config, "position_embedding_type", "absolute"
+        )
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+
+        self.is_decoder = config.is_decoder
+
+    def transpose_for_scores(self, x: paddle.Tensor) -> paddle.Tensor:
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_layer = past_key_value[0]
+            value_layer = past_key_value[1]
+            attention_mask = encoder_attention_mask
+        elif is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        use_cache = past_key_value is not None
+        if self.is_decoder:
+            # if cross_attention save Tuple(paddle.Tensor, paddle.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(paddle.Tensor, paddle.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose([0, 1, 3, 2]))
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
+            if use_cache:
+                position_ids_l = paddle.to_tensor([key_length - 1], dtype="int64").reshape([-1, 1])
+            else:
+                position_ids_l = paddle.arange(query_length, dtype="int64").reshape([-1, 1])
+            position_ids_r = paddle.arange(key_length, dtype="int64").reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in ClapTextModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        if self.is_decoder:
+            outputs = outputs + (past_key_value,)
+        return outputs
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertSelfOutput
+class ClapTextSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertAttention with Bert->ClapText
+class ClapTextAttention(nn.Layer):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        self.self = ClapTextSelfAttention(config, position_embedding_type=position_embedding_type)
+        self.output = ClapTextSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, axis=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertIntermediate
+class ClapTextIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertOutput
+class ClapTextOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertLayer with Bert->ClapText
+class ClapTextLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = ClapTextAttention(config)
+        self.is_decoder = config.is_decoder
+        self.add_cross_attention = config.add_cross_attention
+        if self.add_cross_attention:
+            if not self.is_decoder:
+                raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
+            self.crossattention = ClapTextAttention(config, position_embedding_type="absolute")
+        self.intermediate = ClapTextIntermediate(config)
+        self.output = ClapTextOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+
+        # if decoder, the last output is tuple of self-attn cache
+        if self.is_decoder:
+            outputs = self_attention_outputs[1:-1]
+            present_key_value = self_attention_outputs[-1]
+        else:
+            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        cross_attn_present_key_value = None
+        if self.is_decoder and encoder_hidden_states is not None:
+            if not hasattr(self, "crossattention"):
+                raise ValueError(
+                    f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers"
+                    " by setting `config.add_cross_attention=True`"
+                )
+
+            # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple
+            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                cross_attn_past_key_value,
+                output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
+
+            # add cross-attn cache to positions 3,4 of present_key_value tuple
+            cross_attn_present_key_value = cross_attention_outputs[-1]
+            present_key_value = present_key_value + cross_attn_present_key_value
+
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+
+        # if decoder, return the attn key/values as the last output
+        if self.is_decoder:
+            outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertEncoder with Bert->ClapText
+class ClapTextEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList([ClapTextLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[paddle.Tensor], BaseModelOutputWithPastAndCrossAttentions]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        next_decoder_cache = () if use_cache else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if self.config.add_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+# Copied from paddlenlp.transformers.models.bert.modeling_bert.BertPooler
+class ClapTextPooler(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ClapPreTrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = ClapConfig
+    base_model_prefix = "clap"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids", r"logit_scale_a", r"logit_scale_t"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor
+
+        if isinstance(module, ClapTextEmbeddings):
+            normal_(module.position_embeddings.weight, mean=0.0, std=factor * 0.02)
+            normal_(module.token_type_embeddings.weight, mean=0.0, std=factor * 0.02)
+        elif isinstance(module, ClapModel):
+            normal_(module.logit_scale_a, std=factor * 0.02)
+            normal_(module.logit_scale_t, std=factor * 0.02)
+        elif isinstance(module, nn.Embedding):
+            normal_(module.weight, mean=0.0, std=factor * 0.02)
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        elif isinstance(module, (nn.Conv2D, nn.Linear)):
+            in_proj_std = (self.config.hidden_size**-0.5) * ((2 * self.config.num_hidden_layers) ** -0.5) * factor
+            normal_(module.weight, std=in_proj_std)
+            if module.bias is not None:
+                zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, ClapTextEncoder):
+            module.gradient_checkpointing = value
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: paddle.Tensor,
+        input_shape: Tuple[int],
+        has_query: bool = False,
+    ) -> paddle.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (`paddle.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+        Returns:
+            `paddle.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.cast(dtype=self.config.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[paddle.Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> paddle.Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.ndim == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand([num_hidden_layers, -1, -1, -1, -1])
+        elif head_mask.ndim == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.ndim == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = head_mask.cast(dtype=self.config.dtype)  # switch to float if need + fp16 compatibility
+        return head_mask
+
+
+class ClapAudioModel(ClapPreTrainedModel):
+    config_class = ClapAudioConfig
+    main_input_name = "input_features"
+
+    def __init__(self, config: ClapAudioConfig):
+        super().__init__(config)
+        self.audio_encoder = ClapAudioEncoder(config)
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.audio_encoder.patch_embed.proj
+
+    def forward(
+        self,
+        input_features: Optional[paddle.Tensor] = None,
+        is_longer: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from datasets import load_dataset
+        >>> from paddlenlp.transformers import AutoProcessor, ClapAudioModel
+
+        >>> dataset = load_dataset("ashraq/esc50")
+        >>> audio_sample = dataset["train"]["audio"][0]["array"]
+
+        >>> model = ClapAudioModel.from_pretrained("laion/clap-htsat-fused")
+        >>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-fused")
+
+        >>> inputs = processor(audios=audio_sample, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        return self.audio_encoder(
+            input_features=input_features,
+            is_longer=is_longer,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class ClapTextModel(ClapPreTrainedModel):
+    """
+
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in *Attention is
+    all you need*_ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
+    Kaiser and Illia Polosukhin.
+
+    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
+    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
+    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
+
+    .. _*Attention is all you need*: https://arxiv.org/abs/1706.03762
+
+    """
+
+    config_class = ClapTextConfig
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    # Copied from paddlenlp.transformers.models.bert.modeling_bert.BertModel.__init__ with Bert->ClapText
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = ClapTextEmbeddings(config)
+        self.encoder = ClapTextEncoder(config)
+
+        self.pooler = ClapTextPooler(config) if add_pooling_layer else None
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    # Copied from paddlenlp.transformers.models.bert.modeling_bert.BertModel.forward
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if self.config.is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        batch_size, seq_length = input_shape
+
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(((batch_size, seq_length + past_key_values_length)))
+
+        if token_type_ids is None:
+            if hasattr(self.embeddings, "token_type_ids"):
+                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand([batch_size, seq_length])
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: paddle.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.config.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class ClapModel(ClapPreTrainedModel):
+    config_class = ClapConfig
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def __init__(self, config: ClapConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, ClapTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type ClapTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.audio_config, ClapAudioConfig):
+            raise ValueError(
+                "config.audio_config is expected to be of type ClapAudioConfig but is of type"
+                f" {type(config.audio_config)}."
+            )
+
+        text_config = config.text_config
+        audio_config = config.audio_config
+
+        self.logit_scale_a = Parameter(paddle.ones([1]) * np.log(config.logit_scale_init_value))
+        self.logit_scale_t = Parameter(paddle.ones([1]) * np.log(config.logit_scale_init_value))
+        self.projection_dim = config.projection_dim
+
+        self.text_model = ClapTextModel(text_config)
+        self.text_projection = ClapProjectionLayer(text_config)
+
+        self.audio_model = ClapAudioModel(audio_config)
+        self.audio_projection = ClapProjectionLayer(audio_config)
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`ClapTextModel`].
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoTokenizer, ClapModel
+
+        >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
+        >>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
+
+        >>> inputs = tokenizer(["the sound of a cat", "the sound of a dog"], padding=True, return_tensors="pd")
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        # Use CLAP model's config for some fields (if specified) instead of those of audio & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1] if return_dict is not None else text_outputs.pooler_output
+        text_features = self.text_projection(pooled_output)
+        text_features = F.normalize(text_features, axis=-1)
+
+        return text_features
+
+    def get_audio_features(
+        self,
+        input_features: Optional[paddle.Tensor] = None,
+        is_longer: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Returns:
+            audio_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by
+            applying the projection layer to the pooled output of [`ClapAudioModel`].
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoFeatureExtractor, ClapModel
+        >>> import paddle
+
+        >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
+        >>> feature_extractor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused")
+        >>> random_audio = paddle.rand((16_000))
+        >>> inputs = feature_extractor(random_audio, return_tensors="pd")
+        >>> audio_features = model.get_audio_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        audio_outputs = self.audio_model(
+            input_features=input_features,
+            is_longer=is_longer,
+            return_dict=return_dict,
+        )
+
+        pooled_output = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
+
+        audio_features = self.audio_projection(pooled_output)
+        audio_features = F.normalize(audio_features, axis=-1)
+
+        return audio_features
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        input_features: Optional[paddle.Tensor] = None,
+        is_longer: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ClapOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from datasets import load_dataset
+        >>> from paddlenlp.transformers import AutoProcessor, ClapModel
+        >>> import paddle.nn.functional as F
+
+        >>> dataset = load_dataset("ashraq/esc50")
+        >>> audio_sample = dataset["train"]["audio"][0]["array"]
+
+        >>> model = ClapModel.from_pretrained("laion/clap-htsat-unfused")
+        >>> processor = AutoProcessor.from_pretrained("laion/clap-htsat-unfused")
+
+        >>> input_text = ["Sound of a dog", "Sound of vaccum cleaner"]
+
+        >>> inputs = processor(text=input_text, audios=audio_sample, return_tensors="pd", padding=True)
+
+        >>> outputs = model(**inputs)
+        >>> logits_per_audio = outputs.logits_per_audio  # this is the audio-text similarity score
+        >>> probs = F.softmax(logits_per_audio, axis=-1) # we can take the softmax to get the label probabilities
+        ```"""
+        # Use CLAP model's config for some fields (if specified) instead of those of audio & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        audio_outputs = self.audio_model(
+            input_features=input_features,
+            is_longer=is_longer,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        audio_embeds = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
+        audio_embeds = self.audio_projection(audio_embeds)
+
+        text_embeds = text_outputs[1] if not return_dict else text_outputs.pooler_output
+        text_embeds = self.text_projection(text_embeds)
+
+        # normalized features
+        audio_embeds = audio_embeds / audio_embeds.norm(p=2, axis=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(p=2, axis=-1, keepdim=True)
+
+        # cosine similarity as logits
+        logit_scale_text = self.logit_scale_t.exp()
+        logit_scale_audio = self.logit_scale_a.exp()
+        logits_per_text = paddle.matmul(text_embeds, audio_embeds.t()) * logit_scale_text
+        logits_per_audio = paddle.matmul(audio_embeds, text_embeds.t()) * logit_scale_audio
+
+        loss = None
+        if return_loss:
+            caption_loss = contrastive_loss(logits_per_text)
+            audio_loss = contrastive_loss(logits_per_audio.t())
+            loss = (caption_loss + audio_loss) / 2.0
+
+        if not return_dict:
+            output = (logits_per_audio, logits_per_text, text_embeds, audio_embeds, text_outputs, audio_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return ClapOutput(
+            loss=loss,
+            logits_per_audio=logits_per_audio,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            audio_embeds=audio_embeds,
+            text_model_output=text_outputs,
+            audio_model_output=audio_outputs,
+        )
+
+
+class ClapTextModelWithProjection(ClapPreTrainedModel):
+    config_class = ClapTextConfig
+
+    def __init__(self, config: ClapTextConfig):
+        super().__init__(config)
+        self.text_model = ClapTextModel(config)
+        self.text_projection = ClapProjectionLayer(config)
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ClapTextModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import AutoTokenizer, ClapTextModelWithProjection
+
+        >>> model = ClapTextModelWithProjection.from_pretrained("laion/clap-htsat-unfused")
+        >>> tokenizer = AutoTokenizer.from_pretrained("laion/clap-htsat-unfused")
+
+        >>> inputs = tokenizer(["a sound of a cat", "a sound of a dog"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> text_embeds = outputs.text_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1] if not return_dict else text_outputs.pooler_output
+
+        text_embeds = self.text_projection(pooled_output)
+
+        if not return_dict:
+            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return ClapTextModelOutput(
+            text_embeds=text_embeds,
+            last_hidden_state=text_outputs.last_hidden_state,
+            hidden_states=text_outputs.hidden_states,
+            attentions=text_outputs.attentions,
+        )
+
+
+class ClapAudioModelWithProjection(ClapPreTrainedModel):
+    config_class = ClapAudioConfig
+    main_input_name = "input_features"
+
+    def __init__(self, config: ClapAudioConfig):
+        super().__init__(config)
+        self.audio_model = ClapAudioModel(config)
+        self.audio_projection = ClapProjectionLayer(config)
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.audio_model.audio_encoder.patch_embed.proj
+
+    def forward(
+        self,
+        input_features: Optional[paddle.Tensor] = None,
+        is_longer: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ClapAudioModelOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from datasets import load_dataset
+        >>> from paddlenlp.transformers import ClapAudioModelWithProjection, ClapProcessor
+
+        >>> model = ClapAudioModelWithProjection.from_pretrained("laion/clap-htsat-fused")
+        >>> processor = ClapProcessor.from_pretrained("laion/clap-htsat-fused")
+
+        >>> dataset = load_dataset("ashraq/esc50")
+        >>> audio_sample = dataset["train"]["audio"][0]["array"]
+
+        >>> inputs = processor(audios=audio_sample, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        >>> audio_embeds = outputs.audio_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        audio_outputs = self.audio_model(
+            input_features=input_features,
+            is_longer=is_longer,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = audio_outputs[1] if not return_dict else audio_outputs.pooler_output
+
+        audio_embeds = self.audio_projection(pooled_output)
+
+        if not return_dict:
+            outputs = (audio_embeds, audio_outputs[0]) + audio_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return ClapAudioModelOutput(
+            audio_embeds=audio_embeds,
+            last_hidden_state=audio_outputs.last_hidden_state,
+            attentions=audio_outputs.attentions,
+            hidden_states=audio_outputs.hidden_states,
+        )
diff --git a/paddlenlp/transformers/clap/processing.py b/paddlenlp/transformers/clap/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c08b0e4a2865187d6fad1a9042325fe71faae2b
--- /dev/null
+++ b/paddlenlp/transformers/clap/processing.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Audio/Text processor class for CLAP
+"""
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = [
+    "ClapProcessor",
+]
+
+
+class ClapProcessor(ProcessorMixin):
+    r"""
+    Constructs a CLAP processor which wraps a CLAP feature extractor and a RoBerta tokenizer into a single processor.
+
+    [`ClapProcessor`] offers all the functionalities of [`ClapFeatureExtractor`] and [`RobertaTokenizerFast`]. See the
+    [`~ClapProcessor.__call__`] and [`~ClapProcessor.decode`] for more information.
+
+    Args:
+        feature_extractor ([`ClapFeatureExtractor`]):
+            The audio processor is a required input.
+        tokenizer ([`RobertaTokenizerFast`]):
+            The tokenizer is a required input.
+    """
+    feature_extractor_class = "ClapFeatureExtractor"
+    tokenizer_class = "RobertaTokenizer"
+
+    pretrained_init_configuration = {"laion/clap-htsat-unfused": {"do_lower_case": True}}
+
+    def __init__(self, feature_extractor, tokenizer):
+        super().__init__(feature_extractor, tokenizer)
+
+    def __call__(self, text=None, audios=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
+        and `kwargs` arguments to RobertaTokenizerFast's [`~RobertaTokenizerFast.__call__`] if `text` is not `None` to
+        encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
+        ClapFeatureExtractor's [`~ClapFeatureExtractor.__call__`] if `audios` is not `None`. Please refer to the
+        doctsring of the above two methods for more information.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            audios (`np.ndarray`, `paddle.Tensor`, `List[np.ndarray]`, `List[paddle.Tensor]`):
+                The audio or batch of audios to be prepared. Each audio can be NumPy array or PaddlePaddle tensor. In case
+                of a NumPy array/PaddlePaddle tensor, each audio should be of shape (C, T), where C is a number of channels,
+                and T the sample length of the audio.
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'pd'`: Return PaddlePaddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **audio_features** -- Audio features to be fed to a model. Returned when `audios` is not `None`.
+        """
+        sampling_rate = kwargs.pop("sampling_rate", None)
+
+        if text is None and audios is None:
+            raise ValueError("You have to specify either text or audios. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if audios is not None:
+            audio_features = self.feature_extractor(
+                audios, sampling_rate=sampling_rate, return_tensors=return_tensors, **kwargs
+            )
+
+        if text is not None and audios is not None:
+            encoding["input_features"] = audio_features.input_features
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**audio_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        feature_extractor_input_names = self.feature_extractor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
diff --git a/paddlenlp/transformers/clip/__init__.py b/paddlenlp/transformers/clip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/clip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/clip/configuration.py b/paddlenlp/transformers/clip/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ad9fa63a602bd0bab7cf0051a0946be79db7c56
--- /dev/null
+++ b/paddlenlp/transformers/clip/configuration.py
@@ -0,0 +1,523 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CLIP model configuration"""
+
+import copy
+import os
+from typing import Optional, Union
+
+from ...utils.log import logger
+from ..configuration_utils import (
+    PretrainedConfig,
+    convert_to_legacy_config,
+    flatten_model_config,
+)
+
+__all__ = [
+    "CLIPTextConfig",
+    "CLIPVisionConfig",
+    "CLIPConfig",
+]
+
+
+class Old2NewPretrainedConfig(PretrainedConfig):
+    old_config_dict = [
+        "image_resolution",
+        "vision_layers",
+        "vision_heads",
+        "vision_embed_dim",
+        "vision_patch_size",
+        "vision_mlp_ratio",
+        "vision_hidden_act",
+        "max_text_length",
+        "vocab_size",
+        "text_embed_dim",
+        "text_heads",
+        "text_layers",
+        "text_hidden_act",
+        "projection_dim",
+        "initializer_range",
+        "initializer_factor",
+        "logit_scale_init_value",
+        "init_class",
+    ]
+    text_name_mapping = {
+        "max_text_length": "max_position_embeddings",
+        "vocab_size": "vocab_size",
+        "text_embed_dim": "hidden_size",
+        "text_heads": "num_attention_heads",
+        "text_layers": "num_hidden_layers",
+        "text_hidden_act": "hidden_act",
+        "initializer_range": "initializer_range",
+        "initializer_factor": "initializer_factor",
+        "projection_dim": "projection_dim",
+    }
+    vision_name_mapping = {
+        "image_resolution": "image_size",
+        "vision_layers": "num_hidden_layers",
+        "vision_heads": "num_attention_heads",
+        "vision_embed_dim": "hidden_size",
+        "vision_patch_size": "patch_size",
+        "vision_hidden_act": "hidden_act",
+        "initializer_range": "initializer_range",
+        "initializer_factor": "initializer_factor",
+        "projection_dim": "projection_dim",
+    }
+
+    @classmethod
+    def from_dict(cls, config_dict, **kwargs) -> "PretrainedConfig":
+        """
+        Instantiates a [`PretrainedConfig`] from a Python dictionary of parameters.
+
+        Args:
+            config_dict (`Dict[str, Any]`):
+                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be
+                retrieved from a pretrained checkpoint by leveraging the [`~PretrainedConfig.get_config_dict`] method.
+            kwargs (`Dict[str, Any]`):
+                Additional parameters from which to initialize the configuration object.
+
+        Returns:
+            [`PretrainedConfig`]: The configuration object instantiated from those parameters.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+        # Those arguments may be passed along for our internal telemetry.
+        # We remove them so they don't appear in `return_unused_kwargs`.
+        # convert local config to legacy config
+        # do standard config map: there are some old-school pretrained-config not refactored.
+        config_dict = convert_to_legacy_config(cls.attribute_map, config_dict)
+        config_dict = flatten_model_config(config_dict)
+
+        # check old_config?
+        is_old_config = "vision_layers" in config_dict or "text_layers" in config_dict
+        if is_old_config:
+            # convert to new_config
+            old_config_dict = {}
+            for old_name in cls.old_config_dict:
+                value = config_dict.pop(old_name, None)
+                if value is not None:
+                    old_config_dict[old_name] = value
+
+            # convert text config
+            if cls.model_type in ["clip", "clip_text_model"]:
+                text_config = {}
+                for old_name, new_name in cls.text_name_mapping.items():
+                    old_value = old_config_dict.get(old_name, None)
+                    if old_value is not None:
+                        text_config[new_name] = old_value
+                if "hidden_size" in text_config:
+                    text_config["intermediate_size"] = 4 * text_config["hidden_size"]
+
+                if cls.model_type == "clip":
+                    config_dict["text_config_dict"] = text_config
+                else:
+                    config_dict.update(text_config)
+
+            # convert vision config
+            if cls.model_type in ["clip", "clip_vision_model"]:
+                vision_config = {}
+                for old_name, new_name in cls.vision_name_mapping.items():
+                    old_value = old_config_dict.get(old_name, None)
+                    if old_value is not None:
+                        vision_config[new_name] = old_value
+                if "hidden_size" in vision_config:
+                    radio = old_config_dict.get("vision_mlp_ratio", 4)
+                    vision_config["intermediate_size"] = radio * vision_config["hidden_size"]
+                if cls.model_type == "clip":
+                    config_dict["vision_config_dict"] = vision_config
+                else:
+                    config_dict.update(vision_config)
+
+            if cls.model_type == "clip":
+                # convert common config
+                if "projection_dim" in old_config_dict:
+                    config_dict["projection_dim"] = old_config_dict["projection_dim"]
+                if "logit_scale_init_value" in old_config_dict:
+                    config_dict["logit_scale_init_value"] = old_config_dict["logit_scale_init_value"]
+
+        config = cls(**config_dict)
+
+        if hasattr(config, "pruned_heads"):
+            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())
+
+        # Update config with kwargs if needed
+        if "num_labels" in kwargs and "id2label" in kwargs:
+            num_labels = kwargs["num_labels"]
+            id2label = kwargs["id2label"] if kwargs["id2label"] is not None else []
+            if len(id2label) != num_labels:
+                raise ValueError(
+                    f"You passed along `num_labels={num_labels }` with an incompatible id to label map: "
+                    f"{kwargs['id2label']}. Since those arguments are inconsistent with each other, you should remove "
+                    "one of them."
+                )
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(config, key):
+                setattr(config, key, value)
+                if key != "dtype":
+                    to_remove.append(key)
+        for key in to_remove:
+            kwargs.pop(key, None)
+
+        logger.info(f"Model config {config}")
+        if return_unused_kwargs:
+            return config, kwargs
+        else:
+            return config
+
+
+class CLIPTextConfig(Old2NewPretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate an CLIP
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the CLIP
+    [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 49408):
+            Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`CLIPModel`].
+        hidden_size (`int`, *optional*, defaults to 512):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        max_position_embeddings (`int`, *optional*, defaults to 77):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
+            defaults to 1e-5): The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import CLIPTextConfig, CLIPTextModel
+
+    >>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
+    >>> configuration = CLIPTextConfig()
+
+    >>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
+    >>> model = CLIPTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "clip_text_model"
+
+    def __init__(
+        self,
+        vocab_size=49408,
+        hidden_size=512,
+        intermediate_size=2048,
+        projection_dim=512,
+        num_hidden_layers=12,
+        num_attention_heads=8,
+        max_position_embeddings=77,
+        hidden_act="quick_gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        # This differs from `CLIPTokenizer`'s default and from openai/clip
+        # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
+        pad_token_id=1,
+        bos_token_id=49406,
+        eos_token_id=49407,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from CLIPConfig
+        if config_dict.get("model_type") == "clip":
+            projection_dim = config_dict.get("projection_dim", None)
+            config_dict = config_dict["text_config"]
+            if projection_dim is not None:
+                config_dict["projection_dim"] = projection_dim
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class CLIPVisionConfig(Old2NewPretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate an CLIP
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the CLIP
+    [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*,
+            defaults to 1e-5): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import CLIPVisionConfig, CLIPVisionModel
+
+    >>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
+    >>> configuration = CLIPVisionConfig()
+
+    >>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
+    >>> model = CLIPVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "clip_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        projection_dim=512,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=32,
+        hidden_act="quick_gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from CLIPConfig
+        if config_dict.get("model_type") == "clip":
+            projection_dim = config_dict.get("projection_dim", None)
+            config_dict = config_dict["vision_config"]
+            if projection_dim is not None:
+                config_dict["projection_dim"] = projection_dim
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class CLIPConfig(Old2NewPretrainedConfig):
+    r"""
+    [`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
+    CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the CLIP
+    [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`CLIPTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`CLIPVisionConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimentionality of text and vision projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import CLIPConfig, CLIPModel
+
+    >>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
+    >>> configuration = CLIPConfig()
+
+    >>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
+    >>> model = CLIPModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
+
+    >>> # Initializing a CLIPText and CLIPVision configuration
+    >>> config_text = CLIPTextConfig()
+    >>> config_vision = CLIPVisionConfig()
+
+    >>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
+    ```"""
+
+    model_type = "clip"
+    is_composition = True
+
+    def __init__(
+        self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+        if text_config_dict is not None:
+            text_config = text_config_dict
+        if vision_config_dict is not None:
+            vision_config = vision_config_dict
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the CLIPTextConfig with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the CLIPVisionConfig with default values.")
+
+        text_config["projection_dim"] = projection_dim
+        vision_config["projection_dim"] = projection_dim
+        self.text_config = CLIPTextConfig(**text_config)
+        self.vision_config = CLIPVisionConfig(**vision_config)
+
+        self.projection_dim = projection_dim
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = 1.0
+
+    @classmethod
+    def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs):
+        r"""
+        Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
+        configuration.
+
+        Returns:
+            [`CLIPConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/clip/feature_extraction.py b/paddlenlp/transformers/clip/feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..7baf66ffd16d5ec0f023af062ac678c345ee6d56
--- /dev/null
+++ b/paddlenlp/transformers/clip/feature_extraction.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for CLIP."""
+
+__all__ = ["CLIPFeatureExtractor"]
+
+
+import warnings
+
+from .image_processing import CLIPImageProcessor
+
+
+class CLIPFeatureExtractor(CLIPImageProcessor):
+    def __init__(self, *args, **kwargs) -> None:
+        warnings.warn(
+            "The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of PaddleNLP. Please"
+            " use CLIPImageProcessor instead.",
+            FutureWarning,
+        )
+        super().__init__(*args, **kwargs)
diff --git a/paddlenlp/transformers/clip/image_processing.py b/paddlenlp/transformers/clip/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c02c7ab79b2e2dd9dd8240e0b1da0ec8af8b8c4
--- /dev/null
+++ b/paddlenlp/transformers/clip/image_processing.py
@@ -0,0 +1,328 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for CLIP."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    center_crop,
+    convert_to_rgb,
+    get_resize_output_image_size,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["CLIPImageProcessor"]
+
+
+class CLIPImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a CLIP image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else [0.48145466, 0.4578275, 0.40821073]
+        self.image_std = image_std if image_std is not None else [0.26862954, 0.26130258, 0.27577711]
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/clip/modeling.py b/paddlenlp/transformers/clip/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..425ffb68a2292b58d1833a9b0e49e80bff406861
--- /dev/null
+++ b/paddlenlp/transformers/clip/modeling.py
@@ -0,0 +1,1702 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+
+from ...utils.converter import StateDictNameMapping
+from ...utils.initializer import normal_, ones_, zeros_
+from ..model_outputs import BaseModelOutputWithPooling, ModelOutput
+from ..model_utils import PretrainedModel
+from .configuration import CLIPConfig, CLIPTextConfig, CLIPVisionConfig
+
+CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    # vit model
+    "openai/clip-vit-base-patch32",  # ViT-B/32
+    "openai/clip-vit-base-patch16",  # ViT-B/16
+    "openai/clip-vit-large-patch14",  # ViT-L/14
+    "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
+    "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
+    # resnet model
+    "openai/clip-rn50",  # RN50
+    "openai/clip-rn101",  # RN101
+    "openai/clip-rn50x4",  # RN50x4
+]
+
+__all__ = [
+    "ModifiedResNet",
+    "CLIPVisionTransformer",
+    "CLIPTextTransformer",
+    "CLIPTextModel",
+    "CLIPVisionModel",
+    "CLIPPretrainedModel",
+    "CLIPModel",
+    "CLIPTextModelWithProjection",
+    "CLIPVisionModelWithProjection",
+]
+
+
+def quick_gelu(x):
+    return x * F.sigmoid(1.702 * x)
+
+
+F.quick_gelu = quick_gelu
+
+NEG_INF = -1e4  # float("-inf") -1e4 -1e9
+
+# contrastive loss function, adapted from
+# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html
+
+
+def contrastive_loss(logits: paddle.Tensor) -> paddle.Tensor:
+    return F.cross_entropy(logits, paddle.arange(len(logits)))
+
+
+def clip_loss(similarity: paddle.Tensor) -> paddle.Tensor:
+    caption_loss = contrastive_loss(similarity)
+    image_loss = contrastive_loss(similarity.t())
+    return (caption_loss + image_loss) / 2.0
+
+
+@dataclass
+class CLIPVisionModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+
+    Args:
+        image_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    image_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class CLIPTextModelOutput(ModelOutput):
+    """
+    Base class for text model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        text_embeds (`paddle.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The text embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    text_embeds: Optional[paddle.Tensor] = None
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class CLIPOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`paddle.Tensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`paddle.Tensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPTextModel`].
+        image_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPVisionModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`CLIPTextModel`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`CLIPVisionModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_image: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    image_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    vision_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class ModifiedResNet(nn.Layer):
+    """
+    A ResNet class that is similar to torchvision's but contains the following changes:
+    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
+    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
+    - The final pooling layer is a QKV attention instead of an average pool
+    """
+
+    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
+        super().__init__()
+        self.output_dim = output_dim
+        self.input_resolution = input_resolution
+
+        # the 3-layer stem
+        self.conv1 = nn.Conv2D(3, width // 2, kernel_size=3, stride=2, padding=1, bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(width // 2)
+        self.conv2 = nn.Conv2D(width // 2, width // 2, kernel_size=3, padding=1, bias_attr=False)
+        self.bn2 = nn.BatchNorm2D(width // 2)
+        self.conv3 = nn.Conv2D(width // 2, width, kernel_size=3, padding=1, bias_attr=False)
+        self.bn3 = nn.BatchNorm2D(width)
+        self.avgpool = nn.AvgPool2D(2)
+        self.relu = nn.ReLU()
+
+        # residual layers
+        self._inplanes = width  # this is a *mutable* variable used during construction
+        self.layer1 = self._make_layer(width, layers[0])
+        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
+        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
+        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
+
+        embed_dim = width * 32  # the ResNet feature dimension
+        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)
+
+    def _make_layer(self, planes, blocks, stride=1):
+        layers = [Bottleneck(self._inplanes, planes, stride)]
+
+        self._inplanes = planes * Bottleneck.expansion
+        for _ in range(1, blocks):
+            layers.append(Bottleneck(self._inplanes, planes))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        def stem(x):
+            for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]:
+                x = self.relu(bn(conv(x)))
+            x = self.avgpool(x)
+            return x
+
+        x = stem(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.attnpool(x)
+
+        return x
+
+
+def multi_head_attention_forward(
+    x: paddle.Tensor,
+    num_heads: int,
+    q_proj: nn.Linear,
+    k_proj: nn.Linear,
+    v_proj: nn.Linear,
+    c_proj: nn.Linear,
+    attn_mask: Optional[paddle.Tensor] = None,
+):
+    max_len, batch_size, emb_dim = x.shape
+    head_dim = emb_dim // num_heads
+    scaling = float(head_dim) ** -0.5
+    q = q_proj(x)  # L, N, E
+    k = k_proj(x)  # L, N, E
+    v = v_proj(x)  # L, N, E
+
+    v = v.reshape((-1, batch_size * num_heads, head_dim)).transpose((1, 0, 2))
+    k = k.reshape((-1, batch_size * num_heads, head_dim)).transpose((1, 0, 2))
+    q = q.reshape((-1, batch_size * num_heads, head_dim)).transpose((1, 0, 2))
+
+    q = q * scaling
+    qk = paddle.matmul(q, k, transpose_y=True)
+    if attn_mask is not None:
+        if attn_mask.ndim == 2:
+            attn_mask.unsqueeze_(0)
+        assert attn_mask.shape[0] == 1 and attn_mask.shape[1] == max_len and attn_mask.shape[2] == max_len
+        qk += attn_mask
+
+    qk = F.softmax(qk, axis=-1)
+    atten = paddle.bmm(qk, v)
+    atten = atten.transpose((1, 0, 2))
+    atten = atten.reshape((max_len, batch_size, emb_dim))
+    atten = c_proj(atten)
+    return atten
+
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Bottleneck(nn.Layer):
+    expansion = 4
+
+    def __init__(self, inplanes, planes, stride=1):
+        super().__init__()
+
+        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
+        self.conv1 = nn.Conv2D(inplanes, planes, 1, bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(planes)
+
+        self.conv2 = nn.Conv2D(planes, planes, 3, padding=1, bias_attr=False)
+        self.bn2 = nn.BatchNorm2D(planes)
+
+        self.avgpool = nn.AvgPool2D(stride) if stride > 1 else Identity()
+
+        self.conv3 = nn.Conv2D(planes, planes * self.expansion, 1, bias_attr=False)
+        self.bn3 = nn.BatchNorm2D(planes * self.expansion)
+
+        self.relu = nn.ReLU()
+        self.downsample = None
+        self.stride = stride
+
+        if stride > 1 or inplanes != planes * Bottleneck.expansion:
+            self.downsample = nn.Sequential(
+                ("-1", nn.AvgPool2D(stride)),
+                ("0", nn.Conv2D(inplanes, planes * self.expansion, 1, stride=1, bias_attr=False)),
+                ("1", nn.BatchNorm2D(planes * self.expansion)),
+            )
+
+    def forward(self, x):
+        identity = x
+
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.relu(self.bn2(self.conv2(out)))
+        out = self.avgpool(out)
+        out = self.bn3(self.conv3(out))
+
+        if self.downsample is not None:
+            identity = self.downsample(x)
+
+        out += identity
+        out = self.relu(out)
+        return out
+
+
+class AttentionPool2d(nn.Layer):
+    def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
+        super().__init__()
+
+        self.positional_embedding = nn.Embedding(spacial_dim**2 + 1, embed_dim)
+
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias_attr=True)
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias_attr=True)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias_attr=True)
+        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim, bias_attr=True)
+        self.num_heads = num_heads
+
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
+
+    def forward(self, x):
+
+        x = x.reshape((x.shape[0], x.shape[1], x.shape[2] * x.shape[3])).transpose((2, 0, 1))  # NCHW -> (HW)NC
+        x = paddle.concat([x.mean(axis=0, keepdim=True), x], axis=0)
+        x = x + paddle.unsqueeze(self.positional_embedding.weight, 1)
+        out = multi_head_attention_forward(x, self.num_heads, self.q_proj, self.k_proj, self.v_proj, self.c_proj)
+
+        return out[0]
+
+
+class CLIPPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained CLIP models. It provides CLIP related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    config_class = CLIPConfig
+    base_model_prefix = "clip"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    @classmethod
+    def _get_name_mappings(cls, config: CLIPConfig) -> List[StateDictNameMapping]:
+        mappings: List[StateDictNameMapping] = []
+
+        model_type = config.get("model_type", "clip")
+
+        num_layer_key = "num_hidden_layers"
+        num_text_layer = 0
+        num_vision_layer = 0
+
+        if model_type in ["clip", "clip_text_model"]:
+            text_config = config.get("text_config")
+            if text_config:
+                num_text_layer = text_config.get(num_layer_key, 0)
+            else:
+                num_text_layer = config.get(num_layer_key, 0)
+
+        if model_type in ["clip", "clip_vision_model"]:
+            vision_config = config.get("vision_config")
+            if vision_config:
+                num_vision_layer = vision_config.get(num_layer_key, 0)
+            else:
+                num_vision_layer = config.get(num_layer_key, 0)
+
+        has_text_layer = num_text_layer > 0
+        has_text_projection_layer = has_text_layer and (
+            "CLIPModel" in (config.architectures or [])
+            or "CLIPTextModelWithProjection" in (config.architectures or [])
+            or cls.__name__ in ["CLIPModel", "CLIPTextModelWithProjection"]
+        )
+
+        has_vision_layer = num_vision_layer > 0
+        has_vision_projection_layer = has_vision_layer and (
+            "CLIPModel" in (config.architectures or [])
+            or "CLIPVisionModelWithProjection" in (config.architectures or [])
+            or cls.__name__ in ["CLIPModel", "CLIPVisionModelWithProjection"]
+        )
+
+        if model_type == "clip":
+            hard_mappings = [["logit_scale", "logit_scale"]]
+        else:
+            hard_mappings = []
+
+        # text model
+        if has_text_layer:
+            text_model_layer_mappings = [
+                ["text_model.embeddings.token_embedding.weight", "text_model.token_embedding.weight"],
+                ["text_model.embeddings.position_embedding.weight", "text_model.positional_embedding.weight"],
+                ["text_model.final_layer_norm.weight", "text_model.ln_final.weight"],
+                ["text_model.final_layer_norm.bias", "text_model.ln_final.bias"],
+            ]
+
+            if has_text_projection_layer:
+                text_model_layer_mappings.extend([["text_projection.weight", "text_projection", "transpose"]])
+
+            hard_mappings.extend(text_model_layer_mappings)
+
+            for layer_index in range(num_text_layer):
+                text_model_layer_mappings = [
+                    # qkv out
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                        f"text_model.transformer.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    # fc1
+                    [
+                        f"text_model.encoder.layers.{layer_index}.mlp.fc1.weight",
+                        f"text_model.transformer.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.mlp.fc1.bias",
+                        f"text_model.transformer.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.layer_norm1.weight",
+                        f"text_model.transformer.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.layer_norm1.bias",
+                        f"text_model.transformer.layers.{layer_index}.norm1.bias",
+                    ],
+                    # fc2
+                    [
+                        f"text_model.encoder.layers.{layer_index}.mlp.fc2.weight",
+                        f"text_model.transformer.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.mlp.fc2.bias",
+                        f"text_model.transformer.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.layer_norm2.weight",
+                        f"text_model.transformer.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"text_model.encoder.layers.{layer_index}.layer_norm2.bias",
+                        f"text_model.transformer.layers.{layer_index}.norm2.bias",
+                    ],
+                ]
+                hard_mappings.extend(text_model_layer_mappings)
+
+        # vision model
+        if has_vision_layer:
+            vision_model_layer_mappings = [
+                ["vision_model.embeddings.class_embedding", "vision_model.class_embedding"],
+                ["vision_model.embeddings.patch_embedding.weight", "vision_model.conv1.weight"],
+                ["vision_model.embeddings.position_embedding.weight", "vision_model.positional_embedding.weight"],
+                ["vision_model.pre_layrnorm.weight", "vision_model.ln_pre.weight"],
+                ["vision_model.pre_layrnorm.bias", "vision_model.ln_pre.bias"],
+                ["vision_model.post_layernorm.weight", "vision_model.ln_post.weight"],
+                ["vision_model.post_layernorm.bias", "vision_model.ln_post.bias"],
+            ]
+
+            if has_vision_projection_layer:
+                vision_model_layer_mappings.extend([["visual_projection.weight", "vision_projection", "transpose"]])
+
+            hard_mappings.extend(vision_model_layer_mappings)
+            for layer_index in range(num_vision_layer):
+                vision_model_layer_mappings = [
+                    # qkv out
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.q_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.q_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.k_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.k_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.v_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.v_proj.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.out_proj.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                        f"vision_model.transformer.layers.{layer_index}.self_attn.out_proj.bias",
+                    ],
+                    # fc1
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc1.weight",
+                        f"vision_model.transformer.layers.{layer_index}.linear1.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc1.bias",
+                        f"vision_model.transformer.layers.{layer_index}.linear1.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm1.weight",
+                        f"vision_model.transformer.layers.{layer_index}.norm1.weight",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm1.bias",
+                        f"vision_model.transformer.layers.{layer_index}.norm1.bias",
+                    ],
+                    # fc2
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc2.weight",
+                        f"vision_model.transformer.layers.{layer_index}.linear2.weight",
+                        "transpose",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.mlp.fc2.bias",
+                        f"vision_model.transformer.layers.{layer_index}.linear2.bias",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm2.weight",
+                        f"vision_model.transformer.layers.{layer_index}.norm2.weight",
+                    ],
+                    [
+                        f"vision_model.encoder.layers.{layer_index}.layer_norm2.bias",
+                        f"vision_model.transformer.layers.{layer_index}.norm2.bias",
+                    ],
+                ]
+                hard_mappings.extend(vision_model_layer_mappings)
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(hard_mappings)]
+        return mappings
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, nn.TransformerEncoder):
+            module.enable_recompute = value
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor
+        if isinstance(layer, CLIPVisionTransformer):
+            vision_embed_dim = layer.config.hidden_size
+            vision_layers = layer.config.num_hidden_layers
+            initializer_range = layer.config.initializer_range
+
+            # vision embedding
+            normal_(layer.class_embedding, std=vision_embed_dim**-0.5 * factor)
+            normal_(layer.conv1.weight, std=initializer_range * factor)
+            normal_(layer.positional_embedding.weight, std=initializer_range * factor)
+
+            # init CLIPAttention + CLIPMLP
+            for sub_layer in layer.sublayers():
+                if isinstance(sub_layer, nn.TransformerEncoderLayer):
+                    # self_attn
+                    in_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * ((2 * vision_layers) ** -0.5) * factor
+                    out_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * factor
+                    normal_(sub_layer.self_attn.q_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.k_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.v_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.out_proj.weight, std=out_proj_std)
+                    # ffn
+                    in_proj_std = (sub_layer._config["d_model"] ** -0.5) * ((2 * vision_layers) ** -0.5) * factor
+                    fc_std = (2 * sub_layer._config["d_model"]) ** -0.5 * factor
+                    normal_(sub_layer.linear1.weight, std=fc_std)
+                    normal_(sub_layer.linear2.weight, std=in_proj_std)
+
+        elif isinstance(layer, CLIPTextTransformer):
+            text_layers = layer.config.num_hidden_layers
+            initializer_range = layer.config.initializer_range
+
+            # text embedding
+            normal_(layer.token_embedding.weight, std=factor * 0.02)
+            normal_(layer.positional_embedding.weight, std=factor * 0.02)
+
+            # init CLIPAttention + CLIPMLP
+            for sub_layer in layer.sublayers():
+                if isinstance(sub_layer, nn.TransformerEncoderLayer):
+                    # self_attn
+                    in_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * ((2 * text_layers) ** -0.5) * factor
+                    out_proj_std = (sub_layer.self_attn.embed_dim**-0.5) * factor
+                    normal_(sub_layer.self_attn.q_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.k_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.v_proj.weight, std=in_proj_std)
+                    normal_(sub_layer.self_attn.out_proj.weight, std=out_proj_std)
+                    # ffn
+                    in_proj_std = (sub_layer._config["d_model"] ** -0.5) * ((2 * text_layers) ** -0.5) * factor
+                    fc_std = (2 * sub_layer._config["d_model"]) ** -0.5 * factor
+                    normal_(sub_layer.linear1.weight, std=fc_std)
+                    normal_(sub_layer.linear2.weight, std=in_proj_std)
+
+        elif isinstance(layer, ModifiedResNet):
+            if layer.attnpool is not None:
+                std = layer.output_dim**-0.5
+                normal_(layer.attnpool.q_proj.weight, std=std)
+                normal_(layer.attnpool.k_proj.weight, std=std)
+                normal_(layer.attnpool.v_proj.weight, std=std)
+                normal_(layer.attnpool.c_proj.weight, std=std)
+
+            for resnet_block in [layer.layer1, layer.layer2, layer.layer3, layer.layer4]:
+                for name, param in resnet_block.named_parameters():
+                    if name.endswith("bn3.weight"):
+                        zeros_(param)
+
+        elif isinstance(layer, CLIPModel):
+            normal_(layer.text_projection, std=layer.text_embed_dim**-0.5 * self.config.initializer_factor)
+            if hasattr(layer, "vision_projection"):
+                normal_(layer.vision_projection, std=layer.vision_embed_dim**-0.5 * self.config.initializer_factor)
+        elif isinstance(layer, CLIPVisionModelWithProjection):
+            if hasattr(layer, "vision_projection"):
+                normal_(layer.vision_projection, std=self.config.hidden_size**-0.5 * self.config.initializer_factor)
+        elif isinstance(layer, CLIPTextModelWithProjection):
+            normal_(layer.text_projection, std=self.config.hidden_size**-0.5 * self.config.initializer_factor)
+
+        if isinstance(layer, nn.LayerNorm):
+            zeros_(layer.bias)
+            ones_(layer.weight)
+
+        if isinstance(layer, nn.Linear) and layer.bias is not None:
+            zeros_(layer.bias)
+
+
+class CLIPTextTransformer(nn.Layer):
+    def __init__(self, config: CLIPTextConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
+        self.positional_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.hidden_size,
+            nhead=config.num_attention_heads,
+            dim_feedforward=config.intermediate_size,
+            normalize_before=True,
+            dropout=0.0,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_dropout,
+            act_dropout=0.0,
+        )
+        self.transformer = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.ln_final = nn.LayerNorm(embed_dim)
+
+        # For `pooled_output` computation
+        self.eos_token_id = config.eos_token_id
+
+        self.register_buffer(
+            "causal_mask",
+            paddle.triu(
+                paddle.ones((1, 1, config.max_position_embeddings, config.max_position_embeddings)) * NEG_INF,
+                diagonal=1,
+            ),
+            persistable=False,
+        )
+        self.register_buffer(
+            "position_ids",
+            paddle.arange(config.max_position_embeddings, dtype="int64").reshape((1, -1)),
+            persistable=False,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`CLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        bs, seqlen = input_ids.shape
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seqlen].cast("int64")
+
+        embedding_output = self.token_embedding(input_ids) + self.positional_embedding(
+            position_ids
+        )  # [batch_size, n_ctx, d_model]
+
+        causal_mask = self.causal_mask[:, :, :seqlen, :seqlen]
+        if attention_mask is not None:
+            assert attention_mask.ndim == 2
+            expanded_mask = attention_mask[:, None, None, :].expand([bs, 1, seqlen, -1]).cast(causal_mask.dtype)
+            inverted_mask = (1.0 - expanded_mask) * NEG_INF
+            attention_mask = inverted_mask + causal_mask
+        else:
+            attention_mask = causal_mask
+        attention_mask.stop_gradient = True
+
+        encoder_outputs = self.transformer(
+            embedding_output,
+            src_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            last_hidden_state = encoder_outputs
+        else:
+            last_hidden_state = encoder_outputs[0]
+
+        last_hidden_state = self.ln_final(last_hidden_state)
+
+        if self.eos_token_id == 2:
+            # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
+            # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added
+            # ------------------------------------------------------------
+            # text_embeds.shape = [batch_size, sequence_length, transformer.width]
+            # take features from the eot embedding (eot_token is the highest number in each sequence)
+            # casting to paddle.int32 for onnx compatibility: argmax doesn't support int64 inputs with opset 14
+            pooled_output = last_hidden_state.gather_nd(
+                paddle.stack(
+                    [paddle.arange(last_hidden_state.shape[0], dtype="int32"), input_ids.argmax(-1, dtype="int32")],
+                    axis=-1,
+                )
+            )
+        else:
+            # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible)
+            # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`)
+            pooled_output = last_hidden_state.gather_nd(
+                paddle.stack(
+                    [
+                        paddle.arange(last_hidden_state.shape[0], dtype="int32"),
+                        (input_ids == self.eos_token_id).cast("int32").argmax(axis=-1, dtype="int32"),
+                    ],
+                    axis=-1,
+                )
+            )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            return (last_hidden_state, pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class CLIPTextModel(CLIPPretrainedModel):
+    r"""
+    The text model from CLIP without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CLIPTextConfig`):
+            An instance of CLIPTextConfig used to construct CLIPTextModel.
+    """
+
+    config_class = CLIPTextConfig
+
+    def __init__(self, config: CLIPTextConfig):
+        super().__init__(config)
+        self.text_model = CLIPTextTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.token_embedding
+
+    def set_input_embeddings(self, value):
+        self.text_model.token_embedding = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`CLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import CLIPTokenizer, CLIPTextModel
+
+        >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
+        >>> tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class CLIPVisionTransformer(nn.Layer):
+    def __init__(self, config: CLIPVisionConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.input_resolution = config.image_size
+        self.class_embedding = self.create_parameter(
+            (embed_dim,),
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Assign(paddle.randn((embed_dim,))),
+        )
+        self.conv1 = nn.Conv2D(
+            in_channels=config.num_channels,
+            out_channels=embed_dim,
+            kernel_size=config.patch_size,
+            stride=config.patch_size,
+            bias_attr=False,
+        )
+        self.num_patches = (config.image_size // config.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+        self.positional_embedding = nn.Embedding(self.num_positions, embed_dim)
+
+        self.ln_pre = nn.LayerNorm(embed_dim)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.hidden_size,
+            nhead=config.num_attention_heads,
+            dim_feedforward=config.intermediate_size,
+            normalize_before=True,
+            dropout=0.0,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_dropout,
+            act_dropout=0.0,
+        )
+        self.transformer = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.ln_post = nn.LayerNorm(embed_dim)
+        self.register_buffer(
+            "position_ids",
+            paddle.arange(self.num_positions).reshape((1, -1)),
+            persistable=False,
+        )
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`CLIPFeatureExtractor`]. See [`CLIPFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        target_dtype = self.conv1.weight.dtype
+        pixel_values = self.conv1(pixel_values.cast(target_dtype))
+
+        pixel_values = pixel_values.reshape((pixel_values.shape[0], pixel_values.shape[1], -1))
+        pixel_values = pixel_values.transpose((0, 2, 1))
+        embedding_output = paddle.concat(
+            [self.class_embedding.unsqueeze([0, 1]).expand([pixel_values.shape[0], -1, -1]), pixel_values], axis=1
+        )
+        hidden_states = embedding_output + self.positional_embedding.weight
+        hidden_states = self.ln_pre(hidden_states)
+
+        encoder_outputs = self.transformer(
+            hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            last_hidden_state = encoder_outputs
+        else:
+            last_hidden_state = encoder_outputs[0]
+
+        pooled_output = self.ln_post(last_hidden_state[:, 0])
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            return (last_hidden_state, pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def forward_pre(self, x):
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape([x.shape[0], x.shape[1], -1])  # shape = [*, width, grid ** 2]
+        x = x.transpose((0, 2, 1))  # shape = [*, grid ** 2, width]
+        # t = self.class_embedding.weight + paddle.zeros([x.shape[0], 1, x.shape[-1]], dtype=x.dtype)
+        t = self.class_embedding.unsqueeze([0, 1]).expand([x.shape[0], -1, -1]) + paddle.zeros(
+            [x.shape[0], 1, x.shape[-1]], dtype=x.dtype
+        )
+        x = paddle.concat([t, x], axis=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.positional_embedding.weight
+        x = self.ln_pre(x)
+        return x
+
+    def forward_post(self, x):
+        x = self.ln_post(x)
+        return x
+
+
+class CLIPVisionModel(CLIPPretrainedModel):
+    r"""
+    The vision model from CLIP without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CLIPVisionConfig`):
+            An instance of CLIPVisionConfig used to construct CLIPVisionModel.
+    """
+    config_class = CLIPVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: CLIPVisionConfig):
+        super().__init__(config)
+        if isinstance(config.num_hidden_layers, (tuple, list)):
+            raise NotImplementedError("We only support VIT CLIP Vision Transformer!")
+
+        self.vision_model = CLIPVisionTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.conv1
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`CLIPFeatureExtractor`]. See [`CLIPFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import CLIPProcessor, CLIPVisionModel
+
+        >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
+        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class CLIPModel(CLIPPretrainedModel):
+    r"""
+    The bare CLIP Model outputting logits_per_image and logits_per_text.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CLIPConfig`):
+            An instance of CLIPConfig used to construct CLIPModel.
+    """
+    config_class = CLIPConfig
+
+    def __init__(self, config: CLIPConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, CLIPTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type CLIPTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, CLIPVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+
+        self.projection_dim = config.projection_dim
+        self.text_embed_dim = text_config.hidden_size
+        self.vision_embed_dim = vision_config.hidden_size
+
+        self.text_model = CLIPTextTransformer(text_config)
+
+        if isinstance(vision_config.num_hidden_layers, (tuple, list)):
+            if vision_config.num_attention_heads is None:
+                vision_heads = vision_config.hidden_size * 32 // 64
+            else:
+                vision_heads = vision_config.num_attention_heads
+            self.vision_model = ModifiedResNet(
+                layers=vision_config.num_hidden_layers,
+                output_dim=self.projection_dim,
+                heads=vision_heads,
+                input_resolution=vision_config.image_size,
+                width=vision_config.hidden_size,
+            )
+        else:
+            self.vision_model = CLIPVisionTransformer(vision_config)
+            self.vision_projection = paddle.create_parameter(
+                (self.vision_embed_dim, self.projection_dim), paddle.get_default_dtype()
+            )
+        self.text_projection = paddle.create_parameter(
+            (self.text_embed_dim, self.projection_dim), paddle.get_default_dtype()
+        )
+
+        self.logit_scale = paddle.create_parameter(
+            (1,),
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(config.logit_scale_init_value),
+        )
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`CLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`CLIPTextModel`].
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import CLIPTokenizer, CLIPModel
+
+        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+        >>> tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+        >>> text_features = model.get_text_features(**inputs)
+        ```
+        """
+        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_features = paddle.matmul(pooled_output, self.text_projection)
+
+        return text_features
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`CLIPFeatureExtractor`]. See [`CLIPFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            image_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`CLIPVisionModel`].
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import CLIPProcessor, CLIPModel
+
+        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> image_features = model.get_image_features(**inputs)
+        ```
+        """
+        if isinstance(self.vision_model, ModifiedResNet):
+            return self.vision_model(pixel_values)
+        else:
+            # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
+            output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+            output_hidden_states = (
+                output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+            )
+            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+            vision_outputs = self.vision_model(
+                pixel_values=pixel_values,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+
+            pooled_output = vision_outputs[1]  # pooled_output
+            image_features = paddle.matmul(pooled_output, self.vision_projection)
+
+            return image_features
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        pixel_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPOutput]:
+        r"""
+        The CLIPModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings (CLIPTextTransformer). Selected in
+                the range ``[0, max_text_length - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention (CLIPTextTransformer) to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape `[batch_size, sequence_length`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`CLIPOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `True`.
+
+        Returns:
+            An instance of :class:`CLIPOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`CLIPOutput`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle.nn.functional as F
+        >>> from paddlenlp.transformers import CLIPProcessor, CLIPModel
+
+        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+        >>> model.eval()
+        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(
+        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pd", padding=True
+        ... )
+
+        >>> outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+        >>> probs = F.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities
+        ```
+        """
+        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if isinstance(self.vision_model, ModifiedResNet):
+            vision_outputs = None
+            image_embeds = self.vision_model(pixel_values)
+        else:
+            vision_outputs = self.vision_model(
+                pixel_values=pixel_values,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            image_embeds = vision_outputs[1]
+            image_embeds = paddle.matmul(image_embeds, self.vision_projection)
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        text_embeds = text_outputs[1]
+        text_embeds = paddle.matmul(text_embeds, self.text_projection)
+
+        # normalized features
+        image_embeds = image_embeds / image_embeds.norm(axis=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(axis=-1, keepdim=True)
+
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_text = paddle.matmul(text_embeds, image_embeds, transpose_y=True) * logit_scale
+        logits_per_image = logits_per_text.t()
+
+        loss = None
+        if return_loss:
+            loss = clip_loss(logits_per_text)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return CLIPOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+
+
+class CLIPTextModelWithProjection(CLIPPretrainedModel):
+    r"""
+    CLIP Text Model with a projection layer on top (a linear layer on top of the pooled output).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CLIPTextConfig`):
+            An instance of CLIPTextConfig used to construct CLIPTextModelWithProjection.
+    """
+    config_class = CLIPTextConfig
+
+    def __init__(self, config: CLIPTextConfig):
+        super().__init__(config)
+
+        self.text_model = CLIPTextTransformer(config)
+
+        self.text_projection = paddle.create_parameter(
+            (config.hidden_size, config.projection_dim), paddle.get_default_dtype()
+        )
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.token_embedding
+
+    def set_input_embeddings(self, value):
+        self.text_model.token_embedding = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPTextModelOutput]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`CLIPTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`CLIPTextModelOutput`] instead of a plain tuple.
+                If `False`, the output will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`CLIPTextModelOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`CLIPTextModelOutput`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import CLIPTokenizer, CLIPTextModelWithProjection
+
+        >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
+        >>> tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> text_embeds = outputs.text_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+
+        text_embeds = paddle.matmul(pooled_output, self.text_projection)
+
+        if not return_dict:
+            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
+            return tuple(output for output in outputs if output is not None)
+
+        return CLIPTextModelOutput(
+            text_embeds=text_embeds,
+            last_hidden_state=text_outputs.last_hidden_state,
+            hidden_states=text_outputs.hidden_states,
+            attentions=text_outputs.attentions,
+        )
+
+
+class CLIPVisionModelWithProjection(CLIPPretrainedModel):
+    r"""
+    CLIP Vision Model with a projection layer on top (a linear layer on top of the pooled output).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CLIPVisionConfig`):
+            An instance of CLIPVisionConfig used to construct CLIPVisionModelWithProjection.
+    """
+    config_class = CLIPVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: CLIPVisionConfig):
+        super().__init__(config)
+
+        # support resnet vision model
+        if isinstance(config.num_hidden_layers, (tuple, list)):
+            if config.num_attention_heads is None:
+                vision_heads = config.hidden_size * 32 // 64
+            else:
+                vision_heads = config.num_attention_heads
+            self.vision_model = ModifiedResNet(
+                layers=config.num_hidden_layers,
+                output_dim=config.projection_dim,
+                heads=vision_heads,
+                input_resolution=config.image_size,
+                width=config.hidden_size,
+            )
+        else:
+            self.vision_model = CLIPVisionTransformer(config)
+            self.vision_projection = paddle.create_parameter(
+                (config.hidden_size, config.projection_dim), paddle.get_default_dtype()
+            )
+
+    def get_input_embeddings(self) -> nn.Layer:
+        if isinstance(self.vision_model, CLIPVisionTransformer):
+            return self.vision_model.conv1
+        else:
+            return None
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPVisionModelOutput]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`CLIPFeatureExtractor`]. See [`CLIPFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`CLIPVisionModelOutput`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`CLIPVisionModelOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`CLIPVisionModelOutput`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import CLIPProcessor, CLIPVisionModelWithProjection
+
+        >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
+        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> image_embeds = outputs.image_embeds
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if isinstance(self.vision_model, ModifiedResNet):
+            image_embeds = self.vision_model(pixel_values)
+            if not return_dict:
+                return (image_embeds,)
+            else:
+                return CLIPVisionModelOutput(image_embeds=image_embeds)
+        else:
+            vision_outputs = self.vision_model(
+                pixel_values=pixel_values,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            pooled_output = vision_outputs[1]  # pooled_output
+
+            image_embeds = paddle.matmul(pooled_output, self.vision_projection)
+
+            if not return_dict:
+                outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
+                return tuple(output for output in outputs if output is not None)
+
+            return CLIPVisionModelOutput(
+                image_embeds=image_embeds,
+                last_hidden_state=vision_outputs.last_hidden_state,
+                hidden_states=vision_outputs.hidden_states,
+                attentions=vision_outputs.attentions,
+            )
diff --git a/paddlenlp/transformers/clip/processing.py b/paddlenlp/transformers/clip/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..3424f643e1932a265de1fe792e17d65ac1c3e1fc
--- /dev/null
+++ b/paddlenlp/transformers/clip/processing.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Image/Text processor class for CLIP
+"""
+import warnings
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = ["CLIPProcessor"]
+
+
+class CLIPProcessor(ProcessorMixin):
+    r"""
+    Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.
+
+    [`CLIPProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`CLIPTokenizer`]. See the
+    [`~CLIPProcessor.__call__`] and [`~CLIPProcessor.decode`] for more information.
+
+    Args:
+        image_processor ([`CLIPImageProcessor`]):
+            The image processor is a required input.
+        tokenizer ([`CLIPTokenizer`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "CLIPImageProcessor"
+    tokenizer_class = "CLIPTokenizer"
+
+    pretrained_init_configuration = {
+        "openai/clip-vit-base-patch32": {"do_lower_case": True},
+        "openai/clip-vit-base-patch16": {"do_lower_case": True},
+        "openai/clip-vit-large-patch14": {"do_lower_case": True},
+        "laion/CLIP-ViT-H-14-laion2B-s32B-b79K": {"do_lower_case": True},
+        "laion/CLIP-ViT-B-32-laion2B-s34B-b79K": {"do_lower_case": True},
+        "openai/clip-rn50": {"do_lower_case": True},
+        "openai/clip-rn101": {"do_lower_case": True},
+        "openai/clip-rn50x4": {"do_lower_case": True},
+    }
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to CLIPTokenizer's [`~CLIPTokenizer.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `paddle.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[paddle.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or Paddle
+                tensor. In case of a NumPy array/Paddle tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+
+    @property
+    def feature_extractor_class(self):
+        warnings.warn(
+            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
+            FutureWarning,
+        )
+        return self.image_processor_class
+
+    @property
+    def feature_extractor(self):
+        warnings.warn(
+            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
+            FutureWarning,
+        )
+        return self.image_processor
diff --git a/paddlenlp/transformers/clip/tokenizer.py b/paddlenlp/transformers/clip/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..64bab28d4b5da9fa0226d12dab8478ad0e3628ee
--- /dev/null
+++ b/paddlenlp/transformers/clip/tokenizer.py
@@ -0,0 +1,553 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import unicodedata
+from functools import lru_cache
+from typing import List, Optional
+
+from paddle.utils import try_import
+
+from ...utils.log import logger
+from .. import AddedToken, PretrainedTokenizer
+from ..bert.tokenizer import _is_control, _is_punctuation, _is_whitespace
+
+__all__ = ["CLIPTokenizer"]
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
+    characters the bpe code barfs on.
+
+    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
+    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
+    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
+    tables between utf-8 bytes and unicode strings.
+    """
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+def whitespace_clean(text, re):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
+class BasicTokenizer(object):
+    """
+    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
+
+    Args:
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        never_split (`Iterable`, *optional*):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
+            Whether or not to tokenize Chinese characters.
+
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
+        strip_accents (`bool`, *optional*):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+        do_split_on_punc (`bool`, *optional*, defaults to `True`):
+            In some instances we want to skip the basic punctuation splitting so that later tokenization can capture
+            the full context of the words, such as contractions.
+    """
+
+    def __init__(
+        self,
+        do_lower_case=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        do_split_on_punc=True,
+    ):
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+        self.do_split_on_punc = do_split_on_punc
+
+    def tokenize(self, text, never_split=None):
+        """
+        Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.
+
+        Args:
+            never_split (`List[str]`, *optional*)
+                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
+                [`PreTrainedTokenizer.tokenize`]) List of token not to split.
+        """
+        # union() returns a new set by concatenating the two sets.
+        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        # prevents treating the same character with different unicode codepoints as different characters
+        unicode_normalized_text = unicodedata.normalize("NFC", text)
+        orig_tokens = whitespace_tokenize(unicode_normalized_text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if not self.do_split_on_punc or (never_split is not None and text in never_split):
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class CLIPTokenizer(PretrainedTokenizer):
+    r"""
+    Construct a CLIP tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocabulary file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `77`.
+        bos_token (str, optional):
+            The beginning of sequence token that was used during pretraining. Can be
+            used a sequence classifier token.
+            Defaults to `"<|startoftext|>"`.
+        eos_token (str, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `"<|endoftext|>"`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to `"<|endoftext|>"`.
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to `"<|endoftext|>"`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import AutoTokenizer
+
+            tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch32')
+            print(tokenizer('He was a puppeteer'))
+
+            '''
+            {'input_ids': [49406, 797, 739, 320, 7116, 38820, 528, 49407]}
+            '''
+
+    """
+    # merges and vocab same as GPT2
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {"vocab_file": {}, "merges_file": {}}
+    pretrained_init_configuration = {}
+    model_input_names = [
+        "input_ids",
+    ]
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=77,
+        bos_token="<|startoftext|>",
+        eos_token="<|endoftext|>",
+        unk_token="<|endoftext|>",
+        pad_token="<|endoftext|>",
+        **kwargs
+    ):
+
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+
+        self._build_special_tokens_map_extended(
+            bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, pad_token=pad_token
+        )
+
+        try:
+            import ftfy
+
+            self.fix_text = ftfy.fix_text
+        except ImportError:
+            logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
+            self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False, do_lower_case=True)
+            self.fix_text = None
+        self.re = try_import("regex")
+
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}
+
+        self.pat = self.re.compile(
+            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
+            self.re.IGNORECASE,
+        )
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+        return len(self.encoder)
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A CLIP sequence has the following format:
+
+        - single sequence: `<|startoftext|> X <|endoftext|>`
+
+        Pairs of sequences are not the expected use case, but they will be handled without a separator.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of input IDs with the appropriate special tokens.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return bos_token + token_ids_0 + eos_token
+        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0), (0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
+        zeros is returned.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of zeros.
+        """
+        bos_token = [self.bos_token_id]
+        eos_token = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(bos_token + token_ids_0 + eos_token) * [0]
+        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token[:-1]) + (token[-1] + "</w>",)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token + "</w>"
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                else:
+                    new_word.extend(word[i:j])
+                    i = j
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        if self.fix_text is None:
+            text = " ".join(self.nlp.tokenize(text))
+        else:
+            text = whitespace_clean(self.fix_text(text), self.re).lower()
+
+        for token in self.re.findall(self.pat, text):
+            token = "".join(
+                self.byte_encoder[b] for b in token.encode("utf-8")
+            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        text = "".join(tokens)
+        byte_array = bytearray([self.byte_decoder[c] for c in text])
+        text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
+        return text
+
+    def save_resources(self, save_directory):
+        """
+        Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file
+        (ends with '.spm') under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name)
+
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
diff --git a/paddlenlp/transformers/clipseg/__init__.py b/paddlenlp/transformers/clipseg/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/clipseg/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/clipseg/configuration.py b/paddlenlp/transformers/clipseg/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..085d043e0639e92ad9790fafc8178a332fa6aad1
--- /dev/null
+++ b/paddlenlp/transformers/clipseg/configuration.py
@@ -0,0 +1,413 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" CLIPSeg model configuration"""
+
+import copy
+import os
+from typing import Union
+
+from ...utils.log import logger
+from ..configuration_utils import PretrainedConfig
+
+CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "CIDAS/clipseg-rd64": "https://huggingface.co/CIDAS/clipseg-rd64/resolve/main/config.json",
+}
+
+__all__ = [
+    "CLIPSegTextConfig",
+    "CLIPSegVisionConfig",
+    "CLIPSegConfig",
+]
+
+
+class CLIPSegTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to instantiate an
+    CLIPSeg model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CLIPSeg
+    [CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 49408):
+            Vocabulary size of the CLIPSeg text model. Defines the number of different tokens that can be represented
+            by the `inputs_ids` passed when calling [`CLIPSegModel`].
+        hidden_size (`int`, *optional*, defaults to 512):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        max_position_embeddings (`int`, *optional*, defaults to 77):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import CLIPSegTextConfig, CLIPSegTextModel
+    >>> # Initializing a CLIPSegTextConfig with CIDAS/clipseg-rd64 style configuration
+    >>> configuration = CLIPSegTextConfig()
+    >>> # Initializing a CLIPSegTextModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
+    >>> model = CLIPSegTextModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "clipseg_text_model"
+
+    def __init__(
+        self,
+        vocab_size=49408,
+        hidden_size=512,
+        intermediate_size=2048,
+        num_hidden_layers=12,
+        num_attention_heads=8,
+        max_position_embeddings=77,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-5,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        pad_token_id=1,
+        bos_token_id=49406,
+        eos_token_id=49407,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from CLIPSegConfig
+        if config_dict.get("model_type") == "clipseg":
+            config_dict = config_dict["text_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class CLIPSegVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to instantiate an
+    CLIPSeg model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CLIPSeg
+    [CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 32):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import CLIPSegVisionConfig, CLIPSegVisionModel
+    >>> # Initializing a CLIPSegVisionConfig with CIDAS/clipseg-rd64 style configuration
+    >>> configuration = CLIPSegVisionConfig()
+    >>> # Initializing a CLIPSegVisionModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
+    >>> model = CLIPSegVisionModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "clipseg_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=32,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-5,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from CLIPSegConfig
+        if config_dict.get("model_type") == "clipseg":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class CLIPSegConfig(PretrainedConfig):
+    r"""
+    [`CLIPSegConfig`] is the configuration class to store the configuration of a [`CLIPSegModel`]. It is used to
+    instantiate a CLIPSeg model according to the specified arguments, defining the text model and vision model configs.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIPSeg
+    [CIDAS/clipseg-rd64](https://huggingface.co/CIDAS/clipseg-rd64) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`CLIPSegTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`CLIPSegVisionConfig`].
+        projection_dim (`int`, *optional*, defaults to 512):
+            Dimensionality of text and vision projection layers.
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original CLIPSeg implementation.
+        extract_layers (`List[int]`, *optional*, defaults to [3, 6, 9]):
+            Layers to extract when forwarding the query image through the frozen visual backbone of CLIP.
+        reduce_dim (`int`, *optional*, defaults to 64):
+            Dimensionality to reduce the CLIP vision embedding.
+        decoder_num_attention_heads (`int`, *optional*, defaults to 4):
+            Number of attention heads in the decoder of CLIPSeg.
+        decoder_attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        decoder_hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        decoder_intermediate_size (`int`, *optional*, defaults to 2048):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layers in the Transformer decoder.
+        conditional_layer (`int`, *optional*, defaults to 0):
+            The layer to use of the Transformer encoder whose activations will be combined with the condition
+            embeddings using FiLM (Feature-wise Linear Modulation). If 0, the last layer is used.
+        use_complex_transposed_convolution (`bool`, *optional*, defaults to `False`):
+            Whether to use a more complex transposed convolution in the decoder, enabling more fine-grained
+            segmentation.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import CLIPSegConfig, CLIPSegModel
+    >>> # Initializing a CLIPSegConfig with CIDAS/clipseg-rd64 style configuration
+    >>> configuration = CLIPSegConfig()
+    >>> # Initializing a CLIPSegModel (with random weights) from the CIDAS/clipseg-rd64 style configuration
+    >>> model = CLIPSegModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    >>> # We can also initialize a CLIPSegConfig from a CLIPSegTextConfig and a CLIPSegVisionConfig
+    >>> # Initializing a CLIPSegText and CLIPSegVision configuration
+    >>> config_text = CLIPSegTextConfig()
+    >>> config_vision = CLIPSegVisionConfig()
+    >>> config = CLIPSegConfig.from_text_vision_configs(config_text, config_vision)
+    ```"""
+
+    model_type = "clipseg"
+    is_composition = True
+
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        projection_dim=512,
+        logit_scale_init_value=2.6592,
+        extract_layers=[3, 6, 9],
+        reduce_dim=64,
+        decoder_num_attention_heads=4,
+        decoder_attention_dropout=0.0,
+        decoder_hidden_act="quick_gelu",
+        decoder_intermediate_size=2048,
+        conditional_layer=0,
+        use_complex_transposed_convolution=False,
+        **kwargs,
+    ):
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
+        # of confusion!).
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+
+        super().__init__(**kwargs)
+
+        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
+        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
+        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
+        if text_config_dict is not None:
+            if text_config is None:
+                text_config = {}
+
+            # This is the complete result when using `text_config_dict`.
+            _text_config_dict = CLIPSegTextConfig(**text_config_dict).to_dict()
+
+            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
+            for key, value in _text_config_dict.items():
+                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
+                    # If specified in `text_config_dict`
+                    if key in text_config_dict:
+                        message = (
+                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
+                            f'The value `text_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`text_config_dict` is provided which will be used to initialize `CLIPSegTextConfig`. The "
+                            f'value `text_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `text_config` with the ones in `_text_config_dict`.
+            text_config.update(_text_config_dict)
+
+        if vision_config_dict is not None:
+            if vision_config is None:
+                vision_config = {}
+
+            # This is the complete result when using `vision_config_dict`.
+            _vision_config_dict = CLIPSegVisionConfig(**vision_config_dict).to_dict()
+            # convert keys to string instead of integer
+            if "id2label" in _vision_config_dict:
+                _vision_config_dict["id2label"] = {
+                    str(key): value for key, value in _vision_config_dict["id2label"].items()
+                }
+
+            # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
+            for key, value in _vision_config_dict.items():
+                if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
+                    # If specified in `vision_config_dict`
+                    if key in vision_config_dict:
+                        message = (
+                            f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
+                            f'values. The value `vision_config_dict["{key}"]` will be used instead.'
+                        )
+                    # If inferred from default argument values (just to be super careful)
+                    else:
+                        message = (
+                            f"`vision_config_dict` is provided which will be used to initialize `CLIPSegVisionConfig`. "
+                            f'The value `vision_config["{key}"]` will be overriden.'
+                        )
+                    logger.warning(message)
+
+            # Update all values in `vision_config` with the ones in `_vision_config_dict`.
+            vision_config.update(_vision_config_dict)
+
+        if text_config is None:
+            text_config = {}
+            logger.info("`text_config` is `None`. Initializing the `CLIPSegTextConfig` with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("`vision_config` is `None`. initializing the `CLIPSegVisionConfig` with default values.")
+
+        self.text_config = CLIPSegTextConfig(**text_config)
+        self.vision_config = CLIPSegVisionConfig(**vision_config)
+
+        self.projection_dim = projection_dim
+        self.logit_scale_init_value = logit_scale_init_value
+        self.extract_layers = extract_layers
+        self.reduce_dim = reduce_dim
+        self.decoder_num_attention_heads = decoder_num_attention_heads
+        self.decoder_attention_dropout = decoder_attention_dropout
+        self.decoder_hidden_act = decoder_hidden_act
+        self.decoder_intermediate_size = decoder_intermediate_size
+        self.conditional_layer = conditional_layer
+        self.initializer_factor = 1.0
+        self.use_complex_transposed_convolution = use_complex_transposed_convolution
+
+    @classmethod
+    def from_text_vision_configs(cls, text_config: CLIPSegTextConfig, vision_config: CLIPSegVisionConfig, **kwargs):
+        r"""
+        Instantiate a [`CLIPSegConfig`] (or a derived class) from clipseg text model configuration and clipseg vision
+        model configuration.
+        Returns:
+            [`CLIPSegConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/clipseg/image_processing.py b/paddlenlp/transformers/clipseg/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae410d7b891278161fa126ac7059b5087b7181a6
--- /dev/null
+++ b/paddlenlp/transformers/clipseg/image_processing.py
@@ -0,0 +1,263 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Image processor class for ViT."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import normalize, rescale, resize, to_channel_dimension_format
+from ..image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["ViTImageProcessor"]
+
+
+class ViTImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ViT image processor.
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `(size["height"],
+            size["width"])`. Can be overridden by the `do_resize` parameter in the `preprocess` method.
+        size (`dict`, *optional*, defaults to `{"height": 224, "width": 224}`):
+            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+            Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
+            `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+            parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
+            `preprocess` method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Optional[Dict[str, int]] = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 224, "width": 224}
+        size = get_size_dict(size)
+        self.do_resize = do_resize
+        self.do_rescale = do_rescale
+        self.do_normalize = do_normalize
+        self.size = size
+        self.resample = resample
+        self.rescale_factor = rescale_factor
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image to `(size["height"], size["width"])`.
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+            resample:
+                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
+            data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+        Returns:
+            `np.ndarray`: The resized image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
+        return resize(
+            image, size=(size["height"], size["width"]), resample=resample, data_format=data_format, **kwargs
+        )
+
+    def rescale(
+        self, image: np.ndarray, scale: float, data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs
+    ) -> np.ndarray:
+        """
+        Rescale an image by a scale factor. image = image * scale.
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`float`):
+                The scaling factor to rescale pixel values by.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+        Returns:
+            `np.ndarray`: The rescaled image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            mean (`float` or `List[float]`):
+                Image mean to use for normalization.
+            std (`float` or `List[float]`):
+                Image standard deviation to use for normalization.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+        Returns:
+            `np.ndarray`: The normalized image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: Optional[bool] = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs,
+    ):
+        """
+        Preprocess an image or batch of images.
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Dictionary in the format `{"height": h, "width": w}` specifying the size of the output image after
+                resizing.
+            resample (`PILImageResampling` filter, *optional*, defaults to `self.resample`):
+                `PILImageResampling` filter to use if resizing the image e.g. `PILImageResampling.BILINEAR`. Only has
+                an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use if `do_normalize` is set to `True`.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.PADDLE` or `'pd'`: Return a batch of type `paddle.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        resample = resample if resample is not None else self.resample
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+
+        size = size if size is not None else self.size
+        size_dict = get_size_dict(size)
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "paddle.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size_dict, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/clipseg/modeling.py b/paddlenlp/transformers/clipseg/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..44f8d2848904bf183c6b4057daba99e5b363542f
--- /dev/null
+++ b/paddlenlp/transformers/clipseg/modeling.py
@@ -0,0 +1,1364 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" PaddlePaddle CLIPSeg model."""
+
+import copy
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..model_outputs import BaseModelOutput, BaseModelOutputWithPooling, ModelOutput
+from ..model_utils import PretrainedModel
+from .configuration import CLIPSegConfig, CLIPSegTextConfig, CLIPSegVisionConfig
+
+_CHECKPOINT_FOR_DOC = "CIDAS/clipseg-rd64-refined"
+
+CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "CIDAS/clipseg-rd64-refined",
+]
+
+__all__ = [
+    "CLIPSegPreTrainedModel",
+    "CLIPSegTextModel",
+    "CLIPSegVisionModel",
+    "CLIPSegModel",
+    "CLIPSegForImageSegmentation",
+]
+
+
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: paddle.Tensor, tgt_len: Optional[int] = None):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.shape
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand([bsz, 1, tgt_len, src_len])
+
+    inverted_mask = 1.0 - expanded_mask
+
+    def masked_fill(x, mask, value):
+        y = paddle.full(x.shape, value, x.dtype)
+        return paddle.where(mask, y, x)
+
+    return masked_fill(inverted_mask, inverted_mask.cast("bool"), -1e4)
+
+
+# contrastive loss function, adapted from
+# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html
+def contrastive_loss(logits: paddle.Tensor) -> paddle.Tensor:
+    return F.cross_entropy(logits, paddle.arange(len(logits)))
+
+
+# Copied from paddlenlp.transformers.clip.modeling.clip_loss with clip->clipseg
+def clipseg_loss(similarity: paddle.Tensor) -> paddle.Tensor:
+    caption_loss = contrastive_loss(similarity)
+    image_loss = contrastive_loss(similarity.t())
+    return (caption_loss + image_loss) / 2.0
+
+
+@dataclass
+# Copied from paddlenlp.transformers.clip.modeling.CLIPOutput with CLIP->CLIPSeg
+class CLIPSegOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`paddle.Tensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`paddle.Tensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegTextModel`].
+        image_embeds(`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of
+            [`CLIPSegVisionModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`CLIPSegTextModel`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`CLIPSegVisionModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_image: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    image_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    vision_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+@dataclass
+class CLIPSegDecoderOutput(ModelOutput):
+    """
+    Args:
+        logits (`paddle.Tensor` of shape `(batch_size, height, width)`):
+            Classification scores for each pixel.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
+            the self-attention heads.
+    """
+
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class CLIPSegImageSegmentationOutput(ModelOutput):
+    """
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        ...
+        vision_model_output (`BaseModelOutputWithPooling`):
+            The output of the [`CLIPSegVisionModel`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    conditional_embeddings: paddle.Tensor = None
+    pooled_output: paddle.Tensor = None
+    vision_model_output: BaseModelOutputWithPooling = None
+    decoder_output: CLIPSegDecoderOutput = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["vision_model_output", "decoder_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class CLIPSegVisionEmbeddings(nn.Layer):
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPVisionEmbeddings.__init__
+    def __init__(self, config: CLIPSegVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+
+        self.class_embedding = paddle.create_parameter(
+            (self.embed_dim,),
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Assign(paddle.randn((self.embed_dim,))),
+        )
+
+        self.patch_embedding = nn.Conv2D(
+            in_channels=config.num_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            bias_attr=False,
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.register_buffer("position_ids", paddle.arange(self.num_positions).expand((1, -1)), persistable=False)
+
+    def interpolate_position_embeddings(self, new_size):
+        if len(new_size) != 2:
+            raise ValueError("new_size should consist of 2 values")
+
+        num_patches_one_direction = int(self.num_patches**0.5)
+        # we interpolate the position embeddings in 2D
+        a = self.position_embedding.weight[1:].T.reshape(
+            [1, self.config.hidden_size, num_patches_one_direction, num_patches_one_direction]
+        )
+        b = (
+            nn.functional.interpolate(a, new_size, mode="bicubic", align_corners=False)
+            .squeeze(0)
+            .reshape([self.config.hidden_size, new_size[0] * new_size[1]])
+            .T
+        )
+        result = paddle.concat([self.position_embedding.weight[:1], b])
+
+        return result
+
+    def forward(self, pixel_values: paddle.Tensor) -> paddle.Tensor:
+        batch_size = pixel_values.shape[0]
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose([0, 2, 1])
+
+        class_embeds = self.class_embedding.expand([batch_size, 1, -1])
+        embeddings = paddle.concat([class_embeds, patch_embeds], axis=1)
+
+        if embeddings.shape[1] != self.num_positions:
+            new_shape = int(math.sqrt(embeddings.shape[1] - 1))
+            embeddings = embeddings + self.interpolate_position_embeddings((new_shape, new_shape))
+            embeddings = embeddings
+        else:
+            embeddings = embeddings + self.position_embedding(self.position_ids)
+
+        return embeddings
+
+
+# Copied from paddlenlp.transformers.clip.modeling.CLIPTextEmbeddings with CLIP->CLIPSeg
+class CLIPSegTextEmbeddings(nn.Layer):
+    def __init__(self, config: CLIPSegTextConfig):
+        super().__init__()
+        embed_dim = config.hidden_size
+
+        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
+        self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids",
+            paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1)),
+            persistable=False,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+    ) -> paddle.Tensor:
+        seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.token_embedding(input_ids)
+
+        position_embeddings = self.position_embedding(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+
+        return embeddings
+
+
+# Copied from paddlenlp.transformers.clip.modeling.CLIPAttention with CLIP->CLIPSeg
+class CLIPSegAttention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = config.attention_dropout
+
+        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        causal_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = hidden_states.shape
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scale
+        key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+        value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).reshape(proj_shape)
+        key_states = key_states.reshape(proj_shape)
+        value_states = value_states.reshape(proj_shape)
+
+        src_len = key_states.shape[1]
+        attn_weights = paddle.bmm(query_states, key_states.transpose([0, 2, 1]))
+
+        if attn_weights.shape != [bsz * self.num_heads, tgt_len, src_len]:
+            raise ValueError(
+                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        # apply the causal_attention_mask first
+        if causal_attention_mask is not None:
+            if causal_attention_mask.shape != [bsz, 1, tgt_len, src_len]:
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
+                    f" {causal_attention_mask.shape}"
+                )
+            attn_weights = attn_weights.reshape([bsz, self.num_heads, tgt_len, src_len]) + causal_attention_mask
+            attn_weights = attn_weights.reshape([bsz * self.num_heads, tgt_len, src_len])
+
+        if attention_mask is not None:
+            if attention_mask.shape != [bsz, 1, tgt_len, src_len]:
+                raise ValueError(
+                    f"Attention mask should be of size {[bsz, 1, tgt_len, src_len]}, but is {attention_mask.shape}"
+                )
+            attn_weights = attn_weights.reshape([bsz, self.num_heads, tgt_len, src_len]) + attention_mask
+            attn_weights = attn_weights.reshape([bsz * self.num_heads, tgt_len, src_len])
+
+        attn_weights = nn.functional.softmax(attn_weights, axis=-1)
+
+        if output_attentions:
+            # this operation is a bit akward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.reshape([bsz, self.num_heads, tgt_len, src_len])
+            attn_weights = attn_weights_reshaped.reshape([bsz * self.num_heads, tgt_len, src_len])
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+
+        attn_output = paddle.bmm(attn_probs, value_states)
+
+        if attn_output.shape != [bsz * self.num_heads, tgt_len, self.head_dim]:
+            raise ValueError(
+                f"`attn_output` should be of size {[bsz, self.num_heads, tgt_len, self.head_dim]}, but is"
+                f" {attn_output.shape}"
+            )
+
+        attn_output = attn_output.reshape([bsz, self.num_heads, tgt_len, self.head_dim])
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+        attn_output = attn_output.reshape([bsz, tgt_len, embed_dim])
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped
+
+
+# Copied from paddlenlp.transformers.clip.modeling.CLIPMLP with CLIP->CLIPSeg
+class CLIPSegMLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.clip.modeling.CLIPEncoderLayer with CLIP->CLIPSeg
+class CLIPSegEncoderLayer(nn.Layer):
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = CLIPSegAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = CLIPSegMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        causal_attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            causal_attention_mask=causal_attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = residual + hidden_states
+
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class CLIPSegPreTrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = CLIPSegConfig
+    base_model_prefix = "clip"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_factor
+        if isinstance(module, CLIPSegTextEmbeddings):
+            normal_(module.token_embedding.weight, mean=0.0, std=factor * 0.02)
+            normal_(module.position_embedding.weight, mean=0.0, std=factor * 0.02)
+        elif isinstance(module, CLIPSegVisionEmbeddings):
+            factor = self.config.initializer_factor
+            normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
+            normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
+            normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
+        elif isinstance(module, CLIPSegAttention):
+            factor = self.config.initializer_factor
+            in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
+            out_proj_std = (module.embed_dim**-0.5) * factor
+            normal_(module.q_proj.weight, std=in_proj_std)
+            normal_(module.k_proj.weight, std=in_proj_std)
+            normal_(module.v_proj.weight, std=in_proj_std)
+            normal_(module.out_proj.weight, std=out_proj_std)
+        elif isinstance(module, CLIPSegMLP):
+            factor = self.config.initializer_factor
+            in_proj_std = (
+                (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
+            )
+            fc_std = (2 * module.config.hidden_size) ** -0.5 * factor
+            normal_(module.fc1.weight, std=fc_std)
+            normal_(module.fc2.weight, std=in_proj_std)
+        elif isinstance(module, CLIPSegModel):
+            normal_(
+                module.text_projection.weight,
+                std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
+            )
+            normal_(
+                module.visual_projection.weight,
+                std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
+            )
+
+        if isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, CLIPSegEncoder):
+            module.enable_recompute = value
+
+
+# Copied from paddlenlp.transformers.clip.modeling.CLIPEncoder with CLIP->CLIPSeg
+class CLIPSegEncoder(nn.Layer):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`CLIPSegEncoderLayer`].
+    Args:
+        config: CLIPSegConfig
+    """
+
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.LayerList([CLIPSegEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[paddle.Tensor] = None,
+        causal_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            causal_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Causal mask for the text model. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    causal_attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    causal_attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+class CLIPSegTextTransformer(nn.Layer):
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPTextTransformer.__init__ with CLIP->CLIPSeg
+    def __init__(self, config: CLIPSegTextConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.embeddings = CLIPSegTextEmbeddings(config)
+        self.encoder = CLIPSegEncoder(config)
+        self.final_layer_norm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+        # For `pooled_output` computation
+        self.eos_token_id = config.eos_token_id
+
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPTextTransformer.forward with clip->clipseg, CLIP->CLIPSeg
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is None:
+            raise ValueError("You have to specify input_ids")
+
+        input_shape = input_ids.shape
+        input_ids = input_ids.reshape([-1, input_shape[-1]])
+
+        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+
+        bsz, seq_len = input_shape
+        # CLIPSeg's text model uses causal mask, prepare it here.
+        # https://github.com/openai/CLIPSeg/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clipseg/model.py#L324
+        causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype)
+        # expand attention_mask
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            attention_mask = _expand_mask(attention_mask)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            attention_mask=attention_mask,
+            causal_attention_mask=causal_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.final_layer_norm(last_hidden_state)
+
+        if self.eos_token_id == 2:
+            # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
+            # A CLIPSeg model with such `eos_token_id` in the config can't work correctly with extra new tokens added
+            # ------------------------------------------------------------
+            # text_embeds.shape = [batch_size, sequence_length, transformer.width]
+            # take features from the eot embedding (eot_token is the highest number in each sequence)
+            # casting to paddle.int32 for onnx compatibility: argmax doesn't support int64 inputs with opset 14
+            pooled_output = last_hidden_state.gather_nd(
+                paddle.stack(
+                    [paddle.arange(last_hidden_state.shape[0], dtype="int32"), input_ids.argmax(-1, dtype="int32")],
+                    axis=-1,
+                )
+            )
+        else:
+            # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible)
+            # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`)
+            pooled_output = last_hidden_state.gather_nd(
+                paddle.stack(
+                    [
+                        paddle.arange(last_hidden_state.shape[0], dtype="int32"),
+                        (input_ids == self.eos_token_id).cast("int32").argmax(axis=-1, dtype="int32"),
+                    ],
+                    axis=-1,
+                )
+            )
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def _build_causal_attention_mask(self, bsz, seq_len, dtype):
+        # lazily create causal attention mask, with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = paddle.full([bsz, seq_len, seq_len], fill_value=-1e9, dtype=dtype)
+        mask = paddle.triu(mask, diagonal=1)  # zero out the upper diagonal
+        mask = mask.unsqueeze(1)  # expand mask
+        return mask
+
+
+class CLIPSegTextModel(CLIPSegPreTrainedModel):
+    config_class = CLIPSegTextConfig
+
+    _no_split_modules = ["CLIPSegEncoderLayer"]
+
+    def __init__(self, config: CLIPSegTextConfig):
+        super().__init__(config)
+        self.text_model = CLIPSegTextTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.embeddings.token_embedding
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.token_embedding = value
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from paddlenlp.transformers import AutoTokenizer, CLIPSegTextModel
+        >>> tokenizer = AutoTokenizer.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegTextModel.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```"""
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class CLIPSegVisionTransformer(nn.Layer):
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPVisionTransformer.__init__ with CLIP->CLIPSeg
+    def __init__(self, config: CLIPSegVisionConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = CLIPSegVisionEmbeddings(config)
+        self.pre_layrnorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+        self.encoder = CLIPSegEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPVisionTransformer.forward
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+        hidden_states = self.pre_layrnorm(hidden_states)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class CLIPSegVisionModel(CLIPSegPreTrainedModel):
+    config_class = CLIPSegVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: CLIPSegVisionConfig):
+        super().__init__(config)
+        self.vision_model = CLIPSegVisionTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import AutoProcessor, CLIPSegVisionModel
+        >>> processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegVisionModel.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```"""
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class CLIPSegModel(CLIPSegPreTrainedModel):
+    config_class = CLIPSegConfig
+
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, CLIPSegTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type CLIPSegTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, CLIPSegVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type CLIPSegVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+
+        self.projection_dim = config.projection_dim
+        self.text_embed_dim = text_config.hidden_size
+        self.vision_embed_dim = vision_config.hidden_size
+
+        self.text_model = CLIPSegTextTransformer(text_config)
+        self.vision_model = CLIPSegVisionTransformer(vision_config)
+
+        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias_attr=False)
+        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias_attr=False)
+        self.logit_scale = paddle.create_parameter(
+            (1,),
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(self.config.logit_scale_init_value),
+        )
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`CLIPSegTextModel`].
+        Examples:
+        ```python
+        >>> from paddlenlp.transformers import AutoTokenizer, CLIPSegModel
+        >>> tokenizer = AutoTokenizer.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegModel.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pd")
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        # Use CLIPSEG model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = text_outputs[1]
+        text_features = self.text_projection(pooled_output)
+
+        return text_features
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Returns:
+            image_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`CLIPSegVisionModel`].
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import AutoProcessor, CLIPSegModel
+        >>> processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegModel.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pd")
+        >>> image_features = model.get_image_features(**inputs)
+        ```"""
+        # Use CLIPSEG model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = vision_outputs[1]  # pooled_output
+        image_features = self.visual_projection(pooled_output)
+
+        return image_features
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        pixel_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPSegOutput]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import AutoProcessor, CLIPSegModel
+        >>> processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegModel.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(
+        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pd", padding=True
+        ... )
+        >>> outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+        ```"""
+        # Use CLIPSEG model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        image_embeds = vision_outputs[1]
+        image_embeds = self.visual_projection(image_embeds)
+
+        text_embeds = text_outputs[1]
+        text_embeds = self.text_projection(text_embeds)
+
+        # normalized features
+        image_embeds = image_embeds / image_embeds.norm(p=2, axis=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(p=2, axis=-1, keepdim=True)
+
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_text = paddle.matmul(text_embeds, image_embeds, transpose_y=True) * logit_scale
+        logits_per_image = logits_per_text.t()
+
+        loss = None
+        if return_loss:
+            loss = clipseg_loss(logits_per_text)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return CLIPSegOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+
+
+class CLIPSegDecoderLayer(nn.Layer):
+    """
+    CLIPSeg decoder layer, which is identical to `CLIPSegEncoderLayer`, except that normalization is applied after
+    self-attention/MLP, rather than before.
+    """
+
+    # Copied from paddlenlp.transformers.clip.modeling.CLIPEncoderLayer.__init__ with CLIP->CLIPSeg
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = CLIPSegAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = CLIPSegMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        causal_attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            causal_attention_mask=causal_attention_mask,
+            output_attentions=output_attentions,
+        )
+
+        hidden_states = residual + hidden_states
+        hidden_states = self.layer_norm1(hidden_states)
+
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class CLIPSegDecoder(CLIPSegPreTrainedModel):
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__(config)
+
+        self.conditional_layer = config.conditional_layer
+
+        self.film_mul = nn.Linear(config.projection_dim, config.reduce_dim)
+        self.film_add = nn.Linear(config.projection_dim, config.reduce_dim)
+
+        if config.use_complex_transposed_convolution:
+            transposed_kernels = (config.vision_config.patch_size // 4, config.vision_config.patch_size // 4)
+
+            self.transposed_convolution = nn.Sequential(
+                nn.Conv2D(config.reduce_dim, config.reduce_dim, kernel_size=3, padding=1),
+                nn.ReLU(),
+                nn.Conv2DTranspose(
+                    config.reduce_dim,
+                    config.reduce_dim // 2,
+                    kernel_size=transposed_kernels[0],
+                    stride=transposed_kernels[0],
+                ),
+                nn.ReLU(),
+                nn.Conv2DTranspose(
+                    config.reduce_dim // 2, 1, kernel_size=transposed_kernels[1], stride=transposed_kernels[1]
+                ),
+            )
+        else:
+            self.transposed_convolution = nn.Conv2DTranspose(
+                config.reduce_dim, 1, config.vision_config.patch_size, stride=config.vision_config.patch_size
+            )
+
+        depth = len(config.extract_layers)
+        self.reduces = nn.LayerList(
+            [nn.Linear(config.vision_config.hidden_size, config.reduce_dim) for _ in range(depth)]
+        )
+
+        decoder_config = copy.deepcopy(config.vision_config)
+        decoder_config.hidden_size = config.reduce_dim
+        decoder_config.num_attention_heads = config.decoder_num_attention_heads
+        decoder_config.intermediate_size = config.decoder_intermediate_size
+        decoder_config.hidden_act = "relu"
+        self.layers = nn.LayerList([CLIPSegDecoderLayer(decoder_config) for _ in range(len(config.extract_layers))])
+
+    def forward(
+        self,
+        hidden_states: Tuple[paddle.Tensor],
+        conditional_embeddings: paddle.Tensor,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        activations = hidden_states[::-1]
+
+        output = None
+        for i, (activation, layer, reduce) in enumerate(zip(activations, self.layers, self.reduces)):
+            if output is not None:
+                output = reduce(activation) + output
+            else:
+                output = reduce(activation)
+
+            if i == self.conditional_layer:
+                output = self.film_mul(conditional_embeddings) * output.transpose([1, 0, 2]) + self.film_add(
+                    conditional_embeddings
+                )
+                output = output.transpose([1, 0, 2])
+
+            layer_outputs = layer(
+                output, attention_mask=None, causal_attention_mask=None, output_attentions=output_attentions
+            )
+
+            output = layer_outputs[0]
+
+            if output_hidden_states:
+                all_hidden_states += (output,)
+
+            if output_attentions:
+                all_attentions += (layer_outputs[1],)
+
+        output = output[:, 1:, :].transpose(
+            [0, 2, 1]
+        )  # remove cls token and reshape to [batch_size, reduce_dim, seq_len]
+
+        size = int(math.sqrt(output.shape[2]))
+
+        batch_size = conditional_embeddings.shape[0]
+        output = output.reshape([batch_size, output.shape[1], size, size])
+
+        logits = self.transposed_convolution(output).squeeze()
+
+        if not return_dict:
+            return tuple(v for v in [logits, all_hidden_states, all_attentions] if v is not None)
+
+        return CLIPSegDecoderOutput(
+            logits=logits,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
+
+
+class CLIPSegForImageSegmentation(CLIPSegPreTrainedModel):
+    config_class = CLIPSegConfig
+
+    def __init__(self, config: CLIPSegConfig):
+        super().__init__(config)
+
+        self.config = config
+
+        self.clip = CLIPSegModel(config)
+        self.extract_layers = config.extract_layers
+
+        self.decoder = CLIPSegDecoder(config)
+
+    def get_conditional_embeddings(
+        self,
+        batch_size: int = None,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        conditional_pixel_values: Optional[paddle.Tensor] = None,
+    ):
+        if input_ids is not None:
+            # compute conditional embeddings from texts
+            if len(input_ids) != batch_size:
+                raise ValueError("Make sure to pass as many prompt texts as there are query images")
+            with paddle.no_grad():
+                conditional_embeddings = self.clip.get_text_features(
+                    input_ids, attention_mask=attention_mask, position_ids=position_ids
+                )
+        elif conditional_pixel_values is not None:
+            # compute conditional embeddings from images
+            if len(conditional_pixel_values) != batch_size:
+                raise ValueError("Make sure to pass as many prompt images as there are query images")
+            with paddle.no_grad():
+                conditional_embeddings = self.clip.get_image_features(conditional_pixel_values)
+        else:
+            raise ValueError(
+                "Invalid conditional, should be either provided as `input_ids` or `conditional_pixel_values`"
+            )
+
+        return conditional_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        pixel_values: Optional[paddle.Tensor] = None,
+        conditional_pixel_values: Optional[paddle.Tensor] = None,
+        conditional_embeddings: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CLIPSegOutput]:
+        r"""
+        labels (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        Returns:
+        Examples:
+        ```python
+        >>> from paddlenlp.transformers import AutoProcessor, CLIPSegForImageSegmentation
+        >>> from PIL import Image
+        >>> import requests
+        >>> processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> texts = ["a cat", "a remote", "a blanket"]
+        >>> inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        >>> logits = outputs.logits
+        >>> print(logits.shape)
+           [3, 352, 352]
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the query images through the frozen CLIP vision encoder
+        with paddle.no_grad():
+            vision_outputs = self.clip.vision_model(
+                pixel_values=pixel_values,
+                output_attentions=output_attentions,
+                output_hidden_states=True,  # we need the intermediate hidden states
+                return_dict=return_dict,
+            )
+            pooled_output = self.clip.visual_projection(vision_outputs[1])
+
+            hidden_states = vision_outputs.hidden_states if return_dict else vision_outputs[2]
+            # we add +1 here as the hidden states also include the initial embeddings
+            activations = [hidden_states[i + 1] for i in self.extract_layers]
+
+            # update vision_outputs
+            if return_dict:
+                vision_outputs = BaseModelOutputWithPooling(
+                    last_hidden_state=vision_outputs.last_hidden_state,
+                    pooler_output=vision_outputs.pooler_output,
+                    hidden_states=vision_outputs.hidden_states if output_hidden_states else None,
+                    attentions=vision_outputs.attentions,
+                )
+            else:
+                vision_outputs = (
+                    vision_outputs[:2] + vision_outputs[3:] if not output_hidden_states else vision_outputs
+                )
+
+        # step 2: compute conditional embeddings, either from text, images or an own provided embedding
+        if conditional_embeddings is None:
+            conditional_embeddings = self.get_conditional_embeddings(
+                batch_size=pixel_values.shape[0],
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                conditional_pixel_values=conditional_pixel_values,
+            )
+        else:
+            if conditional_embeddings.shape[0] != pixel_values.shape[0]:
+                raise ValueError(
+                    "Make sure to pass as many conditional embeddings as there are query images in the batch"
+                )
+            if conditional_embeddings.shape[1] != self.config.projection_dim:
+                raise ValueError(
+                    "Make sure that the feature dimension of the conditional embeddings matches"
+                    " `config.projection_dim`."
+                )
+
+        # step 3: forward both the pooled output and the activations through the lightweight decoder to predict masks
+        decoder_outputs = self.decoder(
+            activations,
+            conditional_embeddings,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = decoder_outputs.logits if return_dict else decoder_outputs[0]
+
+        loss = None
+        if labels is not None:
+            loss_fn = nn.BCEWithLogitsLoss()
+            loss = loss_fn(logits, labels)
+
+        if not return_dict:
+            output = (logits, conditional_embeddings, pooled_output, vision_outputs, decoder_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return CLIPSegImageSegmentationOutput(
+            loss=loss,
+            logits=logits,
+            conditional_embeddings=conditional_embeddings,
+            pooled_output=pooled_output,
+            vision_model_output=vision_outputs,
+            decoder_output=decoder_outputs,
+        )
diff --git a/paddlenlp/transformers/clipseg/processing.py b/paddlenlp/transformers/clipseg/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddbdea9c251ccd16536b3b65f979f140836a3909
--- /dev/null
+++ b/paddlenlp/transformers/clipseg/processing.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Image/Text processor class for CLIPSeg
+"""
+import warnings
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = ["CLIPSegProcessor"]
+
+
+class CLIPSegProcessor(ProcessorMixin):
+    r"""
+    Constructs a CLIPSeg processor which wraps a CLIPSeg image processor and a CLIP tokenizer into a single processor.
+    [`CLIPSegProcessor`] offers all the functionalities of [`ViTImageProcessor`] and [`CLIPTokenizerFast`]. See the
+    [`~CLIPSegProcessor.__call__`] and [`~CLIPSegProcessor.decode`] for more information.
+    Args:
+        image_processor ([`ViTImageProcessor`]):
+            The image processor is a required input.
+        tokenizer ([`CLIPTokenizerFast`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "ViTImageProcessor"
+    tokenizer_class = "CLIPTokenizer"
+
+    pretrained_init_configuration = {
+        "CIDAS/clipseg-rd64-refined": {"do_lower_case": True},
+    }
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, visual_prompt=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        ViTImageProcessor's [`~ViTImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring of
+        the above two methods for more information.
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PaddlePaddle
+                tensor. In case of a NumPy array/PaddlePaddle tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            visual_prompt (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
+                The visual prompt image or batch of images to be prepared. Each visual prompt image can be a PIL image,
+                NumPy array or PaddlePaddle tensor. In case of a NumPy array/PaddlePaddle tensor, each image should be of shape
+                (C, H, W), where C is a number of channels, H and W are image height and width.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'pd'`: Return PaddlePaddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+        if text is None and visual_prompt is None and images is None:
+            raise ValueError("You have to specify either text, visual prompt or images.")
+
+        if text is not None and visual_prompt is not None:
+            raise ValueError("You have to specify exactly one type of prompt. Either text or visual prompt.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if visual_prompt is not None:
+            prompt_features = self.image_processor(visual_prompt, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if visual_prompt is not None and images is not None:
+            encoding = {
+                "pixel_values": image_features.pixel_values,
+                "conditional_pixel_values": prompt_features.pixel_values,
+            }
+            return encoding
+        elif text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        elif visual_prompt is not None:
+            encoding = {
+                "conditional_pixel_values": prompt_features.pixel_values,
+            }
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def feature_extractor_class(self):
+        warnings.warn(
+            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
+            FutureWarning,
+        )
+        return self.image_processor_class
+
+    @property
+    def feature_extractor(self):
+        warnings.warn(
+            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
+            FutureWarning,
+        )
+        return self.image_processor
diff --git a/paddlenlp/transformers/codegen/__init__.py b/paddlenlp/transformers/codegen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/codegen/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/codegen/configuration.py b/paddlenlp/transformers/codegen/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..abfb343d8ea5c1c41c927c22ef269b40a12121a2
--- /dev/null
+++ b/paddlenlp/transformers/codegen/configuration.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CODEGEN model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["CODEGEN_PRETRAINED_INIT_CONFIGURATION", "CodeGenConfig", "CODEGEN_PRETRAINED_RESOURCE_FILES_MAP"]
+
+CODEGEN_PRETRAINED_INIT_CONFIGURATION = {}
+CODEGEN_PRETRAINED_RESOURCE_FILES_MAP = {"model_state": {}}
+
+
+class CodeGenConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CodeGenModel`]. It is used to instantiate a
+    CodeGen model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CodeGen
+    Salesforce/codegen-350M-mono architecture. Configuration objects
+    inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from
+    [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `CodeGenModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `CodeGenModel`.
+            Defaulta to `50400`.
+        n_embed (int, optional):
+            Dimensionality of the embedding layer, decoder layer. Defaults to `4096`.
+        n_layer (int, optional):
+            Number of hidden layers. Defaults to `28`.
+        n_head (int, optional):
+            Number of attention heads for each attention layer in the Transformer decoder.
+            Defaults to `16`.
+        n_ctx (int, optional):
+            Dimensionality of the causal mask (usually same as n_positions).
+            Defaults to `2048`.
+        n_positions (int, optional):
+            The maximum sequence length that this model might ever be used with.
+            Defaults to `2048`.
+        attn_pdrop (float, optional):
+            The dropout probability used in MultiHeadAttention in all decoder layers to drop some attention target.
+            Defaults to `0.0`.
+        resid_pdrop (float, optional):
+            The dropout probability for all residual layers in the decoder.
+            Defaults to `0.0`.
+        embd_pdrop (float, optional):
+            The dropout probability used in embedding layers. Defaults to `0.0`.
+        rotary_dim (int, optional):
+            Dimensionality of rotay position embeddings.
+            Defaults to `64`.
+        activation_function (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions are supported.
+            Defaults to `"gelu_new"`.
+        layer_norm_epsilon (float, optional):
+            The epsilon to use in the layer normalization layers.
+            Defaults to `1e-05`.
+        initializer_range (float, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Default to `0.02`.
+    ```"""
+    model_type = "codegen"
+    pretrained_init_configuration = CODEGEN_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50400,
+        bos_token_id: int = 1,
+        eos_token_id: int = 50256,
+        pad_token_id: int = 50256,
+        n_embd: int = 4096,
+        n_layer: int = 28,
+        n_head: int = 16,
+        n_ctx: int = 2048,
+        n_positions: int = 2048,
+        attn_pdrop: float = 0.0,
+        resid_pdrop: float = 0.0,
+        embd_pdrop: float = 0.0,
+        rotary_dim: int = 64,
+        activation_function: str = "gelu_new",
+        layer_norm_epsilon: float = 1e-05,
+        initializer_range: float = 0.02,
+        n_inner: int = None,
+        tie_word_embeddings: bool = False,
+        **kwargs,
+    ):
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.n_ctx = n_ctx
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.n_inner = n_inner
+        self.rotary_dim = rotary_dim
+        self.activation_function = activation_function
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.attn_pdrop = attn_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
diff --git a/paddlenlp/transformers/codegen/modeling.py b/paddlenlp/transformers/codegen/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2759b203c24fc8dfe0a3e19435c202dd1db4fe23
--- /dev/null
+++ b/paddlenlp/transformers/codegen/modeling.py
@@ -0,0 +1,688 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Salesforce authors, The Open AI Team Authors and The HuggingFace Inc. team
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Layer
+
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from .configuration import (
+    CODEGEN_PRETRAINED_INIT_CONFIGURATION,
+    CODEGEN_PRETRAINED_RESOURCE_FILES_MAP,
+    CodeGenConfig,
+)
+
+CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "Salesforce/codegen-350M-nl",
+    "Salesforce/codegen-350M-multi",
+    "Salesforce/codegen-350M-mono",
+    "Salesforce/codegen-2B-nl",
+    "Salesforce/codegen-2B-multi",
+    "Salesforce/codegen-2B-mono",
+    "Salesforce/codegen-6B-nl",
+    "Salesforce/codegen-6B-multi",
+    "Salesforce/codegen-6B-mono",
+    "Salesforce/codegen-16B-nl",
+    "Salesforce/codegen-16B-multi",
+    "Salesforce/codegen-16B-mono",
+]
+
+
+def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
+    dim = x.shape[-1]
+    if seq_len is None:
+        seq_len = x.shape[seq_dim]
+    inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2) / dim))
+    sinusoid_inp = paddle.einsum("i,j->ij", paddle.arange(seq_len, dtype="float32"), inv_freq)
+    return paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)
+
+
+def rotate_every_two(x):
+    x1 = x[:, :, :, ::2]
+    x2 = x[:, :, :, 1::2]
+    x = paddle.stack((-x2, x1), axis=-1)
+    # In einsum notation: rearrange(x, '... d j -> ... (d j)')
+    return x.flatten(-2)
+
+
+def duplicate_interleave(m):
+    return paddle.repeat_interleave(m, 2, axis=1)
+
+
+def apply_rotary_pos_emb(x, sincos, offset=0):
+    sin, cos = map(lambda t: duplicate_interleave(t)[None, offset : x.shape[1] + offset, None, :], sincos)
+    # einsum notation for lambda t: repeat(t[offset:x.shape[1]+offset,:], "n d -> () n () (d j)", j=2)
+    return (x * cos) + (rotate_every_two(x) * sin)
+
+
+class CodeGenAttention(Layer):
+    def __init__(self, config: CodeGenConfig):
+        super().__init__()
+
+        self.causal_mask = paddle.tril(
+            paddle.ones((config.n_positions, config.n_positions), dtype=paddle.get_default_dtype())
+        ).reshape((1, 1, config.n_positions, config.n_positions))
+
+        self.attn_dropout = nn.Dropout(config.attn_pdrop)
+        self.resid_dropout = nn.Dropout(config.resid_pdrop)
+
+        self.embed_dim = config.n_embd
+        self.num_attention_heads = config.n_head
+        self.head_dim = self.embed_dim // self.num_attention_heads
+        if self.head_dim * self.num_attention_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and"
+                f" `num_attention_heads`: {self.num_attention_heads})."
+            )
+        self.scale_attn = paddle.sqrt(paddle.to_tensor(self.head_dim, dtype="float32"))
+        self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias_attr=False)
+
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias_attr=False)
+        self.rotary_dim = config.rotary_dim
+
+    def _split_heads(self, x, n_head, dim_head, mp_num):
+        reshaped = x.reshape(x.shape[:-1] + [n_head // mp_num, dim_head])
+        reshaped = reshaped.reshape(x.shape[:-2] + [-1] + reshaped.shape[-1:])
+        return reshaped
+
+    def _merge_heads(self, tensor, num_attention_heads, attn_head_size):
+        """
+        Merges attn_head_size dim and num_attn_heads dim into n_ctx
+        """
+        if len(tensor.shape) == 5:
+            tensor = tensor.transpose([0, 1, 3, 2, 4])
+        elif len(tensor.shape) == 4:
+            tensor = tensor.transpose([0, 2, 1, 3])
+        else:
+            raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}")
+        new_shape = tensor.shape[:-2] + [num_attention_heads * attn_head_size]
+        return tensor.reshape(new_shape)
+
+    def _attn(self, query, key, value, attention_mask=None):
+
+        # compute causal mask from causal mask buffer
+        query_length, key_length = query.shape[-2], key.shape[-2]
+        causal_mask = self.causal_mask[:, :, key_length - query_length : key_length, :key_length]
+
+        # Keep the attention weights computation in fp32 to avoid overflow issues
+        query = paddle.cast(query, "float32")
+        key = paddle.cast(key, "float32")
+        attn_weights = paddle.matmul(query, key, transpose_y=True)
+
+        attn_weights = attn_weights / self.scale_attn
+        mask_value = paddle.to_tensor(-1e4, dtype=attn_weights.dtype)
+        # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
+        attn_weights = paddle.where(causal_mask, attn_weights, mask_value)
+
+        if attention_mask is not None:
+            # Apply the attention mask
+            attn_weights = attn_weights + attention_mask
+
+        attn_weights = F.softmax(attn_weights, axis=-1, dtype=value.dtype)
+        attn_weights = self.attn_dropout(attn_weights)
+
+        attn_output = paddle.matmul(attn_weights, value)
+
+        return attn_output, attn_weights
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Optional[Tensor] = None,
+        use_cache: Optional[bool] = False,
+        cache: Optional[Tuple[Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple:
+        qkv = self.qkv_proj(hidden_states)
+        mp_num = 4
+        qkv_split = qkv.reshape(qkv.shape[:-1] + [mp_num, -1])
+
+        local_dim = qkv_split.shape[-1] // (self.head_dim * self.num_attention_heads // mp_num)
+        query, value, key = paddle.split(qkv_split, local_dim, axis=-1)
+        query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num)
+        key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num)
+
+        value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num)
+        value = value.transpose([0, 2, 1, 3])
+
+        seq_len = key.shape[1]
+        offset = 0
+
+        if cache is not None:
+            offset = cache[0].shape[-2]
+            seq_len += offset
+
+        if self.rotary_dim is not None:
+            k_rot = key[:, :, :, : self.rotary_dim]
+            k_pass = key[:, :, :, self.rotary_dim :]
+
+            q_rot = query[:, :, :, : self.rotary_dim]
+            q_pass = query[:, :, :, self.rotary_dim :]
+
+            sincos = fixed_pos_embedding(k_rot, 1, seq_len=seq_len)
+            k_rot = apply_rotary_pos_emb(k_rot, sincos, offset=offset)
+            q_rot = apply_rotary_pos_emb(q_rot, sincos, offset=offset)
+
+            key = paddle.concat([k_rot, k_pass], axis=-1)
+            query = paddle.concat([q_rot, q_pass], axis=-1)
+        else:
+            sincos = fixed_pos_embedding(key, 1, seq_len=seq_len)
+            key = apply_rotary_pos_emb(key, sincos, offset=offset)
+            query = apply_rotary_pos_emb(query, sincos, offset=offset)
+
+        key = key.transpose([0, 2, 1, 3])
+        query = query.transpose([0, 2, 1, 3])
+
+        if cache is not None:
+            past_key = cache[0]
+            past_value = cache[1]
+            key = paddle.concat((past_key, key), axis=-2)
+            value = paddle.concat((past_value, value), axis=-2)
+
+        if use_cache is True:
+            present = (key, value)
+        else:
+            present = None
+
+        # compute self-attention: V x Softmax(QK^T)
+        attn_output, attn_weights = self._attn(query, key, value, attention_mask)
+
+        attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim)
+        attn_output = self.out_proj(attn_output)
+        attn_output = self.resid_dropout(attn_output)
+
+        if output_attentions:
+            return attn_output, present, attn_weights
+        return attn_output, present
+
+
+class CodeGenMLP(Layer):
+    def __init__(self, config: CodeGenConfig):
+        super().__init__()
+        inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd
+        self.fc_in = nn.Linear(config.n_embd, inner_dim)
+        self.fc_out = nn.Linear(inner_dim, config.n_embd)
+
+        self.act = ACT2FN[config.activation_function]
+        self.dropout = nn.Dropout(config.resid_pdrop)
+
+    def forward(self, hidden_states: Tensor) -> Tensor:
+        hidden_states = self.fc_in(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.fc_out(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+
+
+class CodeGenBlock(Layer):
+    def __init__(self, config: CodeGenConfig):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd, epsilon=config.layer_norm_epsilon)
+        self.attn = CodeGenAttention(config)
+        self.mlp = CodeGenMLP(config)
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Optional[Tensor] = None,
+        use_cache: Optional[bool] = False,
+        cache: Optional[Tuple[Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple:
+        residual = hidden_states
+        hidden_states = self.ln_1(hidden_states)
+        attn_outputs = self.attn(
+            hidden_states,
+            attention_mask=attention_mask,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
+        outputs = attn_outputs[1:]
+
+        feed_forward_hidden_states = self.mlp(hidden_states)
+        hidden_states = attn_output + feed_forward_hidden_states + residual
+
+        if use_cache:
+            outputs = (hidden_states,) + outputs
+        else:
+            outputs = (hidden_states,) + outputs[1:]
+
+        return outputs  # hidden_states, (present, attentions)  outputs is a tuple
+
+
+class CodeGenPreTrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = CODEGEN_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = CODEGEN_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = CodeGenConfig
+    base_model_prefix = "transformer"
+
+    def _init_weights(self, layer):
+        """Initialize the weights."""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor) and paddle.get_default_dtype() == "float32":
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+            layer._epsilon = self.config.layer_norm_epsilon
+        if isinstance(layer, nn.Linear) and layer.bias is not None:
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+
+@register_base_model
+class CodeGenModel(CodeGenPreTrainedModel):
+    r"""
+    The bare CodeGen Model outputting raw hidden-states.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    Args:
+        config (:class:`CodeGenConfig`):
+            An instance of CodeGenConfig used to construct CodeGenModel.
+    """
+
+    def __init__(self, config: CodeGenConfig):
+        super().__init__(config)
+
+        self.vocab_size = config.vocab_size
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.eos_token_id = config.eos_token_id
+        self.embed_dim = config.n_embd
+        self.initializer_range = config.initializer_range
+        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
+        self.drop = nn.Dropout(config.embd_pdrop)
+        self.h = nn.LayerList([CodeGenBlock(config) for _ in range(config.n_layer)])
+        self.ln_f = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+        self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.n_head)
+
+    def get_input_embeddings(self):
+        return self.wte
+
+    def set_input_embeddings(self, new_embeddings):
+        self.wte = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Tensor]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        r"""
+        The CodeGenModel forward method, overrides the `__call__()` special method.
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            use_cache (bool, optional):
+                 Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                 can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object.
+                If `False`, the output will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=None`,
+            returns a tensor representing the output of :class:`CodeGenModel`.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import CodeGenModel, CodeGenTokenizer
+                tokenizer = CodeGenTokenizer.from_pretrained('Salesforce/codegen-350M-mono')
+                model = CodeGenModel.from_pretrained('Salesforce/codegen-350M-mono')
+                inputs = tokenizer("def hello_world():", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+            input_ids = input_ids.reshape((-1, input_shape[-1]))
+            batch_size = input_ids.shape[0]
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+            batch_size = inputs_embeds.shape[0]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if cache is None:
+            past_length = 0
+            cache = tuple([None] * len(self.h))
+        else:
+            past_length = cache[0][0].shape[-2]
+
+        # Attention mask.
+        if attention_mask is None:
+            if input_ids is not None:
+                if batch_size == 1 and past_length != 0:
+                    batch_size, seq_len = input_shape
+                    attention_mask = paddle.zeros(
+                        [batch_size, 1, 1, seq_len + past_length], dtype=paddle.get_default_dtype()
+                    )
+                else:
+                    attention_mask = (
+                        paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+                        * -1e4
+                    )
+            else:
+                logger.warning(
+                    "Provided inputs_embeds while attention_mask is None, attention weights will not be masked during forwarding."
+                )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        if attention_mask is not None:
+            attention_mask.stop_gradient = True
+        # TODO: CodeGen Attention Mask is TOO confusion.
+        # When it's 2D, it must be int and it's denoted by 1/0.
+        # When using model.generate() without providing attention mask
+        # or using 4D attention mask,
+        # the attention mask's dtype must be float and it's denoted by 0/-inf.
+        # Moreover, cannot support 3D attention mask.
+
+        if inputs_embeds is None:
+            inputs_embeds = self.wte(input_ids)
+        if token_type_ids is not None:
+            token_type_embeds = self.wte(token_type_ids)
+            inputs_embeds = inputs_embeds + token_type_embeds
+
+        hidden_states = self.drop(inputs_embeds)
+        output_shape = input_shape[:] + [hidden_states.shape[-1]]
+
+        presents = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        for i, (block, old_cache) in enumerate(zip(self.h, cache)):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            outputs = block(
+                hidden_states,
+                attention_mask=attention_mask,
+                use_cache=use_cache,
+                cache=old_cache,
+                output_attentions=output_attentions,
+            )
+
+            hidden_states = outputs[0]
+            if use_cache:
+                presents = presents + (outputs[1],)
+            if output_attentions:
+                all_self_attentions += (outputs[-1],)
+
+        hidden_states = self.ln_f(hidden_states)
+        hidden_states = hidden_states.reshape(shape=output_shape)
+
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        last_hidden_state = hidden_states
+        new_cache = presents
+
+        if not return_dict:
+            temp_list = [
+                last_hidden_state,
+                new_cache,
+                all_hidden_states,
+                all_self_attentions,
+            ]
+            return tuple(v for v in temp_list if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=last_hidden_state,
+            past_key_values=new_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=None,
+        )
+
+
+class CodeGenForCausalLM(CodeGenPreTrainedModel):
+    r"""
+    CodeGen Model with a `language modeling` head on top.
+    Args:
+        config (:class:`CodeGenConfig`):
+            An instance of CodeGenConfig used to construct CodeGenForCausalLM.
+    """
+    _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.masked_bias", r"h\.\d+\.attn\.bias"]
+
+    def __init__(self, config: CodeGenConfig):
+        super().__init__(config)
+        self.transformer = CodeGenModel(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterCodeGen
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decoding_lib = kwargs.get("decoding_lib", None)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "beam_search":
+            raise AttributeError("'beam_search' is not supported yet in the fast version of GPTJ")
+        # Currently, FasterTransformer only support restricted size_per_head.
+        size_per_head = self.transformer.config.n_embd // self.transformer.config.n_head
+        if size_per_head not in [32, 64, 80, 96, 128, 160, 192, 224, 256]:
+            raise AttributeError(
+                "'size_per_head = %d' is not supported yet in the fast version of GPTJ" % size_per_head
+            )
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        self._fast_entry = FasterCodeGen(self, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def prepare_inputs_for_generation(self, input_ids, cache=None, **kwargs):
+        # only last token for inputs_ids if past is defined in kwargs
+        token_type_ids = kwargs.get("token_type_ids", None)
+
+        if cache:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if token_type_ids is not None:
+                token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
+
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask[:, :, -1:, :]
+
+        return {
+            "input_ids": input_ids,
+            "cache": cache,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+            "token_type_ids": token_type_ids,
+        }
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Tensor]]] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
+        r"""
+        The CodeGenForCausalLM forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`CodeGenModel`.
+            attention_mask (Tensor, optional):
+                See :class:`CodeGenModel`.
+            use_cache (bool, optional):
+                See :class:`CodeGenModel`.
+            cache (Tensor, optional):
+                See :class:`CodeGenModel`.
+            labels: (Tensor, optional):
+                Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
+                `labels = input_ids` Indices are selected in `[-100, 0, ..., vocab_size]` All labels set to `-100`
+                are ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`
+            inputs_embeds (Tensor, optional):
+                See :class:`CodeGenModel`.
+            output_attentions (bool, optional):
+                See :class: `CodeGenModel`.
+            output_hidden_states (bool, optional):
+                See :class: `CodeGenModel`.
+            return_dict (bool, optional):
+                See :class: `CodeGenModel`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=labels=None`,
+            returns tensor `lm_logits` of shape [batch_size, sequence_length, vocab_size],
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer
+                tokenizer = CodeGenTokenizer.from_pretrained('Salesforce/codegen-350M-mono')
+                model = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-350M-mono')
+                inputs = tokenizer("def hello_world():", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            use_cache=use_cache,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+
+        # make sure sampling in fp16 works correctly and
+        # compute loss in fp32 to match with mesh-tf version
+        # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179
+        lm_logits = paddle.cast(self.lm_head(hidden_states), "float32")
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[:, :-1, :]
+            shift_labels = labels[:, 1:]
+            # Flatten the tokens
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(shift_logits.reshape((-1, shift_logits.shape[-1])), shift_labels.reshape((-1,)))
+
+        if not return_dict:
+            # if isinstance(transformer_outputs, type(input_ids)):
+            #     return (loss, lm_logits) if loss is not None else lm_logits
+            outputs = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
diff --git a/paddlenlp/transformers/codegen/tokenizer.py b/paddlenlp/transformers/codegen/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bc72cfcc2828f39e7f873fc32c97ba33eb0f3b3
--- /dev/null
+++ b/paddlenlp/transformers/codegen/tokenizer.py
@@ -0,0 +1,128 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Salesforce authors, The Open AI Team Authors and The HuggingFace Inc. team
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.utils import try_import
+from .. import GPTTokenizer
+
+__all__ = ["CodeGenTokenizer"]
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class CodeGenTokenizer(GPTTokenizer):
+
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {"vocab_file": {}, "merges_file": {}}
+    pretrained_init_configuration = {}
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        pad_token="<|endoftext|>",
+        eos_token="<|endoftext|>",
+        unk_token="<|endoftext|>",
+        eol_token="\u010a",
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            errors=errors,
+            max_len=max_len,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            eol_token=eol_token,
+            **kwargs,
+        )
+
+    def decode(
+        self,
+        token_ids,
+        skip_special_tokens=False,
+        clean_up_tokenization_spaces=True,
+        truncate_before_pattern=None,
+        **kwargs
+    ):
+        """
+        Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
+        tokens and clean up tokenization spaces.
+
+        Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.
+
+        Args:
+            token_ids (`Union[int, List[int], np.ndarray, paddle.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            truncate_before_pattern (`List[str]`, *optional*, defaults to `None`):
+                A list of regular expression strings that will be used to truncate the returned string. This can be
+                used to remove extra pieces of code (e.g. truncate if observing a comment symbol "#" at the beginning
+                of a new line). An example pattern could be `["^#", re.escape("<|endoftext|>"), "^'''", "\n\n\n"]`.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+
+        Returns:
+            `str`: The decoded sentence.
+        """
+
+        decoded_text = super()._decode(
+            token_ids=token_ids,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+
+        if truncate_before_pattern is not None and len(truncate_before_pattern) > 0:
+            decoded_text = self.truncate(decoded_text, truncate_before_pattern)
+
+        return decoded_text
+
+    def truncate(self, completion, truncate_before_pattern):
+        def find_re(string, pattern, start_pos):
+            m = pattern.search(string, start_pos)
+            return m.start() if m else -1
+
+        re = try_import("regex")
+        terminals = [re.compile(pattern, re.MULTILINE) for pattern in truncate_before_pattern]
+
+        prints = list(re.finditer("^print", completion, re.MULTILINE))
+
+        if len(prints) > 1:
+            completion = completion[: prints[1].start()]
+
+        defs = list(re.finditer("^def", completion, re.MULTILINE))
+
+        if len(defs) > 1:
+            completion = completion[: defs[1].start()]
+
+        start_pos = 0
+
+        terminals_pos = [
+            pos for pos in [find_re(completion, terminal, start_pos) for terminal in terminals] if pos != -1
+        ]
+
+        if len(terminals_pos) > 0:
+            return completion[: min(terminals_pos)]
+        else:
+            return completion
diff --git a/paddlenlp/transformers/configuration_utils.py b/paddlenlp/transformers/configuration_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c22c74a275ad5c9e7ee11fceaa348f2f5cd3821
--- /dev/null
+++ b/paddlenlp/transformers/configuration_utils.py
@@ -0,0 +1,1147 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Configuration base class and utilities."""
+from __future__ import annotations
+
+import copy
+import json
+import os
+import os.path as osp
+import re
+import shutil
+import sys
+import warnings
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import paddle
+from huggingface_hub import hf_hub_download
+from huggingface_hub.utils import EntryNotFoundError
+
+from paddlenlp import __version__
+from paddlenlp.transformers.utils import resolve_cache_dir
+from paddlenlp.utils.env import LEGACY_CONFIG_NAME
+from paddlenlp.utils.log import logger
+
+from ..utils import CONFIG_NAME
+from ..utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+    hf_file_exists,
+    is_url,
+    url_file_exists,
+)
+
+_re_configuration_file = re.compile(r"config\.(.*)\.json")
+
+
+def custom_object_save(obj, folder, config=None):
+    """
+    Save the modeling files corresponding to a custom model/configuration/tokenizer etc. in a given folder. Optionally
+    adds the proper fields in a config.
+
+    Args:
+        obj (`Any`): The object for which to save the module files.
+        folder (`str` or `os.PathLike`): The folder where to save.
+        config (`PretrainedConfig` or dictionary, `optional`):
+            A config in which to register the auto_map corresponding to this custom object.
+    """
+    if obj.__module__ == "__main__":
+        logger.warning(
+            f"We can't save the code defining {obj} in {folder} as it's been defined in __main__. You should put "
+            "this code in a separate module so we can include it in the saved folder and make it easier to share via "
+            "the Hub."
+        )
+
+    def _set_auto_map_in_config(_config):
+        module_name = obj.__class__.__module__
+        last_module = module_name.split(".")[-1]
+        full_name = f"{last_module}.{obj.__class__.__name__}"
+        # Special handling for tokenizers
+        if "Tokenizer" in full_name:
+            slow_tokenizer_class = None
+            fast_tokenizer_class = None
+            if obj.__class__.__name__.endswith("Fast"):
+                # Fast tokenizer: we have the fast tokenizer class and we may have the slow one has an attribute.
+                fast_tokenizer_class = f"{last_module}.{obj.__class__.__name__}"
+                if getattr(obj, "slow_tokenizer_class", None) is not None:
+                    slow_tokenizer = getattr(obj, "slow_tokenizer_class")
+                    slow_tok_module_name = slow_tokenizer.__module__
+                    last_slow_tok_module = slow_tok_module_name.split(".")[-1]
+                    slow_tokenizer_class = f"{last_slow_tok_module}.{slow_tokenizer.__name__}"
+            else:
+                # Slow tokenizer: no way to have the fast class
+                slow_tokenizer_class = f"{last_module}.{obj.__class__.__name__}"
+
+            full_name = (slow_tokenizer_class, fast_tokenizer_class)
+
+        if isinstance(_config, dict):
+            auto_map = _config.get("auto_map", {})
+            auto_map[obj._auto_class] = full_name
+            _config["auto_map"] = auto_map
+        elif getattr(_config, "auto_map", None) is not None:
+            _config.auto_map[obj._auto_class] = full_name
+        else:
+            _config.auto_map = {obj._auto_class: full_name}
+
+    # Add object class to the config auto_map
+    if isinstance(config, (list, tuple)):
+        for cfg in config:
+            _set_auto_map_in_config(cfg)
+    elif config is not None:
+        _set_auto_map_in_config(config)
+
+    # Copy module file to the output folder.
+    object_file = sys.modules[obj.__module__].__file__
+    dest_file = Path(folder) / (Path(object_file).name)
+    shutil.copy(object_file, dest_file)
+
+    # Gather all relative imports recursively and make sure they are copied as well.
+    # TODO(wujingjing): `get_relative_import_files` havn't supported yet.
+    # for needed_file in get_relative_import_files(object_file):
+    #     dest_file = Path(folder) / (Path(needed_file).name)
+    #     shutil.copy(needed_file, dest_file)
+
+
+def cached_path(
+    url_or_filename,
+    cache_dir=None,
+    force_download=False,
+    local_files_only=False,
+) -> Optional[str]:
+    """
+    Given something that might be a URL (or might be a local path), determine which. If it's a URL, download the file
+    and cache it, and return the path to the cached file. If it's already a local path, make sure the file exists and
+    then return the path
+
+    Args:
+        cache_dir: specify a cache directory to save the file to (overwrite the default cache dir).
+        force_download: if True, re-download the file even if it's already cached in the cache dir.
+        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.
+
+    Return:
+        Local path (string) of file or if networking is off, last version of file cached on disk.
+
+    Raises:
+        In case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
+    """
+    if is_url(url_or_filename):
+        if cache_dir is None:
+            raise NotADirectoryError(
+                "when download from url<%s>, cache_dir is required, but receive None", url_or_filename
+            )
+
+        if force_download:
+            # remove the target file under cache_dir
+            file_path = osp.join(cache_dir, osp.split(url_or_filename)[-1])
+            shutil.rmtree(file_path, ignore_errors=True)
+
+        # URL, so get it from the cache (downloading if necessary)
+        output_path = get_path_from_url_with_filelock(
+            url_or_filename,
+            root_dir=cache_dir,
+        )
+    elif os.path.exists(url_or_filename):
+        # File, and it exists.
+        output_path = url_or_filename
+    else:
+        raise FileNotFoundError(
+            "can't find the file<{%s}> which should be valid url or local file path", url_or_filename
+        )
+
+    return output_path
+
+
+def attribute_map(config: PretrainedConfig, kwargs: Dict[str, Any]) -> Dict[str, Any]:
+    """map the <old-attr> to <new-attr> with configuration
+
+    Args:
+        config (PretrainedConfig): the instance of PretrainedConfig
+        kwargs (Dict[str, Any]): the kwargs of attribute
+    """
+    for old_key, new_key in config.attribute_map.items():
+        if old_key in kwargs:
+            if new_key in kwargs:
+                logger.warning(f"receive param<{old_key}> and param<{new_key}>, but the first one will be adopt")
+            kwargs[new_key] = kwargs.pop(old_key)
+    return kwargs
+
+
+def convert_to_legacy_config(attribute_map: Dict[str, str], config: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    works when there are different fields between huggingface and paddle
+    Args:
+        attribute_map (Dict[str, str]): mapping of between standard config and paddle config
+        config (Dict[str, Any]): config of huggingface transformers models
+    Returns: the config which can be mapped into config of paddle model
+    """
+    if "init_args" in config:
+        args = []
+        for init_arg in config["init_args"]:
+            init_arg = convert_to_legacy_config(attribute_map, init_arg)
+            args.append(init_arg)
+        config["init_args"] = args
+
+    # TODO(wj-Mcat): to improve compatibility for: old local config and new PretrainedConfig, eg:
+    # { "init_args": [], "init_class": "", "num_classes": 12 }
+    for standard_field, paddle_field in attribute_map.items():
+        value = config.pop(standard_field, None) or config.pop(paddle_field, None)
+        if value is not None:
+            config[paddle_field] = value
+    return config
+
+
+def flatten_model_config(config: dict) -> dict:
+    """flatten the model config which can be old-style model config
+
+    Args:
+        config (dict): the source of config which can be flatten config or nest config
+
+    Returns:
+        dict: the flatten config
+    """
+    # 1. extract the init_args into the top level
+    init_args = config.pop("init_args", [])
+
+    index = 0
+    while index < len(init_args):
+        if isinstance(init_args[index], dict):
+            for key, value in init_args[index].items():
+                if key not in config:
+                    config[key] = value
+            init_args.pop(index)
+        else:
+            index += 1
+
+    if init_args:
+        config["init_args"] = init_args
+
+    # 2. convert `init_class` into `architectures`
+    if "init_class" in config:
+        config["architectures"] = [config.pop("init_class")]
+
+    return config
+
+
+def is_standard_config(config: Union[PretrainedConfig, Dict[str, Any]]) -> bool:
+    """
+    check whether the config is standard
+    Args:
+        config: the dict data of config
+    """
+    if isinstance(config, PretrainedConfig):
+        return True
+
+    return "init_class" not in config and "architectures" in config
+
+
+def resolve_hf_config_path(repo_id: str, cache_dir: str, subfolder=None) -> str:
+    """resolve config file from hf hub
+
+    Args:
+        repo_id (str): the repo name from huggingface hub
+        cache_dir (str): the cachedir
+        subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+
+    Returns:
+        str: the downloaded config file
+    """
+    if hf_file_exists(repo_id=repo_id, filename=CONFIG_NAME, subfolder=subfolder):
+        file_name = CONFIG_NAME
+    else:
+        raise EntryNotFoundError(f"can not find the paddle/pytorch config file from: https://huggingface.co/{repo_id}")
+
+    return hf_hub_download(
+        repo_id=repo_id,
+        filename=file_name,
+        cache_dir=cache_dir,
+        subfolder=subfolder,
+        library_name="PaddleNLP",
+        library_version=__version__,
+    )
+
+
+class PretrainedConfig:
+    r"""
+    Base class for all configuration classes. Handles a few parameters common to all models' configurations as well as
+    methods for loading/downloading/saving configurations.
+
+    <Tip>
+
+    A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to
+    initialize a model does **not** load the model weights. It only affects the model's configuration.
+
+    </Tip>
+
+    Class attributes (overridden by derived classes):
+
+    - **model_type** (`str`) -- An identifier for the model type, serialized into the JSON file, and used to recreate
+      the correct object in [`~paddlenlp.AutoConfig`].
+    - **is_composition** (`bool`) -- Whether the config class is composed of multiple sub-configs. In this case the
+      config has to be initialized from two or more configs of type [`~paddlenlp.PretrainedConfig`] like:
+      [`~paddlenlp.EncoderDecoderConfig`] or [`~RagConfig`].
+    - **keys_to_ignore_at_inference** (`List[str]`) -- A list of keys to ignore by default when looking at dictionary
+      outputs of the model during inference.
+    - **attribute_map** (`Dict[str, str]`) -- A dict that maps model specific attribute names to the standardized
+      naming of attributes.
+
+    Common attributes (present in all subclasses):
+
+    - **vocab_size** (`int`) -- The number of tokens in the vocabulary, which is also the first dimension of the
+      embeddings matrix (this attribute may be missing for models that don't have a text modality like ViT).
+    - **hidden_size** (`int`) -- The hidden size of the model.
+    - **num_attention_heads** (`int`) -- The number of attention heads used in the multi-head attention layers of the
+      model.
+    - **num_hidden_layers** (`int`) -- The number of blocks in the model.
+
+    Arg:
+        name_or_path (`str`, *optional*, defaults to `""`):
+            Store the string that was passed to [`PreTrainedModel.from_pretrained`] or
+            [`PreTrainedModel.from_pretrained`] as `pretrained_model_name_or_path` if the configuration was created
+            with such a method.
+        output_hidden_states (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should return all hidden-states.
+        output_attentions (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should returns all attentions.
+        return_dict (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return a [`~paddlenlp.transformers.model_outputs.ModelOutput`] instead of a plain tuple.
+        is_encoder_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as an encoder/decoder or not.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as decoder or not (in which case it's used as an encoder).
+        cross_attention_hidden_size** (`bool`, *optional*):
+            The hidden size of the cross-attention layer in case the model is used as a decoder in an encoder-decoder
+            setting and the cross-attention hidden dimension differs from `self.config.hidden_size`.
+        add_cross_attention (`bool`, *optional*, defaults to `False`):
+            Whether cross-attention layers should be added to the model. Note, this option is only relevant for models
+            that can be used as decoder models within the [`EncoderDecoderModel`] class, which consists of all models
+            in `AUTO_MODELS_FOR_CAUSAL_LM`.
+        tie_encoder_decoder (`bool`, *optional*, defaults to `False`):
+            Whether all encoder weights should be tied to their equivalent decoder weights. This requires the encoder
+            and decoder model to have the exact same parameter names.
+        prune_heads (`Dict[int, List[int]]`, *optional*, defaults to `{}`):
+            Pruned heads of the model. The keys are the selected layer indices and the associated values, the list of
+            heads to prune in said layer.
+
+            For instance `{1: [0, 2], 2: [2, 3]}` will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.
+        chunk_size_feed_forward (`int`, *optional*, defaults to `0`):
+            The chunk size of all feed forward layers in the residual attention blocks. A chunk size of `0` means that
+            the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes `n` <
+            sequence_length embeddings at a time. For more information on feed forward chunking, see [How does Feed
+            Forward Chunking work?](../glossary.html#feed-forward-chunking).
+
+        > Parameters for sequence generation
+
+        max_length (`int`, *optional*, defaults to 20):
+            Maximum length that will be used by default in the `generate` method of the model.
+        min_length (`int`, *optional*, defaults to 10):
+            Minimum length that will be used by default in the `generate` method of the model.
+        do_sample (`bool`, *optional*, defaults to `False`):
+            Flag that will be used by default in the `generate` method of the model. Whether or not to use sampling ;
+            use greedy decoding otherwise.
+        early_stopping (`bool`, *optional*, defaults to `False`):
+            Flag that will be used by default in the `generate` method of the model. Whether to stop the beam search
+            when at least `num_beams` sentences are finished per batch or not.
+        num_beams (`int`, *optional*, defaults to 1):
+            Number of beams for beam search that will be used by default in the `generate` method of the model. 1 means
+            no beam search.
+        num_beam_groups (`int`, *optional*, defaults to 1):
+            Number of groups to divide `num_beams` into in order to ensure diversity among different groups of beams
+            that will be used by default in the `generate` method of the model. 1 means no group beam search.
+        diversity_penalty (`float`, *optional*, defaults to 0.0):
+            Value to control diversity for group beam search. that will be used by default in the `generate` method of
+            the model. 0 means no diversity penalty. The higher the penalty, the more diverse are the outputs.
+        temperature (`float`, *optional*, defaults to 1):
+            The value used to module the next token probabilities that will be used by default in the `generate` method
+            of the model. Must be strictly positive.
+        top_k (`int`, *optional*, defaults to 50):
+            Number of highest probability vocabulary tokens to keep for top-k-filtering that will be used by default in
+            the `generate` method of the model.
+        top_p (`float`, *optional*, defaults to 1):
+            Value that will be used by default in the `generate` method of the model for `top_p`. If set to float < 1,
+            only the most probable tokens with probabilities that add up to `top_p` or higher are kept for generation.
+        repetition_penalty (`float`, *optional*, defaults to 1):
+            Parameter for repetition penalty that will be used by default in the `generate` method of the model. 1.0
+            means no penalty.
+        length_penalty (`float`, *optional*, defaults to 1):
+            Exponential penalty to the length that will be used by default in the `generate` method of the model.
+        no_repeat_ngram_size (`int`, *optional*, defaults to 0) -- Value that will be used by default in the
+            `generate` method of the model for `no_repeat_ngram_size`. If set to int > 0, all ngrams of that size can
+            only occur once.
+        encoder_no_repeat_ngram_size (`int`, *optional*, defaults to 0) -- Value that will be used by
+            default in the `generate` method of the model for `encoder_no_repeat_ngram_size`. If set to int > 0, all
+            ngrams of that size that occur in the `encoder_input_ids` cannot occur in the `decoder_input_ids`.
+        bad_words_ids (`List[int]`, *optional*):
+            List of token ids that are not allowed to be generated that will be used by default in the `generate`
+            method of the model. In order to get the tokens of the words that should not appear in the generated text,
+            use `tokenizer.encode(bad_word, add_prefix_space=True)`.
+        num_return_sequences (`int`, *optional*, defaults to 1):
+            Number of independently computed returned sequences for each element in the batch that will be used by
+            default in the `generate` method of the model.
+        output_scores (`bool`, *optional*, defaults to `False`):
+            Whether the model should return the logits when used for generation.
+        return_dict_in_generate (`bool`, *optional*, defaults to `False`):
+            Whether the model should return a [`~paddlenlp.transformers.model_outputs.ModelOutput`] instead of a `paddlenlp.Tensor`.
+        forced_bos_token_id (`int`, *optional*):
+            The id of the token to force as the first generated token after the `decoder_start_token_id`. Useful for
+            multilingual models like [mBART](../model_doc/mbart) where the first generated token needs to be the target
+            language token.
+        forced_eos_token_id (`int`, *optional*):
+            The id of the token to force as the last generated token when `max_length` is reached.
+        remove_invalid_values (`bool`, *optional*):
+            Whether to remove possible _nan_ and _inf_ outputs of the model to prevent the generation method to crash.
+            Note that using `remove_invalid_values` can slow down generation.
+
+        > Parameters for fine-tuning tasks
+
+        architectures (`List[str]`, *optional*):
+            Model architectures that can be used with the model pretrained weights.
+        finetuning_task (`str`, *optional*):
+            Name of the task used to fine-tune the model. This can be used when converting from an original checkpoint.
+        id2label (`Dict[int, str]`, *optional*):
+            A map from index (for instance prediction index, or target index) to label.
+        label2id (`Dict[str, int]`, *optional*): A map from label to index for the model.
+        num_labels (`int`, *optional*):
+            Number of labels to use in the last layer added to the model, typically for a classification task.
+        task_specific_params (`Dict[str, Any]`, *optional*):
+            Additional keyword arguments to store for the current task.
+        problem_type (`str`, *optional*):
+            Problem type for `XxxForSequenceClassification` models. Can be one of `"regression"`,
+            `"single_label_classification"` or `"multi_label_classification"`.
+
+        > Parameters linked to the tokenizer
+
+        tokenizer_class (`str`, *optional*):
+            The name of the associated tokenizer class to use (if none is set, will use the tokenizer associated to the
+            model by default).
+        prefix (`str`, *optional*):
+            A specific prompt that should be added at the beginning of each text before calling the model.
+        bos_token_id (`int`, *optional*): The id of the _beginning-of-stream_ token.
+        pad_token_id (`int`, *optional*): The id of the _padding_ token.
+        eos_token_id (`int`, *optional*): The id of the _end-of-stream_ token.
+        decoder_start_token_id (`int`, *optional*):
+            If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token.
+        sep_token_id (`int`, *optional*): The id of the _separation_ token.
+
+        tie_word_embeddings (`bool`, *optional*, defaults to `True`):
+            Whether the model's input and output word embeddings should be tied. Note that this is only relevant if the
+            model has a output word embedding layer.
+        dtype (`str`, *optional*):
+            The `dtype` of the weights. This attribute can be used to initialize the model to a non-default `dtype`
+            (which is normally `float32`) and thus allow for optimal storage allocation. For example, if the saved
+            model is `float16`, ideally we want to load it back using the minimal amount of memory needed to load
+            `float16` weights. Since the config object is stored in plain text, this attribute contains just the
+            floating type string without the `paddle.` prefix. For example, for `paddle.float16` ``dtype` is the
+            `"float16"` string.
+
+            This attribute is currently not being used during model loading time, but this may change in the future
+            versions. But we can already start preparing for the future by saving the dtype with save_pretrained.
+    """
+    model_type: str = ""
+    is_composition: bool = False
+
+    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
+    # TODO(wj-Mcat): this comment should be removed after this feature is accepted by PaddleNLP teams
+    # `pretrained_init_configuration` can be `dict` or `url`: eg:
+    #     {
+    #         "bert-base-uncased": {
+    #             "vocab_size": 30522,
+    #             "hidden_size": 768,
+    #         },
+    #         "bert-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/model_config.json"
+    #     }
+    #
+    # advantages:
+    #     1. reuse the concept: `pretrained_init_configuration` and extend it
+    #     2. make code more concise when support resource file
+    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
+    pretrained_init_configuration = {}
+
+    # global attribute mapping
+    attribute_map: Dict[str, str] = {"num_classes": "num_labels"}
+
+    _auto_class: Optional[str] = None
+
+    def __setattr__(self, key, value):
+        if key in super().__getattribute__("attribute_map"):
+            key = super().__getattribute__("attribute_map")[key]
+        super().__setattr__(key, value)
+        assert hasattr(self, key)
+
+    def __getattribute__(self, key):
+        if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
+            key = super().__getattribute__("attribute_map")[key]
+        return super().__getattribute__(key)
+
+    def __getitem__(self, key):
+        return getattr(self, key, None)
+
+    def __setitem__(self, key, value):
+        if hasattr(self, key):
+            setattr(self, key, value)
+
+    def __init__(self, **kwargs):
+        # Attributes with defaults
+        # map the old attr to new atr, eg: num_classes -> num_labels
+        kwargs = attribute_map(self, kwargs=kwargs)
+
+        self.return_dict = kwargs.pop("return_dict", False)
+        self.output_hidden_states = kwargs.pop("output_hidden_states", False)
+        self.output_attentions = kwargs.pop("output_attentions", False)
+        self.use_cache = kwargs.pop("use_cache", False)
+
+        self.pruned_heads = kwargs.pop("pruned_heads", {})
+        self.tie_word_embeddings = kwargs.pop(
+            "tie_word_embeddings", True
+        )  # Whether input and output word embeddings should be tied for all MLM, LM and Seq2Seq models.
+
+        # parameter for model dtype
+        if "torch_dtype" in kwargs:
+            self.dtype = kwargs.pop("torch_dtype")
+        else:
+            self.dtype = kwargs.pop("dtype", paddle.get_default_dtype())
+
+        # Parameters for tensor parallel
+        self.tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", -1)
+        self.tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0)
+        # If set to True, this option is used with fleet.meta_parallel.ParallelCrossEntropy
+        # to calculate cross-entropy loss for parallel model.
+        self.tensor_parallel_output = kwargs.pop("tensor_parallel_output", False)
+        # Temporary switch to control hook vs. PyLayer implementation of recompute
+        self.recompute_use_reentrant = kwargs.pop("recompute_use_reentrant", False)
+
+        # Is decoder is used in encoder-decoder models to differentiate encoder from decoder
+        self.is_encoder_decoder = kwargs.pop("is_encoder_decoder", False)
+        self.is_decoder = kwargs.pop("is_decoder", False)
+        self.cross_attention_hidden_size = kwargs.pop("cross_attention_hidden_size", None)
+        self.add_cross_attention = kwargs.pop("add_cross_attention", False)
+        self.tie_encoder_decoder = kwargs.pop("tie_encoder_decoder", False)
+
+        # Parameters for sequence generation
+        self.max_length = kwargs.pop("max_length", 20)
+        self.min_length = kwargs.pop("min_length", 0)
+        self.do_sample = kwargs.pop("do_sample", False)
+        self.early_stopping = kwargs.pop("early_stopping", False)
+        self.num_beams = kwargs.pop("num_beams", 1)
+        self.num_beam_groups = kwargs.pop("num_beam_groups", 1)
+        self.diversity_penalty = kwargs.pop("diversity_penalty", 0.0)
+        self.temperature = kwargs.pop("temperature", 1.0)
+        self.top_k = kwargs.pop("top_k", 50)
+        self.top_p = kwargs.pop("top_p", 1.0)
+        self.typical_p = kwargs.pop("typical_p", 1.0)
+        self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
+        self.length_penalty = kwargs.pop("length_penalty", 1.0)
+        self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
+        self.encoder_no_repeat_ngram_size = kwargs.pop("encoder_no_repeat_ngram_size", 0)
+        self.bad_words_ids = kwargs.pop("bad_words_ids", None)
+        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
+        self.chunk_size_feed_forward = kwargs.pop("chunk_size_feed_forward", 0)
+        self.output_scores = kwargs.pop("output_scores", False)
+        self.return_dict_in_generate = kwargs.pop("return_dict_in_generate", False)
+        self.forced_bos_token_id = kwargs.pop("forced_bos_token_id", None)
+        self.forced_eos_token_id = kwargs.pop("forced_eos_token_id", None)
+        self.remove_invalid_values = kwargs.pop("remove_invalid_values", False)
+        self.exponential_decay_length_penalty = kwargs.pop("exponential_decay_length_penalty", None)
+
+        # Fine-tuning task arguments
+        self.architectures = kwargs.pop("architectures", None)
+        self.finetuning_task = kwargs.pop("finetuning_task", None)
+        self.id2label = kwargs.pop("id2label", None)
+        self.label2id = kwargs.pop("label2id", None)
+        if self.id2label is not None:
+            num_labels = kwargs.pop("num_labels", None)
+            if num_labels is not None and len(self.id2label) != num_labels:
+                logger.warning(
+                    f"You passed along `num_labels={num_labels}` with an incompatible id to label map: "
+                    f"{self.id2label}. The number of labels wil be overwritten to {self.num_labels}."
+                )
+            self.id2label = dict((int(key), value) for key, value in self.id2label.items())
+            # Keys are always strings in JSON so convert ids to int here.
+        else:
+            self.num_labels = kwargs.pop("num_labels", 2)
+        self.num_choices = kwargs.pop("num_choices", None)
+
+        self.classifier_dropout = kwargs.pop("classifier_dropout", None)
+
+        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config
+        self.tokenizer_class = kwargs.pop("tokenizer_class", None)
+        self.prefix = kwargs.pop("prefix", None)
+        self.bos_token_id = kwargs.pop("bos_token_id", None)
+        self.pad_token_id = kwargs.pop("pad_token_id", None)
+        self.eos_token_id = kwargs.pop("eos_token_id", None)
+        self.sep_token_id = kwargs.pop("sep_token_id", None)
+
+        self.decoder_start_token_id = kwargs.pop("decoder_start_token_id", None)
+
+        # task specific arguments
+        self.task_specific_params = kwargs.pop("task_specific_params", None)
+
+        # regression / multi-label classification
+        self.problem_type = kwargs.pop("problem_type", None)
+        allowed_problem_types = ("regression", "single_label_classification", "multi_label_classification")
+        if self.problem_type is not None and self.problem_type not in allowed_problem_types:
+            raise ValueError(
+                f"The config parameter `problem_type` was not understood: received {self.problem_type} "
+                "but only 'regression', 'single_label_classification' and 'multi_label_classification' are valid."
+            )
+
+        # Name or path to the pretrained checkpoint
+        self._name_or_path = str(kwargs.pop("name_or_path", ""))
+
+        # Drop the transformers version info
+        self.paddlenlp_version = kwargs.pop("paddlenlp_version", None)
+
+        # Deal with gradient checkpointing
+        if kwargs.get("gradient_checkpointing", False):
+            warnings.warn(
+                "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
+                "Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the "
+                "`Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`."
+            )
+
+        # Additional attributes without default values
+        for key, value in kwargs.items():
+            try:
+                setattr(self, key, value)
+            except AttributeError as err:
+                logger.error(f"Can't set {key} with value {value} for {self}")
+                raise err
+
+    @property
+    def name_or_path(self) -> str:
+        return getattr(self, "_name_or_path", None)
+
+    @name_or_path.setter
+    def name_or_path(self, value):
+        self._name_or_path = str(value)  # Make sure that name_or_path is a string (for JSON encoding)
+
+    @property
+    def use_return_dict(self) -> bool:
+        """
+        `bool`: Whether or not return [`~paddlenlp.transformers.model_outputs.ModelOutput`] instead of tuples.
+        """
+        return self.return_dict
+
+    @property
+    def num_labels(self) -> int:
+        """
+        `int`: The number of labels for classification models.
+        """
+        return len(self.id2label)
+
+    @num_labels.setter
+    def num_labels(self, num_labels: int):
+        if not hasattr(self, "id2label") or self.id2label is None or len(self.id2label) != num_labels:
+            self.id2label = {i: f"LABEL_{i}" for i in range(num_labels)}
+            self.label2id = dict(zip(self.id2label.values(), self.id2label.keys()))
+
+    def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs):
+        """
+        Save a configuration object to the directory `save_directory`, so that it can be re-loaded using the
+        [`~PretrainedConfig.from_pretrained`] class method.
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the configuration JSON file will be saved (will be created if it does not exist).
+            kwargs:
+                Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
+        """
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        # If we have a custom config, we copy the file defining it in the folder and set the attributes so it can be
+        # loaded from the Hub.
+        if self._auto_class is not None:
+            custom_object_save(self, save_directory, config=self)
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_config_file = os.path.join(save_directory, CONFIG_NAME)
+
+        self.to_json_file(output_config_file, use_diff=True)
+        logger.info(f"Configuration saved in {output_config_file}")
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        r"""
+        Instantiate a [`PretrainedConfig`] (or a derived class) from a pretrained model configuration.
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the *model id* of a pretrained model configuration hosted inside a model repo on
+                  paddlenlp bos server. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a configuration file saved using the
+                  [`~PretrainedConfig.save_pretrained`] method, e.g., `./my_model_directory/`.
+                - a path or url to a saved configuration JSON *file*, e.g., `./my_model_directory/configuration.json`.
+            from_hf_hub (bool, *optional*):
+                load config from huggingface hub: https://huggingface.co/models
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model configuration should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force to (re-)download the configuration files and override the cached versions if
+                they exist.
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final configuration object.
+
+                If `True`, then this functions returns a `Tuple(config, unused_kwargs)` where *unused_kwargs* is a
+                dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the
+                part of `kwargs` which has not been used to update `config` and is otherwise ignored.
+            kwargs (`Dict[str, Any]`, *optional*):
+                The values in kwargs of any keys which are configuration attributes will be used to override the loaded
+                values. Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
+                by the `return_unused_kwargs` keyword parameter.
+
+        <Tip>
+
+        Passing `use_auth_token=True` is required when you want to use a private model.
+
+        </Tip>
+
+        Returns:
+            [`PretrainedConfig`]: The configuration object instantiated from this pretrained model.
+
+        Examples:
+
+        ```python
+        # We can't instantiate directly the base class *PretrainedConfig* so let's show the examples on a
+        # derived class: BertConfig
+        config = BertConfig.from_pretrained(
+            "bert-base-uncased"
+        )  # Download configuration from huggingface.co and cache.
+        config = BertConfig.from_pretrained(
+            "./test/saved_model/"
+        )  # E.g. config (or model) was saved using *save_pretrained('./test/saved_model/')*
+        config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json")
+        config = BertConfig.from_pretrained("bert-base-uncased", output_attentions=True, foo=False)
+        assert config.output_attentions == True
+        config, unused_kwargs = BertConfig.from_pretrained(
+            "bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True
+        )
+        assert config.output_attentions == True
+        assert unused_kwargs == {"foo": False}
+        ```"""
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        return cls.from_dict(config_dict, **kwargs)
+
+    @classmethod
+    def get_config_dict(
+        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
+    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+        """
+        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
+        [`PretrainedConfig`] using `from_dict`.
+
+        Parameters:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
+
+        Returns:
+            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the configuration object.
+
+        """
+        original_kwargs = copy.deepcopy(kwargs)
+        cache_dir = kwargs.pop("cache_dir", None)
+        from_hf_hub = kwargs.pop("from_hf_hub", False)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+
+        # Get config dict associated with the base config file
+        config_dict, kwargs = cls._get_config_dict(
+            pretrained_model_name_or_path, cache_dir=cache_dir, from_hf_hub=from_hf_hub, **kwargs
+        )
+
+        # That config file may point us toward another config file to use.
+        if "configuration_files" in config_dict:
+            original_kwargs["cache_dir"] = cache_dir
+            configuration_file = get_configuration_file(config_dict["configuration_files"])
+            config_dict, kwargs = cls._get_config_dict(
+                pretrained_model_name_or_path, _configuration_file=configuration_file, **original_kwargs
+            )
+
+        return config_dict, kwargs
+
+    @classmethod
+    def _get_config_dict(
+        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
+    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+        cache_dir = kwargs.pop("cache_dir", None)
+        from_hf_hub = kwargs.pop("from_hf_hub", False)
+        subfolder = kwargs.pop("subfolder", None)
+
+        force_download = kwargs.pop("force_download", False)
+        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+
+        resolved_config_file = None
+
+        # 0. init from pretrained_init_configuration
+        if pretrained_model_name_or_path in cls.pretrained_init_configuration:
+            # which can be: dict or url
+            pretrained_model_name_or_path = cls.pretrained_init_configuration[pretrained_model_name_or_path]
+
+            if isinstance(pretrained_model_name_or_path, dict):
+                return pretrained_model_name_or_path, kwargs
+
+        # 1. get the configuration file from local file, eg: /cache/path/model_config.json
+        if os.path.isfile(pretrained_model_name_or_path):
+            resolved_config_file = pretrained_model_name_or_path
+
+        # 2. get the configuration file from url, eg: https://ip/path/to/model_config.json
+        elif is_url(pretrained_model_name_or_path):
+            resolved_config_file = get_path_from_url_with_filelock(
+                pretrained_model_name_or_path, cache_dir, check_exist=not force_download
+            )
+        # 3. get the configuration file from local dir with default name, eg: /local/path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME)
+            configuration_file = os.path.join(pretrained_model_name_or_path, configuration_file)
+            if os.path.exists(configuration_file):
+                resolved_config_file = configuration_file
+            else:
+                # try to detect old-school config file
+                configuration_file = os.path.join(pretrained_model_name_or_path, LEGACY_CONFIG_NAME)
+                if os.path.exists(configuration_file):
+                    resolved_config_file = configuration_file
+                else:
+                    raise FileNotFoundError(
+                        "please make sure there is `model_config.json` under the dir, or you can pass the `_configuration_file` "
+                        "param into `from_pretarined` method to specific the configuration file name"
+                    )  # 4. load it as the community resource file
+
+        # 4. get the configuration file from HF hub
+        elif from_hf_hub:
+            resolved_config_file = resolve_hf_config_path(
+                repo_id=pretrained_model_name_or_path, cache_dir=cache_dir, subfolder=subfolder
+            )
+        else:
+            community_url = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, CONFIG_NAME])
+            if url_file_exists(community_url):
+                return cls._get_config_dict(community_url, cache_dir=cache_dir, **kwargs)
+
+            community_url = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, LEGACY_CONFIG_NAME])
+            if url_file_exists(community_url):
+                return cls._get_config_dict(community_url, cache_dir=cache_dir, **kwargs)
+
+            raise FileNotFoundError(f"configuration file<{CONFIG_NAME}> or <{LEGACY_CONFIG_NAME}> not found")
+
+        try:
+            logger.info(f"Loading configuration file {resolved_config_file}")
+            # Load config dict
+            config_dict = cls._dict_from_json_file(resolved_config_file)
+        except (json.JSONDecodeError, UnicodeDecodeError):
+            raise EnvironmentError(f"Config file<'{resolved_config_file}'> is not a valid JSON file.")
+
+        return config_dict, kwargs
+
+    @classmethod
+    def from_dict(cls, config_dict: Dict[str, Any], **kwargs) -> "PretrainedConfig":
+        """
+        Instantiates a [`PretrainedConfig`] from a Python dictionary of parameters.
+
+        Args:
+            config_dict (`Dict[str, Any]`):
+                Dictionary that will be used to instantiate the configuration object. Such a dictionary can be
+                retrieved from a pretrained checkpoint by leveraging the [`~PretrainedConfig.get_config_dict`] method.
+            kwargs (`Dict[str, Any]`):
+                Additional parameters from which to initialize the configuration object.
+
+        Returns:
+            [`PretrainedConfig`]: The configuration object instantiated from those parameters.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+
+        # do standard config map: there are some old-school pretrained-config not refactored.
+        config_dict = convert_to_legacy_config(cls.attribute_map, config_dict)
+
+        config_dict = flatten_model_config(config_dict)
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        config = cls(**config_dict)
+
+        if hasattr(config, "pruned_heads"):
+            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())
+
+        # Update config with kwargs if needed
+        if "num_labels" in kwargs and "id2label" in kwargs:
+            num_labels = kwargs["num_labels"]
+            id2label = kwargs["id2label"] if kwargs["id2label"] is not None else []
+            if len(id2label) != num_labels:
+                raise ValueError(
+                    f"You passed along `num_labels={num_labels }` with an incompatible id to label map: "
+                    f"{kwargs['id2label']}. Since those arguments are inconsistent with each other, you should remove "
+                    "one of them."
+                )
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(config, key):
+                setattr(config, key, value)
+                if key != "dtype":
+                    to_remove.append(key)
+        for key in to_remove:
+            kwargs.pop(key, None)
+
+        if return_unused_kwargs:
+            return config, kwargs
+        else:
+            return config
+
+    @classmethod
+    def from_json_file(cls, json_file: Union[str, os.PathLike]) -> "PretrainedConfig":
+        """
+        Instantiates a [`PretrainedConfig`] from the path to a JSON file of parameters.
+
+        Args:
+            json_file (`str` or `os.PathLike`):
+                Path to the JSON file containing the parameters.
+
+        Returns:
+            [`PretrainedConfig`]: The configuration object instantiated from that JSON file.
+
+        """
+        config_dict = cls._dict_from_json_file(json_file)
+        return cls(**config_dict)
+
+    @classmethod
+    def _dict_from_json_file(cls, json_file: Union[str, os.PathLike]):
+        with open(json_file, "r", encoding="utf-8") as reader:
+            text = reader.read()
+        return json.loads(text)
+
+    def __eq__(self, other):
+        return self.__dict__ == other.__dict__
+
+    def __repr__(self):
+        return f"{self.__class__.__name__} {self.to_json_string()}"
+
+    def to_diff_dict(self) -> Dict[str, Any]:
+        """
+        Removes all attributes from config which correspond to the default config attributes for better readability and
+        serializes to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        config_dict = self.to_dict()
+
+        # get the default config dict
+        default_config_dict = PretrainedConfig().to_dict()
+
+        # get class specific config dict
+        class_config_dict = self.__class__().to_dict() if not self.is_composition else {}
+
+        serializable_config_dict = {}
+
+        # only serialize values that differ from the default config
+        for key, value in config_dict.items():
+            if (
+                key not in default_config_dict
+                or key == "paddlenlp_version"
+                or value != default_config_dict[key]
+                or (key in class_config_dict and value != class_config_dict[key])
+            ):
+                serializable_config_dict[key] = value
+
+        return serializable_config_dict
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serializes this instance to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
+        """
+        output = copy.deepcopy(self.__dict__)
+        if hasattr(self.__class__, "model_type"):
+            output["model_type"] = self.__class__.model_type
+        if "_auto_class" in output:
+            del output["_auto_class"]
+
+        return output
+
+    def to_json_string(self, use_diff: bool = True) -> str:
+        """
+        Serializes this instance to a JSON string.
+
+        Args:
+            use_diff (`bool`, *optional*, defaults to `True`):
+                If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
+                is serialized to JSON string.
+
+        Returns:
+            `str`: String containing all the attributes that make up this configuration instance in JSON format.
+        """
+        if use_diff is True:
+            config_dict = self.to_diff_dict()
+        else:
+            config_dict = self.to_dict()
+        return json.dumps(config_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
+
+    def to_json_file(self, json_file_path: Union[str, os.PathLike], use_diff: bool = True):
+        """
+        Save this instance to a JSON file.
+
+        Args:
+            json_file_path (`str` or `os.PathLike`):
+                Path to the JSON file in which this configuration instance's parameters will be saved.
+            use_diff (`bool`, *optional*, defaults to `True`):
+                If set to `True`, only the difference between the config instance and the default `PretrainedConfig()`
+                is serialized to JSON file.
+        """
+        with open(json_file_path, "w", encoding="utf-8") as writer:
+            writer.write(self.to_json_string(use_diff=use_diff))
+
+    def update(self, config_dict: Dict[str, Any]):
+        """
+        Updates attributes of this class with attributes from `config_dict`.
+
+        Args:
+            config_dict (`Dict[str, Any]`): Dictionary of attributes that should be updated for this class.
+        """
+        for key, value in config_dict.items():
+            setattr(self, key, value)
+
+    def update_from_string(self, update_str: str):
+        """
+        Updates attributes of this class with attributes from `update_str`.
+
+        The expected format is ints, floats and strings as is, and for booleans use `true` or `false`. For example:
+        "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index"
+
+        The keys to change have to already exist in the config object.
+
+        Args:
+            update_str (`str`): String with attributes that should be updated for this class.
+
+        """
+
+        d = dict(x.split("=") for x in update_str.split(","))
+        for k, v in d.items():
+            if not hasattr(self, k):
+                raise ValueError(f"key {k} isn't in the original config dict")
+
+            old_v = getattr(self, k)
+            if isinstance(old_v, bool):
+                if v.lower() in ["true", "1", "y", "yes"]:
+                    v = True
+                elif v.lower() in ["false", "0", "n", "no"]:
+                    v = False
+                else:
+                    raise ValueError(f"can't derive true or false from {v} (key {k})")
+            elif isinstance(old_v, int):
+                v = int(v)
+            elif isinstance(old_v, float):
+                v = float(v)
+            elif not isinstance(old_v, str):
+                raise ValueError(
+                    f"You can only update int, float, bool or string values in the config, got {v} for key {k}"
+                )
+
+            setattr(self, k, v)
+
+    @classmethod
+    def register_for_auto_class(cls, auto_class="AutoConfig"):
+        """
+        Register this class with a given auto class. This should only be used for custom configurations as the ones in
+        the library are already mapped with `AutoConfig`.
+
+        <Tip warning={true}>
+
+        This API is experimental and may have some slight breaking changes in the next releases.
+
+        </Tip>
+
+        Args:
+            auto_class (`str` or `type`, *optional*, defaults to `"AutoConfig"`):
+                The auto class to register this new configuration with.
+        """
+        if not isinstance(auto_class, str):
+            auto_class = auto_class.__name__
+
+        import transformers.models.auto as auto_module
+
+        if not hasattr(auto_module, auto_class):
+            raise ValueError(f"{auto_class} is not a valid auto class.")
+
+        cls._auto_class = auto_class
+
+    def get(self, key, default=None):
+        """
+        Return the value for key if config class has the attribute , else default.
+        If default is not given, it defaults to None, so that this method never raises a AttributeError.
+        """
+        try:
+            value = self.__getattribute__(key)
+        except AttributeError:
+            return default
+        else:
+            return value
+
+
+def get_configuration_file(configuration_files: List[str]) -> str:
+    """
+    Get the configuration file to use for this version of paddlenlp.
+
+    # TODO: there is not supported actual application models, but useful.
+        this method has not been tested, so be caution to use this feature.
+
+    Args:
+        configuration_files (`List[str]`): The list of available configuration files.
+
+    Returns:
+        `str`: The configuration file to use.
+    """
+    configuration_files_map = {}
+    for file_name in configuration_files:
+        search = _re_configuration_file.search(file_name)
+        if search is not None:
+            v = search.groups()[0]
+            configuration_files_map[v] = file_name
+    available_versions = sorted(configuration_files_map.keys())
+
+    # Defaults to FULL_CONFIGURATION_FILE and then try to look at some newer versions.
+    configuration_file = CONFIG_NAME
+
+    # FIXME: (wj-Mcat) remove the hard dependency of `packaging` which can compare
+    # the version of package, also be uesed in `transfromer`.
+    # **But**, we don't support version compare function now. so remove the hard dependency.
+    from packaging import version
+
+    paddlenlp_version = version.parse(__version__)
+    for v in available_versions:
+        if version.parse(v) <= paddlenlp_version:
+            configuration_file = configuration_files_map[v]
+        else:
+            # No point going further since the versions are sorted.
+            break
+
+    return configuration_file
diff --git a/paddlenlp/transformers/convbert/__init__.py b/paddlenlp/transformers/convbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/convbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/convbert/configuration.py b/paddlenlp/transformers/convbert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..f99d1602034b85e44af1b352f039b2754f7b5d4a
--- /dev/null
+++ b/paddlenlp/transformers/convbert/configuration.py
@@ -0,0 +1,313 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ConvBERT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["CONVBERT_PRETRAINED_INIT_CONFIGURATION", "ConvBertConfig", "CONVBERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+CONVBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "convbert-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+    "convbert-medium-small": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "initializer_range": 0.02,
+        "intermediate_size": 1536,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 2,
+    },
+    "convbert-small": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+    "convbert-base-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+    "convbert-medium-small-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 96,
+        "initializer_range": 0.02,
+        "intermediate_size": 384,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 2,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 2,
+    },
+    "convbert-small-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 64,
+        "initializer_range": 0.02,
+        "intermediate_size": 256,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 1,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+    "convbert-base-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+    "convbert-medium-small-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "initializer_range": 0.02,
+        "intermediate_size": 1536,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 2,
+    },
+    "convbert-small-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "conv_kernel_size": 9,
+        "head_ratio": 2,
+        "num_groups": 1,
+    },
+}
+
+CONVBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "convbert-base": "http://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-base/model_state.pdparams",
+        "convbert-medium-small": "http://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-medium-small/model_state.pdparams",
+        "convbert-small": "http://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-small/model_state.pdparams",
+    }
+}
+
+
+class ConvBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ConvBertModel`]. It is used to instantiate a
+    ConvBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ConvBert
+    convbert-base architecture. Configuration objects.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    ======================================================
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        pool_act (`str`, *optional*):
+            The non-linear activation function in the pooler.
+            Defaults to `"tanh"`.
+        embedding_size (int, optional):
+            Dimensionality of the embedding layer. Defaults to `768`.
+        conv_kernel_size (int, optional):
+            The size of the convolutional kernel.
+            Defaults to `9`.
+        head_ratio (int, optional):
+            Ratio gamma to reduce the number of attention heads.
+            Defaults to `2`.
+        num_groups (int, optional):
+            The number of groups for grouped linear layers for ConvBert model.
+            Defaults to `1`.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import ConvBertModel, ConvBertConfig
+
+    >>> # Initializing a ConvBERT configuration
+    >>> configuration = ConvBertConfig()
+
+    >>> # Initializing a model from the ConvBERT-base style configuration model
+    >>> model = ConvBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ======================================================
+    ```"""
+    model_type = "convbert"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = CONVBERT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        layer_norm_eps: float = 1e-12,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        embedding_size: int = 768,
+        conv_kernel_size: int = 9,
+        head_ratio: int = 2,
+        num_groups: int = 1,
+        **kwargs
+    ):
+
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.layer_norm_eps = layer_norm_eps
+        self.embedding_size = embedding_size
+        self.conv_kernel_size = conv_kernel_size
+        self.head_ratio = head_ratio
+        self.num_groups = num_groups
diff --git a/paddlenlp/transformers/convbert/modeling.py b/paddlenlp/transformers/convbert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5ec8e843c2af316556fc900637ef2f5256a1466
--- /dev/null
+++ b/paddlenlp/transformers/convbert/modeling.py
@@ -0,0 +1,1544 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Optional
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor, tensor
+from paddle.nn import Layer
+
+from .. import PretrainedModel, register_base_model
+from ..activations import get_activation
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    CONVBERT_PRETRAINED_INIT_CONFIGURATION,
+    CONVBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    ConvBertConfig,
+)
+
+__all__ = [
+    "ConvBertModel",
+    "ConvBertForMaskedLM",
+    "ConvBertPretrainedModel",
+    "ConvBertForTotalPretraining",
+    "ConvBertDiscriminator",
+    "ConvBertGenerator",
+    "ConvBertClassificationHead",
+    "ConvBertForSequenceClassification",
+    "ConvBertForTokenClassification",
+    "ConvBertPretrainingCriterion",
+    "ConvBertForQuestionAnswering",
+    "ConvBertForMultipleChoice",
+    "ConvBertForPretraining",
+]
+dtype_float = paddle.get_default_dtype()
+
+
+def _convert_attention_mask(attn_mask, dtype):
+    if attn_mask is not None and attn_mask.dtype != dtype:
+        attn_mask_dtype = attn_mask.dtype
+        if attn_mask_dtype in [paddle.bool, paddle.int8, paddle.int16, paddle.int32, paddle.int64]:
+            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9
+        else:
+            attn_mask = paddle.cast(attn_mask, dtype)
+    return attn_mask
+
+
+class GroupedLinear(nn.Layer):
+    def __init__(self, input_size, output_size, num_groups):
+        super().__init__()
+        self.input_size = input_size
+        self.output_size = output_size
+        self.num_groups = num_groups
+        self.group_in_dim = self.input_size // self.num_groups
+        self.group_out_dim = self.output_size // self.num_groups
+        self.weight = paddle.create_parameter(
+            shape=[self.num_groups, self.group_in_dim, self.group_out_dim], dtype=dtype_float
+        )
+        self.bias = paddle.create_parameter(shape=[output_size], dtype=dtype_float, is_bias=True)
+
+    def forward(self, hidden_states):
+        batch_size = hidden_states.shape[0]
+        x = tensor.reshape(hidden_states, [-1, self.num_groups, self.group_in_dim])
+        x = tensor.transpose(x, perm=[1, 0, 2])
+        x = tensor.matmul(x, self.weight)
+        x = tensor.transpose(x, perm=[1, 0, 2])
+        x = tensor.reshape(x, [batch_size, -1, self.output_size])
+        x = x + self.bias
+        return x
+
+
+class SeparableConv1D(nn.Layer):
+    """This class implements separable convolution, i.e. a depthwise and a pointwise layer"""
+
+    def __init__(self, input_filters, output_filters, kernel_size):
+        super().__init__()
+        self.depthwise = nn.Conv1D(
+            input_filters,
+            input_filters,
+            kernel_size=kernel_size,
+            groups=input_filters,
+            padding=kernel_size // 2,
+            bias_attr=False,
+            data_format="NLC",
+        )
+        self.pointwise = nn.Conv1D(
+            input_filters,
+            output_filters,
+            kernel_size=1,
+            bias_attr=False,
+            data_format="NLC",
+        )
+        self.bias = paddle.create_parameter(shape=[output_filters], dtype=dtype_float, is_bias=True)
+
+    def forward(self, hidden_states):
+        x = self.depthwise(hidden_states)
+        x = self.pointwise(x) + self.bias
+        return x
+
+
+class MultiHeadAttentionWithConv(Layer):
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        conv_kernel_size=None,
+        head_ratio=None,
+    ):
+        super(MultiHeadAttentionWithConv, self).__init__()
+
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.need_weights = need_weights
+        self.head_dim = embed_dim // num_heads
+        self.scale = self.head_dim**-0.5
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        new_num_attention_heads = num_heads // head_ratio
+        if num_heads // head_ratio < 1:
+            self.num_heads = 1
+            self.conv_type = "noconv"
+        else:
+            self.num_heads = new_num_attention_heads
+            self.conv_type = "sdconv"
+
+        self.all_head_size = self.num_heads * self.head_dim
+
+        self.dropout = nn.Dropout(dropout)
+        self.q_proj = nn.Linear(embed_dim, self.all_head_size)
+        self.k_proj = nn.Linear(self.kdim, self.all_head_size)
+        self.v_proj = nn.Linear(self.vdim, self.all_head_size)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+
+        if self.conv_type == "sdconv":
+            self.conv_kernel_size = conv_kernel_size
+            self.key_conv_attn_layer = SeparableConv1D(embed_dim, self.all_head_size, self.conv_kernel_size)
+            self.conv_kernel_layer = nn.Linear(self.all_head_size, self.num_heads * self.conv_kernel_size)
+            self.conv_out_layer = nn.Linear(embed_dim, self.all_head_size)
+            self.padding = (self.conv_kernel_size - 1) // 2
+
+    def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
+        key = query if key is None else key
+        value = query if value is None else value
+
+        q = self.q_proj(query)
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+
+        if self.conv_type == "sdconv":
+            bs = paddle.shape(q)[0]
+            seqlen = paddle.shape(q)[1]
+            mixed_key_conv_attn_layer = self.key_conv_attn_layer(query)
+            conv_attn_layer = mixed_key_conv_attn_layer * q
+
+            # conv_kernel_layer
+            conv_kernel_layer = self.conv_kernel_layer(conv_attn_layer)
+            conv_kernel_layer = tensor.reshape(conv_kernel_layer, shape=[-1, self.conv_kernel_size, 1])
+            conv_kernel_layer = F.softmax(conv_kernel_layer, axis=1)
+            conv_out_layer = self.conv_out_layer(query)
+            conv_out_layer = F.pad(conv_out_layer, pad=[self.padding, self.padding], data_format="NLC")
+            conv_out_layer = paddle.stack(
+                [
+                    paddle.slice(conv_out_layer, axes=[1], starts=[i], ends=[i + seqlen])
+                    for i in range(self.conv_kernel_size)
+                ],
+                axis=-1,
+            )
+            conv_out_layer = tensor.reshape(conv_out_layer, shape=[-1, self.head_dim, self.conv_kernel_size])
+            conv_out_layer = tensor.matmul(conv_out_layer, conv_kernel_layer)
+            conv_out = tensor.reshape(conv_out_layer, shape=[bs, seqlen, self.num_heads, self.head_dim])
+
+        q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = tensor.transpose(x=q, perm=[0, 2, 1, 3])
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+
+        product = tensor.matmul(x=q, y=k, transpose_y=True) * self.scale
+        if attn_mask is not None:
+            attn_mask = _convert_attention_mask(attn_mask, product.dtype)
+            product = product + attn_mask
+
+        weights = F.softmax(product)
+        weights = self.dropout(weights)
+        out = tensor.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        if self.conv_type == "sdconv":
+            out = tensor.concat([out, conv_out], axis=2)
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if cache is not None:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerEncoderLayerWithConv(nn.TransformerEncoderLayer):
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        conv_kernel_size=None,
+        head_ratio=None,
+        num_groups=None,
+        **kwargs
+    ):
+        super().__init__(
+            d_model,
+            nhead,
+            dim_feedforward,
+            dropout=dropout,
+            activation=activation,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            normalize_before=normalize_before,
+        )
+        self.self_attn = MultiHeadAttentionWithConv(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            conv_kernel_size=conv_kernel_size,
+            head_ratio=head_ratio,
+        )
+        if num_groups > 1:
+            self.linear1 = GroupedLinear(d_model, dim_feedforward, num_groups=num_groups)
+            self.linear2 = GroupedLinear(dim_feedforward, d_model, num_groups=num_groups)
+        self._config.update({"conv_kernel_size": conv_kernel_size, "head_ratio": head_ratio, "num_groups": num_groups})
+
+
+class ConvBertEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+    ):
+        if input_ids is not None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        input_shape = paddle.shape(inputs_embeds)[:-1]
+
+        ones = paddle.ones(input_shape, dtype="int64")
+        seq_length = paddle.cumsum(ones, axis=1)
+        position_ids = seq_length - ones
+        position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ConvBertDiscriminatorPredictions(nn.Layer):
+    """
+    Prediction layer for the discriminator.
+    """
+
+    def __init__(self, hidden_size, hidden_act):
+        super(ConvBertDiscriminatorPredictions, self).__init__()
+
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.dense_prediction = nn.Linear(hidden_size, 1)
+        self.act = get_activation(hidden_act)
+
+    def forward(self, discriminator_hidden_states):
+        hidden_states = self.dense(discriminator_hidden_states)
+        hidden_states = self.act(hidden_states)
+        logits = self.dense_prediction(hidden_states).squeeze()
+
+        return logits
+
+
+class ConvBertGeneratorPredictions(nn.Layer):
+    """
+    Prediction layer for the generator.
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertGeneratorPredictions, self).__init__()
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size, epsilon=config.layer_norm_eps)
+        self.dense = nn.Linear(config.hidden_size, config.embedding_size)
+        self.act = get_activation(config.hidden_act)
+
+    def forward(self, generator_hidden_states):
+        hidden_states = self.dense(generator_hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+
+        return hidden_states
+
+
+class ConvBertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ConvBert models. It provides ConvBert related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "convbert"
+
+    # pretrained general configuration
+    gen_weight = 1.0
+    disc_weight = 50.0
+    tie_word_embeddings = True
+    untied_generator_embeddings = False
+    use_softmax_sample = True
+
+    # model init configuration
+    pretrained_init_configuration = CONVBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = CONVBERT_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = ConvBertConfig
+
+    def tie_weights(self):
+        """
+        Tie the weights between the input embeddings and the output embeddings.
+        """
+        if hasattr(self, "get_output_embeddings") and hasattr(self, "get_input_embeddings"):
+            output_embeddings = self.get_output_embeddings()
+            if output_embeddings is not None:
+                self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, (nn.Linear, nn.Embedding, GroupedLinear)):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+            layer._epsilon = self.config.layer_norm_eps
+        elif isinstance(layer, SeparableConv1D):
+            layer.depthwise.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.depthwise.weight.shape,
+                )
+            )
+            layer.pointwise.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.pointwise.weight.shape,
+                )
+            )
+
+        if isinstance(layer, (nn.Linear, GroupedLinear, SeparableConv1D)) and layer.bias is not None:
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+    def _tie_or_clone_weights(self, output_embeddings, input_embeddings):
+        """Tie or clone layer weights"""
+        if output_embeddings.weight.shape == input_embeddings.weight.shape:
+            output_embeddings.weight = input_embeddings.weight
+        elif output_embeddings.weight.shape == input_embeddings.weight.t().shape:
+            output_embeddings.weight.set_value(input_embeddings.weight.t())
+        else:
+            raise ValueError(
+                "when tie input/output embeddings, the shape of output embeddings: {}"
+                "should be equal to shape of input embeddings: {}"
+                "or should be equal to the shape of transpose input embeddings: {}".format(
+                    output_embeddings.weight.shape,
+                    input_embeddings.weight.shape,
+                    input_embeddings.weight.t().shape,
+                )
+            )
+        if getattr(output_embeddings, "bias", None) is not None:
+            if output_embeddings.weight.shape[-1] != output_embeddings.bias.shape[0]:
+                raise ValueError(
+                    "the weight lase shape: {} of output_embeddings is not equal to the bias shape: {}"
+                    "please check output_embeddings configuration".format(
+                        output_embeddings.weight.shape[-1],
+                        output_embeddings.bias.shape[0],
+                    )
+                )
+
+
+@register_base_model
+class ConvBertModel(ConvBertPretrainedModel):
+    """
+    The bare ConvBert Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = ConvBertEmbeddings(config)
+
+        if config.embedding_size != config.hidden_size:
+            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)
+
+        encoder_layer = TransformerEncoderLayerWithConv(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            conv_kernel_size=config.conv_kernel_size,
+            head_ratio=config.head_ratio,
+            num_groups=config.num_groups,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        # self.config = config
+        self.pooler = ConvBertPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+
+        r"""
+        The ConvBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                If its data type is int, the values should be either 0 or 1.
+
+                - **1** for tokens that **not masked**,
+                - **0** for tokens that **masked**.
+
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+                        inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            inputs_embeds (Tensor, optional):
+                Instead of passing input_ids you can choose to directly pass an embedded representation.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertModel, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertModel.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids)
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+                attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+
+        if hasattr(self, "embeddings_project"):
+            embedding_output = self.embeddings_project(embedding_output)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            src_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        # output_attentions may be False
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output)
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class ConvBertDiscriminator(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a discriminator prediction head on top.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertDiscriminator, self).__init__(config)
+
+        self.convbert = ConvBertModel(config)
+        self.discriminator_predictions = ConvBertDiscriminatorPredictions(config.hidden_size, config.hidden_act)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+    ):
+        r"""
+        The ConvBertDiscriminator forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                If its data type is int, the values should be either 0 or 1.
+
+                - **1** for tokens that **not masked**,
+                - **0** for tokens that **masked**.
+
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                Instead of passing input_ids you can choose to directly pass an embedded representation.
+
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the discriminator prediction logits.
+            Shape as `[batch_size, sequence_length]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertDiscriminatorPredictions, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertDiscriminator.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        discriminator_sequence_output = self.convbert(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+        )
+
+        logits = self.discriminator_predictions(discriminator_sequence_output)
+
+        return logits
+
+
+class ConvBertGenerator(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a generator prediction head on top.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertGenerator, self).__init__(config)
+        self.config = config
+        self.convbert = ConvBertModel(config)
+        self.generator_predictions = ConvBertGeneratorPredictions(config)
+
+        if not self.tie_word_embeddings:
+            self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)
+        else:
+            self.generator_lm_head_bias = paddle.create_parameter(
+                shape=[config.vocab_size],
+                dtype=dtype_float,
+                is_bias=True,
+            )
+
+    def get_input_embeddings(self):
+        return self.convbert.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The ConvBertGenerator forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ConvBertModel`.
+            output_attentions (bool, optional):
+                See :class:`ConvBertModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, a tensor of the generator prediction scores.
+            Shape as `[batch_size, sequence_length, vocab_size]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertGenerator, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertGenerator.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                prediction_scores = model(**inputs)
+        """
+        convbert_outputs = self.convbert(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        prediction_scores = self.generator_predictions(convbert_outputs[0])
+        if not self.tie_word_embeddings:
+            prediction_scores = self.generator_lm_head(prediction_scores)
+        else:
+            prediction_scores = paddle.add(
+                paddle.matmul(prediction_scores, self.get_input_embeddings().weight, transpose_y=True),
+                self.generator_lm_head_bias,
+            )
+        loss = None
+        # # Masked language modeling softmax layer
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token
+            loss = loss_fct(prediction_scores.reshape([-1, self.config.vocab_size]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (prediction_scores,) + convbert_outputs[1:]
+            return tuple_output(output, loss)
+
+        return MaskedLMOutput(
+            loss=loss,
+            logits=prediction_scores,
+            hidden_states=convbert_outputs.hidden_states,
+            attentions=convbert_outputs.attentions,
+        )
+
+
+class ConvBertClassificationHead(nn.Layer):
+    """
+    ConvBert head for sentence-level classification tasks.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertClassificationHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+        self.act = get_activation(config.hidden_act)
+
+    def forward(self, features, **kwargs):
+        x = self.dropout(features)
+        x = self.dense(x)
+        x = self.act(x)  # ConvBert paper used gelu here
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class ConvBertForSequenceClassification(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertForSequenceClassification, self).__init__(config)
+        self.convbert = ConvBertModel(config)
+        self.num_labels = config.num_labels
+        self.classifier = ConvBertClassificationHead(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ConvBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            inputs_embeds (Tensor, optional):
+                Instead of passing input_ids you can choose to directly pass an embedded representation.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                See :class:`ConvBertModel`.
+            output_attentions (bool, optional):
+                See :class:`ConvBertModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertForSequenceClassification, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertForSequenceClassification.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.convbert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ConvBertForTokenClassification(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertForTokenClassification, self).__init__(config)
+        self.convbert = ConvBertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ConvBertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ConvBertModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                See :class:`ConvBertModel`.
+            output_attentions (bool, optional):
+                See :class:`ConvBertModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertForTokenClassification, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertForTokenClassification.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.convbert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = self.dropout(outputs[0])
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ConvBertForTotalPretraining(ConvBertPretrainedModel):
+    """
+    Combine generator with discriminator for Replaced Token Detection (RTD) pretraining.
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertForTotalPretraining, self).__init__(config)
+        self.generator = ConvBertGenerator(config)
+        self.discriminator = ConvBertDiscriminator(config)
+        self.initializer_range = config.initializer_range
+        self.tie_weights()
+
+    def get_input_embeddings(self):
+        if not self.untied_generator_embeddings:
+            return self.generator.convbert.embeddings.word_embeddings
+        else:
+            return None
+
+    def get_output_embeddings(self):
+        if not self.untied_generator_embeddings:
+            return self.discriminator.convbert.embeddings.word_embeddings
+        else:
+            return None
+
+    def get_discriminator_inputs(self, inputs, raw_inputs, generator_logits, generator_labels, use_softmax_sample):
+        """Sample from the generator to create discriminator input."""
+        # get generator token result
+        sampled_tokens = (self.sample_from_softmax(generator_logits, use_softmax_sample)).detach()
+        sampled_tokids = paddle.argmax(sampled_tokens, axis=-1)
+        # update token only at mask position
+        # generator_labels : [B, L], L contains -100(unmasked) or token value(masked)
+        # mask_positions : [B, L], L contains 0(unmasked) or 1(masked)
+        umask_positions = paddle.zeros_like(generator_labels)
+        mask_positions = paddle.ones_like(generator_labels)
+        mask_positions = paddle.where(generator_labels == -100, umask_positions, mask_positions)
+        updated_inputs = self.update_inputs(inputs, sampled_tokids, mask_positions)
+        # use inputs and updated_input to get discriminator labels
+        labels = mask_positions * (paddle.ones_like(inputs) - paddle.equal(updated_inputs, raw_inputs).astype("int32"))
+        return updated_inputs, labels, sampled_tokids
+
+    def sample_from_softmax(self, logits, use_softmax_sample=True):
+        if use_softmax_sample:
+            # uniform_noise = paddle.uniform(logits.shape, dtype="float32", min=0, max=1)
+            uniform_noise = paddle.rand(logits.shape, dtype=paddle.get_default_dtype())
+            gumbel_noise = -paddle.log(-paddle.log(uniform_noise + 1e-9) + 1e-9)
+        else:
+            gumbel_noise = paddle.zeros_like(logits)
+        # softmax_sample equal to sampled_tokids.unsqueeze(-1)
+        softmax_sample = paddle.argmax(F.softmax(logits + gumbel_noise), axis=-1)
+        # one hot
+        return F.one_hot(softmax_sample, logits.shape[-1])
+
+    def update_inputs(self, sequence, updates, positions):
+        shape = sequence.shape
+        assert len(shape) == 2, "the dimension of inputs should be [batch_size, sequence_length]"
+        B, L = shape
+        N = positions.shape[1]
+        assert N == L, "the dimension of inputs and mask should be same as [batch_size, sequence_length]"
+
+        updated_sequence = ((paddle.ones_like(sequence) - positions) * sequence) + (positions * updates)
+
+        return updated_sequence
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        raw_input_ids: Optional[Tensor] = None,
+        generator_labels: Optional[Tensor] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            raw_input_ids(Tensor, optional):
+                The raw input_ids. Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            generator_labels(Tensor, optional):
+                The generator labels. Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+
+        Returns:
+            tuple: Returns tuple (``generator_logits``, ``disc_logits``, ``disc_labels``, ``attention_mask``).
+
+            With the fields:
+
+            - `generator_logits` (Tensor):
+                a tensor of the generator prediction logits. Shape as `[batch_size, sequence_length, vocab_size]` and dtype as float32.
+
+            - `disc_logits` (Tensor):
+                a tensor of the discriminator prediction logits. Shape as `[batch_size, sequence_length]` and dtype as float32.
+
+            - `disc_labels` (Tensor):
+                a tensor of the discriminator prediction labels. Shape as `[batch_size, sequence_length]` and dtype as int64.
+
+            - `attention_mask` (Tensor):
+                See :class:`ConvBertModel`.
+        """
+
+        assert (
+            generator_labels is not None
+        ), "generator_labels should not be None, please check DataCollatorForLanguageModeling"
+
+        generator_logits = self.generator(input_ids, token_type_ids, position_ids, attention_mask)[0]
+
+        disc_inputs, disc_labels, generator_predict_tokens = self.get_discriminator_inputs(
+            input_ids, raw_input_ids, generator_logits, generator_labels, self.use_softmax_sample
+        )
+
+        disc_logits = self.discriminator(disc_inputs, token_type_ids, position_ids, attention_mask)
+
+        if attention_mask is None:
+            attention_mask = input_ids != self.discriminator.convbert.config.pad_token_id
+        else:
+            attention_mask = attention_mask.astype("bool")
+
+        return generator_logits, disc_logits, disc_labels, attention_mask
+
+
+class ConvBertPretrainingCriterion(nn.Layer):
+    """
+
+    Args:
+        vocab_size(int):
+            Vocabulary size of `inputs_ids` in `ConvBertModel`. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling `ConvBertModel`.
+        gen_weight(float):
+            This is the generator weight.
+        disc_weight(float):
+            This is the discriminator weight.
+
+    """
+
+    def __init__(self, vocab_size, gen_weight, disc_weight):
+        super(ConvBertPretrainingCriterion, self).__init__()
+
+        self.vocab_size = vocab_size
+        self.gen_weight = gen_weight
+        self.disc_weight = disc_weight
+        self.gen_loss_fct = nn.CrossEntropyLoss(reduction="none")
+        self.disc_loss_fct = nn.BCEWithLogitsLoss(reduction="none")
+
+    def forward(
+        self,
+        generator_prediction_scores,
+        discriminator_prediction_scores,
+        generator_labels,
+        discriminator_labels,
+        attention_mask,
+    ):
+        # generator loss
+        gen_loss = self.gen_loss_fct(
+            paddle.reshape(generator_prediction_scores, [-1, self.vocab_size]),
+            paddle.reshape(generator_labels, [-1]),
+        )
+        # todo: we can remove 4 lines after when CrossEntropyLoss(reduction='mean') improved
+        umask_positions = paddle.zeros_like(generator_labels).astype(dtype_float)
+        mask_positions = paddle.ones_like(generator_labels).astype(dtype_float)
+        mask_positions = paddle.where(generator_labels == -100, umask_positions, mask_positions)
+        if mask_positions.sum() == 0:
+            gen_loss = paddle.to_tensor([0.0])
+        else:
+            gen_loss = gen_loss.sum() / mask_positions.sum()
+
+        # discriminator loss
+        seq_length = discriminator_labels.shape[1]
+        disc_loss = self.disc_loss_fct(
+            paddle.reshape(discriminator_prediction_scores, [-1, seq_length]),
+            discriminator_labels.astype(dtype_float),
+        )
+        if attention_mask is not None:
+            umask_positions = paddle.ones_like(discriminator_labels).astype(dtype_float)
+            mask_positions = paddle.zeros_like(discriminator_labels).astype(dtype_float)
+            use_disc_loss = paddle.where(attention_mask, disc_loss, mask_positions)
+            umask_positions = paddle.where(attention_mask, umask_positions, mask_positions)
+            disc_loss = use_disc_loss.sum() / umask_positions.sum()
+        else:
+            total_positions = paddle.ones_like(discriminator_labels).astype(dtype_float)
+            disc_loss = disc_loss.sum() / total_positions.sum()
+
+        return self.gen_weight * gen_loss + self.disc_weight * disc_loss
+
+
+class ConvBertPooler(Layer):
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.pool_act = config.pool_act
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.pool_act == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ConvBertForMultipleChoice(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks .
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.convbert = ConvBertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ConvBertForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ConvBertModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                See :class:`ConvBertModel`.
+            output_attentions (bool, optional):
+                See :class:`ConvBertModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertForMultipleChoice, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertForMultipleChoice.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        if input_ids is not None:
+            input_ids = input_ids.reshape((-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1]))
+        if position_ids is not None:
+            position_ids = position_ids.reshape((-1, position_ids.shape[-1]))
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1]))
+
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape(shape=(-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]))
+
+        outputs = self.convbert(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape((-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ConvBertForQuestionAnswering(ConvBertPretrainedModel):
+    """
+    ConvBert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ConvBertConfig`):
+            An instance of ConvBertConfig
+
+    """
+
+    def __init__(self, config: ConvBertConfig):
+        super(ConvBertForQuestionAnswering, self).__init__(config)
+        self.convbert = ConvBertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ConvBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ConvBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ConvBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`ConvBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ConvBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ConvBertModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                See :class:`ConvBertModel`.
+            output_attentions (bool, optional):
+                See :class:`ConvBertModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ConvBertForQuestionAnswering, ConvBertTokenizer
+
+                tokenizer = ConvBertTokenizer.from_pretrained('convbert-base')
+                model = ConvBertForQuestionAnswering.from_pretrained('convbert-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  = outputs[1]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.convbert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+# ConvBertForMaskedLM is the same as ConvBertGenerator
+ConvBertForMaskedLM = ConvBertGenerator
+ConvBertForPretraining = ConvBertForTotalPretraining
diff --git a/paddlenlp/transformers/convbert/tokenizer.py b/paddlenlp/transformers/convbert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f21ce42cf795517f1e0f023f9713c38117690527
--- /dev/null
+++ b/paddlenlp/transformers/convbert/tokenizer.py
@@ -0,0 +1,44 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..electra.tokenizer import ElectraTokenizer
+
+__all__ = [
+    "ConvBertTokenizer",
+]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"convbert-base": 512, "convbert-medium-small": 512, "convbert-small": 512}
+
+
+class ConvBertTokenizer(ElectraTokenizer):
+    """
+    Construct a ConvBERT tokenizer. `ConvBertTokenizer` is identical to `ElectraTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "convbert-base": "https://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-base/vocab.txt",
+            "convbert-medium-small": "https://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-medium-small/vocab.txt",
+            "convbert-small": "https://bj.bcebos.com/paddlenlp/models/transformers/convbert/convbert-small/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "convbert-base": {"do_lower_case": True},
+        "convbert-medium-small": {"do_lower_case": True},
+        "convbert-small": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
diff --git a/paddlenlp/transformers/conversion_utils.py b/paddlenlp/transformers/conversion_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2822c803ffbf284355221527f352af92d19b02a2
--- /dev/null
+++ b/paddlenlp/transformers/conversion_utils.py
@@ -0,0 +1,1237 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import inspect
+import json
+import os
+from copy import deepcopy
+from dataclasses import dataclass
+from typing import (
+    TYPE_CHECKING,
+    Callable,
+    Dict,
+    List,
+    Optional,
+    Tuple,
+    Type,
+    TypeVar,
+    Union,
+)
+
+import numpy as np
+import paddle
+from numpy import allclose, ndarray, transpose
+from paddle import Tensor
+from paddle.nn import Layer
+
+from paddlenlp.utils.distributed import distributed_gather
+from paddlenlp.utils.env import CONFIG_NAME, PADDLE_WEIGHTS_NAME, PYTORCH_WEIGHTS_NAME
+from paddlenlp.utils.import_utils import (
+    is_package_available,
+    is_torch_available,
+    is_transformers_available,
+)
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.serialization import load_torch
+
+if TYPE_CHECKING:
+    from paddlenlp.transformers import PretrainedConfig, PretrainedModel
+
+from ..utils import device_guard
+
+# the type hinting for pytorch model & layer & tensor
+Module = TypeVar("Module")
+PytorchTensor = TypeVar("PytorchTensor")
+
+
+def tensor_summary(tensor: Union[str, Tensor, PytorchTensor, tuple, list, ndarray]):
+    """get summary of values which can be some of different values
+
+    Args:
+        tensor (ndarray): the source data of tensor which can be: string, Paddle Tensor, Pytorch Tensor, tuple/list tensor, ndarray
+
+    Returns:
+        str: the summary info
+    """
+    if tensor is None:
+        return "None"
+
+    if isinstance(tensor, str):
+        return tensor
+
+    # Modeling Output from paddlenlp/transformers
+    if isinstance(tensor, dict):
+        tensor = list(tensor.values())
+
+    if isinstance(tensor, (tuple, list)):
+        infos = []
+        for item in tensor:
+            infos.append(tensor_summary(item))
+        return "\n".join(infos)
+
+    # check whether contains `.numpy` method
+    # numpy is wrapped from C++, so it will be the `builtin` method
+    if hasattr(tensor, "numpy") and inspect.isbuiltin(getattr(tensor, "numpy")):
+        tensor = tensor.detach().cpu().numpy()
+        tensor = np.reshape(tensor, [-1])
+        top_3_tensor = str(tensor[1:4])
+        return top_3_tensor
+
+    return str(tensor)
+
+
+def compare_model_weights(first_state_dict: Dict[str, ndarray], second_state_dict: Dict[str, ndarray]) -> List[str]:
+    """compare the values of two state_dict.
+       This function has an assumption: the keys between `first_state_dict` and `second_state_dict` are exactly the same.
+
+    Args:
+        first_state_dict (Dict[str, ndarray]): first state_dict
+        second_state_dict (Dict[str, ndarray]): second state_dict
+
+    Returns:
+        mismatched keys (List[str]): the mismatched keys of state_dict because of some reason
+    """
+    mismatched_keys = []
+    for key in first_state_dict.keys():
+        is_close = np.allclose(first_state_dict[key], second_state_dict[key], atol=1e-4)
+        if not is_close:
+            mismatched_keys.append(key)
+    return mismatched_keys
+
+
+def state_dict_contains_prefix(state_dict: Dict[str, ndarray], prefix: str) -> bool:
+    """check whether state-dict contains `prefix`"""
+    prefix_count = sum([1 for key in state_dict.keys() if key.startswith(prefix)])
+    return prefix_count > 0
+
+
+def init_name_mappings(mappings: list[StateDictNameMapping]) -> list[StateDictNameMapping]:
+    """init name mapping which are simple mappings"""
+    for index in range(len(mappings)):
+        sub_mapping = mappings[index]
+
+        # if sub_mapping is `str`, so repeat it. eg: [ "word_embedding.weight", ["layer_norm", "LayerNorm"] ]
+        if isinstance(sub_mapping, str):
+            sub_mapping = [sub_mapping]
+
+        if len(sub_mapping) == 1:
+            sub_mapping = sub_mapping * 2
+
+        elif sub_mapping[1] is None:
+            sub_mapping[1] = sub_mapping[0]
+
+        mappings[index] = sub_mapping
+
+
+class StateDictKeysChecker:
+    """State Dict Keys Checker"""
+
+    def __init__(
+        self,
+        model_or_state_dict: Union[Layer, Dict[str, ndarray]],
+        loaded_state_dict: Dict[str, ndarray],
+        check_shape: bool = True,
+        base_model_prefix: Optional[str] = None,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> None:
+        if isinstance(model_or_state_dict, Layer):
+            base_model_prefix = base_model_prefix or getattr(model_or_state_dict, "base_model_prefix", None)
+            model_or_state_dict = {
+                key: value.detach().cpu().numpy() for key, value in model_or_state_dict.state_dict().items()
+            }
+
+        self.model_state_dict = model_or_state_dict
+        self.loaded_state_dict = loaded_state_dict
+        self.check_shape = check_shape
+        self.ignore_keys = ignore_keys or []
+        self.base_model_prefix = base_model_prefix
+
+    def change_base_downstream_mismatched_keys(self):
+        """when model is base-model, loaded state-dict is downstream-model,
+        it should re-change the downstream state-dict.
+
+        eg: init `BertModel` with `BertForTokenClassification` state-dict
+
+        # <model-base>-<loaded-downstream>
+        # remove base-prefix
+        """
+        for key in list(self.loaded_state_dict.keys()):
+            if key.startswith(self.base_model_prefix):
+                value = self.loaded_state_dict.pop(key)
+                new_key = key.replace(f"{self.base_model_prefix}.", "")
+                self.loaded_state_dict[new_key] = value
+
+    def change_downstream_base_mismatched_keys(self):
+        """when model is downstream-model, loaded state-dict is base-model,
+        it should re-change the downstream state-dict.
+
+        eg: init `BertModel` with `BertForTokenClassification` state-dict
+
+        # <model>-<loaded>: <downstream>-<base>
+        """
+        for key in list(self.model_state_dict.keys()):
+            if key.startswith(self.base_model_prefix):
+
+                key_in_loaded = key.replace(f"{self.base_model_prefix}.", "")
+                assert key_in_loaded in self.loaded_state_dict
+                # check loaded keys
+                value = self.loaded_state_dict.pop(key_in_loaded)
+                self.loaded_state_dict[key] = value
+
+    def change_diff_keys(self) -> List[str]:
+        """change the loaded-state-dict by base-model & base_model_prefix
+
+        Returns:
+            List[str]: the diff keys between models and loaded-state-dict
+        """
+        # 1. is absolute same
+        all_diff_keys, not_in_model_keys, not_in_loaded_keys = self.get_diff_keys(return_all_diff=True)
+        if len(all_diff_keys) == 0:
+            return []
+
+        if self.base_model_prefix is None:
+            return all_diff_keys
+
+        # 2. <model>-<loaded>: <base>-<downstream>
+        if not state_dict_contains_prefix(self.model_state_dict, self.base_model_prefix):
+
+            # the base-static must be same
+            if not state_dict_contains_prefix(self.loaded_state_dict, self.base_model_prefix):
+                error_msg = ["also the base model, but contains the diff keys: \n"]
+                if not_in_model_keys:
+                    error_msg.append(f"in loaded state-dict, not in model keys: <{not_in_model_keys}>\n")
+                if not_in_loaded_keys:
+                    error_msg.append(f"in model keys, not in loaded state-dict keys: <{not_in_model_keys}>\n")
+                logger.error(error_msg)
+                return []
+            self.change_base_downstream_mismatched_keys()
+        elif not state_dict_contains_prefix(self.loaded_state_dict, self.base_model_prefix):
+            # <model>-<loaded>: <downstream>-<base>
+            self.change_downstream_base_mismatched_keys()
+
+    def get_unexpected_keys(self):
+        """get unexpected keys which are not in model"""
+        self.change_diff_keys()
+        _, unexpected_keys, _ = self.get_diff_keys(True)
+        return unexpected_keys
+
+    def get_mismatched_keys(self):
+        """get mismatched keys which not found in loaded state-dict"""
+        self.change_diff_keys()
+        _, _, mismatched_keys = self.get_diff_keys(True)
+        return mismatched_keys
+
+    def get_diff_keys(self, return_all_diff: bool = False) -> List[str]:
+        """get diff keys
+
+        Args:
+            return_all_diff (bool, optional): return. Defaults to False.
+
+        Returns:
+            List[str]: the diff keys betweens model and loaded state-dict
+        """
+        mismatched_keys = set(self.model_state_dict.keys()) - set(self.loaded_state_dict.keys())
+        unexpected_keys = set(self.loaded_state_dict.keys()) - set(self.model_state_dict.keys())
+
+        all_diff_keys = mismatched_keys | unexpected_keys
+        if return_all_diff:
+            return all_diff_keys, unexpected_keys, mismatched_keys
+        return all_diff_keys
+
+
+def naive_fuse_merge_tp(weight_list, is_column=True, fuse_tensor_parts=2):
+    """
+
+    [A1 B1],[A2 B2]  => [A1, A2, B1, B2]
+
+    Args:
+        weight_list (List[np.ndarray]): The splited tensor parallel weight list.
+        is_column (bool, optional): Is ColumnLinear or RowLinear. Defaults to True.
+
+    Returns:
+        weight (np.ndarray): the merged weight.
+    """
+    if is_column:
+        axis = -1
+    else:
+        axis = 0
+
+    reorder = []
+    for item in weight_list:
+        reorder.extend(np.split(item, fuse_tensor_parts, axis=axis))
+    # 0 1 2 3 -> 0 2 1 3
+    index = (
+        np.transpose(np.arange(len(reorder)).reshape([len(weight_list), fuse_tensor_parts]), [1, 0])
+        .reshape(-1)
+        .tolist()
+    )
+    return np.concatenate([reorder[i] for i in index], axis=axis)
+
+
+def naive_fuse_split_tp(
+    weight, tensor_parallel_degree, tensor_parallel_rank=None, is_column=True, fuse_tensor_parts=2
+):
+    """
+
+    [A1, A2, B1, B2] => [A1 B1],[A2 B2]
+
+    Args:
+        weight (numpy.ndarray): the tensor weight,
+        tensor_parallel_degree (int): tensor_parallel_degree
+        tensor_parallel_rank (int): tensor_parallel_rank
+        is_column (bool, optional): is ColumnLinear . Defaults to True.
+
+    Returns:
+        tensor (numpy.ndarray): splited weight.
+
+    """
+    axis = -1 if is_column else 0
+    if "PySafeSlice" in str(type(weight)):
+        size = weight.get_shape()[axis]
+        block_size = size // (fuse_tensor_parts * tensor_parallel_degree)
+
+        splited = []
+        if tensor_parallel_rank is None:
+            begin, end, step = 0, fuse_tensor_parts * tensor_parallel_degree, 1
+        else:
+            begin, end, step = tensor_parallel_rank, fuse_tensor_parts * tensor_parallel_degree, tensor_parallel_degree
+        for rank in range(begin, end, step):
+            start = rank * block_size
+            stop = (rank + 1) * block_size
+            if axis == 0 or len(weight.get_shape()) == 1:
+                tensor = weight[start:stop]
+            else:
+                tensor = weight[:, start:stop]
+            splited.append(tensor)
+
+        if tensor_parallel_rank is None:
+            ret = []
+            for tensor_parallel_rank in range(tensor_parallel_degree):
+                ret.append(np.concatenate(splited[tensor_parallel_rank::tensor_parallel_degree], axis=axis))
+            return ret
+
+        return np.concatenate(splited, axis=axis)
+
+    splited = np.split(weight, fuse_tensor_parts * tensor_parallel_degree, axis=axis)
+
+    if tensor_parallel_rank is None:
+        ret = []
+        for tensor_parallel_rank in range(tensor_parallel_degree):
+            ret.append(np.concatenate(splited[tensor_parallel_rank::tensor_parallel_degree], axis=axis))
+        return ret
+
+    return np.concatenate(splited[tensor_parallel_rank::tensor_parallel_degree], axis=axis)
+
+
+def normal_fuse_merge_tp(weight_list, is_column=True):
+    """
+
+    [A1],[A2]  => [A1, A2]
+
+    Args:
+        weight_list (List[np.ndarray]): The splited tensor parallel weight list.
+        is_column (bool, optional): Is ColumnLinear or RowLinear. Defaults to True.
+
+    Returns:
+        weight (np.ndarray): the merged weight.
+    """
+    if is_column:
+        return np.concatenate(weight_list, axis=-1)
+    else:
+        return np.concatenate(weight_list, axis=0)
+
+
+def normal_fuse_split_tp(weight, tensor_parallel_degree, tensor_parallel_rank=None, is_column=True):
+    """
+
+    [A1, A2]  =>  [A1],[A2]
+
+    Args:
+        weight (numpy.ndarray): the tensor weight,
+        tensor_parallel_degree (int): tensor_parallel_degree
+        tensor_parallel_rank (int): tensor_parallel_rank
+        is_column (bool, optional): is ColumnLinear . Defaults to True.
+
+    Returns:
+        tensor (numpy.ndarray): splited weight.
+    """
+    dim = -1 if is_column else 0
+    if "PySafeSlice" in str(type(weight)):
+        size = weight.get_shape()[dim]
+        block_size = size // tensor_parallel_degree
+        start = tensor_parallel_rank * block_size
+        stop = (tensor_parallel_rank + 1) * block_size
+        assert (
+            size % tensor_parallel_degree == 0
+        ), f"The choosen size {size} is not compatible with sharding on {tensor_parallel_degree} shards"
+
+        if dim == 0 or len(weight.get_shape()) == 1:
+            tensor = weight[start:stop]
+        elif dim == -1:
+            tensor = weight[:, start:stop]
+        else:
+            raise NotImplementedError("Let's make that generic when needed")
+        return tensor
+
+    size = weight.shape[dim]
+    assert (
+        size % tensor_parallel_degree == 0
+    ), f"The choosen size {size} is not compatible with sharding on {tensor_parallel_degree} shards. for tensor shape {weight.shape}"
+
+    if is_column:
+        splited_weights = np.split(weight, tensor_parallel_degree, axis=-1)
+    else:
+        splited_weights = np.split(weight, tensor_parallel_degree, axis=0)
+
+    if tensor_parallel_rank is not None:
+        return splited_weights[tensor_parallel_rank]
+
+    return splited_weights
+
+
+"""
+There're three types of MultiHeadAttention QKV Layout in Transfomers
+
+tensor_parallel_qkv = [q1, k1, v1, q2, k2, v2]
+naive_merged_qkv    = [q1, q1, k1, k2, v1, v2]
+splited_qkv         = [q1, q1], [k1, k2], [v1, v2]
+
+naive_merged_qkv -> tensor_parallel_qkv
+    : naive_merged_qkv_to_tensor_parallel_qkv
+
+splited_qkv -> tensor_parallel_qkv
+    : splited_qkv_to_tensor_parallel_qkv
+
+
+"""
+
+
+def tensor_parallel_qkv_to_naive_merged_qkv(weight, num_attention_heads):
+    """
+    [q1, k1, v1, q2, k2, v2] => [q1, q1, k1, k2, v1, v2]
+    """
+    qkvs = []
+    partition_dim = -1
+    split_heads = np.split(weight, 3 * num_attention_heads, axis=partition_dim)
+    qkv_weight_num = 3
+
+    for i in range(qkv_weight_num):
+        qkv = np.concatenate(split_heads[i::qkv_weight_num], axis=partition_dim)
+        qkvs.append(qkv)
+
+    return np.concatenate(qkvs, axis=partition_dim)
+
+
+def naive_merged_qkv_to_tensor_parallel_qkv(weight, num_attention_heads):
+    """
+    [q1, q1, k1, k2, v1, v2] => [q1, k1, v1, q2, k2, v2]
+    """
+    qkv_pairs = []
+    partition_dim = -1
+    split_heads = np.split(weight, 3 * num_attention_heads, axis=partition_dim)
+
+    for i in range(num_attention_heads):
+        qkv_pair = np.concatenate(split_heads[i::num_attention_heads], axis=partition_dim)
+        qkv_pairs.append(qkv_pair)
+
+    return np.concatenate(qkv_pairs, axis=partition_dim)
+
+
+def splited_qkv_to_tensor_parallel_qkv(weight_list, num_attention_heads):
+    """
+    [q1, k1, v1], [q2, k2, v2] => [q1, q1, k1, k2, v1, v2]
+
+    Args:
+        weight_list (_type_): [Q,K,V] tensor list
+    """
+    assert len(
+        weight_list
+    ), f"weight_list length is not equal 3, it should be Q K V list. but got length {len(weight_list)}"
+    weight = np.concatenate(weight_list, axis=-1)
+    return naive_merged_qkv_to_tensor_parallel_qkv(weight)
+
+
+def get_tensor_parallel_merge_func(tensor_parallel_degree, tensor_parallel_rank, num_attention_heads=None):
+    def fn(x, is_column=True, transpose=False, is_old_qkv=False, is_naive_2fuse=False, is_naive_3fuse=False):
+        if x is None:
+            return None
+
+        if is_naive_2fuse:
+            return naive_fuse_merge_tp(x, is_column=is_column, fuse_tensor_parts=2)
+        elif is_naive_3fuse:
+            return naive_fuse_merge_tp(x, is_column=is_column, fuse_tensor_parts=3)
+        else:
+            x = normal_fuse_merge_tp(x, is_column=is_column)
+
+        if is_old_qkv:
+            assert is_column, "QKV tensor should be column parallel linear."
+            assert num_attention_heads is not None, "is_old_qkv need num_attention_heads"
+            x = tensor_parallel_qkv_to_naive_merged_qkv(x, num_attention_heads)
+        if transpose:
+            x = np.transpose(x, [1, 0])
+
+        return x
+
+    return fn
+
+
+def get_tensor_parallel_split_func(tensor_parallel_degree, tensor_parallel_rank, num_attention_heads=None):
+    def fn(x, is_column=True, transpose=False, is_old_qkv=False, is_naive_2fuse=False, is_naive_3fuse=False):
+        if x is None:
+            return None
+        if transpose:
+            x = np.transpose(x, [1, 0])
+        if is_old_qkv:
+            assert is_column, "QKV tensor should be column parallel linear."
+            assert num_attention_heads is not None, "is_old_qkv need num_attention_heads"
+            x = naive_merged_qkv_to_tensor_parallel_qkv(x, num_attention_heads)
+        if is_naive_2fuse:
+            return naive_fuse_split_tp(
+                x, tensor_parallel_degree, tensor_parallel_rank, is_column=is_column, fuse_tensor_parts=2
+            )
+        if is_naive_3fuse:
+            return naive_fuse_split_tp(
+                x, tensor_parallel_degree, tensor_parallel_rank, is_column=is_column, fuse_tensor_parts=3
+            )
+
+        return normal_fuse_split_tp(x, tensor_parallel_degree, tensor_parallel_rank, is_column=is_column)
+
+    return fn
+
+
+def split_or_merge_func(is_split, tensor_parallel_degree, tensor_parallel_rank, num_attention_heads=None):
+    if is_split:
+        return get_tensor_parallel_split_func(tensor_parallel_degree, tensor_parallel_rank, num_attention_heads)
+    return get_tensor_parallel_merge_func(tensor_parallel_degree, tensor_parallel_rank, num_attention_heads)
+
+
+@dataclass
+class StateDictNameMapping:
+    """NameMapping of StateDict between two models"""
+
+    source_name: str
+    target_name: str = None
+
+    action: Optional[str] = None  # the value can be: transpose, merge_last_two_dim
+    index: Optional[int] = None
+
+    slots: list[str] = None
+
+    def __post_init__(self):
+        self.target_name = self.target_name or self.source_name
+
+    def should_transpose(self) -> bool:
+        return self.action == "transpose"
+
+    def should_merge_last_two_dim(self) -> bool:
+        """check that wether merge last two dim"""
+        return self.action == "merge_last_two_dim"
+
+    def run(self, state_dict: dict[str, ndarray], name: str) -> ndarray:
+        """run some custom operation on ndarray, eg: transpose, merge_last_two_dim
+
+        Args:
+            tensor (ndarray): the source of the tensor data
+
+        Returns:
+            ndarray: the final tensor
+        """
+        tensor = state_dict.pop(name)
+        if callable(self.action):
+            return self.action(tensor)
+        if self.action == "transpose":
+            return transpose(tensor, [1, 0])
+        if self.action == "merge_last_two_dim":
+            shape = tensor.shape
+            assert len(shape) == 3
+            return np.reshape(tensor, [shape[0], -1])
+        if self.action == "split":
+            assert self.index is not None, "when action is `split`, index field is required."
+            # FIXME if the order of split starts from index=2, no tensor left.
+            if self.index < 2:
+                state_dict[name] = tensor
+            # qkv is stored in same tensor, so it should be split into 3 arr
+            tensors = np.split(tensor, 3, axis=-1)
+            return tensors[self.index]
+
+        return tensor
+
+    def matched(self, text: str) -> bool:
+        """check whether the layer_name match the current pattern
+
+        Args:
+            text (str): the name of layer
+
+        Returns:
+            bool: whether the
+        """
+        if text == self.source_name:
+            return True
+
+        if not self.slots:
+            return False
+
+
+class TensorInfoSaver:
+    def __init__(self) -> None:
+        self.series = {}
+
+    def add(self, state_dict_key: str, key: str, values: Union[float, ndarray, Tensor, PytorchTensor]):
+        """add
+
+        Args:
+            state_dict_key (str): the state_dict key to compare, eg: embedding.weight
+            key (str): the field to compare, eg: paddle_input
+            values (Union[float, ndarray, Tensor]): the tensor
+        """
+        if state_dict_key not in self.series:
+            self.series[state_dict_key] = {}
+
+        if state_dict_key not in self.series[state_dict_key]:
+            self.series[state_dict_key]["state_dict_key"] = state_dict_key
+
+        self.series[state_dict_key][key] = tensor_summary(values)
+
+    def summary(self, output_path: Optional[str] = None):
+        """output the summary info into different terminal
+
+        Args:
+            output_path (Optional[str], optional): the dir/file of sumamry file. Defaults to None.
+        """
+        if output_path and os.path.isdir(output_path):
+            output_path = os.path.join(output_path, "tensor_summary.xlsx")
+            self.summary_to_excel(output_path)
+
+        self.summary_to_terminal()
+
+    def summary_to_excel(self, file: str):
+        if not is_package_available("pandas"):
+            return False
+        if not is_package_available("openpyxl"):
+            logger.warning(
+                "detect that pandas is installed, but openpyxl is not installed so can't save info into excel file. "
+                "you can run command: `pip install openpyxl` to get the great feature"
+            )
+            return False
+
+        import pandas as pd
+
+        with pd.ExcelWriter(file, "a", engine="openpyxl", if_sheet_exists="new") as writer:
+            pd.DataFrame(list(self.series.values())).to_excel(writer, index=False)
+
+    def summary_to_terminal(self):
+        """print table info into terminal with tabulate"""
+        from tabulate import tabulate
+
+        headers = {key: key for key in self.series.keys()}
+        print(tabulate(list(self.series.values()), tablefmt="grid", headers=headers))
+
+    def clear(self):
+        """clear the series data"""
+        self.series.clear()
+
+
+class LogitHooker:
+    """hooks for pytorch model and paddle model, used to generate the logits of elment layers"""
+
+    def __init__(self, mappings: List[StateDictNameMapping], tensor_info_saver: Optional[TensorInfoSaver] = None):
+        """registe the logit hooks to compare the inputs * outputs model
+
+        Args:
+            mappings (List[StateDictNameMapping]): the mappings between paddle & pytorch model
+            tensor_info_saver (Optional[TensorInfoSaver], optional): the saver for model logit. Defaults to None.
+        """
+        self.mappings = mappings
+        self.tensor_info_saver = tensor_info_saver or TensorInfoSaver()
+
+    def _paddle_hooks(self, layer: Layer, inputs: Tuple[Tensor], outputs: Union[Tensor, Tuple[Tensor]]):
+        """internal paddle hooks to save the logit of paddle layer
+
+        Args:
+            layer (Layer): the layer of paddle element
+            inputs (Tuple[Tensor]): the inputs of paddle layer
+            outputs (Union[Tensor, Tuple[Tensor]]): the outputs of paddle layer
+        """
+        state_dict_name = layer.__state_dict_name__
+
+        self.tensor_info_saver.add(state_dict_name, "paddle-input", inputs)
+
+        self.tensor_info_saver.add(state_dict_name, "paddle-outputs", outputs)
+
+    def _pytorch_hooks(
+        self,
+        layer: Layer,
+        inputs: Tuple[PytorchTensor],
+        outputs: Union[Dict[str, PytorchTensor], Tuple[PytorchTensor]],
+    ):
+        """internal pytorch hooks to save the logit of pytorch module
+
+        Args:
+            layer (torch.nn.Module): the module of pytorch model
+            inputs (Tuple[PytorchTensor]): the inputs of pytorch layer
+            outputs (Union[Dict[str, PytorchTensor], Tuple[PytorchTensor]]): the outputs of pytorch layer
+        """
+        state_dict_name = layer.__state_dict_name__
+
+        self.tensor_info_saver.add(
+            state_dict_name,
+            "pytorch-input",
+            inputs,
+        )
+
+        self.tensor_info_saver.add(state_dict_name, "pytorch-outputs", outputs)
+
+    def register_paddle_model_hooks(self, model: Layer):
+        """regist post forward hook to save the inputs & outputs of paddle model
+
+        Args:
+            model (Layer): paddle model
+        """
+
+        # 1. register paddle model hook to save the logits of target layer
+        def register_hook_by_name(model: Layer, mapping: StateDictNameMapping, hook: Callable[..., None]):
+            """register hook by name of state_dict, eg: encoder.layers.0.linear1.bias
+
+            Args:
+                model (Layer): the source model
+                mapping (StateDictNameMapping): the name mapping object
+                hook (Callable[..., None]): the hook for paddle model
+            """
+            name = mapping.target_name
+            attributes = name.split(".")
+            last_layer: Layer = model
+            for attribute in attributes:
+                if getattr(model, attribute, None) is not None:
+                    model = getattr(model, attribute)
+                    if isinstance(model, Layer):
+                        last_layer = model
+            if (
+                hasattr(last_layer, "register_forward_post_hook")
+                and getattr(last_layer, "__state_dict_name__", None) is None
+            ):
+                last_layer.register_forward_post_hook(hook)
+                # set state_dict key into layer as the private attribute
+                last_layer.__state_dict_name__ = name
+
+        for mapping in self.mappings:
+            register_hook_by_name(model, mapping, self._paddle_hooks)
+
+    def register_pytorch_model_hooks(self, model: Module):
+        """regist hook for pytorch model to save the inputs & outputs of pytorch model
+
+        Args:
+            model (_type_): pytorch model
+        """
+        from torch import nn
+
+        # 1. register paddle model hook to save the logits of target layer
+        def register_hook_by_name(model: Module, mapping: StateDictNameMapping, hook: Callable[..., None]):
+            name = mapping.source_name
+            attributes, index = name.split("."), 0
+            last_layer: Module = model
+            while index < len(attributes):
+                attribute = attributes[index]
+                if getattr(model, attribute, None) is not None:
+                    if isinstance(model, nn.ModuleList) and attribute.isdigit():
+                        model = model[int(attribute)]
+                        last_layer = model
+                    else:
+                        model = getattr(model, attribute)
+                        if isinstance(model, nn.Module):
+                            last_layer = model
+                index += 1
+            if (
+                hasattr(last_layer, "register_forward_hook")
+                and getattr(last_layer, "__state_dict_name__", None) is None
+            ):
+                last_layer.register_forward_hook(hook)
+                # set state_dict key into layer as the private attribute
+                last_layer.__state_dict_name__ = mapping.target_name
+
+        for mapping in self.mappings:
+            register_hook_by_name(model, mapping, self._pytorch_hooks)
+
+    def summary(self):
+        """print the summary info to terminal/excel to analysis"""
+        self.tensor_info_saver.summary()
+
+
+class LogitComparer:
+    """Model Weight Converter for developer to convert pytorch/tensorflow/jax pretrained model weight to paddle.
+
+    * you can convert model weight in online/offline mode.
+    * you can convert weight and config file.
+    * you can convert weight/config file in some customization ways.
+    """
+
+    _ignore_state_dict_keys = []
+    num_layer_regex = r"\.\d+\."
+
+    num_layer_key: str = "num_hidden_layers"
+
+    # when field-name is same as hf models, so you only need to
+    # change this attribute to map the configuration
+    config_fields_to_be_removed: List[str] = ["transformers_version"]
+    architectures: Dict[str, Type[PretrainedModel]] = {}
+
+    def __init__(self, input_dir: str) -> None:
+        self.input_dir = input_dir
+
+    def get_paddle_pytorch_model_classes(self) -> Tuple[object, object]:
+        """return the [PaddleModelClass, PytorchModelClass] to
+            1. generate paddle model automatically
+            2. compare the logits from pytorch model and paddle model automatically
+
+        Returns:
+            Tuple[object, object]: [PaddleModelClass, PytorchModelClass]
+        """
+        raise NotImplementedError
+
+    def get_inputs(self):
+        """the numpy inputs for paddle & pytorch model"""
+        input_ids = paddle.arange(600, 700)
+        input_ids = paddle.unsqueeze(input_ids, axis=0).detach().cpu().numpy()
+        return [input_ids]
+
+    def resolve_paddle_output_logits(self, paddle_outputs: Tuple[Tensor]):
+        """resolve the logit from paddle model which can be `last_hidden_state`"""
+        output = None
+        if isinstance(paddle_outputs, (tuple, list)):
+            output = paddle_outputs[0]
+        elif paddle.is_tensor(paddle_outputs):
+            output = paddle_outputs
+
+        if output is None:
+            raise NotImplementedError("can't resolve paddle model outputs")
+
+        return output.detach().cpu().reshape([-1]).numpy()
+
+    def resolve_pytorch_output_logits(self, pytorch_outputs: Module):
+        """resolve the logit from pytorch model which can be `last_hidden_state`"""
+        output = pytorch_outputs[0]
+        if output is None:
+            raise NotImplementedError("can't resolve paddle model outputs")
+
+        return output.detach().cpu().reshape([-1]).numpy()
+
+    @staticmethod
+    def get_model_state_dict(model: Union[Layer, Module], copy: bool = False) -> Dict[str, ndarray]:
+        """get the state_dict of pytorch/paddle model
+
+        Args:
+            model (Union[Layer, Module]): can be paddle/pytorch model
+
+        Returns:
+            Dict[str, ndarray]: the final state_dict data
+        """
+        from torch import nn
+
+        assert isinstance(model, (Layer, nn.Module))
+        state_dict = {key: value.detach().cpu().numpy() for key, value in model.state_dict().items()}
+        if copy:
+            state_dict = deepcopy(state_dict)
+        return state_dict
+
+    def compare_model_state_dicts(
+        self,
+        paddle_model: Union[Layer, Dict[str, ndarray]],
+        pytorch_model: Union[Module, Dict[str, ndarray]],
+        name_mappings: List[StateDictNameMapping],
+    ):
+        """compare the pytorch and paddle mdoel state with name mappings
+
+        Args:
+            paddle_model (Union[Layer, Dict[str, ndarray]]): paddle model instance
+            pytorch_model (Union[Module, Dict[str, ndarray]]): pytorch model instance
+            name_mappings (List[StateDictNameMapping]): the name mappings
+        """
+        if not isinstance(paddle_model, dict):
+            paddle_state_dict = {key: value.detach().cpu().numpy() for key, value in paddle_model.state_dict().items()}
+        else:
+            paddle_state_dict = paddle_model
+
+        if not isinstance(pytorch_model, dict):
+            pytorch_state_dict = {
+                key: value.detach().cpu().numpy() for key, value in pytorch_model.state_dict().items()
+            }
+        else:
+            pytorch_state_dict = pytorch_model
+
+        model_state_saver = TensorInfoSaver()
+        for name_mapping in name_mappings:
+            model_state_saver.add(name_mapping.target_name, "pytorch_key", name_mapping.source_name)
+
+            if name_mapping.target_name in paddle_state_dict:
+                paddle_numpy = paddle_state_dict.pop(name_mapping.target_name)
+                model_state_saver.add(name_mapping.target_name, "paddle", paddle_numpy)
+                model_state_saver.add(name_mapping.target_name, "paddle-shape", str(paddle_numpy.shape))
+
+            if name_mapping.source_name in pytorch_state_dict:
+                pytorch_numpy = pytorch_state_dict.pop(name_mapping.source_name)
+                model_state_saver.add(name_mapping.target_name, "pytorch", pytorch_numpy)
+                model_state_saver.add(name_mapping.target_name, "pytorch-shape", str(pytorch_numpy.shape))
+
+        model_state_saver.summary()
+
+    def compare_logits(self) -> bool:
+        """compare the logit of pytorch & paddle model
+
+        Returns:
+            bool: if the logits is absolutly same
+        """
+        PaddleModel, PytorchModel = self.get_paddle_pytorch_model_classes()
+        paddle_model = PaddleModel.from_pretrained(self.input_dir)
+
+        # 0. init the name_mapping & tensor_info_saver & logit_hooker
+        name_mappings = self.get_name_mapping(paddle_model.config)
+        tensor_info_saver = TensorInfoSaver()
+
+        logit_hooker = LogitHooker(name_mappings, tensor_info_saver)
+        inputs = self.get_inputs()
+
+        # 1. get the logits of paddle model
+        logit_hooker.register_paddle_model_hooks(paddle_model)
+        paddle_inputs = [paddle.to_tensor(input_item) for input_item in inputs]
+        paddle_model.eval()
+
+        paddle_outputs = paddle_model(*paddle_inputs)
+        # remove paddle_model and free gpu memory
+        paddle_model_state_dict = self.get_model_state_dict(paddle_model)
+        del paddle_model
+        paddle_logits = self.resolve_paddle_output_logits(paddle_outputs)
+
+        logger.info("===============the summary of paddle Model logits: ===============")
+        logger.info(tensor_summary(paddle_logits))
+
+        # 2. get the logits of pytorch model
+        import torch
+
+        pytorch_model = PytorchModel.from_pretrained(self.input_dir)
+        logit_hooker.register_pytorch_model_hooks(pytorch_model)
+
+        pytorch_model.eval()
+        pytorch_inputs = [torch.tensor(input_item) for input_item in inputs]
+        torch_outputs = pytorch_model(*pytorch_inputs)
+        # remove paddle_model and free gpu memory
+        pytorch_model_state_dict = self.get_model_state_dict(pytorch_model)
+        del pytorch_model
+
+        pytorch_logits = self.resolve_pytorch_output_logits(torch_outputs)
+
+        logger.info("===============the summary of pytorch Model logits: ===============")
+        logger.info(tensor_summary(pytorch_logits))
+
+        # 3. compare the logits
+        result = allclose(paddle_logits[1:4], pytorch_logits[1:4], atol=1e-4)
+
+        if not result:
+            print("============================== compare model state dict ==============================")
+
+            self.compare_model_state_dicts(paddle_model_state_dict, pytorch_model_state_dict, name_mappings)
+
+            print("============================== compare model inputs & outputs ==============================")
+            logit_hooker.summary()
+
+        return result
+
+    def on_converted(self):
+
+        PaddleModelClass, PytorchModelClass = self.get_paddle_pytorch_model_classes()
+
+        # 1. try to compare two loaded paddle weight file
+        first_paddle_model = PaddleModelClass.from_pretrained(self.input_dir)
+        second_paddle_model = PaddleModelClass.from_pretrained(self.input_dir)
+        mismatched_keys = compare_model_weights(
+            self.get_model_state_dict(first_paddle_model),
+            self.get_model_state_dict(second_paddle_model),
+        )
+        for key in mismatched_keys:
+            logger.error(f"the key<{key}> is not set correctly with weight")
+
+        # 2. try to compare logits between paddle & pytorch model
+        if is_torch_available() and is_transformers_available():
+            result = self.compare_logits()
+            if result is True:
+                logger.info("the logits between pytorch model and paddle model is absolutly same")
+            else:
+                logger.error(
+                    "the logits between pytorch model and paddle model is not same, please check it out more carefully."
+                )
+        else:
+            logger.warning(
+                "you don't install `torch` and `transformers` package, so we can't compare the logits between paddle & pytorch model"
+            )
+
+
+class ConversionMixin:
+    @classmethod
+    def support_conversion(cls, config: PretrainedConfig) -> bool:
+        """check wether the model support conversion"""
+        try:
+            # try to get the name-mapping info
+            _ = cls._get_name_mappings(config)
+        except NotImplementedError:
+            return False
+        finally:
+            return True
+
+    @classmethod
+    def convert(cls, weight_file: str, config: PretrainedConfig, cache_dir: str) -> None:
+        """the entry of converting config and converting model file
+
+        Args:
+            input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file
+            config (PretrainedConfig): the PretrainedConfig instance of model
+        """
+        # FIXME(wj-Mcat): add compatibility with downstream models
+        name_mappings = cls._get_name_mappings(config)
+
+        state_dict = load_torch(weight_file)
+
+        # 3. convert state_dict
+        all_layer_names = set(state_dict.keys())
+        for name_mapping in name_mappings:
+            if name_mapping.source_name not in state_dict:
+                logger.warning(f"key<{name_mapping.source_name}> not in the pytorch weight file.")
+                continue
+
+            state_dict[name_mapping.target_name] = name_mapping.run(state_dict, name_mapping.source_name)
+            if name_mapping.source_name in all_layer_names:
+                all_layer_names.remove(name_mapping.source_name)
+
+        if all_layer_names:
+            logger.warning(f"there are {len(all_layer_names)} tensors not initialized:")
+            for layer_name in all_layer_names:
+                logger.warning(f"--- {layer_name}")
+
+        model_weight_file = os.path.join(cache_dir, PADDLE_WEIGHTS_NAME)
+        paddle.save(state_dict, model_weight_file)
+        return state_dict
+
+    @classmethod
+    def _get_name_mappings(cls, config: PretrainedConfig) -> List[StateDictNameMapping]:
+        """get name mapping of PretrainedModel
+
+        Args:
+            config (PretrainedConfig): the configuration of name-mapping
+
+        Raises:
+            NotImplementedError:
+
+        Returns:
+            List[StateDictNameMapping]: the name-mappings of pretrained model
+        """
+        raise NotImplementedError
+
+    @classmethod
+    def get_tensor_parallel_convert_actions(cls, config: PretrainedConfig, loaded_state_dict_keys, ignore_error=False):
+        name_action_mappings = cls._get_tensor_parallel_mappings(config)
+        state_keys_map = cls._resolve_prefix_keys(name_action_mappings.keys(), loaded_state_dict_keys, ignore_error)
+        for k, v in state_keys_map.items():
+            name_action_mappings[v] = name_action_mappings.pop(k)
+        return name_action_mappings
+
+    @classmethod
+    def convert_tensor_parallel(
+        cls, weight_file: str, config: PretrainedConfig, state_dict=None, ignore_error=False
+    ) -> None:
+        """the entry of converting config and converting model file
+
+        Args:
+            weight_file (str | None): the weight file path of `model_state.pdparams` file
+            config (PretrainedConfig): the PretrainedConfig instance of model
+        """
+        name_action_mappings = cls._get_tensor_parallel_mappings(config)
+        if state_dict is None:
+            with device_guard("cpu"):
+                state_dict = paddle.load(weight_file, return_numpy=False)
+            logger.info("Starting to convert orignal state_dict to tensor parallel state_dict.")
+
+        state_keys_map = cls._resolve_prefix_keys(name_action_mappings.keys(), state_dict.keys(), ignore_error)
+
+        for k, v in state_keys_map.items():
+            name_action_mappings[v] = name_action_mappings.pop(k)
+
+        for name, action in name_action_mappings.items():
+            if name not in state_dict:
+                if not ignore_error:
+                    logger.warning(f"Key <{name}> not in the model state weight file.")
+                continue
+            tensor = state_dict.pop(name)
+            new_tensor = action(tensor)
+            with device_guard("cpu"):
+                state_dict[name] = paddle.Tensor(new_tensor, zero_copy=True)
+
+        return state_dict
+
+    @classmethod
+    def merge_tensor_parallel(cls, state_dict, config) -> None:
+        """the entry of converting config and converting model file
+
+        Args:
+            input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file
+            config (PretrainedConfig): the PretrainedConfig instance of model
+        """
+        name_action_mappings = cls._get_tensor_parallel_mappings(config, is_split=False)
+        state_keys_map = cls._resolve_prefix_keys(name_action_mappings.keys(), state_dict.keys())
+
+        for k, v in state_keys_map.items():
+            name_action_mappings[v] = name_action_mappings.pop(k)
+
+        state_dict_to_save = {}
+
+        hcg = paddle.distributed.fleet.get_hybrid_communicate_group()
+        mp_group = hcg.get_model_parallel_group()
+        is_dst = paddle.distributed.get_rank(mp_group) == 0
+
+        for key in state_dict.keys():
+            tensor = state_dict[key]
+            if key in name_action_mappings:
+                ret = distributed_gather(tensor, group=mp_group, offload=True)
+                action = name_action_mappings.pop(key)
+                tensor = action(ret) if is_dst else None
+            else:
+                tensor = tensor.cpu().numpy() if is_dst else None
+
+            # keep state dict use paddle.tensor
+            if isinstance(tensor, np.ndarray):
+                with device_guard("cpu"):
+                    tensor = paddle.Tensor(tensor, zero_copy=True)
+
+            state_dict_to_save[key] = tensor
+
+        if len(name_action_mappings) > 0:
+            for x in name_action_mappings.keys():
+                logger.warning(f"key <{x}> need to merge tensor parallel but we can't find in model state.")
+
+        return state_dict_to_save
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config: PretrainedConfig, is_split=True) -> List[StateDictNameMapping]:
+        """get name mapping of PretrainedModel
+
+        Args:
+            config (PretrainedConfig): the configuration of name-mapping
+
+        Raises:
+            NotImplementedError:
+
+        Returns:
+            List[StateDictNameMapping]: the name-mappings for tensor_parallel
+        """
+        raise NotImplementedError
+
+    @staticmethod
+    def _resolve_prefix_keys(state_keys_base, state_keys_real, ignore_error=False):
+        # state_keys_map base to real
+        state_keys_map = {}
+
+        state_keys_base = set(state_keys_base)
+        state_keys_real = set(state_keys_real)
+
+        for key in state_keys_base:
+            for x in state_keys_real:
+                if x.endswith(key):
+                    state_keys_map[key] = x
+                    break
+            if key not in state_keys_map:
+                if not ignore_error:
+                    logger.error(f"tensor parallel conversion: could not find name {key} in loaded state dict!")
+            else:
+                state_keys_real.remove(state_keys_map[key])
+
+        return state_keys_map
+
+
+class Converter(ConversionMixin, LogitComparer):
+    """some converters are implemented in ppdiffusers, so if remove it directly, it will make ppdiffusers down.
+    TODO(wj-Mcat): this class will be removed after v2.6
+    """
+
+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+        logger.warning(
+            "`paddlenlp.utils.converter` module will be deprecated soon, you "
+            "should change it to `paddlenlp.transformers.conversion_utils`"
+        )
+
+    @classmethod
+    def resolve_num_layer(cls, config_or_num_layers: Union[dict, int] = None) -> int:
+        """resolve the number of transformer layer based on the key of model config, eg: `num_hidden_layers` in BertModel
+        Args:
+            config_or_num_layers (Union[dict, int], optional): the instance of config or num_layers. Defaults to None.
+        Raises:
+            ValueError: when `config_or_num_layers` is not dict/int, it will raise the error
+        Returns:
+            int: the number of transformer layer
+        """
+        from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+        if isinstance(config_or_num_layers, (dict, PretrainedConfig)):
+            num_layer = config_or_num_layers[cls.num_layer_key]
+        elif isinstance(config_or_num_layers, int):
+            num_layer = config_or_num_layers
+        else:
+            raise ValueError(f"the type of config_or_num_layers<{config_or_num_layers}> should be one of <dict, int>")
+
+        return num_layer
+
+    def convert(self, input_dir: str | None = None) -> None:
+        """the entry of converting config and converting model file
+
+        Args:
+            input_dir (str | None): the input dir which contains `pytorch_model.bin` and `config.json` file
+        """
+        input_dir = input_dir or getattr(self, "input_dir", None)
+        os.makedirs(input_dir, exist_ok=True)
+
+        # 1. get pytorch weight file
+        weight_file = os.path.join(input_dir, PYTORCH_WEIGHTS_NAME)
+        if not os.path.exists(weight_file):
+            raise FileNotFoundError(f"pytorch weight file<{weight_file}> not found")
+
+        config_file = os.path.join(input_dir, CONFIG_NAME)
+        if not os.path.exists(config_file):
+            raise FileNotFoundError(f"config file<{weight_file}> not found")
+
+        # 2. construct name mapping
+        # TODO(wj-Mcat): when AutoConfig is ready, construct config from AutoConfig.
+        with open(config_file, "r", encoding="utf-8") as f:
+            config = json.load(f)
+
+        state_dict = load_torch(weight_file)
+
+        # FIXME(wj-Mcat): add compatibility with downstream models
+        name_mappings = self.get_name_mapping(config)
+
+        # 3. convert state_dict
+        all_layer_names = set(state_dict.keys())
+        for name_mapping in name_mappings:
+            if name_mapping.source_name not in state_dict:
+                logger.warning(f"key<{name_mapping.source_name}> not in the pytorch weight file.")
+                continue
+
+            state_dict[name_mapping.target_name] = name_mapping.run(state_dict.pop(name_mapping.source_name))
+            all_layer_names.remove(name_mapping.source_name)
+
+        if all_layer_names:
+            logger.warning(f"there are {len(all_layer_names)} tensors not initialized:")
+            for layer_name in all_layer_names:
+                logger.warning(f"--- {layer_name}")
+
+        model_weight_file = os.path.join(input_dir, PADDLE_WEIGHTS_NAME)
+        paddle.save(state_dict, model_weight_file)
+        return state_dict
diff --git a/paddlenlp/transformers/convert_slow_tokenizer.py b/paddlenlp/transformers/convert_slow_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cbb2200545eb9f38a4e66b45225bd7c9dc3a67d
--- /dev/null
+++ b/paddlenlp/transformers/convert_slow_tokenizer.py
@@ -0,0 +1,302 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fast_tokenizer import (
+    Tokenizer,
+    decoders,
+    normalizers,
+    postprocessors,
+    pretokenizers,
+)
+from fast_tokenizer.models import BPE, FastWordPiece, Unigram
+
+
+# Extract the vocab and merge file from sentencepiece file
+class SentencePieceExtractor:
+    def __init__(self, model: str):
+        from sentencepiece import SentencePieceProcessor
+
+        self.sp = SentencePieceProcessor()
+        self.sp.Load(model)
+
+    def extract(self):
+        sp = self.sp
+        vocab = {sp.id_to_piece(index): index for index in range(sp.GetPieceSize())}
+
+        # Merges
+        merges = []
+        for piece_l in vocab.keys():
+            for piece_r in vocab.keys():
+                merge = f"{piece_l}{piece_r}"
+                piece_id = vocab.get(merge, None)
+                if piece_id:
+                    merges += [(piece_l, piece_r, piece_id)]
+        merges = sorted(merges, key=lambda val: val[2])
+        merges = [(val[0], val[1]) for val in merges]
+
+        return vocab, merges
+
+
+def check_number_comma(piece: str) -> bool:
+    return len(piece) < 2 or piece[-1] != "," or not piece[-2].isdigit()
+
+
+class Converter:
+    def __init__(self, original_tokenizer):
+        self.original_tokenizer = original_tokenizer
+
+    def converted(self) -> Tokenizer:
+        raise NotImplementedError()
+
+
+class BertConverter(Converter):
+    def converted(self) -> Tokenizer:
+        vocab = self.original_tokenizer.vocab
+        tokenizer = Tokenizer(
+            FastWordPiece(
+                vocab._token_to_idx, unk_token=str(self.original_tokenizer.unk_token), with_pretokenization=True
+            )
+        )
+
+        tokenize_chinese_chars = True
+        strip_accents = True
+        do_lower_case = False
+        if hasattr(self.original_tokenizer, "basic_tokenizer"):
+            do_lower_case = self.original_tokenizer.basic_tokenizer.do_lower_case
+
+        tokenizer.normalizer = normalizers.BertNormalizer(
+            clean_text=True,
+            handle_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            lowercase=do_lower_case,
+        )
+        # No need to init pretokenizer because FastWordPiece can pretokenize
+        # tokenizer.pretokenizer = pretokenizers.BertPreTokenizer()
+
+        cls_token = str(self.original_tokenizer.cls_token)
+        sep_token = str(self.original_tokenizer.sep_token)
+        cls_token_id = self.original_tokenizer.cls_token_id
+        sep_token_id = self.original_tokenizer.sep_token_id
+
+        tokenizer.postprocessor = postprocessors.BertPostProcessor(
+            (str(sep_token), sep_token_id), (str(cls_token), cls_token_id)
+        )
+
+        tokenizer.decoder = decoders.WordPiece(prefix="##")
+        return tokenizer
+
+
+class ErnieConverter(BertConverter):
+    pass
+
+
+class TinyBertConverter(BertConverter):
+    pass
+
+
+class NystromformerConverter(BertConverter):
+    pass
+
+
+# For sentencepiece tokenzier
+class SpmConverter(Converter):
+    def __init__(self, *args):
+        super().__init__(*args)
+        from . import sentencepiece_model_pb2 as model_pb2
+
+        m = model_pb2.ModelProto()
+        # For ernie_m sentencepiece tokenizer
+        if hasattr(self.original_tokenizer, "sentencepiece_model_file"):
+            spm_vocab_file = self.original_tokenizer.sentencepiece_model_file
+        else:
+            spm_vocab_file = self.original_tokenizer.vocab_file
+        with open(spm_vocab_file, "rb") as f:
+            m.ParseFromString(f.read())
+        self.proto = m
+
+    def vocab(self, proto):
+        return [(piece.piece, piece.score) for piece in proto.pieces]
+
+    def unk_id(self, proto):
+        return proto.trainer_spec.unk_id
+
+    def tokenizer(self, proto):
+        self.model_type = proto.trainer_spec.model_type
+        vocab = self.vocab(proto)
+        unk_id = self.unk_id(proto)
+        if self.model_type == 1:
+            tokenizer = Tokenizer(Unigram(vocab, unk_id))
+        elif self.model_type == 2:
+            # Special case for ernie-m
+            if hasattr(self.original_tokenizer, "sentencepiece_model_file"):
+                orginal_vocab_file = self.original_tokenizer.sentencepiece_model_file
+            else:
+                orginal_vocab_file = self.original_tokenizer.vocab_file
+            _, merges = SentencePieceExtractor(orginal_vocab_file).extract()
+            bpe_vocab = {word: i for i, (word, score) in enumerate(vocab)}
+            tokenizer = Tokenizer(
+                BPE(
+                    bpe_vocab,
+                    merges,
+                    unk_token=proto.trainer_spec.unk_piece,
+                    fuse_unk=True,
+                )
+            )
+        else:
+            raise Exception(
+                "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
+            )
+
+        return tokenizer
+
+    def normalizer(self, proto):
+        precompiled_charsmap = proto.normalizer_spec.precompiled_charsmap
+        if not precompiled_charsmap:
+            return normalizers.SequenceNormalizer([normalizers.ReplaceNormalizer(" {2,}", " ")])
+        else:
+            return normalizers.SequenceNormalizer(
+                [normalizers.PrecompiledNormalizer(precompiled_charsmap), normalizers.ReplaceNormalizer(" {2,}", " ")]
+            )
+
+    def pretokenizer(self, replacement, add_prefix_space):
+        return pretokenizers.MetaSpacePreTokenizer(replacement=replacement, add_prefix_space=add_prefix_space)
+
+    def postprocessor(self):
+        return None
+
+    def replacement(self):
+        return "▁"
+
+    def add_prefix_space(self):
+        return True
+
+    def set_model(self, tokenizer):
+        pass
+
+    def converted(self) -> Tokenizer:
+        tokenizer = self.tokenizer(self.proto)
+
+        self.set_model(tokenizer)
+        # Tokenizer assemble
+        tokenizer.normalizer = self.normalizer(self.proto)
+
+        replacement = self.replacement()
+        add_prefix_space = self.add_prefix_space()
+        tokenizer.pretokenizer = self.pretokenizer(replacement, add_prefix_space)
+        # tokenizer.decoder = decoders.MetaSpace(replacement=replacement, add_prefix_space=add_prefix_space)
+        postprocessor = self.postprocessor()
+        if postprocessor:
+            tokenizer.postprocessor = postprocessor
+
+        return tokenizer
+
+
+class ErnieMConverter(SpmConverter):
+    def set_model(self, tokenizer):
+        SPLICE_UNDERLINE = self.replacement()
+        if self.model_type == 1:
+            # Unigram
+            tokenizer.model.set_filter_token(SPLICE_UNDERLINE)
+            chinese_chars = r"\x{4e00}-\x{9fff}"
+            punc_chars = r",;:.?!~，；：。？！《》【】"
+            digits = r"0-9"
+            tokenizer.model.set_split_rule(
+                rf"[{chinese_chars}]|[{punc_chars}]|[{digits}]+|[^{chinese_chars}{punc_chars}{digits}]+"
+            )
+
+    def normalizer(self, proto):
+        list_normalizers = []
+        precompiled_charsmap = proto.normalizer_spec.precompiled_charsmap
+        list_normalizers.append(normalizers.PrecompiledNormalizer(precompiled_charsmap))
+        return normalizers.SequenceNormalizer(list_normalizers)
+
+    def vocab(self, proto):
+        # construct a dict that map word and score
+        word_score_dict = {}
+        for piece in proto.pieces:
+            word_score_dict[piece.piece] = piece.score
+        vocab_list = [None] * len(self.original_tokenizer.vocab)
+        original_vocab = self.original_tokenizer.vocab.token_to_idx
+        for _token, _id in original_vocab.items():
+            if _token in word_score_dict:
+                vocab_list[_id] = (_token, word_score_dict[_token])
+            else:
+                vocab_list[_id] = (_token, 0.0)
+        return vocab_list
+
+    def unk_id(self, proto):
+        return self.original_tokenizer.convert_tokens_to_ids(str(self.original_tokenizer.unk_token))
+
+    def pretokenizer(self, replacement, add_prefix_space):
+        return pretokenizers.SequencePreTokenizer(
+            [
+                pretokenizers.WhitespacePreTokenizer(),
+                pretokenizers.MetaSpacePreTokenizer(replacement=replacement, add_prefix_space=add_prefix_space),
+            ]
+        )
+
+    def postprocessor(self):
+        """
+         An ERNIE-M sequence has the following format:
+        - single sequence:       ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] [SEP] B [SEP]``
+        """
+        return postprocessors.TemplatePostProcessor(
+            single="[CLS]:0 $A:0 [SEP]:0",
+            pair="[CLS]:0 $A:0 [SEP]:0 [SEP]:1 $B:1 [SEP]:1",
+            special_tokens=[
+                ("[CLS]", self.original_tokenizer.convert_tokens_to_ids("[CLS]")),
+                ("[SEP]", self.original_tokenizer.convert_tokens_to_ids("[SEP]")),
+            ],
+        )
+
+
+SLOW_TO_FAST_CONVERTERS = {
+    "BertTokenizer": BertConverter,
+    "ErnieTokenizer": ErnieConverter,
+    "TinyBertTokenizer": TinyBertConverter,
+    "ErnieMTokenizer": ErnieMConverter,
+    "NystromformerTokenizer": NystromformerConverter
+    # TODO(zhoushunjie): Need to implement more TokenizerConverter
+}
+
+
+def convert_slow_tokenizer(transformer_tokenizer) -> Tokenizer:
+    """
+    Utilities to convert a slow tokenizer instance in a fast tokenizer instance.
+
+    Args:
+        transformer_tokenizer ([`~tokenizer_utils_base.PretrainedTokenizer`]):
+            Instance of a slow tokenizer to convert in the backend tokenizer for
+            [`~tokenizer_utils_base.PretrainedFastTokenizer`].
+
+    Return:
+        A instance of [`~tokenizers.Tokenizer`] to be used as the backend tokenizer of a
+        [`~tokenizer_utils_base.PretrainedFastTokenizer`]
+    """
+
+    tokenizer_class_name = transformer_tokenizer.__class__.__name__
+
+    if tokenizer_class_name not in SLOW_TO_FAST_CONVERTERS:
+        raise ValueError(
+            f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance. "
+            f"No converter was found. Currently available slow->fast convertors: {list(SLOW_TO_FAST_CONVERTERS.keys())}"
+        )
+
+    converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]
+
+    return converter_class(transformer_tokenizer).converted()
diff --git a/paddlenlp/transformers/ctrl/__init__.py b/paddlenlp/transformers/ctrl/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ctrl/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ctrl/configuration.py b/paddlenlp/transformers/ctrl/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..162ccccf13fa90f8649192e03a289681fe0fbb2b
--- /dev/null
+++ b/paddlenlp/transformers/ctrl/configuration.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CTRL configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["CTRL_PRETRAINED_INIT_CONFIGURATION", "CTRLConfig", "CTRL_PRETRAINED_RESOURCE_FILES_MAP"]
+
+CTRL_PRETRAINED_INIT_CONFIGURATION = {
+    "ctrl": {
+        "tie_word_embeddings": True,
+        "intermediate_size": 8192,
+        "embd_pdrop": 0.1,
+        "initializer_range": 0.02,
+        "layer_norm_epsilon": 1e-06,
+        "hidden_size": 1280,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 48,
+        "max_position_embeddings": 50000,
+        "resid_pdrop": 0.1,
+        "vocab_size": 246534,
+        "pad_token_id": None,
+    },
+    "sshleifer-tiny-ctrl": {
+        "tie_word_embeddings": True,
+        "intermediate_size": 2,
+        "embd_pdrop": 0.1,
+        "initializer_range": 0.02,
+        "layer_norm_epsilon": 1e-06,
+        "hidden_size": 16,
+        "num_attention_heads": 2,
+        "num_hidden_layers": 2,
+        "max_position_embeddings": 50000,
+        "resid_pdrop": 0.1,
+        "vocab_size": 246534,
+        "pad_token_id": None,
+    },
+}
+
+CTRL_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ctrl": "https://bj.bcebos.com/paddlenlp/models/transformers/ctrl/model_state.pdparams",
+        "sshleifer-tiny-ctrl": "https://bj.bcebos.com/paddlenlp/models/transformers/sshleifer-tiny-ctrl/model_state.pdparams",
+    }
+}
+
+
+class CTRLConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`CTRLModel`]. It is used to
+    instantiate a CTRL model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the
+    [ctrl] architecture from SalesForce.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 246534):
+            Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`CTRLModel`] or [`TFCTRLModel`].
+        n_positions (`int`, *optional*, defaults to 256):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        n_embd (`int`, *optional*, defaults to 1280):
+            Dimensionality of the embeddings and hidden states.
+        dff (`int`, *optional*, defaults to 8192):
+            Dimensionality of the inner dimension of the feed forward networks (FFN).
+        n_layer (`int`, *optional*, defaults to 48):
+            Number of hidden layers in the Transformer encoder.
+        n_head (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        embd_pdrop (`int`, *optional*, defaults to 0.1):
+            The dropout ratio for the embeddings.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-6):
+            The epsilon to use in the layer normalization layers
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+    Examples:
+    ```python
+    >>> from transformers import CTRLConfig, CTRLModel
+    >>> # Initializing a CTRL configuration
+    >>> configuration = CTRLConfig()
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = CTRLModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    pretrained_init_configuration = CTRL_PRETRAINED_INIT_CONFIGURATION
+    model_type = "ctrl"
+    attribute_map: Dict[str, str] = {
+        "max_position_embeddings": "n_positions",
+        "hidden_size": "n_embd",
+        "num_attention_heads": "n_head",
+        "num_hidden_layers": "n_layer",
+        "intermediate_size": "dff",
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size=246534,
+        n_positions=256,
+        n_embd=1280,
+        dff=8192,
+        n_layer=48,
+        n_head=16,
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        layer_norm_epsilon=1e-6,
+        initializer_range=0.02,
+        use_cache=True,
+        **kwargs,
+    ):
+
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.dff = dff
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/ctrl/modeling.py b/paddlenlp/transformers/ctrl/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5b3f640c21ba2b0da969986524e50e8d590f204
--- /dev/null
+++ b/paddlenlp/transformers/ctrl/modeling.py
@@ -0,0 +1,748 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import CrossEntropyLoss, MSELoss
+
+from ...layers import Linear as TransposedLinear
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    CTRL_PRETRAINED_INIT_CONFIGURATION,
+    CTRL_PRETRAINED_RESOURCE_FILES_MAP,
+    CTRLConfig,
+)
+
+__all__ = [
+    "CTRLPreTrainedModel",
+    "CTRLModel",
+    "CTRLLMHeadModel",
+    "CTRLForSequenceClassification",
+    "SinusoidalPositionalEmbedding",
+    "CTRLForCausalLM",
+]
+
+
+class SinusoidalPositionalEmbedding(nn.Embedding):
+    """
+    This module produces sinusoidal positional embeddings of any length.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim):
+        super().__init__(num_embeddings, embedding_dim)
+        self.weight = self._init_weight(self.weight)
+
+    @staticmethod
+    def _init_weight(out):
+        n_pos, dim = out.shape
+        out.stop_gradient = True
+        position_ids = paddle.arange(0, n_pos, dtype=out.dtype).unsqueeze(1)
+        indices = paddle.arange(0, dim // 2, dtype=out.dtype).unsqueeze(0)
+
+        indices = 10000.0 ** (-2 * indices / dim)
+        embeddings = paddle.matmul(position_ids, indices)
+        sentinel = dim // 2
+        out[:, 0:sentinel] = paddle.sin(embeddings)
+        out[:, sentinel:] = paddle.cos(embeddings)
+
+        return out
+
+    @paddle.no_grad()
+    def forward(self, position_ids):
+        return super().forward(position_ids)
+
+
+def scaled_dot_product_attention(q, k, v, mask, attention_mask=None):
+    # calculate attention
+    matmul_qk = paddle.matmul(q, k, transpose_y=True)
+
+    scaled_attention_logits = matmul_qk / np.sqrt(k.shape[-1])
+
+    if mask is not None:
+        nd, ns = scaled_attention_logits.shape[-2], scaled_attention_logits.shape[-1]
+        scaled_attention_logits += mask[ns - nd : ns, :ns] * -1e4
+
+    if attention_mask is not None:
+        # Apply the attention mask
+        scaled_attention_logits = scaled_attention_logits + attention_mask
+
+    attention_weights = F.softmax(scaled_attention_logits, axis=-1)
+
+    output = paddle.matmul(attention_weights, v)
+
+    return output, attention_weights
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    def __init__(self, hidden_size, num_heads):
+        super().__init__()
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+
+        self.depth = hidden_size // self.num_heads
+
+        self.Wq = nn.Linear(hidden_size, hidden_size)
+        self.Wk = nn.Linear(hidden_size, hidden_size)
+        self.Wv = nn.Linear(hidden_size, hidden_size)
+
+        self.dense = nn.Linear(hidden_size, hidden_size)
+
+    def split_into_heads(self, x, batch_size):
+        x = x.reshape([batch_size, -1, self.num_heads, self.depth])
+        return x.transpose(perm=[0, 2, 1, 3])
+
+    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, use_cache=False, output_attentions=False):
+        batch_size = q.shape[0]
+
+        q = self.Wq(q)
+        k = self.Wk(k)
+        v = self.Wv(v)
+
+        q = self.split_into_heads(q, batch_size)
+        k = self.split_into_heads(k, batch_size)
+        v = self.split_into_heads(v, batch_size)
+        if layer_past is not None:
+            past_key, past_value = layer_past[0], layer_past[1]
+            k = paddle.concat([past_key, k], axis=-2)
+            v = paddle.concat([past_value, v], axis=-2)
+
+        if use_cache is True:
+            present = paddle.stack([k, v])
+        else:
+            present = (None,)
+
+        scaled_attention, attn = scaled_dot_product_attention(q, k, v, mask, attention_mask)
+        scaled_attention = scaled_attention.transpose([0, 2, 1, 3])
+
+        original_size_attention = scaled_attention.reshape(shape=[batch_size, -1, self.hidden_size])
+        output = self.dense(original_size_attention)
+
+        outputs = (output, present)
+        if output_attentions:
+            outputs = outputs + (attn,)
+        return outputs
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, hidden_size, num_heads, intermediate_size, rate=0.1, epsilon=1e-6):
+        super().__init__()
+
+        self.multi_head_attention = MultiHeadAttention(hidden_size, num_heads)
+        self.ffn = nn.Sequential(
+            nn.Linear(hidden_size, intermediate_size), nn.ReLU(), nn.Linear(intermediate_size, hidden_size)
+        )
+        self.layernorm1 = nn.LayerNorm(hidden_size, epsilon=epsilon)
+        self.layernorm2 = nn.LayerNorm(hidden_size, epsilon=epsilon)
+
+        self.dropout1 = nn.Dropout(rate)
+        self.dropout2 = nn.Dropout(rate)
+
+    def forward(self, x, mask, layer_past=None, attention_mask=None, use_cache=False, output_attentions=False):
+        normed = self.layernorm1(x)
+        attn_outputs = self.multi_head_attention(
+            normed,
+            normed,
+            normed,
+            mask,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        attn_output = attn_outputs[0]
+        attn_output = self.dropout1(attn_output)
+        out1 = x + attn_output
+
+        out2 = self.layernorm2(out1)
+        ffn_output = self.ffn(out2)
+        ffn_output = self.dropout2(ffn_output)
+        out2 = out1 + ffn_output
+
+        outputs = (out2,) + attn_outputs[1:]
+        return outputs
+
+
+class CTRLPreTrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained CTRL models. It provides CTRL related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "ctrl"
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = CTRL_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = CTRL_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = CTRLConfig
+
+    def _init_weights(self, layer):
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(
+                paddle.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, SinusoidalPositionalEmbedding):
+            pass
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(
+                paddle.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer._padding_idx is not None:
+                emb_weight = layer.weight.numpy()
+                emb_weight[layer._padding_idx] = np.zeros_like(emb_weight[layer._padding_idx])
+                layer.weight.set_value(paddle.to_tensor(emb_weight))
+        elif isinstance(layer, nn.LayerNorm):
+            layer.weight.set_value(paddle.ones_like(layer.weight))
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+
+@register_base_model
+class CTRLModel(CTRLPreTrainedModel):
+    """
+    The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`CTRLConfig`):
+            An instance of :class:`CTRLConfig`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`CTRLPreTrainedModel._init_weights()` for how weights are initialized in `CTRLModel`.
+
+    """
+
+    def __init__(self, config: CTRLConfig):
+        super().__init__(config)
+
+        self.hidden_size = config.hidden_size
+        self.num_layers = config.num_hidden_layers
+        self.initializer_range = config.initializer_range
+
+        self.pos_encoding = SinusoidalPositionalEmbedding(config.max_position_embeddings, self.hidden_size)
+
+        self.w = nn.Embedding(config.vocab_size, config.hidden_size)
+
+        self.dropout = nn.Dropout(config.embd_pdrop)
+        self.h = nn.LayerList(
+            [
+                EncoderLayer(
+                    config.hidden_size,
+                    config.num_attention_heads,
+                    config.intermediate_size,
+                    config.resid_pdrop,
+                    config.layer_norm_epsilon,
+                )
+                for _ in range(self.num_layers)
+            ]
+        )
+        self.layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_epsilon)
+
+    def get_input_embeddings(self):
+        return self.w
+
+    def set_input_embeddings(self, new_embeddings):
+        self.w = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        cache=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+    ):
+        r"""
+        The CTRLModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            cache (Tuple[Tuple[Tensor]], optional):
+                Contains pre-computed hidden-states (key and values in the attention blocks)
+                as computed by the model. Can be used to speed up sequential decoding.
+                The `input_ids` which have their past given to this model should not be
+                passed as input ids as they have already been computed.
+                Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some
+                unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others
+                have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range `[0, type_vocab_size - 1]`.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected
+                in the range `[0, max_position_embeddings - 1]`.
+                Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`.
+            use_cache (bool, optional):
+                Whether or not to use cache. Defaults to `False`. If set to `True`, key value states
+                will be returned and can be used to speed up decoding.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`last_hidden_state`, `caches`, `hidden_states`, `attentions`)
+
+            With the fields:
+
+            - `last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `caches` (tuple(tuple(Tensor), optional):
+                returned when `use_cache=True` is passed.
+                Tuple of `tuple(Tensor)` of length `num_hidden_layers`, with each tuple having 2
+                tensors of shape [batch_size, num_heads, sequence_length, embed_size_per_head] and float32 dtype.
+
+            - `hidden_states` (tuple(Tensor), optional):
+                returned when `output_hidden_states=True` is passed.
+                Tuple of `Tensor` (one for the output of the embeddings + one for the output of
+                each layer). Each Tensor has a data type of float32 and its shape is
+                [batch_size, sequence_length, hidden_size].
+
+            - `attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                Tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data type of
+                float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CTRLModel, CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+                model = CTRLModel.from_pretrained('ctrl')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+
+        """
+
+        seq_len = input_ids.shape[-1]
+        input_ids = input_ids.reshape([-1, seq_len])
+        batch_size = input_ids.shape[0]
+
+        if cache is None:
+            past_length = 0
+            cache = tuple([None] * len(self.h))
+        else:
+            past_length = cache[0][0].shape[-2]
+
+        if position_ids is None:
+            position_ids = paddle.arange(past_length, seq_len + past_length)
+            position_ids = position_ids.unsqueeze(0).reshape(shape=[-1, seq_len])
+
+        # Attention mask.
+        if attention_mask is not None:
+            assert batch_size > 0, "batch_size has to be defined and > 0"
+            attention_mask = attention_mask.reshape(shape=[batch_size, -1])
+            # We create a 3D attention mask from a 2D tensor mask.
+            # Sizes are [batch_size, 1, 1, to_seq_length]
+            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+            # this attention mask is more simple than the triangular masking of causal attention
+            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+            attention_mask = attention_mask.unsqueeze([1, 2])
+
+            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+            # masked positions, this operation will create a tensor which is 0.0 for
+            # positions we want to attend and -10000.0 for masked positions.
+            # Since we are adding it to the raw scores before the softmax, this is
+            # effectively the same as removing these entirely.
+            attention_mask = attention_mask.astype(dtype=paddle.get_default_dtype())  # fp16 compatibility
+            attention_mask = (1.0 - attention_mask) * -10000.0
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=[-1, seq_len])
+            token_type_embeds = self.w(token_type_ids) * np.sqrt(self.hidden_size)
+        else:
+            token_type_embeds = 0.0
+
+        inputs_embeds = self.w(input_ids) * np.sqrt(self.hidden_size)
+        pos_embeds = self.pos_encoding(position_ids)
+
+        hidden_states = inputs_embeds + pos_embeds + token_type_embeds
+
+        hidden_states = self.dropout(hidden_states)
+        mask = paddle.triu(paddle.ones(shape=[seq_len + past_length, seq_len + past_length]), 1)
+
+        presents = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        for i, (h, layer_past) in enumerate(zip(self.h, cache)):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            outputs = h(
+                hidden_states,
+                mask,
+                layer_past=layer_past,
+                attention_mask=attention_mask,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states, present = outputs[:2]
+            if use_cache is True:
+                presents = presents + (present,)
+
+            if output_attentions:
+                all_attentions += (outputs[2],)
+
+        hidden_states = self.layernorm(hidden_states)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        return tuple(v for v in [hidden_states, presents, all_hidden_states, all_attentions] if v is not None)
+
+
+class CTRLLMHeadModel(CTRLPreTrainedModel):
+    """
+    The CTRL Model transformer with a language modeling head on top (linear
+    layer with weights tied to the input embeddings).
+
+    Args:
+        config (:class:`CTRLConfig`):
+            An instance of :class:`CTRLConfig`.
+
+    """
+
+    def __init__(self, config: CTRLConfig):
+        super().__init__(config)
+        self.ctrl = CTRLModel(config)
+        self.lm_head = TransposedLinear(config.hidden_size, config.vocab_size)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+
+        return {"input_ids": input_ids, "use_cache": use_cache, "cache": cache}
+
+    def forward(
+        self,
+        input_ids=None,
+        cache=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        labels=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`CTRLModel`.
+            cache (Tensor, optional):
+                See :class:`CTRLModel`.
+            attention_mask (Tensor, optional):
+                See :class:`CTRLModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`CTRLModel`.
+            position_ids (Tensor, optional):
+                See :class:`CTRLModel`.
+            labels (Tensor, optional):
+                Labels for language modeling. Note that the labels **are shifted**
+                inside the model, i.e. you can set `labels = input_ids` Indices are
+                selected in `[-100, 0, ..., vocab_size]` All labels set to `-100` are
+                ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            use_cache (bool, optional):
+                See :class:`CTRLModel`.
+            output_attentions (bool, optional):
+                See :class:`CTRLModel`.
+            output_hidden_states (bool, optional):
+                See :class:`CTRLModel`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, caches, hidden_states, attentions)`.
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss (for next-token prediction).
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head (scores for each vocabulary
+                token before SoftMax).
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+
+            - `caches` (tuple(tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+            - `hidden_states` (tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+            - `attentions` (tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CTRLLMHeadModel, CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+                model = CTRLLMHeadModel.from_pretrained('ctrl')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+
+        ctrl_outputs = self.ctrl(
+            input_ids,
+            cache=cache,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        hidden_states = ctrl_outputs[0]
+        lm_logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[:, :-1]
+            shift_labels = labels[:, 1:]
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.reshape([-1, shift_logits.shape[-1]]),
+                shift_labels.flatten(),
+            )
+
+        output = (lm_logits,) + ctrl_outputs[1:]
+        return ((loss,) + output) if loss is not None else output
+
+
+class CTRLForSequenceClassification(CTRLPreTrainedModel):
+    """
+    The CTRL Model transformer with a sequence classification head on top (linear layer).
+    `CTRLForSequenceClassification` uses the last token in order to do the classification,
+    as other causal models (e.g. GPT-2) do. Since it does classification on the last token,
+    it requires to know the position of the last token. If a `pad_token_id` is defined in the
+    configuration, it finds the last token that is not a padding token in each row. If no
+    `pad_token_id` is defined, it simply takes the last value in each row of the batch.
+
+    Args:
+        config (:class:`CTRLConfig`):
+            An instance of :class:`CTRLConfig`.
+
+    """
+
+    def __init__(self, config: CTRLConfig):
+        super().__init__(config)
+        self.num_classes = config.num_classes
+        self.ctrl = CTRLModel(config)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes, bias_attr=False)
+
+    def forward(
+        self,
+        input_ids=None,
+        cache=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        labels=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`CTRLModel`.
+            cache (Tensor, optional):
+                See :class:`CTRLModel`.
+            attention_mask (Tensor, optional):
+                See :class:`CTRLModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`CTRLModel`.
+            position_ids (Tensor, optional):
+                See :class:`CTRLModel`.
+            labels (Tensor, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ...,num_classes - 1]`. If `num_classes == 1`
+                a regression loss is computed (Mean-Square loss), If `num_classes > 1`
+                a classification loss is computed (Cross-Entropy).
+                Shape is [batch_size,] and dtype is int64.
+            use_cache (bool, optional):
+                See :class:`CTRLModel`.
+            output_attentions (bool, optional):
+                See :class:`CTRLModel`.
+            output_hidden_states (bool, optional):
+                See :class:`CTRLModel`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, caches, hidden_states, attentions)`.
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss (for next-token prediction).
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head (scores for each vocabulary
+                token before SoftMax).
+                It's data type should be float32 and its shape is [batch_size, num_classes].
+
+            - `caches` (tuple(tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+            - `hidden_states` (tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+            - `attentions` (tuple(Tensor), optional):
+                See :class:`CTRLModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CTRLForSequenceClassification, CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+                model = CTRLForSequenceClassification.from_pretrained('ctrl', pad_token_id=0)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=paddle.to_tensor([1]))
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+        ctrl_outputs = self.ctrl(
+            input_ids,
+            cache=cache,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        hidden_states = ctrl_outputs[0]
+        logits = self.classifier(hidden_states)
+        batch_size = input_ids.shape[0]
+
+        assert (
+            self.config.pad_token_id is not None or batch_size == 1
+        ), "Cannot handle batch sizes > 1 if no padding token is defined."
+
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            sequence_lengths = (
+                paddle.not_equal(
+                    input_ids,
+                    paddle.full(shape=input_ids.shape, fill_value=self.config.pad_token_id, dtype=input_ids.dtype),
+                )
+                .astype(paddle.int64)
+                .sum(-1)
+                - 1
+            )
+
+        pooled_logits = logits.gather_nd(paddle.stack([paddle.arange(batch_size), sequence_lengths], axis=-1))
+
+        loss = None
+        if labels is not None:
+            if self.num_classes == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(pooled_logits.flatten(), labels.astype(pooled_logits.dtype).flatten())
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.reshape([-1, self.num_classes]), labels.flatten())
+
+        output = (pooled_logits,) + ctrl_outputs[1:]
+        return ((loss,) + output) if loss is not None else output
+
+
+CTRLForCausalLM = CTRLLMHeadModel
diff --git a/paddlenlp/transformers/ctrl/tokenizer.py b/paddlenlp/transformers/ctrl/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4e7c108d7f6a838cb39eced6daddc3fc83deef7
--- /dev/null
+++ b/paddlenlp/transformers/ctrl/tokenizer.py
@@ -0,0 +1,357 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Salesforce and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+
+from paddle.utils import try_import
+from .. import PretrainedTokenizer
+from paddlenlp.utils.log import logger
+
+__all__ = ["CTRLTokenizer"]
+
+CONTROL_CODES = {
+    "Pregnancy": 168629,
+    "Christianity": 7675,
+    "Explain": 106423,
+    "Fitness": 63440,
+    "Saving": 63163,
+    "Ask": 27171,
+    "Ass": 95985,
+    "Joke": 163509,
+    "Questions": 45622,
+    "Thoughts": 49605,
+    "Retail": 52342,
+    "Feminism": 164338,
+    "Writing": 11992,
+    "Atheism": 192263,
+    "Netflix": 48616,
+    "Computing": 39639,
+    "Opinion": 43213,
+    "Alone": 44967,
+    "Funny": 58917,
+    "Gaming": 40358,
+    "Human": 4088,
+    "India": 1331,
+    "Joker": 77138,
+    "Diet": 36206,
+    "Legal": 11859,
+    "Norman": 4939,
+    "Tip": 72689,
+    "Weight": 52343,
+    "Movies": 46273,
+    "Running": 23425,
+    "Science": 2090,
+    "Horror": 37793,
+    "Confession": 60572,
+    "Finance": 12250,
+    "Politics": 16360,
+    "Scary": 191985,
+    "Support": 12654,
+    "Technologies": 32516,
+    "Teenage": 66160,
+    "Event": 32769,
+    "Learned": 67460,
+    "Notion": 182770,
+    "Wikipedia": 37583,
+    "Books": 6665,
+    "Extract": 76050,
+    "Confessions": 102701,
+    "Conspiracy": 75932,
+    "Links": 63674,
+    "Narcissus": 150425,
+    "Relationship": 54766,
+    "Relationships": 134796,
+    "Reviews": 41671,
+    "News": 4256,
+    "Translation": 26820,
+    "multilingual": 128406,
+}
+
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+
+    pairs = set(pairs)
+    return pairs
+
+
+class CTRLTokenizer(PretrainedTokenizer):
+    """
+    Constructs a CTRL tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocab file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "<unk>".
+
+    """
+
+    resource_files_names = {
+        "vocab_file": "vocab.json",
+        "merges_file": "merges.txt",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ctrl": "http://bj.bcebos.com/paddlenlp/models/transformers/ctrl/vocab.json",
+            "sshleifer-tiny-ctrl": "http://bj.bcebos.com/paddlenlp/models/transformers/sshleifer-tiny-ctrl/vocab.json",
+        },
+        "merges_file": {
+            "ctrl": "http://bj.bcebos.com/paddlenlp/models/transformers/ctrl/merges.txt",
+            "sshleifer-tiny-ctrl": "http://bj.bcebos.com/paddlenlp/models/transformers/sshleifer-tiny-ctrl/merges.txt",
+        },
+    }
+    pretrained_init_configuration = {"ctrl": {}, "sshleifer-tiny-ctrl": {"max_len": 256}}
+
+    CONTROL_CODES = CONTROL_CODES
+
+    def __init__(self, vocab_file, merges_file, max_len=None, unk_token="<unk>", **kwargs):
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            merges = merges_handle.read().split("\n")[1:-1]
+        merges = [tuple(merge.split()) for merge in merges]
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {}
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def get_vocab(self):
+        return dict(self.encoder)
+
+    def __len__(self):
+        return len(self.encoder)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        word = tuple(list(word[:-1]) + [word[-1] + "</w>"])
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                else:
+                    new_word.extend(word[i:j])
+                    i = j
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = "@@ ".join(word)
+        word = word[:-4]
+        self.cache[token] = word
+        return word
+
+    def tokenize(self, text):
+        """
+        Converts a string to a list of tokens.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+                print(tokenizer.tokenize('Welcome to use PaddlePaddle and PaddleNLP'))
+                # ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
+
+        """
+        return self._tokenize(text)
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        split_tokens = []
+        re = try_import("regex")
+        words = re.findall(r"\S+\n?", text)
+        for token in words:
+            split_tokens.extend([t for t in self.bpe(token).split(" ")])
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) to an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) to a token (str) using the vocab."""
+        return self.decoder.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string.
+
+        Args:
+            tokens (List[str]): A sequence of tokens.
+
+        Returns:
+            str: Converted string.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('crtl')
+                print(tokenizer.convert_tokens_to_string(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
+                # 'Welcome to use PaddlePaddle and PaddleNLP'
+
+        """
+        out_string = " ".join(tokens).replace("@@ ", "").strip()
+        return out_string
+
+    def convert_tokens_to_ids(self, tokens):
+        """
+        Converts a single token or a sequence of tokens to an index or a
+        sequence of indices using the vocab.
+
+        Args:
+            tokens (str|List[str]|tuple(str)):
+                A single token or a sequence of tokens.
+
+        Returns:
+            int|List[int]: The converted token id or token ids.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('crtl')
+                print(tokenizer.convert_tokens_to_ids(['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']))
+                # [41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]
+
+        """
+        ids = []
+        if isinstance(tokens, str):
+            return self._convert_token_to_id(tokens)
+        for token in tokens:
+            ids.append(self._convert_token_to_id(token))
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this CTRL model ({} > {}). Running this"
+                " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """
+        Converts an index or a sequence indices to a single
+        token or a sequence of tokens.
+
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+            skip_special_tokens (bool, optional):
+                Whether or not to skip the special tokens.
+                Defaults to `False`, which means we don't skip the special tokens.
+
+        Returns:
+            str|List[str]: The converted token or the sequence of tokens.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import CTRLTokenizer
+
+                tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+                print(tokenizer.convert_ids_to_tokens([41116, 3, 191, 40324, 1162, 40324, 992, 2, 40324, 1162, 633, 11135]))
+                # ['Welcome', 'to', 'use', 'Padd@@', 'le@@', 'Padd@@', 'le', 'and', 'Padd@@', 'le@@', 'N@@', 'LP']
+
+        """
+        if isinstance(ids, int):
+            return self._convert_id_to_token(ids)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids:
+                continue
+            tokens.append(self._convert_id_to_token(index))
+        return tokens
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to files under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name)
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
diff --git a/paddlenlp/transformers/dallebart/__init__.py b/paddlenlp/transformers/dallebart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/dallebart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/dallebart/configuration.py b/paddlenlp/transformers/dallebart/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a616db122db4890ec15e2dcc1fc41c694217706
--- /dev/null
+++ b/paddlenlp/transformers/dallebart/configuration.py
@@ -0,0 +1,254 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DalleBart model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["DALLEBART_PRETRAINED_INIT_CONFIGURATION", "DalleBartConfig", "DALLEBART_PRETRAINED_RESOURCE_FILES_MAP"]
+
+DALLEBART_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "dalle-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mini/model_state.pdparams",
+        "dalle-mega-v16": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v16/model_state.pdparams",
+        "dalle-mega-v26": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/model_state.pdparams",
+        "dalle-mega": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/model_state.pdparams",
+    }
+}
+
+DALLEBART_PRETRAINED_INIT_CONFIGURATION = {
+    "dalle-mini": {
+        "text_vocab_size": 50264,
+        "image_vocab_size": 16384,
+        "bos_token_id": 16384,
+        "pad_token_id": 16384,
+        "eos_token_id": 16384,
+        "max_text_length": 64,
+        "max_image_length": 256,
+        "decoder_start_token_id": 16384,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 2730,
+        "decoder_ffn_dim": 2730,
+        "dropout": 0.0,
+        "activation_function": "gelu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "use_bias": False,
+        "init_std": 0.02,
+    },
+    "dalle-mega-v16": {
+        "text_vocab_size": 50272,
+        "image_vocab_size": 16415,
+        "bos_token_id": 16384,
+        "pad_token_id": 16384,
+        "eos_token_id": 16384,
+        "max_text_length": 64,
+        "max_image_length": 256,
+        "decoder_start_token_id": 16384,
+        "d_model": 2048,
+        "num_encoder_layers": 24,
+        "num_decoder_layers": 24,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.0,
+        "activation_function": "gelu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "use_bias": False,
+        "init_std": 0.02,
+    },
+    "dalle-mega-v26": {
+        "text_vocab_size": 50272,
+        "image_vocab_size": 16415,
+        "bos_token_id": 16384,
+        "pad_token_id": 16384,
+        "eos_token_id": 16384,
+        "max_text_length": 64,
+        "max_image_length": 256,
+        "decoder_start_token_id": 16384,
+        "d_model": 2048,
+        "num_encoder_layers": 24,
+        "num_decoder_layers": 24,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.0,
+        "activation_function": "gelu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "use_bias": False,
+        "init_std": 0.02,
+    },
+    "dalle-mega": {
+        "text_vocab_size": 50272,
+        "image_vocab_size": 16415,
+        "bos_token_id": 16384,
+        "pad_token_id": 16384,
+        "eos_token_id": 16384,
+        "max_text_length": 64,
+        "max_image_length": 256,
+        "decoder_start_token_id": 16384,
+        "d_model": 2048,
+        "num_encoder_layers": 24,
+        "num_decoder_layers": 24,
+        "encoder_attention_heads": 32,
+        "decoder_attention_heads": 32,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.0,
+        "activation_function": "gelu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "use_bias": False,
+        "init_std": 0.02,
+    },
+}
+
+
+class DalleBartConfig(PretrainedConfig):
+    r"""
+    The bare DalleBart Model outputting raw hidden-states.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    Args:
+        text_vocab_size (int):
+            Vocabulary size of `inputs_ids` in `DalleBartModel`. Also is the vocab size of text token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `DalleBartModel`.
+        image_vocab_size (int):
+            Vocabulary size of `decoder_inputs_ids` in `DalleBartModel`. Also is the vocab size of image token embedding matrix.
+            Defines the number of different tokens that can be represented by the `decoder_inputs_ids` passed when calling `DalleBartModel`.
+        bos_token (int, optional):
+            The beginning of image sequence token that was used during pretraining.
+            Defaults to `16384`.
+        pad_token_id(int, optional):
+            The index of padding token in the image token vocabulary.
+            Defaults to `16384`.
+        eos_token (int, optional):
+            A special token representing the end of a image sequence.
+            Defaults to `16384`.
+        max_text_length (int, optional):
+            The maximum value of the dimensionality of text position encoding, which dictates the maximum supported length of the text
+            input sequence. Defaults to `64`.
+        max_image_length (int, optional):
+            The maximum value of the dimensionality of image position encoding, which dictates the maximum supported length of the image
+            input sequence. Defaults to `256`.
+        decoder_start_token_id (int, optional):
+            The id indicating the start of decoding image sentence. Defaults to `16384`.
+        d_model (int, optional):
+            Dimensionality of the embedding layer, encoder layer and decoder layer. Defaults to `1024`.
+        num_encoder_layers (int, optional):
+            Number of hidden layers in the :class:`DalleBartEncoder`. Defaults to `12`.
+        num_decoder_layers (int, optional):
+            Number of hidden layers in the :class:`DalleBartDecoder`. Defaults to `12`.
+        encoder_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the :class:`DalleBartEncoder`.
+            Defaults to `16`.
+        decoder_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the :class:`DalleBartDecoder`.
+            Defaults to `16`.
+        encoder_ffn_dim (int, optional):
+            Dimensionality of the Gated Linear Units (glu) layer in the encoder. Input tensors
+            to glu layers are firstly projected from `d_model` to `encoder_ffn_dim`,
+            and then projected back to `d_model`. Typically `encoder_ffn_dim` is larger than `d_model`.
+            Defaults to `2730`.
+        decoder_ffn_dim (int, optional):
+            Dimensionality of the Gated Linear Units (glu) layer in the encoder. Input tensors
+            to glu layers are firstly projected from `d_model` to `decoder_ffn_dim`,
+            and then projected back to `d_model`. Typically `decoder_ffn_dim` is larger than `d_model`.
+            Defaults to `2730`.
+        dropout (float, optional):
+            The dropout probability used in all fully connected layers (pre-process and post-process of MHA and FFN sub-layer)
+            in the encoders and decoders. Defaults to `0.`.
+        activation_function (str, optional):
+            The non-linear activation function in the glu layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions are supported.
+            Defaults to `"gelu"`.
+        attention_dropout (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers and decoder layers to drop some attention target.
+            Defaults to `0.`.
+        activation_dropout (float, optional):
+            The dropout probability used after glu activation in all encoder layers and decoder layers.
+            Defaults to `0.`.
+        use_bias (bool, optional):
+            Whether or not use bias in all linear layers. Defaults to `False`.
+        init_std (float, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Default to `0.02`.
+    """
+    pretrained_init_configuration = DALLEBART_PRETRAINED_INIT_CONFIGURATION
+    model_type = "dallebart"
+    attribute_map: Dict[str, str] = {
+        "text_vocab_size": "vocab_size",
+    }
+
+    def __init__(
+        self,
+        vocab_size=50264,
+        image_vocab_size=16384,
+        bos_token_id=16384,
+        pad_token_id=16384,
+        eos_token_id=16384,
+        max_text_length=64,
+        max_image_length=256,
+        decoder_start_token_id=16384,
+        d_model=1024,
+        num_encoder_layers=12,
+        num_decoder_layers=12,
+        encoder_attention_heads=16,
+        decoder_attention_heads=16,
+        encoder_ffn_dim=2730,
+        decoder_ffn_dim=2730,
+        dropout=0.0,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        use_bias=False,
+        init_std=0.02,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.image_vocab_size = image_vocab_size
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.max_text_length = max_text_length
+        self.max_image_length = max_image_length
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.num_decoder_layers = num_decoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_attention_heads = decoder_attention_heads
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.use_bias = use_bias
+        self.init_std = init_std
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.text_pad_token_id = 1  # encoder pad id must be 1
diff --git a/paddlenlp/transformers/dallebart/modeling.py b/paddlenlp/transformers/dallebart/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e5d5000936345b5449c6cf434306c9731dcd75d
--- /dev/null
+++ b/paddlenlp/transformers/dallebart/modeling.py
@@ -0,0 +1,1350 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021-2022 The Fairseq Authors and The Google Flax
+# Team Authors And The HuggingFace Inc. team and & DALL·E Mini team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.common_ops_import import convert_dtype
+
+from ...generation import BeamSearchScorer
+from ...transformers import PretrainedModel, register_base_model
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .configuration import (
+    DALLEBART_PRETRAINED_INIT_CONFIGURATION,
+    DALLEBART_PRETRAINED_RESOURCE_FILES_MAP,
+    DalleBartConfig,
+)
+
+__all__ = [
+    "DalleBartModel",
+    "DalleBartPretrainedModel",
+    "DalleBartEncoder",
+    "DalleBartDecoder",
+    "DalleBartForConditionalGeneration",
+]
+
+
+def shift_tokens_right(input_ids, decoder_start_token_id):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    return shifted_input_ids
+
+
+def _convert_attention_mask(attn_mask, dtype):
+    """
+    Convert the attention mask to the target dtype we expect.
+
+    Parameters:
+        attn_mask (Tensor, optional): A tensor used in multi-head attention
+                to prevents attention to some unwanted positions, usually the
+                paddings or the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
+                When the data type is bool, the unwanted positions have `False`
+                values and the others have `True` values. When the data type is
+                int, the unwanted positions have 0 values and the others have 1
+                values. When the data type is float, the unwanted positions have
+                `-INF` values and the others have 0 values. It can be None when
+                nothing wanted or needed to be prevented attention to. Default None.
+        dtype (VarType): The target type of `attn_mask` we expect.
+
+    Returns:
+        Tensor: A Tensor with shape same as input `attn_mask`, with data type `dtype`.
+    """
+    if attn_mask is not None and attn_mask.dtype != dtype:
+        attn_mask_dtype = convert_dtype(attn_mask.dtype)
+        if attn_mask_dtype == "bool" or "int" in attn_mask_dtype:
+            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e4
+        else:
+            attn_mask = paddle.cast(attn_mask, dtype)
+    return attn_mask
+
+
+class DalleBartPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Bart models. It provides DalleBart related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "dallebart"
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = DALLEBART_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = DALLEBART_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = DalleBartConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding, DalleBartLearnedPositionalEmbedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class DalleBartLearnedPositionalEmbedding(nn.Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim):
+        # DalleBart is set up so that if padding_idx is specified then offset the embedding ids by 0
+        # and adjust num_embeddings appropriately. Other models dont have this hack
+        self.offset = 0
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(self, input_ids_shape, past_key_values_length=0):
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        seq_len = input_ids_shape[1]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        # (gongenlei) For dygraph to static graph
+        return nn.Embedding.forward(self, positions + self.offset)
+
+
+class GLU(nn.Layer):
+    """
+    From "GLU Variants Improve Transformer" by https://arxiv.org/abs/2002.05202
+    """
+
+    def __init__(
+        self,
+        count_in_out: int,
+        count_middle: int,
+        activation_dropout: float,
+        dropout: float,
+        activation_function: str = "gelu",
+        use_bias: bool = False,
+    ):
+        super().__init__()
+        self.ln0 = nn.LayerNorm(count_in_out)
+        self.ln1 = nn.LayerNorm(count_middle)
+        self.fc0 = nn.Linear(count_in_out, count_middle, bias_attr=use_bias)
+        self.fc1 = nn.Linear(count_in_out, count_middle, bias_attr=use_bias)
+        self.fc2 = nn.Linear(count_middle, count_in_out, bias_attr=use_bias)
+        self.dropout1 = nn.Dropout(activation_dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        self.act = getattr(F, activation_function)
+
+    def forward(self, z):
+        z = self.ln0(z)
+        w = self.fc0(z)
+        w = self.act(w)
+        v = self.fc1(z)
+        z = self.dropout1(self.ln1(w * v))
+        z = self.dropout2(self.fc2(z))
+        return z
+
+
+class DalleBartEncoderLayer(nn.Layer):
+    """
+    The Encoder Layer of DalleBartEncoder. The arguments of DalleBartEncoderLayer can see :class:`DalleBartEncoder`.
+    """
+
+    def __init__(self, config: DalleBartConfig):
+        super().__init__()
+        assert config.d_model > 0, "Expected d_model to be greater than 0, " "but received {}".format(config.d_model)
+        assert (
+            config.encoder_attention_heads > 0
+        ), "Expected encoder_attention_heads to be greater than 0, " "but received {}".format(
+            config.encoder_attention_heads
+        )
+        assert config.encoder_ffn_dim > 0, "Expected encoder_ffn_dim to be greater than 0, " "but received {}".format(
+            config.encoder_ffn_dim
+        )
+
+        attention_dropout = config.dropout if config.attention_dropout is None else config.attention_dropout
+        activation_dropout = config.dropout if config.activation_dropout is None else config.activation_dropout
+        self.self_attn = nn.MultiHeadAttention(
+            config.d_model, config.encoder_attention_heads, dropout=attention_dropout, bias_attr=config.use_bias
+        )
+        self.glu = GLU(
+            config.d_model,
+            config.encoder_ffn_dim,
+            activation_dropout,
+            config.dropout,
+            config.activation_function,
+            use_bias=config.use_bias,
+        )
+
+        self.pre_self_attn_layer_norm = nn.LayerNorm(config.d_model)
+        self.self_attn_layer_norm = nn.LayerNorm(config.d_model)
+        self.dropout1 = nn.Dropout(config.dropout)
+
+    def forward(self, src, src_mask=None):
+        src_mask = _convert_attention_mask(src_mask, src.dtype)
+        residual = src
+
+        # pre_self_attn_layer_norm
+        src = self.pre_self_attn_layer_norm(src)
+        src = self.self_attn(src, src, src, src_mask)
+
+        # self_attn_layer_norm
+        src = self.self_attn_layer_norm(src)
+        src = residual + self.dropout1(src)
+
+        residual = src
+        src = self.glu(src)
+        src = residual + src
+        return src
+
+
+class DalleBartEncoder(DalleBartPretrainedModel):
+    """
+    The Encoder of DalleBartModel. The arguments of DalleBartEncoder can see :class:`DalleBartModel`.
+    """
+
+    def __init__(self, config: DalleBartConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+        self.embed_positions = DalleBartLearnedPositionalEmbedding(config.max_text_length, config.d_model)
+
+        self.layers = nn.LayerList([DalleBartEncoderLayer(config) for _ in range(config.num_encoder_layers)])
+        self.layernorm_embedding = nn.LayerNorm(config.d_model)
+        self.final_ln = nn.LayerNorm(config.d_model)
+        self.embedding_dropout = nn.Dropout(config.dropout)
+        self.text_pad_token_id = config.text_pad_token_id
+
+    def forward(self, input_ids, attention_mask=None, **kwargs):
+        """
+        The DalleBartEncoder forward method, overrides the `__call__()` special method.
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`DalleBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`DalleBartModel`.
+        Returns:
+            Tensor: Returns tensor `encoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+        """
+        if input_ids is None:
+            raise ValueError("Input_ids cannot be None.")
+
+        if attention_mask is None:
+            attention_mask = (
+                paddle.cast(input_ids == self.text_pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+                * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        inputs_embeds = self.embed_tokens(input_ids)
+        inputs_embed_pos = self.embed_positions(input_ids.shape)
+        hidden_states = self.layernorm_embedding(inputs_embeds + inputs_embed_pos)
+        hidden_states = self.embedding_dropout(hidden_states)
+
+        for layer in self.layers:
+            hidden_states = layer(hidden_states, attention_mask)
+        hidden_states = self.final_ln(hidden_states)
+
+        return hidden_states
+
+
+class DalleBartDecoderLayer(nn.Layer):
+    """
+    The Decoder Layer of DalleBartDecoder. The arguments of DalleBartDecoderLayer can see :class:`DalleBartDecoder`.
+    """
+
+    def __init__(self, config: DalleBartConfig):
+        super().__init__()
+
+        assert config.d_model > 0, "Expected d_model to be greater than 0, " "but received {}".format(config.d_model)
+        assert (
+            config.decoder_attention_heads > 0
+        ), "Expected decoder_attention_heads to be greater than 0, " "but received {}".format(
+            config.decoder_attention_heads
+        )
+        assert config.decoder_ffn_dim > 0, "Expected decoder_ffn_dim to be greater than 0, " "but received {}".format(
+            config.decoder_ffn_dim
+        )
+
+        attention_dropout = config.dropout if config.attention_dropout is None else config.attention_dropout
+        activation_dropout = config.dropout if config.activation_dropout is None else config.activation_dropout
+
+        self.self_attn = nn.MultiHeadAttention(
+            config.d_model, config.decoder_attention_heads, dropout=attention_dropout, bias_attr=config.use_bias
+        )
+        self.cross_attn = nn.MultiHeadAttention(
+            config.d_model, config.decoder_attention_heads, dropout=attention_dropout, bias_attr=config.use_bias
+        )
+
+        self.glu = GLU(
+            config.d_model,
+            config.decoder_ffn_dim,
+            activation_dropout,
+            config.dropout,
+            config.activation_function,
+            use_bias=config.use_bias,
+        )
+
+        self.pre_self_attn_layer_norm = nn.LayerNorm(config.d_model)
+        self.self_attn_layer_norm = nn.LayerNorm(config.d_model)
+        self.pre_cross_attn_layer_norm = nn.LayerNorm(config.d_model)
+        self.cross_attn_layer_norm = nn.LayerNorm(config.d_model)
+
+        self.dropout1 = nn.Dropout(config.dropout)
+        self.dropout2 = nn.Dropout(config.dropout)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
+
+        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        # self attn
+        residual = tgt
+        tgt = self.pre_self_attn_layer_norm(tgt)
+
+        if cache is None:
+            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0])
+
+        tgt = self.self_attn_layer_norm(tgt)
+        tgt = residual + self.dropout1(tgt)
+
+        # cross attn
+        residual = tgt
+        tgt = self.pre_cross_attn_layer_norm(tgt)
+
+        if cache is None:
+            tgt = self.cross_attn(tgt, memory, memory, memory_mask, None)
+        else:
+            tgt, static_cache = self.cross_attn(tgt, memory, memory, memory_mask, cache[1])
+        tgt = self.cross_attn_layer_norm(tgt)
+        tgt = residual + self.dropout2(tgt)
+
+        # glu
+        residual = tgt
+        tgt = self.glu(tgt)
+        tgt = residual + tgt
+        return tgt if cache is None else (tgt, (incremental_cache, static_cache))
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        static_cache = self.cross_attn.gen_cache(memory, memory, type=self.cross_attn.StaticCache)
+        return incremental_cache, static_cache
+
+
+class DalleBartDecoder(DalleBartPretrainedModel):
+    """
+    The Decoder of DalleBartModel. The arguments of DalleBartDecoder can see :class:`DalleBartModel`.
+    """
+
+    def __init__(self, config: DalleBartConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.embed_tokens = nn.Embedding(config.image_vocab_size + 1, config.d_model)
+
+        self.embed_positions = DalleBartLearnedPositionalEmbedding(config.max_image_length, config.d_model)
+        self.layers = nn.LayerList([DalleBartDecoderLayer(config) for _ in range(config.num_decoder_layers)])
+        self.layernorm_embedding = nn.LayerNorm(config.d_model)
+        self.dropout = nn.Dropout(config.dropout)
+        self.final_ln = nn.LayerNorm(config.d_model)
+
+    def forward(
+        self,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        memory_mask=None,
+        cache=None,
+    ):
+        """
+        The DalleBartDecoder forward method, overrides the `__call__()` special method.
+        Args:
+            decoder_input_ids (Tensor, optional):
+                See :class:`DalleBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`DalleBartModel`.
+            encoder_output (Tensor, optional):
+                See :class:`DalleBartModel`.
+            memory_mask (Tensor, optional):
+                See :class:`DalleBartModel`.
+            cache (Tensor, optional):
+                See :class:`DalleBartModel`.
+        Returns:
+            Tensor: Returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+        """
+        if decoder_attention_mask is None:
+            decoder_length = paddle.shape(decoder_input_ids)[-1]
+            decoder_attention_mask = paddle.triu(
+                (
+                    paddle.full(
+                        (decoder_length, decoder_length),
+                        -1e4,
+                        dtype=paddle.get_default_dtype(),
+                    )
+                ),
+                1,
+            )
+        decoder_inputs_embeds = self.embed_tokens(decoder_input_ids)
+        past_key_values_length = paddle.shape(cache[0][0].k)[2] if cache is not None else 0
+        decoder_inputs_embed_pos = self.embed_positions(paddle.shape(decoder_input_ids), past_key_values_length)
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        hidden_states = self.layernorm_embedding(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        # layers
+        new_caches = []
+        for i, layer in enumerate(self.layers):
+            if cache is None:
+                hidden_states = layer(
+                    hidden_states,
+                    encoder_output,
+                    tgt_mask=decoder_attention_mask,
+                    memory_mask=memory_mask,
+                    cache=None,
+                )
+            else:
+                hidden_states, new_cache = layer(
+                    hidden_states,
+                    encoder_output,
+                    tgt_mask=decoder_attention_mask,
+                    memory_mask=memory_mask,
+                    cache=cache[i],
+                )
+                new_caches.append(new_cache)
+
+        hidden_states = self.final_ln(hidden_states)
+
+        return hidden_states if cache is None else (hidden_states, new_caches)
+
+    def gen_cache(self, memory, do_zip=False):
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+@register_base_model
+class DalleBartModel(DalleBartPretrainedModel):
+    def __init__(self, config: DalleBartConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.text_pad_token_id = 1  # encoder pad id must be 1
+        self.encoder = DalleBartEncoder(config)
+
+        self.decoder = DalleBartDecoder(config)
+
+    def get_input_embeddings(self):
+        return self.encoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.encoder.embed_tokens = value
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+    ):
+        r"""
+        The DalleBartModel forward method, overrides the `__call__()` special method.
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is `[batch_size, sequence_length, hidden_size]`.
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+            use_cache (bool, optional):
+                 Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                 can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+        Returns:
+            Tensor: Returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import DalleBartModel, DalleBartTokenizer
+                tokenizer = DalleBartTokenizer.from_pretrained('dalle-mini')
+                model = DalleBartModel.from_pretrained('dalle-mini')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        if input_ids is None and encoder_output is None:
+            raise ValueError("You have to specify either input_ids or encoder_output")
+        if decoder_input_ids is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating decoder_input_ids"
+            decoder_input_ids = shift_tokens_right(input_ids, self.decoder_start_token_id)
+        if attention_mask is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating attention_mask"
+            attention_mask = (
+                paddle.cast(input_ids == self.text_pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+                * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+            attention_mask.stop_gradient = True
+        if encoder_output is None:
+            encoder_output = self.encoder(input_ids, attention_mask)
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.gen_cache(encoder_output)
+        else:
+            cache = None
+        decoder_output = self.decoder(
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            attention_mask,
+            cache,
+        )
+
+        return decoder_output
+
+
+class DalleBartForConditionalGeneration(DalleBartPretrainedModel):
+    r"""
+    DalleBart Model with a `language modeling` head on top.
+    Args:
+        config (:class:`DalleBartConfig`):
+            An instance of DalleBartConfig used to construct DalleBartForConditionalGeneration.
+    """
+
+    def __init__(self, config: DalleBartConfig):
+        super().__init__(config)
+        self.dallebart = DalleBartModel(config)
+        self.lm_head = nn.Linear(
+            config.d_model,
+            config.image_vocab_size + 1,
+            bias_attr=config.use_bias,
+        )
+        # input_ids_uncond
+        # [0, 2, 1, 1, 1,...,1]
+        # attention_mask_uncond
+        # [1, 1, 0, 0, 0,...,0]
+        input_ids_uncond = [0, 2] + [1] * (config.max_text_length - 2)
+        attention_mask_uncond = [1, 1] + [0] * (config.max_text_length - 2)
+        if hasattr(self, "input_ids_uncond"):
+            self.input_ids_uncond = paddle.to_tensor([input_ids_uncond], dtype="int64")
+        else:
+            self.register_buffer(
+                "input_ids_uncond", paddle.to_tensor([input_ids_uncond], dtype="int64"), persistable=False
+            )
+        if hasattr(self, "attention_mask_uncond"):
+            self.attention_mask_uncond = paddle.to_tensor([attention_mask_uncond], dtype="int64")
+        else:
+            self.register_buffer(
+                "attention_mask_uncond", paddle.to_tensor([attention_mask_uncond], dtype="int64"), persistable=False
+            )
+
+    def get_encoder(self):
+        return self.dallebart.get_encoder()
+
+    def get_decoder(self):
+        return self.dallebart.get_decoder()
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        use_cache=False,
+        cache=None,
+    ):
+        r"""
+        The DalleBartForConditionalGeneration forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`DalleBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`DalleBartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`DalleBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`DalleBartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`DalleBartModel`.
+            use_cache (bool, optional):
+                See :class:`DalleBartModel`.
+            cache (Tensor, optional):
+                See :class:`DalleBartModel`.
+        Returns:
+            Tensor or tuple: Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns tuple (`lm_logits`, `cache`).
+            With the fields:
+            - `lm_logits` (Tensor):
+                The generated sentence of the model.
+                Its data type should be float32 and has a shape of [batch_size, sequence_length, vocab_size].
+            - `cache` (Tensor):
+                See :class:`DalleBartModel`.
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import DalleBartForConditionalGeneration, DalleBartTokenizer
+                tokenizer = DalleBartTokenizer.from_pretrained('dalle-mini')
+                model = DalleBartForConditionalGeneration.from_pretrained('dalle-mini')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+        """
+        output = self.dallebart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            use_cache,
+            cache,
+        )
+        lm_logits = self.lm_head(output)
+        if use_cache:
+            cache = output[1]
+            return lm_logits, cache
+        else:
+            return lm_logits
+
+    def prepare_decoder_input_ids_from_labels(self, labels):
+        return shift_tokens_right(labels, self.config.decoder_start_token_id)
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        attention_mask=None,
+        decoder_attention_mask=None,
+        cache=None,
+        use_cache=False,
+        encoder_output=None,
+        **kwargs
+    ):
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1].unsqueeze(-1)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask[:, :, -1, :].unsqueeze(-2)
+
+        return {
+            "input_ids": None,
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "decoder_attention_mask": decoder_attention_mask,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def sample(
+        self,
+        input_ids,
+        logits_processors,
+        max_length,
+        pad_token_id,
+        eos_token_id,
+        top_k=None,
+        top_p=None,
+        temperature=None,
+        min_tokens_to_keep=1,
+        condition_scale=1.0,
+        model_kwargs_uncond=None,
+        **model_kwargs
+    ):
+        def TopKProcess(probs, top_k, min_tokens_to_keep):
+            top_k = min(max(top_k, min_tokens_to_keep), probs.shape[-1])
+            # Remove all tokens with a probability less than the last token of the top-k
+            topk_probs, _ = paddle.topk(probs, k=top_k)
+            probs = paddle.where(probs >= topk_probs[:, -1:], probs, paddle.full_like(probs, 0.0))
+            return probs
+
+        def TopPProcess(probs, top_p, min_tokens_to_keep):
+            sorted_probs = paddle.sort(probs, descending=True)
+            sorted_indices = paddle.argsort(probs, descending=True)
+            cumulative_probs = paddle.cumsum(sorted_probs, axis=-1)
+
+            # Remove tokens with cumulative probs above the top_p, But keep at
+            # least min_tokens_to_keep tokens
+            sorted_indices_to_remove = cumulative_probs > top_p
+            if min_tokens_to_keep > 1:
+                # Set 'min_tokens_to_keep - 1' because the first token is kept
+                sorted_indices_to_remove[:, : min_tokens_to_keep - 1] = 0
+            # Keep the first token
+            sorted_indices_to_remove = paddle.cast(sorted_indices_to_remove, dtype="int64")
+            sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+            sorted_indices_to_remove[:, 0] = 0
+
+            # Scatter sorted tensors to original indexing
+            sorted_indices = sorted_indices + paddle.arange(probs.shape[0]).unsqueeze(-1) * probs.shape[-1]
+            condition = paddle.scatter(
+                sorted_indices_to_remove.flatten(), sorted_indices.flatten(), sorted_indices_to_remove.flatten()
+            )
+            condition = paddle.cast(condition, "bool").reshape(probs.shape)
+            probs = paddle.where(condition, paddle.full_like(probs, 0.0), probs)
+            return probs
+
+        batch_size, cur_len = input_ids.shape
+        origin_len = cur_len
+        unfinished_flag = paddle.full([batch_size, 1], True, dtype="bool")
+        scores = paddle.full([batch_size, 1], 0.0, dtype=paddle.get_default_dtype())
+
+        while cur_len < max_length:
+            # prepare model inputs & get model output
+            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
+
+            outputs = self(**model_inputs)
+            logits = outputs[0] if isinstance(outputs, tuple) else outputs
+            # [batch_size, vocab_size]
+            logits = logits[:, -1, :]
+
+            # perform super conditioning
+            # Source: @RiversHaveWings - https://twitter.com/RiversHaveWings/status/1478093658716966912?s=20&t=xdm-wZ61Wf7OLnE_NJHZ1w
+            if condition_scale != 1.0:
+                model_inputs_uncond = self.prepare_inputs_for_generation(input_ids, **model_kwargs_uncond)
+                outputs_uncond = self(**model_inputs_uncond)
+                logits_uncond = outputs_uncond[0] if isinstance(outputs_uncond, tuple) else outputs_uncond
+                # [batch_size, vocab_size]
+                logits_uncond = logits_uncond[:, -1, :]
+                logits = logits_uncond + condition_scale * (logits - logits_uncond)
+
+            else:
+                outputs_uncond = None
+
+            # pre-process distribution
+            logits = self.adjust_logits_during_generation(logits)
+            logits = logits_processors(input_ids, logits)
+
+            # sample
+            origin_probs = F.softmax(logits)
+            origin_probs = paddle.log(origin_probs)
+            if temperature is not None and temperature != 1.0:
+                logits = logits / temperature
+            probs = F.softmax(logits)
+            if top_k is not None and top_k != 0:
+                probs = TopKProcess(probs, top_k, min_tokens_to_keep)
+            if top_p is not None and top_p < 1.0:
+                probs = TopPProcess(probs, top_p, min_tokens_to_keep)
+            next_tokens = paddle.multinomial(probs)
+
+            next_scores = paddle.index_sample(origin_probs, next_tokens)
+
+            if eos_token_id is not None:
+                next_tokens = paddle.where(unfinished_flag, next_tokens, paddle.full_like(next_tokens, pad_token_id))
+
+            scores = self.update_scores_for_generation(scores, next_scores, cur_len - origin_len, unfinished_flag)
+
+            cur_len += 1
+            input_ids = paddle.concat([input_ids, next_tokens], axis=1)
+
+            if eos_token_id is not None:
+                unfinished_flag = paddle.logical_and(unfinished_flag, next_tokens != eos_token_id)
+
+            # Stop when there is a </s> in all sentences
+            if not paddle.any(unfinished_flag):
+                break
+
+            model_kwargs = self.update_model_kwargs_for_generation(
+                outputs, model_kwargs, is_encoder_decoder=self.is_encoder_decoder
+            )
+
+            if condition_scale != 1.0:
+                model_kwargs_uncond = self.update_model_kwargs_for_generation(
+                    outputs_uncond, model_kwargs_uncond, is_encoder_decoder=self.is_encoder_decoder
+                )
+            else:
+                model_kwargs_uncond = None
+
+        return input_ids[:, origin_len:], scores
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        input_ids=None,
+        max_length=256,
+        min_length=256,
+        decode_strategy="sampling",
+        temperature=1.0,
+        top_k=0,
+        top_p=1.0,
+        repetition_penalty=1.0,
+        num_beams=1,
+        num_beam_groups=1,
+        length_penalty=0.0,
+        early_stopping=False,
+        bos_token_id=None,
+        eos_token_id=None,
+        pad_token_id=None,
+        text_pad_token_id=1,
+        decoder_start_token_id=None,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        num_return_sequences=1,
+        diversity_rate=0.0,
+        use_cache=True,
+        use_fast=False,
+        use_fp16_decoding=False,
+        condition_scale=1.0,
+        **model_kwargs
+    ):
+        r"""
+        The interface for generation task. This method can generate sequences
+        by using decoding strategy. Currently, there are three decoding
+        strategies supported: "greedy_search", "sampling" and "beam_search".
+
+        Args:
+            input_ids (Tensor, optional): The input sequence ids for the
+                generation. It is a Tensor with shape [batch_size, sequence_length].
+                The data type should be int32 or int64. Default to None, which
+                we will initialize it as a Tensor with shape [1, 1], filled
+                with the value `bos_token_id`.
+            max_length (int, optional): The maximum length of the sequence to
+                be generated. Default to 256.
+            min_length (int, optional): The minimum length of the sequence to
+                be generated. Default to 256.
+            decode_strategy (str, optional): The decoding strategy in generation.
+                Currently, there are three decoding strategies supported:
+                "greedy_search", "sampling" and "beam_search". Default to
+                "sampling".
+            temperature (float, optional): The value used to module the next
+                token probabilities in the "sampling" strategy. Default to 1.0,
+                which means no effect.
+            top_k (int, optional): The number of highest probability tokens to
+                keep for top-k-filtering in the "sampling" strategy. Default to
+                0, which means no effect.
+            top_p (float, optional): The cumulative probability for
+                top-p-filtering in the "sampling" strategy. The value should
+                satisfy :math:`0 <= top\_p < 1`. Default to 1.0, which means no
+                effect.
+            repetition_penalty (float, optional):
+                The parameter for repetition penalty. 1.0 means no penalty. See `this paper
+                <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details. Defaults to 1.0.
+            num_beams (int, optional): The number of beams in the "beam_search"
+                strategy. Default to 1.
+            num_beam_groups (int, optional):
+                Number of groups to divide `num_beams` into in order to use DIVERSE
+                BEAM SEARCH. See `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__
+                for more details. Default to 1.
+            length_penalty (float, optional): The exponential penalty to the
+                sequence length in the "beam_search" strategy. The larger this
+                param is, the more that the model would generate shorter
+                sequences. Default to 0.0, which means no penalty.
+            early_stopping (bool, optional): Whether to stop searching in the
+                "beam_search" strategy when at least `num_beams` sentences are
+                finished per batch or not. Default to False.
+            bos_token_id (int, optional): The id of the `bos_token`. Default to
+                None.
+            eos_token_id (int, optional): The id of the `eos_token`. Default to
+                None.
+            pad_token_id (int, optional): The id of the `pad_token`. Default to
+                None.
+            decoder_start_token_id (int, optional): The start token id for
+                encoder-decoder models. Default to None.
+            forced_bos_token_id (int, optional): The id of the token to force as
+                the first generated token. Usually use for multilingual models.
+                Default to None.
+            forced_eos_token_id (int, optional): The id of the token to force as
+                the last generated token. Default to None.
+            num_return_sequences (int, optional): The number of returned
+                sequences for each sequence in the batch. Default to 1.
+            diversity_rate (float, optional): If num_beam_groups is 1, this is the
+                diversity_rate for Diverse Siblings Search. See
+                `this paper https://arxiv.org/abs/1611.08562`__ for more details.
+                If not, this is the diversity_rate for DIVERSE BEAM SEARCH.
+            use_cache: (bool, optional): Whether to use the model cache to
+                speed up decoding. Default to True.
+            use_fast: (bool, optional): Whether to use fast entry of model
+                for FastGeneration. Default to False.
+            use_fp16_decoding: (bool, optional): Whether to use fp16 for decoding.
+                Only works when fast entry is avalible. Default to False.
+            condition_scale (float, optional): The scale of super conditioning. See
+                `this twitter <https://twitter.com/RiversHaveWings/status/1478093658716966912>`__
+                Default to 1.0.
+            model_kwargs (dict): It can be used to specify additional kwargs
+                passed to the model.
+
+        Returns:
+            tuple[Tensor]: It is a tuple contains two elements: ids and scores.
+            Each element is a Tensor.
+
+            With the fields:
+
+            - ids (Tensor):
+                The ids of the generated sequences. It is a Tensor with shape
+                [batch_size * num_return_sequences, sequence_length]. The data
+                type is same as the input `input_ids`.
+            - scores (Tensor):
+                The scores of the generated sequences. It is a Tensor with shape
+                [batch_size * num_return_sequences, 1]. The data type is float32
+                or float64, which is the same as the parameters in the model.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import (
+                    DalleBartForConditionalGeneration,
+                    DalleBartTokenizer
+                )
+
+                # Initialize the model and tokenizer
+                model_name_or_path = 'dalle-mini'
+                model = DalleBartForConditionalGeneration.from_pretrained(model_name_or_path)
+                tokenizer = DalleBartTokenizer.from_pretrained(model_name_or_path)
+
+                # Prepare the model inputs.
+                prompts = "graphite sketch of Elon Musk"
+                tokenized_inputs = tokenizer(
+                    prompts,
+                    return_tensors="pd",
+                    padding="max_length",
+                    truncation=True,
+                    return_attention_mask=True,
+                    max_length=64,
+                )
+
+                # Generate 4 sequences by using "sampling" strategy (top_k=64, condition_scale=10.0)
+                image_token_ids, scores = model.generate(
+                    input_ids=tokenized_inputs['input_ids'],
+                    attention_mask=tokenized_inputs['attention_mask'],
+                    decode_strategy="sampling",
+                    condition_scale=10.0,
+                    top_k=64,
+                    num_return_sequences=4)
+                print(image_token_ids.shape, scores.shape)
+                # [4, 256] [4, 1]
+        """
+        assert decode_strategy in [
+            "greedy_search",
+            "sampling",
+            "beam_search",
+        ], "`decode_strategy` must be one of 'greedy_search', 'sampling' or 'beam_search' but received {}.".format(
+            decode_strategy
+        )
+
+        bos_token_id = bos_token_id if bos_token_id is not None else getattr(self, "bos_token_id", None)
+        eos_token_id = eos_token_id if eos_token_id is not None else getattr(self, "eos_token_id", None)
+        pad_token_id = pad_token_id if pad_token_id is not None else getattr(self, "pad_token_id", None)
+        forced_bos_token_id = (
+            forced_bos_token_id if forced_bos_token_id is not None else getattr(self, "forced_bos_token_id", None)
+        )
+        forced_eos_token_id = (
+            forced_eos_token_id if forced_eos_token_id is not None else getattr(self, "forced_eos_token_id", None)
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id
+            if decoder_start_token_id is not None
+            else getattr(self, "decoder_start_token_id", None)
+        )
+
+        if getattr(self, "_fast_entry", None) is not False and use_fast:
+            args = locals()
+            args.pop("self")
+            args.pop("__class__", None)
+            model_kwargs = args.pop("model_kwargs")
+            args.update(model_kwargs)
+            try:
+                if not hasattr(self, "_fast_entry"):
+                    self._build_fast(args)
+                if self._fast_entry:
+                    output = self._fast_entry(**args)
+                    if isinstance(output, tuple):
+                        output_ids, dummy_srore = output
+                    else:
+                        output_ids = output
+                        # make result and faster result oneconsistent
+                        dummy_srore = None
+                    if decode_strategy == "beam_search":
+                        output_ids = output_ids.transpose([1, 2, 0])
+                        output_ids = output_ids[:, :num_return_sequences, :].reshape([-1, output_ids.shape[-1]])
+                        if dummy_srore is not None:
+                            dummy_srore = dummy_srore[:, :num_return_sequences].flatten()
+                    else:
+                        output_ids = output_ids.transpose([1, 0])
+                    return output_ids, dummy_srore
+
+            except Exception as e:
+                args["model_kwargs"] = model_kwargs
+                # Prevent self._convert_to_fast to throw Exception
+                self._convert_to_fast(args)
+                logger.warning(e)
+                logger.warning("FastGeneration is not available, " "and the original version would be used instead.")
+
+        # params check
+        if input_ids is None:
+            # Init `input_ids` with bos_token_id
+            input_ids = self.prepare_input_ids_for_generation(bos_token_id)
+
+        if model_kwargs.get("attention_mask", None) is None:
+            # Init `attention_mask` depending on `text_pad_token_id`
+            model_kwargs["attention_mask"] = self.prepare_attention_mask_for_generation(
+                input_ids, text_pad_token_id, eos_token_id
+            )
+
+        self.is_encoder_decoder = hasattr(self, "encoder") and hasattr(self, "decoder")
+        if self.is_encoder_decoder:
+
+            if condition_scale != 1.0:
+                assert decode_strategy == "sampling", "`do_sample` has to be True for super conditioning."
+                assert num_beams == 1, "`num_beams` has to be 1 for super conditioning."
+                input_ids_uncond = self.input_ids_uncond.expand_as(input_ids)
+                model_kwargs_uncond = {"attention_mask": self.attention_mask_uncond.expand_as(input_ids)}
+                model_kwargs_uncond = self.prepare_encoder_decoder_kwargs_for_generation(
+                    input_ids_uncond,
+                    model_kwargs_uncond,
+                )
+                model_kwargs_uncond["use_cache"] = use_cache
+            else:
+                model_kwargs_uncond = None
+
+            model_kwargs = self.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+
+            # set input_ids as decoder_input_ids
+            if "decoder_input_ids" in model_kwargs:
+                input_ids = model_kwargs.pop("decoder_input_ids")
+            else:
+                input_ids = self.prepare_decoder_input_ids_for_generation(
+                    input_ids, decoder_start_token_id, bos_token_id
+                )
+
+        if pad_token_id is None and eos_token_id is not None:
+            print("Setting `pad_token_id` to `eos_token_id`:{} for " "open-end generation.".format(eos_token_id))
+            pad_token_id = eos_token_id
+
+        model_kwargs["use_cache"] = use_cache
+        max_length += input_ids.shape[-1]
+        min_length += input_ids.shape[-1]
+
+        logits_processors = self.get_logits_processor(
+            min_length=min_length,
+            max_length=max_length,
+            eos_token_id=eos_token_id,
+            forced_bos_token_id=forced_bos_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            num_beams=num_beams,
+            num_beam_groups=num_beam_groups,
+            diversity_rate=diversity_rate,
+            repetition_penalty=repetition_penalty,
+        )
+
+        if decode_strategy == "greedy_search":
+            if num_return_sequences > 1:
+                raise ValueError(
+                    "`num_return_sequences` has to be 1, but is {} "
+                    "when doing greedy search.".format(num_return_sequences)
+                )
+
+            return self.greedy_search(
+                input_ids, logits_processors, max_length, pad_token_id, eos_token_id, **model_kwargs
+            )
+
+        elif decode_strategy == "sampling":
+
+            if num_return_sequences > 1:
+                tmpinput_ids = input_ids.clone()
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_return_sequences, **model_kwargs
+                )
+
+                if condition_scale != 1.0:
+                    _, model_kwargs_uncond = self.expand_inputs_for_generation(
+                        tmpinput_ids, expand_size=num_return_sequences, **model_kwargs_uncond
+                    )
+
+            return self.sample(
+                input_ids,
+                logits_processors,
+                max_length,
+                pad_token_id,
+                eos_token_id,
+                top_k,
+                top_p,
+                temperature,
+                condition_scale=condition_scale,
+                model_kwargs_uncond=model_kwargs_uncond,
+                **model_kwargs,
+            )
+
+        elif decode_strategy == "beam_search":
+            batch_size = input_ids.shape[0]
+            if num_return_sequences > num_beams:
+                raise ValueError(
+                    "`num_return_sequences` has to be smaller or equal to "
+                    "`num_beams`. But received `num_return_sequences` is {}, "
+                    "`num_beams` is {}".format(num_return_sequences, num_beams)
+                )
+            if num_beams <= 1:
+                raise ValueError(
+                    "`num_beams` has to be bigger than 1. But received "
+                    "`num_beams` is {}. If `num_beams` is 1, `decode_strategy` "
+                    "should be 'greedy_search'".format(num_beams)
+                )
+            if num_beam_groups > 1:
+                diverse_beam_scorer = BeamSearchScorer(
+                    batch_size=batch_size,
+                    max_length=max_length,
+                    num_beams=num_beams,
+                    length_penalty=length_penalty,
+                    do_early_stopping=early_stopping,
+                    num_beam_hyps_to_keep=num_return_sequences,
+                    num_beam_groups=num_beam_groups,
+                )
+
+                # interleave with `num_beams`
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_beams, **model_kwargs
+                )
+
+                return self.group_beam_search(
+                    input_ids,
+                    diverse_beam_scorer,
+                    logits_processors,
+                    max_length,
+                    diversity_rate,
+                    pad_token_id,
+                    eos_token_id,
+                    **model_kwargs,
+                )
+            else:
+                beam_scorer = BeamSearchScorer(
+                    batch_size=batch_size,
+                    max_length=max_length,
+                    num_beams=num_beams,
+                    length_penalty=length_penalty,
+                    do_early_stopping=early_stopping,
+                    num_beam_hyps_to_keep=num_return_sequences,
+                )
+
+                input_ids, model_kwargs = self.expand_inputs_for_generation(
+                    input_ids, expand_size=num_beams, **model_kwargs
+                )
+
+                return self.beam_search(
+                    input_ids,
+                    beam_scorer,
+                    logits_processors,
+                    max_length,
+                    diversity_rate,
+                    pad_token_id,
+                    eos_token_id,
+                    **model_kwargs,
+                )
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class ResnetBlock(nn.Layer):
+    def __init__(self, log2_count_in: int, log2_count_out: int):
+        super().__init__()
+        m, n = 2**log2_count_in, 2**log2_count_out
+        self.is_middle = m == n
+        self.norm1 = nn.GroupNorm(2**5, m)
+        self.conv1 = nn.Conv2D(m, n, 3, padding=1)
+        self.norm2 = nn.GroupNorm(2**5, n)
+        self.conv2 = nn.Conv2D(n, n, 3, padding=1)
+        if not self.is_middle:
+            self.nin_shortcut = nn.Conv2D(m, n, 1)
+
+    def forward(self, x):
+        h = x
+        h = self.norm1(h)
+        h = F.swish(h)
+        h = self.conv1(h)
+        h = self.norm2(h)
+        h = F.swish(h)
+        h = self.conv2(h)
+        if not self.is_middle:
+            x = self.nin_shortcut(x)
+        return x + h
+
+
+class AttentionBlock(nn.Layer):
+    def __init__(self):
+        super().__init__()
+        n = 2**9
+        self.norm = nn.GroupNorm(2**5, n)
+        self.q = nn.Conv2D(n, n, 1)
+        self.k = nn.Conv2D(n, n, 1)
+        self.v = nn.Conv2D(n, n, 1)
+        self.proj_out = nn.Conv2D(n, n, 1)
+
+    def forward(self, x):
+        n, m = 2**9, x.shape[0]
+        h = x
+        h = self.norm(h)
+        k = self.k(h)
+        v = self.v(h)
+        q = self.q(h)
+        k = k.reshape(shape=[m, n, -1])
+        v = v.reshape(shape=[m, n, -1])
+        q = q.reshape(shape=[m, n, -1])
+        q = q.transpose(perm=[0, 2, 1])
+        w = paddle.bmm(q, k)
+        w /= n**0.5
+        w = F.softmax(w, axis=2)
+        w = w.transpose(perm=[0, 2, 1])
+        h = paddle.bmm(v, w)
+        token_count = int(math.sqrt(h.shape[-1]))
+        h = h.reshape(shape=[m, n, token_count, token_count])
+        h = self.proj_out(h)
+        return x + h
+
+
+class MiddleLayer(nn.Layer):
+    def __init__(self):
+        super().__init__()
+        self.block_1 = ResnetBlock(9, 9)
+        self.attn_1 = AttentionBlock()
+        self.block_2 = ResnetBlock(9, 9)
+
+    def forward(self, h):
+        h = self.block_1(h)
+        h = self.attn_1(h)
+        h = self.block_2(h)
+        return h
+
+
+class Upsample(nn.Layer):
+    def __init__(self, log2_count):
+        super().__init__()
+        n = 2**log2_count
+        self.upsample = nn.UpsamplingNearest2D(scale_factor=2)
+        self.conv = nn.Conv2D(n, n, 3, padding=1)
+
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.conv(x)
+        return x
+
+
+class UpsampleBlock(nn.Layer):
+    def __init__(self, log2_count_in: int, log2_count_out: int, has_attention: bool, has_upsample: bool):
+        super().__init__()
+        self.has_attention = has_attention
+        self.has_upsample = has_upsample
+
+        self.block = nn.LayerList(
+            [
+                ResnetBlock(log2_count_in, log2_count_out),
+                ResnetBlock(log2_count_out, log2_count_out),
+                ResnetBlock(log2_count_out, log2_count_out),
+            ]
+        )
+
+        if has_attention:
+            self.attn = nn.LayerList([AttentionBlock(), AttentionBlock(), AttentionBlock()])
+
+        if has_upsample:
+            self.upsample = Upsample(log2_count_out)
+
+    def forward(self, h):
+        for j in range(3):
+            h = self.block[j](h)
+            if self.has_attention:
+                h = self.attn[j](h)
+        if self.has_upsample:
+            h = self.upsample(h)
+        return h
+
+
+class Decoder(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+        self.conv_in = nn.Conv2D(2**8, 2**9, 3, padding=1)
+        self.mid = MiddleLayer()
+
+        self.up = nn.LayerList(
+            [
+                UpsampleBlock(7, 7, False, False),
+                UpsampleBlock(8, 7, False, True),
+                UpsampleBlock(8, 8, False, True),
+                UpsampleBlock(9, 8, False, True),
+                UpsampleBlock(9, 9, True, True),
+            ]
+        )
+
+        self.norm_out = nn.GroupNorm(2**5, 2**7)
+        self.conv_out = nn.Conv2D(2**7, 3, 3, padding=1)
+
+    def forward(self, z):
+        z = self.conv_in(z)
+        z = self.mid(z)
+
+        for i in reversed(range(5)):
+            z = self.up[i](z)
+
+        z = self.norm_out(z)
+        z = F.swish(z)
+        z = self.conv_out(z)
+        return z
diff --git a/paddlenlp/transformers/dallebart/tokenizer.py b/paddlenlp/transformers/dallebart/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9d25946abe792910b4fcc41d736a84d58da2cb0
--- /dev/null
+++ b/paddlenlp/transformers/dallebart/tokenizer.py
@@ -0,0 +1,503 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021-2022 The Fairseq Authors and The Google Flax
+# Team Authors And The HuggingFace Inc. team and & DALL·E Mini team.
+# All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import html
+import math
+import random
+import re
+from pathlib import Path
+
+from paddle.utils import try_import
+
+from ...transformers import AddedToken, GPTTokenizer
+
+__all__ = ["DalleBartTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "dalle-mini": 64,
+    "dalle-mega-v16": 64,
+    "dalle-mega-v26": 64,
+    "dalle-mega": 64,
+}
+
+# based on wiki word occurrence
+person_token = [("a person", 282265), ("someone", 121194), ("somebody", 12219)]
+temp_token = "xtokx"  # avoid repeating chars
+
+
+class HashtagProcessor:
+    # Adapted from wordninja library
+    # We use our wikipedia word count + a good heuristic to make it work
+    def __init__(self, wiki_word_frequency):
+        self._word_cost = (l.split()[0] for l in Path(wiki_word_frequency).read_text(encoding="utf8").splitlines())
+        self._word_cost = {str(k): math.log(float(i + 1)) for i, k in enumerate(self._word_cost)}
+        self._max_word = max(len(x) for x in self._word_cost.keys())
+        self._SPLIT_RE = re.compile("[^a-zA-Z0-9']+")
+
+    def __call__(self, s):
+        """Uses dynamic programming to infer the location of spaces in a string without spaces."""
+        l = [self._split(x) for x in self._SPLIT_RE.split(s)]
+        return " ".join([item for sublist in l for item in sublist])
+
+    def _split(self, s):
+        # Find the best match for the i first characters, assuming cost has
+        # been built for the i-1 first characters.
+        # Returns a pair (match_cost, match_length).
+        def best_match(i):
+            candidates = enumerate(reversed(cost[max(0, i - self._max_word) : i]))
+            return min((c + self._word_cost.get(s[i - k - 1 : i].lower(), 9e999), k + 1) for k, c in candidates)
+
+        # Build the cost array
+        cost = [0]
+        for i in range(1, len(s) + 1):
+            c, k = best_match(i)
+            cost.append(c)
+
+        # Backtrack to recover the minimal-cost string.
+        out = []
+        i = len(s)
+        while i > 0:
+            c, k = best_match(i)
+            assert c == cost[i]
+            newToken = True
+            if not s[i - k : i] == "'":  # ignore a lone apostrophe
+                if len(out) > 0:
+                    # re-attach split 's and split digits
+                    if out[-1] == "'s" or (s[i - 1].isdigit() and out[-1][0].isdigit()):  # digit followed by digit
+                        out[-1] = s[i - k : i] + out[-1]  # combine current token with previous token
+                        newToken = False
+
+            if newToken:
+                out.append(s[i - k : i])
+
+            i -= k
+
+        return reversed(out)
+
+
+def replace_person_token(t):
+    "Used for CC12M"
+    t = re.sub(r"<person>([,\s]*(and)*[,\s]*<person>)+", " people ", t)
+    while "<person>" in t:
+        t = t.replace("<person>", f" {random.choices(*tuple(zip(*person_token)))[0]} ", 1)
+    return t
+
+
+def fix_html(t):
+    # from OpenAI CLIP
+    return html.unescape(html.unescape(t))
+
+
+def replace_punctuation_with_commas(t):
+    return re.sub(r"[()[\].,|:;?!=+~\-\/{}]", ",", t)
+
+
+def simplify_quotes(t):
+    return re.sub("""['"`]""", ' " ', t)
+
+
+def merge_quotes(t):
+    return re.sub(r'(\s*"+\s*)+', ' " ', t)
+
+
+def remove_comma_numbers(t):
+    def _f(t):
+        return re.sub(r"(\d),(\d{3})", r"\1\2", t)
+
+    return _f(_f(t))
+
+
+def pre_process_dot_numbers(t):
+    return re.sub(r"(\w)\.(\w)", rf"\1{temp_token}dot{temp_token}\2", t)
+
+
+def post_process_dot_numbers(t):
+    return re.sub(f"{temp_token}dot{temp_token}", ".", t)
+
+
+def pre_process_quotes(t):
+    # allows quotes only for 's, 't, 'd, 'm, 'll, 're, 've
+    return re.sub(r"'(?=([stdm]|(ll)|(re)|(ve)|(ll))\b)", rf"{temp_token}quote{temp_token}", t)
+
+
+def post_process_quotes(t):
+    return re.sub(f"{temp_token}quote{temp_token}", "'", t)
+
+
+def pre_process_dates(t):
+    return re.sub(r"(\d)/(\d)", rf"\1{temp_token}slash{temp_token}\2", t)
+
+
+def post_process_dates(t):
+    return re.sub(f"{temp_token}slash{temp_token}", "/", t)
+
+
+def merge_commas(t):
+    return re.sub(r"(\s*,+\s*)+", ", ", t)
+
+
+def add_space_after_commas(t):
+    return re.sub(",", ", ", t)
+
+
+def handle_special_chars(t):
+    "Handle special characters"
+    # replace "-" with a space when between words without space
+    t = re.sub(r"(\w)-(\w)", r"\1 \2", t)
+    # always add space around some characters
+    return re.sub(r"([%&\/$*])", r" \1 ", t)
+
+
+def expand_hashtags(t, hashtag_processor):
+    "Remove # and try to split words"
+    return re.sub(r"#(\w+)", lambda m: hashtag_processor(m.group(1)), t)
+
+
+_re_ignore_chars = r"[_#\\]"
+
+
+def ignore_chars(t):
+    "Ignore useless characters"
+    return re.sub(_re_ignore_chars, " ", t)
+
+
+def remove_extra_spaces(t):
+    "Remove extra spaces (including \t and \n)"
+    return re.sub(r"\s+", " ", t)
+
+
+def remove_repeating_chars(t):
+    "If the same character is present 4+ times (not 3 because of roman 'VIII'), replace with single instance"
+    return re.sub(r"(\D)(\1{3,})", r"\1", t)
+
+
+def remove_urls(t):
+    return re.sub(r"http\S+", "", t)
+
+
+def remove_html_tags(t):
+    return re.sub("<[^<]+?>", " ", t)
+
+
+def remove_first_last_commas(t):
+    t = t.strip()
+    t = t[:-1] if t and t[-1] == "," else t
+    t = t[1:] if t and t[0] == "," else t
+    return t.strip()
+
+
+def remove_wiki_ref(t):
+    t = re.sub(r"\A\s*\[\d+\]", "", t)
+    return re.sub(r"\[\d+\]\s*\Z", "", t)
+
+
+class TextNormalizer:
+    def __init__(self, wiki_word_frequency_file):
+        self._hashtag_processor = HashtagProcessor(wiki_word_frequency_file)
+        self.emoji = try_import("emoji")
+        self.ftfy = try_import("ftfy")
+        self.unidecode = try_import("unidecode")
+
+    def __call__(self, t):
+        # fix some characters
+        t = self.ftfy.fix_text(t)
+        # fix html
+        t = fix_html(t)
+        # decode emojis (would be removed by unidecode)
+        t = self.emoji.demojize(t)
+        # decode and simplify text: see unidecode library
+        t = self.unidecode.unidecode(t)
+        # lower case
+        t = t.lower()
+        # replace <PERSON> (for CC12M)
+        t = replace_person_token(t)
+        # remove wiki reference (for WIT)
+        t = remove_wiki_ref(t)
+        # remove html tags
+        t = remove_html_tags(t)
+        # remove urls
+        t = remove_urls(t)
+        # remove commas in numbers
+        t = remove_comma_numbers(t)
+        # handle dots in numbers and quotes - Part 1
+        t = pre_process_dot_numbers(t)
+        t = pre_process_quotes(t)
+        t = pre_process_dates(t)
+        # handle special characters
+        t = handle_special_chars(t)
+        # handle hashtags
+        t = expand_hashtags(t, self._hashtag_processor)
+        # ignore useless characters
+        t = ignore_chars(t)
+        # simplify quotes
+        t = simplify_quotes(t)
+        # all punctuation becomes commas
+        t = replace_punctuation_with_commas(t)
+        # handle dots in numbers and quotes - Part 2
+        t = post_process_dot_numbers(t)
+        t = post_process_quotes(t)
+        t = post_process_dates(t)
+        # handle repeating characters
+        t = remove_repeating_chars(t)
+        # merge quotes
+        t = merge_quotes(t)
+        # merge commas
+        t = merge_commas(t)
+        # remove multiple spaces
+        t = remove_extra_spaces(t)
+        # remove first and last comma
+        t = remove_first_last_commas(t)
+        # always start with a space
+        return f" {t}"
+
+
+class DalleBartTokenizer(GPTTokenizer):
+    r"""
+    Construct a DalleBart tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.gpt.tokenizer.GPTTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocabulary file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        wiki_word_frequency_file (str):
+            Path to the wiki_word_frequency file when we need normlize text.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+        bos_token (str, optional):
+            The beginning of sequence token that was used during pretraining. Can be
+            used a sequence classifier token.
+            Defaults to `"<s>"`.
+        eos_token (str, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `"</s>"`.
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens.
+            Defaults to `"<s>"`.
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to `"</s>"`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to `"<unk>"`.
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to `"<pad>"`.
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to `"<mask>"`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import DalleBartTokenizer
+
+            tokenizer = DalleBartTokenizer.from_pretrained('dalle-mini')
+            print(tokenizer('Donald Trump in Animal Crossing'))
+
+            # {'input_ids': [0, 7083, 3252, 91, 2203, 7807, 2]}
+
+    """
+    resource_files_names = {
+        "vocab_file": "vocab.json",
+        "merges_file": "merges.txt",
+        "wiki_word_frequency_file": "enwiki-words-frequency.txt",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "dalle-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mini/vocab.json",
+            "dalle-mega-v16": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v16/vocab.json",
+            "dalle-mega-v26": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/vocab.json",
+            "dalle-mega": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/vocab.json",
+        },
+        "merges_file": {
+            "dalle-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mini/merges.txt",
+            "dalle-mega-v16": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v16/merges.txt",
+            "dalle-mega-v26": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/merges.txt",
+            "dalle-mega": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/merges.txt",
+        },
+        "wiki_word_frequency_file": {
+            "dalle-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mini/enwiki-words-frequency.txt",
+            "dalle-mega-v16": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v16/enwiki-words-frequency.txt",
+            "dalle-mega-v26": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/enwiki-words-frequency.txt",
+            "dalle-mega": "https://bj.bcebos.com/paddlenlp/models/transformers/dallebart/dalle-mega-v26/enwiki-words-frequency.txt",
+        },
+    }
+    pretrained_init_configuration = {
+        "dalle-mini": {"normalize_text": True},
+        "dalle-mega-v16": {"normalize_text": True},
+        "dalle-mega-v26": {"normalize_text": True},
+        "dalle-mega": {"normalize_text": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        wiki_word_frequency_file,
+        normalize_text=True,
+        errors="replace",
+        max_len=None,
+        bos_token="<s>",
+        eos_token="</s>",
+        cls_token="<s>",
+        sep_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        **kwargs
+    ):
+
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
+        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
+        self._build_special_tokens_map_extended(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+        )
+        self.normalize_text = normalize_text
+        # in order to save wiki_word_frequency_file, we need set this attr
+        self._wiki_word_frequency_file = wiki_word_frequency_file
+        if self.normalize_text:
+            self.text_processor = TextNormalizer(wiki_word_frequency_file)
+        super().__init__(vocab_file, merges_file, errors, max_len, pad_token, eos_token, unk_token, **kwargs)
+
+    def _bpe_encode(self, text):
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification
+        tasks by concatenating and adding special tokens.
+        """
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return _cls + token_ids_0 + _sep
+        return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is
+        called when adding special tokens using the tokenizer ``encode`` methods.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=64,  # default
+        stride=0,
+        is_split_into_words=False,
+        padding="max_length",  # default
+        truncation=True,  # default
+        return_position_ids=False,
+        return_token_type_ids=False,  # don't return token_type_ids
+        return_attention_mask=True,  # default
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        return_dict=True,
+        return_offsets_mapping=False,
+        add_special_tokens=True,
+        pad_to_multiple_of=None,
+        return_tensors=None,
+        verbose: bool = True,
+        **kwargs
+    ):
+        if self.normalize_text:
+            is_batched = isinstance(text, (list, tuple))
+            if is_batched:
+                text = [self.text_processor(t) for t in text]
+                if text_pair:
+                    text_pair = [self.text_processor(t) for t in text_pair]
+            else:
+                text = self.text_processor(text)
+                if text_pair:
+                    text_pair = self.text_processor(text_pair)
+
+        return super().__call__(
+            text,
+            text_pair,
+            max_length,
+            stride,
+            is_split_into_words,
+            padding,
+            truncation,
+            return_position_ids,
+            return_token_type_ids,
+            return_attention_mask,
+            return_length,
+            return_overflowing_tokens,
+            return_special_tokens_mask,
+            return_dict,
+            return_offsets_mapping,
+            add_special_tokens,
+            pad_to_multiple_of,
+            return_tensors,
+            verbose,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/distilbert/__init__.py b/paddlenlp/transformers/distilbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/distilbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/distilbert/configuration.py b/paddlenlp/transformers/distilbert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..724af5911ca85b2fb327eeed610c483cdd8467d3
--- /dev/null
+++ b/paddlenlp/transformers/distilbert/configuration.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["DISTILBERT_PRETRAINED_INIT_CONFIGURATION", "DistilBertConfig", "DISTILBERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+DISTILBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "distilbert-base-uncased": {
+        "vocab_size": 30522,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "distilbert-base-cased": {
+        "vocab_size": 28996,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+}
+
+DISTILBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "distilbert-base-uncased": "http://bj.bcebos.com/paddlenlp/models/transformers/distilbert/distilbert-base-uncased.pdparams",
+        "distilbert-base-cased": "http://bj.bcebos.com/paddlenlp/models/transformers/distilbert/distilbert-base-cased.pdparams",
+    }
+}
+
+
+class DistilBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`DistilBertModel`]. It is used to
+    instantiate a DistilBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the BERT
+    bert-base-uncased architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import BertModel, BertConfig
+
+    >>> # Initializing a DistilBERT distilbert-base-uncased style configuration
+    >>> configuration = DistilBertConfig()
+
+    >>> # Initializing a model from the distilbert-base-uncased style configuration
+    >>> model = DistilBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "distilbert"
+    attribute_map: Dict[str, str] = {
+        "dropout": "classifier_dropout",
+        "num_classes": "num_labels",
+        "n_layers": "num_hidden_layers",  # for `transformers`
+        "n_heads": "num_attention_heads",  # for `transformers`
+        "dim": "hidden_size",  # for `transformers`
+        "hidden_dim": "intermediate_size",  # for `transformers`
+    }
+    pretrained_init_configuration = DISTILBERT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 6,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        fuse: bool = False,
+        layer_norm_eps=1e-12,
+        use_cache=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/distilbert/modeling.py b/paddlenlp/transformers/distilbert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c5c049690fe798677242a6aef751a5d6e971544
--- /dev/null
+++ b/paddlenlp/transformers/distilbert/modeling.py
@@ -0,0 +1,585 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019-present, the HuggingFace Inc. team, The Google AI Language Team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.utils.env import CONFIG_NAME
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    DISTILBERT_PRETRAINED_INIT_CONFIGURATION,
+    DISTILBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    DistilBertConfig,
+)
+
+__all__ = [
+    "DistilBertModel",
+    "DistilBertPretrainedModel",
+    "DistilBertForSequenceClassification",
+    "DistilBertForTokenClassification",
+    "DistilBertForQuestionAnswering",
+    "DistilBertForMaskedLM",
+]
+
+
+class BertEmbeddings(nn.Layer):
+    """
+    Includes embeddings from word, position and does not include
+    token_type embeddings.
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+
+        input_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = input_embeddings + position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class DistilBertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained DistilBert models. It provides DistilBert related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = DISTILBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = DISTILBERT_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "distilbert"
+    config_class = DistilBertConfig
+    model_config_file = CONFIG_NAME
+
+    @classmethod
+    def _get_name_mappings(cls, config: DistilBertConfig) -> List[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.position_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"transformer.layer.{layer_index}.attention.q_lin.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.q_lin.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.k_lin.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.k_lin.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.v_lin.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.v_lin.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.out_lin.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.attention.out_lin.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.sa_layer_norm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.sa_layer_norm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.output_layer_norm.weight",
+                    f"encoder.layers.{layer_index}.norm2.weight",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.output_layer_norm.bias",
+                    f"encoder.layers.{layer_index}.norm2.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.ffn.lin1.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.ffn.lin1.bias",
+                    f"encoder.layers.{layer_index}.linear1.bias",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.ffn.lin2.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [
+                    f"transformer.layer.{layer_index}.ffn.lin2.bias",
+                    f"encoder.layers.{layer_index}.linear2.bias",
+                ],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+        # base-model prefix "DistilBertModel"
+        if "DistilBertModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "distilbert." + mapping[0]
+                mapping[1] = "distilbert." + mapping[1]
+
+        # downstream mappings
+        if "DistilBertForSequenceClassification" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["pre_classifier.weight", None, "transpose"],
+                    "pre_classifier.bias",
+                    ["classifier.weight", None, "transpose"],
+                    "classifier.bias",
+                ]
+            )
+
+        if "DistilBertForTokenClassification" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["classifier.weight", None, "transpose"],
+                    "classifier.bias",
+                ]
+            )
+
+        if "DistilBertForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [["qa_outputs.weight", "classifier.weight", "transpose"], ["qa_outputs.bias", "classifier.bias"]]
+            )
+
+        init_name_mappings(model_mappings)
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class DistilBertModel(DistilBertPretrainedModel):
+    """
+    The bare DistilBert Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling `DistilBertModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and the pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer.
+            Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`DistilBertPretrainedModel.init_weights()` for how weights are initialized in `DistilBertModel`.
+
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(DistilBertModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = BertEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+
+    def forward(self, input_ids, attention_mask=None):
+        r"""
+        The DistilBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            Tensor: Returns tensor `encoder_output`, which means the sequence of hidden-states at the last layer of the model.
+            Its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import DistilBertModel, DistilBertTokenizer
+
+                tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+                model = DistilBertModel.from_pretrained('distilbert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.encoder.layers[0].norm1.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(
+                    self.encoder.layers[0].norm1.weight.dtype
+                )
+                attention_mask = (1.0 - attention_mask) * -1e4
+        embedding_output = self.embeddings(input_ids=input_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        return encoder_outputs
+
+
+class DistilBertForSequenceClassification(DistilBertPretrainedModel):
+    """
+    DistilBert Model with a linear layer on top of the output layer, designed for
+    sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`DistilBertConfig`):
+            An instance of DistilBertConfig used to construct DistilBertForSequenceClassification.
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(DistilBertForSequenceClassification, self).__init__(config)
+        self.num_classes = config.num_labels
+        self.distilbert = DistilBertModel(config)
+        self.pre_classifier = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.ReLU()
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_classes)
+
+    def forward(self, input_ids, attention_mask=None):
+        r"""
+        The DistilBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`DistilBertModel`.
+            attention_mask (list, optional):
+                See :class:`DistilBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.distilbert.modeling import DistilBertForSequenceClassification
+                from paddlenlp.transformers.distilbert.tokenizer import DistilBertTokenizer
+
+                tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+                model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+
+        distilbert_output = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
+
+        pooled_output = distilbert_output[:, 0]
+        pooled_output = self.pre_classifier(pooled_output)
+        pooled_output = self.activation(pooled_output)
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        return logits
+
+
+class DistilBertForQuestionAnswering(DistilBertPretrainedModel):
+    """
+    DistilBert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`DistilBertConfig`):
+            An instance of DistilBertConfig used to construct DistilBertForQuestionAnswering.
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(DistilBertForQuestionAnswering, self).__init__(config)
+        self.distilbert = DistilBertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, attention_mask=None):
+        r"""
+        The DistilBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`DistilBertModel`.
+            attention_mask (list, optional):
+                See :class:`DistilBertModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - start_logits(Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - end_logits(Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.distilbert.modeling import DistilBertForQuestionAnswering
+                from paddlenlp.transformers.distilbert.tokenizer import DistilBertTokenizer
+
+                tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+                model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  =outputs[1]
+        """
+
+        sequence_output = self.distilbert(input_ids, attention_mask=attention_mask)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+        return start_logits, end_logits
+
+
+class DistilBertForTokenClassification(DistilBertPretrainedModel):
+    """
+    DistilBert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`DistilBertConfig`):
+            An instance of DistilBertConfig used to construct DistilBertForTokenClassification.
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(DistilBertForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_labels
+        self.distilbert = DistilBertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, attention_mask=None):
+        r"""
+        The DistilBertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`DistilBertModel`.
+            attention_mask (list, optional):
+                See :class:`DistilBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.distilbert.modeling import DistilBertForTokenClassification
+                from paddlenlp.transformers.distilbert.tokenizer import DistilBertTokenizer
+
+                tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+                model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+
+        sequence_output = self.distilbert(input_ids, attention_mask=attention_mask)
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class DistilBertForMaskedLM(DistilBertPretrainedModel):
+    """
+    DistilBert Model with a `language modeling` head on top.
+
+    Args:
+        config (:class:`DistilBertConfig`):
+            An instance of DistilBertConfig used to construct DistilBertForMaskedLM
+    """
+
+    def __init__(self, config: DistilBertConfig):
+        super(DistilBertForMaskedLM, self).__init__(config)
+        self.distilbert = DistilBertModel(config)
+        self.vocab_transform = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.GELU()
+        self.vocab_layer_norm = nn.LayerNorm(config.hidden_size)
+        self.vocab_projector = nn.Linear(config.hidden_size, config.vocab_size)
+
+    def forward(self, input_ids=None, attention_mask=None):
+        r"""
+        The DistilBertForMaskedLM forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`DistilBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`DistilBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_logits`, the scores of masked token prediction.
+            Its data type should be float32 and its shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import DistilBertForMaskedLM, DistilBertTokenizer
+
+                tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+                model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                prediction_logits = model(**inputs)
+        """
+
+        distilbert_output = self.distilbert(input_ids=input_ids, attention_mask=attention_mask)
+        prediction_logits = self.vocab_transform(distilbert_output)
+        prediction_logits = self.activation(prediction_logits)
+        prediction_logits = self.vocab_layer_norm(prediction_logits)
+        prediction_logits = self.vocab_projector(prediction_logits)
+        return prediction_logits
diff --git a/paddlenlp/transformers/distilbert/tokenizer.py b/paddlenlp/transformers/distilbert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c97aad81fb48a41208e415842ff4e08acfc6455a
--- /dev/null
+++ b/paddlenlp/transformers/distilbert/tokenizer.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["DistilBertTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"distilbert-base-uncased": 512, "distilbert-base-cased": 512}
+
+
+class DistilBertTokenizer(BertTokenizer):
+    """
+    Constructs a DistilBert tokenizer. The usage of DistilBertTokenizer is the same as
+    `BertTokenizer <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "distilbert-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/distilbert/distilbert-base-uncased-vocab.txt",
+            "distilbert-base-cased": "https://bj.bcebos.com/paddlenlp/models/transformers/distilbert/distilbert-base-cased-vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "distilbert-base-uncased": {"do_lower_case": True},
+        "distilbert-base-cased": {"do_lower_case": False},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_seq_len=None,
+        stride=0,
+        is_split_into_words=False,
+        pad_to_max_seq_len=False,
+        truncation_strategy="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+    ):
+        return super(DistilBertTokenizer, self).__call__(
+            text,
+            text_pair,
+            max_seq_len,
+            stride,
+            is_split_into_words,
+            pad_to_max_seq_len,
+            truncation_strategy,
+            return_position_ids,
+            return_token_type_ids,
+            return_attention_mask,
+            return_length,
+            return_overflowing_tokens,
+            return_special_tokens_mask,
+        )
diff --git a/paddlenlp/transformers/distill_utils.py b/paddlenlp/transformers/distill_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2be090d12ef4f2e7f65b64a3facc8fd9be526703
--- /dev/null
+++ b/paddlenlp/transformers/distill_utils.py
@@ -0,0 +1,397 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+from paddle import tensor
+import paddle.nn.functional as F
+from paddle.nn import MultiHeadAttention, TransformerEncoderLayer, TransformerEncoder
+from paddle.common_ops_import import convert_dtype
+
+from paddlenlp.utils.log import logger
+from paddlenlp.transformers import PPMiniLMForSequenceClassification
+from paddlenlp.transformers import TinyBertForPretraining
+from paddlenlp.transformers import BertForSequenceClassification
+
+__all__ = ["to_distill", "calc_minilm_loss", "calc_multi_relation_loss"]
+
+
+def calc_multi_relation_loss(loss_fct, s, t, attn_mask, num_relation_heads=0, alpha=0.0, beta=0.0):
+    """
+    Calculates loss for multiple Q-Q, K-K and V-V relation. It supports
+    head-head relation, sample-sample relation and origin token-token relation.
+    The final loss value could be balanced by weight `alpha` and `beta`.
+
+    Args:
+        loss_fct (callable):
+            Loss function for distillation. It only supports kl_div loss now.
+        s (Tensor):
+            Q, K, V of Student.
+        t (Tensor):
+            Q, K, V of teacher.
+        attn_mask (Tensor):
+            Attention mask for relation.
+        num_relation_heads (int):
+            The number of relation heads. 0 means `num_relation_heads` equals
+            to origin head num.
+            Defaults to 0.
+        alpha (float):
+            The weight for head-head relation.
+            Defaults to 0.0.
+        beta (float):
+            The weight for sample-sample relation.
+            Defaults to 0.0.
+
+    Returns:
+        Tensor: Weighted loss of token-token loss, head-head loss and
+            sample-sample loss.
+
+    """
+    # Initialize head_num
+    if num_relation_heads > 0 and num_relation_heads != s.shape[1]:
+        # s'shape: [bs, seq_len, head_num, head_dim]
+        s = tensor.transpose(x=s, perm=[0, 2, 1, 3])
+        # s'shape: [bs, seq_len, num_relation_heads, head_dim_new]
+        s = tensor.reshape(x=s, shape=[0, 0, num_relation_heads, -1])
+        s1 = tensor.transpose(x=s, perm=[0, 2, 1, 3])
+    if num_relation_heads > 0 and num_relation_heads != t.shape[1]:
+        t = tensor.transpose(x=t, perm=[0, 2, 1, 3])
+        t = tensor.reshape(x=t, shape=[0, 0, num_relation_heads, -1])
+        t1 = tensor.transpose(x=t, perm=[0, 2, 1, 3])
+
+    s_head_dim, t_head_dim = s.shape[3], t.shape[3]
+
+    if alpha + beta == 1.0:
+        loss_token_token = 0.0
+    else:
+        scaled_dot_product_s1 = tensor.matmul(x=s1, y=s1, transpose_y=True) / math.sqrt(s_head_dim)
+        del s1
+        scaled_dot_product_s1 += attn_mask
+        scaled_dot_product_t1 = tensor.matmul(x=t1, y=t1, transpose_y=True) / math.sqrt(t_head_dim)
+        del t1
+        scaled_dot_product_t1 += attn_mask
+        loss_token_token = loss_fct(F.log_softmax(scaled_dot_product_s1), F.softmax(scaled_dot_product_t1))
+
+    if alpha == 0.0:
+        loss_head_head = 0.0
+    else:
+        scaled_dot_product_s = tensor.matmul(x=s, y=s, transpose_y=True) / math.sqrt(s_head_dim)
+        attn_mask_head_head = tensor.transpose(x=attn_mask, perm=[0, 3, 1, 2])
+
+        scaled_dot_product_s += attn_mask_head_head
+        scaled_dot_product_t = tensor.matmul(x=t, y=t, transpose_y=True) / math.sqrt(t_head_dim)
+        scaled_dot_product_t += attn_mask_head_head
+        loss_head_head = loss_fct(F.log_softmax(scaled_dot_product_s), F.softmax(scaled_dot_product_t))
+    if beta == 0.0:
+        loss_sample_sample = 0.0
+    else:
+        s2 = tensor.transpose(x=s, perm=[1, 2, 0, 3])
+        scaled_dot_product_s2 = tensor.matmul(x=s2, y=s2, transpose_y=True) / math.sqrt(s_head_dim)
+
+        del s, s2
+        # Shape: [seq_len, 1, batch_size, 1]
+        attn_mask_sample_sample = tensor.transpose(x=attn_mask, perm=[3, 1, 0, 2])
+
+        # Shape: [seq_len, head_num, batch_size, batch_size]
+        scaled_dot_product_s2 += attn_mask_sample_sample
+        t2 = tensor.transpose(x=t, perm=[1, 2, 0, 3])
+        scaled_dot_product_t2 = tensor.matmul(x=t2, y=t2, transpose_y=True) / math.sqrt(t_head_dim)
+
+        del t, t2
+        scaled_dot_product_t2 += attn_mask_sample_sample
+        loss_sample_sample = loss_fct(F.log_softmax(scaled_dot_product_s2), F.softmax(scaled_dot_product_t2))
+
+    return (1 - alpha - beta) * loss_token_token + alpha * loss_head_head + beta * loss_sample_sample
+
+
+def calc_minilm_loss(loss_fct, s, t, attn_mask, num_relation_heads=0):
+    """
+    Calculates loss for Q-Q, K-K, V-V relation from MiniLMv2.
+    Args:
+        loss_fct (callable):
+            Loss function for distillation. It only supports kl_div loss now.
+        s (Tensor):
+            Q, K, V of Student.
+        t (Tensor):
+            Q, K, V of teacher.
+        attn_mask (Tensor):
+            Attention mask for relation.
+        num_relation_heads (int):
+            The number of relation heads. 0 means `num_relation_heads` equals
+            to origin head num.
+            Defaults to 0.
+
+    Returns:
+        Tensor: MiniLM loss value.
+
+    """
+    # Initialize head_num
+    if num_relation_heads > 0 and num_relation_heads != s.shape[1]:
+        # s'shape: [bs, seq_len, head_num, head_dim]
+        s = tensor.transpose(x=s, perm=[0, 2, 1, 3])
+        # s'shape: [bs, seq_len, num_relation_heads, head_dim_new]
+        s = tensor.reshape(x=s, shape=[0, 0, num_relation_heads, -1])
+        # s' shape: [bs, num_relation_heads, seq_len, head_dim_new]
+        s = tensor.transpose(x=s, perm=[0, 2, 1, 3])
+    if num_relation_heads > 0 and num_relation_heads != t.shape[1]:
+        t = tensor.transpose(x=t, perm=[0, 2, 1, 3])
+        t = tensor.reshape(x=t, shape=[0, 0, num_relation_heads, -1])
+        t = tensor.transpose(x=t, perm=[0, 2, 1, 3])
+
+    s_head_dim, t_head_dim = s.shape[3], t.shape[3]
+    scaled_dot_product_s = tensor.matmul(x=s, y=s, transpose_y=True) / math.sqrt(s_head_dim)
+    del s
+    scaled_dot_product_s += attn_mask
+
+    scaled_dot_product_t = tensor.matmul(x=t, y=t, transpose_y=True) / math.sqrt(t_head_dim)
+    del t
+    scaled_dot_product_t += attn_mask
+    loss = loss_fct(F.log_softmax(scaled_dot_product_s), F.softmax(scaled_dot_product_t))
+    return loss
+
+
+def to_distill(self, return_qkv=False, return_attentions=False, return_layer_outputs=False, layer_index=-1):
+    """
+    Can be bound to object with transformer encoder layers, and make model
+    expose attributes `outputs.q`, `outputs.k`, `outputs.v`,
+    `outputs.scaled_qks`, `outputs.hidden_states`and `outputs.attentions` of
+    the object for distillation.
+    It could be returned intermediate tensor using in MiniLM and TinyBERT
+    strategy.
+    """
+    logger.warning("`to_distill` is an experimental API and subject to change.")
+    MultiHeadAttention._forward = attention_forward
+    TransformerEncoderLayer._forward = transformer_encoder_layer_forward
+    TransformerEncoder._forward = transformer_encoder_forward
+    BertForSequenceClassification._forward = bert_forward
+
+    if return_qkv:
+        # forward function of student class should be replaced for distributed training.
+        TinyBertForPretraining._forward = minilm_pretraining_forward
+        PPMiniLMForSequenceClassification._forward = minilm_pretraining_forward
+    else:
+        TinyBertForPretraining._forward = tinybert_forward
+
+    def init_func(layer):
+        if isinstance(
+            layer,
+            (
+                MultiHeadAttention,
+                TransformerEncoderLayer,
+                TransformerEncoder,
+                TinyBertForPretraining,
+                BertForSequenceClassification,
+                PPMiniLMForSequenceClassification,
+            ),
+        ):
+            layer.forward = layer._forward
+            if isinstance(layer, TransformerEncoder):
+                layer.return_layer_outputs = return_layer_outputs
+                layer.layer_index = layer_index
+            if isinstance(layer, MultiHeadAttention):
+                layer.return_attentions = return_attentions
+                layer.return_qkv = return_qkv
+
+    for layer in self.children():
+        layer.apply(init_func)
+
+    base_model_prefix = (
+        self._layers.base_model_prefix if isinstance(self, paddle.DataParallel) else self.base_model_prefix
+    )
+
+    # For distribute training
+    if isinstance(self, paddle.DataParallel):
+        if hasattr(self._layers, base_model_prefix):
+            self.outputs = getattr(self._layers, base_model_prefix).encoder
+        else:
+            self.outputs = self._layers.encoder
+    else:
+        if hasattr(self, base_model_prefix):
+            self.outputs = getattr(self, base_model_prefix).encoder
+        else:
+            self.outputs = self.encoder
+    return self
+
+
+def _convert_attention_mask(attn_mask, dtype):
+    if attn_mask is not None and attn_mask.dtype != dtype:
+        attn_mask_dtype = convert_dtype(attn_mask.dtype)
+        if attn_mask_dtype == "bool" or "int" in attn_mask_dtype:
+            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9
+        else:
+            attn_mask = paddle.cast(attn_mask, dtype)
+    return attn_mask
+
+
+def attention_forward(self, query, key=None, value=None, attn_mask=None, cache=None):
+    """
+    Redefines the `forward` function of `paddle.nn.MultiHeadAttention`.
+    """
+    key = query if key is None else key
+    value = query if value is None else value
+    # Computes q ,k ,v
+    if cache is None:
+        q, k, v = self._prepare_qkv(query, key, value, cache)
+    else:
+        q, k, v, cache = self._prepare_qkv(query, key, value, cache)
+
+    # Scale dot product attention
+    product = tensor.matmul(x=q, y=k, transpose_y=True)
+    product /= math.sqrt(self.head_dim)
+
+    if attn_mask is not None:
+        # Support bool or int mask
+        attn_mask = _convert_attention_mask(attn_mask, product.dtype)
+        product = product + attn_mask
+
+    self.attention_matrix = product if self.return_attentions else None
+    weights = F.softmax(product)
+    if self.dropout:
+        weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+    out = tensor.matmul(weights, v)
+    if self.return_qkv:
+        self.q = q
+        self.k = k
+        self.v = v
+
+    # Combine heads
+    out = tensor.transpose(out, perm=[0, 2, 1, 3])
+    out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+    # Project to output
+    out = self.out_proj(out)
+
+    outs = [out]
+    if self.need_weights:
+        outs.append(weights)
+    if cache is not None:
+        outs.append(cache)
+    return out if len(outs) == 1 else tuple(outs)
+
+
+def transformer_encoder_layer_forward(self, src, src_mask=None, cache=None):
+    """
+    Redefines the `forward` function of `paddle.nn.TransformerEncoderLayer`.
+    """
+    src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+    residual = src
+    if self.normalize_before:
+        src = self.norm1(src)
+    # Add cache for encoder for the usage like UniLM
+    if cache is None:
+        src = self.self_attn(src, src, src, src_mask)
+    else:
+        src, incremental_cache = self.self_attn(src, src, src, src_mask, cache)
+    src = residual + self.dropout1(src)
+    if not self.normalize_before:
+        src = self.norm1(src)
+
+    residual = src
+    if self.normalize_before:
+        src = self.norm2(src)
+    src = self.linear2(self.dropout(self.activation(self.linear1(src))))
+    src = residual + self.dropout2(src)
+    if not self.normalize_before:
+        src = self.norm2(src)
+    if hasattr(self.self_attn, "attention_matrix"):
+        self.attention_matrix = self.self_attn.attention_matrix
+    if hasattr(self.self_attn, "q"):
+        self.q = self.self_attn.q
+        self.k = self.self_attn.k
+        self.v = self.self_attn.v
+    return src if cache is None else (src, incremental_cache)
+
+
+def transformer_encoder_forward(self, src, src_mask=None, cache=None):
+    """
+    Redefines the `forward` function of `paddle.nn.TransformerEncoder`.
+    """
+    src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+    output = src
+    new_caches = []
+
+    self.attentions = []
+    self.hidden_states = []
+
+    for i, mod in enumerate(self.layers):
+        if self.return_layer_outputs:
+            self.hidden_states.append(output)
+        if cache is None:
+            output = mod(output, src_mask=src_mask)
+        else:
+            output, new_cache = mod(output, src_mask=src_mask, cache=cache[i])
+            new_caches.append(new_cache)
+        if hasattr(mod, "attention_matrix"):
+            self.attentions.append(mod.attention_matrix)
+        if i == self.layer_index and hasattr(mod, "q"):
+            self.q = mod.q
+            self.k = mod.k
+            self.v = mod.v
+
+    if self.norm is not None:
+        output = self.norm(output)
+    if self.return_layer_outputs:
+        self.hidden_states.append(output)
+    return output if cache is None else (output, new_caches)
+
+
+def minilm_pretraining_forward(self, input_ids, token_type_ids=None, attention_mask=None):
+    """
+    Replaces `forward` function while using multi gpus to train. If training on
+    single GPU, this `forward` could not be replaced.
+    The type of `self` should inherit from base class of pretrained LMs, such as
+    `TinyBertForPretraining`.
+    Strategy MINILM only needs q, k and v of transformers.
+    """
+    assert hasattr(self, self.base_model_prefix), "Student class should inherit from %s" % (self.base_model_class)
+    model = getattr(self, self.base_model_prefix)
+    encoder = model.encoder
+
+    sequence_output, pooled_output = model(input_ids, token_type_ids, attention_mask)
+    return encoder.q, encoder.k, encoder.v
+
+
+def tinybert_forward(self, input_ids, token_type_ids=None, attention_mask=None):
+    """
+    Replaces `forward` function while using multi gpus to train.
+    """
+    assert hasattr(self, self.base_model_prefix), "Student class should inherit from %s" % (self.base_model_class)
+    model = getattr(self, self.base_model_prefix)
+    encoder = model.encoder
+
+    sequence_output, pooled_output = model(input_ids, token_type_ids, attention_mask)
+    for i in range(len(encoder.hidden_states)):
+        # While using tinybert-4l-312d, tinybert-6l-768d, tinybert-4l-312d-zh,
+        # tinybert-6l-768d-zh
+        # While using tinybert-4l-312d-v2, tinybert-6l-768d-v2
+        # encoder.hidden_states[i] = self.tinybert.fit_dense(encoder.hidden_states[i])
+        encoder.hidden_states[i] = self.tinybert.fit_denses[i](encoder.hidden_states[i])
+
+    return encoder.attentions, encoder.hidden_states
+
+
+def bert_forward(self, input_ids, token_type_ids=None, attention_mask=None):
+    """
+    Replaces `forward` function while using multi gpus to train.
+    """
+    assert hasattr(self, self.base_model_prefix), "Student class should inherit from %s" % (self.base_model_class)
+    model = getattr(self, self.base_model_prefix)
+    encoder = model.encoder
+
+    sequence_output, pooled_output = model(input_ids, token_type_ids, attention_mask)
+    return encoder.attentions, encoder.hidden_states
diff --git a/paddlenlp/transformers/dpt/__init__.py b/paddlenlp/transformers/dpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/dpt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/dpt/configuration.py b/paddlenlp/transformers/dpt/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..20cbaf31b0c40c54191a230cd3379f74a74da4ec
--- /dev/null
+++ b/paddlenlp/transformers/dpt/configuration.py
@@ -0,0 +1,226 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DPT model configuration"""
+
+import copy
+
+from ...utils.log import logger
+from ..bit.configuration import BitConfig
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["DPTConfig"]
+
+
+class DPTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`DPTModel`]. It is used to instantiate an DPT
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the DPT
+    [Intel/dpt-large](https://huggingface.co/Intel/dpt-large) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to 384):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The size (resolution) of each patch.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries, keys and values.
+        backbone_out_indices (`List[int]`, *optional*, defaults to `[2, 5, 8, 11]`):
+            Indices of the intermediate hidden states to use from backbone.
+        readout_type (`str`, *optional*, defaults to `"project"`):
+            The readout type to use when processing the readout token (CLS token) of the intermediate hidden states of
+            the ViT backbone. Can be one of [`"ignore"`, `"add"`, `"project"`].
+            - "ignore" simply ignores the CLS token.
+            - "add" passes the information from the CLS token to all other tokens by adding the representations.
+            - "project" passes information to the other tokens by concatenating the readout to all other tokens before
+              projecting the
+            representation to the original feature dimension D using a linear layer followed by a GELU non-linearity.
+        is_hybrid (`bool`, *optional*, defaults to `False`):
+            Whether to use a hybrid backbone. Useful in the context of loading DPT-Hybrid models.
+        reassemble_factors (`List[int]`, *optional*, defaults to `[4, 2, 1, 0.5]`):
+            The up/downsampling factors of the reassemble layers.
+        neck_hidden_sizes (`List[str]`, *optional*, defaults to [96, 192, 384, 768]):
+            The hidden sizes to project to for the feature maps of the backbone.
+        fusion_hidden_size (`int`, *optional*, defaults to 256):
+            The number of channels before fusion.
+        head_in_index (`int`, *optional*, defaults to -1):
+            The index of the features to use in the heads.
+        use_batch_norm_in_fusion_residual (`bool`, *optional*, defaults to `False`):
+            Whether to use batch normalization in the pre-activate residual units of the fusion blocks.
+        use_auxiliary_head (`bool`, *optional*, defaults to `True`):
+            Whether to use an auxiliary head during training.
+        auxiliary_loss_weight (`float`, *optional*, defaults to 0.4):
+            Weight of the cross-entropy loss of the auxiliary head.
+        semantic_loss_ignore_index (`int`, *optional*, defaults to 255):
+            The index that is ignored by the loss function of the semantic segmentation model.
+        semantic_classifier_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the semantic classification head.
+        backbone_featmap_shape (`List[int]`, *optional*, defaults to `[1, 1024, 24, 24]`):
+            Used only for the `hybrid` embedding type. The shape of the feature maps of the backbone.
+        neck_ignore_stages (`List[int]`, *optional*, defaults to `[0, 1]`):
+            Used only for the `hybrid` embedding type. The stages of the readout layers to ignore.
+        backbone_config (`Union[Dict[str, Any], PretrainedConfig]`, *optional*):
+            Used only for the `hybrid` embedding type. The configuration of the backbone in a dictionary.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import DPTModel, DPTConfig
+
+    >>> # Initializing a DPT dpt-large style configuration
+    >>> configuration = DPTConfig()
+
+    >>> # Initializing a model from the dpt-large style configuration
+    >>> model = DPTModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "dpt"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        image_size=384,
+        patch_size=16,
+        num_channels=3,
+        is_hybrid=False,
+        qkv_bias=True,
+        backbone_out_indices=[2, 5, 8, 11],
+        readout_type="project",
+        reassemble_factors=[4, 2, 1, 0.5],
+        neck_hidden_sizes=[96, 192, 384, 768],
+        fusion_hidden_size=256,
+        head_in_index=-1,
+        use_batch_norm_in_fusion_residual=False,
+        use_auxiliary_head=True,
+        auxiliary_loss_weight=0.4,
+        semantic_loss_ignore_index=255,
+        semantic_classifier_dropout=0.1,
+        backbone_featmap_shape=[1, 1024, 24, 24],
+        neck_ignore_stages=[0, 1],
+        backbone_config=None,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.is_hybrid = is_hybrid
+
+        if self.is_hybrid:
+            if backbone_config is None:
+                logger.info("Initializing the config with a `BiT` backbone.")
+                backbone_config = {
+                    "global_padding": "same",
+                    "layer_type": "bottleneck",
+                    "depths": [3, 4, 9],
+                    "out_features": ["stage1", "stage2", "stage3"],
+                    "embedding_dynamic_padding": True,
+                }
+                self.backbone_config = BitConfig(**backbone_config)
+            elif isinstance(backbone_config, dict):
+                logger.info("Initializing the config with a `BiT` backbone.")
+                self.backbone_config = BitConfig(**backbone_config)
+            elif isinstance(backbone_config, PretrainedConfig):
+                self.backbone_config = backbone_config
+            else:
+                raise ValueError(
+                    f"backbone_config must be a dictionary or a `PretrainedConfig`, got {backbone_config.__class__}."
+                )
+
+            self.backbone_featmap_shape = backbone_featmap_shape
+            self.neck_ignore_stages = neck_ignore_stages
+
+            if readout_type != "project":
+                raise ValueError("Readout type must be 'project' when using `DPT-hybrid` mode.")
+        else:
+            self.backbone_config = None
+            self.backbone_featmap_shape = None
+            self.neck_ignore_stages = []
+
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.qkv_bias = qkv_bias
+        self.backbone_out_indices = backbone_out_indices
+        if readout_type not in ["ignore", "add", "project"]:
+            raise ValueError("Readout_type must be one of ['ignore', 'add', 'project']")
+        self.readout_type = readout_type
+        self.reassemble_factors = reassemble_factors
+        self.neck_hidden_sizes = neck_hidden_sizes
+        self.fusion_hidden_size = fusion_hidden_size
+        self.head_in_index = head_in_index
+        self.use_batch_norm_in_fusion_residual = use_batch_norm_in_fusion_residual
+        # auxiliary head attributes (semantic segmentation)
+        self.use_auxiliary_head = use_auxiliary_head
+        self.auxiliary_loss_weight = auxiliary_loss_weight
+        self.semantic_loss_ignore_index = semantic_loss_ignore_index
+        self.semantic_classifier_dropout = semantic_classifier_dropout
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+
+        if output["backbone_config"] is not None:
+            output["backbone_config"] = self.backbone_config.to_dict()
+
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/dpt/image_processing.py b/paddlenlp/transformers/dpt/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..cff5d496920fbe89f393c6c5e8838c18549e990d
--- /dev/null
+++ b/paddlenlp/transformers/dpt/image_processing.py
@@ -0,0 +1,373 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for DPT."""
+
+import math
+from typing import Dict, Iterable, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import normalize, rescale, resize, to_channel_dimension_format
+from ..image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    get_image_size,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["DPTImageProcessor"]
+
+
+def get_resize_output_image_size(
+    input_image: np.ndarray, output_size: Union[int, Iterable[int]], keep_aspect_ratio: bool, multiple: int
+) -> Tuple[int, int]:
+    def constraint_to_multiple_of(val, multiple, min_val=0, max_val=None):
+        x = round(val / multiple) * multiple
+
+        if max_val is not None and x > max_val:
+            x = math.floor(val / multiple) * multiple
+
+        if x < min_val:
+            x = math.ceil(val / multiple) * multiple
+
+        return x
+
+    output_size = (output_size, output_size) if isinstance(output_size, int) else output_size
+
+    input_height, input_width = get_image_size(input_image)
+    output_height, output_width = output_size
+
+    # determine new height and width
+    scale_height = output_height / input_height
+    scale_width = output_width / input_width
+
+    if keep_aspect_ratio:
+        # scale as little as possible
+        if abs(1 - scale_width) < abs(1 - scale_height):
+            # fit width
+            scale_height = scale_width
+        else:
+            # fit height
+            scale_width = scale_height
+
+    new_height = constraint_to_multiple_of(scale_height * input_height, multiple=multiple)
+    new_width = constraint_to_multiple_of(scale_width * input_width, multiple=multiple)
+
+    return (new_height, new_width)
+
+
+class DPTImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a DPT image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions. Can be overidden by `do_resize` in `preprocess`.
+        size (`Dict[str, int]` *optional*, defaults to `{"height": 384, "width": 384}`):
+            Size of the image after resizing. Can be overidden by `size` in `preprocess`.
+        keep_aspect_ratio (`bool`, *optional*, defaults to `False`):
+            If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved. Can
+            be overidden by `keep_aspect_ratio` in `preprocess`.
+        ensure_multiple_of (`int`, *optional*, defaults to `1`):
+            If `do_resize` is `True`, the image is resized to a size that is a multiple of this value. Can be overidden
+            by `ensure_multiple_of` in `preprocess`.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+            Defines the resampling filter to use if resizing the image. Can be overidden by `resample` in `preprocess`.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overidden by `do_rescale` in
+            `preprocess`.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overidden by `rescale_factor` in `preprocess`.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        keep_aspect_ratio: bool = False,
+        ensure_multiple_of: int = 1,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 384, "width": 384}
+        size = get_size_dict(size)
+        self.do_resize = do_resize
+        self.size = size
+        self.keep_aspect_ratio = keep_aspect_ratio
+        self.ensure_multiple_of = ensure_multiple_of
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        keep_aspect_ratio: bool = False,
+        ensure_multiple_of: int = 1,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image to target size `(size["height"], size["width"])`. If `keep_aspect_ratio` is `True`, the image
+        is resized to the largest possible size such that the aspect ratio is preserved. If `ensure_multiple_of` is
+        set, the image is resized to a size that is a multiple of this value.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Target size of the output image.
+            keep_aspect_ratio (`bool`, *optional*, defaults to `False`):
+                If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved.
+            ensure_multiple_of (`int`, *optional*, defaults to `1`):
+                The image is resized to a size that is a multiple of this value.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Defines the resampling filter to use if resizing the image. Otherwise, the image is resized to size
+                specified in `size`.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The size dictionary must contain the keys 'height' and 'width'. Got {size.keys()}")
+        output_size = get_resize_output_image_size(
+            image,
+            output_size=(size["height"], size["width"]),
+            keep_aspect_ratio=keep_aspect_ratio,
+            multiple=ensure_multiple_of,
+        )
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: int = None,
+        keep_aspect_ratio: bool = None,
+        ensure_multiple_of: int = None,
+        resample: PILImageResampling = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after reszing. If `keep_aspect_ratio` is `True`, the image is resized to the largest
+                possible size such that the aspect ratio is preserved. If `ensure_multiple_of` is set, the image is
+                resized to a size that is a multiple of this value.
+            keep_aspect_ratio (`bool`, *optional*, defaults to `self.keep_aspect_ratio`):
+                Whether to keep the aspect ratio of the image. If False, the image will be resized to (size, size). If
+                True, the image will be resized to keep the aspect ratio and the size will be the maximum possible.
+            ensure_multiple_of (`int`, *optional*, defaults to `self.ensure_multiple_of`):
+                Ensure that the image size is a multiple of this value.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`, Only
+                has an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                    - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size)
+        keep_aspect_ratio = keep_aspect_ratio if keep_aspect_ratio is not None else self.keep_aspect_ratio
+        ensure_multiple_of = ensure_multiple_of if ensure_multiple_of is not None else self.ensure_multiple_of
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None or resample is None:
+            raise ValueError("Size and resample must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+
+    def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple] = None):
+        """
+        Converts the output of [`DPTForSemanticSegmentation`] into semantic segmentation maps. Only supports Paddle.
+
+        Args:
+            outputs ([`DPTForSemanticSegmentation`]):
+                Raw outputs of the model.
+            target_sizes (`List[Tuple]` of length `batch_size`, *optional*):
+                List of tuples corresponding to the requested final size (height, width) of each prediction. If unset,
+                predictions will not be resized.
+
+        Returns:
+            semantic_segmentation: `List[paddle.Tensor]` of length `batch_size`, where each item is a semantic
+            segmentation map of shape (height, width) corresponding to the target_sizes entry (if `target_sizes` is
+            specified). Each entry of each `paddle.Tensor` correspond to a semantic class id.
+        """
+        # TODO: add support for other frameworks
+        logits = outputs.logits
+
+        # Resize logits and compute semantic segmentation maps
+        if target_sizes is not None:
+            if len(logits) != len(target_sizes):
+                raise ValueError(
+                    "Make sure that you pass in as many target sizes as the batch dimension of the logits"
+                )
+
+            if paddle.is_tensor(target_sizes):
+                target_sizes = target_sizes.numpy()
+
+            semantic_segmentation = []
+
+            for idx in range(len(logits)):
+                resized_logits = F.interpolate(
+                    logits[idx].unsqueeze(axis=0), size=target_sizes[idx], mode="bilinear", align_corners=False
+                )
+                semantic_map = resized_logits[0].argmax(axis=0)
+                semantic_segmentation.append(semantic_map)
+        else:
+            semantic_segmentation = logits.argmax(axis=1)
+            semantic_segmentation = [semantic_segmentation[i] for i in range(semantic_segmentation.shape[0])]
+
+        return semantic_segmentation
diff --git a/paddlenlp/transformers/dpt/modeling.py b/paddlenlp/transformers/dpt/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..caf5d06afb2d05dd68e78d92314d1bd023aa8a1d
--- /dev/null
+++ b/paddlenlp/transformers/dpt/modeling.py
@@ -0,0 +1,1336 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 Intel Labs, OpenMMLab and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Paddle DPT (Dense Prediction Transformers) model.
+This implementation is heavily inspired by OpenMMLab's implementation, found here:
+https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/decode_heads/dpt_head.py.
+"""
+
+
+import collections.abc
+import math
+from dataclasses import dataclass
+from functools import partial
+from typing import List, Optional, Tuple, Union
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import CrossEntropyLoss
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..bit.configuration import BitConfig
+from ..bit.modeling import BitBackbone
+from ..model_outputs import (
+    BaseModelOutput,
+    DepthEstimatorOutput,
+    ModelOutput,
+    SemanticSegmenterOutput,
+)
+from ..model_utils import PretrainedModel
+from .configuration import DPTConfig
+
+__all__ = [
+    "DPTPretrainedModel",
+    "DPTModel",
+    "DPTForDepthEstimation",
+    "DPTForSemanticSegmentation",
+]
+
+
+@dataclass
+class BaseModelOutputWithIntermediateActivations(ModelOutput):
+    """
+    Base class for model's outputs that also contains intermediate activations that can be used at later stages. Useful
+    in the context of Vision models.:
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        intermediate_activations (`tuple(paddle.Tensor)`, *optional*):
+            Intermediate activations that can be used to compute hidden states of the model at various layers.
+    """
+
+    last_hidden_states: paddle.Tensor = None
+    intermediate_activations: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPoolingAndIntermediateActivations(ModelOutput):
+    """
+    Base class for model's outputs that also contains a pooling of the last hidden states as well as intermediate
+    activations that can be used by the model at later stages.
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`paddle.Tensor` of shape `(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) after further processing
+            through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
+            the classification token after processing through a linear layer and a tanh activation function. The linear
+            layer weights are trained from the next sentence prediction (classification) objective during pretraining.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        intermediate_activations (`tuple(paddle.Tensor)`, *optional*):
+            Intermediate activations that can be used to compute hidden states of the model at various layers.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    pooler_output: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+    intermediate_activations: Optional[Tuple[paddle.Tensor]] = None
+
+
+class DPTViTHybridEmbeddings(nn.Layer):
+    """
+    This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial
+    `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a
+    Transformer.
+    """
+
+    def __init__(self, config, feature_size=None):
+        super().__init__()
+        image_size, patch_size = config.image_size, config.patch_size
+        num_channels, hidden_size = config.num_channels, config.hidden_size
+
+        image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+        patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+
+        if isinstance(config.backbone_config, BitConfig):
+            self.backbone = BitBackbone(config.backbone_config)
+        else:
+            raise NotImplementedError
+        feature_dim = self.backbone.channels[-1]
+        if len(config.backbone_config.out_features) != 3:
+            raise ValueError(
+                f"Expected backbone to have 3 output features, got {len(config.backbone_config.out_features)}"
+            )
+        self.residual_feature_map_index = [0, 1]  # Always take the output of the first and second backbone stage
+
+        if feature_size is None:
+            feat_map_shape = config.backbone_featmap_shape
+            feature_size = feat_map_shape[-2:]
+            feature_dim = feat_map_shape[1]
+        else:
+            feature_size = (
+                feature_size if isinstance(feature_size, collections.abc.Iterable) else (feature_size, feature_size)
+            )
+            feature_dim = self.backbone.channels[-1]
+
+        self.image_size = image_size
+        self.patch_size = patch_size[0]
+        self.num_channels = num_channels
+
+        self.projection = nn.Conv2D(feature_dim, hidden_size, kernel_size=1)
+
+        self.cls_token = self.create_parameter(
+            [1, 1, config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(0.0),
+        )
+
+        self.position_embeddings = self.create_parameter(
+            [1, num_patches + 1, config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(0.0),
+        )
+
+    def _resize_pos_embed(self, posemb, grid_size_height, grid_size_width, start_index=1):
+        posemb_tok = posemb[:, :start_index]
+        posemb_grid = posemb[0, start_index:]
+
+        old_grid_size = int(math.sqrt(len(posemb_grid)))
+
+        posemb_grid = posemb_grid.reshape([1, old_grid_size, old_grid_size, -1]).transpose([0, 3, 1, 2])
+        posemb_grid = F.interpolate(posemb_grid, size=(grid_size_height, grid_size_width), mode="bilinear")
+        posemb_grid = posemb_grid.transpose([0, 2, 3, 1]).reshape([1, grid_size_height * grid_size_width, -1])
+
+        posemb = paddle.concat([posemb_tok, posemb_grid], axis=1)
+
+        return posemb
+
+    def forward(
+        self, pixel_values: paddle.Tensor, interpolate_pos_encoding: bool = False, return_dict: bool = False
+    ) -> paddle.Tensor:
+        batch_size, num_channels, height, width = pixel_values.shape
+        if num_channels != self.num_channels:
+            raise ValueError(
+                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+            )
+        if not interpolate_pos_encoding:
+            if height != self.image_size[0] or width != self.image_size[1]:
+                raise ValueError(
+                    f"Input image size ({height}*{width}) doesn't match model"
+                    f" ({self.image_size[0]}*{self.image_size[1]})."
+                )
+
+        position_embeddings = self._resize_pos_embed(
+            self.position_embeddings, height // self.patch_size, width // self.patch_size
+        )
+
+        backbone_output = self.backbone(pixel_values)
+
+        features = backbone_output.feature_maps[-1]
+
+        # Retrieve also the intermediate activations to use them at later stages
+        output_hidden_states = [backbone_output.feature_maps[index] for index in self.residual_feature_map_index]
+
+        embeddings = self.projection(features).flatten(2).transpose([0, 2, 1])
+
+        cls_tokens = self.cls_token.expand([batch_size, -1, -1])
+        embeddings = paddle.concat((cls_tokens, embeddings), axis=1)
+
+        # add positional encoding to each token
+        embeddings = embeddings + position_embeddings
+
+        if not return_dict:
+            return (embeddings, output_hidden_states)
+
+        # Return hidden states and intermediate activations
+        return BaseModelOutputWithIntermediateActivations(
+            last_hidden_states=embeddings,
+            intermediate_activations=output_hidden_states,
+        )
+
+
+class DPTViTEmbeddings(nn.Layer):
+    """
+    Construct the CLS token, position and patch embeddings.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.cls_token = self.create_parameter(
+            [1, 1, config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(0.0),
+        )
+
+        self.patch_embeddings = DPTViTPatchEmbeddings(config)
+        num_patches = self.patch_embeddings.num_patches
+
+        self.position_embeddings = self.create_parameter(
+            [1, num_patches + 1, config.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(0.0),
+        )
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.config = config
+
+    def _resize_pos_embed(self, posemb, grid_size_height, grid_size_width, start_index=1):
+        posemb_tok = posemb[:, :start_index]
+        posemb_grid = posemb[0, start_index:]
+
+        old_grid_size = int(math.sqrt(len(posemb_grid)))
+
+        posemb_grid = posemb_grid.reshape([1, old_grid_size, old_grid_size, -1]).transpose([0, 3, 1, 2])
+        posemb_grid = F.interpolate(posemb_grid, size=(grid_size_height, grid_size_width), mode="bilinear")
+        posemb_grid = posemb_grid.transpose([0, 2, 3, 1]).reshape([1, grid_size_height * grid_size_width, -1])
+
+        posemb = paddle.concat([posemb_tok, posemb_grid], axis=1)
+
+        return posemb
+
+    def forward(self, pixel_values, return_dict=False):
+        batch_size, num_channels, height, width = pixel_values.shape
+
+        # possibly interpolate position encodings to handle varying image sizes
+        patch_size = self.config.patch_size
+        position_embeddings = self._resize_pos_embed(
+            self.position_embeddings, height // patch_size, width // patch_size
+        )
+
+        embeddings = self.patch_embeddings(pixel_values)
+
+        batch_size = embeddings.shape[0]
+
+        # add the [CLS] token to the embedded patch tokens
+        cls_tokens = self.cls_token.expand([batch_size, -1, -1])
+        embeddings = paddle.concat((cls_tokens, embeddings), axis=1)
+
+        # add positional encoding to each token
+        embeddings = embeddings + position_embeddings
+
+        embeddings = self.dropout(embeddings)
+
+        if not return_dict:
+            return (embeddings,)
+
+        return BaseModelOutputWithIntermediateActivations(last_hidden_states=embeddings)
+
+
+class DPTViTPatchEmbeddings(nn.Layer):
+    """
+    Image to Patch Embedding.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        image_size, patch_size = config.image_size, config.patch_size
+        num_channels, hidden_size = config.num_channels, config.hidden_size
+
+        image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+        patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.num_patches = num_patches
+
+        self.projection = nn.Conv2D(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size)
+
+    def forward(self, pixel_values):
+        batch_size, num_channels, height, width = pixel_values.shape
+        if num_channels != self.num_channels:
+            raise ValueError(
+                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+            )
+        embeddings = self.projection(pixel_values).flatten(2).transpose([0, 2, 1])
+        return embeddings
+
+
+class DPTViTSelfAttention(nn.Layer):
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size {config.hidden_size,} is not a multiple of the number of attention "
+                f"heads {config.num_attention_heads}."
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.scale = math.sqrt(self.attention_head_size)
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size, bias_attr=config.qkv_bias)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size, bias_attr=config.qkv_bias)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size, bias_attr=config.qkv_bias)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x: paddle.Tensor) -> paddle.Tensor:
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self, hidden_states, output_attentions: bool = False
+    ) -> Union[Tuple[paddle.Tensor, paddle.Tensor], Tuple[paddle.Tensor]]:
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        attention_scores = attention_scores / self.scale
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+class DPTViTSelfOutput(nn.Layer):
+    """
+    The residual connection is defined in DPTLayer instead of here (as is the case with other models), due to the
+    layernorm applied before each block.
+    """
+
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        return hidden_states
+
+
+class DPTViTAttention(nn.Layer):
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.attention = DPTViTSelfAttention(config)
+        self.output = DPTViTSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        output_attentions: bool = False,
+    ) -> Union[Tuple[paddle.Tensor, paddle.Tensor], Tuple[paddle.Tensor]]:
+        self_outputs = self.attention(hidden_states, output_attentions)
+
+        attention_output = self.output(self_outputs[0], hidden_states)
+
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class DPTViTIntermediate(nn.Layer):
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+
+        return hidden_states
+
+
+class DPTViTOutput(nn.Layer):
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        hidden_states = hidden_states + input_tensor
+
+        return hidden_states
+
+
+class DPTViTLayer(nn.Layer):
+    """This corresponds to the Block class in the timm implementation."""
+
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = DPTViTAttention(config)
+        self.intermediate = DPTViTIntermediate(config)
+        self.output = DPTViTOutput(config)
+        self.layernorm_before = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.layernorm_after = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        output_attentions: bool = False,
+    ) -> Union[Tuple[paddle.Tensor, paddle.Tensor], Tuple[paddle.Tensor]]:
+        self_attention_outputs = self.attention(
+            self.layernorm_before(hidden_states),  # in ViT, layernorm is applied before self-attention
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        # first residual connection
+        hidden_states = attention_output + hidden_states
+
+        # in ViT, layernorm is also applied after self-attention
+        layer_output = self.layernorm_after(hidden_states)
+        layer_output = self.intermediate(layer_output)
+
+        # second residual connection is done here
+        layer_output = self.output(layer_output, hidden_states)
+
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+class DPTViTEncoder(nn.Layer):
+    def __init__(self, config: DPTConfig) -> None:
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList([DPTViTLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ) -> Union[tuple, BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                )
+            else:
+                layer_outputs = layer_module(hidden_states, output_attentions)
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class DPTReassembleStage(nn.Layer):
+    """
+    This class reassembles the hidden states of the backbone into image-like feature representations at various
+    resolutions.
+    This happens in 3 stages:
+    1. Map the N + 1 tokens to a set of N tokens, by taking into account the readout ([CLS]) token according to
+       `config.readout_type`.
+    2. Project the channel dimension of the hidden states according to `config.neck_hidden_sizes`.
+    3. Resizing the spatial dimensions (height, width).
+    Args:
+        config (`[DPTConfig]`):
+            Model configuration class defining the model architecture.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.config = config
+        self.layers = nn.LayerList()
+        if config.is_hybrid:
+            self._init_reassemble_dpt_hybrid(config)
+        else:
+            self._init_reassemble_dpt(config)
+
+        self.neck_ignore_stages = config.neck_ignore_stages
+
+    def _init_reassemble_dpt_hybrid(self, config):
+        r""" "
+        For DPT-Hybrid the first 2 reassemble layers are set to `nn.Identity()`, please check the official
+        implementation: https://github.com/isl-org/DPT/blob/f43ef9e08d70a752195028a51be5e1aff227b913/dpt/vit.py#L438
+        for more details.
+        """
+        for i, factor in zip(range(len(config.neck_hidden_sizes)), config.reassemble_factors):
+            if i <= 1:
+                self.layers.append(nn.Identity())
+            elif i > 1:
+                self.layers.append(DPTReassembleLayer(config, channels=config.neck_hidden_sizes[i], factor=factor))
+
+        if config.readout_type != "project":
+            raise ValueError(f"Readout type {config.readout_type} is not supported for DPT-Hybrid.")
+
+        # When using DPT-Hybrid the readout type is set to "project". The sanity check is done on the config file
+        self.readout_projects = nn.LayerList()
+        for i in range(len(config.neck_hidden_sizes)):
+            if i <= 1:
+                self.readout_projects.append(nn.Sequential(nn.Identity()))
+            elif i > 1:
+                self.readout_projects.append(
+                    nn.Sequential(nn.Linear(2 * config.hidden_size, config.hidden_size), ACT2FN[config.hidden_act])
+                )
+
+    def _init_reassemble_dpt(self, config):
+        for i, factor in zip(range(len(config.neck_hidden_sizes)), config.reassemble_factors):
+            self.layers.append(DPTReassembleLayer(config, channels=config.neck_hidden_sizes[i], factor=factor))
+
+        if config.readout_type == "project":
+            self.readout_projects = nn.LayerList()
+            for _ in range(len(config.neck_hidden_sizes)):
+                self.readout_projects.append(
+                    nn.Sequential(nn.Linear(2 * config.hidden_size, config.hidden_size), ACT2FN[config.hidden_act])
+                )
+
+    def forward(self, hidden_states: List[paddle.Tensor]) -> List[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`List[paddle.Tensor]`, each of shape `(batch_size, sequence_length + 1, hidden_size)`):
+                List of hidden states from the backbone.
+        """
+        out = []
+
+        for i, hidden_state in enumerate(hidden_states):
+            if i not in self.neck_ignore_stages:
+                # reshape to (B, C, H, W)
+                hidden_state, cls_token = hidden_state[:, 1:], hidden_state[:, 0]
+                batch_size, sequence_length, num_channels = hidden_state.shape
+                size = int(math.sqrt(sequence_length))
+                hidden_state = hidden_state.reshape([batch_size, size, size, num_channels])
+                hidden_state = hidden_state.transpose([0, 3, 1, 2])
+
+                feature_shape = hidden_state.shape
+                if self.config.readout_type == "project":
+                    # reshape to (B, H*W, C)
+                    hidden_state = hidden_state.flatten(2).transpose([0, 2, 1])
+                    readout = cls_token.unsqueeze(1).expand_as(hidden_state)
+                    # concatenate the readout token to the hidden states and project
+                    hidden_state = self.readout_projects[i](paddle.concat((hidden_state, readout), axis=-1))
+                    # reshape back to (B, C, H, W)
+                    hidden_state = hidden_state.transpose([0, 2, 1]).reshape(feature_shape)
+                elif self.config.readout_type == "add":
+                    hidden_state = hidden_state.flatten(2) + cls_token.unsqueeze(-1)
+                    hidden_state = hidden_state.reshape(feature_shape)
+                hidden_state = self.layers[i](hidden_state)
+            out.append(hidden_state)
+
+        return out
+
+
+class DPTReassembleLayer(nn.Layer):
+    def __init__(self, config, channels, factor):
+        super().__init__()
+        # projection
+        self.projection = nn.Conv2D(in_channels=config.hidden_size, out_channels=channels, kernel_size=1)
+
+        # up/down sampling depending on factor
+        if factor > 1:
+            self.resize = nn.Conv2DTranspose(channels, channels, kernel_size=factor, stride=factor, padding=0)
+        elif factor == 1:
+            self.resize = nn.Identity()
+        elif factor < 1:
+            # so should downsample
+            self.resize = nn.Conv2D(channels, channels, kernel_size=3, stride=int(1 / factor), padding=1)
+
+    def forward(self, hidden_state):
+        hidden_state = self.projection(hidden_state)
+        hidden_state = self.resize(hidden_state)
+        return hidden_state
+
+
+class DPTFeatureFusionStage(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.layers = nn.LayerList()
+        for _ in range(len(config.neck_hidden_sizes)):
+            self.layers.append(DPTFeatureFusionLayer(config))
+
+    def forward(self, hidden_states):
+        # reversing the hidden_states, we start from the last
+        hidden_states = hidden_states[::-1]
+
+        fused_hidden_states = []
+        # first layer only uses the last hidden_state
+        fused_hidden_state = self.layers[0](hidden_states[0])
+        fused_hidden_states.append(fused_hidden_state)
+        # looping from the last layer to the second
+        for hidden_state, layer in zip(hidden_states[1:], self.layers[1:]):
+            fused_hidden_state = layer(fused_hidden_state, hidden_state)
+            fused_hidden_states.append(fused_hidden_state)
+
+        return fused_hidden_states
+
+
+class DPTPreActResidualLayer(nn.Layer):
+    """
+    ResidualConvUnit, pre-activate residual unit.
+    Args:
+        config (`[DPTConfig]`):
+            Model configuration class defining the model architecture.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.use_batch_norm = config.use_batch_norm_in_fusion_residual
+        self.activation1 = ACT2FN["relu"]
+        self.convolution1 = nn.Conv2D(
+            config.fusion_hidden_size,
+            config.fusion_hidden_size,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias_attr=not self.use_batch_norm,
+        )
+
+        self.activation2 = ACT2FN["relu"]
+        self.convolution2 = nn.Conv2D(
+            config.fusion_hidden_size,
+            config.fusion_hidden_size,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias_attr=not self.use_batch_norm,
+        )
+
+        if self.use_batch_norm:
+            self.batch_norm1 = nn.BatchNorm2D(config.fusion_hidden_size)
+            self.batch_norm2 = nn.BatchNorm2D(config.fusion_hidden_size)
+
+    def forward(self, hidden_state: paddle.Tensor) -> paddle.Tensor:
+        residual = hidden_state
+        hidden_state = self.activation1(hidden_state)
+
+        hidden_state = self.convolution1(hidden_state)
+
+        if self.use_batch_norm:
+            hidden_state = self.batch_norm1(hidden_state)
+
+        hidden_state = self.activation2(hidden_state)
+        hidden_state = self.convolution2(hidden_state)
+
+        if self.use_batch_norm:
+            hidden_state = self.batch_norm2(hidden_state)
+
+        return hidden_state + residual
+
+
+class DPTFeatureFusionLayer(nn.Layer):
+    """Feature fusion layer, merges feature maps from different stages.
+    Args:
+        config (`[DPTConfig]`):
+            Model configuration class defining the model architecture.
+        align_corners (`bool`, *optional*, defaults to `True`):
+            The align_corner setting for bilinear upsample.
+    """
+
+    def __init__(self, config, align_corners=True):
+        super().__init__()
+
+        self.align_corners = align_corners
+
+        self.projection = nn.Conv2D(config.fusion_hidden_size, config.fusion_hidden_size, kernel_size=1)
+
+        self.residual_layer1 = DPTPreActResidualLayer(config)
+        self.residual_layer2 = DPTPreActResidualLayer(config)
+
+    def forward(self, hidden_state, residual=None):
+        if residual is not None:
+            if hidden_state.shape != residual.shape:
+                residual = F.interpolate(
+                    residual, size=(hidden_state.shape[2], hidden_state.shape[3]), mode="bilinear", align_corners=False
+                )
+            hidden_state = hidden_state + self.residual_layer1(residual)
+
+        hidden_state = self.residual_layer2(hidden_state)
+        hidden_state = F.interpolate(hidden_state, scale_factor=2, mode="bilinear", align_corners=self.align_corners)
+        hidden_state = self.projection(hidden_state)
+
+        return hidden_state
+
+
+class DPTPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = DPTConfig
+    base_model_prefix = "dpt"
+    main_input_name = "pixel_values"
+    supports_gradient_checkpointing = True
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Conv2D, nn.Conv2DTranspose)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                zeros_(module.bias)
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, DPTViTEncoder):
+            module.gradient_checkpointing = value
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+
+class DPTModel(DPTPretrainedModel):
+    """
+    The bare DPT Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`DPTConfig`):
+            An instance of DPTConfig used to construct DPTModel.
+        add_pooling_layer (`bool`, *optional*, defaults to True):
+            Whether to add a pooler layer.
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        # vit encoder
+        if config.is_hybrid:
+            self.embeddings = DPTViTHybridEmbeddings(config)
+        else:
+            self.embeddings = DPTViTEmbeddings(config)
+        self.encoder = DPTViTEncoder(config)
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.pooler = DPTViTPooler(config) if add_pooling_layer else None
+
+    def get_input_embeddings(self):
+        if self.config.is_hybrid:
+            return self.embeddings
+        else:
+            return self.embeddings.patch_embeddings
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPoolingAndIntermediateActivations]:
+        """
+        The DPTModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`DPTImageProcessor`]. See [`DPTImageProcessor.__call__`]
+                for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`BaseModelOutputWithPoolingAndIntermediateActivations` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPoolingAndIntermediateActivations` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`BaseModelOutputWithPoolingAndIntermediateActivations`.
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        embedding_output = self.embeddings(pixel_values, return_dict=return_dict)
+
+        embedding_last_hidden_states = embedding_output[0] if not return_dict else embedding_output.last_hidden_states
+
+        encoder_outputs = self.encoder(
+            embedding_last_hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = encoder_outputs[0]
+        sequence_output = self.layernorm(sequence_output)
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            head_outputs = (sequence_output, pooled_output) if pooled_output is not None else (sequence_output,)
+            return head_outputs + encoder_outputs[1:] + embedding_output[1:]
+
+        return BaseModelOutputWithPoolingAndIntermediateActivations(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            intermediate_activations=embedding_output.intermediate_activations,
+        )
+
+
+class DPTViTPooler(nn.Layer):
+    def __init__(self, config: DPTConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class DPTNeck(nn.Layer):
+    """
+    DPTNeck. A neck is a module that is normally used between the backbone and the head. It takes a list of tensors as
+    input and produces another list of tensors as output. For DPT, it includes 2 stages:
+    * DPTReassembleStage
+    * DPTFeatureFusionStage.
+    Args:
+        config (dict): config dict.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+
+        # postprocessing
+        self.reassemble_stage = DPTReassembleStage(config)
+        self.convs = nn.LayerList()
+        for channel in config.neck_hidden_sizes:
+            self.convs.append(nn.Conv2D(channel, config.fusion_hidden_size, kernel_size=3, padding=1, bias_attr=False))
+
+        # fusion
+        self.fusion_stage = DPTFeatureFusionStage(config)
+
+    def forward(self, hidden_states: List[paddle.Tensor]) -> List[paddle.Tensor]:
+        if not isinstance(hidden_states, list):
+            raise ValueError("hidden_states should be a list of tensors")
+
+        if len(hidden_states) != len(self.config.neck_hidden_sizes):
+            raise ValueError("The number of hidden states should be equal to the number of neck hidden sizes.")
+
+        # postprocess hidden states
+        features = self.reassemble_stage(hidden_states)
+
+        features = [self.convs[i](feature) for i, feature in enumerate(features)]
+
+        # fusion blocks
+        output = self.fusion_stage(features)
+
+        return output
+
+
+class DPTDepthEstimationHead(nn.Layer):
+    """
+    Output head head consisting of 3 convolutional layers. It progressively halves the feature dimension and upsamples
+    the predictions to the input resolution after the first convolutional layer (details can be found in the paper's
+    supplementary material).
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.config = config
+
+        features = config.fusion_hidden_size
+        self.head = nn.Sequential(
+            nn.Conv2D(features, features // 2, kernel_size=3, stride=1, padding=1),
+            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True),
+            nn.Conv2D(features // 2, 32, kernel_size=3, stride=1, padding=1),
+            ACT2FN["relu"],
+            nn.Conv2D(32, 1, kernel_size=1, stride=1, padding=0),
+            ACT2FN["relu"],
+        )
+
+    def forward(self, hidden_states: List[paddle.Tensor]) -> paddle.Tensor:
+        # use last features
+        hidden_states = hidden_states[self.config.head_in_index]
+
+        predicted_depth = self.head(hidden_states)
+
+        predicted_depth = predicted_depth.squeeze(axis=1)
+
+        return predicted_depth
+
+
+class DPTForDepthEstimation(DPTPretrainedModel):
+    """
+    DPT Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`DPTConfig`):
+            An instance of DPTConfig used to construct DPTForDepthEstimation.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.dpt = DPTModel(config, add_pooling_layer=False)
+
+        # Neck
+        self.neck = DPTNeck(config)
+
+        # Depth estimation head
+        self.head = DPTDepthEstimationHead(config)
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], DepthEstimatorOutput]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`DPTImageProcessor`]. See [`DPTImageProcessor.__call__`]
+                for details.
+            labels (`paddle.Tensor` of shape `(batch_size, height, width)`, *optional*):
+                Ground truth depth estimation maps for computing the loss.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`DepthEstimatorOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`DepthEstimatorOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`DepthEstimatorOutput`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import DPTImageProcessor, DPTForDepthEstimation
+        >>> import paddle
+        >>> import paddle.nn.functional as F
+        >>> import numpy as np
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
+        >>> model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
+
+        >>> # prepare image for the model
+        >>> inputs = image_processor(images=image, return_tensors="pd")
+
+        >>> with paddle.no_grad():
+        ...     outputs = model(**inputs)
+        ...     predicted_depth = outputs.predicted_depth
+
+        >>> # interpolate to original size
+        >>> prediction = F.interpolate(
+        ...     predicted_depth.unsqueeze(1),
+        ...     size=image.size[::-1],
+        ...     mode="bicubic",
+        ...     align_corners=False,
+        ... )
+
+        >>> # visualize the prediction
+        >>> output = prediction.squeeze().cpu().numpy()
+        >>> formatted = (output * 255 / np.max(output)).astype("uint8")
+        >>> depth = Image.fromarray(formatted)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        outputs = self.dpt(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=True,  # we need the intermediate hidden states
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs.hidden_states if return_dict else outputs[1]
+
+        # only keep certain features based on config.backbone_out_indices
+        # note that the hidden_states also include the initial embeddings
+        if not self.config.is_hybrid:
+            hidden_states = [
+                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices
+            ]
+        else:
+            backbone_hidden_states = list(outputs.intermediate_activations) if return_dict else list(outputs[-1])
+            backbone_hidden_states.extend(
+                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices[2:]
+            )
+
+            hidden_states = backbone_hidden_states
+
+        hidden_states = self.neck(hidden_states)
+
+        predicted_depth = self.head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            raise NotImplementedError("Training is not implemented yet")
+
+        if not return_dict:
+            if output_hidden_states:
+                output = (predicted_depth,) + outputs[1:]
+            else:
+                output = (predicted_depth,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return DepthEstimatorOutput(
+            loss=loss,
+            predicted_depth=predicted_depth,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            attentions=outputs.attentions,
+        )
+
+
+class DPTSemanticSegmentationHead(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+
+        self.config = config
+
+        features = config.fusion_hidden_size
+        self.head = nn.Sequential(
+            nn.Conv2D(features, features, kernel_size=3, padding=1, bias_attr=False),
+            nn.BatchNorm2D(features),
+            ACT2FN["relu"],
+            nn.Dropout(config.semantic_classifier_dropout),
+            nn.Conv2D(features, config.num_labels, kernel_size=1),
+            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True),
+        )
+
+    def forward(self, hidden_states: List[paddle.Tensor]) -> paddle.Tensor:
+        # use last features
+        hidden_states = hidden_states[self.config.head_in_index]
+
+        logits = self.head(hidden_states)
+
+        return logits
+
+
+class DPTAuxiliaryHead(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+
+        features = config.fusion_hidden_size
+        self.head = nn.Sequential(
+            nn.Conv2D(features, features, kernel_size=3, padding=1, bias_attr=False),
+            nn.BatchNorm2D(features),
+            ACT2FN["relu"],
+            nn.Dropout(0.1, False),
+            nn.Conv2D(features, config.num_labels, kernel_size=1),
+        )
+
+    def forward(self, hidden_states):
+        logits = self.head(hidden_states)
+
+        return logits
+
+
+class DPTForSemanticSegmentation(DPTPretrainedModel):
+    """
+    DPT Model with a semantic segmentation head on top e.g. for ADE20k, CityScapes.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`DPTConfig`):
+            An instance of DPTConfig used to construct DPTForSemanticSegmentation.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.dpt = DPTModel(config, add_pooling_layer=False)
+
+        # Neck
+        self.neck = DPTNeck(config)
+
+        # Segmentation head(s)
+        self.head = DPTSemanticSegmentationHead(config)
+        self.auxiliary_head = DPTAuxiliaryHead(config) if config.use_auxiliary_head else None
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], SemanticSegmenterOutput]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Pixel values can be obtained using [`DPTImageProcessor`]. See [`DPTImageProcessor.__call__`]
+                for details.
+            labels (`paddle.Tensor` of shape `(batch_size, height, width)`, *optional*):
+                Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ...,
+                config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed (Cross-Entropy).
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (bool, optional):
+                Whether to return a :class:`SemanticSegmenterOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            An instance of :class:`SemanticSegmenterOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`SemanticSegmenterOutput`.
+
+        Examples:
+        ```python
+        >>> from paddlenlp.transformers import DPTImageProcessor, DPTForSemanticSegmentation
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large-ade")
+        >>> model = DPTForSemanticSegmentation.from_pretrained("Intel/dpt-large-ade")
+
+        >>> inputs = image_processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> logits = outputs.logits
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        outputs = self.dpt(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=True,  # we need the intermediate hidden states
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs.hidden_states if return_dict else outputs[1]
+
+        # only keep certain features based on config.backbone_out_indices
+        # note that the hidden_states also include the initial embeddings
+        if not self.config.is_hybrid:
+            hidden_states = [
+                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices
+            ]
+        else:
+            backbone_hidden_states = list(outputs.intermediate_activations) if return_dict else list(outputs[-1])
+            backbone_hidden_states.extend(
+                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices[2:]
+            )
+
+            hidden_states = backbone_hidden_states
+
+        hidden_states = self.neck(hidden_states)
+
+        logits = self.head(hidden_states)
+
+        auxiliary_logits = None
+        if self.auxiliary_head is not None:
+            auxiliary_logits = self.auxiliary_head(hidden_states[-1])
+
+        loss = None
+        if labels is not None:
+            if self.config.num_labels == 1:
+                raise ValueError("The number of labels should be greater than one")
+            else:
+                # upsample logits to the images' original size
+                upsampled_logits = F.interpolate(logits, size=labels.shape[-2:], mode="bilinear", align_corners=False)
+                if auxiliary_logits is not None:
+                    upsampled_auxiliary_logits = F.interpolate(
+                        auxiliary_logits, size=labels.shape[-2:], mode="bilinear", align_corners=False
+                    )
+                # compute weighted loss
+                loss_fct = CrossEntropyLoss(ignore_index=self.config.semantic_loss_ignore_index)
+                # upsampled_logits and upsampled_auxiliary_logits 's shape [b, num_labels, h, w] -> [b, h, w, num_labels]
+                main_loss = loss_fct(upsampled_logits.transpose([0, 2, 3, 1]), labels)
+                auxiliary_loss = loss_fct(upsampled_auxiliary_logits.transpose([0, 2, 3, 1]), labels)
+                loss = main_loss + self.config.auxiliary_loss_weight * auxiliary_loss
+
+        if not return_dict:
+            if output_hidden_states:
+                output = (logits,) + outputs[1:]
+            else:
+                output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SemanticSegmenterOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/electra/__init__.py b/paddlenlp/transformers/electra/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/electra/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/electra/configuration.py b/paddlenlp/transformers/electra/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..117bb5869b8d50857dc256e4ff59a668ce2b753c
--- /dev/null
+++ b/paddlenlp/transformers/electra/configuration.py
@@ -0,0 +1,293 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Electra model configuration """
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["ElectraConfig", "ELECTRA_PRETRAINED_INIT_CONFIGURATION", "ELECTRA_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ELECTRA_PRETRAINED_INIT_CONFIGURATION = {
+    "electra-small": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 1024,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "chinese-electra-small": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "chinese-electra-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+    },
+    "ernie-health-chinese": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 22608,
+        "layer_norm_eps": 1e-5,
+    },
+    "electra-small-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-base-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-large-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 1024,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-small-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 128,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-base-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "electra-large-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 1024,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+    },
+    "ernie-health-chinese-generator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 256,
+        "initializer_range": 0.02,
+        "intermediate_size": 1024,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 4,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 22608,
+        "layer_norm_eps": 1e-12,
+    },
+    "ernie-health-chinese-discriminator": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 22608,
+        "layer_norm_eps": 1e-12,
+    },
+}
+
+ELECTRA_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "electra-small": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-small.pdparams",
+        "electra-base": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-base.pdparams",
+        "electra-large": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-large.pdparams",
+        "chinese-electra-small": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese-electra-small/chinese-electra-small.pdparams",
+        "chinese-electra-base": "https://bj.bcebos.com/paddlenlp/models/transformers/chinese-electra-base/chinese-electra-base.pdparams",
+        "ernie-health-chinese": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie-health-chinese/ernie-health-chinese.pdparams",
+    }
+}
+
+
+class ElectraConfig(PretrainedConfig):
+    model_type = "electra"
+    pretrained_init_configuration = ELECTRA_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 22608,
+        embedding_size: int = 768,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        layer_norm_eps: float = 1e-12,
+        num_choices: int = 2,
+        gen_weight: float = 1.0,
+        disc_weight: float = 50.0,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.layer_norm_eps = layer_norm_eps
+        self.num_choices = num_choices
+        self.gen_weight = gen_weight
+        self.disc_weight = disc_weight
diff --git a/paddlenlp/transformers/electra/converter.py b/paddlenlp/transformers/electra/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd5fb3d0d03ec1807241e983667c0dc1b602cd41
--- /dev/null
+++ b/paddlenlp/transformers/electra/converter.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+from typing import List, Union, Dict, Type
+
+from paddlenlp.transformers import PretrainedModel, ElectraModel
+from paddlenlp.utils.converter import StateDictNameMapping, Converter
+
+__all__ = ["ElectraConverter"]
+
+
+class ElectraConverter(Converter):
+    _ignore_state_dict_keys = ["embeddings.position_ids"]
+    architectures: Dict[str, Type[PretrainedModel]] = {"ElectraModel": ElectraModel}
+
+    def get_paddle_pytorch_model_classes(self):
+        from paddlenlp.transformers import ElectraModel as PaddleRobertaModel
+        from transformers import ElectraModel as PytorchRobertaModel
+
+        return PaddleRobertaModel, PytorchRobertaModel
+
+    def get_name_mapping(self, config_or_num_layers: Union[dict, int] = None) -> List[StateDictNameMapping]:
+        num_layer = self.resolve_num_layer(config_or_num_layers)
+
+        mappings = [
+            ["embeddings.word_embeddings.weight", "embeddings.word_embeddings.weight"],
+            ["embeddings.position_embeddings.weight", "embeddings.position_embeddings.weight"],
+            ["embeddings.token_type_embeddings.weight", "embeddings.token_type_embeddings.weight"],
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["embeddings_project.weight", "embeddings_project.weight", "transpose"],
+            ["embeddings_project.bias", "embeddings_project.bias"],
+        ]
+
+        for layer_index in range(num_layer):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            mappings.extend(layer_mappings)
+        return [StateDictNameMapping(*mapping) for mapping in mappings]
diff --git a/paddlenlp/transformers/electra/modeling.py b/paddlenlp/transformers/electra/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ca10044825eaacc14e5315feb2071534ad91a4e
--- /dev/null
+++ b/paddlenlp/transformers/electra/modeling.py
@@ -0,0 +1,1809 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import List, Optional
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import TransformerEncoder, TransformerEncoderLayer
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from .. import PretrainedModel, register_base_model
+from ..activations import get_activation
+from ..model_outputs import (
+    MaskedLMOutput,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    ELECTRA_PRETRAINED_INIT_CONFIGURATION,
+    ELECTRA_PRETRAINED_RESOURCE_FILES_MAP,
+    ElectraConfig,
+)
+
+__all__ = [
+    "ElectraModel",
+    "ElectraPretrainedModel",
+    "ElectraForTotalPretraining",
+    "ElectraDiscriminator",
+    "ElectraGenerator",
+    "ElectraClassificationHead",
+    "ElectraForSequenceClassification",
+    "ElectraForTokenClassification",
+    "ElectraPretrainingCriterion",
+    "ElectraForMultipleChoice",
+    "ElectraForQuestionAnswering",
+    "ElectraForMaskedLM",
+    "ElectraForPretraining",
+    "ErnieHealthForTotalPretraining",
+    "ErnieHealthPretrainingCriterion",
+    "ErnieHealthDiscriminator",
+]
+
+
+class ElectraEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self, input_ids, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=None
+    ):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+            if past_key_values_length is not None:
+                position_ids += past_key_values_length
+            position_ids.stop_gradient = True
+        position_ids = position_ids.astype("int64")
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        if input_ids is not None:
+            input_embeddings = self.word_embeddings(input_ids)
+        else:
+            input_embeddings = inputs_embeds
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embeddings + position_embeddings + token_type_embeddings
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class ElectraDiscriminatorPredictions(nn.Layer):
+    """Prediction layer for the discriminator, made up of two dense layers."""
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraDiscriminatorPredictions, self).__init__()
+
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dense_prediction = nn.Linear(config.hidden_size, 1)
+        self.act = get_activation(config.hidden_act)
+
+    def forward(self, discriminator_hidden_states):
+        hidden_states = self.dense(discriminator_hidden_states)
+        hidden_states = self.act(hidden_states)
+        logits = self.dense_prediction(hidden_states).squeeze()
+
+        return logits
+
+
+class ElectraGeneratorPredictions(nn.Layer):
+    """Prediction layer for the generator, made up of two dense layers."""
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraGeneratorPredictions, self).__init__()
+
+        self.layer_norm = nn.LayerNorm(config.embedding_size)
+        self.dense = nn.Linear(config.hidden_size, config.embedding_size)
+        self.act = get_activation(config.hidden_act)
+
+    def forward(self, generator_hidden_states):
+        hidden_states = self.dense(generator_hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+
+        return hidden_states
+
+
+class ElectraPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Electra models. It provides Electra related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "electra"
+
+    # pretrained general configuration
+    gen_weight = 1.0
+    disc_weight = 50.0
+    tie_word_embeddings = True
+    untied_generator_embeddings = False
+    use_softmax_sample = True
+
+    # model init configuration
+    pretrained_init_configuration = ELECTRA_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ELECTRA_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = ElectraConfig
+
+    @classmethod
+    def _get_name_mappings(cls, config: ElectraConfig) -> List[StateDictNameMapping]:
+        model_mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.position_embeddings.weight",
+            "embeddings.token_type_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["embeddings_project.weight", None, "transpose"],
+            "embeddings_project.bias",
+        ]
+
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+        # base-model prefix "ElectraModel"
+        if "ElectraModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "electra." + mapping[0]
+                mapping[1] = "electra." + mapping[1]
+
+        # downstream mappings
+        if "ElectraForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [["qa_outputs.weight", "classifier.weight", "transpose"], ["qa_outputs.bias", "classifier.bias"]]
+            )
+
+        if "ElectraForMultipleChoice" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["sequence_summary.summary.weight", "sequence_summary.dense.weight", "transpose"],
+                    ["sequence_summary.summary.bias", "sequence_summary.dense.bias"],
+                    ["classifier.weight", "classifier.weight", "transpose"],
+                    ["classifier.bias", "classifier.bias"],
+                ]
+            )
+
+        if "ElectraForSequenceClassification" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["classifier.dense.weight", "classifier.dense.weight", "transpose"],
+                    ["classifier.dense.bias", "classifier.dense.bias"],
+                    ["classifier.out_proj.weight", "classifier.out_proj.weight", "transpose"],
+                    ["classifier.out_proj.bias", "classifier.out_proj.bias"],
+                ]
+            )
+
+        if "ElectraForTokenClassification" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["classifier.weight", "classifier.weight", "transpose"],
+                    "classifier.bias",
+                ]
+            )
+
+        # TODO: need to tie weights
+        if "ElectraForMaskedLM" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["generator_predictions.LayerNorm.weight", "generator_predictions.layer_norm.weight", "transpose"],
+                    ["generator_predictions.LayerNorm.bias", "generator_predictions.layer_norm.bias"],
+                    ["generator_predictions.dense.weight", None, "transpose"],
+                    "generator_predictions.dense.bias",
+                    ["generator_lm_head.bias", "generator_lm_head_bias"],
+                ]
+            )
+
+        init_name_mappings(model_mappings)
+        return [StateDictNameMapping(*mapping) for mapping in model_mappings]
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+            layer._epsilon = getattr(self, "layer_norm_eps", 1e-12)
+        if isinstance(layer, nn.Linear) and layer.bias is not None:
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+
+@register_base_model
+class ElectraModel(ElectraPretrainedModel):
+    """
+    The bare Electra Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.layer_norm_eps = config.layer_norm_eps
+        self.embeddings = ElectraEmbeddings(config)
+
+        if config.embedding_size != config.hidden_size:
+            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)
+
+        encoder_layer = TransformerEncoderLayer(
+            d_model=config.hidden_size,
+            nhead=config.num_attention_heads,
+            dim_feedforward=config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = TransformerEncoder(encoder_layer, config.num_hidden_layers)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The ElectraModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                Instead of passing input_ids you can choose to directly pass an embedded representation.
+                This is useful for use cases such as P-Tuning, where you want more control over how to convert input_ids indices
+                into the embedding space.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, embedding_size].
+            past_key_values (tuple(tuple(Tensor)), optional):
+                Precomputed key and value hidden states of the attention blocks of each layer. This can be used to speedup
+                auto-regressive decoding for generation tasks or to support use cases such as Prefix-Tuning where vectors are prepended
+                to each attention layer. The length of tuple equals to the number of layers, and each tuple having 2 tensors of shape
+                `(batch_size, num_heads, past_key_values_length, embed_size_per_head)`)
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `encoder_outputs`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraModel, ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraModel.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+
+        """
+        past_key_values_length = None
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+        else:
+            if attention_mask.ndim == 2:
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        if hasattr(self, "embeddings_project"):
+            embedding_output = self.embeddings_project(embedding_output)
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return encoder_outputs
+
+
+class ElectraDiscriminator(ElectraPretrainedModel):
+    """
+    The Electra Discriminator can detect the tokens that are replaced by the Electra Generator.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraDiscriminator, self).__init__(config)
+
+        self.electra = ElectraModel(config)
+        self.discriminator_predictions = ElectraDiscriminatorPredictions(config)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ElectraModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ElectraModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, the prediction result of replaced tokens.
+            Its data type should be float32 and if batch_size>1, its shape is [batch_size, sequence_length],
+            if batch_size=1, its shape is [sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraDiscriminator, ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraDiscriminator.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        discriminator_sequence_output = self.electra(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+        )
+
+        logits = self.discriminator_predictions(discriminator_sequence_output)
+        return logits
+
+
+class ElectraGenerator(ElectraPretrainedModel):
+    """
+    The Electra Generator will replace some tokens of the given sequence, it is trained as
+    a masked language model.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraGenerator, self).__init__(config)
+
+        self.electra = ElectraModel(config)
+        self.generator_predictions = ElectraGeneratorPredictions(config)
+
+        if not self.tie_word_embeddings:
+            self.generator_lm_head = nn.Linear(config.embedding_size, config.vocab_size)
+        else:
+            self.generator_lm_head_bias = self.create_parameter(
+                shape=[config.vocab_size], dtype=paddle.get_default_dtype(), is_bias=True
+            )
+
+    def get_input_embeddings(self):
+        return self.electra.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.electra.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ElectraModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, the scores of Electra Generator.
+            Its data type should be int64 and its shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraGenerator, ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraGenerator.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                prediction_scores = model(**inputs)
+
+        """
+        generator_sequence_output = self.electra(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(generator_sequence_output, type(input_ids)):
+            generator_sequence_output = (generator_sequence_output,)
+
+        prediction_scores = self.generator_predictions(generator_sequence_output[0])
+        if not self.tie_word_embeddings:
+            prediction_scores = self.generator_lm_head(prediction_scores)
+        else:
+            prediction_scores = paddle.add(
+                paddle.matmul(prediction_scores, self.get_input_embeddings().weight, transpose_y=True),
+                self.generator_lm_head_bias,
+            )
+        loss = None
+        # Masked language modeling softmax layer
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token
+            loss = loss_fct(prediction_scores.reshape([-1, self.electra.config["vocab_size"]]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (prediction_scores,) + generator_sequence_output[1:]
+            return tuple_output(output, loss)
+
+        return MaskedLMOutput(
+            loss=loss,
+            logits=prediction_scores,
+            hidden_states=generator_sequence_output.hidden_states,
+            attentions=generator_sequence_output.attentions,
+        )
+
+
+class ElectraClassificationHead(nn.Layer):
+    """
+    Perform sentence-level classification tasks.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraClassificationHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+        self.act = get_activation(config.hidden_act)
+
+    def forward(self, features, **kwargs):
+        r"""
+        The ElectraClassificationHead forward method, overrides the __call__() special method.
+
+        Args:
+            features(Tensor):
+               Input sequence, usually the `sequence_output` of electra model.
+               Its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Returns:
+            Tensor: Returns a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+        """
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class ErnieHealthDiscriminator(ElectraPretrainedModel):
+    """
+    The Discriminators in ERNIE-Health (https://arxiv.org/abs/2110.07244), including
+        - token-level Replaced Token Detection (RTD) task
+        - token-level Multi-Token Selection (MTS) task
+        - sequence-level Contrastive Sequence Prediction (CSP) task.
+
+    Args:
+         config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ErnieHealthDiscriminator
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ErnieHealthDiscriminator, self).__init__(config)
+
+        self.electra = ElectraModel(config)
+        self.discriminator_rtd = ElectraDiscriminatorPredictions(config)
+
+        self.discriminator_mts = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation_mts = get_activation(config.hidden_act)
+        self.bias_mts = nn.Embedding(config.vocab_size, 1)
+
+        self.discriminator_csp = ElectraClassificationHead(config)
+
+    def forward(self, input_ids, candidate_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            candidate_ids (Tensor):
+                The candidate indices of input sequence tokens in the vocabulary for MTS task.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ElectraModel`.
+
+        Returns:
+            Tensor: Returns list of tensors, the prediction results of RTD, MTS and CSP.
+            The logits' data type should be float32 and if batch_size > 1,
+                - the shape of `logits_rtd` is [batch_size, sequence_length],
+                - the shape of `logits_mts` is [batch_size, sequence_length, num_candidate],
+                - the shape of `logits_csp` is [batch_size, 128].
+            If batch_size=1, the shapes are [sequence_length], [sequence_length, num_cadidate],
+            [128], separately.
+
+        """
+
+        discriminator_sequence_output = self.electra(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+        )
+
+        logits_rtd = self.discriminator_rtd(discriminator_sequence_output)
+
+        cands_embs = self.electra.embeddings.word_embeddings(candidate_ids)
+        hidden_mts = self.discriminator_mts(discriminator_sequence_output)
+        hidden_mts = self.activation_mts(hidden_mts)
+        hidden_mts = paddle.matmul(hidden_mts.unsqueeze(2), cands_embs, transpose_y=True).squeeze(2)
+        logits_mts = paddle.add(hidden_mts, self.bias_mts(candidate_ids).squeeze(3))
+
+        logits_csp = self.discriminator_csp(discriminator_sequence_output)
+
+        return logits_rtd, logits_mts, logits_csp
+
+
+class ElectraForSequenceClassification(ElectraPretrainedModel):
+    """
+    Electra Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ElectraForSequenceClassification
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.electra = ElectraModel(config)
+        self.classifier = ElectraClassificationHead(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions: bool = None,
+        output_hidden_states: bool = None,
+        return_dict: bool = None,
+    ):
+        r"""
+        The ElectraForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids(Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (list, optional):
+                See :class:`ElectraModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraForSequenceClassification
+                from paddlenlp.transformers import ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraForSequenceClassification.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.electra(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(sequence_output, type(input_ids)):
+            sequence_output = (sequence_output,)
+
+        logits = self.classifier(sequence_output[0])
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + sequence_output[1:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+class ElectraForTokenClassification(ElectraPretrainedModel):
+    """
+    Electra  Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ElectraForTokenClassification
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForTokenClassification, self).__init__(config)
+        self.electra = ElectraModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.hidden_dropout_prob if config.classifier_dropout is None else config.classifier_dropout
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ElectraForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids(Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (list, optional):
+                See :class:`ElectraModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraForTokenClassification
+                from paddlenlp.transformers import ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraForTokenClassification.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.electra(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(sequence_output, type(input_ids)):
+            sequence_output = (sequence_output,)
+
+        logits = self.classifier(self.dropout(sequence_output[0]))
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits,) + sequence_output[1:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+class ElectraForTotalPretraining(ElectraPretrainedModel):
+    """
+    Electra Model for pretraining tasks.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ElectraForTotalPretraining
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForTotalPretraining, self).__init__(config)
+        self.generator = ElectraGenerator(config)
+        self.discriminator = ElectraDiscriminator(config)
+        self.initializer_range = config.initializer_range
+        self.tie_weights()
+
+    def get_input_embeddings(self):
+        if not self.untied_generator_embeddings:
+            return self.generator.electra.embeddings.word_embeddings
+        else:
+            return None
+
+    def get_output_embeddings(self):
+        if not self.untied_generator_embeddings:
+            return self.discriminator.electra.embeddings.word_embeddings
+        else:
+            return None
+
+    def get_discriminator_inputs(self, inputs, raw_inputs, generator_logits, generator_labels, use_softmax_sample):
+        # get generator token result
+        sampled_tokens = (self.sample_from_softmax(generator_logits, use_softmax_sample)).detach()
+        sampled_tokids = paddle.argmax(sampled_tokens, axis=-1)
+        # update token only at mask position
+        # generator_labels : [B, L], L contains -100(unmasked) or token value(masked)
+        # mask_positions : [B, L], L contains 0(unmasked) or 1(masked)
+        umask_positions = paddle.zeros_like(generator_labels)
+        mask_positions = paddle.ones_like(generator_labels)
+        mask_positions = paddle.where(generator_labels == -100, umask_positions, mask_positions)
+        updated_inputs = self.update_inputs(inputs, sampled_tokids, mask_positions)
+        # use inputs and updated_input to get discriminator labels
+        labels = mask_positions * (paddle.ones_like(inputs) - paddle.equal(updated_inputs, raw_inputs).astype("int32"))
+        return updated_inputs, labels, sampled_tokids
+
+    def sample_from_softmax(self, logits, use_softmax_sample=True):
+        if use_softmax_sample:
+            # uniform_noise = paddle.uniform(logits.shape, dtype="float32", min=0, max=1)
+            uniform_noise = paddle.rand(logits.shape, dtype=paddle.get_default_dtype())
+            gumbel_noise = -paddle.log(-paddle.log(uniform_noise + 1e-9) + 1e-9)
+        else:
+            gumbel_noise = paddle.zeros_like(logits)
+        # softmax_sample equal to sampled_tokids.unsqueeze(-1)
+        softmax_sample = paddle.argmax(F.softmax(logits + gumbel_noise), axis=-1)
+        # one hot
+        return F.one_hot(softmax_sample, logits.shape[-1])
+
+    def update_inputs(self, sequence, updates, positions):
+        shape = sequence.shape
+        assert len(shape) == 2, "the dimension of inputs should be [B, L]"
+        B, L = shape
+        N = positions.shape[1]
+        assert N == L, "the dimension of inputs and mask should be same as [B, L]"
+
+        updated_sequence = ((paddle.ones_like(sequence) - positions) * sequence) + (positions * updates)
+
+        return updated_sequence
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        raw_input_ids=None,
+        generator_labels=None,
+    ):
+        r"""
+        The ElectraForPretraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids(Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (list, optional):
+                See :class:`ElectraModel`.
+            raw_input_ids(Tensor, optional):
+                Raw inputs used to get discriminator labels.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            generator_labels(Tensor, optional):
+                Labels to compute the discriminator inputs.
+                Its data type should be int64 and its shape is [batch_size, sequence_length].
+                The value for unmasked tokens should be -100 and value for masked tokens should be 0.
+
+        Returns:
+            tuple: Returns tuple (generator_logits, disc_logits, disc_labels, attention_mask).
+
+            With the fields:
+
+            - `generator_logits` (Tensor):
+                The scores of Electra Generator.
+                Its data type should be int64 and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `disc_logits` (Tensor):
+                The prediction result of replaced tokens.
+                Its data type should be float32 and if batch_size>1, its shape is [batch_size, sequence_length],
+                if batch_size=1, its shape is [sequence_length].
+
+            - `disc_labels` (Tensor):
+                The labels of electra discriminator. Its data type should be int32,
+                and its shape is [batch_size, sequence_length].
+
+            - `attention_mask` (Tensor):
+                See :class:`ElectraModel`. Its data type should be bool.
+
+        """
+
+        assert (
+            generator_labels is not None
+        ), "generator_labels should not be None, please check DataCollatorForLanguageModeling"
+
+        generator_logits = self.generator(input_ids, token_type_ids, position_ids, attention_mask)
+
+        disc_inputs, disc_labels, generator_predict_tokens = self.get_discriminator_inputs(
+            input_ids, raw_input_ids, generator_logits, generator_labels, self.use_softmax_sample
+        )
+
+        disc_logits = self.discriminator(disc_inputs, token_type_ids, position_ids, attention_mask)
+
+        if attention_mask is None:
+            attention_mask = input_ids != self.discriminator.electra.config["pad_token_id"]
+        else:
+            attention_mask = attention_mask.astype("bool")
+
+        return generator_logits, disc_logits, disc_labels, attention_mask
+
+
+class ElectraPooler(nn.Layer):
+    def __init__(self, config: ElectraConfig):
+        super(ElectraPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = get_activation(config.hidden_act)
+        self.pool_act = config.hidden_act
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+@dataclass
+class ErnieHealthForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`ErnieHealthForPreTraining`].
+
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss of the ELECTRA objective.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    gen_loss: Optional[paddle.Tensor] = None
+    disc_rtd_loss: Optional[paddle.Tensor] = None
+    disc_mts_loss: Optional[paddle.Tensor] = None
+    disc_csp_loss: Optional[paddle.Tensor] = None
+
+
+class ErnieHealthForTotalPretraining(ElectraForTotalPretraining):
+    """
+    ERNIE-Health Model for pretraining task.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ElectraForMultipleChoice
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ErnieHealthForTotalPretraining, self).__init__(config)
+        self.generator = ElectraGenerator(config)
+        self.discriminator = ErnieHealthDiscriminator(config)
+        self.initializer_range = config.initializer_range
+
+    def get_discriminator_inputs_ernie_health(
+        self, inputs, raw_inputs, generator_logits, generator_labels, use_softmax_sample
+    ):
+        updated_inputs, labels, sampled_tokids = self.get_discriminator_inputs(
+            inputs, raw_inputs, generator_logits, generator_labels, use_softmax_sample
+        )
+
+        # Get negative samples to construct candidates.
+        neg_samples_ids = self.sample_negatives_from_softmax(generator_logits, raw_inputs, use_softmax_sample)
+        candidate_ids = paddle.concat([raw_inputs.unsqueeze(2), neg_samples_ids], axis=2).detach()
+        return updated_inputs, labels, sampled_tokids, candidate_ids
+
+    def sample_negatives_from_softmax(self, logits, raw_inputs, use_softmax_sample=True):
+        r"""
+        Sample K=5 non-original negative samples for candidate set.
+
+        Returns:
+            Tensor: Returns tensor `neg_samples_ids`, a tensor of the negative
+            samples of original inputs.
+            Shape as ` [batch_size, sequence_length, K, vocab_size]` and dtype
+            as `int64`.
+        """
+        K = 5
+        # Initialize excluded_ids by original inputs in one-hot encoding.
+        # Its shape is [batch_size, sequence_length, vocab_size].
+        excluded_ids = F.one_hot(raw_inputs, logits.shape[-1]) * -10000
+        neg_sample_one_hot = None
+        neg_samples_ids = None
+        for sample in range(K):
+            # Update excluded_ids.
+            if neg_sample_one_hot is not None:
+                excluded_ids = excluded_ids + neg_sample_one_hot * -10000
+            if use_softmax_sample:
+                uniform_noise = paddle.rand(logits.shape, dtype=paddle.get_default_dtype())
+                gumbel_noise = -paddle.log(-paddle.log(uniform_noise + 1e-9) + 1e-9)
+            else:
+                gumbel_noise = paddle.zeros_like(logits)
+            sampled_ids = paddle.argmax(F.softmax(logits + gumbel_noise + excluded_ids), axis=-1)
+            # One-hot encoding of sample_ids.
+            neg_sample_one_hot = F.one_hot(sampled_ids, logits.shape[-1])
+            if neg_samples_ids is None:
+                neg_samples_ids = sampled_ids.unsqueeze(2)
+            else:
+                neg_samples_ids = paddle.concat([neg_samples_ids, sampled_ids.unsqueeze(2)], axis=2)
+        return neg_samples_ids
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        raw_input_ids=None,
+        generator_labels=None,
+        return_dict: Optional[bool] = None,
+    ):
+        assert generator_labels is not None, "generator_labels should not be None, please check DataCollator"
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        generator_logits = self.generator(input_ids, token_type_ids, position_ids, attention_mask)
+
+        disc_input_list = self.get_discriminator_inputs_ernie_health(
+            input_ids, raw_input_ids, generator_logits, generator_labels, self.use_softmax_sample
+        )
+        disc_inputs, disc_labels, _, disc_candidates = disc_input_list
+
+        logits_rtd, logits_mts, logits_csp = self.discriminator(
+            disc_inputs, disc_candidates, token_type_ids, position_ids, attention_mask
+        )
+
+        if attention_mask is None:
+            pad_id = self.generator.electra.pad_token_id
+            attention_mask = input_ids != pad_id
+        else:
+            attention_mask = attention_mask.astype("bool")
+
+        total_loss = None
+        gen_loss = None
+        disc_rtd_loss = None
+        disc_mts_loss = None
+        disc_csp_loss = None
+
+        if generator_labels is not None and disc_labels is not None:
+            loss_fct = ErnieHealthPretrainingCriterion(self.config)
+            total_loss, gen_loss, disc_rtd_loss, disc_mts_loss, disc_csp_loss = loss_fct(
+                generator_logits, generator_labels, logits_rtd, logits_mts, logits_csp, disc_labels, attention_mask
+            )
+
+        if not return_dict:
+            # return total_loss
+            return total_loss, gen_loss, disc_rtd_loss, disc_mts_loss, disc_csp_loss
+
+        return ErnieHealthForPreTrainingOutput(total_loss, gen_loss, disc_rtd_loss, disc_mts_loss, disc_csp_loss)
+
+
+class ElectraForMultipleChoice(ElectraPretrainedModel):
+    """
+    Electra Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig to construct ElectraForMultipleChoice
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.electra = ElectraModel(config)
+        self.sequence_summary = ElectraPooler(config)
+        dropout_p = config.hidden_dropout_prob if config.classifier_dropout is None else config.classifier_dropout
+        self.dropout = nn.Dropout(dropout_p)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ElectraForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`ElectraModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`ElectraModel` and shape as [batch_size, num_choice, sequence_length].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraForMultipleChoice, ElectraTokenizer
+                from paddlenlp.data import Pad, Dict
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraForMultipleChoice.from_pretrained('electra-small', num_choices=2)
+
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                batchify_fn = lambda samples, fn=Dict(
+                    {
+                        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+                        "token_type_ids": Pad(
+                            axis=0, pad_val=tokenizer.pad_token_type_id
+                        ),  # token_type_ids
+                    }
+                ): fn(samples)
+                inputs = batchify_fn(inputs)
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(inputs[0], dtype="int64"),
+                    token_type_ids=paddle.to_tensor(inputs[1], dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+
+        """
+        input_ids = input_ids.reshape((-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1]))
+        if position_ids is not None:
+            position_ids = position_ids.reshape((-1, position_ids.shape[-1]))
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1]))
+
+        sequence_output = self.electra(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(sequence_output, type(input_ids)):
+            sequence_output = (sequence_output,)
+
+        pooled_output = self.sequence_summary(sequence_output[0])
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape((-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        output = (reshaped_logits,) + sequence_output[1:]
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+            output = (loss,) + output
+
+        if not return_dict:
+            output = (reshaped_logits,) + sequence_output[1:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+class ElectraPretrainingCriterion(paddle.nn.Layer):
+    """
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraPretrainingCriterion, self).__init__()
+
+        self.vocab_size = config.vocab_size
+        self.gen_weight = config.gen_weight
+        self.disc_weight = config.disc_weight
+        self.gen_loss_fct = nn.CrossEntropyLoss(reduction="none")
+        self.disc_loss_fct = nn.BCEWithLogitsLoss(reduction="none")
+
+    def forward(
+        self,
+        generator_prediction_scores,
+        discriminator_prediction_scores,
+        generator_labels,
+        discriminator_labels,
+        attention_mask,
+    ):
+        """
+        Args:
+            generator_prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+            discriminator_prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length] or [sequence length] if batch_size=1.
+            generator_labels(Tensor):
+                The labels of the generator, its dimensionality is equal to `generator_prediction_scores`.
+                Its data type should be int64 and its shape is [batch_size, sequence_size, 1].
+            discriminator_labels(Tensor):
+                The labels of the discriminator, its dimensionality is equal to `discriminator_prediction_scores`.
+                The labels should be numbers between 0 and 1.
+                Its data type should be float32 and its shape is [batch_size, sequence_size] or [sequence length] if batch_size=1.
+            attention_mask(Tensor):
+                See :class:`ElectraModel`.
+
+        Returns:
+            Tensor: The pretraining loss, equals to weighted generator loss plus the weighted discriminator loss.
+            Its data type should be float32 and its shape is [1].
+
+        """
+        # generator loss
+        gen_loss = self.gen_loss_fct(
+            paddle.reshape(generator_prediction_scores, [-1, self.vocab_size]), paddle.reshape(generator_labels, [-1])
+        )
+        # todo: we can remove 4 lines after when CrossEntropyLoss(reduction='mean') improved
+        umask_positions = paddle.zeros_like(generator_labels).astype(paddle.get_default_dtype())
+        mask_positions = paddle.ones_like(generator_labels).astype(paddle.get_default_dtype())
+        mask_positions = paddle.where(generator_labels == -100, umask_positions, mask_positions)
+        if mask_positions.sum() == 0:
+            gen_loss = paddle.to_tensor([0.0])
+        else:
+            gen_loss = gen_loss.sum() / mask_positions.sum()
+
+        # discriminator loss
+        seq_length = discriminator_labels.shape[1]
+        disc_loss = self.disc_loss_fct(
+            paddle.reshape(discriminator_prediction_scores, [-1, seq_length]),
+            discriminator_labels.astype(paddle.get_default_dtype()),
+        )
+        if attention_mask is not None:
+            umask_positions = paddle.ones_like(discriminator_labels).astype(paddle.get_default_dtype())
+            mask_positions = paddle.zeros_like(discriminator_labels).astype(paddle.get_default_dtype())
+            use_disc_loss = paddle.where(attention_mask, disc_loss, mask_positions)
+            umask_positions = paddle.where(attention_mask, umask_positions, mask_positions)
+            disc_loss = use_disc_loss.sum() / umask_positions.sum()
+        else:
+            total_positions = paddle.ones_like(discriminator_labels).astype(paddle.get_default_dtype())
+            disc_loss = disc_loss.sum() / total_positions.sum()
+
+        return self.gen_weight * gen_loss + self.disc_weight * disc_loss
+
+
+class ErnieHealthPretrainingCriterion(paddle.nn.Layer):
+    """
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ErnieHealthPretrainingCriterion, self).__init__()
+
+        self.vocab_size = config.vocab_size
+        self.gen_weight = config.gen_weight
+        self.rtd_weight = 50.0
+        self.mts_weight = 20.0
+        self.csp_weight = 1.0
+        self.gen_loss_fct = nn.CrossEntropyLoss(reduction="none")
+        self.disc_rtd_loss_fct = nn.BCEWithLogitsLoss(reduction="none")
+        self.disc_csp_loss_fct = nn.CrossEntropyLoss(reduction="none")
+        self.disc_mts_loss_fct = nn.CrossEntropyLoss(reduction="none")
+        self.temperature = 0.07
+
+    def forward(
+        self,
+        generator_logits,
+        generator_labels,
+        logits_rtd,
+        logits_mts,
+        logits_csp,
+        discriminator_labels,
+        attention_mask,
+    ):
+        """
+        Args:
+            generator_logits(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+            generator_labels(Tensor):
+                The labels of the generator, its dimensionality is equal to `generator_prediction_scores`.
+                Its data type should be int64 and its shape is [batch_size, sequence_size, 1].
+            logits_rtd(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length] or [sequence length] if batch_size=1.
+            discriminator_labels(Tensor):
+                The labels of the discriminator, its dimensionality is equal to `discriminator_prediction_scores`.
+                The labels should be numbers between 0 and 1.
+                Its data type should be float32 and its shape is [batch_size, sequence_size] or [sequence length] if batch_size=1.
+            attention_mask(Tensor):
+                See :class:`ElectraModel`.
+
+        Returns:
+            Tensor: The pretraining loss, equals to weighted generator loss plus the weighted discriminator loss.
+            Its data type should be float32 and its shape is [1].
+
+        """
+        # generator loss
+        gen_loss = self.gen_loss_fct(
+            paddle.reshape(generator_logits, [-1, self.vocab_size]), paddle.reshape(generator_labels, [-1])
+        )
+        # todo: we can remove 4 lines after when CrossEntropyLoss(reduction='mean') improved
+        umask_positions = paddle.zeros_like(generator_labels).astype(paddle.get_default_dtype())
+        mask_positions = paddle.ones_like(generator_labels).astype(paddle.get_default_dtype())
+        mask_positions = paddle.where(generator_labels == -100, umask_positions, mask_positions)
+        if mask_positions.sum() == 0:
+            gen_loss = paddle.to_tensor([0.0])
+        else:
+            gen_loss = gen_loss.sum() / mask_positions.sum()
+
+        # RTD discriminator loss
+        seq_length = discriminator_labels.shape[1]
+        rtd_labels = discriminator_labels
+
+        disc_rtd_loss = self.disc_rtd_loss_fct(
+            paddle.reshape(logits_rtd, [-1, seq_length]), rtd_labels.astype(logits_rtd.dtype)
+        )
+        if attention_mask is not None:
+            umask_positions = paddle.ones_like(rtd_labels).astype(paddle.get_default_dtype())
+            mask_positions = paddle.zeros_like(rtd_labels).astype(paddle.get_default_dtype())
+            umask_positions = paddle.where(attention_mask, umask_positions, mask_positions)
+            # Mask has different meanings here. It denotes [mask] token in
+            # generator and denotes [pad] token in discriminator.
+            disc_rtd_loss = paddle.where(attention_mask, disc_rtd_loss, mask_positions)
+            disc_rtd_loss = disc_rtd_loss.sum() / umask_positions.sum()
+        else:
+            total_positions = paddle.ones_like(rtd_labels).astype(paddle.get_default_dtype())
+            disc_rtd_loss = disc_rtd_loss.sum() / total_positions.sum()
+
+        # MTS discriminator loss
+        replaced_positions = discriminator_labels.astype("bool")
+        mts_labels = paddle.zeros([logits_mts.shape[0] * logits_mts.shape[1]], dtype=generator_labels.dtype).detach()
+        disc_mts_loss = self.disc_mts_loss_fct(paddle.reshape(logits_mts, [-1, logits_mts.shape[-1]]), mts_labels)
+        disc_mts_loss = paddle.reshape(disc_mts_loss, [-1, seq_length])
+        original_positions = paddle.zeros_like(replaced_positions).astype(paddle.get_default_dtype())
+        disc_mts_loss = paddle.where(replaced_positions, disc_mts_loss, original_positions)
+        if discriminator_labels.sum() == 0:
+            disc_mts_loss = paddle.to_tensor([0.0])
+        else:
+            disc_mts_loss = disc_mts_loss.sum() / discriminator_labels.sum()
+
+        # CSP discriminator loss
+        logits_csp = F.normalize(logits_csp, axis=-1)
+        # Gather from all devices (split first)
+        logit_csp_0, logit_csp_1 = paddle.split(logits_csp, num_or_sections=2, axis=0)
+        if paddle.distributed.get_world_size() > 1:
+            csp_list_0, csp_list_1 = [], []
+            paddle.distributed.all_gather(csp_list_0, logit_csp_0)
+            paddle.distributed.all_gather(csp_list_1, logit_csp_1)
+            logit_csp_0 = paddle.concat(csp_list_0, axis=0)
+            logit_csp_1 = paddle.concat(csp_list_1, axis=0)
+        batch_size = logit_csp_0.shape[0]
+        logits_csp = paddle.concat([logit_csp_0, logit_csp_1], axis=0)
+        # Similarity matrix
+        logits_csp = paddle.matmul(logits_csp, logits_csp, transpose_y=True)
+        # Temperature scale
+        logits_csp = logits_csp / self.temperature
+        # Mask self-contrast
+        mask = -1e4 * paddle.eye(logits_csp.shape[0])
+        logits_csp = logits_csp + mask
+        # Create labels for bundle
+        csp_labels = paddle.concat([paddle.arange(batch_size, 2 * batch_size), paddle.arange(batch_size)], axis=0)
+        # Calculate SimCLR loss
+        disc_csp_loss = self.disc_csp_loss_fct(logits_csp, csp_labels)
+        disc_csp_loss = disc_csp_loss.sum() / (batch_size * 2)
+
+        loss = (
+            self.gen_weight * gen_loss
+            + self.rtd_weight * disc_rtd_loss
+            + self.mts_weight * disc_mts_loss
+            + self.csp_weight * disc_csp_loss
+        )
+
+        return loss, gen_loss, disc_rtd_loss, disc_mts_loss, disc_csp_loss
+
+
+class ElectraForQuestionAnswering(ElectraPretrainedModel):
+    """
+    Electra Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ElectraConfig`):
+            An instance of ElectraConfig used to construct ElectraForQuestionAnswering.
+
+    """
+
+    def __init__(self, config: ElectraConfig):
+        super(ElectraForQuestionAnswering, self).__init__(config)
+        self.electra = ElectraModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ElectraForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ElectraModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ElectraModel`.
+            position_ids(Tensor, optional):
+                See :class:`ElectraModel`.
+            attention_mask (list, optional):
+                See :class:`ElectraModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ElectraForQuestionAnswering, ElectraTokenizer
+
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                model = ElectraForQuestionAnswering.from_pretrained('electra-small')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  = outputs[1]
+
+        """
+        sequence_output = self.electra(
+            input_ids,
+            token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(sequence_output, type(input_ids)):
+            sequence_output = (sequence_output,)
+
+        logits = self.classifier(sequence_output[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + sequence_output[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+# ElectraForMaskedLM is the same as ElectraGenerator
+ElectraForMaskedLM = ElectraGenerator
+ElectraForPretraining = ElectraForTotalPretraining
diff --git a/paddlenlp/transformers/electra/tokenizer.py b/paddlenlp/transformers/electra/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..417247e7d8c5b35fd7a0af44d01b02e07913bb97
--- /dev/null
+++ b/paddlenlp/transformers/electra/tokenizer.py
@@ -0,0 +1,309 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+
+__all__ = [
+    "ElectraTokenizer",
+]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "electra-small": 512,
+    "electra-base": 512,
+    "electra-large": 512,
+    "chinese-electra-base": 512,
+    "chinese-electra-small": 512,
+    "ernie-health-chinese": 512,
+}
+
+
+class ElectraTokenizer(PretrainedTokenizer):
+    """
+    Constructs an Electra tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to `True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ElectraTokenizer
+            tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+            tokens = tokenizer('He was a puppeteer')
+            '''
+            {'input_ids': [101, 2002, 2001, 1037, 13997, 11510, 102],
+             'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "electra-small": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-small-vocab.txt",
+            "electra-base": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-base-vocab.txt",
+            "electra-large": "https://bj.bcebos.com/paddlenlp/models/transformers/electra/electra-large-vocab.txt",
+            "chinese-electra-base": "http://bj.bcebos.com/paddlenlp/models/transformers/chinese-electra-base/vocab.txt",
+            "chinese-electra-small": "http://bj.bcebos.com/paddlenlp/models/transformers/chinese-electra-small/vocab.txt",
+            "ernie-health-chinese": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie-health-chinese/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "electra-small": {"do_lower_case": True},
+        "electra-base": {"do_lower_case": True},
+        "electra-large": {"do_lower_case": True},
+        "chinese-electra-base": {"do_lower_case": True},
+        "chinese-electra-small": {"do_lower_case": True},
+        "ernie-health-chinese": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = ElectraTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab._token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for Electra models.
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import ElectraTokenizer
+                tokenizer = ElectraTokenizer.from_pretrained('electra-small')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the
+                number of added tokens in the case of a single sequence if set to False.
+
+        Returns:
+           int: Number of tokens added to sequences.
+
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A ELECTRA sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A ELECTRA offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (:obj:`List[tuple]`):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (:obj:`List[tuple]`, `optional`):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A ELECTRA sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
diff --git a/paddlenlp/transformers/ernie/README.md b/paddlenlp/transformers/ernie/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ee32cbf655516bc6adccd6dfac26e99532198d1e
--- /dev/null
+++ b/paddlenlp/transformers/ernie/README.md
@@ -0,0 +1 @@
+# ERNIE
diff --git a/paddlenlp/transformers/ernie/__init__.py b/paddlenlp/transformers/ernie/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie/configuration.py b/paddlenlp/transformers/ernie/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..38ccc431fdbfbf5d39476e85673321df4b07c895
--- /dev/null
+++ b/paddlenlp/transformers/ernie/configuration.py
@@ -0,0 +1,1291 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ERNIE model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["ERNIE_PRETRAINED_INIT_CONFIGURATION", "ErnieConfig", "ERNIE_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ERNIE_PRETRAINED_INIT_CONFIGURATION = {
+    # Deprecated, alias for ernie-1.0-base-zh
+    "ernie-1.0": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "ernie-1.0-base-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "ernie-1.0-base-zh-cw": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-1.0-large-zh-cw": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,  # it is 3072 instead of 4096
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "ernie-tiny": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 600,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 3,
+        "type_vocab_size": 2,
+        "vocab_size": 50006,
+        "pad_token_id": 0,
+    },
+    "ernie-2.0-base-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 18000,
+    },
+    "ernie-2.0-large-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "intermediate_size": 4096,  # special for large model
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 4,
+        "vocab_size": 12800,
+    },
+    "ernie-2.0-base-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "ernie-2.0-base-en-finetuned-squad": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "ernie-2.0-large-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "intermediate_size": 4096,  # special for ernie-2.0-large-en
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "rocketqa-zh-dureader-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "rocketqa-zh-dureader-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "rocketqa-v1-marco-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "rocketqa-v1-marco-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "rocketqa-zh-dureader-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 513,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18000,
+        "pad_token_id": 0,
+    },
+    "rocketqa-v1-marco-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "ernie-3.0-xbase-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "intermediate_size": 4096,  # special for large model
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 20,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-base-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-medium-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-mini-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-micro-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-nano-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-base-v1-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-medium-v1-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-mini-v1-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-micro-v1-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-nano-v1-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-base-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-medium-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-mini-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-micro-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-nano-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-base-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-base-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-medium-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-medium-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-mini-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-mini-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-micro-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-micro-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-nano-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqa-zh-nano-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "rocketqav2-en-marco-cross-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "rocketqav2-en-marco-query-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "rocketqav2-en-marco-para-encoder": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "uie-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-medium": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-mini": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-micro": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-nano": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-base-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "uie-senta-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-senta-medium": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-senta-mini": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-senta-micro": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-senta-nano": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-base-answer-extractor": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "uie-base-qa-filter": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 40000,
+    },
+    "ernie-search-base-dual-encoder-marco-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "ernie-search-large-cross-encoder-marco-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "ernie-3.0-tiny-base-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-medium-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-mini-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-mini-v2-en": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "type_vocab_size": 1,
+        "use_task_id": False,
+        "vocab_size": 50265,
+    },
+    "ernie-3.0-tiny-micro-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-nano-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "ernie-3.0-tiny-pico-v2-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 128,
+        "intermediate_size": 512,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 2,
+        "num_hidden_layers": 3,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 40000,
+    },
+    "utc-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "dtype": "float32",
+        "fuse": False,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,  # it is 3072 instead of 4096
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "model_type": "ernie",
+        "num_attention_heads": 16,
+        "pool_act": "tanh",
+        "num_hidden_layers": 24,
+        "task_type_vocab_size": 3,
+        "use_task_id": True,
+        "task_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 17965,
+        "pad_token_id": 0,
+    },
+    "utc-xbase": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "intermediate_size": 4096,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 20,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": True,
+        "vocab_size": 39981,
+    },
+    "utc-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+    "utc-medium": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "intermediate_size": 3072,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+    "utc-mini": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+    "utc-micro": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 384,
+        "intermediate_size": 1536,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+    "utc-nano": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 312,
+        "intermediate_size": 1248,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+    "utc-pico": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 128,
+        "intermediate_size": 512,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 2048,
+        "num_attention_heads": 2,
+        "num_hidden_layers": 3,
+        "task_type_vocab_size": 16,
+        "type_vocab_size": 4,
+        "use_task_id": False,
+        "vocab_size": 39981,
+    },
+}
+
+ERNIE_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        # Deprecated, alias for ernie-1.0-base-zh
+        "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams",
+        "ernie-1.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_v1_chn_base.pdparams",
+        "ernie-1.0-base-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_base_zh_cw.pdparams",
+        "ernie-1.0-large-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_large_zh_cw.pdparams",
+        "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/ernie_tiny.pdparams",
+        "ernie-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_base_zh.pdparams",
+        "ernie-2.0-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_large_zh.pdparams",
+        "ernie-2.0-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base.pdparams",
+        "ernie-2.0-base-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/ernie_v2_eng_base_finetuned_squad.pdparams",
+        "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams",
+        "rocketqa-zh-dureader-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_query_encoder.pdparams",
+        "rocketqa-zh-dureader-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_para_encoder.pdparams",
+        "rocketqa-v1-marco-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_query_encoder.pdparams",
+        "rocketqa-v1-marco-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_para_encoder.pdparams",
+        "rocketqa-zh-dureader-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_zh_dureader_cross_encoder.pdparams",
+        "rocketqa-v1-marco-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa_v1_marco_cross_encoder.pdparams",
+        "ernie-3.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams",
+        "ernie-3.0-xbase-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_xbase_zh.pdparams",
+        "ernie-3.0-medium-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams",
+        "ernie-3.0-mini-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams",
+        "ernie-3.0-micro-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams",
+        "ernie-3.0-nano-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams",
+        "ernie-3.0-tiny-base-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams",
+        "ernie-3.0-tiny-medium-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh.pdparams",
+        "ernie-3.0-tiny-mini-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh.pdparams",
+        "ernie-3.0-tiny-micro-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh.pdparams",
+        "ernie-3.0-tiny-nano-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh.pdparams",
+        "rocketqa-zh-base-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-base-query-encoder.pdparams",
+        "rocketqa-zh-base-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-base-para-encoder.pdparams",
+        "rocketqa-zh-medium-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-medium-query-encoder.pdparams",
+        "rocketqa-zh-medium-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-medium-para-encoder.pdparams",
+        "rocketqa-zh-mini-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-mini-query-encoder.pdparams",
+        "rocketqa-zh-mini-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-mini-para-encoder.pdparams",
+        "rocketqa-zh-micro-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-micro-query-encoder.pdparams",
+        "rocketqa-zh-micro-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-micro-para-encoder.pdparams",
+        "rocketqa-zh-nano-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-nano-query-encoder.pdparams",
+        "rocketqa-zh-nano-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-zh-nano-para-encoder.pdparams",
+        "rocketqa-base-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-base-cross-encoder.pdparams",
+        "rocketqa-medium-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-medium-cross-encoder.pdparams",
+        "rocketqa-mini-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-mini-cross-encoder.pdparams",
+        "rocketqa-micro-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-micro-cross-encoder.pdparams",
+        "rocketqa-nano-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqa-nano-cross-encoder.pdparams",
+        "rocketqav2-en-marco-cross-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqav2_en_marco_cross_encoder.pdparams",
+        "rocketqav2-en-marco-query-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqav2_en_marco_query_encoder.pdparams",
+        "rocketqav2-en-marco-para-encoder": "https://paddlenlp.bj.bcebos.com/models/transformers/rocketqa/rocketqav2_en_marco_para_encoder.pdparams",
+        "uie-base": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base.pdparams",
+        "uie-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams",
+        "uie-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_mini.pdparams",
+        "uie-micro": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_micro.pdparams",
+        "uie-nano": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_nano.pdparams",
+        "uie-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base_en.pdparams",
+        "uie-senta-base": "https://paddlenlp.bj.bcebos.com/models/transformers/uie/uie_senta_base.pdparams",
+        "uie-senta-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_senta_medium.pdparams",
+        "uie-senta-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_senta_mini.pdparams",
+        "uie-senta-micro": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_senta_micro.pdparams",
+        "uie-senta-nano": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_senta_nano.pdparams",
+        "uie-base-answer-extractor": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base_answer_extractor.pdparams",
+        "uie-base-qa-filter": "https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base_qa_filter.pdparams",
+        "ernie-search-base-dual-encoder-marco-en": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_search/ernie_search_base_dual_encoder_marco_en.pdparams",
+        "ernie-search-large-cross-encoder-marco-en": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_search/ernie_search_large_cross_encoder_marco_en.pdparams",
+        "ernie-3.0-tiny-base-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_base_v2.pdparams",
+        "ernie-3.0-tiny-medium-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_medium_v2.pdparams",
+        "ernie-3.0-tiny-mini-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_mini_v2.pdparams",
+        "ernie-3.0-tiny-mini-v2-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_mini_v2_en.pdparams",
+        "ernie-3.0-tiny-micro-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_micro_v2.pdparams",
+        "ernie-3.0-tiny-nano-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_nano_v2.pdparams",
+        "ernie-3.0-tiny-pico-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_pico_v2.pdparams",
+        "utc-large": "https://bj.bcebos.com/paddlenlp/models/transformers/utc/utc_large.pdparams",
+        "utc-xbase": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-xbase.pdparams",
+        "utc-base": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-base.pdparams",
+        "utc-medium": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-medium.pdparams",
+        "utc-micro": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-micro.pdparams",
+        "utc-mini": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-mini.pdparams",
+        "utc-nano": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-nano.pdparams",
+        "utc-pico": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc-pico.pdparams",
+    }
+}
+
+
+class ErnieConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieModel`]. It is used to
+    instantiate a ERNIE model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ERNIE
+    ernie-3.0-medium-zh architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the ERNIE model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ErnieModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`ErnieModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import ErnieModel, ErnieConfig
+    >>> # Initializing a ERNIE ernie-3.0-medium-zhstyle configuration
+    >>> configuration = ErnieConfig()
+    >>> # Initializing a model from the  style configuration
+    >>> model = ErnieModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "ernie"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ERNIE_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        task_id=0,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        task_type_vocab_size: int = 3,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        fuse: bool = False,
+        layer_norm_eps=1e-12,
+        use_cache=False,
+        use_task_id=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.task_id = task_id
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.task_type_vocab_size = task_type_vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
+        self.use_task_id = use_task_id
diff --git a/paddlenlp/transformers/ernie/fast_tokenizer.py b/paddlenlp/transformers/ernie/fast_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..03cc21b7605ec182cb81b784d895c33e1162f761
--- /dev/null
+++ b/paddlenlp/transformers/ernie/fast_tokenizer.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from typing import Optional, Tuple
+
+from fast_tokenizer import normalizers
+
+from ..tokenizer_utils_fast import PretrainedFastTokenizer
+from .tokenizer import ErnieTokenizer
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
+
+
+class ErnieFastTokenizer(PretrainedFastTokenizer):
+    resource_files_names = VOCAB_FILES_NAMES  # for save_pretrained
+    slow_tokenizer_class = ErnieTokenizer
+    pretrained_resource_files_map = slow_tokenizer_class.pretrained_resource_files_map
+    pretrained_init_configuration = slow_tokenizer_class.pretrained_init_configuration
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        normalizer_state = json.loads(self.backend_tokenizer.normalizer.__getstate__())
+        if (
+            normalizer_state.get("lowercase", do_lower_case) != do_lower_case
+            or normalizer_state.get("strip_accents", strip_accents) != strip_accents
+            or normalizer_state.get("handle_chinese_chars", tokenize_chinese_chars) != tokenize_chinese_chars
+        ):
+            normalizer_class = getattr(normalizers, normalizer_state.pop("type"))
+            normalizer_state["lowercase"] = do_lower_case
+            normalizer_state["strip_accents"] = strip_accents
+            normalizer_state["handle_chinese_chars"] = tokenize_chinese_chars
+            self.backend_tokenizer.normalizer = normalizer_class(**normalizer_state)
+
+        self.do_lower_case = do_lower_case
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        files = self._tokenizer.model.save(save_directory, filename_prefix)
+        return tuple(files)
diff --git a/paddlenlp/transformers/ernie/modeling.py b/paddlenlp/transformers/ernie/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..754e220a883b0d89f6d58a1abe0007edbdf94bc9
--- /dev/null
+++ b/paddlenlp/transformers/ernie/modeling.py
@@ -0,0 +1,1381 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+
+# TODO(guosheng): update this workaround import for in_declarative_mode
+from paddle.nn.layer.layers import in_declarative_mode
+
+from ...layers import Linear as TransposedLinear
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    ERNIE_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieConfig,
+)
+
+__all__ = [
+    "ErnieModel",
+    "ErniePretrainedModel",
+    "ErnieForSequenceClassification",
+    "ErnieForTokenClassification",
+    "ErnieForQuestionAnswering",
+    "ErnieForPretraining",
+    "ErniePretrainingCriterion",
+    "ErnieForMaskedLM",
+    "ErnieForMultipleChoice",
+    "UIE",
+    "UTC",
+]
+
+
+class ErnieEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: ErnieConfig, weight_attr):
+        super(ErnieEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id, weight_attr=weight_attr
+        )
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings, config.hidden_size, weight_attr=weight_attr
+        )
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(
+                config.type_vocab_size, config.hidden_size, weight_attr=weight_attr
+            )
+        self.use_task_id = config.use_task_id
+        self.task_id = config.task_id
+        if self.use_task_id:
+            self.task_type_embeddings = nn.Embedding(
+                config.task_type_vocab_size, config.hidden_size, weight_attr=weight_attr
+            )
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        task_type_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values_length: int = 0,
+    ):
+
+        if input_ids is not None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        input_shape = inputs_embeds.shape[:-1] if in_declarative_mode() else paddle.shape(inputs_embeds)[:-1]
+
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+            if past_key_values_length > 0:
+                position_ids = position_ids + past_key_values_length
+
+            position_ids.stop_gradient = True
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = paddle.zeros(input_shape, dtype="int64")
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+
+        if self.use_task_id:
+            if task_type_ids is None:
+                task_type_ids = paddle.ones(input_shape, dtype="int64") * self.task_id
+            task_type_embeddings = self.task_type_embeddings(task_type_ids)
+            embeddings = embeddings + task_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErniePooler(nn.Layer):
+    def __init__(self, config: ErnieConfig, weight_attr):
+        super(ErniePooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErniePretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained ERNIE models. It provides ERNIE related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = ErnieConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "ernie"
+
+    pretrained_init_configuration = ERNIE_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class ErnieModel(ErniePretrainedModel):
+    r"""
+    The bare ERNIE Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct ErnieModel
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(ErnieModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.initializer_range)
+        )
+        self.embeddings = ErnieEmbeddings(config=config, weight_attr=weight_attr)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            weight_attr=weight_attr,
+            normalize_before=False,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = ErniePooler(config, weight_attr)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        task_type_ids: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieModel, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieModel.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+        use_cache = use_cache if use_cache is not None else False
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            task_type_ids=task_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            src_mask=attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output)
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                past_key_values=encoder_outputs.past_key_values,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class ErnieForSequenceClassification(ErniePretrainedModel):
+    r"""
+    Ernie Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct ErnieForSequenceClassification.
+    """
+
+    def __init__(self, config):
+        super(ErnieForSequenceClassification, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieForQuestionAnswering(ErniePretrainedModel):
+    """
+    Ernie Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct ErnieForQuestionAnswering.
+    """
+
+    def __init__(self, config):
+        super(ErnieForQuestionAnswering, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieForTokenClassification(ErniePretrainedModel):
+    r"""
+    ERNIE Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfigused to construct ErnieForTokenClassification.
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(ErnieForTokenClassification, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForTokenClassification.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieLMPredictionHead(nn.Layer):
+    r"""
+    Ernie Model with a `language modeling` head on top.
+    """
+
+    def __init__(
+        self,
+        config: ErnieConfig,
+        weight_attr=None,
+    ):
+        super(ErnieLMPredictionHead, self).__init__()
+
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr)
+        self.activation = getattr(nn.functional, config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder = TransposedLinear(config.hidden_size, config.vocab_size)
+        # link bias to load pretrained weights
+        self.decoder_bias = self.decoder.bias
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        # hidden_states = paddle.tensor.matmul(hidden_states, self.decoder.weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class ErniePretrainingHeads(nn.Layer):
+    def __init__(
+        self,
+        config: ErnieConfig,
+        weight_attr=None,
+    ):
+        super(ErniePretrainingHeads, self).__init__()
+        self.predictions = ErnieLMPredictionHead(config, weight_attr)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2, weight_attr=weight_attr)
+
+    def forward(self, sequence_output, pooled_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+@dataclass
+class ErnieForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`ErnieForPreTraining`].
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    seq_relationship_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class ErnieForPretraining(ErniePretrainedModel):
+    r"""
+    Ernie Model with a `masked language modeling` head and a `sentence order prediction` head
+    on top.
+
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(ErnieForPretraining, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(mean=0.0, std=self.ernie.initializer_range)
+        )
+        self.cls = ErniePretrainingHeads(
+            config=config,
+            weight_attr=weight_attr,
+        )
+
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        next_sentence_label: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
+                the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+            next_sentence_label (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence
+                pair (see `input_ids` docstring) Indices should be in `[0, 1]`:
+
+                - 0 indicates sequence B is a continuation of sequence A,
+                - 1 indicates sequence B is a random sequence.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.bert.ErnieForPreTrainingOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.bert.ErnieForPreTrainingOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.bert.ErnieForPreTrainingOutput`.
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        with paddle.static.amp.fp16_guard():
+            outputs = self.ernie(
+                input_ids,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            sequence_output, pooled_output = outputs[:2]
+            prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_positions)
+
+            total_loss = None
+            if labels is not None and next_sentence_label is not None:
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                masked_lm_loss = loss_fct(
+                    prediction_scores.reshape((-1, paddle.shape(prediction_scores)[-1])), labels.reshape((-1,))
+                )
+                next_sentence_loss = loss_fct(
+                    seq_relationship_score.reshape((-1, 2)), next_sentence_label.reshape((-1,))
+                )
+                total_loss = masked_lm_loss + next_sentence_loss
+            if not return_dict:
+                output = (prediction_scores, seq_relationship_score) + outputs[2:]
+                return ((total_loss,) + output) if total_loss is not None else output
+
+            return ErnieForPreTrainingOutput(
+                loss=total_loss,
+                prediction_logits=prediction_scores,
+                seq_relationship_logits=seq_relationship_score,
+                hidden_states=outputs.hidden_states,
+                attentions=outputs.attentions,
+            )
+
+
+class ErniePretrainingCriterion(paddle.nn.Layer):
+    r"""
+    The loss output of Ernie Model during the pretraining:
+    a `masked language modeling` head and a `next sentence prediction (classification)` head.
+
+    """
+
+    def __init__(self, with_nsp_loss=True):
+        super(ErniePretrainingCriterion, self).__init__()
+        self.with_nsp_loss = with_nsp_loss
+        # self.loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=-1)
+
+    def forward(self, prediction_scores, seq_relationship_score, masked_lm_labels, next_sentence_labels=None):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size]
+            seq_relationship_score(Tensor):
+                The scores of next sentence prediction. Its data type should be float32 and
+                its shape is [batch_size, 2]
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64. If `masked_positions` is None, its shape is [batch_size, sequence_length, 1].
+                Otherwise, its shape is [batch_size, mask_token_num, 1]
+            next_sentence_labels(Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and
+                its shape is [batch_size, 1]
+
+        Returns:
+            Tensor: The pretraining loss, equals to the sum of `masked_lm_loss` plus the mean of `next_sentence_loss`.
+            Its data type should be float32 and its shape is [1].
+
+        """
+
+        with paddle.static.amp.fp16_guard():
+            masked_lm_loss = F.cross_entropy(prediction_scores, masked_lm_labels, ignore_index=-1, reduction="none")
+
+            if not self.with_nsp_loss:
+                return paddle.mean(masked_lm_loss)
+
+            next_sentence_loss = F.cross_entropy(seq_relationship_score, next_sentence_labels, reduction="none")
+            return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss)
+
+
+class ErnieOnlyMLMHead(nn.Layer):
+    def __init__(self, config: ErnieConfig):
+        super().__init__()
+        self.predictions = ErnieLMPredictionHead(config=config)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class ErnieForMaskedLM(ErniePretrainedModel):
+    """
+    Ernie Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct ErnieForMaskedLM.
+
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(ErnieForMaskedLM, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.cls = ErnieOnlyMLMHead(config=config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+            masked_positions:
+                masked positions of output.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+                loss is only computed for the tokens with labels in `[0, ..., vocab_size]`
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieForMaskedLM.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 17, 18000]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions=masked_positions)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, paddle.shape(prediction_scores)[-1])), labels.reshape((-1,))
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return (
+                ((masked_lm_loss,) + output)
+                if masked_lm_loss is not None
+                else (output[0] if len(output) == 1 else output)
+            )
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieForMultipleChoice(ErniePretrainedModel):
+    """
+    Ernie Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct ErnieForMultipleChoice
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(ErnieForMultipleChoice, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.num_choices = config.num_choices if config.num_choices is not None else 2
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ErnieForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length].
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieModel` and shape as [batch_size, num_choice, sequence_length, hidden_size].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # input_ids: [bs, num_choice, seq_l]
+        if input_ids is not None:
+            input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape(shape=(-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]))
+
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class UIE(ErniePretrainedModel):
+    """
+    Ernie Model with two linear layer on top of the hidden-states
+    output to compute `start_prob` and `end_prob`,
+    designed for Universal Information Extraction.
+    Args:
+        config (:class:`ErnieConfig`):
+            An instance of ErnieConfig used to construct UIE
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(UIE, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.linear_start = paddle.nn.Linear(config.hidden_size, 1)
+        self.linear_end = paddle.nn.Linear(config.hidden_size, 1)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        return_dict: Optional[Tensor] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieModel`.
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import UIE, ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('uie-base')
+                model = UIE.from_pretrained('uie-base')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                start_prob, end_prob = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        sequence_output, _ = self.ernie(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            return_dict=return_dict,
+        )
+        start_logits = self.linear_start(sequence_output)
+        start_logits = paddle.squeeze(start_logits, -1)
+        start_prob = self.sigmoid(start_logits)
+        end_logits = self.linear_end(sequence_output)
+        end_logits = paddle.squeeze(end_logits, -1)
+        end_prob = self.sigmoid(end_logits)
+        return start_prob, end_prob
+
+
+class UTC(ErniePretrainedModel):
+    """
+    Ernie Model with two linear layer on the top of the hidden-states output to compute
+    probability of candidate labels, designed for Unified Tag Classification.
+    """
+
+    def __init__(self, config: ErnieConfig):
+        super(UTC, self).__init__(config)
+        self.ernie = ErnieModel(config)
+        self.predict_size = 64
+        self.linear_q = paddle.nn.Linear(config.hidden_size, self.predict_size)
+        self.linear_k = paddle.nn.Linear(config.hidden_size, self.predict_size)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids,
+        position_ids,
+        attention_mask,
+        omask_positions,
+        cls_positions,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieModel`.
+            position_ids (Tensor):
+                See :class:`ErnieModel`.
+            attention_mask (Tensor):
+                See :class:`ErnieModel`.
+            omask_positions (Tensor of shape `(batch_size, max_option)`):
+                Masked positions of [O-MASK] tokens padded with 0.
+            cls_positions (Tensor of shape `(batch_size)`):
+                Masked positions of the second [CLS] token.
+            labels (Tensor of shape `(num_labels_in_batch,)`, optional):
+                Labels for computing classification loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.ernie(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        batch_size, seq_len, hidden_size = sequence_output.shape
+        flat_sequence_output = paddle.reshape(sequence_output, [-1, hidden_size])
+        flat_length = paddle.arange(batch_size) * seq_len
+        flat_length = flat_length.unsqueeze(axis=1).astype("int64")
+
+        cls_output = paddle.tensor.gather(flat_sequence_output, cls_positions + flat_length.squeeze(1))
+        q = self.linear_q(cls_output)
+
+        option_output = paddle.tensor.gather(flat_sequence_output, paddle.reshape(omask_positions + flat_length, [-1]))
+        option_output = paddle.reshape(option_output, [batch_size, -1, hidden_size])
+        k = self.linear_k(option_output)
+
+        option_logits = paddle.matmul(q.unsqueeze(1), k, transpose_y=True).squeeze(1)
+        option_logits = option_logits / self.predict_size**0.5
+
+        if hasattr(paddle.framework, "_no_check_dy2st_diff"):
+            # TODO(wanghuancoder): _no_check_dy2st_diff is used to turn off the checking of behavior
+            # inconsistency between dynamic graph and static graph. _no_check_dy2st_diff should be
+            # removed after static graphs support inplace and stride.
+            with paddle.framework._no_check_dy2st_diff():
+                for index, logit in enumerate(option_logits):
+                    option_logits[index] -= (1 - (omask_positions[index] > 0).astype("float32")) * 1e12
+        else:
+            for index, logit in enumerate(option_logits):
+                option_logits[index] -= (1 - (omask_positions[index] > 0).astype("float32")) * 1e12
+        loss = None
+        if not return_dict:
+            output = (option_logits,)
+            if output_hidden_states:
+                output = output + (outputs.hidden_states,)
+            if output_attentions:
+                output = output + (output.attentions,)
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=option_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/ernie/static_to_dygraph_params/match_static_to_dygraph.py b/paddlenlp/transformers/ernie/static_to_dygraph_params/match_static_to_dygraph.py
new file mode 100644
index 0000000000000000000000000000000000000000..e355c66193ef2075030f613cf688d80c49a0d4f3
--- /dev/null
+++ b/paddlenlp/transformers/ernie/static_to_dygraph_params/match_static_to_dygraph.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import pickle
+
+import paddle
+
+
+def match_embedding_param(convert_parameter_name_dict, static_para_prefix=""):
+    convert_parameter_name_dict["embeddings.word_embeddings.weight"] = static_para_prefix + "word_embedding"
+    convert_parameter_name_dict["embeddings.position_embeddings.weight"] = static_para_prefix + "pos_embedding"
+    convert_parameter_name_dict["embeddings.token_type_embeddings.weight"] = static_para_prefix + "sent_embedding"
+    convert_parameter_name_dict["embeddings.task_type_embeddings.weight"] = static_para_prefix + "task_embedding"
+    convert_parameter_name_dict["embeddings.layer_norm.weight"] = static_para_prefix + "pre_encoder_layer_norm_scale"
+    convert_parameter_name_dict["embeddings.layer_norm.bias"] = static_para_prefix + "pre_encoder_layer_norm_bias"
+    return convert_parameter_name_dict
+
+
+def match_encoder_param(convert_parameter_name_dict, layer_num=4, static_para_prefix=""):
+    dygraph_proj_names = ["q", "k", "v", "out"]
+    static_proj_names = ["query", "key", "value", "output"]
+    dygraph_param_names = ["weight", "bias"]
+    static_param_names = ["w", "b"]
+    dygraph_layer_norm_param_names = ["weight", "bias"]
+    static_layer_norm_param_names = ["scale", "bias"]
+
+    # Firstly, converts the multihead_attention to the parameter.
+    dygraph_format_name = "encoder.layers.{}.self_attn.{}_proj.{}"
+    static_format_name = static_para_prefix + "encoder_layer_{}_multi_head_att_{}_fc.{}_0"
+    for i in range(0, layer_num):
+        for dygraph_proj_name, static_proj_name in zip(dygraph_proj_names, static_proj_names):
+            for dygraph_param_name, static_param_name in zip(dygraph_param_names, static_param_names):
+                convert_parameter_name_dict[
+                    dygraph_format_name.format(i, dygraph_proj_name, dygraph_param_name)
+                ] = static_format_name.format(i, static_proj_name, static_param_name)
+
+    # Secondly, converts the encoder ffn parameter.
+    dygraph_ffn_linear_format_name = "encoder.layers.{}.linear{}.{}"
+    static_ffn_linear_format_name = static_para_prefix + "encoder_layer_{}_ffn_fc_{}.{}_0"
+    for i in range(0, layer_num):
+        for j in range(0, 2):
+            for dygraph_param_name, static_param_name in zip(dygraph_param_names, static_param_names):
+                convert_parameter_name_dict[
+                    dygraph_ffn_linear_format_name.format(i, j + 1, dygraph_param_name)
+                ] = static_ffn_linear_format_name.format(i, j, static_param_name)
+
+    # Thirdly, converts the multi_head layer_norm parameter.
+    dygraph_encoder_attention_layer_norm_format_name = "encoder.layers.{}.norm1.{}"
+    static_encoder_attention_layer_norm_format_name = static_para_prefix + "encoder_layer_{}_post_att_layer_norm_{}"
+    for i in range(0, layer_num):
+        for dygraph_param_name, static_pararm_name in zip(
+            dygraph_layer_norm_param_names, static_layer_norm_param_names
+        ):
+            convert_parameter_name_dict[
+                dygraph_encoder_attention_layer_norm_format_name.format(i, dygraph_param_name)
+            ] = static_encoder_attention_layer_norm_format_name.format(i, static_pararm_name)
+
+    dygraph_encoder_ffn_layer_norm_format_name = "encoder.layers.{}.norm2.{}"
+    static_encoder_ffn_layer_norm_format_name = static_para_prefix + "encoder_layer_{}_post_ffn_layer_norm_{}"
+    for i in range(0, layer_num):
+        for dygraph_param_name, static_pararm_name in zip(
+            dygraph_layer_norm_param_names, static_layer_norm_param_names
+        ):
+            convert_parameter_name_dict[
+                dygraph_encoder_ffn_layer_norm_format_name.format(i, dygraph_param_name)
+            ] = static_encoder_ffn_layer_norm_format_name.format(i, static_pararm_name)
+    return convert_parameter_name_dict
+
+
+def match_pooler_parameter(convert_parameter_name_dict, static_para_prefix=""):
+    convert_parameter_name_dict["pooler.dense.weight"] = static_para_prefix + "pooled_fc.w_0"
+    convert_parameter_name_dict["pooler.dense.bias"] = static_para_prefix + "pooled_fc.b_0"
+    return convert_parameter_name_dict
+
+
+def match_mlm_parameter(convert_parameter_name_dict, static_para_prefix=""):
+    # convert_parameter_name_dict["cls.predictions.decoder_weight"] = "word_embedding"
+    convert_parameter_name_dict["cls.predictions.decoder_bias"] = static_para_prefix + "mask_lm_out_fc.b_0"
+    convert_parameter_name_dict["cls.predictions.transform.weight"] = static_para_prefix + "mask_lm_trans_fc.w_0"
+    convert_parameter_name_dict["cls.predictions.transform.bias"] = static_para_prefix + "mask_lm_trans_fc.b_0"
+    convert_parameter_name_dict["cls.predictions.layer_norm.weight"] = (
+        static_para_prefix + "mask_lm_trans_layer_norm_scale"
+    )
+    convert_parameter_name_dict["cls.predictions.layer_norm.bias"] = (
+        static_para_prefix + "mask_lm_trans_layer_norm_bias"
+    )
+    return convert_parameter_name_dict
+
+
+def match_last_fc_parameter(convert_parameter_name_dict, static_para_prefix=""):
+    convert_parameter_name_dict["classifier.weight"] = "_cls_out_w"
+    convert_parameter_name_dict["classifier.bias"] = "_cls_out_b"
+    return convert_parameter_name_dict
+
+
+def convert_static_to_dygraph_params(
+    dygraph_params_save_path, static_params_dir, static_to_dygraph_param_name, model_name="static"
+):
+    files = os.listdir(static_params_dir)
+
+    state_dict = {}
+    model_name = model_name
+    for name in files:
+        path = os.path.join(static_params_dir, name)
+        # static_para_name = name.replace('@HUB_chinese-roberta-wwm-ext-large@',
+        #                                 '')  # for hub module params
+        static_para_name = name.replace(".npy", "")
+        if static_para_name not in static_to_dygraph_param_name:
+            print(static_para_name, "not in static_to_dygraph_param_name")
+            continue
+        dygraph_para_name = static_to_dygraph_param_name[static_para_name]
+        value = paddle.load(path).numpy()
+        if "cls" in dygraph_para_name or "classifier" in dygraph_para_name:
+            # Note: cls.predictions parameters do not need add `model_name.` prefix
+            state_dict[dygraph_para_name] = value
+        else:
+            state_dict[model_name + "." + dygraph_para_name] = value
+
+    with open(dygraph_params_save_path, "wb") as f:
+        pickle.dump(state_dict, f)
+    params = paddle.load(dygraph_params_save_path)
+
+    for name in state_dict.keys():
+        if name in params:
+            assert (state_dict[name] == params[name].numpy()).all()
+        else:
+            print(name, "not in params")
+
+
+if __name__ == "__main__":
+    convert_parameter_name_dict = {}
+
+    convert_parameter_name_dict = match_embedding_param(convert_parameter_name_dict)
+    convert_parameter_name_dict = match_encoder_param(convert_parameter_name_dict, layer_num=12)
+    convert_parameter_name_dict = match_pooler_parameter(convert_parameter_name_dict)
+    convert_parameter_name_dict = match_mlm_parameter(convert_parameter_name_dict)
+
+    static_to_dygraph_param_name = {value: key for key, value in convert_parameter_name_dict.items()}
+
+    for static_name, dygraph_name in static_to_dygraph_param_name.items():
+        print("{}:-------:{}".format(static_name, dygraph_name))
+
+    convert_static_to_dygraph_params(
+        dygraph_params_save_path="./dygraph_model/ernie_v1_chn_base.pdparams",
+        static_params_dir="./ernie1.0_numpy/",
+        static_to_dygraph_param_name=static_to_dygraph_param_name,
+        model_name="ernie",
+    )
diff --git a/paddlenlp/transformers/ernie/tokenizer.py b/paddlenlp/transformers/ernie/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..63205a42b69d9aa6b372fcf2c0bd0e9faab65b3f
--- /dev/null
+++ b/paddlenlp/transformers/ernie/tokenizer.py
@@ -0,0 +1,918 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import pickle
+import shutil
+
+import sentencepiece as spm
+import six
+
+from paddlenlp.utils.env import MODEL_HOME
+from paddlenlp.utils.log import logger
+
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+
+__all__ = ["ErnieTokenizer", "ErnieTinyTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "ernie-1.0": 513,
+    "ernie-1.0-base-zh": 513,
+    "ernie-1.0-base-zh-cw": 512,
+    "ernie-1.0-large-zh-cw": 512,
+    "ernie-tiny": 600,
+    "ernie-2.0-base-zh": 513,
+    "ernie-2.0-large-zh": 512,
+    "ernie-2.0-base-en": 512,
+    "ernie-2.0-base-en-finetuned-squad": 512,
+    "ernie-2.0-large-en": 512,
+    "ernie-gen-base-en": 1024,
+    "ernie-gen-large-en": 1024,
+    "ernie-gen-large-en-430g": 1024,
+    "rocketqa-zh-dureader-query-encoder": 513,
+    "rocketqa-zh-dureader-para-encoder": 513,
+    "rocketqa-v1-marco-query-encoder": 512,
+    "rocketqa-v1-marco-para-encoder": 512,
+    "rocketqa-zh-dureader-cross-encoder": 513,
+    "rocketqa-v1-marco-cross-encoder": 512,
+    "ernie-3.0-base-zh": 2048,
+    "ernie-3.0-xbase-zh": 2048,
+    "ernie-3.0-medium-zh": 2048,
+    "ernie-3.0-mini-zh": 2048,
+    "ernie-3.0-micro-zh": 2048,
+    "ernie-3.0-nano-zh": 2048,
+    "ernie-3.0-tiny-base-v1-zh": 2048,
+    "ernie-3.0-tiny-medium-v1-zh": 2048,
+    "ernie-3.0-tiny-mini-v1-zh": 2048,
+    "ernie-3.0-tiny-micro-v1-zh": 2048,
+    "ernie-3.0-tiny-nano-v1-zh": 2048,
+    "rocketqa-zh-base-query-encoder": 2048,
+    "rocketqa-zh-base-para-encoder": 2048,
+    "rocketqa-zh-medium-query-encoder": 2048,
+    "rocketqa-zh-medium-para-encoder": 2048,
+    "rocketqa-zh-mini-query-encoder": 2048,
+    "rocketqa-zh-mini-para-encoder": 2048,
+    "rocketqa-zh-micro-query-encoder": 2048,
+    "rocketqa-zh-micro-para-encoder": 2048,
+    "rocketqa-zh-nano-query-encoder": 2048,
+    "rocketqa-zh-nano-para-encoder": 2048,
+    "rocketqa-base-cross-encoder": 2048,
+    "rocketqa-medium-cross-encoder": 2048,
+    "rocketqa-mini-cross-encoder": 2048,
+    "rocketqa-micro-cross-encoder": 2048,
+    "rocketqa-nano-cross-encoder": 2048,
+    "rocketqav2-en-marco-cross-encoder": 512,
+    "rocketqav2-en-marco-query-encoder": 512,
+    "rocketqav2-en-marco-para-encoder": 512,
+    "uie-base": 512,
+    "uie-medium": 512,
+    "uie-mini": 512,
+    "uie-micro": 512,
+    "uie-nano": 512,
+    "uie-base-en": 512,
+    "uie-senta-base": 512,
+    "uie-senta-medium": 512,
+    "uie-senta-mini": 512,
+    "uie-senta-micro": 512,
+    "uie-senta-nano": 512,
+    "uie-base-answer-extractor": 512,
+    "uie-base-qa-filter": 512,
+    "ernie-search-base-dual-encoder-marco-en": 512,
+    "ernie-search-large-cross-encoder-marco-en": 512,
+    "ernie-3.0-tiny-base-v2-zh": 2048,
+    "ernie-3.0-tiny-medium-v2-zh": 2048,
+    "ernie-3.0-tiny-mini-v2-zh": 2048,
+    "ernie-3.0-tiny-mini-v2-en": 514,
+    "ernie-3.0-tiny-micro-v2-zh": 2048,
+    "ernie-3.0-tiny-nano-v2-zh": 2048,
+    "ernie-3.0-tiny-pico-v2-zh": 2048,
+    "utc-large": 512,
+    "utc-xbase": 2048,
+    "utc-base": 2048,
+    "utc-medium": 2048,
+    "utc-mini": 2048,
+    "utc-micro": 2048,
+    "utc-nano": 2048,
+    "utc-pico": 2048,
+}
+
+
+class ErnieTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs an ERNIE tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` in order to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieTokenizer
+            tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # { 'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+            #  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            # }
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            # Deprecated, alias for ernie-1.0-base-zh
+            "ernie-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt",
+            "ernie-1.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt",
+            "ernie-1.0-base-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/ernie_1.0_base_zh_cw_vocab.txt",
+            "ernie-1.0-large-zh-cw": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie/vocab.txt",
+            "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/vocab.txt",
+            "ernie-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_base_zh_vocab.txt",
+            "ernie-2.0-large-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_2.0/ernie_2.0_large_zh_vocab.txt",
+            "ernie-2.0-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "ernie-2.0-base-en-finetuned-squad": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "ernie-2.0-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/vocab.txt",
+            "ernie-gen-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-base-en/vocab.txt",
+            "ernie-gen-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-large/vocab.txt",
+            "ernie-gen-large-en-430g": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-large-430g/vocab.txt",
+            "rocketqa-zh-dureader-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-zh-dureader-vocab.txt",
+            "rocketqa-zh-dureader-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-zh-dureader-vocab.txt",
+            "rocketqa-v1-marco-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-v1-marco-vocab.txt",
+            "rocketqa-v1-marco-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-v1-marco-vocab.txt",
+            "rocketqa-zh-dureader-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-zh-dureader-vocab.txt",
+            "rocketqa-v1-marco-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/rocketqa/rocketqa-v1-marco-vocab.txt",
+            "ernie-3.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "ernie-3.0-xbase-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_xbase_zh_vocab.txt",
+            "ernie-3.0-medium-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "ernie-3.0-mini-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "ernie-3.0-micro-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "ernie-3.0-nano-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "ernie-3.0-tiny-base-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "ernie-3.0-tiny-medium-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "ernie-3.0-tiny-mini-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "ernie-3.0-tiny-micro-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "ernie-3.0-tiny-nano-v1-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "rocketqa-zh-base-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "rocketqa-zh-base-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "rocketqa-zh-medium-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "rocketqa-zh-medium-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "rocketqa-zh-mini-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "rocketqa-zh-mini-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "rocketqa-zh-micro-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "rocketqa-zh-micro-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "rocketqa-zh-nano-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "rocketqa-zh-nano-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "rocketqa-base-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "rocketqa-medium-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "rocketqa-mini-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "rocketqa-micro-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "rocketqa-nano-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "rocketqav2-en-marco-cross-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "rocketqav2-en-marco-query-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "rocketqav2-en-marco-para-encoder": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "uie-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "uie-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "uie-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "uie-micro": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "uie-nano": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "uie-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "uie-senta-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "uie-senta-medium": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_medium_zh_vocab.txt",
+            "uie-senta-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_mini_zh_vocab.txt",
+            "uie-senta-micro": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_micro_zh_vocab.txt",
+            "uie-senta-nano": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_nano_zh_vocab.txt",
+            "uie-base-answer-extractor": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "uie-base-qa-filter": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt",
+            "ernie-search-base-dual-encoder-marco-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_base/vocab.txt",
+            "ernie-search-large-cross-encoder-marco-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_v2_large/vocab.txt",
+            "ernie-3.0-tiny-base-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_base_v2_vocab.txt",
+            "ernie-3.0-tiny-medium-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_medium_v2_vocab.txt",
+            "ernie-3.0-tiny-mini-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_mini_v2_vocab.txt",
+            "ernie-3.0-tiny-mini-v2-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_mini_v2_en_vocab.txt",
+            "ernie-3.0-tiny-micro-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_micro_v2_vocab.txt",
+            "ernie-3.0-tiny-nano-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_nano_v2_vocab.txt",
+            "ernie-3.0-tiny-pico-v2-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_tiny_pico_v2_vocab.txt",
+            "utc-large": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_large_vocab.txt",
+            "utc-xbase": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_xbase_vocab.txt",
+            "utc-base": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_base_vocab.txt",
+            "utc-medium": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_medium_vocab.txt",
+            "utc-mini": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_mini_vocab.txt",
+            "utc-micro": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_micro_vocab.txt",
+            "utc-nano": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_nano_vocab.txt",
+            "utc-pico": "https://paddlenlp.bj.bcebos.com/models/transformers/utc/utc_pico_vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ernie-1.0": {"do_lower_case": True},
+        "ernie-1.0-base-zh": {"do_lower_case": True},
+        "ernie-1.0-base-zh-cw": {"do_lower_case": True},
+        "ernie-1.0-large-zh-cw": {"do_lower_case": True},
+        "ernie-tiny": {"do_lower_case": True},
+        "ernie-2.0-base-zh": {"do_lower_case": True},
+        "ernie-2.0-large-zh": {"do_lower_case": True},
+        "ernie-2.0-base-en": {"do_lower_case": True},
+        "ernie-2.0-base-en-finetuned-squad": {"do_lower_case": True},
+        "ernie-2.0-large-en": {"do_lower_case": True},
+        "ernie-gen-base-en": {"do_lower_case": True},
+        "ernie-gen-large-en": {"do_lower_case": True},
+        "ernie-gen-large-en-430g": {"do_lower_case": True},
+        "rocketqa-zh-dureader-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-dureader-para-encoder": {"do_lower_case": True},
+        "rocketqa-v1-marco-query-encoder": {"do_lower_case": True},
+        "rocketqa-v1-marco-para-encoder": {"do_lower_case": True},
+        "rocketqa-zh-dureader-cross-encoder": {"do_lower_case": True},
+        "rocketqa-v1-marco-cross-encoder": {"do_lower_case": True},
+        "ernie-3.0-base-zh": {"do_lower_case": True},
+        "ernie-3.0-xbase-zh": {"do_lower_case": True},
+        "ernie-3.0-medium-zh": {"do_lower_case": True},
+        "ernie-3.0-mini-zh": {"do_lower_case": True},
+        "ernie-3.0-micro-zh": {"do_lower_case": True},
+        "ernie-3.0-nano-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-base-v1-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-medium-v1-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-mini-v1-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-micro-v1-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-nano-v1-zh": {"do_lower_case": True},
+        "rocketqa-zh-base-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-base-para-encoder": {"do_lower_case": True},
+        "rocketqa-zh-medium-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-medium-para-encoder": {"do_lower_case": True},
+        "rocketqa-zh-mini-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-mini-para-encoder": {"do_lower_case": True},
+        "rocketqa-zh-micro-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-micro-para-encoder": {"do_lower_case": True},
+        "rocketqa-zh-nano-query-encoder": {"do_lower_case": True},
+        "rocketqa-zh-nano-para-encoder": {"do_lower_case": True},
+        "rocketqa-base-cross-encoder": {"do_lower_case": True},
+        "rocketqa-medium-cross-encoder": {"do_lower_case": True},
+        "rocketqa-mini-cross-encoder": {"do_lower_case": True},
+        "rocketqa-micro-cross-encoder": {"do_lower_case": True},
+        "rocketqa-nano-cross-encoder": {"do_lower_case": True},
+        "rocketqav2-en-marco-cross-encoder": {"do_lower_case": True},
+        "rocketqav2-en-marco-query-encoder": {"do_lower_case": True},
+        "rocketqav2-en-marco-para-encoder": {"do_lower_case": True},
+        "uie-base": {"do_lower_case": True},
+        "uie-medium": {"do_lower_case": True},
+        "uie-mini": {"do_lower_case": True},
+        "uie-micro": {"do_lower_case": True},
+        "uie-nano": {"do_lower_case": True},
+        "uie-base-en": {"do_lower_case": True},
+        "uie-senta-base": {"do_lower_case": True},
+        "uie-senta-medium": {"do_lower_case": True},
+        "uie-senta-mini": {"do_lower_case": True},
+        "uie-senta-micro": {"do_lower_case": True},
+        "uie-senta-nano": {"do_lower_case": True},
+        "uie-base-answer-extractor": {"do_lower_case": True},
+        "uie-base-qa-filter": {"do_lower_case": True},
+        "ernie-search-base-dual-encoder-marco-en": {"do_lower_case": True},
+        "ernie-search-large-cross-encoder-marco-en": {"do_lower_case": True},
+        "ernie-3.0-tiny-base-v2-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-medium-v2-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-mini-v2-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-mini-v2-en": {"do_lower_case": True},
+        "ernie-3.0-tiny-micro-v2-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-nano-v2-zh": {"do_lower_case": True},
+        "ernie-3.0-tiny-pico-v2-zh": {"do_lower_case": True},
+        "utc-large": {"do_lower_case": True},
+        "utc-xbase": {"do_lower_case": True},
+        "utc-base": {"do_lower_case": True},
+        "utc-medium": {"do_lower_case": True},
+        "utc-mini": {"do_lower_case": True},
+        "utc-micro": {"do_lower_case": True},
+        "utc-nano": {"do_lower_case": True},
+        "utc-pico": {"do_lower_case": True},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = ErnieTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def extend_chinese_char(self):
+        """
+        For, char level model such as ERNIE, we need add ## chinese token
+        to demonstrate the segment information.
+        """
+        vocab_set = set(self.vocab.token_to_idx.keys())
+        extend_list = []
+        for i in range(len(self.vocab)):
+            if i not in self.vocab.idx_to_token:
+                continue
+            w = self.vocab.idx_to_token[i]
+            # Chose chinese char in [0x4E00, Ox9FA5], and try add  ## char to vocab.
+            if len(w) == 1 and ord(w) >= 0x4E00 and ord(w) <= 0x9FA5:
+                new_char = "##" + w
+                if new_char not in vocab_set:
+                    extend_list.append(new_char)
+        if len(self.vocab) + len(extend_list) > 2**16:
+            logger.warnings("The vocab size is larger than uint16")
+        new_tokens = [str(tok) for tok in extend_list]
+
+        tokens_to_add = []
+        for token in new_tokens:
+            if not isinstance(token, str):
+                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
+            if hasattr(self, "do_lower_case") and self.do_lower_case:
+                token = token.lower()
+            if (
+                token != self.unk_token
+                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
+                and token not in tokens_to_add
+            ):
+                tokens_to_add.append(token)
+
+        if self.verbose:
+            print(f"Adding {len(tokens_to_add)} ## chinese tokens to the vocabulary")
+
+        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))
+        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
+        self.added_tokens_encoder.update(added_tok_encoder)
+        self.added_tokens_decoder.update(added_tok_decoder)
+
+    def get_vocab(self):
+        return dict(self.vocab._token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        r"""
+        End-to-end tokenization for ERNIE models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        r"""
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+
+        Args:
+            tokens (List[str]): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import ErnieTokenizer
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                #he was a puppeteer
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        r"""
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Note:
+            This encodes inputs and checks the number of added tokens, and is therefore not efficient.
+            Do not put this inside your training loop.
+
+        Args:
+            pair (bool, optional):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        An Ernie sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        An ERNIE offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        r"""
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A ERNIE sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        r"""
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+        Args:
+            token_ids_0 (List[int]):
+                List of ids of the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+            already_has_special_tokens (str, optional):
+                Whether or not the token list is already formatted with special tokens for the model.
+                Defaults to `False`.
+        Returns:
+            List[int]:
+                The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+
+class ErnieTinyTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs a ErnieTiny tokenizer. It uses the `dict.wordseg.pickle` cut the text to words, and
+    use the `sentencepiece` tools to cut the words to sub-words.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieTokenizer
+            tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # { 'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+            #  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            # }
+
+    Args:
+        vocab_file (str):
+            The file path of the vocabulary.
+        sentencepiece_model_file (str):
+            The file path of sentencepiece model.
+        word_dict(str):
+            The file path of word vocabulary, which is used to do chinese word segmentation.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieTinyTokenizer
+            tokenizer = ErnieTinyTokenizer.from_pretrained('ernie-tiny')
+            inputs = tokenizer('He was a puppeteer')
+            '''
+            {'input_ids': [3, 941, 977, 16690, 269, 11346, 11364, 1337, 13742, 1684, 5],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+    """
+    resource_files_names = {
+        "sentencepiece_model_file": "spm_cased_simp_sampled.model",
+        "vocab_file": "vocab.txt",
+        "word_dict": "dict.wordseg.pickle",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {"ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/vocab.txt"},
+        "sentencepiece_model_file": {
+            "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/spm_cased_simp_sampled.model"
+        },
+        "word_dict": {
+            "ernie-tiny": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_tiny/dict.wordseg.pickle"
+        },
+    }
+    pretrained_init_configuration = {"ernie-tiny": {"do_lower_case": True}}
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        word_dict,
+        do_lower_case=True,
+        encoding="utf8",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        self.sp_model = spm.SentencePieceProcessor()
+        self.word_dict = word_dict
+
+        self.do_lower_case = do_lower_case
+        self.encoding = encoding
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = ErnieTinyTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        if not os.path.isfile(word_dict):
+            raise ValueError(
+                "Can't find a file at path '{}'. To load the "
+                "word dict from a pretrained model please use "
+                "`tokenizer = ErnieTinyTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(word_dict)
+            )
+        self.dict = pickle.load(open(word_dict, "rb"))
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+
+        # if the sentencepiece_model_file is not exists, just the default sentence-piece model
+        if os.path.isfile(sentencepiece_model_file):
+            self.sp_model.Load(sentencepiece_model_file)
+
+    @property
+    def vocab_size(self):
+        r"""
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def cut(self, chars):
+        words = []
+        idx = 0
+        window_size = 5
+        while idx < len(chars):
+            matched = False
+
+            for i in range(window_size, 0, -1):
+                cand = chars[idx : idx + i]
+                if cand in self.dict:
+                    words.append(cand)
+                    matched = True
+                    break
+            if not matched:
+                i = 1
+                words.append(chars[idx])
+            idx += i
+        return words
+
+    def _tokenize(self, text):
+        r"""
+        End-to-end tokenization for ErnieTiny models.
+
+        Args:
+            text (str):
+                The text to be tokenized.
+
+        Returns:
+            List(str):
+                A list of string representing converted tokens.
+        """
+        if len(text) == 0:
+            return []
+        if not isinstance(text, six.string_types):
+            text = text.decode(self.encoding)
+
+        text = [s for s in self.cut(text) if s != " "]
+        text = " ".join(text)
+        text = text.lower()
+
+        tokens = self.sp_model.EncodeAsPieces(text)
+        in_vocab_tokens = []
+        unk_token = self.vocab.unk_token
+        for token in tokens:
+            if token in self.vocab:
+                in_vocab_tokens.append(token)
+            else:
+                in_vocab_tokens.append(unk_token)
+        return in_vocab_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        r"""
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieTinyTokenizer
+            tokenizer = ErnieTinyTokenizer.from_pretrained('ernie-tiny')
+            inputs = tokenizer.tokenize('He was a puppeteer')
+            #['▁h', '▁e', '▁was', '▁a', '▁pu', 'pp', 'e', '▁te', 'er']
+            strings = tokenizer.convert_tokens_to_string(tokens)
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def save_resources(self, save_directory):
+        r"""
+        Save tokenizer related resources to files under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            # TODO: make the name 'ernie-tiny' as a variable
+            source_path = os.path.join(MODEL_HOME, "ernie-tiny", file_name)
+            save_path = os.path.join(save_directory, self.resource_files_names[name])
+
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
+
+    def num_special_tokens_to_add(self, pair=False):
+        r"""
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Note:
+            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this
+            inside your training loop.
+
+        Args:
+            pair (bool, optional):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        An ERNIE sequence has the following format:
+
+        - single sequence:       ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        An ERNIE offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[tuple]: List of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        r"""
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A ERNIE sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        r"""
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                List of ids of the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+            already_has_special_tokens (str, optional):
+                Whether or not the token list is already formatted with special tokens for the model.
+                Defaults to `False`.
+
+        Returns:
+            List[int]:
+                The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
diff --git a/paddlenlp/transformers/ernie_code/__init__.py b/paddlenlp/transformers/ernie_code/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/ernie_code/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_code/configuration.py b/paddlenlp/transformers/ernie_code/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e225dec5119b8cbbb6632186bd7a7720509d41f9
--- /dev/null
+++ b/paddlenlp/transformers/ernie_code/configuration.py
@@ -0,0 +1,198 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ErnieCode model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["ERNIECODE_PRETRAINED_INIT_CONFIGURATION", "ErnieCodeConfig", "ERNIECODE_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ERNIECODE_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-code-base": {
+        "d_ff": 2048,
+        "d_kv": 64,
+        "d_model": 768,
+        "decoder_start_token_id": 0,
+        "dense_act_fn": "gelu_new",
+        "dropout_rate": 0.1,
+        "enable_recompute": False,
+        "eos_token_id": 1,
+        "feed_forward_proj": "gated-gelu",
+        "initializer_factor": 1.0,
+        "is_encoder_decoder": True,
+        "is_gated_act": True,
+        "layer_norm_epsilon": 1e-06,
+        "model_type": "ErnieCode",
+        "num_decoder_layers": 12,
+        "num_heads": 12,
+        "num_layers": 12,
+        "output_past": True,
+        "pad_token_id": 0,
+        "relative_attention_max_distance": 128,
+        "relative_attention_num_buckets": 32,
+        "tie_word_embeddings": False,
+        "tokenizer_class": "ErnieCodeTokenizer",
+        "transformers_version": "4.20.1",
+        "use_cache": True,
+        "vocab_size": 250105,
+    },
+    "ernie-code-base-L512": {
+        "d_ff": 2048,
+        "d_kv": 64,
+        "d_model": 768,
+        "decoder_start_token_id": 0,
+        "dense_act_fn": "gelu_new",
+        "dropout_rate": 0.1,
+        "enable_recompute": False,
+        "eos_token_id": 1,
+        "feed_forward_proj": "gated-gelu",
+        "initializer_factor": 1.0,
+        "is_encoder_decoder": True,
+        "is_gated_act": True,
+        "layer_norm_epsilon": 1e-06,
+        "model_type": "ErnieCode",
+        "num_decoder_layers": 12,
+        "num_heads": 12,
+        "num_layers": 12,
+        "output_past": True,
+        "pad_token_id": 0,
+        "relative_attention_max_distance": 128,
+        "relative_attention_num_buckets": 32,
+        "tie_word_embeddings": False,
+        "tokenizer_class": "ErnieCodeTokenizer",
+        "transformers_version": "4.20.1",
+        "use_cache": True,
+        "vocab_size": 250105,
+    },
+}
+
+ERNIECODE_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-code-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-code/ernie-code-base/model_state.pdparams",
+        "ernie-code-base-L512": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-code/ernie-code-base-L512/model_state.pdparams",
+    }
+}
+
+
+class ErnieCodeConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieCodeModel`]. It is used to
+    instantiate a bert model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 250112):
+            Vocabulary size of the ErnieCode model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ErnieCodeModel`].
+        d_model (`int`, *optional*, defaults to 512):
+            Size of the encoder layers and the pooler layer.
+        d_kv (`int`, *optional*, defaults to 64):
+            Size of the key, query, value projections per attention head. `d_kv` has to be equal to `d_model //
+            num_heads`.
+        d_ff (`int`, *optional*, defaults to 1024):
+            Size of the intermediate feed forward layer in each `ErnieCodeBlock`.
+        num_layers (`int`, *optional*, defaults to 8):
+            Number of hidden layers in the Transformer encoder.
+        num_decoder_layers (`int`, *optional*):
+            Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
+        num_heads (`int`, *optional*, defaults to 6):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+            The number of buckets to use for each attention layer.
+        relative_attention_max_distance (`int`, *optional*, defaults to 128):
+            The maximum distance of the longer sequences for the bucket separation.
+        dropout_rate (`float`, *optional*, defaults to 0.1):
+            The ratio for all dropout layers.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        feed_forward_proj (`string`, *optional*, defaults to `"gated-gelu"`):
+            he non-linear activation function (function or string) in the feed forward layer in the residual attention block.
+            If string, `"relu"`, `"gated-gelu"` are supported. Defaults to `"gated-gelu"`.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        pad_token_id (int, optional):
+            The id of the `padding` token. Defaults to `0`.
+        bos_token_id (int, optional):
+            The id of the `bos` token. Defaults to `0`.
+        eos_token_id (int, optional):
+            The id of the `eos` token. Defaults to `1`.
+        enable_recompute (bool, optional):
+            Whether to recompute cache.
+
+    """
+    model_type = "ErnieCode"
+    attribute_map: Dict[str, str] = {
+        "hidden_size": "d_model",
+        "num_attention_heads": "num_heads",
+        "num_hidden_layers": "num_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = ERNIECODE_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 250112,
+        d_model: int = 512,
+        d_kv: int = 64,
+        d_ff: int = 1024,
+        num_layers: int = 8,
+        num_decoder_layers: int = None,
+        num_heads: int = 6,
+        relative_attention_num_buckets: int = 32,
+        relative_attention_max_distance: int = 128,
+        dropout_rate: float = 0.1,
+        layer_norm_epsilon: float = 1e-6,
+        initializer_factor: float = 1.0,
+        feed_forward_proj: str = "gated-gelu",
+        is_encoder_decoder: bool = True,
+        use_cache: bool = True,
+        bos_token_id: int = 0,
+        pad_token_id: int = 0,
+        eos_token_id: int = 1,
+        enable_recompute: bool = False,
+        **kwargs
+    ):
+
+        super().__init__(
+            bos_token_id=bos_token_id,
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            **kwargs,
+        )
+        self.enable_recompute = enable_recompute
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.d_kv = d_kv
+        self.d_ff = d_ff
+        self.num_layers = num_layers
+        self.num_decoder_layers = (
+            num_decoder_layers if num_decoder_layers is not None else self.num_layers
+        )  # default = symmetry
+        self.num_heads = num_heads
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.relative_attention_max_distance = relative_attention_max_distance
+        self.dropout_rate = dropout_rate
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_factor = initializer_factor
+        self.feed_forward_proj = feed_forward_proj
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/ernie_code/modeling.py b/paddlenlp/transformers/ernie_code/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0649966c64f33a7dbe1a02fa42cc016fc1e469a7
--- /dev/null
+++ b/paddlenlp/transformers/ernie_code/modeling.py
@@ -0,0 +1,1751 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 Baidu ErnieCode Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import math
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.log import logger
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    convert_encoder_output,
+)
+from ..model_utils import PretrainedModel, register_base_model
+from .configuration import (
+    ERNIECODE_PRETRAINED_INIT_CONFIGURATION,
+    ERNIECODE_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieCodeConfig,
+)
+
+__all__ = [
+    "ErnieCodeModel",
+    "ErnieCodePretrainedModel",
+    "ErnieCodeForConditionalGeneration",
+    "ErnieCodeEncoderModel",
+    "ERNIECODE_PRETRAINED_MODEL_ARCHIVE_LIST",
+]
+
+ERNIECODE_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "ernie-code-base",
+    "ernie-code-base-L512",
+]
+
+DATA_TYPE_MAP = {
+    paddle.int64: "int64",
+    paddle.int32: "int32",
+    paddle.float32: "float32",
+    paddle.float64: "float64",
+    paddle.float16: "float16",
+}
+
+
+def data_type_converter(tensor):
+    return DATA_TYPE_MAP[tensor.dtype]
+
+
+def finfo(dtype):
+    if dtype == paddle.float32:
+        return np.finfo(np.float32)
+    if dtype == paddle.float16:
+        return np.finfo(np.float16)
+    if dtype == paddle.float64:
+        return np.finfo(np.float64)
+
+
+class ErnieCodeLayerNorm(nn.Layer):
+    """
+    Construct a layernorm module in the ErnieCode style No bias and no subtraction of mean.
+    """
+
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = self.create_parameter(shape=[hidden_size], default_initializer=nn.initializer.Constant(1.0))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        # layer norm should always be calculated in float32
+        variance = paddle.pow(hidden_states.astype(paddle.float32), 2).mean(axis=-1, keepdim=True)
+        hidden_states = hidden_states * paddle.rsqrt(variance + self.variance_epsilon)
+
+        # convert into float16 if necessary
+        if self.weight.dtype == paddle.float16:
+            hidden_states = hidden_states.astype(paddle.float16)
+        return self.weight * hidden_states
+
+
+class ErnieCodeDenseReluDense(nn.Layer):
+    """
+    Construct a dense-relu-dense module.
+    """
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__()
+        self.wi = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = F.relu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class ErnieCodeDenseGatedGeluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class ErnieCodeDenseGatedSiluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_silu = F.silu(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_silu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class ErnieCodeLayerFF(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__()
+        if config.feed_forward_proj == "relu":
+            self.DenseReluDense = ErnieCodeDenseReluDense(config)
+        elif config.feed_forward_proj == "gated-gelu":
+            self.DenseReluDense = ErnieCodeDenseGatedGeluDense(config)
+        elif config.feed_forward_proj == "gated-silu":
+            self.DenseReluDense = ErnieCodeDenseGatedSiluDense(config)
+        else:
+            raise ValueError(f"{config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`")
+
+        self.layer_norm = ErnieCodeLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        hidden_states = hidden_states + self.dropout(forwarded_states)
+        return hidden_states
+
+
+class ErnieCodeAttention(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.has_relative_attention_bias = has_relative_attention_bias
+
+        self.relative_attention_num_buckets = config.relative_attention_num_buckets
+        self.d_model = config.d_model
+        self.key_value_proj_dim = config.d_kv
+        self.n_heads = config.num_heads
+        self.dropout = config.dropout_rate
+        self.inner_dim = self.n_heads * self.key_value_proj_dim
+        self.enable_recompute = False
+
+        # Mesh TensorFlow initialization to avoid scaling before softmax
+        self.q = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.k = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.v = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.o = nn.Linear(self.inner_dim, self.d_model, bias_attr=False)
+
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+
+        Args:
+            relative_position: an int64 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+
+        Returns:
+            a Tensor with the same shape as relative_position, containing int64 values in the range [0, num_buckets)
+
+        """
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets //= 2
+            relative_buckets += (relative_position > 0).astype(paddle.int64) * num_buckets
+            relative_position = paddle.abs(relative_position)
+        else:
+            relative_position = -paddle.minimum(relative_position, paddle.zeros_like(relative_position))
+        # now relative_position is in the range [0, inf)
+
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_postion_if_large = max_exact + (
+            paddle.log(relative_position.astype(paddle.get_default_dtype()) / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).astype(paddle.int64)
+        relative_postion_if_large = paddle.minimum(
+            relative_postion_if_large,
+            paddle.full_like(relative_postion_if_large, num_buckets - 1),
+        )
+
+        relative_buckets += paddle.where(is_small, relative_position, relative_postion_if_large)
+        return relative_buckets
+
+    def compute_bias(self, query_length, key_length):
+        """Compute binned relative position bias"""
+        context_position = paddle.arange(query_length).unsqueeze(-1)
+        memory_position = paddle.arange(key_length).unsqueeze(0)
+        relative_position = memory_position - context_position  # shape (query_length, key_length)
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=(not self.is_decoder),
+            num_buckets=self.relative_attention_num_buckets,
+        )
+        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
+        values = values.transpose(perm=[2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
+        return values
+
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+        cache=None,
+        query_length=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+        # cache[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
+        batch_size, seq_length = paddle.shape(hidden_states)[:2]
+
+        real_seq_length = seq_length
+
+        if cache is not None:
+            assert len(cache) == 2, f"cache should have 2 past states: keys and values. Got { len(cache)} past states"
+            real_seq_length += paddle.shape(cache[0])[2] if query_length is None else query_length
+
+        key_length = real_seq_length if key_value_states is None else paddle.shape(key_value_states)[1]
+
+        def shape(states):
+            """projection"""
+            return states.reshape(shape=[batch_size, -1, self.n_heads, self.key_value_proj_dim]).transpose(
+                perm=[0, 2, 1, 3]
+            )
+
+        def unshape(states):
+            """reshape"""
+            return states.transpose(perm=[0, 2, 1, 3]).reshape(shape=[batch_size, -1, self.inner_dim])
+
+        def project(hidden_states, proj_layer, key_value_states, cache):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif cache is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if cache is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = paddle.concat([cache, hidden_states], axis=2)
+                else:
+                    # cross-attn
+                    hidden_states = cache
+            return hidden_states
+
+        # get query states
+        query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
+
+        # get key/value states
+        key_states = project(
+            hidden_states,
+            self.k,
+            key_value_states,
+            cache[0] if cache is not None else None,
+        )
+        value_states = project(
+            hidden_states,
+            self.v,
+            key_value_states,
+            cache[1] if cache is not None else None,
+        )
+
+        # compute scores
+        scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        if position_bias is None:
+            if not self.has_relative_attention_bias:
+                position_bias = paddle.zeros(
+                    shape=(1, self.n_heads, real_seq_length, key_length),
+                    dtype=scores.dtype,
+                )
+                if self.training and self.enable_recompute:
+                    position_bias.stop_gradient = False
+            else:
+                position_bias = self.compute_bias(real_seq_length, key_length)
+
+            # if key and values are already calculated
+            # we want only the last query position bias
+            if cache is not None:
+                position_bias = position_bias[:, :, -paddle.shape(hidden_states)[1] :, :]
+
+            if mask is not None:
+                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)
+
+        scores += position_bias
+        attn_weights = F.softmax(scores.astype(paddle.float32), axis=-1).astype(
+            scores.dtype
+        )  # (batch_size, n_heads, seq_length, key_length)
+        attn_weights = F.dropout(
+            attn_weights, p=self.dropout, training=self.training
+        )  # (batch_size, n_heads, seq_length, key_length)
+
+        attn_output = unshape(paddle.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
+
+        attn_output = self.o(attn_output)
+
+        present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
+        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
+
+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        return outputs
+
+
+class ErnieCodeLayerSelfAttention(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.SelfAttention = ErnieCodeAttention(config, has_relative_attention_bias=has_relative_attention_bias)
+        self.layer_norm = ErnieCodeLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.SelfAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class ErnieCodeLayerCrossAttention(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__()
+        self.EncDecAttention = ErnieCodeAttention(config, has_relative_attention_bias=False)
+        self.layer_norm = ErnieCodeLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        query_length=None,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+
+        attention_output = self.EncDecAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            key_value_states=key_value_states,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            query_length=query_length,
+            output_attentions=output_attentions,
+        )
+        layer_output = hidden_states + self.dropout(attention_output[0])
+        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class ErnieCodeBlock(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.layer = nn.LayerList()
+        self.layer.append(ErnieCodeLayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
+        if self.is_decoder:
+            self.layer.append(ErnieCodeLayerCrossAttention(config))
+
+        self.layer.append(ErnieCodeLayerFF(config))
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+
+        if cache is not None:
+            assert self.is_decoder, "Only decoder can use `caches`"
+            expected_num_caches = 2 if encoder_hidden_states is None else 4
+
+            if len(cache) != expected_num_caches:
+                raise ValueError(
+                    f"There should be {expected_num_caches} past states. "
+                    f"{'2 (past / key) for cross attention. ' if expected_num_caches == 4 else ''}"
+                    f"Got {len(cache)} past key / value states"
+                )
+
+            self_attn_cache = cache[:2]
+            cross_attn_cache = cache[2:]
+        else:
+            self_attn_cache, cross_attn_cache = None, None
+
+        self_attention_outputs = self.layer[0](
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            cache=self_attn_cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states, present_key_value_state = self_attention_outputs[:2]
+
+        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+            # TODO finfo
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+        if do_cross_attention:
+            # the actual query length is unknown for cross attention
+            # if using past key value states. Need to inject it here
+            if present_key_value_state is not None:
+                query_length = paddle.shape(present_key_value_state[0])[2]
+            else:
+                query_length = None
+
+            cross_attention_outputs = self.layer[1](
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+                cache=cross_attn_cache,
+                query_length=query_length,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states = cross_attention_outputs[0]
+
+            # clamp inf values to enable fp16 training
+            if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+                clamp_value = finfo(hidden_states.dtype).max - 1000
+                hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+            # Combine self attn and cross attn key value states
+            if present_key_value_state is not None:
+                present_key_value_state = present_key_value_state + cross_attention_outputs[1]
+
+            # Keep cross-attention outputs and relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[2:]
+
+        # Apply Feed Forward layer
+        hidden_states = self.layer[-1](hidden_states)
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        outputs = (hidden_states,)
+
+        if use_cache:
+            outputs = outputs + (present_key_value_state,) + attention_outputs
+        else:
+            outputs = outputs + attention_outputs
+
+        return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+
+
+class ErnieCodePretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ErnieCode models. It provides ErnieCode related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "ErnieCode"
+    config_class = ErnieCodeConfig
+
+    pretrained_init_configuration = ERNIECODE_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIECODE_PRETRAINED_RESOURCE_FILES_MAP
+
+    # support AutoConverter after fix load_torch function
+    @classmethod
+    def _get_name_mappings(cls, config: ErnieCodeConfig) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "shared.weight",
+            "encoder.embed_tokens.weight",
+            "encoder.final_layer_norm.weight",
+            "decoder.embed_tokens.weight",
+            "decoder.final_layer_norm.weight",
+            "encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+            "decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            for att_head in ["q", "k", "v", "o"]:
+                model_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.1.EncDecAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+
+            layer_mappings = [
+                [
+                    f"encoder.block.{layer_index}.layer.1.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                [
+                    f"decoder.block.{layer_index}.layer.2.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"encoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.2.layer_norm.weight",
+            ]
+
+            if config.feed_forward_proj == "relu":
+                layer_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+            elif config.feed_forward_proj == "gated-gelu":
+                for i in range(2):
+                    layer_mappings.extend(
+                        [
+                            [
+                                f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                            [
+                                f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                        ]
+                    )
+
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+
+        if cls.__name__ != "ErnieCodeModel":
+            for mapping in model_mappings:
+                mapping[1] = "ErnieCode." + mapping[1]
+
+        if config.architectures is not None and "ErnieCodeForConditionalGeneration" in config.architectures:
+            model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
+
+        mappings = [StateDictNameMapping(*mapping) for mapping in model_mappings]
+        return mappings
+
+    @property
+    def dummy_inputs(self):
+        DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
+        DUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]
+        input_ids = paddle.assign(np.asarray(DUMMY_INPUTS, dtype="int64"))
+        input_mask = paddle.assign(np.asarray(DUMMY_MASK, dtype="int64"))
+        dummy_inputs = {
+            "decoder_input_ids": input_ids,
+            "input_ids": input_ids,
+            "decoder_attention_mask": input_mask,
+        }
+        return dummy_inputs
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        # Used for testing weights initialization
+        factor = self.config.initializer_factor
+        d_model = self.config.d_model
+        d_ff = self.config.d_ff
+        n_heads = self.config.num_heads
+        key_value_proj_dim = self.config.d_kv
+
+        if isinstance(layer, ErnieCodeLayerNorm):
+            layer.weight.set_value(paddle.ones_like(layer.weight) * factor)
+        elif isinstance(layer, ErnieCodeModel):
+            # Mesh TensorFlow embeddings initialization
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
+            layer.shared.weight.set_value(paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.shared.weight.shape))
+        elif isinstance(layer, (ErnieCodeForConditionalGeneration,)):
+            layer.ErnieCode.shared.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.ErnieCode.shared.weight.shape)
+            )
+
+        elif isinstance(layer, ErnieCodeDenseReluDense):
+            # Mesh TensorFlow FF initialization
+            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
+            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
+            layer.wi.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi.weight.shape)
+            )
+
+            if hasattr(layer.wi, "bias") and layer.wi.bias is not None:
+                layer.wi.bias.set_value(paddle.zeros_like(layer.wi.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+
+        elif isinstance(layer, ErnieCodeDenseGatedGeluDense):
+            layer.wi_0.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_0.weight.shape)
+            )
+            if hasattr(layer.wi_0, "bias") and layer.wi_0.bias is not None:
+                layer.wi_0.bias.set_value(paddle.zeros_like(layer.wi_0.bias))
+
+            layer.wi_1.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_1.weight.shape)
+            )
+            if hasattr(layer.wi_1, "bias") and layer.wi_1.bias is not None:
+                layer.wi_1.bias.set_value(paddle.zeros_like(layer.wi_1.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+        elif isinstance(layer, ErnieCodeAttention):
+            # Mesh TensorFlow attention initialization to avoid scaling before softmax
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
+
+            layer.q.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5), shape=layer.q.weight.shape
+                )
+            )
+
+            layer.k.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.k.weight.shape)
+            )
+
+            layer.v.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.v.weight.shape)
+            )
+
+            layer.o.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5), shape=layer.o.weight.shape
+                )
+            )
+
+            if layer.has_relative_attention_bias:
+                layer.relative_attention_bias.weight.set_value(
+                    paddle.normal(
+                        mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.relative_attention_bias.weight.shape
+                    )
+                )
+
+    def _shift_right(self, input_ids):
+        bos_token_id = self.config.bos_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert (
+            bos_token_id is not None
+        ), "bos_token_id has to be defined. In ErnieCode it is usually set to the pad_token_id. See ErnieCode docs for more information"
+
+        # shift inputs to the right
+        shifted_input_ids = paddle.zeros_like(input_ids)
+        shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+        shifted_input_ids[:, 0] = bos_token_id
+
+        assert pad_token_id is not None, "pad_token_id has to be defined."
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids = paddle.where(
+            shifted_input_ids == -100,
+            paddle.assign(np.asarray(pad_token_id, dtype=data_type_converter(shifted_input_ids)).reshape([1])),
+            shifted_input_ids,
+        )
+
+        assert paddle.all(shifted_input_ids >= 0), "Verify that `shifted_input_ids` has only positive values"
+
+        return shifted_input_ids
+
+
+class ErnieCodeStack(nn.Layer):
+    def __init__(self, config: ErnieCodeConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.embed_tokens = embed_tokens
+        self.block = nn.LayerList(
+            [ErnieCodeBlock(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
+        )
+        self.final_layer_norm = ErnieCodeLayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.enable_recompute = config.enable_recompute
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embed_tokens = new_embeddings
+
+    @property
+    def dtype(self):
+        return self.embed_tokens.weight.dtype
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module,
+        hidden_states,
+        extended_attention_mask,
+        position_bias,
+        encoder_hidden_states,
+        encoder_extended_attention_mask,
+        encoder_decoder_position_bias,
+        use_cache,
+        output_attentions,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return tuple(module(*inputs, use_cache, output_attentions))
+
+            return custom_forward
+
+        layer_outputs = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            extended_attention_mask,
+            position_bias,
+            encoder_hidden_states,
+            encoder_extended_attention_mask,
+            encoder_decoder_position_bias,
+            None,
+        )
+
+        return layer_outputs
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+        **model_kwargs
+    ):
+
+        if input_ids is not None and inputs_embeds is not None:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You cannot specify both {err_msg_prefix}input_ids and {err_msg_prefix}inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            # input_ids = input_ids.reshape(shape=[-1, input_shape[-1]])
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(f"You have to specify either {err_msg_prefix}input_ids or {err_msg_prefix}inputs_embeds")
+
+        if inputs_embeds is None:
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        batch_size, seq_length = input_shape
+
+        # required mask seq length can be calculated via length of past
+        mask_seq_length = paddle.shape(cache[0][0])[2] + seq_length if cache is not None else seq_length
+
+        if use_cache is True:
+            assert self.is_decoder, f"`use_cache` can only be set to `True` if {self.__class__} is used as a decoder"
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(shape=[batch_size, mask_seq_length])
+        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+            encoder_seq_length = paddle.shape(encoder_hidden_states)[1]
+            encoder_attention_mask = paddle.ones([batch_size, encoder_seq_length], dtype=paddle.int64)
+
+        # initialize caches with `None` if past does not exist
+        if cache is None:
+            cache = [None] * len(self.block)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = paddle.shape(encoder_hidden_states)
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(shape=encoder_hidden_shape)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        present_key_value_states = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+        position_bias = None
+        encoder_decoder_position_bias = None
+
+        hidden_states = self.dropout(inputs_embeds)
+
+        for i, (layer_module, past_key_value) in enumerate(zip(self.block, cache)):
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.enable_recompute and self.training:
+                if use_cache:
+                    logger.warning(
+                        "`use_cache=True` is incompatible with `config.enable_recompute=True`. Setting "
+                        "`use_cache=False`..."
+                    )
+                    use_cache = False
+
+                layer_outputs = self.recompute_training(
+                    layer_module,
+                    hidden_states,
+                    extended_attention_mask,
+                    position_bias,
+                    encoder_hidden_states,
+                    encoder_extended_attention_mask,
+                    encoder_decoder_position_bias,
+                    use_cache,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask=extended_attention_mask,
+                    position_bias=position_bias,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_extended_attention_mask,
+                    encoder_decoder_position_bias=encoder_decoder_position_bias,
+                    cache=past_key_value,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+
+            # layer_outputs is a tuple with:
+            # hidden-states, key-value-states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+            if not use_cache:
+                layer_outputs = layer_outputs[:1] + (None,) + layer_outputs[1:]
+
+            hidden_states, present_key_value_state = layer_outputs[:2]
+
+            # We share the position biases between the layers - the first layer store them
+            # layer_outputs = hidden-states, key-value-states (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            position_bias = layer_outputs[2]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[4 if output_attentions else 3]
+            # append next layer key value states
+            if use_cache:
+                present_key_value_states = present_key_value_states + (present_key_value_state,)
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[3],)
+                if self.is_decoder:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[5],)
+
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    present_key_value_states,
+                    all_hidden_states,
+                    all_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_value_states,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+    def get_extended_attention_mask(self, attention_mask, input_shape):
+        if attention_mask.ndim == 3:
+            extended_attention_mask = attention_mask.unsqueeze(1)
+        elif attention_mask.ndim == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length)
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask.unsqueeze([1, 2])
+            else:
+                extended_attention_mask = attention_mask.unsqueeze([1, 2])
+        elif attention_mask.ndim == 4:
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length)
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                # in case cache are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[-1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask
+            else:
+                extended_attention_mask = attention_mask
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        extended_attention_mask = extended_attention_mask.astype(self.dtype)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask):
+        if encoder_attention_mask.ndim == 4:
+            encoder_extended_attention_mask = encoder_attention_mask
+        elif encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze(1)
+        elif encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze([1, 2])
+        encoder_extended_attention_mask = encoder_extended_attention_mask.astype(self.dtype)  # fp16 compatibility
+
+        if self.dtype == paddle.float16:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        elif self.dtype == paddle.float32:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        else:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+            # raise ValueError(
+            #     f"{self.dtype} not recognized. `dtype` should be set to either `paddle.float32` or `paddle.float16`"
+            # )
+
+        return encoder_extended_attention_mask
+
+
+@register_base_model
+class ErnieCodeModel(ErnieCodePretrainedModel):
+    """
+    The bare ErnieCode Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (class:`ErnieCodeConfig`):
+            Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+    """
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__(config)
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.initializer_factor = config.initializer_factor
+        self.d_model = config.d_model
+        self.num_heads = config.num_heads
+        self.d_kv = config.d_kv
+        self.d_ff = config.d_ff
+        self.tie_word_embeddings = config.tie_word_embeddings
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = ErnieCodeStack(encoder_config, self.shared)
+
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = ErnieCodeStack(decoder_config, self.shared)
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The ErnieCodeModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on
+                to some unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float.
+                When the data type is int, the `masked` tokens have `0` values and the others
+                have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the
+                others have `1.0` values.
+                It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, num_attention_heads, sequence_length, sequence_length].
+            cache (Tuple[Tuple[Tensor]], optional):
+                Contains pre-computed hidden-states (key and values in the attention blocks)
+                as computed by the model. Can be used to speed up sequential decoding.
+                The `input_ids` which have their past given to this model should not be
+                passed as input ids as they have already been computed.
+                Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation  of shape `(batch_size, target_sequence_length, hidden_size)`. If `cache` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
+                This is useful if you want more control over how to convert `decoder_input_ids` indices
+                into associated vectors than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                Whether or not to use cache. If set to `True`, `past_buckets_states` states are returned
+                and can be used to speed up decoding.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`.
+
+            tuple: Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the decoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                returned when `use_cache=True` is passed.
+                List of `tuple(Tensor, Tensor)` of length `config["num_layers"]`,
+                with the first element being the previous `buckets` of shape
+                `[batch_size, num_heads, num_hashes, sequence_length]` and the second
+                being the previous `hidden_states` of shape `[batch_size, sequence_length, hidden_size]`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                returned when ``output_hidden_states=True`` is passed.
+                Tuple of `Tensor` (one for the output of the embeddings + one for the output of decoder each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `encoder_last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the encoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                returned when `output_hidden_states=True` is passed.
+                tuple of `Tensor` (one for the output of the embeddings + one for the
+                output of encoder each layer). Each Tensor has a data type of float32
+                and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieCodeModel, AutoTokenizer
+
+                tokenizer = AutoTokenizer.from_pretrained('ErnieCode-base')
+                model = ErnieCodeModel.from_pretrained('ErnieCode-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                input_ids = paddle.to_tensor([inputs["input_ids"]], dtype="int64")
+                decoder_inputs = tokenizer("It means you can")
+                decoder_input_ids = paddle.to_tensor([decoder_inputs["input_ids"]], dtype="int64")
+
+                outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
+                last_hidden_state = outputs[0]
+                print(last_hidden_state.shape)
+                # [1, 5, 768]
+
+        """
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            encoder_output = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        elif return_dict and not isinstance(encoder_output, BaseModelOutput):
+            encoder_output = convert_encoder_output(encoder_output)
+        hidden_states = encoder_output[0]
+
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return decoder_outputs + encoder_output
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+
+class ErnieCodeForConditionalGeneration(ErnieCodePretrainedModel):
+    """
+    The ErnieCode Model transformer with a language modeling head on top.
+
+    Args:
+        config (:class:`ErnieCodeConfig`):
+            An instance of ErnieCodeConfig used to construct ErnieCodeForConditionalGeneration.
+
+    """
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__(config)
+        self.ErnieCode = ErnieCodeModel(config)
+        if not self.ErnieCode.config["tie_word_embeddings"]:
+            self.lm_head = nn.Linear(
+                self.ErnieCode.config["d_model"], self.ErnieCode.config["vocab_size"], bias_attr=False
+            )
+
+    def get_input_embeddings(self):
+        return self.ErnieCode.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.ErnieCode.shared = new_embeddings
+        self.ErnieCode.encoder.set_input_embeddings(new_embeddings)
+        self.ErnieCode.decoder.set_input_embeddings(new_embeddings)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def get_output_embeddings(self):
+        if self.ErnieCode.config["tie_word_embeddings"]:
+            return self.ErnieCode.shared
+        else:
+            return self.lm_head
+
+    def get_encoder(self):
+        return self.ErnieCode.encoder
+
+    def get_decoder(self):
+        return self.ErnieCode.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        labels=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`ErnieCodeModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieCodeModel`.
+            decoder_input_ids (Tensor, optional):
+                See :class:`ErnieCodeModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`ErnieCodeModel`.
+            encoder_output (tuple(Tensor), optional):
+                See :class:`ErnieCodeModel`.
+            cache (List[tuple(Tensor, Tensor)], optional):
+                See :class:`ErnieCodeModel`.
+            labels (Tensor, optional):
+                Labels for language modeling. Note that the labels **are shifted**
+                inside the model, i.e. you can set `labels = input_ids` Indices are
+                selected in `[-100, 0, ..., vocab_size]` All labels set to `-100` are
+                ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor , optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation of shape `(batch_size, target_sequence_length, hidden_size)`. If `past_key_values` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is useful
+                if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                See :class:`ErnieCodeModel`.
+            output_attentions (bool, optional):
+                See :class:`ErnieCodeModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ErnieCodeModel`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`.
+
+            tuple: Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss. It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head
+                (scores for each vocabulary token before SoftMax).
+                It's data type should be float32 and its shape is
+                [batch_size, sequence_length, vocab_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                See :class:`ErnieCodeModel`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                See :class:`ErnieCodeModel`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                See :class:`ErnieCodeModel`.
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                See :class:`ErnieCodeModel`.
+
+            - `encoder_last_hidden_state` (Tensor):
+                See :class:`ErnieCodeModel`.
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                See :class:`ErnieCodeModel`.
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                See :class:`ErnieCodeModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieCodeForConditionalGeneration, AutoTokenizer
+
+                tokenizer = AutoTokenizer.from_pretrained('ErnieCode-base')
+                model = ErnieCodeForConditionalGeneration.from_pretrained('ErnieCode-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+
+        input_type = type(decoder_input_ids) if decoder_input_ids is not None else type(decoder_inputs_embeds)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            # Convert encoder inputs in embeddings if needed
+            encoder_output = self.ErnieCode.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            if return_dict and not isinstance(encoder_output, BaseModelOutput):
+                encoder_output = convert_encoder_output(encoder_output)
+
+        hidden_states = encoder_output[0]
+
+        if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(labels)
+
+        # If decoding with past key value states, only the last tokens
+        # should be given as an input
+        if cache is not None:
+            assert labels is None, "Decoder should not use cached key value states when training."
+            if decoder_input_ids is not None:
+                decoder_input_ids = decoder_input_ids[:, -1:]
+
+        encoder_attention_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                encoder_attention_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                encoder_attention_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                encoder_attention_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        # Decode
+        decoder_outputs = self.ErnieCode.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = decoder_outputs[0]
+
+        if self.ErnieCode.config["tie_word_embeddings"]:
+            # Rescale output before projecting on vocab
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
+            sequence_output = sequence_output * (self.ErnieCode.config["d_model"] ** -0.5)
+            lm_logits = paddle.matmul(sequence_output, self.ErnieCode.shared.weight, transpose_y=True)
+        else:
+            lm_logits = self.lm_head(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(lm_logits.reshape(shape=[-1, lm_logits.shape[-1]]).astype("float32"), labels.flatten())
+
+        if not return_dict:
+            output = (lm_logits,) + decoder_outputs[1:] + encoder_output
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            if isinstance(encoder_output, tuple):
+                encoder_output = encoder_output[0]
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_inputs_for_generation(
+        self, input_ids, cache=None, attention_mask=None, use_cache=None, encoder_output=None, **kwargs
+    ):
+
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            "decoder_input_ids": input_ids,
+            "cache": cache,
+            "encoder_output": encoder_output,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+        }
+
+    def prepare_decoder_input_ids_from_labels(self, labels: paddle.Tensor):
+        return self._shift_right(labels)
+
+    @staticmethod
+    def expand_inputs_for_generation(input_ids, expand_size, attention_mask=None, **model_kwargs):
+        index = paddle.tile(paddle.arange(input_ids.shape[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.index_select(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.index_select(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.index_select(token_type_ids, index)
+
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.index_select(position_ids, index)
+
+        if "seq_len" in model_kwargs:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.index_select(seq_len, index)
+
+        if "encoder_output" in model_kwargs:
+            encoder_output = model_kwargs["encoder_output"]
+            if isinstance(encoder_output, tuple):
+                model_kwargs["encoder_output"] = (paddle.index_select(encoder_output[0], index),) + encoder_output[1:]
+            else:
+                model_kwargs["encoder_output"] = paddle.index_select(encoder_output, index)
+        return input_ids, model_kwargs
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+            return attention_mask
+        else:
+            attention_mask = paddle.ones_like(input_ids)
+        return attention_mask
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class ErnieCodeEncoderModel(ErnieCodePretrainedModel):
+    base_model_class = None
+
+    def __init__(self, config: ErnieCodeConfig):
+        super().__init__(config)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.shared = nn.Embedding(encoder_config.vocab_size, encoder_config.d_model)
+        self.encoder = ErnieCodeStack(encoder_config, embed_tokens=self.shared)
+
+    @property
+    def ErnieCode(self):
+        return self
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self) -> ErnieCodeStack:
+        return self.encoder
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        attention_mask: Optional[Tensor] = None,
+        encoder_hidden_states: Optional[Tuple[Tensor]] = None,
+        encoder_attention_mask: Optional[Tensor] = None,
+        cache=None,
+        inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return encoder_outputs
+
+
+ErnieCodeEncoderModel.base_model_class = ErnieCodeEncoderModel
diff --git a/paddlenlp/transformers/ernie_code/tokenizer.py b/paddlenlp/transformers/ernie_code/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..69793513fc85a565b3b592f6cb2936565c5818f9
--- /dev/null
+++ b/paddlenlp/transformers/ernie_code/tokenizer.py
@@ -0,0 +1,200 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 Baidu ErnieCode Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unicodedata
+from collections import UserDict
+from typing import List, Union
+
+import numpy as np
+import paddle
+
+from ..t5.tokenizer import T5Tokenizer
+
+__all__ = [
+    "ErnieCodeTokenizer",
+]
+
+formate_dict = {" ": "<|space|>"}
+
+
+def to_py_obj(obj):
+    """
+    Convert a Paddle tensor, Numpy array or python list to a python list.
+    """
+    if isinstance(obj, (dict, UserDict)):
+        return {k: to_py_obj(v) for k, v in obj.items()}
+    elif isinstance(obj, (list, tuple)):
+        return [to_py_obj(o) for o in obj]
+    elif isinstance(obj, paddle.Tensor):
+        return obj.numpy().tolist()
+    elif isinstance(obj, (np.ndarray, np.number)):  # tolist also works on 0d np arrays
+        return obj.tolist()
+    else:
+        return obj
+
+
+def clean_up_codem_spaces(s: str):
+    # post process
+    # ===========================
+    new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>" * 4, "<|space|>" * 2, "<|space|>"]
+    for tok in new_tokens:
+        s = s.replace(f"{tok} ", tok)
+
+    cleaned_tokens = ["<pad>", "</s>", "<unk>"]
+    for tok in cleaned_tokens:
+        s = s.replace(tok, "")
+    s = s.replace("<|space|>", " ")
+    # ===========================
+    return s
+
+
+class ErnieCodeTokenizer(T5Tokenizer):
+    """
+    Constructs a ErnieCode tokenizer based on SentencePiece .
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        sentencepiece_model_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing. Defaults to `False`.
+        remove_space (bool):
+            Whether or note to remove space when tokenizing. Defaults to `True`.
+        keep_accents (bool):
+            Whether or note to keep accents when tokenizing. Defaults to `False`.
+        eos_token (str):
+            A special token representing the *eos (end-of-sentence)* token.
+            Defaults to "</s>".
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "<unk>".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "<pad>".
+
+    """
+
+    resource_files_names = {"sentencepiece_model_file": "spiece.model"}
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "ernie-code-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-code/ernie-code-base/spiece.model",
+            "ernie-code-base-L512": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-code/ernie-code-base-L512/spiece.model",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "ernie-code-base": {"do_lower_case": False},
+        "ernie-code-base-L512": {"do_lower_case": False},
+    }
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        eos_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        extra_ids=0,
+        additional_special_tokens=[],
+        sp_model_kwargs=None,
+        **kwargs
+    ):
+        if additional_special_tokens is None or 0 == len(additional_special_tokens):
+            additional_special_tokens = [
+                "\n",
+                "\t",
+                "<|space|><|space|><|space|><|space|>",
+                "<|space|><|space|>",
+                "<|space|>",
+            ]
+
+        super(ErnieCodeTokenizer, self).__init__(
+            sentencepiece_model_file,
+            do_lower_case,
+            remove_space,
+            keep_accents,
+            eos_token,
+            unk_token,
+            pad_token,
+            extra_ids,
+            additional_special_tokens,
+            sp_model_kwargs,
+            **kwargs,
+        )
+
+    def preprocess_text(self, inputs: str):
+        if self.remove_space:
+            outputs = " ".join(inputs.strip().split())
+        else:
+            outputs = inputs
+        outputs = outputs.replace("``", '"').replace("''", '"')
+
+        if not self.keep_accents:
+            outputs = unicodedata.normalize("NFKD", outputs)
+            outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+        if self.do_lower_case:
+            outputs = outputs.lower()
+
+        tokens = list(outputs)
+        i = 0
+        while i < len(tokens):
+            if "\n" == outputs[i]:
+
+                while i + 1 < len(tokens) and " " == tokens[i + 1]:
+                    tokens[i + 1] = formate_dict.get(" ")
+                    i += 1
+            i += 1
+        formatted_line = "".join(tokens)
+        return formatted_line
+
+    def decode(
+        self,
+        token_ids: Union[int, List[int], "np.ndarray", "paddle.Tensor"],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        **kwargs
+    ) -> str:
+        """
+        Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
+        tokens and clean up tokenization spaces.
+        Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.
+        Args:
+            token_ids (`Union[int, List[int], np.ndarray, paddle.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+        Returns:
+            `str`: The decoded sentence.
+        """
+        # Convert inputs to python lists
+        token_ids = to_py_obj(token_ids)
+
+        decoded_preds = self._decode(
+            token_ids=token_ids,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+        return clean_up_codem_spaces(decoded_preds)
diff --git a/paddlenlp/transformers/ernie_ctm/__init__.py b/paddlenlp/transformers/ernie_ctm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_ctm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_ctm/configuration.py b/paddlenlp/transformers/ernie_ctm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8631c22849e40dd5070b15c6982adbc651921fb
--- /dev/null
+++ b/paddlenlp/transformers/ernie_ctm/configuration.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Ernie-CTM model configuration """
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ERNIE_CTM_CONFIG",
+    "ERNIE_CTM_PRETRAINED_INIT_CONFIGURATION",
+    "ERNIE_CTM_PRETRAINED_RESOURCE_FILES_MAP",
+    "ErnieCtmConfig",
+]
+
+
+ERNIE_CTM_CONFIG = {
+    "vocab_size": 23000,
+    "embedding_size": 128,
+    "num_hidden_layers": 12,
+    "num_attention_heads": 12,
+    "intermediate_size": 3072,
+    "hidden_dropout_prob": 0.1,
+    "layer_norm_eps": 1e-12,
+    "max_position_embeddings": 512,
+    "type_vocab_size": 2,
+    "initializer_range": 0.02,
+    "pad_token_id": 0,
+    "use_content_summary": True,
+    "content_summary_index": 1,
+    "cls_num": 2,
+    "num_prompt_placeholders": 5,
+    "prompt_vocab_ids": None,
+}
+
+
+ERNIE_CTM_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-ctm": ERNIE_CTM_CONFIG,
+    "wordtag": ERNIE_CTM_CONFIG,
+    "nptag": ERNIE_CTM_CONFIG,
+}
+
+ERNIE_CTM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-ctm": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/ernie_ctm_v3.pdparams",
+        "wordtag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/wordtag_v3.pdparams",
+        "nptag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/nptag_v3.pdparams",
+    }
+}
+
+
+class ErnieCtmConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieCtmModel`]. It is used to instantiate
+    a Ernie-CTM model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Ernie-CTM-base architecture.
+
+    Configure objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documents from [`PretrainedConfig`] for more informations.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 23000):
+            Vocabulary size of the Ernie-CTM model. Defines the number of different tokens that can be represented by
+            the `input_ids` passed when calling [`ErnieCtmModel`].
+        embedding_size (`int` *optional*, defaults to 128):
+            Dimensionality of vocabulary embeddings.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large.
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when call [`ErnieCtmModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        use_content_summary (`bool`, *optional*, defaults to True):
+            Whether to use content summary token and content representation when inputs passed into [`ErnieCtmModel`].
+        content_summary_index (`int`, *optional*, defaults to 1):
+            If `use_content_summary` is set, content summary token position is defined by this argument.
+        cls_num (`int`, *optional*, defaults to 2):
+            Number of [CLS] token in model.
+        num_prompt_placeholders (`int`, *optional*, defaults to 5):
+            Number of maximum length of prompt answer.
+        prompt_vocab_ids (`dict`, *optional*, defaults to None):
+            Prompt vocabulary of decode procedure.
+    """
+    model_type = "ernie-ctm"
+    pretrained_init_configuration = ERNIE_CTM_PRETRAINED_INIT_CONFIGURATION
+    attribute_map = {"num_tag": "num_labels", "dropout": "classifier_dropout", "num_classes": "num_labels"}
+
+    def __init__(
+        self,
+        vocab_size: int = 23000,
+        embedding_size: int = 128,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        layer_norm_eps: float = 1e-12,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        use_content_summary: bool = True,
+        content_summary_index: int = 1,
+        cls_num: int = 2,
+        pad_token_id: int = 0,
+        num_prompt_placeholders: int = 5,
+        prompt_vocab_ids: set = None,
+        **kwargs
+    ):
+        super(ErnieCtmConfig, self).__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.layer_norm_eps = layer_norm_eps
+        self.initializer_range = initializer_range
+        self.use_content_summary = use_content_summary
+        self.content_summary_index = content_summary_index
+        self.cls_num = cls_num
+        self.num_prompt_placeholders = num_prompt_placeholders
+        self.prompt_vocab_ids = prompt_vocab_ids
diff --git a/paddlenlp/transformers/ernie_ctm/modeling.py b/paddlenlp/transformers/ernie_ctm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7db01b3f662d41dda50351e338a9dd609330b0f
--- /dev/null
+++ b/paddlenlp/transformers/ernie_ctm/modeling.py
@@ -0,0 +1,830 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Layer
+
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+from paddlenlp.transformers.model_outputs import ModelOutput, TokenClassifierOutput
+from paddlenlp.utils.tools import compare_version
+
+from .configuration import (
+    ERNIE_CTM_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_CTM_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieCtmConfig,
+)
+
+if compare_version(paddle.version.full_version, "2.2.0") >= 0:
+    # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0
+    from paddle.text import ViterbiDecoder
+else:
+    from paddlenlp.layers.crf import ViterbiDecoder
+
+from .. import PretrainedModel, register_base_model
+
+__all__ = [
+    "ErnieCtmPretrainedModel",
+    "ErnieCtmModel",
+    "ErnieCtmWordtagModel",
+    "ErnieCtmNptagModel",
+    "ErnieCtmForTokenClassification",
+]
+
+
+@dataclass
+class ErnieCtmModelOutput(ModelOutput):
+    """
+    Base class for model's outputs, with potential hidden states and attentions.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`paddle.Tensor` of shape :obj:`(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) further processed by a
+            Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
+            prediction (classification) objective during pretraining.
+        content_output
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    pooler_output: paddle.Tensor = None
+    content_output: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class ErnieCtmEmbeddings(Layer):
+    """
+    Construct the embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: ErnieCtmConfig):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
+        self.layer_norm = nn.LayerNorm(config.embedding_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.cls_num = config.cls_num
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, inputs_embeds=None):
+        if position_ids is None:
+
+            content_len = paddle.shape(input_ids)[1] - self.cls_num
+            position_ids = paddle.concat(
+                [
+                    paddle.zeros(shape=[self.cls_num], dtype="int64"),
+                    paddle.linspace(1, content_len, content_len, dtype="int64"),
+                ]
+            )
+            position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        if input_ids is not None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings + position_embeddings
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class ErnieCtmPooler(Layer):
+    """ """
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErnieCtmPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ErnieCtm models. It provides ErnieCtm related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading
+     and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    config_class = ErnieCtmConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    base_model_prefix = "ernie_ctm"
+
+    pretrained_init_configuration = ERNIE_CTM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_CTM_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        # Initialize weights
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.ernie_ctm.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class ErnieCtmModel(ErnieCtmPretrainedModel):
+    """
+    The bare ErnieCtm Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `ErnieCtmModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids`
+            passed when calling `ErnieCtmModel`.
+        embedding_size (int, optional):
+            Dimensionality of the embedding layer.
+            Defaults to `128`.
+        hidden_size (int, optional):
+            Dimensionality of the encoder layers and the pooler layer.
+            Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported
+            length of an input sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`.
+            Defaults to `16`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        use_content_summary (`bool`, optional):
+            Whether or not to add content summary tokens.
+            Defaults to `True`.
+        content_summary_index (int, optional):
+            The number of the content summary tokens. Only valid when use_content_summary is True.
+            Defaults to `1`.
+        cls_num (int, optional):
+            The number of the CLS tokens. Only valid when use_content_summary is True.
+            Defaults to `2`.
+    """
+
+    def __init__(self, config: ErnieCtmConfig):
+        super(ErnieCtmModel, self).__init__(config)
+
+        self.config = config
+        self.pad_token_id = config.pad_token_id
+        self.content_summary_index = config.content_summary_index
+        self.initializer_range = config.initializer_range
+        self.embeddings = ErnieCtmEmbeddings(config)
+        self.embedding_hidden_mapping_in = nn.Linear(config.embedding_size, config.hidden_size)
+
+        def construct_encoder_layer():
+            encoder_layer = nn.TransformerEncoderLayer(
+                config.hidden_size,
+                config.num_attention_heads,
+                config.intermediate_size,
+                dropout=config.hidden_dropout_prob,
+                activation="gelu",
+                attn_dropout=config.attention_probs_dropout_prob,
+                act_dropout=0,
+            )
+            encoder_layer.activation = nn.GELU(approximate=True)
+            return encoder_layer
+
+        self.encoder = nn.TransformerEncoder(construct_encoder_layer(), config.num_hidden_layers)
+        self.pooler = ErnieCtmPooler(config.hidden_size)
+
+        self.use_content_summary = config.use_content_summary
+        self.content_summary_index = config.content_summary_index
+        if config.use_content_summary is True:
+            self.feature_fuse = nn.Linear(config.hidden_size * 2, config.intermediate_size)
+            self.feature_output = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        content_clone=False,
+        output_hidden_states=None,
+        output_attentions=None,
+        return_dict=None,
+    ):
+        r"""
+        The ErnieCtmModel forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (`Tensor`):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (`Tensor`, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to
+                `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be
+                [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value.
+                For example, "使用" as a word, "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            content_clone (bool, optional):
+                Whether the `content_output` is clone from `sequence_output`. If set to `True`, the content_output is
+                clone from sequence_output, which may cause the classification task impact on the sequence labeling
+                task.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`. (currently not supported)
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+        Returns:
+            tuple: Returns tuple (``sequence_output``, ``pooled_output``, ``content_output``).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of output at the last layer of the model. Its data type should be float32 and
+                has a shape of [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+            - `content_output` (Tensor):
+                The output of content summary token (`[CLS1]` in sequence). Its data type should be float32 and
+                has a shape of [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieModel, ErnieTokenizer
+
+                tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
+                model = ErnieModel.from_pretrained('ernie-1.0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output, content_output = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        # check the variable of `input_ids` and `inputs_embeds`
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+        embedding_output = self.embedding_hidden_mapping_in(embedding_output)
+
+        hidden_states = embedding_output
+
+        encoder_output = self.encoder(
+            hidden_states,
+            src_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        # when `output_attentions` and `output_hidden_states` are False, it wll return tensor object.
+        encoder_output = (encoder_output,) if paddle.is_tensor(encoder_output) else encoder_output
+
+        sequence_output = encoder_output[0]
+
+        pooled_output = self.pooler(sequence_output)
+        content_output = sequence_output[:, self.content_summary_index] if self.use_content_summary else None
+
+        if self.use_content_summary is True:
+            if content_clone is True:
+                sequence_output = paddle.concat(
+                    (
+                        sequence_output,
+                        sequence_output[:, self.content_summary_index]
+                        .clone()
+                        .unsqueeze([1])
+                        .expand_as(sequence_output),
+                    ),
+                    2,
+                )
+            else:
+                content_output = paddle.expand(
+                    content_output.unsqueeze([1]),
+                    shape=(sequence_output.shape[0], sequence_output.shape[1], sequence_output.shape[2]),
+                )
+
+                sequence_output = paddle.concat((sequence_output, content_output), 2)
+
+            sequence_output = self.feature_fuse(sequence_output)
+
+            sequence_output = self.feature_output(sequence_output)
+
+        if not return_dict:
+            return (
+                sequence_output,
+                pooled_output,
+                content_output,
+            ) + encoder_output[1:]
+
+        return ErnieCtmModelOutput(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            content_output=content_output,
+            hidden_states=encoder_output.hidden_states,
+            attentions=encoder_output.attentions,
+        )
+
+
+class ErnieCtmWordtagModel(ErnieCtmPretrainedModel):
+    """
+    ErnieCtmWordtag Model with a token classification head on top (a crf layer on top of the hidden-states output) .
+    e.g. for Named-Entity-Recognition (NER) tasks.
+
+    Args:
+        ernie_ctm (:clss:`ErnieCtmModel`):
+            An instance of :class:`ErnieCtmModel`.
+        num_tag (int):
+            The number of different tags.
+        crf_lr (float):
+            The learning rate of the crf. Defaults to `100`.
+    """
+
+    def __init__(self, config: ErnieCtmConfig):
+        super(ErnieCtmWordtagModel, self).__init__(config)
+        self.num_tag = config.num_labels
+        self.ernie_ctm = ErnieCtmModel(config)
+        self.tag_classifier = nn.Linear(config.hidden_size, self.num_tag)
+        self.crf = LinearChainCrf(self.num_tag, with_start_stop_tag=False)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, False)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        lengths=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        tag_labels=None,
+        output_hidden_states=None,
+        output_attentions=None,
+        return_dict=None,
+        **kwargs
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieCtmModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            lengths (Tensor, optional):
+                The input length. Its dtype is int64 and has a shape of `[batch_size]`.
+                Defaults to `None`.
+            tag_labels (Tensor, optional):
+                The input predicted tensor.
+                Its dtype is float32 and has a shape of `[batch_size, sequence_length, num_tags]`.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `None`. (currently not supported)
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+
+        Returns:
+            tuple: Returns tuple (`seq_logits`, `cls_logits`).
+
+            With the fields:
+
+            - `seq_logits` (Tensor):
+                A tensor of next sentence prediction logits.
+                Its data type should be float32 and its shape is [batch_size, sequence_length, num_tag].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieCtmWordtagModel, ErnieCtmTokenizer
+
+                tokenizer = ErnieCtmTokenizer.from_pretrained('ernie-ctm')
+                model = ErnieCtmWordtagModel.from_pretrained('ernie-ctm', num_tag=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        # author want to keep the name of `tab_labels`, so add this code to keep style consistent with paddlenlp.
+        tag_labels = kwargs.get("labels", tag_labels)
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        outputs = self.ernie_ctm(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        seq_logits = self.tag_classifier(sequence_output)
+        loss = None
+
+        if lengths is None:
+            lengths = paddle.sum(input_ids != self.config.pad_token_id, axis=-1)
+
+        if tag_labels is not None:
+            crf_loss = self.crf_loss(seq_logits, lengths, tag_labels)
+            seq_loss = F.cross_entropy(seq_logits.reshape((-1, self.num_tag)), tag_labels.reshape((-1,)))
+            loss = crf_loss + seq_loss
+            output = (loss, seq_logits)
+        else:
+            _, seq_logits = self.viterbi_decoder(seq_logits, lengths)
+            output = (seq_logits,)
+
+        if not return_dict:
+            return output + outputs[1:]
+
+        return TokenClassifierOutput(
+            loss=loss, logits=seq_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
+        )
+
+
+class ErnieCtmMLMHead(Layer):
+    def __init__(self, config: ErnieCtmConfig):
+        super(ErnieCtmMLMHead, self).__init__()
+        self.layer_norm = nn.LayerNorm(config.embedding_size)
+
+        self.bias = self.create_parameter(
+            [config.vocab_size], is_bias=True, default_initializer=nn.initializer.Constant(value=0.0)
+        )
+        self.dense = nn.Linear(config.hidden_size, config.embedding_size)
+        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)
+        self.activation = nn.GELU(approximate=True)
+        # Link bias
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        prediction_scores = hidden_states
+        return prediction_scores
+
+
+class ErnieCtmNptagModel(ErnieCtmPretrainedModel):
+    r"""
+    ErnieCtmNptag Model with a `masked language modeling` head on top.
+
+    Args:
+        ernie_ctm (:clss:`ErnieCtmModel`):
+            An instance of :class:`ErnieCtmModel`.
+    """
+
+    def __init__(self, config: ErnieCtmConfig):
+        super(ErnieCtmNptagModel, self).__init__(config)
+
+        self.ernie_ctm = ErnieCtmModel(config)
+        self.predictions = ErnieCtmMLMHead(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieCtmModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ErnieCtmModel`.
+            output_attentions (bool, optional):
+                See :class:`ErnieCtmModel`.
+            return_dict (bool, optional):
+                See :class:`ErnieCtmModel`.
+
+        Returns:
+            tuple: Returns tensor `logits`, the scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieCtmNptagModel, ErnieCtmTokenizer
+
+                tokenizer = ErnieCtmTokenizer.from_pretrained('ernie-ctm')
+                model = ErnieCtmNptagModel.from_pretrained('ernie-ctm')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 45, 23000]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        outputs = self.ernie_ctm(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        logits = self.predictions(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss = F.cross_entropy(logits.reshape([-1, self.config.vocab_size]), labels.reshape([-1]))
+
+        if not return_dict:
+            outputs = (logits,) + outputs[2:]
+            return (loss,) + outputs if loss is not None else outputs
+
+        return TokenClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
+        )
+
+
+class ErnieCtmForTokenClassification(ErnieCtmPretrainedModel):
+    r"""
+    ERNIECtm Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        ernie (`ErnieModel`):
+            An instance of `ErnieModel`.
+        num_tag (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of ERNIE.
+            If None, use the same value as `hidden_dropout_prob`
+            of `ErnieCtmModel` instance `ernie`. Defaults to `None`.
+    """
+
+    def __init__(self, config: ErnieCtmConfig):
+        super(ErnieCtmForTokenClassification, self).__init__(config)
+        self.num_tag = config.num_labels
+        self.ernie_ctm = ErnieCtmModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Tensor,
+        token_type_ids: Tensor | None = None,
+        position_ids: Tensor | None = None,
+        attention_mask: Tensor | None = None,
+        inputs_embeds: Tensor | None = None,
+        labels: Tensor | None = None,
+        output_hidden_states: bool | None = None,
+        output_attentions: bool | None = None,
+        return_dict: bool | None = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieCtmModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ErnieCtmModel`.
+            labels (Tensor, optional): labels for model to compute the loss
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[sequence_length, num_tag]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieCtmForTokenClassification, ErnieCtmTokenizer
+
+                tokenizer = ErnieCtmTokenizer.from_pretrained('ernie-ctm')
+                model = ErnieCtmForTokenClassification.from_pretrained('ernie-ctm')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        output = self.ernie_ctm(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        sequence_output = output[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_tag)), labels.reshape((-1,)))
+
+        if not return_dict:
+            output = (logits,) + output[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=output.hidden_states,
+            attentions=output.attentions,
+        )
diff --git a/paddlenlp/transformers/ernie_ctm/tokenizer.py b/paddlenlp/transformers/ernie_ctm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..51d443b5235225dfdd90788ae960bf2282144e19
--- /dev/null
+++ b/paddlenlp/transformers/ernie_ctm/tokenizer.py
@@ -0,0 +1,282 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from .. import PretrainedTokenizer
+
+__all__ = ["ErnieCtmTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"ernie-ctm": 512, "wordtag": 512, "nptag": 512}
+
+
+class ErnieCtmTokenizer(PretrainedTokenizer):
+    r"""
+    Construct an ERNIE-CTM tokenizer.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            File path of the vocabulary.
+        do_lower_case (bool, optional):
+            Whether or not to lowercase the input when tokenizing. Defaults to `True`
+        do_basic_tokenize (bool, optional):
+            Whether or not to do basic tokenization before WordPiece. Defaults to `True`
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token_template (str, optional)
+            The template of summary token for multiple summary placeholders. Defaults to `"[CLS{}]"`
+        cls_num (int, optional):
+            Summary placeholder used in ernie-ctm model. For catching a sentence global feature from multiple aware.
+            Defaults to `1`.
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used in the masked
+            language modeling task. This is the token which the model will try to predict the original unmasked ones.
+            Defaults to `"[MASK]"`.
+        strip_accents: (bool, optional):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieCtmTokenizer
+            tokenizer = ErnieCtmTokenizer.from_pretrained('ernie-ctm')
+
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # {'input_ids': [101, 98, 153, 150, 99, 168, 146, 164, 99, 146, 99, 161, 166, 161,
+            #  161, 150, 165, 150, 150, 163, 102],
+            # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-ctm": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/vocab.txt",
+            "wordtag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/vocab.txt",
+            "nptag": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_ctm/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ernie-ctm": {"do_lower_case": True, "cls_num": 2},
+        "wordtag": {"do_lower_case": True, "cls_num": 2},
+        "nptag": {"do_lower_case": True, "cls_num": 2},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token_template="[CLS{}]",
+        cls_num=1,
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = ErnieTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.cls_token_template = cls_token_template
+        self.cls_num = cls_num
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def convert_tokens_to_string(self, tokens):
+        r"""
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+
+        Args:
+            tokens (List[str]): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import ErnieCtmTokenizer
+                tokenizer = ErnieCtmTokenizer.from_pretrained('ernie-ctm')
+
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                #he was a puppeteer
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequences for sequence classification tasks by
+        concatenating and add special tokens.
+
+        A ERNIE-CTM sequence has the following format:
+
+        - single sequence:      [CLS0][CLS1]... X [SEP]
+        - pair of sequences:        [CLS0][CLS1]... X [SEP] X [SEP]
+
+        Args:
+            token_ids_0 (List):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List, optional):
+                Optional second list of IDs for sequence pairs. Defaults to ``None``.
+
+        Returns:
+            List[int]: The input_id with the appropriate special tokens.
+        """
+        cls_token_ids = [
+            self.convert_tokens_to_ids(self.cls_token_template.format(sid)) for sid in range(self.cls_num)
+        ]
+        if token_ids_1 is None:
+            return cls_token_ids + token_ids_0 + [self.sep_token_id]
+        return cls_token_ids + token_ids_0 + [self.sep_token_id] + token_ids_1 + [self.sep_token_id]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Creates a special tokens mask from the input sequences.
+        This method is called when adding special tokens using the tokenizer `encode` method.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of `inputs_ids` for the second sequence.
+                Defaults to `None`.
+            already_has_special_tokens (bool, optional):
+                Whether or not the token list already contains special tokens for the model.
+                Defaults to `False`.
+
+        Returns:
+            List[int]: A list of integers which is either 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Creates a token_type mask from the input sequences.
+
+        If `token_ids_1` is not `None`, then a sequence pair
+        token_type mask has the following format:
+
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2
+            | first sequence    | second sequence |
+
+        Else if `token_ids_1` is `None`, then a single sequence
+        token_type mask has the following format:
+
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
+            |            first sequence           |
+
+        - 0 stands for the segment id of **first segment tokens**,
+        - 1 stands for the segment id of **second segment tokens**,
+        - 2 stands for the segment id of **cls_token**.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of `inputs_ids` for the second sequence.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token type IDs according to the given sequence(s).
+        """
+        sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return (self.cls_num + len(token_ids_0 + sep)) * [0]
+        return (self.cls_num + len(token_ids_0 + sep)) * [0] + len(token_ids_1 + sep) * [1]
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Note:
+            This encodes inputs and checks the number of added tokens, and is therefore not efficient.
+            Do not put this inside your training loop.
+
+        Args:
+            pair (bool, optional):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        if pair is True:
+            return self.cls_num + 2
+        else:
+            return self.cls_num + 1
+
+    def _tokenize(self, text, **kwargs):
+        r"""
+        Converts a string to a list of tokens.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+        """
+        orig_tokens = list(text)
+        output_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case is True:
+                token = token.lower()
+            output_tokens.append(token)
+        return output_tokens
diff --git a/paddlenlp/transformers/ernie_doc/__init__.py b/paddlenlp/transformers/ernie_doc/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_doc/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_doc/configuration.py b/paddlenlp/transformers/ernie_doc/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d162b23f56363d00965009b9cdde7cf743635609
--- /dev/null
+++ b/paddlenlp/transformers/ernie_doc/configuration.py
@@ -0,0 +1,165 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DalleBart model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["ERNIE_DOC_PRETRAINED_INIT_CONFIGURATION", "ErnieDocConfig", "ERNIE_DOC_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ERNIE_DOC_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-doc-base-en": {
+        "attention_dropout_prob": 0.0,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.0,
+        "relu_dropout": 0.0,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "vocab_size": 50265,
+        "memory_len": 128,
+        "epsilon": 1e-12,
+        "pad_token_id": 1,
+    },
+    "ernie-doc-base-zh": {
+        "attention_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "relu_dropout": 0.0,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "task_type_vocab_size": 3,
+        "vocab_size": 28000,
+        "memory_len": 128,
+        "epsilon": 1e-12,
+        "pad_token_id": 0,
+    },
+}
+
+ERNIE_DOC_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/ernie-doc-base-en.pdparams",
+        "ernie-doc-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-zh/ernie-doc-base-zh.pdparams",
+    }
+}
+
+
+class ErnieDocConfig(PretrainedConfig):
+    """
+    The bare ERNIE-Doc Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        num_hidden_layers (int):
+            The number of hidden layers in the Transformer encoder.
+        num_attention_heads (int):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_size (int):
+            Dimensionality of the embedding layers, encoder layers and pooler layer.
+        hidden_dropout_prob (int):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+        attention_dropout_prob (int):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+        relu_dropout (int):
+            The dropout probability of FFN.
+        hidden_act (str):
+            The non-linear activation function of FFN.
+        memory_len (int):
+            The number of tokens to cache. If not 0, the last `memory_len` hidden states
+            in each layer will be cached into memory.
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `ErnieDocModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieDocModel`.
+        max_position_embeddings (int):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        task_type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`. Defaults to `3`.
+        normalize_before (bool, optional):
+            Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers.
+            If True, pre-process is layer normalization and post-precess includes dropout,
+            residual connection. Otherwise, no pre-process and post-precess includes dropout,
+            residual connection, layer normalization. Defaults to `False`.
+        epsilon (float, optional):
+            The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for
+            initializing layer normalization layers. Defaults to `1e-5`.
+        rel_pos_params_sharing (bool, optional):
+            Whether to share the relative position parameters.
+            Defaults to `False`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+        pad_token_id (int, optional):
+            The token id of [PAD] token whose parameters won't be updated when training.
+            Defaults to `0`.
+        cls_token_idx (int, optional):
+            The token id of [CLS] token. Defaults to `-1`.
+    """
+
+    model_type = "ernie_doc"
+    pretrained_init_configuration = ERNIE_DOC_PRETRAINED_INIT_CONFIGURATION
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+
+    def __init__(
+        self,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        hidden_size=768,
+        hidden_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        relu_dropout=0.0,
+        hidden_act="gelu",
+        memory_len=128,
+        vocab_size=28000,
+        max_position_embeddings=512,
+        task_type_vocab_size=3,
+        normalize_before=False,
+        epsilon=1e-5,
+        rel_pos_params_sharing=False,
+        initializer_range=0.02,
+        pad_token_id=0,
+        cls_token_idx=-1,
+        **kwargs
+    ):
+        super(ErnieDocConfig, self).__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.attention_dropout_prob = attention_dropout_prob
+        self.relu_dropout = relu_dropout
+        self.hidden_act = hidden_act
+        self.memory_len = memory_len
+        self.hidden_size = hidden_size
+        self.task_type_vocab_size = task_type_vocab_size
+        self.normalize_before = normalize_before
+        self.epsilon = epsilon
+        self.rel_pos_params_sharing = rel_pos_params_sharing
+        self.cls_token_idx = cls_token_idx
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
diff --git a/paddlenlp/transformers/ernie_doc/modeling.py b/paddlenlp/transformers/ernie_doc/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c39de8b4c1561f478a92a6e88f946c734a856f95
--- /dev/null
+++ b/paddlenlp/transformers/ernie_doc/modeling.py
@@ -0,0 +1,808 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from .. import PretrainedModel, register_base_model
+from ..attention_utils import _convert_param_attr_to_list
+from .configuration import (
+    ERNIE_DOC_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_DOC_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieDocConfig,
+)
+
+__all__ = [
+    "ErnieDocModel",
+    "ErnieDocPretrainedModel",
+    "ErnieDocForSequenceClassification",
+    "ErnieDocForTokenClassification",
+    "ErnieDocForQuestionAnswering",
+]
+
+
+class PointwiseFFN(nn.Layer):
+    def __init__(self, d_inner_hid, d_hid, dropout_rate, hidden_act, weight_attr=None, bias_attr=None):
+        super(PointwiseFFN, self).__init__()
+        self.linear1 = nn.Linear(d_hid, d_inner_hid, weight_attr, bias_attr=bias_attr)
+        self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train")
+        self.linear2 = nn.Linear(d_inner_hid, d_hid, weight_attr, bias_attr=bias_attr)
+        self.activation = getattr(F, hidden_act)
+
+    def forward(self, x):
+        return self.linear2(self.dropout(self.activation(self.linear1(x))))
+
+
+class MultiHeadAttention(nn.Layer):
+    def __init__(
+        self,
+        d_key,
+        d_value,
+        d_model,
+        n_head=1,
+        r_w_bias=None,
+        r_r_bias=None,
+        r_t_bias=None,
+        dropout_rate=0.0,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        super(MultiHeadAttention, self).__init__()
+        self.d_key = d_key
+        self.d_value = d_value
+        self.d_model = d_model
+        self.n_head = n_head
+
+        assert d_key * n_head == d_model, "d_model must be divisible by n_head"
+
+        self.q_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.k_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.v_proj = nn.Linear(d_model, d_value * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.r_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.t_proj = nn.Linear(d_model, d_key * n_head, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.out_proj = nn.Linear(d_model, d_model, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.r_w_bias = r_w_bias
+        self.r_r_bias = r_r_bias
+        self.r_t_bias = r_t_bias
+        self.dropout = nn.Dropout(dropout_rate, mode="upscale_in_train") if dropout_rate else None
+
+    def __compute_qkv(self, queries, keys, values, rel_pos, rel_task):
+
+        q = self.q_proj(queries)
+        k = self.k_proj(keys)
+        v = self.v_proj(values)
+        r = self.r_proj(rel_pos)
+        t = self.t_proj(rel_task)
+
+        return q, k, v, r, t
+
+    def __split_heads(self, x, d_model, n_head):
+        # x shape: [B, T, H]
+        x = x.reshape(shape=[0, 0, n_head, d_model // n_head])
+        # shape: [B, N, T, HH]
+        return paddle.transpose(x=x, perm=[0, 2, 1, 3])
+
+    def __rel_shift(self, x, klen=-1):
+        """
+        To perform relative attention, it should relatively shift the attention score matrix
+        See more details on: https://github.com/kimiyoung/transformer-xl/issues/8#issuecomment-454458852
+        """
+        # input shape: [B, N, T, 2 * T + M]
+        x_shape = x.shape
+
+        x = x.reshape([x_shape[0], x_shape[1], x_shape[3], x_shape[2]])
+        x = x[:, :, 1:, :]
+        x = x.reshape([x_shape[0], x_shape[1], x_shape[2], x_shape[3] - 1])
+
+        # output shape: [B, N, T, T + M]
+        return x[:, :, :, :klen]
+
+    def __scaled_dot_product_attention(self, q, k, v, r, t, attn_mask):
+        q_w, q_r, q_t = q
+        score_w = paddle.matmul(q_w, k, transpose_y=True)
+        score_r = paddle.matmul(q_r, r, transpose_y=True)
+        score_r = self.__rel_shift(score_r, k.shape[2])
+        score_t = paddle.matmul(q_t, t, transpose_y=True)
+
+        score = score_w + score_r + score_t
+        score = score * (self.d_key**-0.5)
+        if attn_mask is not None:
+            score += attn_mask
+        weights = F.softmax(score)
+        if self.dropout:
+            weights = self.dropout(weights)
+        out = paddle.matmul(weights, v)
+        return out
+
+    def __combine_heads(self, x):
+        if len(x.shape) == 3:
+            return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        # x shape: [B, N, T, HH]
+        x = paddle.transpose(x, [0, 2, 1, 3])
+        # target shape:[B, T, H]
+        return x.reshape([0, 0, x.shape[2] * x.shape[3]])
+
+    def forward(self, queries, keys, values, rel_pos, rel_task, memory, attn_mask):
+
+        if memory is not None and len(memory.shape) > 1:
+            cat = paddle.concat([memory, queries], 1)
+        else:
+            cat = queries
+        keys, values = cat, cat
+
+        if not (
+            len(queries.shape)
+            == len(keys.shape)
+            == len(values.shape)
+            == len(rel_pos.shape)
+            == len(rel_task.shape)
+            == 3
+        ):
+            raise ValueError("Inputs: quries, keys, values, rel_pos and rel_task should all be 3-D tensors.")
+
+        q, k, v, r, t = self.__compute_qkv(queries, keys, values, rel_pos, rel_task)
+
+        q_w, q_r, q_t = list(map(lambda x: q + x.unsqueeze([0, 1]), [self.r_w_bias, self.r_r_bias, self.r_t_bias]))
+        q_w, q_r, q_t = list(map(lambda x: self.__split_heads(x, self.d_model, self.n_head), [q_w, q_r, q_t]))
+        k, v, r, t = list(map(lambda x: self.__split_heads(x, self.d_model, self.n_head), [k, v, r, t]))
+
+        ctx_multiheads = self.__scaled_dot_product_attention([q_w, q_r, q_t], k, v, r, t, attn_mask)
+
+        out = self.__combine_heads(ctx_multiheads)
+        out = self.out_proj(out)
+        return out
+
+
+class ErnieDocEncoderLayer(nn.Layer):
+    def __init__(
+        self,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        hidden_act,
+        normalize_before=False,
+        epsilon=1e-5,
+        rel_pos_params_sharing=False,
+        r_w_bias=None,
+        r_r_bias=None,
+        r_t_bias=None,
+        weight_attr=None,
+        bias_attr=None,
+    ):
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+        super(ErnieDocEncoderLayer, self).__init__()
+        if not rel_pos_params_sharing:
+            r_w_bias, r_r_bias, r_t_bias = list(
+                map(
+                    lambda x: self.create_parameter(shape=[n_head * d_key], dtype="float32"),
+                    ["r_w_bias", "r_r_bias", "r_t_bias"],
+                )
+            )
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 2)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 2)
+        self.attn = MultiHeadAttention(
+            d_key,
+            d_value,
+            d_model,
+            n_head,
+            r_w_bias,
+            r_r_bias,
+            r_t_bias,
+            attention_dropout,
+            weight_attr=weight_attrs[0],
+            bias_attr=bias_attrs[0],
+        )
+        self.ffn = PointwiseFFN(
+            d_inner_hid, d_model, relu_dropout, hidden_act, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1]
+        )
+        self.norm1 = nn.LayerNorm(d_model, epsilon=epsilon)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=epsilon)
+        self.dropout1 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train")
+        self.dropout2 = nn.Dropout(prepostprocess_dropout, mode="upscale_in_train")
+        self.d_model = d_model
+        self.epsilon = epsilon
+        self.normalize_before = normalize_before
+
+    def forward(self, enc_input, memory, rel_pos, rel_task, attn_mask):
+        residual = enc_input
+        if self.normalize_before:
+            enc_input = self.norm1(enc_input)
+
+        attn_output = self.attn(enc_input, enc_input, enc_input, rel_pos, rel_task, memory, attn_mask)
+        attn_output = residual + self.dropout1(attn_output)
+        if not self.normalize_before:
+            attn_output = self.norm1(attn_output)
+        residual = attn_output
+        if self.normalize_before:
+            attn_output = self.norm2(attn_output)
+        ffn_output = self.ffn(attn_output)
+        output = residual + self.dropout2(ffn_output)
+        if not self.normalize_before:
+            output = self.norm2(output)
+        return output
+
+
+class ErnieDocEncoder(nn.Layer):
+    def __init__(self, num_layers, encoder_layer, mem_len):
+        super(ErnieDocEncoder, self).__init__()
+        self.layers = nn.LayerList(
+            [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)]
+        )
+        self.num_layers = num_layers
+        self.normalize_before = self.layers[0].normalize_before
+        self.mem_len = mem_len
+
+    def _cache_mem(self, curr_out, prev_mem):
+        if self.mem_len is None or self.mem_len == 0:
+            return None
+        if prev_mem is None:
+            new_mem = curr_out[:, -self.mem_len :, :]
+        else:
+            new_mem = paddle.concat([prev_mem, curr_out], 1)[:, -self.mem_len :, :]
+        new_mem.stop_gradient = True
+        return new_mem
+
+    def forward(self, enc_input, memories, rel_pos, rel_task, attn_mask):
+        # no need to normalize enc_input, cause it's already normalized outside.
+        new_mem = []
+        for i, encoder_layer in enumerate(self.layers):
+            enc_input = encoder_layer(enc_input, memories[i], rel_pos, rel_task, attn_mask)
+            new_mem += [self._cache_mem(enc_input, memories[i])]
+            # free the old memories explicitly to save gpu memory
+            memories[i] = None
+        return enc_input, new_mem
+
+
+class ErnieDocPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ErnieDoc models. It provides ErnieDoc related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "ernie_doc"
+    config_class = ErnieDocConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    pretrained_init_configuration = ERNIE_DOC_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_DOC_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        # Initialization hook
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class ErnieDocEmbeddings(nn.Layer):
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocEmbeddings, self).__init__()
+        self.word_emb = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.pos_emb = nn.Embedding(config.max_position_embeddings * 2 + config.memory_len, config.hidden_size)
+        self.token_type_emb = nn.Embedding(config.task_type_vocab_size, config.hidden_size)
+        self.memory_len = config.memory_len
+        self.dropouts = nn.LayerList([nn.Dropout(config.hidden_dropout_prob) for i in range(3)])
+        self.norms = nn.LayerList([nn.LayerNorm(config.hidden_size) for i in range(3)])
+
+    def forward(self, input_ids, token_type_ids, position_ids):
+        # input_embeddings: [B, T, H]
+        input_embeddings = self.word_emb(input_ids.squeeze(-1))
+        # position_embeddings: [B, 2 * T + M, H]
+        position_embeddings = self.pos_emb(position_ids.squeeze(-1))
+
+        batch_size = input_ids.shape[0]
+        token_type_ids = paddle.concat(
+            [
+                paddle.zeros(shape=[batch_size, self.memory_len, 1], dtype="int64") + token_type_ids[0, 0, 0],
+                token_type_ids,
+            ],
+            axis=1,
+        )
+        token_type_ids.stop_gradient = True
+        # token_type_embeddings: [B, M + T, H]
+        token_type_embeddings = self.token_type_emb(token_type_ids.squeeze(-1))
+        embs = [input_embeddings, position_embeddings, token_type_embeddings]
+        for i in range(len(embs)):
+            embs[i] = self.dropouts[i](self.norms[i](embs[i]))
+        return embs
+
+
+class ErnieDocPooler(nn.Layer):
+    """
+    get pool output
+    """
+
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.cls_token_idx = config.cls_token_idx
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the last token.
+        cls_token_tensor = hidden_states[:, self.cls_token_idx]
+        pooled_output = self.dense(cls_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+@register_base_model
+class ErnieDocModel(ErnieDocPretrainedModel):
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocModel, self).__init__(config)
+        r_w_bias, r_r_bias, r_t_bias = None, None, None
+        if config.rel_pos_params_sharing:
+            r_w_bias, r_r_bias, r_t_bias = list(
+                map(
+                    lambda x: self.create_parameter(shape=[config.num_attention_heads * d_key], dtype="float32"),
+                    ["r_w_bias", "r_r_bias", "r_t_bias"],
+                )
+            )
+        d_key = config.hidden_size // config.num_attention_heads
+        d_value = config.hidden_size // config.num_attention_heads
+        d_inner_hid = config.hidden_size * 4
+        encoder_layer = ErnieDocEncoderLayer(
+            config.num_attention_heads,
+            d_key,
+            d_value,
+            config.hidden_size,
+            d_inner_hid,
+            config.hidden_dropout_prob,
+            config.attention_dropout_prob,
+            config.relu_dropout,
+            config.hidden_act,
+            normalize_before=config.normalize_before,
+            epsilon=config.epsilon,
+            rel_pos_params_sharing=config.rel_pos_params_sharing,
+            r_w_bias=r_w_bias,
+            r_r_bias=r_r_bias,
+            r_t_bias=r_t_bias,
+        )
+        self.initializer_range = config.initializer_range
+        self.n_head = config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        self.memory_len = config.memory_len
+        self.encoder = ErnieDocEncoder(config.num_hidden_layers, encoder_layer, config.memory_len)
+        self.pad_token_id = config.pad_token_id
+        self.embeddings = ErnieDocEmbeddings(config)
+        self.pooler = ErnieDocPooler(config)
+
+    def _create_n_head_attn_mask(self, attn_mask, batch_size):
+        # attn_mask shape: [B, T, 1]
+        # concat an data_mask, shape: [B, M + T, 1]
+        data_mask = paddle.concat(
+            [paddle.ones(shape=[batch_size, self.memory_len, 1], dtype=attn_mask.dtype), attn_mask], axis=1
+        )
+        data_mask.stop_gradient = True
+        # create a self_attn_mask, shape: [B, T, M + T]
+        self_attn_mask = paddle.matmul(attn_mask, data_mask, transpose_y=True)
+        self_attn_mask = (self_attn_mask - 1) * 1e8
+        n_head_self_attn_mask = paddle.stack([self_attn_mask] * self.n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+        return n_head_self_attn_mask
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_emb
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_emb = value
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1].
+            memories (List[Tensor]):
+                A list of length `n_layers` with each Tensor being a pre-computed hidden-state for each layer.
+                Each Tensor has a dtype `float32` and a shape of [batch_size, sequence_length, hidden_size].
+            token_type_ids (Tensor):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length, 1].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                config.max_position_embeddings - 1]``. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`.
+            attn_mask (Tensor):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple : Returns tuple (``encoder_output``, ``pooled_output``, ``new_mem``).
+
+            With the fields:
+
+            - `encoder_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+            - `new_mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and shape as [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocModel
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocModel.from_pretrained('ernie-doc-base-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = [paddle.zeros([1, 128, 768], dtype="float32") for _ in range(12)]
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                encoder_output = outputs[0]
+                pooled_output = outputs[1]
+                new_mem = outputs[2]
+
+        """
+
+        input_embeddings, position_embeddings, token_embeddings = self.embeddings(
+            input_ids, token_type_ids, position_ids
+        )
+
+        batch_size = input_embeddings.shape[0]
+        # [B, N, T, M + T]
+        n_head_self_attn_mask = self._create_n_head_attn_mask(attn_mask, batch_size)
+        # memories contains n_layer memory whose shape is [B, M, H]
+        encoder_output, new_mem = self.encoder(
+            enc_input=input_embeddings,
+            memories=memories,
+            rel_pos=position_embeddings,
+            rel_task=token_embeddings,
+            attn_mask=n_head_self_attn_mask,
+        )
+        pooled_output = self.pooler(encoder_output)
+        return encoder_output, pooled_output, new_mem
+
+
+class ErnieDocForSequenceClassification(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ErnieDocConfig`):
+            An instance of ErnieDocConfig used to construct ErnieDocForSequenceClassification.
+    """
+
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocForSequenceClassification, self).__init__(config)
+        self.ernie_doc = ErnieDocModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob,
+            mode="upscale_in_train",
+        )
+        self.linear = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForSequenceClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (List[Tensor]):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`logits`, `mem`).
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor containing the [CLS] of hidden-states of the model at the output of last layer.
+                Each Tensor has a data type of `float32` and has a shape of [batch_size, num_labels].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForSequenceClassification
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForSequenceClassification.from_pretrained('ernie-doc-base-zh', num_labels=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = [paddle.zeros([1, 128, 768], dtype="float32") for _ in range(12)]
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+                mem = outputs[1]
+
+        """
+        _, pooled_output, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        pooled_output = self.dropout(pooled_output)
+        logits = self.linear(pooled_output)
+        return logits, mem
+
+
+class ErnieDocForTokenClassification(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ErnieDocConfig`):
+            An instance of ErnieDocConfig used to construct ErnieDocForTokenClassification.
+    """
+
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.ernie_doc = ErnieDocModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob,
+            mode="upscale_in_train",
+        )
+        self.linear = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForTokenClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (List[Tensor]):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`logits`, `mem`).
+
+            With the fields:
+
+            - `logits` (Tensor):
+                A tensor containing the hidden-states of the model at the output of last layer.
+                Each Tensor has a data type of `float32` and has a shape of [batch_size, sequence_length, num_labels].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForTokenClassification
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForTokenClassification.from_pretrained('ernie-doc-base-zh', num_labels=2)
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = [paddle.zeros([1, 128, 768], dtype="float32") for _ in range(12)]
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+                mem = outputs[1]
+
+        """
+        sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.linear(sequence_output)
+        return logits, mem
+
+
+class ErnieDocForQuestionAnswering(ErnieDocPretrainedModel):
+    """
+    ErnieDoc Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ErnieDocConfig`):
+            An instance of ErnieDocConfig used to construct ErnieDocForQuestionAnswering.
+    """
+
+    def __init__(self, config: ErnieDocConfig):
+        super(ErnieDocForQuestionAnswering, self).__init__(config)
+        self.ernie_doc = ErnieDocModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob,
+            mode="upscale_in_train",
+        )
+        self.linear = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, memories, token_type_ids, position_ids, attn_mask):
+        r"""
+        The ErnieDocForQuestionAnswering forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            memories (List[Tensor]):
+                See :class:`ErnieDocModel`.
+            token_type_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            position_ids (Tensor):
+                See :class:`ErnieDocModel`.
+            attn_mask (Tensor):
+                See :class:`ErnieDocModel`.
+
+        Returns:
+            tuple : Returns tuple (`start_logits`, `end_logits`, `mem`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `mem` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, memory_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import numpy as np
+                import paddle
+                from paddlenlp.transformers import ErnieDocForQuestionAnswering
+                from paddlenlp.transformers import ErnieDocTokenizer
+
+                def get_related_pos(insts, seq_len, memory_len=128):
+                    beg = seq_len + seq_len + memory_len
+                    r_position = [list(range(beg - 1, seq_len - 1, -1)) + \
+                                list(range(0, seq_len)) for i in range(len(insts))]
+                    return np.array(r_position).astype('int64').reshape([len(insts), beg, 1])
+
+                tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+                model = ErnieDocForQuestionAnswering.from_pretrained('ernie-doc-base-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v + [0] * (128-len(v))]).unsqueeze(-1) for (k, v) in inputs.items()}
+
+                memories = [paddle.zeros([1, 128, 768], dtype="float32") for _ in range(12)]
+                position_ids = paddle.to_tensor(get_related_pos(inputs['input_ids'], 128, 128))
+                attn_mask = paddle.ones([1, 128, 1])
+
+                inputs['memories'] = memories
+                inputs['position_ids'] = position_ids
+                inputs['attn_mask'] = attn_mask
+
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+                mem = outputs[2]
+
+        """
+        sequence_output, _, mem = self.ernie_doc(input_ids, memories, token_type_ids, position_ids, attn_mask)
+        sequence_output = self.dropout(sequence_output)
+        logits = self.linear(sequence_output)
+        start_logits, end_logits = paddle.transpose(logits, perm=[2, 0, 1])
+        return start_logits, end_logits, mem
diff --git a/paddlenlp/transformers/ernie_doc/tokenizer.py b/paddlenlp/transformers/ernie_doc/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..904f52a5c56b42d5a0b784998673ed27a5db22c1
--- /dev/null
+++ b/paddlenlp/transformers/ernie_doc/tokenizer.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .. import BPETokenizer
+from ..ernie.tokenizer import ErnieTokenizer
+
+__all__ = ["ErnieDocTokenizer", "ErnieDocBPETokenizer"]
+
+
+class ErnieDocTokenizer(ErnieTokenizer):
+    r"""
+    Constructs an ERNIE-Doc tokenizer.
+    It uses a basic tokenizer to do punctuation splitting, lower casing and so on,
+    and follows a WordPiece tokenizer to tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieDocTokenizer
+            tokenizer = ErnieDocTokenizer.from_pretrained('ernie-doc-base-zh')
+            encoded_inputs = tokenizer('He was a puppeteer')
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-doc-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-zh/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ernie-doc-base-zh": {"do_lower_case": True},
+    }
+
+    max_model_input_sizes = {
+        "ernie-doc-base-zh": 512,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super(ErnieDocTokenizer, self).__init__(
+            vocab_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+
+
+class ErnieDocBPETokenizer(BPETokenizer):
+    r"""
+    Constructs an ERNIE-Doc BPE tokenizer. It uses a bpe tokenizer to do punctuation
+    splitting, lower casing and so on, then tokenize words as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.BPETokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            File path of the vocabulary.
+        encoder_json_path (str, optional):
+            File path of the id to vocab.
+        vocab_bpe_path (str, optional):
+            File path of word merge text.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieDocBPETokenizer
+            tokenizer = ErnieDocBPETokenizer.from_pretrained('ernie-doc-base-en')
+            encoded_inputs = tokenizer('He was a puppeteer')
+
+    """
+    resource_files_names = {
+        "vocab_file": "vocab.txt",
+        "encoder_json_path": "encoder.json",
+        "vocab_bpe_path": "vocab.bpe",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/vocab.txt"
+        },
+        "encoder_json_path": {
+            "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/encoder.json"
+        },
+        "vocab_bpe_path": {
+            "ernie-doc-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-doc-base-en/vocab.bpe"
+        },
+    }
+    pretrained_init_configuration = {
+        "ernie-doc-base-en": {"unk_token": "[UNK]"},
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        encoder_json_path="./configs/encoder.json",
+        vocab_bpe_path="./configs/vocab.bpe",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super(ErnieDocBPETokenizer, self).__init__(
+            vocab_file,
+            encoder_json_path=encoder_json_path,
+            vocab_bpe_path=vocab_bpe_path,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
diff --git a/paddlenlp/transformers/ernie_gen/__init__.py b/paddlenlp/transformers/ernie_gen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gen/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_gen/modeling.py b/paddlenlp/transformers/ernie_gen/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1dec7022d0f48fa5319e2d227269aae2d0deece8
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gen/modeling.py
@@ -0,0 +1,624 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import io
+import json
+import os
+
+import paddle
+import six
+from paddle import nn
+from paddle.nn import functional as F
+from paddle.utils.download import get_path_from_url
+
+from paddlenlp.transformers import (
+    BertPretrainedModel,
+    ElectraPretrainedModel,
+    ErniePretrainedModel,
+    RobertaPretrainedModel,
+)
+from paddlenlp.utils.env import MODEL_HOME
+from paddlenlp.utils.log import logger
+
+from .. import PretrainedModel, register_base_model
+from ..utils import InitTrackerMeta, fn_args_to_dict
+
+__all__ = ["ErnieGenPretrainedModel", "ErnieForGeneration", "ErnieGenModel"]
+
+
+def _build_linear(n_in, n_out, name, init):
+    return nn.Linear(
+        n_in,
+        n_out,
+        weight_attr=paddle.ParamAttr(name="%s.w_0" % name if name is not None else None, initializer=init),
+        bias_attr="%s.b_0" % name if name is not None else None,
+    )
+
+
+def _build_ln(n_in, name):
+    return nn.LayerNorm(
+        normalized_shape=n_in,
+        weight_attr=paddle.ParamAttr(
+            name="%s_layer_norm_scale" % name if name is not None else None, initializer=nn.initializer.Constant(1.0)
+        ),
+        bias_attr=paddle.ParamAttr(
+            name="%s_layer_norm_bias" % name if name is not None else None, initializer=nn.initializer.Constant(1.0)
+        ),
+    )
+
+
+def append_name(name, postfix):
+    if name is None:
+        ret = None
+    elif name == "":
+        ret = postfix
+    else:
+        ret = "%s_%s" % (name, postfix)
+    return ret
+
+
+class AttentionLayer(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(AttentionLayer, self).__init__()
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+        d_model = cfg["hidden_size"]
+        n_head = cfg["num_attention_heads"]
+        assert d_model % n_head == 0
+        d_model_q = cfg.get("query_hidden_size_per_head", d_model // n_head) * n_head
+        d_model_v = cfg.get("value_hidden_size_per_head", d_model // n_head) * n_head
+        self.n_head = n_head
+        self.d_key = d_model_q // n_head
+        self.q = _build_linear(d_model, d_model_q, append_name(name, "query_fc"), initializer)
+        self.k = _build_linear(d_model, d_model_q, append_name(name, "key_fc"), initializer)
+        self.v = _build_linear(d_model, d_model_v, append_name(name, "value_fc"), initializer)
+        self.o = _build_linear(d_model_v, d_model, append_name(name, "output_fc"), initializer)
+        self.dropout = nn.Dropout(p=cfg["attention_probs_dropout_prob"])
+
+    def forward(self, queries, keys, values, attn_bias, past_cache):
+        assert len(queries.shape) == len(keys.shape) == len(values.shape) == 3
+        # bsz, q_len, q_dim = queries.shape
+        # bsz, k_len, k_dim = keys.shape
+        # bsz, v_len, v_dim = values.shape
+        # assert k_len == v_len
+
+        q = self.q(queries)
+        k = self.k(keys)
+        v = self.v(values)
+
+        cache = (k, v)
+        if past_cache is not None:
+            cached_k, cached_v = past_cache
+            k = paddle.concat([cached_k, k], 1)
+            v = paddle.concat([cached_v, v], 1)
+
+        q = q.reshape([0, 0, self.n_head, q.shape[-1] // self.n_head]).transpose(
+            [0, 2, 1, 3]
+        )  # [batch, head, seq, dim]
+        k = k.reshape([0, 0, self.n_head, k.shape[-1] // self.n_head]).transpose(
+            [0, 2, 1, 3]
+        )  # [batch, head, seq, dim]
+        v = v.reshape([0, 0, self.n_head, v.shape[-1] // self.n_head]).transpose(
+            [0, 2, 1, 3]
+        )  # [batch, head, seq, dim]
+
+        q = q.scale(self.d_key**-0.5)
+        score = q.matmul(k, transpose_y=True)
+        if attn_bias is not None:
+            score += attn_bias
+        score = F.softmax(score)
+        score = self.dropout(score)
+
+        out = score.matmul(v).transpose([0, 2, 1, 3])
+        out = out.reshape([0, 0, out.shape[2] * out.shape[3]])
+        out = self.o(out)
+        return out, cache
+
+
+class PositionwiseFeedForwardLayer(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(PositionwiseFeedForwardLayer, self).__init__()
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+        d_model = cfg["hidden_size"]
+        d_ffn = cfg.get("intermediate_size", 4 * d_model)
+        self.act = getattr(paddle.nn.functional, cfg["hidden_act"])
+        self.i = _build_linear(
+            d_model,
+            d_ffn,
+            append_name(name, "fc_0"),
+            initializer,
+        )
+        self.o = _build_linear(d_ffn, d_model, append_name(name, "fc_1"), initializer)
+        prob = cfg.get("intermediate_dropout_prob", 0.0)
+        self.dropout = nn.Dropout(p=prob)
+
+    def forward(self, inputs):
+        hidden = self.act(self.i(inputs))
+        hidden = self.dropout(hidden)
+        out = self.o(hidden)
+        return out
+
+
+class ErnieEncoderLayer(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(ErnieEncoderLayer, self).__init__()
+        d_model = cfg["hidden_size"]
+        self.attn = AttentionLayer(cfg, name=append_name(name, "multi_head_att"))
+        self.ln1 = _build_ln(d_model, name=append_name(name, "post_att"))
+        self.ffn = PositionwiseFeedForwardLayer(cfg, name=append_name(name, "ffn"))
+        self.ln2 = _build_ln(d_model, name=append_name(name, "post_ffn"))
+        prob = cfg.get("intermediate_dropout_prob", cfg["hidden_dropout_prob"])
+        self.dropout = nn.Dropout(p=prob)
+
+    def forward(self, inputs, attn_bias=None, past_cache=None):
+        attn_out, cache = self.attn(inputs, inputs, inputs, attn_bias, past_cache=past_cache)  # self attn
+        attn_out = self.dropout(attn_out)
+        hidden = attn_out + inputs
+        hidden = self.ln1(hidden)  # dropout/ add/ norm
+
+        ffn_out = self.ffn(hidden)
+        ffn_out = self.dropout(ffn_out)
+        hidden = ffn_out + hidden
+        hidden = self.ln2(hidden)
+        return hidden, cache
+
+
+class ErnieEncoderStack(nn.Layer):
+    def __init__(self, cfg, name=None):
+        super(ErnieEncoderStack, self).__init__()
+        n_layers = cfg["num_hidden_layers"]
+        self.block = nn.LayerList([ErnieEncoderLayer(cfg, append_name(name, "layer_%d" % i)) for i in range(n_layers)])
+
+    def forward(self, inputs, attn_bias=None, past_cache=None):
+        if past_cache is not None:
+            assert isinstance(past_cache, tuple), "unknown type of `past_cache`, expect tuple or list. got %s" % repr(
+                type(past_cache)
+            )
+            past_cache = list(zip(*past_cache))
+        else:
+            past_cache = [None] * len(self.block)
+        cache_list_k, cache_list_v, hidden_list = [], [], [inputs]
+
+        for b, p in zip(self.block, past_cache):
+            inputs, cache = b(inputs, attn_bias=attn_bias, past_cache=p)
+            cache_k, cache_v = cache
+            cache_list_k.append(cache_k)
+            cache_list_v.append(cache_v)
+            hidden_list.append(inputs)
+
+        return inputs, hidden_list, (cache_list_k, cache_list_v)
+
+
+@six.add_metaclass(InitTrackerMeta)
+class ErnieGenPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained ErnieGen models. It provides ErnieGen related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+    ernie_gen_pretrained_init_configuration = {
+        "ernie-gen-base-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 768,
+            "initializer_range": 0.02,
+            "intermediate_size": 3072,
+            "max_position_embeddings": 1024,
+            "num_attention_heads": 12,
+            "num_hidden_layers": 12,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie-gen-large-en": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 1024,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+        "ernie-gen-large-en-430g": {
+            "attention_probs_dropout_prob": 0.1,
+            "hidden_act": "gelu",
+            "hidden_dropout_prob": 0.1,
+            "hidden_size": 1024,
+            "initializer_range": 0.02,
+            "intermediate_size": 4096,
+            "max_position_embeddings": 1024,
+            "num_attention_heads": 16,
+            "num_hidden_layers": 24,
+            "type_vocab_size": 4,
+            "vocab_size": 30522,
+            "pad_token_id": 0,
+        },
+    }
+    ernie_gen_pretrained_resource_files_map = {
+        "model_state": {
+            "ernie-gen-base-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-base/ernie_gen_base.pdparams",
+            "ernie-gen-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-large/ernie_gen_large.pdparams",
+            "ernie-gen-large-en-430g": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gen-large-430g/ernie_gen_large_430g.pdparams",
+        }
+    }
+
+    # Support more model to warm start.
+    pretrained_init_configuration = {
+        **ernie_gen_pretrained_init_configuration,
+        **BertPretrainedModel.pretrained_init_configuration,
+        **ElectraPretrainedModel.pretrained_init_configuration,
+        **RobertaPretrainedModel.pretrained_init_configuration,
+        **ErniePretrainedModel.pretrained_init_configuration,
+    }
+    pretrained_resource_files_map = {
+        "model_state": {
+            **ernie_gen_pretrained_resource_files_map["model_state"],
+            **BertPretrainedModel.pretrained_resource_files_map["model_state"],
+            **ElectraPretrainedModel.pretrained_resource_files_map["model_state"],
+            **RobertaPretrainedModel.pretrained_resource_files_map["model_state"],
+            **ErniePretrainedModel.pretrained_resource_files_map["model_state"],
+        }
+    }
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        pretrained_models = list(cls.pretrained_init_configuration.keys())
+        resource_files = {}
+        init_configuration = {}
+        if pretrained_model_name_or_path in pretrained_models:
+            for file_id, map_list in cls.pretrained_resource_files_map.items():
+                resource_files[file_id] = map_list[pretrained_model_name_or_path]
+            init_configuration = copy.deepcopy(cls.pretrained_init_configuration[pretrained_model_name_or_path])
+        else:
+            if os.path.isdir(pretrained_model_name_or_path):
+                for file_id, file_name in cls.resource_files_names.items():
+                    full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
+                    resource_files[file_id] = full_file_name
+                resource_files["model_config_file"] = os.path.join(
+                    pretrained_model_name_or_path, cls.model_config_file
+                )
+            else:
+                raise ValueError(
+                    "Calling {}.from_pretrained() with a model identifier or the "
+                    "path to a directory instead. The supported model "
+                    "identifiers are as follows: {}".format(cls.__name__, cls.pretrained_init_configuration.keys())
+                )
+
+        default_root = os.path.join(MODEL_HOME, pretrained_model_name_or_path)
+        resolved_resource_files = {}
+        for file_id, file_path in resource_files.items():
+            path = os.path.join(default_root, file_path.split("/")[-1])
+            if file_path is None or os.path.isfile(file_path):
+                resolved_resource_files[file_id] = file_path
+            elif os.path.exists(path):
+                logger.info("Already cached %s" % path)
+                resolved_resource_files[file_id] = path
+            else:
+                logger.info("Downloading %s and saved to %s" % (file_path, default_root))
+                resolved_resource_files[file_id] = get_path_from_url(file_path, default_root)
+
+        # Prepare model initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        model_config_file = resolved_resource_files.pop("model_config_file", None)
+        if model_config_file is not None:
+            with io.open(model_config_file, encoding="utf-8") as f:
+                init_kwargs = json.load(f)
+        else:
+            init_kwargs = init_configuration
+
+        # position args are stored in kwargs, maybe better not include
+        init_args = init_kwargs.pop("init_args", [{}])[0]
+        if len(init_args) == 0:
+            init_args = init_kwargs
+
+        name_prefix = kwargs.pop("name", None)
+        init_kwargs.pop("name", None)
+        init_args.pop("name", None)
+
+        model = cls(init_args, name=name_prefix)
+
+        weight_path = resolved_resource_files["model_state"]
+        logger.info("loading pretrained model from %s" % weight_path)
+
+        if os.path.exists(weight_path):
+            m = paddle.load(weight_path)
+            params_name = list(m.keys())
+            if "mlm.weight" not in params_name:
+                # ernie_gen is not implemented with paddle.transformer.
+                # So, when loading the params saved by paddle.transformer, we should convert the params name.
+                # We will update ernie_gen with paddle.transformer in the future.
+                name_index_begin = params_name[0].index(".") + 1
+                for old_name in params_name:
+                    new_name = (
+                        old_name[name_index_begin:]
+                        .replace("embeddings.word_embeddings", "word_emb")
+                        .replace("embeddings.position_embeddings", "pos_emb")
+                        .replace("embeddings.token_type_embeddings", "sent_emb")
+                        .replace("embeddings.layer_norm", "ln")
+                        .replace("encoder.layers", "encoder_stack.block")
+                        .replace("self_attn", "attn")
+                        .replace("k_proj", "k")
+                        .replace("q_proj", "q")
+                        .replace("v_proj", "v")
+                        .replace("out_proj", "o")
+                        .replace("linear1", "ffn.i")
+                        .replace("linear2", "ffn.o")
+                        .replace("norm1", "ln1")
+                        .replace("norm2", "ln2")
+                        .replace("pooler.dense", "pooler")
+                    )
+                    m[new_name] = m.pop(old_name)
+            for k, v in model.state_dict().items():
+                if k not in m:
+                    logger.info("param:%s not set in pretrained model, skip" % k)
+                    m[k] = v  # FIXME: no need to do this in the future
+            model.set_state_dict(m)
+        else:
+            raise ValueError("weight file not found in pretrain dir: %s" % weight_path)
+        return model
+
+    def _post_init(self, original_init, *args, **kwargs):
+        """
+        It would be hooked after `__init__` to add a dict including arguments of
+        `__init__` as a attribute named `config` of the prtrained model instance.
+        """
+        init_dict = fn_args_to_dict(original_init, *args, **kwargs)
+        self.config = init_dict
+
+
+@register_base_model
+class ErnieModel(ErnieGenPretrainedModel):
+    def __init__(self, cfg, name=None):
+        """
+        Fundamental pretrained Ernie model
+        """
+        logger.debug("init ErnieModel with config: %s" % repr(cfg))
+        nn.Layer.__init__(self)
+        d_model = cfg["hidden_size"]
+        d_emb = cfg.get("emb_size", cfg["hidden_size"])
+        d_vocab = cfg["vocab_size"]
+        d_pos = cfg["max_position_embeddings"]
+        d_sent = cfg.get("sent_type_vocab_size") or cfg["type_vocab_size"]
+        self.n_head = cfg["num_attention_heads"]
+        self.return_additional_info = cfg.get("return_additional_info", False)
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+
+        self.ln = _build_ln(d_model, name=append_name(name, "pre_encoder"))
+        self.word_emb = nn.Embedding(
+            d_vocab,
+            d_emb,
+            weight_attr=paddle.ParamAttr(name=append_name(name, "word_embedding"), initializer=initializer),
+        )
+        self.pos_emb = nn.Embedding(
+            d_pos,
+            d_emb,
+            weight_attr=paddle.ParamAttr(name=append_name(name, "pos_embedding"), initializer=initializer),
+        )
+        self.sent_emb = nn.Embedding(
+            d_sent,
+            d_emb,
+            weight_attr=paddle.ParamAttr(name=append_name(name, "sent_embedding"), initializer=initializer),
+        )
+        prob = cfg["hidden_dropout_prob"]
+        self.dropout = nn.Dropout(p=prob)
+
+        self.encoder_stack = ErnieEncoderStack(cfg, append_name(name, "encoder"))
+
+    def forward(
+        self,
+        src_ids,
+        sent_ids=None,
+        pos_ids=None,
+        input_mask=None,
+        attn_bias=None,
+        past_cache=None,
+        use_causal_mask=False,
+    ):
+        """
+        Args:
+            src_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary.
+                They are numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            sent_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            pos_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            input_mask(Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            attn_bias(Tensor, optional):
+                3D version of `input_mask`, if set, overrides `input_mask`;
+                if set not False, attention mask willed not be applied.
+            past_cache(Tensor, optional, tuple of two lists: cached key and cached value,
+                Each is a list of `Variable`s of shape `[batch_size, seq_len, hidden_size]`:
+                cached key/value tensor that will be concated to generated key/value when performing self attention.
+                if set, `attn_bias` should not be None.
+
+        Returns:
+            tuple: Returns tuple (`encoded`, `additional_info`).
+
+            With the fields:
+
+            - `encoded`(Tensor):
+                The output logits of transformer stack.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `additional_info` (dict):
+                Additional middle level info, inclues all hidden stats and k/v caches.
+        """
+        assert len(src_ids.shape) == 2, "expect src_ids.shape = [batch, sequecen], got %s" % (repr(src_ids.shape))
+        assert (
+            attn_bias is not None if past_cache else True
+        ), "if `past_cache` is specified; attn_bias should not be None"
+        d_seqlen = paddle.shape(src_ids)[1]
+        if pos_ids is None:
+            pos_ids = paddle.arange(0, d_seqlen, 1, dtype="int32").reshape([1, -1]).cast("int64")
+        if attn_bias is None:
+            if input_mask is None:
+                input_mask = paddle.cast(src_ids != 0, "float32")
+            assert len(input_mask.shape) == 2
+            input_mask = input_mask.unsqueeze(-1)
+            attn_bias = input_mask.matmul(input_mask, transpose_y=True)
+            if use_causal_mask:
+                sequence = paddle.reshape(paddle.arange(0, d_seqlen, 1, dtype="float32") + 1.0, [1, 1, -1, 1])
+                causal_mask = (sequence.matmul(1.0 / sequence, transpose_y=True) >= 1.0).cast("float32")
+                attn_bias *= causal_mask
+        else:
+            assert len(attn_bias.shape) == 3, "expect attn_bias tobe rank 3, got %r" % attn_bias.shape
+        attn_bias = (1.0 - attn_bias) * -10000.0
+        attn_bias = attn_bias.unsqueeze(1).tile([1, self.n_head, 1, 1])  # avoid broadcast =_=
+
+        if sent_ids is None:
+            sent_ids = paddle.zeros_like(src_ids)
+
+        src_embedded = self.word_emb(src_ids)
+        pos_embedded = self.pos_emb(pos_ids)
+        sent_embedded = self.sent_emb(sent_ids)
+        embedded = src_embedded + pos_embedded + sent_embedded
+
+        embedded = self.dropout(self.ln(embedded))
+
+        encoded, hidden_list, cache_list = self.encoder_stack(embedded, attn_bias, past_cache=past_cache)
+
+        additional_info = {
+            "hiddens": hidden_list,
+            "caches": cache_list,
+        }
+
+        return encoded, additional_info
+
+
+class ErnieForGeneration(ErnieModel):
+    """
+    Ernie Model for sequence to sequence generation.
+
+    This model inherits from :class:`~paddlenlp.transformers.ernie.modeling.ErnieModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    """
+
+    def __init__(self, cfg, name=None):
+        super(ErnieForGeneration, self).__init__(cfg, name=name)
+        initializer = nn.initializer.TruncatedNormal(std=cfg["initializer_range"])
+        d_model = cfg["hidden_size"]
+        d_vocab = cfg["vocab_size"]
+
+        self.mlm = _build_linear(
+            d_model,
+            d_model,
+            append_name(name, "mask_lm_trans_fc"),
+            initializer,
+        )
+        self.act = getattr(paddle.nn.functional, cfg["hidden_act"])
+        self.mlm_ln = _build_ln(d_model, name=append_name(name, "mask_lm_trans"))
+        self.mlm_bias = paddle.create_parameter(
+            dtype="float32",
+            shape=[d_vocab],
+            attr=paddle.ParamAttr(
+                name=append_name(name, "mask_lm_out_fc.b_0"), initializer=nn.initializer.Constant(value=0.0)
+            ),
+            is_bias=True,
+        )
+
+    def forward(self, *args, **kwargs):
+        """
+        Args:
+            tgt_labels(Tensor, optional):
+                The ground truth target sequence id (hard label) or distribution (soft label).
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length] or
+                [batch_size, sequence_length, sequence_length].
+            tgt_pos(Tensor, optional):
+                Index of tgt_labels in `src_ids`.
+                It's data type should be `int64` and has a shape of [n_targets, 2]).
+            encode_only(bool, optional):
+                Whether the model will output the logits or only encode the inputs.
+                If `encode_only` is `True`, `loss` and `logits_2d` will not be returned.
+
+        Returns:
+            tuple: Returns tuple (`None`, `None`, `info`) if `encode_only` is `True`,
+            returns (`output_ids`, `logits`, `info`) if `tgt_labels` or `tgt_pos` is `None`,
+            else, returns (`loss`, `logits_2d`, `info`).
+
+            With the fields:
+
+            - `info`(dict):
+                 Middle level info, includes all hidden stats and k/v caches.
+
+            - `output_ids`(Tensor):
+                The output index. Its data type should be float32 and its shape is [batch_size].
+                If `encode_only`, returns None.
+
+            - `logits`(Tensor):
+                Logits for every targets.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+                If `encode_only`, returns None.
+
+            - `loss`(Tensor):
+                Cross entropy loss mean over every target label.
+                If `encode_only`, returns None.
+
+            - `logits_2d`(Tensor):
+                Logits for every targets if `tgt_labels` or `tgt_pos` is not `None` .
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        """
+        tgt_labels = kwargs.pop("tgt_labels", None)
+        tgt_pos = kwargs.pop("tgt_pos", None)
+        encode_only = kwargs.pop("encode_only", False)
+        encoded, info = ErnieModel.forward(self, *args, **kwargs)
+        if encode_only:
+            return None, None, info
+        if tgt_labels is None or tgt_pos is None:
+            encoded = self.act(self.mlm(encoded))
+            encoded = self.mlm_ln(encoded)
+            logits = encoded.matmul(self.word_emb.weight, transpose_y=True) + self.mlm_bias
+            output_ids = logits.argmax(-1)
+            return output_ids, logits, info
+        else:
+            encoded_2d = encoded.gather_nd(tgt_pos)
+            encoded_2d = self.act(self.mlm(encoded_2d))
+            encoded_2d = self.mlm_ln(encoded_2d)
+            logits_2d = encoded_2d.matmul(self.word_emb.weight, transpose_y=True) + self.mlm_bias
+            if len(tgt_labels.shape) == 1:
+                tgt_labels = paddle.reshape(tgt_labels, [-1, 1])
+
+            loss = F.cross_entropy(logits_2d, tgt_labels, reduction="none", soft_label=(tgt_labels.shape[-1] != 1))
+
+            return loss, logits_2d, info
+
+
+ErnieGenModel = ErnieForGeneration
diff --git a/paddlenlp/transformers/ernie_gen/params_map.json b/paddlenlp/transformers/ernie_gen/params_map.json
new file mode 100644
index 0000000000000000000000000000000000000000..320e940ee69bf395e05d07e3e92c7bafcd49c94b
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gen/params_map.json
@@ -0,0 +1 @@
+{"embeddings.word_embeddings.weight": "word_emb.weight", "embeddings.position_embeddings.weight": "pos_emb.weight", "embeddings.token_type_embeddings.weight": "sent_emb.weight", "embeddings.layer_norm.weight": "ln.weight", "embeddings.layer_norm.bias": "ln.bias", "encoder.layers.0.self_attn.q_proj.weight": "encoder_stack.block.0.attn.q.weight", "encoder.layers.0.self_attn.q_proj.bias": "encoder_stack.block.0.attn.q.bias", "encoder.layers.0.self_attn.k_proj.weight": "encoder_stack.block.0.attn.k.weight", "encoder.layers.0.self_attn.k_proj.bias": "encoder_stack.block.0.attn.k.bias", "encoder.layers.0.self_attn.v_proj.weight": "encoder_stack.block.0.attn.v.weight", "encoder.layers.0.self_attn.v_proj.bias": "encoder_stack.block.0.attn.v.bias", "encoder.layers.0.self_attn.out_proj.weight": "encoder_stack.block.0.attn.o.weight", "encoder.layers.0.self_attn.out_proj.bias": "encoder_stack.block.0.attn.o.bias", "encoder.layers.1.self_attn.q_proj.weight": "encoder_stack.block.1.attn.q.weight", "encoder.layers.1.self_attn.q_proj.bias": "encoder_stack.block.1.attn.q.bias", "encoder.layers.1.self_attn.k_proj.weight": "encoder_stack.block.1.attn.k.weight", "encoder.layers.1.self_attn.k_proj.bias": "encoder_stack.block.1.attn.k.bias", "encoder.layers.1.self_attn.v_proj.weight": "encoder_stack.block.1.attn.v.weight", "encoder.layers.1.self_attn.v_proj.bias": "encoder_stack.block.1.attn.v.bias", "encoder.layers.1.self_attn.out_proj.weight": "encoder_stack.block.1.attn.o.weight", "encoder.layers.1.self_attn.out_proj.bias": "encoder_stack.block.1.attn.o.bias", "encoder.layers.2.self_attn.q_proj.weight": "encoder_stack.block.2.attn.q.weight", "encoder.layers.2.self_attn.q_proj.bias": "encoder_stack.block.2.attn.q.bias", "encoder.layers.2.self_attn.k_proj.weight": "encoder_stack.block.2.attn.k.weight", "encoder.layers.2.self_attn.k_proj.bias": "encoder_stack.block.2.attn.k.bias", "encoder.layers.2.self_attn.v_proj.weight": "encoder_stack.block.2.attn.v.weight", "encoder.layers.2.self_attn.v_proj.bias": "encoder_stack.block.2.attn.v.bias", "encoder.layers.2.self_attn.out_proj.weight": "encoder_stack.block.2.attn.o.weight", "encoder.layers.2.self_attn.out_proj.bias": "encoder_stack.block.2.attn.o.bias", "encoder.layers.3.self_attn.q_proj.weight": "encoder_stack.block.3.attn.q.weight", "encoder.layers.3.self_attn.q_proj.bias": "encoder_stack.block.3.attn.q.bias", "encoder.layers.3.self_attn.k_proj.weight": "encoder_stack.block.3.attn.k.weight", "encoder.layers.3.self_attn.k_proj.bias": "encoder_stack.block.3.attn.k.bias", "encoder.layers.3.self_attn.v_proj.weight": "encoder_stack.block.3.attn.v.weight", "encoder.layers.3.self_attn.v_proj.bias": "encoder_stack.block.3.attn.v.bias", "encoder.layers.3.self_attn.out_proj.weight": "encoder_stack.block.3.attn.o.weight", "encoder.layers.3.self_attn.out_proj.bias": "encoder_stack.block.3.attn.o.bias", "encoder.layers.4.self_attn.q_proj.weight": "encoder_stack.block.4.attn.q.weight", "encoder.layers.4.self_attn.q_proj.bias": "encoder_stack.block.4.attn.q.bias", "encoder.layers.4.self_attn.k_proj.weight": "encoder_stack.block.4.attn.k.weight", "encoder.layers.4.self_attn.k_proj.bias": "encoder_stack.block.4.attn.k.bias", "encoder.layers.4.self_attn.v_proj.weight": "encoder_stack.block.4.attn.v.weight", "encoder.layers.4.self_attn.v_proj.bias": "encoder_stack.block.4.attn.v.bias", "encoder.layers.4.self_attn.out_proj.weight": "encoder_stack.block.4.attn.o.weight", "encoder.layers.4.self_attn.out_proj.bias": "encoder_stack.block.4.attn.o.bias", "encoder.layers.5.self_attn.q_proj.weight": "encoder_stack.block.5.attn.q.weight", "encoder.layers.5.self_attn.q_proj.bias": "encoder_stack.block.5.attn.q.bias", "encoder.layers.5.self_attn.k_proj.weight": "encoder_stack.block.5.attn.k.weight", "encoder.layers.5.self_attn.k_proj.bias": "encoder_stack.block.5.attn.k.bias", "encoder.layers.5.self_attn.v_proj.weight": "encoder_stack.block.5.attn.v.weight", "encoder.layers.5.self_attn.v_proj.bias": "encoder_stack.block.5.attn.v.bias", "encoder.layers.5.self_attn.out_proj.weight": "encoder_stack.block.5.attn.o.weight", "encoder.layers.5.self_attn.out_proj.bias": "encoder_stack.block.5.attn.o.bias", "encoder.layers.6.self_attn.q_proj.weight": "encoder_stack.block.6.attn.q.weight", "encoder.layers.6.self_attn.q_proj.bias": "encoder_stack.block.6.attn.q.bias", "encoder.layers.6.self_attn.k_proj.weight": "encoder_stack.block.6.attn.k.weight", "encoder.layers.6.self_attn.k_proj.bias": "encoder_stack.block.6.attn.k.bias", "encoder.layers.6.self_attn.v_proj.weight": "encoder_stack.block.6.attn.v.weight", "encoder.layers.6.self_attn.v_proj.bias": "encoder_stack.block.6.attn.v.bias", "encoder.layers.6.self_attn.out_proj.weight": "encoder_stack.block.6.attn.o.weight", "encoder.layers.6.self_attn.out_proj.bias": "encoder_stack.block.6.attn.o.bias", "encoder.layers.7.self_attn.q_proj.weight": "encoder_stack.block.7.attn.q.weight", "encoder.layers.7.self_attn.q_proj.bias": "encoder_stack.block.7.attn.q.bias", "encoder.layers.7.self_attn.k_proj.weight": "encoder_stack.block.7.attn.k.weight", "encoder.layers.7.self_attn.k_proj.bias": "encoder_stack.block.7.attn.k.bias", "encoder.layers.7.self_attn.v_proj.weight": "encoder_stack.block.7.attn.v.weight", "encoder.layers.7.self_attn.v_proj.bias": "encoder_stack.block.7.attn.v.bias", "encoder.layers.7.self_attn.out_proj.weight": "encoder_stack.block.7.attn.o.weight", "encoder.layers.7.self_attn.out_proj.bias": "encoder_stack.block.7.attn.o.bias", "encoder.layers.8.self_attn.q_proj.weight": "encoder_stack.block.8.attn.q.weight", "encoder.layers.8.self_attn.q_proj.bias": "encoder_stack.block.8.attn.q.bias", "encoder.layers.8.self_attn.k_proj.weight": "encoder_stack.block.8.attn.k.weight", "encoder.layers.8.self_attn.k_proj.bias": "encoder_stack.block.8.attn.k.bias", "encoder.layers.8.self_attn.v_proj.weight": "encoder_stack.block.8.attn.v.weight", "encoder.layers.8.self_attn.v_proj.bias": "encoder_stack.block.8.attn.v.bias", "encoder.layers.8.self_attn.out_proj.weight": "encoder_stack.block.8.attn.o.weight", "encoder.layers.8.self_attn.out_proj.bias": "encoder_stack.block.8.attn.o.bias", "encoder.layers.9.self_attn.q_proj.weight": "encoder_stack.block.9.attn.q.weight", "encoder.layers.9.self_attn.q_proj.bias": "encoder_stack.block.9.attn.q.bias", "encoder.layers.9.self_attn.k_proj.weight": "encoder_stack.block.9.attn.k.weight", "encoder.layers.9.self_attn.k_proj.bias": "encoder_stack.block.9.attn.k.bias", "encoder.layers.9.self_attn.v_proj.weight": "encoder_stack.block.9.attn.v.weight", "encoder.layers.9.self_attn.v_proj.bias": "encoder_stack.block.9.attn.v.bias", "encoder.layers.9.self_attn.out_proj.weight": "encoder_stack.block.9.attn.o.weight", "encoder.layers.9.self_attn.out_proj.bias": "encoder_stack.block.9.attn.o.bias", "encoder.layers.10.self_attn.q_proj.weight": "encoder_stack.block.10.attn.q.weight", "encoder.layers.10.self_attn.q_proj.bias": "encoder_stack.block.10.attn.q.bias", "encoder.layers.10.self_attn.k_proj.weight": "encoder_stack.block.10.attn.k.weight", "encoder.layers.10.self_attn.k_proj.bias": "encoder_stack.block.10.attn.k.bias", "encoder.layers.10.self_attn.v_proj.weight": "encoder_stack.block.10.attn.v.weight", "encoder.layers.10.self_attn.v_proj.bias": "encoder_stack.block.10.attn.v.bias", "encoder.layers.10.self_attn.out_proj.weight": "encoder_stack.block.10.attn.o.weight", "encoder.layers.10.self_attn.out_proj.bias": "encoder_stack.block.10.attn.o.bias", "encoder.layers.11.self_attn.q_proj.weight": "encoder_stack.block.11.attn.q.weight", "encoder.layers.11.self_attn.q_proj.bias": "encoder_stack.block.11.attn.q.bias", "encoder.layers.11.self_attn.k_proj.weight": "encoder_stack.block.11.attn.k.weight", "encoder.layers.11.self_attn.k_proj.bias": "encoder_stack.block.11.attn.k.bias", "encoder.layers.11.self_attn.v_proj.weight": "encoder_stack.block.11.attn.v.weight", "encoder.layers.11.self_attn.v_proj.bias": "encoder_stack.block.11.attn.v.bias", "encoder.layers.11.self_attn.out_proj.weight": "encoder_stack.block.11.attn.o.weight", "encoder.layers.11.self_attn.out_proj.bias": "encoder_stack.block.11.attn.o.bias", "encoder.layers.0.linear1.weight": "encoder_stack.block.0.ffn.i.weight", "encoder.layers.0.linear1.bias": "encoder_stack.block.0.ffn.i.bias", "encoder.layers.0.linear2.weight": "encoder_stack.block.0.ffn.o.weight", "encoder.layers.0.linear2.bias": "encoder_stack.block.0.ffn.o.bias", "encoder.layers.1.linear1.weight": "encoder_stack.block.1.ffn.i.weight", "encoder.layers.1.linear1.bias": "encoder_stack.block.1.ffn.i.bias", "encoder.layers.1.linear2.weight": "encoder_stack.block.1.ffn.o.weight", "encoder.layers.1.linear2.bias": "encoder_stack.block.1.ffn.o.bias", "encoder.layers.2.linear1.weight": "encoder_stack.block.2.ffn.i.weight", "encoder.layers.2.linear1.bias": "encoder_stack.block.2.ffn.i.bias", "encoder.layers.2.linear2.weight": "encoder_stack.block.2.ffn.o.weight", "encoder.layers.2.linear2.bias": "encoder_stack.block.2.ffn.o.bias", "encoder.layers.3.linear1.weight": "encoder_stack.block.3.ffn.i.weight", "encoder.layers.3.linear1.bias": "encoder_stack.block.3.ffn.i.bias", "encoder.layers.3.linear2.weight": "encoder_stack.block.3.ffn.o.weight", "encoder.layers.3.linear2.bias": "encoder_stack.block.3.ffn.o.bias", "encoder.layers.4.linear1.weight": "encoder_stack.block.4.ffn.i.weight", "encoder.layers.4.linear1.bias": "encoder_stack.block.4.ffn.i.bias", "encoder.layers.4.linear2.weight": "encoder_stack.block.4.ffn.o.weight", "encoder.layers.4.linear2.bias": "encoder_stack.block.4.ffn.o.bias", "encoder.layers.5.linear1.weight": "encoder_stack.block.5.ffn.i.weight", "encoder.layers.5.linear1.bias": "encoder_stack.block.5.ffn.i.bias", "encoder.layers.5.linear2.weight": "encoder_stack.block.5.ffn.o.weight", "encoder.layers.5.linear2.bias": "encoder_stack.block.5.ffn.o.bias", "encoder.layers.6.linear1.weight": "encoder_stack.block.6.ffn.i.weight", "encoder.layers.6.linear1.bias": "encoder_stack.block.6.ffn.i.bias", "encoder.layers.6.linear2.weight": "encoder_stack.block.6.ffn.o.weight", "encoder.layers.6.linear2.bias": "encoder_stack.block.6.ffn.o.bias", "encoder.layers.7.linear1.weight": "encoder_stack.block.7.ffn.i.weight", "encoder.layers.7.linear1.bias": "encoder_stack.block.7.ffn.i.bias", "encoder.layers.7.linear2.weight": "encoder_stack.block.7.ffn.o.weight", "encoder.layers.7.linear2.bias": "encoder_stack.block.7.ffn.o.bias", "encoder.layers.8.linear1.weight": "encoder_stack.block.8.ffn.i.weight", "encoder.layers.8.linear1.bias": "encoder_stack.block.8.ffn.i.bias", "encoder.layers.8.linear2.weight": "encoder_stack.block.8.ffn.o.weight", "encoder.layers.8.linear2.bias": "encoder_stack.block.8.ffn.o.bias", "encoder.layers.9.linear1.weight": "encoder_stack.block.9.ffn.i.weight", "encoder.layers.9.linear1.bias": "encoder_stack.block.9.ffn.i.bias", "encoder.layers.9.linear2.weight": "encoder_stack.block.9.ffn.o.weight", "encoder.layers.9.linear2.bias": "encoder_stack.block.9.ffn.o.bias", "encoder.layers.10.linear1.weight": "encoder_stack.block.10.ffn.i.weight", "encoder.layers.10.linear1.bias": "encoder_stack.block.10.ffn.i.bias", "encoder.layers.10.linear2.weight": "encoder_stack.block.10.ffn.o.weight", "encoder.layers.10.linear2.bias": "encoder_stack.block.10.ffn.o.bias", "encoder.layers.11.linear1.weight": "encoder_stack.block.11.ffn.i.weight", "encoder.layers.11.linear1.bias": "encoder_stack.block.11.ffn.i.bias", "encoder.layers.11.linear2.weight": "encoder_stack.block.11.ffn.o.weight", "encoder.layers.11.linear2.bias": "encoder_stack.block.11.ffn.o.bias", "encoder.layers.0.norm1.weight": "encoder_stack.block.0.ln1.weight", "encoder.layers.0.norm1.bias": "encoder_stack.block.0.ln1.bias", "encoder.layers.1.norm1.weight": "encoder_stack.block.1.ln1.weight", "encoder.layers.1.norm1.bias": "encoder_stack.block.1.ln1.bias", "encoder.layers.2.norm1.weight": "encoder_stack.block.2.ln1.weight", "encoder.layers.2.norm1.bias": "encoder_stack.block.2.ln1.bias", "encoder.layers.3.norm1.weight": "encoder_stack.block.3.ln1.weight", "encoder.layers.3.norm1.bias": "encoder_stack.block.3.ln1.bias", "encoder.layers.4.norm1.weight": "encoder_stack.block.4.ln1.weight", "encoder.layers.4.norm1.bias": "encoder_stack.block.4.ln1.bias", "encoder.layers.5.norm1.weight": "encoder_stack.block.5.ln1.weight", "encoder.layers.5.norm1.bias": "encoder_stack.block.5.ln1.bias", "encoder.layers.6.norm1.weight": "encoder_stack.block.6.ln1.weight", "encoder.layers.6.norm1.bias": "encoder_stack.block.6.ln1.bias", "encoder.layers.7.norm1.weight": "encoder_stack.block.7.ln1.weight", "encoder.layers.7.norm1.bias": "encoder_stack.block.7.ln1.bias", "encoder.layers.8.norm1.weight": "encoder_stack.block.8.ln1.weight", "encoder.layers.8.norm1.bias": "encoder_stack.block.8.ln1.bias", "encoder.layers.9.norm1.weight": "encoder_stack.block.9.ln1.weight", "encoder.layers.9.norm1.bias": "encoder_stack.block.9.ln1.bias", "encoder.layers.10.norm1.weight": "encoder_stack.block.10.ln1.weight", "encoder.layers.10.norm1.bias": "encoder_stack.block.10.ln1.bias", "encoder.layers.11.norm1.weight": "encoder_stack.block.11.ln1.weight", "encoder.layers.11.norm1.bias": "encoder_stack.block.11.ln1.bias", "encoder.layers.0.norm2.weight": "encoder_stack.block.0.ln2.weight", "encoder.layers.0.norm2.bias": "encoder_stack.block.0.ln2.bias", "encoder.layers.1.norm2.weight": "encoder_stack.block.1.ln2.weight", "encoder.layers.1.norm2.bias": "encoder_stack.block.1.ln2.bias", "encoder.layers.2.norm2.weight": "encoder_stack.block.2.ln2.weight", "encoder.layers.2.norm2.bias": "encoder_stack.block.2.ln2.bias", "encoder.layers.3.norm2.weight": "encoder_stack.block.3.ln2.weight", "encoder.layers.3.norm2.bias": "encoder_stack.block.3.ln2.bias", "encoder.layers.4.norm2.weight": "encoder_stack.block.4.ln2.weight", "encoder.layers.4.norm2.bias": "encoder_stack.block.4.ln2.bias", "encoder.layers.5.norm2.weight": "encoder_stack.block.5.ln2.weight", "encoder.layers.5.norm2.bias": "encoder_stack.block.5.ln2.bias", "encoder.layers.6.norm2.weight": "encoder_stack.block.6.ln2.weight", "encoder.layers.6.norm2.bias": "encoder_stack.block.6.ln2.bias", "encoder.layers.7.norm2.weight": "encoder_stack.block.7.ln2.weight", "encoder.layers.7.norm2.bias": "encoder_stack.block.7.ln2.bias", "encoder.layers.8.norm2.weight": "encoder_stack.block.8.ln2.weight", "encoder.layers.8.norm2.bias": "encoder_stack.block.8.ln2.bias", "encoder.layers.9.norm2.weight": "encoder_stack.block.9.ln2.weight", "encoder.layers.9.norm2.bias": "encoder_stack.block.9.ln2.bias", "encoder.layers.10.norm2.weight": "encoder_stack.block.10.ln2.weight", "encoder.layers.10.norm2.bias": "encoder_stack.block.10.ln2.bias", "encoder.layers.11.norm2.weight": "encoder_stack.block.11.ln2.weight", "encoder.layers.11.norm2.bias": "encoder_stack.block.11.ln2.bias", "pooler.dense.weight": "pooler.weight", "pooler.dense.bias": "pooler.bias"}
\ No newline at end of file
diff --git a/paddlenlp/transformers/ernie_gram/__init__.py b/paddlenlp/transformers/ernie_gram/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gram/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_gram/configuration.py b/paddlenlp/transformers/ernie_gram/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcdf5b818ac8a48ac0c8b69f8add3ef2825ce787
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gram/configuration.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Ernie Doc model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["ERNIE_GRAM_PRETRAINED_INIT_CONFIGURATION", "ErnieGramConfig", "ERNIE_GRAM_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ERNIE_GRAM_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-gram-zh": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18018,
+    },
+    "ernie-gram-zh-finetuned-dureader-robust": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 18018,
+    },
+}
+
+ERNIE_GRAM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-gram-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gram_zh/ernie_gram_zh.pdparams",
+        "ernie-gram-zh-finetuned-dureader-robust": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-gram-zh-finetuned-dureader-robust/model_state.pdparams",
+    },
+}
+
+
+class ErnieGramConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieGramModel`]. It is used to instantiate
+    an ErnieGram model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of the ERNIE-Gram model. Also is the vocab size of token embedding matrix.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoders.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer for initializing all weight matrices.
+            Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`ErniePretrainedModel._init_weights()` for how weights are initialized in `ErnieGramModel`.
+
+        rel_pos_size (int, optional):
+            The relative position size just for ERNIE-Gram English model. Defaults to None.
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import ErnieGramConfig, ErnieGramModel
+    >>> # Initializing an ErnieGram style configuration
+    >>> configuration = ErnieGramConfig()
+    >>> # Initializing a model (with random weights) from the ErnieGram-base style configuration
+    >>> model = ErnieGramModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ERNIE_GRAM_PRETRAINED_INIT_CONFIGURATION
+    model_type = "ernie_gram"
+
+    def __init__(
+        self,
+        vocab_size=18018,
+        embedding_size=768,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        rel_pos_size=None,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.rel_pos_size = rel_pos_size
diff --git a/paddlenlp/transformers/ernie_gram/matching_param_name.py b/paddlenlp/transformers/ernie_gram/matching_param_name.py
new file mode 100644
index 0000000000000000000000000000000000000000..6438485137dfa35481535a220a04bd570667b71a
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gram/matching_param_name.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def match_embedding_param(convert_parameter_name_dict):
+    convert_parameter_name_dict["word_emb.weight"] = "embeddings.word_embeddings.weight"
+    convert_parameter_name_dict["pos_emb.weight"] = "embeddings.position_embeddings.weight"
+    convert_parameter_name_dict["sent_emb.weight"] = "embeddings.token_type_embeddings.weight"
+    convert_parameter_name_dict["ln.weight"] = "embeddings.layer_norm.weight"
+    convert_parameter_name_dict["ln.bias"] = "embeddings.layer_norm.bias"
+    convert_parameter_name_dict["rel_pos_bias_emb.weight"] = "embeddings.rel_pos_embedding.weight"
+    return convert_parameter_name_dict
+
+
+def match_encoder_param(convert_parameter_name_dict, layer_num=4):
+    # Firstly, converts the multihead_attention to the parameter.
+    proj_names = ["q", "k", "v", "o"]
+    param_names = ["weight", "bias"]
+    nlp_format = "encoder.layers.{}.self_attn.{}_proj.{}"
+    ernie_format = "encoder_stack.block.{}.attn.{}.{}"
+    for i in range(0, layer_num):
+        for proj_name in proj_names:
+            for param_name in param_names:
+                if proj_name == "o":
+                    nlp_name = nlp_format.format(i, "out", param_name)
+                else:
+                    nlp_name = nlp_format.format(i, proj_name, param_name)
+                ernie_name = ernie_format.format(i, proj_name, param_name)
+                convert_parameter_name_dict[ernie_name] = nlp_name
+
+    # Secondly, converts the encoder ffn parameter.
+    nlp_format = "encoder.layers.{}.linear{}.{}"
+    ernie_format = "encoder_stack.block.{}.ffn.{}.{}"
+    nlp_param_names = ["1", "2"]
+    ernie_param_names = ["i", "o"]
+    param_names = ["weight", "bias"]
+    for i in range(0, layer_num):
+        for nlp_name, ernie_name in zip(nlp_param_names, ernie_param_names):
+            for param_name in param_names:
+                nlp_format_name = nlp_format.format(i, nlp_name, param_name)
+                ernie_format_name = ernie_format.format(i, ernie_name, param_name)
+                convert_parameter_name_dict[ernie_format_name] = nlp_format_name
+
+    # Thirdly, converts the multi_head layer_norm parameter.
+    nlp_format = "encoder.layers.{}.norm{}.{}"
+    ernie_format = "encoder_stack.block.{}.ln{}.{}"
+    proj_names = ["1", "2"]
+    param_names = ["weight", "bias"]
+    for i in range(0, layer_num):
+        for proj_name in proj_names:
+            for param_name in param_names:
+                nlp_format_name = nlp_format.format(i, proj_name, param_name)
+                ernie_format_name = ernie_format.format(i, proj_name, param_name)
+                convert_parameter_name_dict[ernie_format_name] = nlp_format_name
+
+    return convert_parameter_name_dict
+
+
+def match_pooler_parameter(convert_parameter_name_dict):
+    convert_parameter_name_dict["pooler.weight"] = "pooler.dense.weight"
+    convert_parameter_name_dict["pooler.bias"] = "pooler.dense.bias"
+    return convert_parameter_name_dict
+
+
+def match_mlm_parameter(convert_parameter_name_dict):
+    # convert_parameter_name_dict["cls.predictions.decoder_weight"] = "word_embedding"
+    convert_parameter_name_dict["cls.predictions.decoder_bias"] = "mask_lm_out_fc.b_0"
+    convert_parameter_name_dict["cls.predictions.transform.weight"] = "mask_lm_trans_fc.w_0"
+    convert_parameter_name_dict["cls.predictions.transform.bias"] = "mask_lm_trans_fc.b_0"
+    convert_parameter_name_dict["cls.predictions.layer_norm.weight"] = "mask_lm_trans_layer_norm_scale"
+    convert_parameter_name_dict["cls.predictions.layer_norm.bias"] = "mask_lm_trans_layer_norm_bias"
+    return convert_parameter_name_dict
+
+
+def write_vocab(vocab_file):
+    with open(vocab_file, "r", encoding="utf8") as f, open("ernie-gram-zh/new_vocab.txt", "w", encoding="utf8") as nf:
+        for line in f:
+            word, word_id = line.strip().split("\t")
+            nf.write(word + "\n")
+
+
+if __name__ == "__main__":
+    convert_parameter_name_dict = {}
+
+    convert_parameter_name_dict = match_embedding_param(convert_parameter_name_dict)
+    convert_parameter_name_dict = match_encoder_param(convert_parameter_name_dict, layer_num=12)
+    convert_parameter_name_dict = match_pooler_parameter(convert_parameter_name_dict)
+    ernie_state_dict = paddle.load("./ernie-gram-zh/saved_weights.pdparams")
+    nlp_state_dict = {}
+    for name, value in ernie_state_dict.items():
+        nlp_name = convert_parameter_name_dict[name]
+        nlp_state_dict["ernie_gram." + nlp_name] = value
+
+    paddle.save(nlp_state_dict, "./ernie-gram-zh/ernie_gram_zh.pdparams")
+
+    for ernie_name, nlp_name in convert_parameter_name_dict.items():
+        print(ernie_name, "          ", nlp_name)
diff --git a/paddlenlp/transformers/ernie_gram/modeling.py b/paddlenlp/transformers/ernie_gram/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4aef71dac04441121a8cfe5717a4d8be479cd92
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gram/modeling.py
@@ -0,0 +1,703 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    ERNIE_GRAM_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_GRAM_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieGramConfig,
+)
+
+__all__ = [
+    "ErnieGramModel",
+    "ErnieGramPretrainedModel",
+    "ErnieGramForSequenceClassification",
+    "ErnieGramForTokenClassification",
+    "ErnieGramForQuestionAnswering",
+]
+
+
+class ErnieGramEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: ErnieGramConfig):
+        super(ErnieGramEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
+        if config.rel_pos_size and config.num_attention_heads:
+            self.rel_pos_embeddings = nn.Embedding(config.rel_pos_size, config.num_attention_heads)
+        self.layer_norm = nn.LayerNorm(config.embedding_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values_length: int = 0,
+    ):
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        input_shape = paddle.shape(inputs_embeds)[:-1]
+
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+            if past_key_values_length > 0:
+                position_ids = position_ids + past_key_values_length
+
+            position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids_shape = input_shape
+            token_type_ids = paddle.zeros(token_type_ids_shape, dtype="int64")
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErnieGramPooler(nn.Layer):
+    def __init__(self, config: ErnieGramConfig, weight_attr=None):
+        super(ErnieGramPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErnieGramPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained ERNIE-Gram models. It provides ERNIE-Gram related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = ERNIE_GRAM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_GRAM_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "ernie_gram"
+    config_class = ErnieGramConfig
+    model_config_file = CONFIG_NAME
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-5
+
+
+@register_base_model
+class ErnieGramModel(ErnieGramPretrainedModel):
+    r"""
+    The bare ERNIE-Gram Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieGramConfig`):
+            An instance of ErnieGramConfig used to construct ErnieGramModel.
+    """
+
+    def __init__(self, config: ErnieGramConfig):
+        super(ErnieGramModel, self).__init__(config)
+        self.config = config
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = ErnieGramEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = ErnieGramPooler(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                config.max_position_embeddings - 1]``.
+                Defaults to `None`. Shape as `(batch_sie, num_tokens)` and dtype as `int32` or `int64`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in ERNIE, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (``sequence_output``, ``pooled_output``).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieGramModel, ErnieGramTokenizer
+
+                tokenizer = ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
+                model = ErnieGramModel.from_pretrained('ernie-gram-zh)
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+        use_cache = use_cache if use_cache is not None else False
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(input_ids)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output)
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                past_key_values=encoder_outputs.past_key_values,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class ErnieGramForTokenClassification(ErnieGramPretrainedModel):
+    r"""
+    ERNIE-Gram Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ErnieGramConfig`):
+            An instance of ErnieGramConfig used to construct ErnieGramForTokenClassification.
+    """
+
+    def __init__(self, config: ErnieGramConfig):
+        super(ErnieGramForTokenClassification, self).__init__(config)
+        self.config = config
+        self.num_labels = config.num_labels
+        self.ernie_gram = ErnieGramModel(config)  # allow ernie_gram to be config
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(
+            config.hidden_size,
+            config.num_labels,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=config.initializer_range)),
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieGramModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieGramModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieGramForTokenClassification, ErnieGramTokenizer
+
+                tokenizer = ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
+                model = ErnieGramForTokenClassification.from_pretrained('ernie-gram-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        outputs = self.ernie_gram(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieGramForQuestionAnswering(ErnieGramPretrainedModel):
+    """
+    ERNIE-Gram Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD..
+
+    Args:
+        config (:class:`ErnieGramConfig`):
+            An instance of ErnieGramConfig used to construct ErnieGramForQuestionAnswering.
+    """
+
+    def __init__(self, config: ErnieGramConfig):
+        super(ErnieGramForQuestionAnswering, self).__init__(config)
+        self.config = config
+        self.ernie_gram = ErnieGramModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieGramModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieGramModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieGramForQuestionAnswering, ErnieGramTokenizer
+
+                tokenizer = ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
+                model = ErnieGramForQuestionAnswering.from_pretrained('ernie-gram-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.ernie_gram(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieGramForSequenceClassification(ErnieGramPretrainedModel):
+    r"""
+    ERNIE-Gram Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ErnieGramConfig`):
+            An instance of ErnieGramConfig used to construct ErnieGramForSequenceClassification.
+    """
+
+    def __init__(self, config: ErnieGramConfig):
+        super(ErnieGramForSequenceClassification, self).__init__(config)
+        self.config = config
+        self.num_labels = config.num_labels
+        self.ernie_gram = ErnieGramModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieGramModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieGramModel`.
+            attention_mask (Tensor, optional):
+                See :class:`BertModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            inputs_embeds(Tensor, optional):
+                See :class:`ErnieGramModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieGramForSequenceClassification, ErnieGramTokenizer
+
+                tokenizer = ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
+                model = ErnieGramForSequenceClassification.from_pretrained('ernie-gram-zh')
+
+                inputs = tokenizer("欢迎使用百度飞桨！")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        outputs = self.ernie_gram(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = self.dropout(outputs[1])
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/ernie_gram/tokenizer.py b/paddlenlp/transformers/ernie_gram/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..038bade080f05da73dd3d732f85efe8492023f24
--- /dev/null
+++ b/paddlenlp/transformers/ernie_gram/tokenizer.py
@@ -0,0 +1,103 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..ernie.tokenizer import ErnieTokenizer
+
+__all__ = ["ErnieGramTokenizer"]
+
+
+class ErnieGramTokenizer(ErnieTokenizer):
+    r"""
+    Constructs an ERNIE-Gram tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on,
+    and follows a WordPiece tokenizer to tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to `True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import ErnieGramTokenizer
+            tokenizer = ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # {
+            #   'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+            #   'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+            # }
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-gram-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gram_zh/vocab.txt",
+            "ernie-gram-zh-finetuned-dureader-robust": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_gram_zh/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ernie-gram-zh": {"do_lower_case": True},
+        "ernie-gram-zh-finetuned-dureader-robust": {"do_lower_case": True},
+    }
+    max_model_input_sizes = {
+        "ernie-gram-zh": 512,
+        "ernie-gram-zh-finetuned-dureader-robust": 512,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super(ErnieGramTokenizer, self).__init__(
+            vocab_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/ernie_layout/__init__.py b/paddlenlp/transformers/ernie_layout/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_layout/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_layout/configuration.py b/paddlenlp/transformers/ernie_layout/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..59ad9b4dcd6380496ff174fe5db3897fbcd9aab0
--- /dev/null
+++ b/paddlenlp/transformers/ernie_layout/configuration.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ERNIE-Layout model configuration"""
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ERNIE_LAYOUT_PRETRAINED_INIT_CONFIGURATION",
+    "ErnieLayoutConfig",
+    "ERNIE_LAYOUT_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+ERNIE_LAYOUT_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-layoutx-base-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 0,
+        "coordinate_size": 128,
+        "eos_token_id": 2,
+        "gradient_checkpointing": False,
+        "has_relative_attention_bias": True,
+        "has_spatial_attention_bias": True,
+        "has_visual_segment_embedding": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 514,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "ernie_layout",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 1,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 100,
+        "vocab_size": 250002,
+    },
+    "uie-x-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 0,
+        "coordinate_size": 128,
+        "eos_token_id": 2,
+        "gradient_checkpointing": False,
+        "has_relative_attention_bias": True,
+        "has_spatial_attention_bias": True,
+        "has_visual_segment_embedding": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 514,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "ernie_layout",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 1,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 100,
+        "vocab_size": 250002,
+    },
+}
+
+ERNIE_LAYOUT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-layoutx-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_layout/ernie_layoutx_base_uncased.pdparams",
+        "uie-x-base": "https://bj.bcebos.com/paddlenlp/models/transformers/uie_x/uie_x_base.pdparams",
+    },
+}
+
+
+class ErnieLayoutConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieLayoutModel`]. It is used to
+    instantiate a ErnieLayout model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ErnieLayout
+    ernie-layoutx-base-uncased architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 250002):
+            Vocabulary size of the ErnieLayout model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ErnieLayoutModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 514):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 514 or 1028 or 2056).
+        type_vocab_size (`int`, *optional*, defaults to 100):
+            The vocabulary size of the `token_type_ids` passed when calling [`ErnieModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for classifier.
+        has_visual_segment_embedding (`bool`, *optional*, defaults to `False`):
+            Whether or not the model has visual segment embedding.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import ErnieLayoutModel, ErnieLayoutConfig
+    >>> # Initializing a ErnieLayout ernie-layoutx-base-uncased configuration
+    >>> configuration = ErnieLayoutConfig()
+    >>> # Initializing a model from the  style configuration
+    >>> model = ErnieLayoutModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "ernie_layout"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ERNIE_LAYOUT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        task_id=0,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        max_2d_position_embeddings: int = 1024,
+        task_type_vocab_size: int = 3,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        fuse: bool = False,
+        image_feature_pool_shape=[7, 7, 256],
+        layer_norm_eps=1e-12,
+        use_cache=False,
+        use_task_id=True,
+        classifier_dropout=None,
+        has_visual_segment_embedding=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.task_id = task_id
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.max_2d_position_embeddings = max_2d_position_embeddings
+        self.task_type_vocab_size = task_type_vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.image_feature_pool_shape = image_feature_pool_shape
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
+        self.use_task_id = use_task_id
+        self.classifier_dropout = classifier_dropout
+        self.has_visual_segment_embedding = has_visual_segment_embedding
diff --git a/paddlenlp/transformers/ernie_layout/modeling.py b/paddlenlp/transformers/ernie_layout/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..bef6be3241751b949e34a833fa1b40080befc930
--- /dev/null
+++ b/paddlenlp/transformers/ernie_layout/modeling.py
@@ -0,0 +1,1185 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Modeling classes for ErnieLayout model."""
+
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import Layer
+
+from paddlenlp.utils.log import logger
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    ERNIE_LAYOUT_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_LAYOUT_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieLayoutConfig,
+)
+from .visual_backbone import ResNet
+
+__all__ = [
+    "ErnieLayoutModel",
+    "ErnieLayoutPretrainedModel",
+    "ErnieLayoutForTokenClassification",
+    "ErnieLayoutForSequenceClassification",
+    "ErnieLayoutForPretraining",
+    "ErnieLayoutForQuestionAnswering",
+    "UIEX",
+]
+
+
+def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+    """
+    Adapted from Mesh Tensorflow:
+    https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+    Translate relative position to a bucket number for relative attention. The relative position is defined as
+    memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+    position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for small
+    absolute relative_position and larger buckets for larger absolute relative_positions. All relative positions
+    >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket. This should
+    allow for more graceful generalization to longer sequences than the model has been trained on.
+
+    Args:
+        relative_position: an int32 Tensor
+        bidirectional: a boolean - whether the attention is bidirectional
+        num_buckets: an integer
+        max_distance: an integer
+
+    Returns:
+        a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
+    """
+
+    ret = 0
+    if bidirectional:
+        num_buckets //= 2
+        ret += (relative_position > 0).astype(paddle.int64) * num_buckets
+        n = paddle.abs(relative_position)
+    else:
+        n = paddle.max(-relative_position, paddle.zeros_like(relative_position))
+    # Now n is in the range [0, inf)
+    # half of the buckets are for exact increments in positions
+    max_exact = num_buckets // 2
+    is_small = n < max_exact
+
+    # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+    val_if_large = max_exact + (
+        paddle.log(n.astype(paddle.float32) / max_exact)
+        / math.log(max_distance / max_exact)
+        * (num_buckets - max_exact)
+    ).astype(paddle.int64)
+
+    val_if_large = paddle.minimum(val_if_large, paddle.full_like(val_if_large, num_buckets - 1))
+
+    ret += paddle.where(is_small, n, val_if_large)
+    return ret
+
+
+class ErnieLayoutPooler(Layer):
+    def __init__(self, hidden_size, with_pool):
+        super(ErnieLayoutPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+        self.with_pool = with_pool
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.with_pool == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErnieLayoutEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+        self.x_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.y_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.h_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.w_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def _cal_spatial_position_embeddings(self, bbox):
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+        return (
+            left_position_embeddings,
+            upper_position_embeddings,
+            right_position_embeddings,
+            lower_position_embeddings,
+            h_position_embeddings,
+            w_position_embeddings,
+        )
+
+    def forward(self, input_ids, bbox=None, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+
+        x1, y1, x2, y2, h, w = self.embeddings._cal_spatial_position_embeddings(bbox)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + position_embeddings + x1 + y1 + x2 + y2 + h + w + token_type_embeddings
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErnieLayoutPretrainedModel(PretrainedModel):
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = ERNIE_LAYOUT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_LAYOUT_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "ernie_layout"
+    config_class = ErnieLayoutConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class ErnieLayoutSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutSelfOutput, self).__init__()
+        self.dense = nn.Linear(config["hidden_size"], config["hidden_size"])
+        self.LayerNorm = nn.LayerNorm(config["hidden_size"], epsilon=config["layer_norm_eps"])
+        self.dropout = nn.Dropout(config["hidden_dropout_prob"])
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class ErnieLayoutSelfAttention(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutSelfAttention, self).__init__()
+        if config["hidden_size"] % config["num_attention_heads"] != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size {} is not a multiple of the number of attention "
+                "heads {}".format(config["hidden_size"], config["num_attention_heads"])
+            )
+        self.num_attention_heads = config["num_attention_heads"]
+        self.attention_head_size = int(config["hidden_size"] / config["num_attention_heads"])
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.has_relative_attention_bias = config["has_relative_attention_bias"]
+        self.has_spatial_attention_bias = config["has_spatial_attention_bias"]
+
+        self.query = nn.Linear(config["hidden_size"], self.all_head_size)
+        self.key = nn.Linear(config["hidden_size"], self.all_head_size)
+        self.value = nn.Linear(config["hidden_size"], self.all_head_size)
+
+        self.dropout = nn.Dropout(config["attention_probs_dropout_prob"])
+
+    def transpose_for_scores(self, x):
+        x = x.reshape([paddle.shape(x)[0], paddle.shape(x)[1], self.num_attention_heads, self.attention_head_size])
+        return x.transpose([0, 2, 1, 3])
+
+    def compute_qkv(self, hidden_states):
+        q = self.query(hidden_states)
+        k = self.key(hidden_states)
+        v = self.value(hidden_states)
+        return q, k, v
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        q, k, v = self.compute_qkv(hidden_states)
+
+        # (B, L, H*D) -> (B, H, L, D)
+        query_layer = self.transpose_for_scores(q)
+        key_layer = self.transpose_for_scores(k)
+        value_layer = self.transpose_for_scores(v)
+
+        query_layer = query_layer / math.sqrt(self.attention_head_size)
+        # [BSZ, NAT, L, L]
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.has_relative_attention_bias:
+            attention_scores += rel_pos
+        if self.has_spatial_attention_bias:
+            attention_scores += rel_2d_pos
+        bool_attention_mask = attention_mask.astype(paddle.bool)
+        bool_attention_mask.stop_gradient = True
+        attention_scores_shape = paddle.shape(attention_scores)
+        attention_scores = paddle.where(
+            bool_attention_mask.expand(attention_scores_shape),
+            paddle.ones(attention_scores_shape) * float("-1e10"),
+            attention_scores,
+        )
+        attention_probs = F.softmax(attention_scores, axis=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        context_layer = context_layer.reshape(
+            [paddle.shape(context_layer)[0], paddle.shape(context_layer)[1], self.all_head_size]
+        )
+
+        if output_attentions:
+            outputs = [context_layer, attention_probs]
+        else:
+            outputs = [context_layer]
+        return outputs
+
+
+class ErnieLayoutAttention(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutAttention, self).__init__()
+        self.self = ErnieLayoutSelfAttention(config)
+        self.output = ErnieLayoutSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_values,
+            output_attentions,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        # add attentions if we output them
+        if output_attentions:
+            outputs = [
+                attention_output,
+            ] + self_outputs[1:]
+        else:
+            outputs = [attention_output]
+        return outputs
+
+
+class ErnieLayoutEncoder(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutEncoder, self).__init__()
+        self.config = config
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.layer = nn.LayerList([ErnieLayoutLayer(config) for _ in range(config["num_hidden_layers"])])
+
+        self.has_relative_attention_bias = config["has_relative_attention_bias"]
+        self.has_spatial_attention_bias = config["has_spatial_attention_bias"]
+        if self.has_relative_attention_bias:
+            self.rel_pos_bins = config["rel_pos_bins"]
+            self.max_rel_pos = config["max_rel_pos"]
+            self.rel_pos_onehot_size = config["rel_pos_bins"]
+            self.rel_pos_bias = paddle.create_parameter(
+                shape=[self.rel_pos_onehot_size, config["num_attention_heads"]], dtype=paddle.get_default_dtype()
+            )
+
+        if self.has_spatial_attention_bias:
+            self.max_rel_2d_pos = config["max_rel_2d_pos"]
+            self.rel_2d_pos_bins = config["rel_2d_pos_bins"]
+            self.rel_2d_pos_onehot_size = config["rel_2d_pos_bins"]
+            self.rel_pos_x_bias = paddle.create_parameter(
+                shape=[self.rel_2d_pos_onehot_size, config["num_attention_heads"]], dtype=paddle.get_default_dtype()
+            )
+            self.rel_pos_y_bias = paddle.create_parameter(
+                shape=[self.rel_2d_pos_onehot_size, config["num_attention_heads"]], dtype=paddle.get_default_dtype()
+            )
+
+    def _cal_1d_pos_emb(self, hidden_states, position_ids):
+        rel_pos_mat = position_ids.unsqueeze(-2) - position_ids.unsqueeze(-1)
+        rel_pos = relative_position_bucket(
+            rel_pos_mat,
+            num_buckets=self.rel_pos_bins,
+            max_distance=self.max_rel_pos,
+        )
+        rel_pos = paddle.nn.functional.one_hot(rel_pos, num_classes=self.rel_pos_onehot_size).astype(
+            hidden_states.dtype
+        )
+        rel_pos = paddle.matmul(rel_pos, self.rel_pos_bias).transpose([0, 3, 1, 2])
+        return rel_pos
+
+    def _cal_2d_pos_emb(self, hidden_states, bbox):
+        position_coord_x = bbox[:, :, 0]
+        position_coord_y = bbox[:, :, 3]
+        rel_pos_x_2d_mat = position_coord_x.unsqueeze(-2) - position_coord_x.unsqueeze(-1)
+        rel_pos_y_2d_mat = position_coord_y.unsqueeze(-2) - position_coord_y.unsqueeze(-1)
+        rel_pos_x = relative_position_bucket(
+            rel_pos_x_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_y = relative_position_bucket(
+            rel_pos_y_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_x = F.one_hot(rel_pos_x, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_y = F.one_hot(rel_pos_y, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_x = paddle.matmul(rel_pos_x, self.rel_pos_x_bias).transpose([0, 3, 1, 2])
+        rel_pos_y = paddle.matmul(rel_pos_y, self.rel_pos_y_bias).transpose([0, 3, 1, 2])
+        rel_2d_pos = rel_pos_x + rel_pos_y
+        return rel_2d_pos
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        bbox=None,
+        position_ids=None,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+
+        rel_pos = self._cal_1d_pos_emb(hidden_states, position_ids) if self.has_relative_attention_bias else None
+        rel_2d_pos = self._cal_2d_pos_emb(hidden_states, bbox) if self.has_spatial_attention_bias else None
+
+        hidden_save = dict()
+        hidden_save["input_hidden_states"] = hidden_states
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_values = past_key_values[i] if past_key_values is not None else None
+
+            # gradient_checkpointing is set as False here so we remove some codes here
+            hidden_save["input_attention_mask"] = attention_mask
+            hidden_save["input_layer_head_mask"] = layer_head_mask
+
+            if self.enable_recompute and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return tuple(module(*inputs))
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_values,
+                    output_attentions,
+                    rel_pos,
+                    rel_2d_pos,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_values,
+                    output_attentions,
+                    rel_pos=rel_pos,
+                    rel_2d_pos=rel_2d_pos,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            hidden_save["{}_data".format(i)] = hidden_states
+
+        return (hidden_states,)
+
+
+class ErnieLayoutIntermediate(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutIntermediate, self).__init__()
+        self.dense = nn.Linear(config["hidden_size"], config["intermediate_size"])
+        if config["hidden_act"] == "gelu":
+            self.intermediate_act_fn = nn.GELU()
+        else:
+            assert False, "hidden_act is set as: {}, please check it..".format(config["hidden_act"])
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class ErnieLayoutOutput(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutOutput, self).__init__()
+        self.dense = nn.Linear(config["intermediate_size"], config["hidden_size"])
+        self.LayerNorm = nn.LayerNorm(config["hidden_size"], epsilon=config["layer_norm_eps"])
+        self.dropout = nn.Dropout(config["hidden_dropout_prob"])
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class ErnieLayoutLayer(nn.Layer):
+    def __init__(self, config):
+        super(ErnieLayoutLayer, self).__init__()
+        # since chunk_size_feed_forward is 0 as default, no chunk is needed here.
+        self.seq_len_dim = 1
+        self.attention = ErnieLayoutAttention(config)
+        self.add_cross_attention = False  # default as false
+        self.intermediate = ErnieLayoutIntermediate(config)
+        self.output = ErnieLayoutOutput(config)
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_values = past_key_values[:2] if past_key_values is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_values=self_attn_past_key_values,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self_attention_outputs[0]
+        layer_output = self.feed_forward_chunk(attention_output)
+
+        if output_attentions:
+            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+            outputs = [
+                layer_output,
+            ] + list(outputs)
+        else:
+            outputs = [layer_output]
+        return outputs
+
+
+class VisualBackbone(nn.Layer):
+    def __init__(self, config):
+        super(VisualBackbone, self).__init__()
+
+        self.backbone = ResNet(layers=101)
+
+        self.register_buffer("pixel_mean", paddle.to_tensor([103.53, 116.28, 123.675]).reshape([3, 1, 1]))
+        self.register_buffer("pixel_std", paddle.to_tensor([57.375, 57.12, 58.395]).reshape([3, 1, 1]))
+
+        self.pool = nn.AdaptiveAvgPool2D(config["image_feature_pool_shape"][:2])
+
+    def forward(self, images):
+        images_input = (paddle.to_tensor(images) - self.pixel_mean) / self.pixel_std
+        features = self.backbone(images_input)
+        features = self.pool(features).flatten(start_axis=2).transpose([0, 2, 1])
+        return features
+
+
+@register_base_model
+class ErnieLayoutModel(ErnieLayoutPretrainedModel):
+    """
+    The bare ErnieLayout Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+       vocab_size (`int`, *optional*, defaults to 250002):
+            Vocabulary size of the ErnieLayout model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ErnieLayoutModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 514):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 514 or 1028 or 2056).
+        type_vocab_size (`int`, *optional*, defaults to 100):
+            The vocabulary size of the `token_type_ids` passed when calling [`ErnieModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for classifier.
+    """
+
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutModel, self).__init__(config)
+        self.has_visual_segment_embedding = config["has_visual_segment_embedding"]
+        self.embeddings = ErnieLayoutEmbeddings(config)
+
+        self.visual = VisualBackbone(config)
+        self.visual_proj = nn.Linear(config["image_feature_pool_shape"][-1], config["hidden_size"])
+        self.visual_act_fn = nn.GELU()
+        if self.has_visual_segment_embedding:
+            self.visual_segment_embedding = self.create_parameter(
+                shape=[
+                    config["hidden_size"],
+                ],
+                dtype=self.embedding.weight.dtype,
+            )
+        self.visual_LayerNorm = nn.LayerNorm(config["hidden_size"], epsilon=config["layer_norm_eps"])
+        self.visual_dropout = nn.Dropout(config["hidden_dropout_prob"])
+        self.encoder = ErnieLayoutEncoder(config)
+        self.pooler = ErnieLayoutPooler(config["hidden_size"], "tanh")
+
+    def _calc_text_embeddings(self, input_ids, bbox, position_ids, token_type_ids):
+        words_embeddings = self.embeddings.word_embeddings(input_ids)
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        x1, y1, x2, y2, h, w = self.embeddings._cal_spatial_position_embeddings(bbox)
+        token_type_embeddings = self.embeddings.token_type_embeddings(token_type_ids)
+        embeddings = words_embeddings + position_embeddings + x1 + y1 + x2 + y2 + w + h + token_type_embeddings
+
+        embeddings = self.embeddings.LayerNorm(embeddings)
+        embeddings = self.embeddings.dropout(embeddings)
+        return embeddings
+
+    def _calc_img_embeddings(self, image, bbox, position_ids):
+        if image is not None:
+            visual_embeddings = self.visual_act_fn(self.visual_proj(self.visual(image.astype(paddle.float32))))
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        x1, y1, x2, y2, h, w = self.embeddings._cal_spatial_position_embeddings(bbox)
+        if image is not None:
+            embeddings = visual_embeddings + position_embeddings + x1 + y1 + x2 + y2 + w + h
+        else:
+            embeddings = position_embeddings + x1 + y1 + x2 + y2 + w + h
+
+        if self.has_visual_segment_embedding:
+            embeddings += self.visual_segment_embedding
+        embeddings = self.visual_LayerNorm(embeddings)
+        embeddings = self.visual_dropout(embeddings)
+        return embeddings
+
+    def _calc_visual_bbox(self, image_feature_pool_shape, bbox, visual_shape):
+        visual_bbox_x = (
+            paddle.arange(
+                0,
+                1000 * (image_feature_pool_shape[1] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // image_feature_pool_shape[1]
+        )
+        visual_bbox_y = (
+            paddle.arange(
+                0,
+                1000 * (image_feature_pool_shape[0] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // image_feature_pool_shape[0]
+        )
+
+        expand_shape = image_feature_pool_shape[0:2]
+        visual_bbox = paddle.stack(
+            [
+                visual_bbox_x[:-1].expand(expand_shape),
+                visual_bbox_y[:-1].expand(expand_shape[::-1]).transpose([1, 0]),
+                visual_bbox_x[1:].expand(expand_shape),
+                visual_bbox_y[1:].expand(expand_shape[::-1]).transpose([1, 0]),
+            ],
+            axis=-1,
+        ).reshape([expand_shape[0] * expand_shape[1], paddle.shape(bbox)[-1]])
+
+        visual_bbox = visual_bbox.expand([visual_shape[0], visual_bbox.shape[0], visual_bbox.shape[1]])
+        return visual_bbox
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        num_position_embeds_diff = new_num_position_embeddings - self.config["max_position_embeddings"]
+
+        # no resizing needs to be done if the length stays the same
+        if num_position_embeds_diff == 0:
+            return
+
+        logger.info(f"Setting `config.max_position_embeddings={new_num_position_embeddings}`...")
+        self.config["max_position_embeddings"] = new_num_position_embeddings
+
+        old_position_embeddings_weight = self.embeddings.position_embeddings.weight
+
+        self.embeddings.position_embeddings = nn.Embedding(
+            self.config["max_position_embeddings"], self.config["hidden_size"]
+        )
+
+        with paddle.no_grad():
+            if num_position_embeds_diff > 0:
+                self.embeddings.position_embeddings.weight[:-num_position_embeds_diff] = old_position_embeddings_weight
+            else:
+                self.embeddings.position_embeddings.weight = old_position_embeddings_weight[:num_position_embeds_diff]
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        input_shape = paddle.shape(input_ids)
+        visual_shape = list(input_shape)
+        visual_shape[1] = self.config["image_feature_pool_shape"][0] * self.config["image_feature_pool_shape"][1]
+        visual_bbox = self._calc_visual_bbox(self.config["image_feature_pool_shape"], bbox, visual_shape)
+
+        final_bbox = paddle.concat([bbox, visual_bbox], axis=1)
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+
+        visual_attention_mask = paddle.ones(visual_shape)
+
+        attention_mask = attention_mask.astype(visual_attention_mask.dtype)
+
+        final_attention_mask = paddle.concat([attention_mask, visual_attention_mask], axis=1)
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        if position_ids is None:
+            seq_length = input_shape[1]
+            position_ids = self.embeddings.position_ids[:, :seq_length]
+            position_ids = position_ids.expand(input_shape)
+
+        visual_position_ids = paddle.arange(0, visual_shape[1]).expand([input_shape[0], visual_shape[1]])
+        final_position_ids = paddle.concat([position_ids, visual_position_ids], axis=1)
+
+        if bbox is None:
+            bbox = paddle.zeros(input_shape + [4])
+
+        text_layout_emb = self._calc_text_embeddings(
+            input_ids=input_ids,
+            bbox=bbox,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+
+        visual_emb = self._calc_img_embeddings(
+            image=image,
+            bbox=visual_bbox,
+            position_ids=visual_position_ids,
+        )
+        final_emb = paddle.concat([text_layout_emb, visual_emb], axis=1)
+
+        extended_attention_mask = final_attention_mask.unsqueeze(1).unsqueeze(2)
+
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config["num_hidden_layers"], -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
+        else:
+            head_mask = [None] * self.config["num_hidden_layers"]
+
+        encoder_outputs = self.encoder(
+            final_emb,
+            extended_attention_mask,
+            bbox=final_bbox,
+            position_ids=final_position_ids,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class ErnieLayoutForSequenceClassification(ErnieLayoutPretrainedModel):
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutForSequenceClassification, self).__init__(config)
+        self.ernie_layout = ErnieLayoutModel(config)
+        self.num_labels = config.num_labels
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config["hidden_size"] * 3, config.num_labels)
+
+    def get_input_embeddings(self):
+        return self.ernie_layout.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.ernie_layout.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        input_shape = paddle.shape(input_ids)
+        visual_shape = list(input_shape)
+        visual_shape[1] = (
+            self.ernie_layout.config["image_feature_pool_shape"][0]
+            * self.ernie_layout.config["image_feature_pool_shape"][1]
+        )
+        visual_bbox = self.ernie_layout._calc_visual_bbox(
+            self.ernie_layout.config["image_feature_pool_shape"], bbox, visual_shape
+        )
+
+        visual_position_ids = paddle.arange(0, visual_shape[1]).expand([input_shape[0], visual_shape[1]])
+
+        initial_image_embeddings = self.ernie_layout._calc_img_embeddings(
+            image=image,
+            bbox=visual_bbox,
+            position_ids=visual_position_ids,
+        )
+
+        outputs = self.ernie_layout(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_output, final_image_embeddings = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
+
+        cls_final_output = sequence_output[:, 0, :]
+
+        # average-pool the visual embeddings
+        pooled_initial_image_embeddings = initial_image_embeddings.mean(axis=1)
+        pooled_final_image_embeddings = final_image_embeddings.mean(axis=1)
+        # concatenate with cls_final_output
+        sequence_output = paddle.concat(
+            [cls_final_output, pooled_initial_image_embeddings, pooled_final_image_embeddings], axis=1
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,)
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+
+            loss = loss_fct(
+                logits.reshape([-1, self.num_labels]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+class ErnieLayoutPredictionHead(Layer):
+    """
+    Bert Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(ErnieLayoutPredictionHead, self).__init__()
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class ErnieLayoutPretrainingHeads(Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(ErnieLayoutPretrainingHeads, self).__init__()
+        self.predictions = ErnieLayoutPredictionHead(hidden_size, vocab_size, activation, embedding_weights)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class ErnieLayoutForPretraining(ErnieLayoutPretrainedModel):
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutForPretraining, self).__init__(config)
+        self.ernie_layout = ErnieLayoutModel(config)
+        self.cls = ErnieLayoutPretrainingHeads(
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            embedding_weights=self.ernie_layout.embeddings.word_embeddings.weight,
+        )
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.ernie_layout.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        masked_positions=None,
+    ):
+        outputs = self.ernie_layout(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class ErnieLayoutForTokenClassification(ErnieLayoutPretrainedModel):
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.ernie_layout = ErnieLayoutModel(config)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config["hidden_size"], config.num_labels)
+
+    def get_input_embeddings(self):
+        return self.ernie_layout.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.ernie_layout.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        outputs = self.ernie_layout(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = paddle.shape(input_ids)[1]
+        sequence_output = outputs[0][:, :seq_length]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,)
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+
+            if attention_mask is not None:
+                active_loss = (
+                    attention_mask.reshape(
+                        [
+                            -1,
+                        ]
+                    )
+                    == 1
+                )
+                active_logits = logits.reshape([-1, self.num_labels])[active_loss]
+                active_labels = labels.reshape(
+                    [
+                        -1,
+                    ]
+                )[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(
+                    logits.reshape([-1, self.num_labels]),
+                    labels.reshape(
+                        [
+                            -1,
+                        ]
+                    ),
+                )
+
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+class ErnieLayoutForQuestionAnswering(ErnieLayoutPretrainedModel):
+    def __init__(self, config: ErnieLayoutConfig):
+        super(ErnieLayoutForQuestionAnswering, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.ernie_layout = ErnieLayoutModel(config)
+        self.has_visual_segment_embedding = config.has_visual_segment_embedding
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.qa_outputs = nn.Linear(config["hidden_size"], 2)
+
+    def get_input_embeddings(self):
+        return self.ernie_layout.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        start_positions=None,
+        end_positions=None,
+    ):
+        outputs = self.ernie_layout(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = paddle.shape(input_ids)[1]
+        sequence_output = outputs[0][:, :seq_length]
+        sequence_output = self.dropout(sequence_output)
+
+        if token_type_ids is not None:
+            span_mask = -token_type_ids * 1e8
+        else:
+            span_mask = 0
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+        start_logits = start_logits.squeeze(-1) + span_mask
+        end_logits = end_logits.squeeze(-1) + span_mask
+
+        outputs = (start_logits, end_logits) + outputs[2:]
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.shape) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.shape) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not total_loss:
+            return outputs
+        else:
+            outputs = (total_loss,) + outputs
+            return outputs
+
+
+class UIEX(ErnieLayoutPretrainedModel):
+    def __init__(self, config: ErnieLayoutConfig):
+        super(UIEX, self).__init__(config)
+        self.ernie_layout = ErnieLayoutModel(config)
+        self.linear_start = nn.Linear(config.hidden_size, 1)
+        self.linear_end = nn.Linear(config.hidden_size, 1)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, bbox=None, image=None):
+        sequence_output, _ = self.ernie_layout(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            bbox=bbox,
+            image=image,
+        )
+        seq_length = paddle.shape(input_ids)[1]
+        sequence_output = sequence_output[:, :seq_length]
+        start_logits = self.linear_start(sequence_output)
+        start_logits = paddle.squeeze(start_logits, -1)
+        start_prob = self.sigmoid(start_logits)
+        end_logits = self.linear_end(sequence_output)
+        end_logits = paddle.squeeze(end_logits, -1)
+        end_prob = self.sigmoid(end_logits)
+        return start_prob, end_prob
diff --git a/paddlenlp/transformers/ernie_layout/tokenizer.py b/paddlenlp/transformers/ernie_layout/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..49b9df6508bc84376c9fcf60ef0579a3ec965e47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_layout/tokenizer.py
@@ -0,0 +1,299 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for ErnieLayout model."""
+
+import os
+import unicodedata
+from typing import List, Optional
+
+import sentencepiece as spm
+
+from .. import AddedToken, PretrainedTokenizer
+from ..tokenizer_utils import _is_control, _is_punctuation, _is_whitespace
+
+SPIECE_UNDERLINE = "▁"
+
+
+def _is_end_of_word(text):
+    """Checks whether the last character in text is one of a punctuation, control or whitespace character."""
+    last_char = text[-1]
+    return bool(_is_control(last_char) | _is_punctuation(last_char) | _is_whitespace(last_char))
+
+
+def _is_start_of_word(text):
+    """Checks whether the first character in text is one of a punctuation, control or whitespace character."""
+    first_char = text[0]
+    return bool(_is_control(first_char) | _is_punctuation(first_char) | _is_whitespace(first_char))
+
+
+class ErnieLayoutTokenizer(PretrainedTokenizer):
+    resource_files_names = {
+        "sentencepiece_model_file": "sentencepiece.bpe.model",
+        "vocab_file": "vocab.txt",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-layoutx-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_layout/vocab.txt",
+            "uie-x-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_layout/vocab.txt",
+        },
+        "sentencepiece_model_file": {
+            "ernie-layoutx-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_layout/sentencepiece.bpe.model",
+            "uie-x-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_layout/sentencepiece.bpe.model",
+        },
+    }
+    pretrained_init_configuration = {
+        "ernie-layoutx-base-uncased": {"do_lower_case": True, "do_tokenize_postprocess": False},
+        "uie-x-base": {"do_lower_case": True, "do_tokenize_postprocess": True},
+    }
+    pretrained_positional_embedding_sizes = {"ernie-layoutx-base-uncased": 514, "uie-x-base": 514}
+    max_model_input_sizes = pretrained_positional_embedding_sizes
+    # Ernie-M model doesn't have token_type embedding.
+    model_input_names: List[str] = ["input_ids"]
+
+    SPECIAL_TOKENS_ATTRIBUTES = [
+        "unk_token",
+        "sep_token",
+        "pad_token",
+        "cls_token",
+        "mask_token",
+        "additional_special_tokens",
+    ]
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        do_tokenize_postprocess=False,
+        sep_token="[SEP]",
+        cls_token="[CLS]",
+        unk_token="[UNK]",
+        pad_token="[PAD]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._sep_token = sep_token
+        self._cls_token = cls_token
+        self._unk_token = unk_token
+        self._pad_token = pad_token
+        self._mask_token = mask_token
+        self.sp_model = spm.SentencePieceProcessor()
+        self.vocab_file = vocab_file
+        self.sentencepiece_model_file = sentencepiece_model_file
+        if os.path.isfile(sentencepiece_model_file):
+            self.sp_model.Load(sentencepiece_model_file)
+        self.vocab_file = vocab_file
+        self.do_tokenize_postprocess = do_tokenize_postprocess
+
+        self.tokens_to_ids = {"[CLS]": 0, "[PAD]": 1, "[SEP]": 2, "[UNK]": 3}
+
+        # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
+        self.offset = 1
+
+        self.tokens_to_ids["[MASK]"] = len(self.sp_model) + self.offset
+        self.ids_to_tokens = {v: k for k, v in self.tokens_to_ids.items()}
+        self.SP_CHAR_MAPPING = {}
+
+        for ch in range(65281, 65375):
+            if ch in [ord("～")]:
+                self.SP_CHAR_MAPPING[chr(ch)] = chr(ch)
+                continue
+            self.SP_CHAR_MAPPING[chr(ch)] = chr(ch - 65248)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        An ERNIE-LayoutX offset_mapping has the following format:
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs.
+                Defaults to `None`.
+        Returns:
+            List[tuple]: List of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0), (0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    def get_offset_mapping(self, text):
+        split_tokens = self.tokenize(text)
+        normalized_text, char_mapping = "", []
+
+        for i, ch in enumerate(text):
+
+            if ch in self.SP_CHAR_MAPPING:
+                ch = self.SP_CHAR_MAPPING.get(ch)
+            else:
+                ch = unicodedata.normalize("NFKC", ch)
+            if self.is_whitespace(ch):
+                continue
+            normalized_text += ch
+            char_mapping.extend([i] * len(ch))
+
+        text, token_mapping, offset = normalized_text, [], 0
+        for token in split_tokens:
+            if token[:1] == "▁":
+                token = token[1:]
+                if not token:
+                    continue
+            start = text[offset:].index(token) + offset
+            end = start + len(token)
+
+            token_mapping.append((char_mapping[start], char_mapping[end - 1] + 1))
+            offset = end
+        return token_mapping
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.offset + 1  # Add the <mask> token
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        pieces = self.sp_model.EncodeAsPieces(text)
+        if self.do_tokenize_postprocess:
+            new_pieces = []
+            for piece in pieces:
+                if piece == SPIECE_UNDERLINE:
+                    continue
+                lst_i = 0
+                for i, c in enumerate(piece):
+                    if c == SPIECE_UNDERLINE:
+                        continue
+                    if self.is_ch_char(c) or self.is_punct(c):
+                        if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                            new_pieces.append(piece[lst_i:i])
+                        new_pieces.append(c)
+                        lst_i = i + 1
+                    elif c.isdigit() and i > 0 and not piece[i - 1].isdigit():
+                        if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                            new_pieces.append(piece[lst_i:i])
+                        lst_i = i
+                    elif not c.isdigit() and i > 0 and piece[i - 1].isdigit():
+                        if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                            new_pieces.append(piece[lst_i:i])
+                        lst_i = i
+                if len(piece) > lst_i:
+                    new_pieces.append(piece[lst_i:])
+            pieces = new_pieces
+        return pieces
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token in self.tokens_to_ids:
+            return self.tokens_to_ids[token]
+        spm_id = self.sp_model.PieceToId(token)
+
+        # Need to return unknown token if the SP model returned 0
+        return spm_id + self.offset if spm_id else self.unk_token_id
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index in self.ids_to_tokens:
+            return self.ids_to_tokens[index]
+        return self.sp_model.IdToPiece(index - self.offset)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def is_ch_char(self, char):
+        """
+        is_ch_char
+        """
+        if "\u4e00" <= char <= "\u9fff":
+            return True
+        return False
+
+    def is_alpha(self, char):
+        """
+        is_alpha
+        """
+        if "a" <= char <= "z":
+            return True
+        if "A" <= char <= "Z":
+            return True
+        return False
+
+    def is_punct(self, char):
+        """
+        is_punct
+        """
+        if char in ",;:.?!~，；：。？！《》【】":
+            return True
+        return False
+
+    def is_whitespace(self, char):
+        """
+        is whitespace
+        """
+        if char == " " or char == "\t" or char == "\n" or char == "\r":
+            return True
+        if len(char) == 1:
+            cat = unicodedata.category(char)
+            if cat == "Zs":
+                return True
+        return False
diff --git a/paddlenlp/transformers/ernie_layout/visual_backbone.py b/paddlenlp/transformers/ernie_layout/visual_backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ea5456437e31a3664570bc5ace517c5eb21ff3b
--- /dev/null
+++ b/paddlenlp/transformers/ernie_layout/visual_backbone.py
@@ -0,0 +1,214 @@
+# -*- coding: utf-8 -*-
+# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import BatchNorm, Conv2D, MaxPool2D
+
+
+class ConvBNLayer(nn.Layer):
+    def __init__(
+        self, num_channels, num_filters, filter_size, stride=1, groups=1, act=None, name=None, data_format="NCHW"
+    ):
+        super(ConvBNLayer, self).__init__()
+
+        self._conv = Conv2D(
+            in_channels=num_channels,
+            out_channels=num_filters,
+            kernel_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            bias_attr=False,
+            data_format=data_format,
+        )
+        self._batch_norm = BatchNorm(num_filters, act=act, data_layout=data_format)
+
+    def forward(self, inputs):
+        y = self._conv(inputs)
+        y = self._batch_norm(y)
+        return y
+
+
+class BottleneckBlock(nn.Layer):
+    def __init__(self, num_channels, num_filters, stride, shortcut=True, name=None, data_format="NCHW"):
+        super(BottleneckBlock, self).__init__()
+        self.conv0 = ConvBNLayer(
+            num_channels=num_channels, num_filters=num_filters, filter_size=1, act="relu", data_format=data_format
+        )
+        self.conv1 = ConvBNLayer(
+            num_channels=num_filters,
+            num_filters=num_filters,
+            filter_size=3,
+            stride=stride,
+            act="relu",
+            data_format=data_format,
+        )
+        self.conv2 = ConvBNLayer(
+            num_channels=num_filters, num_filters=num_filters * 4, filter_size=1, act=None, data_format=data_format
+        )
+
+        if not shortcut:
+            self.short = ConvBNLayer(
+                num_channels=num_channels,
+                num_filters=num_filters * 4,
+                filter_size=1,
+                stride=stride,
+                data_format=data_format,
+            )
+
+        self.shortcut = shortcut
+
+        self._num_channels_out = num_filters * 4
+
+    def forward(self, inputs):
+        y = self.conv0(inputs)
+        conv1 = self.conv1(y)
+        conv2 = self.conv2(conv1)
+
+        if self.shortcut:
+            short = inputs
+        else:
+            short = self.short(inputs)
+
+        y = paddle.add(x=short, y=conv2)
+        y = F.relu(y)
+        return y
+
+
+class BasicBlock(nn.Layer):
+    def __init__(self, num_channels, num_filters, stride, shortcut=True, name=None, data_format="NCHW"):
+        super(BasicBlock, self).__init__()
+        self.stride = stride
+        self.conv0 = ConvBNLayer(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=3,
+            stride=stride,
+            act="relu",
+            data_format=data_format,
+        )
+        self.conv1 = ConvBNLayer(
+            num_channels=num_filters, num_filters=num_filters, filter_size=3, act=None, data_format=data_format
+        )
+
+        if not shortcut:
+            self.short = ConvBNLayer(
+                num_channels=num_channels,
+                num_filters=num_filters,
+                filter_size=1,
+                stride=stride,
+                data_format=data_format,
+            )
+
+        self.shortcut = shortcut
+
+    def forward(self, inputs):
+        y = self.conv0(inputs)
+        conv1 = self.conv1(y)
+
+        if self.shortcut:
+            short = inputs
+        else:
+            short = self.short(inputs)
+        y = paddle.add(x=short, y=conv1)
+        y = F.relu(y)
+        return y
+
+
+class ResNet(nn.Layer):
+    def __init__(self, layers=50, class_dim=1000, input_image_channel=3, data_format="NCHW"):
+        super(ResNet, self).__init__()
+
+        self.layers = layers
+        self.data_format = data_format
+        self.input_image_channel = input_image_channel
+
+        supported_layers = [18, 34, 50, 101, 152]
+        assert layers in supported_layers, "supported layers are {} but input layer is {}".format(
+            supported_layers, layers
+        )
+
+        if layers == 18:
+            depth = [2, 2, 2, 2]
+        elif layers == 34 or layers == 50:
+            depth = [3, 4, 6, 3]
+        elif layers == 101:
+            depth = [3]
+        elif layers == 152:
+            depth = [3, 8, 36, 3]
+        num_channels = [64, 256, 512, 1024] if layers >= 50 else [64, 64, 128, 256]
+        num_filters = [64, 128, 256, 512]
+
+        self.conv = ConvBNLayer(
+            num_channels=self.input_image_channel,
+            num_filters=64,
+            filter_size=7,
+            stride=2,
+            act="relu",
+            data_format=self.data_format,
+        )
+        self.pool2d_max = MaxPool2D(kernel_size=3, stride=2, padding=1, data_format=self.data_format)
+
+        self.block_list = []
+        if layers >= 50:
+            for block in range(len(depth)):
+                shortcut = False
+                for i in range(depth[block]):
+                    if layers in [101, 152] and block == 2:
+                        if i == 0:
+                            conv_name = "res" + str(block + 2) + "a"
+                        else:
+                            conv_name = "res" + str(block + 2) + "b" + str(i)
+                    else:
+                        conv_name = "res" + str(block + 2) + chr(97 + i)
+                    bottleneck_block = self.add_sublayer(
+                        conv_name,
+                        BottleneckBlock(
+                            num_channels=num_channels[block] if i == 0 else num_filters[block] * 4,
+                            num_filters=num_filters[block],
+                            stride=2 if i == 0 and block != 0 else 1,
+                            shortcut=shortcut,
+                            data_format=self.data_format,
+                        ),
+                    )
+                    self.block_list.append(bottleneck_block)
+                    shortcut = True
+        else:
+            for block in range(len(depth)):
+                shortcut = False
+                for i in range(depth[block]):
+                    conv_name = "res" + str(block + 2) + chr(97 + i)
+                    basic_block = self.add_sublayer(
+                        conv_name,
+                        BasicBlock(
+                            num_channels=num_channels[block] if i == 0 else num_filters[block],
+                            num_filters=num_filters[block],
+                            stride=2 if i == 0 and block != 0 else 1,
+                            shortcut=shortcut,
+                            data_format=self.data_format,
+                        ),
+                    )
+                    self.block_list.append(basic_block)
+                    shortcut = True
+
+    def forward(self, inputs):
+        y = self.conv(inputs)
+        y = self.pool2d_max(y)
+
+        for block in self.block_list:
+            y = block(y)
+        return y
diff --git a/paddlenlp/transformers/ernie_m/__init__.py b/paddlenlp/transformers/ernie_m/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_m/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_m/configuration.py b/paddlenlp/transformers/ernie_m/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..87d8ee768e1db4bf4c17d23ddcd6b140afe3d080
--- /dev/null
+++ b/paddlenlp/transformers/ernie_m/configuration.py
@@ -0,0 +1,177 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ERNIE-M model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["ERNIE_M_PRETRAINED_INIT_CONFIGURATION", "ErnieMConfig", "ERNIE_M_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ERNIE_M_PRETRAINED_INIT_CONFIGURATION = {
+    "ernie-m-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "vocab_size": 250002,
+        "pad_token_id": 1,
+    },
+    "ernie-m-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "vocab_size": 250002,
+        "pad_token_id": 1,
+    },
+    "uie-m-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "vocab_size": 250002,
+        "pad_token_id": 1,
+    },
+    "uie-m-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "vocab_size": 250002,
+        "pad_token_id": 1,
+    },
+}
+
+ERNIE_M_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ernie-m-base": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_m/ernie_m_base.pdparams",
+        "ernie-m-large": "https://paddlenlp.bj.bcebos.com/models/transformers/ernie_m/ernie_m_large.pdparams",
+        "uie-m-base": "https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_base.pdparams",
+        "uie-m-large": "https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_large.pdparams",
+    }
+}
+
+
+class ErnieMConfig(PretrainedConfig):
+    r"""
+        This is the configuration class to store the configuration of a [`ErnieModel`]. It is used to
+        instantiate a ERNIE model according to the specified arguments, defining the model architecture. Instantiating a
+        configuration with the defaults will yield a similar configuration to that of the ERNIE
+        ernie-3.0-medium-zh architecture.
+        Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+        documentation from [`PretrainedConfig`] for more information.
+    Args:
+            vocab_size (int):
+                Vocabulary size of `inputs_ids` in `ErnieMModel`. Also is the vocab size of token embedding matrix.
+                Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ErnieMModel`.
+            hidden_size (int, optional):
+                Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+            num_hidden_layers (int, optional):
+                Number of hidden layers in the Transformer encoder. Defaults to `12`.
+            num_attention_heads (int, optional):
+                Number of attention heads for each attention layer in the Transformer encoder.
+                Defaults to `12`.
+            intermediate_size (int, optional):
+                Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+                to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+                and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+                Defaults to `3072`.
+            hidden_act (str, optional):
+                The non-linear activation function in the feed-forward layer.
+                ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+                are supported. Defaults to `"gelu"`.
+            hidden_dropout_prob (float, optional):
+                The dropout probability for all fully connected layers in the embeddings and encoder.
+                Defaults to `0.1`.
+            attention_probs_dropout_prob (float, optional):
+                The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+                Defaults to `0.1`.
+            max_position_embeddings (int, optional):
+                The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+                sequence. Defaults to `512`.
+            type_vocab_size (int, optional):
+                The vocabulary size of the `token_type_ids`.
+                Defaults to `2`.
+            initializer_range (float, optional):
+                The standard deviation of the normal initializer for initializing all weight matrices.
+                Defaults to `0.02`.
+
+                .. note::
+                    A normal_initializer initializes weight matrices as normal distributions.
+                    See :meth:`ErnieMPretrainedModel._init_weights()` for how weights are initialized in `ErnieMModel`.
+
+            pad_token_id(int, optional):
+                The index of padding token in the token vocabulary.
+                Defaults to `1`.
+
+        Examples:
+        ```python
+        >>> from paddlenlp.transformers import ErnieMModel, ErnieMConfig
+        >>> # Initializing a configuration
+        >>> configuration = ErnieMConfig()
+        >>> # Initializing a model from the configuration
+        >>> model = ErnieMModel(configuration)
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+        ```"""
+    model_type = "ernie_m"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ERNIE_M_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 250002,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 514,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 1,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
diff --git a/paddlenlp/transformers/ernie_m/fast_tokenizer.py b/paddlenlp/transformers/ernie_m/fast_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac26ce4e524ac718d4d0dc4faa11881780ebc9db
--- /dev/null
+++ b/paddlenlp/transformers/ernie_m/fast_tokenizer.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from shutil import copyfile
+from typing import List, Optional, Tuple, Union
+
+from ...utils.log import logger
+from ..tokenizer_utils_base import PaddingStrategy, TensorType, TruncationStrategy
+from ..tokenizer_utils_fast import PretrainedFastTokenizer
+from .tokenizer import ErnieMTokenizer
+
+VOCAB_FILES_NAMES = {
+    "sentencepiece_model_file": "sentencepiece.bpe.model",
+    "vocab_file": "vocab.txt",
+    "tokenizer_file": "tokenizer.json",
+}
+
+SPIECE_UNDERLINE = "▁"
+
+
+class ErnieMFastTokenizer(PretrainedFastTokenizer):
+    resource_files_names = VOCAB_FILES_NAMES  # for save_pretrained
+    slow_tokenizer_class = ErnieMTokenizer
+    pretrained_resource_files_map = slow_tokenizer_class.pretrained_resource_files_map
+    pretrained_init_configuration = slow_tokenizer_class.pretrained_init_configuration
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        tokenizer_file=None,
+        do_lower_case=True,
+        encoding="utf8",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            sentencepiece_model_file=sentencepiece_model_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+        self.do_lower_case = do_lower_case
+        self.encoding = encoding
+        self.vocab_file = vocab_file
+        self.sentencepiece_model_file = sentencepiece_model_file
+        self.can_save_slow_tokenizer = False if not self.vocab_file else True
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not self.can_save_slow_tokenizer:
+            raise ValueError(
+                "Your faster tokenizer does not have the necessary information to save the vocabulary for a slow "
+                "tokenizer."
+            )
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_sentencepiece_model_file = os.path.join(
+            save_directory,
+            (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["sentencepiece_model_file"],
+        )
+        if os.path.abspath(self.sentencepiece_model_file) != os.path.abspath(out_sentencepiece_model_file):
+            copyfile(self.sentencepiece_model_file, out_sentencepiece_model_file)
+        return (out_sentencepiece_model_file,)
+
+    def __call__(
+        self,
+        text: Union[str, List[str], List[List[str]]],
+        text_pair: Optional[Union[str, List[str], List[List[str]]]] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        return_position_ids: bool = True,
+        return_token_type_ids: bool = False,
+        return_attention_mask: bool = True,
+        return_length: bool = False,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        add_special_tokens: bool = True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        verbose: bool = True,
+        **kwargs
+    ):
+        return super(ErnieMFastTokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            # Ernie-M model doesn't have token_type embedding.
+            # So set "return_token_type_ids" to False.
+            return_token_type_ids=False,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_dict=return_dict,
+            return_offsets_mapping=return_offsets_mapping,
+            add_special_tokens=add_special_tokens,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            verbose=verbose,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/ernie_m/modeling.py b/paddlenlp/transformers/ernie_m/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b7e89de92847cbfc5eeea81c451acee44af9a82
--- /dev/null
+++ b/paddlenlp/transformers/ernie_m/modeling.py
@@ -0,0 +1,834 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from paddlenlp.utils.env import CONFIG_NAME
+
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    ERNIE_M_PRETRAINED_INIT_CONFIGURATION,
+    ERNIE_M_PRETRAINED_RESOURCE_FILES_MAP,
+    ErnieMConfig,
+)
+
+__all__ = [
+    "ErnieMModel",
+    "ErnieMPretrainedModel",
+    "ErnieMForSequenceClassification",
+    "ErnieMForTokenClassification",
+    "ErnieMForQuestionAnswering",
+    "ErnieMForMultipleChoice",
+    "UIEM",
+]
+
+
+class ErnieMEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values_length: int = 0,
+    ):
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if position_ids is None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+            if past_key_values_length > 0:
+                position_ids = position_ids + past_key_values_length
+
+            position_ids.stop_gradient = True
+
+        position_ids += 2
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class ErnieMPooler(nn.Layer):
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class ErnieMPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained ERNIE-M models. It provides ERNIE-M related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = ErnieMConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    pretrained_init_configuration = ERNIE_M_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ERNIE_M_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "ernie_m"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+@register_base_model
+class ErnieMModel(ErnieMPretrainedModel):
+    r"""
+    The bare ERNIE-M Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct ErnieMModel.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = ErnieMEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            dim_feedforward=4 * config.hidden_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            normalize_before=False,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = ErnieMPooler(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+            tuple: Returns tuple (``sequence_output``, ``pooled_output``).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieMModel, ErnieMTokenizer
+
+                tokenizer = ErnieMTokenizer.from_pretrained('ernie-m-base')
+                model = ErnieMModel.from_pretrained('ernie-m-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+        use_cache = use_cache if use_cache is not None else False
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            # TODO(linjieccc): fix attention mask after uie-m related models updated
+            attention_mask = paddle.unsqueeze(
+                (input_ids == 0).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class ErnieMForSequenceClassification(ErnieMPretrainedModel):
+    r"""
+    Ernie-M Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct ErnieMForSequenceClassification.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMForSequenceClassification, self).__init__(config)
+        self.ernie_m = ErnieMModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieMModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieMModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieMForSequenceClassification, ErnieMTokenizer
+
+                tokenizer = ErnieMTokenizer.from_pretrained('ernie-m-base')
+                model = ErnieMForSequenceClassification.from_pretrained('ernie-m-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        outputs = self.ernie_m(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = self.dropout(outputs[1])
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieMForQuestionAnswering(ErnieMPretrainedModel):
+    """
+    Ernie-M Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct ErnieMForQuestionAnswering.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMForQuestionAnswering, self).__init__(config)
+        self.ernie_m = ErnieMModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieMModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieMModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieMForQuestionAnswering, ErnieMTokenizer
+
+                tokenizer = ErnieMTokenizer.from_pretrained('ernie-m-base')
+                model = ErnieMForQuestionAnswering.from_pretrained('ernie-m-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.ernie_m(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieMForTokenClassification(ErnieMPretrainedModel):
+    r"""
+    ERNIE-M Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct ErnieMForTokenClassification.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMForTokenClassification, self).__init__(config)
+        self.ernie_m = ErnieMModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieMModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieMModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ErnieMForTokenClassification, ErnieMTokenizer
+
+                tokenizer = ErnieMTokenizer.from_pretrained('ernie-m-base')
+                model = ErnieMForTokenClassification.from_pretrained('ernie-m-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        outputs = self.ernie_m(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = self.dropout(outputs[0])
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ErnieMForMultipleChoice(ErnieMPretrainedModel):
+    """
+    ERNIE-M with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct ErnieMForMultipleChoice.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(ErnieMForMultipleChoice, self).__init__(config)
+        self.ernie_m = ErnieMModel(config)
+        self.num_choices = config.num_choices
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ErnieMForMultipleChoice forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieMModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`ErnieMModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`ErnieMModel` and shape as [batch_size, num_choice, sequence_length].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        outputs = self.ernie_m(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = self.dropout(outputs[1])
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class UIEM(ErnieMPretrainedModel):
+    """
+    Ernie-M Model with two linear layer on top of the hidden-states
+    output to compute `start_prob` and `end_prob`,
+    designed for Universal Information Extraction.
+
+    Args:
+        config (:class:`ErnieMConfig`):
+            An instance of ErnieMConfig used to construct UIEM.
+    """
+
+    def __init__(self, config: ErnieMConfig):
+        super(UIEM, self).__init__(config)
+        self.ernie_m = ErnieMModel(config)
+        self.linear_start = paddle.nn.Linear(config.hidden_size, 1)
+        self.linear_end = paddle.nn.Linear(config.hidden_size, 1)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`ErnieMModel`.
+            position_ids (Tensor, optional):
+                See :class:`ErnieMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ErnieMModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import UIEM, ErnieMTokenizer
+
+                tokenizer = ErnieMTokenizer.from_pretrained('uie-m-base')
+                model = UIEM.from_pretrained('uie-m-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                start_prob, end_prob = model(**inputs)
+        """
+        sequence_output, _ = self.ernie_m(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+        )
+        start_logits = self.linear_start(sequence_output)
+        start_logits = paddle.squeeze(start_logits, -1)
+        start_prob = self.sigmoid(start_logits)
+        end_logits = self.linear_end(sequence_output)
+        end_logits = paddle.squeeze(end_logits, -1)
+        end_prob = self.sigmoid(end_logits)
+        # TODO: add return dict support
+        return start_prob, end_prob
diff --git a/paddlenlp/transformers/ernie_m/tokenizer.py b/paddlenlp/transformers/ernie_m/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a0b8449fddb787dec09cf005446589fccbcae11
--- /dev/null
+++ b/paddlenlp/transformers/ernie_m/tokenizer.py
@@ -0,0 +1,348 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unicodedata
+from typing import List, Optional
+
+import sentencepiece as spm
+
+from .. import PretrainedTokenizer
+
+__all__ = ["ErnieMTokenizer"]
+
+SPIECE_UNDERLINE = "▁"
+
+
+class ErnieMTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs a ErnieM tokenizer. It uses the `sentencepiece` tools to cut the words to sub-words.
+
+    Args:
+        vocab_file (str):
+            The file path of the vocabulary.
+        sentencepiece_model_file (str):
+            The file path of sentencepiece model.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+    """
+    resource_files_names = {
+        "sentencepiece_model_file": "sentencepiece.bpe.model",
+        "vocab_file": "vocab.txt",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie-m-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.vocab.txt",
+            "ernie-m-large": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.vocab.txt",
+            "uie-m-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.vocab.txt",
+            "uie-m-large": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.vocab.txt",
+        },
+        "sentencepiece_model_file": {
+            "ernie-m-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.sentencepiece.bpe.model",
+            "ernie-m-large": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.sentencepiece.bpe.model",
+            "uie-m-base": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.sentencepiece.bpe.model",
+            "uie-m-large": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_m/ernie_m.sentencepiece.bpe.model",
+        },
+    }
+    pretrained_init_configuration = {
+        "ernie-m-base": {"do_lower_case": False},
+        "ernie-m-large": {"do_lower_case": False},
+        "uie-m-base": {"do_lower_case": False},
+        "uie-m-large": {"do_lower_case": False},
+    }
+    max_model_input_sizes = {"ernie-m-base": 514, "ernie-m-large": 514, "uie-m-base": 514, "uie-m-large": 514}
+    # Ernie-M model doesn't have token_type embedding.
+    model_input_names: List[str] = ["input_ids"]
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        encoding="utf8",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        self.sp_model = spm.SentencePieceProcessor()
+
+        self.do_lower_case = do_lower_case
+        self.encoding = encoding
+        if not os.path.isfile(vocab_file):
+            raise ValueError("Can't find a vocabulary file at path '{}'.".format(vocab_file))
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.vocab_file = vocab_file
+        self.sentencepiece_model_file = sentencepiece_model_file
+        if os.path.isfile(sentencepiece_model_file):
+            self.sp_model.Load(sentencepiece_model_file)
+
+        self.SP_CHAR_MAPPING = {}
+
+        for ch in range(65281, 65375):
+            if ch in [ord("～")]:
+                self.SP_CHAR_MAPPING[chr(ch)] = chr(ch)
+                continue
+            self.SP_CHAR_MAPPING[chr(ch)] = chr(ch - 65248)
+
+    def get_offset_mapping(self, text):
+        if text is None:
+            return None
+
+        split_tokens = self.tokenize(text)
+        normalized_text, char_mapping = "", []
+
+        for i, ch in enumerate(text):
+
+            if ch in self.SP_CHAR_MAPPING:
+                ch = self.SP_CHAR_MAPPING.get(ch)
+            else:
+                ch = unicodedata.normalize("NFKC", ch)
+            if self.is_whitespace(ch):
+                continue
+            normalized_text += ch
+            char_mapping.extend([i] * len(ch))
+
+        text, token_mapping, offset = normalized_text, [], 0
+
+        if self.do_lower_case:
+            text = text.lower()
+
+        for token in split_tokens:
+            if token[:1] == "▁":
+                token = token[1:]
+            start = text[offset:].index(token) + offset
+            end = start + len(token)
+
+            token_mapping.append((char_mapping[start], char_mapping[end - 1] + 1))
+            offset = end
+        return token_mapping
+
+    @property
+    def vocab_size(self):
+        r"""
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return self.sp_model.vocab_size()
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
+
+    def clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        return "".join((self.SP_CHAR_MAPPING.get(c, c) for c in text))
+
+    def _tokenize(self, text, sample=False):
+        """Tokenize a string."""
+        if not sample:
+            pieces = self.sp_model.EncodeAsPieces(text)
+        else:
+            pieces = self.sp_model.SampleEncodeAsPieces(text, 64, 0.1)
+        new_pieces = []
+        for piece in pieces:
+            if piece == SPIECE_UNDERLINE:
+                continue
+            lst_i = 0
+            for i, c in enumerate(piece):
+                if c == SPIECE_UNDERLINE:
+                    continue
+                if self.is_ch_char(c) or self.is_punct(c):
+                    if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                        new_pieces.append(piece[lst_i:i])
+                    new_pieces.append(c)
+                    lst_i = i + 1
+                elif c.isdigit() and i > 0 and not piece[i - 1].isdigit():
+                    if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                        new_pieces.append(piece[lst_i:i])
+                    lst_i = i
+                elif not c.isdigit() and i > 0 and piece[i - 1].isdigit():
+                    if i > lst_i and piece[lst_i:i] != SPIECE_UNDERLINE:
+                        new_pieces.append(piece[lst_i:i])
+                    lst_i = i
+            if len(piece) > lst_i:
+                new_pieces.append(piece[lst_i:])
+        return new_pieces
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a sequence of tokens (strings for sub-words) in a single string.
+        """
+        tokens = self.convert_ids_to_tokens(ids)
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        An ERNIE-M sequence has the following format:
+        - single sequence:       ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] [SEP] B [SEP]``
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        An ERNIE-M offset_mapping has the following format:
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs.
+                Defaults to `None`.
+        Returns:
+            List[tuple]: List of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0), (0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        r"""
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+        Args:
+            token_ids_0 (List[int]):
+                List of ids of the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+            already_has_special_tokens (str, optional):
+                Whether or not the token list is already formatted with special tokens for the model.
+                Defaults to `False`.
+        Returns:
+            List[int]:
+                The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create the token type IDs corresponding to the sequences passed. [What are token type
+        IDs?](../glossary#token-type-ids)
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            token_ids_0 (`List[int]`): The first tokenized sequence.
+            token_ids_1 (`List[int]`, *optional*): The second tokenized sequence.
+
+        Returns:
+            `List[int]`: The token type ids.
+        """
+        # called when `add_special_tokens` is True, so align with `build_inputs_with_special_tokens` method
+        if token_ids_1 is None:
+            # [CLS] X [SEP]
+            return (len(token_ids_0) + 2) * [0]
+
+        # [CLS] A [SEP] [SEP] B [SEP]
+        return [0] * (len(token_ids_0) + 1) + [1] * (len(token_ids_1) + 3)
+
+    def is_ch_char(self, char):
+        """
+        is_ch_char
+        """
+        if "\u4e00" <= char <= "\u9fff":
+            return True
+        return False
+
+    def is_alpha(self, char):
+        """
+        is_alpha
+        """
+        if "a" <= char <= "z":
+            return True
+        if "A" <= char <= "Z":
+            return True
+        return False
+
+    def is_punct(self, char):
+        """
+        is_punct
+        """
+        if char in ",;:.?!~，；：。？！《》【】":
+            return True
+        return False
+
+    def is_whitespace(self, char):
+        """
+        is whitespace
+        """
+        if char == " " or char == "\t" or char == "\n" or char == "\r":
+            return True
+        if len(char) == 1:
+            cat = unicodedata.category(char)
+            if cat == "Zs":
+                return True
+        return False
diff --git a/paddlenlp/transformers/ernie_vil/__init__.py b/paddlenlp/transformers/ernie_vil/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ernie_vil/configuration.py b/paddlenlp/transformers/ernie_vil/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..16d6b114a75856be2a0231b6dd7f7ef86d356da6
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/configuration.py
@@ -0,0 +1,359 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ErnieViL model configuration"""
+
+import copy
+import os
+from typing import Optional, Union
+
+from ...utils.log import logger
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "ErnieViLTextConfig",
+    "ErnieViLVisionConfig",
+    "ErnieViLConfig",
+]
+
+
+class ErnieViLTextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieViLTextModel`]. It is used to
+    instantiate a ERNIE model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ERNIE
+    ernie-3.0-medium-zh architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 40000):
+            Vocabulary size of the ERNIE model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`ErnieModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 0):
+            The vocabulary size of the `token_type_ids` passed when calling [`ErnieModel`].
+        task_type_vocab_size (`int`, *optional*, defaults to 3):
+            The vocabulary size of the `task_ids`.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        task_id (`int`, *optional*, defaults to 0):
+            Task id.
+        use_task_id (`bool`, *optional*, defaults to `False`):
+            Whether or not use task_id.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The index of padding token in the token vocabulary.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import ErnieViLTextConfig, ErnieViLTextModel
+
+    >>> configuration = ErnieViLTextConfig()
+
+    >>> model = ErnieViLTextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "ernie_vil_text_model"
+
+    def __init__(
+        self,
+        vocab_size: int = 40000,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 2048,
+        task_type_vocab_size: int = 3,
+        type_vocab_size: int = 0,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        layer_norm_eps=1e-5,
+        task_id: int = 0,
+        use_task_id: bool = False,
+        fuse: bool = False,
+        use_cache: bool = False,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.task_id = task_id
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.task_type_vocab_size = task_type_vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.fuse = fuse
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
+        self.use_task_id = use_task_id
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the text config dict if we are loading from ErnieViLConfig
+        if config_dict.get("model_type") == "ernie_vil":
+            config_dict = config_dict["text_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ErnieViLVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ErnieViLVisionModel`]. It is used to instantiate an ErnieViL
+    model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*,
+            defaults to 1e-6): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ErnieViLVisionConfig, ErnieViLVisionModel
+
+    >>> configuration = ErnieViLVisionConfig()
+
+    >>> model = ErnieViLVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+
+    model_type = "ernie_vil_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=16,
+        hidden_act="quick_gelu",
+        layer_norm_eps=0.000001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=0.02,
+        initializer_factor=1.0,
+        **kwargs
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        from_hf_hub: bool = False,
+        cache_dir: Optional[str] = None,
+        **kwargs
+    ) -> PretrainedConfig:
+        kwargs.update({"from_hf_hub": from_hf_hub, "cache_dir": cache_dir})
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the vision config dict if we are loading from ErnieViLConfig
+        if config_dict.get("model_type") == "ernie_vil":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class ErnieViLConfig(PretrainedConfig):
+    r"""
+    [`ErnieViLConfig`] is the configuration class to store the configuration of a [`ErnieViLModel`]. It is used to instantiate
+    ErnieViL model according to the specified arguments, defining the text model and vision model configs.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ErnieViLTextConfig`].
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`ErnieViLVisionConfig`].
+        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
+            The inital value of the *logit_scale* paramter. Default is used as per the original ErnieViL implementation.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import ErnieViLConfig, ErnieViLModel
+
+    >>> configuration = ErnieViLConfig()
+
+    >>> model = ErnieViLModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+
+    >>> # Initializing a ErnieViLText and ErnieViLVision configuration
+    >>> config_text = ErnieViLTextConfig()
+    >>> config_vision = ErnieViLVisionConfig()
+
+    >>> config = ErnieViLConfig.from_text_vision_configs(config_text, config_vision)
+    ```
+    """
+
+    model_type = "ernie_vil"
+    is_composition = True
+
+    def __init__(self, text_config=None, vision_config=None, logit_scale_init_value=2.6592, **kwargs):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        # If `_config_dict` exist, we use them for the backward compatibility.
+        text_config_dict = kwargs.pop("text_config_dict", None)
+        vision_config_dict = kwargs.pop("vision_config_dict", None)
+        if text_config_dict is not None:
+            text_config = text_config_dict
+        if vision_config_dict is not None:
+            vision_config = vision_config_dict
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the ErnieViLTextConfig with default values.")
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the ErnieViLVisionConfig with default values.")
+
+        self.text_config = ErnieViLTextConfig(**text_config)
+        self.vision_config = ErnieViLVisionConfig(**vision_config)
+
+        self.logit_scale_init_value = logit_scale_init_value
+        self.initializer_factor = 1.0
+
+    @classmethod
+    def from_text_vision_configs(cls, text_config: ErnieViLTextConfig, vision_config: ErnieViLVisionConfig, **kwargs):
+        r"""
+        Instantiate a [`ErnieViLConfig`] (or a derived class) from ernie_vil text model configuration and ernie_vil vision model
+        configuration.
+
+        Returns:
+            [`ErnieViLConfig`]: An instance of a configuration object
+        """
+
+        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["text_config"] = self.text_config.to_dict()
+        output["vision_config"] = self.vision_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/ernie_vil/feature_extraction.py b/paddlenlp/transformers/ernie_vil/feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..4920f8405a0fc7db20819a2d8432ceb9e9da39a5
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/feature_extraction.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for ErnieViL."""
+
+__all__ = ["ErnieViLFeatureExtractor"]
+
+
+import warnings
+
+from .image_processing import ErnieViLImageProcessor
+
+
+class ErnieViLFeatureExtractor(ErnieViLImageProcessor):
+    def __init__(self, *args, **kwargs) -> None:
+        warnings.warn(
+            "The class ErnieViLFeatureExtractor is deprecated and will be removed in version 5 of PaddleNLP. Please"
+            " use ErnieViLImageProcessor instead.",
+            FutureWarning,
+        )
+        super().__init__(*args, **kwargs)
diff --git a/paddlenlp/transformers/ernie_vil/image_processing.py b/paddlenlp/transformers/ernie_vil/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..5873eb2a51c57befa2bd1c9fb9c5bc346efb217c
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/image_processing.py
@@ -0,0 +1,328 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for ErnieViL."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    center_crop,
+    convert_to_rgb,
+    get_resize_output_image_size,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = ["ErnieViLImageProcessor"]
+
+
+class ErnieViLImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a ErnieViL image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in the `preprocess` method.
+        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
+            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
+            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
+        do_center_crop (`bool`, *optional*, defaults to `True`):
+            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
+            `preprocess` method.
+        crop_size (`Dict[str, int]` *optional*, defaults to 224):
+            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
+            method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
+            the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
+            method.
+        do_normalize:
+            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Image standard deviation.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_center_crop: bool = True,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"shortest_edge": 224}
+        size = get_size_dict(size, default_to_square=False)
+        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
+        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else [0.485, 0.456, 0.406]
+        self.image_std = image_std if image_std is not None else [0.229, 0.224, 0.225]
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=False)
+        if "shortest_edge" not in size:
+            raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def center_crop(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
+        returned result will always be of size `size`).
+
+        Args:
+            image (`np.ndarray`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form of a dictionary with keys `height` and `width`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
+        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            image_mean (`float` or `List[float]`):
+                Image mean.
+            image_std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = None,
+        do_center_crop: bool = None,
+        crop_size: int = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        **kwargs
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
+                the longest edge resized to keep the input aspect ratio.
+            resample (`int`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image.
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        size = size if size is not None else self.size
+        size = get_size_dict(size, param_name="size", default_to_square=False)
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None:
+            raise ValueError("Size must be specified if do_resize is True.")
+
+        if do_center_crop and crop_size is None:
+            raise ValueError("Crop size must be specified if do_center_crop is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_center_crop:
+            images = [self.center_crop(image=image, size=crop_size) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/ernie_vil/modeling.py b/paddlenlp/transformers/ernie_vil/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3dd2bbcf772c405b0a55bfab8602723e9220778
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/modeling.py
@@ -0,0 +1,672 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.distributed as dist
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from ...utils.initializer import normal_
+from .. import PretrainedModel
+from ..clip.modeling import CLIPVisionTransformer as ErnieViLVisionTransformer
+from ..clip.modeling import clip_loss
+from ..ernie.modeling import ErnieModel
+from ..model_outputs import (
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from .configuration import ErnieViLConfig, ErnieViLTextConfig, ErnieViLVisionConfig
+
+__all__ = [
+    "ErnieViLModel",
+    "ErnieViLTextModel",
+    "ErnieViLVisionModel",
+    "ErnieViLPretrainedModel",
+]
+
+ERNIE_VIL_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    # vit model
+    "PaddlePaddle/ernie_vil-2.0-base-zh",
+    "PaddlePaddle/disco_diffusion_ernie_vil-2.0-base-zh",
+]
+
+
+def quick_gelu(x):
+    return x * F.sigmoid(1.702 * x)
+
+
+F.quick_gelu = quick_gelu
+
+
+@dataclass
+class ErnieViLOutput(ModelOutput):
+    """
+    Args:
+        loss: (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image: (`paddle.Tensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text: (`paddle.Tensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds: (`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`ErnieModel`].
+        image_embeds: (`paddle.Tensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of [`ErnieViLVisionTransformer`].
+        text_model_output: (:class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`):
+            The output of the [`ErnieModel`].
+        vision_model_output: (:class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPooling`):
+            The output of the [`VisionTransformer`].
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits_per_image: paddle.Tensor = None
+    logits_per_text: paddle.Tensor = None
+    text_embeds: paddle.Tensor = None
+    image_embeds: paddle.Tensor = None
+    text_model_output: BaseModelOutputWithPoolingAndCrossAttentions = None
+    vision_model_output: BaseModelOutputWithPooling = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class ErnieViLPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained ErnieViL models. It provides ErnieViL related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    config_class = ErnieViLConfig
+    base_model_prefix = "ernie_vil"
+    supports_gradient_checkpointing = True
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, nn.TransformerEncoder):
+            module.enable_recompute = value
+
+    def gradient_checkpointing_enable(self):
+        """
+        Activates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if not self.supports_gradient_checkpointing:
+            raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.")
+        self.apply(partial(self._set_gradient_checkpointing, value=True))
+
+    def gradient_checkpointing_disable(self):
+        """
+        Deactivates gradient checkpointing for the current model.
+
+        Note that in other frameworks this feature can be referred to as "activation checkpointing" or "checkpoint
+        activations".
+        """
+        if self.supports_gradient_checkpointing:
+            self.apply(partial(self._set_gradient_checkpointing, value=False))
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, ErnieViLVisionTransformer):
+            # find nn.LayerNorm
+            for sub_layer in layer.sublayers():
+                if isinstance(sub_layer, nn.LayerNorm):
+                    sub_layer._epsilon = layer.config.layer_norm_eps
+
+        elif isinstance(layer, ErnieModel):
+            # find nn.LayerNorm
+            for sub_layer in layer.sublayers():
+                if isinstance(sub_layer, nn.LayerNorm):
+                    sub_layer._epsilon = layer.config.layer_norm_eps
+                elif isinstance(layer, (nn.Linear, nn.Embedding)):
+                    normal_(layer.weight, mean=0.0, std=layer.config.initializer_range)
+
+
+class ErnieViLModel(ErnieViLPretrainedModel):
+    r"""
+    The bare ErnieViL Model outputting logits_per_image and logits_per_text.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieViLConfig`):
+            An instance of ErnieViLConfig used to construct ErnieViLModel.
+    """
+    config_class = ErnieViLConfig
+
+    def __init__(self, config: ErnieViLConfig):
+        super().__init__(config)
+
+        if not isinstance(config.text_config, ErnieViLTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type ErnieViLTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+
+        if not isinstance(config.vision_config, ErnieViLVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type ErnieViLVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+
+        text_config = config.text_config
+        vision_config = config.vision_config
+
+        self.text_model = ErnieModel(text_config)
+
+        self.vision_model = ErnieViLVisionTransformer(vision_config)
+
+        self.temperature = self.create_parameter(
+            shape=(1,),
+            default_initializer=nn.initializer.Constant(config.logit_scale_init_value),
+            dtype=paddle.get_default_dtype(),
+        )
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> paddle.Tensor:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`ErnieViLFeatureExtractor`]. See [`ErnieViLFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            image_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`ErnieViLVisionModel`].
+
+        Examples:
+            .. code-block::
+
+                import requests
+                from PIL import Image
+                from paddlenlp.transformers import ErnieViLProcessor, ErnieViLModel
+
+                model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+                processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+                url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+                image = Image.open(requests.get(url, stream=True).raw)
+                inputs = processor(images=image, return_tensors="pd")
+                image_features = model.get_image_features(**inputs)
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_features = vision_outputs[1]
+        return image_features
+
+    def get_text_features(
+        self,
+        input_ids,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        task_type_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`ErnieViLTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            token_type_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+                Its data type should be `int64`. Defaults to `None`, which means we don't add segment embeddings.
+            task_type_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of tasks of each input sequence tokens in the task embeddings (ErnieModel). Selected in
+                the range ``[0, task_type_vocab_size - 1]``. Defaults to `None`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPoolingAndCrossAttentions`] instead of a plain tuple.
+
+        Returns:
+            text_features (`paddle.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            the pooled output of [`ErnieModel`].
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import ErnieViLModel, ErnieViLTokenizer
+
+                model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+                tokenizer = ErnieViLTokenizer.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+                inputs = tokenizer(["一只猫的照片", "一条狗的照片"], padding=True, return_tensors="pd")
+                text_features = model.get_text_features(**inputs)
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            task_type_ids=task_type_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        text_features = text_outputs[1]
+        return text_features
+
+    def forward(
+        self,
+        input_ids,
+        pixel_values,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        task_type_ids: Optional[paddle.Tensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ErnieViLOutput]:
+        r"""
+        The ErnieViLModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
+                Its data type should be `int64` and it has a shape of [text_batch_size, sequence_length].
+            pixel_values (Tensor):
+                Pixel values. Padding will be ignored by default should you provide it.
+                Its data type should be `float32` and it has a shape of [image_batch_size, num_channels, height, width].
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings (ErnieModel). Selected in
+                the range ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            task_type_ids (Tensor, optional):
+                Indices of tasks of each input sequence tokens in the task embeddings (ErnieModel). Selected in
+                the range ``[0, task_type_vocab_size - 1]``.
+                Shape as `(batch_size, sequence_length)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention (ErnieModel) to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`ErnieViLOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`ErnieViLOutput` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`ErnieViLOutput`.
+
+        Example:
+            .. code-block::
+
+                import requests
+                import paddle.nn.functional as F
+                from PIL import Image
+                from paddlenlp.transformers import ErnieViLModel, ErnieViLProcessor
+
+                processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+                model = ErnieViLModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+                model.eval()
+
+                url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+                image = Image.open(requests.get(url, stream=True).raw)
+
+                inputs = processor(text=["一只猫的照片", "一条狗的照片"],
+                                images=image,
+                                padding=True,
+                                return_tensors="pd")
+
+                outputs = model(**inputs)
+
+                logits_per_image = outputs[0]
+                probs = F.softmax(logits_per_image, axis=1)  # we can take the softmax to get the label probabilities
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            task_type_ids=task_type_ids,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[1]
+        text_embeds = text_outputs[1]
+        # normalized features
+        image_embeds = F.normalize(image_embeds)
+        text_embeds = F.normalize(text_embeds)
+        if paddle.distributed.is_initialized() and dist.get_world_size() > 1:
+            world_size = dist.get_world_size()
+            rank = dist.get_rank()
+            gathered_image_features = [paddle.zeros_like(image_embeds) for _ in range(world_size)]
+            gathered_text_features = [paddle.zeros_like(text_embeds) for _ in range(world_size)]
+            dist.all_gather(gathered_image_features, image_embeds)
+            dist.all_gather(gathered_text_features, text_embeds)
+            # Add current text_embeds image_embeds into the batch for gradient update
+            image_embeds = paddle.concat(
+                [image_embeds] + gathered_image_features[:rank] + gathered_image_features[rank + 1 :]
+            )
+            text_embeds = paddle.concat(
+                [text_embeds] + gathered_text_features[:rank] + gathered_text_features[rank + 1 :]
+            )
+        # cosine similarity as logits
+        logit_scale = self.temperature.exp()
+
+        logits_per_text = paddle.matmul(text_embeds * logit_scale, image_embeds, transpose_y=True)
+        logits_per_image = logits_per_text.t()
+
+        # clip temperature
+        self.temperature.clip(-100.0, 100.0)
+
+        loss = None
+
+        if return_loss:
+            loss = clip_loss(logits_per_text)
+
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return ErnieViLOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+
+
+class ErnieViLTextModel(ErnieViLPretrainedModel):
+    r"""
+    The text model from ErnieViL without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieViLTextConfig`):
+            An instance of ErnieViLTextConfig used to construct ErnieViLTextModel.
+    """
+
+    config_class = ErnieViLTextConfig
+
+    def __init__(self, config: ErnieViLTextConfig):
+        super().__init__(config)
+        self.text_model = ErnieModel(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.text_model.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        task_type_ids: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]:
+        r"""
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+                Indices can be obtained using [`ErnieViLTokenizer`].
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            position_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+                config.max_position_embeddings - 1]`.
+            token_type_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+                Its data type should be `int64`. Defaults to `None`, which means we don't add segment embeddings.
+            task_type_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Indices of tasks of each input sequence tokens in the task embeddings (ErnieModel). Selected in
+                the range ``[0, task_type_vocab_size - 1]``. Defaults to `None`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPoolingAndCrossAttentions`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPoolingAndCrossAttentions` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Examples:
+
+        ```python
+        >>> from paddlenlp.transformers import ErnieViLTokenizer, ErnieViLTextModel
+
+        >>> model = ErnieViLTextModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+        >>> tokenizer = ErnieViLTokenizer.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+        >>> inputs = tokenizer(["一只猫的照片", "一条狗的照片"], padding=True, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            task_type_ids=task_type_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class ErnieViLVisionModel(ErnieViLPretrainedModel):
+    r"""
+    The vision model from ErnieViL without any head or projection on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`ErnieViLVisionConfig`):
+            An instance of ErnieViLVisionConfig used to construct ErnieViLVisionModel.
+    """
+
+    config_class = ErnieViLVisionConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: ErnieViLVisionConfig):
+        super().__init__(config)
+
+        self.vision_model = ErnieViLVisionTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.conv1
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Args:
+            pixel_values (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+                Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+                [`ErnieViLFeatureExtractor`]. See [`ErnieViLFeatureExtractor.__call__`] for details.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`BaseModelOutputWithPooling`] instead of a plain tuple.
+
+        Returns:
+            An instance of :class:`BaseModelOutputWithPooling` if `return_dict=True`. Otherwise it returns a tuple of tensors
+            corresponding to ordered and not None (depending on the input arguments) fields of :class:`BaseModelOutputWithPooling`.
+
+        Examples:
+
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import ErnieViLProcessor, ErnieViLVisionModel
+
+        >>> model = ErnieViLVisionModel.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+        >>> processor = ErnieViLProcessor.from_pretrained("PaddlePaddle/ernie_vil-2.0-base-zh")
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> inputs = processor(images=image, return_tensors="pd")
+
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled CLS states
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
diff --git a/paddlenlp/transformers/ernie_vil/processing.py b/paddlenlp/transformers/ernie_vil/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..e89ab381f4bb81e20ac876e6abc04c82aa8fcc8b
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/processing.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Image/Text processor class for ErnieViL
+"""
+import warnings
+
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = ["ErnieViLProcessor"]
+
+
+class ErnieViLProcessor(ProcessorMixin):
+    r"""
+    Constructs a ErnieViL processor which wraps a ErnieViL image processor and a ErnieViL tokenizer into a single processor.
+
+    [`ErnieViLProcessor`] offers all the functionalities of [`ErnieViLProcessor`] and [`ErnieViLTokenizer`]. See the
+    [`~ErnieViLProcessor.__call__`] and [`~ErnieViLProcessor.decode`] for more information.
+
+    Args:
+        image_processor ([`ErnieViLImageProcessor`]):
+            The image processor is a required input.
+        tokenizer ([`ErnieViLTokenizer`]):
+            The tokenizer is a required input.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "ErnieViLImageProcessor"
+    tokenizer_class = "ErnieViLTokenizer"
+
+    pretrained_init_configuration = {
+        "PaddlePaddle/ernie_vil-2.0-base-zh": {"do_lower_case": True},
+    }
+
+    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        if "feature_extractor" in kwargs:
+            warnings.warn(
+                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
+                " instead.",
+                FutureWarning,
+            )
+            feature_extractor = kwargs.pop("feature_extractor")
+
+        image_processor = image_processor if image_processor is not None else feature_extractor
+        if image_processor is None:
+            raise ValueError("You need to specify an `image_processor`.")
+        if tokenizer is None:
+            raise ValueError("You need to specify a `tokenizer`.")
+
+        super().__init__(image_processor, tokenizer)
+
+    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
+        """
+        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
+        and `kwargs` arguments to ErnieViLTokenizer's [`~ErnieViLTokenizer.__call__`] if `text` is not `None` to encode
+        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
+        ErnieViLImageProcessor's [`~ErnieViLImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
+        of the above two methods for more information.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `paddle.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[paddle.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or Paddle
+                tensor. In case of a NumPy array/Paddle tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+        """
+
+        if text is None and images is None:
+            raise ValueError("You have to specify either text or images. Both cannot be none.")
+
+        if text is not None:
+            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
+
+        if images is not None:
+            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
+
+        if text is not None and images is not None:
+            encoding["pixel_values"] = image_features.pixel_values
+            return encoding
+        elif text is not None:
+            return encoding
+        else:
+            return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ErnieViLTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to ErnieViLTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
+
+    @property
+    def feature_extractor_class(self):
+        warnings.warn(
+            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
+            FutureWarning,
+        )
+        return self.image_processor_class
+
+    @property
+    def feature_extractor(self):
+        warnings.warn(
+            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
+            FutureWarning,
+        )
+        return self.image_processor
diff --git a/paddlenlp/transformers/ernie_vil/tokenizer.py b/paddlenlp/transformers/ernie_vil/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c875eae42167f58d3d11291ed5386254a62d632c
--- /dev/null
+++ b/paddlenlp/transformers/ernie_vil/tokenizer.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..ernie.tokenizer import ErnieTokenizer
+
+__all__ = ["ErnieViLTokenizer"]
+
+
+class ErnieViLTokenizer(ErnieTokenizer):
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ernie_vil-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_vil/ernie_vil-2.0-base-zh/vocab.txt",
+            "disco_diffusion_ernie_vil-2.0-base-zh": "https://bj.bcebos.com/paddlenlp/models/transformers/ernie_vil/disco_diffusion_ernie_vil-2.0-base-zh/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ernie_vil-2.0-base-zh": {"do_lower_case": True},
+        "disco_diffusion_ernie_vil-2.0-base-zh": {"do_lower_case": True},
+    }
+    max_model_input_sizes = {"ernie_vil-2.0-base-zh": 64, "disco_diffusion_ernie_vil-2.0-base-zh": 64}
+
+    model_input_names = [
+        "input_ids",
+    ]
diff --git a/paddlenlp/transformers/export.py b/paddlenlp/transformers/export.py
new file mode 100644
index 0000000000000000000000000000000000000000..46c957ab9c1b44788570570fc7e6c2daf2037608
--- /dev/null
+++ b/paddlenlp/transformers/export.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from typing import List, Optional, Tuple
+
+import paddle
+
+from ..utils.log import logger
+from .model_utils import PretrainedModel, unwrap_model
+
+__all__ = ["export_model"]
+
+
+def export_model(
+    model: "PretrainedModel", input_spec=None, path: Optional[str] = None, model_format: Optional[str] = "paddle"
+) -> Tuple[List[str], List[str]]:
+    """
+    Export paddle inference model or onnx model.
+
+    Args:
+        model ([`PretrainedModel`]:
+            The model to export.
+        input_spec (paddle.static.InputSpec, optional):
+            Describes the input of the saved model’s forward method, which can be described
+            by InputSpec or example Tensor.  Default None.
+        path (Optional[str], optional):
+            Output dir to save the exported model. Defaults to None.
+        model_format (Optional[str], optional):
+            Export model format. There are two options: paddle or onnx, defaults to paddle.
+
+    """
+    if path is None:
+        path = "./"
+        logger.info("Export path is missing, set default path to current dir.")
+
+    if issubclass(type(model), PretrainedModel):
+        model = unwrap_model(model)
+    model.eval()
+
+    model_format = model_format.lower()
+    file_prefix = "model"
+    if model_format == "paddle":
+        # Convert to static graph with specific input description
+        model = paddle.jit.to_static(model, input_spec=input_spec)
+        # Save in static graph model.
+        save_path = os.path.join(path, file_prefix)
+        logger.info("Exporting inference model to %s" % save_path)
+        paddle.jit.save(model, save_path)
+        logger.info("Inference model exported.")
+    elif model_format == "onnx":
+        # Export ONNX model.
+        save_path = os.path.join(path, file_prefix)
+        logger.info("Exporting ONNX model to %s" % save_path)
+        paddle.onnx.export(model, save_path, input_spec=input_spec)
+        logger.info("ONNX model exported.")
+    else:
+        logger.info("This export format is not supported, please select paddle or onnx!")
diff --git a/paddlenlp/transformers/feature_extraction_sequence_utils.py b/paddlenlp/transformers/feature_extraction_sequence_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..24684a7b8ee6c0da04c0d28d6113f43b29c1cf8a
--- /dev/null
+++ b/paddlenlp/transformers/feature_extraction_sequence_utils.py
@@ -0,0 +1,366 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+ Sequence feature extraction class for common feature extractors to preprocess sequences.
+"""
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers.tokenizer_utils_base import PaddingStrategy
+
+from .feature_extraction_utils import BatchFeature, FeatureExtractionMixin
+
+
+class SequenceFeatureExtractor(FeatureExtractionMixin):
+    """
+    This is a general feature extraction class for speech recognition.
+
+    Args:
+        feature_size (`int`):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        padding_value (`float`):
+            The value that is used to fill the padding values / vectors.
+    """
+
+    def __init__(self, feature_size: int, sampling_rate: int, padding_value: float, **kwargs):
+        self.feature_size = feature_size
+        self.sampling_rate = sampling_rate
+        self.padding_value = padding_value
+
+        self.padding_side = kwargs.pop("padding_side", "right")
+        self.return_attention_mask = kwargs.pop("return_attention_mask", True)
+
+        super().__init__(**kwargs)
+
+    def pad(
+        self,
+        processed_features: Union[
+            BatchFeature,
+            List[BatchFeature],
+            Dict[str, BatchFeature],
+            Dict[str, List[BatchFeature]],
+            List[Dict[str, BatchFeature]],
+        ],
+        padding: Union[bool, str, PaddingStrategy] = True,
+        max_length: Optional[int] = None,
+        truncation: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[str] = None,
+    ) -> BatchFeature:
+        """
+        Pad input values / input vectors or a batch of input values / input vectors up to predefined length or to the
+        max sequence length in the batch.
+
+        Padding side (left/right) padding values are defined at the feature extractor level (with `self.padding_side`,
+        `self.padding_value`)
+
+        <Tip>
+
+        If the `processed_features` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
+        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
+        PyTorch tensors, you will lose the specific device of your tensors however.
+
+        </Tip>
+
+        Args:
+            processed_features ([`BatchFeature`], list of [`BatchFeature`], `Dict[str, List[float]]`, `Dict[str, List[List[float]]` or `List[Dict[str, List[float]]]`):
+                Processed inputs. Can represent one input ([`BatchFeature`] or `Dict[str, List[float]]`) or a batch of
+                input values / vectors (list of [`BatchFeature`], *Dict[str, List[List[float]]]* or *List[Dict[str,
+                List[float]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
+                collate function.
+
+                Instead of `List[float]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors),
+                see the note above for the return type.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
+                Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+            return_attention_mask (`bool`, *optional*):
+                Whether to return the attention mask. If left to the default, will return the attention mask according
+                to the specific feature_extractor's default.
+
+                [What are attention masks?](../glossary#attention-mask)
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+                - `'pd'`: Return PaddlePaddle `paddle.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+        # If we have a list of dicts, let's convert it in a dict of lists
+        # We do this to allow using this method as a collate_fn function in PyTorch Dataloader
+        if isinstance(processed_features, (list, tuple)) and isinstance(processed_features[0], (dict, BatchFeature)):
+            processed_features = {
+                key: [example[key] for example in processed_features] for key in processed_features[0].keys()
+            }
+
+        # The model's main input name, usually `input_values`, has be passed for padding
+        if self.model_input_names[0] not in processed_features:
+            raise ValueError(
+                "You should supply an instance of `transformers.BatchFeature` or list of `transformers.BatchFeature`"
+                f" to this method that includes {self.model_input_names[0]}, but you provided"
+                f" {list(processed_features.keys())}"
+            )
+
+        required_input = processed_features[self.model_input_names[0]]
+        return_attention_mask = (
+            return_attention_mask if return_attention_mask is not None else self.return_attention_mask
+        )
+
+        if len(required_input) == 0:
+            if return_attention_mask:
+                processed_features["attention_mask"] = []
+            return processed_features
+
+        # If we have PyTorch/TF tensors or lists as inputs, we cast them as Numpy arrays
+        # and rebuild them afterwards if no return_tensors is specified
+        # Note that we lose the specific device the tensor may be on for PyTorch
+
+        first_element = required_input[0]
+        if isinstance(first_element, (list, tuple)):
+            # first_element might be an empty list/tuple in some edge cases so we grab the first non empty element.
+            index = 0
+            while len(required_input[index]) == 0:
+                index += 1
+            if index < len(required_input):
+                first_element = required_input[index][0]
+
+        if return_tensors is None:
+            if isinstance(first_element, paddle.Tensor):
+                return_tensors = "pd"
+            elif isinstance(first_element, (int, float, list, tuple, np.ndarray)):
+                return_tensors = "np"
+            else:
+                raise ValueError(
+                    f"type of {first_element} unknown: {type(first_element)}. "
+                    "Should be one of a python, numpy, pytorch or tensorflow object."
+                )
+
+        for key, value in processed_features.items():
+            if isinstance(value[0], (int, float)):
+                processed_features[key] = np.array(value)
+            else:
+                processed_features[key] = [np.array(v) for v in value]
+
+        # Convert padding_strategy in PaddingStrategy
+        padding_strategy = self._get_padding_strategies(padding=padding, max_length=max_length)
+
+        required_input = processed_features[self.model_input_names[0]]
+
+        batch_size = len(required_input)
+        if not all(len(v) == batch_size for v in processed_features.values()):
+            raise ValueError("Some items in the output dictionary have a different batch size than others.")
+
+        truncated_inputs = []
+        for i in range(batch_size):
+            inputs = {k: v[i] for k, v in processed_features.items()}
+            # truncation
+            inputs_slice = self._truncate(
+                inputs,
+                max_length=max_length,
+                pad_to_multiple_of=pad_to_multiple_of,
+                truncation=truncation,
+            )
+            truncated_inputs.append(inputs_slice)
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            # make sure that `max_length` cannot be longer than the longest truncated length
+            max_length = max(len(input_slice[self.model_input_names[0]]) for input_slice in truncated_inputs)
+            padding_strategy = PaddingStrategy.MAX_LENGTH
+
+        batch_outputs = {}
+        for i in range(batch_size):
+            # padding
+            outputs = self._pad(
+                truncated_inputs[i],
+                max_length=max_length,
+                padding_strategy=padding_strategy,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                if value.dtype is np.dtype(np.float64):
+                    value = value.astype(np.float32)
+                batch_outputs[key].append(value)
+
+        return BatchFeature(batch_outputs, tensor_type=return_tensors)
+
+    def _pad(
+        self,
+        processed_features: Union[Dict[str, np.ndarray], BatchFeature],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            processed_features (`Union[Dict[str, np.ndarray], BatchFeature]`):
+                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
+                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see below)
+            padding_strategy (`PaddingStrategy`, *optional*, default to `PaddingStrategy.DO_NOT_PAD`):
+                PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The feature_extractor padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of (`int`, *optional*):
+                Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
+                enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
+                which benefit from having sequence lengths be a multiple of 128.
+            return_attention_mask (`bool`, *optional*):
+                Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        required_input = processed_features[self.model_input_names[0]]
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) < max_length
+
+        if return_attention_mask and "attention_mask" not in processed_features:
+            processed_features["attention_mask"] = np.ones(len(required_input), dtype=np.int32)
+
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+            if self.padding_side == "right":
+                if return_attention_mask:
+                    processed_features["attention_mask"] = np.pad(
+                        processed_features["attention_mask"], (0, difference)
+                    )
+                padding_shape = ((0, difference), (0, 0)) if self.feature_size > 1 else (0, difference)
+                processed_features[self.model_input_names[0]] = np.pad(
+                    required_input, padding_shape, "constant", constant_values=self.padding_value
+                )
+            elif self.padding_side == "left":
+                if return_attention_mask:
+                    processed_features["attention_mask"] = np.pad(
+                        processed_features["attention_mask"], (difference, 0)
+                    )
+                padding_shape = ((difference, 0), (0, 0)) if self.feature_size > 1 else (difference, 0)
+                processed_features[self.model_input_names[0]] = np.pad(
+                    required_input, padding_shape, "constant", constant_values=self.padding_value
+                )
+            else:
+                raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+
+        return processed_features
+
+    def _truncate(
+        self,
+        processed_features: Union[Dict[str, np.ndarray], BatchFeature],
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        truncation: Optional[bool] = None,
+    ):
+        """
+        Truncate inputs to predefined length or max length in the batch
+
+        Args:
+            processed_features(`Union[Dict[str, np.ndarray], BatchFeature]`):
+                Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
+                of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
+            max_length (`int`, *optional*):
+                maximum length of the returned list and optionally padding length (see below)
+            pad_to_multiple_of (`int`, *optional*) :
+                Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
+                enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
+                which benefit from having sequence lengths be a multiple of 128.
+            truncation (`bool`, *optional*):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+        """
+        if not truncation:
+            return processed_features
+        elif truncation and max_length is None:
+            raise ValueError("When setting ``truncation=True``, make sure that ``max_length`` is defined.")
+
+        required_input = processed_features[self.model_input_names[0]]
+
+        # find `max_length` that fits `pad_to_multiple_of`
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_truncated = len(required_input) > max_length
+
+        if needs_to_be_truncated:
+            processed_features[self.model_input_names[0]] = processed_features[self.model_input_names[0]][:max_length]
+            if "attention_mask" in processed_features:
+                processed_features["attention_mask"] = processed_features["attention_mask"][:max_length]
+
+        return processed_features
+
+    def _get_padding_strategies(self, padding=False, max_length=None):
+        """
+        Find the correct padding strategy
+        """
+
+        # Get padding strategy
+        if padding is not False:
+            if padding is True:
+                padding_strategy = PaddingStrategy.LONGEST  # Default to pad to the longest sequence in the batch
+            elif not isinstance(padding, PaddingStrategy):
+                padding_strategy = PaddingStrategy(padding)
+            elif isinstance(padding, PaddingStrategy):
+                padding_strategy = padding
+        else:
+            padding_strategy = PaddingStrategy.DO_NOT_PAD
+
+        # Set max length if needed
+        if max_length is None:
+            if padding_strategy == PaddingStrategy.MAX_LENGTH:
+                raise ValueError(
+                    f"When setting ``padding={PaddingStrategy.MAX_LENGTH}``, make sure that max_length is defined"
+                )
+
+        # Test if we have a padding value
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.padding_value is None):
+            raise ValueError(
+                "Asking to pad but the feature_extractor does not have a padding value. Please select a value to use"
+                " as `padding_value`. For example: `feature_extractor.padding_value = 0.0`."
+            )
+
+        return padding_strategy
diff --git a/paddlenlp/transformers/feature_extraction_utils.py b/paddlenlp/transformers/feature_extraction_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cce95b48b9a2543b0c5e9f8ff219ebebfaeacbf
--- /dev/null
+++ b/paddlenlp/transformers/feature_extraction_utils.py
@@ -0,0 +1,420 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+from collections import UserDict
+from typing import Any, Dict, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+from huggingface_hub import hf_hub_download
+
+from .. import __version__
+from ..utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url_with_filelock
+from ..utils.log import logger
+from .tokenizer_utils_base import TensorType
+from .utils import resolve_cache_dir
+
+FEATURE_EXTRACTOR_NAME = "preprocessor_config.json"
+
+
+class BatchFeature(UserDict):
+    r"""
+    Holds the feature extractor specific `__call__` methods.
+    This class is derived from a python dictionary and can be used as a dictionary.
+    Args:
+        data (`dict`):
+            Dictionary of lists/arrays/tensors returned by the __call__/pad methods ('input_values', 'attention_mask',
+            etc.).
+        tensor_type (`Union[None, str, TensorType]`, *optional*):
+            You can give a tensor_type here to convert the lists of integers in Paddle/Numpy Tensors at
+            initialization.
+    """
+
+    def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
+        super().__init__(data)
+        self.convert_to_tensors(tensor_type=tensor_type)
+
+    def __getitem__(self, item: str):
+        """
+        If the key is a string, returns the value of the dict associated to `key` ('input_values', 'attention_mask',
+        etc.).
+        """
+        if isinstance(item, str):
+            return self.data[item]
+        else:
+            raise KeyError("Indexing with integers is not available when using Python based feature extractors")
+
+    def __getattr__(self, item: str):
+        try:
+            return self.data[item]
+        except KeyError:
+            raise AttributeError
+
+    def __getstate__(self):
+        return {"data": self.data}
+
+    def __setstate__(self, state):
+        if "data" in state:
+            self.data = state["data"]
+
+    def keys(self):
+        return self.data.keys()
+
+    def values(self):
+        return self.data.values()
+
+    def items(self):
+        return self.data.items()
+
+    def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None):
+        """
+        Convert the inner content to tensors.
+        Args:
+            tensor_type (`str` or [`TensorType`], *optional*):
+                The type of tensors to use. If `str`, should be one of the values of the enum [`TensorType`]. If
+                `None`, no modification is done.
+        """
+        if tensor_type is None:
+            return self
+
+        # Convert to TensorType
+        if not isinstance(tensor_type, TensorType):
+            tensor_type = TensorType(tensor_type)
+
+        # Get a function reference for the correct framework
+        if tensor_type == TensorType.PADDLE:
+            as_tensor = paddle.to_tensor
+            is_tensor = paddle.is_tensor
+        else:
+            as_tensor = np.asarray
+
+            def is_tensor(x):
+                return isinstance(x, np.ndarray)
+
+        # Do the tensor conversion in batch
+        for key, value in self.items():
+            try:
+                if not is_tensor(value):
+                    tensor = as_tensor(value)
+
+                    self[key] = tensor
+            except:  # noqa E722
+                if key == "overflowing_tokens":
+                    raise ValueError(
+                        "Unable to create tensor returning overflowing tokens of different lengths. "
+                        "Please see if a fast version of this tokenizer is available to have this feature available."
+                    )
+                raise ValueError(
+                    "Unable to create tensor, you should probably activate truncation and/or padding "
+                    "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
+                )
+
+        return self
+
+
+class FeatureExtractionMixin(object):
+    """
+    This is a feature extraction mixin used to provide saving/loading functionality for sequential and image feature
+    extractors.
+    """
+
+    pretrained_feature_extractor_file = []
+    _auto_class = None
+
+    def __init__(self, **kwargs):
+        """Set elements of `kwargs` as attributes."""
+        # Pop "processor_class" as it should be saved as private attribute
+        self._processor_class = kwargs.pop("processor_class", None)
+        # Additional attributes without default values
+        for key, value in kwargs.items():
+            try:
+                setattr(self, key, value)
+            except AttributeError as err:
+                logger.error(f"Can't set {key} with value {value} for {self}")
+                raise err
+
+    def _set_processor_class(self, processor_class: str):
+        """Sets processor class as an attribute."""
+        self._processor_class = processor_class
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs):
+        r"""
+        Instantiate a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a feature extractor, *e.g.* a
+        derived class of [`SequenceFeatureExtractor`].
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the name of a community-contributed pretrained or built-in pretrained model.
+                - a path to a *directory* containing a feature extractor file saved using the
+                  [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] method, e.g.,
+                  `./my_model_directory/`.
+                - a path or url to a saved feature extractor JSON *file*, e.g.,
+                  `./my_model_directory/preprocessor_config.json`.
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final feature extractor object. If `True`, then this
+                functions returns a `Tuple(feature_extractor, unused_kwargs)` where *unused_kwargs* is a dictionary
+                consisting of the key/value pairs whose keys are not feature extractor attributes: i.e., the part of
+                `kwargs` which has not been used to update `feature_extractor` and is otherwise ignored.
+            kwargs (`Dict[str, Any]`, *optional*):
+                The values in kwargs of any keys which are feature extractor attributes will be used to override the
+                loaded values. Behavior concerning key/value pairs whose keys are *not* feature extractor attributes is
+                controlled by the `return_unused_kwargs` keyword parameter.
+
+        Returns:
+            A feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`].
+
+        Examples:
+
+        ```python
+            # We can't instantiate directly the base class *FeatureExtractionMixin* nor *SequenceFeatureExtractor* so let's show the examples on a
+            # derived class: *CLIPFeatureExtractor*
+            feature_extractor = CLIPFeatureExtractor.from_pretrained(
+                "openai/clip-vit-base-patch32"
+            )  # Download feature_extraction_config from bos and cache.
+            feature_extractor = CLIPFeatureExtractor.from_pretrained(
+                "./test/saved_model/"
+            )  # E.g. feature_extractor (or model) was saved using *save_pretrained('./test/saved_model/')*
+            feature_extractor = CLIPFeatureExtractor.from_pretrained("./test/saved_model/preprocessor_config.json")
+            feature_extractor, unused_kwargs = CLIPFeatureExtractor.from_pretrained(
+                "openai/clip-vit-base-patch32", foo=False, return_unused_kwargs=True
+            )
+            assert unused_kwargs == {"foo": False}
+            ```
+        """
+        feature_extractor_dict, kwargs = cls.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
+
+        return cls.from_dict(feature_extractor_dict, **kwargs)
+
+    def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs):
+        """
+        Save a feature_extractor object to the directory `save_directory`, so that it can be re-loaded using the
+        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] class method.
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the feature extractor JSON file will be saved (will be created if it does not exist).
+            kwargs:
+                Additional key word arguments.
+        """
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_feature_extractor_file = os.path.join(save_directory, FEATURE_EXTRACTOR_NAME)
+
+        self.to_json_file(output_feature_extractor_file)
+        logger.info(f"Feature extractor saved in {output_feature_extractor_file}")
+
+        return [output_feature_extractor_file]
+
+    @classmethod
+    def get_feature_extractor_dict(
+        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
+    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+        """
+        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
+        feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`] using `from_dict`.
+
+        Parameters:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
+
+        Returns:
+            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the feature extractor object.
+        """
+        cache_dir = kwargs.pop("cache_dir", None)
+        from_hf_hub = kwargs.pop("from_hf_hub", False)
+        subfolder = kwargs.pop("subfolder", None)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+        is_local = os.path.isdir(pretrained_model_name_or_path)
+        if os.path.isdir(pretrained_model_name_or_path):
+            if subfolder is None:
+                resolved_feature_extractor_file = os.path.join(pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME)
+            else:
+                resolved_feature_extractor_file = os.path.join(
+                    pretrained_model_name_or_path, subfolder, FEATURE_EXTRACTOR_NAME
+                )
+        elif os.path.isfile(pretrained_model_name_or_path):
+            resolved_feature_extractor_file = pretrained_model_name_or_path
+            is_local = True
+        elif from_hf_hub:
+            feature_extractor_file = FEATURE_EXTRACTOR_NAME
+            resolved_feature_extractor_file = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename=feature_extractor_file,
+                cache_dir=cache_dir,
+                subfolder=subfolder,
+                library_name="PaddleNLP",
+                library_version=__version__,
+            )
+        else:
+            # from pretrained_feature_extractor_file
+            if pretrained_model_name_or_path in cls.pretrained_feature_extractor_file:
+                feature_extractor_file = cls.pretrained_feature_extractor_file[pretrained_model_name_or_path]
+            else:
+                # Assuming from community-contributed pretrained models
+                if subfolder is None:
+                    feature_extractor_file = "/".join(
+                        [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME]
+                    )
+                else:
+                    feature_extractor_file = "/".join(
+                        [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, subfolder, FEATURE_EXTRACTOR_NAME]
+                    )
+                    # update cache_dir
+                    cache_dir = os.path.join(cache_dir, subfolder)
+            try:
+                resolved_feature_extractor_file = get_path_from_url_with_filelock(feature_extractor_file, cache_dir)
+            except EnvironmentError:
+                # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
+                # the original exception.
+                raise
+            except Exception:
+                # For any other exception, we throw a generic error.
+                raise EnvironmentError(
+                    f"Can't load feature extractor for '{pretrained_model_name_or_path}'. If you were trying to load"
+                    " it from 'BOS', make sure you don't have a local directory with the"
+                    f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a"
+                    f" directory containing a {FEATURE_EXTRACTOR_NAME} file"
+                )
+        try:
+            # Load feature_extractor dict
+            with open(resolved_feature_extractor_file, "r", encoding="utf-8") as reader:
+                text = reader.read()
+            feature_extractor_dict = json.loads(text)
+
+        except json.JSONDecodeError:
+            raise EnvironmentError(
+                f"It looks like the config file at '{resolved_feature_extractor_file}' is not a valid JSON file."
+            )
+
+        if is_local:
+            logger.info(f"loading configuration file {resolved_feature_extractor_file}")
+        else:
+            logger.info(f"loading configuration file from cache at {resolved_feature_extractor_file}")
+
+        return feature_extractor_dict, kwargs
+
+    @classmethod
+    def from_dict(cls, feature_extractor_dict: Dict[str, Any], **kwargs):
+        """
+        Instantiates a type of [`~feature_extraction_utils.FeatureExtractionMixin`] from a Python dictionary of
+        parameters.
+
+        Args:
+            feature_extractor_dict (`Dict[str, Any]`):
+                Dictionary that will be used to instantiate the feature extractor object. Such a dictionary can be
+                retrieved from a pretrained checkpoint by leveraging the
+                [`~feature_extraction_utils.FeatureExtractionMixin.to_dict`] method.
+            kwargs (`Dict[str, Any]`):
+                Additional parameters from which to initialize the feature extractor object.
+
+        Returns:
+            [`~feature_extraction_utils.FeatureExtractionMixin`]: The feature extractor object instantiated from those
+            parameters.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+
+        feature_extractor = cls(**feature_extractor_dict)
+
+        # Update feature_extractor with kwargs if needed
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(feature_extractor, key):
+                setattr(feature_extractor, key, value)
+                to_remove.append(key)
+        for key in to_remove:
+            kwargs.pop(key, None)
+
+        if return_unused_kwargs:
+            return feature_extractor, kwargs
+        else:
+            return feature_extractor
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serializes this instance to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this feature extractor instance.
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["feature_extractor_type"] = self.__class__.__name__
+
+        return output
+
+    @classmethod
+    def from_json_file(cls, json_file: Union[str, os.PathLike]):
+        """
+        Instantiates a feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`] from the path to
+        a JSON file of parameters.
+
+        Args:
+            json_file (`str` or `os.PathLike`):
+                Path to the JSON file containing the parameters.
+
+        Returns:
+            A feature extractor of type [`~feature_extraction_utils.FeatureExtractionMixin`]: The feature_extractor
+            object instantiated from that JSON file.
+        """
+        with open(json_file, "r", encoding="utf-8") as reader:
+            text = reader.read()
+        feature_extractor_dict = json.loads(text)
+        return cls(**feature_extractor_dict)
+
+    def to_json_string(self) -> str:
+        """
+        Serializes this instance to a JSON string.
+
+        Returns:
+            `str`: String containing all the attributes that make up this feature_extractor instance in JSON format.
+        """
+        dictionary = self.to_dict()
+
+        for key, value in dictionary.items():
+            if isinstance(value, np.ndarray):
+                dictionary[key] = value.tolist()
+
+        # make sure private name "_processor_class" is correctly
+        # saved as "processor_class"
+        _processor_class = dictionary.pop("_processor_class", None)
+        if _processor_class is not None:
+            dictionary["processor_class"] = _processor_class
+
+        return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"
+
+    def to_json_file(self, json_file_path: Union[str, os.PathLike]):
+        """
+        Save this instance to a JSON file.
+
+        Args:
+            json_file_path (`str` or `os.PathLike`):
+                Path to the JSON file in which this feature_extractor instance's parameters will be saved.
+        """
+        with open(json_file_path, "w", encoding="utf-8") as writer:
+            writer.write(self.to_json_string())
+
+    def __repr__(self):
+        return f"{self.__class__.__name__} {self.to_json_string()}"
diff --git a/paddlenlp/transformers/fnet/__init__.py b/paddlenlp/transformers/fnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/fnet/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/fnet/configuration.py b/paddlenlp/transformers/fnet/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..32a159367db1f258d0307ced37e7bf63448e8689
--- /dev/null
+++ b/paddlenlp/transformers/fnet/configuration.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" fnet model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "FNET_PRETRAINED_INIT_CONFIGURATION",
+    "FNET_PRETRAINED_RESOURCE_FILES_MAP",
+    "FNetConfig",
+]
+
+FNET_PRETRAINED_INIT_CONFIGURATION = {
+    "fnet-base": {
+        "vocab_size": 32000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "pad_token_id": 3,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+    },
+    "fnet-large": {
+        "vocab_size": 32000,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "pad_token_id": 3,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+    },
+}
+FNET_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "fnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/fnet/fnet-base/model_state.pdparams",
+        "fnet-large": "https://bj.bcebos.com/paddlenlp/models/transformers/fnet/fnet-large/model_state.pdparams",
+    }
+}
+
+
+class FNetConfig(PretrainedConfig):
+    r"""
+    Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `FNetModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `FNetModel`.
+            Defaults to `32000`.
+        hidden_size (int, optional):
+            Dimensionality of the encoder layer and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `glue_new`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`. Defaults to `4`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to `0.02`.
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`BertPretrainedModel.init_weights()` for how weights are initialized in `ElectraModel`.
+        layer_norm_eps(float, optional):
+            The `epsilon` parameter used in :class:`paddle.nn.LayerNorm` for initializing layer normalization layers.
+            A small value to the variance added to the normalization layer to prevent division by zero.
+            Defaults to `1e-12`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary. Defaults to `3`.
+        add_pooling_layer(bool, optional):
+            Whether or not to add the pooling layer. Defaults to `True`.
+    """
+
+    model_type = "fnet"
+
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=768,
+        num_hidden_layers=12,
+        intermediate_size=3072,
+        hidden_act="gelu_new",
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=4,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=3,
+        bos_token_id=1,
+        eos_token_id=2,
+        add_pooling_layer=True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.add_pooling_layer = add_pooling_layer
diff --git a/paddlenlp/transformers/fnet/modeling.py b/paddlenlp/transformers/fnet/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a159412c84955ceddcfff2ebe111d20c6c33d45
--- /dev/null
+++ b/paddlenlp/transformers/fnet/modeling.py
@@ -0,0 +1,936 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Modeling classes for FNet model."""
+
+import paddle
+import paddle.nn as nn
+from paddle.nn import Layer
+
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from .configuration import (
+    FNET_PRETRAINED_INIT_CONFIGURATION,
+    FNET_PRETRAINED_RESOURCE_FILES_MAP,
+    FNetConfig,
+)
+
+__all__ = [
+    "FNetPretrainedModel",
+    "FNetModel",
+    "FNetForSequenceClassification",
+    "FNetForPreTraining",
+    "FNetForMaskedLM",
+    "FNetForNextSentencePrediction",
+    "FNetForMultipleChoice",
+    "FNetForTokenClassification",
+    "FNetForQuestionAnswering",
+]
+
+
+class FNetBasicOutput(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.layer_norm(input_tensor + hidden_states)
+        return hidden_states
+
+
+class FNetOutput(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layer_norm(input_tensor + hidden_states)
+        return hidden_states
+
+
+class FNetIntermediate(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class FNetLayer(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.fourier = FNetFourierTransform(config)
+        self.intermediate = FNetIntermediate(config)
+        self.output = FNetOutput(config)
+
+    def forward(self, hidden_states):
+        self_fourier_outputs = self.fourier(hidden_states)
+        fourier_output = self_fourier_outputs[0]
+        intermediate_output = self.intermediate(fourier_output)
+        layer_output = self.output(intermediate_output, fourier_output)
+
+        return (layer_output,)
+
+
+class FNetEncoder(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.layers = nn.LayerList([FNetLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(self, hidden_states, output_hidden_states=False, return_dict=True):
+        all_hidden_states = () if output_hidden_states else None
+        for i, layer_module in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer_outputs = layer_module(hidden_states)
+            hidden_states = layer_outputs[0]
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if return_dict:
+            return {"last_hidden_state": hidden_states, "all_hidden_states": all_hidden_states}
+        return (hidden_states,)
+
+
+class FNetPooler(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class FNetEmbeddings(Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config: FNetConfig):
+        super(FNetEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        # NOTE: This is the project layer and will be needed. The original code allows for different embedding and different model dimensions.
+        self.projection = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        embeddings = inputs_embeds + token_type_embeddings
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings += position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.projection(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class FNetBasicFourierTransform(Layer):
+    def __init__(self):
+        super().__init__()
+        self.fourier_transform = paddle.fft.fftn
+
+    def forward(self, hidden_states):
+        outputs = self.fourier_transform(hidden_states).real()
+        return (outputs,)
+
+
+class FNetFourierTransform(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.fourier_transform = FNetBasicFourierTransform()
+        self.output = FNetBasicOutput(config)
+
+    def forward(self, hidden_states):
+        self_outputs = self.fourier_transform(hidden_states)
+        fourier_output = self.output(self_outputs[0], hidden_states)
+        return (fourier_output,)
+
+
+class FNetPredictionHeadTransform(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        return hidden_states
+
+
+class FNetLMPredictionHead(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.transform = FNetPredictionHeadTransform(config)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.vocab_size, config.hidden_size)
+
+        self.bias = self.create_parameter(
+            [config.vocab_size], is_bias=True, default_initializer=nn.initializer.Constant(value=0)
+        )
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = paddle.matmul(hidden_states, self.decoder.weight, transpose_y=True) + self.bias
+        return hidden_states
+
+
+class FNetOnlyMLMHead(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.predictions = FNetLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class FNetOnlyNSPHead(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class FNetPreTrainingHeads(Layer):
+    def __init__(self, config: FNetConfig):
+        super().__init__()
+        self.predictions = FNetLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class FNetPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained FNet models. It provides FNet related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = FNET_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = FNET_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "fnet"
+    config_class = FNetConfig
+
+    def _init_weights(self, layer):
+        # Initialize the weights.
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer._padding_idx is not None:
+                layer.weight[layer._padding_idx].set_value(paddle.zeros_like(layer.weight[layer._padding_idx]))
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.ones_like(layer.weight))
+
+
+@register_base_model
+class FNetModel(FNetPretrainedModel):
+    """
+    The model can behave as an encoder, following the architecture described in `FNet: Mixing Tokens with Fourier
+    Transforms <https://arxiv.org/abs/2105.03824>`__ by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago
+    Ontanon.
+    """
+
+    def __init__(self, config: FNetConfig):
+        super(FNetModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+        self.num_hidden_layers = config.num_hidden_layers
+        self.embeddings = FNetEmbeddings(config)
+        self.encoder = FNetEncoder(config)
+        self.pooler = FNetPooler(config) if config.add_pooling_layer else None
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The FNetModel forward method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+               If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+               pass an embedded representation directly instead of passing `inputs_ids`.
+           output_hidden_states (bool, optional):
+               Whether or not to return all hidden states. Default to `None`.
+            return_dict (bool, optional):
+                Whether or not to return a dict instead of a plain tuple. Default to `None`.
+
+
+        Returns:
+            tuple or Dict: Returns tuple (`sequence_output`, `pooled_output`, `encoder_outputs[1:]`)
+            or a dict with last_hidden_state`, `pooled_output`, `all_hidden_states`, fields.
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+               Sequence of hidden-states at the last layer of the model.
+               It's data type should be float32 and has a shape of [`batch_size, sequence_length, hidden_size`].
+
+            - `pooled_output` (Tensor):
+               The output of first token (`[CLS]`) in sequence.
+               We "pool" the model by simply taking the hidden state corresponding to the first token.
+               Its data type should be float32 and
+               has a shape of [batch_size, hidden_size].
+
+            - `last_hidden_state` (Tensor):
+               The output of the last encoder layer, it is also the `sequence_output`.
+               It's data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+            - `all_hidden_states` (Tensor):
+               Hidden_states of all layers in the Transformer encoder. The length of `all_hidden_states` is `num_hidden_layers + 1`.
+               For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.fnet.modeling import FNetModel
+                from paddlenlp.transformers.fnet.tokenizer import FNetTokenizer
+
+                tokenizer = FNetTokenizer.from_pretrained('fnet-base')
+                model = FNetModel.from_pretrained('fnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(shape=input_shape, dtype="int64")
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+        )
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs["last_hidden_state"] if return_dict else encoder_outputs[0]
+
+        pooler_output = self.pooler(sequence_output) if self.pooler is not None else None
+
+        if return_dict:
+            return {
+                "last_hidden_state": sequence_output,
+                "pooler_output": pooler_output,
+                "all_hidden_states": encoder_outputs["all_hidden_states"],
+            }
+        return (sequence_output, pooler_output) + encoder_outputs[1:]
+
+
+class FNetForSequenceClassification(FNetPretrainedModel):
+    """
+    FNet Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of FNetModel.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+
+    """
+
+    def __init__(self, config: FNetConfig, num_classes=2):
+        super(FNetForSequenceClassification, self).__init__(config)
+        self.num_classes = num_classes
+        self.fnet = FNetModel(config)
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, num_classes)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The FNetForSequenceClassification forward method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+               If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+               pass an embedded representation directly instead of passing `inputs_ids`.
+           output_hidden_states (bool, optional):
+               Whether or not to return all hidden states. Default to `None`.
+            return_dict (bool, optional):
+                Whether or not to return a dict instead of a plain tuple. Default to `None`.
+
+
+        Returns:
+           Tensor or Dict: Returns tensor `logits`, or a dict with `logits`, `hidden_states`, `attentions` fields.
+
+           With the fields:
+
+           - `logits` (Tensor):
+               A tensor of the input text classification logits.
+               Shape as `[batch_size, num_classes]` and dtype as float32.
+
+           - `hidden_states` (Tensor):
+               Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+               For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.fnet.modeling import FNetForSequenceClassification
+                from paddlenlp.transformers.fnet.tokenizer import FNetTokenizer
+
+                tokenizer = FNetTokenizer.from_pretrained('fnet-base')
+                model = FNetModel.from_pretrained('fnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs["pooler_output"] if return_dict else outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        if return_dict:
+            return {
+                "logits": logits,
+                "hidden_states": outputs["all_hidden_states"],
+            }
+        return logits
+
+
+class FNetForPreTraining(FNetPretrainedModel):
+    """
+    FNet Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
+    sentence prediction (classification)` head.
+    """
+
+    def __init__(self, config: FNetConfig):
+        super().__init__(config)
+
+        self.fnet = FNetModel(config)
+        self.cls = FNetPreTrainingHeads(config)
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def get_input_embeddings(self):
+        return self.fnet.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        next_sentence_label=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The FNetForPretraining forward method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`FNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`FNetModel`.
+            position_ids(Tensor, optional):
+                See :class:`FNetModel`.
+            labels (LongTensor of shape (batch_size, sequence_length), optional):
+                Labels for computing the masked language modeling loss.
+            inputs_embeds(Tensor, optional):
+                See :class:`FNetModel`.
+            next_sentence_labels(Tensor):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and
+                its shape is [batch_size, 1]
+            output_hidden_states (bool, optional):
+                See :class:`FNetModel`.
+            return_dict (bool, optional):
+                See :class:`FNetModel`.
+
+        Returns:
+            tuple or Dict: Returns tuple (`prediction_scores`, `seq_relationship_score`) or a dict with
+            `prediction_logits`, `seq_relationship_logits`,  `hidden_states` fields.
+        """
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0] if not return_dict else outputs["last_hidden_state"]
+        pooled_output = outputs[1] if not return_dict else outputs["pooler_output"]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        if return_dict:
+            return {
+                "prediction_logits": prediction_scores,
+                "seq_relationship_logits": seq_relationship_score,
+                "hidden_states": outputs["all_hidden_states"],
+            }
+        return prediction_scores, seq_relationship_score, outputs["all_hidden_states"]
+
+
+class FNetForMaskedLM(FNetPretrainedModel):
+    """
+    FNet Model with a `masked language modeling` head on top.
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of :class:`FNetModel`.
+
+    """
+
+    def __init__(self, config: FNetConfig):
+        super().__init__(config)
+
+        self.fnet = FNetModel(config)
+        self.cls = FNetOnlyMLMHead(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def get_input_embeddings(self):
+        return self.fnet.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        next_sentence_label=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The FNetForMaskedLM forward method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`FNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`FNetModel`.
+            position_ids(Tensor, optional):
+                See :class:`FNetModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`FNetModel`.
+            labels(Tensor, optional):
+                See :class:`FNetForPreTraining`.
+            next_sentence_label(Tensor, optional):
+                See :class:`FNetForPreTraining`.
+            output_hidden_states(Tensor, optional):
+                See :class:`FNetModel`.
+            return_dict(bool, optional):
+                See :class:`FNetModel`.
+
+        Returns:
+            Tensor or Dict: Returns tensor `prediction_scores` or a dict with `prediction_logits`, `hidden_states` fields.
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                and its shape is [batch_size, sequence_length, vocab_size].
+
+            - `hidden_states` (Tensor):
+                Hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+        """
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0] if not return_dict else outputs["last_hidden_state"]
+        prediction_scores = self.cls(sequence_output)
+
+        if return_dict:
+            return {"prediction_logits": prediction_scores, "hidden_states": outputs["all_hidden_states"]}
+        return prediction_scores, outputs["all_hidden_states"]
+
+
+class FNetForNextSentencePrediction(FNetPretrainedModel):
+    """
+    FNet Model with a `next sentence prediction` head on top.
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of :class:`FNetModel`.
+
+    """
+
+    def __init__(self, config: FNetConfig):
+        super().__init__(config)
+
+        self.fnet = FNetModel(config)
+        self.cls = FNetOnlyNSPHead(config)
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def get_input_embeddings(self):
+        return self.fnet.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        next_sentence_label=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1] if not return_dict else outputs["pooler_output"]
+        seq_relationship_score = self.cls(pooled_output)
+
+        if return_dict:
+            return {"seq_relationship_logits": seq_relationship_score, "hidden_states": outputs["all_hidden_states"]}
+        return seq_relationship_score, outputs["all_hidden_states"]
+
+
+class FNetForMultipleChoice(FNetPretrainedModel):
+    """
+    FNet Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like SWAG tasks .
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of FNetModel.
+
+    """
+
+    def __init__(self, config: FNetConfig):
+        super(FNetForMultipleChoice, self).__init__(config)
+        self.fnet = FNetModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        input_ids = input_ids.reshape([-1, input_ids.shape[-1]]) if input_ids is not None else None
+        token_type_ids = token_type_ids.reshape([-1, token_type_ids.shape[-1]]) if token_type_ids is not None else None
+        position_ids = position_ids.reshape([-1, position_ids.shape[-1]]) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.reshape([-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]])
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs["pooler_output"] if return_dict else outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape([-1, num_choices])
+
+        if return_dict:
+            return {
+                "logits": reshaped_logits,
+                "hidden_states": outputs["all_hidden_states"],
+            }
+        return reshaped_logits
+
+
+class FNetForTokenClassification(FNetPretrainedModel):
+    """
+    FNet Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of FNetModel.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+    """
+
+    def __init__(self, config: FNetConfig, num_classes=2):
+        super(FNetForTokenClassification, self).__init__(config)
+        self.fnet = FNetModel(config)
+        self.num_classes = num_classes
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0] if not return_dict else outputs["last_hidden_state"]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        if return_dict:
+            return {
+                "logits": logits,
+                "hidden_states": outputs["all_hidden_states"],
+            }
+        return logits
+
+
+class FNetForQuestionAnswering(FNetPretrainedModel):
+    """
+    FNet Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        fnet (:class:`FNetModel`):
+            An instance of FNetModel.
+        num_labels (int):
+            The number of labels.
+
+    """
+
+    def __init__(self, config: FNetConfig, num_labels):
+        super(FNetForQuestionAnswering, self).__init__(config)
+        self.num_labels = num_labels
+        self.fnet = FNetModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        outputs = self.fnet(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0] if not return_dict else outputs["last_hidden_state"]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+        start_logits = start_logits.squeeze(axis=-1)
+        end_logits = start_logits.squeeze(axis=-1)
+        if return_dict:
+            return {
+                "start_logits": start_logits,
+                "end_logits": end_logits,
+                "hidden_states": outputs["all_hidden_states"],
+            }
+        return start_logits, end_logits
diff --git a/paddlenlp/transformers/fnet/tokenizer.py b/paddlenlp/transformers/fnet/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..36456a4aee4b12553900641c1c5d361daf2cde9f
--- /dev/null
+++ b/paddlenlp/transformers/fnet/tokenizer.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for FNet model."""
+
+from typing import Any, Dict, List, Optional
+
+import sentencepiece as spm
+
+from ..albert.tokenizer import AddedToken, AlbertEnglishTokenizer
+
+__all__ = ["FNetTokenizer"]
+
+SPIECE_UNDERLINE = "▁"
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"fnet-base": 512, "fnet-large": 512}
+
+
+class FNetTokenizer(AlbertEnglishTokenizer):
+    """
+    Construct a FNet tokenizer. Inherit from :class:`AlbertEnglishTokenizer`. Based on `SentencePiece
+    <https://github.com/google/sentencepiece>`__.
+
+    Args:
+        sentencepiece_model_file (:obj:`str`):
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to lowercase the input when tokenizing.
+        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
+        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to keep accents when tokenizing.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        sp_model_kwargs (:obj:`dict`, `optional`):
+            Will be passed to the ``SentencePieceProcessor.__init__()`` method. The `Python wrapper for SentencePiece
+            <https://github.com/google/sentencepiece/tree/master/python>`__ can be used, among other things, to set:
+
+            - ``enable_sampling``: Enable subword regularization.
+            - ``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout.
+
+              - ``nbest_size = {0,1}``: No sampling is performed.
+              - ``nbest_size > 1``: samples from the nbest_size results.
+              - ``nbest_size < 0``: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+                using forward-filtering-and-backward-sampling algorithm.
+            - ``alpha``: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+              BPE-dropout.
+
+    Attributes:
+        sp_model (:obj:`SentencePieceProcessor`):
+            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
+    """
+
+    resource_files_names = {
+        "sentencepiece_model_file": "spiece.model",
+    }
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "fnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/fnet/fnet-base/spiece.model",
+            "fnet-large": "https://bj.bcebos.com/paddlenlp/models/transformers/fnet/fnet-large/spiece.model",
+        }
+    }
+    pretrained_init_configuration = {
+        "fnet-base": {
+            "do_lower_case": False,
+        },
+        "fnet-large": {
+            "do_lower_case": False,
+        },
+    }
+    model_input_names = ["input_ids", "token_type_ids"]
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        unk_token="<unk>",
+        sep_token="[SEP]",
+        pad_token="<pad>",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        **kwargs
+    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
+        super().__init__(
+            sentencepiece_model_file=sentencepiece_model_file,
+            do_lower_case=do_lower_case,
+            remove_space=remove_space,
+            keep_accents=keep_accents,
+            bos_token=cls_token,
+            eos_token=sep_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            sp_model_kwargs=sp_model_kwargs,
+            **kwargs,
+        )
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_file)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. An FNet sequence has the following format:
+
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence
+        pair mask has the following format: ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
diff --git a/paddlenlp/transformers/funnel/__init__.py b/paddlenlp/transformers/funnel/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/funnel/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/funnel/configuration.py b/paddlenlp/transformers/funnel/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..58338ca28727e61e7f8bf09192f435a1ca183eb2
--- /dev/null
+++ b/paddlenlp/transformers/funnel/configuration.py
@@ -0,0 +1,206 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" funnel model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "FUNNEL_PRETRAINED_INIT_CONFIGURATION",
+    "FUNNEL_PRETRAINED_RESOURCE_FILES_MAP",
+    "FunnelConfig",
+]
+
+FUNNEL_PRETRAINED_INIT_CONFIGURATION = {
+    "funnel-transformer/small": {},  # B4-4-4H768
+    "funnel-transformer/small-base": {},  # B4-4-4H768, no decoder
+    "funnel-transformer/medium": {},  # B6-3x2-3x2H768
+    "funnel-transformer/medium-base": {},  # B6-3x2-3x2H768, no decoder
+    "funnel-transformer/intermediate": {},  # B6-6-6H768
+    "funnel-transformer/intermediate-base": {},  # B6-6-6H768, no decoder
+    "funnel-transformer/large": {},  # B8-8-8H1024
+    "funnel-transformer/large-base": {},  # B8-8-8H1024, no decoder
+    "funnel-transformer/xlarge-base": {},  # B10-10-10H1024
+    "funnel-transformer/xlarge": {},  # B10-10-10H1024, no decoder
+}
+
+FUNNEL_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "funnel-transformer/small": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small/model_state.pdparams",
+        "funnel-transformer/small-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small-base/model_state.pdparams",
+        "funnel-transformer/medium": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium/model_state.pdparams",
+        "funnel-transformer/medium-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium-base/model_state.pdparams",
+        "funnel-transformer/intermediate": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate/model_state.pdparams",
+        "funnel-transformer/intermediate-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate-base/model_state.pdparams",
+        "funnel-transformer/large": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large/model_state.pdparams",
+        "funnel-transformer/large-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large-base/model_state.pdparams",
+        "funnel-transformer/xlarge-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge-base/model_state.pdparams",
+        "funnel-transformer/xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge/model_state.pdparams",
+    },
+    "model_config": {
+        "funnel-transformer/small": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small/model_config.json",
+        "funnel-transformer/small-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small-base/model_config.json",
+        "funnel-transformer/medium": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium/model_config.json",
+        "funnel-transformer/medium-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium-base/model_config.json",
+        "funnel-transformer/intermediate": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate/model_config.json",
+        "funnel-transformer/intermediate-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate-base/model_config.json",
+        "funnel-transformer/large": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large/model_config.json",
+        "funnel-transformer/large-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large-base/model_config.json",
+        "funnel-transformer/xlarge-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge-base/model_config.json",
+        "funnel-transformer/xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge/model_config.json",
+    },
+}
+
+FUNNEL_RESOURCE_FILES_NAMES = {"model_state": "model_state.pdparams", "model_config": "model_config.json"}
+
+
+class FunnelConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~hf_paddle.FunnelModel` or a
+    :class:`~hf_paddle.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the Funnel Transformer `funnel-transformer/small
+    <https://huggingface.co/funnel-transformer/small>`__ architecture.
+
+    Configuration objects inherit from :class:`~hf_paddle.PretrainedConfig` and can be used to control the model
+    outputs. Read the documentation from :class:`~hf_paddle.PretrainedConfig` for more information.
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
+            by the :obj:`inputs_ids` passed when calling :class:`~hf_paddle.FunnelModel` or
+            :class:`~hf_paddle.TFFunnelModel`.
+        block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
+            The sizes of the blocks used in the model.
+        block_repeats (:obj:`List[int]`, `optional`):
+            If passed along, each layer of each block is repeated the number of times indicated.
+        num_decoder_layers (:obj:`int`, `optional`, defaults to 2):
+            The number of layers in the decoder (when not using the base model).
+        d_model (:obj:`int`, `optional`, defaults to 768):
+            Dimensionality of the model's hidden states.
+        n_head (:obj:`int`, `optional`, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        d_head (:obj:`int`, `optional`, defaults to 64):
+            Dimensionality of the model's heads.
+        d_inner (:obj:`int`, `optional`, defaults to 3072):
+            Inner dimension in the feed-forward blocks.
+        hidden_act (:obj:`str` or :obj:`callable`, `optional`, defaults to :obj:`"gelu_new"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probability for the attention probabilities.
+        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            The dropout probability used between the two layers of the feed-forward blocks.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (:obj:`int`, `optional`, defaults to 3):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~hf_paddle.FunnelModel` or
+            :class:`~hf_paddle.TFFunnelModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.1):
+            The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
+            layers.
+        initializer_std (:obj:`float`, `optional`):
+            The standard deviation of the `normal initializer` for initializing the embedding matrix and the weight of
+            linear layers. Will default to 1 for the embedding matrix and the value given by Xavier initialization for
+            linear layers.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-9):
+            The epsilon used by the layer normalization layers.
+        pooling_type (:obj:`str`, `optional`, defaults to :obj:`"mean"`):
+            Possible values are ``"mean"`` or ``"max"``. The way pooling is performed at the beginning of each block.
+        attention_type (:obj:`str`, `optional`, defaults to :obj:`"relative_shift"`):
+            Possible values are ``"relative_shift"`` or ``"factorized"``. The former is faster on CPU/GPU while the
+            latter is faster on TPU.
+        separate_cls (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to separate the cls token when applying pooling.
+        truncate_seq (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            When using ``separate_cls``, whether or not to truncate the last token when pooling, to avoid getting a
+            sequence length that is not a multiple of 2.
+        pool_q_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to apply the pooling only to the query or to query, key and values for the attention layers.
+    """
+    model_type = "funnel"
+    attribute_map = {"hidden_size": "d_model", "num_attention_heads": "n_head"}
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        block_sizes=[4, 4, 4],
+        block_repeats=None,
+        num_decoder_layers=2,
+        d_model=768,
+        n_head=12,
+        d_head=64,
+        d_inner=3072,
+        hidden_act="gelu_new",
+        hidden_dropout=0.1,
+        attention_dropout=0.1,
+        activation_dropout=0.0,
+        max_position_embeddings=512,
+        type_vocab_size=3,
+        initializer_range=0.1,
+        initializer_std=None,
+        layer_norm_eps=1e-9,
+        pooling_type="mean",
+        attention_type="relative_shift",
+        separate_cls=True,
+        truncate_seq=True,
+        pool_q_only=True,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.block_sizes = block_sizes
+        self.block_repeats = [1] * len(block_sizes) if block_repeats is None else block_repeats
+        assert len(block_sizes) == len(
+            self.block_repeats
+        ), "`block_sizes` and `block_repeats` should have the same length."
+        self.num_decoder_layers = num_decoder_layers
+        self.d_model = d_model
+        self.n_head = n_head
+        self.d_head = d_head
+        self.d_inner = d_inner
+        self.hidden_act = hidden_act
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.initializer_std = initializer_std
+        self.layer_norm_eps = layer_norm_eps
+        assert pooling_type in [
+            "mean",
+            "max",
+        ], f"Got {pooling_type} for `pooling_type` but only 'mean' and 'max' are supported."
+        self.pooling_type = pooling_type
+        assert attention_type in [
+            "relative_shift",
+            "factorized",
+        ], f"Got {attention_type} for `attention_type` but only 'relative_shift' and 'factorized' are supported."
+        self.attention_type = attention_type
+        self.separate_cls = separate_cls
+        self.truncate_seq = truncate_seq
+        self.pool_q_only = pool_q_only
+
+    @property
+    def num_hidden_layers(self):
+        return sum(self.block_sizes)
+
+    @property
+    def num_blocks(self):
+        return len(self.block_sizes)
diff --git a/paddlenlp/transformers/funnel/modeling.py b/paddlenlp/transformers/funnel/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..5952363a44b112e1dc4b3d20ed78a943a0970948
--- /dev/null
+++ b/paddlenlp/transformers/funnel/modeling.py
@@ -0,0 +1,1581 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from collections import OrderedDict
+from collections.abc import Iterable
+from dataclasses import dataclass, fields
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss
+
+from .. import PretrainedModel as PreTrainedModel
+from .. import register_base_model
+from ..activations import ACT2FN
+from .configuration import (
+    FUNNEL_PRETRAINED_INIT_CONFIGURATION,
+    FUNNEL_PRETRAINED_RESOURCE_FILES_MAP,
+    FUNNEL_RESOURCE_FILES_NAMES,
+    FunnelConfig,
+)
+
+FUNNEL_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "funnel-transformer/small",  # B4-4-4H768
+    "funnel-transformer/small-base",  # B4-4-4H768, no decoder
+    "funnel-transformer/medium",  # B6-3x2-3x2H768
+    "funnel-transformer/medium-base",  # B6-3x2-3x2H768, no decoder
+    "funnel-transformer/intermediate",  # B6-6-6H768
+    "funnel-transformer/intermediate-base",  # B6-6-6H768, no decoder
+    "funnel-transformer/large",  # B8-8-8H1024
+    "funnel-transformer/large-base",  # B8-8-8H1024, no decoder
+    "funnel-transformer/xlarge-base",  # B10-10-10H1024
+    "funnel-transformer/xlarge",  # B10-10-10H1024, no decoder
+]
+
+__all__ = [
+    "FunnelModel",
+    "FunnelForSequenceClassification",
+    "FunnelForTokenClassification",
+    "FunnelForQuestionAnswering",
+]
+
+INF = 1e6
+
+
+def expand(self, *sizes):
+    if isinstance(sizes[0], Iterable):
+        sizes = sizes[0]
+    # handle -1 case
+    if len(sizes) > len(self.shape):
+        for _ in range(len(sizes) - len(self.shape)):
+            self = self.unsqueeze(axis=0)
+    x = paddle.expand(self, sizes, name=None)
+    return x
+
+
+def repeat_interleave(x, repeats, dim=None):
+    orig_shape = list(x.shape)
+    if dim is None:
+        dim = 1
+        x = paddle.reshape(x, (-1, 1))  # x.reshape(-1,1)
+        size = [1] * len(x.shape)
+        size[dim] = repeats
+        x = paddle.tile(x, size)
+        return paddle.reshape(x, (-1))
+    else:
+        if len(orig_shape) == dim + 1:
+            x = x.unsqueeze(-1)
+        # x=x.reshape(-1,1)
+        size = [1] * len(orig_shape)
+        size[-1] = repeats
+        x = paddle.tile(x, size)
+        orig_shape[dim] = -1
+        return paddle.reshape(x, orig_shape)
+
+
+def gather(x, dim, index):
+    index_shape = index.shape
+    index_flatten = index.flatten()
+    if dim < 0:
+        dim = len(x.shape) + dim
+    nd_index = []
+    for k in range(len(x.shape)):
+        if k == dim:
+            nd_index.append(index_flatten)
+        else:
+            reshape_shape = [1] * len(x.shape)
+            reshape_shape[k] = x.shape[k]
+            dim_index = paddle.expand(
+                paddle.reshape(paddle.arange(x.shape[k], dtype=index.dtype), reshape_shape), index_shape
+            ).flatten()
+            nd_index.append(dim_index)
+
+    ind2 = paddle.transpose(paddle.stack(nd_index), [1, 0])
+    paddle_out = paddle.gather_nd(x, ind2).reshape(index_shape)
+    return paddle_out
+
+
+def split(x, batch_size, dim=0):
+    if isinstance(batch_size, int):
+        if batch_size > x.shape[dim]:
+            return [x]  # do nothing
+        return [y for y in paddle.split(x, x.shape[dim] // batch_size, dim)]
+    else:
+        return [y for y in paddle.split(x, batch_size, dim)]
+
+
+def normal_(x, m=0, std=1):
+    y = paddle.randn(x.shape) * std + m
+    paddle.assign(y, x)
+    return x
+
+
+def uniform_(x, a=0, b=1.0):
+    temp_value = paddle.uniform(min=a, max=b, shape=x.shape)
+    x.set_value(temp_value)
+    return x
+
+
+def constant_(x, val):
+    temp_value = paddle.full_like(x, fill_value=val)
+    x.set_value(temp_value)
+    return x
+
+
+class FunnelEmbeddings(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.layer_norm = LayerNorm(config.d_model, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+
+    def forward(self, input_ids=None, inputs_embeds=None):
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        embeddings = self.layer_norm(inputs_embeds)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+def pad(input, pad, mode="constant", value=0):
+    pad2 = []
+    for _ in range(len(input.shape) * 2 - len(pad)):
+        pad2.append(0)
+    if isinstance(pad, tuple):
+        pad = list(pad)
+    pad2 = pad2 + pad
+    return paddle.nn.functional.pad(input, pad2, mode=mode, value=value)
+
+
+class FunnelAttentionStructure(nn.Layer):
+    """
+    Contains helpers for `FunnelRelMultiheadAttention `.
+    """
+
+    cls_token_type_id: int = 2
+
+    def __init__(self, config):
+        super().__init__()
+        self.config2 = config
+        self.sin_dropout = nn.Dropout(config.hidden_dropout)
+        self.cos_dropout = nn.Dropout(config.hidden_dropout)
+        # Track where we are at in terms of pooling from the original input, e.g., by how much the sequence length was
+        # divided.
+        self.pooling_mult = None
+
+    def init_attention_inputs(self, inputs_embeds, attention_mask=None, token_type_ids=None):
+        """Returns the attention inputs associated to the inputs of the model."""
+        # inputs_embeds has shape batch_size x seq_len x d_model
+        # attention_mask and token_type_ids have shape batch_size x seq_len
+        self.pooling_mult = 1
+        self.seq_len = seq_len = inputs_embeds.shape[1]
+        position_embeds = self.get_position_embeds(seq_len, inputs_embeds.dtype)
+        token_type_mat = self.token_type_ids_to_mat(token_type_ids) if token_type_ids is not None else None
+        cls_mask = (
+            pad(
+                paddle.ones([seq_len - 1, seq_len - 1], dtype=inputs_embeds.dtype), (1, 0, 1, 0)
+            )  # nn.functional.pad(inputs_embeds.new_ones([seq_len - 1, seq_len - 1]), (1, 0, 1, 0))
+            if self.config2.separate_cls
+            else None
+        )
+        return (position_embeds, token_type_mat, attention_mask, cls_mask)
+
+    def token_type_ids_to_mat(self, token_type_ids):
+        """Convert `token_type_ids` to `token_type_mat`."""
+        # token_type_mat = token_type_ids[:, :, None] == token_type_ids[:, None]
+        token_type_mat = token_type_ids.unsqueeze(2) == token_type_ids.unsqueeze(1)
+        # Treat <cls> as in the same segment as both A & B
+        cls_ids = token_type_ids == self.cls_token_type_id
+        # cls_mat = cls_ids[:, :, None] | cls_ids[:, None]
+        cls_mat = paddle.logical_or(cls_ids.unsqueeze(2), cls_ids.unsqueeze(1))
+        return paddle.logical_or(cls_mat, token_type_mat)
+
+    def get_position_embeds(self, seq_len, dtype):
+        """
+        Create and cache inputs related to relative position encoding. Those are very different depending on whether we
+        are using the factorized or the relative shift attention:
+
+        For the factorized attention, it returns the matrices (phi, pi, psi, omega) used in the paper, appendix A.2.2,
+        final formula.
+
+        For the relative shift attention, it returns all possible vectors R used in the paper, appendix A.2.1, final
+        formula.
+
+        Paper link: https://arxiv.org/abs/2006.03236
+        """
+        d_model = self.config2.d_model
+        if self.config2.attention_type == "factorized":
+            # Notations from the paper, appending A.2.2, final formula.
+            # We need to create and return the matrices phi, psi, pi and omega.
+            pos_seq = paddle.arange(0, seq_len, 1.0, dtype=dtype)
+            freq_seq = paddle.arange(0, d_model // 2, 1.0, dtype=dtype)
+            inv_freq = 1 / (10000 ** (freq_seq / (d_model // 2)))
+            sinusoid = pos_seq.unsqueeze(1) * inv_freq.unsqueeze(0)
+            sin_embed = paddle.sin(sinusoid)
+            sin_embed_d = self.sin_dropout(sin_embed)
+            cos_embed = paddle.cos(sinusoid)
+            cos_embed_d = self.cos_dropout(cos_embed)
+            # This is different from the formula on the paper...
+            phi = paddle.concat([sin_embed_d, sin_embed_d], axis=-1)
+            psi = paddle.concat([cos_embed, sin_embed], axis=-1)
+            pi = paddle.concat([cos_embed_d, cos_embed_d], axis=-1)
+            omega = paddle.concat([-sin_embed, cos_embed], axis=-1)
+            return (phi, pi, psi, omega)
+        else:
+            # Notations from the paper, appending A.2.1, final formula.
+            # We need to create and return all the possible vectors R for all blocks and shifts.
+            freq_seq = paddle.arange(0, d_model // 2, 1, dtype=dtype)
+            inv_freq = 1 / (10000 ** (freq_seq / (d_model // 2)))
+            # Maximum relative positions for the first input
+            rel_pos_id = paddle.arange(-seq_len * 2, seq_len * 2, 1, dtype=dtype)
+            zero_offset = seq_len * 2
+            sinusoid = rel_pos_id.unsqueeze(1) * inv_freq.unsqueeze(0)
+            sin_embed = self.sin_dropout(paddle.sin(sinusoid))
+            cos_embed = self.cos_dropout(paddle.cos(sinusoid))
+            pos_embed = paddle.concat([sin_embed, cos_embed], axis=-1)
+
+            pos = paddle.arange(0, seq_len, dtype=dtype)
+            pooled_pos = pos
+            position_embeds_list = []
+            for block_index in range(0, self.config2.num_blocks):
+                # For each block with block_index > 0, we need two types position embeddings:
+                #   - Attention(pooled-q, unpooled-kv)
+                #   - Attention(pooled-q, pooled-kv)
+                # For block_index = 0 we only need the second one and leave the first one as None.
+
+                # First type
+                if block_index == 0:
+                    position_embeds_pooling = None
+                else:
+                    pooled_pos = self.stride_pool_pos(pos, block_index)
+
+                    # construct rel_pos_id
+                    stride = 2 ** (block_index - 1)
+                    rel_pos = self.relative_pos(pos, stride, pooled_pos, shift=2)
+                    rel_pos = rel_pos.unsqueeze(1) + zero_offset
+                    rel_pos = expand(rel_pos, (rel_pos.shape[0], d_model))
+                    position_embeds_pooling = gather(pos_embed, 0, rel_pos)
+
+                # Second type
+                pos = pooled_pos
+                stride = 2**block_index
+                rel_pos = self.relative_pos(pos, stride)
+
+                rel_pos = rel_pos.unsqueeze(1) + zero_offset
+                rel_pos = expand(rel_pos, (rel_pos.shape[0], d_model))
+                position_embeds_no_pooling = gather(pos_embed, 0, rel_pos)
+
+                position_embeds_list.append([position_embeds_no_pooling, position_embeds_pooling])
+            return position_embeds_list
+
+    def stride_pool_pos(self, pos_id, block_index):
+        """
+        Pool `pos_id` while keeping the cls token separate (if `config.separate_cls=True`).
+        """
+        if self.config2.separate_cls:
+            # Under separate <cls>, we treat the <cls> as the first token in
+            # the previous block of the 1st real block. Since the 1st real
+            # block always has position 1, the position of the previous block
+            # will be at `1 - 2 ** block_index`.
+            cls_pos = paddle.to_tensor([-(2**block_index) + 1]).astype(pos_id.dtype)
+            pooled_pos_id = pos_id[1:-1] if self.config2.truncate_seq else pos_id[1:]
+            return paddle.concat([cls_pos, pooled_pos_id[::2]], axis=0)
+        else:
+            return pos_id[::2]
+
+    def relative_pos(self, pos, stride, pooled_pos=None, shift=1):
+        """
+        Build the relative positional vector between `pos` and `pooled_pos`.
+        """
+        if pooled_pos is None:
+            pooled_pos = pos
+
+        ref_point = pooled_pos[0] - pos[0]
+        num_remove = shift * len(pooled_pos)
+        max_dist = ref_point + num_remove * stride
+        min_dist = pooled_pos[0] - pos[-1]
+
+        return paddle.arange(max_dist, min_dist - 1, -stride, dtype=paddle.int64)
+
+    def stride_pool(self, tensor, axis):
+        """
+        Perform pooling by stride slicing the tensor along the given axis.
+        """
+        if tensor is None:
+            return None
+        tensor = tensor.astype("float32")
+        # Do the stride pool recursively if axis is a list or a tuple of ints.
+        if isinstance(axis, (list, tuple)):
+            for ax in axis:
+                tensor = self.stride_pool(tensor, ax)
+            return tensor
+
+        # Do the stride pool recursively if tensor is a list or tuple of tensors.
+        if isinstance(tensor, (tuple, list)):
+            return type(tensor)(self.stride_pool(x, axis) for x in tensor)
+
+        # Deal with negative axis
+        axis %= tensor.ndim
+
+        if self.config2.separate_cls:
+            # tensor = paddle.cat([tensor[cls_slice], tensor], axis=axis)
+            if axis == 1:
+                tensor = paddle.concat([tensor[:, :1], tensor], axis=axis)
+            if axis == 2:
+                tensor = paddle.concat([tensor[:, :, :1], tensor], axis=axis)
+            if axis == 0:
+                tensor = paddle.concat([tensor[:1], tensor], axis=axis)
+        if axis == 1:
+            return tensor[:, 0:-1:2].astype("bool")
+        if axis == 0:
+            return tensor[0:-1:2].astype("bool")
+        if axis == 2:
+            return tensor[:, :, 0:-1:2].astype("bool")
+
+    def pool_tensor(self, tensor, mode="mean", stride=2):
+        """Apply 1D pooling to a tensor of size [B x T (x H)]."""
+        if tensor is None:
+            return None
+
+        # Do the pool recursively if tensor is a list or tuple of tensors.
+        if isinstance(tensor, (tuple, list)):
+            return type(tensor)(self.pool_tensor(tensor, mode=mode, stride=stride) for x in tensor)
+
+        if self.config2.separate_cls:
+            suffix = tensor[:, :-1] if self.config2.truncate_seq else tensor
+            tensor = paddle.concat([tensor[:, :1], suffix], axis=1)
+
+        ndim = tensor.ndim
+        if ndim == 2:
+            tensor = tensor.unsqueeze(1).unsqueeze(3)  # [:, None, :, None]
+        elif ndim == 3:
+            tensor = tensor.unsqueeze(1)  # [:, None, :, :]
+        # Stride is applied on the second-to-last dimension.
+        stride = (stride, 1)
+
+        if mode == "mean":
+            tensor = nn.functional.avg_pool2d(tensor, stride, stride=stride, ceil_mode=True)
+        elif mode == "max":
+            tensor = nn.functional.max_pool2d(tensor, stride, stride=stride, ceil_mode=True)
+        elif mode == "min":
+            tensor = -nn.functional.max_pool2d(-tensor, stride, stride=stride, ceil_mode=True)
+        else:
+            raise NotImplementedError("The supported modes are 'mean', 'max' and 'min'.")
+
+        if ndim == 2:
+            return tensor[:, 0, :, 0]
+        elif ndim == 3:
+            return tensor[:, 0]
+        return tensor
+
+    def pre_attention_pooling(self, output, attention_inputs):
+        """Pool `output` and the proper parts of `attention_inputs` before the attention layer."""
+        position_embeds, token_type_mat, attention_mask, cls_mask = attention_inputs
+        if self.config2.pool_q_only:
+            if self.config2.attention_type == "factorized":
+                position_embeds = self.stride_pool(position_embeds[:2], 0) + position_embeds[2:]
+            token_type_mat = self.stride_pool(token_type_mat, 1)
+            cls_mask = self.stride_pool(cls_mask, 0)
+            output = self.pool_tensor(output, mode=self.config2.pooling_type)
+
+        else:
+            self.pooling_mult *= 2
+            if self.config2.attention_type == "factorized":
+                position_embeds = self.stride_pool(position_embeds, 0)
+            token_type_mat = self.stride_pool(token_type_mat, [1, 2])
+            cls_mask = self.stride_pool(cls_mask, [1, 2])
+            attention_mask = self.pool_tensor(attention_mask, mode="min")
+            output = self.pool_tensor(output, mode=self.config2.pooling_type)
+
+        attention_inputs = (position_embeds, token_type_mat, attention_mask, cls_mask)
+        return output, attention_inputs
+
+    def post_attention_pooling(self, attention_inputs):
+        """Pool the proper parts of `attention_inputs` after the attention layer."""
+        position_embeds, token_type_mat, attention_mask, cls_mask = attention_inputs
+        if self.config2.pool_q_only:
+            self.pooling_mult *= 2
+            if self.config2.attention_type == "factorized":
+                position_embeds = position_embeds[:2] + self.stride_pool(position_embeds[2:], 0)
+            token_type_mat = self.stride_pool(token_type_mat, 2)
+            cls_mask = self.stride_pool(cls_mask, 1)
+            attention_mask = self.pool_tensor(attention_mask, mode="min")
+        attention_inputs = (position_embeds, token_type_mat, attention_mask, cls_mask)
+        return attention_inputs
+
+
+def _relative_shift_gather(positional_attn, context_len, shift):
+    batch_size, n_head, seq_len, max_rel_len = positional_attn.shape
+    # max_rel_len = 2 * context_len + shift -1 is the numbers of possible relative positions i-j
+
+    # What's next is the same as doing the following gather, which might be clearer code but less efficient.
+    # idxs = context_len + paddle.arange(0, context_len).unsqueeze(0) - paddle.arange(0, seq_len).unsqueeze(1)
+    # # matrix of context_len + i-j
+    # return positional_attn.gather(3, idxs.expand([batch_size, n_head, context_len, context_len]))
+
+    positional_attn = paddle.reshape(positional_attn, [batch_size, n_head, max_rel_len, seq_len])
+    positional_attn = positional_attn[:, :, shift:, :]
+    positional_attn = paddle.reshape(positional_attn, [batch_size, n_head, seq_len, max_rel_len - shift])
+    positional_attn = positional_attn[:, :, :, :context_len]
+    return positional_attn
+
+
+def Parameter(shape_or_tensor, fill_value=None, requires_grad=True):
+    if isinstance(shape_or_tensor, paddle.Tensor):
+        X = Parameter(shape_or_tensor.shape, 0.0)
+        paddle.assign(shape_or_tensor.astype("float32"), X)
+    else:
+        if isinstance(shape_or_tensor, int):
+            shape_or_tensor = [shape_or_tensor]
+
+        X = paddle.create_parameter(
+            shape=shape_or_tensor,
+            dtype="float32",
+            attr=paddle.ParamAttr(name=None, initializer=paddle.nn.initializer.Constant(value=fill_value)),
+            is_bias=False,
+        )
+    if not requires_grad:
+        X.stop_gradient = True
+
+    return X
+
+
+class FunnelRelMultiheadAttention(nn.Layer):
+    def __init__(self, config, block_index):
+        super().__init__()
+        self.config2 = config
+        self.block_index = block_index
+        d_model, n_head, d_head = config.d_model, config.n_head, config.d_head
+
+        self.hidden_dropout = nn.Dropout(config.hidden_dropout)
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+
+        self.q_head = nn.Linear(d_model, n_head * d_head, bias_attr=False)
+        self.k_head = nn.Linear(d_model, n_head * d_head)
+        self.v_head = nn.Linear(d_model, n_head * d_head)
+
+        self.r_w_bias = Parameter(paddle.zeros([n_head, d_head]))
+        self.r_r_bias = Parameter(paddle.zeros([n_head, d_head]))
+        self.r_kernel = Parameter(paddle.zeros([d_model, n_head, d_head]))
+        self.r_s_bias = Parameter(paddle.zeros([n_head, d_head]))
+        self.seg_embed = Parameter(paddle.zeros([2, n_head, d_head]))
+
+        self.post_proj = nn.Linear(n_head * d_head, d_model)
+        self.layer_norm = LayerNorm(d_model, epsilon=config.layer_norm_eps)
+        self.scale = 1.0 / (d_head**0.5)
+
+    def relative_positional_attention(self, position_embeds, q_head, context_len, cls_mask=None):
+        """Relative attention score for the positional encodings"""
+        # q_head has shape batch_size x sea_len x n_head x d_head
+        if self.config2.attention_type == "factorized":
+            # Notations from the paper, appending A.2.2, final formula (https://arxiv.org/abs/2006.03236)
+            # phi and pi have shape seq_len x d_model, psi and omega have shape context_len x d_model
+            phi, pi, psi, omega = position_embeds
+            # Shape n_head x d_head
+            u = self.r_r_bias * self.scale
+            # Shape d_model x n_head x d_head
+            w_r = self.r_kernel
+
+            # Shape batch_size x sea_len x n_head x d_model
+            q_r_attention = paddle.einsum("binh,dnh->bind", q_head + u, w_r)
+            q_r_attention_1 = q_r_attention * phi.unsqueeze(1)  # [:, None]
+            q_r_attention_2 = q_r_attention * pi.unsqueeze(1)  # [:, None]
+
+            # Shape batch_size x n_head x seq_len x context_len
+            positional_attn = paddle.einsum("bind,jd->bnij", q_r_attention_1, psi) + paddle.einsum(
+                "bind,jd->bnij", q_r_attention_2, omega
+            )
+        else:
+            shift = 2 if q_head.shape[1] != context_len else 1
+            # Notations from the paper, appending A.2.1, final formula (https://arxiv.org/abs/2006.03236)
+            # Grab the proper positional encoding, shape max_rel_len x d_model
+            r = position_embeds[self.block_index][shift - 1]
+            # Shape n_head x d_head
+            v = self.r_r_bias * self.scale
+            # Shape d_model x n_head x d_head
+            w_r = self.r_kernel
+
+            # Shape max_rel_len x n_head x d_model
+            r_head = paddle.einsum("td,dnh->tnh", r, w_r)
+            # Shape batch_size x n_head x seq_len x max_rel_len
+            positional_attn = paddle.einsum("binh,tnh->bnit", q_head + v, r_head)
+            # Shape batch_size x n_head x seq_len x context_len
+            positional_attn = _relative_shift_gather(positional_attn, context_len, shift)
+
+        if cls_mask is not None:
+            positional_attn *= cls_mask
+        return positional_attn
+
+    def relative_token_type_attention(self, token_type_mat, q_head, cls_mask=None):
+        """Relative attention score for the token_type_ids"""
+        if token_type_mat is None:
+            return 0
+        batch_size, seq_len, context_len = token_type_mat.shape
+        # q_head has shape batch_size x seq_len x n_head x d_head
+        # Shape n_head x d_head
+        r_s_bias = self.r_s_bias * self.scale
+
+        # Shape batch_size x n_head x seq_len x 2
+        token_type_bias = paddle.einsum("bind,snd->bnis", q_head + r_s_bias, self.seg_embed)
+
+        # Shape batch_size x n_head x seq_len x context_len
+        # token_type_mat = token_type_mat[:, None].expand([batch_size, q_head.shape[2], seq_len, context_len])
+        token_type_mat = expand(token_type_mat.unsqueeze(1), ([batch_size, q_head.shape[2], seq_len, context_len]))
+        # Shapes batch_size x n_head x seq_len
+        diff_token_type, same_token_type = split(token_type_bias, 1, dim=-1)
+        # Shape batch_size x n_head x seq_len x context_len
+        token_type_attn = paddle.where(
+            token_type_mat,
+            expand(same_token_type, (token_type_mat.shape)),
+            expand(diff_token_type, (token_type_mat.shape)),
+        )
+
+        if cls_mask is not None:
+            token_type_attn *= cls_mask
+        return token_type_attn
+
+    def forward(self, query, key, value, attention_inputs, output_attentions=False):
+        # query has shape batch_size x seq_len x d_model
+        # key and value have shapes batch_size x context_len x d_model
+        position_embeds, token_type_mat, attention_mask, cls_mask = attention_inputs
+
+        batch_size, seq_len, _ = query.shape
+        context_len = key.shape[1]
+        n_head, d_head = self.config2.n_head, self.config2.d_head
+
+        # Shape batch_size x seq_len x n_head x d_head
+        q_head = paddle.reshape(
+            self.q_head(query), (batch_size, seq_len, n_head, d_head)
+        )  # self.q_head(query).reshape(batch_size, seq_len, n_head, d_head)
+        # Shapes batch_size x context_len x n_head x d_head
+        k_head = paddle.reshape(
+            self.k_head(key), (batch_size, context_len, n_head, d_head)
+        )  # self.k_head(key).reshape(batch_size, context_len, n_head, d_head)
+        v_head = paddle.reshape(self.v_head(value), (batch_size, context_len, n_head, d_head))
+
+        q_head = q_head * self.scale
+        # Shape n_head x d_head
+        r_w_bias = self.r_w_bias * self.scale
+        # Shapes batch_size x n_head x seq_len x context_len
+
+        content_score = paddle.einsum("bind,bjnd->bnij", q_head + r_w_bias, k_head)
+
+        positional_attn = self.relative_positional_attention(position_embeds, q_head, context_len, cls_mask)
+        token_type_attn = self.relative_token_type_attention(token_type_mat, q_head, cls_mask)
+
+        # merge attention scores
+        attn_score = content_score + positional_attn + token_type_attn
+
+        # precision safe in case of mixed precision training
+        dtype = attn_score.dtype
+        attn_score = attn_score.astype("float32")
+        # perform masking
+        if attention_mask is not None:
+            # attn_score = attn_score - INF * (1 - attention_mask[:, None, None].float())
+            attn_score = attn_score - INF * (1 - attention_mask.unsqueeze(1).unsqueeze(2).astype("float32"))
+        # attention probability
+        attn_prob = paddle.nn.functional.softmax(attn_score, axis=-1, dtype=dtype)
+        attn_prob = self.attention_dropout(attn_prob)
+
+        # attention output, shape batch_size x seq_len x n_head x d_head
+        attn_vec = paddle.einsum("bnij,bjnd->bind", attn_prob, v_head)
+
+        # Shape shape batch_size x seq_len x d_model
+        attn_out = self.post_proj(attn_vec.reshape((batch_size, seq_len, n_head * d_head)))
+        attn_out = self.hidden_dropout(attn_out)
+
+        output = self.layer_norm(query + attn_out)
+        return (output, attn_prob) if output_attentions else (output,)
+
+
+class FunnelPositionwiseFFN(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.linear_1 = nn.Linear(config.d_model, config.d_inner)
+        self.activation_function = ACT2FN[config.hidden_act]
+        self.activation_dropout = nn.Dropout(config.activation_dropout)
+        self.linear_2 = nn.Linear(config.d_inner, config.d_model)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.layer_norm = LayerNorm(config.d_model, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden):
+        h = self.linear_1(hidden)
+        h = self.activation_function(h)
+        h = self.activation_dropout(h)
+        h = self.linear_2(h)
+        h = self.dropout(h)
+        return self.layer_norm(hidden + h)
+
+
+class FunnelLayer(nn.Layer):
+    def __init__(self, config, block_index):
+        super().__init__()
+        self.attention = FunnelRelMultiheadAttention(config, block_index)
+        self.ffn = FunnelPositionwiseFFN(config)
+
+    def forward(self, query, key, value, attention_inputs, output_attentions=False):
+        attn = self.attention(query, key, value, attention_inputs, output_attentions=output_attentions)
+        output = self.ffn(attn[0])
+        return (output, attn[1]) if output_attentions else (output,)
+
+
+class FunnelEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config2 = config
+        self.attention_structure = FunnelAttentionStructure(config)
+        self.blocks = nn.LayerList(
+            [
+                nn.LayerList([FunnelLayer(config, block_index) for _ in range(block_size)])
+                for block_index, block_size in enumerate(config.block_sizes)
+            ]
+        )
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask=None,
+        token_type_ids=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        # The pooling is not implemented on long tensors, so we convert this mask.
+        attention_mask = attention_mask.astype(inputs_embeds.dtype)
+        attention_inputs = self.attention_structure.init_attention_inputs(
+            inputs_embeds,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+        )
+        hidden = inputs_embeds
+
+        all_hidden_states = (inputs_embeds,) if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        for block_index, block in enumerate(self.blocks):
+            pooling_flag = hidden.shape[1] > (2 if self.config2.separate_cls else 1)
+            pooling_flag = pooling_flag and block_index > 0
+            if pooling_flag:
+                pooled_hidden, attention_inputs = self.attention_structure.pre_attention_pooling(
+                    hidden, attention_inputs
+                )
+            for (layer_index, layer) in enumerate(block):
+                for repeat_index in range(self.config2.block_repeats[block_index]):
+                    do_pooling = (repeat_index == 0) and (layer_index == 0) and pooling_flag
+                    if do_pooling:
+                        query = pooled_hidden
+                        key = value = hidden if self.config2.pool_q_only else pooled_hidden
+                    else:
+                        query = key = value = hidden
+                    # if layer_index==8 and block_index==0 and repeat_index==0 :
+                    #     print(block_index,layer_index,repeat_index,layer,query.mean(), key.mean(), value.mean())
+                    layer_output = layer(query, key, value, attention_inputs, output_attentions=output_attentions)
+
+                    hidden = layer_output[0]
+
+                    if do_pooling:
+                        attention_inputs = self.attention_structure.post_attention_pooling(attention_inputs)
+
+                    if output_attentions:
+                        all_attentions = all_attentions + layer_output[1:]
+                    if output_hidden_states:
+                        all_hidden_states = all_hidden_states + (hidden,)
+        if not return_dict:
+            return tuple(v for v in [hidden, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutput(last_hidden_state=hidden, hidden_states=all_hidden_states, attentions=all_attentions)
+
+
+def upsample(x, stride, target_len, separate_cls=True, truncate_seq=False):
+    """
+    Upsample tensor `x` to match `target_len` by repeating the tokens `stride` time on the sequence length dimension.
+    """
+    if stride == 1:
+        return x
+    if separate_cls:
+        cls = x[:, :1]
+        x = x[:, 1:]
+    output = repeat_interleave(x, repeats=stride, dim=1)
+    if separate_cls:
+        if truncate_seq:
+            output = pad(output, (0, 0, 0, stride - 1, 0, 0))
+        output = output[:, : target_len - 1]
+        output = paddle.concat([cls, output], axis=1)
+    else:
+        output = output[:, :target_len]
+    return output
+
+
+class FunnelDecoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config2 = config
+        self.attention_structure = FunnelAttentionStructure(config)
+        self.layers = nn.LayerList([FunnelLayer(config, 0) for _ in range(config.num_decoder_layers)])
+
+    def forward(
+        self,
+        final_hidden,
+        first_block_hidden,
+        attention_mask=None,
+        token_type_ids=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        upsampled_hidden = upsample(
+            final_hidden,
+            stride=2 ** (len(self.config2.block_sizes) - 1),
+            target_len=first_block_hidden.shape[1],
+            separate_cls=self.config2.separate_cls,
+            truncate_seq=self.config2.truncate_seq,
+        )
+
+        hidden = upsampled_hidden + first_block_hidden
+        all_hidden_states = (hidden,) if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        attention_inputs = self.attention_structure.init_attention_inputs(
+            hidden,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        for layer in self.layers:
+            layer_output = layer(hidden, hidden, hidden, attention_inputs, output_attentions=output_attentions)
+            hidden = layer_output[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + layer_output[1:]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden,)
+        if not return_dict:
+            return tuple(v for v in [hidden, all_hidden_states, all_attentions] if v is not None)
+
+        return BaseModelOutput(last_hidden_state=hidden, hidden_states=all_hidden_states, attentions=all_attentions)
+
+
+class FunnelDiscriminatorPredictions(nn.Layer):
+    """Prediction module for the discriminator, made up of two dense layers."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config2 = config
+        self.dense = nn.Linear(config.d_model, config.d_model)
+        self.dense_prediction = nn.Linear(config.d_model, 1)
+
+    def forward(self, discriminator_hidden_states):
+        hidden_states = self.dense(discriminator_hidden_states)
+        hidden_states = ACT2FN[self.config2.hidden_act](hidden_states)
+        logits = self.dense_prediction(hidden_states).squeeze()
+        return logits
+
+
+class FunnelPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    pretrained_init_configuration = FUNNEL_PRETRAINED_INIT_CONFIGURATION
+    resource_files_names = FUNNEL_RESOURCE_FILES_NAMES
+    pretrained_resource_files_map = FUNNEL_PRETRAINED_RESOURCE_FILES_MAP
+
+    config_class = FunnelConfig
+    base_model_prefix = "funnel"
+
+    def _init_weights(self, module):
+        classname = module.__class__.__name__
+        if classname.find("Linear") != -1:
+            if getattr(module, "weight", None) is not None:
+                if self.config.initializer_std is None:
+                    fan_out, fan_in = module.weight.shape
+                    std = np.sqrt(1.0 / float(fan_in + fan_out))
+                else:
+                    std = self.config.initializer_std
+                normal_(module.weight, std=std)
+            if getattr(module, "bias", None) is not None:
+                constant_(module.bias, 0.0)
+        elif classname == "FunnelRelMultiheadAttention":
+            uniform_(module.r_w_bias, b=self.config.initializer_range)
+            uniform_(module.r_r_bias, b=self.config.initializer_range)
+            uniform_(module.r_kernel, b=self.config.initializer_range)
+            uniform_(module.r_s_bias, b=self.config.initializer_range)
+            uniform_(module.seg_embed, b=self.config.initializer_range)
+        elif classname == "FunnelEmbeddings":
+            std = 1.0 if self.config.initializer_std is None else self.config.initializer_std
+            normal_(module.word_embeddings.weight, std=std)
+            if module.word_embeddings._padding_idx is not None:
+                module.word_embeddings.weight.data[module._padding_idx].zero_()
+
+    def init_weights(self):
+        """
+        If needed prunes and maybe initializes weights.
+        """
+        # Prune heads if needed
+        if self.config.pruned_heads:
+            self.prune_heads(self.config.pruned_heads)
+        _init_weights = True
+        if _init_weights:
+            # Initialize weights
+            self.apply(self._init_weights)
+
+            # Tie weights should be skipped when not initializing all weights
+            # since from_pretrained(...) calls tie weights anyways
+            # self.tie_weights()
+
+    def prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the base model.
+
+        Arguments:
+            heads_to_prune (:obj:`Dict[int, List[int]]`):
+                Dictionary with keys being selected layer indices (:obj:`int`) and associated values being the list of
+                heads to prune in said layer (list of :obj:`int`). For instance {1: [0, 2], 2: [2, 3]} will prune heads
+                0 and 2 on layer 1 and heads 2 and 3 on layer 2.
+        """
+        # save new sets of pruned heads as union of previously stored pruned heads and newly pruned heads
+        for layer, heads in heads_to_prune.items():
+            union_heads = set(self.config.pruned_heads.get(layer, [])) | set(heads)
+            self.config2.pruned_heads[layer] = list(union_heads)  # Unfortunately we have to store it as list for JSON
+
+        self.base_model._prune_heads(heads_to_prune)
+
+
+class FunnelClassificationHead(nn.Layer):
+    def __init__(self, config, n_labels):
+        super().__init__()
+        self.linear_hidden = nn.Linear(config.d_model, config.d_model)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.linear_out = nn.Linear(config.d_model, n_labels)
+
+    def forward(self, hidden):
+        hidden = self.linear_hidden(hidden)
+        hidden = paddle.tanh(hidden)
+        hidden = self.dropout(hidden)
+        return self.linear_out(hidden)
+
+
+class FunnelBaseModel(FunnelPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        if isinstance(config, PreTrainedModel):
+            config = config.config
+        if isinstance(config, dict):
+            config = FunnelConfig(**config)
+        self.config2 = config
+        self.embeddings = FunnelEmbeddings(config)
+        self.encoder = FunnelEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embeddings.word_embeddings = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config2.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config2.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config2.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        # TODO: deal with head_mask
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return encoder_outputs
+
+
+@register_base_model
+class FunnelModel(FunnelPreTrainedModel):
+    base_model_prefix = "model"
+
+    def __init__(self, config: FunnelConfig):
+        super().__init__(config)
+
+        self.config2 = config
+        self.embeddings = FunnelEmbeddings(config)
+        self.encoder = FunnelEncoder(config)
+        self.decoder = FunnelDecoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embeddings.word_embeddings = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+
+        output_attentions = output_attentions if output_attentions is not None else self.config2.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config2.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config2.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+        else:
+            token_type_ids = token_type_ids.astype("int64")
+        # TODO: deal with head_mask
+        if inputs_embeds is None:
+            inputs_embeds = self.embeddings(input_ids)
+            encoder_outputs = self.encoder(
+                inputs_embeds,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                output_attentions=output_attentions,
+                output_hidden_states=True,
+                return_dict=return_dict,
+            )
+        decoder_outputs = self.decoder(
+            final_hidden=encoder_outputs.last_hidden_state,
+            first_block_hidden=encoder_outputs.hidden_states[self.config2.block_sizes[0]],
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if not return_dict:
+            idx = 0
+            outputs = (decoder_outputs.last_hidden_state,)
+            if output_hidden_states:
+                idx += 1
+                outputs = outputs + (encoder_outputs.hidden_states + decoder_outputs[idx],)
+            if output_attentions:
+                idx += 1
+                outputs = outputs + (encoder_outputs.attentions + decoder_outputs[idx],)
+            return outputs
+        return BaseModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            hidden_states=(encoder_outputs.hidden_states + decoder_outputs.hidden_states)
+            if output_hidden_states
+            else None,
+            attentions=(encoder_outputs.attentions + decoder_outputs.attentions) if output_attentions else None,
+        )
+
+
+class FunnelForPreTraining(FunnelPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.funnel = FunnelModel(config)
+        self.discriminator_predictions = FunnelDiscriminatorPredictions(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (``paddle.Tensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the ELECTRA-style loss. Input should be a sequence of tokens (see :obj:`input_ids`
+            docstring) Indices should be in ``[0, 1]``:
+
+            - 0 indicates the token is an original token,
+            - 1 indicates the token was replaced.
+
+        Returns:
+
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        discriminator_hidden_states = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        discriminator_sequence_output = discriminator_hidden_states[0]
+
+        logits = self.discriminator_predictions(discriminator_sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.BCEWithLogitsLoss()
+            if attention_mask is not None:
+                active_loss = attention_mask.reshape(-1, discriminator_sequence_output.shape[1]) == 1
+                active_logits = logits.reshape(-1, discriminator_sequence_output.shape[1])[active_loss]
+                active_labels = labels[active_loss]
+                loss = loss_fct(active_logits, active_labels.astype("float32"))
+            else:
+                loss = loss_fct(logits.reshape(-1, discriminator_sequence_output.shape[1]), labels.astype("float32"))
+
+        if not return_dict:
+            output = (logits,) + discriminator_hidden_states[1:]
+            return ((loss,) + output) if loss is not None else output
+        return FunnelForPreTrainingOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=discriminator_hidden_states.hidden_states,
+            attentions=discriminator_hidden_states.attentions,
+        )
+
+
+class FunnelForMaskedLM(FunnelPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.funnel = FunnelModel(config)
+        self.lm_head = nn.Linear(config.vocab_size, config.d_model)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = outputs[0]
+        prediction_logits = paddle.matmul(last_hidden_state, self.lm_head.weight, transpose_y=True) + self.lm_head.bias
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(prediction_logits.reshape(-1, self.config.vocab_size), labels.reshape(-1))
+
+        if not return_dict:
+            output = (prediction_logits,) + outputs[1:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return prediction_logits
+
+
+class FunnelForSequenceClassification(FunnelPreTrainedModel):
+    base_model_class = FunnelModel
+
+    def __init__(self, config, num_classes=2):
+        super().__init__(config)
+        self.num_classes = num_classes
+
+        self.num_labels = config.num_labels
+
+        self.funnel = FunnelBaseModel(config)
+        self.classifier = FunnelClassificationHead(config, config.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
+            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = outputs[0]
+        pooled_output = last_hidden_state[:, 0]
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.reshape(-1, self.num_labels), labels.reshape(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return logits
+
+
+class FunnelForMultipleChoice(FunnelPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.funnel = FunnelBaseModel(config)
+        self.classifier = FunnelClassificationHead(config, 1)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the multiple choice classification loss. Indices should be in ``[0, ...,
+            num_choices-1]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
+            :obj:`input_ids` above)
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        input_ids = input_ids.reshape(-1, input_ids.shape[-1]) if input_ids is not None else None
+        attention_mask = attention_mask.reshape(-1, attention_mask.shape[-1]) if attention_mask is not None else None
+        token_type_ids = token_type_ids.reshape(-1, token_type_ids.shape[-1]) if token_type_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.reshape(-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1])
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = outputs[0]
+        pooled_output = last_hidden_state[:, 0]
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape(-1, num_choices)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return reshaped_logits
+
+
+class FunnelForTokenClassification(FunnelPreTrainedModel):
+    def __init__(self, config, num_classes=2):
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.funnel = FunnelModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.num_classes = num_classes
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`paddle.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
+            1]``.
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = outputs[0]
+        last_hidden_state = self.dropout(last_hidden_state)
+        logits = self.classifier(last_hidden_state)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.reshape(-1) == 1
+                active_logits = logits.reshape(-1, self.num_labels)
+                active_labels = paddle.where(
+                    active_loss, labels.reshape(-1), paddle.tensor(loss_fct.ignore_index).astype(labels.dtype)
+                )
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.reshape(-1, self.num_labels), paddle.reshape(labels, -1))
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return logits
+
+
+class FunnelForQuestionAnswering(FunnelPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.config2 = config
+        self.num_labels = config.num_labels
+
+        self.funnel = FunnelModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        start_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        end_positions (:obj:`paddle.Tensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+            sequence are not taken into account for computing the loss.
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config2.use_return_dict
+
+        outputs = self.funnel(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = outputs[0]
+
+        logits = self.qa_outputs(last_hidden_state)
+        start_logits, end_logits = split(logits, 1, dim=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.shape) > 1:
+                start_positions = start_positions.squeze(-1)
+            if len(end_positions.shape) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[1:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return start_logits, end_logits
+
+
+def is_tensor(x):
+    """
+    Tests if ``x`` is a :obj:`paddle.Tensor`,   or
+    :obj:`np.ndarray`.
+    """
+
+    if isinstance(x, paddle.Tensor):
+        return True
+
+    return isinstance(x, np.ndarray)
+
+
+class ModelOutput(OrderedDict):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def __post_init__(self):
+        class_fields = fields(self)
+
+        # Safety and consistency checks
+        assert len(class_fields), f"{self.__class__.__name__} has no fields."
+        assert all(
+            field.default is None for field in class_fields[1:]
+        ), f"{self.__class__.__name__} should not have more than one required field."
+
+        first_field = getattr(self, class_fields[0].name)
+        other_fields_are_none = all(getattr(self, field.name) is None for field in class_fields[1:])
+
+        if other_fields_are_none and not is_tensor(first_field):
+            try:
+                iterator = iter(first_field)
+                first_field_iterator = True
+            except TypeError:
+                first_field_iterator = False
+
+            # if we provided an iterator as first field and the iterator is a (key, value) iterator
+            # set the associated fields
+            if first_field_iterator:
+                for element in iterator:
+                    if (
+                        not isinstance(element, (list, tuple))
+                        or not len(element) == 2
+                        or not isinstance(element[0], str)
+                    ):
+                        break
+                    setattr(self, element[0], element[1])
+                    if element[1] is not None:
+                        self[element[0]] = element[1]
+            elif first_field is not None:
+                self[class_fields[0].name] = first_field
+        else:
+            for field in class_fields:
+                v = getattr(self, field.name)
+                if v is not None:
+                    self[field.name] = v
+
+    def __delitem__(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``__delitem__`` on a {self.__class__.__name__} instance.")
+
+    def setdefault(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``setdefault`` on a {self.__class__.__name__} instance.")
+
+    def pop(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
+
+    def update(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``update`` on a {self.__class__.__name__} instance.")
+
+    def __getitem__(self, k):
+        if isinstance(k, str):
+            inner_dict = {k: v for (k, v) in self.items()}
+            return inner_dict[k]
+        else:
+            return self.to_tuple()[k]
+
+    def __setattr__(self, name, value):
+        if value is not None:
+            # Don't call self.__setitem__ to avoid recursion errors
+            super().__setitem__(name, value)
+        super().__setattr__(name, value)
+
+    def __setitem__(self, key, value):
+        # Will raise a KeyException if needed
+        super().__setitem__(key, value)
+        # Don't call self.__setattr__ to avoid recursion errors
+        super().__setattr__(key, value)
+
+    def to_tuple(self):
+        """
+        Convert self to a tuple containing all the attributes/keys that are not ``None``.
+        """
+        return tuple(self[k] for k in self.keys())
+
+
+@dataclass
+class FunnelForPreTrainingOutput(ModelOutput):
+    """
+    Output type of :class:`~hf_paddle.FunnelForPreTraining`.
+
+    Args:
+        loss (`optional`, returned when ``labels`` is provided, ``paddle.Tensor`` of shape :obj:`(1,)`):
+            Total loss of the ELECTRA-style objective.
+        logits (:obj:`paddle.Tensor` of shape :obj:`(batch_size, sequence_length)`):
+            Prediction scores of the head (scores for each token before SoftMax).
+        hidden_states (:obj:`tuple(paddle.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`paddle.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(paddle.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`paddle.Tensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss = None
+    logits = None
+    hidden_states = None
+    attentions = None
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+
+@dataclass
+class BaseModelOutput(ModelOutput):
+    """
+    Base class for model's outputs, with potential hidden states and attentions.
+
+    Args:
+        last_hidden_state (:obj:`paddle.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (:obj:`tuple(paddle.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`paddle.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(paddle.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`paddle.Tensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
diff --git a/paddlenlp/transformers/funnel/tokenizer.py b/paddlenlp/transformers/funnel/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..62161f32a384d5fc3fc909f424b8d3880081bd16
--- /dev/null
+++ b/paddlenlp/transformers/funnel/tokenizer.py
@@ -0,0 +1,134 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__all__ = ["FunnelTokenizer"]
+
+import os
+from typing import List, Optional
+
+from .. import BasicTokenizer, WordpieceTokenizer
+from ..bert.tokenizer import BertTokenizer
+
+
+class FunnelTokenizer(BertTokenizer):
+    cls_token_type_id = 2
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "funnel-transformer/small": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small/vocab.txt",
+            "funnel-transformer/small-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/small-base/vocab.txt",
+            "funnel-transformer/medium": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium/vocab.txt",
+            "funnel-transformer/medium-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/medium-base/vocab.txt",
+            "funnel-transformer/intermediate": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate/vocab.txt",
+            "funnel-transformer/intermediate-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/intermediate-base/vocab.txt",
+            "funnel-transformer/large": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large/vocab.txt",
+            "funnel-transformer/large-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/large-base/vocab.txt",
+            "funnel-transformer/xlarge": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge/vocab.txt",
+            "funnel-transformer/xlarge-base": "https://bj.bcebos.com/paddlenlp/models/transformers/funnel-transformer/xlarge-base/vocab.txt",
+        },
+    }
+    pretrained_init_configuration = {
+        "funnel-transformer/small": {"do_lower_case": True},
+        "funnel-transformer/small-base": {"do_lower_case": True},
+        "funnel-transformer/medium": {"do_lower_case": True},
+        "funnel-transformer/medium-base": {"do_lower_case": True},
+        "funnel-transformer/intermediate": {"do_lower_case": True},
+        "funnel-transformer/intermediate-base": {"do_lower_case": True},
+        "funnel-transformer/large": {"do_lower_case": True},
+        "funnel-transformer/large-base": {"do_lower_case": True},
+        "funnel-transformer/xlarge": {"do_lower_case": True},
+        "funnel-transformer/xlarge-base": {"do_lower_case": True},
+    }
+
+    max_model_input_sizes = {
+        "funnel-transformer/small": 512,
+        "funnel-transformer/small-base": 512,
+        "funnel-transformer/medium": 512,
+        "funnel-transformer/medium-base": 512,
+        "funnel-transformer/intermediate": 512,
+        "funnel-transformer/intermediate-base": 512,
+        "funnel-transformer/large": 512,
+        "funnel-transformer/large-base": 512,
+        "funnel-transformer/xlarge": 512,
+        "funnel-transformer/xlarge-base": 512,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="<unk>",
+        sep_token="<sep>",
+        pad_token="<pad>",
+        cls_token="<cls>",
+        mask_token="<mask>",
+        bos_token="<s>",
+        eos_token="</s>",
+        do_basic_tokenize=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = FunnelTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A Funnel
+        Transformer sequence pair mask has the following format:
+        ```
+        2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls) * [self.cls_token_type_id] + len(token_ids_0 + sep) * [0]
+        return len(cls) * [self.cls_token_type_id] + len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
diff --git a/paddlenlp/transformers/gau_alpha/__init__.py b/paddlenlp/transformers/gau_alpha/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/gau_alpha/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/gau_alpha/configuration.py b/paddlenlp/transformers/gau_alpha/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf47e202d53d298994bd3b14def23405977c6c57
--- /dev/null
+++ b/paddlenlp/transformers/gau_alpha/configuration.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["GAUAlPHA_PRETRAINED_INIT_CONFIGURATION", "GAUAlphaConfig", "GAUAlPHA_PRETRAINED_RESOURCE_FILES_MAP"]
+
+GAUAlPHA_PRETRAINED_INIT_CONFIGURATION = {
+    "chinese_GAU-alpha-char_L-24_H-768": {
+        "vocab_size": 12000,
+        "hidden_size": 768,
+        "intermediate_size": 1536,
+        "num_hidden_layers": 24,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "attention_key_size": 128,
+        "norm_eps": 1e-12,
+        "pad_token_id": 0,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "swish",
+        "use_bias": False,
+        "normalization": "softmax_plus",
+        "attention_scale": True,
+    },
+}
+
+GAUAlPHA_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "chinese_GAU-alpha-char_L-24_H-768": "https://bj.bcebos.com/paddlenlp/models/transformers/gau_alpha/chinese_GAU-alpha-char_L-24_H-768/model_state.pdparams",
+    }
+}
+
+
+class GAUAlphaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`GAUAlphaModel`]. It is used to
+    instantiate a GAUAlpha model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the GAUAlpha
+    chinese_GAU-alpha-char_L-24_H-768 architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the GAUAlpha model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`GAUAlphaModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`GAUAlphaModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import GAUAlphaModel, GAUAlphaConfig
+    >>> # Initializing a GAUAlpha chinese_GAU-alpha-char_L-24_H-768style configuration
+    >>> configuration = GAUAlphaConfig()
+    >>> # Initializing a model from the  style configuration
+    >>> model = GAUAlphaModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "gau_alpha"
+    pretrained_init_configuration = GAUAlPHA_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        task_id=0,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        task_type_vocab_size: int = 3,
+        type_vocab_size: int = 16,
+        attention_key_size=128,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        activation: str = "swish",
+        normalization: str = "softmax_plus",
+        fuse: bool = False,
+        layer_norm_eps=1e-12,
+        norm_eps=1e-12,
+        use_cache=False,
+        use_task_id=True,
+        use_bias=False,
+        attention_scale=True,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.task_id = task_id
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.task_type_vocab_size = task_type_vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.layer_norm_eps = layer_norm_eps
+        self.norm_eps = norm_eps
+        self.use_cache = use_cache
+        self.use_task_id = use_task_id
+        self.use_bias = use_bias
+        self.activation = activation
+        self.attention_key_size = attention_key_size
+        self.normalization = normalization
+        self.attention_scale = attention_scale
diff --git a/paddlenlp/transformers/gau_alpha/modeling.py b/paddlenlp/transformers/gau_alpha/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb9d02ac755b54cd0f3c1ae06434bd1cb4833552
--- /dev/null
+++ b/paddlenlp/transformers/gau_alpha/modeling.py
@@ -0,0 +1,810 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import Layer
+
+from paddlenlp.utils.env import CONFIG_NAME
+
+from .. import PretrainedModel, register_base_model
+from ..albert.modeling import ACT2FN
+from .configuration import (
+    GAUAlPHA_PRETRAINED_INIT_CONFIGURATION,
+    GAUAlPHA_PRETRAINED_RESOURCE_FILES_MAP,
+    GAUAlphaConfig,
+)
+
+__all__ = [
+    "GAUAlphaModel",
+    "GAUAlphaForMaskedLM",
+    "GAUAlphaPretrainedModel",
+    "GAUAlphaForSequenceClassification",
+    "GAUAlphaForTokenClassification",
+    "GAUAlphaForQuestionAnswering",
+    "GAUAlphaForMultipleChoice",
+]
+
+INF = 1e4
+
+
+class Norm(Layer):
+    def __init__(self, epsilon=1e-12):
+        super().__init__()
+        self._epsilon = epsilon
+
+    def forward(self, x):
+        variance = paddle.mean(paddle.square(x), axis=-1, keepdim=True)
+        return x / paddle.sqrt(variance + self._epsilon)
+
+
+def attention_normalize(a, mask=None, axis=-1, method="softmax"):
+    if method == "softmax":
+        return F.softmax(a, axis=axis)
+    else:
+        if mask is not None:
+            l = mask.sum(-1, keepdim=True)
+        else:
+            l = paddle.ones_like(a) * paddle.shape(a)[-2]
+        if method == "squared_relu":
+            return F.relu(a) ** 2 / l
+        elif method == "softmax_plus":
+            scale = paddle.log(l) / np.log(512)
+            # mask: 1 for not padding, 0 for padding
+            # padding position's scale is 1
+            if mask is not None:
+                scale = scale * mask + 1 - mask
+            return F.softmax(a * scale, axis=axis)
+    return a
+
+
+class ScaleOffset(Layer):
+    def __init__(
+        self,
+        hidden_size=768,
+        scale=True,
+        offset=True,
+    ):
+        super().__init__()
+        self.scale = scale
+        self.offset = offset
+
+        if self.scale:
+            self.weight = self.create_parameter((hidden_size,), default_initializer=nn.initializer.Constant(1.0))
+        if self.offset:
+            self.bias = self.create_parameter((hidden_size,), is_bias=True)
+
+    def forward(self, inputs):
+        if self.scale:
+            inputs = inputs * self.weight
+        if self.offset:
+            inputs = inputs + self.bias
+
+        return inputs
+
+
+class GatedAttentionUnit(Layer):
+    """
+    https://github.com/ZhuiyiTechnology/GAU-alpha/blob/ea15e08a85d35652775c360218090cbaed98da18/models.py#L6-L85
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super().__init__()
+        self.activation = ACT2FN[config.activation]
+        self.intermediate_size = config.intermediate_size
+        self.attention_key_size = config.attention_key_size
+        self.use_bias = config.use_bias
+        self.normalization = config.normalization
+        self.attention_scale = config.attention_scale
+        self.attention_dropout = config.attention_probs_dropout_prob
+
+        self.i_dense = nn.Linear(
+            config.hidden_size,
+            2 * config.intermediate_size + config.attention_key_size,
+            bias_attr=self.use_bias,
+        )
+        self.o_dense = nn.Linear(config.intermediate_size, config.hidden_size, bias_attr=self.use_bias)
+
+        self.q_scaleoffset = ScaleOffset(config.attention_key_size, offset=self.use_bias)
+        self.k_scaleoffset = ScaleOffset(config.attention_key_size, offset=self.use_bias)
+        self.rotary = RotaryPositionEmbedding(config)
+
+    def forward(self, hidden_states, attention_mask=None):
+        x = self.i_dense(hidden_states)
+        u, v, qk = paddle.split(
+            self.activation(x),
+            [self.intermediate_size, self.intermediate_size, self.attention_key_size],
+            axis=-1,
+        )
+        q, k = self.q_scaleoffset(qk), self.k_scaleoffset(qk)
+
+        # apply_rotary
+        q, k = self.rotary(q), self.rotary(k)
+
+        # Attention
+        a = paddle.matmul(q, k, transpose_y=True)
+
+        if self.attention_scale:
+            a = a / self.attention_key_size**0.5
+
+        if attention_mask is not None:
+            a = a * attention_mask + (attention_mask - 1) * INF
+
+        A = attention_normalize(a, attention_mask, axis=-1, method=self.normalization)
+
+        A = F.dropout(A, p=self.attention_dropout, training=self.training)
+
+        o = self.o_dense(u * paddle.matmul(A, v))
+
+        return o
+
+
+class GAULayer(Layer):
+    def __init__(self, config: GAUAlphaConfig):
+        super().__init__()
+        self.gau = GatedAttentionUnit(config)
+        self.norm = Norm(config.norm_eps)
+        self.hidden_dropout = config.hidden_dropout_prob
+
+    def forward(self, hidden_states, attention_mask=None):
+        gau_output = self.gau(hidden_states, attention_mask=attention_mask)
+
+        # dropout and residual
+        o = F.dropout(gau_output[0], p=self.hidden_dropout, training=self.training)
+        o = self.norm(hidden_states + o)
+
+        return o
+
+
+def initializer(tensor, num_hidden_layers=12, order=2, gain=1.0):
+    """
+    https://github.com/bojone/bert4keras/blob/5572ed481a14f5a62be7107e3846c88a5d6b617d/bert4keras/models.py#L1226-L1235
+    """
+    shape = paddle.shape(tensor)
+    if shape[0] > 10000 or shape[0] < 10:
+        hidden_size = shape[1]
+    else:
+        hidden_size = shape[0]
+    gain *= num_hidden_layers ** (-1.0 / order)
+    std = 1.13684723 / hidden_size**0.5 * gain
+
+    return nn.initializer.TruncatedNormal(std=std)
+
+
+class RotaryPositionEmbedding(Layer):
+    def __init__(self, config: GAUAlphaConfig):
+        super().__init__()
+        inv_freq = 1.0 / (
+            10000
+            ** (
+                paddle.arange(0, config.attention_key_size, 2, dtype=paddle.get_default_dtype())
+                / config.attention_key_size
+            )
+        )
+        t = paddle.arange(config.max_position_embeddings, dtype=paddle.get_default_dtype())
+        freqs = paddle.matmul(t.unsqueeze(1), inv_freq.unsqueeze(0))
+        self.register_buffer("sin", freqs.sin(), persistable=False)
+        self.register_buffer("cos", freqs.cos(), persistable=False)
+
+    def forward(self, x, offset=0):
+        # x shape [batch_size, seqlen, dim]
+        seqlen = paddle.shape(x)[-2]
+        sin, cos = (
+            self.sin[offset : offset + seqlen, :],
+            self.cos[offset : offset + seqlen, :],
+        )
+        x1, x2 = x[..., 0::2], x[..., 1::2]
+        # [cos_nθ, -sin_nθ] [x1]
+        # [sin_nθ,  cos_nθ] [x2]
+        # => [x1 * cos_nθ - x2 * sin_nθ, x1 * sin_nθ + x2 * cos_nθ]
+        return paddle.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], axis=-1).flatten(-2, -1)
+
+
+class GAUAlphaPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained GAU-alpha models. It provides GAU-alpha related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = GAUAlphaConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "gau_alpha"
+
+    pretrained_init_configuration = GAUAlPHA_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = GAUAlPHA_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                initializer(layer.weight, self.config.num_hidden_layers, order=2, gain=1.0)
+            if isinstance(layer, nn.Linear):
+                use_bias = self.use_bias if hasattr(self, "use_bias") else self.gau_alpha.config["use_bias"]
+                if layer.bias is not None and not use_bias:
+                    layer.bias = None
+
+
+@register_base_model
+class GAUAlphaModel(GAUAlphaPretrainedModel):
+    """
+    The bare GAUAlpha Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `GAUAlphaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `GAUAlphaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the, encoder layers and pooler layer. Defaults to `768`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the gau_alpha encoder. Defaults to `12`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `2`.
+        attention_key_size (int, optional):
+            The dimensionality of the key used in the gau layer. Defaults to `128`.
+        norm_eps (float, optional):
+            The epsilon value used in the normalization layer.
+            Defaults to `1e-12`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in gau in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        hidden_act (str, optional):
+            The activation function used in gau layer. Defaults to `swish`.
+        use_bias (bool, optional):
+            Whether or not use bias.
+            Defaults to `False`.
+        normalization (str, optional):
+            The normalization method used in gau layer.
+            Defaults to `softmax_plus`.
+        attention_scale (bool, optional):
+            Whether or not to scale the attention scores.
+            Defaults to `True`.
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.norm_eps = config.norm_eps
+        self.num_hidden_layers = config.num_hidden_layers
+        self.use_bias = config.use_bias
+        self.embeddings = GAUAlphaEmbeddings(config)
+
+        self.encoder = GAUAlphaEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+    ):
+        r"""
+        The GAUAlphaModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in gau to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape broadcasted to `[batch_size, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+
+        Returns:
+            tuple: Returns `last_hidden_state` (Tensor)
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaModel, GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaModel.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                last_hidden_state = model(**inputs)
+
+        """
+
+        if attention_mask is None:
+            attention_mask = input_ids != self.pad_token_id
+        if attention_mask.ndim == 2:
+            attention_mask = attention_mask.unsqueeze(1)  # bs, 1, seqlen
+        attention_mask = attention_mask.astype(paddle.get_default_dtype())
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+        )
+
+        last_hidden_state = self.encoder(embedding_output, attention_mask=attention_mask)
+
+        return last_hidden_state
+
+
+class GAUAlphaEmbeddings(Layer):
+    """
+    Include embeddings from word and token_type embeddings
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.norm = Norm(config.norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + token_type_embeddings
+        embeddings = self.norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class GAUAlphaEncoder(Layer):
+    def __init__(self, config: GAUAlphaConfig):
+        super().__init__()
+        self.layer = nn.LayerList([GAULayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(self, hidden_states, attention_mask=None):
+        for layer_module in self.layer:
+            hidden_states = layer_module(
+                hidden_states,
+                attention_mask,
+            )
+        return hidden_states
+
+
+class GAUAlphaForQuestionAnswering(GAUAlphaPretrainedModel):
+    """
+    GAUAlpha with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        gau_alpha (:class:`GAUAlphaModel`):
+            An instance of GAUAlphaModel.
+        dropout (float, optional):
+            The dropout probability for output of GAUAlpha.
+            If None, use the same value as `hidden_dropout_prob` of `GAUAlphaModel`
+            instance `gau_alpha`. Defaults to `None`.
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaForQuestionAnswering, self).__init__(config)
+        self.gau_alpha = GAUAlphaModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        The GAUAlphaForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`GAUAlphaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaForQuestionAnswering, GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaForQuestionAnswering.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+        sequence_output = self.gau_alpha(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        logits = self.classifier(sequence_output)
+        start_logits, end_logits = paddle.unstack(logits, axis=-1)
+
+        return start_logits, end_logits
+
+
+class GAUAlphaForSequenceClassification(GAUAlphaPretrainedModel):
+    """
+    GAUAlpha Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        gau_alpha (`GAUAlphaModel`):
+            An instance of `paddlenlp.transformers.GAUAlphaModel`.
+        num_labels (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of GAUAlpha.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.GAUAlphaModel` instance. Defaults to `None`.
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.gau_alpha = GAUAlphaModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`GAUAlphaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaForSequenceClassification, GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaForSequenceClassification.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.gau_alpha(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+        pooled_output = sequence_output[:, 0]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class GAUAlphaForTokenClassification(GAUAlphaPretrainedModel):
+    """
+    GAUAlpha Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        gau_alpha (`GAUAlphaModel`):
+            An instance of `paddlenlp.transformers.GAUAlphaModel`.
+        num_labels (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of GAUAlpha.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.GAUAlphaModel` instance. Defaults to `None`.
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.gau_alpha = GAUAlphaModel(config)  # allow gau_alpha to be config
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`GAUAlphaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaForTokenClassification, GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaForTokenClassification.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.gau_alpha(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class GAUAlphaForMultipleChoice(GAUAlphaPretrainedModel):
+    """
+    GAUAlpha Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        gau_alpha (:class:`GAUAlphaModel`):
+            An instance of GAUAlphaModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of GAUAlpha.
+            If None, use the same value as `hidden_dropout_prob` of `GAUAlphaModel`
+            instance `gau_alpha`. Defaults to None.
+    """
+
+    def __init__(self, config: GAUAlphaConfig):
+        super(GAUAlphaForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.gau_alpha = GAUAlphaModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        The GAUAlphaForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`GAUAlphaModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`GAUAlphaModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`GAUAlphaModel` and shape as [batch_size, num_choice, sequence_length].
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaForMultipleChoice, GAUAlphaTokenizer
+                from paddlenlp.data import Pad
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaForMultipleChoice.from_pretrained('chinese_GAU-alpha-char_L-24_H-768', num_choices=2)
+
+                data = [
+                    {
+                        "question": "如何打开ipad屏幕？",
+                        "answer1": "按音量按钮。",
+                        "answer2": "按下锁定按钮。",
+                        "label": 1,
+                    },
+                    {
+                        "question": "如何缩进一些文本？",
+                        "answer1": "在开始写之前留一些空格。",
+                        "answer2": "按空格键。",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                input_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(inputs["input_ids"])
+                token_type_ids = Pad(axis=0, pad_val=tokenizer.pad_token_type_id)(inputs["token_type_ids"])
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(input_ids, dtype="int64"),
+                    token_type_ids=paddle.to_tensor(token_type_ids, dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, paddle.shape(input_ids)[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, paddle.shape(token_type_ids)[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, paddle.shape(attention_mask)[-1]))
+
+        sequence_output = self.gau_alpha(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        pooled_output = sequence_output[:, 0]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
+
+
+class GAUAlphaLMPredictionHead(Layer):
+    def __init__(
+        self,
+        config: GAUAlphaConfig,
+        embedding_weights=None,
+    ):
+        super(GAUAlphaLMPredictionHead, self).__init__()
+        self.use_bias = config.use_bias
+        self.decoder_weight = (
+            self.create_parameter(shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        if self.use_bias:
+            self.decoder_bias = self.create_parameter(
+                shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+            )
+
+    def forward(self, hidden_states):
+        hidden_states = paddle.matmul(hidden_states, self.decoder_weight, transpose_y=True)
+        if self.use_bias:
+            hidden_states = hidden_states + self.decoder_bias
+
+        return hidden_states
+
+
+class GAUAlphaForMaskedLM(GAUAlphaPretrainedModel):
+    """
+    GAUAlpha Model with a `masked language modeling` head on top.
+
+    Args:
+        gau_alpha (:class:`GAUAlphaModel`):
+            An instance of :class:`GAUAlphaModel`.
+
+    """
+
+    def __init__(
+        self,
+        config: GAUAlphaConfig,
+    ):
+        super(GAUAlphaForMaskedLM, self).__init__(config)
+        self.gau_alpha = GAUAlphaModel(config)
+        self.cls = GAUAlphaLMPredictionHead(
+            config=config,
+            embedding_weights=self.gau_alpha.embeddings.word_embeddings.weight,
+        )
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`GAUAlphaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GAUAlphaModel`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GAUAlphaForMaskedLM, GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                model = GAUAlphaForMaskedLM.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 11, 12000]
+
+        """
+        sequence_output = self.gau_alpha(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        prediction_scores = self.cls(sequence_output)
+        return prediction_scores
diff --git a/paddlenlp/transformers/gau_alpha/tokenizer.py b/paddlenlp/transformers/gau_alpha/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..eca6b9aeb087f9e38fc0b6c758797be4faf27ab5
--- /dev/null
+++ b/paddlenlp/transformers/gau_alpha/tokenizer.py
@@ -0,0 +1,292 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from ..bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from ..tokenizer_utils import PretrainedTokenizer
+
+__all__ = ["GAUAlphaTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"chinese_GAU-alpha-char_L-24_H-768": 512}
+
+
+class GAUAlphaTokenizer(PretrainedTokenizer):
+    """
+    Constructs a GAUAlpha tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool,optional):
+            Whether or not to lowercase the input when tokenizing.
+            If you use the GAUAlpha pretrained model, lower is set to
+            False when using the cased model, otherwise it is set to True.
+            Defaults to`True`.
+        unk_token (str,optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str,optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str,optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str,optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str,optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import GAUAlphaTokenizer
+            tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+
+            tokens = tokenizer('欢迎使用百度飞桨')
+            '''
+            {'input_ids': [101, 3223, 6500, 421, 4179, 4331, 2008, 7263, 3055, 102],
+             'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "chinese_GAU-alpha-char_L-24_H-768": "https://bj.bcebos.com/paddlenlp/models/transformers/gau_alpha/chinese_GAU-alpha-char_L-24_H-768/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "chinese_GAU-alpha-char_L-24_H-768": {"do_lower_case": True},
+    }
+    padding_side = "right"
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = GAUAlphaTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for GAUAlpha models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import GAUAlphaTokenizer
+
+                tokenizer = GAUAlphaTokenizer.from_pretrained('chinese_GAU-alpha-char_L-24_H-768')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨!')
+                '''
+                ['欢', '迎', '使', '用', '百', '度', '飞', '桨', '!']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                '欢 迎 使 用 百 度 飞 桨 !'
+                '''
+        """
+
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A GAUAlpha sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A GAUAlpha offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A GAUAlpha sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
diff --git a/paddlenlp/transformers/glm/__init__.py b/paddlenlp/transformers/glm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/glm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/glm/configuration.py b/paddlenlp/transformers/glm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..f898c8c87918c9cbfb555baa3420cc1df05f9889
--- /dev/null
+++ b/paddlenlp/transformers/glm/configuration.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GLM model configuration"""
+
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "GLMConfig",
+    "GLM_PRETRAINED_INIT_CONFIGURATION",
+    "GLM_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+
+GLM_PRETRAINED_INIT_CONFIGURATION = {
+    "THUDM/glm-515m": {
+        "attention_dropout_prob": 0.1,
+        "attention_scale": 1.0,
+        "block_position_encoding": True,
+        "checkpoint_num_layers": 1,
+        "embedding_dropout_prob": 0.1,
+        "hidden_size": 1152,
+        "initializer_range": 0.02,
+        "max_sequence_length": 512,
+        "model_type": "glm",
+        "num_attention_heads": 18,
+        "num_layers": 30,
+        "layernorm_epsilon": 1e-5,
+        "output_dropout_prob": 0.1,
+        "output_predict": True,
+        "parallel_output": False,
+        "pool_token": "cls",
+        "relative_encoding": False,
+        "spell_func": "lstm",
+        "spell_length": None,
+        "use_scaled_init_for_output_weights": True,
+        "vocab_size": 30592,
+    },
+    "THUDM/glm-2b": {
+        "attention_dropout_prob": 0.1,
+        "attention_scale": 1.0,
+        "block_position_encoding": True,
+        "checkpoint_num_layers": 1,
+        "embedding_dropout_prob": 0.1,
+        "hidden_size": 2048,
+        "initializer_range": 0.02,
+        "max_sequence_length": 1024,
+        "model_type": "glm",
+        "num_attention_heads": 32,
+        "num_layers": 36,
+        "output_dropout_prob": 0.1,
+        "output_predict": True,
+        "parallel_output": True,
+        "pool_token": "cls",
+        "relative_encoding": False,
+        "spell_func": "lstm",
+        "spell_length": None,
+        "vocab_size": 50304,
+    },
+    "THUDM/glm-10b": {
+        "attention_dropout_prob": 0.1,
+        "attention_scale": 1.0,
+        "block_position_encoding": True,
+        "checkpoint_num_layers": 1,
+        "embedding_dropout_prob": 0.1,
+        "hidden_size": 4096,
+        "initializer_range": 0.02,
+        "max_sequence_length": 1024,
+        "model_type": "glm",
+        "num_attention_heads": 64,
+        "num_layers": 48,
+        "output_dropout_prob": 0.1,
+        "output_predict": True,
+        "parallel_output": True,
+        "pool_token": "cls",
+        "relative_encoding": False,
+        "spell_func": "lstm",
+        "spell_length": None,
+        "vocab_size": 50304,
+    },
+    "THUDM/glm-large-chinese": {
+        "attention_dropout_prob": 0.1,
+        "attention_scale": 1.0,
+        "block_position_encoding": True,
+        "checkpoint_num_layers": 1,
+        "embedding_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_sequence_length": 1024,
+        "model_type": "glm",
+        "num_attention_heads": 16,
+        "num_layers": 24,
+        "layernorm_epsilon": 1e-5,
+        "output_dropout_prob": 0.1,
+        "output_predict": True,
+        "parallel_output": False,
+        "pool_token": "cls",
+        "relative_encoding": False,
+        "spell_func": "lstm",
+        "spell_length": None,
+        "vocab_size": 50048,
+    },
+    "THUDM/glm-10b-chinese": {
+        "attention_dropout_prob": 0.1,
+        "attention_scale": 1.0,
+        "block_position_encoding": True,
+        "checkpoint_num_layers": 1,
+        "embedding_dropout_prob": 0.1,
+        "hidden_size": 4096,
+        "initializer_range": 0.02,
+        "max_sequence_length": 1024,
+        "model_type": "glm",
+        "num_attention_heads": 64,
+        "num_layers": 48,
+        "output_dropout_prob": 0.1,
+        "output_predict": True,
+        "parallel_output": True,
+        "pool_token": "cls",
+        "relative_encoding": False,
+        "spell_func": "lstm",
+        "spell_length": None,
+        "vocab_size": 50048,
+        "bad_words_id": [50009],
+    },
+}
+
+GLM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "THUDM/glm-515m": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-515m.pdparams",
+        "THUDM/glm-2b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-2b.pdparams",
+        "THUDM/glm-10b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-10b.pdparams",
+        "THUDM/glm-large-chinese": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-large-chinese.pdparams",
+        "THUDM/glm-10b-chinese": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-10b-chinese.pdparams",
+    }
+}
+
+
+class GLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~GLMModel`].
+    It is used to instantiate an GLM model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    the GLM [shunxing1234/GLM-base-cased](https://huggingface.co/shunxing1234/GLM-base-cased) architecture.
+    Configuration objects inherit from  [`PretrainedConfig`] and can be used
+    to control the model outputs. Read the documentation from  [`PretrainedConfig`]
+    for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the GLM model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~GLMModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimension of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler.
+            If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with.
+            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`~GLMModel`] or
+            [`~TFGLMModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import GLMModel, GLMConfig
+    >>> # Initializing a GLM shunxing1234/GLM-base-cased style configuration
+    >>> configuration = GLMConfig()
+    >>> # Initializing a model from the shunxing1234/GLM-base-cased style configuration
+    >>> model = GLMModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "glm"
+    attribute_map: Dict[str, str] = {"num_hidden_layers": "num_layers", "torch_dtype": "dtype"}
+    pretrained_init_configuration = GLM_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        num_layers=24,
+        vocab_size=30592,
+        hidden_size=1024,
+        num_attention_heads=16,
+        embedding_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        output_dropout_prob=0.1,
+        max_sequence_length=512,
+        checkpoint_num_layers=1,
+        parallel_output=True,
+        relative_encoding=False,
+        block_position_encoding=True,
+        output_predict=False,
+        spell_length=None,
+        spell_func="lstm",
+        attention_scale=1.0,
+        initializer_range=0.02,
+        pool_token="cls",
+        layernorm_epsilon=1e-5,
+        use_scaled_init_for_output_weights=False,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.num_layers = num_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.embedding_dropout_prob = embedding_dropout_prob
+        self.attention_dropout_prob = attention_dropout_prob
+        self.output_dropout_prob = output_dropout_prob
+        self.max_sequence_length = max_sequence_length
+        self.checkpoint_num_layers = checkpoint_num_layers
+        self.parallel_output = parallel_output
+        self.relative_encoding = relative_encoding
+        self.block_position_encoding = block_position_encoding
+        self.output_predict = output_predict
+        self.spell_length = spell_length
+        self.spell_func = spell_func
+        self.attention_scale = attention_scale
+        self.initializer_range = initializer_range
+        self.pool_token = pool_token
+        self.layernorm_epsilon = layernorm_epsilon
+        self.use_scaled_init_for_output_weights = use_scaled_init_for_output_weights
+        self._fast_entry = None
diff --git a/paddlenlp/transformers/glm/modeling.py b/paddlenlp/transformers/glm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdd9ba221ee11dd88361c48c6328f108ac885aae
--- /dev/null
+++ b/paddlenlp/transformers/glm/modeling.py
@@ -0,0 +1,878 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""GLM model"""
+from __future__ import annotations
+
+import math
+from functools import partial
+from typing import Optional
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.env import CONFIG_NAME
+from ...utils.initializer import normal_, ones_, zeros_
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MultipleChoiceModelOutput,
+)
+from .configuration import (
+    GLM_PRETRAINED_INIT_CONFIGURATION,
+    GLM_PRETRAINED_RESOURCE_FILES_MAP,
+    GLMConfig,
+)
+
+__all__ = [
+    "GLMModel",
+    "GLMPretrainedModel",
+    "GLMForMultipleChoice",
+    "GLMForConditionalGeneration",
+]
+
+
+class GLMAttention(nn.Layer):
+    """
+    Self-attention layer performs multiple attention to jointly attending to
+    information from different representation subspaces.
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.config = config
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.hidden_size = config.hidden_size
+        self.attention_scale = config.attention_scale
+
+        if config.tensor_parallel_degree > 1:
+            self.query_key_value = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size, 3 * config.hidden_size, has_bias=True, gather_output=False
+            )
+            self.dense = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size, config.hidden_size, input_is_parallel=True, has_bias=True
+            )
+            self.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree
+        else:
+            self.query_key_value = nn.Linear(config.hidden_size, 3 * config.hidden_size)
+            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+
+        self.attention_dropout = nn.Dropout(config.attention_dropout_prob)
+        self.output_dropout = nn.Dropout(config.output_dropout_prob)
+
+    def _transpose_for_scores(self, inputs: Tensor):
+        """
+        Transpose a 3D tensor [b, s, n/p*h/n] into a 4D tensor [b, n/p, s, h/n],
+        where b means batch_size, s means sequence_length, n means num_attention_heads,
+        h means hidden_size and p means number of partitions.
+        """
+        new_shape = [*inputs.shape[:-1], self.num_attention_heads, self.attention_head_size]
+        outputs = inputs.reshape(new_shape)
+        outputs = paddle.transpose(outputs, [0, 2, 1, 3])
+        return outputs
+
+    def _core_attention(self, hidden_states: Tensor, cache: Tensor = None):
+        # [bs, seq_len, num_head * head_dim]
+        query_length = hidden_states.shape[1]
+        if cache is None:
+            mixed_layer = self.query_key_value(hidden_states)
+            mixed_q_layer, mixed_k_layer, mixed_v_layer = paddle.split(mixed_layer, 3, axis=-1)
+        else:
+            # [bs, cache_len + seq_len, num_head * head_dim]
+            concat_hidden_states = paddle.concat([cache, hidden_states], axis=1)
+            # [bs, cache_len + seq_len, num_head * head_dim * 3]
+            mixed_layer = self.query_key_value(concat_hidden_states)
+            # [bs, cache_len + seq_len, num_head * head_dim]
+            mixed_q_layer, mixed_k_layer, mixed_v_layer = paddle.split(mixed_layer, 3, axis=-1)
+            # [bs, cache_len + seq_len, num_head * head_dim]
+            mixed_q_layer = mixed_q_layer[:, -query_length:]
+            # [bs, seq_len, num_head * head_dim]
+
+        # [bs, num_head, seq_len, head_dim]
+        q_layer = self._transpose_for_scores(mixed_q_layer)
+        # [bs, num_head, cache_len + seq_len, head_dim]
+        k_layer = self._transpose_for_scores(mixed_k_layer)
+        # [bs, num_head, cache_len + seq_len, head_dim]
+        v_layer = self._transpose_for_scores(mixed_v_layer)
+
+        return q_layer, k_layer, v_layer
+
+    def _core_parallel_attention(self, hidden_states: Tensor, cache: Tensor = None):
+        query_length = hidden_states.shape[1]
+        if cache is None:
+            mixed_layer = self.query_key_value(hidden_states)
+            # [bs, seq_len, num_attention_heads, 3* head_dim]
+            mixed_layer = paddle.reshape_(mixed_layer, [0, 0, self.num_attention_heads, 3 * self.attention_head_size])
+            # [bs,  num_attention_heads, seq_len, 3* head_dim]
+            mixed_layer = paddle.transpose(mixed_layer, [0, 2, 1, 3])
+            # [bs,  num_attention_heads, seq_len, head_dim]
+            mixed_q_layer, mixed_k_layer, mixed_v_layer = paddle.split(mixed_layer, num_or_sections=3, axis=-1)
+
+        else:
+            # [bs, seq_len(+cache_len), num_head * head_dim]
+            concat_hidden_states = paddle.concat([cache, hidden_states], axis=1)
+            mixed_layer = self.query_key_value(concat_hidden_states)
+            # [bs. seq_len(+cache_len), num_attention_heads, 3* head_dim]
+            mixed_layer = paddle.reshape_(mixed_layer, [0, 0, self.num_attention_heads, 3 * self.attention_head_size])
+            # [bs, num_attention_heads, seq_len(+cache_len),  3* head_dim]
+            mixed_layer = paddle.transpose(mixed_layer, [0, 2, 1, 3])
+            mixed_q_layer, mixed_k_layer, mixed_v_layer = paddle.split(mixed_layer, num_or_sections=3, axis=-1)
+            # [bs, num_attention_heads, seq_len, head_dim]
+            mixed_q_layer = mixed_q_layer[:, :, -query_length:]
+
+        return mixed_q_layer, mixed_k_layer, mixed_v_layer
+
+    def forward(self, hidden_states: Tensor, ltor_mask: Tensor, cache: Tensor = None):
+        # [bs, seq_len, num_head * head_dim]
+        if self.config.tensor_parallel_degree > 1:
+            q_layer, k_layer, v_layer = self._core_parallel_attention(hidden_states, cache)
+        else:
+            # [bs,  num_head, seq_len, head_dim]
+            q_layer, k_layer, v_layer = self._core_attention(hidden_states, cache)
+
+        if self.attention_scale > 1.0:
+            attention_scores = paddle.matmul(
+                q_layer / math.sqrt(self.attention_scale),
+                k_layer.transpose([0, 1, 3, 2]) / math.sqrt(self.attention_head_size * self.attention_scale),
+            )
+        else:
+            # [bs,  num_head, seq_len, head_dim] * [bs,  num_head,  head_dim, seq_len]
+            # [bs,  num_head, seq_len, seq_len] / [bs,  num_head, seq_len, cache_len + seq_len]
+            attention_scores = paddle.matmul(
+                q_layer, k_layer.transpose([0, 1, 3, 2]) / math.sqrt(self.attention_head_size)
+            )
+
+        ltor_mask = ltor_mask.astype(attention_scores.dtype)
+        # [bs,  num_head, seq_len, seq_len(+cache_len)]
+        attention_scores = paddle.multiply(attention_scores, ltor_mask)
+        if self.attention_scale > 1.0:
+            # Fixme for max op not support fp16 https://github.com/PaddlePaddle/Paddle/issues/52601
+            if attention_scores.dtype != paddle.float32:
+                old_type = attention_scores.dtype
+                max_attention_scores = attention_scores.astype("float32").max(axis=-1, keepdim=True)[0]
+                max_attention_scores = max_attention_scores.astype(old_type)
+            else:
+                max_attention_scores = attention_scores.max(axis=-1, keepdim=True)[0]
+
+            attention_scores -= max_attention_scores
+            attention_scores *= self.attention_scale
+
+        attention_scores = attention_scores + (-65504.0) * (1.0 - ltor_mask)
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        if "local_seed" in get_rng_state_tracker().states_:
+            with get_rng_state_tracker().rng_state("local_seed"):
+                attention_probs = self.attention_dropout(attention_probs)
+        else:
+            attention_probs = self.attention_dropout(attention_probs)
+
+        # [bs,  num_head, seq_len, seq_len(+cache_len)] * [bs,  num_head, seq_len(+cache_len), head_dim]
+        # [bs,  num_head, seq_len, head_dim]
+        context_layer = paddle.matmul(attention_probs, v_layer)
+        # [bs, seq_len, num_head, head_dim]
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        # [bs, seq_len, num_head * head_dim]
+        new_context_shape = context_layer.shape[:-2] + [self.num_attention_heads * self.attention_head_size]
+        context_layer = context_layer.reshape(new_context_shape)
+        output = self.dense(context_layer)
+
+        if "global_seed" in get_rng_state_tracker().states_:
+            with get_rng_state_tracker().rng_state("global_seed"):
+                output = self.output_dropout(output)
+        else:
+            output = self.output_dropout(output)
+
+        return output
+
+
+class GLMBlock(nn.Layer):
+    """
+    The Transformer layer.
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMBlock, self).__init__()
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+        self.attention = GLMAttention(config)
+
+        self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+        self.mlp = GPT2MLP(config)
+
+    def forward(self, hidden_states: Tensor, ltor_mask: Tensor, cache: Tensor = None):
+        layernorm_output = self.input_layernorm(hidden_states)
+        # Layer norm before transformer layer
+        cache = self.input_layernorm(cache) if cache is not None else None
+        # Self attention
+        attention_output = self.attention(layernorm_output, ltor_mask, cache)
+        # Residual connection
+        layernorm_input = hidden_states + attention_output
+        # Layernorm after attention
+        layernorm_output = self.post_attention_layernorm(layernorm_input)
+        # MLP
+        mlp_output = self.mlp(layernorm_output)
+        # Second residual connection
+        output = layernorm_input + mlp_output
+        return output
+
+
+class GPT2MLP(nn.Layer):
+    """
+    MLP takes the input with an h hidden state, project it to 4*h hidden
+    dimension, perform gelu transformation, and project the state back
+    into h hidden dimension. At the end, dropout is also applied.
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GPT2MLP, self).__init__()
+        if config.tensor_parallel_degree > 1:
+            self.dense_h_to_4h = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size, config.hidden_size * 4, has_bias=True, gather_output=False
+            )
+            self.dense_4h_to_h = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size * 4, config.hidden_size, input_is_parallel=True, has_bias=True
+            )
+        else:
+            self.dense_h_to_4h = nn.Linear(config.hidden_size, config.hidden_size * 4)
+            self.dense_4h_to_h = nn.Linear(config.hidden_size * 4, config.hidden_size)
+        self.dropout = nn.Dropout(config.output_dropout_prob)
+
+    def forward(self, hidden_states):
+        # [batch_size, sequence_length, 4h / number of partitions]
+        intermediate_parallel = self.dense_h_to_4h(hidden_states)
+        intermediate_parallel = F.gelu(intermediate_parallel, approximate=True)
+
+        # [batch_size, sequence_length, h]
+        output = self.dense_4h_to_h(intermediate_parallel)
+
+        if "global_seed" in get_rng_state_tracker().states_:
+            with get_rng_state_tracker().rng_state("global_seed"):
+                output = self.dropout(output)
+        else:
+            output = self.dropout(output)
+
+        return output
+
+
+class GLMStack(nn.Layer):
+    """
+    GLM Transformer
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMStack, self).__init__()
+        self.hidden_size = config.hidden_size
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.checkpoint_num_layers = config.checkpoint_num_layers
+
+        self.embedding_dropout = nn.Dropout(config.embedding_dropout_prob)
+        self.block_position_encoding = config.block_position_encoding
+
+        if self.block_position_encoding:
+            self.position_embeddings = nn.Embedding(
+                config.max_sequence_length + 1,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0, std=config.initializer_range)),
+            )
+            self.block_position_embeddings = nn.Embedding(
+                config.max_sequence_length + 1,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0, std=config.initializer_range)),
+            )
+        else:
+            self.position_embeddings = nn.Embedding(
+                config.max_sequence_length,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0, std=config.initializer_range)),
+            )
+
+        self.layers = nn.LayerList()
+        for _ in range(config.num_layers):
+            self.layers.append(GLMBlock(config))
+
+        self.final_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layernorm_epsilon)
+
+    @paddle.jit.not_to_static
+    def recompute_training(self, layer_module: nn.Layer, hidden_states: Tensor, ltor_mask: Tensor, cache: Tensor):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states = recompute(create_custom_forward(layer_module), hidden_states, ltor_mask, cache)
+        return hidden_states
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        cache: Optional[Tensor] = None,
+        return_dict: bool = False,
+    ):
+        batch_size, query_length = hidden_states.shape[:2]
+        memory_length = cache[0].shape[1] if cache is not None else 0
+
+        if attention_mask.dim == 1:
+            is_scalar = bool(paddle.numel(attention_mask) == 1)
+            scalar_sep = attention_mask[0] if is_scalar else attention_mask
+
+            # attention mask is the beginning postion of B region in [0, query_len)
+            def build_mask_matrix(seq_length, sep, memory_length=0):
+                mask = paddle.ones([1, seq_length, seq_length])
+                mask = paddle.tril(mask)
+                if is_scalar:
+                    mask[0, :, : int(sep)] = 1
+                else:
+                    mask = mask.expand([batch_size, -1, -1])
+                    ids = paddle.arange(seq_length, dtype=sep.dtype).unsqueeze(0)
+                    m = (ids < sep.reshape([-1, 1])).astype("float32")
+                    m = m.unsqueeze(1).expand_as(mask).astype("bool")
+                    y = paddle.full(mask.shape, 1, mask.dtype)
+                    mask = paddle.where(m, y, mask)
+                if memory_length > 0:
+                    mask = mask.expand([batch_size, -1, -1])
+                    mask = paddle.concat([paddle.ones([batch_size, seq_length, memory_length]), mask], axis=2)
+                mask = mask.unsqueeze(1)
+                return mask
+
+            attention_mask = build_mask_matrix(query_length, scalar_sep, memory_length=memory_length)
+        elif attention_mask.dim == 2 or attention_mask.dim == 4:
+            if attention_mask.dim() == 2:
+                attention_mask = attention_mask.unsqueeze(1).unsqueeze(1)
+            attention_mask = attention_mask[:, :, :, -query_length - memory_length :]
+
+        if self.block_position_encoding:
+            position_ids, block_position_ids = position_ids[:, 0], position_ids[:, 1]
+        position_embeddings = self.position_embeddings(position_ids)
+
+        hidden_states = hidden_states + position_embeddings
+
+        if self.block_position_encoding:
+            block_position_embeddings = self.block_position_embeddings(block_position_ids)
+            hidden_states = hidden_states + block_position_embeddings
+
+        if "local_seed" in get_rng_state_tracker().states_:
+            with get_rng_state_tracker().rng_state("local_seed"):
+                hidden_states = self.embedding_dropout(hidden_states)
+        else:
+            hidden_states = self.embedding_dropout(hidden_states)
+
+        all_hidden_states = [hidden_states.detach()]
+        for i, layer in enumerate(self.layers):
+            mem_i = cache[i] if cache is not None else None
+            has_gradient = not hidden_states.stop_gradient
+            if self.enable_recompute and has_gradient:
+                # TODO Should the attention_mask be added, it seems missing in original application.
+                hidden_states = self.recompute_training(layer, hidden_states, attention_mask, cache=mem_i)
+            else:
+                hidden_states = layer(hidden_states, attention_mask, cache=mem_i)
+
+            if isinstance(hidden_states, tuple):
+                hidden_states = hidden_states[0]
+
+            all_hidden_states.append(hidden_states.detach())
+
+        output = self.final_layernorm(hidden_states)
+        new_caches = self.update_memories(all_hidden_states, cache)
+
+        if not return_dict:
+            return (output, new_caches)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=output,
+            past_key_values=new_caches,
+            hidden_states=all_hidden_states,
+        )
+
+    def update_memories(self, hiddens, cache):
+        memory_length = cache[0].shape[1] if cache else 0
+        query_length = hiddens[0].shape[1]
+        new_memory_length = memory_length + query_length
+
+        new_memories = cache if cache is not None else []
+        for i in range(len(hiddens)):
+            if cache is None:
+                new_memories.append((hiddens[i][-new_memory_length:]))
+            else:
+                new_memories[i] = paddle.concat([cache[i][:, -memory_length:], hiddens[i]], axis=1)
+        return new_memories
+
+
+class GLMPretrainedModel(PretrainedModel):
+    """
+    An abstarct class for pretrained GLM models. It provides GLM related
+    `model_config_file`, `resource_file_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "glm"
+    config_class = GLMConfig
+    model_config_file = CONFIG_NAME
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_init_configuration = GLM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = GLM_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "transformer.layers.0.mlp.dense_h_to_4h.bias": partial(fn, is_column=True),
+                "transformer.layers.0.mlp.dense_h_to_4h.weight": partial(fn, is_column=True),
+                "transformer.layers.0.attention.query_key_value.bias": partial(fn, is_column=True, is_old_qkv=True),
+                "transformer.layers.0.attention.query_key_value.weight": partial(fn, is_column=True, is_old_qkv=True),
+                # Row Linear
+                "word_embeddings.weight": partial(fn, is_column=False),
+                # 'transformer.layers.0.attention.dense.bias',
+                "transformer.layers.0.attention.dense.weight": partial(fn, is_column=False),
+                # 'transformer.layers.0.mlp.dense_4h_to_h.bias',
+                "transformer.layers.0.mlp.dense_4h_to_h.weight": partial(fn, is_column=False),
+            }
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def _get_name_mappings(cls, config):
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "word_embeddings.weight",
+            "transformer.position_embeddings.weight",
+            "transformer.block_position_embeddings.weight",
+            "transformer.final_layernorm.weight",
+            "transformer.final_layernorm.bias",
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = []
+            transpose_names = [
+                "attention.query_key_value.weight",
+                "attention.dense.weight",
+                "mlp.dense_h_to_4h.weight",
+                "mlp.dense_4h_to_h.weight",
+            ]
+            mapping_names = [
+                "attention.query_key_value.bias",
+                "input_layernorm.weight",
+                "input_layernorm.bias",
+                "attention.dense.bias",
+                "post_attention_layernorm.weight",
+                "post_attention_layernorm.bias",
+                "mlp.dense_h_to_4h.bias",
+                "mlp.dense_4h_to_h.bias",
+            ]
+            for name in mapping_names:
+                layer_mappings.append(
+                    [f"transformer.layers.{layer_index}.{name}", f"transformer.layers.{layer_index}.{name}"]
+                )
+            for name in transpose_names:
+                layer_mappings.append(
+                    [
+                        f"transformer.layers.{layer_index}.{name}",
+                        f"transformer.layers.{layer_index}.{name}",
+                        "transpose",
+                    ]
+                )
+
+            model_mappings.extend(layer_mappings)
+        init_name_mappings(model_mappings)
+
+        import numpy as np
+
+        from paddlenlp.transformers.conversion_utils import (
+            naive_merged_qkv_to_tensor_parallel_qkv,
+            split_tensor_parallel_weight,
+        )
+
+        def fn(x, is_column=True, transpose=False, is_old_qkv=False):
+            if transpose:
+                x = np.transpose(x, [1, 0])
+            if is_old_qkv:
+                assert is_column, "QKV vectors should be column parallel linear."
+                x = naive_merged_qkv_to_tensor_parallel_qkv(x, config.num_attention_heads)
+            return split_tensor_parallel_weight(
+                x,
+                tensor_parallel_degree=config.tensor_parallel_degree,
+                tensor_parallel_rank=config.tensor_parallel_rank,
+                is_column=is_column,
+            )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "transformer.layers.0.mlp.dense_h_to_4h.bias": partial(
+                    fn, is_column=True, transpose=False, is_old_qkv=False
+                ),
+                "transformer.layers.0.mlp.dense_h_to_4h.weight": partial(
+                    fn, is_column=True, transpose=True, is_old_qkv=False
+                ),
+                "transformer.layers.0.attention.query_key_value.bias": partial(
+                    fn, is_column=True, transpose=False, is_old_qkv=True
+                ),
+                "transformer.layers.0.attention.query_key_value.weight": partial(
+                    fn, is_column=True, transpose=True, is_old_qkv=True
+                ),
+                # Row Linear
+                "word_embeddings.weight": partial(fn, is_column=False, transpose=False, is_old_qkv=False),
+                # 'transformer.layers.0.attention.dense.bias',
+                "transformer.layers.0.attention.dense.weight": partial(
+                    fn, is_column=False, transpose=True, is_old_qkv=False
+                ),
+                # 'transformer.layers.0.mlp.dense_4h_to_h.bias',
+                "transformer.layers.0.mlp.dense_4h_to_h.weight": partial(
+                    fn, is_column=False, transpose=True, is_old_qkv=False
+                ),
+            }
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        if config.tensor_parallel_degree > 1:
+            tp_split_mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+            for mapping in model_mappings:
+                if mapping[1] in tp_split_mappings:
+                    if len(mapping) == 3:
+                        mapping[2] = tp_split_mappings[mapping[1]]
+                    else:
+                        mapping.append(tp_split_mappings[mapping[1]])
+
+        if cls.__name__ != "GLMModel":
+            for mapping in model_mappings:
+                mapping[1] = "glm." + mapping[1]
+
+        mappings = [StateDictNameMapping(*mapping) for mapping in model_mappings]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, nn.Linear):
+            std = self.config.initializer_range
+            # TODO: initialization for glm-515m
+            # if self.config.use_scaled_init_for_output_weights and _is_output_dense(layer):
+            #     std = self.config.initializer_range / math.sqrt(2.0 * self.config.num_layers)
+            normal_(layer.weight, mean=0.0, std=std)
+            if layer.bias is not None:
+                zeros_(layer.bias)
+        elif isinstance(layer, nn.Embedding):
+            normal_(layer.weight, mean=0.0, std=self.config.initializer_range)
+        elif isinstance(layer, nn.LayerNorm):
+            ones_(layer.weight)
+            zeros_(layer.bias)
+
+
+def parallel_matmul(lm_output, logit_weights, parallel_output):
+    hcg = fleet.get_hybrid_communicate_group()
+    model_parallel_group = hcg.get_model_parallel_group()
+    world_size = hcg.get_model_parallel_world_size()
+    # rank = hcg.get_model_parallel_rank()
+
+    if world_size > 1:
+        # _c_identity is backwards is reduce
+        input_parallel = paddle.distributed.collective._c_identity(lm_output, group=model_parallel_group)
+
+        logits = paddle.matmul(input_parallel, logit_weights, transpose_y=True)
+
+        if parallel_output:
+            return logits
+
+        # _c_concat has not grad backwards
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+    else:
+        logits = paddle.matmul(lm_output, logit_weights, transpose_y=True)
+        return logits
+
+
+@register_base_model
+class GLMModel(GLMPretrainedModel):
+    r"""
+    The GLM Model transformer can behave as an encoder (with only self-attention) as well as a decoder, where
+    a layer of cross-attention is added between the self-attention layers, following the architecture
+    described in [Attention is all you need](https://arxiv.org/abs/1706.03762).
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMModel, self).__init__(config)
+        self.config = config
+        self.output_predict = config.output_predict
+        if self.config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+
+        self.transformer = GLMStack(config)
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        position_ids: Tensor = None,
+        attention_mask: Tensor = None,
+        cache: Tensor = None,
+        return_dict: bool = True,
+    ):
+        batch_size = input_ids.shape[0]
+        word_embeddings = self.word_embeddings(input_ids)
+        input_shape = input_ids.shape
+
+        if position_ids is None:
+            position_ids = paddle.arange(0, input_shape[-1], dtype="int64")
+            block_position_ids = paddle.zeros(input_shape[-1:], dtype="int64")
+            position_ids = paddle.stack([position_ids, block_position_ids], axis=0).unsqueeze(0)
+
+        if attention_mask is None:
+            attention_mask = paddle.zeros([batch_size])
+
+        outputs = self.transformer(word_embeddings, position_ids, attention_mask, cache, return_dict)
+
+        if self.output_predict:
+            if return_dict:
+                hidden_states = outputs.last_hidden_state
+            else:
+                hidden_states = outputs[0] if isinstance(outputs, tuple) else outputs
+
+            if self.config.tensor_parallel_degree > 1:
+                # FIXME: @ZHUI fix for jit_to_static
+                logits = parallel_matmul(
+                    hidden_states, self.word_embeddings.weight, self.config.tensor_parallel_output
+                )
+            else:
+                logits = F.linear(hidden_states, self.word_embeddings.weight.T)
+
+            if not return_dict:
+                outputs = (logits,) + outputs[1:]
+
+                return outputs
+
+            return CausalLMOutputWithCrossAttentions(
+                logits=logits,
+                past_key_values=outputs.past_key_values,
+                hidden_states=outputs.hidden_states,
+            )
+        else:
+            return outputs
+
+
+class GLMForMultipleChoice(GLMPretrainedModel):
+    """
+    GLM Model transformer for multiple choice classification
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMForMultipleChoice, self).__init__(config)
+        # GLMForMultipleChoice need loggit
+        if not config.output_predict:
+            logger.warning("GLMForMultipleChoice need loggit, please set config.output_predict to True.")
+            config.output_predict = True
+
+        self.glm = GLMModel(config)
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        position_ids: Tensor = None,
+        attention_mask: Tensor = None,
+        choice_ids: Tensor = None,
+        choice_indices: Tensor = None,
+        labels: Tensor = None,
+        return_dict: bool = None,
+    ):
+        model_output = self.glm(input_ids, position_ids, attention_mask, return_dict=return_dict)
+        lm_logits = model_output.logits if return_dict else model_output
+        # [bs, seq_len, vocab]
+        lm_logits = lm_logits[0] if isinstance(lm_logits, tuple) else lm_logits
+        log_probs = []
+        for output, choices, choice_index in zip(F.log_softmax(lm_logits, axis=-1), choice_ids, choice_indices):
+            log_probs_single = []
+            for choice, choice_target_id in zip(choices, choice_index):
+                log_prob = output[choice_target_id, choice].sum()
+                if len(log_prob.shape) == 0:
+                    log_prob = log_prob.unsqueeze(0)
+                log_probs_single.append(log_prob)
+            log_probs.append(paddle.stack(log_probs_single))
+        log_probs = paddle.stack(log_probs).squeeze(2)
+        loss = None
+        if labels is not None:
+            if self.glm.config.tensor_parallel_degree > 1:
+                assert (
+                    self.glm.config.tensor_parallel_output is False
+                ), "GLMForMultipleChoice not avaliable for tensor_parallel_output!"
+
+            loss = F.cross_entropy(log_probs, labels)
+
+        if not return_dict:
+            output = (log_probs, lm_logits)
+            return ((loss,) + output) if loss is not None else output
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=log_probs,
+            hidden_states=lm_logits,
+        )
+
+
+class GLMForConditionalGeneration(GLMPretrainedModel):
+    """
+    GLM Model transformer with a `language modeling` head on top.
+    """
+
+    def __init__(self, config: GLMConfig):
+        super(GLMForConditionalGeneration, self).__init__(config)
+        # GLMForConditionalGeneration need loggit
+        if not config.output_predict:
+            logger.warning("GLMForConditionalGeneration need loggit, please set config.output_predict to True.")
+            config.output_predict = True
+
+        self.glm = GLMModel(config)
+
+    def _reorder_cache(self, cache, beam_index):
+        # Speedy decoding is disabled and no reorder is needed if decoder cache is not given.
+        if cache is None:
+            return None
+
+        reordered_decoder_cache = ()
+        for layer_cache_states in cache:
+            # Get correct batch index from layer cache batch dimension
+            reordered_decoder_cache = reordered_decoder_cache + (layer_cache_states.index_select(0, beam_index),)
+        return reordered_decoder_cache
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: Tensor,
+        position_ids: Tensor = None,
+        attention_mask: Tensor = None,
+        cache: Tensor = None,
+        **kwargs
+    ):
+        attention_mask_gen = attention_mask
+        seq_length = input_ids.shape[1]
+        if cache:
+            if position_ids is not None:
+                position_ids = position_ids[:, :, seq_length - 1].unsqueeze(-1)
+            if attention_mask is not None:
+                attention_mask_gen = attention_mask[:, :, seq_length - 1, :seq_length].unsqueeze(-2)
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+        else:
+            if position_ids is not None:
+                position_ids = position_ids[:, :, :seq_length]
+            if attention_mask is not None:
+                attention_mask_gen = attention_mask[:, :, :seq_length, :seq_length]
+        return {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask_gen,
+            "cache": cache,
+            "use_cache": True,
+        }
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        position_ids: Tensor = None,
+        attention_mask: Tensor = None,
+        labels: Tensor = None,
+        cache: Tensor = None,
+        return_dict: bool = None,
+        loss_mask: Tensor = None,
+        use_cache=True,
+    ):
+        model_output = self.glm(input_ids, position_ids, attention_mask, cache=cache, return_dict=return_dict)
+        if return_dict:
+            lm_logits, cache = model_output.logits, model_output.past_key_values
+        else:
+            lm_logits, cache = model_output
+        # lm_logits [bs, seq_length, vocab_size]
+        loss = None
+        if labels is not None:
+            # Since ParallelCrossEntropy not support -100 ingore index.
+            # we use pad_token_id
+            if self.glm.config.tensor_parallel_degree > 1 and self.glm.config.tensor_parallel_output:
+                self.parallel_loss_func = fleet.meta_parallel.ParallelCrossEntropy()
+                loss = self.parallel_loss_fun(lm_logits, labels)
+            else:
+                loss = F.cross_entropy(
+                    lm_logits.reshape([-1, lm_logits.shape[-1]]), labels.reshape([-1]), reduction="none"
+                )
+            label_smoothing = getattr(self.config, "label_smoothing", 0)
+            if label_smoothing > 0:
+                smooth_loss = (-F.log_softmax(lm_logits, axis=-1) / lm_logits.shape[2]).sum(axis=-1)
+                loss = (1 - label_smoothing) * loss + label_smoothing * smooth_loss
+            if loss_mask is not None:
+                loss_mask = loss_mask.reshape([-1])
+                loss = paddle.sum(loss.reshape([-1]) * loss_mask) / paddle.sum(loss_mask)
+
+        if not return_dict:
+            output = (lm_logits, cache)
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(loss=loss, logits=lm_logits, past_key_values=cache)
diff --git a/paddlenlp/transformers/glm/tokenizer.py b/paddlenlp/transformers/glm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f535ac264b4343aed6bcc00995e1c0517e6ee1a
--- /dev/null
+++ b/paddlenlp/transformers/glm/tokenizer.py
@@ -0,0 +1,501 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+from shutil import copyfile
+from typing import List, Optional, Tuple
+
+import numpy as np
+import paddle
+import sentencepiece as spm
+from scipy.linalg import block_diag
+
+from ...utils.log import logger
+from .. import BertTokenizer, GPTTokenizer
+from ..tokenizer_utils import PretrainedTokenizer
+from ..tokenizer_utils_base import BatchEncoding
+
+
+class GLMTokenizerMixin:
+    """
+    BOS and EOS tokens are used for autoregressive blank filling.
+    """
+
+    @property
+    def sop_token(self) -> Optional[str]:
+        return "<|startofpiece|>"
+
+    @property
+    def sop_token_id(self) -> Optional[int]:
+        return self.convert_tokens_to_ids(self.sop_token)
+
+    @property
+    def eop_token(self) -> Optional[str]:
+        return "<|endofpiece|>"
+
+    @property
+    def eop_token_id(self) -> Optional[int]:
+        return self.convert_tokens_to_ids(self.eop_token)
+
+    @property
+    def gmask_token_id(self) -> int:
+        return self.convert_tokens_to_ids("[gMASK]")
+
+    @property
+    def smask_token_id(self) -> int:
+        return self.convert_tokens_to_ids("[sMASK]")
+
+    @property
+    def mask_token_ids(self):
+        return [self.mask_token_id, self.smask_token_id, self.gmask_token_id]
+
+    def _build_input_for_multiple_choice(self, context, choices):
+        context_id = context["input_ids"]
+        if isinstance(context_id, paddle.Tensor):
+            context_id = context_id.tolist()
+
+        division = len(context_id)
+        mask_position = context_id.index(self.mask_token_id)
+        token = np.array(context_id, dtype="int64")
+        attention_mask = [context["attention_mask"].repeat(division, axis=0)]
+        position_id = np.arange(division, dtype="int64")
+        block_position_id = np.zeros([division], dtype="int64")
+
+        choice_ids, choice_indices = [], []
+
+        for choice_str in choices:
+            choice = np.array(
+                self(choice_str, add_special_tokens=False, padding=False)["input_ids"],
+                dtype="int64",
+            )
+            choice_ids.append(choice)
+
+            choice_indices.append(np.arange(len(token), len(token) + len(choice), dtype="int64"))
+            attention_mask.append(np.tril(np.ones([len(choice), len(choice)], dtype="int64")))
+
+            token = np.concatenate([token, np.array([self.sop_token_id], dtype="int64"), choice[:-1]])
+            position_id = np.concatenate([position_id, np.array([mask_position] * len(choice), dtype="int64")])
+            block_position_id = np.concatenate([block_position_id, np.arange(1, len(choice) + 1, dtype="int64")])
+
+        attention_mask = np.array(block_diag(*[x.tolist() for x in attention_mask]))
+        attention_mask[division:, :division] = context["attention_mask"][None, :]
+
+        return {
+            "input_ids": token,
+            "position_ids": np.stack([position_id, block_position_id]),
+            "attention_mask": attention_mask,
+            "choice_ids": choice_ids,
+            "choice_indices": choice_indices,
+        }
+
+    def _pad_batch(self, tokens, position_ids, attention_mask, max_seq_length):
+        pad_length = max_seq_length - len(tokens)
+        attention_mask = np.pad(attention_mask, [0, pad_length, 0, pad_length], mode="constant", constant_values=0)
+        tokens = np.concatenate([tokens, np.zeros([pad_length], dtype="int64")])
+        if pad_length > 0:
+            position_ids = np.concatenate([position_ids, position_ids[..., -1:].repeat(pad_length, axis=1)], axis=-1)
+        return tokens, position_ids, attention_mask
+
+    def _collate(self, samples):
+        TILE = 1
+        length_to_pad = (max(map(lambda spl: len(spl["input_ids"]), samples)) + TILE - 1) // TILE * TILE
+
+        token_batch, position_id_batch, attention_mask_batch = [], [], []
+        choices_batch, choice_target_ids_batch = [], []
+
+        for sample in samples:
+            token, position_id, attention_mask = self._pad_batch(
+                sample["input_ids"], sample["position_ids"], sample["attention_mask"], length_to_pad
+            )
+            token_batch.append(token)
+            position_id_batch.append(position_id)
+            attention_mask_batch.append(attention_mask)
+            choices_batch.append(sample["choice_ids"])
+            choice_target_ids_batch.append(sample["choice_indices"])
+        return BatchEncoding(
+            {
+                "input_ids": np.stack(token_batch),
+                "position_ids": np.stack(position_id_batch),
+                "attention_mask": np.stack(attention_mask_batch).unsqueeze(1),
+                "choice_ids": choices_batch,
+                "choice_indices": choice_target_ids_batch,
+            }
+        )
+
+    def build_inputs_for_multiple_choice(self, model_input: BatchEncoding, choices, max_length=None):
+        samples = [{key: value[i] for key, value in model_input.items()} for i in range(len(model_input["input_ids"]))]
+        samples = [self._build_input_for_multiple_choice(sample, choice) for sample, choice in zip(samples, choices)]
+        inputs = self._collate(samples)
+        return BatchEncoding(inputs)
+
+    def build_inputs_for_generation(
+        self,
+        model_input: BatchEncoding,
+        max_gen_length=512,
+        targets=None,
+        padding=False,
+        is_train=False,
+    ):
+        mask_ids = self.mask_token_ids
+        input_ids = model_input.input_ids
+        batch_size, seq_length = input_ids.shape[:2]
+        position_id, block_position_id = list(range(seq_length)), [0 for _ in range(seq_length)]
+        position_ids, block_position_ids = [], []
+        labels = None
+        loss_mask = None
+        if targets is not None:
+            is_batched = isinstance(targets, (list, tuple))
+            targets = self(
+                targets,
+                add_special_tokens=False,
+                padding=False,
+                max_length=max_gen_length - 2,
+                truncation=True,
+                truncation_side="right",
+            ).input_ids
+            if not is_batched:
+                targets = [targets]
+            assert len(targets) == len(input_ids)
+            targets = [(target + [self.eop_token_id])[:max_gen_length] for target in targets]
+            if not padding:
+                max_gen_length = max(map(len, targets))
+            targets = [[self.sop_token_id] + target for target in targets]
+            labels = [target[1:] for target in targets]
+            targets = [target + [self.pad_token_id] * (max_gen_length + 1 - len(target)) for target in targets]
+            labels = [label + [self.pad_token_id] * (max_gen_length - len(label)) for label in labels]
+            targets = np.array(targets, dtype="int64")
+            loss_mask = np.logical_and(targets != self.pad_token_id, targets != self.eop_token_id).astype("int64")
+            labels = np.array(labels, dtype="int64")
+            labels = np.concatenate([np.zeros([batch_size, seq_length], dtype="int64"), labels], axis=1)
+
+        for i in range(batch_size):
+            mask_positions = []
+            for mask_id in mask_ids:
+                mask_positions += np.nonzero(input_ids[i] == mask_id)[0].tolist()
+            if not mask_positions:
+                raise ValueError("Cannot find mask token in the input.")
+            mask_positions.sort()
+            mask_pos = mask_positions[0]
+            position_ids.append(position_id + [mask_pos] * max_gen_length)
+            block_position_ids.append(block_position_id + list(range(1, max_gen_length + 1)))
+        position_ids = np.array(position_ids, dtype="int64")
+        block_position_ids = np.array(block_position_ids, dtype="int64")
+        position_ids = np.stack([position_ids, block_position_ids], axis=1)
+
+        attention_mask = model_input.attention_mask
+        attention_mask = attention_mask[:, None, :].repeat(seq_length + max_gen_length, axis=1)
+        generation_attention_mask = np.concatenate(
+            [
+                np.zeros([seq_length, max_gen_length], dtype=attention_mask.dtype),
+                np.tril(np.ones([max_gen_length, max_gen_length], dtype=attention_mask.dtype)),
+            ],
+            axis=0,
+        )[None, :, :].repeat(batch_size, axis=0)
+        attention_mask = np.concatenate([attention_mask, generation_attention_mask], axis=2)[:, None, :, :]
+
+        if targets is None:
+            input_ids = np.concatenate(
+                [input_ids, np.full([batch_size, 1], self.sop_token_id, dtype=input_ids.dtype)], axis=-1
+            )
+        else:
+            loss_mask = np.concatenate([np.zeros_like(input_ids), loss_mask], axis=1)
+            input_ids = np.concatenate([input_ids, targets[:, :-1]], axis=1)
+            loss_mask = loss_mask[:, : len(input_ids[0])]
+
+        batch = {"input_ids": input_ids, "position_ids": position_ids}
+        if labels is None:
+            batch["attention_mask"] = attention_mask
+        else:
+            batch["attention_mask"] = attention_mask
+            batch["loss_mask"] = loss_mask
+            batch["label_ids"] = labels
+        return BatchEncoding(batch, tensor_type="np")
+
+
+class GLMChineseTokenizer(PretrainedTokenizer, GLMTokenizerMixin):
+    model_input_names = ["input_ids", "position_ids", "attention_mask"]
+    resource_files_names = {"model_file": "cog-pretrain.model"}
+    truncation_side: str = "left"
+    pretrained_init_configuration = {
+        "THUDM/glm-large-chinese": {"do_lower_case": True},
+        "THUDM/glm-10b-chinese": {"do_lower_case": True},
+    }
+    cog_model_link = "https://paddlenlp.bj.bcebos.com/models/community/THUDM/cog-pretrain.model"
+    pretrained_resource_files_map = {
+        "model_file": {
+            "THUDM/glm-large-chinese": cog_model_link,
+            "THUDM/glm-10b-chinese": cog_model_link,
+        },
+    }
+    max_model_input_sizes = {"THUDM/glm-10b-chinese": 1024, "THUDM/glm-large-chinese": 1024}
+
+    def __init__(
+        self,
+        model_file,
+        cls_token="[CLS]",
+        sep_token="[SEP]",
+        unk_token="[UNK]",
+        mask_token="[MASK]",
+        pad_token="<|endoftext|>",
+        eos_token="<|endoftext|>",
+        additional_special_tokens=None,
+        **kwargs
+    ):
+        if additional_special_tokens is None:
+            additional_special_tokens = [
+                "[UNUSED1]",
+                "[UNUSED2]",
+                "<|startofpiece|>",
+                "<|endofpiece|>",
+                "[sMASK]",
+                "[gMASK]",
+            ]
+        super().__init__(
+            cls_token=cls_token,
+            sep_token=sep_token,
+            unk_token=unk_token,
+            mask_token=mask_token,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            additional_special_tokens=additional_special_tokens,
+            **kwargs,
+        )
+        self._model_file = model_file
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(model_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text, **kwargs):
+        return self.sp_model.Encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.sp_model.IdToPiece(index)
+
+    def convert_tokens_to_string(self, tokens):
+        return self.sp_model.Decode(tokens)
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + self.vocab_files_names["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A BERT sequence has the following format:
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        cls = [self.cls_token_id]
+        eos = [self.eos_token_id]
+        return cls + token_ids_0 + eos
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens: bool = False):
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.eos_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        return len([self.cls_token_id] + token_ids_0 + [self.eos_token_id]) * [0]
+
+
+class GLMGPT2Tokenizer(GPTTokenizer, GLMTokenizerMixin):
+    model_input_names = ["input_ids", "position_ids", "attention_mask"]
+    truncation_side: str = "left"
+    pretrained_init_configuration = {
+        "THUDM/glm-2b": {},
+        "THUDM/glm-10b": {},
+    }
+    added_tokens_link = "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-added-tokens.json"
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "THUDM/glm-2b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-2b-vocab.json",
+            "THUDM/glm-10b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-10b-vocab.json",
+        },
+        "merges_file": {
+            "THUDM/glm-2b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-2b-merges.txt",
+            "THUDM/glm-10b": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-10b-merges.txt",
+        },
+        "added_tokens_file": {
+            "THUDM/glm-2b": added_tokens_link,
+            "THUDM/glm-10b": added_tokens_link,
+        },
+    }
+    max_model_input_sizes = {
+        "THUDM/glm-2b": 1024,
+        "THUDM/glm-10b": 1024,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        cls_token="[CLS]",
+        sep_token="[SEP]",
+        unk_token="[UNK]",
+        mask_token="[MASK]",
+        pad_token="<|endoftext|>",
+        eos_token="<|endoftext|>",
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            cls_token=cls_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+
+    def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None):
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        cls = [self.cls_token_id]
+        eos = [self.eos_token_id]
+        return cls + token_ids_0 + eos
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens: bool = False):
+        if already_has_special_tokens:
+            raise ValueError(
+                "You should not supply a second sequence if the provided sequence of "
+                "ids is already formatted with special tokens for the model."
+            )
+            return list(map(lambda x: 1 if x in [self.eos_token_id, self.cls_token_id] else 0, token_ids_0))
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        if token_ids_1 is not None:
+            logger.warning("Support single input text and the second one is ignored.")
+        return len([self.cls_token_id] + token_ids_0 + [self.eos_token_id]) * [0]
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+
+class GLMBertTokenizer(BertTokenizer, GLMTokenizerMixin):
+    model_input_names = ["input_ids", "position_ids", "attention_mask"]
+    truncation_side: str = "left"
+    pretrained_init_configuration = {
+        "THUDM/glm-515m": {"do_lower_case": True},
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "THUDM/glm-515m": "https://paddlenlp.bj.bcebos.com/models/community/THUDM/glm-515m-vocab.txt",
+        },
+    }
+    max_model_input_sizes = {
+        "THUDM/glm-515m": 512,
+    }
+
+
+class GLMTokenizer:
+    """
+    GLMTokenizer is a generic tokenizer class that will be instantiated as GLMChineseTokenizer,
+    GLMGPT2Tokenizer or GLMBertTokenizer when created with GLMTokenizer.from_pretrained() class method.
+    """
+
+    bert_model_names = GLMBertTokenizer.pretrained_init_configuration.keys()
+    chinese_model_names = GLMChineseTokenizer.pretrained_init_configuration.keys()
+    gpt2_model_names = GLMGPT2Tokenizer.pretrained_init_configuration.keys()
+    tokenizer_config_file = "tokenizer_config.json"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path).`"
+        )
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs):
+        # From built-in pretrained models
+        if pretrained_model_name_or_path in cls.bert_model_names:
+            return GLMBertTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif pretrained_model_name_or_path in cls.chinese_model_names:
+            return GLMChineseTokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        elif pretrained_model_name_or_path in cls.gpt2_model_names:
+            return GLMGPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.tokenizer_config_file)
+            with open(config_file, "r", encoding="utf-8") as fp:
+                tokenizer_config = json.load(fp)
+            config_tokenizer_class = tokenizer_config.get("tokenizer_class")
+            if config_tokenizer_class == "GLMChineseTokenizer":
+                tokenizer_class = GLMChineseTokenizer
+            elif config_tokenizer_class == "GLMGPT2Tokenizer":
+                tokenizer_class = GLMGPT2Tokenizer
+            elif config_tokenizer_class == "GLMBertTokenizer":
+                tokenizer_class = GLMBertTokenizer
+            else:
+                raise NotImplementedError("Not implemented tokenizer type:", config_tokenizer_class)
+            return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
+        # TODO: Assuming from community-contributed pretrained models
diff --git a/paddlenlp/transformers/gpt/__init__.py b/paddlenlp/transformers/gpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/gpt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/gpt/configuration.py b/paddlenlp/transformers/gpt/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..344a21c7b6d6c7d2c7e470461349dec26793a208
--- /dev/null
+++ b/paddlenlp/transformers/gpt/configuration.py
@@ -0,0 +1,368 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" GPT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["GPT_PRETRAINED_INIT_CONFIGURATION", "GPTConfig", "GPT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+GPT_PRETRAINED_INIT_CONFIGURATION = {
+    "gpt-cpm-large-cn": {  # 2.6B
+        "vocab_size": 30000,
+        "hidden_size": 2560,
+        "num_hidden_layers": 32,
+        "num_attention_heads": 32,
+        "intermediate_size": 10240,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 7,
+        "bos_token_id": 0,
+        "eol_token_id": 3,
+    },
+    "gpt-cpm-small-cn-distill": {  # 109M
+        "vocab_size": 30000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 7,
+        "bos_token_id": 0,
+        "eol_token_id": 3,
+    },
+    "gpt3-89B-en": {  # 89B
+        "vocab_size": 51200,
+        "hidden_size": 12288,
+        "num_hidden_layers": 48,
+        "num_attention_heads": 96,
+        "intermediate_size": 49152,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-175B-en": {  # 175B
+        "vocab_size": 51200,
+        "hidden_size": 12288,
+        "num_hidden_layers": 96,
+        "num_attention_heads": 96,
+        "intermediate_size": 49152,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-13B-en": {  # 13B
+        "vocab_size": 50304,
+        "hidden_size": 5120,
+        "num_hidden_layers": 40,
+        "num_attention_heads": 40,
+        "intermediate_size": 20480,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-6.7B-en": {  # 6.7B
+        "vocab_size": 50304,
+        "hidden_size": 4096,
+        "num_hidden_layers": 32,
+        "num_attention_heads": 32,
+        "intermediate_size": 16384,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 16,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt3-1.3B-en": {  # 1.3B
+        "vocab_size": 50304,
+        "hidden_size": 2048,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 8192,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt2-xl-en": {  # 1558M
+        "vocab_size": 50257,
+        "hidden_size": 1600,
+        "num_hidden_layers": 48,
+        "num_attention_heads": 25,
+        "intermediate_size": 6400,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt2-large-en": {  # 774M
+        "vocab_size": 50257,
+        "hidden_size": 1280,
+        "num_hidden_layers": 36,
+        "num_attention_heads": 20,
+        "intermediate_size": 5120,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt2-medium-en": {  # 345M
+        "vocab_size": 50304,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt2-en": {  # 117M
+        "vocab_size": 50257,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+    "gpt2-small-en": {  # config for CE
+        "vocab_size": 50304,
+        "hidden_size": 1024,
+        "num_hidden_layers": 4,
+        "num_attention_heads": 4,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 1,  # no use
+        "initializer_range": 0.02,
+        "eos_token_id": 50256,
+        "eol_token_id": 198,
+    },
+}
+
+GPT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "gpt-cpm-large-cn": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-cpm-large-cn.pdparams",
+        "gpt-cpm-small-cn-distill": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-cpm-small-cn-distill.pdparams",
+        "gpt2-en": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt2-en.pdparams",
+        "gpt2-medium-en": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt2-medium-en.pdparams",
+        "gpt2-large-en": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt2-large-en.pdparams",
+        "gpt2-xl-en": "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt2-xl-en.pdparams",
+    }
+}
+
+
+class GPTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`GPTModel`] or a [`TFGPTModel`]. It is used to
+    instantiate a GPT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the GPT
+    gpt-base-uncased architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the GPT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`GPTModel`] or [`TFGPTModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_activation (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`GPTModel`] or [`TFGPTModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import GPTModel, GPTConfig
+
+    >>> # Initializing a GPT gpt-base-uncased style configuration
+    >>> configuration = GPTConfig()
+
+    >>> # Initializing a model from the gpt-base-uncased style configuration
+    >>> model = GPTModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "gpt"
+    attribute_map: Dict[str, str] = {
+        "num_classes": "num_labels",
+        "dropout": "classifier_dropout",
+        "n_positions": "max_position_embeddings",
+        "n_embd": "hidden_size",
+        "n_layer": "num_hidden_layers",
+        "n_head": "num_attention_heads",
+        "n_inner": "intermediate_size",
+        "activation_function": "hidden_act",
+        "resid_pdrop": "attention_probs_dropout_prob",
+    }
+
+    pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50304,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_activation: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        eos_token_id: int = 7,
+        bos_token_id: int = 0,
+        eol_token_id: int = 3,
+        num_partitions: int = 1,
+        normalize_before: bool = True,
+        recompute_granularity: str = "full",
+        scale_qk_coeff: float = 1.0,
+        tensor_parallel_degree: int = 1,
+        tensor_parallel_output: bool = True,
+        output_attentions: bool = False,
+        ignore_index: int = 0,
+        use_flash_attention: bool = False,
+        fused_linear: bool = False,
+        fuse_attention_qkv=False,
+        enable_fuse_transformer: bool = False,
+        use_fused_dropout_add: bool = False,
+        fused_softmax_with_triangular: bool = False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_activation = hidden_activation
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.bos_token_id = bos_token_id
+        self.eol_token_id = eol_token_id
+
+        self.fuse_attention_qkv = fuse_attention_qkv
+        self.use_flash_attention = use_flash_attention
+
+        self.num_partitions = num_partitions
+        self.normalize_before = normalize_before
+        self.recompute_granularity = recompute_granularity
+        self.scale_qk_coeff = scale_qk_coeff
+        self.tensor_parallel_degree = tensor_parallel_degree
+        self.tensor_parallel_output = tensor_parallel_output
+        self.output_attentions = output_attentions
+        self.ignore_index = ignore_index
+        self.fused_linear = fused_linear
+        self.enable_fuse_transformer = enable_fuse_transformer
+        self.use_fused_dropout_add = use_fused_dropout_add
+        self.fused_softmax_with_triangular = fused_softmax_with_triangular
diff --git a/paddlenlp/transformers/gpt/modeling.py b/paddlenlp/transformers/gpt/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e1ee0662c081876b728795d4face8d8a6564217
--- /dev/null
+++ b/paddlenlp/transformers/gpt/modeling.py
@@ -0,0 +1,1811 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import collections
+import contextlib
+import math
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.incubate as incubate
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...utils.converter import StateDictNameMapping
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from ..model_utils import dy2st_nocheck_guard_context
+from .configuration import (
+    GPT_PRETRAINED_INIT_CONFIGURATION,
+    GPT_PRETRAINED_RESOURCE_FILES_MAP,
+    GPTConfig,
+)
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+try:
+    from paddle.incubate.nn.layer.fused_dropout_add import FusedDropoutAdd
+except:
+    FusedDropoutAdd = None
+
+__all__ = [
+    "GPTModel",
+    "GPTPretrainedModel",
+    "GPTForPretraining",
+    "GPTPretrainingCriterion",
+    "GPTForGreedyGeneration",
+    "GPTLMHeadModel",
+    "GPTForTokenClassification",
+    "GPTForSequenceClassification",
+    "GPTForCausalLM",
+    "GPTEmbeddings",
+    "GPTDecoderLayer",
+]
+
+
+def get_triangle_upper_mask(x, mask=None):
+    if mask is not None:
+        return mask
+    if paddle.is_compiled_with_xpu():
+        # xpu does not support set constant to -np.inf
+        mask = paddle.full_like(x, -1e4)
+    else:
+        mask = paddle.full_like(x, -np.inf)
+    mask.stop_gradient = True
+    mask = paddle.triu(mask, diagonal=1)
+    mask.stop_gradient = True
+    return mask
+
+
+def parallel_matmul(x: paddle.Tensor, y: paddle.Tensor, tensor_parallel_output=True):
+    is_fleet_init = True
+    tensor_parallel_degree = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        tensor_parallel_degree = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+
+    if is_fleet_init and tensor_parallel_degree > 1 and y.is_distributed:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=True)
+
+        if tensor_parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+
+    else:
+        logits = paddle.matmul(x, y, transpose_y=True)
+        return logits
+
+
+def seed_guard_context(name=None):
+    if name in get_rng_state_tracker().states_:
+        return get_rng_state_tracker().rng_state(name)
+    else:
+        return contextlib.nullcontext()
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        config,
+    ):
+        super(MultiHeadAttention, self).__init__()
+
+        self.config = config
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        self.use_flash_attention = config.use_flash_attention if flash_attention else None
+        self.dropout = config.attention_probs_dropout_prob
+
+        self.head_dim = config.hidden_size // config.num_attention_heads
+        assert (
+            self.head_dim * config.num_attention_heads == config.hidden_size
+        ), "hidden_size must be divisible by num_attention_heads"
+
+        self.num_attention_heads = config.num_attention_heads  # default, without tensor parallel
+        if config.tensor_parallel_degree > 1:
+            assert config.num_attention_heads % config.tensor_parallel_degree == 0
+            self.num_attention_heads = config.num_attention_heads // config.tensor_parallel_degree
+
+            if config.fuse_attention_qkv:
+                self.qkv_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    3 * config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+            else:
+                self.q_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+                self.k_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+                self.v_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                    fuse_matmul_bias=config.fused_linear,
+                )
+
+            self.out_proj = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size,
+                config.hidden_size,
+                has_bias=True,
+                input_is_parallel=True,
+                fuse_matmul_bias=config.fused_linear,
+            )
+        else:
+            if self.config.fuse_attention_qkv:
+                self.qkv_proj = nn.Linear(config.hidden_size, 3 * config.hidden_size, bias_attr=True)
+            else:
+                self.q_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+                self.k_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+                self.v_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+
+            self.out_proj = nn.Linear(config.hidden_size, config.hidden_size, bias_attr=True)
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        mix_layer = self.qkv_proj(query)
+        # bs, seqlen, nhead, headdim
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, self.num_attention_heads, 3 * self.head_dim])
+        # bs, nhead, seqlen, headdim
+        if not self.config.use_flash_attention:
+            # falsh attn need: [ bz, seqlen, nhead, head_dim]
+            mix_layer = paddle.transpose(mix_layer, [0, 2, 1, 3])
+
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=2)
+            v = tensor.concat([cache.v, v], axis=2)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = tensor.reshape(x=q, shape=[0, 0, self.num_attention_heads, self.head_dim])
+        if not self.config.use_flash_attention:
+            # falsh attn need: [ bz, seqlen, nhead, head_dim]
+            q = tensor.transpose(x=q, perm=[0, 2, 1, 3])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = tensor.concat([cache.k, k], axis=2)
+            v = tensor.concat([cache.v, v], axis=2)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_attention_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_attention_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(
+                shape=[key.shape[0], self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0
+            )
+            v = paddle.full(
+                shape=[key.shape[0], self.num_attention_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0
+            )
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def _flash_attention(self, q, k, v, attn_mask=None, output_attentions=False):
+        out, weights = flash_attention(
+            q,
+            k,
+            v,
+            self.config.hidden_dropout_prob,
+            causal=True,
+            return_softmax=output_attentions,
+            training=self.training,
+        )
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+        return (out, weights) if output_attentions else out
+
+    def core_attn(self, q, k, v, attn_mask=None, output_attentions=False):
+        perm = [0, 2, 1, 3]
+        q = tensor.transpose(x=q, perm=perm)
+        k = tensor.transpose(x=k, perm=perm)
+        v = tensor.transpose(x=v, perm=perm)
+
+        # scale dot product attention
+        scale_qk_coeff = self.config.scale_qk_coeff * self.head_dim**0.5
+        product = paddle.matmul(x=q.scale(1.0 / scale_qk_coeff), y=k, transpose_y=True)
+
+        if self.config.scale_qk_coeff != 1.0:
+            product = product.scale(self.config.scale_qk_coeff)
+
+        # softmax_mask_fuse_upper_triangle is not supported sif paddle is not compiled with cuda/rocm
+        if not paddle.is_compiled_with_cuda():
+            attn_mask = get_triangle_upper_mask(product, attn_mask)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+            weights = F.softmax(product)
+        else:
+            weights = incubate.softmax_mask_fuse_upper_triangle(product)
+
+        if self.config.hidden_dropout_prob:
+            with seed_guard_context("local_seed"):
+                weights = F.dropout(
+                    weights, self.config.hidden_dropout_prob, training=self.training, mode="upscale_in_train"
+                )
+
+        out = paddle.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, -1])
+
+        return (out, weights) if output_attentions else out
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None, output_attentions=False):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+        # compute q ,k ,v
+        if self.config.fuse_attention_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        if self.config.use_flash_attention:
+            # Flash Attention now ignore attention mask
+            # Current Flash Attention doesn't support attn maskt
+            # Paddle Flash Attention input [batch_size, seq_len, num_heads, head_dim]
+            # Torch Flash Attention input (batch_size, seqlen, nheads, headdim)
+            bsz, q_len, num_heads, head_dim = q.shape
+            # Q Shape:  [1, 16, 2048, 64]
+            # bs, nhead, seqlen, head_dim
+            attn_output, weights = flash_attention(
+                q,
+                k,
+                v,
+                causal=q.shape[1] != 1,
+                return_softmax=self.need_weights,
+            )
+            out = attn_output.reshape([bsz, q_len, head_dim * num_heads])
+        else:
+            # scale dot product attention
+            product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True)
+
+            if attn_mask is not None:
+                product = product + attn_mask
+
+            weights = F.softmax(product)
+            if self.dropout:
+                weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+            out = tensor.matmul(weights, v)
+
+            # combine heads
+            out = tensor.transpose(out, perm=[0, 2, 1, 3])
+            out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if output_attentions:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoder(nn.Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(self, config, decoder_layers, norm=None, hidden_size=None):
+        super(TransformerDecoder, self).__init__()
+
+        self.config = config
+        self.layers = decoder_layers
+        self.norm = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        self.checkpoints = []
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module: nn.Layer,
+        hidden_states: paddle.Tensor,
+        past_key_value: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        use_cache: bool,
+        cache: paddle.Tensor,
+        output_attentions: paddle.Tensor,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs, output_attentions)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            past_key_value,
+            attention_mask,
+            use_cache,
+            cache,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+        return hidden_states
+
+    def forward(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        use_cache=False,
+        cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = [] if use_cache else None
+        self.checkpoints = []
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+
+        for i, mod in enumerate(self.layers):
+            has_gradient = not output.stop_gradient
+            # def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None, output_attentions=False):
+            if self.enable_recompute and has_gradient:
+                outputs = self.recompute_training(
+                    mod,
+                    output,
+                    memory,
+                    tgt_mask,
+                    use_cache,
+                    None,
+                    output_attentions,
+                )
+            else:
+                outputs = mod(
+                    output,
+                    memory,
+                    tgt_mask=tgt_mask,
+                    use_cache=use_cache,
+                    cache=cache[i] if cache is not None else cache,
+                    output_attentions=output_attentions,
+                )
+
+            # outputs = hidden_states if both use_cache and output_attentions are False
+            # Otherwise, outputs = (hidden_states, attention if output_attentions, cache if use_cache)
+            output = outputs[0] if (use_cache or output_attentions) else outputs
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[1],)
+            if use_cache:
+                new_caches.append(outputs[-1])
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (output,)
+            self.checkpoints.append(output.name)
+
+        if self.norm is not None:
+            output = self.norm(output)
+
+        if not return_dict:
+            temp_list = [output, new_caches, all_hidden_states, all_self_attentions]
+
+            if not (use_cache or output_attentions or output_hidden_states):
+                return output
+
+            return tuple(v for v in temp_list if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=output,
+            past_key_values=new_caches,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=None,
+        )
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `GPTDecoderLayer.gen_cache`. See `GPTDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class GPTDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTDecoderLayer, self).__init__()
+        self.config = config
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        if not FusedDropoutAdd:
+            config.use_fused_dropout_add = False
+
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        self.self_attn = MultiHeadAttention(config=config)
+
+        if config.tensor_parallel_degree > 1:
+            self.linear1 = fleet.meta_parallel.ColumnParallelLinear(
+                config.hidden_size,
+                config.intermediate_size,
+                gather_output=False,
+                has_bias=True,
+                fuse_matmul_bias=self.config.fused_linear,
+            )
+            self.linear2 = fleet.meta_parallel.RowParallelLinear(
+                config.intermediate_size,
+                config.hidden_size,
+                input_is_parallel=True,
+                has_bias=True,
+                fuse_matmul_bias=self.config.fused_linear,
+            )
+        else:
+            self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size, bias_attr=True)
+            self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size, bias_attr=True)
+
+        self.norm1 = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(config.hidden_size, epsilon=1e-5)
+
+        if config.use_fused_dropout_add:
+            self.fused_dropout_add1 = FusedDropoutAdd(config.attention_probs_dropout_prob, mode="upscale_in_train")
+            self.fused_dropout_add2 = FusedDropoutAdd(config.hidden_dropout_prob, mode="upscale_in_train")
+        else:
+            self.dropout1 = nn.Dropout(config.attention_probs_dropout_prob, mode="upscale_in_train")
+            self.dropout2 = nn.Dropout(config.hidden_dropout_prob, mode="upscale_in_train")
+
+        if config.hidden_activation == "gelu":
+            self.activation = F.gelu
+        else:
+            self.activation = getattr(F, config.hidden_activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None, output_attentions=False):
+        residual = tgt
+
+        if self.config.normalize_before:
+            tgt = self.norm1(tgt)
+
+        # self.self_attn(...) --> hidden_states, weights, (cache)
+        if use_cache is False:
+            has_gradient = not tgt.stop_gradient
+            if self.enable_recompute and has_gradient and self.config.recompute_granularity == "full_attn":
+                tgt = recompute(
+                    self.self_attn, tgt, None, None, tgt_mask, use_cache, cache, output_attentions, use_reentrant=False
+                )
+            else:
+                tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache, output_attentions)
+        else:
+            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache, output_attentions)
+        if output_attentions:
+            tgt, attention_weights = tgt
+
+        with seed_guard_context("global_seed"):
+            if self.config.use_fused_dropout_add:
+                tgt = self.fused_dropout_add1(tgt, residual)
+            else:
+                tgt = residual + self.dropout1(tgt)
+
+        if not self.config.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.config.normalize_before:
+            tgt = self.norm2(tgt)
+
+        with seed_guard_context("global_seed"):
+            if not self.config.use_fused_dropout_add:
+                tgt = residual + self.dropout2(self.linear2(self.activation(self.linear1(tgt), approximate=True)))
+            else:
+                tgt = self.fused_dropout_add2(
+                    self.linear2(self.activation(self.linear1(tgt), approximate=True)), residual
+                )
+        if not self.config.normalize_before:
+            tgt = self.norm2(tgt)
+
+        if not (output_attentions or use_cache):
+            return tgt
+
+        temp_list = [tgt, attention_weights if output_attentions else None, incremental_cache if use_cache else None]
+
+        return tuple(v for v in temp_list if v is not None)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class GPTEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(
+        self,
+        config,
+    ):
+        super(GPTEmbeddings, self).__init__()
+
+        self.config = config
+
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+            )
+
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings,
+            config.hidden_size,
+        )
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None, inputs_embeddings=None):
+        if input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            inputs_embeddings = self.word_embeddings(input_ids)
+        else:
+            input_shape = paddle.shape(inputs_embeddings)[:-1]
+
+        if position_ids is None:
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+            position_ids = seq_length - ones
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeddings + position_embeddings
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class GPTPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained GPT models. It provides GPT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = "model_config.json"
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "gpt"
+    config_class = GPTConfig
+    pretrained_init_configuration = GPT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = GPT_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "layers.0.linear1.weight": partial(fn, is_column=True),
+                "layers.0.linear1.bias": partial(fn, is_column=True),
+                # Row Linear
+                "word_embeddings.weight": partial(fn, is_column=False),
+                "layers.0.self_attn.out_proj.weight": partial(fn, is_column=False),
+                "layers.0.linear2.weight": partial(fn, is_column=False),
+            }
+
+            if config.fuse_attention_qkv:
+                base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.qkv_proj.bias"] = partial(fn, is_column=True)
+            else:
+                base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.q_proj.bias"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.k_proj.bias"] = partial(fn, is_column=True)
+                base_actions["layers.0.self_attn.v_proj.bias"] = partial(fn, is_column=True)
+
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def _get_name_mappings(cls, config: GPTConfig) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            ["wte.weight", "embeddings.word_embeddings.weight"],
+            ["wpe.weight", "embeddings.position_embeddings.weight"],
+            ["ln_f.weight", "decoder.norm.weight"],
+            ["ln_f.bias", "decoder.norm.bias"],
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [f"h.{layer_index}.ln_1.weight", f"decoder.layers.{layer_index}.norm1.weight"],
+                [f"h.{layer_index}.ln_1.bias", f"decoder.layers.{layer_index}.norm1.bias"],
+                [f"h.{layer_index}.ln_2.weight", f"decoder.layers.{layer_index}.norm2.weight"],
+                [f"h.{layer_index}.ln_2.bias", f"decoder.layers.{layer_index}.norm2.bias"],
+                [f"h.{layer_index}.mlp.c_fc.weight", f"decoder.layers.{layer_index}.linear1.weight"],
+                [f"h.{layer_index}.mlp.c_fc.bias", f"decoder.layers.{layer_index}.linear1.bias"],
+                [f"h.{layer_index}.mlp.c_proj.weight", f"decoder.layers.{layer_index}.linear2.weight"],
+                [f"h.{layer_index}.mlp.c_proj.bias", f"decoder.layers.{layer_index}.linear2.bias"],
+                [f"h.{layer_index}.attn.c_proj.weight", f"decoder.layers.{layer_index}.self_attn.out_proj.weight"],
+                [f"h.{layer_index}.attn.c_proj.bias", f"decoder.layers.{layer_index}.self_attn.out_proj.bias"],
+                # attention
+                [
+                    f"h.{layer_index}.attn.c_attn.weight",
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "split",
+                    0,
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.bias",
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                    "split",
+                    0,
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.weight",
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "split",
+                    1,
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.bias",
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                    "split",
+                    1,
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.weight",
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "split",
+                    2,
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.bias",
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                    "split",
+                    2,
+                ],
+            ]
+
+            model_mappings.extend(layer_mappings)
+
+        # downstream mappings
+        if "GPT2Model" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "transformer." + mapping[0]
+                mapping[1] = "gpt." + mapping[1]
+        if "GPT2ForTokenClassification" in config.architectures:
+            model_mappings.extend([["classifier.weight", "classifier.weight", "transpose"]])
+        if "GPT2ForSequenceClassification" in config.architectures:
+            model_mappings.extend([["score.weight", "score.weight", "transpose"]])
+        if "GPT2LMHeadModel" in config.architectures:
+            model_mappings.append(["lm_head.weight", "lm_head.decoder.weight"])
+
+        mappings = [StateDictNameMapping(*mapping) for mapping in model_mappings]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(
+            layer,
+            (
+                nn.Linear,
+                nn.Embedding,
+                fleet.meta_parallel.VocabParallelEmbedding,
+                fleet.meta_parallel.ColumnParallelLinear,
+                fleet.meta_parallel.RowParallelLinear,
+            ),
+        ):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        # Layer.apply is DFS https://github.com/PaddlePaddle/Paddle/blob/a6f5021fcc58b21f4414bae6bf4731ef6971582c/python/paddle/nn/layer/layers.py#L527-L530
+        # sublayer is init first
+        # scale RowParallelLinear weight
+        with paddle.no_grad():
+            if isinstance(layer, GPTDecoderLayer):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.linear2.weight.scale_(factor)
+            if isinstance(layer, MultiHeadAttention):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.out_proj.weight.scale_(factor)
+
+
+@register_base_model
+class GPTModel(GPTPretrainedModel):
+    r"""
+    The bare GPT Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `GPTModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `GPTModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer and decoder layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer decoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer decoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the decoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and decoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all decoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`. Defaults to `16`.
+
+            .. note::
+                Please NOT using `type_vocab_size`, for it will be obsolete in the future..
+
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Default to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`GPTPretrainedModel._init_weights()` for how weights are initialized in `GPTModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTModel, self).__init__(config)
+
+        self.config = config
+
+        self.pad_token_id = config.pad_token_id
+        self.eos_token_id = config.eos_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eol_token_id = config.eol_token_id
+        self.vocab_size = config.vocab_size
+
+        self.bias = paddle.tril(
+            paddle.ones([1, 1, config.max_position_embeddings, config.max_position_embeddings], dtype="int64")
+        )
+
+        self.embeddings = GPTEmbeddings(config)
+
+        decoder_layers = nn.LayerList()
+        for i in range(config.num_hidden_layers):
+            decoder_layers.append(GPTDecoderLayer(config))
+
+        self.decoder = TransformerDecoder(
+            config,
+            decoder_layers,
+        )
+
+        self.checkpoints = []
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=False,
+        cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The GPTModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to None.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in self attention to avoid performing attention to some unwanted positions,
+                usually the subsequent positions.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Its data type should be int64.
+                The `masked` tokens have `0` values, and the `unmasked` tokens have `1` values.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            use_cache (bool, optional):
+                Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `outputs` which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GPTModel, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                model = GPTModel.from_pretrained('gpt2-medium-en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        self.checkpoints = []
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            input_ids = input_ids.reshape((-1, input_shape[-1]))
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if position_ids is None:
+            past_length = 0
+            if cache is not None:
+                past_length = paddle.shape(cache[0].k)[-2]
+            position_ids = paddle.arange(past_length, input_shape[-1] + past_length, dtype="int64")
+            position_ids = position_ids.unsqueeze(0)
+            position_ids = paddle.expand(position_ids, input_shape)
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, inputs_embeddings=inputs_embeds
+        )
+
+        # TODO, use registered buffer
+        length = input_shape[-1]
+        if cache is not None:
+            cache_length = paddle.shape(cache[0].k)[2]
+            length = length + cache_length
+        else:
+            cache_length = 0
+        causal_mask = self.bias[:, :, cache_length:length, :length]
+
+        if attention_mask is not None:
+            if attention_mask.dtype != paddle.int64:
+                attention_mask = paddle.cast(attention_mask, dtype=paddle.int64)
+            if len(attention_mask.shape) == 2:
+                attention_mask = attention_mask[:, None, None, :]
+            attention_mask = (1.0 - (attention_mask & causal_mask)) * -1e4
+        else:
+            attention_mask = (1.0 - causal_mask) * -1e4
+
+        # The tensor returned by triu not in static graph.
+        attention_mask.stop_gradient = True
+
+        outputs = self.decoder(
+            embedding_output,
+            memory=None,
+            tgt_mask=attention_mask,
+            use_cache=use_cache,
+            cache=cache,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        if output_hidden_states:
+            if return_dict:
+                outputs.hidden_states = (embedding_output,) + outputs.hidden_states
+            else:  # outputs is a tuple
+                idx = 2 if use_cache else 1
+                all_hidden_states = (embedding_output,) + outputs[idx]
+                outputs = outputs[:idx] + (all_hidden_states) + outputs[idx + 1 :]
+
+        self.checkpoints.extend(self.decoder.checkpoints)
+
+        return outputs
+
+
+class GPTForPretraining(GPTPretrainedModel):
+    """
+    GPT Model with pretraining tasks on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTForPretraining, self).__init__(config)
+        self.gpt = GPTModel(config)
+        self.lm_head = GPTLMHead(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        attention_mask=None,
+        masked_positions=None,
+        use_cache=False,
+        cache=None,
+        labels=None,
+        loss_mask=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GPTModel`.
+            use_cache (bool, optional):
+                See :class:`GPTModel`.
+            cache (Tensor, optional):
+                See :class:`GPTModel`.
+
+        Returns:
+            Tensor or tuple: Returns tensor `logits` or tuple `(logits, cached_kvs)`. If `use_cache` is True,
+            tuple (`logits, cached_kvs`) will be returned. Otherwise, tensor `logits` will be returned.
+            `logits` is the output of the gpt model.
+            `cache_kvs` is the cache output of gpt model if `use_cache` is True.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GPTForPretraining, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                model = GPTForPretraining.from_pretrained('gpt2-medium-en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs,use_cache=True)
+
+                logits = output[0]
+                cached_kvs = output[1]
+
+        """
+
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+        logits = self.lm_head(encoder_outputs)
+
+        if labels is None:
+            if use_cache:
+                return logits, cached_kvs
+            else:
+                return logits
+        else:
+            loss_func = paddle.nn.CrossEntropyLoss(reduction="none")
+            masked_lm_loss = loss_func(logits, labels.unsqueeze(2))
+
+            loss_mask = loss_mask.reshape([-1])
+            masked_lm_loss = paddle.sum(masked_lm_loss.reshape([-1]) * loss_mask)
+            loss = masked_lm_loss / loss_mask.sum()
+            return loss, logits
+
+
+class GPTPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for GPT. It calculates the final loss.
+    """
+
+    def __init__(self, config):
+        super(GPTPretrainingCriterion, self).__init__()
+        self.config = config
+        if config.tensor_parallel_degree > 1 and config.tensor_parallel_output:
+            self.loss_func = fleet.meta_parallel.ParallelCrossEntropy(ignore_index=config.ignore_index)
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=config.ignore_index)
+
+    def forward(self, prediction_scores, masked_lm_labels, loss_mask=None):
+        """
+        Args:
+            prediction_scores(Tensor):
+                The logits of masked token prediction. Its data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+            masked_lm_labels(Tensor):
+                The labels of the masked language modeling, the dimensionality of `masked_lm_labels`
+                is equal to `prediction_scores`. Its data type should be int64 and
+                its shape is [batch_size, sequence_length, 1].
+            loss_mask(Tensor):
+                Mask used for calculating the loss of the masked language modeling to avoid
+                calculating some unwanted tokens.
+                Its data type should be float32 and its shape is [batch_size, sequence_length, 1].
+
+        Returns:
+            Tensor: The pretraining loss. Its data type should be float32 and its shape is [1].
+
+        """
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
+            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0].astype("float32")
+            loss = paddle.mean(masked_lm_loss)
+        return loss
+
+
+class GPTForGreedyGeneration(GPTPretrainedModel):
+    """
+    The generate model for GPT-2.
+    It use the greedy strategy and generate the output sequence with highest probability.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of `paddlenlp.transformers.GPTModel`.
+        max_predict_len(int):
+            The max length of the prediction.
+
+    """
+
+    def __init__(self, config: GPTConfig, max_predict_len: int = 32):
+        super(GPTForGreedyGeneration, self).__init__(config)
+        self.gpt = GPTModel(config)
+        self.max_predict_len = paddle.to_tensor(max_predict_len, dtype="int32")
+        self.eol_token_id = config.eol_token_id
+
+    def model(
+        self, input_ids, position_ids=None, attention_mask=None, masked_positions=None, use_cache=False, cache=None
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GPTModel`.
+            use_cache (bool, optional):
+                See :class:`GPTModel`.
+            cache (Tensor, optional):
+                See :class:`GPTModel`.
+
+        Returns:
+            Tensor or tuple: Returns tensor `logits` or tuple `(logits, cached_kvs)`. If `use_cache` is True,
+            tuple (`logits, cached_kvs`) will be returned. Otherwise, tensor `logits` will be returned.
+            `logits` is the output of the gpt model.
+            `cache_kvs` is the cache output of gpt model if `use_cache` is True.
+
+        """
+
+        outputs = self.gpt(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask, use_cache=use_cache, cache=cache
+        )
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+        logits = paddle.matmul(encoder_outputs, self.gpt.embeddings.word_embeddings.weight, transpose_y=True)
+
+        if use_cache:
+            return logits, cached_kvs
+        else:
+            return logits
+
+    def forward(self, input_ids):
+        """
+
+        Args:
+            input_ids(Tensor):
+                See :class:`GPTModel`.
+
+        Returns:
+            Tensor: Returns tensor `src_ids`, which means the indices of output sequence tokens in the vocabulary.
+            They are numerical representations of tokens that build the output sequence.
+        """
+        output, cached_kvs = self.model(input_ids, use_cache=True, cache=None)
+        src_ids = input_ids
+        nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1])
+        src_ids = paddle.concat([src_ids, nid], axis=1)
+        cur_len = 0
+        with dy2st_nocheck_guard_context():
+            while cur_len < self.max_predict_len:
+                output, cached_kvs = self.model(nid, use_cache=True, cache=cached_kvs)
+                nid = paddle.argmax(output[:, -1, :], axis=-1).reshape([-1, 1])
+                src_ids = paddle.concat([src_ids, nid], axis=1)
+                cur_len += 1
+                if paddle.max(nid) == self.eol_token_id:
+                    break
+        return src_ids
+
+
+class GPTLMHead(nn.Layer):
+    def __init__(self, config: GPTConfig, embedding_weights=None):
+        super(GPTLMHead, self).__init__()
+        self.weight = (
+            self.create_parameter(shape=[config.vocab_size, config.hidden_size], dtype=paddle.get_default_dtype())
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.config = config
+
+        if self.config.tensor_parallel_degree > 1:
+
+            def decoder(hidden_states):
+                return parallel_matmul(hidden_states, self.weight, self.config.tensor_parallel_output)
+
+        else:
+
+            def decoder(hidden_states):
+                return F.linear(hidden_states, self.weight.T)
+
+        self.decoder = decoder
+
+    def forward(self, hidden_states):
+        logits = self.decoder(hidden_states)
+        return logits
+
+
+class GPTForCausalLM(GPTPretrainedModel):
+    """
+    The GPT Model with a `language modeling` head on top.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of :class:`GPTModel`.
+
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTForCausalLM, self).__init__(config)
+        self.gpt = GPTModel(config)
+        self.lm_head = GPTLMHead(config)
+        self.tie_weights()
+        self.criterion = GPTPretrainingCriterion(config)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=False,
+        cache=None,
+        labels=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (Tensor, optional):
+                See :class:`GPTModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`GPTModel`.
+            use_cache (bool, optional):
+                See :class:`GPTModel`.
+            cache (Tensor, optional):
+                See :class:`GPTModel`.
+            labels (paddle.Tensor, optional):
+                A Tensor of shape `(batch_size, sequence_length)`.
+                Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
+                `labels = input_ids` Indices are selected in `[-100, 0, ..., vocab_size]` All labels set to `-100`
+                are ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`
+                Defaults to None.
+            output_attentions (bool, optional):
+                See :class:`GPTModel`.
+            output_hidden_states (bool, optional):
+                See :class:`GPTModel`.
+            return_dict (bool, optional):
+                See :class:`GPTModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+
+            Especialy, when `return_dict=use_cache=output_attentions=output_hidden_states=False`,
+            returns a tensor `logits` which is the output of the gpt model.
+        """
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        outputs = self.gpt(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(outputs, input_type):
+            hidden_states = outputs
+        else:
+            hidden_states = outputs[0]
+
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss = self.criterion(logits, labels)
+            # # Shift so that tokens < n predict n
+            # shift_logits = logits[:, :-1, :]
+            # shift_labels = labels[:, 1:]
+            # # Flatten the tokens
+            # loss_fct = CrossEntropyLoss()
+            # loss = loss_fct(shift_logits.reshape((-1, shift_logits.shape[-1])), shift_labels.reshape((-1,)))
+
+        # outputs = [output, all_hidden_states, new_caches, all_self_attentions]
+        if not return_dict:
+            if isinstance(outputs, input_type):
+                return (loss, logits) if loss is not None else logits
+
+            outputs = (logits,) + outputs[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterGPT
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "beam_search":
+            raise AttributeError("'beam_search' is not supported yet in the fast version of GPT")
+        # Currently, FasterTransformer only support restricted size_per_head.
+        size_per_head = self.gpt.config["hidden_size"] // self.gpt.config["num_attention_heads"]
+        if size_per_head not in [32, 64, 80, 96, 128]:
+            raise AttributeError(
+                "'size_per_head = %d' is not supported yet in the fast version of GPT" % size_per_head
+            )
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        if kwargs["min_length"] != 0:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'min_length != 0' is not supported yet in the fast version")
+        self._fast_entry = FasterGPT(self, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+        if attention_mask is not None and attention_mask.ndim == 4:
+            attention_mask = attention_mask[:, -1:, -1:, :]
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if position_ids is not None:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        return {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and float(paddle.any(input_ids == pad_token_id))
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="int64")
+        return paddle.unsqueeze(attention_mask, axis=[1, 2])
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class GPTForTokenClassification(GPTPretrainedModel):
+    """
+    GPT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
+    for Named-Entity-Recognition (NER) tasks.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of GPTModel.
+        num_labels (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of GPT.
+            If None, use the same value as `hidden_dropout_prob` of `GPTModel`
+            instance `gpt`. Defaults to None.
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.gpt = GPTModel(config)  # allow gpt to be config
+        dropout_p = config.hidden_dropout_prob if config.classifier_dropout is None else config.classifier_dropout
+        self.dropout = nn.Dropout(dropout_p)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The GPTForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids(Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (list, optional):
+                See :class:`GPTModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`GPTModel`.
+            labels (Tensor, optional):
+                Labels of shape `(batch_size, sequence_length)` for computing the sequence classification/regression loss. Indices should be in
+                `[0, ..., num_labels - 1]`. If `num_labels == 1` a regression loss is computed (Mean-Square loss), If
+                `num_labels > 1` a classification loss is computed (Cross-Entropy). Defaults to None.
+            output_attentions (bool, optional):
+                See :class:`GPTModel`.
+            output_hidden_states (bool, optional):
+                See :class:`GPTModel`.
+            return_dict (bool, optional):
+                See :class:`GPTModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+            Especialy, when `return_dict=output_attentions=output_hidden_states=False`,
+            returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GPTForTokenClassification, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                model = GPTForTokenClassification.from_pretrained('gpt2-medium-en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        sequence_output = self.gpt(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(sequence_output, input_type):
+            hidden_states = sequence_output
+        else:
+            hidden_states = sequence_output[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+
+        if not return_dict:
+            if isinstance(sequence_output, input_type):
+                return (loss, logits) if loss is not None else logits
+
+            outputs = (logits,) + sequence_output[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+class GPTForSequenceClassification(GPTPretrainedModel):
+    """
+    GPT Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.
+    for GLUE tasks.
+
+    Args:
+        gpt (:class:`GPTModel`):
+            An instance of GPTModel.
+        num_labels (int, optional):
+            The number of classes. Defaults to `2`.
+
+    """
+
+    def __init__(self, config: GPTConfig):
+        super(GPTForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.gpt = GPTModel(config)
+        self.score = nn.Linear(config.hidden_size, config.num_labels, bias_attr=False)
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The GPTForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`GPTModel`.
+            position_ids(Tensor, optional):
+                See :class:`GPTModel`.
+            attention_mask (list, optional):
+                See :class:`GPTModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`GPTModel`.
+            labels (Tensor, optional):
+                Labels of shape `(batch_size, sequence_length)` for computing the sequence classification/regression loss. Indices should be in
+                `[0, ..., num_labels - 1]`. If `num_labels == 1` a regression loss is computed (Mean-Square loss), If
+                `num_labels > 1` a classification loss is computed (Cross-Entropy). Defaults to None.
+            use_cache (bool, optional):
+                See :classL `GPTModel`.
+            output_attentions (bool, optional):
+                See :class:`GPTModel`.
+            output_hidden_states (bool, optional):
+                See :class:`GPTModel`.
+            return_dict (bool, optional):
+                See :class:`GPTModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutputWithPast` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutputWithPast`.
+
+            Especialy, when `return_dict=output_attentions=output_hidden_states=False`,
+            returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import GPTForSequenceClassification, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                model = GPTForSequenceClassification.from_pretrained('gpt2-medium-en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        # sequence_output shape [bs, seq_len, hidden_size]
+        sequence_output = self.gpt(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(sequence_output, input_type):
+            hidden_states = sequence_output
+        else:
+            hidden_states = sequence_output[0]
+        # logits shape [bs, seq_len, num_class]
+        logits = self.score(hidden_states)
+        # padding index maybe 0
+        eos_token_id = self.gpt.config.eos_token_id or 0
+        # sequence_lengths shape [bs,]
+        if input_ids is not None:
+            sequence_lengths = (input_ids != eos_token_id).astype("int64").sum(axis=-1) - 1
+        else:
+            inputs_shape = paddle.shape(inputs_embeds)[:-1]
+            sequence_lengths = paddle.ones(inputs_shape[:-1], dtype="int64") * (inputs_shape[1] - 1)
+            logger.warning(
+                f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+            )
+
+        pooled_logits = logits.gather_nd(
+            paddle.stack([paddle.arange(paddle.shape(logits)[0]), sequence_lengths], axis=-1)
+        )
+
+        loss = None
+
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+
+        if not return_dict:
+            if isinstance(sequence_output, input_type):
+                return (loss, pooled_logits) if loss is not None else pooled_logits
+
+            outputs = (pooled_logits,) + sequence_output[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=sequence_output.past_key_values,
+            hidden_states=sequence_output.hidden_states,
+            attentions=sequence_output.attentions,
+        )
+
+
+GPTLMHeadModel = GPTForCausalLM
diff --git a/paddlenlp/transformers/gpt/tokenizer.py b/paddlenlp/transformers/gpt/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb0876e2dd74c15229e5dbf9ffcd9d81ede2cc54
--- /dev/null
+++ b/paddlenlp/transformers/gpt/tokenizer.py
@@ -0,0 +1,637 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+from functools import lru_cache
+from typing import Dict, Optional, Union
+
+import jieba
+import numpy as np
+import sentencepiece as spm
+from paddle.utils import try_import
+
+from .. import AddedToken, PretrainedTokenizer
+from ..tokenizer_utils_base import BatchEncoding, EncodedInput, PaddingStrategy
+
+__all__ = [
+    "GPTTokenizer",
+    "GPTChineseTokenizer",
+]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "gpt-cpm-large-cn": 1024,
+    "gpt-cpm-small-cn-distill": 1024,
+    "gpt3-175B-en": 1024,
+    "gpt3-89B-en": 1024,
+    "gpt3-13B-en": 1024,
+    "gpt3-6.7B-en": 1024,
+    "gpt3-1.3B-en": 1024,
+    "gpt2-xl-en": 1024,
+    "gpt2-large-en": 1024,
+    "gpt2-medium-en": 1024,
+    "gpt2-en": 1024,
+    "gpt2-small-en": 1024,
+}
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = chr
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class GPTChineseTokenizer(PretrainedTokenizer):
+    """
+    Constructs a GPT Chinese tokenizer based on `SentencePiece <https://github.com/google/sentencepiece>`__.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        max_len (int):
+            The maximum value of the input sequence length.
+            Defaults to `512`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import GPTChineseTokenizer
+
+            tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
+            print(tokenizer('欢迎使用百度飞桨！'))
+            '''
+            {'input_ids': [2092, 260, 1014, 1596, 17620, 45], 'token_type_ids': [0, 0, 0, 0, 0, 0]}
+            '''
+    """
+
+    resource_files_names = {"model_file": "sentencepiece.model"}  # for save_pretrained
+
+    cpm_model_link = "https://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-cpm-cn-sentencepiece.model"
+    pretrained_resource_files_map = {
+        "model_file": {
+            "gpt-cpm-large-cn": cpm_model_link,
+            "gpt-cpm-small-cn-distill": cpm_model_link,
+        }
+    }
+    pretrained_init_configuration = {
+        "gpt-cpm-large-cn": {},
+        "gpt-cpm-small-cn-distill": {},
+    }
+
+    def __init__(
+        self,
+        model_file,
+        max_len=512,
+        unk_token="<unk>",
+        bos_token="<bod>",
+        eos_token="<eod>",
+        eol_token="\u2583",
+        **kwargs  # The token of newline.
+    ):
+        self._model_file = model_file
+        self.eol_token = eol_token
+        if not os.path.isfile(model_file):
+            raise ValueError(
+                "Can't find a model file at path '{}'. To load the "
+                "model from a pretrained model please use "
+                "`tokenizer = GPTTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(model_file)
+            )
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.sp = spm.SentencePieceProcessor()
+        self.sp.Load(model_file)
+        self.translator = str.maketrans(" \n", "\u2582\u2583")
+
+    @property
+    def eol_token_id(self):
+        if self.eol_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eol_token)
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        seg_list = [x.translate(self.translator) for x in jieba.cut(text, cut_all=False)]
+        new_seg = " ".join(seg_list)
+        return self.sp.encode(new_seg, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) to an id using the vocab."""
+        return self.sp.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) to a token (str) using the vocab."""
+        return self.sp.IdToPiece(index)
+
+    '''
+    def convert_tokens_to_ids(self, tokens):
+        """
+        Converts a single token or a sequence of tokens to an index or a
+        sequence of indices.
+
+        Args:
+            tokens (str|List[str]|tuple(str)):
+                A single token or a sequence of tokens.
+
+        Returns:
+            int|List[int]: The converted token id or token ids.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTChineseTokenizer
+
+                tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
+                print(tokenizer.convert_tokens_to_ids(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']))
+                # [2092, 260, 1014, 1596, 17620, 45]
+        """
+
+        if not isinstance(tokens, (list, tuple)):
+            return self._convert_token_to_id(tokens)
+        else:
+            return [self._convert_token_to_id(token) for token in tokens]
+    '''
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """
+        Converts a single index or a sequence of indices to a token or a
+        sequence of tokens.
+
+        Args:
+            ids (int|List[int]|tuple(int)):
+                The token id (or token ids) to be converted to token(s).
+
+        Returns:
+            str|List[str]: The converted token or sequence of tokens.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTChineseTokenizer
+
+                tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
+                print(tokenizer.convert_ids_to_tokens([2092, 260, 1014, 1596, 17620, 45]))
+                #['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']
+
+        """
+
+        if not isinstance(ids, (list, tuple)):
+            return self._convert_id_to_token(ids)
+        tokens = [self._convert_id_to_token(_id) for _id in ids]
+        return tokens
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTChineseTokenizer
+                tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
+                print(tokenizer.vocab_size)
+                # 50257
+
+        """
+        return len(self.sp)
+
+    def get_vocab(self):
+        """
+        Returns the vocabulary as a dictionary of token to index.
+
+        `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the
+        vocab.
+
+        Returns:
+            `Dict[str, int]`: The vocabulary.
+        """
+        return dict({self.sp.IdToPiece(i): i for i in range(self.sp.GetPieceSize())}, **self.added_tokens_encoder)
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a single index or a sequence of indices to texts.
+
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+
+        Returns:
+            str: The decoded text.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTChineseTokenizer
+                tokenizer = GPTChineseTokenizer.from_pretrained('gpt-cpm-large-cn')
+                print(tokenizer.convert_ids_to_string([2092, 260, 1014, 1596, 17620, 45]))
+                # '欢迎使用百度飞桨!'
+
+        """
+
+        text = self.sp.decode(ids)
+        text = text.replace(" ", "").replace("\u2582", " ").replace("\u2583", "\n")
+        return text
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to files under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            shutil.copyfile(getattr(self, "_%s" % name), save_path)
+
+
+class GPTTokenizer(PretrainedTokenizer):
+    """
+    Constructs a GPT tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocab file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import GPTTokenizer
+
+            tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+            print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))
+
+            '''
+            {'input_ids': [14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}  # for save_pretrained
+    gpt_vocab_link = "http://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-en-vocab.json"
+    gpt_merges_link = "http://bj.bcebos.com/paddlenlp/models/transformers/gpt/gpt-en-merges.txt"
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "gpt3-175B-en": gpt_vocab_link,
+            "gpt3-89B-en": gpt_vocab_link,
+            "gpt3-13B-en": gpt_vocab_link,
+            "gpt3-6.7B-en": gpt_vocab_link,
+            "gpt3-1.3B-en": gpt_vocab_link,
+            "gpt2-xl-en": gpt_vocab_link,
+            "gpt2-large-en": gpt_vocab_link,
+            "gpt2-medium-en": gpt_vocab_link,
+            "gpt2-en": gpt_vocab_link,
+            "gpt2-small-en": gpt_vocab_link,
+        },
+        "merges_file": {
+            "gpt3-175B-en": gpt_merges_link,
+            "gpt3-89B-en": gpt_merges_link,
+            "gpt3-13B-en": gpt_merges_link,
+            "gpt3-6.7B-en": gpt_merges_link,
+            "gpt3-1.3B-en": gpt_merges_link,
+            "gpt2-xl-en": gpt_merges_link,
+            "gpt2-large-en": gpt_merges_link,
+            "gpt2-medium-en": gpt_merges_link,
+            "gpt2-en": gpt_merges_link,
+            "gpt2-small-en": gpt_merges_link,
+        },
+    }
+    pretrained_init_configuration = {
+        "gpt3-175B-en": {},
+        "gpt3-89B-en": {},
+        "gpt3-13B-en": {},
+        "gpt3-6.7B-en": {},
+        "gpt3-1.3B-en": {},
+        "gpt2-xl-en": {},
+        "gpt2-large-en": {},
+        "gpt2-medium-en": {},
+        "gpt2-en": {},
+        "gpt2-small-en": {},
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        pad_token="<|endoftext|>",
+        eos_token="<|endoftext|>",
+        unk_token="<|endoftext|>",
+        eol_token="\u010a",
+        add_prefix_space=False,
+        add_bos_token=False,
+        **kwargs  # The token of newline.
+    ):
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        self.eol_token = eol_token
+        self._build_special_tokens_map_extended(
+            bos_token=pad_token if getattr(self, "bos_token", None) is None else self.bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+        )
+
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.num_command_tokens = 2
+        self.num_type_tokens = 2
+
+        with open(vocab_file, "r", encoding="utf-8") as f:
+            self.encoder = json.load(f)
+
+        self.decoder = {v: k for k, v in self.encoder.items()}
+
+        self.num_tokens = len(self.encoder)
+        self.num_text_tokens = self.num_tokens - 1
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+
+        with open(merges_file, encoding="utf-8") as f:
+            bpe_data = f.read().split("\n")[1:-1]
+
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        self.add_prefix_space = add_prefix_space
+        self.add_bos_token = add_bos_token
+
+        re = try_import("regex")
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+
+        return len(self.encoder)
+
+    @property
+    def eol_token_id(self):
+        if self.eol_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eol_token)
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token):
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        return self.decoder[index]
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a single index or a sequence of indices to texts.
+
+        Args:
+            ids (int|List[int]):
+                The token id (or token ids) to be converted to text.
+
+        Returns:
+            str: The decoded text.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import GPTTokenizer
+                tokenizer = GPTTokenizer.from_pretrained('gpt2-medium-en')
+                print(tokenizer.convert_ids_to_string(tokenizer.convert_ids_to_string([14618, 284, 779, 350, 37382, 47, 37382, 290, 350, 37382, 45, 19930]))
+                # 'Welcome to use PaddlePaddle and PaddleNLP'
+
+        """
+
+        text = "".join([self.decoder[id] for id in ids])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def save_resources(self, save_directory):
+        """
+        Saves `SentencePiece <https://github.com/google/sentencepiece>`__ file
+        (ends with '.spm') under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name)
+
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (string) in a single string.
+        """
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
+        if is_split_into_words or add_prefix_space:
+            text = " " + text
+        return (text, kwargs)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        if self.add_bos_token:
+            bos_token_ids = [self.bos_token_id]
+        else:
+            bos_token_ids = []
+
+        output = bos_token_ids + token_ids_0
+
+        if token_ids_1 is None:
+            return output
+
+        return output + bos_token_ids + token_ids_1
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+
+        # attention_mask shape [1,seq_len,seq_len]
+        if "attention_mask" in encoded_inputs and len(np.shape(encoded_inputs["attention_mask"])) > 2:
+            attention_mask = encoded_inputs["attention_mask"]
+            encoded_inputs.pop("attention_mask")
+        else:
+            attention_mask = None
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        encoded_inputs = super()._pad(
+            encoded_inputs, max_length, padding_strategy, pad_to_multiple_of, return_attention_mask
+        )
+        if attention_mask is not None and len(np.shape(attention_mask)) > 2:
+            encoded_inputs["attention_mask"] = attention_mask
+            needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+            if needs_to_be_padded:
+                difference = max_length - len(required_input)
+                if "attention_mask" in encoded_inputs:
+                    encoded_inputs["attention_mask"] = np.pad(
+                        encoded_inputs["attention_mask"],
+                        pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                        mode="constant",
+                        constant_values=0,
+                    )
+        return encoded_inputs
diff --git a/paddlenlp/transformers/gptj/__init__.py b/paddlenlp/transformers/gptj/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/gptj/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/gptj/configuration.py b/paddlenlp/transformers/gptj/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..bac54435a894e9d09851424d0a98f5af4eaaedb7
--- /dev/null
+++ b/paddlenlp/transformers/gptj/configuration.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" GPT-J model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+GPTJ_PRETRAINED_INIT_CONFIGURATION = {
+    "EleutherAI/gpt-j-6B": {
+        "vocab_size": 50400,
+        "bos_token_id": 50256,
+        "pad_token_id": 50256,
+        "eos_token_id": 50256,
+        "n_embd": 4096,
+        "n_layer": 28,
+        "n_head": 16,
+        "n_positions": 2048,
+        "attn_pdrop": 0.0,
+        "resid_pdrop": 0.0,
+        "embd_pdrop": 0.0,
+        "rotary_dim": 64,
+        "activation_function": "gelu_new",
+        "layer_norm_epsilon": 1e-05,
+        "initializer_range": 0.02,
+        "init_class": "GPTJModel",
+    },
+}
+
+GPTJ_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "EleutherAI/gpt-j-6B": "https://paddlenlp.bj.bcebos.com/models/community/EleutherAI/gpt-j-6B/model_state.pdparams",
+    }
+}
+
+
+class GPTJConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`GPTJModel`]. It is used to instantiate a GPT-J
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the GPT-J
+    EleutherAI/gpt-j-6B architecture. Configuration objects inherit from
+    [`PretrainedConfig`] and can be used to control the model outputs.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 50400):
+            Vocabulary size of the GPT-J model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`GPTJModel`].
+        n_positions (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        n_embd (`int`, *optional*, defaults to 4096):
+            Dimensionality of the embeddings and hidden states.
+        n_layer (`int`, *optional*, defaults to 28):
+            Number of hidden layers in the Transformer encoder.
+        n_head (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        rotary_dim (`int`, *optional*, defaults to 64):
+            Number of dimensions in the embedding that Rotary Position Embedding is applied to.
+        n_inner (`int`, *optional*, defaults to None):
+            Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
+        activation_function (`str`, *optional*, defaults to `"gelu_new"`):
+            Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        embd_pdrop (`int`, *optional*, defaults to 0.1):
+            The dropout ratio for the embeddings.
+        attn_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+            The epsilon to use in the layer normalization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import GPTJModel, GPTJConfig
+    >>> # Initializing a GPT-J 6B configuration
+    >>> configuration = GPTJConfig()
+    >>> # Initializing a model from the configuration
+    >>> model = GPTJModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "gptj"
+    attribute_map = {
+        "max_position_embeddings": "n_positions",
+        "hidden_size": "n_embd",
+        "num_attention_heads": "n_head",
+        "num_hidden_layers": "n_layer",
+        "embed_dim": "n_embd",
+    }
+
+    def __init__(
+        self,
+        vocab_size=50400,
+        n_positions=2048,
+        n_embd=4096,
+        n_layer=28,
+        n_head=16,
+        rotary_dim=64,
+        n_inner=None,
+        activation_function="gelu_new",
+        resid_pdrop=0.0,
+        embd_pdrop=0.0,
+        attn_pdrop=0.0,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        use_cache=True,
+        bos_token_id=50256,
+        eos_token_id=50256,
+        tie_word_embeddings=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.n_inner = n_inner
+        self.rotary_dim = rotary_dim
+        self.activation_function = activation_function
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.attn_pdrop = attn_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+
+        super().__init__(
+            bos_token_id=bos_token_id, eos_token_id=eos_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs
+        )
diff --git a/paddlenlp/transformers/gptj/modeling.py b/paddlenlp/transformers/gptj/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..86207866a5dd34ef69dd4cd7bdaf85717f17a141
--- /dev/null
+++ b/paddlenlp/transformers/gptj/modeling.py
@@ -0,0 +1,799 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The EleutherAI Authors and The HuggingFace Inc. team
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+from paddle.nn import Layer
+
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+)
+from .configuration import GPTJConfig
+
+__all__ = [
+    "GPTJModel",
+    "GPTJPretrainedModel",
+    "GPTJForCausalLM",
+    "GPTJForSequenceClassification",
+    "GPTJForQuestionAnswering",
+]
+
+
+def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
+    dim = x.shape[-1]
+    if seq_len is None:
+        seq_len = x.shape[seq_dim]
+    inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2) / dim))
+    sinusoid_inp = paddle.einsum("i , j -> i j", paddle.arange(seq_len, dtype="float32"), inv_freq)
+    return paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)
+
+
+def rotate_every_two(x):
+    x1 = x[:, :, :, ::2]
+    x2 = x[:, :, :, 1::2]
+    x = paddle.stack((-x2, x1), axis=-1)
+    # In einsum notation: rearrange(x, '... d j -> ... (d j)')
+    return x.flatten(-2)
+
+
+def duplicate_interleave(m):
+    return paddle.repeat_interleave(m, 2, axis=1)
+
+
+def apply_rotary_pos_emb(x, sincos, offset=0):
+    sin, cos = map(lambda t: duplicate_interleave(t)[None, offset : x.shape[1] + offset, None, :], sincos)
+    # einsum notation for lambda t: repeat(t[offset:x.shape[1]+offset,:], "n d -> () n () (d j)", j=2)
+    return (x * cos) + (rotate_every_two(x) * sin)
+
+
+class GPTJAttention(Layer):
+    def __init__(self, config: GPTJConfig):
+        super().__init__()
+
+        max_positions = config.max_position_embeddings
+        self.register_buffer(
+            "bias",
+            paddle.tril(paddle.ones((max_positions, max_positions), dtype=paddle.get_default_dtype())).reshape(
+                (1, 1, max_positions, max_positions)
+            ),
+        )
+        self.register_buffer("masked_bias", paddle.to_tensor(-1e9))
+        self.attn_dropout = nn.Dropout(config.attn_pdrop)
+        self.resid_dropout = nn.Dropout(config.resid_pdrop)
+
+        self.embed_dim = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_attention_heads
+        if self.head_dim * self.num_attention_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and"
+                f" `num_attention_heads`: {self.num_attention_heads})."
+            )
+        self.scale_attn = paddle.sqrt(paddle.to_tensor(self.head_dim, dtype="float32"))
+        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias_attr=False)
+        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias_attr=False)
+        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias_attr=False)
+
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias_attr=False)
+        self.rotary_dim = None
+        if config.rotary_dim is not None:
+            self.rotary_dim = config.rotary_dim
+
+    def _split_heads(self, tensor, num_attention_heads, attn_head_size, rotary):
+        """
+        Splits hidden dim into attn_head_size and num_attention_heads
+        """
+        new_shape = tensor.shape[:-1] + [num_attention_heads, attn_head_size]
+        tensor = tensor.reshape(new_shape)
+        if rotary:
+            return tensor
+        if len(tensor.shape) == 5:
+
+            return tensor.transpose([0, 1, 3, 2, 4])  # (batch, blocks, head, block_length, head_features)
+        elif len(tensor.shape) == 4:
+            return tensor.transpose([0, 2, 1, 3])  # (batch, head, seq_length, head_features)
+        else:
+            raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}")
+
+    def _merge_heads(self, tensor, num_attention_heads, attn_head_size):
+        """
+        Merges attn_head_size dim and num_attn_heads dim into hidden dim
+        """
+        if len(tensor.shape) == 5:
+            tensor = tensor.transpose([0, 1, 3, 2, 4])
+        elif len(tensor.shape) == 4:
+            tensor = tensor.transpose([0, 2, 1, 3])
+        else:
+            raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}")
+        new_shape = tensor.shape[:-2] + [num_attention_heads * attn_head_size]
+        return tensor.reshape(new_shape)
+
+    def _attn(
+        self,
+        query,
+        key,
+        value,
+        attention_mask=None,
+    ):
+        # compute causal mask from causal mask buffer
+        query_length, key_length = query.shape[-2], key.shape[-2]
+        causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
+
+        # Keep the attention weights computation in fp32 to avoid overflow issues
+        query = paddle.cast(query, "float32")
+        key = paddle.cast(key, "float32")
+
+        attn_weights = paddle.matmul(query, key, transpose_y=True)
+
+        if attn_weights.dtype == paddle.float16:
+            mask_value = -65504.0  # smallest representable value for float16
+        else:
+            mask_value = -1e9  # default value used
+        mask_value = paddle.to_tensor(mask_value, dtype=attn_weights.dtype)
+
+        # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
+        # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
+        mask_value = paddle.to_tensor(mask_value, dtype=attn_weights.dtype, place=attn_weights.place)
+        attn_weights = paddle.where(causal_mask, attn_weights, mask_value)
+
+        attn_weights = attn_weights / self.scale_attn
+
+        if attention_mask is not None:
+            # Apply the attention mask
+            attn_weights = attn_weights + attention_mask
+
+        attn_weights = paddle.nn.functional.softmax(attn_weights, axis=-1)
+        attn_weights = attn_weights.astype(value.dtype)
+
+        attn_weights = self.attn_dropout(attn_weights)
+
+        attn_output = paddle.matmul(attn_weights, value)
+
+        return attn_output, attn_weights
+
+    def forward(
+        self,
+        hidden_states: Optional[paddle.Tensor],
+        attention_mask: Optional[paddle.Tensor] = None,
+        layer_past: Optional[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+    ) -> Union[
+        Tuple[paddle.Tensor, Tuple[paddle.Tensor]],
+        Optional[Tuple[paddle.Tensor, Tuple[paddle.Tensor], Tuple[paddle.Tensor, ...]]],
+    ]:
+        query = self.q_proj(hidden_states)
+        key = self.k_proj(hidden_states)
+        value = self.v_proj(hidden_states)
+
+        query = self._split_heads(query, self.num_attention_heads, self.head_dim, True)
+        key = self._split_heads(key, self.num_attention_heads, self.head_dim, True)
+        value = self._split_heads(value, self.num_attention_heads, self.head_dim, False)
+
+        seq_len = key.shape[1]
+        offset = 0
+
+        if layer_past is not None:
+            offset = layer_past[0].shape[-2]
+            seq_len += offset
+
+        if self.rotary_dim is not None:
+            k_rot = key[:, :, :, : self.rotary_dim]
+            k_pass = key[:, :, :, self.rotary_dim :]
+
+            q_rot = query[:, :, :, : self.rotary_dim]
+            q_pass = query[:, :, :, self.rotary_dim :]
+
+            sincos = fixed_pos_embedding(k_rot, 1, seq_len=seq_len)
+            k_rot = apply_rotary_pos_emb(k_rot, sincos, offset=offset)
+            q_rot = apply_rotary_pos_emb(q_rot, sincos, offset=offset)
+
+            key = paddle.concat([k_rot, k_pass], axis=-1)
+            query = paddle.concat([q_rot, q_pass], axis=-1)
+        else:
+            sincos = fixed_pos_embedding(key, 1, seq_len=seq_len)
+            key = apply_rotary_pos_emb(key, sincos, offset=offset)
+            query = apply_rotary_pos_emb(query, sincos, offset=offset)
+
+        key = key.transpose([0, 2, 1, 3])
+        query = query.transpose([0, 2, 1, 3])
+
+        if layer_past is not None:
+            past_key = layer_past[0]
+            past_value = layer_past[1]
+            key = paddle.concat((past_key, key), axis=-2)
+            value = paddle.concat((past_value, value), axis=-2)
+
+        if use_cache is True:
+            present = (key, value)
+        else:
+            present = None
+
+        # compute self-attention: V x Softmax(QK^T)
+        attn_output, attn_weights = self._attn(query, key, value, attention_mask)
+
+        attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim)
+        attn_output = self.out_proj(attn_output)
+        attn_output = self.resid_dropout(attn_output)
+
+        outputs = (attn_output, present)
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs  # a, present, (attentions)
+
+
+class GPTJMLP(Layer):
+    def __init__(self, intermediate_size, config):  # in MLP: intermediate_size= 4 * embed_dim
+        super().__init__()
+        embed_dim = config.n_embd
+
+        self.fc_in = nn.Linear(embed_dim, intermediate_size)
+        self.fc_out = nn.Linear(intermediate_size, embed_dim)
+
+        self.act = ACT2FN[config.activation_function]
+        self.dropout = nn.Dropout(config.resid_pdrop)
+
+    def forward(self, hidden_states: Optional[paddle.Tensor]) -> paddle.Tensor:
+        hidden_states = self.fc_in(hidden_states)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.fc_out(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+
+
+class GPTJBlock(Layer):
+    def __init__(self, config):
+        super().__init__()
+        inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd
+        self.ln_1 = nn.LayerNorm(config.n_embd, epsilon=config.layer_norm_epsilon)
+        self.attn = GPTJAttention(config)
+        self.mlp = GPTJMLP(inner_dim, config)
+
+    def forward(
+        self,
+        hidden_states: Optional[paddle.Tensor],
+        layer_past: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = False,
+        output_attentions: Optional[bool] = False,
+    ) -> Union[Tuple[paddle.Tensor], Optional[Tuple[paddle.Tensor, Tuple[paddle.Tensor, ...]]]]:
+        residual = hidden_states
+        hidden_states = self.ln_1(hidden_states)
+        attn_outputs = self.attn(
+            hidden_states,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
+        outputs = attn_outputs[1:]
+
+        feed_forward_hidden_states = self.mlp(hidden_states)
+        hidden_states = attn_output + feed_forward_hidden_states + residual
+
+        if use_cache:
+            outputs = (hidden_states,) + outputs
+        else:
+            outputs = (hidden_states,) + outputs[1:]
+
+        return outputs  # hidden_states, present, (attentions)
+
+
+class GPTJPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = GPTJConfig
+    base_model_prefix = "transformer"
+    is_parallelizable = True
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["GPTJBlock"]
+
+    def _init_weights(self, layer):
+        """Initialize the weights."""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor) and paddle.get_default_dtype() == "float32":
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.transformer.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+            layer._epsilon = getattr(self, "layer_norm_epsilon", 1e-05)
+        if isinstance(layer, nn.Linear) and layer.bias is not None:
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+
+@register_base_model
+class GPTJModel(GPTJPretrainedModel):
+    def __init__(self, config):
+        super(GPTJModel, self).__init__(config)
+
+        self.embed_dim = config.n_embd
+        self.vocab_size = config.vocab_size
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.eos_token_id = config.eos_token_id
+        self.embed_dim = config.n_embd
+        self.initializer_range = config.initializer_range
+        self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
+        self.drop = nn.Dropout(config.embd_pdrop)
+        self.h = nn.LayerList([GPTJBlock(config) for _ in range(config.n_layer)])
+        self.ln_f = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+    def get_input_embeddings(self):
+        return self.wte
+
+    def set_input_embeddings(self, new_embeddings):
+        self.wte = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+            input_ids = input_ids.reshape(shape=(-1, input_shape[-1]))
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape((-1, input_shape[-1]))
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape((-1, input_shape[-1]))
+
+        if past_key_values is None:
+            past_length = 0
+            past_key_values = tuple([None] * len(self.h))
+        else:
+            past_length = past_key_values[0][0].shape[-2]
+
+        if position_ids is None:
+            position_ids = paddle.arange(past_length, input_shape[-1] + past_length, dtype="int64")
+            position_ids = position_ids.unsqueeze(0).reshape((-1, input_shape[-1]))
+
+        # Attention mask.
+        if attention_mask is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating attention_mask"
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+            attention_mask.stop_gradient = True
+        # TODO(zhangxu): Add head_mask if PretrainedModel supports get_head_mask method
+
+        if inputs_embeds is None:
+            inputs_embeds = self.wte(input_ids)
+
+        hidden_states = inputs_embeds
+
+        if token_type_ids is not None:
+            token_type_embeds = self.wte(token_type_ids)
+            hidden_states = hidden_states + token_type_embeds
+
+        hidden_states = self.drop(hidden_states)
+
+        output_shape = input_shape[:] + [hidden_states.shape[-1]]
+
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            outputs = block(
+                hidden_states,
+                layer_past=layer_past,
+                attention_mask=attention_mask,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+
+            hidden_states = outputs[0]
+            if use_cache:
+                presents = presents + (outputs[1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+
+        hidden_states = self.ln_f(hidden_states)
+
+        hidden_states = hidden_states.reshape(shape=output_shape)
+        # Add last hidden state
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class GPTJForCausalLM(GPTJPretrainedModel):
+    r"""
+    GPTJ Model with a `language modeling` head on top.
+    Args:
+        GPTJ (:class:`GPTJModel`):
+            An instance of GPTJModel.
+    """
+
+    def __init__(self, config):
+        super(GPTJForCausalLM, self).__init__(config)
+        self.transformer = GPTJModel(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterGPTJ
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decoding_lib = kwargs.get("decoding_lib", None)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "beam_search":
+            raise AttributeError("'beam_search' is not supported yet in the fast version of GPTJ")
+        # Currently, FasterTransformer only support restricted size_per_head.
+        size_per_head = self.transformer.config["n_embd"] // self.transformer.config["n_head"]
+        if size_per_head not in [32, 64, 80, 96, 128, 160, 192, 224, 256]:
+            raise AttributeError(
+                "'size_per_head = %d' is not supported yet in the fast version of GPTJ" % size_per_head
+            )
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        self._fast_entry = FasterGPTJ(self, decoding_lib=decoding_lib, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
+        token_type_ids = kwargs.get("token_type_ids", None)
+        # only last token for inputs_ids if past is defined in kwargs
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if token_type_ids is not None:
+                token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
+
+        attention_mask = kwargs.get("attention_mask", None)
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask[:, :, -1:, :]
+
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache"),
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "token_type_ids": token_type_ids,
+        }
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        The GPTJForCausalLM forward method, overrides the __call__() special method.
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import GPTJForCausalLM, GPTJTokenizer
+                tokenizer = GPTJTokenizer.from_pretrained('EleutherAI/gpt-j-6B')
+                model = GPTJForCausalLM.from_pretrained('EleutherAI/gpt-j-6B')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+
+        # make sure sampling in fp16 works correctly and
+        # compute loss in fp32 to match with mesh-tf version
+        lm_logits = self.lm_head(hidden_states).astype("float32")
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :]
+            shift_labels = labels[..., 1:]
+            # Flatten the tokens
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(shift_logits.reshape([-1, shift_logits.shape[-1]]), shift_labels.reshape([-1]))
+
+            loss = loss.astype(hidden_states.dtype)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+    @staticmethod
+    def _reorder_cache(past: Tuple[Tuple[paddle.Tensor]], beam_idx: paddle.Tensor) -> Tuple[Tuple[paddle.Tensor]]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or
+        [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        """
+        return tuple(
+            tuple(past_state.index_select(0, beam_idx.astype(past_state.dtype)) for past_state in layer_past)
+            for layer_past in past
+        )
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class GPTJForSequenceClassification(GPTJPretrainedModel):
+    r"""
+    GPTJ Model with a linear layer on top of the pooled output,
+    designed for sequence classification/regression tasks like GLUE tasks.
+    Since it does classification on the last token, it requires to know the
+    position of the last token. If a `pad_token_id` is defined in the configuration,
+    it finds the last token that is not a padding token in each row. If no `pad_token_id`
+    is defined, it simply takes the last value in each row of the batch.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = GPTJModel(config)
+        self.score = nn.Linear(config.n_embd, self.num_labels, bias_attr=False)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                sequence_lengths = (
+                    paddle.not_equal(
+                        input_ids, paddle.to_tensor(self.config.pad_token_id).astype(input_ids.dtype)
+                    ).sum(-1)
+                    - 1
+                )
+            else:
+                sequence_lengths = -1
+
+        pooled_logits = logits[paddle.arange(batch_size), sequence_lengths]
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels.astype("float32"))
+
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class GPTJForQuestionAnswering(GPTJPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = GPTJModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        token_type_ids: Optional[paddle.Tensor] = None,
+        position_ids: Optional[paddle.Tensor] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        start_positions: Optional[paddle.Tensor] = None,
+        end_positions: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`paddle.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = paddle.split(logits, logits.shape[-1], axis=-1)
+        start_logits = paddle.squeeze(start_logits, axis=-1)
+        end_logits = paddle.squeeze(end_logits, axis=-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.shape) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.shape) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/gptj/tokenizer.py b/paddlenlp/transformers/gptj/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e67eccbb1adb5f6be795c520011d8179b909f64
--- /dev/null
+++ b/paddlenlp/transformers/gptj/tokenizer.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The Open AI Team Authors and The HuggingFace Inc. team
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .. import GPTTokenizer
+
+__all__ = ["GPTJTokenizer"]
+
+
+class GPTJTokenizer(GPTTokenizer):
+
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
+    pretrained_resource_files_map = {"vocab_file": {}, "merges_file": {}}
+    pretrained_init_configuration = {}
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        max_len=None,
+        pad_token="<|endoftext|>",
+        eos_token="<|endoftext|>",
+        unk_token="<|endoftext|>",
+        eol_token="\u010a",
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            errors=errors,
+            max_len=max_len,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            eol_token=eol_token,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/image_processing_utils.py b/paddlenlp/transformers/image_processing_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..dba663907ee8f7c6be0d09c03773dbc6afaca5b4
--- /dev/null
+++ b/paddlenlp/transformers/image_processing_utils.py
@@ -0,0 +1,535 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import tempfile
+from typing import Any, Dict, Iterable, Optional, Tuple, Union
+
+import numpy as np
+from huggingface_hub import (
+    create_repo,
+    get_hf_file_metadata,
+    hf_hub_download,
+    hf_hub_url,
+    repo_type_and_id_from_hf_id,
+    upload_folder,
+)
+from huggingface_hub.utils import EntryNotFoundError
+
+from .. import __version__
+from ..utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url_with_filelock
+from ..utils.log import logger
+from .feature_extraction_utils import BatchFeature as BaseBatchFeature
+from .utils import resolve_cache_dir
+
+IMAGE_PROCESSOR_NAME = "preprocessor_config.json"
+
+
+class BatchFeature(BaseBatchFeature):
+    r"""
+    Holds the output of the image processor specific `__call__` methods.
+
+    This class is derived from a python dictionary and can be used as a dictionary.
+
+    Args:
+        data (`dict`):
+            Dictionary of lists/arrays/tensors returned by the __call__ method ('pixel_values', etc.).
+        tensor_type (`Union[None, str, TensorType]`, *optional*):
+            You can give a tensor_type here to convert the lists of integers in Paddle/Numpy Tensors at
+            initialization.
+    """
+
+
+class ImageProcessingMixin(object):
+    """
+    This is an image processor mixin used to provide saving/loading functionality for sequential and image feature
+    extractors.
+    """
+
+    _auto_class = None
+
+    def __init__(self, **kwargs):
+        """Set elements of `kwargs` as attributes."""
+        # Pop "processor_class" as it should be saved as private attribute
+        self._processor_class = kwargs.pop("processor_class", None)
+        # Additional attributes without default values
+        for key, value in kwargs.items():
+            try:
+                setattr(self, key, value)
+            except AttributeError as err:
+                logger.error(f"Can't set {key} with value {value} for {self}")
+                raise err
+
+    def _set_processor_class(self, processor_class: str):
+        """Sets processor class as an attribute."""
+        self._processor_class = processor_class
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs):
+        r"""
+        Instantiate a type of [`~image_processing_utils.ImageProcessingMixin`] from an image processor.
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the *model id* of a pretrained image_processor hosted inside a model repo on
+                  huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a image processor file saved using the
+                  [`~image_processing_utils.ImageProcessingMixin.save_pretrained`] method, e.g.,
+                  `./my_model_directory/`.
+                - a path or url to a saved image processor JSON *file*, e.g.,
+                  `./my_model_directory/preprocessor_config.json`.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model image processor should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force to (re-)download the image processor files and override the cached versions if
+                they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received file. Attempts to resume the download if such a file
+                exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+            use_auth_token (`str` or `bool`, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, or not specified, will use
+                the token generated when running `huggingface-cli login` (stored in `~/.huggingface`).
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+
+
+                <Tip>
+
+                To test a pull request you made on the Hub, you can pass `revision="refs/pr/<pr_number>".
+
+                </Tip>
+
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final image processor object. If `True`, then this
+                functions returns a `Tuple(image_processor, unused_kwargs)` where *unused_kwargs* is a dictionary
+                consisting of the key/value pairs whose keys are not image processor attributes: i.e., the part of
+                `kwargs` which has not been used to update `image_processor` and is otherwise ignored.
+            kwargs (`Dict[str, Any]`, *optional*):
+                The values in kwargs of any keys which are image processor attributes will be used to override the
+                loaded values. Behavior concerning key/value pairs whose keys are *not* image processor attributes is
+                controlled by the `return_unused_kwargs` keyword parameter.
+
+        Returns:
+            A image processor of type [`~image_processing_utils.ImageProcessingMixin`].
+
+        Examples:
+
+        ```python
+        # We can't instantiate directly the base class *ImageProcessingMixin* so let's show the examples on a
+        # derived class: *CLIPImageProcessor*
+        image_processor = CLIPImageProcessor.from_pretrained(
+            "openai/clip-vit-base-patch32"
+        )  # Download image_processing_config from huggingface.co and cache.
+        image_processor = CLIPImageProcessor.from_pretrained(
+            "./test/saved_model/"
+        )  # E.g. image processor (or model) was saved using *save_pretrained('./test/saved_model/')*
+        image_processor = CLIPImageProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
+        image_processor = CLIPImageProcessor.from_pretrained(
+            "openai/clip-vit-base-patch32", do_normalize=False, foo=False
+        )
+        assert image_processor.do_normalize is False
+        image_processor, unused_kwargs = CLIPImageProcessor.from_pretrained(
+            "openai/clip-vit-base-patch32", do_normalize=False, foo=False, return_unused_kwargs=True
+        )
+        assert image_processor.do_normalize is False
+        assert unused_kwargs == {"foo": False}
+        ```"""
+        image_processor_dict, kwargs = cls.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
+
+        return cls.from_dict(image_processor_dict, **kwargs)
+
+    def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs):
+        """
+        Save an image processor object to the directory `save_directory`, so that it can be re-loaded using the
+        [`~image_processing_utils.ImageProcessingMixin.from_pretrained`] class method.
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the image processor JSON file will be saved (will be created if it does not exist).
+            kwargs:
+                Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
+        """
+        if os.path.isfile(save_directory):
+            raise AssertionError(f"Provided path ({save_directory}) should be a directory, not a file")
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        # If we save using the predefined names, we can load using `from_pretrained`
+        output_image_processor_file = os.path.join(save_directory, IMAGE_PROCESSOR_NAME)
+
+        self.to_json_file(output_image_processor_file)
+        logger.info(f"Image processor saved in {output_image_processor_file}")
+
+        return [output_image_processor_file]
+
+    def save_to_hf_hub(
+        self,
+        repo_id: str,
+        private: Optional[bool] = None,
+        subfolder: Optional[str] = None,
+        commit_message: Optional[str] = None,
+        revision: Optional[str] = None,
+        create_pr: bool = False,
+    ):
+        """
+        Uploads all elements of this processor to a new HuggingFace Hub repository.
+        Args:
+            repo_id (str): Repository name for your processor in the Hub.
+            private (bool, optional): Whether theprocessor is set to private
+            subfolder (str, optional): Push to a subfolder of the repo instead of the root
+            commit_message (str, optional) — The summary / title / first line of the generated commit. Defaults to: f"Upload {path_in_repo} with huggingface_hub"
+            revision (str, optional) — The git revision to commit from. Defaults to the head of the "main" branch.
+            create_pr (boolean, optional) — Whether or not to create a Pull Request with that commit. Defaults to False.
+                If revision is not set, PR is opened against the "main" branch. If revision is set and is a branch, PR is opened against this branch.
+                If revision is set and is not a branch name (example: a commit oid), an RevisionNotFoundError is returned by the server.
+
+        Returns: The url of the commit of your model in the given repository.
+        """
+        repo_url = create_repo(repo_id, private=private, exist_ok=True)
+
+        # Infer complete repo_id from repo_url
+        # Can be different from the input `repo_id` if repo_owner was implicit
+        _, repo_owner, repo_name = repo_type_and_id_from_hf_id(repo_url)
+
+        repo_id = f"{repo_owner}/{repo_name}"
+
+        # Check if README file already exist in repo
+        try:
+            get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename="README.md", revision=revision))
+            has_readme = True
+        except EntryNotFoundError:
+            has_readme = False
+
+        with tempfile.TemporaryDirectory() as root_dir:
+            if subfolder is not None:
+                save_dir = os.path.join(root_dir, subfolder)
+            else:
+                save_dir = root_dir
+            # save model
+            self.save_pretrained(save_dir)
+            # Add readme if does not exist
+            logger.info("README.md not found, adding the default README.md")
+            if not has_readme:
+                with open(os.path.join(root_dir, "README.md"), "w") as f:
+                    f.write(f"---\nlibrary_name: paddlenlp\n---\n# {repo_id}")
+
+            # Upload model and return
+            logger.info(f"Pushing to the {repo_id}. This might take a while")
+            return upload_folder(
+                repo_id=repo_id,
+                repo_type="model",
+                folder_path=root_dir,
+                commit_message=commit_message,
+                revision=revision,
+                create_pr=create_pr,
+            )
+
+    @classmethod
+    def get_image_processor_dict(
+        cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
+    ) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+        """
+        From a `pretrained_model_name_or_path`, resolve to a dictionary of parameters, to be used for instantiating a
+        image processor of type [`~image_processor_utils.ImageProcessingMixin`] using `from_dict`.
+
+        Parameters:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
+            from_hf_hub (bool, optional): whether to load from Huggingface Hub
+            subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+
+
+        Returns:
+            `Tuple[Dict, Dict]`: The dictionary(ies) that will be used to instantiate the image processor object.
+        """
+        cache_dir = kwargs.pop("cache_dir", None)
+        from_hf_hub = kwargs.pop("from_hf_hub", False)
+        subfolder = kwargs.pop("subfolder", None)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+
+        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+        is_local = os.path.isdir(pretrained_model_name_or_path)
+        if os.path.isdir(pretrained_model_name_or_path):
+            if subfolder is None:
+                resolved_image_processor_file = os.path.join(pretrained_model_name_or_path, IMAGE_PROCESSOR_NAME)
+            else:
+                resolved_image_processor_file = os.path.join(
+                    pretrained_model_name_or_path, subfolder, IMAGE_PROCESSOR_NAME
+                )
+        elif os.path.isfile(pretrained_model_name_or_path):
+            resolved_image_processor_file = pretrained_model_name_or_path
+            is_local = True
+        elif from_hf_hub:
+            image_processor_file = IMAGE_PROCESSOR_NAME
+            resolved_image_processor_file = hf_hub_download(
+                repo_id=pretrained_model_name_or_path,
+                filename=image_processor_file,
+                cache_dir=cache_dir,
+                subfolder=subfolder,
+                library_name="PaddleNLP",
+                library_version=__version__,
+            )
+        else:
+            # Assuming from community-contributed pretrained models
+            if subfolder is None:
+                image_processor_file = "/".join(
+                    [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, IMAGE_PROCESSOR_NAME]
+                )
+            else:
+                image_processor_file = "/".join(
+                    [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, subfolder, IMAGE_PROCESSOR_NAME]
+                )
+                # update cache_dir
+                cache_dir = os.path.join(cache_dir, subfolder)
+            try:
+                # Load from local folder or from cache or download from model Hub and cache
+                resolved_image_processor_file = get_path_from_url_with_filelock(image_processor_file, cache_dir)
+            except EnvironmentError:
+                # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted to
+                # the original exception.
+                raise
+            except Exception:
+                # For any other exception, we throw a generic error.
+                raise EnvironmentError(
+                    f"Can't load image processor for '{pretrained_model_name_or_path}'. If you were trying to load"
+                    " it from 'BOS', make sure you don't have a local directory with the"
+                    f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a"
+                    f" directory containing a {IMAGE_PROCESSOR_NAME} file"
+                )
+
+        try:
+            # Load image_processor dict
+            with open(resolved_image_processor_file, "r", encoding="utf-8") as reader:
+                text = reader.read()
+            image_processor_dict = json.loads(text)
+
+        except json.JSONDecodeError:
+            raise EnvironmentError(
+                f"It looks like the config file at '{resolved_image_processor_file}' is not a valid JSON file."
+            )
+
+        if is_local:
+            logger.info(f"loading configuration file {resolved_image_processor_file}")
+        else:
+            logger.info(f"loading configuration file from cache at {resolved_image_processor_file}")
+
+        return image_processor_dict, kwargs
+
+    @classmethod
+    def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
+        """
+        Instantiates a type of [`~image_processing_utils.ImageProcessingMixin`] from a Python dictionary of parameters.
+
+        Args:
+            image_processor_dict (`Dict[str, Any]`):
+                Dictionary that will be used to instantiate the image processor object. Such a dictionary can be
+                retrieved from a pretrained checkpoint by leveraging the
+                [`~image_processing_utils.ImageProcessingMixin.to_dict`] method.
+            kwargs (`Dict[str, Any]`):
+                Additional parameters from which to initialize the image processor object.
+
+        Returns:
+            [`~image_processing_utils.ImageProcessingMixin`]: The image processor object instantiated from those
+            parameters.
+        """
+        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
+
+        image_processor = cls(**image_processor_dict)
+
+        # Update image_processor with kwargs if needed
+        to_remove = []
+        for key, value in kwargs.items():
+            if hasattr(image_processor, key):
+                setattr(image_processor, key, value)
+                to_remove.append(key)
+        for key in to_remove:
+            kwargs.pop(key, None)
+
+        if return_unused_kwargs:
+            return image_processor, kwargs
+        else:
+            return image_processor
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serializes this instance to a Python dictionary.
+
+        Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this image processor instance.
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["image_processor_type"] = self.__class__.__name__
+
+        return output
+
+    @classmethod
+    def from_json_file(cls, json_file: Union[str, os.PathLike]):
+        """
+        Instantiates a image processor of type [`~image_processing_utils.ImageProcessingMixin`] from the path to a JSON
+        file of parameters.
+
+        Args:
+            json_file (`str` or `os.PathLike`):
+                Path to the JSON file containing the parameters.
+
+        Returns:
+            A image processor of type [`~image_processing_utils.ImageProcessingMixin`]: The image_processor object
+            instantiated from that JSON file.
+        """
+        with open(json_file, "r", encoding="utf-8") as reader:
+            text = reader.read()
+        image_processor_dict = json.loads(text)
+        return cls(**image_processor_dict)
+
+    def to_json_string(self) -> str:
+        """
+        Serializes this instance to a JSON string.
+
+        Returns:
+            `str`: String containing all the attributes that make up this feature_extractor instance in JSON format.
+        """
+        dictionary = self.to_dict()
+
+        for key, value in dictionary.items():
+            if isinstance(value, np.ndarray):
+                dictionary[key] = value.tolist()
+
+        # make sure private name "_processor_class" is correctly
+        # saved as "processor_class"
+        _processor_class = dictionary.pop("_processor_class", None)
+        if _processor_class is not None:
+            dictionary["processor_class"] = _processor_class
+
+        return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"
+
+    def to_json_file(self, json_file_path: Union[str, os.PathLike]):
+        """
+        Save this instance to a JSON file.
+
+        Args:
+            json_file_path (`str` or `os.PathLike`):
+                Path to the JSON file in which this image_processor instance's parameters will be saved.
+        """
+        with open(json_file_path, "w", encoding="utf-8") as writer:
+            writer.write(self.to_json_string())
+
+    def __repr__(self):
+        return f"{self.__class__.__name__} {self.to_json_string()}"
+
+
+class BaseImageProcessor(ImageProcessingMixin):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def __call__(self, images, **kwargs) -> BatchFeature:
+        """Preprocess an image or a batch of images."""
+        return self.preprocess(images, **kwargs)
+
+    def preprocess(self, images, **kwargs) -> BatchFeature:
+        raise NotImplementedError("Each image processor must implement its own preprocess method")
+
+
+VALID_SIZE_DICT_KEYS = ({"height", "width"}, {"shortest_edge"}, {"shortest_edge", "longest_edge"})
+
+
+def is_valid_size_dict(size_dict):
+    if not isinstance(size_dict, dict):
+        return False
+
+    size_dict_keys = set(size_dict.keys())
+    for allowed_keys in VALID_SIZE_DICT_KEYS:
+        if size_dict_keys == allowed_keys:
+            return True
+    return False
+
+
+def convert_to_size_dict(
+    size, max_size: Optional[int] = None, default_to_square: bool = True, height_width_order: bool = True
+):
+    # By default, if size is an int we assume it represents a tuple of (size, size).
+    if isinstance(size, int) and default_to_square:
+        if max_size is not None:
+            raise ValueError("Cannot specify both size as an int, with default_to_square=True and max_size")
+        return {"height": size, "width": size}
+    # In other configs, if size is an int and default_to_square is False, size represents the length of
+    # the shortest edge after resizing.
+    elif isinstance(size, int) and not default_to_square:
+        size_dict = {"shortest_edge": size}
+        if max_size is not None:
+            size_dict["longest_edge"] = max_size
+        return size_dict
+    # Otherwise, if size is a tuple it's either (height, width) or (width, height)
+    elif isinstance(size, (tuple, list)) and height_width_order:
+        return {"height": size[0], "width": size[1]}
+    elif isinstance(size, (tuple, list)) and not height_width_order:
+        return {"height": size[1], "width": size[0]}
+
+    raise ValueError(f"Could not convert size input to size dict: {size}")
+
+
+def get_size_dict(
+    size: Union[int, Iterable[int], Dict[str, int]] = None,
+    max_size: Optional[int] = None,
+    height_width_order: bool = True,
+    default_to_square: bool = True,
+    param_name="size",
+) -> dict:
+    """
+    Converts the old size parameter in the config into the new dict expected in the config. This is to ensure backwards
+    compatibility with the old image processor configs and removes ambiguity over whether the tuple is in (height,
+    width) or (width, height) format.
+
+    - If `size` is tuple, it is converted to `{"height": size[0], "width": size[1]}` or `{"height": size[1], "width":
+    size[0]}` if `height_width_order` is `False`.
+    - If `size` is an int, and `default_to_square` is `True`, it is converted to `{"height": size, "width": size}`.
+    - If `size` is an int and `default_to_square` is False, it is converted to `{"shortest_edge": size}`. If `max_size`
+      is set, it is added to the dict as `{"longest_edge": max_size}`.
+
+    Args:
+        size (`Union[int, Iterable[int], Dict[str, int]]`, *optional*):
+            The `size` parameter to be cast into a size dictionary.
+        max_size (`Optional[int]`, *optional*):
+            The `max_size` parameter to be cast into a size dictionary.
+        height_width_order (`bool`, *optional*, defaults to `True`):
+            If `size` is a tuple, whether it's in (height, width) or (width, height) order.
+        default_to_square (`bool`, *optional*, defaults to `True`):
+            If `size` is an int, whether to default to a square image or not.
+    """
+    if not isinstance(size, dict):
+        size_dict = convert_to_size_dict(size, max_size, default_to_square, height_width_order)
+        logger.info(
+            f"{param_name} should be a dictionary on of the following set of keys: {VALID_SIZE_DICT_KEYS}, got {size}."
+            f" Converted to {size_dict}.",
+        )
+    else:
+        size_dict = size
+
+    if not is_valid_size_dict(size_dict):
+        raise ValueError(
+            f"{param_name} must have one of the following set of keys: {VALID_SIZE_DICT_KEYS}, got {size_dict.keys()}"
+        )
+    return size_dict
diff --git a/paddlenlp/transformers/image_transforms.py b/paddlenlp/transformers/image_transforms.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb07f14f42009b639ecb82ee6b3e3f6bc9228556
--- /dev/null
+++ b/paddlenlp/transformers/image_transforms.py
@@ -0,0 +1,655 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import warnings
+from typing import Iterable, List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import PIL
+
+from .image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    get_channel_dimension_axis,
+    get_image_size,
+    infer_channel_dimension_format,
+    to_numpy_array,
+)
+from .tokenizer_utils_base import ExplicitEnum, TensorType
+
+
+def is_paddle_tensor(tensor):
+    return paddle.is_tensor(tensor)
+
+
+def to_channel_dimension_format(
+    image: np.ndarray,
+    channel_dim: Union[ChannelDimension, str],
+    input_channel_dim: Optional[Union[ChannelDimension, str]] = None,
+) -> np.ndarray:
+    """
+    Converts `image` to the channel dimension format specified by `channel_dim`.
+
+    Args:
+        image (`numpy.ndarray`):
+            The image to have its channel dimension set.
+        channel_dim (`ChannelDimension`):
+            The channel dimension format to use.
+
+    Returns:
+        `np.ndarray`: The image with the channel dimension set to `channel_dim`.
+    """
+    if not isinstance(image, np.ndarray):
+        raise ValueError(f"Input image must be of type np.ndarray, got {type(image)}")
+
+    if input_channel_dim is None:
+        input_channel_dim = infer_channel_dimension_format(image)
+
+    target_channel_dim = ChannelDimension(channel_dim)
+    if input_channel_dim == target_channel_dim:
+        return image
+
+    if target_channel_dim == ChannelDimension.FIRST:
+        image = image.transpose((2, 0, 1))
+    elif target_channel_dim == ChannelDimension.LAST:
+        image = image.transpose((1, 2, 0))
+    else:
+        raise ValueError("Unsupported channel dimension format: {}".format(channel_dim))
+
+    return image
+
+
+def rescale(
+    image: np.ndarray, scale: float, data_format: Optional[ChannelDimension] = None, dtype=np.float32
+) -> np.ndarray:
+    """
+    Rescales `image` by `scale`.
+
+    Args:
+        image (`np.ndarray`):
+            The image to rescale.
+        scale (`float`):
+            The scale to use for rescaling the image.
+        data_format (`ChannelDimension`, *optional*):
+            The channel dimension format of the image. If not provided, it will be the same as the input image.
+        dtype (`np.dtype`, *optional*, defaults to `np.float32`):
+            The dtype of the output image. Defaults to `np.float32`. Used for backwards compatibility with feature
+            extractors.
+
+    Returns:
+        `np.ndarray`: The rescaled image.
+    """
+    if not isinstance(image, np.ndarray):
+        raise ValueError(f"Input image must be of type np.ndarray, got {type(image)}")
+
+    rescaled_image = image * scale
+    if data_format is not None:
+        rescaled_image = to_channel_dimension_format(rescaled_image, data_format)
+    rescaled_image = rescaled_image.astype(dtype)
+    return rescaled_image
+
+
+def to_pil_image(
+    image: Union[np.ndarray, "PIL.Image.Image", "paddle.Tensor"],
+    do_rescale: Optional[bool] = None,
+) -> "PIL.Image.Image":
+    """
+    Converts `image` to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if
+    needed.
+
+    Args:
+        image (`PIL.Image.Image` or `numpy.ndarray` or `paddle.Tensor`):
+            The image to convert to the `PIL.Image` format.
+        do_rescale (`bool`, *optional*):
+            Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will default
+            to `True` if the image type is a floating type, `False` otherwise.
+
+    Returns:
+        `PIL.Image.Image`: The converted image.
+    """
+    if isinstance(image, PIL.Image.Image):
+        return image
+
+    # Convert all tensors to numpy arrays before converting to PIL image
+    if is_paddle_tensor(image):
+        image = image.cpu().numpy()
+    elif not isinstance(image, np.ndarray):
+        raise ValueError("Input image type not supported: {}".format(type(image)))
+
+    # If the channel as been moved to first dim, we put it back at the end.
+    image = to_channel_dimension_format(image, ChannelDimension.LAST)
+
+    # If there is a single channel, we squeeze it, as otherwise PIL can't handle it.
+    image = np.squeeze(image, axis=-1) if image.shape[-1] == 1 else image
+
+    # PIL.Image can only store uint8 values, so we rescale the image to be between 0 and 255 if needed.
+    do_rescale = isinstance(image.flat[0], (float, np.float32, np.float64)) if do_rescale is None else do_rescale
+    if do_rescale:
+        image = rescale(image, 255)
+    image = image.astype(np.uint8)
+    return PIL.Image.fromarray(image)
+
+
+# Logic adapted from torchvision resizing logic: https://github.com/pytorch/vision/blob/511924c1ced4ce0461197e5caa64ce5b9e558aab/torchvision/transforms/functional.py#L366
+def get_resize_output_image_size(
+    input_image: np.ndarray,
+    size: Union[int, Tuple[int, int], List[int], Tuple[int]],
+    default_to_square: bool = True,
+    max_size: Optional[int] = None,
+) -> tuple:
+    """
+    Find the target (height, width) dimension of the output image after resizing given the input image and the desired
+    size.
+
+    Args:
+        input_image (`np.ndarray`):
+            The image to resize.
+        size (`int` or `Tuple[int, int]` or List[int] or Tuple[int]):
+            The size to use for resizing the image. If `size` is a sequence like (h, w), output size will be matched to
+            this.
+
+            If `size` is an int and `default_to_square` is `True`, then image will be resized to (size, size). If
+            `size` is an int and `default_to_square` is `False`, then smaller edge of the image will be matched to this
+            number. i.e, if height > width, then image will be rescaled to (size * height / width, size).
+        default_to_square (`bool`, *optional*, defaults to `True`):
+            How to convert `size` when it is a single int. If set to `True`, the `size` will be converted to a square
+            (`size`,`size`). If set to `False`, will replicate
+            [`torchvision.transforms.Resize`](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.Resize)
+            with support for resizing only the smallest edge and providing an optional `max_size`.
+        max_size (`int`, *optional*):
+            The maximum allowed for the longer edge of the resized image: if the longer edge of the image is greater
+            than `max_size` after being resized according to `size`, then the image is resized again so that the longer
+            edge is equal to `max_size`. As a result, `size` might be overruled, i.e the smaller edge may be shorter
+            than `size`. Only used if `default_to_square` is `False`.
+
+    Returns:
+        `tuple`: The target (height, width) dimension of the output image after resizing.
+    """
+    if isinstance(size, (tuple, list)):
+        if len(size) == 2:
+            return tuple(size)
+        elif len(size) == 1:
+            # Perform same logic as if size was an int
+            size = size[0]
+        else:
+            raise ValueError("size must have 1 or 2 elements if it is a list or tuple")
+
+    if default_to_square:
+        return (size, size)
+
+    height, width = get_image_size(input_image)
+    short, long = (width, height) if width <= height else (height, width)
+    requested_new_short = size
+
+    new_short, new_long = requested_new_short, int(requested_new_short * long / short)
+
+    if max_size is not None:
+        if max_size <= requested_new_short:
+            raise ValueError(
+                f"max_size = {max_size} must be strictly greater than the requested "
+                f"size for the smaller edge size = {size}"
+            )
+        if new_long > max_size:
+            new_short, new_long = int(max_size * new_short / new_long), max_size
+
+    return (new_long, new_short) if width <= height else (new_short, new_long)
+
+
+def resize(
+    image,
+    size: Tuple[int, int],
+    resample: "PILImageResampling" = None,
+    reducing_gap: Optional[int] = None,
+    data_format: Optional[ChannelDimension] = None,
+    return_numpy: bool = True,
+) -> np.ndarray:
+    """
+    Resizes `image` to `(height, width)` specified by `size` using the PIL library.
+
+    Args:
+        image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+            The image to resize.
+        size (`Tuple[int, int]`):
+            The size to use for resizing the image.
+        resample (`int`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+            The filter to user for resampling.
+        reducing_gap (`int`, *optional*):
+            Apply optimization by resizing the image in two steps. The bigger `reducing_gap`, the closer the result to
+            the fair resampling. See corresponding Pillow documentation for more details.
+        data_format (`ChannelDimension`, *optional*):
+            The channel dimension format of the output image. If unset, will use the inferred format from the input.
+        return_numpy (`bool`, *optional*, defaults to `True`):
+            Whether or not to return the resized image as a numpy array. If False a `PIL.Image.Image` object is
+            returned.
+
+    Returns:
+        `np.ndarray`: The resized image.
+    """
+    resample = resample if resample is not None else PILImageResampling.BILINEAR
+
+    if not len(size) == 2:
+        raise ValueError("size must have 2 elements")
+
+    # For all transformations, we want to keep the same data format as the input image unless otherwise specified.
+    # The resized image from PIL will always have channels last, so find the input format first.
+    data_format = infer_channel_dimension_format(image) if data_format is None else data_format
+
+    # To maintain backwards compatibility with the resizing done in previous image feature extractors, we use
+    # the pillow library to resize the image and then convert back to numpy
+    if not isinstance(image, PIL.Image.Image):
+        image = to_pil_image(image)
+    height, width = size
+    # PIL images are in the format (width, height)
+    resized_image = image.resize((width, height), resample=resample, reducing_gap=reducing_gap)
+
+    if return_numpy:
+        resized_image = np.array(resized_image)
+        # If the input image channel dimension was of size 1, then it is dropped when converting to a PIL image
+        # so we need to add it back if necessary.
+        resized_image = np.expand_dims(resized_image, axis=-1) if resized_image.ndim == 2 else resized_image
+        # The image is always in channels last format after converting from a PIL image
+        resized_image = to_channel_dimension_format(
+            resized_image, data_format, input_channel_dim=ChannelDimension.LAST
+        )
+    return resized_image
+
+
+def normalize(
+    image: np.ndarray,
+    mean: Union[float, Iterable[float]],
+    std: Union[float, Iterable[float]],
+    data_format: Optional[ChannelDimension] = None,
+) -> np.ndarray:
+    """
+    Normalizes `image` using the mean and standard deviation specified by `mean` and `std`.
+
+    image = (image - mean) / std
+
+    Args:
+        image (`np.ndarray`):
+            The image to normalize.
+        mean (`float` or `Iterable[float]`):
+            The mean to use for normalization.
+        std (`float` or `Iterable[float]`):
+            The standard deviation to use for normalization.
+        data_format (`ChannelDimension`, *optional*):
+            The channel dimension format of the output image. If unset, will use the inferred format from the input.
+    """
+    if isinstance(image, PIL.Image.Image):
+        warnings.warn(
+            "PIL.Image.Image inputs are deprecated and will be removed in v4.26.0. Please use numpy arrays instead.",
+            FutureWarning,
+        )
+        # Convert PIL image to numpy array with the same logic as in the previous feature extractor normalize -
+        # casting to numpy array and dividing by 255.
+        image = to_numpy_array(image)
+        image = rescale(image, scale=1 / 255)
+
+    if not isinstance(image, np.ndarray):
+        raise ValueError("image must be a numpy array")
+
+    input_data_format = infer_channel_dimension_format(image)
+    channel_axis = get_channel_dimension_axis(image)
+    num_channels = image.shape[channel_axis]
+
+    if isinstance(mean, Iterable):
+        if len(mean) != num_channels:
+            raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(mean)}")
+    else:
+        mean = [mean] * num_channels
+    mean = np.array(mean, dtype=image.dtype)
+
+    if isinstance(std, Iterable):
+        if len(std) != num_channels:
+            raise ValueError(f"std must have {num_channels} elements if it is an iterable, got {len(std)}")
+    else:
+        std = [std] * num_channels
+    std = np.array(std, dtype=image.dtype)
+
+    if input_data_format == ChannelDimension.LAST:
+        image = (image - mean) / std
+    else:
+        image = ((image.T - mean) / std).T
+
+    image = to_channel_dimension_format(image, data_format) if data_format is not None else image
+    return image
+
+
+def center_crop(
+    image: np.ndarray,
+    size: Tuple[int, int],
+    data_format: Optional[Union[str, ChannelDimension]] = None,
+    return_numpy: Optional[bool] = None,
+) -> np.ndarray:
+    """
+    Crops the `image` to the specified `size` using a center crop. Note that if the image is too small to be cropped to
+    the size given, it will be padded (so the returned result will always be of size `size`).
+
+    Args:
+        image (`np.ndarray`):
+            The image to crop.
+        size (`Tuple[int, int]`):
+            The target size for the cropped image.
+        data_format (`str` or `ChannelDimension`, *optional*):
+            The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            If unset, will use the inferred format of the input image.
+        return_numpy (`bool`, *optional*):
+            Whether or not to return the cropped image as a numpy array. Used for backwards compatibility with the
+            previous ImageFeatureExtractionMixin method.
+                - Unset: will return the same type as the input image.
+                - `True`: will return a numpy array.
+                - `False`: will return a `PIL.Image.Image` object.
+    Returns:
+        `np.ndarray`: The cropped image.
+    """
+    if isinstance(image, PIL.Image.Image):
+        warnings.warn(
+            "PIL.Image.Image inputs are deprecated and will be removed in v4.26.0. Please use numpy arrays instead.",
+            FutureWarning,
+        )
+        image = to_numpy_array(image)
+        return_numpy = False if return_numpy is None else return_numpy
+    else:
+        return_numpy = True if return_numpy is None else return_numpy
+
+    if not isinstance(image, np.ndarray):
+        raise ValueError(f"Input image must be of type np.ndarray, got {type(image)}")
+
+    if not isinstance(size, Iterable) or len(size) != 2:
+        raise ValueError("size must have 2 elements representing the height and width of the output image")
+
+    input_data_format = infer_channel_dimension_format(image)
+    output_data_format = data_format if data_format is not None else input_data_format
+
+    # We perform the crop in (C, H, W) format and then convert to the output format
+    image = to_channel_dimension_format(image, ChannelDimension.FIRST)
+
+    orig_height, orig_width = get_image_size(image)
+    crop_height, crop_width = size
+    crop_height, crop_width = int(crop_height), int(crop_width)
+
+    # In case size is odd, (image_shape[0] + size[0]) // 2 won't give the proper result.
+    top = (orig_height - crop_height) // 2
+    bottom = top + crop_height
+    # In case size is odd, (image_shape[1] + size[1]) // 2 won't give the proper result.
+    left = (orig_width - crop_width) // 2
+    right = left + crop_width
+
+    # Check if cropped area is within image boundaries
+    if top >= 0 and bottom <= orig_height and left >= 0 and right <= orig_width:
+        image = image[..., top:bottom, left:right]
+        image = to_channel_dimension_format(image, output_data_format)
+        return image
+
+    # Otherwise, we may need to pad if the image is too small. Oh joy...
+    new_height = max(crop_height, orig_height)
+    new_width = max(crop_width, orig_width)
+    new_shape = image.shape[:-2] + (new_height, new_width)
+    new_image = np.zeros_like(image, shape=new_shape)
+
+    # If the image is too small, pad it with zeros
+    top_pad = (new_height - orig_height) // 2
+    bottom_pad = top_pad + orig_height
+    left_pad = (new_width - orig_width) // 2
+    right_pad = left_pad + orig_width
+    new_image[..., top_pad:bottom_pad, left_pad:right_pad] = image
+
+    top += top_pad
+    bottom += top_pad
+    left += left_pad
+    right += left_pad
+
+    new_image = new_image[..., max(0, top) : min(new_height, bottom), max(0, left) : min(new_width, right)]
+    new_image = to_channel_dimension_format(new_image, output_data_format)
+
+    if not return_numpy:
+        new_image = to_pil_image(new_image)
+
+    return new_image
+
+
+def _center_to_corners_format_paddle(bboxes_center: "paddle.Tensor") -> "paddle.Tensor":
+    center_x, center_y, width, height = bboxes_center.unbind(-1)
+    bbox_corners = paddle.stack(
+        # top left x, top left y, bottom right x, bottom right y
+        [(center_x - 0.5 * width), (center_y - 0.5 * height), (center_x + 0.5 * width), (center_y + 0.5 * height)],
+        axis=-1,
+    )
+    return bbox_corners
+
+
+def _center_to_corners_format_numpy(bboxes_center: np.ndarray) -> np.ndarray:
+    center_x, center_y, width, height = bboxes_center.T
+    bboxes_corners = np.stack(
+        # top left x, top left y, bottom right x, bottom right y
+        [center_x - 0.5 * width, center_y - 0.5 * height, center_x + 0.5 * width, center_y + 0.5 * height],
+        axis=-1,
+    )
+    return bboxes_corners
+
+
+# 2 functions below inspired by https://github.com/facebookresearch/detr/blob/master/util/box_ops.py
+def center_to_corners_format(bboxes_center: TensorType) -> TensorType:
+    """
+    Converts bounding boxes from center format to corners format.
+
+    center format: contains the coordinate for the center of the box and its width, height dimensions
+        (center_x, center_y, width, height)
+    corners format: contains the coodinates for the top-left and bottom-right corners of the box
+        (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
+    """
+    # Function is used during model forward pass, so we use the input framework if possible, without
+    # converting to numpy
+    if is_paddle_tensor(bboxes_center):
+        return _center_to_corners_format_paddle(bboxes_center)
+    elif isinstance(bboxes_center, np.ndarray):
+        return _center_to_corners_format_numpy(bboxes_center)
+
+    raise ValueError(f"Unsupported input type {type(bboxes_center)}")
+
+
+def _corners_to_center_format_paddle(bboxes_corners: "paddle.Tensor") -> "paddle.Tensor":
+    top_left_x, top_left_y, bottom_right_x, bottom_right_y = bboxes_corners.unbind(-1)
+    b = [
+        (top_left_x + bottom_right_x) / 2,  # center x
+        (top_left_y + bottom_right_y) / 2,  # center y
+        (bottom_right_x - top_left_x),  # width
+        (bottom_right_y - top_left_y),  # height
+    ]
+    return paddle.stack(b, axis=-1)
+
+
+def _corners_to_center_format_numpy(bboxes_corners: np.ndarray) -> np.ndarray:
+    top_left_x, top_left_y, bottom_right_x, bottom_right_y = bboxes_corners.T
+    bboxes_center = np.stack(
+        [
+            (top_left_x + bottom_right_x) / 2,  # center x
+            (top_left_y + bottom_right_y) / 2,  # center y
+            (bottom_right_x - top_left_x),  # width
+            (bottom_right_y - top_left_y),  # height
+        ],
+        axis=-1,
+    )
+    return bboxes_center
+
+
+def corners_to_center_format(bboxes_corners: TensorType) -> TensorType:
+    """
+    Converts bounding boxes from corners format to center format.
+
+    corners format: contains the coodinates for the top-left and bottom-right corners of the box
+        (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
+    center format: contains the coordinate for the center of the box and its the width, height dimensions
+        (center_x, center_y, width, height)
+    """
+    # Inverse function accepts different input types so implemented here too
+    if is_paddle_tensor(bboxes_corners):
+        return _corners_to_center_format_paddle(bboxes_corners)
+    elif isinstance(bboxes_corners, np.ndarray):
+        return _corners_to_center_format_numpy(bboxes_corners)
+
+    raise ValueError(f"Unsupported input type {type(bboxes_corners)}")
+
+
+# 2 functions below copied from https://github.com/cocodataset/panopticapi/blob/master/panopticapi/utils.py
+# Copyright (c) 2018, Alexander Kirillov
+# All rights reserved.
+def rgb_to_id(color):
+    """
+    Converts RGB color to unique ID.
+    """
+    if isinstance(color, np.ndarray) and len(color.shape) == 3:
+        if color.dtype == np.uint8:
+            color = color.astype(np.int32)
+        return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
+    return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
+
+
+def id_to_rgb(id_map):
+    """
+    Converts unique ID to RGB color.
+    """
+    if isinstance(id_map, np.ndarray):
+        id_map_copy = id_map.copy()
+        rgb_shape = tuple(list(id_map.shape) + [3])
+        rgb_map = np.zeros(rgb_shape, dtype=np.uint8)
+        for i in range(3):
+            rgb_map[..., i] = id_map_copy % 256
+            id_map_copy //= 256
+        return rgb_map
+    color = []
+    for _ in range(3):
+        color.append(id_map % 256)
+        id_map //= 256
+    return color
+
+
+class PaddingMode(ExplicitEnum):
+    """
+    Enum class for the different padding modes to use when padding images.
+    """
+
+    CONSTANT = "constant"
+    REFLECT = "reflect"
+    REPLICATE = "replicate"
+    SYMMETRIC = "symmetric"
+
+
+def pad(
+    image: np.ndarray,
+    padding: Union[int, Tuple[int, int], Iterable[Tuple[int, int]]],
+    mode: PaddingMode = PaddingMode.CONSTANT,
+    constant_values: Union[float, Iterable[float]] = 0.0,
+    data_format: Optional[Union[str, ChannelDimension]] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
+) -> np.ndarray:
+    """
+    Pads the `image` with the specified (height, width) `padding` and `mode`.
+
+    Args:
+        image (`np.ndarray`):
+            The image to pad.
+        padding (`int` or `Tuple[int, int]` or `Iterable[Tuple[int, int]]`):
+            Padding to apply to the edges of the height, width axes. Can be one of three formats:
+            - `((before_height, after_height), (before_width, after_width))` unique pad widths for each axis.
+            - `((before, after),)` yields same before and after pad for height and width.
+            - `(pad,)` or int is a shortcut for before = after = pad width for all axes.
+        mode (`PaddingMode`):
+            The padding mode to use. Can be one of:
+                - `"constant"`: pads with a constant value.
+                - `"reflect"`: pads with the reflection of the vector mirrored on the first and last values of the
+                  vector along each axis.
+                - `"replicate"`: pads with the replication of the last value on the edge of the array along each axis.
+                - `"symmetric"`: pads with the reflection of the vector mirrored along the edge of the array.
+        constant_values (`float` or `Iterable[float]`, *optional*):
+            The value to use for the padding if `mode` is `"constant"`.
+        data_format (`str` or `ChannelDimension`, *optional*):
+            The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            If unset, will use same as the input image.
+        input_data_format (`str` or `ChannelDimension`, *optional*):
+            The channel dimension format for the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            If unset, will use the inferred format of the input image.
+
+    Returns:
+        `np.ndarray`: The padded image.
+
+    """
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(image)
+
+    def _expand_for_data_format(values):
+        """
+        Convert values to be in the format expected by np.pad based on the data format.
+        """
+        if isinstance(values, (int, float)):
+            values = ((values, values), (values, values))
+        elif isinstance(values, tuple) and len(values) == 1:
+            values = ((values[0], values[0]), (values[0], values[0]))
+        elif isinstance(values, tuple) and len(values) == 2 and isinstance(values[0], int):
+            values = (values, values)
+        elif isinstance(values, tuple) and len(values) == 2 and isinstance(values[0], tuple):
+            values = values
+        else:
+            raise ValueError(f"Unsupported format: {values}")
+
+        # add 0 for channel dimension
+        values = ((0, 0), *values) if input_data_format == ChannelDimension.FIRST else (*values, (0, 0))
+
+        # Add additional padding if there's a batch dimension
+        values = (0, *values) if image.ndim == 4 else values
+        return values
+
+    padding = _expand_for_data_format(padding)
+
+    if mode == PaddingMode.CONSTANT:
+        constant_values = _expand_for_data_format(constant_values)
+        image = np.pad(image, padding, mode="constant", constant_values=constant_values)
+    elif mode == PaddingMode.REFLECT:
+        image = np.pad(image, padding, mode="reflect")
+    elif mode == PaddingMode.REPLICATE:
+        image = np.pad(image, padding, mode="edge")
+    elif mode == PaddingMode.SYMMETRIC:
+        image = np.pad(image, padding, mode="symmetric")
+    else:
+        raise ValueError(f"Invalid padding mode: {mode}")
+
+    image = to_channel_dimension_format(image, data_format) if data_format is not None else image
+    return image
+
+
+def convert_to_rgb(image: ImageInput) -> ImageInput:
+    """
+    Converts an image to RGB format. Only converts if the image is of type PIL.Image.Image, otherwise returns the image
+    as is.
+
+    Args:
+        image (Image):
+            The image to convert.
+    """
+
+    if not isinstance(image, PIL.Image.Image):
+        return image
+
+    image = image.convert("RGB")
+    return image
diff --git a/paddlenlp/transformers/image_utils.py b/paddlenlp/transformers/image_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cb8eb63b9993b1c2ce58318090759b2a81d00ec
--- /dev/null
+++ b/paddlenlp/transformers/image_utils.py
@@ -0,0 +1,621 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from collections import UserDict
+from typing import Dict, Iterable, List, Tuple, Union
+
+import numpy as np
+import paddle
+import PIL.Image
+import PIL.ImageOps
+import requests
+from packaging import version
+
+from .tokenizer_utils_base import ExplicitEnum
+
+IMAGENET_DEFAULT_MEAN = [0.485, 0.456, 0.406]
+IMAGENET_DEFAULT_STD = [0.229, 0.224, 0.225]
+IMAGENET_STANDARD_MEAN = [0.5, 0.5, 0.5]
+IMAGENET_STANDARD_STD = [0.5, 0.5, 0.5]
+
+
+def is_paddle_tensor(tensor):
+    return paddle.is_tensor(tensor)
+
+
+def to_numpy(obj):
+    """
+    Convert a TensorFlow tensor, PyTorch tensor, Numpy array or python list to a Numpy array.
+    """
+    if isinstance(obj, (dict, UserDict)):
+        return {k: to_numpy(v) for k, v in obj.items()}
+    elif isinstance(obj, (list, tuple)):
+        return np.array(obj)
+    elif is_paddle_tensor(obj):
+        return obj.detach().cpu().numpy()
+    else:
+        return obj
+
+
+if version.parse(version.parse(PIL.__version__).base_version) >= version.parse("9.1.0"):
+    PILImageResampling = PIL.Image.Resampling
+else:
+    PILImageResampling = PIL.Image
+
+
+ImageInput = Union[
+    "PIL.Image.Image", np.ndarray, "paddle.Tensor", List["PIL.Image.Image"], List[np.ndarray], List["paddle.Tensor"]
+]  # noqa
+
+
+class ChannelDimension(ExplicitEnum):
+    FIRST = "channels_first"
+    LAST = "channels_last"
+
+
+def is_valid_image(img):
+    return isinstance(img, PIL.Image.Image) or isinstance(img, np.ndarray) or is_paddle_tensor(img)
+
+
+def valid_images(imgs):
+    # If we have an list of images, make sure every image is valid
+    if isinstance(imgs, (list, tuple)):
+        for img in imgs:
+            if not valid_images(img):
+                return False
+    # If not a list of tuple, we have been given a single image or batched tensor of images
+    elif not is_valid_image(imgs):
+        return False
+    return True
+
+
+def is_batched(img):
+    if isinstance(img, (list, tuple)):
+        return is_valid_image(img[0])
+    return False
+
+
+def make_list_of_images(images, expected_ndims: int = 3) -> List[ImageInput]:
+    """
+    Ensure that the input is a list of images. If the input is a single image, it is converted to a list of length 1.
+    If the input is a batch of images, it is converted to a list of images.
+    Args:
+        images (`ImageInput`):
+            Image of images to turn into a list of images.
+        expected_ndims (`int`, *optional*, defaults to 3):
+            Expected number of dimensions for a single input image. If the input image has a different number of
+            dimensions, an error is raised.
+    """
+    if is_batched(images):
+        return images
+
+    # Either the input is a single image, in which case we create a list of length 1
+    if isinstance(images, PIL.Image.Image):
+        # PIL images are never batched
+        return [images]
+
+    if is_valid_image(images):
+        if images.ndim == expected_ndims + 1:
+            # Batch of images
+            images = list(images)
+        elif images.ndim == expected_ndims:
+            # Single image
+            images = [images]
+        else:
+            raise ValueError(
+                f"Invalid image shape. Expected either {expected_ndims + 1} or {expected_ndims} dimensions, but got"
+                f" {images.ndim} dimensions."
+            )
+        return images
+    raise ValueError(
+        "Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, paddle.Tensor " f"but got {type(images)}."
+    )
+
+
+def to_numpy_array(img) -> np.ndarray:
+    if not is_valid_image(img):
+        raise ValueError(f"Invalid image type: {type(img)}")
+
+    if isinstance(img, PIL.Image.Image):
+        return np.array(img)
+    return to_numpy(img)
+
+
+def infer_channel_dimension_format(image: np.ndarray) -> ChannelDimension:
+    """
+    Infers the channel dimension format of `image`.
+
+    Args:
+        image (`np.ndarray`):
+            The image to infer the channel dimension of.
+
+    Returns:
+        The channel dimension of the image.
+    """
+    if image.ndim == 3:
+        first_dim, last_dim = 0, 2
+    elif image.ndim == 4:
+        first_dim, last_dim = 1, 3
+    else:
+        raise ValueError(f"Unsupported number of image dimensions: {image.ndim}")
+
+    if image.shape[first_dim] in (1, 3):
+        return ChannelDimension.FIRST
+    elif image.shape[last_dim] in (1, 3):
+        return ChannelDimension.LAST
+    raise ValueError("Unable to infer channel dimension format")
+
+
+def get_channel_dimension_axis(image: np.ndarray) -> int:
+    """
+    Returns the channel dimension axis of the image.
+
+    Args:
+        image (`np.ndarray`):
+            The image to get the channel dimension axis of.
+
+    Returns:
+        The channel dimension axis of the image.
+    """
+    channel_dim = infer_channel_dimension_format(image)
+    if channel_dim == ChannelDimension.FIRST:
+        return image.ndim - 3
+    elif channel_dim == ChannelDimension.LAST:
+        return image.ndim - 1
+    raise ValueError(f"Unsupported data format: {channel_dim}")
+
+
+def get_image_size(image: np.ndarray, channel_dim: ChannelDimension = None) -> Tuple[int, int]:
+    """
+    Returns the (height, width) dimensions of the image.
+
+    Args:
+        image (`np.ndarray`):
+            The image to get the dimensions of.
+        channel_dim (`ChannelDimension`, *optional*):
+            Which dimension the channel dimension is in. If `None`, will infer the channel dimension from the image.
+
+    Returns:
+        A tuple of the image's height and width.
+    """
+    if channel_dim is None:
+        channel_dim = infer_channel_dimension_format(image)
+
+    if channel_dim == ChannelDimension.FIRST:
+        return image.shape[-2], image.shape[-1]
+    elif channel_dim == ChannelDimension.LAST:
+        return image.shape[-3], image.shape[-2]
+    else:
+        raise ValueError(f"Unsupported data format: {channel_dim}")
+
+
+def is_valid_annotation_coco_detection(annotation: Dict[str, Union[List, Tuple]]) -> bool:
+    if (
+        isinstance(annotation, dict)
+        and "image_id" in annotation
+        and "annotations" in annotation
+        and isinstance(annotation["annotations"], (list, tuple))
+        and (
+            # an image can have no annotations
+            len(annotation["annotations"]) == 0
+            or isinstance(annotation["annotations"][0], dict)
+        )
+    ):
+        return True
+    return False
+
+
+def is_valid_annotation_coco_panoptic(annotation: Dict[str, Union[List, Tuple]]) -> bool:
+    if (
+        isinstance(annotation, dict)
+        and "image_id" in annotation
+        and "segments_info" in annotation
+        and "file_name" in annotation
+        and isinstance(annotation["segments_info"], (list, tuple))
+        and (
+            # an image can have no segments
+            len(annotation["segments_info"]) == 0
+            or isinstance(annotation["segments_info"][0], dict)
+        )
+    ):
+        return True
+    return False
+
+
+def valid_coco_detection_annotations(annotations: Iterable[Dict[str, Union[List, Tuple]]]) -> bool:
+    return all(is_valid_annotation_coco_detection(ann) for ann in annotations)
+
+
+def valid_coco_panoptic_annotations(annotations: Iterable[Dict[str, Union[List, Tuple]]]) -> bool:
+    return all(is_valid_annotation_coco_panoptic(ann) for ann in annotations)
+
+
+def load_image(image: Union[str, "PIL.Image.Image"]) -> "PIL.Image.Image":
+    """
+    Loads `image` to a PIL Image.
+
+    Args:
+        image (`str` or `PIL.Image.Image`):
+            The image to convert to the PIL Image format.
+
+    Returns:
+        `PIL.Image.Image`: A PIL Image.
+    """
+    if isinstance(image, str):
+        if image.startswith("http://") or image.startswith("https://"):
+            # We need to actually check for a real protocol, otherwise it's impossible to use a local file
+            # like http_huggingface_co.png
+            image = PIL.Image.open(requests.get(image, stream=True).raw)
+        elif os.path.isfile(image):
+            image = PIL.Image.open(image)
+        else:
+            raise ValueError(
+                f"Incorrect path or url, URLs must start with `http://` or `https://`, and {image} is not a valid path"
+            )
+    elif isinstance(image, PIL.Image.Image):
+        image = image
+    else:
+        raise ValueError(
+            "Incorrect format used for image. Should be an url linking to an image, a local path, or a PIL image."
+        )
+    image = PIL.ImageOps.exif_transpose(image)
+    image = image.convert("RGB")
+    return image
+
+
+class ImageFeatureExtractionMixin:
+    """
+    Mixin that contain utilities for preparing image features.
+    """
+
+    def _ensure_format_supported(self, image):
+        if not isinstance(image, (PIL.Image.Image, np.ndarray)) and not is_paddle_tensor(image):
+            raise ValueError(
+                f"Got type {type(image)} which is not supported, only `PIL.Image.Image`, `np.array` and "
+                "`paddle.Tensor` are."
+            )
+
+    def to_pil_image(self, image, rescale=None):
+        """
+        Converts `image` to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if
+        needed.
+
+        Args:
+            image (`PIL.Image.Image` or `numpy.ndarray` or `paddle.Tensor`):
+                The image to convert to the PIL Image format.
+            rescale (`bool`, *optional*):
+                Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will
+                default to `True` if the image type is a floating type, `False` otherwise.
+        """
+        self._ensure_format_supported(image)
+
+        if is_paddle_tensor(image):
+            image = image.cpu().numpy()
+
+        if isinstance(image, np.ndarray):
+            if rescale is None:
+                # rescale default to the array being of floating type.
+                rescale = isinstance(image.flat[0], np.floating)
+            # If the channel as been moved to first dim, we put it back at the end.
+            if image.ndim == 3 and image.shape[0] in [1, 3]:
+                image = image.transpose(1, 2, 0)
+            if rescale:
+                image = image * 255
+            image = image.astype(np.uint8)
+            return PIL.Image.fromarray(image)
+        return image
+
+    def convert_rgb(self, image):
+        """
+        Converts `PIL.Image.Image` to RGB format.
+
+        Args:
+            image (`PIL.Image.Image`):
+                The image to convert.
+        """
+        self._ensure_format_supported(image)
+        if not isinstance(image, PIL.Image.Image):
+            return image
+
+        return image.convert("RGB")
+
+    def rescale(self, image: np.ndarray, scale: Union[float, int]) -> np.ndarray:
+        """
+        Rescale a numpy image by scale amount
+        """
+        self._ensure_format_supported(image)
+        return image * scale
+
+    def to_numpy_array(self, image, rescale=None, channel_first=True):
+        """
+        Converts `image` to a numpy array. Optionally rescales it and puts the channel dimension as the first
+        dimension.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image to convert to a NumPy array.
+            rescale (`bool`, *optional*):
+                Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). Will
+                default to `True` if the image is a PIL Image or an array/tensor of integers, `False` otherwise.
+            channel_first (`bool`, *optional*, defaults to `True`):
+                Whether or not to permute the dimensions of the image to put the channel dimension first.
+        """
+        self._ensure_format_supported(image)
+
+        if isinstance(image, PIL.Image.Image):
+            image = np.array(image)
+
+        if is_paddle_tensor(image):
+            image = image.cpu().numpy()
+
+        rescale = isinstance(image.flat[0], np.integer) if rescale is None else rescale
+
+        if rescale:
+            image = self.rescale(image.astype(np.float32), 1 / 255.0)
+
+        if channel_first and image.ndim == 3:
+            image = image.transpose(2, 0, 1)
+
+        return image
+
+    def expand_dims(self, image):
+        """
+        Expands 2-dimensional `image` to 3 dimensions.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image to expand.
+        """
+        self._ensure_format_supported(image)
+
+        # Do nothing if PIL image
+        if isinstance(image, PIL.Image.Image):
+            return image
+
+        if is_paddle_tensor(image):
+            image = image.unsqueeze(0)
+        else:
+            image = np.expand_dims(image, axis=0)
+        return image
+
+    def normalize(self, image, mean, std, rescale=False):
+        """
+        Normalizes `image` with `mean` and `std`. Note that this will trigger a conversion of `image` to a NumPy array
+        if it's a PIL Image.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image to normalize.
+            mean (`List[float]` or `np.ndarray` or `paddle.Tensor`):
+                The mean (per channel) to use for normalization.
+            std (`List[float]` or `np.ndarray` or `paddle.Tensor`):
+                The standard deviation (per channel) to use for normalization.
+            rescale (`bool`, *optional*, defaults to `False`):
+                Whether or not to rescale the image to be between 0 and 1. If a PIL image is provided, scaling will
+                happen automatically.
+        """
+        self._ensure_format_supported(image)
+
+        if isinstance(image, PIL.Image.Image):
+            image = self.to_numpy_array(image, rescale=True)
+        # If the input image is a PIL image, it automatically gets rescaled. If it's another
+        # type it may need rescaling.
+        elif rescale:
+            if isinstance(image, np.ndarray):
+                image = self.rescale(image.astype(np.float32), 1 / 255.0)
+            elif is_paddle_tensor(image):
+                image = self.rescale(image.astype("float32"), 1 / 255.0)
+
+        if isinstance(image, np.ndarray):
+            if not isinstance(mean, np.ndarray):
+                mean = np.array(mean).astype(image.dtype)
+            if not isinstance(std, np.ndarray):
+                std = np.array(std).astype(image.dtype)
+        elif is_paddle_tensor(image):
+
+            if not isinstance(mean, paddle.Tensor):
+                mean = paddle.to_tensor(mean).astype(image.dtype)
+            if not isinstance(std, paddle.Tensor):
+                std = paddle.to_tensor(std).astype(image.dtype)
+
+        if image.ndim == 3 and image.shape[0] in [1, 3]:
+            return (image - mean[:, None, None]) / std[:, None, None]
+        else:
+            return (image - mean) / std
+
+    def resize(self, image, size, resample=None, default_to_square=True, max_size=None):
+        """
+        Resizes `image`. Enforces conversion of input to PIL.Image.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image to resize.
+            size (`int` or `Tuple[int, int]`):
+                The size to use for resizing the image. If `size` is a sequence like (h, w), output size will be
+                matched to this.
+
+                If `size` is an int and `default_to_square` is `True`, then image will be resized to (size, size). If
+                `size` is an int and `default_to_square` is `False`, then smaller edge of the image will be matched to
+                this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).
+            resample (`int`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+                The filter to user for resampling.
+            default_to_square (`bool`, *optional*, defaults to `True`):
+                How to convert `size` when it is a single int. If set to `True`, the `size` will be converted to a
+                square (`size`,`size`). If set to `False`, will replicate
+                [`paddle.vision.transforms.Resize`](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/vision/transforms/Resize_cn.html#resize)
+                with support for resizing only the smallest edge and providing an optional `max_size`.
+            max_size (`int`, *optional*, defaults to `None`):
+                The maximum allowed for the longer edge of the resized image: if the longer edge of the image is
+                greater than `max_size` after being resized according to `size`, then the image is resized again so
+                that the longer edge is equal to `max_size`. As a result, `size` might be overruled, i.e the smaller
+                edge may be shorter than `size`. Only used if `default_to_square` is `False`.
+
+        Returns:
+            image: A resized `PIL.Image.Image`.
+        """
+        resample = resample if resample is not None else PILImageResampling.BILINEAR
+
+        self._ensure_format_supported(image)
+
+        if not isinstance(image, PIL.Image.Image):
+            image = self.to_pil_image(image)
+
+        if isinstance(size, list):
+            size = tuple(size)
+
+        if isinstance(size, int) or len(size) == 1:
+            if default_to_square:
+                size = (size, size) if isinstance(size, int) else (size[0], size[0])
+            else:
+                width, height = image.size
+                # specified size only for the smallest edge
+                short, long = (width, height) if width <= height else (height, width)
+                requested_new_short = size if isinstance(size, int) else size[0]
+
+                if short == requested_new_short:
+                    return image
+
+                new_short, new_long = requested_new_short, int(requested_new_short * long / short)
+
+                if max_size is not None:
+                    if max_size <= requested_new_short:
+                        raise ValueError(
+                            f"max_size = {max_size} must be strictly greater than the requested "
+                            f"size for the smaller edge size = {size}"
+                        )
+                    if new_long > max_size:
+                        new_short, new_long = int(max_size * new_short / new_long), max_size
+
+                size = (new_short, new_long) if width <= height else (new_long, new_short)
+
+        return image.resize(size, resample=resample)
+
+    def center_crop(self, image, size):
+        """
+        Crops `image` to the given size using a center crop. Note that if the image is too small to be cropped to the
+        size given, it will be padded (so the returned result has the size asked).
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor` of shape (n_channels, height, width) or (height, width, n_channels)):
+                The image to resize.
+            size (`int` or `Tuple[int, int]`):
+                The size to which crop the image.
+
+        Returns:
+            new_image: A center cropped `PIL.Image.Image` or `np.ndarray` or `paddle.Tensor` of shape: (n_channels,
+            height, width).
+        """
+        self._ensure_format_supported(image)
+
+        if not isinstance(size, tuple):
+            size = (size, size)
+
+        # PIL Image.size is (width, height) but NumPy array and paddle Tensors have (height, width)
+        if is_paddle_tensor(image) or isinstance(image, np.ndarray):
+            if image.ndim == 2:
+                image = self.expand_dims(image)
+            image_shape = image.shape[1:] if image.shape[0] in [1, 3] else image.shape[:2]
+        else:
+            image_shape = (image.size[1], image.size[0])
+
+        top = (image_shape[0] - size[0]) // 2
+        bottom = top + size[0]  # In case size is odd, (image_shape[0] + size[0]) // 2 won't give the proper result.
+        left = (image_shape[1] - size[1]) // 2
+        right = left + size[1]  # In case size is odd, (image_shape[1] + size[1]) // 2 won't give the proper result.
+
+        # For PIL Images we have a method to crop directly.
+        if isinstance(image, PIL.Image.Image):
+            return image.crop((left, top, right, bottom))
+
+        # Check if image is in (n_channels, height, width) or (height, width, n_channels) format
+        channel_first = True if image.shape[0] in [1, 3] else False
+
+        # Transpose (height, width, n_channels) format images
+        if not channel_first:
+            if isinstance(image, np.ndarray):
+                image = image.transpose(2, 0, 1)
+            if is_paddle_tensor(image):
+                image = image.transpose([2, 0, 1])
+
+        # Check if cropped area is within image boundaries
+        if top >= 0 and bottom <= image_shape[0] and left >= 0 and right <= image_shape[1]:
+            return image[..., top:bottom, left:right]
+
+        # Otherwise, we may need to pad if the image is too small. Oh joy...
+        new_shape = image.shape[:-2] + (max(size[0], image_shape[0]), max(size[1], image_shape[1]))
+        if isinstance(image, np.ndarray):
+            new_image = np.zeros_like(image, shape=new_shape)
+        elif is_paddle_tensor(image):
+            new_image = paddle.zeros(new_shape, dtype=image.dtype)
+
+        top_pad = (new_shape[-2] - image_shape[0]) // 2
+        bottom_pad = top_pad + image_shape[0]
+        left_pad = (new_shape[-1] - image_shape[1]) // 2
+        right_pad = left_pad + image_shape[1]
+        new_image[..., top_pad:bottom_pad, left_pad:right_pad] = image
+
+        top += top_pad
+        bottom += top_pad
+        left += left_pad
+        right += left_pad
+
+        new_image = new_image[
+            ..., max(0, top) : min(new_image.shape[-2], bottom), max(0, left) : min(new_image.shape[-1], right)
+        ]
+
+        return new_image
+
+    def flip_channel_order(self, image):
+        """
+        Flips the channel order of `image` from RGB to BGR, or vice versa. Note that this will trigger a conversion of
+        `image` to a NumPy array if it's a PIL Image.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image whose color channels to flip. If `np.ndarray` or `paddle.Tensor`, the channel dimension should
+                be first.
+        """
+        self._ensure_format_supported(image)
+
+        if isinstance(image, PIL.Image.Image):
+            image = self.to_numpy_array(image)
+
+        return image[::-1, :, :]
+
+    def rotate(self, image, angle, resample=None, expand=0, center=None, translate=None, fillcolor=None):
+        """
+        Returns a rotated copy of `image`. This method returns a copy of `image`, rotated the given number of degrees
+        counter clockwise around its centre.
+
+        Args:
+            image (`PIL.Image.Image` or `np.ndarray` or `paddle.Tensor`):
+                The image to rotate. If `np.ndarray` or `paddle.Tensor`, will be converted to `PIL.Image.Image` before
+                rotating.
+
+        Returns:
+            image: A rotated `PIL.Image.Image`.
+        """
+        resample = resample if resample is not None else PIL.Image.NEAREST
+
+        self._ensure_format_supported(image)
+
+        if not isinstance(image, PIL.Image.Image):
+            image = self.to_pil_image(image)
+
+        return image.rotate(
+            angle, resample=resample, expand=expand, center=center, translate=translate, fillcolor=fillcolor
+        )
diff --git a/paddlenlp/transformers/layoutlm/__init__.py b/paddlenlp/transformers/layoutlm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/layoutlm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/layoutlm/configuration.py b/paddlenlp/transformers/layoutlm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9c6511ce4e038519cdfdac0c9e9b9a7d846e224
--- /dev/null
+++ b/paddlenlp/transformers/layoutlm/configuration.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" LayoutLM model configuration"""
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["LAYOUTLM_PRETRAINED_INIT_CONFIGURATION", "LayoutLMConfig", "LAYOUTLM_PRETRAINED_RESOURCE_FILES_MAP"]
+
+LAYOUTLM_PRETRAINED_INIT_CONFIGURATION = {
+    "layoutlm-base-uncased": {
+        "vocab_size": 30522,
+        "hidden_size": 768,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "max_2d_position_embeddings": 1024,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+    },
+    "layoutlm-large-uncased": {
+        "vocab_size": 30522,
+        "hidden_size": 1024,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 512,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+    },
+}
+
+LAYOUTLM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "layoutlm-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlm/layoutlm-base-uncased/model_state.pdparams",
+        "layoutlm-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlm/layoutlm-large-uncased/model_state.pdparams",
+    }
+}
+
+
+class LayoutLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`LayoutLMModel`]. It is used to instantiate an LayoutLM Model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLM LayoutLM-base-uncased architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, optional, defaults to 30522):
+            Vocabulary size of the LayoutLMModel model. Defines the different tokens that can be represented by the
+            *inputs_ids* passed to the forward method of [`LayoutLMModel`].
+        embedding_size (`int`, optional, defaults to 768):
+            Dimensionality of vocabulary embeddings.
+        hidden_size (`int`, optional, defaults to 1024):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, optional, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, optional, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, optional, defaults to 3072):
+            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, optional, defaults to "gelu"):
+            The non-linear activation function (function or string) in the encoder and pooler.
+        hidden_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, optional, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            (e.g., 512 or 1024 or 2048).
+        max_2d_position_embeddings (`int`, optional, defaults to 1024):
+            The maximum value that the 2D position embedding might ever used. Typically set this to something large just in case (e.g., 1024).
+        type_vocab_size (`int`, optional, defaults to 2):
+            The vocabulary size of the *token_type_ids* passed into [`NezhaModel`].
+        initializer_range (`float`, optional, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, optional, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        classifier_dropout (`float`, optional, defaults to 0.1):
+            The dropout ratio for attached classifiers.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import LayoutLMConfig, LayoutLMModel
+    >>> # Initializing an LayoutLMConfig configuration
+    >>> configuration = LayoutLMConfig()
+    >>> # Initializing a model (with random weights) from the LayoutLM-base style configuration model
+    >>> model = LayoutLMModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = LAYOUTLM_PRETRAINED_INIT_CONFIGURATION
+    model_type = "layoutlm"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        max_2d_position_embeddings=1024,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        classifier_dropout=0.1,
+        pad_token_id=0,
+        pool_act="tanh",
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.max_2d_position_embeddings = max_2d_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.classifier_dropout = classifier_dropout
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
diff --git a/paddlenlp/transformers/layoutlm/modeling.py b/paddlenlp/transformers/layoutlm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..abdea1f59b865e6aebb881ead7e8b3cadbbe0d33
--- /dev/null
+++ b/paddlenlp/transformers/layoutlm/modeling.py
@@ -0,0 +1,662 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Modeling classes for LayoutLM model."""
+
+import paddle
+import paddle.nn as nn
+from paddle.nn import Layer
+
+from paddlenlp.utils.log import logger
+
+from ...layers import Linear as TransposedLinear
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    LAYOUTLM_PRETRAINED_INIT_CONFIGURATION,
+    LAYOUTLM_PRETRAINED_RESOURCE_FILES_MAP,
+    LayoutLMConfig,
+)
+
+__all__ = [
+    "LayoutLMModel",
+    "LayoutLMPretrainedModel",
+    "LayoutLMForMaskedLM",
+    "LayoutLMForTokenClassification",
+    "LayoutLMForSequenceClassification",
+]
+
+
+class LayoutLMPooler(Layer):
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.pool_act = config.pool_act
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.pool_act == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class LayoutLMEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        # gry add for layoutlm
+        self.x_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.y_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.h_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        self.w_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.hidden_size)
+        # end of gry add for layoutlm
+        # self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size, padding_idx=pad_token_id)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def forward(self, input_ids, bbox=None, token_type_ids=None, position_ids=None):
+        # input_shape = input_ids.size()
+        # seq_length = input_shape[1]
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        word_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+
+        # gry add
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+        # end of gry add
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = (
+            word_embeddings
+            + position_embeddings
+            + left_position_embeddings
+            + upper_position_embeddings
+            + right_position_embeddings
+            + lower_position_embeddings
+            + h_position_embeddings
+            + w_position_embeddings
+            + token_type_embeddings
+        )
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class LayoutLMPretrainedModel(PretrainedModel):
+    config_class = LayoutLMConfig
+    pretrained_init_configuration = LAYOUTLM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = LAYOUTLM_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "layoutlm"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class LayoutLMModel(LayoutLMPretrainedModel):
+    """
+    The bare LayoutLM Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of the LayoutLM model. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling LayoutLMModel.
+        hidden_size (int):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (int):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (int):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (int):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported.
+        hidden_dropout_prob (float):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+        attention_probs_dropout_prob (float):
+            The dropout probability for all fully connected layers in the pooler.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `16`.
+        initializer_range (float):
+            The standard deviation of the normal initializer.
+            Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`LayoutLMPretrainedModel.init_weights()` for how weights are initialized in `LayoutLMModel`.
+
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        pool_act (str, optional):
+            The non-linear activation function in the pooling layer.
+            Defaults to `"tanh"`.
+    """
+
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMModel, self).__init__(config)
+        # self.config = kwargs
+        self.num_hidden_layers = config.num_hidden_layers
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = LayoutLMEmbeddings(config)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = LayoutLMPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        num_position_embeds_diff = new_num_position_embeddings - self.config["max_position_embeddings"]
+
+        # no resizing needs to be done if the length stays the same
+        if num_position_embeds_diff == 0:
+            return
+
+        logger.info(f"Setting `config.max_position_embeddings={new_num_position_embeddings}`...")
+        self.config.max_position_embeddings = new_num_position_embeddings
+
+        old_position_embeddings_weight = self.embeddings.position_embeddings.weight
+
+        self.embeddings.position_embeddings = nn.Embedding(
+            self.config.max_position_embeddings, self.config.hidden_size
+        )
+
+        with paddle.no_grad():
+            if num_position_embeds_diff > 0:
+                self.embeddings.position_embeddings.weight[:-num_position_embeds_diff] = old_position_embeddings_weight
+            else:
+                self.embeddings.position_embeddings.weight = old_position_embeddings_weight[:num_position_embeds_diff]
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        output_hidden_states=False,
+    ):
+        r"""
+        The LayoutLMModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+        """
+
+        input_shape = input_ids.shape
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+        if bbox is None:
+            bbox = paddle.zeros(tuple(list(input_shape) + [4]), dtype="int64")
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            bbox=bbox,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+        )
+
+        if output_hidden_states:
+            output = embedding_output
+            encoder_outputs = []
+            for mod in self.encoder.layers:
+                output = mod(output, src_mask=attention_mask)
+                encoder_outputs.append(output)
+            if self.encoder.norm is not None:
+                encoder_outputs[-1] = self.encoder.norm(encoder_outputs[-1])
+            pooled_output = self.pooler(encoder_outputs[-1])
+        else:
+            sequence_output = self.encoder(embedding_output, attention_mask)
+            pooled_output = self.pooler(sequence_output)
+        if output_hidden_states:
+            return encoder_outputs, pooled_output
+        else:
+            return sequence_output, pooled_output
+
+
+class LayoutLMForTokenClassification(LayoutLMPretrainedModel):
+    """
+    LayoutLM Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`LayoutLMConfig`):
+            An instance of LayoutLMConfig used to construct LayoutLMForTokenClassification.
+    """
+
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_classes
+        self.layoutlm = LayoutLMModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+        self.classifier.apply(self._init_weights)
+
+    def get_input_embeddings(self):
+        return self.layoutlm.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids,
+        bbox=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        output_hidden_states=False,
+    ):
+        r"""
+        The LayoutLMForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LayoutLMModel`.
+            bbox (Tensor):
+                See :class:`LayoutLMModel`.
+            attention_mask (list, optional):
+                See :class:`LayoutLMModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LayoutLMModel`.
+            position_ids(Tensor, optional):
+                See :class:`LayoutLMModel`.
+            output_hidden_states(Tensor, optional):
+                See :class:`LayoutLMModel`.
+
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LayoutLMFForTokenClassification
+                from paddlenlp.transformers import LayoutLMFTokenizer
+
+                tokenizer = LayoutLMFTokenizer.from_pretrained('layoutlm-base-uncased')
+                model = LayoutLMFForTokenClassification.from_pretrained('layoutlm-base-uncased', num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors="pd")
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 2]
+
+        """
+        if attention_mask is not None:
+            attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype("int64")
+        outputs = self.layoutlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_hidden_states=False,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class LayoutLMForSequenceClassification(LayoutLMPretrainedModel):
+    """
+    LayoutLM Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`LayoutLMConfig`):
+            An instance of LayoutLMConfig used to construct LayoutLMForSequenceClassification.
+    """
+
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMForSequenceClassification, self).__init__(config)
+        self.layoutlm = LayoutLMModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.num_classes = config.num_classes
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def get_input_embeddings(self):
+        return self.layoutlm.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids,
+        bbox=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        output_hidden_states=False,
+    ):
+        r"""
+        The LayoutLMForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LayoutLMModel`.
+            bbox (Tensor):
+                See :class:`LayoutLMModel`.
+            attention_mask (list, optional):
+                See :class:`LayoutLMModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LayoutLMModel`.
+            position_ids(Tensor, optional):
+                See :class:`LayoutLMModel`.
+            output_hidden_states(Tensor, optional):
+                See :class:`LayoutLMModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LayoutLMForSequenceClassification
+                from paddlenlp.transformers import LayoutLMTokenizer
+
+                tokenizer = LayoutLMTokenizer.from_pretrained('layoutlm-base-uncased')
+                model = LayoutLMForSequenceClassification.from_pretrained('layoutlm-base-uncased', num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors="pd")
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 2]
+
+        """
+        outputs = self.layoutlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            output_hidden_states=output_hidden_states,
+        )
+        pooled_outputs = outputs[1]
+        pooled_outputs = self.dropout(pooled_outputs)
+        logits = self.classifier(pooled_outputs)
+        return logits
+
+
+class LayoutLMLMPredictionHead(Layer):
+    """
+    LayoutLM Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, config: LayoutLMConfig, weight_attr=None):
+        super(LayoutLMLMPredictionHead, self).__init__()
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size, weight_attr=weight_attr)
+        self.activation = getattr(nn.functional, config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder = TransposedLinear(config.hidden_size, config.vocab_size)
+        # link bias to load pretrained weights
+        self.decoder_bias = self.decoder.bias
+        # self.decoder_weight = (
+        #     self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+        #     if embedding_weights is None
+        #     else embedding_weights
+        # )
+        # self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class LayoutLMOnlyMLMHead(nn.Layer):
+    def __init__(self, config: LayoutLMConfig, weight_attr=None):
+        super().__init__()
+        self.predictions = LayoutLMLMPredictionHead(config, weight_attr=weight_attr)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class LayoutLMForMaskedLM(LayoutLMPretrainedModel):
+    """
+    LayoutLM Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`LayoutLMConfig`):
+            An instance of LayoutLMConfig used to construct LayoutLMForMaskedLM.
+
+    """
+
+    def __init__(self, config: LayoutLMConfig):
+        super(LayoutLMForMaskedLM, self).__init__(config)
+        self.layoutlm = LayoutLMModel(config)
+        self.cls = LayoutLMOnlyMLMHead(config)
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(self, input_ids, bbox=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`LayoutLMModel`.
+            bbox (Tensor):
+                See :class:`LayoutLMModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LayoutLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`LayoutLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`LayoutLMModel`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LayoutLMForMaskedLM, LayoutLMTokenizer
+
+                tokenizer = LayoutLMTokenizer.from_pretrained('layoutlm-base-uncased')
+                model = LayoutLMForMaskedLM.from_pretrained('layoutlm-base-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors="pd")
+
+                logits = model(**inputs)
+                print(logits.shape)
+
+        """
+
+        outputs = self.layoutlm(
+            input_ids,
+            bbox=bbox,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions=None)
+        return prediction_scores
diff --git a/paddlenlp/transformers/layoutlm/tokenizer.py b/paddlenlp/transformers/layoutlm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..51db3820f007b50343146c2485b1a6f314005426
--- /dev/null
+++ b/paddlenlp/transformers/layoutlm/tokenizer.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for LayoutLM model."""
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["LayoutLMTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"layoutlm-base-uncased": 512, "layoutlm-large-uncased": 512}
+
+
+class LayoutLMTokenizer(BertTokenizer):
+    """
+    The usage of LayoutLMTokenizer is the same as
+    `BertTokenizer <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "layoutlm-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlm/layoutlm-base-uncased/vocab.txt",
+            "layoutlm-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlm/layoutlm-large-uncased/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "layoutlm-base-uncased": {"do_lower_case": True},
+        "layoutlm-large-uncased": {"do_lower_case": True},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
diff --git a/paddlenlp/transformers/layoutlmv2/__init__.py b/paddlenlp/transformers/layoutlmv2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/layoutlmv2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/layoutlmv2/configuration.py b/paddlenlp/transformers/layoutlmv2/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..55facb107b7eef583a2867788b9dffe37c26f4e2
--- /dev/null
+++ b/paddlenlp/transformers/layoutlmv2/configuration.py
@@ -0,0 +1,252 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" LayoutLMv2 model configuration"""
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["LAYOUTLMV2_PRETRAINED_INIT_CONFIGURATION", "LayoutLMv2Config", "LAYOUTLMV2_PRETRAINED_RESOURCE_FILES_MAP"]
+
+LAYOUTLMV2_PRETRAINED_INIT_CONFIGURATION = {
+    "layoutlmv2-base-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "coordinate_size": 128,
+        "fast_qkv": True,
+        "gradient_checkpointing": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 512,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "layoutlmv2",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 0,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "has_relative_attention_bias": True,
+        "has_spatial_attention_bias": True,
+        "has_visual_segment_embedding": False,
+        "use_visual_backbone": True,
+    },
+    "layoutlmv2-large-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "coordinate_size": 171,
+        "fast_qkv": False,
+        "gradient_checkpointing": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "layer_norm_eps": 1e-12,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 512,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "layoutlmv2",
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "output_past": True,
+        "pad_token_id": 0,
+        "shape_size": 170,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "has_relative_attention_bias": True,
+        "has_spatial_attention_bias": True,
+        "has_visual_segment_embedding": False,
+        "use_visual_backbone": True,
+    },
+    "vi-layoutlmv2-base-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "coordinate_size": 128,
+        "fast_qkv": True,
+        "gradient_checkpointing": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-12,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 512,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "layoutlmv2",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 0,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 2,
+        "vocab_size": 30522,
+        "has_relative_attention_bias": True,
+        "has_spatial_attention_bias": True,
+        "has_visual_segment_embedding": False,
+        "use_visual_backbone": False,
+    },
+}
+
+LAYOUTLMV2_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "layoutlmv2-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/layoutlmv2-base-uncased/model_state.pdparams",
+        "layoutlmv2-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/layoutlmv2-large-uncased/model_state.pdparams",
+        "vi-layoutlmv2-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/vi-layoutlmv2-base-uncased/model_state.pdparams",
+    }
+}
+
+
+class LayoutLMv2Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`LayoutLMv2Model`]. It is used to instantiate an LayoutLMv2 Model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv2 layoutlmv2-base-uncased architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, optional, defaults to 21128):
+            Vocabulary size of the NEZHA model. Defines the different tokens that can be represented by the
+            *inputs_ids* passed to the forward method of [`NezhaModel`].
+        embedding_size (`int`, optional, defaults to 128):
+            Dimensionality of vocabulary embeddings.
+        hidden_size (`int`, optional, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, optional, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, optional, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, optional, defaults to 3072):
+            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, optional, defaults to "gelu"):
+            The non-linear activation function (function or string) in the encoder and pooler.
+        hidden_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, optional, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, optional, defaults to 2):
+            The vocabulary size of the *token_type_ids* passed into [`NezhaModel`].
+        initializer_range (`float`, optional, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, optional, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        classifier_dropout (`float`, optional, defaults to 0.1):
+            The dropout ratio for attached classifiers.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import NeZhaConfig, NeZhaModel
+    >>> # Initializing an Nezha configuration
+    >>> configuration = NeZhaConfig()
+    >>> # Initializing a model (with random weights) from the Nezha-base style configuration model
+    >>> model = NeZhaModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = LAYOUTLMV2_PRETRAINED_INIT_CONFIGURATION
+    model_type = "layoutlmv2"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        max_2d_position_embeddings=1024,
+        max_rel_pos=128,
+        max_rel_2d_pos=256,
+        rel_pos_bins=32,
+        rel_2d_pos_bins=64,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        coordinate_size=128,
+        shape_size=128,
+        image_feature_pool_shape=[7, 7, 256],
+        fast_qkv=True,
+        has_relative_attention_bias=True,
+        has_spatial_attention_bias=True,
+        has_visual_segment_embedding=False,
+        output_past=True,
+        gradient_checkpointing=False,
+        classifier_dropout=0.1,
+        pad_token_id=0,
+        bos_token_id=2,
+        eos_token_id=3,
+        use_cache=True,
+        with_pool="tanh",
+        use_visual_backbone=True,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.max_2d_position_embeddings = max_2d_position_embeddings
+        self.max_rel_pos = max_rel_pos
+        self.max_rel_2d_pos = max_rel_2d_pos
+        self.rel_pos_bins = rel_pos_bins
+        self.rel_2d_pos_bins = rel_2d_pos_bins
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.coordinate_size = coordinate_size
+        self.shape_size = shape_size
+        self.image_feature_pool_shape = image_feature_pool_shape
+        self.fast_qkv = fast_qkv
+        self.has_relative_attention_bias = has_relative_attention_bias
+        self.has_spatial_attention_bias = has_spatial_attention_bias
+        self.has_visual_segment_embedding = has_visual_segment_embedding
+        self.output_past = output_past
+        self.gradient_checkpointing = gradient_checkpointing
+        self.classifier_dropout = classifier_dropout
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.use_cache = use_cache
+        self.with_pool = with_pool
+        self.use_visual_backbone = use_visual_backbone
diff --git a/paddlenlp/transformers/layoutlmv2/modeling.py b/paddlenlp/transformers/layoutlmv2/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce6df9f9a2f2260d53d7c52a6a55cf375df02fa7
--- /dev/null
+++ b/paddlenlp/transformers/layoutlmv2/modeling.py
@@ -0,0 +1,1203 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Modeling classes for LayoutLMv2 model."""
+
+import copy
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import CrossEntropyLoss, Layer
+
+from paddlenlp.utils.log import logger
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..layoutxlm.visual_backbone import build_resnet_fpn_backbone, read_config
+from .configuration import (
+    LAYOUTLMV2_PRETRAINED_INIT_CONFIGURATION,
+    LAYOUTLMV2_PRETRAINED_RESOURCE_FILES_MAP,
+    LayoutLMv2Config,
+)
+
+__all__ = [
+    "LayoutLMv2Model",
+    "LayoutLMv2PretrainedModel",
+    "LayoutLMv2ForTokenClassification",
+    "LayoutLMv2ForPretraining",
+    "LayoutLMv2ForRelationExtraction",
+]
+
+
+def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+    ret = 0
+    if bidirectional:
+        num_buckets //= 2
+        ret += (relative_position > 0).astype(paddle.int64) * num_buckets
+        n = paddle.abs(relative_position)
+    else:
+        n = paddle.max(-relative_position, paddle.zeros_like(relative_position))
+    # now n is in the range [0, inf)
+    # half of the buckets are for exact increments in positions
+    max_exact = num_buckets // 2
+    is_small = n < max_exact
+
+    # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+    val_if_large = max_exact + (
+        paddle.log(n.astype(paddle.float32) / max_exact)
+        / math.log(max_distance / max_exact)
+        * (num_buckets - max_exact)
+    ).astype(paddle.int64)
+
+    val_if_large = paddle.minimum(val_if_large, paddle.full_like(val_if_large, num_buckets - 1))
+
+    ret += paddle.where(is_small, n, val_if_large)
+    return ret
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPooler with XLM->LMv2
+class LayoutLMv2Pooler(Layer):
+    def __init__(self, hidden_size, with_pool):
+        super(LayoutLMv2Pooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+        self.with_pool = with_pool
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.with_pool == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMEmbeddings with XLM->LMv2
+class LayoutLMv2Embeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config):
+        super(LayoutLMv2Embeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.x_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.y_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.h_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.w_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def _cal_spatial_position_embeddings(self, bbox):
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+
+        spatial_position_embeddings = paddle.concat(
+            [
+                left_position_embeddings,
+                upper_position_embeddings,
+                right_position_embeddings,
+                lower_position_embeddings,
+                h_position_embeddings,
+                w_position_embeddings,
+            ],
+            axis=-1,
+        )
+        return spatial_position_embeddings
+
+    def forward(self, input_ids, bbox=None, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = (
+            input_embedings
+            + position_embeddings
+            + left_position_embeddings
+            + upper_position_embeddings
+            + right_position_embeddings
+            + lower_position_embeddings
+            + h_position_embeddings
+            + w_position_embeddings
+            + token_type_embeddings
+        )
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class LayoutLMv2PretrainedModel(PretrainedModel):
+    model_config_file = CONFIG_NAME
+    config_class = LayoutLMv2Config
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    base_model_prefix = "layoutlmv2"
+
+    pretrained_init_configuration = LAYOUTLMV2_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = LAYOUTLMV2_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMSelfOutput with XLM->LMv2
+class LayoutLMv2SelfOutput(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2SelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMSelfAttention with XLM->LMv2
+class LayoutLMv2SelfAttention(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2SelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size {} is not a multiple of the number of attention "
+                "heads {}".format(config.hidden_size, config.num_attention_heads)
+            )
+        self.fast_qkv = config.fast_qkv
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.has_relative_attention_bias = config.has_relative_attention_bias
+        self.has_spatial_attention_bias = config.has_spatial_attention_bias
+
+        if self.fast_qkv:
+            self.qkv_linear = nn.Linear(config.hidden_size, 3 * self.all_head_size, bias_attr=False)
+            self.q_bias = self.create_parameter(
+                shape=[1, 1, self.all_head_size], default_initializer=nn.initializer.Constant(0.0)
+            )
+            self.v_bias = self.create_parameter(
+                shape=[1, 1, self.all_head_size], default_initializer=nn.initializer.Constant(0.0)
+            )
+        else:
+            self.query = nn.Linear(config.hidden_size, self.all_head_size)
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = list(x.shape[:-1]) + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def compute_qkv(self, hidden_states):
+        if self.fast_qkv:
+            qkv = self.qkv_linear(hidden_states)
+            q, k, v = paddle.chunk(qkv, 3, axis=-1)
+            if q.ndimension() == self.q_bias.ndimension():
+                q = q + self.q_bias
+                v = v + self.v_bias
+            else:
+                _sz = (1,) * (q.ndimension() - 1) + (-1,)
+                q = q + self.q_bias.reshape(_sz)
+                v = v + self.v_bias.vreshape(_sz)
+        else:
+            q = self.query(hidden_states)
+            k = self.key(hidden_states)
+            v = self.value(hidden_states)
+        return q, k, v
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        q, k, v = self.compute_qkv(hidden_states)
+
+        # (B, L, H*D) -> (B, H, L, D)
+        query_layer = self.transpose_for_scores(q)
+        key_layer = self.transpose_for_scores(k)
+        value_layer = self.transpose_for_scores(v)
+
+        query_layer = query_layer / math.sqrt(self.attention_head_size)
+        # [BSZ, NAT, L, L]
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose([0, 1, 3, 2]))
+        if self.has_relative_attention_bias:
+            attention_scores += rel_pos
+        if self.has_spatial_attention_bias:
+            attention_scores += rel_2d_pos
+
+        bool_attention_mask = attention_mask.astype(paddle.bool)
+        bool_attention_mask.stop_gradient = True
+        attention_scores_shape = paddle.shape(attention_scores)
+        attention_scores = paddle.where(
+            bool_attention_mask.expand(attention_scores_shape),
+            paddle.ones(attention_scores_shape) * float("-1e10"),
+            attention_scores,
+        )
+
+        attention_probs = F.softmax(attention_scores, axis=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = list(context_layer.shape[:-2]) + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        if output_attentions:
+            outputs = [context_layer, attention_probs]
+        else:
+            outputs = [context_layer]
+        return outputs
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMAttention with XLM->LMv2
+class LayoutLMv2Attention(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2Attention, self).__init__()
+        self.self = LayoutLMv2SelfAttention(config)
+        self.output = LayoutLMv2SelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        if output_attentions:
+            outputs = [
+                attention_output,
+            ] + self_outputs[1:]
+        else:
+            outputs = [attention_output]
+        return outputs
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMEncoder with XLM->LMv2
+class LayoutLMv2Encoder(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2Encoder, self).__init__()
+        self.config = config
+        self.layer = nn.LayerList([LayoutLMv2Layer(config) for _ in range(config.num_hidden_layers)])
+
+        self.has_relative_attention_bias = config.has_relative_attention_bias
+        self.has_spatial_attention_bias = config.has_spatial_attention_bias
+
+        if self.has_relative_attention_bias:
+            self.rel_pos_bins = config.rel_pos_bins
+            self.max_rel_pos = config.max_rel_pos
+            self.rel_pos_onehot_size = config.rel_pos_bins
+            self.rel_pos_bias = nn.Linear(self.rel_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+
+        if self.has_spatial_attention_bias:
+            self.max_rel_2d_pos = config.max_rel_2d_pos
+            self.rel_2d_pos_bins = config.rel_2d_pos_bins
+            self.rel_2d_pos_onehot_size = config.rel_2d_pos_bins
+            self.rel_pos_x_bias = nn.Linear(self.rel_2d_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+            self.rel_pos_y_bias = nn.Linear(self.rel_2d_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+
+    def _cal_1d_pos_emb(self, hidden_states, position_ids):
+        rel_pos_mat = position_ids.unsqueeze(-2) - position_ids.unsqueeze(-1)
+        rel_pos = relative_position_bucket(
+            rel_pos_mat,
+            num_buckets=self.rel_pos_bins,
+            max_distance=self.max_rel_pos,
+        )
+        rel_pos = paddle.nn.functional.one_hot(rel_pos, num_classes=self.rel_pos_onehot_size).astype(
+            hidden_states.dtype
+        )
+        rel_pos = self.rel_pos_bias(rel_pos).transpose([0, 3, 1, 2])
+        return rel_pos
+
+    def _cal_2d_pos_emb(self, hidden_states, bbox):
+        position_coord_x = bbox[:, :, 0]
+        position_coord_y = bbox[:, :, 3]
+        rel_pos_x_2d_mat = position_coord_x.unsqueeze(-2) - position_coord_x.unsqueeze(-1)
+        rel_pos_y_2d_mat = position_coord_y.unsqueeze(-2) - position_coord_y.unsqueeze(-1)
+        rel_pos_x = relative_position_bucket(
+            rel_pos_x_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_y = relative_position_bucket(
+            rel_pos_y_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_x = F.one_hot(rel_pos_x, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_y = F.one_hot(rel_pos_y, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_x = self.rel_pos_x_bias(rel_pos_x).transpose([0, 3, 1, 2])
+        rel_pos_y = self.rel_pos_y_bias(rel_pos_y).transpose([0, 3, 1, 2])
+        rel_2d_pos = rel_pos_x + rel_pos_y
+        return rel_2d_pos
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        bbox=None,
+        position_ids=None,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+
+        rel_pos = self._cal_1d_pos_emb(hidden_states, position_ids) if self.has_relative_attention_bias else None
+        rel_2d_pos = self._cal_2d_pos_emb(hidden_states, bbox) if self.has_spatial_attention_bias else None
+
+        hidden_save = dict()
+        hidden_save["input_hidden_states"] = hidden_states
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            # gradient_checkpointing is set as False here so we remove some codes here
+            hidden_save["input_attention_mask"] = attention_mask
+            hidden_save["input_layer_head_mask"] = layer_head_mask
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask,
+                layer_head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                past_key_value,
+                output_attentions,
+                rel_pos=rel_pos,
+                rel_2d_pos=rel_2d_pos,
+            )
+
+            hidden_states = layer_outputs[0]
+
+            hidden_save["{}_data".format(i)] = hidden_states
+
+        return hidden_states, hidden_save
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMIntermediate with XLM->LMv2
+class LayoutLMv2Intermediate(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2Intermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if config.hidden_act == "gelu":
+            self.intermediate_act_fn = nn.GELU()
+        else:
+            assert False, "hidden_act is set as: {}, please check it..".format(config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMOutput with XLM->LMv2
+class LayoutLMv2Output(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2Output, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMLayer with XLM->LMv2
+class LayoutLMv2Layer(nn.Layer):
+    def __init__(self, config):
+        super(LayoutLMv2Layer, self).__init__()
+        # since chunk_size_feed_forward is 0 as default, no chunk is needed here.
+        self.seq_len_dim = 1
+        self.attention = LayoutLMv2Attention(config)
+        self.add_cross_attention = False  # default as false
+        self.intermediate = LayoutLMv2Intermediate(config)
+        self.output = LayoutLMv2Output(config)
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self_attention_outputs[0]
+
+        layer_output = self.feed_forward_chunk(attention_output)
+
+        if output_attentions:
+            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+            outputs = [
+                layer_output,
+            ] + list(outputs)
+        else:
+            outputs = [layer_output]
+        return outputs
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.VisualBackbone
+class VisualBackbone(nn.Layer):
+    def __init__(self, config):
+        super(VisualBackbone, self).__init__()
+        self.cfg = read_config()
+        self.backbone = build_resnet_fpn_backbone(self.cfg)
+
+        assert len(self.cfg.MODEL.PIXEL_MEAN) == len(self.cfg.MODEL.PIXEL_STD)
+        num_channels = len(self.cfg.MODEL.PIXEL_MEAN)
+        self.register_buffer("pixel_mean", paddle.to_tensor(self.cfg.MODEL.PIXEL_MEAN).reshape([num_channels, 1, 1]))
+        self.register_buffer("pixel_std", paddle.to_tensor(self.cfg.MODEL.PIXEL_STD).reshape([num_channels, 1, 1]))
+        self.out_feature_key = "p2"
+        # is_deterministic is disabled here.
+        self.pool = nn.AdaptiveAvgPool2D(config.image_feature_pool_shape[:2])
+        if len(config.image_feature_pool_shape) == 2:
+            config.image_feature_pool_shape.append(self.backbone.output_shape()[self.out_feature_key].channels)
+        assert self.backbone.output_shape()[self.out_feature_key].channels == config.image_feature_pool_shape[2]
+
+    def forward(self, images):
+        images_input = (paddle.to_tensor(images) - self.pixel_mean) / self.pixel_std
+        features = self.backbone(images_input)
+        features = features[self.out_feature_key]
+        features = self.pool(features).flatten(start_axis=2).transpose([0, 2, 1])
+        return features
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMModel with XLM->LMv2
+@register_base_model
+class LayoutLMv2Model(LayoutLMv2PretrainedModel):
+    """
+    The bare LayoutLMv2 Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (`int`):
+            Vocabulary size of the XLNet model. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling XLNetModel.
+        hidden_size (`int`, optional):
+            Dimensionality of the encoder layers and the pooler layer. Defaults to ``768``.
+        num_hidden_layers (`int`, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to ``12``.
+        num_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to ``12``.
+        intermediate_size (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+            Defaults to ``3072``.
+        hidden_act (`str`, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (`float`, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to ``0.1``.
+        attention_probs_dropout_prob (`float`, optional):
+            The dropout probability for all fully connected layers in the pooler.
+            Defaults to ``0.1``.
+        initializer_range (`float`, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Defaults to ``0.02``.
+    """
+
+    def __init__(self, config):
+        super(LayoutLMv2Model, self).__init__(config)
+        self.use_visual_backbone = config.use_visual_backbone
+        self.has_visual_segment_embedding = config.has_visual_segment_embedding
+        self.embeddings = LayoutLMv2Embeddings(config)
+
+        if self.use_visual_backbone is True:
+            self.visual = VisualBackbone(config)
+            self.visual.stop_gradient = True
+            self.visual_proj = nn.Linear(config.image_feature_pool_shape[-1], config.hidden_size)
+        if self.has_visual_segment_embedding:
+            self.visual_segment_embedding = self.create_parameter(
+                shape=[
+                    config.hidden_size,
+                ],
+                dtype=paddle.float32,
+            )
+        self.visual_LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.visual_dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.encoder = LayoutLMv2Encoder(config)
+        self.pooler = LayoutLMv2Pooler(config.hidden_size, config.with_pool)
+        self.config = config
+
+    def _calc_text_embeddings(self, input_ids, bbox, position_ids, token_type_ids):
+        words_embeddings = self.embeddings.word_embeddings(input_ids)
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        spatial_position_embeddings = self.embeddings._cal_spatial_position_embeddings(bbox)
+        token_type_embeddings = self.embeddings.token_type_embeddings(token_type_ids)
+        embeddings = words_embeddings + position_embeddings + spatial_position_embeddings + token_type_embeddings
+        embeddings = self.embeddings.LayerNorm(embeddings)
+        embeddings = self.embeddings.dropout(embeddings)
+        return embeddings
+
+    def _calc_img_embeddings(self, image, bbox, position_ids):
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        spatial_position_embeddings = self.embeddings._cal_spatial_position_embeddings(bbox)
+        if self.use_visual_backbone is True:
+            visual_embeddings = self.visual_proj(self.visual(image.astype(paddle.float32)))
+            embeddings = visual_embeddings + position_embeddings + spatial_position_embeddings
+        else:
+            embeddings = position_embeddings + spatial_position_embeddings
+        if self.has_visual_segment_embedding:
+            embeddings += self.visual_segment_embedding
+        embeddings = self.visual_LayerNorm(embeddings)
+        embeddings = self.visual_dropout(embeddings)
+        return embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        num_position_embeds_diff = new_num_position_embeddings - self.config.max_position_embeddings
+
+        # no resizing needs to be done if the length stays the same
+        if num_position_embeds_diff == 0:
+            return
+
+        logger.info(f"Setting `config.max_position_embeddings={new_num_position_embeddings}`...")
+        self.config.max_position_embeddings = new_num_position_embeddings
+
+        old_position_embeddings_weight = self.embeddings.position_embeddings.weight
+
+        self.embeddings.position_embeddings = nn.Embedding(
+            self.config.max_position_embeddings, self.config.hidden_size
+        )
+
+        with paddle.no_grad():
+            if num_position_embeds_diff > 0:
+                self.embeddings.position_embeddings.weight[:-num_position_embeds_diff] = old_position_embeddings_weight
+            else:
+                self.embeddings.position_embeddings.weight = old_position_embeddings_weight[:num_position_embeds_diff]
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        head_mask=None,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        input_shape = paddle.shape(input_ids)
+
+        visual_shape = list(input_shape)
+        visual_shape[1] = self.config.image_feature_pool_shape[0] * self.config.image_feature_pool_shape[1]
+
+        visual_bbox_x = (
+            paddle.arange(
+                0,
+                1000 * (self.config.image_feature_pool_shape[1] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // self.config.image_feature_pool_shape[1]
+        )
+        visual_bbox_y = (
+            paddle.arange(
+                0,
+                1000 * (self.config.image_feature_pool_shape[0] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // self.config.image_feature_pool_shape[0]
+        )
+
+        expand_shape = self.config.image_feature_pool_shape[0:2]
+
+        visual_bbox = paddle.stack(
+            [
+                visual_bbox_x[:-1].expand(expand_shape),
+                visual_bbox_y[:-1].expand(expand_shape[::-1]).transpose([1, 0]),
+                visual_bbox_x[1:].expand(expand_shape),
+                visual_bbox_y[1:].expand(expand_shape[::-1]).transpose([1, 0]),
+            ],
+            axis=-1,
+        ).reshape([expand_shape[0] * expand_shape[1], paddle.shape(bbox)[-1]])
+        visual_bbox = visual_bbox.expand([input_shape[0], visual_bbox.shape[0], visual_bbox.shape[1]])
+        final_bbox = paddle.concat([bbox, visual_bbox], axis=1)
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+
+        if self.use_visual_backbone is True:
+            visual_attention_mask = paddle.ones(visual_shape)
+        else:
+            visual_attention_mask = paddle.zeros(visual_shape)
+
+        attention_mask = attention_mask.astype(visual_attention_mask.dtype)
+
+        final_attention_mask = paddle.concat([attention_mask, visual_attention_mask], axis=1)
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        if position_ids is None:
+            seq_length = input_shape[1]
+            position_ids = self.embeddings.position_ids[:, :seq_length]
+            position_ids = position_ids.expand(input_shape)
+
+        visual_position_ids = paddle.arange(0, visual_shape[1]).expand([input_shape[0], visual_shape[1]])
+        final_position_ids = paddle.concat([position_ids, visual_position_ids], axis=1)
+
+        if bbox is None:
+            bbox = paddle.zeros(input_shape + [4])
+
+        text_layout_emb = self._calc_text_embeddings(
+            input_ids=input_ids,
+            bbox=bbox,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+
+        visual_emb = self._calc_img_embeddings(
+            image=image,
+            bbox=visual_bbox,
+            position_ids=visual_position_ids,
+        )
+        final_emb = paddle.concat([text_layout_emb, visual_emb], axis=1)
+
+        extended_attention_mask = final_attention_mask.unsqueeze(1).unsqueeze(2)
+
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype)
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        encoder_outputs = self.encoder(
+            final_emb,
+            extended_attention_mask,
+            bbox=final_bbox,
+            position_ids=final_position_ids,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        # i_data (i in [0, 12) is the key of the hidden states
+        return sequence_output, pooled_output, encoder_outputs[1]
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForTokenClassification with XLM->LMv2
+class LayoutLMv2ForTokenClassification(LayoutLMv2PretrainedModel):
+    def __init__(self, config):
+        super(LayoutLMv2ForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.layoutlmv2 = LayoutLMv2Model(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+        self.num_hidden_layers = config.num_hidden_layers
+
+    def get_input_embeddings(self):
+        return self.layoutlmv2.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config.max_position_embeddings`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlmv2.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        outputs = self.layoutlmv2(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_output, _ = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        hidden_states = {f"hidden_states_{idx}": outputs[2][f"{idx}_data"] for idx in range(self.num_hidden_layers)}
+
+        if self.training:
+            outputs = logits, hidden_states
+        else:
+            outputs = (logits,)
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+
+            if attention_mask is not None:
+                active_loss = (
+                    attention_mask.reshape(
+                        [
+                            -1,
+                        ]
+                    )
+                    == 1
+                )
+                active_logits = logits.reshape([-1, self.num_classes])[active_loss]
+                active_labels = labels.reshape(
+                    [
+                        -1,
+                    ]
+                )[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(
+                    logits.reshape([-1, self.num_classes]),
+                    labels.reshape(
+                        [
+                            -1,
+                        ]
+                    ),
+                )
+
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPredictionHead with XLM->LMv2
+class LayoutLMv2PredictionHead(Layer):
+    """
+    Bert Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(LayoutLMv2PredictionHead, self).__init__()
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMPretrainingHeads with XLM->LMv2
+class LayoutLMv2PretrainingHeads(Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(LayoutLMv2PretrainingHeads, self).__init__()
+        self.predictions = LayoutLMv2PredictionHead(hidden_size, vocab_size, activation, embedding_weights)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForPretraining with XLM->LMv2
+class LayoutLMv2ForPretraining(LayoutLMv2PretrainedModel):
+    def __init__(self, config):
+        super(LayoutLMv2ForPretraining, self).__init__(config)
+        self.layoutlmv2 = LayoutLMv2Model(config)
+        self.cls = LayoutLMv2PretrainingHeads(
+            self.layoutlmv2.config.hidden_size,
+            self.layoutlmv2.config.vocab_size,
+            self.layoutlmv2.config.hidden_act,
+            embedding_weights=self.layoutlmv2.embeddings.word_embeddings.weight,
+        )
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config.max_position_embeddings`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlmv2.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        masked_positions=None,
+    ):
+        outputs = self.layoutlmv2(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions)
+        return prediction_scores
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMOutput with XLM->LMv2
+class BiaffineAttention(nn.Layer):
+    """Implements a biaffine attention operator for binary relation classification."""
+
+    def __init__(self, in_features, out_features):
+        super(BiaffineAttention, self).__init__()
+
+        self.in_features = in_features
+        self.out_features = out_features
+
+        self.bilinear = nn.Bilinear(in_features, in_features, out_features, bias_attr=False)
+        self.linear = nn.Linear(2 * in_features, out_features)
+
+    def forward(self, x_1, x_2):
+        return self.bilinear(x_1, x_2) + self.linear(paddle.concat((x_1, x_2), axis=-1))
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.REDecoder
+class REDecoder(nn.Layer):
+    def __init__(self, hidden_size=768, hidden_dropout_prob=0.1):
+        super(REDecoder, self).__init__()
+        self.entity_emb = nn.Embedding(3, hidden_size)
+        projection = nn.Sequential(
+            nn.Linear(hidden_size * 2, hidden_size),
+            nn.ReLU(),
+            nn.Dropout(hidden_dropout_prob),
+            nn.Linear(hidden_size, hidden_size // 2),
+            nn.ReLU(),
+            nn.Dropout(hidden_dropout_prob),
+        )
+        self.ffnn_head = copy.deepcopy(projection)
+        self.ffnn_tail = copy.deepcopy(projection)
+        self.rel_classifier = BiaffineAttention(hidden_size // 2, 2)
+        self.loss_fct = CrossEntropyLoss()
+
+    def build_relation(self, relations, entities):
+        batch_size = len(relations)
+        new_relations = []
+        for b in range(batch_size):
+            if len(entities[b]["start"]) <= 2:
+                entities[b] = {"end": [1, 1], "label": [0, 0], "start": [0, 0]}
+            all_possible_relations = set(
+                [
+                    (i, j)
+                    for i in range(len(entities[b]["label"]))
+                    for j in range(len(entities[b]["label"]))
+                    if entities[b]["label"][i] == 1 and entities[b]["label"][j] == 2
+                ]
+            )
+            if len(all_possible_relations) == 0:
+                all_possible_relations = {(0, 1)}
+            if "head" in relations[b]:
+                positive_relations = set(list(zip(relations[b]["head"], relations[b]["tail"])))
+            else:
+                positive_relations = set()
+            negative_relations = all_possible_relations - positive_relations
+            positive_relations = set([i for i in positive_relations if i in all_possible_relations])
+            reordered_relations = list(positive_relations) + list(negative_relations)
+            relation_per_doc = {
+                "head": [i[0] for i in reordered_relations],
+                "tail": [i[1] for i in reordered_relations],
+                "label": [1] * len(positive_relations) + [0] * (len(reordered_relations) - len(positive_relations)),
+            }
+            assert len(relation_per_doc["head"]) != 0
+            new_relations.append(relation_per_doc)
+        return new_relations, entities
+
+    def get_predicted_relations(self, logits, relations, entities):
+        pred_relations = []
+        for i, pred_label in enumerate(logits.argmax(-1)):
+            if pred_label != 1:
+                continue
+            rel = {}
+            rel["head_id"] = relations["head"][i]
+            rel["head"] = (entities["start"][rel["head_id"]], entities["end"][rel["head_id"]])
+            rel["head_type"] = entities["label"][rel["head_id"]]
+
+            rel["tail_id"] = relations["tail"][i]
+            rel["tail"] = (entities["start"][rel["tail_id"]], entities["end"][rel["tail_id"]])
+            rel["tail_type"] = entities["label"][rel["tail_id"]]
+            rel["type"] = 1
+            pred_relations.append(rel)
+        return pred_relations
+
+    def forward(self, hidden_states, entities, relations):
+        batch_size, max_n_words, context_dim = hidden_states.shape
+        relations, entities = self.build_relation(relations, entities)
+        loss = 0
+        all_pred_relations = []
+        for b in range(batch_size):
+            if "head" not in relations[b]:
+                continue
+            head_entities = paddle.to_tensor(relations[b]["head"])
+            tail_entities = paddle.to_tensor(relations[b]["tail"])
+            relation_labels = paddle.to_tensor(relations[b]["label"])
+            entities_start_index = paddle.to_tensor(entities[b]["start"])
+            entities_labels = paddle.to_tensor(entities[b]["label"])
+            head_index = entities_start_index[head_entities]
+            head_label = entities_labels[head_entities]
+            head_label_repr = self.entity_emb(head_label)
+
+            tail_index = entities_start_index[tail_entities]
+            tail_label = entities_labels[tail_entities]
+            tail_label_repr = self.entity_emb(tail_label)
+
+            tmp_hidden_states = hidden_states[b][head_index]
+            if len(tmp_hidden_states.shape) == 1:
+                tmp_hidden_states = paddle.unsqueeze(tmp_hidden_states, axis=0)
+            head_repr = paddle.concat((tmp_hidden_states, head_label_repr), axis=-1)
+
+            tmp_hidden_states = hidden_states[b][tail_index]
+            if len(tmp_hidden_states.shape) == 1:
+                tmp_hidden_states = paddle.unsqueeze(tmp_hidden_states, axis=0)
+            tail_repr = paddle.concat((tmp_hidden_states, tail_label_repr), axis=-1)
+
+            heads = self.ffnn_head(head_repr)
+            tails = self.ffnn_tail(tail_repr)
+            logits = self.rel_classifier(heads, tails)
+            loss += self.loss_fct(logits, relation_labels)
+            pred_relations = self.get_predicted_relations(logits, relations[b], entities[b])
+            all_pred_relations.append(pred_relations)
+        return loss, all_pred_relations
+
+
+# Copied from paddlenlp.transformers.layoutxlm.modeling.LayoutXLMForRelationExtraction with XLM->LMv2
+class LayoutLMv2ForRelationExtraction(LayoutLMv2PretrainedModel):
+    def __init__(self, config):
+        super(LayoutLMv2ForRelationExtraction, self).__init__(config)
+        self.layoutlmv2 = LayoutLMv2Model(config)
+
+        self.extractor = REDecoder(config.hidden_size, config.hidden_dropout_prob)
+
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight.shape))
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.tensor.zeros(shape=layer.bias.shape))
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight.shape))
+            if layer._padding_idx is not None:
+                layer.weight[layer._padding_idx].set_value(
+                    paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight[layer._padding_idx].shape)
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.weight.set_value(paddle.tensor.ones(shape=layer.bias.shape))
+            layer.bias.set_value(paddle.tensor.zeros(shape=layer.bias.shape))
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config.max_position_embeddings`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutlmv2.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids,
+        bbox,
+        labels=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        entities=None,
+        relations=None,
+    ):
+        outputs = self.layoutlmv2(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+
+        seq_length = input_ids.shape[1]
+        sequence_output, _ = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
+        sequence_output = self.dropout(sequence_output)
+        loss, pred_relations = self.extractor(sequence_output, entities, relations)
+
+        return dict(
+            loss=loss,
+            entities=entities,
+            relations=relations,
+            pred_relations=pred_relations,
+            hidden_states=outputs[0],
+        )
diff --git a/paddlenlp/transformers/layoutlmv2/tokenizer.py b/paddlenlp/transformers/layoutlmv2/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..02a0b761d026eeb7df16738811c5f4a3e6375c26
--- /dev/null
+++ b/paddlenlp/transformers/layoutlmv2/tokenizer.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for LayoutLMv2 model."""
+
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["LayoutLMv2Tokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "layoutlmv2-base-uncased": 512,
+    "layoutlmv2-large-uncased": 512,
+    "layoutlmv2-wo-backbone-base-uncased": 512,
+}
+
+
+class LayoutLMv2Tokenizer(BertTokenizer):
+    """
+    The usage of LayoutLMv2Tokenizer is the same as
+    `BertTokenizer <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "layoutlmv2-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/layoutlmv2-base-uncased/vocab.txt",
+            "layoutlmv2-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/layoutlmv2-large-uncased/vocab.txt",
+            "layoutlmv2-wo-backbone-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutlmv2/layoutlmv2-base-uncased/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "layoutlmv2-base-uncased": {"do_lower_case": True},
+        "layoutlmv2-large-uncased": {"do_lower_case": True},
+        "layoutlmv2-wo-backbone-base-uncased": {"do_lower_case": True},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
diff --git a/paddlenlp/transformers/layoutxlm/__init__.py b/paddlenlp/transformers/layoutxlm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/layoutxlm/configuration.py b/paddlenlp/transformers/layoutxlm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..cff50737dbabad3182f784e9aee3dd65234a2fbb
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/configuration.py
@@ -0,0 +1,246 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" LayoutXLM model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["LAYOUTXLM_PRETRAINED_INIT_CONFIGURATION", "LayoutXLMConfig", "LAYOUTXLM_PRETRAINED_RESOURCE_FILES_MAP"]
+
+LAYOUTXLM_PRETRAINED_INIT_CONFIGURATION = {
+    "layoutxlm-base-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 0,
+        "coordinate_size": 128,
+        "eos_token_id": 2,
+        "fast_qkv": False,
+        "gradient_checkpointing": False,
+        "has_relative_attention_bias": False,
+        "has_spatial_attention_bias": False,
+        "has_visual_segment_embedding": True,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-05,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 514,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "layoutlmv2",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 1,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 1,
+        "vocab_size": 250002,
+    },
+    "vi-layoutxlm-base-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 0,
+        "coordinate_size": 128,
+        "eos_token_id": 2,
+        "fast_qkv": False,
+        "gradient_checkpointing": False,
+        "has_relative_attention_bias": False,
+        "has_spatial_attention_bias": False,
+        "has_visual_segment_embedding": True,
+        "use_visual_backbone": False,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "image_feature_pool_shape": [7, 7, 256],
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "layer_norm_eps": 1e-05,
+        "max_2d_position_embeddings": 1024,
+        "max_position_embeddings": 514,
+        "max_rel_2d_pos": 256,
+        "max_rel_pos": 128,
+        "model_type": "layoutlmv2",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "output_past": True,
+        "pad_token_id": 1,
+        "shape_size": 128,
+        "rel_2d_pos_bins": 64,
+        "rel_pos_bins": 32,
+        "type_vocab_size": 1,
+        "vocab_size": 250002,
+    },
+}
+
+LAYOUTXLM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "layoutxlm-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutxlm_base/model_state.pdparams",
+        "vi-layoutxlm-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/vi-layoutxlm-base-uncased/model_state.pdparams",
+    }
+}
+
+
+class LayoutXLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LayoutXLMtModel`]. It is used to instantiate a
+    LayoutXLM model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the LayoutXLM.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the SqueezeBERT model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`SqueezeBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The ID of the token in the word embedding to use as padding.
+        embedding_size (`int`, *optional*, defaults to 768):
+            The dimension of the word embedding vectors.
+
+        q_groups (`int`, *optional*, defaults to 4):
+            The number of groups in Q layer.
+        k_groups (`int`, *optional*, defaults to 4):
+            The number of groups in K layer.
+        v_groups (`int`, *optional*, defaults to 4):
+            The number of groups in V layer.
+        post_attention_groups (`int`, *optional*, defaults to 1):
+            The number of groups in the first feed forward network layer.
+        intermediate_groups (`int`, *optional*, defaults to 4):
+            The number of groups in the second feed forward network layer.
+        output_groups (`int`, *optional*, defaults to 4):
+            The number of groups in the third feed forward network layer.
+
+    Examples:
+
+    ```python
+    >>> from transformers import SqueezeBertConfig, SqueezeBertModel
+
+    >>> # Initializing a SqueezeBERT configuration
+    >>> configuration = SqueezeBertConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration above
+    >>> model = SqueezeBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+
+    Attributes: pretrained_config_archive_map (Dict[str, str]): A dictionary containing all the available pre-trained
+    checkpoints.
+    """
+    pretrained_init_configuration = LAYOUTXLM_PRETRAINED_INIT_CONFIGURATION
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    model_type = "layoutxlm"
+
+    def __init__(
+        self,
+        attention_probs_dropout_prob=0.1,
+        bos_token_id=0,
+        coordinate_size=128,
+        eos_token_id=2,
+        fast_qkv=False,
+        gradient_checkpointing=False,
+        has_relative_attention_bias=False,
+        has_spatial_attention_bias=False,
+        has_visual_segment_embedding=True,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        hidden_size=768,
+        image_feature_pool_shape=[7, 7, 256],
+        initializer_range=0.02,
+        intermediate_size=3072,
+        layer_norm_eps=1e-05,
+        max_2d_position_embeddings=1024,
+        max_position_embeddings=514,
+        max_rel_2d_pos=256,
+        max_rel_pos=128,
+        model_type="layoutlmv2",
+        num_attention_heads=12,
+        num_hidden_layers=12,
+        output_past=True,
+        pad_token_id=1,
+        shape_size=128,
+        rel_2d_pos_bins=64,
+        rel_pos_bins=32,
+        type_vocab_size=1,
+        vocab_size=250002,
+        with_pool="tanh",
+        use_visual_backbone=False,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.max_2d_position_embeddings = max_2d_position_embeddings
+        self.max_rel_pos = max_rel_pos
+        self.max_rel_2d_pos = max_rel_2d_pos
+        self.rel_pos_bins = rel_pos_bins
+        self.rel_2d_pos_bins = rel_2d_pos_bins
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.coordinate_size = coordinate_size
+        self.shape_size = shape_size
+        self.image_feature_pool_shape = image_feature_pool_shape
+        self.fast_qkv = fast_qkv
+        self.has_relative_attention_bias = has_relative_attention_bias
+        self.has_spatial_attention_bias = has_spatial_attention_bias
+        self.has_visual_segment_embedding = has_visual_segment_embedding
+        self.output_past = output_past
+        self.gradient_checkpointing = gradient_checkpointing
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.model_type = model_type
+        self.with_pool = with_pool
+        self.use_visual_backbone = use_visual_backbone
diff --git a/paddlenlp/transformers/layoutxlm/modeling.py b/paddlenlp/transformers/layoutxlm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..84175fe4cddc92b6abfcff35bbc5d890d7af94cd
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/modeling.py
@@ -0,0 +1,1411 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Modeling classes for LayoutXLM model."""
+
+import copy
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import CrossEntropyLoss, Layer
+
+from paddlenlp.utils.log import logger
+
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    LAYOUTXLM_PRETRAINED_INIT_CONFIGURATION,
+    LAYOUTXLM_PRETRAINED_RESOURCE_FILES_MAP,
+    LayoutXLMConfig,
+)
+from .visual_backbone import build_resnet_fpn_backbone, read_config
+
+__all__ = [
+    "LayoutXLMModel",
+    "LayoutXLMPretrainedModel",
+    "LayoutXLMForTokenClassification",
+    "LayoutXLMForSequenceClassification",
+    "LayoutXLMForPretraining",
+    "LayoutXLMForRelationExtraction",
+    "LayoutXLMForQuestionAnswering",
+]
+
+
+def relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+    ret = 0
+    if bidirectional:
+        num_buckets //= 2
+        ret += (relative_position > 0).astype(paddle.int64) * num_buckets
+        n = paddle.abs(relative_position)
+    else:
+        n = paddle.max(-relative_position, paddle.zeros_like(relative_position))
+    # Now n is in the range [0, inf)
+    # half of the buckets are for exact increments in positions
+    max_exact = num_buckets // 2
+    is_small = n < max_exact
+
+    # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+    val_if_large = max_exact + (
+        paddle.log(n.astype(paddle.float32) / max_exact)
+        / math.log(max_distance / max_exact)
+        * (num_buckets - max_exact)
+    ).astype(paddle.int64)
+
+    val_if_large = paddle.minimum(val_if_large, paddle.full_like(val_if_large, num_buckets - 1))
+
+    ret += paddle.where(is_small, n, val_if_large)
+    return ret
+
+
+def token_featue_to_sequence_feature(input_ids, seq_length, sequence_output):
+    """
+    used to transform token feature into sequence feature by
+    averaging all the token features of certain sequence
+    """
+    batches = input_ids.shape[0]
+    for batch_id in range(batches):
+        start_idx = -1
+        for i in range(0, seq_length):
+            if input_ids[batch_id, i] == 6:
+                if start_idx > -1:
+                    feature_block = sequence_output[batch_id, start_idx + 1 : i]
+                    sequence_output[batch_id, start_idx] = paddle.mean(feature_block, axis=0)
+                start_idx = i
+
+            if input_ids[batch_id, i] == 1:
+                feature_block = sequence_output[batch_id, start_idx + 1 : i]
+                sequence_output[batch_id, start_idx] = paddle.mean(feature_block, axis=0)
+                break
+
+        if i == seq_length - 1:
+            sequence_output[batch_id, start_idx] = paddle.mean(feature_block, axis=0)
+        return
+
+
+class LayoutXLMPooler(Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.with_pool = config.with_pool
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        if self.with_pool == "tanh":
+            pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class LayoutXLMEmbeddings(Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+        self.x_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.y_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.h_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+        self.w_position_embeddings = nn.Embedding(config.max_2d_position_embeddings, config.coordinate_size)
+
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def _cal_spatial_position_embeddings(self, bbox):
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+
+        spatial_position_embeddings = paddle.concat(
+            [
+                left_position_embeddings,
+                upper_position_embeddings,
+                right_position_embeddings,
+                lower_position_embeddings,
+                h_position_embeddings,
+                w_position_embeddings,
+            ],
+            axis=-1,
+        )
+        return spatial_position_embeddings
+
+    def forward(self, input_ids, bbox=None, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=-1)
+
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+
+        try:
+            left_position_embeddings = self.x_position_embeddings(bbox[:, :, 0])
+            upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
+            right_position_embeddings = self.x_position_embeddings(bbox[:, :, 2])
+            lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
+        except IndexError as e:
+            raise IndexError("The :obj:`bbox`coordinate values should be within 0-1000 range.") from e
+        h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
+        w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = (
+            input_embedings
+            + position_embeddings
+            + left_position_embeddings
+            + upper_position_embeddings
+            + right_position_embeddings
+            + lower_position_embeddings
+            + h_position_embeddings
+            + w_position_embeddings
+            + token_type_embeddings
+        )
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class LayoutXLMPretrainedModel(PretrainedModel):
+
+    config_class = LayoutXLMConfig
+    pretrained_init_configuration = LAYOUTXLM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = LAYOUTXLM_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "layoutxlm"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.pretrained_init_configuration["initializer_range"]
+                        if "initializer_range" in self.pretrained_init_configuration
+                        else 0.02,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class LayoutXLMSelfOutput(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LayoutXLMSelfAttention(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size {} is not a multiple of the number of attention "
+                "heads {}".format(config.hidden_size, config.num_attention_heads)
+            )
+        self.fast_qkv = config.fast_qkv
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.has_relative_attention_bias = config.has_relative_attention_bias
+        self.has_spatial_attention_bias = config.has_spatial_attention_bias
+
+        if config.fast_qkv:
+            self.qkv_linear = nn.Linear(config.hidden_size, 3 * self.all_head_size, bias_attr=False)
+            self.q_bias = self.create_parameter(
+                shape=[1, 1, self.all_head_size], default_initializer=nn.initializer.Constant(0.0)
+            )
+            self.v_bias = self.create_parameter(
+                shape=[1, 1, self.all_head_size], default_initializer=nn.initializer.Constant(0.0)
+            )
+        else:
+            self.query = nn.Linear(config.hidden_size, self.all_head_size)
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = list(x.shape[:-1]) + [self.num_attention_heads, self.attention_head_size]
+
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def compute_qkv(self, hidden_states):
+        if self.fast_qkv:
+            qkv = self.qkv_linear(hidden_states)
+            q, k, v = paddle.chunk(qkv, 3, axis=-1)
+            if q.ndimension() == self.q_bias.ndimension():
+                q = q + self.q_bias
+                v = v + self.v_bias
+            else:
+                _sz = (1,) * (q.ndimension() - 1) + (-1,)
+                q = q + self.q_bias.reshape(_sz)
+                v = v + self.v_bias.vreshape(_sz)
+        else:
+            q = self.query(hidden_states)
+            k = self.key(hidden_states)
+            v = self.value(hidden_states)
+        return q, k, v
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        q, k, v = self.compute_qkv(hidden_states)
+
+        # (B, L, H*D) -> (B, H, L, D)
+        query_layer = self.transpose_for_scores(q)
+        key_layer = self.transpose_for_scores(k)
+        value_layer = self.transpose_for_scores(v)
+
+        query_layer = query_layer / math.sqrt(self.attention_head_size)
+        # [BSZ, NAT, L, L]
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose([0, 1, 3, 2]))
+        if self.has_relative_attention_bias:
+            attention_scores += rel_pos
+        if self.has_spatial_attention_bias:
+            attention_scores += rel_2d_pos
+        bool_attention_mask = attention_mask.astype(paddle.bool)
+        bool_attention_mask.stop_gradient = True
+        attention_scores_shape = paddle.shape(attention_scores)
+        attention_scores = paddle.where(
+            bool_attention_mask.expand(attention_scores_shape),
+            paddle.ones(attention_scores_shape) * float("-1e10"),
+            attention_scores,
+        )
+        attention_probs = F.softmax(attention_scores, axis=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = list(context_layer.shape[:-2]) + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        if output_attentions:
+            outputs = [context_layer, attention_probs]
+        else:
+            outputs = [context_layer]
+        return outputs
+
+
+class LayoutXLMAttention(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMAttention, self).__init__()
+        self.self = LayoutXLMSelfAttention(config)
+        self.output = LayoutXLMSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        # add attentions if we output them
+        if output_attentions:
+            outputs = [
+                attention_output,
+            ] + self_outputs[1:]
+        else:
+            outputs = [attention_output]
+        return outputs
+
+
+class LayoutXLMEncoder(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMEncoder, self).__init__()
+        self.config = config
+        self.layer = nn.LayerList([LayoutXLMLayer(config) for _ in range(config.num_hidden_layers)])
+
+        self.has_relative_attention_bias = config.has_relative_attention_bias
+        self.has_spatial_attention_bias = config.has_spatial_attention_bias
+
+        if self.has_relative_attention_bias:
+            self.rel_pos_bins = config.rel_pos_bins
+            self.max_rel_pos = config.max_rel_pos
+            self.rel_pos_onehot_size = config.rel_pos_bins
+            self.rel_pos_bias = nn.Linear(self.rel_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+
+        if self.has_spatial_attention_bias:
+            self.max_rel_2d_pos = config.max_rel_2d_pos
+            self.rel_2d_pos_bins = config.rel_2d_pos_bins
+            self.rel_2d_pos_onehot_size = config.rel_2d_pos_bins
+            self.rel_pos_x_bias = nn.Linear(self.rel_2d_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+            self.rel_pos_y_bias = nn.Linear(self.rel_2d_pos_onehot_size, config.num_attention_heads, bias_attr=False)
+
+    def _cal_1d_pos_emb(self, hidden_states, position_ids):
+        rel_pos_mat = position_ids.unsqueeze(-2) - position_ids.unsqueeze(-1)
+        rel_pos = relative_position_bucket(
+            rel_pos_mat,
+            num_buckets=self.rel_pos_bins,
+            max_distance=self.max_rel_pos,
+        )
+        rel_pos = paddle.nn.functional.one_hot(rel_pos, num_classes=self.rel_pos_onehot_size).astype(
+            hidden_states.dtype
+        )
+        rel_pos = self.rel_pos_bias(rel_pos).transpose([0, 3, 1, 2])
+        return rel_pos
+
+    def _cal_2d_pos_emb(self, hidden_states, bbox):
+        position_coord_x = bbox[:, :, 0]
+        position_coord_y = bbox[:, :, 3]
+        rel_pos_x_2d_mat = position_coord_x.unsqueeze(-2) - position_coord_x.unsqueeze(-1)
+        rel_pos_y_2d_mat = position_coord_y.unsqueeze(-2) - position_coord_y.unsqueeze(-1)
+        rel_pos_x = relative_position_bucket(
+            rel_pos_x_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_y = relative_position_bucket(
+            rel_pos_y_2d_mat,
+            num_buckets=self.rel_2d_pos_bins,
+            max_distance=self.max_rel_2d_pos,
+        )
+        rel_pos_x = F.one_hot(rel_pos_x, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_y = F.one_hot(rel_pos_y, num_classes=self.rel_2d_pos_onehot_size).astype(hidden_states.dtype)
+        rel_pos_x = self.rel_pos_x_bias(rel_pos_x).transpose([0, 3, 1, 2])
+        rel_pos_y = self.rel_pos_y_bias(rel_pos_y).transpose([0, 3, 1, 2])
+        rel_2d_pos = rel_pos_x + rel_pos_y
+        return rel_2d_pos
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        bbox=None,
+        position_ids=None,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+
+        rel_pos = self._cal_1d_pos_emb(hidden_states, position_ids) if self.has_relative_attention_bias else None
+        rel_2d_pos = self._cal_2d_pos_emb(hidden_states, bbox) if self.has_spatial_attention_bias else None
+
+        hidden_save = dict()
+        hidden_save["input_hidden_states"] = hidden_states
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            # gradient_checkpointing is set as False here so we remove some codes here
+            hidden_save["input_attention_mask"] = attention_mask
+            hidden_save["input_layer_head_mask"] = layer_head_mask
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask,
+                layer_head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                past_key_value,
+                output_attentions,
+                rel_pos=rel_pos,
+                rel_2d_pos=rel_2d_pos,
+            )
+
+            hidden_states = layer_outputs[0]
+
+            hidden_save["{}_data".format(i)] = hidden_states
+
+        return hidden_states, hidden_save
+
+
+class LayoutXLMIntermediate(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if config.hidden_act == "gelu":
+            self.intermediate_act_fn = nn.GELU()
+        else:
+            assert False, "hidden_act is set as: {}, please check it..".format(config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class LayoutXLMOutput(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LayoutXLMLayer(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMLayer, self).__init__()
+        # since chunk_size_feed_forward is 0 as default, no chunk is needed here.
+        self.seq_len_dim = 1
+        self.attention = LayoutXLMAttention(config)
+        self.add_cross_attention = False  # default as false
+        self.intermediate = LayoutXLMIntermediate(config)
+        self.output = LayoutXLMOutput(config)
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        rel_pos=None,
+        rel_2d_pos=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+            rel_pos=rel_pos,
+            rel_2d_pos=rel_2d_pos,
+        )
+        attention_output = self_attention_outputs[0]
+        layer_output = self.feed_forward_chunk(attention_output)
+
+        if output_attentions:
+            outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+            outputs = [
+                layer_output,
+            ] + list(outputs)
+        else:
+            outputs = [layer_output]
+        return outputs
+
+
+class VisualBackbone(nn.Layer):
+    def __init__(self, config: LayoutXLMConfig):
+        super(VisualBackbone, self).__init__()
+        self.cfg = read_config()
+        self.backbone = build_resnet_fpn_backbone(self.cfg)
+
+        assert len(self.cfg.MODEL.PIXEL_MEAN) == len(self.cfg.MODEL.PIXEL_STD)
+        num_channels = len(self.cfg.MODEL.PIXEL_MEAN)
+        self.register_buffer("pixel_mean", paddle.to_tensor(self.cfg.MODEL.PIXEL_MEAN).reshape([num_channels, 1, 1]))
+        self.register_buffer("pixel_std", paddle.to_tensor(self.cfg.MODEL.PIXEL_STD).reshape([num_channels, 1, 1]))
+        self.out_feature_key = "p2"
+        # is_deterministic is disabled here.
+        self.pool = nn.AdaptiveAvgPool2D(config.image_feature_pool_shape[:2])
+        if len(config.image_feature_pool_shape) == 2:
+            config.image_feature_pool_shape.append(self.backbone.output_shape()[self.out_feature_key].channels)
+        assert self.backbone.output_shape()[self.out_feature_key].channels == config.image_feature_pool_shape[2]
+
+    def forward(self, images):
+        images_input = (paddle.to_tensor(images) - self.pixel_mean) / self.pixel_std
+        features = self.backbone(images_input)
+        features = features[self.out_feature_key]
+        features = self.pool(features).flatten(start_axis=2).transpose([0, 2, 1])
+        return features
+
+
+@register_base_model
+class LayoutXLMModel(LayoutXLMPretrainedModel):
+    """
+    The bare LayoutXLM Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (`int`):
+            Vocabulary size of the XLNet model. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling XLNetModel.
+        hidden_size (`int`, optional):
+            Dimensionality of the encoder layers and the pooler layer. Defaults to ``768``.
+        num_hidden_layers (`int`, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to ``12``.
+        num_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to ``12``.
+        intermediate_size (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+            Defaults to ``3072``.
+        hidden_act (`str`, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (`float`, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to ``0.1``.
+        attention_probs_dropout_prob (`float`, optional):
+            The dropout probability for all fully connected layers in the pooler.
+            Defaults to ``0.1``.
+        initializer_range (`float`, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Defaults to ``0.02``.
+    """
+
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMModel, self).__init__(config)
+        self.config = config
+        self.use_visual_backbone = config.use_visual_backbone
+        self.has_visual_segment_embedding = config.has_visual_segment_embedding
+        self.embeddings = LayoutXLMEmbeddings(config)
+
+        if self.use_visual_backbone is True:
+            self.visual = VisualBackbone(config)
+            self.visual.stop_gradient = True
+            self.visual_proj = nn.Linear(config.image_feature_pool_shape[-1], config.hidden_size)
+
+        if self.has_visual_segment_embedding:
+            self.visual_segment_embedding = self.create_parameter(
+                shape=[
+                    config.hidden_size,
+                ],
+                dtype=paddle.float32,
+            )
+        self.visual_LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.visual_dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.encoder = LayoutXLMEncoder(config)
+        self.pooler = LayoutXLMPooler(config)
+
+    def _calc_text_embeddings(self, input_ids, bbox, position_ids, token_type_ids):
+        words_embeddings = self.embeddings.word_embeddings(input_ids)
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        spatial_position_embeddings = self.embeddings._cal_spatial_position_embeddings(bbox)
+        token_type_embeddings = self.embeddings.token_type_embeddings(token_type_ids)
+        embeddings = words_embeddings + position_embeddings + spatial_position_embeddings + token_type_embeddings
+        embeddings = self.embeddings.LayerNorm(embeddings)
+        embeddings = self.embeddings.dropout(embeddings)
+        return embeddings
+
+    def _calc_visual_bbox(self, image_feature_pool_shape, bbox, visual_shape):
+        visual_bbox_x = (
+            paddle.arange(
+                0,
+                1000 * (image_feature_pool_shape[1] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // image_feature_pool_shape[1]
+        )
+        visual_bbox_y = (
+            paddle.arange(
+                0,
+                1000 * (image_feature_pool_shape[0] + 1),
+                1000,
+                dtype=bbox.dtype,
+            )
+            // image_feature_pool_shape[0]
+        )
+
+        expand_shape = image_feature_pool_shape[0:2]
+        visual_bbox = paddle.stack(
+            [
+                visual_bbox_x[:-1].expand(expand_shape),
+                visual_bbox_y[:-1].expand(expand_shape[::-1]).transpose([1, 0]),
+                visual_bbox_x[1:].expand(expand_shape),
+                visual_bbox_y[1:].expand(expand_shape[::-1]).transpose([1, 0]),
+            ],
+            axis=-1,
+        ).reshape([expand_shape[0] * expand_shape[1], paddle.shape(bbox)[-1]])
+
+        visual_bbox = visual_bbox.expand([visual_shape[0], visual_bbox.shape[0], visual_bbox.shape[1]])
+        return visual_bbox
+
+    def _calc_img_embeddings(self, image, bbox, position_ids):
+        use_image_info = self.use_visual_backbone and image is not None
+        position_embeddings = self.embeddings.position_embeddings(position_ids)
+        spatial_position_embeddings = self.embeddings._cal_spatial_position_embeddings(bbox)
+        if use_image_info is True:
+            visual_embeddings = self.visual_proj(self.visual(image.astype(paddle.float32)))
+            embeddings = visual_embeddings + position_embeddings + spatial_position_embeddings
+        else:
+            embeddings = position_embeddings + spatial_position_embeddings
+
+        if self.has_visual_segment_embedding:
+            embeddings += self.visual_segment_embedding
+        embeddings = self.visual_LayerNorm(embeddings)
+        embeddings = self.visual_dropout(embeddings)
+        return embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        num_position_embeds_diff = new_num_position_embeddings - self.config.max_position_embeddings
+
+        # no resizing needs to be done if the length stays the same
+        if num_position_embeds_diff == 0:
+            return
+
+        logger.info(f"Setting `config.max_position_embeddings={new_num_position_embeddings}`...")
+        self.config.max_position_embeddings = new_num_position_embeddings
+
+        old_position_embeddings_weight = self.embeddings.position_embeddings.weight
+
+        self.embeddings.position_embeddings = nn.Embedding(
+            self.config.max_position_embeddings, self.config.hidden_size
+        )
+
+        with paddle.no_grad():
+            if num_position_embeds_diff > 0:
+                self.embeddings.position_embeddings.weight[:-num_position_embeds_diff] = old_position_embeddings_weight
+            else:
+                self.embeddings.position_embeddings.weight = old_position_embeddings_weight[:num_position_embeds_diff]
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        head_mask=None,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        input_shape = paddle.shape(input_ids)
+        visual_shape = list(input_shape)
+        visual_shape[1] = self.config.image_feature_pool_shape[0] * self.config.image_feature_pool_shape[1]
+        visual_bbox = self._calc_visual_bbox(self.config.image_feature_pool_shape, bbox, visual_shape)
+
+        final_bbox = paddle.concat([bbox, visual_bbox], axis=1)
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+
+        if self.use_visual_backbone is True:
+            visual_attention_mask = paddle.ones(visual_shape)
+        else:
+            visual_attention_mask = paddle.zeros(visual_shape)
+
+        attention_mask = attention_mask.astype(visual_attention_mask.dtype)
+
+        final_attention_mask = paddle.concat([attention_mask, visual_attention_mask], axis=1)
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        if position_ids is None:
+            seq_length = input_shape[1]
+            position_ids = self.embeddings.position_ids[:, :seq_length]
+            position_ids = position_ids.expand(input_shape)
+
+        visual_position_ids = paddle.arange(0, visual_shape[1]).expand([input_shape[0], visual_shape[1]])
+        final_position_ids = paddle.concat([position_ids, visual_position_ids], axis=1)
+
+        if bbox is None:
+            bbox = paddle.zeros(input_shape + [4])
+
+        text_layout_emb = self._calc_text_embeddings(
+            input_ids=input_ids,
+            bbox=bbox,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+
+        visual_emb = self._calc_img_embeddings(
+            image=image,
+            bbox=visual_bbox,
+            position_ids=visual_position_ids,
+        )
+        final_emb = paddle.concat([text_layout_emb, visual_emb], axis=1)
+
+        extended_attention_mask = final_attention_mask.unsqueeze(1).unsqueeze(2)
+
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        encoder_outputs = self.encoder(
+            final_emb,
+            extended_attention_mask,
+            bbox=final_bbox,
+            position_ids=final_position_ids,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output, encoder_outputs[1]
+
+
+class LayoutXLMForTokenClassification(LayoutXLMPretrainedModel):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_labels
+        self.layoutxlm = LayoutXLMModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def get_input_embeddings(self):
+        return self.layoutxlm.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutxlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_output = outputs[0][:, :seq_length]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        hidden_states = {
+            f"hidden_states_{idx}": outputs[2][f"{idx}_data"] for idx in range(self.layoutxlm.config.num_hidden_layers)
+        }
+        if self.training:
+            outputs = (logits, hidden_states)
+        else:
+            outputs = (logits,)
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+
+            if attention_mask is not None:
+                active_loss = (
+                    attention_mask.reshape(
+                        [
+                            -1,
+                        ]
+                    )
+                    == 1
+                )
+                active_logits = logits.reshape([-1, self.num_classes])[active_loss]
+                active_labels = labels.reshape(
+                    [
+                        -1,
+                    ]
+                )[active_loss]
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(
+                    logits.reshape([-1, self.num_classes]),
+                    labels.reshape(
+                        [
+                            -1,
+                        ]
+                    ),
+                )
+
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+class LayoutXLMForSequenceClassification(LayoutXLMPretrainedModel):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMForSequenceClassification, self).__init__(config)
+        self.num_classes = config.num_labels
+
+        self.layoutxlm = LayoutXLMModel(config)
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size * 3, self.num_classes)
+
+    def get_input_embeddings(self):
+        return self.layoutxlm.embeddings.word_embeddings
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutxlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        input_shape = paddle.shape(input_ids)
+        visual_shape = list(input_shape)
+        visual_shape[1] = (
+            self.layoutxlm.config.image_feature_pool_shape[0] * self.layoutxlm.config.image_feature_pool_shape[1]
+        )
+        visual_bbox = self.layoutxlm._calc_visual_bbox(
+            self.layoutxlm.config.image_feature_pool_shape, bbox, visual_shape
+        )
+
+        visual_position_ids = paddle.arange(0, visual_shape[1]).expand([input_shape[0], visual_shape[1]])
+
+        initial_image_embeddings = self.layoutxlm._calc_img_embeddings(
+            image=image,
+            bbox=visual_bbox,
+            position_ids=visual_position_ids,
+        )
+
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_output, final_image_embeddings = outputs[0][:, :seq_length], outputs[0][:, seq_length:]
+
+        cls_final_output = sequence_output[:, 0, :]
+
+        # average-pool the visual embeddings
+        pooled_initial_image_embeddings = initial_image_embeddings.mean(axis=1)
+        pooled_final_image_embeddings = final_image_embeddings.mean(axis=1)
+        # concatenate with cls_final_output
+        sequence_output = paddle.concat(
+            [cls_final_output, pooled_initial_image_embeddings, pooled_final_image_embeddings], axis=1
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,)
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+
+            loss = loss_fct(
+                logits.reshape([-1, self.num_classes]),
+                labels.reshape(
+                    [
+                        -1,
+                    ]
+                ),
+            )
+
+            outputs = (loss,) + outputs
+
+        return outputs
+
+
+class LayoutXLMPredictionHead(Layer):
+    """
+    Bert Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(LayoutXLMPredictionHead, self).__init__()
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class LayoutXLMPretrainingHeads(Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(LayoutXLMPretrainingHeads, self).__init__()
+        self.predictions = LayoutXLMPredictionHead(hidden_size, vocab_size, activation, embedding_weights)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class LayoutXLMForPretraining(LayoutXLMPretrainedModel):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMForPretraining, self).__init__(config)
+        self.layoutxlm = LayoutXLMModel(config)
+        self.cls = LayoutXLMPretrainingHeads(
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            embedding_weights=self.layoutxlm.embeddings.word_embeddings.weight,
+        )
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutxlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        masked_positions=None,
+    ):
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class BiaffineAttention(nn.Layer):
+    """Implements a biaffine attention operator for binary relation classification."""
+
+    def __init__(self, in_features, out_features):
+        super(BiaffineAttention, self).__init__()
+
+        self.in_features = in_features
+        self.out_features = out_features
+
+        self.bilinear = nn.Bilinear(in_features, in_features, out_features, bias_attr=False)
+        self.linear = nn.Linear(2 * in_features, out_features)
+
+    def forward(self, x_1, x_2):
+        return self.bilinear(x_1, x_2) + self.linear(paddle.concat((x_1, x_2), axis=-1))
+
+
+class REDecoder(nn.Layer):
+    def __init__(self, hidden_size=768, hidden_dropout_prob=0.1):
+        super(REDecoder, self).__init__()
+        self.entity_emb = nn.Embedding(3, hidden_size)
+        projection = nn.Sequential(
+            nn.Linear(hidden_size * 2, hidden_size),
+            nn.ReLU(),
+            nn.Dropout(hidden_dropout_prob),
+            nn.Linear(hidden_size, hidden_size // 2),
+            nn.ReLU(),
+            nn.Dropout(hidden_dropout_prob),
+        )
+        self.ffnn_head = copy.deepcopy(projection)
+        self.ffnn_tail = copy.deepcopy(projection)
+        self.rel_classifier = BiaffineAttention(hidden_size // 2, 2)
+        self.loss_fct = CrossEntropyLoss()
+
+    def build_relation(self, relations, entities):
+        batch_size, max_seq_len = paddle.shape(entities)[:2]
+        new_relations = paddle.full(
+            shape=[batch_size, max_seq_len * max_seq_len, 3], fill_value=-1, dtype=relations.dtype
+        )
+        for b in range(batch_size):
+            if entities[b, 0, 0] <= 2:
+                entitie_new = paddle.full(shape=[512, 3], fill_value=-1, dtype=entities.dtype)
+                entitie_new[0, :] = 2
+                entitie_new[1:3, 0] = 0  # start
+                entitie_new[1:3, 1] = 1  # end
+                entitie_new[1:3, 2] = 0  # label
+                entities[b] = entitie_new
+            entitie_label = entities[b, 1 : entities[b, 0, 2] + 1, 2]
+            all_possible_relations1 = paddle.arange(0, entities[b, 0, 2], dtype=entities.dtype)
+            all_possible_relations1 = all_possible_relations1[entitie_label == 1]
+            all_possible_relations2 = paddle.arange(0, entities[b, 0, 2], dtype=entities.dtype)
+            all_possible_relations2 = all_possible_relations2[entitie_label == 2]
+
+            all_possible_relations = paddle.stack(
+                paddle.meshgrid(all_possible_relations1, all_possible_relations2), axis=2
+            ).reshape([-1, 2])
+            if len(all_possible_relations) == 0:
+                all_possible_relations = paddle.full_like(all_possible_relations, fill_value=-1, dtype=entities.dtype)
+                all_possible_relations[0, 0] = 0
+                all_possible_relations[0, 1] = 1
+
+            relation_head = relations[b, 1 : relations[b, 0, 0] + 1, 0]
+            relation_tail = relations[b, 1 : relations[b, 0, 1] + 1, 1]
+            positive_relations = paddle.stack([relation_head, relation_tail], axis=1)
+
+            all_possible_relations_repeat = all_possible_relations.unsqueeze(axis=1).tile(
+                [1, len(positive_relations), 1]
+            )
+            positive_relations_repeat = positive_relations.unsqueeze(axis=0).tile([len(all_possible_relations), 1, 1])
+            mask = paddle.all(all_possible_relations_repeat == positive_relations_repeat, axis=2)
+            negative_mask = paddle.any(mask, axis=1) is False
+            negative_relations = all_possible_relations[negative_mask]
+
+            positive_mask = paddle.any(mask, axis=0) is True
+            positive_relations = positive_relations[positive_mask]
+            if negative_mask.sum() > 0:
+                reordered_relations = paddle.concat([positive_relations, negative_relations])
+            else:
+                reordered_relations = positive_relations
+
+            relation_per_doc_label = paddle.zeros([len(reordered_relations), 1], dtype=reordered_relations.dtype)
+            relation_per_doc_label[: len(positive_relations)] = 1
+            relation_per_doc = paddle.concat([reordered_relations, relation_per_doc_label], axis=1)
+            assert len(relation_per_doc[:, 0]) != 0
+            new_relations[b, 0] = paddle.shape(relation_per_doc)[0].astype(new_relations.dtype)
+            new_relations[b, 1 : len(relation_per_doc) + 1] = relation_per_doc
+            # new_relations.append(relation_per_doc)
+        return new_relations, entities
+
+    def get_predicted_relations(self, logits, relations, entities):
+        pred_relations = []
+        for i, pred_label in enumerate(logits.argmax(-1)):
+            if pred_label != 1:
+                continue
+            rel = paddle.full(shape=[7, 2], fill_value=-1, dtype=relations.dtype)
+            rel[0, 0] = relations[:, 0][i]
+            rel[1, 0] = entities[:, 0][relations[:, 0][i] + 1]
+            rel[1, 1] = entities[:, 1][relations[:, 0][i] + 1]
+            rel[2, 0] = entities[:, 2][relations[:, 0][i] + 1]
+            rel[3, 0] = relations[:, 1][i]
+            rel[4, 0] = entities[:, 0][relations[:, 1][i] + 1]
+            rel[4, 1] = entities[:, 1][relations[:, 1][i] + 1]
+            rel[5, 0] = entities[:, 2][relations[:, 1][i] + 1]
+            rel[6, 0] = 1
+            pred_relations.append(rel)
+        return pred_relations
+
+    def forward(self, hidden_states, entities, relations):
+        batch_size, max_length, _ = paddle.shape(entities)
+        relations, entities = self.build_relation(relations, entities)
+        loss = 0
+        all_pred_relations = paddle.full(
+            shape=[batch_size, max_length * max_length, 7, 2], fill_value=-1, dtype=entities.dtype
+        )
+        for b in range(batch_size):
+            relation = relations[b, 1 : relations[b, 0, 0] + 1]
+            head_entities = relation[:, 0]
+            tail_entities = relation[:, 1]
+            relation_labels = relation[:, 2]
+            entities_start_index = paddle.to_tensor(entities[b, 1 : entities[b, 0, 0] + 1, 0])
+            entities_labels = paddle.to_tensor(entities[b, 1 : entities[b, 0, 2] + 1, 2])
+            head_index = entities_start_index[head_entities]
+            head_label = entities_labels[head_entities]
+            head_label_repr = self.entity_emb(head_label)
+
+            tail_index = entities_start_index[tail_entities]
+            tail_label = entities_labels[tail_entities]
+            tail_label_repr = self.entity_emb(tail_label)
+
+            tmp_hidden_states = hidden_states[b][head_index]
+            if len(tmp_hidden_states.shape) == 1:
+                tmp_hidden_states = paddle.unsqueeze(tmp_hidden_states, axis=0)
+            head_repr = paddle.concat((tmp_hidden_states, head_label_repr), axis=-1)
+
+            tmp_hidden_states = hidden_states[b][tail_index]
+            if len(tmp_hidden_states.shape) == 1:
+                tmp_hidden_states = paddle.unsqueeze(tmp_hidden_states, axis=0)
+            tail_repr = paddle.concat((tmp_hidden_states, tail_label_repr), axis=-1)
+
+            heads = self.ffnn_head(head_repr)
+            tails = self.ffnn_tail(tail_repr)
+            logits = self.rel_classifier(heads, tails)
+            loss += self.loss_fct(logits, relation_labels)
+            pred_relations = self.get_predicted_relations(logits, relation, entities[b])
+            if len(pred_relations) > 0:
+                pred_relations = paddle.stack(pred_relations)
+                all_pred_relations[b, 0, :, :] = paddle.shape(pred_relations)[0].astype(all_pred_relations.dtype)
+                all_pred_relations[b, 1 : len(pred_relations) + 1, :, :] = pred_relations
+        return loss, all_pred_relations
+
+
+class LayoutXLMForRelationExtraction(LayoutXLMPretrainedModel):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMForRelationExtraction, self).__init__(config)
+
+        self.layoutxlm = LayoutXLMModel(config)
+
+        self.extractor = REDecoder(config.hidden_size, config.hidden_dropout_prob)
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight.shape))
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.tensor.zeros(shape=layer.bias.shape))
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight.shape))
+            if layer._padding_idx is not None:
+                layer.weight[layer._padding_idx].set_value(
+                    paddle.tensor.normal(mean=0.0, std=0.02, shape=layer.weight[layer._padding_idx].shape)
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer.weight.set_value(paddle.tensor.ones(shape=layer.bias.shape))
+            layer.bias.set_value(paddle.tensor.zeros(shape=layer.bias.shape))
+
+    def resize_position_embeddings(self, new_num_position_embeddings):
+        """
+        Resizes position embeddings of the model if `new_num_position_embeddings != config["max_position_embeddings"]`.
+
+        Arguments:
+            new_num_position_embeddings (`int`):
+                The number of new position embedding matrix. If position embeddings are learned, increasing the size
+                will add newly initialized vectors at the end, whereas reducing the size will remove vectors from the
+                end.
+        """
+        self.layoutxlm.resize_position_embeddings(new_num_position_embeddings)
+
+    def forward(
+        self,
+        input_ids,
+        bbox,
+        image=None,
+        attention_mask=None,
+        entities=None,
+        relations=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        sequence_output = outputs[0][:, :seq_length]
+
+        sequence_output = self.dropout(sequence_output)
+        loss, pred_relations = self.extractor(sequence_output, entities, relations)
+        hidden_states = [outputs[2][f"{idx}_data"] for idx in range(self.layoutxlm.config.num_hidden_layers)]
+        hidden_states = paddle.stack(hidden_states, axis=1)
+
+        res = dict(loss=loss, pred_relations=pred_relations, hidden_states=hidden_states)
+        return res
+
+
+class LayoutXLMForQuestionAnswering(LayoutXLMPretrainedModel):
+    def __init__(self, config: LayoutXLMConfig):
+        super(LayoutXLMForQuestionAnswering, self).__init__(config)
+        self.num_classes = config.num_labels
+        self.layoutxlm = LayoutXLMModel(config)
+        self.has_visual_segment_embedding = config.has_visual_segment_embedding
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.qa_outputs = nn.Linear(config.hidden_size, self.num_classes)
+
+    def get_input_embeddings(self):
+        return self.layoutxlm.embeddings.word_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        bbox=None,
+        image=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        start_positions=None,
+        end_positions=None,
+    ):
+        # In LayoutXLM the type vocab size is 1
+        token_type_ids = paddle.zeros_like(input_ids)
+
+        outputs = self.layoutxlm(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            bbox=bbox,
+            image=image,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+        seq_length = input_ids.shape[1]
+        # sequence out and image out
+        sequence_output = outputs[0][:, :seq_length]
+        sequence_output = self.dropout(sequence_output)
+
+        if token_type_ids is not None:
+            span_mask = -token_type_ids * 1e8
+        else:
+            span_mask = 0
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+        start_logits = start_logits.squeeze(-1) + span_mask
+        end_logits = end_logits.squeeze(-1) + span_mask
+
+        outputs = (start_logits, end_logits) + outputs[2:]
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.shape) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.shape) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # Sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not total_loss:
+            return outputs
+        else:
+            outputs = (total_loss,) + outputs
+            return outputs
diff --git a/paddlenlp/transformers/layoutxlm/tokenizer.py b/paddlenlp/transformers/layoutxlm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..26c23c9f629b35dbc3b71baf7138c36790336882
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/tokenizer.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization classes for LayoutXLM model."""
+
+from typing import List, Optional
+
+import sentencepiece as spm
+
+from .. import AddedToken, PretrainedTokenizer
+from ..tokenizer_utils import _is_control, _is_punctuation, _is_whitespace
+
+SPIECE_UNDERLINE = "▁"
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "layoutxlm-base-uncased": 514,
+    # FIXME(wj-Mcat): why this model-name not in the init-configuration
+    # "layoutxlm-wo-backbone-base-uncased": 514
+}
+
+
+def _is_end_of_word(text):
+    """Checks whether the last character in text is one of a punctuation, control or whitespace character."""
+    last_char = text[-1]
+    return bool(_is_control(last_char) | _is_punctuation(last_char) | _is_whitespace(last_char))
+
+
+def _is_start_of_word(text):
+    """Checks whether the first character in text is one of a punctuation, control or whitespace character."""
+    first_char = text[0]
+    return bool(_is_control(first_char) | _is_punctuation(first_char) | _is_whitespace(first_char))
+
+
+class LayoutXLMTokenizer(PretrainedTokenizer):
+    resource_files_names = {"vocab_file": "sentencepiece.bpe.model"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "layoutxlm-base-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/layoutxlm_base/sentencepiece.bpe.model",
+        }
+    }
+    pretrained_init_configuration = {
+        "layoutxlm-base-uncased": {"do_lower_case": False},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    SPECIAL_TOKENS_ATTRIBUTES = [
+        "bos_token",
+        "eos_token",
+        "unk_token",
+        "sep_token",
+        "pad_token",
+        "cls_token",
+        "mask_token",
+        "additional_special_tokens",
+    ]
+
+    def __init__(
+        self,
+        vocab_file,
+        bos_token="<s>",
+        eos_token="</s>",
+        sep_token="</s>",
+        cls_token="<s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        **kwargs
+    ):
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._bos_token = bos_token
+        self._eos_token = eos_token
+        self._sep_token = sep_token
+        self._cls_token = cls_token
+        self._unk_token = unk_token
+        self._pad_token = pad_token
+        self._mask_token = mask_token
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(vocab_file)
+        self.vocab_file = vocab_file
+
+        self.tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
+
+        # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
+        self.offset = 1
+
+        self.tokens_to_ids["<mask>"] = len(self.sp_model) + self.offset
+        self.ids_to_tokens = {v: k for k, v in self.tokens_to_ids.items()}
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.offset + 1  # Add the <mask> token
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        return self.sp_model.EncodeAsPieces(text)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token in self.tokens_to_ids:
+            return self.tokens_to_ids[token]
+        spm_id = self.sp_model.PieceToId(token)
+
+        # Need to return unknown token if the SP model returned 0
+        return spm_id + self.offset if spm_id else self.unk_token_id
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index in self.ids_to_tokens:
+            return self.ids_to_tokens[index]
+        return self.sp_model.IdToPiece(index - self.offset)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (strings for sub-words) in a single string."""
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
diff --git a/paddlenlp/transformers/layoutxlm/visual_backbone.py b/paddlenlp/transformers/layoutxlm/visual_backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d9f7771151a6e18e9a45d927c42529f304d21c5
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/visual_backbone.py
@@ -0,0 +1,737 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+from abc import abstractmethod
+from collections import namedtuple
+
+import numpy as np
+import paddle
+from paddle import ParamAttr
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import Layer
+from paddle.utils import try_import
+
+
+def read_config(fp=None):
+    if fp is None:
+        dir_name = os.path.dirname(os.path.abspath(__file__))
+        fp = os.path.join(dir_name, "visual_backbone.yaml")
+    with open(fp, "r") as fin:
+        yacs_config = try_import("yacs.config")
+        cfg = yacs_config.CfgNode().load_cfg(fin)
+    cfg.freeze()
+    return cfg
+
+
+class Conv2d(nn.Conv2D):
+    def __init__(self, *args, **kwargs):
+        norm = kwargs.pop("norm", None)
+        activation = kwargs.pop("activation", None)
+        super(Conv2d, self).__init__(*args, **kwargs)
+
+        self.norm = norm
+        self.activation = activation
+
+    def forward(self, x):
+        x = super(Conv2d, self).forward(x)
+        if self.norm is not None:
+            x = self.norm(x)
+        if self.activation is not None:
+            x = self.activation(x)
+        return x
+
+
+class CNNBlockBase(Layer):
+    def __init__(self, in_channels, out_channels, stride):
+        """
+        The `__init__` method of any subclass should also contain these arguments.
+        Args:
+            in_channels (int):
+            out_channels (int):
+            stride (int):
+        """
+        super(CNNBlockBase, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
+
+    def freeze(self):
+        for p in self.parameters():
+            p.stop_gradient = True
+
+
+ResNetBlockBase = CNNBlockBase
+
+
+class ShapeSpec(namedtuple("_ShapeSpec", ["channels", "height", "width", "stride"])):
+    def __new__(cls, channels=None, height=None, width=None, stride=None):
+        return super().__new__(cls, channels, height, width, stride)
+
+
+def get_norm(norm, out_channels):
+    """
+    Args:
+        norm (str or callable): either one of BN, SyncBN, FrozenBN, GN;
+            or a callable that takes a channel number and returns
+            the normalization layer as a nn.Layer.
+        out_channels (int): out_channels
+    Returns:
+        nn.Layer or None: the normalization layer
+    """
+    if norm is None:
+        return None
+    if isinstance(norm, str):
+        if len(norm) == 0:
+            return None
+        norm = {
+            "BN": nn.BatchNorm,
+            "SyncBN": nn.SyncBatchNorm,
+            "FrozenBN": FrozenBatchNorm,
+        }[norm]
+    return norm(out_channels)
+
+
+class FrozenBatchNorm(nn.BatchNorm):
+    def __init__(self, num_channels):
+        param_attr = ParamAttr(learning_rate=0.0, trainable=False)
+        bias_attr = ParamAttr(learning_rate=0.0, trainable=False)
+        super(FrozenBatchNorm, self).__init__(
+            num_channels, param_attr=param_attr, bias_attr=bias_attr, use_global_stats=True
+        )
+
+
+class Backbone(nn.Layer):
+    def __init__(self):
+        super(Backbone, self).__init__()
+
+    @abstractmethod
+    def forward(self, *args):
+        pass
+
+    @property
+    def size_divisibility(self) -> int:
+        return 0
+
+    def output_shape(self):
+        # this is a backward-compatible default
+        return {
+            name: ShapeSpec(channels=self._out_feature_channels[name], stride=self._out_feature_strides[name])
+            for name in self._out_features
+        }
+
+
+class BasicBlock(CNNBlockBase):
+    """
+    The basic residual block for ResNet-18 and ResNet-34 defined in :paper:`ResNet`,
+    with two 3x3 conv layers and a projection shortcut if needed.
+    """
+
+    def __init__(self, in_channels, out_channels, *, stride=1, norm="BN"):
+        raise NotImplementedError
+
+
+class BottleneckBlock(CNNBlockBase):
+    """
+    The standard bottleneck residual block used by ResNet-50, 101 and 152
+    defined in :paper:`ResNet`.  It contains 3 conv layers with kernels
+    1x1, 3x3, 1x1, and a projection shortcut if needed.
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        *,
+        bottleneck_channels,
+        stride=1,
+        num_groups=1,
+        norm="BN",
+        stride_in_1x1=False,
+        dilation=1,
+    ):
+        super(BottleneckBlock, self).__init__(in_channels, out_channels, stride)
+
+        if in_channels != out_channels:
+            self.shortcut = Conv2d(
+                in_channels,
+                out_channels,
+                kernel_size=1,
+                stride=stride,
+                bias_attr=False,
+                norm=get_norm(norm, out_channels),
+            )
+        else:
+            self.shortcut = None
+
+        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
+
+        self.conv1 = Conv2d(
+            in_channels,
+            bottleneck_channels,
+            kernel_size=1,
+            stride=stride_1x1,
+            bias_attr=False,
+            norm=get_norm(norm, bottleneck_channels),
+        )
+
+        self.conv2 = Conv2d(
+            bottleneck_channels,
+            bottleneck_channels,
+            kernel_size=3,
+            stride=stride_3x3,
+            padding=1 * dilation,
+            bias_attr=False,
+            groups=num_groups,
+            dilation=dilation,
+            norm=get_norm(norm, bottleneck_channels),
+        )
+
+        self.conv3 = Conv2d(
+            bottleneck_channels,
+            out_channels,
+            kernel_size=1,
+            bias_attr=False,
+            norm=get_norm(norm, out_channels),
+        )
+        # init code is removed cause pretrained model will be loaded
+
+    def forward(self, x):
+        out = self.conv1(x)
+        out = F.relu(out)
+
+        out = self.conv2(out)
+        out = F.relu(out)
+
+        out = self.conv3(out)
+
+        if self.shortcut is not None:
+            shortcut = self.shortcut(x)
+        else:
+            shortcut = x
+
+        out += shortcut
+        out = F.relu(out)
+        return out
+
+
+class DeformBottleneckBlock(CNNBlockBase):
+    """
+    Similar to :class:`BottleneckBlock`, but with :paper:`deformable conv <deformconv>`
+    in the 3x3 convolution.
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        *,
+        bottleneck_channels,
+        stride=1,
+        num_groups=1,
+        norm="BN",
+        stride_in_1x1=False,
+        dilation=1,
+        deform_modulated=False,
+        deform_num_groups=1,
+    ):
+        raise NotImplementedError
+
+
+class BasicStem(CNNBlockBase):
+    """
+    The standard ResNet stem (layers before the first residual block),
+    with a conv, relu and max_pool.
+    """
+
+    def __init__(self, in_channels=3, out_channels=64, norm="BN"):
+        """
+        Args:
+            norm (str or callable): norm after the first conv layer.
+                See :func:`layers.get_norm` for supported format.
+        """
+        super(BasicStem, self).__init__(in_channels, out_channels, 4)
+        self.in_channels = in_channels
+        self.conv1 = Conv2d(
+            in_channels,
+            out_channels,
+            kernel_size=7,
+            stride=2,
+            padding=3,
+            bias_attr=False,
+            norm=get_norm(norm, out_channels),
+        )
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = F.relu(x)
+        x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)
+        return x
+
+
+class ResNet(Backbone):
+    def __init__(self, stem, stages, num_classes=None, out_features=None, freeze_at=0):
+        super(ResNet, self).__init__()
+        self.stem = stem
+        self.num_classes = num_classes
+
+        current_stride = self.stem.stride
+        self._out_feature_strides = {"stem": current_stride}
+        self._out_feature_channels = {"stem": self.stem.out_channels}
+
+        self.stage_names, self.stages = [], []
+
+        if out_features is not None:
+            num_stages = max([{"res2": 1, "res3": 2, "res4": 3, "res5": 4}.get(f, 0) for f in out_features])
+            stages = stages[:num_stages]
+        for i, blocks in enumerate(stages):
+            assert len(blocks) > 0, len(blocks)
+            for block in blocks:
+                assert isinstance(block, CNNBlockBase), block
+
+            name = "res" + str(i + 2)
+            stage = nn.Sequential(*blocks)
+
+            self.add_sublayer(name, stage)
+            self.stage_names.append(name)
+            self.stages.append(stage)
+
+            self._out_feature_strides[name] = current_stride = int(
+                current_stride * np.prod([k.stride for k in blocks])
+            )
+            self._out_feature_channels[name] = curr_channels = blocks[-1].out_channels
+        self.stage_names = tuple(self.stage_names)
+
+        if num_classes is not None:
+            self.avgpool = nn.AdaptiveAvgPool2D(1)
+            self.linear = nn.Linear(curr_channels, num_classes)
+            name = "linear"
+
+        if out_features is None:
+            out_features = [name]
+        self._out_features = out_features
+        assert len(self._out_features)
+        children = [x[0] for x in self.named_children()]
+        for out_feature in self._out_features:
+            assert out_feature in children, "Available children: {}".format(", ".join(children))
+        self.freeze(freeze_at)
+
+    def forward(self, x):
+        """
+        Args:
+            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.
+
+        Returns:
+            dict[str->Tensor]: names and the corresponding features
+        """
+        assert x.dim() == 4, f"ResNet takes an input of shape (N, C, H, W). Got {x.shape} instead!"
+        outputs = {}
+        x = self.stem(x)
+        if "stem" in self._out_features:
+            outputs["stem"] = x
+        for name, stage in zip(self.stage_names, self.stages):
+            x = stage(x)
+            if name in self._out_features:
+                outputs[name] = x
+        if self.num_classes is not None:
+            x = self.avgpool(x)
+            x = paddle.flatten(x, 1)
+            x = self.linear(x)
+            if "linear" in self._out_features:
+                outputs["linear"] = x
+        return outputs
+
+    def output_shape(self):
+        return {
+            name: ShapeSpec(channels=self._out_feature_channels[name], stride=self._out_feature_strides[name])
+            for name in self._out_features
+        }
+
+    @staticmethod
+    def make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs):
+        """
+        Create a list of blocks of the same type that forms one ResNet stage.
+
+        Args:
+            block_class (type): a subclass of CNNBlockBase that's used to create all blocks in this
+                stage. A module of this type must not change spatial resolution of inputs unless its
+                stride != 1.
+            num_blocks (int): number of blocks in this stage
+            in_channels (int): input channels of the entire stage.
+            out_channels (int): output channels of **every block** in the stage.
+            kwargs: other arguments passed to the constructor of
+                `block_class`. If the argument name is "xx_per_block", the
+                argument is a list of values to be passed to each block in the
+                stage. Otherwise, the same argument is passed to every block
+                in the stage.
+
+        Returns:
+            list[CNNBlockBase]: a list of block module.
+
+        Examples:
+        ::
+            stage = ResNet.make_stage(
+                BottleneckBlock, 3, in_channels=16, out_channels=64,
+                bottleneck_channels=16, num_groups=1,
+                stride_per_block=[2, 1, 1],
+                dilations_per_block=[1, 1, 2]
+            )
+
+        Usually, layers that produce the same feature map spatial size are defined as one
+        "stage" (in :paper:`FPN`). Under such definition, ``stride_per_block[1:]`` should
+        all be 1.
+        """
+        blocks = []
+        for i in range(num_blocks):
+            curr_kwargs = {}
+            for k, v in kwargs.items():
+                if k.endswith("_per_block"):
+                    assert len(v) == num_blocks, (
+                        f"Argument '{k}' of make_stage should have the " f"same length as num_blocks={num_blocks}."
+                    )
+                    newk = k[: -len("_per_block")]
+                    assert newk not in kwargs, f"Cannot call make_stage with both {k} and {newk}!"
+                    curr_kwargs[newk] = v[i]
+                else:
+                    curr_kwargs[k] = v
+
+            blocks.append(block_class(in_channels=in_channels, out_channels=out_channels, **curr_kwargs))
+            in_channels = out_channels
+        return blocks
+
+    @staticmethod
+    def make_default_stages(depth, block_class=None, **kwargs):
+        """
+        Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152).
+        If it doesn't create the ResNet variant you need, please use :meth:`make_stage`
+        instead for fine-grained customization.
+
+        Args:
+            depth (int): depth of ResNet
+            block_class (type): the CNN block class. Has to accept
+                `bottleneck_channels` argument for depth > 50.
+                By default it is BasicBlock or BottleneckBlock, based on the
+                depth.
+            kwargs:
+                other arguments to pass to `make_stage`. Should not contain
+                stride and channels, as they are predefined for each depth.
+
+        Returns:
+            list[list[CNNBlockBase]]: modules in all stages; see arguments of
+                :class:`ResNet.__init__`.
+        """
+        num_blocks_per_stage = {
+            18: [2, 2, 2, 2],
+            34: [3, 4, 6, 3],
+            50: [3, 4, 6, 3],
+            101: [3, 4, 23, 3],
+            152: [3, 8, 36, 3],
+        }[depth]
+        if block_class is None:
+            block_class = BasicBlock if depth < 50 else BottleneckBlock
+        if depth < 50:
+            in_channels = [64, 64, 128, 256]
+            out_channels = [64, 128, 256, 512]
+        else:
+            in_channels = [64, 256, 512, 1024]
+            out_channels = [256, 512, 1024, 2048]
+        ret = []
+        for (n, s, i, o) in zip(num_blocks_per_stage, [1, 2, 2, 2], in_channels, out_channels):
+            if depth >= 50:
+                kwargs["bottleneck_channels"] = o // 4
+            ret.append(
+                ResNet.make_stage(
+                    block_class=block_class,
+                    num_blocks=n,
+                    stride_per_block=[s] + [1] * (n - 1),
+                    in_channels=i,
+                    out_channels=o,
+                    **kwargs,
+                )
+            )
+        return ret
+
+    def freeze(self, freeze_at=0):
+        if freeze_at >= 1:
+            self.stem.freeze()
+        for idx, stage in enumerate(self.stages, start=2):
+            if freeze_at >= idx:
+                for block in stage.children():
+                    block.freeze()
+        return self
+
+
+class LastLevelMaxPool(nn.Layer):
+    """
+    This module is used in the original FPN to generate a downsampled
+    P6 feature from P5.
+    """
+
+    def __init__(self):
+        super(LastLevelMaxPool, self).__init__()
+        self.num_levels = 1
+        self.in_feature = "p5"
+
+    def forward(self, x):
+        return [F.max_pool2d(x, kernel_size=1, stride=2, padding=0)]
+
+
+def _assert_strides_are_log2_contiguous(strides):
+    """
+    Assert that each stride is 2x times its preceding stride, i.e. "contiguous in log2".
+    """
+    for i, stride in enumerate(strides[1:], 1):
+        assert stride == 2 * strides[i - 1], "Strides {} {} are not log2 contiguous".format(stride, strides[i - 1])
+
+
+class FPN(Backbone):
+    def __init__(self, bottom_up, in_features, out_channels, norm="", top_block=None, fuse_type="sum"):
+        super(FPN, self).__init__()
+        assert isinstance(bottom_up, Backbone)
+        assert in_features, in_features
+
+        # Feature map strides and channels from the bottom up network (e.g. ResNet)
+        input_shapes = bottom_up.output_shape()
+        strides = [input_shapes[f].stride for f in in_features]
+        in_channels_per_feature = [input_shapes[f].channels for f in in_features]
+
+        _assert_strides_are_log2_contiguous(strides)
+        lateral_convs = []
+        output_convs = []
+
+        use_bias = norm == ""
+        for idx, in_channels in enumerate(in_channels_per_feature):
+            lateral_norm = get_norm(norm, out_channels)
+            output_norm = get_norm(norm, out_channels)
+
+            lateral_conv = Conv2d(in_channels, out_channels, kernel_size=1, bias_attr=use_bias, norm=lateral_norm)
+            output_conv = Conv2d(
+                out_channels,
+                out_channels,
+                kernel_size=3,
+                stride=1,
+                padding=1,
+                bias_attr=use_bias,
+                norm=output_norm,
+            )
+            stage = int(math.log2(strides[idx]))
+            self.add_sublayer("fpn_lateral{}".format(stage), lateral_conv)
+            self.add_sublayer("fpn_output{}".format(stage), output_conv)
+
+            lateral_convs.append(lateral_conv)
+            output_convs.append(output_conv)
+        # Place convs into top-down order (from low to high resolution)
+        # to make the top-down computation in forward clearer.
+        self.lateral_convs = lateral_convs[::-1]
+        self.output_convs = output_convs[::-1]
+        self.top_block = top_block
+        self.in_features = tuple(in_features)
+        self.bottom_up = bottom_up
+        # Return feature names are "p<stage>", like ["p2", "p3", ..., "p6"]
+        self._out_feature_strides = {"p{}".format(int(math.log2(s))): s for s in strides}
+        # top block output feature maps.
+        if self.top_block is not None:
+            for s in range(stage, stage + self.top_block.num_levels):
+                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)
+
+        self._out_features = list(self._out_feature_strides.keys())
+        self._out_feature_channels = {k: out_channels for k in self._out_features}
+        self._size_divisibility = strides[-1]
+        assert fuse_type in {"avg", "sum"}
+        self._fuse_type = fuse_type
+
+    @property
+    def size_divisibility(self):
+        return self._size_divisibility
+
+    def forward(self, x):
+        """
+        Args:
+            x (dict[str->Tensor]): mapping feature map name (e.g., "res5") to
+                feature map tensor for each feature level in high to low resolution order.
+
+        Returns:
+            dict[str->Tensor]:
+                mapping from feature map name to FPN feature map tensor
+                in high to low resolution order. Returned feature names follow the FPN
+                paper convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
+                ["p2", "p3", ..., "p6"].
+        """
+        bottom_up_features = self.bottom_up(x)
+        results = []
+        prev_features = self.lateral_convs[0](bottom_up_features[self.in_features[-1]])
+        results.append(self.output_convs[0](prev_features))
+
+        # Reverse feature maps into top-down order (from low to high resolution)
+        for idx, (lateral_conv, output_conv) in enumerate(zip(self.lateral_convs, self.output_convs)):
+            if idx > 0:
+                features = self.in_features[-idx - 1]
+                features = bottom_up_features[features]
+                top_down_features = F.interpolate(prev_features, scale_factor=2.0, mode="nearest")
+                lateral_features = lateral_conv(features)
+                prev_features = lateral_features + top_down_features
+                if self._fuse_type == "avg":
+                    prev_features /= 2
+                results.insert(0, output_conv(prev_features))
+
+        if self.top_block is not None:
+            if self.top_block.in_feature in bottom_up_features:
+                top_block_in_feature = bottom_up_features[self.top_block.in_feature]
+            else:
+                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]
+            results.extend(self.top_block(top_block_in_feature))
+        assert len(self._out_features) == len(results)
+        return {f: res for f, res in zip(self._out_features, results)}
+
+    def output_shape(self):
+        return {
+            name: ShapeSpec(channels=self._out_feature_channels[name], stride=self._out_feature_strides[name])
+            for name in self._out_features
+        }
+
+
+def make_stage(*args, **kwargs):
+    """
+    Deprecated alias for backward compatibiltiy.
+    """
+    return ResNet.make_stage(*args, **kwargs)
+
+
+def build_resnet_backbone(cfg, input_shape=None):
+    """
+    Create a ResNet instance from config.
+
+    Returns:
+        ResNet: a :class:`ResNet` instance.
+    """
+    # need registration of new blocks/stems?
+    if input_shape is None:
+        ch = 3
+    else:
+        ch = input_shape.channels
+    norm = cfg.MODEL.RESNETS.NORM
+    stem = BasicStem(
+        in_channels=ch,
+        out_channels=cfg.MODEL.RESNETS.STEM_OUT_CHANNELS,
+        norm=norm,
+    )
+
+    # fmt: off
+    freeze_at = cfg.MODEL.BACKBONE.FREEZE_AT  # default as 2
+    out_features = cfg.MODEL.RESNETS.OUT_FEATURES
+    depth = cfg.MODEL.RESNETS.DEPTH
+    num_groups = cfg.MODEL.RESNETS.NUM_GROUPS
+    width_per_group = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
+    bottleneck_channels = num_groups * width_per_group
+    in_channels = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS
+    out_channels = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
+    stride_in_1x1 = cfg.MODEL.RESNETS.STRIDE_IN_1X1
+    res5_dilation = cfg.MODEL.RESNETS.RES5_DILATION
+    deform_on_per_stage = cfg.MODEL.RESNETS.DEFORM_ON_PER_STAGE
+    deform_modulated = cfg.MODEL.RESNETS.DEFORM_MODULATED
+    deform_num_groups = cfg.MODEL.RESNETS.DEFORM_NUM_GROUPS
+    # fmt: on
+    assert res5_dilation in {1, 2}, "res5_dilation cannot be {}.".format(res5_dilation)
+
+    num_blocks_per_stage = {
+        18: [2, 2, 2, 2],
+        34: [3, 4, 6, 3],
+        50: [3, 4, 6, 3],
+        101: [3, 4, 23, 3],
+        152: [3, 8, 36, 3],
+    }[depth]
+
+    if depth in [18, 34]:
+        assert out_channels == 64, "Must set MODEL.RESNETS.RES2_OUT_CHANNELS = 64 for R18/R34"
+        assert not any(deform_on_per_stage), "MODEL.RESNETS.DEFORM_ON_PER_STAGE unsupported for R18/R34"
+        assert res5_dilation == 1, "Must set MODEL.RESNETS.RES5_DILATION = 1 for R18/R34"
+        assert num_groups == 1, "Must set MODEL.RESNETS.NUM_GROUPS = 1 for R18/R34"
+
+    stages = []
+
+    for idx, stage_idx in enumerate(range(2, 6)):
+        # res5_dilation is used this way as a convention in R-FCN & Deformable Conv paper
+        dilation = res5_dilation if stage_idx == 5 else 1
+        first_stride = 1 if idx == 0 or (stage_idx == 5 and dilation == 2) else 2
+        stage_kargs = {
+            "num_blocks": num_blocks_per_stage[idx],
+            "stride_per_block": [first_stride] + [1] * (num_blocks_per_stage[idx] - 1),
+            "in_channels": in_channels,
+            "out_channels": out_channels,
+            "norm": norm,
+        }
+        # Use BasicBlock for R18 and R34.
+        if depth in [18, 34]:
+            stage_kargs["block_class"] = BasicBlock
+        else:
+            stage_kargs["bottleneck_channels"] = bottleneck_channels
+            stage_kargs["stride_in_1x1"] = stride_in_1x1
+            stage_kargs["dilation"] = dilation
+            stage_kargs["num_groups"] = num_groups
+            if deform_on_per_stage[idx]:
+                stage_kargs["block_class"] = DeformBottleneckBlock
+                stage_kargs["deform_modulated"] = deform_modulated
+                stage_kargs["deform_num_groups"] = deform_num_groups
+            else:
+                stage_kargs["block_class"] = BottleneckBlock
+        blocks = ResNet.make_stage(**stage_kargs)
+        in_channels = out_channels
+        out_channels *= 2
+        bottleneck_channels *= 2
+        stages.append(blocks)
+    return ResNet(stem, stages, out_features=out_features, freeze_at=freeze_at)
+
+
+def build_resnet_fpn_backbone(cfg, input_shape=None):
+    bottom_up = build_resnet_backbone(cfg, input_shape)
+    in_features = cfg.MODEL.FPN.IN_FEATURES
+    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
+    backbone = FPN(
+        bottom_up=bottom_up,
+        in_features=in_features,
+        out_channels=out_channels,
+        norm=cfg.MODEL.FPN.NORM,
+        top_block=LastLevelMaxPool(),
+        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
+    )
+    return backbone
+
+
+class VisualBackbone(Layer):
+    def __init__(self, config):
+        super(VisualBackbone, self).__init__()
+        self.cfg = read_config()
+        self.backbone = build_resnet_fpn_backbone(self.cfg)
+        # syncbn is removed cause that will cause import of torch
+
+        assert len(self.cfg.MODEL.PIXEL_MEAN) == len(self.cfg.MODEL.PIXEL_STD)
+        num_channels = len(self.cfg.MODEL.PIXEL_MEAN)
+        self.register_buffer("pixel_mean", paddle.to_tensor(self.cfg.MODEL.PIXEL_MEAN).reshape([num_channels, 1, 1]))
+        self.register_buffer("pixel_std", paddle.to_tensor(self.cfg.MODEL.PIXEL_STD).reshape([num_channels, 1, 1]))
+        self.out_feature_key = "p2"
+        # is_deterministic is disabled here.
+        self.pool = nn.AdaptiveAvgPool2D(config["image_feature_pool_shape"][:2])
+        if len(config["image_feature_pool_shape"]) == 2:
+            config["image_feature_pool_shape"].append(self.backbone.output_shape()[self.out_feature_key].channels)
+        assert self.backbone.output_shape()[self.out_feature_key].channels == config["image_feature_pool_shape"][2]
+
+    def forward(self, images):
+        images_input = (paddle.to_tensor(images) - self.pixel_mean) / self.pixel_std
+        features = self.backbone(images_input)
+        features = features[self.out_feature_key]
+        features = self.pool(features).flatten(start_axis=2).transpose([0, 2, 1])
+        return features
diff --git a/paddlenlp/transformers/layoutxlm/visual_backbone.yaml b/paddlenlp/transformers/layoutxlm/visual_backbone.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..e386f20a086371d8021e29f87e5a67240a17989a
--- /dev/null
+++ b/paddlenlp/transformers/layoutxlm/visual_backbone.yaml
@@ -0,0 +1,323 @@
+CUDNN_BENCHMARK: false
+DATALOADER:
+  ASPECT_RATIO_GROUPING: true
+  FILTER_EMPTY_ANNOTATIONS: true
+  NUM_WORKERS: 4
+  REPEAT_THRESHOLD: 0.0
+  SAMPLER_TRAIN: TrainingSampler
+DATASETS:
+  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
+  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
+  PROPOSAL_FILES_TEST: []
+  PROPOSAL_FILES_TRAIN: []
+  TEST: []
+  TRAIN: []
+GLOBAL:
+  HACK: 1.0
+INPUT:
+  CROP:
+    ENABLED: false
+    SIZE:
+    - 0.9
+    - 0.9
+    TYPE: relative_range
+  FORMAT: BGR
+  MASK_FORMAT: polygon
+  MAX_SIZE_TEST: 1333
+  MAX_SIZE_TRAIN: 1333
+  MIN_SIZE_TEST: 800
+  MIN_SIZE_TRAIN:
+  - 800
+  MIN_SIZE_TRAIN_SAMPLING: choice
+  RANDOM_FLIP: horizontal
+MODEL:
+  ANCHOR_GENERATOR:
+    ANGLES:
+    - - -90
+      - 0
+      - 90
+    ASPECT_RATIOS:
+    - - 0.5
+      - 1.0
+      - 2.0
+    NAME: DefaultAnchorGenerator
+    OFFSET: 0.0
+    SIZES:
+    - - 32
+    - - 64
+    - - 128
+    - - 256
+    - - 512
+  BACKBONE:
+    FREEZE_AT: 2
+    NAME: build_resnet_fpn_backbone
+  DEVICE: cuda
+  FPN:
+    FUSE_TYPE: sum
+    IN_FEATURES:
+    - res2
+    - res3
+    - res4
+    - res5
+    NORM: ''
+    OUT_CHANNELS: 256
+  KEYPOINT_ON: false
+  LOAD_PROPOSALS: false
+  MASK_ON: true
+  META_ARCHITECTURE: GeneralizedRCNN
+  PANOPTIC_FPN:
+    COMBINE:
+      ENABLED: true
+      INSTANCES_CONFIDENCE_THRESH: 0.5
+      OVERLAP_THRESH: 0.5
+      STUFF_AREA_LIMIT: 4096
+    INSTANCE_LOSS_WEIGHT: 1.0
+  PIXEL_MEAN:
+  - 103.53
+  - 116.28
+  - 123.675
+  PIXEL_STD:
+  - 57.375
+  - 57.12
+  - 58.395
+  PROPOSAL_GENERATOR:
+    MIN_SIZE: 0
+    NAME: RPN
+  RESNETS:
+    ASPECT_RATIOS:
+    - - 0.5
+      - 1.0
+      - 2.0
+    DEFORM_MODULATED: false
+    DEFORM_NUM_GROUPS: 1
+    DEFORM_ON_PER_STAGE:
+    - false
+    - false
+    - false
+    - false
+    DEPTH: 101
+    NORM: FrozenBN
+    NUM_GROUPS: 32
+    OUT_FEATURES:
+    - res2
+    - res3
+    - res4
+    - res5
+    RES2_OUT_CHANNELS: 256
+    RES5_DILATION: 1
+    SIZES:
+    - - 32
+    - - 64
+    - - 128
+    - - 256
+    - - 512
+    STEM_OUT_CHANNELS: 64
+    STRIDE_IN_1X1: false
+    WIDTH_PER_GROUP: 8
+  RETINANET:
+    BBOX_REG_LOSS_TYPE: smooth_l1
+    BBOX_REG_WEIGHTS: &id001
+    - 1.0
+    - 1.0
+    - 1.0
+    - 1.0
+    FOCAL_LOSS_ALPHA: 0.25
+    FOCAL_LOSS_GAMMA: 2.0
+    IN_FEATURES:
+    - p3
+    - p4
+    - p5
+    - p6
+    - p7
+    IOU_LABELS:
+    - 0
+    - -1
+    - 1
+    IOU_THRESHOLDS:
+    - 0.4
+    - 0.5
+    NMS_THRESH_TEST: 0.5
+    NORM: ''
+    NUM_CLASSES: 80
+    NUM_CONVS: 4
+    PRIOR_PROB: 0.01
+    SCORE_THRESH_TEST: 0.05
+    SMOOTH_L1_LOSS_BETA: 0.1
+    TOPK_CANDIDATES_TEST: 1000
+  ROI_BOX_CASCADE_HEAD:
+    BBOX_REG_WEIGHTS:
+    - - 10.0
+      - 10.0
+      - 5.0
+      - 5.0
+    - - 20.0
+      - 20.0
+      - 10.0
+      - 10.0
+    - - 30.0
+      - 30.0
+      - 15.0
+      - 15.0
+    IOUS:
+    - 0.5
+    - 0.6
+    - 0.7
+  ROI_BOX_HEAD:
+    BBOX_REG_LOSS_TYPE: smooth_l1
+    BBOX_REG_LOSS_WEIGHT: 1.0
+    BBOX_REG_WEIGHTS:
+    - 10.0
+    - 10.0
+    - 5.0
+    - 5.0
+    CLS_AGNOSTIC_BBOX_REG: false
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NAME: FastRCNNConvFCHead
+    NORM: ''
+    NUM_CONV: 0
+    NUM_FC: 2
+    POOLER_RESOLUTION: 14
+    POOLER_SAMPLING_RATIO: 0
+    POOLER_TYPE: ROIAlignV2
+    SMOOTH_L1_BETA: 0.0
+    TRAIN_ON_PRED_BOXES: false
+  ROI_HEADS:
+    BATCH_SIZE_PER_IMAGE: 512
+    IN_FEATURES:
+    - p2
+    - p3
+    - p4
+    - p5
+    IOU_LABELS:
+    - 0
+    - 1
+    IOU_THRESHOLDS:
+    - 0.5
+    NAME: StandardROIHeads
+    NMS_THRESH_TEST: 0.5
+    NUM_CLASSES: 5
+    POSITIVE_FRACTION: 0.25
+    PROPOSAL_APPEND_GT: true
+    SCORE_THRESH_TEST: 0.05
+  ROI_KEYPOINT_HEAD:
+    CONV_DIMS:
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    - 512
+    LOSS_WEIGHT: 1.0
+    MIN_KEYPOINTS_PER_IMAGE: 1
+    NAME: KRCNNConvDeconvUpsampleHead
+    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
+    NUM_KEYPOINTS: 17
+    POOLER_RESOLUTION: 14
+    POOLER_SAMPLING_RATIO: 0
+    POOLER_TYPE: ROIAlignV2
+  ROI_MASK_HEAD:
+    CLS_AGNOSTIC_MASK: false
+    CONV_DIM: 256
+    NAME: MaskRCNNConvUpsampleHead
+    NORM: ''
+    NUM_CONV: 4
+    POOLER_RESOLUTION: 7
+    POOLER_SAMPLING_RATIO: 0
+    POOLER_TYPE: ROIAlignV2
+  RPN:
+    BATCH_SIZE_PER_IMAGE: 256
+    BBOX_REG_LOSS_TYPE: smooth_l1
+    BBOX_REG_LOSS_WEIGHT: 1.0
+    BBOX_REG_WEIGHTS: *id001
+    BOUNDARY_THRESH: -1
+    HEAD_NAME: StandardRPNHead
+    IN_FEATURES:
+    - p2
+    - p3
+    - p4
+    - p5
+    - p6
+    IOU_LABELS:
+    - 0
+    - -1
+    - 1
+    IOU_THRESHOLDS:
+    - 0.3
+    - 0.7
+    LOSS_WEIGHT: 1.0
+    NMS_THRESH: 0.7
+    POSITIVE_FRACTION: 0.5
+    POST_NMS_TOPK_TEST: 1000
+    POST_NMS_TOPK_TRAIN: 1000
+    PRE_NMS_TOPK_TEST: 1000
+    PRE_NMS_TOPK_TRAIN: 2000
+    SMOOTH_L1_BETA: 0.0
+  SEM_SEG_HEAD:
+    COMMON_STRIDE: 4
+    CONVS_DIM: 128
+    IGNORE_VALUE: 255
+    IN_FEATURES:
+    - p2
+    - p3
+    - p4
+    - p5
+    LOSS_WEIGHT: 1.0
+    NAME: SemSegFPNHead
+    NORM: GN
+    NUM_CLASSES: 54
+  WEIGHTS: ''
+OUTPUT_DIR: ./output
+SEED: -1
+SOLVER:
+  AMP:
+    ENABLED: false
+  BASE_LR: 0.001
+  BIAS_LR_FACTOR: 1.0
+  CHECKPOINT_PERIOD: 5000
+  CLIP_GRADIENTS:
+    CLIP_TYPE: value
+    CLIP_VALUE: 1.0
+    ENABLED: false
+    NORM_TYPE: 2.0
+  GAMMA: 0.1
+  IMS_PER_BATCH: 16
+  LR_SCHEDULER_NAME: WarmupMultiStepLR
+  MAX_ITER: 40000
+  MOMENTUM: 0.9
+  NESTEROV: false
+  REFERENCE_WORLD_SIZE: 0
+  STEPS:
+  - 30000
+  WARMUP_FACTOR: 0.001
+  WARMUP_ITERS: 1000
+  WARMUP_METHOD: linear
+  WEIGHT_DECAY: 0.0001
+  WEIGHT_DECAY_BIAS: 0.0001
+  WEIGHT_DECAY_NORM: 0.0
+TEST:
+  AUG:
+    ENABLED: false
+    FLIP: true
+    MAX_SIZE: 4000
+    MIN_SIZES:
+    - 400
+    - 500
+    - 600
+    - 700
+    - 800
+    - 900
+    - 1000
+    - 1100
+    - 1200
+  DETECTIONS_PER_IMAGE: 100
+  EVAL_PERIOD: 0
+  EXPECTED_RESULTS: []
+  KEYPOINT_OKS_SIGMAS: []
+  PRECISE_BN:
+    ENABLED: false
+    NUM_ITER: 200
+VERSION: 2
+VIS_PERIOD: 0
diff --git a/paddlenlp/transformers/llama/LICENSE b/paddlenlp/transformers/llama/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..b1c9239ba64af3ac330790fbdc8a2282612bf437
--- /dev/null
+++ b/paddlenlp/transformers/llama/LICENSE
@@ -0,0 +1,76 @@
+LLaMA LICENSE AGREEMENT
+This License Agreement (as may be amended in accordance with this License Agreement, “License”), between you, or your employer or other entity (if you are entering into this agreement on behalf of your employer or other entity) (“Licensee” or “you”) and Meta Platforms, Inc. (“Meta” or “we”) applies to your use of any computer program, algorithm, source code, object code, or software that is made available by Meta under this License (“Software”) and any specifications, manuals, documentation, and other written information provided by Meta related to the Software (“Documentation”).
+
+By clicking “I Accept” below or by using the Software, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the Software or Documentation (collectively, the “Software Products”), and you must immediately cease using the Software Products. If you are agreeing to be bound by the terms of this License on behalf of your employer or other entity, you represent and warrant to Meta that you have full legal authority to bind your employer or such entity to this License. If you do not have the requisite authority, you may not accept the License or access the Software Products on behalf of your employer or other entity.
+
+
+
+LICENSE GRANT
+
+a. Subject to your compliance with the Documentation and Sections 2, 3, and 5, Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. The foregoing license is personal to you, and you may not assign or sublicense this License or any other rights or obligations under this License without Meta’s prior written consent; any such assignment or sublicense will be void and will automatically and immediately terminate this License.
+
+b. You may make a reasonable number of copies of the Documentation solely for use in connection with the license to the Software granted above.
+
+c. The grant of rights expressly set forth in this Section 1 (License Grant) are the complete grant of rights to you in the Software Products, and no other licenses are granted, whether by waiver, estoppel, implication, equity or otherwise. Meta and its licensors reserve all rights not expressly granted by this License.
+
+
+RESTRICTIONS
+
+You will not, and will not permit, assist or cause any third party to:
+
+a. use, modify, copy, reproduce, create derivative works of, or distribute the Software Products (or any derivative works thereof, works incorporating the Software Products, or any data produced by the Software), in whole or in part, for (i) any commercial or production purposes, (ii) military purposes or in the service of nuclear technology, (iii) purposes of surveillance, including any research or development relating to surveillance, (iv) biometric processing, (v) in any manner that infringes, misappropriates, or otherwise violates any third-party rights, or (vi) in any manner that violates any applicable law, including accessing the Software Products from an embargoed country as prohibited by the U.S. government, and violating any privacy or security laws, rules, regulations, directives, or governmental requirements (including the General Data Privacy Regulation (Regulation (EU) 2016/679), the California Consumer Privacy Act, and any and all laws governing the processing of biometric information), as well as all amendments and successor laws to any of the foregoing;
+
+b. alter or remove copyright and other proprietary notices which appear on or in the Software Products;
+
+c. utilize any equipment, device, software, or other means to circumvent or remove any security or protection used by Meta in connection with the Software, or to circumvent or remove any usage restrictions, or to enable functionality disabled by Meta; or
+
+d. offer or impose any terms on the Software Products that alter, restrict, or are inconsistent with the terms of this License.
+
+
+ATTRIBUTION
+
+Together with any copies of the Software Products (as well as derivative works thereof or works incorporating the Software Products) that you distribute, you must provide (i) a copy of this License, and (ii) the following attribution notice: “LLaMA is licensed under the LLaMA license, Copyright (c) Meta Platforms, Inc. All Rights Reserved.”
+
+
+DISCLAIMERS
+
+THE SOFTWARE PRODUCTS ARE PROVIDED “AS IS” and “WITH ALL FAULTS” WITH NO WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. META EXPRESSLY DISCLAIMS ALL REPRESENTATIONS AND WARRANTIES, EXPRESS OR IMPLIED, WHETHER BY STATUTE, CUSTOM, USAGE OR OTHERWISE AS TO ANY MATTERS RELATED TO THE SOFTWARE PRODUCTS, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, SATISFACTORY QUALITY, OR NON-INFRINGEMENT. META MAKES NO WARRANTIES OR REPRESENTATIONS THAT THE SOFTWARE PRODUCTS WILL BE ERROR FREE OR FREE OF VIRUSES OR OTHER HARMFUL COMPONENTS, OR PRODUCE ANY PARTICULAR RESULTS.
+
+
+LIMITATION OF LIABILITY
+
+TO THE FULLEST EXTENT PERMITTED BY LAW, IN NO EVENT WILL META BE LIABLE TO YOU (A) UNDER ANY THEORY OF LIABILITY, WHETHER BASED IN CONTRACT, TORT, NEGLIGENCE, STRICT LIABILITY, WARRANTY, OR OTHERWISE UNDER THIS LICENSE, OR (B) FOR ANY INDIRECT, CONSEQUENTIAL, EXEMPLARY, INCIDENTAL, PUNITIVE OR SPECIAL DAMAGES OR LOST PROFITS, EVEN IF META HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THE SOFTWARE PRODUCTS, THEIR CONSTITUENT COMPONENTS, AND ANY OUTPUT (COLLECTIVELY, “SOFTWARE MATERIALS”) ARE NOT DESIGNED OR INTENDED FOR USE IN ANY APPLICATION OR SITUATION WHERE FAILURE OR FAULT OF THE SOFTWARE MATERIALS COULD REASONABLY BE ANTICIPATED TO LEAD TO SERIOUS INJURY OF ANY PERSON, INCLUDING POTENTIAL DISCRIMINATION OR VIOLATION OF AN INDIVIDUAL’S PRIVACY RIGHTS, OR TO SEVERE PHYSICAL, PROPERTY, OR ENVIRONMENTAL DAMAGE (EACH, A “HIGH-RISK USE”). IF YOU ELECT TO USE ANY OF THE SOFTWARE MATERIALS FOR A HIGH-RISK USE, YOU DO SO AT YOUR OWN RISK. YOU AGREE TO DESIGN AND IMPLEMENT APPROPRIATE DECISION-MAKING AND RISK-MITIGATION PROCEDURES AND POLICIES IN CONNECTION WITH A HIGH-RISK USE SUCH THAT EVEN IF THERE IS A FAILURE OR FAULT IN ANY OF THE SOFTWARE MATERIALS, THE SAFETY OF PERSONS OR PROPERTY AFFECTED BY THE ACTIVITY STAYS AT A LEVEL THAT IS REASONABLE, APPROPRIATE, AND LAWFUL FOR THE FIELD OF THE HIGH-RISK USE.
+
+
+INDEMNIFICATION
+
+You will indemnify, defend and hold harmless Meta and our subsidiaries and affiliates, and each of our respective shareholders, directors, officers, employees, agents, successors, and assigns (collectively, the “Meta Parties”) from and against any losses, liabilities, damages, fines, penalties, and expenses (including reasonable attorneys’ fees) incurred by any Meta Party in connection with any claim, demand, allegation, lawsuit, proceeding, or investigation (collectively, “Claims”) arising out of or related to: (a) your access to or use of the Software Products (as well as any results or data generated from such access or use), including any High-Risk Use (defined below); (b) your violation of this License; or (c) your violation, misappropriation or infringement of any rights of another (including intellectual property or other proprietary rights and privacy rights). You will promptly notify the Meta Parties of any such Claims, and cooperate with Meta Parties in defending such Claims. You will also grant the Meta Parties sole control of the defense or settlement, at Meta’s sole option, of any Claims. This indemnity is in addition to, and not in lieu of, any other indemnities or remedies set forth in a written agreement between you and Meta or the other Meta Parties.
+
+
+TERMINATION; SURVIVAL
+
+a. This License will automatically terminate upon any breach by you of the terms of this License.
+
+b. We may terminate this License, in whole or in part, at any time upon notice (including electronic) to you.
+
+c. The following sections survive termination of this License: 2 (Restrictions), 3 (Attribution), 4 (Disclaimers), 5 (Limitation on Liability), 6 (Indemnification) 7 (Termination; Survival), 8 (Third Party Materials), 9 (Trademarks), 10 (Applicable Law; Dispute Resolution), and 11 (Miscellaneous).
+
+
+THIRD PARTY MATERIALS
+
+The Software Products may contain third-party software or other components (including free and open source software) (all of the foregoing, “Third Party Materials”), which are subject to the license terms of the respective third-party licensors. Your dealings or correspondence with third parties and your use of or interaction with any Third Party Materials are solely between you and the third party. Meta does not control or endorse, and makes no representations or warranties regarding, any Third Party Materials, and your access to and use of such Third Party Materials are at your own risk.
+
+
+TRADEMARKS
+
+Licensee has not been granted any trademark license as part of this License and may not use any name or mark associated with Meta without the prior written permission of Meta, except to the extent necessary to make the reference required by the “ATTRIBUTION” section of this Agreement.
+
+
+APPLICABLE LAW; DISPUTE RESOLUTION
+
+This License will be governed and construed under the laws of the State of California without regard to conflicts of law provisions. Any suit or proceeding arising out of or relating to this License will be brought in the federal or state courts, as applicable, in San Mateo County, California, and each party irrevocably submits to the jurisdiction and venue of such courts.
+
+
+MISCELLANEOUS
+
+If any provision or part of a provision of this License is unlawful, void or unenforceable, that provision or part of the provision is deemed severed from this License, and will not affect the validity and enforceability of any remaining provisions. The failure of Meta to exercise or enforce any right or provision of this License will not operate as a waiver of such right or provision. This License does not confer any third-party beneficiary rights upon any other person or entity. This License, together with the Documentation, contains the entire understanding between you and Meta regarding the subject matter of this License, and supersedes all other written or oral agreements and understandings between you and Meta regarding such subject matter. No change or addition to any provision of this License will be binding unless it is in writing and signed by an authorized representative of both you and Meta.
diff --git a/paddlenlp/transformers/llama/Llama2.LICENSE b/paddlenlp/transformers/llama/Llama2.LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..51089e27e6764fb9f72c06a0f3710699fb6c9448
--- /dev/null
+++ b/paddlenlp/transformers/llama/Llama2.LICENSE
@@ -0,0 +1,126 @@
+LLAMA 2 COMMUNITY LICENSE AGREEMENT	
+Llama 2 Version Release Date: July 18, 2023
+
+"Agreement" means the terms and conditions for use, reproduction, distribution and 
+modification of the Llama Materials set forth herein.
+
+"Documentation" means the specifications, manuals and documentation 
+accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+
+"Licensee" or "you" means you, or your employer or any other person or entity (if 
+you are entering into this Agreement on such person or entity's behalf), of the age 
+required under applicable laws, rules or regulations to provide legal consent and that 
+has legal authority to bind your employer or such other person or entity if you are 
+entering in this Agreement on their behalf.
+
+"Llama 2" means the foundational large language models and software and 
+algorithms, including machine-learning model code, trained model weights, 
+inference-enabling code, training-enabling code, fine-tuning enabling code and other 
+elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
+libraries/llama-downloads/.
+
+"Llama Materials" means, collectively, Meta's proprietary Llama 2 and 
+Documentation (and any portion thereof) made available under this Agreement.
+
+"Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you 
+are an entity, your principal place of business is in the EEA or Switzerland) and Meta 
+Platforms, Inc. (if you are located outside of the EEA or Switzerland). 
+
+By clicking "I Accept" below or by using or distributing any portion or element of the 
+Llama Materials, you agree to be bound by this Agreement.
+
+1. License Rights and Redistribution. 
+
+      a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
+transferable and royalty-free limited license under Meta's intellectual property or 
+other rights owned by Meta embodied in the Llama Materials to use, reproduce, 
+distribute, copy, create derivative works of, and make modifications to the Llama 
+Materials.  
+      
+      b. Redistribution and Use.  
+
+            i. If you distribute or make the Llama Materials, or any derivative works 
+thereof, available to a third party, you shall provide a copy of this Agreement to such 
+third party. 
+            ii.  If you receive Llama Materials, or any derivative works thereof, from 
+a Licensee as part of an integrated end user product, then Section 2 of this 
+Agreement will not apply to you. 
+
+            iii. You must retain in all copies of the Llama Materials that you 
+distribute the following attribution notice within a "Notice" text file distributed as a 
+part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License, 
+Copyright (c) Meta Platforms, Inc. All Rights Reserved."
+
+            iv. Your use of the Llama Materials must comply with applicable laws 
+and regulations (including trade compliance laws and regulations) and adhere to the 
+Acceptable Use Policy for the Llama Materials (available at 
+https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into 
+this Agreement.
+
+            v. You will not use the Llama Materials or any output or results of the 
+Llama Materials to improve any other large language model (excluding Llama 2 or 
+derivative works thereof).  
+
+2. Additional Commercial Terms. If, on the Llama 2 version release date, the 
+monthly active users of the products or services made available by or for Licensee, 
+or Licensee's affiliates, is greater than 700 million monthly active users in the 
+preceding calendar month, you must request a license from Meta, which Meta may 
+grant to you in its sole discretion, and you are not authorized to exercise any of the 
+rights under this Agreement unless or until Meta otherwise expressly grants you 
+such rights.
+            
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE 
+LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE 
+PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, 
+EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY 
+WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR 
+FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE 
+FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING 
+THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR 
+USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
+
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE 
+LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, 
+NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS 
+AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, 
+CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN 
+IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF 
+ANY OF THE FOREGOING.
+ 
+5. Intellectual Property.
+
+      a. No trademark licenses are granted under this Agreement, and in 
+connection with the Llama Materials, neither Meta nor Licensee may use any name 
+or mark owned by or associated with the other or any of its affiliates, except as 
+required for reasonable and customary use in describing and redistributing the 
+Llama Materials.
+
+      b. Subject to Meta's ownership of Llama Materials and derivatives made by or 
+for Meta, with respect to any derivative works and modifications of the Llama 
+Materials that are made by you, as between you and Meta, you are and will be the 
+owner of such derivative works and modifications.
+
+      c. If you institute litigation or other proceedings against Meta or any entity 
+(including a cross-claim or counterclaim in a lawsuit) alleging that the Llama 
+Materials or Llama 2 outputs or results, or any portion of any of the foregoing, 
+constitutes infringement of intellectual property or other rights owned or licensable 
+by you, then any licenses granted to you under this Agreement shall terminate as of 
+the date such litigation or claim is filed or instituted. You will indemnify and hold 
+harmless Meta from and against any claim by any third party arising out of or related 
+to your use or distribution of the Llama Materials.
+
+6. Term and Termination. The term of this Agreement will commence upon your 
+acceptance of this Agreement or access to the Llama Materials and will continue in 
+full force and effect until terminated in accordance with the terms and conditions 
+herein. Meta may terminate this Agreement if you are in breach of any term or 
+condition of this Agreement. Upon termination of this Agreement, you shall delete 
+and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the 
+termination of this Agreement. 
+
+7. Governing Law and Jurisdiction. This Agreement will be governed and 
+construed under the laws of the State of California without regard to choice of law 
+principles, and the UN Convention on Contracts for the International Sale of Goods 
+does not apply to this Agreement. The courts of California shall have exclusive 
+jurisdiction of any dispute arising out of this Agreement. 
+
diff --git a/paddlenlp/transformers/llama/__init__.py b/paddlenlp/transformers/llama/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/llama/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/llama/configuration.py b/paddlenlp/transformers/llama/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5ee21142a7452df538e60313cccadaab4bc1b03
--- /dev/null
+++ b/paddlenlp/transformers/llama/configuration.py
@@ -0,0 +1,223 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Llama model configuration"""
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "LLAMA_PRETRAINED_INIT_CONFIGURATION",
+    "LlamaConfig",
+    "LLAMA_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+LLAMA_PRETRAINED_INIT_CONFIGURATION = {
+    # Hypothetical model weights (tiny-random-llama & micro-random-llama) for test only
+    "__internal_testing__/micro-random-llama": {
+        "architectures": ["LlamaForCausalLM"],
+        "hidden_size": 64,
+        "initializer_range": 0.02,
+        "intermediate_size": 1000,
+        "max_position_embeddings": 2048,
+        "model_type": "llama",
+        "num_attention_heads": 8,
+        "num_hidden_layers": 1,
+        "rms_norm_eps": 1e-06,
+        "vocab_size": 32000,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "pad_token_id": 0,
+        "use_cache": False,
+        "use_recompute": False,
+        "use_flash_attention": False,
+    },
+    "__internal_testing__/tiny-random-llama": {
+        "architectures": ["LlamaForCausalLM"],
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 11008,
+        "max_position_embeddings": 2048,
+        "model_type": "llama",
+        "num_attention_heads": 8,
+        "num_hidden_layers": 2,
+        "rms_norm_eps": 1e-06,
+        "vocab_size": 32000,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "pad_token_id": 0,
+        "use_cache": False,
+        "use_recompute": False,
+        "use_flash_attention": False,
+    },
+}
+
+# Hypothetical model weights (tiny-random-llama) for test only
+LLAMA_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "__internal_testing__/micro-random-llama": "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/micro-random-llama/model_state.pdparams",
+        "__internal_testing__/tiny-random-llama": "https://bj.bcebos.com/paddlenlp/models/community/__internal_testing__/tiny-random-llama/model_state.pdparams",
+    },
+}
+
+
+class LlamaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~LlamaModel`]. It is used to instantiate an Llama
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the Llama-7B.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the Llama model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~LlamaModel`] or [`~TFLlamaModel`].
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_fusion_level(`str`, *optional*, defaults to ``):
+            The level of fusion of rope embedding. Can be chosen from:
+            (1) 'full': fuse sin cos compute and rope embedding
+            (2) 'core': only fuse rope embedding, will compute the sin and cos
+            (3) None: don't fuse any part of the rope embedding
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        Example:
+    ```python
+    >>> from paddlenlp.transformer import LlamaModel, LlamaConfig
+
+    >>> # Initializing a Llama llama-7b style configuration
+    >>> configuration = LlamaConfig()
+
+    >>> # Initializing a model from the llama-7b style configuration
+    >>> model = LlamaModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "llama"
+    attribute_map = {
+        "n_positions": "max_position_embeddings",
+        "n_embd": "hidden_size",
+        "n_layer": "num_hidden_layers",
+        "n_head": "num_attention_heads",
+        "n_inner": "intermediate_size",
+        "activation_function": "hidden_act",
+    }
+    pretrained_init_configuration = LLAMA_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        max_position_embeddings=2048,
+        seq_length=2048,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        use_recompute=False,
+        recompute_granularity="full",
+        pp_recompute_interval=1,
+        no_recompute_layers=None,
+        fuse_attention_qkv=False,
+        use_flash_attention=False,
+        fuse_attention_ffn=False,
+        use_fused_rms_norm=False,
+        tensor_parallel_output=True,
+        sequence_parallel=False,
+        fuse_sequence_parallel_allreduce=False,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        tie_word_embeddings=False,
+        alibi=False,
+        rope_fusion_level=None,
+        rope_scaling_factor=1.0,
+        rope_scaling_type=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.max_position_embeddings = max_position_embeddings
+        self.seq_length = seq_length
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+
+        self.use_cache = use_cache
+        self.use_recompute = use_recompute
+        self.recompute_granularity = recompute_granularity
+        self.no_recompute_layers = no_recompute_layers
+        self.pp_recompute_interval = pp_recompute_interval
+        self.fuse_attention_qkv = fuse_attention_qkv
+        self.use_flash_attention = use_flash_attention
+        self.fuse_attention_ffn = fuse_attention_ffn
+        self.use_fused_rms_norm = use_fused_rms_norm
+        self.tensor_parallel_output = tensor_parallel_output
+        self.sequence_parallel = sequence_parallel
+        self.fuse_sequence_parallel_allreduce = fuse_sequence_parallel_allreduce
+
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.alibi = alibi
+
+        self.rope_fusion_level = rope_fusion_level
+        self.rope_scaling_factor = rope_scaling_factor
+        self.rope_scaling_type = rope_scaling_type
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            tensor_parallel_output=tensor_parallel_output,
+            **kwargs,
+        )
+
+    @property
+    def rope(self):
+        return not self.alibi
diff --git a/paddlenlp/transformers/llama/modeling.py b/paddlenlp/transformers/llama/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a23dd95f9364868d2ef901f05360e973dae4ac1
--- /dev/null
+++ b/paddlenlp/transformers/llama/modeling.py
@@ -0,0 +1,1511 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Paddle Llama model"""
+from __future__ import annotations
+
+import math
+import warnings
+from functools import partial
+from typing import Optional, Tuple
+
+import paddle
+import paddle.distributed.fleet.meta_parallel as mpu
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.distributed import fleet
+from paddle.distributed.fleet.utils import recompute
+
+try:
+    from paddle.incubate.nn.functional import fused_rotary_position_embedding
+except ImportError:
+    fused_rotary_position_embedding = None
+
+
+from paddle.utils import try_import
+
+from paddlenlp.transformers.conversion_utils import (
+    StateDictNameMapping,
+    init_name_mappings,
+)
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from paddlenlp.utils.log import logger
+
+from ..sequence_parallel_utils import (
+    ColumnSequenceParallelLinear,
+    GatherOp,
+    RowSequenceParallelLinear,
+    ScatterOp,
+    mark_as_sequence_parallel_parameter,
+)
+from .configuration import (
+    LLAMA_PRETRAINED_INIT_CONFIGURATION,
+    LLAMA_PRETRAINED_RESOURCE_FILES_MAP,
+    LlamaConfig,
+)
+
+try:
+    from paddle.nn.functional.flash_attention import flash_attention
+except:
+    flash_attention = None
+
+__all__ = [
+    "LlamaModel",
+    "LlamaPretrainedModel",
+    "LlamaForCausalLM",
+    "LlamaPretrainingCriterion",
+]
+
+
+def _get_interleave(n):
+    def _get_interleave_power_of_2(n):
+        start = 2 ** (-(2 ** -(math.log2(n) - 3)))
+        ratio = start
+        return [start * ratio**i for i in range(n)]
+
+    if math.log2(n).is_integer():
+        return _get_interleave_power_of_2(n)
+    else:
+        closest_power_of_2 = 2 ** math.floor(math.log2(n))
+        return (
+            _get_interleave_power_of_2(closest_power_of_2)
+            + _get_interleave(2 * closest_power_of_2)[0::2][: n - closest_power_of_2]
+        )
+
+
+def build_alibi_tensor(
+    bool_attention_mask: Tensor, num_heads: int, dtype: paddle.dtype, tensor_parallel_degree=1
+) -> Tensor:
+    attention_mask = bool_attention_mask.astype("float32")
+    batch_size, seq_length = attention_mask.shape[0], attention_mask.shape[-1]
+    slopes = paddle.to_tensor(_get_interleave(num_heads), dtype="float32")
+    alibi = slopes.unsqueeze(axis=[1, 2]) * paddle.arange(seq_length, dtype="float32").unsqueeze(axis=[0, 1]).expand(
+        [num_heads, -1, -1]
+    )
+    alibi = alibi.reshape(shape=(1, num_heads, 1, seq_length)).expand([batch_size, -1, -1, -1])
+    return paddle.cast(alibi, dtype)
+
+
+def get_triangle_upper_mask(x, mask=None):
+    if mask is not None:
+        return mask
+    # [bsz, n_head, q_len, kv_seq_len]
+    shape = x.shape
+    #  [bsz, 1, q_len, kv_seq_len]
+    shape[1] = 1
+    mask = paddle.full(shape, paddle.finfo(x.dtype).min, dtype=x.dtype)
+    mask = paddle.triu(mask, diagonal=1)
+    mask.stop_gradient = True
+    return mask
+
+
+def assign_kv_heads(num_kv_heads: int, num_gpus: int):
+    # Initialize the assignment list
+    """
+    Assign kv heads to different GPUs in the Tensor Parallel Setup
+
+    Examples:
+        assign_kv_heads(num_kv_heads=1, num_gpus=2): [[0], [0]]
+        assign_kv_heads(num_kv_heads=2, num_gpus=2): [[0], [1]]
+        assign_kv_heads(num_kv_heads=4, num_gpus=2): [[0,1], [2,3]]
+        assign_kv_heads(num_kv_heads=1, num_gpus=4): [[0],[0],[0],[0]]
+        assign_kv_heads(num_kv_heads=2, num_gpus=4): [[0],[0],[1],[1]]
+        assign_kv_heads(num_kv_heads=4, num_gpus=4): [[0],[1],[2],[3]]
+    """
+    assignment_list = [[] for _ in range(num_gpus)]
+    # Case 1: more heads than cards
+    if num_kv_heads > num_gpus:
+        num_heads_per_card = num_kv_heads // num_gpus
+        for i in range(num_gpus):
+            for j in range(num_heads_per_card):
+                assignment_list[i].append(i * num_heads_per_card + j)
+    # Case 2: more cards than heads. each card get only 1 head.
+    else:
+        num_card_per_heads = num_gpus // num_kv_heads
+        for i in range(num_kv_heads):
+            for j in range(num_card_per_heads):
+                assignment_list[i * num_card_per_heads + j].append(i)
+    return assignment_list
+
+
+def parallel_matmul(x: Tensor, y: Tensor, tensor_parallel_output=True):
+    is_fleet_init = True
+    tensor_parallel_degree = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        tensor_parallel_degree = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+
+    if paddle.in_dynamic_mode():
+        y_is_distributed = y.is_distributed
+    else:
+        y_is_distributed = tensor_parallel_degree > 1
+
+    if is_fleet_init and tensor_parallel_degree > 1 and y_is_distributed:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=False)
+
+        if tensor_parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+
+    else:
+        logits = paddle.matmul(x, y, transpose_y=False)
+        return logits
+
+
+def scaled_dot_product_attention(
+    query_states,
+    config,
+    key_states,
+    value_states,
+    attention_mask,
+    output_attentions,
+    alibi=None,
+    sequence_parallel=False,
+    is_causal=True,
+):
+    bsz, q_len, num_heads, head_dim = query_states.shape
+    _, kv_seq_len, _, _ = value_states.shape
+
+    if config.use_flash_attention and flash_attention:
+        # Flash Attention now ignore attention mask
+        # Current Flash Attention doesn't support attn maskt
+        # Paddle Flash Attention input [ bz, seqlen, nhead, head_dim]
+        # Torch Flash Attention input [ bz, nhead, seqlen, head_dim]
+        if alibi is not None:
+            raise ValueError("Flash Attention does not support ALiBi yet")
+
+        attn_output, attn_weights = flash_attention(
+            query_states,
+            key_states,
+            value_states,
+            causal=is_causal and query_states.shape[1] != 1,
+            return_softmax=output_attentions,
+        )
+        if sequence_parallel:
+            attn_output = attn_output.reshape([bsz * q_len, head_dim * num_heads])
+        else:
+            attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
+        return (attn_output, attn_weights) if output_attentions else attn_output
+    else:
+        query_states = paddle.transpose(query_states, [0, 2, 1, 3])
+        # merge with the next tranpose
+        key_states = paddle.transpose(key_states, [0, 2, 1, 3])
+        value_states = paddle.transpose(value_states, [0, 2, 1, 3])
+
+        # matmul and devide by sqrt(head_dim)
+        attn_weights = paddle.matmul(query_states / math.sqrt(head_dim), key_states.transpose([0, 1, 3, 2]))
+        # then add alibi bias
+        if alibi is not None:
+            alibi = alibi.reshape([bsz, num_heads, 1, -1])
+            attn_weights = attn_weights + alibi
+
+        if attn_weights.shape != [bsz, num_heads, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention weights should be of shape {(bsz, num_heads, q_len, kv_seq_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        # NOTE: we only call get_triangle_upper_mask under PP setup
+        # FIXME ZHUI when we use pipeline parallel, the attention_mask can be None
+        # we just make it triangle_upper_mask
+        if attention_mask is None:
+            attention_mask = get_triangle_upper_mask(attn_weights)
+
+        attention_mask = attention_mask.reshape([bsz, 1, q_len, kv_seq_len])
+        if attention_mask.shape != [bsz, 1, q_len, kv_seq_len]:
+            raise ValueError(
+                f"Attention mask should be of shape {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
+            )
+
+        attn_weights = attn_weights + attention_mask
+        if not paddle.in_dynamic_mode():
+            attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+        else:
+            with paddle.amp.auto_cast(False):
+                attn_weights = F.softmax(attn_weights, axis=-1, dtype="float32").astype(query_states.dtype)
+
+        attn_output = paddle.matmul(attn_weights, value_states)
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+        if sequence_parallel:
+            attn_output = attn_output.reshape([bsz * q_len, head_dim * num_heads])
+        else:
+            attn_output = attn_output.reshape([bsz, q_len, head_dim * num_heads])
+        return (attn_output, attn_weights) if output_attentions else attn_output
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def _make_causal_mask(input_ids_shape, past_key_values_length):
+    """
+    Make causal mask used for self-attention
+    """
+    batch_size, target_length = input_ids_shape  # target_length: seq_len
+
+    mask = paddle.tril(paddle.ones((target_length, target_length), dtype="bool"))
+
+    if past_key_values_length > 0:
+        # [tgt_len, tgt_len + past_len]
+        mask = paddle.concat([paddle.ones([target_length, past_key_values_length], dtype="bool"), mask], axis=-1)
+
+    # [bs, 1, tgt_len, tgt_len + past_len]
+    return mask[None, None, :, :].expand([batch_size, 1, target_length, target_length + past_key_values_length])
+
+
+def _expand_2d_mask(mask, dtype, tgt_length):
+    """
+    Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
+    """
+    batch_size, src_length = mask.shape[0], mask.shape[-1]
+    tgt_length = tgt_length if tgt_length is not None else src_length
+
+    mask = mask[:, None, None, :].astype("bool")
+    mask.stop_gradient = True
+    expanded_mask = mask.expand([batch_size, 1, tgt_length, src_length])
+
+    return expanded_mask
+
+
+def rms_norm_fused(x_in, w, eps):
+    fused_ln = try_import("fused_ln")
+    return fused_ln.fused_rms_norm(x_in, w, eps)[0]
+
+
+class LlamaRMSNorm(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.weight = paddle.create_parameter(
+            shape=[self.hidden_size],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+        self.variance_epsilon = config.rms_norm_eps
+        self.config = config
+
+        if config.sequence_parallel:
+            mark_as_sequence_parallel_parameter(self.weight)
+
+    def forward(self, hidden_states):
+        if self.config.use_fused_rms_norm:
+            return rms_norm_fused(hidden_states, self.weight, self.variance_epsilon)
+
+        if paddle.in_dynamic_mode():
+            with paddle.amp.auto_cast(False):
+                variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
+                hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
+        else:
+            variance = hidden_states.astype("float32").pow(2).mean(-1, keepdim=True)
+            hidden_states = paddle.rsqrt(variance + self.variance_epsilon) * hidden_states
+
+        if self.weight.dtype in [paddle.float16, paddle.bfloat16]:
+            hidden_states = paddle.cast(hidden_states, self.weight.dtype)
+        return hidden_states * self.weight
+
+
+def repeat_kv(hidden_states: paddle.Tensor, n_rep: int) -> paddle.Tensor:
+    """
+    This is the equivalent of paddle.repeat_interleave(hidden_states, n_rep, axis=1). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, slen, num_key_value_heads, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+
+    hidden_states = hidden_states.unsqueeze(-2).tile([1, 1, 1, n_rep, 1])
+    return hidden_states.reshape([batch, slen, num_key_value_heads * n_rep, head_dim])
+
+
+class LlamaRotaryEmbedding(nn.Layer):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        # [dim / 2]
+        self.inv_freq = 1.0 / (self.base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim))
+        self._set_cos_sin_cache(seq_len=max_position_embeddings)
+
+    def _set_cos_sin_cache(self, seq_len):
+        self.max_seq_len_cached = seq_len
+        # [seq_len]
+        t = paddle.arange(seq_len, dtype="float32")
+        # [seq_len, dim/2]
+        freqs = paddle.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        # [seq_len, dim]
+        emb = paddle.concat([freqs, freqs], axis=-1)
+        # [1, seqlen, 1, dim]
+        self.cos_cached = emb.cos()[None, :, None, :]
+        self.sin_cached = emb.sin()[None, :, None, :]
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        return (
+            self.cos_cached[:, :, :seq_len, ...].cast(x.dtype),
+            self.sin_cached[:, :, :seq_len, ...].cast(x.dtype),
+        )
+
+
+class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings * scaling_factor, base)
+
+    def _set_cos_sin_cache(self, seq_len):
+        self.max_seq_len_cached = seq_len
+        # [seq_len]
+        t = paddle.arange(seq_len, dtype="float32")
+        t = t / self.scaling_factor
+        # [seq_len, dim/2]
+        freqs = paddle.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        # [seq_len, dim]
+        emb = paddle.concat([freqs, freqs], axis=-1)
+        # [1, seqlen, 1, dim]
+        self.cos_cached = emb.cos()[None, :, None, :]
+        self.sin_cached = emb.sin()[None, :, None, :]
+
+
+class LlamaNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with NTK scaling. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, scaling_factor=1.0):
+        base = base * scaling_factor ** (dim / (dim - 2))
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings * scaling_factor, base)
+
+
+class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
+    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base)
+
+    def _scale_cos_sin(self, seq_len):
+        # [seq_len]
+        t = paddle.arange(seq_len, dtype="float32")
+        # [seq_len, dim/2]
+        alpha = (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+        base = self.base * alpha ** (self.dim / (self.dim - 2))
+        inv_freq = 1.0 / (base ** (paddle.cast(paddle.arange(0, self.dim, 2), dtype="float32") / self.dim))
+        freqs = paddle.einsum("i,j->ij", t, inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        # [seq_len, dim]
+        emb = paddle.concat([freqs, freqs], axis=-1)
+        # [1, seqlen, 1, dim]
+        scale_cos = emb.cos()[None, :, None, :]
+        scale_sin = emb.sin()[None, :, None, :]
+        return scale_cos, scale_sin
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_position_embeddings:
+            scale_cos, scale_sin = self._scale_cos_sin(seq_len=seq_len)
+            return (
+                scale_cos[:, :, :seq_len, ...].cast(x.dtype),
+                scale_sin[:, :, :seq_len, ...].cast(x.dtype),
+            )
+        else:
+            return (
+                self.cos_cached[:, :, :seq_len, ...].cast(x.dtype),
+                self.sin_cached[:, :, :seq_len, ...].cast(x.dtype),
+            )
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return paddle.concat([-x2, x1], axis=-1)  # shape is the same as x
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
+
+    if position_ids is None:
+        # Note: Only for LlamaForCausalLMPipe model pretraining
+        cos = cos[:, : q.shape[1], :, :]  # [bs, seq_len, 1, dim]
+        sin = sin[:, : q.shape[1], :, :]  # [bs, seq_len, 1, dim]
+    else:
+        cos = cos.squeeze(axis=[0, 2])  # [seq_len, dim]
+        sin = sin.squeeze(axis=[0, 2])  # [seq_len, dim]
+        cos = cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+        sin = sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class LlamaMLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.tensor_parallel_degree = config.tensor_parallel_degree
+        self.fuse_attention_ffn = config.fuse_attention_ffn
+
+        if config.sequence_parallel:
+            ColumnParallelLinear = ColumnSequenceParallelLinear
+            RowParallelLinear = RowSequenceParallelLinear
+        else:
+            ColumnParallelLinear = fleet.meta_parallel.ColumnParallelLinear
+            RowParallelLinear = fleet.meta_parallel.RowParallelLinear
+
+        if config.tensor_parallel_degree > 1:
+            if config.fuse_attention_ffn:
+                self.gate_up_fused_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size * 2,
+                    gather_output=False,
+                    has_bias=False,
+                )
+            else:
+                self.gate_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
+                self.up_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.intermediate_size,
+                    gather_output=False,
+                    has_bias=False,
+                )
+
+            self.down_proj = RowParallelLinear(
+                self.intermediate_size,
+                self.hidden_size,
+                input_is_parallel=True,
+                has_bias=False,
+            )
+        else:
+            if config.fuse_attention_ffn:
+                self.gate_up_fused_proj = nn.Linear(self.hidden_size, self.intermediate_size * 2, bias_attr=False)
+            else:
+                self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+                self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias_attr=False)
+
+            self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias_attr=False)
+
+    def forward(self, x):
+        if self.fuse_attention_ffn:
+            gate_out, up_out = paddle.chunk(self.gate_up_fused_proj(x), chunks=2, axis=-1)
+            out = self.down_proj(F.silu(gate_out) * up_out)
+        else:
+            out = self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
+        return out
+
+
+class LlamaAttention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: LlamaConfig, layerwise_recompute: bool = False):
+        super().__init__()
+
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+
+        self.head_dim = self.hidden_size // config.num_attention_heads
+
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+
+        self.max_position_embeddings = config.max_position_embeddings
+        self.seq_length = config.seq_length
+        self.sequence_parallel = config.sequence_parallel
+
+        self.fuse_attention_qkv = config.fuse_attention_qkv
+        if self.fuse_attention_qkv and config.num_attention_heads != config.num_key_value_heads:
+            raise ValueError(
+                f"fuse_attention_qkv can't be True when num_attention_heads {config.num_attention_heads}!= num_key_value_heads {config.num_key_value_heads}"
+            )
+
+        self.kv_indices = None
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+        if config.tensor_parallel_degree > 1:
+            assert (
+                self.num_heads % config.tensor_parallel_degree == 0
+            ), f"num_heads: {self.num_heads}, tensor_parallel_degree: {config.tensor_parallel_degree}"
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+
+            if self.num_key_value_heads % config.tensor_parallel_degree == 0:
+                self.num_key_value_heads = self.num_key_value_heads // config.tensor_parallel_degree
+            else:
+                logger.warning(
+                    f"Get num_key_value_heads: {self.num_key_value_heads}, can't split to tensor_parallel_degree: {config.tensor_parallel_degree}, so we don't spilt key value weight."
+                )
+                self.kv_indices = paddle.to_tensor(
+                    assign_kv_heads(self.num_key_value_heads, config.tensor_parallel_degree)[
+                        config.tensor_parallel_rank
+                    ]
+                )
+
+        self.rope_fusion_level = config.rope_fusion_level
+        if self.rope_fusion_level is not None:
+            if "gpu" not in paddle.device.get_device() or fused_rotary_position_embedding is None:
+                warnings.warn(
+                    "Enable fuse rope in the config, but fuse rope is not available. "
+                    "Will disable fuse rope. Try using latest gpu version of Paddle."
+                )
+                self.rope_fusion_level = None
+
+        if config.sequence_parallel:
+            ColumnParallelLinear = ColumnSequenceParallelLinear
+            RowParallelLinear = RowSequenceParallelLinear
+        else:
+            ColumnParallelLinear = fleet.meta_parallel.ColumnParallelLinear
+            RowParallelLinear = fleet.meta_parallel.RowParallelLinear
+
+        if config.tensor_parallel_degree > 1:
+            if self.fuse_attention_qkv:
+                self.qkv_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    3 * self.hidden_size,
+                    has_bias=False,
+                    gather_output=False,
+                )
+            else:
+                self.q_proj = ColumnParallelLinear(
+                    self.hidden_size,
+                    self.hidden_size,
+                    has_bias=False,
+                    gather_output=False,
+                )
+                if self.kv_indices is None:
+                    self.k_proj = ColumnParallelLinear(
+                        self.hidden_size,
+                        self.config.num_key_value_heads * self.head_dim,
+                        has_bias=False,
+                        gather_output=False,
+                    )
+                    self.v_proj = ColumnParallelLinear(
+                        self.hidden_size,
+                        self.config.num_key_value_heads * self.head_dim,
+                        has_bias=False,
+                        gather_output=False,
+                    )
+                else:
+                    self.k_proj = nn.Linear(
+                        self.hidden_size,
+                        self.config.num_key_value_heads * self.head_dim,
+                        bias_attr=False,
+                    )
+                    self.v_proj = nn.Linear(
+                        self.hidden_size,
+                        self.config.num_key_value_heads * self.head_dim,
+                        bias_attr=False,
+                    )
+
+        else:
+            if self.fuse_attention_qkv:
+                self.qkv_proj = nn.Linear(
+                    self.hidden_size,
+                    3 * self.hidden_size,
+                    bias_attr=False,
+                )
+            else:
+                self.q_proj = nn.Linear(
+                    self.hidden_size,
+                    self.hidden_size,
+                    bias_attr=False,
+                )
+                self.k_proj = nn.Linear(
+                    self.hidden_size,
+                    self.config.num_key_value_heads * self.head_dim,
+                    bias_attr=False,
+                )
+                self.v_proj = nn.Linear(
+                    self.hidden_size,
+                    self.config.num_key_value_heads * self.head_dim,
+                    bias_attr=False,
+                )
+
+        if config.tensor_parallel_degree > 1:
+            self.o_proj = RowParallelLinear(
+                self.hidden_size,
+                self.hidden_size,
+                has_bias=False,
+                input_is_parallel=True,
+            )
+        else:
+            self.o_proj = nn.Linear(
+                self.hidden_size,
+                self.hidden_size,
+                bias_attr=False,
+            )
+
+        if config.rope and self.rope_fusion_level != "full":
+            self._init_rope()
+
+        self.config = config
+
+    def _init_rope(self):
+        if self.config.rope_scaling_type is None:
+            self.rotary_emb = LlamaRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+            )
+        elif self.config.rope_scaling_type == "linear":
+            self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                scaling_factor=self.config.rope_scaling_factor,
+            )
+        elif self.config.rope_scaling_type == "ntk":
+            self.rotary_emb = LlamaNTKScalingRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                scaling_factor=self.config.rope_scaling_factor,
+            )
+        elif self.config.rope_scaling_type == "dynamic_ntk":
+            self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                scaling_factor=self.config.rope_scaling_factor,
+            )
+        else:
+            raise ValueError(f"Unknown RoPE scaling type {self.config.rope_scaling_type}")
+
+    def forward(
+        self,
+        hidden_states,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        alibi: Optional[paddle.Tensor] = None,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+        # [bs, seq_len, num_head * head_dim] -> [seq_len / n, bs, num_head * head_dim] (n is model parallelism)
+
+        if self.fuse_attention_qkv:
+            if self.sequence_parallel:
+                target_shape = [-1, self.seq_length, self.num_heads, 3 * self.head_dim]
+            else:
+                target_shape = [0, 0, self.num_heads, 3 * self.head_dim]
+
+            mix_layer = self.qkv_proj(hidden_states)
+            mix_layer = paddle.reshape_(mix_layer, target_shape)
+            query_states, key_states, value_states = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+        else:
+            if self.sequence_parallel:
+                target_query_shape = [-1, self.seq_length, self.num_heads, self.head_dim]
+                target_key_value_shape = [-1, self.seq_length, self.num_key_value_heads, self.head_dim]
+            else:
+                target_query_shape = [0, 0, self.num_heads, self.head_dim]
+                target_key_value_shape = [0, 0, self.num_key_value_heads, self.head_dim]
+
+            query_states = self.q_proj(hidden_states).reshape(shape=target_query_shape)
+            key_states = self.k_proj(hidden_states).reshape(shape=target_key_value_shape)
+            value_states = self.v_proj(hidden_states).reshape(shape=target_key_value_shape)
+
+        kv_seq_len = key_states.shape[-3]
+
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-3]
+
+        if self.config.rope:
+            if self.rope_fusion_level is not None:
+                assert past_key_value is None, "fuse rotary not support cache kv for now"
+
+            if self.rope_fusion_level == "full":
+                query_states, key_states, _ = fused_rotary_position_embedding(query_states, key_states, v=None)
+            elif self.rope_fusion_level == "core":
+                cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+                query_states, key_states, _ = fused_rotary_position_embedding(
+                    query_states, key_states, v=None, sin=sin, cos=cos
+                )
+            else:
+                cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+                query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+        # [bsz, nh, t, hd]
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = paddle.concat([past_key_value[0], key_states], axis=1)
+            value_states = paddle.concat([past_key_value[1], value_states], axis=1)
+
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        if self.kv_indices is not None:
+            key_states = paddle.index_select(key_states, self.kv_indices, axis=2)
+            value_states = paddle.index_select(value_states, self.kv_indices, axis=2)
+
+        # TODO(wj-Mcat): use broadcast strategy when n_kv_heads = 1
+        # repeat k/v heads if n_kv_heads < n_heads
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+
+        has_gradient = not (query_states.stop_gradient and key_states.stop_gradient and value_states.stop_gradient)
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "core_attn"
+        ):
+            outputs = recompute(
+                scaled_dot_product_attention,
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                alibi,
+                self.sequence_parallel,
+                use_reentrant=self.config.recompute_use_reentrant,
+            )
+        else:
+            outputs = scaled_dot_product_attention(
+                query_states,
+                self.config,
+                key_states,
+                value_states,
+                attention_mask,
+                output_attentions,
+                alibi,
+                self.sequence_parallel,
+            )
+        if output_attentions:
+            attn_output, attn_weights = outputs
+        else:
+            attn_output = outputs
+
+        # if sequence_parallel is true, out shape are [q_len / n, bs, num_head * head_dim]
+        # else their shape are [bs, q_len, num_head * head_dim], n is mp parallelism.
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        outputs = (attn_output,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        if use_cache:
+            outputs += (past_key_value,)
+
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
+        return outputs
+
+
+class LlamaDecoderLayer(nn.Layer):
+    def __init__(self, config, layerwise_recompute: bool = False):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.self_attn = LlamaAttention(config, layerwise_recompute)
+        self.mlp = LlamaMLP(config)
+        self.input_layernorm = LlamaRMSNorm(config)
+        self.post_attention_layernorm = LlamaRMSNorm(config)
+        self.sequence_parallel = config.sequence_parallel
+        # Note that we will actually perform a recompute only if both enable_recompute and layerwise_recompute are set to True
+        # Enable_recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        self.layerwise_recompute = layerwise_recompute
+        self.recompute_granularity = config.recompute_granularity
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        position_ids: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = False,
+        alibi: Optional[paddle.Tensor] = None,
+    ) -> Tuple[paddle.Tensor, Optional[Tuple[paddle.Tensor, paddle.Tensor]]]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `cache` key value states are returned and can be used to speed up decoding
+                (see `cache`).
+            cache (`Tuple(paddle.Tensor)`, *optional*): cached past key and value projection states
+        """
+
+        # [bs * seq_len, embed_dim] -> [seq_len * bs / n, embed_dim] (sequence_parallel)
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        has_gradient = not hidden_states.stop_gradient
+        if (
+            self.enable_recompute
+            and self.layerwise_recompute
+            and has_gradient
+            and self.recompute_granularity == "full_attn"
+        ):
+            outputs = recompute(
+                self.self_attn,
+                hidden_states,
+                position_ids,
+                past_key_value,
+                attention_mask,
+                output_attentions,
+                use_cache,
+                alibi,
+                use_reentrant=self.config.recompute_use_reentrant,
+            )
+        else:
+            outputs = self.self_attn(
+                hidden_states,
+                position_ids,
+                past_key_value,
+                attention_mask,
+                output_attentions,
+                use_cache,
+                alibi,
+            )
+
+        if type(outputs) is tuple:
+            hidden_states = outputs[0]
+        else:
+            hidden_states = outputs
+
+        if output_attentions:
+            self_attn_weights = outputs[1]
+
+        if use_cache:
+            present_key_value = outputs[2 if output_attentions else 1]
+
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        # remove empty tuple for pipeline parallel
+        if type(outputs) is tuple and len(outputs) == 1:
+            outputs = outputs[0]
+
+        return outputs
+
+
+class LlamaPretrainedModel(PretrainedModel):
+    config_class = LlamaConfig
+    base_model_prefix = "llama"
+    pretrained_init_configuration = LLAMA_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = LLAMA_PRETRAINED_RESOURCE_FILES_MAP
+    _keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"]
+
+    @classmethod
+    def _get_name_mappings(cls, config: LlamaConfig) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            ["embed_tokens.weight"],
+            ["norm.weight"],
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [f"layers.{layer_index}.self_attn.q_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.self_attn.k_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.self_attn.v_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.self_attn.o_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.self_attn.rotary_emb.inv_freq"],
+                [f"layers.{layer_index}.mlp.gate_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.mlp.down_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.mlp.up_proj.weight", None, "transpose"],
+                [f"layers.{layer_index}.input_layernorm.weight"],
+                [f"layers.{layer_index}.post_attention_layernorm.weight"],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(mappings=model_mappings)
+        # base-model prefix "LlamaModel"
+        if "LlamaModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "model." + mapping[0]
+                mapping[1] = "llama." + mapping[1]
+            model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
+
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config: LlamaConfig, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+
+            base_actions = {
+                "lm_head.weight": partial(fn, is_column=True),
+                # Row Linear
+                "embed_tokens.weight": partial(fn, is_column=False),
+                "layers.0.self_attn.o_proj.weight": partial(fn, is_column=False),
+                "layers.0.mlp.down_proj.weight": partial(fn, is_column=False),
+            }
+
+            # Column Linear
+            if config.fuse_attention_qkv:
+                base_actions["layers.0.self_attn.qkv_proj.weight"] = partial(fn, is_column=True)
+            else:
+                base_actions["layers.0.self_attn.q_proj.weight"] = partial(fn, is_column=True)
+                # if we have enough num_key_value_heads to split, then split it.
+                if config.num_key_value_heads % config.tensor_parallel_degree == 0:
+                    base_actions["layers.0.self_attn.k_proj.weight"] = partial(fn, is_column=True)
+                    base_actions["layers.0.self_attn.v_proj.weight"] = partial(fn, is_column=True)
+
+            if config.fuse_attention_ffn:
+                base_actions["layers.0.mlp.gate_up_fused_proj.weight"] = partial(
+                    fn, is_column=True, is_naive_2fuse=True
+                )
+            else:
+                base_actions["layers.0.mlp.gate_proj.weight"] = partial(fn, is_column=True)
+                base_actions["layers.0.mlp.up_proj.weight"] = partial(fn, is_column=True)
+
+            for key, action in base_actions.items():
+                if "layers.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("layers.0.", f"layers.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(
+            layer,
+            (
+                nn.Linear,
+                nn.Embedding,
+                mpu.VocabParallelEmbedding,
+                mpu.ColumnParallelLinear,
+                mpu.RowParallelLinear,
+                LlamaLMHead,
+                ColumnSequenceParallelLinear,
+                RowSequenceParallelLinear,
+            ),
+        ):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range
+                        if hasattr(self.config, "initializer_range")
+                        else self.llama.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        # Layer.apply is DFS https://github.com/PaddlePaddle/Paddle/blob/a6f5021fcc58b21f4414bae6bf4731ef6971582c/python/paddle/nn/layer/layers.py#L527-L530
+        # sublayer is init first
+        # scale RowParallelLinear weight
+        with paddle.no_grad():
+            if isinstance(layer, LlamaMLP):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.down_proj.weight.scale_(factor)
+            if isinstance(layer, LlamaAttention):
+                factor = 1 / math.sqrt(2 * self.config.num_hidden_layers)
+                layer.o_proj.weight.scale_(factor)
+
+
+@register_base_model
+class LlamaModel(LlamaPretrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
+    Args:
+        config: LlamaConfig
+    """
+
+    def __init__(self, config: LlamaConfig):
+        super().__init__(config)
+        self.vocab_size = config.vocab_size
+        self.hidden_size = config.hidden_size
+        self.sequence_parallel = config.sequence_parallel
+        self.recompute_granularity = config.recompute_granularity
+        self.no_recompute_layers = config.no_recompute_layers if config.no_recompute_layers is not None else []
+
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+        if config.tensor_parallel_degree > 1:
+            self.embed_tokens = mpu.VocabParallelEmbedding(
+                self.vocab_size,
+                self.hidden_size,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.embed_tokens = nn.Embedding(
+                self.vocab_size,
+                self.hidden_size,
+            )
+
+        self.layers = nn.LayerList(
+            [LlamaDecoderLayer(config, i not in self.no_recompute_layers) for i in range(config.num_hidden_layers)]
+        )
+        self.norm = LlamaRMSNorm(config)
+
+        self.gradient_checkpointing = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @staticmethod
+    def _prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length, dtype):
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            if len(attention_mask.shape) == 2:
+                expanded_attn_mask = _expand_2d_mask(attention_mask, dtype, tgt_length=input_shape[-1])
+                # For decoding phase in generation, seq_length = 1, we don't need to add causal mask
+                if input_shape[-1] > 1:
+                    combined_attention_mask = _make_causal_mask(
+                        input_shape, past_key_values_length=past_key_values_length
+                    )
+                    expanded_attn_mask = expanded_attn_mask & combined_attention_mask
+            # [bsz, seq_len, seq_len] -> [bsz, 1, seq_len, seq_len]
+            elif len(attention_mask.shape) == 3:
+                expanded_attn_mask = attention_mask.unsqueeze(1).astype("bool")
+            # if attention_mask is already 4-D, do nothing
+            else:
+                expanded_attn_mask = attention_mask
+        else:
+            expanded_attn_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
+        # Convert bool attention_mask to float attention mask, which will be added to attention_scores later
+        expanded_attn_mask = paddle.where(expanded_attn_mask, 0.0, paddle.finfo(dtype).min).astype(dtype)
+        return expanded_attn_mask
+
+    @paddle.jit.not_to_static
+    def recompute_training_full(
+        self,
+        layer_module: nn.Layer,
+        hidden_states: Tensor,
+        position_ids: Optional[Tensor],
+        attention_mask: Tensor,
+        output_attentions: bool,
+        past_key_value: Tensor,
+        use_cache: bool,
+        alibi=None,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            position_ids,
+            attention_mask,
+            output_attentions,
+            past_key_value,
+            use_cache,
+            alibi,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+
+        return hidden_states
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        past_key_values=None,
+        output_attentions=False,
+        output_hidden_states=None,
+        return_dict=False,
+        **kwargs,
+    ):
+        if self.sequence_parallel and use_cache:
+            raise ValueError("We currently only support sequence parallel without cache.")
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.layers))
+
+        seq_length_with_past = seq_length
+        cache_length = 0
+        if past_key_values[0] is not None:
+            cache_length = paddle.shape(past_key_values[0][0])[1]
+            seq_length_with_past += cache_length
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if self.sequence_parallel:
+            # [bs, seq_len, num_head * head_dim] -> [bs * seq_len, num_head * head_dim]
+            bs, seq_len, hidden_size = inputs_embeds.shape
+            inputs_embeds = paddle.reshape_(inputs_embeds, [bs * seq_len, hidden_size])
+            # [seq_len * bs / n, num_head * head_dim] (n is mp parallelism)
+            inputs_embeds = ScatterOp.apply(inputs_embeds)
+
+        # embed positions
+        if attention_mask is None:
+            # [bs, seq_len]
+            attention_mask = paddle.ones((batch_size, seq_length_with_past), dtype=paddle.bool)
+
+        if self.config.alibi:
+            alibi = build_alibi_tensor(attention_mask, self.config.num_attention_heads, dtype=inputs_embeds.dtype)
+            if self.config.tensor_parallel_degree > 1:
+                block_size = self.config.num_attention_heads // self.config.tensor_parallel_degree
+                alibi = alibi[
+                    :,
+                    self.config.tensor_parallel_rank
+                    * block_size : (self.config.tensor_parallel_rank + 1)
+                    * block_size,
+                ]
+                alibi = alibi.reshape([batch_size * block_size, 1, seq_length_with_past])
+            else:
+                alibi = alibi.reshape([batch_size * self.config.num_attention_heads, 1, seq_length_with_past])
+        else:
+            alibi = None
+
+        if position_ids is None:
+            position_ids = paddle.arange(seq_length, dtype="int64").expand((batch_size, seq_length))
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, (batch_size, seq_length), cache_length, inputs_embeds.dtype
+        )  # [bs, 1, seq_len, seq_len]
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, (decoder_layer) in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            has_gradient = not hidden_states.stop_gradient
+            if (
+                self.enable_recompute
+                and idx not in self.no_recompute_layers
+                and has_gradient
+                and self.recompute_granularity == "full"
+            ):
+                layer_outputs = self.recompute_training_full(
+                    decoder_layer,
+                    hidden_states,
+                    position_ids,
+                    attention_mask,
+                    output_attentions,
+                    past_key_value,
+                    use_cache,
+                    alibi=alibi,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    position_ids,
+                    attention_mask,
+                    output_attentions,
+                    past_key_value,
+                    use_cache,
+                    alibi=alibi,
+                )
+
+            if type(layer_outputs) is tuple:
+                hidden_states = layer_outputs[0]
+            else:
+                hidden_states = layer_outputs
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+            cross_attentions=None,
+        )
+
+
+class LlamaPretrainingCriterion(paddle.nn.Layer):
+    """
+    Criterion for Llama.
+    It calculates the final loss.
+    """
+
+    def __init__(self, config):
+
+        super(LlamaPretrainingCriterion, self).__init__()
+        self.ignore_index = getattr(config, "ignore_index", -100)
+        self.config = config
+        self.enable_parallel_cross_entropy = config.tensor_parallel_degree > 1 and config.tensor_parallel_output
+
+        if self.enable_parallel_cross_entropy:  # and False: # and lm_head is distributed
+            self.loss_func = mpu.ParallelCrossEntropy(ignore_index=self.ignore_index)
+        else:
+            self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index)
+
+    def forward(self, prediction_scores, masked_lm_labels):
+        if self.enable_parallel_cross_entropy:
+            if prediction_scores.shape[-1] == self.config.vocab_size:
+                warnings.warn(
+                    f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}"
+                )
+                self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index)
+
+        with paddle.amp.auto_cast(False):
+            masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
+            # skip ignore_index which loss == 0
+            masked_lm_loss = masked_lm_loss[masked_lm_loss > 0].astype("float32")
+            loss = paddle.mean(masked_lm_loss)
+
+        return loss
+
+
+class LlamaLMHead(nn.Layer):
+    def __init__(self, config: LlamaConfig):
+        super(LlamaLMHead, self).__init__()
+        self.config = config
+        if config.tensor_parallel_degree > 1:
+            vocab_size = config.vocab_size // config.tensor_parallel_degree
+        else:
+            vocab_size = config.vocab_size
+
+        self.weight = self.create_parameter(
+            shape=[config.hidden_size, vocab_size],
+            dtype=paddle.get_default_dtype(),
+        )
+        # Must set distributed attr for Tensor Parallel !
+        self.weight.is_distributed = True if (vocab_size != config.vocab_size) else False
+        if self.weight.is_distributed:
+            self.weight.split_axis = 1
+
+    def forward(self, hidden_states, tensor_parallel_output=None):
+        if self.config.sequence_parallel:
+            hidden_states = GatherOp.apply(hidden_states)
+            hidden_states = paddle.reshape_(hidden_states, [-1, self.config.seq_length, self.config.hidden_size])
+
+        if tensor_parallel_output is None:
+            tensor_parallel_output = self.config.tensor_parallel_output
+
+        logits = parallel_matmul(hidden_states, self.weight, tensor_parallel_output=tensor_parallel_output)
+        return logits
+
+
+class LlamaForCausalLM(LlamaPretrainedModel):
+    enable_to_static_method = True
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+
+        self.llama = LlamaModel(config)
+        self.lm_head = LlamaLMHead(config)
+        self.criterion = LlamaPretrainingCriterion(config)
+
+    def get_input_embeddings(self):
+        return self.llama.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.llama.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.llama = decoder
+
+    def get_decoder(self):
+        return self.llama
+
+    def prepare_inputs_for_generation(
+        self, input_ids, use_cache=False, past_key_values=None, inputs_embeds=None, **kwargs
+    ):
+        batch_size, seq_length = input_ids.shape
+        position_ids = kwargs.get("position_ids", paddle.arange(seq_length).expand((batch_size, seq_length)))
+        attention_mask = kwargs.get("attention_mask", None)
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(axis=-1)
+            position_ids = position_ids[:, -1].unsqueeze(-1)
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": use_cache,
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, CausalLMOutputWithCrossAttentions) and "past_key_values" in outputs:
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update position_ids
+        if "position_ids" in model_kwargs and model_kwargs["position_ids"] is not None:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.concat([position_ids, position_ids[..., -1:] + 1], axis=-1)
+
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            model_kwargs["attention_mask"] = paddle.concat(
+                [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype=attention_mask.dtype)], axis=-1
+            )
+
+        return model_kwargs
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        past_key_values=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.llama(
+            input_ids,  # [bs, seq_len]
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]  # [bs, seq_len, dim]
+
+        # if labels is None，means we need full output, instead of tensor_parallel_output
+        # tensor_parallel_output is togather with ParallelCrossEntropy
+        tensor_parallel_output = (
+            self.config.tensor_parallel_output and labels is not None and self.config.tensor_parallel_degree > 1
+        )
+
+        logits = self.lm_head(hidden_states, tensor_parallel_output=tensor_parallel_output)
+
+        loss = None
+        if labels is not None:
+            loss = self.criterion(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/llama/tokenizer.py b/paddlenlp/transformers/llama/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4efaa48f797cf9ea67512b4072ac8639bcfca762
--- /dev/null
+++ b/paddlenlp/transformers/llama/tokenizer.py
@@ -0,0 +1,272 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from shutil import copyfile
+from typing import Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import sentencepiece as spm
+
+from ...utils.log import logger
+from .. import PretrainedTokenizer
+from ..tokenizer_utils_base import BatchEncoding, EncodedInput, PaddingStrategy
+
+__all__ = ["LlamaTokenizer"]
+
+
+class LlamaTokenizer(PretrainedTokenizer):
+    model_input_names = ["input_ids", "attention_mask", "position_ids"]
+    resource_files_names = {
+        "vocab_file": "sentencepiece.bpe.model",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "__internal_testing__/micro-random-llama": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+            "__internal_testing__/tiny-random-llama": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+            "facebook/llama-7b": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+            "facebook/llama-13b": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+            "facebook/llama-30b": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+            "facebook/llama-65b": "https://bj.bcebos.com/paddlenlp/models/transformers/llama/sentencepiece.bpe.model",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "__internal_testing__/micro-random-llama": {},
+        "__internal_testing__/tiny-random-llama": {},
+        "facebook/llama-7b": {},
+        "facebook/llama-13b": {},
+        "facebook/llama-30b": {},
+        "facebook/llama-65b": {},
+    }
+    padding_side = "left"
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        eos_token="</s>",
+        add_bos_token=True,
+        add_eos_token=False,
+        sp_model_kwargs=None,
+        decode_with_prefix_space=False,
+        **kwargs
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
+
+        self.vocab_file = vocab_file
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self.decode_with_prefix_space = decode_with_prefix_space
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_model.get_piece_size()
+
+    @property
+    def bos_token_id(self) -> Optional[int]:
+        return self.sp_model.bos_id()
+
+    @property
+    def eos_token_id(self) -> Optional[int]:
+        return self.sp_model.eos_id()
+
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        """Returns a tokenized string."""
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        prev_is_special = False
+        for i, token in enumerate(tokens):
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                if not prev_is_special and i != 0:
+                    out_string += " "
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                prev_is_special = True
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+                prev_is_special = False
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string
+
+    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory,
+            (filename_prefix + "-" if filename_prefix else "") + self.resource_files_names["vocab_file"],
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        if self.add_bos_token:
+            bos_token_ids = [self.bos_token_id]
+        else:
+            bos_token_ids = []
+
+        output = bos_token_ids + token_ids_0
+
+        if token_ids_1 is not None:
+            output = output + token_ids_1
+
+        if self.add_eos_token:
+            output = output + [self.eos_token_id]
+
+        return output
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
+        use of token type ids, therefore a list of zeros is returned.
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of zeros.
+        """
+        eos = [self.eos_token_id]
+
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+
+        # attention_mask shape [1,seq_len,seq_len]
+        if "attention_mask" in encoded_inputs and len(np.shape(encoded_inputs["attention_mask"])) > 2:
+            attention_mask = encoded_inputs["attention_mask"]
+            encoded_inputs.pop("attention_mask")
+        else:
+            attention_mask = None
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        encoded_inputs = super()._pad(
+            encoded_inputs, max_length, padding_strategy, pad_to_multiple_of, return_attention_mask
+        )
+        if attention_mask is not None and len(np.shape(attention_mask)) > 2:
+            encoded_inputs["attention_mask"] = attention_mask
+            needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+            if needs_to_be_padded:
+                difference = max_length - len(required_input)
+                if "attention_mask" in encoded_inputs:
+                    encoded_inputs["attention_mask"] = np.pad(
+                        encoded_inputs["attention_mask"],
+                        pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                        mode="constant",
+                        constant_values=0,
+                    )
+        return encoded_inputs
diff --git a/paddlenlp/transformers/luke/__init__.py b/paddlenlp/transformers/luke/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/luke/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/luke/configuration.py b/paddlenlp/transformers/luke/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c408b6d7860cba0ff914aa35a721141b6186f80
--- /dev/null
+++ b/paddlenlp/transformers/luke/configuration.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Luke model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "LUKE_PRETRAINED_INIT_CONFIGURATION",
+    "LUKE_PRETRAINED_RESOURCE_FILES_MAP",
+    "LukeConfig",
+]
+
+LUKE_PRETRAINED_INIT_CONFIGURATION = {
+    "luke-base": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "pad_token_id": 1,
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 1,
+        "vocab_size": 50267,
+    },
+    "luke-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "pad_token_id": 1,
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 1,
+        "vocab_size": 50267,
+    },
+}
+
+LUKE_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "luke-base": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-base/model_state.pdparams",
+        "luke-large": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-large/model_state.pdparams",
+    }
+}
+
+
+class LukeConfig(PretrainedConfig):
+    r"""
+    Args:
+       vocab_size (int, optional):
+           Vocabulary size of `inputs_ids` in `LukeModel`. Also is the vocab size of token embedding matrix.
+           Defines the number of different tokens that can be represented by the `inputs_ids` passed when
+           calling `LukeModel`. Defaults to 50267.
+       hidden_size (int, optional):
+           Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `768`.
+       num_hidden_layers (int, optional):
+           Number of hidden layers in the Transformer encoder. Defaults to `12`.
+       num_attention_heads (int, optional):
+           Number of attention heads for each attention layer in the Transformer encoder.
+           Defaults to `12`.
+       intermediate_size (int, optional):
+           Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+           to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+           and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+           Defaults to `3072`.
+       hidden_act (str, optional):
+           The non-linear activation function in the feed-forward layer.
+           ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+           are supported. Defaults to `"gelu"`.
+       hidden_dropout_prob (float, optional):
+           The dropout probability for all fully connected layers in the embeddings and encoder.
+           Defaults to `0.1`.
+       attention_probs_dropout_prob (float, optional):
+           The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+           Defaults to `0.1`.
+       max_position_embeddings (int, optional):
+           The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+           sequence. Defaults to `514`.
+       type_vocab_size (int, optional):
+           The vocabulary size of `token_type_ids`.
+           Defaults to `1`.
+       entity_vocab_size (int, optional):
+           Vocabulary size of `entity_ids` in `LukeModel`. Also is the vocab size of token entity embedding matrix.
+           Defines the number of different entity that can be represented by the `entity_ids` passed when
+           calling `LukeModel`. Defaults to 500000.
+       entity_emb_size (int, optional):
+           Dimensionality of the entity embedding layer Defaults to `256`.
+       initializer_range (float, optional):
+           The standard deviation of the normal initializer.
+           Defaults to 0.02.
+
+           .. note::
+               A normal_initializer initializes weight matrices as normal distributions.
+               See :meth:`BertPretrainedModel.init_weights()` for how weights are initialized in `BertModel`.
+
+       pad_token_id (int, optional):
+           The index of padding token in the token vocabulary.
+           Defaults to `1`.
+       entity_pad_token_id (int, optional):
+           The index of padding token in the token vocabulary.
+           Defaults to `0`.
+    """
+    model_type = "luke"
+
+    def __init__(
+        self,
+        vocab_size=50267,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=514,
+        type_vocab_size=1,
+        entity_vocab_size=500000,
+        entity_emb_size=256,
+        initializer_range=0.02,
+        pad_token_id=1,
+        entity_pad_token_id=0,
+        cls_token_id=101,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.entity_vocab_size = entity_vocab_size
+        self.entity_emb_size = entity_emb_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.entity_pad_token_id = entity_pad_token_id
+        self.cls_token_id = cls_token_id
diff --git a/paddlenlp/transformers/luke/modeling.py b/paddlenlp/transformers/luke/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..94e01659a0f07709c436acac088d603cc00d2f2f
--- /dev/null
+++ b/paddlenlp/transformers/luke/modeling.py
@@ -0,0 +1,1124 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn as nn
+
+from ...transformers.roberta.modeling import RobertaEmbeddings
+from .. import PretrainedModel, register_base_model
+from ..activations import get_activation
+from .configuration import (
+    LUKE_PRETRAINED_INIT_CONFIGURATION,
+    LUKE_PRETRAINED_RESOURCE_FILES_MAP,
+    LukeConfig,
+)
+
+__all__ = [
+    "LukeModel",
+    "LukePretrainedModel",
+    "LukeForEntitySpanClassification",
+    "LukeForEntityPairClassification",
+    "LukeForEntityClassification",
+    "LukeForMaskedLM",
+    "LukeForQuestionAnswering",
+]
+
+
+def paddle_gather(x, dim, index):
+    index_shape = index.shape
+    index_flatten = index.flatten()
+    if dim < 0:
+        dim = len(x.shape) + dim
+    nd_index = []
+    for k in range(len(x.shape)):
+        if k == dim:
+            nd_index.append(index_flatten)
+        else:
+            reshape_shape = [1] * len(x.shape)
+            reshape_shape[k] = x.shape[k]
+            x_arange = paddle.arange(x.shape[k], dtype=index.dtype)
+            x_arange = x_arange.reshape(reshape_shape)
+            dim_index = paddle.expand(x_arange, index_shape).flatten()
+            nd_index.append(dim_index)
+    ind2 = paddle.transpose(paddle.stack(nd_index), [1, 0]).astype("int64")
+    paddle_out = paddle.gather_nd(x, ind2).reshape(index_shape)
+    return paddle_out
+
+
+layer_norm_eps = 1e-6
+
+
+class LukePretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained Luke models. It provides Luke related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    pretrained_init_configuration = LUKE_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = LUKE_PRETRAINED_RESOURCE_FILES_MAP
+
+    base_model_prefix = "luke"
+    config_class = LukeConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = layer_norm_eps
+
+
+class LukeSelfOutput(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukeSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layer_norm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LukeIntermediate(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.intermediate_act_fn = get_activation(config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class LukeOutput(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukeOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layer_norm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class LukeEmbeddings(RobertaEmbeddings):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeEmbeddings, self).__init__(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+    ):
+        return super(LukeEmbeddings, self).forward(
+            input_ids=input_ids, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+
+class LukePooler(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukePooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class EntityEmbeddings(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(EntityEmbeddings, self).__init__()
+        self.entity_emb_size = config.entity_emb_size
+        self.hidden_size = config.hidden_size
+        self.entity_embeddings = nn.Embedding(config.entity_vocab_size, config.entity_emb_size, padding_idx=0)
+        if config.entity_emb_size != config.hidden_size:
+            self.entity_embedding_dense = nn.Linear(config.entity_emb_size, config.hidden_size, bias_attr=False)
+
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, entity_ids, position_ids, token_type_ids=None):
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(entity_ids)
+
+        entity_embeddings = self.entity_embeddings(entity_ids)
+        if self.entity_emb_size != self.hidden_size:
+            entity_embeddings = self.entity_embedding_dense(entity_embeddings)
+
+        position_embeddings = self.position_embeddings(position_ids.clip(min=0))
+        position_embedding_mask = (position_ids != -1).astype(position_embeddings.dtype).unsqueeze(-1)
+        position_embeddings = position_embeddings * position_embedding_mask
+        position_embeddings = paddle.sum(position_embeddings, axis=-2)
+        position_embeddings = position_embeddings / position_embedding_mask.sum(axis=-2).clip(min=1e-7)
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = entity_embeddings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class LukeSelfAttention(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukeSelfAttention, self).__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.w2e_query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.e2w_query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.e2e_query = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose((0, 2, 1, 3))
+
+    def forward(
+        self,
+        word_hidden_states,
+        entity_hidden_states,
+        attention_mask=None,
+    ):
+        word_size = word_hidden_states.shape[1]
+
+        if entity_hidden_states is None:
+            concat_hidden_states = word_hidden_states
+        else:
+            concat_hidden_states = paddle.concat([word_hidden_states, entity_hidden_states], axis=1)
+
+        key_layer = self.transpose_for_scores(self.key(concat_hidden_states))
+        value_layer = self.transpose_for_scores(self.value(concat_hidden_states))
+
+        if entity_hidden_states is not None:
+            # compute query vectors using word-word (w2w), word-entity (w2e), entity-word (e2w), entity-entity (e2e)
+            # query layers
+            w2w_query_layer = self.transpose_for_scores(self.query(word_hidden_states))
+            w2e_query_layer = self.transpose_for_scores(self.w2e_query(word_hidden_states))
+            e2w_query_layer = self.transpose_for_scores(self.e2w_query(entity_hidden_states))
+            e2e_query_layer = self.transpose_for_scores(self.e2e_query(entity_hidden_states))
+
+            # compute w2w, w2e, e2w, and e2e key vectors used with the query vectors computed above
+            w2w_key_layer = key_layer[:, :, :word_size, :]
+            e2w_key_layer = key_layer[:, :, :word_size, :]
+            w2e_key_layer = key_layer[:, :, word_size:, :]
+            e2e_key_layer = key_layer[:, :, word_size:, :]
+
+            # compute attention scores based on the dot product between the query and key vectors
+            w2w_attention_scores = paddle.matmul(w2w_query_layer, w2w_key_layer.transpose((0, 1, 3, 2)))
+            w2e_attention_scores = paddle.matmul(w2e_query_layer, w2e_key_layer.transpose((0, 1, 3, 2)))
+            e2w_attention_scores = paddle.matmul(e2w_query_layer, e2w_key_layer.transpose((0, 1, 3, 2)))
+            e2e_attention_scores = paddle.matmul(e2e_query_layer, e2e_key_layer.transpose((0, 1, 3, 2)))
+
+            # combine attention scores to create the final attention score matrix
+            word_attention_scores = paddle.concat([w2w_attention_scores, w2e_attention_scores], axis=3)
+            entity_attention_scores = paddle.concat([e2w_attention_scores, e2e_attention_scores], axis=3)
+            attention_scores = paddle.concat([word_attention_scores, entity_attention_scores], axis=2)
+
+        else:
+            query_layer = self.transpose_for_scores(self.query(concat_hidden_states))
+            attention_scores = paddle.matmul(query_layer, key_layer.transpose((0, 1, 3, 2)))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in LukeModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose((0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output_word_hidden_states = context_layer[:, :word_size, :]
+        if entity_hidden_states is None:
+            output_entity_hidden_states = None
+        else:
+            output_entity_hidden_states = context_layer[:, word_size:, :]
+
+        outputs = (output_word_hidden_states, output_entity_hidden_states)
+
+        return outputs
+
+
+class LukeAttention(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super().__init__()
+        self.self = LukeSelfAttention(config)
+        self.output = LukeSelfOutput(config)
+
+    def forward(
+        self,
+        word_hidden_states,
+        entity_hidden_states,
+        attention_mask=None,
+    ):
+        word_size = word_hidden_states.shape[1]
+        self_outputs = self.self(word_hidden_states, entity_hidden_states, attention_mask)
+        if entity_hidden_states is None:
+            concat_self_outputs = self_outputs[0]
+            concat_hidden_states = word_hidden_states
+        else:
+            concat_self_outputs = paddle.concat(self_outputs[:2], axis=1)
+            concat_hidden_states = paddle.concat([word_hidden_states, entity_hidden_states], axis=1)
+
+        attention_output = self.output(concat_self_outputs, concat_hidden_states)
+
+        word_attention_output = attention_output[:, :word_size, :]
+        if entity_hidden_states is None:
+            entity_attention_output = None
+        else:
+            entity_attention_output = attention_output[:, word_size:, :]
+
+        # add attentions if we output them
+        outputs = (word_attention_output, entity_attention_output) + self_outputs[2:]
+
+        return outputs
+
+
+class LukeLayer(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukeLayer, self).__init__()
+        self.seq_len_dim = 1
+        self.attention = LukeAttention(config)
+        self.intermediate = LukeIntermediate(config)
+        self.output = LukeOutput(config)
+
+    def forward(
+        self,
+        word_hidden_states,
+        entity_hidden_states,
+        attention_mask=None,
+    ):
+        word_size = word_hidden_states.shape[1]
+
+        self_attention_outputs = self.attention(
+            word_hidden_states,
+            entity_hidden_states,
+            attention_mask,
+        )
+        if entity_hidden_states is None:
+            concat_attention_output = self_attention_outputs[0]
+        else:
+            concat_attention_output = paddle.concat(self_attention_outputs[:2], axis=1)
+
+        outputs = self_attention_outputs[2:]  # add self attentions if we output attention weights
+
+        layer_output = self.feed_forward_chunk(concat_attention_output)
+
+        word_layer_output = layer_output[:, :word_size, :]
+        if entity_hidden_states is None:
+            entity_layer_output = None
+        else:
+            entity_layer_output = layer_output[:, word_size:, :]
+
+        outputs = (word_layer_output, entity_layer_output) + outputs
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class LukeEncoder(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(LukeEncoder, self).__init__()
+        self.layer = nn.LayerList([LukeLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        word_hidden_states,
+        entity_hidden_states,
+        attention_mask=None,
+    ):
+
+        for i, layer_module in enumerate(self.layer):
+
+            layer_outputs = layer_module(
+                word_hidden_states,
+                entity_hidden_states,
+                attention_mask,
+            )
+
+            word_hidden_states = layer_outputs[0]
+
+            if entity_hidden_states is not None:
+                entity_hidden_states = layer_outputs[1]
+
+        return word_hidden_states, entity_hidden_states
+
+
+@register_base_model
+class LukeModel(LukePretrainedModel):
+    """
+    The bare Luke Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+        self.pad_token_id = config.pad_token_id
+        self.entity_pad_token_id = config.entity_pad_token_id
+        self.encoder = LukeEncoder(config)
+        self.embeddings = LukeEmbeddings(config)
+        self.entity_embeddings = EntityEmbeddings(config)
+        self.pooler = LukePooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            entity_ids (Tensor, optional):
+                Indices of entity sequence tokens in the entity vocabulary. They are numerical
+                representations of entities that build the entity input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, entity_sequence_length].
+            entity_position_ids (Tensor, optional):
+                Indices of positions of each entity sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_entity_tokens)` and dtype as int64. Defaults to `None`.
+            entity_token_type_ids (Tensor, optional):
+                Segment entity token indices to indicate different portions of the entity inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+            entity_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor will be concat with `attention_mask`.
+
+        Returns:
+            tuple: Returns tuple (`word_hidden_state, entity_hidden_state, pool_output`).
+
+            With the fields:
+
+            - `word_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `entity_hidden_state` (Tensor):
+                Sequence of entity hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`<s>`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeModel, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeModel.from_pretrained('luke-base')
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        input_shape = input_ids.shape
+
+        batch_size, seq_length = input_shape
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+                attention_mask = (1.0 - attention_mask) * -1e4
+        if entity_ids is not None:
+            entity_seq_length = entity_ids.shape[1]
+            if entity_attention_mask is None:
+                entity_attention_mask = paddle.unsqueeze(
+                    (entity_ids == self.entity_pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+                )
+            else:
+                if entity_attention_mask.ndim == 2:
+                    # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                    entity_attention_mask = entity_attention_mask.unsqueeze(axis=[1, 2])
+                    entity_attention_mask = (1.0 - entity_attention_mask) * -1e4
+            if entity_token_type_ids is None:
+                entity_token_type_ids = paddle.zeros((batch_size, entity_seq_length), dtype="int64")
+            attention_mask = paddle.concat([attention_mask, entity_attention_mask], axis=-1)
+
+        word_embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+        )
+
+        if entity_ids is None:
+            entity_embedding_output = None
+        else:
+            entity_embedding_output = self.entity_embeddings(entity_ids, entity_position_ids, entity_token_type_ids)
+
+        # Fourth, send embeddings through the model
+        encoder_outputs = self.encoder(
+            word_embedding_output,
+            entity_embedding_output,
+            attention_mask=attention_mask,
+        )
+
+        sequence_output = encoder_outputs[0]
+
+        pooled_output = self.pooler(sequence_output)
+
+        return sequence_output, encoder_outputs[1], pooled_output
+
+
+class LukeLMHead(nn.Layer):
+    """Luke Head for masked language modeling."""
+
+    def __init__(self, config: LukeConfig, embedding_weights=None):
+        super(LukeLMHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=layer_norm_eps)
+        self.activation = get_activation(config.hidden_act)
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype, is_bias=False
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(self, features, **kwargs):
+        hidden_state = self.dense(features)
+        hidden_state = self.activation(hidden_state)
+        hidden_state = self.layer_norm(hidden_state)
+        hidden_state = paddle.tensor.matmul(hidden_state, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_state
+
+
+class EntityPredictionHeadTransform(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(EntityPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.entity_emb_size)
+        self.transform_act_fn = get_activation(config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.entity_emb_size, epsilon=layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        return hidden_states
+
+
+class EntityPredictionHead(nn.Layer):
+    def __init__(self, config: LukeConfig):
+        super(EntityPredictionHead, self).__init__()
+        self.transform = EntityPredictionHeadTransform(config)
+        self.decoder = nn.Linear(config.entity_emb_size, config.entity_vocab_size)
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class LukeForMaskedLM(LukePretrainedModel):
+    """
+    Luke Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeForMaskedLM, self).__init__(config)
+        self.luke = LukeModel(config)
+        self.vocab_size = self.config.vocab_size
+        self.entity_vocab_size = self.config.entity_vocab_size
+
+        self.lm_head = LukeLMHead(
+            config,
+            embedding_weights=self.luke.embeddings.word_embeddings.weight,
+        )
+        self.entity_predictions = EntityPredictionHead(config)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeForMaskedLM forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LukeModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            position_ids (Tensor, optional):
+                See :class: `LukeModel`
+            attention_mask (list, optional):
+                See :class:`LukeModel`.
+            entity_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_position_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_attention_mask (list, optional):
+                See :class:`LukeModel`.
+
+        Returns:
+            tuple: Returns tuple (``logits``, ``entity_logits``).
+
+            With the fields:
+
+            - `logits` (Tensor):
+                The scores of masked token prediction.
+                Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+            - `entity_logits` (Tensor):
+                The scores of masked entity prediction.
+                Its data type should be float32 and its shape is [batch_size, entity_length, entity_vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeForMaskedLM, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeForMaskedLM.from_pretrained('luke-base')
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits, entity_logits = model(**inputs)
+        """
+
+        outputs = self.luke(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+            entity_token_type_ids=entity_token_type_ids,
+            entity_attention_mask=entity_attention_mask,
+        )
+
+        logits = self.lm_head(outputs[0])
+        entity_logits = self.entity_predictions(outputs[1])
+
+        return logits, entity_logits
+
+
+class LukeForEntityClassification(LukePretrainedModel):
+    """
+    The LUKE model with a classification head on top (a linear layer on top of the hidden state of the first entity
+    token) for entity classification tasks, such as Open Entity.
+
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeForEntityClassification, self).__init__(config)
+
+        self.luke = LukeModel(config)
+
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
+        self.classifier = nn.Linear(self.config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeForEntityClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LukeModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            position_ids (Tensor, optional):
+                See :class: `LukeModel`
+            attention_mask (list, optional):
+                See :class:`LukeModel`.
+            entity_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_position_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_attention_mask (list, optional):
+                See :class:`LukeModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the entity classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeForEntityClassification, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeForEntityClassification.from_pretrained('luke-base', num_labels=2)
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.luke(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+            entity_token_type_ids=entity_token_type_ids,
+            entity_attention_mask=entity_attention_mask,
+        )
+
+        feature_vector = outputs[1][:, 0, :]
+        feature_vector = self.dropout(feature_vector)
+        logits = self.classifier(feature_vector)
+
+        return logits
+
+
+class LukeForEntityPairClassification(LukePretrainedModel):
+    """
+    The LUKE model with a classification head on top (a linear layer on top of the hidden states of the two entity
+    tokens) for entity pair classification tasks, such as TACRED.
+
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeForEntityPairClassification, self).__init__(config)
+
+        self.luke = LukeModel(config)
+
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
+        self.classifier = nn.Linear(self.config.hidden_size * 2, config.num_labels, bias_attr=False)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeForEntityPairClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LukeModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            position_ids (Tensor, optional):
+                See :class: `LukeModel`
+            attention_mask (list, optional):
+                See :class:`LukeModel`.
+            entity_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_position_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_attention_mask (list, optional):
+                See :class:`LukeModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the entity pair classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeForEntityPairClassification, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeForEntityPairClassification.from_pretrained('luke-base', num_labels=2)
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7), (17, 28)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.luke(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+            entity_token_type_ids=entity_token_type_ids,
+            entity_attention_mask=entity_attention_mask,
+        )
+
+        feature_vector = paddle.concat([outputs[1][:, 0, :], outputs[1][:, 1, :]], axis=1)
+        feature_vector = self.dropout(feature_vector)
+        logits = self.classifier(feature_vector)
+
+        return logits
+
+
+class LukeForEntitySpanClassification(LukePretrainedModel):
+    """
+    The LUKE model with a span classification head on top (a linear layer on top of the hidden states output) for tasks
+    such as named entity recognition.
+
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeForEntitySpanClassification, self).__init__(config)
+
+        self.luke = LukeModel(config)
+
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(self.config.hidden_dropout_prob)
+        self.classifier = nn.Linear(self.config.hidden_size * 3, config.num_labels)
+
+    def forward(
+        self,
+        entity_start_positions,
+        entity_end_positions,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeForEntitySpanClassification forward method, overrides the __call__() special method.
+
+        Args:
+            entity_start_positions:
+                The start position of entities in sequence.
+            entity_end_positions:
+                The start position of entities in sequence.
+            input_ids (Tensor):
+                See :class:`LukeModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            position_ids (Tensor, optional):
+                See :class: `LukeModel`
+            attention_mask (list, optional):
+                See :class:`LukeModel`.
+            entity_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_position_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_attention_mask (list, optional):
+                See :class:`LukeModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the entity span classification logits.
+            Shape as `[batch_size, num_entities, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeForEntitySpanClassification, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeForEntitySpanClassification.from_pretrained('luke-base', num_labels=2)
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                inputs['entity_start_positions'] = paddle.to_tensor([[1]], dtype='int64')
+                inputs['entity_end_positions'] = paddle.to_tensor([[2]], dtype='int64')
+                logits = model(**inputs)
+        """
+
+        outputs = self.luke(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+            entity_token_type_ids=entity_token_type_ids,
+            entity_attention_mask=entity_attention_mask,
+        )
+        hidden_size = outputs[0].shape[-1]
+
+        entity_start_positions = entity_start_positions.unsqueeze(-1).expand((-1, -1, hidden_size))
+        start_states = paddle_gather(x=outputs[0], index=entity_start_positions, dim=-2)
+        entity_end_positions = entity_end_positions.unsqueeze(-1).expand((-1, -1, hidden_size))
+        end_states = paddle_gather(x=outputs[0], index=entity_end_positions, dim=-2)
+        feature_vector = paddle.concat([start_states, end_states, outputs[1]], axis=2)
+
+        feature_vector = self.dropout(feature_vector)
+        logits = self.classifier(feature_vector)
+
+        return logits
+
+
+class LukeForQuestionAnswering(LukePretrainedModel):
+    """
+    LukeBert Model with question answering tasks.
+    Args:
+        config (:class:`LukeConfig`):
+            An instance of LukeConfig.
+    """
+
+    def __init__(self, config: LukeConfig):
+        super(LukeForQuestionAnswering, self).__init__(config)
+        self.luke = LukeModel(config)
+        self.qa_outputs = nn.Linear(self.config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        entity_ids=None,
+        entity_position_ids=None,
+        entity_token_type_ids=None,
+        entity_attention_mask=None,
+    ):
+        r"""
+        The LukeForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`LukeModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            position_ids (Tensor, optional):
+                See :class: `LukeModel`
+            attention_mask (list, optional):
+                See :class:`LukeModel`.
+            entity_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_position_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_token_type_ids (Tensor, optional):
+                See :class:`LukeModel`.
+            entity_attention_mask (list, optional):
+                See :class:`LukeModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+            With the fields:
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import LukeForQuestionAnswering, LukeTokenizer
+
+                tokenizer = LukeTokenizer.from_pretrained('luke-base')
+                model = LukeForQuestionAnswering.from_pretrained('luke-base')
+
+                text = "Beyoncé lives in Los Angeles."
+                entity_spans = [(0, 7)]
+                inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                start_logits, end_logits = model(**inputs)
+        """
+
+        encoder_outputs = self.luke(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+            entity_token_type_ids=entity_token_type_ids,
+            entity_attention_mask=entity_attention_mask,
+        )
+
+        word_hidden_states = encoder_outputs[0][:, : input_ids.shape[1], :]
+        logits = self.qa_outputs(word_hidden_states)
+        start_logits, end_logits = paddle.split(logits, 2, -1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        return start_logits, end_logits
diff --git a/paddlenlp/transformers/luke/tokenizer.py b/paddlenlp/transformers/luke/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7653c96526c2e64fdafeb56f20055a9c1aa8e2cf
--- /dev/null
+++ b/paddlenlp/transformers/luke/tokenizer.py
@@ -0,0 +1,752 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for LUKE."""
+
+from typing import Dict, List, Optional, Union
+
+try:
+    import regex as re
+except:
+    import re
+
+import itertools
+import json
+import sys
+import warnings
+from itertools import repeat
+
+from .. import RobertaBPETokenizer
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+
+__all__ = ["LukeTokenizer"]
+_add_prefix_space = False
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"luke-base": 514, "luke-large": 514}
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a mapping to unicode strings.
+    We specifically avoids mapping to whitespace/control characters the bpe code barfs on.
+
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    """
+    _chr = chr
+    bs = (
+        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    )
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+class LukeTokenizer(RobertaBPETokenizer):
+    """
+    Constructs a Luke tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.json') required to instantiate
+            a `WordpieceTokenizer`.
+        entity_file (str):
+            The entity vocabulary file path (ends with '.tsv') required to instantiate
+            a `EntityTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import LukeTokenizer
+            tokenizer = LukeTokenizer.from_pretrained('luke-large)
+
+            tokens = tokenizer('Beyoncé lives in Los Angeles', entity_spans=[(0, 7), (17, 28)])
+            #{'input_ids': [0, 40401, 261, 12695, 1074, 11, 1287, 1422, 2], 'entity_ids': [1657, 32]}
+
+    """
+
+    # resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    resource_files_names = {
+        "vocab_file": "vocab.json",
+        "merges_file": "merges.txt",
+        "entity_file": "entity_vocab.json",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "luke-base": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-base/vocab.json",
+            "luke-large": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-large/vocab.json",
+        },
+        "merges_file": {
+            "luke-base": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-base/merges.txt",
+            "luke-large": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-large/merges.txt",
+        },
+        "entity_file": {
+            "luke-base": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-base/entity_vocab.json",
+            "luke-large": "https://bj.bcebos.com/paddlenlp/models/transformers/luke/luke-large/entity_vocab.json",
+        },
+    }
+    pretrained_init_configuration = {"luke-base": {"do_lower_case": True}, "luke-large": {"do_lower_case": True}}
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        entity_file,
+        merges_file,
+        do_lower_case=True,
+        unk_token="<unk>",
+        sep_token="</s>",
+        pad_token="<pad>",
+        cls_token="<s>",
+        mask_token="<mask>",
+        **kwargs
+    ):
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        with open(entity_file, encoding="utf-8") as entity_vocab_handle:
+            self.entity_vocab = json.load(entity_vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.sep_token = sep_token
+        self.cls_token = cls_token
+        self.pad_token = pad_token
+        self.unk_token = unk_token
+        self._all_special_tokens = [unk_token, sep_token, pad_token, cls_token, mask_token]
+        self.errors = "replace"  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            bpe_merges = merges_handle.read().split("\n")[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        self.added_tokens_encoder = {}
+        self.added_tokens_decoder = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+        # RobertaTokenizer don't maintain the entity_file resource file name,
+        # so we should not set it as a param in super.__init__ function
+        self._entity_file = entity_file
+        super(LukeTokenizer, self).__init__(
+            vocab_file,
+            merges_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+
+    @property
+    def sep_token_id(self):
+        return self.encoder[self.sep_token]
+
+    @property
+    def cls_token_id(self):
+        return self.encoder[self.cls_token]
+
+    @property
+    def pad_token_id(self):
+        return self.encoder[self.pad_token]
+
+    @property
+    def unk_token_id(self):
+        return self.encoder[self.unk_token]
+
+    def get_entity_vocab(self):
+        """Get the entity vocab"""
+        return self.entity_vocab
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str/unicode) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (string/unicode) using the vocab."""
+        return self.decoder.get(index)
+
+    def _tokenize(self, text, add_prefix_space=False):
+        if add_prefix_space:
+            text = " " + text
+
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            if sys.version_info[0] == 2:
+                token = "".join(
+                    self.byte_encoder[ord(b)] for b in token
+                )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
+            else:
+                token = "".join(
+                    self.byte_encoder[b] for b in token.encode("utf-8")
+                )  # Maps all our bytes to unicode strings, avoiding controle tokens of the BPE (spaces in our case)
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        entity_spans=None,
+        entity_spans_pair=None,
+        entities=None,
+        entities_pair=None,
+        max_mention_length=30,
+        max_length: Optional[int] = None,
+        stride=0,
+        add_prefix_space=False,
+        is_split_into_words=False,
+        padding=False,
+        truncation="longest_first",
+        return_position_ids=True,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports sequence or sequence pair as input, and batch input
+        is allowed. `self.encode()` or `self.batch_encode()` would be called
+        separately for single or batch input depending on input format and
+        `is_split_into_words` argument.
+
+        Args:
+            text (str, List[str] or List[List[str]]):
+                The sequence or batch of sequences to be processed. One sequence
+                is a string or a list of strings depending on whether it has been
+                pretokenized. If each sequence is provided as a list of strings
+                (pretokenized), you must set `is_split_into_words` as `True` to
+                disambiguate with a batch of sequences.
+            text_pair (str, List[str] or List[List[str]], optional):
+                Same as `text` argument, while it represents for the latter
+                sequence of the sequence pair.
+            entity_spans (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
+                The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
+                with two integers denoting character-based(different from transformers LUKE) start and end positions
+                of entities. If you specify `"entity_classification"` or `"entity_pair_classification"` as the `task`
+                argument in the constructor, the length of each sequence must be 1 or 2, respectively. If you specify
+                `entities`, the length of each sequence must be equal to the length of each sequence of `entities`.
+            entity_spans_pair (`List[Tuple[int, int]]`, `List[List[Tuple[int, int]]]`, *optional*):
+                The sequence or batch of sequences of entity spans to be encoded. Each sequence consists of tuples each
+                with two integers denoting character-based start and end positions of entities. If you specify the
+                `task` argument in the constructor, this argument is ignored. If you specify `entities_pair`, the
+                length of each sequence must be equal to the length of each sequence of `entities_pair`.
+            entities (`List[str]`, `List[List[str]]`, *optional*):
+                The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
+                representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
+                Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length of
+                each sequence must be equal to the length of each sequence of `entity_spans`. If you specify
+                `entity_spans` without specifying this argument, the entity sequence or the batch of entity sequences
+                is automatically constructed by filling it with the [MASK] entity.
+            entities_pair (`List[str]`, `List[List[str]]`, *optional*):
+                The sequence or batch of sequences of entities to be encoded. Each sequence consists of strings
+                representing entities, i.e., special entities (e.g., [MASK]) or entity titles of Wikipedia (e.g., Los
+                Angeles). This argument is ignored if you specify the `task` argument in the constructor. The length of
+                each sequence must be equal to the length of each sequence of `entity_spans_pair`. If you specify
+                `entity_spans_pair` without specifying this argument, the entity sequence or the batch of entity
+                sequences is automatically constructed by filling it with the [MASK] entity.
+            max_mention_length (`int`):
+                The entity_position_ids's length.
+            max_length (int, optional):
+                If set to a number, will limit the total sequence returned so
+                that it has a maximum length. If there are overflowing tokens,
+                those overflowing tokens will be added to the returned dictionary
+                when `return_overflowing_tokens` is `True`. Defaults to `None`.
+            stride (int, optional):
+                Only available for batch input of sequence pair and mainly for
+                question answering usage. When for QA, `text` represents questions
+                and `text_pair` represents contexts. If `stride` is set to a
+                positive number, the context will be split into multiple spans
+                where `stride` defines the number of (tokenized) tokens to skip
+                from the start of one span to get the next span, thus will produce
+                a bigger batch than inputs to include all spans. Moreover, 'overflow_to_sample'
+                and 'offset_mapping' preserving the original example and position
+                information will be added to the returned dictionary. Defaults to 0.
+            add_prefix_space (bool, optional):
+                The tokenizer will add a space at the beginning of the sentence when it set to `True`.
+                Defaults to `False`.
+            padding (bool, optional):
+                If set to `True`, the returned sequences would be padded up to
+                `max_length` specified length according to padding side
+                (`self.padding_side`) and padding token id. Defaults to `False`.
+            truncation (str, optional):
+                String selected in the following options:
+
+                - 'longest_first' (default) Iteratively reduce the inputs sequence
+                until the input is under `max_length` starting from the longest
+                one at each token (when there is a pair of input sequences).
+                - 'only_first': Only truncate the first sequence.
+                - 'only_second': Only truncate the second sequence.
+                - 'do_not_truncate': Do not truncate (raise an error if the input
+                sequence is longer than `max_length`).
+
+                Defaults to 'longest_first'.
+            return_position_ids (bool, optional):
+                Whether to include tokens position ids in the returned dictionary.
+                Defaults to `False`.
+            return_token_type_ids (bool, optional):
+                Whether to include token type ids in the returned dictionary.
+                Defaults to `True`.
+            return_attention_mask (bool, optional):
+                Whether to include the attention mask in the returned dictionary.
+                Defaults to `False`.
+            return_length (bool, optional):
+                Whether to include the length of each encoded inputs in the
+                returned dictionary. Defaults to `False`.
+            return_overflowing_tokens (bool, optional):
+                Whether to include overflowing token information in the returned
+                dictionary. Defaults to `False`.
+            return_special_tokens_mask (bool, optional):
+                Whether to include special tokens mask information in the returned
+                dictionary. Defaults to `False`.
+
+        Returns:
+            dict or list[dict] (for batch input):
+                The dict has the following optional items:
+
+                - **input_ids** (list[int]): List of token ids to be fed to a model.
+                - **position_ids** (list[int], optional): List of token position ids to be
+                  fed to a model. Included when `return_position_ids` is `True`
+                - **token_type_ids** (list[int], optional): List of token type ids to be
+                  fed to a model. Included when `return_token_type_ids` is `True`.
+                - **attention_mask** (list[int], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `return_attention_mask` is `True`.
+                - **entity_ids** (list[int]): List of token ids to be fed to a model. Included when
+                  `entity_spans` is not `None`.
+                - **entity_position_ids** (list[int], optional): List of token position ids to be
+                  fed to a model. Included when `entity_spans` is not `None`.
+                - **entity_segment_ids** (list[int], optional): List of token type ids to be
+                  fed to a model. Included when `entity_spans` is not `None`.
+                - **entity_attention_mask** (list[int], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `entity_spans` is not `None`.
+                - **seq_len** (int, optional): The input_ids length. Included when `return_length`
+                  is `True`.
+                - **overflowing_tokens** (list[int], optional): List of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **num_truncated_tokens** (int, optional): The number of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **special_tokens_mask** (list[int], optional): List of integers valued 0 or 1,
+                  with 0 specifying special added tokens and 1 specifying sequence tokens.
+                  Included when `return_special_tokens_mask` is `True`.
+                - **offset_mapping** (list[int], optional): list of pair preserving the
+                  index of start and end char in original input for each token.
+                  For a special token, the index pair is `(0, 0)`. Included when
+                  `stride` works.
+                - **overflow_to_sample** (int, optional): Index of example from which this
+                  feature is generated. Included when `stride` works.
+        """
+
+        global _add_prefix_space
+        if add_prefix_space:
+            _add_prefix_space = True
+
+        encode_output = super(LukeTokenizer, self).__call__(
+            text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+        if not entity_spans:
+            return encode_output
+        is_batched = bool(
+            (not is_split_into_words and isinstance(text, (list, tuple)))
+            or (
+                is_split_into_words and isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
+            )
+        )
+        if is_batched:
+            if entities is None:
+                entities = [None] * len(entity_spans)
+            for i, ent in enumerate(zip(entities, entity_spans, text)):
+                entity_encode = self.entity_encode(ent[2], ent[0], max_mention_length, ent[1])
+                encode_output[i].update(entity_encode)
+            if entity_spans_pair:
+                if entities_pair is None:
+                    entities_pair = [None] * len(entity_spans_pair)
+                for i, ent in enumerate(zip(entities_pair, entity_spans_pair, text_pair)):
+                    entity_encode = self.entity_encode(
+                        ent[2],
+                        ent[0],
+                        max_mention_length,
+                        ent[1],
+                        1,
+                        encode_output[i]["input_ids"].index(self.sep_token_id) + 2,
+                    )
+                    for k in entity_encode.keys():
+                        encode_output[i][k] = encode_output[i][k] + entity_encode[k]
+
+        else:
+            entity_encode = self.entity_encode(text, entities, max_mention_length, entity_spans)
+
+            encode_output.update(entity_encode)
+            if entity_spans_pair:
+                entity_encode = self.entity_encode(
+                    text_pair,
+                    entities_pair,
+                    max_mention_length,
+                    entity_spans_pair,
+                    1,
+                    encode_output["input_ids"].index(self.sep_token_id) + 2,
+                )
+                for k in entity_encode.keys():
+                    encode_output[k] = encode_output[k] + entity_encode[k]
+
+        return encode_output
+
+    def tokenize(self, text, add_prefix_space=False):
+        """
+        Tokenize a string.
+            Args:
+              text (str):
+                The sentence to be tokenized.
+              add_prefix_space (boolean, default False):
+                Begin the sentence with at least one space to get invariance
+                to word order in GPT-2 (and Luke) tokenizers.
+        """
+        if _add_prefix_space:
+            add_prefix_space = True
+
+        def split_on_token(tok, text):
+            result = []
+            split_text = text.split(tok)
+            for i, sub_text in enumerate(split_text):
+                sub_text = sub_text.strip()
+                if i == 0 and not sub_text:
+                    result += [tok]
+                elif i == len(split_text) - 1:
+                    if sub_text:
+                        result += [sub_text]
+                    else:
+                        pass
+                else:
+                    if sub_text:
+                        result += [sub_text]
+                    result += [tok]
+            return result
+
+        def split_on_tokens(tok_list, text):
+            if not text.strip():
+                return []
+            if not tok_list:
+                return self._tokenize(text, add_prefix_space)
+
+            tokenized_text = []
+            text_list = [text]
+            for tok in tok_list:
+                tokenized_text = []
+                for sub_text in text_list:
+                    if sub_text not in self.added_tokens_encoder and sub_text not in self._all_special_tokens:
+                        tokenized_text += split_on_token(tok, sub_text)
+                    else:
+                        tokenized_text += [sub_text]
+                text_list = tokenized_text
+
+            return list(
+                itertools.chain.from_iterable(
+                    (
+                        self._tokenize(token, add_prefix_space)
+                        if token not in self.added_tokens_encoder and token not in self._all_special_tokens
+                        else [token]
+                        for token in tokenized_text
+                    )
+                )
+            )
+
+        added_tokens = list(self.added_tokens_encoder.keys()) + self._all_special_tokens
+        tokenized_text = split_on_tokens(added_tokens, text)
+        return tokenized_text
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def convert_tokens_to_ids(self, tokens):
+        if tokens is None:
+            return None
+        ids = []
+        for token in tokens:
+            ids.append(self._convert_token_to_id_with_added_voc(token))
+        return ids
+
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+        if token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+
+        return self._convert_token_to_id(token)
+
+    def add_special_tokens(self, token_list: Union[List[int], Dict]):
+        """
+        Adding special tokens if you need.
+
+        Args:
+            token_list (List[int], Dict[List[int]]):
+                The special token list you provided. If you provide a Dict, the key of the Dict must
+                be "additional_special_tokens" and the value must be token list.
+        """
+        if isinstance(token_list, dict):
+            token_list = token_list["additional_special_tokens"]
+        encoder_dict = dict()
+        decoder_dict = dict()
+        for token in token_list:
+            encoder_dict[token] = len(self.encoder.keys())
+            decoder_dict[len(self.decoder.keys())] = token
+        self.added_tokens_encoder.update(encoder_dict)
+        self.added_tokens_decoder.update(decoder_dict)
+
+    def convert_entity_to_id(self, entity: str):
+        """Convert the entity to id"""
+        if not self.entity_vocab.get(entity, None):
+            warnings.warn(f"{entity} not found in entity thesaurus")
+            return None
+        else:
+            return self.entity_vocab[entity]
+
+    def entity_encode(self, text, entities, max_mention_length, entity_spans, ent_sep=0, offset_a=1):
+        """Convert the string entity to digital entity"""
+
+        def convert_tuple_to_list(x):
+            """This function aim to convert tuple to list"""
+            if isinstance(x, tuple):
+                x = list(x)
+            for i, each_x in enumerate(x):
+                if isinstance(each_x, tuple):
+                    x[i] = list(each_x)
+            return x
+
+        mentions = []
+        if entities:
+            for i, entity in enumerate(zip(entities, entity_spans)):
+                entity = convert_tuple_to_list(entity)
+                entity[1][0], entity[1][1] = self._convert_entity_pos(text, entity[1])
+                if not self.entity_vocab.get(entity[0], None):
+                    warnings.warn(f"{entity[0]} not found in entity thesaurus")
+                    mentions.append((1, entity[1][0], entity[1][1]))
+                else:
+                    mentions.append((self.entity_vocab[entity[0]], entity[1][0], entity[1][1]))
+        else:
+            entities = [2] * len(entity_spans)
+            for i, entity in enumerate(zip(entities, entity_spans)):
+                entity = convert_tuple_to_list(entity)
+                entity[1][0], entity[1][1] = self._convert_entity_pos(text, entity[1])
+                mentions.append((entity[0], entity[1][0], entity[1][1]))
+
+        entity_ids = [0] * len(mentions)
+        entity_segment_ids = [ent_sep] * len(mentions)
+        entity_attention_mask = [1] * len(mentions)
+        entity_position_ids = [[-1 for y in range(max_mention_length)] for x in range(len(mentions))]
+
+        for i, (offset, (entity_id, start, end)) in enumerate(zip(repeat(offset_a), mentions)):
+            entity_ids[i] = entity_id
+            entity_position_ids[i][: end - start] = range(start + offset, end + offset)
+        return dict(
+            entity_ids=entity_ids,
+            entity_token_type_ids=entity_segment_ids,
+            entity_attention_mask=entity_attention_mask,
+            entity_position_ids=entity_position_ids,
+        )
+
+    def _convert_entity_pos(self, text, entity_span):
+        text_token = self.tokenize(text[0 : entity_span[0]].strip())
+        entity_token = self.tokenize(text[entity_span[0] : entity_span[1]].strip())
+        return len(text_token), len(text_token) + len(entity_token)
+
+    def get_offset_mapping(self, text):
+        tokens = self._tokenize(text)
+        offset_mapping = []
+        offset = 0
+        for token in tokens:
+            if token[0] == "Ġ":
+                offset_mapping.append((offset + 1, offset + len(token)))
+            else:
+                offset_mapping.append((offset, offset + len(token)))
+            offset += len(token)
+
+        return offset_mapping
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A Luke sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(_sep + token_ids_1 + _sep) * [1]
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification
+        tasks by concatenating and adding special tokens.
+        """
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return _cls + token_ids_0 + _sep
+        return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
diff --git a/paddlenlp/transformers/mbart/__init__.py b/paddlenlp/transformers/mbart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/mbart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/mbart/configuration.py b/paddlenlp/transformers/mbart/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..2876d882197268fff7b18be09996b0978c956d28
--- /dev/null
+++ b/paddlenlp/transformers/mbart/configuration.py
@@ -0,0 +1,272 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MBart model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["MBART_PRETRAINED_INIT_CONFIGURATION", "MBartConfig", "MBART_PRETRAINED_RESOURCE_FILES_MAP"]
+
+MBART_PRETRAINED_INIT_CONFIGURATION = {
+    "mbart-large-cc25": {
+        "vocab_size": 250027,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": True,
+    },
+    "mbart-large-en-ro": {
+        "vocab_size": 250027,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 250020,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "gelu",
+        "attention_dropout": 0.1,
+        "activation_dropout": 0.0,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": True,
+    },
+    "mbart-large-50-one-to-many-mmt": {
+        "vocab_size": 250054,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 2,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "relu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": True,
+    },
+    "mbart-large-50-many-to-one-mmt": {
+        "vocab_size": 250054,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 2,
+        "forced_bos_token_id": 250004,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "relu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": True,
+    },
+    "mbart-large-50-many-to-many-mmt": {
+        "vocab_size": 250054,
+        "bos_token_id": 0,
+        "pad_token_id": 1,
+        "eos_token_id": 2,
+        "decoder_start_token_id": 2,
+        "d_model": 1024,
+        "num_encoder_layers": 12,
+        "num_decoder_layers": 12,
+        "encoder_attention_heads": 16,
+        "decoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "decoder_ffn_dim": 4096,
+        "dropout": 0.1,
+        "activation_function": "relu",
+        "attention_dropout": 0.0,
+        "activation_dropout": 0.0,
+        "max_position_embeddings": 1024,
+        "init_std": 0.02,
+        "scale_embedding": True,
+    },
+}
+
+MBART_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "mbart-large-cc25": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart/mbart-large-cc25.pdparams",
+        "mbart-large-en-ro": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart/mbart-large-en-ro.pdparams",
+        "mbart-large-50-one-to-many-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-one-to-many-mmt.pdparams",
+        "mbart-large-50-many-to-one-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-many-to-one-mmt.pdparams",
+        "mbart-large-50-many-to-many-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-many-to-many-mmt.pdparams",
+    }
+}
+
+
+class MBartConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MBartModel`]. It is used to instantiate a MBART
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the MBART mbart-large-cc25 architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `MBartModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `MBartModel`.
+            Defaults to 50265.
+        bos_token (int, optional):
+            The beginning of sequence token that was used during pretraining. Can be
+            used a sequence classifier token.
+            Defaults to `0`.
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `1`.
+        eos_token (int, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `2`.
+        d_model (int, optional):
+            Dimensionality of the embedding layer, encoder layer and decoder layer. Defaults to `768`.
+        num_encoder_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `6`.
+        num_decoder_layers (int, optional):
+            Number of hidden layers in the Transformer decoder. Defaults to `6`.
+        encoder_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        decoder_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer decoder.
+            Defaults to `12`.
+        encoder_ffn_dim (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `d_model` to `encoder_ffn_dim`,
+            and then projected back to `d_model`. Typically `encoder_ffn_dim` is larger than `d_model`.
+            Defaults to `3072`.
+        decoder_ffn_dim (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `d_model` to `decoder_ffn_dim`,
+            and then projected back to `d_model`. Typically `decoder_ffn_dim` is larger than `d_model`.
+            Defaults to `3072`.
+        dropout (float, optional):
+            The dropout probability used in all fully connected layers (pre-process and post-process of MHA and FFN sub-layer)
+            in the encoders and decoders. Defaults to `0.1`.
+        activation_function (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions are supported.
+            Defaults to `"gelu"`.
+        attention_dropout (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers and decoder layers to drop some attention target.
+            Defaults to `0.1`.
+        activation_dropout (float, optional):
+            The dropout probability used after FFN activation in all encoder layers and decoder layers.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `1024`.
+        init_std (float, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+            Default to `0.02`.
+        num_labels (`int`, optional):
+            The number of labels to use in [`BartForSequenceClassification`]. Defaults to 3.
+        forced_eos_token_id (`int`, optional):
+            The id of the token to force as the last generated token when `max_length` is reached. Usually set to
+            `eos_token_id`. Defaults to 2.
+        scale_embedding (`bool`, optional):
+            Scale embeddings by diving by sqrt(d_model). Default to `True`.
+
+    """
+    model_type = "mbart"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map: Dict[str, str] = {
+        "num_encoder_layers": "encoder_layers",
+        "num_decoder_layers": "decoder_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = MBART_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50265,
+        bos_token_id: int = 0,
+        pad_token_id: int = 1,
+        eos_token_id: int = 2,
+        forced_eos_token_id: int = 2,
+        d_model: int = 768,
+        encoder_layers: int = 12,
+        decoder_layers: int = 12,
+        encoder_attention_heads: int = 16,
+        decoder_attention_heads: int = 16,
+        encoder_ffn_dim: int = 4096,
+        decoder_ffn_dim: int = 4096,
+        dropout: float = 0.1,
+        activation_function: str = "gelu",
+        attention_dropout: float = 0.0,
+        activation_dropout: float = 0.0,
+        max_position_embeddings: int = 1024,
+        init_std: float = 0.02,
+        is_encoder_decoder: bool = True,
+        scale_embedding: bool = True,
+        **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            forced_eos_token_id=forced_eos_token_id,
+            **kwargs,
+        )
+
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.d_model = d_model
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.activation_function = activation_function
+        self.init_std = init_std
+        self.scale_embedding = scale_embedding
diff --git a/paddlenlp/transformers/mbart/modeling.py b/paddlenlp/transformers/mbart/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d401554fde3dee08e72b5e9a26761204c6186da3
--- /dev/null
+++ b/paddlenlp/transformers/mbart/modeling.py
@@ -0,0 +1,1150 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Embedding, Layer, MultiHeadAttention
+
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    ModelOutput,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    Seq2SeqQuestionAnsweringModelOutput,
+    Seq2SeqSequenceClassifierOutput,
+    convert_encoder_output,
+)
+from .configuration import (
+    MBART_PRETRAINED_INIT_CONFIGURATION,
+    MBART_PRETRAINED_RESOURCE_FILES_MAP,
+    MBartConfig,
+)
+
+__all__ = [
+    "MBartModel",
+    "MBartPretrainedModel",
+    "MBartEncoder",
+    "MBartDecoder",
+    "MBartClassificationHead",
+    "MBartForSequenceClassification",
+    "MBartForQuestionAnswering",
+    "MBartForConditionalGeneration",
+]
+
+Cache = MultiHeadAttention.Cache
+StaticCache = MultiHeadAttention.StaticCache
+
+
+def shift_tokens_right(input_ids, pad_token_id):
+    """
+    Shift input ids one token to the right, and wrap the last non pad token (the <LID> token)
+    """
+    shifted_input_ids = input_ids.clone()
+    input_flat = paddle.flatten(shifted_input_ids)
+    batch_size, seq_length = paddle.shape(shifted_input_ids)
+    index = paddle.arange(0, batch_size, 1, dtype="int32") * seq_length
+    index_of_eos = paddle.cast(shifted_input_ids != pad_token_id, dtype="int32").sum(axis=-1) - 1
+    decoder_start_tokens = paddle.gather(input_flat, index + index_of_eos)
+    shifted_input_ids[:, 1:] = shifted_input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_tokens
+    return shifted_input_ids
+
+
+class MBartPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained MBart models. It provides MBart related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = MBART_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = MBART_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "mbart"
+    config_class = MBartConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class MBartLearnedPositionalEmbedding(Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim):
+        # MBart is set up so that if padding_idx is specified then offset the embedding ids by 2
+        # and adjust num_embeddings appropriately. Other models dont have this hack
+        self.offset = 2
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(self, input_ids_shape: Tuple, past_key_values_length: int = 0) -> Tensor:
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        bsz, seq_len = input_ids_shape[:2]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        return Embedding.forward(self, positions + self.offset)
+
+
+class MBartEncoder(MBartPretrainedModel):
+    """
+    The Transformer Encoder of MBartModel. The arguments of MBartEncoder can see :class:`MBartModel`.
+    """
+
+    def __init__(self, config: MBartConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.d_model = config.d_model
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+
+        self.embed_scale = (config.d_model**0.5) if config.scale_embedding else 1.0
+        self.encoder_embed_positions = MBartLearnedPositionalEmbedding(config.max_position_embeddings, config.d_model)
+
+        self.encoder_dropout = nn.Dropout(config.dropout)
+        self.encoder_layernorm_embedding = nn.LayerNorm(config.d_model)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.encoder_attention_heads,
+            dim_feedforward=config.encoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=True,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.encoder_layers, nn.LayerNorm(config.d_model))
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        """
+        The MBartEncoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`MBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            input_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            output_attentions (bool, optional):
+                See :class:`MBartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MBartModel`.
+            return_dict (bool, optional):
+                See :class:`MBartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `encoder_outputs` which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+
+        inputs_embed_pos = self.encoder_embed_positions(input_shape)
+        hidden_states = inputs_embeds + inputs_embed_pos
+        hidden_states = self.encoder_layernorm_embedding(hidden_states)
+        encoder_input = self.encoder_dropout(hidden_states)
+
+        if attention_mask is None and input_ids is not None:
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        encoder_output = self.encoder(
+            encoder_input,
+            src_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return encoder_output
+
+
+class MBartDecoder(MBartPretrainedModel):
+    """
+    The Transformer Decoder of MBartModel. The arguments of MBartDecoder can see :class:`MBartModel`.
+    """
+
+    def __init__(self, config: MBartConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.d_model = config.d_model
+        self.init_std = config.init_std
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+        self.embed_scale = (config.d_model**0.5) if config.scale_embedding else 1.0
+        self.decoder_embed_positions = MBartLearnedPositionalEmbedding(config.max_position_embeddings, config.d_model)
+        self.decoder_dropout = nn.Dropout(config.dropout)
+        self.decoder_layernorm_embedding = nn.LayerNorm(config.d_model)
+
+        decoder_layer = nn.TransformerDecoderLayer(
+            d_model=config.d_model,
+            nhead=config.decoder_attention_heads,
+            dim_feedforward=config.decoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=True,
+        )
+        self.decoder = nn.TransformerDecoder(decoder_layer, config.decoder_layers, nn.LayerNorm(config.d_model))
+
+    def forward(
+        self,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        memory_mask: Optional[Tensor] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        """
+        The MBartDecoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            decoder_input_ids (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            encoder_output (Tensor, optional):
+                See :class:`MBartModel`.
+            memory_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            cache (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            output_attentions (bool, optional):
+                See :class:`MBartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MBartModel`.
+            return_dict (bool, optional):
+                See :class:`MBartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `decoder_outputs` which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if decoder_input_ids is not None and decoder_inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif decoder_input_ids is not None:
+            decoder_input_shape = paddle.shape(decoder_input_ids)
+            decoder_input_ids = decoder_input_ids.reshape((-1, decoder_input_shape[-1]))
+        elif decoder_inputs_embeds is not None:
+            decoder_input_shape = paddle.shape(decoder_inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        if decoder_attention_mask is None:
+
+            decoder_length = decoder_input_shape[-1]
+            decoder_attention_mask = paddle.tensor.triu(
+                (paddle.full((decoder_length, decoder_length), -np.inf, dtype=paddle.get_default_dtype())), 1
+            )
+        if decoder_inputs_embeds is None:
+            decoder_inputs_embeds = self.embed_tokens(decoder_input_ids) * self.embed_scale
+
+        past_key_values_length = paddle.shape(cache[0][0].k)[2] if cache is not None else 0
+        decoder_inputs_embed_pos = self.decoder_embed_positions(decoder_input_shape, past_key_values_length)
+
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        hidden_states = self.decoder_layernorm_embedding(hidden_states)
+        decoder_input = self.decoder_dropout(hidden_states)
+
+        decoder_output = self.decoder(
+            tgt=decoder_input,
+            memory=encoder_output,
+            tgt_mask=decoder_attention_mask,
+            memory_mask=memory_mask,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return decoder_output
+
+
+@register_base_model
+class MBartModel(MBartPretrainedModel):
+    r"""
+    The bare MBart Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        Args:
+        config (:class:`MBartConfig`):
+            An instance of MBartConfig used to construct MBartModel.
+    """
+
+    def __init__(self, config: MBartConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.encoder = MBartEncoder(config, self.shared)
+
+        self.decoder = MBartDecoder(config, self.shared)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, value):
+        self.shared = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The MBartModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is `[batch_size, sequence_length, hidden_size]`.
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation  of shape `(batch_size, target_sequence_length, hidden_size)`. If `cache` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
+                This is useful if you want more control over how to convert `decoder_input_ids` indices
+                into associated vectors than the model's internal embedding lookup matrix. Default to None.
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                 Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                 can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False`,
+            returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MBartModel, MBartTokenizer
+
+                tokenizer = MBartTokenizer.from_pretrained('bart-base')
+                model = MBartModel.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # different to other models, MBart automatically creates decoder_input_ids from
+        # input MBartForSequenceClassification_ids if no decoder_input_ids are provided
+        if input_ids is None and inputs_embeds is None and encoder_output is None:
+            raise ValueError("You have to specify one of input_ids, inputs_embeds and encoder_output")
+        if decoder_input_ids is None and decoder_inputs_embeds is None:
+            if input_ids is None:
+                raise ValueError(
+                    "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
+                    "passed, `input_ids` cannot be `None`. Please pass either "
+                    "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
+                )
+            decoder_input_ids = shift_tokens_right(input_ids, self.pad_token_id)
+        if attention_mask is None and input_ids is not None:
+            logger.warning("input_ids should be specified when generating attention_mask")
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+            attention_mask.stop_gradient = True
+
+        input_type = type(decoder_input_ids) if decoder_input_ids is not None else type(decoder_inputs_embeds)
+
+        if encoder_output is None:
+            encoder_output = self.encoder(
+                input_ids,
+                attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
+        elif return_dict and not isinstance(encoder_output, ModelOutput):
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            encoder_output = convert_encoder_output(encoder_output)
+        if isinstance(encoder_output, input_type):
+            encoder_last_hidden_state = encoder_output
+        else:
+            encoder_last_hidden_state = encoder_output[0]
+
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.decoder.gen_cache(encoder_last_hidden_state)
+        else:
+            cache = None
+
+        memory_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                memory_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                memory_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                memory_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        decoder_output = self.decoder(
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_last_hidden_state,
+            memory_mask,
+            cache,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            if isinstance(decoder_output, input_type):
+                decoder_output = (decoder_output,)
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            return decoder_output + encoder_output
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_output.last_hidden_state,
+            past_key_values=decoder_output.past_key_values,
+            decoder_hidden_states=decoder_output.hidden_states,
+            decoder_attentions=decoder_output.attentions,
+            cross_attentions=decoder_output.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+
+class MBartClassificationHead(Layer):
+    """
+    Head for sentence-level classification tasks.
+    """
+
+    def __init__(self, input_dim: int, inner_dim: int, num_classes: int, pooler_dropout: float):
+        super().__init__()
+        self.dense = nn.Linear(input_dim, inner_dim)
+        self.dropout = nn.Dropout(p=pooler_dropout)
+        self.out_proj = nn.Linear(inner_dim, num_classes)
+
+    def forward(self, hidden_states: Tensor):
+        """
+        Args:
+            hidden_states (Tensor):
+                Hidden states of the classification model.
+        """
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.tanh(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.out_proj(hidden_states)
+        return hidden_states
+
+
+class MBartForSequenceClassification(MBartPretrainedModel):
+    r"""
+    MBart Model with a linear layer on top of the pooled output,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`MBartConfig`):
+            An instance of MBartConfig used to construct MBartForSequenceClassification.
+    """
+
+    def __init__(self, config: MBartConfig):
+        super().__init__(config)
+        self.mbart = MBartModel(config)
+        self.classifier = MBartClassificationHead(
+            config.d_model,
+            config.d_model,
+            config.num_labels,
+            config.classifier_dropout if config.classifier_dropout is not None else config.dropout,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The MBartForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`MBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`MBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`MBartModel`.
+            use_cache (bool, optional):
+                See :class:`MBartModel`.
+            cache (Tensor, optional):
+                See :class:`MBartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            labels (Tensor, optional):
+                Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+                num_labels - 1]`. If `num_labels > 1` a classification loss is computed (Cross-Entropy).
+                Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`MBartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MBartModel`.
+            return_dict (bool, optional):
+                See :class:`MBartModel`.
+
+        Returns:
+            `An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqSequenceClassifierOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqSequenceClassifierOutput`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and labels=None,
+            returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MBartForSequenceClassification, MBartTokenizer
+
+                tokenizer = MBartTokenizer.from_pretrained('bart-base')
+                model = MBartForSequenceClassification.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            logger.warning("The `use_cache` argument is changed to `False` since `labels` is provided.")
+            use_cache = False
+
+        if input_ids is None and inputs_embeds is not None:
+            logger.warning(
+                f"{self.__class__.__name__} will not detect eos tokens in `inputs_embeds`. Results may be "
+                "unexpected if using eos tokens in conjunction with `inputs_embeds.`"
+            )
+
+        outputs = self.mbart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            use_cache=use_cache,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        output = outputs[0]
+        output_shape = paddle.shape(output)
+        if input_ids is not None:
+            eos_mask = paddle.cast(input_ids == self.mbart.config.eos_token_id, dtype="int64")
+            if len(paddle.unique(paddle.sum(eos_mask, axis=1))) > 1:
+                raise ValueError("All examples must have the same number of <eos> tokens.")
+
+            # TODO(gongenlei): support bool tensor index
+            output = output.masked_select(eos_mask.unsqueeze(-1).astype("bool").tile([1, 1, output_shape[-1]]))
+        sentence_representation = output.reshape([output_shape[0], -1, output_shape[-1]])[:, -1, :]
+        logits = self.classifier(sentence_representation)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            if len(outputs) == 2:
+                return (loss, logits) if loss is not None else logits
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+
+class MBartForQuestionAnswering(MBartPretrainedModel):
+    r"""
+    MBart Model with a linear layer on top of the hidden-states output to
+    compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`MBartConfig`):
+            An instance of MBartConfig used to construct MBartForQuestionAnswering.
+    """
+
+    def __init__(self, config: MBartConfig):
+        super().__init__(config)
+        self.mbart = MBartModel(config)
+        self.classifier = nn.Linear(config.d_model, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The MBartForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`MBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`MBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`MBartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            use_cache (bool, optional):
+                See :class:`MBartModel`.
+            cache (Tensor, optional):
+                See :class:`MBartModel`.
+            start_positions (Tensor, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence
+                are not taken into account for computing the loss.
+                A tensor of shape `(batch_size, )`. Default to `None`.
+            end_positions (Tensor, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence
+                are not taken into account for computing the loss.
+                A tensor of shape `(batch_size, )`. Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`MBartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MBartModel`.
+            return_dict (bool, optional):
+                See :class:`MBartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqQuestionAnsweringModelOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqQuestionAnsweringModelOutput`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `start_positions=end_positions=None`,
+            returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MBartForQuestionAnswering, MBartTokenizer
+
+                tokenizer = MBartTokenizer.from_pretrained('bart-base')
+                model = MBartForQuestionAnswering.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+                start_logits = outputs[0]
+                end_logits  =outputs[1]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if start_positions is not None and end_positions is not None:
+            logger.warning(
+                "The `use_cache` argument is changed to `False` since `start_positions` and `end_positions` are provided."
+            )
+            use_cache = False
+        outputs = self.mbart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            use_cache=use_cache,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            outputs = (start_logits, end_logits) + (outputs[1:] if len(outputs) > 2 else ())
+            return ((total_loss,) + outputs) if total_loss else outputs
+
+        return Seq2SeqQuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+
+class MBartForConditionalGeneration(MBartPretrainedModel):
+    r"""
+     MBart Model with a `language modeling` head on top.
+
+    Args:
+         config (:class:`MBartConfig`):
+             An instance of MBartConfig used to construct MBartForConditionalGeneration.
+    """
+
+    def __init__(self, config: MBartConfig):
+        super().__init__(config)
+        self.mbart = MBartModel(config)
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model], dtype=self.mbart.shared.weight.dtype, is_bias=False
+        )
+        self.register_buffer(
+            "final_logits_bias", paddle.zeros((1, config.vocab_size), dtype=paddle.get_default_dtype())
+        )
+
+    def get_encoder(self):
+        return self.mbart.get_encoder()
+
+    def get_decoder(self):
+        return self.mbart.get_decoder()
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterMBART
+
+        decode_strategy = kwargs.get("decode_strategy")
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        if kwargs["min_length"] != 0:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'min_length != 0' is not supported yet in the fast version")
+        self._fast_entry = FasterMBART(self, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        decoder_inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The MBartForConditionalGeneration forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`MBartModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`MBartModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`MBartModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`MBartModel`.
+                See :class:`MBartModel`.
+            use_cache (bool, optional):
+                See :class:`MBartModel`.
+            cache (Tensor, optional):
+                See :class:`MBartModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MBartModel`.
+            decoder_inputs_embeds (Tensor, optional):
+            labels (Tensor, optional):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+                A tensor of shape `(batch_size, sequence_length)`. Default to `None`.
+            output_attentions (bool, optional):
+                See :class:`MBartModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MBartModel`.
+            return_dict (bool, optional):
+                See :class:`MBartModel`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`.
+            Especially, When `use_cache=return_dict=output_hidden_states=output_attentions=False` and labels=None,
+            returns tensor `logits`, a tensor of the input text classification logits.
+
+            With the fields:
+
+            - `lm_logits` (Tensor):
+                The generated sentence of the model.
+                Its data type should be float32 and has a shape of [batch_size, sequence_length, vocab_size].
+
+            - `cache` (Tensor):
+                See :class:`MBartModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MBartForConditionalGeneration, MBartTokenizer
+
+                tokenizer = MBartTokenizer.from_pretrained('bart-base')
+                model = MBartForConditionalGeneration.from_pretrained('bart-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.return_dict
+        if labels is not None:
+            if use_cache:
+                logger.warning("The `use_cache` argument is changed to `False` since `labels` is provided.")
+            use_cache = False
+
+        outputs = self.mbart(
+            input_ids,
+            attention_mask,
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            use_cache=use_cache,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        lm_logits = paddle.tensor.matmul(outputs[0], self.lm_head_weight, transpose_y=True) + self.final_logits_bias
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(lm_logits.reshape((-1, self.mbart.config.vocab_size)), labels.reshape((-1,)))
+
+        if not return_dict:
+            if len(outputs) == 2:
+                return (masked_lm_loss, lm_logits) if masked_lm_loss is not None else lm_logits
+            else:
+                outputs = (lm_logits,) + outputs[1:]
+                return ((masked_lm_loss,) + outputs) if masked_lm_loss is not None else outputs
+
+        return Seq2SeqLMOutput(
+            loss=masked_lm_loss,
+            logits=lm_logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        attention_mask=None,
+        decoder_attention_mask=None,
+        cache=None,
+        use_cache=False,
+        encoder_output=None,
+        **kwargs
+    ):
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1].unsqueeze(-1)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask[:, :, -1, :].unsqueeze(2)
+
+        return {
+            "input_ids": None,
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "decoder_attention_mask": decoder_attention_mask,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
diff --git a/paddlenlp/transformers/mbart/tokenizer.py b/paddlenlp/transformers/mbart/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..163031e178e058f6203b03d83055df5faea3a22a
--- /dev/null
+++ b/paddlenlp/transformers/mbart/tokenizer.py
@@ -0,0 +1,631 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from shutil import copyfile
+
+import sentencepiece as spm
+
+from .. import AddedToken, PretrainedTokenizer
+
+__all__ = ["MBartTokenizer", "MBart50Tokenizer"]
+
+MBART_PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "mbart-large-cc25": 1024,
+    "mbart-large-en-ro": 1024,
+}
+
+MBART50_PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "mbart-large-50-one-to-many-mmt": 1024,
+    "mbart-large-50-many-to-one-mmt": 1024,
+    "mbart-large-50-many-to-many-mmt": 1024,
+}
+
+
+class MBartTokenizer(PretrainedTokenizer):
+    resource_files_names = {
+        "vocab_file": "sentencepiece.bpe.model",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "mbart-large-en-ro": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart/mbart-large-en-ro.sentencepiece.bpe.model",
+            "mbart-large-cc25": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart/mbart-large-cc25.sentencepiece.bpe.model",
+        }
+    }
+    pretrained_init_configuration = {"mbart-large-cc25": {}, "mbart-large-en-ro": {}}
+    max_model_input_sizes = MBART_PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids"]
+
+    FAIRSEQ_LANGUAGE_CODES = [
+        "ar_AR",
+        "cs_CZ",
+        "de_DE",
+        "en_XX",
+        "es_XX",
+        "et_EE",
+        "fi_FI",
+        "fr_XX",
+        "gu_IN",
+        "hi_IN",
+        "it_IT",
+        "ja_XX",
+        "kk_KZ",
+        "ko_KR",
+        "lt_LT",
+        "lv_LV",
+        "my_MM",
+        "ne_NP",
+        "nl_XX",
+        "ro_RO",
+        "ru_RU",
+        "si_LK",
+        "tr_TR",
+        "vi_VN",
+        "zh_CN",
+    ]
+
+    def __init__(
+        self,
+        vocab_file,
+        src_lang=None,
+        tgt_lang=None,
+        bos_token="<s>",
+        eos_token="</s>",
+        sep_token="</s>",
+        cls_token="<s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        sp_model_kwargs=None,
+        additional_special_tokens=None,
+        **kwargs
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._build_special_tokens_map_extended(mask_token=mask_token)
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.vocab_file = vocab_file
+        self.sp_model.Load(str(vocab_file))
+        self.fairseq_offset = 1
+        self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
+        self.sp_model_size = len(self.sp_model)
+        self.lang_code_to_id = {
+            code: self.sp_model_size + i + self.fairseq_offset for i, code in enumerate(self.FAIRSEQ_LANGUAGE_CODES)
+        }
+        self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + len(self.lang_code_to_id) + self.fairseq_offset
+        self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
+        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
+        self.src_lang = src_lang if src_lang is not None else "en_XX"
+        self.tgt_lang = tgt_lang
+        # Get `special_tokens_map` after `_wrap_init()`
+        self.eos_token_id = self.fairseq_tokens_to_ids[eos_token]
+        self.unk_token_id = self.fairseq_tokens_to_ids[unk_token]
+        self.set_src_lang_special_tokens(self.src_lang)
+        self._additional_special_tokens = list(self.lang_code_to_id.keys())
+
+        if additional_special_tokens is not None:
+            # Only add those special tokens if they are not already there.
+            self._additional_special_tokens.extend(
+                [t for t in additional_special_tokens if t not in self._additional_special_tokens]
+            )
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        is_split_into_words=False,
+        padding=None,
+        truncation="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        if "pad_to_max_seq_len" in kwargs and padding is None:
+            pad_to_max_seq_len = kwargs.pop("pad_to_max_seq_len")
+            padding = "max_length" if pad_to_max_seq_len else False
+        elif padding is None:
+            padding = False
+
+        if "max_seq_len" in kwargs and max_length is None:
+            max_length = kwargs["max_seq_len"]
+
+        if "truncation_strategy" in kwargs and kwargs["truncation_strategy"] != "longest_first":
+            truncation = kwargs["truncation_strategy"]
+
+        return super(MBartTokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        state["sp_model_proto"] = self.sp_model.serialized_model_proto()
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
+
+    def save_resources(self, save_directory):
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(self.vocab_file) != os.path.abspath(save_path) and os.path.isfile(self.vocab_file):
+                copyfile(self.vocab_file, save_path)
+            elif not os.path.isfile(self.vocab_file):
+                with open(save_path, "wb") as fi:
+                    content_spiece_model = self.sp_model.serialized_model_proto()
+                    fi.write(content_spiece_model)
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+
+        return len(self.sp_model) + len(self.lang_code_to_id) + self.fairseq_offset + 1
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """
+        Converts a token (str) in an id using the vocab.
+        """
+        if token in self.fairseq_tokens_to_ids:
+            return self.fairseq_tokens_to_ids[token]
+        spm_id = self.sp_model.PieceToId(token)
+
+        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
+
+    def _convert_id_to_token(self, index):
+        """
+        Converts an index (integer) in a token (str) using the vocab.
+        """
+        if index in self.fairseq_ids_to_tokens:
+            return self.fairseq_ids_to_tokens[index]
+        return self.sp_model.IdToPiece(index - self.fairseq_offset)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (strings for sub-words) in a single string.
+        """
+        out_string = "".join(tokens).replace("▁", " ").strip()
+        return out_string
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a sequence of tokens (strings for sub-words) in a single string.
+        """
+        tokens = self.convert_ids_to_tokens(ids)
+        out_string = "".join(tokens).replace("▁", " ").strip()
+        return out_string
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieve sequence ids from a token list that has no special tokens added.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        prefix_ones = [1] * len(self.prefix_tokens)
+        suffix_ones = [1] * len(self.suffix_tokens)
+        if token_ids_1 is None:
+            return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
+        return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. An MBART sequence has the following format, where ``X`` represents the sequence:
+
+        - ``input_ids`` (for encoder) ``X [eos, src_lang_code]``
+        - ``decoder_input_ids``: (for decoder) ``X [eos, tgt_lang_code]``
+
+        BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
+        separator.
+        """
+        if token_ids_1 is None:
+            return self.prefix_tokens + token_ids_0 + self.suffix_tokens
+        # We don't expect to process pairs, but leave the pair logic for API consistency
+        return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + offset_mapping_1 + [(0, 0)]
+
+    def set_src_lang_special_tokens(self, src_lang):
+        """Reset the special tokens to the source lang setting. No prefix and suffix=[eos, src_lang_code]."""
+        self.cur_lang_code_id = self.lang_code_to_id[src_lang]
+        self.prefix_tokens = []
+        self.suffix_tokens = [self.eos_token_id, self.cur_lang_code_id]
+
+    def set_tgt_lang_special_tokens(self, tgt_lang):
+        """Reset the special tokens to the target language setting. No prefix and suffix=[eos, tgt_lang_code]."""
+        self.cur_lang_code_id = self.lang_code_to_id[tgt_lang]
+        self.prefix_tokens = []
+        self.suffix_tokens = [self.eos_token_id, self.cur_lang_code_id]
+
+
+class MBart50Tokenizer(PretrainedTokenizer):
+    resource_files_names = {
+        "vocab_file": "sentencepiece.bpe.model",
+    }
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "mbart-large-50-one-to-many-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-one-to-many-mmt.sentencepiece.bpe.model",
+            "mbart-large-50-many-to-one-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-many-to-one-mmt.sentencepiece.bpe.model",
+            "mbart-large-50-many-to-many-mmt": "https://bj.bcebos.com/paddlenlp/models/transformers/mbart50/mbart-large-50-many-to-many-mmt.sentencepiece.bpe.model",
+        }
+    }
+    pretrained_init_configuration = {
+        "mbart-large-50-one-to-many-mmt": {},
+        "mbart-large-50-many-to-one-mmt": {},
+        "mbart-large-50-many-to-many-mmt": {},
+    }
+    max_model_input_sizes = MBART50_PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids"]
+
+    FAIRSEQ_LANGUAGE_CODES = [
+        "ar_AR",
+        "cs_CZ",
+        "de_DE",
+        "en_XX",
+        "es_XX",
+        "et_EE",
+        "fi_FI",
+        "fr_XX",
+        "gu_IN",
+        "hi_IN",
+        "it_IT",
+        "ja_XX",
+        "kk_KZ",
+        "ko_KR",
+        "lt_LT",
+        "lv_LV",
+        "my_MM",
+        "ne_NP",
+        "nl_XX",
+        "ro_RO",
+        "ru_RU",
+        "si_LK",
+        "tr_TR",
+        "vi_VN",
+        "zh_CN",
+        "af_ZA",
+        "az_AZ",
+        "bn_IN",
+        "fa_IR",
+        "he_IL",
+        "hr_HR",
+        "id_ID",
+        "ka_GE",
+        "km_KH",
+        "mk_MK",
+        "ml_IN",
+        "mn_MN",
+        "mr_IN",
+        "pl_PL",
+        "ps_AF",
+        "pt_XX",
+        "sv_SE",
+        "sw_KE",
+        "ta_IN",
+        "te_IN",
+        "th_TH",
+        "tl_XX",
+        "uk_UA",
+        "ur_PK",
+        "xh_ZA",
+        "gl_ES",
+        "sl_SI",
+    ]
+
+    def __init__(
+        self,
+        vocab_file,
+        src_lang=None,
+        tgt_lang=None,
+        bos_token="<s>",
+        eos_token="</s>",
+        sep_token="</s>",
+        cls_token="<s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        sp_model_kwargs=None,
+        additional_special_tokens=None,
+        **kwargs
+    ):
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._build_special_tokens_map_extended(mask_token=mask_token)
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.vocab_file = vocab_file
+        self.sp_model.Load(str(vocab_file))
+        self.fairseq_offset = 1
+        self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
+        self.sp_model_size = len(self.sp_model)
+        self.lang_code_to_id = {
+            code: self.sp_model_size + i + self.fairseq_offset for i, code in enumerate(self.FAIRSEQ_LANGUAGE_CODES)
+        }
+        self.fairseq_tokens_to_ids["<mask>"] = len(self.sp_model) + len(self.lang_code_to_id) + self.fairseq_offset
+        self.fairseq_tokens_to_ids.update(self.lang_code_to_id)
+        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}
+        self.src_lang = src_lang if src_lang is not None else "en_XX"
+        self.tgt_lang = tgt_lang
+        # Get `special_tokens_map` after `_wrap_init()`
+        self.eos_token_id = self.fairseq_tokens_to_ids[eos_token]
+        self.unk_token_id = self.fairseq_tokens_to_ids[unk_token]
+        self.set_src_lang_special_tokens(self.src_lang)
+        self._additional_special_tokens = list(self.lang_code_to_id.keys())
+
+        if additional_special_tokens is not None:
+            # Only add those special tokens if they are not already there.
+            self._additional_special_tokens.extend(
+                [t for t in additional_special_tokens if t not in self._additional_special_tokens]
+            )
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        is_split_into_words=False,
+        padding=None,
+        truncation="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        if "pad_to_max_seq_len" in kwargs and padding is None:
+            pad_to_max_seq_len = kwargs.pop("pad_to_max_seq_len")
+            padding = "max_length" if pad_to_max_seq_len else False
+        elif padding is None:
+            padding = False
+
+        if "max_seq_len" in kwargs and max_length is None:
+            max_length = kwargs["max_seq_len"]
+
+        if "truncation_strategy" in kwargs and kwargs["truncation_strategy"] != "longest_first":
+            truncation = kwargs["truncation_strategy"]
+
+        return super(MBart50Tokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        state["sp_model_proto"] = self.sp_model.serialized_model_proto()
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
+
+    def save_resources(self, save_directory):
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(self.vocab_file) != os.path.abspath(save_path) and os.path.isfile(self.vocab_file):
+                copyfile(self.vocab_file, save_path)
+            elif not os.path.isfile(self.vocab_file):
+                with open(save_path, "wb") as fi:
+                    content_spiece_model = self.sp_model.serialized_model_proto()
+                    fi.write(content_spiece_model)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def _tokenize(self, text):
+        return self.sp_model.encode(text, out_type=str)
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Returns:
+            int: The sum of size of vocabulary and the size of speical tokens.
+
+        """
+
+        return len(self.sp_model) + len(self.lang_code_to_id) + self.fairseq_offset + 1
+
+    def _convert_token_to_id(self, token):
+        """
+        Converts a token (str) in an id using the vocab.
+        """
+        if token in self.fairseq_tokens_to_ids:
+            return self.fairseq_tokens_to_ids[token]
+        spm_id = self.sp_model.PieceToId(token)
+
+        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
+
+    def _convert_id_to_token(self, index):
+        """
+        Converts an index (integer) in a token (str) using the vocab.
+        """
+        if index in self.fairseq_ids_to_tokens:
+            return self.fairseq_ids_to_tokens[index]
+        return self.sp_model.IdToPiece(index - self.fairseq_offset)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (strings for sub-words) in a single string.
+        """
+        out_string = "".join(tokens).replace("▁", " ").strip()
+        return out_string
+
+    def convert_ids_to_string(self, ids):
+        """
+        Converts a sequence of tokens (strings for sub-words) in a single string.
+        """
+        tokens = self.convert_ids_to_tokens(ids)
+        out_string = "".join(tokens).replace("▁", " ").strip()
+        return out_string
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieve sequence ids from a token list that has no special tokens added.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        prefix_ones = [1] * len(self.prefix_tokens)
+        suffix_ones = [1] * len(self.suffix_tokens)
+        if token_ids_1 is None:
+            return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
+        return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. An MBART50 sequence has the following format, where ``X`` represents the sequence:
+
+        - ``input_ids`` (for encoder) ``[src_lang_code] X [eos]``
+        - ``labels``: (for decoder) ``[tgt_lang_code] X [eos]``
+
+        BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
+        separator.
+        """
+        if token_ids_1 is None:
+            return self.prefix_tokens + token_ids_0 + self.suffix_tokens
+        # We don't expect to process pairs, but leave the pair logic for API consistency
+        return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + offset_mapping_1 + [(0, 0)]
+
+    def set_src_lang_special_tokens(self, src_lang):
+        """Reset the special tokens to the source lang setting. prefix=[src_lang_code] and suffix=[eos]."""
+        self.cur_lang_code_id = self.lang_code_to_id[src_lang]
+        self.prefix_tokens = [self.cur_lang_code_id]
+        self.suffix_tokens = [self.eos_token_id]
+
+    def set_tgt_lang_special_tokens(self, tgt_lang):
+        """Reset the special tokens to the target language setting. prefix=[tgt_lang_code] and suffix=[eos]."""
+        self.cur_lang_code_id = self.lang_code_to_id[tgt_lang]
+        self.prefix_tokens = [self.cur_lang_code_id]
+        self.suffix_tokens = [self.eos_token_id]
+
+    def _build_translation_inputs(self, raw_inputs, return_tensors, src_lang, tgt_lang, **extra_kwargs):
+        """Used by translation pipeline, to prepare inputs for the generate function"""
+        if src_lang is None or tgt_lang is None:
+            raise ValueError("Translation requires a `src_lang` and a `tgt_lang` for this model")
+        self.src_lang = src_lang
+        inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs)
+        tgt_lang_id = self.convert_tokens_to_ids(tgt_lang)
+        inputs["forced_bos_token_id"] = tgt_lang_id
+        return inputs
diff --git a/paddlenlp/transformers/megatronbert/__init__.py b/paddlenlp/transformers/megatronbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/megatronbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/megatronbert/configuration.py b/paddlenlp/transformers/megatronbert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bc3e695a4cf399ff9c8ee9b8f39a47f3652bdb1
--- /dev/null
+++ b/paddlenlp/transformers/megatronbert/configuration.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MBart model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "MegatronBert_PRETRAINED_INIT_CONFIGURATION",
+    "MegatronBert_PRETRAINED_RESOURCE_FILES_MAP",
+    "MegatronBertConfig",
+]
+
+MegatronBert_PRETRAINED_INIT_CONFIGURATION = {
+    "megatronbert-cased": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 2,
+        "vocab_size": 29056,
+        "pad_token_id": 0,
+    },
+    "megatronbert-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 2,
+        "vocab_size": 30592,
+        "pad_token_id": 0,
+    },
+}
+
+MegatronBert_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "megatronbert-cased": "http://bj.bcebos.com/paddlenlp/models/transformers/megatron-bert/megatronbert-cased/model_state.pdparams",
+        "megatronbert-uncased": "http://bj.bcebos.com/paddlenlp/models/transformers/megatron-bert/megatronbert-uncased/model_state.pdparams",
+    }
+}
+
+
+class MegatronBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MegatronBertModel`]. It is used to instantiate a
+    MEGATRON_BERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the MEGATRON_BERT
+    [nvidia/megatron-bert-uncased-345m](https://huggingface.co/nvidia/megatron-bert-uncased-345m) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `MegatronBertModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `MegatronBert`.
+        hidden_size (int, optional):
+            Dimensionality of the encoder layer and pooler layer. Defaults to `1024`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `2`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `16`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `24`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `4096`.
+        position_embedding_type (str, optional):
+            Type of position embedding. Defaults to "absolute"
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer.
+            Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`MegatronBertPretrainedModel.init_weights()` for how weights are initialized in `MegatronBertModel`.
+
+    """
+    model_type = "megatronbert"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=29056,
+        hidden_size=1024,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        intermediate_size=4096,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        position_embedding_type="absolute",
+        # use_cache=True,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        # self.use_cache = use_cache
diff --git a/paddlenlp/transformers/megatronbert/modeling.py b/paddlenlp/transformers/megatronbert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..63053bdedcf9890ce1d26a7e118813b89a73efd8
--- /dev/null
+++ b/paddlenlp/transformers/megatronbert/modeling.py
@@ -0,0 +1,1006 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn as nn
+from paddle import einsum
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import get_activation
+from .configuration import (
+    MegatronBert_PRETRAINED_INIT_CONFIGURATION,
+    MegatronBert_PRETRAINED_RESOURCE_FILES_MAP,
+    MegatronBertConfig,
+)
+
+__all__ = [
+    "MegatronBertModel",
+    "MegatronBertPretrainedModel",
+    "MegatronBertForQuestionAnswering",
+    "MegatronBertForSequenceClassification",
+    "MegatronBertForNextSentencePrediction",
+    "MegatronBertForCausalLM",
+    "MegatronBertForPreTraining",
+    "MegatronBertForMaskedLM",
+    "MegatronBertForMultipleChoice",
+    "MegatronBertForTokenClassification",
+]
+
+layer_norm_eps = 1e-12
+
+
+class MegatronBertPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained MegatronBert models. It provides RoBerta related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+    model_config_file = CONFIG_NAME
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    pretrained_init_configuration = MegatronBert_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = MegatronBert_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "megatronbert"
+    config_class = MegatronBertConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range")
+                    else self.megatronbert.config["initializer_range"],
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = layer_norm_eps
+
+
+class MegatronBertEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer("position_ids", paddle.arange(end=config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = config.position_embedding_type
+
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class MegatronBertSelfAttention(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertSelfAttention, self).__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = config.position_embedding_type
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose((0, 2, 1, 3))
+
+    def forward(self, hidden_states, attention_mask=None):
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose((0, 1, 3, 2)))
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.shape[1]
+            position_ids_l = paddle.arange(end=seq_length, dtype="int64").reshape((-1, 1))
+            position_ids_r = paddle.arange(end=seq_length, dtype="int64").reshape((1, -1))
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in MegatronBertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose((0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        return context_layer, attention_probs
+
+
+class MegatronBertSelfOutput(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, residual):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return residual + hidden_states
+
+
+class MegatronBertAttention(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertAttention, self).__init__()
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.self = MegatronBertSelfAttention(config)
+        self.output = MegatronBertSelfOutput(config)
+        self.pruned_heads = set()
+
+    def forward(self, hidden_states, attention_mask=None):
+        ln_outputs = self.layer_norm(hidden_states)
+        self_outputs = self.self(ln_outputs, attention_mask)
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class MegatronBertIntermediate(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.intermediate_act_fn = get_activation(config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class MegatronBertOutput(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return input_tensor + hidden_states
+
+
+class MegatronBertLayer(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertLayer, self).__init__()
+        self.seq_len_dim = 1
+        self.attention = MegatronBertAttention(config)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.intermediate = MegatronBertIntermediate(config)
+        self.output = MegatronBertOutput(config)
+
+    def forward(self, hidden_states, attention_mask=None):
+        self_attention_outputs = self.attention(hidden_states, attention_mask)
+        attention_output = self_attention_outputs[0]
+
+        outputs = self_attention_outputs[1:]
+
+        layer_output = self.feed_forward_chunk(attention_output)
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        ln_output = self.layer_norm(attention_output)
+        intermediate_output = self.intermediate(ln_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class MegatronBertEncoder(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertEncoder, self).__init__()
+        self.layer = nn.LayerList([MegatronBertLayer(config) for _ in range(config.num_hidden_layers)])
+
+        # The final layer norm. We removed the 1st LN, moved LN to each hidden layer and this one
+        # is simply the final LN (Transformer's BERT has it attached to each hidden layer).
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states, attention_mask=None):
+        for i, layer_module in enumerate(self.layer):
+            layer_outputs = layer_module(hidden_states, attention_mask)
+
+            hidden_states = layer_outputs[0]
+
+        # Finalize the hidden states.
+        hidden_states = self.layer_norm(hidden_states)
+
+        return hidden_states
+
+
+class MegatronBertPooler(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+@register_base_model
+class MegatronBertModel(MegatronBertPretrainedModel):
+    """
+    The bare MegatronBert Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        Args:
+        config (:class:`MegatronBertConfig`):
+            An instance of MegatronBertConfig used to construct MBartModel.
+
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertModel, self).__init__(config)
+        self.num_hidden_layers = config.num_hidden_layers
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.embeddings = MegatronBertEmbeddings(config)
+        self.encoder = MegatronBertEncoder(config)
+        self.pooler = MegatronBertPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                If its data type is int, the values should be either 0 or 1.
+
+                - **1** for tokens that **not masked**,
+                - **0** for tokens that **masked**.
+
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MegatronBertModel, MegatronBertTokenizer
+
+                tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+                model = MegatronBertModel.from_pretrained('megatronbert-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        input_shape = input_ids.shape
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        encoder_outputs = self.encoder(embedding_output, attention_mask=attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+
+        return sequence_output, pooled_output
+
+
+class MegatronBertForQuestionAnswering(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with question answering tasks.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForQuestionAnswering, self).__init__(config)
+        self.megatronbert = MegatronBertModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+    ):
+        r"""
+        The MegatronBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MegatronBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MegatronBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MegatronBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MegatronBertModel`.
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MegatronBertForQuestionAnswering, MegatronBertTokenizer
+
+                tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+                model = MegatronBertForQuestionAnswering.from_pretrained('megatronbert-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  = outputs[1]
+        """
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(2, axis=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        output = (start_logits, end_logits)
+        return output
+
+
+class MegatronBertForSequenceClassification(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with sequence classification tasks.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+        num_labels (int):
+            The number of labels.
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.megatronbert = MegatronBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MegatronBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MegatronBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MegatronBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MegatronBertModel`.
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the sequence classification logits.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MegatronBertForSequenceClassification, MegatronBertTokenizer
+
+                tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+                model = MegatronBertForSequenceClassification.from_pretrained('megatronbert-uncased', num_labels=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        return logits
+
+
+class MegatronBertPredictionHeadTransform(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.transform_act_fn = get_activation(config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        return hidden_states
+
+
+class MegatronBertLMPredictionHead(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertLMPredictionHead, self).__init__()
+        self.transform = MegatronBertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+
+        self.decoder_weight = self.create_parameter(
+            shape=[config.vocab_size, config.hidden_size], dtype=self.transform.dense.weight.dtype, is_bias=False
+        )
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class MegatronBertOnlyMLMHead(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertOnlyMLMHead, self).__init__()
+        self.predictions = MegatronBertLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class MegatronBertOnlyNSPHead(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertOnlyNSPHead, self).__init__()
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, pooled_output):
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return seq_relationship_score
+
+
+class MegatronBertPreTrainingHeads(nn.Layer):
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertPreTrainingHeads, self).__init__()
+        self.predictions = MegatronBertLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class MegatronBertForPreTraining(MegatronBertPretrainedModel):
+    """
+    Megatronbert Model with pretraining tasks on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForPreTraining, self).__init__(config)
+
+        self.megatronbert = MegatronBertModel(config)
+        self.cls = MegatronBertPreTrainingHeads(config)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertForPreTraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MegatronBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MegatronBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MegatronBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MegatronBertModel`.
+        Returns:
+            tuple: Returns tuple (`prediction_scores`, `seq_relationship_score`).
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MegatronBertForPreTraining, MegatronBertTokenizer
+
+                tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+                model = MegatronBertForPreTraining.from_pretrained('megatronbert-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                prediction_scores, seq_relationship_score = model(**inputs)
+        """
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        output = (prediction_scores, seq_relationship_score)
+        return output
+
+
+class MegatronBertForCausalLM(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with a `causal masked language modeling` head on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForCausalLM, self).__init__(config)
+
+        self.megatronbert = MegatronBertModel(config)
+        self.cls = MegatronBertOnlyMLMHead(config)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertForCausalLM forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MegatronBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MegatronBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MegatronBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MegatronBertModel`.
+        Returns:
+            Tensor: Returns Tensor `prediction_scores`. The scores of masked token prediction.
+                    Its data type should be float32. If `masked_positions` is None, its shape is
+                    [batch_size, sequence_length, vocab_size]. Otherwise, its shape is
+                    [batch_size, mask_token_num, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MegatronBertForCausalLM, MegatronBertTokenizer
+
+                tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+                model = MegatronBertForCausalLM.from_pretrained('megatronbert-uncased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                prediction_scores = model(**inputs)
+        """
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+        return prediction_scores
+
+
+class MegatronBertForMaskedLM(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with a `masked language modeling` head on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForMaskedLM, self).__init__(config)
+
+        self.megatronbert = MegatronBertModel(config)
+        self.cls = MegatronBertOnlyMLMHead(config)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+    ):
+        r"""
+        The MegatronBertForMaskedLM forward method, overrides the __call__() special method.
+
+        Args:
+           input_ids (Tensor):
+               See :class:`MegatronBertModel`.
+           token_type_ids (Tensor, optional):
+               See :class:`MegatronBertModel`.
+           position_ids(Tensor, optional):
+               See :class:`MegatronBertModel`.
+           attention_mask (Tensor, optional):
+               See :class:`MegatronBertModel`.
+        Returns:
+           Tensor: Returns Tensor `prediction_scores`. The scores of masked token prediction.
+                   Its data type should be float32. If `masked_positions` is None, its shape is
+                   [batch_size, sequence_length, vocab_size]. Otherwise, its shape is
+                   [batch_size, mask_token_num, vocab_size].
+
+        Example:
+           .. code-block::
+
+               import paddle
+               from paddlenlp.transformers import MegatronBertForMaskedLM, MegatronBertTokenizer
+
+               tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+               model = MegatronBertForMaskedLM.from_pretrained('megatronbert-uncased')
+
+               inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+               inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+               prediction_scores = model(**inputs)
+        """
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        return prediction_scores
+
+
+class MegatronBertForNextSentencePrediction(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with a `next sentence prediction (classification)` head on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForNextSentencePrediction, self).__init__(config)
+
+        self.megatronbert = MegatronBertModel(config)
+        self.cls = MegatronBertOnlyNSPHead(config)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertForNextSentencePrediction forward method, overrides the __call__() special method.
+
+        Args:
+           input_ids (Tensor):
+               See :class:`MegatronBertModel`.
+           token_type_ids (Tensor, optional):
+               See :class:`MegatronBertModel`.
+           position_ids(Tensor, optional):
+               See :class:`MegatronBertModel`.
+           attention_mask (Tensor, optional):
+               See :class:`MegatronBertModel`.
+        Returns:
+           Tensor: Returns Tensor `seq_relationship_scores`. The scores of next sentence prediction.
+                   Its data type should be float32 and its shape is [batch_size, 2].
+
+        Example:
+           .. code-block::
+
+               import paddle
+               from paddlenlp.transformers import MegatronBertForNextSentencePrediction, MegatronBertTokenizer
+
+               tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+               model = MegatronBertForNextSentencePrediction.from_pretrained('megatronbert-uncased')
+
+               inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+               inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+               seq_relationship_scores = model(**inputs)
+        """
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        pooled_output = outputs[1]
+
+        seq_relationship_scores = self.cls(pooled_output)
+
+        return seq_relationship_scores
+
+
+class MegatronBertForMultipleChoice(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with a multiple choice classification head on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForMultipleChoice, self).__init__(config)
+
+        self.megatronbert = MegatronBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The MegatronBertForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+           input_ids (Tensor):
+               See :class:`MegatronBertModel`.
+           token_type_ids (Tensor, optional):
+               See :class:`MegatronBertModel`.
+           position_ids(Tensor, optional):
+               See :class:`MegatronBertModel`.
+           attention_mask (Tensor, optional):
+               See :class:`MegatronBertModel`.
+        Returns:
+           Tensor: Returns Tensor `reshaped_logits`. A tensor of the multiple choice classification logits.
+                   Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+           .. code-block::
+
+               import paddle
+               from paddlenlp.transformers import MegatronBertForMultipleChoice, MegatronBertTokenizer
+
+               tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+               model = MegatronBertForNextSentencePrediction.from_pretrained('megatronbert-uncased')
+
+               inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+               inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+               reshaped_logits = model(**inputs)
+        """
+        num_choices = input_ids.shape[1]
+
+        input_ids = input_ids.reshape((-1, input_ids.shape[-1])) if input_ids is not None else None
+        attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1])) if attention_mask is not None else None
+        token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1])) if token_type_ids is not None else None
+        position_ids = position_ids.reshape((-1, position_ids.shape[-1])) if position_ids is not None else None
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape((-1, num_choices))
+
+        return reshaped_logits
+
+
+class MegatronBertForTokenClassification(MegatronBertPretrainedModel):
+    """
+    MegatronBert Model with a token classification head on top.
+
+    Args:
+        megatronbert (:class:`MegatronBertModel`):
+            An instance of :class:`MegatronBertModel`.
+
+        num_labels (int):
+            The number of labels.
+    """
+
+    def __init__(self, config: MegatronBertConfig):
+        super(MegatronBertForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.megatronbert = MegatronBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None):
+        r"""
+        The MegatronBertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+           input_ids (Tensor):
+               See :class:`MegatronBertModel`.
+           token_type_ids (Tensor, optional):
+               See :class:`MegatronBertModel`.
+           position_ids(Tensor, optional):
+               See :class:`MegatronBertModel`.
+           attention_mask (Tensor, optional):
+               See :class:`MegatronBertModel`.
+        Returns:
+           Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+                   Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+           .. code-block::
+
+               import paddle
+               from paddlenlp.transformers import MegatronBertForTokenClassification, MegatronBertTokenizer
+
+               tokenizer = MegatronBertTokenizer.from_pretrained('megatronbert-uncased')
+               model = MegatronBertForTokenClassification.from_pretrained('megatronbert-uncased', num_labels=2)
+
+               inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+               inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+               reshaped_logits = model(**inputs)
+        """
+
+        outputs = self.megatronbert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        return logits
diff --git a/paddlenlp/transformers/megatronbert/tokenizer.py b/paddlenlp/transformers/megatronbert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..24f7c24266d91f3f7baec034a99b2a0fc3235242
--- /dev/null
+++ b/paddlenlp/transformers/megatronbert/tokenizer.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for MegatronBert."""
+
+from .. import BertTokenizer
+
+__all__ = ["MegatronBertTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"megatronbert-cased": 512, "megatronbert-uncased": 512}
+
+
+class MegatronBertTokenizer(BertTokenizer):
+    """
+    Constructs a MegatronBert tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import MegatronBertTokenizer
+            tokenizer = MegatronBertTokenizer.from_pretrained('MegatronBert-uncased')
+            inputs = tokenizer('He was a puppeteer')
+            print(inputs)
+
+            '''
+            {'input_ids': [101, 2002, 2001, 1037, 13997, 11510, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "megatronbert-uncased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-uncased-vocab.txt",
+            "megatronbert-cased": "https://bj.bcebos.com/paddle-hapi/models/bert/bert-base-cased-vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "megatronbert-uncased": {"do_lower_case": True},
+        "megatronbert-cased": {"do_lower_case": False},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        super(MegatronBertTokenizer, self).__init__(
+            vocab_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/minigpt4/__init__.py b/paddlenlp/transformers/minigpt4/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/minigpt4/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/minigpt4/configuration.py b/paddlenlp/transformers/minigpt4/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1aa504d27902fe70c12cd0f1da4ed6464674656
--- /dev/null
+++ b/paddlenlp/transformers/minigpt4/configuration.py
@@ -0,0 +1,348 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" MiniGPT4 model configuration """
+import copy
+import os
+from typing import Union
+
+from ...utils.log import logger
+from ..auto.modeling import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+from ..configuration_utils import PretrainedConfig
+from ..llama.configuration import LlamaConfig
+
+__all__ = ["MiniGPT4VisionConfig", "MiniGPT4QFormerConfig", "MiniGPT4Config"]
+
+
+class MiniGPT4VisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniGPT4VisionModel`]. It is used to instantiate a
+    MiniGPT4 vision encoder according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 1408):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 6144):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 39):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 14):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported. layer_norm_eps (`float`, *optional*, defaults
+            to 1e-5): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries and values in the self-attention layers.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import MiniGPT4VisionConfig, MiniGPT4VisionModel
+    >>> # Initializing a MiniGPT4VisionConfig
+    >>> configuration = MiniGPT4VisionConfig()
+    >>> # Initializing a MiniGPT4VisionModel (with random weights) from the configuration above.
+    >>> model = MiniGPT4VisionModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "mimigpt4_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=1408,
+        intermediate_size=6144,
+        projection_dim=512,
+        num_hidden_layers=39,
+        num_attention_heads=16,
+        num_channels=3,
+        image_size=224,
+        patch_size=14,
+        hidden_act="gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.0,
+        attention_dropout=0.0,
+        initializer_range=1e-10,
+        initializer_factor=1.0,
+        qkv_bias=True,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.projection_dim = projection_dim
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.qkv_bias = qkv_bias
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        # get the vision config dict if we are loading from MiniGPT4Config
+        if config_dict.get("model_type") == "minigpt4":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class MiniGPT4QFormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniGPT4QFormerModel`]. It is used to instantiate a
+    MiniGPT4 Querying Transformer (Q-Former) model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from
+    [`PretrainedConfig`] for more information.
+    Note that [`MiniGPT4QFormerModel`] is very similar to [`BertLMHeadModel`] with interleaved cross-attention.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the Q-Former model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling the model.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        cross_attention_frequency (`int`, *optional*, defaults to 2):
+            The frequency of adding cross-attention to the Transformer layers.
+        encoder_hidden_size (`int`, *optional*, defaults to 1408):
+            The hidden size of the hidden states for cross-attention.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import MiniGPT4QFormerConfig, MiniGPT4QFormerModel
+    >>> # Initializing a MiniGPT4 configuration
+    >>> configuration = MiniGPT4QFormerConfig()
+    >>> # Initializing a model (with random weights) from the configuration above
+    >>> model = MiniGPT4QFormerModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "minigpt4_qformer"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        position_embedding_type="absolute",
+        classifier_dropout=None,
+        cross_attention_frequency=2,
+        encoder_hidden_size=1408,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.classifier_dropout = classifier_dropout
+        self.cross_attention_frequency = cross_attention_frequency
+        self.encoder_hidden_size = encoder_hidden_size
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the qformer config dict if we are loading from MiniGPT4Config
+        if config_dict.get("model_type") == "minigpt4":
+            config_dict = config_dict["qformer_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class MiniGPT4Config(PretrainedConfig):
+    r"""
+    [`MiniGPT4Config`] is the configuration class to store the configuration of a [`MiniGPT4ForConditionalGeneration`]. It is
+    used to instantiate a MiniGPT4 model according to the specified arguments, defining the vision model, Q-Former model
+    and language model configs.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`MiniGPT4VisionConfig`].
+        qformer_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`MiniGPT4QFormerConfig`].
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize any [`PretrainedConfig`].
+        num_query_tokens (`int`, *optional*, defaults to 32):
+            The number of query tokens passed through the Transformer.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import (
+    ...     MiniGPT4VisionConfig,
+    ...     MiniGPT4QFormerConfig,
+    ...     LlamaConfig,
+    ...     MiniGPT4Config,
+    ...     MiniGPT4ForConditionalGeneration,
+    ... )
+    >>> # Initializing a MiniGPT4Config configuration
+    >>> configuration = MiniGPT4Config()
+    >>> # Initializing a MiniGPT4ForConditionalGeneration (with random weights) from the configuration above
+    >>> model = MiniGPT4ForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    >>> # We can also initialize a MiniGPT4Config from a MiniGPT4VisionConfig, MiniGPT4QFormerConfig and any PretrainedConfig
+    >>> # Initializing MiniGPT4 vision, MiniGPT4 Q-Former and language model configurations
+    >>> vision_config = MiniGPT4VisionConfig()
+    >>> qformer_config = MiniGPT4QFormerConfig()
+    >>> text_config = LlamaConfig()
+    >>> config = MiniGPT4Config.from_text_vision_configs(vision_config, qformer_config, text_config)
+    ```"""
+
+    model_type = "minigpt4"
+    is_composition = True
+
+    def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs):
+        super().__init__(**kwargs)
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the MiniGPT4VisionConfig with default values.")
+
+        if qformer_config is None:
+            qformer_config = {}
+            logger.info("qformer_config is None. Initializing the MiniGPT4QFormerConfig with default values.")
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the text config with default values (`LlamaConfig`).")
+        self.vision_config = MiniGPT4VisionConfig(**vision_config)
+        self.qformer_config = MiniGPT4QFormerConfig(**qformer_config)
+        text_model_type = text_config["model_type"] if "model_type" in text_config else "llama"
+
+        if text_model_type == "llama":
+            self.text_config = LlamaConfig(**text_config)
+        else:
+            raise ValueError("Only llama accepted for model_type, but accepted {}.".format(text_model_type))
+
+        self.num_query_tokens = num_query_tokens
+        self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size
+        self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
+
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+
+    @classmethod
+    def from_vision_qformer_text_configs(
+        cls,
+        vision_config: MiniGPT4VisionConfig,
+        qformer_config: MiniGPT4QFormerConfig,
+        text_config: PretrainedConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`MiniGPT4Config`] (or a derived class) from a vision model, Q-Former and language model
+        configurations.
+        Returns:
+            [`MiniGPT4`]: An instance of a configuration object
+        """
+
+        return cls(
+            vision_config=vision_config.to_dict(),
+            qformer_config=qformer_config.to_dict(),
+            text_config=text_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["vision_config"] = self.vision_config.to_dict()
+        output["qformer_config"] = self.qformer_config.to_dict()
+        output["text_config"] = self.text_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/minigpt4/image_processing.py b/paddlenlp/transformers/minigpt4/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..069559b749c0684079d4e3c11e86149802e17b94
--- /dev/null
+++ b/paddlenlp/transformers/minigpt4/image_processing.py
@@ -0,0 +1,284 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for MiniGPT4."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    convert_to_rgb,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = [
+    "MiniGPT4ImageProcessor",
+]
+
+
+class MiniGPT4ImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a MiniGPT4 image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
+            `do_resize` parameter in the `preprocess` method.
+        size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
+            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
+            overridden by the `resample` parameter in the `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
+            `do_rescale` parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
+            overridden by the `rescale_factor` parameter in the `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
+            overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+            Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        default_image_mean = [0.48145466, 0.4578275, 0.40821073]
+        default_image_std = [0.26862954, 0.26130258, 0.27577711]
+        size = size if size is not None else {"height": 224, "width": 224}
+        size = get_size_dict(size, default_to_square=True)
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else default_image_mean
+        self.image_std = image_std if image_std is not None else default_image_std
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image.
+
+        Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
+        longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
+        resized to the max size while preserving the aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
+            resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=True)
+        output_size = (size["width"], size["height"])
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            mean (`float` or `List[float]`):
+                Image mean.
+            std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: Optional[bool] = None,
+        size: Optional[Dict[str, int]] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        do_convert_rgb: bool = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Controls the size of the image after `resize`. The shortest edge of the image is resized to
+                `size["shortest_edge"]` while preserving the aspect ratio. If the longest edge of this resized image
+                is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
+                edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to normalize the image by if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        size = size if size is not None else self.size
+        size = get_size_dict(size, default_to_square=False)
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None or resample is None:
+            raise ValueError("Size and resample must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/minigpt4/modeling.py b/paddlenlp/transformers/minigpt4/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1026b0b375ce226e5cbb22a4c029fe3f5da6598
--- /dev/null
+++ b/paddlenlp/transformers/minigpt4/modeling.py
@@ -0,0 +1,1775 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import CrossEntropyLoss
+
+from paddlenlp.ops import transfer_param
+from paddlenlp.utils.log import logger
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..llama.modeling import LlamaForCausalLM
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from ..model_utils import (
+    PretrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+
+MiniGPT4_PRETRAINED_MODEL_ARCHIVE_LIST = []
+
+from .configuration import MiniGPT4Config, MiniGPT4QFormerConfig, MiniGPT4VisionConfig
+
+__all__ = [
+    "MiniGPT4Model",
+    "MiniGPT4PretrainedModel",
+    "MiniGPT4QFormerModel",
+    "MiniGPT4VisionModel",
+    "MiniGPT4ForConditionalGeneration",
+]
+
+
+def Parameter(tensor):
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+def convert_weights_to_dtype(model, dtype: str):
+    # trying to convert model dtype if necessary
+    if dtype not in ["float16", "float32", "float64"]:
+        raise ValueError("Not supported dtype: {}., only [float16, float32, float64] supported.".format(dtype))
+    dtype_mapping = {
+        "float16": paddle.float16,
+        "float32": paddle.float32,
+        "float64": paddle.float64,
+    }
+
+    def convert_for_vit(layer):
+        if isinstance(layer, (nn.Linear, nn.Conv1D, nn.Conv2D)):
+            if layer.weight.dtype != dtype_mapping[dtype]:
+                layer.weight = transfer_param(layer.weight, restore_data=True, dtype=dtype)
+            if layer.bias is not None and layer.bias.dtype != dtype_mapping[dtype]:
+                layer.bias = transfer_param(layer.bias, restore_data=True, dtype=dtype)
+
+    if isinstance(model, MiniGPT4VisionModel):
+        model.apply(convert_for_vit)
+    elif isinstance(model, (MiniGPT4QFormerModel, LlamaForCausalLM)):
+        model.to(dtype=dtype)
+    else:
+        raise TypeError("Not support model type: {}.".format(type(model)))
+
+
+@dataclass
+class MiniGPT4ForConditionalGenerationModelOutput(ModelOutput):
+    """
+    Class defining the outputs of [`MiniGPT4ForConditionalGeneration`].
+    Args:
+        loss (`paddle.Tensor`, *optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Language modeling loss from the language model.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head of the language model.
+        vision_outputs (`BaseModelOutputWithPooling`):
+            Outputs of the vision encoder.
+        qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
+            Outputs of the Q-Former (Querying Transformer).
+        language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
+            Outputs of the language model.
+    """
+
+    loss: Optional[Tuple[paddle.Tensor]] = None
+    logits: Optional[Tuple[paddle.Tensor]] = None
+    vision_outputs: Optional[paddle.Tensor] = None
+    qformer_outputs: Optional[Tuple[paddle.Tensor]] = None
+    language_model_outputs: Optional[Tuple[paddle.Tensor]] = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k]
+            if k not in ["vision_outputs", "qformer_outputs", "language_model_outputs"]
+            else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class MiniGPT4PretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = MiniGPT4Config
+    base_model_prefix = "minigpt4"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids",
+    ]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_range
+        if isinstance(module, nn.Conv2D) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
+            normal_(module.weight, mean=0.0, std=factor)
+            if hasattr(module, "bias") and module.bias is not None:
+                zeros_(module.bias)
+
+        if isinstance(module, MiniGPT4VisionEmbeddings):
+            if hasattr(self.config, "vision_config"):
+                factor = self.config.vision_config.initializer_range
+            trunc_normal_ = nn.initializer.TruncatedNormal(mean=0.0, std=factor)
+            trunc_normal_(module.position_embedding)
+            trunc_normal_(
+                module.class_embedding,
+            )
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        elif isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, MiniGPT4Encoder):
+            module.gradient_checkpointing = value
+
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str = None, *args, **kwargs
+    ):
+        vit_dtype = kwargs.pop("vit_dtype", "float16")
+        qformer_dtype = kwargs.pop("qformer_dtype", "float32")
+        llama_dtype = kwargs.pop("llama_dtype", "float16")
+
+        model = super().from_pretrained(
+            pretrained_model_name_or_path, from_hf_hub=from_hf_hub, subfolder=subfolder, *args, **kwargs
+        )
+
+        logger.info("Trying to convert dtype for MiniGPT4 model, it may take a while.")
+        if isinstance(model, (MiniGPT4Model, MiniGPT4ForConditionalGeneration)):
+            convert_weights_to_dtype(model.vision_model, dtype=vit_dtype)
+            convert_weights_to_dtype(model.qformer, dtype=qformer_dtype)
+            convert_weights_to_dtype(model.language_model, dtype=llama_dtype)
+        elif isinstance(model, MiniGPT4VisionModel):
+            convert_weights_to_dtype(model, dtype=vit_dtype)
+        elif isinstance(model, MiniGPT4QFormerModel):
+            convert_weights_to_dtype(model, dtype=qformer_dtype)
+        elif isinstance(model, LlamaForCausalLM):
+            convert_weights_to_dtype(model, dtype=llama_dtype)
+        else:
+            raise TypeError("Not supported model type: {}.".format(type(model)))
+
+        return model
+
+
+class MiniGPT4VisionEmbeddings(nn.Layer):
+    def __init__(self, config: MiniGPT4VisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+
+        self.class_embedding = Parameter(paddle.randn([1, 1, self.embed_dim]))
+
+        self.patch_embedding = nn.Conv2D(
+            in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+
+        self.position_embedding = Parameter(paddle.randn([1, self.num_positions, self.embed_dim]))
+
+    def forward(self, pixel_values: paddle.Tensor) -> paddle.Tensor:
+        batch_size = pixel_values.shape[0]
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds_shape = paddle.shape(patch_embeds)
+        patch_embeds = paddle.reshape(
+            patch_embeds, shape=[patch_embeds_shape[0], patch_embeds_shape[1], -1]
+        ).transpose([0, 2, 1])
+
+        class_embeds = self.class_embedding.expand([batch_size, 1, -1]).cast(target_dtype)
+        embeddings = paddle.concat([class_embeds, patch_embeds], axis=1)
+        embeddings = embeddings + self.position_embedding[:, : embeddings.shape[1], :].cast(target_dtype)
+        return embeddings
+
+
+class MiniGPT4Attention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = nn.Dropout(config.attention_dropout)
+
+        # small tweak here compared to CLIP, no bias here
+        self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias_attr=False)
+
+        if config.qkv_bias:
+            q_bias = Parameter(paddle.zeros([self.embed_dim]))
+            v_bias = Parameter(paddle.zeros([self.embed_dim]))
+        else:
+            q_bias = None
+            v_bias = None
+
+        if q_bias is not None:
+            qkv_bias = paddle.concat((q_bias, paddle.zeros_like(v_bias), v_bias))
+            self.qkv.bias = Parameter(qkv_bias)
+
+        self.projection = nn.Linear(self.embed_dim, self.embed_dim)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = hidden_states.shape
+
+        mixed_qkv = self.qkv(hidden_states)
+
+        mixed_qkv = mixed_qkv.reshape([bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads]).transpose(
+            [2, 0, 3, 1, 4]
+        )
+        query_states, key_states, value_states = (
+            mixed_qkv[0],
+            mixed_qkv[1],
+            mixed_qkv[2],
+        )
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        attention_scores = attention_scores * self.scale
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_states).transpose([0, 2, 1, 3])
+
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.embed_dim,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output = self.projection(context_layer)
+
+        outputs = (output, attention_probs) if output_attentions else (output, None)
+
+        return outputs
+
+
+class MiniGPT4MLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+class MiniGPT4EncoderLayer(nn.Layer):
+    def __init__(self, config: MiniGPT4Config):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = MiniGPT4Attention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = MiniGPT4MLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            head_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + residual
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = hidden_states + residual
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class MiniGPT4Encoder(nn.Layer):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`MiniGPT4EncoderLayer`].
+    Args:
+        config (`MiniGPT4Config`):
+            The corresponding vision configuration for the `MiniGPT4Encoder`.
+    """
+
+    def __init__(self, config: MiniGPT4Config):
+        super().__init__()
+        self.config = config
+        self.layers = nn.LayerList([MiniGPT4EncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+class MiniGPT4VisionModel(MiniGPT4PretrainedModel):
+    main_input_name = "pixel_values"
+    config_class = MiniGPT4VisionConfig
+
+    def __init__(self, config: MiniGPT4VisionConfig):
+        super().__init__(config)
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = MiniGPT4VisionEmbeddings(config)
+        self.encoder = MiniGPT4Encoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings
+
+
+class MiniGPT4QFormerMultiHeadAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention heads (%d)"
+                % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        mixed_query_layer = self.query(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.shape[1]
+            position_ids_l = paddle.arange(seq_length, dtype="int64").reshape([-1, 1])
+            position_ids_r = paddle.arange(seq_length, dtype="int64").reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(axis=-1)(attention_scores)
+
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = paddle.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        outputs = outputs + (past_key_value,)
+        return outputs
+
+
+class MiniGPT4QFormerSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class MiniGPT4QFormerAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.attention = MiniGPT4QFormerMultiHeadAttention(config, is_cross_attention)
+        self.output = MiniGPT4QFormerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.attention.query = prune_linear_layer(self.attention.query, index)
+        self.attention.key = prune_linear_layer(self.attention.key, index)
+        self.attention.value = prune_linear_layer(self.attention.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, axis=1)
+
+        # Update hyper params and store pruned heads
+        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
+        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class MiniGPT4QFormerIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class MiniGPT4QFormerOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class MiniGPT4QFormerLayer(nn.Layer):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = MiniGPT4QFormerAttention(config)
+
+        self.layer_idx = layer_idx
+
+        if layer_idx % config.cross_attention_frequency == 0:
+            self.crossattention = MiniGPT4QFormerAttention(config, is_cross_attention=True)
+            self.has_cross_attention = True
+        else:
+            self.has_cross_attention = False
+
+        self.intermediate_query = MiniGPT4QFormerIntermediate(config)
+        self.output_query = MiniGPT4QFormerOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        query_length=0,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+
+        present_key_value = self_attention_outputs[-1]
+
+        if query_length > 0:
+            query_attention_output = attention_output[:, :query_length, :]
+
+            if self.has_cross_attention:
+                if encoder_hidden_states is None:
+                    raise ValueError("encoder_hidden_states must be given for cross-attention layers")
+                cross_attention_outputs = self.crossattention(
+                    query_attention_output,
+                    attention_mask,
+                    head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions=output_attentions,
+                )
+                query_attention_output = cross_attention_outputs[0]
+                # add cross attentions if we output attention weights
+                outputs = outputs + cross_attention_outputs[1:-1]
+
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk_query,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                query_attention_output,
+            )
+
+            if attention_output.shape[1] > query_length:
+                layer_output_text = apply_chunking_to_forward(
+                    self.feed_forward_chunk,
+                    self.chunk_size_feed_forward,
+                    self.seq_len_dim,
+                    attention_output[:, query_length:, :],
+                )
+                layer_output = paddle.concat([layer_output, layer_output_text], axis=1)
+        else:
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                attention_output,
+            )
+        outputs = (layer_output,) + outputs
+
+        outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def feed_forward_chunk_query(self, attention_output):
+        intermediate_output = self.intermediate_query(attention_output)
+        layer_output = self.output_query(intermediate_output, attention_output)
+        return layer_output
+
+
+class MiniGPT4QFormerEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList(
+            [MiniGPT4QFormerLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        query_length=0,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions else None
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions, query_length)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    query_length,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if layer_module.has_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class MiniGPT4QFormerModel(MiniGPT4PretrainedModel):
+    """
+    Querying Transformer (Q-Former), used in MiniGPT4.
+    """
+
+    def __init__(self, config: MiniGPT4QFormerConfig):
+        super().__init__(config)
+        self.config = config
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.encoder = MiniGPT4QFormerEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: paddle.Tensor,
+        input_shape: Tuple[int],
+        has_query: bool = False,
+    ) -> paddle.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (`paddle.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+        Returns:
+            `paddle.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.cast(dtype=self.layernorm.weight.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask: paddle.Tensor) -> paddle.Tensor:
+        """
+        Invert an attention mask (e.g., switches 0. and 1.).
+        Args:
+            encoder_attention_mask (`paddle.Tensor`): An attention mask.
+        Returns:
+            `paddle.Tensor`: The inverted attention mask.
+        """
+        if encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
+        if encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
+        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
+        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow
+        # /transformer/transformer_layers.py#L270
+        encoder_extended_attention_mask = encoder_extended_attention_mask.cast(
+            dtype=self.layernorm.weight.dtype
+        )  # fp16 compatibility
+        encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+        return encoder_extended_attention_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[paddle.Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> paddle.Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.ndim == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand([num_hidden_layers, -1, -1, -1, -1])
+        elif head_mask.ndim == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.ndim == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = head_mask.cast(dtype=self.config.dtype)  # switch to float if need + fp16 compatibility
+        return head_mask
+
+    def forward(
+        self,
+        query_embeds,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of:
+            shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and
+            value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are
+            used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key
+            value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape
+            `(batch_size, sequence_length)`.
+        use_cache (`bool`, `optional`):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # past_key_values_length
+        past_key_values_length = (
+            past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
+        )
+
+        query_length = query_embeds.shape[1] if query_embeds is not None else 0
+
+        embedding_output = self.layernorm(query_embeds.cast(self.layernorm.weight.dtype))
+        embedding_output = self.dropout(embedding_output)
+
+        input_shape = embedding_output.shape[:-1]
+        batch_size, seq_length = input_shape
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(((batch_size, seq_length + past_key_values_length)))
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].shape
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            query_length=query_length,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = sequence_output[:, 0, :]
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class MiniGPT4Model(MiniGPT4PretrainedModel):
+    config_class = MiniGPT4Config
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: MiniGPT4Config):
+        super().__init__(config)
+
+        self.vision_model = MiniGPT4VisionModel(config.vision_config)
+
+        self.query_tokens = Parameter(paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]))
+        self.qformer = MiniGPT4QFormerModel(config.qformer_config)
+
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        self.language_model = LlamaForCausalLM(config.text_config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            text_outputs (`CausalLMOutputWithPast`, or `tuple(paddle.Tensor)` if `return_dict=False`):
+                The language model outputs. If `return_dict=True`, the output is a [`CausalLMOutputWithPast`] that
+                contains the language model logits, the past key values and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from paddlenlp.transformers import LlamaTokenizer, MiniGPT4Model
+        >>> tokenizer = LlamaTokenizer.from_pretrained("model_name")
+        >>> tokenizer.pad_token = tokenizer.eos_token
+        >>> model = MiniGPT4Model.from_pretrained("model_name")
+        >>> model.eval()
+        >>> inputs = tokenizer(["a photo of a cat"], padding=True, return_tensors="pd", return_token_type_ids=False)
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.language_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return text_outputs
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import MinitGPT4Processor, MiniGPT4Model
+        >>> processor = MinitGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4Model.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor.process_images(images=image, return_tensors="pd")
+        >>> image_outputs = model.get_image_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return vision_outputs
+
+    def get_qformer_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import MinitGPT4Processor, MiniGPT4Model
+        >>> processor = MinitGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4Model.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor.process_images(images=image, return_tensors="pd")
+        >>> qformer_outputs = model.get_qformer_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[0]
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+
+        return query_outputs
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        first_attention_mask: Optional[paddle.Tensor] = None,
+        second_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MiniGPT4ForConditionalGenerationModelOutput]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import MiniGPT4Processor, MiniGPT4Model
+        >>> processor = MiniGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4Model.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(images=image, texts=text, prompts=prompt, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: use the language model, conditioned on the text and image
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+
+        first_embeds = self.language_model.llama.embed_tokens(first_input_ids)
+        second_embeds = self.language_model.llama.embed_tokens(second_input_ids)
+        language_model_inputs = paddle.cast(language_model_inputs, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, language_model_inputs, second_embeds], axis=1)
+
+        if first_attention_mask is None:
+            first_attention_mask = paddle.ones(first_embeds.shape[:-1], dtype="int64")
+        if second_attention_mask is None:
+            second_attention_mask = paddle.ones(second_embeds.shape[:-1], dtype="int64")
+        attention_mask = paddle.concat(
+            [first_attention_mask, language_model_attention_mask, second_attention_mask], axis=1
+        )
+
+        outputs = self.language_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs.logits if return_dict else outputs[0]
+        loss = None
+        # we compute the loss here since we need to take into account the sequence length of the query embeds
+        if labels is not None:
+            logits = logits[:, -labels.shape[1] :, :]
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :]
+            shift_labels = labels[..., 1:]
+
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(reduction="mean")
+
+            loss = loss_fct(shift_logits.reshape([-1, self.config.text_config.vocab_size]), shift_labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits, vision_outputs, query_outputs, outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return MiniGPT4ForConditionalGenerationModelOutput(
+            loss=loss,
+            logits=logits,
+            vision_outputs=vision_outputs,
+            qformer_outputs=query_outputs,
+            language_model_outputs=outputs,
+        )
+
+
+class MiniGPT4ForConditionalGeneration(MiniGPT4PretrainedModel):
+    config_class = MiniGPT4Config
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: MiniGPT4Config):
+        super().__init__(config)
+        self.config = config
+        self.vision_model = MiniGPT4VisionModel(config.vision_config)
+
+        self.query_tokens = Parameter(paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]))
+        self.qformer = MiniGPT4QFormerModel(config.qformer_config)
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        self.language_model = LlamaForCausalLM(config.text_config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        first_attention_mask: Optional[paddle.Tensor] = None,
+        second_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MiniGPT4ForConditionalGenerationModelOutput]:
+        r"""
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import MiniGPT4Processor, MiniGPT4ForConditionalGeneration
+        >>> processor = MiniGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4ForConditionalGeneration.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(images=image, texts=text, prompts=prompt, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: use the language model, conditioned on the text and image
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+
+        first_embeds = self.language_model.llama.embed_tokens(first_input_ids)
+        second_embeds = self.language_model.llama.embed_tokens(second_input_ids)
+        language_model_inputs = paddle.cast(language_model_inputs, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, language_model_inputs, second_embeds], axis=1)
+
+        if first_attention_mask is None:
+            first_attention_mask = paddle.ones(first_embeds.shape[:-1], dtype="int64")
+        if second_attention_mask is None:
+            second_attention_mask = paddle.ones(second_embeds.shape[:-1], dtype="int64")
+        attention_mask = paddle.concat(
+            [first_attention_mask, language_model_attention_mask, second_attention_mask], axis=1
+        )
+
+        outputs = self.language_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = outputs.logits if return_dict else outputs[0]
+        loss = None
+        # we compute the loss here since we need to take into account the sequence length of the query embeds
+        if labels is not None:
+            logits = logits[:, -labels.shape[1] :, :]
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :]
+            shift_labels = labels[..., 1:]
+
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(reduction="mean")
+
+            loss = loss_fct(shift_logits.reshape([-1, self.config.text_config.vocab_size]), shift_labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits, vision_outputs, query_outputs, outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return MiniGPT4ForConditionalGenerationModelOutput(
+            loss=loss,
+            logits=logits,
+            vision_outputs=vision_outputs,
+            qformer_outputs=query_outputs,
+            language_model_outputs=outputs,
+        )
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        first_attention_mask: Optional[paddle.Tensor] = None,
+        second_attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs,
+    ) -> paddle.Tensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+        Args:
+            pixel_values (`paddle.Tensor` of shape (batch_size, num_channels, height, width)):
+                Input images to be processed.
+            first_input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The first input prompt before the tag `<ImageHere>`, it's embeddings will concat with image embeddings and the embeddings of the second_input_ids for the generation.
+            second_input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The second input prompt after the tag `<ImageHere>`, it's embeddings will concat with image embeddings and the embeddings of the first_input_ids for the generation.
+            first_attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The attention mask corresponding with the first_input_ids, whill will mask to avoid performing attention on padding token indices.
+            second_attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The attention mask corresponding with the second_input_ids, whill will mask to avoid performing attention on padding token indices.
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import MiniGPT4Processor, MiniGPT4ForConditionalGeneration
+        >>> processor = MiniGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4ForConditionalGeneration.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(images=image, texts=text, prompts=prompt, return_tensors="pd")
+        >>> generated_ids, scores= model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        """
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: use the language model, conditioned on the text and image
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+
+        first_embeds = self.language_model.llama.embed_tokens(first_input_ids)
+        second_embeds = self.language_model.llama.embed_tokens(second_input_ids)
+        language_model_inputs = paddle.cast(language_model_inputs, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, language_model_inputs, second_embeds], axis=1)
+
+        if first_attention_mask is None:
+            first_attention_mask = paddle.ones(first_embeds.shape[:-1], dtype="int64")
+        if second_attention_mask is None:
+            second_attention_mask = paddle.ones(second_embeds.shape[:-1], dtype="int64")
+        attention_mask = paddle.concat(
+            [first_attention_mask, language_model_attention_mask, second_attention_mask], axis=1
+        )
+
+        outputs = self.language_model.generate(
+            inputs_embeds=inputs_embeds, attention_mask=attention_mask, **generate_kwargs
+        )
+
+        return outputs
+
+    @paddle.no_grad()
+    def encode_images(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+    ) -> paddle.Tensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+        Args:
+            pixel_values (`paddle.Tensor` of shape (batch_size, num_channels, height, width)):
+                Input images to be processed.
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import MiniGPT4Processor, MiniGPT4ForConditionalGeneration
+        >>> processor = MiniGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4ForConditionalGeneration.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> image = processor.process_images(images=image, return_tensors="pd")
+        >>> image_features, image_attention_mask = model.encode_images(**image)
+        """
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: use the language model, conditioned on the text and image
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+
+        return language_model_inputs, language_model_attention_mask
+
+    @paddle.no_grad()
+    def generate_with_image_features(
+        self,
+        image_features: paddle.Tensor,
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        image_attention_mask: Optional[paddle.Tensor] = None,
+        first_attention_mask: Optional[paddle.Tensor] = None,
+        second_attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs,
+    ) -> paddle.Tensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+        Args:
+            image_features (`paddle.Tensor` of shape (batch_size, num_channels, height, width)):
+                Image features extracted with vit and qformer, specifically, the features extracted with the method `encoded_images`.
+            first_input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The first input prompt before the tag `<ImageHere>`, it's embeddings will concat with image embeddings and the embeddings of the second_input_ids for the generation.
+            second_input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The second input prompt after the tag `<ImageHere>`, it's embeddings will concat with image embeddings and the embeddings of the first_input_ids for the generation.
+            image_attention_mask (`paddle.Tensor` of shape (batch_size, image_sequence_length), *optional*):
+                The attention mask to the image_features.
+            first_attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The attention mask corresponding to the first_input_ids.
+            second_attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The attention mask corresponding to the second_input_ids.
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import MiniGPT4Processor, MiniGPT4ForConditionalGeneration
+        >>> processor = MiniGPT4Processor.from_pretrained("model_name")
+        >>> model = MiniGPT4ForConditionalGeneration.from_pretrained("model_name")
+        >>>  url = "https://paddlenlp.bj.bcebos.com/data/images/dog.png"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> processed_image = processor.process_images(images=image, return_tensors="pd")
+        >>> image_features, image_attention_mask = model.encode_images(**processed_image)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(text=text, prompt=prompt, return_tensors="pd")
+        >>> generated_ids, scores= model.generate_with_image_features(image_features, image_attention_mask=image_attention_mask, **inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        """
+        first_embeds = self.language_model.llama.embed_tokens(first_input_ids)
+        second_embeds = self.language_model.llama.embed_tokens(second_input_ids)
+        image_features = paddle.cast(image_features, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, image_features, second_embeds], axis=1)
+
+        if first_attention_mask is None:
+            first_attention_mask = paddle.ones(first_embeds.shape[:-1], dtype="int64")
+        if second_attention_mask is None:
+            second_attention_mask = paddle.ones(second_embeds.shape[:-1], dtype="int64")
+        if image_attention_mask is None:
+            image_attention_mask = paddle.ones(image_features.shape[:-1], dtype="int64")
+
+        attention_mask = paddle.concat([first_attention_mask, image_attention_mask, second_attention_mask], axis=1)
+
+        outputs = self.language_model.generate(
+            inputs_embeds=inputs_embeds, attention_mask=attention_mask, **generate_kwargs
+        )
+
+        return outputs
diff --git a/paddlenlp/transformers/minigpt4/processing.py b/paddlenlp/transformers/minigpt4/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..e31d7ef7225fceecf2e8dc5a982049c8e390ef08
--- /dev/null
+++ b/paddlenlp/transformers/minigpt4/processing.py
@@ -0,0 +1,245 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Processor class for MiniGPT4.
+"""
+
+from typing import List, Optional, Union
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from ..image_processing_utils import BatchFeature
+from ..image_utils import ImageInput
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding, TensorType, TextInput
+
+__all__ = [
+    "MiniGPT4Processor",
+]
+
+
+class MiniGPT4Processor(ProcessorMixin):
+    r"""
+    Constructs a MiniGPT4 processor which wraps a MiniGPT4 image processor and an llama tokenizer into a single processor.
+    [`MiniGPT4Processor`] offers all the functionalities of [`MiniGPT4ImageProcessor`] and [`LlamaTokenizer`]. See the docstring
+    of [`~MiniGPT4ImageProcessor.__call__`] and [`~LlamaTokenizer.decode`] for more information.
+
+    Args:
+        image_processor (`MiniGPT4ImageProcessor`):
+            An instance of [`MiniGPT4ImageProcessor`]. The image processor is a required input.
+        tokenizer (`LlamaTokenizer`):
+            An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
+
+    Examples:
+    ```python
+    >>> import requests
+    >>> from PIL import Image
+
+    >>> import paddle
+    >>> from paddlenlp.transformers import MiniGPT4Processor
+
+    >>> # load processor
+    >>> minigpt4_13b_path = "model_name"
+    >>> processor = MiniGPT4Processor.from_pretrained(minigpt4_13b_path)
+    >>> print("load processor and model done!")
+
+    >>> # prepare model inputs for MiniGPT4
+    >>> url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png"
+    >>> image = Image.open(requests.get(url, stream=True).raw)
+    >>> text = "describe this image"
+    >>> prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+    >>> res = processor([image], text, prompt)
+    ```"""
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "MiniGPT4ImageProcessor"
+    tokenizer_class = "LlamaTokenizer"
+
+    def __init__(self, image_processor, tokenizer):
+        tokenizer.return_token_type_ids = False
+        tokenizer.model_input_names = ["input_ids", "attention_mask"]
+        tokenizer.padding_side = "right"
+        tokenizer.pad_token = tokenizer.eos_token
+        super().__init__(image_processor, tokenizer)
+        self.current_processor = self.image_processor
+        self.default_prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant: "
+        self.image_tag = "<ImageHere>"
+        self.text_tag = "<TextHere>"
+
+    def process_images(
+        self,
+        images: ImageInput,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PADDLE,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        This method uses [`MiniGPT4ImageProcessor.__call__`] method to prepare image(s) for the model.
+        Please refer to the docstring of the method for more information.
+        """
+        if not images:
+            raise ValueError("You have to input correct images.")
+
+        if isinstance(images, (Image.Image, np.ndarray, paddle.Tensor)):
+            images = [images]
+
+        # processing with image processor
+        processed_images = self.image_processor(images, return_tensors=return_tensors)
+
+        return processed_images
+
+    def process_texts(
+        self,
+        texts: Union[TextInput, List[TextInput]],
+        prompts: Union[TextInput, List[TextInput]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PADDLE,
+        **kwargs,
+    ):
+        prompts = prompts if prompts is not None else [self.default_prompt]
+
+        if (not isinstance(texts, TextInput)) and (not isinstance(texts, list)):
+            raise TypeError("Unsupported type for texts: {}, only str and list type supported.".format(type(texts)))
+        if prompts is not None and (not isinstance(prompts, TextInput)) and (not isinstance(prompts, list)):
+            raise TypeError(
+                "Unsupported type for prompts: {}, only str and list type supported.".format(type(prompts))
+            )
+
+        if isinstance(prompts, list):
+            if isinstance(texts, list) and len(prompts) != len(texts):
+                raise ValueError(
+                    "The length of prompts not is equal to texts' length: {} != {}".format(len(prompts), len(texts))
+                )
+            elif isinstance(texts, TextInput):
+                texts = [texts] * len(prompts)
+        else:
+            if isinstance(texts, TextInput):
+                texts = [texts]
+                prompts = [prompts]
+            else:
+                prompts = [prompts] * len(texts)
+
+        assemble_texts = []
+        for text, prompt in zip(texts, prompts):
+            if self.image_tag not in text:
+                if self.image_tag not in prompt:
+                    raise ValueError(
+                        "A prompt should contain a image tag `{}` to insert image embeddings. if you don't want to use prompt function, you have to input a text with the image tag `{}`.".format(
+                            self.image_tag, self.image_tag
+                        )
+                    )
+                if self.text_tag not in prompt:
+                    raise ValueError(
+                        "A prompt should contain a text tag `{}` to insert text information.".format(self.text_tag)
+                    )
+                assemble_texts.append(prompt.replace(self.text_tag, text))
+            else:
+                assemble_texts.append(text)
+
+        # processing with text tokenizer
+        first_texts, second_texts = zip(*[assemble_text.split(self.image_tag) for assemble_text in assemble_texts])
+        first_text_encoding = self.tokenizer(
+            text=first_texts, return_tensors=return_tensors, add_special_tokens=True, **kwargs
+        )
+        second_text_encoding = self.tokenizer(
+            text=second_texts, return_tensors=return_tensors, add_special_tokens=False, **kwargs
+        )
+
+        encoded_texts = BatchEncoding(
+            {
+                "first_input_ids": first_text_encoding["input_ids"],
+                "first_attention_mask": first_text_encoding["attention_mask"],
+                "second_input_ids": second_text_encoding["input_ids"],
+                "second_attention_mask": second_text_encoding["attention_mask"],
+            }
+        )
+        return encoded_texts
+
+    def __call__(
+        self,
+        images: ImageInput = None,
+        text: str = None,
+        prompt: str = None,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PADDLE,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        This method uses [`MiniGPT4ImageProcessor.__call__`] method to prepare image(s) for the model, and
+        [`LlamaTokenizer.__call__`] to prepare text for the model.
+        Please refer to the docstring of the above two methods for more information.
+        """
+        prompt = prompt if prompt is not None else self.default_prompt
+
+        if images is None and text is None:
+            raise ValueError("Images and text are None, you have to specify either images or texts.")
+        if images is not None and not isinstance(images, (Image.Image, np.ndarray, paddle.Tensor, list)):
+            raise TypeError(
+                "A type in [Image.Image, np.ndarray, paddle.Tensor, list] for images is expected, but received {}.".format(
+                    type(images)
+                )
+            )
+        if text is not None and not isinstance(text, str):
+            raise TypeError("A str type of text is expected, but received {}.".format(type(text)))
+        if prompt is not None and not isinstance(prompt, str):
+            raise TypeError("A str type of prompt is expected, but received {}.".format(type(prompt)))
+
+        if images is not None and not isinstance(images, list):
+            images = [images]
+        if text is not None and images is not None:
+            texts = [text] * len(images)
+            prompts = [prompt] * len(images)
+        elif text is not None and images is None:
+            texts = [text]
+            prompts = [prompt]
+
+        # image-only mode
+        if text is None:
+            # processing with image processor
+            processed_features = self.process_images(images, return_tensors=return_tensors, **kwargs)
+            return processed_features
+
+        # text-only mode
+        if images is None:
+            # processing with text tokenizer
+            encoded_texts = self.process_texts(texts, prompts, **kwargs)
+            return encoded_texts
+
+        # text-image mode
+        processed_features = self.image_processor(images, return_tensors=return_tensors)
+        encoded_texts = self.process_texts(texts, prompts, **kwargs)
+        processed_features.update(encoded_texts)
+
+        return processed_features
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
diff --git a/paddlenlp/transformers/mobilebert/__init__.py b/paddlenlp/transformers/mobilebert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/mobilebert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/mobilebert/configuration.py b/paddlenlp/transformers/mobilebert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..0675a9ff5924f9fcab16c8504b79627d00137aa8
--- /dev/null
+++ b/paddlenlp/transformers/mobilebert/configuration.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MobileBert model configuration"""
+from __future__ import annotations
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["MOBILEBERT_PRETRAINED_INIT_CONFIGURATION", "MobileBertConfig", "MOBILEBERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+MOBILEBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "mobilebert-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "classifier_activation": False,
+        "embedding_size": 128,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.0,
+        "hidden_size": 512,
+        "initializer_range": 0.02,
+        "intermediate_size": 512,
+        "intra_bottleneck_size": 128,
+        "key_query_shared_bottleneck": True,
+        "layer_norm_eps": 1e-12,
+        "max_position_embeddings": 512,
+        "model_type": "mobilebert",
+        "normalization_type": "no_norm",
+        "num_attention_heads": 4,
+        "num_feedforward_networks": 4,
+        "num_hidden_layers": 24,
+        "pad_token_id": 0,
+        "transformers_version": "4.6.0.dev0",
+        "trigram_input": True,
+        "true_hidden_size": 128,
+        "type_vocab_size": 2,
+        "use_bottleneck": True,
+        "use_bottleneck_attention": False,
+        "vocab_size": 30522,
+    }
+}
+MOBILEBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "mobilebert-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/mobilebert/mobilebert-uncased/model_state.pdparams"
+    }
+}
+
+
+class MobileBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~paddlenlp.transformers.MobileBertModel`.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`MobileBertModel`].
+        hidden_size (`int`, *optional*, defaults to 512):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 24):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 4):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 512):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"relu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`MobileBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The ID of the token in the word embedding to use as padding.
+        embedding_size (`int`, *optional*, defaults to 128):
+            The dimension of the word embedding vectors.
+        trigram_input (`bool`, *optional*, defaults to `True`):
+            Use a convolution of trigram as input.
+        use_bottleneck (`bool`, *optional*, defaults to `True`):
+            Whether to use bottleneck in BERT.
+        intra_bottleneck_size (`int`, *optional*, defaults to 128):
+            Size of bottleneck layer output.
+        use_bottleneck_attention (`bool`, *optional*, defaults to `False`):
+            Whether to use attention inputs from the bottleneck transformation.
+        key_query_shared_bottleneck (`bool`, *optional*, defaults to `True`):
+            Whether to use the same linear transformation for query&key in the bottleneck.
+        num_feedforward_networks (`int`, *optional*, defaults to 4):
+            Number of FFNs in a block.
+        normalization_type (`str`, *optional*, defaults to `"no_norm"`):
+            The normalization type in MobileBERT.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import MobileBertConfig, MobileBertModel
+    >>> # Initializing a MobileBERT configuration
+    >>> configuration = MobileBertConfig()
+    >>> # Initializing a model (with random weights) from the configuration above
+    >>> model = MobileBertModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    model_type = "mobilebert"
+    pretrained_init_configuration = MOBILEBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = MOBILEBERT_PRETRAINED_RESOURCE_FILES_MAP
+    keys_to_ignore_at_inference = ["pooled_output"]
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=512,
+        num_hidden_layers=24,
+        num_attention_heads=4,
+        intermediate_size=512,
+        hidden_act="relu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        embedding_size=128,
+        true_hidden_size=128,
+        normalization_type="no_norm",
+        use_bottleneck=True,
+        use_bottleneck_attention=False,
+        intra_bottleneck_size=128,
+        key_query_shared_bottleneck=True,
+        num_feedforward_networks=4,
+        trigram_input=True,
+        classifier_activation=False,
+        classifier_dropout=None,
+        add_pooling_layer=True,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.pad_token_id = pad_token_id
+        self.embedding_size = embedding_size
+        self.true_hidden_size = true_hidden_size
+        self.normalization_type = normalization_type
+        self.use_bottleneck = use_bottleneck
+        self.use_bottleneck_attention = use_bottleneck_attention
+        self.intra_bottleneck_size = intra_bottleneck_size
+        self.key_query_shared_bottleneck = key_query_shared_bottleneck
+        self.num_feedforward_networks = num_feedforward_networks
+        self.trigram_input = trigram_input
+        self.classifier_activation = classifier_activation
+        if self.use_bottleneck:
+            self.true_hidden_size = intra_bottleneck_size
+        else:
+            self.true_hidden_size = hidden_size
+
+        self.classifier_dropout = classifier_dropout
+        self.add_pooling_layer = add_pooling_layer
diff --git a/paddlenlp/transformers/mobilebert/modeling.py b/paddlenlp/transformers/mobilebert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..af7cb52ba5aef4c62fd2ca69d15b12840d2bf584
--- /dev/null
+++ b/paddlenlp/transformers/mobilebert/modeling.py
@@ -0,0 +1,1194 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+)
+from .configuration import (
+    MOBILEBERT_PRETRAINED_INIT_CONFIGURATION,
+    MOBILEBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    MobileBertConfig,
+)
+
+__all__ = [
+    "MobileBertModel",
+    "MobileBertPretrainedModel",
+    "MobileBertForPreTraining",
+    "MobileBertForSequenceClassification",
+    "MobileBertForQuestionAnswering",
+]
+
+
+class NoNorm(nn.Layer):
+    def __init__(self, feat_size, eps=None):
+        super().__init__()
+        if isinstance(feat_size, int):
+            feat_size = [feat_size]
+        self.bias = paddle.create_parameter(feat_size, "float32", is_bias=True)
+        self.weight = paddle.create_parameter(
+            feat_size, "float32", default_initializer=paddle.nn.initializer.Constant(value=1.0)
+        )
+
+    def forward(self, input_tensor):
+        return input_tensor * self.weight + self.bias
+
+
+NORM2FN = {"layer_norm": nn.LayerNorm, "no_norm": NoNorm}
+
+
+class MobileBertEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.trigram_input = config.trigram_input
+        self.embedding_size = config.embedding_size
+        self.hidden_size = config.hidden_size
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        embed_dim_multiplier = 3 if self.trigram_input else 1
+        embedded_input_size = self.embedding_size * embed_dim_multiplier
+        self.embedding_transformation = nn.Linear(embedded_input_size, config.hidden_size)
+
+        self.layer_norm = NORM2FN[config.normalization_type](config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if self.trigram_input:
+            # From the paper MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited
+            # Devices (https://arxiv.org/abs/2004.02984)
+            #
+            # The embedding table in BERT models accounts for a substantial proportion of model size. To compress
+            # the embedding layer, we reduce the embedding dimension to 128 in MobileBERT.
+            # Then, we apply a 1D convolution with kernel size 3 on the raw token embedding to produce a 512
+            # dimensional output.
+            inputs_embeds = paddle.concat(
+                [
+                    nn.functional.pad(inputs_embeds[:, 1:], [0, 0, 0, 1, 0, 0], value=0),
+                    inputs_embeds,
+                    nn.functional.pad(inputs_embeds[:, :-1], [0, 0, 1, 0, 0, 0], value=0),
+                ],
+                axis=2,
+            )
+        if self.trigram_input or self.embedding_size != self.hidden_size:
+            inputs_embeds = self.embedding_transformation(inputs_embeds)
+
+        # Add positional embeddings and token type embeddings, then layer
+        # normalize and perform dropout.
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class MobileBertAttention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.true_hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.query = nn.Linear(config.true_hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.true_hidden_size, self.all_head_size)
+        self.value = nn.Linear(
+            config.true_hidden_size if config.use_bottleneck_attention else config.hidden_size,
+            self.all_head_size,
+        )
+
+        self.attention_dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+        self.use_bottleneck = config.use_bottleneck
+        self.dense = nn.Linear(config.true_hidden_size, config.true_hidden_size)
+        self.layer_norm = NORM2FN[config.normalization_type](config.true_hidden_size, eps=config.layer_norm_eps)
+        if not self.use_bottleneck:
+            self.output_dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose(perm=(0, 2, 1, 3))
+
+    def forward(
+        self,
+        query_tensor,
+        key_tensor,
+        value_tensor,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=None,
+    ):
+
+        mixed_query_layer = self.query(query_tensor)
+        mixed_key_layer = self.key(key_tensor)
+        mixed_value_layer = self.value(value_tensor)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(axis=-1)(attention_scores)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = paddle.matmul(attention_probs, value_layer)
+        context_layer = context_layer.transpose(perm=(0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        # dense layer shape to be checked
+        projected_context_layer = self.dense(context_layer)
+
+        # Run a linear projection of `hidden_size` then add a residual
+        # with `hidden_states`.
+        if not self.use_bottleneck:
+            projected_context_layer = self.output_dropout(projected_context_layer)
+        layer_normed_context_layer = self.layer_norm(hidden_states + projected_context_layer)
+
+        outputs = (layer_normed_context_layer, attention_probs) if output_attentions else (layer_normed_context_layer,)
+        return outputs
+
+
+class MobileBertIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.true_hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class OutputBottleneck(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.true_hidden_size, config.hidden_size)
+        self.layer_norm = NORM2FN[config.normalization_type](config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, residual_tensor):
+        layer_outputs = self.dense(hidden_states)
+        layer_outputs = self.dropout(layer_outputs)
+        layer_outputs = self.layer_norm(layer_outputs + residual_tensor)
+        return layer_outputs
+
+
+class MobileBertOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.use_bottleneck = config.use_bottleneck
+        self.dense = nn.Linear(config.intermediate_size, config.true_hidden_size)
+        self.layer_norm = NORM2FN[config.normalization_type](config.true_hidden_size)
+        if not self.use_bottleneck:
+            self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.bottleneck = OutputBottleneck(config)
+
+    def forward(self, intermediate_states, residual_tensor_1, residual_tensor_2):
+        layer_output = self.dense(intermediate_states)
+        if not self.use_bottleneck:
+            layer_output = self.dropout(layer_output)
+            layer_output = self.layer_norm(layer_output + residual_tensor_1)
+        else:
+            layer_output = self.layer_norm(layer_output + residual_tensor_1)
+            layer_output = self.bottleneck(layer_output, residual_tensor_2)
+        return layer_output
+
+
+class BottleneckLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intra_bottleneck_size)
+        self.layer_norm = NORM2FN[config.normalization_type](config.intra_bottleneck_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        layer_input = self.dense(hidden_states)
+        layer_input = self.layer_norm(layer_input)
+        return layer_input
+
+
+class Bottleneck(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.key_query_shared_bottleneck = config.key_query_shared_bottleneck
+        self.use_bottleneck_attention = config.use_bottleneck_attention
+        self.input = BottleneckLayer(config)
+        if self.key_query_shared_bottleneck:
+            self.attention = BottleneckLayer(config)
+
+    def forward(self, hidden_states):
+        # This method can return three different tuples of values. These different values make use of bottlenecks,
+        # which are linear layers used to project the hidden states to a lower-dimensional vector, reducing memory
+        # usage. These linear layer have weights that are learned during training.
+        #
+        # If `config.use_bottleneck_attention`, it will return the result of the bottleneck layer four times for the
+        # key, query, value, and "layer input" to be used by the attention layer.
+        # This bottleneck is used to project the hidden. This last layer input will be used as a residual tensor
+        # in the attention self output, after the attention scores have been computed.
+        #
+        # If not `config.use_bottleneck_attention` and `config.key_query_shared_bottleneck`, this will return
+        # four values, three of which have been passed through a bottleneck: the query and key, passed through the same
+        # bottleneck, and the residual layer to be applied in the attention self output, through another bottleneck.
+        #
+        # Finally, in the last case, the values for the query, key and values are the hidden states without bottleneck,
+        # and the residual layer will be this value passed through a bottleneck.
+
+        bottlenecked_hidden_states = self.input(hidden_states)
+        if self.use_bottleneck_attention:
+            return (bottlenecked_hidden_states,) * 4
+        elif self.key_query_shared_bottleneck:
+            shared_attention_input = self.attention(hidden_states)
+            return (shared_attention_input, shared_attention_input, hidden_states, bottlenecked_hidden_states)
+        else:
+            return (hidden_states, hidden_states, hidden_states, bottlenecked_hidden_states)
+
+
+class FFNOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.true_hidden_size)
+        self.layer_norm = NORM2FN[config.normalization_type](config.true_hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states, residual_tensor):
+        layer_outputs = self.dense(hidden_states)
+        layer_outputs = self.layer_norm(layer_outputs + residual_tensor)
+        return layer_outputs
+
+
+class FFNLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.intermediate = MobileBertIntermediate(config)
+        self.output = FFNOutput(config)
+
+    def forward(self, hidden_states):
+        intermediate_output = self.intermediate(hidden_states)
+        layer_outputs = self.output(intermediate_output, hidden_states)
+        return layer_outputs
+
+
+class MobileBertLayer(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.use_bottleneck = config.use_bottleneck
+        self.num_feedforward_networks = config.num_feedforward_networks
+
+        self.attention = MobileBertAttention(config)
+        self.intermediate = MobileBertIntermediate(config)
+        self.output = MobileBertOutput(config)
+        if self.use_bottleneck:
+            self.bottleneck = Bottleneck(config)
+        if config.num_feedforward_networks > 1:
+            self.ffn = nn.LayerList([FFNLayer(config) for _ in range(config.num_feedforward_networks - 1)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=None,
+    ):
+        if self.use_bottleneck:
+            query_tensor, key_tensor, value_tensor, layer_input = self.bottleneck(hidden_states)
+        else:
+            query_tensor, key_tensor, value_tensor, layer_input = [hidden_states] * 4
+
+        self_attention_outputs = self.attention(
+            query_tensor,
+            key_tensor,
+            value_tensor,
+            layer_input,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+        s = (attention_output,)
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        if self.num_feedforward_networks != 1:
+            for i, ffn_module in enumerate(self.ffn):
+                attention_output = ffn_module(attention_output)
+                s += (attention_output,)
+
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output, hidden_states)
+        outputs = (
+            (layer_output,)
+            + outputs
+            + (
+                paddle.to_tensor(1000),
+                query_tensor,
+                key_tensor,
+                value_tensor,
+                layer_input,
+                attention_output,
+                intermediate_output,
+            )
+            + s
+        )
+        return outputs
+
+
+class MobileBertEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.layers = nn.LayerList([MobileBertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=None,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_outputs = layer_module(
+                hidden_states,
+                attention_mask,
+                head_mask[i],
+                output_attentions,
+            )
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
+        )
+
+
+class MobileBertPooler(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.do_activate = config.classifier_activation
+        if self.do_activate:
+            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        if not self.do_activate:
+            return first_token_tensor
+        else:
+            pooled_output = self.dense(first_token_tensor)
+            pooled_output = paddle.tanh(pooled_output)
+            return pooled_output
+
+
+class MobileBertPredictionHeadTransform(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.layer_norm = NORM2FN["layer_norm"](config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        return hidden_states
+
+
+class MobileBertLMPredictionHead(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.transform = MobileBertPredictionHeadTransform(config)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.dense = nn.Linear(config.vocab_size, config.hidden_size - config.embedding_size, bias_attr=False)
+        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        param_concat = paddle.concat([self.decoder.weight, self.dense.weight.t()], axis=0)
+
+        hidden_states = paddle.matmul(hidden_states, param_concat)
+        hidden_states += self.decoder.bias
+        return hidden_states
+
+
+class MobileBertOnlyMLMHead(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = MobileBertLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class MobileBertPreTrainingHeads(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = MobileBertLMPredictionHead(config)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class MobileBertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained MobileBert models. It provides MobileBert related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = MOBILEBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = MOBILEBERT_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "mobilebert"
+    config_class = MobileBertConfig
+
+    def _init_weights(self, layer):
+        # Initialize the weights.
+        if isinstance(layer, nn.Linear):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, (nn.LayerNorm, NoNorm)):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.ones_like(layer.weight))
+
+
+@dataclass
+class MobileBertForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`ErnieForPreTraining`].
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    seq_relationship_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class MobileBertForPreTraining(MobileBertPretrainedModel):
+    """
+    MobileBert Model with pretraining tasks on top.
+
+    Args:
+        bert (:class:`MobileBertModel`):
+            An instance of :class:`MobileBertModel`.
+    """
+
+    def __init__(self, config):
+        super(MobileBertForPreTraining, self).__init__(config)
+        self.mobilebert = MobileBertModel(config)
+        self.cls = MobileBertPreTrainingHeads(config)
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddigs):
+        self.cls.predictions.decoder = new_embeddigs
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels: Optional[Tensor] = None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The MobileBertForPreTraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MobileBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MobileBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MobileBertModel`.
+            head_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MobileBertModel`.
+            output_attentions (bool, optional):
+                See :class:`MobileBertModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MobileBertModel`.
+
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+            With the fields:
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        .. code-block::
+                import paddle
+                from paddlenlp.transformers import MobileBertModel, MobileBertTokenizer
+                tokenizer = MobileBertTokenizer.from_pretrained('mobilebert-uncased')
+                model = MobileBertForPreTraining.from_pretrained('mobilebert-uncased')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+                prediction_logits = outputs[0]
+                seq_relationship_logits = outputs[1]
+        """
+        with paddle.static.amp.fp16_guard():
+            outputs = self.mobilebert(
+                input_ids,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                position_ids=position_ids,
+                head_mask=head_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+            sequence_output, pooled_output = outputs[:2]
+            prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+            total_loss = None
+            if labels is not None:
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                total_loss = loss_fct(
+                    prediction_scores.reshape((-1, paddle.shape(prediction_scores)[-1])), labels.reshape((-1,))
+                )
+
+            if not return_dict:
+                output = (prediction_scores, seq_relationship_score) + outputs[2:]
+                return ((total_loss,) + output) if total_loss is not None else output
+
+            return MobileBertForPreTrainingOutput(
+                loss=total_loss,
+                prediction_logits=prediction_scores,
+                seq_relationship_logits=seq_relationship_score,
+                hidden_states=outputs.hidden_states,
+                attentions=outputs.attentions,
+            )
+
+
+@register_base_model
+class MobileBertModel(MobileBertPretrainedModel):
+    """
+    The bare MobileBert Model transformer outputting raw hidden-states.
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `MobileBertModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `MobileBertModel`.
+        embedding_size (int, optional):
+            Embedding dimensionality of lookup_table in the embedding layer. Defaults to `128`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layer and pooler layer. Defaults to `512`.
+        true_hidden_size (int, optional):
+            Dimensionality of input_tensor in self attention layer. Defaults to `128`.
+        use_bottleneck_attention (bool, optional):
+            Using bottleneck to value tensor in self attention layer. Defaults to `False`.
+        key_query_shared_bottleneck (bool, optional):
+            Key and query shared bottleneck layer. Defaults to `True`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `24`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `4`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `512`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"relu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer.
+            Defaults to 0.02.
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`MobileBertPretrainedModel.init_weights()` for how weights are initialized in `MobileBertModel`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `1`.
+        add_pooling_layer (bool, optional):
+            Adding the pooling Layer after the encoder layer. Defaults to `True`.
+        classifier_activation (bool, optional):
+            Using the non-linear activation function in the pooling layer. Defaults to `False`.
+
+    """
+
+    def __init__(self, config):
+        super(MobileBertModel, self).__init__(config)
+
+        self.initializer_range = config.initializer_range
+        self.embeddings = MobileBertEmbeddings(config)
+        self.encoder = MobileBertEncoder(config)
+        self.num_hidden_layers = config.num_hidden_layers
+        self.pooler = MobileBertPooler(config) if config.add_pooling_layer else None
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def get_head_mask(self, head_mask, num_hidden_layers, is_attention_chunked=False):
+        """
+        Prepare the head mask if needed.
+
+        Args:
+            head_mask (:obj:`paddle.Tensor` with shape :obj:`[num_heads]` or :obj:`[num_hidden_layers x num_heads]`, `optional`):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (:obj:`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the attentions scores are computed by chunks or not.
+
+        Returns:
+            :obj:`paddle.Tensor` with shape :obj:`[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or
+            list with :obj:`[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.dim() == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand(num_hidden_layers, -1, -1, -1, -1)
+        elif head_mask.dim() == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.dim() == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = head_mask.to(dtype=self.dtype)  # switch to float if need + fp16 compatibility
+        return head_mask
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        output_hidden_states=None,
+        output_attentions=None,
+        return_dict=None,
+    ):
+        r"""
+        The MobileBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            head_mask (:obj:`paddle.Tensor` with shape :obj:`[num_heads]` or :obj:`[num_hidden_layers x num_heads]`, `optional`):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard). Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether to return the output of each self attention layers.
+                Defaults to `None`.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`) or (`encoder_outputs`, `pooled_output`).
+            With the fields:
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+            - `encoder_outputs` (List(Tensor)):
+                A list of Tensor containing hidden-states of the model at each hidden layer in the Transformer encoder.
+                The length of the list is `num_hidden_layers`.
+                Each Tensor has a data type of float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import MobileBertModel, MobileBertTokenizer
+                tokenizer = MobileBertTokenizer.from_pretrained('mobilebert-uncased')
+                model = MobileBertModel.from_pretrained('mobilebert-uncased')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape, dtype=input_ids.dtype)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.num_hidden_layers)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class MobileBertForSequenceClassification(MobileBertPretrainedModel):
+    """
+    MobileBert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        mobilebert (:class:`MobileBertModel`):
+            An instance of MobileBert.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+    """
+
+    def __init__(self, config):
+        super(MobileBertForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.mobilebert = MobileBertModel(config)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The MobileBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MobileBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MobileBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MobileBertModel`.
+            head_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MobileBertModel`.
+            output_attentions (bool, optional):
+                See :class:`MobileBertModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MobileBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import MobileBertForSequenceClassification, MobileBertTokenizer
+                tokenizer = MobileBertTokenizer.from_pretrained('mobilebert-uncased')
+                model = MobileBertForSequenceClassification.from_pretrained('mobilebert-uncased', num_classes=2)
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 2]
+        """
+        outputs = self.mobilebert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class MobileBertForQuestionAnswering(MobileBertPretrainedModel):
+    """
+    MobileBert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        mobilebert (:class:`MobileBert`):
+            An instance of MobileBert.
+    """
+
+    def __init__(self, config):
+        super(MobileBertForQuestionAnswering, self).__init__(config)
+        self.num_labels = 2
+        self.mobilebert = MobileBertModel(config)
+        self.qa_outputs = nn.Linear(self.config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        start_positions=None,
+        end_positions=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The MobileBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MobileBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`MobileBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`MobileBertModel`.
+            head_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MobileBertModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`MobileBertModel`.
+            output_attentions (bool, optional):
+                See :class:`MobileBertModel`.
+            output_hidden_states (bool, optional):
+                See :class:`MobileBertModel`.
+            start_positions (Tensor, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+                sequence are not taken into account for computing the loss.
+            end_positions (Tensor, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (:obj:`sequence_length`). Position outside of the
+                sequence are not taken into account for computing the loss.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+            With the fields:
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import MobileBertForQuestionAnswering, MobileBertTokenizer
+                tokenizer = MobileBertTokenizer.from_pretrained('mobilebert-uncased')
+                model = MobileBertForQuestionAnswering.from_pretrained('mobilebert-uncased')
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+        outputs = self.mobilebert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/mobilebert/tokenizer.py b/paddlenlp/transformers/mobilebert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..eaa6a03e714427ba42e5c157632404a5546446d6
--- /dev/null
+++ b/paddlenlp/transformers/mobilebert/tokenizer.py
@@ -0,0 +1,329 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Union
+
+from paddlenlp.transformers.tokenizer_utils_base import (
+    PaddingStrategy,
+    TensorType,
+    TruncationStrategy,
+)
+
+from ...utils.log import logger
+from .. import BertTokenizer
+from ..tokenizer_utils_base import BatchEncoding
+
+__all__ = ["MobileBertTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"mobilebert-uncased": 512}
+
+
+class MobileBertTokenizer(BertTokenizer):
+    r"""
+    Construct a MobileBERT tokenizer.
+    :class:`~paddlenlp.transformers.MobileBertTokenizer is identical to :class:`~paddlenlp.transformers.BertTokenizer` and runs end-to-end
+    tokenization: punctuation splitting and wordpiece.
+    Refer to superclass :class:`~~paddlenlp.transformers.BertTokenizer` for usage examples and documentation concerning
+    parameters.
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "mobilebert-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/mobilebert/mobilebert-uncased/vocab.txt"
+        }
+    }
+    pretrained_init_configuration = {"mobilebert-uncased": {"do_lower_case": True}}
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def batch_encode(
+        self,
+        batch_text_or_text_pairs,
+        max_length: int = 512,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        stride=0,
+        is_split_into_words=False,
+        return_position_ids=False,
+        return_token_type_ids=True,
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        return_dict=True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        verbose: bool = True,
+        **kwargs
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports batch inputs of sequence or sequence pair.
+
+        Args:
+            batch_text_or_text_pairs (list):
+                The element of list can be sequence or sequence pair, and the
+                sequence is a string or a list of strings depending on whether
+                it has been pretokenized. If each sequence is provided as a list
+                of strings (pretokenized), you must set `is_split_into_words` as
+                `True` to disambiguate with a sequence pair.
+            max_length (int, optional):
+                If set to a number, will limit the total sequence returned so
+                that it has a maximum length. If there are overflowing tokens,
+                those overflowing tokens will be added to the returned dictionary
+                when `return_overflowing_tokens` is `True`. Defaults to `None`.
+            stride (int, optional):
+                Only available for batch input of sequence pair and mainly for
+                question answering usage. When for QA, `text` represents questions
+                and `text_pair` represents contexts. If `stride` is set to a
+                positive number, the context will be split into multiple spans
+                where `stride` defines the number of (tokenized) tokens to skip
+                from the start of one span to get the next span, thus will produce
+                a bigger batch than inputs to include all spans. Moreover, 'overflow_to_sample'
+                and 'offset_mapping' preserving the original example and position
+                information will be added to the returned dictionary. Defaults to 0.
+            padding (bool, optional):
+                If set to `True`, the returned sequences would be padded up to
+                `max_length` specified length according to padding side
+                (`self.padding_side`) and padding token id. Defaults to `False`.
+            truncation_strategy (str, optional):
+                String selected in the following options:
+                - 'longest_first' (default) Iteratively reduce the inputs sequence
+                until the input is under `max_length` starting from the longest
+                one at each token (when there is a pair of input sequences).
+                - 'only_first': Only truncate the first sequence.
+                - 'only_second': Only truncate the second sequence.
+                - 'do_not_truncate': Do not truncate (raise an error if the input
+                sequence is longer than `max_length`).
+                Defaults to 'longest_first'.
+            return_position_ids (bool, optional):
+                Whether to include tokens position ids in the returned dictionary.
+                Defaults to `False`.
+            return_token_type_ids (bool, optional):
+                Whether to include token type ids in the returned dictionary.
+                Defaults to `True`.
+            return_attention_mask (bool, optional):
+                Whether to include the attention mask in the returned dictionary.
+                Defaults to `False`.
+            return_length (bool, optional):
+                Whether to include the length of each encoded inputs in the
+                returned dictionary. Defaults to `False`.
+            return_overflowing_tokens (bool, optional):
+                Whether to include overflowing token information in the returned
+                dictionary. Defaults to `False`.
+            return_special_tokens_mask (bool, optional):
+                Whether to include special tokens mask information in the returned
+                dictionary. Defaults to `False`.
+
+        Returns:
+            dict:
+                The dict has the following optional items:
+                - **input_ids** (list[int]): List of token ids to be fed to a model.
+                - **position_ids** (list[int], optional): List of token position ids to be
+                  fed to a model. Included when `return_position_ids` is `True`
+                - **token_type_ids** (list[int], optional): List of token type ids to be
+                  fed to a model. Included when `return_token_type_ids` is `True`.
+                - **attention_mask** (list[int], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `return_attention_mask` is `True`.
+                - **seq_len** (int, optional): The input_ids length. Included when `return_length`
+                  is `True`.
+                - **overflowing_tokens** (list[int], optional): List of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **num_truncated_tokens** (int, optional): The number of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **special_tokens_mask** (list[int], optional): List of integers valued 0 or 1,
+                  with 0 specifying special added tokens and 1 specifying sequence tokens.
+                  Included when `return_special_tokens_mask` is `True`.
+                - **offset_mapping** (list[int], optional): list of pair preserving the
+                  index of start and end char in original input for each token.
+                  For a sqecial token, the index pair is `(0, 0)`. Included when
+                  `stride` works.
+                - **overflow_to_sample** (int, optional): Index of example from which this
+                  feature is generated. Included when `stride` works.
+        """
+        # Backward compatibility for 'max_seq_len'
+        old_max_seq_len = kwargs.get("max_seq_len", None)
+        if max_length is None and old_max_seq_len:
+            if verbose:
+                logger.warnings(
+                    "The `max_seq_len` argument is deprecated and will be removed in a future version, "
+                    "please use `max_length` instead.",
+                    FutureWarning,
+                )
+            max_length = old_max_seq_len
+
+        padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
+            padding=padding, max_length=max_length, verbose=verbose
+        )
+
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self._tokenize(text)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise ValueError(
+                    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                )
+
+        batch_encode_inputs = []
+        for example_id, tokens_or_pair_tokens in enumerate(batch_text_or_text_pairs):
+            if not isinstance(tokens_or_pair_tokens, (list, tuple)):
+                text, text_pair = tokens_or_pair_tokens, None
+            elif is_split_into_words and not isinstance(tokens_or_pair_tokens[0], (list, tuple)):
+                text, text_pair = tokens_or_pair_tokens, None
+            else:
+                text, text_pair = tokens_or_pair_tokens
+
+            first_ids = get_input_ids(text)
+            second_ids = get_input_ids(text_pair) if text_pair is not None else None
+
+            if stride > 0 and second_ids is not None:
+
+                max_len_for_pair = (
+                    max_length - len(first_ids) - self.num_special_tokens_to_add(pair=True)
+                )  # need -4  <sep> A </sep> </sep> B <sep>
+
+                token_offset_mapping = self.get_offset_mapping(text)
+                token_pair_offset_mapping = self.get_offset_mapping(text_pair)
+
+                while True:
+                    encoded_inputs = {}
+
+                    ids = first_ids
+                    mapping = token_offset_mapping
+                    if len(second_ids) <= max_len_for_pair:
+                        pair_ids = second_ids
+                        pair_mapping = token_pair_offset_mapping
+                    else:
+                        pair_ids = second_ids[:max_len_for_pair]
+                        pair_mapping = token_pair_offset_mapping[:max_len_for_pair]
+
+                    offset_mapping = self.build_offset_mapping_with_special_tokens(mapping, pair_mapping)
+                    sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+                    token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+
+                    # Build output dictionnary
+                    encoded_inputs["input_ids"] = sequence
+                    if return_token_type_ids:
+                        encoded_inputs["token_type_ids"] = token_type_ids
+                    if return_special_tokens_mask:
+                        encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+                    if return_length:
+                        encoded_inputs["seq_len"] = len(encoded_inputs["input_ids"])
+
+                    # Check lengths
+                    assert max_length is None or len(encoded_inputs["input_ids"]) <= max_length
+
+                    # Padding
+                    needs_to_be_padded = padding and max_length and len(encoded_inputs["input_ids"]) < max_length
+
+                    encoded_inputs["offset_mapping"] = offset_mapping
+
+                    if needs_to_be_padded:
+                        difference = max_length - len(encoded_inputs["input_ids"])
+                        if self.padding_side == "right":
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [
+                                    0
+                                ] * difference
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = (
+                                    encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                                )
+                            if return_special_tokens_mask:
+                                encoded_inputs["special_tokens_mask"] = (
+                                    encoded_inputs["special_tokens_mask"] + [1] * difference
+                                )
+                            encoded_inputs["input_ids"] = (
+                                encoded_inputs["input_ids"] + [self.pad_token_id] * difference
+                            )
+                            encoded_inputs["offset_mapping"] = encoded_inputs["offset_mapping"] + [(0, 0)] * difference
+                        elif self.padding_side == "left":
+                            if return_attention_mask:
+                                encoded_inputs["attention_mask"] = [0] * difference + [1] * len(
+                                    encoded_inputs["input_ids"]
+                                )
+                            if return_token_type_ids:
+                                # 0 for padding token mask
+                                encoded_inputs["token_type_ids"] = [
+                                    self.pad_token_type_id
+                                ] * difference + encoded_inputs["token_type_ids"]
+                            if return_special_tokens_mask:
+                                encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs[
+                                    "special_tokens_mask"
+                                ]
+                            encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs[
+                                "input_ids"
+                            ]
+                            encoded_inputs["offset_mapping"] = [(0, 0)] * difference + encoded_inputs["offset_mapping"]
+                    else:
+                        if return_attention_mask:
+                            encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
+
+                    if return_position_ids:
+                        encoded_inputs["position_ids"] = list(range(len(encoded_inputs["input_ids"])))
+
+                    encoded_inputs["overflow_to_sample"] = example_id
+                    batch_encode_inputs.append(encoded_inputs)
+
+                    if len(second_ids) <= max_len_for_pair:
+                        break
+                    else:
+                        second_ids = second_ids[max_len_for_pair - stride :]
+                        token_pair_offset_mapping = token_pair_offset_mapping[max_len_for_pair - stride :]
+
+            else:
+                batch_encode_inputs.append(
+                    self.encode(
+                        first_ids,
+                        second_ids,
+                        max_length=max_length,
+                        padding=padding,
+                        truncation=truncation,
+                        return_position_ids=return_position_ids,
+                        return_token_type_ids=return_token_type_ids,
+                        return_attention_mask=return_attention_mask,
+                        return_overflowing_tokens=return_overflowing_tokens,
+                        return_special_tokens_mask=return_special_tokens_mask,
+                    )
+                )
+
+        batch_encode_inputs = {k: [output[k] for output in batch_encode_inputs] for k in batch_encode_inputs[0].keys()}
+        batch_encode_inputs = self.pad(
+            batch_encode_inputs,
+            padding=padding_strategy.value,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+        if return_dict:
+            batch_outputs = BatchEncoding(batch_encode_inputs, tensor_type=return_tensors)
+            return batch_outputs
+        else:
+            batch_outputs_list = []
+            for k, v in batch_encode_inputs.items():
+                for i in range(len(v)):
+                    if i >= len(batch_outputs_list):
+                        batch_outputs_list.append({k: v[i]})
+                    else:
+                        batch_outputs_list[i][k] = v[i]
+            return batch_outputs_list
diff --git a/paddlenlp/transformers/model_outputs.py b/paddlenlp/transformers/model_outputs.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ffee9ebef24edab44123866be420f3ed28e2f26
--- /dev/null
+++ b/paddlenlp/transformers/model_outputs.py
@@ -0,0 +1,1429 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+from collections import OrderedDict
+from dataclasses import dataclass, fields
+from typing import Any, Optional, Tuple
+
+import numpy as np
+import paddle
+from paddle import Tensor
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import MultiHeadAttention
+from paddle.nn.layer.transformer import _convert_attention_mask
+
+from .utils import adapt_stale_fwd_patch
+
+
+def tuple_output(outputs: Tuple[Tensor], loss: Optional[Tensor] = None):
+    """re-construct the outputs with one method which contains the simple logic
+
+    Args:
+        outputs (Tuple[Tensor]): the source of the outputs
+        loss (Optional[Tensor], optional): the loss of the model. Defaults to None.
+    """
+    if loss is not None:
+        outputs = (loss,) + outputs
+    if len(outputs) == 1:
+        return outputs[0]
+    return outputs
+
+
+def convert_encoder_output(encoder_output):
+    """
+    Convert encoder_output from tuple to class:`~paddlenlp.transformers.model_outputs.BaseModelOutput`.
+
+    Args:
+        encoder_output (tuple or ModelOutput):
+            The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+            The data type of `last_hidden_state` is float32 and its shape is [batch_size, sequence_length, hidden_size].
+    """
+    return BaseModelOutput(
+        last_hidden_state=encoder_output[0],
+        hidden_states=encoder_output[1] if len(encoder_output) > 1 else None,
+        attentions=encoder_output[2] if len(encoder_output) > 2 else None,
+    )
+
+
+def layer_init_wrapper(func):
+    @functools.wraps(func)
+    def _impl(self, *args, **kwargs):
+        enable_recompute = kwargs.pop("enable_recompute", False)
+        func(self, *args, **kwargs)
+        if paddle.in_dynamic_mode():
+            self.enable_recompute = enable_recompute
+        else:
+            self.enable_recompute = False
+
+    return _impl
+
+
+@paddle.jit.not_to_static
+def _transformer_encoder_layer_fwd(self, src, src_mask=None, cache=None, output_attentions=False):
+    self.self_attn.need_weights = output_attentions
+    src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+    residual = src
+    if self.normalize_before:
+        src = self.norm1(src)
+
+    attn_outputs = self.self_attn(src, src, src, src_mask, cache)
+    if isinstance(attn_outputs, tuple):
+        src = attn_outputs[0]
+        outputs = attn_outputs[1:]
+    else:
+        src = attn_outputs
+        outputs = None
+
+    src = residual + self.dropout1(src)
+    if not self.normalize_before:
+        src = self.norm1(src)
+
+    residual = src
+    if self.normalize_before:
+        src = self.norm2(src)
+    src = self.linear2(self.dropout(self.activation(self.linear1(src))))
+    src = residual + self.dropout2(src)
+    if not self.normalize_before:
+        src = self.norm2(src)
+
+    return src if outputs is None else ((src,) + outputs[::-1])  # hidden_states, cache, attentions
+
+
+@paddle.jit.not_to_static
+def _transformer_decoder_layer_fwd(
+    self,
+    tgt,
+    memory,
+    tgt_mask=None,
+    memory_mask=None,
+    cache=None,
+    output_attentions=False,
+):
+    residual = tgt
+
+    # self attention
+    self.self_attn.need_weights = output_attentions
+    tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+
+    if self.normalize_before:
+        tgt = self.norm1(tgt)
+
+    self_attn_outputs = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0] if cache else None)
+    # self_attn_outputs = (tgt, attn_weights, incremental_cache) or only tgt
+    if isinstance(self_attn_outputs, type(tgt)):
+        tgt = self_attn_outputs
+    else:
+        tgt = self_attn_outputs[0]
+        if output_attentions:
+            self_attn_weights = self_attn_outputs[1]
+        if cache:
+            incremental_cache = self_attn_outputs[-1]
+
+    tgt = residual + self.dropout1(tgt)
+    if not self.normalize_before:
+        tgt = self.norm1(tgt)
+
+    residual = tgt
+
+    # cross attention
+    if memory is not None:
+        self.cross_attn.need_weights = output_attentions
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        cross_attn_outputs = self.cross_attn(tgt, memory, memory, memory_mask, cache[1] if cache else None)
+        if isinstance(cross_attn_outputs, type(tgt)):
+            tgt = cross_attn_outputs
+        else:
+            tgt = cross_attn_outputs[0]
+            if output_attentions:
+                cross_attn_weights = cross_attn_outputs[1]
+            if cache:
+                static_cache = cross_attn_outputs[-1]
+
+        tgt = residual + self.dropout2(tgt)
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        residual = tgt
+
+    if self.normalize_before:
+        tgt = self.norm3(tgt)
+    tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+    tgt = residual + self.dropout3(tgt)
+    if not self.normalize_before:
+        tgt = self.norm3(tgt)
+
+    if not output_attentions and cache is None:
+        return tgt
+    else:
+        outputs = (tgt,)
+        if output_attentions:
+            outputs += (self_attn_weights, cross_attn_weights if memory is not None else None)
+        if cache:
+            outputs += ((incremental_cache, static_cache if memory is not None else None),)
+        return outputs
+
+
+@paddle.jit.not_to_static
+def _transformer_decoder_fwd(
+    self,
+    tgt,
+    memory=None,
+    tgt_mask=None,
+    memory_mask=None,
+    cache=None,
+    output_attentions=False,
+    output_hidden_states=False,
+    return_dict=False,
+):
+    tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
+    if memory is not None:
+        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
+
+    new_caches = [] if cache else None
+    all_hidden_states = [tgt] if output_hidden_states else None
+    all_self_attns = [] if output_attentions else None
+    all_cross_attns = [] if output_attentions else None
+
+    for i, mod in enumerate(self.layers):
+        if cache is None:
+            # if output has no gradient, recompute is unnecessary
+            memory_stop_gradient = memory is not None and memory.stop_gradient
+            has_gradient = (not tgt.stop_gradient) or (not memory_stop_gradient)
+            if self.enable_recompute and has_gradient:
+                outputs = recompute(mod, tgt, memory, tgt_mask, memory_mask, None, output_attentions)
+            else:
+                outputs = mod(
+                    tgt,
+                    memory,
+                    tgt_mask=tgt_mask,
+                    memory_mask=memory_mask,
+                    cache=None,
+                    output_attentions=output_attentions,
+                )
+        else:
+            outputs = mod(
+                tgt,
+                memory,
+                tgt_mask=tgt_mask,
+                memory_mask=memory_mask,
+                cache=cache[i] if cache else None,
+                output_attentions=output_attentions,
+            )
+        if isinstance(outputs, type(tgt)):
+            tgt = outputs
+        else:
+            tgt = outputs[0]
+        if cache:
+            new_caches.append(outputs[-1])
+        if output_attentions:
+            all_self_attns.append(outputs[1])
+            all_cross_attns.append(outputs[2])
+        if output_hidden_states:
+            all_hidden_states.append(tgt)
+
+    if self.norm is not None:
+        tgt = self.norm(tgt)
+        if output_hidden_states:
+            all_hidden_states[-1] = tgt
+
+    if not return_dict:
+        if isinstance(outputs, type(tgt)):
+            return tgt
+
+        temp_list = [
+            tgt,
+            new_caches if cache else None,
+            all_hidden_states,
+            all_self_attns,
+            all_cross_attns,
+        ]
+        return tuple(v for v in temp_list if v is not None)
+
+    return BaseModelOutputWithPastAndCrossAttentions(
+        last_hidden_state=tgt,
+        past_key_values=new_caches,
+        hidden_states=all_hidden_states,
+        attentions=all_self_attns,
+        cross_attentions=all_cross_attns,
+    )
+
+
+@paddle.jit.not_to_static
+def _transformer_encoder_fwd(
+    self, src, src_mask=None, cache=None, output_attentions=False, output_hidden_states=False, return_dict=False
+):
+    src_mask = _convert_attention_mask(src_mask, src.dtype)
+
+    output = src
+    # To get cache from None when use_cache is True, which is compatible with HF
+    # while HF requires decoder. The implementation here uses cache update in the
+    # MultiHeadAttention not so efficiently, and maybe optimize it later.
+    if cache is None and getattr(self, "_use_cache", False):
+        cache = [tuple(self.layers[0].gen_cache(src))] * len(self.layers)
+    # To be compatible with `TransformerEncoder.forward`, `_use_cache` defualts
+    # to True when cache is not None.
+    new_caches = [] if cache is not None and getattr(self, "_use_cache", True) else None
+    all_attentions = [] if output_attentions else None
+    # NOTE: Also includes embeding output which is same as HF.
+    all_hidden_states = [output] if output_hidden_states else None
+    for i, mod in enumerate(self.layers):
+        # if output has no gradient, recompute is unnecessary
+        has_gradient = not output.stop_gradient
+        if self.enable_recompute and has_gradient:
+            # Note: recompute do not support pass as **kwargs yet.
+            layer_outputs = recompute(
+                mod,
+                output,
+                src_mask,
+                None
+                if cache is None
+                else cache[i]
+                if isinstance(cache[i], MultiHeadAttention.Cache)
+                else MultiHeadAttention.Cache(*cache[i]),
+                output_attentions,
+            )
+        else:
+            layer_outputs = mod(
+                output,
+                src_mask=src_mask,
+                cache=None
+                if cache is None
+                else cache[i]
+                if isinstance(cache[i], MultiHeadAttention.Cache)
+                else MultiHeadAttention.Cache(*cache[i]),
+                output_attentions=output_attentions,
+            )
+
+        if isinstance(layer_outputs, tuple):
+            output = layer_outputs[0]
+            outputs = layer_outputs[1:]
+        else:
+            output = layer_outputs
+            outputs = None
+
+        if output_hidden_states:
+            all_hidden_states.append(output)
+        if output_attentions:
+            all_attentions.append(outputs[-1])
+        if new_caches is not None:
+            new_caches.append(outputs[0] if isinstance(cache[i], MultiHeadAttention.Cache) else (tuple(outputs[0])))
+
+    if self.norm is not None:
+        output = self.norm(output)
+
+        if output_hidden_states:
+            all_hidden_states[-1] = output
+
+    if not return_dict:
+        outputs = tuple(
+            tuple(v) if isinstance(v, list) else v
+            for v in [
+                output,
+                new_caches,
+                all_hidden_states,
+                all_attentions,
+            ]
+            if v is not None
+        )
+        if len(outputs) == 1:
+            return output
+        else:
+            return outputs
+
+    return BaseModelOutputWithPastAndCrossAttentions(
+        last_hidden_state=output,
+        past_key_values=new_caches,
+        hidden_states=all_hidden_states,
+        attentions=all_attentions,
+    )
+
+
+_transformer_encoder_fwd.__name__ = "forward"
+_transformer_encoder_layer_fwd.__name__ = "forward"
+# patches of paddle.nn.Transformer to get all hidden_states and attentions
+paddle.nn.TransformerEncoderLayer.forward = _transformer_encoder_layer_fwd
+paddle.nn.TransformerDecoderLayer.forward = _transformer_decoder_layer_fwd
+paddle.nn.TransformerEncoder.forward = _transformer_encoder_fwd
+paddle.nn.TransformerDecoder.forward = _transformer_decoder_fwd
+
+_encoder_init = paddle.nn.TransformerEncoder.__init__
+_decoder_init = paddle.nn.TransformerDecoder.__init__
+paddle.nn.TransformerEncoder.__init__ = layer_init_wrapper(_encoder_init)
+paddle.nn.TransformerDecoder.__init__ = layer_init_wrapper(_decoder_init)
+
+
+def _get_wrap_setattr(cls):
+    def _wrap_setattr(self, name, value):
+        value = adapt_stale_fwd_patch(self, name, value)
+        return super(cls, self).__setattr__(name, value)
+
+    return _wrap_setattr
+
+
+paddle.nn.TransformerEncoderLayer.__setattr__ = functools.wraps(paddle.nn.TransformerEncoderLayer.__setattr__)(
+    _get_wrap_setattr(paddle.nn.TransformerEncoderLayer)
+)
+paddle.nn.TransformerEncoder.__setattr__ = functools.wraps(paddle.nn.TransformerEncoder.__setattr__)(
+    _get_wrap_setattr(paddle.nn.TransformerEncoder)
+)
+paddle.nn.TransformerDecoder.__setattr__ = functools.wraps(paddle.nn.TransformerDecoder.__setattr__)(
+    _get_wrap_setattr(paddle.nn.TransformerDecoder)
+)
+
+
+def is_tensor(x):
+    if isinstance(x, paddle.Tensor):
+        return True
+
+    return isinstance(x, np.ndarray)
+
+
+class ModelOutput(OrderedDict):
+    """
+    Base class for all model outputs as dataclass. Has a `__getitem__` that allows indexing by integer or slice (like a
+    tuple) or strings (like a dictionary) that will ignore the `None` attributes. Otherwise behaves like a regular
+    python dictionary.
+
+    <Tip warning={true}>
+
+    You can't unpack a `ModelOutput` directly. Use the [`~utils.ModelOutput.to_tuple`] method to convert it to a tuple
+    before.
+
+    </Tip>
+    """
+
+    def __post_init__(self):
+        class_fields = fields(self)
+
+        # note(guosheng): Convert list to tuple automatically, and better to
+        # check if it is frozen.
+        # assert not getattr(self, dataclasses._PARAMS).frozen
+        for f in class_fields:
+            value = getattr(self, f.name)
+            if isinstance(value, list):
+                setattr(self, f.name, tuple(value))
+
+        # Safety and consistency checks
+        if not len(class_fields):
+            raise ValueError(f"{self.__class__.__name__} has no fields.")
+        if not all(field.default is None for field in class_fields[1:]):
+            raise ValueError(f"{self.__class__.__name__} should not have more than one required field.")
+
+        first_field = getattr(self, class_fields[0].name)
+        other_fields_are_none = all(getattr(self, field.name) is None for field in class_fields[1:])
+
+        if other_fields_are_none and not is_tensor(first_field):
+            if isinstance(first_field, dict):
+                iterator = first_field.items()
+                first_field_iterator = True
+            else:
+                try:
+                    iterator = iter(first_field)
+                    first_field_iterator = True
+                except TypeError:
+                    first_field_iterator = False
+
+            # if we provided an iterator as first field and the iterator is a (key, value) iterator
+            # set the associated fields
+            if first_field_iterator:
+                for element in iterator:
+                    if (
+                        not isinstance(element, (list, tuple))
+                        or not len(element) == 2
+                        or not isinstance(element[0], str)
+                    ):
+                        break
+                    setattr(self, element[0], element[1])
+                    if element[1] is not None:
+                        self[element[0]] = element[1]
+            elif first_field is not None:
+                self[class_fields[0].name] = first_field
+        else:
+            for field in class_fields:
+                v = getattr(self, field.name)
+                if v is not None:
+                    self[field.name] = v
+
+    def __delitem__(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``__delitem__`` on a {self.__class__.__name__} instance.")
+
+    def setdefault(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``setdefault`` on a {self.__class__.__name__} instance.")
+
+    def pop(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``pop`` on a {self.__class__.__name__} instance.")
+
+    def update(self, *args, **kwargs):
+        raise Exception(f"You cannot use ``update`` on a {self.__class__.__name__} instance.")
+
+    def __getitem__(self, k):
+        if isinstance(k, str):
+            inner_dict = {k: v for (k, v) in self.items()}
+            return inner_dict[k]
+        else:
+            return self.to_tuple()[k]
+
+    def __setattr__(self, name, value):
+        if name in self.keys() and value is not None:
+            # Don't call self.__setitem__ to avoid recursion errors
+            super().__setitem__(name, value)
+        super().__setattr__(name, value)
+
+    def __setitem__(self, key, value):
+        # Will raise a KeyException if needed
+        super().__setitem__(key, value)
+        # Don't call self.__setattr__ to avoid recursion errors
+        super().__setattr__(key, value)
+
+    def to_tuple(self) -> Tuple[Any]:
+        """
+        Convert self to a tuple containing all the attributes/keys that are not `None`.
+        """
+        # try to fix: https://github.com/PaddlePaddle/PaddleNLP/issues/3355
+        # when trying to get the keys of `OrderedDict`, `keys` method return empty values.
+        # TODO(wj-Mcat): this bug should be fixed in Paddle framework
+        tuples = ()
+        for field in fields(self):
+            if getattr(self, field.name, None) is None:
+                continue
+            tuples = tuples + (getattr(self, field.name),)
+
+        return tuples
+
+
+@dataclass
+class BaseModelOutput(ModelOutput):
+    """
+    Base class for model's outputs, with potential hidden states and attentions.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithNoAttention(ModelOutput):
+    """
+    Base class for model's outputs, with potential hidden states.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, num_channels, height, width)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPooling(ModelOutput):
+    """
+    Base class for model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`paddle.Tensor` of shape `(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) after further processing
+            through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
+            the classification token after processing through a linear layer and a tanh activation function. The linear
+            layer weights are trained from the next sentence prediction (classification) objective during pretraining.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    pooler_output: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPast(ModelOutput):
+    """
+    Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding).
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+
+            If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
+            hidden_size)` is output.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
+            `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
+            encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
+            `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
+            input) to speed up sequential decoding.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPastAndCrossAttentions(ModelOutput):
+    """
+    Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding).
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+
+            If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
+            hidden_size)` is output.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
+            `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
+            encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
+            `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
+            input) to speed up sequential decoding.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        cross_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutput):
+    """
+    Base class for model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`paddle.Tensor` of shape `(batch_size, hidden_size)`):
+            Last layer hidden-state of the first token of the sequence (classification token) after further processing
+            through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
+            the classification token after processing through a linear layer and a tanh activation function. The linear
+            layer weights are trained from the next sentence prediction (classification) objective during pretraining.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        cross_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
+            `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
+            encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
+            `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
+            input) to speed up sequential decoding.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    pooler_output: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class SequenceClassifierOutput(ModelOutput):
+    """
+    Base class for outputs of sentence classification models.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (`paddle.Tensor` of shape `(batch_size, config.num_labels)`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class TokenClassifierOutput(ModelOutput):
+    """
+    Base class for outputs of token classification models.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided) :
+            Classification loss.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_labels)`):
+            Classification scores (before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class QuestionAnsweringModelOutput(ModelOutput):
+    """
+    Base class for outputs of question answering models.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        start_logits (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Span-start scores (before SoftMax).
+        end_logits (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Span-end scores (before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    start_logits: paddle.Tensor = None
+    end_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class MultipleChoiceModelOutput(ModelOutput):
+    """
+    Base class for outputs of multiple choice models.
+
+    Args:
+        loss (`paddle.Tensor` of shape *(1,)*, *optional*, returned when `labels` is provided):
+            Classification loss.
+        logits (`paddle.Tensor` of shape `(batch_size, num_choices)`):
+            *num_choices* is the second dimension of the input tensors. (see *input_ids* above).
+
+            Classification scores (before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class MaskedLMOutput(ModelOutput):
+    """
+    Base class for masked language models outputs.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Masked language modeling (MLM) loss.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class CausalLMOutputWithPast(ModelOutput):
+    """
+    Base class for causal language model (or autoregressive) outputs.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `paddle.Tensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
+            value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
+            setting. Only relevant if `config.is_decoder = True`.
+
+            Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class CausalLMOutputWithCrossAttentions(ModelOutput):
+    """
+    Base class for causal language model (or autoregressive) outputs.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        cross_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Cross attentions weights after the attention softmax, used to compute the weighted average in the
+            cross-attention heads.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `paddle.Tensor` tuples of length `config.n_layers`, with each tuple containing the cached key,
+            value states of the self-attention and the cross-attention layers if model is used in encoder-decoder
+            setting. Only relevant if `config.is_decoder = True`.
+
+            Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class Seq2SeqModelOutput(ModelOutput):
+    """
+    Base class for model encoder's outputs that also contains : pre-computed hidden states that can speed up sequential
+    decoding.
+
+    Args:
+        last_hidden_state (`paddle.Tensor`):
+            Sequence of hidden-states at the output of the last layer of the decoder of the model, whose shape is `(batch_size, Sequence_length, hidden_size)`.
+
+            If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
+            hidden_size)` is output.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, optional):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            Returned when `use_cache=True` is passed or when `config.use_cache=True`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        decoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`.
+
+            Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.
+        decoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        cross_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        encoder_last_hidden_state (`paddle.Tensor`, optional):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model whose shape is `(batch_size, sequence_length, hidden_size)`,
+        encoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`.
+
+            Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.
+        encoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    decoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    encoder_last_hidden_state: Optional[paddle.Tensor] = None
+    encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    encoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class Seq2SeqLMOutput(ModelOutput):
+    """
+    Base class for sequence-to-sequence language models outputs.
+
+    Args:
+        loss (`paddle.Tensor`, optional):
+            Language modeling loss whose shape is `(1,)`. Returned when `labels` is provided.
+        logits (`paddle.Tensor`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) whose shape is `(batch_size, sequence_length, config.vocab_size)`).
+        past_key_values (`tuple(tuple(paddle.Tensor))`, optional):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            Returned when `use_cache=True` is passed or when `config.use_cache=True`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        decoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`.
+
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        cross_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        encoder_last_hidden_state (`paddle.Tensor`, optional):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model whose shape is `(batch_size, sequence_length, hidden_size)`.
+        encoder_hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed or when `config.output_attentions=True`.
+
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    decoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    encoder_last_hidden_state: Optional[paddle.Tensor] = None
+    encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    encoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class Seq2SeqQuestionAnsweringModelOutput(ModelOutput):
+    """
+    Base class for outputs of sequence-to-sequence question answering models.
+    Args:
+        loss (`paddle.Tensor` ,optional):
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+            A Tensor of shape `(1,)`, returned when `labels` is provided.
+        start_logits (`paddle.Tensor`):
+            Span-start scores (before SoftMax). Tensor of shape `(batch_size, sequence_length)`).
+        end_logits (`paddle.Tensor`):
+            Span-end scores (before SoftMax). Tensor of shape `(batch_size, sequence_length)`).
+        past_key_values (`tuple(tuple(paddle.Tensor))`, optional):
+            Tuple of `tuple(paddle.Tensor)` of length `n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            Returned when `use_cache=True` is passed.
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        decoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed.
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed.
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        cross_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed.
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        encoder_last_hidden_state (`paddle.Tensor` optional):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model.
+            Tensor of shape `(batch_size, sequence_length, hidden_size)`.
+        encoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed.
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed.
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    start_logits: paddle.Tensor = None
+    end_logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    decoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    encoder_last_hidden_state: Optional[paddle.Tensor] = None
+    encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    encoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class Seq2SeqSequenceClassifierOutput(ModelOutput):
+    """
+    Base class for outputs of sequence-to-sequence sentence classification models.
+    Args:
+        loss (`paddle.Tensor` optional):
+            Classification (or regression if config.num_labels==1) loss of shape `(1,)`. Returned when `label` is provided).
+        logits (`paddle.Tensor`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax) of shape `(batch_size, config.num_labels)`
+        past_key_values (`tuple(tuple(paddle.Tensor))`, optional):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            Returned when `use_cache=True` is passed.
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        decoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed.
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed.
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        cross_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed.
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        encoder_last_hidden_state (`paddle.Tensor`, optional):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model.
+            Tensor of shape `(batch_size, sequence_length, hidden_size)`.
+        encoder_hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed.
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Returned when `output_attentions=True` is passed.
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    decoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    encoder_last_hidden_state: Optional[paddle.Tensor] = None
+    encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    encoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class SequenceClassifierOutputWithPast(ModelOutput):
+    """
+    Base class for outputs of sentence classification models.
+    Args:
+        loss (`paddle.Tensor`, optional):
+            Classification (or regression if config.num_labels==1) loss whose shape is `(1,)`.
+            Returned when `labels` is provided.
+        logits (`paddle.Tensor`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax)
+            whose shape is `(batch_size, num_labels)`
+        past_key_values (`tuple(tuple(paddle.Tensor))`, optional):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
+            Returned when `use_cache=True` is passed or when `config.use_cache=True`).
+            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+        hidden_states (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`).
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, optional):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Returned when `output_attentions=True` is passed or when `config.output_attentions=True`).
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BackboneOutput(ModelOutput):
+    """
+    Base class for outputs of backbones.
+
+    Args:
+        feature_maps (`tuple(paddle.Tensor)` of shape `(batch_size, num_channels, height, width)`):
+            Feature maps of the stages.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)` or `(batch_size, num_channels, height, width)`,
+            depending on the backbone.
+
+            Hidden-states of the model at the output of each stage plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Only applicable if the backbone uses attention.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    feature_maps: Tuple[paddle.Tensor] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class BaseModelOutputWithPoolingAndNoAttention(ModelOutput):
+    """
+    Base class for model's outputs that also contains a pooling of the last hidden states.
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, num_channels, height, width)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        pooler_output (`paddle.Tensor` of shape `(batch_size, hidden_size)`):
+            Last layer hidden-state after a pooling operation on the spatial dimensions.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, num_channels, height, width)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+    """
+
+    last_hidden_state: paddle.Tensor = None
+    pooler_output: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ImageClassifierOutputWithNoAttention(ModelOutput):
+    """
+    Base class for outputs of image classification models.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (`paddle.Tensor` of shape `(batch_size, config.num_labels)`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each stage) of shape `(batch_size, num_channels, height, width)`. Hidden-states (also
+            called feature maps) of the model at the output of each stage.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class DepthEstimatorOutput(ModelOutput):
+    """
+    Base class for outputs of depth estimation models.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        predicted_depth (`paddle.Tensor` of shape `(batch_size, height, width)`):
+            Predicted depth for each pixel.
+
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, num_channels, height, width)`.
+
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, patch_size,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    predicted_depth: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class SemanticSegmenterOutput(ModelOutput):
+    """
+    Base class for outputs of semantic segmentation models.
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (`paddle.Tensor` of shape `(batch_size, config.num_labels, logits_height, logits_width)`):
+            Classification scores for each pixel.
+            <Tip warning={true}>
+            The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
+            to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
+            original image size as post-processing. You should always check your logits shape and resize as needed.
+            </Tip>
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, patch_size, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, patch_size,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class Seq2SeqSpectrogramOutput(ModelOutput):
+    """
+    Base class for sequence-to-sequence spectrogram outputs.
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Spectrogram generation loss.
+        spectrogram (`paddle.Tensor` of shape `(batch_size, sequence_length, num_bins)`):
+            The predicted spectrogram.
+        past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        decoder_hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        cross_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
+            weighted average in the cross-attention heads.
+        encoder_last_hidden_state (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model.
+        encoder_hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    spectrogram: paddle.Tensor = None
+    past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None
+    decoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    decoder_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cross_attentions: Optional[Tuple[paddle.Tensor]] = None
+    encoder_last_hidden_state: Optional[paddle.Tensor] = None
+    encoder_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    encoder_attentions: Optional[Tuple[paddle.Tensor]] = None
diff --git a/paddlenlp/transformers/model_utils.py b/paddlenlp/transformers/model_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..858d79cdc9214929a35417c6e6c820cde45656ca
--- /dev/null
+++ b/paddlenlp/transformers/model_utils.py
@@ -0,0 +1,2186 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import contextlib
+import copy
+import gc
+import inspect
+import json
+import os
+import re
+import tempfile
+import warnings
+from contextlib import contextmanager
+from functools import partial
+
+# from pathlib import Path
+from typing import Any, Callable, Dict, List, Optional, Set, Tuple, Type, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import six
+from huggingface_hub import (
+    create_repo,
+    get_hf_file_metadata,
+    hf_hub_url,
+    repo_type_and_id_from_hf_id,
+    upload_folder,
+)
+from huggingface_hub.utils import EntryNotFoundError
+from paddle import Tensor
+from paddle.nn import Embedding, Layer
+
+# TODO(fangzeyang) Temporary fix and replace by paddle framework downloader later
+from paddle.utils.download import is_url as is_remote_url
+from tqdm.auto import tqdm
+
+from paddlenlp.utils.downloader import get_path_from_url_with_filelock, hf_file_exists
+from paddlenlp.utils.env import (
+    CONFIG_NAME,
+    LEGACY_CONFIG_NAME,
+    PADDLE_WEIGHTS_INDEX_NAME,
+    PADDLE_WEIGHTS_NAME,
+    PYTORCH_WEIGHTS_NAME,
+    SAFE_WEIGHTS_INDEX_NAME,
+    SAFE_WEIGHTS_NAME,
+)
+from paddlenlp.utils.log import logger
+
+from ..generation import GenerationConfig, GenerationMixin
+from ..utils import device_guard
+from .configuration_utils import PretrainedConfig
+from .conversion_utils import ConversionMixin
+from .utils import (  # convert_ndarray_dtype,
+    ContextManagers,
+    InitTrackerMeta,
+    adapt_stale_fwd_patch,
+    cached_file,
+    cached_file_for_hf_hub,
+    convert_file_size_to_int,
+    dtype_byte_size,
+    fn_args_to_dict,
+    get_checkpoint_shard_files,
+    is_paddle_support_lazy_init,
+    is_safetensors_available,
+    paddlenlp_load,
+    resolve_cache_dir,
+    weight_name_suffix,
+)
+
+__all__ = [
+    "PretrainedModel",
+    "register_base_model",
+]
+
+
+def dy2st_nocheck_guard_context():
+    try:
+        context = paddle.framework._no_check_dy2st_diff()
+    except:
+        context = contextlib.nullcontext()
+    return context
+
+
+def unwrap_optimizer(optimizer, optimizer_instances=()):
+    if optimizer is None:
+        return None
+    while hasattr(optimizer, "_inner_opt") and not isinstance(optimizer, optimizer_instances):
+        optimizer = optimizer._inner_opt
+    if isinstance(optimizer, optimizer_instances):
+        return optimizer
+    return None
+
+
+if is_safetensors_available():
+
+    from safetensors import safe_open
+    from safetensors.numpy import load_file as safe_load_file
+    from safetensors.numpy import save_file as safe_save_file
+
+
+def prune_linear_layer(layer: nn.Linear, index: paddle.Tensor, dim: int = 0) -> nn.Linear:
+    """
+    Prune a linear layer to keep only entries in index.
+    Used to remove heads.
+    Args:
+        layer (`paddle.nn.Linear`): The layer to prune.
+        index (`paddle.Tensor`): The indices to keep in the layer.
+        dim (`int`, *optional*, defaults to 0): The dimension on which to keep the indices.
+    Returns:
+        `paddle.nn.Linear`: The pruned layer as a new layer with `stop_gradient=False`.
+    """
+    index = index.to(layer.weight)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if layer.bias is not None:
+        if dim == 1:
+            b = layer.bias.clone().detach()
+        else:
+            b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.shape)
+    new_size[dim] = len(index)
+    new_layer = nn.Linear(new_size[1], new_size[0], bias_attr=layer.bias is not None)
+    new_layer.weight.stop_gradient = True
+    new_layer.weight.copy_(W)
+    new_layer.weight.stop_gradient = False
+    if layer.bias is not None:
+        new_layer.bias.stop_gradient = True
+        new_layer.bias.copy_(b)
+        new_layer.bias.stop_gradient = False
+    return new_layer
+
+
+def find_pruneable_heads_and_indices(
+    heads: List[int], n_heads: int, head_size: int, already_pruned_heads: Set[int]
+) -> Tuple[Set[int], paddle.Tensor]:
+    """
+    Finds the heads and their indices taking `already_pruned_heads` into account.
+    Args:
+        heads (`List[int]`): List of the indices of heads to prune.
+        n_heads (`int`): The number of heads in the model.
+        head_size (`int`): The size of each head.
+        already_pruned_heads (`Set[int]`): A set of already pruned heads.
+    Returns:
+        `Tuple[Set[int], paddle.Tensor]`: A tuple with the remaining heads and their corresponding indices.
+    """
+    mask = paddle.ones([n_heads, head_size])
+    heads = set(heads) - already_pruned_heads  # Convert to set and remove already pruned heads
+    for head in heads:
+        # Compute how many pruned heads are before the head and move the index accordingly
+        head = head - sum(1 if h < head else 0 for h in already_pruned_heads)
+        mask[head] = 0
+    mask = mask.reshape([-1]).eq(1)
+    index: paddle.Tensor = paddle.arange(len(mask))[mask].cast("int64")
+    return heads, index
+
+
+def apply_chunking_to_forward(
+    forward_fn: Callable[..., paddle.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
+) -> paddle.Tensor:
+    """
+    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the dimension
+    `chunk_dim`. It then applies a layer `forward_fn` to each chunk independently to save memory.
+    If the `forward_fn` is independent across the `chunk_dim` this function will yield the same result as directly
+    applying `forward_fn` to `input_tensors`.
+    Args:
+        forward_fn (`Callable[..., paddle.Tensor]`):
+            The forward function of the model.
+        chunk_size (`int`):
+            The chunk size of a chunked tensor: `num_chunks = len(input_tensors[0]) / chunk_size`.
+        chunk_dim (`int`):
+            The dimension over which the `input_tensors` should be chunked.
+        input_tensors (`Tuple[paddle.Tensor]`):
+            The input tensors of `forward_fn` which will be chunked
+    Returns:
+        `paddle.Tensor`: A tensor with the same shape as the `forward_fn` would have given if applied`.
+    Examples:
+    ```python
+    # rename the usual forward() fn to forward_chunk()
+    def forward_chunk(self, hidden_states):
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+    # implement a chunked forward function
+    def forward(self, hidden_states):
+        return apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states)
+    ```"""
+
+    assert len(input_tensors) > 0, f"{input_tensors} has to be a tuple/list of tensors"
+
+    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compatibility
+    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)
+    if num_args_in_forward_chunk_fn != len(input_tensors):
+        raise ValueError(
+            f"forward_chunk_fn expects {num_args_in_forward_chunk_fn} arguments, but only {len(input_tensors)} input "
+            "tensors are given"
+        )
+
+    if chunk_size > 0:
+        tensor_shape = input_tensors[0].shape[chunk_dim]
+        for input_tensor in input_tensors:
+            if input_tensor.shape[chunk_dim] != tensor_shape:
+                raise ValueError(
+                    f"All input tenors have to be of the same shape: {tensor_shape}, "
+                    f"found shape {input_tensor.shape[chunk_dim]}"
+                )
+
+        if input_tensors[0].shape[chunk_dim] % chunk_size != 0:
+            raise ValueError(
+                f"The dimension to be chunked {input_tensors[0].shape[chunk_dim]} has to be a multiple of the chunk "
+                f"size {chunk_size}"
+            )
+
+        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size
+
+        # chunk input tensor into tuples
+        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, axis=chunk_dim) for input_tensor in input_tensors)
+        # apply forward fn to every tuple
+        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))
+        # concatenate output at same dimension
+        return paddle.concat(output_chunks, axis=chunk_dim)
+
+    return forward_fn(*input_tensors)
+
+
+def unwrap_model(model, *args, **kwargs):
+    raw_model = model
+    while hasattr(raw_model, "_layers") or hasattr(raw_model, "_layer"):
+        if hasattr(raw_model, "_layers"):
+            # Caused by issue https://github.com/PaddlePaddle/PaddleNLP/issues/5295
+            # TODO: remove this after we fix the issue
+            if raw_model._layers is None:
+                break
+            raw_model = raw_model._layers
+        else:
+            if raw_model._layer is None:
+                break
+            raw_model = raw_model._layer
+
+    return raw_model
+
+
+def _add_variant(weights_name: str, variant=None) -> str:
+    if variant is not None and len(variant) > 0:
+        splits = weights_name.split(".")
+        splits = splits[:-1] + [variant] + splits[-1:]
+        weights_name = ".".join(splits)
+
+    return weights_name
+
+
+@contextmanager
+def dtype_guard(dtype="float32"):
+    origin_dtype = paddle.get_default_dtype()
+    paddle.set_default_dtype(dtype)
+    try:
+        yield
+    finally:
+        paddle.set_default_dtype(origin_dtype)
+
+
+_init_weights = True
+
+
+@contextmanager
+def no_init_weights(_enable=True):
+    """
+    Context manager to globally disable weight initialization to speed up loading large models.
+
+    TODO(Patrick): Delete safety argument `_enable=True` at next major version. .
+    """
+    global _init_weights
+    old_init_weights = _init_weights
+    if _enable:
+        _init_weights = False
+    try:
+        yield
+    finally:
+        _init_weights = old_init_weights
+
+
+def get_parameter_dtype(parameter: nn.Layer) -> paddle.dtype:
+    """get dtype of parameter which should be sub-class of nn.Layer
+
+    Args:
+        parameter (nn.Layer): the instance of layer
+
+    Returns:
+        paddle.dtype: the dtype of tensor
+    """
+
+    last_dtype = None
+    for t in parameter.parameters():
+        last_dtype = t.dtype
+        if t.is_floating_point():
+            return t.dtype
+
+    # TODO(wj-Mcat): get dtype of model when it's in DataParallel Mode.
+    return last_dtype
+
+
+def load_state_dict(checkpoint_file: Union[str, os.PathLike], tensor_parallel_split_mapping=None):
+    """
+    Reads a PaddlePaddle checkpoint file, returning properly formatted errors if they arise.
+    """
+    if tensor_parallel_split_mapping is None:
+        tensor_parallel_split_mapping = {}
+
+    if checkpoint_file.endswith(".safetensors") and is_safetensors_available():
+        # Check format of the archive
+        with safe_open(checkpoint_file, framework="np") as f:
+            metadata = f.metadata()
+        if metadata.get("format") not in ["pd", "np"]:
+            raise OSError(
+                f"The safetensors archive passed at {checkpoint_file} does not contain the valid metadata. Make sure "
+                "you save your model with the `save_pretrained` method."
+            )
+        if metadata["format"] == "pd":
+            raise ValueError("Currently unsupport paddle weights file, use numpy instead.")
+        if metadata["format"] == "np":
+            state_dict = {}
+            with safe_open(checkpoint_file, framework="np") as f:
+                for key in f.keys():
+                    py_safe_slice_ = f.get_slice(key)
+                    if key in tensor_parallel_split_mapping:
+                        weight = tensor_parallel_split_mapping[key](py_safe_slice_)
+                    else:
+                        weight = py_safe_slice_[:]
+                    state_dict[key] = weight
+
+            for k in list(state_dict.keys()):
+                with device_guard():
+                    state_dict[k] = paddle.Tensor(state_dict.pop(k), zero_copy=True)
+
+            return state_dict
+
+    state_dict = paddlenlp_load(checkpoint_file, map_location="cpu")
+    return state_dict
+
+
+def resolve_weight_file_from_hf_hub(repo_id: str, cache_dir: str, support_conversion: bool, subfolder=None):
+    """find the suitable weight file name
+
+    Args:
+        repo_id (str): repo name of huggingface hub
+        cache_dir (str): cache dir for hf
+        support_conversion (bool): whether support converting pytorch weight file to paddle weight file
+        subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+    """
+    is_local = os.path.isdir(repo_id)
+    if not is_local:
+        if hf_file_exists(repo_id, PADDLE_WEIGHTS_NAME, subfolder=subfolder):
+            file_name = PADDLE_WEIGHTS_NAME
+            assert (
+                support_conversion is False
+            ), "Please call set convert_from_torch for paddle weights on huggingface hub, eg. Model.from_pretrained(model_name, from_hf_hub=True, convert_from_torch=False)"
+        elif hf_file_exists(repo_id, PYTORCH_WEIGHTS_NAME, subfolder=subfolder):
+            if not support_conversion:
+                raise EntryNotFoundError(
+                    f"can not download `{PADDLE_WEIGHTS_NAME} from https://huggingface.co/{repo_id}` "
+                    "and current model doesn't support conversion from pytorch weight file to paddle weight file"
+                )
+            file_name = PYTORCH_WEIGHTS_NAME
+        else:
+            raise EntryNotFoundError(
+                message=f"can not find the paddle/pytorch weight file from: https://huggingface.co/{repo_id}",
+                response=None,
+            )
+    else:
+        # for local file, we use support_conversion to select paddle or torch weight.
+        file_name = PYTORCH_WEIGHTS_NAME if support_conversion else PADDLE_WEIGHTS_NAME
+
+    return cached_file_for_hf_hub(repo_id, file_name, cache_dir, subfolder)
+
+
+def register_base_model(cls):
+    """
+    A decorator for `PretrainedModel` class. It first retrieves the parent class
+    of the class being decorated, then sets the `base_model_class` attribute
+    of that parent class to be the class being decorated. In summary, the decorator registers
+    the decorated class as the base model class in all derived classes under the same architecture.
+
+    Args:
+        cls (PretrainedModel): The class (inherited from PretrainedModel) to be decorated .
+
+    Returns:
+        PretrainedModel: The input class `cls` after decorating.
+
+    Example:
+        .. code-block::
+
+            from paddlenlp.transformers import BertModel, register_base_model
+
+            BertModel = register_base_model(BertModel)
+            assert BertModel.base_model_class == BertModel
+    """
+    base_cls = cls.__bases__[0]
+    assert issubclass(
+        base_cls, PretrainedModel
+    ), "`register_base_model` should be used on subclasses of PretrainedModel."
+    base_cls.base_model_class = cls
+    return cls
+
+
+class BackboneMixin:
+    def forward_with_filtered_kwargs(self, *args, **kwargs):
+        signature = dict(inspect.signature(self.forward).parameters)
+        filtered_kwargs = {k: v for k, v in kwargs.items() if k in signature}
+
+        return self(*args, **filtered_kwargs)
+
+
+_re_layer_prefix = re.compile(r"\.(\d+)\.")
+
+
+def _partion_for_pipeline_mode(keys):
+    # the keys should be sort in networks order
+    # TODO maybe handle tie_weight ?
+    def layer_prefix(key):
+        ret = _re_layer_prefix.search(key)
+        if ret is not None:
+            return key[0 : ret.end()]
+        return ""
+
+    keys = list(keys)
+    start_idx = -1
+    prefix_str = None
+    parttion_map = {}
+    for k in keys:
+        prefix = layer_prefix(k)
+        if prefix != prefix_str:
+            prefix_str = prefix
+            start_idx += 1
+        parttion_map[k] = start_idx
+
+    # if only one parttion, we don't parttion it
+    if start_idx < 1:
+        return {keys[i]: i for i in range(len(keys))}
+
+    return parttion_map
+
+
+def shard_checkpoint(
+    state_dict: Dict[str, paddle.Tensor],
+    max_shard_size: Union[int, str] = "10GB",
+    weights_name: str = PADDLE_WEIGHTS_NAME,
+    shard_format="naive",
+):
+    """
+    Splits a model state dictionary in sub-checkpoints so that the final size of each sub-checkpoint does not exceed a
+    given size.
+
+    The sub-checkpoints are determined by iterating through the `state_dict` in the order of its keys, so there is no
+    optimization made to make each sub-checkpoint as close as possible to the maximum size passed. For example, if the
+    limit is 10GB and we have weights of sizes [6GB, 6GB, 2GB, 6GB, 2GB, 2GB] they will get sharded as [6GB], [6+2GB],
+    [6+2+2GB] and not [6+2+2GB], [6+2GB], [6GB].
+
+    <Tip warning={true}>
+
+    If one of the model's weight is bigger that `max_sahrd_size`, it will end up in its own sub-checkpoint which will
+    have a size greater than `max_shard_size`.
+
+    </Tip>
+
+    Args:
+        state_dict (`Dict[str, paddle.Tensor]`): The state dictionary of a model to save.
+        max_shard_size (`int` or `str`, *optional*, defaults to `"10GB"`):
+            The maximum size of each sub-checkpoint. If expressed as a string, needs to be digits followed by a unit
+            (like `"5MB"`).
+        weights_name (`str`, *optional*, defaults to `"model_state.pdparams"`):
+            The name of the model save file.
+        shard_format (`str`, *optional*, defaults to `"naive"`):
+            support naive or pipeline.
+    """
+    assert shard_format in [
+        "naive",
+        "pipeline",
+    ], f"Invalid shard_format: {shard_format}, it show be `naive` or `pipeline`."
+
+    max_shard_size = convert_file_size_to_int(max_shard_size)
+
+    sharded_state_dicts = []
+    current_block = {}
+    current_block_size = 0
+    total_size = 0
+
+    if shard_format == "naive":
+        for key, weight in state_dict.items():
+            weight_size = weight.numel().item() * dtype_byte_size(weight.dtype)
+            # If this weight is going to tip up over the maximal size, we split.
+            if current_block_size + weight_size > max_shard_size:
+                # fix if the first param is large than max_shard_size
+                if len(current_block) > 0:
+                    sharded_state_dicts.append(current_block)
+                current_block = {}
+                current_block_size = 0
+
+            current_block[key] = weight
+            current_block_size += weight_size
+            total_size += weight_size
+
+        # Add the last block
+        sharded_state_dicts.append(current_block)
+
+    if shard_format == "pipeline":
+        parttion_map = _partion_for_pipeline_mode(state_dict.keys())
+        partition_num = max(parttion_map.values())
+
+        for index in range(partition_num + 1):
+            weight_names = [k for k, v in parttion_map.items() if v == index]
+            weight_size = sum(
+                state_dict[key].numel().item() * dtype_byte_size(state_dict[key].dtype) for key in weight_names
+            )
+
+            # try to add new block
+            if current_block_size + weight_size > max_shard_size:
+                # fix if the first param is large than max_shard_size
+                if len(current_block) > 0:
+                    sharded_state_dicts.append(current_block)
+                current_block = {}
+                current_block_size = 0
+            for key in weight_names:
+                current_block[key] = state_dict[key]
+            current_block_size += weight_size
+            total_size += weight_size
+
+        # Add the last block
+        sharded_state_dicts.append(current_block)
+        logger.info(f"The average size of partition is around: {total_size//partition_num}")
+
+    # If we only have one shard, we return it
+    if len(sharded_state_dicts) == 1:
+        return {weights_name: sharded_state_dicts[0]}, None
+
+    # Otherwise, let's build the index
+    weight_map = {}
+    shards = {}
+    for idx, shard in enumerate(sharded_state_dicts):
+        shard_file = weights_name.replace(".pdparams", f"-{idx+1:05d}-of-{len(sharded_state_dicts):05d}.pdparams")
+        shard_file = shard_file.replace(
+            ".safetensors", f"-{idx + 1:05d}-of-{len(sharded_state_dicts):05d}.safetensors"
+        )
+        shards[shard_file] = shard
+        for key in shard.keys():
+            weight_map[key] = shard_file
+
+    # Add the metadata
+    metadata = {"total_size": total_size}
+    index = {"metadata": metadata, "weight_map": weight_map}
+    return shards, index
+
+
+def load_sharded_checkpoint(model, folder, variant=None, strict=True, prefer_safe=False):
+    """
+    This is the same as [`paddle.nn.Layer.set_state_dict`]
+    but for a sharded checkpoint.
+
+    This load is performed efficiently: each checkpoint shard is loaded one by one in RAM and deleted after being
+    loaded in the model.
+
+    Args:
+        model (`paddle.nn.Module`): The model in which to load the checkpoint.
+        folder (`str` or `os.PathLike`): A path to a folder containing the sharded checkpoint.
+        variant (`str`): The model variant.
+        strict (`bool`, *optional`, defaults to `True`):
+            Whether to strictly enforce that the keys in the model state dict match the keys in the sharded checkpoint.
+        prefer_safe (`bool`, *optional*, defaults to `False`):
+            If both safetensors and Paddle save files are present in checkpoint and `prefer_safe` is True, the safetensors
+            files will be loaded. Otherwise, Paddle files are always loaded when possible.
+
+    Returns:
+        `NamedTuple`: A named tuple with `missing_keys` and `unexpected_keys` fields
+            - `missing_keys` is a list of str containing the missing keys
+            - `unexpected_keys` is a list of str containing the unexpected keys
+    """
+    # Load the index
+    index_file = os.path.join(folder, _add_variant(PADDLE_WEIGHTS_INDEX_NAME, variant))
+    safe_index_file = os.path.join(folder, _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant))
+
+    index_present = os.path.isfile(index_file)
+    safe_index_present = os.path.isfile(safe_index_file)
+
+    if not index_present and not (safe_index_present and is_safetensors_available()):
+        filenames = (
+            (_add_variant(PADDLE_WEIGHTS_INDEX_NAME, variant), _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant))
+            if is_safetensors_available()
+            else (_add_variant(PADDLE_WEIGHTS_INDEX_NAME, variant),)
+        )
+        raise ValueError(f"Can't find a checkpoint index ({' or '.join(filenames)}) in {folder}.")
+
+    load_safe = False
+    if safe_index_present:
+        if prefer_safe:
+            if is_safetensors_available():
+                load_safe = True  # load safe due to preference
+            else:
+                logger.warning(
+                    f"Cannot load sharded checkpoint at {folder} safely since safetensors is not installed!"
+                )
+        elif not index_present:
+            load_safe = True
+
+    load_index = safe_index_file if load_safe else index_file
+
+    with open(load_index, "r", encoding="utf-8") as f:
+        index = json.load(f)
+
+    shard_files = list(set(index["weight_map"].values()))
+
+    # If strict=True, error before loading any of the state dicts.
+    loaded_keys = index["weight_map"].keys()
+    model_keys = model.state_dict().keys()
+    missing_keys = [key for key in model_keys if key not in loaded_keys]
+    unexpected_keys = [key for key in loaded_keys if key not in model_keys]
+    if strict and (len(missing_keys) > 0 or len(unexpected_keys) > 0):
+        error_message = f"Error(s) in loading state_dict for {model.__class__.__name__}"
+        if len(missing_keys) > 0:
+            str_missing_keys = ",".join([f'"{k}"' for k in missing_keys])
+            error_message += f"\nMissing key(s): {str_missing_keys}."
+        if len(unexpected_keys) > 0:
+            str_unexpected_keys = ",".join([f'"{k}"' for k in unexpected_keys])
+            error_message += f"\nMissing key(s): {str_unexpected_keys}."
+        raise RuntimeError(error_message)
+
+    loader = safe_load_file if load_safe else partial(paddlenlp_load, map_location="cpu")
+
+    for shard_file in shard_files:
+        state_dict = loader(os.path.join(folder, shard_file))
+        with warnings.catch_warnings():
+            warnings.resetwarnings()
+            warnings.filterwarnings("ignore", message=r".*is not found in the provided dict.*")
+            model.set_state_dict(state_dict)
+
+        # Make sure memory is fred before we load the next state dict.
+        del state_dict
+        gc.collect()
+
+    # Return the same thing as PaddlePaddle set_state_dict function.
+    return missing_keys, unexpected_keys
+
+
+def faster_set_state_dict(model, state_dict):
+    # the state_dict will be destroied.
+    with paddle.no_grad():
+        for k, v in model.state_dict().items():
+            if k in state_dict:
+                v_new = state_dict.pop(k)
+                if not isinstance(v_new, paddle.Tensor):
+                    raise ValueError(
+                        f"faster_set_state_dict need state dict with paddle.Tensor, but got {type(v_new)}"
+                    )
+                # 2. cast param / Tensor to dtype
+                if v.dtype != v_new.dtype:
+                    raise ValueError(f"for key: {k}, expect dtype {v.dtype}, but got {v_new.dtype}")
+                # check shape
+                if list(v.shape) != list(v_new.shape):
+                    raise ValueError(f"for key: {k}, expect shape {v.shape}, but got {v_new.shape}")
+
+                dst_tensor = v.value().get_tensor()
+                place = v.place
+
+                if not v_new.place._equals(place):
+                    # clear dst_tensor for save memory
+                    dst_tensor._clear()
+                    # v_new = v_new._copy_to(paddle.CUDAPinnedPlace(), False)
+                    new_t = v_new._copy_to(place, False)
+                else:
+                    new_t = v_new
+
+                # 4. share Tensor to origin param / Tensor
+                src_tensor = new_t.value().get_tensor()
+                dst_tensor._share_data_with(src_tensor)
+
+
+def _load_state_dict_into_model(model_to_load, state_dict, start_prefix):
+    # torch will cast dtype in load_state_dict, but paddle strictly check dtype
+    _convert_state_dict_dtype_and_shape(state_dict, model_to_load)
+
+    error_msgs = []
+
+    if len(start_prefix) > 0:
+        for key in list(state_dict.keys()):
+            if key.startswith(start_prefix):
+                state_dict[key.replace(start_prefix, "")] = state_dict.pop(key)
+
+    # TODO: add return status to state_dict
+    with warnings.catch_warnings(record=True) as w:
+        warnings.resetwarnings()
+        # paddlenlp hold  missing_keys , just ignore not found warnings.
+        warnings.filterwarnings("ignore", message=r".*is not found in the provided dict.*")
+        model_to_load.set_state_dict(state_dict)
+        error_msgs.extend([str(x.message) for x in w])
+
+    del state_dict
+
+    return error_msgs
+
+
+def _convert_state_dict_dtype_and_shape(state_dict, model_to_load):
+    # convert the dtype of state dict
+    def is_0d_or_1d(tensor):
+        return len(tensor.shape) == 0 or list(tensor.shape) == [1]
+
+    for key, value in model_to_load.state_dict().items():
+        if key in state_dict:
+            if isinstance(state_dict[key], np.ndarray):
+                raise ValueError(
+                    "convert_state_dict_dtype expected paddle.Tensor not numpy.ndarray, plase convert numpy.ndarray to paddle.Tensor"
+                )
+            if state_dict[key].is_floating_point() and state_dict[key].dtype != value.dtype:
+                state_dict[key] = paddle.cast(state_dict.pop(key), value.dtype)
+
+            # unified 0d and 1d tensor
+            if is_0d_or_1d(value) and is_0d_or_1d(state_dict[key]):
+                if list(value.shape) != list(state_dict[key].shape):
+                    state_dict[key] = paddle.reshape(state_dict.pop(key), value.shape)
+
+
+def _load_state_dict_into_meta_model(
+    model,
+    state_dict,
+    loaded_state_dict_keys,  # left for now but could be removed, see below
+    start_prefix,
+    expected_keys,
+    dtype=None,
+    is_safetensors=False,
+    keep_in_fp32_modules=None,
+):
+    """
+    This is somewhat similar to `_load_state_dict_into_model`, but deals with a model that has some or all of its
+    params on a `meta` device. It replaces the model params with the data from the `state_dict`, while moving the
+    params back to the normal device, but only for `loaded_state_dict_keys`.
+
+    `start_prefix` is used for models which insert their name into model keys, e.g. `bert` in
+    `bert.pooler.dense.weight`
+
+    """
+    from paddle.common_ops_import import convert_np_dtype_to_dtype_
+
+    dtype = convert_np_dtype_to_dtype_(dtype)
+    error_msgs = []
+
+    for param_name, param in state_dict.items():
+        # First part of the test is always true as loaded_state_dict_keys always contains state_dict keys.
+        if param_name not in loaded_state_dict_keys or param_name not in expected_keys:
+            continue
+
+        if param_name.startswith(start_prefix):
+            param_name = param_name[len(start_prefix) :]
+
+        if param.place != paddle.framework._current_expected_place():
+            param = param._copy_to(paddle.framework._current_expected_place(), False)
+
+        # # We convert floating dtypes to the `dtype` passed. We want to keep the buffers/params
+        # # in int/uint/bool and not cast them.
+        if dtype is not None and paddle.is_floating_point(param):
+            if (
+                keep_in_fp32_modules is not None
+                and any(module_to_keep_in_fp32 in param_name for module_to_keep_in_fp32 in keep_in_fp32_modules)
+                and dtype == paddle.float16
+            ):
+                param = param.astype(dtype=paddle.float32)
+            else:
+                param = param.astype(dtype=dtype)
+
+        if dtype is None:
+            old_param = model
+            splits = param_name.split(".")
+            for split in splits:
+                old_param = getattr(old_param, split)
+                if old_param is None:
+                    break
+
+            if old_param is not None:
+                param = param.astype(dtype=old_param.dtype)
+
+        with paddle.no_grad():
+            model.state_dict()[param_name].get_tensor()._share_data_with(param.value().get_tensor())
+            param.value().get_tensor()._clear()
+
+    return error_msgs
+
+
+@six.add_metaclass(InitTrackerMeta)
+class PretrainedModel(Layer, GenerationMixin, ConversionMixin):
+    """
+    The base class for all pretrained models. It mainly provides common methods
+    for loading (construction and loading) and saving pretrained models. Loading
+    and saving also rely on the following class attributes which should be overridden
+    by derived classes accordingly:
+
+    - **model_config_file** (str): Represents the file name of model configuration
+      for configuration saving and loading in local file system. The value is
+      `model_config.json`.
+    - **resource_files_names** (dict): Name of local file where the model configuration
+      can be saved and loaded locally. Currently, resources only include the model state,
+      thus the dict only includes `'model_state'` as key with corresponding
+      value `'model_state.pdparams'` for model weights saving and loading.
+    - **pretrained_init_configuration** (dict): Provides the model configurations
+      of built-in pretrained models (contrasts to models in local file system).
+      It has pretrained model names as keys (such as `bert-base-uncased`), and
+      the values are dict preserving corresponding configuration for model initialization.
+    - **pretrained_resource_files_map** (dict): Provides resource URLs of built-in
+      pretrained models (contrasts to models in local file system).
+      It has the same key as resource_files_names (that is "model_state"),
+      and the corresponding value is a dict with specific model name to model weights URL mapping
+      (such as "bert-base-uncased" ->
+      "https://bj.bcebos.com/paddlenlp/models/transformers/bert-base-uncased.pdparams").
+    - **base_model_prefix** (str): Represents the attribute associated to the
+      base model in derived classes of the same architecture adding layers on
+      top of the base model. Note: A base model class is pretrained model class
+      decorated by `register_base_model`, such as `BertModel`; A derived model
+      class is a pretrained model class adding layers on top of the base model,
+      and it has a base model as attribute, such as `BertForSequenceClassification`.
+
+    Methods common to models for text generation are defined in `GenerationMixin`
+    and also inherited here.
+
+    Besides, metaclass `InitTrackerMeta` is used to create `PretrainedModel`,
+    by which subclasses can track arguments for initialization automatically.
+    """
+
+    # Deprecated(wj-Mcat): after 2.6.* version
+    # save the old-school `LEGACY_CONFIG_NAME`, and will be changed to `CONFIG_NAME` after 2.6.* version
+    model_config_file = LEGACY_CONFIG_NAME
+
+    pretrained_init_configuration = {}
+    # TODO: more flexible resource handle, namedtuple with fields as:
+    # resource_name, saved_file, handle_name_for_load(None for used as __init__
+    # arguments), handle_name_for_save
+    resource_files_names = {"model_state": PADDLE_WEIGHTS_NAME}
+    pretrained_resource_files_map = {}
+    base_model_prefix = ""
+    main_input_name = "input_ids"
+    config_class = None
+    _keep_in_fp32_modules = None
+
+    # a list of `re` patterns of `state_dict` keys that should be removed from the list of missing
+    # keys we find (keys inside the model but not in the checkpoint) and avoid unnecessary warnings.
+    _keys_to_ignore_on_load_missing = None
+    # a list of `re` patterns of `state_dict` keys that should be removed from the list of
+    # unexpected keys we find (keys inside the checkpoint but not the model) and avoid unnecessary
+    # warnings.
+    _keys_to_ignore_on_load_unexpected = None
+    # a list of `state_dict` keys to ignore when saving the model (useful for keys that aren't
+    # trained, but which are either deterministic or tied variables)
+    _keys_to_ignore_on_save = None
+    _tied_weights_keys = None
+
+    def __init__(self, *args, **kwargs):
+        super(PretrainedModel, self).__init__()
+
+        if not self.constructed_from_pretrained_config():
+            return
+
+        # extract config from args
+        config = None
+        for arg in args:
+            if isinstance(arg, PretrainedConfig):
+                config = arg
+                break
+        if config is not None:
+            self.config: PretrainedConfig = config
+            self.model_config_file = CONFIG_NAME
+            self.generation_config = GenerationConfig.from_model_config(self.config) if self.can_generate() else None
+            return
+
+        # extract config from kwargs
+        if "config" not in kwargs:
+            raise ValueError(
+                "PretrainedConfig instance not found in the arguments, you can set it as args or kwargs with config field"
+            )
+
+        config = kwargs["config"]
+        if not isinstance(config, PretrainedConfig):
+            raise TypeError("config parameter should be the instance of PretrainedConfig")
+
+        self.config: PretrainedConfig = kwargs["config"]
+        self.generation_config = GenerationConfig.from_model_config(self.config) if self.can_generate() else None
+        self.model_config_file = CONFIG_NAME
+        self.warnings_issued = {}
+
+    def _post_init(self, original_init, *args, **kwargs):
+        """
+        It would be hooked after `__init__` to add a dict including arguments of
+        `__init__` as a attribute named `config` of the pretrained model instance.
+        """
+        if not self.constructed_from_pretrained_config():
+            init_dict = fn_args_to_dict(original_init, *((self,) + args), **kwargs)
+            self.config = init_dict
+
+        # only execute when it's the base method
+        if (
+            original_init.__module__ != "paddlenlp.transformers.model_utils"
+            and self.__class__.init_weights is PretrainedModel.init_weights
+        ):
+            self.init_weights()
+
+    def _init_weights(self, layer):
+        """
+        Initialize the weights. This method should be overridden by derived class.
+        """
+        pass
+
+    def _initialize_weights(self, layer):
+        """
+        Initialize the weights if they are not already initialized.
+        """
+        if getattr(layer, "_is_initialized", False):
+            return
+        self._init_weights(layer)
+        layer._is_initialized = True
+
+    def init_weights(self):
+        """
+        If needed prunes and maybe initializes weights. If using a custom `PreTrainedModel`, you need to implement any
+        initialization logic in `_init_weights`.
+        """
+        # call pure
+        if _init_weights:
+            # Initialize weights
+            self.apply(self._initialize_weights)
+
+            # Tie weights should be skipped when not initializing all weights
+            # since from_pretrained(...) calls tie weights anyways
+
+            # TODO(wj-Mcat): enable all tie-weights later
+            # self.tie_weights()
+
+    @classmethod
+    def _from_config(cls, config, **kwargs):
+        """
+        All context managers that the model should be initialized under go here.
+
+        Args:
+            dtype (`paddle.dtype`, *optional*):
+                Override the default `paddle.dtype` and load the model under this dtype.
+        """
+        dtype = kwargs.pop("dtype", None)
+
+        if dtype is None:
+            if config.dtype is not None:
+                dtype = config.dtype
+            else:
+                dtype = paddle.get_default_dtype()
+
+        with dtype_guard(dtype):
+            model = cls(config, **kwargs)
+
+        return model
+
+    @property
+    def base_model(self):
+        """
+        PretrainedModel: The body of the same model architecture. It is the base
+            model itself for base model or the base model attribute for derived
+            model.
+        """
+        return getattr(self, self.base_model_prefix, self)
+
+    @property
+    def model_name_list(self):
+        """
+        list: Contains all supported built-in pretrained model names of the
+            current PretrainedModel class.
+        """
+        # Todo: return all model name
+        return list(self.pretrained_init_configuration.keys())
+
+    def can_generate(self) -> bool:
+        """
+        Returns whether this model can generate sequences with `.generate()`.
+        Returns:
+            `bool`: Whether this model can generate sequences with `.generate()`.
+        """
+        # Detects whether `prepare_inputs_for_generation` has been overwritten, which is a requirement for generation
+        if "GenerationMixin" in str(self.prepare_inputs_for_generation):
+            return False
+        return True
+
+    def recompute_enable(self):
+        r"""
+        Enable Recompute.
+        All layers with the `enable_recompute` attribute will be set to `True`
+        """
+
+        def fn(layer):
+            if hasattr(layer, "enable_recompute") and (layer.enable_recompute is False or layer.enable_recompute == 0):
+                layer.enable_recompute = True
+
+        self.apply(fn)
+
+    def recompute_disable(self):
+        r"""
+        Disable Recompute.
+        All layers with the `enable_recompute` attribute will be set to `False`
+        """
+
+        def fn(layer):
+            if hasattr(layer, "enable_recompute") and (layer.enable_recompute is False or layer.enable_recompute == 0):
+                layer.enable_recompute = True
+
+        self.apply(fn)
+
+    def get_memory_footprint(self, return_buffers=True):
+        r"""
+        Get the memory footprint of a model. This will return the memory footprint of the current model in bytes.
+        Useful to benchmark the memory footprint of the current model and design some tests.
+
+        Arguments:
+            return_buffers (`bool`, *optional*, defaults to `True`):
+                Whether to return the size of the buffer tensors in the computation of the memory footprint. Buffers
+                are tensors that do not require gradients and not registered as parameters
+        """
+        mem = sum([param.numel().item() * param.element_size() for param in self.parameters()])
+        if return_buffers:
+            mem_bufs = sum([buf.numel().item() * buf.element_size() for buf in self.buffers()])
+            mem = mem + mem_bufs
+        return mem
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        """get input embedding of model
+
+        Returns:
+            nn.Embedding: embedding of model
+        """
+        base_model = getattr(self, self.base_model_prefix, self)
+        if base_model is not self:
+            return base_model.get_input_embeddings()
+
+        raise NotImplementedError(
+            f"model of {type(base_model)} has not implemented the `get_input_embeddings`"
+            " or `set_input_embeddings` method"
+        )
+
+    def set_input_embeddings(self, value: Embedding):
+        """set new input embedding for model
+
+        Args:
+            value (Embedding): the new embedding of model
+
+        Raises:
+            NotImplementedError: Model has not implement `set_input_embeddings` method
+        """
+        base_model = getattr(self, self.base_model_prefix, self)
+        if base_model is not self:
+            return base_model.set_input_embeddings(value)
+        raise NotImplementedError(
+            f"model of {type(base_model)} has not implemented the `get_input_embeddings`"
+            " or `set_input_embeddings` method"
+        )
+
+    def get_output_embeddings(self) -> Optional[Embedding]:
+        """To be overwrited for models with output embeddings
+
+        Returns:
+            Optional[Embedding]: the otuput embedding of model
+        """
+        return None
+
+    def tie_weights(self):
+        """
+        Tie the weights between the input embeddings and the output embeddings.
+        """
+        if self.config.tie_word_embeddings:
+            output_embeddings = self.get_output_embeddings()
+            input_embeddings = self.get_input_embeddings()
+            if output_embeddings is not None and input_embeddings is not None:
+                if input_embeddings.weight.shape != output_embeddings.weight.shape:
+                    logger.warning(
+                        f"The shape of input embeddings is {input_embeddings.weight.shape} and the shape of output embeddings is {output_embeddings.weight.shape}. "
+                        "This is only expected if you are calling the `resize_token_embeddings` method"
+                    )
+                output_embeddings.weight = input_embeddings.weight
+                if getattr(output_embeddings, "bias", None) is not None:
+                    # need to pad
+                    if output_embeddings.weight.shape[0] > output_embeddings.bias.shape[0]:
+                        old_bias = output_embeddings.bias
+                        pad_length = output_embeddings.weight.shape[0] - old_bias.shape[0]
+                        output_embeddings.bias = output_embeddings.create_parameter(
+                            shape=[output_embeddings.weight.shape[0]],
+                            attr=output_embeddings._bias_attr,
+                            dtype=output_embeddings._dtype,
+                            is_bias=True,
+                        )
+                        new_bias = paddle.concat(
+                            [old_bias, paddle.zeros([pad_length], dtype=output_embeddings.bias.dtype)]
+                        )
+                        output_embeddings.bias.set_value(new_bias)
+                    # need to trim
+                    elif output_embeddings.weight.shape[0] < output_embeddings.bias.shape[0]:
+                        new_bias = output_embeddings.bias[: output_embeddings.weight.shape[0]]
+                        output_embeddings.bias = output_embeddings.create_parameter(
+                            shape=[output_embeddings.weight.shape[0]],
+                            attr=output_embeddings._bias_attr,
+                            dtype=output_embeddings._dtype,
+                            is_bias=True,
+                        )
+                        output_embeddings.bias.set_value(new_bias)
+
+    def resize_position_embeddings(self, new_num_position_embeddings: int):
+        """resize position embedding, this method should be overrited overwrited by downstream models
+
+        Args:
+            new_num_position_embeddings (int): the new position size
+
+        Raises:
+            NotImplementedError: when called and not be implemented
+        """
+        raise NotImplementedError(
+            f"`resize_position_embeddings` is not implemented for {self.__class__}`. To implement it, you should "
+            f"overwrite this method in the class {self.__class__} in `{self.__class__.__module__}.py`"
+        )
+
+    @classmethod
+    def constructed_from_pretrained_config(cls, init_func=None) -> bool:
+        """check if the model is constructed from `PretrainedConfig`
+        Returns:
+            bool: if the model is constructed from `PretrainedConfig`
+        """
+        return cls.config_class is not None and issubclass(cls.config_class, PretrainedConfig)
+
+    def save_model_config(self, save_dir: str):
+        """
+        Deprecated, please use `.config.save_pretrained()` instead.
+        Saves model configuration to a file named "config.json" under `save_dir`.
+
+        Args:
+            save_dir (str): Directory to save model_config file into.
+        """
+        logger.warning("The `save_model_config` is deprecated! Please use `.config.save_pretrained()` instead.")
+        self.config.save_pretrained(save_dir)
+
+    def save_to_hf_hub(
+        self,
+        repo_id: str,
+        private: Optional[bool] = None,
+        subfolder: Optional[str] = None,
+        commit_message: Optional[str] = None,
+        revision: Optional[str] = None,
+        create_pr: bool = False,
+    ):
+        """
+        Uploads all elements of this model to a new HuggingFace Hub repository.
+        Args:
+            repo_id (str): Repository name for your model/tokenizer in the Hub.
+            private (bool, optional): Whether the model/tokenizer is set to private
+            subfolder (str, optional): Push to a subfolder of the repo instead of the root
+            commit_message (str, optional) — The summary / title / first line of the generated commit. Defaults to: f"Upload {path_in_repo} with huggingface_hub"
+            revision (str, optional) — The git revision to commit from. Defaults to the head of the "main" branch.
+            create_pr (boolean, optional) — Whether or not to create a Pull Request with that commit. Defaults to False.
+                If revision is not set, PR is opened against the "main" branch. If revision is set and is a branch, PR is opened against this branch.
+                If revision is set and is not a branch name (example: a commit oid), an RevisionNotFoundError is returned by the server.
+
+        Returns: The url of the commit of your model in the given repository.
+        """
+        repo_url = create_repo(repo_id, private=private, exist_ok=True)
+
+        # Infer complete repo_id from repo_url
+        # Can be different from the input `repo_id` if repo_owner was implicit
+        _, repo_owner, repo_name = repo_type_and_id_from_hf_id(repo_url)
+
+        repo_id = f"{repo_owner}/{repo_name}"
+
+        # Check if README file already exist in repo
+        try:
+            get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename="README.md", revision=revision))
+            has_readme = True
+        except EntryNotFoundError:
+            has_readme = False
+
+        with tempfile.TemporaryDirectory() as root_dir:
+            if subfolder is not None:
+                save_dir = os.path.join(root_dir, subfolder)
+            else:
+                save_dir = root_dir
+            # save model
+            self.save_pretrained(save_dir)
+            # Add readme if does not exist
+            logger.info("README.md not found, adding the default README.md")
+            if not has_readme:
+                with open(os.path.join(root_dir, "README.md"), "w") as f:
+                    f.write(f"---\nlibrary_name: paddlenlp\n---\n# {repo_id}")
+
+            # Upload model and return
+            logger.info(f"Pushing to the {repo_id}. This might take a while")
+            return upload_folder(
+                repo_id=repo_id,
+                repo_type="model",
+                folder_path=root_dir,
+                commit_message=commit_message,
+                revision=revision,
+                create_pr=create_pr,
+            )
+
+    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None) -> nn.Embedding:
+        """
+        Resizes input token embeddings matrix of the model according to new_num_tokens.
+
+        Args:
+            new_num_tokens (Optional[int]):
+                The number of new tokens in the embedding matrix. Increasing the size will add newly initialized
+                vectors at the end. Reducing the size will remove vectors from the end. If not provided or None, just
+                returns a pointer to the input tokens embedding module of the model without doing anything.
+
+        Returns:
+            paddle.nn.Embedding: The input tokens Embeddings Module of the model.
+        """
+        old_embeddings: nn.Embedding = self.get_input_embeddings()
+        if not new_num_tokens or new_num_tokens == old_embeddings.weight.shape[0]:
+            return old_embeddings
+
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+        self.set_input_embeddings(new_embeddings)
+
+        # 2. Update vocab_size
+        self.base_model.config["vocab_size"] = new_num_tokens
+        self.vocab_size = new_num_tokens
+
+        # update init_config
+        self._update_init_config(self.init_config, "vocab_size", new_num_tokens)
+
+        # Tie the weights between the input embeddings and the output embeddings if needed.
+        self.tie_weights()
+
+        return new_embeddings
+
+    def _update_init_config(self, init_config: dict, key: str, value: Any):
+        """update init_config by <key, value> pair
+
+        Args:
+            init_config (dict): the init_config instance
+            key (str): the key field
+            value (Any): the new value of instance
+        """
+        if key in init_config:
+            init_config[key] = value
+            return
+
+        for arg in init_config.get("init_args", []):
+            if not isinstance(arg, PretrainedModel):
+                continue
+            self._update_init_config(arg.init_config, key, value)
+
+    def _get_resized_embeddings(
+        self, old_embeddings: nn.Embedding, new_num_tokens: Optional[int] = None
+    ) -> nn.Embedding:
+        """
+        Build a resized Embedding Module from a provided token Embedding Module. Increasing the size will add newly
+        initialized vectors at the end. Reducing the size will remove vectors from the end
+
+        Args:
+            old_embeddings (nn.Embedding):
+                Old embeddings to be resized.
+            new_num_tokens (Optional[int]):
+                New number of tokens in the embedding matrix.
+                Increasing the size will add newly initialized vectors at the end. Reducing the size will remove
+                vectors from the end.
+
+        Returns:
+            paddle.nn.Embedding: The resized Embedding Module or the old Embedding Module if new_num_tokens is None.
+        """
+        if new_num_tokens is None:
+            return old_embeddings
+
+        old_num_tokens, old_embedding_dim = old_embeddings.weight.shape
+        if old_num_tokens == new_num_tokens:
+            return old_embeddings
+
+        if not isinstance(old_embeddings, nn.Embedding):
+            raise TypeError(
+                f"Old embeddings are of type {type(old_embeddings)}, which is not an instance of {nn.Embedding}. You"
+                " should either use a different resize function or make sure that old_embeddings are an instance of"
+                f" {nn.Embedding}."
+            )
+
+        # Build new embeddings
+        new_embeddings = nn.Embedding(
+            new_num_tokens,
+            old_embedding_dim,
+            padding_idx=old_embeddings._padding_idx,
+            sparse=old_embeddings._sparse,
+        )
+
+        # make sure that new_embeddings's dtype is same as the old embeddings' dtype
+        if new_embeddings.weight.dtype != old_embeddings.weight.dtype:
+            new_embeddings.to(dtype=old_embeddings.weight.dtype)
+
+        # numbers of tokens to copy
+        n = min(old_num_tokens, new_num_tokens)
+        with paddle.no_grad():
+            new_embeddings.weight[:n, :] = old_embeddings.weight[:n, :]
+
+        return new_embeddings
+
+    def __setattr__(self, name, value):
+        value = adapt_stale_fwd_patch(self, name, value)
+        return super(PretrainedModel, self).__setattr__(name, value)
+
+    @classmethod
+    def _resolve_model_file_path(
+        cls: Type[PretrainedModel],
+        pretrained_model_name_or_path: str,
+        from_hf_hub: bool = False,
+        cache_dir: str | None = None,
+        subfolder: str = "",
+        config: PretrainedConfig = None,
+        convert_from_torch: bool = False,
+        use_safetensors: bool | None = None,
+        variant=None,
+    ) -> str:
+
+        """resolve model target file path from `` and `cache_dir`
+
+        1. when it is file path:
+            return the weight file
+
+        2. when it is model-name:
+            2.1 check default `MODEL_HOME` + `model-mame` + model_state.pdparams
+            2.2 get the url from `pretrained_resource_files_map`, and set it to `pretrained_model_name_or_path`
+
+        3. when it is local dir:
+            check whether the file<local_dir + weight_file> exist
+
+        Args:
+            cls (Type[PretrainedModel]): the inherited PretrainedModel class
+            pretrained_model_name_or_path (str): the model-name/url/local_dir/local_dir
+            cache_dir (Optional[str], optional): cache_dir is used when name_or_path is model-name/url. Defaults to None.
+            convert_from_torch (bool, optional): whether support convert pytorch model to paddle model
+
+        Returns:
+            str: the model weight file path
+        """
+        is_sharded = False
+        sharded_metadata = None
+
+        # -1. when it's from HF
+        if from_hf_hub or convert_from_torch:
+            resolved_archive_file = resolve_weight_file_from_hf_hub(
+                pretrained_model_name_or_path,
+                cache_dir=cache_dir,
+                support_conversion=convert_from_torch,
+                subfolder=subfolder,
+            )
+            return resolved_archive_file, sharded_metadata, is_sharded
+
+        if pretrained_model_name_or_path is not None:
+            pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+            is_local = os.path.isdir(pretrained_model_name_or_path)
+
+            def get_file_path(pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_NAME, variant):
+                return os.path.join(pretrained_model_name_or_path, subfolder, _add_variant(SAFE_WEIGHTS_NAME, variant))
+
+            # pretrained_model_name_or_path is dir
+            if is_local:
+                if use_safetensors is not False and os.path.isfile(
+                    get_file_path(pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_NAME, variant)
+                ):
+                    # Load from a safetensors checkpoint
+                    archive_file = get_file_path(pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_NAME, variant)
+                elif use_safetensors is not False and os.path.isfile(
+                    get_file_path(pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_NAME, weight_name_suffix())
+                ):
+                    # Load from a safetensors checkpoint
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_NAME, weight_name_suffix()
+                    )
+                elif use_safetensors is not False and os.path.isfile(
+                    get_file_path(pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_INDEX_NAME, variant)
+                ):
+                    # Load from a sharded safetensors checkpoint
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_INDEX_NAME, variant
+                    )
+                    is_sharded = True
+                elif use_safetensors is not False and os.path.isfile(
+                    get_file_path(
+                        pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_INDEX_NAME, weight_name_suffix()
+                    )
+                ):
+                    # Load from a sharded safetensors checkpoint
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, SAFE_WEIGHTS_INDEX_NAME, weight_name_suffix()
+                    )
+                    is_sharded = True
+                elif os.path.isfile(
+                    get_file_path(pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_NAME, variant)
+                ):
+                    # Load from a PaddlePaddle checkpoint
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_NAME, variant
+                    )
+                elif os.path.isfile(
+                    get_file_path(pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_INDEX_NAME, variant)
+                ):
+                    # Load from a sharded PaddlePaddle checkpoint
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_INDEX_NAME, variant
+                    )
+                    is_sharded = True
+                elif os.path.isfile(
+                    get_file_path(
+                        pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_INDEX_NAME, weight_name_suffix()
+                    )
+                ):
+                    # Load from a sharded PaddlePaddle checkpoint for hybrid parallel model
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path, subfolder, PADDLE_WEIGHTS_INDEX_NAME, weight_name_suffix()
+                    )
+                    is_sharded = True
+                elif os.path.isfile(
+                    get_file_path(
+                        pretrained_model_name_or_path,
+                        subfolder,
+                        PADDLE_WEIGHTS_NAME,
+                        weight_name_suffix(),
+                    )
+                ):
+                    # Load from a PaddlePaddle checkpoint for hybrid parallel model
+                    archive_file = get_file_path(
+                        pretrained_model_name_or_path,
+                        subfolder,
+                        PADDLE_WEIGHTS_NAME,
+                        weight_name_suffix(),
+                    )
+                # At this stage we don't have a weight file so we will raise an error.
+                elif os.path.isfile(
+                    os.path.join(pretrained_model_name_or_path, subfolder, _add_variant(PYTORCH_WEIGHTS_NAME, variant))
+                ):
+                    raise ValueError(
+                        f"Found {_add_variant(PYTORCH_WEIGHTS_NAME, variant)} in directory"
+                        f" {pretrained_model_name_or_path}. Please set convert_from_torch=True in from_pretrained. eg, Model.from_pretrained(model_name, convert_from_torch=True) "
+                    )
+                else:
+                    raise EnvironmentError(
+                        f"Error no file named {_add_variant(PADDLE_WEIGHTS_NAME, variant)}, found in directory"
+                        f" {pretrained_model_name_or_path}."
+                    )
+            # pretrained_model_name_or_path is file
+            elif os.path.isfile(os.path.join(subfolder, pretrained_model_name_or_path)):
+                archive_file = pretrained_model_name_or_path
+                is_local = True
+            elif is_remote_url(pretrained_model_name_or_path):
+                filename = pretrained_model_name_or_path
+                resolved_archive_file = get_path_from_url_with_filelock(pretrained_model_name_or_path)
+            else:
+                # set correct filename
+                if use_safetensors is not False:
+                    filename = _add_variant(SAFE_WEIGHTS_NAME, variant)
+                else:
+                    filename = _add_variant(PADDLE_WEIGHTS_NAME, variant)
+
+                try:
+                    # Load from URL or cache if already cached
+                    cached_file_kwargs = dict(
+                        cache_dir=cache_dir,
+                        subfolder=subfolder,
+                        _raise_exceptions_for_missing_entries=False,
+                    )
+                    resolved_archive_file = None
+                    if pretrained_model_name_or_path in cls.pretrained_init_configuration:
+                        # fetch the weight url from the `pretrained_resource_files_map`
+                        resource_file_url = cls.pretrained_resource_files_map["model_state"][
+                            pretrained_model_name_or_path
+                        ]
+                        resolved_archive_file = cached_file(
+                            resource_file_url, _add_variant(PADDLE_WEIGHTS_NAME, variant), **cached_file_kwargs
+                        )
+
+                    if resolved_archive_file is None:
+                        resolved_archive_file = cached_file(
+                            pretrained_model_name_or_path, filename, **cached_file_kwargs
+                        )
+                    else:
+                        # xxx.pdparams in pretrained_resource_files_map renamed model_state.pdparams
+                        filename = _add_variant(PADDLE_WEIGHTS_NAME, variant)
+
+                    # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None
+                    # result when internet is up, the repo and revision exist, but the file does not.
+                    if resolved_archive_file is None and filename == _add_variant(SAFE_WEIGHTS_NAME, variant):
+                        # Maybe the checkpoint is sharded, we try to grab the index name in this case.
+                        resolved_archive_file = cached_file(
+                            pretrained_model_name_or_path,
+                            _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant),
+                            **cached_file_kwargs,
+                        )
+                        if resolved_archive_file is not None:
+                            is_sharded = True
+                        elif use_safetensors:
+                            raise EnvironmentError(
+                                f" {_add_variant(SAFE_WEIGHTS_NAME, variant)} or {_add_variant(SAFE_WEIGHTS_INDEX_NAME, variant)} and thus cannot be loaded with `safetensors`. Please make sure that the model has been saved with `safe_serialization=True` or do not set `use_safetensors=True`."
+                            )
+                        else:
+                            # This repo has no safetensors file of any kind, we switch to PyTorch.
+                            filename = _add_variant(PADDLE_WEIGHTS_NAME, variant)
+                            resolved_archive_file = cached_file(
+                                pretrained_model_name_or_path, filename, **cached_file_kwargs
+                            )
+                    if resolved_archive_file is None and filename == _add_variant(PADDLE_WEIGHTS_NAME, variant):
+                        # Maybe the checkpoint is sharded, we try to grab the index name in this case.
+                        resolved_archive_file = cached_file(
+                            pretrained_model_name_or_path,
+                            _add_variant(PADDLE_WEIGHTS_INDEX_NAME, variant),
+                            **cached_file_kwargs,
+                        )
+                        # raise ValueError(resolved_archive_file)
+                        if resolved_archive_file is not None:
+                            is_sharded = True
+                    if resolved_archive_file is None:
+                        # Otherwise, maybe there is a TF or Flax model file.  We try those to give a helpful error
+                        # message.
+                        raise EnvironmentError(
+                            f"{pretrained_model_name_or_path} does not appear to have a file named"
+                            f" {_add_variant(PADDLE_WEIGHTS_NAME, variant)}."
+                        )
+                except Exception:
+                    # For any other exception, we throw a generic error.
+                    raise EnvironmentError(
+                        f"Can't load the model for '{pretrained_model_name_or_path}'. If you were trying to load it"
+                        " from 'https://paddlenlp.bj.bcebos.com'"
+                    )
+
+            if is_local:
+                logger.info(f"Loading weights file {archive_file}")
+                resolved_archive_file = archive_file
+            else:
+                logger.info(f"Loading weights file {filename} from cache at {resolved_archive_file}")
+        else:
+            resolved_archive_file = None
+
+        # We'll need to download and cache each checkpoint shard if the checkpoint is sharded.
+        if is_sharded:
+            # rsolved_archive_file becomes a list of files that point to the different checkpoint shards in this case.
+            resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
+                pretrained_model_name_or_path,
+                resolved_archive_file,
+                cache_dir=cache_dir,
+                subfolder=subfolder,
+            )
+
+        return resolved_archive_file, sharded_metadata, is_sharded
+
+    @classmethod
+    def _load_pretrained_model(
+        cls,
+        model: PretrainedModel,
+        state_dict: Dict[str, Tensor],
+        loaded_keys: List[str],
+        resolved_archive_file,
+        pretrained_model_name_or_path,
+        config=None,
+        ignore_mismatched_sizes=False,
+        low_cpu_mem_usage=False,
+        dtype=None,
+        keep_in_fp32_modules=None,
+    ) -> Tuple[List[str]]:
+        """load the state_dict into model, and do the following things:
+
+            * check the
+
+        Args:
+            model (PretrainedModel): the pretrained model instance
+            state_dict (Dict[str, Tensor]): the model state dict data
+            loaded_keys (List[str]):
+            ignore_mismatched_sizes (bool, optional): whether ignore error when tensor size mismatched. Defaults to False.
+            dtype (_type_, optional): the dtype of model state dict. Defaults to None.
+
+        Returns:
+            Tuple[List[str]]: _description_
+        """
+        is_safetensors = False
+
+        model_state_dict = model.state_dict()
+
+        expected_keys = list(model_state_dict.keys())
+        prefix = model.base_model_prefix
+
+        if len(prefix) > 0:
+            has_prefix_module = any(s.startswith(prefix) for s in loaded_keys)
+            expects_prefix_module = any(s.startswith(prefix) for s in expected_keys)
+        else:
+            has_prefix_module = False
+            expects_prefix_module = False
+
+        # key re-naming operations are never done on the keys
+        # that are loaded, but always on the keys of the newly initialized model
+        remove_prefix_from_model = not has_prefix_module and expects_prefix_module
+        add_prefix_to_model = has_prefix_module and not expects_prefix_module
+
+        if remove_prefix_from_model:
+            _prefix = f"{prefix}."
+            expected_keys_not_prefixed = [s for s in expected_keys if not s.startswith(_prefix)]
+            expected_keys = [s[len(_prefix) :] if s.startswith(_prefix) else s for s in expected_keys]
+        elif add_prefix_to_model:
+            expected_keys = [".".join([prefix, s]) for s in expected_keys]
+
+        missing_keys = list(set(expected_keys) - set(loaded_keys))
+        unexpected_keys = list(set(loaded_keys) - set(expected_keys))
+
+        # Some models may have keys that are not in the state by design, removing them before needlessly warning
+        # the user.
+        if cls._keys_to_ignore_on_load_missing is not None:
+            for pat in cls._keys_to_ignore_on_load_missing:
+                missing_keys = [k for k in missing_keys if re.search(pat, k) is None]
+
+        if cls._keys_to_ignore_on_load_unexpected is not None:
+            for pat in cls._keys_to_ignore_on_load_unexpected:
+                unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None]
+
+        # Set some modules to fp32 if any
+        if keep_in_fp32_modules is not None:
+            for name, param in model.named_parameters():
+                if any(module_to_keep_in_fp32 in name for module_to_keep_in_fp32 in keep_in_fp32_modules):
+                    param = param.to(dtype=paddle.float32)
+
+        # Make sure we are able to load base models as well as derived models (with heads)
+        start_prefix = ""
+        model_to_load = model
+        if len(cls.base_model_prefix) > 0 and not hasattr(model, cls.base_model_prefix) and has_prefix_module:
+            start_prefix = cls.base_model_prefix + "."
+        if len(cls.base_model_prefix) > 0 and hasattr(model, cls.base_model_prefix) and not has_prefix_module:
+            model_to_load = getattr(model, cls.base_model_prefix)
+            base_model_expected_keys = list(model_to_load.state_dict().keys())
+            if any(key in expected_keys_not_prefixed and key not in base_model_expected_keys for key in loaded_keys):
+                raise ValueError(
+                    "The state dictionary of the model you are trying to load is corrupted. Are you sure it was "
+                    "properly saved?"
+                )
+
+        def _find_mismatched_keys(
+            state_dict,
+            model_state_dict,
+            loaded_keys,
+            add_prefix_to_model,
+            remove_prefix_from_model,
+            ignore_mismatched_sizes,
+        ):
+            mismatched_keys = []
+            if ignore_mismatched_sizes:
+                for checkpoint_key in loaded_keys:
+                    # If the checkpoint is sharded, we may not have the key here.
+                    if checkpoint_key not in state_dict:
+                        continue
+                    model_key = checkpoint_key
+                    if remove_prefix_from_model:
+                        # The model key starts with `prefix` but `checkpoint_key` doesn't so we add it.
+                        model_key = f"{prefix}.{checkpoint_key}"
+                    elif add_prefix_to_model:
+                        # The model key doesn't start with `prefix` but `checkpoint_key` does so we remove it.
+                        model_key = ".".join(checkpoint_key.split(".")[1:])
+
+                    if (
+                        model_key in model_state_dict
+                        and state_dict[checkpoint_key].shape != model_state_dict[model_key].shape
+                    ):
+                        mismatched_keys.append(
+                            (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape)
+                        )
+                        del state_dict[checkpoint_key]
+            return mismatched_keys
+
+        if state_dict is not None:
+            # DONT Hold tensor parallel here, only hold afer load state dict.
+            # Whole checkpoint
+            # For model parallel if FastGeneration
+            # To avoid recursive import temporarily.
+            import paddlenlp.ops.fast_transformer.transformer.decoding as ft_decoding
+
+            state_dict = ft_decoding.get_ft_para_conf().fit_partial_model(model_to_load, state_dict)
+
+            mismatched_keys = _find_mismatched_keys(
+                state_dict,
+                model_state_dict,
+                loaded_keys,
+                add_prefix_to_model,
+                remove_prefix_from_model,
+                ignore_mismatched_sizes,
+            )
+            error_msgs = _load_state_dict_into_model(model_to_load, state_dict, start_prefix)
+        else:
+            # Sharded checkpoint or whole but low_cpu_mem_usage==True
+
+            # This should always be a list but, just to be sure.
+            if not isinstance(resolved_archive_file, list):
+                resolved_archive_file = [resolved_archive_file]
+
+            error_msgs = []
+            mismatched_keys = []
+
+            if len(resolved_archive_file) > 1:
+                resolved_archive_file = tqdm(resolved_archive_file, desc="Loading checkpoint shards")
+
+            for shard_file in resolved_archive_file:
+                pre_tensor_parallel_split = False
+                if (
+                    shard_file.endswith(".safetensors")
+                    and config.tensor_parallel_degree > 1
+                    and "tp" not in shard_file
+                ):
+                    pre_tensor_parallel_split = True
+                    assert loaded_keys is not None, "loaded_keys is not None."
+                    tp_actions = cls.get_tensor_parallel_convert_actions(config, loaded_keys)
+
+                state_dict = load_state_dict(shard_file, tp_actions if pre_tensor_parallel_split else None)
+
+                # Mistmatched keys contains tuples key/shape1/shape2 of weights in the checkpoint that have a shape not
+                # matching the weights in the model.
+                mismatched_keys += _find_mismatched_keys(
+                    state_dict,
+                    model_state_dict,
+                    loaded_keys,
+                    add_prefix_to_model,
+                    remove_prefix_from_model,
+                    ignore_mismatched_sizes,
+                )
+
+                if config.tensor_parallel_degree > 1 and ".tp" not in shard_file and not pre_tensor_parallel_split:
+                    logger.info("Converting state_dict to Tensor Parallel Format")
+                    # ignore error for multi shard, since only parts of data
+                    state_dict = cls.convert_tensor_parallel(
+                        None, config, state_dict=state_dict, ignore_error=len(resolved_archive_file) > 1
+                    )
+                    logger.info("Converted state_dict to Tensor Parallel Format")
+
+                if low_cpu_mem_usage:
+                    new_error_msgs = _load_state_dict_into_meta_model(
+                        model_to_load,
+                        state_dict,
+                        loaded_keys,
+                        start_prefix,
+                        expected_keys,
+                        dtype=dtype,
+                        is_safetensors=is_safetensors,
+                        keep_in_fp32_modules=keep_in_fp32_modules,
+                    )
+                    error_msgs += new_error_msgs
+                else:
+                    error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix)
+
+                # force memory release
+                del state_dict
+                gc.collect()
+
+        if len(error_msgs) > 0:
+            error_msg = "\n\t".join(error_msgs)
+            if " but the expected shape is" in error_msg:
+                error_msg += (
+                    "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
+                )
+            raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
+
+        if len(unexpected_keys) > 0:
+            logger.warning(
+                f"Some weights of the model checkpoint at {pretrained_model_name_or_path} were not used when"
+                f" initializing {model.__class__.__name__}: {unexpected_keys}\n- This IS expected if you are"
+                f" initializing {model.__class__.__name__} from the checkpoint of a model trained on another task or"
+                " with another architecture (e.g. initializing a BertForSequenceClassification model from a"
+                " BertForPreTraining model).\n- This IS NOT expected if you are initializing"
+                f" {model.__class__.__name__} from the checkpoint of a model that you expect to be exactly identical"
+                " (initializing a BertForSequenceClassification model from a BertForSequenceClassification model)."
+            )
+        else:
+            logger.info(f"All model checkpoint weights were used when initializing {model.__class__.__name__}.\n")
+
+        if len(missing_keys) > 0:
+            logger.warning(
+                f"Some weights of {model.__class__.__name__} were not initialized from the model checkpoint at"
+                f" {pretrained_model_name_or_path} and are newly initialized: {missing_keys}\nYou should probably"
+                " TRAIN this model on a down-stream task to be able to use it for predictions and inference."
+            )
+        elif len(mismatched_keys) == 0:
+            logger.info(
+                f"All the weights of {model.__class__.__name__} were initialized from the model checkpoint at"
+                f" {pretrained_model_name_or_path}.\nIf your task is similar to the task the model of the checkpoint"
+                f" was trained on, you can already use {model.__class__.__name__} for predictions without further"
+                " training."
+            )
+        if len(mismatched_keys) > 0:
+            mismatched_warning = "\n".join(
+                [
+                    f"- {key}: found shape {shape1} in the checkpoint and {shape2} in the model instantiated"
+                    for key, shape1, shape2 in mismatched_keys
+                ]
+            )
+            logger.warning(
+                f"Some weights of {model.__class__.__name__} were not initialized from the model checkpoint at"
+                f" {pretrained_model_name_or_path} and are newly initialized because the shapes did not"
+                f" match:\n{mismatched_warning}\nYou should probably TRAIN this model on a down-stream task to be able"
+                " to use it for predictions and inference."
+            )
+
+        return model, missing_keys, unexpected_keys, mismatched_keys
+
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, from_hf_hub: bool = False, subfolder: str | None = None, *args, **kwargs
+    ):
+        """
+        Creates an instance of `PretrainedModel`. Model weights are loaded
+        by specifying name of a built-in pretrained model, a pretrained model from HF Hub, a community contributed model,
+        or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of a built-in pretrained model
+                - Name of a pretrained model from HF Hub
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains model weights file("model_state.pdparams")
+                  and model config file ("model_config.json").
+            from_hf_hub (bool): load model from huggingface hub. Default to `False`.
+            subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+                Only works when loading from Huggingface Hub.
+            *args (tuple): Position arguments for model `__init__`. If provided,
+                use these as position argument values for model initialization.
+            **kwargs (dict): Keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for model
+                initialization. If the keyword is in `__init__` argument names of
+                base model, update argument values of the base model; else update
+                argument values of derived model.
+            load_state_as_np (bool, optional): The weights read in can be choosed
+                to place on CPU or GPU though the model is on the default device.
+                If `True`, load the model weights as `numpy.ndarray` on CPU.
+                Otherwise, weights would be loaded as tensors on the default
+                device. Note that if on GPU, the latter would creates extra
+                temporary tensors in addition to the model weights, which
+                doubles the memory usage . Thus it is suggested to use `True`
+                for big models on GPU. Default to `False`.
+
+        Returns:
+            PretrainedModel: An instance of `PretrainedModel`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertForSequenceClassification
+
+                # Name of built-in pretrained model
+                model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+                # Name of pretrained model from PaddleHub
+                model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+                # Name of community-contributed pretrained model
+                model = BertForSequenceClassification.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned', num_labels=3)
+
+                # Load from local directory path
+                model = BertForSequenceClassification.from_pretrained('./my_bert/')
+        """
+        config = kwargs.pop("config", None)
+        state_dict = kwargs.pop("state_dict", None)
+        cache_dir = kwargs.pop("cache_dir", None)
+        force_download = kwargs.pop("force_download", False)
+        ignore_mismatched_sizes = kwargs.pop("ignore_mismatched_sizes", False)
+        dtype = kwargs.pop("dtype", None)
+        subfolder = kwargs.pop("subfolder", "")
+        variant = kwargs.pop("variant", None)
+        use_safetensors = kwargs.pop("use_safetensors", None if is_safetensors_available() else False)
+
+        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", False)
+        convert_from_torch = kwargs.pop("convert_from_torch", None)
+        load_state_as_np = kwargs.pop("load_state_as_np", None)
+        if load_state_as_np is not None:
+            logger.warning("`load_state_as_np` is deprecated,  please delete it!")
+
+        model_kwargs = kwargs
+
+        # from_hf_hub defalut enable convert_from_torch
+        if from_hf_hub and convert_from_torch is None:
+            logger.warning(
+                "If you are attempting to load weights from Hugging Face Hub and want to disable the default behavior of considering torch weights,"
+                " you can set ·convert_from_torch=False·. By default, `convert_from_torch` is set to `True`. "
+            )
+            convert_from_torch = True
+        # convert_from_torch defalut is False
+        if convert_from_torch is None:
+            convert_from_torch = False
+
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+        # 1. get the PretrainedConfig to init model
+        if not isinstance(config, PretrainedConfig):
+            config_path = config if config is not None else pretrained_model_name_or_path
+            config, model_kwargs = cls.config_class.from_pretrained(
+                config_path,
+                cache_dir=cache_dir,
+                return_unused_kwargs=True,
+                force_download=force_download,
+                from_hf_hub=from_hf_hub,
+                subfolder=subfolder,
+                **kwargs,
+            )
+        if not os.path.exists(os.path.join(cache_dir, CONFIG_NAME)):
+            config.save_pretrained(cache_dir)
+
+        # refine options for config
+        convert_from_torch = cls.support_conversion(config) and convert_from_torch
+
+        if dtype is None:
+            dtype = config.dtype
+        else:
+            config.dtype = dtype
+
+        init_contexts = []
+        if low_cpu_mem_usage:
+            # Instantiate model.
+            init_contexts.append(no_init_weights(_enable=True))
+            if is_paddle_support_lazy_init():
+                init_contexts.append(paddle.LazyGuard())
+
+        if dtype:
+            init_contexts.append(dtype_guard(dtype))
+
+        # Keep in fp32 modules
+        keep_in_fp32_modules = None
+        use_keep_in_fp32_modules = False
+
+        # resolve model_weight file
+        resolved_archive_file, sharded_metadata, is_sharded = cls._resolve_model_file_path(
+            pretrained_model_name_or_path,
+            cache_dir=cache_dir,
+            subfolder=subfolder,
+            from_hf_hub=from_hf_hub,
+            config=config,
+            convert_from_torch=convert_from_torch,
+            use_safetensors=use_safetensors,
+            variant=variant,
+        )
+
+        # load pt weights early so that we know which dtype to init the model under
+        if not is_sharded and state_dict is None:
+            # Time to load the checkpoint
+            if resolved_archive_file.endswith(PYTORCH_WEIGHTS_NAME):
+                if convert_from_torch:
+                    # try to get the name-mapping info
+                    logger.info(
+                        f"Starting to convert pytorch weight file<{resolved_archive_file}> to "
+                        f"paddle weight file<{os.path.join(cache_dir, PADDLE_WEIGHTS_NAME)}> ..."
+                    )
+                    state_dict = cls.convert(resolved_archive_file, config, cache_dir)
+                else:
+                    raise ValueError(
+                        f"download the {PYTORCH_WEIGHTS_NAME} weight file, but model<{cls}> "
+                        "don't support conversion from pytorch weight file to paddle weight file "
+                    )
+            else:
+                # 4. loading non-sharded ckpt from the state dict
+                if config.tensor_parallel_degree > 1:
+                    if resolved_archive_file.endswith("model_state.pdparams"):
+                        state_dict = cls.convert_tensor_parallel(resolved_archive_file, config)
+                    elif resolved_archive_file.endswith("model.safetensors"):
+                        with safe_open(resolved_archive_file, framework="np", device="cpu") as f:
+                            loaded_keys = f.keys()
+                        tp_actions = cls.get_tensor_parallel_convert_actions(config, loaded_keys)
+                        state_dict = load_state_dict(resolved_archive_file, tp_actions)
+                else:
+                    state_dict = load_state_dict(resolved_archive_file)
+
+                logger.info("Loaded weights file from disk, setting weights to model.")
+
+        # Check if `_keep_in_fp32_modules` is not None
+        use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and dtype == "float16"
+
+        if is_sharded:
+            loaded_state_dict_keys = sharded_metadata["all_checkpoint_keys"]
+        else:
+            loaded_state_dict_keys = [k for k in state_dict.keys()]
+
+        if low_cpu_mem_usage:  # or use_keep_in_fp32_modules:
+            state_dict = None
+
+        # will only support load paddle.Tensor to model.
+        if state_dict is not None:
+            for k in list(state_dict.keys()):
+                if not isinstance(state_dict[k], paddle.Tensor):
+                    with device_guard():
+                        state_dict[k] = paddle.Tensor(state_dict.pop(k), zero_copy=True)
+
+        # 3. init the model
+        init_args = config["init_args"] or ()
+        with ContextManagers(init_contexts):
+            model = cls(config, *init_args, **model_kwargs)
+
+        if use_keep_in_fp32_modules:
+            # low_cpu_mem_usage = True
+            keep_in_fp32_modules = model._keep_in_fp32_modules
+        else:
+            keep_in_fp32_modules = []
+
+        model, missing_keys, unexpected_keys, mismatched_keys = cls._load_pretrained_model(
+            model=model,
+            state_dict=state_dict,
+            loaded_keys=loaded_state_dict_keys,
+            resolved_archive_file=resolved_archive_file,
+            pretrained_model_name_or_path=pretrained_model_name_or_path,
+            config=config,
+            ignore_mismatched_sizes=ignore_mismatched_sizes,
+            low_cpu_mem_usage=low_cpu_mem_usage,
+            dtype=dtype,
+            keep_in_fp32_modules=keep_in_fp32_modules,
+        )
+
+        # load generation_config.json
+        if model.can_generate() and pretrained_model_name_or_path is not None:
+            try:
+                model.generation_config = GenerationConfig.from_pretrained(
+                    pretrained_model_name_or_path,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    subfolder=subfolder,
+                    **kwargs,
+                )
+            except OSError:
+                logger.info(
+                    "Generation config file not found, using a generation config created from the model config."
+                )
+                pass
+
+        if paddle.in_dynamic_mode():
+            return model
+
+        return model, state_dict
+
+    def save_pretrained(
+        self,
+        save_dir: Union[str, os.PathLike],
+        is_main_process: bool = True,
+        state_dict: Optional[dict] = None,
+        save_function: Callable = paddle.save,
+        max_shard_size: Union[int, str] = "10GB",
+        safe_serialization: bool = False,
+        variant: Optional[str] = None,
+        *args,
+        **kwargs,
+    ):
+        """
+        Saves model configuration and related resources (model state) as files
+        under `save_dir`. The model configuration would be saved into a file named
+        "model_config.json", and model state would be saved into a file
+        named "model_state.pdparams".
+
+        The `save_dir` can be used in `from_pretrained` as argument value
+        of `pretrained_model_name_or_path` to re-load the trained model.
+
+        Args:
+            save_dir (str): Directory to save files into.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertForSequenceClassification
+
+                model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+                model.save_pretrained('./trained_model/')
+                # reload from save_directory
+                model = BertForSequenceClassification.from_pretrained('./trained_model/')
+        """
+
+        assert not os.path.isfile(save_dir), "Saving directory ({}) should be a directory, not a file".format(save_dir)
+        os.makedirs(save_dir, exist_ok=True)
+
+        merge_tensor_parallel = kwargs.get("merge_tensor_parallel", False)
+        config_to_save = kwargs.get("config_to_save", None)
+        shard_format = kwargs.get("shard_format", "naive")  # support naive pipeline
+        # variant = kwargs.get("variant", None)
+        # is_main_process = kwargs.get("is_main_process", True)
+
+        save_directory = save_dir
+
+        if safe_serialization and not is_safetensors_available():
+            raise ImportError("`safe_serialization` requires the `safetensors library: `pip install safetensors`.")
+
+        if os.path.isfile(save_directory):
+            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
+            return
+
+        os.makedirs(save_directory, exist_ok=True)
+        # Save model config
+
+        # Only save the model itself if we are using distributed training
+        model_to_save = unwrap_model(self)
+
+        # save the string version of dtype to the config, e.g. convert paddle.float32 => "float32"
+        # we currently don't use this setting automatically, but may start to use with v5
+
+        dtype = get_parameter_dtype(model_to_save)
+        model_to_save.config.dtype = str(dtype).split(".")[1]
+        if config_to_save is None:
+            config_to_save = copy.deepcopy(model_to_save.config)
+
+        # Save the model
+        if state_dict is None:
+            state_dict = model_to_save.state_dict()
+            if config_to_save.tensor_parallel_degree > 1:
+                if merge_tensor_parallel:
+                    state_dict = model_to_save.merge_tensor_parallel(state_dict, config_to_save)
+                    config_to_save.tensor_parallel_degree = 1
+                    if config_to_save.tensor_parallel_rank != 0:
+                        logger.info("Saving with merge_tensor_parallel, tensor_parallel_rank > 0 don't need save")
+                        return
+                    if variant is not None and "tp" in variant:
+                        variant = "_".join([x for x in variant.split("_") if "tp" not in x])
+                else:
+                    variant = weight_name_suffix() if variant is None else variant
+
+        # Attach architecture to the config
+        config_to_save.architectures = [model_to_save.__class__.__name__]
+        # Save the config
+        if is_main_process:
+            config_to_save.save_pretrained(save_directory)
+            if self.can_generate():
+                model_to_save.generation_config.save_pretrained(save_directory)
+
+        # Handle the case where some state_dict keys shouldn't be saved
+        if self._keys_to_ignore_on_save is not None:
+            for ignore_key in self._keys_to_ignore_on_save:
+                if ignore_key in state_dict.keys():
+                    del state_dict[ignore_key]
+
+        # Shard the model if it is too big.
+        weights_name = SAFE_WEIGHTS_NAME if safe_serialization else PADDLE_WEIGHTS_NAME
+        weights_name = _add_variant(weights_name, variant)
+
+        # Save model
+        shards, index = shard_checkpoint(
+            state_dict, max_shard_size=max_shard_size, weights_name=weights_name, shard_format=shard_format
+        )
+
+        # Clean the folder from a previous save
+        for filename in os.listdir(save_directory):
+            full_filename = os.path.join(save_directory, filename)
+            # If we have a shard file that is not going to be replaced, we delete it, but only from the main process
+            # in distributed settings to avoid race conditions.
+            weights_no_suffix = weights_name.replace(".pdparams", "").replace(".safetensors", "")
+
+            # make sure that file to be deleted matches format of sharded file, e.g. paddle_model-00001-of-00005
+            filename_no_suffix = filename.replace(".pdparams", "").replace(".safetensors", "")
+            reg = re.compile("(.*?)-\d{5}-of-\d{5}")
+
+            if (
+                filename.startswith(weights_no_suffix)
+                and os.path.isfile(full_filename)
+                and filename not in shards.keys()
+                and is_main_process
+                and reg.fullmatch(filename_no_suffix) is not None
+            ):
+                os.remove(full_filename)
+
+        # Save the model
+        for shard_file, shard in shards.items():
+            if safe_serialization:
+                # At some point we will need to deal better with save_function (used for TPU and other distributed
+                # joyfulness), but for now this enough.
+                for k in list(shard.keys()):
+                    if isinstance(shard[k], paddle.Tensor):
+                        shard[k] = shard.pop(k).cpu().numpy()
+                safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "np"})
+            else:
+                save_function(shard, os.path.join(save_directory, shard_file))
+
+        if index is None:
+            path_to_weights = os.path.join(save_directory, _add_variant(PADDLE_WEIGHTS_NAME, variant))
+            logger.info(f"Model weights saved in {path_to_weights}")
+
+        else:
+            save_index_file = SAFE_WEIGHTS_INDEX_NAME if safe_serialization else PADDLE_WEIGHTS_INDEX_NAME
+            save_index_file = os.path.join(save_directory, _add_variant(save_index_file, variant))
+            # Save the index as well
+            with open(save_index_file, "w", encoding="utf-8") as f:
+                content = json.dumps(index, indent=2) + "\n"
+                f.write(content)
+            logger.info(
+                f"The model is bigger than the maximum size per checkpoint ({max_shard_size}) and is going to be "
+                f"split in {len(shards)} checkpoint shards. You can find where each parameters has been saved in the "
+                f"index located at {save_index_file}."
+            )
diff --git a/paddlenlp/transformers/mpnet/__init__.py b/paddlenlp/transformers/mpnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/mpnet/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/mpnet/configuration.py b/paddlenlp/transformers/mpnet/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e443a4568fec64b8e1cd189c9220e8ed8caed77c
--- /dev/null
+++ b/paddlenlp/transformers/mpnet/configuration.py
@@ -0,0 +1,117 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""MPNet model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "MPNET_PRETRAINED_INIT_CONFIGURATION",
+    "MPNetConfig",
+]
+
+MPNET_PRETRAINED_INIT_CONFIGURATION = {}
+
+
+class MPNetConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MPNetModel`]. It is used to
+    instantiate a MPNet model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the MPNet.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30527):
+            Vocabulary size of the MPNet model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`MPNetModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 514):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+            The number of buckets to use for each attention layer.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import MPNetModel, MPNetConfig
+
+    >>> # Initializing a MPNet mpnet-base style configuration
+    >>> configuration = MPNetConfig()
+
+    >>> # Initializing a model from the MPNet mpnet-base style configuration
+    >>> model = MPNetModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "mpnet"
+    attribute_map = {
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size: int = 30527,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 514,
+        initializer_range: float = 0.02,
+        layer_norm_eps: float = 1e-5,
+        relative_attention_num_buckets: int = 32,
+        pad_token_id: int = 1,
+        bos_token_id: int = 0,
+        eos_token_id: int = 2,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.relative_attention_num_buckets = relative_attention_num_buckets
diff --git a/paddlenlp/transformers/mpnet/modeling.py b/paddlenlp/transformers/mpnet/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b47506fd4c4e3cf01af30229714daf6e8b3a2887
--- /dev/null
+++ b/paddlenlp/transformers/mpnet/modeling.py
@@ -0,0 +1,731 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team, Microsoft Corporation.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from .configuration import MPNET_PRETRAINED_INIT_CONFIGURATION, MPNetConfig
+
+__all__ = [
+    "MPNetModel",
+    "MPNetPretrainedModel",
+    "MPNetForMaskedLM",
+    "MPNetForSequenceClassification",
+    "MPNetForMultipleChoice",
+    "MPNetForTokenClassification",
+    "MPNetForQuestionAnswering",
+]
+
+
+def create_position_ids_from_input_ids(input_ids, padding_idx=1):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`. :param paddle.Tensor x: :return paddle.Tensor:
+    """
+    mask = (input_ids != padding_idx).astype(paddle.int64)
+    incremental_indices = paddle.cumsum(mask, axis=1).astype(mask.dtype) * mask
+    return incremental_indices.astype(paddle.int64) + padding_idx
+
+
+class MPNetEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetEmbeddings, self).__init__()
+        self.padding_idx = config.pad_token_id
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+        )
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, position_ids=None):
+
+        if position_ids is None:
+            position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+
+        embeddings = words_embeddings + position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class MPNetAttention(nn.Layer):
+    def __init__(self, config: MPNetConfig):
+        super(MPNetAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.scale = self.attention_head_size**-0.5
+        self.q = nn.Linear(config.hidden_size, self.all_head_size)
+        self.k = nn.Linear(config.hidden_size, self.all_head_size)
+        self.v = nn.Linear(config.hidden_size, self.all_head_size)
+        self.o = nn.Linear(config.hidden_size, config.hidden_size)
+
+        self.attention_dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.output_dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [
+            self.num_attention_heads,
+            self.attention_head_size,
+        ]
+        x = x.reshape(new_x_shape)
+        return x.transpose(perm=(0, 2, 1, 3))
+
+    def forward(self, hidden_states, attention_mask=None, position_bias=None):
+        q = self.q(hidden_states)
+        k = self.k(hidden_states)
+        v = self.v(hidden_states)
+
+        q = self.transpose_for_scores(q)
+        k = self.transpose_for_scores(k)
+        v = self.transpose_for_scores(v)
+
+        attention_scores = paddle.matmul(q, k, transpose_y=True) * self.scale
+
+        if position_bias is not None:
+            attention_scores += position_bias
+
+        if attention_mask is not None:
+            attention_scores = attention_scores + attention_mask
+
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        attention_probs = self.attention_dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, v)
+
+        context_layer = context_layer.transpose(perm=(0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        projected_context_layer = self.o(context_layer)
+        projected_context_layer_dropout = self.output_dropout(projected_context_layer)
+        layer_normed_context_layer = self.layer_norm(hidden_states + projected_context_layer_dropout)
+
+        return layer_normed_context_layer, attention_scores
+
+
+class MPNetLayer(nn.Layer):
+    def __init__(self, config: MPNetConfig):
+        super(MPNetLayer, self).__init__()
+        self.attention = MPNetAttention(config)
+        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, attention_mask=None, position_bias=None):
+        attention_output, layer_att = self.attention(
+            hidden_states, attention_mask=attention_mask, position_bias=position_bias
+        )
+
+        ffn_output = self.ffn(attention_output)
+        ffn_output = self.activation(ffn_output)
+        ffn_output = self.ffn_output(ffn_output)
+
+        ffn_output_dropout = self.dropout(ffn_output)
+        hidden_states = self.layer_norm(ffn_output_dropout + attention_output)
+
+        return hidden_states, layer_att
+
+
+class MPNetEncoder(nn.Layer):
+    def __init__(self, config: MPNetConfig):
+        super(MPNetEncoder, self).__init__()
+        layer = MPNetLayer(config)
+        self.layer = nn.LayerList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
+        self.relative_attention_bias = nn.Embedding(config.relative_attention_num_buckets, config.num_attention_heads)
+
+    def forward(self, hidden_states, attention_mask=None):
+        position_bias = self.compute_position_bias(hidden_states)
+        all_encoder_layers = []
+        all_encoder_att = []
+        for i, layer_module in enumerate(self.layer):
+            all_encoder_layers.append(hidden_states)
+            hidden_states, layer_att = layer_module(all_encoder_layers[i], attention_mask, position_bias)
+            all_encoder_att.append(layer_att)
+        all_encoder_layers.append(hidden_states)
+        return all_encoder_layers, all_encoder_att
+
+    def compute_position_bias(self, x, position_ids=None, num_buckets=32):
+        bsz, qlen, klen = x.shape[0], x.shape[1], x.shape[1]
+        if position_ids is not None:
+            context_position = position_ids.unsqueeze(2)
+            memory_position = position_ids.unsqueeze(1)
+        else:
+            context_position = paddle.arange(qlen).unsqueeze(1)
+            memory_position = paddle.arange(klen).unsqueeze(0)
+
+        relative_position = memory_position - context_position
+
+        rp_bucket = self.relative_position_bucket(relative_position, num_buckets=num_buckets)
+
+        values = self.relative_attention_bias(rp_bucket)
+        values = values.transpose(perm=[2, 0, 1]).unsqueeze(0)
+        values = values.expand(shape=(bsz, values.shape[1], qlen, klen))
+        return values
+
+    @staticmethod
+    def relative_position_bucket(relative_position, num_buckets=32, max_distance=128):
+        ret = 0
+        n = -relative_position
+
+        num_buckets //= 2
+        ret += (n < 0).astype(paddle.int64) * num_buckets
+        n = paddle.abs(n)
+
+        max_exact = num_buckets // 2
+        is_small = n < max_exact
+
+        val_if_large = max_exact + (
+            paddle.log(n.astype(paddle.float32) / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).astype(paddle.int64)
+
+        val_if_large = paddle.minimum(val_if_large, paddle.full_like(val_if_large, num_buckets - 1))
+        ret += paddle.where(is_small, n, val_if_large)
+        return ret
+
+
+class MPNetPooler(nn.Layer):
+    """
+    Pool the result of MPNetEncoder.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class MPNetPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained MPNet models. It provides MPNet related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "mpnet"
+    pretrained_resource_files_map = {
+        "model_state": {
+            "mpnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/mpnet/mpnet-base/model_state.pdparams",
+        }
+    }
+    pretrained_init_configuration = MPNET_PRETRAINED_INIT_CONFIGURATION
+    config_class = MPNetConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+@register_base_model
+class MPNetModel(MPNetPretrainedModel):
+    """
+    The bare MPNet Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+        self.embeddings = MPNetEmbeddings(config)
+        self.encoder = MPNetEncoder(config)
+        self.pooler = MPNetPooler(config)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        The MPNetModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                If its data type is int, the values should be either 0 or 1.
+
+                - **1** for tokens that **not masked**,
+                - **0** for tokens that **masked**.
+
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`<s>`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MPNetModel, MPNetTokenizer
+
+                tokenizer = MPNetTokenizer.from_pretrained('mpnet-base')
+                model = MPNetModel.from_pretrained('mpnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+        """
+
+        if attention_mask is None:
+            attention_mask = (input_ids != self.embeddings.padding_idx).astype(input_ids.dtype)
+
+        if attention_mask.ndim == 2:
+            attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+            attention_mask = (1.0 - attention_mask) * -10000.0
+
+        embedding_output = self.embeddings(input_ids, position_ids)
+
+        encoder_outputs, _ = self.encoder(embedding_output, attention_mask)
+
+        sequence_output = encoder_outputs[-1]
+        pooled_output = self.pooler(sequence_output)
+
+        return sequence_output, pooled_output
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+
+class MPNetLMHead(nn.Layer):
+    """
+    MPNet Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(
+        self,
+        config: MPNetConfig,
+        embedding_weights=None,
+    ):
+        super(MPNetLMHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[config.vocab_size, config.hidden_size], dtype=self.dense.weight.dtype, is_bias=False
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+
+        hidden_states = paddle.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+
+        return hidden_states
+
+
+class MPNetForMaskedLM(MPNetPretrainedModel):
+    """
+    MPNet Model with a `language modeling` head on top.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetForMaskedLM, self).__init__(config)
+        self.mpnet = MPNetModel(config)
+        self.lm_head = MPNetLMHead(config, embedding_weights=self.mpnet.embeddings.word_embeddings.weight)
+
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        attention_mask=None,
+        labels=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MPNetModel`.
+            position_ids (Tensor, optional):
+                See :class:`MPNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MPNetModel`.
+            labels (Tensor, optional):
+                The Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., vocab_size]`` Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., vocab_size]`` Its shape is [batch_size, sequence_length].
+
+        Returns:
+            tuple: Returns tuple (`masked_lm_loss`, `prediction_scores`, ``sequence_output`).
+
+            With the fields:
+
+            - `masked_lm_loss` (Tensor):
+                The masked lm loss. Its data type should be float32 and its shape is [1].
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32. Its shape is [batch_size, sequence_length, vocab_size].
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model. Its data type should be float32. Its shape is `[batch_size, sequence_length, hidden_size]`.
+
+
+        """
+        sequence_output, pooled_output = self.mpnet(
+            input_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape(shape=(-1, self.mpnet.config["vocab_size"])),
+                labels.reshape(shape=(-1,)),
+            )
+            return masked_lm_loss, prediction_scores, sequence_output
+
+        return prediction_scores, sequence_output
+
+
+class MPNetForSequenceClassification(MPNetPretrainedModel):
+    """
+    MPNet Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.mpnet = MPNetModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        The MPNetForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MPNetModel`.
+            position_ids(Tensor, optional):
+                See :class:`MPNetModel`.
+            attention_mask (list, optional):
+                See :class:`MPNetModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MPNetForSequenceClassification, MPNetTokenizer
+
+                tokenizer = MPNetTokenizer.from_pretrained('mpnet-base')
+                model = MPNetForSequenceClassification.from_pretrained('mpnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+
+        _, pooled_output = self.mpnet(input_ids, position_ids=position_ids, attention_mask=attention_mask)
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)
+
+        return logits
+
+
+class MPNetForMultipleChoice(MPNetPretrainedModel):
+    """
+    MPNet Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+    """
+
+    def __init__(self, config: MPNetConfig, num_choices=2):
+        super(MPNetForMultipleChoice, self).__init__(config)
+        self.num_choices = num_choices
+        self.mpnet = MPNetModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        The MPNetForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MPNetModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`MPNetModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`MPNetModel` and shape as [batch_size, num_choice, sequence_length].
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import MPNetForMultipleChoice, MPNetTokenizer
+
+                tokenizer = MPNetTokenizer.from_pretrained('mpnet-base')
+                model = MPNetForMultipleChoice.from_pretrained('mpnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        _, pooled_output = self.mpnet(input_ids, position_ids=position_ids, attention_mask=attention_mask)
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
+
+
+class MPNetForTokenClassification(MPNetPretrainedModel):
+    """
+    MPNet Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetForTokenClassification, self).__init__(config)
+        self.mpnet = MPNetModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        The MPNetForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MPNetModel`.
+            position_ids(Tensor, optional):
+                See :class:`MPNetModel`.
+            attention_mask (list, optional):
+                See :class:`MPNetModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MPNetForTokenClassification, MPNetTokenizer
+
+                tokenizer = MPNetTokenizer.from_pretrained('mpnet-base')
+                model = MPNetForTokenClassification.from_pretrained('mpnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+        """
+        sequence_output, _ = self.mpnet(input_ids, position_ids=position_ids, attention_mask=attention_mask)
+        sequence_output = self.dropout(sequence_output)
+
+        logits = self.classifier(sequence_output)
+
+        return logits
+
+
+class MPNetForQuestionAnswering(MPNetPretrainedModel):
+    """
+    MPNet Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`MPNetConfig`):
+            An instance of MPNetConfig used to construct MPNetModel.
+    """
+
+    def __init__(self, config: MPNetConfig):
+        super(MPNetForQuestionAnswering, self).__init__(config)
+        self.mpnet = MPNetModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, position_ids=None, attention_mask=None):
+        r"""
+        The MPNetForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`MPNetModel`.
+            position_ids (Tensor, optional):
+                See :class:`MPNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MPNetModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MPNetForQuestionAnswering, MPNetTokenizer
+
+                tokenizer = MPNetTokenizer.from_pretrained('mpnet-base')
+                model = MPNetForQuestionAnswering.from_pretrained('mpnet-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  = outputs[1]
+
+        """
+
+        sequence_output, _ = self.mpnet(input_ids, position_ids=position_ids, attention_mask=attention_mask)
+        logits = self.qa_outputs(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
diff --git a/paddlenlp/transformers/mpnet/tokenizer.py b/paddlenlp/transformers/mpnet/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f91bb00dde87d2bfcd860cef7618daeb1f7a7695
--- /dev/null
+++ b/paddlenlp/transformers/mpnet/tokenizer.py
@@ -0,0 +1,201 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The HuggingFace Inc. team, Microsoft Corporation.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .. import AddedToken
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["MPNetTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"mpnet-base": 514}
+
+
+class MPNetTokenizer(BertTokenizer):
+    """
+    Construct a MPNet tokenizer which is almost identical to `BertTokenizer`.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "mpnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/mpnet/mpnet-base/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {"mpnet-base": {"do_lower_case": True}}
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        bos_token="<s>",
+        eos_token="</s>",
+        unk_token="[UNK]",
+        sep_token="</s>",
+        pad_token="<pad>",
+        cls_token="<s>",
+        mask_token="<mask>",
+        **kwargs
+    ):
+
+        super().__init__(
+            vocab_file=vocab_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
+        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._build_special_tokens_map_extended(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+        )
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        padding=False,
+        is_split_into_words=False,
+        pad_to_max_seq_len=False,
+        truncation=False,
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=False,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        add_special_tokens=True,
+        pad_to_multiple_of=None,
+        return_offsets_mapping=False,
+    ):
+        return super().__call__(
+            text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            padding=padding,
+            is_split_into_words=is_split_into_words,
+            pad_to_max_seq_len=pad_to_max_seq_len,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            add_special_tokens=add_special_tokens,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_offsets_mapping=return_offsets_mapping,
+        )
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A MPNet sequence has the following format:
+
+        - single sequence:      ``<s> X </s>``
+        - pair of sequences:        ``<s> A </s></s> B </s>``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. MPNet does not
+        make use of token type ids, therefore a list of zeros is returned.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + [(0, 0)] + offset_mapping_1 + [(0, 0)]
diff --git a/paddlenlp/transformers/mt5/__init__.py b/paddlenlp/transformers/mt5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/mt5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/mt5/configuration.py b/paddlenlp/transformers/mt5/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..a02062b3bc0deaa8deccac3e30dbe964024fc06a
--- /dev/null
+++ b/paddlenlp/transformers/mt5/configuration.py
@@ -0,0 +1,133 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" mT5 model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["MT5_PRETRAINED_INIT_CONFIGURATION", "MT5Config"]
+
+MT5_PRETRAINED_INIT_CONFIGURATION = {}
+
+
+class MT5Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MT5Model`]. It is used to
+    instantiate a bert model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the mT5
+    mt5-small architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 250112):
+            Vocabulary size of the mT5 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`MT5Model`].
+        d_model (`int`, *optional*, defaults to 512):
+            Size of the encoder layers and the pooler layer.
+        d_kv (`int`, *optional*, defaults to 64):
+            Size of the key, query, value projections per attention head. `d_kv` has to be equal to `d_model //
+            num_heads`.
+        d_ff (`int`, *optional*, defaults to 1024):
+            Size of the intermediate feed forward layer in each `MT5Block`.
+        num_layers (`int`, *optional*, defaults to 8):
+            Number of hidden layers in the Transformer encoder.
+        num_decoder_layers (`int`, *optional*):
+            Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
+        num_heads (`int`, *optional*, defaults to 6):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+            The number of buckets to use for each attention layer.
+        relative_attention_max_distance (`int`, *optional*, defaults to 128):
+            The maximum distance of the longer sequences for the bucket separation.
+        dropout_rate (`float`, *optional*, defaults to 0.1):
+            The ratio for all dropout layers.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        feed_forward_proj (`string`, *optional*, defaults to `"gated-gelu"`):
+            he non-linear activation function (function or string) in the feed forward layer in the residual attention block.
+            If string, `"relu"`, `"gated-gelu"` are supported. Defaults to `"gated-gelu"`.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        pad_token_id (int, optional):
+            The id of the `padding` token. Defaults to `0`.
+        bos_token_id (int, optional):
+            The id of the `bos` token. Defaults to `0`.
+        eos_token_id (int, optional):
+            The id of the `eos` token. Defaults to `1`.
+
+    """
+    model_type = "mt5"
+    attribute_map: Dict[str, str] = {
+        "hidden_size": "d_model",
+        "num_attention_heads": "num_heads",
+        "num_hidden_layers": "num_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = MT5_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 250112,
+        d_model: int = 512,
+        d_kv: int = 64,
+        d_ff: int = 1024,
+        num_layers: int = 8,
+        num_decoder_layers: int = None,
+        num_heads: int = 6,
+        relative_attention_num_buckets: int = 32,
+        relative_attention_max_distance: int = 128,
+        dropout_rate: float = 0.1,
+        layer_norm_epsilon: float = 1e-6,
+        initializer_factor: float = 1.0,
+        feed_forward_proj: str = "gated-gelu",
+        is_encoder_decoder: bool = True,
+        use_cache: bool = True,
+        bos_token_id: int = 0,
+        pad_token_id: int = 0,
+        eos_token_id: int = 1,
+        **kwargs
+    ):
+
+        super().__init__(
+            bos_token_id=bos_token_id,
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.d_kv = d_kv
+        self.d_ff = d_ff
+        self.num_layers = num_layers
+        self.num_decoder_layers = (
+            num_decoder_layers if num_decoder_layers is not None else self.num_layers
+        )  # default = symmetry
+        self.num_heads = num_heads
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.relative_attention_max_distance = relative_attention_max_distance
+        self.dropout_rate = dropout_rate
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_factor = initializer_factor
+        self.feed_forward_proj = feed_forward_proj
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/mt5/converter.py b/paddlenlp/transformers/mt5/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..52649af50095e47362f7de589db6ba73316074a7
--- /dev/null
+++ b/paddlenlp/transformers/mt5/converter.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from collections import OrderedDict
+
+dont_transpose = [
+    "shared.weight",
+    "layer_norm.weight",
+    ".layer_norm.weight",
+    "relative_attention_bias.weight",
+    "embed_tokens.weight",
+]
+
+
+def convert_pytorch_checkpoint_to_paddle(pytorch_checkpoint_path, paddle_dump_path):
+    import paddle
+    import torch
+
+    pytorch_state_dict = torch.load(pytorch_checkpoint_path, map_location="cpu")
+    paddle_state_dict = OrderedDict()
+    for k, v in pytorch_state_dict.items():
+        transpose = False
+
+        if k[-7:] == ".weight":
+            if not any([w in k for w in dont_transpose]):
+                if v.ndim == 2:
+                    v = v.transpose(0, 1)
+                    transpose = True
+
+        print(f"Converting: {k} | is_transpose {transpose}")
+
+        if k != "lm_head.weight":
+            k = "mt5." + k
+        paddle_state_dict[k] = v.data.numpy()
+
+    paddle.save(paddle_state_dict, paddle_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pytorch_checkpoint_path",
+        default="google/mt5-small/pytorch_model.bin",
+        type=str,
+        required=False,
+        help="Path to the Pytorch checkpoint path.",
+    )
+    parser.add_argument(
+        "--paddle_dump_path",
+        default="paddle/mt5-small/model_state.pdparams",
+        type=str,
+        required=False,
+        help="Path to the output Paddle model.",
+    )
+    args = parser.parse_args()
+    convert_pytorch_checkpoint_to_paddle(args.pytorch_checkpoint_path, args.paddle_dump_path)
diff --git a/paddlenlp/transformers/mt5/modeling.py b/paddlenlp/transformers/mt5/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2defa4717912a2e633736a38304fa63d447672af
--- /dev/null
+++ b/paddlenlp/transformers/mt5/modeling.py
@@ -0,0 +1,1742 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Mesh TensorFlow authors, mT5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import math
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.log import logger
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    convert_encoder_output,
+)
+from ..model_utils import PretrainedModel, register_base_model
+from .configuration import MT5_PRETRAINED_INIT_CONFIGURATION, MT5Config
+
+__all__ = [
+    "MT5Model",
+    "MT5PretrainedModel",
+    "MT5ForConditionalGeneration",
+    "MT5EncoderModel",
+    "MT5_PRETRAINED_MODEL_ARCHIVE_LIST",
+]
+
+MT5_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/mt5-small",
+    "google/mt5-base",
+    "google/mt5-large",
+    "google/mt5-xl",
+    "google/mt5-xxl",
+]
+
+DATA_TYPE_MAP = {
+    paddle.int64: "int64",
+    paddle.int32: "int32",
+    paddle.float32: "float32",
+    paddle.float64: "float64",
+    paddle.float16: "float16",
+}
+
+
+def data_type_converter(tensor):
+    return DATA_TYPE_MAP[tensor.dtype]
+
+
+def finfo(dtype):
+    if dtype == paddle.float32:
+        return np.finfo(np.float32)
+    if dtype == paddle.float16:
+        return np.finfo(np.float16)
+    if dtype == paddle.float64:
+        return np.finfo(np.float64)
+
+
+class MT5LayerNorm(nn.Layer):
+    """
+    Construct a layernorm module in the MT5 style No bias and no subtraction of mean.
+    """
+
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = self.create_parameter(shape=[hidden_size], default_initializer=nn.initializer.Constant(1.0))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        # layer norm should always be calculated in float32
+        variance = paddle.pow(hidden_states.astype(paddle.float32), 2).mean(axis=-1, keepdim=True)
+        hidden_states = hidden_states * paddle.rsqrt(variance + self.variance_epsilon)
+
+        # convert into float16 if necessary
+        if self.weight.dtype == paddle.float16:
+            hidden_states = hidden_states.astype(paddle.float16)
+        return self.weight * hidden_states
+
+
+class MT5DenseReluDense(nn.Layer):
+    """
+    Construct a dense-relu-dense module.
+    """
+
+    def __init__(self, config: MT5Config):
+        super().__init__()
+        self.wi = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = F.relu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class MT5DenseGatedGeluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: MT5Config):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class MT5DenseGatedSiluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: MT5Config):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_silu = F.silu(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_silu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class MT5LayerFF(nn.Layer):
+    def __init__(self, config: MT5Config):
+        super().__init__()
+        if config.feed_forward_proj == "relu":
+            self.DenseReluDense = MT5DenseReluDense(config)
+        elif config.feed_forward_proj == "gated-gelu":
+            self.DenseReluDense = MT5DenseGatedGeluDense(config)
+        elif config.feed_forward_proj == "gated-silu":
+            self.DenseReluDense = MT5DenseGatedSiluDense(config)
+        else:
+            raise ValueError(f"{config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`")
+
+        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        hidden_states = hidden_states + self.dropout(forwarded_states)
+        return hidden_states
+
+
+class MT5Attention(nn.Layer):
+    def __init__(self, config: MT5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.has_relative_attention_bias = has_relative_attention_bias
+
+        self.relative_attention_num_buckets = config.relative_attention_num_buckets
+        self.d_model = config.d_model
+        self.key_value_proj_dim = config.d_kv
+        self.n_heads = config.num_heads
+        self.dropout = config.dropout_rate
+        self.inner_dim = self.n_heads * self.key_value_proj_dim
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        # Mesh TensorFlow initialization to avoid scaling before softmax
+        self.q = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.k = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.v = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+        self.o = nn.Linear(self.inner_dim, self.d_model, bias_attr=False)
+
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+
+        Args:
+            relative_position: an int64 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+
+        Returns:
+            a Tensor with the same shape as relative_position, containing int64 values in the range [0, num_buckets)
+
+        """
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets //= 2
+            relative_buckets += (relative_position > 0).astype(paddle.int64) * num_buckets
+            relative_position = paddle.abs(relative_position)
+        else:
+            relative_position = -paddle.minimum(relative_position, paddle.zeros_like(relative_position))
+        # now relative_position is in the range [0, inf)
+
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_postion_if_large = max_exact + (
+            paddle.log(relative_position.astype(paddle.get_default_dtype()) / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).astype(paddle.int64)
+        relative_postion_if_large = paddle.minimum(
+            relative_postion_if_large,
+            paddle.full_like(relative_postion_if_large, num_buckets - 1),
+        )
+
+        relative_buckets += paddle.where(is_small, relative_position, relative_postion_if_large)
+        return relative_buckets
+
+    def compute_bias(self, query_length, key_length):
+        """Compute binned relative position bias"""
+        context_position = paddle.arange(query_length).unsqueeze(-1)
+        memory_position = paddle.arange(key_length).unsqueeze(0)
+        relative_position = memory_position - context_position  # shape (query_length, key_length)
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=(not self.is_decoder),
+            num_buckets=self.relative_attention_num_buckets,
+        )
+        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
+        values = values.transpose(perm=[2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
+        return values
+
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+        cache=None,
+        query_length=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+        # cache[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
+        batch_size, seq_length = paddle.shape(hidden_states)[:2]
+
+        real_seq_length = seq_length
+
+        if cache is not None:
+            assert len(cache) == 2, f"cache should have 2 past states: keys and values. Got { len(cache)} past states"
+            real_seq_length += paddle.shape(cache[0])[2] if query_length is None else query_length
+
+        key_length = real_seq_length if key_value_states is None else paddle.shape(key_value_states)[1]
+
+        def shape(states):
+            """projection"""
+            return states.reshape(shape=[batch_size, -1, self.n_heads, self.key_value_proj_dim]).transpose(
+                perm=[0, 2, 1, 3]
+            )
+
+        def unshape(states):
+            """reshape"""
+            return states.transpose(perm=[0, 2, 1, 3]).reshape(shape=[batch_size, -1, self.inner_dim])
+
+        def project(hidden_states, proj_layer, key_value_states, cache):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif cache is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if cache is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = paddle.concat([cache, hidden_states], axis=2)
+                else:
+                    # cross-attn
+                    hidden_states = cache
+            return hidden_states
+
+        # get query states
+        query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
+
+        # get key/value states
+        key_states = project(
+            hidden_states,
+            self.k,
+            key_value_states,
+            cache[0] if cache is not None else None,
+        )
+        value_states = project(
+            hidden_states,
+            self.v,
+            key_value_states,
+            cache[1] if cache is not None else None,
+        )
+
+        # compute scores
+        scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        if position_bias is None:
+            if not self.has_relative_attention_bias:
+                position_bias = paddle.zeros(
+                    shape=(1, self.n_heads, real_seq_length, key_length),
+                    dtype=scores.dtype,
+                )
+                if self.training and self.enable_recompute:
+                    position_bias.stop_gradient = False
+            else:
+                position_bias = self.compute_bias(real_seq_length, key_length)
+
+            # if key and values are already calculated
+            # we want only the last query position bias
+            if cache is not None:
+                position_bias = position_bias[:, :, -paddle.shape(hidden_states)[1] :, :]
+
+            if mask is not None:
+                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)
+
+        scores += position_bias
+        attn_weights = F.softmax(scores.astype(paddle.float32), axis=-1).astype(
+            scores.dtype
+        )  # (batch_size, n_heads, seq_length, key_length)
+        attn_weights = F.dropout(
+            attn_weights, p=self.dropout, training=self.training
+        )  # (batch_size, n_heads, seq_length, key_length)
+
+        attn_output = unshape(paddle.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
+
+        attn_output = self.o(attn_output)
+
+        present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
+        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
+
+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        return outputs
+
+
+class MT5LayerSelfAttention(nn.Layer):
+    def __init__(self, config: MT5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.SelfAttention = MT5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
+        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+        attention_output = self.SelfAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class MT5LayerCrossAttention(nn.Layer):
+    def __init__(self, config: MT5Config):
+        super().__init__()
+        self.EncDecAttention = MT5Attention(config, has_relative_attention_bias=False)
+        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        query_length=None,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+
+        attention_output = self.EncDecAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            key_value_states=key_value_states,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            query_length=query_length,
+            output_attentions=output_attentions,
+        )
+        layer_output = hidden_states + self.dropout(attention_output[0])
+        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class MT5Block(nn.Layer):
+    def __init__(self, config: MT5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.layer = nn.LayerList()
+        self.layer.append(MT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
+        if self.is_decoder:
+            self.layer.append(MT5LayerCrossAttention(config))
+
+        self.layer.append(MT5LayerFF(config))
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+
+        if cache is not None:
+            assert self.is_decoder, "Only decoder can use `caches`"
+            expected_num_caches = 2 if encoder_hidden_states is None else 4
+
+            if len(cache) != expected_num_caches:
+                raise ValueError(
+                    f"There should be {expected_num_caches} past states. "
+                    f"{'2 (past / key) for cross attention. ' if expected_num_caches == 4 else ''}"
+                    f"Got {len(cache)} past key / value states"
+                )
+
+            self_attn_cache = cache[:2]
+            cross_attn_cache = cache[2:]
+        else:
+            self_attn_cache, cross_attn_cache = None, None
+
+        self_attention_outputs = self.layer[0](
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            cache=self_attn_cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states, present_key_value_state = self_attention_outputs[:2]
+
+        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+            # TODO finfo
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+        if do_cross_attention:
+            # the actual query length is unknown for cross attention
+            # if using past key value states. Need to inject it here
+            if present_key_value_state is not None:
+                query_length = paddle.shape(present_key_value_state[0])[2]
+            else:
+                query_length = None
+
+            cross_attention_outputs = self.layer[1](
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+                cache=cross_attn_cache,
+                query_length=query_length,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states = cross_attention_outputs[0]
+
+            # clamp inf values to enable fp16 training
+            if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+                clamp_value = finfo(hidden_states.dtype).max - 1000
+                hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+            # Combine self attn and cross attn key value states
+            if present_key_value_state is not None:
+                present_key_value_state = present_key_value_state + cross_attention_outputs[1]
+
+            # Keep cross-attention outputs and relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[2:]
+
+        # Apply Feed Forward layer
+        hidden_states = self.layer[-1](hidden_states)
+
+        # clamp inf values to enable fp16 training
+        if hidden_states.dtype == paddle.float16 and paddle.isinf(hidden_states).any():
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        outputs = (hidden_states,)
+
+        if use_cache:
+            outputs = outputs + (present_key_value_state,) + attention_outputs
+        else:
+            outputs = outputs + attention_outputs
+
+        return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+
+
+class MT5PretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained MT5 models. It provides MT5 related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "mt5"
+    config_class = MT5Config
+
+    pretrained_init_configuration = MT5_PRETRAINED_INIT_CONFIGURATION
+
+    # support AutoConverter after fix load_torch function
+    @classmethod
+    def _get_name_mappings(cls, config: MT5Config) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "shared.weight",
+            "encoder.embed_tokens.weight",
+            "encoder.final_layer_norm.weight",
+            "decoder.embed_tokens.weight",
+            "decoder.final_layer_norm.weight",
+            "encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+            "decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            for att_head in ["q", "k", "v", "o"]:
+                model_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.1.EncDecAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+
+            layer_mappings = [
+                [
+                    f"encoder.block.{layer_index}.layer.1.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                [
+                    f"decoder.block.{layer_index}.layer.2.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"encoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.2.layer_norm.weight",
+            ]
+
+            if config.feed_forward_proj == "relu":
+                layer_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+            elif config.feed_forward_proj == "gated-gelu":
+                for i in range(2):
+                    layer_mappings.extend(
+                        [
+                            [
+                                f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                            [
+                                f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                        ]
+                    )
+
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+
+        if cls.__name__ != "MT5Model":
+            for mapping in model_mappings:
+                mapping[1] = "mt5." + mapping[1]
+
+        if config.architectures is not None and "MT5ForConditionalGeneration" in config.architectures:
+            model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
+
+        mappings = [StateDictNameMapping(*mapping) for mapping in model_mappings]
+        return mappings
+
+    @property
+    def dummy_inputs(self):
+        DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
+        DUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]
+        input_ids = paddle.assign(np.asarray(DUMMY_INPUTS, dtype="int64"))
+        input_mask = paddle.assign(np.asarray(DUMMY_MASK, dtype="int64"))
+        dummy_inputs = {
+            "decoder_input_ids": input_ids,
+            "input_ids": input_ids,
+            "decoder_attention_mask": input_mask,
+        }
+        return dummy_inputs
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        # Used for testing weights initialization
+        factor = self.config.initializer_factor
+        d_model = self.config.d_model
+        d_ff = self.config.d_ff
+        n_heads = self.config.num_heads
+        key_value_proj_dim = self.config.d_kv
+
+        if isinstance(layer, MT5LayerNorm):
+            layer.weight.set_value(paddle.ones_like(layer.weight) * factor)
+        elif isinstance(layer, MT5Model):
+            # Mesh TensorFlow embeddings initialization
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
+            layer.shared.weight.set_value(paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.shared.weight.shape))
+        elif isinstance(layer, (MT5ForConditionalGeneration,)):
+            layer.mt5.shared.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.mt5.shared.weight.shape)
+            )
+
+        elif isinstance(layer, MT5DenseReluDense):
+            # Mesh TensorFlow FF initialization
+            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
+            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
+            layer.wi.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi.weight.shape)
+            )
+
+            if hasattr(layer.wi, "bias") and layer.wi.bias is not None:
+                layer.wi.bias.set_value(paddle.zeros_like(layer.wi.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+
+        elif isinstance(layer, MT5DenseGatedGeluDense):
+            layer.wi_0.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_0.weight.shape)
+            )
+            if hasattr(layer.wi_0, "bias") and layer.wi_0.bias is not None:
+                layer.wi_0.bias.set_value(paddle.zeros_like(layer.wi_0.bias))
+
+            layer.wi_1.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_1.weight.shape)
+            )
+            if hasattr(layer.wi_1, "bias") and layer.wi_1.bias is not None:
+                layer.wi_1.bias.set_value(paddle.zeros_like(layer.wi_1.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+        elif isinstance(layer, MT5Attention):
+            # Mesh TensorFlow attention initialization to avoid scaling before softmax
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
+
+            layer.q.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5), shape=layer.q.weight.shape
+                )
+            )
+
+            layer.k.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.k.weight.shape)
+            )
+
+            layer.v.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.v.weight.shape)
+            )
+
+            layer.o.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5), shape=layer.o.weight.shape
+                )
+            )
+
+            if layer.has_relative_attention_bias:
+                layer.relative_attention_bias.weight.set_value(
+                    paddle.normal(
+                        mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.relative_attention_bias.weight.shape
+                    )
+                )
+
+    def _shift_right(self, input_ids):
+        bos_token_id = self.config.bos_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert (
+            bos_token_id is not None
+        ), "bos_token_id has to be defined. In MT5 it is usually set to the pad_token_id. See MT5 docs for more information"
+
+        # shift inputs to the right
+        shifted_input_ids = paddle.zeros_like(input_ids)
+        shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+        shifted_input_ids[:, 0] = bos_token_id
+
+        assert pad_token_id is not None, "pad_token_id has to be defined."
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids = paddle.where(
+            shifted_input_ids == -100,
+            paddle.assign(np.asarray(pad_token_id, dtype=data_type_converter(shifted_input_ids)).reshape([1])),
+            shifted_input_ids,
+        )
+
+        assert paddle.all(shifted_input_ids >= 0), "Verify that `shifted_input_ids` has only positive values"
+
+        return shifted_input_ids
+
+
+class MT5Stack(nn.Layer):
+    def __init__(self, config: MT5Config, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.embed_tokens = embed_tokens
+        self.block = nn.LayerList(
+            [MT5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
+        )
+        self.final_layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embed_tokens = new_embeddings
+
+    @property
+    def dtype(self):
+        return self.embed_tokens.weight.dtype
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module,
+        hidden_states,
+        extended_attention_mask,
+        position_bias,
+        encoder_hidden_states,
+        encoder_extended_attention_mask,
+        encoder_decoder_position_bias,
+        use_cache,
+        output_attentions,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return tuple(module(*inputs, use_cache, output_attentions))
+
+            return custom_forward
+
+        layer_outputs = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            extended_attention_mask,
+            position_bias,
+            encoder_hidden_states,
+            encoder_extended_attention_mask,
+            encoder_decoder_position_bias,
+            None,
+        )
+
+        return layer_outputs
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+        **model_kwargs
+    ):
+
+        if input_ids is not None and inputs_embeds is not None:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You cannot specify both {err_msg_prefix}input_ids and {err_msg_prefix}inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            # input_ids = input_ids.reshape(shape=[-1, input_shape[-1]])
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(f"You have to specify either {err_msg_prefix}input_ids or {err_msg_prefix}inputs_embeds")
+
+        if inputs_embeds is None:
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        batch_size, seq_length = input_shape
+
+        # required mask seq length can be calculated via length of past
+        mask_seq_length = paddle.shape(cache[0][0])[2] + seq_length if cache is not None else seq_length
+
+        if use_cache is True:
+            assert self.is_decoder, f"`use_cache` can only be set to `True` if {self.__class__} is used as a decoder"
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(shape=[batch_size, mask_seq_length])
+        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+            encoder_seq_length = paddle.shape(encoder_hidden_states)[1]
+            encoder_attention_mask = paddle.ones([batch_size, encoder_seq_length], dtype=paddle.int64)
+
+        # initialize caches with `None` if past does not exist
+        if cache is None:
+            cache = [None] * len(self.block)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = paddle.shape(encoder_hidden_states)
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(shape=encoder_hidden_shape)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        present_key_value_states = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+        position_bias = None
+        encoder_decoder_position_bias = None
+
+        hidden_states = self.dropout(inputs_embeds)
+
+        for i, (layer_module, past_key_value) in enumerate(zip(self.block, cache)):
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.enable_recompute and self.training:
+                if use_cache:
+                    logger.warning("`use_cache=True` is incompatible with Recompute. Setting " "`use_cache=False`...")
+                    use_cache = False
+
+                layer_outputs = self.recompute_training(
+                    layer_module,
+                    hidden_states,
+                    extended_attention_mask,
+                    position_bias,
+                    encoder_hidden_states,
+                    encoder_extended_attention_mask,
+                    encoder_decoder_position_bias,
+                    use_cache,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask=extended_attention_mask,
+                    position_bias=position_bias,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_extended_attention_mask,
+                    encoder_decoder_position_bias=encoder_decoder_position_bias,
+                    cache=past_key_value,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+
+            # layer_outputs is a tuple with:
+            # hidden-states, key-value-states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+            if not use_cache:
+                layer_outputs = layer_outputs[:1] + (None,) + layer_outputs[1:]
+
+            hidden_states, present_key_value_state = layer_outputs[:2]
+
+            # We share the position biases between the layers - the first layer store them
+            # layer_outputs = hidden-states, key-value-states (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            position_bias = layer_outputs[2]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[4 if output_attentions else 3]
+            # append next layer key value states
+            if use_cache:
+                present_key_value_states = present_key_value_states + (present_key_value_state,)
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[3],)
+                if self.is_decoder:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[5],)
+
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    present_key_value_states,
+                    all_hidden_states,
+                    all_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_value_states,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+    def get_extended_attention_mask(self, attention_mask, input_shape):
+        if attention_mask.ndim == 3:
+            extended_attention_mask = attention_mask.unsqueeze(1)
+        elif attention_mask.ndim == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length)
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask.unsqueeze([1, 2])
+            else:
+                extended_attention_mask = attention_mask.unsqueeze([1, 2])
+        elif attention_mask.ndim == 4:
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length)
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                # in case cache are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[-1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask
+            else:
+                extended_attention_mask = attention_mask
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        extended_attention_mask = extended_attention_mask.astype(self.dtype)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask):
+        if encoder_attention_mask.ndim == 4:
+            encoder_extended_attention_mask = encoder_attention_mask
+        elif encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze(1)
+        elif encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze([1, 2])
+        encoder_extended_attention_mask = encoder_extended_attention_mask.astype(self.dtype)  # fp16 compatibility
+
+        if self.dtype == paddle.float16:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        elif self.dtype == paddle.float32:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        else:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+            # raise ValueError(
+            #     f"{self.dtype} not recognized. `dtype` should be set to either `paddle.float32` or `paddle.float16`"
+            # )
+
+        return encoder_extended_attention_mask
+
+
+@register_base_model
+class MT5Model(MT5PretrainedModel):
+    """
+    The bare MT5 Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (class:`MT5Config`):
+            Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+    """
+
+    def __init__(self, config: MT5Config):
+        super().__init__(config)
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.initializer_factor = config.initializer_factor
+        self.d_model = config.d_model
+        self.num_heads = config.num_heads
+        self.d_kv = config.d_kv
+        self.d_ff = config.d_ff
+        self.tie_word_embeddings = config.tie_word_embeddings
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = MT5Stack(encoder_config, self.shared)
+
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = MT5Stack(decoder_config, self.shared)
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The MT5Model forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on
+                to some unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float.
+                When the data type is int, the `masked` tokens have `0` values and the others
+                have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the
+                others have `1.0` values.
+                It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, num_attention_heads, sequence_length, sequence_length].
+            cache (Tuple[Tuple[Tensor]], optional):
+                Contains pre-computed hidden-states (key and values in the attention blocks)
+                as computed by the model. Can be used to speed up sequential decoding.
+                The `input_ids` which have their past given to this model should not be
+                passed as input ids as they have already been computed.
+                Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation  of shape `(batch_size, target_sequence_length, hidden_size)`. If `cache` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
+                This is useful if you want more control over how to convert `decoder_input_ids` indices
+                into associated vectors than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                Whether or not to use cache. If set to `True`, `past_buckets_states` states are returned
+                and can be used to speed up decoding.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`.
+
+            tuple: Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the decoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                returned when `use_cache=True` is passed.
+                List of `tuple(Tensor, Tensor)` of length `config["num_layers"]`,
+                with the first element being the previous `buckets` of shape
+                `[batch_size, num_heads, num_hashes, sequence_length]` and the second
+                being the previous `hidden_states` of shape `[batch_size, sequence_length, hidden_size]`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                returned when ``output_hidden_states=True`` is passed.
+                Tuple of `Tensor` (one for the output of the embeddings + one for the output of decoder each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `encoder_last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the encoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                returned when `output_hidden_states=True` is passed.
+                tuple of `Tensor` (one for the output of the embeddings + one for the
+                output of encoder each layer). Each Tensor has a data type of float32
+                and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MT5Model, AutoTokenizer
+
+                tokenizer = AutoTokenizer.from_pretrained('mt5-base')
+                model = MT5Model.from_pretrained('mt5-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                input_ids = paddle.to_tensor([inputs["input_ids"]], dtype="int64")
+                decoder_inputs = tokenizer("It means you can")
+                decoder_input_ids = paddle.to_tensor([decoder_inputs["input_ids"]], dtype="int64")
+
+                outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
+                last_hidden_state = outputs[0]
+                print(last_hidden_state.shape)
+                # [1, 5, 768]
+
+        """
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            encoder_output = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        elif return_dict and not isinstance(encoder_output, BaseModelOutput):
+            encoder_output = convert_encoder_output(encoder_output)
+        hidden_states = encoder_output[0]
+
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return decoder_outputs + encoder_output
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+
+class MT5ForConditionalGeneration(MT5PretrainedModel):
+    """
+    The MT5 Model transformer with a language modeling head on top.
+
+    Args:
+        config (:class:`MT5Config`):
+            An instance of MT5Config used to construct MT5ForConditionalGeneration.
+
+    """
+
+    def __init__(self, config: MT5Config):
+        super().__init__(config)
+        self.mt5 = MT5Model(config)
+        if not self.mt5.config["tie_word_embeddings"]:
+            self.lm_head = nn.Linear(self.mt5.config["d_model"], self.mt5.config["vocab_size"], bias_attr=False)
+
+    def get_input_embeddings(self):
+        return self.mt5.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.mt5.shared = new_embeddings
+        self.mt5.encoder.set_input_embeddings(new_embeddings)
+        self.mt5.decoder.set_input_embeddings(new_embeddings)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def get_output_embeddings(self):
+        if self.mt5.config["tie_word_embeddings"]:
+            return self.mt5.shared
+        else:
+            return self.lm_head
+
+    def get_encoder(self):
+        return self.mt5.encoder
+
+    def get_decoder(self):
+        return self.mt5.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        labels=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`MT5Model`.
+            attention_mask (Tensor, optional):
+                See :class:`MT5Model`.
+            decoder_input_ids (Tensor, optional):
+                See :class:`MT5Model`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`MT5Model`.
+            encoder_output (tuple(Tensor), optional):
+                See :class:`MT5Model`.
+            cache (List[tuple(Tensor, Tensor)], optional):
+                See :class:`MT5Model`.
+            labels (Tensor, optional):
+                Labels for language modeling. Note that the labels **are shifted**
+                inside the model, i.e. you can set `labels = input_ids` Indices are
+                selected in `[-100, 0, ..., vocab_size]` All labels set to `-100` are
+                ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor , optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation of shape `(batch_size, target_sequence_length, hidden_size)`. If `past_key_values` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is useful
+                if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                See :class:`MT5Model`.
+            output_attentions (bool, optional):
+                See :class:`MT5Model`.
+            output_hidden_states (bool, optional):
+                See :class:`MT5Model`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`.
+
+            tuple: Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss. It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head
+                (scores for each vocabulary token before SoftMax).
+                It's data type should be float32 and its shape is
+                [batch_size, sequence_length, vocab_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                See :class:`MT5Model`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                See :class:`MT5Model`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                See :class:`MT5Model`.
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                See :class:`MT5Model`.
+
+            - `encoder_last_hidden_state` (Tensor):
+                See :class:`MT5Model`.
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                See :class:`MT5Model`.
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                See :class:`MT5Model`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import MT5ForConditionalGeneration, AutoTokenizer
+
+                tokenizer = AutoTokenizer.from_pretrained('mt5-base')
+                model = MT5ForConditionalGeneration.from_pretrained('mt5-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+
+        input_type = type(decoder_input_ids) if decoder_input_ids is not None else type(decoder_inputs_embeds)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            # Convert encoder inputs in embeddings if needed
+            encoder_output = self.mt5.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            if return_dict and not isinstance(encoder_output, BaseModelOutput):
+                encoder_output = convert_encoder_output(encoder_output)
+
+        hidden_states = encoder_output[0]
+
+        if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(labels)
+
+        # If decoding with past key value states, only the last tokens
+        # should be given as an input
+        if cache is not None:
+            assert labels is None, "Decoder should not use cached key value states when training."
+            if decoder_input_ids is not None:
+                decoder_input_ids = decoder_input_ids[:, -1:]
+
+        encoder_attention_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                encoder_attention_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                encoder_attention_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                encoder_attention_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        # Decode
+        decoder_outputs = self.mt5.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = decoder_outputs[0]
+
+        if self.mt5.config["tie_word_embeddings"]:
+            # Rescale output before projecting on vocab
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
+            sequence_output = sequence_output * (self.mt5.config["d_model"] ** -0.5)
+            lm_logits = paddle.matmul(sequence_output, self.mt5.shared.weight, transpose_y=True)
+        else:
+            lm_logits = self.lm_head(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(lm_logits.reshape(shape=[-1, lm_logits.shape[-1]]).astype("float32"), labels.flatten())
+
+        if not return_dict:
+            output = (lm_logits,) + decoder_outputs[1:] + encoder_output
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            if isinstance(encoder_output, tuple):
+                encoder_output = encoder_output[0]
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_inputs_for_generation(
+        self, input_ids, cache=None, attention_mask=None, use_cache=None, encoder_output=None, **kwargs
+    ):
+
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            "decoder_input_ids": input_ids,
+            "cache": cache,
+            "encoder_output": encoder_output,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+        }
+
+    def prepare_decoder_input_ids_from_labels(self, labels: paddle.Tensor):
+        return self._shift_right(labels)
+
+    @staticmethod
+    def expand_inputs_for_generation(input_ids, expand_size, attention_mask=None, **model_kwargs):
+        index = paddle.tile(paddle.arange(input_ids.shape[0]).unsqueeze(-1), [1, expand_size]).reshape([-1])
+
+        input_ids = paddle.index_select(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.index_select(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.index_select(token_type_ids, index)
+
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.index_select(position_ids, index)
+
+        if "seq_len" in model_kwargs:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.index_select(seq_len, index)
+
+        if "encoder_output" in model_kwargs:
+            encoder_output = model_kwargs["encoder_output"]
+            if isinstance(encoder_output, tuple):
+                model_kwargs["encoder_output"] = (paddle.index_select(encoder_output[0], index),) + encoder_output[1:]
+            else:
+                model_kwargs["encoder_output"] = paddle.index_select(encoder_output, index)
+        return input_ids, model_kwargs
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+            return attention_mask
+        else:
+            attention_mask = paddle.ones_like(input_ids)
+        return attention_mask
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class MT5EncoderModel(MT5PretrainedModel):
+    base_model_class = None
+
+    def __init__(self, config: MT5Config):
+        super().__init__(config)
+
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.shared = nn.Embedding(encoder_config.vocab_size, encoder_config.d_model)
+        self.encoder = MT5Stack(encoder_config, embed_tokens=self.shared)
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self) -> MT5Stack:
+        return self.encoder
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        attention_mask: Optional[Tensor] = None,
+        encoder_hidden_states: Optional[Tuple[Tensor]] = None,
+        encoder_attention_mask: Optional[Tensor] = None,
+        cache=None,
+        inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return encoder_outputs
+
+
+MT5EncoderModel.base_model_class = MT5EncoderModel
diff --git a/paddlenlp/transformers/nezha/__init__.py b/paddlenlp/transformers/nezha/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/nezha/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/nezha/configuration.py b/paddlenlp/transformers/nezha/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dc02196c28437d31e660d2b8214acc1cb3df527
--- /dev/null
+++ b/paddlenlp/transformers/nezha/configuration.py
@@ -0,0 +1,190 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" NeZha model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["NEZHA_PRETRAINED_INIT_CONFIGURATION", "NeZhaConfig", "NEZHA_PRETRAINED_RESOURCE_FILES_MAP"]
+
+NEZHA_PRETRAINED_INIT_CONFIGURATION = {
+    "nezha-base-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "max_relative_position": 64,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "use_relative_position": True,
+    },
+    "nezha-large-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "max_relative_position": 64,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "use_relative_position": True,
+    },
+    "nezha-base-wwm-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "max_relative_position": 64,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "use_relative_position": True,
+    },
+    "nezha-large-wwm-chinese": {
+        "vocab_size": 21128,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "max_relative_position": 64,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "use_relative_position": True,
+    },
+}
+NEZHA_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "nezha-base-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-base-chinese.pdparams",
+        "nezha-large-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-large-chinese.pdparams",
+        "nezha-base-wwm-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-base-wwm-chinese.pdparams",
+        "nezha-large-wwm-chinese": "https://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-large-wwm-chinese.pdparams",
+    }
+}
+
+
+class NeZhaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`NezhaModel`]. It is used to instantiate an Nezha
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the Nezha
+    [sijunhe/nezha-cn-base](https://huggingface.co/sijunhe/nezha-cn-base) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, optional, defaults to 21128):
+            Vocabulary size of the NEZHA model. Defines the different tokens that can be represented by the
+            *inputs_ids* passed to the forward method of [`NezhaModel`].
+        embedding_size (`int`, optional, defaults to 128):
+            Dimensionality of vocabulary embeddings.
+        hidden_size (`int`, optional, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, optional, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, optional, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, optional, defaults to 3072):
+            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, optional, defaults to "gelu"):
+            The non-linear activation function (function or string) in the encoder and pooler.
+        hidden_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, optional, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, optional, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, optional, defaults to 2):
+            The vocabulary size of the *token_type_ids* passed into [`NezhaModel`].
+        initializer_range (`float`, optional, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, optional, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        classifier_dropout (`float`, optional, defaults to 0.1):
+            The dropout ratio for attached classifiers.
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import NeZhaConfig, NeZhaModel
+    >>> # Initializing an Nezha configuration
+    >>> configuration = NeZhaConfig()
+    >>> # Initializing a model (with random weights) from the Nezha-base style configuration model
+    >>> model = NeZhaModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = NEZHA_PRETRAINED_INIT_CONFIGURATION
+    model_type = "nezha"
+
+    def __init__(
+        self,
+        vocab_size=21128,
+        embedding_size=128,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        max_relative_position=64,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        classifier_dropout=0.1,
+        pad_token_id=0,
+        bos_token_id=2,
+        eos_token_id=3,
+        use_cache=True,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.max_relative_position = max_relative_position
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.classifier_dropout = classifier_dropout
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/nezha/modeling.py b/paddlenlp/transformers/nezha/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f343c190115e7110186ecbaa287aeb0cdfb42d1
--- /dev/null
+++ b/paddlenlp/transformers/nezha/modeling.py
@@ -0,0 +1,1179 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 Huawei Technologies Co., Ltd.
+# Copyright 2018 The Google AI Language Team Authors, The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+
+from ...utils.env import CONFIG_NAME
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    NEZHA_PRETRAINED_INIT_CONFIGURATION,
+    NEZHA_PRETRAINED_RESOURCE_FILES_MAP,
+    NeZhaConfig,
+)
+
+__all__ = [
+    "NeZhaModel",
+    "NeZhaPretrainedModel",
+    "NeZhaForPretraining",
+    "NeZhaForSequenceClassification",
+    "NeZhaForTokenClassification",
+    "NeZhaForQuestionAnswering",
+    "NeZhaForMultipleChoice",
+]
+
+
+class NeZhaAttention(nn.Layer):
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                "heads ({config.num_attention_heads})"
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.relative_positions_embeddings = self.generate_relative_positions_embeddings(
+            length=512, depth=self.attention_head_size, max_relative_position=config.max_relative_position
+        )
+        self.attention_dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.output_dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def generate_relative_positions_embeddings(self, length, depth, max_relative_position=127):
+        vocab_size = max_relative_position * 2 + 1
+        range_vec = paddle.arange(length)
+        range_mat = paddle.tile(range_vec, repeat_times=[length]).reshape((length, length))
+        distance_mat = range_mat - paddle.t(range_mat)
+        distance_mat_clipped = paddle.clip(
+            distance_mat.astype("float32"), -max_relative_position, max_relative_position
+        )
+        final_mat = distance_mat_clipped + max_relative_position
+        embeddings_table = np.zeros([vocab_size, depth])
+
+        for pos in range(vocab_size):
+            for i in range(depth // 2):
+                embeddings_table[pos, 2 * i] = np.sin(pos / np.power(10000, 2 * i / depth))
+                embeddings_table[pos, 2 * i + 1] = np.cos(pos / np.power(10000, 2 * i / depth))
+
+        embeddings_table_tensor = paddle.to_tensor(embeddings_table, dtype="float32")
+        flat_relative_positions_matrix = final_mat.reshape((-1,))
+        one_hot_relative_positions_matrix = paddle.nn.functional.one_hot(
+            flat_relative_positions_matrix.astype("int64"), num_classes=vocab_size
+        )
+        embeddings = paddle.matmul(one_hot_relative_positions_matrix, embeddings_table_tensor)
+        my_shape = final_mat.shape
+        my_shape.append(depth)
+        embeddings = embeddings.reshape(my_shape)
+        return embeddings
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose((0, 2, 1, 3))
+
+    def forward(self, hidden_states, attention_mask):
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose((0, 1, 3, 2)))
+        batch_size, num_attention_heads, from_seq_length, to_seq_length = attention_scores.shape
+
+        relations_keys = self.relative_positions_embeddings.detach().clone()[:to_seq_length, :to_seq_length, :]
+
+        query_layer_t = query_layer.transpose((2, 0, 1, 3))
+        query_layer_r = query_layer_t.reshape(
+            (from_seq_length, batch_size * num_attention_heads, self.attention_head_size)
+        )
+        key_position_scores = paddle.matmul(query_layer_r, relations_keys.transpose((0, 2, 1)))
+        key_position_scores_r = key_position_scores.reshape(
+            (from_seq_length, batch_size, num_attention_heads, from_seq_length)
+        )
+        key_position_scores_r_t = key_position_scores_r.transpose((1, 2, 0, 3))
+        attention_scores = attention_scores + key_position_scores_r_t
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(axis=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.attention_dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        relations_values = self.relative_positions_embeddings.clone()[:to_seq_length, :to_seq_length, :]
+        attention_probs_t = attention_probs.transpose((2, 0, 1, 3))
+        attentions_probs_r = attention_probs_t.reshape(
+            (from_seq_length, batch_size * num_attention_heads, to_seq_length)
+        )
+        value_position_scores = paddle.matmul(attentions_probs_r, relations_values)
+        value_position_scores_r = value_position_scores.reshape(
+            (from_seq_length, batch_size, num_attention_heads, self.attention_head_size)
+        )
+        value_position_scores_r_t = value_position_scores_r.transpose((1, 2, 0, 3))
+        context_layer = context_layer + value_position_scores_r_t
+
+        context_layer = context_layer.transpose((0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        projected_context_layer = self.dense(context_layer)
+        projected_context_layer_dropout = self.output_dropout(projected_context_layer)
+        layer_normed_context_layer = self.layer_norm(hidden_states + projected_context_layer_dropout)
+
+        return layer_normed_context_layer, attention_scores
+
+
+class NeZhaLayer(nn.Layer):
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaLayer, self).__init__()
+        self.seq_len_dim = 1
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.attention = NeZhaAttention(config)
+        self.ffn = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.ffn_output = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, attention_mask=None):
+        attention_output, layer_att = self.attention(hidden_states, attention_mask)
+
+        ffn_output = self.ffn(attention_output)
+        ffn_output = self.activation(ffn_output)
+        ffn_output = self.ffn_output(ffn_output)
+
+        ffn_output_dropout = self.dropout(ffn_output)
+        hidden_states = self.layer_norm(ffn_output_dropout + attention_output)
+
+        return hidden_states, layer_att
+
+
+class NeZhaEncoder(nn.Layer):
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaEncoder, self).__init__()
+        layer = NeZhaLayer(config)
+        self.layer = nn.LayerList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
+
+    def forward(self, hidden_states, attention_mask):
+        all_encoder_layers = []
+        all_encoder_att = []
+        for i, layer_module in enumerate(self.layer):
+            all_encoder_layers.append(hidden_states)
+            hidden_states, layer_att = layer_module(all_encoder_layers[i], attention_mask)
+            all_encoder_att.append(layer_att)
+        all_encoder_layers.append(hidden_states)
+        return all_encoder_layers, all_encoder_att
+
+
+class NeZhaEmbeddings(nn.Layer):
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaEmbeddings, self).__init__()
+        self.use_relative_position = config.use_relative_position
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+
+        if not self.use_relative_position:
+            self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+    ):
+        if input_ids is not None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        input_shape = paddle.shape(inputs_embeds)[:-1]
+
+        ones = paddle.ones(input_shape, dtype="int64")
+        seq_length = paddle.cumsum(ones, axis=1)
+        position_ids = seq_length - ones
+        position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+
+        embeddings = inputs_embeds
+
+        if not self.use_relative_position:
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings += token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class NeZhaPooler(nn.Layer):
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class NeZhaPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained NeZha models. It provides NeZha related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = NeZhaConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "nezha"
+
+    pretrained_init_configuration = NEZHA_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = NEZHA_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class NeZhaModel(NeZhaPretrainedModel):
+    """
+    The bare NeZha Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `DistilBertModel`. Defines the number of different tokens that can
+            be represented by the `inputs_ids` passed when calling `DistilBertModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and the pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"gelu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `16`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer.
+            Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`NeZhaPretrainedModel.init_weights()` for how weights are initialized in `NeZhaModel`.
+
+        max_relative_embeddings (int, optional):
+            The maximum value of the dimensionality of relative encoding, which dictates the maximum supported
+            relative distance of two sentences.
+            Defaults to `64`.
+        layer_norm_eps (float, optional):
+            The small value added to the variance in `LayerNorm` to prevent division by zero.
+            Defaults to `1e-12`.
+        use_relative_position (bool, optional):
+            Whether or not to use relative position embedding. Defaults to `True`.
+
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+
+        self.embeddings = NeZhaEmbeddings(config)
+
+        self.encoder = NeZhaEncoder(config)
+
+        self.pooler = NeZhaPooler(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The NeZhaModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in NeZha, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import NeZhaModel, NeZhaTokenizer
+
+                tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+                model = NeZhaModel.from_pretrained('nezha-base-chinese')
+
+                inputs = tokenizer("欢迎使用百度飞浆!", return_tensors='pt')
+                output = model(**inputs)
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        if attention_mask is None:
+            attention_mask = paddle.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids)
+
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+
+        encoder_outputs = self.encoder(embedding_output, extended_attention_mask)
+        encoder_hidden_outputs, encoder_att_outputs = encoder_outputs
+
+        sequence_output = encoder_hidden_outputs[-1]
+        pooled_output = self.pooler(sequence_output)
+
+        if not return_dict:
+            outputs = (sequence_output, pooled_output)
+            if output_hidden_states:
+                outputs += (encoder_hidden_outputs,)
+            if output_attentions:
+                outputs += (encoder_att_outputs,)
+            return outputs
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_hidden_outputs if output_hidden_states else None,
+            attentions=encoder_att_outputs if output_attentions else None,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+
+class NeZhaLMPredictionHead(nn.Layer):
+    def __init__(self, config: NeZhaConfig, embedding_weights=None):
+        super(NeZhaLMPredictionHead, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = ACT2FN[config.hidden_act]
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.decoder_weight = embedding_weights
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+
+        hidden_states = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+
+        return hidden_states
+
+
+class NeZhaPretrainingHeads(nn.Layer):
+    """
+    Perform language modeling task and next sentence classification task.
+
+    Args:
+        hidden_size (int):
+            See :class:`NeZhaModel`.
+        vocab_size (int):
+            See :class:`NeZhaModel`.
+        hidden_act (str):
+            Activation function used in the language modeling task.
+        embedding_weights (Tensor, optional):
+            Decoding weights used to map hidden_states to logits of the masked token prediction.
+            Its data type should be float32 and its shape is [vocab_size, hidden_size].
+            Defaults to `None`, which means use the same weights of the embedding layer.
+
+    """
+
+    def __init__(self, config: NeZhaConfig, embedding_weights=None):
+        super(NeZhaPretrainingHeads, self).__init__()
+        self.predictions = NeZhaLMPredictionHead(config=config, embedding_weights=embedding_weights)
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        """
+        Args:
+            sequence_output(Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+            pooled_output(Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Returns:
+            tuple: Returns tuple (``prediction_scores``, ``seq_relationship_score``).
+
+            With the fields:
+
+            - `prediction_scores` (Tensor):
+                The scores of masked token prediction. Its data type should be float32.
+                If `masked_positions` is None, its shape is [batch_size, sequence_length, vocab_size].
+                Otherwise, its shape is [batch_size, mask_token_num, vocab_size].
+
+            - `seq_relationship_score` (Tensor):
+                The scores of next sentence prediction.
+                Its data type should be float32 and its shape is [batch_size, 2].
+
+        """
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+@dataclass
+class NeZhaForPreTrainingOutput(ModelOutput):
+    """
+    Output type of [`NeZhaForPreTraining`].
+
+    Args:
+        loss (*optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`paddle.Tensor` of shape `(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    prediction_logits: paddle.Tensor = None
+    seq_relationship_logits: paddle.Tensor = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+class NeZhaForPretraining(NeZhaPretrainedModel):
+    """
+    NeZha Model with pretraining tasks on top.
+
+    Args:
+        nezha (:class:`NeZhaModel`):
+            An instance of :class:`NeZhaModel`.
+
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaForPretraining, self).__init__(config)
+        self.nezha = NeZhaModel(config)
+        self.cls = NeZhaPretrainingHeads(
+            config,
+            self.nezha.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        masked_lm_labels: Optional[Tensor] = None,
+        next_sentence_label: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`NeZhaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`NeZhaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`NeZhaModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`NeZhaModel`.
+            masked_lm_labels (Tensor, optional):
+                The labels of the masked language modeling, its dimensionality is equal to `prediction_scores`.
+                Its data type should be int64 and its shape is [batch_size, sequence_length, 1].
+            next_sentence_label (Tensor, optional):
+                The labels of the next sentence prediction task, the dimensionality of `next_sentence_labels`
+                is equal to `seq_relation_labels`. Its data type should be int64 and its shape is [batch_size, 1].
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.nezha.NeZhaForPreTrainingOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.nezha.NeZhaForPreTrainingOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.nezha.NeZhaForPreTrainingOutput`.
+        """
+        outputs = self.nezha(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output, pooled_output = outputs[0], outputs[1]
+        prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output)
+
+        total_loss = None
+        if masked_lm_labels is not None and next_sentence_label is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, self.nezha.config.vocab_size)), masked_lm_labels.reshape((-1,))
+            )
+            next_sentence_loss = loss_fct(seq_relationship_score.reshape((-1, 2)), next_sentence_label.reshape((-1,)))
+            total_loss = masked_lm_loss + next_sentence_loss
+        elif masked_lm_labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, self.nezha.config.vocab_size)), masked_lm_labels.reshape((-1,))
+            )
+            total_loss = masked_lm_loss
+
+        if not return_dict:
+            output = (prediction_scores, seq_relationship_score) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return NeZhaForPreTrainingOutput(
+            loss=total_loss,
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NeZhaForQuestionAnswering(NeZhaPretrainedModel):
+    """
+    NeZha with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`NeZhaConfig`):
+            An instance of NeZhaConfig used to construct NeZhaForQuestionAnswering.
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaForQuestionAnswering, self).__init__(config)
+        self.nezha = NeZhaModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The NeZhaForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`NeZhaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`NeZhaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`NeZhaModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`NeZhaModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import NeZhaForQuestionAnswering
+                from paddlenlp.transformers import NeZhaTokenizer
+
+                tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+                model = NeZhaForQuestionAnswering.from_pretrained('nezha-base-chinese')
+
+                inputs = tokenizer("欢迎使用百度飞浆!", return_tensors='pt')
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits  =outputs[1]
+        """
+        outputs = self.nezha(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if end_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        output = (start_logits, end_logits)
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NeZhaForSequenceClassification(NeZhaPretrainedModel):
+    """
+    NeZha Model with a linear layer on top of the output layer, designed for
+    sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`NeZhaConfig`):
+            An instance of NeZhaConfig used to construct NeZhaForSequenceClassification.
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaForSequenceClassification, self).__init__(config)
+        self.nezha = NeZhaModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The NeZhaForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`NeZhaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`NeZhaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`NeZhaModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`NeZhaModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import NeZhaForSequenceClassification
+                from paddlenlp.transformers import NeZhaTokenizer
+
+                tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+                model = NeZhaForSequenceClassification.from_pretrained('nezha-base-chinese')
+
+                inputs = tokenizer("欢迎使用百度飞浆!", return_tensors='pt')
+                output = model(**inputs)
+
+                logits = outputs[0]
+
+        """
+        outputs = self.nezha(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NeZhaForTokenClassification(NeZhaPretrainedModel):
+    """
+    NeZha Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`NeZhaConfig`):
+            An instance of NeZhaConfig used to construct NeZhaForSequenceClassification.
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaForTokenClassification, self).__init__(config)
+        self.nezha = NeZhaModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The NeZhaForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`NeZhaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`NeZhaModel`.
+            attention_mask (list, optional):
+                See :class:`NeZhaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`NeZhaModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import NeZhaForTokenClassification
+                from paddlenlp.transformers import NeZhaTokenizer
+
+                tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+                model = NeZhaForTokenClassification.from_pretrained('nezha-base-chinese')
+
+                inputs = tokenizer("欢迎使用百度飞浆!", return_tensors='pt')
+                output = model(**inputs)
+
+                logits = outputs[0]
+        """
+        outputs = self.nezha(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NeZhaForMultipleChoice(NeZhaPretrainedModel):
+    """
+    NeZha Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`BertConfig`):
+            An instance of BertConfig used to construct BertForMultipleChoice.
+    """
+
+    def __init__(self, config: NeZhaConfig):
+        super(NeZhaForMultipleChoice, self).__init__(config)
+        self.nezha = NeZhaModel(config)
+        self.num_choices = config.num_choices
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The NeZhaForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`NeZhaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`NeZhaModel`.
+            attention_mask (list, optional):
+                See :class:`NeZhaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`NeZhaModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the input multiple choice classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as `float32`.
+        """
+
+        # input_ids: [bs, num_choice, seq_l]
+        if input_ids is not None:
+            input_ids = input_ids.reshape((-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1]))
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1]))
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape(shape=(-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]))
+
+        outputs = self.nezha(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape((-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/nezha/tokenizer.py b/paddlenlp/transformers/nezha/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3de0b2feea25fb864b40338e018017e9b0c819f1
--- /dev/null
+++ b/paddlenlp/transformers/nezha/tokenizer.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    PretrainedTokenizer,
+    WordpieceTokenizer,
+)
+
+__all__ = ["NeZhaTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "nezha-base-chinese": 512,
+    "nezha-large-chinese": 512,
+    "nezha-base-wwm-chinese": 512,
+    "nezha-large-wwm-chinese": 512,
+}
+
+
+class NeZhaTokenizer(PretrainedTokenizer):
+    """
+    Constructs a NeZha tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import NeZhaTokenizer
+            tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+
+            inputs = tokenizer('欢迎使用百度飞桨！')
+            print(inputs)
+
+            '''
+            {'input_ids': [101, 3614, 6816, 886, 4500, 4636, 2428, 7607, 3444, 8013, 102],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "nezha-base-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-chinese-vocab.txt",
+            "nezha-base-wwm-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-chinese-vocab.txt",
+            "nezha-large-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-chinese-vocab.txt",
+            "nezha-large-wwm-chinese": "http://bj.bcebos.com/paddlenlp/models/transformers/nezha/nezha-chinese-vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "nezha-base-chinese": {"do_lower_case": False},
+        "nezha-base-wwm-chinese": {"do_lower_case": False},
+        "nezha-large-chinese": {"do_lower_case": False},
+        "nezha-large-wwm-chinese": {"do_lower_case": False},
+    }
+    padding_side = "right"
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = NeZhaTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for NeZha models.
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import NeZhaTokenizer
+
+                tokenizer = NeZhaTokenizer.from_pretrained('nezha-base-chinese')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨！')
+                '''
+                ['欢', '迎', '使', '用', '百', '度', '飞', '桨', '！']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                欢 迎 使 用 百 度 飞 桨 ！
+                '''
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A NeZha sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to `None`.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A NeZha offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to `None`.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A NeZha sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to `None`.
+            already_has_special_tokens (bool, optional):
+                Whether or not the token list is already formatted with special tokens for the model.
+                Defaults to `False`.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
diff --git a/paddlenlp/transformers/nystromformer/__init__.py b/paddlenlp/transformers/nystromformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac50b570447250f860d11c384a246223ac4d4def
--- /dev/null
+++ b/paddlenlp/transformers/nystromformer/__init__.py
@@ -0,0 +1,14 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/nystromformer/configuration.py b/paddlenlp/transformers/nystromformer/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfe10653d5a629073a03ae80a4a71bf37a18dfb4
--- /dev/null
+++ b/paddlenlp/transformers/nystromformer/configuration.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Nystromformer Model Configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "NYSTROMFORMER_PRETRAINED_INIT_CONFIGURATION",
+    "NYSTROMFORMER_PRETRAINED_RESOURCE_FILES_MAP",
+    "NystromformerConfig",
+]
+
+NYSTROMFORMER_PRETRAINED_INIT_CONFIGURATION = {
+    "nystromformer-base-zh": {
+        "model_type": "nystromformer",
+        "attention_probs_dropout_prob": 0.1,
+        "bos_token_id": 1,
+        "conv_kernel_size": 65,
+        "eos_token_id": 2,
+        "hidden_act": "gelu_new",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "inv_coeff_init_option": False,
+        "layer_norm_eps": 1e-05,
+        "max_position_embeddings": 4096,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "num_landmarks": 64,
+        "pad_token_id": 0,
+        "segment_means_seq_len": 64,
+        "type_vocab_size": 2,
+        "vocab_size": 40000,
+    },
+}
+
+NYSTROMFORMER_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "nystromformer-base-zh": "https://paddlenlp.bj.bcebos.com/models/transformers/nystromformer/nystromformer_base_zh/model_state.pdparams"
+    }
+}
+
+
+class NystromformerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NystromformerModel`]. It is used to instantiate
+    an Nystromformer model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Nystromformer
+    [uw-madison/nystromformer-512](https://huggingface.co/uw-madison/nystromformer-512) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 40000):
+            Vocabulary size of the Nystromformer model. Defines the number of different tokens that can be represented
+            by the `inputs_ids` passed when calling [`NystromformerModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimension of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 4096):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NystromformerModel`].
+        segment_means_seq_len (`int`, *optional*, defaults to 64):
+            Sequence length used in segment-means.
+        num_landmarks (`int`, *optional*, defaults to 64):
+            The number of landmark (or Nystrom) points to use in Nystrom approximation of the softmax self-attention
+            matrix.
+        conv_kernel_size (`int`, *optional*, defaults to 65):
+            The kernel size of depthwise convolution used in Nystrom approximation.
+        inv_coeff_init_option (`bool`, *optional*, defaults to `False`):
+            Whether or not to use exact coefficient computation for the initial values for the iterative method of
+            calculating the Moore-Penrose inverse of a matrix.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import NystromformerModel, NystromformerConfig
+    >>> # Initializing a Nystromformer uw-madison/nystromformer-512 style configuration
+    >>> configuration = NystromformerConfig()
+    >>> # Initializing a model from the uw-madison/nystromformer-512 style configuration
+    >>> model = NystromformerModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = NYSTROMFORMER_PRETRAINED_INIT_CONFIGURATION
+    model_type = "nystromformer"
+
+    def __init__(
+        self,
+        vocab_size=40000,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu_new",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=4096,
+        type_vocab_size=2,
+        segment_means_seq_len=64,
+        num_landmarks=64,
+        conv_kernel_size=65,
+        inv_coeff_init_option=False,
+        initializer_range=0.02,
+        layer_norm_eps=1e-5,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.type_vocab_size = type_vocab_size
+        self.segment_means_seq_len = segment_means_seq_len
+        self.num_landmarks = num_landmarks
+        self.conv_kernel_size = conv_kernel_size
+        self.inv_coeff_init_option = inv_coeff_init_option
+        self.layer_norm_eps = layer_norm_eps
diff --git a/paddlenlp/transformers/nystromformer/fast_tokenizer.py b/paddlenlp/transformers/nystromformer/fast_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..af0829fd4f76c86b1674c6a643dbd65bd784c572
--- /dev/null
+++ b/paddlenlp/transformers/nystromformer/fast_tokenizer.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from typing import Optional, Tuple
+
+from fast_tokenizer import normalizers
+
+from ..tokenizer_utils_fast import PretrainedFastTokenizer
+from .tokenizer import NystromformerTokenizer
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
+
+
+class NystromformerFastTokenizer(PretrainedFastTokenizer):
+    resource_files_names = VOCAB_FILES_NAMES  # for save_pretrained
+    slow_tokenizer_class = NystromformerTokenizer
+    pretrained_resource_files_map = slow_tokenizer_class.pretrained_resource_files_map
+    pretrained_init_configuration = slow_tokenizer_class.pretrained_init_configuration
+
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        normalizer_state = json.loads(self.backend_tokenizer.normalizer.__getstate__())
+        if (
+            normalizer_state.get("lowercase", do_lower_case) != do_lower_case
+            or normalizer_state.get("strip_accents", strip_accents) != strip_accents
+            or normalizer_state.get("handle_chinese_chars", tokenize_chinese_chars) != tokenize_chinese_chars
+        ):
+            normalizer_class = getattr(normalizers, normalizer_state.pop("type"))
+            normalizer_state["lowercase"] = do_lower_case
+            normalizer_state["strip_accents"] = strip_accents
+            normalizer_state["handle_chinese_chars"] = tokenize_chinese_chars
+            self.backend_tokenizer.normalizer = normalizer_class(**normalizer_state)
+
+        self.do_lower_case = do_lower_case
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        files = self._tokenizer.model.save(save_directory, prefix=filename_prefix)
+        return tuple(files)
diff --git a/paddlenlp/transformers/nystromformer/modeling.py b/paddlenlp/transformers/nystromformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b80bb03ca8c148dcdf3f3065de0ff540bf372eb
--- /dev/null
+++ b/paddlenlp/transformers/nystromformer/modeling.py
@@ -0,0 +1,1331 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import math
+from typing import Callable, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    NYSTROMFORMER_PRETRAINED_INIT_CONFIGURATION,
+    NYSTROMFORMER_PRETRAINED_RESOURCE_FILES_MAP,
+    NystromformerConfig,
+)
+
+__all__ = [
+    "NystromformerEmbeddings",
+    "NystromformerModel",
+    "NystromformerPretrainedModel",
+    "NystromformerForSequenceClassification",
+    "NystromformerForMaskedLM",
+    "NystromformerForTokenClassification",
+    "NystromformerForMultipleChoice",
+    "NystromformerForQuestionAnswering",
+]
+
+
+class NystromformerEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings + 2, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1)) + 2
+        )
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.register_buffer(
+            "token_type_ids",
+            paddle.zeros(self.position_ids.shape, dtype=paddle.int64),
+            persistable=False,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+    ):
+
+        if input_ids is not None:
+            input_shape = input_ids.shape
+        else:
+            input_shape = inputs_embeds.shape[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
+        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids,
+        # sloves the issue: https://github.com/huggingface/transformers/issues/5664
+        if token_type_ids is None:
+            if hasattr(self, "token_type_ids"):
+                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand((input_shape[0], seq_length))
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class NystromformerSelfAttention(nn.Layer):
+    def __init__(self, config: NystromformerConfig, position_embedding_type: Optional[str] = None):
+        super(NystromformerSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.num_landmarks = config.num_landmarks
+        self.seq_len = config.segment_means_seq_len
+        self.conv_kernel_size = config.conv_kernel_size
+
+        if config.inv_coeff_init_option:
+            self.init_option = config["inv_init_coeff_option"]
+        else:
+            self.init_option = "original"
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = position_embedding_type or getattr(
+            config, "position_embedding_type", "absolute"
+        )
+
+        if self.conv_kernel_size is not None:
+            self.conv = nn.Conv2D(
+                in_channels=self.num_attention_heads,
+                out_channels=self.num_attention_heads,
+                kernel_size=(self.conv_kernel_size, 1),
+                padding=(self.conv_kernel_size // 2, 0),
+                bias_attr=False,
+                groups=self.num_attention_heads,
+            )
+
+    # Function to approximate Moore-Penrose inverse via the iterative method
+    def iterative_inv(self, mat, n_iter=6):
+        identity = paddle.eye(mat.shape[-1])
+        key = mat
+
+        # The entries of key are positive and ||key||_{\infty} = 1 due to softmax
+        if self.init_option == "original":
+            # This original implementation is more conservative to compute coefficient of Z_0.
+            value = 1 / paddle.max(paddle.sum(key, axis=-2)) * key.transpose([0, 1, 3, 2])
+        else:
+            # TODO make sure this way is OK
+            # This is the exact coefficient computation, 1 / ||key||_1, of initialization of Z_0, leading to faster convergence.
+            value = (
+                1
+                / paddle.max(paddle.sum(key, axis=-2), axis=-1).values[:, :, None, None]
+                * key.transpose([0, 1, 3, 2])
+            )
+
+        for _ in range(n_iter):
+            key_value = paddle.matmul(key, value)
+            value = paddle.matmul(
+                0.25 * value,
+                13 * identity
+                - paddle.matmul(key_value, 15 * identity - paddle.matmul(key_value, 7 * identity - key_value)),
+            )
+        return value
+
+    def transpose_for_scores(self, layer):
+        new_layer_shape = layer.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        layer = layer.reshape(new_layer_shape)
+        return layer.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Optional[Tensor] = None,
+        output_attentions: Optional[Tensor] = False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        query_layer = query_layer / math.sqrt(math.sqrt(self.attention_head_size))
+        key_layer = key_layer / math.sqrt(math.sqrt(self.attention_head_size))
+
+        if self.num_landmarks == self.seq_len:
+            attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+            if attention_mask is not None:
+                # Apply the attention mask is (precomputed for all layers in NystromformerModel forward() function)
+                attention_scores = attention_scores + attention_mask
+
+            attention_probs = nn.functional.softmax(attention_scores, axis=-1)
+            context_layer = paddle.matmul(attention_probs, value_layer)
+        else:
+            q_landmarks = query_layer.reshape(
+                [
+                    -1,
+                    self.num_attention_heads,
+                    self.num_landmarks,
+                    self.seq_len // self.num_landmarks,
+                    self.attention_head_size,
+                ]
+            ).mean(axis=-2)
+            k_landmarks = key_layer.reshape(
+                [
+                    -1,
+                    self.num_attention_heads,
+                    self.num_landmarks,
+                    self.seq_len // self.num_landmarks,
+                    self.attention_head_size,
+                ]
+            ).mean(axis=-2)
+
+            kernel_1 = nn.functional.softmax(paddle.matmul(query_layer, k_landmarks, transpose_y=True), axis=-1)
+            kernel_2 = nn.functional.softmax(paddle.matmul(q_landmarks, k_landmarks, transpose_y=True), axis=-1)
+
+            attention_scores = paddle.matmul(q_landmarks, key_layer, transpose_y=True)
+
+            if attention_mask is not None:
+                # Apply the attention mask is (precomputed for all layers in NystromformerModel forward() function)
+                attention_scores = attention_scores + attention_mask
+
+            kernel_3 = nn.functional.softmax(attention_scores, axis=-1)
+            attention_probs = paddle.matmul(kernel_1, self.iterative_inv(kernel_2))
+            new_value_layer = paddle.matmul(kernel_3, value_layer)
+            context_layer = paddle.matmul(attention_probs, new_value_layer)
+
+        if self.conv_kernel_size is not None:
+            context_layer += self.conv(value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+        return outputs
+
+
+class NystromformerSelfOutput(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: Tensor, input_tensor: Tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class NystromformerAttention(nn.Layer):
+    def __init__(self, config: NystromformerConfig, position_embedding_type: Optional[str] = None):
+        super(NystromformerAttention, self).__init__()
+        self.self = NystromformerSelfAttention(config, position_embedding_type=position_embedding_type)
+        self.output = NystromformerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def forward(
+        self, hidden_states: Tensor, attention_mask: Optional[Tensor] = None, output_attentions: Optional[bool] = False
+    ):
+        self_outputs = self.self(hidden_states, attention_mask, output_attentions)
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class NystromformerIntermediate(nn.Layer):
+    def __init__(self, config):
+        super(NystromformerIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: Tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class NystromformerOutput(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: Tensor, input_tensor: Tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class NystromformerLayer(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerLayer, self).__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = NystromformerAttention(config)
+        self.add_cross_attention = config.add_cross_attention
+        self.intermediate = NystromformerIntermediate(config)
+        self.output = NystromformerOutput(config)
+
+    def apply_chunking_to_forward(
+        self, forward_fn: Callable[..., Tensor], chunk_size: int, chunk_dim: int, *input_tensors
+    ):
+        assert len(input_tensors) > 0, f"{input_tensors} has to be a tuple/list of tensors"
+
+        # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compatibility
+        num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)
+        if num_args_in_forward_chunk_fn != len(input_tensors):
+            raise ValueError(
+                f"forward_chunk_fn expects {num_args_in_forward_chunk_fn} arguments, but only {len(input_tensors)} input "
+                "tensors are given"
+            )
+        if chunk_size > 0:
+            tensor_shape = input_tensors[0].shape[chunk_dim]
+            for input_tensor in input_tensors:
+                if input_tensor.shape[chunk_dim] != tensor_shape:
+                    raise ValueError(
+                        f"All input tenors have to be of the same shape: {tensor_shape}, "
+                        f"found shape {input_tensor.shape[chunk_dim]}"
+                    )
+            if input_tensors[0].shape[chunk_dim] % chunk_size != 0:
+                raise ValueError(
+                    f"The dimension to be chunked {input_tensors[0].shape[chunk_dim]} has to be a multiple of the chunk "
+                    f"size {chunk_size}"
+                )
+            num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size
+            input_tensors_chunks = tuple(
+                input_tensor.chunk(num_chunks, axis=chunk_dim) for input_tensor in input_tensors
+            )
+            output_chunks = tuple(
+                forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks)
+            )
+            return paddle.concat(output_chunks, axis=chunk_dim)
+        return forward_fn(*input_tensors)
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Optional[Tensor] = None,
+        output_attentions: Optional[Tensor] = False,
+    ):
+        self_attention_outputs = self.attention(hidden_states, attention_mask, output_attentions=output_attentions)
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        layer_output = self.apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+class NystromformerEncoder(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerEncoder, self).__init__()
+        self.config = config
+        self.layer = nn.LayerList([NystromformerLayer(config) for _ in range(config.num_hidden_layers)])
+        # The parameter output_attentions in forward shoule set to be False when self.use_recompute = True.
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        attention_mask: Optional[Tensor] = None,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if self.enable_recompute and self.training:
+
+                def create_cumtom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)[0]
+
+                    return custom_forward
+
+                layer_outputs = (recompute(create_cumtom_forward(layer_module), hidden_states, attention_mask),)
+            else:
+                layer_outputs = layer_module(hidden_states, attention_mask, output_attentions)
+
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class NystromformerPredictionHeadTransform(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerPredictionHeadTransform, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states: Tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class NystromformerLMPredictionHead(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerLMPredictionHead, self).__init__()
+        self.transform = NystromformerPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias_attr=False)
+        self.bias = paddle.create_parameter(shape=(config.vocab_size,), dtype=self.decoder.weight.dtype)
+
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states: Tensor):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class NystromformerOnlyMLMHead(nn.Layer):
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerOnlyMLMHead, self).__init__()
+        self.predictions = NystromformerLMPredictionHead(config)
+
+    def forward(self, sequence_output: Tensor):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class NystromformerPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Nystromformer models. It provides Nystromformer related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = NystromformerConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "nystromformer"
+    support_recompute = True
+
+    # model init configuration
+    pretrained_init_configuration = NYSTROMFORMER_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = NYSTROMFORMER_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding, nn.Conv2D)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+    def _set_recompute(self, module, value=False):
+        if isinstance(module, NystromformerEncoder):
+            module.enable_recompute = value
+
+
+@register_base_model
+class NystromformerModel(NystromformerPretrainedModel):
+    """
+    The bare Nystromformer Model outputting raw hidden-states.
+    Nystromformer is a nystrom-based approximation of transformer which reduce the time complexity to O(n).
+    See the Nystromformer paper at: https://arxiv.org/pdf/2102.03902v3.pdf
+
+    Ref:
+        Xiong, Yunyang, et al. "Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention." AAAI, 2021.
+
+    Args:
+        config(NystromformerConfig):
+            An instance of ErnieConfig used to construct NystromformerModel.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerModel, self).__init__(config)
+        self.embeddings = NystromformerEmbeddings(config)
+        self.encoder = NystromformerEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def get_extended_attention_mask(self, attention_mask, input_shape):
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], BaseModelOutputWithPastAndCrossAttentions]:
+        r"""
+        The NystromformerModel forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import NystromformerModel, AutoTokenizer
+                tokenizer = AutoTokenizer.from_pretrained("model_name")
+                model = NystromformerModel.from_pretrained("model_name")
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        batch_size, seq_length = input_shape
+
+        if attention_mask is None:
+            attention_mask = paddle.ones((batch_size, seq_length))
+
+        if token_type_ids is None:
+            if hasattr(self.embeddings, "token_type_ids"):
+                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
+                buffered_token_type_ids_expanded = buffered_token_type_ids.expand((batch_size, seq_length))
+                token_type_ids = buffered_token_type_ids_expanded
+            else:
+                token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+        )
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = encoder_outputs[0]
+
+        if not return_dict:
+            return (sequence_output,) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class NystromformerForMaskedLM(NystromformerPretrainedModel):
+    """
+    Nystromformer Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`NystromformerConfig`):
+            An instance of NystromformerConfig used to construct NystromformerForMaskedLM.
+    """
+
+    _keys_to_ignore_on_load_missing = ["cls.predictions.decoder"]
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerForMaskedLM, self).__init__(config)
+
+        self.nystromformer = NystromformerModel(config)
+        self.cls = NystromformerOnlyMLMHead(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], MaskedLMOutput]:
+        r"""
+        The NystromformerForMaskedLM forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+                loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput`.
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import NystromformerForMaskedLM, AutoTokenizer
+                tokenizer = AutoTokenizer.from_pretrained("model_name")
+                model = NystromformerForMaskedLM.from_pretrained("model_name")
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+                print(logits.shape) # [batch_size, seq_len, hidden_size]
+        """
+
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.nystromformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(prediction_scores.reshape([-1, self.config.vocab_size]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[1:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NystromformerClassificationHead(nn.Layer):
+    """
+    Classification head of nystromformer used in sequence classification
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerClassificationHead, self).__init__()
+        self.config = config
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = ACT2FN[self.config.hidden_act](x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class NystromformerForSequenceClassification(NystromformerPretrainedModel):
+    """
+    Nystromformer Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config(NystromformerConfig, optional):
+            An instance of ErnieConfig used to construct NystromformerForSequenceClassification.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.nystromformer = NystromformerModel(config)
+        self.classifier = NystromformerClassificationHead(config)
+        self.config = config if config is not None else NystromformerConfig()
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], SequenceClassifierOutput]:
+        r"""
+        The NystromformerForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+                import paddle
+                from paddlenlp.transformers import AutoTokenizer, NystromformerForSequenceClassification
+
+                tokenizer = AutoTokenizer.from_pretrained("model_name")
+                model = NystromformerForSequenceClassification.from_pretrained("model_name")
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.nystromformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.flatten())
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NystromformerForMultipleChoice(NystromformerPretrainedModel):
+    """
+    Nystromformer Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`NystromformerConfig`):
+            An instance of NystromformerConfig used to construct NystromformerForMultipleChoice.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerForMultipleChoice, self).__init__(config)
+
+        self.nystromformer = NystromformerModel(config)
+        self.pre_classifier = nn.Linear(config.hidden_size, config.hidden_size)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], MultipleChoiceModelOutput]:
+        r"""
+        The NystromformerForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, num_choice, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, num_choice, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, num_choice, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_choice, sequence_length)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, num_choice, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+        """
+
+        return_dict = return_dict if return_dict is not None else False
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+        input_ids = input_ids.reshape([-1, input_ids.shape[-1]]) if input_ids is not None else None
+        attention_mask = attention_mask.reshape([-1, attention_mask.shape[-1]]) if attention_mask is not None else None
+        token_type_ids = token_type_ids.reshape([-1, token_type_ids.shape[-1]]) if token_type_ids is not None else None
+        position_ids = position_ids.reshape([-1, position_ids.shape[-1]]) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.reshape([-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]])
+            if inputs_embeds is not None
+            else None
+        )
+
+        outputs = self.nystromformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_state = outputs[0]  # (bs * num_choices, seq_len, hidden_size)
+        pooled_output = hidden_state[:, 0]  # (bs * num_choices, hidden_size)
+        pooled_output = self.pre_classifier(pooled_output)  # (bs * num_choices, hidden_size)
+        pooled_output = nn.ReLU()(pooled_output)  # (bs * num_choices, hidden_size)
+        logits = self.classifier(pooled_output)
+
+        reshaped_logits = logits.reshape([-1, num_choices])
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NystromformerForTokenClassification(NystromformerPretrainedModel):
+    r"""
+    Nystromformer Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+    Args:
+        config (:class:`NystromformerConfig`):
+            An instance of NystromformerConfig used to construct NystromformerForTokenClassification.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+
+        self.nystromformer = NystromformerModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], TokenClassifierOutput]:
+        r"""
+        The NystromformerForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+        """
+
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.nystromformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape([-1, self.num_labels]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class NystromformerForQuestionAnswering(NystromformerPretrainedModel):
+    """
+    Nystromformer Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+    Args:
+        config (:class:`NystromformerConfig`):
+            An instance of NystromformerConfig used to construct NystromformerForQuestionAnswering.
+    """
+
+    def __init__(self, config: NystromformerConfig):
+        super(NystromformerForQuestionAnswering, self).__init__(config)
+
+        config.num_labels = 2
+        self.nystromformer = NystromformerModel(config)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        The NystromformerForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type should be int. The `masked` tokens have `0` values and the others have `1` values.
+                It is a tensor with shape `[batch_size, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
+                ``[0, max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Indices of embedded input sequence. They are representations of tokens that build the input sequence.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to 'None', which means the input_ids represents the sequence.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_attentions (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `None`, which means get the option from config.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+        """
+
+        return_dict = return_dict if return_dict is not None else False
+
+        outputs = self.nystromformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if end_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[1:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/nystromformer/tokenizer.py b/paddlenlp/transformers/nystromformer/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..580e04b85f13e1e1225370d21c235788365a2db9
--- /dev/null
+++ b/paddlenlp/transformers/nystromformer/tokenizer.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+
+__all__ = [
+    "NystromformerTokenizer",
+]
+
+
+class NystromformerTokenizer(PretrainedTokenizer):
+    """
+    Constructs a Nystromformer tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool, optional):
+            Whether to lowercase the input when tokenizing.
+            Defaults to `True`.
+        do_basic_tokenize (bool, optional):
+            Whether to use a basic tokenizer before a WordPiece tokenizer.
+            Defaults to `True`.
+        never_split (Iterable, optional):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`. Defaults to `None`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+        tokenize_chinese_chars (bool, optional):
+            Whether to tokenize Chinese characters.
+            Defaults to `True`.
+        strip_accents: (bool, optional):
+            Whether to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase`.
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import NystromformerTokenizer
+            tokenizer = NystromformerTokenizer.from_pretrained("model_name")
+
+            inputs = tokenizer("He was a puppeteer")
+            print(inputs)
+
+            '''
+            {"input_ids": [101, 2002, 2001, 1037, 13997, 11510, 102], "token_type_ids": [0, 0, 0, 0, 0, 0, 0]}
+            '''
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "nystromformer-base-zh": "https://paddlenlp.bj.bcebos.com/models/transformers/nystromformer/nystromformer_base_zh/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "nystromformer-base-zh": {"do_lower_case": True},
+    }
+    max_model_input_sizes = {
+        "nystromformer-base-zh": 4096,
+    }
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = NystromformerTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for Nystromformer models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import NystromformerTokenizer
+
+                tokenizer = NystromformerTokenizer.from_pretrained("model_name")
+                tokens = tokenizer.tokenize("He was a puppeteer")
+                '''
+                ["he", "was", "a", "puppet", "##eer"]
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                he was a puppeteer
+                '''
+        """
+
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A Nystromformer sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A Nystromformer offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A Nystromformer sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optinal):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
diff --git a/paddlenlp/transformers/ofa_utils.py b/paddlenlp/transformers/ofa_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0f7bb24025842d20483d9ded28dbbfa37c086e0
--- /dev/null
+++ b/paddlenlp/transformers/ofa_utils.py
@@ -0,0 +1,326 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+__all__ = [
+    "prepare_qkv_ofa",
+    "mha_ofa_forward",
+    "encoder_ofa_forward",
+    "encoder_layer_ofa_forward",
+    "compute_neuron_head_importance",
+    "reorder_neuron_head",
+]
+
+
+def prepare_qkv_ofa(self, query, key, value, cache=None):
+    q = self.q_proj(query)
+    if hasattr(self.q_proj, "fn") and self.q_proj.fn.cur_config["expand_ratio"] is not None:
+        self.num_heads = int(self.num_heads * self.q_proj.fn.cur_config["expand_ratio"])
+    q = paddle.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+    q = paddle.transpose(x=q, perm=[0, 2, 1, 3])
+
+    if isinstance(cache, self.StaticCache):
+        # for encoder-decoder attention in inference and has cached
+        k, v = cache.k, cache.v
+    else:
+        k, v = self.compute_kv(key, value)
+
+    if isinstance(cache, self.Cache):
+        # for decoder self-attention in inference
+        k = paddle.concat([cache.k, k], axis=2)
+        v = paddle.concat([cache.v, v], axis=2)
+        cache = self.Cache(k, v)
+
+    return (q, k, v) if cache is None else (q, k, v, cache)
+
+
+def mha_ofa_forward(self, query, key, value, attn_mask=None, cache=None):
+    """
+    monkey patch for MultiHeadAttention forward to accept head_mask
+    attn_mask[0] = attn_mask, attn_mask[1] = head_mask
+    """
+    key = query if key is None else key
+    value = query if value is None else value
+    # compute q ,k ,v
+    if cache is None:
+        q, k, v = self._prepare_qkv(query, key, value, cache)
+    else:
+        q, k, v, cache = self._prepare_qkv(query, key, value, cache)
+
+    # scale dot product attention
+    product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True)
+    if attn_mask[0] is not None:
+        # TODO(guosheng): support bool mask
+        product = product + attn_mask[0]
+    weights = F.softmax(product)
+    if self.dropout:
+        weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+    if attn_mask[1] is not None:
+        weights = weights * attn_mask[1]
+
+    out = paddle.matmul(weights, v)
+
+    # combine heads
+    out = paddle.transpose(out, perm=[0, 2, 1, 3])
+    out = paddle.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+    # project to output
+    out = self.out_proj(out)
+
+    outs = [out]
+    if self.need_weights:
+        outs.append(weights)
+    if cache is not None:
+        outs.append(cache)
+    if hasattr(self.q_proj, "fn") and self.q_proj.fn.cur_config["expand_ratio"] is not None:
+        self.num_heads = int(float(self.num_heads) / self.q_proj.fn.cur_config["expand_ratio"])
+    return out if len(outs) == 1 else tuple(outs)
+
+
+def encoder_ofa_forward(
+    self,
+    src,
+    src_mask=[None, None],
+    cache=None,
+    output_attentions=False,
+    output_hidden_states=False,
+    return_dict=False,
+):
+    """
+    monkey patch for TransformerEncoder forward to accept head_mask
+    attn_mask[0] = attn_mask, attn_mask[1] = head_mask
+    """
+    output = src
+    if src_mask[1] is not None:
+        head_mask = src_mask[1]
+        if len(head_mask.shape) == 1:
+            head_mask = paddle.unsqueeze(paddle.unsqueeze(paddle.unsqueeze(paddle.unsqueeze(head_mask, 0), 0), -1), -1)
+            head_mask = paddle.expand(head_mask, shape=[self.num_layers] + head_mask.shape[1:])
+        elif len(head_mask.shape) == 2:
+            head_mask = paddle.unsqueeze(paddle.unsqueeze(paddle.unsqueeze(head_mask, 1), -1), -1)
+    else:
+        head_mask = [None] * self.num_layers
+    for i, mod in enumerate(self.layers):
+        output = mod(output, src_mask=[src_mask[0], head_mask[i]])
+    if self.norm is not None:
+        output = self.norm(output)
+
+    return output
+
+
+def encoder_layer_ofa_forward(self, src, src_mask=None, cache=None, output_attentions=False):
+    residual = src
+    if self.normalize_before:
+        src = self.norm1(src)
+    # Add cache for encoder for the usage like UniLM
+    if cache is None:
+        src = self.self_attn(src, src, src, src_mask)
+    else:
+        src, incremental_cache = self.self_attn(src, src, src, src_mask, cache)
+
+    src = residual + self.dropout1(src)
+    if not self.normalize_before:
+        src = self.norm1(src)
+
+    residual = src
+    if self.normalize_before:
+        src = self.norm2(src)
+    src = self.linear2(self.dropout(self.activation(self.linear1(src))))
+    src = residual + self.dropout2(src)
+    if not self.normalize_before:
+        src = self.norm2(src)
+    return src if cache is None else (src, incremental_cache)
+
+
+def reorder_head(layer, index):
+    """
+    Reorder head weights according index.
+    Args:
+         layer(paddle.nn.Layer): the instance of `paddle.nn.MultiHeadAttention` layer.
+         index(list): the sort indices of multi-head.
+    """
+    assert isinstance(
+        layer, nn.MultiHeadAttention
+    ), "layer in reorder_head must be the instance of `paddle.nn.MultiHeadAttention`."
+    n, a = layer.num_heads, layer.head_dim
+    idx = paddle.reshape(
+        paddle.index_select(paddle.reshape(paddle.arange(0, n * a, dtype="int64"), shape=[n, a]), index=index, axis=0),
+        shape=[-1],
+    )
+
+    def reorder_head_matrix(linearLayer, index, dim=1):
+        W = paddle.index_select(linearLayer.weight, index, axis=dim).detach()
+        if linearLayer.bias is not None:
+            if dim == 0:
+                b = paddle.assign(linearLayer.bias).detach()
+            else:
+                b = paddle.assign(paddle.index_select(linearLayer.bias, index, axis=0)).detach()
+
+        linearLayer.weight.stop_gradient = True
+        linearLayer.weight.set_value(W)
+        linearLayer.weight.stop_gradient = False
+        if linearLayer.bias is not None:
+            linearLayer.bias.stop_gradient = True
+            linearLayer.bias.set_value(b)
+            linearLayer.bias.stop_gradient = False
+
+    reorder_head_matrix(layer.q_proj.fn if hasattr(layer.q_proj, "fn") else layer.q_proj, idx)
+    reorder_head_matrix(layer.k_proj.fn if hasattr(layer.k_proj, "fn") else layer.k_proj, idx)
+    reorder_head_matrix(layer.v_proj.fn if hasattr(layer.v_proj, "fn") else layer.v_proj, idx)
+    reorder_head_matrix(layer.out_proj.fn if hasattr(layer.out_proj, "fn") else layer.out_proj, idx, dim=0)
+
+
+def reorder_neuron(layer, index, dim=0):
+    """
+    Reorder feed-forward weights according index.
+    Args:
+         layer(paddle.nn.Layer): the instance of `paddle.nn.Linear` layer.
+         index(list): the sort indices of feed-forward.
+         dim(int): select weights according to the dim.
+    """
+    linearLayer = layer.fn if hasattr(layer, "fn") else layer
+    W = paddle.index_select(linearLayer.weight, index, axis=dim).detach()
+    if linearLayer.bias is not None:
+        if dim == 0:
+            b = paddle.assign(linearLayer.bias).detach()
+        else:
+            b = paddle.assign(paddle.index_select(linearLayer.bias, index, axis=0)).detach()
+    linearLayer.weight.stop_gradient = True
+    linearLayer.weight.set_value(W)
+    linearLayer.weight.stop_gradient = False
+
+    if linearLayer.bias is not None:
+        linearLayer.bias.stop_gradient = True
+        linearLayer.bias.set_value(b)
+        linearLayer.bias.stop_gradient = False
+
+
+def reorder_neuron_head(model, head_importance, neuron_importance):
+    """
+    Reorders weights according head importance and neuron importance
+    """
+    # Reorders heads and ffn neurons
+    for layer, current_importance in enumerate(neuron_importance):
+        # Reorders heads
+        idx = paddle.argsort(head_importance[layer], descending=True)
+        reorder_head(model.base_model.encoder.layers[layer].self_attn, idx)
+        # Reorders neurons
+        idx = paddle.argsort(paddle.to_tensor(current_importance), descending=True)
+        reorder_neuron(model.base_model.encoder.layers[layer].linear1.fn, idx, dim=1)
+
+        reorder_neuron(model.base_model.encoder.layers[layer].linear2.fn, idx, dim=0)
+
+
+def compute_neuron_head_importance(
+    model,
+    data_loader,
+    num_layers,
+    num_heads,
+    loss_fct=nn.loss.CrossEntropyLoss(),
+    intermediate_name="linear1",
+    output_name="linear2",
+    label_names=None,
+):
+    """
+    Computes the importance of multi-head attention and feed-forward  neuron in
+    each transformer layer.
+
+    Args:
+        model(paddle.nn.Layer):
+            The instance of transformer model.
+        data_loader (DataLoader):
+            An iterable data loader is used for evaluate. An instance of
+            `paddle.io.Dataloader`.
+        num_layers (int):
+            Number of transformer layers.
+        num_heads (int):
+            Number of heads in each multi-head attention.
+        loss_fct (Loss|optional):
+            Loss function can be a `paddle.nn.Layer` instance. Default: `nn.loss.CrossEntropyLoss()`.
+        intermediate_name (str|optional):
+            The name of intermediate `Linear` layer in feed-forward.
+            Defaults to `linear1`.
+        output_name (str|optional):
+            The name of output `Linear` layer in feed-forward.
+            Defaults to `linear2`.
+    """
+    head_importance = paddle.zeros(shape=[num_layers, num_heads], dtype="float32")
+    head_mask = paddle.ones(shape=[num_layers, num_heads], dtype="float32")
+    head_mask.stop_gradient = False
+
+    intermediate_weight = []
+    intermediate_bias = []
+    output_weight = []
+
+    for name, w in model.named_parameters():
+        if intermediate_name in name:
+            if len(w.shape) > 1:
+                intermediate_weight.append(w)
+            else:
+                intermediate_bias.append(w)
+
+        if output_name in name:
+            if len(w.shape) > 1:
+                output_weight.append(w)
+
+    neuron_importance = []
+    for w in intermediate_weight:
+        neuron_importance.append(np.zeros(shape=[w.shape[1]], dtype="float32"))
+
+    for i, batch in enumerate(data_loader):
+        labels = None
+        if isinstance(batch, list):
+            input_ids, segment_ids, labels = batch
+            logits = model(input_ids, segment_ids, attention_mask=[None, head_mask])
+        else:
+            if label_names is not None:
+                labels = []
+                for label in label_names:
+                    labels.append(batch.pop(label))
+                labels = tuple(labels)
+            elif "labels" in batch:
+                labels = batch.pop("labels")
+                # For token cls tasks
+                for key in ("length", "seq_len"):
+                    if key in batch:
+                        batch.pop(key)
+            elif "start_positions" in batch and "end_positions" in batch:
+                labels = (batch.pop("start_positions"), batch.pop("end_positions"))
+
+            batch["attention_mask"] = [None, head_mask]
+            logits = model(**batch)
+
+        if loss_fct is not None:
+            loss = loss_fct(logits, labels)
+        else:
+            raise NotImplementedError(
+                "Model to be compressed is an instance of a custom class, "
+                "so function `loss_fct(logits, labels)` should "
+                "be implemented, and it should return a single float for precision "
+                "value, such as acc."
+            )
+
+        loss.backward()
+        head_importance += paddle.abs(paddle.to_tensor(head_mask.gradient()))
+        for w1, b1, w2, current_importance in zip(
+            intermediate_weight, intermediate_bias, output_weight, neuron_importance
+        ):
+            current_importance += np.abs((np.sum(w1.numpy() * w1.gradient(), axis=0) + b1.numpy() * b1.gradient()))
+            current_importance += np.abs(np.sum(w2.numpy() * w2.gradient(), axis=1))
+    return head_importance, neuron_importance
diff --git a/paddlenlp/transformers/opt/__init__.py b/paddlenlp/transformers/opt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c9c883a4297f13c3c3568c1bd12ed0aa959c035
--- /dev/null
+++ b/paddlenlp/transformers/opt/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
diff --git a/paddlenlp/transformers/opt/configuration.py b/paddlenlp/transformers/opt/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..866da043198e28ad6ec8c5502411e07732467e40
--- /dev/null
+++ b/paddlenlp/transformers/opt/configuration.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" OPT Model Configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = [
+    "OPT_PRETRAINED_INIT_CONFIGURATION",
+    "OPT_PRETRAINED_RESOURCE_FILES_MAP",
+    "OPTConfig",
+]
+
+OPT_PRETRAINED_INIT_CONFIGURATION = {
+    "facebook/opt-1.3b": {
+        "init_args": [
+            {
+                "intermediate_size": 8192,
+                "attention_probs_dropout_prob": 0.0,
+                "hidden_dropout_prob": 0.1,
+                "normalize_before": True,
+                "word_embed_proj_dim": 2048,
+                "num_attention_heads": 32,
+                "bos_token_id": 2,
+                "hidden_size": 2048,
+                "eos_token_id": 2,
+                "hidden_act": "relu",
+                "initializer_range": 0.02,
+                "max_position_embeddings": 2048,
+                "num_hidden_layers": 24,
+                "pad_token_id": 1,
+                "vocab_size": 50272,
+                "type_vocab_size": 16,
+                "init_class": "OPTModel",
+            }
+        ],
+        "init_class": "OPTForCausalLM",
+    },
+}
+
+OPT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "facebook/opt-1.3b": "https://bj.bcebos.com/paddlenlp/models/community/facebook/opt-1.3b/model_state.pdparams"
+    }
+}
+
+
+class OPTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`OPTModel`]. It is used to instantiate
+    an OPT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the OPT
+    [facebook/opt-1.3b](https://huggingface.co/facebook/opt-1.3b) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+     Args:
+        vocab_size (`int`, *optional*, defaults to 50272):
+            Vocabulary size of the OPT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`OPTModel`]
+        hidden_size (`int`, *optional*, defaults to 2048):
+            Dimensionality of the layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 24):
+            Number of decoder layers.
+        intermediate_size (`int`, *optional*, defaults to 8192):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"relu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        normalize_before (`bool`, *optional*, defaults to `True`):
+            Whether to perform layer normalization before the attention block.
+        word_embed_proj_dim (`int`, *optional*):
+            `word_embed_proj_dim` can be set to down-project word embeddings, *e.g.* `opt-1.3b`. Defaults to
+            `hidden_size`.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids`. Defaults to `16`.
+            .. note::
+                Please NOT using `type_vocab_size`, for it will be obsolete in the future..
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Default to `0.02`.
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`OPTPretrainedModel._init_weights()` for how weights are initialized in `OPTModel`.
+
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import OPTModel, OPTConfig
+    >>> # Initializing a OPT facebook/opt-1.3b style configuration
+    >>> config = OPTConfig()
+    >>> # Initializing a model from the facebook/opt-1.3b style configuration
+    >>> model = OPTModel(config)
+    >>> # Accessing the model config
+    >>> config = model.config
+    ```"""
+
+    attribute_map: Dict[str, str] = {
+        "dropout": "classifier_dropout",
+        "num_classes": "num_labels",
+        "ffn_dim": "intermediate_size",
+        "activation_function": "hidden_act",
+    }
+    pretrained_init_configuration = OPT_PRETRAINED_INIT_CONFIGURATION
+    model_type = "opt"
+
+    def __init__(
+        self,
+        vocab_size=50272,
+        hidden_size=2048,
+        num_hidden_layers=24,
+        intermediate_size=8192,
+        num_attention_heads=32,
+        hidden_act="relu",
+        max_position_embeddings=2048,
+        normalize_before=True,
+        word_embed_proj_dim=2048,
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        type_vocab_size=16,
+        pad_token_id=1,
+        bos_token_id=2,
+        eos_token_id=2,
+        enable_bias: bool = True,
+        mp_degree: int = 1,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.normalize_before = normalize_before
+        self.word_embed_proj_dim = word_embed_proj_dim if word_embed_proj_dim is not None else hidden_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.type_vocab_size = type_vocab_size
+
+        self.enable_bias = enable_bias
+        self.mp_degree = mp_degree
diff --git a/paddlenlp/transformers/opt/convert_torch_to_paddle.py b/paddlenlp/transformers/opt/convert_torch_to_paddle.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cd36b293335638230c2d98d1d79b8536c75bcde
--- /dev/null
+++ b/paddlenlp/transformers/opt/convert_torch_to_paddle.py
@@ -0,0 +1,180 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import json
+import os
+
+
+def convert_configs(model_dir: str, output_dir: str | None = None):
+    """convert pytorch config.json to model_config.json
+
+    Args:
+        model_dir (str): the directory of model-realted files
+    """
+
+    # 1. load the config file
+    output_dir = output_dir or model_dir
+    target_config_file = os.path.join(output_dir, "model_config.json")
+
+    if os.path.exists(target_config_file):
+        return
+
+    config_file = os.path.join(model_dir, "config.json")
+    assert os.path.exists(config_file), f"<config.json> not found in <{model_dir}> dir"
+
+    with open(config_file, "r", encoding="utf-8") as f:
+        config = json.load(f)
+
+    # 2. transform the config to opt model file
+    target_config = {
+        "init_args": [
+            {
+                "intermediate_size": config["ffn_dim"],
+                "attention_probs_dropout_prob": config["attention_dropout"],
+                "hidden_dropout_prob": config["dropout"],
+                "normalize_before": config["do_layer_norm_before"],
+                "word_embed_proj_dim": config["word_embed_proj_dim"],
+                "num_attention_heads": config["num_attention_heads"],
+                "bos_token_id": config["bos_token_id"],
+                "hidden_size": config["hidden_size"],
+                "eos_token_id": config["eos_token_id"],
+                "hidden_act": config["activation_function"],
+                "initializer_range": config["init_std"],
+                "max_position_embeddings": config["max_position_embeddings"],
+                "num_hidden_layers": config["num_hidden_layers"],
+                "pad_token_id": config["pad_token_id"],
+                "vocab_size": config["vocab_size"],
+                "init_class": "OPTModel",
+            }
+        ],
+        "init_class": "OPTForCausalLM",
+    }
+
+    with open(target_config_file, "w", encoding="utf-8") as f:
+        json.dump(target_config, f)
+
+    print("convert config successfully ...")
+
+
+def convert_weights(model_dir: str, output_dir: str | None = None):
+    # 1. serach the pytorch_model weight files
+    files = [
+        file_name
+        for file_name in os.listdir(model_dir)
+        if file_name.startswith("pytorch_model") and file_name.endswith(".bin")
+    ]
+
+    # 2. construct name-mapping
+    mappings = [
+        ["decoder.embed_tokens.weight", "embeddings.word_embeddings.weight"],
+        ["decoder.embed_positions.weight", "embeddings.position_embeddings.weight"],
+        ["decoder.final_layer_norm.weight", "decoder.final_layer_norm.weight"],
+        ["decoder.final_layer_norm.bias", "decoder.final_layer_norm.bias"],
+    ]
+
+    with open(os.path.join(model_dir, "config.json"), "r", encoding="utf-8") as f:
+        config = json.load(f)
+    for layer_index in range(config["num_hidden_layers"]):
+        layer_mappings = [
+            [
+                f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                "transpose",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                "transpose",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                "transpose",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                f"decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                "transpose",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn.out_proj.bias",
+                f"decoder.layers.{layer_index}.self_attn.out_proj.bias",
+            ],
+            [
+                f"decoder.layers.{layer_index}.self_attn_layer_norm.weight",
+                f"decoder.layers.{layer_index}.norm1.weight",
+            ],
+            [f"decoder.layers.{layer_index}.self_attn_layer_norm.bias", f"decoder.layers.{layer_index}.norm1.bias"],
+            [f"decoder.layers.{layer_index}.fc1.weight", f"decoder.layers.{layer_index}.linear1.weight", "transpose"],
+            [f"decoder.layers.{layer_index}.fc1.bias", f"decoder.layers.{layer_index}.linear1.bias"],
+            [f"decoder.layers.{layer_index}.fc2.weight", f"decoder.layers.{layer_index}.linear2.weight", "transpose"],
+            [f"decoder.layers.{layer_index}.fc2.bias", f"decoder.layers.{layer_index}.linear2.bias"],
+            [f"decoder.layers.{layer_index}.final_layer_norm.weight", f"decoder.layers.{layer_index}.norm2.weight"],
+            [f"decoder.layers.{layer_index}.final_layer_norm.bias", f"decoder.layers.{layer_index}.norm2.bias"],
+        ]
+        mappings.extend(layer_mappings)
+
+    # 3. checking the model keys
+    import torch
+    from tqdm import tqdm
+
+    state_dict = {}
+    for file in files:
+        file_state_dict = torch.load(file)
+        for key in list(file_state_dict.keys()):
+            state_dict[key] = file_state_dict.pop(key).cpu().numpy()
+
+    for mapping in tqdm(mappings):
+        torch_key, paddle_key = mapping[:2]
+        assert torch_key in state_dict, f"{torch_key} no in weight file"
+
+    import paddle
+
+    # 4. transform tensor
+    from tqdm import tqdm
+
+    for mapping in tqdm(mappings):
+        torch_key, paddle_key = mapping[:2]
+        value = state_dict.pop(torch_key)
+        if len(mapping) == 3:
+            value = value.T
+        state_dict[paddle_key] = value
+
+    # 5. save the model files
+    paddle.save(state_dict, "model_state.pdparams")
+    print("convert pytorch model to paddle weight file successfully ...")
+
+
+if __name__ == "__main__":
+    # update your `model_dir` and `output_dir` here to your pytorch model dir
+    model_dir = "your pytorch path"
+    output_dir = None
+
+    convert_configs(model_dir, output_dir)
+
+    convert_weights(model_dir, output_dir)
diff --git a/paddlenlp/transformers/opt/modeling.py b/paddlenlp/transformers/opt/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fccb9cd053c2987d07ad5e85a2f9fd354e34256
--- /dev/null
+++ b/paddlenlp/transformers/opt/modeling.py
@@ -0,0 +1,1173 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import collections
+from functools import partial
+from typing import Any, Dict, List
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.tensor as tensor
+from paddle.distributed import fleet
+from paddle.nn import Layer
+from paddle.nn.layer.transformer import _convert_param_attr_to_list
+
+from paddlenlp.transformers.conversion_utils import StateDictNameMapping
+from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model
+from paddlenlp.utils.log import logger
+
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from .configuration import (
+    OPT_PRETRAINED_INIT_CONFIGURATION,
+    OPT_PRETRAINED_RESOURCE_FILES_MAP,
+    OPTConfig,
+)
+
+__all__ = ["OPTModel", "OPTPretrainedModel", "OPTForCausalLM", "OPTForConditionalGeneration"]
+
+
+def finfo(dtype):
+    if dtype == "float32":
+        return np.finfo(np.float32)
+    if dtype == "float16":
+        return np.finfo(np.float16)
+    if dtype == "float64":
+        return np.finfo(np.float64)
+
+
+def _make_causal_mask(input_ids_shape, past_key_values_length, dtype):
+    """
+    Make causal mask used for self-attention.
+    """
+    batch_size, target_length = input_ids_shape
+
+    mask = paddle.full((target_length, target_length), float(finfo(paddle.get_default_dtype()).min))
+
+    mask_cond = paddle.arange(mask.shape[-1])
+    mask_cond = mask_cond < (mask_cond + 1).reshape([mask.shape[-1], 1])
+    mask = paddle.where(mask_cond, paddle.full(mask_cond.shape, 0), mask)
+
+    if past_key_values_length > 0:
+        mask = paddle.concat([paddle.zeros([target_length, past_key_values_length], dtype=mask.dtype), mask], axis=-1)
+
+    expanded_mask = mask.unsqueeze(0).expand([batch_size, 1, target_length, target_length + past_key_values_length])
+    return expanded_mask
+
+
+def _expand_mask(mask, tgt_length):
+    """
+    Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
+    """
+    batch_size, src_length = mask.shape[0], mask.shape[-1]
+    tgt_length = tgt_length if tgt_length is not None else src_length
+
+    expanded_mask = ~(paddle.cast(mask[:, None, None, :], "bool"))
+    expanded_mask = paddle.cast(expanded_mask, dtype=paddle.float32)
+
+    expanded_mask = expanded_mask.expand([batch_size, 1, tgt_length, src_length])
+    expanded_mask = expanded_mask * float(finfo(paddle.get_default_dtype()).min)
+    return expanded_mask
+
+
+class MultiHeadAttention(nn.Layer):
+    """
+    Attention mapps queries and a set of key-value pairs to outputs, and
+    Multi-Head Attention performs multiple parallel attention to jointly attending
+    to information from different representation subspaces.
+
+    """
+
+    Cache = collections.namedtuple("Cache", ["k", "v"])
+    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])
+
+    def __init__(
+        self,
+        config: OPTConfig,
+        need_weights=False,
+    ):
+        super(MultiHeadAttention, self).__init__()
+
+        self.num_heads = config.num_attention_heads
+        self.head_dim = config.hidden_size // self.num_heads
+
+        # get the `num_heads`
+        assert self.num_heads % config.tensor_parallel_degree == 0
+        if config.tensor_parallel_degree > 0:
+            self.num_heads = self.num_heads // config.tensor_parallel_degree
+            assert (
+                self.head_dim * self.num_heads * config.tensor_parallel_degree == config.hidden_size
+            ), "hidden_size must be divisible by num_heads"
+
+        self.dropout = config.attention_probs_dropout_prob
+        self.need_weights = need_weights
+        self.fuse_attention_qkv = config.fuse_attention_qkv
+
+        if config.tensor_parallel_degree > 1:
+            if self.fuse_attention_qkv:
+                self.qkv_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size * 3,
+                    has_bias=True,
+                    input_is_parallel=True,
+                )
+            else:
+                self.q_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                )
+                self.k_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                )
+                self.v_proj = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.hidden_size,
+                    has_bias=True,
+                    gather_output=False,
+                )
+
+            self.out_proj = fleet.meta_parallel.RowParallelLinear(
+                config.hidden_size, config.hidden_size, input_is_parallel=True, has_bias=True
+            )
+        else:
+            if self.fuse_attention_qkv:
+                self.qkv_proj = nn.Linear(config.hidden_size, 3 * config.hidden_size)
+            else:
+                self.q_proj = nn.Linear(config.hidden_size, config.hidden_size)
+                self.k_proj = nn.Linear(config.hidden_size, config.hidden_size)
+                self.v_proj = nn.Linear(config.hidden_size, config.hidden_size)
+
+            self.out_proj = nn.Linear(config.hidden_size, config.hidden_size)
+
+    def _fuse_prepare_qkv(self, query, use_cache=False, cache=None):
+        mix_layer = self.qkv_proj(query)
+        mix_layer = paddle.reshape_(mix_layer, [0, 0, self.num_heads, 3 * self.head_dim])
+        mix_layer = paddle.transpose(mix_layer, [0, 2, 1, 3])
+        q, k, v = paddle.split(mix_layer, num_or_sections=3, axis=-1)
+
+        assert not isinstance(cache, self.StaticCache), "cache currently does not support the StaticCache type"
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = paddle.concat([cache.k, k], axis=2)
+            v = paddle.concat([cache.v, v], axis=2)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, cache) if use_cache else (q, k, v, None)
+
+    def _prepare_qkv(self, query, key, value, use_cache=False, cache=None):
+        r"""
+        Prapares linear projected queries, keys and values for usage of subsequnt
+        multiple parallel attention. If `cache` is not None, using cached results
+        to reduce redundant calculations.
+
+        """
+        q = self.q_proj(query)
+        q = paddle.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = paddle.transpose(x=q, perm=[0, 2, 1, 3])
+
+        if isinstance(cache, self.StaticCache):
+            # for encoder-decoder attention in inference and has cached
+            k, v = cache.k, cache.v
+        else:
+            k, v = self.compute_kv(key, value)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = paddle.concat([cache.k, k], axis=2)
+            v = paddle.concat([cache.v, v], axis=2)
+        if use_cache is True:
+            cache = self.Cache(k, v)
+
+        return (q, k, v, None) if use_cache is False else (q, k, v, cache)
+
+    def compute_kv(self, key, value):
+        r"""
+        Applies linear projection on input keys and values, then splits heads
+        (reshape and transpose) to get keys and values from different representation
+        subspaces. The results are used as key-values pairs for subsequent multiple
+        parallel attention.
+
+        It is part of calculations in multi-head attention, and is provided as
+        a method to pre-compute and prefetch these results, thus we can use them
+        to construct cache for inference.
+
+        """
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+        return k, v
+
+    def gen_cache(self, key, value=None, type=Cache):
+        """
+        Generates cache for `forward` usage in inference accroding to arguments.
+        The generated cache is an instance of `MultiHeadAttention.Cache` or an
+        instance of `MultiHeadAttention.StaticCache`.
+        """
+        if type == MultiHeadAttention.StaticCache:  # static_kv
+            k, v = self.compute_kv(key, value)
+            return self.StaticCache(k, v)
+        elif value is None:  # incremental_state
+            k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            v = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0)
+            return self.Cache(k, v)
+        else:
+            # incremental_state with initial value, mainly for usage like UniLM
+            return self.Cache(key, value)
+
+    def forward(self, query, key, value, attn_mask=None, use_cache=False, cache=None):
+        r"""
+        Applies multi-head attention to map queries and a set of key-value pairs
+        to outputs.
+        """
+        key = query if key is None else key
+        value = query if value is None else value
+
+        if self.fuse_attention_qkv:
+            q, k, v, cache = self._fuse_prepare_qkv(query, use_cache, cache)
+        else:
+            q, k, v, cache = self._prepare_qkv(query, key, value, use_cache, cache)
+
+        # scale dot product attention
+        product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True)
+
+        if attn_mask is not None:
+            product = product + attn_mask
+
+        weights = F.softmax(product)
+        if self.dropout:
+            weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train")
+
+        out = tensor.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if use_cache:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerDecoderLayer(nn.Layer):
+    """
+    The transformer decoder layer.
+
+    It contains multiheadattention and some linear layers.
+    """
+
+    def __init__(self, config):
+
+        d_model = config.hidden_size
+        dim_feedforward = config.intermediate_size
+        dropout = config.hidden_dropout_prob
+        activation = config.hidden_act
+        attn_dropout = config.attention_probs_dropout_prob
+        act_dropout = config.hidden_dropout_prob
+        normalize_before = getattr(config, "normalize_before", True)
+
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Normal(mean=0.0, std=config.initializer_range))
+        bias_attr = None
+
+        self._config = locals()
+        self._config.pop("self")
+        self._config.pop("__class__", None)  # py3
+
+        super(TransformerDecoderLayer, self).__init__()
+        attn_dropout = dropout if attn_dropout is None else attn_dropout
+        act_dropout = dropout if act_dropout is None else act_dropout
+        self.normalize_before = normalize_before
+
+        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
+        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)
+
+        self.self_attn = MultiHeadAttention(config, need_weights=True)
+        if config.tensor_parallel_degree > 1:
+            self.linear1 = fleet.meta_parallel.ColumnParallelLinear(
+                d_model,
+                dim_feedforward,
+                gather_output=False,
+                has_bias=True,
+            )
+        else:
+            self.linear1 = nn.Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2])
+
+        if config.tensor_parallel_degree > 1:
+            self.linear2 = fleet.meta_parallel.RowParallelLinear(
+                dim_feedforward,
+                d_model,
+                input_is_parallel=True,
+                has_bias=True,
+            )
+        else:
+            self.linear2 = nn.Linear(dim_feedforward, d_model, weight_attrs[2], bias_attr=bias_attrs[2])
+
+        self.norm1 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.norm2 = nn.LayerNorm(d_model, epsilon=1e-5)
+        self.dropout1 = nn.Dropout(dropout, mode="upscale_in_train")
+        self.dropout2 = nn.Dropout(act_dropout, mode="upscale_in_train")
+
+        if activation == "gelu":
+            self.activation = nn.GELU(approximate=True)
+        else:
+            self.activation = getattr(F, activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, use_cache=False, cache=None, output_attentions=False):
+        residual = tgt
+
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        # self.self_attn(...) --> hidden_states, weights, (cache)
+        if use_cache is False:
+            tgt, attn_weights = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        else:
+            tgt, attn_weights, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, use_cache, cache)
+        tgt = residual + self.dropout1(tgt)
+        if not self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm2(tgt)
+        tgt = self.dropout2(self.linear2(self.activation(self.linear1(tgt))))
+        tgt = residual + tgt
+
+        if not self.normalize_before:
+            tgt = self.norm2(tgt)
+
+        if not (output_attentions or use_cache):
+            return tgt
+
+        temp_list = [tgt, attn_weights if output_attentions else None, incremental_cache if use_cache else None]
+
+        return tuple(v for v in temp_list if v is not None)
+
+    def gen_cache(self, memory):
+        incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache)
+        return incremental_cache
+
+
+class TransformerDecoder(Layer):
+    """
+    TransformerDecoder is a stack of N decoder layers.
+    """
+
+    def __init__(self, config: OPTConfig, decoder_layers: List[Layer]):
+        super(TransformerDecoder, self).__init__()
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            if config.tensor_parallel_degree > 1:
+                self.project_out = fleet.meta_parallel.ColumnParallelLinear(
+                    config.hidden_size,
+                    config.word_embed_proj_dim,
+                    gather_output=True,
+                    has_bias=False,
+                )
+            else:
+                self.project_out = nn.Linear(config.hidden_size, config.word_embed_proj_dim, bias_attr=False)
+        else:
+            self.project_out = None
+
+        self.num_layers = config.num_hidden_layers
+        self.layers = decoder_layers
+
+        if config.normalize_before:
+            self.final_layer_norm = nn.LayerNorm(config.hidden_size)
+        else:
+            self.final_layer_norm = None
+
+        self.checkpoints = []
+
+    def forward(
+        self,
+        tgt,
+        memory,
+        tgt_mask=None,
+        memory_mask=None,
+        use_cache: bool = False,
+        cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
+        provided, also applies layer normalization on the output of last decoder
+        layer.
+        """
+        output = tgt
+        new_caches = [] if use_cache else None
+        self.checkpoints = []
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+
+        for i, mod in enumerate(self.layers):
+            outputs = mod(
+                output,
+                memory,
+                tgt_mask=tgt_mask,
+                use_cache=use_cache,
+                cache=cache[i] if cache is not None else cache,
+                output_attentions=output_attentions,
+            )
+
+            # outputs = hidden_states if both use_cache and output_attentions are False
+            # Otherwise, outputs = (hidden_states, attention if output_attentions, cache if use_cache)
+            output = outputs[0] if (use_cache or output_attentions) else outputs
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[1],)
+            if use_cache:
+                new_caches.append(outputs[-1])
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (output,)
+            self.checkpoints.append(output.name)
+
+        if self.final_layer_norm:
+            output = self.final_layer_norm(output)
+
+        if self.project_out:
+            output = self.project_out(output)
+
+        if not return_dict:
+            temp_list = [output, new_caches, all_hidden_states, all_self_attentions]
+
+            if not (use_cache or output_attentions or output_hidden_states):
+                return output
+
+            return tuple(v for v in temp_list if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=output,
+            past_key_values=new_caches,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=None,
+        )
+
+    def gen_cache(self, memory, do_zip=False):
+        r"""
+        Generates cache for `forward` usage. The generated cache is a list, and
+        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
+        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
+        for more details. If `do_zip` is True, apply `zip` on these tuples to get
+        a list with two elements.
+        """
+        cache = [layer.gen_cache(memory) for layer in self.layers]
+        if do_zip:
+            cache = list(zip(*cache))
+        return cache
+
+
+class OPTLearnedPositionEmbedding(nn.Embedding):
+    """this module learns postional embeddings up to a fixed maximum size"""
+
+    def __init__(self, num_embeddings: int, embedding_dim: int, initializer_range: float):
+        """OPT is set up so that if padding_idx is specified then offset the embedding ids by 2
+        and adjust num_embeddings appropriately. Other models don't have this hack.
+
+        Args:
+            num_embeddings (int): the number of embedding size
+            embedding_dim (int): the dim of embedding
+        """
+        self.offset = 2
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(self, attention_mask, past_key_values_length: int = 0):
+        """get the position embedding with attention mask
+
+        Args:
+            attention_mask: (paddle.Tensor): # create positions depending on attention_mask
+            past_key_values_length (int, optional): the past key value which will . Defaults to 0.
+
+        Returns:
+            paddle.Tensor: the position embedding
+        """
+        # create positions depending on attention_mask
+        if attention_mask.dtype not in [paddle.bool, paddle.int64]:
+            attention_mask = attention_mask == 1.0
+
+        position_ids = paddle.cumsum(paddle.cast(attention_mask, "int64"), axis=-1) - 1
+
+        # cut positions if `past_key_values_length` is > 0
+        position_ids = position_ids[:, past_key_values_length:]
+        return nn.Embedding.forward(self, position_ids + self.offset)
+
+
+class OPTEmbeddings(Layer):
+    """
+    Include embeddings from word and position embeddings.
+    """
+
+    def __init__(self, config: OPTConfig):
+        super(OPTEmbeddings, self).__init__()
+        if config.tensor_parallel_degree > 1:
+            self.word_embeddings = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.word_embed_proj_dim,
+                weight_attr=paddle.ParamAttr(
+                    initializer=nn.initializer.Normal(mean=0.0, std=config.initializer_range)
+                ),
+            )
+        else:
+            self.word_embeddings = nn.Embedding(
+                config.vocab_size,
+                config.word_embed_proj_dim,
+                # padding_idx=config.pad_token_id,
+                weight_attr=paddle.ParamAttr(
+                    initializer=nn.initializer.Normal(mean=0.0, std=config.initializer_range)
+                ),
+            )
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            if config.tensor_parallel_degree > 1:
+                self.project_in = fleet.meta_parallel.ColumnParallelLinear(
+                    config.word_embed_proj_dim,
+                    config.hidden_size,
+                    gather_output=True,
+                    has_bias=False,
+                )
+            else:
+                self.project_in = nn.Linear(config.word_embed_proj_dim, config.hidden_size, bias_attr=False)
+        else:
+            self.project_in = None
+
+        self.position_embeddings = OPTLearnedPositionEmbedding(
+            num_embeddings=config.max_position_embeddings,
+            embedding_dim=config.hidden_size,
+            initializer_range=config.initializer_range,
+        )
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids=None, attention_mask=None, input_embeddings=None, past_key_values_length=None):
+        if input_ids is not None:
+            input_embeddings = self.word_embeddings(input_ids)
+
+        if self.project_in:
+            input_embeddings = self.project_in(input_embeddings)
+
+        position_embeddings = self.position_embeddings(attention_mask, past_key_values_length)
+
+        embeddings = input_embeddings + position_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class OPTPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained OPT models. It provides OPT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    config_class = OPTConfig
+    base_model_prefix = "opt"
+
+    pretrained_init_configuration = OPT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = OPT_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config: OPTConfig, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+        actions = {
+            "embeddings.word_embeddings.weight": partial(fn, is_column=False),
+        }
+        for layer_index in range(config.num_hidden_layers):
+            actions.update(
+                {
+                    # Column Linear
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.weight": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.bias": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.weight": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.bias": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.weight": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.bias": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.linear1.weight": partial(fn, is_column=True),
+                    f"decoder.layers.{layer_index}.linear1.bias": partial(fn, is_column=True),
+                    # Row Linear
+                    f"decoder.layers.{layer_index}.linear2.weight": partial(fn, is_column=False),
+                    f"decoder.layers.{layer_index}.self_attn.out_proj.weight": partial(fn, is_column=False),
+                }
+            )
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            actions.update(
+                {
+                    "decoder.project_out.weight": partial(fn, is_column=True),
+                    "decoder.project_in.weight": partial(fn, is_column=True),
+                }
+            )
+
+        if cls.__name__ != "OPTModel":
+            for key in list(actions.keys()):
+                actions["opt." + key] = actions.pop(key)
+
+        return actions
+
+    @classmethod
+    def _get_name_mappings(cls, config: OPTConfig) -> list[StateDictNameMapping]:
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            ["decoder.embed_tokens.weight", "embeddings.word_embeddings.weight"],
+            ["decoder.embed_positions.weight", "embeddings.position_embeddings.weight"],
+            ["decoder.final_layer_norm.weight", "decoder.final_layer_norm.weight"],
+            ["decoder.final_layer_norm.bias", "decoder.final_layer_norm.bias"],
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                    f"decoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                    f"decoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                    f"decoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    f"decoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn.out_proj.bias",
+                    f"decoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn_layer_norm.weight",
+                    f"decoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.self_attn_layer_norm.bias",
+                    f"decoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"decoder.layers.{layer_index}.fc1.weight",
+                    f"decoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"decoder.layers.{layer_index}.fc1.bias", f"decoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"decoder.layers.{layer_index}.fc2.weight",
+                    f"decoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"decoder.layers.{layer_index}.fc2.bias", f"decoder.layers.{layer_index}.linear2.bias"],
+                [
+                    f"decoder.layers.{layer_index}.final_layer_norm.weight",
+                    f"decoder.layers.{layer_index}.norm2.weight",
+                ],
+                [f"decoder.layers.{layer_index}.final_layer_norm.bias", f"decoder.layers.{layer_index}.norm2.bias"],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        # base-model prefix "OPTModel"
+        if cls.__name__ != "OPTModel":
+            for mapping in model_mappings:
+                mapping[0] = "model." + mapping[0]
+                mapping[1] = "opt." + mapping[1]
+
+        # downstream mappings
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.opt.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+@register_base_model
+class OPTModel(OPTPretrainedModel):
+    r"""
+    The bare OPT Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`OPTConfig`):
+            An instance of OPTConfig used to construct OPTModel.
+    """
+
+    def __init__(self, config: OPTConfig):
+        super(OPTModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.hidden_size = config.hidden_size
+        self.vocab_size = config.vocab_size
+        self.embeddings = OPTEmbeddings(config)
+
+        config.fuse_attention_qkv = False
+        decoder_layers = nn.LayerList()
+        for i in range(config.num_hidden_layers):
+            decoder_layers.append(TransformerDecoderLayer(config))
+        self.decoder = TransformerDecoder(config, decoder_layers)
+        self.checkpoints = []
+
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape, past_key_values_length=past_key_values_length, dtype=attention_mask.dtype
+            )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, tgt_length=input_shape[-1])
+            if input_shape[-1] > 1:
+                combined_attention_mask = combined_attention_mask + expanded_attn_mask
+            else:
+                combined_attention_mask = expanded_attn_mask
+
+        return combined_attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        use_cache=False,
+        cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The OPTModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in self attention to avoid performing attention to some unwanted positions,
+                usually the subsequent positions.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Its data type should be float32.
+                The `masked` tokens have `-1e9` values, and the `unmasked` tokens have `0` values.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            use_cache (bool, optional):
+                Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `None`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `None`.
+
+
+        Returns:
+            Tensor: Returns tensor `encoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import OPTModel, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('facebook/opt-125m')
+
+                model = OPTModel.from_pretrained('facebook/opt-125m')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLimage.pngP!", return_token_type_ids=False)
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        if position_ids is not None:
+            logger.warning("position_ids has not required for OPTModel.")
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            input_ids = input_ids.reshape((-1, input_shape[-1]))
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        self.checkpoints = []
+        past_key_values_length = paddle.shape(cache[0].k)[2] if cache is not None else 0
+
+        seq_length_with_past = input_shape[-1] + past_key_values_length
+
+        if attention_mask is None:
+            attention_mask = paddle.ones((input_shape[0], seq_length_with_past), dtype=paddle.bool)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            input_embeddings=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        attention_mask = self._prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length)
+        attention_mask.stop_gradient = True
+
+        outputs = self.decoder.forward(
+            embedding_output,
+            memory=None,
+            tgt_mask=attention_mask,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if output_hidden_states:
+            if return_dict:
+                outputs.hidden_states = (embedding_output,) + outputs.hidden_states
+            else:
+                # [last_hidden_state, caches, all_hidden_states, all_self_attentions]
+                idx = 2 if use_cache else 1
+                all_hidden_states = ((embedding_output,) + outputs[idx],)
+                outputs = outputs[:idx] + all_hidden_states + outputs[idx + 1 :]
+
+        self.checkpoints.extend(self.decoder.checkpoints)
+        return outputs
+
+    def get_input_embeddings(self):
+        """get opt input word embedding
+        Returns:
+            nn.Embedding: the input word embedding of opt mdoel
+        """
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, embedding: nn.Embedding):
+        """set opt input embedding
+        Returns:
+            nn.Embedding: the instance of new word embedding
+        """
+        self.embeddings.word_embeddings = embedding
+
+
+class OPTLMHead(Layer):
+    def __init__(self, config: OPTConfig, embedding_weights=None):
+        super(OPTLMHead, self).__init__()
+        self.config = config
+        self.decoder_weight = (
+            self.create_parameter(
+                default_initializer=paddle.nn.initializer.Uniform(low=-0.1, high=0.1),
+                shape=[config.vocab_size, config.hidden_size],
+                dtype=config.dtype,
+                is_bias=True,
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+
+    def forward(self, hidden_states):
+        if isinstance(hidden_states, BaseModelOutputWithPastAndCrossAttentions):
+            hidden_states = hidden_states["last_hidden_state"]
+        logits = paddle.tensor.matmul(hidden_states, self.decoder_weight.cast(hidden_states.dtype), transpose_y=True)
+        return logits
+
+
+class OPTForCausalLM(OPTPretrainedModel):
+    """
+    The OPT Model with a `language modeling` head on top.
+
+    Args:
+        config (:class:`OPTConfig`):
+            An instance of OPTConfig used to construct OPTModel.
+
+    """
+
+    def __init__(self, config: OPTConfig):
+        super(OPTForCausalLM, self).__init__(config)
+        self.opt = OPTModel(config)
+        self.lm_head = OPTLMHead(
+            config,
+        )
+
+    def _get_model_inputs_spec(self, dtype: str):
+        return {
+            "input_ids": paddle.static.InputSpec(shape=[None, None], dtype="int64"),
+        }
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=False,
+        cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **kwargs,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`OPTModel`.
+            attention_mask (Tensor, optional):
+                See :class:`OPTModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`GPTModel`.
+            use_cache (bool, optional):
+                See :class:`OPTModel`.
+            cache (Tensor, optional):
+                See :class:`OPTModel`.
+            labels (paddle.Tensor, optional):
+                A Tensor of shape `(batch_size, sequence_length)`.
+                Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set
+                `labels = input_ids` Indices are selected in `[-100, 0, ..., vocab_size]` All labels set to `-100`
+                are ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`
+                Defaults to None.
+            output_attentions (bool, optional):
+                See :class:`GPTModel`.
+            output_hidden_states (bool, optional):
+                See :class:`GPTModel`.
+            return_dict (bool, optional):
+                See :class:`GPTModel`.
+        Returns:
+            Tensor or tuple: Returns tensor `logits` or tuple `(logits, cached_kvs)`. If `use_cache` is True,
+            tuple (`logits, cached_kvs`) will be returned. Otherwise, tensor `logits` will be returned.
+            `logits` is the output of the opt model.
+            `cache_kvs` is the cache output of opt model if `use_cache` is True.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import OPTForCausalLM, GPTTokenizer
+
+                tokenizer = GPTTokenizer.from_pretrained('facebook/opt-125m')
+                model = OPTForCausalLM.from_pretrained('facebook/opt-125m')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output_ids, score = model.generate(input_ids=inputs['input_ids'])
+                print(tokenizer.batch_decode(output_ids[0]))
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.opt(
+            input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache=cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if use_cache:
+            encoder_outputs, cached_kvs = outputs[:2]
+        else:
+            encoder_outputs = outputs
+
+        logits = self.lm_head(encoder_outputs)
+
+        loss = None
+        if labels is not None:
+            loss = nn.functional.cross_entropy(logits, labels)
+
+        if not return_dict:
+            if not use_cache:
+                return (loss, logits) if loss is not None else logits
+
+            outputs = (logits,) + outputs[1:]
+            return ((loss,) + outputs) if loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_fast_entry(self, kwargs: Dict[str, Any]):
+        # import FasterOPT at here to avoid cycling import
+        from paddlenlp.ops import FasterOPT
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decode_strategy = kwargs.get("decode_strategy")
+        # decoding_lib can be passed into FasterOPT
+        decoding_lib = kwargs.get("decoding_lib", None)
+
+        if decode_strategy == "beam_search":
+            raise AttributeError("'beam_search' is not supported yet in the fast version of OPT")
+        # Currently, FasterTransformer only support restricted size_per_head.
+        size_per_head = self.opt.config["hidden_size"] // self.opt.config["num_attention_heads"]
+        if size_per_head not in [32, 64, 80, 96, 128]:
+            raise AttributeError(
+                "'size_per_head = %d' is not supported yet in the fast version of OPT" % size_per_head
+            )
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for forced_bos_token_id yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        if kwargs["min_length"] != 0:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'min_length != 0' is not supported yet in the fast version")
+        self._fast_entry = FasterOPT(self, use_fp16_decoding=use_fp16_decoding, decoding_lib=decoding_lib).forward
+        return self._fast_entry
+
+    def prepare_inputs_for_generation(
+        self, input_ids, use_cache=False, cache=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and cache is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "cache": cache,
+                "use_cache": True,
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="int64")
+        return attention_mask
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError as e:
+            try:
+                return getattr(getattr(self, self.base_model_prefix), name)
+            except AttributeError:
+                try:
+                    return getattr(self, self.base_model_prefix).config[name]
+                except KeyError:
+                    raise e
+
+
+OPTForConditionalGeneration = OPTForCausalLM
diff --git a/paddlenlp/transformers/optimization.py b/paddlenlp/transformers/optimization.py
new file mode 100644
index 0000000000000000000000000000000000000000..03e3fd2d0d31aa3415e2a275945a3397f3832702
--- /dev/null
+++ b/paddlenlp/transformers/optimization.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+from paddle.optimizer.lr import LambdaDecay, LRScheduler
+
+__all__ = [
+    "LinearDecayWithWarmup",
+    "ConstScheduleWithWarmup",
+    "CosineDecayWithWarmup",
+    "PolyDecayWithWarmup",
+    "CosineAnnealingWithWarmupDecay",
+    "LinearAnnealingWithWarmupDecay",
+]
+
+
+def is_integer(number):
+    return isinstance(number, int)
+
+
+class CosineAnnealingWithWarmupDecay(LRScheduler):
+    def __init__(self, max_lr, min_lr, warmup_step, decay_step, last_epoch=-1, verbose=False):
+        self.decay_step = decay_step
+        self.warmup_step = warmup_step
+        self.max_lr = max_lr
+        self.min_lr = min_lr
+        super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.warmup_step > 0 and self.last_epoch <= self.warmup_step:
+            return float(self.max_lr) * (self.last_epoch) / self.warmup_step
+
+        if self.last_epoch > self.decay_step:
+            return self.min_lr
+
+        num_step_ = self.last_epoch - self.warmup_step
+        decay_step_ = self.decay_step - self.warmup_step
+        decay_ratio = float(num_step_) / float(decay_step_)
+        coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
+        return self.min_lr + coeff * (self.max_lr - self.min_lr)
+
+
+class LinearAnnealingWithWarmupDecay(LRScheduler):
+    def __init__(self, max_lr, min_lr, warmup_step, decay_step, last_epoch=-1, verbose=False):
+
+        self.decay_step = decay_step
+        self.warmup_step = warmup_step
+        self.max_lr = max_lr
+        self.min_lr = min_lr
+        super(LinearAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.warmup_step > 0 and self.last_epoch <= self.warmup_step:
+            return float(self.max_lr) * (self.last_epoch) / self.warmup_step
+
+        if self.last_epoch > self.decay_step:
+            return self.min_lr
+
+        num_step_ = self.last_epoch - self.warmup_step
+        decay_step_ = self.decay_step - self.warmup_step
+        decay_ratio = float(num_step_) / float(decay_step_)
+        coeff = 1.0 - decay_ratio
+        return self.min_lr + coeff * (self.max_lr - self.min_lr)
+
+
+class LinearDecayWithWarmup(LambdaDecay):
+    """
+    Creates a learning rate scheduler, which increases learning rate linearly
+    from 0 to given `learning_rate`, after this warmup period learning rate
+    would be decreased linearly from the base learning rate to 0.
+
+    Args:
+        learning_rate (float):
+            The base learning rate. It is a python float number.
+        total_steps (int):
+            The number of training steps.
+        warmup (int or float):
+            If int, it means the number of steps for warmup. If float, it means
+            the proportion of warmup in total training steps.
+        last_epoch (int, optional):
+            The index of last epoch. It can be set to restart training. If
+            None, it means initial learning rate.
+            Defaults to -1.
+        verbose (bool, optional):
+            If True, prints a message to stdout for each update.
+            Defaults to False.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.transformers import LinearDecayWithWarmup
+            lr, warmup_steps, max_steps = 0.1, 100, 1000
+            lr_scheduler = LinearDecayWithWarmup(lr, max_steps, warmup_steps)
+
+    """
+
+    def __init__(self, learning_rate, total_steps, warmup, last_epoch=-1, verbose=False):
+        warmup_steps = warmup if is_integer(warmup) else int(math.floor(warmup * total_steps))
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            return max(0.0, float(total_steps - current_step) / float(max(1, total_steps - warmup_steps)))
+
+        super(LinearDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose)
+
+
+class ConstScheduleWithWarmup(LambdaDecay):
+    """
+    Creates a learning rate scheduler, which increases learning rate linearly
+    from 0 to given `learning_rate` during warmup periods and keeps learning
+    rate a constant after that.
+
+    Args:
+        learning_rate (float):
+            The base learning rate. It is a python float number.
+        warmup (int or float):
+            If int, it means the number of steps for warmup. If float, it means
+            the proportion of warmup in total training steps.
+        total_steps (int, optional):
+            The number of training steps. If `warmup` is a float number,
+            `total_steps` must be provided.
+            Defaults to None.
+        last_epoch (int, optional):
+            The index of last epoch. It can be set to restart training. If
+            None, it means initial learning rate.
+            Defaults to -1.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.transformers import ConstScheduleWithWarmup
+            lr, warmup_steps = 0.1, 100
+            lr_scheduler = ConstScheduleWithWarmup(lr, warmup_steps)
+
+    """
+
+    def __init__(self, learning_rate, warmup, total_steps=None, last_epoch=-1, verbose=False):
+        warmup_steps = warmup if is_integer(warmup) else int(math.floor(warmup * total_steps))
+        if is_integer(warmup):
+            warmup_steps = warmup
+        elif total_steps:
+            warmup_steps = int(math.floor(warmup * total_steps))
+        else:
+            raise ValueError(
+                "Please provide total steps if `warmup` is a float number , or provide integer for argument `warmup`."
+            )
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1.0, warmup_steps))
+            return 1.0
+
+        super(ConstScheduleWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose)
+
+
+class CosineDecayWithWarmup(LambdaDecay):
+    """
+    Creates a learning rate scheduler, which increases learning rate linearly
+    from 0 to given `learning_rate`, after this warmup period learning rate
+    would be decreased following the values of the cosine function. If
+    `with_hard_restarts` is True, the cosine function could have serveral hard
+    restarts.
+
+    Args:
+        learning_rate (float):
+            The base learning rate. It is a python float number.
+        total_steps (int):
+            The number of training steps.
+        warmup (int or float):
+            If int, it means the number of steps for warmup. If float, it means
+            the proportion of warmup in total training steps.
+        with_hard_restarts (bool):
+            Whether cosine function has several hard restarts.
+            Defaults to False.
+        num_cycles (int or float, optional):
+            If `with_hard_restarts` is False, it means the number of waves in
+            cosine scheduler and should be an integer number and defaults to 1.
+            If `with_hard_restarts` is True, it means the number of hard
+            restarts to use and should be a float number and defaults to be 0.5.
+            Defaults to None.
+        last_epoch (int, optional):
+            The index of last epoch. It can be set to restart training. If
+            None, it means initial learning rate.
+            Defaults to -1.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.transformers import CosineDecayWithWarmup
+            lr, warmup_steps, max_steps = 0.1, 100, 1000
+            lr_scheduler = CosineDecayWithWarmup(lr, max_steps, warmup_steps)
+
+    """
+
+    def __init__(
+        self,
+        learning_rate,
+        total_steps,
+        warmup,
+        with_hard_restarts=False,
+        num_cycles=None,
+        last_epoch=-1,
+        verbose=False,
+    ):
+        warmup_steps = warmup if is_integer(warmup) else int(math.floor(warmup * total_steps))
+        # Input check
+        if num_cycles is not None:
+            assert (
+                not with_hard_restarts
+                and isinstance(num_cycles, int)
+                or with_hard_restarts
+                and isinstance(num_cycles, float)
+            ), "`num_circles` should be an integer while `with_hard_restarts` is False, an float while `with_hard_restarts` is True."
+        else:
+            num_cycles = 1 if not with_hard_restarts else 0.5
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+
+            progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
+
+            if with_hard_restarts:
+                if progress >= 1.0:
+                    return 0.0
+                return max(0.0, 0.5 * (1.0 + math.cos(math.pi * ((float(num_cycles) * progress) % 1.0))))
+
+            return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress)))
+
+        super(CosineDecayWithWarmup, self).__init__(learning_rate, lr_lambda, last_epoch, verbose)
+
+
+class PolyDecayWithWarmup(LambdaDecay):
+    """
+    Creates a learning rate scheduler, which increases learning rate linearly
+    from 0 to given `lr_init`, after this warmup period learning rate would
+    be decreased as a polynomial decay from the base learning rate to the end
+    learning rate `lr_end`.
+
+    Args:
+        learning_rate (float):
+            The base learning rate. It is a python float number.
+        total_steps (int):
+            The number of training steps.
+        warmup (int or float):
+            If int, it means the number of steps for warmup. If float, it means
+            the proportion of warmup in total training steps.
+        lr_end (float, optional):
+            The end learning rate.
+            Defaults to 1e-7.
+        power (float, optional):
+            Power factor.
+            Defaults to 1.0.
+        last_epoch (int, optional):
+            The index of last epoch. It can be set to restart training. If
+            None, it means initial learning rate.
+            Defaults to -1.
+
+    Examples:
+
+        .. code-block:: python
+
+            from paddlenlp.transformers import PolyDecayWithWarmup
+            lr, lr_end, warmup_steps, max_steps = 0.1, 1e-6, 100, 1000
+            lr_scheduler = PolyDecayWithWarmup(lr, max_steps, warmup_steps, lr_end)
+
+    """
+
+    def __init__(self, learning_rate, total_steps, warmup, lr_end=1e-7, power=1.0, last_epoch=-1, verbose=False):
+        lr_init = learning_rate
+        assert (
+            lr_init > lr_end
+        ), f"`lr_end` must be be smaller than `learning_rate`. But `lr_end` is {lr_end} while `learning_rate` is {lr_init}."
+        warmup_steps = warmup if is_integer(warmup) else int(math.floor(warmup * total_steps))
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            elif current_step > total_steps:
+                return lr_end / lr_init  # it multiplies by lr_init equals to lr_end
+            else:
+                lr_range = lr_init - lr_end
+                decay_steps = total_steps - warmup_steps
+                pct_remaining = 1 - (current_step - warmup_steps) / decay_steps
+                decay = lr_range * pct_remaining**power + lr_end
+                return decay / lr_init  # it multiplies by lr_init equals to decay
+
+        super(PolyDecayWithWarmup, self).__init__(lr_init, lr_lambda, last_epoch, verbose)
diff --git a/paddlenlp/transformers/pegasus/__init__.py b/paddlenlp/transformers/pegasus/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/pegasus/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/pegasus/configuration.py b/paddlenlp/transformers/pegasus/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..e72efdec49567028995c02eb083efba12c4dae43
--- /dev/null
+++ b/paddlenlp/transformers/pegasus/configuration.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Pegasus model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+from ...utils.log import logger
+
+__all__ = ["PEGASUS_PRETRAINED_INIT_CONFIGURATION", "PegasusConfig"]
+
+PEGASUS_PRETRAINED_INIT_CONFIGURATION = {}
+
+
+class PegasusConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`PegasusModel`]. It is used to instantiate a PEGASUS
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the PEGASUS pegasus-238M architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, optional):
+            Vocabulary size of the PEGASUS model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`PegasusModel`]. Default to 50000.
+        d_model (`int`, optional):
+            Dimensionality of the layers and the pooler layer. Default to 1024
+        encoder_layers (`int`, optional):
+            Number of encoder layers. Default to 12.
+        decoder_layers (`int`, optional):
+            Number of decoder layers. Default to 12.
+        encoder_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer encoder. Default to 12.
+        decoder_attention_heads (`int`, optional):
+            Number of attention heads for each attention layer in the Transformer decoder. Default to 12.
+        decoder_ffn_dim (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Default to 3072.
+        encoder_ffn_dim (`int`, optional):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Default to 3072.
+        activation_function (`str` or `function`, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions are supported.
+            Default to `"relu"`.
+        dropout (`float`, optional):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. Default to 0.1.
+        attention_dropout (`float`, optional):
+            The dropout ratio for the attention probabilities. Default to 0.1.
+        activation_dropout (`float`, optional):
+            The dropout ratio for activations inside the fully connected layer. Default to 0.1.
+        max_position_embeddings (`int`, optional):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048). Default to 1024.
+        init_std (`float`, optional):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices. Default to 0.02.
+        num_labels (`int`, optional):
+            The number of labels. Default to 3.
+        forced_eos_token_id (`int`, optional):
+            The id of the token to force as the last generated token when `max_length` is reached. Usually set to
+            `eos_token_id`. Default to 1.
+        scale_embedding (`bool`, optional):
+            Scale embeddings by diving by sqrt(d_model). Default to `False`.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        decoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+
+    """
+    model_type = "pegasus"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {
+        "num_attention_heads": "encoder_attention_heads",
+        "hidden_size": "d_model",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = PEGASUS_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 50000,
+        max_position_embeddings: int = 1024,
+        encoder_layers: int = 12,
+        encoder_ffn_dim: int = 3072,
+        encoder_attention_heads: int = 12,
+        decoder_layers: int = 12,
+        decoder_ffn_dim: int = 3072,
+        decoder_attention_heads: int = 12,
+        activation_function: str = "relu",
+        d_model: int = 768,
+        dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.1,
+        init_std: float = 0.02,
+        pad_token_id: int = 0,
+        bos_token_id: int = 2,
+        eos_token_id: int = 1,
+        is_encoder_decoder: bool = True,
+        decoder_start_token_id: int = 0,
+        forced_eos_token_id: int = 1,
+        scale_embedding: bool = True,
+        use_cache: bool = True,
+        encoder_layerdrop: float = 0.0,
+        decoder_layerdrop: float = 0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.d_model = d_model
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.activation_function = activation_function
+        self.init_std = init_std
+        self.num_hidden_layers = encoder_layers
+        self.scale_embedding = scale_embedding
+        self.use_cache = use_cache
+        self.encoder_layerdrop = encoder_layerdrop
+        self.decoder_layerdrop = decoder_layerdrop
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            decoder_start_token_id=decoder_start_token_id,
+            forced_eos_token_id=forced_eos_token_id,
+            **kwargs,
+        )
+
+        if self.forced_bos_token_id is None and kwargs.get("force_bos_token_to_be_generated", False):
+            self.forced_bos_token_id = self.bos_token_id
+            logger.warning(
+                f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions. "
+                "The config can simply be saved and uploaded again to be fixed."
+            )
diff --git a/paddlenlp/transformers/pegasus/modeling.py b/paddlenlp/transformers/pegasus/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..630b75549272755afddb972934ccb064aba49538
--- /dev/null
+++ b/paddlenlp/transformers/pegasus/modeling.py
@@ -0,0 +1,665 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Google Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+from paddle.nn import Embedding, MultiHeadAttention
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import ModelOutput
+from .configuration import PEGASUS_PRETRAINED_INIT_CONFIGURATION, PegasusConfig
+
+__all__ = [
+    "PegasusModel",
+    "PegasusPretrainedModel",
+    "PegasusEncoder",
+    "PegasusDecoder",
+    "PegasusForConditionalGeneration",
+]
+
+PEGASUS_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese",
+    "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese",
+    "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese-V1",
+    "PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA",
+    "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA",
+]
+
+Cache = MultiHeadAttention.Cache
+StaticCache = MultiHeadAttention.StaticCache
+
+
+def shift_tokens_right(input_ids, pad_token_id, decoder_start_token_id):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros_like(input_ids)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    if pad_token_id is None:
+        raise ValueError("self.model.config.pad_token_id has to be defined.")
+
+    shifted_input_ids = paddle.where(
+        shifted_input_ids == -100, paddle.full_like(shifted_input_ids, pad_token_id), shifted_input_ids
+    )
+    return shifted_input_ids
+
+
+class PegasusPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Pegasus models. It provides Pegasus related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = PEGASUS_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = {}
+    base_model_prefix = "pegasus"
+    config_class = PegasusConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.init_std,
+                        shape=layer.weight.shape,
+                    )
+                )
+            if hasattr(layer, "bias"):
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, PegasusSinusoidalPositionalEmbedding):
+            pass
+
+
+class PegasusSinusoidalPositionalEmbedding(Embedding):
+    """
+    This module learns positional embeddings up to a fixed maximum size.
+    """
+
+    def __init__(self, num_embeddings, embedding_dim):
+        super().__init__(num_embeddings, embedding_dim)
+        self.weight = self._init_weight(self.weight)
+
+    @staticmethod
+    def _init_weight(out):
+        """
+        Identical to the XLM create_sinusoidal_embeddings except features are not interleaved. The cos features are in
+        the 2nd half of the vector. [dim // 2:]
+        """
+        n_pos, dim = out.shape
+        position_enc = np.array(
+            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]
+        )
+        out.stop_gradient = True
+        sentinel = dim // 2 if dim % 2 == 0 else (dim // 2) + 1
+        out[:, 0:sentinel] = np.sin(position_enc[:, 0::2])
+        out[:, sentinel:] = np.cos(position_enc[:, 1::2])
+        return out
+
+    @paddle.no_grad()
+    def forward(self, input_ids_shape: Tuple, past_key_values_length: int = 0) -> Tensor:
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        bsz, seq_len = input_ids_shape[:2]
+        positions = paddle.arange(past_key_values_length, past_key_values_length + seq_len, dtype="int64")
+        # (gongenlei) For dygraph to static graph
+        return Embedding.forward(self, positions)
+
+
+class PegasusEncoder(PegasusPretrainedModel):
+    """
+    The Transformer Encoder of PegasusModel. The arguments of PegasusEncoder can see :class:`PegasusModel`.
+    """
+
+    def __init__(self, config: PegasusConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+
+        self.encoder_embed_positions = PegasusSinusoidalPositionalEmbedding(
+            config.max_position_embeddings, config.d_model
+        )
+
+        self.encoder_dropout = nn.Dropout(config.dropout)
+        self.encoder_layernorm = nn.LayerNorm(config.d_model)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=config.d_model,
+            nhead=config.encoder_attention_heads,
+            dim_feedforward=config.encoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=True,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.encoder_layers)
+
+    def forward(self, input_ids: Optional[Tensor] = None, attention_mask: Optional[Tensor] = None, **kwargs):
+        """
+        The PegasusEncoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`PegasusModel`.
+            attention_mask (Tensor, optional):
+                See :class:`PegasusModel`.
+
+        Returns:
+            Tensor: Returns tensor `encoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        """
+        if input_ids is None:
+            raise ValueError("Input_ids cannot be None.")
+        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+        inputs_embed_pos = self.encoder_embed_positions(paddle.shape(input_ids))
+        hidden_states = inputs_embeds + inputs_embed_pos
+        encoder_input = self.encoder_dropout(hidden_states)
+
+        if attention_mask is None:
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        encoder_output = self.encoder(encoder_input, src_mask=attention_mask)
+        encoder_output = self.encoder_layernorm(encoder_output)
+        return encoder_output
+
+
+class PegasusDecoder(PegasusPretrainedModel):
+    """
+    The Transformer Decoder of PegasusModel. The arguments of PegasusDecoder can see :class:`PegasusModel`.
+    """
+
+    def __init__(self, config: PegasusConfig, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
+        if embed_tokens is not None:
+            self.embed_tokens = embed_tokens
+        else:
+            self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model)
+
+        self.decoder_embed_positions = PegasusSinusoidalPositionalEmbedding(
+            config.max_position_embeddings, config.d_model
+        )
+        self.decoder_dropout = nn.Dropout(config.dropout)
+        self.decoder_layernorm = nn.LayerNorm(config.d_model)
+
+        decoder_layer = nn.TransformerDecoderLayer(
+            d_model=config.d_model,
+            nhead=config.decoder_attention_heads,
+            dim_feedforward=config.decoder_ffn_dim,
+            dropout=config.dropout,
+            activation=config.activation_function,
+            attn_dropout=config.attention_dropout,
+            act_dropout=config.activation_dropout,
+            normalize_before=True,
+        )
+        self.decoder = nn.TransformerDecoder(decoder_layer, config.decoder_layers)
+
+    def forward(
+        self,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        memory_mask: Optional[Tensor] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        x: Optional[Tensor] = None,
+        mix_ratio: Optional[float] = 0,
+    ):
+        """
+        The PegasusDecoder forward method, overrides the `__call__()` special method.
+
+        Args:
+            decoder_input_ids (Tensor, optional):
+                See :class:`PegasusModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`PegasusModel`.
+            encoder_output (Tensor, optional):
+                See :class:`PegasusModel`.
+            memory_mask (Tensor, optional):
+                See :class:`PegasusModel`.
+            cache (Tensor, optional):
+                See :class:`PegasusModel`.
+            x (Tensor, optional):
+                The synthetic decoder input embedding of SSTIA strategy.
+                Its data type should be `float32` and it has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to `None`, which means don't use SSTIA strategy.
+            mix_ratio (float, optional):
+                The mixing ratio of synthetic decoder embedding and general deocder input embedding.
+                If SSTIA strategy is used, this arg should be set in (0,1).
+                Defaults to `0`, which means don't use synthetic decoder embedding.
+
+
+        Returns:
+            Tensor: Returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        """
+        if decoder_attention_mask is None:
+            decoder_length = paddle.shape(decoder_input_ids)[-1]
+            decoder_attention_mask = paddle.tensor.triu(
+                (paddle.full((decoder_length, decoder_length), -np.inf, dtype=paddle.get_default_dtype())), 1
+            )
+
+        if x is None:
+            decoder_inputs_embeds = self.embed_tokens(decoder_input_ids) * self.embed_scale
+        else:
+            decoder_inputs_embeds = self.embed_tokens(
+                decoder_input_ids
+            ) * self.embed_scale * mix_ratio + self.embed_scale * x * (1 - mix_ratio)
+
+        past_key_values_length = paddle.shape(cache[0][0].k)[2] if cache is not None else 0
+        decoder_inputs_embed_pos = self.decoder_embed_positions(
+            paddle.shape(decoder_input_ids), past_key_values_length
+        )
+        hidden_states = decoder_inputs_embeds + decoder_inputs_embed_pos
+        decoder_input = self.decoder_dropout(hidden_states)
+
+        decoder_output = self.decoder(
+            tgt=decoder_input,
+            memory=encoder_output,
+            tgt_mask=decoder_attention_mask,
+            memory_mask=memory_mask,
+            cache=cache,
+        )
+        if cache is not None:
+            new_cache = decoder_output[1]
+            decoder_output = decoder_output[0]
+        else:
+            new_cache = None
+        decoder_output = self.decoder_layernorm(decoder_output)
+        return decoder_output, new_cache
+
+
+@register_base_model
+class PegasusModel(PegasusPretrainedModel):
+    r"""
+    The bare Pegasus Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`PegasusConfig`):
+            An instance of PegasusConfig used to construct PegasusModel.
+    """
+
+    def __init__(self, config: PegasusConfig):
+        super().__init__(config)
+        self.init_std = config.init_std
+        self.pad_token_id = config.pad_token_id
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        self.encoder = PegasusEncoder(config, self.shared)
+        self.decoder = PegasusDecoder(config, self.shared)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, value):
+        self.shared = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+    ):
+        r"""
+        The PegasusModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is `[batch_size, sequence_length, hidden_size]`.
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, sequence_length, hidden_size`].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [`batch_size, num_attention_heads, sequence_length, sequence_length`].
+            use_cache (bool, optional):
+                 Whether or not to use cache. Defaults to `False`. If set to `True`, key value states will be returned and
+                 can be used to speed up decoding.
+            cache (list, optional):
+                It is a list, and each element in the list is a tuple `(incremental_cache, static_cache)`.
+                See `TransformerDecoder.gen_cache <https://github.com/PaddlePaddle/Paddle/blob/release/2.1/python/paddle/nn/layer/transformer.py#L1060>`__ for more details.
+                It is only used for inference and should be None for training.
+                Default to `None`.
+
+        Returns:
+            Tensor: Returns tensor `decoder_output`, which is the output at the last layer of the model.
+            Its data type should be float32 and has a shape of [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PegasusModel, PegasusTokenizer
+
+                tokenizer = PegasusTokenizer.from_pretrained(pegasus_path)
+                model = PegasusModel.from_pretrained(pegasus_path)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        if input_ids is None and encoder_output is None:
+            raise ValueError("You have to specify either input_ids or encoder_output")
+        if decoder_input_ids is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating decoder_input_ids"
+            decoder_input_ids = shift_tokens_right(input_ids, self.pad_token_id, self.decoder_start_token_id)
+        if attention_mask is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating attention_mask"
+            attention_mask = (
+                paddle.cast(input_ids == self.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+            attention_mask.stop_gradient = True
+        if encoder_output is None:
+            encoder_output = self.encoder(input_ids, attention_mask)
+        if decoder_attention_mask is not None and decoder_attention_mask.ndim == 2:
+            decoder_attention_mask = paddle.unsqueeze(decoder_attention_mask, axis=[1, 2]).astype(
+                paddle.get_default_dtype()
+            )
+            decoder_attention_mask = (1.0 - decoder_attention_mask) * -1e4
+            decoder_attention_mask.stop_gradient = True
+
+        if use_cache:
+            if cache is None:
+                cache = self.decoder.decoder.gen_cache(encoder_output)
+        else:
+            cache = None
+
+        memory_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                memory_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                memory_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                memory_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        decoder_output, new_cache = self.decoder(
+            decoder_input_ids, decoder_attention_mask, encoder_output, memory_mask, cache
+        )
+        return decoder_output, new_cache, encoder_output, attention_mask
+
+
+class PegasusForConditionalGeneration(PegasusPretrainedModel):
+    r"""
+    Pegasus Model with a `language modeling` head on top.
+
+    Args:
+        config (:class:`PegasusConfig`):
+            An instance of PegasusConfig used to construct PegasusForConditionalGeneration.
+    """
+
+    def __init__(self, config: PegasusConfig):
+        super().__init__(config)
+        self.pegasus = PegasusModel(config)
+        self.lm_head_weight = self.create_parameter(
+            shape=[config.vocab_size, config.d_model],
+            dtype=self.pegasus.shared.weight.dtype,
+            is_bias=False,
+        )
+        if hasattr(self, "final_logits_bias") and "final_logits_bias" not in self._buffers:
+            self.final_logits_bias = paddle.zeros((1, config.vocab_size))
+        else:
+            self.register_buffer("final_logits_bias", paddle.zeros((1, config.vocab_size)))
+        self.use_SSTIA = False
+        self.mix_ratio = 0
+
+    def get_encoder(self):
+        return self.pegasus.get_encoder()
+
+    def get_decoder(self):
+        return self.pegasus.get_decoder()
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterPegasus
+
+        decode_strategy = kwargs.get("decode_strategy")
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decoding_lib = kwargs.get("decoding_lib", None)
+        enable_fast_encoder = kwargs.get("enable_fast_encoder", True)
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        self._fast_entry = FasterPegasus(
+            self,
+            use_fp16_decoding=use_fp16_decoding,
+            decoding_lib=decoding_lib,
+            enable_fast_encoder=enable_fast_encoder,
+        ).forward
+        return self._fast_entry
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        decoder_input_ids: Optional[Tensor] = None,
+        decoder_attention_mask: Optional[Tensor] = None,
+        encoder_output: Union[Tuple[Tensor], ModelOutput, None] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[List[Tuple[Cache, StaticCache]]] = None,
+        labels: Optional[Tensor] = None,
+    ):
+        r"""
+        The PegasusForConditionalGeneration forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`PegasusModel`.
+            attention_mask (Tensor, optional):
+                See :class:`PegasusModel`.
+            decoder_input_ids (Tensor, `optional`):
+                See :class:`PegasusModel`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`PegasusModel`.
+            encoder_output (Tensor, optonal):
+                See :class:`PegasusModel`.
+            use_cache (bool, optional):
+                See :class:`PegasusModel`.
+            cache (Tensor, optional):
+                See :class:`PegasusModel`.
+
+        Returns:
+            Tensor or tuple: Returns Tensor `lm_logits` if `use_cache` is `False`, otherwise, returns tuple (`lm_logits`, `cache`).
+
+            With the fields:
+
+            - `lm_logits` (Tensor):
+                The generated sentence of the model.
+                Its data type should be float32 and has a shape of [batch_size, sequence_length, vocab_size].
+
+            - `cache` (Tensor):
+                See :class:`PegasusModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PegasusForConditionalGeneration, PegasusTokenizer
+
+                tokenizer = PegasusTokenizer.from_pretrained(pegasus_path)
+                model = PegasusForConditionalGeneration.from_pretrained(pegasus_path)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+        """
+        output, new_cache, encoder_output, attention_mask = self.pegasus(
+            input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, encoder_output, use_cache, cache
+        )
+        lm_logits = paddle.tensor.matmul(output, self.lm_head_weight, transpose_y=True) + self.final_logits_bias
+
+        if self.use_SSTIA:
+            assert 0 < self.mix_ratio < 1
+            x = lm_logits.clone()
+            length = len(x[0])
+            for idx in range(length - 1, -1, -1):
+                x[:, idx] = x[:, idx - 1]
+            x[:, 0, 0] = 2 * paddle.max(x[:, 0])
+            x = paddle.nn.functional.softmax(x, axis=2)
+
+            with paddle.no_grad():
+                embed_matrix = self.pegasus.decoder.embed_tokens.weight.clone()
+                decoder_in = paddle.einsum("blv,ve->ble", x, embed_matrix)
+
+            output_new, _ = self.pegasus.decoder(
+                decoder_input_ids,
+                decoder_attention_mask,
+                encoder_output,
+                attention_mask,
+                cache,
+                x=decoder_in,
+                mix_ratio=self.mix_ratio,
+            )
+            lm_logits_new = (
+                paddle.tensor.matmul(output_new, self.lm_head_weight, transpose_y=True) + self.final_logits_bias
+            )
+
+            masked_lm_loss = None
+            if labels is not None:
+                loss_fct = nn.CrossEntropyLoss()
+                masked_lm_loss = (1 - self.mix_ratio) * loss_fct(
+                    lm_logits.reshape((-1, self.pegasus.config["vocab_size"])), labels.reshape((-1,))
+                )
+                masked_lm_loss += self.mix_ratio * loss_fct(
+                    lm_logits_new.reshape((-1, self.pegasus.config["vocab_size"])), labels.reshape((-1,))
+                )
+                p = paddle.nn.functional.log_softmax(lm_logits_new, axis=2)
+                q = paddle.nn.functional.softmax(lm_logits, axis=2)
+                loss_kl = paddle.nn.functional.kl_div(p, q, reduction="mean")
+                masked_lm_loss += loss_kl
+            return lm_logits, new_cache, masked_lm_loss
+
+        else:
+            masked_lm_loss = None
+            if labels is not None:
+                loss_fct = nn.CrossEntropyLoss()
+                masked_lm_loss = loss_fct(
+                    lm_logits.reshape((-1, self.pegasus.config["vocab_size"])), labels.reshape((-1,))
+                )
+
+            return lm_logits, new_cache, masked_lm_loss
+
+    def prepare_decoder_input_ids_from_labels(self, labels):
+        return shift_tokens_right(labels, self.pegasus.pad_token_id, self.pegasus.config["decoder_start_token_id"])
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        attention_mask=None,
+        decoder_attention_mask=None,
+        cache=None,
+        use_cache=False,
+        encoder_output=None,
+        **kwargs
+    ):
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1].unsqueeze(-1)
+            if decoder_attention_mask is not None:
+                decoder_attention_mask = decoder_attention_mask[:, :, -1, :].unsqueeze(2)
+
+        return {
+            "input_ids": None,
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "decoder_attention_mask": decoder_attention_mask,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
diff --git a/paddlenlp/transformers/pegasus/tokenizer.py b/paddlenlp/transformers/pegasus/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..abc4a5bec05a910009e59ef9d07d405df4142e06
--- /dev/null
+++ b/paddlenlp/transformers/pegasus/tokenizer.py
@@ -0,0 +1,376 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The IDEA-CCNL Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import os
+import re
+
+import jieba
+
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+from ..tokenizer_utils import _is_punctuation
+
+__all__ = ["PegasusChineseTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese": 1024,
+    "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese": 1024,
+    "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese-V1": 1024,
+    "PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA": 1024,
+    "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA": 1024,
+}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+def _is_chinese_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    #
+    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+    # despite its name. The modern Korean Hangul alphabet is a different block,
+    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+    # space-separated words, so they are not treated specially and handled
+    # like the all of the other languages.
+    if (
+        (cp >= 0x4E00 and cp <= 0x9FFF)
+        or (cp >= 0x3400 and cp <= 0x4DBF)
+        or (cp >= 0x20000 and cp <= 0x2A6DF)
+        or (cp >= 0x2A700 and cp <= 0x2B73F)
+        or (cp >= 0x2B740 and cp <= 0x2B81F)
+        or (cp >= 0x2B820 and cp <= 0x2CEAF)
+        or (cp >= 0xF900 and cp <= 0xFAFF)
+        or (cp >= 0x2F800 and cp <= 0x2FA1F)
+    ):
+        return True
+
+    return False
+
+
+class PegasusChineseTokenizer(PretrainedTokenizer):
+    r"""
+    Construct a Pegasus tokenizer. Based on WordPiece.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+    Args:
+        vocab_file (`str`):
+            File containing the vocabulary.
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        do_basic_tokenize (`bool`, *optional*, defaults to `True`):
+            Whether or not to do basic tokenization before WordPiece.
+        never_split (`Iterable`, *optional*):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            `do_basic_tokenize=True`
+        unk_token (`str`, *optional*, defaults to `"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        pad_token (`str`, *optional*, defaults to `"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (`str`, *optional*, defaults to `"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        mask_token (`str`, *optional*, defaults to `"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
+            Whether or not to tokenize Chinese characters.
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
+        strip_accents (`bool`, *optional*):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for `lowercase` (as in the original BERT).
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import PegasusChineseTokenizer
+
+            tokenizer = PegasusChineseTokenizer.from_pretrained('IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese')
+            print(tokenizer('欢迎使用PaddleNLP'))
+
+            '''
+            {'input_ids': [22355, 8994, 35941, 48563, 49375, 48877, 1],
+            'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
+            '''
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese": "",
+            "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese": "",
+            "IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese-V1": "",
+            "PaddlePaddle/Randeng-Pegasus-238M-Summary-Chinese-SSTIA": "",
+            "PaddlePaddle/Randeng-Pegasus-523M-Summary-Chinese-SSTIA": "",
+        },
+    }
+    pretrained_init_configuration = {}
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        pad_token="<pad>",
+        eos_token="</s>",
+        unk_token="<unk>",
+        mask_token="<mask_2>",
+        mask_token_sent="<mask_1>",
+        additional_special_tokens=None,
+        sep_token="[SEP]",
+        cls_token="[CLS]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        offset=100,
+        **kwargs
+    ):
+
+        self.offset = offset
+
+        if additional_special_tokens is not None:
+            if not isinstance(additional_special_tokens, list):
+                raise TypeError(
+                    f"additional_special_tokens should be of type {type(list)}, \
+                     but is {type(additional_special_tokens)}"
+                )
+
+            additional_special_tokens_extended = (
+                ([mask_token_sent] + additional_special_tokens)
+                if mask_token_sent not in additional_special_tokens and mask_token_sent is not None
+                else additional_special_tokens
+            )
+
+            # fill additional tokens with ..., <unk_token_102> in case not all additional tokens are already taken
+            additional_special_tokens_extended += [
+                f"<unk_{i}>" for i in range(len(additional_special_tokens_extended), self.offset - 1)
+            ]
+
+            if len(set(additional_special_tokens_extended)) != len(additional_special_tokens_extended):
+                raise ValueError(
+                    f"Please make sure that the provided additional_special_tokens \
+                        do not contain an incorrectly shifted list of <unk_x> tokens. \
+                        Found {additional_special_tokens_extended}."
+                )
+            additional_special_tokens = additional_special_tokens_extended
+        else:
+            additional_special_tokens = [mask_token_sent] if mask_token_sent is not None else []
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                f"Can't find a vocabulary file at path '{vocab_file}'. \
+                To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
+            )
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            eos_token=eos_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            additional_special_tokens=additional_special_tokens,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        # Function object isn't serializable
+        self.pre_tokenizer = lambda x: jieba.cut(x, HMM=False)
+        self.mask_token_sent = mask_token_sent
+        self.vocab = load_vocab(vocab_file)
+
+        self.vocab[self.eos_token] = self.vocab.pop("[unused1]")
+        self.vocab[self.pad_token] = self.vocab.pop("[PAD]")
+        self.vocab[self.unk_token] = self.vocab.pop("[UNK]")
+
+        if self.mask_token_sent is not None:
+            self.vocab[self.mask_token] = self.vocab.pop("[unused3]")
+            self.vocab[self.mask_token_sent] = self.vocab.pop("[unused2]")
+
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+    @property
+    def do_lower_case(self):
+        return self.basic_tokenizer.do_lower_case
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        split_tokens = []
+        for text in self.pre_tokenizer(text):
+            if text in self.vocab:
+                split_tokens.append(text)
+            else:
+                if self.do_basic_tokenize:
+                    for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+
+                        # If the token is part of the never_split set
+                        if token in self.basic_tokenizer.never_split:
+                            split_tokens.append(token)
+                        else:
+                            split_tokens += self.wordpiece_tokenizer.tokenize(token)
+                else:
+                    split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    @staticmethod
+    def _cjk_punctuation():
+        return "\uff02\uff03\uff04\uff05\uff06\uff07\uff08\uff09\uff0a\uff0b\uff0c\uff0d\uff0f\uff1a\uff1b\uff1c\uff1d\
+            \uff1e\uff20\uff3b\uff3c\uff3d\uff3e\uff3f\uff40\uff5b\uff5c\uff5d\uff5e\uff5f\uff60\uff62\
+            \uff63\uff64\u3000\u3001\u3003\u3008\u3009\u300a\u300b\u300c\u300d\u300e\u300f\u3010\u3011\u3014\
+            \u3015\u3016\u3017\u3018\u3019\u301a\u301b\u301c\u301d\u301e\u301f\u3030\u303e\u303f\u2013\u2014\
+            \u2018\u2019\u201b\u201c\u201d\u201e\u201f\u2026\u2027\ufe4f\ufe51\ufe54\u00b7\uff01\uff1f\uff61\u3002"
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """
+        Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and
+        added tokens.
+        Args:
+            ids (`int` or `List[int]`):
+                The token id (or token ids) to convert to tokens.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+        Returns:
+            `str` or `List[str]`: The decoded token(s).
+        """
+        if isinstance(ids, int):
+            if ids in self.added_tokens_decoder:
+                return self.added_tokens_decoder[ids]
+            else:
+                return self._convert_id_to_token(ids)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids and index != 2:
+                continue
+            if index in self.added_tokens_decoder:
+                tokens.append(self.added_tokens_decoder[index])
+            else:
+                tokens.append(self._convert_id_to_token(index))
+        return tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        # for token in
+        # tokens = tokens or self.ids_to_tokens(ids)
+        # tokens = [token for token in tokens if not self._is_special(token)]
+
+        text = ""
+        for i, token in enumerate(tokens):
+            if token[:2] == "##":
+                text += token[2:]
+            elif len(token) == 1 and _is_chinese_char(ord(token)):
+                text += token
+            elif len(token) == 1 and _is_punctuation(token):
+                text += token
+                text += " "
+            elif i > 0 and _is_chinese_char(ord(text[-1])):
+                text += token
+            elif tokens == "</s>":
+                continue
+            else:
+                text += " "
+                text += token
+
+        text = re.sub(" +", " ", text)
+        text = re.sub("' (re|m|s|t|ve|d|ll) ", "'\\1 ", text)
+        punctuation = re.sub(" +", "", self._cjk_punctuation()).strip() + "+-/={(<["
+        punctuation_regex = "|".join([re.escape(p) for p in punctuation])
+        punctuation_regex = "(%s) " % punctuation_regex
+        text = re.sub(punctuation_regex, "\\1", text)
+        text = re.sub(r"(\d\.) (\d)", "\\1\\2", text)
+
+        return text.strip()
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification
+        tasks by concatenating and adding special tokens.
+        """
+        if token_ids_1 is None:
+            return token_ids_0 + [self.eos_token_id]
+        return token_ids_0 + token_ids_1 + [self.eos_token_id]
+
+    def _special_token_mask(self, seq):
+        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp
+        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special
+
+        return [1 if x in all_special_ids else 0 for x in seq]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is
+        called when adding special tokens using the tokenizer ``encode`` methods.
+        """
+        if already_has_special_tokens:
+            return self._special_token_mask(token_ids_0)
+        elif token_ids_1 is None:
+            return self._special_token_mask(token_ids_0) + [self.eos_token_id]
+        else:
+            return self._special_token_mask(token_ids_0 + token_ids_1) + [self.eos_token_id]
+
+    def num_special_tokens_to_add(self, pair=False):
+        """Just EOS"""
+        return 1
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        if offset_mapping_1 is None:
+            return offset_mapping_0 + [(0, 0)]
+
+        return offset_mapping_0 + offset_mapping_1 + [(0, 0)]
diff --git a/paddlenlp/transformers/ppminilm/__init__.py b/paddlenlp/transformers/ppminilm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/ppminilm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/ppminilm/configuration.py b/paddlenlp/transformers/ppminilm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..70330b10eb7d3deb90c6d848b1f5166a6a037e6e
--- /dev/null
+++ b/paddlenlp/transformers/ppminilm/configuration.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PPMiniLM model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["PPMINILM_PRETRAINED_INIT_CONFIGURATION", "PPMiniLMConfig", "PPMINILM_PRETRAINED_RESOURCE_FILES_MAP"]
+
+PPMINILM_PRETRAINED_INIT_CONFIGURATION = {
+    "ppminilm-6l-768h": {
+        "attention_probs_dropout_prob": 0.1,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+}
+
+PPMINILM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "ppminilm-6l-768h": "https://bj.bcebos.com/paddlenlp/models/transformers/ppminilm-6l-768h/ppminilm-6l-768h.pdparams",
+    },
+}
+
+
+class PPMiniLMConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`PPMiniLMModel`]. It is used to
+    instantiate a PPMiniLM model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the PPMiniLM ppminilm-6l-768h architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 21128):
+            Vocabulary size of the PPMiniLM model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`PPMiniLMModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`PPMiniLMModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import PPMiniLMModel, PPMiniLMConfig
+
+    >>> # Initializing a PPMiniLM ppminilm-6l-768h style configuration
+    >>> configuration = PPMiniLMConfig()
+
+    >>> # Initializing a model from the ppminilm-6l-768h style configuration
+    >>> model = PPMiniLMModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "ppminilm"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = PPMINILM_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 21128,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range=0.02,
+        pad_token_id: int = 0,
+        do_lower_case: bool = True,
+        is_split_into_words: bool = False,
+        max_seq_len: int = 128,
+        pad_to_max_seq_len: bool = False,
+        layer_norm_eps: float = 1e-12,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.do_lower_case = do_lower_case
+        self.max_seq_len = max_seq_len
+        self.is_split_into_words = is_split_into_words
+        self.pad_token_id = pad_token_id
+        self.pad_to_max_seq_len = pad_to_max_seq_len
+        self.initializer_range = initializer_range
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.layer_norm_eps = layer_norm_eps
diff --git a/paddlenlp/transformers/ppminilm/modeling.py b/paddlenlp/transformers/ppminilm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cb405507e5d7306c7fb248168348852b4d938a5
--- /dev/null
+++ b/paddlenlp/transformers/ppminilm/modeling.py
@@ -0,0 +1,442 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    PPMINILM_PRETRAINED_INIT_CONFIGURATION,
+    PPMINILM_PRETRAINED_RESOURCE_FILES_MAP,
+    PPMiniLMConfig,
+)
+
+__all__ = [
+    "PPMiniLMModel",
+    "PPMiniLMPretrainedModel",
+    "PPMiniLMForSequenceClassification",
+    "PPMiniLMForQuestionAnswering",
+    "PPMiniLMForMultipleChoice",
+]
+
+
+class PPMiniLMEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMEmbeddings, self).__init__()
+
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        if position_ids is None:
+            # maybe need use shape op to unify static graph and dynamic graph
+            # seq_length = input_ids.shape[1]
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+            position_ids.stop_gradient = True
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+        input_embedings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        embeddings = input_embedings + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class PPMiniLMPooler(nn.Layer):
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class PPMiniLMPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained PPMiniLM models. It provides PPMiniLM related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+    model_config_file = CONFIG_NAME
+    config_class = PPMiniLMConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "ppminilm"
+
+    pretrained_init_configuration = PPMINILM_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = PPMINILM_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class PPMiniLMModel(PPMiniLMPretrainedModel):
+    r"""
+    The bare PPMiniLM Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`PPMiniLMConfig`):
+            An instance of PPMiniLMConfig used to construct PPMiniLMModel.
+
+    """
+
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.embeddings = PPMiniLMEmbeddings(config)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0.0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = PPMiniLMPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                If `input_ids` is a Tensor object, it is an indices of input
+                sequence tokens in the vocabulary. They are numerical
+                representations of tokens that build the input sequence. It's
+                data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, string, optional):
+                If `token_type_ids` is a Tensor object:
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `[batch_size, num_tokens]` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                We use whole-word-mask in PPMiniLM, so the whole word will have the same value. For example, "使用" as a word,
+                "使" and "用" will have the same value.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (``sequence_output``, ``pooled_output``).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PPMiniLMModel, PPMiniLMTokenizer
+
+                tokenizer = PPMiniLMTokenizer.from_pretrained('ppminilm-6l-768h')
+                model = PPMiniLMModel.from_pretrained('ppminilm-6l-768h')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+                attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+        return sequence_output, pooled_output
+
+
+class PPMiniLMForSequenceClassification(PPMiniLMPretrainedModel):
+    r"""
+    PPMiniLM Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        ppminilm (PPMiniLMModel):
+            An instance of `paddlenlp.transformers.PPMiniLMModel`.
+        num_classes (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of PPMiniLM.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.PPMiniLMModel` instance. Defaults to `None`.
+    """
+
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMForSequenceClassification, self).__init__(config)
+        self.ppminilm = PPMiniLMModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`PPMiniLMModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`PPMiniLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`PPMiniLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`MiniLMModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PPMiniLMForSequenceClassification, PPMiniLMTokenizer
+
+                tokenizer = PPMiniLMTokenizer.from_pretrained('ppminilm-6l-768h')
+                model = PPMiniLMForSequenceClassification.from_pretrained('ppminilm-6l-768h0')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        _, pooled_output = self.ppminilm(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class PPMiniLMForQuestionAnswering(PPMiniLMPretrainedModel):
+    """
+    PPMiniLM Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+        ppminilm (`PPMiniLMModel`):
+            An instance of `PPMiniLMModel`.
+    """
+
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMForQuestionAnswering, self).__init__(config)
+        self.ppminilm = PPMiniLMModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`PPMiniLMModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`PPMiniLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`PPMiniLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`PPMiniLMModel`.
+
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PPMiniLMForQuestionAnswering, PPMiniLMTokenizer
+
+                tokenizer = PPMiniLMTokenizer.from_pretrained('ppminilm-6l-768h')
+                model = PPMiniLMForQuestionAnswering.from_pretrained('ppminilm-6l-768h')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        sequence_output, _ = self.ppminilm(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        return start_logits, end_logits
+
+
+class PPMiniLMForMultipleChoice(PPMiniLMPretrainedModel):
+    """
+    PPMiniLM Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        ppminilm (:class:`PPMiniLMModel`):
+            An instance of PPMiniLMModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of PPMiniLM.
+            If None, use the same value as `hidden_dropout_prob` of `PPMiniLMModel`
+            instance `ppminilm`. Defaults to None.
+    """
+
+    def __init__(self, config: PPMiniLMConfig):
+        super(PPMiniLMForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.ppminilm = PPMiniLMModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The PPMiniLMForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`PPMiniLMModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`PPMiniLMModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`PPMiniLMModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`PPMiniLMModel` and shape as [batch_size, num_choice, sequence_length].
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        _, pooled_output = self.ppminilm(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
diff --git a/paddlenlp/transformers/ppminilm/tokenizer.py b/paddlenlp/transformers/ppminilm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8309cc64423b81394b589db8680a4d39c0a3c2a5
--- /dev/null
+++ b/paddlenlp/transformers/ppminilm/tokenizer.py
@@ -0,0 +1,308 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+
+__all__ = ["PPMiniLMTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"ppminilm-6l-768h": 512}
+
+
+class PPMiniLMTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs an PPMiniLM tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import PPMiniLMTokenizer
+            tokenizer = PPMiniLMTokenizer.from_pretrained('ppminilm-6l-768h')
+
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # { 'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+            #  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            # }
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "ppminilm-6l-768h": "https://bj.bcebos.com/paddlenlp/models/transformers/ppminilm-6l-768h/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "ppminilm-6l-768h": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = PPMiniLMTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        r"""
+        End-to-end tokenization for PPMiniM models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        r"""
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+
+        Args:
+            tokens (List[str]): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import PPMiniLMTokenizer
+                tokenizer = PPMiniLMTokenizer.from_pretrained('ppminilm-6l-768h')
+
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                #he was a puppeteer
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        r"""
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Note:
+            This encodes inputs and checks the number of added tokens, and is therefore not efficient.
+            Do not put this inside your training loop.
+
+        Args:
+            pair (bool, optional):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        An offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        r"""
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in self.all_special_ids else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.vocab._idx_to_token.get(index, self.unk_token)
diff --git a/paddlenlp/transformers/processing_utils.py b/paddlenlp/transformers/processing_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f575b0f6ddec57d00f4e09d035771cbe389a9c9d
--- /dev/null
+++ b/paddlenlp/transformers/processing_utils.py
@@ -0,0 +1,136 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ Processing saving/loading class for common processors.
+"""
+
+import os
+
+import paddlenlp.transformers
+
+
+class ProcessorMixin(object):
+    """
+    This is a mixin used to provide saving/loading functionality for all processor classes.
+    """
+
+    attributes = ["feature_extractor", "tokenizer"]
+    # Names need to be attr_class for attr in attributes
+    feature_extractor_class = None
+    tokenizer_class = None
+    _auto_class = None
+
+    # args have to match the attributes class attribute
+    def __init__(self, *args, **kwargs):
+        # Sanitize args and kwargs
+        for key in kwargs:
+            if key not in self.attributes:
+                raise TypeError(f"Unexepcted keyword argument {key}.")
+        for arg, attribute_name in zip(args, self.attributes):
+            if attribute_name in kwargs:
+                raise TypeError(f"Got multiple values for argument {attribute_name}.")
+            else:
+                kwargs[attribute_name] = arg
+
+        if len(kwargs) != len(self.attributes):
+            raise ValueError(
+                f"This processor requires {len(self.attributes)} arguments: {', '.join(self.attributes)}. Got "
+                f"{len(args)} arguments instead."
+            )
+
+        # Check each arg is of the proper class (this will also catch a user initializing in the wrong order)
+        for attribute_name, arg in kwargs.items():
+            setattr(self, attribute_name, arg)
+
+    def __repr__(self):
+        attributes_repr = [f"- {name}: {repr(getattr(self, name))}" for name in self.attributes]
+        attributes_repr = "\n".join(attributes_repr)
+        return f"{self.__class__.__name__}:\n{attributes_repr}"
+
+    def save_pretrained(self, save_directory, **kwargs):
+        """
+        Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
+        can be reloaded using the [`~ProcessorMixin.from_pretrained`] method.
+
+        <Tip>
+
+        This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
+        [`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
+        above for more information.
+
+        </Tip>
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
+                be created if it does not exist).
+            kwargs:
+                Additional key word arguments.
+        """
+        os.makedirs(save_directory, exist_ok=True)
+
+        for attribute_name in self.attributes:
+            attribute = getattr(self, attribute_name)
+            # Include the processor class in the attribute config so this processor can then be reloaded with the
+            # `AutoProcessor` API.
+            if hasattr(attribute, "_set_processor_class"):
+                attribute._set_processor_class(self.__class__.__name__)
+            attribute.save_pretrained(save_directory)
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        Instantiate a processor associated with a pretrained model.
+
+        <Tip>
+
+        This class method is simply calling the feature extractor
+        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and the tokenizer
+        [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`] methods. Please refer to the docstrings of the
+        methods above for more information.
+
+        </Tip>
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the name of a community-contributed pretrained or built-in pretrained model.
+                - a path to a *directory* containing a feature extractor file saved using the
+                  [`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
+                - a path or url to a saved feature extractor JSON *file*, e.g.,
+                  `./my_model_directory/preprocessor_config.json`.
+            **kwargs
+                Additional keyword arguments passed along to both
+                [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and
+                [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`].
+        """
+        args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
+        return cls(*args)
+
+    @classmethod
+    def _get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        args = []
+        for attribute_name in cls.attributes:
+            class_name = getattr(cls, f"{attribute_name}_class")
+            attribute_class = getattr(paddlenlp.transformers, class_name)
+            args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
+        return args
+
+    @property
+    def model_input_names(self):
+        first_attribute = getattr(self, self.attributes[0])
+        return getattr(first_attribute, "model_input_names", None)
diff --git a/paddlenlp/transformers/prophetnet/__init__.py b/paddlenlp/transformers/prophetnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6444835b4ddbf7888dc9cb971c04ee914723d1e7
--- /dev/null
+++ b/paddlenlp/transformers/prophetnet/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration import *
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/prophetnet/configuration.py b/paddlenlp/transformers/prophetnet/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed10d0898c55dfddc6c98fc8e4d51a7e4f9c003a
--- /dev/null
+++ b/paddlenlp/transformers/prophetnet/configuration.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MBart model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "PROPHETNET_PRETRAINED_INIT_CONFIGURATION",
+    "PROPHETNET_PRETRAINED_RESOURCE_FILES_MAP",
+    "ProphetNetConfig",
+]
+
+PROPHETNET_PRETRAINED_INIT_CONFIGURATION = {
+    "prophetnet-large-uncased": {
+        "activation_dropout": 0.1,
+        "activation_function": "gelu",
+        "attention_dropout": 0.1,
+        "bos_token_id": 102,
+        "decoder_ffn_dim": 4096,
+        "decoder_layerdrop": 0.0,
+        "decoder_max_position_embeddings": 514,
+        "decoder_start_token_id": 102,
+        "disable_ngram_loss": False,
+        "dropout": 0.1,
+        "encoder_ffn_dim": 4096,
+        "encoder_layerdrop": 0.0,
+        "encoder_max_position_embeddings": 513,
+        "eos_token_id": 102,
+        "eps": 0.1,
+        "hidden_size": 1024,
+        "init_std": 0.02,
+        "max_position_embeddings": 512,
+        "ngram": 2,
+        "num_buckets": 32,
+        "num_decoder_attention_heads": 16,
+        "num_decoder_layers": 12,
+        "num_encoder_attention_heads": 16,
+        "num_encoder_layers": 12,
+        "pad_token_id": 0,
+        "relative_max_distance": 128,
+        "length_penalty": 2.0,
+        "no_repeat_ngram_size": 3,
+        "num_beams": 4,
+        "max_length": 142,
+        "vocab_size": 30522,
+    },
+}
+
+PROPHETNET_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "prophetnet-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/prophetnet/prophetnet-large-uncased.pdparams"
+    }
+}
+
+
+class ProphetNetConfig(PretrainedConfig):
+
+    model_type = "prophetnet"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        bos_token_id=102,
+        pad_token_id=0,
+        eos_token_id=102,
+        hidden_size=1024,
+        decoder_start_token_id=102,
+        max_position_embeddings=512,
+        activation_function="gelu",
+        activation_dropout=0.1,
+        dropout=0.1,
+        relative_max_distance=128,
+        ngram=2,
+        num_buckets=32,
+        encoder_ffn_dim=4096,
+        num_encoder_attention_heads=16,
+        num_encoder_layers=12,
+        decoder_ffn_dim=4096,
+        num_decoder_attention_heads=16,
+        num_decoder_layers=12,
+        attention_dropout=0.1,
+        init_std=0.02,
+        eps=0.1,
+        add_cross_attention=True,
+        disable_ngram_loss=False,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+        self.hidden_size = hidden_size
+        self.decoder_start_token_id = decoder_start_token_id
+        self.max_position_embeddings = max_position_embeddings
+        self.activation_function = activation_function
+        self.activation_dropout = activation_dropout
+        self.dropout = dropout
+        self.relative_max_distance = relative_max_distance
+        self.ngram = ngram
+        self.num_buckets = num_buckets
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.num_encoder_attention_heads = num_encoder_attention_heads
+        self.num_decoder_attention_heads = num_decoder_attention_heads
+        self.num_encoder_layers = num_encoder_layers
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.num_decoder_layers = num_decoder_layers
+        self.attention_dropout = attention_dropout
+        self.init_std = init_std
+        self.eps = eps
+        self.add_cross_attention = add_cross_attention
+        self.disable_ngram_loss = disable_ngram_loss
diff --git a/paddlenlp/transformers/prophetnet/modeling.py b/paddlenlp/transformers/prophetnet/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..da9eeb6b87336bd0335cc7a114dc54b6d70cc732
--- /dev/null
+++ b/paddlenlp/transformers/prophetnet/modeling.py
@@ -0,0 +1,1251 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.nn import Layer
+
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from .configuration import (
+    PROPHETNET_PRETRAINED_INIT_CONFIGURATION,
+    PROPHETNET_PRETRAINED_RESOURCE_FILES_MAP,
+    ProphetNetConfig,
+)
+
+__all__ = [
+    "ProphetNetModel",
+    "ProphetNetPretrainedModel",
+    "ProphetNetEncoder",
+    "ProphetNetDecoder",
+    "ProphetNetForConditionalGeneration",
+]
+
+
+def ngram_attention_bias(sequence_length, ngram, dtype):
+    """
+    This function computes the bias for the predict stream
+    """
+    left_block = paddle.ones((ngram, sequence_length, sequence_length), dtype=dtype) * float("-inf")
+    right_block = left_block.detach().clone()
+    # create bias
+    for stream_idx in range(ngram):
+        right_block[stream_idx] = right_block[stream_idx].fill_diagonal_(0, wrap=False)
+        left_block[stream_idx] = paddle.triu(left_block[stream_idx], diagonal=-stream_idx + 1)
+
+    left_block[:, :, 0] = 0
+    return paddle.concat([left_block, right_block], axis=2)
+
+
+def compute_relative_buckets(num_buckets, max_distance, relative_positions, is_bidirectional=False):
+    """
+    This function computes individual parts of the relative position buckets. For more detail, see paper.
+    """
+    inv_relative_positions = -relative_positions
+    rel_positions_bucket = 0
+
+    if is_bidirectional:
+        num_buckets = num_buckets // 2
+        rel_positions_bucket = (
+            rel_positions_bucket
+            + paddle.cast(
+                paddle.less_than(inv_relative_positions, paddle.zeros_like(inv_relative_positions)), dtype=paddle.int32
+            )
+            * num_buckets
+        )
+        inv_relative_positions = paddle.abs(inv_relative_positions)
+    else:
+        inv_relative_positions = (
+            paddle.cast(
+                paddle.less_than(paddle.zeros_like(inv_relative_positions), inv_relative_positions), dtype=paddle.int32
+            )
+            * inv_relative_positions
+        )
+
+    max_exact = num_buckets // 2
+    is_small = paddle.less_than(inv_relative_positions, paddle.to_tensor(max_exact).cast(dtype=paddle.int32))
+    val_if_large = max_exact + paddle.log(
+        paddle.cast(inv_relative_positions, dtype=paddle.float32) / max_exact
+    ) / math.log(max_distance / max_exact) * (num_buckets - max_exact)
+    val_if_large_num_buckets = paddle.ones_like(val_if_large) * (num_buckets - 1)
+    val_if_large_lt = paddle.cast(paddle.less_than(val_if_large, val_if_large_num_buckets), dtype=paddle.int32)
+    val_if_large = (
+        paddle.cast(val_if_large_lt * val_if_large, dtype=paddle.int32)
+        + (1 - val_if_large_lt) * val_if_large_num_buckets
+    )
+    rel_positions_bucket = rel_positions_bucket + paddle.where(
+        is_small, paddle.cast(inv_relative_positions, dtype=paddle.int32), val_if_large
+    )
+    return rel_positions_bucket
+
+
+def compute_all_stream_relative_buckets(num_buckets, max_distance, position_ids):
+    """
+    This function computes both main and predict relative position buckets. For more detail, see paper.
+    """
+    # main stream
+    main_stream_relative_positions = paddle.tile(
+        paddle.unsqueeze(position_ids, axis=1), repeat_times=[1, position_ids.shape[-1], 1]
+    )
+    main_stream_relative_positions = main_stream_relative_positions - paddle.unsqueeze(position_ids, axis=-1)
+
+    # predicting stream
+    predicting_stream_relative_positions = paddle.unsqueeze(
+        paddle.concat([position_ids - 1, position_ids], axis=-1), axis=1
+    )
+    predicting_stream_relative_positions = paddle.tile(
+        predicting_stream_relative_positions, repeat_times=[1, position_ids.shape[-1], 1]
+    )
+    predicting_stream_relative_positions = predicting_stream_relative_positions - paddle.unsqueeze(
+        position_ids, axis=-1
+    )
+
+    # get both position buckets
+    main_relative_position_buckets = compute_relative_buckets(
+        num_buckets, max_distance, main_stream_relative_positions, is_bidirectional=False
+    )
+    predict_relative_position_buckets = compute_relative_buckets(
+        num_buckets, max_distance, predicting_stream_relative_positions, is_bidirectional=False
+    )
+    return main_relative_position_buckets, predict_relative_position_buckets
+
+
+class ProphetNetPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Prophetnet models. It provides Prophetnet related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    """
+
+    pretrained_init_configuration = PROPHETNET_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = PROPHETNET_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "prophetnet"
+    config_class = ProphetNetConfig
+
+    def _init_weights(self, layer):
+        if isinstance(layer, nn.Linear):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.init_std,
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.tensor.zeros(layer.bias.shape))
+
+    def _shift_right(self, input_ids):
+        decoder_start_token_id = self.config.decoder_start_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert decoder_start_token_id is not None, (
+            "self.config.decoder_start_token_id has to be defined. "
+            "In ProphetNet it is usually set to the pad_token_id. See ProphetNet docs for more information"
+        )
+
+        # shift inputs to the right
+        shifted_input_ids = paddle.zeros_like(input_ids)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = decoder_start_token_id
+
+        assert pad_token_id is not None, "self.config.pad_token_id has to be defined."
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids_mask = paddle.cast(shifted_input_ids == -100, dtype=paddle.int32)
+        shifted_input_ids = shifted_input_ids_mask * pad_token_id + (1 - shifted_input_ids_mask) * shifted_input_ids
+
+        assert (
+            paddle.sum(paddle.cast(shifted_input_ids >= 0, dtype=paddle.int32)).item() == shifted_input_ids.shape[-1]
+        ), "Verify that `shifted_input_ids` has only positive values"
+
+        return shifted_input_ids
+
+
+class ProphetNetPositionalEmbeddings(nn.Embedding):
+    """
+    ProphetNetPositional Embeddings.
+    """
+
+    def __init__(self, config: ProphetNetConfig):
+        self.max_length = config.max_position_embeddings
+        super(ProphetNetPositionalEmbeddings, self).__init__(
+            config.max_position_embeddings, config.hidden_size, config.pad_token_id
+        )
+
+    def forward(self, inputs_shape, attention_mask=None, past_key_values=None, position_ids=None):
+        assert (position_ids is None) or (
+            self._padding_idx is None
+        ), "If position_ids is pre-computed then padding_idx should not be set."
+
+        if position_ids is None:
+            if past_key_values is not None:
+                # position_ids is the same for every token when decoding a single step
+                # Without the int() cast, it doesn't work in some cases when exporting to ONNX
+                prev_num_input_ids = past_key_values[0][0].shape[2]
+                num_input_ids = inputs_shape[1] + prev_num_input_ids
+                position_ids = paddle.ones((1, 1), dtype="int64") * (int(self._padding_idx + num_input_ids))
+            else:
+                if attention_mask is None:
+                    attention_mask = paddle.ones(inputs_shape, dtype="int64")
+
+                # retrieve position_ids from input_ids / attention_mask
+                position_ids = (
+                    paddle.cast(
+                        paddle.cast(paddle.cumsum(attention_mask, axis=1), dtype=attention_mask.dtype)
+                        * attention_mask,
+                        dtype=paddle.int64,
+                    )
+                    + self._padding_idx
+                )
+
+                # make sure position_ids are not bigger then max_length
+                position_ids = paddle.clip(position_ids, min=0, max=self.max_length - 1)
+
+        return super().forward(position_ids), position_ids
+
+    def _forward(self, position_ids):
+        return super().forward(position_ids)
+
+
+class ProphetNetAttention(Layer):
+    """
+    Multi-headed attention from 'Attention Is All You Need' paper.
+    """
+
+    def __init__(self, hidden_size, attention_dropout, dropout, num_attn_heads: int):
+        super().__init__()
+        hidden_size = hidden_size
+
+        self.attention_dropout = attention_dropout
+        self.dropout = dropout
+        self.num_attn_heads = num_attn_heads
+        self.head_dim = hidden_size // num_attn_heads
+
+        assert (
+            self.head_dim * num_attn_heads == hidden_size
+        ), "`config.hidden_size` must be divisible by `config.num_encoder_attention_heads` and `config.num_decoder_attention_heads`"
+
+        self.key_proj = nn.Linear(hidden_size, hidden_size)
+        self.value_proj = nn.Linear(hidden_size, hidden_size)
+        self.query_proj = nn.Linear(hidden_size, hidden_size)
+
+        self.out_proj = nn.Linear(hidden_size, hidden_size)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return paddle.transpose(
+            paddle.reshape(tensor, [bsz, seq_len, self.num_attn_heads, self.head_dim]), (0, 2, 1, 3)
+        )
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        past_key_value: Optional[Tuple[Tensor]] = None,
+    ) -> Tuple[Tensor, Optional[Tensor]]:
+
+        batch_size, tgt_len, hidden_size = hidden_states.shape
+
+        # if key_value_states are provided this layer is used as a cross-attention layer
+        # for the decoder
+        is_cross_attention = key_value_states is not None
+        assert hidden_states.shape == [
+            batch_size,
+            tgt_len,
+            hidden_size,
+        ], f"Size of hidden states should be {batch_size, tgt_len, hidden_size}, but is {hidden_states.shape}"
+
+        # previous time steps are cached - no need to recompute key and value if they are static
+        query_states = self.query_proj(hidden_states) / (self.head_dim**0.5)
+
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_states = past_key_value[0]
+            value_states = past_key_value[1]
+        elif is_cross_attention:
+            # cross_attentions
+            key_states = self._shape(self.key_proj(key_value_states), -1, batch_size)
+            value_states = self._shape(self.value_proj(key_value_states), -1, batch_size)
+        else:
+            # self_attention
+            key_states = self._shape(self.key_proj(hidden_states), -1, batch_size)
+            value_states = self._shape(self.value_proj(hidden_states), -1, batch_size)
+
+        if is_cross_attention:
+            # if cross_attention save Tuple(paddle.Tensor, paddle.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_states, value_states)
+
+        # project states into the correct shape
+        proj_shape = (batch_size * self.num_attn_heads, -1, self.head_dim)
+        query_states = paddle.reshape(self._shape(query_states, tgt_len, batch_size), proj_shape)
+        key_states = paddle.reshape(key_states, proj_shape)
+        value_states = paddle.reshape(value_states, proj_shape)
+
+        src_len = key_states.shape[1]
+        attn_weights = paddle.bmm(query_states, key_states.transpose((0, 2, 1)))
+        assert attn_weights.shape == [
+            batch_size * self.num_attn_heads,
+            tgt_len,
+            src_len,
+        ], f"`attn_weights` should be of size {batch_size * self.num_attn_heads, tgt_len, src_len}, but is of size {attn_weights.shape}"
+
+        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.
+        if attention_mask is not None and len(attention_mask.shape) == 0:
+            attention_mask = None
+        assert attention_mask is None or attention_mask.shape == [
+            self.num_attn_heads * batch_size,
+            1,
+            src_len,
+        ], f"`attention_mask` should be `None` or of shape attention_mask.shape == {batch_size * self.num_attn_heads, 1, src_len}, but is {attention_mask.shape}"
+
+        if attention_mask is not None:  # don't attend to padding symbols
+            attn_weights = attn_weights + attention_mask
+
+        attn_weights = F.softmax(attn_weights, axis=-1)
+
+        attn_probs = F.dropout(attn_weights, p=self.attention_dropout, training=self.training)
+
+        attn_output = paddle.bmm(attn_probs, value_states)
+        assert attn_output.shape == [
+            batch_size * self.num_attn_heads,
+            tgt_len,
+            self.head_dim,
+        ], f"`attn_output` should be of shape {batch_size * self.num_attn_heads, tgt_len, self.head_dim}, but is of shape {attn_output.shape}"
+
+        attn_output = paddle.reshape(
+            paddle.transpose(
+                paddle.reshape(attn_output, (batch_size, self.num_attn_heads, tgt_len, self.head_dim)), (0, 2, 1, 3)
+            ),
+            (batch_size, tgt_len, hidden_size),
+        )
+
+        attn_output = self.out_proj(attn_output)
+
+        attn_output = F.dropout(attn_output, p=self.dropout, training=self.training)
+        return attn_output, past_key_value
+
+
+class ProphetNetFeedForward(Layer):
+    """
+    This is the residual two feed-forward layer block based on the original Transformer implementation.
+    """
+
+    def __init__(self, hidden_size, activation_function, activation_dropout, dropout, ffn_dim: int):
+        super(ProphetNetFeedForward, self).__init__()
+        self.activation_fn = ACT2FN[activation_function]
+        self.intermediate = nn.Linear(hidden_size, ffn_dim)
+        self.output = nn.Linear(ffn_dim, hidden_size)
+        self.activation_dropout = activation_dropout
+        self.dropout = dropout
+
+    def forward(self, hidden_states):
+        hidden_states = self.intermediate(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+
+        hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training)
+        hidden_states = self.output(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+        return hidden_states
+
+
+class ProphetNetNgramSelfAttention(Layer):
+    def __init__(
+        self,
+        hidden_size,
+        num_buckets,
+        relative_max_distance,
+        num_decoder_attention_heads,
+        dropout,
+        attention_dropout,
+        ngram,
+    ):
+        super(ProphetNetNgramSelfAttention, self).__init__()
+
+        self.hidden_size = hidden_size
+
+        self.num_buckets = num_buckets
+        self.relative_max_distance = relative_max_distance
+        self.num_attn_heads = num_decoder_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.head_dim = hidden_size // self.num_attn_heads
+        self.ngram = ngram
+
+        assert (
+            self.head_dim * self.num_attn_heads == hidden_size
+        ), "config.hidden_size must be divisible by num_attn_heads"
+        # key, value, query projection
+        self.key_proj = nn.Linear(hidden_size, hidden_size)
+        self.value_proj = nn.Linear(hidden_size, hidden_size)
+        self.query_proj = nn.Linear(hidden_size, hidden_size)
+
+        # out projection
+        self.out_proj = nn.Linear(hidden_size, hidden_size)
+
+        # rel position embeddings
+        self.relative_pos_embeddings = nn.Linear(hidden_size, self.num_buckets * self.num_attn_heads)
+
+    def _shape(self, tensor, seq_len, batch_size):
+        return paddle.transpose(
+            paddle.reshape(tensor, (batch_size, seq_len, self.num_attn_heads, self.head_dim)), (0, 2, 1, 3)
+        )
+
+    def forward(
+        self,
+        hidden_states,
+        past_key_value: Optional[Tuple[Tensor]] = None,
+        attention_mask=None,
+        extended_predict_attention_mask=None,
+        main_relative_position_buckets=None,
+        predict_relative_position_buckets=None,
+        position_ids=None,
+    ):
+        batch_size, ngram_sequence_length, hidden_size = hidden_states.shape
+
+        assert hidden_states.shape == [
+            batch_size,
+            ngram_sequence_length,
+            hidden_size,
+        ], f"`hidden_states` should be of shape {batch_size, ngram_sequence_length, hidden_size}, but is of shape {hidden_states.shape}"
+
+        # project
+        query_states = self.query_proj(hidden_states)
+        key_states = self.key_proj(hidden_states)
+        value_states = self.value_proj(hidden_states)
+
+        # normalize
+        query_states = query_states / (self.head_dim**0.5)
+
+        # reshape
+        query_states = self._shape(query_states, ngram_sequence_length, batch_size)
+        key_states = self._shape(key_states, -1, batch_size)
+        value_states = self._shape(value_states, -1, batch_size)
+
+        proj_shape = (batch_size * self.num_attn_heads, -1, self.head_dim)
+
+        query_states = paddle.reshape(query_states, proj_shape)
+        key_states = paddle.reshape(key_states, proj_shape)
+        value_states = paddle.reshape(value_states, proj_shape)
+
+        # chunk into main stream and predict stream
+        hidden_states_list = paddle.chunk(hidden_states, 1 + self.ngram, axis=1)
+
+        query_states_list = paddle.chunk(query_states, 1 + self.ngram, axis=1)
+        key_states_list = paddle.chunk(key_states, 1 + self.ngram, axis=1)
+        value_states_list = paddle.chunk(value_states, 1 + self.ngram, axis=1)
+
+        main_hidden_states, hidden_states_predict_list = hidden_states_list[0], hidden_states_list[1:]
+        main_query_states, predict_query_states_list = query_states_list[0], query_states_list[1:]
+        main_key_states, predict_key_states_list = key_states_list[0], key_states_list[1:]
+        main_value_states, predict_value_states_list = value_states_list[0], value_states_list[1:]
+
+        # saved states are stored with shape (batch_size, num_attn_heads, seq_len, head_dim)
+        if past_key_value is not None:
+            prev_main_key_states = past_key_value[0].reshape([batch_size * self.num_attn_heads, -1, self.head_dim])
+            main_key_states = paddle.concat((prev_main_key_states, main_key_states), axis=1)
+            prev_main_value_states = past_key_value[1].reshape([batch_size * self.num_attn_heads, -1, self.head_dim])
+            main_value_states = paddle.concat((prev_main_value_states, main_value_states), axis=1)
+
+        # Update cache
+        past_key_value = (
+            paddle.reshape(main_key_states, (batch_size, self.num_attn_heads, -1, self.head_dim)),
+            paddle.reshape(main_value_states, (batch_size, self.num_attn_heads, -1, self.head_dim)),
+        )
+
+        # get seq_length of main stream only
+        sequence_length = ngram_sequence_length // (1 + self.ngram)
+
+        # MAIN-STREAM
+        # main attn weights
+        main_attn_weights = paddle.bmm(main_query_states, paddle.transpose(main_key_states, (0, 2, 1)))
+
+        # retrieve relative position embeddings for each layer -> see paper for more details
+        main_relative_pos_embeddings = self.get_main_relative_pos_embeddings(
+            main_hidden_states, main_attn_weights, position_ids, main_relative_position_buckets
+        )
+
+        main_attn_weights = main_attn_weights + main_relative_pos_embeddings
+
+        if attention_mask is not None:
+            main_attn_weights = main_attn_weights + attention_mask
+
+        main_attn_probs = F.softmax(main_attn_weights, axis=-1, dtype=main_attn_weights.dtype)
+
+        main_attn_probs = F.dropout(main_attn_probs, p=self.attention_dropout, training=self.training)
+        # project to attn_output
+        main_attn_output = paddle.bmm(main_attn_probs, main_value_states)
+
+        # reshape so that num_heads dim is merged into last `head_dim` axis
+        main_attn_output = paddle.reshape(
+            paddle.transpose(
+                paddle.reshape(main_attn_output, (batch_size, self.num_attn_heads, sequence_length, self.head_dim)),
+                (0, 2, 1, 3),
+            ),
+            (batch_size, 1, sequence_length, hidden_size),
+        )
+        main_attn_output = self.out_proj(main_attn_output)
+
+        # PREDICT-STREAM
+        # [ngram, B*head, T, c]
+        predict_query_states = paddle.reshape(
+            paddle.concat(predict_query_states_list, axis=0), (self.ngram, -1, sequence_length, self.head_dim)
+        )
+        # [ngram, B*head, 2*T, c]
+        predict_key_states = paddle.concat(
+            [
+                paddle.unsqueeze(paddle.concat([main_key_states, key], axis=1), axis=0)
+                for key in predict_key_states_list
+            ],
+            axis=0,
+        )
+
+        # [ngram, T, B, C]
+        predict_hidden_states = paddle.reshape(
+            paddle.concat(hidden_states_predict_list, axis=0), (self.ngram, sequence_length, batch_size, hidden_size)
+        )
+
+        # [ngram, B*head, 2*T, c]
+        predict_value_states = paddle.concat(
+            [
+                paddle.unsqueeze(paddle.concat([main_value_states, v_p], axis=1), axis=0)
+                for v_p in predict_value_states_list
+            ],
+            axis=0,
+        )
+
+        # [ngram, B*head, T, 2*T]
+        predict_attn_weights = paddle.einsum("nbtc,nbsc->nbts", predict_query_states, predict_key_states)
+
+        # [ngram, B*head, T, S]
+        # retrieve relative position embeddings for each layer -> see paper for more details
+        predict_relative_pos_embeddings = self.get_predict_relative_pos_embeddings(
+            predict_hidden_states, predict_attn_weights, position_ids, predict_relative_position_buckets
+        )
+
+        # [ngram, B*head, T, 2*T]
+        predict_attn_weights = predict_attn_weights + predict_relative_pos_embeddings
+
+        if extended_predict_attention_mask is not None:
+            predict_attn_weights = predict_attn_weights + paddle.cast(
+                extended_predict_attention_mask, predict_attn_weights.dtype
+            )
+
+        predict_attn_probs = F.softmax(predict_attn_weights, axis=-1, dtype=predict_attn_weights.dtype)
+
+        predict_attn_probs = F.dropout(predict_attn_probs, p=self.attention_dropout, training=self.training)
+        # project to attention output
+        # [ngram, B*head, T, c]
+        predict_attn_output = paddle.einsum("nbts,nbsc->nbtc", predict_attn_probs, predict_value_states)
+
+        # reshape so that num_heads dim is merged into last `head_dim` axis
+        # [ngram, B, T, C]
+        predict_attn_output = paddle.reshape(
+            paddle.transpose(
+                paddle.reshape(
+                    predict_attn_output, (self.ngram, batch_size, self.num_attn_heads, sequence_length, self.head_dim)
+                ),
+                (1, 0, 3, 2, 4),
+            ),
+            (batch_size, self.ngram, sequence_length, hidden_size),
+        )
+        predict_attn_output = self.out_proj(predict_attn_output)
+
+        # concat to single attn output
+        # [B, 1+ngram*T, C]
+        attn_output = paddle.reshape(
+            paddle.concat([main_attn_output, predict_attn_output], axis=1), (batch_size, -1, hidden_size)
+        )
+        # reshape into better form for `config.output_attentions`
+        main_attn_probs = paddle.reshape(main_attn_probs, (batch_size, self.num_attn_heads, sequence_length, -1))
+        predict_attn_probs = paddle.transpose(
+            paddle.reshape(predict_attn_probs, (self.ngram, batch_size, self.num_attn_heads, sequence_length, -1)),
+            (1, 0, 2, 3, 4),
+        )
+
+        attn_output = F.dropout(attn_output, p=self.dropout, training=self.training)
+
+        return attn_output, main_attn_probs, predict_attn_probs, past_key_value
+
+    def get_main_relative_pos_embeddings(
+        self, hidden_states, attn_weights, position_ids, main_relative_position_buckets
+    ):
+        # input hidden_states [B,T,C], input attn_weights [T*head,T,S], input position_ids [B,T] or [1,1]
+
+        if main_relative_position_buckets is None:
+            batch_size, sequence_length = hidden_states.shape[:2]
+            relative_positions = paddle.tile(
+                paddle.unsqueeze(paddle.unsqueeze(paddle.arange(1, attn_weights.shape[-1] + 1), axis=0), axis=0),
+                repeat_times=[batch_size, sequence_length, 1],
+            )
+            relative_positions = relative_positions - paddle.tile(
+                paddle.unsqueeze(position_ids, axis=0), repeat_times=[batch_size, sequence_length, 1]
+            )  # [B, T, s]
+            main_relative_position_buckets = compute_relative_buckets(
+                self.num_buckets, self.relative_max_distance, relative_positions, False
+            )
+
+        rel_pos_embeddings = self.relative_pos_embeddings(hidden_states)  # [B,T,Buckets*head]
+        rel_pos_embeddings = paddle.transpose(
+            paddle.reshape(
+                rel_pos_embeddings, (rel_pos_embeddings.shape[:2] + [self.num_buckets, self.num_attn_heads])
+            ),
+            (0, 3, 1, 2),
+        )  # [B,T,Buckets,head]
+        rel_pos_embeddings = rel_pos_embeddings.reshape(attn_weights.shape[:2] + [-1])  # [B*head,T,Buckets]
+
+        main_relative_position_buckets = paddle.cast(
+            paddle.reshape(
+                paddle.tile(main_relative_position_buckets, repeat_times=[1, self.num_attn_heads, 1]),
+                (-1, main_relative_position_buckets.shape[-1]),
+            ),
+            dtype=paddle.int64,
+        )  # [B*head*T, T]
+        rel_pos_embeddings = paddle.reshape(
+            rel_pos_embeddings, (-1, rel_pos_embeddings.shape[-1])
+        )  # [B*head*T,Buckets]
+
+        main_relative_position_buckets_index = paddle.tile(
+            main_relative_position_buckets.unsqueeze(2), repeat_times=[1, 1, 2]
+        )
+        main_relative_position_buckets_index[:, :, 0] = paddle.tile(
+            paddle.arange(0, main_relative_position_buckets_index.shape[0]).unsqueeze(1),
+            repeat_times=[1, main_relative_position_buckets_index.shape[1]],
+        )
+
+        main_relative_pos_embeddings = paddle.reshape(
+            paddle.gather_nd(rel_pos_embeddings, index=main_relative_position_buckets_index),
+            (attn_weights.shape[:2] + [-1]),
+        )
+        return main_relative_pos_embeddings
+
+    def get_predict_relative_pos_embeddings(
+        self, hidden_states, attn_weights, position_ids, predict_relative_position_buckets
+    ):
+        # input hidden_states [ngram, T,B,C],
+        # input attn_weights [ngram, B*head,T,S],
+        # input position_ids [B,T] or [1,1],
+        # input predict_relative_position_buckets [B,T, 2*T] or None
+        sequence_length, batch_size = hidden_states.shape[1:3]
+
+        if predict_relative_position_buckets is None:
+            key_sequence_length = attn_weights.shape[-1]
+            assert (
+                position_ids[0][0] == key_sequence_length - 1
+            ), "`position_ids` are incorrect. They should be of the format 1 2 3 4 5 ... (key_sequence_length - 1)"
+            relative_positions = paddle.tile(
+                paddle.unsqueeze(paddle.unsqueeze(paddle.arange(0, key_sequence_length), axis=0), axis=0),
+                repeat_times=[batch_size, sequence_length, 1],
+            )
+
+            relative_positions = relative_positions - paddle.tile(
+                paddle.unsqueeze(position_ids, axis=0), repeat_times=[batch_size, sequence_length, 1]
+            )
+            predict_relative_position_buckets = compute_relative_buckets(
+                self.num_buckets, self.relative_max_distance, relative_positions, False
+            )
+
+        hidden_states = paddle.transpose(hidden_states, (0, 2, 1, 3))  # [ngram, B, T, C]
+        rel_pos_embeddings = paddle.reshape(
+            self.relative_pos_embeddings(hidden_states),
+            hidden_states.shape[:-1] + [self.num_buckets, self.num_attn_heads],
+        )  # [ngram, B, T, bucket, head]
+        rel_pos_embeddings = paddle.reshape(
+            paddle.transpose(rel_pos_embeddings, (0, 1, 4, 2, 3)),
+            (self.ngram * batch_size * self.num_attn_heads, sequence_length, -1),
+        )  # [ngram*B*head, T, bucket]
+
+        predict_relative_position_buckets = paddle.tile(
+            paddle.unsqueeze(predict_relative_position_buckets, axis=0),
+            repeat_times=[self.ngram, 1, self.num_attn_heads, 1],
+        )  # [ngram, B, head*T, S]
+
+        rel_pos_embeddings = paddle.reshape(rel_pos_embeddings, (-1, rel_pos_embeddings.shape[-1]))
+        predict_relative_position_buckets = paddle.cast(
+            paddle.reshape(predict_relative_position_buckets, (-1, predict_relative_position_buckets.shape[-1])),
+            dtype=paddle.int64,
+        )  # [ngram*B*head*T, S]
+
+        predict_relative_position_buckets_index = paddle.tile(
+            predict_relative_position_buckets.unsqueeze(2), repeat_times=[1, 1, 2]
+        )
+        predict_relative_position_buckets_index[:, :, 0] = paddle.tile(
+            paddle.arange(0, predict_relative_position_buckets_index.shape[0]).unsqueeze(1),
+            repeat_times=[1, predict_relative_position_buckets_index.shape[1]],
+        )
+
+        predict_relative_pos_embeddings = paddle.reshape(
+            paddle.gather_nd(rel_pos_embeddings, index=predict_relative_position_buckets_index),
+            (self.ngram, batch_size * self.num_attn_heads, sequence_length, -1),
+        )  # [ngram, B*head, T, S]
+
+        return predict_relative_pos_embeddings
+
+
+class ProphetNetEncoderLayer(Layer):
+    """
+    Encoder block for Prophetnet
+    """
+
+    def __init__(self, config: ProphetNetConfig):
+        super(ProphetNetEncoderLayer, self).__init__()
+        # 1st residual block
+        self.self_attn = ProphetNetAttention(
+            config.hidden_size, config.attention_dropout, config.dropout, config.num_encoder_attention_heads
+        )
+        self.self_attn_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        # 2nd residual block
+        self.feed_forward = ProphetNetFeedForward(
+            config.hidden_size,
+            config.activation_function,
+            config.activation_dropout,
+            config.dropout,
+            config.encoder_ffn_dim,
+        )
+        self.feed_forward_layer_norm = nn.LayerNorm(config.hidden_size)
+
+    def forward(self, hidden_states, attention_mask):
+        # 1st residual block
+        attention_output, _ = self.self_attn(hidden_states=hidden_states, attention_mask=attention_mask)
+        hidden_states = self.self_attn_layer_norm(attention_output + hidden_states)
+
+        # 2nd residual block
+        feed_forward_output = self.feed_forward(hidden_states)
+        hidden_states = self.feed_forward_layer_norm(feed_forward_output + hidden_states)
+        return hidden_states
+
+
+class ProphetNetDecoderLayer(Layer):
+    """
+    Decoder block for Prophetnet
+    """
+
+    def __init__(self, config: ProphetNetConfig):
+        super(ProphetNetDecoderLayer, self).__init__()
+        # 1st residual block
+        self.self_attn = ProphetNetNgramSelfAttention(
+            config.hidden_size,
+            config.num_buckets,
+            config.relative_max_distance,
+            config.num_decoder_attention_heads,
+            config.dropout,
+            config.attention_dropout,
+            config.ngram,
+        )
+        self.self_attn_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        # 2nd residual block
+        if config.add_cross_attention:
+            self.cross_attn = ProphetNetAttention(
+                config.hidden_size, config.attention_dropout, config.dropout, config.num_decoder_attention_heads
+            )
+            self.cross_attn_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        # 3rd residual block
+        self.feed_forward = ProphetNetFeedForward(
+            config.hidden_size,
+            config.activation_function,
+            config.activation_dropout,
+            config.dropout,
+            config.decoder_ffn_dim,
+        )
+        self.feed_forward_layer_norm = nn.LayerNorm(config.hidden_size)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attn_mask=None,
+        extended_predict_attention_mask=None,
+        main_relative_position_buckets=None,
+        predict_relative_position_buckets=None,
+        position_ids=None,
+        past_key_value=None,
+        use_cache: bool = True,
+    ):
+        # 1st residual block
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        ngram_attention_output, self_attn_weights, self_attn_weights_ngram, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            past_key_value=self_attn_past_key_value,
+            attention_mask=attention_mask,
+            extended_predict_attention_mask=extended_predict_attention_mask,
+            main_relative_position_buckets=main_relative_position_buckets,
+            predict_relative_position_buckets=predict_relative_position_buckets,
+            position_ids=position_ids,
+        )
+        hidden_states = self.self_attn_layer_norm(hidden_states + ngram_attention_output)
+
+        # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
+        cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
+        if encoder_hidden_states is not None:
+            # 2nd residual block
+            attention_output, cross_attn_present_key_value = self.cross_attn(
+                hidden_states=hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attn_mask,
+                past_key_value=cross_attn_past_key_value,
+            )
+            hidden_states = self.cross_attn_layer_norm(attention_output + hidden_states)
+
+            # add cross-attn to positions 3,4 of present_key_value tuple
+            present_key_value = present_key_value + cross_attn_present_key_value
+
+        # 3rd residual block
+        feed_forward_output = self.feed_forward(hidden_states)
+        hidden_states = self.feed_forward_layer_norm(feed_forward_output + hidden_states)
+
+        outputs = (hidden_states,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+class ProphetNetEncoder(ProphetNetPretrainedModel):
+    r"""
+    word_embeddings  (:obj:`paddle.nn.Embeddings` of shape :obj:`(config.vocab_size, config.hidden_size)`, `optional`):
+        The word embedding parameters. This can be used to initialize :class:`~transformers.ProphetNetEncoder` with
+        pre-defined word embeddings instead of randomly initialized word embeddings.
+    """
+
+    def __init__(self, word_embeddings, config: ProphetNetConfig):
+        super(ProphetNetEncoder, self).__init__(config)
+        self.init_std = config.init_std
+        if word_embeddings is not None:
+            self.word_embeddings = word_embeddings
+        else:
+            self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+
+        self.position_embeddings = ProphetNetPositionalEmbeddings(config)
+        self.embeddings_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        self.layers = nn.LayerList([ProphetNetEncoderLayer(config) for _ in range(config.num_encoder_layers)])
+
+    def forward(self, input_ids=None, attention_mask=None):
+        if input_ids is None:
+            raise ValueError("Input_ids cannot be None.")
+        inputs_embeds = self.word_embeddings(input_ids)
+
+        # prepare attention mask
+        if attention_mask is not None:
+            extended_attention_mask = (
+                paddle.tile(
+                    1.0 - attention_mask.unsqueeze(1), repeat_times=[self.config.num_encoder_attention_heads, 1, 1]
+                )
+            ) * -10000.0
+            extended_attention_mask = paddle.cast(extended_attention_mask, dtype=inputs_embeds.dtype)
+            extended_attention_mask.stop_gradient = True
+        else:
+            extended_attention_mask = None
+
+        position_embeddings, position_ids = self.position_embeddings(inputs_embeds.shape[:2])
+
+        hidden_states = inputs_embeds + position_embeddings
+        hidden_states = self.embeddings_layer_norm(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.config.dropout, training=self.training)
+
+        for idx, encoder_layer in enumerate(self.layers):
+            hidden_states = encoder_layer(hidden_states, attention_mask=extended_attention_mask)
+        return hidden_states
+
+
+class ProphetNetDecoder(ProphetNetPretrainedModel):
+    def __init__(self, word_embeddings, config: ProphetNetConfig):
+        super(ProphetNetDecoder, self).__init__(config)
+        self.init_std = config.init_std
+        self.ngram = config.ngram
+        self.num_buckets = config.num_buckets
+        self.relative_max_distance = config.relative_max_distance
+        self.dropout = config.dropout
+        self.max_target_positions = config.max_position_embeddings
+        self.add_cross_attention = config.add_cross_attention
+        if word_embeddings is not None:
+            self.word_embeddings = word_embeddings
+        else:
+            self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+
+        self.position_embeddings = ProphetNetPositionalEmbeddings(config)
+
+        self.ngram_embeddings = nn.Embedding(self.ngram, config.hidden_size)
+        self.layers = nn.LayerList([ProphetNetDecoderLayer(config) for _ in range(config.num_decoder_layers)])
+        self.embeddings_layer_norm = nn.LayerNorm(config.hidden_size)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=True,
+    ):
+        if input_ids is None:
+            raise ValueError("Decoder input_ids cannot be None.")
+        inputs_embeds = self.word_embeddings(input_ids)
+        batch_size, sequence_length = inputs_embeds.shape[:2]
+
+        main_stream_pos_embed, position_ids = self.position_embeddings(
+            (batch_size, sequence_length), past_key_values=past_key_values
+        )
+
+        if past_key_values is not None:
+            main_relative_position_buckets, predict_relative_position_buckets = None, None
+        else:
+            main_relative_position_buckets, predict_relative_position_buckets = self.compute_buffered_relative_buckets(
+                position_ids
+            )
+        predicting_stream_pos_embed = self.position_embeddings._forward(position_ids + 1)
+
+        # add position embeddings
+        hidden_states = inputs_embeds + main_stream_pos_embed
+
+        ngram_embeddings = self.ngram_embeddings.weight
+
+        # prepare attention mask
+        if past_key_values is not None:
+            assert (
+                hidden_states.shape[1] == 1
+            ), "At the moment `use_cache` is only supported for `decoder_input_ids` of length 1"
+
+            ngram_hidden_states = [
+                paddle.tile(
+                    (ngram_embeddings[ngram - 1] + predicting_stream_pos_embed), repeat_times=[batch_size, 1, 1]
+                )
+                for ngram in range(self.ngram)
+            ]
+            extended_attention_mask = None
+            extended_predict_attention_mask = None
+        else:
+            ngram_hidden_states = [
+                (ngram_embeddings[ngram - 1] + predicting_stream_pos_embed) for ngram in range(self.ngram)
+            ]
+            extended_attention_mask = self.prepare_attention_mask(hidden_states, attention_mask)
+            extended_predict_attention_mask = self.prepare_predict_attention_mask(hidden_states, attention_mask)
+            extended_attention_mask.stop_gradient = True
+            extended_predict_attention_mask.stop_gradient = True
+
+        # prepare encoder attention mask
+        if encoder_attention_mask is not None:
+            extended_encoder_attention_mask = (
+                1.0
+                - paddle.tile(
+                    encoder_attention_mask[:, None, :], repeat_times=[self.config.num_decoder_attention_heads, 1, 1]
+                )
+            ) * -10000.0
+            extended_encoder_attention_mask = paddle.cast(extended_encoder_attention_mask, dtype=inputs_embeds.dtype)
+        else:
+            extended_encoder_attention_mask = None
+
+        hidden_states = paddle.concat([hidden_states] + ngram_hidden_states, axis=1)
+
+        if self.embeddings_layer_norm:
+            hidden_states = self.embeddings_layer_norm(hidden_states)
+
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+
+        present_key_values = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=extended_attention_mask,
+                encoder_hidden_states=encoder_hidden_states,
+                encoder_attn_mask=extended_encoder_attention_mask,
+                extended_predict_attention_mask=extended_predict_attention_mask,
+                main_relative_position_buckets=main_relative_position_buckets,
+                predict_relative_position_buckets=predict_relative_position_buckets,
+                position_ids=position_ids,
+                past_key_value=past_key_value,
+                use_cache=use_cache,
+            )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                present_key_values += (layer_outputs[1],)
+
+        last_hidden_state = hidden_states[:, :sequence_length]  # 1-gram
+        last_hidden_state_ngram = hidden_states[:, sequence_length:] if self.ngram > 0 else None  # 2-gram
+        return tuple(v for v in [last_hidden_state, last_hidden_state_ngram, present_key_values] if v is not None)
+
+    def compute_buffered_relative_buckets(self, position_ids):
+        batch_size, sequence_length = position_ids.shape
+
+        if not hasattr(self, "_main_relative_buckets") or self._main_relative_buckets is None:
+            position_ids = paddle.tile(paddle.arange(1, self.max_target_positions + 1), repeat_times=[1, 1])
+            self._main_relative_buckets, self._predict_relative_buckets = compute_all_stream_relative_buckets(
+                self.num_buckets, self.relative_max_distance, position_ids
+            )
+
+        # buffer relative buckets
+        main_relative_buckets = paddle.tile(
+            self._main_relative_buckets[:, :sequence_length, :sequence_length], repeat_times=[batch_size, 1, 1]
+        )
+        predict_relative_buckets = paddle.tile(
+            paddle.concat(
+                [
+                    self._predict_relative_buckets[:, :sequence_length, :sequence_length],
+                    self._predict_relative_buckets[
+                        :, :sequence_length, self.max_target_positions : self.max_target_positions + sequence_length
+                    ],
+                ],
+                axis=2,
+            ),
+            repeat_times=[batch_size, 1, 1],
+        )
+
+        return main_relative_buckets, predict_relative_buckets
+
+    def prepare_attention_mask(self, hidden_states, attention_mask):
+        batch_size, seq_length = hidden_states.shape[:2]
+
+        # get causal mask
+        if not hasattr(self, "_causal_mask") or self._causal_mask is None:
+            causal_mask = paddle.full(
+                (self.max_target_positions, self.max_target_positions), -float("inf"), dtype=hidden_states.dtype
+            )
+            self._causal_mask = paddle.triu(causal_mask, 1)
+        extended_causal_mask = paddle.expand(
+            self._causal_mask[:seq_length, :seq_length].unsqueeze(0), shape=[batch_size, seq_length, seq_length]
+        )
+
+        # add usual attention mask
+        if attention_mask is not None:
+            extended_attention_mask = (1.0 - attention_mask.unsqueeze(1)) * -10000.0
+            extended_attention_mask = extended_causal_mask + extended_attention_mask
+        else:
+            extended_attention_mask = extended_causal_mask
+        return paddle.cast(
+            paddle.tile(extended_attention_mask, repeat_times=[self.config.num_decoder_attention_heads, 1, 1]),
+            dtype=hidden_states.dtype,
+        )
+
+    def prepare_predict_attention_mask(self, hidden_states, attention_mask):
+        batch_size, seq_length = hidden_states.shape[:2]
+
+        # get causal mask
+        if not hasattr(self, "_predict_causal_mask") or self._predict_causal_mask is None:
+            self._predict_causal_mask = ngram_attention_bias(
+                self.max_target_positions, self.ngram, hidden_states.dtype
+            )
+        predict_causal_mask = paddle.concat(
+            [
+                self._predict_causal_mask[:, :seq_length, :seq_length],
+                self._predict_causal_mask[
+                    :, :seq_length, self.max_target_positions : self.max_target_positions + seq_length
+                ],
+            ],
+            axis=-1,
+        )
+        extended_predict_causal_mask = paddle.expand(
+            predict_causal_mask[:, None, :, :],
+            shape=predict_causal_mask.shape[:1] + [batch_size] + predict_causal_mask.shape[1:],
+        )
+
+        # add usual attention mask
+        if attention_mask is not None:
+            extended_attention_mask = (1.0 - attention_mask[None, :, None, :]) * -10000.0
+            extended_attention_mask = extended_attention_mask.expand((self.ngram, batch_size, seq_length, seq_length))
+            # predicted stream attention_mask should always be 0
+            extended_attention_mask = paddle.concat(
+                [extended_attention_mask, paddle.zeros_like(extended_attention_mask)], axis=-1
+            )
+            extended_predict_attention_mask = extended_predict_causal_mask + extended_attention_mask
+        else:
+            extended_predict_attention_mask = extended_predict_causal_mask
+        return paddle.cast(
+            extended_predict_attention_mask.tile([1, self.config.num_decoder_attention_heads, 1, 1]),
+            dtype=hidden_states.dtype,
+        )
+
+
+@register_base_model
+class ProphetNetModel(ProphetNetPretrainedModel):
+    def __init__(self, config: ProphetNetConfig):
+        super(ProphetNetModel, self).__init__(config)
+        self.init_std = config.init_std
+        self.eps = config.eps
+        self.pad_token_id = config.pad_token_id
+        self.disable_ngram_loss = config.disable_ngram_loss
+        self.decoder_start_token_id = config.decoder_start_token_id
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+
+        self.encoder = ProphetNetEncoder(self.word_embeddings, config)
+
+        self.decoder = ProphetNetDecoder(self.word_embeddings, config)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output: Optional[Tuple] = None,
+        use_cache=True,
+        past_key_values=None,
+    ):
+        if attention_mask is None:
+            assert input_ids is not None, "input_ids should be " "specified when generating attention_mask"
+            attention_mask = paddle.cast(input_ids != self.pad_token_id, dtype=paddle.get_default_dtype())
+
+        if decoder_attention_mask is None:
+            assert decoder_input_ids is not None, (
+                "decoder_input_ids should be " "specified when generating decoder_attention_mask"
+            )
+            decoder_attention_mask = paddle.cast(
+                decoder_input_ids != self.pad_token_id, dtype=paddle.get_default_dtype()
+            )
+        if encoder_output is None:
+            encoder_output = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=encoder_output,
+            encoder_attention_mask=attention_mask,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+        )
+        return decoder_outputs + (encoder_output,)
+
+
+class Linear_wo_bias(Layer):
+    def __init__(self, in_features, out_features, weight_attr=None, name=None):
+        super(Linear_wo_bias, self).__init__()
+        self._dtype = self._helper.get_default_dtype()
+        self._weight_attr = weight_attr
+        self.weight = self.create_parameter(
+            shape=[in_features, out_features], attr=self._weight_attr, dtype=self._dtype, is_bias=False
+        )
+        self.name = name
+
+    def forward(self, input):
+        out = F.linear(x=input, weight=self.weight, name=self.name)
+        return out
+
+    def extra_repr(self):
+        name_str = ", name={}".format(self.name) if self.name else ""
+        return "in_features={}, out_features={}, dtype={}{}".format(
+            self.weight.shape[0], self.weight.shape[1], self._dtype, name_str
+        )
+
+
+class ProphetNetForConditionalGeneration(ProphetNetPretrainedModel):
+    def __init__(self, config: ProphetNetConfig):
+        super(ProphetNetForConditionalGeneration, self).__init__(config)
+        self.prophetnet = ProphetNetModel(config)
+        self.padding_idx = self.prophetnet.word_embeddings._padding_idx
+
+        self.lm_head = Linear_wo_bias(config.hidden_size, config.vocab_size)
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        labels=None,
+        use_cache=True,
+        past_key_values=None,
+    ):
+        if labels is not None and decoder_input_ids is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(labels)
+        outputs = self.prophetnet(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            encoder_output=encoder_output,
+            use_cache=use_cache,
+            past_key_values=past_key_values,
+        )
+
+        batch_size, sequence_length = decoder_input_ids.shape
+
+        predicting_streams = paddle.reshape(outputs[1], (batch_size, self.config.ngram, sequence_length, -1))
+        predict_logits = self.lm_head(predicting_streams)
+
+        logits = predict_logits[:, 0]
+        if use_cache:
+            past_key_values = outputs[2]
+            return logits, past_key_values, predict_logits
+        else:
+            return logits, predict_logits
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        attention_mask=None,
+        decoder_attention_mask=None,
+        cache=None,
+        use_cache=None,
+        encoder_output=None,
+    ):
+        assert encoder_output is not None, "`encoder_output` have to be passed for generation."
+        if cache is not None:
+            decoder_input_ids = decoder_input_ids[:, -1].unsqueeze(-1)
+
+        # first step, decoder_cached_states are empty
+        return {
+            "input_ids": None,  # encoder_outputs is defined. input_ids not needed
+            "decoder_input_ids": decoder_input_ids,
+            "encoder_output": encoder_output,
+            "decoder_attention_mask": decoder_attention_mask,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "past_key_values": cache,
+        }
+
+    def prepare_decoder_input_ids_from_labels(self, labels):
+        return self._shift_right(labels)
+
+    def get_encoder(self):
+        return self.prophetnet.encoder
+
+    def get_decoder(self):
+        return self.prophetnet.decoder
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
diff --git a/paddlenlp/transformers/prophetnet/tokenizer.py b/paddlenlp/transformers/prophetnet/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..303884880deee67cadf95874debcd91250ec89c4
--- /dev/null
+++ b/paddlenlp/transformers/prophetnet/tokenizer.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import logging
+import os
+from typing import List
+
+from .. import PretrainedTokenizer, BasicTokenizer, WordpieceTokenizer
+from ..tokenizer_utils import Trie
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"prophetnet-large-uncased": 512}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+def create_trie(unique_no_split_tokens):
+    trie = Trie()
+    for token in unique_no_split_tokens:
+        trie.add(token)
+    return trie
+
+
+class ProphetNetTokenizer(PretrainedTokenizer):
+    r"""
+    Construct a ProphetNetTokenizer. Based on WordPiece.
+
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (`str`):
+            File containing the vocabulary.
+        do_lower_case (`bool`, *optional*, defaults to `True`):
+            Whether or not to lowercase the input when tokenizing.
+        do_basic_tokenize (`bool`, *optional*, defaults to `True`):
+            Whether or not to do basic tokenization before WordPiece.
+        unk_token (`str`, *optional*, defaults to `"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        x_sep_token (`str`, *optional*, defaults to `"[X_SEP]"`):
+            Special second separator token, which can be generated by
+            [`ProphetNetForConditionalGeneration`]. It is used to separate bullet-point like
+            sentences in summarization, *e.g.*.
+        pad_token (`str`, *optional*, defaults to `"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (`str`, *optional*, defaults to `"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        mask_token (`str`, *optional*, defaults to `"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+    """
+
+    resource_files_names = {"vocab_file": "prophetnet.tokenizer"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "prophetnet-large-uncased": "https://bj.bcebos.com/paddlenlp/models/transformers/prophetnet/prophetnet.tokenizer",
+        }
+    }
+    pretrained_init_configuration = {
+        "prophetnet-large-uncased": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        bos_token="[SEP]",
+        eos_token="[SEP]",
+        cls_token="[CLS]",
+        x_sep_token="[X_SEP]",
+        pad_token="[PAD]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        self.unique_no_split_tokens = [
+            x_sep_token,
+            unk_token,
+            sep_token,
+            bos_token,
+            eos_token,
+            cls_token,
+            pad_token,
+            mask_token,
+        ]
+        self.tokens_trie = create_trie(self.unique_no_split_tokens)
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab)
+
+    def tokenize(self, text):
+        return self._tokenize(text)
+
+    def _tokenize(self, text):
+        """
+        Converts a string to a list of tokens.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+        """
+        no_split_token = set(self.unique_no_split_tokens)
+        tokens = self.tokens_trie.split(text)
+        for i, token in enumerate(tokens):
+            if token in no_split_token:
+                left = tokens[i - 1] if i > 0 else None
+                right = tokens[i + 1] if i < len(tokens) - 1 else None
+                # We strip left and right by default
+                if right:
+                    tokens[i + 1] = right.lstrip()
+                if left:
+                    tokens[i - 1] = left.rstrip()
+        # ["This is something", "<special_token_1>", "else"]
+        tokenized_text = []
+        for token in tokens:
+            # Need to skip eventual empty (fully stripped) tokens
+            if not token:
+                continue
+            if token in no_split_token:
+                tokenized_text.append(token)
+            else:
+                tokenized_text.extend(self._tokenize_function(token))
+        # ["This", " is", " something", "<special_token_1>", "else"]
+        return tokenized_text
+
+    def _tokenize_function(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text):
+                split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    def convert_tokens_to_ids(self, tokens):
+        """
+        Converts a sequence of tokens into ids using the `vocab` attribute (an
+        instance of `Vocab`). Override it if needed.
+
+        Args：
+            tokens (list[int]): List of token ids.
+
+        Returns:
+            list: Converted id list.
+        """
+        if not isinstance(tokens, (list, tuple)):
+            return self._convert_token_to_id(tokens)
+        else:
+            return [self._convert_token_to_id(token) for token in tokens]
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """
+        Converts a single index or a sequence of indices to a token or
+        a sequence of tokens, using the vocabulary and added tokens.
+
+        Args:
+            ids (int or List[int]):
+                The token id (or token ids) to be converted to token(s).
+            skip_special_tokens (bool, optional):
+                Whether or not to remove special tokens in the decoding.
+                Defaults to `False` and we do not remove special tokens.
+
+        Returns:
+            str or List[str]: The decoded token(s).
+        """
+        if not isinstance(ids, (list, tuple)):
+            return self._convert_id_to_token(ids)
+        tokens = [self._convert_id_to_token(_id) for _id in ids]
+        if skip_special_tokens:
+            return [token for token in tokens if token not in self.all_special_tokens]
+        return tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def convert_ids_to_string(self, ids):
+        return self.convert_tokens_to_string(self.convert_ids_to_tokens(ids))
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return ([0] * len(token_ids_0)) + [1]
+        return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ProphetNet
+        sequence pair mask has the following format:
+
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return len(token_ids_0 + sep) * [0]
+        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A BERT sequence has the following format:
+
+        - single sequence: `[CLS] X [SEP]`
+        - pair of sequences: `[CLS] A [SEP] B [SEP]`
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return token_ids_0 + [self.sep_token_id]
+        sep = [self.sep_token_id]
+        return token_ids_0 + sep + token_ids_1 + sep
+
+    def save_vocabulary(self, save_directory):
+        index = 0
+        vocab_file = os.path.join(save_directory, self.resource_files_names["vocab_file"])
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logging.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
diff --git a/paddlenlp/transformers/qwen/__init__.py b/paddlenlp/transformers/qwen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b5b28f3150172667e33941d8b0ec05fd7c969d9
--- /dev/null
+++ b/paddlenlp/transformers/qwen/__init__.py
@@ -0,0 +1,12 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/qwen/configuration.py b/paddlenlp/transformers/qwen/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fb6ac69d3de8cc2fa93a9e36f326ca39645516d
--- /dev/null
+++ b/paddlenlp/transformers/qwen/configuration.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2023 Alibaba Cloud and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.transformers import PretrainedConfig
+
+
+class QWenConfig(PretrainedConfig):
+    model_type = "qwen"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        emb_dropout_prob=0.0,
+        attn_dropout_prob=0.0,
+        layer_norm_epsilon=1e-6,
+        initializer_range=0.02,
+        max_position_embeddings=8192,
+        scale_attn_weights=True,
+        use_cache=True,
+        kv_channels=128,
+        rotary_pct=1.0,
+        rotary_emb_base=10000,
+        use_dynamic_ntk=True,
+        use_logn_attn=True,
+        use_flash_attn="auto",
+        intermediate_size=22016,
+        no_bias=True,
+        tie_word_embeddings=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.emb_dropout_prob = emb_dropout_prob
+        self.attn_dropout_prob = attn_dropout_prob
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.scale_attn_weights = scale_attn_weights
+        self.use_cache = use_cache
+        self.max_position_embeddings = max_position_embeddings
+        self.kv_channels = kv_channels
+        self.rotary_pct = rotary_pct
+        self.rotary_emb_base = rotary_emb_base
+        self.use_dynamic_ntk = use_dynamic_ntk
+        self.use_logn_attn = use_logn_attn
+        self.use_flash_attn = use_flash_attn
+        self.no_bias = no_bias
+
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
diff --git a/paddlenlp/transformers/qwen/modeling.py b/paddlenlp/transformers/qwen/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..4db0dd4341fe83b21430896df84a715fce795c67
--- /dev/null
+++ b/paddlenlp/transformers/qwen/modeling.py
@@ -0,0 +1,884 @@
+# Copyright (c) 2023 Alibaba Cloud and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from functools import partial
+from typing import List
+
+import paddle
+import paddle.distributed.fleet.meta_parallel as mpu
+import paddle.nn.functional as F
+from paddle import Tensor, nn
+from paddle.distributed import fleet
+from paddle.distributed.fleet.utils import recompute
+
+from paddlenlp.transformers.model_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+)
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils.log import logger
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ..model_outputs import ModelOutput
+from .configuration import QWenConfig
+
+MAX_NTK_SEQ_LENGTH = 32768
+
+
+def parallel_matmul(x: Tensor, y: Tensor, tensor_parallel_output=True):
+    is_fleet_init = True
+    tensor_parallel_degree = 1
+    try:
+        hcg = fleet.get_hybrid_communicate_group()
+        model_parallel_group = hcg.get_model_parallel_group()
+        tensor_parallel_degree = hcg.get_model_parallel_world_size()
+    except:
+        is_fleet_init = False
+
+    if is_fleet_init and tensor_parallel_degree > 1 and y.is_distributed:
+        # if not running under distributed.launch, it will raise AttributeError: 'Fleet' object has no attribute '_hcg'
+        input_parallel = paddle.distributed.collective._c_identity(x, group=model_parallel_group)
+        logits = paddle.matmul(input_parallel, y, transpose_y=False)
+
+        if tensor_parallel_output:
+            return logits
+
+        return paddle.distributed.collective._c_concat(logits, group=model_parallel_group)
+
+    else:
+        logits = paddle.matmul(x, y, transpose_y=False)
+        return logits
+
+
+class QWenAttention(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+
+        self.seq_length = config.seq_length
+
+        self.hidden_size = config.hidden_size
+        self.split_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
+
+        self.scale_attn_weights = True
+
+        self.projection_size = config.kv_channels * config.num_attention_heads
+
+        assert self.projection_size % config.num_attention_heads == 0
+        self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads
+
+        if config.tensor_parallel_degree > 1:
+            if config.num_attention_heads % config.tensor_parallel_degree != 0:
+                raise ValueError("num_attention_heads has to be divisible by tensor_parallel_degree")
+            self.num_heads = config.num_attention_heads // config.tensor_parallel_degree
+            self.c_attn = mpu.ColumnParallelLinear(
+                config.hidden_size,
+                3 * self.projection_size,
+                has_bias=True,
+                gather_output=False,
+            )
+            self.c_proj = mpu.RowParallelLinear(
+                config.hidden_size,
+                self.projection_size,
+                has_bias=not config.no_bias,
+                input_is_parallel=True,
+            )
+        else:
+            self.c_attn = nn.Linear(config.hidden_size, 3 * self.projection_size, bias_attr=True)
+            self.c_proj = nn.Linear(config.hidden_size, self.projection_size, bias_attr=not config.no_bias)
+
+        if config.rotary_pct == 1.0:
+            self.rotary_ndims = None
+        else:
+            assert config.rotary_pct < 1
+            self.rotary_ndims = int(self.hidden_size_per_attention_head * config.rotary_pct)
+        dim = self.rotary_ndims if self.rotary_ndims is not None else self.hidden_size_per_attention_head
+        self.rotary_emb = RotaryEmbedding(dim, base=config.rotary_emb_base)
+
+        self.use_dynamic_ntk = config.use_dynamic_ntk
+        self.use_logn_attn = config.use_logn_attn
+
+        logn_list = [math.log(i, self.seq_length) if i > self.seq_length else 1 for i in range(1, MAX_NTK_SEQ_LENGTH)]
+        self.logn_tensor = paddle.to_tensor(logn_list)[None, :, None, None]
+        self._ntk_cached = 1.0
+
+        self.attn_dropout = nn.Dropout(config.attn_dropout_prob)
+
+    def _attn(self, query, key, value, attention_mask=None):
+        attn_weights = paddle.matmul(query, key.transpose([0, 1, 3, 2]))
+
+        if self.scale_attn_weights:
+            attn_weights = attn_weights * self.inv_norm_factor
+
+        attn_weights = attn_weights + attention_mask
+
+        attn_weights = nn.functional.softmax(attn_weights, axis=-1)
+
+        attn_weights = attn_weights.astype(value.dtype)
+        attn_weights = self.attn_dropout(attn_weights)
+
+        attn_output = paddle.matmul(attn_weights, value)
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+
+        return attn_output, attn_weights
+
+    def _split_heads(self, tensor, num_heads, attn_head_size):
+        new_shape = tensor.shape[:-1] + [num_heads, attn_head_size]
+        tensor = tensor.reshape(new_shape)
+        return tensor
+
+    def _merge_heads(self, tensor, num_heads, attn_head_size):
+        new_shape = tensor.shape[:-2] + [
+            num_heads * attn_head_size,
+        ]
+        return tensor.reshape(new_shape)
+
+    def forward(
+        self,
+        hidden_states,
+        layer_past=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+        use_cache=False,
+    ):
+        # [bz, sql, hid] ==> [bz, sql, 3*hid]
+        mixed_x_layer = self.c_attn(hidden_states)
+        # [bz, sql, 3*hid] ==> [bz, sql, hid]
+        query, key, value = paddle.split(mixed_x_layer, num_or_sections=3, axis=-1)
+
+        # [bz, sql, hid] ==> [bz, sql, nh, hdim]
+        query = self._split_heads(query, self.num_heads, self.head_dim)
+        key = self._split_heads(key, self.num_heads, self.head_dim)
+        value = self._split_heads(value, self.num_heads, self.head_dim)
+
+        kv_seq_len = hidden_states.shape[1]
+        if layer_past:
+            # layer past[0] shape: bs * seq_len * head_num * dim
+            kv_seq_len += layer_past[0].shape[1]
+        if self.use_dynamic_ntk and kv_seq_len == hidden_states.shape[1] and not self.training:
+            context_value = math.log(kv_seq_len / self.seq_length, 2) + 1
+            ntk_alpha = 2 ** math.ceil(context_value) - 1
+            ntk_alpha = max(ntk_alpha, 1)
+            self._ntk_cached = ntk_alpha
+        else:
+            ntk_alpha = self._ntk_cached
+        rotary_pos_emb = self.rotary_emb(kv_seq_len, ntk_alpha=ntk_alpha)
+
+        if rotary_pos_emb is not None:
+            if isinstance(rotary_pos_emb, tuple):
+                rotary_pos_emb = rotary_pos_emb
+            else:
+                rotary_pos_emb = (rotary_pos_emb,) * 2
+
+        if rotary_pos_emb is not None:
+            q_pos_emb, k_pos_emb = rotary_pos_emb
+            # Slice the pos emb for current inference
+            cur_len = query.shape[1]
+            q_pos_emb = q_pos_emb[:, -cur_len:, :, :]
+            k_pos_emb = k_pos_emb[:, -cur_len:, :, :]
+            query = apply_rotary_pos_emb(query, q_pos_emb)
+            key = apply_rotary_pos_emb(key, k_pos_emb)
+
+        if layer_past is not None:
+            past_key, past_value = layer_past[0], layer_past[1]
+            key = paddle.concat([past_key, key], axis=1)
+            value = paddle.concat([past_value, value], axis=1)
+
+        if use_cache:
+            present = (key, value)
+        else:
+            present = None
+
+        if self.use_logn_attn and not self.training:
+            if self.logn_tensor.dtype != query.dtype:
+                self.logn_tensor = self.logn_tensor.astype(query.dtype)
+            seq_start = key.shape[1] - query.shape[1]
+            seq_end = key.shape[1]
+            logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :]
+            query = query * logn_tensor.expand(query.shape)
+
+        query = query.transpose([0, 2, 1, 3])
+        key = key.transpose([0, 2, 1, 3])
+        value = value.transpose([0, 2, 1, 3])
+        attn_output, attn_weight = self._attn(query, key, value, attention_mask)
+        context_layer = self._merge_heads(attn_output, self.num_heads, self.head_dim)
+
+        attn_output = self.c_proj(context_layer)
+        outputs = (attn_output, present)
+        if output_attentions:
+            outputs += (attn_weight,)
+
+        return outputs
+
+
+class QWenMLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        ff_dim_in = config.intermediate_size // 2
+        if config.tensor_parallel_degree > 1:
+            self.w1 = mpu.ColumnParallelLinear(
+                config.hidden_size,
+                ff_dim_in,
+                gather_output=False,
+                has_bias=False,
+            )
+            self.w2 = mpu.ColumnParallelLinear(
+                config.hidden_size,
+                ff_dim_in,
+                gather_output=False,
+                has_bias=False,
+            )
+            self.c_proj = mpu.RowParallelLinear(
+                ff_dim_in,
+                config.hidden_size,
+                input_is_parallel=True,
+                has_bias=False,
+            )
+        else:
+            self.w1 = nn.Linear(config.hidden_size, ff_dim_in, bias_attr=not config.no_bias)
+            self.w2 = nn.Linear(config.hidden_size, ff_dim_in, bias_attr=not config.no_bias)
+            self.c_proj = nn.Linear(ff_dim_in, config.hidden_size, bias_attr=not config.no_bias)
+
+    def forward(self, hidden_states):
+        # up
+        a1 = self.w1(hidden_states)
+        # gate
+        a2 = self.w2(hidden_states)
+        intermediate_parallel = a1 * F.silu(a2)
+        # down
+        output = self.c_proj(intermediate_parallel)
+        return output
+
+
+class QWenBlock(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.ln_1 = RMSNorm(
+            hidden_size,
+            eps=config.layer_norm_epsilon,
+        )
+        self.attn = QWenAttention(config)
+        self.ln_2 = RMSNorm(
+            hidden_size,
+            eps=config.layer_norm_epsilon,
+        )
+
+        self.mlp = QWenMLP(config)
+
+    def forward(
+        self,
+        hidden_states,
+        layer_past=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        layernorm_output = self.ln_1(hidden_states)
+
+        attn_outputs = self.attn(
+            layernorm_output,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        attn_output = attn_outputs[0]
+
+        outputs = attn_outputs[1:]
+
+        residual = hidden_states
+        layernorm_input = attn_output + residual
+
+        layernorm_output = self.ln_2(layernorm_input)
+
+        residual = layernorm_input
+        mlp_output = self.mlp(layernorm_output)
+        hidden_states = residual + mlp_output
+
+        if use_cache:
+            outputs = (hidden_states,) + outputs
+        else:
+            outputs = (hidden_states,) + outputs[1:]
+
+        return outputs
+
+
+class QWenPreTrainedModel(PretrainedModel):
+    config_class = QWenConfig
+    base_model_prefix = "qwen"
+
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_attention_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_hidden_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "lm_head.weight": partial(fn, is_column=True),
+                "qwen.h.0.mlp.w2.weight": partial(fn, is_column=True),
+                "qwen.h.0.mlp.w1.weight": partial(fn, is_column=True),
+                "qwen.h.0.attn.c_attn.weight": partial(fn, is_column=True, is_naive_3fuse=True),
+                "qwen.h.0.attn.c_attn.bias": partial(fn, is_column=True, is_naive_3fuse=True),
+                # Row Linear
+                "qwen.wte.weight": partial(fn, is_column=False),
+                "qwen.h.0.mlp.c_proj.weight": partial(fn, is_column=False),
+                "qwen.h.0.attn.c_proj.weight": partial(fn, is_column=False),
+            }
+            for key, action in base_actions.items():
+                if "h.0." in key:
+                    for i in range(num_hidden_layers):
+                        final_actions[key.replace("h.0.", f"h.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def _get_name_mappings(cls, config: QWenConfig) -> List[StateDictNameMapping]:
+        mappings = [
+            "wte.weight",
+            "ln_f.weight",
+        ]
+
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"h.{layer_index}.ln_1.weight",
+                    f"h.{layer_index}.ln_1.weight",
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.weight",
+                    f"h.{layer_index}.attn.c_attn.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.attn.c_attn.bias",
+                    f"h.{layer_index}.attn.c_attn.bias",
+                ],
+                [
+                    f"h.{layer_index}.attn.c_proj.weight",
+                    f"h.{layer_index}.attn.c_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.ln_2.weight",
+                    f"h.{layer_index}.ln_2.weight",
+                ],
+                [
+                    f"h.{layer_index}.mlp.w1.weight",
+                    f"h.{layer_index}.mlp.w1.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.mlp.w2.weight",
+                    f"h.{layer_index}.mlp.w2.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.mlp.c_proj.weight",
+                    f"h.{layer_index}.mlp.c_proj.weight",
+                    "transpose",
+                ],
+            ]
+            mappings.extend(layer_mappings)
+
+        init_name_mappings(mappings)
+        for mapping in mappings:
+            mapping[0] = "transformer." + mapping[0]
+            if len(mapping) > 1 and mapping[1] is not None:
+                mapping[1] = "qwen." + mapping[1]
+
+        if config.architectures is not None:
+            if "QWenForCausalLM" in config.architectures:
+                mappings.extend(
+                    [
+                        [
+                            "lm_head.weight",
+                            "lm_head.weight",
+                            "transpose",
+                        ]
+                    ]
+                )
+
+        init_name_mappings(mappings)
+        return [StateDictNameMapping(*mapping) for mapping in mappings]
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(
+            module,
+            (
+                nn.Linear,
+                nn.Embedding,
+                mpu.ColumnParallelLinear,
+                mpu.RowParallelLinear,
+                mpu.VocabParallelEmbedding,
+                QWenLMHead,
+            ),
+        ):
+            module.weight.set_value(
+                paddle.tensor.normal(mean=0.0, std=self.config.initializer_range, shape=module.weight.shape)
+            )
+            if getattr(module, "bias", None) is not None:
+                module.weight.set_value(paddle.zeros(shape=module.weight.shape, dtype=paddle.get_default_dtype()))
+
+        for name, p in module.named_parameters():
+            if name == "c_proj.weight":
+                p.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range / math.sqrt(2 * self.config.num_hidden_layers),
+                        shape=p.shape,
+                    )
+                )
+
+
+class QWenModel(QWenPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+        self.vocab_size = config.vocab_size
+        self.num_hidden_layers = config.num_hidden_layers
+        self.embed_dim = config.hidden_size
+        self.enable_recompute = False
+
+        if config.tensor_parallel_degree > 1:
+            self.wte = mpu.VocabParallelEmbedding(
+                self.vocab_size,
+                self.embed_dim,
+            )
+        else:
+            self.wte = nn.Embedding(self.vocab_size, self.embed_dim)
+
+        self.drop = nn.Dropout(config.emb_dropout_prob)
+        self.h = nn.LayerList(
+            [
+                QWenBlock(
+                    config,
+                )
+                for i in range(config.num_hidden_layers)
+            ]
+        )
+        self.ln_f = RMSNorm(
+            self.embed_dim,
+            eps=config.layer_norm_epsilon,
+        )
+
+    def get_input_embeddings(self):
+        return self.wte
+
+    def set_input_embeddings(self, new_embeddings):
+        self.wte = new_embeddings
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        block,
+        hidden_states,
+        layer_past,
+        attention_mask,
+        encoder_hidden_states,
+        encoder_attention_mask,
+        use_cache,
+        output_attentions,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return module(*inputs)
+
+            return custom_forward
+
+        hidden_states = recompute(
+            create_custom_forward(block),
+            hidden_states,
+            layer_past,
+            attention_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            use_cache,
+            output_attentions,
+            use_reentrant=self.config.recompute_use_reentrant,
+        )
+        return hidden_states
+
+    def get_masks(self, batch_size, seq_length, past_length, padding_mask=None):
+        # casual mask
+        casual_mask = paddle.tril(paddle.ones([batch_size, 1, seq_length, seq_length], dtype="bool"))
+        if past_length > 0:
+            casual_mask = paddle.concat(
+                [paddle.ones([batch_size, 1, seq_length, past_length], dtype="bool"), casual_mask], axis=-1
+            )
+
+        # seq_mask
+        if padding_mask is None:
+            padding_mask = paddle.ones((batch_size, 1, seq_length, seq_length + past_length), dtype="bool")
+        if len(padding_mask.shape) == 2:
+            # from Tokenizer
+            padding_mask = (
+                padding_mask.unsqueeze(axis=[1, 2])
+                .expand([batch_size, 1, seq_length, seq_length + past_length])
+                .astype("bool")
+            )
+        elif len(padding_mask.shape) == 3:
+            # [batch_size,tgt_length, src_length] -> [batch_size, 1, tgt_length, src_length]
+            padding_mask = padding_mask.unsqueeze(1).astype("bool")
+        elif len(padding_mask.shape) == 4:
+            padding_mask = padding_mask.astype("bool")
+
+        casual_mask = casual_mask & padding_mask
+
+        return casual_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape
+            input_ids = input_ids.reshape([-1, input_shape[-1]])
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if past_key_values is None:
+            past_length = 0
+            past_key_values = tuple([None] * len(self.h))
+        else:
+            past_length = past_key_values[0][0].shape[1]
+
+        encoder_attention_mask = None
+        if inputs_embeds is None:
+            inputs_embeds = self.wte(input_ids)
+        hidden_states = inputs_embeds
+
+        # bool 4D mask
+        attention_mask = self.get_masks(input_shape[0], input_shape[1], past_length, padding_mask=attention_mask)
+        zero = paddle.zeros(attention_mask.shape, dtype=hidden_states.dtype)
+        neg_inf = paddle.full_like(attention_mask, paddle.finfo(hidden_states.dtype).min, dtype=hidden_states.dtype)
+        # dtype 4D mask
+        attention_mask = paddle.where(attention_mask, zero, neg_inf)
+
+        hidden_states = self.drop(hidden_states)
+        output_shape = input_shape + [
+            hidden_states.shape[-1],
+        ]
+
+        if self.enable_recompute and self.training:
+            if use_cache:
+                logger.warning_once("`use_cache=True` is incompatible with recompute")
+                use_cache = False
+
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+            has_gradient = not hidden_states.stop_gradient
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.enable_recompute and self.training and has_gradient:
+                outputs = self.recompute_training(
+                    block,
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=attention_mask,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+            else:
+                outputs = block(
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=attention_mask,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = outputs[0]
+            if use_cache is True:
+                presents = presents + (outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[1],)
+
+        hidden_states = self.ln_f(hidden_states)
+        hidden_states = hidden_states.reshape(output_shape)
+        # Add last hidden state
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states] if v is not None)
+
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class QWenLMHead(nn.Layer):
+    def __init__(self, config: QWenConfig):
+        super(QWenLMHead, self).__init__()
+        self.config = config
+        if config.tensor_parallel_degree > 1:
+            vocab_size = config.vocab_size // config.tensor_parallel_degree
+        else:
+            vocab_size = config.vocab_size
+
+        self.weight = self.create_parameter(
+            shape=[config.hidden_size, vocab_size],
+            dtype=paddle.get_default_dtype(),
+        )
+        # Must set distributed attr for Tensor Parallel !
+        self.weight.is_distributed = True if (vocab_size != config.vocab_size) else False
+        if self.weight.is_distributed:
+            self.weight.split_axis = 1
+
+    def forward(self, hidden_states, tensor_parallel_output=None):
+        if tensor_parallel_output is None:
+            tensor_parallel_output = self.config.tensor_parallel_output
+
+        logits = parallel_matmul(hidden_states, self.weight, tensor_parallel_output=tensor_parallel_output)
+        return logits
+
+
+class QWenForCausalLM(QWenPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.rotary_emb\.inv_freq"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.qwen = QWenModel(config)
+        self.lm_head = QWenLMHead(config)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple) and len(outputs) > 1 and not isinstance(outputs[1], paddle.Tensor):
+            model_kwargs["cache"] = outputs[1]
+            model_kwargs["past_key_values"] = outputs[1]
+
+        if isinstance(outputs, ModelOutput) and "past_key_values" in outputs:
+            model_kwargs["cache"] = outputs.past_key_values
+            model_kwargs["past_key_values"] = outputs.past_key_values
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            model_kwargs["attention_mask"] = None
+
+        return model_kwargs
+
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
+        if past_key_values:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+
+        attention_mask = kwargs.get("attention_mask", None)
+
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype(paddle.int64)
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype=paddle.int64)
+        return attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.qwen(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(lm_logits, labels)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class RotaryEmbedding(nn.Layer):
+    def __init__(self, dim, base=10000):
+        super().__init__()
+        self.dim = dim
+        self.base = base
+        self.inv_freq = 1.0 / (base ** (paddle.arange(0, dim, 2, dtype=paddle.float32) / dim))
+        self._rotary_pos_emb_cache = None
+        self._seq_len_cached = 0
+        self._ntk_alpha_cached = 1.0
+
+    def update_rotary_pos_emb_cache(self, max_seq_len, offset=0, ntk_alpha=1.0):
+        seqlen = max_seq_len + offset
+        if seqlen > self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:
+            base = self.base * ntk_alpha ** (self.dim / (self.dim - 2))
+            self.inv_freq = 1.0 / (base ** (paddle.arange(0, self.dim, 2, dtype=paddle.float32) / self.dim))
+            self._seq_len_cached = max(2 * seqlen, 16)
+            self._ntk_alpha_cached = ntk_alpha
+            seq = paddle.arange(self._seq_len_cached)
+            freqs = paddle.outer(seq.astype(self.inv_freq.dtype), self.inv_freq)
+            emb = paddle.concat([freqs, freqs], axis=-1)
+
+            self._rotary_pos_emb_cache = emb.unsqueeze([0, 2])
+
+    def forward(self, max_seq_len, offset=0, ntk_alpha=1.0):
+        self.update_rotary_pos_emb_cache(max_seq_len, offset, ntk_alpha)
+        return self._rotary_pos_emb_cache[:, offset : offset + max_seq_len]
+
+
+def _rotate_half(x):
+
+    x = x.reshape(x.shape[:-1] + [2, -1])
+    x1, x2 = x.unbind(axis=-2)
+    return paddle.concat([-x2, x1], axis=-1)
+
+
+def apply_rotary_pos_emb(t, freqs):
+    rot_dim = freqs.shape[-1]
+    t_, t_pass_ = t[..., :rot_dim], t[..., rot_dim:]
+    t_ = t_.astype(paddle.float32)
+    t_pass_ = t_pass_.astype(paddle.float32)
+    t_ = (t_ * freqs.cos()) + (_rotate_half(t_) * freqs.sin())
+    return paddle.concat([t_, t_pass_], axis=-1).astype(t.dtype)
+
+
+class RMSNorm(nn.Layer):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = paddle.create_parameter(
+            shape=[dim],
+            dtype=paddle.get_default_dtype(),
+            default_initializer=nn.initializer.Constant(1.0),
+        )
+
+    def _norm(self, x):
+        return x * paddle.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+
+    def forward(self, x):
+        output = self._norm(x.astype(paddle.float32)).astype(x.dtype)
+        return output * self.weight
diff --git a/paddlenlp/transformers/qwen/tokenizer.py b/paddlenlp/transformers/qwen/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2b2a4d27b6668869a648aaad944a8706566e61d
--- /dev/null
+++ b/paddlenlp/transformers/qwen/tokenizer.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tokenization classes for QWen."""
+
+import base64
+import os
+import unicodedata
+from typing import Collection, Dict, List, Optional, Set, Tuple, Union
+
+import numpy as np
+
+from ...utils.import_utils import is_tiktoken_available
+from .. import PretrainedTokenizer
+from ..tokenizer_utils_base import (
+    AddedToken,
+    BatchEncoding,
+    EncodedInput,
+    PaddingStrategy,
+)
+
+VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken"}
+
+PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
+ENDOFTEXT = "<|endoftext|>"
+IMSTART = "<|im_start|>"
+IMEND = "<|im_end|>"
+# as the default behavior is changed to allow special tokens in
+# regular texts, the surface forms of special tokens need to be
+# as different as possible to minimize the impact
+EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205)))
+SPECIAL_TOKENS = (
+    ENDOFTEXT,
+    IMSTART,
+    IMEND,
+) + EXTRAS
+
+tiktoken = None
+
+
+def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
+    with open(tiktoken_bpe_file, "rb") as f:
+        contents = f.read()
+    return {
+        base64.b64decode(token): int(rank) for token, rank in (line.split() for line in contents.splitlines() if line)
+    }
+
+
+class QWenTokenizer(PretrainedTokenizer):
+    """QWen tokenizer."""
+
+    model_input_names = ["input_ids", "attention_mask"]
+    resource_files_names = VOCAB_FILES_NAMES
+
+    def __init__(
+        self,
+        vocab_file,
+        errors="replace",
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        if not is_tiktoken_available():
+            raise ValueError("tiktoken is not installed, please install it use: pip install tiktoken")
+
+        import tiktoken as tk
+
+        tiktoken = tk
+
+        self.errors = errors  # how to handle errors in decoding
+
+        self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)  # type: dict[bytes, int]
+        self.special_tokens = {
+            token: index for index, token in enumerate(SPECIAL_TOKENS, start=len(self.mergeable_ranks))
+        }
+
+        enc = tiktoken.Encoding(
+            "Qwen",
+            pat_str=PAT_STR,
+            mergeable_ranks=self.mergeable_ranks,
+            special_tokens=self.special_tokens,
+        )
+        assert (
+            len(self.mergeable_ranks) + len(self.special_tokens) == enc.n_vocab
+        ), f"{len(self.mergeable_ranks) + len(self.special_tokens)} != {enc.n_vocab} in encoding"
+
+        self.decoder = {v: k for k, v in self.mergeable_ranks.items()}  # type: dict[int, bytes|str]
+        self.decoder.update({v: k for k, v in self.special_tokens.items()})
+
+        self.tokenizer = enc  # type: tiktoken.Encoding
+
+        self.eod_id = self.tokenizer.eot_token
+        self.im_start_id = self.special_tokens[IMSTART]
+        self.im_end_id = self.special_tokens[IMEND]
+
+        if "pad_token_id" in kwargs:
+            self.pad_token_id = kwargs["pad_token_id"]
+        if "eos_token_id" in kwargs:
+            self.eos_token_id = kwargs["eos_token_id"]
+
+    def __len__(self) -> int:
+        return self.tokenizer.n_vocab
+
+    def get_vocab(self) -> Dict[bytes, int]:
+        return self.mergeable_ranks
+
+    def convert_tokens_to_ids(self, tokens: Union[bytes, str, List[Union[bytes, str]]]) -> List[int]:
+        ids = []
+        if isinstance(tokens, (str, bytes)):
+            if tokens in self.special_tokens:
+                return self.special_tokens[tokens]
+            else:
+                return self.mergeable_ranks.get(tokens)
+        for token in tokens:
+            if token in self.special_tokens:
+                ids.append(self.special_tokens[token])
+            else:
+                ids.append(self.mergeable_ranks.get(token))
+        return ids
+
+    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
+        if not special_tokens and new_tokens:
+            raise ValueError("Adding regular tokens is not supported")
+        for token in new_tokens:
+            surface_form = token.content if isinstance(token, AddedToken) else token
+            if surface_form not in SPECIAL_TOKENS:
+                raise ValueError("Adding unknown special tokens is not supported")
+        return 0
+
+    def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
+        """
+        Save only the vocabulary of the tokenizer (vocabulary).
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        file_path = os.path.join(save_directory, "qwen.tiktoken")
+        with open(file_path, "w", encoding="utf8") as w:
+            for k, v in self.mergeable_ranks.items():
+                line = base64.b64encode(k).decode("utf8") + " " + str(v) + "\n"
+                w.write(line)
+        return (file_path,)
+
+    def tokenize(
+        self,
+        text: str,
+        allowed_special: Union[Set, str] = "all",
+        disallowed_special: Union[Collection, str] = (),
+        **kwargs,
+    ) -> List[Union[bytes, str]]:
+        """
+        Converts a string in a sequence of tokens.
+
+        Args:
+            text (`str`):
+                The sequence to be encoded.
+            allowed_special (`Literal["all"]` or `set`):
+                The surface forms of the tokens to be encoded as special tokens in regular texts.
+                Default to "all".
+            disallowed_special (`Literal["all"]` or `Collection`):
+                The surface forms of the tokens that should not be in regular texts and trigger errors.
+                Default to an empty tuple.
+
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific encode method.
+
+        Returns:
+            `List[bytes|str]`: The list of tokens.
+        """
+        tokens = []
+        text = unicodedata.normalize("NFC", text)
+
+        # this implementation takes a detour: text -> token id -> token surface forms
+        for t in self.tokenizer.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special):
+            tokens.append(self.decoder[t])
+        return tokens
+
+    def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
+        """
+        Converts a sequence of tokens in a single string.
+        """
+        text = ""
+        temp = b""
+        for t in tokens:
+            if isinstance(t, str):
+                if temp:
+                    text += temp.decode("utf-8", errors=self.errors)
+                    temp = b""
+                text += t
+            elif isinstance(t, bytes):
+                temp += t
+            else:
+                raise TypeError("token should only be of type types or str")
+        if temp:
+            text += temp.decode("utf-8", errors=self.errors)
+        return text
+
+    @property
+    def vocab_size(self):
+        return self.tokenizer.n_vocab
+
+    def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
+        """Converts an id to a token, special tokens included"""
+        if index in self.decoder:
+            return self.decoder[index]
+        raise ValueError("unknown ids")
+
+    def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
+        """Converts a token to an id using the vocab, special tokens included"""
+        if token in self.special_tokens:
+            return self.special_tokens[token]
+        if token in self.mergeable_ranks:
+            return self.mergeable_ranks[token]
+        raise ValueError("unknown token")
+
+    def _tokenize(self, text: str, **kwargs):
+        """
+        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
+        vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
+
+        Do NOT take care of added tokens.
+        """
+        raise NotImplementedError
+
+    def _decode(
+        self,
+        token_ids: Union[int, List[int]],
+        skip_special_tokens: bool = False,
+        errors: str = None,
+        **kwargs,
+    ) -> str:
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        if skip_special_tokens:
+            token_ids = [i for i in token_ids if i < self.eod_id]
+        return self.tokenizer.decode(token_ids, errors=errors or self.errors)
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+
+        # attention_mask shape [1,seq_len,seq_len]
+        if "attention_mask" in encoded_inputs and len(np.shape(encoded_inputs["attention_mask"])) > 2:
+            attention_mask = encoded_inputs["attention_mask"]
+            encoded_inputs.pop("attention_mask")
+        else:
+            attention_mask = None
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        encoded_inputs = super()._pad(
+            encoded_inputs, max_length, padding_strategy, pad_to_multiple_of, return_attention_mask
+        )
+        if attention_mask is not None and len(np.shape(attention_mask)) > 2:
+            encoded_inputs["attention_mask"] = attention_mask
+            needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+            if needs_to_be_padded:
+                difference = max_length - len(required_input)
+                if "attention_mask" in encoded_inputs:
+                    encoded_inputs["attention_mask"] = np.pad(
+                        encoded_inputs["attention_mask"],
+                        pad_width=[(0, 0), (difference, 0), (difference, 0)],
+                        mode="constant",
+                        constant_values=0,
+                    )
+        return encoded_inputs
diff --git a/paddlenlp/transformers/reformer/__init__.py b/paddlenlp/transformers/reformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b5b28f3150172667e33941d8b0ec05fd7c969d9
--- /dev/null
+++ b/paddlenlp/transformers/reformer/__init__.py
@@ -0,0 +1,12 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/reformer/configuration.py b/paddlenlp/transformers/reformer/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..b804c2cef99c11c8ca40f8415316478fdb69304d
--- /dev/null
+++ b/paddlenlp/transformers/reformer/configuration.py
@@ -0,0 +1,310 @@
+# coding=utf-8
+# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Reformer model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["REFORMER_PRETRAINED_INIT_CONFIGURATION", "ReformerConfig", "REFORMER_PRETRAINED_RESOURCE_FILES_MAP"]
+
+REFORMER_PRETRAINED_INIT_CONFIGURATION = {
+    "reformer-enwik8": {
+        "tie_word_embeddings": False,
+        "is_decoder": True,
+        "chunk_size_feed_forward": 0,
+        "pad_token_id": 0,
+        "hash_seed": None,
+        "vocab_size": 258,
+        "attention_head_size": 128,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hashes": 4,
+        "num_hidden_layers": 12,
+        "num_buckets": 512,
+        "lsh_attn_chunk_length": 256,
+        "local_attn_chunk_length": 128,
+        "lsh_num_chunks_after": 0,
+        "lsh_num_chunks_before": 1,
+        "local_num_chunks_after": 0,
+        "local_num_chunks_before": 1,
+        "hidden_act": "relu",
+        "feed_forward_size": 4096,
+        "hidden_dropout_prob": 0.2,
+        "lsh_attention_probs_dropout_prob": 0.1,
+        "local_attention_probs_dropout_prob": 0.2,
+        "max_position_embeddings": 65536,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "axial_pos_embds": True,
+        "axial_pos_shape": [128, 512],
+        "axial_pos_embds_dim": [256, 768],
+        "axial_norm_std": 1.0,
+        "chunk_size_lm_head": 0,
+        "attn_layers": [
+            "local",
+            "local",
+            "lsh",
+            "local",
+            "local",
+            "local",
+            "lsh",
+            "local",
+            "local",
+            "local",
+            "lsh",
+            "local",
+        ],
+    },
+    "reformer-crime-and-punishment": {
+        "tie_word_embeddings": False,
+        "is_decoder": True,
+        "chunk_size_feed_forward": 0,
+        "pad_token_id": 0,
+        "num_hidden_layers": 6,
+        "hash_seed": None,
+        "vocab_size": 320,
+        "attention_head_size": 64,
+        "hidden_size": 256,
+        "num_attention_heads": 2,
+        "num_hashes": 1,
+        "num_buckets": [64, 128],
+        "lsh_attn_chunk_length": 64,
+        "local_attn_chunk_length": 64,
+        "lsh_num_chunks_after": 0,
+        "lsh_num_chunks_before": 1,
+        "local_num_chunks_after": 0,
+        "local_num_chunks_before": 1,
+        "hidden_act": "relu",
+        "feed_forward_size": 512,
+        "hidden_dropout_prob": 0.05,
+        "lsh_attention_probs_dropout_prob": 0.0,
+        "local_attention_probs_dropout_prob": 0.05,
+        "max_position_embeddings": 524288,
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "axial_pos_embds": True,
+        "axial_pos_shape": [512, 1024],
+        "axial_pos_embds_dim": [64, 192],
+        "axial_norm_std": 1.0,
+        "chunk_size_lm_head": 0,
+        "attn_layers": ["local", "lsh", "local", "lsh", "local", "lsh"],
+    },
+}
+
+REFORMER_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "reformer-enwik8": "http://paddlenlp.bj.bcebos.com/models/transformers/reformer/reformer-enwik8/model_state.pdparams",
+        "reformer-crime-and-punishment": "http://paddlenlp.bj.bcebos.com/models/transformers/reformer/reformer-crime-and-punishment/model_state.pdparams",
+    }
+}
+
+
+class ReformerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ReformerModel`]. It is used to instantiate a
+    Reformer model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the ReFormer
+    [google/reformer-crime-and-punishment](https://huggingface.co/google/reformer-crime-and-punishment) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        tie_word_embeddings (bool, optional):
+            Whether to tie input and output embeddings. Defaults to `False`.
+        is_decoder (bool, optional):
+            Whether or not to use a causal mask in addition to the `attention_mask` passed to `ReformerModel`. When using the Reformer for causal language modeling, this argument should be set to `True`. Defaults to `True`.
+        chunk_size_feed_forward (int, optional):
+            The chunk size of all feed forward layers in the residual attention blocks. A chunk size of `0` means
+            that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes
+            `n` < sequence_length embeddings at a time. Defaults to `0`.
+        pad_token_id (int, optional):
+            The id of the `padding` token. Defaults to `0`.
+        hash_seed (int, optional):
+            Seed that can be used to make local sensitive hashing in `LSHSelfAttention` deterministic. This should
+            only be set for testing purposed. For evaluation and training purposes `hash_seed` should be left as
+            `None` to ensure fully random rotations in local sensitive hashing scheme. Defaults to `None`.
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `ReformerModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ReformerModel`. Defaults to `258`.
+        attention_head_size (int, optional):
+            Dimensionality of the projected key, query and value vectors. Defaults to `128`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layer.Defaults to `1024`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `8`.
+        num_hashes (int, optional):
+            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes. Defaults to `4`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_buckets (int or List[int], optional):
+            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
+            Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors. The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly. Defaults to `512`.
+        lsh_attn_chunk_length (int, optional):
+            Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).Defaults to `256`.
+        local_attn_chunk_length (int, optional):
+            Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).Defaults to `128`.
+        lsh_num_chunks_after (int, optional):
+            Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. Defaults to `0`.
+        lsh_num_chunks_before (int, optional):
+            Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. Defaults to `1`.
+        local_num_chunks_after (int, optional):
+            Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. Defaults to `0`.
+        local_num_chunks_before (int, optional):
+            Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. Defaults to `1`.
+        hidden_act (str, optional):
+            The non-linear activation function (function or string) in the feed forward layer in the residual attention block. If string, `"gelu"`, `"relu"`, `"tanh"`, `"mish"` and `"gelu_new"` are supported. Defaults to `"relu"`.
+        feed_forward_size (int, optional):
+            Dimensionality of the feed_forward layer in the residual attention block. Defaults to `4096`.
+        hidden_dropout_prob (float, optional):
+            The dropout ratio for all fully connected layers in the embeddings and encoder. Defaults to `0.2`.
+        lsh_attention_probs_dropout_prob (float, optional):
+            The dropout ratio for the attention probabilities in `LSHSelfAttention`. Defaults to `0.1`.
+        local_attention_probs_dropout_prob (float, optional):
+            The dropout ratio for the attention probabilities in `LocalSelfAttention`. Defaults to `0.2`.
+        max_position_embeddings (int, optional):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Defaults to `65536`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`ReformerPretrainedModel._init_weights()` for how weights are initialized in `ReformerModel`.
+
+        layer_norm_eps (float, optional):
+            The epsilon used by the layer normalization layers. Defaults to `1e-12`.
+
+        axial_pos_embds (bool, optional):
+            Whether or not to use axial position embeddings. Defaults to `True`.
+        axial_pos_shape (List[int], optional):
+            The position dims of the axial position encodings. During training, the product of the position dims has to be equal to the sequence length. Defaults to `[128, 512]`.
+        axial_pos_embds_dim (List[int], optional):
+            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
+            hidden size. Defaults to `[256, 768]`.
+        axial_norm_std (float, optional):
+            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
+            positional encodings. Defaults to `1.0`.
+        chunk_size_lm_head (int, optional):
+            The chunk size of the final language model feed forward head layer. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n <
+            sequence_length embeddings at a time. Defaults to `0`.
+        attn_layers (List[str], optional):
+            List of attention layer types in ascending order. It can be chosen between a LSHSelfAttention layer
+            (`"lsh"`) and a LocalSelfAttention layer (`"local"`). Defaults to `["local", "local", "lsh", "local", "local", "local", "lsh", "local", "local", "local", "lsh", "local"]`.
+
+    """
+    model_type = "reformer"
+    attribute_map: Dict[str, str] = {
+        "num_attention_heads": "num_heads",
+        "num_hidden_layers": "num_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = REFORMER_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        axial_pos_shape=[128, 512],
+        axial_pos_embds_dim=[256, 768],
+        hidden_dropout_prob=0.2,
+        attn_layers=[
+            "local",
+            "local",
+            "lsh",
+            "local",
+            "local",
+            "local",
+            "lsh",
+            "local",
+            "local",
+            "local",
+            "lsh",
+            "local",
+        ],
+        lsh_attn_chunk_length=256,
+        local_attn_chunk_length=128,
+        hidden_size=1024,
+        max_position_embeddings=65536,
+        axial_pos_embds=True,
+        vocab_size=258,
+        num_hashes=4,
+        num_buckets=512,
+        lsh_num_chunks_before=1,
+        lsh_num_chunks_after=0,
+        hash_seed=None,
+        is_decoder=True,
+        lsh_attention_probs_dropout_prob=0.1,
+        num_attention_heads=8,
+        attention_head_size=128,
+        local_num_chunks_before=1,
+        local_num_chunks_after=0,
+        pad_token_id=0,
+        local_attention_probs_dropout_prob=0.2,
+        layer_norm_eps=1e-12,
+        hidden_act="relu",
+        feed_forward_size=4096,
+        chunk_size_feed_forward=0,
+        chunk_size_lm_head=0,
+        tie_word_embeddings=False,
+        initializer_range=0.02,
+        axial_norm_std=1.0,
+        use_cache=True,
+        classifier_dropout=None,
+        num_hidden_layers=12,
+        **kwargs
+    ):
+
+        self.axial_pos_shape = tuple(axial_pos_shape)
+        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attn_layers = attn_layers
+        self.lsh_attn_chunk_length = lsh_attn_chunk_length
+        self.local_attn_chunk_length = local_attn_chunk_length
+        self.hidden_size = hidden_size
+        self.max_position_embeddings = max_position_embeddings
+        self.axial_pos_embds = axial_pos_embds
+        self.vocab_size = vocab_size
+        self.num_hashes = num_hashes
+        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets
+        self.lsh_num_chunks_before = lsh_num_chunks_before
+        self.lsh_num_chunks_after = lsh_num_chunks_after
+        self.hash_seed = hash_seed
+        self.is_decoder = is_decoder
+        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_size = attention_head_size
+        self.local_num_chunks_before = local_num_chunks_before
+        self.local_num_chunks_after = local_num_chunks_after
+        self.pad_token_id = pad_token_id
+        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.feed_forward_size = feed_forward_size
+        self.chunk_size_lm_head = chunk_size_lm_head
+        self.tie_word_embeddings = tie_word_embeddings
+        self.num_hidden_layers = num_hidden_layers
+        self.initializer_range = initializer_range
+        self.axial_norm_std = axial_norm_std
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
+        super().__init__(
+            pad_token_id=pad_token_id,
+            is_decoder=is_decoder,
+            tie_word_embeddings=tie_word_embeddings,
+            chunk_size_feed_forward=chunk_size_feed_forward,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/reformer/modeling.py b/paddlenlp/transformers/reformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d4f784cbac5631f99719eda64886d4172d0a400
--- /dev/null
+++ b/paddlenlp/transformers/reformer/modeling.py
@@ -0,0 +1,2987 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The Trax Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import sys
+from dataclasses import dataclass
+from functools import reduce
+from operator import mul
+from typing import List, Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.autograd import PyLayer
+
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    ModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+)
+from .configuration import (
+    REFORMER_PRETRAINED_INIT_CONFIGURATION,
+    REFORMER_PRETRAINED_RESOURCE_FILES_MAP,
+    ReformerConfig,
+)
+
+__all__ = [
+    "ReformerModel",
+    "ReformerPretrainedModel",
+    "ReformerForSequenceClassification",
+    "ReformerForQuestionAnswering",
+    "ReformerModelWithLMHead",
+    "ReformerForMaskedLM",
+    "ReformerLayer",
+]
+
+REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "reformer-crime-and-punishment",
+    "reformer-enwik8",
+]
+
+
+def _logsumexp(x, axis=-1, keepdim=False):
+    # paddle.logsumexp don't support 5D Tensor
+    if axis < 0:
+        axis = x.ndim + axis
+    if axis > 1:
+        lse = paddle.logsumexp(x.flatten(0, 1), axis=axis - 1, keepdim=keepdim)
+        orgshape = x.shape
+        if keepdim:
+            orgshape[axis] = 1
+        else:
+            orgshape = orgshape[:axis] + orgshape[axis + 1 :]
+
+        return lse.reshape(shape=orgshape)
+    else:
+        raise ValueError("axis must larger equal than 1.")
+
+
+def _stable_argsort(vector, axis=-1):
+    # this function scales the vector so that paddle.argsort is stable.
+    # paddle.argsort is not stable on its own
+    scale_offset = paddle.arange(vector.shape[axis]).reshape(shape=[1, -1]).astype(vector.dtype)
+    scale_offset = scale_offset.expand_as(vector)
+    scaled_vector = vector.shape[axis] * vector + (scale_offset % vector.shape[axis])
+    return paddle.argsort(scaled_vector, axis=axis)
+
+
+def _apply_chunking_to_forward(forward_fn, chunk_size, chunk_dim, *input_tensors):
+    """
+    This function chunks the `input_tensors` into smaller input tensor parts of size `chunk_size` over the
+    dimension `chunk_dim`. It then applies a layer `forward_fn` to each chunk independently to save memory.
+    If the `forward_fn` is independent across the `chunk_dim` this function will yield the same result as
+    directly applying `forward_fn` to `input_tensors`.
+    """
+    assert len(input_tensors) > 0, f"{input_tensors} has to be a tuple/list of tensors"
+    tensor_shape = input_tensors[0].shape[chunk_dim]
+    assert all(
+        input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors
+    ), "All input tenors have to be of the same shape"
+
+    # inspect.signature exist since python 3.5 and is a python method -> no problem with backward compatibility
+    num_args_in_forward_chunk_fn = len(inspect.signature(forward_fn).parameters)
+    if num_args_in_forward_chunk_fn != len(input_tensors):
+        raise ValueError(
+            f"forward_chunk_fn expects {num_args_in_forward_chunk_fn} arguments, but only {len(input_tensors)} input "
+            "tensors are given"
+        )
+
+    if chunk_size > 0:
+        if input_tensors[0].shape[chunk_dim] % chunk_size != 0:
+            raise ValueError(
+                f"The dimension to be chunked {input_tensors[0].shape[chunk_dim]} has to be a multiple of the chunk "
+                f"size {chunk_size}"
+            )
+
+        num_chunks = input_tensors[0].shape[chunk_dim] // chunk_size
+
+        # chunk input tensor into tuples
+        input_tensors_chunks = tuple(input_tensor.chunk(num_chunks, axis=chunk_dim) for input_tensor in input_tensors)
+        # apply forward fn to every tuple
+        output_chunks = tuple(forward_fn(*input_tensors_chunk) for input_tensors_chunk in zip(*input_tensors_chunks))
+        # concatenate output at same dimension
+        return paddle.concat(output_chunks, axis=chunk_dim)
+
+    return forward_fn(*input_tensors)
+
+
+def _get_least_common_mult_chunk_len(attn_layers, lsh_attn_chunk_length, local_attn_chunk_length):
+    attn_types_set = set(attn_layers)
+    if len(attn_types_set) == 1 and attn_layers[0] == "lsh":
+        return lsh_attn_chunk_length
+    elif len(attn_types_set) == 1 and attn_layers[0] == "local":
+        return local_attn_chunk_length
+    elif len(attn_types_set) == 2 and attn_types_set == set(["lsh", "local"]):
+        return np.lcm(lsh_attn_chunk_length, local_attn_chunk_length)
+    else:
+        raise NotImplementedError(
+            f"Only attn layer types 'lsh' and 'local' exist, but `attn_layers`: {attn_layers}. Select "
+            "attn layer types from ['lsh', 'local'] only."
+        )
+
+
+def _get_min_chunk_len(attn_layers, lsh_attn_chunk_length, local_attn_chunk_length):
+    attn_types_set = set(attn_layers)
+    if len(attn_types_set) == 1 and attn_layers[0] == "lsh":
+        return lsh_attn_chunk_length
+    elif len(attn_types_set) == 1 and attn_layers[0] == "local":
+        return local_attn_chunk_length
+    elif len(attn_types_set) == 2 and attn_types_set == set(["lsh", "local"]):
+        return min(lsh_attn_chunk_length, local_attn_chunk_length)
+    else:
+        raise NotImplementedError(
+            f"Only attn layer types 'lsh' and 'local' exist, but `attn_layers`: {attn_layers}. Select "
+            "attn layer types from ['lsh', 'local'] only."
+        )
+
+
+class ReverseSort(PyLayer):
+    """
+    modified from https://github.com/huggingface/transformers/blob/fbf468b0573baddb1b9d1bb088a8b6d5c9303a7e/src/transformers/models/reformer/modeling_reformer.py#L982-L1011
+    After chunked attention is applied which sorted clusters, original ordering has to be restored. Since customized
+    backward function is used for Reformer, the gradients of the output vectors have to be explicitly sorted here.
+    """
+
+    @staticmethod
+    def forward(ctx, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx):
+        # save sorted_bucket_idx for backprop
+        with paddle.no_grad():
+            ctx.sorted_bucket_idx = sorted_bucket_idx
+            # undo sort to have correct order for next layer
+            raw_shape = out_vectors.shape
+            out_vectors = out_vectors.transpose(perm=[0, 1, 3, 2])
+            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-2).expand_as(out_vectors)
+            out_vectors = (
+                paddle.index_sample(
+                    out_vectors.reshape([-1, raw_shape[2]]),
+                    expanded_undo_sort_indices.reshape([-1, raw_shape[2]]),
+                )
+                .reshape(out_vectors.shape)
+                .transpose(perm=[0, 1, 3, 2])
+            )
+
+            logits = paddle.index_sample(
+                logits.reshape([-1, raw_shape[2]]),
+                undo_sorted_bucket_idx.reshape([-1, raw_shape[2]]),
+            ).reshape(raw_shape[:3])
+
+        return out_vectors, logits
+
+    @staticmethod
+    def backward(ctx, grad_out_vectors, grad_logits):
+        # get parameters saved in ctx
+        sorted_bucket_idx = ctx.sorted_bucket_idx
+
+        raw_shape = grad_out_vectors.shape
+        grad_out_vectors = grad_out_vectors.transpose(perm=[0, 1, 3, 2])
+
+        expanded_sorted_bucket_idx = sorted_bucket_idx.unsqueeze(-2).expand_as(grad_out_vectors)
+        grad_out_vectors = (
+            paddle.index_sample(
+                grad_out_vectors.reshape([-1, raw_shape[2]]),
+                expanded_sorted_bucket_idx.reshape([-1, raw_shape[2]]),
+            )
+            .reshape(grad_out_vectors.shape)
+            .transpose(perm=[0, 1, 3, 2])
+        )
+
+        grad_logits = paddle.index_sample(
+            grad_logits.reshape([-1, raw_shape[2]]),
+            sorted_bucket_idx.reshape([-1, raw_shape[2]]),
+        ).reshape(raw_shape[:3])
+
+        return grad_out_vectors, sorted_bucket_idx, None, None
+
+
+class _ReversibleFunction(PyLayer):
+    """
+    modified from https://github.com/huggingface/transformers/blob/c016dbdbdaf79339ae6d275d4651dc9f380be055/src/transformers/models/reformer/modeling_reformer.py#L1568-L1677
+    To prevent Paddle from performing the usual backpropagation, a customized backward function is implemented here.
+    This way it is made sure that no memory expensive activations are saved during the forward pass.
+    """
+
+    @staticmethod
+    def forward(
+        ctx,
+        hidden_states,
+        layers,
+        attention_mask,
+        num_hashes,
+        all_hidden_states,
+        all_attentions,
+        cache,
+        use_cache,
+        orig_sequence_length,
+        output_hidden_states,
+        output_attentions,
+    ):
+        all_buckets = ()
+
+        # split duplicated tensor
+        hidden_states, attn_output = paddle.chunk(hidden_states, chunks=2, axis=-1)
+
+        for layer_id, layer in enumerate(layers):
+            if output_hidden_states is True:
+                all_hidden_states.append(hidden_states)
+
+            layer_outputs = layer(
+                prev_attn_output=attn_output,
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                num_hashes=num_hashes,
+                cache=cache,
+                use_cache=use_cache,
+                orig_sequence_length=orig_sequence_length,
+                output_attentions=output_attentions,
+            )
+
+            attn_output = layer_outputs.attn_output
+            hidden_states = layer_outputs.hidden_states
+            all_buckets = all_buckets + (layer_outputs.buckets,)
+
+            if output_attentions:
+                all_attentions.append(layer_outputs.attention_probs)
+
+        # Add last layer
+        if output_hidden_states is True:
+            all_hidden_states.append(hidden_states)
+
+        # attach params to ctx for backward
+        ctx.save_for_backward(attn_output.detach(), hidden_states.detach())
+        ctx.layers = layers
+        ctx.all_buckets = all_buckets
+        ctx.attention_mask = attention_mask
+
+        # Concatenate 2 RevNet outputs
+        return paddle.concat([attn_output, hidden_states], axis=-1)
+
+    @staticmethod
+    def backward(ctx, grad_hidden_states):
+
+        grad_attn_output, grad_hidden_states = paddle.chunk(grad_hidden_states, chunks=2, axis=-1)
+
+        # retrieve params from ctx for backward
+        (attn_output, hidden_states) = ctx.saved_tensor()
+
+        # create tuple
+        output = ReformerBackwardOutput(
+            attn_output=attn_output,
+            hidden_states=hidden_states,
+            grad_attn_output=grad_attn_output,
+            grad_hidden_states=grad_hidden_states,
+        )
+
+        # free memory
+        del grad_attn_output, grad_hidden_states, attn_output, hidden_states
+
+        layers = ctx.layers
+        all_buckets = ctx.all_buckets
+        attention_mask = ctx.attention_mask
+
+        for idx, layer in enumerate(layers[::-1]):
+            # pop last buckets from stack
+            buckets = all_buckets[-1]
+            all_buckets = all_buckets[:-1]
+
+            # backprop
+            output = layer.backward_pass(
+                next_attn_output=output.attn_output,
+                hidden_states=output.hidden_states,
+                grad_attn_output=output.grad_attn_output,
+                grad_hidden_states=output.grad_hidden_states,
+                attention_mask=attention_mask,
+                buckets=buckets,
+            )
+
+        assert all_buckets == (), "buckets have to be empty after backpropagation"
+        grad_hidden_states = paddle.concat([output.grad_attn_output, output.grad_hidden_states], axis=-1)
+
+        # num of return vars has to match num of forward() args
+        # return gradient for hidden_states arg and None for other args
+        return grad_hidden_states, None
+
+
+class AxialPositionEmbeddings(nn.Layer):
+    """
+    Constructs axial position embeddings. Useful for very long input sequences to save memory and time.
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.axial_pos_shape = config.axial_pos_shape
+        self.axial_pos_embds_dim = config.axial_pos_embds_dim
+        self.dropout = config.hidden_dropout_prob
+
+        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(
+            attn_layers=config.attn_layers,
+            lsh_attn_chunk_length=config.lsh_attn_chunk_length,
+            local_attn_chunk_length=config.local_attn_chunk_length,
+        )
+        self.weights = nn.ParameterList()
+
+        if sum(self.axial_pos_embds_dim) != config.hidden_size:
+            raise ValueError(
+                f"Make sure that axial_pos_embds factors: {self.axial_pos_embds_dim} sum to "
+                f"hidden_size: {config.hidden_size}"
+            )
+
+        # create weights
+        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):
+            # create expanded shapes
+            ax_shape = [1] * len(self.axial_pos_shape)
+            ax_shape[axis] = self.axial_pos_shape[axis]
+            ax_shape = ax_shape + [axial_pos_embd_dim]
+
+            self.weights.append(
+                paddle.create_parameter(
+                    shape=ax_shape,
+                    dtype=paddle.get_default_dtype(),
+                    default_initializer=nn.initializer.Constant(value=1.0),
+                )
+            )
+
+    def forward(self, position_ids):
+        # broadcast weights to correct shape
+        batch_size = position_ids.shape[0]
+        sequence_length = position_ids.shape[1]
+        broadcasted_weights = [
+            weight.expand(shape=[batch_size] + list(self.axial_pos_shape) + weight.shape[-1:])
+            for weight in self.weights
+        ]
+
+        if self.training is True:
+            if reduce(mul, self.axial_pos_shape) != sequence_length:
+                raise ValueError(
+                    f"If training, make sure that axial_pos_shape factors: {self.axial_pos_shape} multiply to "
+                    f"sequence length. Got prod({self.axial_pos_shape}) != sequence_length: {sequence_length}. "
+                    f"You might want to consider padding your sequence length to {reduce(mul, self.axial_pos_shape)} "
+                    "or changing axial_pos_shape."
+                )
+
+            if self.dropout > 0:
+                weights = paddle.concat(broadcasted_weights, axis=-1)
+                # permute weights so that 2D correctly drops dims 1 and 2
+                transposed_weights = weights.transpose(perm=[0, 2, 1, 3])
+                # drop entire matrix of last two dims (prev dims 1 and 2)
+                dropped_transposed_weights = F.dropout2d(transposed_weights, p=self.dropout, training=self.training)
+                dropped_weights = dropped_transposed_weights.transpose(perm=[0, 2, 1, 3])
+
+                position_encodings = paddle.reshape(dropped_weights, shape=[batch_size, sequence_length, -1])
+
+            else:
+                position_encodings = paddle.concat(
+                    [
+                        paddle.reshape(weight, shape=[batch_size, sequence_length, -1])
+                        for weight in broadcasted_weights
+                    ],
+                    axis=-1,
+                )
+
+        else:
+            if reduce(mul, self.axial_pos_shape) < sequence_length:
+                raise ValueError(
+                    f"Make sure that axial_pos_shape factors: {self.axial_pos_shape} multiply at least to "
+                    f"max(sequence_length, least_common_mult_chunk_length): max({sequence_length}, "
+                    f"{self.least_common_mult_chunk_length})."
+                )
+
+            # compute how many columns are needed
+            max_position_id = position_ids.max().item()
+            required_pos_encodings_columns = -(-(max_position_id + 1) // self.axial_pos_shape[1])
+            # cut to columns that are needed
+            position_encodings = paddle.concat(
+                [weight[:, :required_pos_encodings_columns] for weight in broadcasted_weights],
+                axis=-1,
+            )
+            position_encodings = paddle.reshape(
+                position_encodings, shape=[batch_size, -1, position_encodings.shape[-1]]
+            )
+
+            # select correct position encodings
+            position_encodings = paddle.concat(
+                [
+                    paddle.index_select(position_encodings[i], index=position_ids[i], axis=0).unsqueeze(0)
+                    for i in range(batch_size)
+                ],
+                axis=0,
+            )
+
+        return position_encodings
+
+
+class PositionEmbeddings(nn.Layer):
+    """Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`."""
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.dropout = config.hidden_dropout_prob
+        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+
+    def forward(self, position_ids):
+        position_embeddings = self.embedding(position_ids)
+        position_embeddings = F.dropout(position_embeddings, p=self.dropout, training=self.training)
+        return position_embeddings
+
+
+class ReformerEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.max_position_embeddings = config.max_position_embeddings
+        self.dropout = config.hidden_dropout_prob
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = (
+            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        start_idx_pos_encodings=0,
+        inputs_embeds: Optional[Tensor] = None,
+    ):
+
+        if input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+
+        if position_ids is None:
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = start_idx_pos_encodings + seq_length - start_idx_pos_encodings - ones
+            position_ids.stop_gradient = True
+
+        if position_ids.shape[-1] > self.max_position_embeddings:
+            raise ValueError(
+                f"Sequence Length: {position_ids.shape[-1]} has to be larger equal than "
+                f"max_position_embeddings {self.max_position_embeddings}."
+            )
+
+        # add positional embeddings
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+        # dropout
+        embeddings = F.dropout(embeddings, p=self.dropout, training=self.training)
+        return embeddings
+
+
+class EfficientAttentionMixin:
+    """
+    A few utilities for nn.Layers in Reformer, to be used as a mixin.
+    """
+
+    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):
+        """
+        Used to implement attention between consecutive chunks.
+
+        Args:
+            vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]
+            num_chunks_before: chunks before current chunk to include in attention
+            num_chunks_after: chunks after current chunk to include in attention
+
+        Returns:
+            tensor of shape [num_chunks, N * chunk_length, ...], where N = (1 + num_chunks_before + num_chunks_after).
+        """
+        if num_chunks_before == 0 and num_chunks_after == 0:
+            return vectors
+
+        slices = []
+        for i in range(-num_chunks_before, num_chunks_after + 1):
+            if i == 0:
+                slices.append(vectors)
+            else:
+                slices.append(paddle.concat([vectors[:, :, i:], vectors[:, :, :i]], axis=2))
+        return paddle.concat(slices, axis=3)
+
+    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):
+        """
+        splits hidden_size dim into attn_head_size and num_attn_heads
+        """
+        new_x_shape = x.shape[:-1] + [num_attn_heads, attn_head_size]
+        x = x.reshape(shape=new_x_shape)
+        return x.transpose(perm=[0, 2, 1, 3])
+
+    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):
+        """
+        merges attn_head_size dim and num_attn_heads dim into hidden_size
+        """
+        x = x.transpose(perm=[0, 2, 1, 3])
+        return paddle.reshape(x, shape=[x.shape[0], -1, num_attn_heads * attn_head_size])
+
+    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):
+        """
+        splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims
+        """
+        batch_size = vectors.shape[0]
+
+        split_dim_shape = [batch_size, num_attn_heads, dim_factor_1, dim_factor_2]
+
+        if vectors.ndim == 4:
+            return paddle.reshape(vectors, shape=split_dim_shape + [attn_head_size])
+        elif vectors.ndim == 3:
+            return paddle.reshape(vectors, shape=split_dim_shape)
+        else:
+            raise ValueError(f"Input vector rank should be one of [3, 4], but is: {vectors.ndim}")
+
+
+class LSHSelfAttention(nn.Layer, EfficientAttentionMixin):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+
+        self.chunk_length = config.lsh_attn_chunk_length
+        self.num_hashes = config.num_hashes
+        self.num_buckets = config.num_buckets
+        self.num_chunks_before = config.lsh_num_chunks_before
+        self.num_chunks_after = config.lsh_num_chunks_after
+        self.hash_seed = config.hash_seed
+        self.is_decoder = config.is_decoder
+        self.max_position_embeddings = config.max_position_embeddings
+
+        self.dropout = config.lsh_attention_probs_dropout_prob
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.attention_head_size
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.hidden_size = config.hidden_size
+
+        # projection matrices
+        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias_attr=False)
+        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias_attr=False)
+
+        # save mask value here. Need fp32 and fp16 mask values
+        self.register_buffer("self_mask_value_float16", paddle.to_tensor(-1e3))
+        self.register_buffer("self_mask_value_float32", paddle.to_tensor(-1e5))
+        self.register_buffer("mask_value_float16", paddle.to_tensor(-1e4))
+        self.register_buffer("mask_value_float32", paddle.to_tensor(-1e9))
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        num_hashes=None,
+        buckets=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+        **kwargs,
+    ):
+        batch_size, sequence_length = hidden_states.shape[:2]
+
+        # num hashes can optionally be overwritten by user
+        num_hashes = num_hashes if num_hashes is not None else self.num_hashes
+
+        do_cached_attention = use_cache and cache[1] is not None
+
+        # check if cache shall be used and that hidden states are already cached
+        if do_cached_attention:
+            assert (
+                sequence_length == 1
+            ), f"At the moment, auto-regressive language generation is only possible one word at a time. Make sure that input sequence length {sequence_length} equals 1, when `cache` is passed."
+            past_buckets = cache[0]
+            past_states = cache[1]
+
+            # get query vector
+            query_vectors = self.query_key(hidden_states)
+            query_vectors = self._split_hidden_size_dim(
+                query_vectors, self.num_attention_heads, self.attention_head_size
+            )
+
+            if past_buckets is not None:
+                (key_value_hidden_states, sorted_bucket_idx, buckets,) = self._get_relevant_hid_states_and_buckets(
+                    query_vectors=query_vectors,
+                    attention_mask=attention_mask,
+                    num_hashes=num_hashes,
+                    hidden_states=hidden_states,
+                    past_states=past_states,
+                    past_buckets=past_buckets,
+                )
+
+                query_key_vectors = self._query_per_attn_head(key_value_hidden_states)
+                value_vectors = self._value_per_attn_head(key_value_hidden_states)
+
+                # split key & value vectors by num hashes to apply
+                # self attention on each separately
+                query_key_vectors = self._split_seq_length_dim_to(
+                    query_key_vectors,
+                    num_hashes,
+                    -1,
+                    self.num_attention_heads,
+                    self.attention_head_size,
+                )
+                value_vectors = self._split_seq_length_dim_to(
+                    value_vectors,
+                    num_hashes,
+                    -1,
+                    self.num_attention_heads,
+                    self.attention_head_size,
+                )
+                # expand query vectors across hash dimension
+                query_vectors = paddle.tile(query_vectors.unsqueeze(2), repeat_times=[1, 1, num_hashes, 1, 1])
+            else:
+                key_value_hidden_states = paddle.concat([past_states, hidden_states], axis=1)
+
+                query_key_vectors = self.query_key(key_value_hidden_states)
+                value_vectors = self.value(key_value_hidden_states)
+
+        else:
+            # project hidden_states to query_key and value
+            query_vectors = None
+            query_key_vectors = self.query_key(hidden_states)
+            value_vectors = self.value(hidden_states)
+
+        # if query key is not already split
+        if not do_cached_attention or past_buckets is None:
+            query_key_vectors = self._split_hidden_size_dim(
+                query_key_vectors, self.num_attention_heads, self.attention_head_size
+            )
+            value_vectors = self._split_hidden_size_dim(
+                value_vectors, self.num_attention_heads, self.attention_head_size
+            )
+
+        # cache buckets for next incremental decoding
+        if do_cached_attention and past_buckets is None and key_value_hidden_states.shape[1] >= self.chunk_length:
+            buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask)
+
+        # free memory
+        del hidden_states
+
+        assert (
+            query_key_vectors.shape[-1] == self.attention_head_size
+        ), f"last dim of query_key_vectors is {query_key_vectors.shape[-1]} but should be {self.attention_head_size}."
+        assert (
+            value_vectors.shape[-1] == self.attention_head_size
+        ), f"last dim of value_vectors is {value_vectors.shape[-1]} but should be {self.attention_head_size}."
+
+        do_standard_self_attention = (sequence_length <= self.chunk_length) or (use_cache and cache[1] is not None)
+        # LSH attention only makes sense if chunked attention should be performed
+        if not do_standard_self_attention:
+            # set `num_buckets` on the fly, recommended way to do it
+            if self.num_buckets is None:
+                self._set_num_buckets(sequence_length)
+
+            # use cached buckets for backprop only
+            if buckets is None:
+                # hash query key vectors into buckets
+                buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask)
+            else:
+                # make sure buckets has correct shape for LSH attention
+                buckets = buckets.reshape(
+                    shape=[
+                        batch_size,
+                        self.num_attention_heads,
+                        num_hashes * sequence_length,
+                    ]
+                )
+
+            assert (
+                int(buckets.shape[-1]) == num_hashes * sequence_length
+            ), f"last dim of buckets is {buckets.shape[-1]}, but should be {num_hashes * sequence_length}"
+
+            (
+                sorted_bucket_idx,
+                undo_sorted_bucket_idx,
+            ) = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(buckets)
+
+            # make sure bucket idx is not longer then sequence length
+            sorted_bucket_idx_per_hash = sorted_bucket_idx % sequence_length
+
+            # cluster query key value vectors according to hashed buckets
+            query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx_per_hash, num_hashes)
+
+            value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx_per_hash, num_hashes)
+            query_key_vectors = self._split_seq_length_dim_to(
+                query_key_vectors,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+                self.attention_head_size,
+            )
+            value_vectors = self._split_seq_length_dim_to(
+                value_vectors,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+                self.attention_head_size,
+            )
+
+            if self.chunk_length is None:
+                assert (
+                    self.num_chunks_before == 0 and self.num_chunks_after == 0
+                ), "If `chunk_length` is `None`, make sure `num_chunks_after` and `num_chunks_before` are set to 0."
+        elif do_cached_attention and past_buckets is not None:
+            # use max sequence length
+            sorted_bucket_idx_per_hash = sorted_bucket_idx
+        else:
+            # get sequence length indices
+            sorted_bucket_idx_per_hash = paddle.tile(
+                paddle.arange(sequence_length),
+                repeat_times=[batch_size, self.num_attention_heads, 1],
+            )
+
+        # scale key vectors
+        key_vectors = self._len_and_dim_norm(query_key_vectors)
+
+        # set query_vectors to query key vectors if LSH self attention
+        query_vectors = query_vectors if query_vectors is not None else query_key_vectors
+
+        # free memory
+        del query_key_vectors
+
+        # get attention probs
+        out_vectors, logits, attention_probs = self._attend(
+            query_vectors=query_vectors,
+            key_vectors=key_vectors,
+            value_vectors=value_vectors,
+            sorted_bucket_idx_per_hash=sorted_bucket_idx_per_hash,
+            attention_mask=attention_mask,
+            do_standard_self_attention=do_standard_self_attention,
+            do_cached_attention=do_cached_attention,
+        )
+
+        # free memory
+        del key_vectors, value_vectors
+
+        # re-order out_vectors and logits
+        if not do_standard_self_attention:
+            # sort clusters back to correct ordering
+            out_vectors, logits = ReverseSort.apply(out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx)
+
+        if not do_standard_self_attention or (do_cached_attention and past_buckets is not None):
+            # sum up all hash rounds
+            if num_hashes > 1:
+                out_vectors = self._split_seq_length_dim_to(
+                    out_vectors,
+                    num_hashes,
+                    sequence_length,
+                    self.num_attention_heads,
+                    self.attention_head_size,
+                )
+                logits = self._split_seq_length_dim_to(
+                    logits,
+                    num_hashes,
+                    sequence_length,
+                    self.num_attention_heads,
+                    self.attention_head_size,
+                ).unsqueeze(-1)
+
+                probs_vectors = paddle.exp(logits - _logsumexp(logits, axis=2, keepdim=True))
+                out_vectors = paddle.sum(out_vectors * probs_vectors, axis=2)
+                # free memory
+                del probs_vectors
+
+            # free memory
+            del logits
+
+        assert out_vectors.shape == [
+            batch_size,
+            self.num_attention_heads,
+            sequence_length,
+            self.attention_head_size,
+        ], "out_vectors have be of shape `[batch_size, num_attention_heads, sequence_length, attention_head_size]`."
+
+        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)
+
+        if output_attentions is False:
+            attention_probs = ()
+
+        if buckets is not None:
+            buckets = buckets.reshape(shape=[batch_size, self.num_attention_heads, num_hashes, -1])
+
+        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)
+
+    def _query_per_attn_head(self, hidden_states):
+        per_head_query_key = self.query_key.weight.reshape(
+            shape=[self.num_attention_heads, self.attention_head_size, self.hidden_size]
+        ).transpose(perm=[0, 2, 1])
+        # only relevant for inference and no bias => we can use einsum here
+        query_key_vectors = paddle.einsum("balh,ahr->balr", hidden_states, per_head_query_key)
+        return query_key_vectors
+
+    def _value_per_attn_head(self, hidden_states):
+        per_head_value = self.value.weight.reshape(
+            shape=[self.num_attention_heads, self.attention_head_size, self.hidden_size]
+        ).transpose(perm=[0, 2, 1])
+        # only relevant for inference and no bias => we can use einsum here
+        value_vectors = paddle.einsum("balh,ahr->balr", hidden_states, per_head_value)
+        return value_vectors
+
+    def _hash_vectors(self, vectors, num_hashes, attention_mask, increase_num_buckets=False):
+        batch_size = vectors.shape[0]
+
+        # See https://arxiv.org/pdf/1509.02897.pdf
+        # We sample a different random rotation for each round of hashing to
+        # decrease the probability of hash misses.
+        if isinstance(self.num_buckets, int):
+            assert (
+                self.num_buckets % 2 == 0
+            ), f"There should be an even number of buckets, but `self.num_buckets`: {self.num_buckets}"
+            rotation_size = self.num_buckets
+            num_buckets = self.num_buckets
+        else:
+            # Factorize the hash if self.num_buckets is a list or tuple
+            rotation_size, num_buckets = 0, 1
+            for bucket_factor in self.num_buckets:
+                assert (
+                    bucket_factor % 2 == 0
+                ), f"The number of buckets should be even, but `num_bucket`: {bucket_factor}"
+                rotation_size = rotation_size + bucket_factor
+                num_buckets = num_buckets * bucket_factor
+
+        # remove gradient
+        vectors = vectors.detach()
+
+        if self.hash_seed is not None:
+            # for determinism
+            paddle.seed(self.hash_seed)
+
+        rotations_shape = [
+            self.num_attention_heads,
+            vectors.shape[-1],
+            num_hashes,
+            rotation_size // 2,
+        ]
+
+        # create a random self.attention_head_size x num_hashes x num_buckets/2
+        random_rotations = paddle.randn(shape=rotations_shape, dtype=vectors.dtype)
+        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2
+        rotated_vectors = paddle.einsum("bmtd,mdhr->bmhtr", vectors, random_rotations)
+
+        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:
+            rotated_vectors = paddle.concat([rotated_vectors, -rotated_vectors], axis=-1)
+            buckets = paddle.argmax(rotated_vectors, axis=-1)
+        else:
+            # Get the buckets for them and combine.
+            buckets, cur_sum, cur_product = None, 0, 1
+            for bucket_factor in self.num_buckets:
+                # bmhtr
+                rotated_vectors_factor = rotated_vectors[:, :, :, :, cur_sum : cur_sum + (bucket_factor // 2)]
+                cur_sum = cur_sum + bucket_factor // 2
+                rotated_vectors_factor = paddle.concat([rotated_vectors_factor, -rotated_vectors_factor], axis=-1)
+                if buckets is None:
+                    buckets = paddle.argmax(rotated_vectors_factor, axis=-1)
+                else:
+                    buckets = buckets + (cur_product * paddle.argmax(rotated_vectors_factor, axis=-1))
+
+                cur_product = cur_product * bucket_factor
+
+        if attention_mask is not None and (attention_mask.sum() < batch_size * attention_mask.shape[-1]):
+            # add an extra bucket for padding tokens only
+            num_buckets = num_buckets + 1
+            # assign padding tokens extra bucket
+            buckets_mask = attention_mask.unsqueeze(axis=[1, 2]).expand_as(buckets)
+            buckets = paddle.where(
+                buckets_mask.astype(paddle.bool),
+                buckets,
+                paddle.to_tensor(num_buckets - 1, dtype=buckets.dtype),
+            )
+        elif increase_num_buckets:
+            num_buckets = num_buckets + 1
+
+        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).
+        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.
+        offsets = paddle.arange(num_hashes)
+        offsets = (offsets * num_buckets).reshape(shape=[1, 1, -1, 1])
+
+        # expand to batch size and num attention heads
+        offsets = offsets.expand(shape=[batch_size, self.num_attention_heads] + offsets.shape[-2:])
+        offset_buckets = (buckets + offsets).flatten(start_axis=2, stop_axis=3)
+
+        return offset_buckets
+
+    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, buckets):
+        # no gradients are needed
+        # buckets shape [batch_size, self.num_attention_heads, num_hashes * sequence_length]
+        with paddle.no_grad():
+            original_shape = buckets.shape
+            new_buckets = buckets.flatten(0, 1)
+            offsets = (paddle.arange(new_buckets.shape[0]) * new_buckets.shape[1]).unsqueeze(-1)
+            sorted_bucket_idx = _stable_argsort(new_buckets, axis=-1)
+            new_sorted_bucket_idx = (sorted_bucket_idx + offsets).flatten()
+            updates = paddle.tile(paddle.arange(new_buckets.shape[1]), repeat_times=[new_buckets.shape[0]])
+
+            undo_sorted_bucket_idx = paddle.scatter(
+                paddle.zeros_like(new_sorted_bucket_idx),
+                new_sorted_bucket_idx,
+                updates,
+                overwrite=True,
+            )
+
+        return sorted_bucket_idx.reshape(shape=original_shape), undo_sorted_bucket_idx.reshape(shape=original_shape)
+
+    def _set_num_buckets(self, sequence_length):
+        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper
+        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1
+        # make sure buckets are power of 2
+        num_buckets = 2**num_buckets_pow_2
+
+        # factorize `num_buckets` if `num_buckets` becomes too large
+        num_buckets_limit = 2 * max(
+            int((self.max_position_embeddings // self.chunk_length) ** (0.5)),
+            self.chunk_length,
+        )
+        if num_buckets > num_buckets_limit:
+            num_buckets = [
+                2 ** (num_buckets_pow_2 // 2),
+                2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2),
+            ]
+
+        logger.warning(f"num_buckets is not set. Setting num_buckets to {num_buckets}...")
+
+        # set num buckets in config to be properly saved
+        self.num_buckets = num_buckets
+
+    def _attend(
+        self,
+        query_vectors,
+        key_vectors,
+        value_vectors,
+        sorted_bucket_idx_per_hash,
+        attention_mask,
+        do_standard_self_attention,
+        do_cached_attention,
+    ):
+        # look at previous and following chunks if chunked attention
+        if not do_standard_self_attention:
+            key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)
+            value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)
+
+        # get logits and dots
+        # (BS, NumAttn, NumHash x NumChunk, Chunk_L x Hidden),(BS, NumAttn, NumHash x NumChunk, Chunk_L * (Num_bef + Num_aft + 1) x Hidden) -> (BS, NumAttn, NumHash x NumChunk, Chunk_L, Chunk_L * (1 + Num_bef + Num_aft))
+        query_key_dots = paddle.matmul(query_vectors, key_vectors, transpose_y=True)
+
+        # free memory
+        del query_vectors, key_vectors
+
+        # if chunked attention split bucket idxs to query and key
+        if not do_standard_self_attention:
+            query_bucket_idx = self._split_seq_length_dim_to(
+                sorted_bucket_idx_per_hash,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+            )
+            key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)
+        elif do_cached_attention and query_key_dots.ndim > 4:
+            key_value_bucket_idx = sorted_bucket_idx_per_hash
+            query_bucket_idx = (
+                paddle.ones(
+                    shape=key_value_bucket_idx.shape[:-1] + [1],
+                    dtype=key_value_bucket_idx.dtype,
+                )
+                * key_value_bucket_idx.max()
+            )
+        elif do_cached_attention and query_key_dots.ndim <= 4:
+            query_bucket_idx = (query_key_dots.shape[-1] - 1) * paddle.ones_like(query_key_dots)[:, :, :, -1]
+            key_value_bucket_idx = (
+                paddle.arange(query_key_dots.shape[-1])
+                .unsqueeze(axis=[0, 1])
+                .expand(
+                    shape=query_bucket_idx.shape[:2]
+                    + [
+                        query_key_dots.shape[-1],
+                    ]
+                )
+            )
+        else:
+            query_bucket_idx = key_value_bucket_idx = sorted_bucket_idx_per_hash
+
+        # get correct mask values depending on precision
+        if query_key_dots.dtype == paddle.float16:
+            self_mask_value = self.self_mask_value_float16.astype(paddle.float16)
+            mask_value = self.mask_value_float16.astype(paddle.float16)
+        else:
+            self_mask_value = self.self_mask_value_float32
+            mask_value = self.mask_value_float32
+
+        if not do_cached_attention:
+            mask = self._compute_attn_mask(
+                query_bucket_idx,
+                key_value_bucket_idx,
+                attention_mask,
+                query_key_dots.shape,
+                do_standard_self_attention,
+            )
+
+            if mask is not None:
+                query_key_dots = paddle.where(mask.astype(paddle.bool), query_key_dots, mask_value)
+
+            # free memory
+            del mask
+
+        # Self mask is ALWAYS applied.
+        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):
+        # " While attention to the future is not allowed, typical implementations of the
+        # Transformer do allow a position to attend to itself.
+        # Such behavior is undesirable in a shared-QK formulation because the dot-product
+        # of a query vector with itself will almost always be greater than the dot product of a
+        # query vector with a vector at another position. We therefore modify the masking
+        # to forbid a token from attending to itself, except in situations
+        # where a token has no other valid attention targets (e.g. the first token in a sequence) "
+        self_mask = paddle.not_equal(
+            query_bucket_idx.unsqueeze(-1).astype("int64"), key_value_bucket_idx.unsqueeze(-2).astype("int64")
+        )
+
+        # apply self_mask
+        query_key_dots = paddle.where(self_mask, query_key_dots, self_mask_value)
+
+        # free memory
+        del self_mask
+
+        logits = _logsumexp(query_key_dots, axis=-1, keepdim=True)
+        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`
+        attention_probs = paddle.exp(query_key_dots - logits)
+
+        # free memory
+        del query_key_dots
+
+        # dropout
+        attention_probs = F.dropout(attention_probs, p=self.dropout, training=self.training)
+
+        # attend values
+        out_vectors = paddle.matmul(attention_probs, value_vectors)
+
+        # free memory
+        del value_vectors
+
+        # merge chunk length
+        if out_vectors.ndim > 4:
+
+            logits = logits.flatten(start_axis=2, stop_axis=3).squeeze(-1)
+            out_vectors = out_vectors.flatten(start_axis=2, stop_axis=3)
+
+        return out_vectors, logits, attention_probs
+
+    def _compute_attn_mask(
+        self,
+        query_indices,
+        key_indices,
+        attention_mask,
+        query_key_dot_shape,
+        do_standard_self_attention,
+    ):
+
+        # attention mask for LSH
+        if attention_mask is not None:
+            # if chunked attention, the attention mask has to correspond to LSH order
+            attention_mask = attention_mask.astype(paddle.int64).unsqueeze(1)
+            if not do_standard_self_attention:
+                # expand attn_mask to fit with key_value_bucket_idx shape
+                attention_mask = attention_mask.unsqueeze(1)
+                attention_mask = attention_mask.expand(shape=query_indices.shape[:-1] + [attention_mask.shape[-1]])
+
+                attention_mask = attention_mask.reshape([-1, attention_mask.shape[-1]])
+                new_key_indices = key_indices.reshape([-1, key_indices.shape[-1]])
+                attention_mask = paddle.index_sample(attention_mask, new_key_indices).reshape(key_indices.shape)
+
+            attention_mask = attention_mask.unsqueeze(-2).expand(shape=query_key_dot_shape)
+
+        # Causal mask
+        if self.is_decoder is True:
+
+            causal_mask = paddle.greater_equal(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).astype(
+                paddle.int64
+            )
+
+            # add attention mask if not None
+            if attention_mask is not None:
+                attention_mask = causal_mask * attention_mask
+            else:
+                attention_mask = causal_mask
+
+        return attention_mask
+
+    def _get_relevant_hid_states_and_buckets(
+        self,
+        query_vectors,
+        attention_mask,
+        num_hashes,
+        hidden_states,
+        past_states,
+        past_buckets,
+    ):
+        # concat hidden states
+        hidden_states = paddle.concat([past_states, hidden_states], axis=1)
+
+        # batch_size hidden
+        batch_size = hidden_states.shape[0]
+        sequence_length = hidden_states.shape[1]
+
+        # check if cached buckets include pad bucket
+        max_bucket = self.num_buckets if isinstance(self.num_buckets, int) else reduce(mul, self.num_buckets)
+
+        # if pad bucket was cached => need to increase num buckets for caching
+        increase_num_buckets = past_buckets.max() > num_hashes * max_bucket - 1
+
+        # retrieve query buckets
+        query_buckets = self._hash_vectors(
+            query_vectors,
+            num_hashes,
+            attention_mask,
+            increase_num_buckets=increase_num_buckets,
+        )
+
+        # concat buckets
+        concat_buckets = paddle.concat([past_buckets, query_buckets.unsqueeze(-1)], axis=-1)
+
+        # hash-based sort
+        bucket_idx = paddle.argsort(concat_buckets, axis=-1)
+
+        # bucket_idx has shape: BatchSize x NumAttnHeads x NumHashes x SequenceLength
+        assert bucket_idx.shape == [
+            batch_size,
+            self.num_attention_heads,
+            num_hashes,
+            sequence_length,
+        ], f"bucket_idx should have shape {(batch_size, self.num_attention_heads, num_hashes, sequence_length)}, but has shape {bucket_idx.shape}."
+
+        # find indices of new bucket indices
+        relevant_bucket_idx = (bucket_idx == (bucket_idx.shape[-1] - 1)).nonzero()
+
+        # expand relevant bucket indices to its chunks
+        relevant_bucket_idx_chunk = self._expand_to_indices_in_relevant_chunk(relevant_bucket_idx, sequence_length)
+
+        relevant_bucket_idx_chunk = bucket_idx.gather_nd(relevant_bucket_idx_chunk)
+
+        # adapt bucket_idx for batch and hidden states for index select
+        bucket_idx_batch_offset = sequence_length * (
+            batch_size * paddle.arange(relevant_bucket_idx_chunk.shape[-1]) // relevant_bucket_idx_chunk.shape[-1]
+        )
+
+        # add batch offset
+        relevant_bucket_idx_chunk_all_batch = relevant_bucket_idx_chunk + bucket_idx_batch_offset
+        hidden_states = hidden_states.reshape(shape=[-1, self.hidden_size])
+
+        # select all relevant hidden states
+        relevant_hidden_states = hidden_states.index_select(relevant_bucket_idx_chunk_all_batch, axis=0)
+
+        # reshape hidden states and bucket_idx to correct output
+        relevant_hidden_states = relevant_hidden_states.reshape(
+            shape=[batch_size, self.num_attention_heads, -1, self.hidden_size]
+        )
+        relevant_bucket_idx_chunk = relevant_bucket_idx_chunk.reshape(
+            shape=[batch_size, self.num_attention_heads, num_hashes, -1]
+        )
+
+        assert (
+            relevant_hidden_states.shape[2]
+            == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes
+        ), f"There should be {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes} `hidden_states`, there are {relevant_hidden_states.shape[2]} `hidden_states`."
+
+        assert (
+            relevant_bucket_idx_chunk.shape[-1]
+            == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length
+        ), f"There should be {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length} `hidden_states`, there are {relevant_bucket_idx_chunk.shape[-1]} `bucket_idx`."
+
+        return relevant_hidden_states, relevant_bucket_idx_chunk, query_buckets
+
+    def _expand_to_indices_in_relevant_chunk(self, indices, sequence_length):
+        # get relevant indices of where chunk starts and its size
+        start_indices_chunk = ((indices[:, -1] // self.chunk_length) - self.num_chunks_before) * self.chunk_length
+        total_chunk_size = self.chunk_length * (1 + self.num_chunks_before + self.num_chunks_after)
+
+        # expand start indices and add correct chunk offset via arange
+        expanded_start_indices = start_indices_chunk.unsqueeze(-1).expand(shape=[indices.shape[0], total_chunk_size])
+        chunk_sequence_indices = expanded_start_indices + paddle.arange(total_chunk_size).unsqueeze(0).expand(
+            shape=[indices.shape[0], total_chunk_size]
+        )
+
+        # make sure that circular logic holds via % seq len
+        chunk_sequence_indices = chunk_sequence_indices.flatten() % sequence_length
+
+        # expand indices and set indices correctly
+        indices = (
+            indices.unsqueeze(1)
+            .expand(shape=(indices.shape[0], total_chunk_size, indices.shape[-1]))
+            .flatten(0, 1)
+            .clone()
+        )
+        indices[:, -1] = chunk_sequence_indices
+
+        return indices
+
+    def _len_and_dim_norm(self, vectors):
+        """
+        length and attention head size dim normalization
+        """
+        vectors = self._len_norm(vectors)
+        vectors = vectors * paddle.rsqrt(paddle.to_tensor(self.attention_head_size, dtype=vectors.dtype))
+        return vectors
+
+    def _len_norm(self, x, epsilon=1e-6):
+        """
+        length normalization
+        """
+        variance = paddle.mean(x**2, axis=-1, keepdim=True)
+        norm_x = x * paddle.rsqrt(variance + epsilon)
+        return norm_x
+
+    def _gather_by_expansion(self, vectors, idxs, num_hashes):
+        """
+        expand dims of idxs and vectors for all hashes and gather
+        """
+        expanded_idxs = paddle.tile(idxs.unsqueeze(-2), repeat_times=[1, 1, self.attention_head_size, 1]).reshape(
+            shape=[-1, idxs.shape[2]]
+        )
+        vectors = (
+            paddle.tile(vectors, repeat_times=[1, 1, num_hashes, 1])
+            .transpose(perm=[0, 1, 3, 2])
+            .reshape(shape=[-1, idxs.shape[2]])
+        )
+
+        return (
+            paddle.index_sample(vectors, expanded_idxs)
+            .reshape(shape=[idxs.shape[0], idxs.shape[1], self.attention_head_size, -1])
+            .transpose(perm=[0, 1, 3, 2])
+        )
+
+
+class LocalSelfAttention(nn.Layer, EfficientAttentionMixin):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.chunk_length = config.local_attn_chunk_length
+        self.num_chunks_before = config.local_num_chunks_before
+        self.num_chunks_after = config.local_num_chunks_after
+        self.is_decoder = config.is_decoder
+        self.pad_token_id = config.pad_token_id
+
+        self.attention_head_size = config.attention_head_size
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.hidden_size = config.hidden_size
+
+        # projection matrices
+        self.query = nn.Linear(self.hidden_size, self.all_head_size, bias_attr=False)
+        self.key = nn.Linear(self.hidden_size, self.all_head_size, bias_attr=False)
+        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias_attr=False)
+
+        self.dropout = config.local_attention_probs_dropout_prob
+
+        # save mask value here
+        self.register_buffer("mask_value_float16", paddle.to_tensor(-1e4))
+        self.register_buffer("mask_value_float32", paddle.to_tensor(-1e9))
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+        **kwargs,
+    ):
+        sequence_length = hidden_states.shape[1]
+        batch_size = hidden_states.shape[0]
+
+        # check if cache shall be used and that hidden states are already cached
+        if use_cache and cache[1] is not None:
+            assert (
+                cache[0] is None
+            ), "LocalSelfAttention should not make use of `buckets`. There seems to be an error when caching hidden_states_and_buckets."
+            key_value_hidden_states = self._retrieve_relevant_hidden_states(
+                cache[1], self.chunk_length, self.num_chunks_before
+            )
+            key_value_hidden_states = paddle.concat([key_value_hidden_states, hidden_states], axis=1)
+
+            # only query vector for last token
+            query_vectors = self.query(hidden_states)
+            # compute key and value for relevant chunk
+            key_vectors = self.key(key_value_hidden_states)
+            value_vectors = self.value(key_value_hidden_states)
+
+            # free memory
+            del key_value_hidden_states
+        else:
+            # project hidden_states to query, key and value
+            query_vectors = self.query(hidden_states)
+            key_vectors = self.key(hidden_states)
+            value_vectors = self.value(hidden_states)
+
+        # split last dim into `num_attention_heads` and `attention_head_size`
+        query_vectors = self._split_hidden_size_dim(query_vectors, self.num_attention_heads, self.attention_head_size)
+        key_vectors = self._split_hidden_size_dim(key_vectors, self.num_attention_heads, self.attention_head_size)
+        value_vectors = self._split_hidden_size_dim(value_vectors, self.num_attention_heads, self.attention_head_size)
+
+        assert (
+            query_vectors.shape[-1] == self.attention_head_size
+        ), f"last dim of query_key_vectors is {query_vectors.shape[-1]} but should be {self.attention_head_size}."
+        assert (
+            key_vectors.shape[-1] == self.attention_head_size
+        ), f"last dim of query_key_vectors is {key_vectors.shape[-1]} but should be {self.attention_head_size}."
+        assert (
+            value_vectors.shape[-1] == self.attention_head_size
+        ), f"last dim of query_key_vectors is {value_vectors.shape[-1]} but should be {self.attention_head_size}."
+
+        if self.chunk_length is None:
+            assert (
+                self.num_chunks_before == 0 and self.num_chunks_after == 0
+            ), "If `chunk_length` is `None`, make sure `num_chunks_after` and `num_chunks_before` are set to 0."
+
+        # normalize key vectors
+        key_vectors = key_vectors / paddle.sqrt(paddle.to_tensor(self.attention_head_size, dtype=key_vectors.dtype))
+
+        # get sequence length indices
+        indices = paddle.tile(
+            paddle.arange(sequence_length),
+            repeat_times=[batch_size, self.num_attention_heads, 1],
+        )
+
+        # if one should do normal n^2 self-attention
+        do_standard_self_attention = sequence_length <= self.chunk_length
+
+        # if input should be chunked
+        if not do_standard_self_attention:
+            # chunk vectors
+
+            # B x Num_Attn_Head x Seq_Len // chunk_len x chunk_len  x  attn_head_size
+
+            query_vectors = self._split_seq_length_dim_to(
+                query_vectors,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+                self.attention_head_size,
+            )
+
+            key_vectors = self._split_seq_length_dim_to(
+                key_vectors,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+                self.attention_head_size,
+            )
+            value_vectors = self._split_seq_length_dim_to(
+                value_vectors,
+                -1,
+                self.chunk_length,
+                self.num_attention_heads,
+                self.attention_head_size,
+            )
+
+            query_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)
+            key_indices = self._split_seq_length_dim_to(indices, -1, self.chunk_length, self.num_attention_heads)
+
+            # append chunks before and after
+            key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)
+            value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)
+            key_indices = self._look_adjacent(key_indices, self.num_chunks_before, self.num_chunks_after)
+        else:
+            query_indices = key_indices = indices
+
+        # query-key matmul: QK^T
+        query_key_dots = paddle.matmul(query_vectors, key_vectors, transpose_y=True)
+
+        # free memory
+        del query_vectors, key_vectors
+
+        mask = self._compute_attn_mask(
+            query_indices,
+            key_indices,
+            attention_mask,
+            query_key_dots.shape,
+            do_standard_self_attention,
+        )
+
+        if mask is not None:
+            # get mask tensor depending on half precision or not
+            if query_key_dots.dtype == paddle.float16:
+                mask_value = self.mask_value_float16.astype(paddle.float16)
+            else:
+                mask_value = self.mask_value_float32
+
+            query_key_dots = paddle.where(mask.astype(paddle.bool), query_key_dots, mask_value)
+
+        # free memory
+        del mask
+
+        # softmax
+        logits = _logsumexp(query_key_dots, axis=-1, keepdim=True)
+        attention_probs = paddle.exp(query_key_dots - logits)
+
+        # free memory
+        del logits
+
+        # dropout
+        attention_probs = F.dropout(attention_probs, p=self.dropout, training=self.training)
+
+        # attend values
+        out_vectors = paddle.matmul(attention_probs, value_vectors)
+
+        # free memory
+        del value_vectors
+
+        # merge chunk length
+        if not do_standard_self_attention:
+            out_vectors = out_vectors.flatten(start_axis=2, stop_axis=3)
+
+        assert out_vectors.shape == [
+            batch_size,
+            self.num_attention_heads,
+            sequence_length,
+            self.attention_head_size,
+        ]
+        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)
+
+        if output_attentions is False:
+            attention_probs = ()
+
+        return LocalSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs)
+
+    def _compute_attn_mask(
+        self,
+        query_indices,
+        key_indices,
+        attention_mask,
+        query_key_dots_shape,
+        do_standard_self_attention,
+    ):
+
+        # chunk attention mask and look before and after
+
+        if attention_mask is not None:
+
+            attention_mask = attention_mask.astype(paddle.int64).unsqueeze(1)
+
+            if not do_standard_self_attention:
+                attention_mask = self._split_seq_length_dim_to(attention_mask, -1, self.chunk_length, 1)
+                attention_mask = self._look_adjacent(attention_mask, self.num_chunks_before, self.num_chunks_after)
+            # create attn_mask
+
+            attention_mask = attention_mask.unsqueeze(-2).expand(shape=query_key_dots_shape)
+
+        # Causal mask
+        if self.is_decoder is True:
+            causal_mask = paddle.greater_equal(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2)).astype(
+                paddle.int64
+            )
+
+            # add attention mask if not None
+            if attention_mask is not None:
+                attention_mask = causal_mask * attention_mask
+            else:
+                attention_mask = causal_mask
+
+        return attention_mask
+
+    @staticmethod
+    def _retrieve_relevant_hidden_states(previous_hidden_states, chunk_length, num_chunks_before):
+        start_position = ((previous_hidden_states.shape[1] // chunk_length) - num_chunks_before) * chunk_length
+        return previous_hidden_states[:, start_position:]
+
+
+class ReformerSelfOutput(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        all_head_size = config.num_attention_heads * config.attention_head_size
+        self.dropout = config.hidden_dropout_prob
+
+        self.dense = nn.Linear(all_head_size, config.hidden_size, bias_attr=False)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+        return hidden_states
+
+
+class ReformerAttention(nn.Layer):
+    def __init__(self, config: ReformerConfig, layer_id=0):
+        super().__init__()
+        self.layer_id = layer_id
+        self.attn_layers = config.attn_layers
+
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        if len(set(self.attn_layers)) == 1 and self.attn_layers[0] == "lsh":
+            self.self_attention = LSHSelfAttention(config)
+        elif len(set(self.attn_layers)) == 1 and self.attn_layers[0] == "local":
+            self.self_attention = LocalSelfAttention(config)
+
+        elif len(set(self.attn_layers)) == 2 and set(self.attn_layers) == set(["lsh", "local"]):
+            # get correct attn layers
+            if self.attn_layers[self.layer_id] == "lsh":
+                self.self_attention = LSHSelfAttention(config)
+            else:
+                self.self_attention = LocalSelfAttention(config)
+        else:
+            raise NotImplementedError(
+                f"Only attn layer types 'lsh' and 'local' exist, but got `attn_layers`: {self.attn_layers}. "
+                "Select attn layer types from ['lsh', 'local'] only."
+            )
+        self.output = ReformerSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        num_hashes=None,
+        cache=None,
+        use_cache=False,
+        orig_sequence_length=None,
+        output_attentions=False,
+        buckets=None,
+    ):
+        hidden_states = self.layer_norm(hidden_states)
+
+        # make sure cached hidden states is set to None for backward pass
+        if cache is not None:
+            cache_layer = cache[self.layer_id]
+        else:
+            cache_layer = None
+
+        # use cached buckets for backprob if buckets not None for LSHSelfAttention
+        self_attention_outputs = self.self_attention(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            cache=cache_layer,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            buckets=buckets,
+        )
+
+        # add buckets if necessary
+        if hasattr(self_attention_outputs, "buckets"):
+            buckets = self_attention_outputs.buckets
+        else:
+            buckets = None
+
+        # cache hidden states for future use
+        if use_cache:
+            if cache[self.layer_id][0] is None:
+                # padded input should not be cached
+                past_buckets = (
+                    buckets[:, :, :, :orig_sequence_length]
+                    if (buckets is not None and orig_sequence_length > 1)
+                    else buckets
+                )
+            else:
+                past_buckets = paddle.concat([cache[self.layer_id][0], buckets], axis=-1)
+
+            if cache[self.layer_id][1] is None:
+                # padded input should not be cached
+                past_states = hidden_states[:, :orig_sequence_length]
+            else:
+                past_states = paddle.concat([cache[self.layer_id][1], hidden_states], axis=1)
+
+            cache[self.layer_id] = (past_buckets, past_states)
+        # compute attention feed forward output
+        attention_output = self.output(self_attention_outputs.hidden_states)
+
+        return AttentionOutput(
+            hidden_states=attention_output,
+            attention_probs=self_attention_outputs.attention_probs,
+            buckets=buckets,
+        )
+
+
+class ReformerFeedForwardDense(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.dropout = config.hidden_dropout_prob
+
+        if isinstance(config.hidden_act, str):
+            self.act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.act_fn = config.hidden_act
+
+        self.dense = nn.Linear(config.hidden_size, config.feed_forward_size)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+        hidden_states = self.act_fn(hidden_states)
+        return hidden_states
+
+
+class ReformerFeedForwardOutput(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.dropout = config.hidden_dropout_prob
+
+        self.dense = nn.Linear(config.feed_forward_size, config.hidden_size)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+        return hidden_states
+
+
+class ChunkReformerFeedForward(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dense = ReformerFeedForwardDense(config)
+        self.output = ReformerFeedForwardOutput(config)
+
+    def forward(self, attention_output):
+        return _apply_chunking_to_forward(
+            self.forward_chunk,
+            self.chunk_size_feed_forward,
+            self.seq_len_dim,
+            attention_output,
+        )
+
+    def forward_chunk(self, hidden_states):
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        return self.output(hidden_states)
+
+
+class ReformerLayer(nn.Layer):
+    def __init__(self, config: ReformerConfig, layer_id=0):
+        super().__init__()
+        self.attention = ReformerAttention(config, layer_id)
+        # dropout requires to have the same
+        # seed for forward and backward pass
+        self.attention_seed = None
+        self.feed_forward_seed = None
+
+        self.feed_forward = ChunkReformerFeedForward(config)
+
+    def _init_attention_seed(self):
+        """
+        This function sets a new seed for the attention layer to make dropout
+        deterministic for both forward calls: 1 normal forward call and 1 forward
+        call in backward to recalculate activations.
+        """
+
+        # randomize seeds
+        # use cuda generator if available
+        if paddle.get_device() != "cpu":
+            # GPU
+            device_idx = int(paddle.get_device().split(":")[1])
+            sts = paddle.get_cuda_rng_state()
+            self.attention_seed = sts[device_idx].current_seed()
+        else:
+            # CPU
+            self.attention_seed = np.random.randint(0, sys.maxsize, size=(1,), dtype="int64").item()
+
+        paddle.seed(self.attention_seed)
+
+    def _init_feed_forward_seed(self):
+        """
+        This function sets a new seed for the feed forward layer to make dropout deterministic for both forward calls:
+        1 normal forward call and 1 forward call in backward to recalculate activations.
+        """
+        # randomize seeds
+        # use cuda generator if available
+        if paddle.get_device() != "cpu":
+            # GPU
+            device_idx = int(paddle.get_device().split(":")[1])
+            sts = paddle.get_cuda_rng_state()
+            self.feed_forward_seed = sts[device_idx].current_seed()
+        else:
+            # CPU
+            self.feed_forward_seed = np.random.randint(0, sys.maxsize, size=(1,), dtype="int64").item()
+
+        paddle.seed(self.feed_forward_seed)
+
+    def forward(
+        self,
+        prev_attn_output,
+        hidden_states,
+        attention_mask=None,
+        num_hashes=None,
+        cache=None,
+        use_cache=False,
+        orig_sequence_length=None,
+        output_attentions=False,
+    ):
+        with paddle.no_grad():
+            # every forward pass we sample a different seed
+            # for dropout and save for forward fn in backward pass
+            # to have correct dropout
+            if self.training:
+                self._init_attention_seed()
+
+            attn_outputs = self.attention(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                num_hashes=num_hashes,
+                cache=cache,
+                use_cache=use_cache,
+                orig_sequence_length=orig_sequence_length,
+                output_attentions=output_attentions,
+            )
+            attn_output = attn_outputs.hidden_states
+
+            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)
+            # Y_1 = X_1 + f(X_2)
+            attn_output = prev_attn_output + attn_output
+
+            # free memory
+            del prev_attn_output
+
+            # every forward pass we sample a different seed
+            # for dropout and save seed for forward fn in backward
+            # to have correct dropout
+            if self.training:
+                self._init_feed_forward_seed()
+            # Y_2 = X_2 + g(Y_1)
+            hidden_states = hidden_states + self.feed_forward(attn_output)
+
+        return ReformerOutput(
+            attn_output=attn_output,
+            hidden_states=hidden_states,
+            attention_probs=attn_outputs.attention_probs,
+            buckets=attn_outputs.buckets,
+        )
+
+    def backward_pass(
+        self,
+        next_attn_output,
+        hidden_states,
+        grad_attn_output,
+        grad_hidden_states,
+        attention_mask=None,
+        buckets=None,
+    ):
+        # Implements the backward pass for reversible ResNets.
+        # A good blog post on how this works can be found here:
+        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)
+        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py
+
+        assert (
+            self.training
+        ), "If you want to train `ReformerModel` and its variations, make sure to use `model.train()` to put the model into training mode."
+
+        with paddle.set_grad_enabled(True):
+            next_attn_output.stop_gradient = False
+            # set seed to have correct dropout
+            paddle.seed(self.feed_forward_seed)
+            # g(Y_1)
+            res_hidden_states = self.feed_forward(next_attn_output)
+            res_hidden_states.backward(grad_hidden_states, retain_graph=True)
+
+        with paddle.no_grad():
+            # X_2 = Y_2 - g(Y_1)
+            hidden_states = hidden_states - res_hidden_states
+            del res_hidden_states
+            grad_attn_output = grad_attn_output + next_attn_output.grad
+
+            next_attn_output.stop_gradient = True
+
+        with paddle.set_grad_enabled(True):
+            hidden_states.stop_gradient = False
+
+            # set seed to have correct dropout
+            paddle.seed(self.attention_seed)
+            # f(X_2)
+            # use cached buckets for backprob if buckets not None for LSHSelfAttention
+            output = self.attention(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                buckets=buckets,
+            ).hidden_states
+            output.backward(grad_attn_output, retain_graph=True)
+
+        with paddle.no_grad():
+            # X_1 = Y_1 - f(X_2)
+            attn_output = next_attn_output - output
+            del output, next_attn_output
+
+            grad_hidden_states = grad_hidden_states + hidden_states.grad
+            hidden_states.stop_gradient = True
+            hidden_states = hidden_states.detach()
+
+        return ReformerBackwardOutput(
+            attn_output=attn_output,
+            hidden_states=hidden_states,
+            grad_attn_output=grad_attn_output,
+            grad_hidden_states=grad_hidden_states,
+        )
+
+
+class ReformerEncoder(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        self.dropout = config.hidden_dropout_prob
+
+        self.layers = nn.LayerList([ReformerLayer(config, i) for i in range(config.num_hidden_layers)])
+        # Reformer is using Rev Nets, thus last layer outputs are concatenated and
+        # Layer Norm is done over 2 * hidden_size
+        self.layer_norm = nn.LayerNorm(2 * config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        num_hashes=None,
+        cache=None,
+        use_cache=False,
+        orig_sequence_length=None,
+        output_hidden_states=False,
+        output_attentions=False,
+    ):
+        # hidden_states and attention lists to be filled if wished
+        all_hidden_states = []
+        all_attentions = []
+
+        # init cached hidden states if necessary
+        if cache is None:
+            cache = [((None), (None)) for i in range(len(self.layers))]
+
+        # concat same tensor for reversible ResNet
+        hidden_states = paddle.concat([hidden_states, hidden_states], axis=-1)
+        hidden_states = _ReversibleFunction.apply(
+            hidden_states,
+            self.layers,
+            attention_mask,
+            num_hashes,
+            all_hidden_states,
+            all_attentions,
+            cache,
+            use_cache,
+            orig_sequence_length,
+            output_hidden_states,
+            output_attentions,
+        )
+
+        # Apply layer norm to concatenated hidden states
+        hidden_states = self.layer_norm(hidden_states)
+
+        # Apply dropout
+        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
+
+        return ReformerEncoderOutput(
+            hidden_states=hidden_states,
+            all_hidden_states=all_hidden_states,
+            all_attentions=all_attentions,
+            cache=cache,
+        )
+
+
+class ReformerOnlyLMHead(nn.Layer):
+    def __init__(self, config: ReformerConfig):
+        super().__init__()
+        # Reformer is using Rev Nets, thus last layer outputs are concatenated and
+        # Layer Norm is done over 2 * hidden_size
+        self.seq_len_dim = 1
+        self.chunk_size_lm_head = config.chunk_size_lm_head
+        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias_attr=False)
+        self.bias = self.create_parameter(shape=(config.vocab_size,), is_bias=True)
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        return _apply_chunking_to_forward(self.forward_chunk, self.chunk_size_lm_head, self.seq_len_dim, hidden_states)
+
+    def forward_chunk(self, hidden_states):
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class ReformerClassificationHead(nn.Layer):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(2 * config.hidden_size, config.hidden_size)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_classes)
+
+    def forward(self, hidden_states):
+        hidden_states = hidden_states[:, 0]  # take <s> token (equiv. to [CLS])
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.dense(hidden_states)
+        hidden_states = F.tanh(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.out_proj(hidden_states)
+        return hidden_states
+
+
+@dataclass
+class LSHSelfAttentionOutput(ModelOutput):
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attention_probs: Optional[Tuple[paddle.Tensor]] = None
+    buckets: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class LocalSelfAttentionOutput(ModelOutput):
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attention_probs: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class AttentionOutput(ModelOutput):
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attention_probs: Optional[Tuple[paddle.Tensor]] = None
+    buckets: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ReformerOutput(ModelOutput):
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attn_output: Optional[Tuple[paddle.Tensor]] = None
+    attention_probs: Optional[Tuple[paddle.Tensor]] = None
+    buckets: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ReformerBackwardOutput(ModelOutput):
+    attn_output: Optional[Tuple[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    grad_attn_output: Optional[Tuple[paddle.Tensor]] = None
+    grad_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class ReformerEncoderOutput(ModelOutput):
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    all_hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    all_attentions: Optional[Tuple[paddle.Tensor]] = None
+    cache: Optional[Tuple[paddle.Tensor]] = None
+
+
+class ReformerPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained Reformer models. It provides Reformer related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "reformer"
+    config_class = ReformerConfig
+
+    pretrained_init_configuration = REFORMER_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = REFORMER_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        if isinstance(layer, AxialPositionEmbeddings):
+            for weight in layer.weights:
+                weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.axial_norm_std,
+                        shape=weight.shape,
+                    )
+                )
+
+        elif isinstance(layer, nn.Embedding):
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+
+        elif isinstance(layer, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.axial_norm_std,
+                    shape=layer.weight.shape,
+                )
+            )
+
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+
+
+@register_base_model
+class ReformerModel(ReformerPretrainedModel):
+    """
+    The bare Reformer Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        tie_word_embeddings (bool, optional):
+            Whether to tie input and output embeddings. Defaults to `False`.
+        is_decoder (bool, optional):
+            Whether or not to use a causal mask in addition to the `attention_mask` passed to `ReformerModel`. When using the Reformer for causal language modeling, this argument should be set to `True`. Defaults to `True`.
+        chunk_size_feed_forward (int, optional):
+            The chunk size of all feed forward layers in the residual attention blocks. A chunk size of `0` means
+            that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes
+            `n` < sequence_length embeddings at a time. Defaults to `0`.
+        pad_token_id (int, optional):
+            The id of the `padding` token. Defaults to `0`.
+        hash_seed (int, optional):
+            Seed that can be used to make local sensitive hashing in `LSHSelfAttention` deterministic. This should
+            only be set for testing purposed. For evaluation and training purposes `hash_seed` should be left as
+            `None` to ensure fully random rotations in local sensitive hashing scheme. Defaults to `None`.
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `ReformerModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `ReformerModel`. Defaults to `258`.
+        attention_head_size (int, optional):
+            Dimensionality of the projected key, query and value vectors. Defaults to `128`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layer.Defaults to `1024`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `8`.
+        num_hashes (int, optional):
+            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes. Defaults to `4`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_buckets (int or List[int], optional):
+            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
+            Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors. The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly. Defaults to `512`.
+        lsh_attn_chunk_length (int, optional):
+            Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).Defaults to `256`.
+        local_attn_chunk_length (int, optional):
+            Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).Defaults to `128`.
+        lsh_num_chunks_after (int, optional):
+            Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. Defaults to `0`.
+        lsh_num_chunks_before (int, optional):
+            Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. Defaults to `1`.
+        local_num_chunks_after (int, optional):
+            Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. Defaults to `0`.
+        local_num_chunks_before (int, optional):
+            Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. Defaults to `1`.
+        hidden_act (str, optional):
+            The non-linear activation function (function or string) in the feed forward layer in the residual attention block. If string, `"gelu"`, `"relu"`, `"tanh"`, `"mish"` and `"gelu_new"` are supported. Defaults to `"relu"`.
+        feed_forward_size (int, optional):
+            Dimensionality of the feed_forward layer in the residual attention block. Defaults to `4096`.
+        hidden_dropout_prob (float, optional):
+            The dropout ratio for all fully connected layers in the embeddings and encoder. Defaults to `0.2`.
+        lsh_attention_probs_dropout_prob (float, optional):
+            The dropout ratio for the attention probabilities in `LSHSelfAttention`. Defaults to `0.1`.
+        local_attention_probs_dropout_prob (float, optional):
+            The dropout ratio for the attention probabilities in `LocalSelfAttention`. Defaults to `0.2`.
+        max_position_embeddings (int, optional):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). Defaults to `65536`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`ReformerPretrainedModel._init_weights()` for how weights are initialized in `ReformerModel`.
+
+        layer_norm_eps (float, optional):
+            The epsilon used by the layer normalization layers. Defaults to `1e-12`.
+
+        axial_pos_embds (bool, optional):
+            Whether or not to use axial position embeddings. Defaults to `True`.
+        axial_pos_shape (List[int], optional):
+            The position dims of the axial position encodings. During training, the product of the position dims has to be equal to the sequence length. Defaults to `[128, 512]`.
+        axial_pos_embds_dim (List[int], optional):
+            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
+            hidden size. Defaults to `[256, 768]`.
+        axial_norm_std (float, optional):
+            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
+            positional encodings. Defaults to `1.0`.
+        chunk_size_lm_head (int, optional):
+            The chunk size of the final language model feed forward head layer. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n <
+            sequence_length embeddings at a time. Defaults to `0`.
+        attn_layers (List[str], optional):
+            List of attention layer types in ascending order. It can be chosen between a LSHSelfAttention layer
+            (`"lsh"`) and a LocalSelfAttention layer (`"local"`). Defaults to `["local", "local", "lsh", "local", "local", "local", "lsh", "local", "local", "local", "lsh", "local"]`.
+
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__(config)
+        assert (
+            self.config.num_hidden_layers > 0
+        ), "`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']"
+
+        self.embeddings = ReformerEmbeddings(config)
+        self.encoder = ReformerEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        num_hashes: Optional[int] = None,
+        cache: Optional[List[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = False,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The ReformerModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on
+                to some unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float.
+                When the data type is int, the `masked` tokens have `0` values and the others
+                have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the
+                others have `1.0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range `[0, max_position_embeddings - 1]`.
+                Shape as [batch_size, num_tokens] and dtype as int64. Defaults to `None`.
+            num_hashes (int, optional):
+                The number of hashing rounds that should be performed during bucketing. Setting
+                this argument overwrites the default defined in `config["num_hashes"]`.
+                Defaults to `None`.
+            cache (List[tuple(Tensor, Tensor)], optional):
+                List of `tuple(Tensor, Tensor)` of length `config["num_hidden_layers"]`, with
+                the first element being the previous `buckets` of shape `[batch_size, num_heads, num_hashes, sequence_length]` and the second being the previous `hidden_states` of shape
+                `[batch_size, sequence_length, hidden_size]`.
+                Contains precomputed hidden-states and buckets (only relevant for LSH Self-Attention). Can
+                be used to speed up sequential decoding.
+                Defaults to `None`.
+            use_cache (bool, optional):
+                Whether or not to use cache. If set to `True`, `cache` states are returned
+                and can be used to speed up decoding.
+                Defaults to `False`.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ReformerModel, ReformerTokenizer
+
+                tokenizer = ReformerTokenizer.from_pretrained('reformer-crime-and-punishment')
+                model = ReformerModel.from_pretrained('reformer-crime-and-punishment')
+                model.eval()
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                outputs = model(**inputs)
+                last_hidden_state = outputs[0]
+
+        """
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.shape  # noqa: F841
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.shape[:-1]  # noqa: F841
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        assert (
+            len(input_shape) == 2
+        ), f"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {input_shape}"
+
+        if cache is not None:
+            assert not self.training, "`cache` can only be used for inference, not for training`."
+
+        # original sequence length for padding
+        orig_sequence_length = input_shape[-1]
+
+        # if needs padding
+        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(
+            self.config.attn_layers, self.config.lsh_attn_chunk_length, self.config.local_attn_chunk_length
+        )
+        min_chunk_length = _get_min_chunk_len(
+            self.config.attn_layers, self.config.lsh_attn_chunk_length, self.config.local_attn_chunk_length
+        )
+
+        must_pad_to_match_chunk_length = (
+            input_shape[-1] % least_common_mult_chunk_length != 0
+            and input_shape[-1] > min_chunk_length
+            and cache is None
+        )
+
+        if must_pad_to_match_chunk_length:
+            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length
+
+            if self.training is True:
+                raise ValueError(
+                    f"If training, sequence length {input_shape[-1]} has to be a multiple of least common multiple "
+                    f"chunk_length {least_common_mult_chunk_length}. Please consider padding the input to a length "
+                    f"of {input_shape[-1] + padding_length}."
+                )
+
+            # pad input
+            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(
+                input_ids,
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                input_shape=input_shape,
+                padding_length=padding_length,
+                padded_seq_length=least_common_mult_chunk_length,
+            )
+
+        # start index for position encoding depends on incremental decoding
+        if cache is not None:
+            start_idx_pos_encodings = cache[0][1].shape[1]
+        else:
+            start_idx_pos_encodings = 0
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            start_idx_pos_encodings=start_idx_pos_encodings,
+            inputs_embeds=inputs_embeds,
+        )
+
+        encoder_outputs = self.encoder(
+            hidden_states=embedding_output,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            cache=cache,
+            use_cache=use_cache,
+            orig_sequence_length=orig_sequence_length,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+        sequence_output = encoder_outputs.hidden_states
+
+        # if padding was applied
+        if must_pad_to_match_chunk_length:
+            sequence_output = sequence_output[:, :orig_sequence_length]
+
+        cache = encoder_outputs.cache if use_cache else None
+        hidden_states = encoder_outputs.all_hidden_states if output_hidden_states else None
+        attentions = encoder_outputs.all_attentions if output_attentions else None
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    sequence_output,
+                    cache,
+                    hidden_states,
+                    attentions,
+                ]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            past_key_values=cache,
+            hidden_states=hidden_states,
+            attentions=attentions,
+        )
+
+    def _pad_to_mult_of_chunk_length(
+        self,
+        input_ids,
+        inputs_embeds=None,
+        attention_mask=None,
+        position_ids=None,
+        input_shape=None,
+        padding_length=None,
+        padded_seq_length=None,
+    ):
+        logger.info(
+            f"Input ids are automatically padded from {input_shape[-1]} to {input_shape[-1] + padding_length} to be a "
+            f"multiple of `config.chunk_length`: {padded_seq_length}"
+        )
+
+        padded_input_ids = paddle.full(
+            (input_shape[0], padding_length),
+            self.config.pad_token_id,
+            dtype=paddle.int64,
+        )
+
+        # Extend `attention_mask`
+        if attention_mask is not None:
+            pad_attention_mask = paddle.zeros(shape=[input_shape[0], padding_length], dtype=attention_mask.dtype)
+
+            attention_mask = paddle.concat([attention_mask, pad_attention_mask], axis=-1)
+        else:
+            attention_mask = paddle.concat(
+                [
+                    paddle.ones(input_shape, dtype=paddle.int64),
+                    paddle.zeros((input_shape[0], padding_length), dtype=paddle.int64),
+                ],
+                axis=-1,
+            )
+
+        # Extend `input_ids` with padding to match least common multiple chunk_length
+        if input_ids is not None:
+            input_ids = paddle.concat([paddle.cast(input_ids, dtype="int64"), padded_input_ids], axis=-1)
+            input_shape = input_ids.shape
+
+            # Pad position ids if given
+            if position_ids is not None:
+                padded_position_ids = paddle.arange(input_shape[-1], padded_seq_length, dtype=paddle.int64)
+                padded_position_ids = position_ids.unsqueeze(0).expand(input_shape[0], padding_length)
+                position_ids = paddle.concat([position_ids, padded_position_ids], axis=-1)
+
+        # Extend `inputs_embeds` with padding to match least common multiple chunk_length
+        if inputs_embeds is not None:
+            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)
+            inputs_embeds = paddle.concat([inputs_embeds, padded_inputs_embeds], axis=-2)
+            input_shape = inputs_embeds.shape
+        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape
+
+
+class ReformerModelWithLMHead(ReformerPretrainedModel):
+    """
+    The Reformer Model transformer with a language modeling head on top.
+
+    Args:
+        reformer (:class:`ReformerModel`):
+            An instance of :class:`ReformerModel`.
+
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__(config)
+        self.reformer = ReformerModel(config)
+        local_num_chunks_after = self.config.local_num_chunks_after
+        lsh_num_chunks_after = self.config.lsh_num_chunks_after
+        assert self.config[
+            "is_decoder"
+        ], "If you want to use `ReformerModelWithLMHead` make sure that `is_decoder=True`."
+        assert (
+            "local" not in self.config.attn_layers or local_num_chunks_after == 0
+        ), f"If causal mask is enabled, make sure that `local_num_chunks_after` is set to 0 and not {local_num_chunks_after}."
+        assert (
+            "lsh" not in self.config.attn_layers or lsh_num_chunks_after == 0
+        ), f"If causal mask is enabled, make sure that `lsh_num_chunks_after` is set to 1 and not {lsh_num_chunks_after}."
+
+        """self.lm_head = ReformerOnlyLMHead(
+            chunk_size_lm_head=self.config.chunk_size_lm_head,
+            hidden_size=self.config.hidden_size,
+            vocab_size=self.config.vocab_size,
+        )"""
+        self.lm_head = ReformerOnlyLMHead(config)
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        num_hashes: Optional[int] = None,
+        cache: Optional[List[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = False,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[Tensor] = None,
+        output_attentions: Optional[Tensor] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ReformerModel`.
+            position_ids (Tensor, optional):
+                See :class:`ReformerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ReformerModel`.
+            num_hashes (int, optional):
+                See :class:`ReformerModel`.
+            cache (List[tuple(Tensor, Tensor)], optional):
+                See :class:`ReformerModel`.
+            use_cache (bool, optional):
+                See :class:`ReformerModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ReformerModel`.
+            labels (Tensor, optional):
+                Labels for language modeling. Note that the labels **are shifted**
+                inside the model, i.e. you can set `labels = input_ids` Indices are
+                selected in `[-100, 0, ..., vocab_size]` All labels set to `-100` are
+                ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            output_attentions (bool, optional):
+                See :class:`ReformerModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ReformerModel`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, cache, hidden_states, attentions)`.
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss (for next-token prediction).
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head
+                (scores for each vocabulary token before SoftMax).
+                It's data type should be float32 and its shape is
+                [batch_size, sequence_length, vocab_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)]):
+                See :class:`ReformerModel`.
+
+            - `hidden_states` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+            - `attentions` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ReformerModelWithLMHead, ReformerTokenizer
+
+                tokenizer = ReformerTokenizer.from_pretrained('reformer-crime-and-punishment')
+                model = ReformerModelWithLMHead.from_pretrained('reformer-crime-and-punishment')
+                model.eval()
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+
+        reformer_outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            cache=cache,
+            use_cache=use_cache,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+
+        sequence_output = reformer_outputs[0]
+        logits = self.lm_head(sequence_output)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[:, :-1]
+            shift_labels = labels[:, 1:]
+
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.reshape(shape=[-1, self.config.vocab_size]),
+                shift_labels.flatten(),
+            )
+
+        output = (logits,) + reformer_outputs[1:]
+        return ((loss,) + output) if loss is not None else output
+
+    def prepare_inputs_for_generation(self, input_ids, cache=None, use_cache=None, num_hashes=None, **kwargs):
+        # only last token for inputs_ids if cache is defined in kwargs
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "cache": cache,
+            "use_cache": use_cache,
+            "num_hashes": num_hashes,
+        }
+
+        return inputs_dict
+
+
+class ReformerForMaskedLM(ReformerPretrainedModel):
+    """
+    The Reformer Model transformer with a masked language modeling head on top.
+
+    Args:
+        reformer (:class:`ReformerModel`):
+            An instance of :class:`ReformerModel`.
+
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__(config)
+        self.reformer = ReformerModel(config)
+        assert not self.config[
+            "is_decoder"
+        ], "If you want to use `ReformerForMaskedLM` make sure `is_decoder=False` for bi-directional self-attention."
+        self.lm_head = ReformerOnlyLMHead(config)
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        num_hashes: Optional[int] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ReformerModel`.
+            position_ids (Tensor, optional):
+                See :class:`ReformerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ReformerModel`.
+            num_hashes (int, optional):
+                See :class:`ReformerModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ReformerModel`.
+            labels (Tensor, optional):
+                Labels for computing the masked language modeling loss.
+                Indices should be in ``[-100, 0, ..., vocab_size]``
+                (see ``input_ids`` docstring) Tokens with indices set
+                to ``-100`` are ignored(masked), the loss is only computed
+                for the tokens with labels in ``[0, ..., vocab_size]``.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            output_attentions (bool, optional):
+                See :class:`ReformerModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ReformerModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, hidden_states, attentions)`.
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Masked Language modeling loss.
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the masked language modeling head
+                (scores for each vocabulary token before SoftMax).
+                It's data type should be float32 and its shape is
+                [batch_size, sequence_length, vocab_size].
+
+            - `hidden_states` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+            - `attentions` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ReformerForMaskedLM, ReformerTokenizer
+
+                tokenizer = ReformerTokenizer.from_pretrained('reformer-crime-and-punishment')
+                model = ReformerForMaskedLM.from_pretrained('reformer-crime-and-punishment', is_decoder=False)
+                model.eval()
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        reformer_outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            use_cache=False,  # no causal mask
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        sequence_output = reformer_outputs[0]
+        logits = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                logits.reshape(shape=[-1, self.config.vocab_size]),
+                labels.flatten(),
+            )
+
+        if not return_dict:
+            output = (logits,) + reformer_outputs[1:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=logits,
+            hidden_states=reformer_outputs.hidden_states,
+            attentions=reformer_outputs.attentions,
+        )
+
+
+class ReformerForSequenceClassification(ReformerPretrainedModel):
+    """
+    The Reformer Model transformer with a sequence classification head on top (linear layer).
+
+    Args:
+        reformer (:class:`ReformerModel`):
+            An instance of :class:`ReformerModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Reformer.
+            If None, use the same value as `hidden_dropout_prob` of `ReformerModel`
+            instance `reformer`. Defaults to None.
+
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__(config)
+        self.reformer = ReformerModel(config)
+        self.num_classes = config.num_classes
+        self.classifier = ReformerClassificationHead(config)
+        if self.config.is_decoder:
+            logger.warning("You might want to disable causal masking for sequence classification")
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        num_hashes: Optional[int] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ReformerModel`.
+            position_ids (Tensor, optional):
+                See :class:`ReformerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ReformerModel`.
+            num_hashes (int, optional):
+                See :class:`ReformerModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`ReformerModel`.
+            labels (Tensor, optional):
+                Labels for computing the sequence classification/regression loss. Indices
+                should be in `[0, ...,num_classes - 1]`. If `num_classes == 1` a regression
+                loss is computed (Mean-Square loss), If `num_classes > 1` a classification
+                loss is computed (Cross-Entropy).
+                Shape is [batch_size,] and dtype is int64.
+            output_attentions (bool, optional):
+                See :class:`ReformerModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ReformerModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, hidden_states, attentions)`.
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Classification (or regression if num_classes==1) loss.
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Classification (or regression if num_classes==1) scores (before SoftMax).
+                It's data type should be float32 and its shape is [batch_size, num_classes].
+
+            - `hidden_states` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+            - `attentions` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ReformerForSequenceClassification, ReformerTokenizer
+
+                tokenizer = ReformerTokenizer.from_pretrained('reformer-crime-and-punishment')
+                model = ReformerForSequenceClassification.from_pretrained('reformer-crime-and-punishment', is_decoder=False)
+                model.eval()
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=paddle.to_tensor([0]))
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            if self.num_classes == 1:
+                #  We are doing regression
+                loss_fct = nn.MSELoss()
+                loss = loss_fct(logits.flatten(), labels.astype(logits.dtype).flatten())
+            else:
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape([-1, self.num_classes]), labels.flatten())
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class ReformerForQuestionAnswering(ReformerPretrainedModel):
+    """
+    Reformer Model with a span classification head on top for extractive question-answering tasks like
+    SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and
+    `span end logits`).
+
+    Args:
+        reformer (:class:`ReformerModel`):
+            An instance of ReformerModel.
+        dropout (float, optional):
+            The dropout probability for output of Reformer.
+            If None, use the same value as `hidden_dropout_prob` of `ReformerModel` instance `reformer`. Defaults to `None`.
+
+    """
+
+    def __init__(self, config: ReformerConfig):
+        super().__init__(config)
+        self.reformer = ReformerModel(config)
+        # 2 * hidden_size because we use reversible residual layers
+        self.qa_outputs = nn.Linear(2 * self.config.hidden_size, 2)
+        if self.config.is_decoder:
+            logger.warning("You might want to disable causal masking for question answering task.")
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        num_hashes: Optional[int] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`ReformerModel`.
+            position_ids (Tensor, optional):
+                See :class:`ReformerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`ReformerModel`.
+            num_hashes (int, optional):
+                See :class:`ReformerModel`.
+            start_positions (Tensor, optional):
+                Labels for position (index) of the start of the labelled
+                span for computing the token classification loss.
+                Positions are clamped to the length of the sequence
+                (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+                Shape is [batch_size,] and dtype is int64.
+            end_positions (Tensor, optional):
+                Labels for position (index) of the end of the labelled
+                span for computing the token classification loss.
+                Positions are clamped to the length of the sequence
+                (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+                Shape is [batch_size,] and dtype is int64.
+            inputs_embeds (Tensor, optional):
+                See :class:`ReformerModel`.
+            output_attentions (bool, optional):
+                See :class:`ReformerModel`.
+            output_hidden_states (bool, optional):
+                See :class:`ReformerModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            tuple: Returns tuple `(loss, logits, hidden_states, attentions)`.
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Classification (or regression if num_classes==1) loss.
+                It's data type should be float32 and its shape is [1,].
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates
+                the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates
+                the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `hidden_states` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+            - `attentions` (tuple(Tensor)):
+                See :class:`ReformerModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import ReformerForQuestionAnswering, ReformerTokenizer
+
+                tokenizer = ReformerTokenizer.from_pretrained('reformer-crime-and-punishment')
+                model = ReformerForQuestionAnswering.from_pretrained('reformer-crime-and-punishment', is_decoder=False)
+                model.eval()
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        reformer_outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            num_hashes=num_hashes,
+            use_cache=False,  # no causal mask
+            inputs_embeds=inputs_embeds,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+
+        sequence_output = reformer_outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(2, axis=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.shape[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + reformer_outputs[1:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=reformer_outputs.hidden_states,
+            attentions=reformer_outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/reformer/tokenizer.py b/paddlenlp/transformers/reformer/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6944bc258423cb582b64f4d4b5ae458dc071c4e7
--- /dev/null
+++ b/paddlenlp/transformers/reformer/tokenizer.py
@@ -0,0 +1,292 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import warnings
+
+import sentencepiece as spm
+
+from ..albert.tokenizer import AlbertEnglishTokenizer
+
+__all__ = ["ReformerTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"reformer-enwik8": 65536, "reformer-crime-and-punishment": 524288}
+
+
+class ReformerTokenizer(AlbertEnglishTokenizer):
+    """
+    Constructs a Reformer tokenizer based on SentencePiece .
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        sentencepiece_model_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing. Defaults to `False`.
+        remove_space (bool):
+            Whether or note to remove space when tokenizing. Defaults to `True`.
+        keep_accents (bool):
+            Whether or note to keep accents when tokenizing. Defaults to `False`.
+        eos_token (str):
+            A special token representing the *eos (end-of-sentence)* token.
+            Defaults to "</s>".
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "<unk>".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "<unk>".
+
+    """
+
+    resource_files_names = {
+        "sentencepiece_model_file": "spiece.model",
+    }
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "reformer-crime-and-punishment": "http://paddlenlp.bj.bcebos.com/models/transformers/reformer/reformer-crime-and-punishment/spiece.model",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "reformer-crime-and-punishment": {"do_lower_case": False},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        eos_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        extra_ids=100,
+        additional_special_tokens=[],
+        sp_model_kwargs=None,
+        **kwargs
+    ):
+
+        # Add extra_ids to the special token list
+        if extra_ids > 0 and len(additional_special_tokens) == 0:
+            self._additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
+        elif extra_ids > 0 and len(additional_special_tokens) != 0:
+            # Check that we have the right number of extra_id special tokens
+            extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
+            if extra_tokens != extra_ids:
+                raise ValueError(
+                    f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to ReformerTokenizer. "
+                    "In this case the additional_special_tokens must include the extra_ids tokens"
+                )
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.extra_ids = extra_ids
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_file)
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        is_split_into_words=False,
+        padding=None,
+        truncation="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        if "pad_to_max_seq_len" in kwargs and padding is None:
+            pad_to_max_seq_len = kwargs.pop("pad_to_max_seq_len")
+            padding = "max_length" if pad_to_max_seq_len else False
+        elif padding is None:
+            padding = False
+
+        if "max_seq_len" in kwargs and max_length is None:
+            max_length = kwargs["max_seq_len"]
+
+        if "truncation_strategy" in kwargs and kwargs["truncation_strategy"] != "longest_first":
+            truncation = kwargs["truncation_strategy"]
+
+        return super(ReformerTokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.extra_ids
+
+    def _add_eos_if_not_present(self, token_ids):
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            warnings.warn(
+                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
+            )
+            return token_ids
+        else:
+            return token_ids + [self.eos_token_id]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1):
+        """
+        Build model inputs from a sequence or a pair of sequence.
+
+        An Reformer sequence has the following format:
+
+        - single sequence:      ``X </s>``
+        - pair of sequences:        ``A </s> B </s>``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+
+        """
+        token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+        if token_ids_1 is None:
+            return token_ids_0
+        else:
+            token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+            return token_ids_0 + token_ids_1
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return offset_mapping_0 + [(0, 0)]
+
+        return offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences.
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+
+        """
+        eos = [self.eos_token_id]
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers in the range [0, 1]:
+                1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+
+        # normal case: some special tokens
+        if token_ids_1 is None:
+            return ([0] * len(token_ids_0)) + [1]
+        return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        for token in tokens:
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode_pieces(current_sub_tokens) + token + " "
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+        out_string += self.sp_model.decode_pieces(current_sub_tokens)
+        return out_string.strip()
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token.startswith("<extra_id_"):
+            match = re.match(r"<extra_id_(\d+)>", token)
+            num = int(match.group(1))
+            return self.vocab_size - num - 1
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index < self.sp_model.get_piece_size():
+            token = self.sp_model.IdToPiece(index)
+        else:
+            token = f"<extra_id_{self.vocab_size - 1 - index}>"
+        return token
diff --git a/paddlenlp/transformers/rembert/__init__.py b/paddlenlp/transformers/rembert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/rembert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/rembert/configuration.py b/paddlenlp/transformers/rembert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a5d8457603947b5256e3ad2412604dba3b4e1f1
--- /dev/null
+++ b/paddlenlp/transformers/rembert/configuration.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" MBart model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "REMBERT_PRETRAINED_INIT_CONFIGURATION",
+    "REMBERT_PRETRAINED_RESOURCE_FILES_MAP",
+    "RemBertConfig",
+]
+
+REMBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "rembert": {
+        "attention_probs_dropout_prob": 0,
+        "input_embedding_size": 256,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0,
+        "hidden_size": 1152,
+        "initializer_range": 0.02,
+        "intermediate_size": 4608,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 18,
+        "num_hidden_layers": 32,
+        "pad_token_id": 0,
+        "type_vocab_size": 2,
+        "vocab_size": 250300,
+        "layer_norm_eps": 1e-12,
+    }
+}
+
+REMBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "rembert": "https://bj.bcebos.com/paddlenlp/models/transformers/rembert/model_state.pdparams",
+    }
+}
+
+
+class RemBertConfig(PretrainedConfig):
+    r"""
+    Args:
+    vocab_size (int):
+        Vocabulary size of `inputs_ids` in `RemBertModel`. Also is the vocab size of token embedding matrix.
+        Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RemBertModel`.
+    input_embedding_size (int, optional):
+        Dimensionality of the embedding layer. Defaults to `256`.
+    hidden_size (int, optional):
+        Dimensionality of the encoder layer and pooler layer. Defaults to `1152`.
+    num_hidden_layers (int, optional):
+        Number of hidden layers in the Transformer encoder. Defaults to `32`.
+    num_attention_heads (int, optional):
+        Number of attention heads for each attention layer in the Transformer encoder.
+        Defaults to `18`.
+    intermediate_size (int, optional):
+        Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+        to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+        and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+        Defaults to `3072`.
+    hidden_act (str, optional):
+        The non-linear activation function in the feed-forward layer.
+        ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+        are supported. Defaults to `"gelu"`.
+    hidden_dropout_prob (float, optional):
+        The dropout probability for all fully connected layers in the embeddings and encoder.
+        Defaults to `0.1`.
+    attention_probs_dropout_prob (float, optional):
+        The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+        Defaults to `0.1`.
+    max_position_embeddings (int, optional):
+        The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+        sequence. Defaults to `512`.
+    type_vocab_size (int, optional):
+        The vocabulary size of `token_type_ids`.
+        Defaults to `16`.
+
+    initializer_range (float, optional):
+        The standard deviation of the normal initializer.
+        Defaults to 0.02.
+
+        .. note::
+            A normal_initializer initializes weight matrices as normal distributions.
+            See :meth:`BertPretrainedModel.init_weights()` for how weights are initialized in `BertModel`.
+
+    pad_token_id (int, optional):
+        The index of padding token in the token vocabulary.
+        Defaults to `0`.
+    """
+
+    model_type = "rembert"
+
+    def __init__(
+        self,
+        vocab_size=250300,
+        input_embedding_size=256,
+        hidden_size=1152,
+        num_hidden_layers=32,
+        num_attention_heads=18,
+        intermediate_size=4608,
+        hidden_act="gelu",
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        layer_norm_eps=1e-12,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.input_embedding_size = input_embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.layer_norm_eps = layer_norm_eps
diff --git a/paddlenlp/transformers/rembert/modeling.py b/paddlenlp/transformers/rembert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fa30229e31653ec14258b6b3917c8bda820e0db
--- /dev/null
+++ b/paddlenlp/transformers/rembert/modeling.py
@@ -0,0 +1,781 @@
+# encoding=utf8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from paddlenlp.transformers import PretrainedModel, register_base_model
+
+from ..activations import get_activation
+from .configuration import (
+    REMBERT_PRETRAINED_INIT_CONFIGURATION,
+    REMBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    RemBertConfig,
+)
+
+__all__ = [
+    "RemBertModel",
+    "RemBertForMaskedLM",
+    "RemBertForQuestionAnswering",
+    "RemBertForSequenceClassification",
+    "RemBertForMultipleChoice",
+    "RemBertPretrainedModel",
+    "RemBertForTokenClassification",
+]
+
+
+class RemBertPretrainedModel(PretrainedModel):
+    pretrained_init_configuration = REMBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = REMBERT_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "rembert"
+    config_class = RemBertConfig
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+class RemBertEmbeddings(nn.Layer):
+    """Construct the embeddings from word, position and token_type embeddings."""
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.input_embedding_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.input_embedding_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.input_embedding_size)
+
+        self.layer_norm = nn.LayerNorm(config.input_embedding_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", paddle.arange(end=config.max_position_embeddings).expand((1, -1)))
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+    ):
+        input_shape = input_ids.shape
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings += position_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RemBertPooler(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RemBertSelfAttention(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertSelfAttention, self).__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose((0, 2, 1, 3))
+
+    def forward(self, hidden_states, attention_mask=None):
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer.transpose((0, 1, 3, 2)))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in RemBertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = paddle.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.transpose((0, 2, 1, 3))
+        new_context_layer_shape = context_layer.shape[:-2] + [self.all_head_size]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs)
+        return outputs
+
+
+class RemBertSelfOutput(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layer_norm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class RemBertAttention(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertAttention, self).__init__()
+        self.self = RemBertSelfAttention(config)
+        self.output = RemBertSelfOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+    ):
+        self_outputs = self.self(hidden_states, attention_mask)
+        attention_output = self.output(self_outputs, hidden_states)
+        return attention_output
+
+
+class RemBertIntermediate(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.intermediate_act_fn = get_activation(config.hidden_act)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class RemBertOutput(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layer_norm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class RemBertLayer(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertLayer, self).__init__()
+        self.attention = RemBertAttention(config)
+
+        self.intermediate = RemBertIntermediate(config)
+        self.output = RemBertOutput(config)
+
+    def forward(self, hidden_states, attention_mask=None):
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+        )
+
+        layer_output = self.feed_forward_chunk(self_attention_outputs)
+
+        return layer_output
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class RemBertEncoder(nn.Layer):
+    def __init__(self, config: RemBertConfig):
+        super(RemBertEncoder, self).__init__()
+        self.embedding_hidden_mapping_in = nn.Linear(config.input_embedding_size, config.hidden_size)
+        self.layer = nn.LayerList([RemBertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(self, hidden_states, attention_mask=None):
+        hidden_states = self.embedding_hidden_mapping_in(hidden_states)
+
+        for i, layer_module in enumerate(self.layer):
+            layer_outputs = layer_module(hidden_states, attention_mask)
+
+            hidden_states = layer_outputs
+
+        return hidden_states
+
+
+@register_base_model
+class RemBertModel(RemBertPretrainedModel):
+    """
+    The bare RemBERT Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertModel, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.num_hidden_layers = config.num_hidden_layers
+        self.initializer_range = config.initializer_range
+        self.layer_norm_eps = config.layer_norm_eps
+        self.embeddings = RemBertEmbeddings(config)
+        self.encoder = RemBertEncoder(config)
+
+        self.pooler = RemBertPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The RemBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+
+        Returns:
+            tuple: Returns tuple (`sequence_output`, `pooled_output`)
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertModel, RemBertTokenizer
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertModel.from_pretrained('rembert')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        input_shape = input_ids.shape
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2])
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=attention_mask,
+        )
+        sequence_output = encoder_outputs
+        pooled_output = self.pooler(sequence_output)
+
+        return sequence_output, pooled_output
+
+
+class RemBertForSequenceClassification(RemBertPretrainedModel):
+    """
+    RemBert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`RemBertConfig`):
+            An instance of RemBertConfig used to construct RemBertForSequenceClassification.
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertForSequenceClassification, self).__init__(config)
+        self.rembert = RemBertModel(config)
+        self.dense = nn.Linear(config.hidden_size, config.num_classes)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The RemBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RemBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RemBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertForSequenceClassification
+                from paddlenlp.transformers import RemBertTokenizer
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertForQuestionAnswering.from_pretrained('rembert', num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+        """
+
+        pool_output = self.rembert(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+        )[1]
+
+        pool_output = self.dropout(pool_output)
+        logits = self.dense(pool_output)
+        return logits
+
+
+class RemBertForQuestionAnswering(RemBertPretrainedModel):
+    """
+    RemBert Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`RemBertConfig`):
+            An instance of RemBertConfig used to construct RemBertForQuestionAnswering.
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertForQuestionAnswering, self).__init__(config)
+        self.rembert = RemBertModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+    ):
+        r"""
+        The RemBertForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RemBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RemBertModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertForQuestionAnswering
+                from paddlenlp.transformers import RemBertTokenizer
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertForQuestionAnswering.from_pretrained('rembert')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+
+        outputs = self.rembert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+
+        return start_logits, end_logits
+
+
+class RemBertLMPredictionHead(nn.Layer):
+    """
+    RemBert Model with a `language modeling` head on top for CLM fine-tuning.
+    """
+
+    def __init__(self, config: RemBertConfig, embedding_weights=None):
+        super(RemBertLMPredictionHead, self).__init__()
+        self.transform = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = get_activation(config.hidden_act)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.decoder = nn.Linear(config.hidden_size, config.hidden_size)
+
+    def forward(self, hidden_states, masked_positions=None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        # gather masked tokens might be more quick
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class RemBertOnlyMLMHead(nn.Layer):
+    def __init__(self, config: RemBertConfig, embedding_weights):
+        super(RemBertOnlyMLMHead, self).__init__()
+        self.predictions = RemBertLMPredictionHead(config, embedding_weights=embedding_weights)
+
+    def forward(self, sequence_output, masked_positions=None):
+        prediction_scores = self.predictions(sequence_output, masked_positions)
+        return prediction_scores
+
+
+class RemBertForMaskedLM(RemBertPretrainedModel):
+    """
+    RemBert Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`RemBertConfig`):
+            An instance of RemBertConfig used to construct RemBertForMaskedLM.
+
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertForMaskedLM, self).__init__(config)
+        self.rembert = RemBertModel(config)
+        self.cls = RemBertOnlyMLMHead(
+            config=config,
+            embedding_weights=self.rembert.embeddings.word_embeddings.weight,
+        )
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RemBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RemBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertForMaskedLM, RemBertTokenizer
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertForMaskedLM.from_pretrained('rembert')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+        """
+
+        outputs = self.rembert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output, masked_positions=None)
+        return prediction_scores
+
+
+class RemBertForTokenClassification(RemBertPretrainedModel):
+    """
+    RemBert Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`RemBertConfig`):
+            An instance of RemBertConfig used to construct RemBertForTokenClassification.
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_classes
+        self.rembert = RemBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The RemBertForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RemBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RemBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`RemBertModel`.
+            attention_mask (list, optional):
+                See :class:`RemBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertForTokenClassification
+                from paddlenlp.transformers import RemBertTokenizer
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertForTokenClassification.from_pretrained('rembert')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+        """
+        sequence_output, _ = self.rembert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class RemBertForMultipleChoice(RemBertPretrainedModel):
+    """
+    RemBert Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`RemBertConfig`):
+            An instance of RemBertConfig used to construct RemBertForMultipleChoice.
+    """
+
+    def __init__(self, config: RemBertConfig):
+        super(RemBertForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.rembert = RemBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The BertForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RemBertModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`RemBertModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`RemBertModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`RemBertModel` and shape as [batch_size, num_choice, sequence_length].
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RemBertForMultipleChoice, RemBertTokenizer
+                from paddlenlp.data import Pad, Dict
+
+                tokenizer = RemBertTokenizer.from_pretrained('rembert')
+                model = RemBertForMultipleChoice.from_pretrained('rembert', num_choices=2)
+
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                batchify_fn = lambda samples, fn=Dict(
+                    {
+                        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+                        "token_type_ids": Pad(
+                            axis=0, pad_val=tokenizer.pad_token_type_id
+                        ),  # token_type_ids
+                    }
+                ): fn(samples)
+                inputs = batchify_fn(inputs)
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(inputs[0], dtype="int64"),
+                    token_type_ids=paddle.to_tensor(inputs[1], dtype="int64"),
+                )
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        _, pooled_output = self.rembert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
diff --git a/paddlenlp/transformers/rembert/tokenizer.py b/paddlenlp/transformers/rembert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a975be240b9b745836b4d861d0e9ed40534526aa
--- /dev/null
+++ b/paddlenlp/transformers/rembert/tokenizer.py
@@ -0,0 +1,240 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from shutil import copyfile
+from typing import List, Optional
+
+import sentencepiece as spm
+
+from .. import PretrainedTokenizer
+
+__all__ = ["RemBertTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {"rembert": 512}
+
+
+class RemBertTokenizer(PretrainedTokenizer):
+    """
+    Construct a RemBertTokenizer.
+    For more information regarding those methods, please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to `False`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RemBertTokenizer
+            tokenizer = RemBertTokenizer.from_pretrained('rembert')
+
+            inputs = tokenizer('欢迎使用飞桨！')
+            print(inputs)
+
+            '''
+            {'input_ids': [312, 573, 36203, 3916, 9744, 242391, 646, 313],
+            'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+    """
+
+    resource_files_names = {"vocab_file": "sentencepiece.model"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "rembert": "https://bj.bcebos.com/paddlenlp/models/transformers/rembert/sentencepiece.model",
+        },
+    }
+    pretrained_init_configuration = {
+        "rembert": {"do_lower_case": False},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        cls_token="[CLS]",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        self.sp_model = spm.SentencePieceProcessor()
+        self.sp_model.Load(self.vocab_file)
+
+    def _tokenize(self, text, sample=False):
+        """Tokenize a string."""
+        pieces = self.sp_model.EncodeAsPieces(text)
+        return pieces
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.sp_model.IdToPiece(index)
+
+    def convert_tokens_to_string(self, tokens):
+        out_string = self.sp_model.decode_pieces(tokens)
+        return out_string
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A REMBERT sequence has the following format:
+
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A RemBERT
+        sequence pair mask has the following format:
+
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None):
+        if not os.path.isdir(save_directory):
+            raise ValueError("Vocabulary path ({}) should be a directory".format(save_directory))
+            return None
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + "sentencepiece.model"
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+
+        return (out_vocab_file,)
diff --git a/paddlenlp/transformers/roberta/README.md b/paddlenlp/transformers/roberta/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..97ed2fbadbb1fd632944fde83aaba5efb03895d0
--- /dev/null
+++ b/paddlenlp/transformers/roberta/README.md
@@ -0,0 +1 @@
+# RoBERTa
diff --git a/paddlenlp/transformers/roberta/__init__.py b/paddlenlp/transformers/roberta/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/roberta/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/roberta/configuration.py b/paddlenlp/transformers/roberta/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ce536777847bc11e79b69e4bf5639d288be6729
--- /dev/null
+++ b/paddlenlp/transformers/roberta/configuration.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Albert model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["PRETRAINED_INIT_CONFIGURATION", "RobertaConfig"]
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "hfl/roberta-wwm-ext": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "hfl/roberta-wwm-ext-large": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "hfl/rbt6": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 6,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "hfl/rbt4": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 4,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "hfl/rbt3": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 12,
+        "num_hidden_layers": 3,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+    "hfl/rbtl3": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 3,
+        "type_vocab_size": 2,
+        "vocab_size": 21128,
+        "pad_token_id": 0,
+    },
+}
+
+
+class RobertaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`RobertaModel`]. It is used to
+    instantiate a ALBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ALBERT
+    albert-base-v1 architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        cls_token_id(int, optional):
+            The index of cls token in the token vocabulary.
+            Defaults to `101`.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import RobertaModel, AlbertConfig
+
+    >>> # Initializing a ALBERT albert-base-v1 style configuration
+    >>> configuration = AlbertConfig()
+
+    >>> # Initializing a model from the albert-base-v1 style configuration
+    >>> model = RobertaModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "roberta"
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 21128,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        layer_norm_eps: float = 1e-12,
+        cls_token_id: int = 101,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, cls_token_id=cls_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.layer_norm_eps = layer_norm_eps
+        self.cls_token_id = cls_token_id
diff --git a/paddlenlp/transformers/roberta/converter.py b/paddlenlp/transformers/roberta/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..a86f593201cf66cc605d02842943cdc50de41a5c
--- /dev/null
+++ b/paddlenlp/transformers/roberta/converter.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+from typing import List, Union, Dict, Type
+
+from paddlenlp.transformers import PretrainedModel, RobertaModel
+from paddlenlp.utils.converter import StateDictNameMapping, Converter
+
+__all__ = ["RobertaConverter"]
+
+
+class RobertaConverter(Converter):
+    _ignore_state_dict_keys = ["embeddings.position_ids"]
+    architectures: Dict[str, Type[PretrainedModel]] = {"RobertaModel": RobertaModel}
+
+    def get_paddle_pytorch_model_classes(self):
+        from paddlenlp.transformers import RobertaModel as PaddleRobertaModel
+        from transformers import RobertaModel as PytorchRobertaModel
+
+        return PaddleRobertaModel, PytorchRobertaModel
+
+    def get_name_mapping(self, config_or_num_layers: Union[dict, int] = None) -> List[StateDictNameMapping]:
+        num_layer = self.resolve_num_layer(config_or_num_layers)
+
+        mappings = [
+            ["embeddings.word_embeddings.weight", "embeddings.word_embeddings.weight"],
+            ["embeddings.position_embeddings.weight", "embeddings.position_embeddings.weight"],
+            ["embeddings.token_type_embeddings.weight", "embeddings.token_type_embeddings.weight"],
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["pooler.dense.weight", "pooler.dense.weight", "transpose"],
+            ["pooler.dense.bias", "pooler.dense.bias"],
+        ]
+
+        for layer_index in range(num_layer):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            mappings.extend(layer_mappings)
+        return [StateDictNameMapping(*mapping) for mapping in mappings]
diff --git a/paddlenlp/transformers/roberta/modeling.py b/paddlenlp/transformers/roberta/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..23c7dce1da36b88072957ff8193c05813ac762cc
--- /dev/null
+++ b/paddlenlp/transformers/roberta/modeling.py
@@ -0,0 +1,1387 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+
+from ...layers import Linear as TransposedLinear
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import PRETRAINED_INIT_CONFIGURATION, RobertaConfig
+
+__all__ = [
+    "RobertaModel",
+    "RobertaPretrainedModel",
+    "RobertaForSequenceClassification",
+    "RobertaForTokenClassification",
+    "RobertaForQuestionAnswering",
+    "RobertaForMaskedLM",
+    "RobertaForMultipleChoice",
+    "RobertaForCausalLM",
+]
+
+
+def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`.
+    Args:
+        x: paddle.Tensor x:
+    Returns: paddle.Tensor
+    """
+    if past_key_values_length is None:
+        past_key_values_length = 0
+    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
+    mask = (input_ids != padding_idx).cast("int64")
+    incremental_indices = (paddle.cumsum(mask, axis=1) + past_key_values_length) * mask
+    return incremental_indices + padding_idx
+
+
+class RobertaEmbeddings(nn.Layer):
+    r"""
+    Include embeddings from word, position and token_type embeddings.
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.padding_idx = config.pad_token_id
+        self.cls_token_id = config.cls_token_id
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values_length: Optional[int] = None,
+    ):
+
+        if input_ids is not None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if position_ids is None:
+            if input_ids is not None:
+                position_ids = create_position_ids_from_input_ids(
+                    input_ids, padding_idx=self.padding_idx, past_key_values_length=past_key_values_length
+                )
+            else:
+                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
+            position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+            token_type_ids = paddle.zeros(input_shape, dtype="int64")
+
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+    def create_position_ids_from_inputs_embeds(self, inputs_embeds):
+        """
+        We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
+        Args:
+            input_shape: paddle.Tensor
+        Returns: paddle.Tensor
+        """
+        input_shape = paddle.shape(inputs_embeds)[:-1]
+        sequence_length = input_shape[1]
+
+        position_ids = paddle.arange(self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype="int64")
+        return position_ids.unsqueeze(0).expand(input_shape)
+
+
+class RobertaPooler(nn.Layer):
+    def __init__(self, hidden_size):
+        super(RobertaPooler, self).__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RobertaPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained RoBerta models. It provides RoBerta related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+    config_class = RobertaConfig
+
+    pretrained_resource_files_map = {
+        "model_state": {
+            "hfl/roberta-wwm-ext": "https://bj.bcebos.com/paddlenlp/models/transformers/roberta_base/roberta_chn_base.pdparams",
+            "hfl/roberta-wwm-ext-large": "https://bj.bcebos.com/paddlenlp/models/transformers/roberta_large/roberta_chn_large.pdparams",
+            "hfl/rbt6": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt6/rbt6_chn_large.pdparams",
+            "hfl/rbt4": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt4/rbt4_chn_large.pdparams",
+            "hfl/rbt3": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt3/rbt3_chn_large.pdparams",
+            "hfl/rbtl3": "https://bj.bcebos.com/paddlenlp/models/transformers/rbtl3/rbtl3_chn_large.pdparams",
+        }
+    }
+    base_model_prefix = "roberta"
+
+    @classmethod
+    def _get_name_mappings(cls, config: RobertaConfig) -> list[StateDictNameMapping]:
+        mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.position_embeddings.weight",
+            "embeddings.token_type_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+        ]
+
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            mappings.extend(layer_mappings)
+
+        init_name_mappings(mappings)
+        # Other than RobertaModel, other architectures will prepend model prefix
+        if config.architectures is not None and "RobertaModel" not in config.architectures:
+            for mapping in mappings:
+                mapping[0] = "roberta." + mapping[0]
+
+        if cls.__name__ != "RobertaModel":
+            for mapping in mappings:
+                mapping[1] = "roberta." + mapping[1]
+
+        mappings.extend(
+            [
+                ["pooler.dense.weight", "roberta.pooler.dense.weight", "transpose"],
+                ["pooler.dense.bias", "roberta.pooler.dense.bias"],
+            ]
+        )
+
+        if config.architectures is not None:
+            if "RobertaForSequenceClassification" in config.architectures:
+                mappings.extend(
+                    [
+                        ["classifier.out_proj.weight", None, "transpose"],
+                        "classifier.out_proj.bias",
+                        ["classifier.dense.weight", None, "transpose"],
+                        "classifier.dense.bias",
+                    ]
+                )
+            if "RobertaForMaskedLM" in config.architectures:
+                mappings.extend(
+                    [
+                        "lm_head.bias",
+                        "lm_head.dense.weight",
+                        "lm_head.dense.bias",
+                        "lm_head.layer_norm.weight",
+                        "lm_head.layer_norm.bias",
+                    ]
+                )
+            if (
+                "RobertaForTokenClassification" in config.architectures
+                or "RobertaForMultipleChoice" in config.architectures
+            ):
+                mappings.extend(
+                    [
+                        ["classifier.weight", None, "transpose"],
+                        "classifier.bias",
+                    ]
+                )
+            if "RobertaForQuestionAnswering" in config.architectures:
+                mappings.extend(
+                    [
+                        ["qa_outputs.weight", "classifier.weight", "transpose"],
+                        ["qa_outputs.bias", "classifier.bias"],
+                    ]
+                )
+        init_name_mappings(mappings)
+        return [StateDictNameMapping(*mapping) for mapping in mappings]
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            layer.weight.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.config.initializer_range,
+                    shape=layer.weight.shape,
+                )
+            )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class RobertaModel(RobertaPretrainedModel):
+    r"""
+    The bare Roberta Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`.
+
+        pad_token_id(int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        cls_token_id(int, optional):
+            The index of cls token in the token vocabulary.
+            Defaults to `101`.
+    """
+
+    def __init__(self, config: RobertaConfig, add_pooling_layer=True):
+        super(RobertaModel, self).__init__(config)
+
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+        self.layer_norm_eps = config.layer_norm_eps
+        self.embeddings = RobertaEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = RobertaPooler(config.hidden_size) if add_pooling_layer else None
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range ``[0, max_position_embeddings - 1]``.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaModel, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaModel.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                sequence_output, pooled_output = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        past_key_values_length = None
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            src_mask=attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                past_key_values=encoder_outputs.past_key_values,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class RobertaForQuestionAnswering(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+     and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of RobertaModel.
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaForQuestionAnswering, self).__init__(config)
+
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`RobertaModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RobertaClassificationHead(nn.Layer):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = paddle.tanh(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class RobertaForSequenceClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaForSequenceClassification, self).__init__(config)
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = RobertaClassificationHead(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`RobertaModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_classes - 1]`. If `num_classes == 1`
+                a regression loss is computed (Mean-Square loss), If `num_classes > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.num_labels == 1:
+                loss_fct = paddle.nn.MSELoss()
+                loss = loss_fct(logits, labels)
+            elif labels.dtype == paddle.int64 or labels.dtype == paddle.int32:
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.config.num_labels)), labels.reshape((-1,)))
+            else:
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RobertaForTokenClassification(RobertaPretrainedModel):
+    r"""
+    Roberta Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        roberta (:class:`RobertaModel`):
+            An instance of `RobertaModel`.
+        num_classes (int, optional):
+            The number of classes. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Roberta.
+            If None, use the same value as `hidden_dropout_prob`
+            of `RobertaModel` instance `roberta`. Defaults to `None`.
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaForTokenClassification, self).__init__(config)
+
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`RobertaModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_classes - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.config.num_labels)), labels.reshape((-1,)))
+        if not return_dict:
+
+            output = (logits,) + outputs[2:]
+            if loss is not None:
+                return (loss,) + output
+            if len(output) == 1:
+                return output[0]
+            return output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RobertaForMultipleChoice(RobertaPretrainedModel):
+    """
+    RoBerta Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        bert (:class:`RobertaModel`):
+            An instance of RobertaModel.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of Bert.
+            If None, use the same value as `hidden_dropout_prob` of `RobertaModel`
+            instance `bert`. Defaults to None.
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaForMultipleChoice, self).__init__(config)
+        self.roberta = RobertaModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RobertaForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`RobertaModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`RobertaModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`RobertaModel` and shape as [batch_size, num_choice, sequence_length].
+            inputs_embeds (list, optional):
+                See :class:`RobertaModel` and shape as [batch_size, num_choice, sequence_length].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import BertForMultipleChoice, BertTokenizer
+                from paddlenlp.data import Pad, Dict
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                model = BertForMultipleChoice.from_pretrained('bert-base-uncased', num_choices=2)
+
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                batchify_fn = lambda samples, fn=Dict(
+                    {
+                        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+                        "token_type_ids": Pad(
+                            axis=0, pad_val=tokenizer.pad_token_type_id
+                        ),  # token_type_ids
+                    }
+                ): fn(samples)
+                inputs = batchify_fn(inputs)
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(inputs[0], dtype="int64"),
+                    token_type_ids=paddle.to_tensor(inputs[1], dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None:
+            num_choices = paddle.shape(input_ids)[1]
+        elif inputs_embeds is not None:
+            num_choices = paddle.shape(inputs_embeds)[1]
+
+        input_ids = input_ids.reshape((-1, input_ids.shape[-1])) if input_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.reshape((-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]))
+            if inputs_embeds is not None
+            else None
+        )
+        position_ids = position_ids.reshape((-1, position_ids.shape[-1])) if position_ids is not None else None
+        token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1])) if token_type_ids is not None else None
+        attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1])) if attention_mask is not None else None
+
+        outputs = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape((-1, num_choices))
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RobertaForMaskedLM(RobertaPretrainedModel):
+    """
+    Roberta Model with a `masked language modeling` head on top.
+
+    Args:
+        bert (:class:RobertaModel`):
+            An instance of :class:`RobertaModel`.
+
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super(RobertaForMaskedLM, self).__init__(config)
+
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+        self.lm_head = RobertaLMHead(config)
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`RobertaModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+                loss is only computed for the tokens with labels in `[0, ..., vocab_size]`
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForMaskedLM, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForMaskedLM.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 30522]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.roberta(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return (
+                ((masked_lm_loss,) + output)
+                if masked_lm_loss is not None
+                else (output[0] if len(output) == 1 else output)
+            )
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RobertaLMHead(nn.Layer):
+    """Roberta Head for masked language modeling."""
+
+    def __init__(self, config: RobertaConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.decoder = TransposedLinear(config.hidden_size, config.vocab_size)
+        # link bias to load pretrained weights
+        self.bias = self.decoder.bias
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = F.gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = self.decoder(x)
+
+        return x
+
+
+class RobertaForCausalLM(RobertaPretrainedModel):
+    """
+    Roberta Model with a `Causal language modeling` head on top.
+
+    Args:
+        bert (:class:RobertaModel`):
+            An instance of :class:`RobertaModel`.
+
+    """
+
+    def __init__(self, config: RobertaConfig):
+        super().__init__(config)
+        self.roberta = RobertaModel(config, add_pooling_layer=False)
+        self.lm_head = RobertaLMHead(config)
+
+        self.tie_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RobertaModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            position_ids (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`RobertaModel`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                See :class:`RobertaModel`.
+            use_cache (Tensor, optional):
+                See :class:`RobertaModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RobertaModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+                `[-100, 0, ..., vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
+                ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RobertaForCausalLM, RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                model = RobertaForCausalLM.from_pretrained('roberta-wwm-ext')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 13, 30522]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+        outputs = self.roberta(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :]
+            labels = labels[:, 1:]
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            lm_loss = loss_fct(
+                shifted_prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else (output[0] if len(output) == 1 else output)
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past}
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
diff --git a/paddlenlp/transformers/roberta/tokenizer.py b/paddlenlp/transformers/roberta/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..445d65722a3ada0963efdbf8e5aa52e826b628f1
--- /dev/null
+++ b/paddlenlp/transformers/roberta/tokenizer.py
@@ -0,0 +1,620 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+import json
+import os
+
+from paddle.utils import try_import
+
+from ...utils.downloader import COMMUNITY_MODEL_PREFIX, get_path_from_url
+from ...utils.env import MODEL_HOME
+from ...utils.log import logger
+from .. import (
+    AddedToken,
+    BasicTokenizer,
+    GPTTokenizer,
+    PretrainedTokenizer,
+    WordpieceTokenizer,
+)
+from ..gpt.tokenizer import bytes_to_unicode
+
+__all__ = ["RobertaTokenizer", "RobertaChineseTokenizer", "RobertaBPETokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "hfl/roberta-wwm-ext": 512,
+    "hfl/roberta-wwm-ext-large": 512,
+    "hfl/rbt6": 512,
+    "hfl/rbt4": 512,
+    "hfl/rbt3": 512,
+    "hfl/rbtl3": 512,
+}
+
+
+class RobertaChineseTokenizer(PretrainedTokenizer):
+    """
+    Constructs a RoBerta tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RobertaTokenizer
+            tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+
+            tokens = tokenizer('He was a puppeteer')
+            #{'input_ids': [101, 9245, 9947, 143, 11227, 9586, 8418, 8854, 8180, 102],
+            #'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}、
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "hfl/roberta-wwm-ext": "https://bj.bcebos.com/paddlenlp/models/transformers/roberta_base/vocab.txt",
+            "hfl/roberta-wwm-ext-large": "https://bj.bcebos.com/paddlenlp/models/transformers/roberta_large/vocab.txt",
+            "hfl/rbt6": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt6/vocab.txt",
+            "hfl/rbt4": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt4/vocab.txt",
+            "hfl/rbt3": "https://bj.bcebos.com/paddlenlp/models/transformers/rbt3/vocab.txt",
+            "hfl/rbtl3": "https://bj.bcebos.com/paddlenlp/models/transformers/rbtl3/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "hfl/roberta-wwm-ext": {"do_lower_case": True},
+        "hfl/roberta-wwm-ext-large": {"do_lower_case": True},
+        "hfl/rbt6": {"do_lower_case": True},
+        "hfl/rbt4": {"do_lower_case": True},
+        "hfl/rbt3": {"do_lower_case": True},
+        "hfl/rbtl3": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = RobertaTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab._token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for Roberta models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import RobertaTokenizer
+
+                tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+                '''
+                ['he', 'was', 'a', 'puppet', '##eer']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                he was a puppeteer
+                '''
+        """
+
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A RoBERTa sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A RoBERTa offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A RoBERTa sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+
+class RobertaBPETokenizer(GPTTokenizer):
+    """
+    Constructs a Roberta tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.GPTTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocab file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RobertaBPETokenizer
+            tokenizer = RobertaBPETokenizer.from_pretrained('roberta-base')
+
+            tokens = tokenizer('This is a simple Paddle')
+            #{'input_ids': [0, 713, 16, 10, 2007, 221, 33151, 2],
+            #'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}
+    """
+
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}  # for save_pretrained
+
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "roberta-base": "https://bj.bcebos.com/paddlenlp/models/community/roberta-base/vocab.json",
+            "roberta-large": "https://bj.bcebos.com/paddlenlp/models/community/roberta-large/vocab.json",
+        },
+        "merges_file": {
+            "roberta-base": "https://bj.bcebos.com/paddlenlp/models/community/roberta-base/merges.txt",
+            "roberta-large": "https://bj.bcebos.com/paddlenlp/models/community/roberta-large/merges.txt",
+        },
+    }
+    pretrained_init_configuration = {
+        "roberta-base": {},
+        "roberta-large": {},
+    }
+    max_model_input_sizes = {
+        "roberta-base": 512,
+        "roberta-large": 512,
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        errors="replace",
+        bos_token="<s>",
+        eos_token="</s>",
+        sep_token="</s>",
+        cls_token="<s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        mask_token="<mask>",
+        add_prefix_space=False,
+        **kwargs
+    ):
+
+        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
+        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
+        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
+        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
+        self._build_special_tokens_map_extended(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+        )
+
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.num_command_tokens = 2
+        self.num_type_tokens = 2
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+
+        self.num_tokens = len(self.encoder)
+        self.num_text_tokens = self.num_tokens - 1
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            bpe_data = merges_handle.read().split("\n")[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+        self.add_prefix_space = add_prefix_space
+
+        re = try_import("regex")
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification
+        tasks by concatenating and adding special tokens.
+        """
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return _cls + token_ids_0 + _sep
+        return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
+
+    def get_offset_mapping(self, text):
+        tokens = self.tokenize(text)
+        offset_mapping = []
+        offset = 0
+        for token in tokens:
+            if token[0] == "Ġ":
+                offset_mapping.append((offset + 1, offset + len(token)))
+            else:
+                offset_mapping.append((offset, offset + len(token)))
+            offset += len(token)
+
+        return offset_mapping
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A Roberta offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0), (0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is None:
+            return [1] + ([0] * len(token_ids_0)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
+
+    def convert_tokens_to_string(self, tokens):
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
+        if (is_split_into_words or add_prefix_space) and (len(text) > 0 and not text[0].isspace()):
+            text = " " + text
+        return (text, kwargs)
+
+
+class RobertaTokenizer:
+    """
+    RobertaTokenizer is a generic tokenizer class that will be instantiated as either
+    RobertaChineseTokenizer or RobertaBPETokenizer when created with the RobertaTokenizer.from_pretrained() class method.
+    """
+
+    chinese_model_names = RobertaChineseTokenizer.pretrained_init_configuration.keys()
+    english_model_names = RobertaBPETokenizer.pretrained_init_configuration.keys()
+    tokenizer_config_file = "tokenizer_config.json"
+
+    def __init__(self, *args, **kwargs):
+        raise EnvironmentError(
+            f"{self.__class__.__name__} is designed to be instantiated "
+            f"using the `{self.__class__.__name__}.from_pretrained(pretrained_model_name_or_path).`"
+        )
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        # From built-in pretrained models
+        if pretrained_model_name_or_path in cls.chinese_model_names:
+            return RobertaChineseTokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        elif pretrained_model_name_or_path in cls.english_model_names:
+            return RobertaBPETokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            config_file = os.path.join(pretrained_model_name_or_path, cls.tokenizer_config_file)
+            if os.path.exists(config_file):
+                with io.open(config_file, encoding="utf-8") as f:
+                    init_kwargs = json.load(f)
+                # class name corresponds to this configuration
+                init_class = init_kwargs.pop("init_class", None)
+                if init_class == "RobertaBPETokenizer":
+                    return RobertaBPETokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+                if init_class == "RobertaChineseTokenizer" or init_class == "BertTokenizer":
+                    return RobertaChineseTokenizer.from_pretrained(
+                        pretrained_model_name_or_path, *model_args, **kwargs
+                    )
+            return RobertaBPETokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+        else:
+            # Assuming from community-contributed pretrained models
+            config_file = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.tokenizer_config_file])
+            default_root = os.path.join(MODEL_HOME, pretrained_model_name_or_path)
+            try:
+                resolved_config_file = get_path_from_url(config_file, default_root)
+            except RuntimeError as err:
+                logger.error(err)
+                raise RuntimeError(
+                    f"Can't find load tokenizer_config_file for '{pretrained_model_name_or_path}'.\n"
+                    f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                    "a correct model-identifier of community-contributed pretrained models.\n"
+                )
+            with io.open(resolved_config_file, encoding="utf-8") as f:
+                init_kwargs = json.load(f)
+
+            init_class = init_kwargs.pop("init_class", None)
+            if init_class == "RobertaBPETokenizer":
+                return RobertaBPETokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            elif init_class == "RobertaChineseTokenizer" or init_class == "BertTokenizer":
+                return RobertaChineseTokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+            else:
+                return RobertaBPETokenizer.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
diff --git a/paddlenlp/transformers/roformer/__init__.py b/paddlenlp/transformers/roformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/roformer/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/roformer/configuration.py b/paddlenlp/transformers/roformer/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..d34adfce01c5ec83aec9230e591e0e9573ad4a01
--- /dev/null
+++ b/paddlenlp/transformers/roformer/configuration.py
@@ -0,0 +1,325 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ROFORMER model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["ROFORMER_PRETRAINED_INIT_CONFIGURATION", "RoFormerConfig", "ROFORMER_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ROFORMER_PRETRAINED_INIT_CONFIGURATION = {
+    "roformer-chinese-small": {
+        "vocab_size": 50000,
+        "embedding_size": 384,
+        "hidden_size": 384,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 6,
+        "intermediate_size": 1536,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": False,
+    },
+    "roformer-chinese-base": {
+        "vocab_size": 50000,
+        "embedding_size": 768,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 1536,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": False,
+    },
+    "roformer-chinese-char-small": {
+        "vocab_size": 12000,
+        "embedding_size": 384,
+        "hidden_size": 384,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 6,
+        "intermediate_size": 1536,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": False,
+    },
+    "roformer-chinese-char-base": {
+        "vocab_size": 12000,
+        "embedding_size": 768,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": False,
+    },
+    "roformer-chinese-sim-char-ft-small": {
+        "vocab_size": 12000,
+        "embedding_size": 384,
+        "hidden_size": 384,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 6,
+        "intermediate_size": 1536,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 102,
+        "rotary_value": False,
+        "pool_act": "linear",
+    },
+    "roformer-chinese-sim-char-ft-base": {
+        "vocab_size": 12000,
+        "embedding_size": 768,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 102,
+        "rotary_value": False,
+        "pool_act": "linear",
+    },
+    "roformer-chinese-sim-char-small": {
+        "vocab_size": 12000,
+        "embedding_size": 384,
+        "hidden_size": 384,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 6,
+        "intermediate_size": 1536,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 102,
+        "rotary_value": False,
+        "pool_act": "linear",
+    },
+    "roformer-chinese-sim-char-base": {
+        "vocab_size": 12000,
+        "embedding_size": 768,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "eos_token_id": 102,
+        "rotary_value": False,
+        "pool_act": "linear",
+    },
+    "roformer-english-small-discriminator": {
+        "vocab_size": 30522,
+        "embedding_size": 128,
+        "hidden_size": 256,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 4,
+        "intermediate_size": 1024,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 128,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": True,
+    },
+    "roformer-english-small-generator": {
+        "vocab_size": 30522,
+        "embedding_size": 128,
+        "hidden_size": 64,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 1,
+        "intermediate_size": 256,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 128,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+        "rotary_value": True,
+    },
+}
+
+ROFORMER_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "roformer-chinese-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-small/model_state.pdparams",
+        "roformer-chinese-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-base/model_state.pdparams",
+        "roformer-chinese-char-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-char-small/model_state.pdparams",
+        "roformer-chinese-char-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-char-base/model_state.pdparams",
+        "roformer-chinese-sim-char-ft-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-ft-small/model_state.pdparams",
+        "roformer-chinese-sim-char-ft-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-ft-base/model_state.pdparams",
+        "roformer-chinese-sim-char-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-small/model_state.pdparams",
+        "roformer-chinese-sim-char-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-base/model_state.pdparams",
+        "roformer-english-small-discriminator": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-english-small-discriminator/model_state.pdparams",
+        "roformer-english-small-generator": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-english-small-generator/model_state.pdparams",
+    }
+}
+
+
+class RoFormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`RoFormerModel`]. It is used to
+    instantiate a RoFormer model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the RoFormer
+    roformer-chinese-base architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the RoFormer model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`RoFormer`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 1536):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 1536).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`RoFormerModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        pad_token_id (`int`, *optional*):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        eos_token_id (`int`, *optional*):
+            The id of the `eos` token. Defaults to `102`.
+        pool_act (`str`, *optional*):
+            The non-linear activation function in the pooler.
+            Defaults to `"tanh"`.
+        rotary_value (`bool`, *optional*):
+            Whether or not apply rotay position embeddings to value.
+            Defaults to `False`.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import RoFormerModel, RoFormerConfig
+
+    >>> # Initializing a RoFormer roformer-chinese-base style configuration
+    >>> configuration = RoFormerConfig()
+
+    >>> # Initializing a model from the roformer-chinese-base style configuration
+    >>> model = RoFormerModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "roformer"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = ROFORMER_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        embedding_size: int = 768,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 1536,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        pool_act: str = "tanh",
+        layer_norm_eps: float = 1e-12,
+        rotary_value: bool = False,
+        eos_token_id: int = 102,
+        use_cache=False,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        if embedding_size is None:
+            embedding_size = hidden_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pool_act = pool_act
+        self.rotary_value = rotary_value
+
+        self.layer_norm_eps = layer_norm_eps
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/roformer/modeling.py b/paddlenlp/transformers/roformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d594171f9537248022760f5643f993a2c87ac285
--- /dev/null
+++ b/paddlenlp/transformers/roformer/modeling.py
@@ -0,0 +1,1380 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+from paddle.common_ops_import import convert_dtype
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from .. import PretrainedModel, register_base_model
+from ..activations import get_activation
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+    tuple_output,
+)
+
+__all__ = [
+    "RoFormerModel",
+    "RoFormerPretrainedModel",
+    "RoFormerForSequenceClassification",
+    "RoFormerForTokenClassification",
+    "RoFormerForQuestionAnswering",
+    "RoFormerForMaskedLM",
+    "RoFormerForMultipleChoice",
+    "RoFormerForCausalLM",
+]
+from .configuration import (
+    ROFORMER_PRETRAINED_INIT_CONFIGURATION,
+    ROFORMER_PRETRAINED_RESOURCE_FILES_MAP,
+    RoFormerConfig,
+)
+
+
+class RoFormerEmbeddings(nn.Layer):
+    """
+    Include embeddings from word and token_type embeddings
+    """
+
+    def __init__(
+        self,
+        vocab_size,
+        embedding_size=768,
+        hidden_dropout_prob=0.1,
+        type_vocab_size=2,
+    ):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(vocab_size, embedding_size)
+        self.token_type_embeddings = nn.Embedding(type_vocab_size, embedding_size)
+        self.layer_norm = nn.LayerNorm(embedding_size)
+        self.dropout = nn.Dropout(hidden_dropout_prob)
+
+    def forward(self, input_ids=None, token_type_ids=None, inputs_embeds=None):
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if token_type_ids is None:
+            token_type_ids_shape = paddle.shape(inputs_embeds)[:-1]
+            token_type_ids = paddle.zeros(token_type_ids_shape, dtype="int64")
+
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + token_type_embeddings
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RotaryPositionEmbedding(nn.Layer):
+    def __init__(self, dim, max_position_embeddings=512):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2, dtype=paddle.get_default_dtype()) / dim))
+        t = paddle.arange(max_position_embeddings, dtype=paddle.get_default_dtype())
+        freqs = paddle.matmul(t.unsqueeze(1), inv_freq.unsqueeze(0))
+        self.register_buffer("sin", freqs.sin(), persistable=False)
+        self.register_buffer("cos", freqs.cos(), persistable=False)
+
+    def forward(self, x, offset=0):
+        # x shape [batch_size, num_heads, seqlen, head_dim]
+        seqlen = paddle.shape(x)[-2]
+        sin, cos = (
+            self.sin[offset : offset + seqlen, :],
+            self.cos[offset : offset + seqlen, :],
+        )
+        x1, x2 = x[..., 0::2], x[..., 1::2]
+        # [cos_nθ, -sin_nθ] [x1]
+        # [sin_nθ,  cos_nθ] [x2]
+        # => [x1 * cos_nθ - x2 * sin_nθ, x1 * sin_nθ + x2 * cos_nθ]
+        return paddle.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], axis=-1).flatten(-2, -1)
+
+
+class MultiHeadAttentionWithRotary(nn.MultiHeadAttention):
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        rotary_value=False,
+        max_position_embeddings=512,
+    ):
+        super().__init__(embed_dim, num_heads, dropout, kdim, vdim, need_weights)
+        self.rotary_value = rotary_value
+        self.rotary = RotaryPositionEmbedding(self.head_dim, max_position_embeddings)
+
+    def _prepare_qkv(self, query, key, value, cache=None):
+        q = self.q_proj(query)
+        q = paddle.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = paddle.transpose(x=q, perm=[0, 2, 1, 3])
+
+        k, v = self.compute_kv(key, value)
+
+        offset = 0 if cache is None else cache.k.shape[2]
+
+        # rotary q,k,v
+        q = self.rotary(q, offset=offset)
+        k = self.rotary(k, offset=offset)
+        if self.rotary_value:
+            v = self.rotary(v, offset=offset)
+
+        if isinstance(cache, self.Cache):
+            # for decoder self-attention in inference
+            k = paddle.concat([cache.k, k], axis=2)
+            v = paddle.concat([cache.v, v], axis=2)
+            cache = self.Cache(k, v)
+
+        return (q, k, v) if cache is None else (q, k, v, cache)
+
+
+class TransformerEncoderLayerWithRotary(nn.TransformerEncoderLayer):
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        rotary_value=False,
+        max_position_embeddings=512,
+        **kwargs
+    ):
+        super().__init__(
+            d_model,
+            nhead,
+            dim_feedforward,
+            dropout=dropout,
+            activation=activation,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            normalize_before=normalize_before,
+        )
+        self.self_attn = MultiHeadAttentionWithRotary(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            rotary_value=rotary_value,
+            max_position_embeddings=max_position_embeddings,
+        )
+        self._config.update({"rotary_value": rotary_value, "max_position_embeddings": max_position_embeddings})
+
+
+class RoFormerPooler(nn.Layer):
+    def __init__(self, hidden_size, pool_act="tanh"):
+        super().__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = get_activation(pool_act)
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class RoFormerLMPredictionHead(nn.Layer):
+    def __init__(self, embedding_size, hidden_size, vocab_size, activation, embedding_weights=None):
+        super().__init__()
+        self.transform = nn.Linear(hidden_size, embedding_size)
+        self.activation = get_activation(activation)
+        self.layer_norm = nn.LayerNorm(embedding_size)
+        self.decoder_weight = (
+            self.create_parameter(
+                shape=[vocab_size, embedding_size],
+                dtype=self.transform.weight.dtype,
+                is_bias=False,
+            )
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = paddle.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return hidden_states
+
+
+class RoFormerOnlyMLMHead(nn.Layer):
+    def __init__(self, embedding_size, hidden_size, vocab_size, activation, embedding_weights):
+        super().__init__()
+        self.predictions = RoFormerLMPredictionHead(
+            embedding_size,
+            hidden_size=hidden_size,
+            vocab_size=vocab_size,
+            activation=activation,
+            embedding_weights=embedding_weights,
+        )
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class RoFormerPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained RoFormer models. It provides RoFormer related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+    config_class = RoFormerConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    base_model_prefix = "roformer"
+    pretrained_init_configuration = ROFORMER_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ROFORMER_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_name_mappings(cls, config: RoFormerConfig) -> List[StateDictNameMapping]:
+        mappings: List[StateDictNameMapping] = []
+        model_mappings = [
+            "embeddings.word_embeddings.weight",
+            "embeddings.token_type_embeddings.weight",
+            ["embeddings.LayerNorm.weight", "embeddings.layer_norm.weight"],
+            ["embeddings.LayerNorm.bias", "embeddings.layer_norm.bias"],
+            ["pooler.dense.weight", None, "transpose"],
+            "pooler.dense.bias",
+            # for TokenClassification
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.weight",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.query.bias",
+                    f"encoder.layers.{layer_index}.self_attn.q_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.weight",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.key.bias",
+                    f"encoder.layers.{layer_index}.self_attn.k_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.weight",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.self.value.bias",
+                    f"encoder.layers.{layer_index}.self_attn.v_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.weight",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.weight",
+                    "transpose",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.dense.bias",
+                    f"encoder.layers.{layer_index}.self_attn.out_proj.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.intermediate.dense.weight",
+                    f"encoder.layers.{layer_index}.linear1.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.intermediate.dense.bias", f"encoder.layers.{layer_index}.linear1.bias"],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.weight",
+                    f"encoder.layers.{layer_index}.norm1.weight",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.attention.output.LayerNorm.bias",
+                    f"encoder.layers.{layer_index}.norm1.bias",
+                ],
+                [
+                    f"encoder.layer.{layer_index}.output.dense.weight",
+                    f"encoder.layers.{layer_index}.linear2.weight",
+                    "transpose",
+                ],
+                [f"encoder.layer.{layer_index}.output.dense.bias", f"encoder.layers.{layer_index}.linear2.bias"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.weight", f"encoder.layers.{layer_index}.norm2.weight"],
+                [f"encoder.layer.{layer_index}.output.LayerNorm.bias", f"encoder.layers.{layer_index}.norm2.bias"],
+            ]
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+
+        # base-model prefix "RoFormerModel"
+        if "RoFormerModel" not in config.architectures:
+            for mapping in model_mappings:
+                mapping[0] = "roformer." + mapping[0]
+                mapping[1] = "roformer." + mapping[1]
+
+        if "RoFormerForMaskedLM" in config.architectures:
+            model_mappings.extend(
+                [
+                    ["cls.predictions.transform.dense.weight", "cls.predictions.transform.weight", "transpose"],
+                    ["cls.predictions.transform.dense.bias", "cls.predictions.transform.bias"],
+                    ["cls.predictions.transform.LayerNorm.weight", "cls.predictions.layer_norm.weight"],
+                    ["cls.predictions.transform.LayerNorm.bias", "cls.predictions.layer_norm.bias"],
+                    ["cls.predictions.decoder.bias", "cls.predictions.decoder_bias"],
+                ]
+            )
+        # downstream mappings
+        if "RoFormerForQuestionAnswering" in config.architectures:
+            model_mappings.extend(
+                [["qa_outputs.weight", "classifier.weight", "transpose"], ["qa_outputs.bias", "classifier.bias"]]
+            )
+        if (
+            "RoFormerForMultipleChoice" in config.architectures
+            or "RoFormerForSequenceClassification" in config.architectures
+            or "RoFormerForTokenClassification" in config.architectures
+        ):
+            model_mappings.extend([["classifier.weight", None, "transpose"]])
+
+        init_name_mappings(model_mappings)
+        mappings = [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(model_mappings)]
+        return mappings
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class RoFormerModel(RoFormerPretrainedModel):
+    """
+    The bare RoFormerModel outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerModel.
+    """
+
+    def __init__(
+        self,
+        config: RoFormerConfig,
+    ):
+        super().__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.eos_token_id = config.eos_token_id
+        self.initializer_range = config.initializer_range
+        if config.embedding_size != config.hidden_size:
+            self.embeddings_project = nn.Linear(config.embedding_size, config.hidden_size)
+        self.embeddings = RoFormerEmbeddings(
+            config.vocab_size,
+            config.embedding_size,
+            config.hidden_dropout_prob,
+            config.type_vocab_size,
+        )
+        encoder_layer = TransformerEncoderLayerWithRotary(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            rotary_value=config.rotary_value,
+            max_position_embeddings=config.max_position_embeddings,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = RoFormerPooler(config.hidden_size, config.pool_act)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerModel, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerModel.from_pretrained('roformer-chinese-char-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                output = model(**tokenized_inputs)
+
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+
+        if hasattr(self, "embeddings_project"):
+            embedding_output = self.embeddings_project(embedding_output)
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            src_mask=attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output)
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                past_key_values=encoder_outputs.past_key_values,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+
+class RoFormerForQuestionAnswering(RoFormerPretrainedModel):
+    r"""
+    RoFormer Model with a linear layer on top of the hidden-states output to compute `span_start_logits`
+     and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForQuestionAnswering.
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.roformer = RoFormerModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForQuestionAnswering, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerForQuestionAnswering.from_pretrained('roformer-chinese-char-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                outputs = model(**tokenized_inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+        start_logits, end_logits = paddle.unstack(x=logits, axis=-1)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RoFormerForSequenceClassification(RoFormerPretrainedModel):
+    r"""
+    RoFormer Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForSequenceClassification.
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.roformer = RoFormerModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForSequenceClassification, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerForSequenceClassification.from_pretrained('roformer-chinese-char-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                logits = model(**tokenized_inputs)
+
+        """
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RoFormerForTokenClassification(RoFormerPretrainedModel):
+    r"""
+    RoFormer Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForTokenClassification.
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.roformer = RoFormerModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForTokenClassification, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerForTokenClassification.from_pretrained('roformer-chinese-char-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                logits = model(**tokenized_inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RoFormerForMultipleChoice(RoFormerPretrainedModel):
+    """
+    RoFormerModel with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForMultipleChoice.
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.roformer = RoFormerModel(config)
+        self.num_choices = config.num_choices
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel` and shape as [batch_size, num_choice, sequence_length].
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForMultipleChoice, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerForMultipleChoice.from_pretrained('roformer-chinese-char-base')
+
+                data = [
+                    {
+                        "question": "如何打开ipad屏幕？",
+                        "answer1": "按音量按钮。",
+                        "answer2": "按下锁定按钮。",
+                        "label": 1,
+                    },
+                    {
+                        "question": "如何缩进一些文本？",
+                        "answer1": "在开始写之前留一些空格。",
+                        "answer2": "按空格键。",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                tokenized_inputs = tokenizer(text, text_pair, padding=True, return_tensors="pd")
+                reshaped_logits = model(**tokenized_inputs)
+                print(reshaped_logits.shape)
+                # [2, 2]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        input_ids = input_ids.reshape((-1, input_ids.shape[-1])) if input_ids is not None else None
+        token_type_ids = token_type_ids.reshape((-1, token_type_ids.shape[-1])) if token_type_ids is not None else None
+        attention_mask = attention_mask.reshape((-1, attention_mask.shape[-1])) if attention_mask is not None else None
+
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.reshape((-1, self.num_choices))
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RoFormerForMaskedLM(RoFormerPretrainedModel):
+    """
+    RoFormer Model with a `masked language modeling` head on top.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForMaskedLM.
+
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.roformer = RoFormerModel(config)
+        self.cls = RoFormerOnlyMLMHead(
+            config.embedding_size,
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            embedding_weights=self.roformer.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForMaskedLM forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+                loss is only computed for the tokens with labels in `[0, ..., vocab_size]`
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.MaskedLMOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForMaskedLM, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-char-base')
+                model = RoFormerForMaskedLM.from_pretrained('roformer-chinese-char-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                logits = model(**tokenized_inputs)
+
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+
+        prediction_scores = self.cls(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(
+                prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return tuple_output(output, masked_lm_loss)
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class RoFormerForCausalLM(RoFormerPretrainedModel):
+    """
+    RoFormer Model with a `Causal language modeling` head on top.
+
+    Args:
+        config (:class:`RoFormerConfig`):
+            An instance of RoFormerConfig used to construct RoFormerForCausalLM.
+
+    """
+
+    def __init__(self, config: RoFormerConfig):
+        super().__init__(config)
+        self.roformer = RoFormerModel(config)
+        self.cls = RoFormerOnlyMLMHead(
+            config.embedding_size,
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            embedding_weights=self.roformer.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The RoFormerForCausalLM forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerModel`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerModel`.
+            inputs_embeds(Tensor, optional):
+                See :class:`RoFormerModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+                `[-100, 0, ..., vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
+                ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., vocab_size]`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                See :class:`RoFormerModel`.
+            use_cache (Tensor, optional):
+                See :class:`RoFormerModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerForCausalLM, RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-sim-char-ft-base')
+                model = RoFormerForCausalLM.from_pretrained('roformer-chinese-sim-char-ft-base')
+
+                tokenized_inputs = tokenizer("欢迎使用百度飞桨!", return_tensors="pd")
+                logits = model(**tokenized_inputs)
+                print(logits.shape)
+                # [1, 11, 12000]
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.roformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :]
+            labels = labels[:, 1:]
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            lm_loss = loss_fct(
+                shifted_prediction_scores.reshape((-1, prediction_scores.shape[-1])), labels.reshape((-1,))
+            )
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return tuple_output(output, lm_loss)
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, use_cache=False, cache=None, **kwargs):
+        # only last token for inputs_ids if past is defined in kwargs
+        token_type_ids = kwargs.get("token_type_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
+
+        if attention_mask is not None:
+            if "int" in convert_dtype(attention_mask.dtype):
+                attention_mask = (1.0 - attention_mask) * -1e4
+
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
+            if attention_mask.ndim == 4:
+                attention_mask = attention_mask[:, -1, -1, :].unsqueeze([1, 2])
+
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": cache,
+            "use_cache": use_cache,
+        }
+
+    @staticmethod
+    def update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False):
+        # Update the model inputs during generation.
+        # Note that If `token_type_ids` and `attention_mask` in `model_kwargs`
+        # and they contain pad value, the result vectors updated by this method
+        # may be different from expected. In this case, you need to rewrite the
+        # method.
+
+        # update cache
+        if isinstance(outputs, tuple):
+            model_kwargs["cache"] = outputs[1]
+
+        # update token_type_ids with last value
+        if "token_type_ids" in model_kwargs and model_kwargs["token_type_ids"] is not None:
+            token_type_ids = model_kwargs["token_type_ids"]
+            # token type id = 1
+            model_kwargs["token_type_ids"] = paddle.concat(
+                [token_type_ids, paddle.ones_like(token_type_ids[:, -1:])], axis=-1
+            )
+
+        # update attention_mask
+        if not is_encoder_decoder and "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            # nn.Pad2D don't support the data type `bool`
+            if convert_dtype(attention_mask.dtype) == "bool":
+                attention_mask = paddle.cast(attention_mask, "int64")
+            if len(attention_mask.shape) == 4:
+                attention_mask = attention_mask.expand((-1, -1, attention_mask.shape[-1], -1))
+                attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(attention_mask)
+                attention_mask = nn.Pad2D([0, 1, 0, 0], value=-1e4)(attention_mask)
+                dtype = convert_dtype(attention_mask.dtype)
+                if "int" in dtype:
+                    attention_mask[:, :, -1, -1] = 1
+                elif "float" in dtype:
+                    attention_mask[:, :, -1, -1] = 0.0
+                else:
+                    raise ValueError("The data type of input `attention_mask` must " "be bool, int or float")
+            else:
+                # convert to 4D attention_mask
+                attention_mask = paddle.concat(
+                    [attention_mask, paddle.ones([attention_mask.shape[0], 1], dtype="int64")], axis=-1
+                )
+                if "int" in convert_dtype(attention_mask.dtype):
+                    attention_mask = (1.0 - attention_mask) * -1e4
+                attention_mask = attention_mask.unsqueeze([1, 2]).expand((-1, -1, attention_mask.shape[-1], -1))
+
+            token_type_ids = model_kwargs["token_type_ids"]
+            mask = token_type_ids[:, None, :] > token_type_ids[:, :, None]
+            # we need expand attention_mask
+            attention_mask = paddle.where(mask.unsqueeze(1), paddle.to_tensor(-1e4), attention_mask)
+            model_kwargs["attention_mask"] = attention_mask
+
+        return model_kwargs
diff --git a/paddlenlp/transformers/roformer/tokenizer.py b/paddlenlp/transformers/roformer/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7fef80376a32022a722573be8ad890e992467ae
--- /dev/null
+++ b/paddlenlp/transformers/roformer/tokenizer.py
@@ -0,0 +1,381 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import jieba
+from ..bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from ..tokenizer_utils import PretrainedTokenizer
+
+__all__ = ["RoFormerTokenizer", "JiebaBasicTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "roformer-chinese-small": 512,
+    "roformer-chinese-base": 1536,
+    "roformer-chinese-char-small": 512,
+    "roformer-chinese-char-base": 512,
+    "roformer-chinese-sim-char-ft-small": 512,
+    "roformer-chinese-sim-char-ft-base": 512,
+    "roformer-chinese-sim-char-small": 512,
+    "roformer-chinese-sim-char-base": 512,
+    "roformer-english-small-discriminator": 128,
+    "roformer-english-small-generator": 128,
+}
+
+
+class JiebaBasicTokenizer(BasicTokenizer):
+    """
+    Runs basic tokenization with jieba (punctuation splitting, lower casing, jieba pretokenizer etc).
+
+    Args:
+        vocab (:class:`paddlenlp.data.Vocab`): An instance of paddlenlp.data.Vocab.
+        do_lower_case (bool):
+            Whether the text strips accents and converts to lower case.
+            If you use the RoFormer Pretrained model, lower is set to
+            False when using the cased model, otherwise it is set to True.
+            Defaults to `True`.
+    """
+
+    def __init__(self, vocab, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+        """Constructs a JiebaBasicTokenizer."""
+        super().__init__(
+            never_split=never_split,
+            do_lower_case=do_lower_case,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+        )
+        self.vocab = vocab
+
+    def _tokenize_chinese_chars(self, text):
+        output = []
+        for wholeword in jieba.cut(text, HMM=False):
+            if wholeword in self.vocab:
+                output.append(" ")
+                output.append(wholeword)
+                output.append(" ")
+            else:
+                for char in wholeword:
+                    cp = ord(char)
+                    if self._is_chinese_char(cp):
+                        output.append(" ")
+                        output.append(char)
+                        output.append(" ")
+                    else:
+                        output.append(char)
+        return "".join(output)
+
+
+class RoFormerTokenizer(PretrainedTokenizer):
+    """
+    Constructs a RoFormer tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing, jieba pretokenizer and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool,optional):
+            Whether or not to lowercase the input when tokenizing.
+            If you use the RoFormer pretrained model, lower is set to
+            False when using the cased model, otherwise it is set to True.
+            Defaults to`True`.
+        use_jieba (bool,optional):
+            Whether or not to tokenize the text with jieba. Default: False.
+        unk_token (str,optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str,optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str,optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str,optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str,optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RoFormerTokenizer
+            tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-base')
+
+            tokens = tokenizer('欢迎使用百度飞桨')
+            '''
+            {'input_ids': [101, 22355, 8994, 25854, 5438, 2473, 102],
+             'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            # chinese word level model
+            "roformer-chinese-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-small/vocab.txt",
+            "roformer-chinese-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-base/vocab.txt",
+            # chinese char level model
+            "roformer-chinese-char-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-char-small/vocab.txt",
+            "roformer-chinese-char-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-char-base/vocab.txt",
+            "roformer-chinese-sim-char-ft-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-ft-small/vocab.txt",
+            "roformer-chinese-sim-char-ft-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-ft-base/vocab.txt",
+            "roformer-chinese-sim-char-small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-small/vocab.txt",
+            "roformer-chinese-sim-char-base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-chinese-sim-char-base/vocab.txt",
+            # english
+            "roformer-english-small-discriminator": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-english-small-discriminator/vocab.txt",
+            "roformer-english-small-generator": "https://bj.bcebos.com/paddlenlp/models/transformers/roformer/roformer-english-small-generator/vocab.txt",
+        }
+    }
+    max_model_input_sizes = {
+        "roformer-chinese-small": 512,
+        "roformer-chinese-base": 1536,
+        "roformer-chinese-char-small": 512,
+        "roformer-chinese-char-base": 512,
+        "roformer-chinese-sim-char-ft-small": 512,
+        "roformer-chinese-sim-char-ft-base": 512,
+        "roformer-chinese-sim-char-small": 512,
+        "roformer-chinese-sim-char-base": 512,
+        "roformer-english-small-discriminator": 128,
+        "roformer-english-small-generator": 128,
+    }
+    pretrained_init_configuration = {
+        "roformer-chinese-small": {"do_lower_case": True, "use_jieba": True},
+        "roformer-chinese-base": {"do_lower_case": True, "use_jieba": True},
+        "roformer-chinese-char-small": {"do_lower_case": True, "use_jieba": False},
+        "roformer-chinese-char-base": {"do_lower_case": True, "use_jieba": False},
+        "roformer-chinese-sim-char-ft-small": {"do_lower_case": True, "use_jieba": False},
+        "roformer-chinese-sim-char-ft-base": {"do_lower_case": True, "use_jieba": False},
+        "roformer-chinese-sim-char-small": {"do_lower_case": True, "use_jieba": False},
+        "roformer-chinese-sim-char-base": {"do_lower_case": True, "use_jieba": False},
+        "roformer-english-small-discriminator": {"do_lower_case": True, "use_jieba": False},
+        "roformer-english-small-generator": {"do_lower_case": True, "use_jieba": False},
+    }
+    padding_side = "right"
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        use_jieba=False,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = RoFormerTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        if use_jieba:
+            self.basic_tokenizer = JiebaBasicTokenizer(vocab=self.vocab, do_lower_case=do_lower_case)
+        else:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for RoFormer models.
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) in a single string.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-base')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨')
+                #['欢迎', '使用', '百度', '飞', '桨']
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                #'欢迎 使用 百度 飞 桨'
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A Roformer sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A RoFormer offset_mapping has the following format:
+
+        - single sequence: ``(0,0) X (0,0)``
+        - pair of sequences: `(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: List of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A RoFormer sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(
+                map(
+                    lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0,
+                    token_ids_0,
+                )
+            )
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
diff --git a/paddlenlp/transformers/roformerv2/__init__.py b/paddlenlp/transformers/roformerv2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bd752713b172b69240bb33467ae4f608d189694
--- /dev/null
+++ b/paddlenlp/transformers/roformerv2/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/roformerv2/configuration.py b/paddlenlp/transformers/roformerv2/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b150b215306d0edfb075876d464bd9381b0a80d
--- /dev/null
+++ b/paddlenlp/transformers/roformerv2/configuration.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" RoFormerv2 model configuration """
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["RoFormerv2Config", "ROFORMERV2_PRETRAINED_INIT_CONFIGURATION", "ROFORMERV2_PRETRAINED_RESOURCE_FILES_MAP"]
+
+ROFORMERV2_PRETRAINED_INIT_CONFIGURATION = {
+    "roformer_v2_chinese_char_small": {
+        "vocab_size": 12000,
+        "hidden_size": 384,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 6,
+        "intermediate_size": 1536,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "pad_token_id": 0,
+        "rotary_value": False,
+        "use_bias": False,
+    },
+    "roformer_v2_chinese_char_base": {
+        "vocab_size": 12000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "pad_token_id": 0,
+        "rotary_value": False,
+        "use_bias": False,
+    },
+    "roformer_v2_chinese_char_large": {
+        "vocab_size": 12000,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "pad_token_id": 0,
+        "rotary_value": False,
+        "use_bias": False,
+    },
+}
+
+ROFORMERV2_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "roformer_v2_chinese_char_small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_small/model_state.pdparams",
+        "roformer_v2_chinese_char_base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_base/model_state.pdparams",
+        "roformer_v2_chinese_char_large": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_large/model_state.pdparams",
+    }
+}
+
+
+class RoFormerv2Config(PretrainedConfig):
+    model_type = "roformerv2"
+    pretrained_init_configuration = ROFORMERV2_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 12000,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "relu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        act_dropout: float = 0,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        pad_token_id: int = 0,
+        rotary_value: bool = False,
+        use_bias: bool = False,
+        epsilon: float = 1e-12,
+        normalize_before: bool = False,
+        num_choices: int = 2,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.act_dropout = act_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.pad_token_id = pad_token_id
+        self.rotary_value = rotary_value
+        self.use_bias = use_bias
+        self.epsilon = epsilon
+        self.normalize_before = normalize_before
+        self.num_choices = num_choices
diff --git a/paddlenlp/transformers/roformerv2/modeling.py b/paddlenlp/transformers/roformerv2/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..105a013e46cab20a47fc2ea0d7e8fbcfcd003437
--- /dev/null
+++ b/paddlenlp/transformers/roformerv2/modeling.py
@@ -0,0 +1,802 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import tensor
+from paddle.nn import Layer
+
+from .. import PretrainedModel, register_base_model
+from .configuration import (
+    ROFORMERV2_PRETRAINED_INIT_CONFIGURATION,
+    ROFORMERV2_PRETRAINED_RESOURCE_FILES_MAP,
+    RoFormerv2Config,
+)
+
+__all__ = [
+    "RoFormerv2Model",
+    "RoFormerv2ForMaskedLM",
+    "RoFormerv2PretrainedModel",
+    "RoFormerv2ForSequenceClassification",
+    "RoFormerv2ForTokenClassification",
+    "RoFormerv2ForQuestionAnswering",
+    "RoFormerv2ForMultipleChoice",
+]
+
+
+class Norm(Layer):
+    def __init__(self, epsilon=1e-12):
+        super().__init__()
+        self._epsilon = epsilon
+
+    def forward(self, x):
+        variance = paddle.mean(paddle.square(x), axis=-1, keepdim=True)
+        return x / paddle.sqrt(variance + self._epsilon)
+
+
+def initializer(tensor, num_hidden_layers=12, order=2, gain=1.0):
+    """
+    https://github.com/bojone/bert4keras/blob/5572ed481a14f5a62be7107e3846c88a5d6b617d/bert4keras/models.py#L1226-L1235
+    """
+    shape = paddle.shape(tensor)
+    if shape[0] > 10000 or shape[0] < 10:
+        hidden_size = shape[1]
+    else:
+        hidden_size = shape[0]
+    gain *= num_hidden_layers ** (-1.0 / order)
+    std = 1.13684723 / hidden_size**0.5 * gain
+
+    return nn.initializer.TruncatedNormal(std=std)
+
+
+def _convert_attention_mask(attn_mask, dtype):
+    if attn_mask is not None and attn_mask.dtype != dtype:
+        attn_mask_dtype = attn_mask.dtype
+        if attn_mask_dtype in [paddle.bool, paddle.int8, paddle.int16, paddle.int32, paddle.int64]:
+            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e4
+        else:
+            attn_mask = paddle.cast(attn_mask, dtype)
+    return attn_mask
+
+
+class RotaryPositionEmbedding(Layer):
+    def __init__(self, dim, max_position_embeddings=512):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (paddle.arange(0, dim, 2, dtype=paddle.get_default_dtype()) / dim))
+        t = paddle.arange(max_position_embeddings, dtype=paddle.get_default_dtype())
+        freqs = paddle.matmul(t.unsqueeze(1), inv_freq.unsqueeze(0))
+        self.register_buffer("sin", freqs.sin(), persistable=False)
+        self.register_buffer("cos", freqs.cos(), persistable=False)
+
+    def forward(self, x, offset=0):
+        # x shape [batch_size, num_heads, seqlen, head_dim]
+        seqlen = paddle.shape(x)[-2]
+        sin, cos = (
+            self.sin[offset : offset + seqlen, :],
+            self.cos[offset : offset + seqlen, :],
+        )
+        x1, x2 = x[..., 0::2], x[..., 1::2]
+        # [cos_nθ, -sin_nθ] [x1]
+        # [sin_nθ,  cos_nθ] [x2]
+        # => [x1 * cos_nθ - x2 * sin_nθ, x1 * sin_nθ + x2 * cos_nθ]
+        return paddle.stack([x1 * cos - x2 * sin, x1 * sin + x2 * cos], axis=-1).flatten(-2, -1)
+
+
+class MultiHeadAttentionWithRotary(Layer):
+    def __init__(
+        self,
+        embed_dim,
+        num_heads,
+        dropout=0.0,
+        kdim=None,
+        vdim=None,
+        need_weights=False,
+        rotary_value=False,
+        max_position_embeddings=512,
+    ):
+        super(MultiHeadAttentionWithRotary, self).__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.num_heads = num_heads
+        self.need_weights = need_weights
+        self.rotary_value = rotary_value
+
+        self.head_dim = embed_dim // num_heads
+        self.scale = self.head_dim**-0.5
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+
+        self.dropout = nn.Dropout(dropout)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.k_proj = nn.Linear(self.kdim, embed_dim)
+        self.v_proj = nn.Linear(self.vdim, embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        self.rotary = RotaryPositionEmbedding(self.head_dim, max_position_embeddings)
+
+    def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
+        key = query if key is None else key
+        value = query if value is None else value
+
+        q = self.q_proj(query)
+        k = self.k_proj(key)
+        v = self.v_proj(value)
+
+        q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim])
+        q = tensor.transpose(x=q, perm=[0, 2, 1, 3])
+        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim])
+        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
+        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
+        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
+
+        q, k = self.rotary(q), self.rotary(k)
+        if self.rotary_value:
+            v = self.rotary(v)
+
+        product = tensor.matmul(x=q, y=k, transpose_y=True) * self.scale
+        if attn_mask is not None:
+            attn_mask = _convert_attention_mask(attn_mask, product.dtype)
+            product = product + attn_mask
+
+        weights = F.softmax(product)
+        weights = self.dropout(weights)
+        out = tensor.matmul(weights, v)
+
+        # combine heads
+        out = tensor.transpose(out, perm=[0, 2, 1, 3])
+        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
+
+        # project to output
+        out = self.out_proj(out)
+
+        outs = [out]
+        if self.need_weights:
+            outs.append(weights)
+        if cache is not None:
+            outs.append(cache)
+        return out if len(outs) == 1 else tuple(outs)
+
+
+class TransformerEncoderLayerWithRotary(nn.TransformerEncoderLayer):
+    def __init__(
+        self,
+        d_model,
+        nhead,
+        dim_feedforward,
+        dropout=0.1,
+        activation="relu",
+        attn_dropout=None,
+        act_dropout=None,
+        normalize_before=False,
+        rotary_value=False,
+        max_position_embeddings=512,
+        **kwargs
+    ):
+        super().__init__(
+            d_model,
+            nhead,
+            dim_feedforward,
+            dropout=dropout,
+            activation=activation,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            normalize_before=normalize_before,
+        )
+        self.self_attn = MultiHeadAttentionWithRotary(
+            d_model,
+            nhead,
+            dropout=attn_dropout,
+            rotary_value=rotary_value,
+            max_position_embeddings=max_position_embeddings,
+        )
+        self.norm1 = Norm()
+        self.norm2 = Norm()
+        self._config.update({"rotary_value": rotary_value, "max_position_embeddings": max_position_embeddings})
+
+
+class RoFormerv2Embeddings(Layer):
+    """
+    Include embeddings from word and token_type embeddings
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2Embeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.norm = Norm(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None):
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids)
+
+        input_embedings = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embedings + token_type_embeddings
+        embeddings = self.norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class RoFormerv2PretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained RoFormerv2 models. It provides RoFormerv2 related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = ROFORMERV2_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = ROFORMERV2_PRETRAINED_RESOURCE_FILES_MAP
+
+    base_model_prefix = "roformerv2"
+    config_class = RoFormerv2Config
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                num_hidden_layers = self.config.num_hidden_layers
+                initializer(layer.weight, num_hidden_layers, order=2, gain=1.0)
+            if isinstance(layer, nn.Linear):
+                use_bias = self.config.use_bias
+                if layer.bias is not None and not use_bias:
+                    layer.bias = None
+        elif isinstance(layer, Norm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class RoFormerv2Model(RoFormerv2PretrainedModel):
+    """
+    The bare RoFormerv2 Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        vocab_size (int):
+            Vocabulary size of `inputs_ids` in `RoFormerv2Model`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RoFormerv2Model`.
+        hidden_size (int, optional):
+            Dimensionality of the, encoder layers and pooler layer. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            Number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to `"relu"`.
+        hidden_dropout_prob (float, optional):
+            The dropout probability for all fully connected layers in the embeddings and encoder.
+            Defaults to `0.1`.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of `token_type_ids`.
+            Defaults to `2`.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        rotary_value (`bool`, optional):
+            Whether or not apply rotay position embeddings to value.
+            Defaults to `False`.
+        use_bias (`bool`, optional):
+            Whether or not use bias.
+            Defaults to `False`.
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2Model, self).__init__(config)
+        self.pad_token_id = config.pad_token_id
+        self.num_hidden_layers = config.num_hidden_layers
+        self.use_bias = config.use_bias
+        self.embeddings = RoFormerv2Embeddings(config)
+        encoder_layer = TransformerEncoderLayerWithRotary(
+            d_model=config.hidden_size,
+            nhead=config.num_attention_heads,
+            dim_feedforward=config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            rotary_value=config.rotary_value,
+            max_position_embeddings=config.max_position_embeddings,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_hidden_states=False):
+        r"""
+        The RoFormerv2Model forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Currently, we only support 2D attention_mask.
+                Defaults to `None`, which means `pad_token_id` will be ignored.
+            output_hidden_states (bool, optional):
+                Whether to return the output of each hidden layers.
+                Defaults to `False`.
+
+        Returns:
+            tuple: Returns `sequence_output` or `encoder_outputs`.
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_outputs` (List(Tensor)):
+                A list of Tensor containing hidden-states of the model at each hidden layer in the Transformer encoder.
+                The length of the list is `num_hidden_layers`.
+                Each Tensor has a data type of float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2Model, RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2Model.from_pretrained('roformer_v2_chinese_char_base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4, axis=[1, 2]
+            )
+        else:
+            if attention_mask.ndim == 2:
+                # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+                attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+                attention_mask = (1.0 - attention_mask) * -1e4
+            else:
+                raise ValueError("Currently we only support 2D attention_mask.")
+
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(input_ids=input_ids, token_type_ids=token_type_ids)
+
+        if output_hidden_states:
+            output = embedding_output
+            encoder_outputs = []
+            for mod in self.encoder.layers:
+                output = mod(output, src_mask=attention_mask)
+                encoder_outputs.append(output)
+            if self.encoder.norm is not None:
+                encoder_outputs[-1] = self.encoder.norm(encoder_outputs[-1])
+        else:
+            sequence_output = self.encoder(embedding_output, attention_mask)
+
+        outputs = encoder_outputs if output_hidden_states else sequence_output
+
+        return outputs
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, embedding: nn.Embedding):
+        self.embeddings.word_embeddings = embedding
+
+
+class RoFormerv2ForQuestionAnswering(RoFormerv2PretrainedModel):
+    """
+    RoFormerv2 with a linear layer on top of the hidden-states output to compute `span_start_logits`
+    and `span_end_logits`, designed for question-answering tasks like SQuAD.
+
+    Args:
+        roformerv2 (:class:`RoFormerv2Model`):
+            An instance of RoFormerv2Model.
+        dropout (float, optional):
+            The dropout probability for output of RoFormerv2.
+            If None, use the same value as `hidden_dropout_prob` of `RoFormerv2Model`
+            instance `roformerv2`. Defaults to `None`.
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2ForQuestionAnswering, self).__init__(config)
+        self.roformerv2 = RoFormerv2Model(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        The RoFormerv2ForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerv2Model`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2ForQuestionAnswering, RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2ForQuestionAnswering.from_pretrained('roformer_v2_chinese_char_base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+        sequence_output = self.roformerv2(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        logits = self.classifier(sequence_output)
+        start_logits, end_logits = paddle.unstack(logits, axis=-1)
+
+        return start_logits, end_logits
+
+
+class RoFormerv2ForSequenceClassification(RoFormerv2PretrainedModel):
+    """
+    RoFormerv2 Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        roformerv2 (`RoFormerv2Model`):
+            An instance of `paddlenlp.transformers.RoFormerv2Model`.
+        num_labels (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of RoFormerv2.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.RoFormerv2Model` instance. Defaults to `None`.
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2ForSequenceClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.roformerv2 = RoFormerv2Model(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerv2Model`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_labels]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2ForSequenceClassification, RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2ForSequenceClassification.from_pretrained('roformer_v2_chinese_char_base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.roformerv2(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+        pooled_output = sequence_output[:, 0]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class RoFormerv2ForTokenClassification(RoFormerv2PretrainedModel):
+    """
+    RoFormerv2 Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        roformerv2 (`RoFormerv2Model`):
+            An instance of `paddlenlp.transformers.RoFormerv2Model`.
+        num_labels (int, optional):
+            The number of classes. Default to `2`.
+        dropout (float, optional):
+            The dropout probability for output of RoFormerv2.
+            If None, use the same value as `hidden_dropout_prob`
+            of `paddlenlp.transformers.RoFormerv2Model` instance. Defaults to `None`.
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2ForTokenClassification, self).__init__(config)
+        self.num_labels = config.num_labels
+        self.roformerv2 = RoFormerv2Model(config)  # allow roformerv2 to be config
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerv2Model`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_labels]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2ForTokenClassification, RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2ForTokenClassification.from_pretrained('roformer_v2_chinese_char_base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        sequence_output = self.roformerv2(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
+
+
+class RoFormerv2ForMultipleChoice(RoFormerv2PretrainedModel):
+    """
+    RoFormerv2 Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        roformerv2 (:class:`RoFormerv2Model`):
+            An instance of RoFormerv2Model.
+        num_choices (int, optional):
+            The number of choices. Defaults to `2`.
+        dropout (float, optional):
+            The dropout probability for output of RoFormerv2.
+            If None, use the same value as `hidden_dropout_prob` of `RoFormerv2Model`
+            instance `roformerv2`. Defaults to None.
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2ForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.roformerv2 = RoFormerv2Model(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+        The RoFormerv2ForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerv2Model` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`RoFormerv2Model` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`RoFormerv2Model` and shape as [batch_size, num_choice, sequence_length].
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2ForMultipleChoice, RoFormerv2Tokenizer
+                from paddlenlp.data import Pad
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2ForMultipleChoice.from_pretrained('roformer_v2_chinese_char_base', num_choices=2)
+
+                data = [
+                    {
+                        "question": "如何打开ipad屏幕？",
+                        "answer1": "按音量按钮。",
+                        "answer2": "按下锁定按钮。",
+                        "label": 1,
+                    },
+                    {
+                        "question": "如何缩进一些文本？",
+                        "answer1": "在开始写之前留一些空格。",
+                        "answer2": "按空格键。",
+                        "label": 0,
+                    },
+                ]
+
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair)
+                input_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(inputs["input_ids"])
+                token_type_ids = Pad(axis=0, pad_val=tokenizer.pad_token_type_id)(inputs["token_type_ids"])
+
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(input_ids, dtype="int64"),
+                    token_type_ids=paddle.to_tensor(token_type_ids, dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        input_ids = input_ids.reshape(shape=(-1, paddle.shape(input_ids)[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, paddle.shape(token_type_ids)[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, paddle.shape(attention_mask)[-1]))
+
+        sequence_output = self.roformerv2(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        pooled_output = sequence_output[:, 0]
+        pooled_output = self.dropout(pooled_output)
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        return reshaped_logits
+
+
+class RoFormerv2LMPredictionHead(Layer):
+    def __init__(self, config: RoFormerv2Config, embedding_weights=None):
+        super(RoFormerv2LMPredictionHead, self).__init__()
+        self.use_bias = config.use_bias
+        self.decoder_weight = (
+            self.create_parameter(shape=[config.vocab_size, config.hidden_size], dtype=self.transform.weight.dtype)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        if config.use_bias:
+            self.decoder_bias = self.create_parameter(
+                shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+            )
+
+    def forward(self, hidden_states):
+        hidden_states = paddle.matmul(hidden_states, self.decoder_weight, transpose_y=True)
+        if self.use_bias:
+            hidden_states = hidden_states + self.decoder_bias
+
+        return hidden_states
+
+
+class RoFormerv2ForMaskedLM(RoFormerv2PretrainedModel):
+    """
+    RoFormerv2 Model with a `masked language modeling` head on top.
+
+    Args:
+        roformerv2 (:class:`RoFormerv2Model`):
+            An instance of :class:`RoFormerv2Model`.
+
+    """
+
+    def __init__(self, config: RoFormerv2Config):
+        super(RoFormerv2ForMaskedLM, self).__init__(config)
+        self.roformerv2 = RoFormerv2Model(config)
+        self.cls = RoFormerv2LMPredictionHead(
+            config, embedding_weights=self.roformerv2.embeddings.word_embeddings.weight
+        )
+
+    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
+        r"""
+
+        Args:
+            input_ids (Tensor):
+                See :class:`RoFormerv2Model`.
+            token_type_ids (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+            attention_mask (Tensor, optional):
+                See :class:`RoFormerv2Model`.
+
+        Returns:
+            Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
+            Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import RoFormerv2ForMaskedLM, RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                model = RoFormerv2ForMaskedLM.from_pretrained('roformer_v2_chinese_char_base')
+
+                inputs = tokenizer("欢迎使用百度飞桨!")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+
+                logits = model(**inputs)
+                print(logits.shape)
+                # [1, 11, 12000]
+
+        """
+        sequence_output = self.roformerv2(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
+
+        prediction_scores = self.cls(sequence_output)
+        return prediction_scores
diff --git a/paddlenlp/transformers/roformerv2/tokenizer.py b/paddlenlp/transformers/roformerv2/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..92266df531c80ff2410cb368a7634c69234dfa0a
--- /dev/null
+++ b/paddlenlp/transformers/roformerv2/tokenizer.py
@@ -0,0 +1,306 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from ..bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from ..tokenizer_utils import PretrainedTokenizer
+
+__all__ = ["RoFormerv2Tokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "roformer_v2_chinese_char_small": 512,
+    "roformer_v2_chinese_char_base": 512,
+    "roformer_v2_chinese_char_large": 512,
+}
+
+
+class RoFormerv2Tokenizer(PretrainedTokenizer):
+    """
+    Constructs a RoFormerv2 tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (bool,optional):
+            Whether or not to lowercase the input when tokenizing.
+            If you use the RoFormerv2 pretrained model, lower is set to
+            False when using the cased model, otherwise it is set to True.
+            Defaults to`True`.
+        unk_token (str,optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str,optional):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str,optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str,optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str,optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RoFormerv2Tokenizer
+            tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+
+            tokens = tokenizer('欢迎使用百度飞桨')
+            '''
+            {'input_ids': [101, 3223, 6500, 421, 4179, 4331, 2008, 7263, 3055, 102],
+             'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "roformer_v2_chinese_char_small": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_small/vocab.txt",
+            "roformer_v2_chinese_char_base": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_base/vocab.txt",
+            "roformer_v2_chinese_char_large": "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_large/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "roformer_v2_chinese_char_small": {"do_lower_case": True},
+        "roformer_v2_chinese_char_base": {"do_lower_case": True},
+        "roformer_v2_chinese_char_large": {"do_lower_case": True},
+    }
+
+    # TODO(wj-Mcat): to be confirmed
+    max_model_input_sizes = {
+        "roformer_v2_chinese_char_small": 1024,
+        "roformer_v2_chinese_char_base": 1024,
+        "roformer_v2_chinese_char_large": 1024,
+    }
+    padding_side = "right"
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = RoFormerv2Tokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.do_lower_case = do_lower_case
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for RoFormerv2 models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also removes
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import RoFormerv2Tokenizer
+
+                tokenizer = RoFormerv2Tokenizer.from_pretrained('roformer_v2_chinese_char_base')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨!')
+                '''
+                ['欢', '迎', '使', '用', '百', '度', '飞', '桨', '!']
+                '''
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                '欢 迎 使 用 百 度 飞 桨 !'
+                '''
+        """
+
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A RoFormerv2 sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        A RoFormerv2 offset_mapping has the following format:
+
+        - single sequence:      ``(0,0) X (0,0)``
+        - pair of sequences:        ``(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of wordpiece offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of wordpiece offsets for offset mapping pairs. Defaults to None.
+
+        Returns:
+            List[tuple]: A list of wordpiece offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A RoFormerv2 sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers either be 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self):
+        return dict(self.vocab._token_to_idx, **self.added_tokens_encoder)
diff --git a/paddlenlp/transformers/rw/__init__.py b/paddlenlp/transformers/rw/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb888a96ef6de53c5b86e64ca80f155ff1b69a16
--- /dev/null
+++ b/paddlenlp/transformers/rw/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 Technology Innovation Institute (TII) and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/rw/configuration.py b/paddlenlp/transformers/rw/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c6e91ec979a3dd7641cda6da1922a348e2e0fc4
--- /dev/null
+++ b/paddlenlp/transformers/rw/configuration.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2023 Technology Innovation Institute (TII) and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..configuration_utils import PretrainedConfig
+
+RW_PRETRAINED_INIT_CONFIGURATION = {
+    "model_state": {
+        "tiiuae/falcon-7b": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b/model_state.pdparams",
+        "tiiuae/falcon-7b-instruct": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b-instruct/model_state.pdparams",
+        "OpenBuddy/openbuddy-falcon-7b-v5-fp16": "https://bj.bcebos.com/paddlenlp/models/community/OpenBuddy/openbuddy-falcon-7b-v5-fp16/model_state.pdparams",
+    },
+}
+
+
+class RWConfig(PretrainedConfig):
+    model_type = "RefinedWeb"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {
+        "num_hidden_layers": "n_layer",
+        "num_attention_heads": "n_head",
+    }
+
+    pretrained_init_configuration = RW_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size=250880,
+        hidden_size=64,
+        n_layer=2,
+        n_head=8,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        bos_token_id=1,
+        eos_token_id=2,
+        apply_residual_connection_post_layernorm=False,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        multi_query=False,
+        n_head_kv=None,
+        bias=False,
+        alibi=False,
+        parallel_attn=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        # Backward compatibility with n_embed kwarg
+        n_embed = kwargs.pop("n_embed", None)
+        self.hidden_size = hidden_size if n_embed is None else n_embed
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.multi_query = multi_query
+
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.n_head_kv = n_head if n_head_kv is None else n_head_kv
+        self.alibi = alibi
+        self.bias = bias
+        self.parallel_attn = parallel_attn
+
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+    @property
+    def head_dim(self):
+        return self.hidden_size // self.n_head
+
+    @property
+    def rotary(self):
+        return not self.alibi
diff --git a/paddlenlp/transformers/rw/modeling.py b/paddlenlp/transformers/rw/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2645330786b6494a41a3a920fae26d6aaa001880
--- /dev/null
+++ b/paddlenlp/transformers/rw/modeling.py
@@ -0,0 +1,894 @@
+# Copyright (c) 2023 Technology Innovation Institute (TII) and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import paddle
+from paddle import Tensor, nn
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from .. import PretrainedModel
+from ..model_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+)
+from .configuration import RW_PRETRAINED_INIT_CONFIGURATION, RWConfig
+
+
+# rotary pos emb helpers (paddle.jit.script does not seem to support staticmethod...)
+def rotate_half(x):
+    x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+    return paddle.concat((-x2, x1), axis=x1.ndim - 1)  # dim=-1 triggers a bug in paddle < 1.8.0
+
+
+class RotaryEmbedding(paddle.nn.Layer):
+    """Implementation of RotaryEmbedding from GPT-NeoX.
+    This implementation is design to operate on queries and keys that are compatible with
+    [batch_size, n_heads_per_partition, seq_len, head_dim] (e.g. MinGPTAttention format).
+    """
+
+    def __init__(
+        self,
+        head_dim: int,
+        base=10000,
+    ):
+        super().__init__()
+        # head_dim must be an even number
+        inv_freq = 1.0 / (base ** (paddle.arange(0, head_dim, 2).astype("float32") / head_dim))
+        self.register_buffer("inv_freq", inv_freq, persistable=False)
+        self.head_dim = head_dim
+        self.seq_len_cached = None
+        self.batch_size_cached = None
+        self.cos_cached: Tensor | None = None
+        self.sin_cached: Tensor | None = None
+
+    def cos_sin(
+        self,
+        seq_len: int,
+        dtype=paddle.bfloat16,
+    ) -> Tensor:
+        if seq_len != self.seq_len_cached:
+            self.seq_len_cached = seq_len
+            t = paddle.arange(seq_len, dtype=self.inv_freq.dtype)
+            freqs = paddle.einsum("i,j->ij", t, self.inv_freq)
+            emb = paddle.concat((freqs, freqs), axis=-1)
+
+            if dtype in [paddle.float16, paddle.bfloat16]:
+                emb = paddle.cast(emb, dtype)
+
+            self.cos_cached = emb.cos()[None, :, :]
+            self.sin_cached = emb.sin()[None, :, :]
+
+            self.cos_cached = paddle.cast(self.cos_cached, dtype)
+            self.sin_cached = paddle.cast(self.sin_cached, dtype)
+
+        return self.cos_cached, self.sin_cached
+
+    def forward(self, q, k):
+        batch, seq_len, head_dim = q.shape
+        cos, sin = self.cos_sin(seq_len, q.dtype)
+        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+
+
+def _make_causal_mask(input_ids_shape: paddle.shape, past_key_values_length: int):
+    batch_size, target_length = input_ids_shape
+    mask = paddle.empty((target_length, target_length + past_key_values_length), dtype=paddle.bool)
+    # ONNX doesn't support `Tensor.triu` properly, thus we use this workaround
+    seq_ids = paddle.arange(target_length)
+    mask[:, past_key_values_length:] = seq_ids[:, None] < seq_ids[None, :]
+
+    if past_key_values_length > 0:
+        mask[:, :past_key_values_length] = False
+
+    expanded_mask = mask[None, None, :, :].expand(
+        shape=(batch_size, 1, target_length, target_length + past_key_values_length)
+    )
+    return expanded_mask
+
+
+def _expand_mask(mask: Tensor, tgt_length: int):
+    batch_size, src_length = mask.shape
+    tgt_length = tgt_length if tgt_length is not None else src_length
+
+    expanded_mask = ~(paddle.cast(mask[:, None, None, :], "bool"))
+    return expanded_mask.expand(shape=(batch_size, 1, tgt_length, src_length))
+
+
+def build_alibi_tensor(attention_mask: Tensor, num_heads: int, dtype: paddle.dtype) -> Tensor:
+    batch_size, seq_length = attention_mask.shape
+    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+
+    base = paddle.to_tensor(2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), dtype=paddle.float32)
+    powers = paddle.arange(1, 1 + closest_power_of_2, dtype=paddle.float32)
+    slopes = paddle.pow(base, powers)
+
+    if closest_power_of_2 != num_heads:
+        extra_base = Tensor(2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), dtype=paddle.float32)
+        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
+        extra_powers = paddle.arange(1, 1 + 2 * num_remaining_heads, 2, dtype=paddle.int32)
+        slopes = paddle.concat([slopes, paddle.pow(extra_base, extra_powers)], axis=0)
+
+    arange_tensor = ((attention_mask.cumsum(axis=-1) - 1) * attention_mask)[:, None, :]
+    alibi = paddle.cast(slopes[..., None], "bfloat16") * arange_tensor
+    return paddle.cast(alibi.reshape([batch_size * num_heads, 1, seq_length]), dtype)
+
+
+def dropout_add(x: Tensor, residual: Tensor, prob: float, training: bool) -> Tensor:
+    out = nn.functional.dropout(x, p=prob, training=training)
+    out = residual + out
+    return out
+
+
+class Attention(nn.Layer):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.n_head
+        self.head_dim = self.hidden_size // self.num_heads
+        self.split_size = self.hidden_size
+        self.hidden_dropout = config.hidden_dropout
+
+        if self.head_dim * self.num_heads != self.hidden_size:
+            raise ValueError(
+                f"`hidden_size` must be divisible by num_heads (got `hidden_size`: {self.hidden_size} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+
+        self.maybe_rotary = RotaryEmbedding(config.head_dim) if config.rotary else lambda q, k: (q, k)
+
+        # Layer-wise attention scaling
+        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
+        self.beta = self.inv_norm_factor
+
+        self.query_key_value = nn.Linear(
+            self.hidden_size,
+            3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
+            bias_attr=config.bias,
+        )
+        self.multi_query = config.multi_query
+        self.dense = nn.Linear(self.hidden_size, self.hidden_size, bias_attr=config.bias)
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+        self.num_kv = config.n_head if not self.multi_query else 1
+
+    def _split_heads(self, fused_qkv: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        """
+        Split the last dimension into (num_heads, head_dim) without making any copies, results share same memory
+        storage as `fused_qkv`
+
+        Args:
+            fused_qkv (`Tensor`, *required*): [batch_size, seq_length, num_heads * 3 * head_dim]
+
+        Returns:
+            query: [batch_size, seq_length, num_heads, head_dim] key: [batch_size, seq_length, num_heads, head_dim]
+            value: [batch_size, seq_length, num_heads, head_dim]
+        """
+        if not self.multi_query:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.reshape([batch_size, seq_length, self.num_heads, 3, self.head_dim])
+            return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
+        else:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.reshape([batch_size, seq_length, self.num_heads + 2, self.head_dim])
+            return fused_qkv[..., :-2, :], fused_qkv[..., -2, :].unsqueeze(-2), fused_qkv[..., -1, :].unsqueeze(-2)
+
+    def _merge_heads(self, x: Tensor) -> Tensor:
+        """
+        Merge heads together over the last dimenstion
+
+        Args:
+            x: (`Tensor`, *required*): [batch_size * num_heads, seq_length, head_dim]
+
+        Returns:
+            Tensor: [batch_size, seq_length, num_heads * head_dim]
+        """
+        # What we want to achieve is:
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads * head_dim
+        batch_size_and_num_heads, seq_length, _ = x.shape
+        batch_size = batch_size_and_num_heads // self.num_heads
+
+        # First reshape to decompose the batch size
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, num_heads, seq_length, head_dim
+        x = x.reshape([batch_size, self.num_heads, seq_length, self.head_dim])
+
+        # batch_size, num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads, head_dim
+        x = x.transpose([0, 2, 1, 3])
+
+        # batch_size, seq_length, num_heads, head_dim -> batch_size, seq_length, num_heads * head_dim
+        return x.reshape([batch_size, seq_length, self.num_heads * self.head_dim])
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        alibi: Tensor,
+        attention_mask: Tensor,
+        layer_past: Optional[Tuple[Tensor, Tensor]] = None,
+        head_mask: Optional[Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        i: int = 0,
+    ):
+        fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
+
+        # 3 x [batch_size, seq_length, num_heads, head_dim]
+        (query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
+
+        batch_size, q_length, _, _ = query_layer.shape
+
+        # [batch_size, seq_length, num_heads, head_dim]
+        query_layer = query_layer.transpose([0, 2, 1, 3]).reshape(
+            [batch_size * self.num_heads, q_length, self.head_dim]
+        )
+        key_layer = key_layer.transpose([0, 2, 1, 3]).reshape(
+            [
+                batch_size * self.num_kv,
+                q_length,
+                self.head_dim,
+            ]
+        )
+        value_layer = value_layer.transpose([0, 2, 1, 3]).reshape([batch_size * self.num_kv, q_length, self.head_dim])
+
+        query_layer, key_layer = self.maybe_rotary(query_layer, key_layer)
+
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            # concatenate along seq_length dimension:
+            #  - key: [batch_size * self.num_heads, head_dim, kv_length]
+            #  - value: [batch_size * self.num_heads, kv_length, head_dim]
+            key_layer = paddle.concat((past_key, key_layer), axis=1)
+            value_layer = paddle.concat((past_value, value_layer), axis=1)
+
+        # if use layer_past, kv_length != q_length
+        _, kv_length, _ = key_layer.shape
+
+        if use_cache is True:
+            present = (key_layer, value_layer)
+        else:
+            present = None
+
+        if alibi is None:
+            query_layer_ = query_layer.reshape([batch_size, self.num_heads, q_length, self.head_dim])
+            key_layer_ = key_layer.reshape([batch_size, self.num_kv, kv_length, self.head_dim])
+            value_layer_ = value_layer.reshape([batch_size, self.num_kv, kv_length, self.head_dim])
+
+            attn_output = query_layer_ @ key_layer_.transpose([0, 1, 3, 2])
+            attention_scores = attn_output.reshape([batch_size, self.num_heads, q_length, kv_length])
+
+            # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype - [batch_size, num_heads, q_length, kv_length]
+            input_dtype = attention_scores.dtype
+            # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
+            if input_dtype == paddle.float16 or input_dtype == paddle.bfloat16:
+                attention_scores = paddle.cast(attention_scores, paddle.float32)
+            attention_scores = paddle.where(
+                attention_mask > 0,
+                paddle.full_like(attention_scores, paddle.finfo(attention_scores.dtype).min),
+                attention_scores,
+            )
+            attention_probs = nn.functional.softmax(
+                attention_scores * self.inv_norm_factor,
+                axis=-1,
+                dtype=hidden_states.dtype,
+            )
+            # [batch_size, num_heads, q_length, kv_length]
+            attention_probs = self.attention_dropout(attention_probs)
+
+            if head_mask is not None:
+                attention_probs = attention_probs * head_mask
+
+            # matmul: [batch_size, num_heads, q_length, head_dim]
+            context_layer = attention_probs @ value_layer_
+
+            # change reshape [batch_size , q_length,  num_heads * head_dim]
+            context_layer = context_layer.transpose([0, 2, 1, 3])
+            context_layer = context_layer.reshape([batch_size, q_length, -1])
+
+            output_tensor = self.dense(context_layer)
+
+            outputs = (output_tensor, present)
+            if output_attentions:
+                outputs += (attention_probs,)
+
+            return outputs
+        else:
+            query_layer_ = query_layer.reshape([batch_size, self.num_heads, q_length, self.head_dim])
+            key_layer_ = key_layer.reshape([batch_size, self.num_kv, kv_length, self.head_dim])
+            value_layer_ = value_layer.reshape([batch_size, self.num_kv, kv_length, self.head_dim])
+
+            alibi = alibi.reshape([batch_size, self.num_heads, 1, -1])
+
+            attention_scores = query_layer_ @ key_layer_.transpose([0, 1, 3, 2])
+
+            attention_mask_float = paddle.zeros_like(attention_mask, dtype=attention_scores.dtype)
+            attention_mask_float = paddle.where(
+                attention_mask > 0,
+                paddle.full_like(attention_scores, paddle.finfo(attention_scores.dtype).min),
+                attention_mask_float,
+            )
+
+            # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype - [batch_size, num_heads, q_length, kv_length]
+            input_dtype = attention_scores.dtype
+            # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
+            if input_dtype == paddle.float16 or input_dtype == paddle.bfloat16:
+                attention_scores = paddle.cast(attention_scores, paddle.float32)
+            # attn_weights = paddle.masked_fill(attention_scores, attention_mask, paddle.finfo(attention_scores.dtype).min)
+            attention_probs = nn.functional.softmax(
+                (attention_scores + alibi) * self.inv_norm_factor + attention_mask_float,
+                axis=-1,
+                dtype=hidden_states.dtype,
+            )
+            # [batch_size, num_heads, q_length, kv_length]
+            attention_probs = self.attention_dropout(attention_probs)
+
+            if head_mask is not None:
+                attention_probs = attention_probs * head_mask
+
+            # matmul: [batch_size, num_heads, q_length, kv_length] * [batch_size, num_kv, kv_length, head_dim]
+            context_layer = attention_probs @ value_layer_
+
+            # change reshape [batch_size x num_heads, q_length, head_dim]
+            context_layer = context_layer.reshape([batch_size * self.num_heads, q_length, self.head_dim])
+
+            # change reshape [batch_size, num_heads, q_length, head_dim]
+            context_layer = self._merge_heads(context_layer)
+
+            output_tensor = self.dense(context_layer)
+
+            outputs = (output_tensor, present)
+            if output_attentions:
+                outputs += (attention_probs,)
+
+            return outputs
+
+
+class MLP(nn.Layer):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.dense_h_to_4h = nn.Linear(hidden_size, 4 * hidden_size, bias_attr=config.bias)
+        self.act = nn.GELU()
+        self.dense_4h_to_h = nn.Linear(4 * hidden_size, hidden_size, bias_attr=config.bias)
+        self.hidden_dropout = config.hidden_dropout
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.act(self.dense_h_to_4h(x))
+        x = self.dense_4h_to_h(x)
+        return x
+
+
+class DecoderLayer(nn.Layer):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.input_layernorm = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
+        self.num_heads = config.n_head
+        self.self_attention = Attention(config)
+
+        if not config.parallel_attn:
+            # unused if parallel attn
+            self.post_attention_layernorm = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
+
+        self.mlp = MLP(config)
+
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+        self.hidden_dropout = config.hidden_dropout
+
+        self.config = config
+
+    def forward(
+        self,
+        hidden_states: Tensor = None,
+        alibi: Tensor = None,
+        attention_mask: Tensor = None,
+        layer_past: Optional[Tuple[Tensor, Tensor]] = None,
+        head_mask: Optional[Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        i: int = 0,
+    ):
+
+        layernorm_output = self.input_layernorm(hidden_states)
+        residual = hidden_states
+
+        # Self attention.
+        attn_outputs = self.self_attention(
+            layernorm_output,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            alibi=alibi,
+            head_mask=head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            i=i,
+        )
+
+        attention_output = attn_outputs[0]
+
+        if not self.config.parallel_attn:
+            residual = dropout_add(attention_output, residual, self.config.attention_dropout, training=self.training)
+            layernorm_output = self.post_attention_layernorm(residual)
+
+        outputs = attn_outputs[1:]
+
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+
+        if self.config.parallel_attn:
+            mlp_output += attention_output
+
+        output = dropout_add(mlp_output, residual, self.config.hidden_dropout, training=self.training)
+
+        if use_cache:
+            outputs = (output,) + outputs
+        else:
+            outputs = (output,) + outputs[1:]
+
+        return outputs  # hidden_states, present, attentions
+
+
+class RWPreTrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = RWConfig
+    base_model_prefix = "transformer"
+
+    pretrained_init_configuration = RW_PRETRAINED_INIT_CONFIGURATION
+
+    @classmethod
+    def _get_name_mappings(cls, config: RWConfig) -> List[StateDictNameMapping]:
+        mappings = [
+            "word_embeddings.weight",
+            "ln_f.weight",
+            "ln_f.bias",
+        ]
+
+        for layer_index in range(config.num_hidden_layers):
+            layer_mappings = [
+                [
+                    f"h.{layer_index}.input_layernorm.weight",
+                    f"h.{layer_index}.input_layernorm.weight",
+                ],
+                [
+                    f"h.{layer_index}.input_layernorm.bias",
+                    f"h.{layer_index}.input_layernorm.bias",
+                ],
+                [
+                    f"h.{layer_index}.self_attention.query_key_value.weight",
+                    f"h.{layer_index}.self_attention.query_key_value.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.self_attention.query_key_value.bias",
+                    f"h.{layer_index}.self_attention.query_key_value.bias",
+                ],
+                [
+                    f"h.{layer_index}.self_attention.dense.weight",
+                    f"h.{layer_index}.self_attention.dense.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.self_attention.dense.bias",
+                    f"h.{layer_index}.self_attention.dense.bias",
+                ],
+                [
+                    f"h.{layer_index}.mlp.dense_h_to_4h.weight",
+                    f"h.{layer_index}.mlp.dense_h_to_4h.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.mlp.dense_h_to_4h.bias",
+                    f"h.{layer_index}.mlp.dense_h_to_4h.bias",
+                ],
+                [
+                    f"h.{layer_index}.mlp.dense_4h_to_h.weight",
+                    f"h.{layer_index}.mlp.dense_4h_to_h.weight",
+                    "transpose",
+                ],
+                [
+                    f"h.{layer_index}.mlp.dense_4h_to_h.bias",
+                    f"h.{layer_index}.mlp.dense_4h_to_h.bias",
+                ],
+            ]
+            mappings.extend(layer_mappings)
+
+        init_name_mappings(mappings)
+        # Other than RWModel, other architectures will prepend model prefix
+        if config.architectures is not None and "RWModel" not in config.architectures:
+            for mapping in mappings:
+                mapping[0] = "transformer." + mapping[0]
+                if len(mapping) > 1 and mapping[1] is not None:
+                    mapping[1] = "transformer." + mapping[1]
+
+        if config.architectures is not None:
+            if "RWForCausalLM" in config.architectures:
+                mappings.extend(
+                    [
+                        "lm_head.weight",
+                        "lm_head.bias",
+                    ]
+                )
+
+        init_name_mappings(mappings)
+        return [StateDictNameMapping(*mapping) for mapping in mappings]
+
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+
+    def _init_weights(self, layer: nn.Layer):
+        """Initialize the weights."""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            layer.weight.set_value(
+                paddle.tensor.normal(mean=0.0, std=self.config.initializer_range, shape=layer.weight.shape)
+            )
+            if getattr(layer, "bias", None) is not None:
+                layer.weight.set_value(paddle.zeros(shape=layer.weight.shape, dtype=paddle.get_default_dtype()))
+
+    def _set_gradient_checkpointing(self, module: nn.Layer, value: bool = False):
+        if isinstance(module, RWModel):
+            module.gradient_checkpointing = value
+
+    @staticmethod
+    def _convert_to_standard_cache(
+        past_key_value: Tuple[Tuple[Tensor, Tensor]], batch_size: int
+    ) -> Tuple[Tuple[Tensor, Tensor]]:
+        """
+        Standardizes the format of the cache so as to match most implementations, i.e. to tuple(tuple([batch_size,
+        num_heads, ...]))
+        """
+        batch_size_times_num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        num_heads = batch_size_times_num_heads // batch_size
+        # key: [batch_size * num_heads, head_dim, seq_length] -> [batch_size, num_heads, head_dim, seq_length]
+        # value: [batch_size * num_heads, seq_length, head_dim] -> [batch_size, num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].reshape([batch_size, num_heads, head_dim, seq_length]),
+                layer_past[1].reshape([batch_size, num_heads, seq_length, head_dim]),
+            )
+            for layer_past in past_key_value
+        )
+
+    @staticmethod
+    def _convert_to_rw_cache(past_key_value: Tuple[Tuple[Tensor, Tensor]]) -> Tuple[Tuple[Tensor, Tensor]]:
+        batch_size, num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        batch_size_times_num_heads = batch_size * num_heads
+        # key:  [batch_size, num_heads, head_dim, seq_length] -> [batch_size * num_heads, head_dim, seq_length]
+        # value: [batch_size, num_heads, seq_length, head_dim] -> [batch_size * num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].reshape([batch_size_times_num_heads, head_dim, seq_length]),
+                layer_past[1].reshape([batch_size_times_num_heads, seq_length, head_dim]),
+            )
+            for layer_past in past_key_value
+        )
+
+
+class RWModel(RWPreTrainedModel):
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.n_head
+        self.alibi = config.alibi
+
+        # Embedding + LN Embedding
+        self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+
+        # Transformer blocks
+        self.h = nn.LayerList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
+
+        # Final Layer Norm
+        self.ln_f = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_epsilon)
+
+        self.gradient_checkpointing = False
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def _prepare_attn_mask(self, attention_mask: Tensor, input_shape: Tuple[int, int], past_key_values_length: int):
+        # create causal mask
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        combined_attention_mask = None
+        # device = attention_mask.device
+        _, src_length = input_shape
+
+        if src_length > 1:
+            combined_attention_mask = _make_causal_mask(input_shape, past_key_values_length=past_key_values_length)
+
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        expanded_attn_mask = _expand_mask(attention_mask, tgt_length=src_length)
+        combined_attention_mask = (
+            expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
+        )
+
+        return combined_attention_mask
+
+    def set_input_embeddings(self, new_embeddings: Tensor):
+        self.word_embeddings = new_embeddings
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.dim() == 1:
+            axis = paddle.to_tensor([0, 1, 3, 4])
+            head_mask = paddle.unsqueeze(head_mask, axis=axis)
+            head_mask = head_mask.expand(shape=(num_hidden_layers, -1, -1, -1, -1))
+        elif head_mask.dim() == 2:
+            axis = paddle.to_tensor([1, 3, 4])
+            head_mask = paddle.unsqueeze(head_mask, axis=axis)
+        assert head_mask.dim() == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+
+        head_mask = paddle.cast(head_mask, dtype=self.config.dtype)
+        return head_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values=None,
+        attention_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.h))
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape batch_size x num_heads x N x N
+        # head_mask has shape n_layer x batch x num_heads x N x N
+        head_mask = self.get_head_mask(head_mask, self.config.n_layer)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        hidden_states = inputs_embeds
+
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+
+        # Compute alibi tensor: check build_alibi_tensor documentation
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+        if attention_mask is None:
+            attention_mask = paddle.ones((batch_size, seq_length_with_past))
+
+        if self.alibi:
+            alibi = build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype)
+        else:
+            alibi = None
+
+        causal_mask = self._prepare_attn_mask(
+            attention_mask,
+            input_shape=(batch_size, seq_length),
+            past_key_values_length=past_key_values_length,
+        )
+
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            outputs = block(
+                hidden_states,
+                layer_past=layer_past,
+                attention_mask=causal_mask,
+                head_mask=head_mask[i],
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+                alibi=alibi,
+                i=i,
+            )
+
+            hidden_states = outputs[0]
+            if use_cache is True:
+                presents = presents + (outputs[1],)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+
+        # Add last hidden state
+        hidden_states = self.ln_f(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class CausalLMHead(nn.Linear):
+    def forward(self, input: Tensor) -> Tensor:
+        ret = input @ self.weight.T
+        return ret
+
+
+class RWForCausalLM(RWPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+        self.transformer = RWModel(config)
+        self.lm_head = CausalLMHead(config.vocab_size, config.hidden_size, bias_attr=False)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings: Tensor):
+        self.lm_head = new_embeddings
+
+    def prepare_attention_mask_for_generation(self, input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+        else:
+            attention_mask = paddle.ones_like(input_ids, dtype="int64")
+        return attention_mask
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        **kwargs,
+    ) -> dict:
+        # only last token for input_ids if past is not None
+        if past:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+
+            # the cache may be in the stardard format (e.g. in contrastive search), convert to our's format if needed
+            if past[0][0].shape[0] == input_ids.shape[0]:
+                past = self._convert_to_rw_cache(past)
+
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+        }
+
+    def forward(
+        self,
+        input_ids=None,
+        past_key_values: Optional[Tuple[Tuple[Tensor, Tensor], ...]] = None,
+        attention_mask: Optional[Tensor] = None,
+        head_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`paddle.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss = nn.functional.cross_entropy(lm_logits, labels)
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+    @staticmethod
+    def _reorder_cache(past: Tuple[Tuple[Tensor]], beam_idx: Tensor) -> Tuple[Tuple[Tensor]]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        """
+        return tuple(tuple(past_state.index_select(0, beam_idx) for past_state in layer_past) for layer_past in past)
diff --git a/paddlenlp/transformers/rw/tokenizer.py b/paddlenlp/transformers/rw/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..dadf52a892ba1bd4fea1788961369d180bd0a812
--- /dev/null
+++ b/paddlenlp/transformers/rw/tokenizer.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2023 Technology Innovation Institute (TII) and PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.transformers.gpt.tokenizer import GPTTokenizer
+
+__all__ = ["RWTokenizer"]
+
+
+class RWTokenizer(GPTTokenizer):
+    """
+    Constructs a RWModel tokenizer based on byte-level Byte-Pair-Encoding.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.GPTTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            Path to the vocab file.
+            The vocab file contains a mapping from vocabulary strings to indices.
+        merges_file (str):
+            Path to the merge file.
+            The merge file is used to split the input sentence into "subword" units.
+            The vocab file is then used to encode those units as intices.
+        errors (str):
+            Paradigm to follow when decoding bytes to UTF-8.
+            Defaults to `'replace'`.
+        max_len (int, optional):
+            The maximum value of the input sequence length.
+            Defaults to `None`.
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import RWTokenizer
+
+            tokenizer = RWTokenizer.from_pretrained('tiiuae/falcon-7b')
+            print(tokenizer('Welcome to use PaddlePaddle and PaddleNLP'))
+
+            '''
+            {'input_ids': [11302, 271, 745, 337, 18849, 59, 18849, 273, 337, 18849, 57, 15549]}
+            '''
+
+    """
+
+    resource_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}  # for save_pretrained
+    model_input_names = ["input_ids"]
+
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "falcon-7b": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b/vocab.json",
+            "falcon-7b-instruct": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b-instruct/vocab.json",
+            "OpenBuddy/openbuddy-falcon-7b-v5-fp16": "https://bj.bcebos.com/paddlenlp/models/community/OpenBuddy/openbuddy-falcon-7b-v5-fp16/vocab.json",
+        },
+        "merges_file": {
+            "falcon-7b": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b/merges.txt",
+            "falcon-7b-instruct": "https://bj.bcebos.com/paddlenlp/models/community/tiiuae/falcon-7b-instruct/merges.txt",
+            "OpenBuddy/openbuddy-falcon-7b-v5-fp16": "https://bj.bcebos.com/paddlenlp/models/community/OpenBuddy/openbuddy-falcon-7b-v5-fp16/merges.txt",
+        },
+    }
+    padding_side = "right"
+
+    def __init__(self, vocab_file, merges_file, **kwargs):  # The token of newline.
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = RWTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+
+        self.spaces_between_special_tokens = kwargs.get("spaces_between_special_tokens", True)
+        super().__init__(vocab_file, merges_file, **kwargs)
+
+    def decode(
+        self, token_ids, skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True, **kwargs
+    ) -> str:
+        return super(RWTokenizer, self).decode(
+            token_ids=token_ids,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            spaces_between_special_tokens=self.spaces_between_special_tokens,
+            **kwargs,
+        )
diff --git a/paddlenlp/transformers/semantic_search/__init__.py b/paddlenlp/transformers/semantic_search/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/semantic_search/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/semantic_search/modeling.py b/paddlenlp/transformers/semantic_search/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c16808e2177071f56f5a650bf347b4880255b415
--- /dev/null
+++ b/paddlenlp/transformers/semantic_search/modeling.py
@@ -0,0 +1,311 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from ..ernie.configuration import ErnieConfig
+from ..ernie.modeling import ErnieModel, ErniePretrainedModel
+
+__all__ = ["ErnieDualEncoder", "ErnieCrossEncoder", "ErnieEncoder"]
+
+
+class ErnieEncoder(ErniePretrainedModel):
+    def __init__(self, config: ErnieConfig, output_emb_size: int | None = None):
+        super(ErnieEncoder, self).__init__(config)
+
+        self.ernie = ErnieModel(config)
+        dropout = config.classifier_dropout if config.classifier_dropout is not None else 0.1
+
+        self.dropout = nn.Dropout(dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Compatible to ERNIE-Search for adding extra linear layer
+        if output_emb_size is not None and output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(config.hidden_size, output_emb_size, weight_attr=weight_attr)
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        sequence_output, pool_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        return sequence_output, pool_output
+
+
+class ErnieDualEncoder(nn.Layer):
+    """
+    This class encapsulates two ErnieEncoder models into one model, so query
+    embedding and title embedding could be obtained using one model. And this
+    class allows two ErnieEncoder models to be trained at the same time.
+
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import ErnieDualEncoder, ErnieTokenizer
+
+            model = ErnieDualEncoder("rocketqa-zh-dureader-query-encoder", "rocketqa-zh-dureader-para-encoder")
+            tokenizer = ErnieTokenizer.from_pretrained("rocketqa-zh-dureader-query-encoder")
+
+            inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+            inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+            # Get query embedding
+            query_embedding = model.get_pooled_embedding(**inputs)
+
+            # Get title embedding
+            title_embedding = model.get_pooled_embedding(**inputs, is_query=False)
+
+    """
+
+    def __init__(
+        self,
+        query_model_name_or_path=None,
+        title_model_name_or_path=None,
+        share_parameters=False,
+        output_emb_size=None,
+        dropout=None,
+        reinitialize=False,
+        use_cross_batch=False,
+    ):
+
+        super().__init__()
+        self.query_ernie, self.title_ernie = None, None
+        self.use_cross_batch = use_cross_batch
+        self.output_emb_size = output_emb_size
+        if query_model_name_or_path is not None:
+            self.query_ernie = ErnieEncoder.from_pretrained(query_model_name_or_path, output_emb_size=output_emb_size)
+        if share_parameters:
+            self.title_ernie = self.query_ernie
+        elif title_model_name_or_path is not None:
+            self.title_ernie = ErnieEncoder.from_pretrained(title_model_name_or_path, output_emb_size=output_emb_size)
+        assert (self.query_ernie is not None) or (
+            self.title_ernie is not None
+        ), "At least one of query_ernie and title_ernie should not be None"
+
+        # Compatible to rocketv2 initialization for setting layer._epsilon to 1e-5
+        if reinitialize:
+            self.apply(self.init_epsilon_weights)
+
+    def init_epsilon_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-5
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+                input_ids = paddle.to_tensor(input_ids)
+                token_type_ids = paddle.to_tensor(token_type_ids)
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def get_pooled_embedding(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, is_query=True
+    ):
+        """Get the first feature of each sequence for classification"""
+        assert (is_query and self.query_ernie is not None) or (
+            not is_query and self.title_ernie
+        ), "Please check whether your parameter for `is_query` are consistent with DualEncoder initialization."
+        if is_query:
+            sequence_output, _ = self.query_ernie(input_ids, token_type_ids, position_ids, attention_mask)
+            if self.output_emb_size is not None and self.output_emb_size > 0:
+                cls_embedding = self.query_ernie.emb_reduce_linear(sequence_output[:, 0])
+            else:
+                cls_embedding = sequence_output[:, 0]
+
+        else:
+            sequence_output, _ = self.title_ernie(input_ids, token_type_ids, position_ids, attention_mask)
+            if self.output_emb_size is not None and self.output_emb_size > 0:
+                cls_embedding = self.title_ernie.emb_reduce_linear(sequence_output[:, 0])
+            else:
+                cls_embedding = sequence_output[:, 0]
+        return cls_embedding
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask, is_query=False
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    def forward(
+        self,
+        query_input_ids,
+        pos_title_input_ids,
+        neg_title_input_ids,
+        is_prediction=False,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        pos_title_token_type_ids=None,
+        pos_title_position_ids=None,
+        pos_title_attention_mask=None,
+        neg_title_token_type_ids=None,
+        neg_title_position_ids=None,
+        neg_title_attention_mask=None,
+    ):
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        pos_title_cls_embedding = self.get_pooled_embedding(
+            pos_title_input_ids,
+            pos_title_token_type_ids,
+            pos_title_position_ids,
+            pos_title_attention_mask,
+            is_query=False,
+        )
+
+        neg_title_cls_embedding = self.get_pooled_embedding(
+            neg_title_input_ids,
+            neg_title_token_type_ids,
+            neg_title_position_ids,
+            neg_title_attention_mask,
+            is_query=False,
+        )
+
+        all_title_cls_embedding = paddle.concat(x=[pos_title_cls_embedding, neg_title_cls_embedding], axis=0)
+
+        if is_prediction:
+            logits = paddle.dot(query_cls_embedding, pos_title_cls_embedding)
+            outputs = {"probs": logits, "q_rep": query_cls_embedding, "p_rep": pos_title_cls_embedding}
+            return outputs
+
+        if self.use_cross_batch:
+            tensor_list = []
+            paddle.distributed.all_gather(tensor_list, all_title_cls_embedding)
+            all_title_cls_embedding = paddle.concat(x=tensor_list, axis=0)
+
+        logits = paddle.matmul(query_cls_embedding, all_title_cls_embedding, transpose_y=True)
+
+        batch_size = query_cls_embedding.shape[0]
+
+        labels = paddle.arange(batch_size * self.rank * 2, batch_size * (self.rank * 2 + 1), dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        accuracy = paddle.metric.accuracy(input=logits, label=labels)
+        loss = F.cross_entropy(input=logits, label=labels)
+        outputs = {"loss": loss, "accuracy": accuracy}
+
+        return outputs
+
+
+class ErnieCrossEncoder(nn.Layer):
+    """
+    Example:
+
+        .. code-block::
+
+            import paddle
+            from paddlenlp.transformers import ErnieCrossEncoder, ErnieTokenizer
+
+            model = ErnieCrossEncoder("rocketqa-zh-dureader-cross-encoder")
+            tokenizer = ErnieTokenizer.from_pretrained("rocketqa-zh-dureader-cross-encoder")
+
+            inputs = tokenizer("你们好", text_pair="你好")
+            inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+
+            # Get embedding of text pair.
+            embedding = model.matching(**inputs)
+
+    """
+
+    def __init__(self, pretrain_model_name_or_path, num_classes=2, reinitialize=False, dropout=None):
+        super().__init__()
+
+        self.ernie = ErnieEncoder.from_pretrained(pretrain_model_name_or_path, num_classes=num_classes)
+        # Compatible to rocketv2 initialization for setting layer._epsilon to 1e-5
+        if reinitialize:
+            self.apply(self.init_epsilon_weights)
+
+    def init_epsilon_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-5
+
+    def matching(
+        self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, return_prob_distributation=False
+    ):
+        """Use the pooled_output as the feature for pointwise prediction, eg. RocketQAv1"""
+        _, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.ernie.dropout(pooled_output)
+        cls_embedding = self.ernie.classifier(pooled_output)
+        probs = F.softmax(cls_embedding, axis=1)
+        if return_prob_distributation:
+            return probs
+        return probs[:, 1]
+
+    def matching_v2(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        """Use the cls token embedding as the feature for listwise prediction, eg. RocketQAv2"""
+        sequence_output, _ = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.ernie.dropout(sequence_output[:, 0])
+        probs = self.ernie.classifier(pooled_output)
+        return probs
+
+    def matching_v3(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        """Use the pooled_output as the feature for listwise prediction, eg. ERNIE-Search"""
+        sequence_output, pooled_output = self.ernie(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+        pooled_output = self.ernie.dropout(pooled_output)
+        probs = self.ernie.classifier(pooled_output)
+        return probs
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None, labels=None):
+        probs = self.matching(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            return_prob_distributation=True,
+        )
+        if labels is not None:
+            accuracy = paddle.metric.accuracy(input=probs, label=labels)
+            loss = F.cross_entropy(input=probs, label=labels)
+            outputs = {"loss": loss, "accuracy": accuracy}
+            return outputs
+        else:
+            return probs[:, 1]
diff --git a/paddlenlp/transformers/sentencepiece_model_pb2.py b/paddlenlp/transformers/sentencepiece_model_pb2.py
new file mode 100644
index 0000000000000000000000000000000000000000..2502772a9e04e4583dc53ad316940fd028ed2837
--- /dev/null
+++ b/paddlenlp/transformers/sentencepiece_model_pb2.py
@@ -0,0 +1,1534 @@
+# -*- coding: utf-8 -*-
+# Generated by the protocol buffer compiler.  DO NOT EDIT!
+# source: sentencepiece_model.proto
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generated protocol buffer code."""
+from google.protobuf import descriptor as _descriptor
+from google.protobuf import message as _message
+from google.protobuf import reflection as _reflection
+from google.protobuf import symbol_database as _symbol_database
+
+# @@protoc_insertion_point(imports)
+
+_sym_db = _symbol_database.Default()
+
+DESCRIPTOR = _descriptor.FileDescriptor(
+    name="sentencepiece_model.proto",
+    package="sentencepiece",
+    syntax="proto2",
+    serialized_options=b"H\003",
+    create_key=_descriptor._internal_create_key,
+    serialized_pb=b'\n\x19sentencepiece_model.proto\x12\rsentencepiece"\xdb\x0b\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01 \x03(\t\x12\x14\n\x0cinput_format\x18\x07 \x01(\t\x12\x14\n\x0cmodel_prefix\x18\x02 \x01(\t\x12\x41\n\nmodel_type\x18\x03 \x01(\x0e\x32$.sentencepiece.TrainerSpec.ModelType:\x07UNIGRAM\x12\x18\n\nvocab_size\x18\x04 \x01(\x05:\x04\x38\x30\x30\x30\x12\x17\n\x0f\x61\x63\x63\x65pt_language\x18\x05 \x03(\t\x12 \n\x15self_test_sample_size\x18\x06 \x01(\x05:\x01\x30\x12*\n\x1b\x65nable_differential_privacy\x18\x32 \x01(\x08:\x05\x66\x61lse\x12+\n differential_privacy_noise_level\x18\x33 \x01(\x02:\x01\x30\x12\x32\n\'differential_privacy_clipping_threshold\x18\x34 \x01(\x04:\x01\x30\x12"\n\x12\x63haracter_coverage\x18\n \x01(\x02:\x06\x30.9995\x12\x1e\n\x13input_sentence_size\x18\x0b \x01(\x04:\x01\x30\x12$\n\x16shuffle_input_sentence\x18\x13 \x01(\x08:\x04true\x12 \n\x14mining_sentence_size\x18\x0c \x01(\x05\x42\x02\x18\x01\x12"\n\x16training_sentence_size\x18\r \x01(\x05\x42\x02\x18\x01\x12(\n\x17seed_sentencepiece_size\x18\x0e \x01(\x05:\x07\x31\x30\x30\x30\x30\x30\x30\x12\x1e\n\x10shrinking_factor\x18\x0f \x01(\x02:\x04\x30.75\x12!\n\x13max_sentence_length\x18\x12 \x01(\x05:\x04\x34\x31\x39\x32\x12\x17\n\x0bnum_threads\x18\x10 \x01(\x05:\x02\x31\x36\x12\x1d\n\x12num_sub_iterations\x18\x11 \x01(\x05:\x01\x32\x12$\n\x18max_sentencepiece_length\x18\x14 \x01(\x05:\x02\x31\x36\x12%\n\x17split_by_unicode_script\x18\x15 \x01(\x08:\x04true\x12\x1d\n\x0fsplit_by_number\x18\x17 \x01(\x08:\x04true\x12!\n\x13split_by_whitespace\x18\x16 \x01(\x08:\x04true\x12)\n\x1atreat_whitespace_as_suffix\x18\x18 \x01(\x08:\x05\x66\x61lse\x12+\n\x1c\x61llow_whitespace_only_pieces\x18\x1a \x01(\x08:\x05\x66\x61lse\x12\x1b\n\x0csplit_digits\x18\x19 \x01(\x08:\x05\x66\x61lse\x12\x17\n\x0f\x63ontrol_symbols\x18\x1e \x03(\t\x12\x1c\n\x14user_defined_symbols\x18\x1f \x03(\t\x12\x16\n\x0erequired_chars\x18$ \x01(\t\x12\x1c\n\rbyte_fallback\x18# \x01(\x08:\x05\x66\x61lse\x12+\n\x1dvocabulary_output_piece_score\x18  \x01(\x08:\x04true\x12\x1e\n\x10hard_vocab_limit\x18! \x01(\x08:\x04true\x12\x1c\n\ruse_all_vocab\x18" \x01(\x08:\x05\x66\x61lse\x12\x11\n\x06unk_id\x18( \x01(\x05:\x01\x30\x12\x11\n\x06\x62os_id\x18) \x01(\x05:\x01\x31\x12\x11\n\x06\x65os_id\x18* \x01(\x05:\x01\x32\x12\x12\n\x06pad_id\x18+ \x01(\x05:\x02-1\x12\x18\n\tunk_piece\x18- \x01(\t:\x05<unk>\x12\x16\n\tbos_piece\x18. \x01(\t:\x03<s>\x12\x17\n\teos_piece\x18/ \x01(\t:\x04</s>\x12\x18\n\tpad_piece\x18\x30 \x01(\t:\x05<pad>\x12\x1a\n\x0bunk_surface\x18, \x01(\t:\x05 \xe2\x81\x87 \x12+\n\x1ctrain_extremely_large_corpus\x18\x31 \x01(\x08:\x05\x66\x61lse"5\n\tModelType\x12\x0b\n\x07UNIGRAM\x10\x01\x12\x07\n\x03\x42PE\x10\x02\x12\x08\n\x04WORD\x10\x03\x12\x08\n\x04\x43HAR\x10\x04*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02"\xd1\x01\n\x0eNormalizerSpec\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x1c\n\x14precompiled_charsmap\x18\x02 \x01(\x0c\x12\x1e\n\x10\x61\x64\x64_dummy_prefix\x18\x03 \x01(\x08:\x04true\x12&\n\x18remove_extra_whitespaces\x18\x04 \x01(\x08:\x04true\x12 \n\x12\x65scape_whitespaces\x18\x05 \x01(\x08:\x04true\x12\x1e\n\x16normalization_rule_tsv\x18\x06 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02"y\n\x0cSelfTestData\x12\x33\n\x07samples\x18\x01 \x03(\x0b\x32".sentencepiece.SelfTestData.Sample\x1a)\n\x06Sample\x12\r\n\x05input\x18\x01 \x01(\t\x12\x10\n\x08\x65xpected\x18\x02 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02"\xfe\x03\n\nModelProto\x12\x37\n\x06pieces\x18\x01 \x03(\x0b\x32\'.sentencepiece.ModelProto.SentencePiece\x12\x30\n\x0ctrainer_spec\x18\x02 \x01(\x0b\x32\x1a.sentencepiece.TrainerSpec\x12\x36\n\x0fnormalizer_spec\x18\x03 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x12\x33\n\x0eself_test_data\x18\x04 \x01(\x0b\x32\x1b.sentencepiece.SelfTestData\x12\x38\n\x11\x64\x65normalizer_spec\x18\x05 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x1a\xd2\x01\n\rSentencePiece\x12\r\n\x05piece\x18\x01 \x01(\t\x12\r\n\x05score\x18\x02 \x01(\x02\x12\x42\n\x04type\x18\x03 \x01(\x0e\x32,.sentencepiece.ModelProto.SentencePiece.Type:\x06NORMAL"T\n\x04Type\x12\n\n\x06NORMAL\x10\x01\x12\x0b\n\x07UNKNOWN\x10\x02\x12\x0b\n\x07\x43ONTROL\x10\x03\x12\x10\n\x0cUSER_DEFINED\x10\x04\x12\x08\n\x04\x42YTE\x10\x06\x12\n\n\x06UNUSED\x10\x05*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\x42\x02H\x03',
+)
+
+_TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor(
+    name="ModelType",
+    full_name="sentencepiece.TrainerSpec.ModelType",
+    filename=None,
+    file=DESCRIPTOR,
+    create_key=_descriptor._internal_create_key,
+    values=[
+        _descriptor.EnumValueDescriptor(
+            name="UNIGRAM",
+            index=0,
+            number=1,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="BPE",
+            index=1,
+            number=2,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="WORD",
+            index=2,
+            number=3,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="CHAR",
+            index=3,
+            number=4,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    containing_type=None,
+    serialized_options=None,
+    serialized_start=1480,
+    serialized_end=1533,
+)
+_sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE)
+
+_MODELPROTO_SENTENCEPIECE_TYPE = _descriptor.EnumDescriptor(
+    name="Type",
+    full_name="sentencepiece.ModelProto.SentencePiece.Type",
+    filename=None,
+    file=DESCRIPTOR,
+    create_key=_descriptor._internal_create_key,
+    values=[
+        _descriptor.EnumValueDescriptor(
+            name="NORMAL",
+            index=0,
+            number=1,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="UNKNOWN",
+            index=1,
+            number=2,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="CONTROL",
+            index=2,
+            number=3,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="USER_DEFINED",
+            index=3,
+            number=4,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="BYTE",
+            index=4,
+            number=6,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.EnumValueDescriptor(
+            name="UNUSED",
+            index=5,
+            number=5,
+            serialized_options=None,
+            type=None,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    containing_type=None,
+    serialized_options=None,
+    serialized_start=2286,
+    serialized_end=2370,
+)
+_sym_db.RegisterEnumDescriptor(_MODELPROTO_SENTENCEPIECE_TYPE)
+
+_TRAINERSPEC = _descriptor.Descriptor(
+    name="TrainerSpec",
+    full_name="sentencepiece.TrainerSpec",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="input",
+            full_name="sentencepiece.TrainerSpec.input",
+            index=0,
+            number=1,
+            type=9,
+            cpp_type=9,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="input_format",
+            full_name="sentencepiece.TrainerSpec.input_format",
+            index=1,
+            number=7,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="model_prefix",
+            full_name="sentencepiece.TrainerSpec.model_prefix",
+            index=2,
+            number=2,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="model_type",
+            full_name="sentencepiece.TrainerSpec.model_type",
+            index=3,
+            number=3,
+            type=14,
+            cpp_type=8,
+            label=1,
+            has_default_value=True,
+            default_value=1,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="vocab_size",
+            full_name="sentencepiece.TrainerSpec.vocab_size",
+            index=4,
+            number=4,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=8000,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="accept_language",
+            full_name="sentencepiece.TrainerSpec.accept_language",
+            index=5,
+            number=5,
+            type=9,
+            cpp_type=9,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="self_test_sample_size",
+            full_name="sentencepiece.TrainerSpec.self_test_sample_size",
+            index=6,
+            number=6,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="enable_differential_privacy",
+            full_name="sentencepiece.TrainerSpec.enable_differential_privacy",
+            index=7,
+            number=50,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="differential_privacy_noise_level",
+            full_name="sentencepiece.TrainerSpec.differential_privacy_noise_level",
+            index=8,
+            number=51,
+            type=2,
+            cpp_type=6,
+            label=1,
+            has_default_value=True,
+            default_value=float(0),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="differential_privacy_clipping_threshold",
+            full_name="sentencepiece.TrainerSpec.differential_privacy_clipping_threshold",
+            index=9,
+            number=52,
+            type=4,
+            cpp_type=4,
+            label=1,
+            has_default_value=True,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="character_coverage",
+            full_name="sentencepiece.TrainerSpec.character_coverage",
+            index=10,
+            number=10,
+            type=2,
+            cpp_type=6,
+            label=1,
+            has_default_value=True,
+            default_value=float(0.9995),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="input_sentence_size",
+            full_name="sentencepiece.TrainerSpec.input_sentence_size",
+            index=11,
+            number=11,
+            type=4,
+            cpp_type=4,
+            label=1,
+            has_default_value=True,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="shuffle_input_sentence",
+            full_name="sentencepiece.TrainerSpec.shuffle_input_sentence",
+            index=12,
+            number=19,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="mining_sentence_size",
+            full_name="sentencepiece.TrainerSpec.mining_sentence_size",
+            index=13,
+            number=12,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=False,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=b"\030\001",
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="training_sentence_size",
+            full_name="sentencepiece.TrainerSpec.training_sentence_size",
+            index=14,
+            number=13,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=False,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=b"\030\001",
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="seed_sentencepiece_size",
+            full_name="sentencepiece.TrainerSpec.seed_sentencepiece_size",
+            index=15,
+            number=14,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=1000000,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="shrinking_factor",
+            full_name="sentencepiece.TrainerSpec.shrinking_factor",
+            index=16,
+            number=15,
+            type=2,
+            cpp_type=6,
+            label=1,
+            has_default_value=True,
+            default_value=float(0.75),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="max_sentence_length",
+            full_name="sentencepiece.TrainerSpec.max_sentence_length",
+            index=17,
+            number=18,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=4192,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="num_threads",
+            full_name="sentencepiece.TrainerSpec.num_threads",
+            index=18,
+            number=16,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=16,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="num_sub_iterations",
+            full_name="sentencepiece.TrainerSpec.num_sub_iterations",
+            index=19,
+            number=17,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=2,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="max_sentencepiece_length",
+            full_name="sentencepiece.TrainerSpec.max_sentencepiece_length",
+            index=20,
+            number=20,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=16,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="split_by_unicode_script",
+            full_name="sentencepiece.TrainerSpec.split_by_unicode_script",
+            index=21,
+            number=21,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="split_by_number",
+            full_name="sentencepiece.TrainerSpec.split_by_number",
+            index=22,
+            number=23,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="split_by_whitespace",
+            full_name="sentencepiece.TrainerSpec.split_by_whitespace",
+            index=23,
+            number=22,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="treat_whitespace_as_suffix",
+            full_name="sentencepiece.TrainerSpec.treat_whitespace_as_suffix",
+            index=24,
+            number=24,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="allow_whitespace_only_pieces",
+            full_name="sentencepiece.TrainerSpec.allow_whitespace_only_pieces",
+            index=25,
+            number=26,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="split_digits",
+            full_name="sentencepiece.TrainerSpec.split_digits",
+            index=26,
+            number=25,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="control_symbols",
+            full_name="sentencepiece.TrainerSpec.control_symbols",
+            index=27,
+            number=30,
+            type=9,
+            cpp_type=9,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="user_defined_symbols",
+            full_name="sentencepiece.TrainerSpec.user_defined_symbols",
+            index=28,
+            number=31,
+            type=9,
+            cpp_type=9,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="required_chars",
+            full_name="sentencepiece.TrainerSpec.required_chars",
+            index=29,
+            number=36,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="byte_fallback",
+            full_name="sentencepiece.TrainerSpec.byte_fallback",
+            index=30,
+            number=35,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="vocabulary_output_piece_score",
+            full_name="sentencepiece.TrainerSpec.vocabulary_output_piece_score",
+            index=31,
+            number=32,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="hard_vocab_limit",
+            full_name="sentencepiece.TrainerSpec.hard_vocab_limit",
+            index=32,
+            number=33,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="use_all_vocab",
+            full_name="sentencepiece.TrainerSpec.use_all_vocab",
+            index=33,
+            number=34,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="unk_id",
+            full_name="sentencepiece.TrainerSpec.unk_id",
+            index=34,
+            number=40,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=0,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="bos_id",
+            full_name="sentencepiece.TrainerSpec.bos_id",
+            index=35,
+            number=41,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=1,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="eos_id",
+            full_name="sentencepiece.TrainerSpec.eos_id",
+            index=36,
+            number=42,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=2,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="pad_id",
+            full_name="sentencepiece.TrainerSpec.pad_id",
+            index=37,
+            number=43,
+            type=5,
+            cpp_type=1,
+            label=1,
+            has_default_value=True,
+            default_value=-1,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="unk_piece",
+            full_name="sentencepiece.TrainerSpec.unk_piece",
+            index=38,
+            number=45,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=True,
+            default_value=b"<unk>".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="bos_piece",
+            full_name="sentencepiece.TrainerSpec.bos_piece",
+            index=39,
+            number=46,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=True,
+            default_value=b"<s>".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="eos_piece",
+            full_name="sentencepiece.TrainerSpec.eos_piece",
+            index=40,
+            number=47,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=True,
+            default_value=b"</s>".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="pad_piece",
+            full_name="sentencepiece.TrainerSpec.pad_piece",
+            index=41,
+            number=48,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=True,
+            default_value=b"<pad>".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="unk_surface",
+            full_name="sentencepiece.TrainerSpec.unk_surface",
+            index=42,
+            number=44,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=True,
+            default_value=b" \342\201\207 ".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="train_extremely_large_corpus",
+            full_name="sentencepiece.TrainerSpec.train_extremely_large_corpus",
+            index=43,
+            number=49,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=False,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[],
+    enum_types=[
+        _TRAINERSPEC_MODELTYPE,
+    ],
+    serialized_options=None,
+    is_extendable=True,
+    syntax="proto2",
+    extension_ranges=[
+        (200, 536870912),
+    ],
+    oneofs=[],
+    serialized_start=45,
+    serialized_end=1544,
+)
+
+_NORMALIZERSPEC = _descriptor.Descriptor(
+    name="NormalizerSpec",
+    full_name="sentencepiece.NormalizerSpec",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="name",
+            full_name="sentencepiece.NormalizerSpec.name",
+            index=0,
+            number=1,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="precompiled_charsmap",
+            full_name="sentencepiece.NormalizerSpec.precompiled_charsmap",
+            index=1,
+            number=2,
+            type=12,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"",
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="add_dummy_prefix",
+            full_name="sentencepiece.NormalizerSpec.add_dummy_prefix",
+            index=2,
+            number=3,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="remove_extra_whitespaces",
+            full_name="sentencepiece.NormalizerSpec.remove_extra_whitespaces",
+            index=3,
+            number=4,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="escape_whitespaces",
+            full_name="sentencepiece.NormalizerSpec.escape_whitespaces",
+            index=4,
+            number=5,
+            type=8,
+            cpp_type=7,
+            label=1,
+            has_default_value=True,
+            default_value=True,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="normalization_rule_tsv",
+            full_name="sentencepiece.NormalizerSpec.normalization_rule_tsv",
+            index=5,
+            number=6,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[],
+    enum_types=[],
+    serialized_options=None,
+    is_extendable=True,
+    syntax="proto2",
+    extension_ranges=[
+        (200, 536870912),
+    ],
+    oneofs=[],
+    serialized_start=1547,
+    serialized_end=1756,
+)
+
+_SELFTESTDATA_SAMPLE = _descriptor.Descriptor(
+    name="Sample",
+    full_name="sentencepiece.SelfTestData.Sample",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="input",
+            full_name="sentencepiece.SelfTestData.Sample.input",
+            index=0,
+            number=1,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="expected",
+            full_name="sentencepiece.SelfTestData.Sample.expected",
+            index=1,
+            number=2,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[],
+    enum_types=[],
+    serialized_options=None,
+    is_extendable=False,
+    syntax="proto2",
+    extension_ranges=[],
+    oneofs=[],
+    serialized_start=1827,
+    serialized_end=1868,
+)
+
+_SELFTESTDATA = _descriptor.Descriptor(
+    name="SelfTestData",
+    full_name="sentencepiece.SelfTestData",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="samples",
+            full_name="sentencepiece.SelfTestData.samples",
+            index=0,
+            number=1,
+            type=11,
+            cpp_type=10,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[
+        _SELFTESTDATA_SAMPLE,
+    ],
+    enum_types=[],
+    serialized_options=None,
+    is_extendable=True,
+    syntax="proto2",
+    extension_ranges=[
+        (200, 536870912),
+    ],
+    oneofs=[],
+    serialized_start=1758,
+    serialized_end=1879,
+)
+
+_MODELPROTO_SENTENCEPIECE = _descriptor.Descriptor(
+    name="SentencePiece",
+    full_name="sentencepiece.ModelProto.SentencePiece",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="piece",
+            full_name="sentencepiece.ModelProto.SentencePiece.piece",
+            index=0,
+            number=1,
+            type=9,
+            cpp_type=9,
+            label=1,
+            has_default_value=False,
+            default_value=b"".decode("utf-8"),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="score",
+            full_name="sentencepiece.ModelProto.SentencePiece.score",
+            index=1,
+            number=2,
+            type=2,
+            cpp_type=6,
+            label=1,
+            has_default_value=False,
+            default_value=float(0),
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="type",
+            full_name="sentencepiece.ModelProto.SentencePiece.type",
+            index=2,
+            number=3,
+            type=14,
+            cpp_type=8,
+            label=1,
+            has_default_value=True,
+            default_value=1,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[],
+    enum_types=[
+        _MODELPROTO_SENTENCEPIECE_TYPE,
+    ],
+    serialized_options=None,
+    is_extendable=True,
+    syntax="proto2",
+    extension_ranges=[
+        (200, 536870912),
+    ],
+    oneofs=[],
+    serialized_start=2171,
+    serialized_end=2381,
+)
+
+_MODELPROTO = _descriptor.Descriptor(
+    name="ModelProto",
+    full_name="sentencepiece.ModelProto",
+    filename=None,
+    file=DESCRIPTOR,
+    containing_type=None,
+    create_key=_descriptor._internal_create_key,
+    fields=[
+        _descriptor.FieldDescriptor(
+            name="pieces",
+            full_name="sentencepiece.ModelProto.pieces",
+            index=0,
+            number=1,
+            type=11,
+            cpp_type=10,
+            label=3,
+            has_default_value=False,
+            default_value=[],
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="trainer_spec",
+            full_name="sentencepiece.ModelProto.trainer_spec",
+            index=1,
+            number=2,
+            type=11,
+            cpp_type=10,
+            label=1,
+            has_default_value=False,
+            default_value=None,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="normalizer_spec",
+            full_name="sentencepiece.ModelProto.normalizer_spec",
+            index=2,
+            number=3,
+            type=11,
+            cpp_type=10,
+            label=1,
+            has_default_value=False,
+            default_value=None,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="self_test_data",
+            full_name="sentencepiece.ModelProto.self_test_data",
+            index=3,
+            number=4,
+            type=11,
+            cpp_type=10,
+            label=1,
+            has_default_value=False,
+            default_value=None,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+        _descriptor.FieldDescriptor(
+            name="denormalizer_spec",
+            full_name="sentencepiece.ModelProto.denormalizer_spec",
+            index=4,
+            number=5,
+            type=11,
+            cpp_type=10,
+            label=1,
+            has_default_value=False,
+            default_value=None,
+            message_type=None,
+            enum_type=None,
+            containing_type=None,
+            is_extension=False,
+            extension_scope=None,
+            serialized_options=None,
+            file=DESCRIPTOR,
+            create_key=_descriptor._internal_create_key,
+        ),
+    ],
+    extensions=[],
+    nested_types=[
+        _MODELPROTO_SENTENCEPIECE,
+    ],
+    enum_types=[],
+    serialized_options=None,
+    is_extendable=True,
+    syntax="proto2",
+    extension_ranges=[
+        (200, 536870912),
+    ],
+    oneofs=[],
+    serialized_start=1882,
+    serialized_end=2392,
+)
+
+_TRAINERSPEC.fields_by_name["model_type"].enum_type = _TRAINERSPEC_MODELTYPE
+_TRAINERSPEC_MODELTYPE.containing_type = _TRAINERSPEC
+_SELFTESTDATA_SAMPLE.containing_type = _SELFTESTDATA
+_SELFTESTDATA.fields_by_name["samples"].message_type = _SELFTESTDATA_SAMPLE
+_MODELPROTO_SENTENCEPIECE.fields_by_name["type"].enum_type = _MODELPROTO_SENTENCEPIECE_TYPE
+_MODELPROTO_SENTENCEPIECE.containing_type = _MODELPROTO
+_MODELPROTO_SENTENCEPIECE_TYPE.containing_type = _MODELPROTO_SENTENCEPIECE
+_MODELPROTO.fields_by_name["pieces"].message_type = _MODELPROTO_SENTENCEPIECE
+_MODELPROTO.fields_by_name["trainer_spec"].message_type = _TRAINERSPEC
+_MODELPROTO.fields_by_name["normalizer_spec"].message_type = _NORMALIZERSPEC
+_MODELPROTO.fields_by_name["self_test_data"].message_type = _SELFTESTDATA
+_MODELPROTO.fields_by_name["denormalizer_spec"].message_type = _NORMALIZERSPEC
+DESCRIPTOR.message_types_by_name["TrainerSpec"] = _TRAINERSPEC
+DESCRIPTOR.message_types_by_name["NormalizerSpec"] = _NORMALIZERSPEC
+DESCRIPTOR.message_types_by_name["SelfTestData"] = _SELFTESTDATA
+DESCRIPTOR.message_types_by_name["ModelProto"] = _MODELPROTO
+_sym_db.RegisterFileDescriptor(DESCRIPTOR)
+
+TrainerSpec = _reflection.GeneratedProtocolMessageType(
+    "TrainerSpec",
+    (_message.Message,),
+    {
+        "DESCRIPTOR": _TRAINERSPEC,
+        "__module__": "sentencepiece_model_pb2"
+        # @@protoc_insertion_point(class_scope:sentencepiece.TrainerSpec)
+    },
+)
+_sym_db.RegisterMessage(TrainerSpec)
+
+NormalizerSpec = _reflection.GeneratedProtocolMessageType(
+    "NormalizerSpec",
+    (_message.Message,),
+    {
+        "DESCRIPTOR": _NORMALIZERSPEC,
+        "__module__": "sentencepiece_model_pb2"
+        # @@protoc_insertion_point(class_scope:sentencepiece.NormalizerSpec)
+    },
+)
+_sym_db.RegisterMessage(NormalizerSpec)
+
+SelfTestData = _reflection.GeneratedProtocolMessageType(
+    "SelfTestData",
+    (_message.Message,),
+    {
+        "Sample": _reflection.GeneratedProtocolMessageType(
+            "Sample",
+            (_message.Message,),
+            {
+                "DESCRIPTOR": _SELFTESTDATA_SAMPLE,
+                "__module__": "sentencepiece_model_pb2"
+                # @@protoc_insertion_point(class_scope:sentencepiece.SelfTestData.Sample)
+            },
+        ),
+        "DESCRIPTOR": _SELFTESTDATA,
+        "__module__": "sentencepiece_model_pb2"
+        # @@protoc_insertion_point(class_scope:sentencepiece.SelfTestData)
+    },
+)
+_sym_db.RegisterMessage(SelfTestData)
+_sym_db.RegisterMessage(SelfTestData.Sample)
+
+ModelProto = _reflection.GeneratedProtocolMessageType(
+    "ModelProto",
+    (_message.Message,),
+    {
+        "SentencePiece": _reflection.GeneratedProtocolMessageType(
+            "SentencePiece",
+            (_message.Message,),
+            {
+                "DESCRIPTOR": _MODELPROTO_SENTENCEPIECE,
+                "__module__": "sentencepiece_model_pb2"
+                # @@protoc_insertion_point(class_scope:sentencepiece.ModelProto.SentencePiece)
+            },
+        ),
+        "DESCRIPTOR": _MODELPROTO,
+        "__module__": "sentencepiece_model_pb2"
+        # @@protoc_insertion_point(class_scope:sentencepiece.ModelProto)
+    },
+)
+_sym_db.RegisterMessage(ModelProto)
+_sym_db.RegisterMessage(ModelProto.SentencePiece)
+
+DESCRIPTOR._options = None
+_TRAINERSPEC.fields_by_name["mining_sentence_size"]._options = None
+_TRAINERSPEC.fields_by_name["training_sentence_size"]._options = None
+# @@protoc_insertion_point(module_scope)
diff --git a/paddlenlp/transformers/sequence_parallel_utils.py b/paddlenlp/transformers/sequence_parallel_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f68ebc6bed7ac0b2d2300b6bf97370402ab61782
--- /dev/null
+++ b/paddlenlp/transformers/sequence_parallel_utils.py
@@ -0,0 +1,429 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle import distributed as dist
+from paddle.autograd import PyLayer
+from paddle.distributed import fleet
+from paddle.distributed.fleet.meta_parallel import get_rng_state_tracker
+from paddle.distributed.fleet.utils.hybrid_parallel_util import (
+    fused_allreduce_gradients_with_group,
+)
+from paddle.framework import core
+from paddle.nn import functional as F
+from paddle.nn.layer.layers import Layer
+
+__all__ = [
+    "GatherOp",
+    "ScatterOp",
+    "AllGatherOp",
+    "ReduceScatterOp",
+    "ColumnSequenceParallelLinear",
+    "RowSequenceParallelLinear",
+    "mark_as_sequence_parallel_parameter",
+    "register_sequence_parallel_allreduce_hooks",
+]
+
+####################################################
+#                                                  #
+#        Distributed Communication Operator        #
+#                                                  #
+####################################################
+
+
+def scatter(input):
+    hcg = fleet.get_hybrid_communicate_group()
+    group = hcg.get_model_parallel_group()  # sp 与 tp 共用并行通信组
+    parallelism = group.nranks
+    rank = group.rank
+    seq_len = input.shape[0]
+    assert (
+        seq_len % parallelism == 0
+    ), "Input sequence length {} can't be deviced exactly by sequence parallelism {}".format(seq_len, parallelism)
+    interval = seq_len // parallelism
+    input = paddle.slice(input, axes=[0], starts=[interval * rank], ends=[interval * (rank + 1)])  # 获取当前 rank 对应的序列块
+    return input
+
+
+def all_gather(input):
+    hcg = fleet.get_hybrid_communicate_group()
+    group = hcg.get_model_parallel_group()
+    parallelism = group.nranks
+    output_shape = input.shape
+    output_shape[0] = output_shape[0] * parallelism
+    output = paddle.empty(shape=output_shape, dtype=input.dtype)
+    group.process_group.all_gather(input, output).wait()
+    return output
+
+
+def reduce_scatter(input):
+    hcg = fleet.get_hybrid_communicate_group()
+    group = hcg.get_model_parallel_group()
+    parallelism = group.nranks
+    output_shape = input.shape
+    assert (
+        input.shape[0] % parallelism == 0
+    ), "Input sequence length {0} can't be divided exactly by sequence parallelism {1}".format(
+        input.shape[0], parallelism
+    )
+    output_shape[0] = output_shape[0] // parallelism
+    output = paddle.empty(shape=output_shape, dtype=input.dtype)
+    dist.stream.reduce_scatter(output, input, op=dist.ReduceOp.SUM, group=group, sync_op=True)
+    return output
+
+
+class ScatterOp(PyLayer):
+    # input shape: [s, b, h], n is mp parallelism
+    # after forward, shape: [s / n, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return scatter(input)
+
+    @staticmethod
+    def backward(ctx, grad):
+        return all_gather(grad)
+
+
+class GatherOp(PyLayer):
+    # input shape: [s / n, b, h], n is mp parallelism
+    # after forward, shape: [s, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return all_gather(input)
+
+    @staticmethod
+    def backward(ctx, grad):
+        return scatter(grad)
+
+
+# All gather along the first dim during forward pass
+# All reduce and scatter along the first dim during backward pass
+class AllGatherOp(PyLayer):
+    # input shape: [s / n, b, h], n is mp parallelism
+    # after forward, shape: [s, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return all_gather(input)
+
+    # grad shape: [s, b, h], n is mp parallelism
+    # after backward, shape: [s / n, b, h]
+    @staticmethod
+    def backward(ctx, grad):
+        return reduce_scatter(grad)
+
+
+# All reduce and scatter along the first dim during forward pass
+# All gather along the first dim during backward pass
+class ReduceScatterOp(PyLayer):
+    # input shape: [s, b, h], n is mp parallelism
+    # after forward, shape: [s / n, b, h]
+    @staticmethod
+    def forward(ctx, input):
+        return reduce_scatter(input)
+
+    # grad shape: [s / n, b, h], n is mp parallelism
+    # after backward, shape: [s, b, h]
+    @staticmethod
+    def backward(ctx, grad):
+        return all_gather(grad)
+
+
+###################################################
+#                                                 #
+#        Modified Parallel Linear Operator        #
+#                                                 #
+###################################################
+
+
+def mark_as_sequence_parallel_parameter(parameter):
+    setattr(parameter, "sequence_parallel", True)
+
+
+def is_sequence_parallel_parameter(parameter):
+    return getattr(parameter, "sequence_parallel", False)
+
+
+def create_fused_allreduce_gradient_hook(parameter_list, accumulation_steps):
+    hcg = fleet.get_hybrid_communicate_group()
+    group = hcg.get_model_parallel_group()
+    step = [0]
+    accumulation_steps *= len(parameter_list)
+
+    def __impl__(grad):
+        step[0] += 1
+        if step == accumulation_steps:
+            step[0] = 0
+            fused_allreduce_gradients_with_group(parameter_list, group=group, scale=1.0)
+        return grad
+
+    return __impl__
+
+
+def create_non_fused_allreduce_gradient_hook(param, accumulation_steps):
+    hcg = fleet.get_hybrid_communicate_group()
+    pg = hcg.get_model_parallel_group().process_group
+    step = [0]
+
+    @paddle.autograd.no_grad()
+    def __impl__():
+        step[0] += 1
+        if (step[0] % accumulation_steps) == 0:
+            if hasattr(param, "main_grad"):
+                pg.allreduce(param.main_grad).wait()
+            else:
+                pg.allreduce(param.grad).wait()
+
+    return __impl__
+
+
+def register_sequence_parallel_allreduce_hooks(model, accumulation_steps, fuse_sequence_parallel_allreduce):
+    if accumulation_steps <= 0 or not paddle.distributed.is_initialized():
+        return
+
+    hcg = fleet.get_hybrid_communicate_group()
+    mp_group = hcg.get_model_parallel_group()
+    if mp_group.nranks <= 1:
+        return
+
+    params = []
+    for p in model.parameters():
+        if is_sequence_parallel_parameter(p):
+            params.append(p)
+
+    if fuse_sequence_parallel_allreduce:
+        hook = create_fused_allreduce_gradient_hook(params, accumulation_steps)
+        for p in params:
+            p._register_backward_hook(hook)
+    else:
+        for p in params:
+            hook = create_non_fused_allreduce_gradient_hook(p, accumulation_steps)
+            p._register_backward_hook(hook)
+
+
+def is_fused_matmul_bias_supported():
+    if paddle.is_compiled_with_cuda() and not paddle.is_compiled_with_rocm():
+        return hasattr(core.eager.ops.legacy, "fused_gemm_epilogue")
+    else:
+        return False
+
+
+class ColumnSequenceParallelLinear(Layer):
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        weight_attr=None,
+        has_bias=None,
+        gather_output=True,
+        fuse_matmul_bias=False,
+        mp_group=None,
+        name=None,
+    ):
+        super(ColumnSequenceParallelLinear, self).__init__()
+
+        hcg = fleet.get_hybrid_communicate_group()
+        self.model_parallel_group = hcg.get_model_parallel_group() if mp_group is None else mp_group
+        self.world_size = hcg.get_model_parallel_group().nranks if mp_group is None else mp_group.nranks
+
+        pp_degree = hcg.get_pipe_parallel_world_size()
+        if pp_degree > 1:
+            strategy = fleet.fleet._user_defined_strategy
+            if strategy.pipeline_configs["enable_partial_send_recv"]:
+                raise ValueError(
+                    "If using sequence parallel in pipeline mode, please set "
+                    "enable_partial_send_recv in training args to be False."
+                )
+
+        self._name = name
+        self.is_mp = self.world_size > 1
+
+        assert (
+            gather_output is False
+        ), "If sequence_parallel is True, \
+                                        gather_output is False"
+        self.gather_output = gather_output
+
+        assert (
+            out_features % self.world_size == 0
+        ), "Number of column of the weight for linear ({}) must be" " divisible by model parallel size ({})".format(
+            out_features, self.world_size
+        )
+        self.output_size_per_partition = out_features // self.world_size
+
+        self._weight_attr = weight_attr
+        self._dtype = self._helper.get_default_dtype()
+
+        if self.is_mp and paddle.in_dynamic_mode():
+            with get_rng_state_tracker().rng_state():
+                self.weight = self.create_parameter(
+                    shape=[in_features, self.output_size_per_partition],
+                    attr=self._weight_attr,
+                    dtype=self._dtype,
+                    is_bias=False,
+                )
+        else:
+            self.weight = self.create_parameter(
+                shape=[in_features, self.output_size_per_partition],
+                attr=self._weight_attr,
+                dtype=self._dtype,
+                is_bias=False,
+            )
+
+        self.weight.is_distributed = True if self.is_mp else False
+
+        if has_bias:
+            # initialize bias to zero like Megatron
+            self.bias = self.create_parameter(
+                shape=[self.output_size_per_partition],
+                attr=paddle.nn.initializer.Constant(value=0.0),
+                dtype=self._dtype,
+                is_bias=True,
+            )
+            self.bias.is_distributed = True if self.is_mp else False
+        else:
+            self.bias = None
+
+        self.linear = F.linear
+
+        if fuse_matmul_bias:
+            if not is_fused_matmul_bias_supported():
+                raise NotImplementedError(
+                    "You set fuse_matmul_bias=True in ColumnSequenceParallelLinear, "
+                    "however, the paddle you are using not support this operation. "
+                    "Please set fuse_matmul_bias=False or use paddle compiled "
+                    "with cuda 11.6 or higher."
+                )
+                from paddle.incubate.nn.functional import fused_linear
+
+                self.linear = fused_linear
+
+    def forward(self, x):
+        # sequence parallelism is same as model parallelism
+        # if sequence parallel is true, input shape is [s, b, h]
+        # else input shape is [b, s, h]
+        if self.is_mp:
+            input_parallel = AllGatherOp.apply(x)
+        else:
+            input_parallel = x
+        output = self.linear(input_parallel, self.weight, self.bias, name=self._name)
+        return output
+
+
+class RowSequenceParallelLinear(Layer):
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        weight_attr=None,
+        has_bias=True,
+        input_is_parallel=False,
+        fuse_matmul_bias=False,
+        mp_group=None,
+        name=None,
+    ):
+        super(RowSequenceParallelLinear, self).__init__()
+
+        self.in_features = in_features
+        self.out_features = out_features
+        assert (
+            input_is_parallel is True
+        ), "If sequence_parallel is True, \
+                                           input_is_parallel should be true."
+        self.input_is_parallel = input_is_parallel
+        self._weight_attr = weight_attr
+        self._dtype = self._helper.get_default_dtype()
+        self._name = name
+
+        hcg = fleet.get_hybrid_communicate_group()
+        self.model_parallel_group = hcg.get_model_parallel_group() if mp_group is None else mp_group
+        self.world_size = hcg.get_model_parallel_group().nranks if mp_group is None else mp_group.nranks
+        self.rank = hcg.get_model_parallel_group().rank if mp_group is None else mp_group.rank
+
+        pp_degree = hcg.get_pipe_parallel_world_size()
+        if pp_degree > 1:
+            strategy = fleet.fleet._user_defined_strategy
+            if strategy.pipeline_configs["enable_partial_send_recv"]:
+                raise ValueError(
+                    "If using sequence parallel in pipeline mode, please set "
+                    "enable_partial_send_recv in training args to be False."
+                )
+
+        self.is_mp = self.world_size > 1
+        assert (
+            in_features % self.world_size == 0
+        ), "Number of row of the weight for linear ({}) must be" " divisible by model parallel size ({})".format(
+            in_features, self.world_size
+        )
+
+        self.input_size_per_partition = in_features // self.world_size
+
+        if self.is_mp and paddle.in_dynamic_mode():
+            with get_rng_state_tracker().rng_state():
+                self.weight = self.create_parameter(
+                    shape=[self.input_size_per_partition, self.out_features],
+                    attr=self._weight_attr,
+                    dtype=self._dtype,
+                    is_bias=False,
+                )
+        else:
+            self.weight = self.create_parameter(
+                shape=[self.input_size_per_partition, self.out_features],
+                attr=self._weight_attr,
+                dtype=self._dtype,
+                is_bias=False,
+            )
+
+        self.weight.is_distributed = True if self.is_mp else False
+
+        # if sequence parallel is true,
+        # register hook to all_reduce gradient of weight and bias
+        if has_bias:
+            self.bias = self.create_parameter(
+                shape=[self.out_features],
+                attr=paddle.nn.initializer.Constant(value=0.0),
+                dtype=self._dtype,
+                is_bias=True,
+            )
+            if self.is_mp:
+                mark_as_sequence_parallel_parameter(self.bias)
+        else:
+            self.bias = None
+
+        self.linear = F.linear
+
+        if fuse_matmul_bias:
+            if not is_fused_matmul_bias_supported():
+                raise NotImplementedError(
+                    "You set fuse_matmul_bias=True in RowSequenceParallelLinear, "
+                    "however, the paddle you are using not support this operation. "
+                    "Please set fuse_matmul_bias=False or use paddle compiled "
+                    "with cuda 11.6 or higher."
+                )
+            from paddle.incubate.nn.functional import fused_linear
+
+            self.linear = fused_linear
+
+    def forward(self, x):
+        input_parallel = x
+        if self.is_mp:
+            output_parallel = self.linear(input_parallel, self.weight, name=self._name)
+            # if self.bias is not none, sequence parallel will use
+            # register_hook to all_reduce self.bias
+            output_ = ReduceScatterOp.apply(output_parallel)
+            output = output_ + self.bias if self.bias is not None else output_
+        else:
+            output = self.linear(input_parallel, self.weight, self.bias, name=self._name)
+        return output
diff --git a/paddlenlp/transformers/skep/__init__.py b/paddlenlp/transformers/skep/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/skep/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/skep/configuration.py b/paddlenlp/transformers/skep/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2cf0bec92cbf56fa3e6311bf5261037dc8659c7
--- /dev/null
+++ b/paddlenlp/transformers/skep/configuration.py
@@ -0,0 +1,143 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" SKEP model configuration """
+from __future__ import annotations
+
+from typing import Dict
+
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["SKEP_PRETRAINED_INIT_CONFIGURATION", "SKEP_PRETRAINED_RESOURCE_FILES_MAP", "SkepConfig"]
+
+SKEP_PRETRAINED_INIT_CONFIGURATION = {
+    "skep_ernie_1.0_large_ch": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 4,
+        "vocab_size": 12800,
+        "pad_token_id": 0,
+    },
+    "skep_ernie_2.0_large_en": {
+        "attention_probs_dropout_prob": 0.1,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "intermediate_size": 4096,
+        "max_position_embeddings": 512,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 4,
+        "vocab_size": 30522,
+        "pad_token_id": 0,
+    },
+    "skep_roberta_large_en": {
+        "attention_probs_dropout_prob": 0.1,
+        "intermediate_size": 4096,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 1024,
+        "initializer_range": 0.02,
+        "max_position_embeddings": 514,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 24,
+        "type_vocab_size": 0,
+        "vocab_size": 50265,
+        "pad_token_id": 1,
+    },
+}
+
+SKEP_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "skep_ernie_1.0_large_ch": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams",
+        "skep_ernie_2.0_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_ernie_2.0_large_en.pdparams",
+        "skep_roberta_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_roberta_large_en.pdparams",
+    }
+}
+
+
+class SkepConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`SKEPModel`]. It is used to instantiate an SKEP Model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the SKEP skep_ernie_1.0_large_ch architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, optional, defaults to 12800): Vocabulary size of the SKEP model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`SKEPModel`].
+        hidden_size (`int`, optional, defaults to 768): Dimensionality of the embedding layer, encoder layers and the pooler layer.
+        num_hidden_layers (int, optional, defaults to 12): Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, optional, defaults to 12):  Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, optional, defaults to 3072): Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors to ff layers are firstly projected from `hidden_size` to `intermediate_size`, and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+        hidden_act (`str`, optional, defaults to "relu"):The non-linear activation function in the encoder and pooler. "gelu", "relu" and any other paddle supported activation functions are supported.
+        hidden_dropout_prob (`float`, optional, defaults to 0.1): The dropout probability for all fully connected layers in the embeddings and encoder.
+        attention_probs_dropout_prob (`float`, optional, defaults to 0.1): The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+        max_position_embeddings (`int`, optional, defaults to 512): The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, optional, defaults to 4): The vocabulary size of the *token_type_ids* passed into [`SKEPModel`].
+        initializer_range (`float`, optional, defaults to 0.02): The standard deviation of the normal initializer.
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`SkepPretrainedModel.init_weights()` for how weights are initialized in [`SkepModel`].
+        pad_token_id(int, optional, defaults to 0): The index of padding token in the token vocabulary.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import SKEPModel, SkepConfig
+    >>> # Initializing an SKEP configuration
+    >>> configuration = SkepConfig()
+    >>> # Initializing a model (with random weights) from the SKEP-base style configuration model
+    >>> model = SKEPModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = SKEP_PRETRAINED_INIT_CONFIGURATION
+    model_type = "skep"
+
+    def __init__(
+        self,
+        vocab_size=12800,
+        hidden_size=1024,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        intermediate_size=4096,
+        hidden_act="relu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=4,
+        initializer_range=0.02,
+        pad_token_id=0,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
diff --git a/paddlenlp/transformers/skep/modeling.py b/paddlenlp/transformers/skep/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a5c2b544760fd9262d3c4ef6db17ac2e51fc89b
--- /dev/null
+++ b/paddlenlp/transformers/skep/modeling.py
@@ -0,0 +1,760 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+from paddlenlp.utils.log import logger
+from paddlenlp.utils.tools import compare_version
+
+if compare_version(paddle.version.full_version, "2.2.0") >= 0:
+    # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0
+    from paddle.text import ViterbiDecoder
+else:
+    from paddlenlp.layers.crf import ViterbiDecoder
+
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from .configuration import (
+    SKEP_PRETRAINED_INIT_CONFIGURATION,
+    SKEP_PRETRAINED_RESOURCE_FILES_MAP,
+    SkepConfig,
+)
+
+__all__ = [
+    "SkepModel",
+    "SkepPretrainedModel",
+    "SkepForSequenceClassification",
+    "SkepForTokenClassification",
+    "SkepCrfForTokenClassification",
+]
+
+
+class SkepEmbeddings(nn.Layer):
+    """
+    Include embeddings from word, position and token_type embeddings
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size != 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values_length: Optional[int] = 0,
+    ):
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if position_ids is None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+            # maybe need use shape op to unify static graph and dynamic graph
+            ones = paddle.ones(input_shape, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+            if past_key_values_length > 0:
+                position_ids = position_ids + past_key_values_length
+
+            position_ids.stop_gradient = True
+
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+        if self.type_vocab_size != 0:
+            if token_type_ids is None:
+                token_type_ids_shape = paddle.shape(inputs_embeds)[:-1]
+                token_type_ids = paddle.zeros(token_type_ids_shape, dtype="int64")
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings += token_type_embeddings
+        elif token_type_ids is not None:
+            logger.warning(
+                "There is no need to pass the token type ids to SKEP based on RoBERTa model."
+                "The input token type ids will be ignored."
+            )
+
+        embeddings = self.layer_norm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class SkepPooler(nn.Layer):
+    """
+    The pooling layer on skep model.
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepPooler, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states: Tensor):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class SkepPretrainedModel(PretrainedModel):
+    r"""
+    An abstract class for pretrained Skep models. It provides Skep related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+
+    """
+
+    config_class = SkepConfig
+    base_model_prefix = "skep"
+
+    pretrained_init_configuration = SKEP_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = SKEP_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # only support dygraph, use truncated_normal and make it inplace
+            # and configurable later
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-5
+
+
+@register_base_model
+class SkepModel(SkepPretrainedModel):
+    r"""
+    The bare SKEP Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    More details refer to `SKEP <https://www.aclweb.org/anthology/2020.acl-main.374>`.
+
+    Args:
+        vocab_size (`int`, optional, defaults to 12800): Vocabulary size of the SKEP model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`SKEPModel`].
+        hidden_size (`int`, optional, defaults to 768): Dimensionality of the embedding layer, encoder layers and the pooler layer.
+        num_hidden_layers (int, optional, defaults to 12): Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, optional, defaults to 12):  Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, optional, defaults to 3072): Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors to ff layers are firstly projected from `hidden_size` to `intermediate_size`, and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+        hidden_act (`str`, optional, defaults to "relu"):The non-linear activation function in the encoder and pooler. "gelu", "relu" and any other paddle supported activation functions are supported.
+        hidden_dropout_prob (`float`, optional, defaults to 0.1): The dropout probability for all fully connected layers in the embeddings and encoder.
+        attention_probs_dropout_prob (`float`, optional, defaults to 0.1): The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+        max_position_embeddings (`int`, optional, defaults to 512): The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, optional, defaults to 4): The vocabulary size of the *token_type_ids* passed into [`SKEPModel`].
+        initializer_range (`float`, optional, defaults to 0.02): The standard deviation of the normal initializer.
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`SkepPretrainedModel.init_weights()` for how weights are initialized in [`SkepModel`].
+        pad_token_id(int, optional, defaults to 0): The index of padding token in the token vocabulary.
+
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+        self.embeddings = SkepEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+        self.pooler = SkepPooler(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The SkepModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (bool, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+            if the result is tuple: Returns tuple (`sequence_output`, `pooled_output`).
+
+            With the fields:
+
+            - `sequence_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import SkepModel, SkepTokenizer
+
+                tokenizer = SkepTokenizer.from_pretrained('skep_ernie_2.0_large_en')
+                model = SkepModel.from_pretrained('skep_ernie_2.0_large_en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP! ")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+
+        """
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+        use_cache = use_cache if use_cache is not None else False
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids.astype("int64") == self.config.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4,
+                axis=[1, 2],
+            )
+            if past_key_values is not None:
+                batch_size = paddle.shape(past_key_values[0][0])[0]
+
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+
+        # For 2D attention_mask from tokenizer
+        elif attention_mask.ndim == 2:
+            attention_mask = paddle.unsqueeze(attention_mask, axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+        attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+        else:
+            sequence_output = encoder_outputs[0]
+            pooled_output = self.pooler(sequence_output)
+            if not return_dict:
+                return (sequence_output, pooled_output) + encoder_outputs[1:]
+            return BaseModelOutputWithPoolingAndCrossAttentions(
+                last_hidden_state=sequence_output,
+                pooler_output=pooled_output,
+                past_key_values=encoder_outputs.past_key_values,
+                hidden_states=encoder_outputs.hidden_states,
+                attentions=encoder_outputs.attentions,
+            )
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        """get skep input word embedding
+
+        Returns:
+            nn.Embedding: the input word embedding of skep mdoel
+        """
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, embedding: nn.Embedding) -> nn.Embedding:
+        """set skep input embedding
+
+        Returns:
+            nn.Embedding: the instance of new word embedding
+        """
+        self.embeddings.word_embeddings = embedding
+
+
+class SkepForSequenceClassification(SkepPretrainedModel):
+    r"""
+    SKEP Model with a linear layer on top of the pooled output,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`SkepConfig`): An instance of SkepConfig used to contruct SkepForSequenceClassification.
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepForSequenceClassification, self).__init__(config)
+        self.skep = SkepModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The SkepForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`SkepModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SkepModel`.
+            position_ids (Tensor, `optional`):
+                See :class:`SkepModel`.
+            attention_mask (Tensor, optional):
+                See :class:`SkepModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            inputs_embeds(Tensor, optional):
+                See :class:`SkepModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
+
+                tokenizer = SkepTokenizer.from_pretrained('skep_ernie_2.0_large_en')
+                model = SkepForSequenceClassification.from_pretrained('skep_ernie_2.0_large_en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        outputs = self.skep(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class SkepForTokenClassification(SkepPretrainedModel):
+    r"""
+    SKEP Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`SkepConfig`): An instance of SkepConfig used to construct SkepForTokenClassification.
+
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepForTokenClassification, self).__init__(config)
+        self.skep = SkepModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The SkepForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`SkepModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SkepModel`.
+            position_ids (Tensor, optional):
+                See :class:`SkepModel`.
+            attention_mask (Tensor, optional):
+                See :class:`SkepModel`.
+            labels (Tensor of shape `(batch_size, sequence_length)`, optional):
+                Labels for computing the token classification loss. Indices should be in `[0, ..., num_labels - 1]`.
+            inputs_embeds(Tensor, optional):
+                See :class:`SkepModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import SkepForTokenClassification, SkepTokenizer
+
+                tokenizer = SkepTokenizer.from_pretrained('skep_ernie_2.0_large_en')
+                model = SkepForTokenClassification.from_pretrained('skep_ernie_2.0_large_en')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+
+        """
+        outputs = self.skep(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else (output[0] if len(output) == 1 else output)
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class SkepCrfForTokenClassification(SkepPretrainedModel):
+    r"""
+    SKEPCRF Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`SkepConfig`): An instance of SkepConfig used to construct SkepCrfForTokenClassification.
+    """
+
+    def __init__(self, config: SkepConfig):
+        super(SkepCrfForTokenClassification, self).__init__(config)
+        self.skep = SkepModel(config)
+        self.num_labels = config.num_labels
+        gru_hidden_size = 128
+
+        self.gru = nn.GRU(config.hidden_size, gru_hidden_size, num_layers=2, direction="bidirect")
+        self.fc = nn.Linear(
+            gru_hidden_size * 2,
+            self.num_labels,
+            weight_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-0.1, high=0.1),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+        self.crf = LinearChainCrf(self.num_labels, crf_lr=0.2, with_start_stop_tag=False)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, False)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        seq_lens: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The SkepCrfForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`SkepModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SkepModel`.
+            position_ids (Tensor, optional):
+                See :class:`SkepModel`.
+            attention_mask (Tensor, optional):
+                See :class:`SkepModel`.
+            seq_lens (Tensor, optional):
+                The input length tensor storing real length of each sequence for correctness.
+                Its data type should be int64 and its shape is `[batch_size]`.
+                Defaults to `None`.
+            labels (Tensor, optional):
+                The input label tensor.
+                Its data type should be int64 and its shape is `[batch_size, sequence_length]`.
+            inputs_embeds(Tensor, optional):
+                See :class:`SkepModel`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.TokenClassifierOutput`.
+
+            if return_dict is False, Returns tensor `loss` if `labels` is not None. Otherwise, returns tensor `prediction`.
+
+            - `loss` (Tensor):
+                The crf loss. Its data type is float32 and its shape is `[batch_size]`.
+
+            - `prediction` (Tensor):
+                The prediction tensor containing the highest scoring tag indices.
+                Its data type is int64 and its shape is `[batch_size, sequence_length]`.
+
+        """
+        outputs = self.skep(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        bigru_output, _ = self.gru(outputs[0])
+        emission = self.fc(bigru_output)
+
+        if seq_lens is None:
+            # compute seq length according to the attention mask
+            if attention_mask is not None:
+                seq_lens = paddle.sum(attention_mask, axis=1, dtype="int64")
+            else:
+                input_ids_shape = paddle.shape(input_ids)
+                seq_lens = paddle.ones(shape=[input_ids_shape[0]], dtype="int64") * input_ids_shape[1]
+
+        loss, prediction = None, None
+        if labels is not None:
+            loss = self.crf_loss(emission, seq_lens, labels)
+        else:
+            _, prediction = self.viterbi_decoder(emission, seq_lens)
+
+        if not return_dict:
+            # when loss is None, return prediction
+            if labels is not None:
+                return loss if len(outputs[2:]) == 0 else (loss,) + outputs[2:]
+            return prediction if len(outputs[2:]) == 0 else (prediction,) + outputs[2:]
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=prediction,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/skep/tokenizer.py b/paddlenlp/transformers/skep/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8146670b9277693af3970ef459eb1c558023cf35
--- /dev/null
+++ b/paddlenlp/transformers/skep/tokenizer.py
@@ -0,0 +1,588 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+from typing import Dict, List, Optional
+
+from paddle.utils import try_import
+
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    PretrainedTokenizer,
+    WordpieceTokenizer,
+)
+
+__all__ = [
+    "SkepTokenizer",
+]
+
+
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    bs = list(range(33, 126 + 1)) + list(range(161, 172 + 1)) + list(range(174, 255 + 1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+
+class BpeEncoder(object):
+    """BpeEncoder"""
+
+    def __init__(self, encoder_json_file, vocab_bpe_file, errors="replace", unk_token="<|endoftext|>", **kwargs):
+        """
+        Constructs a BpeEncoder.
+
+        Args:
+            encoder_json_file (`str`): The path to bpe encode json file.
+            vocab_bpe_file (`str`): The path to bpe vocab file.
+            errors (`str`): the error handler
+            unk_token (`str`): the unk token
+        """
+        self.encoder = self.__get_encoder(encoder_json_file)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.errors = errors  # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+        self.bpe_ranks = self.__get_bpe_ranks(vocab_bpe_file)
+        self.unk_token = unk_token
+        self.cache = {}
+        re = try_import("regex")
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+    def __get_encoder(self, encoder_json_file):
+        with open(encoder_json_file, "r") as f:
+            encoder = json.load(f)
+        return encoder
+
+    def __get_bpe_ranks(self, vocab_bpe_file):
+        with open(vocab_bpe_file, "r", encoding="utf-8") as f:
+            bpe_data = f.read()
+        bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split("\n")[1:-1]]
+        bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        return bpe_ranks
+
+    def bpe(self, token):
+        """
+        bpe
+        """
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except Exception:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        self.cache[token] = word
+        return word
+
+    def encode(self, text: str) -> List[int]:
+        """
+        encode the text to token_ids
+        TODO(wj-Mcat): to be deprecated
+        """
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def decode(self, tokens: List[str]) -> str:
+        """
+        decode
+        TODO(wj-Mcat): to be deprecated
+        """
+        text = "".join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+    def _tokenize(self, text: str) -> List[str]:
+        """tokenize text into tokens with bpe algo
+
+        Args:
+            text (str): the content of text
+
+        Returns:
+            List[str]: the sub token of text
+        """
+        bpe_tokens = []
+        re = try_import("regex")
+        for token in re.findall(self.pat, text):
+            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
+        return bpe_tokens
+
+    def _convert_token_to_id(self, token: str) -> int:
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index: int) -> str:
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index)
+
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        """Converts a sequence of tokens (string) in a single string."""
+        text = "".join(tokens)
+        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+        return text
+
+
+class SkepTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs a Skep tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        bpe_vocab_file (str, optional):
+            The vocabulary file path of a `BpeTokenizer`. Defaults to `None`.
+        bpe_json_file (str, optional):
+            The json file path of a `BpeTokenizer`. Defaults to `None`.
+        use_bpe_encoder (bool, optional):
+            Whether or not to use BPE Encoder. Defaults to `False`.
+        need_token_type_id (bool, optional):
+            Whether or not to use token type id. Defaults to `True`.
+        add_two_sep_token_inter (bool, optional):
+            Whether or not to add two different `sep_token`. Defaults to `False`.
+        unk_token (str, optional):
+            The special token for unknown words.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            The special token for separator token.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            The special token for padding.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            The special token for cls.
+            Defaults to "[CLS]".
+        mask_token (str, optional):
+            The special token for mask.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import SkepTokenizer
+            tokenizer = SkepTokenizer.from_pretrained('skep_ernie_2.0_large_en')
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs:
+            # {
+            #    'input_ids': [101, 2002, 2001, 1037, 13997, 11510, 102],
+            #    'token_type_ids': [0, 0, 0, 0, 0, 0, 0]
+            # }
+    """
+    resource_files_names = {
+        "vocab_file": "vocab.txt",
+        "bpe_vocab_file": "vocab.bpe",
+        "bpe_json_file": "encoder.json",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "skep_ernie_1.0_large_ch": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt",
+            "skep_ernie_2.0_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_ernie_2.0_large_en.vocab.txt",
+            "skep_roberta_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_roberta_large_en.vocab.txt",
+        },
+        "bpe_vocab_file": {
+            "skep_ernie_1.0_large_ch": None,
+            "skep_ernie_2.0_large_en": None,
+            "skep_roberta_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_roberta_large_en.vocab.bpe",
+        },
+        "bpe_json_file": {
+            "skep_ernie_1.0_large_ch": None,
+            "skep_ernie_2.0_large_en": None,
+            "skep_roberta_large_en": "https://bj.bcebos.com/paddlenlp/models/transformers/skep/skep_roberta_large_en.encoder.json",
+        },
+    }
+    max_model_input_sizes = {
+        "skep_ernie_1.0_large_ch": 512,
+        "skep_ernie_2.0_large_en": 512,
+        "skep_roberta_large_en": 514,
+    }
+
+    pretrained_init_configuration = {
+        "skep_ernie_1.0_large_ch": {
+            "do_lower_case": True,
+            "use_bpe_encoder": False,
+            "need_token_type_id": True,
+            "add_two_sep_token_inter": False,
+        },
+        "skep_ernie_2.0_large_en": {
+            "do_lower_case": True,
+            "use_bpe_encoder": False,
+            "need_token_type_id": True,
+            "add_two_sep_token_inter": False,
+        },
+        "skep_roberta_large_en": {
+            "do_lower_case": True,
+            "use_bpe_encoder": True,
+            "need_token_type_id": False,
+            "add_two_sep_token_inter": True,
+        },
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        bpe_vocab_file=None,
+        bpe_json_file=None,
+        do_lower_case=True,
+        use_bpe_encoder=False,
+        need_token_type_id=True,
+        add_two_sep_token_inter=False,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = SkepTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab_file = vocab_file
+        self.bpe_vocab_file = bpe_vocab_file
+        self.bpe_json_file = bpe_json_file
+        self.vocab = self.load_vocabulary(
+            vocab_file,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            bos_token=cls_token,
+            eos_token=sep_token,
+            mask_token=mask_token,
+        )
+
+        self.use_bpe_encoder = use_bpe_encoder
+        self.need_token_type_id = need_token_type_id
+        self.add_two_sep_token_inter = add_two_sep_token_inter
+
+        if not self.use_bpe_encoder:
+            self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+            self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+        else:
+            assert (bpe_vocab_file and bpe_json_file) is not None, "bpe_vocab_file and bpe_json_file must be not None."
+            if os.path.isfile(bpe_vocab_file) and os.path.isfile(bpe_json_file):
+                self.bpe_tokenizer = BpeEncoder(bpe_json_file, bpe_vocab_file, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        r"""
+        Return the size of vocabulary.
+
+        Returns:
+            int: the size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def _tokenize(self, text):
+        r"""
+        End-to-end tokenization for Skep models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        if not self.use_bpe_encoder:
+            for token in self.basic_tokenizer.tokenize(text):
+                for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                    split_tokens.append(sub_token)
+        else:
+            for token in self.bpe_tokenizer._tokenize(text):
+                split_tokens.append(str(token))
+
+        return split_tokens
+
+    def num_special_tokens_to_add(self, pair=False):
+        r"""
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair (bool, optional):
+                Returns the number of added tokens in the case of a sequence
+                pair if set to True, returns the number of added tokens in the case of a single sequence if set to False.
+                Defaults to False.
+
+        Returns:
+            int: Number of tokens added to sequences
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        A skep_roberta_large_en sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            list[int]: List of input_id with the appropriate special tokens.
+        """
+        if not self.add_two_sep_token_inter:
+            if token_ids_1 is None:
+                return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+            _cls = [self.cls_token_id]
+            _sep = [self.sep_token_id]
+            return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+        else:
+            if token_ids_1 is None:
+                return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+            _cls = [self.cls_token_id]
+            _sep = [self.sep_token_id]
+            return _cls + token_ids_0 + _sep + _sep + token_ids_1 + _sep
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        r"""
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+
+        A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        note: There is no need token type ids for skep_roberta_large_ch model.
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        if self.need_token_type_id:
+            _sep = [self.sep_token_id]
+            _cls = [self.cls_token_id]
+            if token_ids_1 is None:
+                return len(_cls + token_ids_0 + _sep) * [0]
+            return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+        else:
+            # For model skep-roberta-large-en, token type ids is no need.
+            return None
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to files under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            source_file = getattr(self, name, None)
+            if not source_file:
+                continue
+
+            if os.path.abspath(source_file) != os.path.abspath(save_path):
+                shutil.copyfile(source_file, save_path)
+
+    def convert_tokens_to_string(self, tokens: List[str]):
+        """
+        Converts a sequence of tokens (list of string) in a single string.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import RoFormerTokenizer
+
+                tokenizer = RoFormerTokenizer.from_pretrained('roformer-chinese-base')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨')
+                #['欢迎', '使用', '百度', '飞', '桨']
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                #'欢迎 使用 百度 飞 桨'
+
+        """
+        # to handle the bpe and wordpiece case
+        if hasattr(self, "wordpiece_tokenizer"):
+            return " ".join(tokens).replace(" ##", "").strip()
+        else:
+            return self.bpe_tokenizer.convert_tokens_to_string(tokens)
+
+    def _convert_token_to_id(self, token: str) -> int:
+        """Converts a token (str) in an id using the vocab."""
+        if self.use_bpe_encoder:
+            return self.bpe_tokenizer._convert_token_to_id(token)
+
+        return super()._convert_token_to_id(token)
+
+    def _convert_id_to_token(self, index: int) -> str:
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if self.use_bpe_encoder:
+            return self.bpe_tokenizer._convert_id_to_token(index)
+
+        return super()._convert_id_to_token(index)
+
+    def get_special_tokens_mask(
+        self,
+        token_ids_0: List[int],
+        token_ids_1: Optional[List[int]] = None,
+        already_has_special_tokens: bool = False,
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def get_vocab(self) -> Dict[str, int]:
+        """
+        Returns the vocabulary as a dictionary of token to index.
+
+        `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the
+        vocab.
+
+        Returns:
+            `Dict[str, int]`: The vocabulary.
+        """
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
diff --git a/paddlenlp/transformers/speecht5/__init__.py b/paddlenlp/transformers/speecht5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/speecht5/configuration.py b/paddlenlp/transformers/speecht5/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5f9033a13590c6421eb11126ae39a6b337240e4
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/configuration.py
@@ -0,0 +1,419 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" SpeechT5 model configuration"""
+
+import functools
+import operator
+
+from ..configuration_utils import PretrainedConfig
+
+
+class SpeechT5Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`SpeechT5Model`]. It is used to instantiate a
+    SpeechT5 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the SpeechT5
+    [microsoft/speecht5_asr](https://huggingface.co/microsoft/speecht5_asr) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 81):
+            Vocabulary size of the SpeechT5 model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed to the forward method of [`SpeechT5Model`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        encoder_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        encoder_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        encoder_ffn_dim (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        encoder_layerdrop (`float`, *optional*, defaults to 0.1):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        decoder_layers (`int`, *optional*, defaults to 6):
+            Number of hidden layers in the Transformer decoder.
+        decoder_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        decoder_ffn_dim (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer decoder.
+        decoder_layerdrop (`float`, *optional*, defaults to 0.1):
+            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        positional_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the text position encoding layers.
+        hidden_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for activations inside the fully connected layer.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        scale_embedding (`bool`, *optional*, defaults to `False`):
+            Scale embeddings by diving by sqrt(d_model).
+        feat_extract_norm (`str`, *optional*, defaults to `"group"`):
+            The norm to be applied to 1D convolutional layers in the speech encoder pre-net. One of `"group"` for group
+            normalization of only the first 1D convolutional layer or `"layer"` for layer normalization of all 1D
+            convolutional layers.
+        feat_proj_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability for output of the speech encoder pre-net.
+        feat_extract_activation (`str, `optional`, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the 1D convolutional layers of the feature
+            extractor. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        conv_dim (`Tuple[int]` or `List[int]`, *optional*, defaults to `(512, 512, 512, 512, 512, 512, 512)`):
+            A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
+            speech encoder pre-net. The length of *conv_dim* defines the number of 1D convolutional layers.
+        conv_stride (`Tuple[int]` or `List[int]`, *optional*, defaults to `(5, 2, 2, 2, 2, 2, 2)`):
+            A tuple of integers defining the stride of each 1D convolutional layer in the speech encoder pre-net. The
+            length of *conv_stride* defines the number of convolutional layers and has to match the length of
+            *conv_dim*.
+        conv_kernel (`Tuple[int]` or `List[int]`, *optional*, defaults to `(10, 3, 3, 3, 3, 3, 3)`):
+            A tuple of integers defining the kernel size of each 1D convolutional layer in the speech encoder pre-net.
+            The length of *conv_kernel* defines the number of convolutional layers and has to match the length of
+            *conv_dim*.
+        conv_bias (`bool`, *optional*, defaults to `False`):
+            Whether the 1D convolutional layers have a bias.
+        num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
+            Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
+            embeddings layer.
+        num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
+            Number of groups of 1D convolutional positional embeddings layer.
+        apply_spec_augment (`bool`, *optional*, defaults to `True`):
+            Whether to apply *SpecAugment* data augmentation to the outputs of the speech encoder pre-net. For
+            reference see [SpecAugment: A Simple Data Augmentation Method for Automatic Speech
+            Recognition](https://arxiv.org/abs/1904.08779).
+        mask_time_prob (`float`, *optional*, defaults to 0.05):
+            Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
+            procecure generates ''mask_time_prob*len(time_axis)/mask_time_length'' independent masks over the axis. If
+            reasoning from the propability of each feature vector to be chosen as the start of the vector span to be
+            masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the
+            actual percentage of masked vectors. This is only relevant if `apply_spec_augment is True`.
+        mask_time_length (`int`, *optional*, defaults to 10):
+            Length of vector span along the time axis.
+        mask_time_min_masks (`int`, *optional*, defaults to 2),:
+            The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
+            irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
+            mask_time_min_masks''
+        mask_feature_prob (`float`, *optional*, defaults to 0.0):
+            Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
+            masking procecure generates ''mask_feature_prob*len(feature_axis)/mask_time_length'' independent masks over
+            the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector
+            span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap
+            may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is
+            True`.
+        mask_feature_length (`int`, *optional*, defaults to 10):
+            Length of vector span along the feature axis.
+        mask_feature_min_masks (`int`, *optional*, defaults to 0),:
+            The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
+            step, irrespectively of `mask_feature_prob`. Only relevant if
+            ''mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks''
+        num_mel_bins (`int`, *optional*, defaults to 80):
+            Number of mel features used per input features. Used by the speech decoder pre-net. Should correspond to
+            the value used in the [`SpeechT5Processor`] class.
+        speech_decoder_prenet_layers (`int`, *optional*, defaults to 2):
+            Number of layers in the speech decoder pre-net.
+        speech_decoder_prenet_units (`int`, *optional*, defaults to 256):
+            Dimensionality of the layers in the speech decoder pre-net.
+        speech_decoder_prenet_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout probability for the speech decoder pre-net layers.
+        speaker_embedding_dim (`int`, *optional*, defaults to 512):
+            Dimensionality of the *XVector* embedding vectors.
+        speech_decoder_postnet_layers (`int`, *optional*, defaults to 5):
+            Number of layers in the speech decoder post-net.
+        speech_decoder_postnet_units (`int`, *optional*, defaults to 256):
+            Dimensionality of the layers in the speech decoder post-net.
+        speech_decoder_postnet_kernel (`int`, *optional*, defaults to 5):
+            Number of convolutional filter channels in the speech decoder post-net.
+        speech_decoder_postnet_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout probability for the speech decoder post-net layers.
+        reduction_factor (`int`, *optional*, defaults to 2):
+            Spectrogram length reduction factor for the speech decoder inputs.
+        max_speech_positions (`int`, *optional*, defaults to 4000):
+            The maximum sequence length of speech features that this model might ever be used with.
+        max_text_positions (`int`, *optional*, defaults to 450):
+            The maximum sequence length of text features that this model might ever be used with.
+        encoder_max_relative_position (`int`, *optional*, defaults to 160):
+            Maximum distance for relative position embedding in the encoder.
+        use_guided_attention_loss (`bool`, *optional*, defaults to `True`):
+            Whether to apply guided attention loss while training the TTS model.
+        guided_attention_loss_num_heads (`int`, *optional*, defaults to 2):
+            Number of attention heads the guided attention loss will be applied to. Use -1 to apply this loss to all
+            attention heads.
+        guided_attention_loss_sigma (`float`, *optional*, defaults to 0.4):
+            Standard deviation for guided attention loss.
+        guided_attention_loss_scale (`float`, *optional*, defaults to 10.0):
+            Scaling coefficient for guided attention loss (also known as lambda).
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import SpeechT5Model, SpeechT5Config
+
+    >>> # Initializing a "microsoft/speecht5_asr" style configuration
+    >>> configuration = SpeechT5Config()
+
+    >>> # Initializing a model (with random weights) from the "microsoft/speecht5_asr" style configuration
+    >>> model = SpeechT5Model(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "speecht5"
+    attribute_map = {
+        "num_attention_heads": "encoder_attention_heads",
+        "num_hidden_layers": "encoder_layers",
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size=81,
+        hidden_size=768,
+        encoder_layers=12,
+        encoder_attention_heads=12,
+        encoder_ffn_dim=3072,
+        encoder_layerdrop=0.1,
+        decoder_layers=6,
+        decoder_ffn_dim=3072,
+        decoder_attention_heads=12,
+        decoder_layerdrop=0.1,
+        hidden_act="gelu",
+        positional_dropout=0.1,
+        hidden_dropout=0.1,
+        attention_dropout=0.1,
+        activation_dropout=0.1,
+        initializer_range=0.02,
+        layer_norm_eps=1e-5,
+        scale_embedding=False,
+        feat_extract_norm="group",
+        feat_proj_dropout=0.0,
+        feat_extract_activation="gelu",
+        conv_dim=(512, 512, 512, 512, 512, 512, 512),
+        conv_stride=(5, 2, 2, 2, 2, 2, 2),
+        conv_kernel=(10, 3, 3, 3, 3, 2, 2),
+        conv_bias=False,
+        num_conv_pos_embeddings=128,
+        num_conv_pos_embedding_groups=16,
+        apply_spec_augment=True,
+        mask_time_prob=0.05,
+        mask_time_length=10,
+        mask_time_min_masks=2,
+        mask_feature_prob=0.0,
+        mask_feature_length=10,
+        mask_feature_min_masks=0,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=2,
+        num_mel_bins=80,
+        speech_decoder_prenet_layers=2,
+        speech_decoder_prenet_units=256,
+        speech_decoder_prenet_dropout=0.5,
+        speaker_embedding_dim=512,
+        speech_decoder_postnet_layers=5,
+        speech_decoder_postnet_units=256,
+        speech_decoder_postnet_kernel=5,
+        speech_decoder_postnet_dropout=0.5,
+        reduction_factor=2,
+        max_speech_positions=4000,
+        max_text_positions=450,
+        encoder_max_relative_position=160,
+        use_guided_attention_loss=True,
+        guided_attention_loss_num_heads=2,
+        guided_attention_loss_sigma=0.4,
+        guided_attention_loss_scale=10.0,
+        use_cache=True,
+        is_encoder_decoder=True,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.encoder_layers = encoder_layers
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_attention_heads = encoder_attention_heads
+        self.encoder_layerdrop = encoder_layerdrop
+        self.decoder_layers = decoder_layers
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_attention_heads = decoder_attention_heads
+        self.decoder_layerdrop = decoder_layerdrop
+        self.hidden_act = hidden_act
+        self.positional_dropout = positional_dropout
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.scale_embedding = scale_embedding
+
+        self.feat_extract_norm = feat_extract_norm
+        self.feat_proj_dropout = feat_proj_dropout
+        self.feat_extract_activation = feat_extract_activation
+        self.conv_dim = list(conv_dim)
+        self.conv_stride = list(conv_stride)
+        self.conv_kernel = list(conv_kernel)
+        self.conv_bias = conv_bias
+        self.num_conv_pos_embeddings = num_conv_pos_embeddings
+        self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
+        self.num_feat_extract_layers = len(self.conv_dim)
+
+        if (
+            (len(self.conv_stride) != self.num_feat_extract_layers)
+            or (len(self.conv_kernel) != self.num_feat_extract_layers)
+            or (len(self.conv_dim) != self.num_feat_extract_layers)
+        ):
+            raise ValueError(
+                "Configuration for convolutional layers is incorrect. It is required that `len(config.conv_dim)` =="
+                " `len(config.conv_stride)` == `len(config.conv_kernel)`, but is `len(config.conv_dim) ="
+                f" {len(self.conv_dim)}`, `len(config.conv_stride) = {len(self.conv_stride)}`,"
+                f" `len(config.conv_kernel) = {len(self.conv_kernel)}`."
+            )
+
+        # fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
+        self.apply_spec_augment = apply_spec_augment
+        self.mask_time_prob = mask_time_prob
+        self.mask_time_length = mask_time_length
+        self.mask_time_min_masks = mask_time_min_masks
+        self.mask_feature_prob = mask_feature_prob
+        self.mask_feature_length = mask_feature_length
+        self.mask_feature_min_masks = mask_feature_min_masks
+
+        self.num_mel_bins = num_mel_bins
+        self.speech_decoder_prenet_layers = speech_decoder_prenet_layers
+        self.speech_decoder_prenet_units = speech_decoder_prenet_units
+        self.speech_decoder_prenet_dropout = speech_decoder_prenet_dropout
+        self.speaker_embedding_dim = speaker_embedding_dim
+
+        self.speech_decoder_postnet_layers = speech_decoder_postnet_layers
+        self.speech_decoder_postnet_units = speech_decoder_postnet_units
+        self.speech_decoder_postnet_kernel = speech_decoder_postnet_kernel
+        self.speech_decoder_postnet_dropout = speech_decoder_postnet_dropout
+        self.reduction_factor = reduction_factor
+
+        self.max_speech_positions = max_speech_positions
+        self.max_text_positions = max_text_positions
+        self.encoder_max_relative_position = encoder_max_relative_position
+
+        self.use_guided_attention_loss = use_guided_attention_loss
+        self.guided_attention_loss_num_heads = guided_attention_loss_num_heads
+        self.guided_attention_loss_sigma = guided_attention_loss_sigma
+        self.guided_attention_loss_scale = guided_attention_loss_scale
+
+        self.use_cache = use_cache
+        self.is_encoder_decoder = is_encoder_decoder
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            decoder_start_token_id=decoder_start_token_id,
+            **kwargs,
+        )
+
+    def inputs_to_logits_ratio(self):
+        return functools.reduce(operator.mul, self.conv_stride, 1)
+
+
+class SpeechT5HifiGanConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`SpeechT5HifiGanModel`]. It is used to instantiate
+    a SpeechT5 HiFi-GAN vocoder model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the SpeechT5
+    [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        model_in_dim (`int`, *optional*, defaults to 80):
+            The number of frequency bins in the input log-mel spectrogram.
+        sampling_rate (`int`, *optional*, defaults to 16000):
+            The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
+        upsample_initial_channel (`int`, *optional*, defaults to 512):
+            The number of input channels into the upsampling network.
+        upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[4, 4, 4, 4]`):
+            A tuple of integers defining the stride of each 1D convolutional layer in the upsampling network. The
+            length of *upsample_rates* defines the number of convolutional layers and has to match the length of
+            *upsample_kernel_sizes*.
+        upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 8, 8]`):
+            A tuple of integers defining the kernel size of each 1D convolutional layer in the upsampling network. The
+            length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match the length of
+            *upsample_rates*.
+        resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
+            A tuple of integers defining the kernel sizes of the 1D convolutional layers in the multi-receptive field
+            fusion (MRF) module.
+        resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
+            A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
+            multi-receptive field fusion (MRF) module.
+        initializer_range (`float`, *optional*, defaults to 0.01):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        leaky_relu_slope (`float`, *optional*, defaults to 0.1):
+            The angle of the negative slope used by the leaky ReLU activation.
+        normalize_before (`bool`, *optional*, defaults to `True`):
+            Whether or not to normalize the spectrogram before vocoding using the vocoder's learned mean and variance.
+
+    Example:
+
+    ```python
+    >>> from paddlenlp.transformers import SpeechT5HifiGan, SpeechT5HifiGanConfig
+
+    >>> # Initializing a "microsoft/speecht5_hifigan" style configuration
+    >>> configuration = SpeechT5HifiGanConfig()
+
+    >>> # Initializing a model (with random weights) from the "microsoft/speecht5_hifigan" style configuration
+    >>> model = SpeechT5HifiGan(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "hifigan"
+
+    def __init__(
+        self,
+        model_in_dim=80,
+        sampling_rate=16000,
+        upsample_initial_channel=512,
+        upsample_rates=[4, 4, 4, 4],
+        upsample_kernel_sizes=[8, 8, 8, 8],
+        resblock_kernel_sizes=[3, 7, 11],
+        resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+        initializer_range=0.01,
+        leaky_relu_slope=0.1,
+        normalize_before=True,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        self.model_in_dim = model_in_dim
+        self.sampling_rate = sampling_rate
+        self.upsample_initial_channel = upsample_initial_channel
+        self.upsample_rates = upsample_rates
+        self.upsample_kernel_sizes = upsample_kernel_sizes
+        self.resblock_kernel_sizes = resblock_kernel_sizes
+        self.resblock_dilation_sizes = resblock_dilation_sizes
+        self.initializer_range = initializer_range
+        self.leaky_relu_slope = leaky_relu_slope
+        self.normalize_before = normalize_before
+        super().__init__(**kwargs)
diff --git a/paddlenlp/transformers/speecht5/feature_extraction.py b/paddlenlp/transformers/speecht5/feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ecd4a306a3ea989a50713c17be7d4b4b4ba9e39
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/feature_extraction.py
@@ -0,0 +1,392 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import warnings
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+
+from ...utils.log import logger
+from ..audio_utils import mel_filter_bank, optimal_fft_length, spectrogram
+from ..feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ..feature_extraction_utils import BatchFeature
+from ..tokenizer_utils_base import PaddingStrategy
+
+
+class SpeechT5FeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a SpeechT5 feature extractor.
+
+    This class can pre-process a raw speech signal by (optionally) normalizing to zero-mean unit-variance, for use by
+    the SpeechT5 speech encoder prenet.
+
+    This class can also extract log-mel filter bank features from raw speech, for use by the SpeechT5 speech decoder
+    prenet.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 16000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        padding_value (`float`, *optional*, defaults to 0.0):
+            The value that is used to fill the padding values.
+        do_normalize (`bool`, *optional*, defaults to `False`):
+            Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
+            improve the performance for some models.
+        num_mel_bins (`int`, *optional*, defaults to 80):
+            The number of mel-frequency bins in the extracted spectrogram features.
+        hop_length (`int`, *optional*, defaults to 16):
+            Number of ms between windows. Otherwise referred to as "shift" in many papers.
+        win_length (`int`, *optional*, defaults to 64):
+            Number of ms per window.
+        win_function (`str`, *optional*, defaults to `"hann_window"`):
+            Name for the window function used for windowing, must be accessible via `paddle.{win_function}`
+        frame_signal_scale (`float`, *optional*, defaults to 1.0):
+            Constant multiplied in creating the frames before applying DFT. This argument is deprecated.
+        fmin (`float`, *optional*, defaults to 80):
+            Minimum mel frequency in Hz.
+        fmax (`float`, *optional*, defaults to 7600):
+            Maximum mel frequency in Hz.
+        mel_floor (`float`, *optional*, defaults to 1e-10):
+            Minimum value of mel frequency banks.
+        reduction_factor (`int`, *optional*, defaults to 2):
+            Spectrogram length reduction factor. This argument is deprecated.
+        return_attention_mask (`bool`, *optional*, defaults to `True`):
+            Whether or not [`~SpeechT5FeatureExtractor.__call__`] should return `attention_mask`.
+    """
+
+    model_input_names = ["input_values", "attention_mask"]
+
+    def __init__(
+        self,
+        feature_size: int = 1,
+        sampling_rate: int = 16000,
+        padding_value: float = 0.0,
+        do_normalize: bool = False,
+        num_mel_bins: int = 80,
+        hop_length: int = 16,
+        win_length: int = 64,
+        win_function: str = "hann_window",
+        frame_signal_scale: float = 1.0,
+        fmin: float = 80,
+        fmax: float = 7600,
+        mel_floor: float = 1e-10,
+        reduction_factor: int = 2,
+        return_attention_mask: bool = True,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+        self.do_normalize = do_normalize
+        self.return_attention_mask = return_attention_mask
+
+        self.num_mel_bins = num_mel_bins
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.win_function = win_function
+        self.frame_signal_scale = frame_signal_scale
+        self.fmin = fmin
+        self.fmax = fmax
+        self.mel_floor = mel_floor
+        self.reduction_factor = reduction_factor
+
+        self.sample_size = win_length * sampling_rate // 1000
+        self.sample_stride = hop_length * sampling_rate // 1000
+        self.n_fft = optimal_fft_length(self.sample_size)
+        self.n_freqs = (self.n_fft // 2) + 1
+
+        window = paddle.audio.functional.get_window(
+            win_function.split("_")[0], win_length=self.sample_size, fftbins=True
+        )
+        self.window = window.numpy().astype(np.float64)
+
+        self.mel_filters = mel_filter_bank(
+            num_frequency_bins=self.n_freqs,
+            num_mel_filters=self.num_mel_bins,
+            min_frequency=self.fmin,
+            max_frequency=self.fmax,
+            sampling_rate=self.sampling_rate,
+            norm="slaney",
+            mel_scale="slaney",
+        )
+
+        if frame_signal_scale != 1.0:
+            warnings.warn(
+                "The argument `frame_signal_scale` is deprecated and will be removed in version 4.30.0 of Transformers",
+                FutureWarning,
+            )
+        if reduction_factor != 2.0:
+            warnings.warn(
+                "The argument `reduction_factor` is deprecated and will be removed in version 4.30.0 of Transformers",
+                FutureWarning,
+            )
+
+    @staticmethod
+    # Copied from transformers.models.wav2vec2.feature_extraction_wav2vec2.Wav2Vec2FeatureExtractor.zero_mean_unit_var_norm
+    def zero_mean_unit_var_norm(
+        input_values: List[np.ndarray], attention_mask: List[np.ndarray], padding_value: float = 0.0
+    ) -> List[np.ndarray]:
+        """
+        Every array in the list is normalized to have zero mean and unit variance
+        """
+        if attention_mask is not None:
+            attention_mask = np.array(attention_mask, np.int32)
+            normed_input_values = []
+
+            for vector, length in zip(input_values, attention_mask.sum(-1)):
+                normed_slice = (vector - vector[:length].mean()) / np.sqrt(vector[:length].var() + 1e-7)
+                if length < normed_slice.shape[0]:
+                    normed_slice[length:] = padding_value
+
+                normed_input_values.append(normed_slice)
+        else:
+            normed_input_values = [(x - x.mean()) / np.sqrt(x.var() + 1e-7) for x in input_values]
+
+        return normed_input_values
+
+    def _extract_mel_features(
+        self,
+        one_waveform: np.ndarray,
+    ) -> np.ndarray:
+        """
+        Extracts log-mel filterbank features for one waveform array (unbatched).
+        """
+        log_mel_spec = spectrogram(
+            one_waveform,
+            window=self.window,
+            frame_length=self.sample_size,
+            hop_length=self.sample_stride,
+            fft_length=self.n_fft,
+            mel_filters=self.mel_filters,
+            mel_floor=self.mel_floor,
+            log_mel="log10",
+        )
+        return log_mel_spec.T
+
+    def __call__(
+        self,
+        audio: Optional[Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]]] = None,
+        audio_target: Optional[Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]]] = None,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        max_length: Optional[int] = None,
+        truncation: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[str] = None,
+        sampling_rate: Optional[int] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Pass in a value for `audio` to extract waveform features. Pass in a value for `audio_target` to extract log-mel
+        spectrogram features.
+
+        Args:
+            audio (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, *optional*):
+                The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values. This outputs waveform features. Must
+                be mono channel audio, not stereo, i.e. single float per timestep.
+            audio_target (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, *optional*):
+                The sequence or batch of sequences to be processed as targets. Each sequence can be a numpy array, a
+                list of float values, a list of numpy arrays or a list of list of float values. This outputs log-mel
+                spectrogram features.
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
+                Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            truncation (`bool`):
+                Activates truncation to cut input sequences longer than *max_length* to *max_length*.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
+            return_attention_mask (`bool`, *optional*):
+                Whether to return the attention mask. If left to the default, will return the attention mask according
+                to the specific feature_extractor's default.
+
+                [What are attention masks?](../glossary#attention-mask)
+
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+                - `'pd'`: Return PaddlePaddle `paddle.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `audio` or `audio_target` input was sampled. It is strongly recommended
+                to pass `sampling_rate` at the forward call to prevent silent errors.
+        """
+        if audio is None and audio_target is None:
+            raise ValueError("You must provide either `audio` or `audio_target` values.")
+
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided audio input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the ``sampling_rate`` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+
+        if audio is not None:
+            inputs = self._process_audio(
+                audio,
+                False,
+                padding,
+                max_length,
+                truncation,
+                pad_to_multiple_of,
+                return_attention_mask,
+                return_tensors,
+                **kwargs,
+            )
+        else:
+            inputs = None
+
+        if audio_target is not None:
+            inputs_target = self._process_audio(
+                audio_target,
+                True,
+                padding,
+                max_length,
+                truncation,
+                pad_to_multiple_of,
+                return_attention_mask,
+                return_tensors,
+                **kwargs,
+            )
+
+            if inputs is None:
+                return inputs_target
+            else:
+                inputs["labels"] = inputs_target["input_values"]
+                decoder_attention_mask = inputs_target.get("attention_mask")
+                if decoder_attention_mask is not None:
+                    inputs["decoder_attention_mask"] = decoder_attention_mask
+
+        return inputs
+
+    def _process_audio(
+        self,
+        speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
+        is_target: bool = False,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        max_length: Optional[int] = None,
+        truncation: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[str] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        is_batched_numpy = isinstance(speech, np.ndarray) and len(speech.shape) > 1
+        if is_batched_numpy and len(speech.shape) > 2:
+            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
+        is_batched = is_batched_numpy or (
+            isinstance(speech, (list, tuple)) and (isinstance(speech[0], (np.ndarray, tuple, list)))
+        )
+
+        if is_batched:
+            speech = [np.asarray(speech, dtype=np.float32) for speech in speech]
+        elif not is_batched and not isinstance(speech, np.ndarray):
+            speech = np.asarray(speech, dtype=np.float32)
+        elif isinstance(speech, np.ndarray) and speech.dtype is np.dtype(np.float64):
+            speech = speech.astype(np.float32)
+
+        # always return batch
+        if not is_batched:
+            speech = [speech]
+
+        # needed to make pad() work on spectrogram inputs
+        feature_size_hack = self.feature_size
+
+        # convert into correct format for padding
+        if is_target:
+            features = [self._extract_mel_features(waveform) for waveform in speech]
+            encoded_inputs = BatchFeature({"input_values": features})
+            self.feature_size = self.num_mel_bins
+        else:
+            encoded_inputs = BatchFeature({"input_values": speech})
+
+        padded_inputs = self.pad(
+            encoded_inputs,
+            padding=padding,
+            max_length=max_length,
+            truncation=truncation,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+            **kwargs,
+        )
+
+        self.feature_size = feature_size_hack
+
+        # convert input values to correct format
+        input_values = padded_inputs["input_values"]
+        if not isinstance(input_values[0], np.ndarray):
+            padded_inputs["input_values"] = [np.asarray(array, dtype=np.float32) for array in input_values]
+        elif (
+            not isinstance(input_values, np.ndarray)
+            and isinstance(input_values[0], np.ndarray)
+            and input_values[0].dtype is np.dtype(np.float64)
+        ):
+            padded_inputs["input_values"] = [array.astype(np.float32) for array in input_values]
+        elif isinstance(input_values, np.ndarray) and input_values.dtype is np.dtype(np.float64):
+            padded_inputs["input_values"] = input_values.astype(np.float32)
+
+        # convert attention_mask to correct format
+        attention_mask = padded_inputs.get("attention_mask")
+        if attention_mask is not None:
+            padded_inputs["attention_mask"] = [np.asarray(array, dtype=np.int32) for array in attention_mask]
+
+        # zero-mean and unit-variance normalization
+        if not is_target and self.do_normalize:
+            attention_mask = (
+                attention_mask
+                if self._get_padding_strategies(padding, max_length=max_length) is not PaddingStrategy.DO_NOT_PAD
+                else None
+            )
+            padded_inputs["input_values"] = self.zero_mean_unit_var_norm(
+                padded_inputs["input_values"], attention_mask=attention_mask, padding_value=self.padding_value
+            )
+
+        if return_tensors is not None:
+            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
+
+        return padded_inputs
+
+    def to_dict(self) -> Dict[str, Any]:
+        output = super().to_dict()
+
+        # Don't serialize these as they are derived from the other properties.
+        names = ["window", "mel_filters", "sample_size", "sample_stride", "n_fft", "n_freqs"]
+        for name in names:
+            if name in output:
+                del output[name]
+
+        return output
diff --git a/paddlenlp/transformers/speecht5/modeling.py b/paddlenlp/transformers/speecht5/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..a94dabca16a94d44d14a592cf20b7937abcb1ac2
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/modeling.py
@@ -0,0 +1,3112 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" PyTorch SpeechT5 model."""
+
+import math
+import random
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, L1Loss
+
+from ...utils.initializer import (
+    constant_,
+    kaiming_normal_,
+    normal_,
+    ones_,
+    uniform_,
+    zeros_,
+)
+from ...utils.log import logger
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    Seq2SeqSpectrogramOutput,
+)
+from ..model_utils import PretrainedModel
+
+# from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from .configuration import SpeechT5Config, SpeechT5HifiGanConfig
+
+_HIDDEN_STATES_START_POSITION = 1
+
+# General docstring
+_CONFIG_FOR_DOC = "SpeechT5Config"
+
+
+SPEECHT5_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "microsoft/speecht5_asr",
+    "microsoft/speecht5_tts",
+    "microsoft/speecht5_vc",
+    # See all SpeechT5 models at https://huggingface.co/models?filter=speecht5
+]
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def finfo(dtype: paddle.dtype = None):
+    if dtype is None:
+        dtype = paddle.get_default_dtype()
+
+    if dtype == paddle.bfloat16:
+        # Numpy do not support `np.finfo(np.uint16)`, so try to construct a finfo object to fetch min value
+        class BFloatFInfo:
+            min = -3.3895313892515355e38
+
+        return BFloatFInfo
+    if dtype == paddle.float32:
+        return np.finfo(np.float32)
+    if dtype == paddle.float16:
+        return np.finfo(np.float16)
+    if dtype == paddle.float64:
+        return np.finfo(np.float64)
+
+
+def Parameter(tensor):
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+# Copied from paddlenlp.transformers.models.bart.modeling_bart.shift_tokens_right
+def shift_tokens_right(input_ids: paddle.Tensor, pad_token_id: int, decoder_start_token_id: int):
+    """
+    Shift input ids one token to the right.
+    """
+    shifted_input_ids = paddle.zeros(input_ids.shape)
+    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+    shifted_input_ids[:, 0] = decoder_start_token_id
+
+    if pad_token_id is None:
+        raise ValueError("self.model.config.pad_token_id has to be defined.")
+    # replace possible -100 values in labels by `pad_token_id`
+    masked_fill(shifted_input_ids, shifted_input_ids == -100, pad_token_id)
+
+    return shifted_input_ids
+
+
+def shift_spectrograms_right(input_values: paddle.Tensor, reduction_factor: int = 1):
+    """
+    Shift input spectrograms one timestep to the right. Also applies the reduction factor to the sequence length.
+    """
+    # thin out frames for reduction factor
+    if reduction_factor > 1:
+        input_values = input_values[:, reduction_factor - 1 :: reduction_factor]
+
+    shifted_input_values = paddle.zeros(input_values.shape)
+    shifted_input_values[:, 1:] = input_values[:, :-1].clone()
+
+    # replace possible -100 values in labels by zeros
+    masked_fill(shifted_input_values, shifted_input_values == -100.0, 0.0)
+
+    return shifted_input_values
+
+
+# Copied from paddlenlp.transformers.models.bart.modeling_bart._make_causal_mask
+def _make_causal_mask(input_ids_shape: paddle.shape, dtype: paddle.dtype, past_key_values_length: int = 0):
+    """
+    Make causal mask used for bi-directional self-attention.
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = paddle.full((tgt_len, tgt_len), float(finfo(dtype).min))
+    mask_cond = paddle.arange(mask.shape[-1])
+    masked_fill(mask, mask_cond < (mask_cond + 1).reshape([mask.shape[-1], 1]), 0)
+    mask = mask.cast(dtype)
+
+    if past_key_values_length > 0:
+        mask = paddle.concat([paddle.zeros([tgt_len, past_key_values_length], dtype=dtype), mask], axis=-1)
+    return mask[None, None, :, :].expand([bsz, 1, tgt_len, tgt_len + past_key_values_length])
+
+
+# Copied from paddlenlp.transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: paddle.Tensor, dtype: paddle.dtype, tgt_len: Optional[int] = None):
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.shape
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand([bsz, 1, tgt_len, src_len]).cast(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return masked_fill(inverted_mask, inverted_mask.cast("bool"), finfo(dtype).min)
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2._compute_mask_indices
+def _compute_mask_indices(
+    shape: Tuple[int, int],
+    mask_prob: float,
+    mask_length: int,
+    attention_mask: Optional[paddle.Tensor] = None,
+    min_masks: int = 0,
+) -> np.ndarray:
+    """
+    Computes random mask spans for a given shape. Used to implement [SpecAugment: A Simple Data Augmentation Method for
+    ASR](https://arxiv.org/abs/1904.08779). Note that this method is not optimized to run on TPU and should be run on
+    CPU as part of the preprocessing during training.
+
+    Args:
+        shape: The shape for which to compute masks. This should be of a tuple of size 2 where
+               the first element is the batch size and the second element is the length of the axis to span.
+        mask_prob:  The percentage of the whole axis (between 0 and 1) which will be masked. The number of
+                    independently generated mask spans of length `mask_length` is computed by
+                    `mask_prob*shape[1]/mask_length`. Note that due to overlaps, `mask_prob` is an upper bound and the
+                    actual percentage will be smaller.
+        mask_length: size of the mask
+        min_masks: minimum number of masked spans
+        attention_mask: A (right-padded) attention mask which independently shortens the feature axis of
+                        each batch dimension.
+    """
+    batch_size, sequence_length = shape
+
+    if mask_length < 1:
+        raise ValueError("`mask_length` has to be bigger than 0.")
+
+    if mask_length > sequence_length:
+        raise ValueError(
+            f"`mask_length` has to be smaller than `sequence_length`, but got `mask_length`: {mask_length}"
+            f" and `sequence_length`: {sequence_length}`"
+        )
+
+    # epsilon is used for probabilistic rounding
+    epsilon = np.random.rand(1).item()
+
+    def compute_num_masked_span(input_length):
+        """Given input length, compute how many spans should be masked"""
+        num_masked_span = int(mask_prob * input_length / mask_length + epsilon)
+        num_masked_span = max(num_masked_span, min_masks)
+
+        # make sure num masked span <= sequence_length
+        if num_masked_span * mask_length > sequence_length:
+            num_masked_span = sequence_length // mask_length
+
+        # make sure num_masked span is also <= input_length - (mask_length - 1)
+        if input_length - (mask_length - 1) < num_masked_span:
+            num_masked_span = max(input_length - (mask_length - 1), 0)
+
+        return num_masked_span
+
+    # compute number of masked spans in batch
+    input_lengths = (
+        attention_mask.sum(-1).detach().tolist()
+        if attention_mask is not None
+        else [sequence_length for _ in range(batch_size)]
+    )
+
+    # SpecAugment mask to fill
+    spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=bool)
+    spec_aug_mask_idxs = []
+
+    max_num_masked_span = compute_num_masked_span(sequence_length)
+
+    if max_num_masked_span == 0:
+        return spec_aug_mask
+
+    for input_length in input_lengths:
+        # compute num of masked spans for this input
+        num_masked_span = compute_num_masked_span(input_length)
+
+        # get random indices to mask
+        spec_aug_mask_idx = np.random.choice(
+            np.arange(input_length - (mask_length - 1)), num_masked_span, replace=False
+        )
+
+        # pick first sampled index that will serve as a dummy index to pad vector
+        # to ensure same dimension for all batches due to probabilistic rounding
+        # Picking first sample just pads those vectors twice.
+        if len(spec_aug_mask_idx) == 0:
+            # this case can only happen if `input_length` is strictly smaller then
+            # `sequence_length` in which case the last token has to be a padding
+            # token which we can use as a dummy mask id
+            dummy_mask_idx = sequence_length - 1
+        else:
+            dummy_mask_idx = spec_aug_mask_idx[0]
+
+        spec_aug_mask_idx = np.concatenate(
+            [spec_aug_mask_idx, np.ones(max_num_masked_span - num_masked_span, dtype=np.int32) * dummy_mask_idx]
+        )
+        spec_aug_mask_idxs.append(spec_aug_mask_idx)
+
+    spec_aug_mask_idxs = np.array(spec_aug_mask_idxs)
+
+    # expand masked indices to masked spans
+    spec_aug_mask_idxs = np.broadcast_to(
+        spec_aug_mask_idxs[:, :, None], (batch_size, max_num_masked_span, mask_length)
+    )
+    spec_aug_mask_idxs = spec_aug_mask_idxs.reshape([batch_size, max_num_masked_span * mask_length])
+
+    # add offset to the starting indexes so that indexes now create a span
+    offsets = np.arange(mask_length)[None, None, :]
+    offsets = np.broadcast_to(offsets, (batch_size, max_num_masked_span, mask_length)).reshape(
+        [batch_size, max_num_masked_span * mask_length]
+    )
+    spec_aug_mask_idxs = spec_aug_mask_idxs + offsets
+
+    # ensure that we cannot have indices larger than sequence_length
+    if spec_aug_mask_idxs.max() > sequence_length - 1:
+        spec_aug_mask_idxs[spec_aug_mask_idxs > sequence_length - 1] = sequence_length - 1
+
+    # scatter indices to mask
+    np.put_along_axis(spec_aug_mask, spec_aug_mask_idxs, 1, -1)
+
+    return spec_aug_mask
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2NoLayerNormConvLayer with Wav2Vec2->SpeechT5
+class SpeechT5NoLayerNormConvLayer(nn.Layer):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+
+        self.conv = nn.Conv1D(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias_attr=config.conv_bias,
+        )
+        self.activation = ACT2FN[config.feat_extract_activation]
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2LayerNormConvLayer with Wav2Vec2->SpeechT5
+class SpeechT5LayerNormConvLayer(nn.Layer):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+
+        self.conv = nn.Conv1D(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias_attr=config.conv_bias,
+        )
+        self.layer_norm = nn.LayerNorm(self.out_conv_dim)
+        self.activation = ACT2FN[config.feat_extract_activation]
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+
+        hidden_states = hidden_states.transpose(-2, -1)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = hidden_states.transpose(-2, -1)
+
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2GroupNormConvLayer with Wav2Vec2->SpeechT5
+class SpeechT5GroupNormConvLayer(nn.Layer):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+        self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
+        self.out_conv_dim = config.conv_dim[layer_id]
+
+        self.conv = nn.Conv1D(
+            self.in_conv_dim,
+            self.out_conv_dim,
+            kernel_size=config.conv_kernel[layer_id],
+            stride=config.conv_stride[layer_id],
+            bias_attr=config.conv_bias,
+        )
+        self.activation = ACT2FN[config.feat_extract_activation]
+
+        self.layer_norm = nn.GroupNorm(num_groups=self.out_conv_dim, num_channels=self.out_conv_dim)
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.speech_to_text.modeling_speech_to_text.Speech2TextSinusoidalPositionalEmbedding with Speech2Text->SpeechT5
+class SpeechT5SinusoidalPositionalEmbedding(nn.Layer):
+    """This module produces sinusoidal positional embeddings of any length."""
+
+    def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None):
+        super().__init__()
+        self.offset = 2
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx
+        self.make_weights(num_positions + self.offset, embedding_dim, padding_idx)
+
+    def make_weights(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None):
+        emb_weights = self.get_embedding(num_embeddings, embedding_dim, padding_idx)
+        if hasattr(self, "weights"):
+            # in forward put the weights on the correct dtype and device of the param
+            emb_weights = emb_weights.cast(dtype=self.weights.dtype)
+
+        self.weights = Parameter(emb_weights)
+        self.weights.stop_gradient = True
+        self.weights.detach()
+
+    @staticmethod
+    def get_embedding(num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None):
+        """
+        Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the
+        description in Section 3.5 of "Attention Is All You Need".
+        """
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = paddle.exp(paddle.arange(half_dim, dtype="float32") * -emb)
+        emb = paddle.arange(num_embeddings, dtype="float32").unsqueeze(1) * emb.unsqueeze(0)
+        emb = paddle.concat([paddle.sin(emb), paddle.cos(emb)], axis=1).reshape([num_embeddings, -1])
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = paddle.concat([emb, paddle.zeros([num_embeddings, 1])], axis=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        return emb.cast(paddle.get_default_dtype())
+
+    @paddle.no_grad()
+    def forward(self, input_ids: paddle.Tensor, past_key_values_length: int = 0):
+        bsz, seq_len = input_ids.shape
+        # Create the position ids from the input token ids. Any padded tokens remain padded.
+        position_ids = self.create_position_ids_from_input_ids(input_ids, self.padding_idx, past_key_values_length)
+
+        # expand embeddings if needed
+        max_pos = self.padding_idx + 1 + seq_len
+        if max_pos > self.weights.shape[0]:
+            self.make_weights(max_pos + self.offset, self.embedding_dim, self.padding_idx)
+
+        return self.weights.index_select(axis=0, index=position_ids.reshape([-1])).reshape([bsz, seq_len, -1]).detach()
+
+    def create_position_ids_from_input_ids(
+        self, input_ids: paddle.Tensor, padding_idx: int, past_key_values_length: Optional[int] = 0
+    ):
+        """
+        Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding
+        symbols are ignored. This is modified from fairseq's `utils.make_positions`.
+
+        Args:
+            x: paddle.Tensor x:
+        Returns: paddle.Tensor
+        """
+        # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
+        # mask = input_ids.ne(padding_idx).cast('int32')
+        mask = input_ids.cast("int64").not_equal(paddle.to_tensor([padding_idx], dtype="int64")).cast("int32")
+        incremental_indices = (paddle.cumsum(mask, axis=1).cast(mask.dtype) + past_key_values_length) * mask
+        return incremental_indices.cast("int64") + padding_idx
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2PositionalConvEmbedding with Wav2Vec2->SpeechT5
+class SpeechT5PositionalConvEmbedding(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.conv = nn.Conv1D(
+            config.hidden_size,
+            config.hidden_size,
+            kernel_size=config.num_conv_pos_embeddings,
+            padding=config.num_conv_pos_embeddings // 2,
+            groups=config.num_conv_pos_embedding_groups,
+        )
+        # self.conv = nn.utils.weight_norm(self.conv, name="weight")
+        self.padding = SpeechT5SamePadLayer(config.num_conv_pos_embeddings)
+        self.activation = ACT2FN[config.feat_extract_activation]
+
+    def forward(self, hidden_states):
+
+        hidden_states = hidden_states.transpose([0, 2, 1])
+
+        hidden_states = self.conv(hidden_states)
+        print(self.conv.weight)
+        hidden_states = self.padding(hidden_states)
+        hidden_states = self.activation(hidden_states)
+
+        hidden_states = hidden_states.transpose([0, 2, 1])
+        return hidden_states
+
+
+class SpeechT5ScaledPositionalEncoding(nn.Layer):
+    """
+    Scaled positional encoding, see §3.2 in https://arxiv.org/abs/1809.08895
+    """
+
+    def __init__(self, dropout, dim, max_len=5000):
+        pe = paddle.zeros([max_len, dim])
+        position = paddle.arange(0, max_len).unsqueeze(1)
+        div_term = paddle.exp((paddle.arange(0, dim, 2, dtype="float32") * -(math.log(10000.0) / dim)))
+        pe[:, 0::2] = paddle.sin(position.cast("float32") * div_term)
+        pe[:, 1::2] = paddle.cos(position.cast("float32") * div_term)
+        pe = pe.unsqueeze(0)
+        super().__init__()
+        self.register_buffer("pe", pe)
+        self.dropout = nn.Dropout(p=dropout)
+        self.dim = dim
+        self.alpha = Parameter(paddle.to_tensor([1.0]))
+
+    def forward(self, emb):
+        emb = emb + self.alpha * self.pe[:, : emb.shape[1]]
+        emb = self.dropout(emb)
+        return emb
+
+
+class SpeechT5RelativePositionalEncoding(nn.Layer):
+    def __init__(self, dim, max_length=1000):
+        super().__init__()
+        self.dim = dim
+        self.max_length = max_length
+        self.pe_k = paddle.nn.Embedding(2 * max_length, dim)
+
+    def forward(self, hidden_states):
+        seq_len = hidden_states.shape[1]
+        pos_seq = paddle.arange(0, seq_len).cast("int64")
+        pos_seq = pos_seq[:, None] - pos_seq[None, :]
+
+        pos_seq[pos_seq < -self.max_length] = -self.max_length
+        pos_seq[pos_seq >= self.max_length] = self.max_length - 1
+        pos_seq = pos_seq + self.max_length
+
+        return self.pe_k(pos_seq)
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2SamePadLayer with Wav2Vec2->SpeechT5
+class SpeechT5SamePadLayer(nn.Layer):
+    def __init__(self, num_conv_pos_embeddings):
+        super().__init__()
+        self.num_pad_remove = 1 if num_conv_pos_embeddings % 2 == 0 else 0
+
+    def forward(self, hidden_states):
+        if self.num_pad_remove > 0:
+            hidden_states = hidden_states[:, :, : -self.num_pad_remove]
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2FeatureEncoder with Wav2Vec2->SpeechT5
+class SpeechT5FeatureEncoder(nn.Layer):
+    """Construct the features from raw audio waveform"""
+
+    def __init__(self, config):
+        super().__init__()
+
+        if config.feat_extract_norm == "group":
+            conv_layers = [SpeechT5GroupNormConvLayer(config, layer_id=0)] + [
+                SpeechT5NoLayerNormConvLayer(config, layer_id=i + 1) for i in range(config.num_feat_extract_layers - 1)
+            ]
+        elif config.feat_extract_norm == "layer":
+            conv_layers = [
+                SpeechT5LayerNormConvLayer(config, layer_id=i) for i in range(config.num_feat_extract_layers)
+            ]
+        else:
+            raise ValueError(
+                f"`config.feat_extract_norm` is {config.feat_extract_norm}, but has to be one of ['group', 'layer']"
+            )
+        self.conv_layers = nn.LayerList(conv_layers)
+        self.gradient_checkpointing = False
+        self._requires_grad = True
+
+    def _freeze_parameters(self):
+        for param in self.parameters():
+            param.stop_gradient = True
+        self._requires_grad = False
+
+    def forward(self, input_values):
+        hidden_states = input_values[:, None]
+
+        # make sure hidden_states require grad for gradient_checkpointing
+        if self._requires_grad and self.training:
+            hidden_states.stop_gradiet = False
+
+        for conv_layer in self.conv_layers:
+            if self._requires_grad and self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+
+                    return custom_forward
+
+                hidden_states = recompute(
+                    create_custom_forward(conv_layer),
+                    hidden_states,
+                )
+            else:
+                hidden_states = conv_layer(hidden_states.cast("float32"))
+
+        return hidden_states
+
+
+# Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2FeatureProjection with Wav2Vec2->SpeechT5
+class SpeechT5FeatureProjection(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.layer_norm = nn.LayerNorm(config.conv_dim[-1], epsilon=config.layer_norm_eps)
+        self.projection = nn.Linear(config.conv_dim[-1], config.hidden_size)
+        self.dropout = nn.Dropout(config.feat_proj_dropout)
+
+    def forward(self, hidden_states):
+        # non-projected hidden states are needed for quantization
+        norm_hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.projection(norm_hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states, norm_hidden_states
+
+
+class SpeechT5SpeechEncoderPrenet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.feature_encoder = SpeechT5FeatureEncoder(config)
+        self.feature_projection = SpeechT5FeatureProjection(config)
+
+        # model only needs masking vector if mask prob is > 0.0
+        if config.mask_time_prob > 0.0 or config.mask_feature_prob > 0.0:
+            self.masked_spec_embed = Parameter(uniform_(paddle.to_tensor([config.hidden_size], dtype="float32"), 0, 1))
+
+        self.pos_conv_embed = SpeechT5PositionalConvEmbedding(config)
+        self.pos_sinusoidal_embed = SpeechT5SinusoidalPositionalEmbedding(
+            config.max_speech_positions + config.pad_token_id + 1,
+            config.hidden_size,
+            config.pad_token_id,
+        )
+
+    def freeze_feature_encoder(self):
+        self.feature_encoder._freeze_parameters()
+
+    def forward(
+        self,
+        input_values: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        mask_time_indices: Optional[paddle.Tensor] = None,
+    ):
+        extract_features = self.feature_encoder(input_values)
+        extract_features = extract_features.transpose([0, 2, 1])
+        if attention_mask is not None:
+            # compute reduced attention_mask corresponding to feature vectors
+            attention_mask = self._get_feature_vector_attention_mask(
+                extract_features.shape[1],
+                attention_mask,
+            )
+
+        hidden_states, extract_features = self.feature_projection(extract_features)
+        hidden_states = self._mask_hidden_states(
+            hidden_states, mask_time_indices=mask_time_indices, attention_mask=attention_mask
+        )
+        positional_conv_embedding = self.pos_conv_embed(hidden_states)
+        hidden_states = hidden_states + positional_conv_embedding
+
+        if attention_mask is not None:
+            padding_mask = attention_mask.not_equal(paddle.to_tensor([1], dtype="bool")).cast("int64")
+        else:
+            padding_mask = paddle.zeros(hidden_states.shape[:2], dtype="int64")
+
+        positional_sinusoidal_embeddings = self.pos_sinusoidal_embed(padding_mask)
+        hidden_states = hidden_states + positional_sinusoidal_embeddings
+        return hidden_states, attention_mask
+
+    # Copied from paddlenlp.transformers.models.unispeech.modeling_unispeech.UniSpeechPretrainedModel._get_feature_vector_attention_mask
+    def _get_feature_vector_attention_mask(self, feature_vector_length: int, attention_mask: paddle.Tensor):
+        # Effectively attention_mask.sum(-1), but not inplace to be able to run
+        # on inference mode.
+        non_padded_lengths = attention_mask.cumsum(axis=-1)[:, -1]
+        output_lengths = self._get_feat_extract_output_lengths(non_padded_lengths).cast("int64")
+        batch_size = attention_mask.shape[0]
+
+        attention_mask = paddle.zeros((batch_size, feature_vector_length), dtype=attention_mask.dtype)
+        # these two operations makes sure that all values before the output lengths idxs are attended to
+        attention_mask[(paddle.arange(attention_mask.shape[0]), output_lengths - 1)] = 1
+        attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).cast("bool")
+        return attention_mask
+
+    # Copied from paddlenlp.transformers.models.unispeech.modeling_unispeech.UniSpeechPretrainedModel._get_feat_extract_output_lengths
+    def _get_feat_extract_output_lengths(self, input_lengths: Union[paddle.Tensor, int]):
+        """
+        Computes the output length of the convolutional layers
+        """
+
+        def _conv_out_length(input_length, kernel_size, stride):
+            # 1D convolutional layer output length formula taken
+            # from https://pytorch.org/docs/stable/generated/paddle.nn.Conv1D.html
+            # return torch.div(input_length - kernel_size, stride, rounding_mode="floor") + 1
+            return (input_length - kernel_size) // stride + 1
+
+        for kernel_size, stride in zip(self.config.conv_kernel, self.config.conv_stride):
+            if isinstance(input_lengths, paddle.Tensor):
+                input_lengths = input_lengths.cast("int64")
+            input_lengths = _conv_out_length(input_lengths, kernel_size, stride)
+
+        return input_lengths
+
+    # Copied from paddlenlp.transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2Model._mask_hidden_states
+    def _mask_hidden_states(
+        self,
+        hidden_states: paddle.Tensor,
+        mask_time_indices: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+    ):
+        """
+        Masks extracted features along time axis and/or along feature axis according to
+        [SpecAugment](https://arxiv.org/abs/1904.08779).
+        """
+
+        # `config.apply_spec_augment` can set masking to False
+        if not getattr(self.config, "apply_spec_augment", True):
+            return hidden_states
+
+        # generate indices & apply SpecAugment along time axis
+        batch_size, sequence_length, hidden_size = hidden_states.shape
+
+        if mask_time_indices is not None:
+            # apply SpecAugment along time axis with given mask_time_indices
+            hidden_states[mask_time_indices] = self.masked_spec_embed.cast(hidden_states.dtype)
+        elif self.config.mask_time_prob > 0 and self.training:
+            mask_time_indices = _compute_mask_indices(
+                (batch_size, sequence_length),
+                mask_prob=self.config.mask_time_prob,
+                mask_length=self.config.mask_time_length,
+                attention_mask=attention_mask,
+                min_masks=self.config.mask_time_min_masks,
+            )
+            mask_time_indices = paddle.to_tensor(mask_time_indices, dtype="bool")
+            hidden_states[mask_time_indices] = self.masked_spec_embed.cast(hidden_states.dtype)
+
+        if self.config.mask_feature_prob > 0 and self.training:
+            # generate indices & apply SpecAugment along feature axis
+            mask_feature_indices = _compute_mask_indices(
+                (batch_size, hidden_size),
+                mask_prob=self.config.mask_feature_prob,
+                mask_length=self.config.mask_feature_length,
+                min_masks=self.config.mask_feature_min_masks,
+            )
+            mask_feature_indices = paddle.to_tensor(mask_feature_indices, dtype="bool")
+            mask_feature_indices = mask_feature_indices[:, None].expand([-1, sequence_length, -1])
+            hidden_states[mask_feature_indices] = 0
+
+        return hidden_states
+
+
+class SpeechT5SpeechDecoderPrenet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+
+        self.layers = nn.LayerList(
+            [
+                nn.Linear(
+                    config.num_mel_bins if i == 0 else config.speech_decoder_prenet_units,
+                    config.speech_decoder_prenet_units,
+                )
+                for i in range(config.speech_decoder_prenet_layers)
+            ]
+        )
+
+        self.final_layer = nn.Linear(config.speech_decoder_prenet_units, config.hidden_size)
+
+        self.encode_positions = SpeechT5ScaledPositionalEncoding(
+            config.positional_dropout,
+            config.hidden_size,
+            config.max_speech_positions,
+        )
+
+        self.speaker_embeds_layer = nn.Linear(config.speaker_embedding_dim + config.hidden_size, config.hidden_size)
+
+    def forward(
+        self,
+        input_values: paddle.Tensor,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+    ):
+        # Dropout is always applied, even when evaluating. See §2.2 in https://arxiv.org/abs/1712.05884.
+
+        inputs_embeds = input_values
+        for layer in self.layers:
+            inputs_embeds = nn.functional.relu(layer(inputs_embeds))
+            inputs_embeds = nn.functional.dropout(
+                inputs_embeds, self.config.speech_decoder_prenet_dropout, training=True
+            )
+
+        inputs_embeds = self.final_layer(inputs_embeds)
+        inputs_embeds = self.encode_positions(inputs_embeds)
+
+        if speaker_embeddings is not None:
+            speaker_embeddings = nn.functional.normalize(speaker_embeddings)
+            speaker_embeddings = speaker_embeddings.unsqueeze(1)
+            speaker_embeddings = speaker_embeddings.expand([-1, inputs_embeds.shape[1], -1])
+            inputs_embeds = paddle.concat([inputs_embeds, speaker_embeddings], axis=-1)
+            inputs_embeds = nn.functional.relu(self.speaker_embeds_layer(inputs_embeds))
+
+        return inputs_embeds
+
+
+class SpeechT5BatchNormConvLayer(nn.Layer):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+
+        if layer_id == 0:
+            in_conv_dim = config.num_mel_bins
+        else:
+            in_conv_dim = config.speech_decoder_postnet_units
+
+        if layer_id == config.speech_decoder_postnet_layers - 1:
+            out_conv_dim = config.num_mel_bins
+        else:
+            out_conv_dim = config.speech_decoder_postnet_units
+
+        self.conv = nn.Conv1D(
+            in_conv_dim,
+            out_conv_dim,
+            kernel_size=config.speech_decoder_postnet_kernel,
+            stride=1,
+            padding=(config.speech_decoder_postnet_kernel - 1) // 2,
+            bias_attr=False,
+        )
+        self.batch_norm = nn.BatchNorm1D(out_conv_dim)
+
+        if layer_id < config.speech_decoder_postnet_layers - 1:
+            self.activation = nn.Tanh()
+        else:
+            self.activation = None
+
+        self.dropout = nn.Dropout(config.speech_decoder_postnet_dropout)
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.batch_norm(hidden_states)
+        if self.activation is not None:
+            hidden_states = self.activation(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+
+
+class SpeechT5SpeechDecoderPostnet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+
+        self.feat_out = nn.Linear(config.hidden_size, config.num_mel_bins * config.reduction_factor)
+        self.prob_out = nn.Linear(config.hidden_size, config.reduction_factor)
+
+        self.layers = nn.LayerList(
+            [SpeechT5BatchNormConvLayer(config, i) for i in range(config.speech_decoder_postnet_layers)]
+        )
+
+    def forward(self, hidden_states: paddle.Tensor):
+        outputs_before_postnet = self.feat_out(hidden_states).reshape(
+            [hidden_states.shape[0], -1, self.config.num_mel_bins]
+        )
+        outputs_after_postnet = self.postnet(outputs_before_postnet)
+        logits = self.prob_out(hidden_states).reshape([hidden_states.shape[0], -1])
+        return outputs_before_postnet, outputs_after_postnet, logits
+
+    def postnet(self, hidden_states: paddle.Tensor):
+        layer_output = hidden_states.transpose([0, 2, 1])
+        for layer in self.layers:
+            layer_output = layer(layer_output)
+        return hidden_states + layer_output.transpose([0, 2, 1])
+
+
+class SpeechT5TextEncoderPrenet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.encode_positions = SpeechT5ScaledPositionalEncoding(
+            config.positional_dropout,
+            config.hidden_size,
+            config.max_text_positions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    def forward(self, input_ids: paddle.Tensor):
+        inputs_embeds = self.embed_tokens(input_ids)
+        inputs_embeds = self.encode_positions(inputs_embeds)
+        return inputs_embeds
+
+
+class SpeechT5TextDecoderPrenet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.dropout = nn.Dropout(config.positional_dropout)
+        self.embed_scale = math.sqrt(config.hidden_size) if config.scale_embedding else 1.0
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, config.pad_token_id)
+
+        self.embed_positions = SpeechT5SinusoidalPositionalEmbedding(
+            config.max_text_positions + config.pad_token_id + 1,
+            config.hidden_size,
+            config.pad_token_id,
+        )
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    def forward(
+        self,
+        input_ids: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.shape
+            input_ids = input_ids.reshape([-1, input_shape[-1]])
+        else:
+            raise ValueError("You have to specify `decoder_input_ids`")
+
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+        positions = self.embed_positions(input_ids, past_key_values_length)
+
+        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+        inputs_embeds += positions
+        inputs_embeds = self.dropout(inputs_embeds)
+
+        return inputs_embeds, attention_mask
+
+
+class SpeechT5TextDecoderPostnet(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias_attr=False)
+
+    def forward(self, hidden_states: paddle.Tensor):
+        return self.lm_head(hidden_states)
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+
+class SpeechT5Attention(nn.Layer):
+    """
+    Multi-headed attention from 'Attention Is All You Need' paper with relative position bias (see
+    https://aclanthology.org/N18-2074.pdf)
+    """
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        is_decoder: bool = False,
+        bias: bool = True,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+
+        if (self.head_dim * num_heads) != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
+                f" and `num_heads`: {num_heads})."
+            )
+        self.scaling = self.head_dim**-0.5
+        self.is_decoder = is_decoder
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias_attr=bias)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias_attr=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias_attr=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias_attr=bias)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        key_value_states: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        layer_head_mask: Optional[paddle.Tensor] = None,
+        position_bias: Optional[paddle.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        # if key_value_states are provided this layer is used as a cross-attention layer
+        # for the decoder
+        is_cross_attention = key_value_states is not None
+
+        bsz, tgt_len, _ = hidden_states.shape
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scaling
+        # get key, value proj
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_states = past_key_value[0]
+            value_states = past_key_value[1]
+        elif is_cross_attention:
+            # cross_attentions
+            key_states = self._shape(self.k_proj(key_value_states), -1, bsz)
+            value_states = self._shape(self.v_proj(key_value_states), -1, bsz)
+        elif past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+            key_states = paddle.concat([past_key_value[0], key_states], axis=2)
+            value_states = paddle.concat([past_key_value[1], value_states], axis=2)
+        else:
+            # self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(paddle.Tensor, paddle.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(paddle.Tensor, paddle.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_states, value_states)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).reshape(proj_shape)
+        key_states = key_states.reshape(proj_shape)
+        value_states = value_states.reshape(proj_shape)
+
+        src_len = key_states.shape[1]
+        attn_weights = paddle.bmm(query_states, key_states.transpose([0, 2, 1]))
+
+        if attn_weights.shape != [bsz * self.num_heads, tgt_len, src_len]:
+            raise ValueError(
+                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.shape}"
+            )
+
+        # relative attention bias
+        if position_bias is not None:
+            reshape_q = query_states.reshape([bsz * self.num_heads, -1, self.head_dim]).transpose([1, 0, 2])
+            rel_pos_bias = paddle.matmul(reshape_q, position_bias.transpose([0, 2, 1]))
+            rel_pos_bias = rel_pos_bias.transpose([1, 0, 2]).reshape(
+                [bsz * self.num_heads, position_bias.shape[0], position_bias.shape[1]]
+            )
+            attn_weights += rel_pos_bias
+
+        if attention_mask is not None:
+            if attention_mask.shape != [bsz, 1, tgt_len, src_len]:
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.shape}"
+                )
+            attn_weights = attn_weights.reshape([bsz, self.num_heads, tgt_len, src_len]) + attention_mask
+            attn_weights = attn_weights.reshape([bsz * self.num_heads, tgt_len, src_len])
+
+        attn_weights = nn.functional.softmax(attn_weights, axis=-1)
+
+        if layer_head_mask is not None:
+            if layer_head_mask.shape != [
+                self.num_heads,
+            ]:
+                raise ValueError(
+                    f"Head mask for a single layer should be of size {(self.num_heads,)}, but is"
+                    f" {layer_head_mask.shape}"
+                )
+            attn_weights = layer_head_mask.reshape([1, -1, 1, 1]) * attn_weights.reshape(
+                [bsz, self.num_heads, tgt_len, src_len]
+            )
+            attn_weights = attn_weights.reshape([bsz * self.num_heads, tgt_len, src_len])
+
+        if output_attentions:
+            # this operation is a bit awkward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to be reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.reshape([bsz, self.num_heads, tgt_len, src_len])
+            attn_weights = attn_weights_reshaped.reshape([bsz * self.num_heads, tgt_len, src_len])
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+
+        attn_output = paddle.bmm(attn_probs, value_states)
+        if attn_output.shape != [bsz * self.num_heads, tgt_len, self.head_dim]:
+            raise ValueError(
+                f"`attn_output` should be of size {[bsz, self.num_heads, tgt_len, self.head_dim]}, but is"
+                f" {attn_output.shape}"
+            )
+
+        attn_output = attn_output.reshape([bsz, self.num_heads, tgt_len, self.head_dim])
+        attn_output = attn_output.transpose([0, 2, 1, 3])
+
+        # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
+        # partitioned aross GPUs when using tensor-parallelism.
+        attn_output = attn_output.reshape([bsz, tgt_len, self.embed_dim])
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped, past_key_value
+
+
+class SpeechT5FeedForward(nn.Layer):
+    def __init__(self, config, intermediate_size):
+        super().__init__()
+        self.intermediate_dropout = nn.Dropout(config.activation_dropout)
+
+        self.intermediate_dense = nn.Linear(config.hidden_size, intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+        self.output_dense = nn.Linear(intermediate_size, config.hidden_size)
+        self.output_dropout = nn.Dropout(config.hidden_dropout)
+
+    def forward(self, hidden_states):
+        hidden_states = self.intermediate_dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        hidden_states = self.intermediate_dropout(hidden_states)
+
+        hidden_states = self.output_dense(hidden_states)
+        hidden_states = self.output_dropout(hidden_states)
+        return hidden_states
+
+
+class SpeechT5EncoderLayer(nn.Layer):
+    def __init__(self, config: SpeechT5Config):
+        super().__init__()
+        self.attention = SpeechT5Attention(
+            embed_dim=config.hidden_size,
+            num_heads=config.encoder_attention_heads,
+            dropout=config.attention_dropout,
+            is_decoder=False,
+        )
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.feed_forward = SpeechT5FeedForward(config, config.encoder_ffn_dim)
+        self.final_layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        layer_head_mask: Optional[paddle.Tensor] = None,
+        position_bias: Optional[paddle.Tensor] = None,
+        output_attentions: bool = False,
+    ):
+        """
+        Args:
+            hidden_states (`paddle.Tensor`):
+                input to the layer of shape `(batch, seq_len, hidden_size)`
+            attention_mask (`paddle.Tensor`):
+                attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very
+                large negative values.
+            layer_head_mask (`paddle.Tensor`): mask for attention heads in a given layer of size
+                `(config.encoder_attention_heads,)`.
+            position_bias (`paddle.Tensor`):
+                relative position embeddings of size `(seq_len, seq_len, hidden_size // encoder_attention_heads)`
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+        hidden_states, attn_weights, _ = self.attention(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            position_bias=position_bias,
+            output_attentions=output_attentions,
+        )
+
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = residual + hidden_states
+
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = hidden_states + self.feed_forward(hidden_states)
+        hidden_states = self.final_layer_norm(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class SpeechT5DecoderLayer(nn.Layer):
+    def __init__(self, config: SpeechT5Config):
+        super().__init__()
+        self.self_attn = SpeechT5Attention(
+            embed_dim=config.hidden_size,
+            num_heads=config.decoder_attention_heads,
+            dropout=config.attention_dropout,
+            is_decoder=True,
+        )
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.self_attn_layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.encoder_attn = SpeechT5Attention(
+            config.hidden_size,
+            config.decoder_attention_heads,
+            dropout=config.attention_dropout,
+            is_decoder=True,
+        )
+        self.encoder_attn_layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.feed_forward = SpeechT5FeedForward(config, config.decoder_ffn_dim)
+        self.final_layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        layer_head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_layer_head_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[paddle.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = True,
+    ):
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, hidden_size)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            encoder_hidden_states (`paddle.Tensor`):
+                cross attention input to the layer of shape `(batch, seq_len, hidden_size)`
+            encoder_attention_mask (`paddle.Tensor`): encoder attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (`paddle.Tensor`): mask for attention heads in a given layer of size
+                `(encoder_attention_heads,)`.
+            cross_attn_layer_head_mask (`paddle.Tensor`): mask for cross-attention heads in a given layer of
+                size `(decoder_attention_heads,)`.
+            past_key_value (`Tuple(paddle.Tensor)`): cached past key and value projection states
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        # Self Attention
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        # add present self-attn cache to positions 1,2 of present_key_value tuple
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            past_key_value=self_attn_past_key_value,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # Cross-Attention Block
+        cross_attn_present_key_value = None
+        cross_attn_weights = None
+        if encoder_hidden_states is not None:
+            residual = hidden_states
+
+            # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
+            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
+            hidden_states, cross_attn_weights, cross_attn_present_key_value = self.encoder_attn(
+                hidden_states=hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                layer_head_mask=cross_attn_layer_head_mask,
+                past_key_value=cross_attn_past_key_value,
+                output_attentions=output_attentions,
+            )
+            hidden_states = self.dropout(hidden_states)
+            hidden_states = residual + hidden_states
+            hidden_states = self.encoder_attn_layer_norm(hidden_states)
+
+            # add cross-attn to positions 3,4 of present_key_value tuple
+            present_key_value = present_key_value + cross_attn_present_key_value
+
+        # Fully Connected
+        hidden_states = hidden_states + self.feed_forward(hidden_states)
+        hidden_states = self.final_layer_norm(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights, cross_attn_weights)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+class SpeechT5PretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = SpeechT5Config
+    base_model_prefix = "speecht5"
+    main_input_name = "input_values"
+    supports_gradient_checkpointing = True
+
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, SpeechT5PositionalConvEmbedding):
+
+            normal_(
+                module.conv.weight,
+                mean=0,
+                std=2 * math.sqrt(1 / (module.conv._kernel_size[0] * module.conv._in_channels)),
+            )
+            constant_(module.conv.bias, 0)
+        elif isinstance(module, SpeechT5FeatureProjection):
+            # module.projection.weight.shape[0] == module.projection.in_features
+            k = math.sqrt(1 / module.projection.weight.shape[0])
+            uniform_(module.projection.weight, a=-k, b=k)
+            uniform_(module.projection.bias, a=-k, b=k)
+        elif isinstance(module, nn.Linear):
+            normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                zeros_(module.bias)
+        elif isinstance(module, (nn.LayerNorm, nn.GroupNorm)):
+            zeros_(module.bias)
+            ones_(module.weight)
+        elif isinstance(module, nn.Conv1D):
+            kaiming_normal_(module.weight)
+            if module.bias is not None:
+                k = math.sqrt(module._groups / (module._in_channels * module._kernel_size[0]))
+                uniform_(module.bias, a=-k, b=k)
+        elif isinstance(module, nn.Embedding):
+            normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module._padding_idx is not None:
+                zeros_(module.weight[module._padding_idx])
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (SpeechT5Encoder, SpeechT5Decoder, SpeechT5FeatureEncoder)):
+            module.gradient_checkpointing = value
+
+
+class SpeechT5Encoder(SpeechT5PretrainedModel):
+    """
+    Transformer encoder consisting of *config.encoder_layers* layers. Each layer is a [`SpeechT5EncoderLayer`].
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout)
+        self.layerdrop = config.encoder_layerdrop
+
+        self.layers = nn.LayerList([SpeechT5EncoderLayer(config) for _ in range(config.encoder_layers)])
+
+        self.embed_positions = SpeechT5RelativePositionalEncoding(
+            config.hidden_size // config.encoder_attention_heads, config.encoder_max_relative_position
+        )
+
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor` of shape `(batch_size, sequence_length, feature_size)`):
+                Features extracted from the speech or text input by the encoder prenet.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing convolution and attention on padding token indices. Mask values selected in
+                `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            head_mask (`paddle.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
+                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # expand attention_mask
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)
+
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        position_bias = self.embed_positions(hidden_states)
+
+        # deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()
+
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        if head_mask is not None:
+            if head_mask.shape[0] != len(self.layers):
+                raise ValueError(
+                    f"The head_mask should be specified for {len(self.layers)} layers, but it is for"
+                    f" {head_mask.shape[0]}."
+                )
+
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
+            dropout_probability = np.random.uniform(0, 1)
+            # print(dropout_probability)
+
+            skip_the_layer = self.training and (dropout_probability < self.layerdrop)
+            if not skip_the_layer:
+                # under deepspeed zero3 all gpus must run in sync
+                if self.gradient_checkpointing and self.training:
+                    # create gradient checkpointing function
+                    def create_custom_forward(module):
+                        def custom_forward(*inputs):
+                            return module(*inputs, output_attentions)
+
+                        return custom_forward
+
+                    layer_outputs = recompute(
+                        create_custom_forward(encoder_layer),
+                        hidden_states,
+                        attention_mask,
+                        (head_mask[idx] if head_mask is not None else None),
+                        position_bias,
+                    )
+                else:
+
+                    layer_outputs = encoder_layer(
+                        hidden_states,
+                        attention_mask=attention_mask,
+                        position_bias=position_bias,
+                        layer_head_mask=(head_mask[idx] if head_mask is not None else None),
+                        output_attentions=output_attentions,
+                    )
+
+                hidden_states = layer_outputs[0]
+
+            if skip_the_layer:
+                layer_outputs = (None, None)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class SpeechT5EncoderWithSpeechPrenet(SpeechT5PretrainedModel):
+    """
+    Wrapper around SpeechT5Encoder that applies SpeechT5SpeechEncoderPrenet to convert the audio waveform data to
+    hidden features.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.prenet = SpeechT5SpeechEncoderPrenet(config)
+        self.wrapped_encoder = SpeechT5Encoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def forward(
+        self,
+        input_values: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        hidden_states, attention_mask = self.prenet(input_values, attention_mask)
+
+        outputs = self.wrapped_encoder(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return outputs
+
+
+class SpeechT5EncoderWithTextPrenet(SpeechT5PretrainedModel):
+    """
+    Wrapper around SpeechT5Encoder that applies SpeechT5TextEncoderPrenet to convert the input_ids to hidden features.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.prenet = SpeechT5TextEncoderPrenet(config)
+        self.wrapped_encoder = SpeechT5Encoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.prenet.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.prenet.set_input_embeddings(value)
+
+    def forward(
+        self,
+        input_values: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        hidden_states = self.prenet(input_values)
+
+        outputs = self.wrapped_encoder(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return outputs
+
+
+class SpeechT5EncoderWithoutPrenet(SpeechT5PretrainedModel):
+    """
+    This wrapper class is a helper class to correctly load pretrained checkpoints when used in combination with
+    [`SpeechT5Model`].
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.wrapped_encoder = SpeechT5Encoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def forward(
+        self,
+        input_values: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        return self.wrapped_encoder(
+            hidden_states=input_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+
+class SpeechT5Decoder(SpeechT5PretrainedModel):
+    """
+    Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`SpeechT5DecoderLayer`]
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.layerdrop = config.decoder_layerdrop
+
+        self.layers = nn.LayerList([SpeechT5DecoderLayer(config) for _ in range(config.decoder_layers)])
+
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    # Copied from paddlenlp.transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                past_key_values_length=past_key_values_length,
+            )
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
+    def forward(
+        self,
+        hidden_states: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        r"""
+        Args:
+            hidden_states (`paddle.Tensor` of shape `(batch_size, sequence_length, feature_size)`):
+                Features extracted from the speech or text input by the decoder prenet.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            encoder_hidden_states (`paddle.Tensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
+                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+                of the decoder.
+            encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, encoder_sequence_length)`, *optional*):
+                Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values
+                selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            head_mask (`paddle.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
+                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+            cross_attn_head_mask (`paddle.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
+                Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
+                cross-attention on hidden heads. Mask values selected in `[0, 1]`:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+            past_key_values (`tuple(tuple(paddle.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+                Tuple of `tuple(paddle.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
+                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
+                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
+                all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`paddle.Tensor` of
+                shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
+                `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more
+                control over how to convert `input_ids` indices into associated vectors than the model's internal
+                embedding lookup matrix.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        input_shape = hidden_states.shape[:-1]
+
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, input_shape, hidden_states, past_key_values_length
+        )
+
+        # expand encoder attention mask
+        if encoder_hidden_states is not None and encoder_attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            encoder_attention_mask = _expand_mask(encoder_attention_mask, hidden_states.dtype, tgt_len=input_shape[-1])
+
+        # deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
+        next_decoder_cache = () if use_cache else None
+
+        # check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired
+        for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]):
+            if attn_mask is not None:
+                if attn_mask.shape[0] != (len(self.layers)):
+                    raise ValueError(
+                        f"The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for"
+                        f" {head_mask.shape[0]}."
+                    )
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
+            dropout_probability = random.uniform(0, 1)
+
+            skip_the_layer = self.training and (dropout_probability < self.layerdrop)
+            if skip_the_layer:
+                continue
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, output_attentions, use_cache)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    head_mask[idx] if head_mask is not None else None,
+                    cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None,
+                    None,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    layer_head_mask=(head_mask[idx] if head_mask is not None else None),
+                    cross_attn_layer_head_mask=(
+                        cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None
+                    ),
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[3 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+                if encoder_hidden_states is not None:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(
+                v
+                for v in [hidden_states, next_cache, all_hidden_states, all_self_attentions, all_cross_attentions]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class SpeechT5DecoderWithSpeechPrenet(SpeechT5PretrainedModel):
+    """
+    Wrapper around SpeechT5Decoder that applies SpeechT5SpeechDecoderPrenet to convert log-mel filterbanks to hidden
+    features.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.prenet = SpeechT5SpeechDecoderPrenet(config)
+        self.wrapped_decoder = SpeechT5Decoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        decoder_hidden_states = self.prenet(input_values, speaker_embeddings)
+
+        outputs = self.wrapped_decoder(
+            hidden_states=decoder_hidden_states,
+            attention_mask=attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            head_mask=head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return outputs
+
+
+class SpeechT5DecoderWithTextPrenet(SpeechT5PretrainedModel):
+    """
+    Wrapper around SpeechT5Decoder that applies SpeechT5TextDecoderPrenet to convert input tokens to hidden features.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.prenet = SpeechT5TextDecoderPrenet(config)
+        self.wrapped_decoder = SpeechT5Decoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.prenet.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.prenet.set_input_embeddings(value)
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        decoder_hidden_states, attention_mask = self.prenet(input_values, attention_mask, past_key_values)
+
+        outputs = self.wrapped_decoder(
+            hidden_states=decoder_hidden_states,
+            attention_mask=attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            head_mask=head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return outputs
+
+
+class SpeechT5DecoderWithoutPrenet(SpeechT5PretrainedModel):
+    """
+    This wrapper class is a helper class to correctly load pretrained checkpoints when used in combination with
+    [`SpeechT5Model`].
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+        self.wrapped_decoder = SpeechT5Decoder(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        past_key_values: Optional[List[paddle.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
+        outputs = self.wrapped_decoder(
+            hidden_states=input_values,
+            attention_mask=attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            head_mask=head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return outputs
+
+
+class SpeechT5GuidedMultiheadAttentionLoss(nn.Layer):
+    """
+    Guided attention loss from the paper [Efficiently Trainable Text-to-Speech System Based on Deep Convolutional
+    Networks with Guided Attention](https://arxiv.org/abs/1710.08969), adapted for multi-head attention.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__()
+        self.sigma = config.guided_attention_loss_sigma
+        self.scale = config.guided_attention_loss_scale
+
+    def forward(
+        self, attentions: paddle.Tensor, input_masks: paddle.Tensor, output_masks: paddle.Tensor
+    ) -> paddle.Tensor:
+        """
+        Compute the attention loss.
+
+        Args:
+            attentions (`paddle.Tensor` of shape `(batch_size, layers * heads, output_sequence_length, input_sequence_length)`):
+                Batch of multi-head attention weights
+            input_masks (`paddle.Tensor` of shape `(batch_size, input_sequence_length)`):
+                Input attention mask as booleans.
+            output_masks (`paddle.Tensor` of shape `(batch_size, output_sequence_length)`):
+                Target attention mask as booleans.
+
+        Returns:
+            `paddle.Tensor` with the loss value
+        """
+        guided_attn_masks = self._make_guided_attention_masks(input_masks, output_masks)
+        masks = output_masks.unsqueeze(-1) & input_masks.unsqueeze(-2)
+        masks = masks.unsqueeze(1)
+
+        losses = guided_attn_masks * attentions
+        loss = paddle.mean(losses.masked_select(masks))
+        return self.scale * loss
+
+    def _make_guided_attention_masks(self, input_masks, output_masks):
+        input_lengths = input_masks.sum(-1)
+        output_lengths = output_masks.sum(-1)
+
+        guided_attn_masks = paddle.zeros((len(input_masks), output_masks.shape[1], input_masks.shape[1]))
+
+        for idx, (ilen, olen) in enumerate(zip(input_lengths, output_lengths)):
+            guided_attn_masks[idx, :olen, :ilen] = self._make_guided_attention_mask(ilen, olen, self.sigma)
+
+        return guided_attn_masks.unsqueeze(1)
+
+    @staticmethod
+    def _make_guided_attention_mask(input_length, output_length, sigma):
+        grid_y, grid_x = paddle.meshgrid(
+            paddle.arange(input_length),
+            paddle.arange(output_length),
+            indexing="xy",
+        )
+        grid_x = grid_x.cast("float32") / output_length
+        grid_y = grid_y.cast("float32") / input_length
+        return 1.0 - paddle.exp(-((grid_y - grid_x) ** 2) / (2 * (sigma**2)))
+
+
+class SpeechT5SpectrogramLoss(nn.Layer):
+    """
+    Loss computation used by SpeechT5ForTextToSpeech.
+    """
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__()
+        self.use_guided_attention_loss = config.use_guided_attention_loss
+        self.guided_attention_loss_num_heads = config.guided_attention_loss_num_heads
+        self.reduction_factor = config.reduction_factor
+
+        self.l1_criterion = L1Loss()
+        self.bce_criterion = BCEWithLogitsLoss(pos_weight=paddle.to_tensor([5.0]))
+
+        if self.use_guided_attention_loss:
+            self.attn_criterion = SpeechT5GuidedMultiheadAttentionLoss(config)
+
+    def forward(
+        self,
+        attention_mask: paddle.Tensor,
+        outputs_before_postnet: paddle.Tensor,
+        outputs_after_postnet: paddle.Tensor,
+        logits: paddle.Tensor,
+        labels: paddle.Tensor,
+        cross_attentions: Optional[paddle.Tensor] = None,
+    ) -> paddle.Tensor:
+        padding_mask = labels != -100.0
+
+        # mask out the padded portions
+        labels = labels.masked_select(padding_mask)
+        outputs_before_postnet = outputs_before_postnet.masked_select(padding_mask)
+        outputs_after_postnet = outputs_after_postnet.masked_select(padding_mask)
+
+        # spectrogram loss
+        l1_loss = self.l1_criterion(outputs_after_postnet, labels) + self.l1_criterion(outputs_before_postnet, labels)
+
+        # construct stop labels from the padding mask
+        masks = padding_mask[:, :, 0]
+        stop_labels = paddle.concat([~masks * 1.0, paddle.ones(masks.shape[0], 1)], axis=1)
+        stop_labels = stop_labels[:, 1:].masked_select(masks)
+        logits = logits.masked_select(masks)
+
+        # stop token loss
+        bce_loss = self.bce_criterion(logits, stop_labels)
+
+        # combined loss
+        loss = l1_loss + bce_loss
+
+        # guided attention loss
+        if self.use_guided_attention_loss:
+            attn = paddle.concat([x[:, : self.guided_attention_loss_num_heads] for x in cross_attentions], axis=1)
+            input_masks = attention_mask == 1
+            output_masks = padding_mask[:, :, 0]
+            if self.reduction_factor > 1:
+                output_masks = output_masks[:, self.reduction_factor - 1 :: self.reduction_factor]
+            attn_loss = self.attn_criterion(attn, input_masks, output_masks)
+            loss += attn_loss
+
+        return loss
+
+
+class SpeechT5Model(SpeechT5PretrainedModel):
+    def __init__(
+        self,
+        config: SpeechT5Config,
+        encoder: Optional[nn.Layer] = None,
+        decoder: Optional[nn.Layer] = None,
+    ):
+        super().__init__(config)
+        self.config = config
+        self.encoder = SpeechT5EncoderWithoutPrenet(config) if encoder is None else encoder
+        self.decoder = SpeechT5DecoderWithoutPrenet(config) if decoder is None else decoder
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        if isinstance(self.encoder, SpeechT5EncoderWithTextPrenet):
+            return self.encoder.get_input_embeddings()
+        if isinstance(self.decoder, SpeechT5DecoderWithTextPrenet):
+            return self.decoder.get_input_embeddings()
+        return None
+
+    def set_input_embeddings(self, value):
+        if isinstance(self.encoder, SpeechT5EncoderWithTextPrenet):
+            self.encoder.set_input_embeddings(value)
+        if isinstance(self.decoder, SpeechT5DecoderWithTextPrenet):
+            self.decoder.set_input_embeddings(value)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        if isinstance(self.encoder, SpeechT5EncoderWithSpeechPrenet):
+            self.encoder.prenet.freeze_feature_encoder()
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_values: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        decoder_head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[paddle.Tensor], Seq2SeqModelOutput]:
+        r"""
+        input_values (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Depending on which encoder is being used, the `input_values` are either: float values of the input raw
+            speech waveform, or indices of input sequence tokens in the vocabulary, or hidden states.
+
+        decoder_input_values (`paddle.Tensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Depending on which decoder is being used, the `decoder_input_values` are either: float values of log-mel
+            filterbank features extracted from the raw speech waveform, or indices of decoder input sequence tokens in
+            the vocabulary, or hidden states.
+
+        speaker_embeddings (`paddle.Tensor` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
+            Tensor containing the speaker embeddings.
+
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_outputs is None:
+            encoder_outputs = self.encoder(
+                input_values=input_values,
+                attention_mask=attention_mask,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
+        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
+            encoder_outputs = BaseModelOutput(
+                last_hidden_state=encoder_outputs[0],
+                hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
+                attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
+            )
+
+        # downsample encoder attention mask (only for encoders with speech input)
+        if attention_mask is not None and isinstance(self.encoder, SpeechT5EncoderWithSpeechPrenet):
+            encoder_attention_mask = self.encoder.prenet._get_feature_vector_attention_mask(
+                encoder_outputs[0].shape[1], attention_mask
+            )
+        else:
+            encoder_attention_mask = attention_mask
+
+        if isinstance(self.decoder, SpeechT5DecoderWithSpeechPrenet):
+            decoder_args = {"speaker_embeddings": speaker_embeddings}
+        else:
+            decoder_args = {}
+
+        decoder_outputs = self.decoder(
+            input_values=decoder_input_values,
+            attention_mask=decoder_attention_mask,
+            encoder_hidden_states=encoder_outputs[0],
+            encoder_attention_mask=encoder_attention_mask,
+            head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            **decoder_args,
+        )
+
+        if not return_dict:
+            return decoder_outputs + encoder_outputs
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+            encoder_attentions=encoder_outputs.attentions,
+        )
+
+
+class SpeechT5ForSpeechToText(SpeechT5PretrainedModel):
+    _keys_to_ignore_on_load_missing = [
+        r"speecht5.encoder.prenet.pos_sinusoidal_embed.weights",
+        r"text_decoder_postnet.lm_head.weight",
+    ]
+    _keys_to_ignore_on_save = [
+        r"speecht5.encoder.prenet.pos_sinusoidal_embed.weights",
+    ]
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+
+        if config.vocab_size is None:
+            raise ValueError(
+                f"You are trying to instantiate {self.__class__} with a configuration that does not define the"
+                " vocabulary size of the language model head. Please instantiate the model as follows:"
+                " `SpeechT5ForSpeechToText.from_pretrained(..., vocab_size=vocab_size)`. or define `vocab_size` of"
+                " your model's configuration."
+            )
+
+        speech_encoder = SpeechT5EncoderWithSpeechPrenet(config)
+        text_decoder = SpeechT5DecoderWithTextPrenet(config)
+        self.speecht5 = SpeechT5Model(config, speech_encoder, text_decoder)
+
+        self.text_decoder_postnet = SpeechT5TextDecoderPostnet(config)
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_encoder(self):
+        return self.speecht5.get_encoder()
+
+    def get_decoder(self):
+        return self.speecht5.get_decoder()
+
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.get_encoder().prenet.freeze_feature_encoder()
+
+    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
+        new_embeddings = super().resize_token_embeddings(new_num_tokens)
+        return new_embeddings
+
+    def get_output_embeddings(self):
+        return self.text_decoder_postnet.get_output_embeddings()
+
+    def set_output_embeddings(self, new_embeddings):
+        self.text_decoder_postnet.set_output_embeddings(new_embeddings)
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_ids: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        decoder_head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+    ) -> Union[Tuple, Seq2SeqLMOutput]:
+        r"""
+        input_values (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Float values of input raw speech waveform. Values can be obtained by loading a *.flac* or *.wav* audio file
+            into an array of type `List[float]` or a `numpy.ndarray`, *e.g.* via the soundfile library (*pip install
+            soundfile*). To prepare the array into `input_values`, the [`SpeechT5Processor`] should be used for padding
+            and conversion into a tensor of type `paddle.Tensor`. See [`SpeechT5Processor.__call__`] for details.
+
+        decoder_input_ids (`paddle.Tensor` of shape `(batch_size, target_sequence_length)`, *optional*):
+            Indices of decoder input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`SpeechT5Tokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are decoder input IDs?](../glossary#decoder-input-ids)
+
+            SpeechT5 uses the `eos_token_id` as the starting token for `decoder_input_ids` generation. If
+            `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the language modeling loss. Indices should either be in `[0, ..., config.vocab_size]`
+            or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is
+            only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+            Label indices can be obtained using [`SpeechT5Tokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from paddlenlp.transformers import SpeechT5Processor, SpeechT5ForSpeechToText
+        >>> from datasets import load_dataset
+
+        >>> dataset = load_dataset(
+        ...     "hf-internal-testing/librispeech_asr_demo", "clean", split="validation"
+        ... )  # doctest: +IGNORE_RESULT
+        >>> dataset = dataset.sort("id")
+        >>> sampling_rate = dataset.features["audio"].sampling_rate
+
+        >>> processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
+        >>> model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")
+
+        >>> # audio file is decoded on the fly
+        >>> inputs = processor(audio=dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pd")
+        >>> predicted_ids = model.generate(**inputs, max_length=100)
+
+        >>> # transcribe speech
+        >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+        >>> transcription[0]
+        'mister quilter is the apostle of the middle classes and we are glad to welcome his gospel'
+        ```
+
+        ```python
+        >>> inputs["labels"] = processor(text_target=dataset[0]["text"], return_tensors="pd").input_ids
+
+        >>> # compute loss
+        >>> loss = model(**inputs).loss
+        >>> round(loss.item(), 2)
+        19.68
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if labels is not None:
+            if decoder_input_ids is None:
+                decoder_input_ids = shift_tokens_right(
+                    labels, self.config.pad_token_id, self.config.decoder_start_token_id
+                )
+        outputs = self.speecht5(
+            input_values=input_values,
+            attention_mask=attention_mask,
+            decoder_input_values=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            encoder_outputs=encoder_outputs,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+        logits = self.text_decoder_postnet(outputs[0])
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.reshape([-1, self.config.vocab_size]), labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        cross_attn_head_mask=None,
+        use_cache=None,
+        encoder_outputs=None,
+        **kwargs,
+    ):
+        # cut decoder_input_ids if past is used
+        if past_key_values is not None:
+            decoder_input_ids = decoder_input_ids[:, -1:]
+
+        return {
+            "encoder_outputs": encoder_outputs,
+            "past_key_values": past_key_values,
+            "decoder_input_ids": decoder_input_ids,
+            "attention_mask": attention_mask,
+            "head_mask": head_mask,
+            "decoder_head_mask": decoder_head_mask,
+            "cross_attn_head_mask": cross_attn_head_mask,
+            "use_cache": use_cache,  # change this to avoid caching (presumably for debugging)
+        }
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
+
+
+def _generate_speech(
+    model: SpeechT5PretrainedModel,
+    input_values: paddle.Tensor,
+    speaker_embeddings: Optional[paddle.Tensor] = None,
+    threshold: float = 0.5,
+    minlenratio: float = 0.0,
+    maxlenratio: float = 20.0,
+    vocoder: Optional[nn.Layer] = None,
+    output_cross_attentions: bool = False,
+) -> Union[paddle.Tensor, Tuple[paddle.Tensor, paddle.Tensor]]:
+    encoder_attention_mask = paddle.ones_like(input_values)
+
+    encoder_out = model.speecht5.encoder(
+        input_values=input_values,
+        attention_mask=encoder_attention_mask,
+        return_dict=True,
+    )
+
+    encoder_last_hidden_state = encoder_out.last_hidden_state
+
+    # downsample encoder attention mask
+    if isinstance(model.speecht5.encoder, SpeechT5EncoderWithSpeechPrenet):
+        encoder_attention_mask = model.speecht5.encoder.prenet._get_feature_vector_attention_mask(
+            encoder_out[0].shape[1], encoder_attention_mask
+        )
+
+    maxlen = int(encoder_last_hidden_state.shape[1] * maxlenratio / model.config.reduction_factor)
+    minlen = int(encoder_last_hidden_state.shape[1] * minlenratio / model.config.reduction_factor)
+
+    # Start the output sequence with a mel spectrum that is all zeros.
+    output_sequence = paddle.zeros([1, 1, model.config.num_mel_bins], dtype=encoder_last_hidden_state.dtype)
+
+    spectrogram = []
+    cross_attentions = []
+    past_key_values = None
+    idx = 0
+
+    while True:
+        idx += 1
+
+        # Run the decoder prenet on the entire output sequence.
+        decoder_hidden_states = model.speecht5.decoder.prenet(output_sequence, speaker_embeddings)
+
+        # Run the decoder layers on the last element of the prenet output.
+        decoder_out = model.speecht5.decoder.wrapped_decoder(
+            hidden_states=decoder_hidden_states[:, -1:],
+            attention_mask=None,
+            encoder_hidden_states=encoder_last_hidden_state,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=True,
+            output_attentions=output_cross_attentions,
+            return_dict=True,
+        )
+
+        if output_cross_attentions:
+            cross_attentions.append(paddle.concat(decoder_out.cross_attentions, axis=0))
+
+        last_decoder_output = decoder_out.last_hidden_state[0, -1]
+        past_key_values = decoder_out.past_key_values
+
+        # Predict the new mel spectrum for this step in the sequence.
+        spectrum = model.speech_decoder_postnet.feat_out(last_decoder_output)
+        spectrum = spectrum.reshape([model.config.reduction_factor, model.config.num_mel_bins])
+        spectrogram.append(spectrum)
+
+        # Extend the output sequence with the new mel spectrum.
+        output_sequence = paddle.concat(
+            (output_sequence, spectrum[-1].reshape([1, 1, model.config.num_mel_bins])), axis=1
+        )
+
+        # Predict the probability that this is the stop token.
+        prob = F.sigmoid(model.speech_decoder_postnet.prob_out(last_decoder_output))
+
+        # Finished when stop token or maximum length is reached.
+        if idx >= minlen and (int(sum(prob.numpy() >= threshold)) > 0 or idx >= maxlen):
+            spectrogram = paddle.concat(spectrogram, axis=0).unsqueeze(0)
+            spectrogram = model.speech_decoder_postnet.postnet(spectrogram)
+            spectrogram = spectrogram.squeeze(0)
+            break
+
+    if vocoder is not None:
+        outputs = vocoder(spectrogram)
+    else:
+        outputs = spectrogram
+
+    if output_cross_attentions:
+        cross_attentions = paddle.concat(cross_attentions, axis=2)
+        outputs = (outputs, cross_attentions)
+
+    return outputs
+
+
+class SpeechT5ForTextToSpeech(SpeechT5PretrainedModel):
+    _keys_to_ignore_on_load_missing = []
+    _keys_to_ignore_on_save = []
+
+    main_input_name = "input_ids"
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+
+        if config.vocab_size is None:
+            raise ValueError(
+                f"You are trying to instantiate {self.__class__} with a configuration that does not define the"
+                " vocabulary size of the language model head. Please instantiate the model as follows:"
+                " `SpeechT5ForTextToSpeech.from_pretrained(..., vocab_size=vocab_size)`. or define `vocab_size` of"
+                " your model's configuration."
+            )
+
+        text_encoder = SpeechT5EncoderWithTextPrenet(config)
+        speech_decoder = SpeechT5DecoderWithSpeechPrenet(config)
+        self.speecht5 = SpeechT5Model(config, text_encoder, speech_decoder)
+
+        self.speech_decoder_postnet = SpeechT5SpeechDecoderPostnet(config)
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_encoder(self):
+        return self.speecht5.get_encoder()
+
+    def get_decoder(self):
+        return self.speecht5.get_decoder()
+
+    def forward(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_values: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        decoder_head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        stop_labels: Optional[paddle.Tensor] = None,
+    ) -> Union[Tuple, Seq2SeqSpectrogramOutput]:
+        r"""
+        input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. The `batch_size` should be 1 currently.
+
+            Indices can be obtained using [`SpeechT5Tokenizer`]. See [`~PreTrainedTokenizer.encode`] and
+            [`~PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        decoder_input_values (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_mel_bins)`):
+            Float values of input mel spectrogram.
+
+            SpeechT5 uses an all-zero spectrum as the starting token for `decoder_input_values` generation. If
+            `past_key_values` is used, optionally only the last `decoder_input_values` have to be input (see
+            `past_key_values`).
+        speaker_embeddings (`paddle.Tensor` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
+            Tensor containing the speaker embeddings.
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_mel_bins)`, *optional*):
+            Float values of target mel spectrogram. Timesteps set to `-100.0` are ignored (masked) for the loss
+            computation. Spectrograms can be obtained using [`SpeechT5Processor`]. See [`SpeechT5Processor.__call__`]
+            for details.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from paddlenlp.transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
+        >>> import paddle
+
+        >>> processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
+        >>> model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
+        >>> vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+
+        >>> inputs = processor(text="Hello, my dog is cute", return_tensors="pd")
+        >>> speaker_embeddings = paddle.zeros((1, 512))  # or load xvectors from a file
+
+        >>> set_seed(555)  # make deterministic
+
+        >>> # generate speech
+        >>> speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
+        >>> speech.shape
+            [15872]
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if stop_labels is not None:
+            warnings.warn(
+                "The argument `stop_labels` is deprecated and will be removed in version 4.30.0 of Transformers",
+                FutureWarning,
+            )
+
+        if labels is not None:
+            if decoder_input_values is None:
+                decoder_input_values = shift_spectrograms_right(labels, self.config.reduction_factor)
+            if self.config.use_guided_attention_loss:
+                output_attentions = True
+
+        outputs = self.speecht5(
+            input_values=input_ids,
+            attention_mask=attention_mask,
+            decoder_input_values=decoder_input_values,
+            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            encoder_outputs=encoder_outputs,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            speaker_embeddings=speaker_embeddings,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+
+        outputs_before_postnet, outputs_after_postnet, logits = self.speech_decoder_postnet(outputs[0])
+
+        loss = None
+        if labels is not None:
+            criterion = SpeechT5SpectrogramLoss(self.config)
+            loss = criterion(
+                attention_mask,
+                outputs_before_postnet,
+                outputs_after_postnet,
+                logits,
+                labels,
+                outputs.cross_attentions,
+            )
+
+        if not return_dict:
+            output = (outputs_after_postnet,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqSpectrogramOutput(
+            loss=loss,
+            spectrogram=outputs_after_postnet,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    @paddle.no_grad()
+    def generate_speech(
+        self,
+        input_ids: paddle.Tensor,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        threshold: float = 0.5,
+        minlenratio: float = 0.0,
+        maxlenratio: float = 20.0,
+        vocoder: Optional[nn.Layer] = None,
+        output_cross_attentions: bool = False,
+    ) -> Union[paddle.Tensor, Tuple[paddle.Tensor, paddle.Tensor]]:
+        r"""
+        Converts a sequence of input tokens into a sequence of mel spectrograms, which are subsequently turned into a
+        speech waveform using a vocoder.
+
+        Args:
+            input_ids (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. The `batch_size` should be 1 currently.
+
+                Indices can be obtained using [`SpeechT5Tokenizer`]. See [`~PreTrainedTokenizer.encode`] and
+                [`~PreTrainedTokenizer.__call__`] for details.
+
+                [What are input IDs?](../glossary#input-ids)
+            speaker_embeddings (`paddle.Tensor` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
+                Tensor containing the speaker embeddings.
+            threshold (`float`, *optional*, defaults to 0.5):
+                The generated sequence ends when the predicted stop token probability exceeds this value.
+            minlenratio (`float`, *optional*, defaults to 0.0):
+                Used to calculate the minimum required length for the output sequence.
+            maxlenratio (`float`, *optional*, defaults to 20.0):
+                Used to calculate the maximum allowed length for the output sequence.
+            vocoder (`nn.Layer`, *optional*, defaults to `None`):
+                The vocoder that converts the mel spectrogram into a speech waveform. If `None`, the output is the mel
+                spectrogram.
+            output_cross_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of the decoder's cross-attention layers.
+
+        Returns:
+            `tuple(paddle.Tensor)` comprising various elements depending on the inputs:
+            - **spectrogram** (*optional*, returned when no `vocoder` is provided) `paddle.Tensor` of shape
+              `(output_sequence_length, config.num_mel_bins)` -- The predicted log-mel spectrogram.
+            - **waveform** (*optional*, returned when a `vocoder` is provided) `paddle.Tensor` of shape
+              `(num_frames,)` -- The predicted speech waveform.
+            - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `paddle.Tensor`
+              of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length,
+              input_sequence_length)` -- The outputs of the decoder's cross-attention layers.
+        """
+        return _generate_speech(
+            self,
+            input_ids,
+            speaker_embeddings,
+            threshold,
+            minlenratio,
+            maxlenratio,
+            vocoder,
+            output_cross_attentions,
+        )
+
+
+class SpeechT5ForSpeechToSpeech(SpeechT5PretrainedModel):
+    _keys_to_ignore_on_load_missing = [
+        r"speecht5.encoder.prenet.pos_sinusoidal_embed.weights",
+    ]
+    _keys_to_ignore_on_save = [
+        r"speecht5.encoder.prenet.pos_sinusoidal_embed.weights",
+    ]
+
+    def __init__(self, config: SpeechT5Config):
+        super().__init__(config)
+
+        speech_encoder = SpeechT5EncoderWithSpeechPrenet(config)
+        speech_decoder = SpeechT5DecoderWithSpeechPrenet(config)
+        self.speecht5 = SpeechT5Model(config, speech_encoder, speech_decoder)
+
+        self.speech_decoder_postnet = SpeechT5SpeechDecoderPostnet(config)
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def get_encoder(self):
+        return self.speecht5.get_encoder()
+
+    def get_decoder(self):
+        return self.speecht5.get_decoder()
+
+    def freeze_feature_encoder(self):
+        """
+        Calling this function will disable the gradient computation for the feature encoder so that its parameter will
+        not be updated during training.
+        """
+        self.get_encoder().prenet.freeze_feature_encoder()
+
+    def forward(
+        self,
+        input_values: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        decoder_input_values: Optional[paddle.Tensor] = None,
+        decoder_attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        decoder_head_mask: Optional[paddle.Tensor] = None,
+        cross_attn_head_mask: Optional[paddle.Tensor] = None,
+        encoder_outputs: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        past_key_values: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        stop_labels: Optional[paddle.Tensor] = None,
+    ) -> Union[Tuple, Seq2SeqSpectrogramOutput]:
+        r"""
+        input_values (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+            Float values of input raw speech waveform. Values can be obtained by loading a *.flac* or *.wav* audio file
+            into an array of type `List[float]` or a `numpy.ndarray`, *e.g.* via the soundfile library (*pip install
+            soundfile*). To prepare the array into `input_values`, the [`SpeechT5Processor`] should be used for padding
+            and conversion into a tensor of type `paddle.Tensor`. See [`SpeechT5Processor.__call__`] for details.
+        decoder_input_values (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_mel_bins)`):
+            Float values of input mel spectrogram.
+
+            SpeechT5 uses an all-zero spectrum as the starting token for `decoder_input_values` generation. If
+            `past_key_values` is used, optionally only the last `decoder_input_values` have to be input (see
+            `past_key_values`).
+        speaker_embeddings (`paddle.Tensor` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
+            Tensor containing the speaker embeddings.
+        labels (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_mel_bins)`, *optional*):
+            Float values of target mel spectrogram. Spectrograms can be obtained using [`SpeechT5Processor`]. See
+            [`SpeechT5Processor.__call__`] for details.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from paddlenlp.transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech, SpeechT5HifiGan, set_seed
+        >>> from datasets import load_dataset
+        >>> import paddle
+
+        >>> dataset = load_dataset(
+        ...     "hf-internal-testing/librispeech_asr_demo", "clean", split="validation"
+        ... )  # doctest: +IGNORE_RESULT
+        >>> dataset = dataset.sort("id")
+        >>> sampling_rate = dataset.features["audio"].sampling_rate
+
+        >>> processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
+        >>> model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")
+        >>> vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+
+        >>> # audio file is decoded on the fly
+        >>> inputs = processor(audio=dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pd")
+
+        >>> speaker_embeddings = paddle.zeros((1, 512))  # or load xvectors from a file
+
+        >>> set_seed(555)  # make deterministic
+
+        >>> # generate speech
+        >>> speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder)
+        >>> speech.shape
+        [77824]
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if stop_labels is not None:
+            warnings.warn(
+                "The argument `stop_labels` is deprecated and will be removed in version 4.30.0 of Transformers",
+                FutureWarning,
+            )
+
+        if labels is not None:
+            if decoder_input_values is None:
+                decoder_input_values = shift_spectrograms_right(labels, self.config.reduction_factor)
+
+        outputs = self.speecht5(
+            input_values=input_values,
+            attention_mask=attention_mask,
+            decoder_input_values=decoder_input_values,
+            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
+            cross_attn_head_mask=cross_attn_head_mask,
+            encoder_outputs=encoder_outputs,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            speaker_embeddings=speaker_embeddings,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+
+        _, spectrogram, logits = self.speech_decoder_postnet(outputs[0])
+
+        loss = None
+
+        if not return_dict:
+            output = (spectrogram,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqSpectrogramOutput(
+            loss=loss,
+            spectrogram=spectrogram,
+            past_key_values=outputs.past_key_values,
+            decoder_hidden_states=outputs.decoder_hidden_states,
+            decoder_attentions=outputs.decoder_attentions,
+            cross_attentions=outputs.cross_attentions,
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
+            encoder_hidden_states=outputs.encoder_hidden_states,
+            encoder_attentions=outputs.encoder_attentions,
+        )
+
+    @paddle.no_grad()
+    def generate_speech(
+        self,
+        input_values: paddle.Tensor,
+        speaker_embeddings: Optional[paddle.Tensor] = None,
+        threshold: float = 0.5,
+        minlenratio: float = 0.0,
+        maxlenratio: float = 20.0,
+        vocoder: Optional[nn.Layer] = None,
+        output_cross_attentions: bool = False,
+    ) -> paddle.Tensor:
+        r"""
+        Converts a raw speech waveform into a sequence of mel spectrograms, which are subsequently turned back into a
+        speech waveform using a vocoder.
+
+        Args:
+            input_values (`paddle.Tensor` of shape `(batch_size, sequence_length)`):
+                Float values of input raw speech waveform. The `batch_size` should be 1 currently.
+
+                Values can be obtained by loading a *.flac* or *.wav* audio file into an array of type `List[float]` or
+                a `numpy.ndarray`, *e.g.* via the soundfile library (*pip install soundfile*). To prepare the array
+                into `input_values`, the [`SpeechT5Processor`] should be used for padding and conversion into a tensor
+                of type `paddle.Tensor`. See [`SpeechT5Processor.__call__`] for details.
+            speaker_embeddings (`paddle.Tensor` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
+                Tensor containing the speaker embeddings.
+            threshold (`float`, *optional*, defaults to 0.5):
+                The generated sequence ends when the predicted stop token probability exceeds this value.
+            minlenratio (`float`, *optional*, defaults to 0.0):
+                Used to calculate the minimum required length for the output sequence.
+            maxlenratio (`float`, *optional*, defaults to 20.0):
+                Used to calculate the maximum allowed length for the output sequence.
+            vocoder (`nn.Layer`, *optional*, defaults to `None`):
+                The vocoder that converts the mel spectrogram into a speech waveform. If `None`, the output is the mel
+                spectrogram.
+            output_cross_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of the decoder's cross-attention layers.
+
+        Returns:
+            `tuple(paddle.Tensor)` comprising various elements depending on the inputs:
+            - **spectrogram** (*optional*, returned when no `vocoder` is provided) `paddle.Tensor` of shape
+              `(output_sequence_length, config.num_mel_bins)` -- The predicted log-mel spectrogram.
+            - **waveform** (*optional*, returned when a `vocoder` is provided) `paddle.Tensor` of shape
+              `(num_frames,)` -- The predicted speech waveform.
+            - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `paddle.Tensor`
+              of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length,
+              input_sequence_length)` -- The outputs of the decoder's cross-attention layers.
+        """
+        if speaker_embeddings is None:
+            speaker_embeddings = paddle.zeros((1, 512))
+
+        return _generate_speech(
+            self,
+            input_values,
+            speaker_embeddings,
+            threshold,
+            minlenratio,
+            maxlenratio,
+            vocoder,
+            output_cross_attentions,
+        )
+
+
+class HifiGanResidualBlock(nn.Layer):
+    def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1):
+        super().__init__()
+        self.leaky_relu_slope = leaky_relu_slope
+
+        self.convs1 = nn.LayerList(
+            [
+                nn.Conv1D(
+                    channels,
+                    channels,
+                    kernel_size,
+                    stride=1,
+                    dilation=dilation[i],
+                    padding=self.get_padding(kernel_size, dilation[i]),
+                )
+                for i in range(len(dilation))
+            ]
+        )
+        self.convs2 = nn.LayerList(
+            [
+                nn.Conv1D(
+                    channels,
+                    channels,
+                    kernel_size,
+                    stride=1,
+                    dilation=1,
+                    padding=self.get_padding(kernel_size, 1),
+                )
+                for _ in range(len(dilation))
+            ]
+        )
+
+    def get_padding(self, kernel_size, dilation=1):
+        return (kernel_size * dilation - dilation) // 2
+
+    def apply_weight_norm(self):
+        for layer in self.convs1:
+            nn.utils.weight_norm(layer)
+        for layer in self.convs2:
+            nn.utils.weight_norm(layer)
+
+    def remove_weight_norm(self):
+        for layer in self.convs1:
+            nn.utils.remove_weight_norm(layer)
+        for layer in self.convs2:
+            nn.utils.remove_weight_norm(layer)
+
+    def forward(self, hidden_states):
+        for conv1, conv2 in zip(self.convs1, self.convs2):
+            residual = hidden_states
+            hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+            hidden_states = conv1(hidden_states)
+            hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope)
+            hidden_states = conv2(hidden_states)
+            hidden_states = hidden_states + residual
+        return hidden_states
+
+
+class SpeechT5HifiGan(PretrainedModel):
+    config_class = SpeechT5HifiGanConfig
+    main_input_name = "spectrogram"
+
+    def __init__(self, config: SpeechT5HifiGanConfig):
+        super().__init__(config)
+        self.num_kernels = len(config.resblock_kernel_sizes)
+        self.num_upsamples = len(config.upsample_rates)
+        self.conv_pre = nn.Conv1D(
+            config.model_in_dim,
+            config.upsample_initial_channel,
+            kernel_size=7,
+            stride=1,
+            padding=3,
+        )
+
+        self.upsampler = nn.LayerList()
+        for i, (upsample_rate, kernel_size) in enumerate(zip(config.upsample_rates, config.upsample_kernel_sizes)):
+            self.upsampler.append(
+                nn.Conv1DTranspose(
+                    config.upsample_initial_channel // (2**i),
+                    config.upsample_initial_channel // (2 ** (i + 1)),
+                    kernel_size=kernel_size,
+                    stride=upsample_rate,
+                    padding=(kernel_size - upsample_rate) // 2,
+                )
+            )
+
+        self.resblocks = nn.LayerList()
+        for i in range(len(self.upsampler)):
+            channels = config.upsample_initial_channel // (2 ** (i + 1))
+            for kernel_size, dilation in zip(config.resblock_kernel_sizes, config.resblock_dilation_sizes):
+                self.resblocks.append(HifiGanResidualBlock(channels, kernel_size, dilation, config.leaky_relu_slope))
+
+        self.conv_post = nn.Conv1D(channels, 1, kernel_size=7, stride=1, padding=3)
+
+        self.register_buffer("mean", paddle.zeros([config.model_in_dim]))
+        self.register_buffer("scale", paddle.ones([config.model_in_dim]))
+
+        # Initialize weights and apply final processing
+        self.init_weights()
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, (nn.Linear, nn.Conv1D)):
+            # module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            normal_(module.weight, mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                # module.bias.data.zero_()
+                zeros_(module.bias)
+
+    def apply_weight_norm(self):
+        nn.utils.weight_norm(self.conv_pre)
+        for layer in self.upsampler:
+            nn.utils.weight_norm(layer)
+        for layer in self.resblocks:
+            layer.apply_weight_norm()
+        nn.utils.weight_norm(self.conv_post)
+
+    def remove_weight_norm(self):
+        nn.utils.remove_weight_norm(self.conv_pre)
+        for layer in self.upsampler:
+            nn.utils.remove_weight_norm(layer)
+        for layer in self.resblocks:
+            layer.remove_weight_norm()
+        nn.utils.remove_weight_norm(self.conv_post)
+
+    def forward(self, spectrogram: paddle.Tensor) -> paddle.Tensor:
+        r"""
+        Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch
+        of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech
+        waveform.
+
+        Args:
+            spectrogram (`paddle.Tensor`):
+                Tensor containing the log-mel spectrograms. Can be batched and of shape `(batch_size, sequence_length,
+                config.model_in_dim)`, or un-batched and of shape `(sequence_length, config.model_in_dim)`.
+
+        Returns:
+            `paddle.Tensor`: Tensor containing the speech waveform. If the input spectrogram is batched, will be of
+            shape `(batch_size, num_frames,)`. If un-batched, will be of shape `(num_frames,)`.
+        """
+        if self.config.normalize_before:
+            spectrogram = (spectrogram - self.mean) / self.scale
+
+        is_batched = spectrogram.dim() == 3
+        if not is_batched:
+            spectrogram = spectrogram.unsqueeze(0)
+        hidden_states = spectrogram.transpose([0, 2, 1])
+
+        hidden_states = self.conv_pre(hidden_states)
+        for i in range(self.num_upsamples):
+            hidden_states = nn.functional.leaky_relu(hidden_states, self.config.leaky_relu_slope)
+            hidden_states = self.upsampler[i](hidden_states)
+
+            res_state = self.resblocks[i * self.num_kernels](hidden_states)
+            for j in range(1, self.num_kernels):
+                res_state += self.resblocks[i * self.num_kernels + j](hidden_states)
+            hidden_states = res_state / self.num_kernels
+
+        hidden_states = nn.functional.leaky_relu(hidden_states)
+        hidden_states = self.conv_post(hidden_states)
+        hidden_states = paddle.tanh(hidden_states)
+
+        if not is_batched:
+            # remove batch dim and collapse tensor to 1-d audio waveform
+            waveform = hidden_states.squeeze(0).transpose([1, 0]).reshape([-1])
+        else:
+            # remove seq-len dim since this collapses to 1
+            waveform = hidden_states.squeeze(1)
+
+        return waveform
diff --git a/paddlenlp/transformers/speecht5/processing.py b/paddlenlp/transformers/speecht5/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c5430185a6495f5edcb7ceb022a8c45b20c6543
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/processing.py
@@ -0,0 +1,192 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Speech processor class for SpeechT5."""
+
+from ..processing_utils import ProcessorMixin
+
+__all__ = [
+    "SpeechT5Processor",
+]
+
+
+class SpeechT5Processor(ProcessorMixin):
+    r"""
+    Constructs a SpeechT5 processor which wraps a feature extractor and a tokenizer into a single processor.
+
+    [`SpeechT5Processor`] offers all the functionalities of [`SpeechT5FeatureExtractor`] and [`SpeechT5Tokenizer`]. See
+    the docstring of [`~SpeechT5Processor.__call__`] and [`~SpeechT5Processor.decode`] for more information.
+
+    Args:
+        feature_extractor (`SpeechT5FeatureExtractor`):
+            An instance of [`SpeechT5FeatureExtractor`]. The feature extractor is a required input.
+        tokenizer (`SpeechT5Tokenizer`):
+            An instance of [`SpeechT5Tokenizer`]. The tokenizer is a required input.
+    """
+    feature_extractor_class = "SpeechT5FeatureExtractor"
+    tokenizer_class = "SpeechT5Tokenizer"
+
+    pretrained_init_configuration = {
+        "microsoft/speecht5_hifigan": {"do_lower_case": True},
+        "microsoft/speecht5_vc": {"do_lower_case": True},
+    }
+
+    def __init__(self, feature_extractor, tokenizer):
+        super().__init__(feature_extractor, tokenizer)
+
+    def __call__(self, *args, **kwargs):
+        """
+        Processes audio and text input, as well as audio and text targets.
+
+        You can process audio by using the argument `audio`, or process audio targets by using the argument
+        `audio_target`. This forwards the arguments to SpeechT5FeatureExtractor's
+        [`~SpeechT5FeatureExtractor.__call__`].
+
+        You can process text by using the argument `text`, or process text labels by using the argument `text_target`.
+        This forwards the arguments to SpeechT5Tokenizer's [`~SpeechT5Tokenizer.__call__`].
+
+        Valid input combinations are:
+
+        - `text` only
+        - `audio` only
+        - `text_target` only
+        - `audio_target` only
+        - `text` and `audio_target`
+        - `audio` and `audio_target`
+        - `text` and `text_target`
+        - `audio` and `text_target`
+
+        Please refer to the docstring of the above two methods for more information.
+        """
+        audio = kwargs.pop("audio", None)
+        text = kwargs.pop("text", None)
+        text_target = kwargs.pop("text_target", None)
+        audio_target = kwargs.pop("audio_target", None)
+        sampling_rate = kwargs.pop("sampling_rate", None)
+
+        if audio is not None and text is not None:
+            raise ValueError(
+                "Cannot process both `audio` and `text` inputs. Did you mean `audio_target` or `text_target`?"
+            )
+        if audio_target is not None and text_target is not None:
+            raise ValueError(
+                "Cannot process both `audio_target` and `text_target` inputs. Did you mean `audio` or `text`?"
+            )
+        if audio is None and audio_target is None and text is None and text_target is None:
+            raise ValueError(
+                "You need to specify either an `audio`, `audio_target`, `text`, or `text_target` input to process."
+            )
+
+        if audio is not None:
+            inputs = self.feature_extractor(audio, *args, sampling_rate=sampling_rate, **kwargs)
+        elif text is not None:
+            inputs = self.tokenizer(text, **kwargs)
+        else:
+            inputs = None
+
+        if audio_target is not None:
+            targets = self.feature_extractor(audio_target=audio_target, *args, sampling_rate=sampling_rate, **kwargs)
+            labels = targets["input_values"]
+        elif text_target is not None:
+            targets = self.tokenizer(text_target, **kwargs)
+            labels = targets["input_ids"]
+        else:
+            targets = None
+
+        if inputs is None:
+            return targets
+
+        if targets is not None:
+            inputs["labels"] = labels
+
+            decoder_attention_mask = targets.get("attention_mask")
+            if decoder_attention_mask is not None:
+                inputs["decoder_attention_mask"] = decoder_attention_mask
+
+        return inputs
+
+    def pad(self, *args, **kwargs):
+        """
+        Collates the audio and text inputs, as well as their targets, into a padded batch.
+
+        Audio inputs are padded by SpeechT5FeatureExtractor's [`~SpeechT5FeatureExtractor.pad`]. Text inputs are padded
+        by SpeechT5Tokenizer's [`~SpeechT5Tokenizer.pad`].
+
+        Valid input combinations are:
+
+        - `input_ids` only
+        - `input_values` only
+        - `labels` only, either log-mel spectrograms or text tokens
+        - `input_ids` and log-mel spectrogram `labels`
+        - `input_values` and text `labels`
+
+        Please refer to the docstring of the above two methods for more information.
+        """
+        input_values = kwargs.pop("input_values", None)
+        input_ids = kwargs.pop("input_ids", None)
+        labels = kwargs.pop("labels", None)
+
+        if input_values is not None and input_ids is not None:
+            raise ValueError("Cannot process both `input_values` and `input_ids` inputs.")
+        if input_values is None and input_ids is None and labels is None:
+            raise ValueError(
+                "You need to specify either an `input_values`, `input_ids`, or `labels` input to be padded."
+            )
+
+        if input_values is not None:
+            inputs = self.feature_extractor.pad(input_values, *args, **kwargs)
+        elif input_ids is not None:
+            inputs = self.tokenizer.pad(input_ids, **kwargs)
+        else:
+            inputs = None
+
+        if labels is not None:
+            if "input_ids" in labels or (isinstance(labels, list) and "input_ids" in labels[0]):
+                targets = self.tokenizer.pad(labels, **kwargs)
+                labels = targets["input_ids"]
+            else:
+                feature_size_hack = self.feature_extractor.feature_size
+                self.feature_extractor.feature_size = self.feature_extractor.num_mel_bins
+                targets = self.feature_extractor.pad(labels, *args, **kwargs)
+                self.feature_extractor.feature_size = feature_size_hack
+                labels = targets["input_values"]
+        else:
+            targets = None
+
+        if inputs is None:
+            return targets
+
+        if targets is not None:
+            inputs["labels"] = labels
+
+            decoder_attention_mask = targets.get("attention_mask")
+            if decoder_attention_mask is not None:
+                inputs["decoder_attention_mask"] = decoder_attention_mask
+
+        return inputs
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to SpeechT5Tokenizer's [`~SpeechT5Tokenizer.batch_decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to SpeechT5Tokenizer's [`~SpeechT5Tokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
diff --git a/paddlenlp/transformers/speecht5/tokenizer.py b/paddlenlp/transformers/speecht5/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa0e0d95952d031c23378ad72ac257245e45dfdb
--- /dev/null
+++ b/paddlenlp/transformers/speecht5/tokenizer.py
@@ -0,0 +1,217 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tokenization class for SpeechT5."""
+
+
+import os
+from shutil import copyfile
+from typing import Any, Dict, List, Optional, Tuple
+
+import sentencepiece as spm
+
+from paddlenlp.transformers import PretrainedTokenizer
+
+from ...utils.log import logger
+
+VOCAB_FILES_NAMES = {"vocab_file": "spm_char.model"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "microsoft/speecht5_asr": "https://huggingface.co/microsoft/speecht5_asr/resolve/main/spm_char.model",
+        "microsoft/speecht5_tts": "https://huggingface.co/microsoft/speecht5_tts/resolve/main/spm_char.model",
+        "microsoft/speecht5_vc": "https://huggingface.co/microsoft/speecht5_vc/resolve/main/spm_char.model",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "microsoft/speecht5_asr": 1024,
+    "microsoft/speecht5_tts": 1024,
+    "microsoft/speecht5_vc": 1024,
+}
+
+
+__all__ = ["SpeechT5Tokenizer"]
+
+
+# Define type aliases and NamedTuples
+TextInput = str
+PreTokenizedInput = List[str]
+EncodedInput = List[int]
+TextInputPair = Tuple[str, str]
+PreTokenizedInputPair = Tuple[List[str], List[str]]
+EncodedInputPair = Tuple[List[int], List[int]]
+
+
+class SpeechT5Tokenizer(PretrainedTokenizer):
+    """
+    Construct a SpeechT5 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
+
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (`str`):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
+            The begin of sequence token.
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        sp_model_kwargs (`dict`, *optional*):
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+            to set:
+
+            - `enable_sampling`: Enable subword regularization.
+            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
+
+              - `nbest_size = {0,1}`: No sampling is performed.
+              - `nbest_size > 1`: samples from the nbest_size results.
+              - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+                using forward-filtering-and-backward-sampling algorithm.
+
+            - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+              BPE-dropout.
+
+    Attributes:
+        sp_model (`SentencePieceProcessor`):
+            The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
+    """
+
+    resource_files_names = VOCAB_FILES_NAMES
+    pretrained_resource_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        bos_token="<s>",
+        eos_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        **kwargs,
+    ) -> None:
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            pad_token=pad_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
+
+        self.vocab_file = vocab_file
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+        self._in_target_context_manager = False
+
+    @property
+    def vocab_size(self):
+        return self.sp_model.get_piece_size()
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.vocab_file)
+
+    def _tokenize(self, text: str) -> List[str]:
+        """Take as input a string and return a list of strings (tokens) for words/sub-words"""
+        return self.sp_model.encode(text, out_type=str)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        for token in tokens:
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string.strip()
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> List[int]:
+        """Build model inputs from a sequence by appending eos_token_id."""
+        if token_ids_1 is None:
+            return token_ids_0 + [self.eos_token_id]
+        # We don't expect to process pairs, but leave the pair logic for API consistency
+        return token_ids_0 + token_ids_1 + [self.eos_token_id]
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        suffix_ones = [1]
+        if token_ids_1 is None:
+            return ([0] * len(token_ids_0)) + suffix_ones
+        return ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
diff --git a/paddlenlp/transformers/squeezebert/__init__.py b/paddlenlp/transformers/squeezebert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/squeezebert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/squeezebert/configuration.py b/paddlenlp/transformers/squeezebert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..64d00b9a6355067126bd12d86abccda4a9e3d036
--- /dev/null
+++ b/paddlenlp/transformers/squeezebert/configuration.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" SqueezeBERT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "SQUEEZEBERT_PRETRAINED_INIT_CONFIGURATION",
+    "SqueezeBertConfig",
+    "SQUEEZEBERT_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+SQUEEZEBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "squeezebert-uncased": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "model_type": "squeezebert",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 30528,
+        "q_groups": 4,
+        "k_groups": 4,
+        "v_groups": 4,
+        "post_attention_groups": 1,
+        "intermediate_groups": 4,
+        "output_groups": 4,
+        "pad_token_id": 0,
+        "layer_norm_eps": 1e-12,
+    },
+    "squeezebert-mnli": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "model_type": "squeezebert",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 30528,
+        "q_groups": 4,
+        "k_groups": 4,
+        "v_groups": 4,
+        "post_attention_groups": 1,
+        "intermediate_groups": 4,
+        "output_groups": 4,
+        "num_labels": 3,
+        "pad_token_id": 0,
+        "layer_norm_eps": 1e-12,
+    },
+    "squeezebert-mnli-headless": {
+        "attention_probs_dropout_prob": 0.1,
+        "embedding_size": 768,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "hidden_size": 768,
+        "initializer_range": 0.02,
+        "intermediate_size": 3072,
+        "max_position_embeddings": 512,
+        "model_type": "squeezebert",
+        "num_attention_heads": 12,
+        "num_hidden_layers": 12,
+        "type_vocab_size": 2,
+        "vocab_size": 30528,
+        "q_groups": 4,
+        "k_groups": 4,
+        "v_groups": 4,
+        "post_attention_groups": 1,
+        "intermediate_groups": 4,
+        "output_groups": 4,
+        "pad_token_id": 0,
+        "layer_norm_eps": 1e-12,
+    },
+}
+
+SQUEEZEBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "squeezebert-uncased": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-uncased/model_state.pdparams",
+        "squeezebert-mnli": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-mnli/model_state.pdparams",
+        "squeezebert-mnli-headless": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-mnli-headless/model_state.pdparams",
+    }
+}
+
+
+class SqueezeBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`SqueezeBertModel`]. It is used to instantiate a
+    SqueezeBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the SqueezeBERT.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the SqueezeBERT model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`SqueezeBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The ID of the token in the word embedding to use as padding.
+        embedding_size (`int`, *optional*, defaults to 768):
+            The dimension of the word embedding vectors.
+
+        q_groups (`int`, *optional*, defaults to 4):
+            The number of groups in Q layer.
+        k_groups (`int`, *optional*, defaults to 4):
+            The number of groups in K layer.
+        v_groups (`int`, *optional*, defaults to 4):
+            The number of groups in V layer.
+        post_attention_groups (`int`, *optional*, defaults to 1):
+            The number of groups in the first feed forward network layer.
+        intermediate_groups (`int`, *optional*, defaults to 4):
+            The number of groups in the second feed forward network layer.
+        output_groups (`int`, *optional*, defaults to 4):
+            The number of groups in the third feed forward network layer.
+
+    Examples:
+
+    ```python
+    >>> from transformers import SqueezeBertConfig, SqueezeBertModel
+
+    >>> # Initializing a SqueezeBERT configuration
+    >>> configuration = SqueezeBertConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration above
+    >>> model = SqueezeBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+
+    Attributes: pretrained_config_archive_map (Dict[str, str]): A dictionary containing all the available pre-trained
+    checkpoints.
+    """
+    pretrained_init_configuration = SQUEEZEBERT_PRETRAINED_INIT_CONFIGURATION
+    model_type = "squeezebert"
+    attribute_map: Dict[str, str] = {
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        embedding_size=768,
+        q_groups=4,
+        k_groups=4,
+        v_groups=4,
+        post_attention_groups=1,
+        intermediate_groups=4,
+        output_groups=4,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.embedding_size = embedding_size
+        self.q_groups = q_groups
+        self.k_groups = k_groups
+        self.v_groups = v_groups
+        self.post_attention_groups = post_attention_groups
+        self.intermediate_groups = intermediate_groups
+        self.output_groups = output_groups
diff --git a/paddlenlp/transformers/squeezebert/modeling.py b/paddlenlp/transformers/squeezebert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..60b88144a9f1f82fc9d19d8bf5c24d3ef4e62c0b
--- /dev/null
+++ b/paddlenlp/transformers/squeezebert/modeling.py
@@ -0,0 +1,623 @@
+# coding=utf-8
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+from paddle import nn
+
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from .configuration import (
+    SQUEEZEBERT_PRETRAINED_INIT_CONFIGURATION,
+    SQUEEZEBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    SqueezeBertConfig,
+)
+
+__all__ = [
+    "SqueezeBertModel",
+    "SqueezeBertPreTrainedModel",
+    "SqueezeBertForSequenceClassification",
+    "SqueezeBertForTokenClassification",
+    "SqueezeBertForQuestionAnswering",
+]
+
+
+def _convert_attention_mask(attention_mask, inputs):
+    if attention_mask.dim() == 3:
+        extended_attention_mask = attention_mask.unsqueeze(1)
+    elif attention_mask.dim() == 2:
+        # extended_attention_mask = attention_mask[:, None, None, :]
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(1)
+    extended_attention_mask = paddle.cast(extended_attention_mask, inputs.dtype)  # fp16 compatibility
+    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+    return extended_attention_mask
+
+
+class SqueezeBertEmbeddings(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=None)
+
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.register_buffer(
+            "position_ids", paddle.arange(config.max_position_embeddings, dtype="int64").expand((1, -1))
+        )
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None):
+        input_shape = input_ids.shape
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(
+                input_shape,
+                dtype=paddle.int64,
+            )
+
+        inputs_embeds = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class MatMulWrapper(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, mat1, mat2):
+        """
+        :param inputs: two paddle tensors :return: matmul of these tensors
+        Here are the typical dimensions found in BERT (the B is optional) mat1.shape: [B, <optional extra dims>, M, K]
+        mat2.shape: [B, <optional extra dims>, K, N] output shape: [B, <optional extra dims>, M, N]
+        """
+        return paddle.matmul(mat1, mat2)
+
+
+class SqueezeBertLayerNorm(nn.LayerNorm):
+    def __init__(self, hidden_size, epsilon=1e-12):
+        nn.LayerNorm.__init__(
+            self, normalized_shape=hidden_size, epsilon=epsilon
+        )  # instantiates self.{weight, bias, eps}
+
+    def forward(self, x):
+        x = x.transpose((0, 2, 1))
+        x = nn.LayerNorm.forward(self, x)
+        return x.transpose((0, 2, 1))
+
+
+class ConvDropoutLayerNorm(nn.Layer):
+    def __init__(self, cin, cout, groups, dropout_prob):
+        super().__init__()
+
+        self.conv1d = nn.Conv1D(in_channels=cin, out_channels=cout, kernel_size=1, groups=groups)
+        self.layernorm = SqueezeBertLayerNorm(cout)
+        self.dropout = nn.Dropout(dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        x = self.conv1d(hidden_states)
+        x = self.dropout(x)
+        x = x + input_tensor
+        x = self.layernorm(x)
+        return x
+
+
+class ConvActivation(nn.Layer):
+    def __init__(self, cin, cout, groups, act):
+        super().__init__()
+        self.conv1d = nn.Conv1D(in_channels=cin, out_channels=cout, kernel_size=1, groups=groups)
+        self.act = ACT2FN[act]
+
+    def forward(self, x):
+        output = self.conv1d(x)
+        return self.act(output)
+
+
+class SqueezeBertSelfAttention(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig, cin, q_groups=1, k_groups=1, v_groups=1):
+        super().__init__()
+        if cin % config.num_attention_heads != 0:
+            raise ValueError(
+                f"cin ({cin}) is not a multiple of the number of attention heads ({config.num_attention_heads})"
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(cin / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Conv1D(in_channels=cin, out_channels=cin, kernel_size=1, groups=q_groups)
+        self.key = nn.Conv1D(in_channels=cin, out_channels=cin, kernel_size=1, groups=k_groups)
+        self.value = nn.Conv1D(in_channels=cin, out_channels=cin, kernel_size=1, groups=v_groups)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.softmax = nn.Softmax(axis=-1)
+
+        self.matmul_qk = MatMulWrapper()
+        self.matmul_qkv = MatMulWrapper()
+
+    def transpose_for_scores(self, x):
+        """
+        - input: [N, C, W]
+        - output: [N, C1, W, C2] where C1 is the head index, and C2 is one head's contents
+        """
+        new_x_shape = (x.shape[0], self.num_attention_heads, self.attention_head_size, x.shape[-1])  # [N, C1, C2, W]
+        x = x.reshape(new_x_shape)
+        return x.transpose((0, 1, 3, 2))  # [N, C1, C2, W] --> [N, C1, W, C2]
+
+    def transpose_key_for_scores(self, x):
+        """
+        - input: [N, C, W]
+        - output: [N, C1, C2, W] where C1 is the head index, and C2 is one head's contents
+        """
+        new_x_shape = (x.shape[0], self.num_attention_heads, self.attention_head_size, x.shape[-1])  # [N, C1, C2, W]
+        x = x.reshape(new_x_shape)
+        return x
+
+    def transpose_output(self, x):
+        """
+        - input: [N, C1, W, C2]
+        - output: [N, C, W]
+        """
+        x = x.transpose((0, 1, 3, 2))  # [N, C1, C2, W]
+        new_x_shape = (x.shape[0], self.all_head_size, x.shape[3])  # [N, C, W]
+        x = x.reshape(new_x_shape)
+        return x
+
+    def forward(self, hidden_states, attention_mask, output_attentions):
+        """
+        expects hidden_states in [N, C, W] data layout.
+        The attention_mask data layout is [N, W], and it does not need to be transposed.
+        """
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_key_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_score = self.matmul_qk(query_layer, key_layer)
+        attention_score = attention_score / math.sqrt(self.attention_head_size)
+        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+        attention_score = attention_score + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = self.softmax(attention_score)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        context_layer = self.matmul_qkv(attention_probs, value_layer)
+        context_layer = self.transpose_output(context_layer)
+
+        result = {"context_layer": context_layer}
+        if output_attentions:
+            result["attention_score"] = attention_score
+        return result
+
+
+class SqueezeBertLayer(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        """
+        - hidden_size = input chans = output chans for Q, K, V (they are all the same ... for now) = output chans for
+          the module
+        - intermediate_size = output chans for intermediate layer
+        - groups = number of groups for all layers in the BertLayer. (eventually we could change the interface to
+          allow different groups for different layers)
+        """
+        super().__init__()
+
+        c0 = config.hidden_size
+        c1 = config.hidden_size
+        c2 = config.intermediate_size
+        c3 = config.hidden_size
+
+        self.attention = SqueezeBertSelfAttention(
+            config,
+            cin=c0,
+            q_groups=config.q_groups,
+            k_groups=config.k_groups,
+            v_groups=config.v_groups,
+        )
+        self.post_attention = ConvDropoutLayerNorm(
+            cin=c0, cout=c1, groups=config.post_attention_groups, dropout_prob=config.hidden_dropout_prob
+        )
+        self.intermediate = ConvActivation(cin=c1, cout=c2, groups=config.intermediate_groups, act=config.hidden_act)
+        self.output = ConvDropoutLayerNorm(
+            cin=c2, cout=c3, groups=config.output_groups, dropout_prob=config.hidden_dropout_prob
+        )
+
+    def forward(self, hidden_states, attention_mask, output_attentions):
+        att = self.attention(hidden_states, attention_mask, output_attentions)
+        attention_output = att["context_layer"]
+
+        post_attention_output = self.post_attention(attention_output, hidden_states)
+        intermediate_output = self.intermediate(post_attention_output)
+        layer_output = self.output(intermediate_output, post_attention_output)
+
+        output_dict = {"feature_map": layer_output}
+        if output_attentions:
+            output_dict["attention_score"] = att["attention_score"]
+
+        return output_dict
+
+
+class SqueezeBertEncoder(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        assert config.embedding_size == config.hidden_size, (
+            "If you want embedding_size != intermediate hidden_size,"
+            "please insert a Conv1D layer to adjust the number of channels "
+            "before the first SqueezeBertLayer."
+        )
+        self.layers = nn.LayerList(SqueezeBertLayer(config) for _ in range(config.num_hidden_layers))
+
+    def forward(self, hidden_states, attention_mask=None, output_attentions=False, output_hidden_states=False):
+
+        hidden_states = hidden_states.transpose((0, 2, 1))
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        for layer in self.layers:
+            if output_hidden_states:
+                hidden_states = hidden_states.transpose((0, 2, 1))
+                all_hidden_states += (hidden_states,)
+                hidden_states = hidden_states.transpose((0, 2, 1))
+
+            layer_output = layer.forward(hidden_states, attention_mask, output_attentions)
+
+            hidden_states = layer_output["feature_map"]
+
+            if output_attentions:
+                all_attentions += (layer_output["attention_score"],)
+
+        # [batch_size, hidden_size, sequence_length] --> [batch_size, sequence_length, hidden_size]
+        hidden_states = hidden_states.transpose((0, 2, 1))
+
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
+
+
+class SqueezeBertPooler(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class SqueezeBertPredictionHeadTransform(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class SqueezeBertLMPredictionHead(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        self.transform = SqueezeBertPredictionHeadTransform(
+            config.hidden_size, config.hidden_act, config.layer_norm_eps
+        )
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias_attr=False)
+        self.bias = paddle.create_parameter([config.vocab_size], dtype="float32", is_bias=True)
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class SqueezeBertPreTrainingHeads(nn.Layer):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__()
+        self.predictions = SqueezeBertLMPredictionHead(
+            config.hidden_size, config.hidden_act, config.layer_norm_eps, config.vocab_size
+        )
+        self.seq_relationship = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, sequence_output, pooled_output):
+        prediction_scores = self.predictions(sequence_output)
+        seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+
+
+class SqueezeBertPreTrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained SqueezBert models. It provides SqueezBert related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    config_class = SqueezeBertConfig
+    base_model_prefix = "squeezebert"
+
+    pretrained_init_configuration = SQUEEZEBERT_PRETRAINED_INIT_CONFIGURATION
+
+    pretrained_resource_files_map = SQUEEZEBERT_PRETRAINED_RESOURCE_FILES_MAP
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.squeezebert.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = 1e-12
+
+
+@register_base_model
+class SqueezeBertModel(SqueezeBertPreTrainedModel):
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__(config)
+        self.initializer_range = config.initializer_range
+        self.embeddings = SqueezeBertEmbeddings(config)
+        self.encoder = SqueezeBertEncoder(config)
+        self.pooler = SqueezeBertPooler(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embeddings.word_embeddings = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        output_attentions=None,
+        output_hidden_states=None,
+    ):
+        r"""
+        The  forward method, overrides the `__call__()` special method.
+        Args:
+           input_ids (Tensor):
+               Indices of input sequence tokens in the vocabulary. They are
+               numerical representations of tokens that build the input sequence.
+               Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+           attention_mask (Tensor, optional):
+               Mask used in multi-head attention to avoid performing attention on to some unwanted positions,
+               usually the paddings or the subsequent positions.
+               Its data type can be int, float and bool.
+               If its data type is int, the values should be either 0 or 1.
+               - **1** for tokens that **not masked**,
+               - **0** for tokens that **masked**.
+               It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+               Defaults to `None`, which means nothing needed to be prevented attention to.
+           token_type_ids (Tensor, optional):
+               Segment token indices to indicate different portions of the inputs.
+               Selected in the range ``[0, type_vocab_size - 1]``.
+               If `type_vocab_size` is 2, which means the inputs have two portions.
+               Indices can either be 0 or 1:
+               - 0 corresponds to a *sentence A* token,
+               - 1 corresponds to a *sentence B* token.
+               Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+               Defaults to `None`, which means we don't add segment embeddings.
+           position_ids(Tensor, optional):
+               Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+               max_position_embeddings - 1]``.
+               Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+
+           output_attentions (bool, optional):
+               Whether to return the attention_weight of each hidden layers.
+               Defaults to `False`.
+           output_hidden_states (bool, optional):
+               Whether to return the output of each hidden layers.
+               Defaults to `False`.
+        Returns:
+           tuple: Returns tuple (`sequence_output`, `pooled_output`) with (`encoder_outputs`, `encoder_attentions`) by
+           optional.
+           With the fields:
+           - `sequence_output` (Tensor):
+               Sequence of hidden-states at the last layer of the model.
+               It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+           - `pooled_output` (Tensor):
+               The output of first token (`[CLS]`) in sequence.
+               We "pool" the model by simply taking the hidden state corresponding to the first token.
+               Its data type should be float32 and its shape is [batch_size, hidden_size].
+           - `encoder_outputs` (List(Tensor)):
+               A list of Tensor containing hidden-states of the model at each hidden layer in the Transformer encoder.
+               The length of the list is `num_hidden_layers` + 1 (Embedding Layer output).
+               Each Tensor has a data type of float32 and its shape is [batch_size, sequence_length, hidden_size].
+        """
+        input_shape = input_ids.shape
+        if attention_mask is None:
+            attention_mask = paddle.ones(input_shape)
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros(input_shape, dtype=paddle.int64)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids
+        )
+        extended_attention_mask = _convert_attention_mask(attention_mask, embedding_output)
+        encoder_outputs = self.encoder(
+            hidden_states=embedding_output,
+            attention_mask=extended_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+
+        return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+
+class SqueezeBertForSequenceClassification(SqueezeBertPreTrainedModel):
+    """
+    SqueezeBert Model with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.
+    for GLUE tasks.
+    Args:
+        config (:class:`SqueezeBertConfig`):
+            An instance of SqueezeBertConfig.
+    """
+
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__(config)
+        self.num_classes = config.num_labels
+        self.squeezebert = SqueezeBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The SqueezeBertForSequenceClassification forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`SqueezeBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SqueezeBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`SqueezeBertModel`.
+            attention_mask (list, optional):
+                See :class:`SqueezeBertModel`.
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input text classification logits.
+            Shape as `[batch_size, num_classes]` and dtype as float32.
+        """
+
+        _, pooled_output = self.squeezebert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        return logits
+
+
+class SqueezeBertForQuestionAnswering(SqueezeBertPreTrainedModel):
+    """
+    SqueezeBert Model with a span classification head on top for extractive question-answering tasks like
+    SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and
+    `span end logits`).
+    Args:
+        config (:class:`SqueezeBertConfig`):
+            An instance of SqueezeBertConfig.
+    """
+
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__(config)
+        self.squeezebert = SqueezeBertModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids, token_type_ids=None):
+        r"""
+        The SqueezeBertForQuestionAnswering forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`SqueezeBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SqueezeBertModel`.
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+            With the fields:
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+        """
+        sequence_output, _ = self.squeezebert(
+            input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None
+        )
+        logits = self.classifier(sequence_output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+        return start_logits, end_logits
+
+
+class SqueezeBertForTokenClassification(SqueezeBertPreTrainedModel):
+    """
+    SqueezeBert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g.
+    for Named-Entity-Recognition (NER) tasks.
+    Args:
+        config (:class:`SqueezeBertConfig`):
+            An instance of SqueezeBertConfig.
+    """
+
+    def __init__(self, config: SqueezeBertConfig):
+        super().__init__(config)
+        self.num_classes = config.num_labels
+        self.squeezebert = SqueezeBertModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, self.num_classes)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        r"""
+        The SqueezeBertForTokenClassification forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`SqueezeBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`SqueezeBertModel`.
+            position_ids(Tensor, optional):
+                See :class:`SqueezeBertModel`.
+            attention_mask (list, optional):
+                See :class:`SqueezeBertModel`.
+        Returns:
+            Tensor: Returns tensor `logits`, a tensor of the input token classification logits.
+            Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+        """
+
+        sequence_output, _ = self.squeezebert(
+            input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask
+        )
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        return logits
diff --git a/paddlenlp/transformers/squeezebert/tokenizer.py b/paddlenlp/transformers/squeezebert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..13e51b3b2c96bbb5e03032db99e9e0e1bbe364f0
--- /dev/null
+++ b/paddlenlp/transformers/squeezebert/tokenizer.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    PretrainedTokenizer,
+    WordpieceTokenizer,
+)
+
+__all__ = [
+    "SqueezeBertTokenizer",
+]
+
+
+class SqueezeBertTokenizer(PretrainedTokenizer):
+    """
+    Constructs a SqueezeBert tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    Args:
+        vocab_file (str): file path of the vocabulary
+        do_lower_case (bool): Whether the text strips accents and convert to
+            lower case. Default: `True`.
+            Default: True.
+        unk_token (str): The special token for unkown words. Default: "[UNK]".
+        sep_token (str): The special token for separator token . Default: "[SEP]".
+        pad_token (str): The special token for padding. Default: "[PAD]".
+        cls_token (str): The special token for cls. Default: "[CLS]".
+        mask_token (str): The special token for mask. Default: "[MASK]".
+
+    Examples:
+        .. code-block:: python
+            from paddlenlp.transformers import SqueezeBertTokenizer
+            tokenizer = SqueezeBertTokenizer.from_pretrained('squeezebert-uncased')
+            # the following line get: ['he', 'was', 'a', 'puppet', '##eer']
+            tokens = tokenizer('He was a puppeteer')
+            # the following line get: 'he was a puppeteer'
+            tokenizer.convert_tokens_to_string(tokens)
+    """
+
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "squeezebert-uncased": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-uncased/vocab.txt",
+            "squeezebert-mnli": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-mnli/vocab.txt",
+            "squeezebert-mnli-headless": "http://bj.bcebos.com/paddlenlp/models/transformers/squeezebert/squeezebert-mnli-headless/vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "squeezebert-uncased": {"do_lower_case": True},
+        "squeezebert-mnli": {"do_lower_case": True},
+        "squeezebert-mnli-headless": {"do_lower_case": True},
+    }
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = SqueezeBertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        return the size of vocabulary.
+        Returns:
+            int: the size of vocabulary.
+        """
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab.token_to_idx, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """
+        End-to-end tokenization for SqueezeBert models.
+        Args:
+            text (str): The text to be tokenized.
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+        Returns:
+            str: Converted string from tokens.
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+        Note:
+            This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this
+            inside your training loop.
+        Args:
+            pair: Returns the number of added tokens in the case of a sequence pair if set to True, returns the
+                number of added tokens in the case of a single sequence if set to False.
+        Returns:
+            Number of tokens added to sequences
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+        A SqueezeBert sequence has the following format:
+        ::
+            - single sequence: ``[CLS] X [SEP]``
+            - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+        A SqueezeBert offset_mapping has the following format:
+        ::
+            - single sequence: ``(0,0) X (0,0)``
+            - pair of sequences: `(0,0) A (0,0) B (0,0)``
+        Args:
+            offset_mapping_ids_0 (:obj:`List[tuple]`):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (:obj:`List[tuple]`, `optional`):
+                Optional second list of char offsets for offset mapping pairs.
+        Returns:
+            :obj:`List[tuple]`: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A SqueezeBert sequence pair mask has the following format:
+        ::
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            :obj:`List[int]`: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+        Returns:
+            results (List[int]): The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
diff --git a/paddlenlp/transformers/t5/__init__.py b/paddlenlp/transformers/t5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/t5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/t5/configuration.py b/paddlenlp/transformers/t5/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef57d3c9bf4cf18b610facec6cba71eeec96e7ae
--- /dev/null
+++ b/paddlenlp/transformers/t5/configuration.py
@@ -0,0 +1,268 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" T5 model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["T5_PRETRAINED_INIT_CONFIGURATION", "T5Config", "T5_PRETRAINED_RESOURCE_FILES_MAP"]
+
+T5_PRETRAINED_INIT_CONFIGURATION = {
+    "t5-small": {
+        "tie_word_embeddings": True,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 512,
+        "d_kv": 64,
+        "d_ff": 2048,
+        "num_layers": 6,
+        "num_decoder_layers": 6,
+        "num_heads": 8,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "relu",
+    },
+    "t5-base": {
+        "tie_word_embeddings": True,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 768,
+        "d_kv": 64,
+        "d_ff": 3072,
+        "num_layers": 12,
+        "num_decoder_layers": 12,
+        "num_heads": 12,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "relu",
+    },
+    "t5-large": {
+        "tie_word_embeddings": True,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 1024,
+        "d_kv": 64,
+        "d_ff": 4096,
+        "num_layers": 24,
+        "num_decoder_layers": 24,
+        "num_heads": 16,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "relu",
+    },
+    "t5-v1_1-base": {
+        "tie_word_embeddings": False,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 768,
+        "d_kv": 64,
+        "d_ff": 2048,
+        "num_layers": 12,
+        "num_decoder_layers": 12,
+        "num_heads": 12,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "gated-gelu",
+    },
+    "t5-v1_1-large": {
+        "tie_word_embeddings": False,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 1024,
+        "d_kv": 64,
+        "d_ff": 2816,
+        "num_layers": 24,
+        "num_decoder_layers": 24,
+        "num_heads": 16,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "gated-gelu",
+    },
+    "t5-3b": {
+        "tie_word_embeddings": True,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 1024,
+        "d_kv": 128,
+        "d_ff": 16384,
+        "num_layers": 24,
+        "num_decoder_layers": 24,
+        "num_heads": 32,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "relu",
+    },
+    "t5-11b": {
+        "tie_word_embeddings": True,
+        "pad_token_id": 0,
+        "bos_token_id": 0,
+        "eos_token_id": 1,
+        "vocab_size": 32128,
+        "d_model": 1024,
+        "d_kv": 128,
+        "d_ff": 65536,
+        "num_layers": 24,
+        "num_decoder_layers": 24,
+        "num_heads": 128,
+        "relative_attention_num_buckets": 32,
+        "dropout_rate": 0.1,
+        "layer_norm_epsilon": 1e-06,
+        "initializer_factor": 1.0,
+        "feed_forward_proj": "relu",
+    },
+}
+
+T5_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "t5-small": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-small/model_state.pdparams",
+        "t5-base": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-base/model_state.pdparams",
+        "t5-large": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-large/model_state.pdparams",
+        "t5-3b": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-3b/model_state.pdparams",
+        "t5-11b": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-11b/model_state.pdparams",
+        "t5-v1_1-base": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-v1_1-base/model_state.pdparams",
+        "t5-v1_1-large": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-v1_1-large/model_state.pdparams",
+    }
+}
+
+
+class T5Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`T5Model`]. It is used to
+    instantiate a bert model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the T5
+    t5-small architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32128):
+            Vocabulary size of the T5 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`T5Model`].
+        d_model (`int`, *optional*, defaults to 512):
+            Size of the encoder layers and the pooler layer.
+        d_kv (`int`, *optional*, defaults to 64):
+            Size of the key, query, value projections per attention head. `d_kv` has to be equal to `d_model //
+            num_heads`.
+        d_ff (`int`, *optional*, defaults to 2048):
+            Size of the intermediate feed forward layer in each `T5Block`.
+        num_layers (`int`, *optional*, defaults to 6):
+            Number of hidden layers in the Transformer encoder.
+        num_decoder_layers (`int`, *optional*):
+            Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
+        num_heads (`int`, *optional*, defaults to 8):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        relative_attention_num_buckets (`int`, *optional*, defaults to 32):
+            The number of buckets to use for each attention layer.
+        relative_attention_max_distance (`int`, *optional*, defaults to 128):
+            The maximum distance of the longer sequences for the bucket separation.
+        dropout_rate (`float`, *optional*, defaults to 0.1):
+            The ratio for all dropout layers.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (`float`, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        feed_forward_proj (`string`, *optional*, defaults to `"relu"`):
+            he non-linear activation function (function or string) in the feed forward layer in the residual attention block.
+            If string, `"relu"`, `"gated-gelu"` are supported. Defaults to `"relu"`.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        pad_token_id (int, optional):
+            The id of the `padding` token. Defaults to `0`.
+        bos_token_id (int, optional):
+            The id of the `bos` token. Defaults to `0`.
+        eos_token_id (int, optional):
+            The id of the `eos` token. Defaults to `1`.
+    """
+    model_type = "t5"
+    attribute_map: Dict[str, str] = {
+        "hidden_size": "d_model",
+        "num_attention_heads": "num_heads",
+        "num_hidden_layers": "num_layers",
+        "num_classes": "num_labels",
+    }
+    pretrained_init_configuration = T5_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 32128,
+        d_model: int = 512,
+        d_kv: int = 64,
+        d_ff: int = 2048,
+        num_layers: int = 6,
+        num_decoder_layers: int = None,
+        num_heads: int = 8,
+        relative_attention_num_buckets: int = 32,
+        relative_attention_max_distance: int = 128,
+        dropout_rate: float = 0.1,
+        layer_norm_epsilon: float = 1e-6,
+        initializer_factor: float = 1.0,
+        feed_forward_proj: str = "relu",
+        is_encoder_decoder: bool = True,
+        use_cache: bool = True,
+        pad_token_id: int = 0,
+        eos_token_id: int = 1,
+        **kwargs
+    ):
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.d_kv = d_kv
+        self.d_ff = d_ff
+        self.num_layers = num_layers
+        self.num_decoder_layers = (
+            num_decoder_layers if num_decoder_layers is not None else self.num_layers
+        )  # default = symmetry
+        self.num_heads = num_heads
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.relative_attention_max_distance = relative_attention_max_distance
+        self.dropout_rate = dropout_rate
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_factor = initializer_factor
+        self.feed_forward_proj = feed_forward_proj
+        self.use_cache = use_cache
diff --git a/paddlenlp/transformers/t5/modeling.py b/paddlenlp/transformers/t5/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc27b4c91d5723e9486583f66af477641f950a3d
--- /dev/null
+++ b/paddlenlp/transformers/t5/modeling.py
@@ -0,0 +1,1890 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import math
+from functools import partial
+from typing import Optional, Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+from paddle.amp.auto_cast import amp_state
+from paddle.distributed import fleet
+from paddle.distributed.fleet.utils import recompute
+
+from ...utils.converter import StateDictNameMapping, init_name_mappings
+from ...utils.log import logger
+from ..activations import ACT2FN
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    Seq2SeqLMOutput,
+    Seq2SeqModelOutput,
+    convert_encoder_output,
+)
+from ..model_utils import PretrainedModel, register_base_model
+from .configuration import (
+    T5_PRETRAINED_INIT_CONFIGURATION,
+    T5_PRETRAINED_RESOURCE_FILES_MAP,
+    T5Config,
+)
+
+__all__ = ["T5Model", "T5PretrainedModel", "T5ForConditionalGeneration", "T5EncoderModel"]
+
+T5_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "t5-small",
+    "t5-base",
+    "t5-large",
+    "t5-3b",
+    "t5-11b",
+]
+
+DATA_TYPE_MAP = {
+    paddle.int64: "int64",
+    paddle.int32: "int32",
+    paddle.float32: "float32",
+    paddle.float64: "float64",
+    paddle.float16: "float16",
+}
+
+
+def data_type_converter(tensor):
+    return DATA_TYPE_MAP[tensor.dtype]
+
+
+def finfo(dtype):
+    if dtype == paddle.float32:
+        return np.finfo(np.float32)
+    if dtype == paddle.float16:
+        return np.finfo(np.float16)
+    if dtype == paddle.float64:
+        return np.finfo(np.float64)
+
+
+class T5LayerNorm(nn.Layer):
+    """
+    Construct a layernorm module in the T5 style No bias and no subtraction of mean.
+    """
+
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = self.create_parameter(shape=[hidden_size], default_initializer=nn.initializer.Constant(1.0))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        # layer norm should always be calculated in float32
+        variance = paddle.pow(hidden_states.astype(paddle.float32), 2).mean(axis=-1, keepdim=True)
+        hidden_states = hidden_states * paddle.rsqrt(variance + self.variance_epsilon)
+
+        # convert into float16 if necessary
+        if amp_state() or self.weight.dtype == paddle.float16:
+            hidden_states = hidden_states.astype(paddle.float16)
+        return self.weight * hidden_states
+
+
+class T5DenseReluDense(nn.Layer):
+    """
+    Construct a dense-relu-dense module.
+    """
+
+    def __init__(self, config: T5Config):
+        super().__init__()
+        if config.tensor_parallel_degree > 1:
+            self.wi = fleet.meta_parallel.ColumnParallelLinear(
+                config.d_model, config.d_ff, has_bias=False, gather_output=False
+            )
+            self.wo = fleet.meta_parallel.RowParallelLinear(
+                config.d_ff, config.d_model, input_is_parallel=True, has_bias=False
+            )
+        else:
+            self.wi = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+            self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.wi(hidden_states)
+        hidden_states = F.relu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5DenseGatedGeluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: T5Config):
+        super().__init__()
+        if config.tensor_parallel_degree > 1:
+            self.wi_0 = fleet.meta_parallel.ColumnParallelLinear(
+                config.d_model, config.d_ff, has_bias=False, gather_output=False
+            )
+            self.wi_1 = fleet.meta_parallel.ColumnParallelLinear(
+                config.d_model, config.d_ff, has_bias=False, gather_output=False
+            )
+            self.wo = fleet.meta_parallel.RowParallelLinear(
+                config.d_ff, config.d_model, input_is_parallel=True, has_bias=False
+            )
+        else:
+            self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+            self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+            self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5DenseGatedSiluDense(nn.Layer):
+    """
+    Construct a dense-gated_gelu-dense module.
+    """
+
+    def __init__(self, config: T5Config):
+        super().__init__()
+        if config.tensor_parallel_degree > 1:
+            self.wi_0 = fleet.meta_parallel.ColumnParallelLinear(
+                config.d_model, config.d_ff, has_bias=False, gather_output=False
+            )
+            self.wi_1 = fleet.meta_parallel.ColumnParallelLinear(
+                config.d_model, config.d_ff, has_bias=False, gather_output=False
+            )
+            self.wo = fleet.meta_parallel.RowParallelLinear(
+                config.d_ff, config.d_model, input_is_parallel=True, has_bias=False
+            )
+        else:
+            self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+            self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias_attr=False)
+            self.wo = nn.Linear(config.d_ff, config.d_model, bias_attr=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_silu = F.silu(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_silu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+
+
+class T5LayerFF(nn.Layer):
+    def __init__(self, config: T5Config):
+        super().__init__()
+        if config.feed_forward_proj == "relu":
+            self.DenseReluDense = T5DenseReluDense(config)
+        elif config.feed_forward_proj == "gated-gelu":
+            self.DenseReluDense = T5DenseGatedGeluDense(config)
+        elif config.feed_forward_proj == "gated-silu":
+            self.DenseReluDense = T5DenseGatedSiluDense(config)
+        else:
+            raise ValueError(f"{config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`")
+
+        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        # hidden_states maybe FP16
+        # self.dropout(forwarded_states) maybe FP32
+        # FP32 + FP16 = FP32, FP16 + FP32 = FP16
+        hidden_states = self.dropout(forwarded_states) + hidden_states
+        return hidden_states
+
+
+class T5Attention(nn.Layer):
+    def __init__(self, config: T5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.has_relative_attention_bias = has_relative_attention_bias
+
+        self.relative_attention_num_buckets = config.relative_attention_num_buckets
+        self.d_model = config.d_model
+        self.key_value_proj_dim = config.d_kv
+        self.n_heads = config.num_heads
+        self.dropout = config.dropout_rate
+        self.inner_dim = self.n_heads * self.key_value_proj_dim
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+        # Mesh TensorFlow initialization to avoid scaling before softmax
+        if config.tensor_parallel_degree > 1:
+            assert self.n_heads % config.tensor_parallel_degree == 0
+            self.q = fleet.meta_parallel.ColumnParallelLinear(
+                self.d_model, self.inner_dim, has_bias=False, gather_output=False
+            )
+            self.k = fleet.meta_parallel.ColumnParallelLinear(
+                self.d_model, self.inner_dim, has_bias=False, gather_output=False
+            )
+            self.v = fleet.meta_parallel.ColumnParallelLinear(
+                self.d_model, self.inner_dim, has_bias=False, gather_output=False
+            )
+            self.o = fleet.meta_parallel.RowParallelLinear(
+                self.inner_dim, self.d_model, has_bias=False, input_is_parallel=True
+            )
+            self.n_heads = self.n_heads // config.tensor_parallel_degree
+            self.inner_dim = self.inner_dim // config.tensor_parallel_degree
+        else:
+            self.q = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+            self.k = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+            self.v = nn.Linear(self.d_model, self.inner_dim, bias_attr=False)
+            self.o = nn.Linear(self.inner_dim, self.d_model, bias_attr=False)
+
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
+
+    @staticmethod
+    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
+        """
+        Adapted from Mesh Tensorflow:
+        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593
+
+        Translate relative position to a bucket number for relative attention. The relative position is defined as
+        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
+        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
+        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
+        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
+        This should allow for more graceful generalization to longer sequences than the model has been trained on
+
+        Args:
+            relative_position: an int64 Tensor
+            bidirectional: a boolean - whether the attention is bidirectional
+            num_buckets: an integer
+            max_distance: an integer
+
+        Returns:
+            a Tensor with the same shape as relative_position, containing int64 values in the range [0, num_buckets)
+
+        """
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets //= 2
+            relative_buckets += (relative_position > 0).astype(paddle.int64) * num_buckets
+            relative_position = paddle.abs(relative_position)
+        else:
+            relative_position = -paddle.minimum(relative_position, paddle.zeros_like(relative_position))
+        # now relative_position is in the range [0, inf)
+
+        # half of the buckets are for exact increments in positions
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+
+        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
+        relative_postion_if_large = max_exact + (
+            paddle.log(relative_position.astype(paddle.get_default_dtype()) / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).astype(paddle.int64)
+        relative_postion_if_large = paddle.minimum(
+            relative_postion_if_large,
+            paddle.full_like(relative_postion_if_large, num_buckets - 1),
+        )
+
+        relative_buckets += paddle.where(is_small, relative_position, relative_postion_if_large)
+        return relative_buckets
+
+    def compute_bias(self, query_length, key_length):
+        """Compute binned relative position bias"""
+        context_position = paddle.arange(query_length, dtype="int64").unsqueeze(-1)
+        memory_position = paddle.arange(key_length, dtype="int64").unsqueeze(0)
+        relative_position = memory_position - context_position  # shape (query_length, key_length)
+        relative_position_bucket = self._relative_position_bucket(
+            relative_position,  # shape (query_length, key_length)
+            bidirectional=(not self.is_decoder),
+            num_buckets=self.relative_attention_num_buckets,
+        )
+        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
+        values = values.transpose(perm=[2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
+        return values
+
+    def forward(
+        self,
+        hidden_states,
+        mask=None,
+        key_value_states=None,
+        position_bias=None,
+        cache=None,
+        query_length=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        """
+        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
+        """
+        # Input is (batch_size, seq_length, dim)
+        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
+        # cache[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
+        batch_size, seq_length = paddle.shape(hidden_states)[:2]
+
+        real_seq_length = seq_length
+
+        if cache is not None:
+            assert len(cache) == 2, f"cache should have 2 past states: keys and values. Got { len(cache)} past states"
+            real_seq_length += paddle.shape(cache[0])[2] if query_length is None else query_length
+
+        key_length = real_seq_length if key_value_states is None else paddle.shape(key_value_states)[1]
+
+        def shape(states):
+            """projection"""
+            return states.reshape(shape=[batch_size, -1, self.n_heads, self.key_value_proj_dim]).transpose(
+                perm=[0, 2, 1, 3]
+            )
+
+        def unshape(states):
+            """reshape"""
+            return states.transpose(perm=[0, 2, 1, 3]).reshape(shape=[batch_size, -1, self.inner_dim])
+
+        def project(hidden_states, proj_layer, key_value_states, cache):
+            """projects hidden states correctly to key/query states"""
+            if key_value_states is None:
+                # self-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(hidden_states))
+            elif cache is None:
+                # cross-attn
+                # (batch_size, n_heads, seq_length, dim_per_head)
+                hidden_states = shape(proj_layer(key_value_states))
+
+            if cache is not None:
+                if key_value_states is None:
+                    # self-attn
+                    # (batch_size, n_heads, key_length, dim_per_head)
+                    hidden_states = paddle.concat([cache, hidden_states], axis=2)
+                else:
+                    # cross-attn
+                    hidden_states = cache
+            return hidden_states
+
+        # get query states
+        query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
+
+        # get key/value states
+        key_states = project(
+            hidden_states,
+            self.k,
+            key_value_states,
+            cache[0] if cache is not None else None,
+        )
+        value_states = project(
+            hidden_states,
+            self.v,
+            key_value_states,
+            cache[1] if cache is not None else None,
+        )
+
+        # compute scores
+        scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        if position_bias is None:
+            if not self.has_relative_attention_bias:
+                position_bias = paddle.zeros(
+                    shape=(1, self.n_heads, real_seq_length, key_length),
+                    dtype=scores.dtype,
+                )
+                if self.training and self.enable_recompute:
+                    position_bias.stop_gradient = False
+            else:
+                position_bias = self.compute_bias(real_seq_length, key_length)
+
+            # if key and values are already calculated
+            # we want only the last query position bias
+            if cache is not None:
+                position_bias = position_bias[:, :, -paddle.shape(hidden_states)[1] :, :]
+
+            if mask is not None:
+                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)
+
+        scores += position_bias
+        attn_weights = F.softmax(scores.astype(paddle.float32), axis=-1).astype(
+            scores.dtype
+        )  # (batch_size, n_heads, seq_length, key_length)
+        attn_weights = F.dropout(
+            attn_weights, p=self.dropout, training=self.training
+        )  # (batch_size, n_heads, seq_length, key_length)
+
+        attn_output = unshape(paddle.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
+
+        attn_output = self.o(attn_output)
+
+        present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
+        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)
+
+        if output_attentions:
+            outputs = outputs + (attn_weights,)
+        return outputs
+
+
+class T5LayerSelfAttention(nn.Layer):
+    def __init__(self, config: T5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.SelfAttention = T5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
+        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+
+        attention_output = self.SelfAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + self.dropout(attention_output[0])
+        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class T5LayerCrossAttention(nn.Layer):
+    def __init__(self, config: T5Config):
+        super().__init__()
+        self.EncDecAttention = T5Attention(config, has_relative_attention_bias=False)
+        self.layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+
+    def forward(
+        self,
+        hidden_states,
+        key_value_states,
+        attention_mask=None,
+        position_bias=None,
+        cache=None,
+        use_cache=False,
+        query_length=None,
+        output_attentions=False,
+    ):
+        normed_hidden_states = self.layer_norm(hidden_states)
+
+        attention_output = self.EncDecAttention(
+            normed_hidden_states,
+            mask=attention_mask,
+            key_value_states=key_value_states,
+            position_bias=position_bias,
+            cache=cache,
+            use_cache=use_cache,
+            query_length=query_length,
+            output_attentions=output_attentions,
+        )
+        layer_output = hidden_states + self.dropout(attention_output[0])
+        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
+        return outputs
+
+
+class T5Block(nn.Layer):
+    def __init__(self, config: T5Config, has_relative_attention_bias: bool = False):
+        super().__init__()
+        self.is_decoder = config.is_decoder
+        self.layer = nn.LayerList()
+        self.layer.append(T5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
+        if self.is_decoder:
+            self.layer.append(T5LayerCrossAttention(config))
+
+        self.layer.append(T5LayerFF(config))
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        position_bias=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        encoder_decoder_position_bias=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+    ):
+
+        if cache is not None:
+            assert self.is_decoder, "Only decoder can use `caches`"
+            expected_num_caches = 2 if encoder_hidden_states is None else 4
+
+            if len(cache) != expected_num_caches:
+                raise ValueError(
+                    f"There should be {expected_num_caches} past states. "
+                    f"{'2 (past / key) for cross attention. ' if expected_num_caches == 4 else ''}"
+                    f"Got {len(cache)} past key / value states"
+                )
+
+            self_attn_cache = cache[:2]
+            cross_attn_cache = cache[2:]
+        else:
+            self_attn_cache, cross_attn_cache = None, None
+
+        self_attention_outputs = self.layer[0](
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            cache=self_attn_cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        hidden_states, present_key_value_state = self_attention_outputs[:2]
+
+        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights
+
+        # clamp inf values to enable fp16 training
+        if (amp_state() or hidden_states.dtype == paddle.float16) and paddle.isinf(hidden_states).any():
+            # TODO finfo
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
+        if do_cross_attention:
+            # the actual query length is unknown for cross attention
+            # if using past key value states. Need to inject it here
+            if present_key_value_state is not None:
+                query_length = paddle.shape(present_key_value_state[0])[2]
+            else:
+                query_length = None
+
+            cross_attention_outputs = self.layer[1](
+                hidden_states,
+                key_value_states=encoder_hidden_states,
+                attention_mask=encoder_attention_mask,
+                position_bias=encoder_decoder_position_bias,
+                cache=cross_attn_cache,
+                query_length=query_length,
+                use_cache=use_cache,
+                output_attentions=output_attentions,
+            )
+            hidden_states = cross_attention_outputs[0]
+
+            # clamp inf values to enable fp16 training
+            if (amp_state() or hidden_states.dtype == paddle.float16) and paddle.isinf(hidden_states).any():
+                clamp_value = finfo(hidden_states.dtype).max - 1000
+                hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+            # Combine self attn and cross attn key value states
+            if present_key_value_state is not None:
+                present_key_value_state = present_key_value_state + cross_attention_outputs[1]
+
+            # Keep cross-attention outputs and relative position weights
+            attention_outputs = attention_outputs + cross_attention_outputs[2:]
+
+        # Apply Feed Forward layer
+        hidden_states = self.layer[-1](hidden_states)
+
+        # clamp inf values to enable fp16 training
+        if (amp_state() or hidden_states.dtype == paddle.float16) and paddle.isinf(hidden_states).any():
+            clamp_value = finfo(hidden_states.dtype).max - 1000
+            hidden_states = paddle.clip(hidden_states, min=-clamp_value, max=clamp_value)
+
+        outputs = (hidden_states,)
+
+        if use_cache:
+            outputs = outputs + (present_key_value_state,) + attention_outputs
+        else:
+            outputs = outputs + attention_outputs
+
+        return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+
+
+class T5PretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained T5 models. It provides T5 related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models. See `PretrainedModel` for more details.
+    """
+
+    base_model_prefix = "t5"
+    config_class = T5Config
+
+    pretrained_init_configuration = T5_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = T5_PRETRAINED_RESOURCE_FILES_MAP
+
+    @classmethod
+    def _get_tensor_parallel_mappings(cls, config, is_split=True):
+        # paddle tensor parallel
+        from paddlenlp.transformers.conversion_utils import split_or_merge_func
+
+        fn = split_or_merge_func(
+            is_split=is_split,
+            tensor_parallel_degree=config.tensor_parallel_degree,
+            tensor_parallel_rank=config.tensor_parallel_rank,
+            num_attention_heads=config.num_heads,
+        )
+
+        def get_tensor_parallel_split_mappings(num_layers):
+            final_actions = {}
+            base_actions = {
+                # Column Linear
+                "encoder.block.0.layer.0.SelfAttention.q.weight": partial(fn, is_column=True),
+                "encoder.block.0.layer.0.SelfAttention.k.weight": partial(fn, is_column=True),
+                "encoder.block.0.layer.0.SelfAttention.v.weight": partial(fn, is_column=True),
+                "decoder.block.0.layer.0.SelfAttention.q.weight": partial(fn, is_column=True),
+                "decoder.block.0.layer.0.SelfAttention.k.weight": partial(fn, is_column=True),
+                "decoder.block.0.layer.0.SelfAttention.v.weight": partial(fn, is_column=True),
+                # Row Linear
+                "encoder.block.0.layer.0.SelfAttention.o.weight": partial(fn, is_column=False),
+                "decoder.block.0.layer.0.SelfAttention.o.weight": partial(fn, is_column=False),
+                "encoder.block.0.layer.1.DenseReluDense.wo.weight": partial(fn, is_column=False),
+                "decoder.block.0.layer.2.DenseReluDense.wo.weight": partial(fn, is_column=False),
+                "shared.weight": partial(fn, is_column=False),
+            }
+            # mv T5LayerCrossAttention here
+            base_actions["decoder.block.0.layer.1.EncDecAttention.q.weight"] = partial(fn, is_column=True)
+            base_actions["decoder.block.0.layer.1.EncDecAttention.k.weight"] = partial(fn, is_column=True)
+            base_actions["decoder.block.0.layer.1.EncDecAttention.v.weight"] = partial(fn, is_column=True)
+            base_actions["decoder.block.0.layer.1.EncDecAttention.o.weight"] = partial(fn, is_column=False)
+
+            if config.feed_forward_proj == "relu":
+                base_actions["encoder.block.0.layer.1.DenseReluDense.wi.weight"] = partial(fn, is_column=True)
+                base_actions["decoder.block.0.layer.2.DenseReluDense.wi.weight"] = partial(fn, is_column=True)
+            elif config.feed_forward_proj == "gated-gelu":
+                for i in range(2):
+                    base_actions[f"encoder.block.0.layer.1.DenseReluDense.wi_{i}.weight"] = partial(fn, is_column=True)
+                    base_actions[f"decoder.block.0.layer.2.DenseReluDense.wi_{i}.weight"] = partial(fn, is_column=True)
+
+            final_actions["encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = partial(
+                fn, is_column=True
+            )
+            final_actions["decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = partial(
+                fn, is_column=True
+            )
+
+            for key, action in base_actions.items():
+                if "block.0." in key:
+                    for i in range(num_layers):
+                        final_actions[key.replace("block.0.", f"block.{i}.")] = action
+                final_actions[key] = action
+
+            return final_actions
+
+        mappings = get_tensor_parallel_split_mappings(config.num_hidden_layers)
+
+        return mappings
+
+    @classmethod
+    def _get_name_mappings(cls, config: T5Config) -> list[StateDictNameMapping]:
+        # from pytorch to paddle
+        mappings: list[StateDictNameMapping] = []
+        model_mappings = [
+            "shared.weight",
+            "encoder.embed_tokens.weight",
+            "encoder.final_layer_norm.weight",
+            "decoder.embed_tokens.weight",
+            "decoder.final_layer_norm.weight",
+            "encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+            "decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight",
+        ]
+        for layer_index in range(config.num_hidden_layers):
+            for att_head in ["q", "k", "v", "o"]:
+                model_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.0.SelfAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.1.EncDecAttention.{att_head}.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+
+            layer_mappings = [
+                [
+                    f"encoder.block.{layer_index}.layer.1.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                [
+                    f"decoder.block.{layer_index}.layer.2.DenseReluDense.wo.weight",
+                    None,
+                    "transpose",
+                ],
+                f"encoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"encoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.0.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.1.layer_norm.weight",
+                f"decoder.block.{layer_index}.layer.2.layer_norm.weight",
+            ]
+
+            if config.feed_forward_proj == "relu":
+                layer_mappings.extend(
+                    [
+                        [
+                            f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                        [
+                            f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi.weight",
+                            None,
+                            "transpose",
+                        ],
+                    ]
+                )
+            elif config.feed_forward_proj == "gated-gelu":
+                for i in range(2):
+                    layer_mappings.extend(
+                        [
+                            [
+                                f"encoder.block.{layer_index}.layer.1.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                            [
+                                f"decoder.block.{layer_index}.layer.2.DenseReluDense.wi_{i}.weight",
+                                None,
+                                "transpose",
+                            ],
+                        ]
+                    )
+
+            model_mappings.extend(layer_mappings)
+
+        init_name_mappings(model_mappings)
+
+        if cls.__name__ != "T5Model":
+            for mapping in model_mappings:
+                mapping[1] = "t5." + mapping[1]
+
+        if config.architectures is not None and "T5ForConditionalGeneration" in config.architectures:
+            model_mappings.append(["lm_head.weight", "lm_head.weight", "transpose"])
+
+        mappings = [StateDictNameMapping(*mapping) for mapping in model_mappings]
+        return mappings
+
+    @property
+    def dummy_inputs(self):
+        DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
+        DUMMY_MASK = [[1, 1, 1, 1, 1], [1, 1, 1, 0, 0], [0, 0, 0, 1, 1]]
+        input_ids = paddle.assign(np.asarray(DUMMY_INPUTS, dtype="int64"))
+        input_mask = paddle.assign(np.asarray(DUMMY_MASK, dtype="int64"))
+        dummy_inputs = {
+            "decoder_input_ids": input_ids,
+            "input_ids": input_ids,
+            "decoder_attention_mask": input_mask,
+        }
+        return dummy_inputs
+
+    def _init_weights(self, layer):
+        """Initialize the weights"""
+        # Used for testing weights initialization
+        factor = self.config.initializer_factor
+        d_model = self.config.d_model
+        d_ff = self.config.d_ff
+        n_heads = self.config.num_heads
+        key_value_proj_dim = self.config.d_kv
+
+        if isinstance(layer, T5LayerNorm):
+            layer.weight.set_value(paddle.ones_like(layer.weight) * factor)
+        elif isinstance(layer, T5Model):
+            # Mesh TensorFlow embeddings initialization
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
+            layer.shared.weight.set_value(paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.shared.weight.shape))
+        elif isinstance(layer, (T5ForConditionalGeneration,)):
+            layer.t5.shared.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * 1.0, shape=layer.t5.shared.weight.shape)
+            )
+
+        elif isinstance(layer, T5DenseReluDense):
+            # Mesh TensorFlow FF initialization
+            # See https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/transformer_layers.py#L56
+            # and https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L89
+            layer.wi.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi.weight.shape)
+            )
+
+            if hasattr(layer.wi, "bias") and layer.wi.bias is not None:
+                layer.wi.bias.set_value(paddle.zeros_like(layer.wi.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+
+        elif isinstance(layer, T5DenseGatedGeluDense):
+            layer.wi_0.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_0.weight.shape)
+            )
+            if hasattr(layer.wi_0, "bias") and layer.wi_0.bias is not None:
+                layer.wi_0.bias.set_value(paddle.zeros_like(layer.wi_0.bias))
+
+            layer.wi_1.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.wi_1.weight.shape)
+            )
+            if hasattr(layer.wi_1, "bias") and layer.wi_1.bias is not None:
+                layer.wi_1.bias.set_value(paddle.zeros_like(layer.wi_1.bias))
+
+            layer.wo.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * ((d_ff) ** -0.5), shape=layer.wo.weight.shape)
+            )
+
+            if hasattr(layer.wo, "bias") and layer.wo.bias is not None:
+                layer.wo.bias.set_value(paddle.zeros_like(layer.wo.bias))
+        elif isinstance(layer, T5Attention):
+            # Mesh TensorFlow attention initialization to avoid scaling before softmax
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
+
+            layer.q.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5), shape=layer.q.weight.shape
+                )
+            )
+
+            layer.k.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.k.weight.shape)
+            )
+
+            layer.v.weight.set_value(
+                paddle.normal(mean=0.0, std=factor * (d_model**-0.5), shape=layer.v.weight.shape)
+            )
+
+            layer.o.weight.set_value(
+                paddle.normal(
+                    mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5), shape=layer.o.weight.shape
+                )
+            )
+
+            if layer.has_relative_attention_bias:
+                layer.relative_attention_bias.weight.set_value(
+                    paddle.normal(
+                        mean=0.0, std=factor * ((d_model) ** -0.5), shape=layer.relative_attention_bias.weight.shape
+                    )
+                )
+
+    def _shift_right(self, input_ids):
+        bos_token_id = self.config.bos_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert (
+            bos_token_id is not None
+        ), "bos_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
+
+        # shift inputs to the right
+        shifted_input_ids = paddle.zeros_like(input_ids)
+        shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
+        shifted_input_ids[:, 0] = bos_token_id
+
+        assert pad_token_id is not None, "pad_token_id has to be defined."
+        # replace possible -100 values in labels by `pad_token_id`
+        shifted_input_ids = paddle.where(
+            shifted_input_ids == -100,
+            paddle.assign(np.asarray(pad_token_id, dtype=data_type_converter(shifted_input_ids)).reshape([1])),
+            shifted_input_ids,
+        )
+
+        assert paddle.all(shifted_input_ids >= 0), "Verify that `shifted_input_ids` has only positive values"
+
+        return shifted_input_ids
+
+
+class T5Stack(T5PretrainedModel):
+    def __init__(self, config: T5Config, embed_tokens: Optional[nn.Embedding] = None):
+        super().__init__(config)
+        self.is_decoder = config.is_decoder
+        self.embed_tokens = embed_tokens
+        self.block = nn.LayerList(
+            [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
+        )
+        self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        # Recompute defaults to False and is controlled by Trainer
+        self.enable_recompute = False
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, new_embeddings):
+        self.embed_tokens = new_embeddings
+
+    @property
+    def dtype(self):
+        return self.embed_tokens.weight.dtype
+
+    @paddle.jit.not_to_static
+    def recompute_training(
+        self,
+        layer_module,
+        hidden_states,
+        extended_attention_mask,
+        position_bias,
+        encoder_hidden_states,
+        encoder_extended_attention_mask,
+        encoder_decoder_position_bias,
+        use_cache,
+        output_attentions,
+    ):
+        def create_custom_forward(module):
+            def custom_forward(*inputs):
+                return tuple(module(*inputs, use_cache, output_attentions))
+
+            return custom_forward
+
+        layer_outputs = recompute(
+            create_custom_forward(layer_module),
+            hidden_states,
+            extended_attention_mask,
+            position_bias,
+            encoder_hidden_states,
+            encoder_extended_attention_mask,
+            encoder_decoder_position_bias,
+            None,
+        )
+
+        return layer_outputs
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        inputs_embeds=None,
+        cache=None,
+        use_cache=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+        **model_kwargs
+    ):
+
+        if input_ids is not None and inputs_embeds is not None:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(
+                f"You cannot specify both {err_msg_prefix}input_ids and {err_msg_prefix}inputs_embeds at the same time"
+            )
+        elif input_ids is not None:
+            input_shape = paddle.shape(input_ids)
+            # input_ids = input_ids.reshape(shape=[-1, input_shape[-1]])
+        elif inputs_embeds is not None:
+            input_shape = paddle.shape(inputs_embeds)[:-1]
+        else:
+            err_msg_prefix = "decoder_" if self.is_decoder else ""
+            raise ValueError(f"You have to specify either {err_msg_prefix}input_ids or {err_msg_prefix}inputs_embeds")
+
+        if inputs_embeds is None:
+            assert self.embed_tokens is not None, "You have to initialize the model with valid token embeddings"
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        batch_size, seq_length = input_shape
+
+        # required mask seq length can be calculated via length of past
+        mask_seq_length = paddle.shape(cache[0][0])[2] + seq_length if cache is not None else seq_length
+
+        if use_cache is True:
+            assert self.is_decoder, f"`use_cache` can only be set to `True` if {self.__class__} is used as a decoder"
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(shape=[batch_size, mask_seq_length])
+        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
+            encoder_seq_length = paddle.shape(encoder_hidden_states)[1]
+            encoder_attention_mask = paddle.ones([batch_size, encoder_seq_length], dtype=paddle.int64)
+
+        # initialize caches with `None` if past does not exist
+        if cache is None:
+            cache = [None] * len(self.block)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if self.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = paddle.shape(encoder_hidden_states)
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(shape=encoder_hidden_shape)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        present_key_value_states = () if use_cache else None
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and self.is_decoder) else None
+        position_bias = None
+        encoder_decoder_position_bias = None
+
+        hidden_states = self.dropout(inputs_embeds)
+
+        for i, (layer_module, past_key_value) in enumerate(zip(self.block, cache)):
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.enable_recompute and self.training:
+                if use_cache:
+                    logger.warning("`use_cache=True` is incompatible with Recompute. Setting " "`use_cache=False`...")
+                    use_cache = False
+
+                layer_outputs = self.recompute_training(
+                    layer_module,
+                    hidden_states,
+                    extended_attention_mask,
+                    position_bias,
+                    encoder_hidden_states,
+                    encoder_extended_attention_mask,
+                    encoder_decoder_position_bias,
+                    use_cache,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask=extended_attention_mask,
+                    position_bias=position_bias,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_extended_attention_mask,
+                    encoder_decoder_position_bias=encoder_decoder_position_bias,
+                    cache=past_key_value,
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                )
+
+            # layer_outputs is a tuple with:
+            # hidden-states, key-value-states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)
+            if not use_cache:
+                layer_outputs = layer_outputs[:1] + (None,) + layer_outputs[1:]
+
+            hidden_states, present_key_value_state = layer_outputs[:2]
+
+            # We share the position biases between the layers - the first layer store them
+            # layer_outputs = hidden-states, key-value-states (self-attention position bias), (self-attention weights),
+            # (cross-attention position bias), (cross-attention weights)
+            position_bias = layer_outputs[2]
+            if self.is_decoder and encoder_hidden_states is not None:
+                encoder_decoder_position_bias = layer_outputs[4 if output_attentions else 3]
+            # append next layer key value states
+            if use_cache:
+                present_key_value_states = present_key_value_states + (present_key_value_state,)
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[3],)
+                if self.is_decoder:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[5],)
+
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        # Add last layer
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    present_key_value_states,
+                    all_hidden_states,
+                    all_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_value_states,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+    def get_extended_attention_mask(self, attention_mask, input_shape):
+        if attention_mask.ndim == 3:
+            extended_attention_mask = attention_mask.unsqueeze(1)
+        elif attention_mask.ndim == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length, dtype="int64")
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask.unsqueeze([1, 2])
+            else:
+                extended_attention_mask = attention_mask.unsqueeze([1, 2])
+        elif attention_mask.ndim == 4:
+            if self.is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = paddle.arange(seq_length, dtype="int64")
+                causal_mask = paddle.tile(
+                    seq_ids.unsqueeze(axis=[0, 1]), [batch_size, seq_length, 1]
+                ) <= seq_ids.unsqueeze(axis=[0, 2])
+                # in case cache are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type
+                causal_mask = causal_mask.astype(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[-1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = paddle.concat(
+                        [
+                            paddle.ones(
+                                [batch_size, seq_length, prefix_seq_len],
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = causal_mask.unsqueeze(1) * attention_mask
+            else:
+                extended_attention_mask = attention_mask
+        else:
+            raise ValueError(
+                f"Wrong shape for input_ids (shape {input_shape}) or attention_mask (shape {attention_mask.shape})"
+            )
+
+        extended_attention_mask = extended_attention_mask.astype(self.dtype)
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask):
+        if encoder_attention_mask.ndim == 4:
+            encoder_extended_attention_mask = encoder_attention_mask
+        elif encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze(1)
+        elif encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask.unsqueeze([1, 2])
+        encoder_extended_attention_mask = encoder_extended_attention_mask.astype(self.dtype)  # fp16 compatibility
+
+        if amp_state() or self.dtype == paddle.float16:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        elif self.dtype == paddle.float32:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        else:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+            # raise ValueError(
+            #     f"{self.dtype} not recognized. `dtype` should be set to either `paddle.float32` or `paddle.float16`"
+            # )
+
+        return encoder_extended_attention_mask
+
+
+@register_base_model
+class T5Model(T5PretrainedModel):
+    """
+    The bare T5 Model transformer outputting raw hidden-states without any specific head on top.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (class:`T5Config`):
+            Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+    """
+
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        self.bos_token_id = config.bos_token_id
+        self.pad_token_id = config.pad_token_id
+        self.initializer_factor = config.initializer_factor
+        self.d_model = config.d_model
+        self.num_heads = config.num_heads
+        self.d_kv = config.d_kv
+        self.d_ff = config.d_ff
+        self.tie_word_embeddings = config.tie_word_embeddings
+        if config.tensor_parallel_degree > 1:
+            self.shared = fleet.meta_parallel.VocabParallelEmbedding(
+                config.vocab_size,
+                config.d_model,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.XavierNormal()),
+            )
+        else:
+            self.shared = nn.Embedding(config.vocab_size, config.d_model)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.is_decoder = False
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.encoder = T5Stack(encoder_config, self.shared)
+
+        decoder_config = copy.deepcopy(config)
+        decoder_config.is_decoder = True
+        decoder_config.is_encoder_decoder = False
+        decoder_config.num_layers = config.num_decoder_layers
+        self.decoder = T5Stack(decoder_config, self.shared)
+
+    def get_input_embeddings(self):
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+        self.decoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        The T5Model forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on
+                to some unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float.
+                When the data type is int, the `masked` tokens have `0` values and the others
+                have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the
+                others have `1.0` values.
+                It is a tensor with shape broadcasted to [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            decoder_input_ids (Tensor, optional):
+                Indices of decoder input sequence tokens in the vocabulary.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means no `decoder_input_ids` is provided, the model will create the tensor
+                by shifting the `input_ids` to the right.
+            decoder_attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions in `decoder_input_ids`.
+                Its data type and shape is the same as `attention_mask`. Defaults to `None`.
+            encoder_output (tuple, optional):
+                The output of the encoder, a tuple consists `last_hidden_state`, `hidden_states`(optional), `attentions`(optional).
+                The data type of `last_hidden_state` is float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `hidden_states` is hidden_states of all layers in the Transformer encoder. The length of `hidden_states` is `num_hidden_layers + 1`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+                `attentions` is attentions of all layers of in the Transformer encoder. The length of `attentions` is `num_hidden_layers`.
+                For all element in the tuple, its data type should be float32 and its shape is [batch_size, num_attention_heads, sequence_length, sequence_length].
+            cache (Tuple[Tuple[Tensor]], optional):
+                Contains pre-computed hidden-states (key and values in the attention blocks)
+                as computed by the model. Can be used to speed up sequential decoding.
+                The `input_ids` which have their past given to this model should not be
+                passed as input ids as they have already been computed.
+                Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation  of shape `(batch_size, target_sequence_length, hidden_size)`. If `cache` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
+                This is useful if you want more control over how to convert `decoder_input_ids` indices
+                into associated vectors than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                Whether or not to use cache. If set to `True`, `past_buckets_states` states are returned
+                and can be used to speed up decoding.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqModelOutput`.
+
+            tuple: Returns tuple (`last_hidden_state`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the decoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                returned when `use_cache=True` is passed.
+                List of `tuple(Tensor, Tensor)` of length `config["num_layers"]`,
+                with the first element being the previous `buckets` of shape
+                `[batch_size, num_heads, num_hashes, sequence_length]` and the second
+                being the previous `hidden_states` of shape `[batch_size, sequence_length, hidden_size]`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                returned when ``output_hidden_states=True`` is passed.
+                Tuple of `Tensor` (one for the output of the embeddings + one for the output of decoder each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+            - `encoder_last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the encoder of the model.
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                returned when `output_hidden_states=True` is passed.
+                tuple of `Tensor` (one for the output of the embeddings + one for the
+                output of encoder each layer). Each Tensor has a data type of float32
+                and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data
+                type of float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import T5Model, T5Tokenizer
+
+                tokenizer = T5Tokenizer.from_pretrained('t5-base')
+                model = T5Model.from_pretrained('t5-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                input_ids = paddle.to_tensor([inputs["input_ids"]], dtype="int64")
+                decoder_inputs = tokenizer("It means you can")
+                decoder_input_ids = paddle.to_tensor([decoder_inputs["input_ids"]], dtype="int64")
+
+                outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
+                last_hidden_state = outputs[0]
+                print(last_hidden_state.shape)
+                # [1, 5, 768]
+
+        """
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            encoder_output = self.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        elif return_dict and not isinstance(encoder_output, BaseModelOutput):
+            encoder_output = convert_encoder_output(encoder_output)
+        hidden_states = encoder_output[0]
+
+        # Decode
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return decoder_outputs + encoder_output
+
+        return Seq2SeqModelOutput(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+
+class T5ForConditionalGeneration(T5PretrainedModel):
+    """
+    The T5 Model transformer with a language modeling head on top.
+
+    Args:
+        config (:class:`T5Config`):
+            An instance of T5Config used to construct T5ForConditionalGeneration.
+
+    """
+
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        self.t5 = T5Model(config)
+
+        if not config.tie_word_embeddings:
+            if config.tensor_parallel_degree > 1:
+                self.lm_head = nn.Linear(
+                    config.d_model, config.vocab_size // config.tensor_parallel_degree, bias_attr=False
+                )
+            else:
+                self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias_attr=False)
+
+    def get_input_embeddings(self):
+        return self.t5.shared
+
+    def set_input_embeddings(self, new_embeddings):
+        self.t5.shared = new_embeddings
+        self.t5.encoder.set_input_embeddings(new_embeddings)
+        self.t5.decoder.set_input_embeddings(new_embeddings)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def get_output_embeddings(self):
+        if self.t5.config["tie_word_embeddings"]:
+            return self.t5.shared
+        else:
+            return self.lm_head
+
+    def get_encoder(self):
+        return self.t5.encoder
+
+    def get_decoder(self):
+        return self.t5.decoder
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        encoder_output=None,
+        cache=None,
+        labels=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`T5Model`.
+            attention_mask (Tensor, optional):
+                See :class:`T5Model`.
+            decoder_input_ids (Tensor, optional):
+                See :class:`T5Model`.
+            decoder_attention_mask (Tensor, optional):
+                See :class:`T5Model`.
+            encoder_output (tuple(Tensor), optional):
+                See :class:`T5Model`.
+            cache (List[tuple(Tensor, Tensor)], optional):
+                See :class:`T5Model`.
+            labels (Tensor, optional):
+                Labels for language modeling. Note that the labels **are shifted**
+                inside the model, i.e. you can set `labels = input_ids` Indices are
+                selected in `[-100, 0, ..., vocab_size]` All labels set to `-100` are
+                ignored (masked), the loss is only computed for labels in `[0, ..., vocab_size]`.
+                Shape is [batch_size, sequence_length] and dtype is int64.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            decoder_inputs_embeds (Tensor , optional):
+                Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
+                representation of shape `(batch_size, target_sequence_length, hidden_size)`. If `past_key_values` is used,
+                optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`). This is useful
+                if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix. Default to None.
+
+                If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
+                of `inputs_embeds`.
+            use_cache (bool, optional):
+                See :class:`T5Model`.
+            output_attentions (bool, optional):
+                See :class:`T5Model`.
+            output_hidden_states (bool, optional):
+                See :class:`T5Model`.
+            return_dict (bool, optional):
+                Whether or not to return a class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.Seq2SeqLMOutput`.
+
+            tuple: Returns tuple (`loss`, `logits`, `cache`, `decoder_hidden_states`, `decoder_attentions`,
+            `cross_attentions`, `encoder_last_hidden_state`, `encoder_hidden_states`, `encoder_attentions`)
+
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss. It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head
+                (scores for each vocabulary token before SoftMax).
+                It's data type should be float32 and its shape is
+                [batch_size, sequence_length, vocab_size].
+
+            - `cache` (List[tuple(Tensor, Tensor)], optional):
+                See :class:`T5Model`.
+
+            - `decoder_hidden_states` (tuple(Tensor), optional)
+                See :class:`T5Model`.
+
+            - `decoder_attentions` (tuple(Tensor), optional):
+                See :class:`T5Model`.
+
+            - `cross_attentions` (tuple(Tensor), optional):
+                See :class:`T5Model`.
+
+            - `encoder_last_hidden_state` (Tensor):
+                See :class:`T5Model`.
+
+            - `encoder_hidden_states` (tuple(Tensor), optional):
+                See :class:`T5Model`.
+
+            - `encoder_attentions` (tuple(Tensor), optional):
+                See :class:`T5Model`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import T5ForConditionalGeneration, T5Tokenizer
+
+                tokenizer = T5Tokenizer.from_pretrained('t5-base')
+                model = T5ForConditionalGeneration.from_pretrained('t5-base')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs, labels=inputs["input_ids"])
+
+                loss = output[0]
+                logits = output[1]
+
+        """
+
+        input_type = type(decoder_input_ids) if decoder_input_ids is not None else type(decoder_inputs_embeds)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Encode if needed (training, first prediction pass)
+        if encoder_output is None:
+            # Convert encoder inputs in embeddings if needed
+            encoder_output = self.t5.encoder(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )
+        else:
+            if isinstance(encoder_output, input_type):
+                encoder_output = (encoder_output,)
+            if return_dict and not isinstance(encoder_output, BaseModelOutput):
+                encoder_output = convert_encoder_output(encoder_output)
+
+        hidden_states = encoder_output[0]
+
+        if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(labels)
+
+        # If decoding with past key value states, only the last tokens
+        # should be given as an input
+        if cache is not None:
+            assert labels is None, "Decoder should not use cached key value states when training."
+            if decoder_input_ids is not None:
+                decoder_input_ids = decoder_input_ids[:, -1:]
+
+        encoder_attention_mask = attention_mask
+        if attention_mask is not None:
+            if attention_mask.ndim == 4:
+                encoder_attention_mask = attention_mask[:, :, -1:, :]
+            elif attention_mask.ndim == 3:
+                encoder_attention_mask = attention_mask[:, -1:, :].unsqueeze([1])
+            elif attention_mask.ndim == 2:
+                encoder_attention_mask = attention_mask.unsqueeze([1, 2])
+            else:
+                raise ValueError("Invalid attention mask shape. ")
+
+        # Decode
+        decoder_outputs = self.t5.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=decoder_attention_mask,
+            inputs_embeds=decoder_inputs_embeds,
+            cache=cache,
+            encoder_hidden_states=hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = decoder_outputs[0]
+
+        if self.t5.config["tie_word_embeddings"]:
+            # Rescale output before projecting on vocab
+            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
+            sequence_output = sequence_output * (self.t5.config["d_model"] ** -0.5)
+            lm_logits = paddle.matmul(sequence_output, self.t5.shared.weight, transpose_y=True)
+        else:
+            lm_logits = self.lm_head(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(lm_logits.reshape(shape=[-1, lm_logits.shape[-1]]).astype("float32"), labels.flatten())
+
+        if not return_dict:
+            output = (lm_logits,) + decoder_outputs[1:] + encoder_output
+            return ((loss,) + output) if loss is not None else output
+
+        return Seq2SeqLMOutput(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_output.last_hidden_state,
+            encoder_hidden_states=encoder_output.hidden_states,
+            encoder_attentions=encoder_output.attentions,
+        )
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterT5
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        self._fast_entry = FasterT5(self, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    @staticmethod
+    def prepare_input_ids_for_generation(bos_token_id, encoder_output=None):
+        batch_size = 1
+        if bos_token_id is None:
+            raise ValueError("`bos_token_id` should be defined when no " "`input_ids` are provided.")
+        if encoder_output is not None:
+            if isinstance(encoder_output, tuple):
+                encoder_output = encoder_output[0]
+            batch_size = encoder_output.shape[0]
+        return paddle.ones([batch_size, 1], dtype="int64") * bos_token_id
+
+    def prepare_inputs_for_generation(
+        self, input_ids, cache=None, attention_mask=None, use_cache=None, encoder_output=None, **kwargs
+    ):
+
+        # cut decoder_input_ids if past is used
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            "decoder_input_ids": input_ids,
+            "cache": cache,
+            "encoder_output": encoder_output,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+        }
+
+    def prepare_decoder_input_ids_from_labels(self, labels: paddle.Tensor):
+        return self._shift_right(labels)
+
+    @staticmethod
+    def expand_inputs_for_generation(input_ids, expand_size, attention_mask=None, **model_kwargs):
+        index = paddle.tile(paddle.arange(input_ids.shape[0], dtype="int64").unsqueeze(-1), [1, expand_size]).reshape(
+            [-1]
+        )
+
+        input_ids = paddle.index_select(input_ids, index)
+
+        if attention_mask is not None:
+            model_kwargs["attention_mask"] = paddle.index_select(attention_mask, index)
+
+        if "token_type_ids" in model_kwargs:
+            token_type_ids = model_kwargs["token_type_ids"]
+            model_kwargs["token_type_ids"] = paddle.index_select(token_type_ids, index)
+
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            model_kwargs["position_ids"] = paddle.index_select(position_ids, index)
+
+        if "seq_len" in model_kwargs:
+            seq_len = model_kwargs["seq_len"]
+            model_kwargs["seq_len"] = paddle.index_select(seq_len, index)
+
+        if "encoder_output" in model_kwargs:
+            encoder_output = model_kwargs["encoder_output"]
+            if isinstance(encoder_output, tuple):
+                model_kwargs["encoder_output"] = (paddle.index_select(encoder_output[0], index),) + encoder_output[1:]
+            else:
+                model_kwargs["encoder_output"] = paddle.index_select(encoder_output, index)
+        return input_ids, model_kwargs
+
+    @staticmethod
+    def prepare_attention_mask_for_generation(input_ids, pad_token_id, eos_token_id):
+        is_pad_token_in_inputs_ids = (pad_token_id is not None) and paddle.any(input_ids == pad_token_id).item()
+        is_pad_token_not_equal_to_eos_token_id = (eos_token_id is None) or (
+            (eos_token_id is not None) and (pad_token_id != eos_token_id)
+        )
+        if is_pad_token_in_inputs_ids and is_pad_token_not_equal_to_eos_token_id:
+            attention_mask = (input_ids != pad_token_id).astype("int64")
+            return attention_mask
+        else:
+            attention_mask = paddle.ones_like(input_ids)
+        return attention_mask
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+class T5EncoderModel(T5PretrainedModel):
+    base_model_class = None
+
+    def __init__(self, config: T5Config):
+        super().__init__(config)
+        encoder_config = copy.deepcopy(config)
+        encoder_config.use_cache = False
+        encoder_config.is_encoder_decoder = False
+        self.shared = nn.Embedding(encoder_config.vocab_size, encoder_config.d_model)
+        self.encoder = T5Stack(encoder_config, embed_tokens=self.shared)
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        return self.shared
+
+    def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
+        self.shared = new_embeddings
+        self.encoder.set_input_embeddings(new_embeddings)
+
+    def get_encoder(self) -> T5Stack:
+        return self.encoder
+
+    def forward(
+        self,
+        input_ids: Tensor = None,
+        attention_mask: Optional[Tensor] = None,
+        encoder_hidden_states: Optional[Tuple[Tensor]] = None,
+        encoder_attention_mask: Optional[Tensor] = None,
+        cache=None,
+        inputs_embeds: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_outputs = self.encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            cache=cache,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return encoder_outputs
+
+
+T5EncoderModel.base_model_class = T5EncoderModel
diff --git a/paddlenlp/transformers/t5/tokenizer.py b/paddlenlp/transformers/t5/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fc3d60c7cfe6f6a521febbfdf716e5b7e6a06b1
--- /dev/null
+++ b/paddlenlp/transformers/t5/tokenizer.py
@@ -0,0 +1,373 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import warnings
+
+import sentencepiece as spm
+
+from ..albert.tokenizer import AlbertEnglishTokenizer
+
+__all__ = [
+    "T5Tokenizer",
+]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "t5-small": 512,
+    "t5-base": 512,
+    "t5-large": 512,
+    "t5-3b": 512,
+    "t5-11b": 512,
+}
+
+
+class T5Tokenizer(AlbertEnglishTokenizer):
+    """
+    Constructs a T5 tokenizer based on SentencePiece .
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        sentencepiece_model_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool):
+            Whether or not to lowercase the input when tokenizing. Defaults to `False`.
+        remove_space (bool):
+            Whether or note to remove space when tokenizing. Defaults to `True`.
+        keep_accents (bool):
+            Whether or note to keep accents when tokenizing. Defaults to `False`.
+        eos_token (str):
+            A special token representing the *eos (end-of-sentence)* token.
+            Defaults to "</s>".
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "<unk>".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "<pad>".
+
+    """
+
+    resource_files_names = {"sentencepiece_model_file": "spiece.model"}
+    pretrained_resource_files_map = {
+        "sentencepiece_model_file": {
+            "t5-small": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-small/spiece.model",
+            "t5-base": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-base/spiece.model",
+            "t5-large": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-large/spiece.model",
+            "t5-3b": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-3b/spiece.model",
+            "t5-11b": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-11b/spiece.model",
+            "t5-v1_1-base": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-v1_1-base/spiece.model",
+            "t5-v1_1-large": "https://bj.bcebos.com/paddlenlp/models/transformers/t5/t5-v1_1-large/spiece.model",
+        },
+    }
+
+    pretrained_init_configuration = {
+        "t5-small": {"do_lower_case": False},
+        "t5-base": {"do_lower_case": False},
+        "t5-large": {"do_lower_case": False},
+        "t5-3b": {"do_lower_case": False},
+        "t5-11b": {"do_lower_case": False},
+        "t5-v1_1-base": {"do_lower_case": False},
+        "t5-v1_1-large": {"do_lower_case": False},
+    }
+
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=True,
+        eos_token="</s>",
+        unk_token="<unk>",
+        pad_token="<pad>",
+        extra_ids=100,
+        additional_special_tokens=[],
+        sp_model_kwargs=None,
+        **kwargs
+    ):
+
+        # Add extra_ids to the special token list
+        if extra_ids > 0 and len(additional_special_tokens) == 0:
+            self._additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
+        elif extra_ids > 0 and len(additional_special_tokens) != 0:
+            # Check that we have the right number of extra_id special tokens
+            extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
+            if extra_tokens != extra_ids:
+                raise ValueError(
+                    f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
+                    "In this case the additional_special_tokens must include the extra_ids tokens"
+                )
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.extra_ids = extra_ids
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_file)
+
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        max_length=None,
+        stride=0,
+        is_split_into_words=False,
+        padding=None,
+        truncation="longest_first",
+        return_position_ids=False,
+        return_token_type_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        **kwargs
+    ):
+        if "pad_to_max_seq_len" in kwargs and padding is None:
+            pad_to_max_seq_len = kwargs.pop("pad_to_max_seq_len")
+            padding = "max_length" if pad_to_max_seq_len else False
+        elif padding is None:
+            padding = False
+
+        if "max_seq_len" in kwargs and max_length is None:
+            max_length = kwargs["max_seq_len"]
+
+        if "truncation_strategy" in kwargs and kwargs["truncation_strategy"] != "longest_first":
+            truncation = kwargs["truncation_strategy"]
+
+        return super(T5Tokenizer, self).__call__(
+            text=text,
+            text_pair=text_pair,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            padding=padding,
+            truncation=truncation,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_length=return_length,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            **kwargs,
+        )
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model) + self.extra_ids
+
+    def _add_eos_if_not_present(self, token_ids):
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            warnings.warn(
+                f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
+            )
+            return token_ids
+        else:
+            return token_ids + [self.eos_token_id]
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1):
+        """
+        Build model inputs from a sequence or a pair of sequence.
+
+        An Reformer sequence has the following format:
+
+        - single sequence:      ``X </s>``
+        - pair of sequences:        ``A </s> B </s>``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs. Defaults to None.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+
+        """
+        token_ids_0 = self._add_eos_if_not_present(token_ids_0)
+        if token_ids_1 is None:
+            return token_ids_0
+        else:
+            token_ids_1 = self._add_eos_if_not_present(token_ids_1)
+            return token_ids_0 + token_ids_1
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return offset_mapping_0 + [(0, 0)]
+
+        return offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Create a mask from the two sequences.
+
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+
+        """
+        eos = [self.eos_token_id]
+        if token_ids_1 is None:
+            return len(token_ids_0 + eos) * [0]
+        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            List[int]: The list of integers in the range [0, 1]:
+                1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+
+        # normal case: some special tokens
+        if token_ids_1 is None:
+            return ([0] * len(token_ids_0)) + [1]
+        return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        current_sub_tokens = []
+        out_string = ""
+        for token in tokens:
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode_pieces(current_sub_tokens) + token + " "
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+        out_string += self.sp_model.decode_pieces(current_sub_tokens)
+        return out_string.strip()
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        if token.startswith("<extra_id_"):
+            match = re.match(r"<extra_id_(\d+)>", token)
+            num = int(match.group(1))
+            return self.vocab_size - num - 1
+        return self.sp_model.piece_to_id(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        if index < self.sp_model.get_piece_size():
+            token = self.sp_model.IdToPiece(index)
+        else:
+            token = f"<extra_id_{self.vocab_size - 1 - index}>"
+        return token
+
+    def batch_decode(self, sequences, skip_special_tokens=False, clean_up_tokenization_spaces=True):
+        """
+        Convert a list of lists of token ids into a list of strings by calling decode.
+
+        Args:
+            sequences (Union[List[int], List[List[int]], Tensor]):
+                List of tokenized input ids.
+            skip_special_tokens (bool, optional):
+                Whether or not to remove special tokens in the decoding. Defaults to `False`.
+            clean_up_tokenization_spaces (bool, optional):
+                Whether or not to clean up the tokenization spaces. Defaults to `True`.
+
+        Returns:
+            List[str]: The list of decoded sentences.
+        """
+        return [
+            self.decode(
+                seq, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces
+            )
+            for seq in sequences
+        ]
+
+    @staticmethod
+    def clean_up_tokenization(out_string):
+        """
+        Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.
+
+        Args:
+            out_string (str): The text to clean up.
+
+        Returns:
+            str: The cleaned-up string.
+        """
+        out_string = (
+            out_string.replace(" .", ".")
+            .replace(" ?", "?")
+            .replace(" !", "!")
+            .replace(" ,", ",")
+            .replace(" ' ", "'")
+            .replace(" n't", "n't")
+            .replace(" 'm", "'m")
+            .replace(" 's", "'s")
+            .replace(" 've", "'ve")
+            .replace(" 're", "'re")
+        )
+        return out_string
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.sentencepiece_model_file)
diff --git a/paddlenlp/transformers/tinybert/__init__.py b/paddlenlp/transformers/tinybert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/tinybert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/tinybert/configuration.py b/paddlenlp/transformers/tinybert/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..620b4dce93cd463dba1e45c86d54d602a63ded7a
--- /dev/null
+++ b/paddlenlp/transformers/tinybert/configuration.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TinyBERT model configuration"""
+from __future__ import annotations
+
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["TINYBERT_PRETRAINED_INIT_CONFIGURATION", "TinyBertConfig", "TINYBERT_PRETRAINED_RESOURCE_FILES_MAP"]
+
+TINYBERT_PRETRAINED_INIT_CONFIGURATION = {
+    "tinybert-4l-312d": {
+        "vocab_size": 30522,
+        "hidden_size": 312,
+        "num_hidden_layers": 4,
+        "num_attention_heads": 12,
+        "intermediate_size": 1200,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "tinybert-6l-768d": {
+        "vocab_size": 30522,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "tinybert-4l-312d-v2": {
+        "vocab_size": 30522,
+        "hidden_size": 312,
+        "num_hidden_layers": 4,
+        "num_attention_heads": 12,
+        "intermediate_size": 1200,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "tinybert-6l-768d-v2": {
+        "vocab_size": 30522,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "tinybert-4l-312d-zh": {
+        "vocab_size": 21128,
+        "hidden_size": 312,
+        "num_hidden_layers": 4,
+        "num_attention_heads": 12,
+        "intermediate_size": 1200,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+    "tinybert-6l-768d-zh": {
+        "vocab_size": 21128,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "pad_token_id": 0,
+    },
+}
+
+TINYBERT_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "tinybert-4l-312d": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d.pdparams",
+        "tinybert-6l-768d": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d.pdparams",
+        "tinybert-4l-312d-v2": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d-v2.pdparams",
+        "tinybert-6l-768d-v2": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d-v2.pdparams",
+        "tinybert-4l-312d-zh": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d-zh.pdparams",
+        "tinybert-6l-768d-zh": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d-zh.pdparams",
+    }
+}
+
+
+class TinyBertConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`TinyBertModel`]. It is used to
+    instantiate a TinyBERT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the TinyBERT
+    tinybert-6l-768d-v2 architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        pad_token_id (int, optional):
+            The index of padding token in the token vocabulary.
+            Defaults to `0`.
+        fit_size (int, optional):
+            Dimensionality of the output layer of `fit_dense(s)`, which is the hidden size of the teacher model.
+            `fit_dense(s)` means a hidden states' transformation from student to teacher.
+            `fit_dense(s)` will be generated when bert model is distilled during the training, and will not be generated
+            during the prediction process.
+            `fit_denses` is used in v2 models and it has `num_hidden_layers+1` layers.
+            `fit_dense` is used in other pretraining models and it has one linear layer.
+            Defaults to `768`.
+
+    Examples:
+
+    ```python
+    >>> from paddlenlp.transformers import TinyBertModel, TinyBertConfig
+
+    >>> # Initializing a TinyBERT tinybert-6l-768d-v2 style configuration
+    >>> configuration = TinyBertConfig()
+
+    >>> # Initializing a model from the tinybert-6l-768d-v2 style configuration
+    >>> model = TinyBertModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "tinybert"
+    attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
+    pretrained_init_configuration = TINYBERT_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30522,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        pool_act="tanh",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 16,
+        layer_norm_eps=1e-12,
+        initializer_range: float = 0.02,
+        pad_token_id: int = 0,
+        fit_size: int = 768,
+        **kwargs
+    ):
+
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.pool_act = pool_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.layer_norm_eps = layer_norm_eps
+        self.initializer_range = initializer_range
+        self.fit_size = fit_size
diff --git a/paddlenlp/transformers/tinybert/fast_tokenizer.py b/paddlenlp/transformers/tinybert/fast_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbfc4550f2313d6bd4ec667ff232355957c9eb3d
--- /dev/null
+++ b/paddlenlp/transformers/tinybert/fast_tokenizer.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from typing import Optional, Tuple
+
+from fast_tokenizer import normalizers
+
+from ..tokenizer_utils_fast import PretrainedFastTokenizer
+from .tokenizer import TinyBertTokenizer
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
+
+
+class TinyBertFastTokenizer(PretrainedFastTokenizer):
+    resource_files_names = VOCAB_FILES_NAMES  # for save_pretrained
+    slow_tokenizer_class = TinyBertTokenizer
+    pretrained_resource_files_map = slow_tokenizer_class.pretrained_resource_files_map
+    pretrained_init_configuration = slow_tokenizer_class.pretrained_init_configuration
+
+    padding_side = "right"
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        **kwargs
+    ):
+        super().__init__(
+            vocab_file,
+            tokenizer_file=tokenizer_file,
+            do_lower_case=do_lower_case,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
+
+        normalizer_state = json.loads(self.backend_tokenizer.normalizer.__getstate__())
+        if (
+            normalizer_state.get("lowercase", do_lower_case) != do_lower_case
+            or normalizer_state.get("strip_accents", strip_accents) != strip_accents
+            or normalizer_state.get("handle_chinese_chars", tokenize_chinese_chars) != tokenize_chinese_chars
+        ):
+            normalizer_class = getattr(normalizers, normalizer_state.pop("type"))
+            normalizer_state["lowercase"] = do_lower_case
+            normalizer_state["strip_accents"] = strip_accents
+            normalizer_state["handle_chinese_chars"] = tokenize_chinese_chars
+            self.backend_tokenizer.normalizer = normalizer_class(**normalizer_state)
+
+        self.do_lower_case = do_lower_case
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        files = self._tokenizer.model.save(save_directory, prefix=filename_prefix)
+        return tuple(files)
diff --git a/paddlenlp/transformers/tinybert/modeling.py b/paddlenlp/transformers/tinybert/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..25061d998513deb2644eec322b18992ae4e87980
--- /dev/null
+++ b/paddlenlp/transformers/tinybert/modeling.py
@@ -0,0 +1,754 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 Huawei Technologies Co., Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..bert.modeling import BertEmbeddings, BertPooler
+from ..model_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    tuple_output,
+)
+from .configuration import (
+    TINYBERT_PRETRAINED_INIT_CONFIGURATION,
+    TINYBERT_PRETRAINED_RESOURCE_FILES_MAP,
+    TinyBertConfig,
+)
+
+__all__ = [
+    "TinyBertModel",
+    "TinyBertPretrainedModel",
+    "TinyBertForPretraining",
+    "TinyBertForSequenceClassification",
+    "TinyBertForQuestionAnswering",
+    "TinyBertForMultipleChoice",
+]
+
+
+class TinyBertPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained TinyBERT models. It provides TinyBERT related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    config_class = TinyBertConfig
+    resource_files_names = {"model_state": "model_state.pdparams"}
+
+    pretrained_init_configuration = TINYBERT_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = TINYBERT_PRETRAINED_RESOURCE_FILES_MAP
+
+    base_model_prefix = "tinybert"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+        elif isinstance(layer, nn.LayerNorm):
+            layer._epsilon = self.config.layer_norm_eps
+
+
+@register_base_model
+class TinyBertModel(TinyBertPretrainedModel):
+    """
+    The bare TinyBERT Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`TinyBertConfig`):
+            An instance of TinyBertConfig used to construct TinyBertModel.
+
+    """
+
+    def __init__(self, config: TinyBertConfig):
+        super(TinyBertModel, self).__init__(config)
+
+        self.pad_token_id = config.pad_token_id
+        self.initializer_range = config.initializer_range
+
+        self.embeddings = BertEmbeddings(config)
+
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0.0,
+        )
+
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers)
+
+        self.pooler = BertPooler(config)
+        # fit_dense(s) means a hidden states' transformation from student to teacher.
+        # `fit_denses` is used in v2 model, and `fit_dense` is used in other pretraining models.
+        self.fit_denses = nn.LayerList(
+            [nn.Linear(config.hidden_size, config.fit_size) for i in range(config.num_hidden_layers + 1)]
+        )
+        self.fit_dense = nn.Linear(config.hidden_size, config.fit_size)
+
+    def get_input_embeddings(self) -> nn.Embedding:
+        """get input embedding of TinyBert Pretrained Model
+
+        Returns:
+            nn.Embedding: the input embedding of tiny bert
+        """
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, embedding: nn.Embedding) -> None:
+        """set the input embedding with the new embedding value
+
+        Args:
+            embedding (nn.Embedding): the new embedding value
+        """
+        self.embeddings.word_embeddings = embedding
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        past_key_values: Optional[Tuple[Tuple[Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The TinyBertModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate different portions of the inputs.
+                Selected in the range ``[0, type_vocab_size - 1]``.
+                If `type_vocab_size` is 2, which means the inputs have two portions.
+                Indices can either be 0 or 1:
+
+                - 0 corresponds to a *sentence A* token,
+                - 1 corresponds to a *sentence B* token.
+
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+                Defaults to `None`, which means we don't add segment embeddings.
+            position_ids(Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected in the range ``[0,
+                max_position_embeddings - 1]``.
+                Shape as `(batch_size, num_tokens)` and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            inputs_embeds (Tensor, optional):
+                If you want to control how to convert `inputs_ids` indices into associated vectors, you can
+                pass an embedded representation directly instead of passing `inputs_ids`.
+            past_key_values (tuple(tuple(Tensor)), optional):
+                The length of tuple equals to the number of layers, and each inner
+                tuple haves 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`)
+                which contains precomputed key and value hidden states of the attention blocks.
+                If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that
+                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+                `input_ids` of shape `(batch_size, sequence_length)`.
+            use_cache (`bool`, optional):
+                If set to `True`, `past_key_values` key value states are returned.
+                Defaults to `None`.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.ModelOutput` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPoolingAndCrossAttentions`.
+
+            tuple: Returns tuple (`encoder_output`, `pooled_output`).
+
+            With the fields:
+
+            - `encoder_output` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `pooled_output` (Tensor):
+                The output of first token (`[CLS]`) in sequence.
+                We "pool" the model by simply taking the hidden state corresponding to the first token.
+                Its data type should be float32 and its shape is [batch_size, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TinyBertModel, TinyBertTokenizer
+
+                tokenizer = TinyBertTokenizer.from_pretrained('tinybert-4l-312d')
+                model = TinyBertModel.from_pretrained('tinybert-4l-312d')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP! ")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                output = model(**inputs)
+        """
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        # init the default bool value
+        output_attentions = output_attentions if output_attentions is not None else False
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        return_dict = return_dict if return_dict is not None else False
+        use_cache = use_cache if use_cache is not None else False
+
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+
+        if attention_mask is None:
+            attention_mask = paddle.unsqueeze(
+                (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e4, axis=[1, 2]
+            )
+
+            if past_key_values is not None:
+                batch_size = past_key_values[0][0].shape[0]
+                past_mask = paddle.zeros([batch_size, 1, 1, past_key_values_length], dtype=attention_mask.dtype)
+                attention_mask = paddle.concat([past_mask, attention_mask], axis=-1)
+        elif attention_mask.ndim == 2:
+            # attention_mask [batch_size, sequence_length] -> [batch_size, 1, 1, sequence_length]
+            attention_mask = attention_mask.unsqueeze(axis=[1, 2]).astype(paddle.get_default_dtype())
+            attention_mask = (1.0 - attention_mask) * -1e4
+
+        # TODO(wj-Mcat): in current branch, not support `inputs_embeds`
+        embedding_output = self.embeddings(
+            input_ids, token_type_ids, position_ids, past_key_values_length=past_key_values_length
+        )
+
+        self.encoder._use_cache = use_cache  # To be consistent with HF
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache=past_key_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if isinstance(encoder_outputs, type(embedding_output)):
+            sequence_output = encoder_outputs
+            pooled_output = self.pooler(sequence_output)
+            return (sequence_output, pooled_output)
+
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class TinyBertForPretraining(TinyBertPretrainedModel):
+    """
+    TinyBert Model with pretraining tasks on top.
+
+    Args:
+        config (:class:`TinyBertConfig`):
+            An instance of TinyBertConfig used to construct TinyBertForPretraining.
+
+    """
+
+    def __init__(self, config: TinyBertConfig):
+        super(TinyBertForPretraining, self).__init__(config)
+        self.tinybert = TinyBertModel(config)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The TinyBertForPretraining forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`TinyBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`TinyBertModel`.
+
+        Returns:
+            Tensor: Returns tensor `sequence_output`, sequence of hidden-states at the last layer of the model.
+            It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.tinybert.modeling import TinyBertForPretraining
+                from paddlenlp.transformers.tinybert.tokenizer import TinyBertTokenizer
+
+                tokenizer = TinyBertTokenizer.from_pretrained('tinybert-4l-312d')
+                model = TinyBertForPretraining.from_pretrained('tinybert-4l-312d')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP! ")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+
+
+        """
+        outputs = self.tinybert(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        # return the sequence presentation
+        if not return_dict:
+            return outputs[0]
+        return outputs
+
+
+class TinyBertForSequenceClassification(TinyBertPretrainedModel):
+    """
+    TinyBert Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`TinyBertConfig`):
+            An instance of TinyBertConfig used to construct TinyBertForSequenceClassification.
+    """
+
+    def __init__(self, config: TinyBertConfig):
+        super(TinyBertForSequenceClassification, self).__init__(config)
+        self.tinybert = TinyBertModel(config)
+        self.num_labels = config.num_labels
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.activation = nn.ReLU()
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The TinyBertForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`TinyBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            attention_mask_list (list, optional):
+                See :class:`TinyBertModel`.
+            labels (Tensor of shape `(batch_size,)`, optional):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in `[0, ..., num_labels - 1]`. If `num_labels == 1`
+                a regression loss is computed (Mean-Square loss), If `num_labels > 1`
+                a classification loss is computed (Cross-Entropy).
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput` if `return_dict=True`.
+            Otherwise it returns a tuple of tensors corresponding to ordered and
+            not None (depending on the input arguments) fields of :class:`~paddlenlp.transformers.model_outputs.SequenceClassifierOutput`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.tinybert.modeling import TinyBertForSequenceClassification
+                from paddlenlp.transformers.tinybert.tokenizer import TinyBertTokenizer
+
+                tokenizer = TinyBertTokenizer.from_pretrained('tinybert-4l-312d')
+                model = TinyBertForSequenceClassification.from_pretrained('tinybert-4l-312d')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP! ")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+
+        outputs = self.tinybert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.classifier(self.activation(outputs[1]))
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == paddle.int64 or labels.dtype == paddle.int32):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = paddle.nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = paddle.nn.CrossEntropyLoss()
+                loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1,)))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = paddle.nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class TinyBertForQuestionAnswering(TinyBertPretrainedModel):
+    """
+    TinyBert Model with a linear layer on top of the hidden-states
+    output to compute `span_start_logits` and `span_end_logits`,
+    designed for question-answering tasks like SQuAD.
+
+    Args:
+    Args:
+        config (:class:`TinyBertConfig`):
+            An instance of TinyBertConfig used to construct TinyBertForQuestionAnswering.
+    """
+
+    def __init__(self, config: TinyBertConfig):
+        super(TinyBertForQuestionAnswering, self).__init__(config)
+        self.tinybert = TinyBertModel(config)
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        start_positions: Optional[Tensor] = None,
+        end_positions: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        Args:
+            input_ids (Tensor):
+                See :class:`TinyBertModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            position_ids (Tensor, optional):
+                See :class:`TinyBertModel`.
+            attention_mask (Tensor, optional):
+                See :class:`TinyBertModel`.
+            start_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the start of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            end_positions (Tensor of shape `(batch_size,)`, optional):
+                Labels for position (index) of the end of the labelled span for computing the token classification loss.
+                Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+                are not taken into account for computing the loss.
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.QuestionAnsweringModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TinyBertForQuestionAnswering, TinyBertTokenizer
+
+                tokenizer = TinyBertTokenizer.from_pretrained('tinybert-6l-768d-zh')
+                model = TinyBertForQuestionAnswering.from_pretrained('tinybert-6l-768d-zh')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                logits = model(**inputs)
+        """
+
+        outputs = self.tinybert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = self.classifier(outputs[0])
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return tuple_output(output, total_loss)
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class TinyBertForMultipleChoice(TinyBertPretrainedModel):
+    """
+    TinyBERT Model with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+    Args:
+        config (:class:`TinyBertConfig`):
+            An instance of TinyBertConfig used to construct TinyBertForMultipleChoice.
+    """
+
+    def __init__(self, config: TinyBertConfig):
+        super(TinyBertForMultipleChoice, self).__init__(config)
+        self.num_choices = config.num_choices
+        self.tinybert = TinyBertModel(config)
+        self.dropout = nn.Dropout(
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The TinyBertForMultipleChoice forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`TinyBertModel` and shape as [batch_size, num_choice, sequence_length].
+            token_type_ids(Tensor, optional):
+                See :class:`TinyBertModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids(Tensor, optional):
+                See :class:`TinyBertModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (list, optional):
+                See :class:`TinyBertModel` and shape as [batch_size, num_choice, sequence_length].
+            labels (Tensor of shape `(batch_size, )`, optional):
+                Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+                num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+                `input_ids` above)
+            output_hidden_states (bool, optional):
+                Whether to return the hidden states of all layers.
+                Defaults to `False`.
+            output_attentions (bool, optional):
+                Whether to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.MultipleChoiceModelOutput` object. If
+                `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            Tensor: Returns tensor `reshaped_logits`, a tensor of the multiple choice classification logits.
+            Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        """
+        # input_ids: [bs, num_choice, seq_l]
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
+
+        if input_ids is None and inputs_embeds is None:
+            raise ValueError("input_ids and inputs_embeds should not be None at the same time.")
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape([-1, inputs_embeds.shape[-2], inputs_embeds.shape[-1]])
+        else:
+            input_ids = input_ids.reshape(shape=(-1, input_ids.shape[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, token_type_ids.shape[-1]))
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, position_ids.shape[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, attention_mask.shape[-1]))
+
+        outputs = self.tinybert(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = self.dropout(outputs[1])
+
+        logits = self.classifier(pooled_output)  # logits: (bs*num_choice,1)
+        reshaped_logits = logits.reshape(shape=(-1, self.num_choices))  # logits: (bs, num_choice)
+
+        loss = None
+        if labels is not None:
+            loss_fct = paddle.nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return tuple_output(output, loss)
+
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/paddlenlp/transformers/tinybert/tokenizer.py b/paddlenlp/transformers/tinybert/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4384e4618465edeb37bc3642dfba19916f5ebcd
--- /dev/null
+++ b/paddlenlp/transformers/tinybert/tokenizer.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ..bert.tokenizer import BertTokenizer
+
+__all__ = ["TinyBertTokenizer"]
+
+
+class TinyBertTokenizer(BertTokenizer):
+    """
+    Constructs a TinyBert tokenizer. The usage of TinyBertTokenizer is the same as
+    `BertTokenizer <https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.bert.tokenizer.html>`__.
+    For more information regarding those methods, please refer to this superclass.
+    """
+
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "tinybert-4l-312d": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d-vocab.txt",
+            "tinybert-6l-768d": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d-vocab.txt",
+            "tinybert-4l-312d-v2": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d-v2-vocab.txt",
+            "tinybert-6l-768d-v2": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d-v2-vocab.txt",
+            "tinybert-4l-312d-zh": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-4l-312d-zh-vocab.txt",
+            "tinybert-6l-768d-zh": "http://bj.bcebos.com/paddlenlp/models/transformers/tinybert/tinybert-6l-768d-zh-vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "tinybert-4l-312d": {"do_lower_case": True},
+        "tinybert-6l-768d": {"do_lower_case": True},
+        "tinybert-4l-312d-v2": {"do_lower_case": True},
+        "tinybert-6l-768d-v2": {"do_lower_case": True},
+        "tinybert-4l-312d-zh": {"do_lower_case": True},
+        "tinybert-6l-768d-zh": {"do_lower_case": True},
+    }
+    max_model_input_sizes = {
+        "tinybert-4l-312d": 512,
+        "tinybert-6l-768d": 512,
+        "tinybert-4l-312d-v2": 512,
+        "tinybert-6l-768d-v2": 512,
+        "tinybert-4l-312d-zh": 512,
+        "tinybert-6l-768d-zh": 512,
+    }
diff --git a/paddlenlp/transformers/tokenizer_utils.py b/paddlenlp/transformers/tokenizer_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..53563326d8c156bca94077549a90a5afe94f6968
--- /dev/null
+++ b/paddlenlp/transformers/tokenizer_utils.py
@@ -0,0 +1,1715 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import bisect
+import io
+import itertools
+import json
+import re
+import unicodedata
+from collections import OrderedDict
+from typing import Dict, List, Optional, Tuple, Union
+
+import six
+from paddle.utils import try_import
+
+from paddlenlp.utils.log import logger
+
+try:
+    from functools import lru_cache
+except ImportError:
+    from backports.functools_lru_cache import lru_cache
+
+from ..data.vocab import Vocab
+from .tokenizer_utils_base import (
+    AddedToken,
+    BatchEncoding,
+    EncodedInput,
+    EncodedInputPair,
+    PaddingStrategy,
+    PreTokenizedInput,
+    PreTokenizedInputPair,
+    PretrainedTokenizerBase,
+    TensorType,
+    TextInput,
+    TextInputPair,
+    TruncationStrategy,
+)
+from .utils import InitTrackerMeta, fn_args_to_dict
+
+__all__ = [
+    "PretrainedTokenizer",
+    "BPETokenizer",
+    "tokenize_chinese_chars",
+    "is_chinese_char",
+    "normalize_chars",
+    "tokenize_special_chars",
+    "convert_to_unicode",
+]
+
+
+def convert_to_unicode(text):
+    """
+    Converts `text` to Unicode (if it's not already), assuming utf-8 input.
+    Args:
+        text (str|bytes): Text to be converted to unicode.
+    Returns:
+        str: converted text.
+    """
+    if isinstance(text, str):
+        return text
+    elif isinstance(text, bytes):
+        return text.decode("utf-8", "ignore")
+    else:
+        raise ValueError("Unsupported string type: %s" % (type(text)))
+
+
+def whitespace_tokenize(text):
+    """
+    Runs basic whitespace cleaning and splitting on a peice of text.
+    Args:
+        text (str): Text to be tokenized.
+    Returns:
+        list(str): Token list.
+    """
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+def _is_whitespace(char):
+    """
+    Checks whether `chars` is a whitespace character.
+    """
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
+
+
+def _is_end_of_word(text):
+    """Checks whether the last character in text is one of a punctuation, control or whitespace character."""
+    last_char = text[-1]
+    return bool(_is_control(last_char) | _is_punctuation(last_char) | _is_whitespace(last_char))
+
+
+def _is_start_of_word(text):
+    """Checks whether the first character in text is one of a punctuation, control or whitespace character."""
+    first_char = text[0]
+    return bool(_is_control(first_char) | _is_punctuation(first_char) | _is_whitespace(first_char))
+
+
+def _insert_one_token_to_ordered_list(token_list: List[str], new_token: str):
+    """
+    Inserts one token to an ordered list if it does not already exist. Note: token_list must be sorted.
+    """
+    insertion_idx = bisect.bisect_left(token_list, new_token)
+    # Checks if new_token is already in the ordered token_list
+    if insertion_idx < len(token_list) and token_list[insertion_idx] == new_token:
+        # new_token is in token_list, don't add
+        return
+    else:
+        token_list.insert(insertion_idx, new_token)
+
+
+def is_chinese_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    #
+    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+    # despite its name. The modern Korean Hangul alphabet is a different block,
+    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+    # space-separated words, so they are not treated specially and handled
+    # like the all of the other languages.
+    if (
+        (cp >= 0x4E00 and cp <= 0x9FFF)
+        or (cp >= 0x3400 and cp <= 0x4DBF)  #
+        or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+        or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+        or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+        or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+        or (cp >= 0xF900 and cp <= 0xFAFF)
+        or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+    ):  #
+        return True
+
+    return False
+
+
+def _is_nonnormalized_char(char):
+    """Check whther `chars` is a non-normalized character."""
+    cp = ord(char)
+    if (
+        (0xFF00 <= cp <= 0xFFEF)
+        or (0xFE50 <= cp <= 0xFE6B)  # Halfwidth and Fullwidth Forms
+        or (0x3358 <= cp <= 0x33FF)  # Small Form Variants
+        or (0x249C <= cp <= 0x24E9)  # CJK Compatibility
+        or (0x3200 <= cp <= 0x32FF)  # Enclosed Alphanumerics: Ⓛ ⒰
+    ):  # Enclosed CJK Letters and Months
+        return True
+
+    return False
+
+
+def _is_nonnormalized_numeric(char):
+    """Check whether `chars` is a non-normalized numeric character."""
+    cp = ord(char)
+    if (
+        (0x2460 <= cp <= 0x249B)
+        or (0x24EA <= cp <= 0x24FF)  #
+        or (0x2776 <= cp <= 0x2793)  #
+        or (0x2160 <= cp <= 0x217F)  # Enclosed Alphanumerics
+    ):  # Number Forms
+        return True
+
+    return False
+
+
+def normalize_chars(text):
+    """
+    Normalize the text for multiligual and chinese models. Unicode range:
+    https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html
+    """
+    output = []
+    for char in text:
+        if _is_nonnormalized_char(char):
+            for c in unicodedata.normalize("NFKC", char):
+                output.append(c)
+        elif _is_nonnormalized_numeric(char):
+            output.append(" ")
+            for c in str(int(unicodedata.numeric(char))):
+                output.append(c)
+            output.append(" ")
+        elif ord(char) == 0xF979:  # https://www.zhihu.com/question/20697984
+            output.append("凉")
+        else:
+            output.append(char)
+    return "".join(output)
+
+
+def _is_symbol(char):
+    """Check whether CP is the codepoint of a Symbol character."""
+    cp = ord(char)
+    if unicodedata.category(char).startswith("S") or (
+        cp in [0x00AD, 0x00B2, 0x00BA, 0x3007, 0x00B5, 0x00D8, 0x014B, 0x01B1]
+    ):
+        return True
+    return False
+
+
+def tokenize_special_chars(text):
+    """Adds whitespace around any special character."""
+    output = []
+    for char in text:
+        cp = ord(char)
+        if (
+            (0x3040 <= cp <= 0x30FF)
+            or (0x0370 <= cp <= 0x04FF)  # Japanese
+            or (0x0250 <= cp <= 0x02AF)  # Greek/Coptic & Cyrillic
+            or _is_symbol(char)  # IPA
+        ):
+            output.append(" ")
+            output.append(char)
+            output.append(" ")
+        else:
+            output.append(char)
+    return "".join(output)
+
+
+class Trie:
+    """
+    Trie in Python. Creates a Trie out of a list of words. The trie is used to split on `added_tokens` in one pass
+    Loose reference https://en.wikipedia.org/wiki/Trie
+    """
+
+    def __init__(self):
+        self.data = {}
+
+    def add(self, word: str):
+        """
+        Passes over every char (utf-8 char) on word and recursively adds it to the internal `data` trie representation.
+        The special key `""` is used to represent termination.
+
+        This function is idempotent, adding twice the same word will leave the trie unchanged
+
+        Example:
+
+        ```python
+        >>> trie = Trie()
+        >>> trie.add("Hello 友達")
+        >>> trie.data
+        {"H": {"e": {"l": {"l": {"o": {" ": {"友": {"達": {"": 1}}}}}}}}}
+
+        >>> trie.add("Hello")
+        >>> trie.data
+        {"H": {"e": {"l": {"l": {"o": {"": 1, " ": {"友": {"達": {"": 1}}}}}}}}}
+        ```
+        """
+        if not word:
+            # Prevent empty string
+            return
+        ref = self.data
+        for char in word:
+            ref[char] = char in ref and ref[char] or {}
+            ref = ref[char]
+        ref[""] = 1
+
+    def split(self, text: str) -> List[str]:
+        """
+        Will look for the words added to the trie within `text`. Output is the original string splitted along the
+        boundaries of the words found.
+
+        This trie will match the longest possible word first !
+
+        Example:
+
+        ```python
+        >>> trie = Trie()
+        >>> trie.split("[CLS] This is a extra_id_100")
+        ["[CLS] This is a extra_id_100"]
+
+        >>> trie.add("[CLS]")
+        >>> trie.add("extra_id_1")
+        >>> trie.add("extra_id_100")
+        >>> trie.split("[CLS] This is a extra_id_100")
+        ["[CLS]", " This is a ", "extra_id_100"]
+        ```
+        """
+        # indexes are counted left of the chars index.
+        # "hello", index 0, is left of h, index 1 is between h and e.
+        # index 5 is right of the "o".
+
+        # States are going to capture every possible start (indexes as above)
+        # as keys, and have as values, a pointer to the position in the trie
+        # where we're at. This is a partial match for now.
+        # This enables to keep track of multiple matches while we're iterating
+        # the string
+        # If the trie contains, "blowing", and "lower" and we encounter the
+        # string "blower", we need to split into ["b", "lower"].
+        # This is where we need to keep track of multiple possible starts.
+        states = OrderedDict()
+
+        # This will contain every indices where we need
+        # to cut.
+        # We force to cut at offset 0 and len(text) (added later)
+        offsets = [0]
+
+        # This is used by the lookahead which needs to skip over
+        # some text where the full match exceeded the place in the initial
+        # for loop
+        skip = 0
+        # Main loop, Giving this algorithm O(n) complexity
+        for current, current_char in enumerate(text):
+            if skip and current < skip:
+                # Prevents the lookahead for matching twice
+                # like extra_id_100 and id_100
+                continue
+
+            # This will track every state
+            # that stop matching, we need to stop tracking them.
+            # If we look at "lowball", we're going to match "l" (add it to states), "o", "w", then
+            # fail on "b", we need to remove 0 from the valid states.
+            to_remove = set()
+            # Whenever we found a match, we need to drop everything
+            # this is a greedy algorithm, it will match on the first found token
+            reset = False
+
+            # In this case, we already have partial matches (But unfinished)
+            for start, trie_pointer in states.items():
+                if "" in trie_pointer:
+                    # This is a final match, we need to reset and
+                    # store the results in `offsets`.
+
+                    # Lookahead to match longest first
+                    # Important in case of extra_id_1 vs extra_id_100
+                    # Here we are also actively looking for other earlier partial
+                    # matches
+                    # "[CLS]", "L", we need to match CLS even if L is special
+                    for lookstart, looktrie_pointer in states.items():
+                        if lookstart > start:
+                            # This partial match is later, we can stop looking
+                            break
+                        elif lookstart < start:
+                            # This partial match is earlier, the trie pointer
+                            # was already updated, so index is + 1
+                            lookahead_index = current + 1
+                            end = current + 1
+                        else:
+                            # Here lookstart == start and
+                            #      looktrie_pointer == trie_pointer
+                            # It wasn't updated yet so indices are current ones
+                            lookahead_index = current
+                            end = current
+                        next_char = text[lookahead_index] if lookahead_index < len(text) else None
+                        if "" in looktrie_pointer:
+                            start = lookstart
+                            end = lookahead_index
+                            skip = lookahead_index
+
+                        while next_char in looktrie_pointer:
+                            looktrie_pointer = looktrie_pointer[next_char]
+                            lookahead_index += 1
+                            if "" in looktrie_pointer:
+                                start = lookstart
+                                end = lookahead_index
+                                skip = lookahead_index
+
+                            if lookahead_index == len(text):
+                                # End of string
+                                break
+                            next_char = text[lookahead_index]
+                        # End lookahead
+
+                        # Storing and resetting
+                    offsets.append(start)
+                    offsets.append(end)
+                    reset = True
+                    break
+                elif current_char in trie_pointer:
+                    # The current character being looked at has a match within the trie
+                    # update the pointer (it will be stored back into states later).
+                    trie_pointer = trie_pointer[current_char]
+
+                    # Storing back the new pointer into the states.
+                    # Partial matches got longer by one.
+                    states[start] = trie_pointer
+                else:
+                    # The new character has not match in the trie, we need
+                    # to stop keeping track of this partial match.
+                    # We can't do it directly within the loop because of how
+                    # python iteration works
+                    to_remove.add(start)
+
+            # Either clearing the full start (we found a real match)
+            # Or clearing only the partial matches that didn't work.
+            if reset:
+                states = {}
+            else:
+                for start in to_remove:
+                    del states[start]
+
+            # If this character is a starting character within the trie
+            # start keeping track of this partial match.
+            if current >= skip and current_char in self.data:
+                states[current] = self.data[current_char]
+
+        # We have a cut at the end with states.
+        for start, trie_pointer in states.items():
+            if "" in trie_pointer:
+                # This is a final match, we need to reset and
+                # store the results in `offsets`.
+                end = len(text)
+                offsets.append(start)
+                offsets.append(end)
+                # Longest cut is always the one with lower start so the first
+                # item so we need to break.
+                break
+
+        return self.cut_text(text, offsets)
+
+    def cut_text(self, text, offsets):
+        # We have all the offsets now, we just need to do the actual splitting.
+        # We need to eventually add the first part of the string and the eventual
+        # last part.
+        offsets.append(len(text))
+        tokens = []
+        start = 0
+        for end in offsets:
+            if start > end:
+                logger.error(
+                    "There was a bug in Trie algorithm in tokenization. Attempting to recover. Please report it anyway."
+                )
+                continue
+            elif start == end:
+                # This might happen if there's a match at index 0
+                # we're also preventing zero-width cuts in case of two
+                # consecutive matches
+                continue
+            tokens.append(text[start:end])
+            start = end
+
+        return tokens
+
+
+def tokenize_chinese_chars(text):
+    """Adds whitespace around any CJK character."""
+    output = []
+    buff = ""
+    for char in text:
+        cp = ord(char)
+        if is_chinese_char(cp):
+            if buff != "":
+                output.append(buff)
+                buff = ""
+            output.append(char)
+        else:
+            buff += char
+
+    if buff != "":
+        output.append(buff)
+
+    return output
+
+
+@six.add_metaclass(InitTrackerMeta)
+class PretrainedTokenizer(PretrainedTokenizerBase):
+    """
+    Base class for all tokenizers.
+
+    Inherits from [`~tokenizer_utils_base.PretrainedTokenizerBase`].
+
+    Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading
+    pretrained tokenizers as well as adding tokens to the vocabulary.
+
+    This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the
+    specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...).
+
+    - **resource_files_names** (`Dict[str, str]`) -- A dictionary with, as keys, the `__init__` keyword name of each
+        vocabulary file required by the model, and as associated values, the filename for saving the associated file
+        (string).
+    - **pretrained_resource_files_map** (`Dict[str, Dict[str, str]]`) -- A dictionary of dictionaries, with the
+        high-level keys being the `__init__` keyword name of each vocabulary file required by the model, the
+        low-level being the `short-cut-names` of the pretrained models with, as associated values, the `url` to the
+        associated pretrained vocabulary file.
+    - **max_model_input_sizes** (`Dict[str, Optional[int]]`) -- A dictionary with, as keys, the `short-cut-names`
+        of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model,
+        or `None` if the model has no maximum input size.
+    - **pretrained_init_configuration** (`Dict[str, Dict[str, Any]]`) -- A dictionary with, as keys, the
+        `short-cut-names` of the pretrained models, and as associated values, a dictionary of specific arguments to
+        pass to the `__init__` method of the tokenizer class for this pretrained model when loading the tokenizer
+        with the [`~tokenizer_utils_base.PretrainedTokenizerBase.from_pretrained`] method.
+    - **model_input_names** (`List[str]`) -- A list of inputs expected in the forward pass of the model.
+    - **padding_side** (`str`) -- The default value for the side on which the model should have padding applied.
+        Should be `'right'` or `'left'`.
+    - **truncation_side** (`str`) -- The default value for the side on which the model should have truncation
+        applied. Should be `'right'` or `'left'`.
+
+    Moreover, methods common to tokenizers for tokenization, token/id conversion
+    and encoding as model inputs are also provided here.
+
+    Besides, metaclass `InitTrackerMeta` is used to create `PretrainedTokenizer`,
+    by which subclasses can track arguments for initialization automatically
+    and expose special tokens initialization used as attributes.
+    """
+
+    added_tokens_encoder: Dict[str, int] = {}
+    added_tokens_decoder: Dict[int, str] = {}
+    unique_no_split_tokens: List[str] = []
+    tokens_trie = Trie()
+
+    _decode_use_source_tokenizer = False
+
+    def _pre_init(self, original_init, *args, **kwargs):
+        """
+        It would be hooked before `__init__` to add specials tokens (arguments of
+        `__init__` whose name ends with `_token`) as attributes of the tokenizer
+        instance.
+        """
+        init_dict = fn_args_to_dict(original_init, *((self,) + args), **kwargs)
+        init_dict.pop("self", None)
+        super(PretrainedTokenizer, self).__init__(**init_dict)
+
+        self.added_tokens_encoder: Dict[str, int] = {}
+        self.added_tokens_decoder: Dict[int, str] = {}
+        self.unique_no_split_tokens: List[str] = []
+        self.tokens_trie = Trie()
+
+        self._decode_use_source_tokenizer = False
+
+    def _build_special_tokens_map_extended(self, **kwargs):
+        for key, value in kwargs.items():
+            if value is None:
+                continue
+            if key in self.SPECIAL_TOKENS_ATTRIBUTES:
+                if key == "additional_special_tokens":
+                    assert isinstance(value, (list, tuple)), f"Value {value} is not a list or tuple"
+                    assert all(
+                        isinstance(t, (str, AddedToken)) for t in value
+                    ), "One of the tokens is not a string or an AddedToken"
+                    setattr(self, key, value)
+                elif isinstance(value, (str, AddedToken)):
+                    setattr(self, key, value)
+                else:
+                    raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}")
+
+    @property
+    def vocab_size(self) -> int:
+        """
+        `int`: Size of the base vocabulary (without the added tokens).
+        """
+        raise NotImplementedError
+
+    @property
+    def is_fast(self) -> bool:
+        return False
+
+    def get_added_vocab(self) -> Dict[str, int]:
+        """
+        Returns the added tokens in the vocabulary as a dictionary of token to index.
+
+        Returns:
+            `Dict[str, int]`: The added tokens.
+        """
+        return self.added_tokens_encoder
+
+    def __len__(self):
+        """
+        Size of the full vocabulary with the added tokens.
+        """
+        return self.vocab_size + len(self.added_tokens_encoder)
+
+    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
+        """
+        Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
+        it with indices starting from length of the current vocabulary.
+
+        Args:
+            new_tokens (`List[str]`or `List[AddedToken]`):
+                Token(s) to add in vocabulary. A token is only added if it's not already in the vocabulary (tested by
+                checking if the tokenizer assign the index of the `unk_token` to them).
+            special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the tokens should be added as special tokens.
+
+        Returns:
+            `int`: The number of tokens actually added to the vocabulary.
+
+        Examples:
+
+        ```python
+        # Let's see how to increase the vocabulary of Bert model and tokenizer
+        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+        model = BertModel.from_pretrained("bert-base-uncased")
+
+        num_added_toks = tokenizer.add_tokens(["new_tok1", "my_new-tok2"])
+        print("We have added", num_added_toks, "tokens")
+        ```"""
+        new_tokens = [str(tok) for tok in new_tokens]
+
+        tokens_to_add = []
+        for token in new_tokens:
+            if not isinstance(token, str):
+                raise TypeError(f"Token {token} is not a string but a {type(token)}.")
+            if not special_tokens and hasattr(self, "do_lower_case") and self.do_lower_case:
+                token = token.lower()
+            if (
+                token != self.unk_token
+                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
+                and token not in tokens_to_add
+            ):
+                tokens_to_add.append(token)
+                if self.verbose:
+                    logger.info(f"Adding {token} to the vocabulary")
+
+        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))
+        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
+        self.added_tokens_encoder.update(added_tok_encoder)
+        self.added_tokens_decoder.update(added_tok_decoder)
+
+        # Make sure we don't split on any special tokens (even they were already in the vocab before e.g. for Albert)
+        if special_tokens:
+            if len(new_tokens) == 1:
+                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, new_tokens[0])
+            else:
+                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(new_tokens)))
+        else:
+            # Or on the newly added tokens
+            if len(tokens_to_add) == 1:
+                _insert_one_token_to_ordered_list(self.unique_no_split_tokens, tokens_to_add[0])
+            else:
+                self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))
+        self._create_trie(self.unique_no_split_tokens)
+
+        return len(tokens_to_add)
+
+    def _create_trie(self, unique_no_split_tokens):
+        trie = Trie()
+        for token in unique_no_split_tokens:
+            if hasattr(self, "do_lower_case") and self.do_lower_case and token not in self.all_special_tokens:
+                trie.add(token.lower())
+            else:
+                trie.add(token)
+        self.tokens_trie = trie
+
+    def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
+        """
+        Performs any necessary transformations before tokenization.
+
+        This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the
+        `kwargs` at the end of the encoding process to be sure all the arguments have been used.
+
+        Args:
+            text (`str`):
+                The text to prepare.
+            is_split_into_words (`bool`, *optional*, defaults to `False`):
+                Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the
+                tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
+                which it will tokenize. This is useful for NER or token classification.
+            kwargs:
+                Keyword arguments to use for the tokenization.
+
+        Returns:
+            `Tuple[str, Dict[str, Any]]`: The prepared text and the unused kwargs.
+        """
+
+        return (text, kwargs)
+
+    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
+        """
+        Converts a string in a sequence of tokens, using the tokenizer.
+
+        Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
+        (BPE/SentencePieces/WordPieces). Takes care of added tokens.
+
+        Args:
+            text (`str`):
+                The sequence to be encoded.
+            **kwargs (additional keyword arguments):
+                Passed along to the model-specific `prepare_for_tokenization` preprocessing method.
+
+        Returns:
+            `List[str]`: The list of tokens.
+        """
+        # Simple mapping string => AddedToken for special tokens with specific tokenization behaviors
+        all_special_tokens_extended = dict(
+            (str(t), t) for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
+        )
+
+        text, kwargs = self.prepare_for_tokenization(text, **kwargs)
+
+        # TODO: should this be in the base class?
+        if hasattr(self, "do_lower_case") and self.do_lower_case:
+            # convert non-special tokens to lowercase
+            escaped_special_toks = [
+                re.escape(s_tok) for s_tok in (self.unique_no_split_tokens + self.all_special_tokens)
+            ]
+            pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
+            text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
+
+        no_split_token = set(self.unique_no_split_tokens)
+        tokens = self.tokens_trie.split(text)
+
+        # ["This is something", "<special_token_1>", "  else"]
+        for i, token in enumerate(tokens):
+            if token in no_split_token:
+                tok_extended = all_special_tokens_extended.get(token, None)
+                left = tokens[i - 1] if i > 0 else None
+                right = tokens[i + 1] if i < len(tokens) - 1 else None
+                if isinstance(tok_extended, AddedToken):
+                    if tok_extended.rstrip and right:
+                        # A bit counter-intuitive but we strip the left of the string
+                        # since tok_extended.rstrip means the special token is eating all white spaces on its right
+                        tokens[i + 1] = right.lstrip()
+                    # Strip white spaces on the left
+                    if tok_extended.lstrip and left:
+                        tokens[i - 1] = left.rstrip()  # Opposite here
+                else:
+                    # We strip left and right by default
+                    if right:
+                        tokens[i + 1] = right.lstrip()
+                    if left:
+                        tokens[i - 1] = left.rstrip()
+        # ["This is something", "<special_token_1>", "else"]
+        tokenized_text = []
+        for token in tokens:
+            # Need to skip eventual empty (fully stripped) tokens
+            if not token:
+                continue
+            if token in no_split_token:
+                tokenized_text.append(token)
+            else:
+                tokenized_text.extend(self._tokenize(token))
+        # ["This", " is", " something", "<special_token_1>", "else"]
+        return tokenized_text
+
+    def _tokenize(self, text, **kwargs):
+        """
+        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
+        vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
+
+        Do NOT take care of added tokens.
+        """
+        raise NotImplementedError
+
+    def convert_tokens_to_ids(self, tokens):
+        if tokens is None:
+            return None
+
+        if isinstance(tokens, str):
+            return self._convert_token_to_id_with_added_voc(tokens)
+
+        ids = []
+        for token in tokens:
+            ids.append(self._convert_token_to_id_with_added_voc(token))
+
+        return ids
+
+    def _convert_token_to_id_with_added_voc(self, token):
+        if token is None:
+            return None
+
+        if token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+        return self._convert_token_to_id(token)
+
+    def _convert_token_to_id(self, token):
+
+        return self.vocab.to_indices(token)
+
+    def convert_tokens_to_string(self, tokens):
+        """
+        Converts a sequence of tokens (list of string) to a single string by
+        using ``' '.join(tokens)`` .
+
+        Args:
+            tokens (list[str]): A sequence of tokens.
+
+        Returns:
+            str: Converted string.
+        """
+        return " ".join(tokens)
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        if isinstance(ids, int):
+            if ids in self.added_tokens_decoder:
+                return self.added_tokens_decoder[ids]
+            else:
+                return self._convert_id_to_token(ids)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids:
+                continue
+            if index in self.added_tokens_decoder:
+                tokens.append(self.added_tokens_decoder[index])
+            else:
+                tokens.append(self._convert_id_to_token(index))
+        return tokens
+
+    def _convert_id_to_token(self, index):
+
+        return self.vocab.to_tokens(index)
+
+    @staticmethod
+    def load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs):
+        """
+        Instantiate an instance of `Vocab` from a file reserving all tokens
+        by using `Vocab.from_dict`. The file contains a token per line, and the
+        line number would be the index of corresponding token.
+
+        Args:
+            filepath (str): path of file to construct vocabulary.
+            unk_token (str): special token for unknown token. If no need, it also
+                could be `None`. Defaults to `None`.
+            pad_token (str): special token for padding token. If no need, it also
+                could be `None`. Defaults to `None`.
+            bos_token (str): special token for bos token. If no need, it also
+                could be `None`. Defaults to `None`.
+            eos_token (str): special token for eos token. If no need, it also
+                could be `None`. Defaults to `None`.
+            **kwargs (dict): keyword arguments for `Vocab.from_dict`.
+
+        Returns:
+            Vocab: An instance of `Vocab`.
+        """
+        token_to_idx = {}
+        with io.open(filepath, "r", encoding="utf-8") as f:
+            for index, line in enumerate(f):
+                token = line.rstrip("\n")
+                token_to_idx[token] = int(index)
+        vocab = Vocab.from_dict(
+            token_to_idx, unk_token=unk_token, pad_token=pad_token, bos_token=bos_token, eos_token=eos_token, **kwargs
+        )
+        return vocab
+
+    @staticmethod
+    def save_vocabulary(filepath, vocab):
+        """
+        Save all tokens to a vocabulary file. The file contains a token per line,
+        and the line number would be the index of corresponding token.
+
+        Args:
+            filepath (str): File path to be saved to.
+            vocab (Vocab|dict): The `Vocab` or `dict` instance to be saved.
+        """
+        if isinstance(vocab, Vocab):
+            tokens = vocab.idx_to_token
+        else:
+            tokens = sorted(vocab.keys(), key=lambda token: vocab[token])
+        with io.open(filepath, "w", encoding="utf-8") as f:
+            for token in tokens:
+                f.write(token + "\n")
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``encode`` methods.
+
+        Args:
+            token_ids_0 (List[int]): List of ids of the first sequence.
+            token_ids_1 (List[int], optional): List of ids of the second sequence.
+            already_has_special_tokens (bool, optional): Whether or not the token list is already
+                formatted with special tokens for the model. Defaults to None.
+
+        Returns:
+            results (List[int]): The list of integers in the range [0, 1]:
+                1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+        return [0] * ((len(token_ids_1) if token_ids_1 else 0) + len(token_ids_0))
+
+    def num_special_tokens_to_add(self, pair):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair (bool, optional):
+                Whether the number of added tokens should be computed in the case of a sequence pair or a single
+                sequence. Defaults to `False`.
+        Returns:
+            int: Number of special tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def _encode_plus(
+        self,
+        text: Union[TextInput, PreTokenizedInput, EncodedInput],
+        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_position_ids: Optional[bool] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self.tokenize(text, **kwargs)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                if is_split_into_words:
+                    tokens = list(
+                        itertools.chain(*(self.tokenize(t, is_split_into_words=True, **kwargs) for t in text))
+                    )
+                    return self.convert_tokens_to_ids(tokens)
+                else:
+                    return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                if is_split_into_words:
+                    raise ValueError(
+                        f"Input {text} is not valid. Should be a string or a list/tuple of strings when `is_split_into_words=True`."
+                    )
+                else:
+                    raise ValueError(
+                        f"Input {text} is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                    )
+
+        first_ids = get_input_ids(text)
+        second_ids = get_input_ids(text_pair) if text_pair is not None else None
+
+        if return_offsets_mapping:
+            kwargs["text"] = text
+            kwargs["text_pair"] = text_pair
+
+        return self.prepare_for_model(
+            first_ids,
+            pair_ids=second_ids,
+            add_special_tokens=add_special_tokens,
+            padding=padding_strategy.value,
+            truncation=truncation_strategy.value,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            prepend_batch_axis=True,
+            return_position_ids=return_position_ids,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def _batch_encode_plus(
+        self,
+        batch_text_or_text_pairs: Union[
+            List[TextInput],
+            List[TextInputPair],
+            List[PreTokenizedInput],
+            List[PreTokenizedInputPair],
+            List[EncodedInput],
+            List[EncodedInputPair],
+        ],
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        def get_input_ids(text):
+            if isinstance(text, str):
+                tokens = self.tokenize(text, **kwargs)
+                return self.convert_tokens_to_ids(tokens)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
+                if is_split_into_words:
+                    tokens = list(
+                        itertools.chain(*(self.tokenize(t, is_split_into_words=True, **kwargs) for t in text))
+                    )
+                    return self.convert_tokens_to_ids(tokens)
+                else:
+                    return self.convert_tokens_to_ids(text)
+            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], int):
+                return text
+            else:
+                raise ValueError(
+                    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
+                )
+
+        input_ids = []
+        for ids_or_pair_ids in batch_text_or_text_pairs:
+            if not isinstance(ids_or_pair_ids, (list, tuple)):
+                ids, pair_ids = ids_or_pair_ids, None
+            elif is_split_into_words and not isinstance(ids_or_pair_ids[0], (list, tuple)):
+                ids, pair_ids = ids_or_pair_ids, None
+            else:
+                ids, pair_ids = ids_or_pair_ids
+
+            first_ids = get_input_ids(ids)
+            second_ids = get_input_ids(pair_ids) if pair_ids is not None else None
+            input_ids.append((first_ids, second_ids))
+
+        if stride > 0 and second_ids is not None:
+            kwargs["batch_text_or_text_pairs"] = batch_text_or_text_pairs
+        else:
+            if return_offsets_mapping:
+                has_pair = False
+                if len(batch_text_or_text_pairs) > 0:
+                    if isinstance(batch_text_or_text_pairs[0], (list, tuple)):
+                        has_pair = True
+                kwargs["texts"] = None
+                kwargs["text_pairs"] = None
+                if has_pair:
+                    kwargs["texts"] = [text[0] for text in batch_text_or_text_pairs]
+                    kwargs["text_pairs"] = [text[1] for text in batch_text_or_text_pairs]
+                else:
+                    kwargs["texts"] = [text for text in batch_text_or_text_pairs]
+
+        batch_outputs = self._batch_prepare_for_model(
+            input_ids,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_position_ids=return_position_ids,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_dict=return_dict,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            return_tensors=return_tensors,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        return batch_outputs
+
+    def _batch_prepare_for_model(
+        self,
+        batch_ids_pairs: List[Union[PreTokenizedInputPair, Tuple[List[int], None]]],
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[str] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
+        adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
+        manages a moving window (with user defined stride) for overflowing tokens
+
+        Args:
+            batch_ids_pairs: list of tokenized input ids or input ids pairs
+        """
+        if return_token_type_ids and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+
+        batch_outputs = {}
+        batch_outputs_list = []
+        for example_id, (first_ids, second_ids) in enumerate(batch_ids_pairs):
+            if stride > 0 and second_ids is not None:
+                if return_token_type_ids is None:
+                    return_token_type_ids = "token_type_ids" in self.model_input_names
+                if return_attention_mask is None:
+                    return_attention_mask = "attention_mask" in self.model_input_names
+
+                max_len_for_pair = (
+                    max_length
+                    - len(first_ids)
+                    - (self.num_special_tokens_to_add(pair=True) if add_special_tokens else 0)
+                )
+
+                text, text_pair = kwargs["batch_text_or_text_pairs"][example_id]
+                token_offset_mapping = self.get_offset_mapping(text)
+                token_pair_offset_mapping = self.get_offset_mapping(text_pair)
+
+                offset = 0
+                while offset < len(second_ids):
+                    encoded_inputs = {}
+                    length = len(second_ids) - offset
+                    if length > max_len_for_pair:
+                        length = max_len_for_pair
+
+                    ids = first_ids
+                    pair_ids = second_ids[offset : offset + length]
+                    pair = bool(pair_ids is not None)
+                    mapping = token_offset_mapping
+                    pair_mapping = token_pair_offset_mapping[offset : offset + length]
+                    if add_special_tokens:
+                        offset_mapping = self.build_offset_mapping_with_special_tokens(mapping, pair_mapping)
+                        sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+                        token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+                    else:
+                        offset_mapping = mapping + pair_mapping
+                        sequence = ids + pair_ids if pair else ids
+                        token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
+                    encoded_inputs["offset_mapping"] = offset_mapping
+                    # Build output dictionnary
+                    encoded_inputs["input_ids"] = sequence
+                    if return_token_type_ids:
+                        encoded_inputs["token_type_ids"] = token_type_ids
+                    if return_special_tokens_mask:
+                        if add_special_tokens:
+                            encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+                        else:
+                            encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+
+                    # Check lengths
+                    self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
+                    if return_position_ids:
+                        encoded_inputs["position_ids"] = list(range(len(encoded_inputs["input_ids"])))
+
+                    if return_length:
+                        encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+                        encoded_inputs["seq_len"] = encoded_inputs["length"]
+
+                    encoded_inputs["overflow_to_sample"] = example_id
+
+                    for key, value in encoded_inputs.items():
+                        if key not in batch_outputs:
+                            batch_outputs[key] = []
+                        batch_outputs[key].append(value)
+
+                    if offset + length == len(second_ids):
+                        break
+                    offset += min(length, stride)
+            else:
+                if return_offsets_mapping:
+                    kwargs["text"] = kwargs["texts"][example_id]
+                    kwargs["text_pair"] = None
+                    if kwargs["text_pairs"] is not None:
+                        kwargs["text_pair"] = kwargs["text_pairs"][example_id]
+
+                encoded_inputs = self.prepare_for_model(
+                    first_ids,
+                    second_ids,
+                    add_special_tokens=add_special_tokens,
+                    padding=PaddingStrategy.DO_NOT_PAD.value,  # we pad in batch afterward
+                    truncation=truncation_strategy.value,
+                    max_length=max_length,
+                    stride=stride,
+                    pad_to_multiple_of=None,  # we pad in batch afterward
+                    return_position_ids=return_position_ids,  # we pad in batch afterward
+                    return_attention_mask=False,  # we pad in batch afterward
+                    return_token_type_ids=return_token_type_ids,
+                    return_overflowing_tokens=return_overflowing_tokens,
+                    return_special_tokens_mask=return_special_tokens_mask,
+                    return_offsets_mapping=return_offsets_mapping,
+                    return_length=return_length,
+                    return_tensors=None,  # We convert the whole batch to tensors at the end
+                    prepend_batch_axis=False,
+                    verbose=verbose,
+                    **kwargs,
+                )
+                for key, value in encoded_inputs.items():
+                    if key not in batch_outputs:
+                        batch_outputs[key] = []
+                    batch_outputs[key].append(value)
+
+        batch_outputs = self.pad(
+            batch_outputs,
+            padding=padding_strategy.value,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+        if return_dict:
+            batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
+            return batch_outputs
+        else:
+            for k, v in batch_outputs.items():
+                for i in range(len(v)):
+                    if i >= len(batch_outputs_list):
+                        batch_outputs_list.append({k: v[i]})
+                    else:
+                        batch_outputs_list[i][k] = v[i]
+            return batch_outputs_list
+
+    def _get_bert_like_offset_mapping(self, text: str):
+        """
+        Returns the map of tokens and the start and end index of their start and end character.
+        Modified from https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372
+        Args:
+            text (str):
+                Input text.
+        Returns:
+            list: The offset map of input text.
+
+        """
+        if text is None:
+            return None
+        split_tokens = self.tokenize(text)
+
+        normalized_text, char_mapping = "", []
+
+        for i, ch in enumerate(text):
+            if hasattr(self, "do_lower_case") and self.do_lower_case:
+                ch = ch.lower()
+                if self.basic_tokenizer.strip_accents is not False:
+                    ch = unicodedata.normalize("NFD", ch)
+                    ch = "".join([c for c in ch if unicodedata.category(c) != "Mn"])
+            elif self.basic_tokenizer.strip_accents:
+                ch = unicodedata.normalize("NFD", ch)
+                ch = "".join([c for c in ch if unicodedata.category(c) != "Mn"])
+
+            ch = "".join([c for c in ch if not (ord(c) == 0 or ord(c) == 0xFFFD or _is_control(c))])
+            normalized_text += ch
+
+            char_mapping.extend([i] * len(ch))
+        text, token_mapping, offset = normalized_text, [], 0
+
+        char_mapping_indexes = []
+        for index, token in enumerate(split_tokens):
+            if token[:2] == "##":
+                token = token[2:]
+            if token in self.all_special_tokens:
+                token = token.lower() if hasattr(self, "do_lower_case") and self.do_lower_case else token
+            # The greek letter "sigma" has 2 forms of lowercase, σ and ς respectively.
+            # When used as a final letter of a word, the final form (ς) is used. Otherwise, the form (σ) is used.
+            # https://latin.stackexchange.com/questions/6168/how-and-when-did-we-get-two-forms-of-sigma
+            if "σ" in token or "ς" in token:
+                start = text[offset:].replace("ς", "σ").index(token.replace("ς", "σ")) + offset
+            else:
+
+                # try to fix: https://github.com/PaddlePaddle/PaddleNLP/issues/3985
+                if token not in text[offset:]:
+                    # check whether there are consecutive UNK tokens, eg: ['好', '[UNK]', '[UNK]', 'good']
+                    if index < len(split_tokens) - 1 and split_tokens[index + 1] in self.all_special_tokens:
+                        start = offset
+                        token = " "  # only contains one char
+                    else:
+                        start = -1
+                else:
+                    start = text[offset:].index(token) + offset
+
+            end = start + len(token)
+            char_mapping_indexes.append([start, end])
+
+            if start != -1:
+                offset = end
+
+        token_mapping = []
+        for index, (start, end) in enumerate(char_mapping_indexes):
+            if start == -1:
+                # init start
+                if index == 0:
+                    start = 0
+                else:
+                    start = char_mapping_indexes[index - 1][1]
+
+                # init end
+                if index == len(char_mapping_indexes) - 1:
+                    end = len(char_mapping)
+                else:
+                    # next start
+                    end = char_mapping_indexes[index + 1][0]
+
+            token_mapping.append((char_mapping[start], char_mapping[end - 1] + 1))
+
+        return token_mapping
+
+    def get_offset_mapping(self, text: str, split_tokens: Optional[List[str]] = None):
+        """
+        Returns the map of tokens and the start and end index of their start and end character.
+        Modified from https://github.com/bojone/bert4keras/blob/master/bert4keras/tokenizers.py#L372
+        Args:
+            text (str):
+                Input text.
+            split_tokens (Optional[List[str]]):
+                the tokens which has been split which can accelerate the operation.
+
+        Returns:
+            list: The offset map of input text.
+
+        """
+        if text is None:
+            return None
+        split_tokens = self.tokenize(text)
+
+        # bert-like tokenizer use the old-school code block
+        if hasattr(self, "basic_tokenizer") or hasattr(self, "wordpiece_tokenizer"):
+            return self._get_bert_like_offset_mapping(text)
+
+        if not split_tokens:
+            split_tokens = self.tokenize(text)
+
+        normalized_text, char_mapping = "", []
+
+        for i, ch in enumerate(text):
+            normalized_text += normalize_chars(ch)
+            char_mapping.extend([i] * len(ch))
+
+        text, token_mapping, offset = normalized_text, [], 0
+        do_lower_case = getattr(self, "do_lower_case", False)
+
+        # lower the text if the token is lower-cased
+        # keep align with token
+        if do_lower_case:
+            text = text.lower()
+
+        char_mapping_indexes = []
+        for token in split_tokens:
+
+            # convert tokens into original string
+            token: str = self.convert_tokens_to_string(token).strip()
+
+            if token in self.all_special_tokens:
+                if do_lower_case:
+                    token = token.lower()
+
+            # The greek letter "sigma" has 2 forms of lowercase, σ and ς respectively.
+            # When used as a final letter of a word, the final form (ς) is used. Otherwise, the form (σ) is used.
+            # https://latin.stackexchange.com/questions/6168/how-and-when-did-we-get-two-forms-of-sigma
+            if "σ" in token or "ς" in token:
+                start = text[offset:].replace("ς", "σ").index(token.replace("ς", "σ")) + offset
+            else:
+
+                # try to fix: https://github.com/PaddlePaddle/PaddleNLP/issues/3985
+                if token not in text[offset:]:
+                    start = -1
+                else:
+                    start = text[offset:].index(token) + offset
+
+            end = start + len(token)
+            char_mapping_indexes.append([start, end])
+
+            if start != -1:
+                offset = end
+
+        token_mapping = []
+        for index, (start, end) in enumerate(char_mapping_indexes):
+            if start == -1:
+                # init start
+                if index == 0:
+                    start = 0
+                else:
+                    start = char_mapping_indexes[index - 1][1]
+
+                # init end
+                if index == len(char_mapping_indexes) - 1:
+                    end = len(char_mapping)
+                else:
+                    # next start
+                    end = char_mapping_indexes[index + 1][0]
+
+            token_mapping.append((char_mapping[start], char_mapping[end - 1] + 1))
+
+        return token_mapping
+
+    def _decode(
+        self,
+        token_ids: List[int],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        spaces_between_special_tokens: bool = True,
+        **kwargs
+    ) -> str:
+        self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
+        filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
+
+        # To avoid mixing byte-level and unicode for byte-level BPT
+        # we need to build string separately for added tokens and byte-level tokens
+        # cf. https://github.com/huggingface/transformers/issues/1133
+        sub_texts = []
+        current_sub_text = []
+        for token in filtered_tokens:
+            if skip_special_tokens and token in self.all_special_ids:
+                continue
+            if token in self.added_tokens_encoder:
+                if current_sub_text:
+                    sub_texts.append(self.convert_tokens_to_string(current_sub_text))
+                    current_sub_text = []
+                sub_texts.append(token)
+            else:
+                current_sub_text.append(token)
+        if current_sub_text:
+            sub_texts.append(self.convert_tokens_to_string(current_sub_text))
+
+        if spaces_between_special_tokens:
+            text = " ".join(sub_texts)
+        else:
+            text = "".join(sub_texts)
+
+        if clean_up_tokenization_spaces:
+            clean_text = self.clean_up_tokenization(text)
+            return clean_text
+        else:
+            return text
+
+
+class BPETokenizer(PretrainedTokenizer):
+    """
+    The base class for all bpe tokenizers. It mainly provides common tokenize
+    methods for bpe type tokenizer.
+
+    Args:
+        vocab_file (str):
+            file path of the vocabulary.
+        encoder_json_path (str, optional):
+            file path of the id to vocab.
+        vocab_bpe_path (str, optional):
+            file path of word merge text.
+        unk_token (str, optional):
+            The special token for unknown words.
+            Defaults to "[UNK]".
+        sep_token (str, optional):
+            The special token for separator token.
+            Defaults to "[SEP]".
+        pad_token (str, optional):
+            The special token for padding.
+            Defaults to "[PAD]".
+        cls_token (str, optional):
+            The special token for cls.
+            Defaults to "[CLS]".
+        mask_token (str, optional):
+            The special token for mask.
+            Defaults to "[MASK]".
+
+    """
+
+    class Encoder(object):
+        def __init__(self, encoder, bpe_merges, errors="replace", special_tokens=["[SEP]", "[p]", "[q]", "[/q]"]):
+            self.encoder = encoder
+            self.decoder = {v: k for k, v in self.encoder.items()}
+            self.errors = errors  # how to handle errors in decoding
+            self.byte_encoder = self._bytes_to_unicode()
+            self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
+            self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+            self.cache = {}
+            self.re = try_import("regex")
+            self.special_tokens = special_tokens
+
+            # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+            self.pat = self.re.compile(
+                r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
+            )
+
+        @lru_cache()
+        def _bytes_to_unicode(self):
+            """
+            Returns list of utf-8 byte and a corresponding list of unicode strings.
+            The reversible bpe codes work on unicode strings.
+            This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+            When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+            This is a signficant percentage of your normal, say, 32K bpe vocab.
+            To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+            And avoids mapping to whitespace/control characters the bpe code barfs on.
+            """
+
+            bs = (
+                list(range(ord("!"), ord("~") + 1))
+                + list(range(ord("¡"), ord("¬") + 1))
+                + list(range(ord("®"), ord("ÿ") + 1))
+            )
+            cs = bs[:]
+
+            n = 0
+            for b in range(2**8):
+                if b not in bs:
+                    bs.append(b)
+                    cs.append(2**8 + n)
+                    n += 1
+
+            cs = [chr(n) for n in cs]
+
+            return dict(zip(bs, cs))
+
+        def _get_pairs(self, word):
+            """Return set of symbol pairs in a word.
+            Word is represented as tuple of symbols (symbols being variable-length strings).
+            """
+            pairs = set()
+            prev_char = word[0]
+            for char in word[1:]:
+                pairs.add((prev_char, char))
+                prev_char = char
+            return pairs
+
+        def bpe(self, token):
+            if token in self.cache:
+                return self.cache[token]
+            word = tuple(token)
+            pairs = self._get_pairs(word)
+
+            if not pairs:
+                return token
+
+            while True:
+                bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+                if bigram not in self.bpe_ranks:
+                    break
+                first, second = bigram
+                new_word = []
+                i = 0
+                while i < len(word):
+                    try:
+                        j = word.index(first, i)
+                        new_word.extend(word[i:j])
+                        i = j
+                    except:  # noqa: E722
+                        new_word.extend(word[i:])
+                        break
+
+                    if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                        new_word.append(first + second)
+                        i += 2
+                    else:
+                        new_word.append(word[i])
+                        i += 1
+                new_word = tuple(new_word)
+                word = new_word
+                if len(word) == 1:
+                    break
+                else:
+                    pairs = self._get_pairs(word)
+            word = " ".join(word)
+            self.cache[token] = word
+
+            return word
+
+        def tokenize(self, text):
+            tokens = text.split(" ")
+            sub_tokens = []
+            for token_i, token in enumerate(tokens):
+                if self.is_special_token(token):
+                    if token_i == 0:
+                        sub_tokens.extend([token])
+                    else:
+                        sub_tokens.extend([" " + token])
+                else:
+                    if token_i == 0:
+                        sub_tokens.extend(self.re.findall(self.pat, token))
+                    else:
+                        sub_tokens.extend(self.re.findall(self.pat, " " + token))
+            return sub_tokens
+
+        def tokenize_old(self, text):
+            return self.re.findall(self.pat, text)
+
+        def is_special_token(self, tok):
+            if isinstance(tok, int):
+                return False
+            res = False
+            for t in self.special_tokens:
+                # if tok.find(t) != -1:
+                if tok.strip() == t:
+                    res = True
+                    break
+            return res
+
+        def tokenize_bpe(self, token):
+
+            if self.is_special_token(token):
+                return [token.strip()]  # remove space for convert_to_ids
+            else:
+
+                token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
+                return [self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")]
+
+        def encode(self, text):
+            bpe_tokens = []
+            for token in self.tokenize(text):
+                bpe_tokens.extend(self.tokenize_bpe(token))
+            return bpe_tokens
+
+        def decode(self, tokens):
+            pre_token_i = 0
+            texts = []
+            for token_i, token in enumerate(tokens):
+                if self.is_special_token(token):
+                    # proprecess tokens before token_i
+                    if token_i - pre_token_i > 0:
+                        text = "".join([self.decoder[int(tok)] for tok in tokens[pre_token_i:token_i]])
+                        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+                        texts.append(text)
+                    # texts.append(token)
+                    if token_i == 0:
+                        texts.append(token)  # in the beginning, there is no space before special tokens
+                    else:
+                        texts.extend([" ", token])  # in middle sentence, there must be a space before special tokens
+                    pre_token_i = token_i + 1
+
+            if pre_token_i < len(tokens):
+                text = "".join([self.decoder[int(tok)] for tok in tokens[pre_token_i:]])
+                text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
+                texts.append(text)
+
+            return "".join(texts)
+
+    def __init__(
+        self,
+        vocab_file,
+        encoder_json_path="./configs/encoder.json",
+        vocab_bpe_path="./configs/vocab.bpe",
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+    ):
+        self.vocab = self.load_vocabulary(
+            vocab_file, unk_token=unk_token, sep_token=sep_token, cls_token=cls_token, mask_token=mask_token
+        )
+        self.encoder_json_path = encoder_json_path
+        self.vocab_bpe_path = vocab_bpe_path
+        self.encoder = self._get_encoder(encoder_json_path, vocab_bpe_path)
+        self.nltk = try_import("nltk")
+
+    def _tokenize(self, text, is_sentencepiece=True):
+        text = convert_to_unicode(text)
+        text = " ".join(text.split())  # remove duplicate whitespace
+        if is_sentencepiece:
+            sents = self.nltk.tokenize.sent_tokenize(text)
+            bpe_ids = sum([self.encoder.encode(sent) for sent in sents], [])
+        else:
+            bpe_ids = self.encoder.encode(text)
+        tokens = [str(bpe_id) for bpe_id in bpe_ids]
+        return tokens
+
+    def _get_encoder(self, encoder_json_path, vocab_bpe_path):
+        with open(encoder_json_path, "r") as f:
+            encoder = json.load(f)
+        with open(vocab_bpe_path, "r", encoding="utf-8") as f:
+            bpe_data = f.read()
+        bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split("\n")[1:-1]]
+
+        return self.Encoder(
+            encoder=encoder,
+            bpe_merges=bpe_merges,
+        )
diff --git a/paddlenlp/transformers/tokenizer_utils_base.py b/paddlenlp/transformers/tokenizer_utils_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b4c2663b3ef53e3ca9d979983939a8a7a89d639
--- /dev/null
+++ b/paddlenlp/transformers/tokenizer_utils_base.py
@@ -0,0 +1,3291 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import io
+import json
+import os
+import shutil
+import tempfile
+import warnings
+from collections import OrderedDict, UserDict
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union
+
+import numpy as np
+import paddle
+from huggingface_hub import (
+    create_repo,
+    get_hf_file_metadata,
+    hf_hub_download,
+    hf_hub_url,
+    repo_type_and_id_from_hf_id,
+    upload_folder,
+)
+from huggingface_hub.utils import EntryNotFoundError
+from paddle import __version__
+
+from paddlenlp.utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    get_path_from_url_with_filelock,
+    url_file_exists,
+)
+from paddlenlp.utils.env import TOKENIZER_CONFIG_NAME
+from paddlenlp.utils.log import logger
+
+from .utils import resolve_cache_dir
+
+
+@dataclass(frozen=True, eq=True)
+class AddedToken:
+    """
+    AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
+    way it should behave.
+    """
+
+    content: str = field(default_factory=str)
+    single_word: bool = False
+    lstrip: bool = False
+    rstrip: bool = False
+    normalized: bool = True
+
+    def __getstate__(self):
+        return self.__dict__
+
+    def __str__(self):
+        return self.content
+
+
+@dataclass
+class FastEncoding:
+    """This is dummy class reserved for fast tokenizer"""
+
+    pass
+
+
+class ExplicitEnum(Enum):
+    """
+    Enum with more explicit error message for missing values.
+    """
+
+    @classmethod
+    def _missing_(cls, value):
+        raise ValueError(
+            f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
+        )
+
+
+class PaddingStrategy(ExplicitEnum):
+    """
+    Possible values for the `padding` argument in [`PretrainedTokenizerBase.__call__`]. Useful for tab-completion in an
+    IDE.
+    """
+
+    LONGEST = "longest"
+    MAX_LENGTH = "max_length"
+    DO_NOT_PAD = "do_not_pad"
+
+
+class TensorType(ExplicitEnum):
+    """
+    Possible values for the `return_tensors` argument in [`PretrainedTokenizerBase.__call__`]. Useful for
+    tab-completion in an IDE.
+    """
+
+    PADDLE = "pd"
+    NUMPY = "np"
+
+
+VERY_LARGE_INTEGER = int(1e30)  # This is used to set the max input length for a model with infinite size input
+LARGE_INTEGER = int(1e20)  # This is used when we need something big but slightly smaller than VERY_LARGE_INTEGER
+
+# Define type aliases and NamedTuples
+TextInput = str
+PreTokenizedInput = List[str]
+EncodedInput = List[int]
+TextInputPair = Tuple[str, str]
+PreTokenizedInputPair = Tuple[List[str], List[str]]
+EncodedInputPair = Tuple[List[int], List[int]]
+
+# Slow tokenizers used to be saved in three separated files
+SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
+ADDED_TOKENS_FILE = "added_tokens.json"
+TOKENIZER_CONFIG_FILE = "tokenizer_config.json"
+
+
+def to_py_obj(obj):
+    """
+    Convert a Paddle tensor, Numpy array or python list to a python list.
+    """
+    if isinstance(obj, (dict, UserDict)):
+        return {k: to_py_obj(v) for k, v in obj.items()}
+    elif isinstance(obj, (list, tuple)):
+        return [to_py_obj(o) for o in obj]
+    elif isinstance(obj, paddle.Tensor):
+        return obj.numpy().tolist()
+    elif isinstance(obj, (np.ndarray, np.number)):  # tolist also works on 0d np arrays
+        return obj.tolist()
+    else:
+        return obj
+
+
+def _is_numpy(x):
+    return isinstance(x, np.ndarray)
+
+
+class TruncationStrategy(ExplicitEnum):
+    """
+    Possible values for the `truncation` argument in [`PretrainedTokenizerBase.__call__`]. Useful for tab-completion in
+    an IDE.
+    """
+
+    ONLY_FIRST = "only_first"
+    ONLY_SECOND = "only_second"
+    LONGEST_FIRST = "longest_first"
+    DO_NOT_TRUNCATE = "do_not_truncate"
+
+
+class CharSpan(NamedTuple):
+    """
+    Character span in the original string.
+
+    Args:
+        start (`int`): Index of the first character in the original string.
+        end (`int`): Index of the character following the last character in the original string.
+    """
+
+    start: int
+    end: int
+
+
+class TokenSpan(NamedTuple):
+    """
+    Token span in an encoded string (list of tokens).
+
+    Args:
+        start (`int`): Index of the first token in the span.
+        end (`int`): Index of the token following the last token in the span.
+    """
+
+    start: int
+    end: int
+
+
+class BatchEncoding(UserDict):
+    """
+    Holds the output of the [`PretrainedTokenizerBase.__call__`],
+    [`PretrainedTokenizerBase.encode_plus`] and
+    [`PretrainedTokenizerBase.batch_encode_plus`] methods (tokens, attention_masks, etc).
+
+    This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes
+    utility methods to map from word/character space to token space.
+
+    Args:
+        data (`dict`):
+            Dictionary of lists/arrays/tensors returned by the `__call__`/`encode`/`batch_encode` methods
+            ('input_ids', 'attention_mask', etc.).
+        tensor_type (`Union[None, str, TensorType]`, *optional*):
+            You can give a tensor_type here to convert the lists of integers in Paddle/Numpy Tensors at
+            initialization.
+        prepend_batch_axis (`bool`, *optional*, defaults to `False`):
+            Whether or not to add a batch axis when converting to tensors (see `tensor_type` above).
+    """
+
+    def __init__(
+        self,
+        data: Optional[Dict[str, Any]] = None,
+        encoding: Optional[Union[FastEncoding, Sequence[FastEncoding]]] = None,
+        tensor_type: Union[None, str] = None,
+        prepend_batch_axis: bool = False,
+        n_sequences: Optional[int] = None,
+    ):
+        super().__init__(data)
+
+        if isinstance(encoding, FastEncoding):
+            encoding = [encoding]
+
+        self._encodings = encoding
+
+        if n_sequences is None and encoding is not None and len(encoding):
+            n_sequences = encoding[0].n_sequences
+
+        self._n_sequences = n_sequences
+
+        self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
+
+    @property
+    def n_sequences(self) -> Optional[int]:
+        """
+        `Optional[int]`: The number of sequences used to generate each sample from the batch encoded in this
+        [`BatchEncoding`]. Currently can be one of `None` (unknown), `1` (a single sentence) or `2` (a pair of
+        sentences)
+        """
+        return self._n_sequences
+
+    @property
+    def is_fast(self) -> bool:
+        """
+        `bool`: Indicate whether this [`BatchEncoding`] was generated from the result of a [`PretrainedFastTokenizer`]
+        or not.
+        """
+        return self._encodings is not None
+
+    def __getitem__(self, item: Union[int, str]) -> Union[Any, FastEncoding]:
+        """
+        If the key is a string, returns the value of the dict associated to `key` ('input_ids', 'attention_mask',
+        etc.).
+
+        If the key is an integer, get the `Encoding` for batch item with index `key`.
+        """
+        if isinstance(item, str):
+            return self.data[item]
+        elif self._encodings is not None:
+            return self._encodings[item]
+        else:
+            raise KeyError(
+                "Indexing with integers is not available when using tokenizer.__call__()"
+                " with return_dict=True. Please set return_dict to False to use integer indexing."
+            )
+
+    def __getattr__(self, item: str):
+        try:
+            return self.data[item]
+        except KeyError:
+            raise AttributeError
+
+    def __getstate__(self):
+        return {"data": self.data, "encodings": self._encodings}
+
+    def __setstate__(self, state):
+        if "data" in state:
+            self.data = state["data"]
+
+        if "encodings" in state:
+            self._encodings = state["encodings"]
+
+    def keys(self):
+        return self.data.keys()
+
+    def values(self):
+        return self.data.values()
+
+    def items(self):
+        return self.data.items()
+
+    # After this point:
+    # Extended properties and methods only available for fast tokenizers
+    # not yet supported
+
+    @property
+    def encodings(self) -> Optional[List[FastEncoding]]:
+        """
+        `Optional[List[FastEncoding]]`: The list all encodings from the tokenization process. Returns `None` if
+        the input was tokenized through Python (i.e., not a fast) tokenizer.
+        """
+        return self._encodings
+
+    def tokens(self, batch_index: int = 0) -> List[str]:
+        """
+        Return the list of tokens (sub-parts of the input strings after word/subword splitting and before conversion to
+        integer indices) at a given batch index (only works for the output of a fast tokenizer).
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[str]`: The list of tokens at that index.
+        """
+        if not self._encodings:
+            raise ValueError("tokens() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].tokens
+
+    def sequence_ids(self, batch_index: int = 0) -> List[Optional[int]]:
+        """
+        Return a list mapping the tokens to the id of their original sentences:
+
+            - `None` for special tokens added around or between sequences,
+            - `0` for tokens corresponding to words in the first sequence,
+            - `1` for tokens corresponding to words in the second sequence when a pair of sequences was jointly
+              encoded.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the sequence id corresponding to each token. Special tokens added
+            by the tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding
+            sequence.
+        """
+        if not self._encodings:
+            raise ValueError("sequence_ids() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].sequence_ids
+
+    def words(self, batch_index: int = 0) -> List[Optional[int]]:
+        """
+        Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by the
+            tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding word
+            (several tokens will be mapped to the same word index if they are parts of that word).
+        """
+        if not self._encodings:
+            raise ValueError("words() is not available when using Python-based tokenizers")
+        warnings.warn(
+            "`BatchEncoding.words()` property is deprecated and should be replaced with the identical, "
+            "but more self-explanatory `BatchEncoding.word_ids()` property.",
+            FutureWarning,
+        )
+        return self.word_ids(batch_index)
+
+    def word_ids(self, batch_index: int = 0) -> List[Optional[int]]:
+        """
+        Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
+
+        Args:
+            batch_index (`int`, *optional*, defaults to 0): The index to access in the batch.
+
+        Returns:
+            `List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by the
+            tokenizer are mapped to `None` and other tokens are mapped to the index of their corresponding word
+            (several tokens will be mapped to the same word index if they are parts of that word).
+        """
+        if not self._encodings:
+            raise ValueError("word_ids() is not available when using Python-based tokenizers")
+        return self._encodings[batch_index].word_ids
+
+    def token_to_sequence(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:
+        """
+        Get the index of the sequence represented by the given token. In the general use case, this method returns `0`
+        for a single sequence or the first sequence of a pair, and `1` for the second sequence of a pair
+
+        Can be called as:
+
+        - `self.token_to_sequence(token_index)` if batch size is 1
+        - `self.token_to_sequence(batch_index, token_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e.,
+        words are defined by the user). In this case it allows to easily associate encoded tokens with provided
+        tokenized words.
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprises one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token in the
+                sequence.
+
+        Returns:
+            `int`: Index of the word in the input sequence.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_sequence() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if token_index < 0:
+            token_index = self._seq_len + token_index
+        return self._encodings[batch_index].token_to_sequence(token_index)
+
+    def token_to_word(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:
+        """
+        Get the index of the word corresponding (i.e. comprising) to an encoded token in a sequence of the batch.
+
+        Can be called as:
+
+        - `self.token_to_word(token_index)` if batch size is 1
+        - `self.token_to_word(batch_index, token_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e.,
+        words are defined by the user). In this case it allows to easily associate encoded tokens with provided
+        tokenized words.
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token in the
+                sequence.
+
+        Returns:
+            `int`: Index of the word in the input sequence.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_word() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if token_index < 0:
+            token_index = self._seq_len + token_index
+        return self._encodings[batch_index].token_to_word(token_index)
+
+    def word_to_tokens(
+        self, batch_or_word_index: int, word_index: Optional[int] = None, sequence_index: int = 0
+    ) -> Optional[TokenSpan]:
+        """
+        Get the encoded token span corresponding to a word in a sequence of the batch.
+
+        Token spans are returned as a [`TokenSpan`] with:
+
+        - **start** -- Index of the first token.
+        - **end** -- Index of the token following the last token.
+
+        Can be called as:
+
+        - `self.word_to_tokens(word_index, sequence_index: int = 0)` if batch size is 1
+        - `self.word_to_tokens(batch_index, word_index, sequence_index: int = 0)` if batch size is greater or equal to
+          1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_word_index (`int`):
+                Index of the sequence in the batch. If the batch only comprises one sequence, this can be the index of
+                the word in the sequence.
+            word_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided word index belongs to.
+
+        Returns:
+            Optional [`TokenSpan`] Span of tokens in the encoded sequence. Returns `None` if
+            no tokens correspond to the word.
+        """
+
+        if not self._encodings:
+            raise ValueError("word_to_tokens() is not available when using Python based tokenizers")
+        if word_index is not None:
+            batch_index = batch_or_word_index
+        else:
+            batch_index = 0
+            word_index = batch_or_word_index
+        if batch_index < 0:
+            batch_index = self._batch_size + batch_index
+        if word_index < 0:
+            word_index = self._seq_len + word_index
+        span = self._encodings[batch_index].word_to_tokens(word_index, sequence_index)
+        return TokenSpan(*span) if span is not None else None
+
+    def token_to_chars(self, batch_or_token_index: int, token_index: Optional[int] = None) -> CharSpan:
+        """
+        Get the character span corresponding to an encoded token in a sequence of the batch.
+
+        Character spans are returned as a [`CharSpan`] with:
+
+        - **start** -- Index of the first character in the original string associated to the token.
+        - **end** -- Index of the character following the last character in the original string associated to the
+          token.
+
+        Can be called as:
+
+        - `self.token_to_chars(token_index)` if batch size is 1
+        - `self.token_to_chars(batch_index, token_index)` if batch size is greater or equal to 1
+
+        Args:
+            batch_or_token_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the token in the sequence.
+            token_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the token or tokens in
+                the sequence.
+
+        Returns:
+            [`CharSpan`]: Span of characters in the original string.
+        """
+
+        if not self._encodings:
+            raise ValueError("token_to_chars() is not available when using Python based tokenizers")
+        if token_index is not None:
+            batch_index = batch_or_token_index
+        else:
+            batch_index = 0
+            token_index = batch_or_token_index
+        return CharSpan(*(self._encodings[batch_index].token_to_chars(token_index)))
+
+    def char_to_token(
+        self, batch_or_char_index: int, char_index: Optional[int] = None, sequence_index: int = 0
+    ) -> int:
+        """
+        Get the index of the token in the encoded output comprising a character in the original string for a sequence
+        of the batch.
+
+        Can be called as:
+
+        - `self.char_to_token(char_index)` if batch size is 1
+        - `self.char_to_token(batch_index, char_index)` if batch size is greater or equal to 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_char_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the word in the sequence
+            char_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided character index belongs to.
+
+
+        Returns:
+            `int`: Index of the token.
+        """
+
+        if not self._encodings:
+            raise ValueError("char_to_token() is not available when using Python based tokenizers")
+        if char_index is not None:
+            batch_index = batch_or_char_index
+        else:
+            batch_index = 0
+            char_index = batch_or_char_index
+        return self._encodings[batch_index].char_to_token(char_index, sequence_index)
+
+    def word_to_chars(
+        self, batch_or_word_index: int, word_index: Optional[int] = None, sequence_index: int = 0
+    ) -> CharSpan:
+        """
+        Get the character span in the original string corresponding to given word in a sequence of the batch.
+
+        Character spans are returned as a CharSpan NamedTuple with:
+
+        - start: index of the first character in the original string
+        - end: index of the character following the last character in the original string
+
+        Can be called as:
+
+        - `self.word_to_chars(word_index)` if batch size is 1
+        - `self.word_to_chars(batch_index, word_index)` if batch size is greater or equal to 1
+
+        Args:
+            batch_or_word_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the word in the sequence
+            word_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the word in the
+                sequence.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided word index belongs to.
+
+        Returns:
+            `CharSpan` or `List[CharSpan]`: Span(s) of the associated character or characters in the string. CharSpan
+            are NamedTuple with:
+
+                - start: index of the first character associated to the token in the original string
+                - end: index of the character following the last character associated to the token in the original
+                  string
+        """
+
+        if not self._encodings:
+            raise ValueError("word_to_chars() is not available when using Python based tokenizers")
+        if word_index is not None:
+            batch_index = batch_or_word_index
+        else:
+            batch_index = 0
+            word_index = batch_or_word_index
+        return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index, sequence_index)))
+
+    def char_to_word(self, batch_or_char_index: int, char_index: Optional[int] = None, sequence_index: int = 0) -> int:
+        """
+        Get the word in the original string corresponding to a character in the original string of a sequence of the
+        batch.
+
+        Can be called as:
+
+        - `self.char_to_word(char_index)` if batch size is 1
+        - `self.char_to_word(batch_index, char_index)` if batch size is greater than 1
+
+        This method is particularly suited when the input sequences are provided as pre-tokenized sequences (i.e. words
+        are defined by the user). In this case it allows to easily associate encoded tokens with provided tokenized
+        words.
+
+        Args:
+            batch_or_char_index (`int`):
+                Index of the sequence in the batch. If the batch only comprise one sequence, this can be the index of
+                the character in the original string.
+            char_index (`int`, *optional*):
+                If a batch index is provided in *batch_or_token_index*, this can be the index of the character in the
+                original string.
+            sequence_index (`int`, *optional*, defaults to 0):
+                If pair of sequences are encoded in the batch this can be used to specify which sequence in the pair (0
+                or 1) the provided character index belongs to.
+
+
+        Returns:
+            `int` or `List[int]`: Index or indices of the associated encoded token(s).
+        """
+
+        if not self._encodings:
+            raise ValueError("char_to_word() is not available when using Python based tokenizers")
+        if char_index is not None:
+            batch_index = batch_or_char_index
+        else:
+            batch_index = 0
+            char_index = batch_or_char_index
+        return self._encodings[batch_index].char_to_word(char_index, sequence_index)
+
+    def convert_to_tensors(
+        self, tensor_type: Optional[Union[str, TensorType]] = None, prepend_batch_axis: bool = False
+    ):
+        """
+        Convert the inner content to tensors.
+
+        Args:
+            tensor_type (`str` or [`TensorType`], *optional*):
+                The type of tensors to use. If `str`, should be one of the values of the enum [`TensorType`]. If
+                `None`, no modification is done.
+            prepend_batch_axis (`int`, *optional*, defaults to `False`):
+                Whether or not to add the batch dimension during the conversion.
+        """
+        if tensor_type is None:
+            return self
+
+        # Convert to TensorType
+        if not isinstance(tensor_type, TensorType):
+            tensor_type = TensorType(tensor_type)
+        # Get a function reference for the correct framework
+        if tensor_type == TensorType.PADDLE:
+            as_tensor = paddle.to_tensor
+            is_tensor = paddle.is_tensor
+        else:
+            as_tensor = np.asarray
+            is_tensor = _is_numpy
+
+        # Do the tensor conversion in batch
+        for key, value in self.items():
+            try:
+                if prepend_batch_axis:
+                    value = [value]
+
+                if not is_tensor(value):
+                    tensor = as_tensor(value)
+
+                    self[key] = tensor
+            except:  # noqa E722
+                if key == "overflowing_tokens":
+                    raise ValueError(
+                        "Unable to create tensor returning overflowing tokens of different lengths. "
+                        "Please see if a fast version of this tokenizer is available to have this feature available."
+                    )
+                raise ValueError(
+                    "Unable to create tensor, you should probably activate truncation and/or padding "
+                    "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
+                )
+
+        return self
+
+
+class SpecialTokensMixin:
+    """
+    A mixin derived by [`PretrainedTokenizer`] to handle specific behaviors related to
+    special tokens. In particular, this class hold the attributes which can be used to directly access these special
+    tokens in a model-independent manner and allow to set and update the special tokens.
+
+    Args:
+        bos_token (`str` or `AddedToken`, *optional*):
+            A special token representing the beginning of a sentence.
+        eos_token (`str` or `AddedToken`, *optional*):
+            A special token representing the end of a sentence.
+        unk_token (`str` or `AddedToken`, *optional*):
+            A special token representing an out-of-vocabulary token.
+        sep_token (`str` or `AddedToken`, *optional*):
+            A special token separating two different sentences in the same input (used by BERT for instance).
+        pad_token (`str` or `AddedToken`, *optional*):
+            A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
+            attention mechanisms or loss computation.
+        cls_token (`str` or `AddedToken`, *optional*):
+            A special token representing the class of the input (used by BERT for instance).
+        mask_token (`str` or `AddedToken`, *optional*):
+            A special token representing a masked token (used by masked-language modeling pretraining objectives, like
+            BERT).
+        additional_special_tokens (tuple or list of `str` or `AddedToken`, *optional*):
+            A tuple or a list of additional special tokens.
+    """
+
+    SPECIAL_TOKENS_ATTRIBUTES = [
+        "bos_token",
+        "eos_token",
+        "unk_token",
+        "sep_token",
+        "pad_token",
+        "cls_token",
+        "mask_token",
+        "additional_special_tokens",
+    ]
+
+    def __init__(self, verbose=True, **kwargs):
+        # note(guosheng): Since `__init__` might be called multiple times which
+        # is hooked before `PretrainedTokenizer` init, we do not set to None as
+        # HF to avoid unintentional overriding.
+        self._bos_token = getattr(self, "_bos_token", None)
+        self._eos_token = getattr(self, "_eos_token", None)
+        self._unk_token = getattr(self, "_unk_token", None)
+        self._sep_token = getattr(self, "_sep_token", None)
+        self._pad_token = getattr(self, "_pad_token", None)
+        self._cls_token = getattr(self, "_cls_token", None)
+        self._mask_token = getattr(self, "_mask_token", None)
+        self._pad_token_type_id = getattr(self, "_pad_token_type_id", 0)
+        self._additional_special_tokens = getattr(self, "_additional_special_tokens", [])
+        self.verbose = verbose
+
+        # We directly set the hidden value to allow initialization with special tokens
+        # which are not yet in the vocabulary. Necessary for serialization/de-serialization
+        # TODO clean this up at some point (probably by switching to fast tokenizers)
+        for key, value in kwargs.items():
+            if value is None:
+                continue
+            if key in self.SPECIAL_TOKENS_ATTRIBUTES:
+                if key == "additional_special_tokens":
+                    assert isinstance(value, (list, tuple)), f"Value {value} is not a list or tuple"
+                    assert all(
+                        isinstance(t, (str, AddedToken)) for t in value
+                    ), "One of the tokens is not a string or an AddedToken"
+                    setattr(self, key, value)
+                elif isinstance(value, (str, AddedToken)):
+                    setattr(self, key, value)
+                else:
+                    raise TypeError(f"special token {key} has to be either str or AddedToken but got: {type(value)}")
+
+    def sanitize_special_tokens(self) -> int:
+        """
+        Make sure that all the special tokens attributes of the tokenizer (`tokenizer.mask_token`,
+        `tokenizer.cls_token`, etc.) are in the vocabulary.
+
+        Add the missing ones to the vocabulary if needed.
+
+        Return:
+            `int`: The number of tokens added in the vocabulary during the operation.
+        """
+        return self.add_tokens(self.all_special_tokens_extended, special_tokens=True)
+
+    def add_special_tokens(self, special_tokens_dict: Dict[str, Union[str, AddedToken]]) -> int:
+        """
+        Add a dictionary of special tokens (eos, pad, cls, etc.) to the encoder and link them to class attributes. If
+        special tokens are NOT in the vocabulary, they are added to it (indexed starting from the last index of the
+        current vocabulary).
+
+        Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
+        matrix of the model so that its embedding matrix matches the tokenizer.
+
+        In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
+
+        Using `add_special_tokens` will ensure your special tokens can be used in several ways:
+
+        - Special tokens are carefully handled by the tokenizer (they are never split).
+        - You can easily refer to special tokens using tokenizer class attributes like `tokenizer.cls_token`. This
+          makes it easy to develop model-agnostic training and fine-tuning scripts.
+
+        When possible, special tokens are already registered for provided pretrained models (for instance
+        [`BertTokenizer`] `cls_token` is already registered to be :obj*'[CLS]'* and XLM's one is also registered to be
+        `'</s>'`).
+
+        Args:
+            special_tokens_dict (dictionary *str* to *str* or `AddedToken`):
+                Keys should be in the list of predefined special attributes: [`bos_token`, `eos_token`, `unk_token`,
+                `sep_token`, `pad_token`, `cls_token`, `mask_token`, `additional_special_tokens`].
+
+                Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer
+                assign the index of the `unk_token` to them).
+
+        Returns:
+            `int`: Number of tokens added to the vocabulary.
+
+        Examples:
+
+        ```python
+        # Let's see how to add a new classification token to GPT-2
+        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+        model = GPT2Model.from_pretrained("gpt2")
+
+        special_tokens_dict = {"cls_token": "<CLS>"}
+
+        num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
+        print("We have added", num_added_toks, "tokens")
+        # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
+        model.resize_token_embeddings(len(tokenizer))
+
+        assert tokenizer.cls_token == "<CLS>"
+        ```"""
+        if not special_tokens_dict:
+            return 0
+
+        added_tokens = 0
+        for key, value in special_tokens_dict.items():
+            assert key in self.SPECIAL_TOKENS_ATTRIBUTES, f"Key {key} is not a special token"
+
+            if self.verbose:
+                logger.info(f"Assigning {value} to the {key} key of the tokenizer")
+            setattr(self, key, value)
+
+            if key == "additional_special_tokens":
+                assert isinstance(value, (list, tuple)) and all(
+                    isinstance(t, (str, AddedToken)) for t in value
+                ), f"Tokens {value} for key {key} should all be str or AddedToken instances"
+                added_tokens += self.add_tokens(value, special_tokens=True)
+            else:
+                assert isinstance(
+                    value, (str, AddedToken)
+                ), f"Token {value} for key {key} should be a str or an AddedToken instance"
+                added_tokens += self.add_tokens([value], special_tokens=True)
+
+        return added_tokens
+
+    def add_tokens(
+        self, new_tokens: Union[str, AddedToken, List[Union[str, AddedToken]]], special_tokens: bool = False
+    ) -> int:
+        """
+        Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
+        it with indices starting from length of the current vocabulary.
+
+        Note,None When adding new tokens to the vocabulary, you should make sure to also resize the token embedding
+        matrix of the model so that its embedding matrix matches the tokenizer.
+
+        In order to do that, please use the [`~PreTrainedModel.resize_token_embeddings`] method.
+
+        Args:
+            new_tokens (`str`, `AddedToken` or a list of *str* or `AddedToken`):
+                Tokens are only added if they are not already in the vocabulary. `AddedToken` wraps a string
+                token to let you personalize its behavior: whether this token should only match against a single word,
+                whether this token should strip all potential whitespaces on the left side, whether this token should
+                strip all potential whitespaces on the right side, etc.
+            special_tokens (`bool`, *optional*, defaults to `False`):
+                Can be used to specify if the token is a special token. This mostly change the normalization behavior
+                (special tokens like CLS or [MASK] are usually not lower-cased for instance).
+
+        Returns:
+            `int`: Number of tokens added to the vocabulary.
+
+        Examples:
+
+        ```python
+        # Let's see how to increase the vocabulary of Bert model and tokenizer
+        tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+        model = BertModel.from_pretrained("bert-base-uncased")
+
+        num_added_toks = tokenizer.add_tokens(["new_tok1", "my_new-tok2"])
+        print("We have added", num_added_toks, "tokens")
+        # Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
+        model.resize_token_embeddings(len(tokenizer))
+        ```"""
+        if not new_tokens:
+            return 0
+
+        if not isinstance(new_tokens, (list, tuple)):
+            new_tokens = [new_tokens]
+
+        return self._add_tokens(new_tokens, special_tokens=special_tokens)
+
+    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
+        raise NotImplementedError
+
+    @property
+    def bos_token(self) -> str:
+        """
+        `str`: Beginning of sentence token. Log an error if used while not having been set.
+        """
+        if self._bos_token is None and self.verbose:
+            logger.error("Using bos_token, but it is not set yet.")
+            return None
+        return str(self._bos_token)
+
+    @property
+    def eos_token(self) -> str:
+        """
+        `str`: End of sentence token. Log an error if used while not having been set.
+        """
+        if self._eos_token is None and self.verbose:
+            logger.error("Using eos_token, but it is not set yet.")
+            return None
+        return str(self._eos_token)
+
+    @property
+    def unk_token(self) -> str:
+        """
+        `str`: Unknown token. Log an error if used while not having been set.
+        """
+        if self._unk_token is None and self.verbose:
+            logger.error("Using unk_token, but it is not set yet.")
+            return None
+        return str(self._unk_token)
+
+    @property
+    def sep_token(self) -> str:
+        """
+        `str`: Separation token, to separate context and query in an input sequence. Log an error if used while not
+        having been set.
+        """
+        if self._sep_token is None and self.verbose:
+            logger.error("Using sep_token, but it is not set yet.")
+            return None
+        return str(self._sep_token)
+
+    @property
+    def pad_token(self) -> str:
+        """
+        `str`: Padding token. Log an error if used while not having been set.
+        """
+        if self._pad_token is None and self.verbose:
+            logger.error("Using pad_token, but it is not set yet.")
+            return None
+        return str(self._pad_token)
+
+    @property
+    def cls_token(self) -> str:
+        """
+        `str`: Classification token, to extract a summary of an input sequence leveraging self-attention along the full
+        depth of the model. Log an error if used while not having been set.
+        """
+        if self._cls_token is None and self.verbose:
+            logger.error("Using cls_token, but it is not set yet.")
+            return None
+        return str(self._cls_token)
+
+    @property
+    def mask_token(self) -> str:
+        """
+        `str`: Mask token, to use when training a model with masked-language modeling. Log an error if used while not
+        having been set.
+        """
+        if self._mask_token is None and self.verbose:
+            logger.error("Using mask_token, but it is not set yet.")
+            return None
+        return str(self._mask_token)
+
+    @property
+    def additional_special_tokens(self) -> List[str]:
+        """
+        `List[str]`: All the additional special tokens you may want to use. Log an error if used while not having been
+        set.
+        """
+        if self._additional_special_tokens is None and self.verbose:
+            logger.error("Using additional_special_tokens, but it is not set yet.")
+            return None
+        return [str(tok) for tok in self._additional_special_tokens]
+
+    @bos_token.setter
+    def bos_token(self, value):
+        self._bos_token = value
+
+    @eos_token.setter
+    def eos_token(self, value):
+        self._eos_token = value
+
+    @unk_token.setter
+    def unk_token(self, value):
+        self._unk_token = value
+
+    @sep_token.setter
+    def sep_token(self, value):
+        self._sep_token = value
+
+    @pad_token.setter
+    def pad_token(self, value):
+        self._pad_token = value
+
+    @cls_token.setter
+    def cls_token(self, value):
+        self._cls_token = value
+
+    @mask_token.setter
+    def mask_token(self, value):
+        self._mask_token = value
+
+    @additional_special_tokens.setter
+    def additional_special_tokens(self, value):
+        self._additional_special_tokens = value
+
+    @property
+    def bos_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the beginning of sentence token in the vocabulary. Returns `None` if the token has not
+        been set.
+        """
+        if self._bos_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.bos_token)
+
+    @property
+    def eos_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the end of sentence token in the vocabulary. Returns `None` if the token has not been
+        set.
+        """
+        if self._eos_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eos_token)
+
+    @property
+    def unk_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the unknown token in the vocabulary. Returns `None` if the token has not been set.
+        """
+        if self._unk_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.unk_token)
+
+    @property
+    def sep_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the separation token in the vocabulary, to separate context and query in an input
+        sequence. Returns `None` if the token has not been set.
+        """
+        if self._sep_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.sep_token)
+
+    @property
+    def pad_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the padding token in the vocabulary. Returns `None` if the token has not been set.
+        """
+        if self._pad_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.pad_token)
+
+    @property
+    def pad_token_type_id(self) -> int:
+        """
+        `int`: Id of the padding token type in the vocabulary.
+        """
+        return self._pad_token_type_id
+
+    @property
+    def cls_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the classification token in the vocabulary, to extract a summary of an input sequence
+        leveraging self-attention along the full depth of the model.
+
+        Returns `None` if the token has not been set.
+        """
+        if self._cls_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.cls_token)
+
+    @property
+    def mask_token_id(self) -> Optional[int]:
+        """
+        `Optional[int]`: Id of the mask token in the vocabulary, used when training a model with masked-language
+        modeling. Returns `None` if the token has not been set.
+        """
+        if self._mask_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.mask_token)
+
+    @property
+    def additional_special_tokens_ids(self) -> List[int]:
+        """
+        `List[int]`: Ids of all the additional special tokens in the vocabulary. Log an error if used while not having
+        been set.
+        """
+        return self.convert_tokens_to_ids(self.additional_special_tokens)
+
+    @bos_token_id.setter
+    def bos_token_id(self, value):
+        self._bos_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @eos_token_id.setter
+    def eos_token_id(self, value):
+        self._eos_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @unk_token_id.setter
+    def unk_token_id(self, value):
+        self._unk_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @sep_token_id.setter
+    def sep_token_id(self, value):
+        self._sep_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @pad_token_id.setter
+    def pad_token_id(self, value):
+        self._pad_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @cls_token_id.setter
+    def cls_token_id(self, value):
+        self._cls_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @mask_token_id.setter
+    def mask_token_id(self, value):
+        self._mask_token = self.convert_ids_to_tokens(value) if value is not None else None
+
+    @additional_special_tokens_ids.setter
+    def additional_special_tokens_ids(self, values):
+        self._additional_special_tokens = [self.convert_ids_to_tokens(value) for value in values]
+
+    @property
+    def special_tokens_map(self) -> Dict[str, Union[str, List[str]]]:
+        """
+        `Dict[str, Union[str, List[str]]]`: A dictionary mapping special token class attributes (`cls_token`,
+        `unk_token`, etc.) to their values (`'<unk>'`, `'<cls>'`, etc.).
+
+        Convert potential tokens of `AddedToken` type to string.
+        """
+        set_attr = {}
+        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
+            attr_value = getattr(self, "_" + attr)
+            if attr_value:
+                set_attr[attr] = (
+                    type(attr_value)(str(attr_value_sub) for attr_value_sub in attr_value)
+                    if isinstance(attr_value, (list, tuple))
+                    else str(attr_value)
+                )
+        return set_attr
+
+    @property
+    def special_tokens_map_extended(self) -> Dict[str, Union[str, AddedToken, List[Union[str, AddedToken]]]]:
+        """
+        `Dict[str, Union[str, AddedToken, List[Union[str, AddedToken]]]]`: A dictionary mapping
+        special token class attributes (`cls_token`, `unk_token`, etc.) to their values (`'<unk>'`, `'<cls>'`, etc.).
+
+        Don't convert tokens of `AddedToken` type to string so they can be used to control more finely how
+        special tokens are tokenized.
+        """
+        set_attr = {}
+        for attr in self.SPECIAL_TOKENS_ATTRIBUTES:
+            attr_value = getattr(self, "_" + attr)
+            if attr_value:
+                set_attr[attr] = attr_value
+        return set_attr
+
+    @property
+    def all_special_tokens(self) -> List[str]:
+        """
+        `List[str]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
+
+        Convert tokens of `AddedToken` type to string.
+        """
+        all_toks = [str(s) for s in self.all_special_tokens_extended]
+        return all_toks
+
+    @property
+    def all_special_tokens_extended(self) -> List[Union[str, AddedToken]]:
+        """
+        `List[Union[str, AddedToken]]`: All the special tokens (`'<unk>'`, `'<cls>'`, etc.) mapped to class
+        attributes.
+
+        Don't convert tokens of `AddedToken` type to string so they can be used to control more finely how
+        special tokens are tokenized.
+        """
+        all_toks = []
+        set_attr = self.special_tokens_map_extended
+        for attr_value in set_attr.values():
+            all_toks = all_toks + (list(attr_value) if isinstance(attr_value, (list, tuple)) else [attr_value])
+        all_toks = list(OrderedDict.fromkeys(all_toks))
+        return all_toks
+
+    @property
+    def all_special_ids(self) -> List[int]:
+        """
+        `List[int]`: List the ids of the special tokens(`'<unk>'`, `'<cls>'`, etc.) mapped to class attributes.
+        """
+        all_toks = self.all_special_tokens
+        all_ids = self.convert_tokens_to_ids(all_toks)
+        return all_ids
+
+
+class PretrainedTokenizerBase(SpecialTokensMixin):
+    """
+    Base class for [`PretrainedTokenizer`].
+
+    Class attributes (overridden by derived classes)
+
+         - **resource_files_names** (`Dict[str, str]`) -- A dictionary with, as keys, the `__init__` keyword name of each
+            vocabulary file required by the model, and as associated values, the filename for saving the associated file
+            (string).
+        - **pretrained_resource_files_map** (`Dict[str, Dict[str, str]]`) -- A dictionary of dictionaries, with the
+            high-level keys being the `__init__` keyword name of each vocabulary file required by the model, the
+            low-level being the `short-cut-names` of the pretrained models with, as associated values, the `url` to the
+            associated pretrained vocabulary file.
+        - **max_model_input_sizes** (`Dict[str, Optional[int]]`) -- A dictionary with, as keys, the `short-cut-names`
+            of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model,
+            or `None` if the model has no maximum input size.
+        - **pretrained_init_configuration** (`Dict[str, Dict[str, Any]]`) -- A dictionary with, as keys, the
+            `short-cut-names` of the pretrained models, and as associated values, a dictionary of specific arguments to
+            pass to the `__init__` method of the tokenizer class for this pretrained model when loading the tokenizer
+            with the [`~tokenizer_utils_base.PretrainedTokenizerBase.from_pretrained`] method.
+        - **model_input_names** (`List[str]`) -- A list of inputs expected in the forward pass of the model.
+        - **padding_side** (`str`) -- The default value for the side on which the model should have padding applied.
+            Should be `'right'` or `'left'`.
+        - **truncation_side** (`str`) -- The default value for the side on which the model should have truncation
+            applied. Should be `'right'` or `'left'`.
+
+    Args:
+        model_max_length (`int`, *optional*):
+            The maximum length (in number of tokens) for the inputs to the transformer model. When the tokenizer is
+            loaded with [`~tokenizer_utils_base.PretrainedTokenizerBase.from_pretrained`], this will be set to the
+            value stored for the associated model in `max_model_input_sizes` (see above). If no value is provided, will
+            default to VERY_LARGE_INTEGER (`int(1e30)`).
+        padding_side (`str`, *optional*):
+            The side on which the model should have padding applied. Should be selected between ['right', 'left'].
+            Default value is picked from the class attribute of the same name.
+        truncation_side (`str`, *optional*):
+            The side on which the model should have truncation applied. Should be selected between ['right', 'left'].
+            Default value is picked from the class attribute of the same name.
+        model_input_names (`List[string]`, *optional*):
+            The list of inputs accepted by the forward pass of the model (like `"token_type_ids"` or
+            `"attention_mask"`). Default value is picked from the class attribute of the same name.
+        bos_token (`str` or `AddedToken`, *optional*):
+            A special token representing the beginning of a sentence. Will be associated to `self.bos_token` and
+            `self.bos_token_id`.
+        eos_token (`str` or `AddedToken`, *optional*):
+            A special token representing the end of a sentence. Will be associated to `self.eos_token` and
+            `self.eos_token_id`.
+        unk_token (`str` or `AddedToken`, *optional*):
+            A special token representing an out-of-vocabulary token. Will be associated to `self.unk_token` and
+            `self.unk_token_id`.
+        sep_token (`str` or `AddedToken`, *optional*):
+            A special token separating two different sentences in the same input (used by BERT for instance). Will be
+            associated to `self.sep_token` and `self.sep_token_id`.
+        pad_token (`str` or `AddedToken`, *optional*):
+            A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
+            attention mechanisms or loss computation. Will be associated to `self.pad_token` and `self.pad_token_id`.
+        cls_token (`str` or `AddedToken`, *optional*):
+            A special token representing the class of the input (used by BERT for instance). Will be associated to
+            `self.cls_token` and `self.cls_token_id`.
+        mask_token (`str` or `AddedToken`, *optional*):
+            A special token representing a masked token (used by masked-language modeling pretraining objectives, like
+            BERT). Will be associated to `self.mask_token` and `self.mask_token_id`.
+        additional_special_tokens (tuple or list of `str` or `AddedToken`, *optional*):
+            A tuple or a list of additional special tokens. Add them here to ensure they won't be split by the
+            tokenization process. Will be associated to `self.additional_special_tokens` and
+            `self.additional_special_tokens_ids`.
+    """
+
+    resource_files_names: Dict[str, str] = {}
+    pretrained_resource_files_map: Dict[str, Dict[str, str]] = {}
+    pretrained_init_configuration: Dict[str, Dict[str, Any]] = {}
+    max_model_input_sizes: Dict[str, Optional[int]] = {}
+    _auto_class: Optional[str] = None
+    tokenizer_config_file = TOKENIZER_CONFIG_NAME
+
+    # first name has to correspond to main model input name
+    # to make sure `tokenizer.pad(...)` works correctly
+    model_input_names: List[str] = ["input_ids", "token_type_ids"]
+    padding_side: str = "right"
+    truncation_side: str = "right"
+    slow_tokenizer_class = None
+
+    def __init__(self, **kwargs):
+        # inputs and kwargs for saving and re-loading (see ``from_pretrained`` and ``save_pretrained``)
+        self.init_inputs = ()
+
+        self.init_kwargs = getattr(self, "init_kwargs", None) or copy.deepcopy(kwargs)
+        self.name_or_path = kwargs.pop("name_or_path", "")
+        self._processor_class = kwargs.pop("processor_class", None)
+
+        # For backward compatibility we fallback to set model_max_length from max_len if provided
+        model_max_length = kwargs.pop("model_max_length", kwargs.pop("max_len", None))
+        self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER
+
+        # Padding and truncation side are right by default and overridden in subclasses. If specified in the kwargs, it
+        # is changed.
+        self.padding_side = kwargs.pop("padding_side", self.padding_side)
+        if self.padding_side not in ["right", "left"]:
+            raise ValueError(
+                f"Padding side should be selected between 'right' and 'left', current value: {self.padding_side}"
+            )
+
+        self.truncation_side = kwargs.pop("truncation_side", self.truncation_side)
+        if self.truncation_side not in ["right", "left"]:
+            raise ValueError(
+                f"Padding side should be selected between 'right' and 'left', current value: {self.truncation_side}"
+            )
+
+        self.model_input_names = kwargs.pop("model_input_names", self.model_input_names)
+
+        self.deprecation_warnings = (
+            {}
+        )  # Use to store when we have already noticed a deprecation warning (avoid overlogging).
+
+        super().__init__(**kwargs)
+
+    @property
+    def max_len_single_sentence(self) -> int:
+        """
+        `int`: The maximum length of a sentence that can be fed to the model.
+        """
+        return self.model_max_length - self.num_special_tokens_to_add(pair=False)
+
+    @property
+    def max_len_sentences_pair(self) -> int:
+        """
+        `int`: The maximum combined length of a pair of sentences that can be fed to the model.
+        """
+        return self.model_max_length - self.num_special_tokens_to_add(pair=True)
+
+    @max_len_single_sentence.setter
+    def max_len_single_sentence(self, value) -> int:
+        # For backward compatibility, allow to try to setup 'max_len_single_sentence'.
+        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False) and self.verbose:
+            if not self.deprecation_warnings.get("max_len_single_sentence", False):
+                warnings.warn(
+                    "Setting 'max_len_single_sentence' is now deprecated. " "This value is automatically set up."
+                )
+            self.deprecation_warnings["max_len_single_sentence"] = True
+        else:
+            raise ValueError(
+                "Setting 'max_len_single_sentence' is now deprecated. " "This value is automatically set up."
+            )
+
+    def _switch_to_input_mode(self):
+        """
+        Private method to put the tokenizer in input mode (when it has different modes for input/outputs)
+        """
+        pass
+
+    @max_len_sentences_pair.setter
+    def max_len_sentences_pair(self, value) -> int:
+        # For backward compatibility, allow to try to setup 'max_len_sentences_pair'.
+        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True) and self.verbose:
+            if not self.deprecation_warnings.get("max_len_sentences_pair", False):
+                warnings.warn(
+                    "Setting 'max_len_sentences_pair' is now deprecated. " "This value is automatically set up."
+                )
+            self.deprecation_warnings["max_len_sentences_pair"] = True
+        else:
+            raise ValueError(
+                "Setting 'max_len_sentences_pair' is now deprecated. " "This value is automatically set up."
+            )
+
+    def _set_processor_class(self, processor_class: str):
+        """Sets processor class as an attribute."""
+        self._processor_class = processor_class
+
+    def __repr__(self) -> str:
+        return (
+            f"{'PretrainedTokenizer'}(name_or_path='{self.name_or_path}', "
+            f"vocab_size={self.vocab_size}, model_max_len={self.model_max_length}, "
+            f"padding_side='{self.padding_side}', truncation_side='{self.truncation_side}', special_tokens={self.special_tokens_map_extended})"
+        )
+
+    def get_vocab(self) -> Dict[str, int]:
+        """
+        Returns the vocabulary as a dictionary of token to index.
+
+        `tokenizer.get_vocab()[token]` is equivalent to `tokenizer.convert_tokens_to_ids(token)` when `token` is in the
+        vocab.
+
+        Returns:
+            `Dict[str, int]`: The vocabulary.
+        """
+        raise NotImplementedError()
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, from_hf_hub=False, subfolder=None, **kwargs):
+        """
+        Creates an instance of `PretrainedTokenizer`. Related resources are loaded
+        by specifying name of a built-in pretrained model, or a community-contributed
+        pretrained model, or a local file directory path.
+
+        Args:
+            pretrained_model_name_or_path (str): Name of pretrained model or dir path
+                to load from. The string can be:
+
+                - Name of built-in pretrained model
+                - Name of a community-contributed pretrained model.
+                - Local directory path which contains tokenizer related resources
+                  and tokenizer config file ("tokenizer_config.json").
+            from_hf_hub (bool, optional): whether to load from Huggingface Hub
+            subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+                Only works when loading from Huggingface Hub.
+            *args (tuple): position arguments for model `__init__`. If provided,
+                use these as position argument values for tokenizer initialization.
+            **kwargs (dict): keyword arguments for model `__init__`. If provided,
+                use these to update pre-defined keyword argument values for tokenizer
+                initialization.
+
+        Returns:
+            PretrainedTokenizer: An instance of `PretrainedTokenizer`.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertTokenizer
+
+                # Name of built-in pretrained model
+                tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+                # Name of community-contributed pretrained model
+                tokenizer = BertTokenizer.from_pretrained('yingyibiao/bert-base-uncased-sst-2-finetuned')
+
+                # Load from local directory path
+                tokenizer = BertTokenizer.from_pretrained('./my_bert/')
+        """
+
+        pretrained_model_name_or_path = str(pretrained_model_name_or_path)
+        cache_dir = kwargs.pop("cache_dir", None)
+        cache_dir = resolve_cache_dir(pretrained_model_name_or_path, from_hf_hub, cache_dir)
+        vocab_files = {}
+        init_configuration = {}
+
+        additional_files_names = {
+            "added_tokens_file": ADDED_TOKENS_FILE,
+            "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
+            "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
+        }
+
+        vocab_files_target = {**cls.resource_files_names, **additional_files_names}
+
+        # From HF Hub
+        if from_hf_hub:
+            # Only include the necessary resource files specified by the tokenizer cls
+            # Deep copy to avoid modifiying the class attributes
+            vocab_files = copy.deepcopy(cls.resource_files_names)
+            vocab_files["tokenizer_config_file"] = cls.tokenizer_config_file
+        # From built-in pretrained models
+        elif pretrained_model_name_or_path in cls.pretrained_init_configuration:
+            for file_id, map_list in cls.pretrained_resource_files_map.items():
+                vocab_files[file_id] = map_list[pretrained_model_name_or_path]
+            init_configuration = copy.deepcopy(cls.pretrained_init_configuration[pretrained_model_name_or_path])
+        # From local dir path
+        elif os.path.isdir(pretrained_model_name_or_path):
+            vocab_files_target["tokenizer_config_file"] = cls.tokenizer_config_file
+            for file_id, file_name in vocab_files_target.items():
+                full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
+                if os.path.isfile(full_file_name):
+                    vocab_files[file_id] = full_file_name
+        else:
+            # Assuming from community-contributed pretrained models
+            for file_id, file_name in vocab_files_target.items():
+                full_file_name = "/".join([COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, file_name])
+                vocab_files[file_id] = full_file_name
+            vocab_files["tokenizer_config_file"] = "/".join(
+                [COMMUNITY_MODEL_PREFIX, pretrained_model_name_or_path, cls.tokenizer_config_file]
+            )
+
+        resolved_vocab_files = {}
+        for file_id, file_path in vocab_files.items():
+            if file_path is None or os.path.isfile(file_path):
+                resolved_vocab_files[file_id] = file_path
+                continue
+            if from_hf_hub:
+                resolved_vocab_files[file_id] = hf_hub_download(
+                    repo_id=pretrained_model_name_or_path,
+                    filename=file_path,
+                    subfolder=subfolder,
+                    cache_dir=cache_dir,
+                    library_name="PaddleNLP",
+                    library_version=__version__,
+                )
+            else:
+                path = os.path.join(cache_dir, file_path.split("/")[-1])
+                if os.path.exists(path):
+                    logger.info("Already cached %s" % path)
+                    resolved_vocab_files[file_id] = path
+
+                else:
+                    logger.info("Downloading %s and saved to %s" % (file_path, cache_dir))
+                    try:
+                        if not url_file_exists(file_path):
+                            logger.warning(f"file<{file_path}> not exist")
+                            resolved_vocab_files[file_id] = None
+                            continue
+                        resolved_vocab_files[file_id] = get_path_from_url_with_filelock(file_path, cache_dir)
+                    except RuntimeError as err:
+                        if file_id not in cls.resource_files_names:
+                            resolved_vocab_files[file_id] = None
+                        else:
+                            logger.error(err)
+                            raise RuntimeError(
+                                f"Can't load tokenizer for '{pretrained_model_name_or_path}'.\n"
+                                f"Please make sure that '{pretrained_model_name_or_path}' is:\n"
+                                "- a correct model-identifier of built-in pretrained models,\n"
+                                "- or a correct model-identifier of community-contributed pretrained models,\n"
+                                "- or the correct path to a directory containing relevant tokenizer files.\n"
+                            )
+
+        # Prepare tokenizer initialization kwargs
+        # Did we saved some inputs and kwargs to reload ?
+        has_tokenizer_file = resolved_vocab_files.get("tokenizer_file", None) is not None
+        tokenizer_config_file = resolved_vocab_files.pop("tokenizer_config_file", None)
+        if tokenizer_config_file is not None:
+            with io.open(tokenizer_config_file, encoding="utf-8") as f:
+                init_kwargs = json.load(f)
+        else:
+            init_kwargs = init_configuration
+
+        # position args are stored in kwargs, maybe better not include
+        init_args = init_kwargs.pop("init_args", ())
+        init_kwargs.pop("init_class", None)
+
+        # Update with newly provided args and kwargs
+        init_args = init_args if not args else args
+        init_kwargs.update(kwargs)
+
+        def convert_added_tokens(obj):
+            if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
+                obj.pop("__type")
+                return AddedToken(**obj)
+            elif isinstance(obj, (list, tuple)):
+                return list(convert_added_tokens(o) for o in obj)
+            elif isinstance(obj, dict):
+                return {k: convert_added_tokens(v) for k, v in obj.items()}
+            return obj
+
+        init_kwargs = convert_added_tokens(init_kwargs)
+        # Set max length if needed
+        if pretrained_model_name_or_path in cls.max_model_input_sizes:
+            # if we're using a pretrained model, ensure the tokenizer
+            # wont index sequences longer than the number of positional embeddings
+            model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]
+            if model_max_length is not None and isinstance(model_max_length, (int, float)):
+                init_kwargs["model_max_length"] = min(init_kwargs.get("model_max_length", int(1e30)), model_max_length)
+
+        added_tokens_file = resolved_vocab_files.pop("added_tokens_file", None)
+        # Merge resolved_vocab_files arguments in init_kwargs if not including.
+        # Maybe need more ways to load resources.
+        for args_name, file_path in resolved_vocab_files.items():
+            # when `pretrained_model_name_or_path` is a pretrained model name,
+            # use pretrained_init_configuration as `init_kwargs` to init which
+            # does not include the vocab file in it, thus add vocab file into
+            # args.
+            if args_name not in init_kwargs:
+                init_kwargs[args_name] = file_path
+            # when `pretrained_model_name_or_path` is a pretrained model dir,
+            # use tokenizer_config_file.json as `init_kwargs` to init which
+            # does include a vocab file path in it. However, if the vocab file
+            # path included in json does not exist, such as was deleted, to make
+            # it still work, use the vocab file under this dir.
+            elif not os.path.isfile(init_kwargs[args_name] or "") and os.path.isfile(file_path):
+                init_kwargs[args_name] = file_path
+
+        # TODO(zhoushunjie): It's not supportted to load tokenizer.json of hf so far.
+        if from_hf_hub and "tokenizer_file" in init_kwargs:
+            init_kwargs.pop("tokenizer_file")
+
+        # TODO(guosheng): avoid reduplication of position args and key word args
+        tokenizer = cls(*init_args, **init_kwargs)
+        special_tokens_map_file = resolved_vocab_files.pop("special_tokens_map_file", None)
+        if special_tokens_map_file is not None:
+            with open(special_tokens_map_file, encoding="utf-8") as special_tokens_map_handle:
+                special_tokens_map = json.load(special_tokens_map_handle)
+            for key, value in special_tokens_map.items():
+                if key in kwargs and kwargs[key]:
+                    # This value has already been redefined by the kwargs
+                    # We keep this new value and ignore the one stored in the special_tokens_map_file
+
+                    continue
+
+                if isinstance(value, dict):
+                    value = AddedToken(**value)
+                elif isinstance(value, list):
+                    value = [AddedToken(**token) if isinstance(token, dict) else token for token in value]
+                setattr(tokenizer, key, value)
+        # Add supplementary tokens.
+        special_tokens = tokenizer.all_special_tokens
+        if added_tokens_file is not None:
+            with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
+                added_tok_encoder = json.load(added_tokens_handle)
+
+            # Sort added tokens by index
+            added_tok_encoder_sorted = list(sorted(added_tok_encoder.items(), key=lambda x: x[1]))
+            for token, index in added_tok_encoder_sorted:
+                if has_tokenizer_file and index != len(tokenizer) and tokenizer.convert_tokens_to_ids(token) != index:
+                    # index is the current length of the tokenizer (not in vocabulary)
+                    raise ValueError(
+                        f"Wrong index found for {token}: should be {tokenizer.convert_tokens_to_ids(token)} but found "
+                        f"{index}."
+                    )
+                elif not has_tokenizer_file and index != len(tokenizer):
+                    # Tokenizer slow: added token cannot already be in the vocabulary so its index needs to be the
+                    # current length of the tokenizer.
+                    raise ValueError(
+                        f"Non-consecutive added token '{token}' found. "
+                        f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
+                    )
+
+                tokenizer.add_tokens(token, special_tokens=bool(token in special_tokens))
+        # Check all our special tokens are registered as "no split" token (we don't cut them) and are in the vocab
+        added_tokens = tokenizer.sanitize_special_tokens()
+        if added_tokens:
+            logger.info(
+                "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained."
+            )
+        # save all of related things into default root dir
+        if pretrained_model_name_or_path in cls.pretrained_init_configuration:
+            tokenizer.save_pretrained(cache_dir)
+
+        return tokenizer
+
+    def save_pretrained(self, save_directory, filename_prefix: Optional[str] = None, **kwargs):
+        """
+        Save tokenizer configuration and related resources to files under
+        `save_directory`. The tokenizer configuration would be saved into
+        `tokenizer_config_file` indicating file (thus `tokenizer_config.json`),
+        and resources would be saved into `resource_files_names` indicating files
+        by using `self.save_resources(save_directory)`.
+
+        The `save_directory` can be used in `from_pretrained` as argument value
+        of `pretrained_model_name_or_path` to re-load the tokenizer.
+
+        Args:
+            save_directory (str): Directory to save files into.
+            filename_prefix: (str, optional):
+                A prefix to add to the names of the files saved by the tokenizer.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import BertTokenizer
+
+                tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+                tokenizer.save_pretrained('trained_model')
+                # reload from save_directory
+                tokenizer = BertTokenizer.from_pretrained('trained_model')
+        """
+        assert not os.path.isfile(save_directory), "Saving directory ({}) should be a directory, not a file".format(
+            save_directory
+        )
+        os.makedirs(save_directory, exist_ok=True)
+
+        special_tokens_map_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + SPECIAL_TOKENS_MAP_FILE
+        )
+        tokenizer_config_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + self.tokenizer_config_file
+        )
+
+        tokenizer_config = copy.deepcopy(self.init_kwargs)
+        if len(self.init_inputs) > 0:
+            tokenizer_config["init_inputs"] = copy.deepcopy(self.init_inputs)
+        for file_id in self.resource_files_names.keys():
+            tokenizer_config.pop(file_id, None)
+
+        # Sanitize AddedTokens
+        def convert_added_tokens(obj: Union[AddedToken, Any], add_type_field=True):
+            if isinstance(obj, AddedToken):
+                out = obj.__getstate__()
+                if add_type_field:
+                    out["__type"] = "AddedToken"
+                return out
+            elif isinstance(obj, (list, tuple)):
+                return list(convert_added_tokens(o, add_type_field=add_type_field) for o in obj)
+            elif isinstance(obj, dict):
+                return {k: convert_added_tokens(v, add_type_field=add_type_field) for k, v in obj.items()}
+            return obj
+
+        # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
+        tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
+
+        # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
+        tokenizer_class = self.__class__.__name__
+        tokenizer_config["tokenizer_class"] = tokenizer_class
+
+        with io.open(tokenizer_config_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(tokenizer_config, ensure_ascii=False))
+        logger.info(f"tokenizer config file saved in {tokenizer_config_file}")
+
+        # Sanitize AddedTokens in special_tokens_map
+        write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
+        with open(special_tokens_map_file, "w", encoding="utf-8") as f:
+            f.write(json.dumps(write_dict, ensure_ascii=False))
+        logger.info(f"Special tokens file saved in {special_tokens_map_file}")
+
+        file_names = (tokenizer_config_file, special_tokens_map_file)
+
+        save_files = self._save_pretrained(
+            save_directory=save_directory,
+            file_names=file_names,
+            filename_prefix=filename_prefix,
+        )
+
+        return save_files
+
+    def _save_pretrained(
+        self, save_directory: Union[str, os.PathLike], file_names: Tuple[str], filename_prefix: Optional[str] = None
+    ) -> Tuple[str]:
+        """
+        Save a tokenizer using the tokenizer format: vocabulary + added tokens.
+
+        """
+        save_directory = str(save_directory)
+
+        added_tokens_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + ADDED_TOKENS_FILE
+        )
+        added_vocab = self.get_added_vocab()
+        if added_vocab:
+            with open(added_tokens_file, "w", encoding="utf-8") as f:
+                out_str = json.dumps(added_vocab, ensure_ascii=False)
+                f.write(out_str)
+                logger.info(f"added tokens file saved in {added_tokens_file}")
+
+        self.save_resources(save_directory)
+
+        return file_names + (added_tokens_file,)
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to `resource_files_names` indicating
+        files under `save_directory` by copying directly. Override it if necessary.
+
+        Args:
+            save_directory (str): Directory to save files into.
+        """
+        for name, file_name in self.resource_files_names.items():
+            src_path = self.init_kwargs[name]
+            dst_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(src_path) != os.path.abspath(dst_path):
+                shutil.copyfile(src_path, dst_path)
+
+    def save_to_hf_hub(
+        self,
+        repo_id: str,
+        private: Optional[bool] = None,
+        subfolder: Optional[str] = None,
+        commit_message: Optional[str] = None,
+        revision: Optional[str] = None,
+        create_pr: bool = False,
+    ):
+        """
+        Uploads all elements of this tokenizer to a new HuggingFace Hub repository.
+        Args:
+            repo_id (str): Repository name for your model/tokenizer in the Hub.
+            private (bool, optional): Whether the model/tokenizer is set to private
+            subfolder (str, optional): Push to a subfolder of the repo instead of the root
+            commit_message (str, optional) — The summary / title / first line of the generated commit. Defaults to: f"Upload {path_in_repo} with huggingface_hub"
+            revision (str, optional) — The git revision to commit from. Defaults to the head of the "main" branch.
+            create_pr (boolean, optional) — Whether or not to create a Pull Request with that commit. Defaults to False.
+                If revision is not set, PR is opened against the "main" branch. If revision is set and is a branch, PR is opened against this branch.
+                If revision is set and is not a branch name (example: a commit oid), an RevisionNotFoundError is returned by the server.
+
+        Returns: The url of the commit of your model in the given repository.
+        """
+        repo_url = create_repo(repo_id, private=private, exist_ok=True)
+
+        # Infer complete repo_id from repo_url
+        # Can be different from the input `repo_id` if repo_owner was implicit
+        _, repo_owner, repo_name = repo_type_and_id_from_hf_id(repo_url)
+        repo_id = f"{repo_owner}/{repo_name}"
+
+        # Check if README file already exist in repo
+        try:
+            get_hf_file_metadata(hf_hub_url(repo_id=repo_id, filename="README.md", revision=revision))
+            has_readme = True
+        except EntryNotFoundError:
+            has_readme = False
+
+        with tempfile.TemporaryDirectory() as root_dir:
+            if subfolder is not None:
+                save_dir = os.path.join(root_dir, subfolder)
+            else:
+                save_dir = root_dir
+            # save model
+            self.save_pretrained(save_dir)
+            # Add readme if does not exist
+            logger.info("README.md not found, adding the default README.md")
+            if not has_readme:
+                with open(os.path.join(root_dir, "README.md"), "w") as f:
+                    f.write(f"---\nlibrary_name: paddlenlp\n---\n# {repo_id}")
+            # Upload model and return
+            logger.info(f"Pushing to the {repo_id}. This might take a while")
+            return upload_folder(
+                repo_id=repo_id,
+                repo_type="model",
+                folder_path=root_dir,
+                commit_message=commit_message,
+                revision=revision,
+                create_pr=create_pr,
+            )
+
+    def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
+        """
+        Converts a string in a sequence of tokens, replacing unknown tokens with the `unk_token`.
+
+        Args:
+            text (`str`):
+                The sequence to be encoded.
+            pair (`str`, *optional*):
+                A second sequence to be encoded with the first.
+            add_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to add the special tokens associated with the corresponding model.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific encode method. See details in
+                [`~PretrainedTokenizerBase.__call__`]
+
+        Returns:
+            `List[str]`: The list of tokens.
+        """
+        raise NotImplementedError
+
+    def num_special_tokens_to_add(self, pair: bool = False) -> int:
+        raise NotImplementedError
+
+    def _get_padding_truncation_strategies(
+        self, padding=False, truncation=False, max_length=None, pad_to_multiple_of=None, verbose=True, **kwargs
+    ):
+        """
+        Find the correct padding/truncation strategy with backward compatibility for old arguments (truncation_strategy
+        and pad_to_max_length) and behaviors.
+        """
+        old_truncation_strategy = kwargs.pop("truncation_strategy", "do_not_truncate")
+        old_pad_to_max_length = kwargs.pop("pad_to_max_seq_len", False)
+
+        # Backward compatibility for previous behavior, maybe we should deprecate it:
+        # If you only set max_length, it activates truncation for max_length
+        if max_length is not None and padding is False and truncation is False:
+            if verbose:
+                if not self.deprecation_warnings.get("Truncation-not-explicitly-activated", False):
+                    warnings.warn(
+                        "Truncation was not explicitly activated but `max_length` is provided a specific value, "
+                        "please use `truncation=True` to explicitly truncate examples to max length. "
+                        "Defaulting to 'longest_first' truncation strategy. "
+                        "If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
+                        "more precisely by providing a specific strategy to `truncation`."
+                    )
+                self.deprecation_warnings["Truncation-not-explicitly-activated"] = True
+            truncation = "longest_first"
+
+        # Get padding strategy
+        if padding is False and old_pad_to_max_length:
+            if verbose:
+                warnings.warn(
+                    "The `pad_to_max_length` argument is deprecated and will be removed in a future version, "
+                    "use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or "
+                    "use `padding='max_length'` to pad to a max length. In this case, you can give a specific "
+                    "length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the "
+                    "maximal input size of the model (e.g. 512 for Bert).",
+                    FutureWarning,
+                )
+            if max_length is None:
+                padding_strategy = PaddingStrategy.LONGEST
+            else:
+                padding_strategy = PaddingStrategy.MAX_LENGTH
+        elif padding is not False:
+            if padding is True:
+                if verbose:
+                    if max_length is not None and (truncation is False or truncation == "do_not_truncate"):
+                        warnings.warn(
+                            "`max_length` is ignored when `padding`=`True` and there is no truncation strategy. "
+                            "To pad to max length, use `padding='max_length'`."
+                        )
+                    if old_pad_to_max_length is not False:
+                        warnings.warn("Though `pad_to_max_length` = `True`, it is ignored because `padding`=`True`.")
+                # Default to pad to the longest sequence in the batch
+                padding_strategy = PaddingStrategy.LONGEST
+            elif not isinstance(padding, PaddingStrategy):
+                padding_strategy = PaddingStrategy(padding)
+            elif isinstance(padding, PaddingStrategy):
+                padding_strategy = padding
+        else:
+            padding_strategy = PaddingStrategy.DO_NOT_PAD
+
+        # Get truncation strategy
+        if truncation is False and old_truncation_strategy != "do_not_truncate":
+            if verbose:
+                warnings.warn(
+                    "The `truncation_strategy` argument is deprecated and will be removed in a future version, "
+                    "use `truncation=True` to truncate examples to a max length. You can give a specific "
+                    "length with `max_length` (e.g. `max_length=45`) or leave max_length to None to truncate to the "
+                    "maximal input size of the model (e.g. 512 for Bert). "
+                    " If you have pairs of inputs, you can give a specific truncation strategy selected among "
+                    "`truncation='only_first'` (will only truncate the first sentence in the pairs) "
+                    "`truncation='only_second'` (will only truncate the second sentence in the pairs) "
+                    "or `truncation='longest_first'` (will iteratively remove tokens from the longest sentence in the pairs).",
+                    FutureWarning,
+                )
+            truncation_strategy = TruncationStrategy(old_truncation_strategy)
+        elif truncation is not False and truncation is not None:
+            if truncation is True:
+                truncation_strategy = (
+                    TruncationStrategy.LONGEST_FIRST
+                )  # Default to truncate the longest sequences in pairs of inputs
+            elif not isinstance(truncation, TruncationStrategy):
+                truncation_strategy = TruncationStrategy(truncation)
+            elif isinstance(truncation, TruncationStrategy):
+                truncation_strategy = truncation
+        else:
+            truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
+
+        # Set max length if needed
+        if max_length is None:
+            if padding_strategy == PaddingStrategy.MAX_LENGTH:
+                if self.model_max_length > LARGE_INTEGER:
+                    if verbose:
+                        if not self.deprecation_warnings.get("Asking-to-pad-to-max_length", False):
+                            warnings.warn(
+                                "Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. "
+                                "Default to no padding."
+                            )
+                        self.deprecation_warnings["Asking-to-pad-to-max_length"] = True
+                    padding_strategy = PaddingStrategy.DO_NOT_PAD
+                else:
+                    max_length = self.model_max_length
+
+            if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:
+                if self.model_max_length > LARGE_INTEGER:
+                    if verbose:
+                        if not self.deprecation_warnings.get("Asking-to-truncate-to-max_length", False):
+                            warnings.warn(
+                                "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. "
+                                "Default to no truncation."
+                            )
+                        self.deprecation_warnings["Asking-to-truncate-to-max_length"] = True
+                    truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
+                else:
+                    max_length = self.model_max_length
+
+        # Test if we have a padding token
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
+            raise ValueError(
+                "Asking to pad but the tokenizer does not have a padding token. "
+                "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
+                "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
+            )
+
+        # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
+        if (
+            truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
+            and padding_strategy != PaddingStrategy.DO_NOT_PAD
+            and pad_to_multiple_of is not None
+            and max_length is not None
+            and (max_length % pad_to_multiple_of != 0)
+        ):
+            raise ValueError(
+                f"Truncation and padding are both activated but "
+                f"truncation length ({max_length}) is not a multiple of pad_to_multiple_of ({pad_to_multiple_of})."
+            )
+
+        return padding_strategy, truncation_strategy, max_length, kwargs
+
+    def __call__(
+        self,
+        text: Union[str, List[str], List[List[str]]],
+        text_pair: Optional[Union[str, List[str], List[List[str]]]] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: Union[bool, str] = False,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        return_position_ids: bool = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_length: bool = False,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        add_special_tokens: bool = True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        verbose: bool = True,
+        **kwargs
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports sequence or sequence pair as input, and batch input
+        is allowed. `self.encode()` or `self.batch_encode()` would be called
+        separately for single or batch input depending on input format and
+        `is_split_into_words` argument.
+
+        Args:
+            text (str, List[str] or List[List[str]]):
+                The sequence or batch of sequences to be processed. One sequence
+                is a string or a list of strings depending on whether it has been
+                pretokenized. If each sequence is provided as a list of strings
+                (pretokenized), you must set `is_split_into_words` as `True` to
+                disambiguate with a batch of sequences.
+            text_pair (str, List[str] or List[List[str]], optional):
+                Same as `text` argument, while it represents for the latter
+                sequence of the sequence pair.
+            max_length (int, optional):
+                If set to a number, will limit the total sequence returned so
+                that it has a maximum length. If there are overflowing tokens,
+                those overflowing tokens will be added to the returned dictionary
+                when `return_overflowing_tokens` is `True`. Defaults to `None`.
+            stride (int, optional):
+                Only available for batch input of sequence pair and mainly for
+                question answering usage. When for QA, `text` represents questions
+                and `text_pair` represents contexts. If `stride` is set to a
+                positive number, the context will be split into multiple spans
+                where `stride` defines the number of (tokenized) tokens to skip
+                from the start of one span to get the next span, thus will produce
+                a bigger batch than inputs to include all spans. Moreover, 'overflow_to_sample'
+                and 'offset_mapping' preserving the original example and position
+                information will be added to the returned dictionary. Defaults to 0.
+            is_split_into_words (Union[bool, str], optional):
+                when the text is words or tokens, `is_split_into_words` should be True or `token`.
+                `True`: means that the text should be words which should be tokenized.
+                `token`: means that the text should be tokens which already be tokenized, so it should not be tokenized again.
+            padding (bool, str or [PaddingStrategy], optional):
+                Activates and controls padding. Accepts the following values:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+                Defaults to `False`.
+            truncation (bool, str or [TruncationStrategy], optional):
+                Activates and controls truncation. Accepts the following values:
+
+                - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
+                  to the maximum acceptable input length for the model if that argument is not provided. This will
+                  truncate token by token, removing a token from the longest sequence in the pair if a pair of
+                  sequences (or a batch of pairs) is provided.
+                - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided. This will only
+                  truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+                - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided. This will only
+                  truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+                - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
+                  greater than the model maximum admissible input size).
+                Defaults to `False`.
+            return_position_ids (bool, optional):
+                Whether to include tokens position ids in the returned dictionary.
+                Defaults to `False`.
+            return_token_type_ids (bool, optional):
+                Whether to include token type ids in the returned dictionary.
+                Defaults to `True`.
+            return_attention_mask (bool, optional):
+                Whether to include the attention mask in the returned dictionary.
+                Defaults to `False`.
+            return_length (bool, optional):
+                Whether to include the length of each encoded inputs in the
+                returned dictionary. Defaults to `False`.
+            return_overflowing_tokens (bool, optional):
+                Whether to include overflowing token information in the returned
+                dictionary. Defaults to `False`.
+            return_special_tokens_mask (bool, optional):
+                Whether to include special tokens mask information in the returned
+                dictionary. Defaults to `False`.
+            return_dict (bool, optional):
+                Decide the format for returned encoded batch inputs. Only works when
+                input is a batch of data.
+                ::
+                    - If True, encoded inputs would be a dictionary like:
+                        {'input_ids': [[1, 4444, 4385, 1545, 6712],[1, 4444, 4385]],
+                        'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0]]}
+                    - If False, encoded inputs would be a list like:
+                        [{'input_ids': [1, 4444, 4385, 1545, 6712],
+                          'token_type_ids': [0, 0, 0, 0, 0]},
+                         {'input_ids': [1, 4444, 4385], 'token_type_ids': [0, 0, 0]}]
+
+                Defaults to `True`.
+            return_offsets_mapping (bool, optional):
+                Whether to include the list of pair preserving the index of start
+                and end char in original input for each token in the returned
+                dictionary. Would be automatically set to `True` when `stride` > 0.
+                Defaults to `False`.
+            add_special_tokens (bool, optional):
+                Whether to add the special tokens associated with the corresponding model
+                to the encoded inputs. Defaults to `True`
+            pad_to_multiple_of (int, optional):
+                If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
+                the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
+                Defaults to `None`.
+            return_tensors (str or [TensorType], optional):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+                Defaults to `None`.
+            verbose (bool, optional):
+                Whether or not to print more information and warnings. Defaults to True.
+
+        Returns:
+            dict or list[dict] (for batch input):
+                The dict has the following optional items:
+
+                - **input_ids** (list[int] or list[list[int]]): List of token ids to be fed to a model.
+                - **position_ids** (list[int] or list[list[int]], optional): List of token position ids to be
+                  fed to a model. Included when `return_position_ids` is `True`
+                - **token_type_ids** (list[int] or list[list[int]], optional): List of token type ids to be
+                  fed to a model. Included when `return_token_type_ids` is `True`.
+                - **attention_mask** (list[int] or list[list[int]], optional): List of integers valued 0 or 1,
+                  where 0 specifies paddings and should not be attended to by the
+                  model. Included when `return_attention_mask` is `True`.
+                - **seq_len** (int or list[int], optional): The input_ids length. Included when `return_length`
+                  is `True`.
+                - **overflowing_tokens** (list[int] or list[list[int]], optional): List of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **num_truncated_tokens** (int or list[int], optional): The number of overflowing tokens.
+                  Included when if `max_length` is specified and `return_overflowing_tokens`
+                  is True.
+                - **special_tokens_mask** (list[int] or list[list[int]], optional): List of integers valued 0 or 1,
+                  with 0 specifying special added tokens and 1 specifying sequence tokens.
+                  Included when `return_special_tokens_mask` is `True`.
+                - **offset_mapping** (list[int], optional): list of pair preserving the
+                  index of start and end char in original input for each token.
+                  For a sqecial token, the index pair is `(0, 0)`. Included when
+                  `return_overflowing_tokens` is True or `stride` > 0.
+                - **overflow_to_sample** (int or list[int], optional): Index of example from which this
+                  feature is generated. Included when `stride` works.
+        """
+
+        # Input type checking for clearer error
+        def _is_valid_text_input(t):
+            if isinstance(t, str):
+                # Strings are fine
+                return True
+            elif isinstance(t, (list, tuple)):
+                # List are fine as long as they are...
+                if len(t) == 0:
+                    # ... empty
+                    return True
+                elif isinstance(t[0], str):
+                    # ... list of strings
+                    return True
+                elif isinstance(t[0], (list, tuple)):
+                    # ... list with an empty list or with a list of strings
+                    return len(t[0]) == 0 or isinstance(t[0][0], str)
+                else:
+                    return False
+            else:
+                return False
+
+        if not _is_valid_text_input(text):
+            raise ValueError(
+                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
+                "or `List[List[str]]` (batch of pretokenized examples)."
+            )
+
+        if text_pair is not None and not _is_valid_text_input(text_pair):
+            raise ValueError(
+                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
+                "or `List[List[str]]` (batch of pretokenized examples)."
+            )
+
+        # check `split_into_words` value
+        if isinstance(is_split_into_words, str) and is_split_into_words != "token":
+            raise ValueError(
+                "the value of `is_split_into_words` should be one of: {True, False, 'token'} but receive: <%s>",
+                is_split_into_words,
+            )
+
+        if is_split_into_words:
+            is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
+        else:
+            is_batched = isinstance(text, (list, tuple))
+
+        if is_batched:
+            if isinstance(text_pair, str):
+                raise TypeError(
+                    "when tokenizing batches of text, `text_pair` must be a list or tuple with the same length as `text`."
+                )
+            if text_pair is not None and len(text) != len(text_pair):
+                raise ValueError(
+                    f"batch length of `text`: {len(text)} does not match batch length of `text_pair`: {len(text_pair)}."
+                )
+            batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
+            return self.batch_encode(
+                batch_text_or_text_pairs=batch_text_or_text_pairs,
+                max_length=max_length,
+                stride=stride,
+                is_split_into_words=is_split_into_words,
+                padding=padding,
+                truncation=truncation,
+                return_position_ids=return_position_ids,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_length=return_length,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_dict=return_dict,
+                return_offsets_mapping=return_offsets_mapping,
+                add_special_tokens=add_special_tokens,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_tensors=return_tensors,
+                verbose=verbose,
+                **kwargs,
+            )
+        else:
+            return self.encode(
+                text=text,
+                text_pair=text_pair,
+                max_length=max_length,
+                stride=stride,
+                is_split_into_words=is_split_into_words,
+                padding=padding,
+                truncation=truncation,
+                return_position_ids=return_position_ids,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_length=return_length,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_offsets_mapping=return_offsets_mapping,
+                add_special_tokens=add_special_tokens,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_tensors=return_tensors,
+                verbose=verbose,
+                **kwargs,
+            )
+
+    def encode(
+        self,
+        text,
+        text_pair=None,
+        add_special_tokens=True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        return_position_ids=None,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Tokenize and prepare for the model a sequence or a pair of sequences.
+
+        Args:
+            text (`str`, `List[str]` or `List[int]`):
+                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+                `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+                method).
+            text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+                method).
+        """
+        # Backward compatibility for 'max_seq_len'
+        old_max_seq_len = kwargs.get("max_seq_len", None)
+        if max_length is None and old_max_seq_len:
+            if verbose:
+                warnings.warn(
+                    "The `max_seq_len` argument is deprecated and will be removed in a future version, "
+                    "please use `max_length` instead.",
+                    FutureWarning,
+                )
+            max_length = old_max_seq_len
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        return self._encode_plus(
+            text=text,
+            text_pair=text_pair,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def encode_plus(
+        self,
+        text: Union[TextInput, PreTokenizedInput, EncodedInput],
+        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs,
+    ) -> BatchEncoding:
+        """
+        Tokenize and prepare for the model a sequence or a pair of sequences.
+
+        <Tip warning={true}>
+
+        This method is deprecated, `__call__` should be used instead.
+
+        </Tip>
+
+        Args:
+            text (`str`, `List[str]` or `List[int]` (the latter only for not-fast tokenizers)):
+                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the
+                `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+                method).
+            text_pair (`str`, `List[str]` or `List[int]`, *optional*):
+                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using
+                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
+                method).
+        """
+
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        return self._encode_plus(
+            text=text,
+            text_pair=text_pair,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def _encode_plus(
+        self,
+        text: Union[TextInput, PreTokenizedInput, EncodedInput],
+        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        raise NotImplementedError
+
+    def batch_encode(
+        self,
+        batch_text_or_text_pairs: Union[
+            List[TextInput],
+            List[TextInputPair],
+            List[PreTokenizedInput],
+            List[PreTokenizedInputPair],
+            List[EncodedInput],
+            List[EncodedInputPair],
+        ],
+        max_length=None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        return_position_ids=None,
+        # TODO(wj-mcat): keep align with `encode` method
+        return_token_type_ids=None,
+        return_attention_mask=None,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        return_dict=True,
+        return_offsets_mapping=False,
+        add_special_tokens=True,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports batch inputs of sequence or sequence pair.
+
+        Args:
+            batch_text_or_text_pairs (list):
+                The element of list can be sequence or sequence pair, and the
+                sequence is a string or a list of strings depending on whether
+                it has been pretokenized. If each sequence is provided as a list
+                of strings (pretokenized), you must set `is_split_into_words` as
+                `True` to disambiguate with a sequence pair.
+
+        Returns:
+            dict or list[dict]:
+                The dict has the following optional items:
+
+        """
+        # Backward compatibility for 'max_seq_len'
+        old_max_seq_len = kwargs.get("max_seq_len", None)
+        if max_length is None and old_max_seq_len:
+            if verbose:
+                warnings.warn(
+                    "The `max_seq_len` argument is deprecated and will be removed in a future version, "
+                    "please use `max_length` instead.",
+                    FutureWarning,
+                )
+            max_length = old_max_seq_len
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        return self._batch_encode_plus(
+            batch_text_or_text_pairs=batch_text_or_text_pairs,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            is_split_into_words=is_split_into_words,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_position_ids=return_position_ids,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_dict=return_dict,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def _batch_encode_plus(
+        self,
+        batch_text_or_text_pairs: Union[
+            List[TextInput],
+            List[TextInputPair],
+            List[PreTokenizedInput],
+            List[PreTokenizedInputPair],
+            List[EncodedInput],
+            List[EncodedInputPair],
+        ],
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        raise NotImplementedError
+
+    def pad(
+        self,
+        encoded_inputs: Union[
+            BatchEncoding,
+            List[BatchEncoding],
+            Dict[str, EncodedInput],
+            Dict[str, List[EncodedInput]],
+            List[Dict[str, EncodedInput]],
+        ],
+        padding: Union[bool, str, PaddingStrategy] = True,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        verbose: bool = True,
+    ) -> BatchEncoding:
+        """
+        Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
+        in the batch.
+
+        Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
+        `self.pad_token_id` and `self.pad_token_type_id`)
+
+        <Tip>
+
+        If the `encoded_inputs` passed are dictionary of numpy arrays, Paddle tensors, the
+        result will use the same type unless you provide a different tensor type with `return_tensors`.
+        </Tip>
+
+        Args:
+            encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):
+                Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch of
+                tokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,
+                List[int]]]*) so you can use this method during preprocessing as well as in a Paddle Dataloader
+                collate function.
+
+                Instead of `List[int]` you can have tensors (numpy arrays, Paddle tensors), see
+                the note above for the return type.
+            padding (`bool`, `str` or [`PaddingStrategy`], *optional*, defaults to `True`):
+                 Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                 index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value.
+
+                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask (`bool`, *optional*):
+                Whether to return the attention mask. If left to the default, will return the attention mask according
+                to the specific tokenizer's default, defined by the `return_outputs` attribute.
+
+                [What are attention masks?](../glossary#attention-mask)
+            return_tensors (`str` or [`TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'pd'`: Return Paddle `paddle.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+            verbose (`bool`, *optional*, defaults to `True`):
+                Whether or not to print more information and warnings.
+        """
+        # If we have a list of dicts, let's convert it in a dict of lists
+        if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], (dict, BatchEncoding)):
+            encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
+
+        # The model's main input name, usually `input_ids`, has be passed for padding
+        if self.model_input_names[0] not in encoded_inputs:
+            raise ValueError(
+                "You should supply an encoding or a list of encodings to this method "
+                f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
+            )
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+
+        if not required_input:
+            if return_attention_mask:
+                encoded_inputs["attention_mask"] = []
+            return encoded_inputs
+
+        # If we have Paddle/NumPy tensors/arrays as inputs, we cast them as python objects
+        # and rebuild them afterwards if no return_tensors is specified
+
+        first_element = required_input[0]
+        if isinstance(first_element, (list, tuple)):
+            # first_element might be an empty list/tuple in some edge cases so we grab the first non empty element.
+            for item in required_input:
+                if len(item) != 0:
+                    first_element = item[0]
+                    break
+        # At this state, if `first_element` is still a list/tuple, it's an empty one so there is nothing to do.
+        if not isinstance(first_element, (int, list, tuple)):
+            if isinstance(first_element, paddle.Tensor):
+                return_tensors = "pd" if return_tensors is None else return_tensors
+            else:
+                raise ValueError(
+                    f"type of {first_element} unknown: {type(first_element)}. "
+                    f"Should be either python or paddle object."
+                )
+
+            for key, value in encoded_inputs.items():
+                encoded_inputs[key] = to_py_obj(value)
+
+        # Convert padding_strategy in PaddingStrategy
+        padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
+            padding=padding, max_length=max_length, verbose=verbose
+        )
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+        if required_input and not isinstance(required_input[0], (list, tuple)):
+            encoded_inputs = self._pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding_strategy=padding_strategy,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+            return BatchEncoding(encoded_inputs, tensor_type=return_tensors)
+
+        batch_size = len(required_input)
+        assert all(
+            len(v) == batch_size for v in encoded_inputs.values()
+        ), "Some items in the output dictionary have a different batch size than others."
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = max(len(inputs) for inputs in required_input)
+            padding_strategy = PaddingStrategy.MAX_LENGTH
+
+        batch_outputs = {}
+        for i in range(batch_size):
+            inputs = dict((k, v[i]) for k, v in encoded_inputs.items())
+            outputs = self._pad(
+                inputs,
+                max_length=max_length,
+                padding_strategy=padding_strategy,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                batch_outputs[key].append(value)
+
+        return BatchEncoding(batch_outputs, tensor_type=return_tensors)
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create the token type IDs corresponding to the sequences passed. [What are token type
+        IDs?](../glossary#token-type-ids)
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            token_ids_0 (`List[int]`): The first tokenized sequence.
+            token_ids_1 (`List[int]`, *optional*): The second tokenized sequence.
+
+        Returns:
+            `List[int]`: The token type ids.
+        """
+        if token_ids_1 is None:
+            return len(token_ids_0) * [0]
+        return [0] * len(token_ids_0) + [1] * len(token_ids_1)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        This implementation does not add special tokens and this method should be overridden in a subclass.
+
+        Args:
+            token_ids_0 (`List[int]`): The first tokenized sequence.
+            token_ids_1 (`List[int]`, *optional*): The second tokenized sequence.
+
+        Returns:
+            `List[int]`: The model input with special tokens.
+        """
+        if token_ids_1 is None:
+            return token_ids_0
+        return token_ids_0 + token_ids_1
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
+
+        Should be overridden in a subclass if the model has a special way of building those.
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return offset_mapping_0
+
+        return offset_mapping_0 + offset_mapping_1
+
+    def prepare_for_model(
+        self,
+        ids,
+        pair_ids=None,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_position_ids=None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_special_tokens_mask=False,
+        return_offsets_mapping=False,
+        add_special_tokens=True,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs
+    ):
+        """
+        Performs tokenization and uses the tokenized tokens to prepare model
+        inputs. It supports sequence or sequence pair as input, and batch input
+        is not allowed.
+        """
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        pair = bool(pair_ids is not None)
+        len_ids = len(ids)
+        len_pair_ids = len(pair_ids) if pair else 0
+
+        if return_token_type_ids and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+
+        if (
+            return_overflowing_tokens
+            and truncation_strategy == TruncationStrategy.LONGEST_FIRST
+            and pair_ids is not None
+        ):
+            raise ValueError(
+                "Not possible to return overflowing tokens for pair of sequences with the "
+                "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                "for instance `only_second` or `only_first`."
+            )
+
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+        if return_position_ids is None:
+            return_position_ids = "position_ids" in self.model_input_names
+        encoded_inputs = {}
+        # Truncation: Handle max sequence length
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+
+        overflowing_tokens = []
+
+        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
+            ids, pair_ids, overflowing_tokens = self.truncate_sequences(
+                ids,
+                pair_ids=pair_ids,
+                num_tokens_to_remove=total_len - max_length,
+                truncation_strategy=truncation_strategy,
+                stride=stride,
+            )
+        if return_overflowing_tokens:
+            encoded_inputs["overflowing_tokens"] = overflowing_tokens
+            encoded_inputs["num_truncated_tokens"] = total_len - max_length
+
+        # Add special tokens
+        if add_special_tokens:
+            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+        else:
+            sequence = ids + pair_ids if pair else ids
+            token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
+
+        # Build output dictionnary
+        encoded_inputs["input_ids"] = sequence
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+        if return_special_tokens_mask:
+            if add_special_tokens:
+                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+            else:
+                encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+
+        if return_offsets_mapping and "text" in kwargs and "text_pair" in kwargs:
+            text = kwargs.pop("text")
+            text_pair = kwargs.pop("text_pair")
+
+            token_offset_mapping = self.get_offset_mapping(text)
+            token_pair_offset_mapping = self.get_offset_mapping(text_pair) if text_pair is not None else None
+            if max_length and total_len > max_length:
+                token_offset_mapping, token_pair_offset_mapping, _ = self.truncate_sequences(
+                    token_offset_mapping,
+                    pair_ids=token_pair_offset_mapping,
+                    num_tokens_to_remove=total_len - max_length,
+                    truncation_strategy=truncation_strategy,
+                    stride=stride,
+                )
+            if add_special_tokens:
+                offset_mapping = self.build_offset_mapping_with_special_tokens(
+                    token_offset_mapping, token_pair_offset_mapping
+                )
+            else:
+                offset_mapping = (
+                    token_offset_mapping + token_pair_offset_mapping
+                    if token_pair_offset_mapping
+                    else token_offset_mapping
+                )
+            encoded_inputs["offset_mapping"] = offset_mapping
+
+        # Check lengths
+        self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
+
+        if return_position_ids:
+            encoded_inputs["position_ids"] = list(range(len(encoded_inputs["input_ids"])))
+
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding_strategy.value,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+            # for compatibility
+            encoded_inputs["seq_len"] = encoded_inputs["length"]
+
+        batch_outputs = BatchEncoding(
+            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+        )
+
+        return batch_outputs
+
+    def truncate_sequences(
+        self,
+        ids: List[int],
+        pair_ids: Optional[List[int]] = None,
+        num_tokens_to_remove: int = 0,
+        truncation_strategy: Union[str, TruncationStrategy] = "longest_first",
+        stride: int = 0,
+    ) -> Tuple[List[int], List[int], List[int]]:
+        """
+        Truncates a sequence pair in-place following the strategy.
+
+        Args:
+            ids (`List[int]`):
+                Tokenized input ids of the first sequence. Can be obtained from a string by chaining the `tokenize` and
+                `convert_tokens_to_ids` methods.
+            pair_ids (`List[int]`, *optional*):
+                Tokenized input ids of the second sequence. Can be obtained from a string by chaining the `tokenize`
+                and `convert_tokens_to_ids` methods.
+            num_tokens_to_remove (`int`, *optional*, defaults to 0):
+                Number of tokens to remove using the truncation strategy.
+            truncation_strategy (`str` or [`TruncationStrategy`], *optional*, defaults to `False`):
+                The strategy to follow for truncation. Can be:
+
+                - `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided. This will truncate
+                  token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a
+                  batch of pairs) is provided.
+                - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided. This will only
+                  truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+                - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided. This will only
+                  truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+                - `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths greater
+                  than the model maximum admissible input size).
+            stride (`int`, *optional*, defaults to 0):
+                If set to a positive number, the overflowing tokens returned will contain some tokens from the main
+                sequence returned. The value of this argument defines the number of additional tokens.
+
+        Returns:
+            `Tuple[List[int], List[int], List[int]]`: The truncated `ids`, the truncated `pair_ids` and the list of
+            overflowing tokens. Note: The *longest_first* strategy returns empty list of overflowing tokens if a pair
+            of sequences (or a batch of pairs) is provided.
+        """
+        if num_tokens_to_remove <= 0:
+            return ids, pair_ids, []
+
+        if not isinstance(truncation_strategy, TruncationStrategy):
+            truncation_strategy = TruncationStrategy(truncation_strategy)
+
+        overflowing_tokens = []
+        if truncation_strategy == TruncationStrategy.ONLY_FIRST or (
+            truncation_strategy == TruncationStrategy.LONGEST_FIRST and pair_ids is None
+        ):
+            if len(ids) > num_tokens_to_remove:
+                window_len = min(len(ids), stride + num_tokens_to_remove)
+                if self.truncation_side == "left":
+                    overflowing_tokens = ids[:window_len]
+                    ids = ids[num_tokens_to_remove:]
+                elif self.truncation_side == "right":
+                    overflowing_tokens = ids[-window_len:]
+                    ids = ids[:-num_tokens_to_remove]
+                else:
+                    raise ValueError(f"invalid truncation strategy: {self.truncation_side}, use 'left' or 'right'.")
+
+            else:
+                error_msg = (
+                    f"We need to remove {num_tokens_to_remove} to truncate the input "
+                    f"but the first sequence has a length {len(ids)}. "
+                )
+                if truncation_strategy == TruncationStrategy.ONLY_FIRST:
+                    error_msg = (
+                        error_msg + "Please select another truncation strategy than "
+                        f"{truncation_strategy}, for instance 'longest_first' or 'only_second'."
+                    )
+                logger.error(error_msg)
+        elif truncation_strategy == TruncationStrategy.LONGEST_FIRST:
+            warnings.warn(
+                f"Be aware, overflowing tokens are not returned for the setting you have chosen,"
+                f" i.e. sequence pairs with the '{TruncationStrategy.LONGEST_FIRST.value}' "
+                f"truncation strategy. So the returned list will always be empty even if some "
+                f"tokens have been removed."
+            )
+            for _ in range(num_tokens_to_remove):
+                if pair_ids is None or len(ids) > len(pair_ids):
+                    if self.truncation_side == "right":
+                        ids = ids[:-1]
+                    elif self.truncation_side == "left":
+                        ids = ids[1:]
+                    else:
+                        raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+                else:
+                    if self.truncation_side == "right":
+                        pair_ids = pair_ids[:-1]
+                    elif self.truncation_side == "left":
+                        pair_ids = pair_ids[1:]
+                    else:
+                        raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+        elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
+            if len(pair_ids) > num_tokens_to_remove:
+                window_len = min(len(pair_ids), stride + num_tokens_to_remove)
+                if self.truncation_side == "right":
+                    overflowing_tokens = pair_ids[-window_len:]
+                    pair_ids = pair_ids[:-num_tokens_to_remove]
+                elif self.truncation_side == "left":
+                    overflowing_tokens = pair_ids[:window_len]
+                    pair_ids = pair_ids[num_tokens_to_remove:]
+                else:
+                    raise ValueError("invalid truncation strategy:" + str(self.truncation_side))
+            else:
+                logger.error(
+                    f"We need to remove {num_tokens_to_remove} to truncate the input "
+                    f"but the second sequence has a length {len(pair_ids)}. "
+                    f"Please select another truncation strategy than {truncation_strategy}, "
+                    f"for instance 'longest_first' or 'only_first'."
+                )
+
+        return (ids, pair_ids, overflowing_tokens)
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names or "attention_mask" in encoded_inputs
+
+        required_input = encoded_inputs[self.model_input_names[0]]
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+
+        # Initialize attention mask if not present.
+        if return_attention_mask and "attention_mask" not in encoded_inputs:
+            encoded_inputs["attention_mask"] = [1] * len(required_input)
+
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+
+            if self.padding_side == "right":
+                if return_attention_mask:
+
+                    encoded_inputs["attention_mask"] = encoded_inputs["attention_mask"] + [0] * difference
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                    )
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+                if "offset_mapping" in encoded_inputs:
+                    encoded_inputs["offset_mapping"] = encoded_inputs["offset_mapping"] + [(0, 0)] * difference
+                if "position_ids" in encoded_inputs:
+                    encoded_inputs["position_ids"] = encoded_inputs["position_ids"] + [0] * difference
+                # NOTE: In ernie3.0-qa, the type of `*_positions` is int.
+                if "start_positions" in encoded_inputs and isinstance(encoded_inputs["start_positions"], list):
+                    encoded_inputs["start_positions"] = encoded_inputs["start_positions"] + [0] * difference
+                if "end_positions" in encoded_inputs and isinstance(encoded_inputs["end_positions"], list):
+                    encoded_inputs["end_positions"] = encoded_inputs["end_positions"] + [0] * difference
+                encoded_inputs[self.model_input_names[0]] = required_input + [self.pad_token_id] * difference
+            elif self.padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
+                        "token_type_ids"
+                    ]
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+                if "offset_mapping" in encoded_inputs:
+                    encoded_inputs["offset_mapping"] = [(0, 0)] * difference + encoded_inputs["offset_mapping"]
+                if "position_ids" in encoded_inputs:
+                    encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
+                if "start_positions" in encoded_inputs and isinstance(encoded_inputs["start_positions"], list):
+                    encoded_inputs["start_positions"] = [0] * difference + encoded_inputs["start_positions"]
+                if "end_positions" in encoded_inputs and isinstance(encoded_inputs["end_positions"], list):
+                    encoded_inputs["end_positions"] = [0] * difference + encoded_inputs["end_positions"]
+                encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+            else:
+                raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+
+        return encoded_inputs
+
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        """
+        Converts a sequence of tokens in a single string. The most simple way to do it is `" ".join(tokens)` but we
+        often want to remove sub-word tokenization artifacts at the same time.
+
+        Args:
+            tokens (`List[str]`): The token to join in a string.
+
+        Returns:
+            `str`: The joined tokens.
+        """
+        raise NotImplementedError
+
+    def batch_decode(
+        self,
+        sequences: Union[List[int], List[List[int]], "np.ndarray", "paddle.Tensor"],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        **kwargs
+    ) -> List[str]:
+        """
+        Convert a list of lists of token ids into a list of strings by calling decode.
+
+        Args:
+            sequences (`Union[List[int], List[List[int]], np.ndarray, paddle.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+
+        Returns:
+            `List[str]`: The list of decoded sentences.
+        """
+        return [
+            self.decode(
+                seq,
+                skip_special_tokens=skip_special_tokens,
+                clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+                **kwargs,
+            )
+            for seq in sequences
+        ]
+
+    def decode(
+        self,
+        token_ids: Union[int, List[int], "np.ndarray", "paddle.Tensor"],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        **kwargs
+    ) -> str:
+        """
+        Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special
+        tokens and clean up tokenization spaces.
+
+        Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`.
+
+        Args:
+            token_ids (`Union[int, List[int], np.ndarray, paddle.Tensor]`):
+                List of tokenized input ids. Can be obtained using the `__call__` method.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
+                Whether or not to clean up the tokenization spaces.
+            kwargs (additional keyword arguments, *optional*):
+                Will be passed to the underlying model specific decode method.
+
+        Returns:
+            `str`: The decoded sentence.
+        """
+        # Convert inputs to python lists
+        token_ids = to_py_obj(token_ids)
+
+        return self._decode(
+            token_ids=token_ids,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+
+    def _decode(
+        self,
+        token_ids: Union[int, List[int]],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        **kwargs
+    ) -> str:
+        raise NotImplementedError
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of ids of the first sequence.
+            token_ids_1 (`List[int]`, *optional*):
+                List of ids of the second sequence.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        assert already_has_special_tokens and token_ids_1 is None, (
+            "You cannot use ``already_has_special_tokens=False`` with this tokenizer. "
+            "Please use a slow (full python) tokenizer to activate this argument. "
+            "Or set `return_special_tokens_mask=True` when calling the encoding method "
+            "to get the special tokens mask in any tokenizer. "
+        )
+
+        all_special_ids = self.all_special_ids  # cache the property
+
+        special_tokens_mask = [1 if token in all_special_ids else 0 for token in token_ids_0]
+
+        return special_tokens_mask
+
+    @staticmethod
+    def clean_up_tokenization(out_string: str) -> str:
+        """
+        Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.
+
+        Args:
+            out_string (`str`): The text to clean up.
+
+        Returns:
+            `str`: The cleaned-up string.
+        """
+        out_string = (
+            out_string.replace(" .", ".")
+            .replace(" ?", "?")
+            .replace(" !", "!")
+            .replace(" ,", ",")
+            .replace(" ' ", "'")
+            .replace(" n't", "n't")
+            .replace(" 'm", "'m")
+            .replace(" 's", "'s")
+            .replace(" 've", "'ve")
+            .replace(" 're", "'re")
+        )
+        return out_string
+
+    def _eventual_warn_about_too_long_sequence(self, ids: List[int], max_length: Optional[int], verbose: bool):
+        """
+        Depending on the input and internal state we might trigger a warning about a sequence that is too long for its
+        corresponding model
+
+        Args:
+            ids (`List[str]`): The ids produced by the tokenization
+            max_length (`int`, *optional*): The max_length desired (does not trigger a warning if it is set)
+            verbose (`bool`): Whether or not to print more information and warnings.
+
+        """
+        if max_length is None and len(ids) > self.model_max_length and verbose:
+            if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
+                logger.warning(
+                    "Token indices sequence length is longer than the specified maximum sequence length "
+                    f"for this model ({len(ids)} > {self.model_max_length}). Running this sequence through the model "
+                    "will result in indexing errors"
+                )
+            self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True
diff --git a/paddlenlp/transformers/tokenizer_utils_fast.py b/paddlenlp/transformers/tokenizer_utils_fast.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdef55b31f4728e81b5ea7a90121f5073bdd46bf
--- /dev/null
+++ b/paddlenlp/transformers/tokenizer_utils_fast.py
@@ -0,0 +1,538 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+from collections import defaultdict
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from fast_tokenizer import Encoding as FastEncoding
+from fast_tokenizer import Tokenizer as FastTokenizer
+
+from .convert_slow_tokenizer import convert_slow_tokenizer
+from .tokenizer_utils import PretrainedTokenizer
+from .tokenizer_utils_base import (
+    AddedToken,
+    BatchEncoding,
+    EncodedInput,
+    EncodedInputPair,
+    PaddingStrategy,
+    PreTokenizedInput,
+    PreTokenizedInputPair,
+    PretrainedTokenizerBase,
+    TextInput,
+    TextInputPair,
+    TruncationStrategy,
+)
+
+TOKENIZER_FILE = "tokenizer.json"
+VOCAB_FILES_NAMES = {"tokenizer_file": TOKENIZER_FILE}
+ADDED_TOKENS_FILE = "added_tokens.json"
+SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
+
+
+class PretrainedFastTokenizer(PretrainedTokenizerBase):
+    resource_files_names = VOCAB_FILES_NAMES
+    slow_tokenizer_class: PretrainedTokenizer = None
+    can_save_slow_tokenizer: bool = True
+
+    def __init__(self, *args, **kwargs):
+        tokenizer_object = kwargs.pop("tokenizer_object", None)
+        slow_tokenizer = kwargs.pop("__slow_tokenizer", None)
+        fast_tokenizer_file = kwargs.pop("tokenizer_file", None)
+        from_slow = kwargs.pop("from_slow", False)
+        if tokenizer_object is not None:
+            fast_tokenizer = tokenizer_object
+        elif fast_tokenizer_file is not None and not from_slow:
+            # We have a serialization from tokenizers which let us directly build the backend
+            # From json file
+            fast_tokenizer = FastTokenizer.from_file(fast_tokenizer_file)
+        elif slow_tokenizer is not None:
+            # We need to convert a slow tokenizer to build the backend
+            fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
+        elif self.slow_tokenizer_class is not None:
+            # We need to create and convert a slow tokenizer to build the backend
+            slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
+            fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
+        else:
+            raise ValueError(
+                "Couldn't instantiate the backend tokenizer from one of: \n"
+                "(1) a `fast_tokenizer` library serialization file, \n"
+                "(2) a slow tokenizer instance to convert or \n"
+                "(3) an equivalent slow tokenizer class to instantiate and convert. \n"
+                "You need to have sentencepiece installed to convert a slow tokenizer to a fast one."
+            )
+        self._tokenizer = fast_tokenizer
+
+        if slow_tokenizer is not None:
+            kwargs.update(slow_tokenizer.init_kwargs)
+
+        self._decode_use_source_tokenizer = False
+        # We call this after having initialized the backend tokenizer because we update it.
+        super().__init__(**kwargs)
+
+    @property
+    def is_fast(self) -> bool:
+        return True
+
+    @property
+    def vocab_size(self) -> int:
+        """
+        `int`: Size of the base vocabulary (without the added tokens).
+        """
+        return self._tokenizer.get_vocab_size(with_added_vocabulary=False)
+
+    def get_vocab(self) -> Dict[str, int]:
+        return self._tokenizer.get_vocab(with_added_vocabulary=True)
+
+    @property
+    def vocab(self) -> Dict[str, int]:
+        return self.get_vocab()
+
+    def get_added_vocab(self) -> Dict[str, int]:
+        """
+        Returns the added tokens in the vocabulary as a dictionary of token to index.
+
+        Returns:
+            `Dict[str, int]`: The added tokens.
+        """
+        base_vocab = self._tokenizer.get_vocab(with_added_vocabulary=False)
+        full_vocab = self._tokenizer.get_vocab(with_added_vocabulary=True)
+        added_vocab = dict((tok, index) for tok, index in full_vocab.items() if tok not in base_vocab)
+        return added_vocab
+
+    def __len__(self) -> int:
+        """
+        Size of the full vocabulary with the added tokens.
+        """
+        return self._tokenizer.get_vocab_size(with_added_vocabulary=True)
+
+    @property
+    def backend_tokenizer(self) -> FastTokenizer:
+        return self._tokenizer
+
+    def _convert_encoding(
+        self,
+        encoding: FastEncoding,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        return_position_ids: bool = False,
+        verbose: bool = True,
+    ) -> Tuple[Dict[str, Any], List[FastEncoding]]:
+        """
+        Convert the encoding representation (from low-level PaddleNLP FastTokenizer output) to a python Dict and a list
+        of encodings, take care of building a batch from overflowing tokens.
+
+        Overflowing tokens are converted to additional examples (like batches) so the output values of the dict are
+        lists (overflows) of lists (tokens).
+
+        Output shape: (overflows, sequence length)
+        """
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+
+        if return_overflowing_tokens and encoding.overflowing is not None:
+            encodings = [encoding] + encoding.overflowing
+        else:
+            encodings = [encoding]
+
+        encoding_dict = defaultdict(list)
+        for e in encodings:
+            encoding_dict["input_ids"].append(e.ids)
+
+            if return_token_type_ids:
+                encoding_dict["token_type_ids"].append(e.type_ids)
+            if return_attention_mask:
+                encoding_dict["attention_mask"].append(e.attention_mask)
+            if return_special_tokens_mask:
+                encoding_dict["special_tokens_mask"].append(e.special_tokens_mask)
+            if return_offsets_mapping:
+                encoding_dict["offset_mapping"].append(e.offsets)
+            if return_length:
+                encoding_dict["length"].append(len(e.ids))
+            if return_position_ids:
+                encoding_dict["position_ids"].append(list(range(len(e.ids))))
+        return encoding_dict, encodings
+
+    def convert_tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
+        """
+        Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the
+        vocabulary.
+
+        Args:
+            tokens (`str` or `List[str]`): One or several token(s) to convert to token id(s).
+
+        Returns:
+            `int` or `List[int]`: The token id or list of token ids.
+        """
+        if tokens is None:
+            return None
+
+        if isinstance(tokens, str):
+            return self._convert_token_to_id_with_added_voc(tokens)
+
+        ids = []
+        for token in tokens:
+            ids.append(self._convert_token_to_id_with_added_voc(token))
+        return ids
+
+    def _convert_token_to_id_with_added_voc(self, token: str) -> int:
+        index = self._tokenizer.token_to_id(token)
+        if index is None:
+            return self.unk_token_id
+        return index
+
+    def _convert_id_to_token(self, index: int) -> Optional[str]:
+        return self._tokenizer.id_to_token(int(index))
+
+    def _add_tokens(self, new_tokens: List[Union[str, AddedToken]], special_tokens=False) -> int:
+        if special_tokens:
+            return self._tokenizer.add_special_tokens(new_tokens)
+
+        return self._tokenizer.add_tokens(new_tokens)
+
+    def num_special_tokens_to_add(self, pair: bool = False) -> int:
+        return self._tokenizer.num_special_tokens_to_add(pair)
+
+    def convert_ids_to_tokens(
+        self, ids: Union[int, List[int]], skip_special_tokens: bool = False
+    ) -> Union[str, List[str]]:
+        """
+        Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and
+        added tokens.
+
+        Args:
+            ids (`int` or `List[int]`):
+                The token id (or token ids) to convert to tokens.
+            skip_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not to remove special tokens in the decoding.
+
+        Returns:
+            `str` or `List[str]`: The decoded token(s).
+        """
+        if isinstance(ids, int):
+            return self._tokenizer.id_to_token(ids)
+        tokens = []
+        for index in ids:
+            index = int(index)
+            if skip_special_tokens and index in self.all_special_ids:
+                continue
+            tokens.append(self._tokenizer.id_to_token(index))
+        return tokens
+
+    def tokenize(self, text: str, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs) -> List[str]:
+        return self.encode(text=text, text_pair=pair, add_special_tokens=add_special_tokens, **kwargs).tokens()
+
+    def set_truncation_and_padding(
+        self,
+        padding_strategy: PaddingStrategy,
+        truncation_strategy: TruncationStrategy,
+        max_length: int,
+        stride: int,
+        pad_to_multiple_of: Optional[int],
+    ):
+        """
+        Define the truncation and the padding strategies for fast tokenizers (provided by PaddleNLP's fast_tokenizer
+        library) and restore the tokenizer settings afterwards.
+
+        The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a
+        padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed
+        section.
+
+        Args:
+            padding_strategy ([`~utils.PaddingStrategy`]):
+                The kind of padding that will be applied to the input
+            truncation_strategy ([`~tokenization_utils_base.TruncationStrategy`]):
+                The kind of truncation that will be applied to the input
+            max_length (`int`):
+                The maximum size of a sequence.
+            stride (`int`):
+                The stride to use when handling overflow.
+            pad_to_multiple_of (`int`, *optional*):
+                If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
+                the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
+        """
+        _truncation = self._tokenizer.truncation
+        _padding = self._tokenizer.padding
+        # Set truncation and padding on the backend tokenizer
+        if truncation_strategy == TruncationStrategy.DO_NOT_TRUNCATE:
+            if _truncation is not None:
+                self._tokenizer.disable_truncation()
+        else:
+            target = {
+                "max_length": max_length,
+                "stride": stride,
+                "strategy": truncation_strategy.value,
+                "direction": self.truncation_side,
+            }
+
+            # _truncation might contain more keys that the target `transformers`
+            # supports. Use only the target keys to trigger `enable_truncation`.
+            # This should enable this code to works on various `tokenizers`
+            # targets.
+            if _truncation != target:
+                self._tokenizer.enable_truncation(**target)
+
+        if padding_strategy == PaddingStrategy.DO_NOT_PAD:
+            if _padding is not None:
+                self._tokenizer.disable_padding()
+        else:
+            length = max_length if padding_strategy == PaddingStrategy.MAX_LENGTH else None
+            target = {
+                "length": length,
+                "direction": self.padding_side,
+                "pad_id": self.pad_token_id,
+                "pad_token": self.pad_token,
+                "pad_type_id": self.pad_token_type_id,
+                "pad_to_multiple_of": pad_to_multiple_of,
+            }
+            if _padding != target:
+                self._tokenizer.enable_padding(**target)
+
+    def _batch_encode_plus(
+        self,
+        batch_text_or_text_pairs: Union[
+            List[TextInput],
+            List[TextInputPair],
+            List[PreTokenizedInput],
+            List[PreTokenizedInputPair],
+            List[EncodedInput],
+            List[EncodedInputPair],
+        ],
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[str] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_dict: bool = True,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+
+        if not isinstance(batch_text_or_text_pairs, list):
+            raise TypeError(f"batch_text_or_text_pairs has to be a list (got {type(batch_text_or_text_pairs)})")
+
+        # Set the truncation and padding strategy and restore the initial configuration
+        self.set_truncation_and_padding(
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+        )
+        encodings = self._tokenizer.encode_batch(
+            batch_text_or_text_pairs, add_special_tokens=add_special_tokens, is_pretokenized=is_split_into_words
+        )
+
+        # Convert encoding to dict
+        # `Tokens` has type: Tuple[
+        #                       List[Dict[str, List[List[int]]]] or List[Dict[str, 2D-Tensor]],
+        #                       List[FastEncoding]
+        #                    ]
+        # with nested dimensions corresponding to batch, overflows, sequence length
+        tokens_and_encodings = [
+            self._convert_encoding(
+                encoding=encoding,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_offsets_mapping=return_offsets_mapping,
+                return_length=return_length,
+                return_position_ids=return_position_ids,
+                verbose=verbose,
+            )
+            for encoding in encodings
+        ]
+
+        # Convert the output to have dict[list] from list[dict] and remove the additional overflows dimension
+        # From (variable) shape (batch, overflows, sequence length) to ~ (batch * overflows, sequence length)
+        # (we say ~ because the number of overflow varies with the example in the batch)
+        #
+        # To match each overflowing sample with the original sample in the batch
+        # we add an overflow_to_sample_mapping array (see below)
+        sanitized_tokens = {}
+        for key in tokens_and_encodings[0][0].keys():
+            stack = [e for item, _ in tokens_and_encodings for e in item[key]]
+            sanitized_tokens[key] = stack
+        sanitized_encodings = [e for _, item in tokens_and_encodings for e in item]
+
+        # If returning overflowing tokens, we need to return a mapping
+        # from the batch idx to the original sample
+        if return_overflowing_tokens:
+            overflow_to_sample_mapping = []
+            for i, (toks, _) in enumerate(tokens_and_encodings):
+                overflow_to_sample_mapping += [i] * len(toks["input_ids"])
+            sanitized_tokens["overflow_to_sample_mapping"] = overflow_to_sample_mapping
+
+        for input_ids in sanitized_tokens["input_ids"]:
+            self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
+        return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
+
+    def _encode_plus(
+        self,
+        text: Union[TextInput, PreTokenizedInput, EncodedInput],
+        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        is_split_into_words: bool = False,
+        pad_to_multiple_of: Optional[int] = None,
+        return_position_ids: Optional[bool] = None,
+        return_tensors: Optional[bool] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+
+        batched_input = [(text, text_pair)] if text_pair is not None else [text]
+        batched_output = self._batch_encode_plus(
+            batched_input,
+            is_split_into_words=is_split_into_words,
+            add_special_tokens=add_special_tokens,
+            padding_strategy=padding_strategy,
+            truncation_strategy=truncation_strategy,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_position_ids=return_position_ids,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        # Return tensor is None, then we can remove the leading batch axis
+        # Overflowing tokens are returned as a batch of output so we keep them in this case
+        if return_tensors is None and not return_overflowing_tokens:
+            batched_output = BatchEncoding(
+                {
+                    key: value[0] if len(value) > 0 and isinstance(value[0], list) else value
+                    for key, value in batched_output.items()
+                },
+                batched_output.encodings,
+            )
+
+        self._eventual_warn_about_too_long_sequence(batched_output["input_ids"], max_length, verbose)
+
+        return batched_output
+
+    def _save_pretrained(
+        self,
+        save_directory: Union[str, os.PathLike],
+        file_names: Tuple[str],
+        legacy_format: Optional[bool] = None,
+        filename_prefix: Optional[str] = None,
+    ) -> Tuple[str]:
+        """
+        Save a tokenizer using the slow-tokenizer/legacy format: vocabulary + added tokens as well as in a unique JSON
+        file containing {config + vocab + added-tokens}.
+        """
+        save_directory = str(save_directory)
+
+        if self.slow_tokenizer_class is None and legacy_format is True:
+            raise ValueError(
+                "Your tokenizer does not have a legacy version defined and therefore cannot register this version. You "
+                "might consider leaving the legacy_format at `None` or setting it to `False`."
+            )
+
+        save_slow = (
+            (legacy_format is None or legacy_format is True)
+            and self.slow_tokenizer_class is not None
+            and self.can_save_slow_tokenizer
+        )
+        save_fast = legacy_format is None or legacy_format is False
+
+        if save_slow:
+            added_tokens_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + ADDED_TOKENS_FILE
+            )
+            added_vocab = self.get_added_vocab()
+            if added_vocab:
+                with open(added_tokens_file, "w", encoding="utf-8") as f:
+                    out_str = json.dumps(added_vocab, ensure_ascii=False)
+                    f.write(out_str)
+
+            vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
+            file_names = file_names + vocab_files + (added_tokens_file,)
+
+        if save_fast:
+            tokenizer_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + TOKENIZER_FILE
+            )
+            self.backend_tokenizer.save(tokenizer_file)
+            file_names = file_names + (tokenizer_file,)
+        return file_names
+
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        """
+        Converts a sequence of tokens in a single string. The most simple way to do it is `" ".join(tokens)` but we
+        often want to remove sub-word tokenization artifacts at the same time.
+
+        Args:
+            tokens (`List[str]`): The token to join in a string.
+
+        Returns:
+            `str`: The joined tokens.
+        """
+        return self.backend_tokenizer.decoder.decode(tokens)
+
+    def _decode(
+        self,
+        token_ids: Union[int, List[int]],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool = True,
+        **kwargs
+    ) -> str:
+        self._decode_use_source_tokenizer = kwargs.pop("use_source_tokenizer", False)
+
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
+
+        if clean_up_tokenization_spaces:
+            clean_text = self.clean_up_tokenization(text)
+            return clean_text
+        else:
+            return text
diff --git a/paddlenlp/transformers/transformer/__init__.py b/paddlenlp/transformers/transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/transformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/transformer/modeling.py b/paddlenlp/transformers/transformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1adb4508b3860cfe80119cf06a2b71a9d62e93c7
--- /dev/null
+++ b/paddlenlp/transformers/transformer/modeling.py
@@ -0,0 +1,1384 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import (
+    TransformerDecoder,
+    TransformerDecoderLayer,
+    TransformerEncoder,
+    TransformerEncoderLayer,
+)
+from paddle.utils import map_structure
+
+__all__ = [
+    "position_encoding_init",
+    "WordEmbedding",
+    "PositionalEmbedding",
+    "CrossEntropyCriterion",
+    "TransformerDecodeCell",
+    "TransformerBeamSearchDecoder",
+    "TransformerModel",
+    "InferTransformerModel",
+    "LabelSmoothedCrossEntropyCriterion",
+]
+
+
+def position_encoding_init(n_position, d_pos_vec, dtype="float32"):
+    """
+    Generates the initial values for the sinusoidal position encoding table.
+    This method follows the implementation in tensor2tensor, but is slightly
+    different from the description in "Attention Is All You Need".
+
+    Args:
+        n_position (int):
+            The largest position for sequences, that is, the maximum length
+            of source or target sequences.
+        d_pos_vec (int):
+            The size of positional embedding vector.
+        dtype (str, optional):
+            The output `numpy.array`'s data type. Defaults to "float32".
+
+    Returns:
+        numpy.array:
+            The embedding table of sinusoidal position encoding with shape
+            `[n_position, d_pos_vec]`.
+
+    Example:
+        .. code-block::
+
+            from paddlenlp.transformers import position_encoding_init
+
+            max_length = 256
+            emb_dim = 512
+            pos_table = position_encoding_init(max_length, emb_dim)
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = np.log(float(1e4) / float(1)) / (num_timescales - 1)
+    inv_timescales = np.exp(np.arange(num_timescales) * -log_timescale_increment)
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, 0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], "constant")
+    position_enc = signal
+    return position_enc.astype(dtype)
+
+
+class WordEmbedding(nn.Layer):
+    r"""
+    Word Embedding Layer of Transformer.
+    This layer automatically constructs a 2D embedding matrix based on the
+    input the size of vocabulary (`vocab_size`) and the size of each embedding
+    vector (`emb_dim`). This layer lookups embeddings vector of ids provided
+    by input `word`.
+
+    After the embedding, those weights are multiplied by `sqrt(d_model)` which is
+    `sqrt(emb_dim)` in the interface.
+
+    .. math::
+
+        Out = embedding(word) * sqrt(emb\_dim)
+
+    Args:
+        vocab_size (int):
+            The size of vocabulary.
+        emb_dim (int):
+            Dimensionality of each embedding vector.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+    """
+
+    def __init__(self, vocab_size, emb_dim, bos_id=0):
+        super(WordEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=vocab_size,
+            embedding_dim=emb_dim,
+            padding_idx=bos_id,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(0.0, emb_dim ** (-0.5))),
+        )
+
+    def forward(self, word):
+        r"""
+        Computes word embedding.
+
+        Args:
+            word (Tensor):
+                The input ids which indicates the sequences' words with shape
+                `[batch_size, sequence_length]` whose data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                The (scaled) embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import WordEmbedding
+
+                word_embedding = WordEmbedding(
+                    vocab_size=30000,
+                    emb_dim=512,
+                    bos_id=0)
+
+                batch_size = 5
+                sequence_length = 10
+                src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length])
+                src_emb = word_embedding(src_words)
+        """
+        word_emb = self.emb_dim**0.5 * self.word_embedding(word)
+        return word_emb
+
+
+class PositionalEmbedding(nn.Layer):
+    """
+    This layer produces sinusoidal positional embeddings of any length.
+    While in `forward()` method, this layer lookups embeddings vector of
+    ids provided by input `pos`.
+
+    Args:
+        emb_dim (int):
+            The size of each embedding vector.
+        max_length (int):
+            The maximum length of sequences.
+    """
+
+    def __init__(self, emb_dim, max_length):
+        super(PositionalEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.pos_encoder = nn.Embedding(
+            num_embeddings=max_length,
+            embedding_dim=self.emb_dim,
+            weight_attr=paddle.ParamAttr(
+                initializer=paddle.nn.initializer.Assign(position_encoding_init(max_length, self.emb_dim))
+            ),
+        )
+
+    def forward(self, pos):
+        r"""
+        Computes positional embedding.
+
+        Args:
+            pos (Tensor):
+                The input position ids with shape `[batch_size, sequence_length]` whose
+                data type can be int or int64.
+
+        Returns:
+            Tensor:
+                The positional embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PositionalEmbedding
+
+                pos_embedding = PositionalEmbedding(
+                    emb_dim=512,
+                    max_length=256)
+
+                batch_size = 5
+                pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1])
+                pos_emb = pos_embedding(pos)
+        """
+        pos_emb = self.pos_encoder(pos)
+        pos_emb.stop_gradient = True
+        return pos_emb
+
+
+class CrossEntropyCriterion(nn.Layer):
+    """
+    Computes the cross entropy loss for given input with or without label smoothing.
+
+    Args:
+        label_smooth_eps (float, optional):
+            The weight used to mix up the original ground-truth distribution
+            and the fixed distribution. Defaults to None. If given, label smoothing
+            will be applied on `label`.
+        pad_idx (int, optional):
+            The token id used to pad variant sequence. Defaults to 0.
+    """
+
+    def __init__(self, label_smooth_eps=None, pad_idx=0):
+        super(CrossEntropyCriterion, self).__init__()
+        self.label_smooth_eps = label_smooth_eps
+        self.pad_idx = pad_idx
+
+    def forward(self, predict, label):
+        r"""
+        Computes cross entropy loss with or without label smoothing.
+
+        Args:
+            predict (Tensor):
+                The predict results of `TransformerModel` with shape
+                `[batch_size, sequence_length, vocab_size]` whose data type can
+                be float32 or float64.
+            label (Tensor):
+                The label for correspoding results with shape
+                `[batch_size, sequence_length, 1]`.
+
+        Returns:
+            tuple:
+                A tuple with items: (`sum_cost`, `avg_cost`, `token_num`).
+
+                With the corresponding fields:
+
+                - `sum_cost` (Tensor):
+                    The sum of loss of current batch whose data type can be float32, float64.
+                - `avg_cost` (Tensor):
+                    The average loss of current batch whose data type can be float32, float64.
+                    The relation between `sum_cost` and `avg_cost` can be described as:
+
+                    .. math::
+
+                        avg\_cost = sum\_cost / token\_num
+
+                - `token_num` (Tensor):
+                    The number of tokens of current batch. Its data type can be float32, float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CrossEntropyCriterion
+
+                criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0)
+                batch_size = 1
+                seq_len = 2
+                vocab_size = 30000
+                predict = paddle.rand(shape=[batch_size, seq_len, vocab_size])
+                label = paddle.randint(
+                    low=3,
+                    high=vocab_size,
+                    shape=[batch_size, seq_len, 1])
+
+                criterion(predict, label)
+        """
+        weights = paddle.cast(label != self.pad_idx, dtype=paddle.get_default_dtype())
+        if self.label_smooth_eps:
+            label = paddle.squeeze(label, axis=[2])
+            label = F.label_smooth(
+                label=F.one_hot(x=label, num_classes=predict.shape[-1]), epsilon=self.label_smooth_eps
+            )
+
+        cost = F.cross_entropy(
+            input=predict, label=label, reduction="none", soft_label=True if self.label_smooth_eps else False
+        )
+        weighted_cost = cost * weights
+        sum_cost = paddle.sum(weighted_cost)
+        token_num = paddle.sum(weights)
+        token_num.stop_gradient = True
+        avg_cost = sum_cost / token_num
+        return sum_cost, avg_cost, token_num
+
+
+def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=None, reduce=True):
+    if target.dim() == lprobs.dim() - 1:
+        target = target.unsqueeze(-1)
+
+    num_tokens = paddle.shape(lprobs)[0]
+    index = paddle.arange(0, num_tokens, dtype="int64").unsqueeze(-1)
+    index = paddle.concat([index, target], axis=-1)
+    index.stop_gradient = True
+
+    log_probs = -lprobs
+
+    nll_loss = paddle.gather_nd(log_probs, index=index).unsqueeze(-1)
+    smooth_loss = log_probs.sum(axis=-1, keepdim=True)
+
+    pad_mask = paddle.cast(target != ignore_index, dtype=paddle.get_default_dtype())
+    nll_loss = nll_loss * pad_mask
+    smooth_loss = smooth_loss * pad_mask
+    if reduce:
+        nll_loss = nll_loss.sum()
+        smooth_loss = smooth_loss.sum()
+    eps_i = epsilon / (lprobs.shape[-1] - 1)
+    loss = (1.0 - epsilon - eps_i) * nll_loss + eps_i * smooth_loss
+    token_num = paddle.sum(pad_mask)
+    return loss, loss / token_num, token_num
+
+
+class LabelSmoothedCrossEntropyCriterion(nn.Layer):
+    def __init__(self, label_smoothing, padding_idx=0):
+        super().__init__()
+        self.eps = label_smoothing
+        self.padding_idx = padding_idx
+
+    def forward(self, predict, label, reduce=True):
+        return self.compute_loss(predict, label, reduce=reduce)
+
+    def get_lprobs_and_target(self, predict, label):
+        lprobs = paddle.nn.functional.log_softmax(predict, axis=-1)
+        return lprobs.reshape([-1, lprobs.shape[-1]]), label.reshape([-1])
+
+    def compute_loss(self, predict, label, reduce=True):
+        lprobs, label = self.get_lprobs_and_target(predict, label)
+        return label_smoothed_nll_loss(lprobs, label, self.eps, ignore_index=self.padding_idx, reduce=reduce)
+
+
+class TransformerDecodeCell(nn.Layer):
+    """
+    This layer wraps a Transformer decoder combined with embedding
+    layer and output layer to produce logits from ids and position.
+
+    Args:
+        decoder (callable):
+            Can be a `paddle.nn.TransformerDecoder` instance. Or a wrapper that includes an
+            embedding layer accepting ids and positions and includes an
+            output layer transforming decoder output to logits.
+        word_embedding (callable, optional):
+            Can be a `WordEmbedding` instance or a callable that accepts ids as
+            arguments and return embeddings. It can be None if `decoder`
+            includes a embedding layer. Defaults to None.
+        pos_embedding (callable, optional):
+            Can be a `PositionalEmbedding` instance or a callable that accepts position
+            as arguments and return embeddings. It can be None if `decoder`
+            includes a positional embedding layer. Defaults to None.
+        linear (callable, optional):
+            Can be a `paddle.nn.Linear` instance or a callable to transform decoder
+            output to logits.
+        dropout (float, optional):
+            The dropout rate for the results of `word_embedding` and `pos_embedding`.
+            Defaults to 0.1.
+    """
+
+    def __init__(self, decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1):
+        super(TransformerDecodeCell, self).__init__()
+        self.decoder = decoder
+        self.word_embedding = word_embedding
+        self.pos_embedding = pos_embedding
+        self.linear = linear
+        self.dropout = dropout
+
+    def forward(self, inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs):
+        r"""
+        Produces logits.
+
+        Args:
+            inputs (Tensor|tuple|list):
+                A tuple/list includes target ids and positions. If `word_embedding` is None,
+                then it should be a Tensor which means the input for decoder.
+            states (list):
+                It is a list and each element of the list is an instance
+                of `paddle.nn.MultiheadAttention.Cache` for corresponding decoder
+                layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            static_cache (list):
+                It is a list and each element of the list is an instance of
+                `paddle.nn.MultiheadAttention.StaticCache` for corresponding
+                decoder layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            trg_src_attn_bias (Tensor):
+                A tensor used in self attention to prevents attention to some unwanted
+                positions, usually the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, target_length, target_length]`,
+                where the unwanted positions have `-INF` values and the others
+                have 0 values. The data type should be float32 or float64. It can
+                be None when nothing wanted or needed to be prevented attention to.
+            memory (Tensor):
+                The output of Transformer encoder. It is a tensor with shape
+                `[batch_size, source_length, d_model]` and its data type can be
+                float32 or float64.
+
+        Returns:
+            tuple:
+                A tuple with items: `(outputs, new_states)`
+
+                With the corresponding fields:
+
+                - `outputs` (Tensor):
+                    A float32 or float64 3D tensor representing logits shaped
+                    `[batch_size, sequence_length, vocab_size]`.
+                - `new_states` (Tensor):
+                    This output has the same structure and data type with `states`
+                    while the length is one larger since concatanating the
+                    intermediate results of current step.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerDecodeCell
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                def decoder():
+                    # do decoder
+                    pass
+
+                cell = TransformerDecodeCell(decoder())
+
+                self.decode = TransformerBeamSearchDecoder(
+                    cell, start_token=0, end_token=1, beam_size=4,
+                    var_dim_in_state=2)
+        """
+
+        if states and static_cache:
+            states = list(zip(states, static_cache))
+
+        if self.word_embedding:
+            if not isinstance(inputs, (list, tuple)):
+                inputs = inputs
+
+            word_emb = self.word_embedding(inputs[0])
+            pos_emb = self.pos_embedding(inputs[1])
+            word_emb = word_emb + pos_emb
+            inputs = F.dropout(word_emb, p=self.dropout, training=False) if self.dropout else word_emb
+
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+        else:
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+
+        if self.linear:
+            cell_outputs = self.linear(cell_outputs)
+
+        new_states = [cache[0] for cache in new_states]
+
+        return cell_outputs, new_states
+
+
+class TransformerBeamSearchDecoder(nn.decode.BeamSearchDecoder):
+    """
+    This layer is a subclass of `BeamSearchDecoder` to make
+    beam search adapt to Transformer decoder.
+
+    Args:
+        cell (`TransformerDecodeCell`):
+            An instance of `TransformerDecoderCell`.
+        start_token (int):
+            The start token id.
+        end_token (int):
+            The end token id.
+        beam_size (int):
+            The beam width used in beam search.
+        var_dim_in_state (int):
+            Indicate which dimension of states is variant.
+    """
+
+    def __init__(self, cell, start_token, end_token, beam_size, var_dim_in_state):
+        super(TransformerBeamSearchDecoder, self).__init__(cell, start_token, end_token, beam_size)
+        self.cell = cell
+        self.var_dim_in_state = var_dim_in_state
+
+    def _merge_batch_beams_with_var_dim(self, c):
+        # Init length of cache is 0, and it increases with decoding carrying on,
+        # thus need to reshape elaborately
+        var_dim_in_state = self.var_dim_in_state + 1  # count in beam dim
+        c = paddle.transpose(c, list(range(var_dim_in_state, len(c.shape))) + list(range(0, var_dim_in_state)))
+        c = paddle.reshape(
+            c,
+            [0] * (len(c.shape) - var_dim_in_state)
+            + [self.batch_size * self.beam_size]
+            + [int(size) for size in c.shape[-var_dim_in_state + 2 :]],
+        )
+        c = paddle.transpose(
+            c,
+            list(range((len(c.shape) + 1 - var_dim_in_state), len(c.shape)))
+            + list(range(0, (len(c.shape) + 1 - var_dim_in_state))),
+        )
+        return c
+
+    def _split_batch_beams_with_var_dim(self, c):
+        var_dim_size = paddle.shape(c)[self.var_dim_in_state]
+        c = paddle.reshape(
+            c,
+            [-1, self.beam_size]
+            + [int(size) for size in c.shape[1 : self.var_dim_in_state]]
+            + [var_dim_size]
+            + [int(size) for size in c.shape[self.var_dim_in_state + 1 :]],
+        )
+        return c
+
+    @staticmethod
+    def tile_beam_merge_with_batch(t, beam_size):
+        r"""
+        Tiles the batch dimension of a tensor. Specifically, this function takes
+        a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch
+        entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape
+        `[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries
+        `t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated
+        `beam_size` times.
+
+        Args:
+            t (list|tuple):
+                A list of tensor with shape `[batch_size, ...]`.
+            beam_size (int):
+                The beam width used in beam search.
+
+        Returns:
+            Tensor:
+                A tensor with shape `[batch_size * beam_size, ...]`, whose
+                data type is same as `t`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                t = paddle.rand(shape=[10, 10])
+                TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)
+        """
+        return map_structure(lambda x: nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(x, beam_size), t)
+
+    def step(self, time, inputs, states, **kwargs):
+        """
+        Perform a beam search decoding step, which uses cell to get probabilities,
+        and follows a beam search step to calculate scores and select candidate token ids.
+
+        Args:
+             time(Tensor): An `int64` tensor with shape `[1]` provided by the caller,
+                 representing the current time step number of decoding.
+             inputs(Tensor): A tensor variable. It is same as `initial_inputs`
+                 returned by `initialize()` for the first decoding step and
+                 `next_inputs` returned by `step()` for the others.
+             states(Tensor): A structure of tensor variables.
+                 It is same as the `initial_cell_states` returned by `initialize()`
+                 for the first decoding step and `next_states` returned by
+                 `step()` for the others.
+             kwargs(dict, optional): Additional keyword arguments, provided by the caller `dynamic_decode`.
+
+        Returns:
+             tuple: Returns tuple (``beam_search_output, beam_search_state, next_inputs, finished``).
+             `beam_search_state` and `next_inputs` have the same structure,
+             shape and data type as the input arguments states and inputs separately.
+             `beam_search_output` is a namedtuple(including scores, predicted_ids, parent_ids as fields) of tensor variables,
+             where `scores, predicted_ids, parent_ids` all has a tensor value shaped [batch_size, beam_size] with data type
+             float32, int64, int64. `finished` is a bool tensor with shape [batch_size, beam_size].
+
+        """
+        # Steps for decoding.
+        # Compared to RNN, Transformer has 3D data at every decoding step
+        inputs = paddle.reshape(inputs, [-1, 1])  # token
+        pos = paddle.ones_like(inputs) * time  # pos
+
+        cell_states = map_structure(self._merge_batch_beams_with_var_dim, states.cell_states)
+
+        cell_outputs, next_cell_states = self.cell((inputs, pos), cell_states, **kwargs)
+
+        # Squeeze to adapt to BeamSearchDecoder which use 2D logits
+        cell_outputs = map_structure(lambda x: paddle.squeeze(x, [1]) if len(x.shape) == 3 else x, cell_outputs)
+        cell_outputs = map_structure(self._split_batch_beams, cell_outputs)
+        next_cell_states = map_structure(self._split_batch_beams_with_var_dim, next_cell_states)
+
+        beam_search_output, beam_search_state = self._beam_search_step(
+            time=time, logits=cell_outputs, next_cell_states=next_cell_states, beam_state=states
+        )
+
+        if kwargs.get("trg_word", None) is not None:
+            if paddle.in_dynamic_mode():
+                if paddle.shape(kwargs.get("trg_word"))[1] > time:
+                    beam_search_output, beam_search_state = self.force_decoding(
+                        beam_search_output, beam_search_state, kwargs.get("trg_word"), kwargs.get("trg_length"), time
+                    )
+            else:
+
+                def condition(trg_word, time):
+                    return paddle.shape(trg_word)[1] > time
+
+                def default_fn(beam_search_output, beam_search_state):
+                    return beam_search_output, beam_search_state
+
+                from functools import partial
+
+                beam_search_output, beam_search_state = paddle.static.nn.case(
+                    [
+                        (
+                            condition(kwargs.get("trg_word"), time),
+                            partial(
+                                self.force_decoding,
+                                beam_search_output=beam_search_output,
+                                beam_search_state=beam_search_state,
+                                trg_word=kwargs.get("trg_word"),
+                                trg_length=kwargs.get("trg_length"),
+                                time=time,
+                            ),
+                        )
+                    ],
+                    default=partial(
+                        default_fn, beam_search_output=beam_search_output, beam_search_state=beam_search_state
+                    ),
+                )
+
+        next_inputs, finished = (beam_search_output.predicted_ids, beam_search_state.finished)
+
+        return (beam_search_output, beam_search_state, next_inputs, finished)
+
+    def force_decoding(self, beam_search_output, beam_search_state, trg_word, trg_length, time):
+        batch_size = paddle.shape(beam_search_output.predicted_ids)[0]
+        beam_size = paddle.shape(beam_search_output.predicted_ids)[1]
+
+        ids_dtype = beam_search_output.predicted_ids.dtype
+        scores_dtype = beam_search_output.scores.dtype
+        parent_ids = paddle.zeros(shape=[batch_size, 1], dtype=ids_dtype)
+        scores = paddle.ones(shape=[batch_size, beam_size], dtype=scores_dtype) * -1e4
+        scores = paddle.scatter(
+            scores.flatten(),
+            paddle.arange(0, batch_size * beam_size, step=beam_size, dtype="int64"),
+            paddle.zeros([batch_size]),
+        ).reshape([batch_size, beam_size])
+
+        force_position = paddle.unsqueeze(trg_length > time, [1])
+        # NOTE: When the date type of the input of paddle.tile is bool
+        # and enable static mode, its stop_gradient must be True .
+        force_position.stop_gradient = True
+        force_position = paddle.tile(force_position, [1, beam_size])
+        crt_trg_word = paddle.slice(trg_word, axes=[1], starts=[time], ends=[time + 1])
+        crt_trg_word = paddle.tile(crt_trg_word, [1, beam_size])
+
+        predicted_ids = paddle.where(force_position, crt_trg_word, beam_search_output.predicted_ids)
+        scores = paddle.where(force_position, scores, beam_search_output.scores)
+        parent_ids = paddle.where(force_position, parent_ids, beam_search_output.parent_ids)
+
+        cell_states = beam_search_state.cell_states
+        log_probs = paddle.where(force_position, scores, beam_search_state.log_probs)
+        finished = beam_search_state.finished
+        lengths = beam_search_state.lengths
+
+        return self.OutputWrapper(scores, predicted_ids, parent_ids), self.StateWrapper(
+            cell_states, log_probs, finished, lengths
+        )
+
+
+class TransformerModel(nn.Layer):
+    """
+    The Transformer model.
+
+    This model is a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        attn_dropout (float):
+            The dropout probability used in MHA to drop some attention target.
+            If None, use the value of dropout. Defaults to None.
+        act_dropout (float):
+            The dropout probability used after FFN activation. If None, use
+            the value of dropout. Defaults to None.
+        bos_id (int, optional):
+            The start token id and also be used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        attn_dropout=None,
+        act_dropout=None,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        activation="relu",
+        normalize_before=True,
+    ):
+        super(TransformerModel, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
+        self.dropout = dropout
+
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
+        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+        if weight_sharing:
+            assert (
+                src_vocab_size == trg_vocab_size
+            ), "Vocabularies in source and target should be same for weight sharing."
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.pad_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        if not normalize_before:
+            encoder_layer = TransformerEncoderLayer(
+                d_model=d_model,
+                nhead=n_head,
+                dim_feedforward=d_inner_hid,
+                dropout=dropout,
+                activation=activation,
+                attn_dropout=attn_dropout,
+                act_dropout=act_dropout,
+                normalize_before=normalize_before,
+            )
+            encoder_with_post_norm = TransformerEncoder(encoder_layer, num_encoder_layers)
+
+            decoder_layer = TransformerDecoderLayer(
+                d_model=d_model,
+                nhead=n_head,
+                dim_feedforward=d_inner_hid,
+                dropout=dropout,
+                activation=activation,
+                attn_dropout=attn_dropout,
+                act_dropout=act_dropout,
+                normalize_before=normalize_before,
+            )
+            decoder_with_post_norm = TransformerDecoder(decoder_layer, num_decoder_layers)
+
+        self.transformer = paddle.nn.Transformer(
+            d_model=d_model,
+            nhead=n_head,
+            num_encoder_layers=num_encoder_layers,
+            num_decoder_layers=num_decoder_layers,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            activation=activation,
+            normalize_before=normalize_before,
+            custom_encoder=None if normalize_before else encoder_with_post_norm,
+            custom_decoder=None if normalize_before else decoder_with_post_norm,
+        )
+
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
+            )
+        else:
+            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)
+
+    def forward(self, src_word, trg_word):
+        r"""
+        The Transformer forward methods. The input are source/target sequences, and
+        returns logits.
+
+        Args:
+            src_word (Tensor):
+                The ids of source sequences words. It is a tensor with shape
+                `[batch_size, source_sequence_length]` and its data type can be
+                int or int64.
+            trg_word (Tensor):
+                The ids of target sequences words. It is a tensor with shape
+                `[batch_size, target_sequence_length]` and its data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                Output tensor of the final layer of the model whose data
+                type can be float32 or float64 with shape
+                `[batch_size, sequence_length, vocab_size]`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerModel
+
+                transformer = TransformerModel(
+                    src_vocab_size=30000,
+                    trg_vocab_size=30000,
+                    max_length=257,
+                    num_encoder_layers=6,
+                    num_decoder_layers=6,
+                    n_head=8,
+                    d_model=512,
+                    d_inner_hid=2048,
+                    dropout=0.1,
+                    weight_sharing=True,
+                    bos_id=0,
+                    eos_id=1)
+
+                batch_size = 5
+                seq_len = 10
+                predict = transformer(
+                    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]),
+                    trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+        )
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = self.transformer.generate_square_subsequent_mask(trg_max_len)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = src_slf_attn_bias
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=src_max_len, dtype=src_word.dtype
+        )
+        trg_pos = paddle.cast(trg_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=trg_max_len, dtype=trg_word.dtype
+        )
+
+        with paddle.static.amp.fp16_guard():
+            src_emb = self.src_word_embedding(src_word)
+            src_pos_emb = self.src_pos_embedding(src_pos)
+            src_emb = src_emb + src_pos_emb
+            enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            dec_output = self.transformer(
+                enc_input,
+                dec_input,
+                src_mask=src_slf_attn_bias,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias,
+            )
+
+            predict = self.linear(dec_output)
+
+        return predict
+
+
+class InferTransformerModel(TransformerModel):
+    """
+    The Transformer model for auto-regressive generation.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        attn_dropout (float):
+            The dropout probability used in MHA to drop some attention target.
+            If None, use the value of dropout. Defaults to None.
+        act_dropout (float):
+            The dropout probability used after FFN activition. If None, use
+            the value of dropout. Defaults to None.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        beam_size (int, optional):
+            The beam width for beam search. Defaults to 4.
+        max_out_len (int, optional):
+            The maximum output length. Defaults to 256.
+        output_time_major(bool, optional):
+            Indicate the data layout of predicted
+            Tensor. If `False`, the data layout would be batch major with shape
+            `[batch_size, seq_len, beam_size]`. If  `True`, the data layout would
+            be time major with shape `[seq_len, batch_size, beam_size]`. Default
+            to `False`.
+        beam_search_version (str):
+            Specify beam search version. It should be in one
+            of [`v1`, `v2`]. If `v2`, need to set `alpha`(default to 0.6) for length
+            penalty. Default to `v1`.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+        normalize_before (bool, optional):
+            Whether to apply pre-normalization. Defaults to True.
+        kwargs:
+            The key word arguments can be `rel_len` and `alpha`:
+
+            - `rel_len(bool, optional)`: Indicating whether `max_out_len` in
+            is the length relative to that of source text. Only works in `v2`
+            temporarily. It is suggest to set a small `max_out_len` and use
+            `rel_len=True`. Default to False if not set.
+
+            - `alpha(float, optional)`: The power number in length penalty
+            calculation. Refer to `GNMT <https://arxiv.org/pdf/1609.08144.pdf>`_.
+            Only works in `v2` temporarily. Default to 0.6 if not set.
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        attn_dropout=None,
+        act_dropout=None,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        beam_size=4,
+        max_out_len=256,
+        output_time_major=False,
+        beam_search_version="v1",
+        activation="relu",
+        normalize_before=True,
+        **kwargs
+    ):
+        args = dict(locals())
+        args.pop("self")
+        args.pop("__class__", None)
+        self.beam_size = args.pop("beam_size")
+        self.max_out_len = args.pop("max_out_len")
+        self.output_time_major = args.pop("output_time_major")
+        self.dropout = dropout
+        self.beam_search_version = args.pop("beam_search_version")
+        kwargs = args.pop("kwargs")
+        if self.beam_search_version == "v2":
+            self.alpha = kwargs.get("alpha", 0.6)
+            self.rel_len = kwargs.get("rel_len", False)
+        super(InferTransformerModel, self).__init__(**args)
+
+        cell = TransformerDecodeCell(
+            self.transformer.decoder, self.trg_word_embedding, self.trg_pos_embedding, self.linear, self.dropout
+        )
+
+        self.decode = TransformerBeamSearchDecoder(cell, bos_id, eos_id, beam_size, var_dim_in_state=2)
+
+    def forward(self, src_word, trg_word=None):
+        r"""
+        The Transformer forward method.
+
+        Args:
+            src_word (Tensor):
+                The ids of source sequence words. It is a tensor with shape
+                `[batch_size, source_sequence_length]` and its data type can be
+                int or int64.
+            trg_word (Tensor):
+                The ids of target sequence words. Normally, it should NOT be
+                given. If it's given, force decoding with previous output token
+                will be trigger. Defaults to None.
+
+        Returns:
+            Tensor:
+                An int64 tensor shaped indicating the predicted ids. Its shape is
+                `[batch_size, seq_len, beam_size]` or `[seq_len, batch_size, beam_size]`
+                according to `output_time_major`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import InferTransformerModel
+
+                transformer = InferTransformerModel(
+                    src_vocab_size=30000,
+                    trg_vocab_size=30000,
+                    max_length=256,
+                    num_encoder_layers=6,
+                    num_decoder_layers=6,
+                    n_head=8,
+                    d_model=512,
+                    d_inner_hid=2048,
+                    dropout=0.1,
+                    weight_sharing=True,
+                    bos_id=0,
+                    eos_id=1,
+                    beam_size=4,
+                    max_out_len=256)
+
+                batch_size = 5
+                seq_len = 10
+                transformer(
+                    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
+        """
+        if trg_word is not None:
+            trg_length = paddle.sum(paddle.cast(trg_word != self.pad_id, dtype="int32"), axis=-1)
+        else:
+            trg_length = None
+
+        if self.beam_search_version == "v1":
+            src_max_len = paddle.shape(src_word)[-1]
+            src_slf_attn_bias = (
+                paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+            )
+            trg_src_attn_bias = src_slf_attn_bias
+            src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
+                start=0, end=src_max_len, dtype=src_word.dtype
+            )
+
+            # Run encoder
+            src_emb = self.src_word_embedding(src_word)
+            src_pos_emb = self.src_pos_embedding(src_pos)
+            src_emb = src_emb + src_pos_emb
+            enc_input = F.dropout(src_emb, p=self.dropout, training=False) if self.dropout else src_emb
+            enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias)
+
+            # Init states (caches) for transformer, need to be updated according to selected beam
+            incremental_cache, static_cache = self.transformer.decoder.gen_cache(enc_output, do_zip=True)
+
+            static_cache, enc_output, trg_src_attn_bias = TransformerBeamSearchDecoder.tile_beam_merge_with_batch(
+                (static_cache, enc_output, trg_src_attn_bias), self.beam_size
+            )
+
+            rs, _ = nn.decode.dynamic_decode(
+                decoder=self.decode,
+                inits=incremental_cache,
+                max_step_num=self.max_out_len,
+                memory=enc_output,
+                trg_src_attn_bias=trg_src_attn_bias,
+                static_cache=static_cache,
+                is_test=True,
+                output_time_major=self.output_time_major,
+                trg_word=trg_word,
+                trg_length=trg_length,
+            )
+
+            return rs
+
+        elif self.beam_search_version == "v2":
+            finished_seq, finished_scores = self.beam_search_v2(
+                src_word, self.beam_size, self.max_out_len, self.alpha, trg_word, trg_length
+            )
+            if self.output_time_major:
+                finished_seq = finished_seq.transpose([2, 0, 1])
+            else:
+                finished_seq = finished_seq.transpose([0, 2, 1])
+
+            return finished_seq
+
+    def beam_search_v2(self, src_word, beam_size=4, max_len=None, alpha=0.6, trg_word=None, trg_length=None):
+        """
+        Beam search with the alive and finished two queues, both have a beam size
+        capicity separately. It includes `grow_topk` `grow_alive` `grow_finish` as
+        steps.
+        1. `grow_topk` selects the top `2*beam_size` candidates to avoid all getting
+        EOS.
+        2. `grow_alive` selects the top `beam_size` non-EOS candidates as the inputs
+        of next decoding step.
+        3. `grow_finish` compares the already finished candidates in the finished queue
+        and newly added finished candidates from `grow_topk`, and selects the top
+        `beam_size` finished candidates.
+        """
+
+        def expand_to_beam_size(tensor, beam_size):
+            tensor = paddle.unsqueeze(tensor, axis=1)
+            tile_dims = [1] * len(tensor.shape)
+            tile_dims[1] = beam_size
+            return paddle.tile(tensor, tile_dims)
+
+        def merge_beam_dim(tensor):
+            shape = tensor.shape
+            return paddle.reshape(tensor, [shape[0] * shape[1]] + list(shape[2:]))
+
+        # run encoder
+        src_max_len = paddle.shape(src_word)[-1]
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.pad_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+        )
+        src_slf_attn_bias.stop_gradient = True
+        src_pos = paddle.cast(src_word != self.pad_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=src_max_len, dtype=src_word.dtype
+        )
+        src_emb = self.src_word_embedding(src_word)
+        src_pos_emb = self.src_pos_embedding(src_pos)
+        src_emb = src_emb + src_pos_emb
+        enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+
+        enc_output = self.transformer.encoder(enc_input, src_slf_attn_bias)
+
+        # constant number
+        inf = float(1.0 * 1e7)
+        batch_size = enc_output.shape[0]
+        max_len = (
+            (enc_output.shape[1] + 20)
+            if max_len is None
+            else (enc_output.shape[1] + max_len if self.rel_len else max_len)
+        )
+
+        # initialize states of beam search
+        # init for the alive
+        initial_log_probs = paddle.assign(np.array([[0.0] + [-inf] * (beam_size - 1)], dtype="float32"))
+        alive_log_probs = paddle.tile(initial_log_probs, [batch_size, 1])
+
+        alive_seq = paddle.tile(
+            paddle.cast(paddle.assign(np.array([[[self.bos_id]]])), src_word.dtype), [batch_size, beam_size, 1]
+        )
+
+        # init for the finished
+        finished_scores = paddle.assign(np.array([[-inf] * beam_size], dtype="float32"))
+        finished_scores = paddle.tile(finished_scores, [batch_size, 1])
+
+        finished_seq = paddle.tile(
+            paddle.cast(paddle.assign(np.array([[[self.bos_id]]])), src_word.dtype), [batch_size, beam_size, 1]
+        )
+        finished_flags = paddle.zeros_like(finished_scores)
+
+        # initialize inputs and states of transformer decoder
+        # init inputs for decoder, shaped `[batch_size*beam_size, ...]`
+        pre_word = paddle.reshape(alive_seq[:, :, -1], [batch_size * beam_size, 1])
+        trg_src_attn_bias = src_slf_attn_bias
+        trg_src_attn_bias = merge_beam_dim(expand_to_beam_size(trg_src_attn_bias, beam_size))
+        enc_output = merge_beam_dim(expand_to_beam_size(enc_output, beam_size))
+
+        # init states (caches) for transformer, need to be updated according to selected beam
+        caches = self.transformer.decoder.gen_cache(enc_output, do_zip=False)
+
+        if trg_word is not None:
+            scores_dtype = finished_scores.dtype
+            scores = paddle.ones(shape=[batch_size, beam_size * 2], dtype=scores_dtype) * -1e4
+            scores = paddle.scatter(
+                scores.flatten(),
+                paddle.arange(0, batch_size * beam_size * 2, step=beam_size * 2, dtype=finished_seq.dtype),
+                paddle.zeros([batch_size]),
+            )
+            scores = paddle.reshape(scores, [batch_size, beam_size * 2])
+
+        def update_states(caches, topk_coordinates, beam_size, batch_size):
+            new_caches = []
+            for cache in caches:
+                k = gather_2d(cache[0].k, topk_coordinates, beam_size, batch_size, need_unmerge=True)
+                v = gather_2d(cache[0].v, topk_coordinates, beam_size, batch_size, need_unmerge=True)
+                new_caches.append((nn.MultiHeadAttention.Cache(k, v), cache[1]))
+            return new_caches
+
+        def get_topk_coordinates(beam_idx, beam_size, batch_size, dtype="int64"):
+            batch_pos = paddle.arange(batch_size * beam_size, dtype=dtype) // beam_size
+            batch_pos = paddle.reshape(batch_pos, [batch_size, beam_size])
+            topk_coordinates = paddle.stack([batch_pos, beam_idx], axis=2)
+            return topk_coordinates
+
+        def gather_2d(tensor_nd, topk_coordinates, beam_size, batch_size, need_unmerge=False):
+
+            new_tensor_nd = (
+                paddle.reshape(tensor_nd, shape=[batch_size, beam_size] + list(tensor_nd.shape[1:]))
+                if need_unmerge
+                else tensor_nd
+            )
+            topk_seq = paddle.gather_nd(new_tensor_nd, topk_coordinates)
+            return merge_beam_dim(topk_seq) if need_unmerge else topk_seq
+
+        def early_finish(alive_log_probs, finished_scores, finished_in_finished):
+            max_length_penalty = np.power(((5.0 + max_len) / 6.0), alpha)
+            lower_bound_alive_scores = alive_log_probs[:, 0] / max_length_penalty
+            lowest_score_of_fininshed_in_finished = paddle.min(finished_scores * finished_in_finished, 1)
+            lowest_score_of_fininshed_in_finished += (1.0 - paddle.max(finished_in_finished, 1)) * -inf
+            bound_is_met = paddle.all(
+                paddle.greater_than(lowest_score_of_fininshed_in_finished, lower_bound_alive_scores)
+            )
+
+            return bound_is_met
+
+        def grow_topk(i, logits, alive_seq, alive_log_probs, states):
+            """
+            This function takes the current alive sequences, and grows them to topk
+            sequences where k = 2*beam.
+            """
+            logits = paddle.reshape(logits, [batch_size, beam_size, -1])
+            candidate_log_probs = paddle.log(F.softmax(logits, axis=2))
+            log_probs = paddle.add(candidate_log_probs, alive_log_probs.unsqueeze(-1))
+
+            # Length penalty is given by = (5+len(decode)/6) ^ -\alpha. Pls refer to
+            # https://arxiv.org/abs/1609.08144.
+            length_penalty = paddle.pow((5.0 + i + 1.0) / 6.0, alpha)
+            curr_scores = log_probs / length_penalty
+            flat_curr_scores = paddle.reshape(curr_scores, [batch_size, -1])
+
+            topk_scores, topk_ids = paddle.topk(flat_curr_scores, k=beam_size * 2)
+            if topk_ids.dtype != alive_seq.dtype:
+                topk_ids = paddle.cast(topk_ids, dtype=alive_seq.dtype)
+
+            if trg_word is not None:
+                topk_ids, topk_scores = force_decoding_v2(topk_ids, topk_scores, i)
+
+            topk_log_probs = topk_scores * length_penalty
+
+            topk_beam_index = topk_ids // self.trg_vocab_size
+            topk_ids = topk_ids % self.trg_vocab_size
+
+            topk_coordinates = get_topk_coordinates(topk_beam_index, beam_size * 2, batch_size, dtype=alive_seq.dtype)
+            topk_seq = gather_2d(alive_seq, topk_coordinates, beam_size, batch_size)
+            topk_seq = paddle.concat([topk_seq, paddle.reshape(topk_ids, list(topk_ids.shape[:]) + [1])], axis=2)
+            states = update_states(states, topk_coordinates, beam_size, batch_size)
+            eos = paddle.full(shape=paddle.shape(topk_ids), dtype=alive_seq.dtype, fill_value=self.eos_id)
+            topk_finished = paddle.cast(paddle.equal(topk_ids, eos), "float32")
+
+            # topk_seq: [batch_size, 2*beam_size, i+1]
+            # topk_log_probs, topk_scores, topk_finished: [batch_size, 2*beam_size]
+            return topk_seq, topk_log_probs, topk_scores, topk_finished, states
+
+        def grow_alive(curr_seq, curr_scores, curr_log_probs, curr_finished, states):
+            """
+            Given sequences and scores, will gather the top k=beam size sequences
+            """
+            curr_scores += curr_finished * -inf
+            _, topk_indexes = paddle.topk(curr_scores, k=beam_size)
+            if topk_indexes.dtype != curr_seq.dtype:
+                topk_indexes = paddle.cast(topk_indexes, dtype=curr_seq.dtype)
+
+            topk_coordinates = get_topk_coordinates(topk_indexes, beam_size, batch_size, dtype=curr_seq.dtype)
+            alive_seq = gather_2d(curr_seq, topk_coordinates, beam_size, batch_size)
+
+            alive_log_probs = gather_2d(curr_log_probs, topk_coordinates, beam_size, batch_size)
+            states = update_states(states, topk_coordinates, beam_size * 2, batch_size)
+
+            return alive_seq, alive_log_probs, states
+
+        def grow_finished(finished_seq, finished_scores, finished_flags, curr_seq, curr_scores, curr_finished):
+            """
+            Given sequences and scores, will gather the top k=beam size sequences.
+            """
+            # finished scores
+            finished_seq = paddle.concat(
+                [
+                    finished_seq,
+                    paddle.full(shape=[batch_size, beam_size, 1], dtype=finished_seq.dtype, fill_value=self.eos_id),
+                ],
+                axis=2,
+            )
+            curr_scores += (1.0 - curr_finished) * -inf
+            curr_finished_seq = paddle.concat([finished_seq, curr_seq], axis=1)
+            curr_finished_scores = paddle.concat([finished_scores, curr_scores], axis=1)
+            curr_finished_flags = paddle.concat([finished_flags, curr_finished], axis=1)
+            _, topk_indexes = paddle.topk(curr_finished_scores, k=beam_size)
+            if topk_indexes.dtype != curr_seq.dtype:
+                topk_indexes = paddle.cast(topk_indexes, dtype=curr_seq.dtype)
+
+            topk_coordinates = get_topk_coordinates(topk_indexes, beam_size, batch_size, dtype=curr_seq.dtype)
+            finished_seq = gather_2d(curr_finished_seq, topk_coordinates, beam_size, batch_size)
+            finished_scores = gather_2d(curr_finished_scores, topk_coordinates, beam_size, batch_size)
+            finished_flags = gather_2d(curr_finished_flags, topk_coordinates, beam_size, batch_size)
+
+            return finished_seq, finished_scores, finished_flags
+
+        def force_decoding_v2(topk_ids, topk_scores, time):
+            beam_size = topk_ids.shape[1]
+            if trg_word.shape[1] > time:
+                force_position = paddle.unsqueeze(trg_length > time, [1])
+                force_position.stop_gradient = True
+                force_position = paddle.tile(force_position, [1, beam_size])
+
+                crt_trg_word = paddle.slice(trg_word, axes=[1], starts=[time], ends=[time + 1])
+                crt_trg_word = paddle.tile(crt_trg_word, [1, beam_size])
+
+                topk_ids = paddle.where(force_position, crt_trg_word, topk_ids)
+
+                topk_scores = paddle.where(force_position, scores, topk_scores)
+
+            return topk_ids, topk_scores
+
+        def inner_loop(i, pre_word, alive_seq, alive_log_probs, finished_seq, finished_scores, finished_flags, caches):
+            trg_pos = paddle.full(shape=paddle.shape(pre_word), dtype=alive_seq.dtype, fill_value=i)
+            trg_emb = self.trg_word_embedding(pre_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            logits, caches = self.transformer.decoder(dec_input, enc_output, None, trg_src_attn_bias, caches)
+            logits = self.linear(logits)
+            topk_seq, topk_log_probs, topk_scores, topk_finished, states = grow_topk(
+                i, logits, alive_seq, alive_log_probs, caches
+            )
+            alive_seq, alive_log_probs, states = grow_alive(
+                topk_seq, topk_scores, topk_log_probs, topk_finished, states
+            )
+            caches = states
+            finished_seq, finished_scores, finished_flags = grow_finished(
+                finished_seq, finished_scores, finished_flags, topk_seq, topk_scores, topk_finished
+            )
+            pre_word = paddle.reshape(alive_seq[:, :, -1], [batch_size * beam_size, 1])
+            return (i + 1, pre_word, alive_seq, alive_log_probs, finished_seq, finished_scores, finished_flags, caches)
+
+        def is_not_finish(
+            i, pre_word, alive_seq, alive_log_probs, finished_seq, finished_scores, finished_flags, caches
+        ):
+            return paddle.greater_than(i < max_len, early_finish(alive_log_probs, finished_scores, finished_flags))
+
+        (
+            _,
+            pre_word,
+            alive_seq,
+            alive_log_probs,
+            finished_seq,
+            finished_scores,
+            finished_flags,
+            caches,
+        ) = paddle.static.nn.while_loop(
+            is_not_finish,
+            inner_loop,
+            [
+                paddle.zeros(shape=[1], dtype="int64"),
+                pre_word,
+                alive_seq,
+                alive_log_probs,
+                finished_seq,
+                finished_scores,
+                finished_flags,
+                caches,
+            ],
+        )
+
+        # (gongenlei) `paddle.where` doesn't support broadcast, so we need to use `paddle.unsqueeze`
+        # and `paddle.tile` to make condition.shape same as X.shape. But when converting dygraph
+        # to static  graph, `paddle.tile` will raise error.
+        finished_flags = paddle.cast(finished_flags, dtype=finished_seq.dtype)
+        neg_finished_flags = 1 - finished_flags
+        finished_seq = paddle.multiply(finished_seq, finished_flags.unsqueeze(-1)) + paddle.multiply(
+            alive_seq, neg_finished_flags.unsqueeze(-1)
+        )
+        finished_scores = paddle.multiply(
+            finished_scores, paddle.cast(finished_flags, dtype=finished_scores.dtype)
+        ) + paddle.multiply(alive_log_probs, paddle.cast(neg_finished_flags, dtype=alive_log_probs.dtype))
+        return finished_seq, finished_scores
diff --git a/paddlenlp/transformers/unified_transformer/__init__.py b/paddlenlp/transformers/unified_transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/unified_transformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/unified_transformer/configuration.py b/paddlenlp/transformers/unified_transformer/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..c629459f77330af7f76c98186944f742791f7b85
--- /dev/null
+++ b/paddlenlp/transformers/unified_transformer/configuration.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""UNIFIED_TRANSFORMER model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = [
+    "UNIFIED_TRANSFORMER_PRETRAINED_INIT_CONFIGURATION",
+    "UnifiedTransformerConfig",
+    "UNIFIED_TRANSFORMER_PRETRAINED_RESOURCE_FILES_MAP",
+]
+
+UNIFIED_TRANSFORMER_PRETRAINED_INIT_CONFIGURATION = {
+    "unified_transformer-12L-cn": {
+        "vocab_size": 30004,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": True,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "unk_token_id": 0,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "mask_token_id": 30000,
+    },
+    "unified_transformer-12L-cn-luge": {
+        "vocab_size": 30004,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": True,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "unk_token_id": 0,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "mask_token_id": 30000,
+    },
+    "plato-mini": {
+        "vocab_size": 30001,
+        "hidden_size": 768,
+        "num_hidden_layers": 6,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": True,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 2,
+        "initializer_range": 0.02,
+        "unk_token_id": 0,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "mask_token_id": 30000,
+    },
+    "plato-xl": {
+        "vocab_size": 8001,
+        "hidden_size": 3072,
+        "num_hidden_layers": 72,
+        "num_attention_heads": 32,
+        "intermediate_size": 18432,
+        "hidden_act": "gelu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": True,
+        "max_position_embeddings": 1024,
+        "type_vocab_size": 3,
+        "role_type_size": 128,
+        "initializer_range": 0.02,
+        "unk_token_id": 0,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 2,
+        "mask_token_id": 8000,
+    },
+}
+
+
+UNIFIED_TRANSFORMER_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "unified_transformer-12L-cn": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn.pdparams",
+        "unified_transformer-12L-cn-luge": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn-luge.pdparams",
+        "plato-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-mini.pdparams",
+        "plato-xl": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-xl.pdparams",
+    }
+}
+
+
+class UnifiedTransformerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`UnifiedTransformerModel`]. It is used to
+    instantiate a Unified TransformerModel model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Unified TransformerModel
+    unified_transformer-12L-cn architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+        Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in :class:`UnifiedTransformerModel`.
+            Also is the vocab size of token embedding matrix. Defaults to 30004.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layers, encoder layers and pooler
+            layer. Defaults to 768.
+        num_hidden_layers (int, optional):
+            The number of hidden layers in the encoder. Defaults to 12.
+        num_attention_heads (int, optional):
+            The number of heads in multi-head attention(MHA). Defaults to 12.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward layer in the encoder. Input
+            tensors to feed-forward layers are firstly projected from
+            `hidden_size` to `intermediate_size`, and then projected back to
+            `hidden_size`. Typically `intermediate_size` is larger than
+            `hidden_size`. Defaults to 3072.
+        hidden_act (str, optional):
+            The activation function in the feedforward network. Defaults to
+            "gelu".
+        hidden_dropout_prob(float, optional):
+            The dropout probability used in pre-process and post-precess of MHA
+            and FFN sub-layer. Defaults to 0.1.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MHA to drop some attention target.
+            Defaults to 0.1.
+        normalize_before (bool, optional):
+            Indicate whether to put layer normalization into preprocessing of
+            MHA and FFN sub-layers. If True, pre-process is layer normalization
+            and post-precess includes dropout, residual connection. Otherwise,
+            no pre-process and post-precess includes dropout, residual
+            connection, layer normalization. Defaults to True.
+        max_position_embeddings (int, optional):
+            The maximum length of input `position_ids`. Defaults to 512.
+        type_vocab_size (int, optional):
+            The size of the input `token_type_ids`. Defaults to 2.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to 0.02.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal
+                distributions. See
+                :meth:`UnifiedTransformerPretrainedModel.init_weights` method
+                for how weights are initialized in
+                :class:`UnifiedTransformerModel`.
+        unk_token_id (int, optional):
+            The id of special token `unk_token`. Defaults to 0.
+        pad_token_id (int, optional):
+            The id of special token `pad_token`. Defaults to 0.
+        bos_token_id (int, optional):
+            The id of special token `bos_token`. Defaults to 1.
+        eos_token_id (int, optional):
+            The id of special token `eos_token`. Defaults to 2.
+        mask_token_id (int, optional):
+            The id of special token `mask_token`. Defaults to 30000.
+    ```"""
+    model_type = "unified_transformer"
+    pretrained_init_configuration = UNIFIED_TRANSFORMER_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 30004,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        normalize_before: bool = True,
+        max_position_embeddings: int = 512,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        unk_token_id: int = 0,
+        pad_token_id: int = 0,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        mask_token_id: int = 30000,
+        role_type_size: int = None,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.normalize_before = normalize_before
+        self.unk_token_id = unk_token_id
+        self.mask_token_id = mask_token_id
+        self.role_type_size = role_type_size
diff --git a/paddlenlp/transformers/unified_transformer/convert.py b/paddlenlp/transformers/unified_transformer/convert.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d6b31d56574e8e12b4c716d826ded75dffe6b68
--- /dev/null
+++ b/paddlenlp/transformers/unified_transformer/convert.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import pickle
+import re
+
+import paddle
+
+
+def setup_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--param_path", type=str, required=True)
+    parser.add_argument("--save_path", type=str, required=True)
+    return parser.parse_args()
+
+
+def convert(args):
+    paddle.enable_static()
+    prog_state = paddle.static.load_program_state(args.param_path)
+    new_state = {}
+    for k in prog_state:
+        if k.endswith("_embedding"):
+            prefix = "unified_transformer."
+            if k == "word_embedding":
+                suffix = "word_embeddings.weight"
+            elif k == "pos_embedding":
+                suffix = "position_embeddings.weight"
+            elif k == "sent_embedding":
+                suffix = "token_type_embeddings.weight"
+            elif k == "role_embedding":
+                suffix = "role_embeddings.weight"
+        elif k.startswith("encoder_layer"):
+            p = "encoder_layer_(\d+)_([^_]+)_([^_]+)_"
+            m = re.match(p, k)
+            layer_idx = m.group(1)
+            sub_layer = m.group(2)
+            prefix = "unified_transformer.encoder.layers." + layer_idx + "."
+            if sub_layer == "pre":
+                if m.group(3) == "att":
+                    if k.endswith("layer_norm_scale"):
+                        suffix = "norm1.weight"
+                    elif k.endswith("layer_norm_bias"):
+                        suffix = "norm1.bias"
+                elif m.group(3) == "ffn":
+                    if k.endswith("layer_norm_scale"):
+                        suffix = "norm2.weight"
+                    elif k.endswith("layer_norm_bias"):
+                        suffix = "norm2.bias"
+            elif sub_layer == "multi":
+                prefix += "self_attn."
+                m = re.match("encoder_layer_(\d+)_multi_head_att_(\w+)\.(.+)", k)
+                if m.group(2) == "query_fc":
+                    if m.group(3) == "w_0":
+                        suffix = "q_proj.weight"
+                    elif m.group(3) == "b_0":
+                        suffix = "q_proj.bias"
+                elif m.group(2) == "key_fc":
+                    if m.group(3) == "w_0":
+                        suffix = "k_proj.weight"
+                    elif m.group(3) == "b_0":
+                        suffix = "k_proj.bias"
+                elif m.group(2) == "value_fc":
+                    if m.group(3) == "w_0":
+                        suffix = "v_proj.weight"
+                    elif m.group(3) == "b_0":
+                        suffix = "v_proj.bias"
+                elif m.group(2) == "output_fc":
+                    if m.group(3) == "w_0":
+                        suffix = "out_proj.weight"
+                    elif m.group(3) == "b_0":
+                        suffix = "out_proj.bias"
+            elif sub_layer == "ffn":
+                if k.endswith("fc_0.w_0"):
+                    suffix = "linear1.weight"
+                elif k.endswith("fc_0.b_0"):
+                    suffix = "linear1.bias"
+                elif k.endswith("fc_1.w_0"):
+                    suffix = "linear2.weight"
+                elif k.endswith("fc_1.b_0"):
+                    suffix = "linear2.bias"
+        elif k.startswith("post_encoder"):
+            prefix = "unified_transformer.encoder."
+            if k.endswith("_scale"):
+                suffix = "norm.weight"
+            elif k.endswith("_bias"):
+                suffix = "norm.bias"
+        elif k.startswith("mask_lm"):
+            prefix = "lm_head."
+            if k.endswith("layer_norm_scale"):
+                suffix = "layer_norm.weight"
+            elif k.endswith("layer_norm_bias"):
+                suffix = "layer_norm.bias"
+            elif k.endswith("trans_fc.w_0"):
+                suffix = "transform.weight"
+            elif k.endswith("trans_fc.b_0"):
+                suffix = "transform.bias"
+            elif k.endswith("out_fc.w_0"):
+                suffix = "decoder_weight"
+            elif k.endswith("out_fc.b_0"):
+                suffix = "decoder_bias"
+        new_state[prefix + suffix] = prog_state[k]
+    with open(args.save_path, "wb") as f:
+        pickle.dump(new_state, f)
+
+
+if __name__ == "__main__":
+    args = setup_args()
+    convert(args)
diff --git a/paddlenlp/transformers/unified_transformer/modeling.py b/paddlenlp/transformers/unified_transformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..505baa0c86e4857e9db61aaded029da12f585f41
--- /dev/null
+++ b/paddlenlp/transformers/unified_transformer/modeling.py
@@ -0,0 +1,580 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Modeling classes for UnifiedTransformer model."""
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import CausalLMOutputWithCrossAttentions
+from .configuration import (
+    UNIFIED_TRANSFORMER_PRETRAINED_INIT_CONFIGURATION,
+    UNIFIED_TRANSFORMER_PRETRAINED_RESOURCE_FILES_MAP,
+    UnifiedTransformerConfig,
+)
+
+__all__ = [
+    "UnifiedTransformerPretrainedModel",
+    "UnifiedTransformerModel",
+    "UnifiedTransformerLMHeadModel",
+    "UnifiedTransformerForMaskedLM",
+]
+
+
+class UnifiedTransformerPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained UnifiedTransformer models. It provides  UnifiedTransformer
+    related `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = UNIFIED_TRANSFORMER_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = UNIFIED_TRANSFORMER_PRETRAINED_RESOURCE_FILES_MAP
+    config_class = UnifiedTransformerConfig
+    base_model_prefix = "unified_transformer"
+
+    def _init_weights(self, layer):
+        # Initialization hook
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor) and paddle.get_default_dtype() == "float32":
+                layer.weight.set_value(
+                    # TODO(guosheng): `normal` does not support float16, and
+                    # need to handle this when using fp16 as default dtype for
+                    # big models.
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class UnifiedTransformerEmbeddings(nn.Layer):
+    # Include embeddings from word, position and token_type.
+
+    def __init__(self, config: UnifiedTransformerConfig):
+        super(UnifiedTransformerEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.role_embeddings = (
+            None if config.role_type_size is None else nn.Embedding(config.role_type_size, config.hidden_size)
+        )
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.pad_token_id = config.pad_token_id
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        role_ids: Optional[Tensor] = None,
+        input_embeddings: Optional[Tensor] = None,
+    ):
+        if input_ids is None and input_embeddings is None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            inputs_shape = paddle.shape(input_ids)
+        elif input_embeddings is not None:
+            inputs_shape = paddle.shape(input_embeddings)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if input_embeddings is None:
+            input_embeddings = self.word_embeddings(input_ids)
+
+        if position_ids is None:
+            if self.pad_token_id is None:
+                position_ids = paddle.expand(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape)
+            else:
+                if input_ids is not None:
+                    # NOTE: If there is a unk_token_id in input_ids, the following logic is wrong.
+                    # In that case, the position_ids must be provided.
+                    # And this is for left padding input_ids.
+                    num_pad = paddle.sum((input_ids == self.pad_token_id).astype("float32"), axis=-1, keepdim=True)
+                    position_ids = F.relu(
+                        paddle.expand(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape) - num_pad
+                    ).astype("int64")
+                else:
+                    logger.warning(
+                        "Position_ids or pad_token_ids should be provided when input_embeds is specified, "
+                        "otherwise an unexpected result may be returned since `[0, 1, ..., sequence length - 1]` will be generated as a default position_ids."
+                    )
+                    position_ids = paddle.expand(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape)
+            position_ids.stop_gradient = True
+
+        position_embeddings = self.position_embeddings(position_ids)
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+            token_type_ids.stop_gradient = True
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embeddings + position_embeddings + token_type_embeddings
+        # A model with role_embeddings can generate without role_ids.
+        if role_ids is not None:
+            embeddings += self.role_embeddings(role_ids)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+@register_base_model
+class UnifiedTransformerModel(UnifiedTransformerPretrainedModel):
+    """
+    The bare UnifiedTransformer Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn
+    /documentation/docs/zh/api/paddle/nn/Layer_cn.html>`__
+    subclass. Use it as a regular Paddle Layer and refer to the Paddle
+    documentation for all matter related to general usage and behavior.
+
+
+    """
+
+    def __init__(self, config: UnifiedTransformerConfig):
+        super(UnifiedTransformerModel, self).__init__(config)
+        self.unk_token_id = config.unk_token_id
+        self.pad_token_id = config.pad_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eos_token_id = config.eos_token_id
+        self.mask_token_id = config.mask_token_id
+        self.initializer_range = config.initializer_range
+
+        self.embeddings = UnifiedTransformerEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            normalize_before=config.normalize_before,
+        )
+        encoder_norm = nn.LayerNorm(config.hidden_size)
+        self.encoder = nn.TransformerEncoder(encoder_layer, config.num_hidden_layers, encoder_norm)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[Tuple[Tensor]] = None,
+        role_ids: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The UnifiedTransformerModel forward method, overrides the special
+        :meth:`__call__` method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input
+                sequence. It's data type should be `int64` and has a shape of
+                [batch_size, sequence_length].
+            token_type_ids (Tensor):
+                Segment token indices to indicate first and second portions of
+                the inputs. Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of
+                [batch_size, sequence_length].
+            position_ids (Tensor):
+                The position indices of input sequence tokens. It's data type
+                should be `int64` and has a shape of [batch_size, sequence_length].
+            attention_mask (Tensor):
+                A tensor used in multi-head attention to prevents attention to
+                some unwanted positions, usually the paddings or the subsequent
+                positions. It is a tensor with shape broadcasted to
+                [batch_size, n_head, sequence_length, sequence_length].
+
+                - When the data type is bool, the unwanted positions have
+                  `False` values and the others have `True` values.
+                - When the data type is int, the unwanted positions have 0
+                  values and the others have 1 values.
+                - When the data type is float, the unwanted positions have
+                  `-INF` values and the others have 0 values.
+
+            use_cache: (bool, optional):
+                Whether or not use the model cache to speed up decoding. Defaults
+                to False.
+            cache (list, optional):
+                It is a list, and each element in the list is `incremental_cache`
+                produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache`
+                method. See :meth:`paddle.nn.TransformerEncoder.gen_cache`
+                method for more details. It is only used for inference and
+                should be None for training. Defaults to None.
+            role_ids (Tensor, optional):
+                Indices of role ids indicated different roles.
+                 It's data type should be `int64` and has a shape of
+                [batch_size, sequence_length]. Defaults to None.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object.
+                If `False`, the output will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=None`,
+            returns a tensor representing the output of :class:`UnifiedTransformerModel`, with
+            shape [batch_size, sequence_length, hidden_size]. The data type is
+            float32 or float64.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerModel
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                model = UnifiedTransformerModel.from_pretrained('plato-mini')
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+
+                history = '我爱祖国'
+                inputs = tokenizer.dialogue_encode(
+                    history,
+                    return_tensors=True,
+                    is_split_into_words=False)
+                outputs = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if attention_mask is None:
+            if input_ids is not None:
+                attention_mask = (
+                    (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4
+                ).unsqueeze([1, 2])
+            else:
+                logger.warning(
+                    "Provided inputs_embeds while attention_mask is None, attention weights will not be masked during forwarding."
+                )
+        if attention_mask is not None:
+            attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(
+            input_ids, token_type_ids, position_ids, role_ids=role_ids, input_embeddings=inputs_embeds
+        )
+        if use_cache and cache is None:
+            cache = self.encoder.gen_cache(embedding_output)
+
+        sequence_output = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return sequence_output
+
+
+class UnifiedTransformerLMHead(nn.Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(UnifiedTransformerLMHead, self).__init__()
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(
+        self,
+        hidden_states: Tensor,
+        masked_positions: Optional[Tensor] = None,
+    ):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        logits = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return logits
+
+
+class UnifiedTransformerLMHeadModel(UnifiedTransformerPretrainedModel):
+    """
+    The UnifiedTransformer Model with a language modeling head on top
+    for generation tasks.
+
+    Args:
+        unified_transformer (:class:`UnifiedTransformerModel`):
+            An instance of :class:`UnifiedTransformerModel`.
+    """
+
+    def __init__(self, config: UnifiedTransformerConfig):
+        super(UnifiedTransformerLMHeadModel, self).__init__(config)
+        self.unified_transformer = UnifiedTransformerModel(config)
+        self.lm_head = UnifiedTransformerLMHead(
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            self.unified_transformer.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[Tuple[Tensor]] = None,
+        role_ids: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The UnifiedTransformerLMHeadModel forward method, overrides the special
+        :meth:`__call__` method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`UnifiedTransformerModel`.
+            token_type_ids (Tensor):
+                See :class:`UnifiedTransformerModel`.
+            position_ids (Tensor):
+                See :class:`UnifiedTransformerModel`.
+            attention_mask (Tensor):
+                See :class:`UnifiedTransformerModel`.
+            use_cache: (bool, optional):
+                See :class:`UnifiedTransformerModel`.
+            cache (list, optional):
+                See :class:`UnifiedTransformerModel`.
+            role_ids: (Tensor, optional):
+                See :class:`UnifiedTransformerModel`.
+            labels: (Tensor, optional):
+                Labels for computing the left-to-right language modeling loss. Indices should be in
+                `[-100, 0, ..., vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
+                ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., vocab_size]`
+            inputs_embeds (Tensor, optional):
+                See :class:`UnifiedTransformerModel`.
+            output_attentions (bool, optional):
+                See :class: `UnifiedTransformerModel`
+            output_hidden_states (bool, optional):
+                See :class: `UnifiedTransformerModel`
+            return_dict (bool, optional):
+                See :class: `UnifiedTransformerModel`
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=labels=None`,
+            returns a tensor representing the output of :class:`UnifiedTransformerLMHeadModel`,
+            with shape [batch_size, sequence_length, vocab_size]. The data type
+            is float32 or float64.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerLMHeadModel
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                model = UnifiedTransformerLMHeadModel.from_pretrained('plato-mini')
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+
+                history = '我爱祖国'
+                inputs = tokenizer.dialogue_encode(
+                    history,
+                    return_tensors=True,
+                    is_split_into_words=False)
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.unified_transformer(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            use_cache,
+            cache,
+            role_ids=role_ids,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        sequence_output = outputs if isinstance(outputs, input_type) else outputs[0]
+        logits = self.lm_head(sequence_output, masked_positions)
+
+        lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            lm_loss = loss_fct(logits.reshape((-1, logits.shape[-1])), labels.reshape([-1]))
+        if not return_dict:
+            if isinstance(outputs, input_type):
+                return (lm_loss, logits) if lm_loss is not None else logits
+            else:
+                outputs = (logits,) + outputs[1:]
+                return ((lm_loss,) + outputs) if lm_loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterUnifiedTransformer
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError("'forced_bos_token_id != None' is not supported yet in the fast version")
+        self._fast_entry = FasterUnifiedTransformer(self, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def adjust_logits_during_generation(self, logits):
+        # pre-process distribution
+        logits[:, self.unified_transformer.unk_token_id] = -1e4
+        logits[:, self.unified_transformer.bos_token_id] = -1e4
+        logits[:, self.unified_transformer.mask_token_id] = -1e4
+        return logits
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        use_cache=False,
+        cache=None,
+        **kwargs
+    ):
+
+        role_ids = kwargs.get("role_ids", None)
+
+        if position_ids is None:
+            if self.pad_token_id is None:
+                position_ids = paddle.expand_as(
+                    paddle.arange(end=paddle.shape(input_ids)[1], dtype="int64"), input_ids
+                )
+            else:
+                # NOTE: If there is a unk_token_id in input_ids, the following logic is wrong.
+                # In that case, the position_ids must be provided.
+                # And this is for left padding input_ids.
+                num_pad = paddle.sum((input_ids == self.pad_token_id).astype("float32"), axis=-1, keepdim=True)
+                position_ids = F.relu(
+                    paddle.expand_as(paddle.arange(end=paddle.shape(input_ids)[1], dtype="float32"), input_ids)
+                    - num_pad
+                ).astype("int64")
+            position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+            token_type_ids.stop_gradient = True
+
+        if attention_mask is None:
+            attention_mask = ((input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4).unsqueeze(
+                [1, 2]
+            )
+            attention_mask.stop_gradient = True
+
+        # only last token for inputs_ids if cache is defined in kwargs
+        if cache is not None:
+            input_ids = input_ids[:, -1:]
+            if token_type_ids is not None:
+                token_type_ids = token_type_ids[:, -1:]
+            if position_ids is not None:
+                position_ids = position_ids[:, -1:]
+            if role_ids is not None:
+                role_ids = role_ids[:, -1:]
+            if attention_mask is not None:
+                attention_mask = attention_mask[:, :, -1:, :]
+
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+            "role_ids": role_ids,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+UnifiedTransformerForMaskedLM = UnifiedTransformerLMHeadModel
diff --git a/paddlenlp/transformers/unified_transformer/tokenizer.py b/paddlenlp/transformers/unified_transformer/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ddb8a6a3620adb1c3219282267666138823a9f2d
--- /dev/null
+++ b/paddlenlp/transformers/unified_transformer/tokenizer.py
@@ -0,0 +1,711 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for UnifiedTransformer model."""
+
+import os
+import re
+import unicodedata
+from shutil import copyfile
+
+import jieba
+import numpy as np
+import paddle
+import sentencepiece as spm
+
+from ...data.vocab import Vocab
+from .. import PretrainedTokenizer
+from ..tokenizer_utils import _is_control, _is_whitespace, convert_to_unicode
+
+__all__ = ["UnifiedTransformerTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "unified_transformer-12L-cn": 512,
+    "unified_transformer-12L-cn-luge": 512,
+    "plato-mini": 512,
+    "plato-xl": 1024,
+}
+
+
+class UnifiedTransformerTokenizer(PretrainedTokenizer):
+    """
+    Constructs an UnifiedTransformer tokenizer based on `SentencePiece <https://github.com/google/sentencepiece>`__.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The path of file to construct vocabulary.
+        sentencepiece_model_file (str):
+            The sentencepiece model file (ends with '.spm') required to instantiate a
+            `SentencePiece <https://github.com/google/sentencepiece>`__.
+        do_lower_case (bool, optional):
+            Whether or not to lowercase the input when tokenizing. Defaults to
+            False and **does not** lowercase the input.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted
+            to an ID. Defaults to "[UNK]".
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for
+            batching purposes. Defaults to "[PAD]".
+        cls_token (str, optional):
+            A special token representing the beginning of a sequence. Defaults
+            to "[CLS]".
+        sep_token (str, optional):
+            A special token representing the end of a sequence or separating
+            two different sentences in the same input. Defaults to "[SEP]".
+        mask_token (str, optional):
+            A special token representing a masked token. Defaults to "[MASK]".
+        special_tokens_file (str, optional):
+            The path of file that contains additional special tokens to be used
+            by the tokenizer. Defaults to "".
+    """
+
+    resource_files_names = {
+        "vocab_file": "vocab.txt",
+        "sentencepiece_model_file": "spm.model",
+    }  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "unified_transformer-12L-cn": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn-vocab.txt",
+            "unified_transformer-12L-cn-luge": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn-vocab.txt",
+            "plato-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-mini-vocab.txt",
+            "plato-xl": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-xl-vocab.txt",
+        },
+        "sentencepiece_model_file": {
+            "unified_transformer-12L-cn": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn-spm.model",
+            "unified_transformer-12L-cn-luge": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/unified_transformer-12L-cn-spm.model",
+            "plato-mini": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-mini-spm.model",
+            "plato-xl": "https://bj.bcebos.com/paddlenlp/models/transformers/unified_transformer/plato-xl-spm.model",
+        },
+    }
+    pretrained_init_configuration = {
+        "unified_transformer-12L-cn": {"do_lower_case": False},
+        "unified_transformer-12L-cn-luge": {"do_lower_case": False},
+        "plato-mini": {"do_lower_case": False},
+        "plato-xl": {"do_lower_case": False},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    TASK_TO_SPECIAL_TOKEN = {
+        "chitchat": "[CHAT]",
+        "knowledge": "[KNOW]",
+        "recommend": "[RECO]",
+    }
+    padding_side = "left"
+
+    def __init__(
+        self,
+        vocab_file,
+        sentencepiece_model_file,
+        do_lower_case=False,
+        unk_token="[UNK]",
+        pad_token="[UNK]",
+        cls_token="[CLS]",
+        sep_token="[SEP]",
+        mask_token="[MASK]",
+        special_tokens_file="",
+        **kwargs
+    ):
+        self.spm_model = spm.SentencePieceProcessor()
+
+        self.do_lower_case = do_lower_case
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = ErnieTinyTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(
+            vocab_file, unk_token, pad_token, cls_token, sep_token, mask_token=mask_token
+        )
+
+        # if the sentencepiece_model_file is not exists, just the default sentence-piece model
+        if os.path.isfile(sentencepiece_model_file):
+            self.spm_model.Load(sentencepiece_model_file)
+
+        pat_str = ""
+        if os.path.isfile(special_tokens_file):
+            self.specials = self.read_file(special_tokens_file)
+            for special in self.specials:
+                pat_str += "(" + re.escape(special) + ")|"
+        else:
+            self.specials = {}
+
+        pat_str += r"([a-zA-Z0-9\S]+)"
+        self.pat = re.compile(pat_str)
+
+        self.vocab_file = vocab_file
+        self.sentencepiece_model_file = sentencepiece_model_file
+
+    @property
+    def vocab_size(self):
+        """
+        Returns the size of vocabulary.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+                print(tokenizer.vocab_size)
+                # 30001
+        """
+        return len(self.vocab)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        return vocab
+
+    def preprocess_text(self, inputs, remove_space=True, lower=False, is_split_into_words=True):
+        # preprocess data by removing extra space and normalize data.
+        if not is_split_into_words:
+            inputs = " ".join(jieba.lcut(inputs))
+        outputs = inputs
+        if remove_space:
+            outputs = " ".join(inputs.strip().split())
+        outputs = unicodedata.normalize("NFKD", outputs)
+        outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+        if lower:
+            outputs = outputs.lower()
+        return outputs
+
+    def clean_text(self, text):
+        # Performs invalid character removal and whitespace cleanup on text.
+        text = text.replace("“", '"').replace("”", '"').replace("‘", "'").replace("’", "'").replace("—", "-")
+        output = []
+        for char in text:
+            if _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def encode_pieces(self, spm_model, text, return_unicode=True, sample=False):
+        # turn sentences into word pieces.
+        # liujiaxiang: add for ernie-albert, mainly consider for “/”/‘/’/— causing too many unk
+        text = self.clean_text(text)
+        if not sample:
+            pieces = spm_model.EncodeAsPieces(text)
+        else:
+            pieces = spm_model.SampleEncodeAsPieces(text, 64, 0.1)
+        return pieces
+
+    def _tokenize(self, text, is_split_into_words=True):
+        """
+        End-to-end tokenization for UnifiedTransformer models.
+
+        Args:
+            text (str):
+                The text to be tokenized.
+
+        Returns:
+            list: A list of string representing converted tokens.
+        """
+        text = self.preprocess_text(text, lower=self.do_lower_case, is_split_into_words=is_split_into_words)
+        tokens = []
+        for match in self.pat.finditer(text):
+            part_text = match.group(0)
+            if part_text in self.specials:
+                tokens.append(part_text)
+                continue
+            part_tokens = self.encode_pieces(self.spm_model, part_text)
+            tokens.extend(part_tokens)
+        return tokens
+
+    def merge_subword(self, tokens):
+        # Merge subword.
+        ret = []
+        for token in tokens:
+            if token.startswith("▁"):
+                ret.append(token[1:])
+            else:
+                if len(ret):
+                    ret[-1] += token
+                else:
+                    ret.append(token)
+
+        ret = [token for token in ret if token]
+        return ret
+
+    def convert_tokens_to_string(self, tokens, keep_space=True):
+        """
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `__` to concat subwords, also remove
+        `__` when converting.
+
+        Args:
+            tokens (list[str]):
+                A list of string representing tokens to be converted.
+            keep_space (bool, optional):
+                Whether or not to keep the segmentation with space. Defaults to
+                True.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+                print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!']))
+                # 欢迎 使用 百度 飞桨 !
+                print(tokenizer.convert_tokens_to_string(['▁欢迎', '▁使用', '▁百度', '▁飞', '桨', '▁!'], keep_space=False))
+                # 欢迎使用百度飞桨!
+        """
+        tokens = self.merge_subword(tokens)
+        if keep_space:
+            out_string = " ".join(tokens).replace("<s>", "")
+        else:
+            out_string = "".join(tokens).replace("<s>", "")
+        out_string = out_string.replace("</s>", "\n").replace("\n ", "\n").strip()
+        return out_string
+
+    def convert_ids_to_string(self, ids, keep_space=True):
+        """
+        Converts a single index or a sequence of indices to a token or a
+        sequence of tokens.
+
+        Args:
+            ids (int|list[int]):
+                The token id (or token ids) to be converted to token(s).
+            keep_space (bool, optional):
+                Whether or not to keep the segmentation with space. Defaults to
+                True.
+
+        Returns:
+            str|list[str]: The decoded token(s).
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+                tokens = tokenizer.tokenize('欢迎使用百度飞桨！', is_split_into_words=False)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                print(ids)
+                # [6, 121, 26907, 25475]
+
+                print(tokenizer.convert_ids_to_string(ids))
+                # 我 爱祖国
+                print(tokenizer.convert_ids_to_string(ids, keep_space=False))
+                # 我爱祖国
+        """
+        tokens = self.convert_ids_to_tokens(ids)
+        out_string = self.convert_tokens_to_string(tokens, keep_space)
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return _cls + token_ids_0 + _sep
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return [0] * len(_cls + token_ids_0 + _sep)
+        return [0] * len(_cls + token_ids_0 + _sep) + [1] * len(token_ids_1 + _sep)
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def save_resources(self, save_directory):
+        # Rewrite the :meth:`save_resources` method of superclass to save
+        # related resources under `save_directory`.
+        for name, file_name in self.resource_files_names.items():
+            src_path = getattr(self, name)
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(src_path) != os.path.abspath(save_path):
+                copyfile(src_path, save_path)
+
+    @staticmethod
+    def read_file(filepath):
+        token_to_idx = {}
+        with open(filepath, "r", encoding="utf-8") as f:
+            for num, line in enumerate(f):
+                items = convert_to_unicode(line.rstrip()).split("\t")
+                if len(items) > 2:
+                    break
+                token = items[0]
+                index = int(items[1]) if len(items) == 2 else num
+                token = token.strip()
+                token_to_idx[token] = index
+        return token_to_idx
+
+    @staticmethod
+    def load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs):
+        # Rewrite the :meth:`load_vocabulary` method of superclass to deal with
+        # the inconsistency of the vocabulary format.
+        token_to_idx = UnifiedTransformerTokenizer.read_file(filepath)
+        vocab = Vocab.from_dict(
+            token_to_idx, unk_token=unk_token, pad_token=pad_token, bos_token=bos_token, eos_token=eos_token, **kwargs
+        )
+        # Filtered the tokens that are mapped to the same id
+        idx_to_token = {v: k for k, v in vocab._token_to_idx.items()}
+        vocab._idx_to_token = [idx_to_token[idx] for idx in sorted(idx_to_token.keys())]
+        return vocab
+
+    def dialogue_encode(
+        self,
+        history,
+        response=None,
+        knowledge=None,
+        task_type=None,
+        max_seq_len=512,
+        max_response_len=128,
+        max_knowledge_len=128,
+        return_position_ids=True,
+        return_token_type_ids=True,
+        return_role_ids=False,
+        return_attention_mask=True,
+        return_length=False,
+        add_start_token_as_response=False,
+        pad_to_max_seq_len=False,
+        return_tensors=False,
+        is_split_into_words=True,
+        position_style="continuous",
+    ):
+        """
+        Main method to encode the single-turn or multi-turn dialogue conversation.
+        It will return a dictionary containing the encoded sequence and other
+        relative informations which meets the input format requirements of the
+        UnifiedTransformer model.
+        See detail at
+        https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue
+
+        Args:
+            history (str|list|tuple): The history of dialogue conversation. It
+                is an utterance or list of utterances to be encoded. Each
+                utterance is a string.
+            response (str, optional): The response of dialogue conversation.
+                It should be set when training the model. It should not be set
+                when running inference. Defaults to None.
+            knowledge (str, optional): The knowledge information of dialogue
+                conversation. It should be set if the `task_type` is "knowledge"
+                or "recommend". Defaults to None.
+            task_type (str, optional): The type of dialogue conversation. It is
+                one of "chitchat", "knowledge" and "recommend". They represent
+                the chitchat dialogue, knowledge grounded dialogue and
+                conversational recommendation respectively. Defaults to None,
+                which means there is no `special_token` added in output sequence
+                for identifying different conversation types.
+            max_seq_len (int, optional): The maximum encoded sequence length.
+                Defaults to 512.
+            max_response_len (int, optional): The maximum encoded sequence
+                length of the input `response`. Defaults to 128.
+            max_knowledge_len (int, optional): The maximum encoded sequence
+                length of the input `knowledge`. Defaults to 128.
+            return_position_ids (bool, optional): Whether to return the
+                position_ids. Defaults to True.
+            return_token_type_ids (bool, optional): Whether to return the
+                token_type_ids. Defaults to True.
+            return_role_ids (bool, optional): Whether to return the role_ids.
+                Defaults to False.
+            return_attention_mask (bool, optional): Whether to return the
+                attention_mask. Defaults to True.
+            return_length (bool, optional): Whether to return the length of the
+                encoded sequence. Defaults to False.
+            add_start_token_as_response (bool, optional): Whether to add the
+                special token "[CLS]" at the end of sequence as the beginning of
+                the response when running inference to force the model to start
+                generating response sequence. Defaults to False.
+            pad_to_max_seq_len (bool, optional): Whether to pad the returned
+                sequences to the `max_seq_len`. Note that, in this method,
+                returned sequences will be padded on the left. Defaults to False.
+            return_tensors (bool, optional): Whether to convert the returned
+                sequences to Tensor. Defaults to False.
+            is_split_into_words (bool, optional): Whether or not the input text
+                (`history`, `response` and `knowledge`) has been pretokenized.
+                Defaults to True.
+            position_style (str, optional): Specify the involved positional style
+                which must be one of [relative, continuous]. Defaults to continuous
+                which means start from 0 to maximum length of history.
+
+        Returns:
+            dict: A dictionary containing the encoded sequence and other
+            relative informations.
+
+            With the corresponding fields:
+
+            - input_ids (list[int]|Tensor):
+                A list of indices of input tokens to be feed to UnifiedTransformer
+                model. If `return_tensors` is True, it is a Tensor with shape
+                [1, sequence_length] and data type 'int64'.
+            - role_ids (list[int]|Tensor, optional):
+                A list of indices of role indices. If `return_role_ids` is True,
+                it is a Tensor with shape [1, sequence_length] and data type 'int64'.
+            - token_type_ids (list[int]|Tensor, optional):
+                A list of segment token indices to indicate whether the token
+                belongs to the dialogue response. If `return_tensors` is True,
+                it is a Tensor with shape [1, sequence_length] and data type
+                'int64'.
+                Being returned when `return_token_type_ids` is set to True.
+            - position_ids (list[int]|Tensor, optional):
+                A list of The position indices. If `return_tensors` is True,
+                it is a Tensor with shape [1, sequence_length] and data type
+                'int64'.
+                Being returned when `return_position_ids` is set to True.
+            - attention_mask (numpy.ndarray|Tensor, optional):
+                A numpy.ndarray to prevents attention to some unwanted positions,
+                with shape [sequence_length, sequence_length] and data type
+                'float32'. If `return_tensors` is True, it is a Tensor with shape
+                [1, 1, sequence_length, sequence_length] and data type 'float32'.
+                Being returned when `return_attention_mask` is set to True.
+            - seq_len (int, optional):
+                The actual length of the `input_ids`, excluding the pad token.
+                Being returned when `return_length` is set to True.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UnifiedTransformerTokenizer
+
+                tokenizer = UnifiedTransformerTokenizer.from_pretrained('plato-mini')
+
+                inputs = tokenizer.dialogue_encode('我爱祖国')
+                for key in inputs:
+                    print(key + ':')
+                    print(inputs[key])
+                # input_ids: [1, 6, 25445, 26907, 25475, 2]
+                # token_type_ids: [0, 0, 0, 0, 0, 0]
+                # position_ids: [0, 1, 2, 3, 4, 5]
+                # attention_mask: [[0. 0. 0. 0. 0. 0.]
+                # [0. 0. 0. 0. 0. 0.]
+                # [0. 0. 0. 0. 0. 0.]
+                # [0. 0. 0. 0. 0. 0.]
+                # [0. 0. 0. 0. 0. 0.]
+                # [0. 0. 0. 0. 0. 0.]]
+        """
+
+        # Input type checking for clearer error
+        assert isinstance(history, str) or (
+            isinstance(history, (list, tuple))
+            and (len(history) == 0 or len(history) != 0 and isinstance(history[0], str))
+        ), (
+            "The input `history` must be with type `str` (single context) "
+            "or `List[str]` (multi-turn context). But received: {}".format(history)
+        )
+        assert response is None or isinstance(
+            response, str
+        ), "The input `response` must of be with type `str`. But received: {}".format(response)
+        assert knowledge is None or isinstance(
+            knowledge, str
+        ), "The input `knowledge` must of be with type `str`. But received: {}".format(knowledge)
+        assert (
+            task_type is None or task_type in self.TASK_TO_SPECIAL_TOKEN
+        ), "The input `task_type` must be None or one of {}.".format(", ".join(self.TASK_TO_SPECIAL_TOKEN.keys()))
+        assert max_seq_len > max_response_len + max_knowledge_len, (
+            "`max_seq_len` must be greater than the sum of `max_response_len` "
+            "and `max_knowledge_len`. But received `max_seq_len` is {}, "
+            "`max_response_len` is {}, `max_knowledge_len` is {}.".format(
+                max_seq_len, max_response_len, max_knowledge_len
+            )
+        )
+        assert response is None or not add_start_token_as_response, (
+            "`add_start_token_as_response` only works when `response` is "
+            "`None`. But received `add_start_token_as_response`: `{}`, "
+            "`response`: {}.".format(add_start_token_as_response, response)
+        )
+
+        knowledge_ids = []
+        if knowledge is not None:
+            tokens = self._tokenize(knowledge, is_split_into_words)
+            knowledge_ids = self.convert_tokens_to_ids(tokens)
+            if len(knowledge_ids) > max_knowledge_len - 1:
+                knowledge_ids = knowledge_ids[: max_knowledge_len - 1]
+            knowledge_ids += [self.sep_token_id]
+
+        if return_role_ids:
+            response_role_ids = []
+
+        response_ids = []
+        if response is not None:
+            if return_role_ids:
+                if "\1" in response:
+                    response, role_id = response.split("\1")
+                    role_id = int(role_id)
+                else:
+                    role_id = 0
+
+            tokens = self._tokenize(response, is_split_into_words)
+            response_ids = [self.cls_token_id] + self.convert_tokens_to_ids(tokens)
+            if len(response_ids) > max_response_len - 1:
+                response_ids = response_ids[: max_response_len - 1]
+            response_ids += [self.sep_token_id]
+
+            if return_role_ids:
+                response_role_ids = [role_id] * len(response_ids)
+
+        elif add_start_token_as_response:
+            response_ids = [self.cls_token_id]
+
+            if return_role_ids:
+                response_role_ids = [0]
+
+        if task_type is not None:
+            special_token = self.TASK_TO_SPECIAL_TOKEN[task_type]
+            assert (
+                special_token in self.vocab._token_to_idx
+            ), "The vocab file should contain the special token corresponding " "to the task: {}.".format(task_type)
+            special_token_id = self.vocab._token_to_idx[special_token]
+            knowledge_ids = [self.cls_token_id, special_token_id] + knowledge_ids
+        else:
+            knowledge_ids = [self.cls_token_id] + knowledge_ids
+
+        if return_role_ids:
+            history_role_ids = []
+            individual_history_length = []
+            knowledge_role_ids = [0] * len(knowledge_ids)
+
+        max_history_len = max_seq_len - len(knowledge_ids) - len(response_ids)
+        if isinstance(history, str):
+            history = [history]
+        history_ids = []
+        for i in range(len(history) - 1, -1, -1):
+            role_id = None
+
+            if return_role_ids and "\1" in history[i]:
+                history[i], role_id = history[i].split("\1")
+                role_id = int(role_id)
+
+            tokens = self._tokenize(history[i], is_split_into_words)
+            if len(history_ids) + len(tokens) + 1 > max_history_len:
+                if i == len(history) - 1:
+                    tokens = tokens[1 - max_history_len :]
+                    history_ids = self.convert_tokens_to_ids(tokens) + [self.sep_token_id]
+
+                    if role_id is not None:
+                        history_role_ids = [role_id] * len(history_ids)
+                    elif return_role_ids:
+                        individual_history_length = [len(history_ids)]
+                break
+
+            if role_id is not None:
+                # 1 stands for [SEP]
+                history_role_ids = [role_id] * (len(tokens) + 1) + history_role_ids
+            elif return_role_ids:
+                individual_history_length = [len(tokens) + 1] + individual_history_length
+
+            history_ids = (self.convert_tokens_to_ids(tokens) + [self.sep_token_id]) + history_ids
+
+        if return_role_ids and len(history_role_ids) == 0:
+            for i in range(len(individual_history_length)):
+                history_role_ids = (
+                    history_role_ids + [(len(individual_history_length) - i) % 2] * individual_history_length[i]
+                )
+
+        history_ids = knowledge_ids + history_ids
+
+        if return_role_ids:
+            history_role_ids = knowledge_role_ids + history_role_ids
+
+        # Build output dictionnary
+        encoded_inputs = {}
+        encoded_inputs["input_ids"] = history_ids + response_ids
+
+        if return_role_ids:
+            encoded_inputs["role_ids"] = history_role_ids + response_role_ids
+
+        # Check lengths
+        sequence_length = len(encoded_inputs["input_ids"])
+        assert sequence_length <= max_seq_len
+
+        # Considering that the logits at the last time step in the API of
+        # generative task are taken to generate the next token. In order to
+        # avoid the last time step being a pad, so take padding on the left.
+        pad_length = max_seq_len - sequence_length if pad_to_max_seq_len else 0
+        if pad_length > 0:
+            encoded_inputs["input_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["input_ids"]
+        if return_tensors:
+            # Add dimention for batch_size
+            encoded_inputs["input_ids"] = paddle.to_tensor(encoded_inputs["input_ids"]).unsqueeze(0)
+
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = [0] * len(history_ids) + [1] * len(response_ids)
+            if pad_length > 0:
+                encoded_inputs["token_type_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["token_type_ids"]
+            if return_tensors:
+                # Add dimention for batch_size
+                encoded_inputs["token_type_ids"] = paddle.to_tensor(encoded_inputs["token_type_ids"]).unsqueeze(0)
+
+        if return_length:
+            encoded_inputs["seq_len"] = sequence_length
+
+        if return_position_ids:
+            if position_style == "continuous":
+                encoded_inputs["position_ids"] = list(range(sequence_length))
+            elif position_style == "relative":
+                encoded_inputs["position_ids"] = [
+                    max_response_len + (len(history_ids)) - i - 1 for i in range(len(history_ids))
+                ] + list(range(len(response_ids)))
+            else:
+                raise ValueError(
+                    "Expected position_style is one of [continuous, relative], but received {}".format(position_style)
+                )
+            if pad_length > 0:
+                encoded_inputs["position_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["position_ids"]
+            if return_tensors:
+                # Add dimention for batch_size
+                encoded_inputs["position_ids"] = paddle.to_tensor(encoded_inputs["position_ids"]).unsqueeze(0)
+
+        if return_attention_mask:
+            attention_mask = np.ones((sequence_length, sequence_length), dtype="float32") * -1e4
+            start = len(history_ids)
+            end = sequence_length
+            attention_mask[:end, :start] = 0.0
+            # Generate the lower triangular matrix using the slice of matrix
+            tmp = np.triu(np.ones([end - start, end - start], dtype="float32") * -1e4, 1)
+            attention_mask[start:end, start:end] = tmp
+            encoded_inputs["attention_mask"] = attention_mask
+            if pad_length > 0:
+                new_mask = np.ones((max_seq_len, max_seq_len), dtype="float32") * -1e4
+                new_mask[-sequence_length:, -sequence_length:] = attention_mask
+                encoded_inputs["attention_mask"] = new_mask
+            if return_tensors:
+                # Add dimentions for batch_size and num_heads
+                encoded_inputs["attention_mask"] = paddle.to_tensor(encoded_inputs["attention_mask"]).unsqueeze((0, 1))
+
+        return encoded_inputs
diff --git a/paddlenlp/transformers/unimo/__init__.py b/paddlenlp/transformers/unimo/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/paddlenlp/transformers/unimo/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/unimo/configuration.py b/paddlenlp/transformers/unimo/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d76afda299ad526bda8ef32c7e9713a3b96f8e5
--- /dev/null
+++ b/paddlenlp/transformers/unimo/configuration.py
@@ -0,0 +1,303 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" UNIMO model configuration"""
+from __future__ import annotations
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+__all__ = ["UNIMO_PRETRAINED_INIT_CONFIGURATION", "UNIMOConfig", "UNIMO_PRETRAINED_RESOURCE_FILES_MAP"]
+
+UNIMO_PRETRAINED_INIT_CONFIGURATION = {
+    "unimo-text-1.0": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-lcsts-new": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-summary": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-large": {
+        "vocab_size": 12800,
+        "hidden_size": 1024,
+        "num_hidden_layers": 24,
+        "num_attention_heads": 16,
+        "intermediate_size": 4096,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 512,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 12088,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-dureader_qg": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-question-generation": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-question-generation-full_domain": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+    "unimo-text-1.0-question-generation-dureader_qg": {
+        "vocab_size": 18000,
+        "hidden_size": 768,
+        "num_hidden_layers": 12,
+        "num_attention_heads": 12,
+        "intermediate_size": 3072,
+        "hidden_act": "relu",
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "normalize_before": False,
+        "max_position_embeddings": 513,
+        "type_vocab_size": 4,
+        "initializer_range": 0.02,
+        "unk_token_id": 17963,
+        "pad_token_id": 0,
+        "bos_token_id": 1,
+        "eos_token_id": 3,
+        "mask_token_id": 3,
+    },
+}
+
+UNIMO_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "unimo-text-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0.pdparams",
+        "unimo-text-1.0-lcsts-new": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-lcsts-new.pdparams",
+        "unimo-text-1.0-large": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-large.pdparams",
+        "unimo-text-1.0-summary": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-summary.pdparams",
+        "unimo-text-1.0-dureader_qg": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-dureader_qg.pdparams",
+        "unimo-text-1.0-question-generation": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-question-generation.pdparams",
+        "unimo-text-1.0-question-generation-v2": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-question-generation-full_domain.pdparams",
+        "unimo-text-1.0-question-generation-dureader_qg": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-question-generation-dureader_qg.pdparams",
+    }
+}
+
+
+class UNIMOConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`UNIMOModel`]. It is used to
+    instantiate a UNIMO model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the UNIMO
+    unimo-text-1.0 architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (int, optional):
+            Vocabulary size of `inputs_ids` in `UNIMOModel`. Also is the vocab size of token embedding matrix.
+            Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `UNIMOModel`.
+            Defaults to `18000`.
+        hidden_size (int, optional):
+            Dimensionality of the embedding layers and encoder layers. Defaults to `768`.
+        num_hidden_layers (int, optional):
+            The number of hidden layers in the Transformer encoder. Defaults to `12`.
+        num_attention_heads (int, optional):
+            Number of attention heads for each attention layer in the Transformer encoder.
+            Defaults to `12`.
+        intermediate_size (int, optional):
+            Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors
+            to ff layers are firstly projected from `hidden_size` to `intermediate_size`,
+            and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`.
+            Defaults to `3072`.
+        hidden_act (str, optional):
+            The non-linear activation function in the feed-forward layer.
+            ``"gelu"``, ``"relu"`` and any other paddle supported activation functions
+            are supported. Defaults to ``"gelu"``.
+        hidden_dropout_prob(float, optional):
+            The dropout probability used in pre-process and post-precess of MHA
+            and FFN sub-layer. Defaults to 0.1.
+        attention_probs_dropout_prob (float, optional):
+            The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target.
+            Defaults to `0.1`.
+        normalize_before (bool, optional):
+            Indicate whether to put layer normalization into preprocessing of
+            MHA and FFN sub-layers. If True, pre-process is layer normalization
+            and post-precess includes dropout, residual connection. Otherwise,
+            no pre-process and post-precess includes dropout, residual
+            connection, layer normalization. Defaults to `True`.
+        max_position_embeddings (int, optional):
+            The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input
+            sequence. Defaults to `512`.
+        type_vocab_size (int, optional):
+            The vocabulary size of the `token_type_ids` passed when calling `~transformers.UNIMOModel`.
+            Defaults to `2`.
+        initializer_range (float, optional):
+            The standard deviation of the normal initializer. Defaults to `0.02`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`UNIMOPretrainedModel._init_weights()` for how weights are initialized in `UNIMOModel`.
+
+        unk_token_id (int, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` in order to be converted to an ID.
+            Defaults to `17963`.
+        pad_token_id (int, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to `0`.
+        bos_token_id (int, optional):
+            A special token representing the beginning of a sequence that was used during pretraining.
+            Defaults to `1`.
+        eos_token_id (int, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `3`.
+        mask_token_id (int, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to `3`.
+    ```"""
+    model_type = "unimo"
+    pretrained_init_configuration = UNIMO_PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_size: int = 18000,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "relu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        normalize_before: int = False,
+        max_position_embeddings: int = 513,
+        type_vocab_size: int = 4,
+        initializer_range: float = 0.02,
+        unk_token_id: int = 17963,
+        pad_token_id: int = 0,
+        bos_token_id: int = 1,
+        eos_token_id: int = 3,
+        mask_token_id: int = 3,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.normalize_before = normalize_before
+        self.unk_token_id = unk_token_id
+        self.mask_token_id = mask_token_id
diff --git a/paddlenlp/transformers/unimo/modeling.py b/paddlenlp/transformers/unimo/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd633e5e4a67df5cb0b940ad6e4ced5b34f17579
--- /dev/null
+++ b/paddlenlp/transformers/unimo/modeling.py
@@ -0,0 +1,556 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Modeling classes for UNIMO model."""
+
+from typing import Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle import Tensor
+
+from ...utils.env import CONFIG_NAME
+from ...utils.log import logger
+from .. import PretrainedModel, register_base_model
+from ..model_outputs import CausalLMOutputWithCrossAttentions
+from .configuration import (
+    UNIMO_PRETRAINED_INIT_CONFIGURATION,
+    UNIMO_PRETRAINED_RESOURCE_FILES_MAP,
+    UNIMOConfig,
+)
+
+__all__ = [
+    "UNIMOPretrainedModel",
+    "UNIMOModel",
+    "UNIMOLMHeadModel",
+    "UNIMOForMaskedLM",
+    "UNIMOForConditionalGeneration",
+]
+
+
+class UNIMOPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained UNIMO models. It provides UNIMO related
+    `model_config_file`, `pretrained_init_configuration`, `resource_files_names`,
+    `pretrained_resource_files_map`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    model_config_file = CONFIG_NAME
+    pretrained_init_configuration = UNIMO_PRETRAINED_INIT_CONFIGURATION
+    pretrained_resource_files_map = UNIMO_PRETRAINED_RESOURCE_FILES_MAP
+    base_model_prefix = "unimo"
+    config_class = UNIMOConfig
+
+    def _init_weights(self, layer):
+        # Initialization hook
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            # In the dygraph mode, use the `set_value` to reset the parameter directly,
+            # and reset the `state_dict` to update parameter in static mode.
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.config.initializer_range,
+                        shape=layer.weight.shape,
+                    )
+                )
+
+
+class UNIMOEmbeddings(nn.Layer):
+    # Include embeddings from word, position and token_type.
+
+    def __init__(self, config: UNIMOConfig):
+        super(UNIMOEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.pad_token_id = config.pad_token_id
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        input_embeddings: Optional[Tensor] = None,
+    ):
+        if input_ids is None and input_embeddings is None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            inputs_shape = paddle.shape(input_ids)
+        elif input_embeddings is not None:
+            inputs_shape = paddle.shape(input_embeddings)[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if input_embeddings is None:
+            input_embeddings = self.word_embeddings(input_ids)
+
+        if position_ids is None:
+            if self.pad_token_id is None:
+                position_ids = paddle.expand_as(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape)
+            else:
+                if input_ids is not None:
+                    num_pad = paddle.sum((input_ids == self.pad_token_id).astype("float32"), axis=-1, keepdim=True)
+                    position_ids = F.relu(
+                        paddle.expand_as(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape) - num_pad
+                    ).astype("int64")
+                else:
+                    logger.warning(
+                        "Position_ids or pad_token_ids should be provided when input_embeds is specified, "
+                        "otherwise an unexpected result may be returned since `[0, 1, ..., sequence length - 1]` will be generated as a default position_ids."
+                    )
+                    position_ids = paddle.expand_as(paddle.arange(end=inputs_shape[1], dtype="int64"), inputs_shape)
+            position_ids.stop_gradient = True
+        position_embeddings = self.position_embeddings(position_ids)
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+            token_type_ids.stop_gradient = True
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = input_embeddings + position_embeddings + token_type_embeddings
+        return embeddings
+
+
+@register_base_model
+class UNIMOModel(UNIMOPretrainedModel):
+    """
+    The bare UNIMO Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the  superclass documentation for the generic methods.
+
+    This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass.
+    Use it as a regular Paddle Layer and refer to the Paddle
+    documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`UNIMOConfig`):
+            An instance of UNIMOConfig used to construct UNIMOModel.
+    """
+
+    def __init__(self, config: UNIMOConfig):
+        super(UNIMOModel, self).__init__(config)
+        self.unk_token_id = config.unk_token_id
+        self.pad_token_id = config.pad_token_id
+        self.bos_token_id = config.bos_token_id
+        self.eos_token_id = config.eos_token_id
+        self.mask_token_id = config.mask_token_id
+        self.initializer_range = config.initializer_range
+
+        self.embeddings = UNIMOEmbeddings(config)
+        encoder_layer = nn.TransformerEncoderLayer(
+            config.hidden_size,
+            config.num_attention_heads,
+            config.intermediate_size,
+            dropout=config.hidden_dropout_prob,
+            activation=config.hidden_act,
+            attn_dropout=config.attention_probs_dropout_prob,
+            act_dropout=0,
+            normalize_before=config.normalize_before,
+        )
+
+        self.encoder_norm = nn.LayerNorm(config.hidden_size)
+        # post_encoder_norm = nn.LayerNorm(config.hidden_size)
+
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.encoder = nn.TransformerEncoder(
+            encoder_layer,
+            config.num_hidden_layers,
+            # post_encoder_norm,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[Tuple[Tensor]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The UNIMOModel forward method, overrides the special :meth:`__call__` method.
+
+        Args:
+            input_ids (Tensor, optional):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of  [batch_size, sequence_length].
+            token_type_ids (Tensor):
+                Segment token indices to indicate first and second portions of
+                the inputs. Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            position_ids (Tensor):
+                Indices of positions of each input sequence tokens in the position embeddings.
+                Selected in the range ``[0, max_position_embeddings - 1]``.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            attention_mask (Tensor):
+                Mask used in multi-head attention to avoid performing attention to some unwanted positions,
+                usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                For example, its shape can be  [batch_size, sequence_length], [batch_size, sequence_length, sequence_length],
+                [batch_size, num_attention_heads, sequence_length, sequence_length].
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            use_cache: (bool, optional):
+                Whether or not use the model cache to speed up decoding.
+                Defaults to `False`.
+            cache (list, optional):
+                It is a list, and each element in the list is `incremental_cache`
+                produced by :meth:`paddle.nn.TransformerEncoderLayer.gen_cache`
+                method. See :meth:`paddle.nn.TransformerEncoder.gen_cache`
+                method for more details. It is only used for inference and
+                should be None for training. Defaults to `None`.
+            inputs_embeds (Tensor, optional):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation
+                of shape `(batch_size, sequence_length, hidden_size)`. This is useful if you want more control over
+                how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
+                Default to None.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+                tensors for more detail. Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+                more detail. Defaults to `False`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.BaseModelOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=None`,
+            returns tensor `Sequence_output` of shape [batch_size, sequence_length, hidden_size],
+            which is the output at the last layer of the model.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UNIMOModel
+                from paddlenlp.transformers import UNIMOTokenizer
+
+                model = UNIMOModel.from_pretrained('unimo-text-1.0')
+                tokenizer = UNIMOTokenizer.from_pretrained('unimo-text-1.0')
+
+                inputs = tokenizer.gen_encode("Welcome to use PaddlePaddle and PaddleNLP!", return_tensors=True)
+                outputs = model(**inputs)
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if attention_mask is None:
+            if input_ids is not None:
+                attention_mask = (
+                    (input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4
+                ).unsqueeze([1, 2])
+            else:
+                logger.warning(
+                    "Provided inputs_embeds while attention_mask is None, attention weights will not be masked during forwarding."
+                )
+
+        if attention_mask is not None:
+            attention_mask.stop_gradient = True
+
+        embedding_output = self.embeddings(input_ids, token_type_ids, position_ids, inputs_embeds)
+
+        embedding_output = self.encoder_norm(embedding_output)
+        embedding_output = self.dropout(embedding_output)
+
+        if use_cache and cache is None:
+            cache = self.encoder.gen_cache(embedding_output)
+
+        outputs = self.encoder(
+            embedding_output,
+            attention_mask,
+            cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        return outputs
+
+
+class UNIMOLMHead(nn.Layer):
+    def __init__(self, hidden_size, vocab_size, activation, embedding_weights=None):
+        super(UNIMOLMHead, self).__init__()
+        self.transform = nn.Linear(hidden_size, hidden_size)
+        self.activation = getattr(nn.functional, activation)
+        self.layer_norm = nn.LayerNorm(hidden_size)
+        self.decoder_weight = (
+            self.create_parameter(shape=[vocab_size, hidden_size], dtype=self.transform.weight.dtype, is_bias=False)
+            if embedding_weights is None
+            else embedding_weights
+        )
+        self.decoder_bias = self.create_parameter(shape=[vocab_size], dtype=self.decoder_weight.dtype, is_bias=True)
+
+    def forward(self, hidden_states: Tensor, masked_positions: Optional[Tensor] = None):
+        if masked_positions is not None:
+            hidden_states = paddle.reshape(hidden_states, [-1, hidden_states.shape[-1]])
+            hidden_states = paddle.tensor.gather(hidden_states, masked_positions)
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        logits = paddle.tensor.matmul(hidden_states, self.decoder_weight, transpose_y=True) + self.decoder_bias
+        return logits
+
+
+class UNIMOLMHeadModel(UNIMOPretrainedModel):
+    """
+    The UNIMO Model with a `language modeling` head on top designed for generation tasks.
+
+    Args:
+        unimo (:class:`UNIMOModel`):
+            An instance of :class:`UNIMOModel`.
+    """
+
+    def __init__(self, config: UNIMOConfig):
+        super(UNIMOLMHeadModel, self).__init__(config)
+        self.unimo = UNIMOModel(config)
+        self.lm_head = UNIMOLMHead(
+            config.hidden_size,
+            config.vocab_size,
+            config.hidden_act,
+            self.unimo.embeddings.word_embeddings.weight,
+        )
+
+    def forward(
+        self,
+        input_ids: Optional[Tensor] = None,
+        token_type_ids: Optional[Tensor] = None,
+        position_ids: Optional[Tensor] = None,
+        attention_mask: Optional[Tensor] = None,
+        masked_positions: Optional[Tensor] = None,
+        use_cache: Optional[bool] = None,
+        cache: Optional[Tuple[Tensor]] = None,
+        inputs_embeds: Optional[Tensor] = None,
+        labels: Optional[Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        The UNIMOLMHeadModel forward method, overrides the special
+        :meth:`__call__` method.
+
+        Args:
+            input_ids (Tensor, optional):
+                See :class:`UNIMOModel`.
+            token_type_ids (Tensor):
+                See :class:`UNIMOModel`.
+            position_ids (Tensor):
+                See :class:`UNIMOModel`.
+            attention_mask (Tensor):
+                See :class:`UNIMOModel`.
+            use_cache: (bool, optional):
+                See :class:`UNIMOModel`.
+            cache (list, optional):
+                See :class:`UNIMOModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`UNIMOModel`.
+            labels (Tensor, optional):
+                Labels for computing the left-to-right language modeling loss. Indices should be in
+                `[-100, 0, ..., vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
+                ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., vocab_size]`
+            output_attentions (bool, optional):
+                See :class:`UNIMOModel`.
+            output_hidden_states (bool, optional):
+                See :class:`UNIMOModel`.
+            return_dict (bool, optional):
+                Whether to return a :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithPastAndCrossAttentions` object. If `False`, the output
+                will be a tuple of tensors. Defaults to `False`.
+
+        Returns:
+            An instance of :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithPastAndCrossAttentions` if
+            `return_dict=True`. Otherwise it returns a tuple of tensors corresponding
+            to ordered and not None (depending on the input arguments) fields of
+            :class:`~paddlenlp.transformers.model_outputs.CausalLMOutputWithPastAndCrossAttentions`.
+            Especially, When `return_dict=output_hidden_states=output_attentions=False` and `cache=labels=None`,
+            returns tensor `logits` of shape [batch_size, sequence_length, hidden_size],
+            which is the output at the last layer of the model.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UNIMOLMHeadModel
+                from paddlenlp.transformers import UNIMOTokenizer
+
+                model = UNIMOLMHeadModel.from_pretrained('unimo-text-1.0')
+                tokenizer = UNIMOTokenizer.from_pretrained('unimo-text-1.0')
+
+                inputs = tokenizer.gen_encode(
+                    "Welcome to use PaddlePaddle and PaddleNLP!",
+                    return_tensors=True,
+                    is_split_into_words=False)
+                logits = model(**inputs)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.unimo(
+            input_ids,
+            token_type_ids,
+            position_ids,
+            attention_mask,
+            use_cache,
+            cache,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        input_type = type(input_ids) if input_ids is not None else type(inputs_embeds)
+        sequence_output = outputs if isinstance(outputs, input_type) else outputs[0]
+
+        logits = self.lm_head(sequence_output, masked_positions)
+
+        lm_loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            lm_loss = loss_fct(logits.reshape((-1, self.unimo.config.vocab_size)), labels.reshape((-1,)))
+
+        if not return_dict:
+            if isinstance(outputs, input_type):
+                return (lm_loss, logits) if lm_loss is not None else logits
+            else:
+                outputs = (logits,) + outputs[1:]
+                return ((lm_loss,) + outputs) if lm_loss is not None else outputs
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_fast_entry(self, kwargs):
+        from paddlenlp.ops import FasterMIRO, FasterUNIMOText
+
+        use_fp16_decoding = kwargs.get("use_fp16_decoding", False)
+        decode_strategy = kwargs.get("decode_strategy")
+        if decode_strategy == "sampling" and kwargs.get("top_k") != 0 and kwargs.get("top_p") != 1:
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+        if kwargs["repetition_penalty"] != 1.0:
+            # not support for repetition_penalty yet in the fast version
+            raise AttributeError("'repetition_penalty != 1' is not supported yet in the fast version")
+        if kwargs["forced_bos_token_id"] is not None:
+            # not support for min_length yet in the fast version
+            raise AttributeError(
+                "Only topk sampling or topp sampling are supported. "
+                "Topk sampling and topp sampling cannot be both applied in the fast version."
+            )
+
+        if getattr(self.encoder, "norm", None) is None:
+            self._fast_entry = FasterUNIMOText(self, use_fp16_decoding=use_fp16_decoding).forward
+        else:
+            self._fast_entry = FasterMIRO(self, use_fp16_decoding=use_fp16_decoding).forward
+        return self._fast_entry
+
+    def adjust_logits_during_generation(self, logits):
+        # pre-process distribution
+        logits[:, self.unimo.unk_token_id] = -1e9
+        logits[:, self.unimo.pad_token_id] = -1e9
+        logits[:, self.unimo.bos_token_id] = -1e9
+        return logits
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        token_type_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        use_cache=False,
+        cache=None,
+        **kwargs
+    ):
+
+        if position_ids is None:
+            if self.pad_token_id is None:
+                position_ids = paddle.expand_as(
+                    paddle.arange(end=paddle.shape(input_ids)[1], dtype="int64"), input_ids
+                )
+            else:
+                num_pad = paddle.sum((input_ids == self.pad_token_id).astype("float32"), axis=-1, keepdim=True)
+                position_ids = F.relu(
+                    paddle.expand_as(paddle.arange(end=paddle.shape(input_ids)[1], dtype="float32"), input_ids)
+                    - num_pad
+                ).astype("int64")
+            position_ids.stop_gradient = True
+
+        if token_type_ids is None:
+            token_type_ids = paddle.zeros_like(input_ids, dtype="int64")
+            token_type_ids.stop_gradient = True
+
+        if attention_mask is None:
+            attention_mask = ((input_ids == self.pad_token_id).astype(paddle.get_default_dtype()) * -1e4).unsqueeze(
+                [1, 2]
+            )
+            attention_mask.stop_gradient = True
+
+        # only last token for inputs_ids if cache is defined in kwargs
+        if cache is not None:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            if token_type_ids is not None:
+                token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
+            if position_ids is not None:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+            if attention_mask is not None:
+                attention_mask = attention_mask[:, :, -1:, :]
+
+        return {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "use_cache": use_cache,
+            "cache": cache,
+        }
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(getattr(self, self.base_model_prefix), name)
+
+
+UNIMOForMaskedLM = UNIMOLMHeadModel
+UNIMOForConditionalGeneration = UNIMOLMHeadModel
diff --git a/paddlenlp/transformers/unimo/tokenizer.py b/paddlenlp/transformers/unimo/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e25b65ac41fed38542d66c25898f728de3daec04
--- /dev/null
+++ b/paddlenlp/transformers/unimo/tokenizer.py
@@ -0,0 +1,540 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import paddle
+
+from ...data.vocab import Vocab
+from .. import BasicTokenizer, PretrainedTokenizer, WordpieceTokenizer
+
+__all__ = ["UNIMOTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "unimo-text-1.0": 513,
+    "unimo-text-1.0-lcsts-new": 513,
+    "unimo-text-1.0-large": 512,
+}
+
+
+class UNIMOTokenizer(PretrainedTokenizer):
+    r"""
+    Constructs an UNIMO tokenizer. It uses a basic tokenizer to do punctuation
+    splitting, lower casing and so on, and follows a WordPiece tokenizer to
+    tokenize as subwords.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file path (ends with '.txt') required to instantiate
+            a `WordpieceTokenizer`.
+        do_lower_case (str, optional):
+            Whether or not to lowercase the input when tokenizing.
+            Defaults to`True`.
+        unk_token (str):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to "[UNK]".
+        sep_token (str):
+            A special token separating two different sentences in the same input.
+            Defaults to "[SEP]".
+        pad_token (str):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to "[PAD]".
+        cls_token (str):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to "[CLS]".
+        mask_token (str):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to "[MASK]".
+
+    Examples:
+        .. code-block::
+
+            from paddlenlp.transformers import UNIMOTokenizer
+            tokenizer = UNIMOTokenizer.from_pretrained('unimo-text-1.0')
+            encoded_inputs = tokenizer('He was a puppeteer')
+            # encoded_inputs
+            #{
+            #   'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+            #   'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+            #}
+
+    """
+    resource_files_names = {"vocab_file": "vocab.txt"}  # for save_pretrained
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "unimo-text-1.0": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-lcsts-new": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-large": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-large-vocab.txt",
+            "unimo-text-1.0-summary": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-dureader_qg": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-question-generation": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-question-generation-full_domain": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+            "unimo-text-1.0-question-generation-dureader_qg": "https://bj.bcebos.com/paddlenlp/models/transformers/unimo/unimo-text-1.0-vocab.txt",
+        }
+    }
+    pretrained_init_configuration = {
+        "unimo-text-1.0": {"do_lower_case": True},
+        "unimo-text-1.0-lcsts-new": {"do_lower_case": True},
+        "unimo-text-1.0-large": {"do_lower_case": True},
+        "unimo-text-1.0-summary": {"do_lower_case": True},
+        "unimo-text-1.0-dureader_qg": {"do_lower_case": True},
+        "unimo-text-1.0-question-generation": {"do_lower_case": True},
+        "unimo-text-1.0-question-generation-full_domain": {"do_lower_case": True},
+        "unimo-text-1.0-question-generation-dureader_qg": {"do_lower_case": True},
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        **kwargs
+    ):
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the "
+                "vocabulary from a pretrained model please use "
+                "`tokenizer = UNIMOTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = self.load_vocabulary(vocab_file, unk_token=unk_token)
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=unk_token)
+
+    @property
+    def vocab_size(self):
+        """
+        Return the size of vocabulary.
+
+        Returns:
+            int: The size of vocabulary.
+        """
+        return len(self.vocab)
+
+    @staticmethod
+    def load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs):
+        token_to_idx = {}
+        with open(filepath, "r", encoding="utf-8") as f:
+            for line in f:
+                token, index = line.rstrip("\n").split("\t")
+                token_to_idx[token] = int(index)
+        vocab = Vocab.from_dict(
+            token_to_idx, unk_token=unk_token, pad_token=pad_token, bos_token=bos_token, eos_token=eos_token, **kwargs
+        )
+        return vocab
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        return vocab
+
+    def _tokenize(self, text):
+        r"""
+        End-to-end tokenization for UNIMO models.
+
+        Args:
+            text (str): The text to be tokenized.
+
+        Returns:
+            List[str]: A list of string representing converted tokens.
+        """
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+        return split_tokens
+
+    def convert_tokens_to_string(self, tokens):
+        r"""
+        Converts a sequence of tokens (list of string) in a single string. Since
+        the usage of WordPiece introducing `##` to concat subwords, also remove
+        `##` when converting.
+
+        Args:
+            tokens (list): A list of string representing tokens to be converted.
+
+        Returns:
+            str: Converted string from tokens.
+
+        Examples:
+            .. code-block::
+
+                from paddlenlp.transformers import UNIMOTokenizer
+
+                tokenizer = UNIMOTokenizer.from_pretrained('unimo-text-1.0')
+                tokens = tokenizer.tokenize('He was a puppeteer')
+
+                strings = tokenizer.convert_tokens_to_string(tokens)
+                '''
+                he was a puppeteer
+                '''
+
+        """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        r"""
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair(bool):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        r"""
+        Build model inputs from a sequence or a pair of sequence for sequence
+        classification tasks by concatenating and adding special tokens.
+
+        A UNIMO sequence has the following format:
+
+        - single sequence:      ``[CLS] X [SEP]``
+        - pair of sequences:        ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of input_id with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        _sep = [self.sep_token_id]
+        return _cls + token_ids_0 + _sep + token_ids_1 + _sep
+
+    def merge_subword(self, tokens):
+        r"""
+        Converts the subwords in a sequence of tokens (list of string) to whole
+        words, also remove `##` when converting.
+
+        Args:
+            tokens (List[str]): A list of string representing tokens to be converted.
+
+        Returns:
+            List[str]: Converted sequence of whole words.
+        """
+        ret = []
+        for token in tokens:
+            if token.startswith("##"):
+                real_token = token[2:]
+                if len(ret):
+                    ret[-1] += real_token
+                else:
+                    ret.append(real_token)
+            else:
+                ret.append(token)
+
+        return ret
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        r"""
+        Build offset map from a pair of offset map by concatenating and adding
+        offsets of special tokens.
+
+        A UNIMO offset_mapping has the following format:
+        ::
+            - single sequence: ``(0,0) X (0,0)``
+            - pair of sequences: `(0,0) A (0,0) B (0,0)``
+
+        Args:
+            offset_mapping_ids_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_ids_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[tuple]: List of char offsets with the appropriate offsets
+                of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return [(0, 0)] + offset_mapping_0 + [(0, 0)]
+
+        return [(0, 0)] + offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        r"""
+        Create a mask from the two sequences passed to be used in a sequence-pair
+        classification task.
+
+        A UNIMO sequence pair mask has the following format:
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If `token_ids_1` is `None`, this method only returns the first portion
+        of the mask (0s).
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for sequence pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token_type_id according to the given sequence(s).
+        """
+        _sep = [self.sep_token_id]
+        _cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(_cls + token_ids_0 + _sep) * [0]
+        return len(_cls + token_ids_0 + _sep) * [0] + len(token_ids_1 + _sep) * [1]
+
+    def gen_encode(
+        self,
+        source,
+        title=None,
+        target=None,
+        max_seq_len=512,
+        max_title_len=128,
+        max_target_len=128,
+        return_position_ids=True,
+        return_token_type_ids=True,
+        return_attention_mask=True,
+        return_length=False,
+        add_start_token_for_decoding=False,
+        pad_to_max_seq_len=False,
+        return_tensors=False,
+        is_split_into_words=False,
+        continuous_position=False,
+    ):
+        """
+        Main method for encoding the source for generation. It will return a
+        dictionary containing the encoded sequence and other relative informations
+        which meets the input format requirements of the UNIMO-text model.
+
+        Args:
+            source (str): The source text of generation. It should be a string.
+            target (str, optional): The target text of generation. It should be
+                set when training the model and should be None when running
+                inference. Defaults to None.
+            title (str, optional): The additional information of some of the
+                generation tasks such as summary. Defaults to None.
+            max_seq_len (int, optional): The maximum encoded sequence length.
+                Defaults to 512.
+            max_target_len (int, optional): The maximum encoded sequence
+                length of the input `target`. Defaults to 128.
+            max_title_len (int, optional): The maximum encoded sequence
+                length of the input `title`. Defaults to 128.
+            return_position_ids (bool, optional): Whether to return the
+                position_ids. Defaults to True.
+            return_token_type_ids (bool, optional): Whether to return the
+                token_type_ids. Defaults to True.
+            return_attention_mask (bool, optional): Whether to return the
+                attention_mask. Defaults to True.
+            return_length (bool, optional): Whether to return the length of the
+                encoded sequence. Defaults to False.
+            add_start_token_for_decoding (bool, optional): Whether to add the
+                special token "[CLS]" at the end of sequence as the beginning of
+                the target when running inference to force the model to start
+                generating target sequence. Defaults to False.
+            pad_to_max_seq_len (bool, optional): Whether to pad the returned
+                sequences to the `max_seq_len`. Note that, in this method,
+                returned sequences will be padded on the left. Defaults to False.
+            return_tensors (bool, optional): Whether to convert the returned
+                sequences to Tensor. Defaults to False.
+            is_split_into_words(bool, optional): Whether or not the input text
+                (`source`, `target` and `title`) has been pretokenized.
+                Defaults to False.
+            continuous_position(bool, optional): Whether the position ids is
+                continuous between source ids and target ids. Defaults to False.
+
+        Returns:
+            dict: A dictionary containing the encoded sequence and other
+            relative informations.
+
+            With the corresponding fields:
+
+            - input_ids (list[int]|Tensor):
+                A list of indices of input tokens to be feed to UNIMO-text
+                model. If `return_tensors` is True, it is a Tensor with shape
+                [1, sequence_length] and data type 'int64'.
+            - token_type_ids (list[int]|Tensor, optional):
+                A list of segment token indices to indicate whether the token
+                belongs to the dialogue target. If `return_tensors` is True,
+                it is a Tensor with shape [1, sequence_length] and data type
+                'int64'.
+                Being returned when `return_token_type_ids` is set to True.
+            - position_ids (list[int]|Tensor, optional):
+                A list of The position indices. If `return_tensors` is True,
+                it is a Tensor with shape [1, sequence_length] and data type
+                'int64'.
+                Being returned when `return_position_ids` is set to True.
+            - attention_mask (numpy.ndarray|Tensor, optional):
+                A numpy.ndarray to prevents attention to some unwanted positions,
+                with shape [sequence_length, sequence_length] and data type
+                'float32'. If `return_tensors` is True, it is a Tensor with shape
+                [1, 1, sequence_length, sequence_length] and data type 'float32'.
+                Being returned when `return_attention_mask` is set to True.
+            - seq_len (int, optional):
+                The actual length of the `input_ids`, excluding the pad token.
+                Being returned when `return_length` is set to True.
+
+        Example:
+            .. code-block::
+
+                from paddlenlp.transformers import UNIMOTokenizer
+                tokenizer = UNIMOTokenizer.from_pretrained('unimo-text-1.0')
+                inputs = tokenizer.gen_encode('He was a puppeteer')
+                #{'input_ids': [1, 4444, 4385, 1545, 6712, 10062, 9568, 9756, 9500, 2],
+                #'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                #'position_ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
+                #'attention_mask': array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
+                #[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)}
+        """
+
+        # Input type checking for clearer error
+        assert isinstance(
+            source, str
+        ), "The input `source` must be with type `str` (single context). " " But received: {}".format(source)
+        assert target is None or isinstance(
+            target, str
+        ), "The input `target` must of be with type `str`. But received: {}".format(target)
+        assert title is None or isinstance(
+            title, str
+        ), "The input `title` must of be with type `str`. But received: {}".format(title)
+        assert max_seq_len > max_title_len + max_target_len, (
+            "`max_seq_len` must be greater than the sum of `max_target_len` "
+            "and `max_title_len`. But received `max_seq_len` is {}, "
+            "`max_target_len` is {}, `max_title_len` is {}.".format(max_seq_len, max_title_len, max_target_len)
+        )
+        assert target is None or not add_start_token_for_decoding, (
+            "`add_start_token_for_decoding` only works when `target` is "
+            "`None`. But received `add_start_token_for_decoding`: `{}`, "
+            "`target`: {}.".format(add_start_token_for_decoding, target)
+        )
+
+        title_ids = []
+        if title is not None:
+            tokens = self._tokenize(title)
+            title_ids = self.convert_tokens_to_ids(tokens)
+            if len(title_ids) > max_title_len - 1:
+                title_ids = title_ids[: max_title_len - 1]
+            title_ids += [self.sep_token_id]
+
+        target_ids = []
+        if target is not None:
+            tokens = self._tokenize(target)
+            target_ids = [self.cls_token_id] + self.convert_tokens_to_ids(tokens)
+            if len(target_ids) > max_target_len - 1:
+                target_ids = target_ids[: max_target_len - 1]
+            target_ids += [self.mask_token_id]
+        elif add_start_token_for_decoding:
+            target_ids = [self.cls_token_id]
+
+        title_ids = [self.cls_token_id] + title_ids
+
+        max_source_len = max_seq_len - len(title_ids) - len(target_ids)
+        source_ids = []
+        tokens = self._tokenize(source)
+        source_ids = self.convert_tokens_to_ids(tokens)
+
+        if len(source_ids) > max_source_len - 1:
+            source_ids = source_ids[: max_source_len - 1]
+
+        source_ids += [self.sep_token_id]
+        source_ids = title_ids + source_ids
+        # Build output dictionnary
+
+        encoded_inputs = {}
+        encoded_inputs["input_ids"] = source_ids + target_ids
+        # Check lengths
+        sequence_length = len(encoded_inputs["input_ids"])
+        assert sequence_length <= max_seq_len
+
+        # Considering that the logits at the last time step in the API of
+        # generative task are taken to generate the next token. In order to
+        # avoid the last time step being a pad, so take padding on the left.
+        pad_length = max_seq_len - sequence_length if pad_to_max_seq_len else 0
+        if pad_length > 0:
+            encoded_inputs["input_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["input_ids"]
+        if return_tensors:
+            # Add dimention for batch_size
+            encoded_inputs["input_ids"] = paddle.to_tensor(encoded_inputs["input_ids"]).unsqueeze(0)
+
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = [0] * len(source_ids) + [1] * len(target_ids)
+            if pad_length > 0:
+                encoded_inputs["token_type_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["token_type_ids"]
+            if return_tensors:
+                # Add dimention for batch_size
+                encoded_inputs["token_type_ids"] = paddle.to_tensor(encoded_inputs["token_type_ids"]).unsqueeze(0)
+
+        if return_length:
+            encoded_inputs["seq_len"] = sequence_length
+
+        if return_position_ids:
+            if continuous_position:
+                encoded_inputs["position_ids"] = list(range(sequence_length))
+            else:
+                encoded_inputs["position_ids"] = list(range(len(source_ids))) + list(range(len(target_ids)))
+            if pad_length > 0:
+                encoded_inputs["position_ids"] = [self.pad_token_id] * pad_length + encoded_inputs["position_ids"]
+            if return_tensors:
+                # Add dimention for batch_size
+                encoded_inputs["position_ids"] = paddle.to_tensor(encoded_inputs["position_ids"]).unsqueeze(0)
+
+        if return_attention_mask:
+            attention_mask = np.ones((sequence_length, sequence_length), dtype="float32") * -1e4
+            start = len(source_ids)
+            end = sequence_length
+            attention_mask[:end, :start] = 0.0
+            # Generate the lower triangular matrix using the slice of matrix
+            tmp = np.triu(np.ones([end - start, end - start], dtype="float32") * -1e4, 1)
+            attention_mask[start:end, start:end] = tmp
+            encoded_inputs["attention_mask"] = attention_mask
+            if pad_length > 0:
+                new_mask = np.ones((max_seq_len, max_seq_len), dtype="float32") * -1e4
+                new_mask[-sequence_length:, -sequence_length:] = attention_mask
+                encoded_inputs["attention_mask"] = new_mask
+            if return_tensors:
+                # Add dimentions for batch_size and num_heads
+                encoded_inputs["attention_mask"] = paddle.to_tensor(encoded_inputs["attention_mask"]).unsqueeze((0, 1))
+
+        return encoded_inputs
diff --git a/paddlenlp/transformers/utils.py b/paddlenlp/transformers/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..38721fc42d39a56b9c16ae58115624e9d991e3c5
--- /dev/null
+++ b/paddlenlp/transformers/utils.py
@@ -0,0 +1,898 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import contextlib
+import functools
+import hashlib
+import importlib
+import inspect
+import os
+import re
+import shutil
+import sys
+import warnings
+from contextlib import ExitStack
+from io import StringIO
+from pathlib import Path
+from typing import TYPE_CHECKING, ContextManager, List, Optional, Type, Union
+
+from filelock import FileLock
+
+from paddlenlp import __version__
+from paddlenlp.utils.downloader import (
+    COMMUNITY_MODEL_PREFIX,
+    download_check,
+    get_path_from_url_with_filelock,
+    is_url,
+    url_file_exists,
+)
+
+if TYPE_CHECKING:
+    from paddlenlp.transformers import PretrainedModel
+
+import numpy as np
+import paddle
+import tqdm
+from huggingface_hub import hf_hub_download, try_to_load_from_cache
+from huggingface_hub.utils import EntryNotFoundError
+from paddle.common_ops_import import convert_dtype
+from paddle.nn import Layer
+from requests.exceptions import HTTPError
+
+from paddlenlp.utils.env import HF_CACHE_HOME, MODEL_HOME
+from paddlenlp.utils.import_utils import import_module
+from paddlenlp.utils.log import logger
+
+HUGGINGFACE_CO_RESOLVE_ENDPOINT = "https://huggingface.co"
+
+
+def convert_ndarray_dtype(np_array: np.ndarray, target_dtype: str) -> np.ndarray:
+    """convert ndarray
+
+    Args:
+        np_array (np.ndarray): numpy ndarray instance
+        target_dtype (str): the target dtype
+
+    Returns:
+        np.ndarray: converted numpy ndarray instance
+    """
+    source_dtype = convert_dtype(np_array.dtype)
+    if source_dtype == "uint16" or target_dtype == "bfloat16":
+        tensor = paddle.to_tensor(np_array)
+        tensor = paddle.cast(tensor, target_dtype)
+        return tensor.cpu().numpy()
+
+        # TODO(wj-Mcat): device_guard will slow the converting
+        # with device_guard("cpu"):
+        #     tensor = paddle.to_tensor(np_array)
+        #     tensor = paddle.cast(tensor, target_dtype)
+        # return tensor.cpu().numpy()
+
+    if target_dtype == "bfloat16":
+        target_dtype = "uint16"
+
+    return np_array.astype(target_dtype)
+
+
+def get_scale_by_dtype(dtype: str = None, return_positive: bool = True) -> float:
+    """get scale value by dtype
+
+    Args:
+        dtype (str): the string dtype value
+
+    Returns:
+        float: the scale value
+    """
+    if dtype is None:
+        dtype = paddle.get_default_dtype()
+
+    dtype = convert_dtype(dtype)
+    scale_value = 1e6
+
+    # TODO(wj-Mcaf): support int8, int4 dtypes later
+    if dtype == "float16":
+        scale_value = 1e4
+
+    if return_positive:
+        return scale_value
+    return -1 * scale_value
+
+
+def fn_args_to_dict(func, *args, **kwargs):
+    """
+    Inspect function `func` and its arguments for running, and extract a
+    dict mapping between argument names and keys.
+    """
+    if hasattr(inspect, "getfullargspec"):
+        (spec_args, spec_varargs, spec_varkw, spec_defaults, _, _, _) = inspect.getfullargspec(func)
+    else:
+        (spec_args, spec_varargs, spec_varkw, spec_defaults) = inspect.getargspec(func)
+    # add positional argument values
+    init_dict = dict(zip(spec_args, args))
+    # add default argument values
+    kwargs_dict = dict(zip(spec_args[-len(spec_defaults) :], spec_defaults)) if spec_defaults else {}
+    for k in list(kwargs_dict.keys()):
+        if k in init_dict:
+            kwargs_dict.pop(k)
+    kwargs_dict.update(kwargs)
+    init_dict.update(kwargs_dict)
+    return init_dict
+
+
+def adapt_stale_fwd_patch(self, name, value):
+    """
+    Since there are some monkey patches for forward of PretrainedModel, such as
+    model compression, we make these patches compatible with the latest forward
+    method.
+    """
+    if name == "forward":
+        # NOTE(guosheng): In dygraph to static, `layer.forward` would be patched
+        # by an instance of `StaticFunction`. And use string compare to avoid to
+        # import fluid.
+        if type(value).__name__.endswith("StaticFunction") or self.forward.__class__.__name__.endswith(
+            "StaticFunction"
+        ):
+            return value
+        if hasattr(inspect, "getfullargspec"):
+            (
+                patch_spec_args,
+                patch_spec_varargs,
+                patch_spec_varkw,
+                patch_spec_defaults,
+                _,
+                _,
+                _,
+            ) = inspect.getfullargspec(value)
+            (spec_args, spec_varargs, spec_varkw, spec_defaults, _, _, _) = inspect.getfullargspec(self.forward)
+        else:
+            (patch_spec_args, patch_spec_varargs, patch_spec_varkw, patch_spec_defaults) = inspect.getargspec(value)
+            (spec_args, spec_varargs, spec_varkw, spec_defaults) = inspect.getargspec(self.forward)
+        new_args = [
+            arg
+            for arg in ("output_hidden_states", "output_attentions", "return_dict")
+            if arg not in patch_spec_args and arg in spec_args
+        ]
+
+        if new_args:
+            if self.__module__.startswith("paddlenlp"):
+                warnings.warn(
+                    f"The `forward` method of {self.__class__ if isinstance(self, Layer) else self} is patched and the patch "
+                    "might be based on an old oversion which missing some "
+                    f"arguments compared with the latest, such as {new_args}. "
+                    "We automatically add compatibility on the patch for "
+                    "these arguemnts, and maybe the patch should be updated."
+                )
+            else:
+                warnings.warn(
+                    f"The `forward` method of {self.__class__ if isinstance(self, Layer) else self} "
+                    "is patched and the patch might be conflict with patches made "
+                    f"by paddlenlp which seems have more arguments such as {new_args}. "
+                    "We automatically add compatibility on the patch for "
+                    "these arguemnts, and maybe the patch should be updated."
+                )
+            if isinstance(self, Layer) and inspect.isfunction(value):
+
+                @functools.wraps(value)
+                def wrap_fwd(*args, **kwargs):
+                    for arg in new_args:
+                        kwargs.pop(arg, None)
+                    return value(self, *args, **kwargs)
+
+            else:
+
+                @functools.wraps(value)
+                def wrap_fwd(*args, **kwargs):
+                    for arg in new_args:
+                        kwargs.pop(arg, None)
+                    return value(*args, **kwargs)
+
+            return wrap_fwd
+    return value
+
+
+class InitTrackerMeta(type(Layer)):
+    """
+    This metaclass wraps the `__init__` method of a class to add `init_config`
+    attribute for instances of that class, and `init_config` use a dict to track
+    the initial configuration. If the class has `_pre_init` or `_post_init`
+    method, it would be hooked before or after `__init__` and called as
+    `_pre_init(self, init_fn, init_args)` or `_post_init(self, init_fn, init_args)`.
+    Since InitTrackerMeta would be used as metaclass for pretrained model classes,
+    which always are Layer and `type(Layer)` is not `type`, thus use `type(Layer)`
+    rather than `type` as base class for it to avoid inheritance metaclass
+    conflicts.
+    """
+
+    def __init__(cls, name, bases, attrs):
+        init_func = cls.__init__
+        # If attrs has `__init__`, wrap it using accessable `_pre_init, _post_init`.
+        # Otherwise, no need to wrap again since the super cls has been wraped.
+        # TODO: remove reduplicated tracker if using super cls `__init__`
+        pre_init_func = getattr(cls, "_pre_init", None) if "__init__" in attrs else None
+        post_init_func = getattr(cls, "_post_init", None) if "__init__" in attrs else None
+        cls.__init__ = InitTrackerMeta.init_and_track_conf(init_func, pre_init_func, post_init_func)
+        super(InitTrackerMeta, cls).__init__(name, bases, attrs)
+
+    @staticmethod
+    def init_and_track_conf(init_func, pre_init_func=None, post_init_func=None):
+        """
+        wraps `init_func` which is `__init__` method of a class to add `init_config`
+        attribute for instances of that class.
+        Args:
+            init_func (callable): It should be the `__init__` method of a class.
+                warning: `self` always is the class type of down-stream model, eg: BertForTokenClassification
+            pre_init_func (callable, optional): If provided, it would be hooked after
+                `init_func` and called as `pre_init_func(self, init_func, *init_args, **init_args)`.
+                Default None.
+            post_init_func (callable, optional): If provided, it would be hooked after
+                `init_func` and called as `post_init_func(self, init_func, *init_args, **init_args)`.
+                Default None.
+
+        Returns:
+            function: the wrapped function
+        """
+
+        @functools.wraps(init_func)
+        def __impl__(self, *args, **kwargs):
+            # registed helper by `pre_init_func`
+            if pre_init_func:
+                pre_init_func(self, init_func, *args, **kwargs)
+            # keep full configuration
+            init_func(self, *args, **kwargs)
+            # registed helper by `post_init_func`
+            if post_init_func:
+                post_init_func(self, init_func, *args, **kwargs)
+            self.init_config = kwargs
+            if args:
+                kwargs["init_args"] = args
+            kwargs["init_class"] = self.__class__.__name__
+
+        return __impl__
+
+    def __setattr__(self, name, value):
+        value = adapt_stale_fwd_patch(self, name, value)
+        return super(InitTrackerMeta, self).__setattr__(name, value)
+
+
+def param_in_func(func, param_field: str) -> bool:
+    """check if the param_field is in `func` method, eg: if the `bert` param is in `__init__` method
+
+    Args:
+        cls (type): the class of PretrainedModel
+        param_field (str): the name of field
+
+    Returns:
+        bool: the result of existence
+    """
+
+    if hasattr(inspect, "getfullargspec"):
+        result = inspect.getfullargspec(func)
+    else:
+        result = inspect.getargspec(func)
+
+    return param_field in result[0]
+
+
+def resolve_cache_dir(pretrained_model_name_or_path: str, from_hf_hub: bool, cache_dir: Optional[str] = None) -> str:
+    """resolve cache dir for PretrainedModel and PretrainedConfig
+
+    Args:
+        pretrained_model_name_or_path (str): the name or path of pretrained model
+        from_hf_hub (bool): if load from huggingface hub
+        cache_dir (str): cache_dir for models
+    """
+    if os.path.isdir(pretrained_model_name_or_path):
+        return pretrained_model_name_or_path
+
+    # hf hub library takes care of appending the model name so we don't append the model name
+    if from_hf_hub:
+        if cache_dir is not None:
+            return cache_dir
+        else:
+            return HF_CACHE_HOME
+    else:
+        if cache_dir is not None:
+            # since model_clas.from_pretrained calls config_clas.from_pretrained, the model_name may get appended twice
+            if cache_dir.endswith(pretrained_model_name_or_path):
+                return cache_dir
+            else:
+                return os.path.join(cache_dir, pretrained_model_name_or_path)
+        return os.path.join(MODEL_HOME, pretrained_model_name_or_path)
+
+
+def find_transformer_model_type(model_class: Type) -> str:
+    """get the model type from module name,
+        eg:
+            BertModel -> bert,
+            RobertaForTokenClassification -> roberta
+
+    Args:
+        model_class (Type): the class of model
+
+    Returns:
+        str: the type string
+    """
+    from paddlenlp.transformers import PretrainedModel
+
+    default_model_type = ""
+
+    if not issubclass(model_class, PretrainedModel):
+        return default_model_type
+
+    module_name: str = model_class.__module__
+    if not module_name.startswith("paddlenlp.transformers."):
+        return default_model_type
+
+    tokens = module_name.split(".")
+    if len(tokens) < 3:
+        return default_model_type
+
+    return tokens[2]
+
+
+def find_transformer_model_class_by_name(model_name: str) -> Optional[Type[PretrainedModel]]:
+    """find transformer model_class by name
+
+    Args:
+        model_name (str): the string of class name
+
+    Returns:
+        Optional[Type[PretrainedModel]]: optional pretrained-model class
+    """
+    transformer_module = import_module("paddlenlp.transformers")
+
+    for obj_name in dir(transformer_module):
+        if obj_name.startswith("_"):
+            continue
+        obj = getattr(transformer_module, obj_name, None)
+        if obj is None:
+            continue
+
+        name = getattr(obj, "__name__", None)
+        if name is None:
+            continue
+
+        if name == model_name:
+            return obj
+    logger.debug(f"can not find model_class<{model_name}>")
+    return None
+
+
+def convert_file_size_to_int(size: Union[int, str]):
+    """
+    Converts a size expressed as a string with digits an unit (like `"5MB"`) to an integer (in bytes).
+    Args:
+        size (`int` or `str`): The size to convert. Will be directly returned if an `int`.
+    Example:
+    ```py
+    >>> convert_file_size_to_int("1MiB")
+    1048576
+    ```
+    """
+    if isinstance(size, int):
+        return size
+    if size.upper().endswith("GIB"):
+        return int(size[:-3]) * (2**30)
+    if size.upper().endswith("MIB"):
+        return int(size[:-3]) * (2**20)
+    if size.upper().endswith("KIB"):
+        return int(size[:-3]) * (2**10)
+    if size.upper().endswith("GB"):
+        int_size = int(size[:-2]) * (10**9)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("MB"):
+        int_size = int(size[:-2]) * (10**6)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("KB"):
+        int_size = int(size[:-2]) * (10**3)
+        return int_size // 8 if size.endswith("b") else int_size
+    raise ValueError("`size` is not in a valid format. Use an integer followed by the unit, e.g., '5GB'.")
+
+
+def paddlenlp_hub_download(
+    repo_id: str,
+    filename: str,
+    *,
+    subfolder: Optional[str] = None,
+    cache_dir: Union[str, Path, None] = None,
+    local_dir: Union[str, Path, None] = None,
+) -> str:
+
+    # check in cache_dir
+    weight_file_path = os.path.join(cache_dir, filename)
+
+    if os.path.exists(weight_file_path):
+        logger.info(f"Already cached {weight_file_path}")
+        return weight_file_path
+
+    # Download from custom model url
+    if is_url(repo_id):
+        # check wether the target file exist in the comunity bos server
+        if url_file_exists(repo_id):
+            logger.info(f"Downloading {repo_id}")
+            weight_file_path = get_path_from_url_with_filelock(repo_id, cache_dir)
+            # # check the downloaded weight file and registered weight file name
+            download_check(repo_id, "paddlenlp_hub_download")
+
+            # make sure that model states names: model_states.pdparams
+            new_weight_file_path = os.path.join(os.path.split(weight_file_path)[0], filename)
+
+            if weight_file_path != new_weight_file_path:
+                # create lock file, which is empty, under the `LOCK_FILE_HOME` directory.
+                lock_file_name = hashlib.md5((repo_id + cache_dir).encode("utf-8")).hexdigest()
+                # create `.lock` private directory in the cache dir
+                lock_file_path = os.path.join(cache_dir, ".lock", lock_file_name)
+
+                with FileLock(lock_file_path):
+                    if not os.path.exists(new_weight_file_path):
+                        shutil.move(weight_file_path, new_weight_file_path)
+
+                weight_file_path = new_weight_file_path
+
+            return weight_file_path
+
+        return None
+
+    # find in community repo
+    community_model_file_path = "/".join([COMMUNITY_MODEL_PREFIX, repo_id, filename])
+    assert is_url(community_model_file_path)
+
+    # check wether the target file exist in the comunity bos server
+    if url_file_exists(community_model_file_path):
+        logger.info(f"Downloading {community_model_file_path}")
+        weight_file_path = get_path_from_url_with_filelock(community_model_file_path, cache_dir)
+        # # check the downloaded weight file and registered weight file name
+        download_check(community_model_file_path, "paddlenlp_hub_download")
+        return weight_file_path
+
+    return None
+
+
+# Return value when trying to load a file from cache but the file does not exist in the distant repo.
+_CACHED_NO_EXIST = object()
+
+
+def cached_file(
+    path_or_repo_id: Union[str, os.PathLike],
+    filename: str,
+    cache_dir: Optional[Union[str, os.PathLike]] = None,
+    subfolder: str = "",
+    _raise_exceptions_for_missing_entries: bool = True,
+    _raise_exceptions_for_connection_errors: bool = True,
+):
+    """
+    Tries to locate a file in a local folder and repo, downloads and cache it if necessary.
+    Args:
+        path_or_repo_id (`str` or `os.PathLike`):
+            This can be either:
+            - a string, the *model id* of a model repo on huggingface.co.
+            - a path to a *directory* potentially containing the file.
+        filename (`str`):
+            The name of the file to locate in `path_or_repo`.
+        cache_dir (`str` or `os.PathLike`, *optional*):
+            Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
+            cache should not be used.
+        subfolder (`str`, *optional*, defaults to `""`):
+            In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
+            specify the folder name here.
+
+    Returns:
+        `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
+    Examples:
+    ```python
+    # Download a model weight from the Hub and cache it.
+    model_weights_file = cached_file("bert-base-uncased", "pytorch_model.bin")
+    ```
+    """
+
+    if subfolder is None:
+        subfolder = ""
+
+    path_or_repo_id = str(path_or_repo_id)
+    full_filename = os.path.join(subfolder, filename)
+    if os.path.isdir(path_or_repo_id):
+        resolved_file = os.path.join(os.path.join(path_or_repo_id, subfolder), filename)
+        if not os.path.isfile(resolved_file):
+            if _raise_exceptions_for_missing_entries:
+                raise EnvironmentError(
+                    f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
+                    f"'https://huggingface.co/{path_or_repo_id}/' for available files."
+                )
+            else:
+                return None
+        return resolved_file
+
+    if cache_dir is None:
+        cache_dir = os.path.join(MODEL_HOME, ".cache")
+    if isinstance(cache_dir, Path):
+        cache_dir = str(cache_dir)
+
+    try:
+        # Load from URL or cache if already cached
+        # import pdb;pdb.set_trace()
+        resolved_file = paddlenlp_hub_download(
+            path_or_repo_id,
+            filename,
+            subfolder=None if len(subfolder) == 0 else subfolder,
+            # revision=revision,
+            cache_dir=cache_dir,
+        )
+    except HTTPError as err:
+        # First we try to see if we have a cached version (not up to date):
+        resolved_file = try_to_load_from_cache(path_or_repo_id, full_filename, cache_dir=cache_dir)
+        if resolved_file is not None and resolved_file != _CACHED_NO_EXIST:
+            return resolved_file
+        if not _raise_exceptions_for_connection_errors:
+            return None
+
+        raise EnvironmentError(f"There was a specific connection error when trying to load {path_or_repo_id}:\n{err}")
+
+    return resolved_file
+
+
+def cached_file_for_hf_hub(
+    path_or_repo_id: Union[str, os.PathLike],
+    filename: str,
+    cache_dir: Optional[Union[str, os.PathLike]] = None,
+    subfolder: str = "",
+    _raise_exceptions_for_missing_entries: bool = True,
+):
+
+    if subfolder is None:
+        subfolder = ""
+
+    path_or_repo_id = str(path_or_repo_id)
+    full_filename = os.path.join(subfolder, filename)
+    if os.path.isdir(path_or_repo_id):
+        resolved_file = os.path.join(os.path.join(path_or_repo_id, subfolder), filename)
+        if not os.path.isfile(resolved_file):
+            if _raise_exceptions_for_missing_entries:
+                raise EnvironmentError(
+                    f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
+                    f"'https://huggingface.co/{path_or_repo_id}' for available files."
+                )
+            else:
+                return None
+        return resolved_file
+
+    if cache_dir is None:
+        cache_dir = os.path.join(MODEL_HOME, ".cache")
+    if isinstance(cache_dir, Path):
+        cache_dir = str(cache_dir)
+
+    try:
+        # Load from URL or cache if already cached
+        download_check(path_or_repo_id, full_filename, addition="from_hf_hub")
+        resolved_file = hf_hub_download(
+            repo_id=path_or_repo_id,
+            filename=full_filename,
+            cache_dir=cache_dir,
+            subfolder=subfolder,
+            library_name="PaddleNLP",
+            library_version=__version__,
+        )
+        return resolved_file
+    except Exception as e:
+        print(e)
+        raise EnvironmentError(
+            f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
+            "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
+            "pass a token having permission to this repo with `use_auth_token` or log in with "
+            "`huggingface-cli login` and pass `use_auth_token=True`."
+        )
+
+
+def get_checkpoint_shard_files(
+    pretrained_model_name_or_path,
+    index_filename,
+    cache_dir=None,
+    subfolder="",
+):
+    """
+    For a given model:
+    - download and cache all the shards of a sharded checkpoint if `pretrained_model_name_or_path` is a model ID on the
+      Hub
+    - returns the list of paths to all the shards, as well as some metadata.
+    For the description of each arg, see [`PretrainedModel.from_pretrained`]. `index_filename` is the full path to the
+    index (downloaded and cached if `pretrained_model_name_or_path` is a model ID on the Hub).
+    """
+
+    import json
+
+    if not os.path.isfile(index_filename):
+        raise ValueError(f"Can't find a checkpoint index ({index_filename}) in {pretrained_model_name_or_path}.")
+
+    with open(index_filename, "r") as f:
+        index = json.loads(f.read())
+
+    shard_filenames = sorted(set(index["weight_map"].values()))
+    sharded_metadata = index["metadata"]
+    sharded_metadata["all_checkpoint_keys"] = list(index["weight_map"].keys())
+    sharded_metadata["weight_map"] = index["weight_map"].copy()
+
+    # First, let's deal with local folder.
+    if os.path.isdir(pretrained_model_name_or_path):
+        shard_filenames = [os.path.join(pretrained_model_name_or_path, subfolder, f) for f in shard_filenames]
+        return shard_filenames, sharded_metadata
+
+    # At this stage pretrained_model_name_or_path is a model identifier on the Hub
+    cached_filenames = []
+    # Check if the model is already cached or not. We only try the last checkpoint, this should cover most cases of
+    # downloaded (if interrupted).
+    last_shard = try_to_load_from_cache(
+        pretrained_model_name_or_path,
+        shard_filenames[-1],
+        cache_dir=cache_dir,
+    )
+
+    show_progress_bar = last_shard is None
+    for shard_filename in tqdm.tqdm(shard_filenames, desc="Downloading shards", disable=not show_progress_bar):
+        try:
+            cached_filename = paddlenlp_hub_download(
+                pretrained_model_name_or_path,
+                shard_filename,
+                subfolder=None if len(subfolder) == 0 else subfolder,
+                cache_dir=cache_dir,
+            )
+        # We have already dealt with RepositoryNotFoundError and RevisionNotFoundError when getting the index, so
+        # we don't have to catch them here.
+        except EntryNotFoundError:
+            raise EnvironmentError(
+                f"{pretrained_model_name_or_path} does not appear to have a file named {shard_filename} which is "
+                "required according to the checkpoint index."
+            )
+        except HTTPError:
+            raise EnvironmentError(
+                f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load {shard_filename}. You should try"
+                " again after checking your internet connection."
+            )
+
+        cached_filenames.append(cached_filename)
+
+    return cached_filenames, sharded_metadata
+
+
+def is_safetensors_available():
+    return importlib.util.find_spec("safetensors") is not None
+
+
+@contextlib.contextmanager
+def device_guard(device="cpu", dev_id=0):
+    origin_device = paddle.device.get_device()
+    if device == "cpu":
+        paddle.set_device(device)
+    elif device in ["gpu", "xpu", "npu"]:
+        paddle.set_device("{}:{}".format(device, dev_id))
+    try:
+        yield
+    finally:
+        paddle.set_device(origin_device)
+
+
+def paddlenlp_load(path, map_location="cpu"):
+    assert map_location in ["cpu", "gpu", "xpu", "npu", "numpy", "np"]
+    if map_location in ["numpy", "np"]:
+        return paddle.load(path, return_numpy=True)
+    else:
+        with device_guard(map_location):
+            return paddle.load(path)
+            # TODO(zhonghui03): the following code has problems when hot start optimizer checkpoint.
+            if map_location == "cpu":
+                from paddle.framework.io import (
+                    _parse_every_object,
+                    _to_LodTensor,
+                    _transformed_from_lodtensor,
+                )
+
+                def _ndarray_to_tensor(obj, return_numpy=False):
+                    if return_numpy:
+                        return obj
+                    if paddle.in_dynamic_mode():
+                        return paddle.Tensor(obj, zero_copy=True)
+                    else:
+                        return _to_LodTensor(obj)
+
+                state_dict = paddle.load(path, return_numpy=True)
+                # Hack for zero copy for saving loading time. for paddle.load there need copy to create paddle.Tensor
+                return _parse_every_object(state_dict, _transformed_from_lodtensor, _ndarray_to_tensor)
+
+            else:
+                return paddle.load(path)
+
+
+def is_paddle_support_lazy_init():
+    return hasattr(paddle, "LazyGuard")
+
+
+class ContextManagers:
+    """
+    Wrapper for `contextlib.ExitStack` which enters a collection of context managers. Adaptation of `ContextManagers`
+    in the `fastcore` library.
+    """
+
+    def __init__(self, context_managers: List[ContextManager]):
+        self.context_managers = context_managers
+        self.stack = ExitStack()
+
+    def __enter__(self):
+        for context_manager in self.context_managers:
+            self.stack.enter_context(context_manager)
+
+    def __exit__(self, *args, **kwargs):
+        self.stack.__exit__(*args, **kwargs)
+
+
+def use_hybrid_parallel():
+    try:
+        from paddle.distributed import fleet
+
+        hcg = fleet.get_hybrid_communicate_group()
+        return hcg
+    except:
+        return None
+
+
+def optimizer_name_suffix():
+    hcg = use_hybrid_parallel()
+    if hcg is not None:
+        name = []
+        if hcg.get_model_parallel_world_size() > 1:
+            name.append(f"tp{hcg.get_model_parallel_rank():0>2d}")
+        if hcg.get_pipe_parallel_world_size() > 1:
+            name.append(f"pp{hcg.get_stage_id():0>2d}")
+        if hcg.get_sharding_parallel_world_size() > 1:
+            name.append(f"shard{hcg.get_sharding_parallel_rank():0>2d}")
+
+        return "_".join(name)
+    else:
+        return None
+
+
+def weight_name_suffix():
+    hcg = use_hybrid_parallel()
+    if hcg is not None:
+        name = []
+        if hcg.get_model_parallel_world_size() > 1:
+            name.append(f"tp{hcg.get_model_parallel_rank():0>2d}")
+        if hcg.get_pipe_parallel_world_size() > 1:
+            name.append(f"pp{hcg.get_stage_id():0>2d}")
+        return "_".join(name)
+    else:
+        return None
+
+
+def dtype_byte_size(dtype):
+    """
+    Returns the size (in bytes) occupied by one parameter of type `dtype`.
+
+    Example:
+
+    ```py
+    >>> dtype_byte_size(paddle.float32)
+    4
+    ```
+    """
+    if dtype == paddle.bool:
+        return 1 / 8
+    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
+    if bit_search is None:
+        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
+    bit_size = int(bit_search.groups()[0])
+    return bit_size // 8
+
+
+def apply_print_resets(buf):
+    return re.sub(r"^.*\r", "", buf, 0, re.M)
+
+
+class CaptureStd:
+    """
+    Context manager to capture:
+
+        - stdout: replay it, clean it up and make it available via `obj.out`
+        - stderr: replay it and make it available via `obj.err`
+
+    Args:
+        out (`bool`, *optional*, defaults to `True`): Whether to capture stdout or not.
+        err (`bool`, *optional*, defaults to `True`): Whether to capture stderr or not.
+        replay (`bool`, *optional*, defaults to `True`): Whether to replay or not.
+            By default each captured stream gets replayed back on context's exit, so that one can see what the test was
+            doing. If this is a not wanted behavior and the captured data shouldn't be replayed, pass `replay=False` to
+            disable this feature.
+
+    Examples:
+
+    ```python
+    # to capture stdout only with auto-replay
+    with CaptureStdout() as cs:
+        print("Secret message")
+    assert "message" in cs.out
+
+    # to capture stderr only with auto-replay
+    import sys
+
+    with CaptureStderr() as cs:
+        print("Warning: ", file=sys.stderr)
+    assert "Warning" in cs.err
+
+    # to capture both streams with auto-replay
+    with CaptureStd() as cs:
+        print("Secret message")
+        print("Warning: ", file=sys.stderr)
+    assert "message" in cs.out
+    assert "Warning" in cs.err
+
+    # to capture just one of the streams, and not the other, with auto-replay
+    with CaptureStd(err=False) as cs:
+        print("Secret message")
+    assert "message" in cs.out
+    # but best use the stream-specific subclasses
+
+    # to capture without auto-replay
+    with CaptureStd(replay=False) as cs:
+        print("Secret message")
+    assert "message" in cs.out
+    ```"""
+
+    def __init__(self, out=True, err=True, replay=True):
+        self.replay = replay
+
+        if out:
+            self.out_buf = StringIO()
+            self.out = "error: CaptureStd context is unfinished yet, called too early"
+        else:
+            self.out_buf = None
+            self.out = "not capturing stdout"
+
+        if err:
+            self.err_buf = StringIO()
+            self.err = "error: CaptureStd context is unfinished yet, called too early"
+        else:
+            self.err_buf = None
+            self.err = "not capturing stderr"
+
+    def __enter__(self):
+        if self.out_buf:
+            self.out_old = sys.stdout
+            sys.stdout = self.out_buf
+
+        if self.err_buf:
+            self.err_old = sys.stderr
+            sys.stderr = self.err_buf
+
+        return self
+
+    def __exit__(self, *exc):
+        if self.out_buf:
+            sys.stdout = self.out_old
+            captured = self.out_buf.getvalue()
+            if self.replay:
+                sys.stdout.write(captured)
+            self.out = apply_print_resets(captured)
+
+        if self.err_buf:
+            sys.stderr = self.err_old
+            captured = self.err_buf.getvalue()
+            if self.replay:
+                sys.stderr.write(captured)
+            self.err = captured
+
+    def __repr__(self):
+        msg = ""
+        if self.out_buf:
+            msg += f"stdout: {self.out}\n"
+        if self.err_buf:
+            msg += f"stderr: {self.err}\n"
+        return msg
diff --git a/paddlenlp/transformers/visualglm/__init__.py b/paddlenlp/transformers/visualglm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/paddlenlp/transformers/visualglm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlenlp/transformers/visualglm/configuration.py b/paddlenlp/transformers/visualglm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..942768e8cfd93453bf76c4385de5616f45ed5cfb
--- /dev/null
+++ b/paddlenlp/transformers/visualglm/configuration.py
@@ -0,0 +1,338 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" VisualGLM model configuration """
+import copy
+import os
+from typing import Union
+
+from ...utils.log import logger
+from ..chatglm.configuration import ChatGLMConfig
+from ..configuration_utils import PretrainedConfig
+
+__all__ = ["VisualGLMVisionConfig", "VisualGLMQFormerConfig", "VisualGLMConfig"]
+
+
+class VisualGLMVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`VisualGLMVisionModel`]. It is used to instantiate a
+    VisualGLM vision encoder according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 1408):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 6144):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 39):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 14):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported. layer_norm_eps (`float`, *optional*, defaults
+            to 1e-5): The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        initializer_factor (`float``, *optional*, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries and values in the self-attention layers.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import VisualGLMVisionConfig, VisualGLMVisionModel
+    >>> # Initializing a VisualGLMVisionConfig
+    >>> configuration = VisualGLMVisionConfig()
+    >>> # Initializing a VisualGLMVisionModel (with random weights) from the configuration above.
+    >>> model = VisualGLMVisionModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "visualglm_vision_model"
+
+    def __init__(
+        self,
+        hidden_size=1408,
+        intermediate_size=6144,
+        num_hidden_layers=39,
+        num_attention_heads=16,
+        num_channels=3,
+        image_size=224,
+        patch_size=14,
+        hidden_act="gelu",
+        layer_norm_eps=0.00001,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=1e-10,
+        initializer_factor=1.0,
+        qkv_bias=True,
+        **kwargs,
+    ):
+        kwargs["return_dict"] = kwargs.pop("return_dict", True)
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.initializer_range = initializer_range
+        self.initializer_factor = initializer_factor
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        self.qkv_bias = qkv_bias
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        # get the vision config dict if we are loading from VisualGLMConfig
+        if config_dict.get("model_type") == "visualglm":
+            config_dict = config_dict["vision_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class VisualGLMQFormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`VisualGLMQFormerModel`]. It is used to instantiate a
+    VisualGLM Querying Transformer (Q-Former) model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from
+    [`PretrainedConfig`] for more information.
+    Note that [`VisualGLMQFormerModel`] is very similar to [`BertLMHeadModel`] with interleaved cross-attention.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+        cross_attention_frequency (`int`, *optional*, defaults to 2):
+            The frequency of adding cross-attention to the Transformer layers.
+        encoder_hidden_size (`int`, *optional*, defaults to 1408):
+            The hidden size of the hidden states for cross-attention.
+    Examples:
+    ```python
+    >>> from paddlenlp.transformers import VisualGLMQFormerConfig, VisualGLMQFormerModel
+    >>> # Initializing a VisualGLM configuration
+    >>> configuration = VisualGLMQFormerConfig()
+    >>> # Initializing a model (with random weights) from the configuration above
+    >>> model = VisualGLMQFormerModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "visualglm_qformer_model"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        position_embedding_type="absolute",
+        classifier_dropout=None,
+        cross_attention_frequency=2,
+        encoder_hidden_size=1408,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.classifier_dropout = classifier_dropout
+        self.cross_attention_frequency = cross_attention_frequency
+        self.encoder_hidden_size = encoder_hidden_size
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the qformer config dict if we are loading from VisualGLMConfig
+        if config_dict.get("model_type") == "visualglm":
+            config_dict = config_dict["qformer_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+class VisualGLMConfig(PretrainedConfig):
+    r"""
+    [`VisualGLMConfig`] is the configuration class to store the configuration of a [`VisualGLMForConditionalGeneration`]. It is
+    used to instantiate a VisualGLM model according to the specified arguments, defining the vision model, Q-Former model
+    and language model configs.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vision_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`VisualGLMVisionConfig`].
+        qformer_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize [`VisualGLMQFormerConfig`].
+        text_config (`dict`, *optional*):
+            Dictionary of configuration options used to initialize any [`PretrainedConfig`].
+        num_query_tokens (`int`, *optional*, defaults to 32):
+            The number of query tokens passed through the Transformer.
+        kwargs (*optional*):
+            Dictionary of keyword arguments.
+    Example:
+    ```python
+    >>> from paddlenlp.transformers import (
+    ...     VisualGLMVisionConfig,
+    ...     VisualGLMQFormerConfig,
+    ...     ChatGLMConfig,
+    ...     VisualGLMConfig,
+    ...     VisualGLMForConditionalGeneration,
+    ... )
+    >>> # Initializing a VisualGLMConfig configuration
+    >>> configuration = VisualGLMConfig()
+    >>> # Initializing a VisualGLMForConditionalGeneration (with random weights) from the configuration above
+    >>> model = VisualGLMForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    >>> # We can also initialize a VisualGLMConfig from a VisualGLMVisionConfig, VisualGLMQFormerConfig and any PretrainedConfig
+    >>> # Initializing VisualGLM vision, VisualGLM Q-Former and language model configurations
+    >>> vision_config = VisualGLMVisionConfig()
+    >>> qformer_config = VisualGLMQFormerConfig()
+    >>> text_config = ChatGLMConfig()
+    >>> config = VisualGLMConfig.from_text_vision_configs(vision_config, qformer_config, text_config)
+    ```"""
+
+    model_type = "visualglm"
+
+    def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs):
+        super().__init__(**kwargs)
+
+        if vision_config is None:
+            vision_config = {}
+            logger.info("vision_config is None. initializing the VisualGLMVisionConfig with default values.")
+
+        if qformer_config is None:
+            qformer_config = {}
+            logger.info("qformer_config is None. Initializing the VisualGLMQFormerConfig with default values.")
+
+        if text_config is None:
+            text_config = {}
+            logger.info("text_config is None. Initializing the text config with default values (`ChatGLMConfig`).")
+        self.vision_config = VisualGLMVisionConfig(**vision_config)
+        self.qformer_config = VisualGLMQFormerConfig(**qformer_config)
+        text_model_type = text_config["model_type"] if "model_type" in text_config else "chatglm"
+
+        if text_model_type == "chatglm":
+            self.text_config = ChatGLMConfig(**text_config)
+        else:
+            raise ValueError("Only chatglm accepted for model_type, but accepted {}.".format(text_model_type))
+
+        self.num_query_tokens = num_query_tokens
+        self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size
+
+        self.initializer_factor = 1.0
+        self.initializer_range = 0.02
+
+    @classmethod
+    def from_vision_qformer_text_configs(
+        cls,
+        vision_config: VisualGLMVisionConfig,
+        qformer_config: VisualGLMQFormerConfig,
+        text_config: PretrainedConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`VisualGLMConfig`] (or a derived class) from a vision model, Q-Former and language model
+        configurations.
+        Returns:
+            [`VisualGLM`]: An instance of a configuration object
+        """
+
+        return cls(
+            vision_config=vision_config.to_dict(),
+            qformer_config=qformer_config.to_dict(),
+            text_config=text_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+        output["vision_config"] = self.vision_config.to_dict()
+        output["qformer_config"] = self.qformer_config.to_dict()
+        output["text_config"] = self.text_config.to_dict()
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/paddlenlp/transformers/visualglm/image_processing.py b/paddlenlp/transformers/visualglm/image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d02c5597b27a15deadf9e564fa69e3f60b099a9
--- /dev/null
+++ b/paddlenlp/transformers/visualglm/image_processing.py
@@ -0,0 +1,284 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for VisualGLM."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import PIL
+
+from ..image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ..image_transforms import (
+    convert_to_rgb,
+    normalize,
+    rescale,
+    resize,
+    to_channel_dimension_format,
+)
+from ..image_utils import (
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    is_batched,
+    to_numpy_array,
+    valid_images,
+)
+from ..tokenizer_utils_base import TensorType
+
+__all__ = [
+    "VisualGLMImageProcessor",
+]
+
+
+class VisualGLMImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a VisualGLM image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
+            `do_resize` parameter in the `preprocess` method.
+        size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`):
+            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
+            method.
+        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
+            overridden by the `resample` parameter in the `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
+            `do_rescale` parameter in the `preprocess` method.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
+            overridden by the `rescale_factor` parameter in the `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
+            overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+            Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        do_rescale: bool = True,
+        rescale_factor: Union[int, float] = 1 / 255,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_rgb: bool = True,
+        **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        default_image_mean = [0.48145466, 0.4578275, 0.40821073]
+        default_image_std = [0.26862954, 0.26130258, 0.27577711]
+        size = size if size is not None else {"height": 224, "width": 224}
+        size = get_size_dict(size, default_to_square=True)
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else default_image_mean
+        self.image_std = image_std if image_std is not None else default_image_std
+        self.do_convert_rgb = do_convert_rgb
+
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Resize an image.
+
+        Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
+        longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
+        resized to the max size while preserving the aspect ratio.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
+            resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=True)
+        output_size = (size["width"], size["height"])
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ):
+        """
+        Rescale an image by a scale factor. image = image * scale.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return rescale(image, scale=scale, data_format=data_format, **kwargs)
+
+    def normalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, List[float]],
+        std: Union[float, List[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to normalize.
+            mean (`float` or `List[float]`):
+                Image mean.
+            std (`float` or `List[float]`):
+                Image standard deviation.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: Optional[bool] = None,
+        size: Optional[Dict[str, int]] = None,
+        resample: PILImageResampling = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        do_convert_rgb: bool = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Controls the size of the image after `resize`. The shortest edge of the image is resized to
+                `size["shortest_edge"]` while preserving the aspect ratio. If the longest edge of this resized image
+                is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
+                edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to normalize the image by if `do_normalize` is set to `True`.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.PADDLE` or `'pt'`: Return a batch of type `paddle.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: defaults to the channel dimension format of the input image.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        resample = resample if resample is not None else self.resample
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
+
+        size = size if size is not None else self.size
+        size = get_size_dict(size, default_to_square=False)
+
+        if not is_batched(images):
+            images = [images]
+
+        if not valid_images(images):
+            raise ValueError("Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "paddle.Tensor.")
+
+        if do_resize and size is None or resample is None:
+            raise ValueError("Size and resample must be specified if do_resize is True.")
+
+        if do_rescale and rescale_factor is None:
+            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+
+        if do_normalize and (image_mean is None or image_std is None):
+            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+
+        # PIL RGBA images are converted to RGB
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if do_resize:
+            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+
+        if do_rescale:
+            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+
+        if do_normalize:
+            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+
+        images = [to_channel_dimension_format(image, data_format) for image in images]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/paddlenlp/transformers/visualglm/modeling.py b/paddlenlp/transformers/visualglm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1900392e1bb40c4278a7dc476fd34dfac918ecca
--- /dev/null
+++ b/paddlenlp/transformers/visualglm/modeling.py
@@ -0,0 +1,1550 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.distributed.fleet.utils import recompute
+from paddle.nn import CrossEntropyLoss
+
+from paddlenlp.utils.log import logger
+
+from ...utils.initializer import normal_, ones_, zeros_
+from ..activations import ACT2FN
+from ..chatglm.configuration import ChatGLMConfig
+from ..chatglm.modeling import ChatGLMForCausalLM
+from ..model_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPooling,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    ModelOutput,
+)
+from ..model_utils import (
+    PretrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+
+VisualGLM_PRETRAINED_MODEL_ARCHIVE_LIST = []
+
+from .configuration import (
+    VisualGLMConfig,
+    VisualGLMQFormerConfig,
+    VisualGLMVisionConfig,
+)
+
+__all__ = [
+    "VisualGLMModel",
+    "VisualGLMPretrainedModel",
+    "VisualGLMQFormerModel",
+    "VisualGLMVisionModel",
+    "VisualGLMForConditionalGeneration",
+]
+
+
+def Parameter(tensor, dtype="float16"):
+    tensor = paddle.cast(tensor, dtype)
+    return paddle.create_parameter(tensor.shape, dtype=tensor.dtype, default_initializer=nn.initializer.Assign(tensor))
+
+
+@dataclass
+class VisualGLMForConditionalGenerationModelOutput(ModelOutput):
+    """
+    Class defining the outputs of [`VisualGLMForConditionalGeneration`].
+    Args:
+        loss (`paddle.Tensor`, *optional*, returned when `labels` is provided, `paddle.Tensor` of shape `(1,)`):
+            Language modeling loss from the language model.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head of the language model.
+        vision_outputs (`BaseModelOutputWithPooling`):
+            Outputs of the vision encoder.
+        qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
+            Outputs of the Q-Former (Querying Transformer).
+        language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
+            Outputs of the language model.
+    """
+
+    loss: Optional[Tuple[paddle.Tensor]] = None
+    logits: Optional[Tuple[paddle.Tensor]] = None
+    vision_outputs: Optional[paddle.Tensor] = None
+    qformer_outputs: Optional[Tuple[paddle.Tensor]] = None
+    language_model_outputs: Optional[Tuple[paddle.Tensor]] = None
+
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k]
+            if k not in ["vision_outputs", "qformer_outputs", "language_model_outputs"]
+            else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+
+
+class VisualGLMPretrainedModel(PretrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = VisualGLMConfig
+    base_model_prefix = "visualglm"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [
+        r"position_ids",
+    ]
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        factor = self.config.initializer_range
+        if isinstance(module, nn.Conv2D) or isinstance(module, nn.Embedding) or isinstance(module, nn.Linear):
+            normal_(module.weight, mean=0.0, std=factor)
+            if hasattr(module, "bias") and module.bias is not None:
+                zeros_(module.bias)
+
+        if isinstance(module, VisualGLMVisionEmbeddings):
+            if hasattr(self.config, "vision_config"):
+                factor = self.config.vision_config.initializer_range
+            trunc_normal_ = nn.initializer.TruncatedNormal(mean=0.0, std=factor)
+            trunc_normal_(module.position_embedding)
+            trunc_normal_(
+                module.class_embedding,
+            )
+        elif isinstance(module, nn.LayerNorm):
+            zeros_(module.bias)
+            ones_(module.weight)
+        elif isinstance(module, nn.Linear) and module.bias is not None:
+            zeros_(module.bias)
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, VisualGLMEncoder):
+            module.gradient_checkpointing = value
+
+
+class VisualGLMVisionEmbeddings(nn.Layer):
+    def __init__(self, config: VisualGLMVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.in_channels = config.num_channels
+
+        self.patch_embedding = nn.Conv2D(
+            in_channels=self.in_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+        )
+
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches + 1
+
+        self.class_embedding = Parameter(paddle.randn([1, 1, self.embed_dim]), dtype=self.patch_embedding.weight.dtype)
+        self.position_embedding = Parameter(
+            paddle.randn([1, self.num_positions, self.embed_dim]), dtype=self.patch_embedding.weight.dtype
+        )
+
+    def forward(self, pixel_values: paddle.Tensor) -> paddle.Tensor:
+        batch_size = pixel_values.shape[0]
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose([0, 2, 1])
+
+        class_embeds = self.class_embedding.expand([batch_size, 1, -1]).cast(target_dtype)
+        embeddings = paddle.concat([class_embeds, patch_embeds], axis=1)
+        embeddings = embeddings + self.position_embedding[:, : embeddings.shape[1], :].cast(target_dtype)
+        return embeddings
+
+
+class VisualGLMAttention(nn.Layer):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = nn.Dropout(config.attention_dropout)
+
+        # small tweak here compared to CLIP, no bias here
+        self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias_attr=False)
+
+        if config.qkv_bias:
+            q_bias = Parameter(paddle.zeros([self.embed_dim], dtype=self.qkv.weight.dtype))
+            v_bias = Parameter(paddle.zeros([self.embed_dim], dtype=self.qkv.weight.dtype))
+        else:
+            q_bias = None
+            v_bias = None
+
+        if q_bias is not None:
+            qkv_bias = paddle.concat((q_bias, paddle.zeros_like(v_bias), v_bias))
+            self.qkv.bias = Parameter(qkv_bias, dtype=self.qkv.weight.dtype)
+
+        self.projection = nn.Linear(self.embed_dim, self.embed_dim)
+
+    def _shape(self, tensor: paddle.Tensor, seq_len: int, bsz: int):
+        return tensor.reshape([bsz, seq_len, self.num_heads, self.head_dim]).transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        head_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor, Optional[paddle.Tensor], Optional[Tuple[paddle.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+        bsz, tgt_len, embed_dim = hidden_states.shape
+
+        mixed_qkv = self.qkv(hidden_states)
+
+        mixed_qkv = mixed_qkv.reshape([bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads]).transpose(
+            [2, 0, 3, 1, 4]
+        )
+        query_states, key_states, value_states = (
+            mixed_qkv[0],
+            mixed_qkv[1],
+            mixed_qkv[2],
+        )
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_states, key_states, transpose_y=True)
+
+        attention_scores = attention_scores * self.scale
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = F.softmax(attention_scores, axis=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = paddle.matmul(attention_probs, value_states).transpose([0, 2, 1, 3])
+
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.embed_dim,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        output = self.projection(context_layer)
+
+        outputs = (output, attention_probs) if output_attentions else (output, None)
+
+        return outputs
+
+
+class VisualGLMMLP(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+class VisualGLMEncoderLayer(nn.Layer):
+    def __init__(self, config: VisualGLMConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = VisualGLMAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+        self.mlp = VisualGLMMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: paddle.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        """
+        Args:
+            hidden_states (`paddle.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`paddle.Tensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+                `(config.encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            head_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = hidden_states + residual
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = hidden_states + residual
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attn_weights,)
+
+        return outputs
+
+
+class VisualGLMEncoder(nn.Layer):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`VisualGLMEncoderLayer`].
+    Args:
+        config (`VisualGLMConfig`):
+            The corresponding vision configuration for the `VisualGLMEncoder`.
+    """
+
+    def __init__(self, config: VisualGLMConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.LayerList([VisualGLMEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for idx, encoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+
+
+class VisualGLMVisionModel(VisualGLMPretrainedModel):
+    main_input_name = "pixel_values"
+    config_class = VisualGLMVisionConfig
+
+    def __init__(self, config: VisualGLMVisionConfig):
+        super().__init__(config)
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = VisualGLMVisionEmbeddings(config)
+        self.encoder = VisualGLMEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, epsilon=config.layer_norm_eps)
+
+    def forward(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        hidden_states = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooled_output = last_hidden_state[:, 0, :]
+        pooled_output = self.post_layernorm(pooled_output)
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.embeddings
+
+
+class VisualGLMQFormerMultiHeadAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention heads (%d)"
+                % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.encoder_hidden_size, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.shape[:-1] + [self.num_attention_heads, self.attention_head_size]
+        x = x.reshape(new_x_shape)
+        return x.transpose([0, 2, 1, 3])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = paddle.concat([past_key_value[0], key_layer], axis=2)
+            value_layer = paddle.concat([past_key_value[1], value_layer], axis=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        mixed_query_layer = self.query(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = paddle.matmul(query_layer, key_layer, transpose_y=True)
+
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.shape[1]
+            position_ids_l = paddle.arange(seq_length, dtype="int64").reshape([-1, 1])
+            position_ids_r = paddle.arange(seq_length, dtype="int64").reshape([1, -1])
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.cast(dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = paddle.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = paddle.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(axis=-1)(attention_scores)
+
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = paddle.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.transpose([0, 2, 1, 3])
+        new_context_layer_shape = context_layer.shape[:-2] + [
+            self.all_head_size,
+        ]
+        context_layer = context_layer.reshape(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        outputs = outputs + (past_key_value,)
+        return outputs
+
+
+class VisualGLMQFormerSelfOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class VisualGLMQFormerAttention(nn.Layer):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.attention = VisualGLMQFormerMultiHeadAttention(config, is_cross_attention)
+        self.output = VisualGLMQFormerSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.attention.query = prune_linear_layer(self.attention.query, index)
+        self.attention.key = prune_linear_layer(self.attention.key, index)
+        self.attention.value = prune_linear_layer(self.attention.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, axis=1)
+
+        # Update hyper params and store pruned heads
+        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
+        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: paddle.Tensor,
+        attention_mask: Optional[paddle.Tensor] = None,
+        head_mask: Optional[paddle.Tensor] = None,
+        encoder_hidden_states: Optional[paddle.Tensor] = None,
+        encoder_attention_mask: Optional[paddle.Tensor] = None,
+        past_key_value: Optional[Tuple[Tuple[paddle.Tensor]]] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[paddle.Tensor]:
+        self_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class VisualGLMQFormerIntermediate(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class VisualGLMQFormerOutput(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        # self.LayerNorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: paddle.Tensor, input_tensor: paddle.Tensor) -> paddle.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = hidden_states + input_tensor
+        # hidden_states = self.LayerNorm()
+        return hidden_states
+
+
+class VisualGLMQFormerLayer(nn.Layer):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.input_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.attention = VisualGLMQFormerAttention(config)
+
+        self.layer_idx = layer_idx
+
+        if layer_idx % config.cross_attention_frequency == 0:
+            self.crossattention = VisualGLMQFormerAttention(config, is_cross_attention=True)
+            self.has_cross_attention = True
+        else:
+            self.has_cross_attention = False
+
+        self.intermediate_query = VisualGLMQFormerIntermediate(config)
+        self.output_query = VisualGLMQFormerOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        query_length=0,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        hidden_states = self.input_layernorm(hidden_states)
+        self_attention_outputs = self.attention(
+            hidden_states,  # 1, 32, 768
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+
+        present_key_value = self_attention_outputs[-1]
+
+        if query_length > 0:
+            query_attention_output = attention_output[:, :query_length, :]
+
+            if self.has_cross_attention:
+                if encoder_hidden_states is None:
+                    raise ValueError("encoder_hidden_states must be given for cross-attention layers")
+                cross_attention_outputs = self.crossattention(
+                    query_attention_output,
+                    attention_mask,
+                    head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions=output_attentions,
+                )
+                query_attention_output = cross_attention_outputs[0]
+                # add cross attentions if we output attention weights
+                outputs = outputs + cross_attention_outputs[1:-1]
+
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk_query,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                query_attention_output,
+            )
+
+            if attention_output.shape[1] > query_length:
+                layer_output_text = apply_chunking_to_forward(
+                    self.feed_forward_chunk,
+                    self.chunk_size_feed_forward,
+                    self.seq_len_dim,
+                    attention_output[:, query_length:, :],
+                )
+                layer_output = paddle.concat([layer_output, layer_output_text], axis=1)
+        else:
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                attention_output,
+            )
+        outputs = (layer_output,) + outputs
+
+        outputs = outputs + (present_key_value,)
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def feed_forward_chunk_query(self, attention_output):
+        intermediate_output = self.intermediate_query(attention_output)
+        layer_output = self.output_query(intermediate_output, attention_output)
+        return layer_output
+
+
+class VisualGLMQFormerEncoder(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.LayerList(
+            [VisualGLMQFormerLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        query_length=0,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions else None
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions, query_length)
+
+                    return custom_forward
+
+                layer_outputs = recompute(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    query_length,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+                if layer_module.has_cross_attention:
+                    all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class VisualGLMQFormerModel(VisualGLMPretrainedModel):
+    """
+    Querying Transformer (Q-Former), used in VisualGLM.
+    """
+
+    def __init__(self, config: VisualGLMQFormerConfig):
+        super().__init__(config)
+        self.config = config
+
+        self.final_layernorm = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.encoder = VisualGLMQFormerEncoder(config)
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: paddle.Tensor,
+        input_shape: Tuple[int],
+        has_query: bool = False,
+    ) -> paddle.Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (`paddle.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (`Tuple[int]`):
+                The shape of the input to the model.
+        Returns:
+            `paddle.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - the model is an encoder, so make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.cast(dtype=self.config.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def invert_attention_mask(self, encoder_attention_mask: paddle.Tensor) -> paddle.Tensor:
+        """
+        Invert an attention mask (e.g., switches 0. and 1.).
+        Args:
+            encoder_attention_mask (`paddle.Tensor`): An attention mask.
+        Returns:
+            `paddle.Tensor`: The inverted attention mask.
+        """
+        if encoder_attention_mask.ndim == 3:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
+        if encoder_attention_mask.ndim == 2:
+            encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
+        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
+        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow
+        # /transformer/transformer_layers.py#L270
+        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==
+        # encoder_extended_attention_mask.transpose(-1, -2))
+        encoder_extended_attention_mask = encoder_extended_attention_mask.cast(
+            dtype=self.config.dtype
+        )  # fp16 compatibility
+        encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+
+        return encoder_extended_attention_mask
+
+    def get_head_mask(
+        self, head_mask: Optional[paddle.Tensor], num_hidden_layers: int, is_attention_chunked: bool = False
+    ) -> paddle.Tensor:
+        """
+        Prepare the head mask if needed.
+        Args:
+            head_mask (`paddle.Tensor` with shape `[num_heads]` or `[num_hidden_layers x num_heads]`, *optional*):
+                The mask indicating if we should keep the heads or not (1.0 for keep, 0.0 for discard).
+            num_hidden_layers (`int`):
+                The number of hidden layers in the model.
+            is_attention_chunked: (`bool`, *optional*, defaults to `False`):
+                Whether or not the attentions scores are computed by chunks or not.
+        Returns:
+            `paddle.Tensor` with shape `[num_hidden_layers x batch x num_heads x seq_length x seq_length]` or list with
+            `[None]` for each layer.
+        """
+        if head_mask is not None:
+            head_mask = self._convert_head_mask_to_5d(head_mask, num_hidden_layers)
+            if is_attention_chunked is True:
+                head_mask = head_mask.unsqueeze(-1)
+        else:
+            head_mask = [None] * num_hidden_layers
+
+        return head_mask
+
+    def _convert_head_mask_to_5d(self, head_mask, num_hidden_layers):
+        """-> [num_hidden_layers x batch x num_heads x seq_length x seq_length]"""
+        if head_mask.ndim == 1:
+            head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+            head_mask = head_mask.expand([num_hidden_layers, -1, -1, -1, -1])
+        elif head_mask.ndim == 2:
+            head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+        assert head_mask.ndim == 5, f"head_mask.dim != 5, instead {head_mask.dim()}"
+        head_mask = head_mask.cast(dtype=self.config.dtype)  # switch to float if need + fp16 compatibility
+        return head_mask
+
+    def forward(
+        self,
+        query_embeds,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        encoder_hidden_states  (`paddle.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (`paddle.Tensor` of shape `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (`tuple(tuple(paddle.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of:
+            shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and
+            value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are
+            used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key
+            value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape
+            `(batch_size, sequence_length)`.
+        use_cache (`bool`, `optional`):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # past_key_values_length
+        past_key_values_length = (
+            past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
+        )
+
+        query_length = query_embeds.shape[1] if query_embeds is not None else 0
+
+        embedding_output = self.dropout(query_embeds)
+
+        input_shape = embedding_output.shape[:-1]
+        batch_size, seq_length = input_shape
+
+        if attention_mask is None:
+            attention_mask = paddle.ones(((batch_size, seq_length + past_key_values_length)))
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].shape
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.shape
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = paddle.ones(encoder_hidden_shape)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            query_length=query_length,
+        )
+        sequence_output = encoder_outputs[0]
+        sequence_output = self.final_layernorm(sequence_output)
+        pooled_output = sequence_output[:, 0, :]
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class VisualGLMModel(VisualGLMPretrainedModel):
+    config_class = VisualGLMConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: VisualGLMConfig):
+        super().__init__(config)
+
+        self.vision_model = VisualGLMVisionModel(config.vision_config)
+        self.query_tokens = Parameter(
+            paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]), dtype=self.config.dtype
+        )
+        self.qformer = VisualGLMQFormerModel(config.qformer_config)
+
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        self.language_model = ChatGLMForCausalLM(config.text_config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def get_text_features(
+        self,
+        input_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            text_outputs (`CausalLMOutputWithPast`, or `tuple(paddle.Tensor)` if `return_dict=False`):
+                The language model outputs. If `return_dict=True`, the output is a [`CausalLMOutputWithPast`] that
+                contains the language model logits, the past key values and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from paddlenlp.transformers import ChatGLMTokenizer, VisualGLMModel
+        >>> tokenizer = ChatGLMTokenizer.from_pretrained("model_name")
+        >>> tokenizer.pad_token = tokenizer.eos_token
+        >>> model = VisualGLMModel.from_pretrained("model_name")
+        >>> model.eval()
+        >>> inputs = tokenizer(["a photo of a cat"], padding=True, return_tensors="pd", return_token_type_ids=False)
+        >>> text_features = model.get_text_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        text_outputs = self.language_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return text_outputs
+
+    def get_image_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import MinitGPT4Processor, VisualGLMModel
+        >>> processor = MinitGPT4Processor.from_pretrained("model_name")
+        >>> model = VisualGLMModel.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor.process_images(images=image, return_tensors="pd")
+        >>> image_outputs = model.get_image_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        return vision_outputs
+
+    def get_qformer_features(
+        self,
+        pixel_values: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs
+    ):
+        r"""
+        Returns:
+            vision_outputs (`BaseModelOutputWithPooling` or tuple of `paddle.Tensor`):
+                The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that
+                contains the image features, the pooled image features and the hidden states if
+                `output_hidden_states=True`.
+        Examples:
+        ```python
+        >>> import paddle
+        >>> from PIL import Image
+        >>> import requests
+        >>> from paddlenlp.transformers import MinitGPT4Processor, VisualGLMModel
+        >>> processor = MinitGPT4Processor.from_pretrained("model_name")
+        >>> model = VisualGLMModel.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor.process_images(images=image, return_tensors="pd")
+        >>> qformer_outputs = model.get_qformer_features(**inputs)
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[0]
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+
+        return query_outputs
+
+    def forward(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+        first_input_ids: paddle.Tensor,
+        second_input_ids: paddle.Tensor,
+        first_attention_mask: Optional[paddle.Tensor] = None,
+        second_attention_mask: Optional[paddle.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[paddle.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, VisualGLMForConditionalGenerationModelOutput]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import VisualGLMProcessor, VisualGLMModel
+        >>> processor = VisualGLMProcessor.from_pretrained("model_name")
+        >>> model = VisualGLMModel.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(images=image, texts=text, prompts=prompt, return_tensors="pd")
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: use the language model, conditioned on the text and image
+        language_model_inputs = self.language_projection(query_output)
+        language_model_attention_mask = paddle.ones(language_model_inputs.shape[:-1], dtype="int64")
+
+        first_embeds = self.language_model.chatglm.transformer.word_embeddings(first_input_ids)
+        second_embeds = self.language_model.chatglm.word_embeddings(second_input_ids)
+        language_model_inputs = paddle.cast(language_model_inputs, dtype=first_embeds.dtype)
+        inputs_embeds = paddle.concat([first_embeds, language_model_inputs, second_embeds], axis=1)
+
+        if first_attention_mask is None:
+            first_attention_mask = paddle.ones_like(first_embeds.shape[:-1], dtype="int64")
+        if second_attention_mask is None:
+            second_attention_mask = paddle.ones_like(second_embeds.shape[:-1], dtype="int64")
+        attention_mask = paddle.concat(
+            [first_attention_mask, language_model_attention_mask, second_attention_mask], axis=1
+        )
+
+        outputs = self.language_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        logits = outputs.logits if return_dict else outputs[0]
+        loss = None
+        # we compute the loss here since we need to take into account the sequence length of the query embeds
+        if labels is not None:
+            logits = logits[:, -labels.shape[1] :, :]
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :]
+            shift_labels = labels[..., 1:]
+
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(reduction="mean")
+
+            loss = loss_fct(shift_logits.reshape([-1, self.config.text_config.vocab_size]), shift_labels.reshape([-1]))
+
+        if not return_dict:
+            output = (logits, vision_outputs, query_outputs, outputs)
+            return ((loss,) + output) if loss is not None else output
+
+        return VisualGLMForConditionalGenerationModelOutput(
+            loss=loss,
+            logits=logits,
+            vision_outputs=vision_outputs,
+            qformer_outputs=query_outputs,
+            language_model_outputs=outputs,
+        )
+
+
+class ChatGLMForConditionalGenerationWithImage(ChatGLMForCausalLM):
+    def __init__(self, config: ChatGLMConfig):
+        super(ChatGLMForConditionalGenerationWithImage, self).__init__(config)
+        self.config = config
+
+    def forward(
+        self,
+        image_features: paddle.Tensor,
+        input_ids: paddle.Tensor,
+        position_ids: Optional[paddle.Tensor] = None,
+        attention_mask: Optional[paddle.Tensor] = None,
+        pre_image_length: Optional[int] = None,
+        cache: Optional[Tuple[paddle.Tensor]] = None,
+        inputs_embeds: Optional[paddle.Tensor] = None,
+        labels: Optional[paddle.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if inputs_embeds is None and cache is None and image_features is not None:
+            pre_ids, pad_ids, post_ids = paddle.split(input_ids, num_or_sections=[pre_image_length, 32, -1], axis=1)
+            pre_txt_emb = self.chatglm.transformer.word_embeddings(pre_ids)
+            post_txt_emb = self.chatglm.transformer.word_embeddings(post_ids)
+            inputs_embeds = paddle.concat([pre_txt_emb, image_features, post_txt_emb], axis=1)
+
+        outputs = super().forward(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            cache=cache,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            return_dict=return_dict,
+        )
+
+        return outputs
+
+
+class VisualGLMForConditionalGeneration(VisualGLMPretrainedModel):
+    config_class = VisualGLMConfig
+    main_input_name = "pixel_values"
+
+    def __init__(self, config: VisualGLMConfig):
+        super().__init__(config)
+        self.config = config
+        self.vision_model = VisualGLMVisionModel(config.vision_config)
+        self.query_tokens = Parameter(
+            paddle.zeros([1, config.num_query_tokens, config.qformer_config.hidden_size]), dtype=self.config.dtype
+        )
+        self.qformer = VisualGLMQFormerModel(config.qformer_config)
+        self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
+        self.language_model = ChatGLMForConditionalGenerationWithImage(config.text_config)
+
+    def get_input_embeddings(self) -> nn.Layer:
+        return self.vision_model.embeddings.patch_embedding
+
+    def encode_images(
+        self,
+        pixel_values: paddle.Tensor,  # processed image
+    ):
+        # step 1: forward the images through the vision encoder,
+        # to get image embeddings of shape (batch_size, seq_len, hidden_size)
+        pixel_values = paddle.cast(pixel_values, self.vision_model.embeddings.patch_embedding.weight.dtype)
+        vision_outputs = self.vision_model(pixel_values, return_dict=True)
+        image_embeds = vision_outputs.last_hidden_state
+        image_attention_mask = paddle.ones(image_embeds.shape[:-1], dtype="int64")
+
+        # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention
+        query_tokens = self.query_tokens.expand([image_embeds.shape[0], -1, -1])
+        query_tokens = paddle.cast(query_tokens, self.qformer.final_layernorm.weight.dtype)
+        image_embeds = paddle.cast(image_embeds, self.qformer.final_layernorm.weight.dtype)
+        query_outputs = self.qformer(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_attention_mask,
+            return_dict=True,
+        )
+        query_output = query_outputs.last_hidden_state
+
+        # step 3: mapping query_output into language_model space
+        language_model_inputs = self.language_projection(query_output)
+
+        return language_model_inputs
+
+    @paddle.no_grad()
+    def generate(
+        self,
+        pixel_values: paddle.Tensor,
+        input_ids: paddle.Tensor,
+        pre_image_length: int,
+        attention_mask: Optional[paddle.Tensor] = None,
+        **generate_kwargs,
+    ) -> paddle.Tensor:
+        """
+        Overrides `generate` function to be able to use the model as a conditional generator.
+        Args:
+            pixel_values (`paddle.Tensor` of shape (batch_size, num_channels, height, width)):
+                Input images to be processed.
+            input_ids (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                The sequence used as a prompt for the generation.
+            attention_mask (`paddle.Tensor` of shape (batch_size, sequence_length), *optional*):
+                Mask to avoid performing attention on padding token indices
+        Returns:
+            captions (list): A list of strings of length batch_size * num_captions.
+
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> import paddle
+        >>> from paddlenlp.transformers import VisualGLMProcessor, VisualGLMForConditionalGeneration
+        >>> processor = VisualGLMProcessor.from_pretrained("model_name")
+        >>> model = VisualGLMForConditionalGeneration.from_pretrained("model_name")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> text = "describe this image"
+        >>> prompt = "###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+        >>> inputs = processor(images=image, texts=text, prompts=prompt, return_tensors="pd")
+        >>> generated_ids, scores= model.generate(**inputs)
+        >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        """
+
+        image_features = self.encode_images(pixel_values)
+
+        outputs = self.language_model.generate(
+            input_ids=input_ids,
+            image_features=image_features,
+            pre_image_length=pre_image_length,
+            attention_mask=attention_mask,
+            **generate_kwargs,
+        )
+
+        return outputs
diff --git a/paddlenlp/transformers/visualglm/processing.py b/paddlenlp/transformers/visualglm/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd1a17397f94a0855cccd496145a30c0e5f596a7
--- /dev/null
+++ b/paddlenlp/transformers/visualglm/processing.py
@@ -0,0 +1,223 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Processor class for VisualGLM.
+"""
+
+import re
+from typing import List, Optional, Union
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from ..image_processing_utils import BatchFeature
+from ..image_utils import ImageInput
+from ..processing_utils import ProcessorMixin
+from ..tokenizer_utils_base import BatchEncoding, TensorType, TextInput
+
+__all__ = [
+    "VisualGLMProcessor",
+]
+
+
+class VisualGLMProcessor(ProcessorMixin):
+    r"""
+    Constructs a VisualGLM processor which wraps a VisualGLM image processor and an llama tokenizer into a single processor.
+    [`VisualGLMProcessor`] offers all the functionalities of [`VisualGLMImageProcessor`] and [`LlamaTokenizer`]. See the docstring
+    of [`~VisualGLMImageProcessor.__call__`] and [`~LlamaTokenizer.decode`] for more information.
+
+    Args:
+        image_processor (`VisualGLMImageProcessor`):
+            An instance of [`VisualGLMImageProcessor`]. The image processor is a required input.
+        tokenizer (`LlamaTokenizer`):
+            An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
+
+    Examples:
+    ```python
+    >>> import requests
+    >>> from PIL import Image
+
+    >>> import paddle
+    >>> from paddlenlp.transformers import VisualGLMProcessor
+
+    >>> # load processor
+    >>> minigpt4_13b_path = "model_name"
+    >>> processor = VisualGLMProcessor.from_pretrained(minigpt4_13b_path)
+    >>> print("load processor and model done!")
+
+    >>> # prepare model inputs for VisualGLM
+    >>> url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png"
+    >>> image = Image.open(requests.get(url, stream=True).raw)
+    >>> text = "describe this image"
+    >>> prompt = "Give the following image: <Img>ImageContent</Img>. You will be able to see the image once I provide it to you. Please answer my questions.###Human: <Img><ImageHere></Img> <TextHere>###Assistant:"
+    >>> res = processor([image], text, prompt)
+    ```"""
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "VisualGLMImageProcessor"
+    tokenizer_class = "ChatGLMTokenizer"
+
+    def __init__(self, image_processor, tokenizer):
+        tokenizer.return_token_type_ids = False
+        tokenizer.model_input_names = ["input_ids", "attention_mask"]
+        super().__init__(image_processor, tokenizer)
+        self.current_processor = self.image_processor
+        self.default_prompt = "<img><ImageHere></img>"
+        self.image_tag = "<ImageHere>"
+        self.num_query_tokens = 32
+
+    def process_images(
+        self,
+        images: ImageInput,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PADDLE,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        This method uses [`VisualGLMImageProcessor.__call__`] method to prepare image(s) for the model.
+        Please refer to the docstring of the method for more information.
+        """
+        if not images:
+            raise ValueError("You have to input correct images.")
+
+        if isinstance(images, (Image.Image, np.ndarray, paddle.Tensor)):
+            images = [images]
+
+        processed_images = self.image_processor(images, return_tensors=return_tensors)
+
+        return processed_images
+
+    def process_texts(
+        self,
+        texts: Union[TextInput, List[TextInput]],
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PADDLE,
+        **kwargs,
+    ) -> BatchEncoding:
+        if not texts:
+            raise ValueError("You have to input correct texts.")
+
+        if isinstance(texts, TextInput):
+            texts = [texts]
+
+        processed_texts = self.tokenizer(text=texts, return_tensors=return_tensors, **kwargs)
+        return BatchEncoding(processed_texts)
+
+    def build_inputs_with_image(
+        self,
+        image: Union[Image.Image, np.ndarray, paddle.Tensor],
+        query: str,
+        history: Optional[str] = None,
+    ):
+        # construct prompt with inputs
+        if image is not None:
+            prompt = self.default_prompt
+        else:
+            prompt = ""
+        for old_query, response in history:
+            prompt += "问：{}\n答：{}\n".format(old_query, response)
+        prompt += "问：{}\n答：".format(query)
+
+        if image is not None:
+            image_start_position = prompt.rfind(self.image_tag)
+            image_end_position = image_start_position + len(self.image_tag)
+            first_text_input = self.tokenizer.encode(prompt[:image_start_position], add_special_tokens=False)
+            image_input = [self.tokenizer.unk_token_id] * self.num_query_tokens
+            second_text_input = self.tokenizer.encode(prompt[image_end_position:], add_special_tokens=False)
+            all_input_ids = first_text_input["input_ids"] + image_input + second_text_input["input_ids"]
+            all_input_ids = self.tokenizer.build_inputs_with_special_tokens(all_input_ids)
+
+            # processing image
+            processed_image = self.process_images(image)
+
+            inputs = {
+                "input_ids": paddle.to_tensor(all_input_ids, dtype="int64").unsqueeze(0),
+                "pre_image_length": len(first_text_input["input_ids"]),
+                "pixel_values": processed_image["pixel_values"],
+            }
+        else:
+            inputs = self.tokenizer([prompt], return_tensors="pd")
+            inputs["pre_image_length"] = 0
+
+        return inputs
+
+    def __call__(
+        self,
+        image: Union[Image.Image, np.ndarray, paddle.Tensor],
+        query: str,
+        history: Optional[str] = [],
+        **kwargs,
+    ):
+        if image is None:
+            raise ValueError("Image should not be None.")
+        if query is None:
+            raise ValueError("Query should not be None.")
+        if not isinstance(query, str):
+            raise TypeError("A string type of query is expected, but acceived {}.".format(type(query)))
+        if not isinstance(history, list):
+            raise TypeError(
+                "A list type of history is expected with each item [query, response] in it, but acceived {}.".format(
+                    type(history)
+                )
+            )
+
+        inputs = self.build_inputs_with_image(image, query, history=history)
+
+        return inputs
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer
+        to the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+    def process_response(self, response):
+        response = response.strip()
+        response = response.replace("[[训练时间]]", "2023年")
+        punkts = [
+            [",", "，"],
+            ["!", "！"],
+            [":", "："],
+            [";", "；"],
+            ["\?", "？"],
+        ]
+        for item in punkts:
+            response = re.sub(r"([\u4e00-\u9fff])%s" % item[0], r"\1%s" % item[1], response)
+            response = re.sub(r"%s([\u4e00-\u9fff])" % item[0], r"%s\1" % item[1], response)
+        return response
+
+    def get_responses(self, *args, **kwargs):
+        processed_responses = []
+        responses = self.batch_decode(*args, **kwargs)
+
+        for response in responses:
+            response = self.process_response(response)
+            processed_responses.append(response)
+
+        return processed_responses
+
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
diff --git a/paddlenlp/transformers/xlm/__init__.py b/paddlenlp/transformers/xlm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0d15c218ce9e897caa6b56d4862a0163fab1f3f
--- /dev/null
+++ b/paddlenlp/transformers/xlm/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/xlm/configuration.py b/paddlenlp/transformers/xlm/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb8a8f23606f99bc9dbc49ec057c6a4267b554e8
--- /dev/null
+++ b/paddlenlp/transformers/xlm/configuration.py
@@ -0,0 +1,609 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLM configuration"""
+from __future__ import annotations
+
+# from .onnx import OnnxConfig
+import logging
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+__all__ = ["XLM_PRETRAINED_INIT_CONFIGURATION", "XLM_PRETRAINED_RESOURCE_FILES_MAP", "XLMConfig"]
+
+
+XLM_PRETRAINED_INIT_CONFIGURATION = {
+    "xlm-mlm-en-2048": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 1,
+        "use_lang_embeddings": True,
+        "vocab_size": 30145,
+        "pad_token_id": 2,
+        "hidden_size": 2048,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 12,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.015625,
+        "init_std": 0.02,
+        "lang_id": 0,
+        "lang2id": None,
+    },
+    "xlm-mlm-ende-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 2,
+        "use_lang_embeddings": True,
+        "vocab_size": 64699,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 6,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 1,
+        "lang2id": {"de": 0, "en": 1},
+    },
+    "xlm-mlm-enfr-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 2,
+        "use_lang_embeddings": True,
+        "vocab_size": 64139,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 6,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 0,
+        "lang2id": {"en": 0, "fr": 1},
+    },
+    "xlm-mlm-enro-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 2,
+        "use_lang_embeddings": True,
+        "vocab_size": 64592,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 6,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 0,
+        "lang2id": {"en": 0, "ro": 1},
+    },
+    "xlm-mlm-tlm-xnli15-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 15,
+        "use_lang_embeddings": True,
+        "vocab_size": 95000,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 12,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 4,
+        "lang2id": {
+            "ar": 0,
+            "bg": 1,
+            "de": 2,
+            "el": 3,
+            "en": 4,
+            "es": 5,
+            "fr": 6,
+            "hi": 7,
+            "ru": 8,
+            "sw": 9,
+            "th": 10,
+            "tr": 11,
+            "ur": 12,
+            "vi": 13,
+            "zh": 14,
+        },
+    },
+    "xlm-mlm-xnli15-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 15,
+        "use_lang_embeddings": True,
+        "vocab_size": 95000,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 12,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 4,
+        "lang2id": {
+            "ar": 0,
+            "bg": 1,
+            "de": 2,
+            "el": 3,
+            "en": 4,
+            "es": 5,
+            "fr": 6,
+            "hi": 7,
+            "ru": 8,
+            "sw": 9,
+            "th": 10,
+            "tr": 11,
+            "ur": 12,
+            "vi": 13,
+            "zh": 14,
+        },
+    },
+    "xlm-clm-enfr-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 2,
+        "use_lang_embeddings": True,
+        "vocab_size": 64139,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 6,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 0,
+        "lang2id": {"en": 0, "fr": 1},
+    },
+    "xlm-clm-ende-1024": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 2,
+        "use_lang_embeddings": True,
+        "vocab_size": 64699,
+        "pad_token_id": 2,
+        "hidden_size": 1024,
+        "num_attention_heads": 8,
+        "num_hidden_layers": 6,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.02209708691207961,
+        "init_std": 0.02,
+        "lang_id": 1,
+        "lang2id": {"de": 0, "en": 1},
+    },
+    "xlm-mlm-17-1280": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 17,
+        "use_lang_embeddings": False,
+        "vocab_size": 200000,
+        "pad_token_id": 2,
+        "hidden_size": 1280,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 16,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.01976423537605237,
+        "init_std": 0.02,
+        "lang_id": 2,
+        "lang2id": {
+            "ar": 0,
+            "de": 1,
+            "en": 2,
+            "es": 3,
+            "fr": 4,
+            "hi": 5,
+            "it": 6,
+            "ja": 7,
+            "ko": 8,
+            "nl": 9,
+            "pl": 10,
+            "pt": 11,
+            "ru": 12,
+            "sv": 13,
+            "tr": 14,
+            "vi": 15,
+            "zh": 16,
+        },
+    },
+    "xlm-mlm-100-1280": {
+        "is_encoder": True,
+        "causal": False,
+        "n_langs": 100,
+        "use_lang_embeddings": False,
+        "vocab_size": 200000,
+        "pad_token_id": 2,
+        "hidden_size": 1280,
+        "num_attention_heads": 16,
+        "num_hidden_layers": 16,
+        "hidden_dropout_prob": 0.1,
+        "attention_probs_dropout_prob": 0.1,
+        "max_position_embeddings": 512,
+        "use_sinusoidal_embeddings": False,
+        "layer_norm_eps": 1e-12,
+        "hidden_act": "gelu",
+        "embed_init_std": 0.01976423537605237,
+        "init_std": 0.02,
+        "lang_id": 23,
+        "lang2id": {
+            "af": 0,
+            "als": 1,
+            "am": 2,
+            "an": 3,
+            "ang": 4,
+            "ar": 5,
+            "arz": 6,
+            "ast": 7,
+            "az": 8,
+            "bar": 9,
+            "be": 10,
+            "bg": 11,
+            "bn": 12,
+            "br": 13,
+            "bs": 14,
+            "ca": 15,
+            "ceb": 16,
+            "ckb": 17,
+            "cs": 18,
+            "cy": 19,
+            "da": 20,
+            "de": 21,
+            "el": 22,
+            "en": 23,
+            "eo": 24,
+            "es": 25,
+            "et": 26,
+            "eu": 27,
+            "fa": 28,
+            "fi": 29,
+            "fr": 30,
+            "fy": 31,
+            "ga": 32,
+            "gan": 33,
+            "gl": 34,
+            "gu": 35,
+            "he": 36,
+            "hi": 37,
+            "hr": 38,
+            "hu": 39,
+            "hy": 40,
+            "ia": 41,
+            "id": 42,
+            "is": 43,
+            "it": 44,
+            "ja": 45,
+            "jv": 46,
+            "ka": 47,
+            "kk": 48,
+            "kn": 49,
+            "ko": 50,
+            "ku": 51,
+            "la": 52,
+            "lb": 53,
+            "lt": 54,
+            "lv": 55,
+            "mk": 56,
+            "ml": 57,
+            "mn": 58,
+            "mr": 59,
+            "ms": 60,
+            "my": 61,
+            "nds": 62,
+            "ne": 63,
+            "nl": 64,
+            "nn": 65,
+            "no": 66,
+            "oc": 67,
+            "pl": 68,
+            "pt": 69,
+            "ro": 70,
+            "ru": 71,
+            "scn": 72,
+            "sco": 73,
+            "sh": 74,
+            "si": 75,
+            "simple": 76,
+            "sk": 77,
+            "sl": 78,
+            "sq": 79,
+            "sr": 80,
+            "sv": 81,
+            "sw": 82,
+            "ta": 83,
+            "te": 84,
+            "th": 85,
+            "tl": 86,
+            "tr": 87,
+            "tt": 88,
+            "uk": 89,
+            "ur": 90,
+            "uz": 91,
+            "vi": 92,
+            "war": 93,
+            "wuu": 94,
+            "yi": 95,
+            "zh": 96,
+            "zh_classical": 97,
+            "zh_min_nan": 98,
+            "zh_yue": 99,
+        },
+    },
+}
+
+XLM_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "xlm-mlm-en-2048": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-en-2048/model_state.pdparams",
+        "xlm-mlm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-ende-1024/model_state.pdparams",
+        "xlm-mlm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enfr-1024/model_state.pdparams",
+        "xlm-mlm-enro-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enro-1024/model_state.pdparams",
+        "xlm-mlm-tlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-tlm-xnli15-1024/model_state.pdparams",
+        "xlm-mlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-xnli15-1024/model_state.pdparams",
+        "xlm-clm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-enfr-1024/model_state.pdparams",
+        "xlm-clm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-ende-1024/model_state.pdparams",
+        "xlm-mlm-17-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-17-1280/model_state.pdparams",
+        "xlm-mlm-100-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-100-1280/model_state.pdparams",
+    }
+}
+
+
+class XLMConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`XLMModel`]. It is used to
+    instantiate a XLM model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the
+    [xlm-mlm-en-2048] architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30145):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`XLMModel`] .
+        emb_dim (`int`, *optional*, defaults to 2048):
+            Dimensionality of the encoder layers and the pooler layer.
+        n_layer (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        n_head (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the attention mechanism
+        gelu_activation (`bool`, *optional*, defaults to `True`):
+            Whether or not to use *gelu* for the activations instead of *relu*.
+        sinusoidal_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
+        causal (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in
+            order to only attend to the left-side context instead if a bidirectional context.
+        asm (`bool`, *optional*, defaults to `False`):
+            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
+            layer.
+        n_langs (`int`, *optional*, defaults to 1):
+            The number of languages the model handles. Set to 1 for monolingual models.
+        use_lang_emb (`bool`, *optional*, defaults to `True`)
+            Whether to use language embeddings. Some models use additional language embeddings, see [the multilingual
+            models page]
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        embed_init_std (`float`, *optional*, defaults to 2048^-0.5):
+            The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.
+        init_std (`int`, *optional*, defaults to 50257):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the
+            embedding matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        bos_index (`int`, *optional*, defaults to 0):
+            The index of the beginning of sentence token in the vocabulary.
+        eos_index (`int`, *optional*, defaults to 1):
+            The index of the end of sentence token in the vocabulary.
+        pad_index (`int`, *optional*, defaults to 2):
+            The index of the padding token in the vocabulary.
+        unk_index (`int`, *optional*, defaults to 3):
+            The index of the unknown token in the vocabulary.
+        mask_index (`int`, *optional*, defaults to 5):
+            The index of the masking token in the vocabulary.
+        is_encoder(`bool`, *optional*, defaults to `True`):
+            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
+        summary_type (`string`, *optional*, defaults to "first"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Has to be one of the following options:
+
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (`str`, *optional*):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
+        summary_first_dropout (`float`, *optional*, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (`int`, *optional*, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (`int`, *optional*, defaults to 5):
+            Used in the SQuAD evaluation script.
+        mask_token_id (`int`, *optional*, defaults to 0):
+            Model agnostic parameter to identify masked tokens when generating text in an MLM context.
+        lang_id (`int`, *optional*, defaults to 1):
+            The ID of the language used by the model. This parameter is used when generating text in a given language.
+
+    Examples:
+
+    ```python
+    >>> from transformers import XLMConfig, XLMModel
+
+    >>> # Initializing a XLM configuration
+    >>> configuration = XLMConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = XLMModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "xlm"
+    pretrained_init_configuration = XLM_PRETRAINED_INIT_CONFIGURATION
+    attribute_map: Dict[str, str] = {
+        "dropout_prob": "dropout",
+        "attention_probs_dropout_prob": "attention_dropout",
+        "hidden_size": "emb_dim",
+        "num_attention_heads": "n_heads",
+        "num_hidden_layers": "n_layers",
+        "n_words": "vocab_size",  # For backward compatibility
+        "use_lang_embeddings": "use_lang_emb",
+        "use_sinusoidal_embeddings": "sinusoidal_embeddings",
+        "hidden_dropout_prob": "dropout",
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size=30145,
+        emb_dim=2048,
+        n_layers=12,
+        n_heads=16,
+        dropout=0.1,
+        attention_dropout=0.1,
+        gelu_activation=True,
+        hidden_act="gelu",
+        sinusoidal_embeddings=False,
+        causal=False,
+        asm=False,
+        n_langs=1,
+        use_lang_emb=True,
+        max_position_embeddings=512,
+        embed_init_std=2048**-0.5,
+        layer_norm_eps=1e-12,
+        init_std=0.02,
+        bos_index=0,
+        eos_index=1,
+        pad_index=2,
+        unk_index=3,
+        mask_index=5,
+        is_encoder=True,
+        summary_type="first",
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        start_n_top=5,
+        end_n_top=5,
+        mask_token_id=0,
+        lang_id=0,
+        pad_token_id=2,
+        bos_token_id=0,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)
+
+        """Constructs XLMConfig."""
+        self.vocab_size = vocab_size
+        self.emb_dim = emb_dim
+        self.n_layers = n_layers
+        self.n_heads = n_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.gelu_activation = gelu_activation
+        self.hidden_act = hidden_act
+        self.sinusoidal_embeddings = sinusoidal_embeddings
+        self.causal = causal
+        self.asm = asm
+        self.n_langs = n_langs
+        self.use_lang_emb = use_lang_emb
+        self.layer_norm_eps = layer_norm_eps
+        self.bos_index = bos_index
+        self.eos_index = eos_index
+        self.pad_index = pad_index
+        self.unk_index = unk_index
+        self.mask_index = mask_index
+        self.is_encoder = is_encoder
+        self.max_position_embeddings = max_position_embeddings
+        self.embed_init_std = embed_init_std
+        self.init_std = init_std
+        self.summary_type = summary_type
+        self.summary_use_proj = summary_use_proj
+        self.summary_activation = summary_activation
+        self.summary_proj_to_labels = summary_proj_to_labels
+        self.summary_first_dropout = summary_first_dropout
+        self.start_n_top = start_n_top
+        self.end_n_top = end_n_top
+        self.mask_token_id = mask_token_id
+        self.lang_id = lang_id
+
+        if "n_words" in kwargs:
+            self.n_words = kwargs["n_words"]
diff --git a/paddlenlp/transformers/xlm/modeling.py b/paddlenlp/transformers/xlm/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e28666891c0992123c40f5de143d0b028f39205
--- /dev/null
+++ b/paddlenlp/transformers/xlm/modeling.py
@@ -0,0 +1,892 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019-present, Facebook, Inc and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import math
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN
+from .configuration import (
+    XLM_PRETRAINED_INIT_CONFIGURATION,
+    XLM_PRETRAINED_RESOURCE_FILES_MAP,
+    XLMConfig,
+)
+
+__all__ = [
+    "XLMModel",
+    "XLMPretrainedModel",
+    "XLMWithLMHeadModel",
+    "XLMForSequenceClassification",
+    "XLMForTokenClassification",
+    "XLMForQuestionAnsweringSimple",
+    "XLMForMultipleChoice",
+]
+
+INF = 1e4
+
+
+class SinusoidalPositionalEmbedding(nn.Embedding):
+    def __init__(self, num_embeddings, embedding_dim):
+        super().__init__(num_embeddings, embedding_dim)
+        self.weight = self._init_weight(self.weight)
+
+    @staticmethod
+    def _init_weight(out):
+        n_pos, dim = paddle.shape(out)
+        out.stop_gradient = True
+        position_ids = paddle.arange(0, n_pos, dtype=out.dtype).unsqueeze(1)
+        indices = paddle.arange(0, dim // 2, dtype=out.dtype).unsqueeze(0)
+        indices = 10000.0 ** (-2 * indices / dim)
+        embeddings = paddle.matmul(position_ids, indices)
+        out[:, 0::2] = paddle.sin(embeddings)
+        out[:, 1::2] = paddle.cos(embeddings)
+        return out
+
+    @paddle.no_grad()
+    def forward(self, position_ids):
+        return super().forward(position_ids)
+
+
+def get_masks(seqlen, lengths, causal, padding_mask=None):
+    """
+    Generate hidden states mask, and optionally an attention mask.
+    """
+    alen = paddle.arange(0, seqlen, dtype="int64")
+    if padding_mask is not None:
+        mask = padding_mask
+    else:
+        mask = alen < lengths[:, None]
+
+    # attention mask is the same as mask, or triangular inferior attention (causal)
+    bs = paddle.shape(lengths)[0]
+    if causal:
+        attn_mask = paddle.tile(alen[None, None, :], (bs, seqlen, 1)) <= alen[None, :, None]
+    else:
+        attn_mask = mask
+
+    return mask, attn_mask
+
+
+class MultiHeadAttention(nn.Layer):
+
+    NEW_ID = itertools.count()
+
+    def __init__(self, n_heads, dim, config: XLMConfig):
+        super().__init__()
+        self.layer_id = next(MultiHeadAttention.NEW_ID)
+        self.dim = dim
+        self.n_heads = n_heads
+        assert self.dim % self.n_heads == 0
+        self.q_lin = nn.Linear(dim, dim)
+        self.k_lin = nn.Linear(dim, dim)
+        self.v_lin = nn.Linear(dim, dim)
+        self.out_lin = nn.Linear(dim, dim)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.dim_per_head = self.dim // self.n_heads
+
+    def shape(self, x):
+        """projection"""
+        return x.reshape([0, 0, self.n_heads, self.dim_per_head]).transpose([0, 2, 1, 3])
+
+    def unshape(self, x):
+        """compute context"""
+        return x.transpose([0, 2, 1, 3]).reshape([0, 0, self.n_heads * self.dim_per_head])
+
+    def forward(self, input, mask, kv=None, cache=None, output_attentions=False):
+        """
+        Self-attention (if kv is None) or attention over source sentence (provided by kv).
+        """
+        # Input is (bs, qlen, dim)
+        # Mask is (bs, klen) (non-causal) or (bs, klen, klen)
+        bs, qlen, dim = paddle.shape(input)
+        if kv is None:
+            klen = qlen if cache is None else cache["seqlen"] + qlen
+        else:
+            klen = paddle.shape(kv)[1]
+
+        mask_reshape = (bs, 1, qlen, klen) if mask.ndim == 3 else (bs, 1, 1, klen)
+
+        q = self.shape(self.q_lin(input))  # (bs, n_heads, qlen, dim_per_head)
+        if kv is None:
+            k = self.shape(self.k_lin(input))  # (bs, n_heads, qlen, dim_per_head)
+            v = self.shape(self.v_lin(input))  # (bs, n_heads, qlen, dim_per_head)
+        elif cache is None or self.layer_id not in cache:
+            k = v = kv
+            k = self.shape(self.k_lin(k))  # (bs, n_heads, qlen, dim_per_head)
+            v = self.shape(self.v_lin(v))  # (bs, n_heads, qlen, dim_per_head)
+
+        if cache is not None:
+            if self.layer_id in cache:
+                if kv is None:
+                    k_, v_ = cache[self.layer_id]
+                    k = paddle.concat([k_, k], axis=2)  # (bs, n_heads, klen, dim_per_head)
+                    v = paddle.concat([v_, v], axis=2)  # (bs, n_heads, klen, dim_per_head)
+                else:
+                    k, v = cache[self.layer_id]
+            cache[self.layer_id] = (k, v)
+
+        q = q / math.sqrt(self.dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
+
+        scores = paddle.matmul(q, k, transpose_y=True)  # (bs, n_heads, qlen, klen)
+
+        mask = mask.reshape(mask_reshape)  # (bs, n_heads, qlen, klen)
+
+        scores = scores + (mask.astype(scores.dtype) - 1) * INF
+
+        weights = F.softmax(scores, axis=-1)  # (bs, n_heads, qlen, klen)
+        weights = self.dropout(weights)  # (bs, n_heads, qlen, klen)
+
+        context = paddle.matmul(weights, v)  # (bs, n_heads, qlen, dim_per_head)
+        context = self.unshape(context)  # (bs, qlen, dim)
+
+        outputs = (self.out_lin(context),)
+        if output_attentions:
+            outputs = outputs + (weights,)
+        return outputs
+
+
+class TransformerFFN(nn.Layer):
+    def __init__(self, in_dim, dim_hidden, out_dim, config: XLMConfig):
+        super().__init__()
+        self.lin1 = nn.Linear(in_dim, dim_hidden)
+        self.lin2 = nn.Linear(dim_hidden, out_dim)
+        self.dropout = nn.Dropout(config.dropout_prob)
+        self.act = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        x = self.lin1(x)
+        x = self.act(x)
+        x = self.lin2(x)
+        x = self.dropout(x)
+        return x
+
+
+class XLMPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained XLM models. It provides XLM related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading and
+    loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = XLM_PRETRAINED_INIT_CONFIGURATION
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = XLM_PRETRAINED_RESOURCE_FILES_MAP
+    model_config_file = CONFIG_NAME
+    config_class = XLMConfig
+    base_model_prefix = "xlm"
+
+    def _init_weights(self, layer):
+        """Initialization hook"""
+        if isinstance(layer, nn.Embedding):
+            new_weight = paddle.normal(
+                mean=0.0,
+                std=self.embed_init_std if hasattr(self, "embed_init_std") else self.xlm.config["embed_init_std"],
+                shape=layer.weight.shape,
+            )
+            if layer._padding_idx is not None:
+                new_weight[layer._padding_idx] = paddle.zeros_like(new_weight[layer._padding_idx])
+            layer.weight.set_value(new_weight)
+        elif isinstance(layer, nn.Linear):
+            layer.weight.set_value(
+                paddle.normal(
+                    mean=0.0,
+                    std=self.init_std if hasattr(self, "init_std") else self.xlm.config["init_std"],
+                    shape=layer.weight.shape,
+                )
+            )
+            if layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+
+
+@register_base_model
+class XLMModel(XLMPretrainedModel):
+    """
+    The bare XLM Model transformer outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+    """
+
+    def __init__(self, config: XLMConfig):
+        super().__init__(config)
+        self.causal = config.causal
+        self.num_hidden_layers = config.num_hidden_layers
+        self.pad_token_id = config.pad_token_id
+        self.hidden_size = config.hidden_size
+        self.embed_init_std = config.embed_init_std
+        self.init_std = config.init_std
+        self.use_lang_embeddings = config.use_lang_embeddings
+        self.n_langs = config.n_langs
+        if not config.is_encoder:
+            raise NotImplementedError("Currently XLM can only be used as an encoder")
+        assert (
+            config.hidden_size % config.num_attention_heads == 0
+        ), "xlm model's hidden_size must be a multiple of num_attention_heads"
+
+        # embeddings
+        if config.use_sinusoidal_embeddings:
+            self.position_embeddings = SinusoidalPositionalEmbedding(
+                config.max_position_embeddings, config.hidden_size
+            )
+        else:
+            self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        if config.n_langs > 1 and config.use_lang_embeddings:
+            self.lang_embeddings = nn.Embedding(config.n_langs, config.hidden_size)
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layer_norm_emb = nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps)
+
+        self.attentions = nn.LayerList()
+        self.layer_norm1 = nn.LayerList()
+        self.ffns = nn.LayerList()
+        self.layer_norm2 = nn.LayerList()
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        for _ in range(self.num_hidden_layers):
+            self.attentions.append(MultiHeadAttention(config.num_attention_heads, config.hidden_size, config))
+            self.layer_norm1.append(nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps))
+
+            self.ffns.append(
+                TransformerFFN(
+                    config.hidden_size,
+                    config.hidden_size * 4,
+                    config.hidden_size,
+                    config,
+                )
+            )
+            self.layer_norm2.append(nn.LayerNorm(config.hidden_size, epsilon=config.layer_norm_eps))
+
+        self.register_buffer(
+            "position_ids",
+            paddle.arange(0, config.max_position_embeddings).reshape((1, -1)),
+            persistable=False,
+        )
+
+    def forward(
+        self,
+        input_ids=None,
+        langs=None,
+        attention_mask=None,
+        position_ids=None,
+        lengths=None,
+        cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+    ):
+        r"""
+        The XLMModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                Its data type should be `int64` and it has a shape of [batch_size, sequence_length].
+            langs (Tensor, optional):
+                A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
+                languages ids which can be obtained from the language names by using two conversion mappings provided in
+                the configuration of the model (only provided for multilingual models). More precisely, the *language name
+                to language id* mapping is in `model.config['lang2id']` (which is a dictionary string to int).
+                Shape as [batch_size, sequence_length] and dtype as int64. Defaults to `None`.
+            attention_mask (Tensor, optional):
+                Mask used in multi-head attention to avoid performing attention on to some
+                unwanted positions, usually the paddings or the subsequent positions.
+                Its data type can be int, float and bool.
+                When the data type is bool, the `masked` tokens have `False` values and the others
+                have `True` values.
+                When the data type is int, the `masked` tokens have `0` values and the others have `1` values.
+                When the data type is float, the `masked` tokens have `0.0` values and the others have `1.0` values.
+                It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`.
+                Defaults to `None`, which means nothing needed to be prevented attention to.
+            position_ids (Tensor, optional):
+                Indices of positions of each input sequence tokens in the position embeddings. Selected
+                in the range `[0, max_position_embeddings - 1]`.
+                Shape as [batch_size, sequence_length] and dtype as int64. Defaults to `None`.
+            lengths (Tensor, optional):
+                Length of each sentence that can be used to avoid performing attention on padding token indices. You can
+                also use *attention_mask* for the same result (see above), kept here for compatibility. Indices selected in
+                `[0, ..., sequence_length]`.
+                Shape as [batch_size] and dtype as int64. Defaults to `None`.
+            cache (Tuple[Tuple[Tensor]], optional):
+                Contains pre-computed hidden-states (key and values in the attention blocks)
+                as computed by the model. Can be used to speed up sequential decoding.
+                The `input_ids` which have their past given to this model should not be
+                passed as input ids as they have already been computed.
+                Defaults to `None`.
+            output_attentions (bool, optional):
+                Whether or not to return the attentions tensors of all attention layers.
+                Defaults to `False`.
+            output_hidden_states (bool, optional):
+                Whether or not to return the output of all hidden layers.
+                Defaults to `False`.
+
+        Returns:
+            tuple: Returns tuple (`last_hidden_state`, `hidden_states`, `attentions`)
+
+            With the fields:
+
+            - `last_hidden_state` (Tensor):
+                Sequence of hidden-states at the last layer of the model.
+                It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size].
+
+            - `hidden_states` (tuple(Tensor), optional):
+                returned when `output_hidden_states=True` is passed.
+                Tuple of `Tensor` (one for the output of the embeddings + one for the output of
+                each layer). Each Tensor has a data type of float32 and its shape is
+                [batch_size, sequence_length, hidden_size].
+
+            - `attentions` (tuple(Tensor), optional):
+                returned when `output_attentions=True` is passed.
+                Tuple of `Tensor` (one for each layer) of shape. Each Tensor has a data type of
+                float32 and its shape is [batch_size, num_heads, sequence_length, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMModel, XLMTokenizer
+
+                tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+                model = XLMModel.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", lang="en")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                inputs["langs"] = paddle.ones_like(inputs["input_ids"]) * tokenizer.lang2id["en"]
+
+                last_hidden_state = model(**inputs)[0]
+
+        """
+        bs, seqlen = paddle.shape(input_ids)
+
+        if lengths is None:
+            if input_ids is not None:
+                lengths = (input_ids != self.pad_token_id).sum(axis=1).astype("int64")
+            else:
+                lengths = paddle.to_tensor([seqlen] * bs, dtype="int64")
+
+        # generate masks
+        mask, attn_mask = get_masks(seqlen, lengths, self.causal, padding_mask=attention_mask)
+
+        # position_ids
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seqlen]
+
+        # do not recompute cached elements
+        if cache is not None and input_ids is not None:
+            _seqlen = seqlen - cache["seqlen"]
+            input_ids = input_ids[:, -_seqlen:]
+            position_ids = position_ids[:, -_seqlen:]
+            if langs is not None:
+                langs = langs[:, -_seqlen:]
+            mask = mask[:, -_seqlen:]
+            attn_mask = attn_mask[:, -_seqlen:]
+
+        # embeddings
+        tensor = self.embeddings(input_ids) + self.position_embeddings(position_ids)
+        if langs is not None and self.use_lang_embeddings and self.n_langs > 1:
+            tensor = tensor + self.lang_embeddings(langs)
+
+        tensor = self.layer_norm_emb(tensor)
+        tensor = self.dropout(tensor)
+        tensor = tensor * mask.unsqueeze(-1).astype(tensor.dtype)
+
+        # transformer layers
+        hidden_states = () if output_hidden_states else None
+        attentions = () if output_attentions else None
+        for i in range(self.num_hidden_layers):
+            if output_hidden_states:
+                hidden_states = hidden_states + (tensor,)
+            # self attention
+            attn_outputs = self.attentions[i](
+                tensor,
+                attn_mask,
+                cache=cache,
+                output_attentions=output_attentions,
+            )
+            attn = attn_outputs[0]
+            if output_attentions:
+                attentions = attentions + (attn_outputs[1],)
+            attn = self.dropout(attn)
+            tensor = tensor + attn
+            tensor = self.layer_norm1[i](tensor)
+            # FFN
+            tensor = tensor + self.ffns[i](tensor)
+            tensor = self.layer_norm2[i](tensor)
+            tensor = tensor * mask.unsqueeze(-1).astype(tensor.dtype)
+
+        # Add last hidden state
+        if output_hidden_states:
+            hidden_states = hidden_states + (tensor,)
+
+        # update cache length
+        if cache is not None:
+            cache["seqlen"] += paddle.shape(tensor)[1]
+
+        return tuple(v for v in [tensor, hidden_states, attentions] if v is not None)
+
+    def get_input_embeddings(self):
+        return self.embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings = value
+
+
+class XLMPredLayer(nn.Layer):
+    """
+    Prediction layer with cross_entropy.
+    """
+
+    def __init__(
+        self,
+        config: XLMConfig,
+        embedding_weights=None,
+    ):
+        super().__init__()
+        self.vocab_size = config.vocab_size
+        if embedding_weights is None:
+            self.proj = nn.Linear(config.hidden_size, config.vocab_size)
+        else:
+            self.bias = self.create_parameter(shape=[config.vocab_size], is_bias=True)
+            self.proj = lambda x: paddle.matmul(x, embedding_weights, transpose_y=True) + self.bias
+
+    def forward(self, x, y=None):
+        """Compute the loss, and optionally the scores."""
+        outputs = ()
+        scores = self.proj(x)
+        outputs = (scores,) + outputs
+        if y is not None:
+            loss = F.cross_entropy(scores.reshape([-1, self.vocab_size]), y.flatten(), reduction="mean")
+            outputs = (loss,) + outputs
+        return outputs
+
+
+class XLMWithLMHeadModel(XLMPretrainedModel):
+    """
+    The XLM Model transformer with a masked language modeling head on top (linear
+    layer with weights tied to the input embeddings).
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+
+    """
+
+    def __init__(self, config: XLMConfig):
+        super().__init__(config)
+        self.xlm = XLMModel(config)
+        self.pred_layer = XLMPredLayer(
+            config,
+            embedding_weights=self.xlm.embeddings.weight,
+        )
+
+    def forward(
+        self, input_ids=None, langs=None, attention_mask=None, position_ids=None, lengths=None, cache=None, labels=None
+    ):
+        r"""
+        The XLMWithLMHeadModel forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLMModel`.
+            langs (Tensor, optional):
+                See :class:`XLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`XLMModel`.
+            lengths (Tensor, optional):
+                See :class:`XLMModel`.
+            cache (Dict[str, Tensor], optional):
+                See :class:`XLMModel`.
+            labels (Tensor, optional):
+                The Labels for computing the masked language modeling loss. Indices are selected in
+                `[-100, 0, ..., vocab_size-1]` All labels set to `-100` are ignored (masked), the loss is
+                only computed for labels in `[0, ..., vocab_size-1]`
+                Shape as [batch_size, sequence_length] and dtype as int64. Defaults to `None`.
+
+        Returns:
+            tuple: Returns tuple `(loss, logits)`.
+            With the fields:
+
+            - `loss` (Tensor):
+                returned when `labels` is provided.
+                Language modeling loss (for next-token prediction).
+                It's data type should be float32 and its shape is [1,].
+
+            - `logits` (Tensor):
+                Prediction scores of the language modeling head (scores for each vocabulary
+                token before SoftMax).
+                It's data type should be float32 and
+                its shape is [batch_size, sequence_length, vocab_size].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMWithLMHeadModel, XLMTokenizer
+
+                tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-tlm-xnli15-1024')
+                model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-tlm-xnli15-1024')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", lang="en")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                inputs["langs"] = paddle.ones_like(inputs["input_ids"]) * tokenizer.lang2id["en"]
+                inputs["labels"] = inputs["input_ids"]
+
+                loss, logits = model(**inputs)
+
+
+        """
+        xlm_outputs = self.xlm(
+            input_ids,
+            langs=langs,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            lengths=lengths,
+            cache=cache,
+        )
+
+        output = xlm_outputs[0]
+        outputs = self.pred_layer(output, labels)
+        return outputs + xlm_outputs[1:]
+
+
+class XLMForSequenceClassification(XLMPretrainedModel):
+    """
+    The XLMModel with a sequence classification head on top (linear layer).
+    `XLMForSequenceClassification` uses the first token in order to do the classification.
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+
+    """
+
+    def __init__(self, config: XLMConfig):
+        super().__init__(config)
+        self.num_classes = config.num_classes
+        self.xlm = XLMModel(config)
+        dropout_prob = config.dropout if config.dropout is not None else config.hidden_dropout_prob
+        self.dropout = nn.Dropout(dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_classes)
+
+    def forward(self, input_ids=None, langs=None, attention_mask=None, position_ids=None, lengths=None):
+        r"""
+        The XLMForSequenceClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLMModel`.
+            langs (Tensor, optional):
+                See :class:`XLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`XLMModel`.
+            lengths (Tensor, optional):
+                See :class:`XLMModel`.
+
+        Returns:
+            logits (Tensor):
+                A tensor of the input text classification logits.
+                Shape as `[batch_size, num_classes]` and dtype as float32.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMForSequenceClassification, XLMTokenizer
+
+                tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+                model = XLMForSequenceClassification.from_pretrained("xlm-mlm-tlm-xnli15-1024", num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", lang="en")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                inputs["langs"] = paddle.ones_like(inputs["input_ids"]) * tokenizer.lang2id["en"]
+
+                logits = model(**inputs)
+
+        """
+
+        sequence_output = self.xlm(
+            input_ids, langs=langs, attention_mask=attention_mask, position_ids=position_ids, lengths=lengths
+        )[0]
+        sequence_output = self.dropout(sequence_output)
+        pooled_output = sequence_output[:, 0]
+        logits = self.classifier(pooled_output)
+
+        return logits
+
+
+class XLMForTokenClassification(XLMPretrainedModel):
+    """
+    XLMModel with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+    """
+
+    def __init__(self, config: XLMConfig):
+        super(XLMForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_classes
+        self.xlm = XLMModel(config)  # allow xlm to be config
+        self.dropout = nn.Dropout(config.dropout if config.dropout is not None else config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_classes)
+
+    def forward(self, input_ids=None, langs=None, attention_mask=None, position_ids=None, lengths=None):
+        r"""
+        The XLMForTokenClassification forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLMModel`.
+            langs (Tensor, optional):
+                See :class:`XLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`XLMModel`.
+            lengths (Tensor, optional):
+                See :class:`XLMModel`.
+
+        Returns:
+            logits (Tensor):
+                A tensor of the input token classification logits.
+                Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMForTokenClassification, XLMTokenizer
+
+                tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+                model = XLMForTokenClassification.from_pretrained("xlm-mlm-tlm-xnli15-1024", num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", lang="en")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                inputs["langs"] = paddle.ones_like(inputs["input_ids"]) * tokenizer.lang2id["en"]
+
+                logits = model(**inputs)
+
+        """
+
+        sequence_output = self.xlm(
+            input_ids, langs=langs, attention_mask=attention_mask, position_ids=position_ids, lengths=lengths
+        )[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        return logits
+
+
+class XLMForQuestionAnsweringSimple(XLMPretrainedModel):
+    """
+    XLMModel with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
+    layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+    """
+
+    def __init__(self, config: XLMConfig):
+        super(XLMForQuestionAnsweringSimple, self).__init__(config)
+        self.xlm = XLMModel(config)  # allow xlm to be config
+        self.classifier = nn.Linear(config.hidden_size, 2)
+
+    def forward(self, input_ids=None, langs=None, attention_mask=None, position_ids=None, lengths=None):
+        r"""
+        The XLMForQuestionAnswering forward method, overrides the __call__() special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLMModel`.
+            langs (Tensor, optional):
+                See :class:`XLMModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLMModel`.
+            position_ids (Tensor, optional):
+                See :class:`XLMModel`.
+            lengths (Tensor, optional):
+                See :class:`XLMModel`.
+
+        Returns:
+            tuple: Returns tuple (`start_logits`, `end_logits`).
+
+            With the fields:
+
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMForQuestionAnswering, XLMTokenizer
+
+                tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+                model = XLMForQuestionAnswering.from_pretrained("xlm-mlm-tlm-xnli15-1024", num_classes=2)
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!", lang="en")
+                inputs = {k:paddle.to_tensor([v], dtype="int64") for (k, v) in inputs.items()}
+                inputs["langs"] = paddle.ones_like(inputs["input_ids"]) * tokenizer.lang2id["en"]
+
+                outputs = model(**inputs)
+
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+
+        """
+
+        sequence_output = self.xlm(
+            input_ids, langs=langs, attention_mask=attention_mask, position_ids=position_ids, lengths=lengths
+        )[0]
+        logits = self.classifier(sequence_output)
+        start_logits, end_logits = paddle.unstack(x=logits, axis=-1)
+
+        return start_logits, end_logits
+
+
+class XLMForMultipleChoice(XLMPretrainedModel):
+    """
+    XLMModel with a linear layer on top of the hidden-states output layer,
+    designed for multiple choice tasks like RocStories/SWAG tasks.
+
+    Args:
+        config (:class:`XLMConfig`):
+            An instance of :class:`XLMConfig`.
+    """
+
+    def __init__(self, config: XLMConfig):
+        super(XLMForMultipleChoice, self).__init__(config)
+        # self.num_choices = num_choices
+        self.xlm = XLMModel(config)
+        self.dropout = nn.Dropout(config.dropout if config.dropout is not None else config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+
+    def forward(self, input_ids=None, langs=None, attention_mask=None, position_ids=None, lengths=None):
+        r"""
+        The XLMForMultipleChoice forward method, overrides the __call__() special method.
+        Args:
+            input_ids (Tensor):
+                See :class:`XLMModel` and shape as [batch_size, num_choice, sequence_length].
+            langs(Tensor, optional):
+                See :class:`XLMModel` and shape as [batch_size, num_choice, sequence_length].
+            attention_mask (Tensor, optional):
+                See :class:`XLMModel` and shape as [batch_size, num_choice, sequence_length].
+            position_ids (Tensor, optional):
+                See :class:`XLMModel` and shape as [batch_size, num_choice, sequence_length].
+            lengths (Tensor, optional):
+                See :class:`XLMModel` and shape as [batch_size, num_choice].
+
+        Returns:
+            reshaped_logits (Tensor):
+                A tensor of the multiple choice classification logits.
+                Shape as `[batch_size, num_choice]` and dtype as `float32`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLMForMultipleChoice, XLMTokenizer
+                from paddlenlp.data import Pad
+
+                tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-tlm-xnli15-1024")
+                model = XLMForMultipleChoice.from_pretrained("xlm-mlm-tlm-xnli15-1024", num_choices=2)
+
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+
+                inputs = tokenizer(text, text_pair, lang="en")
+                input_ids = Pad(axis=0, pad_val=tokenizer.pad_token_id)(inputs["input_ids"])
+                input_ids = paddle.to_tensor(input_ids, dtype="int64")
+                langs = paddle.ones_like(input_ids) * tokenizer.lang2id["en"]
+
+                reshaped_logits = model(
+                    input_ids=input_ids,
+                    langs=langs,
+                )
+        """
+        num_choices = input_ids.shape[1]
+        # input_ids: [bs, num_choice, seqlen]
+        input_ids = input_ids.reshape(
+            shape=(-1, paddle.shape(input_ids)[-1])
+        )  # flat_input_ids: [bs*num_choice, seqlen]
+
+        if langs is not None:
+            langs = langs.reshape(shape=(-1, paddle.shape(langs)[-1]))
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, paddle.shape(attention_mask)[-1]))
+
+        if position_ids is not None:
+            position_ids = position_ids.reshape(shape=(-1, paddle.shape(position_ids)[-1]))
+
+        if lengths is not None:
+            lengths = lengths.reshape(shape=(-1,))
+
+        sequence_output = self.xlm(
+            input_ids, langs=langs, attention_mask=attention_mask, position_ids=position_ids, lengths=lengths
+        )[0]
+        sequence_output = self.dropout(sequence_output)
+        pooled_output = sequence_output[:, 0]
+
+        logits = self.classifier(pooled_output)  # logits: [bs*num_choice, 1]
+        reshaped_logits = logits.reshape(shape=(-1, num_choices))  # logits: [bs, num_choice]
+
+        return reshaped_logits
diff --git a/paddlenlp/transformers/xlm/tokenizer.py b/paddlenlp/transformers/xlm/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..34064dec3b9252a97a9613e8907890301d05c21b
--- /dev/null
+++ b/paddlenlp/transformers/xlm/tokenizer.py
@@ -0,0 +1,1023 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import re
+import shutil
+import sys
+import unicodedata
+from typing import List, Optional
+
+from paddle.utils import try_import
+
+from ...utils.log import logger
+from .. import PretrainedTokenizer
+from ..tokenizer_utils import AddedToken, TextInput
+
+__all__ = ["XLMTokenizer"]
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "xlm-mlm-en-2048": 512,
+    "xlm-mlm-ende-1024": 512,
+    "xlm-mlm-enfr-1024": 512,
+    "xlm-mlm-enro-1024": 512,
+    "xlm-mlm-tlm-xnli15-1024": 512,
+    "xlm-mlm-xnli15-1024": 512,
+    "xlm-clm-enfr-1024": 512,
+    "xlm-clm-ende-1024": 512,
+    "xlm-mlm-17-1280": 512,
+    "xlm-mlm-100-1280": 512,
+}
+
+
+def get_pairs(word):
+    """
+    Return set of symbol pairs in a word.
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+
+    pairs = set(pairs)
+    return pairs
+
+
+def lowercase_and_remove_accent(text):
+    """
+    Lowercase and strips accents from a piece of text based on
+    https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py
+    """
+    text = " ".join(text)
+    text = text.lower()
+    text = unicodedata.normalize("NFD", text)
+    output = []
+    for char in text:
+        cat = unicodedata.category(char)
+        if cat == "Mn":
+            continue
+        output.append(char)
+    return "".join(output).lower().split(" ")
+
+
+def replace_unicode_punct(text):
+    """
+    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/replace-unicode-punctuation.perl
+    """
+    text = text.replace("，", ",")
+    text = re.sub(r"。\s*", ". ", text)
+    text = text.replace("、", ",")
+    text = text.replace("”", '"')
+    text = text.replace("“", '"')
+    text = text.replace("∶", ":")
+    text = text.replace("：", ":")
+    text = text.replace("？", "?")
+    text = text.replace("《", '"')
+    text = text.replace("》", '"')
+    text = text.replace("）", ")")
+    text = text.replace("！", "!")
+    text = text.replace("（", "(")
+    text = text.replace("；", ";")
+    text = text.replace("１", "1")
+    text = text.replace("」", '"')
+    text = text.replace("「", '"')
+    text = text.replace("０", "0")
+    text = text.replace("３", "3")
+    text = text.replace("２", "2")
+    text = text.replace("５", "5")
+    text = text.replace("６", "6")
+    text = text.replace("９", "9")
+    text = text.replace("７", "7")
+    text = text.replace("８", "8")
+    text = text.replace("４", "4")
+    text = re.sub(r"．\s*", ". ", text)
+    text = text.replace("～", "~")
+    text = text.replace("’", "'")
+    text = text.replace("…", "...")
+    text = text.replace("━", "-")
+    text = text.replace("〈", "<")
+    text = text.replace("〉", ">")
+    text = text.replace("【", "[")
+    text = text.replace("】", "]")
+    text = text.replace("％", "%")
+    return text
+
+
+def remove_non_printing_char(text):
+    """
+    Port of https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/remove-non-printing-char.perl
+    """
+    output = []
+    for char in text:
+        cat = unicodedata.category(char)
+        if cat.startswith("C"):
+            continue
+        output.append(char)
+    return "".join(output)
+
+
+def romanian_preprocessing(text):
+    """Sennrich's WMT16 scripts for Romanian preprocessing, used by model `xlm-mlm-enro-1024`"""
+    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/normalise-romanian.py
+    text = text.replace("\u015e", "\u0218").replace("\u015f", "\u0219")
+    text = text.replace("\u0162", "\u021a").replace("\u0163", "\u021b")
+    # https://github.com/rsennrich/wmt16-scripts/blob/master/preprocess/remove-diacritics.py
+    text = text.replace("\u0218", "S").replace("\u0219", "s")  # s-comma
+    text = text.replace("\u021a", "T").replace("\u021b", "t")  # t-comma
+    text = text.replace("\u0102", "A").replace("\u0103", "a")
+    text = text.replace("\u00C2", "A").replace("\u00E2", "a")
+    text = text.replace("\u00CE", "I").replace("\u00EE", "i")
+    return text
+
+
+class XLMTokenizer(PretrainedTokenizer):
+    """
+    Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following:
+    - Moses preprocessing and tokenization for most supported languages.
+    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP).
+    - Optionally lowercases and normalizes all inputs text.
+    - The arguments `special_tokens` and the function `set_special_tokens`, can be used to add additional symbols (like
+      "__classify__") to a vocabulary.
+    - The `lang2id` attribute maps the languages supported by the model with their IDs if provided (automatically set
+      for pretrained vocabularies).
+    - The `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies).
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`. Users should refer to
+    this superclass for more information regarding those methods.
+
+    Args:
+        vocab_file (str):
+            Vocabulary file.
+        merges_file (str):
+            Merges file.
+        unk_token (str, optional):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+            <Tip>
+            When building a sequence using special tokens, this is not the token that is used for the beginning of
+            sequence. The token used is the `cls_token`.
+            </Tip>
+            Defaults to `"<unk>"`.
+        sep_token (str, optional):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+            Defaults to `"</s>"`.
+        pad_token (str, optional):
+            The token used for padding, for example when batching sequences of different lengths.
+            Defaults to `"<pad>"`.
+        cls_token (str, optional):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+            Defaults to `"</s>"`.
+        mask_token (str, optional):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+            Defaults to `"<special1>"`.
+        additional_special_tokens (List[str], optional):
+            List of additional special tokens.
+            Defaults to `["<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"]`.
+        lang2id (Dict[str, int], optional):
+            Dictionary mapping languages string identifiers to their IDs.
+        id2lang (Dict[int, str], optional):
+            Dictionary mapping language IDs to their string identifiers.
+        do_lowercase_and_remove_accent (bool, optional):
+            Whether to lowercase and remove accents when tokenizing.
+            Defaults to `True`.
+    """
+
+    resource_files_names = {
+        "vocab_file": "vocab.json",
+        "merges_file": "merges.txt",
+    }
+
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "xlm-mlm-en-2048": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-en-2048/vocab.json",
+            "xlm-mlm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-ende-1024/vocab.json",
+            "xlm-mlm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enfr-1024/vocab.json",
+            "xlm-mlm-enro-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enro-1024/vocab.json",
+            "xlm-mlm-tlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-tlm-xnli15-1024/vocab.json",
+            "xlm-mlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-xnli15-1024/vocab.json",
+            "xlm-clm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-enfr-1024/vocab.json",
+            "xlm-clm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-ende-1024/vocab.json",
+            "xlm-mlm-17-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-17-1280/vocab.json",
+            "xlm-mlm-100-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-100-1280/vocab.json",
+        },
+        "merges_file": {
+            "xlm-mlm-en-2048": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-en-2048/merges.txt",
+            "xlm-mlm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-ende-1024/merges.txt",
+            "xlm-mlm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enfr-1024/merges.txt",
+            "xlm-mlm-enro-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-enro-1024/merges.txt",
+            "xlm-mlm-tlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-tlm-xnli15-1024/merges.txt",
+            "xlm-mlm-xnli15-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-xnli15-1024/merges.txt",
+            "xlm-clm-enfr-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-enfr-1024/merges.txt",
+            "xlm-clm-ende-1024": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-clm-ende-1024/merges.txt",
+            "xlm-mlm-17-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-17-1280/merges.txt",
+            "xlm-mlm-100-1280": "https://bj.bcebos.com/paddlenlp/models/transformers/xlm/xlm-mlm-100-1280/merges.txt",
+        },
+    }
+    pretrained_init_configuration = {
+        "xlm-mlm-en-2048": {"do_lowercase_and_remove_accent": True},
+        "xlm-mlm-ende-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {0: "de", 1: "en"},
+            "lang2id": {"de": 0, "en": 1},
+        },
+        "xlm-mlm-enfr-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {0: "en", 1: "fr"},
+            "lang2id": {"en": 0, "fr": 1},
+        },
+        "xlm-mlm-enro-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {0: "en", 1: "ro"},
+            "lang2id": {"en": 0, "ro": 1},
+        },
+        "xlm-mlm-tlm-xnli15-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {
+                0: "ar",
+                1: "bg",
+                2: "de",
+                3: "el",
+                4: "en",
+                5: "es",
+                6: "fr",
+                7: "hi",
+                8: "ru",
+                9: "sw",
+                10: "th",
+                11: "tr",
+                12: "ur",
+                13: "vi",
+                14: "zh",
+            },
+            "lang2id": {
+                "ar": 0,
+                "bg": 1,
+                "de": 2,
+                "el": 3,
+                "en": 4,
+                "es": 5,
+                "fr": 6,
+                "hi": 7,
+                "ru": 8,
+                "sw": 9,
+                "th": 10,
+                "tr": 11,
+                "ur": 12,
+                "vi": 13,
+                "zh": 14,
+            },
+        },
+        "xlm-mlm-xnli15-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {
+                0: "ar",
+                1: "bg",
+                2: "de",
+                3: "el",
+                4: "en",
+                5: "es",
+                6: "fr",
+                7: "hi",
+                8: "ru",
+                9: "sw",
+                10: "th",
+                11: "tr",
+                12: "ur",
+                13: "vi",
+                14: "zh",
+            },
+            "lang2id": {
+                "ar": 0,
+                "bg": 1,
+                "de": 2,
+                "el": 3,
+                "en": 4,
+                "es": 5,
+                "fr": 6,
+                "hi": 7,
+                "ru": 8,
+                "sw": 9,
+                "th": 10,
+                "tr": 11,
+                "ur": 12,
+                "vi": 13,
+                "zh": 14,
+            },
+        },
+        "xlm-clm-enfr-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {0: "en", 1: "fr"},
+            "lang2id": {"en": 0, "fr": 1},
+        },
+        "xlm-clm-ende-1024": {
+            "do_lowercase_and_remove_accent": True,
+            "id2lang": {0: "de", 1: "en"},
+            "lang2id": {"de": 0, "en": 1},
+        },
+        "xlm-mlm-17-1280": {
+            "do_lowercase_and_remove_accent": False,
+            "id2lang": {
+                0: "ar",
+                1: "de",
+                2: "en",
+                3: "es",
+                4: "fr",
+                5: "hi",
+                6: "it",
+                7: "ja",
+                8: "ko",
+                9: "nl",
+                10: "pl",
+                11: "pt",
+                12: "ru",
+                13: "sv",
+                14: "tr",
+                15: "vi",
+                16: "zh",
+            },
+            "lang2id": {
+                "ar": 0,
+                "de": 1,
+                "en": 2,
+                "es": 3,
+                "fr": 4,
+                "hi": 5,
+                "it": 6,
+                "ja": 7,
+                "ko": 8,
+                "nl": 9,
+                "pl": 10,
+                "pt": 11,
+                "ru": 12,
+                "sv": 13,
+                "tr": 14,
+                "vi": 15,
+                "zh": 16,
+            },
+        },
+        "xlm-mlm-100-1280": {
+            "do_lowercase_and_remove_accent": False,
+            "id2lang": {
+                0: "af",
+                1: "als",
+                2: "am",
+                3: "an",
+                4: "ang",
+                5: "ar",
+                6: "arz",
+                7: "ast",
+                8: "az",
+                9: "bar",
+                10: "be",
+                11: "bg",
+                12: "bn",
+                13: "br",
+                14: "bs",
+                15: "ca",
+                16: "ceb",
+                17: "ckb",
+                18: "cs",
+                19: "cy",
+                20: "da",
+                21: "de",
+                22: "el",
+                23: "en",
+                24: "eo",
+                25: "es",
+                26: "et",
+                27: "eu",
+                28: "fa",
+                29: "fi",
+                30: "fr",
+                31: "fy",
+                32: "ga",
+                33: "gan",
+                34: "gl",
+                35: "gu",
+                36: "he",
+                37: "hi",
+                38: "hr",
+                39: "hu",
+                40: "hy",
+                41: "ia",
+                42: "id",
+                43: "is",
+                44: "it",
+                45: "ja",
+                46: "jv",
+                47: "ka",
+                48: "kk",
+                49: "kn",
+                50: "ko",
+                51: "ku",
+                52: "la",
+                53: "lb",
+                54: "lt",
+                55: "lv",
+                56: "mk",
+                57: "ml",
+                58: "mn",
+                59: "mr",
+                60: "ms",
+                61: "my",
+                62: "nds",
+                63: "ne",
+                64: "nl",
+                65: "nn",
+                66: "no",
+                67: "oc",
+                68: "pl",
+                69: "pt",
+                70: "ro",
+                71: "ru",
+                72: "scn",
+                73: "sco",
+                74: "sh",
+                75: "si",
+                76: "simple",
+                77: "sk",
+                78: "sl",
+                79: "sq",
+                80: "sr",
+                81: "sv",
+                82: "sw",
+                83: "ta",
+                84: "te",
+                85: "th",
+                86: "tl",
+                87: "tr",
+                88: "tt",
+                89: "uk",
+                90: "ur",
+                91: "uz",
+                92: "vi",
+                93: "war",
+                94: "wuu",
+                95: "yi",
+                96: "zh",
+                97: "zh_classical",
+                98: "zh_min_nan",
+                99: "zh_yue",
+            },
+            "lang2id": {
+                "af": 0,
+                "als": 1,
+                "am": 2,
+                "an": 3,
+                "ang": 4,
+                "ar": 5,
+                "arz": 6,
+                "ast": 7,
+                "az": 8,
+                "bar": 9,
+                "be": 10,
+                "bg": 11,
+                "bn": 12,
+                "br": 13,
+                "bs": 14,
+                "ca": 15,
+                "ceb": 16,
+                "ckb": 17,
+                "cs": 18,
+                "cy": 19,
+                "da": 20,
+                "de": 21,
+                "el": 22,
+                "en": 23,
+                "eo": 24,
+                "es": 25,
+                "et": 26,
+                "eu": 27,
+                "fa": 28,
+                "fi": 29,
+                "fr": 30,
+                "fy": 31,
+                "ga": 32,
+                "gan": 33,
+                "gl": 34,
+                "gu": 35,
+                "he": 36,
+                "hi": 37,
+                "hr": 38,
+                "hu": 39,
+                "hy": 40,
+                "ia": 41,
+                "id": 42,
+                "is": 43,
+                "it": 44,
+                "ja": 45,
+                "jv": 46,
+                "ka": 47,
+                "kk": 48,
+                "kn": 49,
+                "ko": 50,
+                "ku": 51,
+                "la": 52,
+                "lb": 53,
+                "lt": 54,
+                "lv": 55,
+                "mk": 56,
+                "ml": 57,
+                "mn": 58,
+                "mr": 59,
+                "ms": 60,
+                "my": 61,
+                "nds": 62,
+                "ne": 63,
+                "nl": 64,
+                "nn": 65,
+                "no": 66,
+                "oc": 67,
+                "pl": 68,
+                "pt": 69,
+                "ro": 70,
+                "ru": 71,
+                "scn": 72,
+                "sco": 73,
+                "sh": 74,
+                "si": 75,
+                "simple": 76,
+                "sk": 77,
+                "sl": 78,
+                "sq": 79,
+                "sr": 80,
+                "sv": 81,
+                "sw": 82,
+                "ta": 83,
+                "te": 84,
+                "th": 85,
+                "tl": 86,
+                "tr": 87,
+                "tt": 88,
+                "uk": 89,
+                "ur": 90,
+                "uz": 91,
+                "vi": 92,
+                "war": 93,
+                "wuu": 94,
+                "yi": 95,
+                "zh": 96,
+                "zh_classical": 97,
+                "zh_min_nan": 98,
+                "zh_yue": 99,
+            },
+        },
+    }
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+
+    def __init__(
+        self,
+        vocab_file,
+        merges_file,
+        unk_token="<unk>",
+        bos_token="<s>",
+        sep_token="</s>",
+        pad_token="<pad>",
+        cls_token="</s>",
+        mask_token="<special1>",
+        additional_special_tokens=[
+            "<special0>",
+            "<special1>",
+            "<special2>",
+            "<special3>",
+            "<special4>",
+            "<special5>",
+            "<special6>",
+            "<special7>",
+            "<special8>",
+            "<special9>",
+        ],
+        lang2id=None,
+        id2lang=None,
+        do_lowercase_and_remove_accent=True,
+        **kwargs,
+    ):
+        super().__init__(
+            unk_token=unk_token,
+            bos_token=bos_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            additional_special_tokens=additional_special_tokens,
+            lang2id=lang2id,
+            id2lang=id2lang,
+            do_lowercase_and_remove_accent=do_lowercase_and_remove_accent,
+            **kwargs,
+        )
+        self._vocab_file = vocab_file
+        self._merges_file = merges_file
+        self.sm = try_import("sacremoses")
+
+        # cache of sm.MosesPunctNormalizer instance
+        self.cache_moses_punct_normalizer = dict()
+        # cache of sm.MosesTokenizer instance
+        self.cache_moses_tokenizer = dict()
+        self.lang_with_custom_tokenizer = set(["zh", "th", "ja"])
+        # True for current supported model (v1.2.0), False for XLM-17 & 100
+        self.do_lowercase_and_remove_accent = do_lowercase_and_remove_accent
+        self.lang2id = lang2id
+        self.id2lang = id2lang
+        if lang2id is not None and id2lang is not None:
+            assert len(lang2id) == len(id2lang)
+
+        self.ja_word_tokenizer = None
+        self.zh_word_tokenizer = None
+
+        with open(vocab_file, encoding="utf-8") as vocab_handle:
+            self.encoder = json.load(vocab_handle)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        with open(merges_file, encoding="utf-8") as merges_handle:
+            merges = merges_handle.read().split("\n")[:-1]
+        merges = [tuple(merge.split()[:2]) for merge in merges]
+        self.bpe_ranks = dict(zip(merges, range(len(merges))))
+        self.cache = {}
+
+    @property
+    def do_lower_case(self):
+        return self.do_lowercase_and_remove_accent
+
+    def moses_punct_norm(self, text, lang):
+        if lang not in self.cache_moses_punct_normalizer:
+            punct_normalizer = self.sm.MosesPunctNormalizer(lang=lang)
+            self.cache_moses_punct_normalizer[lang] = punct_normalizer
+        else:
+            punct_normalizer = self.cache_moses_punct_normalizer[lang]
+        return punct_normalizer.normalize(text)
+
+    def moses_tokenize(self, text, lang):
+        if lang not in self.cache_moses_tokenizer:
+            moses_tokenizer = self.sm.MosesTokenizer(lang=lang)
+            self.cache_moses_tokenizer[lang] = moses_tokenizer
+        else:
+            moses_tokenizer = self.cache_moses_tokenizer[lang]
+        return moses_tokenizer.tokenize(text, return_str=False, escape=False)
+
+    def moses_pipeline(self, text, lang):
+        text = replace_unicode_punct(text)
+        text = self.moses_punct_norm(text, lang)
+        text = remove_non_printing_char(text)
+        return text
+
+    def ja_tokenize(self, text):
+        """Tokenize a Japanese string."""
+        if self.ja_word_tokenizer is None:
+            try:
+                import Mykytea
+
+                self.ja_word_tokenizer = Mykytea.Mykytea(
+                    f"-model {os.path.expanduser('~')}/local/share/kytea/model.bin"
+                )
+            except (AttributeError, ImportError):
+                logger.error(
+                    "Make sure you install KyTea (https://github.com/neubig/kytea) and it's python wrapper (https://github.com/chezou/Mykytea-python) with the following steps"
+                )
+                logger.error("1. git clone git@github.com:neubig/kytea.git && cd kytea")
+                logger.error("2. autoreconf -i")
+                logger.error("3. ./configure --prefix=$HOME/local")
+                logger.error("4. make && make install")
+                logger.error("5. pip install kytea")
+                raise
+        return list(self.ja_word_tokenizer.getWS(text))
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    # def __len__(self):
+    #     return len(self.encoder)
+
+    def bpe(self, token):
+        word = tuple(token[:-1]) + (token[-1] + "</w>",)
+        if token in self.cache:
+            return self.cache[token]
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token + "</w>"
+
+        while True:
+            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                except ValueError:
+                    new_word.extend(word[i:])
+                    break
+                else:
+                    new_word.extend(word[i:j])
+                    i = j
+
+                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
+                    new_word.append(first + second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = " ".join(word)
+        if word == "\n  </w>":
+            word = "\n</w>"
+        self.cache[token] = word
+        return word
+
+    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
+        """
+        Converts a string in a sequence of tokens, using the tokenizer.
+
+        Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
+        (BPE/SentencePieces/WordPieces). Takes care of added tokens.
+
+        Args:
+            text (`str`):
+                The sequence to be encoded.
+            **kwargs (additional keyword arguments):
+                Passed along to the model-specific `prepare_for_tokenization` preprocessing method.
+
+        Returns:
+            `List[str]`: The list of tokens.
+        """
+        # Simple mapping string => AddedToken for special tokens with specific tokenization behaviors
+        all_special_tokens_extended = dict(
+            (str(t), t) for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
+        )
+
+        text, kwargs = self.prepare_for_tokenization(text, **kwargs)
+
+        # TODO: should this be in the base class?
+        if hasattr(self, "do_lower_case") and self.do_lower_case:
+            # convert non-special tokens to lowercase
+            escaped_special_toks = [
+                re.escape(s_tok) for s_tok in (self.unique_no_split_tokens + self.all_special_tokens)
+            ]
+            pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
+            text = re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), text)
+
+        no_split_token = set(self.unique_no_split_tokens)
+        tokens = self.tokens_trie.split(text)
+        # ["This is something", "<special_token_1>", "  else"]
+        for i, token in enumerate(tokens):
+            if token in no_split_token:
+                tok_extended = all_special_tokens_extended.get(token, None)
+                left = tokens[i - 1] if i > 0 else None
+                right = tokens[i + 1] if i < len(tokens) - 1 else None
+                if isinstance(tok_extended, AddedToken):
+                    if tok_extended.rstrip and right:
+                        # A bit counter-intuitive but we strip the left of the string
+                        # since tok_extended.rstrip means the special token is eating all white spaces on its right
+                        tokens[i + 1] = right.lstrip()
+                    # Strip white spaces on the left
+                    if tok_extended.lstrip and left:
+                        tokens[i - 1] = left.rstrip()  # Opposite here
+                else:
+                    # We strip left and right by default
+                    if right:
+                        tokens[i + 1] = right.lstrip()
+                    if left:
+                        tokens[i - 1] = left.rstrip()
+        # ["This is something", "<special_token_1>", "else"]
+        tokenized_text = []
+        lang = kwargs.pop("lang", "en")
+        bypass_tokenizer = kwargs.pop("bypass_tokenizer", False)
+        for token in tokens:
+            # Need to skip eventual empty (fully stripped) tokens
+            if not token:
+                continue
+            if token in no_split_token:
+                tokenized_text.append(token)
+            else:
+                tokenized_text.extend(self._tokenize(token, lang=lang, bypass_tokenizer=bypass_tokenizer))
+        # ["This", " is", " something", "<special_token_1>", "else"]
+        return tokenized_text
+
+    def _tokenize(self, text, lang="en", bypass_tokenizer=False):
+        """
+        Tokenize a string given language code. For Chinese, Japanese and Thai, we use a language specific tokenizer.
+        Otherwise, we use Moses.
+        Details of tokenization:
+            - [sacremoses](https://github.com/alvations/sacremoses): port of Moses
+            - Install with `pip install sacremoses`
+            - [pythainlp](https://github.com/PyThaiNLP/pythainlp): Thai tokenizer
+            - Install with `pip install pythainlp`
+            - [kytea](https://github.com/chezou/Mykytea-python): Japanese tokenizer, wrapper of
+              [KyTea](https://github.com/neubig/kytea)
+            - Install with the following steps:
+            ::
+                git clone git@github.com:neubig/kytea.git && cd kytea autoreconf -i ./configure --prefix=$HOME/local
+                make && make install pip install kytea
+            - [jieba](https://github.com/fxsjy/jieba): Chinese tokenizer (*)
+            - Install with `pip install jieba`
+        (*) The original XLM used [Stanford
+        Segmenter](https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip). However, the wrapper
+        (`nltk.tokenize.stanford_segmenter`) is slow due to JVM overhead, and it will be deprecated. Jieba is a lot
+        faster and pip-installable. Note there is some mismatch with the Stanford Segmenter. It should be fine if you
+        fine-tune the model with Chinese supervisionself. If you want the same exact behaviour, use the original XLM
+        [preprocessing script](https://github.com/facebookresearch/XLM/tree/master/tools) to tokenize the sentence
+        externally, and set `bypass_tokenizer=True` to bypass the tokenizer.
+
+        Args:
+            - lang: ISO language code (default = 'en') (string). Languages should belong of the model supported
+              languages. However, we don't enforce it.
+            - bypass_tokenizer: Allow users to preprocess and tokenize the sentences externally (default = False)
+              (bool). If True, we only apply BPE.
+
+        Returns:
+            List of tokens.
+
+        """
+        if lang and self.lang2id and lang not in self.lang2id:
+            logger.error(
+                "Supplied language code not found in lang2id mapping. Please check that your language is supported by the loaded pretrained model."
+            )
+        if bypass_tokenizer:
+            text = text.split()
+        elif lang not in self.lang_with_custom_tokenizer:
+            text = self.moses_pipeline(text, lang=lang)
+            # TODO: make sure we are using `xlm-mlm-enro-1024`, since XLM-100 doesn't have this step
+            if lang == "ro":
+                text = romanian_preprocessing(text)
+            text = self.moses_tokenize(text, lang=lang)
+        elif lang == "th":
+            text = self.moses_pipeline(text, lang=lang)
+            try:
+                if "pythainlp" not in sys.modules:
+                    from pythainlp.tokenize import word_tokenize as th_word_tokenize
+                else:
+                    th_word_tokenize = sys.modules["pythainlp"].word_tokenize
+            except (AttributeError, ImportError):
+                logger.error(
+                    "Make sure you install PyThaiNLP (https://github.com/PyThaiNLP/pythainlp) with the following steps"
+                )
+                logger.error("1. pip install pythainlp")
+                raise
+            text = th_word_tokenize(text)
+        elif lang == "zh":
+            try:
+                if "jieba" not in sys.modules:
+                    import jieba
+                else:
+                    jieba = sys.modules["jieba"]
+            except (AttributeError, ImportError):
+                logger.error("Make sure you install Jieba (https://github.com/fxsjy/jieba) with the following steps")
+                logger.error("1. pip install jieba")
+                raise
+            text = " ".join(jieba.cut(text))
+            text = self.moses_pipeline(text, lang=lang)
+            text = text.split()
+        elif lang == "ja":
+            text = self.moses_pipeline(text, lang=lang)
+            text = self.ja_tokenize(text)
+        else:
+            raise ValueError("It should not reach here")
+
+        if self.do_lowercase_and_remove_accent and not bypass_tokenizer:
+            text = lowercase_and_remove_accent(text)
+
+        split_tokens = []
+        for token in text:
+            if token:
+                split_tokens.extend([t for t in self.bpe(token).split(" ")])
+
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        out_string = "".join(tokens).replace("</w>", " ").strip()
+        return out_string
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. An XLM sequence has the following format:
+        - single sequence: `<s> X </s>`
+        - pair of sequences: `<s> A </s> B </s>`
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: The model input with special tokens.
+        """
+        bos = [self.bos_token_id]
+        sep = [self.sep_token_id]
+
+        if token_ids_1 is None:
+            return bos + token_ids_0 + sep
+        return bos + token_ids_0 + sep + token_ids_1 + sep
+
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLM sequence
+        pair mask has the following format:
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [token type IDs] according to the given sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return len(cls + token_ids_0 + sep) * [0]
+        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+
+    def get_special_tokens_mask(
+        self,
+        token_ids_0: List[int],
+        token_ids_1: Optional[List[int]] = None,
+        already_has_special_tokens: bool = False,
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0,
+                token_ids_1=token_ids_1,
+                already_has_special_tokens=True,
+            )
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    def save_resources(self, save_directory):
+        """
+        Save tokenizer related resources to files under `save_directory`.
+
+        Args:
+            save_directory (str): Directory to save files into.
+
+        """
+        for name, file_name in self.resource_files_names.items():
+            source_path = getattr(self, "_%s" % name)
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(source_path) != os.path.abspath(save_path):
+                shutil.copyfile(source_path, save_path)
diff --git a/paddlenlp/transformers/xlnet/__init__.py b/paddlenlp/transformers/xlnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0d15c218ce9e897caa6b56d4862a0163fab1f3f
--- /dev/null
+++ b/paddlenlp/transformers/xlnet/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .modeling import *
+from .tokenizer import *
diff --git a/paddlenlp/transformers/xlnet/configuration.py b/paddlenlp/transformers/xlnet/configuration.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5a93b6cc0c67eb086f9757bbff6f0f18e679859
--- /dev/null
+++ b/paddlenlp/transformers/xlnet/configuration.py
@@ -0,0 +1,337 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" XLNet configuration"""
+from __future__ import annotations
+
+import logging
+import warnings
+from typing import Dict
+
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+__all__ = ["XLNET_PRETRAINED_INIT_CONFIGURATION", "XLNetConfig", "XLNET_PRETRAINED_RESOURCE_FILES_MAP"]
+XLNET_PRETRAINED_RESOURCE_FILES_MAP = {
+    "model_state": {
+        "xlnet-base-cased": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/xlnet-base-cased.pdparams",
+        "xlnet-large-cased": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/xlnet-large-cased.pdparams",
+        "chinese-xlnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-base.pdparams",
+        "chinese-xlnet-mid": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-mid.pdparams",
+        "chinese-xlnet-large": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-large.pdparams",
+    }
+}
+
+XLNET_PRETRAINED_INIT_CONFIGURATION = {
+    "xlnet-base-cased": {
+        "attn_type": "bi",
+        "bi_data": False,
+        "clamp_len": -1,
+        "d_head": 64,
+        "d_inner": 3072,
+        "d_model": 768,
+        "dropout": 0.1,
+        "classifier_dropout": 0.1,
+        "ff_activation": "gelu",
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "mem_len": None,
+        "n_head": 12,
+        "n_layer": 12,
+        "reuse_len": None,
+        "same_length": False,
+        "vocab_size": 32000,
+    },
+    "xlnet-large-cased": {
+        "attn_type": "bi",
+        "bi_data": False,
+        "clamp_len": -1,
+        "d_head": 64,
+        "d_inner": 4096,
+        "d_model": 1024,
+        "dropout": 0.1,
+        "classifier_dropout": 0.1,
+        "ff_activation": "gelu",
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "mem_len": None,
+        "n_head": 16,
+        "n_layer": 24,
+        "reuse_len": None,
+        "same_length": False,
+        "vocab_size": 32000,
+    },
+    "chinese-xlnet-base": {
+        "attn_type": "bi",
+        "bi_data": False,
+        "clamp_len": -1,
+        "d_head": 64,
+        "d_inner": 3072,
+        "d_model": 768,
+        "dropout": 0.1,
+        "classifier_dropout": 0.1,
+        "ff_activation": "relu",
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "mem_len": None,
+        "n_head": 12,
+        "n_layer": 12,
+        "reuse_len": None,
+        "same_length": False,
+        "vocab_size": 32000,
+    },
+    "chinese-xlnet-mid": {
+        "attn_type": "bi",
+        "bi_data": False,
+        "clamp_len": -1,
+        "d_head": 64,
+        "d_inner": 3072,
+        "d_model": 768,
+        "dropout": 0.1,
+        "classifier_dropout": 0.1,
+        "ff_activation": "relu",
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "mem_len": None,
+        "n_head": 12,
+        "n_layer": 24,
+        "reuse_len": None,
+        "same_length": False,
+        "vocab_size": 32000,
+    },
+    "chinese-xlnet-large": {
+        "attn_type": "bi",
+        "bi_data": False,
+        "clamp_len": -1,
+        "d_head": 64,
+        "d_inner": 4096,
+        "d_model": 1024,
+        "dropout": 0.1,
+        "classifier_dropout": 0.1,
+        "ff_activation": "relu",
+        "initializer_range": 0.02,
+        "layer_norm_eps": 1e-12,
+        "mem_len": None,
+        "n_head": 16,
+        "n_layer": 24,
+        "reuse_len": None,
+        "same_length": False,
+        "vocab_size": 32000,
+    },
+}
+
+
+class XLNetConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`XLNetModel`]. It is used to
+    instantiate a XLNet model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the
+    [xlnet-large-cased] architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the XLNet model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`XLNetModel`].
+        d_model (`int`, *optional*, defaults to 1024):
+            Dimensionality of the encoder layers and the pooler layer.
+        n_layer (`int`, *optional*, defaults to 24):
+            Number of hidden layers in the Transformer encoder.
+        n_head (`int`, *optional*, defaults to 16):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        d_inner (`int`, *optional*, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        ff_activation (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the If string, `"gelu"`, `"relu"`, `"silu"` and
+            `"gelu_new"` are supported.
+        untie_r (`bool`, *optional*, defaults to `True`):
+            Whether or not to untie relative position biases
+        attn_type (`str`, *optional*, defaults to `"bi"`):
+            The attention type used by the model. Set `"bi"` for XLNet, `"uni"` for Transformer-XL.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        mem_len (`int` or `None`, *optional*):
+            The number of tokens to cache. The key/value pairs that have already been pre-computed in a previous
+            forward pass won't be re-computed.
+        reuse_len (`int`, *optional*):
+            The number of tokens in the current batch to be cached and reused in the future.
+        bi_data (`bool`, *optional*, defaults to `False`):
+            Whether or not to use bidirectional input pipeline. Usually set to `True` during pretraining and `False`
+            during finetuning.
+        clamp_len (`int`, *optional*, defaults to -1):
+            Clamp all relative distances larger than clamp_len. Setting this attribute to -1 means no clamping.
+        same_length (`bool`, *optional*, defaults to `False`):
+            Whether or not to use the same attention length for each token.
+        summary_type (`str`, *optional*, defaults to "last"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Has to be one of the following options:
+
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (`str`, *optional*):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (`boo`, *optional*, defaults to `True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
+        summary_last_dropout (`float`, *optional*, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (`int`, *optional*, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (`int`, *optional*, defaults to 5):
+            Used in the SQuAD evaluation script.
+        use_mems_eval (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.
+        use_mems_train (`bool`, *optional*, defaults to `False`):
+            Whether or not the model should make use of the recurrent memory mechanism in train mode.
+
+    Examples:
+
+    ```python
+    >>> from transformers import XLNetConfig, XLNetModel
+
+    >>> # Initializing a XLNet configuration
+    >>> configuration = XLNetConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = XLNetModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "xlnet"
+    keys_to_ignore_at_inference = ["mems"]
+    pretrained_init_configuration = XLNET_PRETRAINED_INIT_CONFIGURATION
+    # attribute_map: Dict[str, str] = {"hidden_size": "classifier_dropout", "num_classes": "num_labels"}
+    attribute_map: Dict[str, str] = {
+        "n_token": "vocab_size",  # Backward compatibility
+        "hidden_size": "d_model",
+        "num_attention_heads": "n_head",
+        "num_hidden_layers": "n_layer",
+        "num_classes": "num_labels",
+    }
+
+    def __init__(
+        self,
+        vocab_size=32000,
+        d_model=1024,
+        n_layer=24,
+        n_head=16,
+        d_inner=4096,
+        ff_activation="gelu",
+        untie_r=True,
+        attn_type="bi",
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        dropout=0.1,
+        classfier_dropout=0.1,
+        mem_len=512,
+        reuse_len=None,
+        use_mems_eval=True,
+        use_mems_train=False,
+        bi_data=False,
+        clamp_len=-1,
+        same_length=False,
+        summary_type="last",
+        summary_use_proj=True,
+        summary_activation="tanh",
+        summary_last_dropout=0.1,
+        start_n_top=5,
+        end_n_top=5,
+        pad_token_id=5,
+        bos_token_id=1,
+        eos_token_id=2,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+        """Constructs XLNetConfig."""
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.n_layer = n_layer
+        self.n_head = n_head
+        if d_model % n_head != 0:
+            raise ValueError(f"'d_model % n_head' ({d_model % n_head}) should be equal to 0")
+        if "d_head" in kwargs:
+            if kwargs["d_head"] != d_model // n_head:
+                raise ValueError(
+                    f"`d_head` ({kwargs['d_head']}) should be equal to `d_model // n_head` ({d_model // n_head})"
+                )
+        self.d_head = d_model // n_head
+        self.ff_activation = ff_activation
+        self.d_inner = d_inner
+        self.untie_r = untie_r
+        self.attn_type = attn_type
+
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+
+        self.dropout = dropout
+        self.classfier_dropout = classfier_dropout
+        self.mem_len = mem_len
+        self.reuse_len = reuse_len
+        self.bi_data = bi_data
+        self.clamp_len = clamp_len
+        self.same_length = same_length
+
+        self.summary_type = summary_type
+        self.summary_use_proj = summary_use_proj
+        self.summary_activation = summary_activation
+        self.summary_last_dropout = summary_last_dropout
+        self.start_n_top = start_n_top
+        self.end_n_top = end_n_top
+
+        self.bos_token_id = bos_token_id
+        self.pad_token_id = pad_token_id
+        self.eos_token_id = eos_token_id
+
+        if "use_cache" in kwargs:
+            warnings.warn(
+                "The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems_eval`"
+                " instead.",
+                FutureWarning,
+            )
+            use_mems_eval = kwargs["use_cache"]
+
+        self.use_mems_eval = use_mems_eval
+        self.use_mems_train = use_mems_train
+
+    @property
+    def max_position_embeddings(self):
+        logger.info(f"The model {self.model_type} is one of the few models that has no sequence length limit.")
+        return -1
+
+    @max_position_embeddings.setter
+    def max_position_embeddings(self, value):
+        # Message copied from Transformer-XL documentation
+        raise NotImplementedError(
+            f"The model {self.model_type} is one of the few models that has no sequence length limit."
+        )
diff --git a/paddlenlp/transformers/xlnet/converter.py b/paddlenlp/transformers/xlnet/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea2a6e015f7109df1ea2842167212834201865a0
--- /dev/null
+++ b/paddlenlp/transformers/xlnet/converter.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+from typing import List, Union, Dict, Type
+
+from paddlenlp.transformers import PretrainedModel, XLNetModel
+from paddlenlp.utils.converter import StateDictNameMapping, Converter
+
+__all__ = ["XLNetConverter"]
+
+
+class XLNetConverter(Converter):
+    _ignore_state_dict_keys = ["embeddings.position_ids"]
+    num_layer_key = "n_layer"
+    architectures: Dict[str, Type[PretrainedModel]] = {"XLNetModel": XLNetModel}
+
+    def get_paddle_pytorch_model_classes(self):
+        from paddlenlp.transformers import XLNetModel as PaddleModel
+        from transformers import XLNetModel as PytorchModel
+
+        return PaddleModel, PytorchModel
+
+    def get_name_mapping(self, config_or_num_layers: Union[dict, int] = None) -> List[StateDictNameMapping]:
+        num_layer = self.resolve_num_layer(config_or_num_layers)
+
+        hard_mapping = [
+            ["mask_emb", "mask_emb"],
+            ["word_embedding.weight", "word_embedding.weight"],
+        ]
+
+        for layer_index in range(num_layer):
+            layer_mappings = [
+                [f"layer.{layer_index}.rel_attn.q", f"layer.{layer_index}.rel_attn.q", "merge_last_two_dim"],
+                [f"layer.{layer_index}.rel_attn.k", f"layer.{layer_index}.rel_attn.k", "merge_last_two_dim"],
+                [f"layer.{layer_index}.rel_attn.v", f"layer.{layer_index}.rel_attn.v", "merge_last_two_dim"],
+                [f"layer.{layer_index}.rel_attn.o", f"layer.{layer_index}.rel_attn.o", "merge_last_two_dim"],
+                [f"layer.{layer_index}.rel_attn.r", f"layer.{layer_index}.rel_attn.r", "merge_last_two_dim"],
+                [f"layer.{layer_index}.rel_attn.r_r_bias", f"layer.{layer_index}.rel_attn.r_r_bias"],
+                [f"layer.{layer_index}.rel_attn.r_s_bias", f"layer.{layer_index}.rel_attn.r_s_bias"],
+                [f"layer.{layer_index}.rel_attn.r_w_bias", f"layer.{layer_index}.rel_attn.r_w_bias"],
+                [f"layer.{layer_index}.rel_attn.seg_embed", f"layer.{layer_index}.rel_attn.seg_embed"],
+                [f"layer.{layer_index}.rel_attn.layer_norm.weight", f"layer.{layer_index}.rel_attn.layer_norm.weight"],
+                [f"layer.{layer_index}.rel_attn.layer_norm.bias", f"layer.{layer_index}.rel_attn.layer_norm.bias"],
+                [f"layer.{layer_index}.ff.layer_norm.weight", f"layer.{layer_index}.ff.layer_norm.weight"],
+                [f"layer.{layer_index}.ff.layer_norm.bias", f"layer.{layer_index}.ff.layer_norm.bias"],
+                [f"layer.{layer_index}.ff.layer_1.weight", f"layer.{layer_index}.ff.layer_1.weight", "transpose"],
+                [f"layer.{layer_index}.ff.layer_2.weight", f"layer.{layer_index}.ff.layer_2.weight", "transpose"],
+                [f"layer.{layer_index}.ff.layer_1.bias", f"layer.{layer_index}.ff.layer_1.bias"],
+                [f"layer.{layer_index}.ff.layer_2.bias", f"layer.{layer_index}.ff.layer_2.bias"],
+            ]
+            hard_mapping.extend(layer_mappings)
+        return [StateDictNameMapping(*mapping, index=index) for index, mapping in enumerate(hard_mapping)]
diff --git a/paddlenlp/transformers/xlnet/modeling.py b/paddlenlp/transformers/xlnet/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff1cb2eefda831cf56bd6f7a7967811b132910e5
--- /dev/null
+++ b/paddlenlp/transformers/xlnet/modeling.py
@@ -0,0 +1,1947 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Modeling classes for XLNet model."""
+from dataclasses import dataclass
+from typing import List, Optional, Tuple
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn import BCEWithLogitsLoss, CrossEntropyLoss, Layer, MSELoss
+
+from ...utils.env import CONFIG_NAME
+from .. import PretrainedModel, register_base_model
+from ..activations import ACT2FN, get_activation
+from ..model_outputs import ModelOutput, tuple_output
+from .configuration import (
+    XLNET_PRETRAINED_INIT_CONFIGURATION,
+    XLNET_PRETRAINED_RESOURCE_FILES_MAP,
+    XLNetConfig,
+)
+
+__all__ = [
+    "XLNetPretrainedModel",
+    "XLNetModel",
+    "XLNetForSequenceClassification",
+    "XLNetForTokenClassification",
+    "XLNetLMHeadModel",
+    "XLNetForMultipleChoice",
+    "XLNetForQuestionAnswering",
+    "XLNetForCausalLM",
+]
+
+dtype_float = paddle.get_default_dtype()
+
+
+class XLNetRelativeAttention(Layer):
+    def __init__(self, config: XLNetConfig):
+        super(XLNetRelativeAttention, self).__init__()
+
+        self.n_head = config.n_head
+        self.d_head = config.d_head
+        self.d_model = config.d_model
+        self.scale = 1 / (config.d_head**0.5)
+
+        self.q = self.create_parameter([self.d_model, self.n_head * self.d_head])
+        self.k = self.create_parameter([self.d_model, self.n_head * self.d_head])
+        self.v = self.create_parameter([self.d_model, self.n_head * self.d_head])
+        self.o = self.create_parameter([self.d_model, self.n_head * self.d_head])
+        self.r = self.create_parameter([self.d_model, self.n_head * self.d_head])
+
+        self.r_r_bias = self.create_parameter([self.n_head, self.d_head], is_bias=True)
+        self.r_s_bias = self.create_parameter([self.n_head, self.d_head], is_bias=True)
+        self.r_w_bias = self.create_parameter([self.n_head, self.d_head], is_bias=True)
+        self.seg_embed = self.create_parameter([2, self.n_head, self.d_head], is_bias=False)
+
+        self.layer_norm = nn.LayerNorm(config.d_model, epsilon=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.dropout)
+
+    def prune_heads(self, heads):
+        raise NotImplementedError
+
+    @staticmethod
+    def rel_shift_bnij(x, klen=-1):
+        # Relative shift of the attention matrix from bd~ to bd (refer to Appendix B in the Transformer-XL paper)
+        x_size = paddle.shape(x)
+
+        x = paddle.reshape(x, [x_size[0], x_size[1], x_size[3], x_size[2]])
+        x = x[:, :, 1:, :]
+        x = paddle.reshape(x, [x_size[0], x_size[1], x_size[2], x_size[3] - 1])
+        x = paddle.index_select(x, index=paddle.arange(klen, dtype="int64"), axis=3)
+        return x
+
+    def rel_attn_core(
+        self,
+        q_head,
+        k_head_h,
+        v_head_h,
+        k_head_r,
+        seg_mat=None,
+        attn_mask=None,
+        head_mask=None,
+        output_attentions=False,
+    ):
+        """Core relative positional attention operations."""
+
+        # Content based attention score (refer to the Transformer-XL paper)
+        # q_head = Exi * Wq; self.r_w_bias = u; k_head_h = Wke * Exj
+        # a = Exi * Wq * Wke * Exj; c = u * Wke * Exj; ac = a + c
+        ac = paddle.einsum("ibnd,jbnd->bnij", q_head + self.r_w_bias, k_head_h)
+
+        # Position based attention score (refer to the Transformer-XL paper)
+        # q_head = Exi * Wq; self.r_r_bias = v; k_head_r = Wkr * Rij
+        # b = Exi * Wq * Wkr * Rij; d = v * Wkr * Rij; bd = b + d
+        bd = paddle.einsum("ibnd,jbnd->bnij", q_head + self.r_r_bias, k_head_r)
+        bd = self.rel_shift_bnij(bd, klen=paddle.shape(ac)[3])
+
+        # Segment based attention score
+        if seg_mat is None:
+            ef = 0
+        else:
+            ef = paddle.einsum("ibnd,snd->ibns", q_head + self.r_s_bias, self.seg_embed)
+            ef = paddle.einsum("ijbs,ibns->bnij", seg_mat, ef)
+
+        # Merge attention scores and perform masking
+        attn_score = (ac + bd + ef) * self.scale
+
+        if attn_mask is not None:
+            attn_mask = attn_mask.transpose([2, 3, 0, 1])
+            attn_score = attn_score - 1e30 * attn_mask
+
+        # Attention probability
+        attn_prob = F.softmax(attn_score, axis=3)
+        attn_prob = self.dropout(attn_prob)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attn_prob = attn_prob * head_mask.transpose([2, 3, 0, 1])
+
+        # Attention output
+        attn_vec = paddle.einsum("bnij,jbnd->ibnd", attn_prob, v_head_h)
+
+        if output_attentions:
+            return attn_vec, attn_prob.transpose([2, 3, 0, 1])
+        return attn_vec
+
+    def post_attention(self, h, attn_vec, residual=True):
+        """Post-attention processing."""
+        # Post-attention projection (back to 'd_model')
+        # Compute einsum4x4("ibnd,hnd->ibh", attn_vec, self.o)
+        shape = paddle.shape(attn_vec)
+        attn_vec = attn_vec.reshape([shape[0], shape[1], attn_vec.shape[2] * attn_vec.shape[3]])
+        attn_out = paddle.einsum("ibm,hm->ibh", attn_vec, self.o)
+
+        attn_out = self.dropout(attn_out)
+        if residual:
+            attn_out = attn_out + h
+
+        output = self.layer_norm(attn_out)
+        return output
+
+    def forward(
+        self,
+        h,
+        g,
+        attn_mask_h,
+        attn_mask_g,
+        r,
+        seg_mat,
+        mems=None,
+        target_mapping=None,
+        head_mask=None,
+        output_attentions=False,
+    ):
+        if g is not None:
+            # Two-stream attention with relative positional encoding.
+            # Content based attention score
+            if mems is not None and mems.dim() > 1:
+                cat = paddle.concat([mems, h], axis=0)
+            else:
+                cat = h
+
+            # Content-based key head
+            # Compute k_head_h = einsum4x4("ibh,h(n*d)->ibnd", cat, self.k)
+            k_head_h = paddle.matmul(cat, self.k)
+            k_head_h = paddle.reshape(
+                k_head_h, shape=[paddle.shape(cat)[0], paddle.shape(cat)[1], self.n_head, self.d_head]
+            )
+
+            # Content-based value head
+            # Compute v_head_h = einsum4x4("ibh,h(n*d)->ibnd", cat, self.v)
+            v_head_h = paddle.matmul(cat, self.v)
+            v_head_h = paddle.reshape(
+                v_head_h, shape=[paddle.shape(cat)[0], paddle.shape(cat)[1], self.n_head, self.d_head]
+            )
+
+            # Position-based key head
+            # Compute k_head_r = einsum4x4("ibh,h(n*d)->ibnd", r, self.r)
+            k_head_r = paddle.matmul(r, self.r)
+            k_head_r = paddle.reshape(
+                k_head_r, shape=[paddle.shape(r)[0], paddle.shape(r)[1], self.n_head, self.d_head]
+            )
+
+            # H-stream
+            # Content-stream query head
+            # Compute q_head_h = einsum4x4("ibh,h(n*d)->ibnd", h, self.q)
+            q_head_h = paddle.matmul(h, self.q)  # shape
+            q_head_h = paddle.reshape(
+                q_head_h, shape=[paddle.shape(h)[0], paddle.shape(h)[1], self.n_head, self.d_head]
+            )
+
+            # Core attention ops
+            attn_vec_h = self.rel_attn_core(
+                q_head_h,
+                k_head_h,
+                v_head_h,
+                k_head_r,
+                seg_mat=seg_mat,
+                attn_mask=attn_mask_h,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+            )
+
+            if output_attentions:
+                attn_vec_h, attn_prob_h = attn_vec_h
+
+            # Post processing
+            output_h = self.post_attention(h, attn_vec_h)
+
+            # G-stream
+            # Query-stream query head
+            # Compute q_head_g = einsum4x4("ibh,hnd->ibnd", g, self.q)
+            shape = g.shape
+            q_head_g = paddle.matmul(g, self.q).reshape([shape[0], shape[1], self.n_head, self.d_head])
+
+            # Core attention ops
+            if target_mapping is not None:
+                # Compute q_head_g = einsum4x4("mbnd,mlb->lbnd", q_head_g, target_mapping)
+                q_head_g = paddle.einsum("mbnd,mlb->lbnd", q_head_g, target_mapping)
+                attn_vec_g = self.rel_attn_core(
+                    q_head_g,
+                    k_head_h,
+                    v_head_h,
+                    k_head_r,
+                    seg_mat=seg_mat,
+                    attn_mask=attn_mask_g,
+                    head_mask=head_mask,
+                    output_attentions=output_attentions,
+                )
+
+                if output_attentions:
+                    attn_vec_g, attn_prob_g = attn_vec_g
+
+                # Compute attn_vec_g = einsum4x4("lbnd,mlb->mbnd", attn_vec_g, target_mapping)
+                attn_vec_g = paddle.einsum("lbnd,mlb->mbnd", attn_vec_g, target_mapping)
+
+            else:
+                attn_vec_g = self.rel_attn_core(
+                    q_head_g,
+                    k_head_h,
+                    v_head_h,
+                    k_head_r,
+                    seg_mat=seg_mat,
+                    attn_mask=attn_mask_g,
+                    head_mask=head_mask,
+                    output_attentions=output_attentions,
+                )
+
+                if output_attentions:
+                    attn_vec_g, attn_prob_g = attn_vec_g
+
+            # Post processing
+            output_g = self.post_attention(g, attn_vec_g)
+
+            if output_attentions:
+                attn_prob = attn_prob_h, attn_prob_g
+
+        else:
+            # Multi-head attention with relative positional encoding
+            if mems is not None and mems.dim() > 1:
+                cat = paddle.concat([mems, h], axis=0)
+            else:
+                cat = h
+
+            # Content heads
+            # Compute q_head_h = einsum4x4("ibh,hnd->ibnd", h, self.q)
+            q_head_h = paddle.matmul(h, self.q)
+            q_head_h = paddle.reshape(
+                q_head_h, shape=[paddle.shape(h)[0], paddle.shape(h)[1], self.n_head, self.d_head]
+            )
+
+            # Compute k_head_h = einsum4x4("ibh,hnd->ibnd", cat, self.k)
+            k_head_h = paddle.matmul(cat, self.k)
+            k_head_h = paddle.reshape(
+                k_head_h, shape=[paddle.shape(h)[0], paddle.shape(h)[1], self.n_head, self.d_head]
+            )
+
+            # Compute v_head_h = einsum4x4("ibh,hnd->ibnd", cat, self.v)
+            v_head_h = paddle.matmul(cat, self.v)
+            v_head_h = paddle.reshape(
+                v_head_h, shape=[paddle.shape(h)[0], paddle.shape(h)[1], self.n_head, self.d_head]
+            )
+
+            # Position-based key head
+            # Compute k_head_r = einsum4x4("ibh,hnd->ibnd", r, self.r)
+            k_head_r = paddle.matmul(r, self.r)
+            k_head_r = paddle.reshape(k_head_r, shape=[paddle.shape(k_head_r)[0], -1, self.n_head, self.d_head])
+
+            # Core attention ops
+            attn_vec = self.rel_attn_core(
+                q_head_h,
+                k_head_h,
+                v_head_h,
+                k_head_r,
+                seg_mat=seg_mat,
+                attn_mask=attn_mask_h,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+            )
+
+            if output_attentions:
+                attn_vec, attn_prob = attn_vec
+
+            # Post processing
+            output_h = self.post_attention(h, attn_vec)
+            output_g = None
+
+        outputs = (output_h, output_g)
+
+        if output_attentions:
+            outputs = outputs + (attn_prob,)
+        return outputs
+
+
+class XLNetFeedForward(Layer):
+    def __init__(self, config: XLNetConfig):
+        super(XLNetFeedForward, self).__init__()
+
+        self.layer_norm = nn.LayerNorm(config.d_model, epsilon=config.layer_norm_eps)
+        self.layer_1 = nn.Linear(config.d_model, config.d_inner)
+        self.layer_2 = nn.Linear(config.d_inner, config.d_model)
+        self.dropout = nn.Dropout(config.dropout)
+        if isinstance(config.ff_activation, str):
+            self.activation_function = ACT2FN[config.ff_activation]
+        else:
+            self.activation_function = config.ff_activation
+
+    def forward(self, inp):
+        output = inp
+        output = self.layer_1(output)
+        output = self.activation_function(output)
+        output = self.dropout(output)
+        output = self.layer_2(output)
+        output = self.dropout(output)
+        output = self.layer_norm(output + inp)
+        return output
+
+
+class XLNetLayer(Layer):
+    def __init__(self, config: XLNetConfig):
+        super(XLNetLayer, self).__init__()
+
+        self.rel_attn = XLNetRelativeAttention(config)
+        self.ff = XLNetFeedForward(config)
+        self.seq_len_dim = 1
+
+    def forward(
+        self,
+        output_h,
+        output_g,
+        attn_mask_h,
+        attn_mask_g,
+        r,
+        seg_mat,
+        mems=None,
+        target_mapping=None,
+        head_mask=None,
+        output_attentions=False,
+    ):
+        outputs = self.rel_attn(
+            output_h,
+            output_g,
+            attn_mask_h,
+            attn_mask_g,
+            r,
+            seg_mat,
+            mems=mems,
+            target_mapping=target_mapping,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+        )
+
+        output_h, output_g = outputs[:2]
+
+        if output_g is not None:
+            output_g = self.ff(output_g)
+        output_h = self.ff(output_h)
+
+        outputs = (output_h, output_g) + outputs[2:]  # Add again attentions if they are there
+        return outputs
+
+
+class XLNetPretrainedModel(PretrainedModel):
+    """
+    An abstract class for pretrained XLNet models. It provides XLNet related
+    `model_config_file`, `resource_files_names`, `pretrained_resource_files_map`,
+    `pretrained_init_configuration`, `base_model_prefix` for downloading
+    and loading pretrained models.
+    See :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details.
+    """
+
+    pretrained_init_configuration = XLNET_PRETRAINED_INIT_CONFIGURATION
+    resource_files_names = {"model_state": "model_state.pdparams"}
+    pretrained_resource_files_map = XLNET_PRETRAINED_RESOURCE_FILES_MAP
+    model_config_file = CONFIG_NAME
+    config_class = XLNetConfig
+    base_model_prefix = "transformer"
+
+    def _init_weights(self, layer):
+        # Initialize the weights.
+        if isinstance(layer, (nn.Linear, nn.Embedding)):
+            if isinstance(layer.weight, paddle.Tensor):
+                layer.weight.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.transformer.config["initializer_range"],
+                        shape=layer.weight.shape,
+                    )
+                )
+            if isinstance(layer, nn.Linear) and layer.bias is not None:
+                layer.bias.set_value(paddle.zeros_like(layer.bias))
+        elif isinstance(layer, nn.LayerNorm):
+            layer.bias.set_value(paddle.zeros_like(layer.bias))
+            layer.weight.set_value(paddle.full_like(layer.weight, 1.0))
+        elif isinstance(layer, XLNetRelativeAttention):
+            for param in [
+                layer.q,
+                layer.k,
+                layer.v,
+                layer.o,
+                layer.r,
+                layer.r_r_bias,
+                layer.r_s_bias,
+                layer.r_w_bias,
+                layer.seg_embed,
+            ]:
+                param.set_value(
+                    paddle.tensor.normal(
+                        mean=0.0,
+                        std=self.initializer_range
+                        if hasattr(self, "initializer_range")
+                        else self.transformer.config["initializer_range"],
+                        shape=param.shape,
+                    )
+                )
+        elif isinstance(layer, XLNetModel):
+            layer.mask_emb.set_value(
+                paddle.tensor.normal(
+                    mean=0.0,
+                    std=self.initializer_range
+                    if hasattr(self, "initializer_range")
+                    else self.transformer.config["initializer_range"],
+                    shape=layer.mask_emb.shape,
+                )
+            )
+
+
+@dataclass
+class XLNetModelOutput(ModelOutput):
+    """
+    Output type of [`XLNetModel`].
+
+    Args:
+        last_hidden_state (`paddle.Tensor` of shape `(batch_size, num_predict, hidden_size)`):
+            Sequence of hidden-states at the last layer of the model.
+
+            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
+            corresponds to `sequence_length`.
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    last_hidden_state: paddle.Tensor
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetLMHeadModelOutput(ModelOutput):
+    """
+    Output type of [`XLNetLMHeadModel`].
+
+    Args:
+        loss (`paddle.Tensor` of shape *(1,)*, *optional*, returned when `labels` is provided)
+            Language modeling loss (for next-token prediction).
+        logits (`paddle.Tensor` of shape `(batch_size, num_predict, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+
+            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
+            corresponds to `sequence_length`.
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetForSequenceClassificationOutput(ModelOutput):
+    """
+    Output type of [`XLNetForSequenceClassification`].
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `label` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (`paddle.Tensor` of shape `(batch_size, config.num_labels)`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetForTokenClassificationOutput(ModelOutput):
+    """
+    Output type of [`XLNetForTokenClassificationOutput`].
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided) :
+            Classification loss.
+        logits (`paddle.Tensor` of shape `(batch_size, sequence_length, config.num_labels)`):
+            Classification scores (before SoftMax).
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetForMultipleChoiceOutput(ModelOutput):
+    """
+    Output type of [`XLNetForMultipleChoice`].
+
+    Args:
+        loss (`paddle.Tensor` of shape *(1,)*, *optional*, returned when `labels` is provided):
+            Classification loss.
+        logits (`paddle.Tensor` of shape `(batch_size, num_choices)`):
+            *num_choices* is the second dimension of the input tensors. (see *input_ids* above).
+
+            Classification scores (before SoftMax).
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    logits: paddle.Tensor = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetForQuestionAnsweringSimpleOutput(ModelOutput):
+    """
+    Output type of [`XLNetForQuestionAnsweringSimple`].
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        start_logits (`paddle.Tensor` of shape `(batch_size, sequence_length,)`):
+            Span-start scores (before SoftMax).
+        end_logits (`paddle.Tensor` of shape `(batch_size, sequence_length,)`):
+            Span-end scores (before SoftMax).
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    start_logits: paddle.Tensor = None
+    end_logits: paddle.Tensor = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@dataclass
+class XLNetForQuestionAnsweringOutput(ModelOutput):
+    """
+    Output type of [`XLNetForQuestionAnswering`].
+
+    Args:
+        loss (`paddle.Tensor` of shape `(1,)`, *optional*, returned if both `start_positions` and `end_positions` are provided):
+            Classification loss as the sum of start token, end token (and is_impossible if provided) classification
+            losses.
+        start_top_log_probs (`paddle.Tensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
+            Log probabilities for the top config.start_n_top start token possibilities (beam-search).
+        start_top_index (`paddle.Tensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
+            Indices for the top config.start_n_top start token possibilities (beam-search).
+        end_top_log_probs (`paddle.Tensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
+            Log probabilities for the top `config.start_n_top * config.end_n_top` end token possibilities
+            (beam-search).
+        end_top_index (`paddle.Tensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
+            Indices for the top `config.start_n_top * config.end_n_top` end token possibilities (beam-search).
+        cls_logits (`paddle.Tensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
+            Log probabilities for the `is_impossible` label of the answers.
+        mems (`List[paddle.Tensor]` of length `config.n_layers`):
+            Contains pre-computed hidden-states. Can be used (see `mems` input) to speed up sequential decoding. The
+            token ids which have their past given to this model should not be passed as `input_ids` as they have
+            already been computed.
+        hidden_states (`tuple(paddle.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `paddle.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(paddle.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `paddle.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+
+    loss: Optional[paddle.Tensor] = None
+    start_top_log_probs: Optional[paddle.Tensor] = None
+    start_top_index: Optional[paddle.Tensor] = None
+    end_top_log_probs: Optional[paddle.Tensor] = None
+    end_top_index: Optional[paddle.Tensor] = None
+    cls_logits: Optional[paddle.Tensor] = None
+    mems: Optional[List[paddle.Tensor]] = None
+    hidden_states: Optional[Tuple[paddle.Tensor]] = None
+    attentions: Optional[Tuple[paddle.Tensor]] = None
+
+
+@register_base_model
+class XLNetModel(XLNetPretrainedModel):
+    """
+    The bare XLNet Model outputting raw hidden-states.
+
+    This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`.
+    Refer to the superclass documentation for the generic methods.
+
+    This model is also a `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+
+            .. note::
+                A normal_initializer initializes weight matrices as normal distributions.
+                See :meth:`XLNetPretrainedModel._init_weights()` for how weights are initialized in `XLNetModel`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetModel, self).__init__(config)
+        self.initializer_range = config.initializer_range
+        self.mem_len = config.mem_len
+        self.reuse_len = config.reuse_len
+        self.d_model = config.d_model
+        self.same_length = config.same_length
+        self.attn_type = config.attn_type
+        self.bi_data = config.bi_data
+        self.clamp_len = config.clamp_len
+        self.n_layer = config.n_layer
+        self.dropout = nn.Dropout(config.dropout)
+        self.word_embedding = nn.Embedding(config.vocab_size, config.d_model)
+        self.mask_emb = self.create_parameter([1, 1, config.d_model])
+        self.layer = nn.LayerList([XLNetLayer(config) for _ in range(config.n_layer)])
+
+    def get_input_embeddings(self):
+        return self.word_embedding
+
+    def set_input_embeddings(self, new_embeddings):
+        self.word_embedding = new_embeddings
+
+    def _prune_heads(self, heads_to_prune):
+        raise NotImplementedError
+
+    def create_mask(self, qlen, mlen):
+        # Creates causal attention mask. Float mask where 1.0 indicates masked, 0.0 indicates not-masked.
+        attn_mask = paddle.ones([qlen, qlen])
+        mask_up = paddle.triu(attn_mask, diagonal=1)
+        attn_mask_pad = paddle.zeros([qlen, mlen])
+        ret = paddle.concat([attn_mask_pad, mask_up], axis=1)
+        if self.same_length:
+            mask_lo = paddle.tril(attn_mask, diagonal=-1)
+            ret = paddle.concat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], axis=1)
+
+        return ret
+
+    def cache_mem(self, curr_out, prev_mem):
+        # Cache hidden states into memory.
+        if self.reuse_len is not None and self.reuse_len > 0:
+            curr_out = curr_out[: self.reuse_len]
+
+        if self.mem_len is None or self.mem_len == 0:
+            # If `use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
+            # and returns all of the past and current hidden states.
+            cutoff = 0
+        else:
+            # If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
+            # states. This is the preferred setting for training and long-form generation.
+            cutoff = -self.mem_len
+        if prev_mem is None:
+            # If :obj:`use_mems` is active and `mem_len` is defined, the model
+            new_mem = curr_out[cutoff:]
+        else:
+            new_mem = paddle.concat([prev_mem, curr_out], axis=0)[cutoff:]
+
+        return new_mem.detach()
+
+    @staticmethod
+    def positional_embedding(pos_seq, inv_freq, bsz=None):
+        # Compute sinusoid_inp = einsum4x4("i,d->id", pos_seq, inv_freq)
+        sinusoid_inp = paddle.einsum("i,d->id", pos_seq, inv_freq)
+        pos_emb = paddle.concat([paddle.sin(sinusoid_inp), paddle.cos(sinusoid_inp)], axis=-1)
+        pos_emb = paddle.unsqueeze(pos_emb, axis=1)
+        if bsz is not None:
+            pos_emb = pos_emb.expand([-1, bsz, -1])
+            pos_emb.stop_gradient = True
+        pos_emb.stop_gradient = True
+        return pos_emb
+
+    def relative_positional_encoding(self, qlen, klen, bsz=None):
+        # Create relative positional encoding.
+        freq_seq = paddle.arange(0, self.d_model, 2.0, dtype=dtype_float)
+        inv_freq = 1 / 10000 ** (freq_seq / self.d_model)
+
+        if self.attn_type == "bi":
+            beg, end = klen, -qlen
+        elif self.attn_type == "uni":
+            beg, end = klen, -1
+        else:
+            raise ValueError("Unknown `attn_type` {}.".format(self.attn_type))
+
+        if self.bi_data:
+            fwd_pos_seq = paddle.arange(beg, end, -1.0, dtype=dtype_float)
+            bwd_pos_seq = paddle.arange(-beg, -end, 1.0, dtype=dtype_float)
+
+            if self.clamp_len > 0:
+                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+                bwd_pos_seq = bwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+
+            if bsz is not None:
+                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz // 2)
+                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq, bsz // 2)
+            else:
+                fwd_pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq)
+                bwd_pos_emb = self.positional_embedding(bwd_pos_seq, inv_freq)
+            pos_emb = paddle.concat([fwd_pos_emb, bwd_pos_emb], axis=1)
+        else:
+            fwd_pos_seq = paddle.arange(beg, end, -1.0, dtype=dtype_float)
+            if self.clamp_len > 0:
+                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
+            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)
+        return pos_emb
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The XLNetModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                Indices of input sequence tokens in the vocabulary. They are
+                numerical representations of tokens that build the input sequence.
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+            token_type_ids (Tensor, optional):
+                Segment token indices to indicate first and second portions of the inputs.
+                Indices can be either 0 or 1:
+
+                - 0 corresponds to a **sentence A** token,
+                - 1 corresponds to a **sentence B** token.
+
+                It's data type should be `int64` and has a shape of [batch_size, sequence_length].
+                Defaults to None, which means no segment embeddings is added to token embeddings.
+            attention_mask (Tensor, optional):
+                Mask to indicate whether to perform attention on each input token or not.
+                The values should be either 0 or 1. The attention scores will be set
+                to **-infinity** for any positions in the mask that are **0**, and will be
+                **unchanged** for positions that are **1**.
+
+                - **1** for tokens that are **not masked**,
+                - **0** for tokens that are **masked**.
+
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length].
+                Defaults to `None`.
+            mems (List[Tensor], optional):
+                A list of length `n_layers` with each Tensor being a pre-computed hidden-state for each layer.
+                Each Tensor has a dtype `float32` and a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to None, and we don't use mems.
+
+                .. note::
+                    `use_mems` has to be set to `True` in order to make use of `mems`.
+            perm_mask (Tensor, optional):
+                Mask to indicate the permutation pattern of the input sequence with values being either 0 or 1.
+
+                - if ``perm_mask[k, i, j] = 0``, i **attend** to j in batch k;
+                - if ``perm_mask[k, i, j] = 1``, i **does not attend** to j in batch k.
+
+                Only used during pretraining (to define factorization order) or
+                for sequential decoding (generation). It's data type should be `float32` and
+                has a shape of [batch_size, sequence_length, sequence_length].
+                Defaults to `None`, then each token attends to all the other tokens (full bidirectional attention).
+            target_mapping (Tensor, optional):
+                Mask to indicate the output tokens to use with values being either 0 or 1.
+                If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.
+                It's data type should be `float32` and has a shape of [batch_size, num_predict, sequence_length].
+                Only used during pretraining for partial prediction or for sequential decoding (generation).
+                Defaults to `None`.
+            input_mask (Tensor, optional):
+                Mask to avoid performing attention on padding token with values being either 0 or 1.
+                It's data type should be `float32` and it has a shape of [batch_size, sequence_length].
+                This mask is negative of `attention_mask`:
+
+                - 1 for tokens that are **masked**,
+                - 0 for tokens that are **not masked**.
+
+                You should use only one of `input_mask` and `attention_mask`. Defaults to `None`.
+            head_mask (Tensor, optional):
+                Mask to nullify selected heads of the self-attention layers with values being either 0 or 1.
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+                It's data type should be `float32` and has a shape of [num_heads] or [num_layers, num_heads].
+                Defaults to `None`, which means we keep all heads.
+            inputs_embeds (Tensor, optional):
+                An embedded representation tensor which is an alternative of `input_ids`.
+                You should specify only either one of them to avoid contradiction.
+                It's data type should be `float32` and has a shape of [batch_size, sequence_length, hidden_size].
+                Defaults to `None`, which means we only specify `input_ids`.
+            use_mems_train (bool, optional):
+                Whether or not to use recurrent memory mechanism during training.
+                Defaults to `False` and we don't use recurrent memory mechanism in training mode.
+            use_mems_eval (bool, optional):
+                Whether or not to use recurrent memory mechanism during evaluation.
+                Defaults to `False` and we don't use recurrent memory mechanism in evaluation mode.
+            return_dict (bool, optional):
+                Whether or not to return additional information other than the output tensor.
+                If True, then returns information about `output`, `new_mems`, `hidden_states` and `attentions`
+                which will also be formatted as a dict. Else only returns the output tensor.
+                Defaults to False.
+
+        Returns:
+            Tensor or dict: Returns tensor `output` or a dict with key-value pairs:
+            {"last_hidden_state": `output`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}.
+
+            With the corresponding fields:
+
+            - `output` (Tensor):
+                Output of the final layer of the model.
+                It's a Tensor of dtype `float32` and has a shape of [batch_size, num_predict, hidden_size].
+
+                .. note::
+                    `num_predict` corresponds to `target_mapping.shape[1]`.
+                    If `target_mapping` is `None`, then `num_predict` equals to `sequence_length`.
+            - `mems` (List[Tensor]):
+                A list of pre-computed hidden-states. The length of the list is `n_layers`.
+                Each element in the list is a Tensor with dtype `float32` and has a shape of
+                [batch_size, sequence_length, hidden_size].
+            - `hidden_states` (List[Tensor], optional):
+                A list of Tensor containing hidden-states of the model at the output of each layer
+                plus the initial embedding outputs. Each Tensor has a data type of `float32` and
+                has a shape of [batch_size, sequence_length, hidden_size].
+                Being returned when `output_hidden_states` is set to `True`.
+            - `attentions` (List[Tensor], optional):
+                A list of Tensor containing attentions weights of each hidden layer.
+                Each Tensor (one for each layer) has a data type of `float32` and
+                has a shape of [batch_size, num_heads, sequence_length, sequence_length].
+                Being returned when `output_attentions` is set to `True`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.xlnet.modeling import XLNetModel
+                from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetModel.from_pretrained('xlnet-base-cased')
+
+                inputs = tokenizer("Hey, Paddle-paddle is awesome !")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                last_hidden_states = outputs[0]
+        """
+
+        if self.training:
+            use_mems = use_mems_train
+        else:
+            use_mems = use_mems_eval
+
+        # The original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
+        # but we want a unified interface in the library with the batch size on the first dimension
+        # so we move here the first dimension (batch) to the end
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_ids = paddle.transpose(input_ids, perm=[1, 0])
+            qlen, bsz = paddle.shape(input_ids)[0], paddle.shape(input_ids)[1]
+        elif inputs_embeds is not None:
+            inputs_embeds = paddle.transpose(inputs_embeds, perm=[1, 0])
+            qlen, bsz = paddle.shape(inputs_embeds)[0], paddle.shape(inputs_embeds)[1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        token_type_ids = token_type_ids.transpose([1, 0]) if token_type_ids is not None else None
+        input_mask = input_mask.transpose([1, 0]) if input_mask is not None else None
+        attention_mask = attention_mask.transpose([1, 0]) if attention_mask is not None else None
+        perm_mask = perm_mask.transpose([1, 2, 0]) if perm_mask is not None else None
+        target_mapping = target_mapping.transpose([1, 2, 0]) if target_mapping is not None else None
+
+        mlen = paddle.shape(mems[0])[0] if mems is not None and mems[0] is not None else 0
+        klen = mlen + qlen
+
+        # Attention mask
+        # Causal attention mask
+        if self.attn_type == "uni":
+            attn_mask = self.create_mask(qlen, mlen)
+            attn_mask = paddle.unsqueeze(attn_mask, axis=[2, 3])
+        elif self.attn_type == "bi":
+            attn_mask = None
+        else:
+            raise ValueError("Unsupported attention type: {}".format(self.attn_type))
+
+        # Data mask: input mask & perm mask
+        assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) "
+        "or attention_mask (uses 0 for padding, added for compatibility with BERT). Please choose one."
+        if input_mask is None and attention_mask is not None:
+            input_mask = 1.0 - attention_mask
+        if input_mask is not None and perm_mask is not None:
+            data_mask = paddle.unsqueeze(input_mask, axis=0) + perm_mask
+        elif input_mask is not None and perm_mask is None:
+            data_mask = paddle.unsqueeze(input_mask, axis=0)
+        elif input_mask is None and perm_mask is not None:
+            data_mask = perm_mask
+        else:
+            data_mask = None
+
+        if data_mask is not None:
+            # All mems can be attended to
+            if mlen > 0:
+                mems_mask = paddle.cast(paddle.zeros([paddle.shape(data_mask)[0], mlen, bsz]), dtype=dtype_float)
+                data_mask = paddle.concat([mems_mask, data_mask], axis=1)
+            if attn_mask is None:
+                attn_mask = paddle.unsqueeze(data_mask, axis=-1)
+            else:
+                attn_mask += paddle.unsqueeze(data_mask, axis=-1)
+
+        if attn_mask is not None:
+            attn_mask = paddle.cast((attn_mask > 0), dtype=dtype_float)
+
+        if attn_mask is not None:
+            fill_val = paddle.ones(qlen)
+            non_tgt_mask = paddle.cast(-paddle.diag(fill_val), dtype=dtype_float)
+            if mlen > 0:
+                non_tgt_mask = paddle.concat(
+                    [paddle.cast(paddle.zeros([qlen, mlen]), dtype=dtype_float), non_tgt_mask], axis=-1
+                )
+            non_tgt_mask = paddle.cast(
+                ((attn_mask + paddle.unsqueeze(non_tgt_mask, axis=[2, 3])) > 0), dtype=dtype_float
+            )
+        else:
+            non_tgt_mask = None
+
+        # Word embeddings and prepare h & g hidden states
+        if inputs_embeds is not None:
+            word_emb_k = inputs_embeds
+        else:
+            word_emb_k = self.word_embedding(input_ids)
+
+        output_h = self.dropout(word_emb_k)
+        if target_mapping is not None:
+            word_emb_q = self.mask_emb.expand([paddle.shape(target_mapping)[0], bsz, -1])
+            output_g = self.dropout(word_emb_q)
+        else:
+            output_g = None
+
+        # Segment embedding
+        if token_type_ids is not None:
+            # Convert `token_type_ids` to one-hot `seg_mat`
+            if mlen > 0:
+                mem_pad = paddle.zeros(shape=[mlen, bsz], dtype="int64")
+                cat_ids = paddle.concat(x=[mem_pad, token_type_ids], axis=0)
+            else:
+                cat_ids = token_type_ids
+
+            # `1` indicates not in the same segment [qlen x klen x bsz]
+            seg_mat = paddle.cast(
+                paddle.unsqueeze(token_type_ids, axis=1) != paddle.unsqueeze(cat_ids, axis=0), dtype="int64"
+            )
+            seg_mat = paddle.cast(F.one_hot(seg_mat, num_classes=2), dtype=dtype_float)
+        else:
+            seg_mat = None
+
+        # Positional encoding
+        pos_emb = self.relative_positional_encoding(qlen, klen, bsz=bsz)
+        pos_emb = self.dropout(pos_emb)
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # Attention_probs has shape bsz x n_heads x N x N
+        # Input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] (a head_mask for each layer)
+        # And head_mask is converted to shape [num_hidden_layers x qlen x klen x bsz x n_head]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0)
+                head_mask = head_mask.expand([self.n_layer, -1, -1, -1, -1])
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(1).unsqueeze(1)
+        else:
+            head_mask = [None] * self.n_layer
+
+        new_mems = ()
+        if mems is None:
+            mems = [None] * len(self.layer)
+
+        attentions = [] if output_attentions else None
+        hidden_states = [] if output_hidden_states else None
+
+        for i, layer_module in enumerate(self.layer):
+            if use_mems:
+                # Cache new mems
+                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
+            if output_hidden_states:
+                hidden_states.append((output_h, output_g) if output_g is not None else output_h)
+
+            outputs = layer_module(
+                output_h,
+                output_g,
+                attn_mask_h=non_tgt_mask,
+                attn_mask_g=attn_mask,
+                r=pos_emb,
+                seg_mat=seg_mat,
+                mems=mems[i],
+                target_mapping=target_mapping,
+                head_mask=head_mask[i],
+                output_attentions=output_attentions,
+            )
+            output_h, output_g = outputs[:2]
+
+            if output_attentions:
+                attentions.append(outputs[2])
+
+        # Add last hidden state
+        if output_hidden_states:
+            hidden_states.append((output_h, output_g) if output_g is not None else output_h)
+
+        output = self.dropout(output_g if output_g is not None else output_h)
+
+        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
+        output = paddle.transpose(output, perm=[1, 0, 2])
+
+        if not use_mems:
+            new_mems = None
+
+        if output_hidden_states:
+            if output_g is not None:
+                hidden_states = tuple(paddle.transpose(h, perm=[1, 0, 2]) for hs in hidden_states for h in hs)
+            else:
+                hidden_states = tuple(paddle.transpose(hs, perm=[1, 0, 2]) for hs in hidden_states)
+
+        if output_attentions:
+            if target_mapping is not None:
+                # When target_mapping is provided, there are 2-tuple of attentions
+                attentions = tuple(
+                    tuple(paddle.transpose(att_stream, perm=[2, 3, 0, 1]) for att_stream in t) for t in attentions
+                )
+            else:
+                attentions = tuple(paddle.transpose(t, perm=[2, 3, 0, 1]) for t in attentions)
+
+        if not return_dict:
+            return tuple(v for v in [output, new_mems, hidden_states, attentions] if v is not None)
+
+        return XLNetModelOutput(
+            last_hidden_state=output, mems=new_mems, hidden_states=hidden_states, attentions=attentions
+        )
+
+
+class XLNetClassificationHead(Layer):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetClassificationHead, self).__init__()
+        self.dense = nn.Linear(config.d_model, config.d_model)
+        self.dropout = nn.Dropout(config.classfier_dropout)
+        self.out_proj = nn.Linear(config.d_model, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, -1, :]  # Take <cls> token
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = get_activation("tanh")(x)
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+class XLNetForSequenceClassification(XLNetPretrainedModel):
+    """
+    XLNet Model with a linear layer on top of the output layer,
+    designed for sequence classification/regression tasks like GLUE tasks.
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetForSequenceClassification, self).__init__(config)
+        self.num_classes = config.num_classes
+        self.transformer = XLNetModel(config)
+        self.classifier = XLNetClassificationHead(config)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+        problem_type: str = "single_label_classification",
+    ):
+        r"""
+        The XLNetForSequenceClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`XLNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            mems (Tensor, optional):
+                See :class:`XLNetModel`.
+            perm_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            target_mapping (Tensor, optional):
+                See :class:`XLNetModel`.
+            input_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            head_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`XLNetModel`.
+            use_mems_train (bool, optional):
+                See :class:`XLNetModel`.
+            use_mems_eval (bool, optional):
+                See :class:`XLNetModel`.
+            return_dict (bool, optional):
+                See :class:`XLNetModel`.
+
+        Returns:
+            Tensor or dict: Returns tensor `logits` or a dict with key-value pairs:
+            {"logits": `logits`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}.
+
+            With the corresponding fields:
+
+            - `logits` (Tensor):
+                Classification scores before SoftMax (also called logits). It's data type should be `float32`
+                and has a shape of [batch_size, num_classes].
+            - `mems` (List[Tensor]):
+                See :class:`XLNetModel`.
+            - `hidden_states` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+            - `attentions` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.xlnet.modeling import XLNetForSequenceClassification
+                from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased')
+
+                inputs = tokenizer("Hey, Paddle-paddle is awesome !")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            input_mask=input_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_mems_train=use_mems_train,
+            use_mems_eval=use_mems_eval,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        output = transformer_outputs[0]
+
+        logits = self.classifier(output)
+
+        loss = None
+        if labels is not None:
+
+            if problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_classes == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.reshape(shape=[-1, self.num_classes]), labels.reshape(shape=[-1]))
+            elif problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return tuple_output(output, loss)
+
+        return XLNetForSequenceClassificationOutput(
+            loss=loss,
+            logits=logits,
+            mems=transformer_outputs.mems,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class XLNetForTokenClassification(XLNetPretrainedModel):
+    """
+    XLNet Model with a linear layer on top of the hidden-states output layer,
+    designed for token classification tasks like NER tasks.
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetForTokenClassification, self).__init__(config)
+        self.num_classes = config.num_classes
+
+        self.transformer = XLNetModel(config)
+        self.classifier = nn.Linear(self.transformer.d_model, config.num_classes)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The XLNetForTokenClassification forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`XLNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            mems (Tensor, optional):
+                See :class:`XLNetModel`.
+            perm_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            target_mapping (Tensor, optional):
+                See :class:`XLNetModel`.
+            input_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            head_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`XLNetModel`.
+            use_mems_train (bool, optional):
+                See :class:`XLNetModel`.
+            use_mems_eval (bool, optional):
+                See :class:`XLNetModel`.
+            return_dict (bool, optional):
+                See :class:`XLNetModel`.
+
+        Returns:
+            Tensor or dict: Returns tensor `logits` or a dict with key-value pairs:
+             {"logits": `logits`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}.
+
+            With the corresponding fields:
+
+            - `logits` (Tensor):
+                Classification scores before SoftMax (also called logits). It's data type should be `float32`
+                and has a shape of [batch_size, sequence_length, num_classes].
+            - `mems` (List[Tensor]):
+                See :class:`XLNetModel`.
+            - `hidden_states` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+            - `attentions` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.xlnet.modeling import XLNetForTokenClassification
+                from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetForTokenClassification.from_pretrained('xlnet-base-cased')
+
+                inputs = tokenizer("Hey, Paddle-paddle is awesome !")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+
+                logits = outputs[0]
+        """
+        outputs = self.transformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            input_mask=input_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_mems_train=use_mems_train,
+            use_mems_eval=use_mems_eval,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(sequence_output)
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.reshape(shape=[-1, self.num_classes]), labels.reshape(shape=[-1]))
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return tuple_output(output, loss)
+
+        return XLNetForTokenClassificationOutput(
+            loss=loss,
+            logits=logits,
+            mems=outputs.mems,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+class XLNetLMHeadModel(XLNetPretrainedModel):
+    """
+    XLNet Model with a language modeling head on top (linear layer with weights tied to the input embeddings).
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetLMHeadModel, self).__init__(config)
+        self.transformer = XLNetModel(config)
+        self.decoder_weight = self.transformer.word_embedding.weight
+        self.decoder_bias = self.create_parameter(
+            shape=[config.vocab_size], dtype=self.decoder_weight.dtype, is_bias=True
+        )
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=False,
+    ):
+        r"""
+        The XLNetLMHeadModel forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`XLNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            mems (Tensor, optional):
+                See :class:`XLNetModel`.
+            perm_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            target_mapping (Tensor, optional):
+                See :class:`XLNetModel`.
+            input_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            head_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`XLNetModel`.
+            use_mems_train (bool, optional):
+                See :class:`XLNetModel`.
+            use_mems_eval (bool, optional):
+                See :class:`XLNetModel`.
+            return_dict (bool, optional):
+                See :class:`XLNetModel`.
+
+        Returns:
+            Tensor or dict: Returns tensor `logits` or a dict with key-value pairs:
+             {"logits": `logits`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}.
+
+            With the corresponding fields:
+
+            - `logits` (Tensor):
+                Classification scores before SoftMax (also called logits). It's data type should be `float32`
+                and has a shape of [batch_size, sequence_length, num_classes].
+            - `mems` (List[Tensor]):
+                See :class:`XLNetModel`.
+            - `hidden_states` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+            - `attentions` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.xlnet.modeling import XLNetLMHeadModel
+                from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
+
+                inputs = tokenizer("Hey, Paddle-paddle is awesome !")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+                logits = outputs
+        """
+        transformer_outputs = self.transformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            input_mask=input_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_mems_train=use_mems_train,
+            use_mems_eval=use_mems_eval,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = paddle.matmul(transformer_outputs[0], self.decoder_weight, transpose_y=True) + self.decoder_bias
+        loss = None
+        if labels is not None:
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.reshape(shape=[-1, logits.shape[-1]]), labels.reshape(shape=[-1]))
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return tuple_output(output, loss)
+
+        return XLNetLMHeadModelOutput(
+            loss=loss,
+            logits=logits,
+            mems=transformer_outputs.mems,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class XLNetForMultipleChoice(XLNetPretrainedModel):
+    """
+    XLNet Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
+    softmax) e.g. for RACE/SWAG tasks.
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetForMultipleChoice, self).__init__(config)
+        self.transformer = XLNetModel(config)
+        self.classifier = XLNetClassificationHead(config)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict=False,
+    ):
+        r"""
+        The XLNetForMultipleChoice forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`XLNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            mems (Tensor, optional):
+                See :class:`XLNetModel`.
+            perm_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            target_mapping (Tensor, optional):
+                See :class:`XLNetModel`.
+            input_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            head_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`XLNetModel`.
+            use_mems_train (bool, optional):
+                See :class:`XLNetModel`.
+            use_mems_eval (bool, optional):
+                See :class:`XLNetModel`.
+            return_dict (bool, optional):
+                See :class:`XLNetModel`.
+
+        Returns:
+            tensor or dict: Returns tensor `logtis` or a dict with key-value pairs:
+             {"logits": `logits`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}
+
+            With the corresponding fields:
+            - `logits` (Tensor):
+                Classification scores before SoftMax (also called logits). It's data type should be `float32`
+                and has a shape of [batch_size, sequence_length, num_classes].
+            - `mems` (List[Tensor]):
+                See :class:`XLNetModel`.
+            - `hidden_states` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+            - `attentions` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import XLNetForMultipleChoice, XLNetTokenizer
+                from paddlenlp.data import Pad, Dict
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetForMultipleChoice.from_pretrained('xlnet-base-cased')
+                data = [
+                    {
+                        "question": "how do you turn on an ipad screen?",
+                        "answer1": "press the volume button.",
+                        "answer2": "press the lock button.",
+                        "label": 1,
+                    },
+                    {
+                        "question": "how do you indent something?",
+                        "answer1": "leave a space before starting the writing",
+                        "answer2": "press the spacebar",
+                        "label": 0,
+                    },
+                ]
+                text = []
+                text_pair = []
+                for d in data:
+                    text.append(d["question"])
+                    text_pair.append(d["answer1"])
+                    text.append(d["question"])
+                    text_pair.append(d["answer2"])
+                inputs = tokenizer(text, text_pair)
+                batchify_fn = lambda samples, fn=Dict(
+                    {
+                        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
+                        "token_type_ids": Pad(
+                            axis=0, pad_val=tokenizer.pad_token_type_id
+                        ),  # token_type_ids
+                    }
+                ): fn(samples)
+                inputs = batchify_fn(inputs)
+                reshaped_logits = model(
+                    input_ids=paddle.to_tensor(inputs[0], dtype="int64"),
+                    token_type_ids=paddle.to_tensor(inputs[1], dtype="int64"),
+                )
+                print(reshaped_logits.shape)
+                # [2, 2]
+        """
+        num_choices = paddle.shape(input_ids)[1] if input_ids is not None else paddle.shape(inputs_embeds)[1]
+        input_ids = input_ids.reshape(shape=(-1, paddle.shape(input_ids)[-1]))  # flat_input_ids: [bs*num_choice,seq_l]
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.reshape(shape=(-1, paddle.shape(attention_mask)[-1]))
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.reshape(shape=(-1, paddle.shape(token_type_ids)[-1]))
+
+        if inputs_embeds is not None:
+            inputs_embeds = inputs_embeds.reshape(
+                shape=(paddle.shape(inputs_embeds)[0], -1, paddle.shape(inputs_embeds)[-1])
+            )
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            inputs_embeds=inputs_embeds,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            input_mask=input_mask,
+            head_mask=head_mask,
+            use_mems_train=use_mems_train,
+            use_mems_eval=use_mems_eval,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        output = transformer_outputs[0]
+        logits = self.classifier(output)
+        reshaped_logits = logits.reshape([-1, num_choices])
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels.reshape(shape=[-1]))
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return tuple_output(output, loss)
+
+        return XLNetForMultipleChoiceOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            mems=transformer_outputs.mems,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+class XLNetForQuestionAnswering(XLNetPretrainedModel):
+    """
+      XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
+      layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+
+    Args:
+        config (:class:`XLNetConfig`):
+            An instance of :class:`XLNetConfig`.
+    """
+
+    def __init__(self, config: XLNetConfig):
+        super(XLNetForQuestionAnswering, self).__init__(config)
+        self.transformer = XLNetModel(config)
+        self.qa_outputs = nn.Linear(config.d_model, 2)
+
+    def forward(
+        self,
+        input_ids,
+        token_type_ids=None,
+        attention_mask=None,
+        mems=None,
+        perm_mask=None,
+        target_mapping=None,
+        start_positions=None,
+        end_positions=None,
+        input_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        use_mems_train=False,
+        use_mems_eval=False,
+        return_dict=False,
+    ):
+        r"""
+        The XLNetForQuestionAnswering forward method, overrides the `__call__()` special method.
+
+        Args:
+            input_ids (Tensor):
+                See :class:`XLNetModel`.
+            token_type_ids (Tensor, optional):
+                See :class:`XLNetModel`.
+            attention_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            mems (Tensor, optional):
+                See :class:`XLNetModel`.
+            perm_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            target_mapping (Tensor, optional):
+                See :class:`XLNetModel`.
+            input_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            head_mask (Tensor, optional):
+                See :class:`XLNetModel`.
+            inputs_embeds (Tensor, optional):
+                See :class:`XLNetModel`.
+            use_mems_train (bool, optional):
+                See :class:`XLNetModel`.
+            use_mems_eval (bool, optional):
+                See :class:`XLNetModel`.
+            return_dict (bool, optional):
+                See :class:`XLNetModel`.
+
+        Returns:
+            tuple or dict: Returns tensor (`start_logits`, `end_logits`) or a dict with key-value pairs:
+             {"start_logits": `start_logits`, "end_logits": `end_logits`, "mems": `mems`,
+            "hidden_states": `hidden_states`, "attentions": `attentions`}
+
+            With the corresponding fields:
+            - `start_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the start position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `end_logits` (Tensor):
+                A tensor of the input token classification logits, indicates the end position of the labelled span.
+                Its data type should be float32 and its shape is [batch_size, sequence_length].
+            - `mems` (List[Tensor]):
+                See :class:`XLNetModel`.
+            - `hidden_states` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+            - `attentions` (List[Tensor], optional):
+                See :class:`XLNetModel`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers.xlnet.modeling import XLNetForQuestionAnswering
+                from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+
+                tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+                model = XLNetForQuestionAnswering.from_pretrained('xlnet-base-cased')
+
+                inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
+                inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
+                outputs = model(**inputs)
+                start_logits = outputs[0]
+                end_logits = outputs[1]
+        """
+        transformer_outputs = self.transformer(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            mems=mems,
+            perm_mask=perm_mask,
+            target_mapping=target_mapping,
+            input_mask=input_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_mems_train=use_mems_train,
+            use_mems_eval=use_mems_eval,
+            return_dict=return_dict,
+        )
+        output = transformer_outputs[0]
+
+        logits = self.qa_outputs(output)
+        logits = paddle.transpose(logits, perm=[2, 0, 1])
+        start_logits, end_logits = paddle.unstack(x=logits, axis=0)
+
+        loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + transformer_outputs[1:]
+            # the length of output must be larger than 1
+            return tuple_output(output, loss)
+
+        return XLNetForQuestionAnsweringSimpleOutput(
+            loss=loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            mems=transformer_outputs.mems,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+XLNetForCausalLM = XLNetLMHeadModel
diff --git a/paddlenlp/transformers/xlnet/tokenizer.py b/paddlenlp/transformers/xlnet/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..196537a282019af13029a1703b84a19dc302aa1b
--- /dev/null
+++ b/paddlenlp/transformers/xlnet/tokenizer.py
@@ -0,0 +1,366 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization class for XLNet model."""
+
+import os
+import unicodedata
+from shutil import copyfile
+
+import sentencepiece as spm
+
+from .. import AddedToken, PretrainedTokenizer
+
+__all__ = ["XLNetTokenizer"]
+
+SENTENCEPIECE_UNDERLINE = "▁"
+SPIECE_UNDERLINE = SENTENCEPIECE_UNDERLINE  # Kept for backward compatibility
+
+# Segments (not really needed)
+SEG_ID_A = 0
+SEG_ID_B = 1
+SEG_ID_CLS = 2
+SEG_ID_SEP = 3
+SEG_ID_PAD = 4
+
+
+class XLNetTokenizer(PretrainedTokenizer):
+    """
+    Constructs an XLNet tokenizer based on `SentencePiece <https://github.com/google/sentencepiece>`__.
+
+    This tokenizer inherits from :class:`~paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer`
+    which contains most of the main methods. For more information regarding those methods,
+    please refer to this superclass.
+
+    Args:
+        vocab_file (str):
+            The vocabulary file (ends with '.spm') required to instantiate
+            a `SentencePiece <https://github.com/google/sentencepiece>`__ tokenizer.
+        do_lower_case (bool, optional):
+            Whether or not to lowercase the input when tokenizing. Defaults to `False` and
+            **does not** lowercase the input.
+        remove_space (bool, optional):
+            Whether or not to strip the text when tokenizing. Defaults to `True` and
+            removes excess spaces before and after the string.
+        keep_accents (bool, optional):
+            Whether or not to keep accents when tokenizing. Defaults to `False` and **does not** keep accents.
+        bos_token (str, optional):
+            A special token representing the beginning of a sequence that was used during pretraining.
+            Defaults to `"<s>"`.
+        eos_token (str, optional):
+            A special token representing the end of a sequence that was used during pretraining.
+            Defaults to `"</s>"`.
+        unk_token (str, optional):
+            A special token representing the *unknown (out-of-vocabulary)* token.
+            An unknown token is set to be `unk_token` inorder to be converted to an ID.
+            Defaults to `"<unk>"`.
+        sep_token (str, optional):
+            A special token separating two different sentences in the same input.
+            Defaults to `"<sep>"`.
+        pad_token (str, optional):
+            A special token used to make arrays of tokens the same size for batching purposes.
+            Defaults to `"<pad>"`.
+        cls_token (str, optional):
+            A special token used for sequence classification. It is the last token
+            of the sequence when built with special tokens. Defaults to `"<cls>"`.
+        mask_token (str, optional):
+            A special token representing a masked token. This is the token used
+            in the masked language modeling task which the model tries to predict the original unmasked ones.
+            Defaults to `"<mask>"`.
+        additional_special_tokens (List[str], optional):
+            A list of additional special tokens to be used by the tokenizer.
+            Defaults to `["<eop>", "<eod>"]`.
+
+    Attributes:
+        sp_model (SentencePieceProcessor):
+            The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
+    """
+
+    resource_files_names = {"vocab_file": "spiece.model"}
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "xlnet-base-cased": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/xlnet-base-cased-spiece.model",
+            "xlnet-large-cased": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/xlnet-large-cased-spiece.model",
+            "chinese-xlnet-base": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-base-spiece.model",
+            "chinese-xlnet-mid": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-mid-spiece.model",
+            "chinese-xlnet-large": "https://bj.bcebos.com/paddlenlp/models/transformers/xlnet/chinese-xlnet-large-spiece.model",
+        }
+    }
+    pretrained_init_configuration = {
+        "xlnet-base-cased": {"do_lower_case": False},
+        "xlnet-large-cased": {"do_lower_case": False},
+        "chinese-xlnet-base": {"do_lower_case": False},
+        "chinese-xlnet-mid": {"do_lower_case": False},
+        "chinese-xlnet-large": {"do_lower_case": False},
+    }
+    pretrained_positional_embedding_sizes = {
+        "xlnet-base-cased": None,
+        "xlnet-large-cased": None,
+        "chinese-xlnet-base": None,
+        "chinese-xlnet-mid": None,
+        "chinese-xlnet-large": None,
+    }
+    max_model_input_sizes = pretrained_positional_embedding_sizes
+    padding_side = "left"
+    pad_token_type_id = 3
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=False,
+        remove_space=True,
+        keep_accents=False,
+        bos_token="<s>",
+        eos_token="</s>",
+        unk_token="<unk>",
+        sep_token="<sep>",
+        pad_token="<pad>",
+        cls_token="<cls>",
+        mask_token="<mask>",
+        additional_special_tokens=["<eop>", "<eod>"],
+        sp_model_kwargs=None,
+        **kwargs
+    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+        self._build_special_tokens_map_extended(mask_token=mask_token)
+
+        self._pad_token_type_id = 3
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.do_lower_case = do_lower_case
+        self.remove_space = remove_space
+        self.keep_accents = keep_accents
+        self.vocab_file = vocab_file
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(vocab_file)
+
+    @property
+    def vocab_size(self):
+        return len(self.sp_model)
+
+    def get_vocab(self):
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+
+        # for backward compatibility
+        if not hasattr(self, "sp_model_kwargs"):
+            self.sp_model_kwargs = {}
+
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(self.vocab_file)
+
+    def preprocess_text(self, inputs):
+        if self.remove_space:
+            outputs = " ".join(inputs.strip().split())
+        else:
+            outputs = inputs
+        outputs = outputs.replace("``", '"').replace("''", '"')
+
+        if not self.keep_accents:
+            outputs = unicodedata.normalize("NFKD", outputs)
+            outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
+        if self.do_lower_case:
+            outputs = outputs.lower()
+
+        return outputs
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        text = self.preprocess_text(text)
+        pieces = self.sp_model.encode(text, out_type=str)
+        new_pieces = []
+        for piece in pieces:
+            if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit():
+                cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, ""))
+                if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
+                    if len(cur_pieces[0]) == 1:
+                        cur_pieces = cur_pieces[1:]
+                    else:
+                        cur_pieces[0] = cur_pieces[0][1:]
+                cur_pieces.append(piece[-1])
+                new_pieces.extend(cur_pieces)
+            else:
+                new_pieces.append(piece)
+
+        return new_pieces
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) to an id using the vocab."""
+        return self.sp_model.PieceToId(token)
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) to a token (str) using the vocab."""
+        return self.sp_model.IdToPiece(index)
+
+    def convert_tokens_to_string(self, tokens):
+        # Converts a sequence of tokens (strings for sub-words) in a single string.
+        out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
+        return out_string
+
+    def num_special_tokens_to_add(self, pair=False):
+        """
+        Returns the number of added tokens when encoding a sequence with special tokens.
+
+        Args:
+            pair (bool, optional):
+                Whether the input is a sequence pair or a single sequence.
+                Defaults to `False` and the input is a single sequence.
+
+        Returns:
+            int: Number of tokens added to sequences.
+        """
+        token_ids_0 = []
+        token_ids_1 = []
+        return len(self.build_inputs_with_special_tokens(token_ids_0, token_ids_1 if pair else None))
+
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Builds model inputs from a sequence or a pair of sequence
+        for sequence classification tasks by concatenating and
+        adding special tokens. An XLNet sequence has the following format:
+
+        - single sequence:      ``X <sep> <cls>``
+        - pair of sequences:    ``A <sep> B <sep> <cls>``
+
+        Args:
+            token_ids_0 (List[int]):
+                List of IDs for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of IDs for the second sequenze. Defaults to `None`.
+
+        Returns:
+            List[int]: List of input IDs with the appropriate special tokens.
+        """
+        sep = [self.sep_token_id]
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return token_ids_0 + sep + cls
+        return token_ids_0 + sep + token_ids_1 + sep + cls
+
+    def build_offset_mapping_with_special_tokens(self, offset_mapping_0, offset_mapping_1=None):
+        """
+        Builds offset map from a pair of offset map by concatenating
+        and adding offsets of special tokens.
+
+        An XLNet offset_mapping has the following format:
+
+        - single sequence:      ``X (0,0) (0,0)``
+        - pair of sequences:    ``A (0,0) B (0,0) (0,0)``
+
+        Args:
+            offset_mapping_0 (List[tuple]):
+                List of char offsets to which the special tokens will be added.
+            offset_mapping_1 (List[tuple], optional):
+                Optional second list of char offsets for offset mapping pairs.
+                Defaults to `None`.
+
+        Returns:
+            List[tuple]: A list of char offsets with the appropriate offsets of special tokens.
+        """
+        if offset_mapping_1 is None:
+            return offset_mapping_0 + [(0, 0)] + [(0, 0)]
+
+        return offset_mapping_0 + [(0, 0)] + offset_mapping_1 + [(0, 0)] + [(0, 0)]
+
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        """
+        Creates a special tokens mask from the input sequences.
+        This method is called when adding special tokens using the tokenizer `encode` method.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of `inputs_ids` for the second sequence.
+                Defaults to `None`.
+            already_has_special_tokens (bool, optional):
+                Whether or not the token list already contains special tokens for the model.
+                Defaults to `False`.
+
+        Returns:
+            List[int]: A list of integers which is either 0 or 1: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1, 1]
+        return ([0] * len(token_ids_0)) + [1, 1]
+
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        """
+        Creates a token_type mask from the input sequences.
+        If `token_ids_1` is not `None`, then a sequence pair
+        token_type mask has the following format:
+
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2
+            | first sequence    | second sequence |
+
+        Else if `token_ids_1` is `None`, then a single sequence
+        token_type mask has the following format:
+
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
+            |            first sequence           |
+
+        - 0 stands for the segment id of **first segment tokens**,
+        - 1 stands for the segment id of **second segment tokens**,
+        - 2 stands for the segment id of **cls_token**.
+
+        Args:
+            token_ids_0 (List[int]):
+                A list of `inputs_ids` for the first sequence.
+            token_ids_1 (List[int], optional):
+                Optional second list of `inputs_ids` for the second sequence.
+                Defaults to `None`.
+
+        Returns:
+            List[int]: List of token type IDs according to the given sequence(s).
+        """
+        sep = [self.sep_token_id]
+        cls_segment_id = [2]
+
+        if token_ids_1 is None:
+            return len(token_ids_0 + sep) * [0] + cls_segment_id
+        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] + cls_segment_id
+
+    def save_resources(self, save_directory):
+        for name, file_name in self.resource_files_names.items():
+            save_path = os.path.join(save_directory, file_name)
+            if os.path.abspath(self.vocab_file) != os.path.abspath(save_path) and os.path.isfile(self.vocab_file):
+                copyfile(self.vocab_file, save_path)
+            elif not os.path.isfile(self.vocab_file):
+                with open(save_path, "wb") as fi:
+                    content_spiece_model = self.sp_model.serialized_model_proto()
+                    fi.write(content_spiece_model)
diff --git a/paddlenlp/utils/__init__.py b/paddlenlp/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e04e7f0f0c07ed7708967c6f64ae86db4bfc17b
--- /dev/null
+++ b/paddlenlp/utils/__init__.py
@@ -0,0 +1,40 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+
+import paddle
+
+from .batch_sampler import *
+from .env import CONFIG_NAME, GENERATION_CONFIG_NAME, LEGACY_CONFIG_NAME
+from .import_utils import install_package, uninstall_package
+from .initializer import to
+from .serialization import load_torch
+
+# hack impl for EagerParamBase to function
+# https://github.com/PaddlePaddle/Paddle/blob/fa44ea5cf2988cd28605aedfb5f2002a63018df7/python/paddle/nn/layer/layers.py#L2077
+paddle.framework.io.EagerParamBase.to = to
+
+
+@contextlib.contextmanager
+def device_guard(device="cpu", dev_id=0):
+    origin_device = paddle.device.get_device()
+    if device == "cpu":
+        paddle.set_device(device)
+    elif device in ["gpu", "xpu", "npu"]:
+        paddle.set_device("{}:{}".format(device, dev_id))
+    try:
+        yield
+    finally:
+        paddle.set_device(origin_device)
diff --git a/paddlenlp/utils/batch_sampler.py b/paddlenlp/utils/batch_sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e3bbca6ee3ee7d44959502b56f38c088a790bd5
--- /dev/null
+++ b/paddlenlp/utils/batch_sampler.py
@@ -0,0 +1,180 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import division, print_function
+
+import math
+
+import paddle
+
+__all__ = ["DistributedBatchSampler"]
+
+
+class DistributedBatchSampler(paddle.io.BatchSampler):
+    """Sampler that restricts data loading to a subset of the dataset.
+
+    In such case, each process can pass a DistributedBatchSampler instance
+    as a DataLoader sampler, and load a subset of the original dataset that
+    is exclusive to it.
+
+    .. note::
+        Dataset is assumed to be of constant size.
+
+    Args:
+        dataset(paddle.io.Dataset): this could be a `paddle.io.Dataset` implement
+                     or other python object which implemented
+                     `__len__` for BatchSampler to get sample
+                     number of data source.
+        batch_size(int): sample indice number in a mini-batch indices.
+        num_replicas(int, optional): porcess number in distributed training.
+            If :attr:`num_replicas` is None, :attr:`num_replicas` will be
+            retrieved from :code:`paddle.distributed.ParallenEnv`.
+            Default None.
+        rank(int, optional): the rank of the current process among :attr:`num_replicas`
+            processes. If :attr:`rank` is None, :attr:`rank` is retrieved from
+            :code:`paddle.distributed.ParallenEnv`. Default None.
+        shuffle(bool): whther to shuffle indices order before genrating
+            batch indices. Default False.
+        drop_last(bool): whether drop the last incomplete batch dataset size
+            is not divisible by the batch size. Default False
+
+    Examples:
+        .. code-block:: python
+
+            import numpy as np
+
+            from paddle.io import Dataset, DistributedBatchSampler
+
+            # init with dataset
+            class RandomDataset(Dataset):
+                def __init__(self, num_samples):
+                    self.num_samples = num_samples
+
+                def __getitem__(self, idx):
+                    image = np.random.random([784]).astype('float32')
+                    label = np.random.randint(0, 9, (1, )).astype('int64')
+                    return image, label
+
+                def __len__(self):
+                    return self.num_samples
+
+            dataset = RandomDataset(100)
+            sampler = DistributedBatchSampler(dataset, batch_size=64)
+
+            for data in sampler:
+                # do something
+                break
+    """
+
+    def __init__(
+        self, dataset, batch_size, num_replicas=None, rank=None, shuffle=False, drop_last=False, consumed_samples=0
+    ):
+        self.dataset = dataset
+
+        assert isinstance(batch_size, int) and batch_size > 0, "batch_size should be a positive integer"
+        self.batch_size = batch_size
+        assert isinstance(shuffle, bool), "shuffle should be a boolean value"
+        self.shuffle = shuffle
+        assert isinstance(drop_last, bool), "drop_last should be a boolean number"
+
+        from paddle.distributed import ParallelEnv
+
+        if num_replicas is not None:
+            assert isinstance(num_replicas, int) and num_replicas > 0, "num_replicas should be a positive integer"
+            self.nranks = num_replicas
+        else:
+            self.nranks = ParallelEnv().nranks
+
+        if rank is not None:
+            assert isinstance(rank, int) and rank >= 0, "rank should be a non-negative integer"
+            self.local_rank = rank
+        else:
+            self.local_rank = ParallelEnv().local_rank
+
+        self.drop_last = drop_last
+        self.epoch = 0
+
+        self.consumed_samples = consumed_samples
+        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.nranks))
+        self.total_size = self.num_samples * self.nranks
+
+    def get_start_end_idx(self):
+        start_idx = self.local_rank * self.batch_size
+        end_idx = start_idx + self.batch_size
+        return start_idx, end_idx
+
+    def __iter__(self):
+        assert (
+            self.consumed_samples % self.nranks == 0
+        ), "The consumed_samples should be divided by nranks. consumed_samples=%d, nranks=%s" % (
+            self.consumed_samples,
+            self.nranks,
+        )
+        self.remain_num_samples = int(math.ceil((len(self.dataset) - self.consumed_samples) * 1.0 / self.nranks))
+        self.remain_total_size = self.remain_num_samples * self.nranks
+        self.batch_size_times_rank_size = self.batch_size * self.nranks
+
+        batch_indices = []
+        for idx in range(self.consumed_samples, self.total_size):
+            batch_indices.append(idx)
+            if len(batch_indices) == self.batch_size_times_rank_size:
+                start_idx, end_idx = self.get_start_end_idx()
+                yield batch_indices[start_idx:end_idx]
+                batch_indices = []
+        if not self.drop_last and len(batch_indices) > 0:
+            yield batch_indices
+
+    def __len__(self):
+        num_samples = self.num_samples
+        num_samples += int(not self.drop_last) * (self.batch_size - 1)
+        return num_samples // self.batch_size
+
+    def set_epoch(self, epoch=0, consumed_samples=0):
+        """
+        Sets the epoch number. When :attr:`shuffle=True`, this number is used
+        as seeds of random numbers. By default, users may not set this, all
+        replicas (workers) use a different random ordering for each epoch.
+        If set same number at each epoch, this sampler will yield the same
+        ordering at all epoches.
+
+        Arguments:
+            epoch (int): Epoch number.
+
+        Examples:
+            .. code-block:: python
+
+                from paddle.io import Dataset, DistributedBatchSampler
+
+                # init with dataset
+                class RandomDataset(Dataset):
+                    def __init__(self, num_samples):
+                        self.num_samples = num_samples
+
+                    def __getitem__(self, idx):
+                        image = np.random.random([784]).astype('float32')
+                        label = np.random.randint(0, 9, (1, )).astype('int64')
+                        return image, label
+
+                    def __len__(self):
+                        return self.num_samples
+
+                dataset = RandomDataset(100)
+                sampler = DistributedBatchSampler(dataset, batch_size=64)
+
+                for epoch in range(10):
+                    sampler.set_epoch(epoch)
+        """
+        self.epoch = epoch
+        # if we reset the epoch, the consumed_samples should be set to 0.
+        self.consumed_samples = consumed_samples
diff --git a/paddlenlp/utils/converter.py b/paddlenlp/utils/converter.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4f2891c7c0adbb47d4310b1db8d267104832ced
--- /dev/null
+++ b/paddlenlp/utils/converter.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+# FIXME(wj-Mcat): this converter will be deprecated after V2.5.2
+from paddlenlp.transformers.conversion_utils import *  # noqa: F401, F403
diff --git a/paddlenlp/utils/distributed.py b/paddlenlp/utils/distributed.py
new file mode 100644
index 0000000000000000000000000000000000000000..1245d39353743b4449ba6be423d68ff7e2200450
--- /dev/null
+++ b/paddlenlp/utils/distributed.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from typing import Any, Union
+
+import numpy as np
+import paddle
+import paddle.distributed as distributed
+
+from . import device_guard
+
+world_size = distributed.get_world_size()
+
+
+def convert_file_size_to_int(size: Union[int, str]):
+    """
+    Converts a size expressed as a string with digits an unit (like `"5MB"`) to an integer (in bytes).
+    Args:
+        size (`int` or `str`): The size to convert. Will be directly returned if an `int`.
+    Example:
+    ```py
+    >>> convert_file_size_to_int("1MiB")
+    1048576
+    ```
+    """
+    if isinstance(size, int):
+        return size
+    if size.upper().endswith("GIB"):
+        return int(size[:-3]) * (2**30)
+    if size.upper().endswith("MIB"):
+        return int(size[:-3]) * (2**20)
+    if size.upper().endswith("KIB"):
+        return int(size[:-3]) * (2**10)
+    if size.upper().endswith("GB"):
+        int_size = int(size[:-2]) * (10**9)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("MB"):
+        int_size = int(size[:-2]) * (10**6)
+        return int_size // 8 if size.endswith("b") else int_size
+    if size.upper().endswith("KB"):
+        int_size = int(size[:-2]) * (10**3)
+        return int_size // 8 if size.endswith("b") else int_size
+    raise ValueError("`size` is not in a valid format. Use an integer followed by the unit, e.g., '5GB'.")
+
+
+def reduce_tensor(tensor, buffer_size="32MiB"):
+    numel = int(paddle.numel(tensor).item())
+    # dtype = str(tensor.dtype)
+    # numel_bits = numel * dtype_byte_size(tensor.dtype)
+    buffer_size = convert_file_size_to_int(buffer_size)
+    tensor.reshape_([-1])
+
+    send_size = buffer_size // dtype_byte_size(tensor.dtype)
+
+    for x in range(0, numel, send_size):
+        part_tensor = tensor[x : min(numel, x + send_size)]
+        yield part_tensor, (x, min(numel, x + send_size))
+
+
+def dtype_byte_size(dtype):
+    """
+    Returns the size (in bytes) occupied by one parameter of type `dtype`.
+    Example:
+    ```py
+    >>> dtype_byte_size(torch.float32)
+    4
+    ```
+    """
+    if dtype == paddle.bool:
+        return 1 / 8
+    bit_search = re.search(r"[^\d](\d+)$", str(dtype))
+    if bit_search is None:
+        raise ValueError(f"`dtype` is not a valid dtype: {dtype}.")
+    bit_size = int(bit_search.groups()[0])
+    return bit_size // 8
+
+
+@paddle.no_grad()
+def distributed_gather(tensor: Any, dst: int = 0, group=None, offload=False) -> Any:
+    try:
+        if isinstance(tensor, (tuple, list)):
+            return type(tensor)(distributed_gather(t, dst, group, offload) for t in tensor)
+        if isinstance(tensor, dict):
+            return {k: distributed_gather(v, dst, group, offload) for k, v in tensor.items()}
+
+        output_tensors = None
+
+        is_dst = dst == distributed.get_rank(group=group)
+        if is_dst:
+            if offload:
+                output_tensors = [[] for _ in range(distributed.get_world_size(group=group))]
+                # with device_guard("cpu"):
+                #     output_tensors = [paddle.empty_like(tensor) for _ in range(distributed.get_world_size())]
+            else:
+                output_tensors = [paddle.empty_like(tensor) for _ in range(distributed.get_world_size(group=group))]
+                # for scalar tensor ?
+                output_tensors = [t if len(t.shape) > 0 else t[None] for t in output_tensors]
+
+        if offload:
+            origin_shape = tensor.shape
+            tensor.reshape_([-1])
+
+            for slice_tensor, index in reduce_tensor(tensor):
+                slice_output_tensors = None
+                if distributed.get_rank(group=group) == dst:
+                    slice_output_tensors = [
+                        paddle.empty_like(slice_tensor) for _ in range(distributed.get_world_size(group=group))
+                    ]
+                dist_gather(slice_tensor, slice_output_tensors, dst=dst, group=group)
+
+                if is_dst:
+                    for i in range(len(output_tensors)):
+                        output_tensors[i].append(slice_output_tensors[i].cpu().numpy())
+
+            tensor.reshape_(origin_shape)
+            if is_dst:
+                with device_guard("cpu"):
+                    new_output_tensors = []
+                    for x in output_tensors:
+                        t = np.concatenate(x)
+                        t = t.reshape(origin_shape)
+                        new_output_tensors.append(t)
+                    output_tensors = new_output_tensors
+
+        else:
+            dist_gather(tensor, output_tensors, dst=dst)
+
+        return output_tensors
+
+    except AssertionError:
+        raise AssertionError("Not currently using distributed training")
+
+
+@paddle.no_grad()
+def distributed_allgather(tensor: Any, group=None, offload=False):
+    """nested all gather function with offload
+
+    Args:
+        tensor (Any): the desired tensor, list of tensor, dict of tensor to allgather.
+        group (_type_, optional): the communication group. Defaults to None.
+        offload (bool, optional): If True, we offload the received tensor to cpu/(numpy). Defaults to False.
+
+    Raises:
+        AssertionError: Unexpected errors.
+
+    Returns:
+        tensor list: list of all gathered tensors
+    """
+    try:
+        if isinstance(tensor, (tuple, list)):
+            return type(tensor)(distributed_allgather(t, group, offload) for t in tensor)
+        if isinstance(tensor, dict):
+            return {k: distributed_allgather(v, group, offload) for k, v in tensor.items()}
+
+        output_tensors = []
+
+        if offload:
+            with device_guard("cpu"):
+                output_tensors = [paddle.empty_like(tensor) for _ in range(distributed.get_world_size(group))]
+        else:
+            output_tensors = [paddle.empty_like(tensor) for _ in range(distributed.get_world_size(group))]
+
+        # for scalar tensor ?
+        output_tensors = [t if len(t.shape) > 0 else t[None] for t in output_tensors]
+
+        if offload:
+            origin_shape = tensor.shape
+            tensor.reshape_([-1])
+            for x in output_tensors:
+                x.reshape_([-1])
+
+            for slice_tensor, index in reduce_tensor(tensor):
+                # paddle.empty_like(slice_tensor)
+                slice_output_tensors = [
+                    paddle.empty_like(slice_tensor) for _ in range(distributed.get_world_size(group))
+                ]
+                distributed.all_gather(slice_output_tensors, slice_tensor, group=group)
+                for x, y in zip(slice_output_tensors, output_tensors):
+                    with device_guard("cpu"):
+                        # x.cpu()
+                        y[index[0] : index[1]] = x.cpu()
+
+            tensor.reshape_(origin_shape)
+            for x in output_tensors:
+                x.reshape_(origin_shape)
+
+        else:
+            distributed.all_gather(output_tensors, tensor)
+
+        return output_tensors
+
+    except AssertionError:
+        raise AssertionError("Not currently using distributed training")
+
+
+def dist_gather(tensor, gather_list=None, dst=0, group=None, async_op=False):
+    """_summary_
+
+    Args:
+        tensor (_type_): _description_
+        gather_list (_type_, optional): _description_. Defaults to None.
+        dst (int, optional): _description_. Defaults to 0.
+        group (_type_, optional): _description_. Defaults to None.
+        async_op (bool, optional): _description_. Defaults to False.
+
+    Returns:
+        _type_: _description_
+    """
+    from paddle.distributed.communication.batch_isend_irecv import _with_batch_p2p_guard
+
+    rank = distributed.get_rank(group=group)
+    nranks = distributed.get_world_size(group=group)
+    task_list = []
+    backend = "NCCL"
+    if paddle.get_device().split(":")[0] == "npu":
+        backend = "HCCL"
+    with _with_batch_p2p_guard(backend):
+        if rank == dst:
+            for src in range(nranks):
+                wait = paddle.distributed.communication.stream.recv(
+                    gather_list[src],
+                    src=group.ranks[src] if group else src,
+                    group=group,
+                    sync_op=False,
+                    use_calc_stream=False,
+                )
+                task_list.append(wait)
+        wait = paddle.distributed.communication.stream.send(
+            tensor,
+            dst=group.ranks[dst] if group else dst,
+            group=group,
+            sync_op=False,
+            use_calc_stream=False,
+        )
+        task_list.append(wait)
+    for task in task_list:
+        task.wait()
diff --git a/paddlenlp/utils/doc_parser.py b/paddlenlp/utils/doc_parser.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d9552f72c7140d875b71a34fa4d51cf481194ed
--- /dev/null
+++ b/paddlenlp/utils/doc_parser.py
@@ -0,0 +1,432 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import mimetypes
+import os
+import random
+import re
+from io import BytesIO
+
+import numpy as np
+import requests
+from packaging.version import Version
+from PIL import Image, ImageDraw, ImageOps
+
+from .image_utils import np2base64
+from .log import logger
+
+
+class DocParser(object):
+    """DocParser"""
+
+    def __init__(self, ocr_lang="ch", layout_analysis=False, pdf_parser_config=None, use_gpu=None, device_id=None):
+        self.ocr_lang = ocr_lang
+        self.use_angle_cls = False
+        self.layout_analysis = layout_analysis
+        self.pdf_parser_config = pdf_parser_config
+        self.ocr_infer_model = None
+        self.use_gpu = use_gpu
+        self.device_id = device_id
+
+    def parse(self, doc, expand_to_a4_size=False, do_ocr=True):
+        """
+        parse
+        """
+        doc_type = mimetypes.guess_type(doc["doc"])[0]
+
+        if not doc_type or doc_type.startswith("image"):
+            image = self.read_image(doc["doc"])
+        elif doc_type == "application/pdf":
+            image = self.read_pdf(doc["doc"])
+        offset_x, offset_y = 0, 0
+        if expand_to_a4_size:
+            image, offset_x, offset_y = self.expand_image_to_a4_size(image, center=True)
+        img_h, img_w = image.shape[:2]
+        doc["image"] = np2base64(image)
+        doc["offset_x"] = offset_x
+        doc["offset_y"] = offset_y
+        doc["img_w"] = img_w
+        doc["img_h"] = img_h
+        if do_ocr:
+            ocr_result = self.ocr(image)
+            if expand_to_a4_size:
+                layout = []
+                for segment in ocr_result:
+                    box = segment[0]
+                    org_box = [
+                        max(box[0] - offset_x, 0),
+                        max(box[1] - offset_y, 0),
+                        max(box[2] - offset_x, 0),
+                        max(box[3] - offset_y, 0),
+                    ]
+                    if len(segment) == 2:
+                        layout.append((org_box, segment[1]))
+                    elif len(segment) == 3:
+                        layout.append((org_box, segment[1], segment[2]))
+                doc["layout"] = layout
+            else:
+                doc["layout"] = ocr_result
+        return doc
+
+    def __call__(self, *args, **kwargs):
+        """
+        Call parse
+        """
+        return self.parse(*args, **kwargs)
+
+    def ocr(self, image, det=True, rec=True, cls=None):
+        """
+        Call ocr for an image
+        """
+
+        def _get_box(box):
+            box = [
+                min(box[0][0], box[3][0]),  # x1
+                min(box[0][1], box[1][1]),  # y1
+                max(box[1][0], box[2][0]),  # x2
+                max(box[2][1], box[3][1]),  # y2
+            ]
+            return box
+
+        def _normal_box(box):
+            # Ensure the height and width of bbox are greater than zero
+            if box[3] - box[1] < 0 or box[2] - box[0] < 0:
+                return False
+            return True
+
+        def _is_ch(s):
+            for ch in s:
+                if "\u4e00" <= ch <= "\u9fff":
+                    return True
+            return False
+
+        if self.ocr_infer_model is None:
+            self.init_ocr_inference()
+        if cls is None:
+            cls = self.use_angle_cls
+        remove = False if self.ppocr_version <= Version("2.6.0.1") else True
+
+        layout = []
+        if not self.layout_analysis:
+            ocr_result = self.ocr_infer_model.ocr(image, det, rec, cls)
+            ocr_result = ocr_result[0] if remove else ocr_result
+            for segment in ocr_result:
+                box = segment[0]
+                box = _get_box(box)
+                if not _normal_box(box):
+                    continue
+                text = segment[1][0]
+                layout.append((box, text))
+        else:
+            layout_result = self.layout_analysis_engine(image)
+            for region in layout_result:
+                if region["type"] != "table":
+                    ocr_result = region["res"]
+                    for segment in ocr_result:
+                        box = segment["text_region"]
+                        box = _get_box(box)
+                        if not _normal_box(box):
+                            continue
+                        text = segment["text"]
+                        layout.append((box, text, region["type"]))
+                else:
+                    bbox = region["bbox"]
+                    table_result = region["res"]
+                    html = table_result["html"]
+                    cell_bbox = table_result["cell_bbox"]
+                    table_list = []
+                    lines = re.findall("<tr>(.*?)</tr>", html)
+                    for line in lines:
+                        table_list.extend(re.findall("<td.*?>(.*?)</td>", line))
+                    for cell_box, text in zip(cell_bbox, table_list):
+                        if self.ocr_lang == "ch":
+                            box = [
+                                bbox[0] + cell_box[0],
+                                bbox[1] + cell_box[1],
+                                bbox[0] + cell_box[4],
+                                bbox[1] + cell_box[5],
+                            ]
+                        else:
+                            box = [
+                                bbox[0] + cell_box[0],
+                                bbox[1] + cell_box[1],
+                                bbox[0] + cell_box[2],
+                                bbox[1] + cell_box[3],
+                            ]
+                        if not _normal_box(box):
+                            continue
+                        if _is_ch(text):
+                            text = text.replace(" ", "")
+                        layout.append((box, text, region["type"]))
+        return layout
+
+    @classmethod
+    def _get_buffer(self, data, file_like=False):
+        buff = None
+        if len(data) < 1024:
+            if os.path.exists(data):
+                buff = open(data, "rb").read()
+            elif data.startswith("http://") or data.startswith("https://"):
+                resp = requests.get(data, stream=True)
+                if not resp.ok:
+                    raise RuntimeError("Failed to download the file from {}".format(data))
+                buff = resp.raw.read()
+            else:
+                raise FileNotFoundError("Image file {} not found!".format(data))
+        if buff is None:
+            buff = base64.b64decode(data)
+        if buff and file_like:
+            return BytesIO(buff)
+        return buff
+
+    @classmethod
+    def read_image(self, image):
+        """
+        read image to np.ndarray
+        """
+        image_buff = self._get_buffer(image)
+
+        # Use exif_transpose to correct orientation
+        _image = np.array(ImageOps.exif_transpose(Image.open(BytesIO(image_buff)).convert("RGB")))
+        return _image
+
+    @classmethod
+    def read_pdf(self, pdf, password=None):
+        """
+        read pdf
+        """
+        try:
+            import fitz
+        except ImportError:
+            raise RuntimeError(
+                "Need PyMuPDF to process pdf input. " "Please install module by: python3 -m pip install pymupdf"
+            )
+        if isinstance(pdf, fitz.Document):
+            return pdf
+        pdf_buff = self._get_buffer(pdf)
+        if not pdf_buff:
+            logger.warning("Failed to read pdf: %s...", pdf[:32])
+            return None
+        pdf_doc = fitz.Document(stream=pdf_buff, filetype="pdf")
+        if pdf_doc.needs_pass:
+            if pdf_doc.authenticate(password) == 0:
+                raise ValueError("The password of pdf is incorrect.")
+
+        if pdf_doc.page_count > 1:
+            logger.warning("Currently only parse the first page for PDF input with more than one page.")
+
+        page = pdf_doc.load_page(0)
+        # The original image is shrunk when convertd from PDF by fitz, so we scale the image size by 10 times
+        matrix = fitz.Matrix(10, 10)
+        image = np.array(self.get_page_image(page, matrix).convert("RGB"))
+        return image
+
+    @classmethod
+    def get_page_image(self, page, matrix):
+        """
+        get page image
+        """
+        pix = page.get_pixmap(matrix=matrix)
+        image_buff = pix.pil_tobytes("jpeg")
+        return Image.open(BytesIO(image_buff))
+
+    def init_ocr_inference(self):
+        """
+        init ocr inference
+        """
+        if self.ocr_infer_model is not None:
+            logger.warning("ocr model has already been initialized")
+            return
+
+        try:
+            import paddleocr
+        except ImportError:
+            raise RuntimeError(
+                "Need paddleocr to process image input. Please install module by: python3 -m pip install paddleocr"
+            )
+        self.ppocr_version = Version(paddleocr.__version__)
+
+        if not self.layout_analysis:
+            from paddleocr import PaddleOCR
+
+            self.ocr_infer_model = PaddleOCR(show_log=False, lang=self.ocr_lang)
+        else:
+            from paddleocr import PPStructure
+
+            self.layout_analysis_engine = PPStructure(table=True, ocr=True, show_log=False, lang=self.ocr_lang)
+
+    @classmethod
+    def _normalize_box(self, box, old_size, new_size, offset_x=0, offset_y=0):
+        """normalize box"""
+        return [
+            int((box[0] + offset_x) * new_size[0] / old_size[0]),
+            int((box[1] + offset_y) * new_size[1] / old_size[1]),
+            int((box[2] + offset_x) * new_size[0] / old_size[0]),
+            int((box[3] + offset_y) * new_size[1] / old_size[1]),
+        ]
+
+    @classmethod
+    def expand_image_to_a4_size(self, image, center=False):
+        """expand image to a4 size"""
+        h, w = image.shape[:2]
+        offset_x, offset_y = 0, 0
+        if h * 1.0 / w >= 1.42:
+            exp_w = int(h / 1.414 - w)
+            if center:
+                offset_x = int(exp_w / 2)
+                exp_img = np.zeros((h, offset_x, 3), dtype="uint8")
+                exp_img.fill(255)
+                image = np.hstack([exp_img, image, exp_img])
+            else:
+                exp_img = np.zeros((h, exp_w, 3), dtype="uint8")
+                exp_img.fill(255)
+                image = np.hstack([image, exp_img])
+        elif h * 1.0 / w <= 1.40:
+            exp_h = int(w * 1.414 - h)
+            if center:
+                offset_y = int(exp_h / 2)
+                exp_img = np.zeros((offset_y, w, 3), dtype="uint8")
+                exp_img.fill(255)
+                image = np.vstack([exp_img, image, exp_img])
+            else:
+                exp_img = np.zeros((exp_h, w, 3), dtype="uint8")
+                exp_img.fill(255)
+                image = np.vstack([image, exp_img])
+        return image, offset_x, offset_y
+
+    @classmethod
+    def write_image_with_results(
+        self, image, layout=None, result=None, save_path=None, return_image=False, format=None, max_size=None
+    ):
+        """
+        write image with boxes and results
+        """
+
+        def _flatten_results(results):
+            """flatten results"""
+            is_single = False
+            if not isinstance(results, list):
+                results = [results]
+                is_single = True
+            flat_results = []
+
+            def _flatten(result):
+                flat_result = []
+                for key, vals in result.items():
+                    for val in vals:
+                        new_val = val.copy()
+                        if val.get("relations"):
+                            new_val["relations"] = _flatten(val["relations"])
+                        new_val["label"] = key
+                        flat_result.append(new_val)
+                return flat_result
+
+            for result in results:
+                flat_results.append(_flatten(result))
+            if is_single:
+                return flat_results[0]
+            return flat_results
+
+        def _write_results(results, color=None, root=True, parent_centers=None):
+            for segment in results:
+                if "bbox" not in segment.keys():
+                    continue
+                boxes = segment["bbox"]
+                if not isinstance(boxes[0], list):
+                    boxes = [boxes]
+                centers = []
+                plot_boxes = []
+                for box in boxes:
+                    x1, y1, x2, y2 = box
+                    plot_box = [
+                        (x1, y1),
+                        (x2, y1),
+                        (x2, y2),
+                        (x1, y2),
+                    ]
+                    plot_boxes.append(plot_box)
+                    centers.append(((x2 - x1) / 2 + x1, (y2 - y1) / 2 + y1))
+                if root:
+                    while True:
+                        color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
+                        if sum(color) < 480:
+                            break
+                for box in plot_boxes:
+                    draw_render.polygon(box, fill=color)
+                if parent_centers:
+                    for p_c in parent_centers:
+                        for c in centers:
+                            draw_render.line((p_c[0], p_c[1], c[0], c[1]), fill=125, width=3)
+                if isinstance(segment, dict) and segment.get("relations"):
+                    _write_results(segment["relations"], color, root=False, parent_centers=centers)
+
+        random.seed(0)
+        _image = self.read_image(image)
+        _image = Image.fromarray(np.uint8(_image))
+        h, w = _image.height, _image.width
+        img_render = _image.copy()
+        draw_render = ImageDraw.Draw(img_render)
+
+        if layout:
+            for segment in layout:
+                if isinstance(segment, dict):
+                    box = segment["bbox"]
+                else:
+                    box = segment[0]
+                box = [
+                    (box[0], box[1]),
+                    (box[2], box[1]),
+                    (box[2], box[3]),
+                    (box[0], box[3]),
+                ]
+                while True:
+                    color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
+                    if sum(color) < 480:
+                        break
+                draw_render.polygon(box, fill=color)
+
+        elif result:
+            flatten_results = _flatten_results(result)
+            _write_results(flatten_results, color=None, root=True)
+
+        img_render = Image.blend(_image, img_render, 0.3)
+        img_show = Image.new("RGB", (w, h), (255, 255, 255))
+        img_show.paste(img_render, (0, 0, w, h))
+        w, h = img_show.width, img_show.height
+        if max_size and max(w, h) > max_size:
+            if max(w, h) == h:
+                new_size = (int(w * max_size / h), max_size)
+            else:
+                new_size = (max_size, int(h * max_size / w))
+            img_show = img_show.resize(new_size)
+
+        if save_path:
+            dir_path = os.path.dirname(save_path)
+            if dir_path and not os.path.isdir(dir_path):
+                os.makedirs(dir_path)
+            img_show.save(save_path)
+            if return_image:
+                return np.array(img_show)
+        elif return_image:
+            return np.array(img_show)
+        else:
+            buff = BytesIO()
+            if format is None:
+                format = "jpeg"
+            if format.lower() == "jpg":
+                format = "jpeg"
+            img_show.save(buff, format=format, quality=90)
+            return buff
diff --git a/paddlenlp/utils/downloader.py b/paddlenlp/utils/downloader.py
new file mode 100644
index 0000000000000000000000000000000000000000..62f7e9c5e603f31637bbddc27de7f169a5d4ea7a
--- /dev/null
+++ b/paddlenlp/utils/downloader.py
@@ -0,0 +1,471 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import hashlib
+import json
+import os
+import os.path as osp
+import shutil
+import tarfile
+import threading
+import time
+import uuid
+import zipfile
+from collections import OrderedDict
+from typing import Optional, Union
+
+import requests
+from filelock import FileLock
+from huggingface_hub import get_hf_file_metadata, hf_hub_url
+from huggingface_hub.utils import EntryNotFoundError
+from tqdm.auto import tqdm
+
+from .env import DOWNLOAD_SERVER, FAILED_STATUS, SUCCESS_STATUS
+from .log import logger
+
+__all__ = ["get_weights_path_from_url"]
+
+
+COMMUNITY_MODEL_PREFIX = os.getenv("COMMUNITY_MODEL_PREFIX", "https://bj.bcebos.com/paddlenlp/models/community")
+WEIGHTS_HOME = osp.expanduser("~/.cache/paddle/hapi/weights")
+DOWNLOAD_RETRY_LIMIT = 3
+DOWNLOAD_CHECK = False
+
+nlp_models = OrderedDict(
+    (
+        ("RoBERTa-zh-base", "https://bert-models.bj.bcebos.com/chinese_roberta_wwm_ext_L-12_H-768_A-12.tar.gz"),
+        (
+            "RoBERTa-zh-large",
+            "https://bert-models.bj.bcebos.com/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.tar.gz",
+        ),
+        ("ERNIE-v2-en-base", "https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz"),
+        ("ERNIE-v2-en-large", "https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz"),
+        ("XLNet-cased-base", "https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz"),
+        ("XLNet-cased-large", "https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz"),
+        ("ERNIE-v1-zh-base", "https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz"),
+        ("ERNIE-v1-zh-base-max-len-512", "https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz"),
+        (
+            "BERT-en-uncased-large-whole-word-masking",
+            "https://bert-models.bj.bcebos.com/wwm_uncased_L-24_H-1024_A-16.tar.gz",
+        ),
+        (
+            "BERT-en-cased-large-whole-word-masking",
+            "https://bert-models.bj.bcebos.com/wwm_cased_L-24_H-1024_A-16.tar.gz",
+        ),
+        ("BERT-en-uncased-base", "https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz"),
+        ("BERT-en-uncased-large", "https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz"),
+        ("BERT-en-cased-base", "https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz"),
+        ("BERT-en-cased-large", "https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz"),
+        ("BERT-multilingual-uncased-base", "https://bert-models.bj.bcebos.com/multilingual_L-12_H-768_A-12.tar.gz"),
+        ("BERT-multilingual-cased-base", "https://bert-models.bj.bcebos.com/multi_cased_L-12_H-768_A-12.tar.gz"),
+        ("BERT-zh-base", "https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz"),
+    )
+)
+
+
+def is_url(path):
+    """
+    Whether path is URL.
+    Args:
+        path (string): URL string or not.
+    """
+    return path.startswith("http://") or path.startswith("https://")
+
+
+def get_weights_path_from_url(url, md5sum=None):
+    """Get weights path from WEIGHT_HOME, if not exists,
+    download it from url.
+    Args:
+        url (str): download url
+        md5sum (str): md5 sum of download package
+
+    Returns:
+        str: a local path to save downloaded weights.
+    Examples:
+        .. code-block:: python
+            from paddle.utils.download import get_weights_path_from_url
+            resnet18_pretrained_weight_url = 'https://paddle-hapi.bj.bcebos.com/models/resnet18.pdparams'
+            local_weight_path = get_weights_path_from_url(resnet18_pretrained_weight_url)
+    """
+    path = get_path_from_url(url, WEIGHTS_HOME, md5sum)
+    return path
+
+
+def _map_path(url, root_dir):
+    # parse path after download under root_dir
+    fname = osp.split(url)[-1]
+    fpath = fname
+    return osp.join(root_dir, fpath)
+
+
+def get_path_from_url(url, root_dir, md5sum=None, check_exist=True):
+    """Download from given url to root_dir.
+    if file or directory specified by url is exists under
+    root_dir, return the path directly, otherwise download
+    from url and decompress it, return the path.
+    Args:
+        url (str): download url
+        root_dir (str): root dir for downloading, it should be
+                        WEIGHTS_HOME or DATASET_HOME
+        md5sum (str): md5 sum of download package
+
+    Returns:
+        str: a local path to save downloaded models & weights & datasets.
+    """
+
+    assert is_url(url), "downloading from {} not a url".format(url)
+    # parse path after download to decompress under root_dir
+    fullpath = _map_path(url, root_dir)
+
+    if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum):
+        logger.info("Found {}".format(fullpath))
+    else:
+        fullpath = _download(url, root_dir, md5sum)
+
+    if tarfile.is_tarfile(fullpath) or zipfile.is_zipfile(fullpath):
+        fullpath = _decompress(fullpath)
+
+    # model tokenizer config, [file-lock]
+    return fullpath
+
+
+def get_path_from_url_with_filelock(
+    url: str, root_dir: str, md5sum: Optional[str] = None, check_exist: bool = True, timeout: float = -1
+) -> str:
+    """construct `get_path_from_url` for `model_utils` to enable downloading multiprocess-safe
+
+    Args:
+        url (str): the url of resource file
+        root_dir (str): the local download path
+        md5sum (str, optional): md5sum string for file. Defaults to None.
+        check_exist (bool, optional): whether check the file is exist. Defaults to True.
+        timeout (int, optional): the timeout for downloading. Defaults to -1.
+
+    Returns:
+        str: the path of downloaded file
+    """
+
+    os.makedirs(root_dir, exist_ok=True)
+
+    # create lock file, which is empty, under the `LOCK_FILE_HOME` directory.
+    lock_file_name = hashlib.md5((url + root_dir).encode("utf-8")).hexdigest()
+
+    # create `.lock` private directory in the cache dir
+    lock_file_path = os.path.join(root_dir, ".lock", lock_file_name)
+
+    os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
+
+    with FileLock(lock_file_path, timeout=timeout):
+        result = get_path_from_url(url=url, root_dir=root_dir, md5sum=md5sum, check_exist=check_exist)
+    return result
+
+
+def _download(url, path, md5sum=None):
+    """
+    Download from url, save to path.
+    url (str): download url
+    path (str): download to given path
+    """
+    os.makedirs(path, exist_ok=True)
+
+    fname = osp.split(url)[-1]
+    fullname = osp.join(path, fname)
+    retry_cnt = 0
+
+    while not (osp.exists(fullname) and _md5check(fullname, md5sum)):
+        if retry_cnt < DOWNLOAD_RETRY_LIMIT:
+            retry_cnt += 1
+        else:
+            raise RuntimeError("Download from {} failed. " "Retry limit reached".format(url))
+
+        logger.info("Downloading {} from {}".format(fname, url))
+
+        req = requests.get(url, stream=True)
+        if req.status_code != 200:
+            raise RuntimeError("Downloading from {} failed with code " "{}!".format(url, req.status_code))
+
+        # For protecting download interupted, download to
+        # tmp_fullname firstly, move tmp_fullname to fullname
+        # after download finished
+        tmp_fullname = fullname + "_tmp"
+        total_size = req.headers.get("content-length")
+        with open(tmp_fullname, "wb") as f:
+            if total_size:
+                with tqdm(total=int(total_size), unit="B", unit_scale=True, unit_divisor=1024) as pbar:
+                    for chunk in req.iter_content(chunk_size=1024):
+                        f.write(chunk)
+                        pbar.update(len(chunk))
+            else:
+                for chunk in req.iter_content(chunk_size=1024):
+                    if chunk:
+                        f.write(chunk)
+        shutil.move(tmp_fullname, fullname)
+
+    return fullname
+
+
+def _md5check(fullname, md5sum=None):
+    if md5sum is None:
+        return True
+
+    logger.info("File {} md5 checking...".format(fullname))
+    md5 = hashlib.md5()
+    with open(fullname, "rb") as f:
+        for chunk in iter(lambda: f.read(4096), b""):
+            md5.update(chunk)
+    calc_md5sum = md5.hexdigest()
+
+    if calc_md5sum != md5sum:
+        logger.info("File {} md5 check failed, {}(calc) != " "{}(base)".format(fullname, calc_md5sum, md5sum))
+        return False
+    return True
+
+
+def _md5(text):
+    """
+    Calculate the md5 value of the input text.
+    """
+
+    md5code = hashlib.md5(text.encode())
+    return md5code.hexdigest()
+
+
+def _decompress(fname):
+    """
+    Decompress for zip and tar file
+    """
+    logger.info("Decompressing {}...".format(fname))
+
+    # For protecting decompressing interupted,
+    # decompress to fpath_tmp directory firstly, if decompress
+    # successed, move decompress files to fpath and delete
+    # fpath_tmp and remove download compress file.
+
+    if tarfile.is_tarfile(fname):
+        uncompressed_path = _uncompress_file_tar(fname)
+    elif zipfile.is_zipfile(fname):
+        uncompressed_path = _uncompress_file_zip(fname)
+    else:
+        raise TypeError("Unsupport compress file type {}".format(fname))
+
+    return uncompressed_path
+
+
+def _uncompress_file_zip(filepath):
+    files = zipfile.ZipFile(filepath, "r")
+    file_list = files.namelist()
+
+    file_dir = os.path.dirname(filepath)
+
+    if _is_a_single_file(file_list):
+        rootpath = file_list[0]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+
+        for item in file_list:
+            files.extract(item, file_dir)
+
+    elif _is_a_single_dir(file_list):
+        rootpath = os.path.splitext(file_list[0])[0].split(os.sep)[-1]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+
+        for item in file_list:
+            files.extract(item, file_dir)
+
+    else:
+        rootpath = os.path.splitext(filepath)[0].split(os.sep)[-1]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+        if not os.path.exists(uncompressed_path):
+            os.makedirs(uncompressed_path)
+        for item in file_list:
+            files.extract(item, os.path.join(file_dir, rootpath))
+
+    files.close()
+
+    return uncompressed_path
+
+
+def _uncompress_file_tar(filepath, mode="r:*"):
+    files = tarfile.open(filepath, mode)
+    file_list = files.getnames()
+    file_dir = os.path.dirname(filepath)
+
+    if _is_a_single_file(file_list):
+        rootpath = file_list[0]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+        files.extractall(file_dir, files.getmembers())
+    elif _is_a_single_dir(file_list):
+        rootpath = os.path.splitext(file_list[0])[0].split(os.sep)[-1]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+        files.extractall(file_dir, files.getmembers())
+    else:
+        rootpath = os.path.splitext(filepath)[0].split(os.sep)[-1]
+        uncompressed_path = os.path.join(file_dir, rootpath)
+        if not os.path.exists(uncompressed_path):
+            os.makedirs(uncompressed_path)
+
+        files.extractall(os.path.join(file_dir, rootpath), files.getmembers())
+
+    files.close()
+
+    return uncompressed_path
+
+
+def _is_a_single_file(file_list):
+    if len(file_list) == 1 and file_list[0].find(os.sep) < -1:
+        return True
+    return False
+
+
+def _is_a_single_dir(file_list):
+    new_file_list = []
+    for file_path in file_list:
+        if "/" in file_path:
+            file_path = file_path.replace("/", os.sep)
+        elif "\\" in file_path:
+            file_path = file_path.replace("\\", os.sep)
+        new_file_list.append(file_path)
+
+    file_name = new_file_list[0].split(os.sep)[0]
+    for i in range(1, len(new_file_list)):
+        if file_name != new_file_list[i].split(os.sep)[0]:
+            return False
+    return True
+
+
+class DownloaderCheck(threading.Thread):
+    """
+    Check the resource applicability  when downloading the models.
+    """
+
+    def __init__(self, task, command="taskflow", addition=None):
+        threading.Thread.__init__(self)
+        self.command = command
+        self.task = task
+        self.addition = addition
+        self._initialize()
+
+    def uri_path(self, server_url, api):
+        srv = server_url
+        if server_url.endswith("/"):
+            srv = server_url[:-1]
+        if api.startswith("/"):
+            srv += api
+        else:
+            api = "/" + api
+            srv += api
+        return srv
+
+    def _initialize(self):
+        etime = str(int(time.time()))
+        self.full_hash_flag = _md5(str(uuid.uuid1())[-12:])
+        self.hash_flag = _md5(str(uuid.uuid1())[9:18]) + "-" + etime
+
+    def request_check(self, task, command, addition):
+        if task is None:
+            return SUCCESS_STATUS
+        payload = {"word": self.task}
+        api_url = self.uri_path(DOWNLOAD_SERVER, "stat")
+        cache_path = os.path.join("～")
+        if os.path.exists(cache_path):
+            extra = {
+                "command": self.command,
+                "mtime": os.stat(cache_path).st_mtime,
+                "hub_name": self.hash_flag,
+                "cache_info": self.full_hash_flag,
+            }
+        else:
+            extra = {
+                "command": self.command,
+                "mtime": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),
+                "hub_name": self.hash_flag,
+                "cache_info": self.full_hash_flag,
+            }
+        if addition is not None:
+            extra.update({"addition": addition})
+        try:
+            import paddle
+
+            import paddlenlp
+
+            payload["hub_version"] = " "
+            payload["ppnlp_version"] = paddlenlp.__version__
+            payload["paddle_version"] = paddle.__version__.split("-")[0]
+            payload["from"] = "ppnlp"
+            payload["extra"] = json.dumps(extra)
+            r = requests.get(api_url, payload, timeout=1).json()
+            if r.get("update_cache", 0) == 1:
+                return SUCCESS_STATUS
+            else:
+                return FAILED_STATUS
+        except Exception:
+            return FAILED_STATUS
+
+    def run(self):
+        self.request_check(self.task, self.command, self.addition)
+
+
+def download_check(model_id, model_class, addition=None):
+    logger.disable()
+    global DOWNLOAD_CHECK
+    if not DOWNLOAD_CHECK:
+        DOWNLOAD_CHECK = True
+        checker = DownloaderCheck(model_id, model_class, addition)
+        checker.start()
+        checker.join()
+    logger.enable()
+
+
+def url_file_exists(url: str) -> bool:
+    """check whether the url file exists
+
+        refer to: https://stackoverflow.com/questions/2486145/python-check-if-url-to-jpg-exists
+
+    Args:
+        url (str): the url of target file
+
+    Returns:
+        bool: whether the url file exists
+    """
+    if not is_url(url):
+        return False
+
+    result = requests.head(url)
+    return result.status_code == requests.codes.ok
+
+
+def hf_file_exists(
+    repo_id: str, filename: str, token: Union[bool, str, None] = None, subfolder: Optional[str] = None
+) -> bool:
+    """Check whether the HF file exists
+
+    Args:
+        repo_id (`str`): A namespace (user or an organization) name and a repo name separated by a `/`.
+        filename (`str`): The name of the file in the repo.
+        token (`str` or `bool`, *optional*): A token to be used for the download.
+            - If `True`, the token is read from the HuggingFace config folder.
+            - If `False` or `None`, no token is provided.
+            - If a string, it's used as the authentication token.
+        subfolder (str, optional) An optional value corresponding to a folder inside the repo.
+    Returns:
+        bool: whether the HF file exists
+    """
+
+    url = hf_hub_url(repo_id=repo_id, filename=filename, subfolder=subfolder)
+    try:
+        _ = get_hf_file_metadata(
+            url=url,
+            token=token,
+        )
+        return True
+    except EntryNotFoundError:
+        return False
diff --git a/paddlenlp/utils/env.py b/paddlenlp/utils/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..578f9d528c8d9e3e9feef03151959367a7c60109
--- /dev/null
+++ b/paddlenlp/utils/env.py
@@ -0,0 +1,89 @@
+# Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This module is used to store environmental variables in PaddleNLP.
+PPNLP_HOME              -->  the root directory for storing PaddleNLP related data. Default to ~/.paddlenlp. Users can change the
+├                            default value through the PPNLP_HOME environment variable.
+├─ MODEL_HOME              -->  Store model files.
+└─ DATA_HOME         -->  Store automatically downloaded datasets.
+"""
+import os
+
+
+def _get_user_home():
+    return os.path.expanduser("~")
+
+
+def _get_ppnlp_home():
+    if "PPNLP_HOME" in os.environ:
+        home_path = os.environ["PPNLP_HOME"]
+        if os.path.exists(home_path):
+            if os.path.isdir(home_path):
+                return home_path
+            else:
+                raise RuntimeError("The environment variable PPNLP_HOME {} is not a directory.".format(home_path))
+        else:
+            return home_path
+    return os.path.join(_get_user_home(), ".paddlenlp")
+
+
+def _get_sub_home(directory, parent_home=_get_ppnlp_home()):
+    home = os.path.join(parent_home, directory)
+    if not os.path.exists(home):
+        os.makedirs(home, exist_ok=True)
+    return home
+
+
+def _get_bool_env(env_key: str, default_value: str) -> bool:
+    """get boolean environment variable, which can be "true", "True", "1"
+
+    Args:
+        env_key (str): key of env variable
+    """
+    value = os.getenv(env_key, default_value).lower()
+    return value in ["true", "1"]
+
+
+USER_HOME = _get_user_home()
+PPNLP_HOME = _get_ppnlp_home()
+MODEL_HOME = _get_sub_home("models")
+HF_CACHE_HOME = os.environ.get("HUGGINGFACE_HUB_CACHE", MODEL_HOME)
+DATA_HOME = _get_sub_home("datasets")
+PACKAGE_HOME = _get_sub_home("packages")
+DOWNLOAD_SERVER = "http://paddlepaddle.org.cn/paddlehub"
+FAILED_STATUS = -1
+SUCCESS_STATUS = 0
+
+LEGACY_CONFIG_NAME = "model_config.json"
+CONFIG_NAME = "config.json"
+TOKENIZER_CONFIG_NAME = "tokenizer_config.json"
+GENERATION_CONFIG_NAME = "generation_config.json"
+
+
+LORA_CONFIG_NAME = "lora_config.json"
+LORA_WEIGHTS_NAME = "lora_model_state.pdparams"
+
+PREFIX_CONFIG_NAME = "prefix_config.json"
+PREFIX_WEIGHTS_NAME = "prefix_model_state.pdparams"
+
+PAST_KEY_VALUES_FILE_NAME = "pre_caches.npy"
+
+PADDLE_WEIGHTS_NAME = "model_state.pdparams"
+PADDLE_WEIGHTS_INDEX_NAME = "model_state.pdparams.index.json"
+
+PYTORCH_WEIGHTS_NAME = "pytorch_model.bin"
+PYTORCH_WEIGHTS_INDEX_NAME = "pytorch_model.bin.index.json"
+
+SAFE_WEIGHTS_NAME = "model.safetensors"
+SAFE_WEIGHTS_INDEX_NAME = "model.safetensors.index.json"
diff --git a/paddlenlp/utils/ie_utils.py b/paddlenlp/utils/ie_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a8969bad03600be916e8cf3c1658412012f280d
--- /dev/null
+++ b/paddlenlp/utils/ie_utils.py
@@ -0,0 +1,142 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+from io import BytesIO
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from ..metrics import SpanEvaluator
+from .image_utils import NormalizeImage, Permute, ResizeImage
+
+resize_func = ResizeImage(target_size=224, interp=1)
+norm_func = NormalizeImage(is_channel_first=False, mean=[123.675, 116.280, 103.530], std=[58.395, 57.120, 57.375])
+permute_func = Permute(to_bgr=False)
+
+
+def map_offset(ori_offset, offset_mapping):
+    """
+    map ori offset to token offset
+    """
+    for index, span in enumerate(offset_mapping):
+        if span[0] <= ori_offset < span[1]:
+            return index
+    return -1
+
+
+def pad_image_data(image_data):
+    if not image_data:
+        image = np.zeros([3, 224, 224])
+        return image
+    # decode image
+    data = np.frombuffer(bytearray(image_data), dtype="uint8")
+    image = np.array(Image.open(BytesIO(data)).convert("RGB"))
+    sample = {"image": image}
+    # resize image
+    sample = resize_func(sample)
+    # norm image
+    sample = norm_func(sample)
+    # permute
+    sample = permute_func(sample)
+    return sample["image"]
+
+
+def unify_prompt_name(prompt):
+    # The classification labels are shuffled during finetuning, so they need
+    # to be unified during evaluation.
+    if re.search(r"\[.*?\]$", prompt):
+        prompt_prefix = prompt[: prompt.find("[", 1)]
+        cls_options = re.search(r"\[.*?\]$", prompt).group()[1:-1].split(",")
+        cls_options = sorted(list(set(cls_options)))
+        cls_options = ",".join(cls_options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+        return prompt
+    return prompt
+
+
+def get_relation_type_dict(relation_data, schema_lang="ch"):
+    def compare(a, b, schema_lang="ch"):
+        if schema_lang == "ch":
+            a = a[::-1]
+            b = b[::-1]
+
+        res = ""
+        for i in range(min(len(a), len(b))):
+            if a[i] == b[i]:
+                res += a[i]
+            else:
+                break
+        if res == "":
+            return res
+        if schema_lang == "ch" and res[::-1][0] == "的":
+            return res[::-1][1:]
+        elif schema_lang == "en" and res[-3:] == " of":
+            return res[:-3]
+        return ""
+
+    relation_type_dict = {}
+    added_list = []
+    for i in range(len(relation_data)):
+        added = False
+        if relation_data[i][0] not in added_list:
+            for j in range(i + 1, len(relation_data)):
+                match = compare(relation_data[i][0], relation_data[j][0], schema_lang=schema_lang)
+                if match != "":
+                    match = unify_prompt_name(match)
+                    if relation_data[i][0] not in added_list:
+                        added_list.append(relation_data[i][0])
+                        relation_type_dict.setdefault(match, []).append(relation_data[i][1])
+                    added_list.append(relation_data[j][0])
+                    relation_type_dict.setdefault(match, []).append(relation_data[j][1])
+                    added = True
+            if not added:
+                added_list.append(relation_data[i][0])
+                if schema_lang == "ch":
+                    suffix = relation_data[i][0].rsplit("的", 1)[1]
+                    suffix = unify_prompt_name(suffix)
+                    relation_type = suffix
+                else:
+                    prefix = relation_data[i][0].split(" of ", 1)[0]
+                    prefix = unify_prompt_name(prefix)
+                    relation_type = prefix
+                relation_type_dict.setdefault(relation_type, []).append(relation_data[i][1])
+    return relation_type_dict
+
+
+def uie_loss_func(outputs, labels):
+    criterion = paddle.nn.BCELoss()
+    start_ids, end_ids = labels
+    start_prob, end_prob = outputs
+    start_ids = paddle.cast(start_ids, "float32")
+    end_ids = paddle.cast(end_ids, "float32")
+    loss_start = criterion(start_prob, start_ids)
+    loss_end = criterion(end_prob, end_ids)
+    loss = (loss_start + loss_end) / 2.0
+    return loss
+
+
+def compute_metrics(p):
+    metric = SpanEvaluator()
+    start_prob, end_prob = p.predictions
+    start_ids, end_ids = p.label_ids
+    metric.reset()
+
+    num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+    metric.update(num_correct, num_infer, num_label)
+    precision, recall, f1 = metric.accumulate()
+    metric.reset()
+
+    return {"precision": precision, "recall": recall, "f1": f1}
diff --git a/paddlenlp/utils/image_utils.py b/paddlenlp/utils/image_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..708eb15cbd24bfd2daf8149360a212de830d47d9
--- /dev/null
+++ b/paddlenlp/utils/image_utils.py
@@ -0,0 +1,734 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import random
+import re
+import uuid
+from collections.abc import Sequence
+from functools import cmp_to_key
+from io import BytesIO
+
+import numpy as np
+from PIL import Image
+
+
+class BaseOperator(object):
+    def __init__(self, name=None):
+        if name is None:
+            name = self.__class__.__name__
+        self._id = name + "_" + str(uuid.uuid4())[-6:]
+
+    def __call__(self, sample, context=None):
+        """Process a sample.
+        Args:
+            sample (dict): a dict of sample, eg: {'image':xx, 'label': xxx}
+            context (dict): info about this sample processing
+        Returns:
+            result (dict): a processed sample
+        """
+        return sample
+
+    def __str__(self):
+        return str(self._id)
+
+
+class DecodeImage(BaseOperator):
+    def __init__(self):
+        """Transform the image data to numpy format."""
+        super(DecodeImage, self).__init__()
+
+    def __call__(self, sample, context=None):
+        """load image if 'im_file' field is not empty but 'image' is"""
+        if "image" not in sample:
+            sample["image"] = base64.b64decode(sample["im_base64"].encode("utf-8"))
+
+        im = sample["image"]
+        data = np.frombuffer(bytearray(im), dtype="uint8")
+        im = np.array(Image.open(BytesIO(data)).convert("RGB"))  # RGB format
+        sample["image"] = im
+
+        if "h" not in sample:
+            sample["h"] = im.shape[0]
+        elif sample["h"] != im.shape[0]:
+            sample["h"] = im.shape[0]
+        if "w" not in sample:
+            sample["w"] = im.shape[1]
+        elif sample["w"] != im.shape[1]:
+            sample["w"] = im.shape[1]
+
+        # make default im_info with [h, w, 1]
+        sample["im_info"] = np.array([im.shape[0], im.shape[1], 1.0], dtype=np.float32)
+        return sample
+
+
+class ResizeImage(BaseOperator):
+    def __init__(self, target_size=0, interp=1):
+        """
+        Rescale image to the specified target size, and capped at max_size
+        if max_size != 0.
+        If target_size is list, selected a scale randomly as the specified
+        target size.
+        Args:
+            target_size (int|list): the target size of image's short side,
+                multi-scale training is adopted when type is list.
+            interp (int): the interpolation method
+        """
+        super(ResizeImage, self).__init__()
+        self.interp = int(interp)
+        if not (isinstance(target_size, int) or isinstance(target_size, list)):
+            raise TypeError(
+                "Type of target_size is invalid. Must be Integer or List, now is {}".format(type(target_size))
+            )
+        self.target_size = target_size
+
+    def __call__(self, sample, context=None, save_real_img=False):
+        """Resize the image numpy."""
+        im = sample["image"]
+        if not isinstance(im, np.ndarray):
+            raise TypeError("{}: image type is not numpy.".format(self))
+        im_shape = im.shape
+        im_size_min = np.min(im_shape[0:2])
+        if isinstance(self.target_size, list):
+            # Case for multi-scale training
+            selected_size = random.choice(self.target_size)
+        else:
+            selected_size = self.target_size
+        if float(im_size_min) == 0:
+            raise ZeroDivisionError("{}: min size of image is 0".format(self))
+
+        resize_w = selected_size
+        resize_h = selected_size
+
+        im = Image.fromarray(im.astype("uint8"))
+        im = im.resize((int(resize_w), int(resize_h)), self.interp)
+        sample["image"] = np.array(im)
+        return sample
+
+
+class Permute(BaseOperator):
+    def __init__(self, to_bgr=True):
+        """
+        Change the channel.
+        Args:
+            to_bgr (bool): confirm whether to convert RGB to BGR
+        """
+        super(Permute, self).__init__()
+        self.to_bgr = to_bgr
+
+    def __call__(self, sample, context=None):
+        samples = sample
+        batch_input = True
+        if not isinstance(samples, Sequence):
+            batch_input = False
+            samples = [samples]
+        for sample in samples:
+            assert "image" in sample, "image data not found"
+            for k in sample.keys():
+                # hard code
+                if k.startswith("image"):
+                    im = sample[k]
+                    im = np.swapaxes(im, 1, 2)
+                    im = np.swapaxes(im, 1, 0)
+                    if self.to_bgr:
+                        im = im[[2, 1, 0], :, :]
+                    sample[k] = im
+        if not batch_input:
+            samples = samples[0]
+        return samples
+
+
+class NormalizeImage(BaseOperator):
+    def __init__(self, mean=[0.485, 0.456, 0.406], std=[1, 1, 1], is_channel_first=True, is_scale=False):
+        """
+        Args:
+            mean (list): the pixel mean
+            std (list): the pixel variance
+            channel_first (bool): confirm whether to change channel
+        """
+        super(NormalizeImage, self).__init__()
+        self.mean = mean
+        self.std = std
+        self.is_channel_first = is_channel_first
+        self.is_scale = is_scale
+        from functools import reduce
+
+        if reduce(lambda x, y: x * y, self.std) == 0:
+            raise ValueError("{}: std is invalid!".format(self))
+
+    def __call__(self, sample, context=None):
+        """Normalize the image.
+        Operators:
+            1.(optional) Scale the image to [0,1]
+            2. Each pixel minus mean and is divided by std
+        """
+        samples = sample
+        batch_input = True
+        if not isinstance(samples, Sequence):
+            batch_input = False
+            samples = [samples]
+        for sample in samples:
+            for k in sample.keys():
+                if k.startswith("image"):
+                    im = sample[k]
+                    im = im.astype(np.float32, copy=False)
+                    if self.is_channel_first:
+                        mean = np.array(self.mean)[:, np.newaxis, np.newaxis]
+                        std = np.array(self.std)[:, np.newaxis, np.newaxis]
+                    else:
+                        mean = np.array(self.mean)[np.newaxis, np.newaxis, :]
+                        std = np.array(self.std)[np.newaxis, np.newaxis, :]
+                    if self.is_scale:
+                        im = im / 255.0
+                    im -= mean
+                    im /= std
+                    sample[k] = im
+        if not batch_input:
+            samples = samples[0]
+        return samples
+
+
+class PadBatch(BaseOperator):
+    """
+    Pad a batch of samples so they can be divisible by a stride.
+    The layout of each image should be 'CHW'.
+    Args:
+        pad_to_stride (int): If `pad_to_stride > 0`, pad zeros to ensure
+            height and width is divisible by `pad_to_stride`.
+    """
+
+    def __init__(self, pad_to_stride=0, use_padded_im_info=True):
+        super(PadBatch, self).__init__()
+        self.pad_to_stride = pad_to_stride
+        self.use_padded_im_info = use_padded_im_info
+
+    def __call__(self, samples, context=None):
+        """
+        Args:
+            samples (list): a batch of sample, each is dict.
+        """
+        coarsest_stride = self.pad_to_stride
+        if coarsest_stride == 0:
+            return samples
+        max_shape = np.array([data["image"].shape for data in samples]).max(axis=0)
+
+        if coarsest_stride > 0:
+            max_shape[1] = int(np.ceil(max_shape[1] / coarsest_stride) * coarsest_stride)
+            max_shape[2] = int(np.ceil(max_shape[2] / coarsest_stride) * coarsest_stride)
+
+        for data in samples:
+            im = data["image"]
+            im_c, im_h, im_w = im.shape[:]
+            padding_im = np.zeros((im_c, max_shape[1], max_shape[2]), dtype=np.float32)
+            padding_im[:, :im_h, :im_w] = im
+            data["image"] = padding_im
+            if self.use_padded_im_info:
+                data["im_info"][:2] = max_shape[1:3]
+        return samples
+
+
+def check(s):
+    """Check whether is English"""
+    my_re = re.compile(r"[A-Za-z0-9]", re.S)
+    res = re.findall(my_re, s)
+    if len(res):
+        return True
+    return False
+
+
+def img2base64(img_path):
+    """get base64"""
+    with open(img_path, "rb") as f:
+        base64_str = base64.b64encode(f.read()).decode("utf-8")
+    return base64_str
+
+
+def np2base64(image_np):
+    img = Image.fromarray(image_np)
+    base64_str = pil2base64(img)
+    return base64_str
+
+
+def pil2base64(image, image_type=None, size=False):
+    if not image_type:
+        image_type = "JPEG"
+    img_buffer = BytesIO()
+    image.save(img_buffer, format=image_type)
+
+    byte_data = img_buffer.getvalue()
+    base64_str = base64.b64encode(byte_data)
+
+    base64_string = base64_str.decode("utf-8")
+
+    if size:
+        return base64_string, image.size
+    else:
+        return base64_string
+
+
+class Bbox(object):
+    """
+    The inner store format of `Bbox` is (left, top, width, height).
+
+    The user may instance plenty of `Bbox`, thats why we insist the `Bbox` only contains four variables.
+    """
+
+    __slots__ = ["_c_left", "_c_top", "_c_width", "_c_height"]
+
+    def __init__(self, left=0, top=0, width=0, height=0):
+        """
+        Constructor of `Bbox`.
+
+        >> left: The most left position of bounding box.
+        >> right: The most right position of bounding box.
+        >> width: The width of bounding box.
+        >> height: The height of bounding box.
+
+        ^^ AssertionError: width and height must larger than 0.
+        """
+        assert width >= 0, "width {} must no less than 0".format(width)
+        assert height >= 0, "height {} must no less than 0".format(height)
+        self._c_left, self._c_top, self._c_width, self._c_height = left, top, width, height
+
+    def __str__(self):
+        """
+        Reload the `str` operator.
+        """
+        return repr(self)
+
+    def __repr__(self):
+        """
+        Reload the `repr` operator.
+        """
+        return "(x={}, y={}, w={}, h={})".format(self.left, self.top, self.width, self.height)
+
+    def __eq__(self, other):
+        """
+        if `self` is equal with given `other` box.
+
+        >> other: The comparing box instance.
+
+        << True if two box is equal else False.
+        """
+        return (
+            self.left == other.left
+            and self.top == other.top
+            and self.width == other.width
+            and self.height == other.height
+        )
+
+    def tuple(self, precision=3):
+        """
+        Return the tuple format box.
+        """
+        return tuple(round(one, precision) for one in (self.left, self.top, self.width, self.height))
+
+    def list_int(self):
+        """
+        Return the list(int) format box.
+        """
+        return list(int(one) for one in (self.left, self.top, self.width, self.height))
+
+    def points_tuple(self, precision=3):
+        """
+        Return the coordinate of box
+        """
+        return tuple(round(one, precision) for one in (self.left, self.top, self.right, self.bottom))
+
+    @property
+    def left(self):
+        """
+        Visit the most left position of bounding box.
+        """
+        return self._c_left
+
+    @left.setter
+    def left(self, left):
+        """
+        Set the most left position of bounding box.
+        """
+        self._c_left = left
+
+    @property
+    def right(self):
+        """
+        Visit the most right position of bounding box.
+        """
+        return self._c_left + self._c_width
+
+    @right.setter
+    def right(self, right):
+        """
+        Set the most right position of bounding box.
+
+        ^^ AssertionError: when right is less than left.
+        """
+        assert right >= self._c_left, "right {} < left {} is forbidden.".format(right, self._c_left)
+        self._c_width = right - self._c_left
+
+    @property
+    def top(self):
+        """
+        Visit the most top position of bounding box.
+        """
+        return self._c_top
+
+    @top.setter
+    def top(self, top):
+        """
+        Set the most top position of bounding box.
+        """
+        self._c_top = top
+
+    @property
+    def bottom(self):
+        """
+        Visit the most bottom position of bounding box.
+        """
+        return self._c_top + self._c_height
+
+    @bottom.setter
+    def bottom(self, bottom):
+        """
+        Set the most bottom position of bounding box.
+
+        ^^ AssertionError: when bottom is less than top.
+        """
+        assert bottom >= self._c_top, "top {} > bottom {} is forbidden.".format(self._c_top, bottom)
+        self._c_height = bottom - self._c_top
+
+    @property
+    def width(self):
+        """
+        Visit the width of bounding box.
+        """
+        return self._c_width
+
+    @width.setter
+    def width(self, width):
+        """
+        Set the width of bounding box.
+
+        ^^ AssertionError: when width is less than 0.
+        """
+        assert width >= 0, "width {} < 0 is forbidden.".format(width)
+        self._c_width = width
+
+    @property
+    def height(self):
+        """
+        Visit the height of bounding box.
+        """
+        return self._c_height
+
+    @height.setter
+    def height(self, height):
+        """
+        Set the height of bounding box.
+
+        ^^ AssertionError: when height is less than 0.
+        """
+        assert height >= 0, "height {} < 0 is forbidden.".format(height)
+        self._c_height = height
+
+    def is_cross_boundary(self, width, height, top=0, left=0):
+        """
+        If this box is cross boundary of given boundary. The boundary is start at (0, 0) by default.
+
+        >> width: The width of boundary.
+        >> height: The height of boundary.
+        >> top: The top-left point location. Default at (0, 0)
+        >> left: The top-left point location. Default at (0, 0)
+        """
+        boundary = Bbox(top, left, width, height)
+        return boundary.contain(self)
+
+    def is_vertical(self):
+        """
+        If this box is vertical.
+        """
+        return self.width < self.height
+
+    def is_horizontal(self):
+        """
+        If this box is horizontal.
+        """
+        return self.width > self.height
+
+    def is_square(self):
+        """
+        If this box is square.
+        """
+        return self.width == self.height
+
+    def center(self):
+        """
+        Return the center point of this box.
+        """
+        return (self.left + self.width / 2.0, self.top + self.height / 2.0)
+
+    def points(self):
+        """
+        Convert bounding box to main corner points (left, top) + (right, bottom).
+
+        << Two tuple of points, left-top and right-bottom respectively.
+        """
+        return (self.left, self.top), (self.right, self.bottom)
+
+    def contain(self, box):
+        """
+        If given `box` is contained by `self`.
+
+        >> box: The box supposed to be contained.
+
+        << True if `self` contains `box` else False
+        """
+        return self.left <= box.left and self.top <= box.top and self.right >= box.right and self.bottom >= box.bottom
+
+    def overlap_vertically(self, box):
+        """
+        If given `box` is overlap with `self` vertically.
+
+        >> box: The comparing box.
+
+        << True if overlap with each others vertically else False.
+        """
+        return not (self.top >= box.bottom or self.bottom <= box.top)
+
+    def overlap_horizontally(self, box):
+        """
+        If given `box` is overlap with `self` horizontally.
+
+        >> box: The comparing box.
+
+        << True if overlap with each others horizontally else False.
+        """
+        return not (self.left >= box.right or self.right <= box.left)
+
+    def overlap(self, box):
+        """
+        If given `box` is overlap with `self`.
+
+        >> box: The comparing box.
+
+        << True if overlap with each others else False.
+        """
+        return self.overlap_horizontally(box) and self.overlap_vertically(box)
+
+    def hoverlap(self, box):
+        """
+        The value of overlapped horizontally.
+
+        >> box: The calculating box.
+        """
+        if not self.overlap_horizontally(box):
+            return 0
+
+        return min(self.right, box.right) - max(self.left, box.left)
+
+    def voverlap(self, box):
+        """
+        The value of overlap vertically.
+
+        >> box: The calculating box.
+        """
+        if not self.overlap_vertically(box):
+            return 0
+
+        return min(self.bottom, box.bottom) - max(self.top, box.top)
+
+    def hdistance(self, box):
+        """
+        The distance of two boxes horizontally.
+
+        >> box: The calculating box.
+        """
+        if self.overlap_horizontally(box):
+            return 0
+
+        return max(self.left, box.left) - min(self.right, box.right)
+
+    def vdistance(self, box):
+        """
+        The distance of two boxes vertically.
+
+        >> box: The calculating box.
+        """
+        if self.overlap_vertically(box):
+            return 0
+
+        return max(self.top, box.top) - min(self.bottom, box.bottom)
+
+    def area(self):
+        """
+        Calculate the area within the bounding box.
+        """
+        return self.width * self.height
+
+    def translate(self, vector):
+        """
+        Translate box in the direction of vector
+        """
+        return Bbox(self.left + vector[0], self.top + vector[1], self.width, self.height)
+
+    @staticmethod
+    def union(*boxes):
+        """
+        Calculate the union bounding box of given `boxes`.
+
+        >> boxes: The boxes to calculate with.
+
+        << The union `Bbox` of `boxes`.
+        """
+        left, top = min([box.left for box in boxes]), min([box.top for box in boxes])
+        right, bottom = max([box.right for box in boxes]), max([box.bottom for box in boxes])
+
+        return Bbox.from_points((left, top), (right, bottom))
+
+    @staticmethod
+    def adjacency(boxa, boxb):
+        """
+        Calculate the adjacent bounding box of given boxes.
+
+        >> boxa: The box to calculate with.
+        >> boxb: The box to calculate with.
+
+        << The adjacent `Bbox` of boxes.
+        """
+        horizon = [min(boxa.right, boxb.right), max(boxa.left, boxb.left)]
+        vertical = [min(boxa.bottom, boxb.bottom), max(boxa.top, boxb.top)]
+
+        left, right = min(horizon), max(horizon)
+        top, bottom = min(vertical), max(vertical)
+
+        return Bbox.from_points((left, top), (right, bottom))
+
+    @staticmethod
+    def intersection(*boxes):
+        """
+        Calculate the intersection bounding box of given `boxes`.
+
+        >> boxes: The boxes to calculate with.
+
+        << The intersection `Bbox` of `boxes`.
+        """
+        left, top = max(box.left for box in boxes), max(box.top for box in boxes)
+        right, bottom = min(box.right for box in boxes), min(box.bottom for box in boxes)
+
+        if left > right or top > bottom:
+            return Bbox()
+
+        return Bbox.from_points((left, top), (right, bottom))
+
+    @staticmethod
+    def iou(boxa, boxb):
+        """
+        Calculate the union area divided by intersection area.
+
+        >> boxa: The box to calculate with.
+        >> boxb: The box to calculate with.
+        """
+        return Bbox.intersection(boxa, boxb).area() / Bbox.union(boxa, boxb).area()
+
+    @staticmethod
+    def from_points(p0, p1):
+        """
+        Convert main corner points to bounding box.
+
+        >> p0: The left-top points in (x, y).
+        >> p1: The right-bottom points in (x, y).
+
+        << The instance of `Bbox`.
+
+        ^^ AssertionError: if width or height is less than 0.
+        """
+        assert p1[0] >= p0[0], "width {} must larger than 0.".format(p1[0] - p0[0])
+        assert p1[1] >= p0[1], "height {} must larger than 0.".format(p1[1] - p0[1])
+
+        return Bbox(p0[0], p0[1], p1[0] - p0[0], p1[1] - p0[1])
+
+
+def two_dimension_sort_box(box1: Bbox, box2: Bbox, vratio=0.5):
+    """bbox sort 2D
+
+    Args:
+        box1 (Bbox): [bbox1]
+        box2 (Bbox): [bbox2]
+        vratio (float, optional): [description]. Defaults to 0.5.
+
+    Returns:
+        [type]: [description]
+    """
+    kernel = [box1.left - box2.left, box1.top - box2.top]
+    if box1.voverlap(box2) < vratio * min(box1.height, box2.height):
+        kernel = [box1.top - box2.top, box1.left - box2.left]
+    return kernel[0] if kernel[0] != 0 else kernel[1]
+
+
+def two_dimension_sort_layout(layout1, layout2, vratio=0.54):
+    """Layout sort"""
+    return two_dimension_sort_box(layout1["bbox"], layout2["bbox"])
+
+
+def ppocr2example(ocr_res, img_path):
+    """Transfer paddleocr result to example"""
+    segments = []
+    for rst in ocr_res:
+        left = min(rst[0][0][0], rst[0][3][0])
+        top = min(rst[0][0][-1], rst[0][1][-1])
+        width = max(rst[0][1][0], rst[0][2][0]) - min(rst[0][0][0], rst[0][3][0])
+        height = max(rst[0][2][-1], rst[0][3][-1]) - min(rst[0][0][-1], rst[0][1][-1])
+        segments.append({"bbox": Bbox(*[left, top, width, height]), "text": rst[-1][0]})
+    segments.sort(key=cmp_to_key(two_dimension_sort_layout))
+    img_base64 = img2base64(img_path)
+    doc_tokens = []
+    doc_boxes = []
+
+    im_w_box = max([seg["bbox"].left + seg["bbox"].width for seg in segments]) + 20 if segments else 0
+    im_h_box = max([seg["bbox"].top + seg["bbox"].height for seg in segments]) + 20 if segments else 0
+    img = Image.open(img_path)
+    im_w, im_h = img.size
+    im_w, im_h = max(im_w, im_w_box), max(im_h, im_h_box)
+
+    for segment in segments:
+        bbox = segment["bbox"]
+        x1, y1, w, h = bbox.left, bbox.top, bbox.width, bbox.height
+        bbox = Bbox(*[x1, y1, w, h])
+        text = segment["text"]
+        char_num = 0
+        eng_word = ""
+        for char in text:
+            if not check(char) and not eng_word:
+                doc_tokens.append(char)
+                char_num += 1
+            elif not check(char) and eng_word:
+                doc_tokens.append(eng_word)
+                eng_word = ""
+                doc_tokens.append(char)
+                char_num += 2
+            else:
+                eng_word += char
+        if eng_word:
+            doc_tokens.append(eng_word)
+            char_num += 1
+        char_width = int(w / char_num)
+        for char_idx in range(char_num):
+            doc_boxes.append([Bbox(*[bbox.left + (char_width * char_idx), bbox.top, char_width, bbox.height])])
+    new_doc_boxes = []
+    for doc_box in doc_boxes:
+        bbox = doc_box[0]
+        new_doc_boxes.append([bbox.left, bbox.top, bbox.right, bbox.bottom])
+    doc_boxes = new_doc_boxes
+    example = {"text": doc_tokens, "bbox": doc_boxes, "width": im_w, "height": im_h, "image": img_base64}
+    return example
diff --git a/paddlenlp/utils/import_utils.py b/paddlenlp/utils/import_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..28ca2a0d1478e434cb42bdaad8fa723a2d77b9ef
--- /dev/null
+++ b/paddlenlp/utils/import_utils.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import importlib.util
+import os
+import shutil
+import site
+import sys
+from typing import Optional, Type
+
+import pip
+
+from paddlenlp.utils.log import logger
+
+
+def is_datasets_available():
+    import importlib
+
+    return importlib.util.find_spec("datasets") is not None
+
+
+def is_paddle_cuda_available() -> bool:
+    if is_paddle_available():
+        import paddle
+
+        return paddle.device.cuda.device_count() > 0
+    else:
+        return False
+
+
+def is_paddle_available() -> bool:
+    """check if `torch` package is installed
+    Returns:
+        bool: if `torch` is available
+    """
+    return is_package_available("paddle")
+
+
+def is_psutil_available():
+    return importlib.util.find_spec("psutil") is not None
+
+
+def is_tiktoken_available():
+    return importlib.util.find_spec("tiktoken") is not None
+
+
+def is_torch_available() -> bool:
+    """check if `torch` package is installed
+    Returns:
+        bool: if `torch` is available
+    """
+    return is_package_available("torch")
+
+
+def is_package_available(package_name: str) -> bool:
+    """check if the package is avaliable
+    Args:
+        package_name (str): the installed package name
+    Returns:
+        bool: the existence of installed package
+    """
+    package_spec = importlib.util.find_spec(package_name)
+    return package_spec is not None and package_spec.has_location
+
+
+def is_fast_tokenizer_available() -> bool:
+    """check if `fast_tokenizer` ia avaliable
+    Returns:
+        bool: if `fast_tokenizer` is avaliable
+    """
+    return is_package_available("fast_tokenizer")
+
+
+def is_paddlenlp_ops_available() -> bool:
+    """check if `paddlenlp_ops` ia avaliable
+    Returns:
+        bool: if `paddlenlp_ops` is avaliable
+    """
+    return is_package_available("paddlenlp_ops")
+
+
+def is_transformers_available() -> bool:
+    """check if `transformers` package is installed
+    Returns:
+        bool: if `transformers` is available
+    """
+    return is_package_available("transformers")
+
+
+def install_package(
+    package_name: str,
+    version: Optional[str] = None,
+    module_name: Optional[str] = None,
+    cache_dir: Optional[str] = None,
+):
+    """install the specific version of package
+
+    Args:
+        package_name (str): the name of package
+        version (str): the version of package
+        module_name (str): the imported name of package
+        cache_dir (str): cache dir
+    """
+    module_name = module_name or package_name
+
+    # 1. remove the existing version of package
+    uninstall_package(package_name, module_name)
+
+    # 2. install the package
+    if version:
+        package_name += f"=={version}"
+
+    arguments = ["install"]
+    if cache_dir:
+        arguments += ["-t", cache_dir]
+        sys.path.insert(0, cache_dir)
+
+    # 3. load the pypi mirror to speedup of installing packages
+    mirror_key = "PYPI_MIRROR"
+    mirror_source = os.environ.get(mirror_key, None)
+    if mirror_source is not None:
+        logger.info(f"loading <{mirror_source}> from as the final mirror source to install package.")
+        arguments += ["-i", mirror_source]
+
+    arguments += [package_name]
+    pip.main(arguments)
+
+    # 4. add site-package to the top of package
+    for site_package_dir in site.getsitepackages():
+        sys.path.insert(0, site_package_dir)
+
+
+def uninstall_package(package_name: str, module_name: Optional[str] = None):
+    """uninstall the pacakge from site-packages.
+
+    To remove the cache of source package module & class & method, it should:
+        1. remove the source files of packages under the `site-packages` dir.
+        2. remove the cache under the `locals()`
+        3. remove the cache under the `sys.modules`
+
+    Args:
+        package_name (str): the name of package
+    """
+    module_name = module_name or package_name
+    for site_package_dir in site.getsitepackages():
+        for file in os.listdir(site_package_dir):
+            package_dir = os.path.join(site_package_dir, file)
+            if file.startswith(package_name) and os.path.isdir(package_dir):
+                shutil.rmtree(package_dir)
+
+    for site_package_dir in site.getsitepackages():
+        while sys.path[0] == site_package_dir:
+            sys.path.pop(0)
+
+    for key in list(locals().keys()):
+        if module_name in key:
+            del locals()[key]
+
+    for key in list(sys.modules.keys()):
+        if module_name in key:
+            del sys.modules[key]
+
+
+def import_module(module_name: str) -> Optional[Type]:
+    """import moudle base on the model
+    Args:
+        module_name (str): the name of target module
+    """
+    # 1. prepare the name
+    assert "." in module_name, "`.` must be in the module_name"
+    index = module_name.rindex(".")
+    module = module_name[:index]
+    target_module_name = module_name[index + 1 :]
+
+    # 2. get the target module name
+    try:
+        module = importlib.import_module(module)
+        target_module = getattr(module, target_module_name, None)
+        return target_module
+    except ModuleNotFoundError:
+        return None
diff --git a/paddlenlp/utils/initializer.py b/paddlenlp/utils/initializer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9660429acf8c0686f9ccca6a3e00ca94a4802fe5
--- /dev/null
+++ b/paddlenlp/utils/initializer.py
@@ -0,0 +1,336 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This code is based on https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py
+Ths copyright of pytorch/pytorch is a BSD-style license, as found in the LICENSE file.
+"""
+
+import math
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+__all__ = [
+    "uniform_",
+    "normal_",
+    "constant_",
+    "ones_",
+    "zeros_",
+    "xavier_uniform_",
+    "xavier_normal_",
+    "kaiming_uniform_",
+    "kaiming_normal_",
+    "linear_init_",
+    "conv_init_",
+    "reset_initialized_parameter",
+]
+
+
+def _no_grad_uniform_(tensor, a, b):
+    with paddle.no_grad():
+        tensor.uniform_(min=a, max=b)
+    return tensor
+
+
+def _no_grad_normal_(tensor, mean=0.0, std=1.0):
+    with paddle.no_grad():
+        tensor.set_value(paddle.normal(mean=mean, std=std, shape=tensor.shape))
+    return tensor
+
+
+def _no_grad_fill_(tensor, value=0.0):
+    with paddle.no_grad():
+        tensor.fill_(value)
+    return tensor
+
+
+def uniform_(tensor, a, b):
+    """
+    Modified tensor inspace using uniform_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        a (float|int): min value.
+        b (float|int): max value.
+    Return:
+        tensor
+    """
+    return _no_grad_uniform_(tensor, a, b)
+
+
+def normal_(tensor, mean=0.0, std=1.0):
+    """
+    Modified tensor inspace using normal_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        mean (float|int): mean value.
+        std (float|int): std value.
+    Return:
+        tensor
+    """
+    return _no_grad_normal_(tensor, mean, std)
+
+
+def constant_(tensor, value=0.0):
+    """
+    Modified tensor inspace using constant_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        value (float|int): value to fill tensor.
+    Return:
+        tensor
+    """
+    return _no_grad_fill_(tensor, value)
+
+
+def ones_(tensor):
+    """
+    Modified tensor inspace using ones_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+    Return:
+        tensor
+    """
+    return _no_grad_fill_(tensor, 1)
+
+
+def zeros_(tensor):
+    """
+    Modified tensor inspace using zeros_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+    Return:
+        tensor
+    """
+    return _no_grad_fill_(tensor, 0)
+
+
+def vector_(tensor, vector):
+    with paddle.no_grad():
+        tensor.set_value(paddle.to_tensor(vector, dtype=tensor.dtype))
+    return tensor
+
+
+def _calculate_fan_in_and_fan_out(tensor, reverse=False):
+    """
+    Calculate (fan_in, _fan_out) for tensor
+    Args:
+        tensor (Tensor): paddle.Tensor
+        reverse (bool: False): tensor data format order, False by default as [fout, fin, ...]. e.g. : conv.weight [cout, cin, kh, kw] is False; linear.weight [cin, cout] is True
+    Return:
+        Tuple[fan_in, fan_out]
+    """
+    if tensor.ndim < 2:
+        raise ValueError("Fan in and fan out can not be computed for tensor with fewer than 2 dimensions")
+
+    if reverse:
+        num_input_fmaps, num_output_fmaps = tensor.shape[0], tensor.shape[1]
+    else:
+        num_input_fmaps, num_output_fmaps = tensor.shape[1], tensor.shape[0]
+
+    receptive_field_size = 1
+    if tensor.ndim > 2:
+        receptive_field_size = np.prod(tensor.shape[2:])
+
+    fan_in = num_input_fmaps * receptive_field_size
+    fan_out = num_output_fmaps * receptive_field_size
+
+    return fan_in, fan_out
+
+
+def xavier_uniform_(tensor, gain=1.0, reverse=False):
+    """
+    Modified tensor inspace using xavier_uniform_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        gain (float): super parameter, 1. default.
+        reverse (bool):  reverse (bool: False): tensor data format order, False by default as [fout, fin, ...].
+    Return:
+        tensor
+    """
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor, reverse=reverse)
+    std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
+    k = math.sqrt(3.0) * std
+    return _no_grad_uniform_(tensor, -k, k)
+
+
+def xavier_normal_(tensor, gain=1.0, reverse=False):
+    """
+    Modified tensor inspace using xavier_normal_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        gain (float): super parameter, 1. default.
+        reverse (bool):  reverse (bool: False): tensor data format order, False by default as [fout, fin, ...].
+    Return:
+        tensor
+    """
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor, reverse=reverse)
+    std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
+    return _no_grad_normal_(tensor, 0, std)
+
+
+# reference: https://pytorch.org/docs/stable/_modules/torch/nn/init.html
+def _calculate_correct_fan(tensor, mode, reverse=False):
+    mode = mode.lower()
+    valid_modes = ["fan_in", "fan_out"]
+    if mode not in valid_modes:
+        raise ValueError("Mode {} not supported, please use one of {}".format(mode, valid_modes))
+
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor, reverse)
+
+    return fan_in if mode == "fan_in" else fan_out
+
+
+def _calculate_gain(nonlinearity, param=None):
+    linear_fns = ["linear", "conv1d", "conv2d", "conv3d", "conv_transpose1d", "conv_transpose2d", "conv_transpose3d"]
+    if nonlinearity in linear_fns or nonlinearity == "sigmoid":
+        return 1
+    elif nonlinearity == "tanh":
+        return 5.0 / 3
+    elif nonlinearity == "relu":
+        return math.sqrt(2.0)
+    elif nonlinearity == "leaky_relu":
+        if param is None:
+            negative_slope = 0.01
+        elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float):
+            # True/False are instances of int, hence check above
+            negative_slope = param
+        else:
+            raise ValueError("negative_slope {} not a valid number".format(param))
+        return math.sqrt(2.0 / (1 + negative_slope**2))
+    elif nonlinearity == "selu":
+        return 3.0 / 4
+    else:
+        raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))
+
+
+def kaiming_uniform_(tensor, a=0, mode="fan_in", nonlinearity="leaky_relu", reverse=False):
+    """
+    Modified tensor inspace using kaiming_uniform method
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        mode (str): ['fan_in', 'fan_out'], 'fin_in' defalut
+        nonlinearity (str): nonlinearity method name
+        reverse (bool):  reverse (bool: False): tensor data format order, False by default as [fout, fin, ...].
+    Return:
+        tensor
+    """
+    fan = _calculate_correct_fan(tensor, mode, reverse)
+    gain = _calculate_gain(nonlinearity, a)
+    std = gain / math.sqrt(fan)
+    k = math.sqrt(3.0) * std
+    return _no_grad_uniform_(tensor, -k, k)
+
+
+def kaiming_normal_(tensor, a=0, mode="fan_in", nonlinearity="leaky_relu", reverse=False):
+    """
+    Modified tensor inspace using kaiming_normal_
+    Args:
+        tensor (paddle.Tensor): paddle Tensor
+        mode (str): ['fan_in', 'fan_out'], 'fin_in' defalut
+        nonlinearity (str): nonlinearity method name
+        reverse (bool):  reverse (bool: False): tensor data format order, False by default as [fout, fin, ...].
+    Return:
+        tensor
+    """
+    fan = _calculate_correct_fan(tensor, mode, reverse)
+    gain = _calculate_gain(nonlinearity, a)
+    std = gain / math.sqrt(fan)
+    return _no_grad_normal_(tensor, 0, std)
+
+
+def linear_init_(module):
+    bound = 1 / math.sqrt(module.weight.shape[0])
+    uniform_(module.weight, -bound, bound)
+    uniform_(module.bias, -bound, bound)
+
+
+def conv_init_(module):
+    bound = 1 / np.sqrt(np.prod(module.weight.shape[1:]))
+    uniform_(module.weight, -bound, bound)
+    if module.bias is not None:
+        uniform_(module.bias, -bound, bound)
+
+
+def bias_init_with_prob(prior_prob=0.01):
+    """initialize conv/fc bias value according to a given probability value."""
+    bias_init = float(-np.log((1 - prior_prob) / prior_prob))
+    return bias_init
+
+
+@paddle.no_grad()
+def reset_initialized_parameter(model, include_self=True):
+    """
+    Reset initialized parameter using following method for [conv, linear, embedding, bn]
+    Args:
+        model (paddle.Layer): paddle Layer
+        include_self (bool: False): include_self for Layer.named_sublayers method. Indicate whether including itself
+    Return:
+        None
+    """
+    for _, m in model.named_sublayers(include_self=include_self):
+        if isinstance(m, nn.Conv2D):
+            k = float(m._groups) / (m._in_channels * m._kernel_size[0] * m._kernel_size[1])
+            k = math.sqrt(k)
+            _no_grad_uniform_(m.weight, -k, k)
+            if hasattr(m, "bias") and getattr(m, "bias") is not None:
+                _no_grad_uniform_(m.bias, -k, k)
+
+        elif isinstance(m, nn.Linear):
+            k = math.sqrt(1.0 / m.weight.shape[0])
+            _no_grad_uniform_(m.weight, -k, k)
+            if hasattr(m, "bias") and getattr(m, "bias") is not None:
+                _no_grad_uniform_(m.bias, -k, k)
+
+        elif isinstance(m, nn.Embedding):
+            _no_grad_normal_(m.weight, mean=0.0, std=1.0)
+
+        elif isinstance(m, (nn.BatchNorm2D, nn.LayerNorm)):
+            _no_grad_fill_(m.weight, 1.0)
+            if hasattr(m, "bias") and getattr(m, "bias") is not None:
+                _no_grad_fill_(m.bias, 0)
+
+
+def to(
+    self,
+    device=None,
+    dtype=None,
+    blocking=None,
+    floating_only=True,
+):
+    """
+    Cast the parameters and buffers of Layer by the give device, dtype and blocking.
+
+    Parameters:
+        device(str|paddle.CPUPlace()|paddle.CUDAPlace()|paddle.CUDAPinnedPlace()|paddle.XPUPlace()|None, optional): The device of the Layer which want to be stored.
+        If None, the device is the same with the original Tensor. If device is string, it can be ``cpu``, ``gpu:x`` and ``xpu:x``, where ``x`` is the
+        index of the GPUs or XPUs. Default: None.
+
+        dtype(str|numpy.dtype|paddle.dtype|None, optional): The type of the data. If None, the dtype is the same with the original Tensor. Default: None.
+
+        blocking(bool|None, optional): If False and the source is in pinned memory, the copy will be
+            asynchronous with respect to the host. Otherwise, the argument has no effect. If None, the blocking is set True. Default: None.
+
+        floating_only(bool|False, optional): If True, only cast all floating point parameters and buffers of Layer by the give device, dtype and blocking.
+
+    Returns:
+        self
+
+    """
+
+    if floating_only and (not paddle.is_floating_point(self)):
+        return self
+    paddle.Tensor._to(self, device, dtype, blocking)
+    return self
diff --git a/paddlenlp/utils/log.py b/paddlenlp/utils/log.py
new file mode 100644
index 0000000000000000000000000000000000000000..94af6c7bca215b8c1dd26ce7a60a2e9a0060e24e
--- /dev/null
+++ b/paddlenlp/utils/log.py
@@ -0,0 +1,122 @@
+# coding:utf-8
+# Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import contextlib
+import functools
+import logging
+import threading
+import time
+
+import colorlog
+
+loggers = {}
+
+log_config = {
+    "DEBUG": {"level": 10, "color": "purple"},
+    "INFO": {"level": 20, "color": "green"},
+    "TRAIN": {"level": 21, "color": "cyan"},
+    "EVAL": {"level": 22, "color": "blue"},
+    "WARNING": {"level": 30, "color": "yellow"},
+    "ERROR": {"level": 40, "color": "red"},
+    "CRITICAL": {"level": 50, "color": "bold_red"},
+}
+
+
+class Logger(object):
+    """
+    Deafult logger in PaddleNLP
+
+    Args:
+        name(str) : Logger name, default is 'PaddleNLP'
+    """
+
+    def __init__(self, name: str = None):
+        name = "PaddleNLP" if not name else name
+        self.logger = logging.getLogger(name)
+
+        for key, conf in log_config.items():
+            logging.addLevelName(conf["level"], key)
+            self.__dict__[key] = functools.partial(self.__call__, conf["level"])
+            self.__dict__[key.lower()] = functools.partial(self.__call__, conf["level"])
+
+        self.format = colorlog.ColoredFormatter(
+            "%(log_color)s[%(asctime)-15s] [%(levelname)8s]%(reset)s - %(message)s",
+            log_colors={key: conf["color"] for key, conf in log_config.items()},
+        )
+
+        self.handler = logging.StreamHandler()
+        self.handler.setFormatter(self.format)
+
+        self.logger.addHandler(self.handler)
+        self.logLevel = "DEBUG"
+        self.logger.setLevel(logging.DEBUG)
+        self.logger.propagate = False
+        self._is_enable = True
+
+    def disable(self):
+        self._is_enable = False
+
+    def enable(self):
+        self._is_enable = True
+
+    def set_level(self, log_level: str):
+        assert log_level in log_config, f"Invalid log level. Choose among {log_config.keys()}"
+        self.logger.setLevel(log_level)
+
+    @property
+    def is_enable(self) -> bool:
+        return self._is_enable
+
+    def __call__(self, log_level: str, msg: str):
+        if not self.is_enable:
+            return
+
+        self.logger.log(log_level, msg)
+
+    @contextlib.contextmanager
+    def use_terminator(self, terminator: str):
+        old_terminator = self.handler.terminator
+        self.handler.terminator = terminator
+        yield
+        self.handler.terminator = old_terminator
+
+    @contextlib.contextmanager
+    def processing(self, msg: str, interval: float = 0.1):
+        """
+        Continuously print a progress bar with rotating special effects.
+
+        Args:
+            msg(str): Message to be printed.
+            interval(float): Rotation interval. Default to 0.1.
+        """
+        end = False
+
+        def _printer():
+            index = 0
+            flags = ["\\", "|", "/", "-"]
+            while not end:
+                flag = flags[index % len(flags)]
+                with self.use_terminator("\r"):
+                    self.info("{}: {}".format(msg, flag))
+                time.sleep(interval)
+                index += 1
+
+        t = threading.Thread(target=_printer)
+        t.start()
+        yield
+        end = True
+
+
+logger = Logger()
diff --git a/paddlenlp/utils/profiler.py b/paddlenlp/utils/profiler.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a6fa25b90d7cbe56dc95c8c4c3c78454c5a14a1
--- /dev/null
+++ b/paddlenlp/utils/profiler.py
@@ -0,0 +1,130 @@
+# copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+import paddle.profiler as profiler
+
+# A global variable to record the number of calling times for profiler
+# functions. It is used to specify the tracing range of training steps.
+_profiler_step_id = 0
+
+# A global variable to avoid parsing from string every time.
+_profiler_options = None
+_prof = None
+
+
+class ProfilerOptions(object):
+    """
+    Use a string to initialize a ProfilerOptions.
+    The string should be in the format: "key1=value1;key2=value;key3=value3".
+    For example:
+      "profile_path=model.profile"
+      "batch_range=[50, 60]; profile_path=model.profile"
+      "batch_range=[50, 60]; tracer_option=OpDetail; profile_path=model.profile"
+
+    ProfilerOptions supports following key-value pair:
+      batch_range      - a integer list, e.g. [100, 110].
+      state            - a string, the optional values are 'CPU', 'GPU' or 'All'.
+      sorted_key       - a string, the optional values are 'calls', 'total',
+                         'max', 'min' or 'ave.
+      tracer_option    - a string, the optional values are 'Default', 'OpDetail',
+                         'AllOpDetail'.
+      profile_path     - a string, the path to save the serialized profile data,
+                         which can be used to generate a timeline.
+      exit_on_finished - a boolean.
+      record_shapes    - a boolean.
+    """
+
+    def __init__(self, options_str):
+        assert isinstance(options_str, str)
+
+        self._options = {
+            "batch_range": [10, 20],
+            "state": "All",
+            "sorted_key": "total",
+            "tracer_option": "Default",
+            "profile_path": "/tmp/profile",
+            "exit_on_finished": True,
+            "timer_only": True,
+            "record_shapes": False,
+        }
+        self._parse_from_string(options_str)
+
+    def _parse_from_string(self, options_str):
+        for kv in options_str.replace(" ", "").split(";"):
+            key, value = kv.split("=")
+            if key == "batch_range":
+                value_list = value.replace("[", "").replace("]", "").split(",")
+                value_list = list(map(int, value_list))
+                if len(value_list) >= 2 and value_list[0] >= 0 and value_list[1] > value_list[0]:
+                    self._options[key] = value_list
+            elif key == "exit_on_finished":
+                self._options[key] = value.lower() in ("yes", "true", "t", "1")
+            elif key in ["state", "sorted_key", "tracer_option", "profile_path"]:
+                self._options[key] = value
+            elif key == "timer_only":
+                self._options[key] = value
+            elif key == "record_shapes":
+                self._options[key] = value
+
+    def __getitem__(self, name):
+        if self._options.get(name, None) is None:
+            raise ValueError("ProfilerOptions does not have an option named %s." % name)
+        return self._options[name]
+
+
+def add_profiler_step(options_str=None):
+    """
+    Enable the operator-level timing using PaddlePaddle's profiler.
+    The profiler uses a independent variable to count the profiler steps.
+    One call of this function is treated as a profiler step.
+    Args:
+      profiler_options - a string to initialize the ProfilerOptions.
+                         Default is None, and the profiler is disabled.
+    """
+    if options_str is None:
+        return
+
+    global _prof
+    global _profiler_step_id
+    global _profiler_options
+
+    if _profiler_options is None:
+        _profiler_options = ProfilerOptions(options_str)
+    # profile : https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/performance_improving/profiling_model.html#chakanxingnengshujudetongjibiaodan
+    # timer_only = True  only the model's throughput and time overhead are displayed
+    # timer_only = False calling summary can print a statistical form that presents performance data from different perspectives.
+    # timer_only = False the output Timeline information can be found in the profiler_log directory
+    if _prof is None:
+        _timer_only = str(_profiler_options["timer_only"]) == str(True)
+        _record_shapes = str(_profiler_options["record_shapes"]) == str(True)
+        _prof = profiler.Profiler(
+            scheduler=(_profiler_options["batch_range"][0], _profiler_options["batch_range"][1]),
+            on_trace_ready=profiler.export_chrome_tracing(_profiler_options["profile_path"]),
+            timer_only=_timer_only,
+            record_shapes=_record_shapes,
+        )
+        _prof.start()
+    else:
+        _prof.step()
+
+    if _profiler_step_id == _profiler_options["batch_range"][1]:
+        _prof.stop()
+        _prof.summary(op_detail=True, thread_sep=False, time_unit="ms")
+        _prof = None
+        if _profiler_options["exit_on_finished"]:
+            sys.exit(0)
+
+    _profiler_step_id += 1
diff --git a/paddlenlp/utils/serialization.py b/paddlenlp/utils/serialization.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3af871121118201a028c7115d0dcedf2fb63e7d
--- /dev/null
+++ b/paddlenlp/utils/serialization.py
@@ -0,0 +1,214 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import io
+import os
+import pickle
+from functools import lru_cache
+from typing import Union
+from zipfile import ZipFile
+
+import numpy as np
+from _io import BufferedReader
+
+MZ_ZIP_LOCAL_DIR_HEADER_SIZE = 30
+
+
+class SerializationError(Exception):
+    """Exception for serialization"""
+
+    pass
+
+
+def seek_by_string(file_handler: BufferedReader, string: str, file_size: int) -> int:
+    """seek the index of file-handler with target words
+    Args:
+        file_handler (BufferedReader): file handler
+        string (str): the specific string in the file
+        file_size (int): size of file
+    Returns:
+        int: end index of target string
+    """
+    word_index = 0
+    word_bytes = string.encode("latin")
+    empty_byte = "".encode("latin")
+
+    while word_index < len(string) and file_handler.tell() < file_size:
+        content = file_handler.read(1)
+        if content == empty_byte:
+            break
+
+        if word_bytes[word_index] == content[0]:
+            word_index += 1
+        else:
+            word_index = 0
+
+    if file_handler.tell() >= file_size - 1:
+        raise SerializationError(f"can't find the find the target string<{string}> in the file")
+    return file_handler.tell()
+
+
+def read_prefix_key(path):
+    file_size = os.stat(path).st_size
+    with open(path, "rb") as file_handler:
+        end_index = seek_by_string(file_handler, "data.pkl", file_size)
+        file_handler.seek(MZ_ZIP_LOCAL_DIR_HEADER_SIZE)
+        prefix_key = file_handler.read(end_index - MZ_ZIP_LOCAL_DIR_HEADER_SIZE - len("/data.pkl"))
+    return prefix_key.decode("latin")
+
+
+def _maybe_decode_ascii(bytes_str: Union[bytes, str]) -> str:
+    if isinstance(bytes_str, bytes):
+        return bytes_str.decode("ascii")
+    return bytes_str
+
+
+@lru_cache(maxsize=None)
+def _storage_type_to_dtype_to_map():
+    """convert storage type to numpy dtype"""
+    return {
+        "DoubleStorage": np.double,
+        "FloatStorage": np.float32,
+        "HalfStorage": np.half,
+        "LongStorage": np.int64,
+        "IntStorage": np.int32,
+        "ShortStorage": np.int16,
+        "CharStorage": np.int8,
+        "ByteStorage": np.uint8,
+        "BoolStorage": np.bool_,
+        "ComplexDoubleStorage": np.cdouble,
+        "ComplexFloatStorage": np.cfloat,
+        "BFloat16Storage": np.uint16,  # support bf16
+    }
+
+
+class StorageType:
+    """Temp Class for Storage Type"""
+
+    def __init__(self, name):
+        self.dtype = _storage_type_to_dtype_to_map()[name]
+
+    def __str__(self):
+        return f"StorageType(dtype={self.dtype})"
+
+
+def _element_size(dtype: str) -> int:
+    """
+    Returns the element size for a dtype, in bytes
+    """
+    if dtype in [np.float16, np.float32, np.float64]:
+        return np.finfo(dtype).bits >> 3
+    elif dtype == np.bool_:
+        return 1
+    else:
+        return np.iinfo(dtype).bits >> 3
+
+
+class UnpicklerWrapperStage(pickle.Unpickler):
+    def find_class(self, mod_name, name):
+        if type(name) is str and "Storage" in name:
+            try:
+                return StorageType(name)
+            except KeyError:
+                pass
+
+        if mod_name == "torch._utils":
+            # rebuild torch.nn.Papameter
+            if name == "_rebuild_parameter":
+                return _rebuild_parameter
+            # rebuild torch.nn.Papameter with state
+            if name == "_rebuild_parameter_with_state":
+                return _rebuild_parameter_with_state
+            # rebuild torch.Tensor
+            return _rebuild_tensor_stage
+
+        # pytorch_lightning tensor builder
+        if "pytorch_lightning" in mod_name:
+            return dumpy
+        return super().find_class(mod_name, name)
+
+
+def _rebuild_tensor_stage(storage, storage_offset, size, stride, requires_grad, backward_hooks):
+    # if a tensor has shape [M, N] and stride is [1, N], it's column-wise / fortran-style
+    # if a tensor has shape [M, N] and stride is [M, 1], it's row-wise / C-style
+    # defautls to C-style
+    if stride is not None and len(stride) > 1 and stride[0] == 1 and stride[1] > 1:
+        order = "F"
+    else:
+        order = "C"
+
+    # fix bug when load https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+    numel = int(np.prod(size))
+    return storage[storage_offset : storage_offset + numel].reshape(size, order=order)
+
+
+def _rebuild_parameter(data, requires_grad, backward_hooks):
+    return data
+
+
+def _rebuild_parameter_with_state(data, requires_grad, backward_hooks, state):
+    return data
+
+
+def dumpy(*args, **kwarsg):
+    return None
+
+
+def load_torch(path: str, **pickle_load_args):
+    """
+    load torch weight file with the following steps:
+    1. load the structure of pytorch weight file
+    2. read the tensor data and re-construct the state-dict
+    Args:
+        path: the path of pytorch weight file
+        **pickle_load_args: args of pickle module
+    Returns:
+    """
+    pickle_load_args.update({"encoding": "utf-8"})
+
+    prefix_key = read_prefix_key(path)
+
+    torch_zip = ZipFile(path, "r")
+    loaded_storages = {}
+
+    def load_tensor(dtype, numel, key, location):
+        name = f"{prefix_key}/data/{key}"
+        typed_storage = np.frombuffer(torch_zip.open(name).read()[:numel], dtype=dtype)
+        return typed_storage
+
+    def persistent_load(saved_id):
+        assert isinstance(saved_id, tuple)
+        typename = _maybe_decode_ascii(saved_id[0])
+        data = saved_id[1:]
+
+        assert typename == "storage", f"Unknown typename for persistent_load, expected 'storage' but got '{typename}'"
+        storage_type, key, location, numel = data
+        dtype = storage_type.dtype
+
+        if key in loaded_storages:
+            typed_storage = loaded_storages[key]
+        else:
+            nbytes = numel * _element_size(dtype)
+            typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
+            loaded_storages[key] = typed_storage
+
+        return typed_storage
+
+    data_iostream = torch_zip.open(f"{prefix_key}/data.pkl").read()
+    unpickler_stage = UnpicklerWrapperStage(io.BytesIO(data_iostream), **pickle_load_args)
+    unpickler_stage.persistent_load = persistent_load
+    state_dict = unpickler_stage.load()
+    torch_zip.close()
+    return state_dict
diff --git a/paddlenlp/utils/tools.py b/paddlenlp/utils/tools.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f6b787f50657a316287abfea38064286b33d573
--- /dev/null
+++ b/paddlenlp/utils/tools.py
@@ -0,0 +1,836 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import os
+import random
+
+import numpy as np
+import paddle
+from tqdm import tqdm
+
+from .doc_parser import DocParser
+from .log import logger
+
+
+def static_params_to_dygraph(model, static_tensor_dict):
+    """Simple tool for convert static paramters to dygraph paramters dict.
+
+    **NOTE** The model must both support static graph and dygraph mode.
+
+    Args:
+        model (nn.Layer): the model of a neural network.
+        static_tensor_dict (string): path of which locate the saved paramters in static mode.
+            Usualy load by `paddle.static.load_program_state`.
+
+    Returns:
+        [tensor dict]: a state dict the same as the dygraph mode.
+    """
+    state_dict = model.state_dict()
+    # static_tensor_dict = paddle.static.load_program_state(static_params_path)
+
+    ret_dict = dict()
+    for n, p in state_dict.items():
+        if p.name not in static_tensor_dict:
+            logger.info("%s paramter is missing from you state dict." % n)
+            continue
+        ret_dict[n] = static_tensor_dict[p.name]
+
+    return ret_dict
+
+
+def dygraph_params_to_static(model, dygraph_tensor_dict, topo=None):
+    """Simple tool for convert dygraph paramters to static paramters dict.
+
+    **NOTE** The model must both support static graph and dygraph mode.
+
+    Args:
+        model (nn.Layer): the model of a neural network.
+        dygraph_tensor_dict (string): path of which locate the saved paramters in static mode.
+
+    Returns:
+        [tensor dict]: a state dict the same as the dygraph mode.
+    """
+    state_dict = model.state_dict()
+
+    ret_dict = dict()
+    for name, parm in state_dict.items():
+        if name not in dygraph_tensor_dict:
+            logger.info("%s paramter is missing from you state dict." % name)
+            continue
+
+        tensor = dygraph_tensor_dict[name]
+        if parm.is_distributed:
+            assert topo is not None
+            for dim, v in enumerate(tensor.shape):
+                if parm.shape[dim] != v:
+                    break
+
+            splited = np.split(tensor, topo.mp_info.size, axis=dim)[topo.mp_info.rank]
+            ret_dict[parm.name] = splited
+        else:
+            ret_dict[parm.name] = tensor
+
+    return ret_dict
+
+
+class TimeCostAverage(object):
+    """
+    Simple tool for calcluating time average cost in the process of training and inferencing.
+    """
+
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        """
+        Reset the recoder state, and reset the `cnt` to zero.
+        """
+        self.cnt = 0
+        self.total_time = 0
+
+    def record(self, usetime):
+        """
+        Recoding the time cost in current step and accumulating the `cnt`.
+        """
+        self.cnt += 1
+        self.total_time += usetime
+
+    def get_average(self):
+        """
+        Returning the average time cost after the start of training.
+        """
+        if self.cnt == 0:
+            return 0
+        return self.total_time / self.cnt
+
+
+def get_env_device():
+    """
+    Return the device name of running environment.
+    """
+    if paddle.is_compiled_with_cuda():
+        return "gpu"
+    elif "npu" in paddle.device.get_all_custom_device_type():
+        return "npu"
+    elif paddle.is_compiled_with_rocm():
+        return "rocm"
+    elif paddle.is_compiled_with_xpu():
+        return "xpu"
+    return "cpu"
+
+
+def compare_version(version, pair_version):
+    """
+    Args:
+        version (str): The first version string needed to be compared.
+            The format of version string should be as follow : "xxx.yyy.zzz".
+        pair_version (str): The second version string needed to be compared.
+             The format of version string should be as follow : "xxx.yyy.zzz".
+    Returns:
+        int: The result of comparasion. 1 means version > pair_version; 0 means
+            version = pair_version; -1 means version < pair_version.
+
+    Examples:
+        >>> compare_version("2.2.1", "2.2.0")
+        >>> 1
+        >>> compare_version("2.2.0", "2.2.0")
+        >>> 0
+        >>> compare_version("2.2.0-rc0", "2.2.0")
+        >>> -1
+        >>> compare_version("2.3.0-rc0", "2.2.0")
+        >>> 1
+    """
+    version = version.strip()
+    pair_version = pair_version.strip()
+    if version == pair_version:
+        return 0
+    version_list = version.split(".")
+    pair_version_list = pair_version.split(".")
+    for version_code, pair_version_code in zip(version_list, pair_version_list):
+        if not version_code.isnumeric():
+            return -1
+        if not pair_version_code.isnumeric():
+            return 1
+        if int(version_code) > int(pair_version_code):
+            return 1
+        elif int(version_code) < int(pair_version_code):
+            return -1
+    return 0
+
+
+def get_bool_ids_greater_than(probs, limit=0.5, return_prob=False):
+    """
+    Get idx of the last dimension in probability arrays, which is greater than a limitation.
+
+    Args:
+        probs (List[List[float]]): The input probability arrays.
+        limit (float): The limitation for probability.
+        return_prob (bool): Whether to return the probability
+    Returns:
+        List[List[int]]: The index of the last dimension meet the conditions.
+    """
+    probs = np.array(probs)
+    dim_len = len(probs.shape)
+    if dim_len > 1:
+        result = []
+        for p in probs:
+            result.append(get_bool_ids_greater_than(p, limit, return_prob))
+        return result
+    else:
+        result = []
+        for i, p in enumerate(probs):
+            if p > limit:
+                if return_prob:
+                    result.append((i, p))
+                else:
+                    result.append(i)
+        return result
+
+
+def get_span(start_ids, end_ids, with_prob=False):
+    """
+    Get span set from position start and end list.
+
+    Args:
+        start_ids (List[int]/List[tuple]): The start index list.
+        end_ids (List[int]/List[tuple]): The end index list.
+        with_prob (bool): If True, each element for start_ids and end_ids is a tuple aslike: (index, probability).
+    Returns:
+        set: The span set without overlapping, every id can only be used once .
+    """
+    if with_prob:
+        start_ids = sorted(start_ids, key=lambda x: x[0])
+        end_ids = sorted(end_ids, key=lambda x: x[0])
+    else:
+        start_ids = sorted(start_ids)
+        end_ids = sorted(end_ids)
+
+    start_pointer = 0
+    end_pointer = 0
+    len_start = len(start_ids)
+    len_end = len(end_ids)
+    couple_dict = {}
+    while start_pointer < len_start and end_pointer < len_end:
+        if with_prob:
+            start_id = start_ids[start_pointer][0]
+            end_id = end_ids[end_pointer][0]
+        else:
+            start_id = start_ids[start_pointer]
+            end_id = end_ids[end_pointer]
+
+        if start_id == end_id:
+            couple_dict[end_ids[end_pointer]] = start_ids[start_pointer]
+            start_pointer += 1
+            end_pointer += 1
+            continue
+        if start_id < end_id:
+            couple_dict[end_ids[end_pointer]] = start_ids[start_pointer]
+            start_pointer += 1
+            continue
+        if start_id > end_id:
+            end_pointer += 1
+            continue
+    result = [(couple_dict[end], end) for end in couple_dict]
+    result = set(result)
+    return result
+
+
+class DataConverter(object):
+    """DataConverter to convert data export from annotation platform"""
+
+    def __init__(
+        self,
+        label_studio_file,
+        negative_ratio=5,
+        prompt_prefix="情感倾向",
+        options=["正向", "负向"],
+        separator="##",
+        layout_analysis=False,
+        expand_to_a4_size=True,
+        schema_lang="ch",
+        ocr_lang="en",
+        anno_type="text",
+    ):
+        """Init Data Converter"""
+        self.negative_ratio = negative_ratio
+        self.prompt_prefix = prompt_prefix
+        self.options = options
+        self.separator = separator
+        self.layout_analysis = layout_analysis
+        self.expand_to_a4_size = expand_to_a4_size
+        self.schema_lang = schema_lang
+        self.ocr_lang = ocr_lang
+        self.anno_type = anno_type
+        self.label_studio_file = label_studio_file
+        self.ignore_list = ["属性值", "object"]
+
+    def process_text_tag(self, line, task_type="ext"):
+        items = {}
+        items["text"] = line["data"]["text"]
+        if task_type == "ext":
+            items["entities"] = []
+            items["relations"] = []
+            result_list = line["annotations"][0]["result"]
+            for a in result_list:
+                if a["type"] == "labels":
+                    items["entities"].append(
+                        {
+                            "id": a["id"],
+                            "start_offset": a["value"]["start"],
+                            "end_offset": a["value"]["end"],
+                            "label": a["value"]["labels"][0],
+                        }
+                    )
+                else:
+                    items["relations"].append(
+                        {
+                            "id": a["from_id"] + "-" + a["to_id"],
+                            "from_id": a["from_id"],
+                            "to_id": a["to_id"],
+                            "type": a["labels"][0],
+                        }
+                    )
+        elif task_type == "cls":
+            items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
+        return items
+
+    def process_image_tag(self, line, task_type="ext"):
+        def _io1(box1, box2):
+            """calc intersection over box1 area"""
+            x1 = max(box1[0], box2[0])
+            y1 = max(box1[1], box2[1])
+            x2 = min(box1[2], box2[2])
+            y2 = min(box1[3], box2[3])
+            if x2 <= x1 or y2 <= y1:
+                return 0.0
+            box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
+            return (x2 - x1) * (y2 - y1) * 1.0 / box1_area
+
+        def _find_segment_in_box(layouts, box, threshold=0.7):
+            positions = []
+            global_offset = 0
+            for segment in layouts:
+                sbox = segment[0]
+                text_len = len(segment[1])
+                if text_len == 0:
+                    continue
+                if len(segment) == 2 or (len(segment) == 3 and segment[2] != "table"):
+                    char_w = (sbox[2] - sbox[0]) * 1.0 / text_len
+                    for i in range(text_len):
+                        cbox = [sbox[0] + i * char_w, sbox[1], sbox[0] + (i + 1) * char_w, sbox[3]]
+                        c_covered = _io1(cbox, box)
+                        if c_covered >= threshold:
+                            positions.append(global_offset)
+                        elif (
+                            cbox[2] == min(cbox[2], box[2])
+                            and cbox[0] == max(cbox[0], box[0])
+                            and cbox[1] < box[1]
+                            and cbox[3] > box[3]
+                        ):
+                            if c_covered > 0.5:
+                                positions.append(global_offset)
+                        global_offset += 1
+                else:
+                    cell_covered = _io1(box, sbox)
+                    if cell_covered >= threshold:
+                        for i in range(text_len):
+                            positions.append(global_offset)
+                            global_offset += 1
+                    else:
+                        global_offset += text_len
+
+            offsets = []
+            if not positions:
+                return offsets
+            spos = positions[0]
+            for i in range(1, len(positions)):
+                if positions[i] != positions[i - 1] + 1:
+                    offsets.append((spos, positions[i - 1] + 1))
+                    spos = positions[i]
+            offsets.append((spos, positions[-1] + 1))
+            return offsets
+
+        items = {}
+        img_file = os.path.basename(line["data"]["image"])
+        p = img_file.find("-")
+        img_file = img_file[p + 1 :]
+
+        img_path = os.path.join("/".join(self.label_studio_file.split("/")[:-1]), "images", img_file)
+        if not os.path.exists(img_path):
+            logger.warning(
+                "Image file %s not exist in %s"
+                % (img_file, "/".join(self.label_studio_file.split("/")[:-1]) + "images")
+            )
+            return None
+        logger.info("Parsing image file %s ..." % (img_file))
+        doc_parser = DocParser(layout_analysis=self.layout_analysis, ocr_lang=self.ocr_lang)
+
+        parsed_doc = doc_parser.parse({"doc": img_path})
+        img_w, img_h = parsed_doc["img_w"], parsed_doc["img_h"]
+
+        text = ""
+        bbox = []
+        for segment in parsed_doc["layout"]:
+            box = doc_parser._normalize_box(segment[0], [img_w, img_h], [1000, 1000])
+            text += segment[1]
+            bbox.extend([box] * len(segment[1]))
+        assert len(text) == len(bbox), "len of text is not equal to len of bbox"
+        items["text"] = text
+        items["bbox"] = bbox
+        items["image"] = parsed_doc["image"]
+        if task_type == "ext":
+            items["entities"] = []
+            items["relations"] = []
+
+            result_list = line["annotations"][0]["result"]
+            ent_ids = []
+            for e in result_list:
+                if e["type"] != "rectanglelabels":
+                    continue
+                assert img_w == e["original_width"] and img_h == e["original_height"], "Image size not match"
+                box = [
+                    e["value"]["x"] * 0.01 * img_w,
+                    e["value"]["y"] * 0.01 * img_h,
+                    (e["value"]["x"] + e["value"]["width"]) * 0.01 * img_w,
+                    (e["value"]["y"] + e["value"]["height"]) * 0.01 * img_h,
+                ]
+                offsets = _find_segment_in_box(parsed_doc["layout"], box)
+                if len(offsets) > 0:
+                    items["entities"].append(
+                        {
+                            "id": e["id"],
+                            "start_offset": offsets[0][0],
+                            "end_offset": offsets[0][1],
+                            "label": e["value"]["rectanglelabels"][0],
+                        }
+                    )
+                    ent_ids.append(e["id"])
+            for r in result_list:
+                if r["type"] != "relation":
+                    continue
+                if r["from_id"] in ent_ids and r["to_id"] in ent_ids:
+                    items["relations"].append(
+                        {
+                            "id": r["from_id"] + "-" + r["to_id"],
+                            "from_id": r["from_id"],
+                            "to_id": r["to_id"],
+                            "type": r["labels"][0],
+                        }
+                    )
+        else:
+            items["label"] = line["annotations"][0]["result"][0]["value"]["choices"]
+        return items
+
+    def convert_cls_examples(self, raw_examples):
+        """
+        Convert labeled data for classification task.
+        """
+        examples = []
+        logger.info("Converting annotation data...")
+        with tqdm(total=len(raw_examples)):
+            for line in raw_examples:
+                if self.anno_type == "text":
+                    items = self.process_text_tag(line, task_type="cls")
+                    image, bbox = None, None
+                elif self.anno_type == "image":
+                    items = self.process_image_tag(line, task_type="cls")
+                    if items is None:
+                        continue
+                    image, bbox = items["image"], items["bbox"]
+                else:
+                    raise ValueError("The type of annotation should be text or image")
+                text, labels = items["text"], items["label"]
+                example = self.generate_cls_example(text, labels, self.prompt_prefix, self.options, image, bbox)
+                examples.append(example)
+        return examples
+
+    def convert_ext_examples(self, raw_examples, is_train=True):
+        """
+        Convert labeled data for extraction task.
+        """
+
+        def _sep_cls_label(label, separator):
+            label_list = label.split(separator)
+            if len(label_list) == 1:
+                return label_list[0], None
+            return label_list[0], label_list[1:]
+
+        texts = []
+        # {"content": "", "result_list": [], "prompt": "X"}
+        entity_examples = []
+        # {"content": "", "result_list": [], "prompt": "X的Y"}
+        relation_examples = []
+        # {"content": "", "result_list": [], "prompt": "X的情感倾向[正向，负向]"}
+        entity_cls_examples = []
+
+        # Entity label set: ["时间", "地点", ... ]
+        entity_label_set = []
+        # Entity name set: ["2月8日上午", "北京", ... ]
+        entity_name_set = []
+        # Predicate set: ["歌手", "所属专辑", ... ]
+        predicate_set = []
+
+        # List[List[str]]
+        # List of entity prompt for each example
+        entity_prompt_list = []
+        # List of relation prompt for each example
+        relation_prompt_list = []
+        # Golden subject label for each example
+        subject_golden_list = []
+        # List of inverse relation for each example
+        inverse_relation_list = []
+        # List of predicate for each example
+        predicate_list = []
+
+        if self.anno_type == "text":
+            images, bbox_list = None, None
+        else:
+            images, bbox_list = [], []
+
+        logger.info("Converting annotation data...")
+        with tqdm(total=len(raw_examples)) as pbar:
+            for line in raw_examples:
+
+                if self.anno_type == "text":
+                    items = self.process_text_tag(line, task_type="ext")
+                    image, bbox = None, None
+                elif self.anno_type == "image":
+                    items = self.process_image_tag(line, task_type="ext")
+                    if items is None:
+                        continue
+                    image, bbox = items["image"], items["bbox"]
+                    images.append(image)
+                    bbox_list.append(bbox)
+                else:
+                    raise ValueError("The type of annotation should be text or image")
+
+                text, relations, entities = items["text"], items["relations"], items["entities"]
+                texts.append(text)
+
+                entity_example = []
+                entity_prompt = []
+                entity_example_map = {}
+                entity_map = {}  # id to entity name
+                for entity in entities:
+                    entity_name = text[entity["start_offset"] : entity["end_offset"]]
+                    entity_map[entity["id"]] = {
+                        "name": entity_name,
+                        "start": entity["start_offset"],
+                        "end": entity["end_offset"],
+                    }
+                    if entity["label"] in self.ignore_list:
+                        continue
+
+                    entity_label, entity_cls_label = _sep_cls_label(entity["label"], self.separator)
+
+                    # Define the prompt prefix for entity-level classification
+                    # xxx + "的" + 情感倾向 -> Chinese
+                    # Sentiment classification + " of " + xxx -> English
+                    if self.schema_lang == "ch":
+                        entity_cls_prompt_prefix = entity_name + "的" + self.prompt_prefix
+                    else:
+                        entity_cls_prompt_prefix = self.prompt_prefix + " of " + entity_name
+                    if entity_cls_label is not None:
+                        entity_cls_example = self.generate_cls_example(
+                            text, entity_cls_label, entity_cls_prompt_prefix, self.options, image, bbox
+                        )
+
+                        entity_cls_examples.append(entity_cls_example)
+
+                    result = {"text": entity_name, "start": entity["start_offset"], "end": entity["end_offset"]}
+                    if entity_label not in entity_example_map.keys():
+                        entity_example_map[entity_label] = {
+                            "content": text,
+                            "result_list": [result],
+                            "prompt": entity_label,
+                        }
+                        if self.anno_type == "image":
+                            entity_example_map[entity_label]["image"] = image
+                            entity_example_map[entity_label]["bbox"] = bbox
+                    else:
+                        entity_example_map[entity_label]["result_list"].append(result)
+
+                    if entity_label not in entity_label_set and entity_label != "观点词":
+                        entity_label_set.append(entity_label)
+                    if entity_name not in entity_name_set:
+                        entity_name_set.append(entity_name)
+                    entity_prompt.append(entity_label)
+
+                for v in entity_example_map.values():
+                    entity_example.append(v)
+
+                entity_examples.append(entity_example)
+                entity_prompt_list.append(entity_prompt)
+
+                subject_golden = []  # Golden entity inputs
+                relation_example = []
+                relation_prompt = []
+                relation_example_map = {}
+                inverse_relation = []
+                predicates = []
+                for relation in relations:
+                    predicate = relation["type"]
+                    subject_id = relation["from_id"]
+                    object_id = relation["to_id"]
+                    # The relation prompt is constructed as follows:
+                    # subject + "的" + predicate -> Chinese
+                    # predicate + " of " + subject -> English
+                    if self.schema_lang == "ch":
+                        prompt = entity_map[subject_id]["name"] + "的" + predicate
+                        inverse_negative = entity_map[object_id]["name"] + "的" + predicate
+                    else:
+                        prompt = predicate + " of " + entity_map[subject_id]["name"]
+                        inverse_negative = predicate + " of " + entity_map[object_id]["name"]
+
+                    if entity_map[subject_id]["name"] not in subject_golden:
+                        subject_golden.append(entity_map[subject_id]["name"])
+                    result = {
+                        "text": entity_map[object_id]["name"],
+                        "start": entity_map[object_id]["start"],
+                        "end": entity_map[object_id]["end"],
+                    }
+
+                    inverse_relation.append(inverse_negative)
+                    predicates.append(predicate)
+
+                    if prompt not in relation_example_map.keys():
+                        relation_example_map[prompt] = {"content": text, "result_list": [result], "prompt": prompt}
+                        if self.anno_type == "image":
+                            relation_example_map[prompt]["image"] = image
+                            relation_example_map[prompt]["bbox"] = bbox
+                    else:
+                        relation_example_map[prompt]["result_list"].append(result)
+
+                    if predicate not in predicate_set:
+                        predicate_set.append(predicate)
+                    relation_prompt.append(prompt)
+
+                for v in relation_example_map.values():
+                    relation_example.append(v)
+
+                relation_examples.append(relation_example)
+                relation_prompt_list.append(relation_prompt)
+                subject_golden_list.append(subject_golden)
+                inverse_relation_list.append(inverse_relation)
+                predicate_list.append(predicates)
+                pbar.update(1)
+
+        logger.info("Adding negative samples for first stage prompt...")
+        positive_examples, negative_examples = self.add_entity_negative_example(
+            entity_examples, texts, entity_prompt_list, entity_label_set, images, bbox_list
+        )
+        if len(positive_examples) == 0:
+            all_entity_examples = []
+        else:
+            all_entity_examples = positive_examples + negative_examples
+
+        all_relation_examples = []
+        if len(predicate_set) != 0:
+            logger.info("Adding negative samples for second stage prompt...")
+            if is_train:
+
+                positive_examples = []
+                negative_examples = []
+                per_n_ratio = self.negative_ratio // 3
+
+                with tqdm(total=len(texts)) as pbar:
+                    for i, text in enumerate(texts):
+                        negative_example = []
+                        collects = []
+                        num_positive = len(relation_examples[i])
+
+                        # 1. inverse_relation_list
+                        redundants1 = inverse_relation_list[i]
+
+                        # 2. entity_name_set ^ subject_golden_list[i]
+                        redundants2 = []
+                        if len(predicate_list[i]) != 0:
+                            nonentity_list = list(set(entity_name_set) ^ set(subject_golden_list[i]))
+                            nonentity_list.sort()
+
+                            if self.schema_lang == "ch":
+                                redundants2 = [
+                                    nonentity + "的" + predicate_list[i][random.randrange(len(predicate_list[i]))]
+                                    for nonentity in nonentity_list
+                                ]
+                            else:
+                                redundants2 = [
+                                    predicate_list[i][random.randrange(len(predicate_list[i]))] + " of " + nonentity
+                                    for nonentity in nonentity_list
+                                ]
+
+                        # 3. entity_label_set ^ entity_prompt_list[i]
+                        redundants3 = []
+                        if len(subject_golden_list[i]) != 0:
+                            non_ent_label_list = list(set(entity_label_set) ^ set(entity_prompt_list[i]))
+                            non_ent_label_list.sort()
+
+                            if self.schema_lang == "ch":
+                                redundants3 = [
+                                    subject_golden_list[i][random.randrange(len(subject_golden_list[i]))]
+                                    + "的"
+                                    + non_ent_label
+                                    for non_ent_label in non_ent_label_list
+                                ]
+                            else:
+                                redundants3 = [
+                                    non_ent_label
+                                    + " of "
+                                    + subject_golden_list[i][random.randrange(len(subject_golden_list[i]))]
+                                    for non_ent_label in non_ent_label_list
+                                ]
+
+                        redundants_list = [redundants1, redundants2, redundants3]
+
+                        for redundants in redundants_list:
+                            if self.anno_type == "text":
+                                added, rest = self.add_relation_negative_example(
+                                    redundants,
+                                    texts[i],
+                                    num_positive,
+                                    per_n_ratio,
+                                )
+                            else:
+                                added, rest = self.add_relation_negative_example(
+                                    redundants, texts[i], num_positive, per_n_ratio, images[i], bbox_list[i]
+                                )
+                            negative_example.extend(added)
+                            collects.extend(rest)
+
+                        num_sup = num_positive * self.negative_ratio - len(negative_example)
+                        if num_sup > 0 and collects:
+                            if num_sup > len(collects):
+                                idxs = [k for k in range(len(collects))]
+                            else:
+                                idxs = random.sample(range(0, len(collects)), num_sup)
+                            for idx in idxs:
+                                negative_example.append(collects[idx])
+
+                        positive_examples.extend(relation_examples[i])
+                        negative_examples.extend(negative_example)
+                        pbar.update(1)
+                all_relation_examples = positive_examples + negative_examples
+            else:
+                relation_examples = self.add_full_negative_example(
+                    relation_examples, texts, relation_prompt_list, predicate_set, subject_golden_list
+                )
+                all_relation_examples = [r for relation_example in relation_examples for r in relation_example]
+        return all_entity_examples + all_relation_examples + entity_cls_examples
+
+    def generate_cls_example(self, text, labels, prompt_prefix, options, image=None, bbox=None):
+        random.shuffle(self.options)
+        cls_options = ",".join(self.options)
+        prompt = prompt_prefix + "[" + cls_options + "]"
+
+        result_list = []
+        example = {"content": text, "result_list": result_list, "prompt": prompt}
+        if image and bbox:
+            example["image"] = image
+            example["bbox"] = bbox
+        for label in labels:
+            start = prompt.rfind(label) - len(prompt) - 1
+            end = start + len(label)
+            result = {"text": label, "start": start, "end": end}
+            example["result_list"].append(result)
+        return example
+
+    def add_full_negative_example(
+        self, examples, texts, relation_prompt_list, predicate_set, subject_golden_list, images=None, bbox_list=None
+    ):
+        with tqdm(total=len(relation_prompt_list)) as pbar:
+            for i, relation_prompt in enumerate(relation_prompt_list):
+                negative_sample = []
+                for subject in subject_golden_list[i]:
+                    for predicate in predicate_set:
+                        # The relation prompt is constructed as follows:
+                        # subject + "的" + predicate -> Chinese
+                        # predicate + " of " + subject -> English
+                        if self.schema_lang == "ch":
+                            prompt = subject + "的" + predicate
+                        else:
+                            prompt = predicate + " of " + subject
+                        if prompt not in relation_prompt:
+                            negative_result = {"content": texts[i], "result_list": [], "prompt": prompt}
+                            if images and bbox_list:
+                                negative_result["image"] = images[i]
+                                negative_result["bbox"] = bbox_list[i]
+                            negative_sample.append(negative_result)
+                examples[i].extend(negative_sample)
+                pbar.update(1)
+        return examples
+
+    def add_entity_negative_example(self, examples, texts, prompts, label_set, images=None, bbox_list=None):
+        negative_examples = []
+        positive_examples = []
+        with tqdm(total=len(prompts)) as pbar:
+            for i, prompt in enumerate(prompts):
+                redundants = list(set(label_set) ^ set(prompt))
+                redundants.sort()
+
+                num_positive = len(examples[i])
+                if num_positive != 0:
+                    actual_ratio = math.ceil(len(redundants) / num_positive)
+                else:
+                    # Set num_positive to 1 for text without positive example
+                    num_positive, actual_ratio = 1, 0
+
+                if actual_ratio <= self.negative_ratio or self.negative_ratio == -1:
+                    idxs = [k for k in range(len(redundants))]
+                else:
+                    idxs = random.sample(range(0, len(redundants)), self.negative_ratio * num_positive)
+
+                for idx in idxs:
+                    negative_result = {"content": texts[i], "result_list": [], "prompt": redundants[idx]}
+                    if images and bbox_list:
+                        negative_result["image"] = images[i]
+                        negative_result["bbox"] = bbox_list[i]
+                    negative_examples.append(negative_result)
+                positive_examples.extend(examples[i])
+                pbar.update(1)
+        return positive_examples, negative_examples
+
+    def add_relation_negative_example(self, redundants, text, num_positive, ratio, image=None, bbox=None):
+        added_example = []
+        rest_example = []
+
+        if num_positive != 0:
+            actual_ratio = math.ceil(len(redundants) / num_positive)
+        else:
+            # Set num_positive to 1 for text without positive example
+            num_positive, actual_ratio = 1, 0
+
+        all_idxs = [k for k in range(len(redundants))]
+        if actual_ratio <= ratio or ratio == -1:
+            idxs = all_idxs
+            rest_idxs = []
+        else:
+            idxs = random.sample(range(0, len(redundants)), ratio * num_positive)
+            rest_idxs = list(set(all_idxs) ^ set(idxs))
+
+        for idx in idxs:
+            negative_result = {"content": text, "result_list": [], "prompt": redundants[idx]}
+            if image and bbox:
+                negative_result["image"] = image
+                negative_result["bbox"] = bbox
+            added_example.append(negative_result)
+
+        for rest_idx in rest_idxs:
+            negative_result = {"content": text, "result_list": [], "prompt": redundants[rest_idx]}
+            if image and bbox:
+                negative_result["image"] = image
+                negative_result["bbox"] = bbox
+            rest_example.append(negative_result)
+
+        return added_example, rest_example
diff --git a/pipelines/API.md b/pipelines/API.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb5f19e04f24d3f8c277ef867c7e77371d54d6e2
--- /dev/null
+++ b/pipelines/API.md
@@ -0,0 +1,39 @@
+# API的预置模型介绍
+
+以下是`Pipelines`的主要API的模型介绍，有其他定制化的需求的用户可提issue。
+
+## DensePassageRetriever
+
+除了`DensePassageRetriever`的默认模型外，还可以选择下面的模型试试效果：
+
+| 模型  | 语言 | 模型详细信息 |
+| -------- | -------- | -------- |
+| rocketqa-zh-base-query-encoder     | Chinese     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-zh-medium-query-encoder     | Chinese     | 6-layer, 768-hidden, 12-heads, 75M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-zh-mini-query-encoder     | Chinese     | 6-layer, 384-hidden, 12-heads, 27M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-zh-micro-query-encoder    | Chinese     | 4-layer, 384-hidden, 12-heads, 23M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-zh-nano-query-encoder     | Chinese     | 4-layer, 312-hidden, 12-heads, 18M parameters. Trained on DuReader retrieval text.     |
+| rocketqav2-en-marco-query-encoder    | English     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on MSMARCO.     |
+| ernie-search-base-dual-encoder-marco-en    | English     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on MSMARCO.     |
+
+## ErnieRanker
+
+类似地`ErnieRanker`可以选择下面的模型试试效果：
+
+| 模型  | 语言 | 模型详细信息 |
+| -------- | -------- | -------- |
+| rocketqa-base-cross-encoder     | Chinese     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-medium-cross-encoder     | Chinese     | 6-layer, 768-hidden, 12-heads, 75M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-mini-cross-encoder    | Chinese     | 6-layer, 384-hidden, 12-heads, 27M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-micro-cross-encoder     | Chinese     | 4-layer, 384-hidden, 12-heads, 23M parameters. Trained on DuReader retrieval text.     |
+| rocketqa-nano-cross-encoder    | Chinese     | 4-layer, 312-hidden, 12-heads, 18M parameters. Trained on DuReader retrieval text.    |
+| rocketqav2-en-marco-cross-encoder    | English     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on Trained on MSMARCO.    |
+| ernie-search-large-cross-encoder-marco-en    | English     | 24-layer, 768-hidden, 12-heads, 118M parameters. Trained on Trained on MSMARCO.    |
+
+## ErnieReader
+
+`ErnieReader`目前提供了一个预置的模型：
+
+| 模型  | 语言 | 模型详细信息 |
+| -------- | -------- | -------- |
+| ernie-gram-zh-finetuned-dureader-robust     | Chinese     | 12-layer, 768-hidden, 12-heads, 118M parameters. Trained on DuReader Robust Text.     |
diff --git a/pipelines/FAQ.md b/pipelines/FAQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..0c199d611fbe03a5d2c1c5203ee19453378a2b80
--- /dev/null
+++ b/pipelines/FAQ.md
@@ -0,0 +1,223 @@
+## FAQ
+
+#### pip安装htbuilder包报错，`UnicodeDecodeError: 'gbk' codec can't decode byte....`
+
+windows的默认字符gbk导致的，可以使用源码进行安装，源码已经进行了修复。
+
+```
+git clone https://github.com/tvst/htbuilder.git
+cd htbuilder/
+python setup install
+```
+
+#### 语义检索系统可以跑通，但终端输出字符是乱码怎么解决？
+
++ 通过如下命令设置操作系统默认编码为 zh_CN.UTF-8
+```bash
+export LANG=zh_CN.UTF-8
+```
+
+#### Linux上安装elasticsearch出现错误 `java.lang.RuntimeException: can not run elasticsearch as root`
+
+elasticsearch 需要在非root环境下运行，可以做如下的操作：
+
+```
+adduser est
+chown est:est -R ${HOME}/elasticsearch-8.3.2/
+cd ${HOME}/elasticsearch-8.3.2/
+su est
+./bin/elasticsearch
+```
+
+#### Mac OS上安装elasticsearch出现错误 `flood stage disk watermark [95%] exceeded on.... all indices on this node will be marked read-only`
+
+elasticsearch默认达到95％就全都设置只读，可以腾出一部分空间出来再启动，或者修改 `config/elasticsearch.pyml`。
+```
+cluster.routing.allocation.disk.threshold_enabled: false
+```
+
+#### nltk_data加载失败的错误 `[nltk_data] Error loading punkt: [Errno 60] Operation timed out`
+
+在命令行里面输入python,然后输入下面的命令进行下载：
+
+```
+import nltk
+nltk.download('punkt')
+```
+如果下载还是很慢，可以手动[下载](https://github.com/nltk/nltk_data/tree/gh-pages/packages/tokenizers)，然后放入本地的`~/nltk_data/tokenizers`进行解压即可。
+
+#### 服务端运行报端口占用的错误 `[Errno 48] error while attempting to bind on address ('0.0.0.0',8891): address already in use`
+
+```
+lsof -i:8891
+kill -9 PID # PID为8891端口的进程
+```
+
+#### faiss 安装上了但还是显示找不到faiss怎么办？
+
+推荐您使用anaconda进行单独安装，安装教程请参考[faiss](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md)
+
+```
+# CPU-only version
+conda install -c pytorch faiss-cpu
+
+# GPU(+CPU) version
+conda install -c pytorch faiss-gpu
+```
+
+#### 如何更换pipelines中预置的模型？
+
+更换系统预置的模型以后，由于模型不一样了，需要重新构建索引，并修改相关的配置文件。以语义索引为例，需要修改2个地方，第一个地方是`utils/offline_ann.py`,另一个是`rest_api/pipeline/semantic_search.yaml`，并重新运行：
+
+首先修改`utils/offline_ann.py`：
+
+```
+python utils/offline_ann.py --index_name dureader_robust_base_encoder \
+                            --doc_dir data/dureader_dev \
+                            --query_embedding_model rocketqa-zh-base-query-encoder \
+                            --passage_embedding_model rocketqa-zh-base-para-encoder \
+                            --embedding_dim 768 \
+                            --delete_index
+```
+
+然后修改`rest_api/pipeline/semantic_search.yaml`文件：
+
+```
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: dureader_robust_base_encoder # 修改索引名
+      embedding_dim: 768   # 修改向量的维度
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-base-query-encoder  # 修改Retriever的query模型名
+      passage_embedding_model: rocketqa-zh-base-para-encoder # 修改 Retriever的para模型
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-base-cross-encoder  # 修改 ErnieRanker的模型名
+      top_k: 3
+```
+
+然后重新运行：
+
+```bash
+# 指定语义检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 运行faiss examples出现了错误：`sqlalchemy.exec.OperationalError: (sqlite3.OperationalError) too many SQL variables`
+
+python 3.7版本引起的错误，修改如下代码：
+
+```
+# 增加batch_size参数，传入一个数值即可
+document_store.update_embeddings(retriever, batch_size=256)
+```
+
+#### 运行后台程序出现了错误：`Exception: Failed loading pipeline component 'DocumentStore': RequestError(400, 'illegal_argument_exception', 'Mapper for [embedding] conflicts with existing mapper:\n\tCannot update parameter [dims] from [312] to [768]')`
+
+以语义检索为例，这是因为模型的维度不对造成的，请检查一下 `elastic search`中的文本的向量的维度和`semantic_search.yaml`里面`DocumentStore`设置的维度`embedding_dim`是否一致，如果不一致，请重新使用`utils/offline_ann.py`构建索引。总之，请确保构建索引所用到的模型和`semantic_search.yaml`设置的模型是一致的。
+
+#### 安装后出现错误：`cannot import name '_registerMatType' from 'cv2'`
+
+opencv版本不匹配的原因，可以对其进行升级到最新版本，保证opencv系列的版本一致。
+
+```
+pip install opencv-contrib-python --upgrade
+pip install opencv-contrib-python-headless --upgrade
+pip install opencv-python --upgrade
+```
+
+#### 安装运行出现 `RuntimeError: Can't load weights for 'rocketqa-zh-nano-query-encoder'`
+
+rocketqa模型2.3.7之后才添加，paddlenlp版本需要升级：
+```
+pip install paddlenlp --upgrade
+```
+
+#### 安装出现问题 `The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored.`
+
+设置pip源为清华源，然后重新安装，可运行如下命令进行设置：
+
+```
+pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+#### Elastic search 日志显示错误 `exception during geoip databases update`
+
+需要编辑config/elasticsearch.yml，在末尾添加：
+
+```
+ingest.geoip.downloader.enabled: false
+```
+如果是Docker启动，请添加如下的配置，然后运行：
+
+```
+docker run \
+      -d \
+      --name es02 \
+      --net elastic \
+      -p 9200:9200 \
+      -e discovery.type=single-node \
+      -e ES_JAVA_OPTS="-Xms256m -Xmx256m"\
+      -e xpack.security.enabled=false \
+      -e  ingest.geoip.downloader.enabled=false \
+      -e cluster.routing.allocation.disk.threshold_enabled=false \
+      -it \
+      docker.elastic.co/elasticsearch/elasticsearch:8.3.3
+```
+
+#### Windows出现运行前端报错`requests.exceptions.MissingSchema: Invalid URL 'None/query': No scheme supplied. Perhaps you meant http://None/query?`
+
+环境变量没有生效，请检查一下环境变量，确保PIPELINE_YAML_PATH和API_ENDPOINT生效：
+
+```
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'
+
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+```
+
+#### Windows的GPU运行出现错误：`IndexError: index 4616429690595525704 is out of bounds for axis 0 with size 1`
+
+paddle.nozero算子出现异常，请退回到PaddlePaddle 2.2.2版本，比如您使用的是cuda 11.2，可以使用如下的命令：
+
+```
+python -m pip install paddlepaddle-gpu==2.2.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html
+```
+
+#### 运行应用的时候出现错误 `assert d == self.d`
+
+这是运行多个应用引起的，请在运行其他应用之前，删除现有的db文件：
+
+```
+rm -rf faiss_document_store.db
+```
+
+#### Windows运行应用的时候出现了下面的错误：`RuntimeError: (NotFound) Cannot open file C:\Users\my_name/.paddleocr/whl\det\ch\ch_PP-OCRv3_det_infer/inference.pdmodel, please confirm whether the file is normal.`
+
+这是Windows系统用户命名为中文的原因，详细解决方法参考issue. [https://github.com/PaddlePaddle/PaddleNLP/issues/3242](https://github.com/PaddlePaddle/PaddleNLP/issues/3242)
+
+#### 怎样从GPU切换到CPU上运行？
+
+请在对应的所有`sh`文件里面加入下面的环境变量
+```
+export CUDA_VISIBLE_DEVICES=""
+```
+
+#### 运行streamlit前端程序出现错误：`AttributeError: module 'click' has no attribute 'get_os_args'`
+
+click版本过高导致：
+
+```
+pip install click==8.0
+```
diff --git a/pipelines/Makefile b/pipelines/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..5d63761d7af58e2ea2a113974f1040e7b1682cf8
--- /dev/null
+++ b/pipelines/Makefile
@@ -0,0 +1,37 @@
+
+.DEFAULT_GOAL := all
+
+.PHONY: all
+all: deploy-version build deploy
+
+.PHONY: build
+build:
+	python3 setup.py sdist bdist_wheel
+
+.PHONY: deploy
+deploy:
+	make deploy-version
+	twine upload --skip-existing dist/*
+
+.PHONY: deploy-version
+deploy-version:
+	echo "VERSION = '$$(cat VERSION)'" > pipelines/version.py
+
+.PHONY: install
+install:
+	pip install -r requirements.txt
+
+
+.PHONY: test
+test: unit-test
+
+unit-test:
+	PYTHONPATH=$(shell pwd) pytest tests
+
+.PHONY: version
+version:
+	@newVersion=$$(awk -F. '{print $$1"."$$2"."$$3+1}' < VERSION) \
+		&& echo $${newVersion} > VERSION \
+		&& git add VERSION \
+		&& git commit -m "🔥 update version to $${newVersion}" > /dev/null \
+		&& echo "Bumped version to $${newVersion}"
\ No newline at end of file
diff --git a/pipelines/README.md b/pipelines/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d108c25a216a15fcd8c310a3b32d49c92453884b
--- /dev/null
+++ b/pipelines/README.md
@@ -0,0 +1,229 @@
+## PaddleNLP Pipelines：NLP流水线系统
+
+PaddleNLP Pipelines 是一个端到端NLP流水线系统框架，面向 NLP **全场景**，帮助用户**低门槛**构建强大**产品级系统**。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190302765-663ba441-9dd3-470a-8fee-f7a6f81da615.gif" width="500px">
+</div>
+
+更多效果展示Demo请参考 [效果展示](#效果展示)
+
+## NLP流水线系统特色
+* **全场景支持**：依托灵活的插拔式组件产线化设计，支持各类 NLP 场景任务，包括：信息抽取、情感倾向分析、阅读理解、检索系统、问答系统、文本分类、文本生成等。
+
+* **低门槛开发**：依托丰富的预置组件，像搭积木一样快速构建产品级系统，预置组件覆盖文档解析、数据处理、模型组网、预测部署、Web 服务、UI 界面等全流程系统功能。
+
+* **高精度预测**：基于前沿的预训练模型、成熟的系统方案，可构建效果领先的产品级系统，如[NLP流水线系统](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#NLP流水线系统)中预置的语义检索系统、阅读理解式智能问答系统等。
+
+* **灵活可定制**：除深度兼容 PaddleNLP 模型组件外，还可嵌入飞桨生态下任意模型、[AI 开放平台算子](https://ai.baidu.com/)、其它开源项目如 Elasticsearch 等作为基础组件，快速扩展，从而实现任意复杂系统的灵活定制开发。
+
+## Benchmarks
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/187362675-f0818e77-a521-4479-8dd7-bcbf4a820f7d.png" width="500">
+</div>
+
+更多的Benchmarks的信息请参考文档[Benchmarks](./benchmarks/README.md)
+
+## NLP流水线系统
+
+PaddleNLP Pipelines NLP流水线系统针对 NLP 部分高频场景开源了经过充分打磨的产品级系统，并会不断开放其它场景的产品级系统，用户可以基于NLP流水线系统提供的系统能力快速开发出适配业务数据的产品。
+
+* 快速搭建产品级[**语义检索**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/semantic-search)系统：使用自然语言文本通过语义进行智能文档查询，而不是关键字匹配
+* 快速搭建产品级[**智能问答**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/question-answering)系统：用自然语言提问，即可获得精准答案片段
+* 快速搭建产品级 [**FAQ 问答**](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/examples/FAQ)系统：用自然语言提问，匹配相关的高频问题，并返回匹配到的高频问题的答案
+* 快速搭建产品级**多模态信息抽取**系统（即将开放，敬请期待）
+
+### 效果展示
+
++ 语义检索
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190302765-663ba441-9dd3-470a-8fee-f7a6f81da615.gif" width="500px">
+</div>
+
++ 智能问答
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190298926-a1fc92f3-5ec7-4265-8357-ab860cc1fed2.gif" width="500px">
+</div>
+
++ FAQ智能问答
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190307449-38135678-f259-4483-ac0f-2fa3ae4be97f.gif" width="500px">
+</div>
+
+|  |  |
+|-|-|
+| :floppy_disk: [快速安装](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#floppy_disk-安装) |安装 PaddleNLP Pipelines|
+| :beginner: [快速体验](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#beginner-快速体验) |基于 Pipelines 快速搭建语义检索/智能问答等产品系统|
+| :man_office_worker: [用户案例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#man_office_worker-用户案例) |各行业用户基于PaddleNLP Pipelinse 构建的产品案例|
+| :mortar_board: [Tutorials](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#mortar_board-tutorials) |像搭积木一样一步步构建 NLP 流水线系统教程|
+| :bar_chart: [Benchmarks](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines/benchmarks) |针对各场景模型的性能、精度评测指标|
+| :telescope: [Roadmap](https://github.com/PaddlePaddle/PaddleNLP) | PaddleNLP Pipelines 产品路线图|
+| :newspaper: [技术博客](https://github.com/PaddlePaddle/PaddleNLP) | 阅读 PaddleNLP Pipelines 系列技术文章|
+| :vulcan_salute: [社区交流](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/pipelines#vulcan_salute-社区交流) | [官方微信群](https://github.com/PaddlePaddle/PaddleNLP#社区交流), [GitHub Discussions](https://github.com/PaddlePaddle/PaddleNLP/discussions) |
+
+## :floppy_disk: 安装
+Note: 因为 pipelines 依赖较多, 安装耗时大概 10 分钟左右，安装过程中请请耐心等待。
+### 环境依赖
+- python >= 3.7.3
+- paddlenlp >= 2.2.1
+- paddlepaddle >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Docker 18.03 以上
+### pip 安装
+```
+pip install --upgrade paddle-pipelines
+```
+
+### 源码安装
+```
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+python setup.py install
+```
+
+## :beginner: 快速体验
+
+### 快速开发
+
+您可以参考如下示例像搭积木一样快速构建语义检索流水线，通过命令行终端输出快速体验流水线系统效果
+
+```python
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieRanker
+
+# Step1: Preparing the data
+documents = [
+  {'content': '金钱龟不分品种,只有生长地之分,在我国主要分布于广东、广西、福建、海南、香港、澳门等地,在国外主要分布于越南等亚热带国家和地区。',
+  'meta': {'name': 'test1.txt'}},
+  {'content': '衡量酒水的价格的因素很多的，酒水的血统(也就是那里产的，采用什么工艺等）；存储的时间等等，酒水是一件很难标准化得商品，只要你敢要价，有买的那就值那个钱。',
+  'meta': {'name': 'test2.txt'}}
+]
+
+# Step2: Initialize a FaissDocumentStore to store texts of documents
+document_store = FAISSDocumentStore(embedding_dim=768)
+document_store.write_documents(documents)
+
+# Step3: Initialize a DenseRetriever and build ANN index
+retriever = DensePassageRetriever(document_store=document_store, query_embedding_model="rocketqa-zh-base-query-encoder",embed_title=False)
+document_store.update_embeddings(retriever)
+
+# Step4: Initialize a Ranker
+ranker = ErnieRanker(model_name_or_path="rocketqa-base-cross-encoder")
+
+# Step5: Initialize a SemanticSearchPipeline and ask questions
+from pipelines import SemanticSearchPipeline
+pipeline = SemanticSearchPipeline(retriever, ranker)
+prediction = pipeline.run(query="衡量酒水的价格的因素有哪些?")
+```
+### 快速部署
+
+您可以基于我们发布的 Docker 镜像一键部署智能文本流水线系统，通过 Web UI 快速体验。
+
+#### 启动 elastic search
+
+```
+docker network create elastic
+docker pull docker.elastic.co/elasticsearch/elasticsearch:8.3.3
+docker run \
+      -d \
+      --name es02 \
+      --net elastic \
+      -p 9200:9200 \
+      -e discovery.type=single-node \
+      -e ES_JAVA_OPTS="-Xms256m -Xmx256m"\
+      -e xpack.security.enabled=false \
+      -e cluster.routing.allocation.disk.threshold_enabled=false \
+      -it \
+      docker.elastic.co/elasticsearch/elasticsearch:8.3.3
+```
+
+#### 部署 CPU 服务
+
+对于Linux使用Docker的用户，使用下面的命令：
+```
+docker pull registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0
+docker run -d --name paddlenlp_pipelines --net host -ti registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0
+```
+对于Windows&Macos上使用Docker的用户，用下面的命令：
+
+```
+docker pull registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0.windows.darwin
+docker run -d --name paddlenlp_pipelines  -p 8891:8891 -p 8502:8502 -ti registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0.windows.darwin
+```
+CPU 镜像下载大概耗时 10 分钟左右，容器启动成功后，等待3分钟左右，通过浏览器访问 [http://127.0.0.1:8502](http://127.0.0.1:8502) 快速体验产品级语义检索服务。
+
+
+#### 部署 GPU 服务
+```
+docker pull registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0-gpu-cuda10.2-cudnn7
+nvidia-docker run -d --name paddlenlp_pipelines_gpu --net host -ti registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0-gpu-cuda10.2-cudnn7
+```
+GPU 镜像下载大概耗时 15 分钟左右，容器启动成功后，等待1分钟左右，通过浏览器访问 [http://127.0.0.1:8502](http://127.0.0.1:8502) 快速体验产品级语义检索服务。
+
+
+对于国内用户，因为网络问题下载docker比较慢时，可使用百度提供的镜像：
+
+
+|  环境                         |   镜像 Tag               |    运行平台      |
+| :--------------------------: | :-------------------------------: | :-------------: |
+|  CPU                         | registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0                      |  Linux    |
+|  CPU                         | registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0.windows.darwin       |  Windows&Macos   |
+|  CUDA10.2 + cuDNN 7           | registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0-gpu-cuda10.2-cudnn7 |  Linux   |
+|  CUDA11.2 + cuDNN 8           | registry.baidubce.com/paddlepaddle/paddlenlp:2.4.0-gpu-cuda11.2-cudnn8 |  Linux   |
+
+如果您的机器不在中国大陆地区，我们推荐您使用DockerHub的镜像：
+
+|  环境                         |   镜像 Tag               |    运行平台      |
+| :--------------------------: | :-------------------------------: | :-------------: |
+|  CPU                         | paddlepaddle/paddlenlp:2.4.0                      |  Linux    |
+|  CPU                         | paddlepaddle/paddlenlp:2.4.0.windows.darwin       |  Windows&Macos   |
+|  CUDA10.2 + cuDNN 7          | paddlepaddle/paddlenlp:2.4.0-gpu-cuda10.2-cudnn7  |  Linux   |
+|  CUDA11.2 + cuDNN 8          | paddlepaddle/paddlenlp:2.4.0-gpu-cuda11.2-cudnn8  |  Linux   |
+
+对于智能问答应用，请参考Docker文档[docker文档](./docker/README.md)，只需做少量的修改，就可以完成智能问答应用的部署。
+
+#### REST API
+
+Pipelines可以服务化，通过HTTP接口的形式供其他程序进行调用，Pipelines提供了Swagger API方便用户查询接口文档，从而把Pipelines的能力接入到自己的应用系统中，只需要在启动REST API后通过浏览器访问 [http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/204282574-4a468ba5-d32f-4ead-970b-87139b613521.png" width="500px">
+</div>
+
+
+## :man_office_worker: 用户案例
+
+### 案例1: [寻规-工程规范搜索引擎](https://xungui365.com/)
+
+[寻规](https://xungui365.com/)，是一款基于飞桨 PaddleNLP Pipelines 构建的建筑工程规范搜索引擎。大幅提升了工程设计人员工作效率。
+
+#### 查询效率提升 36~60 倍
+
+相比市面当前的工程规范查询系统/网站，平均查询到一条规范条款要 3\~5 分钟，而基于 PaddleNLP Pipelines 构建的[寻规](https://xungui365.com/)检索系统，平均查询到一条规范条款仅需 5 秒左右，搜索时间大幅缩短，仅规范查询效率方面就提升**36\~60** 倍！
+
+#### 查询精度大幅提升
+
+市面现已有的工程规范查询系统解决方案一直延续着传统关键字词匹配的查询方式，依赖用户对查询结果进行自行排序、筛选、鉴别，有时甚至还要再次由工程设计人员耗费一定时间精力人工查阅工程规范文件后，才能最终确认是否为想要查询的规范条款。传统规范查询系统至少需要进行 3~5 次查询才能找到用户想要的规范条款，而寻规系统是基于强大预训练模型构建起来的语义检索系统，针对 80% 的规范查询需求仅 **1 次查询** 就能精确命中查询意图，并返回真正符合工程设计人员查询意图的结果！
+
+## :mortar_board: Tutorials
+- Tutorial 1 - Pipelines [Windows视频安装教程](https://www.bilibili.com/video/BV1DY4y1M7HE/?zw)
+- Tutorial 2 - 语义检索 Pipeline: [AIStudio notebook](https://aistudio.baidu.com/aistudio/projectdetail/4442670) | [Python](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/pipelines/examples/semantic-search/semantic_search_example.py)
+- Tutorial 3 - 智能问答 Pipeline: [AIStudio notebook](https://aistudio.baidu.com/aistudio/projectdetail/4442857) | [Python](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/pipelines/examples/question-answering/dense_qa_example.py)
+- Tutorial 4 - FAQ智能问答 Pipeline: [AIStudio notebook](https://aistudio.baidu.com/aistudio/projectdetail/4465498) | [Python](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/pipelines/examples/FAQ/dense_faq_example.py)
+- Tutorial 5 - Pipelines 快速上手二次开发教程: [AIStudio notebook](https://aistudio.baidu.com/aistudio/projectdetail/5011119)
+## :vulcan_salute: 社区交流
+微信扫描二维码并填写问卷之后，加入交流群与来自各行各业的小伙伴交流学习吧~
+  <div align="center">
+  <img src="https://user-images.githubusercontent.com/11793384/168411900-d9f3d777-99ab-4b5c-8cdc-ef747a48b864.jpg" width="150" height="150" />
+  </div>
+
+
+## :heart: Acknowledge
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/VERSION b/pipelines/VERSION
new file mode 100644
index 0000000000000000000000000000000000000000..7ceb04048e8a4e16e4dd3374c371970b698fd6e8
--- /dev/null
+++ b/pipelines/VERSION
@@ -0,0 +1 @@
+0.6.1
\ No newline at end of file
diff --git a/pipelines/VERSION.txt b/pipelines/VERSION.txt
new file mode 100644
index 0000000000000000000000000000000000000000..473dd0a465cf8d659d4ea93f2b222dff43220a6b
--- /dev/null
+++ b/pipelines/VERSION.txt
@@ -0,0 +1 @@
+1.0.0dev
\ No newline at end of file
diff --git a/pipelines/benchmarks/README.md b/pipelines/benchmarks/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1f6393b007ffe9dbbf00a0954dd7fb21efbee7b2
--- /dev/null
+++ b/pipelines/benchmarks/README.md
@@ -0,0 +1,59 @@
+# ERNIE 3.0 的RocketQA模型
+
+
+## 模型介绍
+
+本次开源的模型ERNIE 3.0系列的模型的基础上，使用RocketQA的训练策略训练的`DualEncoder`和`CrossEncoder`模型，相比于原始的RocektQA模型，在中文上的效果达到最佳。
+
+## 模型效果
+
+### DualEncoder模型效果
+
+|  模型                         |   模型规模               |    MRR@10      |    Recall@1      |    Recall@50      |
+| :--------------------------: | :-------------------------------: | :-------------: |:-------------: |:-------------: |
+|  rocketqa-baselines          | 12-layer, 768-hidden                     |  56.45%    |45.05% | 91.40%|
+|  rocketqa-zh-base-query-encoder&rocketqa-zh-base-para-encoder | 12-layer, 768-hidden       |  **59.14%**    |**48.00%** | **92.15%**|
+|  rocketqa-zh-medium-query-encoder&rocketqa-zh-medium-para-encoder | 6-layer, 768-hidden    |  53.92%    |42.35% | 89.75%|
+|  rocketqa-zh-mini-query-encoder&rocketqa-zh-mini-para-encoder     | 6-layer, 384-hidden    |  44.97%    |34.25% | 84.97%|
+|  rocketqa-zh-micro-query-encoder&rocketqa-zh-micro-para-encoder   | 4-layer, 384-hidden    |  40.22%    |28.70% | 83.40% |
+|  rocketqa-zh-nano-query-encoder&rocketqa-zh-nano-para-encoder     | 4-layer, 312-hidden    |  38.07%    |27.35% | 80.35%|
+
+
+### CrossEncoder模型效果
+
+|  模型                         |   模型规模               |    MRR@10      |    Recall@1      |    Recall@50      |
+| :--------------------------: | :-------------------------------: | :-------------: |:-------------: |:-------------: |
+|  rocketqa-baselines                   | 12-layer, 768-hidden                     |  65.62%    |55.50% | 91.75%|
+|  rocketqa-base-cross-encoder| 12-layer, 768-hidden       |  **76.64%**    |**70.05%** | **91.75%**|
+|  rocketqa-medium-cross-encoder | 6-layer, 768-hidden    |  74.82%    |67.30% | 91.75%|
+|  rocketqa-mini-cross-encoder     | 6-layer, 384-hidden    |  70.25%    |60.85% | 91.75%|
+|  rocketqa-micro-cross-encoder   | 4-layer, 384-hidden    |  68.80%    |59.35% | 91.75% |
+|  rocketqa-nano-cross-encoder    | 4-layer, 312-hidden    |  67.99%    |58.25% | 91.75%|
+
+
+## 模型性能
+
+RocketQA系列的模型在GPU上能够达到ms级别的速度，一个query大概20ms左右，最快能够达到10ms，另外，为了验证在CPU上的速度，我们使用RocketQA系列的模型在CPU上测试了构建索引的时间，数据集是1398条，在cpu上的测试时间，以下cpu上的查询时间指的是在后台query发一次请求经过模型处理后得到结果的时间，包含召回和排序两部分，召回的文本数为30条。
+
+|  模型                         |   模型规模               |    构建索引时间      |    查询一条文本时间      |
+| :--------------------------: | :-------------------------------: | :-------------: |:-------------: |
+|  rocketqa-baselines          | 12-layer, 768-hidden                     |   15min46s  |10.04s/30条 |
+|  rocketqa-zh-base                  | 12-layer, 768-hidden                     |   15min54s  |11.50s/30条 |
+|  rocketqa-zh-medium| 6-layer, 768-hidden       |  7min54s    | 5.39s/30条 |
+|  rocketqa-zh-mini | 6-layer, 384-hidden    |  4min10s    |4.02s/30条 |
+|  rocketqa-zh-micro     | 4-layer, 384-hidden    |  2min49s   |3.0s/30条 |
+|  rocketqa-zh-nano   | 4-layer, 312-hidden    |  2min30s    |2.84s/30条 |
+
+注意在测试速度的时候会有一些浮动，有很多因素会影响测试速度，比如 `elastic search`本身的性能，检索的文本数目，但总体的结论不会变。
+
+## 模型评测
+- 数据集选择，使用 [T2Ranking](https://github.com/THUIR/T2Ranking/tree/main) 数据集，由于 T2Ranking 的数据集太大，评测起来的时间成本有些高，所以我们只选择了 T2Ranking 中的前 10000 篇文章
+- 评测方式，使用 MTEB 的方式进行评测，报告 map@1, map@10, mrr@1, mrr@10, ndcg@1, ndcg@10
+1. 安装依赖
+```bash
+pip install -r requirements.txt
+```
+2. 运行评测脚本
+```bash
+python Retrieval_benchmarks.py
+```
diff --git a/pipelines/benchmarks/Retrieval_benchmarks.py b/pipelines/benchmarks/Retrieval_benchmarks.py
new file mode 100644
index 0000000000000000000000000000000000000000..198ae2a0744b7f60347ce6d8692af3bb028141dc
--- /dev/null
+++ b/pipelines/benchmarks/Retrieval_benchmarks.py
@@ -0,0 +1,173 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import time
+from collections import defaultdict
+from typing import Dict, List, cast
+
+from datasets import load_dataset
+from mteb.abstasks import AbsTaskRetrieval
+
+from paddlenlp import Taskflow
+
+
+class PaddleModel:
+    def __init__(
+        self, query_model, corpus_model, batch_size=32, max_seq_len=512, sep=" ", pooling_mode="mean_tokens", **kwargs
+    ):
+        self.query_model = Taskflow(
+            "feature_extraction",
+            model=query_model,
+            pooling_mode=pooling_mode,
+            max_seq_len=max_seq_len,
+            batch_size=batch_size,
+            _static_mode=True,
+        )
+        self.corpus_model = Taskflow(
+            "feature_extraction",
+            model=corpus_model,
+            pooling_mode=pooling_mode,
+            max_seq_len=max_seq_len,
+            batch_size=batch_size,
+            _static_mode=True,
+        )
+        self.sep = sep
+
+    def encode_queries(self, queries: List[str], batch_size: int, **kwargs):
+        return self.query_model(queries, batch_size=batch_size, **kwargs)["features"].detach().cpu().numpy()
+
+    def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs):
+        if type(corpus) is dict:
+            sentences = [
+                (corpus["title"][i] + self.sep + corpus["text"][i]).strip()
+                if "title" in corpus
+                else corpus["text"][i].strip()
+                for i in range(len(corpus["text"]))
+            ]
+        else:
+            sentences = [
+                (doc["title"] + self.sep + doc["text"]).strip() if "title" in doc else doc["text"].strip()
+                for doc in corpus
+            ]
+        return self.corpus_model(sentences, batch_size=batch_size, **kwargs)["features"].detach().cpu().numpy()
+
+
+class T2RRetrieval(AbsTaskRetrieval):
+    def __init__(self, num_max_passages: "int | None" = None, **kwargs):
+        super().__init__(**kwargs)
+        self.num_max_passages = num_max_passages or math.inf
+
+    @property
+    def description(self):
+        return {
+            "name": "T2RankingRetrieval",
+            "reference": "https://huggingface.co/datasets/THUIR/T2Ranking",
+            "type": "Retrieval",
+            "category": "s2p",
+            "eval_splits": ["dev"],
+            "eval_langs": ["zh"],
+            "main_score": "ndcg_at_10",
+        }
+
+    def evaluate(
+        self,
+        model_query,
+        model_corpus,
+        split="test",
+        batch_size=32,
+        corpus_chunk_size=None,
+        target_devices=None,
+        score_function="cos_sim",
+        **kwargs
+    ):
+        from beir.retrieval.evaluation import EvaluateRetrieval
+
+        if not self.data_loaded:
+            self.load_data()
+        corpus, queries, relevant_docs = self.corpus[split], self.queries[split], self.relevant_docs[split]
+
+        from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
+
+        model = PaddleModel(model_query, model_corpus)
+
+        model = DRES(
+            model,
+            batch_size=batch_size,
+            corpus_chunk_size=corpus_chunk_size if corpus_chunk_size is not None else 50000,
+            **kwargs,
+        )
+        retriever = EvaluateRetrieval(model, score_function=score_function)  # or "cos_sim" or "dot"
+        start_time = time.time()
+        results = retriever.retrieve(corpus, queries)
+        end_time = time.time()
+        print("Time taken to retrieve: {:.2f} seconds".format(end_time - start_time))
+
+        ndcg, _map, recall, precision = retriever.evaluate(relevant_docs, results, retriever.k_values)
+        mrr = retriever.evaluate_custom(relevant_docs, results, retriever.k_values, "mrr")
+
+        scores = {
+            **{f"ndcg_at_{k.split('@')[1]}": v for (k, v) in ndcg.items()},
+            **{f"map_at_{k.split('@')[1]}": v for (k, v) in _map.items()},
+            **{f"recall_at_{k.split('@')[1]}": v for (k, v) in recall.items()},
+            **{f"precision_at_{k.split('@')[1]}": v for (k, v) in precision.items()},
+            **{f"mrr_at_{k.split('@')[1]}": v for (k, v) in mrr.items()},
+        }
+        print(scores)
+        return scores
+
+    def load_data(self, **kwargs):
+        corpus, queries, qrels = load_t2ranking_for_retraviel(self.num_max_passages)
+        self.corpus, self.queries, self.relevant_docs = {}, {}, {}
+        self.corpus["dev"] = corpus
+        self.queries["dev"] = queries
+        self.relevant_docs["dev"] = qrels
+        self.data_loaded = True
+
+
+def load_t2ranking_for_retraviel(num_max_passages: float):
+    collection_dataset = load_dataset("THUIR/T2Ranking", "collection")["train"]  # type: ignore
+    dev_queries_dataset = load_dataset("THUIR/T2Ranking", "queries.dev")["train"]  # type: ignore
+    dev_rels_dataset = load_dataset("THUIR/T2Ranking", "qrels.dev")["train"]  # type: ignore
+    corpus = {}
+    for index in range(min(len(collection_dataset), num_max_passages)):
+        record = collection_dataset.iloc[index, :]
+        record = cast(dict, record)
+        pid: int = record["pid"]
+        corpus[str(pid)] = {"text": record["text"]}
+    queries = {}
+    for record in dev_queries_dataset:
+        record = cast(dict, record)
+        queries[str(record["qid"])] = record["text"]
+
+    all_qrels = defaultdict(dict)
+    for record in dev_rels_dataset:
+        record = cast(dict, record)
+        pid: int = record["pid"]
+        if pid > num_max_passages:
+            continue
+        all_qrels[str(record["qid"])][str(record["pid"])] = record["rel"]
+    valid_qrels = {}
+    for qid, qrels in all_qrels.items():
+        if len(set(list(qrels.values())) - set([0])) >= 1:
+            valid_qrels[qid] = qrels
+    valid_queries = {}
+    for qid, query in queries.items():
+        if qid in valid_qrels:
+            valid_queries[qid] = query
+    print(f"valid qrels: {len(valid_qrels)}")
+    return corpus, valid_queries, valid_qrels
+
+
+tasks = T2RRetrieval(num_max_passages=10000)
+tasks.evaluate(model_query="moka-ai/m3e-base", model_corpus="moka-ai/m3e-base", split="dev")
diff --git a/pipelines/benchmarks/requirements.txt b/pipelines/benchmarks/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9e97f0f1699800b1b5a7a145ea5c8a7dfa257fe4
--- /dev/null
+++ b/pipelines/benchmarks/requirements.txt
@@ -0,0 +1,3 @@
+datasets
+mteb[beir]
+typer==0.9.0
\ No newline at end of file
diff --git a/pipelines/docker/Dockerfile b/pipelines/docker/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..ee1eb9a946cae0a8a22add37121c7d1c80b13c1a
--- /dev/null
+++ b/pipelines/docker/Dockerfile
@@ -0,0 +1,11 @@
+FROM paddlepaddle/paddlenlp:pipelines-cpu-1.0
+COPY start.sh /root/start.sh
+COPY create_index.sh /root/PaddleNLP/pipelines
+COPY run_server.sh  /root/PaddleNLP/pipelines
+COPY run_client.sh /root/PaddleNLP/pipelines
+COPY semantic_search.yaml /root/PaddleNLP/pipelines
+COPY dense_qa.yaml /root/PaddleNLP/pipelines
+RUN chmod +x /root/start.sh
+WORKDIR /root
+RUN chmod +x /root/start.sh
+ENTRYPOINT /root/start.sh && tail -f /dev/null
\ No newline at end of file
diff --git a/pipelines/docker/Dockerfile-GPU b/pipelines/docker/Dockerfile-GPU
new file mode 100644
index 0000000000000000000000000000000000000000..b29991ec165ecba753aceccfe8f84a990ea78ef7
--- /dev/null
+++ b/pipelines/docker/Dockerfile-GPU
@@ -0,0 +1,11 @@
+FROM paddlepaddle/paddlenlp:pipelines-1.0-gpu-cuda10.2-cudnn7
+COPY start.sh /root/start.sh
+COPY create_index.sh /root/PaddleNLP/pipelines
+COPY run_server.sh  /root/PaddleNLP/pipelines 
+COPY run_client.sh /root/PaddleNLP/pipelines
+COPY semantic_search.yaml /root/PaddleNLP/pipelines
+COPY dense_qa.yaml /root/PaddleNLP/pipelines
+RUN chmod +x /root/start.sh
+WORKDIR /root
+RUN chmod +x /root/start.sh
+# ENTRYPOINT /root/start.sh && tail -f /dev/null
diff --git a/pipelines/docker/README.md b/pipelines/docker/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eca399f45cf082b03508c3bb1ac48ea1acd52be1
--- /dev/null
+++ b/pipelines/docker/README.md
@@ -0,0 +1,125 @@
+# Docker 启动 pipelines服务
+
+## 1. Docker 一键启动
+
+### 1.1 Docker本地构建镜像启动
+
+为了满足用户多种多样的需求，我们提供了Dockerfile来构建一个镜像(以语义索引为例)，其中`Docker`的安装请参考[官方文档](https://docs.docker.com/desktop/),安装完以后，修改服务端运行脚本`run_server.sh`，客户端界面脚本`run_client.sh`，然后执行下面的命令：
+
+```
+cd docker
+# CPU
+docker build --tag=pipeline_cpu_server . -f Dockerfile
+# GPU
+docker build --tag=pipeline_server . -f Dockerfile-GPU
+```
+构建完以后就可以运行，先启动`elastic search`，然后再启动刚刚打包好的镜像：
+
+```
+docker network create elastic
+docker run \
+      -d \
+      --name es02 \
+      --net elastic \
+      -p 9200:9200 \
+      -e discovery.type=single-node \
+      -e ES_JAVA_OPTS="-Xms256m -Xmx256m"\
+      -e xpack.security.enabled=false \
+      -e cluster.routing.allocation.disk.threshold_enabled=false \
+      -it \
+      docker.elastic.co/elasticsearch/elasticsearch:8.3.3
+# cpu
+docker run -d --name paddlenlp_pipelines --net host -it pipeline_cpu_server
+# gpu
+nvidia-docker run -d --name paddlenlp_pipelines --net host -it pipeline_server
+```
+
+cpu版本大概等待3分钟左右，gpu版本大概1分钟左右，到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+## 2. Docker-Compose 一键启动
+
+为了方便用户很方便的切换，并部署`Pipelines`的语义索引和问答应用，我们特地提供了`Docker-Compose`的方式，`Docker Compose`的安装请参考[Docker Compose](https://docs.docker.com/compose/)，用户只需要修改`docker-compose.yml`的配置，然后一键启动即可。
+
+下面以语义检索的例子进行展示：
+
+```
+cd docker
+# 启动cpu容器
+docker-compose up -d
+# 关闭cpu容器
+docker-compose stop
+# 查看容器运行的日志 cpu
+docker logs pip01
+
+# 启动 gpu 容器
+docker-compose -f docker-compose-gpu.yml up -d
+# 关闭 gpu 容器
+docker-compose -f docker-compose-gpu.yml stop
+# 查看容器运行的日志 gpu
+docker logs pip02
+```
+构建过程一般会持续3分钟左右，然后cpu版本启动等待1分钟左右，然后您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+## 3. Docker编译一个定制化CUDA版本的Pipelines的镜像
+
+Docker编译一个定制化CUDA版本的Pipelines的镜像流程分2步，第一步是利用Paddle镜像构建Pipelines基础镜像，第二步是构建一键启动镜像。第一步构建的镜像是一个可用的状态，但是启动后，需要进入容器，手工启动服务，第二步是需要把运行命令打包到镜像中，使得Docker启动的时候能够自动启动Pipelines的服务。
+
+### 3.1 基础镜像
+
+以CUDA 11.2环境为例，编译一个Pipelines基础镜像流程如下：
+
+```
+nvidia-docker run --name pipelines --net host --shm-size 4g -it registry.baidubce.com/paddlepaddle/paddle:2.3.2-gpu-cuda11.2-cudnn8 /bin/bash
+cd /root
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+apt-get install lsof
+```
+镜像构建完成可以使用`Ctrl+P+Q`组合键跳出容器。
+
+在第一步构建镜像的过程中，如果是CUDA的其他版本，则需要在[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html)上查找是否有对应的CUDA版本的Paddle镜像，如果没有，则需要自己手工构建一个该CUDA版本的Docker，然后安装对应CUDA版本的PaddlePaddle，然后继续执行上面的流程。
+
+### 3.2 一键启动镜像
+
+到了上一步就构建了一个可用的Pipelines镜像了，但是这个镜像还没有一键启动功能，即需要进入容器手动启动后台和前端。这里进一步打包镜像，把启动运行的命令也打包到镜像中，执行过程如下：
+
+```
+docker commit pipelines pipelines:1.0-gpu-cuda11.2-cudnn8
+docker tag pipelines:1.0-gpu-cuda11.2-cudnn8  paddlepaddle/paddlenlp:pipelines-1.0-gpu-cuda11.2-cudnn8
+# 在容器外下载一份PaddleNLP代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines/docker
+```
+修改`Dockerfile-GPU`文件，更换基础镜像，并添加一键运行命令：
+
+```
+FROM paddlepaddle/paddlenlp:pipelines-1.0-gpu-cuda11.2-cudnn8
+# 使得Docker容器启动start.sh，并且保持运行
+ENTRYPOINT /root/start.sh && tail -f /dev/null
+```
+然后执行：
+
+```
+# Dockerfile-GPU 包含一键启动的命令
+docker build --tag=paddlepaddle/paddlenlp:2.4.0-gpu-cuda11.2-cudnn8 . -f Dockerfile-GPU
+```
+
+这样就构建了一键启动的Docker镜像。
+
+### 3.3 启动镜像
+
+一键启动的Docker构建完成以后就可以使用下面的命令启动：
+
+```
+nvidia-docker run -d --name paddlenlp_pipelines_gpu --net host -ti paddlepaddle/paddlenlp:2.4.0-gpu-cuda11.2-cudnn8
+# 查看运行日志
+sudo docker logs paddlenlp_pipelines_gpu
+# 进入容器命令
+sudo docker exec -it paddlenlp_pipelines_gpu bash
+# 查看后台端口状态
+lsof -i:8891
+# 查看前端端口状态
+lsof -i:8502
+```
diff --git a/pipelines/docker/create_index.sh b/pipelines/docker/create_index.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c9e00d6264da50e1cb57d6651a6ecdc6bb6a09bb
--- /dev/null
+++ b/pipelines/docker/create_index.sh
@@ -0,0 +1,20 @@
+unset http_proxy && unset https_proxy
+export CUDA_VISIBLE_DEVICES=0
+# linux
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/dureader_dev \
+                            --query_embedding_model rocketqa-zh-nano-query-encoder \
+                            --passage_embedding_model rocketqa-zh-nano-para-encoder \
+                            --port 9200 \
+                            --host localhost \
+                            --embedding_dim 312 \
+                            --delete_index 
+# windows & macos                             
+# python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+#                             --doc_dir data/dureader_dev \
+#                             --query_embedding_model rocketqa-zh-nano-query-encoder \
+#                             --passage_embedding_model rocketqa-zh-nano-para-encoder \
+#                             --port 9200 \
+#                             --host host.docker.internal \
+#                             --embedding_dim 312 \
+#                             --delete_index 
diff --git a/pipelines/docker/dense_qa.yaml b/pipelines/docker/dense_qa.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fda899515a55fcd5f2a615baffe0b615a7437c8b
--- /dev/null
+++ b/pipelines/docker/dense_qa.yaml
@@ -0,0 +1,74 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      index: baike_cities
+      port: 9200
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 5
+  - name: Reader       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieReader    # pipelines Class name for the component
+    params:
+      model_name_or_path: ernie-gram-zh-finetuned-dureader-robust
+      context_window_size: 1000
+      return_no_answer: true
+      top_k: 5
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+      - name: Reader
+        inputs: [Ranker]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/docker/docker-compose-gpu.yml b/pipelines/docker/docker-compose-gpu.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6c91e30f8caa85e82b8ce5d115d86edacd1c02ee
--- /dev/null
+++ b/pipelines/docker/docker-compose-gpu.yml
@@ -0,0 +1,45 @@
+version: "3"
+services:
+  elasticsearch:
+    # This will start an empty elasticsearch instance (so you have to add your documents yourself)
+    #image: "elasticsearch:8.3.3"
+    image: "docker.elastic.co/elasticsearch/elasticsearch:8.3.3"
+    container_name: es02
+    ports:
+      - 9200:9200
+    restart: on-failure
+    environment:
+      - discovery.type=single-node
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+      - cluster.routing.allocation.disk.threshold_enabled=false
+      - xpack.security.enabled=false
+  pipelines-gpu-serving:
+    build:
+      context: .
+      dockerfile: Dockerfile-GPU
+    image: paddlepaddle/paddlenlp:pipelines-1.0-gpu-cuda10.2-cudnn7
+    container_name: pip02
+    runtime: nvidia
+    network_mode: host
+    restart: on-failure
+    environment:
+      - API_ENDPOINT=http://127.0.0.1:8891
+      - PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
+      - NVIDIA_VISIBLE_DEVICES=all
+    depends_on:
+      - elasticsearch
+    volumes:
+      # Docker directory
+      - .:/paddle
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    command: "/bin/bash -c 'nvidia-smi && sleep 10 && cd /paddle && sh start_compose.sh && tail -f /dev/null'"
+  
+networks:
+  default:
+    name: elastic
\ No newline at end of file
diff --git a/pipelines/docker/docker-compose.yml b/pipelines/docker/docker-compose.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dbf2f02378b7bf7971266e3e28c3c73bd873b087
--- /dev/null
+++ b/pipelines/docker/docker-compose.yml
@@ -0,0 +1,35 @@
+version: "3"
+services:
+  elasticsearch:
+    # This will start an empty elasticsearch instance (so you have to add your documents yourself)
+    #image: "elasticsearch:8.3.3"
+    image: "docker.elastic.co/elasticsearch/elasticsearch:8.3.3"
+    container_name: es01
+    ports:
+      - 9200:9200
+    restart: on-failure
+    environment:
+      - discovery.type=single-node
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+      - cluster.routing.allocation.disk.threshold_enabled=false
+      - xpack.security.enabled=false
+  pipelines-cpu-serving:
+    image: paddlepaddle/paddlenlp:pipelines-cpu-1.0
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: pip01
+    network_mode: host
+    restart: on-failure
+    volumes:
+      - .:/paddle
+    environment:
+      - API_ENDPOINT=http://127.0.0.1:8891
+      - PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
+    depends_on:
+      - elasticsearch
+    command: /bin/bash -c "sleep 15 && cd /paddle && sh start_compose.sh && tail -f /dev/null"
+  
+networks:
+  default:
+    name: elastic
\ No newline at end of file
diff --git a/pipelines/docker/run_client.sh b/pipelines/docker/run_client.sh
new file mode 100644
index 0000000000000000000000000000000000000000..edd2cf11b430fe9ee69b32da10b83c7166b68f6c
--- /dev/null
+++ b/pipelines/docker/run_client.sh
@@ -0,0 +1,4 @@
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
diff --git a/pipelines/docker/run_server.sh b/pipelines/docker/run_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4b09930a981edb201f2afae989f651191fa750ac
--- /dev/null
+++ b/pipelines/docker/run_server.sh
@@ -0,0 +1,3 @@
+export PIPELINE_YAML_PATH=semantic_search.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
diff --git a/pipelines/docker/semantic_search.yaml b/pipelines/docker/semantic_search.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1cd518fd196d271514acc1a52f3b43b6df4bc3af
--- /dev/null
+++ b/pipelines/docker/semantic_search.yaml
@@ -0,0 +1,66 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost  # windows&mac host.docker.internal, linux: localhost
+      port: 9200
+      index: dureader_robust_query_encoder
+      embedding_dim: 312
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/docker/start.sh b/pipelines/docker/start.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6825200a8f34b9ad45a276a990a36ad3b5caa584
--- /dev/null
+++ b/pipelines/docker/start.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+cd /root/PaddleNLP/pipelines/
+sh create_index.sh
+nohup sh run_server.sh > server.log 2>&1 &
+nohup sh run_client.sh > client.log 2>&1 &
\ No newline at end of file
diff --git a/pipelines/docker/start_compose.sh b/pipelines/docker/start_compose.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a9ce44149f33bb2c27fbbf5147ee8f0f84046954
--- /dev/null
+++ b/pipelines/docker/start_compose.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+cd /root/PaddleNLP/pipelines/
+# linux
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/dureader_dev \
+                            --query_embedding_model rocketqa-zh-nano-query-encoder \
+                            --passage_embedding_model rocketqa-zh-nano-para-encoder \
+                            --port 9200 \
+                            --host localhost \
+                            --embedding_dim 312 \
+                            --delete_index 
+# 使用端口号 8891 启动模型服务
+nohup python rest_api/application.py 8891 > server.log 2>&1 &
+# 在指定端口 8502 启动 WebUI
+nohup python -m streamlit run ui/webapp_semantic_search.py --server.port 8502 > client.log 2>&1 &
diff --git a/pipelines/examples/FAQ/Install_windows.md b/pipelines/examples/FAQ/Install_windows.md
new file mode 100644
index 0000000000000000000000000000000000000000..3c361246ed9107318c6104aa9976db34a91d11cc
--- /dev/null
+++ b/pipelines/examples/FAQ/Install_windows.md
@@ -0,0 +1,105 @@
+# WINDOWS环境下搭建端到端FAQ智能问答系统
+以下的流程都是使用的Anaconda的环境进行的搭建，Anaconda安装好以后，进入 `Anaconda Powershell Prompt`（由于环境变量设置不兼容的原因，暂不支持使用`cmd`执行下面的命令），然后执行下面的流程。
+
+## 1. 快速开始: 快速搭建FAQ智能问答系统
+
+### 1.1 运行环境和安装说明
+
+a. 依赖安装：
+我们预置了基于[ 8000 多条保险行业问答数据](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb)搭建保险FAQ智能问答的代码示例，您可以通过如下命令快速体验智能问答的效果
+```bash
+git clone https://github.com/tvst/htbuilder.git
+cd htbuilder/
+python setup install
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+### 1.2 数据说明
+我们预置了基于[ 8000 多条保险行业问答数据](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb)搭建保险FAQ智能问答的代码示例，您可以通过如下命令快速体验智能问答的效果
+
+### 1.3 一键体验FAQ智能问答系统
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+python examples/FAQ/dense_faq_example.py --device gpu
+# 如果只有 CPU 机器，安装CPU版本的Paddle后，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+python examples/FAQ/dense_faq_example.py --device cpu
+```
+
+### 1.4 构建 Web 可视化FAQ系统
+
+整个 Web 可视化FAQ智能问答系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的FAQ智能问答系统。
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+把`xpack.security.enabled` 设置成false，如下：
+```
+xpack.security.enabled: false
+```
+
+然后直接双击bin目录下的elasticsearch.bat即可启动。
+
+3. elasticsearch可视化工具Kibana（可选）
+为了更好的对数据进行管理，可以使用Kibana可视化工具进行管理和分析，下载链接为[Kibana](https://www.elastic.co/cn/downloads/kibana)，下载完后解压，直接双击运行 `bin\kibana.bat`即可。
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name insurance --doc_dir data/insurance --split_answers --delete_index
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+
+运行结束后，可使用Kibana查看数据
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定FAQ智能问答系统的Yaml配置文件
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/dense_faq.yaml'
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 1.4.4 启动 WebUI
+```bash
+# 配置模型服务地址
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_faq.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验FAQ智能问答系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
diff --git a/pipelines/examples/FAQ/README.md b/pipelines/examples/FAQ/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3fb745e1f8dc302488b258a0926e7e6417977694
--- /dev/null
+++ b/pipelines/examples/FAQ/README.md
@@ -0,0 +1,204 @@
+# 端到端FAQ智能问答系统
+
+## 1. 场景概述
+
+智能问答是获取信息和只是的更直接、更高效的方式之一，传统的信息检索方法只能找到相关的文档，而智能问答能够直接找到精准的答案，极大的节省了人们获取信息的时间。问答技按照技术划分分为基于阅读理解抽取式的问答和检索式的问答，本项目属于检索式问答，即检索匹配库里面的高频的问题，然后把高频问题对应的答案返回给用户。问答的领域用途很广，比如搜索引擎，小度音响等智能硬件，政府，金融，银行，电信，电商领域的智能客服，聊天机器人等。
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端FAQ智能问答的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的检索系统模型(召回模型、排序模型)快速搭建一个针对自己业务数据的问答系统，并可以提供 Web 化产品服务。以下是使用预置模型的教程，如果用户想训练并接入自己训练的模型，模型训练可以参考[FAQ Finance](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/supervised_qa/faq_finance)，模型的接入流程参考Pipelines语义检索中Neural Search模型接入流程即可。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190307449-38135678-f259-4483-ac0f-2fa3ae4be97f.gif" width="500px">
+</div>
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端FAQ智能问答系统能力
+    + 多源数据支持: 支持对 Txt、Word、PDF、Image 多源数据进行解析、识别并写入 ANN 数据库
++ 效果好
+    + 依托百度领先的NLP技术，包括[ERNIE](https://github.com/PaddlePaddle/ERNIE)语义理解技术与[RocketQA](https://github.com/PaddlePaddle/RocketQA)开放域问答技术
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 快速搭建FAQ智能问答系统
+
+以下是针对mac和linux的安装流程
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.4.0
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+FAQ智能问答数据库的数据来自于[8000 多条保险行业问答数据](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb)，共包含 8000 多个问答对，并过滤选取了其中3788条问答对来搭建FAQ智能问答系统。
+
+### 3.3 一键体验FAQ智能问答系统
+
+#### 3.3.1 快速一键启动
+
+我们预置了基于[ 8000 多条保险行业问答数据](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/baoxianzhidao/intro.ipynb)搭建保险FAQ智能问答的代码示例，您可以通过如下命令快速体验智能问答的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/FAQ/dense_faq_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/FAQ/dense_faq_example.py --device cpu
+```
+`dense_faq_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+### 3.4 构建 Web 可视化FAQ智能问答
+
+整个 Web 可视化FAQ智能问答主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的FAQ智能问答。
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+如果elasticsearch里面没有数据，结果会输出为空：
+
+```
+{ }
+```
+
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以保险数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name insurance \
+                            --doc_dir data/insurance \
+                            --split_answers \
+                            --delete_index
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `split_answers`: 是否切分每一行的数据为query和answer两部分
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+```
+# 打印几条数据
+curl http://localhost:9200/insurance/_search
+```
+会输出如下的示例结果：
+
+```
+{"took":2,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":3776,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"insurance","_id":"5bfb94d6da02e52ce5b778bc4876f91f","_score":1.0,"_source":{"content":"如果你想和你最好的朋友一起去席，推荐一个旅游保险","content_type":"text","name":"qa_pair.txt","answer":"您好，去西*旅游保险比较重要。慧择网推出huts保险，包括境内游险种......
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+```bash
+# 指定FAQ智能问答系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/dense_faq.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/FAQ/run_faq_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "企业如何办理养老保险?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+```
+如果成功运行，则会返回结果。
+
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_faq.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/FAQ/run_faq_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验FAQ智能问答系统服务了。
+
+#### 3.4.5 数据更新
+
+数据更新有两种，第一种是使用界面的文件上传，支持txt，word，必须是Question和Answer两列，用\t进行分隔，另外word格式的数据，数据间用空行分隔开，txt格式按正常回车键分隔即可。第二种是使用前面的 `utils/offline_ann.py`进行数据更新，示例数据如下(demo.txt)：
+
+```
+我想买保险，可以买哪些？    人身保障的保险，主要可以分为四大险种——即意外险、重疾险、医疗险和寿险。意外险——像过马路被车撞、被开水烫伤等等意外，意外险皆可赔付。医疗险——花多少钱报销多少钱，一般建议买百万医疗险。重疾险——得了重疾，按比例一次性赔付你约定保额。寿险——身故即赔。
+选保险产品时，保险公司很重要吗？    重要，但不是第一重要，也不是最重要。产品应该是优先于公司的，毕竟产品的保障才是最直接和我们的利益挂钩的。在保险产品的保障差不多的情况下，知名度更高的保险公司会更好。
+```
+word示例数据：
+
+```
+我想买保险，可以买哪些？	可以买哪些？人身保障的保险，主要可以分为四大险种——即意外险、重疾险、医疗险和寿险。意外险——像过马路被车撞、被开水烫伤等等意外，意外险皆可赔付。医疗险——花多少钱报销多少钱，一般建议买百万医疗险。重疾险——得了重疾，按比例一次性赔付你约定保额。寿险——身故即赔。
+
+选保险产品时，保险公司很重要吗？	重要，但不是第一重要，也不是最重要。产品应该是优先于公司的，毕竟产品的保障才是最直接和我们的利益挂钩的。在保险产品的保障差不多的情况下，知名度更高的保险公司会更好。
+
+```
+
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/FAQ/dense_faq_example.py b/pipelines/examples/FAQ/dense_faq_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..9289ae4accaf33f97fd5c9e7672bb08d13a2cd41
--- /dev/null
+++ b/pipelines/examples/FAQ/dense_faq_example.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 城市百科知识智能问答系统
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieRanker
+from pipelines.utils import (
+    convert_files_to_dicts,
+    fetch_archive_from_http,
+    print_documents,
+)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def dense_faq_pipeline():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+    else:
+        doc_dir = "data/insurance"
+        city_data = "https://paddlenlp.bj.bcebos.com/applications/insurance.zip"
+        fetch_archive_from_http(url=city_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, split_answers=True, encoding="utf-8")
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+
+    # Pipeline
+    from pipelines import SemanticSearchPipeline
+
+    pipe = SemanticSearchPipeline(retriever, ranker)
+
+    pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}}
+    prediction = pipe.run(query="企业如何办理养老保险", params=pipeline_params)
+
+    print_documents(prediction, print_name=False, print_meta=True)
+
+
+if __name__ == "__main__":
+    dense_faq_pipeline()
diff --git a/pipelines/examples/FAQ/run_faq_server.sh b/pipelines/examples/FAQ/run_faq_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a616ba97569e0c15ae1f9acd8bd399ffb1456f55
--- /dev/null
+++ b/pipelines/examples/FAQ/run_faq_server.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/dense_faq.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/FAQ/run_faq_web.sh b/pipelines/examples/FAQ/run_faq_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c50be953897927fe00fbb8664dc02b486498d42a
--- /dev/null
+++ b/pipelines/examples/FAQ/run_faq_web.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset http_proxy && unset https_proxy
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_faq.py --server.port 8502
\ No newline at end of file
diff --git a/pipelines/examples/agents/react_example.py b/pipelines/examples/agents/react_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..4496ad66eda69366cae0024cb6782e6296f48c4a
--- /dev/null
+++ b/pipelines/examples/agents/react_example.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines.agents import Agent, Tool
+from pipelines.agents.base import ToolsManager
+from pipelines.nodes import PromptNode, WebRetriever
+from pipelines.nodes.prompt.prompt_template import PromptTemplate
+from pipelines.pipelines import WebQAPipeline
+
+few_shot_prompt = """
+You are a helpful and knowledgeable agent. To achieve your goal of answering complex questions correctly, you have access to the following tools:
+
+Search: useful for when you need to Google questions. You should ask targeted questions, for example, Who is Anthony Dirrell's brother?
+
+To answer questions, you'll need to go through multiple steps involving step-by-step thinking and selecting appropriate tools and their inputs; tools will respond with observations. When you are ready for a final answer, respond with the `Final Answer:`
+Examples:
+##
+Question: Anthony Dirrell is the brother of which super middleweight title holder?
+Thought: Let's think step by step. To answer this question, we first need to know who Anthony Dirrell is.
+Tool: Search
+Tool Input: Who is Anthony Dirrell?
+Observation: Boxer
+Thought: We've learned Anthony Dirrell is a Boxer. Now, we need to find out who his brother is.
+Tool: Search
+Tool Input: Who is Anthony Dirrell brother?
+Observation: Andre Dirrell
+Thought: We've learned Andre Dirrell is Anthony Dirrell's brother. Now, we need to find out what title Andre Dirrell holds.
+Tool: Search
+Tool Input: What is the Andre Dirrell title?
+Observation: super middleweight
+Thought: We've learned Andre Dirrell title is super middleweight. Now, we can answer the question.
+Final Answer: Andre Dirrell
+##
+Question: What year was the party of the winner of the 1971 San Francisco mayoral election founded?
+Thought: Let's think step by step. To answer this question, we first need to know who won the 1971 San Francisco mayoral election.
+Tool: Search
+Tool Input: Who won the 1971 San Francisco mayoral election?
+Observation: Joseph Alioto
+Thought: We've learned Joseph Alioto won the 1971 San Francisco mayoral election. Now, we need to find out what party he belongs to.
+Tool: Search
+Tool Input: What party does Joseph Alioto belong to?
+Observation: Democratic Party
+Thought: We've learned Democratic Party is the party of Joseph Alioto. Now, we need to find out when the Democratic Party was founded.
+Tool: Search
+Tool Input: When was the Democratic Party founded?
+Observation: 1828
+Thought: We've learned the Democratic Party was founded in 1828. Now, we can answer the question.
+Final Answer: 1828
+##
+Question: Right Back At It Again contains lyrics co-written by the singer born in what city?
+Thought: Let's think step by step. To answer this question, we first need to know what song the question is referring to.
+Tool: Search
+Tool Input: What is the song Right Back At It Again?
+Observation: "Right Back at It Again" is the song by A Day to Remember
+Thought: We've learned Right Back At It Again is a song by A Day to Remember. Now, we need to find out who co-wrote the song.
+Tool: Search
+Tool Input: Who co-wrote the song Right Back At It Again?
+Observation: Jeremy McKinnon
+Thought: We've learned Jeremy McKinnon co-wrote the song Right Back At It Again. Now, we need to find out what city he was born in.
+Tool: Search
+Tool Input: Where was Jeremy McKinnon born?
+Observation: Gainsville, Florida
+Thought: We've learned Gainsville, Florida is the city Jeremy McKinnon was born in. Now, we can answer the question.
+Final Answer: Gainsville, Florida
+##
+Question: {query}
+Thought:{transcript}
+"""
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--search_api_key", default=None, type=str, help="The Serper.dev or SerpAPI key.")
+parser.add_argument('--llm_name', choices=['THUDM/chatglm-6b', "THUDM/chatglm-6b-v1.1", "gpt-3.5-turbo", "gpt-4"], default="THUDM/chatglm-6b-v1.1", help="The chatbot models ")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def search_and_action_example():
+    pn = PromptNode(
+        args.llm_name,
+        max_length=256,
+        api_key=args.api_key,
+        default_prompt_template="question-answering-with-document-scores",
+    )
+
+    # https://serper.dev
+    web_retriever = WebRetriever(api_key=args.search_api_key, top_search_results=2)
+    pipeline = WebQAPipeline(retriever=web_retriever, prompt_node=pn)
+
+    prompt_node = PromptNode(args.llm_name, api_key=args.api_key, max_length=512, stop_words=["Observation:"])
+
+    web_qa_tool = Tool(
+        name="Search",
+        pipeline_or_node=pipeline,
+        description="useful for when you need to Google questions.",
+        output_variable="results",
+    )
+    few_shot_agent_template = PromptTemplate("few-shot-react", prompt_text=few_shot_prompt)
+    # Time to initialize the Agent specifying the PromptNode to use and the Tools
+    agent = Agent(
+        prompt_node=prompt_node,
+        prompt_template=few_shot_agent_template,
+        tools_manager=ToolsManager([web_qa_tool]),
+        max_steps=8,
+    )
+    hotpot_questions = [
+        "what is the capital of China?",
+        "What year was the father of the Princes in the Tower born?",
+        "Which author is English: John Braine or Studs Terkel?",
+    ]
+    for question in hotpot_questions:
+        result = agent.run(query=question)
+        print(f"\n{result['transcript']}")
+
+
+if __name__ == "__main__":
+    search_and_action_example()
diff --git a/pipelines/examples/agents/react_example_cn.py b/pipelines/examples/agents/react_example_cn.py
new file mode 100644
index 0000000000000000000000000000000000000000..967381e0e104158873761b2ca0adef0fa859036b
--- /dev/null
+++ b/pipelines/examples/agents/react_example_cn.py
@@ -0,0 +1,218 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+import os
+
+from pipelines.agents import Agent, Tool
+from pipelines.agents.base import ToolsManager
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import (
+    CharacterTextSplitter,
+    DensePassageRetriever,
+    DocxToTextConverter,
+    FileTypeClassifier,
+    PDFToTextConverter,
+    PromptNode,
+    TextConverter,
+    WebRetriever,
+)
+from pipelines.nodes.prompt.prompt_template import PromptTemplate
+from pipelines.pipelines import Pipeline, WebQAPipeline
+from pipelines.utils import fetch_archive_from_http
+
+few_shot_prompt = """
+你是一个乐于助人、知识渊博的人工智能助手。为了实现正确回答复杂问题的目标，您可以使用以下工具:
+搜索: 当你需要用谷歌搜索问题时很有用。你应该问一些有针对性的问题，例如，谁是安东尼·迪雷尔的兄弟？
+要回答问题，你需要经历多个步骤，包括逐步思考和选择合适的工具及其输入；工具将以观察作为回应。当您准备好接受最终答案时，回答"最终答案":
+示例:
+##
+问题: 哈利波特的作者是谁？
+思考: 让我们一步一步地思考。要回答这个问题，我们首先需要了解哈利波特是什么。
+工具: 搜索
+工具输入: 哈利波特是什么？
+观察: 哈利波特是一系列非常受欢迎的魔幻小说，以及后来的电影和衍生作品。
+思考: 我们了解到哈利波特是一系列魔幻小说。现在我们需要找到这些小说的作者是谁。
+工具: 搜索
+工具输入: 哈利波特的作者是谁？
+观察: 哈利波特系列的作者是J.K.罗琳（J.K. Rowling）。
+思考: 根据搜索结果，哈利波特系列的作者是J.K.罗琳。所以最终答案是J.K.罗琳。
+最终答案: J.K.罗琳
+##
+问题: {query}
+思考:{transcript}
+"""
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--search_engine", choices=['faiss', 'milvus'], default="faiss", help="The type of ANN search engine.")
+parser.add_argument("--retriever", choices=['dense', 'SerperDev', 'SerpAPI'], default="dense", help="The type of Retriever.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--search_api_key", default=None, type=str, help="The Serper.dev or SerpAPI key.")
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument('--llm_name', choices=['ernie-bot', 'THUDM/chatglm-6b', "gpt-3.5-turbo", "gpt-4"], default="THUDM/chatglm-6b", help="The chatbot models ")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def indexing_files(retriever, document_store, filepaths, chunk_size):
+    try:
+        text_converter = TextConverter()
+        pdf_converter = PDFToTextConverter()
+        doc_converter = DocxToTextConverter()
+
+        text_splitter = CharacterTextSplitter(separator="\f", chunk_size=chunk_size, chunk_overlap=0, filters=["\n"])
+        pdf_splitter = CharacterTextSplitter(
+            separator="\f",
+            chunk_size=chunk_size,
+            chunk_overlap=0,
+            filters=['([﹒﹔﹖﹗．。！？]["’”」』]{0,2}|(?=["‘“「『]{1,2}|$))'],
+        )
+        file_classifier = FileTypeClassifier()
+        indexing_pipeline = Pipeline()
+        indexing_pipeline.add_node(component=file_classifier, name="file_classifier", inputs=["File"])
+        indexing_pipeline.add_node(component=doc_converter, name="DocConverter", inputs=["file_classifier.output_4"])
+        indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["file_classifier.output_1"])
+        indexing_pipeline.add_node(component=pdf_converter, name="PDFConverter", inputs=["file_classifier.output_2"])
+
+        indexing_pipeline.add_node(
+            component=text_splitter, name="TextSplitter", inputs=["TextConverter", "DocConverter"]
+        )
+        indexing_pipeline.add_node(component=pdf_splitter, name="PDFSplitter", inputs=["PDFConverter"])
+        indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["TextSplitter", "PDFSplitter"])
+        indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+        files = glob.glob(filepaths + "/*.*", recursive=True)
+        indexing_pipeline.run(file_paths=files)
+    except Exception as e:
+        print(e)
+        pass
+
+
+def get_faiss_retriever(use_gpu):
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+    else:
+        dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+        zip_dir = "data/dureader_dev"
+        fetch_archive_from_http(url=dureader_data, output_dir=zip_dir)
+
+        document_store = FAISSDocumentStore(embedding_dim=args.embedding_dim, faiss_index_factory_str="Flat")
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            top_k=5,
+        )
+        filepaths = "data/dureader_dev/dureader_dev"
+        indexing_files(retriever, document_store, filepaths, chunk_size=500)
+        document_store.save(args.index_name)
+    return retriever
+
+
+def search_and_action_example(web_retriever):
+
+    qa_template = PromptTemplate(
+        name="文档问答",
+        prompt_text="使用以下段落作为来源回答以下问题。"
+        "答案应该简短，最多几个字。\n"
+        "段落:\n{documents}\n"
+        "问题: {query}\n\n"
+        "说明: 考虑以上所有段落及其相应的分数，得出答案。 "
+        "虽然一个段落可能得分很高， "
+        "但重要的是要考虑同一候选答案的所有段落，以便准确回答。\n\n"
+        "在考虑了所有的可能性之后，最终答案是:\n",
+    )
+    pn = PromptNode(
+        args.llm_name,
+        max_length=512,
+        default_prompt_template=qa_template,
+        api_key=args.api_key,
+        secret_key=args.secret_key,
+    )
+
+    pipeline = WebQAPipeline(retriever=web_retriever, prompt_node=pn)
+
+    prompt_node = PromptNode(
+        args.llm_name, max_length=512, api_key=args.api_key, secret_key=args.secret_key, stop_words=["观察: "]
+    )
+
+    web_qa_tool = Tool(
+        name="搜索",
+        pipeline_or_node=pipeline,
+        description="当你需要用谷歌搜索问题时很有用。",
+        output_variable="results",
+    )
+    few_shot_agent_template = PromptTemplate("few-shot-react", prompt_text=few_shot_prompt)
+    # Time to initialize the Agent specifying the PromptNode to use and the Tools
+    agent = Agent(
+        prompt_node=prompt_node,
+        prompt_template=few_shot_agent_template,
+        tools_manager=ToolsManager(
+            tools=[web_qa_tool],
+            tool_pattern=r"工具:\s*(\w+)\s*工具输入:\s*(?:\"([\s\S]*?)\"|((?:.|\n)*))\s*",
+            observation_prefix="观察: ",
+            llm_prefix="思考: ",
+        ),
+        max_steps=8,
+        final_answer_pattern=r"最终答案\s*:\s*(.*)",
+        observation_prefix="观察: ",
+        llm_prefix="思考: ",
+    )
+    hotpot_questions = ["范冰冰的身高是多少?", "武则天传位给了谁？"]
+    for question in hotpot_questions:
+        result = agent.run(query=question)
+        print(f"\n{result['transcript']}")
+
+
+if __name__ == "__main__":
+    if args.retriever == "dense":
+        use_gpu = True if args.device == "gpu" else False
+        web_retriever = get_faiss_retriever(use_gpu)
+    else:
+        # https://serper.dev
+        web_retriever = WebRetriever(api_key=args.search_api_key, engine="google", top_search_results=2)
+    search_and_action_example(web_retriever)
diff --git a/pipelines/examples/chat_table/README.md b/pipelines/examples/chat_table/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e41c802389177915896bb57a32176f5589d4ac1
--- /dev/null
+++ b/pipelines/examples/chat_table/README.md
@@ -0,0 +1,34 @@
+# ChatTable
+
+## 1. 功能概述
+
+ChatTable 是一个表格问答助手，他可以根据您给出的查询问题，快速定位表格，并返回查询结果。目前单表查询的性能较优。
+
+## 2. 安装依赖
+
+
+```
+pip install gradio
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines/examples/chat_table
+```
+
+## 3. 快速开始
+
+根据表格数据，建立索引
+```
+python create_index.py \
+--dirname ...
+```
+开始ChatTable
+```
+python chat_table_web.py \
+--api_key ... \
+--secret_key ...
+```
+
+## 4. 效果展示
+
+<div align="center">
+    <img src="https://github.com/PaddlePaddle/PaddleNLP/assets/137043369/c6a79c78-3f51-4960-b0e6-2fecc5a0d412" width="1000px">
+</div>
diff --git a/pipelines/examples/chat_table/chat_table.py b/pipelines/examples/chat_table/chat_table.py
new file mode 100644
index 0000000000000000000000000000000000000000..0da22177c2952edf589cfcdb6cec45343e730ec3
--- /dev/null
+++ b/pipelines/examples/chat_table/chat_table.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import time
+
+from create_index import columns_titles, index_name
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieBot, ErnieRanker
+from pipelines.pipelines import Pipeline
+
+all_titles = [value for key, value in columns_titles.items()]
+
+
+def text_retrieval(query, key="", index="text"):
+    """
+    Obtain the  matching text information
+    """
+    document_store = FAISSDocumentStore.load(index_name)
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model="moka-ai/m3e-base",
+        passage_embedding_model="moka-ai/m3e-base",
+        output_emb_size=None,
+        max_seq_len_query=64,
+        max_seq_len_passage=256,
+        batch_size=16,
+        use_gpu=True,
+        embed_title=False,
+        pooling_mode="mean_tokens",
+    )
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 5, "index": str(index)}})
+    if key == "":
+        return prediction["documents"][0].content
+    else:
+        document = ""
+        i = 0
+        while i < 5:
+            if key in prediction["documents"][i].content:
+                document = prediction["documents"][i].content
+                break
+            i += 1
+        return document
+
+
+def chat_table(query, api_key=None, secret_key=None, key="", maxlen=11200, index="table"):
+    """
+    Obtain the  matching table information and  conduct Table Q&A
+    """
+    document_store = FAISSDocumentStore.load(index_name)
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model="moka-ai/m3e-base",
+        passage_embedding_model="moka-ai/m3e-base",
+        output_emb_size=None,
+        max_seq_len_query=64,
+        max_seq_len_passage=256,
+        batch_size=16,
+        use_gpu=True,
+        embed_title=False,
+        pooling_mode="mean_tokens",
+    )
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=True)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    prediction = query_pipeline.run(
+        query=query, params={"Retriever": {"top_k": 15, "index": str(index)}, "Ranker": {"top_k": 5}}
+    )
+    documents = ""
+    if key != "":
+        i = 0
+        while i < 5:
+            if key in prediction["documents"][i].content:
+                documents = prediction["documents"][i].meta["info"]
+                break
+            i += 1
+    else:
+        documents = prediction["documents"][0].meta["info"]
+    if documents == "":
+        return ""
+    else:
+        ernie_bot = ErnieBot(api_key=api_key, secret_key=secret_key)
+        prompt = "你是一个金融助手，你的任务是是一个抽取任务，你要抽取表格中信息来回答问题。请你记住，你输出的信息只能是表格中的内容，你只是抽取相关内容，不能生成无关的数据，不能对数据进行运算。现给你表格信息，请你抽取表格相关内容回答输入问题，输入表格信息：{documents}，输入问题：{query}"
+        documents = documents[: maxlen - 1 - len(query) - len(prompt)]
+        prompt = prompt.format(documents=documents, query=query)
+        for _ in range(2):
+            try:
+                result = ernie_bot.run(prompt)[0]
+                return result["result"]
+            except:
+                time.sleep(3)
+                continue
+        return "我暂时不能回答这个方面的问题"
+
+
+def parsing_QA(api_key=None, secret_key=None, query="", maxlen=11200):
+    """
+    FistrParsing query to obtain keywords,
+    then,text retrieval, and table Q&A
+    """
+    prompt = "给你年报主要内容的28个标题，具体为\n{}。".format(";".join(all_titles))
+    prompt += """现在给你一个问题，你需要理解这个问题，并给出这个问题涉及到公司年报哪些标题，
+    请你返回一个json格式的字符串，键值包含"公司名"和"涉及标题"，涉及标题的value是一个列表，
+    列表包含这个问题涉及到公司年报哪些标题，
+    请你保证你输出的涉及标题不要把27个标题都输出，只输出你认为与问题相关性高的标题。
+    下面给你以下示例：
+    示例1:
+    输入问题：宁德时代近三年发展的怎么样
+    输出：{"公司名":宁德时代,"涉及标题":['主营收入','公司主营业务','历年营业总收入、归母净利润、扣非净利润数据','营业收入的主营产品构成情况','历年毛利率和净利率数据']}
+    示例2:
+    输入问题：宁德时代的股票可以买吗？
+    输出：{"公司名":宁德时代,"涉及标题":['公司主营业务','日行情K线数据','近年来市盈率(TTM)、市净率(LF)、市销率(TTM)变化情况','近年来市盈率(TTM)、市净率(LF)、市销率(TTM)历史分位','人均薪酬、创利、创收情况','历年加权净资产收益率','历年研发投入','历年净现比数据','历年主要负债堆积']}
+    示例3:
+    输入问题：寒武纪的风险点有哪些？
+    输出：{"公司名":寒武纪,"涉及标题":['历年营业总收入、归母净利润、扣非净利润数据','公司主营业务','五大客户、供应商集中度情况','前五大客户和供应商占比','近年来市盈率(TTM)、市净率(LF)、市销率(TTM)历史分位','营运能力各周转率','应收账款及减值准备情况','历年存货堆积']}
+    现在让我们开始
+    输入问题："""
+    prompt += query + "\n请你记住你的输出是一个json格式的字符串,键值有两个,请你保证你输出的标题是与问题相关的标题"
+    ernie_bot = ErnieBot(api_key=api_key, secret_key=secret_key)
+    query_keys = []
+    company = ""
+    for i in range(3):
+        try:
+            result = ernie_bot.run(prompt)[0]
+            result = result["result"]
+            result = result[result.find("{") :]
+            result = result[: result.find("}")] + "}"
+            result = json.loads(result)
+            company = result.get("公司名", "")
+            query_keys.extend([i.strip() for i in result["涉及标题"] if i.strip() in all_titles])
+        except:
+            time.sleep(3)
+            continue
+    query_keys = list(set(query_keys))
+    if query_keys == []:
+        return "我无法解答这个问题"
+    else:
+        text = ""
+        index = 0
+        for key in query_keys:
+            index += 1
+            query_text = "请提取" + company + key + "的信息来解答" + query + "这个问题"
+            if key in ["主营收入", "公司主营业务"]:
+                text += str(index) + "." + text_retrieval(query_text, key) + "\n\n"
+            elif key != "日行情K线数据":
+                text += str(index) + "." + chat_table(query_text, api_key, secret_key, key) + "\n\n"
+    ernie_bot = ErnieBot(api_key, secret_key)
+    try:
+        prompt = "你现在是金融助手，请你根据背景信息回答问题。请你记住，你的回答要基于背景信息，不要胡编乱造。背景信息{information},请回答相关问题{query}"
+        information = text[: maxlen - 1 - len(query) - len(prompt)]
+        prompt = prompt.format(information=information, query=query)
+        text = "以下是我为这个问题提供的相关信息：\n" + text + "\n\n" + ernie_bot.run(prompt)[0]["result"]
+        return text
+    except:
+        return text
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--api_key", default="", type=str, help="The API Key.")
+    parser.add_argument("--secret_key", default="", type=str, help="The secret key.")
+    parser.add_argument("--query", default="宁德时代的股票可以买吗？", type=str, help="the query")
+    args = parser.parse_args()
+    # Single Table Q&A
+    result = chat_table(query="比亚迪2022年固定资产是多少", api_key=args.api_key, secret_key=args.secret_key)
+    print(result)
+    # Single text retrieval
+    result = text_retrieval(query="宁德时代的公司主营业务")
+    print(result)
+    # Multiple Q&A
+    result = parsing_QA(query=args.query, api_key=args.api_key, secret_key=args.secret_key)
+    print(result)
diff --git a/pipelines/examples/chat_table/chat_table_web.py b/pipelines/examples/chat_table/chat_table_web.py
new file mode 100644
index 0000000000000000000000000000000000000000..814069d018fbdc01dd306e3e5bec31bdaa5863e0
--- /dev/null
+++ b/pipelines/examples/chat_table/chat_table_web.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import gradio as gr
+from chat_table import parsing_QA
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--api_key", default="", type=str, help="The API Key.")
+parser.add_argument("--secret_key", default="", type=str, help="The secret key.")
+args = parser.parse_args()
+
+
+def reset_state():
+    return "", []
+
+
+def predict(query, history=[]):
+    result = parsing_QA(args.api_key, args.secret_key, query)
+    history.append(["user: {}".format(query), "assistant: {}".format(result)])
+    return "", history, history
+
+
+if __name__ == "__main__":
+    with gr.Blocks() as demo:
+        gr.HTML("""<h1 align="center">🤖ChatTable</h1>""")
+        with gr.Accordion("输出区", open=True, elem_id="input-panel") as area_input_primary:
+            chatbot = gr.Chatbot(scale=30, height=600)
+        with gr.Accordion("输入区", open=True, elem_id="output-panel") as area_output_primary:
+            with gr.Row():
+                user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=5, scale=10).style(
+                    container=False
+                )
+                with gr.Column(scale=1):
+                    submitBtn = gr.Button("🚀 提交", variant="primary", scale=2, min_width=0)
+                    state = gr.State([])
+                    emptyBtn = gr.Button("Clear History")
+        submitBtn.click(predict, [user_input, state], [user_input, chatbot, state])
+        emptyBtn.click(reset_state, outputs=[chatbot, state], show_progress=True)
+    demo.queue().launch(server_name="0.0.0.0", server_port=8084, share=False)
diff --git a/pipelines/examples/chat_table/create_index.py b/pipelines/examples/chat_table/create_index.py
new file mode 100644
index 0000000000000000000000000000000000000000..105bab5b70c2d68e749334e995e8be61a4840bdb
--- /dev/null
+++ b/pipelines/examples/chat_table/create_index.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import json
+import logging
+import os
+import re
+import shutil
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever
+
+logging.getLogger().setLevel(logging.INFO)
+index_name = "cropus_base"
+from collections import defaultdict
+
+all_tables_dict = defaultdict(list)
+columns_titles = {
+    "maincwdata": "主营收入",
+    "mainbusiness": "公司主营业务",
+    "zcchartdata": "主要资产堆积",
+    "chchartdata": "历年存货堆积",
+    "fzchartdata": "历年主要负债堆积",
+    "zlchartdata": "应收账款及减值准备情况",
+    "cashchartdata": "历年现金流量",
+    "hqchartdata": "日行情K线数据",
+    "jxbchartdata": "历年净现比数据",
+    "ysjlchartdata": "历年营业总收入、归母净利润、扣非净利润数据",
+    "roechartdata": "历年加权净资产收益率",
+    "mjlvchartdata": "历年毛利率和净利率数据",
+    "fzlvchartdata": "资产负债率数据",
+    "tenchartdata": "十大股东数据",
+    "choumachartdata": "股档户数和持股情况数据",
+    "qijianchartdata": "历年期间费用数据",
+    "zycpchartdata": "营业收入的主营产品构成情况",
+    "xcchartdata": "人均薪酬、创利、创收情况",
+    "liudongdata": "流动速动比率",
+    "fcpsrchartdata": "分产品毛利率",
+    "top5chartdata": "五大客户、供应商集中度情况",
+    "yftrchartdata": "历年研发投入",
+    "roicchartdata": "投入资本回报率",
+    "zzlchartdata": "营运能力各周转率",
+    "freecashchartdata": "自由现金流",
+    "sylchartdata": "近年来市盈率(TTM)、市净率(LF)、市销率(TTM)变化情况",
+    "hisperdata": "近年来市盈率(TTM)、市净率(LF)、市销率(TTM)历史分位",
+    "top5zbchartdata": "前五大客户和供应商占比",
+}
+
+
+def preprocess(path):
+    """
+    Preprocessing json file
+    """
+    with open(path, mode="r", encoding="utf-8") as f:
+        data = json.load(f)
+    keys = data.keys()
+    if "comname" in data:
+        comname = data["comname"]
+    else:
+        comname = path.split("/")[-1].replace(".json", "")
+    for keys, value in data.items():
+        if keys == "maincwdata":
+            maincw = comname + columns_titles.get(keys, "") + ":"
+            maincw += "".join([v for k, v in value.items()])
+            all_tables_dict["text"].append({"content": maincw, "meta": {}})
+        elif keys == "mainbusiness":
+            mainbusiness = comname + columns_titles.get(keys, "") + ":" + value
+            all_tables_dict["text"].append({"content": mainbusiness, "meta": {}})
+        elif keys in columns_titles and value != "":
+            try:
+                meta = {}
+                value_str = str(value)
+                column_list = re.findall(r"legend\d*.*?]", value_str)
+                assert len(column_list) > 0
+                column_name = ""
+                for item in column_list:
+                    item = re.sub(r"legend\d*", "", item)
+                    column_name += re.sub(r"[\[|\]\:|\']", "", item)
+                title_list = re.findall(r"\'title\':\s*\'.*?\'", value_str)
+                title_list = [re.sub(r"[\'title\':\s* |\']", "", item) for item in title_list]
+                text_list = re.findall(r"\'text\':\s*\'.*?\'", value_str)
+                text_list = [re.sub(r"[\'text\':\s* |\']", "", item) for item in text_list]
+                meta = {"info": value_str}
+                title = "".join(title_list)
+                text = "".join(text_list)
+                all_tables_dict["table"].append(
+                    {"content": comname + columns_titles.get(keys, "") + text + title + column_name, "meta": meta}
+                )
+            except:
+                continue
+        else:
+            continue
+
+
+def get_all_tables(paths):
+    """
+    Process json files to obtain text and table information
+    """
+    for path in paths:
+        preprocess(path)
+
+
+def create_index(index_name, tables_dict):
+    """
+    Creating indexes
+    """
+    if os.path.exists("faiss_cropus_store_all.db"):
+        os.remove("faiss_cropus_store_all.db")
+    if os.path.exists(index_name):
+        shutil.rmtree(index_name)
+    document_store = FAISSDocumentStore(
+        embedding_dim=768,
+        faiss_index_factory_str="Flat",
+        duplicate_documents="skip",
+        sql_url="sqlite:///faiss_cropus_store_all.db",
+    )
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model="moka-ai/m3e-base",
+        passage_embedding_model="moka-ai/m3e-base",
+        output_emb_size=None,
+        max_seq_len_query=64,
+        max_seq_len_passage=256,
+        batch_size=16,
+        use_gpu=True,
+        embed_title=False,
+        pooling_mode="mean_tokens",
+    )
+    for key, value in tables_dict.items():
+        value = retriever.run_indexing(value)[0]["documents"]
+        document_store.write_documents(value, index=str(key))
+    document_store.save(index_name)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dirname", default="./", type=str, help="The dirname of json files")
+    args = parser.parse_args()
+    files = glob.glob(args.dirname + "/*.json", recursive=True)
+    # 建库
+    get_all_tables(files)
+    create_index(index_name, all_tables_dict)
diff --git a/pipelines/examples/chatbot/README.md b/pipelines/examples/chatbot/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d5709750e78b15e4fb92a6244c333bf7011f27cf
--- /dev/null
+++ b/pipelines/examples/chatbot/README.md
@@ -0,0 +1,220 @@
+# ChatFile
+
+## 1. 场景概述
+
+Chatfile的目的是让用户上传文件并与它进行交互，从而实现与文件的对话。它具备丰富落地场景，包括 1）内置翻译系统的Chatfile实现国内外商人之间的无障碍交流；2）基于本地知识库的Chatfile，实现智能客服问答功能，与客服人员实现良好互补；3）通过分析历史报告文件和新数据，Chatflie可以生成新的报告文件。
+
+ChatFile带有一个内置的聊天机器人，它使用 文心一言 ErnieBot 技术。这使得它能够理解用户的输入并以自然的方式进行响应。此外，用户可以上传多个文件，ChatFile将文件存储在一个索引中。这样，用户就可以轻松地与多个文件进行交互和对话。ChatFile的使用非常简单，用户只需将文件上传到应用程序中，然后即可开始与它进行聊天。当用户发送消息时，聊天机器人会自动解析消息并做出相应的回答。
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端聊天机器人系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的聊天机器人系统模型(召回模型、排序模型、文心一言 ErnieBot)快速搭建一个针对自己业务数据的问答系统，并可以提供 Web 化产品服务。
+
+<div align="center">
+    <img src="https://github.com/PaddlePaddle/PaddleNLP/assets/137043369/978b6d88-355f-4e63-91e0-874e9bcb8012" width="1000px">
+</div>
+
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端聊天机器人系统能力
+    + 多源数据支持: 支持对 Txt、Word、PDF、Image、Markdown 多源数据进行解析、识别并写入 ANN 数据库
++ 效果好
+    + 依托百度领先的NLP技术
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 快速搭建聊天机器人系统
+
+以下是针对mac和linux的安装流程
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.6
+- paddlepaddle-gpu >=2.5
+- CUDA Version: 11.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+聊天机器人数据库的数据来自于[enterprise_responsibility_report](https://paddlenlp.bj.bcebos.com/applications/enterprise_responsibility_report.zip)，共包含 3个pdf文件和3个docx文件。如果有版权问题，请第一时间联系，并删除数据。
+
+### 3.3 一键体验聊天机器人系统
+
+#### 3.3.1 快速一键启动
+
+我们预置了基于[enterprise_responsibility_report](https://paddlenlp.bj.bcebos.com/applications/enterprise_responsibility_report.zip)搭建聊天机器人的代码示例，您可以通过如下命令快速体验聊天机器人系统的效果
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/chatbot/chat_markdown_example.py --device gpu \
+                                                 --search_engine faiss
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/chatbot/chat_markdown_example.py --device cpu \
+                                                 --search_engine faiss
+```
+`chat_markdown_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+
+### 3.4 构建 Web 可视化聊天机器人系统
+
+整个 Web 可视化聊天机器人系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Gradio 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的聊天机器人系统。
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以enterprise_responsibility_report数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name esg_example \
+                            --doc_dir data/enterprise_responsibility_report \
+                            --search_engine elastic \
+                            --embed_title True \
+                            --use_splitter  True \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/esg_example/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+* `embed_title`: 是否需要对标题建索引，默认为false，标题默认为文件名
+
+删除索引也可以使用下面的命令：
+
+```
+curl -XDELETE http://localhost:9200/esg_example
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+Note: chatfile.yaml中的api_key和secret_key需要自己补充
+```bash
+# 指定聊天机器人系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/chatfile.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/chatbot/run_chatfile_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "光大证券的社会企业责任 ","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+```
+
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+```bash
+pip install gradio
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_chatfile_gradio.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/chatbot/run_chatfile_wb.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验聊天机器人服务了。
+
+#### 3.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann_chatfile.py`进行数据更新，第二种是使用前端界面的文件上传（在界面的左侧）进行数据更新。对于第一种使用脚本的方式，可以使用多种文件更新数据，示例的文件更新建索引的命令如下，里面包含了图片（目前仅支持把图中所有的文字合并建立索引），docx（纯文本），txt，pdf，markdown五种格式的文件建索引：
+
+```
+python utils/offline_ann_chatfile.py --index_name esg_example \
+                            --doc_dir data/enterprise_responsibility_report \
+                            --port 9200 \
+                            --search_engine elastic \
+                            --delete_index
+```
+
+对于第二种使用界面的方式，支持txt，pdf，image，word，markdown的格式，以markdown格式的文件为例，程序会根据标题分段建立索引，示例数据如下(demo.md)：
+
+```
+**目录**
+
+ 1. [教学内容](#教学内容)
+     1.1  [添加课节](#添加课节)
+     1.2  [项目-添加项目](#项目-添加项目)
+     1.3  [项目-编辑项目](#项目-编辑项目)
+     1.4  [项目-发布项目](#项目-发布项目)
+     1.5  [文档](#文档)
+     1.6  [视频](#视频)
+     1.7  [调序](#调序)
+     1.8  [设置试看](#设置试看)
+ 3. [学习跟踪](#学习跟踪)
+ 4. [教学大纲](#教学大纲)
+ ```
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 和 langchain(https://github.com/hwchase17/langchain)优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)和langchain(https://github.com/hwchase17/langchain)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) and langchain(https://github.com/hwchase17/langchain), and we would like to express our thanks to the authors of Haystack and langchain and their open source community.
diff --git a/pipelines/examples/chatbot/README_ParallelRetriever.md b/pipelines/examples/chatbot/README_ParallelRetriever.md
new file mode 100644
index 0000000000000000000000000000000000000000..d55fad9cccf7ae8e0111ecba2738716f8e6838f5
--- /dev/null
+++ b/pipelines/examples/chatbot/README_ParallelRetriever.md
@@ -0,0 +1,114 @@
+简体中文
+
+# m3e 服务化部署示例
+
+在服务化部署前，需确认
+
+- 1. 服务化镜像的软硬件环境要求和镜像拉取命令请参考[FastDeploy服务化部署](https://github.com/PaddlePaddle/FastDeploy/tree/develop/serving)
+- 2. FastDeploy Server镜像支持的推理后端有OpenVINO、TensorRT、Paddle Inference和ONNX Runtime，用户可以根据自己的软硬件条件进行选择。
+
+
+## 准备模型
+
+下载m3e-base模型(如果有已训练好的模型，跳过此步骤):
+```bash
+# 下载并解压m3e-base模型
+wget https://paddlenlp.bj.bcebos.com/pipelines/m3e.tar.gz
+tar -xzvf m3e.tar.gz
+```
+
+模型下载移动好之后，m3e的models目录结构如下:
+```
+models
+├── m3e                      # pipeline
+│   ├── 1
+│   └── config.pbtxt         # 通过这个文件组合前后处理和模型推理
+├── m3e_model                # 模型推理
+│   ├── 1
+│   │   ├── model.pdmodel
+|   │   └── model.pdiparams
+│   └── config.pbtxt
+├── m3e_postprocess          # 后处理
+│   ├── 1
+│   │   └── model.py
+│   └── config.pbtxt
+└── m3e_tokenizer            # 预处理分词
+    ├── 1
+    │   └── model.py
+    └── config.pbtxt
+```
+*注意*:如果使用TensorRT引擎，用 pipelines/examples/chatbot/config.pbtxt 替换 models/m3e_model/config.pbtxt
+
+## 拉取并运行镜像
+
+以GPU镜像部署为例：
+
+```bash
+# GPU镜像
+docker pull registry.baidubce.com/paddlepaddle/fastdeploy:1.0.7-gpu-cuda11.4-trt8.5-21.10
+
+# 运行
+nvidia-docker run  -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/m3e/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:1.0.7-gpu-cuda11.4-trt8.5-21.10 bash
+```
+
+## 部署模型
+
+serving目录包含启动pipeline服务的配置和发送预测请求的代码，包括：
+
+```
+models                    # 服务化启动需要的模型仓库，包含模型和服务配置文件
+```
+
+*注意*:启动服务时，Server的每个python后端进程默认申请`64M`内存，默认启动的docker无法启动多个python后端节点。有两个解决方案：
+- 1.启动容器时设置`shm-size`参数, 比如:`docker run  -it --net=host --name fastdeploy_server --shm-size="1g" -v /path/m3e/models:/models registry.baidubce.com/paddlepaddle/fastdeploy:x.y.z-gpu-cuda11.4-trt8.4-21.10 bash`
+- 2.启动服务时设置python后端的`shm-default-byte-size`参数, 设置python后端的默认内存为10M： `tritonserver --model-repository=/models --backend-config=python,shm-default-byte-size=10485760`
+
+### m3e_embedding任务
+在容器内执行下面命令启动服务:
+```
+# 可通过参数只启动m3e任务
+ fastdeployserver --model-repository=/models --model-control-mode=explicit --load-model=m3e --http-port=8082
+```
+输出打印如下:
+```
+...
+I0804 04:14:54.478441 16055 server.cc:592]
++-------------------+---------+--------+
+| Model             | Version | Status |
++-------------------+---------+--------+
+| m3e               | 1       | READY  |
+| m3e_model         | 1       | READY  |
+| me3_postprocess   | 1       | READY  |
+| m3e_tokenizer     | 1       | READY  |
++-------------------+---------+--------+
+...
+
+```
+## 客户端请求
+客户端请求可以在本地执行脚本请求；也可以在容器中执行。
+
+本地执行脚本需要先安装依赖:
+```
+pip install grpcio
+pip install tritonclient[all]
+
+# 如果bash无法识别括号，可以使用如下指令安装:
+pip install tritonclient\[all\]
+```
+
+### 示例m3eRetriever
+注意执行客户端请求时关闭代理，并根据实际情况修改m3eRetriever中的url地址(启动服务所在的机器)
+```
+python examples/chatbot/ParallelRetriever_example.py \
+                                            --file_paths ... \
+                                            --api_key ... \
+                                            --secret_key ... \
+```
+参数含义说明
+* `file_paths`: 文件的路径
+* `api_key`: 文心一言的apk key
+* `secret_key`: 文心一言的secret key
+
+## 配置修改
+
+当前m3e默认配置在GPU上运行Paddle引擎; 3个GPU各部署一个实例，假设需要修改配置，详情请参考[配置文档](https://github.com/PaddlePaddle/FastDeploy/blob/develop/serving/docs/zh_CN/model_configuration.md#cpugpu%E5%92%8C%E5%AE%9E%E4%BE%8B%E4%B8%AA%E6%95%B0%E9%85%8D%E7%BD%AE)
diff --git a/pipelines/examples/chatbot/chat_docx_example.py b/pipelines/examples/chatbot/chat_docx_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2d5e30a02f94aea2db7a431c2209d155b1d54d4
--- /dev/null
+++ b/pipelines/examples/chatbot/chat_docx_example.py
@@ -0,0 +1,95 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieBot, ErnieRanker, PromptTemplate
+from pipelines.nodes.file_converter.docx import DocxTotxtConverter
+from pipelines.nodes.preprocessor.text_splitter import SpacyTextSplitter
+from pipelines.pipelines import Pipeline
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--file_paths", default='./data/pdf_files', type=str, help="The PDF file path.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--chunk_size", default=384, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def chat_docx_tutorial():
+
+    document_store = FAISSDocumentStore(
+        embedding_dim=args.embedding_dim,
+        duplicate_documents="skip",
+        return_embedding=True,
+        faiss_index_factory_str="Flat",
+    )
+    use_gpu = True if args.device == "gpu" else False
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        batch_size=args.retriever_batch_size,
+        use_gpu=use_gpu,
+        embed_title=args.embed_title,
+    )
+
+    docx_converter = DocxTotxtConverter()
+    text_splitter = SpacyTextSplitter(separator="\n", chunk_size=args.chunk_size, chunk_overlap=0, filters=["\n"])
+    indexing_pipeline = Pipeline()
+    indexing_pipeline.add_node(component=docx_converter, name="docx_converter", inputs=["File"])
+    indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["docx_converter"])
+    indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+    indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+    files_paths = glob.glob(args.file_paths + "/*.docx")
+    indexing_pipeline.run(file_paths=files_paths)
+
+    # Query docxs
+    ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    query_pipeline.add_node(
+        component=PromptTemplate("请根据以下背景资料回答问题：\n 背景资料：{documents} \n问题：{query}"), name="Template", inputs=["Ranker"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["Template"])
+    query = "Tensorflow和Pytorch有哪些区别？"
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Ranker": {"top_k": 2}})
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    chat_docx_tutorial()
diff --git a/pipelines/examples/chatbot/chat_markdown_example.py b/pipelines/examples/chatbot/chat_markdown_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e345e6a1c6fc1153fc4186a3be406ad293f167c
--- /dev/null
+++ b/pipelines/examples/chatbot/chat_markdown_example.py
@@ -0,0 +1,132 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+import time
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import (
+    CharacterTextSplitter,
+    DensePassageRetriever,
+    ErnieBot,
+    ErnieRanker,
+    MarkdownConverter,
+    PromptTemplate,
+    TruncatedConversationHistory,
+)
+from pipelines.nodes.file_converter.markdown import MarkdownRawTextConverter
+from pipelines.nodes.preprocessor.text_splitter import MarkdownHeaderTextSplitter
+from pipelines.pipelines import Pipeline
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--file_paths", default='./data/md_files', type=str, help="The PDF file path.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument("--chunk_size", default=300, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument('--title_split', default=False, type=bool, help='the markdown file is split by titles')
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+parser.add_argument('--indexing', default=False, type=bool, help='Whether indexing is enabled.')
+args = parser.parse_args()
+# yapf: enable
+
+
+def chat_markdown_tutorial():
+    document_store = FAISSDocumentStore(
+        embedding_dim=args.embedding_dim,
+        duplicate_documents="skip",
+        return_embedding=True,
+        faiss_index_factory_str="Flat",
+    )
+    use_gpu = True if args.device == "gpu" else False
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        batch_size=args.retriever_batch_size,
+        use_gpu=use_gpu,
+        embed_title=args.embed_title,
+    )
+
+    # Indexing Markdowns
+    if args.title_split is True:
+        markdown_converter = MarkdownRawTextConverter()
+        headers_to_split_on = [
+            ("#", "Header 1"),
+            ("##", "Header 2"),
+            ("###", "Header 3"),
+            ("####", "Header 4"),
+            ("#####", "Header 5"),
+            ("######", "Header 6"),
+        ]
+        text_splitter = MarkdownHeaderTextSplitter(
+            separator="\n",
+            chunk_size=args.chunk_size,
+            headers_to_split_on=headers_to_split_on,
+            return_each_line=False,
+            filters=["\n"],
+        )
+    else:
+        markdown_converter = MarkdownConverter()
+        text_splitter = CharacterTextSplitter(
+            separator="\n", chunk_size=args.chunk_size, chunk_overlap=0, filters=["\n"]
+        )
+
+    if args.indexing:
+        indexing_pipeline = Pipeline()
+        indexing_pipeline.add_node(component=markdown_converter, name="MarkdownConverter", inputs=["File"])
+        indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["MarkdownConverter"])
+        indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+        indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+        files = glob.glob(args.file_paths + "/**/*.md", recursive=True)
+        indexing_pipeline.run(file_paths=files)
+
+    # Query Markdowns
+    ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    query_pipeline.add_node(component=PromptTemplate("背景：{documents} 问题：{query}"), name="Template", inputs=["Ranker"])
+    query_pipeline.add_node(
+        component=TruncatedConversationHistory(max_length=256), name="TruncateHistory", inputs=["Template"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["TruncateHistory"])
+    query = "Jupyter 和 AI Studio Notebook 有什么区别？如何使用Jupyter？"
+    start_time = time.time()
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 30}, "Ranker": {"top_k": 3}})
+    end_time = time.time()
+    print("Time cost for query markdown conversion:", end_time - start_time)
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    chat_markdown_tutorial()
diff --git a/pipelines/examples/chatbot/chat_markdown_multi_recall_example.py b/pipelines/examples/chatbot/chat_markdown_multi_recall_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..4bd077310aa767cca56908920f6fdf1b06e7d4e5
--- /dev/null
+++ b/pipelines/examples/chatbot/chat_markdown_multi_recall_example.py
@@ -0,0 +1,178 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+import time
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    ElasticsearchDocumentStore,
+)
+from pipelines.nodes import (
+    BM25Retriever,
+    CharacterTextSplitter,
+    ChatGLMBot,
+    DensePassageRetriever,
+    EmbeddingRetriever,
+    ErnieBot,
+    ErnieRanker,
+    JoinDocuments,
+    MarkdownConverter,
+    PromptTemplate,
+    TruncatedConversationHistory,
+)
+from pipelines.pipelines import Pipeline
+
+BOT_CLASSES = {
+    "chatglm": ChatGLMBot,
+}
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--file_paths", default='./data/md_files', type=str, help="The PDF file path.")
+parser.add_argument("--search_engine", choices=['elastic', 'bes'], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument("--data_chunk_size", default=300, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--chatbot', choices=['ernie_bot', 'chatglm'], default="chatglm", help="The chatbot models ")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search', 'ernie-embedding-v1'], default="ernie", help="the ernie model types")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+parser.add_argument("--embedding_api_key", default=None, type=str, help="The Embedding API Key.")
+parser.add_argument("--embedding_secret_key", default=None, type=str, help="The Embedding secret key.")
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+parser.add_argument("--es_chunk_size", default=500, type=int, help="Number of docs in one chunk sent to es")
+parser.add_argument("--es_thread_count", default=32, type=int, help="Size of the threadpool to use for the bulk requests")
+parser.add_argument("--es_queue_size", default=32, type=int, help="Size of the task queue between the main thread (producing chunks to send) and the processing threads.")
+parser.add_argument('--indexing', default=False, type=bool, help='Whether indexing is enabled.')
+args = parser.parse_args()
+# yapf: enable
+
+
+def chat_markdown_tutorial():
+
+    if args.search_engine == "elastic":
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            index=args.index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+
+    else:
+        document_store = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+    use_gpu = True if args.device == "gpu" else False
+
+    if args.model_type == "ernie-embedding-v1":
+        retriever = EmbeddingRetriever(
+            document_store=document_store,
+            retriever_batch_size=args.retriever_batch_size,
+            api_key=args.embedding_api_key,
+            embed_title=args.embed_title,
+            secret_key=args.embedding_secret_key,
+        )
+    else:
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            precision="fp16",
+        )
+    bm_retriever = BM25Retriever(document_store=document_store)
+
+    # Indexing Markdowns
+    markdown_converter = MarkdownConverter()
+
+    text_splitter = CharacterTextSplitter(
+        separator="\n", chunk_size=args.data_chunk_size, chunk_overlap=0, filters=["\n"]
+    )
+    if args.indexing:
+        indexing_pipeline = Pipeline()
+        indexing_pipeline.add_node(component=markdown_converter, name="MarkdownConverter", inputs=["File"])
+        indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["MarkdownConverter"])
+        indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+        indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+        files = glob.glob(args.file_paths + "/**/*.md", recursive=True)
+        if len(files) == 0:
+            raise Exception("file should not be empty")
+        indexing_pipeline.run(file_paths=files)
+
+    # Query Markdowns
+    if args.chatbot in ["ernie_bot"]:
+        ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    else:
+        ernie_bot = BOT_CLASSES[args.chatbot]()
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="DenseRetriever", inputs=["Query"])
+    query_pipeline.add_node(component=bm_retriever, name="BMRetriever", inputs=["Query"])
+    query_pipeline.add_node(
+        component=JoinDocuments(join_mode="reciprocal_rank_fusion"),
+        name="JoinResults",
+        inputs=["BMRetriever", "DenseRetriever"],
+    )
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["JoinResults"])
+    query_pipeline.add_node(component=PromptTemplate("背景：{documents} 问题：{query}"), name="Template", inputs=["Ranker"])
+    query_pipeline.add_node(
+        component=TruncatedConversationHistory(max_length=256), name="TruncateHistory", inputs=["Template"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["TruncateHistory"])
+    query = "理财产品的认购期是多久？"
+    start_time = time.time()
+    prediction = query_pipeline.run(query=query, params={"DenseRetriever": {"top_k": 10}, "Ranker": {"top_k": 5}})
+    end_time = time.time()
+    print("Time cost for query markdown conversion:", end_time - start_time)
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    chat_markdown_tutorial()
diff --git a/pipelines/examples/chatbot/chat_pdf_example.py b/pipelines/examples/chatbot/chat_pdf_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..e08e623e4a6c18d3e4b4c946dc36233de32deb45
--- /dev/null
+++ b/pipelines/examples/chatbot/chat_pdf_example.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import (
+    CharacterTextSplitter,
+    DensePassageRetriever,
+    ErnieBot,
+    ErnieRanker,
+    PDFToTextConverter,
+    PromptTemplate,
+)
+from pipelines.pipelines import Pipeline
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--file_paths", default='./data/pdf_files', type=str, help="The PDF file path.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--chunk_size", default=384, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def chat_pdf_tutorial():
+
+    document_store = FAISSDocumentStore(
+        embedding_dim=args.embedding_dim,
+        duplicate_documents="skip",
+        return_embedding=True,
+        faiss_index_factory_str="Flat",
+    )
+    use_gpu = True if args.device == "gpu" else False
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        batch_size=args.retriever_batch_size,
+        use_gpu=use_gpu,
+        embed_title=args.embed_title,
+    )
+
+    # Indexing PDFs
+    pdf_converter = PDFToTextConverter()
+    text_splitter = CharacterTextSplitter(separator="\f", chunk_size=args.chunk_size, chunk_overlap=0, filters=["\n"])
+    indexing_pipeline = Pipeline()
+    indexing_pipeline.add_node(component=pdf_converter, name="pdf_converter", inputs=["File"])
+    indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["pdf_converter"])
+    indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+    indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+    files_paths = glob.glob(args.file_paths + "/*.pdf")
+    indexing_pipeline.run(file_paths=files_paths)
+
+    # Query PDFs
+    ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    query_pipeline.add_node(
+        component=PromptTemplate("请根据以下背景资料回答问题：\n 背景资料：{documents} \n问题：{query}"), name="Template", inputs=["Ranker"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["Template"])
+    query = "Tensorflow和Pytorch有哪些区别？"
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Ranker": {"top_k": 2}})
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    chat_pdf_tutorial()
diff --git a/pipelines/examples/chatbot/chat_pdf_multi_recall_example.py b/pipelines/examples/chatbot/chat_pdf_multi_recall_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..e38bd8ddfe4283b063812d138c4bf112632af668
--- /dev/null
+++ b/pipelines/examples/chatbot/chat_pdf_multi_recall_example.py
@@ -0,0 +1,117 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+import time
+
+from pipelines.document_stores import ElasticsearchDocumentStore
+from pipelines.nodes import (
+    CharacterTextSplitter,
+    ChatGLMBot,
+    DensePassageRetriever,
+    ErnieBot,
+    ErnieRanker,
+    PDFToTextConverter,
+    PromptTemplate,
+)
+from pipelines.pipelines import Pipeline
+
+BOT_CLASSES = {
+    "chatglm": ChatGLMBot,
+}
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--file_paths", default='./data/pdf_files', type=str, help="The PDF file path.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--chunk_size", default=384, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--chatbot', choices=['ernie_bot', 'chatglm'], default="chatglm", help="The chatbot models ")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+args = parser.parse_args()
+# yapf: enable
+
+
+def chat_pdf_tutorial():
+
+    document_store = ElasticsearchDocumentStore(
+        host=args.host,
+        port=args.port,
+        username="",
+        password="",
+        embedding_dim=args.embedding_dim,
+        index=args.index_name,
+    )
+    use_gpu = True if args.device == "gpu" else False
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        batch_size=args.retriever_batch_size,
+        use_gpu=use_gpu,
+        embed_title=args.embed_title,
+    )
+
+    # Indexing PDFs
+    pdf_converter = PDFToTextConverter()
+    text_splitter = CharacterTextSplitter(separator="\f", chunk_size=args.chunk_size, chunk_overlap=0, filters=["\n"])
+    indexing_pipeline = Pipeline()
+    indexing_pipeline.add_node(component=pdf_converter, name="pdf_converter", inputs=["File"])
+    indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["pdf_converter"])
+    indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+    indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+    files_paths = glob.glob(args.file_paths + "/*.pdf")
+    indexing_pipeline.run(file_paths=files_paths)
+
+    # Query PDFs
+    if args.chatbot in ["ernie_bot"]:
+        ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    else:
+        ernie_bot = BOT_CLASSES[args.chatbot]()
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    query_pipeline.add_node(
+        component=PromptTemplate("请根据以下背景资料回答问题：\n 背景资料：{documents} \n问题：{query}"), name="Template", inputs=["Ranker"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["Template"])
+    query = "Tensorflow和Pytorch有哪些区别？"
+    start_time = time.time()
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 10}, "Ranker": {"top_k": 2}})
+    end_time = time.time()
+    print("Time cost for query markdown conversion:", end_time - start_time)
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    chat_pdf_tutorial()
diff --git a/pipelines/examples/chatbot/chatbot_example.py b/pipelines/examples/chatbot/chatbot_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee820ad8db264d1f8e01c2dd0d317a2761868924
--- /dev/null
+++ b/pipelines/examples/chatbot/chatbot_example.py
@@ -0,0 +1,200 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore, MilvusDocumentStore
+from pipelines.nodes import (
+    DensePassageRetriever,
+    ErnieBot,
+    PromptTemplate,
+    TruncatedConversationHistory,
+)
+from pipelines.pipelines import Pipeline
+from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--search_engine", choices=['faiss', 'milvus'], default="faiss", help="The type of ANN search engine.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--port', type=str, default="8530", help='port of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def get_faiss_retriever(use_gpu):
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+    else:
+        doc_dir = "data/dureader_dev"
+        dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+
+        fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=args.embedding_dim, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+    return retriever
+
+
+def get_milvus_retriever(use_gpu):
+
+    milvus_document_store = "milvus_document_store.db"
+    if os.path.exists(milvus_document_store):
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+        # connect to existed Milvus Index
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+    else:
+        doc_dir = "data/dureader_dev"
+        dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+
+        fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+
+        document_store.write_documents(dicts)
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+    return retriever
+
+
+def ernie_bot_tutorial():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    if args.search_engine == "milvus":
+        retriever = get_milvus_retriever(use_gpu)
+    else:
+        retriever = get_faiss_retriever(use_gpu)
+
+    # QA over documents
+    ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    pipe = Pipeline()
+    pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component=PromptTemplate("背景：{documents} 问题：{query}"), name="Template", inputs=["Retriever"])
+    pipe.add_node(component=ernie_bot, name="ErnieBot", inputs=["Template"])
+    query = "亚马逊河流的介绍"
+    prediction = pipe.run(query=query, params={"Retriever": {"top_k": 5}})
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+    # Pipeline
+    # Chat over documents
+    pipe = Pipeline()
+    pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component=PromptTemplate("背景：{documents} 问题：{query}"), name="Template", inputs=["Retriever"])
+    pipe.add_node(component=TruncatedConversationHistory(max_length=64), name="TruncateHistory", inputs=["Template"])
+    pipe.add_node(component=ernie_bot, name="ErnieBot", inputs=["TruncateHistory"])
+    history = []
+    num_of_runs = 4
+    for i in range(num_of_runs):
+        query = "亚马逊河流的介绍{}".format(i)
+        prediction = pipe.run(query=query, params={"Retriever": {"top_k": 5}, "TruncateHistory": {"history": history}})
+        print("user: {}".format(query))
+        print("assistant: {}".format(prediction["result"]))
+        history = prediction["history"]
+
+
+if __name__ == "__main__":
+    ernie_bot_tutorial()
diff --git a/pipelines/examples/chatbot/config.pbtxt b/pipelines/examples/chatbot/config.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..df1a0df39c02afbeb5375c78c0b2674d7b59d5bd
--- /dev/null
+++ b/pipelines/examples/chatbot/config.pbtxt
@@ -0,0 +1,67 @@
+backend: "fastdeploy"
+max_batch_size: 64
+input [
+    {
+      name: "input_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    },
+    {
+      name: "token_type_ids"
+      data_type: TYPE_INT64
+      dims: [ -1 ]
+    }
+]
+output [
+    {
+      name: "layer_norm_49.tmp_2"
+      data_type: TYPE_FP32
+      dims: [ -1,768 ]
+    }
+]
+
+instance_group [
+  {
+      # 创建1个实例
+      count: 1
+      # 使用CPU推理(KIND_CPU、KIND_GPU)
+      kind: KIND_GPU
+      gpus: [ 0]
+  }
+]
+
+
+
+
+
+optimization {
+  execution_accelerators {
+  gpu_execution_accelerator : [ {
+    # use TRT engine
+    name: "tensorrt",
+    # use fp16 on TRT engine
+    parameters { key: "precision" value: "trt_fp16" }
+  },
+  {
+    # Configure the minimum shape of dynamic shape
+    name: "min_shape"
+    # All input name and minimum shape
+    parameters { key: "input_ids" value: "1 512" }
+    parameters { key: "token_type_ids" value: "1 512" }
+  },
+  {
+    # Configure the optimal shape of dynamic shape
+    name: "opt_shape"
+    # All input name and optimal shape
+    parameters { key: "input_ids" value: "10 512" }
+    parameters { key: "token_type_ids" value: "10 512" }
+  },
+  {
+    # Configure the maximum shape of dynamic shape
+    name: "max_shape"
+    # All input name and maximum shape
+    parameters { key: "input_ids" value: "16 512" }
+    parameters { key: "token_type_ids" value: "16 512" }
+  }
+  ]
+}}
\ No newline at end of file
diff --git a/pipelines/examples/chatbot/parallel_retriever_example.py b/pipelines/examples/chatbot/parallel_retriever_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..54daee3cdcce7c4a7de260a8578beb3dd34e62a1
--- /dev/null
+++ b/pipelines/examples/chatbot/parallel_retriever_example.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import glob
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    FAISSDocumentStore,
+)
+from pipelines.nodes import (
+    ErnieBot,
+    ErnieRanker,
+    PromptTemplate,
+    TruncatedConversationHistory,
+)
+from pipelines.nodes.file_converter import TextConverter
+from pipelines.nodes.preprocessor.text_splitter import CharacterTextSplitter
+from pipelines.nodes.retriever.parallel_retriever import ParallelRetriever
+from pipelines.pipelines import Pipeline
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--file_paths", default='./data/md_files', type=str, help="The PDF file path.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--data_chunk_size", default=300, type=int, help="The length of data for indexing by retriever")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument('--title_split', default=False, type=bool, help='the markdown file is split by titles')
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+parser.add_argument('--url', default='0.0.0.0:8082', type=str, help='The port of the HTTP service')
+parser.add_argument('--num_process', default=10, type=int, help='The number of process used for parallel retriever')
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+parser.add_argument("--es_thread_count", default=32, type=int, help="Size of the threadpool to use for the bulk requests")
+parser.add_argument("--es_queue_size", default=32, type=int, help="Size of the task queue between the main thread (producing chunks to send) and the processing threads.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--search_engine", choices=['faiss', 'bes'], default="faiss", help="The type of ANN search engine.")
+parser.add_argument("--es_chunk_size", default=500, type=int, help="Number of docs in one chunk sent to es")
+args = parser.parse_args()
+# yapf: enable
+
+
+def ChatFile():
+    txt_converter = TextConverter()
+    text_splitter = CharacterTextSplitter(
+        separator="\n", chunk_size=args.data_chunk_size, chunk_overlap=0, filters=["\n"]
+    )
+    if args.search_engine == "faiss":
+        document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat", return_embedding=True)
+    else:
+        document_store = BaiduElasticsearchDocumentStore(
+            embedding_dim=768,
+            duplicate_documents="skip",
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            index=args.index_name,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+    use_gpu = True if args.device == "gpu" else False
+    retriever = ParallelRetriever(
+        document_store=document_store,
+        params_path=args.params_path,
+        output_emb_size=None,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        use_gpu=use_gpu,
+        batch_size=args.retriever_batch_size,
+        embed_title=args.embed_title,
+        url=args.url,
+        num_process=args.num_process,
+    )
+    indexing_pipeline = Pipeline()
+    indexing_pipeline.add_node(component=txt_converter, name="txt_converter", inputs=["File"])
+    indexing_pipeline.add_node(component=text_splitter, name="Splitter", inputs=["txt_converter"])
+    indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Splitter"])
+    indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])
+    files = glob.glob(args.file_paths + "/**/*.txt", recursive=True)
+    indexing_pipeline.run(file_paths=files)
+    ernie_bot = ErnieBot(api_key=args.api_key, secret_key=args.secret_key)
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=True)
+    query_pipeline = Pipeline()
+    query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+    query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+    query_pipeline.add_node(component=PromptTemplate("背景：{documents} 问题：{query}"), name="Template", inputs=["Ranker"])
+    query_pipeline.add_node(
+        component=TruncatedConversationHistory(max_length=256), name="TruncateHistory", inputs=["Template"]
+    )
+    query_pipeline.add_node(component=ernie_bot, name="ErnieBot", inputs=["TruncateHistory"])
+    query = "高血压出现头晕，脸红的症状怎么办？"
+    prediction = query_pipeline.run(query=query, params={"Retriever": {"top_k": 30}, "Ranker": {"top_k": 3}})
+    print("user: {}".format(query))
+    print("assistant: {}".format(prediction["result"]))
+
+
+if __name__ == "__main__":
+    ChatFile()
diff --git a/pipelines/examples/chatbot/run_chatfile_server.sh b/pipelines/examples/chatbot/run_chatfile_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9a5f2c47ecb4fcf0417b23bdace326156f510cd0
--- /dev/null
+++ b/pipelines/examples/chatbot/run_chatfile_server.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#指定聊天机器人系统的Yaml配置文件
+export PIPELINE_YAML_PATH=./rest_api/pipelines/chatfile.yaml
+# 使用端口号  启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/chatbot/run_chatfile_wb.sh b/pipelines/examples/chatbot/run_chatfile_wb.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a8b8d53a0107ad2e7dd1f9a8fd9731c031cef82
--- /dev/null
+++ b/pipelines/examples/chatbot/run_chatfile_wb.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_chatfile_gradio.py
\ No newline at end of file
diff --git a/pipelines/examples/chatpaper/README.md b/pipelines/examples/chatpaper/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0afebed0c382dea26f9a1400907062f04fb8b35f
--- /dev/null
+++ b/pipelines/examples/chatpaper/README.md
@@ -0,0 +1,32 @@
+# ChatPaper
+
+## 1. 功能概述
+ChatPaper 是一个文献阅读助手，可以随时随地不厌其烦地回答您的疑问，帮助您理解文献。具体功能包含文章检索摘要和单篇论文翻译、精读、多轮问答。
+
+## 2. 安装依赖
+```
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines/examples/chatpaper
+pip install -r requirements.txt
+```
+## 3. 快速开始
+```
+python chat_paper.py \
+--api_key  \
+--secret_key \
+--bos_ak \
+--bos_sk \
+--txt_file \
+--retriever_api_key \
+--retriever_secret_key \
+--es_host \
+--es_port \
+--es_username \
+--es_password \
+--es_index_abstract \
+--es_index_full_text
+```
+## 4. 效果展示
+<div align="center">
+    <img src="https://github.com/PaddlePaddle/PaddleNLP/assets/137043369/8d9cd087-5bc5-4de5-b897-9f3f4e514241" width="1000px">
+</div>
diff --git a/pipelines/examples/chatpaper/chat_paper.py b/pipelines/examples/chatpaper/chat_paper.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b2d80c5a552fce746893029699a4c47571463a
--- /dev/null
+++ b/pipelines/examples/chatpaper/chat_paper.py
@@ -0,0 +1,480 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import re
+import time
+
+import arxiv
+import erniebot as eb
+import gradio as gr
+from utils import (
+    _apply_token,
+    get_parse_args,
+    merge_summary,
+    pdf2image,
+    retrieval,
+    summarize_abstract,
+    tackle_history,
+    translate_part,
+)
+
+FORMAT = "%(asctime)s-%(levelname)s: %(message)s"
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+# create file handler which logs even debug messages
+fh = logging.FileHandler("logger.log")
+fh.setLevel(logging.DEBUG)
+logger.addHandler(fh)
+
+args = get_parse_args()
+PROMPT_SYSTEM = """
+你现在需要一步步执行下面的操作
+你需要先完成关键句抽取任务，从背景信息中抽取与输入问题相关的关键句，
+并输出信关键句抽取任务的结果,关键句需要按照1,2,3,...序号编号，然后你需要基于抽取的内容完成问答任务,回答输出问题。
+请记住你输出的格式是一个json格式的字符串。
+json有两个key值，一个是"关键句抽取任务的结果",对应的value是关键句需要按照1,2,3,...序号编号后的结果，第二个是"问答任务的结果",对应的value是问答任务的结果。
+输出格式如下:
+```
+json{'关键句抽取任务的结果':'1.关键句1 \n2.关键句1\n...','问答任务的结果':''}
+```
+"""
+PROMPT_RETRIVER = """
+现在我给你背景信息和问题：
+背景信息：{documents}
+输入问题：{query}
+根据背景信息，来完成关键句抽取任务和问答任务。
+请记住你需要先执行关键句抽取任务，再执行问答任务。你的输出格式需要是一个json格式的字符串。
+"""
+PROMPT_RETRIVER_MUL = """
+根据背景信息，简洁和专业的来回答问题。如果无法从中得到答案，
+请说 “根据已知信息无法回答该问题”，不允许在答案中添加编造成分，答案请使用中文。
+背景信息：{documents}
+问题：{query}
+"""
+PROMPT_PROBLEM = """
+给你一篇论文的标题和关键词，请你给出一些用户可能针对这篇论文进行问答的问题，问题的数量不要超过3个。
+论文的标题：{title}
+论文的关键词：{key_words}
+问题："""
+eb.api_type = args.api_type
+access_token = _apply_token(args.api_key, args.secret_key)
+eb.access_token = access_token
+model = "ernie-bot-3.5" if args.ernie_model is None or args.ernie_model.strip() == "" else args.ernie_model
+
+
+def retrieval_papers(history=[]):
+    """
+    Retrieve papers
+    """
+    query = history.pop()[0]
+    query = query.strip().replace("<br>", "\n")
+    context = tackle_history(history)
+    if query:
+        if len(history) == 1:
+            paper_id_list = []
+            context.append({"role": "user", "content": query})
+            prediction = retrieval(
+                query=query,
+                es_host=args.es_host,
+                es_port=args.es_port,
+                es_username=args.es_username,
+                es_password=args.es_password,
+                es_index=args.es_index_abstract,
+                es_chunk_size=args.es_chunk_size,
+                es_thread_count=args.es_thread_count,
+                es_queue_size=args.es_queue_size,
+                retriever_batch_size=args.retriever_batch_size,
+                retriever_api_key=args.retriever_api_key,
+                retriever_secret_key=args.retriever_secret_key,
+                retriever_embed_title=args.retriever_embed_title,
+                retriever_topk=30,
+                rank_topk=3,
+            )
+            documents = prediction["documents"]
+            all_content = ""
+            papers_absatract = []
+            for i in range(len(documents)):
+                if documents[i].meta["id"] not in paper_id_list:
+                    paper_id_list.append(documents[i].meta["id"])
+                    key_words = documents[i].meta.get("key_words", "")
+                    title = documents[i].meta.get("title", "")
+                    abstract = documents[i].meta.get("abstracts", "")
+                    abstract = summarize_abstract(
+                        abstract,
+                        api_key=args.api_key,
+                        secret_key=args.secret_key,
+                        chunk_size=500,
+                        max_token=args.max_token,
+                    )
+                    papers_absatract.append({"content": abstract, "meta": {}})
+                    paper_content = (
+                        "**" + str(len(paper_id_list)) + "." + title + "**" + "\n" + key_words + "\n" + abstract
+                    )
+                    all_content += paper_content + "\n\n"
+            history.append(["下面请基于这几篇论文进行问答，单篇文档问答请使用单篇问答精读翻译", ",".join(paper_id_list)])
+            confine_summary = merge_summary(papers_absatract, api_key=args.api_key, secret_key=args.secret_key)
+            confine_summary = "**下面是对上面几篇文档进行的总结**" + "\n" + confine_summary
+            confine_summary = confine_summary.replace("\n\n", "\n")
+            history.append([query, all_content + confine_summary])
+        else:
+            # history = [[user_msg(None),system_msg],[user_hint(None),paper_id]]
+            paper_id_list = history[1][1].split(",")
+            content = ""
+            for id in paper_id_list:
+                prediction = retrieval(
+                    query=query,
+                    file_id=id,
+                    es_host=args.es_host,
+                    es_port=args.es_port,
+                    es_username=args.es_username,
+                    es_password=args.es_password,
+                    es_index=args.es_index_full_text,
+                    es_chunk_size=args.es_chunk_size,
+                    es_thread_count=args.es_thread_count,
+                    es_queue_size=args.es_queue_size,
+                    retriever_batch_size=args.retriever_batch_size,
+                    retriever_api_key=args.retriever_api_key,
+                    retriever_secret_key=args.retriever_secret_key,
+                    retriever_embed_title=args.retriever_embed_title,
+                    retriever_topk=30,
+                    rank_topk=2,
+                )
+                content += "\n".join([item.content for item in prediction["documents"]])
+            content = PROMPT_RETRIVER_MUL.format(documents=content, query=query)
+            content = content[: args.max_token]
+            context.append({"role": "user", "content": content})
+            eb.api_type = args.api_type
+            access_token = _apply_token(args.api_key, args.secret_key)
+            eb.access_token = access_token
+            model = "ernie-bot-3.5" if args.ernie_model is None or args.ernie_model.strip() == "" else args.ernie_model
+            response = eb.ChatCompletion.create(model=model, messages=context, stream=False)
+            bot_response = response.result
+            history.append([query, bot_response])
+    return history
+
+
+def retrieval_title(title):
+    """
+    Retrieve the paper_id  of  the title
+    """
+    prediction = retrieval(
+        title,
+        es_host=args.es_host,
+        es_port=args.es_port,
+        es_username=args.es_username,
+        es_password=args.es_password,
+        es_index=args.es_index_abstract,
+        es_chunk_size=args.es_chunk_size,
+        es_thread_count=args.es_thread_count,
+        es_queue_size=args.es_queue_size,
+        retriever_batch_size=args.retriever_batch_size,
+        retriever_api_key=args.retriever_api_key,
+        retriever_secret_key=args.retriever_secret_key,
+        retriever_embed_title=args.retriever_embed_title,
+        retriever_topk=30,
+        rank_topk=1,
+    )
+    if prediction["documents"][0].rank_score > args.retriever_threshold:
+        return prediction["documents"][0], prediction["documents"][0].meta["id"]
+    return None
+
+
+def infer(history=[]):
+    """Model inference."""
+    query = history.pop()[0]
+    query = query.strip().replace("<br>", "\n")
+    context = tackle_history(history)
+    single_paper_id = history[1][1]
+    if query:
+        if single_paper_id:
+            prediction = retrieval(
+                query=query,
+                file_id=single_paper_id,
+                es_host=args.es_host,
+                es_port=args.es_port,
+                es_username=args.es_username,
+                es_password=args.es_password,
+                es_index=args.es_index_full_text,
+                es_chunk_size=args.es_chunk_size,
+                es_thread_count=args.es_thread_count,
+                es_queue_size=args.es_queue_size,
+                retriever_batch_size=args.retriever_batch_size,
+                retriever_api_key=args.retriever_api_key,
+                retriever_secret_key=args.retriever_secret_key,
+                retriever_embed_title=args.retriever_embed_title,
+                retriever_topk=30,
+                rank_topk=2,
+            )
+            content = "\n".join([item.content for item in prediction["documents"]])
+            content = PROMPT_SYSTEM + PROMPT_RETRIVER.format(documents=content, query=query)
+            content = content[: args.max_token]
+            context.append({"role": "user", "content": content})
+            response = eb.ChatCompletion.create(model=model, messages=context, stream=False)
+            bot_response = response.result
+            try:
+                bot_response = bot_response[bot_response.find("{") :]
+                bot_response = bot_response[: bot_response.find("}") + 1]
+                bot_response = json.loads(bot_response)
+                if type(bot_response["关键句抽取任务的结果"]) == list:
+                    bot_response["关键句抽取任务的结果"] = "\n".join(bot_response["关键句抽取任务的结果"])
+                bot_response = (
+                    "以下是我的分析内容：\n"
+                    + str(bot_response["关键句抽取任务的结果"])
+                    + "\n\n"
+                    + "以下是我的总结："
+                    + str(bot_response["问答任务的结果"])
+                )
+            except:
+                bot_response = (
+                    str(bot_response).replace("'关键句抽取任务的结果':", "以下是我的分析内容").replace("'问答任务的结果':", "\n以下是我的总结\n")
+                )
+            bot_response = re.sub(r"\[|\]|{|}", "", bot_response)
+            bot_response = bot_response.replace("\\n", "\n")
+            history.append([query, bot_response])
+        else:
+            context.append({"role": "user", "content": query})
+            response = eb.ChatFile.create(messages=context, stream=False)
+            bot_response = response.result
+            history.append([query, bot_response])
+    return history
+
+
+def upload_file(file_name, file_url, file_upload, history=[]):
+    """
+    Upload the file to bos or retrieve the json_file of the paper
+    """
+    if file_name:
+        try:
+            json_content, file_id = retrieval_title(file_name)
+            content = (
+                "**"
+                + json_content.meta["title"]
+                + "**"
+                + "\n\n"
+                + json_content.meta["key_words"]
+                + "\n\n"
+                + json_content.meta["abstracts"]
+            )
+            title = json_content.meta["title"]
+            key_words = json_content.meta["key_words"]
+            response = eb.ChatCompletion.create(
+                model=model,
+                messages=[{"role": "user", "content": PROMPT_PROBLEM.format(title=title, key_words=key_words)}],
+                stream=False,
+            )
+            response = response.result
+            history.append([None, file_id])
+            history.append(["你可以参考以下问题，对论文进行提问", response])
+        except:
+            content = "这篇论文目前尚未加入到论文库中,请你自行上传论文的pdf或者url链接."
+            file_id = None
+            history.append([None, file_id])
+        return (
+            gr.Gallery.update(visible=False),
+            gr.File.update(visible=False),
+            history,
+            gr.Chatbot.update(
+                [[None, content]],
+                visible=True,
+                scale=30,
+                height=600,
+            ),
+        )
+    elif file_url:
+        # temp image save dir
+        root_path = "./images"
+        os.makedirs(root_path, exist_ok=True)
+        paper = next(arxiv.Search(id_list=[file_url.split("/")[-1]]).results())
+        real_filename = "{}.pdf".format(file_url.split("/")[-1])
+        logger.info(real_filename)
+        paper.download_pdf(dirpath=root_path, filename=real_filename)
+        file_name = os.path.join(root_path, real_filename)
+        tim = time.time()
+        image_path = os.path.join(root_path, str(tim))
+        os.makedirs(image_path, exist_ok=True)
+        imgs = pdf2image(pdfPath=file_name, imgPath=image_path, number_process_page=args.number_process_page)
+    elif file_upload:
+        file_name = file_upload.name
+        real_filename = os.path.split(file_name)[-1]
+        root_path = os.path.dirname(file_name)
+        tim = time.time()
+        image_path = os.path.join(root_path, str(tim))
+        os.makedirs(image_path, exist_ok=True)
+        imgs = pdf2image(pdfPath=file_name, imgPath=image_path, number_process_page=args.number_process_page)
+    # 上传到bos后到文件是否需要删除
+    filename_in_bos = real_filename
+    url = eb.utils.upload_file_to_bos(
+        file_name, filename_in_bos, access_key_id=args.bos_ak, secret_access_key=args.bos_sk
+    )
+    history.append([None, None])
+    content = "<file>{}</file><url>{}</url>".format(real_filename, url)
+    content = content.strip().replace("<br>", "\n")
+    context = tackle_history(history)
+    context.append({"role": "user", "content": content})
+    response = eb.ChatFile.create(messages=context, stream=False)
+    bot_response = response.result
+    history.append([content, bot_response])
+    return (
+        gr.Gallery.update(imgs, visible=True),
+        gr.File.update(file_name, label="原文下载链接", visible=True),
+        history,
+        gr.Chatbot.update(visible=False),
+    )
+
+
+def add_messaget_chatbot(messages, history):
+    history.append([messages, None])
+    return None, history
+
+
+def translation_txt(history=[], lang=""):
+    if not lang:
+        lang = "中文"
+    message = history.pop()[0]
+    if message:
+        translation_content = translate_part(
+            text=message,
+            api_key=args.api_key,
+            secret_key=args.secret_key,
+            task="翻译",
+            max_length=args.translation_max_token,
+            lang=lang,
+            chunk_size=args.translation_chunk_size,
+            cycle_num=args.translation_cycle_num,
+        )
+        history.append([message, translation_content])
+    return history
+
+
+with gr.Blocks(title="维普小助手", theme=gr.themes.Base()) as demo:
+    gr.HTML("""<h1 align="center">ChatPaper维普小助手</h1>""")
+    with gr.Row(variant="panel"):
+        with gr.Column(scale=1):
+            cheetah = os.path.join(os.path.dirname(__file__), "weipu.jpg")
+            gr.Image(cheetah, elem_id="banner-image", show_label=False, show_download_button=False)
+        with gr.Column(scale=9):
+            gr.HTML(
+                """
+                <p>【文章检索摘要】
+                <p> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.适合从大量文章中，粗力度获取需要的信息。返回包含 : 文章题目+作者+关键词+术语+摘要.</p>
+                <p> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.适合基于某个技术领域，生成对应的技术综述.</p>
+                <p>【单篇精读翻译】：
+                <p> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1.适合针对具体一篇文章，详细了解细节内容.</p>
+                <p> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2.适合针对具体一篇文章，详细翻译摘要术语.</p>
+            """
+            )
+    with gr.Tab("文章检索摘要"):
+        retrieval_chatbot = gr.Chatbot(
+            height=600, value=[[None, "你好, 我是维普ChatPaper小助手, 我这里收录了100w篇论文,可以提供您专业的学术咨询.请问有什么可以帮您的吗?"]]
+        )  # height聊天框高度, value 默认语句
+        retrieval_textbox = gr.Textbox(placeholder="最近自监督学习论文有哪些?")
+        with gr.Row():
+            retrieval_submit_btn = gr.Button("🚀 提交", variant="primary", scale=2, min_width=0)
+            retrieval_clear_btn = gr.Button("清除", variant="primary", scale=2, min_width=0)
+    retrieval_submit_btn.click(
+        add_messaget_chatbot,
+        inputs=[retrieval_textbox, retrieval_chatbot],
+        outputs=[retrieval_textbox, retrieval_chatbot],
+    ).then(retrieval_papers, retrieval_chatbot, retrieval_chatbot)
+    retrieval_clear_btn.click(
+        lambda _: ([[None, "你好, 我是维普ChatPaper文章精读翻译小助手,可以提供您专业的学术咨询.请问有什么可以帮您的吗?"]]),
+        inputs=[retrieval_clear_btn],
+        outputs=[retrieval_chatbot],
+    )
+    with gr.Tab("单篇精读"):  # 封装chatFile的能力
+        with gr.Accordion("文章精读：输入区（输入方式三选一，三种输入方式优先级依次降低）", open=True, elem_id="input-panel") as area_input_primary:
+            with gr.Row():
+                with gr.Group():
+                    file_name = gr.Textbox(
+                        label="(输入方式1) 论文/期刊标题（仅支持单篇文章精读）",
+                        value="",
+                        placeholder="Human-level control through deep reinforcement learning",
+                        interactive=True,
+                        scale=1,
+                    )
+                    file_url = gr.Textbox(
+                        label="(输入方式2) 论文 axiv链接（仅支持单篇文章精读）",
+                        value="",
+                        placeholder="https://arxiv.org/abs/2303.08774",
+                        interactive=True,
+                        scale=1,
+                    )
+                file_upload = gr.File(
+                    label="(输入方式3) 上传论文/期刊PDF(仅支持单篇PDF精读)", file_count="single", height=180, min_width=50
+                )
+            with gr.Row():
+                clear = gr.Button(value="清空输入区")
+                submit = gr.Button(value="全文精读")
+        with gr.Accordion("文章精读：输出区", open=True, elem_id="input-panel") as area_input_primary:
+            with gr.Tab("单文解读"):  # 包含下载功能
+                with gr.Row():
+                    with gr.Column():
+                        gr.Dropdown(choices=[""], max_choices=1, label="论文原文-PDF插件-支持下载；此处为PDF占位符")
+                        ori_paper = gr.Gallery(label="论文原文", show_label=False, elem_id="gallery").style(
+                            columns=[1], rows=[1], object_fit="contain", height="700px"
+                        )
+                        ori_json = gr.Chatbot(label="论文原文", visible=False)
+                        ori_pdf = gr.File(label="原文下载链接")
+                    with gr.Accordion("   "):
+                        gr.Dropdown(choices=[""], max_choices=1, label="文章摘要等总结-PDF插件-支持下载；此处为PDF占位符")
+                        chatbot = gr.Chatbot(
+                            value=[[None, "你好, 我是维普ChatPaper文章精读翻译小助手,可以提供您专业的学术咨询.请问有什么可以帮您的吗?"]],
+                            scale=30,
+                            height=600,
+                        )
+                        message = gr.Textbox(placeholder="请问具体描述这篇文章的方法?", scale=7)
+                        with gr.Row():
+                            submit_btn = gr.Button("🚀 提交", variant="primary", scale=2, min_width=0)
+                            clear_btn = gr.Button("清除", variant="primary", scale=2, min_width=0)
+                submit.click(
+                    upload_file,
+                    inputs=[file_name, file_url, file_upload, chatbot],
+                    outputs=[ori_paper, ori_pdf, chatbot, ori_json],
+                )
+                clear.click(
+                    lambda _: ("", "", None, [[None, "你好, 我是维普ChatPaper文章精读翻译小助手,可以提供您专业的学术咨询.请问有什么可以帮您的吗?"]]),
+                    inputs=[],
+                    outputs=[file_name, file_url, file_upload, chatbot],
+                )
+                submit_btn.click(add_messaget_chatbot, inputs=[message, chatbot], outputs=[message, chatbot]).then(
+                    infer, chatbot, chatbot
+                )
+                clear_btn.click(
+                    lambda _: ([[None, "你好, 我是维普ChatPaper文章精读翻译小助手,可以提供您专业的学术咨询.请问有什么可以帮您的吗?"]]),
+                    inputs=clear_btn,
+                    outputs=[chatbot],
+                    api_name="clear",
+                    show_progress=False,
+                )
+    with gr.Tab("翻译"):
+        with gr.Column():
+            chatbot_translation = gr.Chatbot(value=[[None, "你好, 我是翻译小助手"]], scale=35, height=500)
+            message_translation = gr.Textbox(placeholder="请输出需要翻译的内容", lines=5, max_lines=20)
+            with gr.Row():
+                lang = gr.Radio(choices=["中文", "英文"], max_choices=1, scale=1, value="中文", label="输入语言")
+                submit_translation = gr.Button("🚀 提交", variant="primary", scale=1)
+                clear_translation = gr.Button("清除", variant="primary", scale=1)
+        submit_translation.click(
+            add_messaget_chatbot,
+            inputs=[message_translation, chatbot_translation],
+            outputs=[message_translation, chatbot_translation],
+        ).then(translation_txt, inputs=[chatbot_translation, lang], outputs=[chatbot_translation])
+        clear_translation.click(
+            lambda _: ([[None, "你好, 你好, 我是翻译小助手"]]), inputs=[clear_translation], outputs=[chatbot_translation]
+        )
+demo.queue(concurrency_count=40, max_size=40)
+demo.launch(server_name=args.serving_name, server_port=args.serving_port)
diff --git a/pipelines/examples/chatpaper/hierarchical_search_example.py b/pipelines/examples/chatpaper/hierarchical_search_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..9778100fa786eb038d9995e7a975b79751910cce
--- /dev/null
+++ b/pipelines/examples/chatpaper/hierarchical_search_example.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    ElasticsearchDocumentStore,
+)
+from pipelines.nodes import (
+    BM25Retriever,
+    DensePassageRetriever,
+    EmbeddingRetriever,
+    ErnieRanker,
+    JoinDocuments,
+)
+from pipelines.pipelines import Pipeline
+from pipelines.utils import print_documents
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--root_index_name", default="weipu_abstract", type=str, help="The index name of the ANN search engine")
+parser.add_argument("--child_index_name", default="weipu_full_text", type=str, help="The index name of the ANN search engine")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--search_engine", choices=['elastic', 'bes'], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=384, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-para-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search', "ernie-embedding-v1"], default="ernie", help="the ernie model types")
+parser.add_argument("--params_path", default="", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--port', type=str, default="9200", help='port of ANN search engine')
+parser.add_argument("--bm_topk", default=10, type=int, help="The number of candidates for BM25Retriever to retrieve.")
+parser.add_argument("--dense_topk", default=10, type=int, help="The number of candidates for DensePassageRetriever to retrieve.")
+parser.add_argument("--rank_topk", default=10, type=int, help="The number of candidates ranker to filter.")
+parser.add_argument("--embedding_api_key", default=None, type=str, help="The Embedding API Key.")
+parser.add_argument("--embedding_secret_key", default=None, type=str, help="The Embedding secret key.")
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def get_retrievers(use_gpu):
+
+    if args.search_engine == "elastic":
+        document_store_with_docs = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            vector_type="dense_vector",
+            search_fields=["content", "meta"],
+            index=args.root_index_name,
+        )
+    else:
+        document_store_with_docs = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.root_index_name,
+        )
+
+    # 语义索引模型
+    if args.model_type == "ernie-embedding-v1":
+        dpr_retriever = EmbeddingRetriever(
+            document_store=document_store_with_docs,
+            retriever_batch_size=args.retriever_batch_size,
+            api_key=args.embedding_api_key,
+            embed_title=args.embed_title,
+            secret_key=args.embedding_secret_key,
+        )
+    else:
+        dpr_retriever = DensePassageRetriever(
+            document_store=document_store_with_docs,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+
+    bm_retriever = BM25Retriever(document_store=document_store_with_docs)
+
+    return dpr_retriever, bm_retriever
+
+
+def hierarchical_search_tutorial():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    dpr_retriever, bm_retriever = get_retrievers(use_gpu)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-base-cross-encoder", use_gpu=use_gpu)
+
+    # Pipeline
+    pipeline = Pipeline()
+    pipeline.add_node(component=bm_retriever, name="BMRetriever", inputs=["Query"])
+    pipeline.add_node(component=dpr_retriever, name="DenseRetriever", inputs=["Query"])
+    pipeline.add_node(
+        component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["BMRetriever", "DenseRetriever"]
+    )
+    pipeline.add_node(component=ranker, name="Ranker", inputs=["JoinResults"])
+
+    # Abstract search
+    prediction = pipeline.run(
+        query="商誉私法保护研究",
+        params={
+            "BMRetriever": {"top_k": args.bm_topk, "index": args.root_index_name},
+            "DenseRetriever": {
+                "top_k": args.dense_topk,
+                "index": args.root_index_name,
+            },
+            "Ranker": {"top_k": args.rank_topk},
+        },
+    )
+    print_documents(prediction)
+
+    # Main body Search
+    documents = prediction["documents"]
+    file_id = documents[0].meta["id"]
+
+    # filters = {
+    #         "$and": {
+    #             "id": {"$eq": "6bc0c021ef4ec96a81fbc5707e1c7016"},
+    #         }
+    # }
+    pipe = Pipeline()
+    pipe.add_node(component=dpr_retriever, name="DenseRetriever", inputs=["Query"])
+    pipe.add_node(component=ranker, name="Ranker", inputs=["DenseRetriever"])
+
+    filters = {
+        "$and": {
+            "id": {"$eq": file_id},
+        }
+    }
+    results = pipe.run(
+        query="商誉私法保护的目的是什么？",
+        params={
+            "DenseRetriever": {"top_k": args.dense_topk, "index": args.child_index_name, "filters": filters},
+        },
+    )
+    print_documents(results, print_meta=True)
+
+
+if __name__ == "__main__":
+    hierarchical_search_tutorial()
diff --git a/pipelines/examples/chatpaper/offline_ann_example.py b/pipelines/examples/chatpaper/offline_ann_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ac32305395b06b658a45b5a558a579bb2c13e14
--- /dev/null
+++ b/pipelines/examples/chatpaper/offline_ann_example.py
@@ -0,0 +1,253 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from concurrent.futures import ThreadPoolExecutor
+
+import pandas as pd
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    ElasticsearchDocumentStore,
+    MilvusDocumentStore,
+)
+from pipelines.nodes import DensePassageRetriever, EmbeddingRetriever, SpacyTextSplitter
+from pipelines.utils import launch_es
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--root_index_name", default="weipu_abstract", type=str, help="The index name of the ANN search engine")
+parser.add_argument("--child_index_name", default="weipu_full_text", type=str, help="The index name of the ANN search engine")
+parser.add_argument("--file_name", default="data/baike/", type=str, help="The doc path of the corpus")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--search_engine", choices=["elastic", "milvus", 'bes'], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--host", type=str, default="127.0.0.1", help="host ip of ANN search engine")
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--split_answers", action="store_true", help="whether to split lines into question and answers")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path",)
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-para-encoder", type=str, help="The passage_embedding_model path", )
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--delete_index", action="store_true", help="Whether to delete existing index while updating index")
+parser.add_argument("--share_parameters", action="store_true", help="Use to control the query and title models sharing the same parameters",)
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search', "ernie-embedding-v1"], default="ernie", help="the ernie model types")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select devices, defaults to gpu.")
+parser.add_argument('--search_fields', default=['content', 'name'], help="multi recall BM25Retriever set search_fields")
+parser.add_argument('--use_splitter', default=False, type=bool, help="How to split documents")
+parser.add_argument('--chunk_size', type=int, default=300, help="The length of data for indexing by retriever")
+parser.add_argument('--chunk_overlap', type=int, default=0, help="a larger chunk than the chunk overlap")
+parser.add_argument('--separator', type=str, default='\n', help="Use symbols to segment text, PDF, and image files, or connect some short chunks")
+parser.add_argument('--filters', type=list, default=['\n'], help="Filter special symbols")
+parser.add_argument('--language', type=str, default='chinese', help="the language of files")
+parser.add_argument('--pooling_mode', choices=['max_tokens', 'mean_tokens', 'mean_sqrt_len_tokens', 'cls_token'], default='cls_token', help='the type of sentence embedding')
+parser.add_argument("--es_chunk_size", default=500, type=int, help="Number of docs in one chunk sent to es")
+parser.add_argument("--es_thread_count", default=32, type=int, help="Size of the threadpool to use for the bulk requests")
+parser.add_argument("--es_queue_size", default=32, type=int, help="Size of the task queue between the main thread (producing chunks to send) and the processing threads.")
+parser.add_argument("--embedding_api_key", default=None, type=str, help="The Embedding API Key.")
+parser.add_argument("--embedding_secret_key", default=None, type=str, help="The Embedding secret key.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def offline_ann(index_name, docs):
+    use_gpu = True if args.device == "gpu" else False
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    elif args.search_engine == "bes":
+
+        document_store = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+
+    else:
+        launch_es()
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+            search_fields=args.search_fields,  # 当使用了多路召回并且搜索字段设置了除content的其他字段，构建索引时其他字段也需要设置，例如：['content', 'name']。
+        )
+    # 语义索引模型
+    if args.model_type == "ernie-embedding-v1":
+        retriever = EmbeddingRetriever(
+            document_store=document_store,
+            retriever_batch_size=args.retriever_batch_size,
+            api_key=args.embedding_api_key,
+            embed_title=args.embed_title,
+            secret_key=args.embedding_secret_key,
+        )
+    else:
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            share_parameters=args.share_parameters,
+            max_seq_len_query=64,
+            max_seq_len_passage=256,
+            batch_size=16,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+        )
+
+    log_file = open("log.txt", "a")
+    try:
+        # Manually indexing
+        res = retriever.run_indexing(docs)
+        documents = res[0]["documents"]
+        document_store.write_documents(documents)
+        log_file.write(index_name + "\t" + "success" + "\n")
+    except Exception as e:
+        print("Indexing failed, please try again.")
+        log_file.write(index_name + "\t" + e + "\n")
+
+
+def delete_data(index_name):
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    elif args.search_engine == "bes":
+
+        document_store = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+
+    else:
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username="",
+            password="",
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+        )
+    document_store.delete_index(index_name)
+    print("Delete an existing elasticsearch index {} Done.".format(index_name))
+
+
+def read_data(file_path):
+    data = pd.read_json(path_or_buf=file_path, lines=True)
+    list_data = []
+    for index, row in data.iterrows():
+        doc = row.to_dict()
+        list_data.append(doc)
+    return list_data
+
+
+def indexing_abstract(csv_name):
+    dataset = read_data(csv_name)
+    text_splitter = SpacyTextSplitter(separator="\n", chunk_size=320, chunk_overlap=10, filters=["\n"])
+    datasets = []
+    for document in dataset:
+        text = document["abstracts"]
+        text_splits = text_splitter.split_text(text)
+        for txt in text_splits:
+            meta_data = {}
+            meta_data.update(document)
+            meta_data.pop("content")
+            meta_data.pop("abstracts")
+            datasets.append({"content": txt, "meta": meta_data})
+    # Add abstract into one index
+    offline_ann(index_name=args.root_index_name, docs=datasets)
+
+
+def run_thread_index(data):
+    docs = data["content"]
+    offline_ann(args.child_index_name, docs)
+
+
+def run_multi_process_splitter(document):
+    file_log = open("log_process.txt", "a")
+    text_splitter = SpacyTextSplitter(separator="\n", chunk_size=320, chunk_overlap=10, filters=["\n"])
+    text = document["content"]
+    text_splits = text_splitter.split_text(text)
+
+    datasets = []
+    for txt in text_splits:
+        meta_data = {
+            "name": document["title"],
+            "id": document["id"],
+            "title": document["title"],
+            "key_words": document["key_words"],
+        }
+        datasets.append({"content": txt, "meta": meta_data})
+    file_log.write(document["id"] + "\tsuccess" + "\n")
+    return {"index_name": document["id"], "content": datasets}
+
+
+from multiprocessing import Pool
+
+
+def indexing_main_body(csv_name):
+    dataset = read_data(csv_name)
+    # Multiprocessing for splitting text
+    pool = Pool(processes=10)
+    all_data = pool.map(run_multi_process_splitter, dataset)
+
+    # Add body into separate index
+    thread_count = 10
+    with ThreadPoolExecutor(max_workers=thread_count) as executor:
+        executor.map(run_thread_index, all_data)
+
+
+if __name__ == "__main__":
+    if args.delete_index:
+        delete_data(args.root_index_name)
+        delete_data(args.child_index_name)
+    # hierarchical index abstract, keywords
+    indexing_abstract(args.file_name)
+    # hierarchical index main body
+    indexing_main_body(args.file_name)
diff --git a/pipelines/examples/chatpaper/requirements.txt b/pipelines/examples/chatpaper/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..472f7bba2373aa824a5ecb2dffd1f2cabce140a7
--- /dev/null
+++ b/pipelines/examples/chatpaper/requirements.txt
@@ -0,0 +1,5 @@
+scipdf
+PyMuPDF==1.20.2
+arxiv
+erniebot
+gradio==3.41.2
\ No newline at end of file
diff --git a/pipelines/examples/chatpaper/utils.py b/pipelines/examples/chatpaper/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..508e5693df5c807510529adf9da4976c9e561546
--- /dev/null
+++ b/pipelines/examples/chatpaper/utils.py
@@ -0,0 +1,395 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import logging
+import os
+from typing import Optional
+
+import fitz
+import requests
+
+from pipelines.document_stores import BaiduElasticsearchDocumentStore
+from pipelines.nodes import EmbeddingRetriever, ErnieRanker
+from pipelines.pipelines import Pipeline
+
+logging.getLogger().setLevel(logging.INFO)
+import time
+from functools import partial
+from multiprocessing import Pool
+
+from pipelines.nodes import ErnieBot
+from pipelines.nodes.combine_documents import (
+    MapReduceDocuments,
+    ReduceDocuments,
+    StuffDocuments,
+)
+from pipelines.nodes.preprocessor.text_splitter import SpacyTextSplitter
+
+
+def load_all_json_path(path):
+    json_path = {}
+    with open(path, encoding="utf-8", mode="r") as f:
+        for line in f:
+            try:
+                json_id, json_name = line.strip().split()
+                json_path[json_id] = json_name
+            except:
+                continue
+    return json_path
+
+
+def pdf2image_index(start, end, pdfPath, imgPath, zoom_x=10, zoom_y=10, rotation_angle=0):
+    pdf = fitz.open(pdfPath)
+    image_path = []
+    for index in range(start, end):
+        page = pdf[index]
+        trans = fitz.Matrix(zoom_x, zoom_y).prerotate(rotation_angle)
+        pm = page.get_pixmap(matrix=trans, alpha=False)
+        pm._writeIMG(imgPath + "/" + str(index) + ".png", format=1)
+        image_path.append((imgPath + "/" + str(index) + ".png", "page:" + str(index)))
+    return image_path
+
+
+def pdf2image(pdfPath, imgPath, zoom_x=10, zoom_y=10, rotation_angle=0, number_process_page=5):
+    """
+    Convert PDF to Image
+    """
+    pdf = fitz.open(pdfPath)
+    image_path = []
+    if pdf.page_count % number_process_page == 0:
+        number_process = pdf.page_count // number_process_page
+    else:
+        number_process = pdf.page_count // number_process_page + 1
+    number_process = min(number_process, os.cpu_count())
+    pool = Pool(processes=number_process)
+    index_list = [i for i in range(0, pdf.page_count, number_process_page)]
+    if index_list[-1] < pdf.page_count:
+        index_list.append(pdf.page_count)
+    print(number_process)
+    func = partial(
+        pdf2image_index, pdfPath=pdfPath, imgPath=imgPath, zoom_x=zoom_x, zoom_y=zoom_y, rotation_angle=rotation_angle
+    )
+    result = pool.starmap(func, [(start, end) for start, end in zip(index_list, index_list[1:])])
+    pool.close()
+    pool.join()
+    pdf.close()
+    for item in result:
+        image_path.extend(item)
+    return image_path
+
+
+def _apply_token(api_key, secret_key):
+    """
+    Gererate an access token.
+    """
+    payload = ""
+    headers = {"Content-Type": "application/json", "Accept": "application/json"}
+    token_host = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={api_key}&client_secret={secret_key}"
+    response = requests.request("POST", token_host, headers=headers, data=payload)
+    if response:
+        res = response.json()
+    else:
+        raise RuntimeError("Request access token error.")
+    return res["access_token"]
+
+
+def get_shown_context(context):
+    """Get gradio chatbot."""
+    shown_context = []
+    for turn_idx in range(0, len(context), 2):
+        shown_context.append([context[turn_idx]["content"], context[turn_idx + 1]["content"]])
+    return shown_context
+
+
+def tackle_history(history=[]):
+    messages = []
+    if len(history) < 3:
+        return messages
+    for turn_idx in range(2, len(history)):
+        messages.extend(
+            [{"role": "user", "content": history[turn_idx][0]}, {"role": "assistant", "content": history[turn_idx][1]}]
+        )
+    return messages
+
+
+def retrieval(
+    query: str,
+    file_id: Optional[str] = None,
+    es_host: str = "",
+    es_port: int = "",
+    es_username: str = "",
+    es_password: str = "",
+    es_index: str = "",
+    es_chunk_size: int = 500,
+    es_thread_count: int = 30,
+    es_queue_size: int = 30,
+    retriever_batch_size: int = 16,
+    retriever_api_key: str = "",
+    retriever_embed_title: bool = False,
+    retriever_secret_key: str = "",
+    retriever_topk: int = 30,
+    rank_topk: int = 5,
+):
+    """
+    :param query: The query
+    :param file_id: The id of a file
+    :param es_host: The host of es
+    :param es_por: The host of es
+    :param es_username: The username of es
+    :param es_password: The password of es
+    :param es_index: The index of es
+    :param es_chunk_size: The chunk size of es
+    :param es_thread_count: The thread count of es
+    :param es_queue_size: The queue size of es
+    :param retriever_batch_size: The batch_size of retriever
+    :param retriever_api_key: The api_key of retriever
+    :param retriever_embed_title: The embed_title of retriever
+    :param retriever_secret_key: The secret_key of retriever
+    :param retriever_topk: How many documents to return per query in  .
+    :param rank_topk: The maximum number of documents to return in rank.
+    """
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=True)
+    document_store = BaiduElasticsearchDocumentStore(
+        embedding_dim=384,
+        duplicate_documents="skip",
+        host=es_host,
+        port=es_port,
+        username=es_username,
+        password=es_password,
+        index=es_index,
+        similarity="dot_prod",
+        vector_type="bpack_vector",
+        search_fields=["content", "meta"],
+        chunk_size=es_chunk_size,
+        thread_count=es_thread_count,
+        queue_size=es_queue_size,
+    )
+    retriever = EmbeddingRetriever(
+        document_store=document_store,
+        retriever_batch_size=retriever_batch_size,
+        api_key=retriever_api_key,
+        embed_title=retriever_embed_title,
+        secret_key=retriever_secret_key,
+    )
+    if not file_id:
+        query_pipeline = Pipeline()
+        query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+        query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+        prediction = query_pipeline.run(
+            query=query, params={"Retriever": {"top_k": retriever_topk}, "Ranker": {"top_k": rank_topk}}
+        )
+        return prediction
+    else:
+        filters = {
+            "$and": {
+                "id": {"$eq": file_id},
+            }
+        }
+        query_pipeline = Pipeline()
+        query_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+        query_pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+        prediction = query_pipeline.run(
+            query=query,
+            params={"Retriever": {"top_k": retriever_topk, "filters": filters}, "Ranker": {"top_k": rank_topk}},
+        )
+        return prediction
+
+
+# Summary of a paper
+def summarize_collapse(text_list, api_key, secret_key):
+    document_prompt = "这是一篇论文摘要的第{index}部分的内容：{content}"
+    llm_prompt = """
+    我需要你的帮助来阅读和总结输入论文摘要的主要内容，请你使用中文输出。
+    根据以下四点进行总结。(专有名词需要用英语标记）
+    （1）：论文研究背景是什么？
+    （2）：过去的方法是什么？他们有什么问题？这种方法是否有良好的动机？
+    （3）：论文提出的研究方法是什么？
+    （4）：在什么任务上，通过论文的方法实现了什么性能？表现能支持他们的目标吗？
+    （5）意义是什么？
+    （6）从创新点、绩效和工作量三个维度总结论文优势和劣势
+    语句尽可能简洁和学术，不要有太多重复的信息，数值使用原始数字，请你使用中文输出
+    输入论文摘要:{}
+    总结输出:
+    """
+    combine_documents = StuffDocuments(
+        api_key=api_key, secret_key=secret_key, llm_prompt=llm_prompt, document_prompt=document_prompt
+    )
+    reduce_documents = ReduceDocuments(combine_documents=combine_documents)
+    MapReduce = MapReduceDocuments(
+        api_key=api_key, secret_key=secret_key, llm_prompt=llm_prompt, reduce_documents=reduce_documents
+    )
+    summary = MapReduce.run(text_list)
+    return summary[0]["result"]
+
+
+def summarize_abstract(abstract, api_key, secret_key, chunk_size=300, max_token=4200):
+    llm_prompt = """
+    我需要你的帮助来阅读和总结输入论文摘要的主要内容，请你使用中文输出。
+    根据以下四点进行总结。(专有名词需要用英语标记）
+    （1）：论文研究背景是什么？
+    （2）：过去的方法是什么？他们有什么问题？这种方法是否有良好的动机？
+    （3）：论文提出的研究方法是什么？
+    （4）：在什么任务上，通过论文的方法实现了什么性能？表现能支持他们的目标吗？
+    （5）意义是什么？
+    （6）从创新点、绩效和工作量三个维度总结论文优势和劣势
+    语句尽可能简洁和学术，不要有太多重复的信息，数值使用原始数字，请你使用中文输出
+    输入论文摘要:{}
+    总结输出:
+    """
+    if len(llm_prompt.format(abstract)) > max_token:
+        file_splitter_chinese = SpacyTextSplitter(chunk_size=chunk_size, separator="\n", chunk_overlap=0)
+        txt_split = file_splitter_chinese.split_text(abstract)
+        txt_list = []
+        for split in txt_split:
+            txt_list.append({"content": split, "meta": {}})
+        summary = summarize_collapse(txt_list, api_key, secret_key)
+    else:
+        ernie_bot = ErnieBot(api_key=api_key, secret_key=secret_key, model_name="ERNIE-Bot")
+        summary = ernie_bot.run(llm_prompt.format(abstract))[0]["result"]
+    return summary.replace("\n\n", "\n")
+
+
+# Summary of multiple papers
+def merge_summary(text_list, api_key, secret_key):
+    document_prompt = "输入的第{index}论文的内容：{content}"
+    llm_prompt = """你需要完成多篇论文总结任务，不要分别进行单篇论文总结。
+    我需要你的帮助来总结一下多篇论文在背景、研究方法、数据集、结论这四个方面的共同之处和不同之处。
+    输入的多篇论文:{}
+    总结输出:
+    """
+    sum_prompt = "总结输入的论文摘要，保留主要内容，论文摘要:{}"
+    combine_documents = StuffDocuments(
+        api_key=api_key, secret_key=secret_key, llm_prompt=llm_prompt, document_prompt=document_prompt
+    )
+    reduce_documents = ReduceDocuments(combine_documents=combine_documents)
+    MapReduce = MapReduceDocuments(
+        api_key=api_key, secret_key=secret_key, llm_prompt=sum_prompt, reduce_documents=reduce_documents
+    )
+    summary = MapReduce.run(text_list)
+    return summary[0]["result"]
+
+
+# translation
+dict_l = {"中文": "英文", "英文": "中文"}
+
+
+def ernie_bot_translation(prompt, api_key, secret_key, cycle_num=3, key="文本翻译"):
+    ernie_bot = ErnieBot(api_key=api_key, secret_key=secret_key, model_name="ERNIE-Bot")
+    i = 0
+    while i < cycle_num:
+        try:
+            txt = ernie_bot.run(prompt)[0]["result"]
+            return str(txt)
+        except:
+            i += 1
+            time.sleep(0.5)
+    return None
+
+
+def translate_part(text, api_key, secret_key, task="翻译", max_length=10000, lang="中文", chunk_size=1000, cycle_num=3):
+    if lang == "中文":
+        file_splitter = SpacyTextSplitter(chunk_size=chunk_size, separator="\n", chunk_overlap=0)
+    elif lang == "英文":
+        file_splitter = SpacyTextSplitter(
+            chunk_size=chunk_size, separator="\n", pipeline="en_core_web_sm", chunk_overlap=0
+        )
+    text = text.replace("\n\n", "\n")
+    prompt_all = """
+        你现在是一个翻译助手。你需要将输入的{l_s}内容翻译为{l_t}，你必须保证完成了将{l_s}内容翻译为{l_t}内容的任务。
+        下面让我们正式开始：输入的{l_s}内容为：{content}
+        你必须保证完成了将{l_s}内容翻译为{l_t}内容的任务，并输出翻译结果。
+        翻译结果：
+        """
+    if len(prompt_all.format(content=text, l_s=lang, l_t=dict_l[lang])) > max_length:
+        documents = file_splitter.split_text(text)
+        txt = ""
+        for split in documents:
+            txt_split = ernie_bot_translation(
+                prompt_all.format(content=split, l_s=lang, l_t=dict_l[lang]),
+                api_key=api_key,
+                secret_key=secret_key,
+                cycle_num=cycle_num,
+                key="文本翻译",
+            )
+            if txt_split:
+                txt += txt_split
+            else:
+                txt += split
+    else:
+        txt = ernie_bot_translation(
+            prompt_all.format(content=text, l_s=lang, l_t=dict_l[lang]),
+            api_key=api_key,
+            secret_key=secret_key,
+            cycle_num=cycle_num,
+            key="文本翻译",
+        )
+        if not txt:
+            txt = text
+    return txt
+
+
+def get_parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--api_type", type=str, default="qianfan")
+    parser.add_argument("--api_key", type=str, default="", help="The API Key.")
+    parser.add_argument("--secret_key", type=str, default="", help="The secret key.")
+    parser.add_argument("--bos_ak", type=str, default="", help="The Access Token for uploading files to bos")
+    parser.add_argument("--bos_sk", type=str, default="", help="The Secret Token for uploading files to bos")
+    parser.add_argument(
+        "--top_p",
+        type=float,
+        default=0.7,
+        help="The range is between 0 and 1.The smaller the parameter, the more stable the generated result. When it is 0, randomness is minimized",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=0.95,
+        help="The smaller the parameter, the more stable the generated result",
+    )
+    parser.add_argument("--max_length", type=int, default=1024, help="Maximum number of generated tokens")
+    parser.add_argument("--ernie_model", type=str, default="ernie-bot-3.5", help="Model type")
+    parser.add_argument("--system_prompt", type=str, default="你是我的AI助理。", help="System settings for dialogue models")
+    parser.add_argument("--es_host", type=str, default="", help="the host of es")
+    parser.add_argument("--es_port", type=int, default=8309, help="the port of es")
+    parser.add_argument("--es_username", type=str, default="", help="the username of es")
+    parser.add_argument("--es_password", type=str, default="", help="the password of es")
+    parser.add_argument("--es_index_abstract", type=str, default="", help="the index of abstracts")
+    parser.add_argument("--es_index_full_text", type=str, default="", help="the index of all papers")
+    parser.add_argument("--es_chunk_size", type=int, default=500, help="the size of chunk in es")
+    parser.add_argument("--es_thread_count", type=int, default=30, help="the thread count in es")
+    parser.add_argument("--es_queue_size", type=int, default=30, help="the size of queue in es")
+    parser.add_argument("--retriever_batch_size", type=int, default=16, help="the batch size of retriever ")
+    parser.add_argument("--retriever_api_key", type=str, default="", help="the api key of retriever")
+    parser.add_argument("--retriever_secret_key", type=str, default="", help="the secret key of retriever")
+    parser.add_argument(
+        "--retriever_embed_title", type=bool, default=False, help="whether use embedding title in retriever"
+    )
+    parser.add_argument("--retriever_threshold", type=float, default=0.95, help="the threshold of retriever")
+    parser.add_argument(
+        "--txt_file", type=str, default="", help="the path of a txt file which includes all papers path"
+    )
+    parser.add_argument("--max_token", type=int, default=11200, help=" the max number of tokens of LLM")
+    parser.add_argument("--translation_chunk_size", type=int, default=300, help="the chunk size of translation")
+    parser.add_argument("--translation_cycle_num", type=int, default=3, help="fault tolerance times")
+    parser.add_argument(
+        "--translation_max_token", type=int, default=500, help="the max number of tokens of translation segment"
+    )
+    parser.add_argument("--serving_name", default="0.0.0.0", help="Serving ip.")
+    parser.add_argument("--serving_port", default=8099, type=int, help="Serving port.")
+    parser.add_argument(
+        "--number_process_page", default=5, type=int, help="the number of PDF pages processed per process"
+    )
+    args = parser.parse_args()
+    return args
diff --git a/pipelines/examples/chatpaper/weipu.jpg b/pipelines/examples/chatpaper/weipu.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..09c45b1557109786c6d030a09b50b6d399eb76a6
Binary files /dev/null and b/pipelines/examples/chatpaper/weipu.jpg differ
diff --git a/pipelines/examples/document-intelligence/README.md b/pipelines/examples/document-intelligence/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31f40e2d4d11071f2913fc341860ffc09ac2a751
--- /dev/null
+++ b/pipelines/examples/document-intelligence/README.md
@@ -0,0 +1,98 @@
+# 端到端开放文档抽取问答系统
+
+## 1. 系统介绍
+
+开放文档抽取问答主要指对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息，通过人工智能技术进行理解、分类、提取以及信息归纳的过程。开放文档抽取问答技术广泛应用于金融、保险、能源、物流、医疗等行业，常见的应用场景包括财务报销单、招聘简历、企业财报、合同文书、动产登记证、法律判决书、物流单据等多模态文档的关键信息抽取、问题回答等。
+
+本项目提供了低成本搭建端到端开放文档抽取问答系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的开放文档抽取问答系统模型(文档OCR预处理模型、文档抽取问答模型)快速搭建一个针对自己业务数据的文档抽取问答系统，并提供基于[Gradio](https://gradio.app/) 的 Web 可视化服务。
+
+![gradio](https://user-images.githubusercontent.com/63761690/197500524-17013358-8d19-43c4-9796-abac1e2d675f.gif)
+
+## 2. 快速开始
+
+以下是针对mac和linux的安装流程：
+
+
+### 2.1 运行环境
+
+**安装PaddlePaddle：**
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+**安装PaddleNLP：**
+
+```bash
+pip install paddlenlp==2.4.1
+```
+
+**安装Paddle-Pipelines：**
+
+安装相关依赖：
+```bash
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+pip 一键安装Paddle-Pipelines：
+```bash
+pip install paddle-pipelines==0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+或使用源码安装Paddle-Pipelines最新版本：
+```bash
+cd ${HOME}/PaddleNLP/pipelines/
+python setup.py install
+```
+
+【注意】**以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录**
+
+### 2.2 一键体验问答系统
+您可以通过如下命令快速体验开放文档抽取问答系统的效果。
+
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/document-intelligence/docprompt_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/document-intelligence/docprompt_example.py --device cpu
+```
+
+### 2.3 构建 Web 可视化开放文档抽取问答系统
+
+整个 Web 可视化问答系统主要包含两大组件:  1. 基于 RestAPI 构建模型服务 2. 基于 Gradio 构建 WebUI。接下来我们依次搭建这 2 个服务并串联构成可视化的开放文档抽取问答系统。
+
+#### 2.3.1 启动 RestAPI 模型服务
+```bash
+# 指定智能问答系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/docprompt.yaml
+export QUERY_PIPELINE_NAME=query_documents
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/document-intelligence/run_docprompt_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl --request POST --url 'http://0.0.0.0:8891/query_documents' -H "Content-Type: application/json"  --data '{"meta": {"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}}'
+```
+
+#### 2.3.2 启动 WebUI
+
+```bash
+pip install gradio
+python ui/webapp_docprompt_gradio.py  --serving_port 8891
+```
+
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/document-intelligence/run_docprompt_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:7860 地址体验开放文档抽取问答系统系统服务了。
diff --git a/pipelines/examples/document-intelligence/docprompt_example.py b/pipelines/examples/document-intelligence/docprompt_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..1522e1b96fe4cd3dc40660888864f747c3d4e6ea
--- /dev/null
+++ b/pipelines/examples/document-intelligence/docprompt_example.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines import DocPipeline
+from pipelines.nodes import DocOCRProcessor, DocPrompter
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run docprompt system, defaults to gpu.")
+parser.add_argument("--batch_size", default=4, type=int, help="The batch size of prompt for one image.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def docprompt_pipeline():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    preprocessor = DocOCRProcessor(use_gpu=use_gpu)
+    docprompter = DocPrompter(use_gpu=use_gpu, batch_size=args.batch_size)
+    pipe = DocPipeline(preprocessor=preprocessor, docreader=docprompter)
+    # image link input
+    meta = {
+        "doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg",
+        "prompt": ["发票号码是多少?", "校验码是多少?"],
+    }
+    # image local path input
+    # meta = {"doc": "./invoice.jpg", "prompt": ["发票号码是多少?", "校验码是多少?"]}
+
+    prediction = pipe.run(meta=meta)
+    print(prediction["results"][0])
+
+
+if __name__ == "__main__":
+    docprompt_pipeline()
diff --git a/pipelines/examples/document-intelligence/run_docprompt_server.sh b/pipelines/examples/document-intelligence/run_docprompt_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4f3247bc14e921aaec3c25b2443ec61c01f52a61
--- /dev/null
+++ b/pipelines/examples/document-intelligence/run_docprompt_server.sh
@@ -0,0 +1,20 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/docprompt.yaml
+export QUERY_PIPELINE_NAME=query_documents
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/document-intelligence/run_docprompt_web.sh b/pipelines/examples/document-intelligence/run_docprompt_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8737b4a676585ff688dcbd66e3d89e4e2fb3d10e
--- /dev/null
+++ b/pipelines/examples/document-intelligence/run_docprompt_web.sh
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+unset http_proxy && unset https_proxy
+# 配置模型服务地址
+python ui/webapp_docprompt_gradio.py --serving_port 8891
\ No newline at end of file
diff --git a/pipelines/examples/image_text_retrieval/IMAGE_TO_TEXT_SEARCH.md b/pipelines/examples/image_text_retrieval/IMAGE_TO_TEXT_SEARCH.md
new file mode 100644
index 0000000000000000000000000000000000000000..7d86a01b29336108f4ca346bd036c213ed46dc88
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/IMAGE_TO_TEXT_SEARCH.md
@@ -0,0 +1,188 @@
+# 端到端图搜文跨模态检索系统
+
+## 1. 场景概述
+
+图搜文跨模态检索系统目的是通过图片找到最符合该图片描述的文本，是文搜图应用的互补形式，可以广泛应用到图文匹配的场景。
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端图搜文跨模态检索系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的图搜文跨模态检索系统模型快速搭建一个针对自己业务数据的跨模态检索系统，并可以提供 Web 化产品服务。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/224908556-00a6e084-e6eb-45e2-a8e4-c2a524bf58b6.png" width="800px">
+</div>
+
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端图搜文跨模态检索系统能力
+    + 依托百度领先的NLP技术，包括[ERNIE](https://github.com/PaddlePaddle/ERNIE)语义理解技术，[ERNIE-ViL 2.0](https://arxiv.org/abs/2209.15270)跨模态检索能力
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 快速搭建图搜文跨模态检索系统
+
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.5.0
+- paddlepaddle-gpu >=2.4.1
+- CUDA Version: 11.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+图搜文跨模态检索数据库的数据来自于[Noah-Wukong数据集](https://wukong-dataset.github.io/wukong-dataset/index.html)，并选取了测试集中3056张图片来搭建图搜文跨模态检索系统。
+
+### 3.3 一键体验图搜文跨模态检索系统
+
+#### 3.3.1 快速一键启动
+
+我们预置了基于[Noah-Wukong数据集](https://wukong-dataset.github.io/wukong-dataset/index.html)搭建图搜文跨模态检索系统的代码示例，您可以通过如下命令快速体验图搜文跨模态检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/image_text_retrieval/image_to_text_retrieval_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/image_text_retrieval/image_to_text_retrieval_example.py --device cpu
+```
+
+
+### 3.4 构建 Web 可视化图搜文跨模态检索系统
+
+整个 Web 可视化图搜文跨模态检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的图搜文跨模态检索系统。
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann_mm.py --index_name wukong_text \
+                            --doc_dir data/wukong_text \
+                            --embedding_type text \
+                            --search_engine elastic \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/wukong_text/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+删除索引也可以使用下面的命令：
+
+```
+curl -XDELETE http://localhost:9200/wukong_test
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+```bash
+# 指定图搜文跨模态检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/image_to_text_retrieval.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X 'POST' \
+  'http://10.9.189.4:8891/query_images' \
+  -H 'accept: application/json' \
+  -H 'Content-Type: multipart/form-data' \
+  -F 'files=@微信图片_20201216132259.jpg;type=image/jpeg' \
+  -F 'meta={"Retriever": {"top_k": 2, "query_type":"image"}}'
+```
+
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+```bash
+pip install gradio
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_image_to_text_retrieval.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验图搜文跨模态检索系统服务了。
+
+#### 3.4.5 数据更新
+
+数据更新使用前面的 `utils/offline_ann_mm.py`进行数据更新，把图片放在特定目录，然后传入该目录即可：
+
+```
+python utils/offline_ann_mm.py --index_name wukong_text \
+                            --doc_dir data/wukong_text \
+                            --port 9200 \
+                            --embedding_type text \
+                            --search_engine elastic \
+                            --delete_index
+```
+
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Shan, Bin, et al. "[ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training](https://arxiv.org/abs/2209.15270)." arXiv preprint arXiv:2209.15270 (2022).
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/image_text_retrieval/README.md b/pipelines/examples/image_text_retrieval/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cc41077478cfc4f0009245fb4a07d50b3d3fbac8
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/README.md
@@ -0,0 +1,191 @@
+# 端到端文图跨模态检索系统
+
+## 1. 场景概述
+
+文图跨模态检索系统目的是通过文字找到最符合描述的图片。传统的方案是用标签和图片的关键字进行匹配，而跨模态检索真正的实现了文本语义和图片语义内容的匹配，这种检索方式更符合人类的逻辑判断，是一种真正意义上的端到端人工智能。文图应用目前可以广泛应用于电商搜索，安防视频，图像检索，抖音等小视频，旅游app应用搜索。有助于提升效率和搜索体验。另外还有一些潜在的领域，比如司法的互联网调查取证，侵权检测，数据增强，文案匹配，各种互联网logo，肖像，风景，海报等图片网站的检索，医药等专业领域的文图搜索等。
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端文图跨模态检索系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的文图跨模态检索系统模型快速搭建一个针对自己业务数据的跨模态检索系统，并可以提供 Web 化产品服务。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/216578818-f194cf9f-3d7f-4139-a173-dc9366e52e97.png" width="500px">
+</div>
+
+以下是文搜图的系统搭建流程，如果用户需要图搜文的应用，请参考[图搜文系统搭建流程](./IMAGE_TO_TEXT_SEARCH.md)
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端文图跨模态检索系统能力
+    + 依托百度领先的NLP技术，包括[ERNIE](https://github.com/PaddlePaddle/ERNIE)语义理解技术，[ERNIE-ViL 2.0](https://arxiv.org/abs/2209.15270)跨模态检索能力
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 快速搭建文图跨模态检索系统
+
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.5.0
+- paddlepaddle-gpu >=2.4.1
+- CUDA Version: 11.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+文图跨模态检索数据库的数据来自于[Noah-Wukong数据集](https://wukong-dataset.github.io/wukong-dataset/index.html)，并选取了测试集中3056张图片来搭建文图跨模态检索系统。
+
+### 3.3 一键体验文图跨模态检索系统
+
+#### 3.3.1 快速一键启动
+
+我们预置了基于[Noah-Wukong数据集](https://wukong-dataset.github.io/wukong-dataset/index.html)搭建文图跨模态检索系统的代码示例，您可以通过如下命令快速体验文图跨模态检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/image_text_retrieval/text_to_image_retrieval_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/image_text_retrieval/text_to_image_retrieval_example.py --device cpu
+```
+
+
+### 3.4 构建 Web 可视化文图跨模态检索系统
+
+整个 Web 可视化文图跨模态检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的文图跨模态检索系统。
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann_mm.py --index_name wukong_test \
+                            --doc_dir data/wukong_test \
+                            --search_engine elastic \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/wukong_test/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+删除索引也可以使用下面的命令：
+
+```
+curl -XDELETE http://localhost:9200/wukong_test
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+```bash
+# 指定文图跨模态检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/text_to_image_retrieval.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/image_text_retrieval/run_search_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "云南普者黑现纯白色⒌蒂莲","params": {"Retriever": {"top_k": 5}}}'
+```
+
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+```bash
+pip install gradio
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_text_to_image_retrieval.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/image_text_retrieval/run_search_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验文图跨模态检索系统服务了。
+
+#### 3.4.5 数据更新
+
+数据更新使用前面的 `utils/offline_ann_mm.py`进行数据更新，把图片放在特定目录，然后传入该目录即可：
+
+```
+python utils/offline_ann_mm.py --index_name wukong_test \
+                            --doc_dir data/wukong_test \
+                            --port 9200 \
+                            --search_engine elastic \
+                            --delete_index
+```
+
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Shan, Bin, et al. "[ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training](https://arxiv.org/abs/2209.15270)." arXiv preprint arXiv:2209.15270 (2022).
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/image_text_retrieval/image_to_text_retrieval_example.py b/pipelines/examples/image_text_retrieval/image_to_text_retrieval_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bdaf34b292e63fbed6d3573bf8a862e0a21d03c
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/image_to_text_retrieval_example.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import MultiModalRetriever
+from pipelines.pipelines import Pipeline
+from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='wukong_test', type=str, help="The ann index name of ANN.")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--query_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The query_embedding_model path")
+parser.add_argument("--document_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The document_embedding_model path")
+args = parser.parse_args()
+# yapf: enable
+
+
+def image_text_retrieval_tutorial():
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # Connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="image",
+            document_embedding_models={"text": args.document_embedding_model},
+        )
+    else:
+        doc_dir = "data/wukong_text"
+        wukong_data = "https://paddlenlp.bj.bcebos.com/applications/wukong_text.zip"
+        fetch_archive_from_http(url=wukong_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+        document_store = FAISSDocumentStore(embedding_dim=args.embedding_dim, faiss_index_factory_str="Flat")
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="image",
+            document_embedding_models={"text": args.document_embedding_model},
+        )
+        # Update metadata
+        document_store.write_documents(dicts)
+        # Update Embedding
+        document_store.update_embeddings(retriever_mm)
+        # Save index
+        document_store.save(args.index_name)
+    pipe = Pipeline()
+    pipe.add_node(component=retriever_mm, name="Retriever", inputs=["Query"])
+    result = pipe.run(
+        query="data/wukong_test/00001469-0017.jpg", params={"Retriever": {"top_k": 5, "query_type": "image"}}
+    )
+    print(result)
+
+
+if __name__ == "__main__":
+    image_text_retrieval_tutorial()
diff --git a/pipelines/examples/image_text_retrieval/run_search_server.sh b/pipelines/examples/image_text_retrieval/run_search_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b2a5f59894848d1cb1b5274e8ac2098cacae37e1
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/run_search_server.sh
@@ -0,0 +1,20 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset http_proxy && unset https_proxy
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/text_to_image_retrieval.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/image_text_retrieval/run_search_web.sh b/pipelines/examples/image_text_retrieval/run_search_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7e558381cd4f72248129b86aa473f0138337e7c6
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/run_search_web.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_text_to_image_retrieval.py --serving_port 8502
\ No newline at end of file
diff --git a/pipelines/examples/image_text_retrieval/text_to_image_retrieval_example.py b/pipelines/examples/image_text_retrieval/text_to_image_retrieval_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..b92117c1d165bd058087ecac3d98cd6a6bb91454
--- /dev/null
+++ b/pipelines/examples/image_text_retrieval/text_to_image_retrieval_example.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import MultiModalRetriever
+from pipelines.pipelines import Pipeline
+from pipelines.schema import Document
+from pipelines.utils import fetch_archive_from_http
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='wukong_test', type=str, help="The ann index name of ANN.")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--query_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The query_embedding_model path")
+parser.add_argument("--document_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The document_embedding_model path")
+args = parser.parse_args()
+# yapf: enable
+
+
+def image_text_retrieval_tutorial():
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # Connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="text",
+            document_embedding_models={"image": args.document_embedding_model},
+        )
+    else:
+        doc_dir = "data/wukong_test"
+        wukong_data = "https://paddlenlp.bj.bcebos.com/applications/wukong_test_demo.zip"
+        fetch_archive_from_http(url=wukong_data, output_dir=doc_dir)
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+        document_store = FAISSDocumentStore(embedding_dim=args.embedding_dim, faiss_index_factory_str="Flat")
+        docs = [Document(content=f"./{doc_dir}/{filename}", content_type="image") for filename in os.listdir(doc_dir)]
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="text",
+            document_embedding_models={"image": args.document_embedding_model},
+        )
+        # Update metadata
+        document_store.write_documents(docs)
+        # Update Embedding
+        document_store.update_embeddings(retriever_mm)
+        # Save index
+        document_store.save(args.index_name)
+    pipe = Pipeline()
+    pipe.add_node(component=retriever_mm, name="Retriever", inputs=["Query"])
+    result = pipe.run(query="云南普者黑现纯白色⒌蒂莲", params={"Retriever": {"top_k": 5}})
+    print(result)
+
+
+if __name__ == "__main__":
+    image_text_retrieval_tutorial()
diff --git a/pipelines/examples/question-answering/Install_windows.md b/pipelines/examples/question-answering/Install_windows.md
new file mode 100644
index 0000000000000000000000000000000000000000..91896756865d1807778853303325591e2ce44ac6
--- /dev/null
+++ b/pipelines/examples/question-answering/Install_windows.md
@@ -0,0 +1,116 @@
+# WINDOWS环境下搭建端到端智能问答系统
+
+以下的流程都是使用的Anaconda的环境进行的搭建，Anaconda安装好以后，进入 `Anaconda Powershell Prompt`（由于环境变量设置不兼容的原因，暂不支持使用`cmd`执行下面的命令），然后执行下面的流程。
+
+## 1. 快速开始: 城市百科知识问答系统搭建
+
+### 1.1 运行环境和安装说明
+
+a. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+git clone https://github.com/tvst/htbuilder.git
+cd htbuilder/
+python setup install
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+### 1.2 数据说明
+问答知识库数据是我们爬取了百度百科上对国内重点城市的百科介绍文档。我们将所有文档中的非结构化文本数据抽取出来， 按照段落切分后作为问答系统知识库的数据，一共包含 365 个城市的百科介绍文档、切分后共 1318 个段落。
+
+### 1.3 一键体验问答系统
+我们预置了搭建城市百科知识问答系统的代码示例，您可以通过如下命令快速体验问答系统的效果。
+
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+python examples/question-answering/dense_qa_example.py --device gpu
+# 如果只有 CPU 机器，安装CPU版本的Paddle后，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+python examples/question-answering/dense_qa_example.py --device cpu
+```
+`dense_qa_example.py`中`DensePassageRetriever`，`ErnieRanker`和`ErnieReader`的模型介绍请参考[API介绍](../../API.md)
+
+### 1.4 构建 Web 可视化问答系统
+
+整个 Web 可视化问答系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI。接下来我们依次搭建这 3 个服务并串联构成可视化的问答系统
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+把`xpack.security.enabled` 设置成false，如下：
+```
+xpack.security.enabled: false
+```
+
+然后直接双击bin目录下的elasticsearch.bat即可启动。
+
+3. elasticsearch可视化工具Kibana（可选）
+为了更好的对数据进行管理，可以使用Kibana可视化工具进行管理和分析，下载链接为[Kibana](https://www.elastic.co/cn/downloads/kibana)，下载完后解压，直接双击运行 `bin\kibana.bat`即可。
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以百科城市数据为例建立 ANN 索引库
+python utils/offline_ann.py --index_name baike_cities --doc_dir data/baike
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+
+运行成功后会输出如下的日志：
+```
+INFO - pipelines.utils.logger -  Logged parameters:
+ {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'}
+INFO - pipelines.document_stores.elasticsearch -  Updating embeddings for all 1318 docs ...
+Updating embeddings: 10000 Docs [00:16, 617.76 Docs/s]
+```
+运行结束后，可使用Kibana查看数据
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定智能问答系统的Yaml配置文件
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/dense_qa.yaml'
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 1.4.4 启动 WebUI
+```bash
+# 配置模型服务地址
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_question_answering.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验城市百科知识问答系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
diff --git a/pipelines/examples/question-answering/README.md b/pipelines/examples/question-answering/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..10a1646b67a3908fddb3db3523c757b51b7e3f8b
--- /dev/null
+++ b/pipelines/examples/question-answering/README.md
@@ -0,0 +1,202 @@
+# 端到端智能问答系统
+
+## 1. 场景概述
+
+问答系统是信息检索系统的一种高级形式，通过对用户输入的问题进行理解，然后从知识库中寻找答案，并直接反馈给用户。在很多具体场景下我们都会对问答系统有需求。
+
+在日常生活中，用户会经常碰到很多复杂的规章制度、规则条款。例如：火车提前多久可以免费退票？、在北京工作几年可以办理居住证？
+
+在平时工作中，员工也会面对公司多种多样的政策。比如：商业保险理赔需要什么材料？打车报销的具体流程是什么？
+
+这些情况下，传统做法需要仔细阅读政策文件、规章制度、或者咨询相关工作人员才能得到答案，费时费力。现在我们则可以针对这类常见的业务场景快速搭建一套智能问答系统，高效地回答用户的常见问题，提升用户体验的同时，也降低了客服人员的工作负荷及企业成本。
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端问答系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的问答系统模型(召回模型、排序模型、阅读理解模型)快速搭建一个针对自己业务数据的问答系统，并可以提供基于[Streamlit](https://streamlit.io/) 的 Web 可视化服务。以下是使用预置模型的教程，如果用户想训练并接入自己训练的模型，对于召回和排序模型训练可以参考[Neural Search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search),对于其中的答案抽取模型，训练教程请参考[machine_reading_comprehension](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_reading_comprehension/DuReader-robust)，召回和排序模型接入流程参考语义检索的Neural Search接入流程即可，阅读理解模型只需要在加载模型的时候，把模型名称换成您的模型的路径即可。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190298926-a1fc92f3-5ec7-4265-8357-ab860cc1fed2.gif" width="500px">
+</div>
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端问答系统能力
+    + 多源数据支持: 支持对 Txt、Word、PDF、Image 多源数据进行解析、识别并写入 ANN 数据库
++ 效果好
+    + 依托百度领先的NLP技术，包括[ERNIE](https://github.com/PaddlePaddle/ERNIE)语义理解技术与[RocketQA](https://github.com/PaddlePaddle/RocketQA)开放域问答技术
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 城市百科知识问答系统搭建
+
+以下是针对mac和linux的安装流程，windows的安装和使用流程请参考[windows](./Install_windows.md)
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.2.1
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+### 3.2 数据说明
+问答知识库数据是我们爬取了百度百科上对国内重点城市的百科介绍文档。我们将所有文档中的非结构化文本数据抽取出来， 按照段落切分后作为问答系统知识库的数据，一共包含 365 个城市的百科介绍文档、切分后共 1318 个段落。
+
+### 3.3 一键体验问答系统
+我们预置了搭建城市百科知识问答系统的代码示例，您可以通过如下命令快速体验问答系统的效果。
+
+
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/question-answering/dense_qa_example.py --device gpu
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/question-answering/dense_qa_example.py --device cpu
+```
+`dense_qa_example.py`中`DensePassageRetriever`，`ErnieRanker`和`ErnieReader`的模型介绍请参考[API介绍](../../API.md)
+
+### 3.4 构建 Web 可视化问答系统
+
+整个 Web 可视化问答系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI。接下来我们依次搭建这 3 个服务并串联构成可视化的问答系统
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以百科城市数据为例建立 ANN 索引库
+python utils/offline_ann.py --index_name baike_cities \
+                            --doc_dir data/baike \
+                            --delete_index
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+运行成功后会输出如下的日志：
+```
+INFO - pipelines.utils.logger -  Logged parameters:
+ {'processor': 'TextSimilarityProcessor', 'tokenizer': 'NoneType', 'max_seq_len': '0', 'dev_split': '0.1'}
+INFO - pipelines.document_stores.elasticsearch -  Updating embeddings for all 1318 docs ...
+Updating embeddings: 10000 Docs [00:16, 617.76 Docs/s]
+```
+使用如下的命令可以查看是否插入成功：
+```
+curl -XGET http://localhost:9200/baike_cities/_count
+```
+
+运行结束后会有如下的输出：
+```
+{"count":1318,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+```bash
+# 指定智能问答系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/dense_qa.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/question-answering/run_qa_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "北京市有多少个行政区？","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+```
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_question_answering.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/question-answering/run_qa_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验城市百科知识问答系统服务了。
+
+#### 3.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/question-answering/dense_qa_example.py b/pipelines/examples/question-answering/dense_qa_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..e14c98bd8fc450e0f23f0989ad579ed0c623c319
--- /dev/null
+++ b/pipelines/examples/question-answering/dense_qa_example.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 城市百科知识智能问答系统
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieRanker, ErnieReader
+from pipelines.utils import (
+    convert_files_to_dicts,
+    fetch_archive_from_http,
+    print_answers,
+)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def dense_qa_pipeline():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+    else:
+        doc_dir = "data/baike"
+        city_data = "https://paddlenlp.bj.bcebos.com/applications/baike.zip"
+        fetch_archive_from_http(url=city_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+
+    reader = ErnieReader(
+        model_name_or_path="ernie-gram-zh-finetuned-dureader-robust", use_gpu=use_gpu, num_processes=1
+    )
+
+    # Pipeline
+    from pipelines import ExtractiveQAPipeline
+
+    pipe = ExtractiveQAPipeline(reader, ranker, retriever)
+
+    pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}, "Reader": {"top_k": 1}}
+
+    prediction = pipe.run(query="北京市有多少个行政区？", params=pipeline_params)
+    print_answers(prediction, details="minimum")
+
+    prediction = pipe.run(query="上海常住人口有多少？", params=pipeline_params)
+    print_answers(prediction, details="minimum")
+
+    prediction = pipe.run(query="广州市总面积多大？", params=pipeline_params)
+    print_answers(prediction, details="minimum")
+
+    prediction = pipe.run(query="河北省的省会在哪里？", params=pipeline_params)
+    print_answers(prediction, details="minimum")
+
+    prediction = pipe.run(query="安徽省的简称是什么？", params=pipeline_params)
+    print_answers(prediction, details="minimum")
+
+
+if __name__ == "__main__":
+    dense_qa_pipeline()
diff --git a/pipelines/examples/question-answering/run_qa_server.sh b/pipelines/examples/question-answering/run_qa_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d169de733b45a9a5431807b2ad3eddb23e76bcae
--- /dev/null
+++ b/pipelines/examples/question-answering/run_qa_server.sh
@@ -0,0 +1,5 @@
+export CUDA_VISIBLE_DEVICES=0
+# 指定智能问答系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/dense_qa.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/question-answering/run_qa_web.sh b/pipelines/examples/question-answering/run_qa_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..da9de7845628b6a927ea672d2edecca4b24dbde9
--- /dev/null
+++ b/pipelines/examples/question-answering/run_qa_web.sh
@@ -0,0 +1,5 @@
+unset http_proxy && unset https_proxy
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_question_answering.py --server.port 8502
\ No newline at end of file
diff --git a/pipelines/examples/semantic-search/Install_windows.md b/pipelines/examples/semantic-search/Install_windows.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e5afda434eee8bca65d05b054376d6ae88399df
--- /dev/null
+++ b/pipelines/examples/semantic-search/Install_windows.md
@@ -0,0 +1,106 @@
+# WINDOWS环境下搭建端到端语义检索系统
+以下的流程都是使用的Anaconda的环境进行的搭建，Anaconda安装好以后，进入 `Anaconda Powershell Prompt`（由于环境变量设置不兼容的原因，暂不支持使用`cmd`执行下面的命令），然后执行下面的流程。
+
+## 1. 快速开始: 快速搭建语义检索系统
+
+### 1.1 运行环境和安装说明
+
+a. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+git clone https://github.com/tvst/htbuilder.git
+cd htbuilder/
+python setup install
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+### 1.2 数据说明
+语义检索数据库的数据来自于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)，共包含 46972 个段落文本，并选取了其中验证集1417条段落文本来搭建语义检索系统。
+
+### 1.3 一键体验语义检索系统
+我们预置了基于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)搭建语义检索系统的代码示例，您可以通过如下命令快速体验语义检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+python examples/semantic-search/semantic_search_example.py --device gpu
+# 如果只有 CPU 机器，安装CPU版本的Paddle后，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+python examples/semantic-search/semantic_search_example.py --device cpu
+```
+`semantic_search_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+### 1.4 构建 Web 可视化语义检索系统
+
+整个 Web 可视化语义检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的语义检索系统。
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+把`xpack.security.enabled` 设置成false，如下：
+```
+xpack.security.enabled: false
+```
+
+然后直接双击bin目录下的elasticsearch.bat即可启动。
+
+3. elasticsearch可视化工具Kibana（可选）
+为了更好的对数据进行管理，可以使用Kibana可视化工具进行管理和分析，下载链接为[Kibana](https://www.elastic.co/cn/downloads/kibana)，下载完后解压，直接双击运行 `bin\kibana.bat`即可。
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name dureader_robust_query_encoder --doc_dir data/dureader_dev
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+运行结束后，可使用Kibana查看数据
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+$env:PIPELINE_YAML_PATH='rest_api/pipeline/semantic_search.yaml'
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+#### 1.4.4 启动 WebUI
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+$env:API_ENDPOINT='http://127.0.0.1:8891'
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
diff --git a/pipelines/examples/semantic-search/Multi_Recall.md b/pipelines/examples/semantic-search/Multi_Recall.md
new file mode 100644
index 0000000000000000000000000000000000000000..06dd79dbb93a31175cd7e3cb1619aae86f8df960
--- /dev/null
+++ b/pipelines/examples/semantic-search/Multi_Recall.md
@@ -0,0 +1,202 @@
+# 端到端两路召回语义检索系统
+
+## 1. 概述
+
+多路召回是指采用不同的策略、特征或者简单的模型，分别召回一部分候选集合，然后把这些候选集混合在一起供后续的排序模型进行重排，也可以定制自己的重排序的规则等等。本项目使用关键字和语义检索两路召回的检索系统，系统的架构如下，用户输入的Query会分别通过关键字召回BMRetriever（Okapi BM 25算法，Elasticsearch默认使用的相关度评分算法，是基于词频和文档频率和文档长度相关性来计算相关度），语义向量检索召回DenseRetriever（使用RocketQA抽取向量，然后比较向量之间相似度）后得到候选集，然后通过JoinResults进行结果聚合，最后通过通用的Ranker模块得到重排序的结果返回给用户。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/204423532-90f62781-5f81-4b6d-9f94-741416ae3fcb.png" width="500px">
+</div>
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端两路召回语义检索系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的两路召回语义检索系统模型(召回模型、排序模型)快速搭建一个针对自己业务数据的检索系统，并可以提供 Web 化产品服务。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/204435911-0ba1cb9f-cb56-4bcd-9f64-63ff173826d6.png" width="500px">
+</div>
+
+## 3. 快速开始: 快速搭建两路召回语义检索系统
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.4.3
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】
+
+- Windows的安装复杂一点，教程请参考：[Windows视频安装教程](https://www.bilibili.com/video/BV1DY4y1M7HE/?zw)
+- 以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+
+语义检索数据库的数据来自于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)，共包含 46972 个段落文本，并选取了其中验证集1417条段落文本来搭建语义检索系统。
+
+### 3.3 一键体验语义检索系统
+
+#### 3.3.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.3.2 快速一键启动
+
+我们预置了基于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)搭建语义检索系统的代码示例，您可以通过如下命令快速体验语义检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/semantic-search/multi_recall_semantic_search_example.py --device gpu \
+                                                          --search_engine elastic
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/semantic-search/multi_recall_semantic_search_example.py --device cpu \
+                                                          --search_engine elastic
+```
+`multi_recall_semantic_search_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+参数含义说明
+* `device`: 设备名称，cpu/gpu，默认为gpu
+* `index_name`: 索引的名称
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `max_seq_len_query`: query的最大长度，默认是64
+* `max_seq_len_passage`: passage的最大长度，默认是384
+* `retriever_batch_size`: 召回模型一次处理的数据的数量
+* `query_embedding_model`: query模型的名称，默认为rocketqa-zh-nano-query-encoder
+* `passage_embedding_model`: 段落模型的名称，默认为rocketqa-zh-nano-para-encoder
+* `params_path`: Neural Search的召回模型的名称，默认为
+* `embedding_dim`: 模型抽取的向量的维度,默认为312，为rocketqa-zh-nano-query-encoder的向量维度
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `bm_topk`: 关键字召回节点BM25Retriever的召回数量
+* `dense_topk`: 语义向量召回节点DensePassageRetriever的召回数量
+* `rank_topk`: 排序模型节点ErnieRanker的排序过滤数量
+
+### 3.4 构建 Web 可视化语义检索系统
+
+整个 Web 可视化语义检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，搭建ANN服务请参考1.3.1节，接下来我们依次搭建后台和前端两个服务。
+
+#### 3.4.1 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/dureader_dev \
+                            --search_engine elastic \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/dureader_robust_query_encoder/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+#### 3.4.2 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/multi_recall_semantic_search.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "衡量酒水的价格的因素有哪些?","params": {"BMRetriever": {"top_k": 10}, "DenseRetriever": {"top_k": 10}, "Ranker":{"top_k": 3}}}'
+```
+#### 3.4.3 启动 WebUI
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_multi_recall_semantic_search.py --server.port 8502
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+#### 3.4.4 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，第二种是使用前端界面的文件上传（在界面的左侧）进行数据更新。对于第一种使用脚本的方式，可以使用多种文件更新数据，示例的文件更新建索引的命令如下，里面包含了图片（目前仅支持把图中所有的文字合并建立索引），docx（支持图文，需要按照空行进行划分段落），txt（需要按照空行划分段落）三种格式的文件建索引：
+
+```
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/file_example \
+                            --port 9200 \
+                            --search_engine elastic \
+                            --delete_index
+```
+
+对于第二种使用界面的方式，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/semantic-search/Neural_Search.md b/pipelines/examples/semantic-search/Neural_Search.md
new file mode 100644
index 0000000000000000000000000000000000000000..d23988f2ff0bc23c9c5de8ab04b8e05222d4a8d9
--- /dev/null
+++ b/pipelines/examples/semantic-search/Neural_Search.md
@@ -0,0 +1,173 @@
+# Neural Search
+
+## 1. 快速开始: 快速搭建语义检索系统
+
+
+### 1.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.2.1
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 1.2 数据说明
+语义检索数据库的数据来自于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)，共包含 46972 个段落文本，并选取了其中验证集1417条段落文本来搭建语义检索系统。
+
+### 1.3 一键体验语义检索系统
+
+#### 1.3.1 快速一键启动
+
+我们预置了基于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)搭建语义检索系统的代码示例，您可以通过如下命令快速体验语义检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/semantic-search/semantic_search_example.py \
+                                            --device gpu \
+                                            --query_embedding_model rocketqa-zh-base-query-encoder \
+                                            --params_path checkpoints/model_40/model_state.pdparams \
+                                            --embedding_dim 256
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/semantic-search/semantic_search_example.py \
+                                              --device cpu \
+                                              --query_embedding_model rocketqa-zh-base-query-encoder \
+                                              --params_path checkpoints/model_40/model_state.pdparams \
+                                              --embedding_dim 256
+```
+
+### 1.4 构建 Web 可视化语义检索系统
+
+整个 Web 可视化语义检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的语义检索系统。
+
+#### 1.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 1.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name dureader_robust_neural_search \
+                            --doc_dir data/dureader_dev \
+                            --query_embedding_model rocketqa-zh-base-query-encoder \
+                            --params_path checkpoints/model_40/model_state.pdparams \
+                            --embedding_dim 256 \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/dureader_robust_neural_search/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+#### 1.4.3 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search_custom.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/semantic-search/run_neural_search_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "衡量酒水的价格的因素有哪些?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+
+```
+#### 1.4.4 启动 WebUI
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/semantic-search/run_search_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+#### 1.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，另一种是使用前端界面的文件上传进行数据更新，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/semantic-search/README.md b/pipelines/examples/semantic-search/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..addd31af6d8cd01d59339f15e71d3a2ec42b6721
--- /dev/null
+++ b/pipelines/examples/semantic-search/README.md
@@ -0,0 +1,223 @@
+# 端到端语义检索系统
+
+## 1. 场景概述
+
+检索系统存在于我们日常使用的很多产品中，比如商品搜索系统、学术文献检索系等等，本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query，快速在海量数据中查找相似文档。
+
+所谓语义检索（也称基于向量的检索），是指检索系统不再拘泥于用户 Query 字面本身，而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索，从而更准确地向用户返回最符合的结果。通过使用最先进的语义索引模型找到文本的向量表示，在高维向量空间中对它们进行索引，并度量查询向量与索引文档的相似程度，从而解决了关键词索引带来的缺陷。
+
+例如下面两组文本 Pair，如果基于关键词去计算相似度，两组的相似度是相同的。而从实际语义上看，第一组相似度高于第二组。
+
+```
+车头如何放置车牌    前牌照怎么装
+车头如何放置车牌    后牌照怎么装
+```
+
+语义检索系统的关键就在于，采用语义而非关键词方式进行召回，达到更精准、更广泛得召回相似结果的目的。如果需要关键字和语义检索两种结合方式请参考文档[多路召回](./Multi_Recall.md)
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端语义检索系统的能力。用户只需要处理好自己的业务数据，就可以使用本项目预置的语义检索系统模型(召回模型、排序模型)快速搭建一个针对自己业务数据的问答系统，并可以提供 Web 化产品服务。以下是使用预置模型的教程，如果用户想训练并接入自己训练的模型，模型训练可以参考[Neural Search](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search),接入流程可以参考[Neural Search的流程](./Neural_Search.md)。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/190302765-663ba441-9dd3-470a-8fee-f7a6f81da615.gif" width="500px">
+</div>
+
+
+### 2.1 系统特色
+
++ 端到端
+    + 提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端语义检索系统能力
+    + 多源数据支持: 支持对 Txt、Word、PDF、Image 多源数据进行解析、识别并写入 ANN 数据库
++ 效果好
+    + 依托百度领先的NLP技术，包括[ERNIE](https://github.com/PaddlePaddle/ERNIE)语义理解技术与[RocketQA](https://github.com/PaddlePaddle/RocketQA)开放域问答技术
+    + 预置领先的深度学习模型
+
+## 3. 快速开始: 快速搭建语义检索系统
+
+以下是针对mac和linux的安装流程，windows的安装和使用流程请参考[windows](./Install_windows.md)
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己 GPU 硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.2.1
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+语义检索数据库的数据来自于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)，共包含 46972 个段落文本，并选取了其中验证集1417条段落文本来搭建语义检索系统。
+
+### 3.3 一键体验语义检索系统
+
+#### 3.3.1 快速一键启动
+
+我们预置了基于[DuReader-Robust数据集](https://github.com/baidu/DuReader/tree/master/DuReader-Robust)搭建语义检索系统的代码示例，您可以通过如下命令快速体验语义检索系统的效果
+```bash
+# 我们建议在 GPU 环境下运行本示例，运行速度较快
+# 设置 1 个空闲的 GPU 卡，此处假设 0 卡为空闲 GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/semantic-search/semantic_search_example.py --device gpu \
+                                                          --search_engine faiss
+# 如果只有 CPU 机器，可以通过 --device 参数指定 cpu 即可, 运行耗时较长
+unset CUDA_VISIBLE_DEVICES
+python examples/semantic-search/semantic_search_example.py --device cpu \
+                                                          --search_engine faiss
+```
+`semantic_search_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+
+### 3.4 构建 Web 可视化语义检索系统
+
+整个 Web 可视化语义检索系统主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI，接下来我们依次搭建这 3 个服务并最终形成可视化的语义检索系统。
+
+#### 3.4.1 启动 ANN 服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动 ES 服务
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保 ES 服务启动成功
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+备注：ES 服务默认开启端口为 9200
+
+#### 3.4.2 文档数据写入 ANN 索引库
+```
+# 以DuReader-Robust 数据集为例建立 ANN 索引库
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/dureader_dev \
+                            --search_engine elastic \
+                            --embed_title True \
+                            --delete_index
+```
+可以使用下面的命令来查看数据：
+
+```
+# 打印几条数据
+curl http://localhost:9200/dureader_robust_query_encoder/_search
+```
+
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: ANN索引引擎的IP地址
+* `port`: ANN索引引擎的端口号
+* `search_engine`: 选择的近似索引引擎elastic，milvus，默认elastic
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+* `embed_title`: 是否需要对标题建索引，默认为false，标题默认为文件名
+
+删除索引也可以使用下面的命令：
+
+```
+curl -XDELETE http://localhost:9200/dureader_robust_query_encoder
+```
+
+#### 3.4.3 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/semantic-search/run_search_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+
+```
+curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{"query": "衡量酒水的价格的因素有哪些?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+```
+
+更多API接口文档及其调用方式请参考链接[http://127.0.0.1:8891/docs](http://127.0.0.1:8891/docs)
+
+#### 3.4.4 启动 WebUI
+
+
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/semantic-search/run_search_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验语义检索系统服务了。
+
+#### 3.4.5 数据更新
+
+数据更新的方法有两种，第一种使用前面的 `utils/offline_ann.py`进行数据更新，第二种是使用前端界面的文件上传（在界面的左侧）进行数据更新。对于第一种使用脚本的方式，可以使用多种文件更新数据，示例的文件更新建索引的命令如下，里面包含了图片（目前仅支持把图中所有的文字合并建立索引），docx（支持图文，需要按照空行进行划分段落），txt（需要按照空行划分段落）三种格式的文件建索引：
+
+```
+python utils/offline_ann.py --index_name dureader_robust_query_encoder \
+                            --doc_dir data/file_example \
+                            --port 9200 \
+                            --search_engine elastic \
+                            --delete_index
+```
+
+对于第二种使用界面的方式，支持txt，pdf，image，word的格式，以txt格式的文件为例，每段文本需要使用空行隔开，程序会根据空行进行分段建立索引，示例数据如下(demo.txt)：
+
+```
+兴证策略认为，最恐慌的时候已经过去，未来一个月市场迎来阶段性修复窗口。
+
+从海外市场表现看，
+对俄乌冲突的恐慌情绪已显著释放，
+海外权益市场也从单边下跌转入双向波动。
+
+长期，继续聚焦科技创新的五大方向。1)新能源(新能源汽车、光伏、风电、特高压等)，2)新一代信息通信技术(人工智能、大数据、云计算、5G等)，3)高端制造(智能数控机床、机器人、先进轨交装备等)，4)生物医药(创新药、CXO、医疗器械和诊断设备等)，5)军工(导弹设备、军工电子元器件、空间站、航天飞机等)。
+```
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/semantic-search/multi_recall_semantic_search_example.py b/pipelines/examples/semantic-search/multi_recall_semantic_search_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..db6e21259901512a198acd71389dea64ac8664fa
--- /dev/null
+++ b/pipelines/examples/semantic-search/multi_recall_semantic_search_example.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    ElasticsearchDocumentStore,
+)
+from pipelines.nodes import (
+    BM25Retriever,
+    DensePassageRetriever,
+    ErnieRanker,
+    JoinDocuments,
+)
+from pipelines.pipelines import Pipeline
+from pipelines.utils import (
+    convert_files_to_dicts_splitter,
+    fetch_archive_from_http,
+    print_documents,
+)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_nano_query_encoder', type=str, help="The ann index name of ANN.")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--search_engine", choices=['elastic', 'bes'], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=384, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-para-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--port', type=str, default="9200", help='port of ANN search engine')
+parser.add_argument("--bm_topk", default=10, type=int, help="The number of candidates for BM25Retriever to retrieve.")
+parser.add_argument("--dense_topk", default=10, type=int, help="The number of candidates for DensePassageRetriever to retrieve.")
+parser.add_argument("--rank_topk", default=10, type=int, help="The number of candidates ranker to filter.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def get_retrievers(use_gpu):
+
+    doc_dir = "data/dureader_dev"
+    dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+    fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
+    dicts = convert_files_to_dicts_splitter(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+    if args.search_engine == "elastic":
+        document_store_with_docs = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            vector_type="dense_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+        )
+    else:
+        document_store_with_docs = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+        )
+
+    document_store_with_docs.write_documents(dicts)
+
+    dpr_retriever = DensePassageRetriever(
+        document_store=document_store_with_docs,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim,
+        max_seq_len_query=args.max_seq_len_query,
+        max_seq_len_passage=args.max_seq_len_passage,
+        batch_size=args.retriever_batch_size,
+        use_gpu=use_gpu,
+        embed_title=False,
+    )
+    # update Embedding
+    document_store_with_docs.update_embeddings(dpr_retriever)
+
+    bm_retriever = BM25Retriever(document_store=document_store_with_docs)
+
+    return dpr_retriever, bm_retriever
+
+
+def semantic_search_tutorial():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    dpr_retriever, bm_retriever = get_retrievers(use_gpu)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-nano-cross-encoder", use_gpu=use_gpu)
+
+    # Pipeline
+    pipeline = Pipeline()
+    pipeline.add_node(component=bm_retriever, name="BMRetriever", inputs=["Query"])
+    pipeline.add_node(component=dpr_retriever, name="DenseRetriever", inputs=["Query"])
+    pipeline.add_node(
+        component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["BMRetriever", "DenseRetriever"]
+    )
+    pipeline.add_node(component=ranker, name="Ranker", inputs=["JoinResults"])
+    # Keywords recall results
+    prediction = pipeline.run(
+        query="广播权",
+        params={
+            "BMRetriever": {"top_k": args.bm_topk},
+            "DenseRetriever": {"top_k": args.dense_topk},
+            "Ranker": {"top_k": args.rank_topk},
+        },
+    )
+    print_documents(prediction)
+    # Dense vector recall results
+    prediction = pipeline.run(
+        query="期货交易手续费指的是什么?",
+        params={
+            "BMRetriever": {"top_k": args.bm_topk},
+            "DenseRetriever": {"top_k": args.dense_topk},
+            "Ranker": {"top_k": args.rank_topk},
+        },
+    )
+    print_documents(prediction)
+
+
+if __name__ == "__main__":
+    semantic_search_tutorial()
diff --git a/pipelines/examples/semantic-search/run_neural_search_server.sh b/pipelines/examples/semantic-search/run_neural_search_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3edad2c528184b74a9d4aa1f1516de8ab47a7913
--- /dev/null
+++ b/pipelines/examples/semantic-search/run_neural_search_server.sh
@@ -0,0 +1,5 @@
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search_custom.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/semantic-search/run_search_server.sh b/pipelines/examples/semantic-search/run_search_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..24f655fc3532ed5f972b1acf9bca44fcf9628076
--- /dev/null
+++ b/pipelines/examples/semantic-search/run_search_server.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset http_proxy && unset https_proxy
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search.yaml
+# English Version
+# export PIPELINE_YAML_PATH=rest_api/pipeline/semantic_search_en.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/semantic-search/run_search_web.sh b/pipelines/examples/semantic-search/run_search_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..438e4ec2ab498e4711ccff0c13c172a0dba0b417
--- /dev/null
+++ b/pipelines/examples/semantic-search/run_search_web.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset http_proxy && unset https_proxy
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_semantic_search.py --server.port 8502 --server.fileWatcherType none
+
+# English Version
+# export EVAL_FILE=ui/country_search.csv
+# export DEFAULT_QUESTION_AT_STARTUP="The introduction of United States of America?"
+# python -m streamlit run ui/webapp_semantic_search.py --server.port 8502
\ No newline at end of file
diff --git a/pipelines/examples/semantic-search/semantic_search_example.py b/pipelines/examples/semantic-search/semantic_search_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee4a3104c28ceb92d61de59ec5931b3e63a1bdde
--- /dev/null
+++ b/pipelines/examples/semantic-search/semantic_search_example.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.document_stores import FAISSDocumentStore, MilvusDocumentStore
+from pipelines.nodes import DensePassageRetriever, ErnieRanker
+from pipelines.utils import (
+    convert_files_to_dicts,
+    fetch_archive_from_http,
+    print_documents,
+)
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='dureader_index', type=str, help="The ann index name of ANN.")
+parser.add_argument("--search_engine", choices=['faiss', 'milvus'], default="faiss", help="The type of ANN search engine.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The query_embedding_model path")
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-nano-query-encoder", type=str, help="The passage_embedding_model path")
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--embedding_dim", default=312, type=int, help="The embedding_dim of index")
+parser.add_argument('--host', type=str, default="localhost", help='host ip of ANN search engine')
+parser.add_argument('--port', type=str, default="8530", help='port of ANN search engine')
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument('--pooling_mode', choices=['max_tokens', 'mean_tokens', 'mean_sqrt_len_tokens', 'cls_token'], default='cls_token', help='the type of sentence embedding')
+args = parser.parse_args()
+# yapf: enable
+
+
+def get_faiss_retriever(use_gpu):
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            pooling_mode=args.pooling_mode,
+            precision="fp16",
+        )
+    else:
+        doc_dir = "data/dureader_dev"
+        dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+
+        fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=args.embedding_dim, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            pooling_mode=args.pooling_mode,
+            precision="fp16",
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+    return retriever
+
+
+def get_milvus_retriever(use_gpu):
+
+    milvus_document_store = "milvus_document_store.db"
+    if os.path.exists(milvus_document_store):
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+        # connect to existed Milvus Index
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            pooling_mode=args.pooling_mode,
+            precision="fp16",
+        )
+    else:
+        doc_dir = "data/dureader_dev"
+        dureader_data = "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip"
+
+        fetch_archive_from_http(url=dureader_data, output_dir=doc_dir)
+        dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=True, encoding="utf-8")
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            passage_embedding_model=args.passage_embedding_model,
+            params_path=args.params_path,
+            output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=args.embed_title,
+            pooling_mode=args.pooling_mode,
+            precision="fp16",
+        )
+
+        document_store.write_documents(dicts)
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+    return retriever
+
+
+def semantic_search_tutorial():
+
+    use_gpu = True if args.device == "gpu" else False
+
+    if args.search_engine == "milvus":
+        retriever = get_milvus_retriever(use_gpu)
+    else:
+        retriever = get_faiss_retriever(use_gpu)
+
+    # Pipeline
+    from pipelines import SemanticSearchPipeline
+
+    if args.query_embedding_model == "moka-ai/m3e-base" or args.passage_embedding_model == "moka-ai/m3e-base":
+        pipe = SemanticSearchPipeline(retriever)
+        prediction = pipe.run(query="亚马逊河流的介绍", params={"Retriever": {"top_k": 50}})
+    else:
+        # Ranker
+        ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+        pipe = SemanticSearchPipeline(retriever, ranker)
+        prediction = pipe.run(query="亚马逊河流的介绍", params={"Retriever": {"top_k": 50}, "Ranker": {"top_k": 5}})
+
+    print_documents(prediction)
+
+    # Batch prediction
+    predictions = pipe.run_batch(queries=["亚马逊河流的介绍", "期货交易手续费指的是什么?"], params={"Retriever": {"top_k": 10}})
+    for i in range(len(predictions["queries"])):
+        result = {"documents": predictions["documents"][i], "query": predictions["queries"][i]}
+        print_documents(result)
+
+    pipe = SemanticSearchPipeline(retriever)
+    prediction = pipe.run(query="dev621.txt，五笔", params={"Retriever": {"top_k": 20}})
+    print_documents(prediction)
+
+
+if __name__ == "__main__":
+    semantic_search_tutorial()
diff --git a/pipelines/examples/sentiment_analysis/README.md b/pipelines/examples/sentiment_analysis/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f0cc497c9adf618c0c3526041cdc2a58a1e542e2
--- /dev/null
+++ b/pipelines/examples/sentiment_analysis/README.md
@@ -0,0 +1,96 @@
+# 端到端情感分析系统
+
+## 1. 系统介绍
+
+情感分析（sentiment analysis）是近年来国内外研究的热点，旨在对带有情感色彩的主观性文本进行分析、处理、归纳和推理，其广泛应用于消费决策、舆情分析、个性化推荐等领域，具有很高的商业价值。按照分析粒度可以大致分为三类：篇章级的情感分析（Document-Level Sentiment Classification）、语句级的情感分析（Sentence-Level Sentiment Classification）和属性级的情感分析（Aspect-Level Sentiment Classification）。
+
+本项目更多聚焦于属性级的情感分析，支持文本评论中关于属性、观点词和情感倾向方面的分析。同时为方便用户使用，本项目提供了基于UIE模型提供了从输入数据到情感分析结果可视化的解决方案，用户只需上传自己的业务数据，就可以使用本项目开放的情感分析解决方案，快速搭建一个针对自己业务数据的情感分析系统，并提供基于[Gradio](https://gradio.app/) 的 Web 可视化服务。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/35913314/208049755-8fac879e-c544-443f-b127-5a83899a6d6f.png" />
+</div>
+
+
+## 2. 快速开始
+
+以下是针对mac和linux的安装流程：
+
+
+### 2.1 运行环境
+
+**安装PaddlePaddle：**
+
+ 环境中paddlepaddle-gpu或paddlepaddle版本应大于或等于2.3, 请参见[飞桨快速安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)根据自己需求选择合适的PaddlePaddle下载命令。
+
+**安装PaddleNLP：**
+
+```bash
+pip install paddlenlp==2.4.1
+```
+
+**安装Paddle-Pipelines：**
+
+安装相关依赖：
+```bash
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+pip 一键安装Paddle-Pipelines：
+```bash
+pip install paddle-pipelines==0.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+或使用源码安装Paddle-Pipelines最新版本：
+```bash
+cd ${HOME}/PaddleNLP/pipelines/
+python setup.py install
+```
+
+【注意】**以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录**
+
+### 2.2 一键体验情感分析系统
+您可以通过如下命令快速体验开放情感分析系统的效果。
+
+```bash
+# 建议在 GPU 环境下运行本示例，运行速度较快
+export CUDA_VISIBLE_DEVICES=0
+python examples/sentiment_analysis/senta_example.py \
+    --file_path "your file path"
+```
+
+在运行结束后，可视化结果将存放到与`file_path` 文件所在目录的子目录`images`目录下。
+
+### 2.3 构建 Web 可视化开放文档抽取问答系统
+
+整个 Web 可视化情感分析系统主要包含两大组件:  1. 基于 RestAPI 构建模型服务 2. 基于 Gradio 构建 WebUI。接下来我们依次搭建这 2 个服务并串联构成可视化的情感分析系统。
+
+#### 2.3.1 启动 RestAPI 模型服务
+```bash
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/senta.yaml
+export QUERY_PIPELINE_NAME=senta_pipeline
+
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/sentiment_analysis/run_senta_server.sh
+```
+
+#### 2.3.2 启动 WebUI
+
+```bash
+python ui/webapp_senta.py --serving_port 8891
+```
+
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/sentiment_analysis/run_senta_web.sh
+```
+
+接下来，您就可以打开浏览器访问 http://127.0.0.1:7860 地址体验情感分析系统服务了。
diff --git a/pipelines/examples/sentiment_analysis/run_senta_server.sh b/pipelines/examples/sentiment_analysis/run_senta_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a3bb65a7eb74b70b18bb1883b9e98fbcaa906be8
--- /dev/null
+++ b/pipelines/examples/sentiment_analysis/run_senta_server.sh
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 指定语义检索系统的Yaml配置文件
+export CUDA_VISIBLE_DEVICES=0
+export PIPELINE_YAML_PATH=rest_api/pipeline/senta.yaml
+export QUERY_PIPELINE_NAME=senta_pipeline
+
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/sentiment_analysis/run_senta_web.sh b/pipelines/examples/sentiment_analysis/run_senta_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..47089030b358833f4895bd10ee227009c33b9d2f
--- /dev/null
+++ b/pipelines/examples/sentiment_analysis/run_senta_web.sh
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+unset http_proxy && unset https_proxy
+# 配置模型服务地址
+python ui/webapp_senta.py --serving_port 8891
\ No newline at end of file
diff --git a/pipelines/examples/sentiment_analysis/senta_example.py b/pipelines/examples/sentiment_analysis/senta_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..704d3b87e352f225610aed36051630d346f1b231
--- /dev/null
+++ b/pipelines/examples/sentiment_analysis/senta_example.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines import SentaPipeline
+from pipelines.nodes import SentaProcessor, SentaVisualization, UIESenta
+
+
+def format_print(results):
+    """
+    Print Information in results.
+    """
+    if "sr_save_path" in results:
+        print("\nText Result: ", results["sr_save_path"])
+    if "img_dict" in results:
+        print("Visualization Result: ")
+        for img_name in results["img_dict"]:
+            print("\t{}:{}".format(img_name, results["img_dict"][img_name]))
+
+
+def senta_pipeline(args):
+    """
+    Sentiment Analysis with Pipeline.
+    """
+
+    # initializing SentaPipeline
+    preprocessor = SentaProcessor(max_examples=args.max_examples)
+    if not args.aspects:
+        schema = [{"评价维度": ["观点词", "情感倾向[正向,负向,未提及]"]}]
+        senta = UIESenta(schema=schema, model=args.model, batch_size=args.batch_size, aspects=args.aspects)
+    else:
+        schema = ["观点词", "情感倾向[正向,负向,未提及]"]
+        senta = UIESenta(schema=schema, model=args.model, batch_size=args.batch_size)
+    visualization = SentaVisualization(font_name="SimHei")
+    senta_pipeline = SentaPipeline(preprocessor=preprocessor, senta=senta, visualization=visualization)
+
+    # run SentaPipeline for inputting file.
+    meta = {"file_path": args.file_path}
+    results = senta_pipeline.run(meta=meta)
+    format_print(results)
+
+
+if __name__ == "__main__":
+    # yapf: disable
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--file_path", required=True, type=str, help="The file that you want to perform sentiment analysis on.")
+    parser.add_argument("--max_examples", default=-1, type=int, help="The maxinum number of examples processed by pipline.")
+    parser.add_argument("--model", choices=['uie-senta-base', 'uie-senta-medium', 'uie-senta-mini', 'uie-senta-micro', 'uie-senta-nano'], default="uie-senta-base", help="The model name that you wanna use for sentiment analysis.")
+    parser.add_argument("--aspects", default=None, type=str, nargs="+", help="A list of pre-given aspects, that is to say, Pipeline only perform sentiment analysis on these pre-given aspects if you input it.")
+    parser.add_argument("--batch_size", default=4, type=int, help="Number of samples the model receives in one batch for sentiment inference.")
+
+    args = parser.parse_args()
+    # yapf: enable
+
+    senta_pipeline(args)
diff --git a/pipelines/examples/text_to_image/README.md b/pipelines/examples/text_to_image/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c3503d569962e2c4543dd347415b6a464cb3c0ed
--- /dev/null
+++ b/pipelines/examples/text_to_image/README.md
@@ -0,0 +1,125 @@
+# ERNIE-ViLG 文生图系统
+
+## 1. 场景概述
+
+ERNIE-ViLG是一个知识增强跨模态图文生成大模型，将文生成图和图生成文任务融合到同一个模型进行端到端的学习，从而实现文本和图像的跨模态语义对齐。可以支持用户进行内容创作，让每个用户都能够体验到一个低门槛的创作平台。更多详细信息请参考官网的介绍[ernieVilg](https://wenxin.baidu.com/moduleApi/ernieVilg)
+
+
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建端到端文生图的能力。用户需要进行简单的参数配置，然后输入prompts就可以生成各种风格的画作，另外，Pipelines提供了 Web 化产品服务，让用户在本地端就能搭建起来文生图系统。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/198007539-51863b31-715c-4cf4-921a-9ddf036c036b.gif" width="500px">
+</div>
+
+
+## 3. 快速开始: 快速搭建文生图系统
+
+
+### 3.1 运行环境和安装说明
+
+本实验采用了以下的运行环境进行，详细说明如下，用户也可以在自己的环境进行：
+
+a. 软件环境：
+- python >= 3.7.3
+- paddlenlp >= 2.4.0
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)，然后安装下面的依赖：
+```bash
+# pip 一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 或者源码进行安装最新版本
+cd ${HOME}/PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+
+```
+# 下载pipelines源代码
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/pipelines
+```
+
+【注意】以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录；另外，文生图系统需要联网，用户需要在有网的环境下进行。
+
+
+### 3.2 一键体验文生图系统
+
+在运行下面的命令之前，需要在[ERNIE-ViLG官网](https://wenxin.baidu.com/moduleApi/ernieVilg)申请`API Key`和 `Secret key`两个密钥(需要登录，登录后点击右上角的查看AK/SK，具体如下图)，然后执行下面的命令。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/12107462/196942735-06953270-ce1e-45a5-9e0d-5841068a8464.png" width="500">
+</div>
+
+
+#### 3.2.1 快速一键启动
+
+您可以通过如下命令快速体验文生图系统的效果
+```bash
+python examples/text_to_image/text_to_image_example.py --prompt_text 宁静的小镇 \
+                                                       --style 古风 \
+                                                       --topk 5 \
+                                                       --api_key 你申请的apikey \
+                                                       --secret_key 你申请的secretkey \
+                                                       --output_dir ernievilg_output
+```
+大概运行一分钟后就可以得到结果了,生成的图片请查看您的输出目录`output_dir`。
+
+### 3.3 构建 Web 可视化文生图系统
+
+整个 Web 可视化文生图系统主要包含 2 大组件: 1. 基于 RestfulAPI 构建模型服务 2. 基于 Gradio 构建 WebUI，接下来我们依次搭建这 2 个服务并最终形成可视化的文生图系统。
+
+#### 3.3.1 启动 RestAPI 模型服务
+
+启动之前，需要把您申请的`API Key`和 `Secret key`两个密钥添加到`text_to_image.yaml`的ak和sk的位置，然后运行：
+
+```bash
+export PIPELINE_YAML_PATH=rest_api/pipeline/text_to_image.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/text_to_image/run_text_to_image.sh
+```
+
+#### 3.3.2 启动 WebUI
+
+WebUI使用了[gradio前端](https://gradio.app/)，首先需要安装gradio，运行命令如下：
+```
+pip install gradio
+```
+然后使用如下的命令启动：
+```bash
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_text_to_image.py --serving_port 8502
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/text_to_image/run_text_to_image_web.sh
+```
+
+到这里您就可以打开浏览器访问 http://127.0.0.1:8502 地址体验文生图系统服务了。
+
+如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/text_to_image/run_text_to_image.sh b/pipelines/examples/text_to_image/run_text_to_image.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4a61f0b98e9eed79b1a5c3ae207dd17ff6a192d4
--- /dev/null
+++ b/pipelines/examples/text_to_image/run_text_to_image.sh
@@ -0,0 +1,19 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 指定文生图的Yaml配置文件
+unset http_proxy && unset https_proxy
+export PIPELINE_YAML_PATH=rest_api/pipeline/text_to_image.yaml
+# 使用端口号 8891 启动模型服务
+python rest_api/application.py 8891
\ No newline at end of file
diff --git a/pipelines/examples/text_to_image/run_text_to_image_web.sh b/pipelines/examples/text_to_image/run_text_to_image_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..05a59f7be69f5dfd1f58983425f6058938d5c4e1
--- /dev/null
+++ b/pipelines/examples/text_to_image/run_text_to_image_web.sh
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8891
+# 在指定端口 8502 启动 WebUI
+python ui/webapp_text_to_image.py --serving_port 8502
\ No newline at end of file
diff --git a/pipelines/examples/text_to_image/text_to_image_example.py b/pipelines/examples/text_to_image/text_to_image_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef6c96263d72f926a5e9cae13beaf5849a3346c6
--- /dev/null
+++ b/pipelines/examples/text_to_image/text_to_image_example.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines import TextToImagePipeline
+from pipelines.nodes import ErnieTextToImageGenerator
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--api_key", default=None, type=str, help="The API Key.")
+parser.add_argument("--secret_key", default=None, type=str, help="The secret key.")
+parser.add_argument("--prompt_text", default='宁静的小镇', type=str, help="The prompt_text.")
+parser.add_argument("--output_dir", default='ernievilg_output', type=str, help="The output path.")
+parser.add_argument("--style", default='探索无限', type=str, help="The style text.")
+parser.add_argument("--size", default='1024*1024', choices=['1024*1024', '1024*1536', '1536*1024'], help="Size of the generation images")
+parser.add_argument("--topk", default=5, type=int, help="The top k images.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def text_to_image():
+    erine_image_generator = ErnieTextToImageGenerator(ak=args.api_key, sk=args.secret_key)
+    pipe = TextToImagePipeline(erine_image_generator)
+    prediction = pipe.run(
+        query=args.prompt_text,
+        params={
+            "TextToImageGenerator": {
+                "topk": args.topk,
+                "style": args.style,
+                "resolution": args.size,
+                "output_dir": args.output_dir,
+            }
+        },
+    )
+    print(prediction)
+    pipe.save_to_yaml("text_to_image.yaml")
+
+
+if __name__ == "__main__":
+    text_to_image()
diff --git a/pipelines/examples/unsupervised-question-answering/README.md b/pipelines/examples/unsupervised-question-answering/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ef02f0075231f42b7e3ed3c048f9e0c8bc00752e
--- /dev/null
+++ b/pipelines/examples/unsupervised-question-answering/README.md
@@ -0,0 +1,223 @@
+# 无监督智能检索问答系统
+
+## 1. 场景概述
+
+智能问答（QA）是获取信息和只是的更直接、更高效的方式之一，传统的信息检索系统只能找到相关的文档，而问答系统能够直接找到精准的答案，极大的节省了人们获取信息的时间。问答系统中最关键的挑战之一是标记数据的稀缺性，这是因为对目标领域获取问答对或常见问答对（FAQ）的成本很高，需要消耗大量的人力和时间。由于上述制约，这导致检索式问答系统落地困难，解决此问题的一种方法是依据上下文或大量非结构化文本自动生成的QA问答对。
+
+本项目，即无监督智能检索问答(问答对自动生成智能检索式问答)，基于PaddleNLP问题生成、UIE、检索式问答，支持以非结构化文本形式为上下文自动生成QA问答对，生成的问答对语料可以通过无监督的方式构建检索式问答系统。
+
+<div align="center">
+    <img src="https://user-images.githubusercontent.com/20476674/199488926-c64d3f4e-8117-475f-afe6-b02088105d09.gif" >
+</div>
+
+若开发者已有FAQ语料，请参考FAQ检索式问答。
+## 2. 产品功能介绍
+
+本项目提供了低成本搭建问答对自动生成智能检索问答系统的能力。开发者只需要提供非结构化的纯文本，就可以使用本项目预制的问答对生成模块生成大量的问答对，并基于此快速搭建一个针对自己业务的检索问答系统，并可以提供Web可视化产品服务。Web可视化产品服务支持问答检索、在线问答对生成，在线文件上传和解析，在线索引库更新等功能，用户也可根据需要自行调整。
+
+### 2.1 系统特色
++ 低成本
+    + 可通过自动生成的方式快速大量合成QA语料，大大降低人力成本
+    + 可控性好，合成语料和语义检索解耦合，可以人工筛查和删除合成的问答对，也可以添加人工标注的问答对
++ 端到端
+    + 提供包括问答语料生成、索引库构建、模型服务部署、WebUI可视化一整套端到端智能问答系统能力
+    + 支持对Txt、Word、PDF、Image多源数据上传，同时支持离线、在线QA语料生成和ANN数据库更新
++ 效果好
+    + 可通过自动问答对生成提升问答对语料覆盖度，缓解中长尾问题覆盖较少的问题
+    + 依托百度领先的NLP技术，预置效果领先的深度学习模型
+
+## 3. 快速开始: 快速搭建无监督智能检索问答系统
+
+以下是针对mac和linux的搭建流程。
+
+### 3.1 运行环境和安装说明
+
+本项目采用了以下的运行环境进行，详细说明如下，用户也可以在自己的GPU硬件环境进行：
+
+a. 软件环境：
+- python >= 3.7.0
+- paddlenlp >= 2.4.3
+- paddlepaddle-gpu >=2.3
+- CUDA Version: 10.2
+- NVIDIA Driver Version: 440.64.00
+- Ubuntu 16.04.6 LTS (Docker)
+
+b. 硬件环境：
+
+- NVIDIA Tesla V100 16GB x4卡
+- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+
+c. 依赖安装：
+首先需要安装PaddlePaddle，PaddlePaddle的安装请参考文档[官方安装文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)。然后需要安装paddle-pipelines依赖，使用pip安装命令如下：
+```bash
+# pip一键安装
+pip install --upgrade paddle-pipelines -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+或者进入pipelines目录下，针对源码进行安装：
+```bash
+# 源码进行安装
+cd PaddleNLP/pipelines/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+python setup.py install
+```
+**【注意】** 以下的所有的流程都只需要在`pipelines`根目录下进行，不需要跳转目录
+
+### 3.2 数据说明
+我们以提供的纯文本文件[source_file.txt](https://paddlenlp.bj.bcebos.com/applications/unsupervised_qa/source_file.txt)为例，系统将每一条都视为一个上下文并基于此生成多个问答对，并基于此构建索引库，该文件可直接下载放入./data，开发者也可以使用自己的文件。
+
+
+### 3.3 一键体验无监督智能检索问答系统
+
+开发者可以通过如下命令快速体验无监督智能检索问答系统的效果，系统将自动根据提供的纯文本文件构建问答对语料库，并基于生成的问答对语料库构造检索数据库。
+我们建议在GPU环境下运行本示例，运行速度较快，运行命令如下：
+```bash
+# GPU环境下运行示例
+# 设置1个空闲的GPU卡，此处假设0卡为空闲GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/unsupervised-question-answering/unsupervised_question_answering_example.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data --index_name faiss_index --retriever_batch_size 16
+```
+关键参数释义如下：
+- `device`: 使用的设备，默认为'gpu'，可选择['cpu', 'gpu']。
+- `source_file`: 源文件路径，指定该路径将自动为其生成问答对至`doc_dir`。
+- `doc_dir`: 生成的问答对语料保存的位置，系统将根据该位置自动构建检索数据库，默认为'data/my_data'。
+- `index_name`: FAISS的ANN索引名称，默认为'faiss_index'。
+- `retriever_batch_size`: 构建ANN索引时的批量大小，默认为16。
+
+如果只有CPU机器，可以通过--device参数指定cpu即可, 运行耗时较长，运行命令如下：
+```bash
+# CPU环境下运行示例
+unset CUDA_VISIBLE_DEVICES
+python examples/unsupervised-question-answering/unsupervised_question_answering_example.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data
+```
+**【注意】**  `unsupervised_question_answering_example.py`中`DensePassageRetriever`和`ErnieRanker`的模型介绍请参考[API介绍](../../API.md)
+
+### 3.4 构建Web可视化无监督智能检索问答系统
+
+整个Web可视化无监督智能检索问答系统主要包含3大组件:
+1. 基于ElasticSearch的ANN服务搭建在线索引库
+2. 基于RestAPI构建模型后端服务
+3. 基于Streamlit构建前端WebUI
+
+接下来我们依次搭建这些个服务，得到可视化、可交互的无监督智能检索问答系统。
+
+
+#### 3.4.1 离线生成问答对语料
+执行以下命令将自动根据提供的纯文本文件离线构建问答对语料库：
+```bash
+# GPU环境下运行示例
+# 设置1个空闲的GPU卡，此处假设0卡为空闲GPU
+export CUDA_VISIBLE_DEVICES=0
+python examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py --device gpu --source_file data/source_file.txt --doc_dir data/my_data
+```
+关键参数释义如下：
+- `device`: 使用的设备，默认为'gpu'，可选择['cpu', 'gpu']。
+- `source_file`: 源文件路径，指定该路径将自动为其生成问答对至`doc_dir`。
+- `doc_dir`: 生成的问答对语料保存的位置，系统将根据该位置自动构建检索数据库，默认为'data/my_data'。
+
+
+如果只有CPU机器，可以通过--device参数指定cpu即可, 运行耗时较长，运行命令如下：
+```bash
+# CPU环境下运行示例
+unset CUDA_VISIBLE_DEVICES
+python examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py --device cpu --source_file data/source_file.txt --doc_dir data/my_data
+```
+
+#### 3.4.2 启动ElasticSearch ANN服务
+1. 参考官方文档下载安装 [elasticsearch-8.3.2](https://www.elastic.co/cn/downloads/elasticsearch) 并解压。
+2. 启动ElasticSearch服务。
+
+首先修改`config/elasticsearch.yml`的配置：
+```
+xpack.security.enabled: false
+```
+然后启动elasticsearch：
+```bash
+./bin/elasticsearch
+```
+3. 检查确保ElasticSearch服务启动成功。
+
+执行以下命令，如果ElasticSearch里面没有数据，结果会输出为空，即{ }。
+```bash
+curl http://localhost:9200/_aliases?pretty=true
+```
+
+备注：ElasticSearch服务默认开启端口为 9200
+
+#### 3.4.3 ANN索引库构建
+执行以下命令建立ANN索引库：
+```
+python utils/offline_ann.py --index_name my_data \
+                            --doc_dir data/my_data \
+                            --split_answers \
+                            --delete_index
+```
+参数含义说明
+* `index_name`: 索引的名称
+* `doc_dir`: txt文本数据的路径
+* `host`: Elasticsearch的IP地址
+* `port`: Elasticsearch的端口号
+* `split_answers`: 是否切分每一行的数据为query和answer两部分
+* `delete_index`: 是否删除现有的索引和数据，用于清空es的数据，默认为false
+
+执行以下命令打印几条数据，检测ANN索引库是否构建成功：
+```
+curl http://localhost:9200/my_data/_search
+```
+如果索引库正常会输出类似如下的结果：
+```
+{"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":5,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"my_data","_id":"fb308738f2767626d72282f5a35402e5","_score":1.0,"_source":{"content":......
+```
+
+#### 3.4.4 启动RestAPI模型后端
+```bash
+export CUDA_VISIBLE_DEVICES=0
+# 指定无监督智能检索问答系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/unsupervised_qa.yaml
+# 使用端口号8896启动模型服务
+python rest_api/application.py 8896
+```
+Linux 用户推荐采用Shell脚本来启动服务：
+
+```bash
+sh examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh
+```
+启动后可以使用curl命令验证是否成功运行：
+```
+curl -X POST -k http://localhost:8896/query -H 'Content-Type: application/json' -d '{"query": "企业如何办理养老保险?","params": {"Retriever": {"top_k": 5}, "Ranker":{"top_k": 5}}}'
+```
+如果成功运行，则会返回结果。
+
+#### 3.4.5 启动Streamlit WebUI前端
+```bash
+pip install streamlit==1.11.1
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8896
+# 在指定端口 8502 启动 WebUI
+python -m streamlit run ui/webapp_unsupervised_question_answering.py --server.port 8508
+```
+Linux 用户推荐采用 Shell 脚本来启动服务：
+
+```bash
+sh examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh
+```
+
+到这里您就可以打开浏览器访问地址 http://127.0.0.1:8508 体验无监督智能检索问答系统服务了。
+
+
+
+**【注意】** 如果安装遇见问题可以查看[FAQ文档](../../FAQ.md)
+
+## Reference
+[1]Y. Sun et al., “[ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation](https://arxiv.org/pdf/2107.02137.pdf),” arXiv:2107.02137 [cs], Jul. 2021, Accessed: Jan. 17, 2022. [Online]. Available: http://arxiv.org/abs/2107.02137
+
+[2]Y. Qu et al., “[RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2010.08191),” arXiv:2010.08191 [cs], May 2021, Accessed: Aug. 16, 2021. [Online]. Available: http://arxiv.org/abs/2010.08191
+
+[3]H. Tang, H. Li, J. Liu, Y. Hong, H. Wu, and H. Wang, “[DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications](https://arxiv.org/pdf/2004.11142.pdf).” arXiv, Jul. 21, 2021. Accessed: May 15, 2022. [Online]. Available: http://arxiv.org/abs/2004.11142
+
+[4]Li, Wei, et al. "Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning." arXiv preprint arXiv:2012.15409 (2020).
+
+## Acknowledge
+
+我们借鉴了 Deepset.ai [Haystack](https://github.com/deepset-ai/haystack) 优秀的框架设计，在此对[Haystack](https://github.com/deepset-ai/haystack)作者及其开源社区表示感谢。
+
+We learn form the excellent framework design of Deepset.ai [Haystack](https://github.com/deepset-ai/haystack), and we would like to express our thanks to the authors of Haystack and their open source community.
diff --git a/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py b/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..06837b5fc74c7dc954661191595685d60f0609f3
--- /dev/null
+++ b/pipelines/examples/unsupervised-question-answering/offline_question_answer_pairs_generation.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.nodes import AnswerExtractor, QAFilter, QuestionGenerator
+from pipelines.pipelines import QAGenerationPipeline
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer piars file to be loaded when building ANN index.")
+parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def offline_qa_generation():
+    answer_extractor = AnswerExtractor(
+        model="uie-base-answer-extractor",
+        device=args.device,
+        schema=["答案"],
+        position_prob=0.01,
+    )
+
+    question_generator = QuestionGenerator(
+        model="unimo-text-1.0-question-generator",
+        device=args.device,
+    )
+
+    qa_filter = QAFilter(
+        model="uie-base-qa-filter",
+        device=args.device,
+        schema=["答案"],
+        position_prob=0.1,
+    )
+
+    pipe = QAGenerationPipeline(
+        answer_extractor=answer_extractor, question_generator=question_generator, qa_filter=qa_filter
+    )
+    pipeline_params = {"QAFilter": {"is_filter": True}}
+
+    if args.source_file:
+        meta = []
+        with open(args.source_file, "r", encoding="utf-8") as rf:
+            for line in rf:
+                meta.append(line.strip())
+        prediction = pipe.run(meta=meta, params=pipeline_params)
+        prediction = prediction["filtered_cqa_triples"]
+        if not os.path.exists(args.doc_dir):
+            os.makedirs(args.doc_dir)
+        with open(os.path.join(args.doc_dir, "generated_qa_pairs.txt"), "w", encoding="utf-8") as wf:
+            for pair in prediction:
+                wf.write(pair["synthetic_question"].strip() + "\t" + pair["synthetic_answer"].strip() + "\n")
+
+
+if __name__ == "__main__":
+    offline_qa_generation()
diff --git a/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh
new file mode 100644
index 0000000000000000000000000000000000000000..07fe42ebc70ef25a959eeddd1eb22a448db52bf6
--- /dev/null
+++ b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_server.sh
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 环境变量设置
+export CUDA_VISIBLE_DEVICES=0
+
+# 指定语义检索系统的Yaml配置文件
+export PIPELINE_YAML_PATH=rest_api/pipeline/unsupervised_qa.yaml
+
+# 使用端口号 8896 启动模型服务
+python rest_api/application.py 8896
\ No newline at end of file
diff --git a/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh
new file mode 100644
index 0000000000000000000000000000000000000000..da69ca06d019fe4c0afbef4e9c9b818a0dd2454b
--- /dev/null
+++ b/pipelines/examples/unsupervised-question-answering/run_unsupervised_question_answering_web.sh
@@ -0,0 +1,23 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# 环境变量设置
+unset http_proxy && unset https_proxy
+export CUDA_VISIBLE_DEVICES=0
+
+# 配置模型服务地址
+export API_ENDPOINT=http://127.0.0.1:8896
+
+# 在指定端口8896启动WebUI
+python -m streamlit run ui/webapp_unsupervised_question_answering.py --server.port 8508
\ No newline at end of file
diff --git a/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py b/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py
new file mode 100644
index 0000000000000000000000000000000000000000..34c8f2a8723ed4bc92e49d6bea83bdd26906100e
--- /dev/null
+++ b/pipelines/examples/unsupervised-question-answering/unsupervised_question_answering_example.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pprint import pprint
+
+from pipelines.document_stores import FAISSDocumentStore
+from pipelines.nodes import (
+    AnswerExtractor,
+    DensePassageRetriever,
+    ErnieRanker,
+    QAFilter,
+    QuestionGenerator,
+)
+from pipelines.pipelines import QAGenerationPipeline, SemanticSearchPipeline
+from pipelines.utils import convert_files_to_dicts, print_documents
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to run dense_qa system, defaults to gpu.")
+parser.add_argument("--index_name", default='faiss_index', type=str, help="The ann index name of FAISS.")
+parser.add_argument("--max_seq_len_query", default=64, type=int, help="The maximum total length of query after tokenization.")
+parser.add_argument("--max_seq_len_passage", default=256, type=int, help="The maximum total length of passage after tokenization.")
+parser.add_argument("--retriever_batch_size", default=16, type=int, help="The batch size of retriever to extract passage embedding for building ANN index.")
+parser.add_argument("--doc_dir", default="data/my_data", type=str, help="The question-answer piars file to be loaded when building ANN index.")
+parser.add_argument("--source_file", default=None, type=str, help="The source raw texts file to be loaded when creating question-answer pairs.")
+
+args = parser.parse_args()
+# yapf: enable
+
+
+def dense_faq_pipeline():
+    use_gpu = True if args.device == "gpu" else False
+    faiss_document_store = "faiss_document_store.db"
+    if os.path.exists(args.index_name) and os.path.exists(faiss_document_store):
+        # connect to existed FAISS Index
+        document_store = FAISSDocumentStore.load(args.index_name)
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+    else:
+        dicts = convert_files_to_dicts(
+            dir_path=args.doc_dir, split_paragraphs=True, split_answers=True, encoding="utf-8"
+        )
+
+        if os.path.exists(args.index_name):
+            os.remove(args.index_name)
+        if os.path.exists(faiss_document_store):
+            os.remove(faiss_document_store)
+
+        document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
+        document_store.write_documents(dicts)
+
+        retriever = DensePassageRetriever(
+            document_store=document_store,
+            query_embedding_model="rocketqa-zh-dureader-query-encoder",
+            passage_embedding_model="rocketqa-zh-dureader-query-encoder",
+            max_seq_len_query=args.max_seq_len_query,
+            max_seq_len_passage=args.max_seq_len_passage,
+            batch_size=args.retriever_batch_size,
+            use_gpu=use_gpu,
+            embed_title=False,
+        )
+
+        # update Embedding
+        document_store.update_embeddings(retriever)
+
+        # save index
+        document_store.save(args.index_name)
+
+    # Ranker
+    ranker = ErnieRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder", use_gpu=use_gpu)
+
+    pipe = SemanticSearchPipeline(retriever, ranker)
+
+    pipeline_params = {"Retriever": {"top_k": 50}, "Ranker": {"top_k": 1}}
+    prediction = pipe.run(query="世界上最早的地雷发明者是谁？", params=pipeline_params)
+
+    print_documents(prediction, print_name=False, print_meta=True)
+
+
+def qa_generation_pipeline():
+    answer_extractor = AnswerExtractor(
+        model="uie-base-answer-extractor",
+        device=args.device,
+        schema=["答案"],
+        max_answer_candidates=3,
+        position_prob=0.01,
+    )
+
+    question_generator = QuestionGenerator(
+        model="unimo-text-1.0-question-generation",
+        device=args.device,
+        num_return_sequences=2,
+    )
+
+    qa_filter = QAFilter(
+        model="uie-base-qa-filter",
+        device=args.device,
+        schema=["答案"],
+        position_prob=0.1,
+    )
+
+    pipe = QAGenerationPipeline(
+        answer_extractor=answer_extractor, question_generator=question_generator, qa_filter=qa_filter
+    )
+    pipeline_params = {"QAFilter": {"is_filter": True}}
+
+    # list example
+    meta = [
+        "世界上最早的电影院是美国洛杉矶的“电气剧场”，建于1902年。",
+        "以脸书为例，2020年时，54%的成年人表示，他们从该平台获取新闻。而现在，这个数字下降到了44%。与此同时，YouTube在过去几年里一直保持平稳，约有三分之一的用户在该平台上获取新闻。",
+    ]
+    prediction = pipe.run(meta=meta, params=pipeline_params)
+    prediction = prediction["filtered_cqa_triples"]
+    pprint(prediction)
+
+    # file example
+    if args.source_file:
+        meta = []
+        with open(args.source_file, "r", encoding="utf-8") as rf:
+            for line in rf:
+                meta.append(line.strip())
+        prediction = pipe.run(meta=meta, params=pipeline_params)
+        prediction = prediction["filtered_cqa_triples"]
+        if not os.path.exists(args.doc_dir):
+            os.makedirs(args.doc_dir)
+        with open(os.path.join(args.doc_dir, "generated_qa_pairs.txt"), "w", encoding="utf-8") as wf:
+            for pair in prediction:
+                wf.write(pair["synthetic_question"].strip() + "\t" + pair["synthetic_answer"].strip() + "\n")
+
+
+if __name__ == "__main__":
+    qa_generation_pipeline()
+    dense_faq_pipeline()
diff --git a/pipelines/pipelines/__init__.py b/pipelines/pipelines/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..138c7b36b9d4c15fdc7f4622ab165382c0649482
--- /dev/null
+++ b/pipelines/pipelines/__init__.py
@@ -0,0 +1,163 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+from .version import VERSION
+
+__version__ = VERSION  # Maybe dev is better
+
+from types import ModuleType
+from typing import Union
+
+try:
+    from importlib import metadata
+except (ModuleNotFoundError, ImportError):
+    # Python <= 3.7
+    import importlib_metadata as metadata  # type: ignore
+
+# This configuration must be done before any import to apply to all submodules
+import logging
+
+logging.basicConfig(
+    format="%(levelname)s - %(name)s -  %(message)s", datefmt="%m/%d/%Y %H:%M:%S", level=logging.WARNING
+)
+logging.getLogger("pipelines").setLevel(logging.INFO)
+
+import pandas as pd
+
+from pipelines import utils
+from pipelines.nodes import BaseComponent
+from pipelines.pipelines import Pipeline
+from pipelines.pipelines.standard_pipelines import (
+    BaseStandardPipeline,
+    DocPipeline,
+    ExtractiveQAPipeline,
+    QAGenerationPipeline,
+    SemanticSearchPipeline,
+    SentaPipeline,
+    TextToImagePipeline,
+)
+from pipelines.schema import Answer, Document, Label, Span
+
+pd.options.display.max_colwidth = 80
+
+# ###########################################
+# Enable old style imports (temporary)
+import sys
+
+logger = logging.getLogger(__name__)
+
+
+# Wrapper emitting a warning on import
+def DeprecatedModule(mod, deprecated_attributes=None, is_module_deprecated=True):
+    """
+    Return a wrapped object that warns about deprecated accesses at import
+    """
+
+    class DeprecationWrapper:
+        warned = []
+
+        def __getattr__(self, attr):
+            is_a_deprecated_attr = deprecated_attributes and attr in deprecated_attributes
+            is_a_deprecated_module = is_module_deprecated and attr not in ["__path__", "__spec__", "__name__"]
+            warning_already_emitted = attr in self.warned
+            attribute_exists = getattr(mod, attr) is not None
+
+            if (is_a_deprecated_attr or is_a_deprecated_module) and not warning_already_emitted and attribute_exists:
+                logger.warn(
+                    f"Object '{attr}' is imported through a deprecated path. Please check out the docs for the new import path."
+                )
+                self.warned.append(attr)
+            return getattr(mod, attr)
+
+    return DeprecationWrapper()
+
+
+# All modules to be aliased need to be imported here
+
+# This self-import is used to monkey-patch, keep for now
+import pipelines  # pylint: disable=import-self
+from pipelines.nodes import file_converter, preprocessor, ranker, reader, retriever
+
+# Note that we ignore the ImportError here because if the user did not install
+# the correct dependency group for a document store, we don't need to setup
+# import warnings for that, so the import here is useless and should fail silently.
+
+document_stores: Union[ModuleType, None] = None
+try:
+    from pipelines import document_stores
+except ImportError:
+    pass
+
+from pipelines.nodes.file_classifier import FileTypeClassifier
+from pipelines.nodes.other import Docs2Answers, JoinAnswers, JoinDocuments
+from pipelines.utils import cleaning, preprocessing
+
+# For the alias to work as an importable module (like `from pipelines import reader`),
+# modules need to be set as attributes of their parent model.
+# To make chain imports work (`from pipelines.reader import FARMReader`) the module
+# needs to be also present in sys.modules with its complete import path.
+
+setattr(preprocessor, "utils", DeprecatedModule(preprocessing))
+setattr(preprocessor, "cleaning", DeprecatedModule(cleaning))
+sys.modules["pipelines.preprocessor.utils"] = DeprecatedModule(preprocessing)
+sys.modules["pipelines.preprocessor.cleaning"] = DeprecatedModule(cleaning)
+
+setattr(pipelines, "document_store", DeprecatedModule(document_stores))
+setattr(pipelines, "file_converter", DeprecatedModule(file_converter, deprecated_attributes=["FileTypeClassifier"]))
+setattr(
+    pipelines,
+    "pipeline",
+    DeprecatedModule(
+        pipelines,
+        deprecated_attributes=[
+            "JoinDocuments",
+            "Docs2Answers",
+        ],
+    ),
+)
+setattr(pipelines, "preprocessor", DeprecatedModule(preprocessor, deprecated_attributes=["utils", "cleaning"]))
+setattr(pipelines, "ranker", DeprecatedModule(ranker))
+setattr(pipelines, "reader", DeprecatedModule(reader))
+setattr(pipelines, "retriever", DeprecatedModule(retriever))
+
+sys.modules["pipelines.document_store"] = DeprecatedModule(document_stores)
+sys.modules["pipelines.file_converter"] = DeprecatedModule(file_converter)
+sys.modules["pipelines.pipeline"] = DeprecatedModule(pipelines)
+sys.modules["pipelines.preprocessor"] = DeprecatedModule(preprocessor, deprecated_attributes=["utils", "cleaning"])
+sys.modules["pipelines.ranker"] = DeprecatedModule(ranker)
+sys.modules["pipelines.reader"] = DeprecatedModule(reader)
+sys.modules["pipelines.retriever"] = DeprecatedModule(retriever)
+
+# To be imported from modules, classes need only to be set as attributes,
+# they don't need to be present in sys.modules too.
+# Adding them to sys.modules would enable `import pipelines.pipelines.JoinDocuments`,
+# which I believe it's a very rare import style.
+setattr(file_converter, "FileTypeClassifier", FileTypeClassifier)
+setattr(pipelines, "Docs2Answers", Docs2Answers)
+
+# This last line is used to throw the deprecation error for imports like `from pipelines import connector`
+deprecated_attributes = [
+    "document_store",
+    "file_converter",
+    "pipeline",
+    "preprocessor",
+    "ranker",
+    "reader",
+    "retriever",
+]
+
+sys.modules["pipelines"] = DeprecatedModule(
+    pipelines, is_module_deprecated=False, deprecated_attributes=deprecated_attributes
+)
diff --git a/pipelines/pipelines/agents/__init__.py b/pipelines/pipelines/agents/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0e274edb1f2d027b7d645c9814a6dd051b469dd
--- /dev/null
+++ b/pipelines/pipelines/agents/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.agents.agent_step import AgentStep
+from pipelines.agents.base import Agent, Tool, ToolsManager
diff --git a/pipelines/pipelines/agents/agent_step.py b/pipelines/pipelines/agents/agent_step.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a452e89aa5dbcfb25d71243bdace53305a6cedd
--- /dev/null
+++ b/pipelines/pipelines/agents/agent_step.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import logging
+import re
+from typing import Any, Dict, Optional
+
+from pipelines import Answer
+
+logger = logging.getLogger(__name__)
+
+
+class AgentStep:
+    """
+    The AgentStep class represents a single step in the execution of an agent.
+
+    """
+
+    def __init__(
+        self,
+        current_step: int = 1,
+        max_steps: int = 10,
+        final_answer_pattern: Optional[str] = None,
+        prompt_node_response: str = "",
+        transcript: str = "",
+        observation_prefix: str = "Observation:",
+        llm_prefix: str = "Thought:",
+    ):
+        """
+        :param current_step: The current step in the execution of the agent.
+        :param max_steps: The maximum number of steps the agent can execute.
+        :param final_answer_pattern: The regex pattern to extract the final answer from the PromptNode response. If no
+        pattern is provided, entire prompt node response is considered the final answer.
+        :param prompt_node_response: The PromptNode response received.
+        text it generated during execution up to this step. The transcript is used to generate the next prompt.
+        """
+        self.current_step = current_step
+        self.max_steps = max_steps
+        self.final_answer_pattern = final_answer_pattern or r"^([\s\S]+)$"
+        self.prompt_node_response = prompt_node_response
+        self.transcript = transcript
+        self.observation_prefix = observation_prefix
+        self.llm_prefix = llm_prefix
+
+    def create_next_step(self, prompt_node_response: Any, current_step: Optional[int] = None) -> AgentStep:
+        """
+        Creates the next agent step based on the current step and the PromptNode response.
+        :param prompt_node_response: The PromptNode response received.
+        :param current_step: The current step in the execution of the agent.
+        """
+        if not isinstance(prompt_node_response, list) or not prompt_node_response:
+            raise Exception(
+                f"Agent output must be a non-empty list of str, but {prompt_node_response} received. "
+                f"Transcript:\n{self.transcript}"
+            )
+        cls = type(self)
+        return cls(
+            current_step=current_step if current_step else self.current_step + 1,
+            max_steps=self.max_steps,
+            final_answer_pattern=self.final_answer_pattern,
+            prompt_node_response=prompt_node_response[0],
+            transcript=self.transcript,
+        )
+
+    def final_answer(self, query: str) -> Dict[str, Any]:
+        """
+        Formats an answer as a dict containing `query` and `answers` similar to the output of a Pipeline.
+        The full transcript based on the Agent's initial prompt template and the text it generated during execution.
+
+        :param query: The search query
+        """
+        answer: Dict[str, Any] = {
+            "query": query,
+            "answers": [Answer(answer="", type="generative")],
+            "transcript": self.transcript,
+        }
+        if self.current_step > self.max_steps:
+            logger.warning(
+                "Maximum number of iterations (%s) reached for query (%s). Increase max_steps "
+                "or no answer can be provided for this query.",
+                self.max_steps,
+                query,
+            )
+        else:
+            final_answer = self.parse_final_answer()
+            if not final_answer:
+                logger.warning(
+                    "Final answer parser (%s) could not parse PromptNode response (%s).",
+                    self.final_answer_pattern,
+                    self.prompt_node_response,
+                )
+            else:
+                answer = {
+                    "query": query,
+                    "answers": [Answer(answer=final_answer, type="generative")],
+                    "transcript": self.transcript,
+                }
+        return answer
+
+    def is_last(self) -> bool:
+        """
+        Check if this is the last step of the Agent.
+        :return: True if this is the last step of the Agent, False otherwise.
+        """
+        return bool(self.parse_final_answer()) or self.current_step > self.max_steps
+
+    def completed(self, observation: Optional[str]) -> None:
+        """
+        Update the transcript with the observation
+        :param observation: received observation from the Agent environment.
+        """
+        self.transcript += (
+            f"{self.prompt_node_response}\n{self.observation_prefix} {observation}\n{self.llm_prefix}"
+            if observation
+            else self.prompt_node_response
+        )
+
+    def __repr__(self) -> str:
+        """
+        Return a string representation of the AgentStep object.
+
+        :return: A string that represents the AgentStep object.
+        """
+        return (
+            f"AgentStep(current_step={self.current_step}, max_steps={self.max_steps}, "
+            f"prompt_node_response={self.prompt_node_response}, final_answer_pattern={self.final_answer_pattern}, "
+            f"transcript={self.transcript})"
+        )
+
+    def parse_final_answer(self) -> Optional[str]:
+        """
+        Parse the final answer from the response of the prompt node.
+
+        This function searches the prompt node's response for a match with the
+        pre-defined final answer pattern. If a match is found, it's returned as the
+        final answer after removing leading/trailing quotes and whitespaces.
+        If no match is found, it returns None.
+
+        :return: The final answer as a string if a match is found, otherwise None.
+        """
+        # Search for a match with the final answer pattern in the prompt node response
+        final_answer_match = re.search(self.final_answer_pattern, self.prompt_node_response)
+        if final_answer_match:
+            # If a match is found, get the first group (i.e., the content inside the parentheses of the regex pattern)
+            final_answer = final_answer_match.group(1)
+
+            # Remove leading/trailing quotes and whitespaces, then return the final answer
+            return final_answer.strip('" ')  # type: ignore
+        else:
+            # If no match is found, return None
+            return None
diff --git a/pipelines/pipelines/agents/base.py b/pipelines/pipelines/agents/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca42cd239e90455694a4e8fc256b793dc6f2a1a9
--- /dev/null
+++ b/pipelines/pipelines/agents/base.py
@@ -0,0 +1,447 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import logging
+import re
+from collections.abc import Callable, Iterable
+from hashlib import md5
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from events import Events
+
+from pipelines import (
+    Answer,
+    BaseComponent,
+    BaseStandardPipeline,
+    Document,
+    ExtractiveQAPipeline,
+    Pipeline,
+    SemanticSearchPipeline,
+)
+from pipelines.agents.agent_step import AgentStep
+from pipelines.agents.memory import Memory, NoMemory
+from pipelines.agents.types import AgentTokenStreamingHandler, Color
+from pipelines.nodes import BaseRetriever, PromptNode, PromptTemplate
+
+logger = logging.getLogger(__name__)
+
+STREAMING_CAPABLE_MODELS = ["ernie-bot", "chatglm-6b"]
+
+
+def print_text(text: str, end="", color: Optional[Color] = None) -> None:
+    """
+    Print text with optional color.
+    :param text: Text to print.
+    :param end: End character to use (defaults to "").
+    :param color: Color to print text in (defaults to None).
+    """
+    if color:
+        print(f"{color.value}{text}{Color.RESET.value}", end=end, flush=True)
+    else:
+        print(text, end=end, flush=True)
+
+
+class Tool:
+    """
+    Agent uses tools to find the best answer. A tool is a pipeline or a node. When you add a tool to an Agent, the Agent
+    can invoke the underlying pipeline or node to answer questions.
+
+    You must provide a name and a description for each tool. The name should be short and should indicate what the tool
+    can do. The description should explain what the tool is useful for. The Agent uses the description to decide when
+    to use a tool, so the wording you use is important.
+
+    :param name: The name of the tool. The Agent uses this name to refer to the tool in the text the Agent generates.
+        The name should be short, ideally one token, and a good description of what the tool can do, for example:
+        "Calculator" or "Search". Use only letters (a-z, A-Z), digits (0-9) and underscores (_)."
+    :param pipeline_or_node: The pipeline or node to run when the Agent invokes this tool.
+    :param description: A description of what the tool is useful for. The Agent uses this description to decide
+        when to use which tool. For example, you can describe a tool for calculations by "useful for when you need to
+
+        answer questions about math".
+    """
+
+    def __init__(
+        self,
+        name: str,
+        pipeline_or_node: Union[BaseComponent, Pipeline, ExtractiveQAPipeline, SemanticSearchPipeline],
+        description: str,
+        output_variable: str = "results",
+        logging_color: Color = Color.YELLOW,
+    ):
+        if re.search(r"\W", name):
+            raise ValueError(
+                f"Invalid name supplied for tool: '{name}'. Use only letters (a-z, A-Z), digits (0-9) and "
+                f"underscores (_)."
+            )
+        self.name = name
+        self.pipeline_or_node = pipeline_or_node
+        self.description = description
+        self.output_variable = output_variable
+        self.logging_color = logging_color
+
+    def run(self, tool_input: str, params: Optional[dict] = None) -> str:
+        # We can only pass params to pipelines but not to nodes
+        if isinstance(self.pipeline_or_node, (Pipeline, BaseStandardPipeline)):
+            result = self.pipeline_or_node.run(query=tool_input, params=params)
+        elif isinstance(self.pipeline_or_node, BaseRetriever):
+            result = self.pipeline_or_node.run(query=tool_input, root_node="Query")
+        else:
+            result = self.pipeline_or_node.run(query=tool_input)
+        return self._process_result(result)
+
+    def _process_result(self, result: Any) -> str:
+        # Base case: string or an empty container
+        if not result or isinstance(result, str):
+            return str(result)
+        # Recursive case: process the result based on its type and return the result
+        else:
+            if isinstance(result, (tuple, list)):
+                return self._process_result(result[0] if result else [])
+            elif isinstance(result, dict):
+                if self.output_variable not in result:
+                    raise ValueError(
+                        f"Tool {self.name} returned result {result} but "
+                        f"output variable '{self.output_variable}' not found."
+                    )
+                return self._process_result(result[self.output_variable])
+            elif isinstance(result, Answer):
+                return self._process_result(result.answer)
+            elif isinstance(result, Document):
+                return self._process_result(result.content)
+            else:
+                return str(result)
+
+
+class ToolsManager:
+    """
+    The ToolsManager manages tools for an Agent.
+    """
+
+    def __init__(
+        self,
+        tools: Optional[List[Tool]] = None,
+        tool_pattern: str = r"Tool:\s*(\w+)\s*Tool Input:\s*(?:\"([\s\S]*?)\"|((?:.|\n)*))\s*",
+        observation_prefix: str = "Observation: ",
+        llm_prefix: str = "Thought:",
+    ):
+        """
+        :param tools: A list of tools to add to the ToolManager. Each tool must have a unique name.
+        :param tool_pattern: A regular expression pattern that matches the text that the Agent generates to invoke
+            a tool.
+        """
+        self._tools: Dict[str, Tool] = {tool.name: tool for tool in tools} if tools else {}
+        self.tool_pattern = tool_pattern
+        self.callback_manager = Events(("on_tool_start", "on_tool_finish", "on_tool_error"))
+        self.observation_prefix = observation_prefix
+        self.llm_prefix = llm_prefix
+
+    def add_tool(self, tool: Tool):
+        """
+        Add a tool to the Agent. This also updates the PromptTemplate for the Agent's PromptNode with the tool name.
+
+        :param tool: The tool to add to the Agent. Any previously added tool with the same name will be overwritten.
+        Example:
+        `agent.add_tool(
+            Tool(
+                name="Calculator",
+                pipeline_or_node=calculator
+                description="Useful when you need to answer questions about math."
+            )
+        )
+        """
+        self.tools[tool.name] = tool
+
+    @property
+    def tools(self):
+        return self._tools
+
+    def get_tool_names(self) -> str:
+        """
+        Returns a string with the names of all registered tools.
+        """
+        return ", ".join(self.tools.keys())
+
+    def get_tools(self) -> List[Tool]:
+        """
+        Returns a list of all registered tool instances.
+        """
+        return list(self.tools.values())
+
+    def get_tool_names_with_descriptions(self) -> str:
+        """
+        Returns a string with the names and descriptions of all registered tools.
+        """
+        return "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools.values()])
+
+    def run_tool(self, llm_response: str, params: Optional[Dict[str, Any]] = None) -> str:
+        tool_result: str = ""
+        if self.tools:
+            tool_name, tool_input = self.extract_tool_name_and_tool_input(llm_response)
+            if tool_name and tool_input:
+                tool: Tool = self.tools[tool_name]
+                try:
+                    self.callback_manager.on_tool_start(tool_input, tool=tool)
+                    tool_result = tool.run(tool_input, params)
+                    self.callback_manager.on_tool_finish(
+                        tool_result,
+                        observation_prefix=f"{self.observation_prefix}",
+                        llm_prefix="{self.llm_prefix}",
+                        color=tool.logging_color,
+                    )
+                except Exception as e:
+                    self.callback_manager.on_tool_error(e, tool=self.tools[tool_name])
+                    raise e
+        return tool_result
+
+    def extract_tool_name_and_tool_input(self, llm_response: str) -> Tuple[Optional[str], Optional[str]]:
+        """
+        Parse the tool name and the tool input from the PromptNode response.
+        :param llm_response: The PromptNode response.
+        :return: A tuple containing the tool name and the tool input.
+        """
+        tool_match = re.search(self.tool_pattern, llm_response)
+        if tool_match:
+            tool_name = tool_match.group(1)
+            tool_input = tool_match.group(2) or tool_match.group(3)
+            return tool_name.strip('" []\n').strip(), tool_input.strip('" \n')
+        return None, None
+
+
+class Agent:
+    """
+    An Agent answers queries using the tools you give to it. The tools are pipelines or nodes. The Agent uses a large
+    language model (LLM) through the PromptNode you initialize it with. To answer a query, the Agent follows this
+    sequence:
+
+    1. It generates a thought based on the query.
+    2. It decides which tool to use.
+    3. It generates the input for the tool.
+    4. Based on the output it gets from the tool, the Agent can either stop if it now knows the answer or repeat the
+    process of 1) generate thought, 2) choose tool, 3) generate input.
+
+    Agents are useful for questions containing multiple sub questions that can be answered step-by-step (Multi-hop QA)
+    using multiple pipelines and nodes as tools.
+    """
+
+    def __init__(
+        self,
+        prompt_node: PromptNode,
+        prompt_template: Optional[Union[str, PromptTemplate]] = None,
+        tools_manager: Optional[ToolsManager] = None,
+        memory: Optional[Memory] = None,
+        prompt_parameters_resolver: Optional[Callable] = None,
+        max_steps: int = 8,
+        final_answer_pattern: str = r"Final Answer\s*:\s*(.*)",
+        observation_prefix: str = "Observation:",
+        llm_prefix: str = "Thought:",
+    ):
+        """
+         Creates an Agent instance.
+
+        :param prompt_node: The PromptNode that the Agent uses to decide which tool to use and what input to provide to
+        it in each iteration.
+        :param prompt_template: The name of a PromptTemplate for the PromptNode. It's used for generating thoughts and
+        choosing tools to answer queries step-by-step. You can use the default `zero-shot-react` template or create a
+        new template in a similar format.
+        with `add_tool()` before running the Agent.
+        :param tools_manager: A ToolsManager instance that the Agent uses to run tools. Each tool must have a unique name.
+        You can also add tools with `add_tool()` before running the Agent.
+        :param memory: A Memory instance that the Agent uses to store information between iterations.
+        :param prompt_parameters_resolver: A callable that takes query, agent, and agent_step as parameters and returns
+        a dictionary of parameters to pass to the prompt_template. The default is a callable that returns a dictionary
+        of keys and values needed for the React agent prompt template.
+        :param max_steps: The number of times the Agent can run a tool +1 to let it infer it knows the final answer.
+            Set it to at least 2, so that the Agent can run one a tool once and then infer it knows the final answer.
+            The default is 8.
+        :param final_answer_pattern: A regular expression to extract the final answer from the text the Agent generated.
+        """
+        self.max_steps = max_steps
+        self.tm = tools_manager or ToolsManager()
+        self.memory = memory or NoMemory()
+        self.callback_manager = Events(("on_agent_start", "on_agent_step", "on_agent_finish", "on_new_token"))
+        self.prompt_node = prompt_node
+        prompt_template = prompt_template or "zero-shot-react"
+        resolved_prompt_template = prompt_node.get_prompt_template(prompt_template)
+        self.observation_prefix = observation_prefix
+        self.llm_prefix = llm_prefix
+        if not resolved_prompt_template:
+            raise ValueError(
+                f"Prompt template '{prompt_template}' not found. Please check the spelling of the template name."
+            )
+        self.prompt_template = resolved_prompt_template
+        react_parameter_resolver: Callable[
+            [str, Agent, AgentStep, Dict[str, Any]], Dict[str, Any]
+        ] = lambda query, agent, agent_step, **kwargs: {
+            "query": query,
+            "tool_names": agent.tm.get_tool_names(),
+            "tool_names_with_descriptions": agent.tm.get_tool_names_with_descriptions(),
+            "transcript": agent_step.transcript,
+        }
+        self.prompt_parameters_resolver = (
+            prompt_parameters_resolver if prompt_parameters_resolver else react_parameter_resolver
+        )
+        self.final_answer_pattern = final_answer_pattern
+        # Resolve model name to check if it's a streaming model
+        if isinstance(self.prompt_node.model_name_or_path, str):
+            model_name = self.prompt_node.model_name_or_path
+        else:
+            model_name = self.prompt_node.model_name_or_path.model_name_or_path
+        self.add_default_logging_callbacks(streaming=any(m for m in STREAMING_CAPABLE_MODELS if m in model_name))
+        self.hash = None
+        self.last_hash = None
+        self.update_hash()
+
+    def update_hash(self):
+        """
+        Used for telemetry. Hashes the tool classnames to send an event only when they change.
+        See haystack/telemetry.py::send_event
+        """
+        try:
+            tool_names = " ".join([tool.pipeline_or_node.__class__.__name__ for tool in self.tm.get_tools()])
+            self.hash = md5(tool_names.encode()).hexdigest()
+        except Exception as exc:
+            logger.debug("Telemetry exception: %s", str(exc))
+            self.hash = "[an exception occurred during hashing]"
+
+    def add_default_logging_callbacks(self, agent_color: Color = Color.GREEN, streaming: bool = False) -> None:
+        def on_tool_finish(
+            tool_output: str,
+            color: Optional[Color] = None,
+            observation_prefix: Optional[str] = None,
+            llm_prefix: Optional[str] = None,
+            **kwargs: Any,
+        ) -> None:
+            print_text(observation_prefix)  # type: ignore
+            print_text(tool_output, color=color)
+            print_text(f"\n{llm_prefix}")
+
+        def on_agent_start(**kwargs: Any) -> None:
+            agent_name = kwargs.pop("name", "react")
+            print_text(f"\nAgent {agent_name} started with {kwargs}\n")
+
+        self.tm.callback_manager.on_tool_finish += on_tool_finish
+        self.callback_manager.on_agent_start += on_agent_start
+
+        if streaming:
+            self.callback_manager.on_new_token += lambda token, **kwargs: print_text(token, color=agent_color)
+        else:
+            self.callback_manager.on_agent_step += lambda agent_step: print_text(
+                agent_step.prompt_node_response, color=agent_color
+            )
+
+    def add_tool(self, tool: Tool):
+        """
+        Add a tool to the Agent. This also updates the PromptTemplate for the Agent's PromptNode with the tool name.
+
+        :param tool: The tool to add to the Agent. Any previously added tool with the same name will be overwritten.
+        Example:
+        `agent.add_tool(
+            Tool(
+                name="Calculator",
+                pipeline_or_node=calculator
+                description="Useful when you need to answer questions about math"
+            )
+        )
+        """
+        self.tm.add_tool(tool)
+
+    def has_tool(self, tool_name: str) -> bool:
+        """
+        Check whether the Agent has a tool with the name you provide.
+
+        :param tool_name: The name of the tool for which you want to check whether the Agent has it.
+        """
+        return tool_name in self.tm.tools
+
+    def run(
+        self, query: str, max_steps: Optional[int] = None, params: Optional[dict] = None
+    ) -> Dict[str, Union[str, List[Answer]]]:
+        """
+        Runs the Agent given a query and optional parameters to pass on to the tools used. The result is in the
+        same format as a pipeline's result: a dictionary with a key `answers` containing a list of answers.
+
+        :param query: The search query
+        :param max_steps: The number of times the Agent can run a tool +1 to infer it knows the final answer.
+            If you want to set it, make it at least 2 so that the Agent can run a tool once and then infer it knows the
+            final answer.
+        :param params: A dictionary of parameters you want to pass to the tools that are pipelines.
+                       To pass a parameter to all nodes in those pipelines, use the format: `{"top_k": 10}`.
+                       To pass a parameter to targeted nodes in those pipelines, use the format:
+                        `{"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}}`.
+                        You can only pass parameters to tools that are pipelines, but not nodes.
+        """
+        try:
+            if not self.hash == self.last_hash:
+                self.last_hash = self.hash
+                # send_event(event_name="Agent", event_properties={"llm.agent_hash": self.hash})
+        except Exception as exc:
+            logger.debug("Telemetry exception: %s", exc)
+
+        self.callback_manager.on_agent_start(name=self.prompt_template.name, query=query, params=params)
+        agent_step = self.create_agent_step(max_steps)
+        try:
+            while not agent_step.is_last():
+                agent_step = self._step(query, agent_step, params)
+        finally:
+            self.callback_manager.on_agent_finish(agent_step)
+        return agent_step.final_answer(query=query)
+
+    def _step(self, query: str, current_step: AgentStep, params: Optional[dict] = None):
+        # plan next step using the LLM
+        prompt_node_response = self._plan(query, current_step)
+        # from the LLM response, create the next step
+        next_step = current_step.create_next_step(prompt_node_response)
+        self.callback_manager.on_agent_step(next_step)
+        # run the tool selected by the LLM
+        observation = self.tm.run_tool(next_step.prompt_node_response, params) if not next_step.is_last() else None
+
+        # save the input, output and observation to memory (if memory is enabled)
+        memory_data = self.prepare_data_for_memory(input=query, output=prompt_node_response, observation=observation)
+        self.memory.save(data=memory_data)
+        # update the next step with the observation
+        next_step.completed(observation)
+        return next_step
+
+    def _plan(self, query, current_step):
+        # first resolve prompt template params
+        template_params = self.prompt_parameters_resolver(query=query, agent=self, agent_step=current_step)
+
+        # if prompt node has no default prompt template, use agent's prompt template
+        if self.prompt_node.default_prompt_template is None:
+            prepared_prompt = next(self.prompt_template.fill(**template_params))
+            prompt_node_response = self.prompt_node(
+                prepared_prompt, stream_handler=AgentTokenStreamingHandler(self.callback_manager)
+            )
+        # otherwise, if prompt node has default prompt template, use it
+        else:
+            prompt_node_response = self.prompt_node(
+                stream_handler=AgentTokenStreamingHandler(self.callback_manager), **template_params
+            )
+        return prompt_node_response
+
+    def create_agent_step(self, max_steps: Optional[int] = None) -> AgentStep:
+        """
+        Create an AgentStep object. Override this method to customize the AgentStep class used by the Agent.
+        """
+        return AgentStep(max_steps=max_steps or self.max_steps, final_answer_pattern=self.final_answer_pattern)
+
+    def prepare_data_for_memory(self, **kwargs) -> dict:
+        """
+        Prepare data for saving to the Agent's memory. Override this method to customize the data saved to the memory.
+        """
+        return {
+            k: v if isinstance(v, str) else next(iter(v)) for k, v in kwargs.items() if isinstance(v, (str, Iterable))
+        }
diff --git a/pipelines/pipelines/agents/memory/__init__.py b/pipelines/pipelines/agents/memory/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f437037bd3073cefbbdf03cc90dd0bb03f45834f
--- /dev/null
+++ b/pipelines/pipelines/agents/memory/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.agents.memory.base import Memory
+from pipelines.agents.memory.conversation_memory import ConversationMemory
+from pipelines.agents.memory.no_memory import NoMemory
diff --git a/pipelines/pipelines/agents/memory/base.py b/pipelines/pipelines/agents/memory/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..31853982ae07f43a648794aa28ffb85d5c882ec6
--- /dev/null
+++ b/pipelines/pipelines/agents/memory/base.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC, abstractmethod
+from typing import Any, Dict, List, Optional
+
+
+class Memory(ABC):
+    """
+    Abstract base class for memory management in an Agent.
+    """
+
+    @abstractmethod
+    def load(self, keys: Optional[List[str]] = None, **kwargs) -> Any:
+        """
+        Load the context of this model run from memory.
+
+        :param keys: Optional list of keys to specify the data to load.
+        :return: The loaded data.
+        """
+
+    @abstractmethod
+    def save(self, data: Dict[str, Any]) -> None:
+        """
+        Save the context of this model run to memory.
+
+        :param data: A dictionary containing the data to save.
+        """
+
+    @abstractmethod
+    def clear(self) -> None:
+        """
+        Clear memory contents.
+        """
diff --git a/pipelines/pipelines/agents/memory/conversation_memory.py b/pipelines/pipelines/agents/memory/conversation_memory.py
new file mode 100644
index 0000000000000000000000000000000000000000..7668f035c38e7cac5adb11d3bfc62a5e8081e26f
--- /dev/null
+++ b/pipelines/pipelines/agents/memory/conversation_memory.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+from typing import Any, Dict, List, Optional, OrderedDict
+
+from pipelines.agents.memory import Memory
+
+
+class ConversationMemory(Memory):
+    """
+    A memory class that stores conversation history.
+    """
+
+    def __init__(self, input_key: str = "input", output_key: str = "output"):
+        """
+        Initialize ConversationMemory with input and output keys.
+
+        :param input_key: The key to use for storing user input.
+        :param output_key: The key to use for storing model output.
+        """
+        self.list: List[OrderedDict] = []
+        self.input_key = input_key
+        self.output_key = output_key
+
+    def load(self, keys: Optional[List[str]] = None, **kwargs) -> str:
+        """
+        Load conversation history as a formatted string.
+
+        :param keys: Optional list of keys (ignored in this implementation).
+        :param kwargs: Optional keyword arguments
+            - window_size: integer specifying the number of most recent conversation snippets to load.
+        :return: A formatted string containing the conversation history.
+        """
+        chat_transcript = ""
+        window_size = kwargs.get("window_size", None)
+
+        if window_size is not None:
+            chat_list = self.list[-window_size:]  # pylint: disable=invalid-unary-operand-type
+        else:
+            chat_list = self.list
+
+        for chat_snippet in chat_list:
+            chat_transcript += f"Human: {chat_snippet['Human']}\n"
+            chat_transcript += f"AI: {chat_snippet['AI']}\n"
+        return chat_transcript
+
+    def save(self, data: Dict[str, Any]) -> None:
+        """
+        Save a conversation snippet to memory.
+
+        :param data: A dictionary containing the conversation snippet to save.
+        """
+        chat_snippet = collections.OrderedDict()
+        chat_snippet["Human"] = data[self.input_key]
+        chat_snippet["AI"] = data[self.output_key]
+        self.list.append(chat_snippet)
+
+    def clear(self) -> None:
+        """
+        Clear the conversation history.
+        """
+        self.list = []
diff --git a/pipelines/pipelines/agents/memory/no_memory.py b/pipelines/pipelines/agents/memory/no_memory.py
new file mode 100644
index 0000000000000000000000000000000000000000..13fd90b62faa80a0f56ca2b502a7647d29d41b3a
--- /dev/null
+++ b/pipelines/pipelines/agents/memory/no_memory.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, Dict, List, Optional
+
+from pipelines.agents.memory import Memory
+
+
+class NoMemory(Memory):
+    """
+    A memory class that doesn't store any data.
+    """
+
+    def load(self, keys: Optional[List[str]] = None, **kwargs) -> str:
+        """
+        Load an empty dictionary.
+
+        :param keys: Optional list of keys (ignored in this implementation).
+        :return: An empty str.
+        """
+        return ""
+
+    def save(self, data: Dict[str, Any]) -> None:
+        """
+        Save method that does nothing.
+
+        :param data: A dictionary containing the data to save (ignored in this implementation).
+        """
+        pass
+
+    def clear(self) -> None:
+        """
+        Clear method that does nothing.
+        """
+        pass
diff --git a/pipelines/pipelines/agents/types.py b/pipelines/pipelines/agents/types.py
new file mode 100644
index 0000000000000000000000000000000000000000..7faed3e78476de1c1d6d3b79fd1af0c1eb2264e2
--- /dev/null
+++ b/pipelines/pipelines/agents/types.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC, abstractmethod
+from enum import Enum
+
+from events import Events
+
+
+class TokenStreamingHandler(ABC):
+    """
+    TokenStreamingHandler implementations handle the streaming of tokens from the stream.
+    """
+
+    DONE_MARKER = "[DONE]"
+
+    @abstractmethod
+    def __call__(self, token_received: str, **kwargs) -> str:
+        """
+        This callback method is called when a new token is received from the stream.
+
+        :param token_received: The token received from the stream.
+        :param kwargs: Additional keyword arguments passed to the handler.
+        :return: The token to be sent to the stream.
+        """
+        pass
+
+
+class Color(Enum):
+    BLACK = "\033[30m"
+    RED = "\033[31m"
+    GREEN = "\033[32m"
+    YELLOW = "\033[33m"
+    BLUE = "\033[34m"
+    MAGENTA = "\033[35m"
+    CYAN = "\033[36m"
+    WHITE = "\033[37m"
+    RESET = "\x1b[0m"
+
+
+class AgentTokenStreamingHandler(TokenStreamingHandler):
+    def __init__(self, events: Events):
+        self.events = events
+
+    def __call__(self, token_received, **kwargs) -> str:
+        self.events.on_new_token(token_received, **kwargs)
+        return token_received
diff --git a/pipelines/pipelines/data_handler/__init__.py b/pipelines/pipelines/data_handler/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/pipelines/data_handler/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/pipelines/data_handler/dataset.py b/pipelines/pipelines/data_handler/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..8fe0ec0a11ee918547b2a2128b787f031c64ca14
--- /dev/null
+++ b/pipelines/pipelines/data_handler/dataset.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import numbers
+
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+from pipelines.utils.common_utils import flatten_list
+
+logger = logging.getLogger(__name__)
+
+
+def convert_features_to_dataset(features):
+    """
+    Converts a list of feature dictionaries (one for each sample) into a Paddle Dataset.
+
+    :param features: A list of dictionaries. Each dictionary corresponds to one sample. Its keys are the
+                     names of the type of feature and the keys are the features themselves.
+    :Return: a Paddle dataset and a list of tensor names.
+    """
+    # features can be an empty list in cases where down sampling occurs
+    if len(features) == 0:
+        return None, None
+    tensor_names = list(features[0].keys())
+    all_tensors = []
+    for t_name in tensor_names:
+        try:
+            # Checking whether a non-integer will be silently converted to Paddle.long
+            check = features[0][t_name]
+            if isinstance(check, numbers.Number):
+                base = check
+            # extract a base variable from a nested lists or tuples
+            elif isinstance(check, list):
+                base = list(flatten_list(check))[0]
+            # extract a base variable from numpy arrays
+            else:
+                base = check.ravel()[0]
+            if not np.issubdtype(type(base), np.integer):
+                logger.warning(
+                    f"Problem during conversion to Paddle tensors:\n"
+                    f"A non-integer value for feature '{t_name}' with a value of: "
+                    f"'{base}' will be converted to a Paddle tensor of dtype long."
+                )
+        except Exception:
+            logger.debug(
+                f"Could not determine type for feature '{t_name}'. " "Converting now to a tensor of default type long."
+            )
+
+        # Convert all remaining python objects to Paddle long tensors
+        cur_tensor = [sample[t_name] for sample in features]
+        all_tensors.append(cur_tensor)
+
+    # Todo(tianxin): When set to IterDataset, throw Exception with paddle.io.BatchSampler
+    # all_tensors: List[List[all_token_ids], List[all_segment_ids]]
+    # list(zip(*all_tensors)): List[([token_ids], [segment_ids]), ([token_ids], [segment_ids])]
+    # For Question Answering: tensor_names: ['input_ids', 'padding_mask', 'segment_ids', 'passage_start_t', 'start_of_word', 'labels', 'id', 'seq_2_start_t', 'span_mask']
+    dataset = MapDataset(list(zip(*all_tensors)))
+    return dataset, tensor_names
diff --git a/pipelines/pipelines/data_handler/inputs.py b/pipelines/pipelines/data_handler/inputs.py
new file mode 100644
index 0000000000000000000000000000000000000000..526cf8bf72f35e2fd7a8482c9e3a5950b4bab3e2
--- /dev/null
+++ b/pipelines/pipelines/data_handler/inputs.py
@@ -0,0 +1,39 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Union
+
+
+class Question:
+    def __init__(self, text: str, uid: str = None):
+        self.text = text
+        self.uid = uid
+
+    def to_dict(self):
+        ret = {"question": self.text, "id": self.uid, "answers": []}
+        return ret
+
+
+class QAInput:
+    def __init__(self, doc_text: str, questions: Union[List[Question], Question]):
+        self.doc_text = doc_text
+        if type(questions) == Question:
+            self.questions = [questions]
+        else:
+            self.questions = questions  # type: ignore
+
+    def to_dict(self):
+        questions = [q.to_dict() for q in self.questions]
+        ret = {"qas": questions, "context": self.doc_text}
+        return ret
diff --git a/pipelines/pipelines/data_handler/predictions.py b/pipelines/pipelines/data_handler/predictions.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab9abf3a3d200540be7e5750133942b92d82467d
--- /dev/null
+++ b/pipelines/pipelines/data_handler/predictions.py
@@ -0,0 +1,336 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Any, Optional, Tuple, Union, Dict
+
+import logging
+from abc import ABC
+
+logger = logging.getLogger(__name__)
+
+
+class Pred(ABC):
+    """
+    Abstract base class for predictions of every task
+    """
+
+    def __init__(self, id: str, prediction: List[Any], context: str):
+        self.id = id
+        self.prediction = prediction
+        self.context = context
+
+    def to_json(self):
+        raise NotImplementedError
+
+
+class QACandidate:
+    """
+    A single QA candidate answer.
+    """
+
+    def __init__(
+        self,
+        answer_type: str,
+        score: float,
+        offset_answer_start: int,
+        offset_answer_end: int,
+        offset_unit: str,
+        aggregation_level: str,
+        probability: Optional[float] = None,
+        n_passages_in_doc: Optional[int] = None,
+        passage_id: Optional[str] = None,
+        confidence: Optional[float] = None,
+    ):
+        """
+        :param answer_type: The category that this answer falls into e.g. "no_answer", "yes", "no" or "span"
+        :param score: The score representing the model's confidence of this answer
+        :param offset_answer_start: The index of the start of the answer span (whether it is char or tok is stated in self.offset_unit)
+        :param offset_answer_end: The index of the start of the answer span (whether it is char or tok is stated in self.offset_unit)
+        :param offset_unit: States whether the offsets refer to character or token indices
+        :param aggregation_level: States whether this candidate and its indices are on a passage level (pre aggregation) or on a document level (post aggregation)
+        :param probability: The probability the model assigns to the answer
+        :param n_passages_in_doc: Number of passages that make up the document
+        :param passage_id: The id of the passage which contains this candidate answer
+        :param confidence: The (calibrated) confidence score representing the model's predicted accuracy of the index of the start of the answer span
+        """
+        # self.answer_type can be "no_answer", "yes", "no" or "span"
+        self.answer_type = answer_type
+        self.score = score
+        self.probability = probability
+
+        # If self.answer_type is "span", self.answer is a string answer (generated by self.span_to_string())
+        # Otherwise, it is None
+        self.answer = None  # type: Optional[str]
+        self.offset_answer_start = offset_answer_start
+        self.offset_answer_end = offset_answer_end
+
+        # If self.answer_type is in ["yes", "no"] then self.answer_support is a text string
+        # If self.answer is a string answer span or self.answer_type is "no_answer", answer_support is None
+        self.answer_support = None  # type: Optional[str]
+        self.offset_answer_support_start = None  # type: Optional[int]
+        self.offset_answer_support_end = None  # type: Optional[int]
+
+        # self.context is the document or passage where the answer is found
+        self.context_window = None  # type: Optional[str]
+        self.offset_context_window_start = None  # type: Optional[int]
+        self.offset_context_window_end = None  # type: Optional[int]
+
+        # Offset unit is either "token" or "char"
+        # Aggregation level is either "doc" or "passage"
+        self.offset_unit = offset_unit
+        self.aggregation_level = aggregation_level
+
+        self.n_passages_in_doc = n_passages_in_doc
+        self.passage_id = passage_id
+        self.confidence = confidence
+
+        # This attribute is used by pipelines to store sample metadata
+        self.meta = None
+
+    def set_context_window(self, context_window_size: int, clear_text: str):
+        window_str, start_ch, end_ch = self._create_context_window(context_window_size, clear_text)
+        self.context_window = window_str
+        self.offset_context_window_start = start_ch
+        self.offset_context_window_end = end_ch
+
+    def set_answer_string(self, token_offsets: List[int], document_text: str):
+        pred_str, self.offset_answer_start, self.offset_answer_end = self._span_to_string(token_offsets, document_text)
+        self.offset_unit = "char"
+        self._add_answer(pred_str)
+
+    def _add_answer(self, string: str):
+        """
+        Set the answer string. This method will check that the answer given is valid given the start
+        and end indices that are stored in the object.
+        """
+        if string == "":
+            self.answer = "no_answer"
+            if self.offset_answer_start != 0 or self.offset_answer_end != 0:
+                logger.error(
+                    f"Both start and end offsets should be 0: \n"
+                    f"{self.offset_answer_start}, {self.offset_answer_end} with a no_answer. "
+                )
+        else:
+            self.answer = string
+            if self.offset_answer_end - self.offset_answer_start <= 0:
+                logger.error(
+                    f"End offset comes before start offset: \n"
+                    f"({self.offset_answer_start}, {self.offset_answer_end}) with a span answer. "
+                )
+            elif self.offset_answer_end <= 0:
+                logger.error(
+                    f"Invalid end offset: \n"
+                    f"({self.offset_answer_start}, {self.offset_answer_end}) with a span answer. "
+                )
+
+    def _create_context_window(self, context_window_size: int, clear_text: str) -> Tuple[str, int, int]:
+        """
+        Extract from the clear_text a window that contains the answer and (usually) some amount of text on either
+        side of the answer. Useful for cases where the answer and its surrounding context needs to be
+        displayed in a UI. If the self.context_window_size is smaller than the extracted answer, it will be
+        enlarged so that it can contain the answer
+
+        :param context_window_size: The size of the context window to be generated. Note that the window size may be increased if the answer is longer.
+        :param clear_text: The text from which the answer is extracted
+        """
+        if self.offset_answer_start == 0 and self.offset_answer_end == 0:
+            return "", 0, 0
+        else:
+            # If the extracted answer is longer than the context_window_size,
+            # we will increase the context_window_size
+            len_ans = self.offset_answer_end - self.offset_answer_start
+            context_window_size = max(context_window_size, len_ans + 1)
+
+            len_text = len(clear_text)
+            midpoint = int(len_ans / 2) + self.offset_answer_start
+            half_window = int(context_window_size / 2)
+            window_start_ch = midpoint - half_window
+            window_end_ch = midpoint + half_window
+
+            # if we have part of the context window overlapping the start or end of the passage,
+            # we'll trim it and use the additional chars on the other side of the answer
+            overhang_start = max(0, -window_start_ch)
+            overhang_end = max(0, window_end_ch - len_text)
+            window_start_ch -= overhang_end
+            window_start_ch = max(0, window_start_ch)
+            window_end_ch += overhang_start
+            window_end_ch = min(len_text, window_end_ch)
+        window_str = clear_text[window_start_ch:window_end_ch]
+        return window_str, window_start_ch, window_end_ch
+
+    def _span_to_string(self, token_offsets: List[int], clear_text: str) -> Tuple[str, int, int]:
+        """
+        Generates a string answer span using self.offset_answer_start and self.offset_answer_end. If the candidate
+        is a no answer, an empty string is returned
+
+        :param token_offsets: A list of ints which give the start character index of the corresponding token
+        :param clear_text: The text from which the answer span is to be extracted
+        :return: The string answer span, followed by the start and end character indices
+        """
+        if self.offset_unit != "token":
+            logger.error(
+                f"QACandidate needs to have self.offset_unit=token before calling _span_to_string() (id = {self.passage_id})"
+            )
+
+        start_t = self.offset_answer_start
+        end_t = self.offset_answer_end
+
+        # If it is a no_answer prediction
+        if start_t == -1 and end_t == -1:
+            return "", 0, 0
+
+        n_tokens = len(token_offsets)
+
+        # We do this to point to the beginning of the first token after the span instead of
+        # the beginning of the last token in the span
+        end_t += 1
+
+        # Predictions sometimes land on the very final special token of the passage. But there are no
+        # special tokens on the document level. We will just interpret this as a span that stretches
+        # to the end of the document
+        end_t = min(end_t, n_tokens)
+
+        start_ch = int(token_offsets[start_t])
+        # i.e. pointing at the END of the last token
+        if end_t == n_tokens:
+            end_ch = len(clear_text)
+        else:
+            end_ch = token_offsets[end_t]
+
+        final_text = clear_text[start_ch:end_ch]
+
+        # if the final_text is more than whitespaces we trim it otherwise return a no_answer
+        # final_text can be an empty string if start_t points to the very final token of the passage
+        # final_text can be a whitespace if there is a whitespace token in the text, e.g.,
+        # if the original text contained multiple consecutive whitespaces
+        if len(final_text.strip()) > 0:
+            final_text = final_text.strip()
+        else:
+            return "", 0, 0
+        end_ch = int(start_ch + len(final_text))
+
+        return final_text, start_ch, end_ch
+
+    def to_doc_level(self, start: int, end: int):
+        """
+        Populate the start and end indices with document level indices. Changes aggregation level to 'document'
+        """
+        self.offset_answer_start = start
+        self.offset_answer_end = end
+        self.aggregation_level = "document"
+
+    def to_list(self) -> List[Optional[Union[str, int, float]]]:
+        return [self.answer, self.offset_answer_start, self.offset_answer_end, self.score, self.passage_id]
+
+
+class QAPred(Pred):
+    """
+    A set of QA predictions for a passage or a document. The candidates are stored in QAPred.prediction which is a
+    list of QACandidate objects. Also contains all attributes needed to convert the object into json format and also
+    to create a context window for a UI
+    """
+
+    def __init__(
+        self,
+        id: str,
+        prediction: List[QACandidate],
+        context: str,
+        question: str,
+        token_offsets: List[int],
+        context_window_size: int,
+        aggregation_level: str,
+        no_answer_gap: float,
+        ground_truth_answer: str = None,
+        answer_types: List[str] = [],
+    ):
+        """
+        :param id: The id of the passage or document
+        :param prediction: A list of QACandidate objects for the given question and document
+        :param context: The text passage from which the answer can be extracted
+        :param question: The question being posed
+        :param token_offsets: A list of ints indicating the start char index of each token
+        :param context_window_size: The number of chars in the text window around the answer
+        :param aggregation_level: States whether this candidate and its indices are on a passage level (pre aggregation) or on a document level (post aggregation)
+        :param no_answer_gap: How much the QuestionAnsweringHead.no_ans_boost needs to change to turn a no_answer to a positive answer
+        :param ground_truth_answer: Ground truth answers
+        :param answer_types: List of answer_types supported by this task e.g. ["span", "yes_no", "no_answer"]
+        """
+        super().__init__(id, prediction, context)
+        self.question = question
+        self.token_offsets = token_offsets
+        self.context_window_size = context_window_size
+        self.aggregation_level = aggregation_level
+        self.answer_types = answer_types
+        self.ground_truth_answer = ground_truth_answer
+        self.no_answer_gap = no_answer_gap
+        self.n_passages = self.prediction[0].n_passages_in_doc
+        for qa_candidate in self.prediction:
+            qa_candidate.set_answer_string(token_offsets, self.context)
+            qa_candidate.set_context_window(self.context_window_size, self.context)
+
+    def to_json(self, squad=False) -> Dict:
+        """
+        Converts the information stored in the object into a json format.
+
+        :param squad: If True, no_answers are represented by the empty string instead of "no_answer"
+        """
+        answers = self._answers_to_json(self.id, squad)
+        ret = {
+            "task": "qa",
+            "predictions": [
+                {
+                    "question": self.question,
+                    "id": self.id,
+                    "ground_truth": self.ground_truth_answer,
+                    "answers": answers,
+                    "no_ans_gap": self.no_answer_gap,  # Add no_ans_gap to current no_ans_boost for switching top prediction
+                }
+            ],
+        }
+        if squad:
+            del ret["predictions"][0]["id"]  # type: ignore
+            ret["predictions"][0]["question_id"] = self.id  # type: ignore
+        return ret
+
+    def _answers_to_json(self, ext_id, squad=False) -> List[Dict]:
+        """
+        Convert all answers into a json format
+
+        :param id: ID of the question document pair
+        :param squad: If True, no_answers are represented by the empty string instead of "no_answer"
+        """
+        ret = []
+
+        # iterate over the top_n predictions of the one document
+        for qa_candidate in self.prediction:
+            if squad and qa_candidate.answer == "no_answer":
+                answer_string = ""
+            else:
+                answer_string = qa_candidate.answer
+            curr = {
+                "score": qa_candidate.score,
+                "probability": None,
+                "answer": answer_string,
+                "offset_answer_start": qa_candidate.offset_answer_start,
+                "offset_answer_end": qa_candidate.offset_answer_end,
+                "context": qa_candidate.context_window,
+                "offset_context_start": qa_candidate.offset_context_window_start,
+                "offset_context_end": qa_candidate.offset_context_window_end,
+                "document_id": ext_id,
+            }
+            ret.append(curr)
+        return ret
+
+    def to_squad_eval(self) -> Dict:
+        return self.to_json(squad=True)
diff --git a/pipelines/pipelines/data_handler/processor.py b/pipelines/pipelines/data_handler/processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..4238be08b0b611ec457840b681864d3236639783
--- /dev/null
+++ b/pipelines/pipelines/data_handler/processor.py
@@ -0,0 +1,920 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import logging
+import os
+import random
+from abc import ABC, abstractmethod
+from pathlib import Path
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from pipelines.data_handler.dataset import convert_features_to_dataset
+from pipelines.data_handler.samples import (
+    Sample,
+    SampleBasket,
+    get_passage_offsets,
+    offset_to_token_idx_vecorized,
+)
+from pipelines.utils.logger import StdoutLogger
+from pipelines.utils.tokenization import tokenize_batch_question_answering
+
+logger = logging.getLogger(__name__)
+
+
+class Processor(ABC):
+    """
+    Base class for low level data processors to convert input text to PaddleNLP Datasets.
+    """
+
+    subclasses: dict = {}
+
+    def __init__(
+        self,
+        tokenizer,
+        max_seq_len: int,
+        train_filename: Optional[Union[Path, str]],
+        dev_filename: Optional[Union[Path, str]],
+        test_filename: Optional[Union[Path, str]],
+        dev_split: float,
+        data_dir: Optional[Union[Path, str]],
+        tasks: Dict = {},
+        proxies: Optional[Dict] = None,
+        multithreading_rust: Optional[bool] = True,
+    ):
+        """
+        :param tokenizer: Used to split a sentence (str) into tokens.
+        :param max_seq_len: Samples are truncated after this many tokens.
+        :param train_filename: The name of the file containing training data.
+        :param dev_filename: The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set
+                             will be a slice of the train set.
+        :param test_filename: The name of the file containing test data.
+        :param dev_split: The proportion of the train set that will sliced. Only works if dev_filename is set to None
+        :param data_dir: The directory in which the train, test and perhaps dev files can be found.
+        :param tasks: Tasks for which the processor shall extract labels from the input data.
+                      Usually this includes a single, default task, e.g. text classification.
+                      In a multitask setting this includes multiple tasks, e.g. 2x text classification.
+                      The task name will be used to connect with the related PredictionHead.
+        :param proxies: proxy configuration to allow downloads of remote datasets.
+                    Format as in  "requests" library: https://2.python-requests.org//en/latest/user/advanced/#proxies
+        :param multithreading_rust: Whether to allow multithreading in Rust, e.g. for FastTokenizers.
+                                    Note: Enabling multithreading in Rust AND multiprocessing in python might cause
+                                    deadlocks.
+        """
+        if not multithreading_rust:
+            os.environ["RAYON_RS_NUM_CPUS"] = "1"
+
+        self.tokenizer = tokenizer
+        self.max_seq_len = max_seq_len
+        self.tasks = tasks
+        self.proxies = proxies
+
+        # data sets
+        self.train_filename = train_filename
+        self.dev_filename = dev_filename
+        self.test_filename = test_filename
+        self.dev_split = dev_split
+        if data_dir:
+            self.data_dir = Path(data_dir)
+        else:
+            self.data_dir = None  # type: ignore
+        self.baskets: List = []
+
+        self._log_params()
+        self.problematic_sample_ids: set = set()
+
+    def __init_subclass__(cls, **kwargs):
+        """This automatically keeps track of all available subclasses.
+        Enables generic load() and load_from_dir() for all specific Processor implementation.
+        """
+        super().__init_subclass__(**kwargs)
+        cls.subclasses[cls.__name__] = cls
+
+    # TODO potentially remove tasks from code - multitask learning is not supported anyways
+    def add_task(
+        self, name, metric, label_list, label_column_name=None, label_name=None, task_type=None, text_column_name=None
+    ):
+        if type(label_list) is not list:
+            raise ValueError(f"Argument `label_list` must be of type list. Got: f{type(label_list)}")
+
+        if label_name is None:
+            label_name = f"{name}_label"
+        label_tensor_name = label_name + "_ids"
+        self.tasks[name] = {
+            "label_list": label_list,
+            "metric": metric,
+            "label_tensor_name": label_tensor_name,
+            "label_name": label_name,
+            "label_column_name": label_column_name,
+            "text_column_name": text_column_name,
+            "task_type": task_type,
+        }
+
+    @abstractmethod
+    def dataset_from_dicts(self, dicts: List[dict], indices: Optional[List[int]] = None, return_baskets: bool = False):
+        raise NotImplementedError()
+
+    @abstractmethod
+    def _create_dataset(self, baskets: List[SampleBasket]):
+        raise NotImplementedError
+
+    @staticmethod
+    def log_problematic(problematic_sample_ids):
+        if problematic_sample_ids:
+            n_problematic = len(problematic_sample_ids)
+            problematic_id_str = ", ".join([str(i) for i in problematic_sample_ids])
+            logger.error(
+                f"Unable to convert {n_problematic} samples to features. Their ids are : {problematic_id_str}"
+            )
+
+    @staticmethod
+    def _check_sample_features(basket: SampleBasket):
+        """
+        Check if all samples in the basket has computed its features.
+
+        :param basket: the basket containing the samples
+
+        :return: True if all the samples in the basket has computed its features, False otherwise
+        """
+        if basket.samples is None:
+            return False
+        elif len(basket.samples) == 0:
+            return False
+        if basket.samples is None:
+            return False
+        else:
+            for sample in basket.samples:
+                if sample.features is None:
+                    return False
+        return True
+
+    def _log_samples(self, n_samples: int, baskets: List[SampleBasket]):
+        logger.debug("*** Show {} random examples ***".format(n_samples))
+        if len(baskets) == 0:
+            logger.debug("*** No samples to show because there are no baskets ***")
+            return
+        for i in range(n_samples):
+            random_basket = random.choice(baskets)
+            random_sample = random.choice(random_basket.samples)  # type: ignore
+            logger.debug(random_sample)
+
+    def _log_params(self):
+        params = {
+            "processor": self.__class__.__name__,
+            "tokenizer": self.tokenizer.__class__.__name__,
+        }
+        names = ["max_seq_len", "dev_split"]
+        for name in names:
+            value = getattr(self, name)
+            params.update({name: str(value)})
+        StdoutLogger.log_params(params)
+
+
+class SquadProcessor(Processor):
+    """
+    Convert QA data (in SQuAD Format)
+    """
+
+    def __init__(
+        self,
+        tokenizer,  # type: ignore
+        max_seq_len: int,
+        data_dir: Optional[Union[Path, str]],
+        label_list: Optional[List[str]] = None,
+        metric="squad",  # type: ignore
+        train_filename: Optional[Union[Path, str]] = Path("train-v2.0.json"),
+        dev_filename: Optional[Union[Path, str]] = Path("dev-v2.0.json"),
+        test_filename: Optional[Union[Path, str]] = None,
+        dev_split: float = 0,
+        doc_stride: int = 128,
+        max_query_length: int = 64,
+        proxies: Optional[dict] = None,
+        max_answers: int = 6,
+        **kwargs,
+    ):
+        """
+        :param tokenizer: Used to split a sentence (str) into tokens.
+        :param max_seq_len: Samples are truncated after this many tokens.
+        :param data_dir: The directory in which the train and dev files can be found.
+                         If not available the dataset will be loaded automaticaly
+                         if the last directory has the same name as a predefined dataset.
+                         These predefined datasets are defined as the keys in the dict at
+                         `pipelines.basics.data_handler.utils.`_.
+        :param label_list: list of labels to predict (strings). For most cases this should be: ["start_token", "end_token"]
+        :param metric: name of metric that shall be used for evaluation, can be "squad" or "top_n_accuracy"
+        :param train_filename: The name of the file containing training data.
+        :param dev_filename: The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set
+                             will be a slice of the train set.
+        :param test_filename: None
+        :param dev_split: The proportion of the train set that will sliced. Only works if dev_filename is set to None
+        :param doc_stride: When the document containing the answer is too long it gets split into part, strided by doc_stride
+        :param max_query_length: Maximum length of the question (in number of subword tokens)
+        :param proxies: proxy configuration to allow downloads of remote datasets.
+                        Format as in  "requests" library: https://2.python-requests.org//en/latest/user/advanced/#proxies
+        :param max_answers: number of answers to be converted. QA dev or train sets can contain multi-way annotations, which are converted to arrays of max_answer length
+        :param kwargs: placeholder for passing generic parameters
+        """
+        self.ph_output_type = "per_token_squad"
+
+        assert doc_stride < (max_seq_len - max_query_length), (
+            "doc_stride ({}) is longer than max_seq_len ({}) minus space reserved for query tokens ({}). \nThis means that there will be gaps "
+            "as the passage windows slide, causing the model to skip over parts of the document.\n"
+            "Please set a lower value for doc_stride (Suggestions: doc_stride=128, max_seq_len=384)\n "
+            "Or decrease max_query_length".format(doc_stride, max_seq_len, max_query_length)
+        )
+
+        self.doc_stride = doc_stride
+        self.max_query_length = max_query_length
+        self.max_answers = max_answers
+        super(SquadProcessor, self).__init__(
+            tokenizer=tokenizer,
+            max_seq_len=max_seq_len,
+            train_filename=train_filename,
+            dev_filename=dev_filename,
+            test_filename=test_filename,
+            dev_split=dev_split,
+            data_dir=data_dir,
+            tasks={},
+            proxies=proxies,
+        )
+        self._initialize_special_tokens_count()
+        if metric and label_list:
+            self.add_task("question_answering", metric, label_list)
+        else:
+            logger.info(
+                "Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for "
+                "using the default task or add a custom task later via processor.add_task()"
+            )
+
+    def dataset_from_dicts(self, dicts: List[dict], indices: Optional[List[int]] = None, return_baskets: bool = False):
+        """
+        Convert input dictionaries into a paddlenlp dataset for Question Answering.
+        For this we have an internal representation called "baskets".
+        Each basket is a question-document pair.
+        Each stage adds or transforms specific information to our baskets.
+
+        :param dicts: dict, input dictionary with SQuAD style information present
+        :param indices: list, indices used during multiprocessing so that IDs assigned to our baskets is unique
+        :param return_baskets: boolean, whether to return the baskets or not (baskets are needed during inference)
+        """
+        # Convert to standard format
+        # Have no effect on BasicQA tutorial
+        pre_baskets = [self.convert_qa_input_dict(x) for x in dicts]  # TODO move to input object conversion
+
+        # Step1: Tokenize documents and questions
+        baskets = tokenize_batch_question_answering(pre_baskets, self.tokenizer, indices)
+
+        # Split documents into smaller passages to fit max_seq_len
+        baskets = self._split_docs_into_passages(baskets)
+
+        # Convert answers from string to token space, skip this step for inference
+        if not return_baskets:
+            baskets = self._convert_answers(baskets)
+
+        # Convert internal representation (nested baskets + samples with mixed types) to paddle features (arrays of numbers)
+        baskets = self._passages_to_paddle_features(baskets, return_baskets)
+
+        # Convert features into paddle dataset, this step also removes potential errors during preprocessing
+        dataset, tensor_names, baskets = self._create_dataset(baskets)
+
+        # Logging
+        if indices:
+            if 0 in indices:
+                self._log_samples(n_samples=1, baskets=self.baskets)
+
+        # During inference we need to keep the information contained in baskets.
+        if return_baskets:
+            return dataset, tensor_names, self.problematic_sample_ids, baskets
+        else:
+            return dataset, tensor_names, self.problematic_sample_ids
+
+    # TODO use Input Objects instead of this function, remove Natural Questions (NQ) related code
+    def convert_qa_input_dict(self, infer_dict: dict):
+        """Input dictionaries in QA can either have ["context", "qas"] (internal format) as keys or
+        ["text", "questions"] (api format). This function converts the latter into the former. It also converts the
+        is_impossible field to answer_type so that NQ and SQuAD dicts have the same format.
+        """
+        # check again for doc stride vs max_seq_len when. Parameters can be changed for already initialized models (e.g. in pipelines)
+        assert self.doc_stride < (self.max_seq_len - self.max_query_length), (
+            "doc_stride ({}) is longer than max_seq_len ({}) minus space reserved for query tokens ({}). \nThis means that there will be gaps "
+            "as the passage windows slide, causing the model to skip over parts of the document.\n"
+            "Please set a lower value for doc_stride (Suggestions: doc_stride=128, max_seq_len=384)\n "
+            "Or decrease max_query_length".format(self.doc_stride, self.max_seq_len, self.max_query_length)
+        )
+
+        try:
+            # Check if infer_dict is already in internal json format
+            if "context" in infer_dict and "qas" in infer_dict:
+                return infer_dict
+            # converts dicts from inference mode to data structure used in pipelines
+            questions = infer_dict["questions"]
+            text = infer_dict["text"]
+            uid = infer_dict.get("id", None)
+            qas = [{"question": q, "id": uid, "answers": [], "answer_type": None} for i, q in enumerate(questions)]
+            converted = {"qas": qas, "context": text}
+            return converted
+        except KeyError:
+            raise Exception("Input does not have the expected format")
+
+    def _initialize_special_tokens_count(self):
+        vec = self.tokenizer.build_inputs_with_special_tokens(token_ids_0=["a"], token_ids_1=["b"])
+        self.sp_toks_start = vec.index("a")
+        self.sp_toks_mid = vec.index("b") - self.sp_toks_start - 1
+        self.sp_toks_end = len(vec) - vec.index("b") - 1
+
+    def _split_docs_into_passages(self, baskets: List[SampleBasket]):
+        """
+        Because of the sequence length limitation of Language Models, the documents need to be divided into smaller
+        parts that we call passages.
+        """
+        # n_special_tokens = 4
+        n_special_tokens = self.tokenizer.num_special_tokens_to_add(pair=True)
+        for basket in baskets:
+            samples = []
+            # perform some basic checking
+            # TODO, eventually move checking into input validation functions
+            # ignore samples with empty context
+            if basket.raw["document_text"] == "":
+                logger.warning("Ignoring sample with empty context")
+                continue
+            # end checking
+
+            # Calculate the number of tokens that can be reserved for the passage. This is calculated by considering
+            # the max_seq_len, the number of tokens in the question and the number of special tokens that will be added
+            # when the question and passage are joined (e.g. [CLS] and [SEP])
+            passage_len_t = (
+                self.max_seq_len - len(basket.raw["question_tokens"][: self.max_query_length]) - n_special_tokens
+            )
+
+            # passage_spans is a list of dictionaries where each defines the start and end of each passage
+            # on both token and character level
+            try:
+                passage_spans = get_passage_offsets(
+                    basket.raw["document_offsets"], self.doc_stride, passage_len_t, basket.raw["document_text"]
+                )
+            except Exception as e:
+                logger.warning(
+                    f"Could not devide document into passages. Document: {basket.raw['document_text'][:200]}\n"
+                    f"With error: {e}"
+                )
+                passage_spans = []
+
+            for passage_span in passage_spans:
+                # Unpack each variable in the dictionary. The "_t" and "_c" indicate
+                # whether the index is on the token or character level
+                passage_start_t = passage_span["passage_start_t"]
+                passage_end_t = passage_span["passage_end_t"]
+                passage_start_c = passage_span["passage_start_c"]
+                passage_end_c = passage_span["passage_end_c"]
+
+                # Token 粒度标志: token 是否为 Words 的开头，如果为 0 则表示该 token 应该与之前的 token 连接起来.
+                passage_start_of_word = basket.raw["document_start_of_word"][passage_start_t:passage_end_t]
+                passage_tokens = basket.raw["document_tokens"][passage_start_t:passage_end_t]
+                passage_text = basket.raw["document_text"][passage_start_c:passage_end_c]
+
+                clear_text = {
+                    "passage_text": passage_text,
+                    "question_text": basket.raw["question_text"],
+                    "passage_id": passage_span["passage_id"],
+                }
+                tokenized = {
+                    "passage_start_t": passage_start_t,
+                    "passage_start_c": passage_start_c,
+                    "passage_tokens": passage_tokens,
+                    "passage_start_of_word": passage_start_of_word,
+                    "question_tokens": basket.raw["question_tokens"][: self.max_query_length],
+                    "question_offsets": basket.raw["question_offsets"][: self.max_query_length],
+                    "question_start_of_word": basket.raw["question_start_of_word"][: self.max_query_length],
+                }
+                # The sample ID consists of internal_id and a passage numbering
+                # sample_id 最后一位表示 passage-id
+                sample_id = f"{basket.id_internal}-{passage_span['passage_id']}"
+                samples.append(Sample(id=sample_id, clear_text=clear_text, tokenized=tokenized))
+
+            basket.samples = samples
+
+        return baskets
+
+    def _convert_answers(self, baskets: List[SampleBasket]):
+        """
+        Converts answers that are pure strings into the token based representation with start and end token offset.
+        Can handle multiple answers per question document pair as is common for development/text sets
+        """
+        for basket in baskets:
+            error_in_answer = False
+            for num, sample in enumerate(basket.samples):  # type: ignore
+                # Dealing with potentially multiple answers (e.g. Squad dev set)
+                # Initializing a numpy array of shape (max_answers, 2), filled with -1 for missing values
+                label_idxs = np.full((self.max_answers, 2), fill_value=-1)
+
+                if error_in_answer or (len(basket.raw["answers"]) == 0):
+                    # If there are no answers we set
+                    label_idxs[0, :] = 0
+                else:
+                    # For all other cases we use start and end token indices, that are relative to the passage
+                    for i, answer in enumerate(basket.raw["answers"]):
+                        # Calculate start and end relative to document
+                        answer_len_c = len(answer["text"])
+                        answer_start_c = answer["answer_start"]
+                        answer_end_c = answer_start_c + answer_len_c - 1
+
+                        # Convert character offsets to token offsets on document level
+                        answer_start_t = offset_to_token_idx_vecorized(basket.raw["document_offsets"], answer_start_c)
+                        answer_end_t = offset_to_token_idx_vecorized(basket.raw["document_offsets"], answer_end_c)
+
+                        # Adjust token offsets to be relative to the passage
+                        answer_start_t -= sample.tokenized["passage_start_t"]  # type: ignore
+                        answer_end_t -= sample.tokenized["passage_start_t"]  # type: ignore
+
+                        # Initialize some basic variables
+                        question_len_t = len(sample.tokenized["question_tokens"])  # type: ignore
+                        passage_len_t = len(sample.tokenized["passage_tokens"])  # type: ignore
+
+                        # Check that start and end are contained within this passage
+                        # answer_end_t is 0 if the first token is the answer
+                        # answer_end_t is passage_len_t if the last token is the answer
+                        if passage_len_t > answer_start_t >= 0 and passage_len_t >= answer_end_t >= 0:
+                            # Then adjust the start and end offsets by adding question and special token
+                            label_idxs[i][0] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_start_t
+                            label_idxs[i][1] = self.sp_toks_start + question_len_t + self.sp_toks_mid + answer_end_t
+                        # If the start or end of the span answer is outside the passage, treat passage as no_answer
+                        else:
+                            label_idxs[i][0] = 0
+                            label_idxs[i][1] = 0
+
+                        # answer checking
+                        # TODO, move this checking into input validation functions and delete wrong examples there
+                        # Cases where the answer is not within the current passage will be turned into no answers by the featurization fn
+                        if answer_start_t < 0 or answer_end_t >= passage_len_t:
+                            pass
+                        else:
+                            doc_text = basket.raw["document_text"]
+                            answer_indices = doc_text[answer_start_c : answer_end_c + 1]
+                            answer_text = answer["text"]
+                            # check if answer string can be found in context
+                            if answer_text not in doc_text:
+                                logger.warning(
+                                    f"Answer '{answer['text']}' not contained in context.\n"
+                                    f"Example will not be converted for training/evaluation."
+                                )
+                                error_in_answer = True
+                                label_idxs[i][0] = -100  # TODO remove this hack also from featurization
+                                label_idxs[i][1] = -100
+                                break  # Break loop around answers, so the error message is not shown multiple times
+                            if answer_indices.strip() != answer_text.strip():
+                                logger.warning(
+                                    f"Answer using start/end indices is '{answer_indices}' while gold label text is '{answer_text}'.\n"
+                                    f"Example will not be converted for training/evaluation."
+                                )
+                                error_in_answer = True
+                                label_idxs[i][0] = -100  # TODO remove this hack also from featurization
+                                label_idxs[i][1] = -100
+                                break  # Break loop around answers, so the error message is not shown multiple times
+                        # end of checking
+
+                sample.tokenized["labels"] = label_idxs  # type: ignore
+
+        return baskets
+
+    def _passages_to_paddle_features(self, baskets: List[SampleBasket], return_baskets: bool):
+        """
+        Convert internal representation (nested baskets + samples with mixed types) to python features (arrays of numbers).
+        We first join question and passages into one large vector.
+        Then we add vectors for: - input_ids (token ids)
+                                 - segment_ids (does a token belong to question or document)
+                                 - padding_mask
+                                 - span_mask (valid answer tokens)
+                                 - start_of_word
+        """
+        for basket in baskets:
+            # Add features to samples
+            for num, sample in enumerate(basket.samples):  # type: ignore
+                # Initialize some basic variables
+                if sample.tokenized is not None:
+                    question_tokens = sample.tokenized["question_tokens"]
+                    question_start_of_word = sample.tokenized["question_start_of_word"]
+                    question_len_t = len(question_tokens)
+                    passage_start_t = sample.tokenized["passage_start_t"]
+                    passage_tokens = sample.tokenized["passage_tokens"]
+                    passage_start_of_word = sample.tokenized["passage_start_of_word"]
+                    passage_len_t = len(passage_tokens)
+                    sample_id = [int(x) for x in sample.id.split("-")]
+
+                    # - Combines question_tokens and passage_tokens into a single vector called input_ids
+                    # - input_ids also contains special tokens (e.g. CLS or SEP tokens).
+                    # - It will have length = question_len_t + passage_len_t + n_special_tokens. This may be less than
+                    #   max_seq_len but never greater since truncation was already performed when the document was chunked into passages
+                    question_input_ids = sample.tokenized["question_tokens"]
+                    passage_input_ids = sample.tokenized["passage_tokens"]
+
+                input_ids = self.tokenizer.build_inputs_with_special_tokens(
+                    token_ids_0=question_input_ids, token_ids_1=passage_input_ids
+                )
+
+                segment_ids = self.tokenizer.create_token_type_ids_from_sequences(
+                    token_ids_0=question_input_ids, token_ids_1=passage_input_ids
+                )
+                # To make the start index of passage tokens the start manually
+                # self.sp_toks_start = 1
+                # self.sp_toks_mid = 2
+                # self.sp_toks_end = 1
+                # [0, 'a', 2, 2, 'b', 2] = self.tokenizer.build_inputs_with_special_tokens(token_ids_0=["a"], token_ids_1=["b"])
+                seq_2_start_t = self.sp_toks_start + question_len_t + self.sp_toks_mid
+
+                start_of_word = (
+                    [0] * self.sp_toks_start
+                    + question_start_of_word
+                    + [0] * self.sp_toks_mid
+                    + passage_start_of_word
+                    + [0] * self.sp_toks_end
+                )
+
+                # The mask has 1 for real tokens and 0 for padding tokens. Only real
+                # tokens are attended to.
+                padding_mask = [1] * len(input_ids)
+
+                # The span_mask has 1 for tokens that are valid start or end tokens for QA spans.
+                # 0s are assigned to question tokens, mid special tokens, end special tokens, and padding
+                # Note that start special tokens are assigned 1 since they can be chosen for a no_answer prediction
+                span_mask = [1] * self.sp_toks_start
+                span_mask += [0] * question_len_t
+                span_mask += [0] * self.sp_toks_mid
+                span_mask += [1] * passage_len_t
+                span_mask += [0] * self.sp_toks_end
+
+                # Pad up to the sequence length. For certain models, the pad token id is not 0 (e.g. Roberta where it is 1)
+                pad_idx = self.tokenizer.pad_token_id
+                padding = [pad_idx] * (self.max_seq_len - len(input_ids))
+                zero_padding = [0] * (self.max_seq_len - len(input_ids))
+
+                input_ids += padding
+                padding_mask += zero_padding
+                segment_ids += zero_padding
+                start_of_word += zero_padding
+                span_mask += zero_padding
+
+                # TODO possibly remove these checks after input validation is in place
+                len_check = (
+                    len(input_ids) == len(padding_mask) == len(segment_ids) == len(start_of_word) == len(span_mask)
+                )
+                id_check = len(sample_id) == 3
+                label_check = (
+                    return_baskets or len(sample.tokenized.get("labels", [])) == self.max_answers
+                )  # type: ignore
+                # labels are set to -100 when answer cannot be found
+                label_check2 = return_baskets or np.all(sample.tokenized["labels"] > -99)  # type: ignore
+                if len_check and id_check and label_check and label_check2:
+                    # - The first of the labels will be used in train, and the full array will be used in eval.
+                    # - start_of_word and spec_tok_mask are not actually needed by model.forward() but are needed for
+                    #   model.formatted_preds() during inference for creating answer strings
+                    # - passage_start_t is index of passage's first token relative to document
+                    feature_dict = {
+                        "input_ids": input_ids,
+                        "padding_mask": padding_mask,
+                        "segment_ids": segment_ids,
+                        "passage_start_t": passage_start_t,  # 相对于 document token 的起始位置.
+                        "start_of_word": start_of_word,
+                        "labels": sample.tokenized.get("labels", []),  # type: ignore
+                        "id": sample_id,
+                        "seq_2_start_t": seq_2_start_t,  # query、passage pair 对中的 token id 起始位置
+                        "span_mask": span_mask,
+                    }
+                    # other processor's features can be lists
+                    sample.features = [feature_dict]  # type: ignore
+                else:
+                    self.problematic_sample_ids.add(sample.id)
+                    sample.features = None
+        return baskets
+
+    def _create_dataset(self, baskets: List[SampleBasket]):
+        """
+        Convert python features into paddle dataset.
+        Also removes potential errors during preprocessing.
+        Flattens nested basket structure to create a flat list of features
+        """
+        features_flat: List[dict] = []
+        basket_to_remove = []
+        for basket in baskets:
+            if self._check_sample_features(basket):
+                for sample in basket.samples:  # type: ignore
+                    features_flat.extend(sample.features)  # type: ignore
+            else:
+                # remove the entire basket
+                basket_to_remove.append(basket)
+        if len(basket_to_remove) > 0:
+            for basket in basket_to_remove:
+                # if basket_to_remove is not empty remove the related baskets
+                baskets.remove(basket)
+
+        dataset, tensor_names = convert_features_to_dataset(features=features_flat)
+        return dataset, tensor_names, baskets
+
+
+class TextSimilarityProcessor(Processor):
+    """
+    Used to handle the Dense Passage Retrieval datasets that come in json format, example: biencoder-nq-train.json, biencoder-nq-dev.json, trivia-train.json, trivia-dev.json
+
+    dataset format: list of dictionaries with keys: 'dataset', 'question', 'answers', 'positive_ctxs', 'negative_ctxs', 'hard_negative_ctxs'
+    Each sample is a dictionary of format:
+    {"dataset": str,
+    "question": str,
+    "answers": list of str
+    "positive_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
+    "negative_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
+    "hard_negative_ctxs": list of dictionaries of format {'title': str, 'text': str, 'score': int, 'title_score': int, 'passage_id': str}
+    }
+
+    """
+
+    def __init__(
+        self,
+        query_tokenizer,  # type: ignore
+        passage_tokenizer,  # type: ignore
+        max_seq_len_query: int,
+        max_seq_len_passage: int,
+        data_dir: str = "",
+        metric=None,  # type: ignore
+        train_filename: str = "train.json",
+        dev_filename: Optional[str] = None,
+        test_filename: Optional[str] = "test.json",
+        dev_split: float = 0.1,
+        proxies: Optional[dict] = None,
+        max_samples: Optional[int] = None,
+        embed_title: bool = True,
+        num_positives: int = 1,
+        num_hard_negatives: int = 1,
+        shuffle_negatives: bool = True,
+        shuffle_positives: bool = False,
+        label_list: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        """
+        :param query_tokenizer: Used to split a question (str) into tokens
+        :param passage_tokenizer: Used to split a passage (str) into tokens.
+        :param max_seq_len_query: Query samples are truncated after this many tokens.
+        :param max_seq_len_passage: Context/Passage Samples are truncated after this many tokens.
+        :param data_dir: The directory in which the train and dev files can be found.
+                         If not available the dataset will be loaded automaticaly
+                         if the last directory has the same name as a predefined dataset.
+                         These predefined datasets are defined as the keys in the dict at
+                         `pipelines.basics.data_handler.utils`_.
+        :param metric: name of metric that shall be used for evaluation, e.g. "acc" or "f1_macro".
+                 Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value.
+                 For using multiple metrics supply them as a list, e.g ["acc", my_custom_metric_fn].
+        :param train_filename: The name of the file containing training data.
+        :param dev_filename: The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set
+                             will be a slice of the train set.
+        :param test_filename: None
+        :param dev_split: The proportion of the train set that will sliced. Only works if dev_filename is set to None
+        :param proxies: proxy configuration to allow downloads of remote datasets.
+                        Format as in  "requests" library: https://2.python-requests.org//en/latest/user/advanced/#proxies
+        :param max_samples: maximum number of samples to use
+        :param embed_title: Whether to embed title in passages during tensorization (bool),
+        :param num_hard_negatives: maximum number to hard negative context passages in a sample
+        :param num_positives: maximum number to positive context passages in a sample
+        :param shuffle_negatives: Whether to shuffle all the hard_negative passages before selecting the num_hard_negative number of passages
+        :param shuffle_positives: Whether to shuffle all the positive passages before selecting the num_positive number of passages
+        :param label_list: list of labels to predict. Usually ["hard_negative", "positive"]
+        :param kwargs: placeholder for passing generic parameters
+        """
+        # TODO If an arg is misspelt, e.g. metrics, it will be swallowed silently by kwargs
+
+        # Custom processor attributes
+        self.max_samples = max_samples
+        self.query_tokenizer = query_tokenizer
+        self.passage_tokenizer = passage_tokenizer
+        self.embed_title = embed_title
+        self.num_hard_negatives = num_hard_negatives
+        self.num_positives = num_positives
+        self.shuffle_negatives = shuffle_negatives
+        self.shuffle_positives = shuffle_positives
+        self.max_seq_len_query = max_seq_len_query
+        self.max_seq_len_passage = max_seq_len_passage
+
+        super(TextSimilarityProcessor, self).__init__(
+            tokenizer=None,  # type: ignore
+            max_seq_len=0,
+            train_filename=train_filename,
+            dev_filename=dev_filename,
+            test_filename=test_filename,
+            dev_split=dev_split,
+            data_dir=data_dir,
+            tasks={},
+            proxies=proxies,
+        )
+        if metric:
+            self.add_task(
+                name="text_similarity",
+                metric=metric,
+                label_list=label_list,
+                label_name="label",
+                task_type="text_similarity",
+            )
+        else:
+            logger.info(
+                "Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for "
+                "using the default task or add a custom task later via processor.add_task()"
+            )
+
+    def dataset_from_dicts(self, dicts: List[dict], indices: Optional[List[int]] = None, return_baskets: bool = False):
+        """
+        Convert input dictionaries into a paddle dataset for TextSimilarity.
+        For conversion we have an internal representation called "baskets".
+        Each basket is one query and related text passages (positive passages fitting to the query and negative
+        passages that do not fit the query)
+        Each stage adds or transforms specific information to our baskets.
+
+        :param dicts: input dictionary with DPR-style content
+                        {"query": str,
+                         "passages": List[
+                                        {'title': str,
+                                        'text': str,
+                                        'label': 'hard_negative',
+                                        'external_id': str},
+                                        ....
+                                        ]
+                         }
+        :param indices: indices used during multiprocessing so that IDs assigned to our baskets is unique
+        :param return_baskets: whether to return the baskets or not (baskets are needed during inference)
+        :return: dataset, tensor_names, problematic_ids, [baskets]
+        """
+        # Take the dict and insert into our basket structure, this stages also adds an internal IDs
+        baskets = self._fill_baskets(dicts, indices)
+
+        # Separat conversion of query
+        baskets = self._convert_queries(baskets=baskets)
+
+        # and context passages. When converting the context the label is also assigned.
+        baskets = self._convert_contexts(baskets=baskets)
+
+        # Convert features into paddle dataset, this step also removes and logs potential errors during preprocessing
+        dataset, tensor_names, problematic_ids, baskets = self._create_dataset(baskets)
+
+        if problematic_ids:
+            logger.error(
+                f"There were {len(problematic_ids)} errors during preprocessing at positions: {problematic_ids}"
+            )
+
+        if return_baskets:
+            return dataset, tensor_names, problematic_ids, baskets
+        else:
+            return dataset, tensor_names, problematic_ids
+
+    def _fill_baskets(self, dicts: List[dict], indices: Optional[List[int]]):
+        baskets = []
+        if not indices:
+            indices = list(range(len(dicts)))
+        for d, id_internal in zip(dicts, indices):
+            basket = SampleBasket(id_external=None, id_internal=id_internal, raw=d)
+            baskets.append(basket)
+        return baskets
+
+    def _convert_queries(self, baskets: List[SampleBasket]):
+        for basket in baskets:
+            clear_text = {}
+            tokenized = {}
+            features = [{}]  # type: ignore
+            # extract query, positive context passages and titles, hard-negative passages and titles
+            if "query" in basket.raw:
+                try:
+                    query = self._normalize_question(basket.raw["query"])
+                    query_inputs = self.query_tokenizer(text=query, max_seq_len=self.max_seq_len_query)
+                    tokenized_query = self.query_tokenizer.convert_ids_to_tokens(query_inputs["input_ids"])
+
+                    if len(tokenized_query) == 0:
+                        logger.warning(
+                            "The query could not be tokenized, likely because it contains a character that the query tokenizer does not recognize"
+                        )
+                        return None
+
+                    clear_text["query_text"] = query
+                    tokenized["query_tokens"] = tokenized_query
+                    features[0]["query_input_ids"] = query_inputs["input_ids"]
+                    features[0]["query_segment_ids"] = query_inputs["token_type_ids"]
+                except Exception:
+                    features = None  # type: ignore
+
+            sample = Sample(id="", clear_text=clear_text, tokenized=tokenized, features=features)  # type: ignore
+            basket.samples = [sample]
+        return baskets
+
+    def _convert_contexts(self, baskets: List[SampleBasket]):
+        for basket in baskets:
+            if "passages" in basket.raw:
+                try:
+                    positive_context = list(filter(lambda x: x["label"] == "positive", basket.raw["passages"]))
+                    if self.shuffle_positives:
+                        random.shuffle(positive_context)
+                    positive_context = positive_context[: self.num_positives]
+                    hard_negative_context = list(
+                        filter(lambda x: x["label"] == "hard_negative", basket.raw["passages"])
+                    )
+                    if self.shuffle_negatives:
+                        random.shuffle(hard_negative_context)
+                    hard_negative_context = hard_negative_context[: self.num_hard_negatives]
+
+                    positive_ctx_titles = [passage.get("title", None) for passage in positive_context]
+                    positive_ctx_texts = [passage["text"] for passage in positive_context]
+                    hard_negative_ctx_titles = [passage.get("title", None) for passage in hard_negative_context]
+                    hard_negative_ctx_texts = [passage["text"] for passage in hard_negative_context]
+
+                    # all context passages and labels: 1 for positive context and 0 for hard-negative context
+                    # ctx_label = [1] * self.num_positives + [0] * self.num_hard_negatives
+                    # featurize context passages
+                    if self.embed_title:
+                        # concatenate title with positive context passages + negative context passages
+                        all_ctx = self._combine_title_context(
+                            positive_ctx_titles, positive_ctx_texts
+                        ) + self._combine_title_context(hard_negative_ctx_titles, hard_negative_ctx_texts)
+                    else:
+                        all_ctx = positive_ctx_texts + hard_negative_ctx_texts
+
+                    # assign empty string tuples if hard_negative passages less than num_hard_negatives
+                    all_ctx += [("", "")] * ((self.num_positives + self.num_hard_negatives) - len(all_ctx))
+
+                    # [text] -> tokenize -> id
+                    ctx_inputs = self.passage_tokenizer(text=all_ctx[0], max_seq_len=self.max_seq_len_passage)
+
+                    # get tokens in string format
+                    tokenized_passage = [
+                        self.passage_tokenizer.convert_ids_to_tokens(ctx) for ctx in ctx_inputs["input_ids"]
+                    ]
+                    # we only have one sample containing query and corresponding (multiple) context features
+                    sample = basket.samples[0]  # type: ignore
+                    sample.clear_text["passages"] = positive_context + hard_negative_context
+                    sample.tokenized["passages_tokens"] = tokenized_passage  # type: ignore
+                    sample.features[0]["passage_input_ids"] = ctx_inputs["input_ids"]  # type: ignore
+                    sample.features[0]["passage_segment_ids"] = ctx_inputs["token_type_ids"]  # type: ignore
+                except Exception:
+                    basket.samples[0].features = None  # type: ignore
+
+        return baskets
+
+    def _create_dataset(self, baskets: List[SampleBasket]):
+        """
+        Convert python features into paddle dataset.
+        Also removes potential errors during preprocessing.
+        Flattens nested basket structure to create a flat list of features
+        """
+        features_flat: List[dict] = []
+        basket_to_remove = []
+        problematic_ids: set = set()
+        for basket in baskets:
+            if self._check_sample_features(basket):
+                for sample in basket.samples:  # type: ignore
+                    features_flat.extend(sample.features)  # type: ignore
+            else:
+                # remove the entire basket
+                basket_to_remove.append(basket)
+        if len(basket_to_remove) > 0:
+            for basket in basket_to_remove:
+                # if basket_to_remove is not empty remove the related baskets
+                problematic_ids.add(basket.id_internal)
+                baskets.remove(basket)
+
+        dataset, tensor_names = convert_features_to_dataset(features=features_flat)
+        return dataset, tensor_names, problematic_ids, baskets
+
+    @staticmethod
+    def _normalize_question(question: str) -> str:
+        """Removes '?' from queries/questions"""
+        if question[-1] == "?":
+            question = question[:-1]
+        return question
+
+    @staticmethod
+    def _combine_title_context(titles: List[str], texts: List[str]):
+        res = []
+        for title, ctx in zip(titles, texts):
+            if title is None:
+                title = ""
+                logger.warning(
+                    f"Couldn't find title although `embed_title` is set to True. Using title='' now. Related passage text: '{ctx}' "
+                )
+            res.append(tuple((title, ctx)))
+        return res
+
+
+def _is_json(x):
+    if issubclass(type(x), Path):
+        return True
+    try:
+        json.dumps(x)
+        return True
+    except Exception:
+        return False
diff --git a/pipelines/pipelines/data_handler/samples.py b/pipelines/pipelines/data_handler/samples.py
new file mode 100644
index 0000000000000000000000000000000000000000..901efb1002522056e7a672bd2d8203cdf9d6d618
--- /dev/null
+++ b/pipelines/pipelines/data_handler/samples.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Union, Optional, List
+
+import logging
+import numpy as np
+
+logger = logging.getLogger(__name__)
+
+
+class Sample:
+    """A single training/test sample. This should contain the input and the label. Is initialized with
+    the human readable clear_text. Over the course of data preprocessing, this object is populated
+    with tokenized and featurized versions of the data."""
+
+    def __init__(self, id: str, clear_text: dict, tokenized: Optional[dict] = None, features: Optional[dict] = None):
+        """
+        :param id: The unique id of the sample
+        :param clear_text: A dictionary containing various human readable fields (e.g. text, label).
+        :param tokenized: A dictionary containing the tokenized version of clear text plus helpful meta data: offsets (start position of each token in the original text) and start_of_word (boolean if a token is the first one of a word).
+        :param features: A dictionary containing features in a vectorized format needed by the model to process this sample.
+        """
+        self.id = id
+        self.clear_text = clear_text
+        self.features = features
+        self.tokenized = tokenized
+
+    def __str__(self):
+
+        if self.clear_text:
+            clear_text_str = "\n \t".join([k + ": " + str(v) for k, v in self.clear_text.items()])
+            if len(clear_text_str) > 3000:
+                clear_text_str = (
+                    clear_text_str[:3_000] + f"\nTHE REST IS TOO LONG TO DISPLAY. "
+                    f"Remaining chars :{len(clear_text_str)-3_000}"
+                )
+        else:
+            clear_text_str = "None"
+
+        if self.features:
+            if isinstance(self.features, list):
+                features = self.features[0]
+            else:
+                features = self.features
+            feature_str = "\n \t".join([k + ": " + str(v) for k, v in features.items()])
+        else:
+            feature_str = "None"
+
+        if self.tokenized:
+            tokenized_str = "\n \t".join([k + ": " + str(v) for k, v in self.tokenized.items()])
+            if len(tokenized_str) > 3000:
+                tokenized_str = (
+                    tokenized_str[:3_000] + f"\nTHE REST IS TOO LONG TO DISPLAY. "
+                    f"Remaining chars: {len(tokenized_str)-3_000}"
+                )
+        else:
+            tokenized_str = "None"
+        s = (
+            f"ID: {self.id}\n"
+            f"Clear Text: \n \t{clear_text_str}\n"
+            f"Tokenized: \n \t{tokenized_str}\n"
+            f"Features: \n \t{feature_str}\n"
+            "_____________________________________________________"
+        )
+        return s
+
+
+class SampleBasket:
+    """An object that contains one source text and the one or more samples that will be processed. This
+    is needed for tasks like question answering where the source text can generate multiple input - label
+    pairs."""
+
+    def __init__(
+        self,
+        id_internal: Optional[Union[int, str]],
+        raw: dict,
+        id_external: str = None,
+        samples: Optional[List[Sample]] = None,
+    ):
+        """
+        :param id_internal: A unique identifying id. Used for identification within pipelines.
+        :param external_id: Used for identification outside of pipelines. E.g. if another framework wants to pass along its own id with the results.
+        :param raw: Contains the various data needed to form a sample. It is ideally in human readable form.
+        :param samples: An optional list of Samples used to populate the basket at initialization.
+        """
+        self.id_internal = id_internal
+        self.id_external = id_external
+        self.raw = raw
+        self.samples = samples
+
+
+def process_answers(answers, doc_offsets, passage_start_c, passage_start_t):
+    """TODO Write Comment"""
+    answers_clear = []
+    answers_tokenized = []
+    for answer in answers:
+        # This section calculates start and end relative to document
+        answer_text = answer["text"]
+        answer_len_c = len(answer_text)
+        if "offset" in answer:
+            answer_start_c = answer["offset"]
+        else:
+            answer_start_c = answer["answer_start"]
+        answer_end_c = answer_start_c + answer_len_c - 1
+        answer_start_t = offset_to_token_idx_vecorized(doc_offsets, answer_start_c)
+        answer_end_t = offset_to_token_idx_vecorized(doc_offsets, answer_end_c)
+
+        # TODO: Perform check that answer can be recovered from document?
+        # This section converts start and end so that they are relative to the passage
+        # TODO: Is this actually necessary on character level?
+        answer_start_c -= passage_start_c
+        answer_end_c -= passage_start_c
+        answer_start_t -= passage_start_t
+        answer_end_t -= passage_start_t
+
+        curr_answer_clear = {"text": answer_text, "start_c": answer_start_c, "end_c": answer_end_c}
+        curr_answer_tokenized = {
+            "start_t": answer_start_t,
+            "end_t": answer_end_t,
+            "answer_type": answer.get("answer_type", "span"),
+        }
+
+        answers_clear.append(curr_answer_clear)
+        answers_tokenized.append(curr_answer_tokenized)
+    return answers_clear, answers_tokenized
+
+
+def get_passage_offsets(doc_offsets, doc_stride, passage_len_t, doc_text):
+    """
+    Get spans (start and end offsets) for passages by applying a sliding window function.
+    The sliding window moves in steps of doc_stride.
+    Returns a list of dictionaries which each describe the start, end and id of a passage
+    that is formed when chunking a document using a sliding window approach."""
+
+    passage_spans = []
+    passage_id = 0
+    # offsets is character basic
+    doc_len_t = len(doc_offsets)
+    while True:
+        passage_start_t = passage_id * doc_stride
+        passage_end_t = passage_start_t + passage_len_t
+        # passage_start_character
+        passage_start_c = doc_offsets[passage_start_t]
+
+        # If passage_end_t points to the last token in the passage, define passage_end_c as the length of the document
+        if passage_end_t >= doc_len_t - 1:
+            passage_end_c = len(doc_text)
+
+        # Get document text up to the first token that is outside the passage. Strip of whitespace.
+        # Use the length of this text as the passage_end_c
+        else:
+            end_ch_idx = doc_offsets[passage_end_t + 1]
+            raw_passage_text = doc_text[:end_ch_idx]
+            passage_end_c = len(raw_passage_text.strip())
+
+        passage_span = {
+            "passage_start_t": passage_start_t,
+            "passage_end_t": passage_end_t,
+            "passage_start_c": passage_start_c,
+            "passage_end_c": passage_end_c,
+            "passage_id": passage_id,
+        }
+        passage_spans.append(passage_span)
+        passage_id += 1
+        # If the end idx is greater than or equal to the length of the passage
+        if passage_end_t >= doc_len_t:
+            break
+    return passage_spans
+
+
+def offset_to_token_idx(token_offsets, ch_idx) -> Optional[int]:
+    """Returns the idx of the token at the given character idx"""
+    n_tokens = len(token_offsets)
+    for i in range(n_tokens):
+        if (i + 1 == n_tokens) or (token_offsets[i] <= ch_idx < token_offsets[i + 1]):
+            return i
+    return None
+
+
+def offset_to_token_idx_vecorized(token_offsets, ch_idx):
+    """Returns the idx of the token at the given character idx"""
+    # case ch_idx is at end of tokens
+    if ch_idx >= np.max(token_offsets):
+        # TODO check "+ 1" (it is needed for making end indices compliant with old offset_to_token_idx() function)
+        # check whether end token is incluse or exclusive
+        idx = np.argmax(token_offsets) + 1
+    # looking for the first occurence of token_offsets larger than ch_idx and taking one position to the left.
+    # This is needed to overcome n special_tokens at start of sequence
+    # and failsafe matching (the character start might not always coincide with a token offset, e.g. when starting at whitespace)
+    else:
+        idx = np.argmax(token_offsets > ch_idx) - 1
+    return idx
diff --git a/pipelines/pipelines/document_stores/__init__.py b/pipelines/pipelines/document_stores/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f606cb8bf871800a5b6ee757c3ff1bf39494604b
--- /dev/null
+++ b/pipelines/pipelines/document_stores/__init__.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+import importlib
+import os
+
+from pipelines.document_stores.base import (
+    BaseDocumentStore,
+    BaseKnowledgeGraph,
+    KeywordDocumentStore,
+)
+from pipelines.utils.import_utils import safe_import
+
+ElasticsearchDocumentStore = safe_import(
+    "pipelines.document_stores.elasticsearch", "ElasticsearchDocumentStore", "elasticsearch"
+)
+OpenDistroElasticsearchDocumentStore = safe_import(
+    "pipelines.document_stores.elasticsearch", "OpenDistroElasticsearchDocumentStore", "elasticsearch"
+)
+OpenSearchDocumentStore = safe_import(
+    "pipelines.document_stores.elasticsearch", "OpenSearchDocumentStore", "elasticsearch"
+)
+
+BaiduElasticsearchDocumentStore = safe_import(
+    "pipelines.document_stores.elasticsearch", "BaiduElasticsearchDocumentStore", "elasticsearch"
+)
+
+FAISSDocumentStore = safe_import("pipelines.document_stores.faiss", "FAISSDocumentStore", "faiss")
+
+MilvusDocumentStore = safe_import("pipelines.document_stores.milvus2", "Milvus2DocumentStore", "milvus")
+
+from pipelines.document_stores.utils import (
+    es_index_to_document_store,
+    eval_data_from_json,
+    eval_data_from_jsonl,
+)
diff --git a/pipelines/pipelines/document_stores/base.py b/pipelines/pipelines/document_stores/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce8537eb4b4b5dba293168aa4192a84eca5c28b6
--- /dev/null
+++ b/pipelines/pipelines/document_stores/base.py
@@ -0,0 +1,676 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import logging
+from abc import abstractmethod
+from itertools import islice
+from pathlib import Path
+from typing import Dict, Generator, List, Optional, Set, Union
+
+import numpy as np
+
+from pipelines.document_stores.utils import (
+    eval_data_from_json,
+    eval_data_from_jsonl,
+    squad_json_to_jsonl,
+)
+from pipelines.errors import DuplicateDocumentError
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.preprocessor import PreProcessor
+from pipelines.schema import Document, FilterType, Label
+
+logger = logging.getLogger(__name__)
+
+try:
+    from numba import njit  # pylint: disable=import-error
+except (ImportError, ModuleNotFoundError):
+    logger.info("Numba not found, replacing njit() with no-op implementation. " "Enable it with 'pip install numba'.")
+
+    def njit(f):
+        return f
+
+
+@njit  # (fastmath=True)
+def expit(x: float) -> float:
+    return 1 / (1 + np.exp(-x))
+
+
+class BaseKnowledgeGraph(BaseComponent):
+    """
+    Base class for implementing Knowledge Graphs.
+    """
+
+    outgoing_edges = 1
+
+    def run(
+        self, sparql_query: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None
+    ):  # type: ignore
+        result = self.query(sparql_query=sparql_query, index=index, headers=headers)
+        output = {"sparql_result": result}
+        return output, "output_1"
+
+    def query(self, sparql_query: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None):
+        raise NotImplementedError
+
+
+class BaseDocumentStore(BaseComponent):
+    """
+    Base class for implementing Document Stores.
+    """
+
+    index: Optional[str]
+    label_index: Optional[str]
+    similarity: Optional[str]
+    duplicate_documents_options: tuple = ("skip", "overwrite", "fail")
+    ids_iterator = None
+
+    @abstractmethod
+    def write_documents(
+        self,
+        documents: Union[List[dict], List[Document]],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Indexes documents for later queries.
+
+        :param documents: a list of Python dictionaries or a list of pipelines Document objects.
+                          For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
+                          Optionally: Include meta data via {"text": "<the-actual-text>",
+                          "meta":{"name": "<some-document-name>, "author": "somebody", ...}}
+                          It can be used for filtering and is accessible in the responses of the Finder.
+        :param index: Optional name of index where the documents shall be written to.
+                      If None, the DocumentStore's default index (self.index) will be used.
+        :param batch_size: Number of documents that are passed to bulk function at a time.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+
+        :return: None
+        """
+        pass
+
+    @abstractmethod
+    def get_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Get documents from the document store.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: Number of documents that are passed to bulk function at a time.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        """
+        pass
+
+    @abstractmethod
+    def get_all_documents_generator(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[Document, None, None]:
+        """
+        Get documents from the document store. Under-the-hood, documents are fetched in batches from the
+        document store and yielded as individual documents. This method can be used to iteratively process
+        a large number of documents without having to load all documents in memory.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                        __Example__:
+                        ```python
+                        filters = {
+                            "$and": {
+                                "type": {"$eq": "article"},
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": {"$in": ["economy", "politics"]},
+                                    "publisher": {"$eq": "nytimes"}
+                                }
+                            }
+                        }
+                        ```
+
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        """
+        pass
+
+    def __iter__(self):
+        if not self.ids_iterator:
+            self.ids_iterator = [x.id for x in self.get_all_documents()]
+        return self
+
+    def __next__(self):
+        if len(self.ids_iterator) == 0:
+            raise StopIteration
+        curr_id = self.ids_iterator[0]
+        ret = self.get_document_by_id(curr_id)
+        self.ids_iterator = self.ids_iterator[1:]
+        return ret
+
+    def scale_to_unit_interval(self, score: float, similarity: Optional[str]) -> float:
+        if similarity == "cosine":
+            return (score + 1) / 2
+        else:
+            return float(expit(score / 100))
+
+    @abstractmethod
+    def get_all_labels(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Label]:
+        pass
+
+    @abstractmethod
+    def get_document_by_id(
+        self, id: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None
+    ) -> Optional[Document]:
+        pass
+
+    @abstractmethod
+    def get_document_count(
+        self,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        index: Optional[str] = None,
+        only_documents_without_embedding: bool = False,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> int:
+        pass
+
+    @staticmethod
+    @njit  # (fastmath=True)
+    def normalize_embedding(emb: np.ndarray) -> None:
+        """
+        Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix
+        (2D array).
+        """
+        # Might be extended to other normalizations in future
+
+        # Single vec
+        if len(emb.shape) == 1:
+            norm = np.sqrt(emb.dot(emb))  # faster than np.linalg.norm()
+            if norm != 0.0:
+                emb /= norm
+        # 2D matrix
+        else:
+            for vec in emb:
+                vec = np.ascontiguousarray(vec)
+                norm = np.sqrt(vec.dot(vec))
+                if norm != 0.0:
+                    vec /= norm
+
+    def finalize_raw_score(self, raw_score: float, similarity: Optional[str]) -> float:
+        if similarity == "cosine":
+            return (raw_score + 1) / 2
+        else:
+            return float(expit(raw_score / 100))
+
+    @abstractmethod
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        pass
+
+    def query_by_embedding_batch(
+        self,
+        query_embs: Union[List[np.ndarray], np.ndarray],
+        filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[List[Document]]:
+        if isinstance(filters, list):
+            if len(filters) != len(query_embs):
+                raise Exception(
+                    "Number of filters does not match number of query_embs. Please provide as many filters"
+                    " as query_embs or a single filter that will be applied to each query_emb."
+                )
+        else:
+            filters = [filters] * len(query_embs)
+        results = []
+        for query_emb, filter in zip(query_embs, filters):
+            results.append(
+                self.query_by_embedding(
+                    query_emb=query_emb,
+                    filters=filter,
+                    top_k=top_k,
+                    index=index,
+                    return_embedding=return_embedding,
+                    headers=headers,
+                )
+            )
+        return results
+
+    @abstractmethod
+    def get_label_count(self, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None) -> int:
+        pass
+
+    @abstractmethod
+    def write_labels(
+        self,
+        labels: Union[List[Label], List[dict]],
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        pass
+
+    def add_eval_data(
+        self,
+        filename: str,
+        doc_index: str = "eval_document",
+        label_index: str = "label",
+        batch_size: Optional[int] = None,
+        preprocessor: Optional[PreProcessor] = None,
+        max_docs: Union[int, bool] = None,
+        open_domain: bool = False,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
+        If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise
+        from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
+
+        :param filename: Name of the file containing evaluation data (json or jsonl)
+        :param doc_index: Elasticsearch index where evaluation documents should be stored
+        :param label_index: Elasticsearch index where labeled questions should be stored
+        :param batch_size: Optional number of documents that are loaded and processed at a time.
+                           When set to None (default) all documents are processed at once.
+        :param preprocessor: Optional PreProcessor to preprocess evaluation documents.
+                             It can be used for splitting documents into passages (and assigning labels to corresponding passages).
+                             Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0.
+                             When set to None (default) preprocessing is disabled.
+        :param max_docs: Optional number of documents that will be loaded.
+                         When set to None (default) all available eval documents are used.
+        :param open_domain: Set this to True if your file is an open domain dataset where two different answers to the
+                            same question might be found in different contexts.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+
+        """
+        # TODO improve support for PreProcessor when adding eval data
+        if preprocessor is not None:
+            assert preprocessor.split_by != "sentence", (
+                "Split by sentence not supported.\n"
+                "Please set 'split_by' to either 'word' or 'passage' in the supplied PreProcessor."
+            )
+            assert preprocessor.split_respect_sentence_boundary is False, (
+                "split_respect_sentence_boundary not supported yet.\n"
+                "Please set 'split_respect_sentence_boundary' to False in the supplied PreProcessor."
+            )
+            assert preprocessor.split_overlap == 0, (
+                "Overlapping documents are currently not supported when adding eval data.\n"
+                "Please set 'split_overlap=0' in the supplied PreProcessor."
+            )
+            assert preprocessor.clean_empty_lines is False, (
+                "clean_empty_lines currently not supported when adding eval data.\n"
+                "Please set 'clean_empty_lines=False' in the supplied PreProcessor."
+            )
+            assert preprocessor.clean_whitespace is False, (
+                "clean_whitespace is currently not supported when adding eval data.\n"
+                "Please set 'clean_whitespace=False' in the supplied PreProcessor."
+            )
+            assert preprocessor.clean_header_footer is False, (
+                "clean_header_footer is currently not supported when adding eval data.\n"
+                "Please set 'clean_header_footer=False' in the supplied PreProcessor."
+            )
+
+        file_path = Path(filename)
+        if file_path.suffix == ".json":
+            if batch_size is None:
+                docs, labels = eval_data_from_json(
+                    filename, max_docs=max_docs, preprocessor=preprocessor, open_domain=open_domain
+                )
+                self.write_documents(docs, index=doc_index, headers=headers)
+                self.write_labels(labels, index=label_index, headers=headers)
+            else:
+                jsonl_filename = (file_path.parent / (file_path.stem + ".jsonl")).as_posix()
+                logger.info(
+                    f"Adding evaluation data batch-wise is not compatible with json-formatted SQuAD files. "
+                    f"Converting json to jsonl to: {jsonl_filename}"
+                )
+                squad_json_to_jsonl(filename, jsonl_filename)
+                self.add_eval_data(
+                    jsonl_filename, doc_index, label_index, batch_size, open_domain=open_domain, headers=headers
+                )
+
+        elif file_path.suffix == ".jsonl":
+            for docs, labels in eval_data_from_jsonl(
+                filename, batch_size, max_docs=max_docs, preprocessor=preprocessor, open_domain=open_domain
+            ):
+                if docs:
+                    self.write_documents(docs, index=doc_index, headers=headers)
+                if labels:
+                    self.write_labels(labels, index=label_index, headers=headers)
+
+        else:
+            logger.error("File needs to be in json or jsonl format.")
+
+    def delete_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        pass
+
+    @abstractmethod
+    def delete_documents(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        pass
+
+    @abstractmethod
+    def delete_labels(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        pass
+
+    @abstractmethod
+    def _create_document_field_map(self) -> Dict:
+        pass
+
+    def run(
+        self,
+        documents: List[dict],
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        id_hash_keys: Optional[List[str]] = None,
+    ):  # type: ignore
+        """
+        Run requests of document stores
+
+        Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents.
+        In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function
+        is therefore only an interim solution until the run function also accepts documents.
+
+        :param documents: A list of dicts that are documents.
+        :param headers: A list of headers.
+        :param index: Optional name of index where the documents shall be written to.
+                      If None, the DocumentStore's default index (self.index) will be used.
+        :param id_hash_keys: List of the fields that the hashes of the ids are generated from.
+        """
+
+        field_map = self._create_document_field_map()
+        doc_objects = [Document.from_dict(d, field_map=field_map, id_hash_keys=id_hash_keys) for d in documents]
+        self.write_documents(documents=doc_objects, index=index, headers=headers)
+        return {}, "output_1"
+
+    @abstractmethod
+    def get_documents_by_id(
+        self,
+        ids: List[str],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        pass
+
+    def _drop_duplicate_documents(self, documents: List[Document]) -> List[Document]:
+        """
+        Drop duplicates documents based on same hash ID
+
+        :param documents: A list of pipelines Document objects.
+        :return: A list of pipelines Document objects.
+        """
+        _hash_ids: Set = set([])
+        _documents: List[Document] = []
+
+        for document in documents:
+            if document.id in _hash_ids:
+                logger.info(
+                    f"Duplicate Documents: Document with id '{document.id}' already exists in index " f"'{self.index}'"
+                )
+                continue
+            _documents.append(document)
+            _hash_ids.add(document.id)
+
+        return _documents
+
+    def _handle_duplicate_documents(
+        self,
+        documents: List[Document],
+        index: Optional[str] = None,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Checks whether any of the passed documents is already existing in the chosen index and returns a list of
+        documents that are not in the index yet.
+
+        :param documents: A list of pipelines Document objects.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip (default option): Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        :return: A list of pipelines Document objects.
+        """
+
+        index = index or self.index
+        if duplicate_documents in ("skip", "fail"):
+            documents = self._drop_duplicate_documents(documents)
+            documents_found = self.get_documents_by_id(ids=[doc.id for doc in documents], index=index, headers=headers)
+            ids_exist_in_db: List[str] = [doc.id for doc in documents_found]
+
+            if len(ids_exist_in_db) > 0 and duplicate_documents == "fail":
+                raise DuplicateDocumentError(
+                    f"Document with ids '{', '.join(ids_exist_in_db)} already exists" f" in index = '{index}'."
+                )
+
+            documents = list(filter(lambda doc: doc.id not in ids_exist_in_db, documents))
+
+        return documents
+
+    def _get_duplicate_labels(
+        self, labels: list, index: str = None, headers: Optional[Dict[str, str]] = None
+    ) -> List[Label]:
+        """
+        Return all duplicate labels
+        :param labels: List of Label objects
+        :param index: add an optional index attribute to labels. It can be later used for filtering.
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        :return: List of labels
+        """
+        index = index or self.label_index
+        new_ids: List[str] = [label.id for label in labels]
+        duplicate_ids: List[str] = []
+
+        for label_id, count in collections.Counter(new_ids).items():
+            if count > 1:
+                duplicate_ids.append(label_id)
+
+        for label in self.get_all_labels(index=index, headers=headers):
+            if label.id in new_ids:
+                duplicate_ids.append(label.id)
+
+        return [label for label in labels if label.id in duplicate_ids]
+
+
+class KeywordDocumentStore(BaseDocumentStore):
+    """
+    Base class for implementing Document Stores that support keyword searches.
+    """
+
+    @abstractmethod
+    def query(
+        self,
+        query: Optional[str],
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: int = 10,
+        custom_query: Optional[str] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query as defined by keyword matching algorithms like BM25.
+
+        :param query: The query
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+
+        :param top_k: How many documents to return per query.
+        :param custom_query: Custom query to be executed.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        """
+
+
+def get_batches_from_generator(iterable, n):
+    """
+    Batch elements of an iterable into fixed-length chunks or blocks.
+    """
+    it = iter(iterable)
+    x = tuple(islice(it, n))
+    while x:
+        yield x
+        x = tuple(islice(it, n))
diff --git a/pipelines/pipelines/document_stores/elasticsearch.py b/pipelines/pipelines/document_stores/elasticsearch.py
new file mode 100644
index 0000000000000000000000000000000000000000..375afc4bb940da72326187e63841d0d5a11d10d9
--- /dev/null
+++ b/pipelines/pipelines/document_stores/elasticsearch.py
@@ -0,0 +1,2284 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import time
+from copy import deepcopy
+from string import Template
+from typing import Any, Dict, Generator, List, Optional, Type, Union
+
+import numpy as np
+from scipy.special import expit
+from tqdm.auto import tqdm
+
+try:
+    from elasticsearch import (
+        Connection,
+        Elasticsearch,
+        RequestsHttpConnection,
+        Urllib3HttpConnection,
+    )
+    from elasticsearch.exceptions import RequestError
+    from elasticsearch.helpers import parallel_bulk, scan
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "elasticsearch", ie)
+
+from pipelines.document_stores import KeywordDocumentStore
+from pipelines.document_stores.base import get_batches_from_generator
+from pipelines.document_stores.filter_utils import LogicalFilterClause
+from pipelines.schema import Document, Label
+
+logger = logging.getLogger(__name__)
+# disable elastic search debug log
+logging.getLogger("elasticsearch").setLevel(logging.INFO)
+
+
+class ElasticsearchDocumentStore(KeywordDocumentStore):
+    def __init__(
+        self,
+        host: Union[str, List[str]] = "localhost",
+        port: Union[int, List[int]] = 9200,
+        username: str = "",
+        password: str = "",
+        api_key_id: Optional[str] = None,
+        api_key: Optional[str] = None,
+        aws4auth=None,
+        index: str = "document",
+        label_index: str = "label",
+        search_fields: Union[str, list] = "content",
+        content_field: str = "content",
+        name_field: str = "name",
+        embedding_field: str = "embedding",
+        embedding_dim: int = 768,
+        vector_type: str = "dense_vector",
+        custom_mapping: Optional[dict] = None,
+        excluded_meta_data: Optional[list] = None,
+        analyzer: str = "standard",
+        scheme: str = "http",
+        ca_certs: Optional[str] = None,
+        verify_certs: bool = True,
+        recreate_index: bool = False,
+        create_index: bool = True,
+        refresh_type: str = "wait_for",
+        similarity="dot_product",
+        timeout=30,
+        return_embedding: bool = False,
+        duplicate_documents: str = "overwrite",
+        index_type: str = "flat",
+        scroll: str = "1d",
+        skip_missing_embeddings: bool = True,
+        synonyms: Optional[List] = None,
+        synonym_type: str = "synonym",
+        use_system_proxy: bool = False,
+        chunk_size: int = 500,
+        thread_count: int = 32,
+        queue_size: int = 32,
+    ):
+        """
+        A DocumentStore using Elasticsearch to store and query the documents for our search.
+
+            * Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings
+            * You can either use an existing Elasticsearch index or create a new one via pipelines
+            * Retrievers operate on top of this DocumentStore to find the relevant documents for a query
+
+        :param host: url(s) of elasticsearch nodes
+        :param port: port(s) of elasticsearch nodes
+        :param username: username (standard authentication via http_auth)
+        :param password: password (standard authentication via http_auth)
+        :param api_key_id: ID of the API key (altenative authentication mode to the above http_auth)
+        :param api_key: Secret value of the API key (altenative authentication mode to the above http_auth)
+        :param aws4auth: Authentication for usage with aws elasticsearch (can be generated with the requests-aws4auth package)
+        :param index: Name of index in elasticsearch to use for storing the documents that we want to search. If not existing yet, we will create one.
+        :param label_index: Name of index in elasticsearch to use for storing labels. If not existing yet, we will create one.
+        :param search_fields: Name of fields used by ElasticsearchRetriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"]
+        :param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
+                           If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
+        :param name_field: Name of field that contains the title of the doc
+        :param embedding_field: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
+        :param embedding_dim: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
+        :param custom_mapping: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.
+        :param analyzer: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index.
+                         Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
+                         https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html
+        :param excluded_meta_data: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]).
+                                   Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).
+        :param scheme: 'https' or 'http', protocol used to connect to your elasticsearch instance
+        :param ca_certs: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.
+        :param verify_certs: Whether to be strict about ca certificates
+        :param recreate_index: If set to True, an existing elasticsearch index will be deleted and a new one will be
+            created using the config you are using for initialization. Be aware that all data in the old index will be
+            lost if you choose to recreate the index. Be aware that both the document_index and the label_index will
+            be recreated.
+        :param create_index:
+            Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case)
+            ..deprecated:: 2.0
+                This param is deprecated. In the next major version we will always try to create an index if there is no
+                existing index (the current behaviour when create_index=True). If you are looking to recreate an
+                existing index by deleting it first if it already exist use param recreate_index.
+        :param refresh_type: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search.
+                             If set to 'wait_for', continue only after changes are visible (slow, but safe).
+                             If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion).
+                             More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html
+        :param similarity: The similarity function used to compare document vectors.
+        :param timeout: Number of seconds after which an ElasticSearch request times out.
+        :param return_embedding: To return document embedding
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param index_type: The type of index to be created. Choose from 'flat' and 'hnsw'. Currently the
+                           ElasticsearchDocumentStore does not support HNSW but OpenDistroElasticsearchDocumentStore does.
+        :param scroll: Determines how long the current index is fixed, e.g. during updating all documents with embeddings.
+                       Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h"
+                       For details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html
+        :param skip_missing_embeddings: Parameter to control queries based on vector similarity when indexed documents miss embeddings.
+                                        Parameter options: (True, False)
+                                        False: Raises exception if one or more documents do not have embeddings at query time
+                                        True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)
+        :param synonyms: List of synonyms can be passed while elasticsearch initialization.
+                         For example: [ "foo, bar => baz",
+                                        "foozball , foosball" ]
+                         More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
+        :param synonym_type: Synonym filter type can be passed.
+                             Synonym or Synonym_graph to handle synonyms, including multi-word synonyms correctly during the analysis process.
+                             More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
+        :param use_system_proxy: Whether to use system proxy.
+        :param queue_size: size of the task queue between the main thread (producing chunks to send) and the processing threads. for more info at https://elasticsearch-py.readthedocs.io/en/v8.8.2/helpers.html?highlight=bulk#bulk-helpers
+        :param chunk_size: number of docs in one chunk sent to es (default: 500)
+        :param thread_count: size of the threadpool to use for the bulk requests
+        """
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            host=host,
+            port=port,
+            username=username,
+            password=password,
+            api_key_id=api_key_id,
+            api_key=api_key,
+            aws4auth=aws4auth,
+            index=index,
+            label_index=label_index,
+            search_fields=search_fields,
+            content_field=content_field,
+            name_field=name_field,
+            embedding_field=embedding_field,
+            embedding_dim=embedding_dim,
+            custom_mapping=custom_mapping,
+            excluded_meta_data=excluded_meta_data,
+            analyzer=analyzer,
+            scheme=scheme,
+            ca_certs=ca_certs,
+            verify_certs=verify_certs,
+            create_index=create_index,
+            duplicate_documents=duplicate_documents,
+            refresh_type=refresh_type,
+            similarity=similarity,
+            timeout=timeout,
+            return_embedding=return_embedding,
+            index_type=index_type,
+            vector_type=vector_type,
+            scroll=scroll,
+            skip_missing_embeddings=skip_missing_embeddings,
+            synonyms=synonyms,
+            synonym_type=synonym_type,
+            use_system_proxy=use_system_proxy,
+        )
+
+        self.client = self._init_elastic_client(
+            host=host,
+            port=port,
+            username=username,
+            password=password,
+            api_key=api_key,
+            api_key_id=api_key_id,
+            aws4auth=aws4auth,
+            scheme=scheme,
+            ca_certs=ca_certs,
+            verify_certs=verify_certs,
+            timeout=timeout,
+            use_system_proxy=use_system_proxy,
+        )
+
+        # configure mappings to ES fields that will be used for querying / displaying results
+        if type(search_fields) == str:
+            search_fields = [search_fields]
+
+        # TODO we should implement a more flexible interal mapping here that simplifies the usage of additional,
+        # custom fields (e.g. meta data you want to return)
+        self.search_fields = search_fields
+        self.content_field = content_field
+        self.name_field = name_field
+        self.embedding_field = embedding_field
+        self.embedding_dim = embedding_dim
+        self.excluded_meta_data = excluded_meta_data
+        self.analyzer = analyzer
+        self.return_embedding = return_embedding
+
+        self.custom_mapping = custom_mapping
+        self.synonyms = synonyms
+        self.synonym_type = synonym_type
+        self.index: str = index
+        self.label_index: str = label_index
+        self.scroll = scroll
+        self.skip_missing_embeddings: bool = skip_missing_embeddings
+        self.vector_type = vector_type
+
+        self.similarity_check(similarity)
+        if index_type in ["flat", "hnsw"]:
+            self.index_type = index_type
+        else:
+            raise Exception("Invalid value for index_type in constructor. Choose between 'flat' and 'hnsw'")
+        if index_type == "hnsw" and type(self) == ElasticsearchDocumentStore:
+            raise Exception(
+                "The HNSW algorithm for approximate nearest neighbours calculation is currently not available in the ElasticSearchDocumentStore. "
+                "Try the OpenSearchDocumentStore instead."
+            )
+        if recreate_index:
+            self.delete_index(index)
+            self.delete_index(label_index)
+            self._create_document_index(index)
+            self._create_label_index(label_index)
+        elif create_index:
+            self._create_document_index(index)
+            self._create_label_index(label_index)
+
+        self.duplicate_documents = duplicate_documents
+        self.refresh_type = refresh_type
+        self.chunk_size = chunk_size
+        self.thread_count = thread_count
+        self.queue_size = queue_size
+
+    def similarity_check(self, similarity):
+        if similarity in ["cosine", "dot_product", "l2"]:
+            self.similarity = similarity
+        else:
+            raise Exception(
+                f"Invalid value {similarity} for similarity in ElasticSearchDocumentStore constructor. Choose between 'cosine', 'l2' and 'dot_product'"
+            )
+
+    @classmethod
+    def _init_elastic_client(
+        cls,
+        host: Union[str, List[str]],
+        port: Union[int, List[int]],
+        username: str,
+        password: str,
+        api_key_id: Optional[str],
+        api_key: Optional[str],
+        aws4auth,
+        scheme: str,
+        ca_certs: Optional[str],
+        verify_certs: bool,
+        timeout: int,
+        use_system_proxy: bool,
+    ) -> Elasticsearch:
+
+        hosts = cls._prepare_hosts(host, port)
+
+        if (api_key or api_key_id) and not (api_key and api_key_id):
+            raise ValueError("You must provide either both or none of `api_key_id` and `api_key`")
+
+        connection_class: Type[Connection] = Urllib3HttpConnection
+        if use_system_proxy:
+            connection_class = RequestsHttpConnection
+
+        if api_key:
+            # api key authentication
+            client = Elasticsearch(
+                hosts=hosts,
+                api_key=(api_key_id, api_key),
+                scheme=scheme,
+                ca_certs=ca_certs,
+                verify_certs=verify_certs,
+                timeout=timeout,
+                connection_class=connection_class,
+            )
+        elif aws4auth:
+            # aws elasticsearch with IAM
+            # see https://elasticsearch-py.readthedocs.io/en/v7.12.0/index.html?highlight=http_auth#running-on-aws-with-iam
+            client = Elasticsearch(
+                hosts=hosts,
+                http_auth=aws4auth,
+                connection_class=RequestsHttpConnection,
+                use_ssl=True,
+                verify_certs=True,
+                timeout=timeout,
+            )
+        elif username:
+            # standard http_auth
+            client = Elasticsearch(
+                hosts=hosts,
+                http_auth=(username, password),
+                scheme=scheme,
+                ca_certs=ca_certs,
+                verify_certs=verify_certs,
+                timeout=timeout,
+                connection_class=connection_class,
+            )
+        else:
+            # there is no authentication for this elasticsearch instance
+            client = Elasticsearch(
+                hosts=hosts,
+                scheme=scheme,
+                ca_certs=ca_certs,
+                verify_certs=verify_certs,
+                timeout=timeout,
+                connection_class=connection_class,
+            )
+
+        # Test connection
+        try:
+            # ping uses a HEAD request on the root URI. In some cases, the user might not have permissions for that,
+            # resulting in a HTTP Forbidden 403 response.
+            if username in ["", "elastic"]:
+                status = client.ping()
+                if not status:
+                    raise ConnectionError(
+                        f"Initial connection to Elasticsearch failed. Make sure you run an Elasticsearch instance "
+                        f"at `{hosts}` and that it has finished the initial ramp up (can take > 30s)."
+                    )
+        except Exception:
+            raise ConnectionError(
+                f"Initial connection to Elasticsearch failed. Make sure you run an Elasticsearch instance at `{hosts}` and that it has finished the initial ramp up (can take > 30s)."
+            )
+        return client
+
+    @staticmethod
+    def _prepare_hosts(host, port):
+        # Create list of host(s) + port(s) to allow direct client connections to multiple elasticsearch nodes
+        if isinstance(host, list):
+            if isinstance(port, list):
+                if not len(port) == len(host):
+                    raise ValueError("Length of list `host` must match length of list `port`")
+                hosts = [{"host": h, "port": p} for h, p in zip(host, port)]
+            else:
+                hosts = [{"host": h, "port": port} for h in host]
+        else:
+            hosts = [{"host": host, "port": port}]
+        return hosts
+
+    def _create_document_index(self, index_name: str, headers: Optional[Dict[str, str]] = None):
+        """
+        Create a new index for storing documents. In case if an index with the name already exists, it ensures that
+        the embedding_field is present.
+        """
+        # check if the existing index has the embedding field; if not create it
+        if self.client.indices.exists(index=index_name, headers=headers):
+            mapping = self.client.indices.get(index_name, headers=headers)[index_name]["mappings"]
+            if self.search_fields:
+                for search_field in self.search_fields:
+                    if search_field in mapping["properties"] and mapping["properties"][search_field]["type"] != "text":
+                        raise Exception(
+                            f"The search_field '{search_field}' of index '{index_name}' with type '{mapping['properties'][search_field]['type']}' "
+                            f"does not have the right type 'text' to be queried in fulltext search. Please use only 'text' type properties as search_fields. "
+                            f"This error might occur if you are trying to use pipelines 1.0 and above with an existing elasticsearch index created with a previous version of pipelines."
+                            f"In this case deleting the index with `curl -X DELETE \"{self.pipeline_config['params']['host']}:{self.pipeline_config['params']['port']}/{index_name}\"` will fix your environment. "
+                            f"Note, that all data stored in the index will be lost!"
+                        )
+            if self.embedding_field:
+                if (
+                    self.embedding_field in mapping["properties"]
+                    and mapping["properties"][self.embedding_field]["type"] != self.vector_type
+                ):
+                    raise Exception(
+                        f"The '{index_name}' index in Elasticsearch already has a field called '{self.embedding_field}'"
+                        f" with the type '{mapping['properties'][self.embedding_field]['type']}'. Please update the "
+                        f"document_store to use a different name for the embedding_field parameter."
+                    )
+                mapping["properties"][self.embedding_field] = {"type": self.vector_type, "dims": self.embedding_dim}
+                self.client.indices.put_mapping(index=index_name, body=mapping, headers=headers)
+            return
+
+        if self.custom_mapping:
+            mapping = self.custom_mapping
+        else:
+            mapping = {
+                "mappings": {
+                    "properties": {self.name_field: {"type": "keyword"}, self.content_field: {"type": "text"}},
+                    "dynamic_templates": [
+                        {
+                            "strings": {
+                                "path_match": "*",
+                                "match_mapping_type": "string",
+                                "mapping": {"type": "keyword"},
+                            }
+                        }
+                    ],
+                },
+                "settings": {
+                    "analysis": {
+                        "analyzer": {
+                            "default": {
+                                "type": self.analyzer,
+                            }
+                        }
+                    }
+                },
+            }
+
+            if self.synonyms:
+                for field in self.search_fields:
+                    mapping["mappings"]["properties"].update({field: {"type": "text", "analyzer": "synonym"}})
+                mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
+
+                mapping["settings"]["analysis"]["analyzer"]["synonym"] = {
+                    "tokenizer": "whitespace",
+                    "filter": ["lowercase", "synonym"],
+                }
+                mapping["settings"]["analysis"]["filter"] = {
+                    "synonym": {"type": self.synonym_type, "synonyms": self.synonyms}
+                }
+
+            else:
+                for field in self.search_fields:
+                    mapping["mappings"]["properties"].update({field: {"type": "text"}})
+
+            if self.embedding_field:
+                mapping["settings"]["number_of_shards"] = 1
+                mapping["settings"]["number_of_replicas"] = 2
+                mapping["mappings"]["properties"][self.embedding_field] = {
+                    "type": self.vector_type,
+                    "dims": self.embedding_dim,
+                }
+
+        try:
+            self.client.indices.create(index=index_name, body=mapping, headers=headers)
+        except RequestError as e:
+            # With multiple workers we need to avoid race conditions, where:
+            # - there's no index in the beginning
+            # - both want to create one
+            # - one fails as the other one already created it
+            if not self.client.indices.exists(index=index_name, headers=headers):
+                raise e
+
+    def _create_label_index(self, index_name: str, headers: Optional[Dict[str, str]] = None):
+        if self.client.indices.exists(index=index_name, headers=headers):
+            return
+        mapping = {
+            "mappings": {
+                "properties": {
+                    "query": {"type": "text"},
+                    "answer": {"type": "flattened"},  # light-weight but less search options than full object
+                    "document": {"type": "flattened"},
+                    "is_correct_answer": {"type": "boolean"},
+                    "is_correct_document": {"type": "boolean"},
+                    "origin": {"type": "keyword"},  # e.g. user-feedback or gold-label
+                    "document_id": {"type": "keyword"},
+                    "no_answer": {"type": "boolean"},
+                    "pipeline_id": {"type": "keyword"},
+                    "created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"},
+                    "updated_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"}
+                    # TODO add pipeline_hash and pipeline_name once we migrated the REST API to pipelines
+                }
+            },
+            "settings": {"number_of_shards": 1, "number_of_replicas": 2},
+        }
+        try:
+            self.client.indices.create(index=index_name, body=mapping, headers=headers)
+        except RequestError as e:
+            # With multiple workers we need to avoid race conditions, where:
+            # - there's no index in the beginning
+            # - both want to create one
+            # - one fails as the other one already created it
+            if not self.client.indices.exists(index=index_name, headers=headers):
+                raise e
+
+    # TODO: Add flexibility to define other non-meta and meta fields expected by the Document class
+    def _create_document_field_map(self) -> Dict:
+        return {self.content_field: "content", self.embedding_field: "embedding"}
+
+    def get_document_by_id(
+        self, id: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None
+    ) -> Optional[Document]:
+        """Fetch a document by specifying its text id string"""
+        index = index or self.index
+        documents = self.get_documents_by_id([id], index=index, headers=headers)
+        if documents:
+            return documents[0]
+        else:
+            return None
+
+    def get_documents_by_id(
+        self,
+        ids: List[str],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Fetch documents by specifying a list of text id strings. Be aware that passing a large number of ids might lead
+        to performance issues. Note that Elasticsearch limits the number of results to 10,000 documents by default.
+        """
+        index = index or self.index
+        query = {"size": len(ids), "query": {"ids": {"values": ids}}}
+        result = self.client.search(index=index, body=query, headers=headers)["hits"]["hits"]
+        documents = [self._convert_es_hit_to_document(hit, return_embedding=self.return_embedding) for hit in result]
+        return documents
+
+    def get_metadata_values_by_key(
+        self,
+        key: str,
+        query: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[dict]:
+        """
+        Get values associated with a metadata key. The output is in the format:
+            [{"value": "my-value-1", "count": 23}, {"value": "my-value-2", "count": 12}, ... ]
+
+        :param key: the meta key name to get the values for.
+        :param query: narrow down the scope to documents matching the query string.
+        :param filters: Narrow down the scope to documents that match the given filters.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param index: Elasticsearch index where the meta values should be searched. If not supplied,
+                      self.index will be used.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        """
+        body: dict = {"size": 0, "aggs": {"metadata_agg": {"terms": {"field": key}}}}
+        if query:
+            body["query"] = {
+                "bool": {
+                    "should": [
+                        {
+                            "multi_match": {
+                                "query": query,
+                                "type": "most_fields",
+                                "fields": self.search_fields,
+                            }
+                        }
+                    ]
+                }
+            }
+        if filters:
+            if not body.get("query"):
+                body["query"] = {"bool": {}}
+            body["query"]["bool"].update({"filter": LogicalFilterClause.parse(filters).convert_to_elasticsearch()})
+        result = self.client.search(body=body, index=index, headers=headers)
+        buckets = result["aggregations"]["metadata_agg"]["buckets"]
+        for bucket in buckets:
+            bucket["count"] = bucket.pop("doc_count")
+            bucket["value"] = bucket.pop("key")
+        return buckets
+
+    def write_documents(
+        self,
+        documents: Union[List[dict], List[Document]],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Indexes documents for later queries in Elasticsearch.
+
+        Behaviour if a document with the same ID already exists in ElasticSearch:
+        a) (Default) Throw Elastic's standard error message for duplicate IDs.
+        b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents.
+        (This is only relevant if you pass your own ID when initializing a `Document`.
+        If don't set custom IDs for your Documents or just pass a list of dictionaries here,
+        they will automatically get UUIDs assigned. See the `Document` class for details)
+
+        :param documents: a list of Python dictionaries or a list of pipelines Document objects.
+                          For documents as dictionaries, the format is {"content": "<the-actual-text>"}.
+                          Optionally: Include meta data via {"content": "<the-actual-text>",
+                          "meta":{"name": "<some-document-name>, "author": "somebody", ...}}
+                          It can be used for filtering and is accessible in the responses of the Finder.
+                          Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary
+                          should be changed to what you have set for self.content_field and self.name_field.
+        :param index: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used.
+        :param batch_size: Number of documents that are passed to Elasticsearch's bulk function at a time.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :raises DuplicateDocumentError: Exception trigger on duplicate document
+        :return: None
+        """
+
+        if index and not self.client.indices.exists(index=index, headers=headers):
+            self._create_document_index(index, headers=headers)
+
+        if index is None:
+            index = self.index
+        duplicate_documents = duplicate_documents or self.duplicate_documents
+        assert (
+            duplicate_documents in self.duplicate_documents_options
+        ), f"duplicate_documents parameter must be {', '.join(self.duplicate_documents_options)}"
+
+        field_map = self._create_document_field_map()
+        document_objects = [
+            Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents
+        ]
+        document_objects = self._handle_duplicate_documents(
+            documents=document_objects, index=index, duplicate_documents=duplicate_documents, headers=headers
+        )
+        documents_to_index = []
+        for doc in document_objects:
+            _doc = {
+                "_op_type": "index" if duplicate_documents == "overwrite" else "create",
+                "_index": index,
+                **doc.to_dict(field_map=self._create_document_field_map()),
+            }  # type: Dict[str, Any]
+
+            # cast embedding type as ES cannot deal with np.array
+            if _doc[self.embedding_field] is not None:
+                if type(_doc[self.embedding_field]) == np.ndarray:
+                    _doc[self.embedding_field] = _doc[self.embedding_field].tolist()
+
+            # rename id for elastic
+            _doc["_id"] = str(_doc.pop("id"))
+
+            # don't index query score and empty fields
+            _ = _doc.pop("score", None)
+            _doc = {k: v for k, v in _doc.items() if v is not None}
+
+            # In order to have a flat structure in elastic + similar behaviour to the other DocumentStores,
+            # we "unnest" all value within "meta"
+            if "meta" in _doc.keys():
+                for k, v in _doc["meta"].items():
+                    _doc[k] = v
+                _doc.pop("meta")
+            documents_to_index.append(_doc)
+
+            # Pass batch_size number of documents to bulk
+            if len(documents_to_index) % batch_size == 0:
+                for success, info in parallel_bulk(
+                    self.client,
+                    documents_to_index,
+                    chunk_size=self.chunk_size,
+                    thread_count=self.thread_count,
+                    queue_size=self.queue_size,
+                ):
+                    if not success:
+                        logger.error("A document failed:", info)
+                documents_to_index = []
+
+        if documents_to_index:
+            for success, info in parallel_bulk(
+                self.client,
+                documents_to_index,
+                chunk_size=self.chunk_size,
+                thread_count=self.thread_count,
+                queue_size=self.queue_size,
+            ):
+                if not success:
+                    logger.error("A document failed:", info)
+
+    def write_labels(
+        self,
+        labels: Union[List[Label], List[dict]],
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: int = 10_000,
+    ):
+        """Write annotation labels into document store.
+
+        :param labels: A list of Python dictionaries or a list of pipelines Label objects.
+        :param index: Elasticsearch index where the labels should be stored. If not supplied, self.label_index will be used.
+        :param batch_size: Number of labels that are passed to Elasticsearch's bulk function at a time.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        """
+        index = index or self.label_index
+        if index and not self.client.indices.exists(index=index, headers=headers):
+            self._create_label_index(index, headers=headers)
+
+        label_list: List[Label] = [Label.from_dict(label) if isinstance(label, dict) else label for label in labels]
+        duplicate_ids: list = [label.id for label in self._get_duplicate_labels(label_list, index=index)]
+        if len(duplicate_ids) > 0:
+            logger.warning(
+                f"Duplicate Label IDs: Inserting a Label whose id already exists in this document store."
+                f" This will overwrite the old Label. Please make sure Label.id is a unique identifier of"
+                f" the answer annotation and not the question."
+                f" Problematic ids: {','.join(duplicate_ids)}"
+            )
+        labels_to_index = []
+        for label in label_list:
+            # create timestamps if not available yet
+            if not label.created_at:  # type: ignore
+                label.created_at = time.strftime("%Y-%m-%d %H:%M:%S")  # type: ignore
+            if not label.updated_at:  # type: ignore
+                label.updated_at = label.created_at  # type: ignore
+
+            _label = {
+                "_op_type": "index"
+                if self.duplicate_documents == "overwrite" or label.id in duplicate_ids
+                else "create",  # type: ignore
+                "_index": index,
+                **label.to_dict(),  # type: ignore
+            }  # type: Dict[str, Any]
+
+            # rename id for elastic
+            if label.id is not None:  # type: ignore
+                _label["_id"] = str(_label.pop("id"))  # type: ignore
+
+            labels_to_index.append(_label)
+
+            # Pass batch_size number of labels to bulk
+            if len(labels_to_index) % batch_size == 0:
+                for success, info in parallel_bulk(
+                    self.client,
+                    labels_to_index,
+                    chunk_size=self.chunk_size,
+                    thread_count=self.thread_count,
+                    queue_size=self.queue_size,
+                ):
+                    if not success:
+                        logger.error("A document failed:", info)
+                labels_to_index = []
+
+        if labels_to_index:
+            for success, info in parallel_bulk(
+                self.client,
+                labels_to_index,
+                chunk_size=self.chunk_size,
+                thread_count=self.thread_count,
+                queue_size=self.queue_size,
+            ):
+                if not success:
+                    logger.error("A document failed:", info)
+
+    def update_document_meta(
+        self, id: str, meta: Dict[str, str], headers: Optional[Dict[str, str]] = None, index: str = None
+    ):
+        """
+        Update the metadata dictionary of a document by specifying its string id
+        """
+        if not index:
+            index = self.index
+        body = {"doc": meta}
+        self.client.update(index=index, id=id, body=body, refresh=self.refresh_type, headers=headers)
+
+    def get_document_count(
+        self,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        index: Optional[str] = None,
+        only_documents_without_embedding: bool = False,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> int:
+        """
+        Return the number of documents in the document store.
+        """
+        index = index or self.index
+
+        body: dict = {"query": {"bool": {}}}
+        if only_documents_without_embedding:
+            body["query"]["bool"]["must_not"] = [{"exists": {"field": self.embedding_field}}]
+
+        if filters:
+            body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        result = self.client.count(index=index, body=body, headers=headers)
+        count = result["count"]
+        return count
+
+    def get_label_count(self, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None) -> int:
+        """
+        Return the number of labels in the document store
+        """
+        index = index or self.label_index
+        return self.get_document_count(index=index, headers=headers)
+
+    def get_embedding_count(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> int:
+        """
+        Return the count of embeddings in the document store.
+        """
+
+        index = index or self.index
+
+        body: dict = {"query": {"bool": {"must": [{"exists": {"field": self.embedding_field}}]}}}
+        if filters:
+            body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        result = self.client.count(index=index, body=body, headers=headers)
+        count = result["count"]
+        return count
+
+    def get_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Get documents from the document store.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        """
+        result = self.get_all_documents_generator(
+            index=index, filters=filters, return_embedding=return_embedding, batch_size=batch_size, headers=headers
+        )
+        documents = list(result)
+        return documents
+
+    def get_all_documents_generator(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[Document, None, None]:
+        """
+        Get documents from the document store. Under-the-hood, documents are fetched in batches from the
+        document store and yielded as individual documents. This method can be used to iteratively process
+        a large number of documents without having to load all documents in memory.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        """
+
+        if index is None:
+            index = self.index
+
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        result = self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size, headers=headers)
+        for hit in result:
+            document = self._convert_es_hit_to_document(hit, return_embedding=return_embedding)
+            yield document
+
+    def get_all_labels(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: int = 10_000,
+    ) -> List[Label]:
+        """
+        Return all labels in the document store
+        """
+        index = index or self.label_index
+        result = list(
+            self._get_all_documents_in_index(index=index, filters=filters, batch_size=batch_size, headers=headers)
+        )
+        labels = [Label.from_dict({**hit["_source"], "id": hit["_id"]}) for hit in result]
+        return labels
+
+    def _get_all_documents_in_index(
+        self,
+        index: str,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        batch_size: int = 10_000,
+        only_documents_without_embedding: bool = False,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[dict, None, None]:
+        """
+        Return all documents in a specific index in the document store
+        """
+        body: dict = {"query": {"bool": {}}}
+
+        if filters:
+            body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        if only_documents_without_embedding:
+            body["query"]["bool"]["must_not"] = [{"exists": {"field": self.embedding_field}}]
+
+        result = scan(self.client, query=body, index=index, size=batch_size, scroll=self.scroll, headers=headers)
+        yield from result
+
+    def query(
+        self,
+        query: Optional[str],
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: int = 10,
+        custom_query: Optional[str] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        all_terms_must_match: bool = False,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query as defined by the BM25 algorithm.
+
+        :param query: The query
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return per query.
+        :param custom_query: query string as per Elasticsearch DSL with a mandatory query placeholder(query).
+
+                             Optionally, ES `filter` clause can be added where the values of `terms` are placeholders
+                             that get substituted during runtime. The placeholder(${filter_name_1}, ${filter_name_2}..)
+                             names must match with the filters dict supplied in self.retrieve().
+                             ::
+
+                                 **An example custom_query:**
+                                 ```python
+                                |    {
+                                |        "size": 10,
+                                |        "query": {
+                                |            "bool": {
+                                |                "should": [{"multi_match": {
+                                |                    "query": ${query},                 // mandatory query placeholder
+                                |                    "type": "most_fields",
+                                |                    "fields": ["content", "title"]}}],
+                                |                "filter": [                                 // optional custom filters
+                                |                    {"terms": {"year": ${years}}},
+                                |                    {"terms": {"quarter": ${quarters}}},
+                                |                    {"range": {"date": {"gte": ${date}}}}
+                                |                    ],
+                                |            }
+                                |        },
+                                |    }
+                                 ```
+
+                                **For this custom_query, a sample retrieve() could be:**
+                                ```python
+                                |    self.retrieve(query="Why did the revenue increase?",
+                                |                  filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
+                                ```
+
+                             Optionally, highlighting can be defined by specifying Elasticsearch's highlight settings.
+                             See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html.
+                             You will find the highlighted output in the returned Document's meta field by key "highlighted".
+                             ::
+
+                                 **Example custom_query with highlighting:**
+                                 ```python
+                                |    {
+                                |        "size": 10,
+                                |        "query": {
+                                |            "bool": {
+                                |                "should": [{"multi_match": {
+                                |                    "query": ${query},                 // mandatory query placeholder
+                                |                    "type": "most_fields",
+                                |                    "fields": ["content", "title"]}}],
+                                |            }
+                                |        },
+                                |        "highlight": {             // enable highlighting
+                                |            "fields": {            // for fields content and title
+                                |                "content": {},
+                                |                "title": {}
+                                |            }
+                                |        },
+                                |    }
+                                 ```
+
+                                 **For this custom_query, highlighting info can be accessed by:**
+                                ```python
+                                |    docs = self.retrieve(query="Why did the revenue increase?")
+                                |    highlighted_content = docs[0].meta["highlighted"]["content"]
+                                |    highlighted_title = docs[0].meta["highlighted"]["title"]
+                                ```
+
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :param all_terms_must_match: Whether all terms of the query must match the document.
+                                     If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
+                                     Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
+                                     Defaults to false.
+        """
+
+        if index is None:
+            index = self.index
+
+        # Naive retrieval without BM25, only filtering
+        if query is None:
+            body = {"query": {"bool": {"must": {"match_all": {}}}}}  # type: Dict[str, Any]
+            if filters:
+                body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        # Retrieval via custom query
+        elif custom_query:  # substitute placeholder for query and filters for the custom_query template string
+            template = Template(custom_query)
+            # replace all "${query}" placeholder(s) with query
+            substitutions = {"query": f'"{query}"'}
+            # For each filter we got passed, we'll try to find & replace the corresponding placeholder in the template
+            # Example: filters={"years":[2018]} => replaces {$years} in custom_query with '[2018]'
+            if filters:
+                for key, values in filters.items():
+                    values_str = json.dumps(values)
+                    substitutions[key] = values_str
+            custom_query_json = template.substitute(**substitutions)
+            body = json.loads(custom_query_json)
+            # add top_k
+            body["size"] = str(top_k)
+
+        # Default Retrieval via BM25 using the user query on `self.search_fields`
+        else:
+            if not isinstance(query, str):
+                logger.warning(
+                    "The query provided seems to be not a string, but an object "
+                    f"of type {type(query)}. This can cause Elasticsearch to fail."
+                )
+            operator = "AND" if all_terms_must_match else "OR"
+            body = {
+                "size": str(top_k),
+                "query": {
+                    "bool": {
+                        "should": [
+                            {
+                                "multi_match": {
+                                    "query": query,
+                                    "type": "most_fields",
+                                    "fields": self.search_fields,
+                                    "operator": operator,
+                                }
+                            }
+                        ]
+                    }
+                },
+            }
+
+            if filters:
+                body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        if self.excluded_meta_data:
+            body["_source"] = {"excludes": self.excluded_meta_data}
+
+        logger.debug(f"Retriever query: {body}")
+        logging.getLogger("elasticsearch").setLevel(logging.CRITICAL)
+        result = self.client.search(index=index, body=body, headers=headers)["hits"]["hits"]
+
+        documents = [self._convert_es_hit_to_document(hit, return_embedding=self.return_embedding) for hit in result]
+        return documents
+
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
+
+        :param query_emb: Embedding of the query.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return
+        :param index: Index name for storing the docs and metadata
+        :param return_embedding: To return document embedding
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return:
+        """
+        if index is None:
+            index = self.index
+
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        if not self.embedding_field:
+            raise RuntimeError("Please specify arg `embedding_field` in ElasticsearchDocumentStore()")
+
+        # +1 in similarity to avoid negative numbers (for cosine sim)
+        body = {"size": top_k, "query": self._get_vector_similarity_query(query_emb, top_k)}
+        if filters:
+            filter_ = {"bool": {"filter": LogicalFilterClause.parse(filters).convert_to_elasticsearch()}}
+            if body["query"]["script_score"]["query"] == {"match_all": {}}:
+                body["query"]["script_score"]["query"] = filter_
+            else:
+                body["query"]["script_score"]["query"]["bool"]["filter"]["bool"]["must"].append(filter_)
+
+        excluded_meta_data: Optional[list] = None
+
+        if self.excluded_meta_data:
+            excluded_meta_data = deepcopy(self.excluded_meta_data)
+
+            if return_embedding is True and self.embedding_field in excluded_meta_data:
+                excluded_meta_data.remove(self.embedding_field)
+            elif return_embedding is False and self.embedding_field not in excluded_meta_data:
+                excluded_meta_data.append(self.embedding_field)
+        elif return_embedding is False:
+            excluded_meta_data = [self.embedding_field]
+
+        if excluded_meta_data:
+            body["_source"] = {"excludes": excluded_meta_data}
+
+        logger.debug(f"Retriever query: {body}")
+        try:
+            result = self.client.search(index=index, body=body, request_timeout=300, headers=headers)["hits"]["hits"]
+            if len(result) == 0:
+                count_embeddings = self.get_embedding_count(index=index, headers=headers)
+                if count_embeddings == 0:
+                    logger.info({"info": "No documents with embeddings."})
+                    logger.info(
+                        "Likely some of your stored documents don't have embeddings."
+                        " try to run the document store's update_embeddings() method."
+                    )
+        except RequestError as e:
+            raise e
+
+        documents = [
+            self._convert_es_hit_to_document(hit, adapt_score_for_embedding=True, return_embedding=return_embedding)
+            for hit in result
+        ]
+        return documents
+
+    def _get_vector_similarity_query(self, query_emb: np.ndarray, top_k: int):
+        """
+        Generate Elasticsearch query for vector similarity.
+        """
+        if self.similarity == "cosine":
+            similarity_fn_name = "cosineSimilarity"
+        elif self.similarity == "dot_product":
+            similarity_fn_name = "dotProduct"
+        elif self.similarity == "l2":
+            similarity_fn_name = "l2norm"
+        else:
+            raise Exception(
+                "Invalid value for similarity in ElasticSearchDocumentStore constructor. Choose between 'cosine', 'dot_product' and 'l2'"
+            )
+
+        # To handle scenarios where embeddings may be missing
+        script_score_query: dict = {"match_all": {}}
+        if self.skip_missing_embeddings:
+            script_score_query = {
+                "bool": {"filter": {"bool": {"must": [{"exists": {"field": self.embedding_field}}]}}}
+            }
+
+        query = {
+            "script_score": {
+                "query": script_score_query,
+                "script": {
+                    # offset score to ensure a positive range as required by Elasticsearch
+                    "source": f"{similarity_fn_name}(params.query_vector,'{self.embedding_field}') + 1000",
+                    "params": {"query_vector": query_emb.tolist()},
+                },
+            }
+        }
+        return query
+
+    def _convert_es_hit_to_document(
+        self,
+        hit: dict,
+        return_embedding: bool,
+        adapt_score_for_embedding: bool = False,
+    ) -> Document:
+        # We put all additional data of the doc into meta_data and return it in the API
+        meta_data = {
+            k: v
+            for k, v in hit["_source"].items()
+            if k not in (self.content_field, "content_type", self.embedding_field)
+        }
+        name = meta_data.pop(self.name_field, None)
+        if name:
+            meta_data["name"] = name
+
+        if "highlight" in hit:
+            meta_data["highlighted"] = hit["highlight"]
+
+        score = hit["_score"]
+        if score:
+            if adapt_score_for_embedding:
+                score = self._scale_embedding_score(score)
+                if self.similarity == "cosine":
+                    score = (score + 1) / 2  # scaling probability from cosine similarity
+                else:
+                    score = float(expit(np.asarray(score / 100)))  # scaling probability from dot product and l2
+            else:
+                score = float(expit(np.asarray(score / 8)))  # scaling probability from TFIDF/BM25
+
+        embedding = None
+        if return_embedding:
+            embedding_list = hit["_source"].get(self.embedding_field)
+            if embedding_list:
+                embedding = np.asarray(embedding_list, dtype=np.float32)
+
+        doc_dict = {
+            "id": hit["_id"],
+            "content": hit["_source"].get(self.content_field),
+            "content_type": hit["_source"].get("content_type", None),
+            "meta": meta_data,
+            "es_ann_score": score,
+            "score": score,
+            "embedding": embedding,
+        }
+        document = Document.from_dict(doc_dict)
+
+        return document
+
+    def _scale_embedding_score(self, score):
+        return score - 1000
+
+    def describe_documents(self, index=None):
+        """
+        Return a summary of the documents in the document store
+        """
+        if index is None:
+            index = self.index
+        docs = self.get_all_documents(index)
+
+        l = [len(d.content) for d in docs]
+        stats = {
+            "count": len(docs),
+            "chars_mean": np.mean(l),
+            "chars_max": max(l),
+            "chars_min": min(l),
+            "chars_median": np.median(l),
+        }
+        return stats
+
+    def update_embeddings(
+        self,
+        retriever,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        update_existing_embeddings: bool = True,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Updates the embeddings in the document store using the encoding model specified in the retriever.
+        This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
+
+        :param retriever: Retriever to use to update the embeddings.
+        :param index: Index name to update
+        :param update_existing_embeddings: Whether to update existing embeddings of the documents. If set to False,
+                                           only documents without embeddings are processed. This mode can be used for
+                                           incremental updating of embeddings, wherein, only newly indexed documents
+                                           get processed.
+        :param filters: Optional filters to narrow down the documents for which embeddings are to be updated.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return: None
+        """
+        if index is None:
+            index = self.index
+
+        if self.refresh_type == "false":
+            self.client.indices.refresh(index=index, headers=headers)
+
+        if not self.embedding_field:
+            raise RuntimeError("Specify the arg `embedding_field` when initializing ElasticsearchDocumentStore()")
+
+        if update_existing_embeddings:
+            document_count = self.get_document_count(index=index, headers=headers)
+            logger.info(f"Updating embeddings for all {document_count} docs ...")
+        else:
+            document_count = self.get_document_count(
+                index=index, filters=filters, only_documents_without_embedding=True, headers=headers
+            )
+            logger.info(f"Updating embeddings for {document_count} docs without embeddings ...")
+
+        result = self._get_all_documents_in_index(
+            index=index,
+            filters=filters,
+            batch_size=batch_size,
+            only_documents_without_embedding=not update_existing_embeddings,
+            headers=headers,
+        )
+
+        logging.getLogger("elasticsearch").setLevel(logging.CRITICAL)
+        with tqdm(total=document_count, position=0, unit=" Docs", desc="Updating embeddings") as progress_bar:
+            for result_batch in get_batches_from_generator(result, batch_size):
+                document_batch = [
+                    self._convert_es_hit_to_document(hit, return_embedding=False) for hit in result_batch
+                ]
+                embeddings = retriever.embed_documents(document_batch)  # type: ignore
+                assert len(document_batch) == len(embeddings)
+
+                if embeddings[0].shape[0] != self.embedding_dim:
+                    raise RuntimeError(
+                        f"Embedding dim. of model ({embeddings[0].shape[0]})"
+                        f" doesn't match embedding dim. in DocumentStore ({self.embedding_dim})."
+                        "Specify the arg `embedding_dim` when initializing ElasticsearchDocumentStore()"
+                    )
+                doc_updates = []
+                for doc, emb in zip(document_batch, embeddings):
+                    update = {
+                        "_op_type": "update",
+                        "_index": index,
+                        "_id": doc.id,
+                        "doc": {self.embedding_field: emb.tolist()},
+                    }
+                    doc_updates.append(update)
+                for success, info in parallel_bulk(
+                    self.client,
+                    doc_updates,
+                    chunk_size=self.chunk_size,
+                    thread_count=self.thread_count,
+                    queue_size=self.queue_size,
+                ):
+                    if not success:
+                        logger.error("A document failed:", info)
+                progress_bar.update(batch_size)
+
+    def delete_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete documents in an index. All documents are deleted if no filters are passed.
+
+        :param index: Index name to delete the document from.
+        :param filters: Optional filters to narrow down the documents to be deleted.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return: None
+        """
+        logger.warning(
+            """DEPRECATION WARNINGS:
+                1. delete_all_documents() method is deprecated, please use delete_documents method
+                """
+        )
+        self.delete_documents(index, None, filters, headers=headers)
+
+    def delete_documents(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete documents in an index. All documents are deleted if no filters are passed.
+
+        :param index: Index name to delete the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used
+        :param ids: Optional list of IDs to narrow down the documents to be deleted.
+        :param filters: Optional filters to narrow down the documents to be deleted.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+
+                            If filters are provided along with a list of IDs, this method deletes the
+                            intersection of the two query results (documents that match the filters and
+                            have their ID in the list).
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return: None
+        """
+        index = index or self.index
+        query: Dict[str, Any] = {"query": {}}
+        if filters:
+            query["query"]["bool"] = {"filter": LogicalFilterClause.parse(filters).convert_to_elasticsearch()}
+
+            if ids:
+                query["query"]["bool"]["must"] = {"ids": {"values": ids}}
+
+        elif ids:
+            query["query"]["ids"] = {"values": ids}
+        else:
+            query["query"] = {"match_all": {}}
+        self.client.delete_by_query(index=index, body=query, ignore=[404], headers=headers)
+        # We want to be sure that all docs are deleted before continuing (delete_by_query doesn't support wait_for)
+        if self.refresh_type == "wait_for":
+            time.sleep(2)
+
+    def delete_labels(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete labels in an index. All labels are deleted if no filters are passed.
+
+        :param index: Index name to delete the labels from. If None, the
+                      DocumentStore's default label index (self.label_index) will be used
+        :param ids: Optional list of IDs to narrow down the labels to be deleted.
+        :param filters: Optional filters to narrow down the labels to be deleted.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            ```
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return: None
+        """
+        index = index or self.label_index
+        self.delete_documents(index=index, ids=ids, filters=filters, headers=headers)
+
+    def delete_index(self, index: str):
+        """
+        Delete an existing elasticsearch index. The index including all data will be removed.
+
+        :param index: The name of the index to delete.
+        :return: None
+        """
+        self.client.indices.delete(index=index, ignore=[400, 404])
+        logger.debug(f"deleted elasticsearch index {index}")
+
+
+class OpenSearchDocumentStore(ElasticsearchDocumentStore):
+    def __init__(self, verify_certs=False, scheme="https", username="admin", password="admin", port=9200, **kwargs):
+        """
+        Document Store using OpenSearch (https://opensearch.org/). It is compatible with the AWS Elasticsearch Service.
+
+        In addition to native Elasticsearch query & filtering, it provides efficient vector similarity search using
+        the KNN plugin that can scale to a large number of documents.
+
+        :param host: url(s) of elasticsearch nodes
+        :param port: port(s) of elasticsearch nodes
+        :param username: username (standard authentication via http_auth)
+        :param password: password (standard authentication via http_auth)
+        :param api_key_id: ID of the API key (altenative authentication mode to the above http_auth)
+        :param api_key: Secret value of the API key (altenative authentication mode to the above http_auth)
+        :param aws4auth: Authentication for usage with aws elasticsearch (can be generated with the requests-aws4auth package)
+        :param index: Name of index in elasticsearch to use for storing the documents that we want to search. If not existing yet, we will create one.
+        :param label_index: Name of index in elasticsearch to use for storing labels. If not existing yet, we will create one.
+        :param search_fields: Name of fields used by ElasticsearchRetriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"]
+        :param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
+                           If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.
+        :param name_field: Name of field that contains the title of the doc
+        :param embedding_field: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
+                                Note, that in OpenSearch the similarity type for efficient approximate vector similarity calculations is tied to the embedding field's data type which cannot be changed after creation.
+        :param embedding_dim: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)
+        :param custom_mapping: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.
+        :param analyzer: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index.
+                         Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at:
+                         https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.html
+        :param excluded_meta_data: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]).
+                                   Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).
+        :param scheme: 'https' or 'http', protocol used to connect to your elasticsearch instance
+        :param ca_certs: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.
+        :param verify_certs: Whether to be strict about ca certificates
+        :param create_index: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case
+        :param refresh_type: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search.
+                             If set to 'wait_for', continue only after changes are visible (slow, but safe).
+                             If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion).
+                             More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.html
+        :param similarity: The similarity function used to compare document vectors.
+                           Note, that the use of efficient approximate vector calculations in OpenSearch is tied to embedding_field's data type which cannot be changed after creation.
+                           You won't be able to use approximate vector calculations on an embedding_field which was created with a different similarity value.
+                           In such cases a fallback to exact but slow vector calculations will happen and a warning will be displayed.
+        :param timeout: Number of seconds after which an ElasticSearch request times out.
+        :param return_embedding: To return document embedding
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param index_type: The type of index to be created. Choose from 'flat' and 'hnsw'.
+                           As OpenSearch currently does not support all similarity functions (e.g. dot_product) in exact vector similarity calculations,
+                           we don't make use of exact vector similarity when index_type='flat'. Instead we use the same approximate vector similarity calculations like in 'hnsw', but further optimized for accuracy.
+                           Exact vector similarity is only used as fallback when there's a mismatch between certain requested and indexed similarity types.
+                           In these cases however, a warning will be displayed. See similarity param for more information.
+        :param scroll: Determines how long the current index is fixed, e.g. during updating all documents with embeddings.
+                       Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h"
+                       For details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html
+        :param skip_missing_embeddings: Parameter to control queries based on vector similarity when indexed documents miss embeddings.
+                                        Parameter options: (True, False)
+                                        False: Raises exception if one or more documents do not have embeddings at query time
+                                        True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)
+        :param synonyms: List of synonyms can be passed while elasticsearch initialization.
+                         For example: [ "foo, bar => baz",
+                                        "foozball , foosball" ]
+                         More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
+        :param synonym_type: Synonym filter type can be passed.
+                             Synonym or Synonym_graph to handle synonyms, including multi-word synonyms correctly during the analysis process.
+                             More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
+        """
+        self.embeddings_field_supports_similarity = False
+        self.similarity_to_space_type = {"cosine": "cosinesimil", "dot_product": "innerproduct", "l2": "l2"}
+        self.space_type_to_similarity = {v: k for k, v in self.similarity_to_space_type.items()}
+        # Overwrite default kwarg values of parent class so that in default cases we can initialize
+        # an OpenSearchDocumentStore without provding any arguments
+        super(OpenSearchDocumentStore, self).__init__(
+            verify_certs=verify_certs, scheme=scheme, username=username, password=password, port=port, **kwargs
+        )
+
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
+
+        :param query_emb: Embedding of the query.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return
+        :param index: Index name for storing the docs and metadata
+        :param return_embedding: To return document embedding
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :return:
+        """
+        if index is None:
+            index = self.index
+
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        if not self.embedding_field:
+            raise RuntimeError("Please specify arg `embedding_field` in ElasticsearchDocumentStore()")
+        # +1 in similarity to avoid negative numbers (for cosine sim)
+        body: Dict[str, Any] = {
+            "size": top_k,
+            "query": self._get_vector_similarity_query(query_emb, top_k),
+        }
+        if filters:
+            body["query"]["bool"]["filter"] = LogicalFilterClause.parse(filters).convert_to_elasticsearch()
+
+        excluded_meta_data: Optional[list] = None
+
+        if self.excluded_meta_data:
+            excluded_meta_data = deepcopy(self.excluded_meta_data)
+
+            if return_embedding is True and self.embedding_field in excluded_meta_data:
+                excluded_meta_data.remove(self.embedding_field)
+            elif return_embedding is False and self.embedding_field not in excluded_meta_data:
+                excluded_meta_data.append(self.embedding_field)
+        elif return_embedding is False:
+            excluded_meta_data = [self.embedding_field]
+
+        if excluded_meta_data:
+            body["_source"] = {"excludes": excluded_meta_data}
+
+        logger.debug(f"Retriever query: {body}")
+        result = self.client.search(index=index, body=body, request_timeout=300, headers=headers)["hits"]["hits"]
+
+        documents = [
+            self._convert_es_hit_to_document(hit, adapt_score_for_embedding=True, return_embedding=return_embedding)
+            for hit in result
+        ]
+        return documents
+
+    def _create_document_index(self, index_name: str, headers: Optional[Dict[str, str]] = None):
+        """
+        Create a new index for storing documents.
+        """
+        # check if the existing index has the embedding field; if not create it
+        if self.client.indices.exists(index=index_name, headers=headers):
+            index_info = self.client.indices.get(index_name, headers=headers)[index_name]
+            mappings = index_info["mappings"]
+            index_settings = index_info["settings"]["index"]
+            if self.search_fields:
+                for search_field in self.search_fields:
+                    if (
+                        search_field in mappings["properties"]
+                        and mappings["properties"][search_field]["type"] != "text"
+                    ):
+                        raise Exception(
+                            f"The search_field '{search_field}' of index '{index_name}' with type '{mappings['properties'][search_field]['type']}' "
+                            f"does not have the right type 'text' to be queried in fulltext search. Please use only 'text' type properties as search_fields. "
+                            f"This error might occur if you are trying to use pipelines 1.0 and above with an existing elasticsearch index created with a previous version of pipelines."
+                            f"In this case deleting the index with `curl -X DELETE \"{self.pipeline_config['params']['host']}:{self.pipeline_config['params']['port']}/{index_name}\"` will fix your environment. "
+                            f"Note, that all data stored in the index will be lost!"
+                        )
+
+            # embedding field will be created
+            if self.embedding_field not in mappings["properties"]:
+                mappings["properties"][self.embedding_field] = self._get_embedding_field_mapping(
+                    similarity=self.similarity
+                )
+                self.client.indices.put_mapping(index=self.index, body=mappings, headers=headers)
+                self.embeddings_field_supports_similarity = True
+            else:
+                # bad embedding field
+                if mappings["properties"][self.embedding_field]["type"] != "knn_vector":
+                    raise Exception(
+                        f"The '{index_name}' index in OpenSearch already has a field called '{self.embedding_field}'"
+                        f" with the type '{mappings['properties'][self.embedding_field]['type']}'. Please update the "
+                        f"document_store to use a different name for the embedding_field parameter."
+                    )
+                # embedding field with global space_type setting
+                if "method" not in mappings["properties"][self.embedding_field]:
+                    embedding_field_space_type = index_settings["knn.space_type"]
+                # embedding field with local space_type setting
+                else:
+                    # embedding field with global space_type setting
+                    if "method" not in mappings["properties"][self.embedding_field]:
+                        embedding_field_space_type = index_settings["knn.space_type"]
+                    # embedding field with local space_type setting
+                    else:
+                        embedding_field_space_type = mappings["properties"][self.embedding_field]["method"][
+                            "space_type"
+                        ]
+
+                    embedding_field_similarity = self.space_type_to_similarity[embedding_field_space_type]
+                    if embedding_field_similarity == self.similarity:
+                        self.embeddings_field_supports_similarity = True
+                    else:
+                        logger.warning(
+                            f"Embedding field '{self.embedding_field}' is optimized for similarity '{embedding_field_similarity}'. "
+                            f"Falling back to slow exact vector calculation. "
+                            f"Consider cloning the embedding field optimized for '{embedding_field_similarity}' by calling clone_embedding_field(similarity='{embedding_field_similarity}', ...) "
+                            f"or creating a new index optimized for '{self.similarity}' by setting `similarity='{self.similarity}'` the first time you instantiate OpenSearchDocumentStore for the new index, "
+                            f"e.g. `OpenSearchDocumentStore(index='my_new_{self.similarity}_index', similarity='{self.similarity}')`."
+                        )
+
+            # Adjust global ef_search setting. If not set, default is 512.
+            ef_search = index_settings.get("knn.algo_param", {"ef_search": 512}).get("ef_search", 512)
+            if self.index_type == "hnsw" and ef_search != 20:
+                body = {"knn.algo_param.ef_search": 20}
+                self.client.indices.put_settings(index=self.index, body=body, headers=headers)
+            elif self.index_type == "flat" and ef_search != 512:
+                body = {"knn.algo_param.ef_search": 512}
+                self.client.indices.put_settings(index=self.index, body=body, headers=headers)
+
+            return
+
+        if self.custom_mapping:
+            index_definition = self.custom_mapping
+        else:
+            index_definition = {
+                "mappings": {
+                    "properties": {self.name_field: {"type": "keyword"}, self.content_field: {"type": "text"}},
+                    "dynamic_templates": [
+                        {
+                            "strings": {
+                                "path_match": "*",
+                                "match_mapping_type": "string",
+                                "mapping": {"type": "keyword"},
+                            }
+                        }
+                    ],
+                },
+                "settings": {
+                    "analysis": {
+                        "analyzer": {
+                            "default": {
+                                "type": self.analyzer,
+                            }
+                        }
+                    }
+                },
+            }
+
+            if self.synonyms:
+                for field in self.search_fields:
+                    index_definition["mappings"]["properties"].update({field: {"type": "text", "analyzer": "synonym"}})
+                index_definition["mappings"]["properties"][self.content_field] = {
+                    "type": "text",
+                    "analyzer": "synonym",
+                }
+
+                index_definition["settings"]["analysis"]["analyzer"]["synonym"] = {
+                    "tokenizer": "whitespace",
+                    "filter": ["lowercase", "synonym"],
+                }
+                index_definition["settings"]["analysis"]["filter"] = {
+                    "synonym": {"type": self.synonym_type, "synonyms": self.synonyms}
+                }
+
+            else:
+                for field in self.search_fields:
+                    index_definition["mappings"]["properties"].update({field: {"type": "text"}})
+
+            if self.embedding_field:
+                index_definition["settings"]["index"] = {"knn": True}
+                if self.index_type == "hnsw":
+                    index_definition["settings"]["index"]["knn.algo_param.ef_search"] = 20
+                index_definition["mappings"]["properties"][self.embedding_field] = self._get_embedding_field_mapping(
+                    similarity=self.similarity
+                )
+
+        try:
+            self.client.indices.create(index=index_name, body=index_definition, headers=headers)
+        except RequestError as e:
+            # With multiple workers we need to avoid race conditions, where:
+            # - there's no index in the beginning
+            # - both want to create one
+            # - one fails as the other one already created it
+            if not self.client.indices.exists(index=index_name, headers=headers):
+                raise e
+
+    def _get_embedding_field_mapping(self, similarity: Optional[str]):
+        space_type = self.similarity_to_space_type[similarity]
+        method: dict = {"space_type": space_type, "name": "hnsw", "engine": "nmslib"}
+
+        if self.index_type == "flat":
+            # use default parameters
+            pass
+        elif self.index_type == "hnsw":
+            method["parameters"] = {"ef_construction": 80, "m": 64}
+        else:
+            logger.error("Please set index_type to either 'flat' or 'hnsw'")
+
+        embeddings_field_mapping = {"type": "knn_vector", "dimension": self.embedding_dim, "method": method}
+        return embeddings_field_mapping
+
+    def _create_label_index(self, index_name: str, headers: Optional[Dict[str, str]] = None):
+        if self.client.indices.exists(index=index_name, headers=headers):
+            return
+        mapping = {
+            "mappings": {
+                "properties": {
+                    "query": {"type": "text"},
+                    "answer": {
+                        "type": "nested"
+                    },  # In elasticsearch we use type:flattened, but this is not supported in opensearch
+                    "document": {"type": "nested"},
+                    "is_correct_answer": {"type": "boolean"},
+                    "is_correct_document": {"type": "boolean"},
+                    "origin": {"type": "keyword"},  # e.g. user-feedback or gold-label
+                    "document_id": {"type": "keyword"},
+                    "no_answer": {"type": "boolean"},
+                    "pipeline_id": {"type": "keyword"},
+                    "created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"},
+                    "updated_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"}
+                    # TODO add pipeline_hash and pipeline_name once we migrated the REST API to pipelines
+                }
+            }
+        }
+        try:
+            self.client.indices.create(index=index_name, body=mapping, headers=headers)
+        except RequestError as e:
+            # With multiple workers we need to avoid race conditions, where:
+            # - there's no index in the beginning
+            # - both want to create one
+            # - one fails as the other one already created it
+            if not self.client.indices.exists(index=index_name, headers=headers):
+                raise e
+
+    def _get_vector_similarity_query(self, query_emb: np.ndarray, top_k: int):
+        """
+        Generate Elasticsearch query for vector similarity.
+        """
+        if self.embeddings_field_supports_similarity:
+            query: dict = {
+                "bool": {"must": [{"knn": {self.embedding_field: {"vector": query_emb.tolist(), "k": top_k}}}]}
+            }
+        else:
+            # if we do not have a proper similarity field we have to fall back to exact but slow vector similarity calculation
+            query = {
+                "script_score": {
+                    "query": {"match_all": {}},
+                    "script": {
+                        "source": "knn_score",
+                        "lang": "knn",
+                        "params": {
+                            "field": self.embedding_field,
+                            "query_value": query_emb.tolist(),
+                            "space_type": self.similarity_to_space_type[self.similarity],
+                        },
+                    },
+                }
+            }
+        return query
+
+    def _scale_embedding_score(self, score):
+        # adjust scores according to https://opensearch.org/docs/latest/search-plugins/knn/approximate-knn
+        # and https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/
+        if self.similarity == "dot_product":
+            if score > 1:
+                score = score - 1
+            else:
+                score = -(1 / score - 1)
+        elif self.similarity == "l2":
+            score = 1 / score - 1
+        elif self.similarity == "cosine":
+            if self.embeddings_field_supports_similarity:
+                score = -(1 / score - 2)
+            else:
+                score = score - 1
+
+        return score
+
+    def clone_embedding_field(
+        self,
+        new_embedding_field: str,
+        similarity: str,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        mapping = self.client.indices.get(self.index, headers=headers)[self.index]["mappings"]
+        if new_embedding_field in mapping["properties"]:
+            raise Exception(
+                f"{new_embedding_field} already exists with mapping {mapping['properties'][new_embedding_field]}"
+            )
+        mapping["properties"][new_embedding_field] = self._get_embedding_field_mapping(similarity=similarity)
+        self.client.indices.put_mapping(index=self.index, body=mapping, headers=headers)
+
+        document_count = self.get_document_count(headers=headers)
+        result = self._get_all_documents_in_index(index=self.index, batch_size=batch_size, headers=headers)
+
+        logging.getLogger("elasticsearch").setLevel(logging.CRITICAL)
+
+        with tqdm(total=document_count, position=0, unit=" Docs", desc="Cloning embeddings") as progress_bar:
+            for result_batch in get_batches_from_generator(result, batch_size):
+                document_batch = [self._convert_es_hit_to_document(hit, return_embedding=True) for hit in result_batch]
+                doc_updates = []
+                for doc in document_batch:
+                    if doc.embedding is not None:
+                        update = {
+                            "_op_type": "update",
+                            "_index": self.index,
+                            "_id": doc.id,
+                            "doc": {new_embedding_field: doc.embedding.tolist()},
+                        }
+                        doc_updates.append(update)
+                for success, info in parallel_bulk(
+                    self.client,
+                    doc_updates,
+                    chunk_size=self.chunk_size,
+                    thread_count=self.thread_count,
+                    queue_size=self.queue_size,
+                ):
+                    if not success:
+                        logger.error("A document failed:", info)
+                progress_bar.update(batch_size)
+
+
+class OpenDistroElasticsearchDocumentStore(OpenSearchDocumentStore):
+    """
+    A DocumentStore which has an Open Distro for Elasticsearch service behind it.
+    """
+
+    def __init__(self, host="https://admin:admin@localhost:9200/", similarity="cosine", **kwargs):
+        logger.warning(
+            "Open Distro for Elasticsearch has been replaced by OpenSearch! "
+            "See https://opensearch.org/faq/ for details. "
+            "We recommend using the OpenSearchDocumentStore instead."
+        )
+        super(OpenDistroElasticsearchDocumentStore, self).__init__(host=host, similarity=similarity, **kwargs)
+
+    def _prepare_hosts(self, host, port):
+        return host
+
+
+class BaiduElasticsearchDocumentStore(ElasticsearchDocumentStore):
+    def similarity_check(self, similarity):
+        if similarity in ["cosine", "dot_prod", "l2", "l1"]:
+            self.similarity = similarity
+        else:
+            raise Exception(
+                f"Invalid value {similarity} for similarity in BaiduElasticSearchDocumentStore constructor. Choose between 'cosine', 'l1', 'l2' and 'dot_prod'"
+            )
+
+    def _get_vector_similarity_query(self, query_emb: np.ndarray, top_k: int):
+        """
+        Generate Elasticsearch query for vector similarity.
+        """
+        # To handle scenarios where embeddings may be missing
+        script_score_query: dict = {"match_all": {}}
+        if self.skip_missing_embeddings:
+            script_score_query = {
+                "bool": {"filter": {"bool": {"must": [{"exists": {"field": self.embedding_field}}]}}}
+            }
+
+        query = {
+            "script_score": {
+                "query": script_score_query,
+                "script": {
+                    # offset score to ensure a positive range as required by Elasticsearch
+                    "source": "bpack_knn_script",
+                    "lang": "knn",
+                    "params": {"space": self.similarity, "field": "embedding", "vector": query_emb.tolist()},
+                },
+            }
+        }
+        return query
+
+    def _create_label_index(self, index_name: str, headers: Optional[Dict[str, str]] = None):
+        if self.client.indices.exists(index=index_name, headers=headers):
+            return
+        mapping = {
+            "mappings": {
+                "properties": {
+                    "query": {"type": "text"},
+                    "answer": {"type": "text"},  # light-weight but less search options than full object
+                    "document": {"type": "text"},
+                    "is_correct_answer": {"type": "boolean"},
+                    "is_correct_document": {"type": "boolean"},
+                    "origin": {"type": "keyword"},  # e.g. user-feedback or gold-label
+                    "document_id": {"type": "keyword"},
+                    "no_answer": {"type": "boolean"},
+                    "pipeline_id": {"type": "keyword"},
+                    "created_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"},
+                    "updated_at": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"}
+                    # TODO add pipeline_hash and pipeline_name once we migrated the REST API to pipelines
+                }
+            }
+        }
+        try:
+            self.client.indices.create(index=index_name, body=mapping, headers=headers)
+        except RequestError as e:
+            # With multiple workers we need to avoid race conditions, where:
+            # - there's no index in the beginning
+            # - both want to create one
+            # - one fails as the other one already created it
+            if not self.client.indices.exists(index=index_name, headers=headers):
+                raise e
diff --git a/pipelines/pipelines/document_stores/faiss.py b/pipelines/pipelines/document_stores/faiss.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ba60f1ba056694f811b978775d4789f336285ee
--- /dev/null
+++ b/pipelines/pipelines/document_stores/faiss.py
@@ -0,0 +1,711 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import glob
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:
+    from pipelines.nodes.retriever import BaseRetriever
+
+import json
+import logging
+import os
+import warnings
+from inspect import Signature, signature
+from pathlib import Path
+from typing import Dict, Generator, List, Optional, Union
+
+import numpy as np
+from tqdm.auto import tqdm
+
+try:
+    import faiss
+
+    from pipelines.document_stores.sql import (  # its deps are optional, but get installed with the `faiss` extra
+        SQLDocumentStore,
+    )
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "faiss", ie)
+
+from pipelines.document_stores.base import get_batches_from_generator
+from pipelines.schema import Document
+
+logger = logging.getLogger(__name__)
+
+
+class FAISSDocumentStore(SQLDocumentStore):
+    """
+    Document store for very large scale embedding based dense retrievers.
+
+    It implements the FAISS library(https://github.com/facebookresearch/faiss)
+    to perform similarity search on vectors.
+
+    The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while
+    the vector embeddings are indexed in a FAISS Index.
+    """
+
+    def __init__(
+        self,
+        sql_url: str = "sqlite:///faiss_document_store.db",
+        vector_dim: int = None,
+        embedding_dim: int = 768,
+        faiss_index_factory_str: str = "Flat",
+        faiss_index: Union[dict, faiss.swigfaiss_avx2.IndexFlat] = None,
+        return_embedding: bool = False,
+        index_name: Union[str, list] = "document",
+        similarity: str = "dot_product",
+        embedding_field: str = "embedding",
+        progress_bar: bool = True,
+        duplicate_documents: str = "overwrite",
+        faiss_index_path: Union[str, Path, list] = None,
+        faiss_config_path: Union[str, Path, list] = None,
+        isolation_level: str = None,
+        **kwargs,
+    ):
+        """
+        :param sql_url: SQL connection URL for database. It defaults to local file based SQLite DB. For large scale
+                        deployment, Postgres is recommended.
+        :param vector_dim: Deprecated. Use embedding_dim instead.
+        :param embedding_dim: The embedding vector size. Default: 768.
+        :param faiss_index_factory_str: Create a new FAISS index of the specified type.
+                                        The type is determined from the given string following the conventions
+                                        of the original FAISS index factory.
+                                        Recommended options:
+                                        - "Flat" (default): Best accuracy (= exact). Becomes slow and RAM intense for > 1 Mio docs.
+                                        - "HNSW": Graph-based heuristic. If not further specified,
+                                                  we use the following config:
+                                                  HNSW64, efConstruction=80 and efSearch=20
+                                        - "IVFx,Flat": Inverted Index. Replace x with the number of centroids aka nlist.
+                                                          Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point.
+                                        For more details see:
+                                        - Overview of indices https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
+                                        - Guideline for choosing an index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
+                                        - FAISS Index factory https://github.com/facebookresearch/faiss/wiki/The-index-factory
+                                        Benchmarks: XXX
+        :param faiss_index: Pass an existing FAISS Index, i.e. an empty one that you configured manually
+                            or one with docs that you used in pipelines before and want to load again.
+        :param return_embedding: To return document embedding. Unlike other document stores, FAISS will return normalized embeddings
+        :param index_name: Name of index in document store to use.
+        :param similarity: The similarity function used to compare document vectors.
+                   In both cases, the returned values in Document.score are normalized to be in range [0,1]:
+                   For `dot_product`: expit(np.asarray(raw_score / 100))
+                   FOr `cosine`: (raw_score + 1) / 2
+        :param embedding_field: Name of field containing an embedding vector.
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param faiss_index_path: Stored FAISS index file. Can be created via calling `save()`.
+            If specified no other params besides faiss_config_path must be specified.
+        :param faiss_config_path: Stored FAISS initial configuration parameters.
+            Can be created via calling `save()`
+        :param isolation_level: see SQLAlchemy's `isolation_level` parameter for `create_engine()` (https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine.params.isolation_level)
+        """
+        # special case if we want to load an existing index from disk
+        # load init params from disk and run init again
+        if faiss_index_path:
+            sig = signature(self.__class__.__init__)
+            self._validate_params_load_from_disk(sig, locals(), kwargs)
+            init_params = self._load_init_params_from_config(faiss_index_path, faiss_config_path)
+            self.__class__.__init__(self, **init_params)  # pylint: disable=non-parent-init-called
+            return
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            sql_url=sql_url,
+            vector_dim=vector_dim,
+            embedding_dim=embedding_dim,
+            faiss_index_factory_str=faiss_index_factory_str,
+            return_embedding=return_embedding,
+            duplicate_documents=duplicate_documents,
+            index=index_name,
+            similarity=similarity,
+            embedding_field=embedding_field,
+            progress_bar=progress_bar,
+            isolation_level=isolation_level,
+        )
+
+        if similarity in ("dot_product", "cosine"):
+            self.similarity = similarity
+            self.metric_type = faiss.METRIC_INNER_PRODUCT
+        elif similarity == "l2":
+            self.similarity = similarity
+            self.metric_type = faiss.METRIC_L2
+        else:
+            raise ValueError(
+                "The FAISS document store can currently only support dot_product, cosine and l2 similarity. "
+                "Please set similarity to one of the above."
+            )
+
+        if vector_dim is not None:
+            warnings.warn(
+                "The 'vector_dim' parameter is deprecated, " "use 'embedding_dim' instead.", DeprecationWarning, 2
+            )
+            self.embedding_dim = vector_dim
+        else:
+            self.embedding_dim = embedding_dim
+
+        self.faiss_index_factory_str = faiss_index_factory_str
+        self.faiss_indexes: Dict[str, faiss.swigfaiss.Index] = {}
+        if faiss_index and type(index_name) == str:
+            self.faiss_indexes[index_name] = faiss_index
+        elif faiss_index and type(index_name) == list:
+            for index in index_name:
+                self.faiss_indexes[index] = faiss_index[index]
+        else:
+            self.faiss_indexes[index_name] = self._create_new_index(
+                embedding_dim=self.embedding_dim,
+                index_factory=faiss_index_factory_str,
+                metric_type=self.metric_type,
+                **kwargs,
+            )
+
+        self.return_embedding = return_embedding
+        self.embedding_field = embedding_field
+
+        self.progress_bar = progress_bar
+        if type(index_name) == list:
+            index_name = index_name[0]
+
+        super().__init__(
+            url=sql_url, index=index_name, duplicate_documents=duplicate_documents, isolation_level=isolation_level
+        )
+
+        self._validate_index_sync()
+
+    def _validate_params_load_from_disk(self, sig: Signature, locals: dict, kwargs: dict):
+        allowed_params = ["faiss_index_path", "faiss_config_path", "self", "kwargs", "faiss_index", "index_name"]
+        invalid_param_set = False
+
+        for param in sig.parameters.values():
+            if param.name not in allowed_params and param.default != locals[param.name]:
+                invalid_param_set = True
+                break
+
+        if invalid_param_set or len(kwargs) > 0:
+            raise ValueError("if faiss_index_path is passed no other params besides faiss_config_path are allowed.")
+
+    def _validate_index_sync(self):
+        # This check ensures the correct document database was loaded.
+        # If it fails, make sure you provided the path to the database
+        # used when creating the original FAISS index
+        logger.info(f"document_cnt:{self.get_document_count()}\tembedding_cnt:{self.get_embedding_count()}")
+        if not self.get_document_count() == self.get_embedding_count():
+            raise ValueError(
+                "The number of documents present in the SQL database does not "
+                "match the number of embeddings in FAISS. Make sure your FAISS "
+                "configuration file correctly points to the same database that "
+                "was used when creating the original index."
+            )
+
+    def _create_new_index(self, embedding_dim: int, metric_type, index_factory: str = "Flat", **kwargs):
+        if index_factory == "HNSW":
+            # faiss index factory doesn't give the same results for HNSW IP, therefore direct init.
+            n_links = kwargs.get("n_links", 64)
+            index = faiss.IndexHNSWFlat(embedding_dim, n_links, metric_type)
+            index.hnsw.efSearch = kwargs.get("efSearch", 20)  # 20
+            index.hnsw.efConstruction = kwargs.get("efConstruction", 80)  # 80
+            if "ivf" in index_factory.lower():  # enable reconstruction of vectors for inverted index
+                self.faiss_indexes[index].set_direct_map_type(faiss.DirectMap.Hashtable)
+
+            logger.info(
+                f"HNSW params: n_links: {n_links}, efSearch: {index.hnsw.efSearch}, efConstruction: {index.hnsw.efConstruction}"
+            )
+        else:
+            index = faiss.index_factory(embedding_dim, index_factory, metric_type)
+        return index
+
+    def write_documents(
+        self,
+        documents: Union[List[dict], List[Document]],
+        index: Optional[str] = None,
+        batch_size: int = 1000,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> None:
+        """
+        Add new documents to the DocumentStore.
+
+        :param documents: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
+                          them right away in FAISS. If not, you can later call update_embeddings() to create & index them.
+        :param index: (SQL) index name for storing the docs and metadata
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :raises DuplicateDocumentError: Exception trigger on duplicate document
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        index = index or self.index
+        duplicate_documents = duplicate_documents or self.duplicate_documents
+        assert (
+            duplicate_documents in self.duplicate_documents_options
+        ), f"duplicate_documents parameter must be {', '.join(self.duplicate_documents_options)}"
+
+        if not self.faiss_indexes.get(index):
+            self.faiss_indexes[index] = self._create_new_index(
+                embedding_dim=self.embedding_dim,
+                index_factory=self.faiss_index_factory_str,
+                metric_type=faiss.METRIC_INNER_PRODUCT,
+            )
+
+        field_map = self._create_document_field_map()
+        document_objects = [
+            Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents
+        ]
+        document_objects = self._handle_duplicate_documents(
+            documents=document_objects, index=index, duplicate_documents=duplicate_documents
+        )
+        if len(document_objects) > 0:
+            add_vectors = False if document_objects[0].embedding is None else True
+
+            if self.duplicate_documents == "overwrite" and add_vectors:
+                logger.warning(
+                    "You have to provide `duplicate_documents = 'overwrite'` arg and "
+                    "`FAISSDocumentStore` does not support update in existing `faiss_index`.\n"
+                    "Please call `update_embeddings` method to repopulate `faiss_index`"
+                )
+            vector_id = self.faiss_indexes[index].ntotal
+            with tqdm(
+                total=len(document_objects), disable=not self.progress_bar, position=0, desc="Writing Documents"
+            ) as progress_bar:
+                for i in range(0, len(document_objects), batch_size):
+                    if add_vectors:
+                        embeddings = [doc.embedding for doc in document_objects[i : i + batch_size]]
+                        embeddings_to_index = np.array(embeddings, dtype="float32")
+
+                        if self.similarity == "cosine":
+                            self.normalize_embedding(embeddings_to_index)
+
+                        self.faiss_indexes[index].add(embeddings_to_index)
+
+                    docs_to_write_in_sql = []
+                    for doc in document_objects[i : i + batch_size]:
+                        meta = doc.meta
+                        if add_vectors:
+                            meta["vector_id"] = str(vector_id) + "_" + index
+                            vector_id += 1
+                        docs_to_write_in_sql.append(doc)
+                    super(FAISSDocumentStore, self).write_documents(
+                        docs_to_write_in_sql,
+                        index=index,
+                        duplicate_documents=duplicate_documents,
+                        batch_size=batch_size,
+                    )
+                    progress_bar.update(batch_size)
+            progress_bar.close()
+
+    def _create_document_field_map(self) -> Dict:
+        return {
+            self.index: self.embedding_field,
+        }
+
+    def update_embeddings(
+        self,
+        retriever: "BaseRetriever",
+        index: Optional[str] = None,
+        update_existing_embeddings: bool = True,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        batch_size: int = 10000,
+    ):
+        """
+        Updates the embeddings in the document store using the encoding model specified in the retriever.
+        This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
+
+        :param retriever: Retriever to use to get embeddings for text
+        :param index: Index name for which embeddings are to be updated. If set to None, the default self.index is used.
+        :param update_existing_embeddings: Whether to update existing embeddings of the documents. If set to False,
+                                           only documents without embeddings are processed. This mode can be used for
+                                           incremental updating of embeddings, wherein, only newly indexed documents
+                                           get processed.
+        :param filters: Optional filters to narrow down the documents for which embeddings are to be updated.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :return: None
+        """
+        index = index or self.index
+
+        if update_existing_embeddings is True:
+            if filters is None:
+                self.faiss_indexes[index].reset()
+                self.reset_vector_ids(index)
+            else:
+                raise Exception("update_existing_embeddings=True is not supported with filters.")
+
+        if not self.faiss_indexes.get(index):
+            raise ValueError("Couldn't find a FAISS index. Try to init the FAISSDocumentStore() again ...")
+
+        document_count = self.get_document_count(index=index)
+        if document_count == 0:
+            logger.warning("Calling DocumentStore.update_embeddings() on an empty index")
+            return
+
+        logger.info(f"Updating embeddings for {document_count} docs...")
+        vector_id = sum([index.ntotal for index in self.faiss_indexes.values()])
+
+        # Query texts from SQL.
+        result = self._query(
+            index=index,
+            vector_ids=None,
+            batch_size=batch_size,
+            filters=filters,
+            only_documents_without_embedding=not update_existing_embeddings,
+        )
+        batched_documents = get_batches_from_generator(result, batch_size)
+        with tqdm(
+            total=document_count, disable=not self.progress_bar, position=0, unit=" docs", desc="Updating Embedding"
+        ) as progress_bar:
+            for document_batch in batched_documents:
+                embeddings = retriever.embed_documents(document_batch)  # type: ignore
+                assert len(document_batch) == len(embeddings)
+
+                embeddings_to_index = np.array(embeddings, dtype="float32")
+
+                if self.similarity == "cosine":
+                    self.normalize_embedding(embeddings_to_index)
+
+                self.faiss_indexes[index].add(embeddings_to_index)
+
+                vector_id_map = {}
+                for doc in document_batch:
+                    vector_id_map[str(doc.id)] = str(vector_id)
+                    vector_id += 1
+                self.update_vector_ids(vector_id_map, index=index)
+                progress_bar.set_description_str("Documents Processed")
+                progress_bar.update(batch_size)
+
+    def get_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        result = self.get_all_documents_generator(
+            index=index, filters=filters, return_embedding=return_embedding, batch_size=batch_size
+        )
+        documents = list(result)
+        return documents
+
+    def get_all_documents_generator(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[Document, None, None]:
+        """
+        Get all documents from the document store. Under-the-hood, documents are fetched in batches from the
+        document store and yielded as individual documents. This method can be used to iteratively process
+        a large number of documents without having to load all documents in memory.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param return_embedding: Whether to return the document embeddings. Unlike other document stores, FAISS will return normalized embeddings
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        index = index or self.index
+        documents = super(FAISSDocumentStore, self).get_all_documents_generator(
+            index=index, filters=filters, batch_size=batch_size, return_embedding=False
+        )
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        for doc in documents:
+            if return_embedding:
+                if doc.meta and doc.meta.get("vector_id") is not None:
+                    doc.embedding = self.faiss_indexes[index].reconstruct(int(doc.meta["vector_id"]))
+            yield doc
+
+    def get_documents_by_id(
+        self,
+        ids: List[str],
+        index: Optional[str] = None,
+        batch_size: int = 10000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        index = index or self.index
+        documents = super(FAISSDocumentStore, self).get_documents_by_id(ids=ids, index=index, batch_size=batch_size)
+        if self.return_embedding:
+            for doc in documents:
+                if doc.meta and doc.meta.get("vector_id") is not None:
+                    doc.embedding = self.faiss_indexes[index].reconstruct(int(doc.meta["vector_id"]))
+        return documents
+
+    def get_embedding_count(self, index: Optional[str] = None, filters: Optional[Dict[str, Any]] = None) -> int:
+        """
+        Return the count of embeddings in the document store.
+        """
+        if filters:
+            raise Exception("filters are not supported for get_embedding_count in FAISSDocumentStore")
+        index = index or self.index
+        return self.faiss_indexes[index].ntotal
+
+    def train_index(
+        self,
+        documents: Optional[Union[List[dict], List[Document]]],
+        embeddings: Optional[np.ndarray] = None,
+        index: Optional[str] = None,
+    ):
+        """
+        Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
+        The train vectors should come from the same distribution as your final ones.
+        You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
+
+        :param documents: Documents (incl. the embeddings)
+        :param embeddings: Plain embeddings
+        :param index: Name of the index to train. If None, the DocumentStore's default index (self.index) will be used.
+        :return: None
+        """
+        index = index or self.index
+        if embeddings and documents:
+            raise ValueError("Either pass `documents` or `embeddings`. You passed both.")
+        if documents:
+            document_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents]
+            doc_embeddings = [doc.embedding for doc in document_objects]
+            embeddings_for_train = np.array(doc_embeddings, dtype="float32")
+            self.faiss_indexes[index].train(embeddings_for_train)
+        if embeddings:
+            self.faiss_indexes[index].train(embeddings)
+
+    def delete_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete all documents from the document store.
+        """
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        logger.warning(
+            """DEPRECATION WARNINGS:
+                1. delete_all_documents() method is deprecated, please use delete_documents method
+                """
+        )
+        self.delete_documents(index, None, filters)
+
+    def delete_documents(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete documents from the document store. All documents are deleted if no filters are passed.
+
+        :param index: Index name to delete the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param ids: Optional list of IDs to narrow down the documents to be deleted.
+        :param filters: Optional filters to narrow down the documents to be deleted.
+            Example filters: {"name": ["some", "more"], "category": ["only_one"]}.
+            If filters are provided along with a list of IDs, this method deletes the
+            intersection of the two query results (documents that match the filters and
+            have their ID in the list).
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        index = index or self.index
+        if index in self.faiss_indexes.keys():
+            if not filters and not ids:
+                self.faiss_indexes[index].reset()
+            else:
+                affected_docs = self.get_all_documents(filters=filters)
+                if ids:
+                    affected_docs = [doc for doc in affected_docs if doc.id in ids]
+                doc_ids = [
+                    doc.meta.get("vector_id")
+                    for doc in affected_docs
+                    if doc.meta and doc.meta.get("vector_id") is not None
+                ]
+                self.faiss_indexes[index].remove_ids(np.array(doc_ids, dtype="int64"))
+
+        super().delete_documents(index=index, ids=ids, filters=filters)
+
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in FAISSDocStore
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
+
+        :param query_emb: Embedding of the query.
+        :param filters: Optional filters to narrow down the search space.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param top_k: How many documents to return
+        :param index: Index name to query the document from.
+        :param return_embedding: To return document embedding. Unlike other document stores, FAISS will return normalized embeddings
+        :return:
+        """
+        if headers:
+            raise NotImplementedError("FAISSDocumentStore does not support headers.")
+
+        if filters:
+            logger.warning("Query filters are not implemented for the FAISSDocumentStore.")
+
+        index = index or self.index
+        if not self.faiss_indexes.get(index):
+            raise Exception(f"Index named '{index}' does not exists. Use 'update_embeddings()' to create an index.")
+
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        query_emb = query_emb.reshape(1, -1).astype(np.float32)
+        if self.similarity == "cosine":
+            self.normalize_embedding(query_emb)
+
+        score_matrix, vector_id_matrix = self.faiss_indexes[index].search(query_emb, top_k)
+        vector_ids_for_query = [str(vector_id) + "_" + index for vector_id in vector_id_matrix[0] if vector_id != -1]
+        documents = self.get_documents_by_vector_ids(vector_ids_for_query, index=index)
+
+        # assign query score to each document
+        scores_for_vector_ids: Dict[str, float] = {
+            str(v_id): s for v_id, s in zip(vector_id_matrix[0], score_matrix[0])
+        }
+        for doc in documents:
+            raw_score = scores_for_vector_ids[doc.meta["vector_id"].split("_")[0]]
+            doc.ann_score = self.finalize_raw_score(raw_score, self.similarity)
+
+            if return_embedding is True:
+                doc.embedding = self.faiss_indexes[index].reconstruct(int(doc.meta["vector_id"].split("_")[0]))
+        return documents
+
+    def save(self, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None):
+        """
+        Save FAISS Index to the specified file.
+
+        :param index_path: Path to save the FAISS index to.
+        :param config_path: Path to save the initial configuration parameters to.
+            Defaults to the same as the file path, save the extension (.json).
+            This file contains all the parameters passed to FAISSDocumentStore()
+            at creation time (for example the SQL path, embedding_dim, etc), and will be
+            used by the `load` method to restore the index with the appropriate configuration.
+        :return: None
+        """
+        index_path_dir = index_path
+        if not os.path.exists(index_path_dir):
+            os.mkdir(index_path_dir)
+        for index in self.faiss_indexes.keys():
+            index_path = Path(os.path.join(index_path_dir, str(index)))
+            config_path = index_path.with_suffix(".json")
+            faiss.write_index(self.faiss_indexes[index], str(index_path))
+            with open(config_path, "w") as ipp:
+                json.dump(self.pipeline_config["params"], ipp)
+
+    def _load_init_params_from_config(
+        self, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None
+    ):
+        if type(index_path) != list:
+            if not config_path:
+                index_path = Path(index_path)
+                config_path = index_path.with_suffix(".json")
+
+            init_params: dict = {}
+            try:
+                with open(config_path, "r") as ipp:
+                    init_params = json.load(ipp)
+                if "index" in init_params:
+                    init_params["index_name"] = init_params["index"]
+                    init_params.pop("index")
+            except OSError as e:
+                raise ValueError(
+                    f"Can't open FAISS configuration file `{config_path}`. "
+                    "Make sure the file exists and the you have the correct permissions "
+                    "to access it."
+                ) from e
+            faiss_index = faiss.read_index(str(index_path))
+
+            # Add other init params to override the ones defined in the init params file
+            init_params["faiss_index"] = faiss_index
+            init_params["embedding_dim"] = faiss_index.d
+
+        else:
+            if not config_path:
+                index_path = Path(index_path[0])
+                config_path = index_path.with_suffix(".json")
+            else:
+                config_path = config_path[0]
+            init_params: dict = {}
+            try:
+                with open(config_path, "r") as ipp:
+                    init_params = json.load(ipp)
+                if "index" in init_params:
+                    init_params.pop("index")
+            except OSError as e:
+                raise ValueError(
+                    f"Can't open FAISS configuration file `{config_path}`. "
+                    "Make sure the file exists and the you have the correct permissions "
+                    "to access it."
+                ) from e
+            if type(index_path) == list:
+                init_params["faiss_index"] = {}
+                init_params["index_name"] = []
+                for index in index_path:
+                    faiss_index = faiss.read_index(str(index))
+                    index_name = str(index).split("/")[-1]
+                    init_params["index_name"].append(index_name)
+                    init_params["faiss_index"][index_name] = faiss_index
+                    # Add other init params to override the ones defined in the init params file
+        return init_params
+
+    @classmethod
+    def load(cls, index_path: Union[str, Path], config_path: Optional[Union[str, Path]] = None):
+        """
+        Load a saved FAISS index from a file and connect to the SQL database.
+        Note: In order to have a correct mapping from FAISS to SQL,
+              make sure to use the same SQL DB that you used when calling `save()`.
+
+        :param index_path: Stored FAISS index file. Can be created via calling `save()`
+        :param config_path: Stored FAISS initial configuration parameters.
+            Can be created via calling `save()`
+        """
+        if os.path.isdir(index_path):
+            config_path = glob.glob(index_path + "/**/*.json", recursive=True)
+            index_path = [path.replace(".json", "") for path in config_path]
+        return cls(faiss_index_path=index_path, faiss_config_path=config_path)
diff --git a/pipelines/pipelines/document_stores/filter_utils.py b/pipelines/pipelines/document_stores/filter_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..44acf560f299ee4026d2e25403070e181950403d
--- /dev/null
+++ b/pipelines/pipelines/document_stores/filter_utils.py
@@ -0,0 +1,628 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC, abstractmethod
+from collections import defaultdict
+from typing import Dict, List, Optional, Tuple, Union
+
+from sqlalchemy import and_, or_
+from sqlalchemy.sql import select
+
+from pipelines.document_stores import utils
+
+
+def nested_defaultdict() -> defaultdict:
+    """
+    Data structure that recursively adds a dictionary as value if a key does not exist. Advantage: In nested dictionary
+    structures, we don't need to check if a key already exists (which can become hard to maintain in nested dictionaries
+    with many levels) but access the existing value if a key exists and create an empty dictionary if a key does not
+    exist.
+    """
+    return defaultdict(nested_defaultdict)
+
+
+class LogicalFilterClause(ABC):
+    """
+    Class that is able to parse a filter and convert it to the format that the underlying databases of our
+    DocumentStores require.
+
+    Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+    operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`, `"$gte"`, `"$lt"`,
+    `"$lte"`) or a metadata field name.
+    Logical operator keys take a dictionary of metadata field names and/or logical operators as
+    value. Metadata field names take a dictionary of comparison operators as value. Comparison
+    operator keys take a single value or (in case of `"$in"`) a list of values as value.
+    If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+    operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+    operation.
+    Example:
+        ```python
+        filters = {
+            "$and": {
+                "type": {"$eq": "article"},
+                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                "rating": {"$gte": 3},
+                "$or": {
+                    "genre": {"$in": ["economy", "politics"]},
+                    "publisher": {"$eq": "nytimes"}
+                }
+            }
+        }
+        # or simpler using default operators
+        filters = {
+            "type": "article",
+            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+            "rating": {"$gte": 3},
+            "$or": {
+                "genre": ["economy", "politics"],
+                "publisher": "nytimes"
+            }
+        }
+        ```
+
+    To use the same logical operator multiple times on the same level, logical operators take optionally a list of
+    dictionaries as value.
+
+    Example:
+        ```python
+        filters = {
+            "$or": [
+                {
+                    "$and": {
+                        "Type": "News Paper",
+                        "Date": {
+                            "$lt": "2019-01-01"
+                        }
+                    }
+                },
+                {
+                    "$and": {
+                        "Type": "Blog Post",
+                        "Date": {
+                            "$gte": "2019-01-01"
+                        }
+                    }
+                }
+            ]
+        }
+        ```
+
+    """
+
+    def __init__(self, conditions: List[Union["LogicalFilterClause", "ComparisonOperation"]]):
+        self.conditions = conditions
+
+    @abstractmethod
+    def evaluate(self, fields) -> bool:
+        pass
+
+    @classmethod
+    def parse(cls, filter_term: Union[dict, List[dict]]) -> Union["LogicalFilterClause", "ComparisonOperation"]:
+        """
+        Parses a filter dictionary/list and returns a LogicalFilterClause instance.
+
+        :param filter_term: Dictionary or list that contains the filter definition.
+        """
+        conditions: List[Union[LogicalFilterClause, ComparisonOperation]] = []
+
+        if isinstance(filter_term, dict):
+            filter_term = [filter_term]
+        for item in filter_term:
+            for key, value in item.items():
+                if key == "$not":
+                    conditions.append(NotOperation.parse(value))
+                elif key == "$and":
+                    conditions.append(AndOperation.parse(value))
+                elif key == "$or":
+                    conditions.append(OrOperation.parse(value))
+                # Key needs to be a metadata field
+                else:
+                    conditions.extend(ComparisonOperation.parse(key, value))
+
+        if cls == LogicalFilterClause:
+            if len(conditions) == 1:
+                return conditions[0]
+            else:
+                return AndOperation(conditions)
+        else:
+            return cls(conditions)
+
+    @abstractmethod
+    def convert_to_elasticsearch(self):
+        """
+        Converts the LogicalFilterClause instance to an Elasticsearch filter.
+        """
+        pass
+
+    @abstractmethod
+    def convert_to_sql(self, meta_document_orm):
+        """
+        Converts the LogicalFilterClause instance to an SQL filter.
+        """
+        pass
+
+    def convert_to_weaviate(self):
+        """
+        Converts the LogicalFilterClause instance to a Weaviate filter.
+        """
+        pass
+
+    def _merge_es_range_queries(self, conditions: List[Dict]) -> List[Dict[str, Dict]]:
+        """
+        Merges Elasticsearch range queries that perform on the same metadata field.
+        """
+
+        range_conditions = [cond["range"] for cond in filter(lambda condition: "range" in condition, conditions)]
+        if range_conditions:
+            conditions = [condition for condition in conditions if "range" not in condition]
+            range_conditions_dict = nested_defaultdict()
+            for condition in range_conditions:
+                field_name = list(condition.keys())[0]
+                operation = list(condition[field_name].keys())[0]
+                comparison_value = condition[field_name][operation]
+                range_conditions_dict[field_name][operation] = comparison_value
+
+            for field_name, comparison_operations in range_conditions_dict.items():
+                conditions.append({"range": {field_name: comparison_operations}})
+
+        return conditions
+
+    @abstractmethod
+    def invert(self) -> Union["LogicalFilterClause", "ComparisonOperation"]:
+        """
+        Inverts the LogicalOperation instance.
+        Necessary for Weaviate as Weaviate doesn't seem to support the 'Not' operator anymore.
+        (https://github.com/semi-technologies/weaviate/issues/1717)
+        """
+        pass
+
+
+class ComparisonOperation(ABC):
+    def __init__(self, field_name: str, comparison_value: Union[str, int, float, bool, List]):
+        self.field_name = field_name
+        self.comparison_value = comparison_value
+
+    @abstractmethod
+    def evaluate(self, fields) -> bool:
+        pass
+
+    @classmethod
+    def parse(cls, field_name, comparison_clause: Union[Dict, List, str, float]) -> List["ComparisonOperation"]:
+        comparison_operations: List[ComparisonOperation] = []
+
+        if isinstance(comparison_clause, dict):
+            for comparison_operation, comparison_value in comparison_clause.items():
+                if comparison_operation == "$eq":
+                    comparison_operations.append(EqOperation(field_name, comparison_value))
+                elif comparison_operation == "$in":
+                    comparison_operations.append(InOperation(field_name, comparison_value))
+                elif comparison_operation == "$ne":
+                    comparison_operations.append(NeOperation(field_name, comparison_value))
+                elif comparison_operation == "$nin":
+                    comparison_operations.append(NinOperation(field_name, comparison_value))
+                elif comparison_operation == "$gt":
+                    comparison_operations.append(GtOperation(field_name, comparison_value))
+                elif comparison_operation == "$gte":
+                    comparison_operations.append(GteOperation(field_name, comparison_value))
+                elif comparison_operation == "$lt":
+                    comparison_operations.append(LtOperation(field_name, comparison_value))
+                elif comparison_operation == "$lte":
+                    comparison_operations.append(LteOperation(field_name, comparison_value))
+
+        # No comparison operator is given, so we use the default operators "$in" if the comparison value is a list and
+        # "$eq" in every other case
+        elif isinstance(comparison_clause, list):
+            comparison_operations.append(InOperation(field_name, comparison_clause))
+        else:
+            comparison_operations.append((EqOperation(field_name, comparison_clause)))
+
+        return comparison_operations
+
+    @abstractmethod
+    def convert_to_elasticsearch(self):
+        """
+        Converts the ComparisonOperation instance to an Elasticsearch query.
+        """
+        pass
+
+    @abstractmethod
+    def convert_to_sql(self, meta_document_orm):
+        """
+        Converts the ComparisonOperation instance to an SQL filter.
+        """
+        pass
+
+    @abstractmethod
+    def convert_to_weaviate(self):
+        """
+        Converts the ComparisonOperation instance to a Weaviate comparison operator.
+        """
+        pass
+
+    @abstractmethod
+    def invert(self) -> "ComparisonOperation":
+        """
+        Inverts the ComparisonOperation.
+        Necessary for Weaviate as Weaviate doesn't seem to support the 'Not' operator anymore.
+        (https://github.com/semi-technologies/weaviate/issues/1717)
+        """
+        pass
+
+    def _get_weaviate_datatype(
+        self, value: Optional[Union[str, int, float, bool]] = None
+    ) -> Tuple[str, Union[str, int, float, bool]]:
+        """
+        Determines the type of the comparison value and converts it to RFC3339 format if it is as date,
+        as Weaviate requires dates to be in RFC3339 format including the time and timezone.
+
+        """
+        if value is None:
+            assert not isinstance(self.comparison_value, list)  # Necessary for mypy
+            value = self.comparison_value
+
+        if isinstance(value, str):
+            # Check if comparison value is a date
+            try:
+                value = utils.convert_date_to_rfc3339(value)
+                data_type = "valueDate"
+            # Comparison value is a plain string
+            except ValueError:
+                data_type = "valueString"
+        elif isinstance(value, int):
+            data_type = "valueInt"
+        elif isinstance(value, float):
+            data_type = "valueNumber"
+        elif isinstance(value, bool):
+            data_type = "valueBoolean"
+        else:
+            raise ValueError(
+                f"Unsupported data type of comparison value for {self.__class__.__name__}."
+                f"Value needs to be of type str, int, float, or bool."
+            )
+
+        return data_type, value
+
+
+class NotOperation(LogicalFilterClause):
+    """
+    Handles conversion of logical 'NOT' operations.
+    """
+
+    def evaluate(self, fields) -> bool:
+        return not any(condition.evaluate(fields) for condition in self.conditions)
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict]:
+        conditions = [condition.convert_to_elasticsearch() for condition in self.conditions]
+        conditions = self._merge_es_range_queries(conditions)
+        return {"bool": {"must_not": conditions}}
+
+    def convert_to_sql(self, meta_document_orm):
+        conditions = [
+            meta_document_orm.document_id.in_(condition.convert_to_sql(meta_document_orm))
+            for condition in self.conditions
+        ]
+        return select(meta_document_orm.document_id).filter(~or_(*conditions))
+
+    def convert_to_weaviate(self) -> Dict[str, Union[str, int, float, bool, List[Dict]]]:
+        conditions = [condition.invert().convert_to_weaviate() for condition in self.conditions]
+        if len(conditions) > 1:
+            # Conditions in self.conditions are by default combined with AND which becomes OR according to DeMorgan
+            return {"operator": "Or", "operands": conditions}
+        else:
+            return conditions[0]
+
+    def invert(self) -> Union[LogicalFilterClause, ComparisonOperation]:
+        # This method is called when a "$not" operation is embedded in another "$not" operation. Therefore, we don't
+        # invert the operations here, as two "$not" operation annihilate each other.
+        # (If we have more than one condition, we return an AndOperation, the default logical operation for combining
+        # multiple conditions.)
+        if len(self.conditions) > 1:
+            return AndOperation(self.conditions)
+        else:
+            return self.conditions[0]
+
+
+class AndOperation(LogicalFilterClause):
+    """
+    Handles conversion of logical 'AND' operations.
+    """
+
+    def evaluate(self, fields) -> bool:
+        return all(condition.evaluate(fields) for condition in self.conditions)
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict]:
+        conditions = [condition.convert_to_elasticsearch() for condition in self.conditions]
+        conditions = self._merge_es_range_queries(conditions)
+        return {"bool": {"must": conditions}}
+
+    def convert_to_sql(self, meta_document_orm):
+        conditions = [
+            meta_document_orm.document_id.in_(condition.convert_to_sql(meta_document_orm))
+            for condition in self.conditions
+        ]
+        return select(meta_document_orm.document_id).filter(and_(*conditions))
+
+    def convert_to_weaviate(self) -> Dict[str, Union[str, List[Dict]]]:
+        conditions = [condition.convert_to_weaviate() for condition in self.conditions]
+        return {"operator": "And", "operands": conditions}
+
+    def invert(self) -> "OrOperation":
+        return OrOperation([condition.invert() for condition in self.conditions])
+
+
+class OrOperation(LogicalFilterClause):
+    """
+    Handles conversion of logical 'OR' operations.
+    """
+
+    def evaluate(self, fields) -> bool:
+        return any(condition.evaluate(fields) for condition in self.conditions)
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict]:
+        conditions = [condition.convert_to_elasticsearch() for condition in self.conditions]
+        conditions = self._merge_es_range_queries(conditions)
+        return {"bool": {"should": conditions}}
+
+    def convert_to_sql(self, meta_document_orm):
+        conditions = [
+            meta_document_orm.document_id.in_(condition.convert_to_sql(meta_document_orm))
+            for condition in self.conditions
+        ]
+        return select(meta_document_orm.document_id).filter(or_(*conditions))
+
+    def convert_to_weaviate(self) -> Dict[str, Union[str, List[Dict]]]:
+        conditions = [condition.convert_to_weaviate() for condition in self.conditions]
+        return {"operator": "Or", "operands": conditions}
+
+    def invert(self) -> AndOperation:
+        return AndOperation([condition.invert() for condition in self.conditions])
+
+
+class EqOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$eq' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] == self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Union[str, int, float, bool]]]:
+        assert not isinstance(self.comparison_value, list), "Use '$in' operation for lists as comparison values."
+        return {"term": {self.field_name: self.comparison_value}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value == self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, int, float, bool]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        return {"path": [self.field_name], "operator": "Equal", comp_value_type: comp_value}
+
+    def invert(self) -> "NeOperation":
+        return NeOperation(self.field_name, self.comparison_value)
+
+
+class InOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$in' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] in self.comparison_value  # type: ignore
+        # is only initialized with lists, but changing the type annotation would mean duplicating __init__
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, List]]:
+        assert isinstance(self.comparison_value, list), "'$in' operation requires comparison value to be a list."
+        return {"terms": {self.field_name: self.comparison_value}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value.in_(self.comparison_value)
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[str, List[Dict]]]:
+        filter_dict: Dict[str, Union[str, List[Dict]]] = {"operator": "Or", "operands": []}
+        assert isinstance(self.comparison_value, list), "'$in' operation requires comparison value to be a list."
+        for value in self.comparison_value:
+            comp_value_type, comp_value = self._get_weaviate_datatype(value)
+            assert isinstance(filter_dict["operands"], list)  # Necessary for mypy
+            filter_dict["operands"].append(
+                {"path": [self.field_name], "operator": "Equal", comp_value_type: comp_value}
+            )
+
+        return filter_dict
+
+    def invert(self) -> "NinOperation":
+        return NinOperation(self.field_name, self.comparison_value)
+
+
+class NeOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$ne' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] != self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Dict[str, Union[str, int, float, bool]]]]]:
+        assert not isinstance(self.comparison_value, list), "Use '$nin' operation for lists as comparison values."
+        return {"bool": {"must_not": {"term": {self.field_name: self.comparison_value}}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value != self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, int, float, bool]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        return {"path": [self.field_name], "operator": "NotEqual", comp_value_type: comp_value}
+
+    def invert(self) -> "EqOperation":
+        return EqOperation(self.field_name, self.comparison_value)
+
+
+class NinOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$nin' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] not in self.comparison_value  # type: ignore
+        # is only initialized with lists, but changing the type annotation would mean duplicating __init__
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Dict[str, List]]]]:
+        assert isinstance(self.comparison_value, list), "'$nin' operation requires comparison value to be a list."
+        return {"bool": {"must_not": {"terms": {self.field_name: self.comparison_value}}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value.notin_(self.comparison_value)
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[str, List[Dict]]]:
+        filter_dict: Dict[str, Union[str, List[Dict]]] = {"operator": "And", "operands": []}
+        assert isinstance(self.comparison_value, list), "'$nin' operation requires comparison value to be a list."
+        for value in self.comparison_value:
+            comp_value_type, comp_value = self._get_weaviate_datatype(value)
+            assert isinstance(filter_dict["operands"], list)  # Necessary for mypy
+            filter_dict["operands"].append(
+                {"path": [self.field_name], "operator": "NotEqual", comp_value_type: comp_value}
+            )
+
+        return filter_dict
+
+    def invert(self) -> "InOperation":
+        return InOperation(self.field_name, self.comparison_value)
+
+
+class GtOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$gt' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] > self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Union[str, float, int]]]]:
+        assert not isinstance(self.comparison_value, list), "Comparison value for '$gt' operation must not be a list."
+        return {"range": {self.field_name: {"gt": self.comparison_value}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value > self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, float, int]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        assert not isinstance(comp_value, list), "Comparison value for '$gt' operation must not be a list."
+        return {"path": [self.field_name], "operator": "GreaterThan", comp_value_type: comp_value}
+
+    def invert(self) -> "LteOperation":
+        return LteOperation(self.field_name, self.comparison_value)
+
+
+class GteOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$gte' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] >= self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Union[str, float, int]]]]:
+        assert not isinstance(self.comparison_value, list), "Comparison value for '$gte' operation must not be a list."
+        return {"range": {self.field_name: {"gte": self.comparison_value}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value >= self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, float, int]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        assert not isinstance(comp_value, list), "Comparison value for '$gte' operation must not be a list."
+        return {"path": [self.field_name], "operator": "GreaterThanEqual", comp_value_type: comp_value}
+
+    def invert(self) -> "LtOperation":
+        return LtOperation(self.field_name, self.comparison_value)
+
+
+class LtOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$lt' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] < self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Union[str, float, int]]]]:
+        assert not isinstance(self.comparison_value, list), "Comparison value for '$lt' operation must not be a list."
+        return {"range": {self.field_name: {"lt": self.comparison_value}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value < self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, float, int]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        assert not isinstance(comp_value, list), "Comparison value for '$lt' operation must not be a list."
+        return {"path": [self.field_name], "operator": "LessThan", comp_value_type: comp_value}
+
+    def invert(self) -> "GteOperation":
+        return GteOperation(self.field_name, self.comparison_value)
+
+
+class LteOperation(ComparisonOperation):
+    """
+    Handles conversion of the '$lte' comparison operation.
+    """
+
+    def evaluate(self, fields) -> bool:
+        if self.field_name not in fields:
+            return False
+        return fields[self.field_name] <= self.comparison_value
+
+    def convert_to_elasticsearch(self) -> Dict[str, Dict[str, Dict[str, Union[str, float, int]]]]:
+        assert not isinstance(self.comparison_value, list), "Comparison value for '$lte' operation must not be a list."
+        return {"range": {self.field_name: {"lte": self.comparison_value}}}
+
+    def convert_to_sql(self, meta_document_orm):
+        return select([meta_document_orm.document_id]).where(
+            meta_document_orm.name == self.field_name, meta_document_orm.value <= self.comparison_value
+        )
+
+    def convert_to_weaviate(self) -> Dict[str, Union[List[str], str, float, int]]:
+        comp_value_type, comp_value = self._get_weaviate_datatype()
+        assert not isinstance(comp_value, list), "Comparison value for '$lte' operation must not be a list."
+        return {"path": [self.field_name], "operator": "LessThanEqual", comp_value_type: comp_value}
+
+    def invert(self) -> "GtOperation":
+        return GtOperation(self.field_name, self.comparison_value)
diff --git a/pipelines/pipelines/document_stores/milvus2.py b/pipelines/pipelines/document_stores/milvus2.py
new file mode 100644
index 0000000000000000000000000000000000000000..73a9c29c502f1d74c9846c31655807e4d1b1ae3c
--- /dev/null
+++ b/pipelines/pipelines/document_stores/milvus2.py
@@ -0,0 +1,687 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING, Any, Dict, Generator, List, Optional, Union
+
+import logging
+import warnings
+import numpy as np
+
+from tqdm import tqdm
+
+try:
+    from pymilvus import FieldSchema, CollectionSchema, Collection, connections, utility
+    from pymilvus.client.abstract import QueryResult
+    from pymilvus.client.types import DataType
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "milvus2", ie)
+
+from pipelines.schema import Document
+from pipelines.document_stores.sql import SQLDocumentStore
+from pipelines.document_stores.base import get_batches_from_generator
+
+if TYPE_CHECKING:
+    from pipelines.nodes.retriever.base import BaseRetriever
+
+logger = logging.getLogger(__name__)
+
+
+class Milvus2DocumentStore(SQLDocumentStore):
+    """
+    you can now run a query using vector similarity and filter for some meta data at the same time!
+    (See https://milvus.io/docs/v2.0.x/comparison.md for more details)
+
+    Usage:
+    1. Start a Milvus service via docker (see https://milvus.io/docs/v2.0.x/install_standalone-docker.md)
+    2. Run pip install Paddle-Pipelines
+    3. Init a MilvusDocumentStore() in Pipelines
+
+    Overview:
+    Milvus (https://milvus.io/) is a highly reliable, scalable Document Store specialized on storing and processing vectors.
+    Therefore, it is particularly suited for Pipelines users that work with dense retrieval methods (like DPR).
+
+    In contrast to FAISS, Milvus ...
+     - runs as a separate service (e.g. a Docker container) and can scale easily in a distributed environment
+     - allows dynamic data management (i.e. you can insert/delete vectors without recreating the whole index)
+     - encapsulates multiple ANN libraries (FAISS, ANNOY ...)
+
+    This class uses Milvus for all vector related storage, processing and querying.
+    The meta-data (e.g. for filtering) and the document text are however stored in a separate SQL Database as Milvus
+    does not allow these data types (yet).
+    """
+
+    def __init__(
+        self,
+        sql_url: str = "sqlite:///milvus_document_store.db",
+        host: str = "localhost",
+        port: str = "19530",
+        connection_pool: str = "SingletonThread",
+        index: str = "document",
+        vector_dim: int = None,
+        embedding_dim: int = 768,
+        index_file_size: int = 1024,
+        similarity: str = "dot_product",
+        index_type: str = "IVF_FLAT",
+        index_param: Optional[Dict[str, Any]] = None,
+        search_param: Optional[Dict[str, Any]] = None,
+        return_embedding: bool = False,
+        embedding_field: str = "embedding",
+        id_field: str = "id",
+        custom_fields: Optional[List[Any]] = None,
+        progress_bar: bool = True,
+        duplicate_documents: str = "overwrite",
+        isolation_level: str = None,
+        consistency_level: int = 0,
+        recreate_index: bool = False,
+    ):
+        """
+        :param sql_url: SQL connection URL for storing document texts and metadata. It defaults to a local, file based SQLite DB. For large scale
+                        deployment, Postgres is recommended. If using MySQL then same server can also be used for
+                        Milvus metadata. For more details see https://milvus.io/docs/v1.1.0/data_manage.md.
+        :param milvus_url: Milvus server connection URL for storing and processing vectors.
+                           Protocol, host and port will automatically be inferred from the URL.
+                           See https://milvus.io/docs/v2.0.x/install_standalone-docker.md for instructions to start a Milvus instance.
+        :param connection_pool: Connection pool type to connect with Milvus server. Default: "SingletonThread".
+        :param index: Index name for text, embedding and metadata (in Milvus terms, this is the "collection name").
+        :param vector_dim: Deprecated. Use embedding_dim instead.
+        :param embedding_dim: The embedding vector size. Default: 768.
+        :param index_file_size: Specifies the size of each segment file that is stored by Milvus and its default value is 1024 MB.
+         When the size of newly inserted vectors reaches the specified volume, Milvus packs these vectors into a new segment.
+         Milvus creates one index file for each segment. When conducting a vector search, Milvus searches all index files one by one.
+         As a rule of thumb, we would see a 30% ~ 50% increase in the search performance after changing the value of index_file_size from 1024 to 2048.
+         Note that an overly large index_file_size value may cause failure to load a segment into the memory or graphics memory.
+         (From https://milvus.io/docs/v2.0.x/performance_faq.md)
+        :param similarity: The similarity function used to compare document vectors. 'dot_product' is the default and recommended for DPR embeddings.
+                           'cosine' is recommended for Sentence Transformers, but is not directly supported by Milvus.
+                           However, you can normalize your embeddings and use `dot_product` to get the same results.
+                           See https://milvus.io/docs/v2.0.x/metric.md.
+        :param index_type: Type of approximate nearest neighbour (ANN) index used. The choice here determines your tradeoff between speed and accuracy.
+                           Some popular options:
+                           - FLAT (default): Exact method, slow
+                           - IVF_FLAT, inverted file based heuristic, fast
+                           - HSNW: Graph based, fast
+                           - ANNOY: Tree based, fast
+                           See: https://milvus.io/docs/v2.0.x/index.md
+        :param index_param: Configuration parameters for the chose index_type needed at indexing time.
+                            For example: {"nlist": 16384} as the number of cluster units to create for index_type IVF_FLAT.
+                            See https://milvus.io/docs/v2.0.x/index.md
+        :param search_param: Configuration parameters for the chose index_type needed at query time
+                             For example: {"nprobe": 10} as the number of cluster units to query for index_type IVF_FLAT.
+                             See https://milvus.io/docs/v2.0.x/index.md
+        :param return_embedding: To return document embedding.
+        :param embedding_field: Name of field containing an embedding vector.
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param isolation_level: see SQLAlchemy's `isolation_level` parameter for `create_engine()` (https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine.params.isolation_level)
+        :param recreate_index: If set to True, an existing Milvus index will be deleted and a new one will be
+            created using the config you are using for initialization. Be aware that all data in the old index will be
+            lost if you choose to recreate the index. Be aware that both the document_index and the label_index will
+            be recreated.
+        """
+
+        super().__init__(
+            url=sql_url, index=index, duplicate_documents=duplicate_documents, isolation_level=isolation_level
+        )
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            sql_url=sql_url,
+            host=host,
+            port=port,
+            index=index,
+            embedding_dim=embedding_dim,
+            vector_dim=vector_dim,
+            index_file_size=1024,
+            similarity=similarity,
+            index_type=index_type,
+        )
+
+        connections.add_connection(default={"host": host, "port": port})
+        connections.connect()
+
+        if vector_dim is not None:
+            warnings.warn(
+                message="The 'vector_dim' parameter is deprecated, use 'embedding_dim' instead.",
+                category=DeprecationWarning,
+                stacklevel=2,
+            )
+            self.embedding_dim = vector_dim
+        else:
+            self.embedding_dim = embedding_dim
+
+        self.index_file_size = index_file_size
+        self.similarity = similarity
+        self.cosine = False
+
+        if similarity == "dot_product":
+            self.metric_type = "IP"
+        elif similarity == "l2":
+            self.metric_type = "L2"
+        elif similarity == "cosine":
+            self.metric_type = "IP"
+            self.cosine = True
+        else:
+            raise ValueError(
+                "The Milvus document store can currently only support dot_product, cosine and L2 similarity. "
+                'Please set similarity="dot_product" or "cosine" or "l2"'
+            )
+
+        self.index_type = index_type
+        self.index_param = index_param or {"nlist": 16384}
+        self.search_param = search_param or {"nprobe": 10}
+        self.index = index
+        self.embedding_field = embedding_field
+        self.id_field = id_field
+        self.custom_fields = custom_fields
+
+        self.collection = self._create_collection_and_index(
+            self.index, consistency_level, recreate_index=recreate_index
+        )
+
+        self.return_embedding = return_embedding
+        self.progress_bar = progress_bar
+
+    def _create_collection_and_index(
+        self,
+        index: Optional[str] = None,
+        consistency_level: int = 0,
+        index_param: Optional[Dict[str, Any]] = None,
+        recreate_index: bool = False,
+    ):
+        index = index or self.index
+        index_param = index_param or self.index_param
+        custom_fields = self.custom_fields or []
+
+        if recreate_index:
+            self._delete_index(index)
+            super().delete_labels()
+
+        has_collection = utility.has_collection(collection_name=index)
+        if not has_collection:
+            fields = [
+                FieldSchema(
+                    name=self.id_field, dtype=DataType.INT64, is_primary=True, auto_id=True, description="primary id"
+                ),
+                FieldSchema(
+                    name=self.embedding_field,
+                    dtype=DataType.FLOAT_VECTOR,
+                    dim=self.embedding_dim,
+                    description="vector",
+                ),
+            ]
+
+            for field in custom_fields:
+                if field.name == self.id_field or field.name == self.embedding_field:
+                    logger.warning(f"Skipping `{field.name}` as it is similar to `id_field` or `embedding_field`")
+                else:
+                    fields.append(field)
+
+            collection_schema = CollectionSchema(fields=fields)
+        else:
+            collection_schema = None
+
+        collection = Collection(name=index, schema=collection_schema, consistency_level=consistency_level)
+
+        has_index = collection.has_index()
+        if not has_index:
+            collection.create_index(
+                field_name=self.embedding_field,
+                index_params={"index_type": self.index_type, "metric_type": self.metric_type, "params": index_param},
+            )
+
+        collection.load()
+
+        return collection
+
+    def _create_document_field_map(self) -> Dict:
+        return {self.index: self.embedding_field}
+
+    def write_documents(
+        self,
+        documents: Union[List[dict], List[Document]],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        index_param: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        Add new documents to the DocumentStore.
+
+        :param documents: List of `Dicts` or List of `Documents`. If they already contain the embeddings, we'll index
+                                  them right away in Milvus. If not, you can later call `update_embeddings()` to create & index them.
+        :param index: (SQL) index name for storing the docs and metadata
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :raises DuplicateDocumentError: Exception trigger on duplicate document
+        :return:
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        index = index or self.index
+        index_param = index_param or self.index_param
+        duplicate_documents = duplicate_documents or self.duplicate_documents
+        assert (
+            duplicate_documents in self.duplicate_documents_options
+        ), f"duplicate_documents parameter must be {', '.join(self.duplicate_documents_options)}"
+        field_map = self._create_document_field_map()
+
+        if len(documents) == 0:
+            logger.warning("Calling DocumentStore.write_documents() with empty list")
+            return
+
+        document_objects = [
+            Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents
+        ]
+        document_objects = self._handle_duplicate_documents(document_objects, duplicate_documents)
+        add_vectors = False if document_objects[0].embedding is None else True
+
+        batched_documents = get_batches_from_generator(document_objects, batch_size)
+        with tqdm(total=len(document_objects), disable=not self.progress_bar) as progress_bar:
+            mutation_result: Any = None
+
+            for document_batch in batched_documents:
+                if add_vectors:
+                    doc_ids = []
+                    embeddings = []
+                    for doc in document_batch:
+                        doc_ids.append(doc.id)
+                        if isinstance(doc.embedding, np.ndarray):
+                            if self.cosine:
+                                embedding = doc.embedding / np.linalg.norm(doc.embedding)
+                                embeddings.append(embedding.tolist())
+                            else:
+                                embeddings.append(doc.embedding.tolist())
+                        elif isinstance(doc.embedding, list):
+                            if self.cosine:
+                                embedding = np.array(doc.embedding)
+                                embedding /= np.linalg.norm(embedding)
+                                embeddings.append(embedding.tolist())
+                            else:
+                                embeddings.append(doc.embedding)
+                        else:
+                            raise AttributeError(
+                                f"Format of supplied document embedding {type(doc.embedding)} is not "
+                                f"supported. Please use list or numpy.ndarray"
+                            )
+                    if duplicate_documents == "overwrite":
+                        existing_docs = super().get_documents_by_id(ids=doc_ids, index=index)
+                        self._delete_vector_ids_from_milvus(documents=existing_docs, index=index)
+
+                    mutation_result = self.collection.insert([embeddings])
+
+                docs_to_write_in_sql = []
+
+                for idx, doc in enumerate(document_batch):
+                    meta = doc.meta
+                    if add_vectors and mutation_result is not None:
+                        meta["vector_id"] = str(mutation_result.primary_keys[idx])
+                    docs_to_write_in_sql.append(doc)
+
+                super().write_documents(docs_to_write_in_sql, index=index, duplicate_documents=duplicate_documents)
+                progress_bar.update(batch_size)
+        progress_bar.close()
+
+    def update_embeddings(
+        self,
+        retriever: "BaseRetriever",
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        update_existing_embeddings: bool = True,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in Milvus2DocStore
+    ):
+        """
+        Updates the embeddings in the document store using the encoding model specified in the retriever.
+        This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
+
+        :param retriever: Retriever to use to get embeddings for text
+        :param index: (SQL) index name for storing the docs and metadata
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param update_existing_embeddings: Whether to update existing embeddings of the documents. If set to False,
+                                           only documents without embeddings are processed. This mode can be used for
+                                           incremental updating of embeddings, wherein, only newly indexed documents
+                                           get processed.
+        :param filters: Optional filters to narrow down the documents for which embeddings are to be updated.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :return: None
+        """
+        index = index or self.index
+
+        document_count = self.get_document_count(index=index)
+        if document_count == 0:
+            logger.warning("Calling DocumentStore.update_embeddings() on an empty index")
+            return
+
+        logger.info(f"Updating embeddings for {document_count} docs...")
+
+        result = self._query(
+            index=index,
+            vector_ids=None,
+            batch_size=batch_size,
+            filters=filters,
+            only_documents_without_embedding=not update_existing_embeddings,
+        )
+        batched_documents = get_batches_from_generator(result, batch_size)
+        with tqdm(
+            total=document_count, disable=not self.progress_bar, position=0, unit=" docs", desc="Updating Embedding"
+        ) as progress_bar:
+            for document_batch in batched_documents:
+                self._delete_vector_ids_from_milvus(documents=document_batch, index=index)
+
+                embeddings = retriever.embed_documents(document_batch)  # type: ignore
+                if self.cosine:
+                    embeddings = [embedding / np.linalg.norm(embedding) for embedding in embeddings]
+                embeddings_list = [embedding.tolist() for embedding in embeddings]
+                assert len(document_batch) == len(embeddings_list)
+
+                mutation_result = self.collection.insert([embeddings_list])
+
+                vector_id_map = {}
+                for vector_id, doc in zip(mutation_result.primary_keys, document_batch):
+                    vector_id_map[doc.id] = str(vector_id)
+
+                self.update_vector_ids(vector_id_map, index=index)
+                progress_bar.set_description_str("Documents Processed")
+                progress_bar.update(batch_size)
+
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in Milvus2DocStore
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+        scale_score: bool = True,
+    ) -> List[Document]:
+        """
+        Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
+
+        :param query_emb: Embedding of the query (e.g. gathered from DPR)
+        :param filters: Optional filters to narrow down the search space.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param top_k: How many documents to return
+        :param index: (SQL) index name for storing the docs and metadata
+        :param return_embedding: To return document embedding
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+                            If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
+                            Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
+        :return:
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        index = index or self.index
+        has_collection = utility.has_collection(collection_name=index)
+        if not has_collection:
+            raise Exception("No index exists. Use 'update_embeddings()` to create an index.")
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        query_emb = query_emb.reshape(-1).astype(np.float32)
+        if self.cosine:
+            query_emb = query_emb / np.linalg.norm(query_emb)
+
+        search_result: QueryResult = self.collection.search(
+            data=[query_emb.tolist()],
+            anns_field=self.embedding_field,
+            param={"metric_type": self.metric_type, **self.search_param},
+            limit=top_k,
+        )
+
+        vector_ids_for_query = []
+        scores_for_vector_ids: Dict[str, float] = {}
+        for vector_id, distance in zip(search_result[0].ids, search_result[0].distances):
+            vector_ids_for_query.append(str(vector_id))
+            scores_for_vector_ids[str(vector_id)] = distance
+
+        documents = self.get_documents_by_vector_ids(vector_ids_for_query, index=index)
+
+        if return_embedding:
+            self._populate_embeddings_to_docs(index=index, docs=documents)
+
+        for doc in documents:
+            score = scores_for_vector_ids[doc.meta["vector_id"]]
+            if scale_score:
+                score = self.scale_to_unit_interval(score, self.similarity)
+            doc.score = score
+
+        return documents
+
+    def delete_documents(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in Milvus2DocStore
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: int = 10_000,
+    ):
+        """
+        Delete all documents (from SQL AND Milvus).
+        :param index: (SQL) index name for storing the docs and metadata
+        :param filters: Optional filters to narrow down the search space.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        if ids:
+            self._delete_vector_ids_from_milvus(ids=ids, index=index)
+        elif filters:
+            batch = []
+            for existing_docs in super().get_all_documents_generator(
+                filters=filters, index=index, batch_size=batch_size
+            ):
+                batch.append(existing_docs)
+                if len(batch) == batch_size:
+                    self._delete_vector_ids_from_milvus(documents=batch, index=index)
+            if len(batch) != 0:
+                self._delete_vector_ids_from_milvus(documents=batch, index=index)
+        else:
+            self.collection = self._create_collection_and_index(self.index, recreate_index=True)
+
+        index = index or self.index
+        super().delete_documents(index=index, filters=filters, ids=ids)
+
+    def delete_index(self, index: str):
+        """
+        Delete an existing index. The index including all data will be removed.
+
+        :param index: The name of the index to delete.
+        :return: None
+        """
+        if index == self.index:
+            logger.warning(
+                f"Deletion of default index '{index}' detected. "
+                f"If you plan to use this index again, please reinstantiate '{self.__class__.__name__}' in order to avoid side-effects."
+            )
+        self._delete_index(index)
+
+    def _delete_index(self, index: str):
+        if utility.has_collection(collection_name=index):
+            utility.drop_collection(collection_name=index)
+            logger.info(f"Index '{index}' deleted.")
+        super().delete_labels(index)
+
+    def get_all_documents_generator(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in Milvus2DocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[Document, None, None]:
+        """
+        Get all documents from the document store. Under-the-hood, documents are fetched in batches from the
+        document store and yielded as individual documents. This method can be used to iteratively process
+        a large number of documents without having to load all documents in memory.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        index = index or self.index
+        documents = super().get_all_documents_generator(index=index, filters=filters, batch_size=batch_size)
+        if return_embedding is None:
+            return_embedding = self.return_embedding
+
+        for doc in documents:
+            if return_embedding:
+                self._populate_embeddings_to_docs(index=index, docs=[doc])
+            yield doc
+
+    def get_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in Milvus2DocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Get documents from the document store (optionally using filter criteria).
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        index = index or self.index
+        result = self.get_all_documents_generator(
+            index=index, filters=filters, return_embedding=return_embedding, batch_size=batch_size
+        )
+        documents = list(result)
+        return documents
+
+    def get_document_by_id(
+        self, id: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None
+    ) -> Optional[Document]:
+        """
+        Fetch a document by specifying its text id string
+
+        :param id: ID of the document
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        documents = self.get_documents_by_id([id], index)
+        document = documents[0] if documents else None
+        return document
+
+    def get_documents_by_id(
+        self,
+        ids: List[str],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """
+        Fetch multiple documents by specifying their IDs (strings)
+
+        :param ids: List of IDs of the documents
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        if headers:
+            raise NotImplementedError("Milvus2DocumentStore does not support headers.")
+
+        index = index or self.index
+        documents = super().get_documents_by_id(ids=ids, index=index, batch_size=batch_size)
+        if self.return_embedding:
+            self._populate_embeddings_to_docs(index=index, docs=documents)
+
+        return documents
+
+    def _populate_embeddings_to_docs(self, docs: List[Document], index: Optional[str] = None):
+        index = index or self.index
+        docs_with_vector_ids = []
+        for doc in docs:
+            if doc.meta and doc.meta.get("vector_id") is not None:
+                docs_with_vector_ids.append(doc)
+
+        if len(docs_with_vector_ids) == 0:
+            return
+
+        ids = []
+        vector_id_map = {}
+
+        for doc in docs_with_vector_ids:
+            vector_id: str = doc.meta["vector_id"]  # type: ignore
+            # vector_id is always a string, but it isn't part of type hint
+            ids.append(str(vector_id))
+            vector_id_map[int(vector_id)] = doc
+
+        search_result: QueryResult = self.collection.query(
+            expr=f'{self.id_field} in [ {",".join(ids)} ]', output_fields=[self.embedding_field]
+        )
+
+        for result in search_result:
+            doc = vector_id_map[result["id"]]
+            doc.embedding = np.array(result["embedding"], "float32")
+
+    def _delete_vector_ids_from_milvus(
+        self, documents: Optional[List[Document]] = None, ids: Optional[List[str]] = None, index: Optional[str] = None
+    ):
+        index = index or self.index
+        if ids is None:
+            ids = []
+            if documents is None:
+                raise ValueError("You must either specify documents or ids to delete.")
+            for doc in documents:
+                if "vector_id" in doc.meta:
+                    ids.append(str(doc.meta["vector_id"]))
+        else:
+            docs = super().get_documents_by_id(ids=ids, index=index)
+            ids = [doc.meta["vector_id"] for doc in docs if "vector_id" in doc.meta]
+
+        expr = f"{self.id_field} in [{','.join(ids)}]"
+
+        self.collection.delete(expr)
+
+    def get_embedding_count(self, index: Optional[str] = None, filters: Optional[Dict[str, List[str]]] = None) -> int:
+        """
+        Return the count of embeddings in the document store.
+        """
+        if filters:
+            raise Exception("filters are not supported for get_embedding_count in MilvusDocumentStore.")
+        return len(self.get_all_documents(index=index))
diff --git a/pipelines/pipelines/document_stores/sql.py b/pipelines/pipelines/document_stores/sql.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fcafb72fb955423497306cb7c2360d81b3574a6
--- /dev/null
+++ b/pipelines/pipelines/document_stores/sql.py
@@ -0,0 +1,777 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import logging
+from typing import Any, Dict, Generator, List, Optional, Union
+from uuid import uuid4
+
+import numpy as np
+
+try:
+    from sqlalchemy import (
+        JSON,
+        Boolean,
+        Column,
+        DateTime,
+        ForeignKeyConstraint,
+        String,
+        Text,
+        and_,
+        create_engine,
+        func,
+        text,
+    )
+    from sqlalchemy.ext.declarative import declarative_base
+    from sqlalchemy.orm import relationship, sessionmaker
+    from sqlalchemy.sql import case, null
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "sql", ie)
+
+from pipelines.document_stores.base import BaseDocumentStore
+from pipelines.document_stores.filter_utils import LogicalFilterClause
+from pipelines.schema import Answer, Document, Label
+
+logger = logging.getLogger(__name__)
+Base = declarative_base()  # type: Any
+
+
+class ORMBase(Base):
+    __abstract__ = True
+
+    id = Column(String(100), default=lambda: str(uuid4()), primary_key=True)
+    created_at = Column(DateTime, server_default=func.now())
+    updated_at = Column(DateTime, server_default=func.now(), server_onupdate=func.now())
+
+
+class DocumentORM(ORMBase):
+    __tablename__ = "document"
+
+    content = Column(JSON, nullable=False)
+    content_type = Column(Text, nullable=True)
+    # primary key in combination with id to allow the same doc in different indices
+    index = Column(String(100), nullable=False, primary_key=True)
+    vector_id = Column(String(100), unique=True, nullable=True)
+    # speeds up queries for get_documents_by_vector_ids() by having a single query that returns joined metadata
+    meta = relationship("MetaDocumentORM", back_populates="documents", lazy="joined")
+
+
+class MetaDocumentORM(ORMBase):
+    __tablename__ = "meta_document"
+
+    name = Column(String(100), index=True)
+    value = Column(String(1000), index=True)
+    documents = relationship("DocumentORM", back_populates="meta")
+
+    document_id = Column(String(100), nullable=False, index=True)
+    document_index = Column(String(100), nullable=False, index=True)
+    __table_args__ = (
+        ForeignKeyConstraint(
+            [document_id, document_index], [DocumentORM.id, DocumentORM.index], ondelete="CASCADE", onupdate="CASCADE"
+        ),
+        {},
+    )  # type: ignore
+
+
+class LabelORM(ORMBase):
+    __tablename__ = "label"
+
+    index = Column(String(100), nullable=False, primary_key=True)
+    query = Column(Text, nullable=False)
+    answer = Column(JSON, nullable=True)
+    document = Column(JSON, nullable=False)
+    no_answer = Column(Boolean, nullable=False)
+    origin = Column(String(100), nullable=False)
+    is_correct_answer = Column(Boolean, nullable=False)
+    is_correct_document = Column(Boolean, nullable=False)
+    pipeline_id = Column(String(500), nullable=True)
+
+    meta = relationship("MetaLabelORM", back_populates="labels", lazy="joined")
+
+
+class MetaLabelORM(ORMBase):
+    __tablename__ = "meta_label"
+
+    name = Column(String(100), index=True)
+    value = Column(String(1000), index=True)
+    labels = relationship("LabelORM", back_populates="meta")
+
+    label_id = Column(String(100), nullable=False, index=True)
+    label_index = Column(String(100), nullable=False, index=True)
+    __table_args__ = (
+        ForeignKeyConstraint(
+            [label_id, label_index], [LabelORM.id, LabelORM.index], ondelete="CASCADE", onupdate="CASCADE"
+        ),
+        {},
+    )  # type: ignore
+
+
+class SQLDocumentStore(BaseDocumentStore):
+    def __init__(
+        self,
+        url: str = "sqlite://",
+        index: str = "document",
+        label_index: str = "label",
+        duplicate_documents: str = "overwrite",
+        check_same_thread: bool = False,
+        isolation_level: str = None,
+    ):
+        """
+        An SQL backed DocumentStore. Currently supports SQLite, PostgreSQL and MySQL backends.
+
+        :param url: URL for SQL database as expected by SQLAlchemy. More info here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls
+        :param index: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents.
+                      This parameter sets the default value for document index.
+        :param label_index: The default value of index attribute for the labels.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+        :param check_same_thread: Set to False to mitigate multithreading issues in older SQLite versions (see https://docs.sqlalchemy.org/en/14/dialects/sqlite.html?highlight=check_same_thread#threading-pooling-behavior)
+        :param isolation_level: see SQLAlchemy's `isolation_level` parameter for `create_engine()` (https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine.params.isolation_level)
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            url=url,
+            index=index,
+            label_index=label_index,
+            duplicate_documents=duplicate_documents,
+            check_same_thread=check_same_thread,
+        )
+        create_engine_params = {}
+        if isolation_level:
+            create_engine_params["isolation_level"] = isolation_level
+        if "sqlite" in url:
+            engine = create_engine(url, connect_args={"check_same_thread": check_same_thread}, **create_engine_params)
+        else:
+            engine = create_engine(url, **create_engine_params)
+        Base.metadata.create_all(engine)
+        Session = sessionmaker(bind=engine)
+        self.session = Session()
+        self.index: str = index
+        self.label_index = label_index
+        self.duplicate_documents = duplicate_documents
+        if getattr(self, "similarity", None) is None:
+            self.similarity = None
+        self.use_windowed_query = True
+        if "sqlite" in url:
+            import sqlite3
+
+            if sqlite3.sqlite_version < "3.25":
+                self.use_windowed_query = False
+
+    def get_document_by_id(
+        self, id: str, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None
+    ) -> Optional[Document]:
+        """Fetch a document by specifying its text id string"""
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        documents = self.get_documents_by_id([id], index)
+        document = documents[0] if documents else None
+        return document
+
+    def get_documents_by_id(
+        self,
+        ids: List[str],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        """Fetch documents by specifying a list of text id strings"""
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.index
+
+        documents = []
+        for i in range(0, len(ids), batch_size):
+            query = self.session.query(DocumentORM).filter(
+                DocumentORM.id.in_(ids[i : i + batch_size]), DocumentORM.index == index
+            )
+            for row in query.all():
+                documents.append(self._convert_sql_row_to_document(row))
+
+        return documents
+
+    def get_documents_by_vector_ids(
+        self, vector_ids: List[str], index: Optional[str] = None, batch_size: int = 10_000
+    ):
+        """Fetch documents by specifying a list of text vector id strings"""
+        index = index or self.index
+
+        documents = []
+        for i in range(0, len(vector_ids), batch_size):
+            query = self.session.query(DocumentORM).filter(
+                DocumentORM.vector_id.in_(vector_ids[i : i + batch_size]), DocumentORM.index == index
+            )
+            for row in query.all():
+                documents.append(self._convert_sql_row_to_document(row))
+
+        sorted_documents = sorted(documents, key=lambda doc: vector_ids.index(doc.meta["vector_id"]))
+        return sorted_documents
+
+    def get_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        documents = list(
+            self.get_all_documents_generator(
+                index=index, filters=filters, return_embedding=return_embedding, batch_size=batch_size
+            )
+        )
+        return documents
+
+    def get_all_documents_generator(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        return_embedding: Optional[bool] = None,
+        batch_size: int = 10_000,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Generator[Document, None, None]:
+        """
+        Get documents from the document store. Under-the-hood, documents are fetched in batches from the
+        document store and yielded as individual documents. This method can be used to iteratively process
+        a large number of documents without having to load all documents in memory.
+
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param return_embedding: Whether to return the document embeddings.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        if return_embedding is True:
+            raise Exception("return_embeddings is not supported by SQLDocumentStore.")
+        result = self._query(
+            index=index,
+            filters=filters,
+            batch_size=batch_size,
+        )
+        yield from result
+
+    def _create_document_field_map(self) -> Dict:
+        """
+        There is no field mapping required
+        """
+        return {}
+
+    def _query(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        vector_ids: Optional[List[str]] = None,
+        only_documents_without_embedding: bool = False,
+        batch_size: int = 10_000,
+    ):
+        """
+        :param index: Name of the index to get the documents from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param filters: Optional filters to narrow down the documents to return.
+                        Example: {"name": ["some", "more"], "category": ["only_one"]}
+        :param vector_ids: List of vector_id strings to filter the documents by.
+        :param only_documents_without_embedding: return only documents without an embedding.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        index = index or self.index
+        # Generally ORM objects kept in memory cause performance issue
+        # Hence using directly column name improve memory and performance.
+        # Refer https://stackoverflow.com/questions/23185319/why-is-loading-sqlalchemy-objects-via-the-orm-5-8x-slower-than-rows-via-a-raw-my
+        documents_query = self.session.query(
+            DocumentORM.id, DocumentORM.content, DocumentORM.content_type, DocumentORM.vector_id
+        ).filter_by(index=index)
+
+        if filters:
+            parsed_filter = LogicalFilterClause.parse(filters)
+            select_ids = parsed_filter.convert_to_sql(MetaDocumentORM)
+            documents_query = documents_query.filter(DocumentORM.id.in_(select_ids))
+
+        if only_documents_without_embedding:
+            documents_query = documents_query.filter(DocumentORM.vector_id.is_(None))
+        if vector_ids:
+            documents_query = documents_query.filter(DocumentORM.vector_id.in_(vector_ids))
+
+        documents_map = {}
+
+        if self.use_windowed_query:
+            documents_query = self._windowed_query(documents_query, DocumentORM.id, batch_size)
+
+        for i, row in enumerate(documents_query, start=1):
+            documents_map[row.id] = Document.from_dict(
+                {
+                    "id": row.id,
+                    "content": row.content,
+                    "content_type": row.content_type,
+                    "meta": {} if row.vector_id is None else {"vector_id": row.vector_id},
+                }
+            )
+            if i % batch_size == 0:
+                documents_map = self._get_documents_meta(documents_map)
+                yield from documents_map.values()
+                documents_map = {}
+        if documents_map:
+            documents_map = self._get_documents_meta(documents_map)
+            yield from documents_map.values()
+
+    def _get_documents_meta(self, documents_map):
+        doc_ids = documents_map.keys()
+        meta_query = self.session.query(
+            MetaDocumentORM.document_id, MetaDocumentORM.name, MetaDocumentORM.value
+        ).filter(MetaDocumentORM.document_id.in_(doc_ids))
+
+        for row in meta_query.all():
+            documents_map[row.document_id].meta[row.name] = row.value
+        return documents_map
+
+    def get_all_labels(self, index=None, filters: Optional[dict] = None, headers: Optional[Dict[str, str]] = None):
+        """
+        Return all labels in the document store
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.label_index
+        # TODO: Use batch_size
+        label_rows = self.session.query(LabelORM).filter_by(index=index).all()
+        labels = [self._convert_sql_row_to_label(row) for row in label_rows]
+
+        return labels
+
+    def write_documents(
+        self,
+        documents: Union[List[dict], List[Document]],
+        index: Optional[str] = None,
+        batch_size: int = 10_000,
+        duplicate_documents: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> None:
+        """
+        Indexes documents for later queries.
+
+        :param documents: a list of Python dictionaries or a list of pipelines Document objects.
+                          For documents as dictionaries, the format is {"text": "<the-actual-text>"}.
+                          Optionally: Include meta data via {"text": "<the-actual-text>",
+                          "meta":{"name": "<some-document-name>, "author": "somebody", ...}}
+                          It can be used for filtering and is accessible in the responses of the Finder.
+        :param index: add an optional index attribute to documents. It can be later used for filtering. For instance,
+                      documents for evaluation can be indexed in a separate index than the documents for search.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        :param duplicate_documents: Handle duplicates document based on parameter options.
+                                    Parameter options : ( 'skip','overwrite','fail')
+                                    skip: Ignore the duplicates documents
+                                    overwrite: Update any existing documents with the same ID when adding documents.
+                                    fail: an error is raised if the document ID of the document being added already
+                                    exists.
+
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.index
+        duplicate_documents = duplicate_documents or self.duplicate_documents
+        if len(documents) == 0:
+            return
+        # Make sure we comply to Document class format
+        if isinstance(documents[0], dict):
+            document_objects = [Document.from_dict(d) if isinstance(d, dict) else d for d in documents]
+        else:
+            document_objects = documents
+
+        document_objects = self._handle_duplicate_documents(
+            documents=document_objects, index=index, duplicate_documents=duplicate_documents
+        )
+        for i in range(0, len(document_objects), batch_size):
+            for doc in document_objects[i : i + batch_size]:
+                meta_fields = doc.meta or {}
+                vector_id = meta_fields.pop("vector_id", None)
+                # Support storing list type data by adding value semicolon
+                meta_orms = [
+                    MetaDocumentORM(name=key, value=";".join(value) if type(value) == list else value)
+                    for key, value in meta_fields.items()
+                ]
+                doc_orm = DocumentORM(
+                    id=doc.id,
+                    content=doc.to_dict()["content"],
+                    content_type=doc.content_type,
+                    vector_id=vector_id,
+                    meta=meta_orms,
+                    index=index,
+                )
+                if duplicate_documents == "overwrite":
+                    # First old meta data cleaning is required
+                    self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
+                    self.session.merge(doc_orm)
+                else:
+                    self.session.add(doc_orm)
+            try:
+                self.session.commit()
+            except Exception as ex:
+                logger.error(f"Transaction rollback: {ex.__cause__}")
+                # Rollback is important here otherwise self.session will be in inconsistent state and next call will fail
+                self.session.rollback()
+                raise ex
+
+    def write_labels(self, labels, index=None, headers: Optional[Dict[str, str]] = None):
+        """Write annotation labels into document store."""
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        labels = [Label.from_dict(l) if isinstance(l, dict) else l for l in labels]
+        index = index or self.label_index
+
+        duplicate_ids: list = [label.id for label in self._get_duplicate_labels(labels, index=index)]
+        if len(duplicate_ids) > 0:
+            logger.warning(
+                f"Duplicate Label IDs: Inserting a Label whose id already exists in this document store."
+                f" This will overwrite the old Label. Please make sure Label.id is a unique identifier of"
+                f" the answer annotation and not the question."
+                f"   Problematic ids: {','.join(duplicate_ids)}"
+            )
+        # TODO: Use batch_size
+
+        for label in labels:
+            # TODO As of now, we write documents as part of the Label table as this is consistent with the other
+            #  document stores (e.g. elasticsearch) where "indices" are completely independent.
+            # We should eventually switch to an approach here that writes related documents to the document table if not already existing.
+            # See Issue XXX
+
+            # self.write_documents(documents=[label.document], index=index, duplicate_documents="skip")
+
+            # TODO: Handle label meta data
+            label_orm = LabelORM(
+                id=label.id,
+                no_answer=label.no_answer,
+                # document_id=label.document.id,
+                document=label.document.to_json(),
+                origin=label.origin,
+                query=label.query,
+                is_correct_answer=label.is_correct_answer,
+                is_correct_document=label.is_correct_document,
+                answer=label.answer.to_json(),
+                pipeline_id=label.pipeline_id,
+                index=index,
+            )
+            if label.id in duplicate_ids:
+                self.session.merge(label_orm)
+            else:
+                self.session.add(label_orm)
+
+            # TODO: investigate why test_multilabel() failed when not committing within the loop
+            # Seems that in some cases only the last label get than "committed"
+            self.session.commit()
+
+    def update_vector_ids(self, vector_id_map: Dict[str, str], index: Optional[str] = None, batch_size: int = 10_000):
+        """
+        Update vector_ids for given document_ids.
+
+        :param vector_id_map: dict containing mapping of document_id -> vector_id.
+        :param index: filter documents by the optional index attribute for documents in database.
+        :param batch_size: When working with large number of documents, batching can help reduce memory footprint.
+        """
+        index = index or self.index
+        for chunk_map in self.chunked_dict(vector_id_map, size=batch_size):
+            self.session.query(DocumentORM).filter(DocumentORM.id.in_(chunk_map), DocumentORM.index == index).update(
+                {
+                    DocumentORM.vector_id: case(
+                        chunk_map,
+                        value=DocumentORM.id,
+                    )
+                },
+                synchronize_session=False,
+            )
+            try:
+                self.session.commit()
+            except Exception as ex:
+                logger.error(f"Transaction rollback: {ex.__cause__}")
+                self.session.rollback()
+                raise ex
+
+    def reset_vector_ids(self, index: Optional[str] = None):
+        """
+        Set vector IDs for all documents as None
+        """
+        index = index or self.index
+        self.session.query(DocumentORM).filter_by(index=index).update({DocumentORM.vector_id: null()})
+        self.session.commit()
+
+    def update_document_meta(self, id: str, meta: Dict[str, str], index: str = None):
+        """
+        Update the metadata dictionary of a document by specifying its string id
+        """
+        if not index:
+            index = self.index
+        self.session.query(MetaDocumentORM).filter_by(document_id=id, document_index=index).delete()
+        meta_orms = [
+            MetaDocumentORM(name=key, value=value, document_id=id, document_index=index) for key, value in meta.items()
+        ]
+        for m in meta_orms:
+            self.session.add(m)
+        self.session.commit()
+
+    def get_document_count(
+        self,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        index: Optional[str] = None,
+        only_documents_without_embedding: bool = False,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> int:
+        """
+        Return the number of documents in the document store.
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.index
+        query = self.session.query(DocumentORM).filter_by(index=index)
+
+        if filters:
+            for key, values in filters.items():
+                query = query.join(MetaDocumentORM, aliased=True).filter(
+                    MetaDocumentORM.name == key,
+                    MetaDocumentORM.value.in_(values),
+                )
+
+        if only_documents_without_embedding:
+            query = query.filter(DocumentORM.vector_id.is_(None))
+
+        count = query.count()
+        return count
+
+    def get_label_count(self, index: Optional[str] = None, headers: Optional[Dict[str, str]] = None) -> int:
+        """
+        Return the number of labels in the document store
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.label_index
+        return self.session.query(LabelORM).filter_by(index=index).count()
+
+    def _convert_sql_row_to_document(self, row) -> Document:
+        doc_dict = {
+            "id": row.id,
+            "content": row.content,
+            "content_type": row.content_type,
+            "meta": {meta.name: meta.value for meta in row.meta},
+        }
+        document = Document.from_dict(doc_dict)
+
+        if row.vector_id:
+            document.meta["vector_id"] = row.vector_id
+        return document
+
+    def _convert_sql_row_to_label(self, row) -> Label:
+        # doc = self._convert_sql_row_to_document(row.document)
+
+        label = Label(
+            query=row.query,
+            answer=Answer.from_json(row.answer),  # type: ignore
+            document=Document.from_json(row.document),
+            is_correct_answer=row.is_correct_answer,
+            is_correct_document=row.is_correct_document,
+            origin=row.origin,
+            id=row.id,
+            no_answer=row.no_answer,
+            pipeline_id=row.pipeline_id,
+            created_at=row.created_at,
+            updated_at=row.updated_at,
+            meta=row.meta,
+        )
+        return label
+
+    def query_by_embedding(
+        self,
+        query_emb: np.ndarray,
+        filters: Optional[dict] = None,
+        top_k: int = 10,
+        index: Optional[str] = None,
+        return_embedding: Optional[bool] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> List[Document]:
+
+        raise NotImplementedError(
+            "SQLDocumentStore is currently not supporting embedding queries. "
+            "Change the query type (e.g. by choosing a different retriever) "
+            "or change the DocumentStore (e.g. to ElasticsearchDocumentStore)"
+        )
+
+    def delete_all_documents(
+        self,
+        index: Optional[str] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete documents in an index. All documents are deleted if no filters are passed.
+
+        :param index: Index name to delete the document from.
+        :param filters: Optional filters to narrow down the documents to be deleted.
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        logger.warning(
+            """DEPRECATION WARNINGS:
+                1. delete_all_documents() method is deprecated, please use delete_documents method
+                """
+        )
+        self.delete_documents(index, None, filters)
+
+    def delete_documents(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete documents in an index. All documents are deleted if no filters are passed.
+
+        :param index: Index name to delete the document from. If None, the
+                      DocumentStore's default index (self.index) will be used.
+        :param ids: Optional list of IDs to narrow down the documents to be deleted.
+        :param filters: Optional filters to narrow down the documents to be deleted.
+            Example filters: {"name": ["some", "more"], "category": ["only_one"]}.
+            If filters are provided along with a list of IDs, this method deletes the
+            intersection of the two query results (documents that match the filters and
+            have their ID in the list).
+        :return: None
+        """
+        index = index or self.index
+        if not filters and not ids:
+            self.session.query(DocumentORM).filter_by(index=index).delete(synchronize_session=False)
+        else:
+            document_ids_to_delete = self.session.query(DocumentORM.id).filter(DocumentORM.index == index)
+            if filters:
+                for key, values in filters.items():
+                    document_ids_to_delete = document_ids_to_delete.join(MetaDocumentORM, aliased=True).filter(
+                        MetaDocumentORM.name == key,
+                        MetaDocumentORM.value.in_(values),
+                    )
+            if ids:
+                document_ids_to_delete = document_ids_to_delete.filter(DocumentORM.id.in_(ids))
+
+            self.session.query(DocumentORM).filter(DocumentORM.id.in_(document_ids_to_delete)).delete(
+                synchronize_session=False
+            )
+
+        self.session.commit()
+
+    def delete_labels(
+        self,
+        index: Optional[str] = None,
+        ids: Optional[List[str]] = None,
+        filters: Optional[Dict[str, Any]] = None,  # TODO: Adapt type once we allow extended filters in SQLDocStore
+        headers: Optional[Dict[str, str]] = None,
+    ):
+        """
+        Delete labels from the document store. All labels are deleted if no filters are passed.
+
+        :param index: Index name to delete the labels from. If None, the
+                      DocumentStore's default label index (self.label_index) will be used.
+        :param ids: Optional list of IDs to narrow down the labels to be deleted.
+        :param filters: Optional filters to narrow down the labels to be deleted.
+                        Example filters: {"id": ["9a196e41-f7b5-45b4-bd19-5feb7501c159", "9a196e41-f7b5-45b4-bd19-5feb7501c159"]} or {"query": ["question2"]}
+        :return: None
+        """
+        if headers:
+            raise NotImplementedError("SQLDocumentStore does not support headers.")
+
+        index = index or self.label_index
+        if not filters and not ids:
+            self.session.query(LabelORM).filter_by(index=index).delete(synchronize_session=False)
+        else:
+            label_ids_to_delete = self.session.query(LabelORM.id).filter_by(index=index)
+            if filters:
+                for key, values in filters.items():
+                    label_attribute = getattr(LabelORM, key)
+                    label_ids_to_delete = label_ids_to_delete.filter(label_attribute.in_(values))
+
+            if ids:
+                label_ids_to_delete = label_ids_to_delete.filter(LabelORM.id.in_(ids))
+
+            self.session.query(LabelORM).filter(LabelORM.id.in_(label_ids_to_delete)).delete(synchronize_session=False)
+
+        self.session.commit()
+
+    def _get_or_create(self, session, model, **kwargs):
+        instance = session.query(model).filter_by(**kwargs).first()
+        if instance:
+            return instance
+        else:
+            instance = model(**kwargs)
+            session.add(instance)
+            session.commit()
+            return instance
+
+    def chunked_dict(self, dictionary, size):
+        it = iter(dictionary)
+        for i in range(0, len(dictionary), size):
+            yield {k: dictionary[k] for k in itertools.islice(it, size)}
+
+    def _column_windows(self, session, column, windowsize):
+        """Return a series of WHERE clauses against
+        a given column that break it into windows.
+
+        Result is an iterable of tuples, consisting of
+        ((start, end), whereclause), where (start, end) are the ids.
+
+        The code is taken from: https://github.com/sqlalchemy/sqlalchemy/wiki/RangeQuery-and-WindowedRangeQuery
+        """
+
+        def int_for_range(start_id, end_id):
+            if end_id:
+                return and_(column >= start_id, column < end_id)
+            else:
+                return column >= start_id
+
+        q = session.query(column, func.row_number().over(order_by=column).label("rownum")).from_self(column)
+        if windowsize > 1:
+            q = q.filter(text("rownum %% %d=1" % windowsize))
+
+        intervals = [id for id, in q]
+
+        while intervals:
+            start = intervals.pop(0)
+            if intervals:
+                end = intervals[0]
+            else:
+                end = None
+            yield int_for_range(start, end)
+
+    def _windowed_query(self, q, column, windowsize):
+        """ "Break a Query into windows on a given column."""
+
+        for whereclause in self._column_windows(q.session, column, windowsize):
+            for row in q.filter(whereclause).order_by(column):
+                yield row
diff --git a/pipelines/pipelines/document_stores/utils.py b/pipelines/pipelines/document_stores/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb8ebb55baf37af79aae932336063809b3f4a048
--- /dev/null
+++ b/pipelines/pipelines/document_stores/utils.py
@@ -0,0 +1,434 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import typing
+from datetime import datetime
+from typing import Dict, Generator, List, Optional, Tuple, Union
+
+from elasticsearch.helpers import scan
+from tqdm.auto import tqdm
+
+from pipelines.document_stores.filter_utils import LogicalFilterClause
+from pipelines.nodes.preprocessor import PreProcessor
+from pipelines.schema import Answer, Document, Label, Span
+
+if typing.TYPE_CHECKING:
+    # This results in a circular import if we don't use typing.TYPE_CHECKING
+    from pipelines.document_stores.base import BaseDocumentStore
+
+logger = logging.getLogger(__name__)
+
+
+def eval_data_from_json(
+    filename: str, max_docs: Union[int, bool] = None, preprocessor: PreProcessor = None, open_domain: bool = False
+) -> Tuple[List[Document], List[Label]]:
+    """
+    Read Documents + Labels from a SQuAD-style file.
+    Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
+
+    :param filename: Path to file in SQuAD format
+    :param max_docs: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
+    :param open_domain: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.
+    """
+    docs: List[Document] = []
+    labels = []
+    problematic_ids = []
+
+    with open(filename, "r", encoding="utf-8") as file:
+        data = json.load(file)
+        if "title" not in data["data"][0]:
+            logger.warning(f"No title information found for documents in QA file: {filename}")
+
+        for document in data["data"]:
+            if max_docs:
+                if len(docs) > max_docs:
+                    break
+            # Extracting paragraphs and their labels from a SQuAD document dict
+            cur_docs, cur_labels, cur_problematic_ids = _extract_docs_and_labels_from_dict(
+                document, preprocessor, open_domain
+            )
+            docs.extend(cur_docs)
+            labels.extend(cur_labels)
+            problematic_ids.extend(cur_problematic_ids)
+    if len(problematic_ids) > 0:
+        logger.warning(
+            f"Could not convert an answer for {len(problematic_ids)} questions.\n"
+            f"There were conversion errors for question ids: {problematic_ids}"
+        )
+    return docs, labels
+
+
+def eval_data_from_jsonl(
+    filename: str,
+    batch_size: Optional[int] = None,
+    max_docs: Union[int, bool] = None,
+    preprocessor: PreProcessor = None,
+    open_domain: bool = False,
+) -> Generator[Tuple[List[Document], List[Label]], None, None]:
+    """
+    Read Documents + Labels from a SQuAD-style file in jsonl format, i.e. one document per line.
+    Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
+
+    This is a generator which will yield one tuple per iteration containing a list
+    of batch_size documents and a list with the documents' labels.
+    If batch_size is set to None, this method will yield all documents and labels.
+
+    :param filename: Path to file in SQuAD format
+    :param max_docs: This sets the number of documents that will be loaded. By default, this is set to None, thus reading in all available eval documents.
+    :param open_domain: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.
+    """
+    docs: List[Document] = []
+    labels = []
+    problematic_ids = []
+
+    with open(filename, "r", encoding="utf-8") as file:
+        for document in file:
+            if max_docs:
+                if len(docs) > max_docs:
+                    break
+            # Extracting paragraphs and their labels from a SQuAD document dict
+            document_dict = json.loads(document)
+            cur_docs, cur_labels, cur_problematic_ids = _extract_docs_and_labels_from_dict(
+                document_dict, preprocessor, open_domain
+            )
+            docs.extend(cur_docs)
+            labels.extend(cur_labels)
+            problematic_ids.extend(cur_problematic_ids)
+
+            if batch_size is not None:
+                if len(docs) >= batch_size:
+                    if len(problematic_ids) > 0:
+                        logger.warning(
+                            f"Could not convert an answer for {len(problematic_ids)} questions.\n"
+                            f"There were conversion errors for question ids: {problematic_ids}"
+                        )
+                    yield docs, labels
+                    docs = []
+                    labels = []
+                    problematic_ids = []
+
+    yield docs, labels
+
+
+def squad_json_to_jsonl(squad_file: str, output_file: str):
+    """
+    Converts a SQuAD-json-file into jsonl format with one document per line.
+    :param squad_file: SQuAD-file in json format.
+    :param output_file: Name of output file (SQuAD in jsonl format)
+    """
+    with open(squad_file, encoding="utf-8") as json_file, open(output_file, "w", encoding="utf-8") as jsonl_file:
+        squad_json = json.load(json_file)
+
+        for doc in squad_json["data"]:
+            json.dump(doc, jsonl_file)
+            jsonl_file.write("\n")
+
+
+def _extract_docs_and_labels_from_dict(
+    document_dict: Dict, preprocessor: PreProcessor = None, open_domain: bool = False
+):
+    """
+    Set open_domain to True if you are trying to load open_domain labels (i.e. labels without doc id or start idx)
+    """
+    docs = []
+    labels = []
+    problematic_ids = []
+
+    # get all extra fields from document level (e.g. title)
+    meta_doc = {k: v for k, v in document_dict.items() if k not in ("paragraphs", "title")}
+    for paragraph in document_dict["paragraphs"]:
+        # Create Metadata
+        cur_meta = {"name": document_dict.get("title", None)}
+        # all other fields from paragraph level
+        meta_paragraph = {k: v for k, v in paragraph.items() if k not in ("qas", "context")}
+        cur_meta.update(meta_paragraph)
+        # meta from parent document
+        cur_meta.update(meta_doc)
+
+        # Create Document
+        cur_full_doc = Document(content=paragraph["context"], meta=cur_meta)
+        if preprocessor is not None:
+            splits_dicts = preprocessor.process(cur_full_doc.to_dict())
+            # we need to pull in _split_id into the document id for unique reference in labels
+            # todo: PreProcessor should work on Documents instead of dicts
+            splits: List[Document] = []
+            offset = 0
+            for d in splits_dicts:
+                id = f"{d['id']}-{d['meta']['_split_id']}"
+                d["meta"]["_split_offset"] = offset
+                offset += len(d["content"])
+                # offset correction based on splitting method
+                if preprocessor.split_by == "word":
+                    offset += 1
+                elif preprocessor.split_by == "passage":
+                    offset += 2
+                else:
+                    raise NotImplementedError
+                mydoc = Document(content=d["content"], id=id, meta=d["meta"])
+                splits.append(mydoc)
+        else:
+            splits = [cur_full_doc]
+        docs.extend(splits)
+
+        # Assign Labels to corresponding documents
+        for qa in paragraph["qas"]:
+            if not qa.get("is_impossible", False):
+                for answer in qa["answers"]:
+                    ans = answer["text"]
+                    # TODO The following block of code means that answer_start is never calculated
+                    #  and cur_id is always None for open_domain
+                    #  This can be rewritten so that this function could try to calculate offsets
+                    #  and populate id in open_domain mode
+                    if open_domain:
+                        # TODO check with Branden why we want to treat open_domain here differently.
+                        # Shouldn't this be something configured at eval time only?
+                        cur_ans_start = answer.get("answer_start", 0)
+                        # cur_id = '0'
+                        label = Label(
+                            query=qa["question"],
+                            answer=Answer(answer=ans, type="extractive", score=0.0),
+                            document=None,  # type: ignore
+                            is_correct_answer=True,
+                            is_correct_document=True,
+                            no_answer=qa.get("is_impossible", False),
+                            origin="gold-label",
+                        )
+                        labels.append(label)
+                    else:
+                        ans_position = cur_full_doc.content[answer["answer_start"] : answer["answer_start"] + len(ans)]
+                        if ans != ans_position:
+                            # do not use answer
+                            problematic_ids.append(qa.get("id", "missing"))
+                            break
+                        # find corresponding document or split
+                        if len(splits) == 1:
+                            # cur_id = splits[0].id
+                            cur_ans_start = answer["answer_start"]
+                            cur_doc = splits[0]
+                        else:
+                            for s in splits:
+                                # If answer start offset is contained in passage we assign the label to that passage
+                                if (answer["answer_start"] >= s.meta["_split_offset"]) and (
+                                    answer["answer_start"] < (s.meta["_split_offset"] + len(s.content))
+                                ):
+                                    cur_doc = s
+                                    cur_ans_start = answer["answer_start"] - s.meta["_split_offset"]
+                                    # If a document is splitting an answer we add the whole answer text to the document
+                                    if s.content[cur_ans_start : cur_ans_start + len(ans)] != ans:
+                                        s.content = s.content[:cur_ans_start] + ans
+                                    break
+                        cur_answer = Answer(
+                            answer=ans,
+                            type="extractive",
+                            score=0.0,
+                            context=cur_doc.content,
+                            offsets_in_document=[Span(start=cur_ans_start, end=cur_ans_start + len(ans))],
+                            offsets_in_context=[Span(start=cur_ans_start, end=cur_ans_start + len(ans))],
+                            document_id=cur_doc.id,
+                        )
+                        label = Label(
+                            query=qa["question"],
+                            answer=cur_answer,
+                            document=cur_doc,
+                            is_correct_answer=True,
+                            is_correct_document=True,
+                            no_answer=qa.get("is_impossible", False),
+                            origin="gold-label",
+                        )
+                        labels.append(label)
+            else:
+                # for no_answer we need to assign each split as not fitting to the question
+                for s in splits:
+                    label = Label(
+                        query=qa["question"],
+                        answer=Answer(
+                            answer="",
+                            type="extractive",
+                            score=0.0,
+                            offsets_in_document=[Span(start=0, end=0)],
+                            offsets_in_context=[Span(start=0, end=0)],
+                        ),
+                        document=s,
+                        is_correct_answer=True,
+                        is_correct_document=True,
+                        no_answer=qa.get("is_impossible", False),
+                        origin="gold-label",
+                    )
+
+                    labels.append(label)
+
+    return docs, labels, problematic_ids
+
+
+def convert_date_to_rfc3339(date: str) -> str:
+    """
+    Converts a date to RFC3339 format, as Weaviate requires dates to be in RFC3339 format including the time and
+    timezone.
+
+    If the provided date string does not contain a time and/or timezone, we use 00:00 as default time
+    and UTC as default time zone.
+
+    This method cannot be part of WeaviateDocumentStore, as this would result in a circular import between weaviate.py
+    and filter_utils.py.
+    """
+    parsed_datetime = datetime.fromisoformat(date)
+    if parsed_datetime.utcoffset() is None:
+        converted_date = parsed_datetime.isoformat() + "Z"
+    else:
+        converted_date = parsed_datetime.isoformat()
+
+    return converted_date
+
+
+def es_index_to_document_store(
+    document_store: "BaseDocumentStore",
+    original_index_name: str,
+    original_content_field: str,
+    original_name_field: Optional[str] = None,
+    included_metadata_fields: Optional[List[str]] = None,
+    excluded_metadata_fields: Optional[List[str]] = None,
+    store_original_ids: bool = True,
+    index: Optional[str] = None,
+    preprocessor: Optional[PreProcessor] = None,
+    batch_size: int = 10_000,
+    host: Union[str, List[str]] = "localhost",
+    port: Union[int, List[int]] = 9200,
+    username: str = "",
+    password: str = "",
+    api_key_id: Optional[str] = None,
+    api_key: Optional[str] = None,
+    aws4auth=None,
+    scheme: str = "http",
+    ca_certs: Optional[str] = None,
+    verify_certs: bool = True,
+    timeout: int = 30,
+    use_system_proxy: bool = False,
+) -> "BaseDocumentStore":
+    """
+    This function provides brownfield support of existing Elasticsearch indexes by converting each of the records in
+    the provided index to pipelines `Document` objects and writing them to the specified `DocumentStore`. It can be used
+    on a regular basis in order to add new records of the Elasticsearch index to the `DocumentStore`.
+
+    :param document_store: The pipelines `DocumentStore` to write the converted `Document` objects to.
+    :param original_index_name: Elasticsearch index containing the records to be converted.
+    :param original_content_field: Elasticsearch field containing the text to be put in the `content` field of the
+        resulting pipelines `Document` objects.
+    :param original_name_field: Optional Elasticsearch field containing the title of the Document.
+    :param included_metadata_fields: List of Elasticsearch fields that shall be stored in the `meta` field of the
+        resulting pipelines `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
+        all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
+        `included_metadata_fields` and `excluded_metadata_fields` parameters.
+    :param excluded_metadata_fields: List of Elasticsearch fields that shall be excluded from the `meta` field of the
+        resulting pipelines `Document` objects. If `included_metadata_fields` and `excluded_metadata_fields` are `None`,
+        all the fields found in the Elasticsearch records will be kept as metadata. You can specify only one of the
+        `included_metadata_fields` and `excluded_metadata_fields` parameters.
+    :param store_original_ids: Whether to store the ID a record had in the original Elasticsearch index at the
+        `"_original_es_id"` metadata field of the resulting pipelines `Document` objects. This should be set to `True`
+        if you want to continuously update the `DocumentStore` with new records inside your Elasticsearch index. If this
+        parameter was set to `False` on the first call of `es_index_to_document_store`,
+        all the indexed Documents in the `DocumentStore` will be overwritten in the second call.
+    :param index: Name of index in `document_store` to use to store the resulting pipelines `Document` objects.
+    :param preprocessor: Optional PreProcessor that will be applied on the content field of the original Elasticsearch
+        record.
+    :param batch_size: Number of records to process at once.
+    :param host: URL(s) of Elasticsearch nodes.
+    :param port: Ports(s) of Elasticsearch nodes.
+    :param username: Username (standard authentication via http_auth).
+    :param password: Password (standard authentication via http_auth).
+    :param api_key_id: ID of the API key (altenative authentication mode to the above http_auth).
+    :param api_key: Secret value of the API key (altenative authentication mode to the above http_auth).
+    :param aws4auth: Authentication for usage with AWS Elasticsearch
+        (can be generated with the requests-aws4auth package).
+    :param scheme: `"https"` or `"http"`, protocol used to connect to your Elasticsearch instance.
+    :param ca_certs: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk.
+        You can use certifi package with `certifi.where()` to find where the CA certs file is located in your machine.
+    :param verify_certs: Whether to be strict about ca certificates.
+    :param timeout: Number of seconds after which an Elasticsearch request times out.
+    :param use_system_proxy: Whether to use system proxy.
+    """
+    # This import cannot be at the beginning of the file, as this would result in a circular import
+    from pipelines.document_stores.elasticsearch import ElasticsearchDocumentStore
+
+    # Initialize Elasticsearch client
+    es_client = ElasticsearchDocumentStore._init_elastic_client(
+        host=host,
+        port=port,
+        username=username,
+        password=password,
+        api_key=api_key,
+        api_key_id=api_key_id,
+        aws4auth=aws4auth,
+        scheme=scheme,
+        ca_certs=ca_certs,
+        verify_certs=verify_certs,
+        timeout=timeout,
+        use_system_proxy=use_system_proxy,
+    )
+
+    # Get existing original ES IDs inside DocumentStore in order to not reindex the corresponding records
+    existing_ids = [
+        doc.meta["_original_es_id"]
+        for doc in document_store.get_all_documents_generator(index=index)
+        if "_original_es_id" in doc.meta
+    ]
+
+    # Iterate over each individual record
+    query: Dict[str, Dict] = {"query": {"bool": {"must": [{"match_all": {}}]}}}
+    if existing_ids:
+        filters = LogicalFilterClause.parse({"_id": {"$nin": existing_ids}}).convert_to_elasticsearch()
+        query["query"]["bool"]["filter"] = filters
+    records = scan(client=es_client, query=query, index=original_index_name)
+    number_of_records = es_client.count(index=original_index_name, body=query)["count"]
+    pipelines_documents: List[Dict] = []
+    for idx, record in enumerate(tqdm(records, total=number_of_records, desc="Converting ES Records")):
+        # Write batch_size number of documents to pipelines DocumentStore
+        if (idx + 1) % batch_size == 0:
+            document_store.write_documents(pipelines_documents, index=index)
+            pipelines_documents = []
+
+        # Get content and metadata of current record
+        content = record["_source"].pop(original_content_field, "")
+        if content:
+            record_doc = {"content": content, "meta": {}}
+
+            if original_name_field is not None:
+                if original_name_field in record["_source"]:
+                    record_doc["meta"]["name"] = record["_source"].pop(original_name_field)
+            # Only add selected metadata fields
+            if included_metadata_fields is not None:
+                for metadata_field in included_metadata_fields:
+                    if metadata_field in record["_source"]:
+                        record_doc["meta"][metadata_field] = record["_source"][metadata_field]
+            # Add all metadata fields except for those in excluded_metadata_fields
+            else:
+                if excluded_metadata_fields is not None:
+                    for metadata_field in excluded_metadata_fields:
+                        record["_source"].pop(metadata_field, None)
+                record_doc["meta"].update(record["_source"])
+
+            if store_original_ids:
+                record_doc["meta"]["_original_es_id"] = record["_id"]
+
+            # Apply preprocessor if provided
+            preprocessed_docs = preprocessor.process(record_doc) if preprocessor is not None else [record_doc]
+
+            pipelines_documents.extend(preprocessed_docs)
+
+    if pipelines_documents:
+        document_store.write_documents(pipelines_documents, index=index)
+
+    return document_store
diff --git a/pipelines/pipelines/errors.py b/pipelines/pipelines/errors.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac9f3d7d66da8be83adcc66c24b38ede74821176
--- /dev/null
+++ b/pipelines/pipelines/errors.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# coding: utf8
+"""Custom Errors for pipelines stacks"""
+
+
+class DuplicateDocumentError(ValueError):
+    """Exception for Duplicate document"""
+
+    pass
diff --git a/pipelines/pipelines/nodes/__init__.py b/pipelines/pipelines/nodes/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fce44f5b99fe46f7daea8a4860304e762e7ec71
--- /dev/null
+++ b/pipelines/pipelines/nodes/__init__.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+from pipelines.utils.import_utils import safe_import  # isort: skip
+from pipelines.nodes.answer_extractor import (
+    AnswerExtractor,
+    AnswerExtractorPreprocessor,
+    QAFilter,
+    QAFilterPostprocessor,
+)
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.document import DocOCRProcessor, DocPrompter
+from pipelines.nodes.file_classifier import FileTypeClassifier
+from pipelines.nodes.file_converter import (
+    BaseConverter,
+    DocxToTextConverter,
+    ImageToTextConverter,
+    MarkdownConverter,
+    PDFToTextConverter,
+    PDFToTextOCRConverter,
+    TextConverter,
+)
+from pipelines.nodes.llm import ChatGLMBot
+from pipelines.nodes.llm.ernie_bot import ErnieBot
+from pipelines.nodes.llm.history import TruncatedConversationHistory
+from pipelines.nodes.llm.prompt_template import LLMPromptTemplate as PromptTemplate
+from pipelines.nodes.other import JoinDocuments
+from pipelines.nodes.preprocessor import (
+    BasePreProcessor,
+    CharacterTextSplitter,
+    PreProcessor,
+    RecursiveCharacterTextSplitter,
+    SpacyTextSplitter,
+)
+from pipelines.nodes.prompt import PromptModel, PromptNode, Shaper
+from pipelines.nodes.question_generator import QuestionGenerator
+from pipelines.nodes.ranker import BaseRanker, ErnieRanker
+from pipelines.nodes.reader import BaseReader, ErnieReader
+from pipelines.nodes.retriever import (
+    BaseRetriever,
+    BM25Retriever,
+    DensePassageRetriever,
+    EmbeddingRetriever,
+    MultiModalRetriever,
+    WebRetriever,
+)
+from pipelines.nodes.sentiment_analysis import (
+    SentaProcessor,
+    SentaVisualization,
+    UIESenta,
+)
+from pipelines.nodes.text_to_image_generator import ErnieTextToImageGenerator
diff --git a/pipelines/pipelines/nodes/answer_extractor/__init__.py b/pipelines/pipelines/nodes/answer_extractor/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..87089b19066691c775208e1eb69a77ffeb7faa8c
--- /dev/null
+++ b/pipelines/pipelines/nodes/answer_extractor/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from pipelines.nodes.answer_extractor.answer_extractor import AnswerExtractor
+from pipelines.nodes.answer_extractor.answer_extractor_preprocessor import AnswerExtractorPreprocessor
+from pipelines.nodes.answer_extractor.qa_filter import QAFilter
+from pipelines.nodes.answer_extractor.qa_filter_postprocessor import QAFilterPostprocessor
diff --git a/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py b/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d1571c9478240c20697a0a4ab50246394efb1ee
--- /dev/null
+++ b/pipelines/pipelines/nodes/answer_extractor/answer_extractor.py
@@ -0,0 +1,184 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+
+import paddle
+from paddle.dataset.common import md5file
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+from paddlenlp.taskflow.utils import download_file
+from paddlenlp.utils.env import PPNLP_HOME
+from pipelines.nodes.base import BaseComponent
+
+
+class AnswerExtractor(BaseComponent):
+    """
+    Answer Extractor based on Universal Information Extraction.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    resource_files_urls = {
+        "uie-base-answer-extractor": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/model_state.pdparams",
+                "c8619f631a0c20434199840d34bb8b8c",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/model_config.json",
+                "74f033ab874a1acddb3aec9b9c4d9cde",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/answer_generator/uie-base-answer-extractor/uie-base-answer-extractor-v1/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+    }
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(
+        self,
+        model="uie-base-answer-extractor",
+        schema=["答案"],
+        task_path=None,
+        device="gpu",
+        batch_size=64,
+        position_prob=0.01,
+        max_answer_candidates=5,
+    ):
+        paddle.set_device(device)
+        self.model = model
+        self._from_taskflow = False
+        self._custom_model = False
+        if task_path:
+            self._task_path = task_path
+            self._custom_model = True
+        else:
+            if model in ["uie-base"]:
+                self._task_path = None
+                self._from_taskflow = True
+            else:
+                self._task_path = os.path.join(PPNLP_HOME, "pipelines", "unsupervised_question_answering", self.model)
+                self._check_task_files()
+        self.batch_size = batch_size
+        self.max_answer_candidates = max_answer_candidates
+        self.schema = schema
+        self.answer_generator = Taskflow(
+            "information_extraction",
+            model=self.model if self._from_taskflow else "uie-base",
+            schema=schema,
+            task_path=self._task_path,
+            batch_size=batch_size,
+            position_prob=position_prob,
+            device_id=0 if device == "gpu" else -1,
+        )
+
+    def _check_task_files(self):
+        """
+        Check files required by the task.
+        """
+        for file_id, file_name in self.resource_files_names.items():
+            path = os.path.join(self._task_path, file_name)
+            url = self.resource_files_urls[self.model][file_id][0]
+            md5 = self.resource_files_urls[self.model][file_id][1]
+
+            downloaded = True
+            if not os.path.exists(path):
+                downloaded = False
+            else:
+                if not self._custom_model:
+                    if os.path.exists(path):
+                        # Check whether the file is updated
+                        if not md5file(path) == md5:
+                            downloaded = False
+                            if file_id == "model_state":
+                                self._param_updated = True
+                    else:
+                        downloaded = False
+            if not downloaded:
+                download_file(self._task_path, file_name, url, md5)
+
+    def answer_generation_from_paragraphs(
+        self, paragraphs, batch_size=16, model=None, max_answer_candidates=5, schema=None, wf=None
+    ):
+        """Generate answer from given paragraphs."""
+        result = []
+        buffer = []
+        i = 0
+        len_paragraphs = len(paragraphs)
+        for paragraph_tobe in tqdm(paragraphs):
+            buffer.append(paragraph_tobe)
+            if len(buffer) == batch_size or (i + 1) == len_paragraphs:
+                predicts = model(buffer)
+                paragraph_list = buffer
+                buffer = []
+                for predict_dict, paragraph in zip(predicts, paragraph_list):
+                    answers = []
+                    probabilitys = []
+                    for prompt in schema:
+                        if prompt in predict_dict:
+                            answer_dicts = predict_dict[prompt]
+                            answers += [answer_dict["text"] for answer_dict in answer_dicts]
+                            probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts]
+                        else:
+                            answers += []
+                            probabilitys += []
+                    candidates = sorted(
+                        list(set([(a, p) for a, p in zip(answers, probabilitys)])), key=lambda x: -x[1]
+                    )
+                    if len(candidates) > max_answer_candidates:
+                        candidates = candidates[:max_answer_candidates]
+                    outdict = {
+                        "context": paragraph,
+                        "answer_candidates": candidates,
+                    }
+                    if wf:
+                        wf.write(json.dumps(outdict, ensure_ascii=False) + "\n")
+                    result.append(outdict)
+            i += 1
+        return result
+
+    def run(self, meta):
+        print("createing synthetic answers...")
+        synthetic_context_answer_pairs = self.answer_generation_from_paragraphs(
+            meta,
+            batch_size=self.batch_size,
+            model=self.answer_generator,
+            max_answer_candidates=self.max_answer_candidates,
+            schema=self.schema,
+            wf=None,
+        )
+        results = {"ca_pairs": synthetic_context_answer_pairs}
+        return results, "output_1"
diff --git a/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py b/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b6308d1de554e326c2fb0fe7e217a6d8e255563
--- /dev/null
+++ b/pipelines/pipelines/nodes/answer_extractor/answer_extractor_preprocessor.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.base import BaseComponent
+import paddle
+
+
+class AnswerExtractorPreprocessor(BaseComponent):
+    """
+    Answer Extractor Preprocessor used to preprocess the result of textconvert.
+    """
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(self, device="gpu"):
+        paddle.set_device(device)
+
+    def run(self, documents):
+        results = {"meta": [document["content"] for document in documents]}
+        return results, "output_1"
diff --git a/pipelines/pipelines/nodes/answer_extractor/qa_filter.py b/pipelines/pipelines/nodes/answer_extractor/qa_filter.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa3a91bb61f89089da26e5229ac6a08206774a5c
--- /dev/null
+++ b/pipelines/pipelines/nodes/answer_extractor/qa_filter.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+
+import paddle
+from paddle.dataset.common import md5file
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+from paddlenlp.taskflow.utils import download_file
+from paddlenlp.utils.env import PPNLP_HOME
+from pipelines.nodes.base import BaseComponent
+
+
+class QAFilter(BaseComponent):
+    """
+    Question Answer Pairs Filter based on Universal Information Extraction.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    resource_files_urls = {
+        "uie-base-qa-filter": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/model_state.pdparams",
+                "feb2d076fa2f78a0d3c3e3d20e9d5dc5",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/model_config.json",
+                "74f033ab874a1acddb3aec9b9c4d9cde",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/vocab.txt",
+                "1c1c1f4fd93c5bed3b4eebec4de976a8",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/qa_filter/uie-base-qa-filter-v1/tokenizer_config.json",
+                "3e623b57084882fd73e17f544bdda47d",
+            ],
+        },
+    }
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(
+        self,
+        model="uie-base-qa-filter",
+        schema=["答案"],
+        task_path=None,
+        device="gpu",
+        batch_size=64,
+        position_prob=0.1,
+    ):
+        paddle.set_device(device)
+        self.model = model
+        self._custom_model = False
+        self._from_taskflow = False
+        if task_path:
+            self._task_path = task_path
+            self._custom_model = True
+        else:
+            if model in ["uie-base"]:
+                self._task_path = None
+                self._from_taskflow = True
+            else:
+                self._task_path = os.path.join(PPNLP_HOME, "pipelines", "unsupervised_question_answering", self.model)
+                self._check_task_files()
+        self.batch_size = batch_size
+        self.schema = schema
+        self.filtration_model = Taskflow(
+            "information_extraction",
+            model=self.model if self._from_taskflow else "uie-base",
+            schema=schema,
+            task_path=self._task_path,
+            batch_size=batch_size,
+            position_prob=position_prob,
+            device_id=0 if device == "gpu" else -1,
+        )
+
+    def _check_task_files(self):
+        """
+        Check files required by the task.
+        """
+        for file_id, file_name in self.resource_files_names.items():
+            path = os.path.join(self._task_path, file_name)
+            url = self.resource_files_urls[self.model][file_id][0]
+            md5 = self.resource_files_urls[self.model][file_id][1]
+
+            downloaded = True
+            if not os.path.exists(path):
+                downloaded = False
+            else:
+                if not self._custom_model:
+                    if os.path.exists(path):
+                        # Check whether the file is updated
+                        if not md5file(path) == md5:
+                            downloaded = False
+                            if file_id == "model_state":
+                                self._param_updated = True
+                    else:
+                        downloaded = False
+            if not downloaded:
+                download_file(self._task_path, file_name, url, md5)
+
+    def filtration(self, paragraphs, batch_size=16, model=None, schema=None, wf=None, wf_debug=None):
+        result = []
+        buffer = []
+        valid_num, invalid_num = 0, 0
+        i = 0
+        len_paragraphs = len(paragraphs)
+        for paragraph_tobe in tqdm(paragraphs):
+            buffer.append(paragraph_tobe)
+            if len(buffer) == batch_size or (i + 1) == len_paragraphs:
+                model_inputs = []
+                for d in buffer:
+                    context = d["context"]
+                    synthetic_question = d["synthetic_question"]
+                    prefix = "问题：" + synthetic_question + "上下文："
+                    content = prefix + context
+                    model_inputs.append(content)
+                predicts = model(model_inputs)
+                paragraph_list = buffer
+                buffer = []
+                for predict_dict, paragraph in zip(predicts, paragraph_list):
+                    context = paragraph["context"]
+                    synthetic_question = paragraph["synthetic_question"]
+                    synthetic_question_probability = paragraph["synthetic_question_probability"]
+                    synthetic_answer = paragraph["synthetic_answer"]
+                    synthetic_answer_probability = paragraph["synthetic_answer_probability"]
+
+                    answers = []
+                    probabilitys = []
+                    for prompt in schema:
+                        if prompt in predict_dict:
+                            answer_dicts = predict_dict[prompt]
+                            answers += [answer_dict["text"] for answer_dict in answer_dicts]
+                            probabilitys += [answer_dict["probability"] for answer_dict in answer_dicts]
+                        else:
+                            answers += []
+                            probabilitys += []
+                    candidates = [
+                        an for an, pro in sorted([(a, p) for a, p in zip(answers, probabilitys)], key=lambda x: -x[1])
+                    ]
+                    out_dict = {
+                        "context": context,
+                        "synthetic_answer": synthetic_answer,
+                        "synthetic_answer_probability": synthetic_answer_probability,
+                        "synthetic_question": synthetic_question,
+                        "synthetic_question_probability": synthetic_question_probability,
+                    }
+                    if synthetic_answer in candidates:
+                        if wf:
+                            wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                        result.append(out_dict)
+                        valid_num += 1
+                    else:
+                        if wf_debug:
+                            wf_debug.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                        invalid_num += 1
+            i += 1
+        print("valid synthetic question-answer pairs number:", valid_num)
+        print("invalid sythetic question-answer pairs numbewr:", invalid_num)
+        return result
+
+    def run(self, cqa_triples, is_filter=True):
+        if is_filter:
+            print("filtering synthetic question-answer pairs...")
+            filtered_cqa_triples = self.filtration(
+                cqa_triples, batch_size=self.batch_size, model=self.filtration_model, schema=self.schema
+            )
+            print("filter synthetic question-answer pairs successfully!")
+        else:
+            filtered_cqa_triples = cqa_triples
+
+        results = {"filtered_cqa_triples": filtered_cqa_triples}
+        return results, "output_1"
diff --git a/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py b/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..76bea253ef5780ca99394fe5f6ab3e5857e570e6
--- /dev/null
+++ b/pipelines/pipelines/nodes/answer_extractor/qa_filter_postprocessor.py
@@ -0,0 +1,44 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.base import BaseComponent
+import paddle
+
+
+class QAFilterPostprocessor(BaseComponent):
+    """
+    QA Filter Postprocessor used to postprocess the result of qa filter.
+    """
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(self, device="gpu"):
+        paddle.set_device(device)
+
+    def run(self, filtered_cqa_triples):
+        results = {
+            "documents": [
+                {
+                    "content": triple["synthetic_question"],
+                    "content_type": "text",
+                    "meta": {"answer": triple["synthetic_answer"], "_split_id": 0},
+                }
+                for triple in filtered_cqa_triples
+            ]
+        }
+        return results, "output_1"
diff --git a/pipelines/pipelines/nodes/base.py b/pipelines/pipelines/nodes/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..78ca680900da82f8e0e96f1893364868b61d53ba
--- /dev/null
+++ b/pipelines/pipelines/nodes/base.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import inspect
+import logging
+from abc import abstractmethod
+from copy import deepcopy
+from typing import Any, Callable, Dict, List, Optional, Tuple
+
+from pipelines.schema import Document
+
+logger = logging.getLogger(__name__)
+
+
+class BaseComponent:
+    """
+    A base class for implementing nodes in a Pipeline.
+    """
+
+    outgoing_edges: int
+    subclasses: dict = {}
+    pipeline_config: dict = {}
+    name: Optional[str] = None
+
+    def __init_subclass__(cls, **kwargs):
+        """
+        Automatically keeps track of all available subclasses.
+        Enables generic load() for all specific component implementations.
+        """
+        super().__init_subclass__(**kwargs)
+        cls.subclasses[cls.__name__] = cls
+
+    @classmethod
+    def get_subclass(cls, component_type: str):
+        if component_type not in cls.subclasses.keys():
+            raise Exception(f"pipelines component with the name '{component_type}' does not exist.")
+        subclass = cls.subclasses[component_type]
+        return subclass
+
+    @classmethod
+    def load_from_args(cls, component_type: str, **kwargs):
+        """
+        Load a component instance of the given type using the kwargs.
+
+        :param component_type: name of the component class to load.
+        :param kwargs: parameters to pass to the __init__() for the component.
+        """
+        subclass = cls.get_subclass(component_type)
+        instance = subclass(**kwargs)
+        return instance
+
+    @classmethod
+    def load_from_pipeline_config(cls, pipeline_config: dict, component_name: str):
+        """
+        Load an individual component from a YAML config for Pipelines.
+
+        :param pipeline_config: the Pipelines YAML config parsed as a dict.
+        :param component_name: the name of the component to load.
+        """
+        if pipeline_config:
+            all_component_configs = pipeline_config["components"]
+            all_component_names = [comp["name"] for comp in all_component_configs]
+            component_config = next(comp for comp in all_component_configs if comp["name"] == component_name)
+            component_params = component_config["params"]
+
+            for key, value in component_params.items():
+                if value in all_component_names:  # check if the param value is a reference to another component
+                    component_params[key] = cls.load_from_pipeline_config(pipeline_config, value)
+
+            component_instance = cls.load_from_args(component_config["type"], **component_params)
+        else:
+            component_instance = cls.load_from_args(component_name)
+        return component_instance
+
+    @abstractmethod
+    def run(
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+    ) -> Tuple[Dict, str]:
+        """
+        Method that will be executed when the node in the graph is called.
+
+        The argument that are passed can vary between different types of nodes
+        (e.g. retriever nodes expect different args than a reader node)
+
+
+        See an example for an implementation in pipelines/reader/base/BaseReader.py
+        :return:
+        """
+        pass
+
+    def _dispatch_run(self, **kwargs) -> Tuple[Dict, str]:
+        """
+        The Pipelines call this method which in turn executes the run() method of Component.
+
+        It takes care of the following:
+          - inspect run() signature to validate if all necessary arguments are available
+          - pop `debug` and sets them on the instance to control debug output
+          - call run() with the corresponding arguments and gather output
+          - collate `_debug` information if present
+          - merge component output with the preceding output and pass it on to the subsequent Component in the Pipeline
+        """
+        return self._dispatch_run_general(self.run, **kwargs)
+
+    def _dispatch_run_batch(self, **kwargs):
+        """
+        The Pipelines call this method when run_batch() is executed. This method in turn executes the
+        _dispatch_run_general() method with the correct run method.
+        """
+        return self._dispatch_run_general(self.run_batch, **kwargs)
+
+    def _dispatch_run_general(self, run_method: Callable, **kwargs):
+        """
+        This method takes care of the following:
+          - inspect run_method's signature to validate if all necessary arguments are available
+          - pop `debug` and sets them on the instance to control debug output
+          - call run_method with the corresponding arguments and gather output
+          - collate `_debug` information if present
+          - merge component output with the preceding output and pass it on to the subsequent Component in the Pipeline
+        """
+        arguments = deepcopy(kwargs)
+        params = arguments.get("params") or {}
+
+        run_signature_args = inspect.signature(run_method).parameters.keys()
+
+        run_params: Dict[str, Any] = {}
+        for key, value in params.items():
+            if key == self.name:  # targeted params for this node
+                if isinstance(value, dict):
+                    # Extract debug attributes
+                    if "debug" in value.keys():
+                        self.debug = value.pop("debug")
+
+                    for _k, _v in value.items():
+                        if _k not in run_signature_args:
+                            raise Exception(f"Invalid parameter '{_k}' for the node '{self.name}'.")
+
+                run_params.update(**value)
+            elif key in run_signature_args:  # global params
+                run_params[key] = value
+
+        run_inputs = {}
+        for key, value in arguments.items():
+            if key in run_signature_args:
+                run_inputs[key] = value
+
+        output, stream = run_method(**run_inputs, **run_params)
+
+        # Collect debug information
+        debug_info = {}
+        if getattr(self, "debug", None):
+            # Include input
+            debug_info["input"] = {**run_inputs, **run_params}
+            debug_info["input"]["debug"] = self.debug
+            # Include output, exclude _debug to avoid recursion
+            filtered_output = {key: value for key, value in output.items() if key != "_debug"}
+            debug_info["output"] = filtered_output
+        # Include custom debug info
+        custom_debug = output.get("_debug", {})
+        if custom_debug:
+            debug_info["runtime"] = custom_debug
+
+        # append _debug information from nodes
+        all_debug = arguments.get("_debug", {})
+        if debug_info:
+            all_debug[self.name] = debug_info
+        if all_debug:
+            output["_debug"] = all_debug
+
+        # add "extra" args that were not used by the node, but not the 'inputs' value
+        for k, v in arguments.items():
+            if k not in output.keys() and k != "inputs":
+                output[k] = v
+
+        output["params"] = params
+        return output, stream
+
+    def set_config(self, **kwargs):
+        """
+        Save the init parameters of a component that later can be used with exporting
+        YAML configuration of a Pipeline.
+
+        :param kwargs: all parameters passed to the __init__() of the Component.
+        """
+        if not self.pipeline_config:
+            self.pipeline_config = {"params": {}, "type": type(self).__name__}
+            for k, v in kwargs.items():
+                if isinstance(v, BaseComponent):
+                    self.pipeline_config["params"][k] = v.pipeline_config
+                elif v is not None:
+                    self.pipeline_config["params"][k] = v
diff --git a/pipelines/pipelines/nodes/combine_documents/__init__.py b/pipelines/pipelines/nodes/combine_documents/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..81a47602a9d8d80f68616f700065c8143c6aa60d
--- /dev/null
+++ b/pipelines/pipelines/nodes/combine_documents/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .base import BaseCombineDocuments
+from .map_reduce import MapReduceDocuments
+from .reduce import ReduceDocuments
+from .stuff import StuffDocuments
diff --git a/pipelines/pipelines/nodes/combine_documents/base.py b/pipelines/pipelines/nodes/combine_documents/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..c67cf29782acf27abd7883f546712ae77582fe1e
--- /dev/null
+++ b/pipelines/pipelines/nodes/combine_documents/base.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from abc import abstractmethod
+from typing import Any, List, Optional, Tuple, Union
+
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class BaseCombineDocuments(BaseComponent):
+    outgoing_edges = 1
+
+    def __init__(self, input_key="contens"):
+        self.input_key = input_key
+        """
+        :param input_key: the key values corresponding to document content
+        """
+
+    def prompt_length(self, docs: List[dict], **kwargs: Any) -> Optional[int]:
+        return None
+
+    @abstractmethod
+    def combine_docs(self, docs: List[dict], **kwargs: Any) -> Tuple[str, dict]:
+        """Combine documents into a single string.
+
+        Args:
+            docs: List[Document], the documents to combine
+            **kwargs: Other parameters to use in combining documents, often
+                other inputs to the prompt.
+
+        Returns:
+            The first element returned is the single string output. The second
+            element returned is a dictionary of other keys to return.
+        """
+
+    def run(
+        self,
+        documents: Union[dict, List[dict]],
+        **kwargs,
+    ) -> Tuple[dict, str]:
+        """
+        :param documents: Documents used for multi document summary generation
+        """
+        # Other keys are assumed to be needed for LLM prediction
+        output, _ = self.combine_docs(documents)
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/combine_documents/map_reduce.py b/pipelines/pipelines/nodes/combine_documents/map_reduce.py
new file mode 100644
index 0000000000000000000000000000000000000000..d394323e2adfa853611c45cf3bff94cac2263901
--- /dev/null
+++ b/pipelines/pipelines/nodes/combine_documents/map_reduce.py
@@ -0,0 +1,77 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Any, List, Optional, Tuple
+
+from pipelines.nodes import ErnieBot
+from pipelines.nodes.combine_documents.base import BaseCombineDocuments
+from pipelines.nodes.combine_documents.reduce import ReduceDocuments
+
+logger = logging.getLogger(__name__)
+
+
+class MapReduceDocuments(BaseCombineDocuments):
+    def __init__(
+        self,
+        llm_prompt: str,
+        reduce_documents: BaseCombineDocuments,
+        api_key: str = "",
+        secret_key: str = "",
+        llm=None,
+        **kwargs,
+    ):
+        """
+        The MapReduceDocuments class is a subclass of the BaseCombineDocuments class,
+        which is designed to implement multiple document summaries.
+        It first conducts a single document summary, followed by a collapsed multi document summary.
+
+        :param llm_prompt: the prompt for single document summary
+        :param reduce_documents: the collapse multi document summary generation
+        :param token_max: the maximum length of collapsing documents
+        :param llm: the  Language Model
+        """
+        self.llm_prompt = llm_prompt
+        self.reduce_documents = reduce_documents
+        if llm is not None:
+            self.llm = llm
+        else:
+            self.llm = ErnieBot(api_key, secret_key)
+        assert isinstance(
+            self.reduce_documents, ReduceDocuments
+        ), f"`reduce_documents` is of type {type(self.reduce_documents)} so it does not have this attribute."
+
+    def combine_docs(
+        self,
+        docs: List[dict],
+        token_max: Optional[int] = None,
+        **kwargs: Any,
+    ) -> Tuple[str, dict]:
+        map_results = []
+        for doc in docs:
+            result = self.llm.run(self.llm_prompt.format(doc["content"]))[0]
+            try:
+                txt = result["result"]
+            except KeyError:
+                logger.info("The text parsing error")
+                txt = doc["content"]
+            map_results.append(txt)
+        result_docs = []
+        for i, r in enumerate(map_results):
+            if "meta" in docs[i]:
+                result_docs.append({"content": r, "meta": docs[i]["meta"]})
+            else:
+                result_docs.append({"content": r, "meta": {}})
+        result = self.reduce_documents.combine_docs(result_docs, token_max=token_max, **kwargs)
+        return result
diff --git a/pipelines/pipelines/nodes/combine_documents/reduce.py b/pipelines/pipelines/nodes/combine_documents/reduce.py
new file mode 100644
index 0000000000000000000000000000000000000000..e324c8ccbe4b49c305dbb4fa5282a1b9cbd4f666
--- /dev/null
+++ b/pipelines/pipelines/nodes/combine_documents/reduce.py
@@ -0,0 +1,126 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Any, Callable, List, Optional, Protocol, Tuple
+
+from pipelines.nodes.combine_documents.base import BaseCombineDocuments
+
+logger = logging.getLogger(__name__)
+
+
+class CombineDocsProtocol(Protocol):
+    """Interface for the combine_docs method."""
+
+    def __call__(self, docs: List[dict], **kwargs: Any) -> str:
+        """Interface for the combine_docs method."""
+
+
+# 保证每个文档数小于设定的tokens
+# ensure that the length of documents is less than the set tokens
+def _split_list_of_docs(docs: List[dict], length_func: Callable, token_max: int, **kwargs: Any) -> List[List[dict]]:
+    new_result_doc_list = []
+    _sub_result_docs = []
+    for doc in docs:
+        _sub_result_docs.append(doc)
+        _num_tokens = length_func(_sub_result_docs, **kwargs)
+        if _num_tokens > token_max:
+            if len(_sub_result_docs) == 1:
+                raise ValueError("A single document was longer than the context length," " we cannot handle this.")
+            new_result_doc_list.append(_sub_result_docs[:-1])
+            _sub_result_docs = _sub_result_docs[-1:]
+    new_result_doc_list.append(_sub_result_docs)
+    return new_result_doc_list
+
+
+def _collapse_docs(
+    docs: List[dict],
+    combine_document_func: CombineDocsProtocol,
+    **kwargs: Any,
+) -> dict:
+    result = combine_document_func(docs, **kwargs)
+    try:
+        text = result[0]["result"]
+    except KeyError:
+        logger.info("The text parsing error")
+        text = docs[0]["content"]
+    combined_metadata = {k: str(v) for k, v in docs[0]["meta"].items()}
+    for doc in docs[1:]:
+        for k, v in doc["meta"].items():
+            if k in combined_metadata:
+                combined_metadata[k] += f", {v}"
+            else:
+                combined_metadata[k] = str(v)
+    return {"content": text, "meta": combined_metadata}
+
+
+class ReduceDocuments(BaseCombineDocuments):
+    def __init__(
+        self,
+        combine_documents: BaseCombineDocuments,
+        collapse_documents: Optional[BaseCombineDocuments] = None,
+        token_max: int = 10000,
+    ):
+        """
+        The ReduceDocuments class is a subclass of the BaseCombineDocuments class,
+        which is designed to collapse multi document summary.
+        Fisrt,the number of tokens for multiple output documents is greater than the token_max.
+        Then, the ReduceDocuments collapses multiple documents  (the number of tokens for collapsing documents should not exceed the token_max).
+        Ultimately, it ensures that the total number of tokens for all documents does not exceed the token_max and implements multi document summary generation.
+
+        :param combine_documents: Generate multiple document summaries
+        :param collapse_documents: Iteratively collapse multiple documents to ensure that the total number of tokens for the final merged documents is less than the set value
+        :param token_max: Maximum length of collapsing documents
+        """
+        self.combine_documents = combine_documents
+        self.collapse_documents = collapse_documents
+        self.token_max = token_max
+
+    @property
+    def _collapse_node(self) -> BaseCombineDocuments:
+        if self.collapse_documents is not None:
+            return self.collapse_documents
+        else:
+            return self.combine_documents
+
+    def combine_docs(
+        self,
+        docs: List[dict],
+        token_max: Optional[int] = None,
+        **kwargs: Any,
+    ) -> Tuple[str, dict]:
+        result_docs, _ = self._collapse(docs, token_max=token_max, **kwargs)
+        return self.combine_documents.combine_docs(docs=result_docs, **kwargs)
+
+    def _collapse(
+        self,
+        docs: List[dict],
+        token_max: Optional[int] = None,
+        **kwargs: Any,
+    ) -> Tuple[List[dict], dict]:
+        length_func = self.combine_documents.prompt_length
+        num_tokens = length_func(docs, **kwargs)
+
+        def _collapse_docs_func(docs: List[dict], **kwargs: Any) -> str:
+            return self._collapse_node.run(documents=docs, **kwargs)
+
+        _token_max = token_max or self.token_max
+        while num_tokens is not None and num_tokens > _token_max:
+            new_result_doc_list = _split_list_of_docs(docs, length_func, _token_max, **kwargs)
+            docs = []
+            for docs_item in new_result_doc_list:
+                new_doc = _collapse_docs(docs_item, _collapse_docs_func, **kwargs)
+                docs.append(new_doc)
+            num_tokens = length_func(docs, **kwargs)
+        return docs, {}
diff --git a/pipelines/pipelines/nodes/combine_documents/stuff.py b/pipelines/pipelines/nodes/combine_documents/stuff.py
new file mode 100644
index 0000000000000000000000000000000000000000..f794ab777a52b75048a2198f51c4b1e1e4088cdd
--- /dev/null
+++ b/pipelines/pipelines/nodes/combine_documents/stuff.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Any, List, Optional, Tuple
+
+from pipelines.nodes import ErnieBot
+from pipelines.nodes.combine_documents.base import BaseCombineDocuments
+
+logger = logging.getLogger(__name__)
+
+
+def format_document(doc: dict, prompt: str) -> str:
+    return prompt.format(**doc)
+
+
+class StuffDocuments(BaseCombineDocuments):
+    def __init__(
+        self,
+        api_key: str = "",
+        secret_key: str = "",
+        llm=None,
+        document_prompt: str = "文件{index}: 文件内容{content}",
+        document_separator: str = "\n\n",
+        llm_prompt: Optional[str] = None,
+        len_str: int = 10000,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+
+        """
+        The StuffDocuments class is a subclass of the BaseCombineDocuments class,
+        which is designed to generate multi document summary.
+        First, merge multiple documents, and then generate a multi document summary .
+        Ensuring that the number of tokens for all documents does not exceed the len_str.
+
+        :param document_prompt: the prompt for geting and merging multiple documents
+        :param llm_prompt: the prompt for multiple document summaries
+        :param len_str: maximum document length
+        :param llm: the  Language Model
+        """
+        self.document_prompt = document_prompt
+        self.document_separator = document_separator
+        self.llm_prompt = llm_prompt
+        if llm is not None:
+            self.llm = llm
+        else:
+            self.llm = ErnieBot(api_key, secret_key)
+        self.len_str = len_str
+        """
+        示例：
+        document_prompt='文件{index}: 文件内容{content}'
+        llm_prompt='根据下列多个的文件内容给出一个摘要：\n{}'
+        """
+
+    def _get_inputs(self, docs: List[dict], **kwargs: Any) -> str:
+        # Format each document according to the prompt
+        for index, doc in enumerate(docs):
+            docs[index]["index"] = index
+        doc_strings = [format_document(doc, self.document_prompt) for doc in docs]
+        # Join the documents together to put them in the prompt.
+        return self.document_separator.join(doc_strings)
+
+    def get_num_tokens(self, prompt):
+        return len(prompt)
+
+    def prompt_length(self, docs: List[dict], **kwargs: Any) -> Optional[int]:
+        inputs = self._get_inputs(docs, **kwargs)
+        prompt = self.llm_prompt.format(inputs)
+        return self.get_num_tokens(prompt)
+
+    def combine_docs(self, docs: List[dict], **kwargs: Any) -> Tuple[dict, str]:
+        # Merge multiple files into one file
+        inputs = self._get_inputs(docs, **kwargs)
+        if self.llm_prompt is not None:
+            inputs = self.llm_prompt.format(inputs)
+        if len(inputs) > self.len_str:
+            logger.info("the length of text is too long")
+            inputs = inputs[: self.len_str]
+        # Call predict on the LLM.
+        return self.llm.run(inputs)
diff --git a/pipelines/pipelines/nodes/document/__init__.py b/pipelines/pipelines/nodes/document/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..1440fb101b1a327371504899360363042cab77d3
--- /dev/null
+++ b/pipelines/pipelines/nodes/document/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.document.document_preprocessor import DocOCRProcessor
+from pipelines.nodes.document.document_intelligence import DocPrompter
diff --git a/pipelines/pipelines/nodes/document/document_intelligence.py b/pipelines/pipelines/nodes/document/document_intelligence.py
new file mode 100644
index 0000000000000000000000000000000000000000..529e1be41d5bef7440282a9a4eb4e52b8925566b
--- /dev/null
+++ b/pipelines/pipelines/nodes/document/document_intelligence.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import collections
+import math
+from multiprocessing import cpu_count
+from typing import List
+import logging
+
+import paddle
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.taskflow.utils import download_file, ImageReader, get_doc_pred, find_answer_pos, sort_res
+from paddlenlp.utils.env import PPNLP_HOME
+
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+URLS = {
+    "docprompt": [
+        "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/docprompt/docprompt_params.tar",
+        "8eae8148981731f230b328076c5a08bf",
+    ],
+}
+
+
+class DocPrompter(BaseComponent):
+    """
+    DocPrompter: extract prompt's answers from the document input.
+    """
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(
+        self,
+        topn: int = 1,
+        use_gpu: bool = True,
+        task_path: str = None,
+        model: str = "docprompt",
+        device_id: int = 0,
+        num_threads: int = None,
+        lang: str = "ch",
+        batch_size: int = 1,
+    ):
+        """
+        Init Document Prompter.
+        :param topn: return top n answers.
+        :param use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
+        :param task_path: Custom model path if using custom model parameters.
+        :param model: Choose model name.
+        :param device_id: Choose gpu device id.
+        :param num_threads: Number of processing threads.
+        :param lang: Choose langugae.
+        :param batch_size: Number of samples the model receives in one batch for inference.
+                           Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
+                           to a value so only a single batch is used.
+        """
+        self._use_gpu = False if paddle.get_device() == "cpu" else use_gpu
+        self.model = model
+        self._device_id = device_id
+        self._num_threads = num_threads if num_threads else math.ceil(cpu_count() / 2)
+        self._topn = topn
+        self._lang = lang
+        self._batch_size = batch_size
+        if task_path is None:
+            self._task_path = os.path.join(PPNLP_HOME, "pipelines", "document_intelligence", self.model)
+        else:
+            self._task_path = task_path
+
+        download_file(self._task_path, "docprompt_params.tar", URLS[self.model][0], URLS[self.model][1])
+        self._get_inference_model()
+        self._tokenizer = AutoTokenizer.from_pretrained("ernie-layoutx-base-uncased")
+        self._reader = ImageReader(super_rel_pos=False, tokenizer=self._tokenizer)
+
+    def _get_inference_model(self):
+        inference_model_path = os.path.join(self._task_path, "static", "inference")
+        self._static_model_file = inference_model_path + ".pdmodel"
+        self._static_params_file = inference_model_path + ".pdiparams"
+        self._config = paddle.inference.Config(self._static_model_file, self._static_params_file)
+        self._prepare_static_mode()
+
+    def _prepare_static_mode(self):
+        """
+        Construct the input data and predictor in the PaddlePaddele static mode.
+        """
+        if paddle.get_device() == "cpu":
+            self._config.disable_gpu()
+            self._config.enable_mkldnn()
+        else:
+            self._config.enable_use_gpu(100, self._device_id)
+            self._config.delete_pass("embedding_eltwise_layernorm_fuse_pass")
+        self._config.set_cpu_math_library_num_threads(self._num_threads)
+        self._config.switch_use_feed_fetch_ops(False)
+        self._config.disable_glog_info()
+        self._config.enable_memory_optim()
+        self._config.switch_ir_optim(False)
+        self.predictor = paddle.inference.create_predictor(self._config)
+        self.input_names = [name for name in self.predictor.get_input_names()]
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = [self.predictor.get_output_handle(name) for name in self.predictor.get_output_names()]
+
+    def _run_model(self, inputs: List[dict]):
+        """
+        Run docprompt model.
+        """
+        all_predictions_list = []
+        for example in inputs:
+            ocr_result = example["ocr_result"]
+            doc_path = example["doc"]
+            prompt = example["prompt"]
+            ocr_type = example["ocr_type"]
+
+            if not ocr_result:
+                all_predictions = [
+                    {"prompt": p, "result": [{"value": "", "prob": 0.0, "start": -1, "end": -1}]} for p in prompt
+                ]
+                all_boxes = {}
+            else:
+                data_loader = self._reader.data_generator(ocr_result, doc_path, prompt, self._batch_size, ocr_type)
+
+                RawResult = collections.namedtuple("RawResult", ["unique_id", "seq_logits"])
+
+                all_results = []
+                for data in data_loader:
+                    for idx in range(len(self.input_names)):
+                        self.input_handles[idx].copy_from_cpu(data[idx])
+                    self.predictor.run()
+                    outputs = [output_handle.copy_to_cpu() for output_handle in self.output_handle]
+                    unique_ids, seq_logits = outputs
+
+                    for idx in range(len(unique_ids)):
+                        all_results.append(
+                            RawResult(
+                                unique_id=int(unique_ids[idx]),
+                                seq_logits=seq_logits[idx],
+                            )
+                        )
+
+                all_examples = self._reader.examples["infer"]
+                all_features = self._reader.features["infer"]
+                all_key_probs = [1 for _ in all_examples]
+
+                example_index_to_features = collections.defaultdict(list)
+
+                for feature in all_features:
+                    example_index_to_features[feature.qas_id].append(feature)
+
+                unique_id_to_result = {}
+                for result in all_results:
+                    unique_id_to_result[result.unique_id] = result
+
+                all_predictions = []
+                all_boxes = {}
+                for (example_index, example) in enumerate(all_examples):
+                    example_doc_tokens = example.doc_tokens
+                    example_qas_id = example.qas_id
+                    page_id = example_qas_id.split("_")[0]
+                    if page_id not in all_boxes:
+                        all_boxes[page_id] = example.ori_boxes
+                    example_query = example.keys[0]
+                    features = example_index_to_features[example_qas_id]
+
+                    preds = []
+                    # keep track of the minimum score of null start+end of position 0
+                    for feature in features:
+                        if feature.unique_id not in unique_id_to_result:
+                            continue
+                        result = unique_id_to_result[feature.unique_id]
+
+                        # find preds
+                        ans_pos = find_answer_pos(result.seq_logits, feature)
+                        preds.extend(
+                            get_doc_pred(
+                                result, ans_pos, example, self._tokenizer, feature, True, all_key_probs, example_index
+                            )
+                        )
+
+                    if not preds:
+                        preds.append({"value": "", "prob": 0.0, "start": -1, "end": -1})
+                    else:
+                        preds = sort_res(example_query, preds, example_doc_tokens, all_boxes[page_id], self._lang)[
+                            : self._topn
+                        ]
+                    all_predictions.append({"prompt": example_query, "result": preds})
+            all_predictions_list.append(all_predictions)
+        return all_predictions_list
+
+    def run(self, example: dict):
+        results = self._run_model([example])
+        output = {"results": results}
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/document/document_preprocessor.py b/pipelines/pipelines/nodes/document/document_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec4e91268f48fbf699c6da7c80b9be51bfcf975c
--- /dev/null
+++ b/pipelines/pipelines/nodes/document/document_preprocessor.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import base64
+import logging
+import os
+from io import BytesIO
+
+import numpy as np
+import paddle
+from paddleocr import PaddleOCR
+from PIL import Image
+
+from paddlenlp.taskflow.utils import download_file
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class DocOCRProcessor(BaseComponent):
+    """
+    Preprocess document input from image/image url/image bytestream to ocr outputs
+    """
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(self, use_gpu: bool = True, lang: str = "ch"):
+        """
+        Init Document Preprocessor.
+        :param use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
+        :param lang: Choose ocr model processing langugae
+        """
+        self._lang = lang
+        self._use_gpu = False if paddle.get_device() == "cpu" else use_gpu
+        self._ocr = PaddleOCR(use_angle_cls=True, show_log=False, use_gpu=self._use_gpu, lang=self._lang)
+
+    def _check_input_text(self, inputs):
+        if isinstance(inputs, dict):
+            inputs = [inputs]
+        if isinstance(inputs, list):
+            input_list = []
+            for example in inputs:
+                data = {}
+                if isinstance(example, dict):
+                    if "doc" not in example.keys():
+                        raise ValueError(
+                            "Invalid inputs, the inputs should contain an url to an image or a local path."
+                        )
+                    else:
+                        if isinstance(example["doc"], str):
+
+                            if example["doc"].startswith("http://") or example["doc"].startswith("https://"):
+                                download_file("./", example["doc"].rsplit("/", 1)[-1], example["doc"])
+                                data["doc"] = example["doc"].rsplit("/", 1)[-1]
+                            elif os.path.isfile(example["doc"]):
+                                data["doc"] = example["doc"]
+                            else:
+                                img = base64.b64decode(example["doc"].encode("utf-8"))
+                                img = np.frombuffer(bytearray(img), dtype="uint8")
+                                img = np.array(Image.open(BytesIO(img)).convert("RGB"))
+                                img = Image.fromarray(img)
+                                img.save("./tmp.jpg")
+                                data["doc"] = "./tmp.jpg"
+                        else:
+                            raise ValueError("Incorrect path or url, URLs must start with `http://` or `https://`")
+                    if "prompt" not in example.keys():
+                        raise ValueError("Invalid inputs, the inputs should contain the prompt.")
+                    else:
+                        if isinstance(example["prompt"], str):
+                            data["prompt"] = [example["prompt"]]
+                        elif isinstance(example["prompt"], list) and all(
+                            isinstance(s, str) for s in example["prompt"]
+                        ):
+                            data["prompt"] = example["prompt"]
+                        else:
+                            raise TypeError("Incorrect prompt, prompt should be string or list of string.")
+                    if "word_boxes" in example.keys():
+                        data["word_boxes"] = example["word_boxes"]
+                    input_list.append(data)
+                else:
+                    raise TypeError(
+                        "Invalid inputs, input for document intelligence task should be dict or list of dict, but type of {} found!".format(
+                            type(example)
+                        )
+                    )
+        else:
+            raise TypeError(
+                "Invalid inputs, input for document intelligence task should be dict or list of dict, but type of {} found!".format(
+                    type(inputs)
+                )
+            )
+        return input_list
+
+    def run(self, meta: dict):
+        example = self._check_input_text(meta)[0]
+
+        if "word_boxes" in example.keys():
+            ocr_result = example["word_boxes"]
+            example["ocr_type"] = "word_boxes"
+        else:
+            ocr_result = self._ocr.ocr(example["doc"], cls=True)
+            example["ocr_type"] = "ppocr"
+            # Compatible with paddleocr>=2.6.0.2
+            ocr_result = ocr_result[0] if len(ocr_result) == 1 else ocr_result
+        example["ocr_result"] = ocr_result
+        output = {"example": example}
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/file_classifier/__init__.py b/pipelines/pipelines/nodes/file_classifier/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4dded62824c3e54015bdaabe87f341f5795356f
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_classifier/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.file_classifier.file_type import FileTypeClassifier
diff --git a/pipelines/pipelines/nodes/file_classifier/file_type.py b/pipelines/pipelines/nodes/file_classifier/file_type.py
new file mode 100644
index 0000000000000000000000000000000000000000..c93f4e35cd909252365f0513d3708a2fdce8da4e
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_classifier/file_type.py
@@ -0,0 +1,90 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from pathlib import Path
+from typing import List, Union
+
+from pipelines.nodes.base import BaseComponent
+
+DEFAULT_TYPES = ["txt", "pdf", "md", "docx", "html", "png", "jpg"]
+
+
+class FileTypeClassifier(BaseComponent):
+    """
+    Route files in an Indexing Pipeline to corresponding file converters.
+    """
+
+    outgoing_edges = 10
+
+    def __init__(self, supported_types: List[str] = DEFAULT_TYPES):
+        """
+        Node that sends out files on a different output edge depending on their extension.
+
+        :param supported_types: the file types that this node can distinguish.
+            Note that it's limited to a maximum of 10 outgoing edges, which
+            correspond each to a file extension. Such extension are, by default
+            `txt`, `pdf`, `md`, `docx`, `html`. Lists containing more than 10
+            elements will not be allowed. Lists with duplicate elements will
+            also be rejected.
+        """
+        if len(supported_types) > 10:
+            raise ValueError("supported_types can't have more than 10 values.")
+        if len(set(supported_types)) != len(supported_types):
+            raise ValueError("supported_types can't contain duplicate values.")
+
+        self.set_config(supported_types=supported_types)
+        self.supported_types = supported_types
+
+    def _get_extension(self, file_paths: List[Path]) -> str:
+        """
+        Return the extension found in the given list of files.
+        Also makes sure that all files have the same extension.
+        If this is not true, it throws an exception.
+
+        :param file_paths: the paths to extract the extension from
+        :return: a set of strings with all the extensions (without duplicates)
+        """
+        extension = file_paths[0].suffix
+
+        for path in file_paths:
+            if path.suffix != extension:
+                raise ValueError("Multiple file types are not allowed at once.")
+
+        return extension.lstrip(".")
+
+    def run(self, file_paths: Union[Path, List[Path], str, List[str], List[Union[Path, str]]]):  # type: ignore
+        """
+        Sends out files on a different output edge depending on their extension.
+
+        :param file_paths: paths to route on different edges.
+        """
+        if not isinstance(file_paths, list):
+            file_paths = [file_paths]
+
+        paths = [Path(path) for path in file_paths]
+
+        output = {"file_paths": paths}
+        extension = self._get_extension(paths)
+        try:
+            index = self.supported_types.index(extension) + 1
+        except ValueError:
+            raise ValueError(
+                f"Files of type '{extension}' are not supported. "
+                f"The supported types are: {self.supported_types}. "
+                "Consider using the 'supported_types' parameter to "
+                "change the types accepted by this node."
+            )
+        return output, f"output_{index}"
diff --git a/pipelines/pipelines/nodes/file_converter/__init__.py b/pipelines/pipelines/nodes/file_converter/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..a584b5f26a557985168edc715fdc20fe2099e412
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/__init__.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.file_converter.base import BaseConverter
+
+from pipelines.utils.import_utils import safe_import
+
+MarkdownConverter = safe_import(
+    "pipelines.nodes.file_converter.markdown", "MarkdownConverter", "preprocessing"
+)  # Has optional dependencies
+ImageToTextConverter = safe_import(
+    "pipelines.nodes.file_converter.image", "ImageToTextConverter", "ocr"
+)  # Has optional dependencies
+PDFToTextConverter = safe_import(
+    "pipelines.nodes.file_converter.pdf", "PDFToTextConverter", "ocr"
+)  # Has optional dependencies
+PDFToTextOCRConverter = safe_import(
+    "pipelines.nodes.file_converter.pdf", "PDFToTextOCRConverter", "ocr"
+)  # Has optional dependencies
+
+from pipelines.nodes.file_converter.docx import DocxToTextConverter
+from pipelines.nodes.file_converter.txt import TextConverter
+from pipelines.nodes.file_converter.pdf import PDFToTextConverter
+from pipelines.nodes.file_converter.image import ImageToTextConverter
diff --git a/pipelines/pipelines/nodes/file_converter/base.py b/pipelines/pipelines/nodes/file_converter/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..88182a931af54c3fbcc1745d6e6713334f98d075
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/base.py
@@ -0,0 +1,127 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import abstractmethod
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
+import langdetect
+
+from pipelines.nodes.base import BaseComponent
+
+
+class BaseConverter(BaseComponent):
+    """
+    Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        remove_numeric_tables: bool = False,
+        valid_languages: Optional[List[str]] = None,
+    ):
+        """
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+
+        self.remove_numeric_tables = remove_numeric_tables
+        self.valid_languages = valid_languages
+
+    @abstractmethod
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]],
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+    ) -> List[Dict[str, Any]]:
+        """
+        Convert a file to a dictionary containing the text and any associated meta data.
+
+        File converters may extract file meta like name or size. In addition to it, user
+        supplied meta data like author, url, external IDs can be supplied as a dictionary.
+
+        :param file_path: path of the file to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        :param encoding: Select the file encoding (default is `utf-8`)
+        """
+        pass
+
+    def validate_language(self, text: str, valid_languages: Optional[List[str]] = None) -> bool:
+        """
+        Validate if the language of the text is one of valid languages.
+        """
+        if valid_languages is None:
+            valid_languages = self.valid_languages
+
+        if not valid_languages:
+            return True
+
+        try:
+            lang = langdetect.detect(text)
+        except langdetect.lang_detect_exception.LangDetectException:
+            lang = None
+
+        return lang in valid_languages
+
+    def run(  # type: ignore
+        self,
+        file_paths: Union[Path, List[Path]],  # type: ignore
+        meta: Optional[Union[Dict[str, str], List[Dict[str, str]]]] = None,  # type: ignore
+        remove_numeric_tables: Optional[bool] = None,  # type: ignore
+        valid_languages: Optional[List[str]] = None,  # type: ignore
+    ):
+        if isinstance(file_paths, Path):
+            file_paths = [file_paths]
+
+        if meta is None or isinstance(meta, dict):
+            meta = [meta] * len(file_paths)  # type: ignore
+
+        documents: list = []
+        for file_path, file_meta in zip(file_paths, meta):
+            for doc in self.convert(
+                file_path=file_path,
+                meta=file_meta,
+                remove_numeric_tables=remove_numeric_tables,
+                valid_languages=valid_languages,
+            ):
+                documents.append(doc)
+        result = {"documents": documents}
+        return result, "output_1"
diff --git a/pipelines/pipelines/nodes/file_converter/docx.py b/pipelines/pipelines/nodes/file_converter/docx.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdd57ea49b1b70b0f18714dd51ee83cbe4f72515
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/docx.py
@@ -0,0 +1,198 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hashlib
+import logging
+import os
+from io import BytesIO
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import docx
+from docx.document import Document
+from docx.oxml.shape import CT_Picture
+from docx.parts.image import ImagePart
+from docx.text.paragraph import Paragraph
+from PIL import Image
+
+from pipelines.nodes.file_converter import BaseConverter
+
+logger = logging.getLogger(__name__)
+
+
+class DocxToTextConverter(BaseConverter):
+    def __init__(
+        self,
+        remove_numeric_tables: bool = False,
+        valid_languages: Optional[List[str]] = None,
+    ):
+        """
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+
+        # Save init parameters to enable export of component config as YAML
+        self.set_config(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+
+        self.remove_numeric_tables = remove_numeric_tables
+        self.valid_languages = valid_languages
+
+        self.desc_path = "parse_files"
+        os.makedirs(self.desc_path, exist_ok=True)
+
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, Any]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = None,
+    ) -> List[Dict[str, Any]]:
+        """
+        Extract text from a .docx file.
+        Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
+        For compliance with other converters we nevertheless opted for keeping the methods name.
+
+        :param file_path: Path to the .docx file you want to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        :param encoding: Not applicable
+        """
+        if remove_numeric_tables is None:
+            remove_numeric_tables = self.remove_numeric_tables
+        if valid_languages is None:
+            valid_languages = self.valid_languages
+        if remove_numeric_tables is True:
+            raise Exception("'remove_numeric_tables' is not supported by DocxToTextConverter.")
+        if valid_languages is True:
+            raise Exception("Language validation using 'valid_languages' is not supported by DocxToTextConverter.")
+        # Creating word reader object.
+        file = docx.Document(file_path)
+        documents = []
+        text_dict = {}
+        # This part will parse the docs files with images, the text and the following images will be added as an document
+        for i in range(len(file.paragraphs)):
+            paragraph = file.paragraphs[i]
+            # Extracting images from the paragraph
+            image_list = self.get_image_list(file, paragraph)
+            # Extracting text from the paragraph
+            # If there is text, Adding the text to text_dict
+            if paragraph.text != "":
+                text = paragraph.text
+                if bool(text_dict) is False:
+                    text_dict = {"text": [text], "images": []}
+                else:
+                    text_dict["text"].append(text)
+                if image_list is not None:
+                    image_names = self.save_images(image_list)
+                    text_dict["images"] += image_names
+            else:
+                # If there are not text and images, adding text_dict to documents
+                if image_list is None and bool(text_dict):
+                    raw_text = "".join(text_dict["text"])
+                    # If the extracted text is "", skip it
+                    if raw_text == "":
+                        continue
+                    meta_data = {}
+                    if meta is not None and "name" in meta:
+                        meta_data["name"] = meta["name"]
+                    meta_data["images"] = text_dict["images"]
+                    document = {"content": raw_text, "content_type": "text", "meta": meta_data}
+                    documents.append(document)
+
+                    text = paragraph.text
+                    text_dict = {"text": [text], "images": []}
+                elif image_list is not None:
+                    image_names = self.save_images(image_list)
+                    text_dict["images"] += image_names
+                else:
+                    continue
+        return documents
+
+    def save_images(self, image_list):
+        """
+        Save the parsed image into desc_path
+        :param image_list: image files from the docx file
+        """
+        image_names = []
+        for i, image in enumerate(image_list):
+            if image:
+                # File extension & file content
+                ext, blob = image.ext, image.blob
+                # Using md5 to generate image name and save image into desc_path
+                md5hash = hashlib.md5(blob)
+                md5_name = md5hash.hexdigest()
+                image_name = "{}_{}.{}".format(md5_name, i, ext)
+                image_path = os.path.join(self.desc_path, image_name)
+                Image.open(BytesIO(blob)).save(image_path)
+                # Adding image_name into the text_dict as the image for the text
+                image_names.append(image_name)
+
+        return image_names
+
+    def get_image_list(self, document: Document, paragraph: Paragraph):
+        """
+        Extract images from  paragraph and document object.
+        :param document: file objects
+        :param paragraph: image paragraph
+        """
+        result_list = []
+        # Looking up the images of the paragraph
+        img_list = paragraph._element.xpath(".//pic:pic")
+        if len(img_list) == 0 or not img_list:
+            return
+        # Extracting images from the document
+        for i in range(len(img_list)):
+            img: CT_Picture = img_list[i]
+            embed = img.xpath(".//a:blip/@r:embed")[0]
+            related_part: ImagePart = document.part.related_parts[embed]
+            image: Image = related_part.image
+            result_list.append(image)
+        return result_list
+
+
+class DocxTotxtConverter(BaseConverter):
+    def convert(
+        self,
+        file_path: Path,
+        separator="\n",
+        **kwargs: Any,
+    ) -> List[str]:
+        """
+        Extract text from a .docx file.
+        """
+        # Creating word reader object.
+        file = docx.Document(file_path)
+        txt_documents = ""
+        txt_documents = separator.join([i.text for i in file.paragraphs])
+        document = {"content": txt_documents, "content_type": "text", "meta": {}}
+        return [document]
diff --git a/pipelines/pipelines/nodes/file_converter/image.py b/pipelines/pipelines/nodes/file_converter/image.py
new file mode 100644
index 0000000000000000000000000000000000000000..55ea5bf575d54f5d0f4873bd9be4dc64fa94c832
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/image.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import paddle
+from paddleocr import PaddleOCR
+
+from pipelines.nodes.file_converter import BaseConverter
+
+logger = logging.getLogger(__name__)
+
+
+class ImageToTextConverter(BaseConverter):
+    def __init__(
+        self,
+        remove_numeric_tables: bool = False,
+        valid_languages: Optional[List[str]] = ["eng"],
+    ):
+        """
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified here
+                                (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text. Run the following line of code to check available language packs:
+                                # List of available languages
+                                print(pytesseract.get_languages(config=''))
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+        use_gpu = True if "gpu" in paddle.device.get_device() else False
+        self.recognize = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=use_gpu)
+
+        super().__init__(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+        **kwargs: Any,
+    ) -> List[Dict[str, Any]]:
+        """
+        Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
+
+        :param file_path: path to image file
+        :param meta: Optional dictionary with metadata that shall be attached to all resulting documents.
+                     Can be any custom keys and values.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages supported by tessarect
+                                (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+        pages = self._image_to_text(file_path)
+        if remove_numeric_tables is None:
+            remove_numeric_tables = self.remove_numeric_tables
+        if valid_languages is None:
+            valid_languages = self.valid_languages
+
+        cleaned_pages = []
+        for page in pages:
+            lines = page.splitlines()
+            cleaned_lines = []
+            for line in lines:
+                words = line.split()
+                digits = [word for word in words if any(i.isdigit() for i in word)]
+
+                # remove lines having > 40% of words as digits AND not ending with a period(.)
+                if remove_numeric_tables:
+                    if words and len(digits) / len(words) > 0.4 and not line.strip().endswith("."):
+                        logger.debug(f"Removing line '{line}' from file")
+                        continue
+                cleaned_lines.append(line)
+            cleaned_pages.append(page)
+
+        if valid_languages:
+            document_text = "".join(cleaned_pages)
+            if not self.validate_language(document_text, valid_languages):
+                logger.warning(
+                    f"The language for image is not one of {valid_languages}. The file may not have "
+                    f"been decoded in the correct text format."
+                )
+        documents = []
+        for page in cleaned_pages:
+            document = {"content": page, "meta": meta}
+            documents.append(document)
+        return documents
+
+    def _image_to_text(self, img_path) -> List[str]:
+        """
+        Extract text from image path.
+
+        :param image: input image path
+        """
+        img_path = str(img_path)
+        result = self.recognize.ocr(img_path, cls=True)
+        texts = []
+        for line in result[0]:
+            texts.append(line[-1][0])
+        texts = ["".join(texts)]
+        return texts
diff --git a/pipelines/pipelines/nodes/file_converter/markdown.py b/pipelines/pipelines/nodes/file_converter/markdown.py
new file mode 100644
index 0000000000000000000000000000000000000000..f94eef1c9f9664b5d70fcdb994e52d9f9f55dcf8
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/markdown.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+try:
+    from bs4 import BeautifulSoup
+    from markdown import markdown
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "preprocessing", ie)
+
+from pipelines.nodes.file_converter import BaseConverter
+
+logger = logging.getLogger(__name__)
+
+
+class MarkdownConverter(BaseConverter):
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+    ) -> List[Dict[str, Any]]:
+        """
+        Reads text from a txt file and executes optional preprocessing steps.
+
+        :param file_path: path of the file to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param encoding: Select the file encoding (default is `utf-8`)
+        :param remove_numeric_tables: Not applicable
+        :param valid_languages: Not applicable
+
+        :return: Dict of format {"text": "The text from file", "meta": meta}}
+        """
+        with open(file_path, encoding=encoding, errors="ignore") as f:
+            markdown_text = f.read()
+        text = self.markdown_to_text(markdown_text)
+        document = {"content": text, "content_type": "text", "meta": meta}
+        return [document]
+
+    # Following code snippet is copied from https://gist.github.com/lorey/eb15a7f3338f959a78cc3661fbc255fe
+    @staticmethod
+    def markdown_to_text(markdown_string: str) -> str:
+        """
+        Converts a markdown string to plaintext
+
+        :param markdown_string: String in markdown format
+        """
+        # md -> html -> text since BeautifulSoup can extract text cleanly
+        html = markdown(markdown_string)
+
+        # remove code snippets
+        html = re.sub(r"<pre>(.*?)</pre>", " ", html)
+        html = re.sub(r"<code>(.*?)</code >", " ", html)
+
+        # extract text
+        soup = BeautifulSoup(html, "html.parser")
+        text = "".join(soup.findAll(text=True))
+        return text
+
+
+class MarkdownRawTextConverter(BaseConverter):
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+        **kwargs: Any,
+    ) -> List[Dict[str, Any]]:
+        """
+        Reads text from a txt file and executes optional preprocessing steps.
+
+        :param file_path: path of the file to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param encoding: Select the file encoding (default is `utf-8`)
+        :param remove_numeric_tables: Not applicable
+        :param valid_languages: Not applicable
+
+        :return: Dict of format {"text": "The text from file", "meta": meta}}
+        """
+        with open(file_path, encoding=encoding, errors="ignore") as f:
+            markdown_text = f.read()
+        html = markdown(markdown_text)
+        # remove code snippets
+        html = re.sub(r"<pre>(.*?)</pre>", " ", html)
+        html = re.sub(r"<code>(.*?)</code >", " ", html)
+        # 保留标题
+        html = re.sub(r"<h1>(.*?)</h1>", "<h1>" + r"# \1" + "</h1>", html)
+        html = re.sub(r"<h2>(.*?)</h2>", "<h2>" + r"## \1" + "</h2>", html)
+        html = re.sub(r"<h3>(.*?)</h3>", "<h3>" + r"### \1" + "</h3>", html)
+        html = re.sub(r"<h4>(.*?)</h4>", "<h4>" + r"#### \1" + "</h4>", html)
+        html = re.sub(r"<h5>(.*?)</h5>", "<h5>" + r"##### \1" + "</h5>", html)
+        html = re.sub(r"<h6>(.*?)</h6>", "<h6>" + r"###### \1" + "</h6>", html)
+        # extract text
+        soup = BeautifulSoup(html, "html.parser")
+        markdown_text = "".join(soup.findAll(text=True))
+        document = {"content": markdown_text, "content_type": "text", "meta": meta}
+        return [document]
diff --git a/pipelines/pipelines/nodes/file_converter/pdf.py b/pipelines/pipelines/nodes/file_converter/pdf.py
new file mode 100644
index 0000000000000000000000000000000000000000..235862f11be659c9e7627bd3a835a6987d8b91db
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/pdf.py
@@ -0,0 +1,227 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import tempfile
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import pdfplumber
+
+try:
+    from pdf2image import convert_from_path
+except (ImportError, ModuleNotFoundError) as ie:
+    from pipelines.utils.import_utils import _optional_component_not_installed
+
+    _optional_component_not_installed(__name__, "ocr", ie)
+
+from pipelines.nodes.file_converter import BaseConverter, ImageToTextConverter
+
+logger = logging.getLogger(__name__)
+
+
+class PDFToTextConverter(BaseConverter):
+    def __init__(
+        self,
+        remove_numeric_tables: bool = False,
+        language: str = "en",
+        valid_languages: Optional[List[str]] = None,
+    ):
+        """
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+        # save init parameters to enable export of component config as YAML
+        self.set_config(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+
+        super().__init__(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+        self.language = language
+
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        language: Optional[str] = "en",
+        **kwargs: Any,
+    ) -> List[Dict[str, Any]]:
+        """
+        Extract text from a .pdf file using the pdftotext library (https://www.xpdfreader.com/pdftotext-man.html)
+
+        :param file_path: Path to the .pdf file you want to convert
+        :param meta: Optional dictionary with metadata that shall be attached to all resulting documents.
+                     Can be any custom keys and values.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+
+        pages = self._read_pdf(file_path, layout=False)
+        if remove_numeric_tables is None:
+            remove_numeric_tables = self.remove_numeric_tables
+        if valid_languages is None:
+            valid_languages = self.valid_languages
+        if language is None:
+            language = self.language
+        cleaned_pages = []
+        for page in pages:
+            # pdftotext tool provides an option to retain the original physical layout of a PDF page. This behaviour
+            # can be toggled by using the layout param.
+            #  layout=True
+            #      + table structures get retained better
+            #      - multi-column pages(eg, research papers) gets extracted with text from multiple columns on same line
+            #  layout=False
+            #      + keeps strings in content stream order, hence multi column layout works well
+            #      - cells of tables gets split across line
+            #
+            #  Here, as a "safe" default, layout is turned off.
+            lines = page.splitlines()
+
+            cleaned_lines = []
+            for line in lines:
+                if self.language == "chinese":
+                    words = list(line)
+                else:
+                    words = line.split()
+                digits = [word for word in words if any(i.isdigit() for i in word)]
+
+                # remove lines having > 40% of words as digits AND not ending with a period(.)
+                if remove_numeric_tables:
+                    if words and len(digits) / len(words) > 0.4 and not line.strip().endswith("."):
+                        logger.debug(f"Removing line '{line}' from {file_path}")
+                        continue
+                cleaned_lines.append(line)
+
+            page = "\n".join(cleaned_lines)
+            cleaned_pages.append(page)
+
+        if valid_languages:
+            document_text = "".join(cleaned_pages)
+            if not self.validate_language(document_text, valid_languages):
+                logger.warning(
+                    f"The language for {file_path} is not one of {valid_languages}. The file may not have "
+                    f"been decoded in the correct text format."
+                )
+
+        text = "\f".join(cleaned_pages)
+        document = {"content": text, "content_type": "text", "meta": meta}
+        return [document]
+
+    def _read_pdf(self, file_path: Path, layout: bool) -> List[str]:
+        """
+        Extract pages from the pdf file at file_path.
+
+        :param file_path: path of the pdf file
+        :param layout: whether to retain the original physical layout for a page. If disabled, PDF pages are read in
+                       the content stream order.
+        """
+        pdf = pdfplumber.open(file_path)
+        page_text = []
+        for page in pdf.pages:
+            paragraphs = page.extract_text()
+            page_text.append(paragraphs)
+        return page_text
+
+
+class PDFToTextOCRConverter(BaseConverter):
+    def __init__(
+        self,
+        remove_numeric_tables: bool = False,
+        valid_languages: Optional[List[str]] = ["eng"],
+    ):
+        """
+        Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
+
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages supported by tessarect
+                                (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        """
+        # init image to text instance
+        self.image_2_text = ImageToTextConverter(remove_numeric_tables, valid_languages)
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+        super().__init__(remove_numeric_tables=remove_numeric_tables, valid_languages=valid_languages)
+
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+    ) -> List[Dict[str, Any]]:
+        """
+        Convert a file to a dictionary containing the text and any associated meta data.
+
+        File converters may extract file meta like name or size. In addition to it, user
+        supplied meta data like author, url, external IDs can be supplied as a dictionary.
+
+        :param file_path: path of the file to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        :param encoding: Select the file encoding (default is `utf-8`)
+        """
+        pages = []
+        try:
+            images = convert_from_path(file_path)
+            for image in images:
+                temp_img = tempfile.NamedTemporaryFile(
+                    dir=os.path.dirname(os.path.realpath(__file__)), suffix=".jpeg", delete=False
+                )
+                image.save(temp_img.name)
+                pages.append(self.image_2_text.convert(temp_img.name)[0]["content"])
+                temp_img.close()
+                os.remove(temp_img.name)
+        except Exception as exception:
+            logger.error(f"File {file_path} has an error \n {exception}")
+
+        raw_text = "\f".join(pages)
+        document = {"content": raw_text, "meta": meta}
+
+        return [document]
diff --git a/pipelines/pipelines/nodes/file_converter/txt.py b/pipelines/pipelines/nodes/file_converter/txt.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a1ab8f82ecb91e020d1aec495f9c8467084d1d7
--- /dev/null
+++ b/pipelines/pipelines/nodes/file_converter/txt.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+from pipelines.nodes.file_converter import BaseConverter
+
+logger = logging.getLogger(__name__)
+
+
+class TextConverter(BaseConverter):
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        remove_numeric_tables: Optional[bool] = None,
+        valid_languages: Optional[List[str]] = None,
+        encoding: Optional[str] = "utf-8",
+        **kwargs: Any,
+    ) -> List[Dict[str, Any]]:
+        """
+        Reads text from a txt file and executes optional preprocessing steps.
+
+        :param file_path: path of the file to convert
+        :param meta: dictionary of meta data key-value pairs to append in the returned document.
+        :param remove_numeric_tables: This option uses heuristics to remove numeric rows from the tables.
+                                      The tabular structures in documents might be noise for the reader model if it
+                                      does not have table parsing capability for finding answers. However, tables
+                                      may also have long strings that could possible candidate for searching answers.
+                                      The rows containing strings are thus retained in this option.
+        :param valid_languages: validate languages from a list of languages specified in the ISO 639-1
+                                (https://en.wikipedia.org/wiki/ISO_639-1) format.
+                                This option can be used to add test for encoding errors. If the extracted text is
+                                not one of the valid languages, then it might likely be encoding error resulting
+                                in garbled text.
+        :param encoding: Select the file encoding (default is `utf-8`)
+
+        :return: Dict of format {"text": "The text from file", "meta": meta}}
+
+        """
+        if remove_numeric_tables is None:
+            remove_numeric_tables = self.remove_numeric_tables
+        if valid_languages is None:
+            valid_languages = self.valid_languages
+
+        with open(file_path, encoding=encoding, errors="ignore") as f:
+            text = f.read()
+            pages = text.split("\n\n")
+
+        cleaned_pages = []
+        for page in pages:
+            lines = page.splitlines()
+            cleaned_lines = []
+            for line in lines:
+                words = line.split()
+                digits = [word for word in words if any(i.isdigit() for i in word)]
+
+                # remove lines having > 40% of words as digits AND not ending with a period(.)
+                if remove_numeric_tables:
+                    if words and len(digits) / len(words) > 0.4 and not line.strip().endswith("."):
+                        logger.debug(f"Removing line '{line}' from {file_path}")
+                        continue
+
+                cleaned_lines.append(line)
+
+            page = "\n".join(cleaned_lines)
+            cleaned_pages.append(page)
+
+        if valid_languages:
+            document_text = "".join(cleaned_pages)
+            if not self.validate_language(document_text, valid_languages):
+                logger.warning(
+                    f"The language for {file_path} is not one of {valid_languages}. The file may not have "
+                    f"been decoded in the correct text format."
+                )
+        documents = []
+        for page in cleaned_pages:
+            document = {"content": page, "content_type": "text", "meta": meta}
+            documents.append(document)
+        return documents
diff --git a/pipelines/pipelines/nodes/llm/__init__.py b/pipelines/pipelines/nodes/llm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..be4e6b70a2f4bd7be765b2936f789919f94bcd0e
--- /dev/null
+++ b/pipelines/pipelines/nodes/llm/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .chatglm import ChatGLMBot
+from .ernie_bot import ErnieBot
+from .history import TruncatedConversationHistory
+from .prompt_template import LLMPromptTemplate
diff --git a/pipelines/pipelines/nodes/llm/chatglm.py b/pipelines/pipelines/nodes/llm/chatglm.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc05076342f466d0b9c45151f7cf0dd03cad3602
--- /dev/null
+++ b/pipelines/pipelines/nodes/llm/chatglm.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from paddlenlp import Taskflow
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class ChatGLMBot(BaseComponent):
+    def __init__(
+        self,
+        model_name_or_path="THUDM/chatglm-6b-v1.1",
+        batch_size: int = 2,
+        max_seq_length: int = 2048,
+        tgt_length: int = 2048,
+        **kwargs
+    ):
+        """
+        Initialize the ChatGLMBot instance.
+
+        :param batch_size: batch_size for chatglm prediction.
+        :param max_seq_length: max_seq_length for the processing input.
+        :param tgt_length: tgt_length for models output
+        """
+        self.kwargs = kwargs
+        self.chatglm = Taskflow(
+            "text2text_generation",
+            model=model_name_or_path,
+            batch_size=batch_size,
+            max_seq_length=max_seq_length,
+            tgt_length=tgt_length,
+            **self.kwargs,
+        )
+
+    def predict(self, query, stream=False):
+        result = self.chatglm(query)
+        return result
+
+    def run(self, query, stream=False, **kwargs):
+        """
+        Using the chatbot to generate the answers
+        :param query: The user's input/query to be sent to the chatGLM.
+        :param stream: Whether to use streaming mode when making the request. Currently not in use. Defaults to False.
+        """
+        debug = kwargs.get("debug", False)
+        if debug:
+            logger.debug(f"Query: {query}")
+        result = self.predict(query=query, stream=stream)
+        return result, "output_1"
diff --git a/pipelines/pipelines/nodes/llm/ernie_bot.py b/pipelines/pipelines/nodes/llm/ernie_bot.py
new file mode 100644
index 0000000000000000000000000000000000000000..6482d800329da2f4c0506f7864fd836a3dd24f32
--- /dev/null
+++ b/pipelines/pipelines/nodes/llm/ernie_bot.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import json
+import logging
+import os
+
+import requests
+
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+ernie_dict = {
+    "ERNIE-Bot-turbo": " https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/eb-instant?access_token={}",
+    "ERNIE-Bot": "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/chat/completions?access_token={}",
+}
+
+
+class ErnieBot(BaseComponent):
+    """
+    The ErnieBot class is a subclass of the BaseComponent class, which is designed to interface with
+    the Ernie Bot API for generating AI chatbot responses. It handles the interaction with the API using
+    the provided api_key, secret_key . It allows you to make a request with a given query and optional conversation
+    history, receiving a response from the chatbot and extending the conversation history accordingly.
+    """
+
+    outgoing_edges = 1
+    headers = {"Content-Type": "application/json", "Accept": "application/json"}
+
+    def __init__(self, api_key=None, secret_key=None, model_name="ERNIE-Bot-turbo"):
+        """
+        Initialize the ErnieBot instance with the provided api_key and secret_key.
+
+        :param api_key: api_key for applying token to request wenxin api.
+        :param secret_key: secret_key for applying token to request wenxin api.
+        """
+        api_key = api_key or os.environ.get("ERNIE_BOT_API_KEY", None)
+        secret_key = secret_key or os.environ.get("ERNIE_BOT_SECRET_KEY", None)
+        self.model_name = model_name
+        if api_key is None or secret_key is None:
+            raise Exception(
+                "Please apply api_key and secret_key from https://cloud.baidu.com/doc/WENXINWORKSHOP/s/flfmc9do2"
+            )
+        self.api_key = api_key
+        self.secret_key = secret_key
+        self.token = self._apply_token(self.api_key, self.secret_key)
+
+    def _apply_token(self, api_key, secret_key):
+        payload = ""
+        self.token_host = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={api_key}&client_secret={secret_key}"
+        response = requests.request("POST", self.token_host, headers=self.headers, data=payload)
+        if response:
+            res = response.json()
+        else:
+            raise RuntimeError("Request access token error.")
+
+        return res["access_token"]
+
+    def predict(self, query, history=None, stream=False, api_key=None, secret_key=None):
+        if api_key is not None and secret_key is not None:
+            self.token = self._apply_token(api_key, secret_key)
+        payload = {"messages": []}
+        if history is not None:
+            if len(history) % 2 == 0:
+                for past_msg in history:
+                    if past_msg["role"] not in ["user", "assistant"]:
+                        raise ValueError(
+                            "Invalid history: The `role` in each message in history must be `user` or `assistant`."
+                        )
+                payload["messages"].extend(history)
+            else:
+                raise ValueError("Invalid history: an even number of `messages` is expected!")
+
+        payload["messages"].append({"role": "user", "content": f"{query}"})
+        # Do not use stream for now
+        if stream:
+            payload["stream"] = True
+        chat_url = ernie_dict[self.model_name].format(self.token)
+        response = requests.request("POST", chat_url, headers=self.headers, data=json.dumps(payload))
+        response_json = json.loads(response.text)
+        if history is None:
+            return_history = []
+        else:
+            return_history = copy.deepcopy(history)
+        try:
+            return_history.extend(
+                [{"role": "user", "content": query}, {"role": "assistant", "content": response_json["result"]}]
+            )
+            response_json["history"] = return_history
+        except Exception as e:
+            logger.error(e)
+            logger.error(response_json)
+        return response_json
+
+    def run(self, query, history=None, stream=False, api_key=None, secret_key=None, **kwargs):
+        """
+        Send a request to the Ernie Bot API with the given query and optional conversation history.
+        Returns the chatbot response and updates the conversation history accordingly.
+
+        :param query: The user's input/query to be sent to the Ernie Bot API.
+        :param history: A list of dictionaries representing the conversation history,
+        :param stream: Whether to use streaming mode when making the request. Currently not in use. Defaults to False.
+        """
+        debug = kwargs.get("debug", False)
+        if debug:
+            logger.debug(f"Query: {query}")
+        response_json = self.predict(query, history, stream)
+        return response_json, "output_1"
diff --git a/pipelines/pipelines/nodes/llm/history.py b/pipelines/pipelines/nodes/llm/history.py
new file mode 100644
index 0000000000000000000000000000000000000000..44adb25169a64c57638a3b4dde1a3d329886273f
--- /dev/null
+++ b/pipelines/pipelines/nodes/llm/history.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.base import BaseComponent
+
+
+class TruncatedConversationHistory(BaseComponent):
+    """This class represents a component that truncates conversation history to a specified maximum length."""
+
+    outgoing_edges = 1
+
+    def __init__(self, max_length):
+        """Initializes the TruncatedConversationHistory class with the specified maximum length."""
+        self.max_length = max_length
+
+    def run(self, query, history=None):
+        """Truncates the conversation history to the maximum allowed length, then returns the modified history along with the query."""
+        if history is None:
+            return {"query": query}, "output_1"
+        for past_msg in history:
+            if len(past_msg["content"]) > self.max_length:
+                past_msg["content"] = past_msg["content"][: self.max_length]
+        return {"query": query, "history": history}, "output_1"
diff --git a/pipelines/pipelines/nodes/llm/prompt_template.py b/pipelines/pipelines/nodes/llm/prompt_template.py
new file mode 100644
index 0000000000000000000000000000000000000000..31c082f440a39aa649ba6e8a6cf7b8cc00c8b612
--- /dev/null
+++ b/pipelines/pipelines/nodes/llm/prompt_template.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.base import BaseComponent
+
+
+class LLMPromptTemplate(BaseComponent):
+    outgoing_edges = 1
+
+    def __init__(self, template):
+        self.template = template
+
+    def run(self, query=None, documents=None, history=None):
+        if documents is not None:
+            documents = [i.content for i in documents]
+            context = "".join(documents)
+            result = {"documents": context, "query": query}
+        elif history is not None:
+            chat_history = "\n".join(history)
+            question = query
+            result = {"chat_history": chat_history, "question": question}
+        else:
+            raise NotImplementedError("This prompt template is not implemented!")
+
+        return {"query": self.template.format(**result)}, "output_1"
diff --git a/pipelines/pipelines/nodes/models/__init__.py b/pipelines/pipelines/nodes/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cdda93a4d50eeefd8754f0427a15440d39ba9a6
--- /dev/null
+++ b/pipelines/pipelines/nodes/models/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.models.neural_search_model import SemanticIndexBatchNeg
diff --git a/pipelines/pipelines/nodes/models/neural_search_model.py b/pipelines/pipelines/nodes/models/neural_search_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d0234b14db1d0d5fc43752e36d7344a63ae0e38
--- /dev/null
+++ b/pipelines/pipelines/nodes/models/neural_search_model.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SemanticIndexBase(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, output_emb_size=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is not None, then add Linear layer to reduce embedding_size,
+        # we recommend set output_emb_size = 256 considering the trade-off between
+        # recall performance and efficiency
+
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(768, output_emb_size, weight_attr=weight_attr)
+
+    def get_pooled_embedding(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def get_semantic_embedding(self, data_loader):
+        self.eval()
+        with paddle.no_grad():
+            for batch_data in data_loader:
+                input_ids, token_type_ids = batch_data
+
+                text_embeddings = self.get_pooled_embedding(input_ids, token_type_ids=token_type_ids)
+
+                yield text_embeddings
+
+    def cosine_sim(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1)
+        return cosine_sim
+
+    @abc.abstractmethod
+    def forward(self):
+        pass
+
+
+class SemanticIndexBatchNeg(SemanticIndexBase):
+    def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_emb_size=None):
+        super().__init__(pretrained_model, dropout, output_emb_size)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def forward(
+        self,
+        query_input_ids,
+        title_input_ids,
+        query_token_type_ids=None,
+        query_position_ids=None,
+        query_attention_mask=None,
+        title_token_type_ids=None,
+        title_position_ids=None,
+        title_attention_mask=None,
+    ):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask
+        )
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask
+        )
+
+        cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype()
+        )
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64")
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss
diff --git a/pipelines/pipelines/nodes/other/__init__.py b/pipelines/pipelines/nodes/other/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9889642ae873643d614f1afadf3a7c3cec619c09
--- /dev/null
+++ b/pipelines/pipelines/nodes/other/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.other.docs2answers import Docs2Answers
+from pipelines.nodes.other.join_docs import JoinDocuments
+from pipelines.nodes.other.route_documents import RouteDocuments
+from pipelines.nodes.other.join_answers import JoinAnswers
diff --git a/pipelines/pipelines/nodes/other/docs2answers.py b/pipelines/pipelines/nodes/other/docs2answers.py
new file mode 100644
index 0000000000000000000000000000000000000000..820f2111a8aa5717479c6cb0dfa05ecea0a000a4
--- /dev/null
+++ b/pipelines/pipelines/nodes/other/docs2answers.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+from pipelines.schema import Document, Answer, Span
+from pipelines.nodes.base import BaseComponent
+
+
+class Docs2Answers(BaseComponent):
+    """
+    This Node is used to convert retrieved documents into predicted answers format.
+    It is useful for situations where you are calling a Retriever only pipeline via REST API.
+    This ensures that your output is in a compatible format.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(self):
+        self.set_config()
+
+    def run(self, query: str, documents: List[Document]):  # type: ignore
+        # conversion from Document -> Answer
+        answers: List[Answer] = []
+        for doc in documents:
+            # For FAQ style QA use cases
+            if "answer" in doc.meta:
+                doc.meta["query"] = doc.content  # question from the existing FAQ
+                cur_answer = Answer(
+                    answer=doc.meta["answer"],
+                    type="other",
+                    score=doc.score,
+                    context=doc.meta["answer"],
+                    offsets_in_context=[Span(start=0, end=len(doc.meta["answer"]))],
+                    document_id=doc.id,
+                    meta=doc.meta,
+                )
+            else:
+                # Regular docs
+                cur_answer = Answer(
+                    answer="",
+                    type="other",
+                    score=doc.score,
+                    context=doc.content,
+                    document_id=doc.id,
+                    meta=doc.meta,
+                )
+            answers.append(cur_answer)
+
+        output = {"query": query, "answers": answers}
+
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/other/join_answers.py b/pipelines/pipelines/nodes/other/join_answers.py
new file mode 100644
index 0000000000000000000000000000000000000000..e797f4223cfacb9b8f299493e8bb05d843317b83
--- /dev/null
+++ b/pipelines/pipelines/nodes/other/join_answers.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, List, Dict, Tuple
+
+from pipelines.schema import Answer
+from pipelines.nodes import BaseComponent
+
+
+class JoinAnswers(BaseComponent):
+    """
+    A node to join `Answer`s produced by multiple `Reader` nodes.
+    """
+
+    def __init__(
+        self, join_mode: str = "concatenate", weights: Optional[List[float]] = None, top_k_join: Optional[int] = None
+    ):
+        """
+        :param join_mode: `"concatenate"` to combine documents from multiple `Reader`s. `"merge"` to aggregate scores
+        of individual `Answer`s.
+        :param weights: A node-wise list (length of list must be equal to the number of input nodes) of weights for
+            adjusting `Answer` scores when using the `"merge"` join_mode. By default, equal weight is assigned to each
+            `Reader` score. This parameter is not compatible with the `"concatenate"` join_mode.
+        :param top_k_join: Limit `Answer`s to top_k based on the resulting scored of the join.
+        """
+
+        assert join_mode in ["concatenate", "merge"], f"JoinAnswers node does not support '{join_mode}' join_mode."
+        assert not (
+            weights is not None and join_mode == "concatenate"
+        ), "Weights are not compatible with 'concatenate' join_mode"
+
+        # Save init parameters to enable export of component config as YAML
+        self.set_config(join_mode=join_mode, weights=weights, top_k_join=top_k_join)
+
+        self.join_mode = join_mode
+        self.weights = [float(i) / sum(weights) for i in weights] if weights else None
+        self.top_k_join = top_k_join
+
+    def run(self, inputs: List[Dict], top_k_join: Optional[int] = None) -> Tuple[Dict, str]:  # type: ignore
+        reader_results = [inp["answers"] for inp in inputs]
+
+        if not top_k_join:
+            top_k_join = self.top_k_join
+
+        if self.join_mode == "concatenate":
+            concatenated_answers = [answer for cur_reader_result in reader_results for answer in cur_reader_result]
+            concatenated_answers = sorted(concatenated_answers, reverse=True)[:top_k_join]
+            return {"answers": concatenated_answers, "labels": inputs[0].get("labels", None)}, "output_1"
+
+        elif self.join_mode == "merge":
+            merged_answers = self._merge_answers(reader_results)
+
+            merged_answers = merged_answers[:top_k_join]
+            return {"answers": merged_answers, "labels": inputs[0].get("labels", None)}, "output_1"
+
+        else:
+            raise ValueError(f"Invalid join_mode: {self.join_mode}")
+
+    def _merge_answers(self, reader_results: List[List[Answer]]) -> List[Answer]:
+        weights = self.weights if self.weights else [1 / len(reader_results)] * len(reader_results)
+
+        for result, weight in zip(reader_results, weights):
+            for answer in result:
+                if isinstance(answer.score, float):
+                    answer.score *= weight
+
+        return sorted([answer for cur_reader_result in reader_results for answer in cur_reader_result], reverse=True)
diff --git a/pipelines/pipelines/nodes/other/join_docs.py b/pipelines/pipelines/nodes/other/join_docs.py
new file mode 100644
index 0000000000000000000000000000000000000000..062d3c51e3615a2975b1ec8aeb82689135e07c44
--- /dev/null
+++ b/pipelines/pipelines/nodes/other/join_docs.py
@@ -0,0 +1,125 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import defaultdict
+
+from typing import Optional, List
+
+from pipelines.nodes.base import BaseComponent
+
+
+class JoinDocuments(BaseComponent):
+    """
+    A node to join documents outputted by multiple retriever nodes.
+
+    The node allows multiple join modes:
+    * concatenate: combine the documents from multiple nodes. Any duplicate documents are discarded.
+    * merge: merge scores of documents from multiple nodes. Optionally, each input score can be given a different
+             `weight` & a `top_k` limit can be set. This mode can also be used for "reranking" retrieved documents.
+    * reciprocal_rank_fusion: combines the documents based on their rank in multiple nodes.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self, join_mode: str = "concatenate", weights: Optional[List[float]] = None, top_k_join: Optional[int] = None
+    ):
+        """
+        :param join_mode: `concatenate` to combine documents from multiple retrievers `merge` to aggregate scores of
+                          individual documents, `reciprocal_rank_fusion` to apply rank based scoring.
+        :param weights: A node-wise list(length of list must be equal to the number of input nodes) of weights for
+                        adjusting document scores when using the `merge` join_mode. By default, equal weight is given
+                        to each retriever score. This param is not compatible with the `concatenate` join_mode.
+        :param top_k_join: Limit documents to top_k based on the resulting scores of the join.
+        """
+        assert join_mode in [
+            "concatenate",
+            "merge",
+            "reciprocal_rank_fusion",
+        ], f"JoinDocuments node does not support '{join_mode}' join_mode."
+
+        assert not (
+            weights is not None and join_mode == "concatenate"
+        ), "Weights are not compatible with 'concatenate' join_mode."
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(join_mode=join_mode, weights=weights, top_k_join=top_k_join)
+
+        self.join_mode = join_mode
+        self.weights = [float(i) / sum(weights) for i in weights] if weights else None
+        self.top_k_join = top_k_join
+
+    def run(self, inputs: List[dict], top_k_join: Optional[int] = None):  # type: ignore
+        results = [inp["documents"] for inp in inputs]
+        document_map = {doc.id: doc for result in results for doc in result}
+
+        if self.join_mode == "concatenate":
+            scores_map = self._concatenate_results(results)
+        elif self.join_mode == "merge":
+            scores_map = self._calculate_comb_sum(results)
+        elif self.join_mode == "reciprocal_rank_fusion":
+            scores_map = self._calculate_rrf(results)
+        else:
+            raise ValueError(f"Invalid join_mode: {self.join_mode}")
+
+        sorted_docs = sorted(scores_map.items(), key=lambda d: d[1], reverse=True)
+
+        if not top_k_join:
+            top_k_join = self.top_k_join
+        if not top_k_join:
+            top_k_join = len(sorted_docs)
+
+        docs = []
+        for (id, score) in sorted_docs[:top_k_join]:
+            doc = document_map[id]
+            doc.score = score
+            docs.append(doc)
+
+        output = {"documents": docs, "labels": inputs[0].get("labels", None)}
+
+        return output, "output_1"
+
+    def _concatenate_results(self, results):
+        """
+        Concatenates multiple document result lists.
+        """
+        return {doc.id: doc.score for result in results for doc in result}
+
+    def _calculate_comb_sum(self, results):
+        """
+        Calculates a combination sum by multiplying each score by its weight.
+        """
+        scores_map = defaultdict(int)
+        weights = self.weights if self.weights else [1 / len(results)] * len(results)
+
+        for result, weight in zip(results, weights):
+            for doc in result:
+                scores_map[doc.id] += doc.score * weight
+
+        return scores_map
+
+    def _calculate_rrf(self, results):
+        """
+        Calculates the reciprocal rank fusion. The constant K is set to 61 (60 was suggested by the original paper,
+        plus 1 as python lists are 0-based and the paper used 1-based ranking).
+        """
+        K = 61
+
+        scores_map = defaultdict(int)
+        for result in results:
+            for rank, doc in enumerate(result):
+                scores_map[doc.id] += 1 / (K + rank)
+
+        return scores_map
diff --git a/pipelines/pipelines/nodes/other/route_documents.py b/pipelines/pipelines/nodes/other/route_documents.py
new file mode 100644
index 0000000000000000000000000000000000000000..851bdb828896bbd107842c1bee69dcad74ec6d02
--- /dev/null
+++ b/pipelines/pipelines/nodes/other/route_documents.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Tuple, Dict, Optional
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.schema import Document
+
+
+class RouteDocuments(BaseComponent):
+    """
+    A node to split a list of `Document`s by `content_type` or by the values of a metadata field and route them to
+    different nodes.
+    """
+
+    # By default (split_by == "content_type"), the node has two outgoing edges.
+    outgoing_edges = 2
+
+    def __init__(self, split_by: str = "content_type", metadata_values: Optional[List[str]] = None):
+        """
+        :param split_by: Field to split the documents by, either `"content_type"` or a metadata field name.
+            If this parameter is set to `"content_type"`, the list of `Document`s will be split into a list containing
+            only `Document`s of type `"text"` (will be routed to `"output_1"`) and a list containing only `Document`s of
+            type `"text"` (will be routed to `"output_2"`).
+            If this parameter is set to a metadata field name, you need to specify the parameter `metadata_values` as
+            well.
+        :param metadata_values: If the parameter `split_by` is set to a metadata field name, you need to provide a list
+            of values to group the `Document`s to. `Document`s whose metadata field is equal to the first value of the
+            provided list will be routed to `"output_1"`, `Document`s whose metadata field is equal to the second
+            value of the provided list will be routed to `"output_2"`, etc.
+        """
+
+        assert split_by == "content_type" or metadata_values is not None, (
+            "If split_by is set to the name of a metadata field, you must provide metadata_values "
+            "to group the documents to."
+        )
+
+        # Save init parameters to enable export of component config as YAML
+        self.set_config(split_by=split_by, metadata_values=metadata_values)
+
+        self.split_by = split_by
+        self.metadata_values = metadata_values
+
+        # If we split list of Documents by a metadata field, number of outgoing edges might change
+        if split_by != "content_type" and metadata_values is not None:
+            self.outgoing_edges = len(metadata_values)
+
+    def run(self, documents: List[Document]) -> Tuple[Dict, str]:  # type: ignore
+        if self.split_by == "content_type":
+            split_documents: Dict[str, List[Document]] = {"output_1": [], "output_2": []}
+
+            for doc in documents:
+                if doc.content_type == "text":
+                    split_documents["output_1"].append(doc)
+                elif doc.content_type == "table":
+                    split_documents["output_2"].append(doc)
+
+        else:
+            assert isinstance(self.metadata_values, list), (
+                "You need to provide metadata_values if you want to split" " a list of Documents by a metadata field."
+            )
+            split_documents = {f"output_{i+1}": [] for i in range(len(self.metadata_values))}
+            for doc in documents:
+                current_metadata_value = doc.meta.get(self.split_by, None)
+                # Disregard current document if it does not contain the provided metadata field
+                if current_metadata_value is not None:
+                    try:
+                        index = self.metadata_values.index(current_metadata_value)
+                    except ValueError:
+                        # Disregard current document if current_metadata_value is not in the provided metadata_values
+                        continue
+
+                    split_documents[f"output_{index+1}"].append(doc)
+
+        return split_documents, "split_documents"
diff --git a/pipelines/pipelines/nodes/preprocessor/__init__.py b/pipelines/pipelines/nodes/preprocessor/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d905e20a8c3a35c02f9b6ed45564952f8599271
--- /dev/null
+++ b/pipelines/pipelines/nodes/preprocessor/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.preprocessor.base import BasePreProcessor
+from pipelines.nodes.preprocessor.preprocessor import PreProcessor
+from pipelines.nodes.preprocessor.text_splitter import (
+    CharacterTextSplitter,
+    RecursiveCharacterTextSplitter,
+    SpacyTextSplitter,
+)
diff --git a/pipelines/pipelines/nodes/preprocessor/base.py b/pipelines/pipelines/nodes/preprocessor/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d6b8bae57a4a5fba6ac325d38fdf56b228298b1
--- /dev/null
+++ b/pipelines/pipelines/nodes/preprocessor/base.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Dict, Any, Optional, Union
+
+from pipelines.nodes.base import BaseComponent
+
+
+class BasePreProcessor(BaseComponent):
+    outgoing_edges = 1
+
+    def process(
+        self,
+        documents: Union[dict, List[dict]],
+        clean_whitespace: Optional[bool] = True,
+        clean_header_footer: Optional[bool] = False,
+        clean_empty_lines: Optional[bool] = True,
+        split_by: Optional[str] = "word",
+        split_length: Optional[int] = 1000,
+        split_overlap: Optional[int] = None,
+        split_respect_sentence_boundary: Optional[bool] = True,
+    ) -> List[dict]:
+        """
+        Perform document cleaning and splitting. Takes a single document as input and returns a list of documents.
+        """
+        raise NotImplementedError
+
+    def clean(
+        self,
+        document: dict,
+        clean_whitespace: bool,
+        clean_header_footer: bool,
+        clean_empty_lines: bool,
+    ) -> Dict[str, Any]:
+        raise NotImplementedError
+
+    def split(
+        self,
+        document: dict,
+        split_by: str,
+        split_length: int,
+        split_overlap: int,
+        split_respect_sentence_boundary: bool,
+    ) -> List[Dict[str, Any]]:
+        raise NotImplementedError
+
+    def run(  # type: ignore
+        self,
+        documents: Union[dict, List[dict]],
+        clean_whitespace: Optional[bool] = None,
+        clean_header_footer: Optional[bool] = None,
+        clean_empty_lines: Optional[bool] = None,
+        split_by: Optional[str] = None,
+        split_length: Optional[int] = None,
+        split_overlap: Optional[int] = None,
+        split_respect_sentence_boundary: Optional[bool] = None,
+    ):
+        documents = self.process(
+            documents=documents,
+            clean_whitespace=clean_whitespace,
+            clean_header_footer=clean_header_footer,
+            clean_empty_lines=clean_empty_lines,
+            split_by=split_by,
+            split_length=split_length,
+            split_overlap=split_overlap,
+            split_respect_sentence_boundary=split_respect_sentence_boundary,
+        )
+        result = {"documents": documents}
+        return result, "output_1"
diff --git a/pipelines/pipelines/nodes/preprocessor/preprocessor.py b/pipelines/pipelines/nodes/preprocessor/preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3c4962e5f0aec113aa62159c60eba813fb99e85
--- /dev/null
+++ b/pipelines/pipelines/nodes/preprocessor/preprocessor.py
@@ -0,0 +1,458 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import re
+from copy import deepcopy
+from functools import partial, reduce
+from itertools import chain
+from typing import Generator, List, Optional, Set, Union
+
+import nltk
+from more_itertools import windowed
+from tqdm import tqdm
+
+from pipelines.nodes.preprocessor import BasePreProcessor
+
+logger = logging.getLogger(__name__)
+
+iso639_to_nltk = {
+    "ru": "russian",
+    "sl": "slovene",
+    "es": "spanish",
+    "sv": "swedish",
+    "tr": "turkish",
+    "cs": "czech",
+    "da": "danish",
+    "nl": "dutch",
+    "en": "english",
+    "et": "estonian",
+    "fi": "finnish",
+    "fr": "french",
+    "de": "german",
+    "el": "greek",
+    "it": "italian",
+    "no": "norwegian",
+    "pl": "polish",
+    "pt": "portuguese",
+}
+
+
+class PreProcessor(BasePreProcessor):
+    def __init__(
+        self,
+        clean_whitespace: bool = True,
+        clean_header_footer: bool = False,
+        clean_empty_lines: bool = True,
+        split_by: str = "word",
+        split_length: int = 200,
+        split_overlap: int = 0,
+        split_answers: bool = False,
+        split_respect_sentence_boundary: bool = True,
+        language: str = "en",
+    ):
+        """
+        :param clean_header_footer: Use heuristic to remove footers and headers across different pages by searching
+                                     for the longest common string. This heuristic uses exact matches and therefore
+                                     works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
+                                     or similar.
+        :param clean_whitespace: Strip whitespaces before or after each line in the text.
+        :param clean_empty_lines: Remove more than two empty lines in the text.
+        :param split_by: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
+        :param split_length: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
+                           "sentence", then each output document will have 10 sentences.
+        :param split_overlap: Word overlap between two adjacent documents after a split.
+                              Setting this to a positive number essentially enables the sliding window approach.
+                              For example, if split_by -> `word`,
+                              split_length -> 5 & split_overlap -> 2, then the splits would be like:
+                              [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
+                              Set the value to 0 to ensure there is no overlap among the documents after splitting.
+        :param split_respect_sentence_boundary: Whether to split in partial sentences if split_by -> `word`. If set
+                                                to True, the individual split will always have complete sentences &
+                                                the number of words will be <= split_length.
+        :param language: The language used by "nltk.tokenize.sent_tokenize" in iso639 format. Available options: "en", "es", "de", "fr" & many more.
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            clean_whitespace=clean_whitespace,
+            clean_header_footer=clean_header_footer,
+            clean_empty_lines=clean_empty_lines,
+            split_by=split_by,
+            split_length=split_length,
+            split_overlap=split_overlap,
+            split_answers=split_answers,
+            split_respect_sentence_boundary=split_respect_sentence_boundary,
+        )
+
+        try:
+            if nltk:
+                nltk.data.find("tokenizers/punkt")
+        except LookupError:
+            try:
+                if nltk:
+                    nltk.download("punkt")
+            except FileExistsError as error:
+                logger.debug("NLTK punkt tokenizer seems to be already downloaded. Error message: %s", error)
+                pass
+
+        self.clean_whitespace = clean_whitespace
+        self.clean_header_footer = clean_header_footer
+        self.clean_empty_lines = clean_empty_lines
+        self.split_by = split_by
+        self.split_length = split_length
+        self.split_overlap = split_overlap
+        self.split_respect_sentence_boundary = split_respect_sentence_boundary
+        self.language = language
+        self.print_log: Set[str] = set()
+        self.split_answers = split_answers
+
+    def process(
+        self,
+        documents: Union[dict, List[dict]],
+        clean_whitespace: Optional[bool] = None,
+        clean_header_footer: Optional[bool] = None,
+        clean_empty_lines: Optional[bool] = None,
+        split_by: Optional[str] = None,
+        split_length: Optional[int] = None,
+        split_overlap: Optional[int] = None,
+        split_respect_sentence_boundary: Optional[bool] = None,
+    ) -> List[dict]:
+        """
+        Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.
+        """
+
+        kwargs = {
+            "clean_whitespace": clean_whitespace,
+            "clean_header_footer": clean_header_footer,
+            "clean_empty_lines": clean_empty_lines,
+            "split_by": split_by,
+            "split_length": split_length,
+            "split_overlap": split_overlap,
+            "split_respect_sentence_boundary": split_respect_sentence_boundary,
+        }
+
+        ret = []
+
+        if type(documents) == dict:
+            ret = self._process_single(document=documents, **kwargs)  # type: ignore
+        elif type(documents) == list:
+            ret = self._process_batch(documents=list(documents), **kwargs)
+
+        else:
+            raise Exception("documents provided to PreProcessor.prepreprocess() is not of type list nor Document")
+
+        return ret
+
+    def _process_single(
+        self,
+        document,
+        clean_whitespace: Optional[bool] = None,
+        clean_header_footer: Optional[bool] = None,
+        clean_empty_lines: Optional[bool] = None,
+        split_by: Optional[str] = None,
+        split_length: Optional[int] = None,
+        split_overlap: Optional[int] = None,
+        split_answers: Optional[int] = None,
+        split_respect_sentence_boundary: Optional[bool] = None,
+    ) -> List[dict]:
+
+        if clean_whitespace is None:
+            clean_whitespace = self.clean_whitespace
+        if clean_header_footer is None:
+            clean_header_footer = self.clean_header_footer
+        if clean_empty_lines is None:
+            clean_empty_lines = self.clean_empty_lines
+        if split_by is None:
+            split_by = self.split_by
+        if split_length is None:
+            split_length = self.split_length
+        if split_overlap is None:
+            split_overlap = self.split_overlap
+        if split_respect_sentence_boundary is None:
+            split_respect_sentence_boundary = self.split_respect_sentence_boundary
+        if split_answers is None:
+            split_answers = self.split_answers
+
+        cleaned_document = self.clean(
+            document=document,
+            clean_whitespace=clean_whitespace,
+            clean_header_footer=clean_header_footer,
+            clean_empty_lines=clean_empty_lines,
+        )
+        split_documents = self.split(
+            document=cleaned_document,
+            split_by=split_by,
+            split_length=split_length,
+            split_overlap=split_overlap,
+            split_answers=split_answers,
+            split_respect_sentence_boundary=split_respect_sentence_boundary,
+        )
+        return split_documents
+
+    def _process_batch(self, documents: List[dict], **kwargs) -> List[dict]:
+        nested_docs = [self._process_single(d, **kwargs) for d in tqdm(documents, unit="docs")]
+        return [d for x in nested_docs for d in x]
+
+    def clean(
+        self,
+        document: dict,
+        clean_whitespace: bool,
+        clean_header_footer: bool,
+        clean_empty_lines: bool,
+    ) -> dict:
+        """
+        Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
+        and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().
+        """
+        text = document["content"]
+        if clean_header_footer:
+            text = self._find_and_remove_header_footer(
+                text, n_chars=300, n_first_pages_to_ignore=1, n_last_pages_to_ignore=1
+            )
+
+        if clean_whitespace:
+            lines = text.splitlines()
+
+            cleaned_lines = []
+            for line in lines:
+                line = line.strip()
+                cleaned_lines.append(line)
+            text = "\n".join(cleaned_lines)
+
+        if clean_empty_lines:
+            text = re.sub(r"\n\n+", "\n\n", text)
+
+        document["content"] = text
+        return document
+
+    def split(
+        self,
+        document: dict,
+        split_by: str,
+        split_length: int,
+        split_overlap: int,
+        split_answers: bool,
+        split_respect_sentence_boundary: bool,
+    ) -> List[dict]:
+        """Perform document splitting on a single document. This method can split on different units, at different lengths,
+        with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
+        the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents."""
+
+        if not split_by:
+            return [document]
+
+        if not split_length:
+            raise Exception("split_length needs be set when using split_by.")
+
+        if split_respect_sentence_boundary and split_by != "word":
+            raise NotImplementedError(
+                "'split_respect_sentence_boundary=True' is only compatible with split_by='word'."
+            )
+
+        text = document["content"]
+
+        if split_respect_sentence_boundary and split_by == "word":
+            # split by words ensuring no sub sentence splits
+            if self.language == "chinese":
+                sentences = text
+            else:
+                language_name = iso639_to_nltk.get(self.language)
+                sentences = nltk.tokenize.sent_tokenize(text, language=language_name)
+
+            word_count = 0
+            list_splits = []
+            current_slice: List[str] = []
+            for sen in sentences:
+                if self.language == "chinese":
+                    current_word_count = len(sen)
+                else:
+                    current_word_count = len(sen.split(" "))
+                if current_word_count > split_length:
+                    long_sentence_message = "One or more sentence found with word count higher than the split length."
+                    if long_sentence_message not in self.print_log:
+                        self.print_log.add(long_sentence_message)
+                        logger.warning(long_sentence_message)
+                if word_count + current_word_count > split_length:
+                    list_splits.append(current_slice)
+                    # Enable split_stride with split_by='word' while respecting sentence boundaries.
+                    if split_overlap:
+                        overlap = []
+                        w_count = 0
+                        for s in current_slice[::-1]:
+                            if self.language == "chinese":
+                                sen_len = len(s)
+                            else:
+                                sen_len = len(s.split(" "))
+                            if w_count < split_overlap:
+                                overlap.append(s)
+                                w_count += sen_len
+                            else:
+                                break
+                        current_slice = list(reversed(overlap))
+                        word_count = w_count
+                    else:
+                        current_slice = []
+                        word_count = 0
+                current_slice.append(sen)
+                if self.language == "chinese":
+                    word_count += len(sen)
+                else:
+                    word_count += len(sen.split(" "))
+            if current_slice:
+                list_splits.append(current_slice)
+
+            text_splits = []
+            for sl in list_splits:
+                if self.language == "chinese":
+                    txt = "".join(sl)
+                else:
+                    txt = " ".join(sl)
+                if len(txt) > 0:
+                    text_splits.append(txt)
+        else:
+            # create individual "elements" of passage, sentence, or word
+            # Faq text need to split text by '\n' of a passage
+            if split_answers and split_by == "passage":
+                text_splits = text.split("\n")
+            elif split_by == "passage":
+                elements = text.split("\n\n")
+            elif split_by == "sentence":
+                language_name = iso639_to_nltk.get(self.language)
+                elements = nltk.tokenize.sent_tokenize(text, language=language_name)
+            elif split_by == "word":
+                elements = text.split(" ")
+            else:
+                raise NotImplementedError(
+                    "PreProcessor only supports 'passage', 'sentence' or 'word' split_by options."
+                )
+
+            # concatenate individual elements based on split_length & split_stride
+            # FAQ text process don't need split text into fix lengths
+            if not split_answers:
+                segments = windowed(elements, n=split_length, step=split_length - split_overlap)
+
+                text_splits = []
+                for seg in segments:
+                    txt = " ".join([t for t in seg if t is not None])
+                    if len(txt) > 0:
+                        text_splits.append(txt)
+        # create new document dicts for each text split
+        documents = []
+        for i, txt in enumerate(text_splits):
+            doc = deepcopy(document)
+            doc["content"] = txt
+
+            if "meta" not in doc.keys() or doc["meta"] is None:
+                doc["meta"] = {}
+            if split_answers:
+                text_arr = doc["content"].split("\t")
+                if len(text_arr) > 2:
+                    raise Exception("Each line text must be two columns and separated by \t")
+                # Maybe empty lines
+                if len(text_arr) == 1:
+                    logger.info("Some lines in your text cannot parse into question and text, maybe empty lines")
+                    continue
+                else:
+                    query, answer = text_arr
+                doc["content"] = query
+                doc["meta"]["answer"] = answer
+            doc["meta"]["_split_id"] = i
+            documents.append(doc)
+
+        return documents
+
+    def _find_and_remove_header_footer(
+        self, text: str, n_chars: int, n_first_pages_to_ignore: int, n_last_pages_to_ignore: int
+    ) -> str:
+        """
+        Heuristic to find footers and headers across different pages by searching for the longest common string.
+        For headers we only search in the first n_chars characters (for footer: last n_chars).
+        Note: This heuristic uses exact matches and therefore works well for footers like "Copyright 2019 by XXX",
+         but won't detect "Page 3 of 4" or similar.
+
+        :param n_chars: number of first/last characters where the header/footer shall be searched in
+        :param n_first_pages_to_ignore: number of first pages to ignore (e.g. TOCs often don't contain footer/header)
+        :param n_last_pages_to_ignore: number of last pages to ignore
+        :return: (cleaned pages, found_header_str, found_footer_str)
+        """
+
+        pages = text.split("\f")
+
+        # header
+        start_of_pages = [p[:n_chars] for p in pages[n_first_pages_to_ignore:-n_last_pages_to_ignore]]
+        found_header = self._find_longest_common_ngram(start_of_pages)
+        if found_header:
+            pages = [page.replace(found_header, "") for page in pages]
+
+        # footer
+        end_of_pages = [p[-n_chars:] for p in pages[n_first_pages_to_ignore:-n_last_pages_to_ignore]]
+        found_footer = self._find_longest_common_ngram(end_of_pages)
+        if found_footer:
+            pages = [page.replace(found_footer, "") for page in pages]
+        logger.debug(f"Removed header '{found_header}' and footer '{found_footer}' in document")
+        text = "\f".join(pages)
+        return text
+
+    def _ngram(self, seq: str, n: int) -> Generator[str, None, None]:
+        """
+        Return ngram (of tokens - currently split by whitespace)
+        :param seq: str, string from which the ngram shall be created
+        :param n: int, n of ngram
+        :return: str, ngram as string
+        """
+
+        # In order to maintain the original whitespace, but still consider \n and \t for n-gram tokenization,
+        # we add a space here and remove it after creation of the ngrams again (see below)
+        seq = seq.replace("\n", " \n")
+        seq = seq.replace("\t", " \t")
+
+        words = seq.split(" ")
+        ngrams = (
+            " ".join(words[i : i + n]).replace(" \n", "\n").replace(" \t", "\t") for i in range(0, len(words) - n + 1)
+        )
+
+        return ngrams
+
+    def _allngram(self, seq: str, min_ngram: int, max_ngram: int) -> Set[str]:
+        lengths = range(min_ngram, max_ngram) if max_ngram else range(min_ngram, len(seq))
+        ngrams = map(partial(self._ngram, seq), lengths)
+        res = set(chain.from_iterable(ngrams))
+        return res
+
+    def _find_longest_common_ngram(
+        self, sequences: List[str], max_ngram: int = 30, min_ngram: int = 3
+    ) -> Optional[str]:
+        """
+        Find the longest common ngram across different text sequences (e.g. start of pages).
+        Considering all ngrams between the specified range. Helpful for finding footers, headers etc.
+
+        :param sequences: list[str], list of strings that shall be searched for common n_grams
+        :param max_ngram: int, maximum length of ngram to consider
+        :param min_ngram: minimum length of ngram to consider
+        :return: str, common string of all sections
+        """
+        sequences = [s for s in sequences if s]  # filter empty sequences
+        if not sequences:
+            return None
+        seqs_ngrams = map(partial(self._allngram, min_ngram=min_ngram, max_ngram=max_ngram), sequences)
+        intersection = reduce(set.intersection, seqs_ngrams)
+
+        try:
+            longest = max(intersection, key=len)
+        except ValueError:
+            # no common sequence found
+            longest = ""
+        return longest if longest.strip() else None
diff --git a/pipelines/pipelines/nodes/preprocessor/text_splitter.py b/pipelines/pipelines/nodes/preprocessor/text_splitter.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ab847a447d879de14e9e0e7b5b07ec9270ab460
--- /dev/null
+++ b/pipelines/pipelines/nodes/preprocessor/text_splitter.py
@@ -0,0 +1,586 @@
+# Copyright 2023 The LangChain Authors. All rights reserved.
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import logging
+from abc import abstractmethod
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Iterable,
+    List,
+    Optional,
+    Tuple,
+    TypedDict,
+    Union,
+)
+
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class TextSplitter(BaseComponent):
+    """Interface for splitting text into chunks."""
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        chunk_size: int = 4000,
+        chunk_overlap: int = 200,
+        length_function: Callable[[str], int] = len,
+        filters: list = [],
+        separator: str = "",
+    ):
+        """Create a new TextSplitter."""
+        if chunk_overlap > chunk_size:
+            raise ValueError(
+                f"Got a larger chunk overlap ({chunk_overlap}) than chunk size " f"({chunk_size}), should be smaller."
+            )
+        self._chunk_size = chunk_size
+        self._chunk_overlap = chunk_overlap
+        self._length_function = length_function
+        self._filter = filters
+        self._separator = separator
+
+    @abstractmethod
+    def split_text(self, text: str, **kwargs) -> List[str]:
+        """Split text into multiple components."""
+
+    def create_documents(
+        self, texts: List[str], metadatas: Optional[List[dict]] = None, separator: Optional[str] = None, **kwargs
+    ) -> List[dict]:
+        """Create documents from a list of texts."""
+        _metadatas = metadatas or [{}] * len(texts)
+        documents = []
+        for i, text in enumerate(texts):
+            for chunk in self.split_text(text, separator, **kwargs):
+                new_doc = {"content": chunk, "meta": copy.deepcopy(_metadatas[i])}
+                documents.append(new_doc)
+        return documents
+
+    def split_documents(self, documents: List[dict], **kwargs) -> List[dict]:
+        """Split documents."""
+        texts = [doc["content"] for doc in documents]
+        metadatas = [doc["meta"] for doc in documents]
+        return self.create_documents(texts, metadatas, **kwargs)
+
+    def _join_docs(self, docs: List[str], separator: str, **kwargs) -> Optional[str]:
+        text = separator.join(docs)
+        text = text.strip()
+        if text == "":
+            return None
+        else:
+            return text
+
+    def _merge_splits(
+        self,
+        splits: Iterable[str],
+        separator: str,
+        chunk_size: Optional[int] = None,
+        chunk_overlap: Optional[int] = None,
+    ) -> List[str]:
+        # We now want to combine these smaller pieces into medium size
+        # chunks to send to the LLM.
+        if chunk_size is None:
+            chunk_size = self._chunk_size
+        if chunk_overlap is None:
+            chunk_overlap = self._chunk_overlap
+        if separator is None:
+            separator = self._separator
+        separator_len = self._length_function(separator)
+
+        docs = []
+        current_doc: List[str] = []
+        total = 0
+        for d in splits:
+            _len = self._length_function(d)
+            if total + _len + (separator_len if len(current_doc) > 0 else 0) > chunk_size:
+                if total > chunk_size:
+                    logger.warning(
+                        f"Created a chunk of size {total}, " f"which is longer than the specified {chunk_size}"
+                    )
+                if len(current_doc) > 0:
+                    doc = self._join_docs(current_doc, separator)
+                    if doc is not None:
+                        docs.append(doc)
+                    # Keep on popping if:
+                    # - we have a larger chunk than in the chunk overlap
+                    # - or if we still have any chunks and the length is long
+                    while total > chunk_overlap or (
+                        total + _len + (separator_len if len(current_doc) > 0 else 0) > chunk_size and total > 0
+                    ):
+                        total -= self._length_function(current_doc[0]) + (separator_len if len(current_doc) > 1 else 0)
+                        current_doc = current_doc[1:]
+            current_doc.append(d)
+            total += _len + (separator_len if len(current_doc) > 1 else 0)
+        doc = self._join_docs(current_doc, separator)
+        if doc is not None:
+            docs.append(doc)
+        return docs
+
+    def clean(self, documents: List[dict], filters: List[str]):
+        for special_character in filters:
+            for doc in documents:
+                doc["content"] = doc["content"].replace(special_character, "")
+        return documents
+
+    def run(  # type: ignore
+        self,
+        documents: Union[dict, List[dict]],
+        meta: Optional[Union[Dict[str, str], List[Dict[str, str]]]] = None,  # type: ignore
+        separator: Optional[str] = None,
+        chunk_size: Optional[int] = None,
+        chunk_overlap: Optional[int] = None,
+        filters: Optional[List[str]] = None,
+    ):
+        if separator is None:
+            separator = self._separator
+        if chunk_size is None:
+            chunk_size = self._chunk_size
+        if chunk_overlap is None:
+            chunk_overlap = self._chunk_overlap
+        if filters is None:
+            filters = self._filter
+        ret = []
+        if type(documents) == dict:  # single document
+            text_splits = self.split_text(
+                documents["content"], separator=separator, chunk_size=chunk_size, chunk_overlap=chunk_overlap
+            )
+            for i, txt in enumerate(text_splits):
+                doc = copy.deepcopy(documents)
+                doc["content"] = txt
+
+                if "meta" not in doc.keys() or doc["meta"] is None:
+                    doc["meta"] = {}
+
+                doc["meta"]["_split_id"] = i
+                ret.append(doc)
+
+        elif type(documents) == list:  # list document
+            for document in documents:
+                text_splits = self.split_text(
+                    document["content"], separator=separator, chunk_size=chunk_size, chunk_overlap=chunk_overlap
+                )
+                for i, txt in enumerate(text_splits):
+                    doc = copy.deepcopy(document)
+                    doc["content"] = txt
+
+                    if "meta" not in doc.keys() or doc["meta"] is None:
+                        doc["meta"] = {}
+
+                    doc["meta"]["_split_id"] = i
+                    ret.append(doc)
+        if filters is not None and len(filters) > 0:
+            ret = self.clean(ret, filters)
+        result = {"documents": ret}
+        return result, "output_1"
+
+
+class CharacterTextSplitter(TextSplitter):
+    """Implementation of splitting text that looks at characters."""
+
+    def __init__(self, separator: str = "\n\n", filters: list = [], **kwargs: Any):
+        """Create a new TextSplitter."""
+        super().__init__(**kwargs)
+        self._separator = separator
+        self._filter = filters
+
+    def split_text(self, text: str, separator: Optional[str] = None, **kwargs) -> List[str]:
+        """Split incoming text and return chunks."""
+        # First we naively split the large input into a bunch of smaller ones.
+        if separator is None:
+            separator = self._separator
+        if separator:
+            splits = text.split(separator)
+        else:
+            splits = list(text)
+        return self._merge_splits(splits, separator, **kwargs)
+
+
+class RecursiveCharacterTextSplitter(TextSplitter):
+    """Implementation of splitting text that looks at characters.
+    Recursively tries to split by different characters to find one
+    that works.
+    """
+
+    def __init__(self, separators: Optional[List[str]] = None, **kwargs: Any):
+        """Create a new TextSplitter."""
+        super().__init__(**kwargs)
+        self._separators = separators or ["\n\n", "\n", " ", ""]
+
+    def split_text(self, text: str, **kwargs) -> List[str]:
+        """Split incoming text and return chunks."""
+        final_chunks = []
+        # Get appropriate separator to use
+        separator = self._separators[-1]
+        for _s in self._separators:
+            if _s == "":
+                separator = _s
+                break
+            if _s in text:
+                separator = _s
+                break
+        # Now that we have the separator, split the text
+        if separator:
+            splits = text.split(separator)
+        else:
+            splits = list(text)
+        # Now go merging things, recursively splitting longer texts.
+        _good_splits = []
+        for s in splits:
+            if self._length_function(s) < self._chunk_size:
+                _good_splits.append(s)
+            else:
+                if _good_splits:
+                    merged_text = self._merge_splits(
+                        _good_splits,
+                        separator,
+                        chunk_size=kwargs.get("chunk_size", None),
+                        chunk_overlap=kwargs.get("chunk_overlap", None),
+                    )
+                    final_chunks.extend(merged_text)
+                    _good_splits = []
+                other_info = self.split_text(s)
+                final_chunks.extend(other_info)
+        if _good_splits:
+            merged_text = self._merge_splits(
+                _good_splits,
+                separator,
+                chunk_size=kwargs.get("chunk_size", None),
+                chunk_overlap=kwargs.get("chunk_overlap", None),
+            )
+            final_chunks.extend(merged_text)
+        return final_chunks
+
+
+class SpacyTextSplitter(TextSplitter):
+    """Implementation of splitting text that looks at sentences using Spacy."""
+
+    def __init__(self, pipeline: str = "zh_core_web_sm", **kwargs: Any) -> None:
+        """Initialize the spacy text splitter."""
+        super().__init__(**kwargs)
+        try:
+            import spacy
+        except ImportError:
+            raise ImportError("Spacy is not installed, please install it with `pip install spacy`.")
+        try:
+            self._tokenizer = spacy.load(pipeline)
+        except:
+            spacy.cli.download(pipeline)
+            self._tokenizer = spacy.load(pipeline)
+
+    def split_text(self, text: str, separator: Optional[str] = None, **kwargs) -> List[str]:
+        """Split incoming text and return chunks."""
+        if len(text) > 1000000:
+            self._tokenizer.max_length = len(text) + 100
+        splits = (str(s) for s in self._tokenizer(text).sents)
+        return self._merge_splits(splits, separator, **kwargs)
+
+
+class HeaderType(TypedDict):
+    """Header type as typed dict."""
+
+    level: int
+    name: str
+    data: str
+
+
+class LineType(TypedDict):
+    """Line type as typed dict."""
+
+    metadata: Dict[str, str]
+    content: str
+
+
+class MarkdownHeaderTextSplitter(BaseComponent):
+    """Implementation of splitting markdown files based on specified headers."""
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        headers_to_split_on: List[Tuple[str, str]] = [
+            ("#", "Header 1"),
+            ("##", "Header 2"),
+            ("###", "Header 3"),
+            ("####", "Header 4"),
+            ("#####", "Header 5"),
+            ("######", "Header 6"),
+        ],
+        return_each_line: bool = False,
+        filters: list = [],
+        chunk_size: int = 4000,
+        chunk_overlap: int = 200,
+        length_function: Callable[[str], int] = len,
+        separator="\n",
+    ):
+        """Create a new MarkdownHeaderTextSplitter.
+
+        Args:
+            headers_to_split_on: Headers we want to track
+            return_each_line: Return each line w/ associated headers
+        """
+        # Output line-by-line or aggregated into chunks w/ common headers
+        self.return_each_line = return_each_line
+        self._chunk_size = chunk_size
+        # Given the headers we want to split on,
+        # (e.g., "#, ##, etc") order by length
+        self.headers_to_split_on = sorted(headers_to_split_on, key=lambda split: len(split[0]), reverse=True)
+        self._filter = filters
+        self._length_function = length_function
+        self._separator = separator
+        self._chunk_overlap = chunk_overlap
+
+    def aggregate_lines_to_chunks(self, lines: List[LineType]) -> List[dict]:
+        """Combine lines with common metadata into chunks
+        Args:
+            lines: Line of text / associated header metadata
+        """
+        aggregated_chunks: List[LineType] = []
+
+        for line in lines:
+            if aggregated_chunks and aggregated_chunks[-1]["metadata"] == line["metadata"]:
+                # If the last line in the aggregated list
+                # has the same metadata as the current line,
+                # append the current content to the last lines's content
+                aggregated_chunks[-1]["content"] += "  \n" + line["content"]
+            else:
+                # Otherwise, append the current line to the aggregated list
+                aggregated_chunks.append(line)
+
+        return [{"page_content": chunk["content"], "metadata": chunk["metadata"]} for chunk in aggregated_chunks]
+
+    def split_text(
+        self,
+        text: str,
+        separator: Optional[str] = None,
+        chunk_size: Optional[int] = None,
+        chunk_overlap: Optional[int] = None,
+    ) -> List[dict]:
+        """Split markdown file
+        Args:
+            text: Markdown file"""
+        if separator is None:
+            separator = self._separator
+        if chunk_size is None:
+            chunk_size = self._chunk_size
+        if chunk_overlap is None:
+            chunk_overlap = self._chunk_overlap
+
+        # Split the input text by newline character ("\n").
+        lines = text.split(separator)
+        # Final output
+        lines_with_metadata: List[LineType] = []
+        # Content and metadata of the chunk currently being processed
+        current_content: List[str] = []
+        current_metadata: Dict[str, str] = {}
+        # Keep track of the nested header structure
+        # header_stack: List[Dict[str, Union[int, str]]] = []
+        header_stack: List[HeaderType] = []
+        initial_metadata: Dict[str, str] = {}
+        for line in lines:
+            stripped_line = line.strip()
+            # Check each line against each of the header types (e.g., #, ##)
+            for sep, name in self.headers_to_split_on:
+                # Check if line starts with a header that we intend to split on
+                if stripped_line.startswith(sep) and (
+                    # Header with no text OR header is followed by space
+                    # Both are valid conditions that sep is being used a header
+                    len(stripped_line) == len(sep)
+                    or stripped_line[len(sep)] == " "
+                ):
+                    # Ensure we are tracking the header as metadata
+                    if name is not None:
+                        # Get the current header level
+                        current_header_level = sep.count("#")
+
+                        # Pop out headers of lower or same level from the stack
+                        while header_stack and header_stack[-1]["level"] >= current_header_level:
+                            # We have encountered a new header
+                            # at the same or higher level
+                            popped_header = header_stack.pop()
+                            # Clear the metadata for the
+                            # popped header in initial_metadata
+                            if popped_header["name"] in initial_metadata:
+                                initial_metadata.pop(popped_header["name"])
+
+                        # Push the current header to the stack
+                        header: HeaderType = {
+                            "level": current_header_level,
+                            "name": name,
+                            "data": stripped_line[len(sep) :].strip(),
+                        }
+                        header_stack.append(header)
+                        # Update initial_metadata with the current header
+                        initial_metadata[name] = header["data"]
+
+                    # Add the previous line to the lines_with_metadata
+                    # only if current_content is not empty
+                    if current_content:
+                        lines_with_metadata.append(
+                            {
+                                "content": separator.join(current_content),
+                                "metadata": current_metadata.copy(),
+                            }
+                        )
+                        current_content.clear()
+
+                    break
+            else:
+                if stripped_line:
+                    current_content.append(stripped_line)
+                elif current_content:
+                    lines_with_metadata.append(
+                        {
+                            "content": separator.join(current_content),
+                            "metadata": current_metadata.copy(),
+                        }
+                    )
+                    current_content.clear()
+
+            current_metadata = initial_metadata.copy()
+        if current_content:
+            lines_with_metadata.append({"content": separator.join(current_content), "metadata": current_metadata})
+        # lines_with_metadata has each line with associated header metadata
+        # aggregate these into chunks based on common metadata
+        if not self.return_each_line:
+            return self._merge_splits(
+                self.aggregate_lines_to_chunks(lines_with_metadata), separator, chunk_size, chunk_overlap
+            )
+        else:
+            return self._merge_splits(
+                [{"page_content": chunk["content"], "metadata": chunk["metadata"]} for chunk in lines_with_metadata],
+                separator,
+                chunk_size,
+                chunk_overlap,
+            )
+
+    def clean(self, documents: List[dict], filters: Optional[List[str]] = None):
+        if filters is None:
+            filters = self._filter
+        for special_character in filters:
+            for doc in documents:
+                doc["content"] = doc["content"].replace(special_character, "")
+        return documents
+
+    def _join_docs(self, docs: List[str], separator: str) -> Optional[str]:
+        text = separator.join(docs)
+        text = text.strip()
+        if text == "":
+            return None
+        else:
+            return text
+
+    def _merge_splits(
+        self,
+        documents: List[dict],
+        separator: Optional[str] = None,
+        chunk_size: Optional[int] = None,
+        chunk_overlap: [int] = None,
+    ) -> List[str]:
+        # We now want to combine these smaller pieces into medium size
+        # chunks to send to the LLM.
+        if chunk_size is None:
+            chunk_size = self._chunk_size
+        if chunk_overlap is None:
+            chunk_overlap = self._chunk_overlap
+        if separator is None:
+            separator = self._separator
+        separator_len = self._length_function(separator)
+
+        docs = []
+        current_doc: List[str] = []
+        total = 0
+        for doc in documents:
+            if doc["metadata"] != {}:
+                head = sorted(doc["metadata"].items(), key=lambda x: x[0], reverse=True)[0][1]
+                d = head + separator + doc["page_content"]
+            else:
+                d = doc["page_content"]
+            _len = self._length_function(d)
+            if total + _len + (separator_len if len(current_doc) > 0 else 0) > chunk_size:
+                if total > chunk_size:
+                    logger.warning(
+                        f"Created a chunk of size {total}, " f"which is longer than the specified {chunk_size}"
+                    )
+                if len(current_doc) > 0:
+                    doc = self._join_docs(current_doc, separator)
+                    if doc is not None:
+                        docs.append(doc)
+                    # Keep on popping if:
+                    # - we have a larger chunk than in the chunk overlap
+                    # - or if we still have any chunks and the length is long
+                    while total > chunk_overlap or (
+                        total + _len + (separator_len if len(current_doc) > 0 else 0) > chunk_size and total > 0
+                    ):
+                        total -= self._length_function(current_doc[0]) + (separator_len if len(current_doc) > 1 else 0)
+                        current_doc = current_doc[1:]
+            current_doc.append(d)
+            total += _len + (separator_len if len(current_doc) > 1 else 0)
+        doc = self._join_docs(current_doc, separator)
+        if doc is not None:
+            docs.append(doc)
+        return docs
+
+    def run(
+        self,
+        documents: Union[dict, List[dict]],
+        meta: Optional[Union[Dict[str, str], List[Dict[str, str]]]] = None,
+        filters: Optional[List[str]] = None,
+        chunk_size: Optional[int] = None,
+        chunk_overlap: Optional[int] = None,
+        separator: Optional[str] = None,
+    ):
+        if filters is None:
+            filters = self._filter
+        if chunk_size is None:
+            chunk_size = self._chunk_size
+        if chunk_overlap is None:
+            chunk_overlap = self._chunk_overlap
+        if separator is None:
+            separator = self._separator
+        ret = []
+        if type(documents) == list:
+            for document in documents:
+                text_splits = self.split_text(document["content"], separator, chunk_size, chunk_overlap)
+                for i, txt in enumerate(text_splits):
+                    doc = {}
+                    doc["content"] = txt
+
+                    if "meta" not in doc.keys() or doc["meta"] is None:
+                        doc["meta"] = {}
+
+                    doc["meta"]["_split_id"] = i
+                    ret.append(doc)
+        elif type(documents) == dict:
+            text_splits = self.split_text(documents["content"], separator, chunk_size, chunk_overlap)
+            for i, txt in enumerate(text_splits):
+                doc = {}
+                doc["content"] = txt
+
+                if "meta" not in doc.keys() or doc["meta"] is None:
+                    doc["meta"] = {}
+
+                doc["meta"]["_split_id"] = i
+                ret.append(doc)
+        if filters is None:
+            filters = self._filter
+        if filters is not None and len(filters) > 0:
+            ret = self.clean(ret, filters)
+        result = {"documents": ret}
+        return result, "output_1"
diff --git a/pipelines/pipelines/nodes/prompt/__init__.py b/pipelines/pipelines/nodes/prompt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcd410a7b06b7685ebf65fd0a1b48761a096c0dd
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.prompt.prompt_model import PromptModel
+from pipelines.nodes.prompt.prompt_node import PromptNode
+from pipelines.nodes.prompt.prompt_template import PromptTemplate
+from pipelines.nodes.prompt.shapers import (
+    AnswerParser,
+    BaseOutputParser,
+    Shaper,
+    to_strings,
+)
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/__init__.py b/pipelines/pipelines/nodes/prompt/invocation_layer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..359a243e973fa3b66444ac89165c3ff8925776ed
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.prompt.invocation_layer.base import PromptModelInvocationLayer
+from pipelines.nodes.prompt.invocation_layer.chatglm import ChatGLMInvocationLayer
+from pipelines.nodes.prompt.invocation_layer.chatgpt import ChatGPTInvocationLayer
+from pipelines.nodes.prompt.invocation_layer.ernie_bot import ErnieBotInvocationLayer
+from pipelines.nodes.prompt.invocation_layer.open_ai import OpenAIInvocationLayer
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/base.py b/pipelines/pipelines/nodes/prompt/invocation_layer/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..8aeb5072893e7e7f097cf684628f747916b264ed
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/base.py
@@ -0,0 +1,76 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import abstractmethod
+from typing import Dict, List, Type, Union
+
+
+class PromptModelInvocationLayer:
+    """
+    PromptModelInvocationLayer implementations execute a prompt on an underlying model.
+
+    The implementation can be a simple invocation on the underlying model running in a local runtime, or
+    could be even remote, for example, a call to a remote API endpoint.
+    """
+
+    invocation_layer_providers: List[Type["PromptModelInvocationLayer"]] = []
+
+    def __init__(self, model_name_or_path: str, **kwargs):
+        """
+        Creates a new PromptModelInvocationLayer instance.
+
+        :param model_name_or_path: The name or path of the underlying model.
+        :param kwargs: Additional keyword arguments passed to the underlying model.
+        """
+        if model_name_or_path is None or len(model_name_or_path) == 0:
+            raise ValueError("model_name_or_path cannot be None or empty string")
+
+        self.model_name_or_path = model_name_or_path
+
+    def __init_subclass__(cls, **kwargs):
+        """
+        Used to register user-defined invocation layers.
+
+        Called when a subclass of PromptModelInvocationLayer is imported.
+        """
+        super().__init_subclass__(**kwargs)
+        cls.invocation_layer_providers.append(cls)
+
+    @abstractmethod
+    def invoke(self, *args, **kwargs):
+        """
+        It takes a prompt and returns a list of generated text using the underlying model.
+        :return: A list of generated text.
+        """
+        pass
+
+    @classmethod
+    def supports(cls, model_name_or_path: str, **kwargs) -> bool:
+        """
+        Checks if the given model is supported by this invocation layer.
+
+        :param model_name_or_path: The name or path of the model.
+        :param kwargs: Additional keyword arguments passed to the underlying model which might be used to determine
+        if the model is supported.
+        :return: True if this invocation layer supports the model, False otherwise.
+        """
+        return False
+
+    @abstractmethod
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that length of the prompt and answer is within the maximum token length of the PromptModel.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        pass
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/chatglm.py b/pipelines/pipelines/nodes/prompt/invocation_layer/chatglm.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f2bfa9da2ae3daa6dd22eb442537061d6a31db6
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/chatglm.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from typing import Dict, List, Optional, Union
+
+from pipelines.nodes.llm.chatglm import ChatGLMBot
+from pipelines.nodes.prompt.invocation_layer import PromptModelInvocationLayer
+
+logger = logging.getLogger(__name__)
+
+
+class ChatGLMInvocationLayer(ChatGLMBot, PromptModelInvocationLayer):
+    """
+    A subclass of the PromptModelInvocationLayer class. It loads a pre-trained model from Taskflow and
+    passes a prepared prompt into that model.
+
+    Note: kwargs other than init parameter names are ignored to enable reflective construction of the class,
+    as many variants of PromptModelInvocationLayer are possible and they may have different parameters.
+    """
+
+    def __init__(
+        self,
+        model_name_or_path: str = "THUDM/chatglm-6b-v1.1",
+        tgt_length: int = 2048,
+        max_seq_length: int = 2048,
+        batch_size: int = 1,
+        use_gpu: Optional[bool] = True,
+        devices: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        """
+        Creates an instance of ChatGLMInvocationLayer used to invoke local ChatGLM models.
+
+        :param model_name_or_path: The name or path of the underlying model.
+        :param max_length: The maximum number of tokens the output text can have.
+        :param use_auth_token: The token to use as HTTP bearer authorization for remote files.
+        :param use_gpu: Whether to use GPU for inference.
+        :param device: The device to use for inference.
+        :param kwargs: Additional keyword arguments passed to the underlying model. Due to reflective construction of
+        all PromptModelInvocationLayer instances, this instance of ChatGLMInvocationLayer might receive some unrelated
+        kwargs. Only kwargs relevant to the ChatGLMInvocationLayer are considered.
+        The model_max_length is used to specify the custom sequence length for the underlying pipeline.
+        """
+        super().__init__(
+            model_name_or_path=model_name_or_path, max_seq_length=max_seq_length, tgt_length=tgt_length, **kwargs
+        )
+
+        self.kwargs = kwargs
+
+    def invoke(self, *args, **kwargs):
+
+        prompt = kwargs.pop("prompt")
+        if not prompt:
+            raise ValueError(
+                f"No prompt provided. Model {self.model_name_or_path} requires prompt."
+                f"Make sure to provide prompt in kwargs."
+            )
+        # return a list
+        output = self.predict(prompt)
+        if "stop_words" in kwargs and kwargs["stop_words"] is not None:
+            # split text by stop words
+            result = output["result"][0].split(kwargs["stop_words"][0])[0]
+            generated_texts = [result]
+        else:
+            generated_texts = output["result"]
+        return generated_texts
+
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that the length of the prompt and answer is within the max tokens limit of the model.
+        If needed, truncate the prompt text so that it fits within the limit.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        # TODO: support truncation
+        return prompt
+
+    @classmethod
+    def supports(cls, model_name_or_path: str, **kwargs) -> bool:
+        if os.path.exists(model_name_or_path):
+            return True
+        return model_name_or_path in ["THUDM/chatglm-6b", "THUDM/chatglm-6b-v1.1"]
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/chatgpt.py b/pipelines/pipelines/nodes/prompt/invocation_layer/chatgpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..76812d4f106e0d486bf39a0f9ab8364ff67010af
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/chatgpt.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Any, Dict, List, Optional, Union
+
+logger = logging.getLogger(__name__)
+
+from pipelines.nodes.prompt.invocation_layer.handlers import (
+    DefaultTokenStreamingHandler,
+    TokenStreamingHandler,
+)
+from pipelines.nodes.prompt.invocation_layer.open_ai import (
+    OpenAIInvocationLayer,
+    _check_openai_finish_reason,
+    openai_request,
+)
+
+
+class ChatGPTInvocationLayer(OpenAIInvocationLayer):
+    """
+    ChatGPT Invocation Layer
+
+    PromptModelInvocationLayer implementation for OpenAI's GPT-3 ChatGPT API. Invocations are made using REST API.
+    See [OpenAI ChatGPT API](https://platform.openai.com/docs/guides/chat) for more details.
+
+    :param model_name_or_path: The name or path of the underlying model.
+    :param max_length: The maximum number of tokens the output text can have.
+    :param api_key: The OpenAI API key.
+    :param api_base: The OpenAI API Base url, defaults to `https://api.openai.com/v1`.
+    :param kwargs: Additional keyword arguments passed to the underlying model.
+    [See OpenAI documentation](https://platform.openai.com/docs/api-reference/chat).
+    """
+
+    def __init__(
+        self,
+        api_key: str,
+        model_name_or_path: str = "gpt-3.5-turbo",
+        max_length: Optional[int] = 500,
+        api_base: str = "https://api.openai.com/v1",
+        **kwargs
+    ):
+        super().__init__(api_key, model_name_or_path, max_length, api_base=api_base, **kwargs)
+
+    def invoke(self, *args, **kwargs):
+        """
+        It takes in either a prompt or a list of messages and returns a list of responses, using a REST invocation.
+
+        :return: A list of generated responses.
+
+        Note: Only kwargs relevant to OpenAI are passed to OpenAI rest API. Others kwargs are ignored.
+        For more details, see [OpenAI ChatGPT API reference](https://platform.openai.com/docs/api-reference/chat).
+        """
+        prompt = kwargs.get("prompt", None)
+
+        if isinstance(prompt, str):
+            messages = [{"role": "user", "content": prompt}]
+        elif isinstance(prompt, list) and len(prompt) > 0 and isinstance(prompt[0], dict):
+            messages = prompt
+        else:
+            raise ValueError(
+                f"The prompt format is different than what the model expects. "
+                f"The model {self.model_name_or_path} requires either a string or messages in the ChatML format. "
+                f"For more details, see this [GitHub discussion](https://github.com/openai/openai-python/blob/main/chatml.md)."
+            )
+
+        kwargs_with_defaults = self.model_input_kwargs
+        if kwargs:
+            # we use keyword stop_words but OpenAI uses stop
+            if "stop_words" in kwargs:
+                kwargs["stop"] = kwargs.pop("stop_words")
+            if "top_k" in kwargs:
+                top_k = kwargs.pop("top_k")
+                kwargs["n"] = top_k
+                kwargs["best_of"] = top_k
+            kwargs_with_defaults.update(kwargs)
+
+        stream = (
+            kwargs_with_defaults.get("stream", False) or kwargs_with_defaults.get("stream_handler", None) is not None
+        )
+        payload = {
+            "model": self.model_name_or_path,
+            "messages": messages,
+            "max_tokens": kwargs_with_defaults.get("max_tokens", self.max_length),
+            "temperature": kwargs_with_defaults.get("temperature", 0.7),
+            "top_p": kwargs_with_defaults.get("top_p", 1),
+            "n": kwargs_with_defaults.get("n", 1),
+            "stream": stream,
+            "stop": kwargs_with_defaults.get("stop", None),
+            "presence_penalty": kwargs_with_defaults.get("presence_penalty", 0),
+            "frequency_penalty": kwargs_with_defaults.get("frequency_penalty", 0),
+            "logit_bias": kwargs_with_defaults.get("logit_bias", {}),
+        }
+        if not stream:
+            response = openai_request(url=self.url, headers=self.headers, payload=payload)
+            _check_openai_finish_reason(result=response, payload=payload)
+            assistant_response = [choice["message"]["content"].strip() for choice in response["choices"]]
+        else:
+            response = openai_request(
+                url=self.url, headers=self.headers, payload=payload, read_response=False, stream=True
+            )
+            handler: TokenStreamingHandler = kwargs_with_defaults.pop("stream_handler", DefaultTokenStreamingHandler())
+            assistant_response = self._process_streaming_response(response=response, stream_handler=handler)
+
+        # Although ChatGPT generates text until stop words are encountered, unfortunately it includes the stop word
+        # We want to exclude it to be consistent with other invocation layers
+        if "stop" in kwargs_with_defaults and kwargs_with_defaults["stop"] is not None:
+            stop_words = kwargs_with_defaults["stop"]
+            for idx, _ in enumerate(assistant_response):
+                for stop_word in stop_words:
+                    assistant_response[idx] = assistant_response[idx].replace(stop_word, "").strip()
+        return assistant_response
+
+    def _extract_token(self, event_data: Dict[str, Any]):
+        delta = event_data["choices"][0]["delta"]
+        if "content" in delta:
+            return delta["content"]
+        return None
+
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Make sure the length of the prompt and answer is within the max tokens limit of the model.
+        If needed, truncate the prompt text so that it fits within the limit.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        if isinstance(prompt, str):
+            messages = [{"role": "user", "content": prompt}]
+        elif isinstance(prompt, list) and len(prompt) > 0 and isinstance(prompt[0], dict):
+            messages = prompt
+
+        # n_prompt_tokens = count_openai_tokens_messages(messages, self._tokenizer)
+        n_prompt_tokens = len(messages)
+        n_answer_tokens = self.max_length
+        if (n_prompt_tokens + n_answer_tokens) <= self.max_tokens_limit:
+            return prompt
+
+        # TODO: support truncation as in _ensure_token_limit methods for other invocation layers
+        raise ValueError(
+            f"The prompt or the messages are too long ({n_prompt_tokens} tokens). "
+            f"The length of the prompt or messages and the answer ({n_answer_tokens} tokens) should be within the max token limit ({self.max_tokens_limit} tokens). "
+            f"Reduce the length of the prompt or messages."
+        )
+
+    @property
+    def url(self) -> str:
+        return f"{self.api_base}/chat/completions"
+
+    @classmethod
+    def supports(cls, model_name_or_path: str, **kwargs) -> bool:
+        return model_name_or_path in ["gpt-3.5-turbo", "gpt-4", "gpt-4-32k"]
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/ernie_bot.py b/pipelines/pipelines/nodes/prompt/invocation_layer/ernie_bot.py
new file mode 100644
index 0000000000000000000000000000000000000000..1584901eb7cdc506c7c3a161979b78e5ee22e00f
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/ernie_bot.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from typing import Dict, List, Union
+
+from pipelines.nodes.llm.ernie_bot import ErnieBot
+from pipelines.nodes.prompt.invocation_layer import PromptModelInvocationLayer
+
+logger = logging.getLogger(__name__)
+
+
+class ErnieBotInvocationLayer(ErnieBot, PromptModelInvocationLayer):
+    """
+    A subclass of the PromptModelInvocationLayer class. It loads a pre-trained model from Taskflow and
+    passes a prepared prompt into that model.
+
+    Note: kwargs other than init parameter names are ignored to enable reflective construction of the class,
+    as many variants of PromptModelInvocationLayer are possible and they may have different parameters.
+    """
+
+    def __init__(
+        self,
+        api_key=None,
+        secret_key=None,
+        **kwargs,
+    ):
+        """
+        Creates an instance of ChatGLMInvocationLayer used to invoke local ChatGLM models.
+
+        :param model_name_or_path: The name or path of the underlying model.
+        :param max_length: The maximum number of tokens the output text can have.
+        :param use_auth_token: The token to use as HTTP bearer authorization for remote files.
+        :param use_gpu: Whether to use GPU for inference.
+        :param device: The device to use for inference.
+        :param kwargs: Additional keyword arguments passed to the underlying model. Due to reflective construction of
+        all PromptModelInvocationLayer instances, this instance of ChatGLMInvocationLayer might receive some unrelated
+        kwargs. Only kwargs relevant to the ChatGLMInvocationLayer are considered.
+        The model_max_length is used to specify the custom sequence length for the underlying pipeline.
+        """
+        super().__init__(api_key, secret_key)
+
+        self.kwargs = kwargs
+
+    def invoke(self, *args, **kwargs):
+
+        prompt = kwargs.pop("prompt")
+        if not prompt:
+            raise ValueError(
+                f"No prompt provided. Model {self.model_name_or_path} requires prompt."
+                f"Make sure to provide prompt in kwargs."
+            )
+        output = self.predict(prompt)
+        if "stop_words" in kwargs and kwargs["stop_words"] is not None:
+            # split text by stop words
+            result = output["result"].split(kwargs["stop_words"][0])[0]
+            generated_texts = [result]
+        else:
+            generated_texts = [output["result"]]
+        return generated_texts
+
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that the length of the prompt and answer is within the max tokens limit of the model.
+        If needed, truncate the prompt text so that it fits within the limit.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        # TODO: support truncation
+        return prompt
+
+    @classmethod
+    def supports(cls, model_name_or_path: str, **kwargs) -> bool:
+        if os.path.exists(model_name_or_path):
+            return True
+        return model_name_or_path in ["ernie-bot"]
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/handlers.py b/pipelines/pipelines/nodes/prompt/invocation_layer/handlers.py
new file mode 100644
index 0000000000000000000000000000000000000000..36ff40355910a90af124f8773be367bd00b99fe9
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/handlers.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC, abstractmethod
+from typing import Dict, Union
+
+from paddlenlp.transformers import AutoTokenizer
+
+
+class TokenStreamingHandler(ABC):
+    """
+    TokenStreamingHandler implementations handle the streaming of tokens from the stream.
+    """
+
+    DONE_MARKER = "[DONE]"
+
+    @abstractmethod
+    def __call__(self, token_received: str, **kwargs) -> str:
+        """
+        This callback method is called when a new token is received from the stream.
+
+        :param token_received: The token received from the stream.
+        :param kwargs: Additional keyword arguments passed to the handler.
+        :return: The token to be sent to the stream.
+        """
+        pass
+
+
+class DefaultTokenStreamingHandler(TokenStreamingHandler):
+    def __call__(self, token_received, **kwargs) -> str:
+        """
+        This callback method is called when a new token is received from the stream.
+
+        :param token_received: The token received from the stream.
+        :param kwargs: Additional keyword arguments passed to the handler.
+        :return: The token to be sent to the stream.
+        """
+        print(token_received, flush=True, end="")
+        return token_received
+
+
+class DefaultPromptHandler:
+    """
+    DefaultPromptHandler resizes the prompt to ensure that the prompt and answer token lengths together
+    are within the model_max_length.
+    """
+
+    def __init__(self, model_name_or_path: str, model_max_length: int, max_length: int = 100):
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+        self.model_max_length = model_max_length
+        self.max_length = max_length
+
+    def __call__(self, prompt: str, **kwargs) -> Dict[str, Union[str, int]]:
+        """
+        Resizes the prompt to ensure that the prompt and answer is within the model_max_length
+
+        :param prompt: the prompt to be sent to the model.
+        :param kwargs: Additional keyword arguments passed to the handler.
+        :return: A dictionary containing the resized prompt and additional information.
+        """
+        resized_prompt = prompt
+        prompt_length = 0
+        new_prompt_length = 0
+
+        if prompt:
+            prompt_length = len(self.tokenizer.tokenize(prompt))
+            if (prompt_length + self.max_length) <= self.model_max_length:
+                resized_prompt = prompt
+                new_prompt_length = prompt_length
+            else:
+                tokenized_payload = self.tokenizer.tokenize(prompt)
+                resized_prompt = self.tokenizer.convert_tokens_to_string(
+                    tokenized_payload[: self.model_max_length - self.max_length]
+                )
+                new_prompt_length = len(tokenized_payload[: self.model_max_length - self.max_length])
+
+        return {
+            "resized_prompt": resized_prompt,
+            "prompt_length": prompt_length,
+            "new_prompt_length": new_prompt_length,
+            "model_max_length": self.model_max_length,
+            "max_length": self.max_length,
+        }
diff --git a/pipelines/pipelines/nodes/prompt/invocation_layer/open_ai.py b/pipelines/pipelines/nodes/prompt/invocation_layer/open_ai.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4e67d24f7d88e19c3ce74f6afdd29345f5b48ef
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/invocation_layer/open_ai.py
@@ -0,0 +1,280 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import requests
+import sseclient
+
+from pipelines.nodes.prompt.invocation_layer.base import PromptModelInvocationLayer
+from pipelines.nodes.prompt.invocation_layer.handlers import (
+    DefaultTokenStreamingHandler,
+    TokenStreamingHandler,
+)
+
+logger = logging.getLogger(__name__)
+
+OPENAI_TIMEOUT = 30
+
+
+def openai_request(
+    url: str,
+    headers: Dict,
+    payload: Dict,
+    timeout: Union[float, Tuple[float, float]] = OPENAI_TIMEOUT,
+    read_response: Optional[bool] = True,
+    **kwargs,
+):
+    """Make a request to the OpenAI API given a `url`, `headers`, `payload`, and `timeout`.
+
+    :param url: The URL of the OpenAI API.
+    :param headers: Dictionary of HTTP Headers to send with the :class:`Request`.
+    :param payload: The payload to send with the request.
+    :param timeout: The timeout length of the request. The default is 30s.
+    :param read_response: Whether to read the response as JSON. The default is True.
+    """
+
+    response = requests.request("POST", url, headers=headers, data=json.dumps(payload), timeout=timeout, **kwargs)
+    if read_response:
+        json_response = json.loads(response.text)
+
+    if response.status_code != 200:
+        openai_error: Exception
+        if response.status_code == 429:
+            openai_error = Exception(f"API rate limit exceeded: {response.text}")
+        elif response.status_code == 401:
+            openai_error = Exception(f"API key is invalid: {response.text}")
+        else:
+            openai_error = Exception(
+                f"OpenAI returned an error.\n"
+                f"Status code: {response.status_code}\n"
+                f"Response body: {response.text}",
+                status_code=response.status_code,
+            )
+        raise openai_error
+    if read_response:
+        return json_response
+    else:
+        return response
+
+
+def _check_openai_finish_reason(result: Dict, payload: Dict) -> None:
+    """Check the `finish_reason` the answers returned by OpenAI completions endpoint.
+    If the `finish_reason` is `length` or `content_filter`, log a warning to the user.
+
+    :param result: The result returned from the OpenAI API.
+    :param payload: The payload sent to the OpenAI API.
+    """
+    number_of_truncated_completions = sum(1 for ans in result["choices"] if ans["finish_reason"] == "length")
+    if number_of_truncated_completions > 0:
+        logger.warning(
+            "%s out of the %s completions have been truncated before reaching a natural stopping point. "
+            "Increase the max_tokens parameter to allow for longer completions.",
+            number_of_truncated_completions,
+            payload["n"],
+        )
+
+    number_of_content_filtered_completions = sum(
+        1 for ans in result["choices"] if ans["finish_reason"] == "content_filter"
+    )
+    if number_of_content_filtered_completions > 0:
+        logger.warning(
+            "%s out of the %s completions have omitted content due to a flag from OpenAI content filters.",
+            number_of_truncated_completions,
+            payload["n"],
+        )
+
+
+class OpenAIInvocationLayer(PromptModelInvocationLayer):
+    """
+    PromptModelInvocationLayer implementation for OpenAI's GPT-3 InstructGPT models. Invocations are made using REST API.
+    See [OpenAI GPT-3](https://platform.openai.com/docs/models/gpt-3) for more details.
+
+    Note: kwargs other than init parameter names are ignored to enable reflective construction of the class
+    as many variants of PromptModelInvocationLayer are possible and they may have different parameters.
+    """
+
+    def __init__(
+        self,
+        api_key: str,
+        model_name_or_path: str = "text-davinci-003",
+        max_length: Optional[int] = 100,
+        api_base: str = "https://api.openai.com/v1",
+        **kwargs
+    ):
+        """
+         Creates an instance of OpenAIInvocationLayer for OpenAI's GPT-3 InstructGPT models.
+
+        :param model_name_or_path: The name or path of the underlying model.
+        :param max_length: The maximum number of tokens the output text can have.
+        :param api_key: The OpenAI API key.
+        :param api_base: The OpenAI API Base url, defaults to `https://api.openai.com/v1`.
+        :param kwargs: Additional keyword arguments passed to the underlying model. Due to reflective construction of
+        all PromptModelInvocationLayer instances, this instance of OpenAIInvocationLayer might receive some unrelated
+        kwargs. Only the kwargs relevant to OpenAIInvocationLayer are considered. The list of OpenAI-relevant
+        kwargs includes: suffix, temperature, top_p, presence_penalty, frequency_penalty, best_of, n, max_tokens,
+        logit_bias, stop, echo, and logprobs. For more details about these kwargs, see OpenAI
+        [documentation](https://platform.openai.com/docs/api-reference/completions/create).
+        """
+        super().__init__(model_name_or_path)
+        if not isinstance(api_key, str) or len(api_key) == 0:
+            raise Exception(f"api_key {api_key} must be a valid OpenAI key. Visit https://openai.com/api/ to get one.")
+        self.api_key = api_key
+        self.api_base = api_base
+
+        # 16 is the default length for answers from OpenAI shown in the docs
+        # here, https://platform.openai.com/docs/api-reference/completions/create.
+        # max_length must be set otherwise OpenAIInvocationLayer._ensure_token_limit will fail.
+        self.max_length = max_length or 16
+
+        # Due to reflective construction of all invocation layers we might receive some
+        # unknown kwargs, so we need to take only the relevant.
+        # For more details refer to OpenAI documentation
+        self.model_input_kwargs = {
+            key: kwargs[key]
+            for key in [
+                "suffix",
+                "max_tokens",
+                "temperature",
+                "top_p",
+                "n",
+                "logprobs",
+                "echo",
+                "stop",
+                "presence_penalty",
+                "frequency_penalty",
+                "best_of",
+                "logit_bias",
+                "stream",
+                "stream_handler",
+            ]
+            if key in kwargs
+        }
+
+        max_tokens_limit = 4096
+        self.max_tokens_limit = max_tokens_limit
+
+    @property
+    def url(self) -> str:
+        return f"{self.api_base}/completions"
+
+    @property
+    def headers(self) -> Dict[str, str]:
+        return {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
+
+    def invoke(self, *args, **kwargs):
+        """
+        Invokes a prompt on the model. It takes in a prompt and returns a list of responses using a REST invocation.
+
+        :return: The responses are being returned.
+
+        Note: Only kwargs relevant to OpenAI are passed to OpenAI rest API. Others kwargs are ignored.
+        For more details, see OpenAI [documentation](https://platform.openai.com/docs/api-reference/completions/create).
+        """
+        prompt = kwargs.get("prompt")
+        if not prompt:
+            raise ValueError(
+                f"No prompt provided. Model {self.model_name_or_path} requires prompt."
+                f"Make sure to provide prompt in kwargs."
+            )
+
+        kwargs_with_defaults = self.model_input_kwargs
+        if kwargs:
+            # we use keyword stop_words but OpenAI uses stop
+            if "stop_words" in kwargs:
+                kwargs["stop"] = kwargs.pop("stop_words")
+            if "top_k" in kwargs:
+                top_k = kwargs.pop("top_k")
+                kwargs["n"] = top_k
+                kwargs["best_of"] = top_k
+            kwargs_with_defaults.update(kwargs)
+
+        # either stream is True (will use default handler) or stream_handler is provided
+        stream = (
+            kwargs_with_defaults.get("stream", False) or kwargs_with_defaults.get("stream_handler", None) is not None
+        )
+        payload = {
+            "model": self.model_name_or_path,
+            "prompt": prompt,
+            "suffix": kwargs_with_defaults.get("suffix", None),
+            "max_tokens": kwargs_with_defaults.get("max_tokens", self.max_length),
+            "temperature": kwargs_with_defaults.get("temperature", 0.7),
+            "top_p": kwargs_with_defaults.get("top_p", 1),
+            "n": kwargs_with_defaults.get("n", 1),
+            "stream": stream,
+            "logprobs": kwargs_with_defaults.get("logprobs", None),
+            "echo": kwargs_with_defaults.get("echo", False),
+            "stop": kwargs_with_defaults.get("stop", None),
+            "presence_penalty": kwargs_with_defaults.get("presence_penalty", 0),
+            "frequency_penalty": kwargs_with_defaults.get("frequency_penalty", 0),
+            "best_of": kwargs_with_defaults.get("best_of", 1),
+            "logit_bias": kwargs_with_defaults.get("logit_bias", {}),
+        }
+        if not stream:
+            res = openai_request(url=self.url, headers=self.headers, payload=payload)
+            _check_openai_finish_reason(result=res, payload=payload)
+            responses = [ans["text"].strip() for ans in res["choices"]]
+            return responses
+        else:
+            response = openai_request(
+                url=self.url, headers=self.headers, payload=payload, read_response=False, stream=True
+            )
+            handler: TokenStreamingHandler = kwargs_with_defaults.pop("stream_handler", DefaultTokenStreamingHandler())
+            return self._process_streaming_response(response=response, stream_handler=handler)
+
+    def _process_streaming_response(self, response, stream_handler: TokenStreamingHandler):
+        client = sseclient.SSEClient(response)
+        tokens: List[str] = []
+        try:
+            for event in client.events():
+                if event.data != TokenStreamingHandler.DONE_MARKER:
+                    event_data = json.loads(event.data)
+                    token: str = self._extract_token(event_data)
+                    if token:
+                        tokens.append(stream_handler(token, event_data=event_data["choices"]))
+        finally:
+            client.close()
+        return ["".join(tokens)]  # return a list of strings just like non-streaming
+
+    def _extract_token(self, event_data: Dict[str, Any]):
+        return event_data["choices"][0]["text"]
+
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that the length of the prompt and answer is within the max tokens limit of the model.
+        If needed, truncate the prompt text so that it fits within the limit.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        n_prompt_tokens = len(prompt)
+        n_answer_tokens = self.max_length
+        if (n_prompt_tokens + n_answer_tokens) <= self.max_tokens_limit:
+            return prompt
+
+        logger.warning(
+            "The prompt has been truncated from %s tokens to %s tokens so that the prompt length and "
+            "answer length (%s tokens) fit within the max token limit (%s tokens). "
+            "Reduce the length of the prompt to prevent it from being cut off.",
+            n_prompt_tokens,
+            self.max_tokens_limit - n_answer_tokens,
+            n_answer_tokens,
+            self.max_tokens_limit,
+        )
+        return prompt
+
+    @classmethod
+    def supports(cls, model_name_or_path: str, **kwargs) -> bool:
+        valid_model = any(m for m in ["ada", "babbage", "davinci", "curie"] if m in model_name_or_path)
+        return valid_model and kwargs.get("azure_base_url") is None
diff --git a/pipelines/pipelines/nodes/prompt/prompt_model.py b/pipelines/pipelines/nodes/prompt/prompt_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3ce5e7d42eace4008ff7a3c93595429d49ec9a7
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/prompt_model.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Any, Dict, List, Optional, Tuple, Type, Union, overload
+
+logger = logging.getLogger(__name__)
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.prompt.invocation_layer import PromptModelInvocationLayer
+from pipelines.schema import Document, MultiLabel
+
+
+def instruction_following_models() -> List[str]:
+    return ["THUDM/chatglm-6b", "opt-iml", "gpt-3.5-turbo", "gpt-4", "gpt-35-turbo", "gpt-4-32k", "ernie-bot"]
+
+
+class PromptModel(BaseComponent):
+    """
+    The PromptModel class is a component that uses a pre-trained model to perform tasks defined in a prompt. Out of
+    the box, it supports model invocation layers for:
+    - PaddleNLP transformers (all text2text-generation and text-generation models)
+    - OpenAI InstructGPT models
+    - Azure OpenAI InstructGPT models
+
+    Although it's possible to use PromptModel to make prompt invocations on the underlying model, use
+    PromptNode to interact with the model. PromptModel instances are a way for multiple
+    PromptNode instances to use a single PromptNode, and thus save computational resources.
+
+    For more details, refer to [PromptModels](https://docs.haystack.deepset.ai/docs/prompt_node#models).
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        model_name_or_path: str = "gpt-3.5-turbo",
+        max_length: Optional[int] = 100,
+        api_key: Optional[str] = None,
+        secret_key: Optional[str] = None,
+        use_auth_token: Optional[Union[str, bool]] = None,
+        use_gpu: Optional[bool] = None,
+        devices: Optional[List[str]] = None,
+        invocation_layer_class: Optional[Type[PromptModelInvocationLayer]] = None,
+        model_kwargs: Optional[Dict] = None,
+    ):
+        """
+        Creates an instance of PromptModel.
+
+        :param model_name_or_path: The name or path of the underlying model.
+        :param max_length: The maximum number of tokens the output text generated by the model can have.
+        :param api_key: The API key to use for the model.
+        :param use_auth_token: The PaddleNLP token to use.
+        :param use_gpu: Whether to use GPU or not.
+        :param devices: The devices to use where the model is loaded.
+        :param invocation_layer_class: The custom invocation layer class to use. If None, known invocation layers are used.
+        :param model_kwargs: Additional keyword arguments passed to the underlying model.
+
+        Note that Azure OpenAI InstructGPT models require two additional parameters: azure_base_url (The URL for the
+        Azure OpenAI API endpoint, usually in the form `https://<your-endpoint>.openai.azure.com') and
+        azure_deployment_name (the name of the Azure OpenAI API deployment). You should add these parameters
+        in the `model_kwargs` dictionary.
+        """
+        super().__init__()
+        self.model_name_or_path = model_name_or_path
+        self.max_length = max_length
+        self.api_key = api_key
+        self.secret_key = secret_key
+        self.use_auth_token = use_auth_token
+        self.use_gpu = use_gpu
+        self.devices = devices
+
+        self.model_kwargs = model_kwargs if model_kwargs else {}
+        self.model_invocation_layer = self.create_invocation_layer(invocation_layer_class=invocation_layer_class)
+        is_instruction_following: bool = any(m in model_name_or_path for m in instruction_following_models())
+        if not is_instruction_following:
+            logger.warning(
+                "PromptNode has been potentially initialized with a language model not fine-tuned on instruction-following tasks. "
+                "Many of the default prompts and PromptTemplates may not work as intended. "
+                "Use custom prompts and PromptTemplates specific to the %s model",
+                model_name_or_path,
+            )
+
+    def create_invocation_layer(
+        self, invocation_layer_class: Optional[Type[PromptModelInvocationLayer]]
+    ) -> PromptModelInvocationLayer:
+        kwargs = {
+            "api_key": self.api_key,
+            "secret_key": self.secret_key,
+            "use_auth_token": self.use_auth_token,
+            "use_gpu": self.use_gpu,
+            "devices": self.devices,
+        }
+        all_kwargs = {**self.model_kwargs, **kwargs}
+
+        if invocation_layer_class:
+            return invocation_layer_class(
+                model_name_or_path=self.model_name_or_path, max_length=self.max_length, **all_kwargs
+            )
+        # search all invocation layer classes and find the first one that supports the model,
+        # then create an instance of that invocation layer
+        for invocation_layer in PromptModelInvocationLayer.invocation_layer_providers:
+            if invocation_layer.supports(self.model_name_or_path, **all_kwargs):
+                return invocation_layer(
+                    model_name_or_path=self.model_name_or_path, max_length=self.max_length, **all_kwargs
+                )
+        raise ValueError(
+            f"Model {self.model_name_or_path} is not supported - no matching invocation layer found."
+            f" Currently supported invocation layers are: {PromptModelInvocationLayer.invocation_layer_providers}"
+            f" You can implement and provide custom invocation layer for {self.model_name_or_path} by subclassing "
+            "PromptModelInvocationLayer."
+        )
+
+    def invoke(self, prompt: Union[str, List[str], List[Dict[str, str]]], **kwargs) -> List[str]:
+        """
+        Takes in a prompt and returns a list of responses using the underlying invocation layer.
+
+        :param prompt: The prompt to use for the invocation. It can be a single prompt or a list of prompts.
+        :param kwargs: Additional keyword arguments to pass to the invocation layer.
+        :return: A list of model-generated responses for the prompt or prompts.
+        """
+        output = self.model_invocation_layer.invoke(prompt=prompt, **kwargs)
+        return output
+
+    @overload
+    def _ensure_token_limit(self, prompt: str) -> str:
+        ...
+
+    @overload
+    def _ensure_token_limit(self, prompt: List[Dict[str, str]]) -> List[Dict[str, str]]:
+        ...
+
+    def _ensure_token_limit(self, prompt: Union[str, List[Dict[str, str]]]) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that length of the prompt and answer is within the maximum token length of the PromptModel.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        return self.model_invocation_layer._ensure_token_limit(prompt=prompt)
+
+    def run(
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+    ) -> Tuple[Dict, str]:
+        raise NotImplementedError("This method should never be implemented in the derived class")
+
+    def run_batch(
+        self,
+        queries: Optional[Union[str, List[str]]] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        raise NotImplementedError("This method should never be implemented in the derived class")
+
+    def __repr__(self):
+        return "{}({!r})".format(self.__class__.__name__, self.__dict__)
diff --git a/pipelines/pipelines/nodes/prompt/prompt_node.py b/pipelines/pipelines/nodes/prompt/prompt_node.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2c7a06ad5ef03c4b50b77f4b3f5267669cdb395
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/prompt_node.py
@@ -0,0 +1,550 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import logging
+import re
+from collections import defaultdict
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import yaml
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.prompt.prompt_model import PromptModel
+from pipelines.nodes.prompt.prompt_template import (
+    PromptTemplate,
+    get_predefined_prompt_templates,
+)
+from pipelines.nodes.prompt.shapers import BaseOutputParser
+from pipelines.schema import Document, MultiLabel
+
+logger = logging.getLogger(__name__)
+
+
+class PromptNode(BaseComponent):
+    """
+    The PromptNode class is the central abstraction in Haystack's large language model (LLM) support. PromptNode
+    supports multiple NLP tasks out of the box. You can use it to perform tasks such as
+    summarization, question answering, question generation, and more, using a single, unified model within the Haystack
+    framework.
+
+    One of the benefits of PromptNode is that you can use it to define and add additional prompt templates
+     the model supports. Defining additional prompt templates makes it possible to extend the model's capabilities
+    and use it for a broader range of NLP tasks in Haystack. Prompt engineers define templates
+    for each NLP task and register them with PromptNode. The burden of defining templates for each task rests on
+    the prompt engineers, not the users.
+
+    Using an instance of the PromptModel class, you can create multiple PromptNodes that share the same model, saving
+    the memory and time required to load the model multiple times.
+
+    PromptNode also supports multiple model invocation layers:
+    - PaddleNLP transformers (all text2text-generation models)
+    - OpenAI InstructGPT models
+    - Azure OpenAI InstructGPT models
+
+    But you're not limited to the models listed above, as you can register
+    additional custom model invocation layers.
+
+    We recommend using LLMs fine-tuned on a collection of datasets phrased as instructions, otherwise we find that the
+    LLM does not "follow" prompt instructions well. The list of instruction-following models increases every month,
+    and the current list includes: Flan, OpenAI InstructGPT, opt-iml, bloomz, and mt0 models.
+
+    For more details, see [PromptNode](https://docs.haystack.deepset.ai/docs/prompt_node).
+    """
+
+    outgoing_edges: int = 1
+
+    def __init__(
+        self,
+        model_name_or_path: str = "THUDM/chatglm-6b",
+        default_prompt_template: Optional[Union[str, PromptTemplate]] = None,
+        output_variable: Optional[str] = None,
+        max_length: Optional[int] = 100,
+        api_key: Optional[str] = None,
+        secret_key: Optional[str] = None,
+        use_auth_token: Optional[Union[str, bool]] = None,
+        use_gpu: Optional[bool] = None,
+        devices: Optional[List[str]] = None,
+        stop_words: Optional[List[str]] = None,
+        top_k: int = 1,
+        debug: Optional[bool] = False,
+        model_kwargs: Optional[Dict] = None,
+    ):
+        """
+        Creates a PromptNode instance.
+
+        :param model_name_or_path: The name of the model to use or an instance of the PromptModel.
+        :param default_prompt_template: The default prompt template to use for the model.
+        :param output_variable: The name of the output variable in which you want to store the inference results.
+            If not set, PromptNode uses PromptTemplate's output_variable. If PromptTemplate's output_variable is not set, the default name is `results`.
+        :param max_length: The maximum number of tokens the generated text output can have.
+        :param api_key: The API key to use for the model.
+        :param use_auth_token: The authentication token to use for the model.
+        :param use_gpu: Whether to use GPU or not.
+        :param devices: The devices to use for the model.
+        :param top_k: The number of independently generated texts to return per prompt. For example, if you set top_k=3, the model will generate three answers to the query.
+        :param stop_words: Stops text generation if any of the stop words is generated.
+        :param model_kwargs: Additional keyword arguments passed when loading the model specified in `model_name_or_path`.
+        :param debug: Whether to include the used prompts as debug information in the output under the key _debug.
+
+        Note that Azure OpenAI InstructGPT models require two additional parameters: azure_base_url (the URL for the
+        Azure OpenAI API endpoint, usually in the form `https://<your-endpoint>.openai.azure.com') and
+        azure_deployment_name (the name of the Azure OpenAI API deployment).
+        You should specify these parameters in the `model_kwargs` dictionary.
+
+        """
+        super().__init__()
+
+        self.prompt_templates: Dict[str, PromptTemplate] = {pt.name: pt for pt in get_predefined_prompt_templates()}  # type: ignore
+        self.default_prompt_template: Union[str, PromptTemplate, None] = default_prompt_template
+        self.output_variable: Optional[str] = output_variable
+        self.model_name_or_path: Union[str, PromptModel] = model_name_or_path
+        self.prompt_model: PromptModel
+        self.stop_words: Optional[List[str]] = stop_words
+        self.top_k: int = top_k
+        self.debug = debug
+
+        self.set_config(
+            default_prompt_template=self.default_prompt_template, top_k=self.top_k, stop_words=self.stop_words
+        )
+
+        if isinstance(self.default_prompt_template, str) and not self.is_supported_template(
+            self.default_prompt_template
+        ):
+            raise ValueError(
+                f"Prompt template {self.default_prompt_template} is not supported. "
+                f"Select one of: {self.get_prompt_template_names()} "
+                f"or register a new prompt template first using the add_prompt_template() method."
+            )
+
+        if isinstance(model_name_or_path, str):
+            self.prompt_model = PromptModel(
+                model_name_or_path=model_name_or_path,
+                max_length=max_length,
+                api_key=api_key,
+                secret_key=secret_key,
+                use_auth_token=use_auth_token,
+                use_gpu=use_gpu,
+                devices=devices,
+                model_kwargs=model_kwargs,
+            )
+        elif isinstance(model_name_or_path, PromptModel):
+            self.prompt_model = model_name_or_path
+        else:
+            raise ValueError("model_name_or_path must be either a string or a PromptModel object")
+
+    def __call__(self, *args, **kwargs) -> List[Any]:
+        """
+        This method is invoked when the component is called directly, for example:
+        ```python
+            PromptNode pn = ...
+            sa = pn.set_default_prompt_template("sentiment-analysis")
+            sa(documents=[Document("I am in love and I feel great!")])
+        ```
+        """
+        if "prompt_template" in kwargs:
+            prompt_template = kwargs["prompt_template"]
+            kwargs.pop("prompt_template")
+            return self.prompt(prompt_template, *args, **kwargs)
+        else:
+            return self.prompt(self.default_prompt_template, *args, **kwargs)
+
+    def prompt(self, prompt_template: Optional[Union[str, PromptTemplate]], *args, **kwargs) -> List[Any]:
+        """
+        Prompts the model and represents the central API for the PromptNode. It takes a prompt template,
+        a list of non-keyword and keyword arguments, and returns a list of strings - the responses from the underlying model.
+
+        If you specify the optional prompt_template parameter, it takes precedence over the default PromptTemplate for this PromptNode.
+
+        :param prompt_template: The name or object of the optional PromptTemplate to use.
+        :return: A list of strings as model responses.
+        """
+        results = []
+        # we pop the prompt_collector kwarg to avoid passing it to the model
+        prompt_collector: List[Union[str, List[Dict[str, str]]]] = kwargs.pop("prompt_collector", [])
+
+        # kwargs override model kwargs
+        kwargs = {**self._prepare_model_kwargs(), **kwargs}
+        template_to_fill = self.get_prompt_template(prompt_template)
+        if template_to_fill:
+            # prompt template used, yield prompts from inputs args
+            for prompt in template_to_fill.fill(*args, **kwargs):
+                kwargs_copy = copy.copy(kwargs)
+                # and pass the prepared prompt and kwargs copy to the model
+                prompt = self.prompt_model._ensure_token_limit(prompt)
+                prompt_collector.append(prompt)
+                logger.debug("Prompt being sent to LLM with prompt %s and kwargs %s", prompt, kwargs_copy)
+                output = self.prompt_model.invoke(prompt, **kwargs_copy)
+                results.extend(output)
+
+            kwargs["prompts"] = prompt_collector
+            results = template_to_fill.post_process(results, **kwargs)
+        else:
+            # straightforward prompt, no templates used
+            for prompt in list(args):
+                kwargs_copy = copy.copy(kwargs)
+                prompt = self.prompt_model._ensure_token_limit(prompt)
+                prompt_collector.append(prompt)
+                logger.debug("Prompt being sent to LLM with prompt %s and kwargs %s ", prompt, kwargs_copy)
+                output = self.prompt_model.invoke(prompt, **kwargs_copy)
+                results.extend(output)
+        return results
+
+    def add_prompt_template(self, prompt_template: PromptTemplate) -> None:
+        """
+        Adds a prompt template to the list of supported prompt templates.
+        :param prompt_template: The PromptTemplate object to be added.
+        :return: None
+        """
+        if prompt_template.name in self.prompt_templates:
+            raise ValueError(
+                f"Prompt template {prompt_template.name} already exists. "
+                f"Select a different name for this prompt template."
+            )
+
+        self.prompt_templates[prompt_template.name] = prompt_template  # type: ignore
+
+    def remove_prompt_template(self, prompt_template: str) -> PromptTemplate:
+        """
+        Removes a prompt template from the list of supported prompt templates.
+        :param prompt_template: Name of the prompt template to be removed.
+        :return: PromptTemplate object that was removed.
+        """
+        if prompt_template not in self.prompt_templates:
+            raise ValueError(f"Prompt template {prompt_template} does not exist")
+
+        return self.prompt_templates.pop(prompt_template)
+
+    def set_default_prompt_template(self, prompt_template: Union[str, PromptTemplate]) -> "PromptNode":
+        """
+        Sets the default prompt template for the node.
+        :param prompt_template: The prompt template to be set as default.
+        :return: The current PromptNode object.
+        """
+        if not self.is_supported_template(prompt_template):
+            raise ValueError(f"{prompt_template} not supported, select one of: {self.get_prompt_template_names()}")
+
+        self.default_prompt_template = prompt_template
+        return self
+
+    def get_prompt_templates(self) -> List[PromptTemplate]:
+        """
+        Returns the list of supported prompt templates.
+        :return: List of supported prompt templates.
+        """
+        return list(self.prompt_templates.values())
+
+    def get_prompt_template_names(self) -> List[str]:
+        """
+        Returns the list of supported prompt template names.
+        :return: List of supported prompt template names.
+        """
+        return list(self.prompt_templates.keys())
+
+    def is_supported_template(self, prompt_template: Union[str, PromptTemplate]) -> bool:
+        """
+        Checks if a prompt template is supported.
+        :param prompt_template: The prompt template to be checked.
+        :return: True if the prompt template is supported, False otherwise.
+        """
+        template_name = prompt_template if isinstance(prompt_template, str) else prompt_template.name
+        return template_name in self.prompt_templates
+
+    def get_prompt_template(
+        self, prompt_template: Union[str, PromptTemplate, None] = None
+    ) -> Optional[PromptTemplate]:
+        """
+        Resolves a prompt template.
+
+        :param prompt_template: The prompt template to be resolved. You can choose between the following types:
+            - None: Returns the default prompt template.
+            - PromptTemplate: Returns the given prompt template object.
+            - str: Parses the string depending on its content:
+                - prompt template name: Returns the prompt template registered with the given name.
+                - prompt template yaml: Returns a prompt template specified by the given YAML.
+                - prompt text: Returns a copy of the default prompt template with the given prompt text.
+
+            :return: The prompt template object.
+        """
+        prompt_template = prompt_template or self.default_prompt_template
+        if prompt_template is None:
+            return None
+
+        if isinstance(prompt_template, PromptTemplate):
+            return prompt_template
+
+        if isinstance(prompt_template, str) and prompt_template in self.prompt_templates:
+            return self.prompt_templates[prompt_template]
+
+        # if it's not a string or looks like a prompt template name
+        if not isinstance(prompt_template, str) or re.fullmatch(r"[-a-zA-Z0-9_]+", prompt_template):
+            raise ValueError(
+                f"{prompt_template} not supported, select one of: {self.get_prompt_template_names()} or pass a PromptTemplate instance for prompting."
+            )
+
+        if "prompt_text:" in prompt_template:
+            prompt_template_parsed = yaml.safe_load(prompt_template)
+            if isinstance(prompt_template_parsed, dict):
+                return PromptTemplate(**prompt_template_parsed)
+
+        # it's a prompt_text
+        prompt_text = prompt_template
+        output_parser: Optional[BaseOutputParser] = None
+        default_prompt_template = self.get_prompt_template()
+        if default_prompt_template:
+            output_parser = default_prompt_template.output_parser
+        return PromptTemplate(name="custom-at-query-time", prompt_text=prompt_text, output_parser=output_parser)
+
+    def prompt_template_params(self, prompt_template: str) -> List[str]:
+        """
+        Returns the list of parameters for a prompt template.
+        :param prompt_template: The name of the prompt template.
+        :return: The list of parameters for the prompt template.
+        """
+        if not self.is_supported_template(prompt_template):
+            raise ValueError(f"{prompt_template} not supported, select one of: {self.get_prompt_template_names()}")
+
+        return list(self.prompt_templates[prompt_template].prompt_params)
+
+    def run(
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+        invocation_context: Optional[Dict[str, Any]] = None,
+        prompt_template: Optional[Union[str, PromptTemplate]] = None,
+    ) -> Tuple[Dict, str]:
+        """
+        Runs the PromptNode on these input parameters. Returns the output of the prompt model.
+        The parameters `query`, `file_paths`, `labels`, `documents`, and `meta` are added to the invocation context
+        before invoking the prompt model. PromptNode uses these variables only if they are present as
+        parameters in the PromptTemplate.
+
+        :param query: The PromptNode usually ignores the query, unless it's used as a parameter in the
+        prompt template.
+        :param file_paths: The PromptNode usually ignores the file paths, unless they're used as a parameter
+        in the prompt template.
+        :param labels: The PromptNode usually ignores the labels, unless they're used as a parameter in the
+        prompt template.
+        :param documents: The documents to be used for the prompt.
+        :param meta: PromptNode usually ignores meta information, unless it's used as a parameter in the
+        PromptTemplate.
+        :param invocation_context: The invocation context to be used for the prompt.
+        :param prompt_template: The prompt template to use. You can choose between the following types:
+            - None: Use the default prompt template.
+            - PromptTemplate: Use the given prompt template object.
+            - str: Parses the string depending on its content:
+                - prompt template name: Uses the prompt template registered with the given name.
+                - prompt template yaml: Uses the prompt template specified by the given YAML.
+                - prompt text: Uses a copy of the default prompt template with the given prompt text.
+        """
+        # prompt_collector is an empty list, it's passed to the PromptNode that will fill it with the rendered prompts,
+        # so that they can be returned by `run()` as part of the pipeline's debug output.
+        prompt_collector: List[str] = []
+
+        invocation_context = invocation_context or {}
+        if query and "query" not in invocation_context.keys():
+            invocation_context["query"] = query
+
+        if file_paths and "file_paths" not in invocation_context.keys():
+            invocation_context["file_paths"] = file_paths
+
+        if labels and "labels" not in invocation_context.keys():
+            invocation_context["labels"] = labels
+
+        if documents and "documents" not in invocation_context.keys():
+            invocation_context["documents"] = documents
+
+        if meta and "meta" not in invocation_context.keys():
+            invocation_context["meta"] = meta
+
+        if "prompt_template" not in invocation_context.keys():
+            invocation_context["prompt_template"] = self.get_prompt_template(prompt_template)
+
+        results = self(prompt_collector=prompt_collector, **invocation_context)
+
+        prompt_template_resolved: PromptTemplate = invocation_context.pop("prompt_template")
+        try:
+            output_variable = self.output_variable or prompt_template_resolved.output_variable or "results"
+        except:
+            output_variable = "results"
+        invocation_context[output_variable] = results
+        invocation_context["prompts"] = prompt_collector
+        final_result: Dict[str, Any] = {output_variable: results, "invocation_context": invocation_context}
+
+        if self.debug:
+            final_result["_debug"] = {"prompts_used": prompt_collector}
+
+        return final_result, "output_1"
+
+    def run_batch(  # type: ignore
+        self,
+        queries: Optional[List[str]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        invocation_contexts: Optional[List[Dict[str, Any]]] = None,
+        prompt_templates: Optional[List[Union[str, PromptTemplate]]] = None,
+    ):
+        """
+        Runs PromptNode in batch mode.
+
+        - If you provide a list containing a single query (or invocation context)...
+            - ... and a single list of Documents, the query is applied to each Document individually.
+            - ... and a list of lists of Documents, the query is applied to each list of Documents and the results
+              are aggregated per Document list.
+
+        - If you provide a list of multiple queries (or multiple invocation contexts)...
+            - ... and a single list of Documents, each query (or invocation context) is applied to each Document individually.
+            - ... and a list of lists of Documents, each query (or invocation context) is applied to its corresponding list of Documents
+              and the results are aggregated per query-Document pair.
+
+        - If you provide no Documents, then each query (or invocation context) is applied directly to the PromptTemplate.
+
+        :param queries: List of queries.
+        :param documents: Single list of Documents or list of lists of Documents in which to search for the answers.
+        :param invocation_contexts: List of invocation contexts.
+        :param prompt_templates: The prompt templates to use. You can choose between the following types:
+            - None: Use the default prompt template.
+            - PromptTemplate: Use the given prompt template object.
+            - str: Parses the string depending on its content:
+                - prompt template name: Uses the prompt template registered with the given name.
+                - prompt template yaml: Uuses the prompt template specified by the given YAML.
+                - prompt text: Uses a copy of the default prompt template with the given prompt text.
+        """
+        inputs = PromptNode._flatten_inputs(queries, documents, invocation_contexts, prompt_templates)
+        all_results: Dict[str, List] = defaultdict(list)
+        for query, docs, invocation_context, prompt_template in zip(
+            inputs["queries"], inputs["documents"], inputs["invocation_contexts"], inputs["prompt_templates"]
+        ):
+            prompt_template = self.get_prompt_template(self.default_prompt_template)
+            output_variable = self.output_variable or prompt_template.output_variable or "results"
+            results = self.run(
+                query=query, documents=docs, invocation_context=invocation_context, prompt_template=prompt_template
+            )[0]
+            all_results[output_variable].append(results[output_variable])
+            all_results["invocation_contexts"].append(results["invocation_context"])
+            if self.debug:
+                all_results["_debug"].append(results["_debug"])
+        return all_results, "output_1"
+
+    def _prepare_model_kwargs(self):
+        # these are the parameters from PromptNode level
+        # that are passed to the prompt model invocation layer
+        return {"stop_words": self.stop_words, "top_k": self.top_k}
+
+    @staticmethod
+    def _flatten_inputs(
+        queries: Optional[List[str]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        invocation_contexts: Optional[List[Dict[str, Any]]] = None,
+        prompt_templates: Optional[List[Union[str, PromptTemplate]]] = None,
+    ) -> Dict[str, List]:
+        """Flatten and copy the queries, documents, and invocation contexts into lists of equal length.
+
+        - If you provide a list containing a single query (or invocation context)...
+            - ... and a single list of Documents, the query is applied to each Document individually.
+            - ... and a list of lists of Documents, the query is applied to each list of Documents and the results
+              are aggregated per Document list.
+
+        - If you provide a list of multiple queries (or multiple invocation contexts)...
+            - ... and a single list of Documents, each query (or invocation context) is applied to each Document individually.
+            - ... and a list of lists of Documents, each query (or invocation context) is applied to its corresponding list of Documents
+              and the results are aggregated per query-Document pair.
+
+        - If you provide no Documents, then each query (or invocation context) is applied to the PromptTemplate.
+
+        :param queries: List of queries.
+        :param documents: Single list of Documents or list of lists of Documents in which to search for the answers.
+        :param invocation_contexts: List of invocation contexts.
+        """
+        # Check that queries, and invocation_contexts are of the same length if provided
+        input_queries: List[Any]
+        input_invocation_contexts: List[Any]
+        input_prompt_templates: List[Any]
+        if queries is not None and invocation_contexts is not None:
+            if len(queries) != len(invocation_contexts):
+                raise ValueError("The input variables queries and invocation_contexts should have the same length.")
+            input_queries = queries
+            input_invocation_contexts = invocation_contexts
+        elif queries is not None and invocation_contexts is None:
+            input_queries = queries
+            input_invocation_contexts = [None] * len(queries)
+        elif queries is None and invocation_contexts is not None:
+            input_queries = [None] * len(invocation_contexts)
+            input_invocation_contexts = invocation_contexts
+        else:
+            input_queries = [None]
+            input_invocation_contexts = [None]
+
+        if prompt_templates is not None:
+            if len(prompt_templates) != len(input_queries):
+                raise ValueError("The input variables prompt_templates and queries should have the same length.")
+            input_prompt_templates = prompt_templates
+        else:
+            input_prompt_templates = [None] * len(input_queries)
+
+        multi_docs_list = isinstance(documents, list) and len(documents) > 0 and isinstance(documents[0], list)
+        single_docs_list = isinstance(documents, list) and len(documents) > 0 and isinstance(documents[0], Document)
+
+        # Docs case 1: single list of Documents
+        # -> apply each query (and invocation_contexts) to all Documents
+        inputs: Dict[str, List] = defaultdict(list)
+        if documents is not None:
+            if single_docs_list:
+                for query, invocation_context, prompt_template in zip(
+                    input_queries, input_invocation_contexts, input_prompt_templates
+                ):
+                    for doc in documents:
+                        inputs["queries"].append(query)
+                        inputs["invocation_contexts"].append(invocation_context)
+                        inputs["documents"].append([doc])
+                        inputs["prompt_templates"].append(prompt_template)
+            # Docs case 2: list of lists of Documents
+            # -> apply each query (and invocation_context) to corresponding list of Documents,
+            # if queries contains only one query, apply it to each list of Documents
+            elif multi_docs_list:
+                total_queries = input_queries.copy()
+                total_invocation_contexts = input_invocation_contexts.copy()
+                total_prompt_templates = input_prompt_templates.copy()
+                if (
+                    len(total_queries) == 1
+                    and len(total_invocation_contexts) == 1
+                    and len(total_prompt_templates) == 1
+                ):
+                    total_queries = input_queries * len(documents)
+                    total_invocation_contexts = input_invocation_contexts * len(documents)
+                    total_prompt_templates = input_prompt_templates * len(documents)
+                if (
+                    len(total_queries) != len(documents)
+                    or len(total_invocation_contexts) != len(documents)
+                    or len(total_prompt_templates) != len(documents)
+                ):
+                    raise ValueError("Number of queries must be equal to number of provided Document lists.")
+                for query, invocation_context, prompt_template, cur_docs in zip(
+                    total_queries, total_invocation_contexts, total_prompt_templates, documents
+                ):
+                    inputs["queries"].append(query)
+                    inputs["invocation_contexts"].append(invocation_context)
+                    inputs["documents"].append(cur_docs)
+                    inputs["prompt_templates"].append(prompt_template)
+        elif queries is not None or invocation_contexts is not None or prompt_templates is not None:
+            for query, invocation_context, prompt_template in zip(
+                input_queries, input_invocation_contexts, input_prompt_templates
+            ):
+                inputs["queries"].append(query)
+                inputs["invocation_contexts"].append(invocation_context)
+                inputs["documents"].append([None])
+                inputs["prompt_templates"].append(prompt_template)
+        return inputs
diff --git a/pipelines/pipelines/nodes/prompt/prompt_template.py b/pipelines/pipelines/nodes/prompt/prompt_template.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ac6b59aaa6340432c632c27fd3b1021ede04003
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/prompt_template.py
@@ -0,0 +1,436 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ast
+import json
+import logging
+from abc import ABC
+from typing import Any, Dict, Iterator, List, Optional, Tuple, Union
+from uuid import uuid4
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.prompt.shapers import AnswerParser, BaseOutputParser
+from pipelines.schema import Document, MultiLabel
+
+from .shapers import to_strings  # # noqa F401
+
+logger = logging.getLogger(__name__)
+
+PROMPT_TEMPLATE_ALLOWED_FUNCTIONS = json.loads('["join", "to_strings", "replace", "enumerate", "str"]')
+PROMPT_TEMPLATE_SPECIAL_CHAR_ALIAS = {"new_line": "\n", "tab": "\t", "double_quote": '"', "carriage_return": "\r"}
+PROMPT_TEMPLATE_STRIPS = ["'", '"']
+PROMPT_TEMPLATE_STR_REPLACE = {'"': "'"}
+
+
+class BasePromptTemplate(BaseComponent):
+    outgoing_edges = 1
+
+    def run(
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+    ) -> Tuple[Dict, str]:
+        raise NotImplementedError("This method should never be implemented in the derived class")
+
+    def run_batch(
+        self,
+        queries: Optional[Union[str, List[str]]] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        raise NotImplementedError("This method should never be implemented in the derived class")
+
+
+class _ValidationVisitor(ast.NodeVisitor):
+    """
+    This class is used to validate the prompt text for a PromptTemplate.
+    It checks that the prompt text is a valid f-string and that it only uses allowed functions.
+    Useful information extracted from the AST is stored in the class attributes (for example, `prompt_params` and `used_functions`).
+    """
+
+    def __init__(self, prompt_template_name: str):
+        self.used_names: List[str] = []
+        self.comprehension_targets: List[str] = []
+        self.used_functions: List[str] = []
+        self.prompt_template_name = prompt_template_name
+
+    @property
+    def prompt_params(self) -> List[str]:
+        """
+        The names of the variables used in the prompt text.
+        For example, for the prompt text `f"Hello {name}"`, the prompt_params is `["name"]`.
+        """
+        return list(set(self.used_names) - set(self.used_functions) - set(self.comprehension_targets))
+
+    def visit_Name(self, node: ast.Name) -> None:
+        """
+        Stores the name of the variable used in the prompt text. This also includes function and method names.
+        For example, for the prompt text `f"Hello {func(name)}"`, the used_names are `["func", "name"]`.
+        """
+        self.used_names.append(node.id)
+
+    def visit_comprehension(self, node: ast.comprehension) -> None:
+        """
+        Stores the name of the variable used in comprehensions.
+        For example, for the prompt text `f"Hello {[name for name in names]}"`, the comprehension_targets is `["name"]`.
+        """
+        super().generic_visit(node)
+        if isinstance(node.target, ast.Name):
+            self.comprehension_targets.append(node.target.id)
+        elif isinstance(node.target, ast.Tuple):
+            self.comprehension_targets.extend([elt.id for elt in node.target.elts if isinstance(elt, ast.Name)])
+
+    def visit_Call(self, node: ast.Call) -> None:
+        """
+        Stores the name of functions and methods used in the prompt text and validates that only allowed functions are used.
+        For example, for the prompt text `f"Hello {func(name)}"`, the used_functions is `["func"]`.
+
+        raises: PromptTemplateValidationError if the prompt text contains an invalid function.
+        """
+        super().generic_visit(node)
+        if isinstance(node.func, ast.Name) and node.func.id in PROMPT_TEMPLATE_ALLOWED_FUNCTIONS:
+            # functions: func(args, kwargs)
+            self.used_functions.append(node.func.id)
+        elif isinstance(node.func, ast.Attribute) and node.func.attr in PROMPT_TEMPLATE_ALLOWED_FUNCTIONS:
+            # methods: instance.method(args, kwargs)
+            self.used_functions.append(node.func.attr)
+        else:
+            raise Exception(
+                f"Invalid function in prompt text for prompt template {self.prompt_template_name}. "
+                f"Allowed functions are {PROMPT_TEMPLATE_ALLOWED_FUNCTIONS}."
+            )
+
+
+class _FstringParamsTransformer(ast.NodeTransformer):
+    """
+    Transforms an AST for f-strings into a format the PromptTemplate can use.
+    It replaces all f-string expressions with a unique ID and stores the corresponding expressions in a dictionary.
+
+    You can evaluate the stored expressions using the `eval` function given the `prompt_params` (see _ValidatorVisitor).
+    PromptTemplate determines the number of prompts to generate and renders them using the evaluated expressions.
+    """
+
+    def __init__(self):
+        self.prompt_params_functions: Dict[str, ast.Expression] = {}
+
+    def visit_FormattedValue(self, node: ast.FormattedValue) -> Optional[ast.AST]:
+        """
+        Replaces the f-string expression with a unique ID and stores the corresponding expression in a dictionary.
+        If the expression is the raw `documents` variable, it is encapsulated into a call to `documents_to_strings` to ensure that the documents get rendered correctly.
+        """
+        super().generic_visit(node)
+
+        # Keep special char variables as is. They are available via globals.
+        if isinstance(node.value, ast.Name) and node.value.id in PROMPT_TEMPLATE_SPECIAL_CHAR_ALIAS:
+            return node
+
+        id = uuid4().hex
+        if isinstance(node.value, ast.Name) and node.value.id in ["documents", "answers"]:
+            call = ast.Call(func=ast.Name(id="to_strings", ctx=ast.Load()), args=[node.value], keywords=[])
+            self.prompt_params_functions[id] = ast.fix_missing_locations(ast.Expression(body=call))
+        else:
+            self.prompt_params_functions[id] = ast.fix_missing_locations(ast.Expression(body=node.value))
+        return ast.FormattedValue(
+            value=ast.Name(id=id, ctx=ast.Load()), conversion=node.conversion, format_spec=node.format_spec
+        )
+
+
+class PromptTemplate(BasePromptTemplate, ABC):
+    """
+    PromptTemplate is a template for the prompt you feed to the model to instruct it what to do. For example, if you want the model to perform sentiment analysis, you simply tell it to do that in a prompt. Here's what a prompt template may look like:
+
+    ```python
+        PromptTemplate(name="sentiment-analysis",
+                   prompt_text="Give a sentiment for this context. Answer with positive, negative
+                   or neutral. Context: {documents}; Answer:")
+    ```
+
+    Optionally, you can declare prompt parameters using f-string syntax in the PromptTemplate. Prompt parameters are input parameters that need to be filled in
+    the prompt_text for the model to perform the task. For example, in the template above, there's one prompt parameter, `documents`.
+
+    You declare prompt parameters by adding variables to the prompt text. These variables should be in the format: `{variable}`. In the template above, the variable is `{documents}`.
+
+    At runtime, the variables you declared in prompt text are filled in with the arguments passed to the `fill()` method of the PromptTemplate. So in the example above, the `{documents}` variable will be filled with the Documents whose sentiment you want the model to analyze.
+
+    Note that other than strict f-string syntax, you can safely use the following backslash characters in the text parts of the prompt text: `\n`, `\t`, `\r`.
+    In f-string expressions, use `new_line`, `tab`, `carriage_return` instead.
+    Double quotes (`"`) are automatically replaced with single quotes (`'`) in the prompt text. If you want to use double quotes in the prompt text, use `{double_quote}` instead.
+
+    For more details on how to use PromptTemplate, see
+    [PromptTemplates](https://docs.haystack.deepset.ai/docs/prompt_node#prompttemplates).
+    """
+
+    def __init__(
+        self, name: str, prompt_text: str, output_parser: Optional[Union[BaseOutputParser, Dict[str, Any]]] = None
+    ):
+        """
+         Creates a PromptTemplate instance.
+
+        :param name: The name of the prompt template (for example, "sentiment-analysis", "question-generation"). You can specify your own name but it must be unique.
+        :param prompt_text: The prompt text, including prompt parameters.
+        :param output_parser: A parser that applied to the model output.
+                For example, to convert the model output to an Answer object, you can use `AnswerParser`.
+                Instead of BaseOutputParser instances, you can also pass dictionaries defining the output parsers. For example:
+                ```
+                output_parser={"type": "AnswerParser", "params": {"pattern": "Answer: (.*)"}},
+                ```
+        """
+        super().__init__()
+
+        # use case when PromptTemplate is loaded from a YAML file, we need to start and end the prompt text with quotes
+        for strip in PROMPT_TEMPLATE_STRIPS:
+            prompt_text = prompt_text.strip(strip)
+        replacements = {
+            **{v: "{" + k + "}" for k, v in PROMPT_TEMPLATE_SPECIAL_CHAR_ALIAS.items()},
+            **PROMPT_TEMPLATE_STR_REPLACE,
+        }
+        for old, new in replacements.items():
+            prompt_text = prompt_text.replace(old, new)
+
+        self._ast_expression = ast.parse(f'f"{prompt_text}"', mode="eval")
+
+        ast_validator = _ValidationVisitor(prompt_template_name=name)
+        ast_validator.visit(self._ast_expression)
+
+        ast_transformer = _FstringParamsTransformer()
+        self._ast_expression = ast.fix_missing_locations(ast_transformer.visit(self._ast_expression))
+        self._prompt_params_functions = ast_transformer.prompt_params_functions
+        self._used_functions = ast_validator.used_functions
+
+        self.name = name
+        self.prompt_text = prompt_text
+        self.prompt_params: List[str] = sorted(
+            param for param in ast_validator.prompt_params if param not in PROMPT_TEMPLATE_SPECIAL_CHAR_ALIAS
+        )
+        self.globals = {
+            **{k: v for k, v in globals().items() if k in PROMPT_TEMPLATE_ALLOWED_FUNCTIONS},
+            **PROMPT_TEMPLATE_SPECIAL_CHAR_ALIAS,
+        }
+        self.output_parser: Optional[BaseOutputParser] = None
+        if isinstance(output_parser, BaseOutputParser):
+            self.output_parser = output_parser
+        elif isinstance(output_parser, dict):
+            output_parser_type = output_parser["type"]
+            output_parser_params = output_parser.get("params", {})
+            self.output_parser = BaseComponent._create_instance(output_parser_type, output_parser_params)
+
+    @property
+    def output_variable(self) -> Optional[str]:
+        return self.output_parser.output_variable if self.output_parser else None
+
+    def prepare(self, *args, **kwargs) -> Dict[str, Any]:
+        """
+        Prepares and verifies the PromtpTemplate with input parameters.
+
+        :param args: Non-keyword arguments to fill the parameters in the prompt text of a PromptTemplate.
+        :param kwargs: Keyword arguments to fill the parameters in the prompt text of a PromptTemplate.
+        :return: A dictionary with the prompt text and the prompt parameters.
+        """
+        params_dict = {}
+        # attempt to resolve args first
+        if args:
+            if len(args) != len(self.prompt_params):
+                logger.warning(
+                    "For %s, expected %s arguments, instead got %s arguments %s",
+                    self.name,
+                    self.prompt_params,
+                    len(args),
+                    args,
+                )
+            for prompt_param, arg in zip(self.prompt_params, args):
+                params_dict[prompt_param] = [arg] if isinstance(arg, str) else arg
+        # then attempt to resolve kwargs
+        if kwargs:
+            for param in self.prompt_params:
+                if param in kwargs:
+                    params_dict[param] = kwargs[param]
+
+        if not set(self.prompt_params).issubset(params_dict.keys()):
+            available_params = {*params_dict.keys(), *kwargs.keys()}
+            provided = set(self.prompt_params).intersection(available_params)
+            message = f"only {list(provided)}" if provided else "none of these parameters"
+            raise ValueError(
+                f"Expected prompt parameters {self.prompt_params} to be provided but got "
+                f"{message}. Make sure to provide all template parameters."
+            )
+
+        template_dict = {"_at_least_one_prompt": True}
+        for id, call in self._prompt_params_functions.items():
+            template_dict[id] = eval(  # pylint: disable=eval-used
+                compile(call, filename="<string>", mode="eval"), self.globals, params_dict
+            )
+
+        return template_dict
+
+    def post_process(self, prompt_output: List[str], **kwargs) -> List[Any]:
+        """
+        Post-processes the output of the PromptTemplate.
+        :param args: Non-keyword arguments to use for post-processing the prompt output.
+        :param kwargs: Keyword arguments to use for post-processing the prompt output.
+        :return: A dictionary with the post-processed output.
+        """
+        if self.output_parser:
+            invocation_context = kwargs
+            invocation_context["results"] = prompt_output
+            self.output_parser.run(invocation_context=invocation_context)
+            return invocation_context[self.output_parser.outputs[0]]
+        else:
+            return prompt_output
+
+    def fill(self, *args, **kwargs) -> Iterator[str]:
+        """
+        Fills the parameters defined in the prompt text with the arguments passed to it and returns the iterator prompt text.
+
+        You can pass non-keyword (args) or keyword (kwargs) arguments to this method. If you pass non-keyword arguments, their order must match the left-to-right
+        order of appearance of the parameters in the prompt text. For example, if the prompt text is:
+        `Come up with a question for the given context and the answer. Context: {documents};
+        Answer: {answers}; Question:`, then the first non-keyword argument fills the `{documents}` variable
+        and the second non-keyword argument fills the `{answers}` variable.
+
+        If you pass keyword arguments, the order of the arguments doesn't matter. Variables in the
+        prompt text are filled with the corresponding keyword argument.
+
+        :param args: Non-keyword arguments to fill the parameters in the prompt text. Their order must match the order of appearance of the parameters in the prompt text.
+        :param kwargs: Keyword arguments to fill the parameters in the prompt text.
+        :return: An iterator of prompt texts.
+        """
+        template_dict = self.prepare(*args, **kwargs)
+
+        # the prompt context values should all be lists, as they will be split as one
+        prompt_context_copy = {k: v if isinstance(v, list) else [v] for k, v in template_dict.items()}
+        max_len = max(len(v) for v in prompt_context_copy.values())
+        if max_len > 1:
+            for key, value in prompt_context_copy.items():
+                if len(value) == 1:
+                    prompt_context_copy[key] = value * max_len
+
+        for prompt_context_values in zip(*prompt_context_copy.values()):
+            template_input = {key: prompt_context_values[idx] for idx, key in enumerate(prompt_context_copy.keys())}
+            prompt_prepared: str = eval(  # pylint: disable=eval-used
+                compile(self._ast_expression, filename="<string>", mode="eval"), self.globals, template_input
+            )
+            yield prompt_prepared
+
+    def __repr__(self):
+        return f"PromptTemplate(name={self.name}, prompt_text={self.prompt_text}, prompt_params={self.prompt_params})"
+
+
+def get_predefined_prompt_templates() -> List[PromptTemplate]:
+    return [
+        PromptTemplate(
+            name="question-answering",
+            prompt_text="Given the context please answer the question. Context: {join(documents)}; Question: "
+            "{query}; Answer:",
+            output_parser=AnswerParser(),
+        ),
+        PromptTemplate(
+            name="question-answering-per-document",
+            prompt_text="Given the context please answer the question. Context: {documents}; Question: "
+            "{query}; Answer:",
+            output_parser=AnswerParser(),
+        ),
+        PromptTemplate(
+            name="question-answering-with-references",
+            prompt_text="Create a concise and informative answer (no more than 50 words) for a given question "
+            "based solely on the given documents. You must only use information from the given documents. "
+            "Use an unbiased and journalistic tone. Do not repeat text. Cite the documents using Document[number] notation. "
+            "If multiple documents contain the answer, cite those documents like ‘as stated in Document[number], Document[number], etc.’. "
+            "If the documents do not contain the answer to the question, say that ‘answering is not possible given the available information.’\n"
+            "{join(documents, delimiter=new_line, pattern=new_line+'Document[$idx]: $content', str_replace={new_line: ' ', '[': '(', ']': ')'})} \n Question: {query}; Answer: ",
+            output_parser=AnswerParser(reference_pattern=r"Document\[(\d+)\]"),
+        ),
+        PromptTemplate(
+            name="question-answering-with-document-scores",
+            prompt_text="Answer the following question using the paragraphs below as sources. "
+            "An answer should be short, a few words at most.\n"
+            "Paragraphs:\n{documents}\n"
+            "Question: {query}\n\n"
+            "Instructions: Consider all the paragraphs above and their corresponding scores to generate "
+            "the answer. While a single paragraph may have a high score, it's important to consider all "
+            "paragraphs for the same answer candidate to answer accurately.\n\n"
+            "After having considered all possibilities, the final answer is:\n",
+        ),
+        PromptTemplate(
+            name="question-generation",
+            prompt_text="Given the context please generate a question. Context: {documents}; Question:",
+        ),
+        PromptTemplate(
+            name="conditioned-question-generation",
+            prompt_text="Please come up with a question for the given context and the answer. "
+            "Context: {documents}; Answer: {answers}; Question:",
+        ),
+        PromptTemplate(name="summarization", prompt_text="Summarize this document: {documents} Summary:"),
+        PromptTemplate(
+            name="question-answering-check",
+            prompt_text="Does the following context contain the answer to the question? "
+            "Context: {documents}; Question: {query}; Please answer yes or no! Answer:",
+            output_parser=AnswerParser(),
+        ),
+        PromptTemplate(
+            name="sentiment-analysis",
+            prompt_text="Please give a sentiment for this context. Answer with positive, "
+            "negative or neutral. Context: {documents}; Answer:",
+        ),
+        PromptTemplate(
+            name="multiple-choice-question-answering",
+            prompt_text="Question:{query} ; Choose the most suitable option to answer the above question. "
+            "Options: {options}; Answer:",
+            output_parser=AnswerParser(),
+        ),
+        PromptTemplate(
+            name="topic-classification",
+            prompt_text="Categories: {options}; What category best describes: {documents}; Answer:",
+        ),
+        PromptTemplate(
+            name="language-detection",
+            prompt_text="Detect the language in the following context and answer with the "
+            "name of the language. Context: {documents}; Answer:",
+        ),
+        PromptTemplate(
+            name="translation",
+            prompt_text="Translate the following context to {target_language}. Context: {documents}; Translation:",
+        ),
+        PromptTemplate(
+            name="zero-shot-react",
+            prompt_text="You are a helpful and knowledgeable agent. To achieve your goal of answering complex questions "
+            "correctly, you have access to the following tools:\n\n"
+            "{tool_names_with_descriptions}\n\n"
+            "To answer questions, you'll need to go through multiple steps involving step-by-step thinking and "
+            "selecting appropriate tools and their inputs; tools will respond with observations. When you are ready "
+            "for a final answer, respond with the `Final Answer:`\n\n"
+            "Use the following format:\n\n"
+            "Question: the question to be answered\n"
+            "Thought: Reason if you have the final answer. If yes, answer the question. If not, find out the missing information needed to answer it.\n"
+            "Tool: pick one of {tool_names} \n"
+            "Tool Input: the input for the tool\n"
+            "Observation: the tool will respond with the result\n"
+            "...\n"
+            "Final Answer: the final answer to the question, make it short (1-5 words)\n\n"
+            "Thought, Tool, Tool Input, and Observation steps can be repeated multiple times, but sometimes we can find an answer in the first pass\n"
+            "---\n\n"
+            "Question: {query}\n"
+            "Thought: Let's think step-by-step, I first need to ",
+        ),
+        PromptTemplate(
+            name="conversational-summary",
+            prompt_text="Condense the following chat transcript by shortening and summarizing the content without losing important information:\n{chat_transcript}\nCondensed Transcript:",
+        ),
+    ]
diff --git a/pipelines/pipelines/nodes/prompt/shapers.py b/pipelines/pipelines/nodes/prompt/shapers.py
new file mode 100644
index 0000000000000000000000000000000000000000..42bb8bba4db8fafcf8ffddb4579a16715ca133d3
--- /dev/null
+++ b/pipelines/pipelines/nodes/prompt/shapers.py
@@ -0,0 +1,907 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import logging
+import re
+from functools import reduce
+from string import Template
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.schema import Answer, Document, MultiLabel
+
+logger = logging.getLogger(__name__)
+
+
+def to_strings(items: List[Union[str, Document, Answer]], pattern=None, str_replace=None) -> List[str]:
+    results = []
+    for idx, item in enumerate(items, start=1):
+        if isinstance(item, str):
+            results.append(format_string(item, str_replace=str_replace))
+        elif isinstance(item, Document):
+            results.append(format_document(document=item, pattern=pattern, str_replace=str_replace, idx=idx))
+        elif isinstance(item, Answer):
+            results.append(format_answer(answer=item, pattern=pattern, str_replace=str_replace, idx=idx))
+        else:
+            raise ValueError(f"Unsupported item type: {type(item)}")
+    return results
+
+
+def format_document(
+    document: Document,
+    pattern: Optional[str] = None,
+    str_replace: Optional[Dict[str, str]] = None,
+    idx: Optional[int] = None,
+) -> str:
+    """
+    Transforms a document into a single string.
+    Use regex in the `pattern` parameter to control how the document is represented.
+    You can use the following placeholders:
+    - $content: The content of the document.
+    - $idx: The index of the document in the list.
+    - $id: The ID of the document.
+    - $META_FIELD: The value of the metadata field called 'META_FIELD'.
+
+    Example:
+
+    ```python
+    assert format_document(
+        document=Document(content="first"),
+        pattern="prefix [$idx] $content",
+        str_replace={"r": "R"},
+        idx=1,
+    ) == "prefix [1] fiRst"
+    ```
+    """
+    str_replace = str_replace or {}
+    pattern = pattern or "$content"
+
+    template = Template(pattern)
+    pattern_params = [
+        match.groupdict().get("named", match.groupdict().get("braced"))
+        for match in template.pattern.finditer(template.template)
+    ]
+    meta_params = [param for param in pattern_params if param and param not in ["content", "idx", "id"]]
+    content = template.substitute(
+        {
+            "idx": idx,
+            "content": reduce(lambda content, kv: content.replace(*kv), str_replace.items(), document.content),
+            "id": reduce(lambda id, kv: id.replace(*kv), str_replace.items(), document.id),
+            **{
+                k: reduce(lambda val, kv: val.replace(*kv), str_replace.items(), document.meta.get(k, ""))
+                for k in meta_params
+            },
+        }
+    )
+    return content
+
+
+def format_answer(
+    answer: Answer,
+    pattern: Optional[str] = None,
+    str_replace: Optional[Dict[str, str]] = None,
+    idx: Optional[int] = None,
+) -> str:
+    """
+    Transforms an answer into a single string.
+    Use regex in the `pattern` parameter to control how the answer is represented.
+    You can use the following placeholders:
+    - $answer: The answer text.
+    - $idx: The index of the answer in the list.
+    - $META_FIELD: The value of the metadata field called 'META_FIELD'.
+
+    Example:
+
+    ```python
+    assert format_answer(
+        answer=Answer(answer="first"),
+        pattern="prefix [$idx] $answer",
+        str_replace={"r": "R"},
+        idx=1,
+    ) == "prefix [1] fiRst"
+    ```
+    """
+    str_replace = str_replace or {}
+    pattern = pattern or "$answer"
+
+    template = Template(pattern)
+    pattern_params = [
+        match.groupdict().get("named", match.groupdict().get("braced"))
+        for match in template.pattern.finditer(template.template)
+    ]
+    meta_params = [param for param in pattern_params if param and param not in ["answer", "idx"]]
+    meta = answer.meta or {}
+    content = template.substitute(
+        {
+            "idx": idx,
+            "answer": reduce(lambda content, kv: content.replace(*kv), str_replace.items(), answer.answer),
+            **{k: reduce(lambda val, kv: val.replace(*kv), str_replace.items(), meta.get(k, "")) for k in meta_params},
+        }
+    )
+    return content
+
+
+def format_string(string: str, str_replace: Optional[Dict[str, str]] = None) -> str:
+    """
+    Replaces strings.
+
+    Example:
+
+    ```python
+    assert format_string(string="first", str_replace={"r": "R"}) == "fiRst"
+    ```
+    """
+    str_replace = str_replace or {}
+    return reduce(lambda s, kv: s.replace(*kv), str_replace.items(), string)
+
+
+def rename(value: Any) -> Any:
+    """
+    An identity function. You can use it to rename values in the invocation context without changing them.
+
+    Example:
+
+    ```python
+    assert rename(1) == 1
+    ```
+    """
+    return value
+
+
+def strings_to_answers(
+    strings: List[str],
+    prompts: Optional[List[Union[str, List[Dict[str, str]]]]] = None,
+    documents: Optional[List[Document]] = None,
+    pattern: Optional[str] = None,
+    reference_pattern: Optional[str] = None,
+    reference_mode: Literal["index", "id", "meta"] = "index",
+    reference_meta_field: Optional[str] = None,
+) -> List[Answer]:
+    """
+    Transforms a list of strings into a list of answers.
+    Specify `reference_pattern` to populate the answer's `document_ids` by extracting document references from the strings.
+
+    :param strings: The list of strings to transform.
+    :param prompts: The prompts used to generate the answers.
+    :param documents: The documents used to generate the answers.
+    :param pattern: The regex pattern to use for parsing the answer.
+        Examples:
+            `[^\\n]+$` will find "this is an answer" in string "this is an argument.\nthis is an answer".
+            `Answer: (.*)` will find "this is an answer" in string "this is an argument. Answer: this is an answer".
+        If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer.
+    :param reference_pattern: The regex pattern to use for parsing the document references.
+        Example: `\\[(\\d+)\\]` will find "1" in string "this is an answer[1]".
+        If None, no parsing is done and all documents are referenced.
+    :param reference_mode: The mode used to reference documents. Supported modes are:
+        - index: the document references are the one-based index of the document in the list of documents.
+            Example: "this is an answer[1]" will reference the first document in the list of documents.
+        - id: the document references are the document IDs.
+            Example: "this is an answer[123]" will reference the document with id "123".
+        - meta: the document references are the value of a metadata field of the document.
+            Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field.
+    :param reference_meta_field: The name of the metadata field to use for document references in reference_mode "meta".
+    :return: The list of answers.
+
+    Examples:
+
+    Without reference parsing:
+    ```python
+    assert strings_to_answers(strings=["first", "second", "third"], prompt="prompt", documents=[Document(id="123", content="content")]) == [
+            Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
+            Answer(answer="second", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
+            Answer(answer="third", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
+        ]
+    ```
+
+    With reference parsing:
+    ```python
+    assert strings_to_answers(strings=["first[1]", "second[2]", "third[1][3]"], prompt="prompt",
+            documents=[Document(id="123", content="content"), Document(id="456", content="content"), Document(id="789", content="content")],
+            reference_pattern=r"\\[(\\d+)\\]",
+            reference_mode="index"
+        ) == [
+            Answer(answer="first", type="generative", document_ids=["123"], meta={"prompt": "prompt"}),
+            Answer(answer="second", type="generative", document_ids=["456"], meta={"prompt": "prompt"}),
+            Answer(answer="third", type="generative", document_ids=["123", "789"], meta={"prompt": "prompt"}),
+        ]
+    ```
+    """
+    if prompts:
+        if len(prompts) == 1:
+            # one prompt for all strings/documents
+            documents_per_string: List[Optional[List[Document]]] = [documents] * len(strings)
+            prompt_per_string: List[Optional[Union[str, List[Dict[str, str]]]]] = [prompts[0]] * len(strings)
+        elif len(prompts) > 1 and len(strings) % len(prompts) == 0:
+            # one prompt per string/document
+            if documents is not None and len(documents) != len(prompts):
+                raise ValueError("The number of documents must match the number of prompts.")
+            string_multiplier = len(strings) // len(prompts)
+            documents_per_string = (
+                [[doc] for doc in documents for _ in range(string_multiplier)] if documents else [None] * len(strings)
+            )
+            prompt_per_string = [prompt for prompt in prompts for _ in range(string_multiplier)]
+        else:
+            raise ValueError("The number of prompts must be one or a multiple of the number of strings.")
+    else:
+        documents_per_string = [documents] * len(strings)
+        prompt_per_string = [None] * len(strings)
+
+    answers = []
+    for string, prompt, _documents in zip(strings, prompt_per_string, documents_per_string):
+        answer = string_to_answer(
+            string=string,
+            prompt=prompt,
+            documents=_documents,
+            pattern=pattern,
+            reference_pattern=reference_pattern,
+            reference_mode=reference_mode,
+            reference_meta_field=reference_meta_field,
+        )
+        answers.append(answer)
+    return answers
+
+
+def string_to_answer(
+    string: str,
+    prompt: Optional[Union[str, List[Dict[str, str]]]],
+    documents: Optional[List[Document]],
+    pattern: Optional[str] = None,
+    reference_pattern: Optional[str] = None,
+    reference_mode: Literal["index", "id", "meta"] = "index",
+    reference_meta_field: Optional[str] = None,
+) -> Answer:
+    """
+    Transforms a string into an answer.
+    Specify `reference_pattern` to populate the answer's `document_ids` by extracting document references from the string.
+
+    :param string: The string to transform.
+    :param prompt: The prompt used to generate the answer.
+    :param documents: The documents used to generate the answer.
+    :param pattern: The regex pattern to use for parsing the answer.
+        Examples:
+            `[^\\n]+$` will find "this is an answer" in string "this is an argument.\nthis is an answer".
+            `Answer: (.*)` will find "this is an answer" in string "this is an argument. Answer: this is an answer".
+        If None, the whole string is used as the answer. If not None, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer.
+    :param reference_pattern: The regex pattern to use for parsing the document references.
+        Example: `\\[(\\d+)\\]` will find "1" in string "this is an answer[1]".
+        If None, no parsing is done and all documents are referenced.
+    :param reference_mode: The mode used to reference documents. Supported modes are:
+        - index: the document references are the one-based index of the document in the list of documents.
+            Example: "this is an answer[1]" will reference the first document in the list of documents.
+        - id: the document references are the document IDs.
+            Example: "this is an answer[123]" will reference the document with id "123".
+        - meta: the document references are the value of a metadata field of the document.
+            Example: "this is an answer[123]" will reference the document with the value "123" in the metadata field specified by reference_meta_field.
+    :param reference_meta_field: The name of the metadata field to use for document references in reference_mode "meta".
+    :return: The answer
+    """
+    if reference_mode == "index":
+        candidates = {str(idx): doc.id for idx, doc in enumerate(documents, start=1)} if documents else {}
+    elif reference_mode == "id":
+        candidates = {doc.id: doc.id for doc in documents} if documents else {}
+    elif reference_mode == "meta":
+        if not reference_meta_field:
+            raise ValueError("reference_meta_field must be specified when reference_mode is 'meta'")
+        candidates = (
+            {doc.meta[reference_meta_field]: doc.id for doc in documents if doc.meta.get(reference_meta_field)}
+            if documents
+            else {}
+        )
+    else:
+        raise ValueError(f"Invalid document_id_mode: {reference_mode}")
+
+    if pattern:
+        match = re.search(pattern, string)
+        if match:
+            if not match.lastindex:
+                # no group in pattern -> take the whole match
+                string = match.group(0)
+            elif match.lastindex == 1:
+                # one group in pattern -> take the group
+                string = match.group(1)
+            else:
+                # more than one group in pattern -> raise error
+                raise ValueError(f"Pattern must have at most one group: {pattern}")
+        else:
+            string = ""
+    document_ids = parse_references(string=string, reference_pattern=reference_pattern, candidates=candidates)
+    answer = Answer(answer=string, type="generative", document_ids=document_ids, meta={"prompt": prompt})
+    return answer
+
+
+def parse_references(
+    string: str, reference_pattern: Optional[str] = None, candidates: Optional[Dict[str, str]] = None
+) -> Optional[List[str]]:
+    """
+    Parses an answer string for document references and returns the document IDs of the referenced documents.
+
+    :param string: The string to parse.
+    :param reference_pattern: The regex pattern to use for parsing the document references.
+        Example: `\\[(\\d+)\\]` will find "1" in string "this is an answer[1]".
+        If None, no parsing is done and all candidate document IDs are returned.
+    :param candidates: A dictionary of candidates to choose from. The keys are the reference strings and the values are the document IDs.
+        If None, no parsing is done and None is returned.
+    :return: A list of document IDs.
+    """
+    if not candidates:
+        return None
+    if not reference_pattern:
+        return list(candidates.values())
+
+    document_idxs = re.findall(reference_pattern, string)
+    return [candidates[idx] for idx in document_idxs if idx in candidates]
+
+
+def join_documents_and_scores(documents: List[Document]) -> Tuple[List[Document]]:
+    """
+    Transforms a list of documents with scores in their metadata into a list containing a single document.
+    The resulting document contains the scores and the contents of all the original documents.
+    All metadata is dropped.
+    Example:
+    ```python
+    assert join_documents_and_scores(
+        documents=[
+            Document(content="first", meta={"score": 0.9}),
+            Document(content="second", meta={"score": 0.7}),
+            Document(content="third", meta={"score": 0.5})
+        ],
+        delimiter=" - "
+    ) == ([Document(content="-[0.9] first\n -[0.7] second\n -[0.5] third")], )
+    ```
+    """
+    if "score" in documents[0].meta:
+        content = "\n".join([f"-[{round(float(doc.meta['score']),2)}] {doc.content}" for doc in documents])
+    else:
+        content = "\n".join([f"-[{round(float(doc.score),2)}] {doc.content}" for doc in documents])
+    return ([Document(content=content)],)
+
+
+def answers_to_strings(
+    answers: List[Answer], pattern: Optional[str] = None, str_replace: Optional[Dict[str, str]] = None
+) -> List[str]:
+    """
+    Extracts the content field of answers and returns a list of strings.
+
+    Example:
+
+    ```python
+    assert answers_to_strings(
+            answers=[
+                Answer(answer="first"),
+                Answer(answer="second"),
+                Answer(answer="third")
+            ],
+            pattern="[$idx] $answer",
+            str_replace={"r": "R"}
+        ) == ["[1] fiRst", "[2] second", "[3] thiRd"]
+    ```
+    """
+    return [format_answer(answer, pattern, str_replace, idx) for idx, answer in enumerate(answers, start=1)]
+
+
+def documents_to_strings(
+    documents: List[Document], pattern: Optional[str] = None, str_replace: Optional[Dict[str, str]] = None
+) -> List[str]:
+    """
+    Extracts the content field of documents and returns a list of strings. Use regext in the `pattern` parameter to control how the documents are represented.
+
+    Example:
+
+    ```python
+    assert documents_to_strings(
+            documents=[
+                Document(content="first"),
+                Document(content="second"),
+                Document(content="third")
+            ],
+            pattern="[$idx] $content",
+            str_replace={"r": "R"}
+        ) == ["[1] fiRst", "[2] second", "[3] thiRd"]
+    ```
+    """
+    return [format_document(doc, pattern, str_replace, idx) for idx, doc in enumerate(documents, start=1)]
+
+
+def strings_to_documents(
+    strings: List[str],
+    meta: Union[List[Optional[Dict[str, Any]]], Optional[Dict[str, Any]]] = None,
+    id_hash_keys: Optional[List[str]] = None,
+) -> List[Document]:
+    """
+    Transforms a list of strings into a list of documents. If you pass the metadata in a single
+    dictionary, all documents get the same metadata. If you pass the metadata as a list, the length of this list
+    must be the same as the length of the list of strings, and each document gets its own metadata.
+    You can specify `id_hash_keys` only once and it gets assigned to all documents.
+
+    Example:
+
+    ```python
+    assert strings_to_documents(
+            strings=["first", "second", "third"],
+            meta=[{"position": i} for i in range(3)],
+            id_hash_keys=['content', 'meta]
+        ) == [
+            Document(content="first", metadata={"position": 1}, id_hash_keys=['content', 'meta])]),
+            Document(content="second", metadata={"position": 2}, id_hash_keys=['content', 'meta]),
+            Document(content="third", metadata={"position": 3}, id_hash_keys=['content', 'meta])
+        ]
+    ```
+    """
+    all_metadata: List[Optional[Dict[str, Any]]]
+    if isinstance(meta, dict):
+        all_metadata = [meta] * len(strings)
+    elif isinstance(meta, list):
+        if len(meta) != len(strings):
+            raise ValueError(
+                f"Not enough metadata dictionaries. strings_to_documents received {len(strings)} and {len(meta)} metadata dictionaries."
+            )
+        all_metadata = meta
+    else:
+        all_metadata = [None] * len(strings)
+
+    return [Document(content=string, meta=m, id_hash_keys=id_hash_keys) for string, m in zip(strings, all_metadata)]
+
+
+def join_documents_to_string(
+    documents: List[Document],
+    delimiter: str = " ",
+    pattern: Optional[str] = None,
+    str_replace: Optional[Dict[str, str]] = None,
+) -> str:
+    """
+    Transforms a list of documents into a single string. The content of this string
+    is the joined result of all original documents separated by the delimiter you specify.
+    Use regex in the `pattern` parameter to control how the documents are represented.
+    You can use the following placeholders:
+    - $content: The content of the document.
+    - $idx: The index of the document in the list.
+    - $id: The ID of the document.
+    - $META_FIELD: The value of the metadata field called 'META_FIELD'.
+
+    Example:
+
+    ```python
+    assert join_documents_to_string(
+        documents=[
+            Document(content="first"),
+            Document(content="second"),
+            Document(content="third")
+        ],
+        delimiter=" - ",
+        pattern="[$idx] $content",
+        str_replace={"r": "R"}
+    ) == "[1] fiRst - [2] second - [3] thiRd"
+    ```
+    """
+    content = delimiter.join(
+        format_document(doc, pattern, str_replace, idx=idx) for idx, doc in enumerate(documents, start=1)
+    )
+    return content
+
+
+def join_documents(
+    documents: List[Document],
+    delimiter: str = " ",
+    pattern: Optional[str] = None,
+    str_replace: Optional[Dict[str, str]] = None,
+) -> List[Document]:
+    """
+    Transforms a list of documents into a list containing a single document. The content of this document
+    is the joined result of all original documents, separated by the delimiter you specify.
+    Use regex in the `pattern` parameter to control how each document is represented.
+    You can use the following placeholders:
+    - $content: The content of the document.
+    - $idx: The index of the document in the list.
+    - $id: The ID of the document.
+    - $META_FIELD: The value of the metadata field called 'META_FIELD'.
+
+    All metadata is dropped.
+
+    Example:
+
+    ```python
+    assert join_documents(
+        documents=[
+            Document(content="first"),
+            Document(content="second"),
+            Document(content="third")
+        ],
+        delimiter=" - ",
+        pattern="[$idx] $content",
+        str_replace={"r": "R"}
+    ) == [Document(content="[1] fiRst - [2] second - [3] thiRd")]
+    ```
+    """
+    return [Document(content=join_documents_to_string(documents, delimiter, pattern, str_replace))]
+
+
+def join_strings(strings: List[str], delimiter: str = " ", str_replace: Optional[Dict[str, str]] = None) -> str:
+    """
+    Transforms a list of strings into a single string. The content of this string
+    is the content of all of the original strings separated by the delimiter you specify.
+
+    Example:
+
+    ```python
+    assert join_strings(strings=["first", "second", "third"], delimiter=" - ", str_replace={"r": "R"}) == "fiRst - second - thiRd"
+    ```
+    """
+    str_replace = str_replace or {}
+    return delimiter.join([format_string(string, str_replace) for string in strings])
+
+
+def join_lists(lists: List[List[Any]]) -> List[Any]:
+    """
+    Joins the lists you pass to it into a single list.
+
+    Example:
+
+    ```python
+    assert join_lists(lists=[[1, 2, 3], [4, 5]]) == [1, 2, 3, 4, 5]
+    ```
+    """
+    merged_list = []
+    for inner_list in lists:
+        merged_list += inner_list
+    return merged_list
+
+
+def value_to_list(value: Any, target_list: List[Any]) -> List[Any]:
+    """
+    Transforms a value into a list containing this value as many times as the length of the target list.
+
+    Example:
+
+    ```python
+    assert value_to_list(value=1, target_list=list(range(5))) == [1, 1, 1, 1, 1]
+    ```
+    """
+    return [value] * len(target_list)
+
+
+REGISTERED_FUNCTIONS: Dict[str, Callable[..., Any]] = {
+    "rename": rename,
+    "value_to_list": value_to_list,
+    "join_lists": join_lists,
+    "join_strings": join_strings,
+    "join_documents": join_documents,
+    "join_documents_and_scores": join_documents_and_scores,
+    "strings_to_answers": strings_to_answers,
+    "answers_to_strings": answers_to_strings,
+    "strings_to_documents": strings_to_documents,
+    "documents_to_strings": documents_to_strings,
+}
+
+
+class Shaper(BaseComponent):
+
+    """
+    Shaper is a component that can invoke arbitrary, registered functions on the invocation context
+    (query, documents, and so on) of a pipeline. It then passes the new or modified variables further down the pipeline.
+
+    Using YAML configuration, the Shaper component is initialized with functions to invoke on pipeline invocation
+    context.
+
+    For example, in the YAML snippet below:
+    ```yaml
+        components:
+        - name: shaper
+          type: Shaper
+          params:
+            func: value_to_list
+            inputs:
+                value: query
+                target_list: documents
+            output: [questions]
+    ```
+    the Shaper component is initialized with a directive to invoke function expand on the variable query and to store
+    the result in the invocation context variable questions. All other invocation context variables are passed down
+    the pipeline as they are.
+
+    You can use multiple Shaper components in a pipeline to modify the invocation context as needed.
+
+    Currently, `Shaper` supports the following functions:
+
+    - `rename`
+    - `value_to_list`
+    - `join_lists`
+    - `join_strings`
+    - `format_string`
+    - `join_documents`
+    - `join_documents_and_scores`
+    - `format_document`
+    - `format_answer`
+    - `join_documents_to_string`
+    - `strings_to_answers`
+    - `string_to_answer`
+    - `parse_references`
+    - `answers_to_strings`
+    - `join_lists`
+    - `strings_to_documents`
+    - `documents_to_strings`
+
+    See their descriptions in the code for details about their inputs, outputs, and other parameters.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        func: str,
+        outputs: List[str],
+        inputs: Optional[Dict[str, Union[List[str], str]]] = None,
+        params: Optional[Dict[str, Any]] = None,
+        publish_outputs: Union[bool, List[str]] = True,
+    ):
+        """
+        Initializes the Shaper component.
+
+        Some examples:
+
+        ```yaml
+        - name: shaper
+          type: Shaper
+          params:
+          func: value_to_list
+          inputs:
+            value: query
+            target_list: documents
+          outputs:
+            - questions
+        ```
+        This node takes the content of `query` and creates a list that contains the value of `query` `len(documents)` times.
+        This list is stored in the invocation context under the key `questions`.
+
+        ```yaml
+        - name: shaper
+          type: Shaper
+          params:
+          func: join_documents
+          inputs:
+            value: documents
+          params:
+            delimiter: ' - '
+          outputs:
+            - documents
+        ```
+        This node overwrites the content of `documents` in the invocation context with a list containing a single Document
+        whose content is the concatenation of all the original Documents. So if `documents` contained
+        `[Document("A"), Document("B"), Document("C")]`, this shaper overwrites it with `[Document("A - B - C")]`
+
+        ```yaml
+        - name: shaper
+          type: Shaper
+          params:
+          func: join_strings
+          params:
+            strings: ['a', 'b', 'c']
+            delimiter: ' . '
+          outputs:
+            - single_string
+
+        - name: shaper
+          type: Shaper
+          params:
+          func: strings_to_documents
+          inputs:
+            strings: single_string
+            metadata:
+              name: 'my_file.txt'
+          outputs:
+            - single_document
+        ```
+        These two nodes, executed one after the other, first add a key in the invocation context called `single_string`
+        that contains `a . b . c`, and then create another key called `single_document` that contains instead
+        `[Document(content="a . b . c", metadata={'name': 'my_file.txt'})]`.
+
+        :param func: The function to apply.
+        :param inputs: Maps the function's input kwargs to the key-value pairs in the invocation context.
+            For example, `value_to_list` expects the `value` and `target_list` parameters, so `inputs` might contain:
+            `{'value': 'query', 'target_list': 'documents'}`. It doesn't need to contain all keyword args, see `params`.
+        :param params: Maps the function's input kwargs to some fixed values. For example, `value_to_list` expects
+            `value` and `target_list` parameters, so `params` might contain
+            `{'value': 'A', 'target_list': [1, 1, 1, 1]}` and the node's output is `["A", "A", "A", "A"]`.
+            It doesn't need to contain all keyword args, see `inputs`.
+            You can use params to provide fallback values for arguments of `run` that you're not sure exist.
+            So if you need `query` to exist, you can provide a fallback value in the params, which will be used only if `query`
+            is not passed to this node by the pipeline.
+        :param outputs: The key to store the outputs in the invocation context. The length of the outputs must match
+            the number of outputs produced by the function invoked.
+        :param publish_outputs: Controls whether to publish the outputs to the pipeline's output.
+            Set `True` (default value) to publishes all outputs or `False` to publish None.
+            E.g. if `outputs = ["documents"]` result for `publish_outputs = True` looks like
+            ```python
+                {
+                    "invocation_context": {
+                        "documents": [...]
+                    },
+                    "documents": [...]
+                }
+            ```
+            For `publish_outputs = False` result looks like
+            ```python
+                {
+                    "invocation_context": {
+                        "documents": [...]
+                    },
+                }
+            ```
+            If you want to have finer-grained control, pass a list of the outputs you want to publish.
+        """
+        super().__init__()
+        self.function = REGISTERED_FUNCTIONS[func]
+        self.outputs = outputs
+        self.inputs = inputs or {}
+        self.params = params or {}
+        if isinstance(publish_outputs, bool):
+            self.publish_outputs = self.outputs if publish_outputs else []
+        else:
+            self.publish_outputs = publish_outputs
+
+    def run(  # type: ignore
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+        invocation_context: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[Dict, str]:
+        invocation_context = invocation_context or {}
+        if query and "query" not in invocation_context.keys():
+            invocation_context["query"] = query
+
+        if file_paths and "file_paths" not in invocation_context.keys():
+            invocation_context["file_paths"] = file_paths
+
+        if labels and "labels" not in invocation_context.keys():
+            invocation_context["labels"] = labels
+
+        if documents and "documents" not in invocation_context.keys():
+            invocation_context["documents"] = documents
+
+        if meta and "meta" not in invocation_context.keys():
+            invocation_context["meta"] = meta
+
+        input_values: Dict[str, Any] = {}
+        for key, value in self.inputs.items():
+            if isinstance(value, list):
+                input_values[key] = []
+                for v in value:
+                    if v in invocation_context.keys() and v is not None:
+                        input_values[key].append(invocation_context[v])
+            else:
+                if value in invocation_context.keys() and value is not None:
+                    input_values[key] = invocation_context[value]
+
+        # auto fill in input values if there's an invocation context value with the same name
+        function_params = inspect.signature(self.function).parameters
+        for parameter in function_params.values():
+            if (
+                parameter.name not in input_values.keys()
+                and parameter.name not in self.params.keys()
+                and parameter.name in invocation_context.keys()
+            ):
+                input_values[parameter.name] = invocation_context[parameter.name]
+
+        input_values = {**self.params, **input_values}
+        try:
+            logger.debug(
+                "Shaper is invoking this function: %s(%s)",
+                self.function.__name__,
+                ", ".join([f"{key}={value}" for key, value in input_values.items()]),
+            )
+            output_values = self.function(**input_values)
+            if not isinstance(output_values, tuple):
+                output_values = (output_values,)
+        except TypeError as e:
+            raise ValueError(
+                "Shaper couldn't apply the function to your inputs and parameters. "
+                "Check the above stacktrace and make sure you provided all the correct inputs, parameters, "
+                "and parameter types."
+            ) from e
+
+        if len(self.outputs) < len(output_values):
+            logger.warning(
+                "The number of outputs from function %s is %s. However, only %s output key(s) were provided. "
+                "Only %s output(s) will be stored. "
+                "Provide %s output keys to store all outputs.",
+                self.function.__name__,
+                len(output_values),
+                len(self.outputs),
+                len(self.outputs),
+                len(output_values),
+            )
+
+        if len(self.outputs) > len(output_values):
+            logger.warning(
+                "The number of outputs from function %s is %s. However, %s output key(s) were provided. "
+                "Only the first %s output key(s) will be used.",
+                self.function.__name__,
+                len(output_values),
+                len(self.outputs),
+                len(output_values),
+            )
+
+        results = {}
+        for output_key, output_value in zip(self.outputs, output_values):
+            invocation_context[output_key] = output_value
+            if output_key in self.publish_outputs:
+                results[output_key] = output_value
+        results["invocation_context"] = invocation_context
+
+        return results, "output_1"
+
+    def run_batch(  # type: ignore
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+        invocation_context: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[Dict, str]:
+        return self.run(
+            query=query,
+            file_paths=file_paths,
+            labels=labels,
+            documents=documents,
+            meta=meta,
+            invocation_context=invocation_context,
+        )
+
+
+class BaseOutputParser(Shaper):
+    """
+    An output parser in `PromptTemplate` defines how to parse the model output and convert it into Haystack primitives (answers, documents, or labels).
+    BaseOutputParser is the base class for output parser implementations.
+    """
+
+    @property
+    def output_variable(self) -> Optional[str]:
+        return self.outputs[0]
+
+
+class AnswerParser(BaseOutputParser):
+    """
+    Parses the model output to extract the answer into a proper `Answer` object using regex patterns.
+    AnswerParser adds the `document_ids` of the documents used to generate the answer and the prompts used to the `Answer` object.
+    You can pass a `reference_pattern` to extract the document_ids of the answer from the model output.
+    """
+
+    def __init__(self, pattern: Optional[str] = None, reference_pattern: Optional[str] = None):
+        """
+         :param pattern: The regex pattern to use for parsing the answer.
+            Examples:
+                `[^\\n]+$` finds "this is an answer" in string "this is an argument.\nthis is an answer".
+                `Answer: (.*)` finds "this is an answer" in string "this is an argument. Answer: this is an answer".
+            If not specified, the whole string is used as the answer. If specified, the first group of the regex is used as the answer. If there is no group, the whole match is used as the answer.
+        :param reference_pattern: The regex pattern to use for parsing the document references.
+            Example: `\\[(\\d+)\\]` finds "1" in string "this is an answer[1]".
+            If None, no parsing is done and all documents are referenced.
+        """
+        self.pattern = pattern
+        self.reference_pattern = reference_pattern
+        super().__init__(
+            func="strings_to_answers",
+            inputs={"strings": "results"},
+            outputs=["answers"],
+            params={"pattern": pattern, "reference_pattern": reference_pattern},
+        )
diff --git a/pipelines/pipelines/nodes/question_generator/__init__.py b/pipelines/pipelines/nodes/question_generator/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d52d40078950de5d3e28b8d73ab36f2ff29acfa
--- /dev/null
+++ b/pipelines/pipelines/nodes/question_generator/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.question_generator.question_generator import QuestionGenerator
diff --git a/pipelines/pipelines/nodes/question_generator/question_generator.py b/pipelines/pipelines/nodes/question_generator/question_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..94cbef4304216f1d7332ad913be8ee5a285987d0
--- /dev/null
+++ b/pipelines/pipelines/nodes/question_generator/question_generator.py
@@ -0,0 +1,280 @@
+# coding:utf-8
+# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+
+import paddle
+from paddle.dataset.common import md5file
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+from paddlenlp.taskflow.utils import download_file
+from paddlenlp.utils.env import PPNLP_HOME
+from pipelines.nodes.base import BaseComponent
+
+
+class QuestionGenerator(BaseComponent):
+    """
+    Question Generator based on Unimo Text.
+    """
+
+    resource_files_names = {
+        "model_state": "model_state.pdparams",
+        "model_config": "model_config.json",
+        "vocab_file": "vocab.txt",
+        "special_tokens_map": "special_tokens_map.json",
+        "tokenizer_config": "tokenizer_config.json",
+    }
+
+    resource_files_urls = {
+        "unimo-text-1.0-question-generator": {
+            "model_state": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/model_state.pdparams",
+                "856a2980f83dc227a8fed4ecd730696d",
+            ],
+            "model_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/model_config.json",
+                "b5bab534683d9f0ef82fc84803ee6f3d",
+            ],
+            "vocab_file": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/vocab.txt",
+                "ea3f8a8cc03937a8df165d2b507c551e",
+            ],
+            "special_tokens_map": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/special_tokens_map.json",
+                "8b3fb1023167bb4ab9d70708eb05f6ec",
+            ],
+            "tokenizer_config": [
+                "https://bj.bcebos.com/paddlenlp/pipelines/question_generator/unimo-text-1.0-question-generator-v1/tokenizer_config.json",
+                "ef261f5d413a46ed1d6f071aed6fb345",
+            ],
+        },
+    }
+
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    def __init__(
+        self,
+        model="unimo-text-1.0-question-generation",
+        task_path=None,
+        device="gpu",
+        batch_size=16,
+        output_scores=True,
+        is_select_from_num_return_sequences=False,
+        max_length=50,
+        decode_strategy="sampling",
+        temperature=1.0,
+        top_k=5,
+        top_p=1.0,
+        num_beams=6,
+        num_beam_groups=1,
+        diversity_rate=0.0,
+        num_return_sequences=1,
+        template=1,
+    ):
+        paddle.set_device(device)
+        self.model = model
+        self._from_taskflow = False
+        self._custom_model = False
+        if task_path:
+            self._task_path = task_path
+            self._custom_model = True
+        else:
+            if model in [
+                "unimo-text-1.0",
+                "unimo-text-1.0-dureader_qg",
+                "unimo-text-1.0-question-generation",
+                "unimo-text-1.0-question-generation-dureader_qg",
+            ]:
+                self._task_path = None
+                self._from_taskflow = True
+            else:
+                self._task_path = os.path.join(PPNLP_HOME, "pipelines", "unsupervised_question_answering", self.model)
+                self._check_task_files()
+                self.model = "unimo-text-1.0"
+        self.num_return_sequences = num_return_sequences
+        self.batch_size = batch_size
+        if self._from_taskflow:
+            self.question_generation = Taskflow(
+                "question_generation",
+                model=self.model if self._from_taskflow else "unimo-text-1.0",
+                output_scores=True,
+                max_length=max_length,
+                is_select_from_num_return_sequences=is_select_from_num_return_sequences,
+                num_return_sequences=num_return_sequences,
+                batch_size=batch_size,
+                decode_strategy=decode_strategy,
+                num_beams=num_beams,
+                num_beam_groups=num_beam_groups,
+                diversity_rate=diversity_rate,
+                top_k=top_k,
+                top_p=top_p,
+                temperature=temperature,
+                template=1,
+                device_id=0 if device == "gpu" else -1,
+            )
+        else:
+            self.question_generation = Taskflow(
+                "question_generation",
+                model=self.model if self._from_taskflow else "unimo-text-1.0",
+                task_path=self._task_path,
+                output_scores=True,
+                max_length=max_length,
+                is_select_from_num_return_sequences=is_select_from_num_return_sequences,
+                num_return_sequences=num_return_sequences,
+                batch_size=batch_size,
+                decode_strategy=decode_strategy,
+                num_beams=num_beams,
+                num_beam_groups=num_beam_groups,
+                diversity_rate=diversity_rate,
+                top_k=top_k,
+                top_p=top_p,
+                temperature=temperature,
+                template=1,
+                device_id=0 if device == "gpu" else -1,
+            )
+
+    def _check_task_files(self):
+        """
+        Check files required by the task.
+        """
+        for file_id, file_name in self.resource_files_names.items():
+            path = os.path.join(self._task_path, file_name)
+            url = self.resource_files_urls[self.model][file_id][0]
+            md5 = self.resource_files_urls[self.model][file_id][1]
+
+            downloaded = True
+            if not os.path.exists(path):
+                downloaded = False
+            else:
+                if not self._custom_model:
+                    if os.path.exists(path):
+                        # Check whether the file is updated
+                        if not md5file(path) == md5:
+                            downloaded = False
+                            if file_id == "model_state":
+                                self._param_updated = True
+                    else:
+                        downloaded = False
+            if not downloaded:
+                download_file(self._task_path, file_name, url, md5)
+
+    def create_question(
+        self, json_file_or_pair_list, out_json=None, num_return_sequences=1, all_sample_num=None, batch_size=8
+    ):
+        if out_json:
+            wf = open(out_json, "w", encoding="utf-8")
+        if isinstance(json_file_or_pair_list, list):
+            all_lines = json_file_or_pair_list
+        else:
+            rf = open(json_file_or_pair_list, "r", encoding="utf-8")
+            all_lines = []
+            for json_line in rf:
+                line_dict = json.loads(json_line)
+                all_lines.append(line_dict)
+            rf.close()
+        num_all_lines = len(all_lines)
+        output = []
+        context_buffer = []
+        answer_buffer = []
+        answer_probability_buffer = []
+        true_question_buffer = []
+        i = 0
+        for index, line_dict in enumerate(tqdm(all_lines)):
+            if "question" in line_dict:
+                q = line_dict["question"]
+            else:
+                q = ""
+            c = line_dict["context"]
+            assert "answer_candidates" in line_dict
+            answers = line_dict["answer_candidates"]
+            if not answers:
+                continue
+            for j, pair in enumerate(answers):
+                a, p = pair
+                context_buffer += [c]
+                answer_buffer += [a]
+                answer_probability_buffer += [p]
+                true_question_buffer += [q]
+                if (
+                    (i + 1) % batch_size == 0
+                    or (all_sample_num and (i + 1) == all_sample_num)
+                    or ((index + 1) == num_all_lines and j == len(answers) - 1)
+                ):
+                    result_buffer = self.question_generation(
+                        [
+                            {"context": context, "answer": answer}
+                            for context, answer in zip(context_buffer, answer_buffer)
+                        ]
+                    )
+                    (
+                        context_buffer_temp,
+                        answer_buffer_temp,
+                        answer_probability_buffer_temp,
+                        true_question_buffer_temp,
+                    ) = ([], [], [], [])
+                    for context, answer, answer_probability, true_question in zip(
+                        context_buffer, answer_buffer, answer_probability_buffer, true_question_buffer
+                    ):
+                        context_buffer_temp += [context] * num_return_sequences
+                        answer_buffer_temp += [answer] * num_return_sequences
+                        answer_probability_buffer_temp += [answer_probability] * num_return_sequences
+                        true_question_buffer_temp += [true_question] * num_return_sequences
+                    result_one_two_buffer = [(one, two) for one, two in zip(result_buffer[0], result_buffer[1])]
+                    for context, answer, answer_probability, true_question, result in zip(
+                        context_buffer_temp,
+                        answer_buffer_temp,
+                        answer_probability_buffer_temp,
+                        true_question_buffer_temp,
+                        result_one_two_buffer,
+                    ):
+                        fake_quesitons_tokens = [result[0]]
+                        fake_quesitons_scores = [result[1]]
+                        for fake_quesitons_token, fake_quesitons_score in zip(
+                            fake_quesitons_tokens, fake_quesitons_scores
+                        ):
+                            out_dict = {
+                                "context": context,
+                                "synthetic_answer": answer,
+                                "synthetic_answer_probability": answer_probability,
+                                "synthetic_question": fake_quesitons_token,
+                                "synthetic_question_probability": fake_quesitons_score,
+                                "true_question": true_question,
+                            }
+                            if out_json:
+                                wf.write(json.dumps(out_dict, ensure_ascii=False) + "\n")
+                            output.append(out_dict)
+                    context_buffer = []
+                    answer_buffer = []
+                    true_question_buffer = []
+                if all_sample_num and (i + 1) >= all_sample_num:
+                    break
+                i += 1
+        if out_json:
+            wf.close()
+        return output
+
+    def run(self, ca_pairs):
+        print("createing synthetic question-answer pairs...")
+        synthetic_context_answer_question_triples = self.create_question(
+            ca_pairs, None, self.num_return_sequences, None, self.batch_size
+        )
+        print("create synthetic question-answer pairs successfully!")
+        results = {"cqa_triples": synthetic_context_answer_question_triples}
+        return results, "output_1"
diff --git a/pipelines/pipelines/nodes/ranker/__init__.py b/pipelines/pipelines/nodes/ranker/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..df1199d006678c2b5b31f445bd7e49483becd02f
--- /dev/null
+++ b/pipelines/pipelines/nodes/ranker/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.ranker.base import BaseRanker
+from pipelines.nodes.ranker.ernie_ranker import ErnieRanker
diff --git a/pipelines/pipelines/nodes/ranker/base.py b/pipelines/pipelines/nodes/ranker/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b8fd1c2f0e4c5dd3404c77380ea7790025dcde8
--- /dev/null
+++ b/pipelines/pipelines/nodes/ranker/base.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Union
+
+import logging
+from abc import abstractmethod
+from functools import wraps
+from time import perf_counter
+
+from pipelines.schema import Document
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class BaseRanker(BaseComponent):
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    @abstractmethod
+    def predict(self, query: str, documents: List[Document], top_k: Optional[int] = None):
+        pass
+
+    @abstractmethod
+    def predict_batch(self, query_doc_list: List[dict], top_k: Optional[int] = None, batch_size: Optional[int] = None):
+        pass
+
+    def run(self, query: str, documents: List[Document], top_k: Optional[int] = None):
+        self.query_count += 1
+        if documents:
+            predict = self.timing(self.predict, "query_time")
+            results = predict(query=query, documents=documents, top_k=top_k)
+        else:
+            results = []
+
+        document_ids = [doc.id for doc in results]
+        logger.debug(f"Retrieved documents with IDs: {document_ids}")
+        output = {"documents": results}
+
+        return output, "output_1"
+
+    def run_batch(
+        self,
+        queries: List[str],
+        documents: Union[List[Document], List[List[Document]]],
+        top_k: Optional[int] = None,
+        batch_size: Optional[int] = None,
+    ):
+        self.query_count += len(queries)
+        predict_batch = self.timing(self.predict_batch, "query_time")
+        results = predict_batch(queries=queries, documents=documents, top_k=top_k, batch_size=batch_size)
+
+        for doc_list in results:
+            document_ids = [doc.id for doc in doc_list]
+            logger.debug("Ranked documents with IDs: %s", document_ids)
+
+        output = {"documents": results}
+
+        return output, "output_1"
+
+    def timing(self, fn, attr_name):
+        """Wrapper method used to time functions."""
+
+        @wraps(fn)
+        def wrapper(*args, **kwargs):
+            if attr_name not in self.__dict__:
+                self.__dict__[attr_name] = 0
+            tic = perf_counter()
+            ret = fn(*args, **kwargs)
+            toc = perf_counter()
+            self.__dict__[attr_name] += toc - tic
+            return ret
+
+        return wrapper
+
+    def print_time(self):
+        print("Ranker (Speed)")
+        print("---------------")
+        if not self.query_count:
+            print("No querying performed via Retriever.run()")
+        else:
+            print(f"Queries Performed: {self.query_count}")
+            print(f"Query time: {self.query_time}s")
+            print(f"{self.query_time / self.query_count} seconds per query")
+
+    def eval(
+        self,
+        label_index: str = "label",
+        doc_index: str = "eval_document",
+        label_origin: str = "gold_label",
+        top_k: int = 10,
+        open_domain: bool = False,
+        return_preds: bool = False,
+    ) -> dict:
+        """
+        Performs evaluation of the Ranker.
+        Ranker is evaluated in the same way as a Retriever based on whether it finds the correct document given the query string and at which
+        position in the ranking of documents the correct document is.
+
+        |  Returns a dict containing the following metrics:
+
+            - "recall": Proportion of questions for which correct document is among retrieved documents
+            - "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
+              Only considers the highest ranked relevant document.
+            - "map": Mean of average precision for each question. Rewards retrievers that give relevant
+              documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
+              average precision is normalized by the number of retrieved relevant documents per query.
+              If ``open_domain=False``, average precision is normalized by the number of all relevant documents
+              per query.
+
+        :param label_index: Index/Table in DocumentStore where labeled questions are stored
+        :param doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
+        :param top_k: How many documents to return per query
+        :param open_domain: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
+                            contained in the retrieved docs (common approach in open-domain QA).
+                            If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
+                            are within ids explicitly stated in the labels.
+        :param return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary
+                             contains the keys "predictions" and "metrics".
+        """
+        raise NotImplementedError
diff --git a/pipelines/pipelines/nodes/ranker/ernie_ranker.py b/pipelines/pipelines/nodes/ranker/ernie_ranker.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffdd64d212475cb32c697a421d4a1940600a0e96
--- /dev/null
+++ b/pipelines/pipelines/nodes/ranker/ernie_ranker.py
@@ -0,0 +1,236 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Iterator, List, Optional, Tuple, Union
+
+from tqdm import tqdm
+
+from paddlenlp import Taskflow
+from pipelines.nodes.ranker import BaseRanker
+from pipelines.schema import Document
+from pipelines.utils.common_utils import initialize_device_settings
+
+logger = logging.getLogger(__name__)
+
+
+class ErnieRanker(BaseRanker):
+    """
+    Re-Ranking can be used on top of a retriever to boost the performance for document search. This is particularly useful if the retriever has a high recall but is bad in sorting the documents by relevance.
+
+    Usage example:
+    ...
+    retriever = ElasticsearchRetriever(document_store=document_store)
+    ranker = SentenceTransformersRanker(model_name_or_path="rocketqa-zh-dureader-cross-encoder")
+    p = Pipeline()
+    p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
+    p.add_node(component=ranker, name="Ranker", inputs=["ESRetriever"])
+    """
+
+    def __init__(
+        self,
+        model_name_or_path: Union[str, Path],
+        top_k: int = 10,
+        use_gpu: bool = True,
+        max_seq_len: int = 512,
+        progress_bar: bool = True,
+        batch_size: int = 1000,
+        reinitialize: bool = False,
+        embed_title: bool = False,
+        use_en: bool = False,
+    ):
+        """
+        :param model_name_or_path: Directory of a saved model or the name of a public model e.g.
+        'rocketqa-zh-dureader-cross-encoder'.
+        :param top_k: The maximum number of documents to return
+        :param use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            model_name_or_path=model_name_or_path,
+            top_k=top_k,
+            use_en=use_en,
+        )
+
+        self.top_k = top_k
+        # Parameter to control the use of English Cross Encoder Model
+        self.use_en = use_en
+
+        self.devices, _ = initialize_device_settings(use_cuda=use_gpu, multi_gpu=True)
+
+        logger.info("Loading Parameters from:{}".format(model_name_or_path))
+        self.embed_title = embed_title
+        self.progress_bar = progress_bar
+        self.batch_size = batch_size
+        self.max_seq_len = max_seq_len
+        # self.transformer_model = ErnieCrossEncoder(model_name_or_path, reinitialize=reinitialize)
+        self.transformer_model = Taskflow(
+            "text_similarity", model=model_name_or_path, batch_size=self.batch_size, device_id=0 if use_gpu else -1
+        )
+
+    def predict(self, query: str, documents: List[Document], top_k: Optional[int] = None) -> List[Document]:
+        """
+        Use loaded ranker model to re-rank the supplied list of Document.
+
+        Returns list of Document sorted by (desc.) similarity with the query.
+
+        :param query: Query string
+        :param documents: List of Document to be re-ranked
+        :param top_k: The maximum number of documents to return
+        :return: List of Document
+        """
+        if top_k is None:
+            top_k = self.top_k
+        datasets = []
+        for doc in documents:
+            if self.embed_title:
+                datasets.append([query, doc.meta["name"] + doc.content])
+            else:
+                datasets.append([query, doc.content])
+        outputs = self.transformer_model(datasets)
+        similarity_scores = [item["similarity"] for item in outputs]
+
+        for doc, rank_score in zip(documents, similarity_scores):
+            doc.rank_score = rank_score
+            doc.score = rank_score
+
+        sorted_scores_and_documents = sorted(
+            zip(similarity_scores, documents),
+            key=lambda similarity_document_tuple: similarity_document_tuple[0],
+            reverse=True,
+        )
+
+        # Rank documents according to scores
+        sorted_documents = [doc for _, doc in sorted_scores_and_documents]
+        return sorted_documents[:top_k]
+
+    def predict_batch(
+        self,
+        queries: List[str],
+        documents: Union[List[Document], List[List[Document]]],
+        top_k: Optional[int] = None,
+        batch_size: Optional[int] = None,
+    ) -> Union[List[Document], List[List[Document]]]:
+        """
+        Use loaded ranker model to re-rank the supplied lists of Documents
+
+        Returns lists of Documents sorted by (desc.) similarity with the corresponding queries.
+
+        :param queries: Single query string or list of queries
+        :param documents: Single list of Documents or list of lists of Documents to be reranked.
+        :param top_k: The maximum number of documents to return per Document list.
+        :param batch_size: Number of Documents to process at a time.
+        """
+        if top_k is None:
+            top_k = self.top_k
+
+        if batch_size is None:
+            batch_size = self.batch_size
+
+        number_of_docs, all_queries, all_docs, single_list_of_docs = self._preprocess_batch_queries_and_docs(
+            queries=queries, documents=documents
+        )
+        batches = self._get_batches(all_queries=all_queries, all_docs=all_docs, batch_size=batch_size)
+        pb = tqdm(total=len(all_docs), disable=not self.progress_bar, desc="Ranking")
+
+        preds = []
+        for cur_queries, cur_docs in batches:
+            datasets = []
+            for query, doc in zip(cur_queries, cur_docs):
+                if self.embed_title:
+                    datasets.append([query, doc.meta["name"] + doc.content])
+                else:
+                    datasets.append([query, doc.content])
+            outputs = self.transformer_model(datasets)
+            similarity_scores = [item["similarity"] for item in outputs]
+            preds.extend(similarity_scores)
+
+            for doc, rank_score in zip(cur_docs, similarity_scores):
+                doc.rank_score = rank_score
+                doc.score = rank_score
+            pb.update(len(cur_docs))
+        pb.close()
+        if single_list_of_docs:
+            sorted_scores_and_documents = sorted(
+                zip(preds, documents),
+                key=lambda similarity_document_tuple: similarity_document_tuple[0],
+                reverse=True,
+            )
+            sorted_documents = [doc for _, doc in sorted_scores_and_documents]
+            return sorted_documents[:top_k]
+        else:
+            grouped_predictions = []
+            left_idx = 0
+            right_idx = 0
+            for number in number_of_docs:
+                right_idx = left_idx + number
+                grouped_predictions.append(preds[left_idx:right_idx])
+                left_idx = right_idx
+            result = []
+            for pred_group, doc_group in zip(grouped_predictions, documents):
+                sorted_scores_and_documents = sorted(
+                    zip(pred_group, doc_group),
+                    key=lambda similarity_document_tuple: similarity_document_tuple[0],
+                    reverse=True,
+                )
+                sorted_documents = [doc for _, doc in sorted_scores_and_documents]
+                result.append(sorted_documents[:top_k])
+            return result
+
+    def _preprocess_batch_queries_and_docs(
+        self, queries: List[str], documents: Union[List[Document], List[List[Document]]]
+    ) -> Tuple[List[int], List[str], List[Document], bool]:
+        number_of_docs = []
+        all_queries = []
+        all_docs: List[Document] = []
+        single_list_of_docs = False
+
+        # Docs case 1: single list of Documents -> rerank single list of Documents based on single query
+        if len(documents) > 0 and isinstance(documents[0], Document):
+            if len(queries) != 1:
+                raise Exception("Number of queries must be 1 if a single list of Documents is provided.")
+            query = queries[0]
+            number_of_docs = [len(documents)]
+            all_queries = [query] * len(documents)
+            all_docs = documents  # type: ignore
+            single_list_of_docs = True
+
+        # Docs case 2: list of lists of Documents -> rerank each list of Documents based on corresponding query
+        # If queries contains a single query, apply it to each list of Documents
+        if len(documents) > 0 and isinstance(documents[0], list):
+            if len(queries) == 1:
+                queries = queries * len(documents)
+            if len(queries) != len(documents):
+                raise Exception("Number of queries must be equal to number of provided Document lists.")
+            for query, cur_docs in zip(queries, documents):
+                if not isinstance(cur_docs, list):
+                    raise Exception(f"cur_docs was of type {type(cur_docs)}, but expected a list of Documents.")
+                number_of_docs.append(len(cur_docs))
+                all_queries.extend([query] * len(cur_docs))
+                all_docs.extend(cur_docs)
+
+        return number_of_docs, all_queries, all_docs, single_list_of_docs
+
+    @staticmethod
+    def _get_batches(
+        all_queries: List[str], all_docs: List[Document], batch_size: Optional[int]
+    ) -> Iterator[Tuple[List[str], List[Document]]]:
+        if batch_size is None:
+            yield all_queries, all_docs
+            return
+        else:
+            for index in range(0, len(all_queries), batch_size):
+                yield all_queries[index : index + batch_size], all_docs[index : index + batch_size]
diff --git a/pipelines/pipelines/nodes/reader/__init__.py b/pipelines/pipelines/nodes/reader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cce4a4bda24a224720cdec14b8acc92038f5d45
--- /dev/null
+++ b/pipelines/pipelines/nodes/reader/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.reader.base import BaseReader
+from pipelines.nodes.reader.ernie_dureader import ErnieReader
diff --git a/pipelines/pipelines/nodes/reader/base.py b/pipelines/pipelines/nodes/reader/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae859b1d0de6519c7114fad73f1a07a7f4ce2a74
--- /dev/null
+++ b/pipelines/pipelines/nodes/reader/base.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Optional, Sequence, Dict, Tuple
+
+import numpy as np
+from scipy.special import expit
+from abc import abstractmethod
+from copy import deepcopy
+from functools import wraps
+from time import perf_counter
+
+from pipelines.schema import Document, Answer, Span
+from pipelines.nodes.base import BaseComponent
+
+
+class BaseReader(BaseComponent):
+    return_no_answers: bool
+    outgoing_edges = 1
+    query_count = 0
+    query_time = 0
+
+    @abstractmethod
+    def predict(self, query: str, documents: List[Document], top_k: Optional[int] = None):
+        pass
+
+    @abstractmethod
+    def predict_batch(self, query_doc_list: List[dict], top_k: Optional[int] = None, batch_size: Optional[int] = None):
+        pass
+
+    @staticmethod
+    def _calc_no_answer(
+        no_ans_gaps: Sequence[float], best_score_answer: float, use_confidence_scores: bool = True
+    ) -> Tuple[Answer, float]:
+        # "no answer" scores and positive answers scores are difficult to compare, because
+        # + a positive answer score is related to one specific document
+        # - a "no answer" score is related to all input documents
+        # Thus we compute the "no answer" score relative to the best possible answer and adjust it by
+        # the most significant difference between scores.
+        # Most significant difference: a model switching from predicting an answer to "no answer" (or vice versa).
+        # No_ans_gap is a list of this most significant difference per document
+        no_ans_gap_array = np.array(no_ans_gaps)
+        max_no_ans_gap = np.max(no_ans_gap_array)
+        # all passages "no answer" as top score
+        if np.sum(no_ans_gap_array < 0) == len(no_ans_gap_array):
+            no_ans_score = (
+                best_score_answer - max_no_ans_gap
+            )  # max_no_ans_gap is negative, so it increases best pos score
+        else:  # case: at least one passage predicts an answer (positive no_ans_gap)
+            no_ans_score = best_score_answer - max_no_ans_gap
+
+        no_ans_prediction = Answer(
+            answer="",
+            type="extractive",
+            score=float(expit(np.asarray(no_ans_score) / 8))
+            if use_confidence_scores
+            else no_ans_score,  # just a pseudo prob for now or old score,
+            context=None,
+            offsets_in_context=[Span(start=0, end=0)],
+            offsets_in_document=[Span(start=0, end=0)],
+            document_id=None,
+            meta=None,
+        )
+
+        return no_ans_prediction, max_no_ans_gap
+
+    @staticmethod
+    def add_doc_meta_data_to_answer(documents: List[Document], answer):
+        # Add corresponding document_name and more meta data, if the answer contains the document_id
+        if answer.meta is None:
+            answer.meta = {}
+        # get meta from doc
+        meta_from_doc = {}
+        for doc in documents:
+            if doc.id == answer.document_id:
+                meta_from_doc = deepcopy(doc.meta)
+                break
+        # append to "own" meta
+        answer.meta.update(meta_from_doc)
+        return answer
+
+    def run(
+        self, query: str, documents: List[Document], top_k: Optional[int] = None, add_isolated_node_eval: bool = False
+    ):  # type: ignore
+        self.query_count += 1
+        if documents:
+            predict = self.timing(self.predict, "query_time")
+            results = predict(query=query, documents=documents, top_k=top_k)
+        else:
+            results = {"answers": []}
+
+        # Add corresponding document_name and more meta data, if an answer contains the document_id
+        results["answers"] = [
+            BaseReader.add_doc_meta_data_to_answer(documents=documents, answer=answer) for answer in results["answers"]
+        ]
+
+        return results, "output_1"
+
+    def run_batch(self, query_doc_list: List[Dict], top_k: Optional[int] = None):
+        """A unoptimized implementation of running Reader queries in batch"""
+        self.query_count += len(query_doc_list)
+        results = []
+        if query_doc_list:
+            for qd in query_doc_list:
+                q = qd["queries"]
+                docs = qd["docs"]
+                predict = self.timing(self.predict, "query_time")
+                result = predict(query=q, documents=docs, top_k=top_k)
+                results.append(result)
+        else:
+            results = [{"answers": [], "query": ""}]
+        return {"results": results}, "output_1"
+
+    def timing(self, fn, attr_name):
+        """Wrapper method used to time functions."""
+
+        @wraps(fn)
+        def wrapper(*args, **kwargs):
+            if attr_name not in self.__dict__:
+                self.__dict__[attr_name] = 0
+            tic = perf_counter()
+            ret = fn(*args, **kwargs)
+            toc = perf_counter()
+            self.__dict__[attr_name] += toc - tic
+            return ret
+
+        return wrapper
+
+    def print_time(self):
+        print("Reader (Speed)")
+        print("---------------")
+        if not self.query_count:
+            print("No querying performed via Retriever.run()")
+        else:
+            print(f"Queries Performed: {self.query_count}")
+            print(f"Query time: {self.query_time}s")
+            print(f"{self.query_time / self.query_count} seconds per query")
diff --git a/pipelines/pipelines/nodes/reader/ernie_dureader.py b/pipelines/pipelines/nodes/reader/ernie_dureader.py
new file mode 100644
index 0000000000000000000000000000000000000000..e22c33c9a77b433e66692ff317cf76d293a8fcab
--- /dev/null
+++ b/pipelines/pipelines/nodes/reader/ernie_dureader.py
@@ -0,0 +1,936 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Dict, List, Optional, Union
+
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import Stack, Tuple
+from paddlenlp.transformers import AutoModelForQuestionAnswering, AutoTokenizer
+from pipelines.data_handler.inputs import QAInput, Question
+from pipelines.data_handler.predictions import QACandidate, QAPred
+from pipelines.data_handler.processor import SquadProcessor
+from pipelines.data_handler.samples import SampleBasket
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes.reader import BaseReader
+from pipelines.schema import Answer, Document, Span
+from pipelines.utils.common_utils import initialize_device_settings, try_get
+
+logger = logging.getLogger(__name__)
+
+
+class ErnieReader(BaseReader):
+    """
+    Transformer based model for extractive Question Answering based on ERNIE3.0.
+    """
+
+    def __init__(
+        self,
+        model_name_or_path: str,
+        model_version: Optional[str] = None,
+        context_window_size: int = 150,
+        batch_size: int = 50,
+        use_gpu: bool = True,
+        no_ans_boost: float = 0.0,
+        return_no_answer: bool = False,
+        top_k: int = 10,
+        top_k_per_candidate: int = 3,
+        top_k_per_sample: int = 1,
+        num_processes: Optional[int] = None,
+        max_seq_len: int = 256,
+        doc_stride: int = 128,
+        progress_bar: bool = True,
+        duplicate_filtering: int = 0,
+        use_confidence_scores: bool = True,
+        proxies: Optional[Dict[str, str]] = None,
+        local_files_only=False,
+        force_download=False,
+        use_auth_token: Optional[Union[str, bool]] = None,
+        n_best_per_sample: int = 1,
+        use_confidence_scores_for_ranking: bool = False,
+        n_best: int = 5,
+        **kwargs,
+    ):
+        """
+        :param model_name_or_path: Directory of a saved model or the name of a public model e.g. 'ernie-gram-zh-finetuned-dureader-robust'.
+        :param context_window_size: The size, in characters, of the window around the answer span that is used when
+                                    displaying the context around the answer.
+        :param batch_size: Number of samples the model receives in one batch for inference.
+                           Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
+                           to a value so only a single batch is used.
+        :param use_gpu: Whether to use GPU (if available)
+        :param no_ans_boost: How much the no_answer logit is boosted/increased.
+        If set to 0 (default), the no_answer logit is not changed.
+        If a negative number, there is a lower chance of "no_answer" being predicted.
+        If a positive number, there is an increased chance of "no_answer"
+        :param return_no_answer: Whether to include no_answer predictions in the results.
+        :param top_k: The maximum number of answers to return
+        :param top_k_per_candidate: How many answers to extract for each candidate doc that is coming from the retriever (might be a long text).
+        Note that this is not the number of "final answers" you will receive
+        (see `top_k` in FARMReader.predict() or Finder.get_answers() for that)
+        and that FARM includes no_answer in the sorted list of predictions.
+        :param top_k_per_sample: How many answers to extract from each small text passage that the model can process at once
+        (one "candidate doc" is usually split into many smaller "passages").
+        You usually want a very small value here, as it slows down inference
+        and you don't gain much of quality by having multiple answers from one passage.
+        Note that this is not the number of "final answers" you will receive
+        (see `top_k` in FARMReader.predict() or Finder.get_answers() for that)
+        and that FARM includes no_answer in the sorted list of predictions.
+        :param num_processes: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
+                              multiprocessing. Set to None to let Inferencer determine optimum number. If you
+                              want to debug the Language Model, you might need to disable multiprocessing!
+        :param max_seq_len: Max sequence length of one input text for the model
+        :param doc_stride: Length of striding window for splitting long texts (used if ``len(text) > max_seq_len``)
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        :param duplicate_filtering: Answers are filtered based on their position. Both start and end position of the answers are considered.
+                                    The higher the value, answers that are more apart are filtered out. 0 corresponds to exact duplicates. -1 turns off duplicate removal.
+        :param use_confidence_scores: Sets the type of score that is returned with every predicted answer.
+                                      `True` => a scaled confidence / relevance score between [0, 1].
+                                      This score can also be further calibrated on your dataset via self.eval()
+                                      `False` => an unscaled, raw score [-inf, +inf] which is the sum of start and end logit
+                                      from the model for the predicted span.
+        :param proxies: Dict of proxy servers to use for downloading external models. Example: {'http': 'some.proxy:1234', 'http://hostname': 'my.proxy:3111'}
+        :param local_files_only: Whether to force checking for local files only (and forbid downloads)
+        :param force_download: Whether fo force a (re-)download even if the model exists locally in the cache.
+        :param n_best: The number of positive answer spans for each document.
+        """
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            model_name_or_path=model_name_or_path,
+            context_window_size=context_window_size,
+            batch_size=batch_size,
+            use_gpu=use_gpu,
+            no_ans_boost=no_ans_boost,
+            return_no_answer=return_no_answer,
+            top_k=top_k,
+            top_k_per_candidate=top_k_per_candidate,
+            top_k_per_sample=top_k_per_sample,
+            num_processes=num_processes,
+            max_seq_len=max_seq_len,
+            doc_stride=doc_stride,
+            progress_bar=progress_bar,
+            duplicate_filtering=duplicate_filtering,
+            proxies=proxies,
+            local_files_only=local_files_only,
+            force_download=force_download,
+            use_confidence_scores=use_confidence_scores,
+            **kwargs,
+        )
+
+        self.batch_size = batch_size
+        self.devices, _ = initialize_device_settings(use_cuda=use_gpu, multi_gpu=False)
+
+        self.return_no_answers = return_no_answer
+        self.top_k = top_k
+        self.top_k_per_candidate = top_k_per_candidate
+
+        # Add by tianxin04
+        self.n_best_per_sample = n_best_per_sample
+        self.duplicate_filtering = duplicate_filtering
+        self.no_ans_boost = no_ans_boost
+        self.use_confidence_scores_for_ranking = use_confidence_scores_for_ranking
+        self.n_best = n_best
+        self.context_window_size = context_window_size
+
+        # load_model
+        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_path)
+        self.model.eval()
+        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+        self.processor = SquadProcessor(
+            tokenizer=tokenizer,
+            max_seq_len=max_seq_len,
+            label_list=["start_token", "end_token"],
+            metric="squad",
+            data_dir="data",
+            doc_stride=doc_stride,
+        )
+
+        self.max_seq_len = max_seq_len
+        self.use_gpu = use_gpu
+        self.progress_bar = progress_bar
+        self.use_confidence_scores = use_confidence_scores
+
+    def predict(self, query: str, documents: List[Document], top_k: Optional[int] = None):
+        """
+        Use loaded QA model to find answers for a query in the supplied list of Document.
+
+        Returns dictionaries containing answers sorted by (desc.) score.
+        Example:
+         ```python
+            |{
+            |    'query': 'Who is the father of Arya Stark?',
+            |    'answers':[Answer(
+            |                 'answer': 'Eddard,',
+            |                 'context': "She travels with her father, Eddard, to King's Landing when he is",
+            |                 'score': 0.9787139466668613,
+            |                 'offsets_in_context': [Span(start=29, end=35],
+            |                 'offsets_in_context': [Span(start=347, end=353],
+            |                 'document_id': '88d1ed769d003939d3a0d28034464ab2'
+            |                 ),...
+            |              ]
+            |}
+         ```
+
+        :param query: Query string
+        :param documents: List of Document in which to search for the answer
+        :param top_k: The maximum number of answers to return
+        :return: Dict containing query and answers
+        """
+        if top_k is None:
+            top_k = self.top_k
+        # convert input to FARM format
+        inputs = []
+        for doc in documents:
+            # QAInput Class
+            cur = QAInput(doc_text=doc.content, questions=Question(text=query, uid=doc.id))
+            inputs.append(cur)
+
+        # get answers from QA model
+        # TODO: Need fix in FARM's `to_dict` function of `QAInput` class
+
+        # convert Document to dicts
+        dicts = [o.to_dict() for o in inputs]
+
+        # Generate dataset
+        indices = list(range(len(dicts)))
+        dataset, tensor_names, problematic_ids, baskets = self.processor.dataset_from_dicts(
+            dicts, indices=indices, return_baskets=True
+        )
+
+        # Need more elegent implementation
+        self.baskets = baskets
+
+        predictions = self._get_predictions_and_aggregate(dataset, tensor_names, baskets)
+
+        # assemble answers from all the different documents & format them.
+        answers, max_no_ans_gap = self._extract_answers_of_predictions(predictions, top_k)
+        # TODO: potentially simplify return here to List[Answer] and handle no_ans_gap differently
+        result = {"query": query, "no_ans_gap": max_no_ans_gap, "answers": answers}
+
+        return result
+
+    def _extract_answers_of_predictions(self, predictions: List[QAPred], top_k: Optional[int] = None):
+        # Assemble answers from all the different documents and format them.
+        # For the 'no answer' option, we collect all no_ans_gaps and decide how likely
+        # a no answer is based on all no_ans_gaps values across all documents
+        answers: List[Answer] = []
+        no_ans_gaps = []
+        best_score_answer = 0
+
+        for pred in predictions:
+            answers_per_document = []
+            no_ans_gaps.append(pred.no_answer_gap)
+            for ans in pred.prediction:
+                # skip 'no answers' here
+                if self._check_no_answer(ans):
+                    pass
+                else:
+                    cur = Answer(
+                        answer=ans.answer,
+                        type="extractive",
+                        score=ans.confidence if self.use_confidence_scores else ans.score,
+                        context=ans.context_window,
+                        document_id=pred.id,
+                        offsets_in_context=[
+                            Span(
+                                start=ans.offset_answer_start - ans.offset_context_window_start,
+                                end=ans.offset_answer_end - ans.offset_context_window_start,
+                            )
+                        ],
+                        offsets_in_document=[Span(start=ans.offset_answer_start, end=ans.offset_answer_end)],
+                    )
+
+                    answers_per_document.append(cur)
+
+                    if ans.score > best_score_answer:
+                        best_score_answer = ans.score
+
+            # Only take n best candidates. Answers coming back from FARM are sorted with decreasing relevance
+            answers += answers_per_document[: self.top_k_per_candidate]
+
+        # calculate the score for predicting 'no answer', relative to our best positive answer score
+        no_ans_prediction, max_no_ans_gap = self._calc_no_answer(
+            no_ans_gaps, best_score_answer, self.use_confidence_scores
+        )
+        if self.return_no_answers:
+            answers.append(no_ans_prediction)
+
+        # sort answers by score (descending) and select top-k
+        answers = sorted(answers, reverse=True)
+        answers = answers[:top_k]
+
+        return answers, max_no_ans_gap
+
+    def calibrate_confidence_scores(
+        self,
+        document_store: BaseDocumentStore,
+        device: Optional[str] = None,
+        label_index: str = "label",
+        doc_index: str = "eval_document",
+        label_origin: str = "gold_label",
+    ):
+        """
+        Calibrates confidence scores on evaluation documents in the DocumentStore.
+
+        :param document_store: DocumentStore containing the evaluation documents
+        :param device: The device on which the tensors should be processed. Choose from "cpu" and "cuda" or use the Reader's device by default.
+        :param label_index: Index/Table name where labeled questions are stored
+        :param doc_index: Index/Table name where documents that are used for evaluation are stored
+        :param label_origin: Field name where the gold labels are stored
+        """
+        if device is None:
+            device = self.devices[0]
+        self.eval(
+            document_store=document_store,
+            device=device,
+            label_index=label_index,
+            doc_index=doc_index,
+            label_origin=label_origin,
+            calibrate_conf_scores=True,
+        )
+
+    @staticmethod
+    def _check_no_answer(c: QACandidate):
+        # check for correct value in "answer"
+        if c.offset_answer_start == 0 and c.offset_answer_end == 0:
+            if c.answer != "no_answer":
+                logger.error(
+                    "Invalid 'no_answer': Got a prediction for position 0, but answer string is not 'no_answer'"
+                )
+        return c.answer == "no_answer"
+
+    def predict_on_texts(self, question: str, texts: List[str], top_k: Optional[int] = None):
+        """
+        Use loaded QA model to find answers for a question in the supplied list of Document.
+        Returns dictionaries containing answers sorted by (desc.) score.
+        Example:
+         ```python
+            |{
+            |    'question': 'Who is the father of Arya Stark?',
+            |    'answers':[
+            |                 {'answer': 'Eddard,',
+            |                 'context': " She travels with her father, Eddard, to King's Landing when he is ",
+            |                 'offset_answer_start': 147,
+            |                 'offset_answer_end': 154,
+            |                 'score': 0.9787139466668613,
+            |                 'document_id': '1337'
+            |                 },...
+            |              ]
+            |}
+         ```
+
+        :param question: Question string
+        :param documents: List of documents as string type
+        :param top_k: The maximum number of answers to return
+        :return: Dict containing question and answers
+        """
+        documents = []
+        for text in texts:
+            documents.append(Document(content=text))
+        predictions = self.predict(question, documents, top_k)
+        return predictions
+
+    def _get_predictions_and_aggregate(self, dataset, tensor_names: List, baskets: List[SampleBasket]):
+        """
+        Feed a preprocessed dataset to the model and get the actual predictions (forward pass + logits_to_preds + formatted_preds).
+
+        Difference to _get_predictions():
+         - Additional aggregation step across predictions of individual samples
+         (e.g. For QA on long texts, we extract answers from multiple passages and then aggregate them on the "document level")
+
+        :param dataset: Paddle Dataset with samples you want to predict
+        :param tensor_names: Names of the tensors in the dataset
+        :param baskets: For each item in the dataset, we need additional information to create formatted preds.
+                        Baskets contain all relevant infos for that.
+                        Example: QA - input string to convert the predicted answer from indices back to string space
+        :return: list of predictions
+        """
+
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=self.batch_size, shuffle=False)
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+            Stack(dtype="int64"),  # input_ids
+        ): [data for data in fn(samples)]
+
+        data_loader = paddle.io.DataLoader(
+            dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True
+        )
+
+        # TODO Sometimes this is the preds of one head, sometimes of two. We need a more advanced stacking operation
+        # TODO so that preds of the right shape are passed in to formatted_preds
+        unaggregated_preds_all = []
+
+        # for i, batch in enumerate(
+        #     tqdm(data_loader, desc=f"Inferencing Samples", unit=" Batches", disable=False)
+        # ):
+        for i, batch in enumerate(data_loader):
+
+            (
+                input_ids,
+                padding_mask,
+                segment_ids,
+                passage_start_t,
+                start_of_word,
+                labels,
+                id,
+                seq_2_start_t,
+                span_mask,
+            ) = batch
+
+            # get logits
+            with paddle.no_grad():
+                # Aggregation works on preds, not logits. We want as much processing happening in one batch + on GPU
+                # So we transform logits to preds here as well
+                start_logits, end_logits = self.model.forward(input_ids=input_ids, token_type_ids=segment_ids)
+                start_logits = paddle.unsqueeze(start_logits, axis=2)
+                end_logits = paddle.unsqueeze(end_logits, axis=2)
+                logits = paddle.concat(x=[start_logits, end_logits], axis=-1)
+
+                preds = self.logits_to_preds(
+                    logits, span_mask=span_mask, start_of_word=start_of_word, seq_2_start_t=seq_2_start_t
+                )
+
+                unaggregated_preds_all.append(preds)
+
+        # In some use cases we want to aggregate the individual predictions.
+        # This is mostly useful, if the input text is longer than the max_seq_len that the model can process.
+        # In QA we can use this to get answers from long input texts by first getting predictions for smaller passages
+        # and then aggregating them here.
+
+        # At this point unaggregated preds has shape [n_batches][n_heads][n_samples]
+
+        # can assume that we have only complete docs i.e. all the samples of one doc are in the current chunk
+        logits = [None]
+        preds_all = self.formatted_preds_wrapper(
+            logits=logits,  # For QA we collected preds per batch and do not want to pass logits
+            preds=unaggregated_preds_all,
+            baskets=self.baskets,
+        )  # type ignore
+        return preds_all
+
+    def logits_to_preds(
+        self,
+        logits: paddle.Tensor,
+        span_mask: paddle.Tensor,
+        start_of_word: paddle.Tensor,
+        seq_2_start_t: paddle.Tensor,
+        max_answer_length: int = 1000,
+        **kwargs,
+    ):
+        """
+        Get the predicted index of start and end token of the answer. Note that the output is at token level
+        and not word level. Note also that these logits correspond to the tokens of a sample
+        (i.e. special tokens, question tokens, passage_tokens)
+        """
+
+        # Will be populated with the top-n predictions of each sample in the batch
+        # shape = batch_size x ~top_n
+        # Note that ~top_n = n   if no_answer is     within the top_n predictions
+        #           ~top_n = n+1 if no_answer is not within the top_n predictions
+        all_top_n = []
+
+        # logits is of shape [batch_size, max_seq_len, 2]. The final dimension corresponds to [start, end]
+        start_logits, end_logits = paddle.split(logits, num_or_sections=2, axis=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        # Calculate a few useful variables
+        batch_size = start_logits.shape[0]
+        max_seq_len = start_logits.shape[1]  # target dim
+
+        # get scores for all combinations of start and end logits => candidate answers
+        # [22, 256] -> [22, 256, 1] -> [22, 256, 256]
+        start_matrix = paddle.expand(start_logits.unsqueeze(2), shape=[-1, -1, max_seq_len])
+        # [22, 256] -> [22, 1, 256] -> [22, 256, 256]
+        end_matrix = paddle.expand(end_logits.unsqueeze(1), shape=[-1, max_seq_len, -1])
+        start_end_matrix = start_matrix + end_matrix
+
+        # disqualify answers where end < start
+        # (set the lower triangular matrix to low value, excluding diagonal)
+        # The answer positions that end position less than start position shuold be mask
+        pos_mask_tensor = paddle.tensor.triu(
+            (paddle.ones((max_seq_len, max_seq_len), dtype=paddle.get_default_dtype()) * -888), diagonal=1
+        )
+        pos_mask_tensor = paddle.transpose(pos_mask_tensor, perm=[1, 0])
+
+        masked_start_end_matrix = []
+        for single_start_end_matrix in start_end_matrix:
+            single_start_end_matrix += pos_mask_tensor
+            masked_start_end_matrix.append(paddle.unsqueeze(single_start_end_matrix, axis=0))
+        start_end_matrix = paddle.concat(x=masked_start_end_matrix, axis=0)
+
+        # Todo(tianxin04): mask long span
+        # disqualify answers where answer span is greater than max_answer_length
+        # (set the upper triangular matrix to low value, excluding diagonal)
+        # indices_long_span = paddle.triu_indices(
+        #     max_seq_len, max_seq_len, offset=max_answer_length, device=start_end_matrix.device
+        # )
+        # start_end_matrix[:, indices_long_span[0][:], indices_long_span[1][:]] = -777
+
+        # disqualify answers where start=0, but end != 0
+        start_end_matrix[:, 0, 1:] = -666
+
+        # Turn 1d span_mask vectors into 2d span_mask along 2 different axes
+        # span mask has:
+        #   0 for every position that is never a valid start or end index (question tokens, mid and end special tokens, padding)
+        #   1 everywhere else
+        # [22, 256] -> [22, 256, 1] -> [22, 256, 256]
+        span_mask_start = paddle.expand(paddle.unsqueeze(span_mask, axis=2), shape=[-1, -1, max_seq_len])
+        span_mask_end = paddle.expand(paddle.unsqueeze(span_mask, axis=1), shape=[-1, max_seq_len, -1])
+        span_mask_2d = span_mask_start + span_mask_end
+
+        # disqualify spans where either start or end is on an invalid token
+        invalid_indices = paddle.nonzero((span_mask_2d != 2), as_tuple=True)
+        # Todo(tianxin04):
+        # Hack: This Paddle operation is very time consuming, so convert Paddle.Tensor to numpy.array
+        # and then convert back to Paddle.Tensor
+        start_end_matrix = start_end_matrix.numpy()
+        start_end_matrix[invalid_indices[0][:], invalid_indices[1][:], invalid_indices[2][:]] = -999
+        start_end_matrix = paddle.to_tensor(start_end_matrix, place=self.devices[0])
+
+        # Sort the candidate answers by their score. Sorting happens on the flattened matrix.
+        # flat_sorted_indices.shape: (batch_size, max_seq_len^2, 1)
+        flat_scores = paddle.reshape(start_end_matrix, shape=[batch_size, -1])
+        flat_sorted_indices_2d = paddle.argsort(flat_scores, axis=-1, descending=True)
+        flat_sorted_indices = paddle.unsqueeze(flat_sorted_indices_2d, axis=2)
+
+        # The returned indices are then converted back to the original dimensionality of the matrix.
+        # sorted_candidates.shape : (batch_size, max_seq_len^2, 2)
+        start_indices = flat_sorted_indices // max_seq_len
+        end_indices = flat_sorted_indices % max_seq_len
+        sorted_candidates = paddle.concat(x=[start_indices, end_indices], axis=2)
+
+        # Get the n_best candidate answers for each sample
+        for sample_idx in range(batch_size):
+            sample_top_n = self.get_top_candidates(
+                sorted_candidates[sample_idx],
+                start_end_matrix[sample_idx],
+                sample_idx,
+                start_matrix=start_matrix[sample_idx],
+                end_matrix=end_matrix[sample_idx],
+            )
+            all_top_n.append(sample_top_n)
+
+        return all_top_n
+
+    def get_top_candidates(self, sorted_candidates, start_end_matrix, sample_idx: int, start_matrix, end_matrix):
+        """
+        Returns top candidate answers as a list of Span objects. Operates on a matrix of summed start and end logits.
+        This matrix corresponds to a single sample (includes special tokens, question tokens, passage tokens).
+        This method always returns a list of len n_best_per_sample + 1 (it is comprised of the n_best_per_sample positive answers along with the one no_answer)
+        """
+        # Initialize some variables
+        top_candidates: List[QACandidate] = []
+        n_candidates = sorted_candidates.shape[0]
+        start_idx_candidates = set()
+        end_idx_candidates = set()
+
+        start_matrix_softmax_start = F.softmax(start_matrix[:, 0], axis=-1)
+        end_matrix_softmax_end = F.softmax(end_matrix[0, :], axis=-1)
+
+        # Iterate over all candidates and break when we have all our n_best candidates
+        for candidate_idx in range(n_candidates):
+            if len(top_candidates) == self.n_best_per_sample:
+                break
+            # Retrieve candidate's indices
+            start_idx = sorted_candidates[candidate_idx, 0].item()
+            end_idx = sorted_candidates[candidate_idx, 1].item()
+            # Ignore no_answer scores which will be extracted later in this method
+            if start_idx == 0 and end_idx == 0:
+                continue
+            if self.duplicate_filtering > -1 and (start_idx in start_idx_candidates or end_idx in end_idx_candidates):
+                continue
+            score = start_end_matrix[start_idx, end_idx].item()
+            confidence = (start_matrix_softmax_start[start_idx].item() + end_matrix_softmax_end[end_idx].item()) / 2
+            top_candidates.append(
+                QACandidate(
+                    offset_answer_start=start_idx,
+                    offset_answer_end=end_idx,
+                    score=score,
+                    answer_type="span",
+                    offset_unit="token",
+                    aggregation_level="passage",
+                    passage_id=str(sample_idx),
+                    confidence=confidence,
+                )
+            )
+            if self.duplicate_filtering > -1:
+                for i in range(0, self.duplicate_filtering + 1):
+                    start_idx_candidates.add(start_idx + i)
+                    start_idx_candidates.add(start_idx - i)
+                    end_idx_candidates.add(end_idx + i)
+                    end_idx_candidates.add(end_idx - i)
+
+        no_answer_score = start_end_matrix[0, 0].item()
+        no_answer_confidence = (start_matrix_softmax_start[0].item() + end_matrix_softmax_end[0].item()) / 2
+        top_candidates.append(
+            QACandidate(
+                offset_answer_start=0,
+                offset_answer_end=0,
+                score=no_answer_score,
+                answer_type="no_answer",
+                offset_unit="token",
+                aggregation_level="passage",
+                passage_id=None,
+                confidence=no_answer_confidence,
+            )
+        )
+        return top_candidates
+
+    def formatted_preds_wrapper(self, logits: paddle.Tensor, **kwargs):
+        """
+        Format predictions for inference.
+
+        :param logits: Model logits.
+        :return: Predictions in the right format.
+        """
+
+        preds_final = []
+        # This try catch is to deal with the fact that sometimes we collect preds before passing it to
+        # formatted_preds (see Inferencer._get_predictions_and_aggregate()) and sometimes we don't
+        # (see Inferencer._get_predictions())
+        try:
+            preds = kwargs["preds"]
+            temp = preds
+            preds_flat = [item for sublist in temp for item in sublist]
+            kwargs["preds"] = preds_flat
+        except KeyError:
+            kwargs["preds"] = None
+
+        logits_for_head = logits[0]
+        preds = self.formatted_preds(logits=logits_for_head, **kwargs)
+        # TODO This is very messy - we need better definition of what the output should look like
+        if type(preds) == list:
+            preds_final += preds
+        elif type(preds) == dict and "predictions" in preds:
+            preds_final.append(preds)
+
+        return preds_final
+
+    def formatted_preds(
+        self, preds: List[QACandidate], baskets: List[SampleBasket], logits: Optional[paddle.Tensor] = None, **kwargs
+    ):
+        """
+        Takes a list of passage level predictions, each corresponding to one sample, and converts them into document level
+        predictions. Leverages information in the SampleBaskets. Assumes that we are being passed predictions from
+        ALL samples in the one SampleBasket i.e. all passages of a document. Logits should be None, because we have
+        already converted the logits to predictions before calling formatted_preds.
+        (see Inferencer._get_predictions_and_aggregate()).
+        """
+        # Unpack some useful variables
+        # passage_start_t is the token index of the passage relative to the document (usually a multiple of doc_stride)
+        # seq_2_start_t is the token index of the first token in passage relative to the input sequence (i.e. number of
+        # special tokens and question tokens that come before the passage tokens)
+        if logits or preds is None:
+            logger.error(
+                "QuestionAnsweringHead.formatted_preds() expects preds as input and logits to be None \
+                            but was passed something different"
+            )
+
+        samples = [s for b in baskets for s in b.samples]  # type: ignore
+        ids = [s.id for s in samples]
+        passage_start_t = [s.features[0]["passage_start_t"] for s in samples]  # type: ignore
+        seq_2_start_t = [s.features[0]["seq_2_start_t"] for s in samples]  # type: ignore
+
+        # Aggregate passage level predictions to create document level predictions.
+        # This method assumes that all passages of each document are contained in preds
+        # i.e. that there are no incomplete documents. The output of this step
+        # are prediction spans
+        preds_d = self.aggregate_preds(preds, passage_start_t, ids, seq_2_start_t)
+
+        # Separate top_preds list from the no_ans_gap float.
+        top_preds, no_ans_gaps = zip(*preds_d)
+
+        # Takes document level prediction spans and returns string predictions
+        doc_preds = self.to_qa_preds(top_preds, no_ans_gaps, baskets)
+
+        return doc_preds
+
+    def to_qa_preds(self, top_preds, no_ans_gaps, baskets):
+        """
+        Groups Span objects together in a QAPred object
+        """
+        ret = []
+
+        # Iterate over each set of document level prediction
+        for pred_d, no_ans_gap, basket in zip(top_preds, no_ans_gaps, baskets):
+
+            # Unpack document offsets, clear text and id
+            token_offsets = basket.raw["document_offsets"]
+            pred_id = basket.id_external if basket.id_external else basket.id_internal
+
+            # These options reflect the different input dicts that can be assigned to the basket
+            # before any kind of normalization or preprocessing can happen
+            question_names = ["question_text", "qas", "questions"]
+            doc_names = ["document_text", "context", "text"]
+
+            document_text = try_get(doc_names, basket.raw)
+            question = self.get_question(question_names, basket.raw)
+            ground_truth = self.get_ground_truth(basket)
+
+            curr_doc_pred = QAPred(
+                id=pred_id,
+                prediction=pred_d,
+                context=document_text,
+                question=question,
+                token_offsets=token_offsets,
+                context_window_size=self.context_window_size,
+                aggregation_level="document",
+                ground_truth_answer=ground_truth,
+                no_answer_gap=no_ans_gap,
+            )
+            ret.append(curr_doc_pred)
+        return ret
+
+    def aggregate_preds(self, preds, passage_start_t, ids, seq_2_start_t=None, labels=None):
+        """
+        Aggregate passage level predictions to create document level predictions.
+        This method assumes that all passages of each document are contained in preds
+        i.e. that there are no incomplete documents. The output of this step
+        are prediction spans. No answer is represented by a (-1, -1) span on the document level
+        """
+        # Initialize some variables
+        n_samples = len(preds)
+        all_basket_preds = {}
+        all_basket_labels = {}
+
+        # Iterate over the preds of each sample - remove final number which is the sample id and not needed for aggregation
+        for sample_idx in range(n_samples):
+            basket_id = ids[sample_idx]
+            basket_id = basket_id.split("-")[:-1]
+            basket_id = "-".join(basket_id)
+
+            # curr_passage_start_t is the token offset of the current passage
+            # It will always be a multiple of doc_stride
+            curr_passage_start_t = passage_start_t[sample_idx]
+
+            # This is to account for the fact that all model input sequences start with some special tokens
+            # and also the question tokens before passage tokens.
+            if seq_2_start_t:
+                cur_seq_2_start_t = seq_2_start_t[sample_idx]
+                curr_passage_start_t -= cur_seq_2_start_t
+
+            # Converts the passage level predictions+labels to document level predictions+labels. Note
+            # that on the passage level a no answer is (0,0) but at document level it is (-1,-1) since (0,0)
+            # would refer to the first token of the document
+
+            # pred1, pred2 = preds[sample_idx]
+            pred_d = self.pred_to_doc_idxs(preds[sample_idx], curr_passage_start_t, sample_idx)
+            if labels:
+                label_d = self.label_to_doc_idxs(labels[sample_idx], curr_passage_start_t)
+
+            # Initialize the basket_id as a key in the all_basket_preds and all_basket_labels dictionaries
+            if basket_id not in all_basket_preds:
+                all_basket_preds[basket_id] = []
+                all_basket_labels[basket_id] = []
+
+            # Add predictions and labels to dictionary grouped by their basket_ids
+            # passage-level -> document-level
+            all_basket_preds[basket_id].append(pred_d)
+            if labels:
+                all_basket_labels[basket_id].append(label_d)
+
+        # Pick n-best predictions and remove repeated labels
+        idx = 0
+        for k, v in all_basket_preds.items():
+            pred1, pred2 = v[0]
+            all_basket_preds[k] = self.reduce_preds(v)
+            idx += 1
+        # all_basket_preds = {k: self.reduce_preds(v) for k, v in all_basket_preds.items()}
+        if labels:
+            all_basket_labels = {k: self.reduce_labels(v) for k, v in all_basket_labels.items()}
+
+        # Return aggregated predictions in order as a list of lists
+        keys = [k for k in all_basket_preds]
+        aggregated_preds = [all_basket_preds[k] for k in keys]
+        if labels:
+            labels = [all_basket_labels[k] for k in keys]
+            return aggregated_preds, labels
+        else:
+            return aggregated_preds
+
+    @staticmethod
+    def pred_to_doc_idxs(pred, passage_start_t, sample_idx):
+        """
+        Converts the passage level predictions to document level predictions. Note that on the doc level we
+        don't have special tokens or question tokens. This means that a no answer
+        cannot be prepresented by a (0,0) qa_answer but will instead be represented by (-1, -1)
+        """
+        new_pred = []
+        for qa_answer in pred:
+            start = qa_answer.offset_answer_start
+            end = qa_answer.offset_answer_end
+            if start == 0:
+                start = -1
+            else:
+                start += passage_start_t
+                if start < 0:
+                    logger.error("Start token index < 0 (document level)")
+            if end == 0:
+                end = -1
+            else:
+                end += passage_start_t
+                if end < 0:
+                    logger.error("End token index < 0 (document level)")
+            qa_answer.to_doc_level(start, end)
+            new_pred.append(qa_answer)
+        return new_pred
+
+    def reduce_preds(self, preds):
+        """
+        This function contains the logic for choosing the best answers from each passage. In the end, it
+        returns the n_best predictions on the document level.
+        """
+
+        # Initialize variables
+        passage_no_answer = []
+        passage_best_score = []
+        passage_best_confidence = []
+        no_answer_scores = []
+        no_answer_confidences = []
+        n_samples = len(preds)
+
+        # Iterate over the top predictions for each sample
+        # Note: preds: [[QACandidate, QACandidate]]
+        for sample_idx, sample_preds in enumerate(preds):
+            best_pred = sample_preds[0]
+            best_pred_score = best_pred.score
+            best_pred_confidence = best_pred.confidence
+            no_answer_score, no_answer_confidence = self.get_no_answer_score_and_confidence(sample_preds)
+            no_answer_score += self.no_ans_boost
+            # TODO we might want to apply some kind of a no_ans_boost to no_answer_confidence too
+            no_answer = no_answer_score > best_pred_score
+            passage_no_answer.append(no_answer)
+            no_answer_scores.append(no_answer_score)
+            no_answer_confidences.append(no_answer_confidence)
+            passage_best_score.append(best_pred_score)
+            passage_best_confidence.append(best_pred_confidence)
+
+        # Get all predictions in flattened list and sort by score
+        pos_answers_flat = []
+        for sample_idx, passage_preds in enumerate(preds):
+            for qa_candidate in passage_preds:
+                # Todo(tianxin04): When all qa_candidate of preds has no answer, this func will occur error
+                # Whether all qa_candidate has no answer is expected or not?
+                if not (qa_candidate.offset_answer_start == -1 and qa_candidate.offset_answer_end == -1):
+                    pos_answers_flat.append(
+                        QACandidate(
+                            offset_answer_start=qa_candidate.offset_answer_start,
+                            offset_answer_end=qa_candidate.offset_answer_end,
+                            score=qa_candidate.score,
+                            answer_type=qa_candidate.answer_type,
+                            offset_unit="token",
+                            aggregation_level="document",
+                            passage_id=str(sample_idx),
+                            n_passages_in_doc=n_samples,
+                            confidence=qa_candidate.confidence,
+                        )
+                    )
+
+        # TODO add switch for more variation in answers, e.g. if varied_ans then never return overlapping answers
+        pos_answer_dedup = self.deduplicate(pos_answers_flat)
+
+        # This is how much no_ans_boost needs to change to turn a no_answer to a positive answer (or vice versa)
+        no_ans_gap = -min([nas - pbs for nas, pbs in zip(no_answer_scores, passage_best_score)])
+        no_ans_gap_confidence = -min([nas - pbs for nas, pbs in zip(no_answer_confidences, passage_best_confidence)])
+
+        # "no answer" scores and positive answers scores are difficult to compare, because
+        # + a positive answer score is related to a specific text qa_candidate
+        # - a "no answer" score is related to all input texts
+        # Thus we compute the "no answer" score relative to the best possible answer and adjust it by
+        # the most significant difference between scores.
+        # Most significant difference: change top prediction from "no answer" to answer (or vice versa)
+        best_overall_positive_score = max(x.score for x in pos_answer_dedup)
+        best_overall_positive_confidence = max(x.confidence for x in pos_answer_dedup)
+        no_answer_pred = QACandidate(
+            offset_answer_start=-1,
+            offset_answer_end=-1,
+            score=best_overall_positive_score - no_ans_gap,
+            answer_type="no_answer",
+            offset_unit="token",
+            aggregation_level="document",
+            passage_id=None,
+            n_passages_in_doc=n_samples,
+            confidence=best_overall_positive_confidence - no_ans_gap_confidence,
+        )
+
+        # Add no answer to positive answers, sort the order and return the n_best
+        n_preds = [no_answer_pred] + pos_answer_dedup
+        n_preds_sorted = sorted(
+            n_preds, key=lambda x: x.confidence if self.use_confidence_scores_for_ranking else x.score, reverse=True
+        )
+
+        # n_best: The number of positive answer spans for each document.
+        n_preds_reduced = n_preds_sorted[: self.n_best]
+        return n_preds_reduced, no_ans_gap
+
+    @staticmethod
+    def get_no_answer_score_and_confidence(preds):
+        for qa_answer in preds:
+            start = qa_answer.offset_answer_start
+            end = qa_answer.offset_answer_end
+            score = qa_answer.score
+            confidence = qa_answer.confidence
+            if start == -1 and end == -1:
+                return score, confidence
+        raise Exception
+
+    @staticmethod
+    def deduplicate(flat_pos_answers):
+        # Remove duplicate spans that might be twice predicted in two different passages
+        seen = {}
+        for qa_answer in flat_pos_answers:
+            if (qa_answer.offset_answer_start, qa_answer.offset_answer_end) not in seen:
+                seen[(qa_answer.offset_answer_start, qa_answer.offset_answer_end)] = qa_answer
+            else:
+                seen_score = seen[(qa_answer.offset_answer_start, qa_answer.offset_answer_end)].score
+                if qa_answer.score > seen_score:
+                    seen[(qa_answer.offset_answer_start, qa_answer.offset_answer_end)] = qa_answer
+        return list(seen.values())
+
+    @staticmethod
+    def get_question(question_names: List[str], raw_dict: Dict):
+        # For NQ style dicts
+        qa_name = None
+        if "qas" in raw_dict:
+            qa_name = "qas"
+        elif "question" in raw_dict:
+            qa_name = "question"
+        if qa_name:
+            if type(raw_dict[qa_name][0]) == dict:
+                return raw_dict[qa_name][0]["question"]
+        return try_get(question_names, raw_dict)
+
+    @staticmethod
+    def get_ground_truth(basket: SampleBasket):
+        if "answers" in basket.raw:
+            return basket.raw["answers"]
+        elif "annotations" in basket.raw:
+            return basket.raw["annotations"]
+        else:
+            return None
diff --git a/pipelines/pipelines/nodes/retriever/__init__.py b/pipelines/pipelines/nodes/retriever/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9716cf013a112389fe1a244bfab8accaa55f83c1
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.nodes.retriever.dense import DensePassageRetriever, EmbeddingRetriever
+from pipelines.nodes.retriever.multimodal_retriever import MultiModalRetriever
+from pipelines.nodes.retriever.parallel_retriever import ParallelRetriever
+from pipelines.nodes.retriever.sparse import BM25Retriever
+from pipelines.nodes.retriever.web import WebRetriever
diff --git a/pipelines/pipelines/nodes/retriever/base.py b/pipelines/pipelines/nodes/retriever/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..f157c5659d3088555eae7997914735fac6f28756
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/base.py
@@ -0,0 +1,258 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from abc import abstractmethod
+from copy import deepcopy
+from functools import wraps
+from time import perf_counter
+from typing import Dict, Iterator, List, Optional, Union
+
+from pipelines.document_stores.base import BaseDocumentStore, BaseKnowledgeGraph
+from pipelines.nodes.base import BaseComponent
+from pipelines.schema import ContentTypes, Document
+
+logger = logging.getLogger(__name__)
+
+
+class BaseGraphRetriever(BaseComponent):
+    """
+    Base classfor knowledge graph retrievers.
+    """
+
+    knowledge_graph: BaseKnowledgeGraph
+    outgoing_edges = 1
+
+    @abstractmethod
+    def retrieve(self, query: str, top_k: int):
+        pass
+
+    def eval(self):
+        raise NotImplementedError
+
+    def run(self, query: str, top_k: int):  # type: ignore
+        answers = self.retrieve(query=query, top_k=top_k)
+        results = {"answers": answers}
+        return results, "output_1"
+
+
+class BaseRetriever(BaseComponent):
+    """
+    Base class for regular retrievers.
+    """
+
+    document_store: BaseDocumentStore
+    outgoing_edges = 1
+    query_count = 0
+    index_count = 0
+    query_time = 0.0
+    index_time = 0.0
+    retrieve_time = 0.0
+
+    @abstractmethod
+    def retrieve(
+        self,
+        query: str,
+        query_type: ContentTypes = "text",
+        filters: dict = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query.
+
+        :param query: The query
+        :param filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
+        """
+        pass
+
+    @abstractmethod
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: bool = None,
+        **kwargs,
+    ) -> List[List[Document]]:
+        pass
+
+    def timing(self, fn, attr_name):
+        """Wrapper method used to time functions."""
+
+        @wraps(fn)
+        def wrapper(*args, **kwargs):
+            if attr_name not in self.__dict__:
+                self.__dict__[attr_name] = 0
+            tic = perf_counter()
+            ret = fn(*args, **kwargs)
+            toc = perf_counter()
+            self.__dict__[attr_name] += toc - tic
+            return ret
+
+        return wrapper
+
+    def run(  # type: ignore
+        self,
+        root_node: str,
+        query: Optional[str] = None,
+        query_type: Optional[ContentTypes] = None,
+        filters: Optional[dict] = None,
+        top_k: Optional[int] = None,
+        documents: Optional[List[dict]] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ):
+        if root_node == "Query":
+            self.query_count += 1
+            run_query_timed = self.timing(self.run_query, "query_time")
+            output, stream = run_query_timed(
+                query=query,
+                query_type=query_type,
+                filters=filters,
+                top_k=top_k,
+                index=index,
+                headers=headers,
+                **kwargs,
+            )
+        elif root_node == "File":
+            self.index_count += len(documents)  # type: ignore
+            run_indexing = self.timing(self.run_indexing, "index_time")
+            output, stream = run_indexing(documents=documents, **kwargs)
+        else:
+            raise Exception(f"Invalid root_node '{root_node}'.")
+        return output, stream
+
+    def run_batch(  # type: ignore
+        self,
+        root_node: str,
+        queries: Optional[List[str]] = None,
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[Union[dict, List[dict]]] = None,
+        top_k: Optional[int] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ):
+        if root_node == "Query":
+            self.query_count += len(queries) if isinstance(queries, list) else 1
+            run_query_batch_timed = self.timing(self.run_query_batch, "query_time")
+            output, stream = run_query_batch_timed(
+                queries=queries, filters=filters, top_k=top_k, index=index, headers=headers, **kwargs
+            )
+        elif root_node == "File":
+            self.index_count += len(documents)  # type: ignore
+            run_indexing = self.timing(self.run_indexing, "index_time")
+            output, stream = run_indexing(documents=documents, **kwargs)
+        else:
+            raise Exception(f"Invalid root_node '{root_node}'.")
+        return output, stream
+
+    def run_query(
+        self,
+        query: str,
+        query_type: Optional[ContentTypes] = None,
+        filters: Optional[dict] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ):
+        documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index, headers=headers, **kwargs)
+        document_ids = [doc.id for doc in documents]
+        logger.debug(f"Retrieved documents with IDs: {document_ids}")
+        output = {"documents": documents}
+
+        return output, "output_1"
+
+    def run_query_batch(
+        self,
+        queries: List[str],
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[dict] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        **kwargs,
+    ):
+        documents = self.retrieve_batch(
+            queries=queries,
+            queries_type=queries_type,
+            filters=filters,
+            top_k=top_k,
+            index=index,
+            headers=headers,
+            batch_size=batch_size,
+            **kwargs,
+        )
+        if isinstance(queries, str):
+            document_ids = []
+            for doc in documents:
+                document_ids.append(doc.id)
+                logger.debug("Retrieved documents with IDs: %s", document_ids)
+        else:
+            for doc_list in documents:
+                document_ids = [doc.id for doc in doc_list]
+                logger.debug("Retrieved documents with IDs: %s", document_ids)
+        output = {"documents": documents}
+        return output, "output_1"
+
+    def run_indexing(self, documents: List[dict], **kwargs):
+        if self.__class__.__name__ in ["DensePassageRetriever", "EmbeddingRetriever"]:
+            documents = deepcopy(documents)
+            document_objects = [Document.from_dict(doc) for doc in documents]
+            embeddings = self.embed_documents(document_objects, **kwargs)  # type: ignore
+            for doc, emb in zip(documents, embeddings):
+                doc["embedding"] = emb
+        output = {"documents": documents}
+        return output, "output_1"
+
+    def print_time(self):
+        print("Retriever (Speed)")
+        print("---------------")
+        if not self.index_count:
+            print("No indexing performed via Retriever.run()")
+        else:
+            print(f"Documents indexed: {self.index_count}")
+            print(f"Index time: {self.index_time}s")
+            print(f"{self.query_time / self.query_count} seconds per document")
+        if not self.query_count:
+            print("No querying performed via Retriever.run()")
+        else:
+            print(f"Queries Performed: {self.query_count}")
+            print(f"Query time: {self.query_time}s")
+            print(f"{self.query_time / self.query_count} seconds per query")
+
+    @staticmethod
+    def _get_batches(queries: List[str], batch_size: Optional[int]) -> Iterator[List[str]]:
+        if batch_size is None:
+            yield queries
+            return
+        else:
+            for index in range(0, len(queries), batch_size):
+                yield queries[index : index + batch_size]
diff --git a/pipelines/pipelines/nodes/retriever/base_embedding_encoder.py b/pipelines/pipelines/nodes/retriever/base_embedding_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..0035d7954d7ae657c31b259a44bcd4b86783c2f4
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/base_embedding_encoder.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from abc import abstractmethod
+from typing import List
+
+import numpy as np
+
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.schema import Document
+
+logger = logging.getLogger(__name__)
+
+
+class _BaseEmbeddingEncoder:
+    @abstractmethod
+    def embed_queries(self, queries: List[str]) -> np.ndarray:
+        """
+        Create embeddings for a list of queries.
+
+        :param queries: List of queries to embed.
+        :return: Embeddings, one per input query, shape: (queries, embedding_dim)
+        """
+        pass
+
+    @abstractmethod
+    def embed_documents(self, docs: List[Document]) -> np.ndarray:
+        """
+        Create embeddings for a list of documents.
+
+        :param docs: List of documents to embed.
+        :return: Embeddings, one per input document, shape: (documents, embedding_dim)
+        """
+        pass
+
+    def _check_docstore_similarity_function(self, document_store: BaseDocumentStore, model_name: str):
+        """
+        Check that document_store uses a similarity function
+        compatible with the embedding model
+        """
+        if "dpr" in model_name.lower() and document_store.similarity != "dot_product":
+            logger.warning(
+                "You seem to be using a DPR model with the %s function. "
+                "We recommend using dot_product instead. "
+                "This can be set when initializing the DocumentStore",
+                document_store.similarity,
+            )
diff --git a/pipelines/pipelines/nodes/retriever/dense.py b/pipelines/pipelines/nodes/retriever/dense.py
new file mode 100644
index 0000000000000000000000000000000000000000..dabb139023e629abbc07314880bce8bc8121061d
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/dense.py
@@ -0,0 +1,759 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from copy import deepcopy
+from pathlib import Path
+from typing import Dict, List, Optional, Union
+
+import pandas as pd
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal
+
+from abc import abstractmethod
+
+import numpy as np
+import paddle
+from tqdm.auto import tqdm
+
+from paddlenlp import Taskflow
+from paddlenlp.transformers import AutoModel, AutoTokenizer
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes.models import SemanticIndexBatchNeg
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.nodes.retriever.ernie_encoder import ErnieEmbeddingEncoder
+from pipelines.schema import ContentTypes, Document, FilterType
+from pipelines.utils.common_utils import initialize_device_settings
+
+logger = logging.getLogger(__name__)
+
+
+class DensePassageRetriever(BaseRetriever):
+    """
+    Retriever that uses a bi-encoder (one transformer for query, one transformer for passage).
+    """
+
+    def __init__(
+        self,
+        document_store: BaseDocumentStore,
+        query_embedding_model: Union[Path, str] = "rocketqa-zh-dureader-query-encoder",
+        passage_embedding_model: Union[Path, str] = "rocketqa-zh-dureader-para-encoder",
+        params_path: Optional[str] = "",
+        model_version: Optional[str] = None,
+        output_emb_size: Optional[int] = None,
+        reinitialize: bool = False,
+        share_parameters: bool = False,
+        max_seq_len_query: int = 64,
+        max_seq_len_passage: int = 384,
+        top_k: int = 10,
+        use_gpu: bool = True,
+        batch_size: int = 16,
+        embed_title: bool = True,
+        similarity_function: str = "dot_product",
+        progress_bar: bool = True,
+        mode: Literal["snippets", "raw_documents", "preprocessed_documents"] = "preprocessed_documents",
+        **kwargs
+    ):
+        """
+        Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
+
+        **Example:**
+
+                ```python
+                |    # remote model from FAIR
+                |    DensePassageRetriever(document_store=your_doc_store,
+                |                          query_embedding_model="rocketqa-zh-dureader-query-encoder",
+                |                          passage_embedding_model="rocketqa-zh-dureader-para-encoder")
+                |    # or from local path
+                |    DensePassageRetriever(document_store=your_doc_store,
+                |                          query_embedding_model="model_directory/question-encoder",
+                |                          passage_embedding_model="model_directory/context-encoder")
+                ```
+
+        :param document_store: An instance of DocumentStore from which to retrieve documents.
+        :param query_embedding_model: Local path or remote name of question encoder checkpoint. The format equals the
+                                      one used by paddlenlp transformers' models
+                                      Currently available remote names: ``"rocketqa-zh-dureader-query-encoder"``
+        :param passage_embedding_model: Local path or remote name of passage encoder checkpoint. The format equals the
+                                        one used by paddlenlp transformers' models
+                                        Currently available remote names: ``"rocketqa-zh-dureader-para-encoder"``
+        :param max_seq_len_query: Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down."
+        :param max_seq_len_passage: Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down."
+        :param top_k: How many documents to return per query.
+        :param use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
+        :param batch_size: Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size.
+        :param embed_title: Whether to concatenate title and passage to a text pair that is then used to create the embedding.
+                            This is the approach used in the original paper and is likely to improve performance if your
+                            titles contain meaningful information for retrieval (topic, entities etc.) .
+                            The title is expected to be present in doc.meta["name"] and can be supplied in the documents
+                            before writing them to the DocumentStore like this:
+                            {"text": "my text", "meta": {"name": "my title"}}.
+        :param similarity_function: Which function to apply for calculating the similarity of query and passage embeddings during training.
+                                    Options: `dot_product` (Default) or `cosine`
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        """
+        # Save init parameters to enable export of component config as YAML
+        self.set_config(
+            document_store=document_store,
+            query_embedding_model=query_embedding_model,
+            passage_embedding_model=passage_embedding_model,
+            model_version=model_version,
+            max_seq_len_query=max_seq_len_query,
+            max_seq_len_passage=max_seq_len_passage,
+            top_k=top_k,
+            use_gpu=use_gpu,
+            batch_size=batch_size,
+            embed_title=embed_title,
+            reinitialize=reinitialize,
+            share_parameters=share_parameters,
+            output_emb_size=output_emb_size,
+            similarity_function=similarity_function,
+            progress_bar=progress_bar,
+        )
+
+        self.devices, _ = initialize_device_settings(use_cuda=use_gpu, multi_gpu=True)
+        if batch_size < len(self.devices):
+            logger.warning("Batch size is less than the number of devices. All gpus will not be utilized.")
+
+        self.document_store = document_store
+        self.batch_size = batch_size
+        self.progress_bar = progress_bar
+        self.top_k = top_k
+        self.embed_title = embed_title
+        self.mode = mode
+
+        if document_store is None:
+            logger.warning("DensePassageRetriever initialized without a document store. ")
+        elif document_store.similarity != "dot_product":
+            logger.warning(
+                f"You are using a Dense Passage Retriever model with the {document_store.similarity} function. "
+                "We recommend you use dot_product instead. "
+                "This can be set when initializing the DocumentStore"
+            )
+
+        # Init & Load Encoders
+        if os.path.exists(params_path):
+            pretrained_model = AutoModel.from_pretrained(query_embedding_model)
+            self.ernie_dual_encoder = SemanticIndexBatchNeg(pretrained_model, output_emb_size=output_emb_size)
+            # Load Custom models
+            logger.info("Loading Parameters from:{}".format(params_path))
+            state_dict = paddle.load(params_path)
+            self.ernie_dual_encoder.set_dict(state_dict)
+            self.query_tokenizer = AutoTokenizer.from_pretrained(query_embedding_model)
+            self.passage_tokenizer = AutoTokenizer.from_pretrained(query_embedding_model)
+        else:
+            self.query_encoder = Taskflow(
+                "feature_extraction",
+                model=query_embedding_model,
+                batch_size=self.batch_size,
+                _static_mode=True,
+                return_tensors="np",
+                max_len=max_seq_len_query,
+                output_emb_size=output_emb_size,
+                reinitialize=reinitialize,
+                share_parameters=share_parameters,
+                device_id=0 if use_gpu else -1,
+                **kwargs,
+            )
+            self.passage_encoder = Taskflow(
+                "feature_extraction",
+                model=passage_embedding_model,
+                batch_size=self.batch_size,
+                _static_mode=True,
+                return_tensors="np",
+                max_len=max_seq_len_passage,
+                output_emb_size=output_emb_size,
+                reinitialize=reinitialize,
+                share_parameters=share_parameters,
+                device_id=0 if use_gpu else -1,
+                **kwargs,
+            )
+
+    def retrieve(
+        self,
+        query: str,
+        query_type: Optional[ContentTypes] = None,
+        filters: dict = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query.
+
+        :param query: The query
+        :param filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        """
+        if top_k is None:
+            top_k = self.top_k
+        if not self.document_store:
+            logger.error("Cannot perform retrieve() since DensePassageRetriever initialized with document_store=None")
+            return []
+        if index is None:
+            index = self.document_store.index
+
+        query_emb = self.embed_queries(texts=[query], **kwargs)
+        documents = self.document_store.query_by_embedding(
+            query_emb=query_emb[0], top_k=top_k, filters=filters, index=index, headers=headers, return_embedding=False
+        )
+        return documents
+
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[
+            Union[
+                Dict[str, Union[Dict, List, str, int, float, bool]],
+                List[Dict[str, Union[Dict, List, str, int, float, bool]]],
+            ]
+        ] = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: bool = None,
+        **kwargs,
+    ) -> List[List[Document]]:
+        if top_k is None:
+            top_k = self.top_k
+        if batch_size is None:
+            batch_size = self.batch_size
+
+        if isinstance(filters, list):
+            if len(filters) != len(queries):
+                raise Exception(
+                    "Number of filters does not match number of queries. Please provide as many filters"
+                    " as queries or a single filter that will be applied to each query."
+                )
+        else:
+            filters = [filters] * len(queries) if filters is not None else [{}] * len(queries)
+        if index is None:
+            index = self.document_store.index
+        if not self.document_store:
+            logger.error(
+                "Cannot perform retrieve_batch() since DensePassageRetriever initialized with document_store=None"
+            )
+            return [[] * len(queries)]  # type: ignore
+        documents = []
+        query_embs: List[np.ndarray] = []
+        for batch in self._get_batches(queries=queries, batch_size=batch_size):
+            query_embs.extend(self.embed_queries(texts=batch, **kwargs))
+        for query_emb, cur_filters in tqdm(
+            zip(query_embs, filters), total=len(query_embs), disable=not self.progress_bar, desc="Querying"
+        ):
+            cur_docs = self.document_store.query_by_embedding(
+                query_emb=query_emb,
+                top_k=top_k,
+                filters=cur_filters,
+                index=index,
+                headers=headers,
+                return_embedding=False,
+            )
+            documents.append(cur_docs)
+        return documents
+
+    def _get_predictions(self, dicts, **kwargs):
+        """
+        Feed a preprocessed dataset to the model and get the actual predictions (forward pass + formatting).
+
+        :param dicts: list of dictionaries
+        examples:[{'query': "where is florida?"}, {'query': "who wrote lord of the rings?"}, ...]
+                [{'passages': [{
+                    "title": 'Big Little Lies (TV series)',
+                    "text": 'series garnered several accolades. It received..',
+                    "label": 'positive',
+                    "external_id": '18768923'},
+                    {"title": 'Framlingham Castle',
+                    "text": 'Castle on the Hill "Castle on the Hill" is a song by English..',
+                    "label": 'positive',
+                    "external_id": '19930582'}, ...]
+        :return: dictionary of embeddings for "passages" and "query"
+        """
+        datasets = []
+        if "passages" in dicts[0]:
+            # dicts is a list of passages
+            for passages in dicts:
+                for item in passages["passages"]:
+                    if self.embed_title:
+                        datasets.append(item["title"] + item["text"])
+                    else:
+                        datasets.append(item["text"])
+        elif "query" in dicts[0]:
+            # dicts is a list of passages
+            for passages in dicts:
+                datasets.append(passages["query"])
+
+        all_embeddings = {"query": [], "passages": []}
+
+        # When running evaluations etc., we don't want a progress bar for every single query
+        if len(datasets) == 1:
+            disable_tqdm = True
+        else:
+            disable_tqdm = not self.progress_bar
+        with tqdm(
+            total=len(datasets) // self.batch_size,
+            unit=" Docs",
+            desc="Create embeddings",
+            position=1,
+            leave=False,
+            disable=disable_tqdm,
+        ) as progress_bar:
+            for i in range(0, len(datasets), self.batch_size):
+
+                if "query" in dicts[0]:
+                    cls_embeddings = self.query_encoder(datasets[i : i + self.batch_size], **kwargs)
+                    all_embeddings["query"].append(cls_embeddings["features"])
+                if "passages" in dicts[0]:
+                    cls_embeddings = self.passage_encoder(datasets[i : i + self.batch_size], **kwargs)
+                    all_embeddings["passages"].append(cls_embeddings["features"])
+                progress_bar.update(self.batch_size)
+
+        if all_embeddings["passages"]:
+            all_embeddings["passages"] = np.concatenate(all_embeddings["passages"])
+        if all_embeddings["query"]:
+            all_embeddings["query"] = np.concatenate(all_embeddings["query"])
+        return all_embeddings
+
+    def embed_queries(self, texts: List[str], **kwargs) -> List[np.ndarray]:
+        """
+        Create embeddings for a list of queries using the query encoder
+
+        :param texts: Queries to embed
+        :return: Embeddings, one per input queries
+        """
+        queries = [{"query": q} for q in texts]
+        result = self._get_predictions(queries, **kwargs)["query"]
+        return result
+
+    def embed_documents(self, docs: List[Document], **kwargs) -> List[np.ndarray]:
+        """
+        Create embeddings for a list of documents using the passage encoder
+
+        :param docs: List of Document objects used to represent documents / passages in a standardized way within pipelines.
+        :return: Embeddings of documents / passages shape (batch_size, embedding_dim)
+        """
+        passages = [
+            {
+                "passages": [
+                    {
+                        "title": d.meta["name"] if d.meta and "name" in d.meta else "",
+                        "text": d.content,
+                        "label": d.meta["label"] if d.meta and "label" in d.meta else "positive",
+                        "external_id": d.id,
+                    }
+                ]
+            }
+            for d in docs
+        ]
+        embeddings = self._get_predictions(passages, **kwargs)["passages"]
+
+        return embeddings
+
+
+_EMBEDDING_ENCODERS = {"ernie-embedding-v1": ErnieEmbeddingEncoder}
+
+
+class DenseRetriever(BaseRetriever):
+    """
+    Base class for all dense retrievers.
+    """
+
+    @abstractmethod
+    def embed_queries(self, queries: List[str]) -> np.ndarray:
+        """
+        Create embeddings for a list of queries.
+
+        :param queries: List of queries to embed.
+        :return: Embeddings, one per input query, shape: (queries, embedding_dim)
+        """
+        pass
+
+    @abstractmethod
+    def embed_documents(self, documents: List[Document]) -> np.ndarray:
+        """
+        Create embeddings for a list of documents.
+
+        :param documents: List of documents to embed.
+        :return: Embeddings of documents, one per input document, shape: (documents, embedding_dim)
+        """
+        pass
+
+    def run_indexing(self, documents: List[Document]):
+        documents_objects = [Document.from_dict(doc) for doc in documents]
+        embeddings = self.embed_documents(documents_objects)
+        for doc, emb in zip(documents, embeddings):
+            doc["embedding"] = emb
+        output = {"documents": documents}
+        return output, "output_1"
+
+
+class EmbeddingRetriever(DenseRetriever):
+
+    """
+    Retriever that uses a bi-encoder (query model for query, passage model for passage).
+    """
+
+    def __init__(
+        self,
+        document_store: BaseDocumentStore,
+        embedding_model: Union[Path, str] = "ernie-embedding-v1",
+        max_seq_len: int = 384,
+        top_k: int = 10,
+        batch_size: int = 16,
+        embed_title: bool = True,
+        similarity_function: str = "dot_product",
+        api_key: Optional[str] = None,
+        secret_key: Optional[str] = None,
+        scale_score: bool = True,
+        progress_bar: bool = True,
+        embed_meta_fields: Optional[List[str]] = None,
+        mode: Literal["snippets", "raw_documents", "preprocessed_documents"] = "preprocessed_documents",
+        **kwargs
+    ):
+
+        """
+        Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
+        :param document_store: An instance of DocumentStore from which to retrieve documents.
+        :param embedding_model: Local path or remote name of question encoder checkpoint. The format equals the
+                                      one used by paddlenlp transformers' models
+                                      Currently available remote names: ``"ernie-embedding-v1"``
+        :param top_k: How many documents to return per query.
+        :param batch_size: Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size.
+        :param embed_title: Whether to concatenate title and passage to a text pair that is then used to create the embedding.
+                            This is the approach used in the original paper and is likely to improve performance if your
+                            titles contain meaningful information for retrieval (topic, entities etc.) .
+                            The title is expected to be present in doc.meta["name"] and can be supplied in the documents
+                            before writing them to the DocumentStore like this:
+                            {"text": "my text", "meta": {"name": "my title"}}.
+        :param similarity_function: Which function to apply for calculating the similarity of query and passage embeddings during training.
+                                    Options: `dot_product` (Default) or `cosine`
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        """
+        if api_key is None or secret_key is None:
+            raise Exception(
+                "Please apply api_key and secret_key from https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu"
+            )
+        if embed_meta_fields is None:
+            embed_meta_fields = []
+        super().__init__()
+        self.api_key = api_key
+        self.secret_key = secret_key
+        self.batch_size = batch_size
+        self.progress_bar = progress_bar
+        self.document_store = document_store
+        self.top_k = top_k
+        self.embed_title = embed_title
+        self.embedding_model = embedding_model
+        self.max_seq_len = max_seq_len
+        self.embed_meta_fields = embed_meta_fields
+        self.scale_score = scale_score
+        self.embedding_encoder = _EMBEDDING_ENCODERS[self.embedding_model](retriever=self)
+
+    def retrieve(
+        self,
+        query: str,
+        filters: Optional[FilterType] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[Document]:
+        """
+        Scan through the documents in a DocumentStore and return a small number of documents
+        that are most relevant to the query.
+
+        :param query: The query
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+                                           If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant.
+                                           Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
+        :param document_store: the docstore to use for retrieval. If `None`, the one given in the `__init__` is used instead.
+        """
+        document_store = document_store or self.document_store
+        if document_store is None:
+            raise ValueError(
+                "This Retriever was not initialized with a Document Store. Provide one to the retrieve() method."
+            )
+        if top_k is None:
+            top_k = self.top_k
+        if index is None:
+            index = document_store.index
+        if scale_score is None:
+            scale_score = self.scale_score
+        query_emb = self.embed_queries(queries=[query])
+        documents = document_store.query_by_embedding(
+            query_emb=query_emb[0], filters=filters, top_k=top_k, index=index, headers=headers
+        )
+        return documents
+
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[List[Document]]:
+        """
+        Scan through the documents in a DocumentStore and return a small number of documents
+        that are most relevant to the supplied queries.
+
+        Returns a list of lists of Documents (one per query).
+
+        :param queries: List of query strings.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions. Can be a single filter that will be applied to each query or a list of filters
+                        (one filter per query).
+
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+
+                            __Example__:
+
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+
+                            __Example__:
+
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
+        :param batch_size: Number of queries to embed at a time.
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+                            If true similarity scores (e.g. cosine or dot_product) which naturally have a different
+                            value range will be scaled to a range of [0,1], where 1 means extremely relevant.
+                            Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
+        :param document_store: the docstore to use for retrieval. If `None`, the one given in the `__init__` is used instead.
+        """
+        document_store = document_store or self.document_store
+        if document_store is None:
+            raise ValueError(
+                "This Retriever was not initialized with a Document Store. Provide one to the retrieve_batch() method."
+            )
+        if top_k is None:
+            top_k = self.top_k
+
+        if batch_size is None:
+            batch_size = self.batch_size
+
+        if index is None:
+            index = document_store.index
+        if scale_score is None:
+            scale_score = self.scale_score
+
+        # embed_queries is already batched within by batch_size, so no need to batch the input here
+        query_embs: np.ndarray = self.embed_queries(queries=queries)
+        batched_query_embs: List[np.ndarray] = []
+        for i in range(0, len(query_embs), batch_size):
+            batched_query_embs.extend(query_embs[i : i + batch_size])
+        documents = document_store.query_by_embedding_batch(
+            query_embs=batched_query_embs,
+            top_k=top_k,
+            filters=filters,
+            index=index,
+            headers=headers,
+            scale_score=scale_score,
+        )
+
+        return documents
+
+    def embed_queries(self, queries: List[str]) -> np.ndarray:
+        """
+        Create embeddings for a list of queries.
+
+        :param queries: List of queries to embed.
+        :return: Embeddings, one per input query, shape: (queries, embedding_dim)
+        """
+        # for backward compatibility: cast pure str input
+        if isinstance(queries, str):
+            queries = [queries]
+        assert isinstance(queries, list), "Expecting a list of texts, i.e. create_embeddings(texts=['text1',...])"
+        return self.embedding_encoder.embed_queries(queries)
+
+    def embed_documents(self, documents: List[Document]) -> np.ndarray:
+        """
+        Create embeddings for a list of documents.
+
+        :param documents: List of documents to embed.
+        :return: Embeddings, one per input document, shape: (docs, embedding_dim)
+        """
+        documents = self._preprocess_documents(documents)
+        return self.embedding_encoder.embed_documents(documents)
+
+    def _preprocess_documents(self, docs: List[Document]) -> List[Document]:
+        """
+        Turns table documents into text documents by representing the table in csv format.
+        This allows us to use text embedding models for table retrieval.
+        It also concatenates specified meta data fields with the text representations.
+
+        :param docs: List of documents to linearize. If the document is not a table, it is returned as is.
+        :return: List of documents with meta data + linearized tables or original documents if they are not tables.
+        """
+        linearized_docs = []
+        for doc in docs:
+            doc = deepcopy(doc)
+            if doc.content_type == "table":
+                if isinstance(doc.content, pd.DataFrame):
+                    doc.content = doc.content.to_csv(index=False)
+                else:
+                    raise Exception("Documents of type 'table' need to have a pd.DataFrame as content field")
+            # Gather all relevant metadata fields
+            meta_data_fields = []
+            for key in self.embed_meta_fields:
+                if key in doc.meta and doc.meta[key]:
+                    if isinstance(doc.meta[key], list):
+                        meta_data_fields.extend([item for item in doc.meta[key]])
+                    else:
+                        meta_data_fields.append(doc.meta[key])
+            # Convert to type string (e.g. for ints or floats)
+            meta_data_fields = [str(field) for field in meta_data_fields]
+            doc.content = "\n".join(meta_data_fields + [doc.content])
+            linearized_docs.append(doc)
+        return linearized_docs
diff --git a/pipelines/pipelines/nodes/retriever/embedder.py b/pipelines/pipelines/nodes/retriever/embedder.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e96973813dd0985a4bc2da63f94cbb85231b984
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/embedder.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+import paddle
+from PIL import Image
+from tqdm.auto import tqdm
+
+from paddlenlp import Taskflow
+from pipelines.schema import Document
+
+logger = logging.getLogger(__name__)
+FilterType = Dict[str, Union[Dict[str, Any], List[Any], str, int, float, bool]]
+
+
+DOCUMENT_CONVERTERS = {
+    # NOTE: Keep this '?' cleaning step, it needs to be double-checked for impact on the inference results.
+    "text": lambda doc: doc.content[:-1] if doc.content[-1] == "?" else doc.content,
+    "image": lambda doc: Image.open(doc.content),
+}
+
+CAN_EMBED_META = ["text"]
+
+
+class MultiModalEmbedder:
+    def __init__(
+        self,
+        embedding_models: Dict[str, Union[Path, str]],  # replace str with ContentTypes starting from Python3.8
+        feature_extractors_params: Optional[Dict[str, Dict[str, Any]]] = None,
+        batch_size: int = 16,
+        embed_meta_fields: List[str] = ["name"],
+        progress_bar: bool = True,
+    ):
+        """
+        Init the Retriever and all its models from a local or remote model checkpoint.
+        :param embedding_models: A dictionary matching a local path or remote name of encoder checkpoint with
+            the content type it should handle ("text",  "image", etc...).
+            Expected input format: `{'text': 'name_or_path_to_text_model', 'image': 'name_or_path_to_image_model', ... }`
+            Keep in mind that the models should output in the same embedding space for this retriever to work.
+        :param feature_extractors_params: A dictionary matching a content type ("text",  "image" and so on) with the
+            parameters of its own feature extractor if the model requires one.
+            Expected input format: `{'text': {'param_name': 'param_value', ...}, 'image': {'param_name': 'param_value', ...}, ...}`
+        :param batch_size: Number of questions or passages to encode at once. In case of multiple GPUs, this will be the total batch size.
+        :param embed_meta_fields: Concatenate the provided meta fields and text passage / image to a text pair that is
+                                  then used to create the embedding.
+                                  This is the approach used in the original paper and is likely to improve
+                                  performance if your titles contain meaningful information for retrieval
+                                  (topic, entities etc.).
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+                             Can be helpful to disable in production deployments to keep the logs clean.
+        """
+        super().__init__()
+
+        self.batch_size = batch_size
+        self.progress_bar = progress_bar
+        self.embed_meta_fields = embed_meta_fields
+
+        feature_extractors_params = {
+            content_type: {"max_length": 256, **(feature_extractors_params or {}).get(content_type, {})}
+            for content_type in ["text", "image"]  # FIXME get_args(ContentTypes) from Python3.8 on
+        }
+
+        self.models = {}  # replace str with ContentTypes starting from Python3.8
+        for content_type, embedding_model in embedding_models.items():
+            if content_type in ["text", "image"]:
+                self.models[content_type] = Taskflow("feature_extraction", model=embedding_model)
+            else:
+                raise ValueError(f"{content_type} is not a supported content.")
+
+        # Check embedding sizes for models: they must all match
+        if len(self.models) > 1:
+            sizes = {model.embedding_dim for model in self.models.values()}
+            if None in sizes:
+                logger.warning(
+                    "Pipelines could not find the output embedding dimensions for '%s'. "
+                    "Dimensions won't be checked before computing the embeddings.",
+                    ", ".join(
+                        {
+                            str(model.model_name_or_path)
+                            for model in self.models.values()
+                            if model.embedding_dim is None
+                        }
+                    ),
+                )
+            elif len(sizes) > 1:
+                embedding_sizes: Dict[int, List[str]] = {}
+                for model in self.models.values():
+                    embedding_sizes[model.embedding_dim] = embedding_sizes.get(model.embedding_dim, []) + [
+                        str(model.model_name_or_path)
+                    ]
+                raise ValueError(f"Not all models have the same embedding size: {embedding_sizes}")
+
+    def embed(self, documents: List[Document], batch_size: Optional[int] = None) -> np.ndarray:
+        """
+        Create embeddings for a list of documents using the relevant encoder for their content type.
+        :param documents: Documents to embed.
+        :return: Embeddings, one per document, in the form of a np.array
+        """
+        batch_size = batch_size if batch_size is not None else self.batch_size
+
+        all_embeddings = []
+        for batch_index in tqdm(
+            iterable=range(0, len(documents), batch_size),
+            unit=" Docs",
+            desc="Create embeddings",
+            position=1,
+            leave=False,
+            disable=not self.progress_bar,
+        ):
+            docs_batch = documents[batch_index : batch_index + batch_size]
+            data_by_type = self._docs_to_data(documents=docs_batch)
+
+            # Get output for each model
+            outputs_by_type: Dict[str, paddle.Tensor] = {}  # replace str with ContentTypes starting Python3.8
+            for data_type, data in data_by_type.items():
+
+                model = self.models.get(data_type)
+                if not model:
+                    raise Exception(
+                        f"Some data of type {data_type} was passed, but no model capable of handling such data was "
+                        f"initialized. Initialized models: {', '.join(self.models.keys())}"
+                    )
+                outputs_by_type[data_type] = model(data)["features"]
+            # Check the output sizes
+            embedding_sizes = [output.shape[-1] for output in outputs_by_type.values()]
+
+            if not all(embedding_size == embedding_sizes[0] for embedding_size in embedding_sizes):
+                raise Exception(
+                    "Some of the models are using a different embedding size. They should all match. "
+                    f"Embedding sizes by model: "
+                    f"{ {name: output.shape[-1] for name, output in outputs_by_type.items()} }"
+                )
+
+            # Combine the outputs in a single matrix
+            outputs = paddle.stack(list(outputs_by_type.values()))
+            embeddings = outputs.reshape([-1, embedding_sizes[0]])
+            embeddings = embeddings.cpu()
+            all_embeddings.append(embeddings)
+        return np.concatenate(all_embeddings)
+
+    def _docs_to_data(
+        self, documents: List[Document]
+    ) -> Dict[str, List[Any]]:  # FIXME replace str to ContentTypes from Python3.8
+        """
+        Extract the data to embed from each document and return them classified by content type.
+        :param documents: The documents to prepare fur multimodal embedding.
+        :return: A dictionary containing one key for each content type, and a list of data extracted
+            from each document, ready to be passed to the feature extractor (for example the content
+            of a text document, a linearized table, a PIL image object, and so on)
+        """
+        docs_data: Dict[str, List[Any]] = {  # FIXME replace str to ContentTypes from Python3.8
+            key: [] for key in ["text", "image"]
+        }  # FIXME get_args(ContentTypes) from Python3.8 on
+        for doc in documents:
+            try:
+                document_converter = DOCUMENT_CONVERTERS[doc.content_type]
+            except KeyError:
+                raise Exception(
+                    f"Unknown content type '{doc.content_type}'. Known types: 'text', 'image'."  # FIXME {', '.join(get_args(ContentTypes))}"  from Python3.8 on
+                )
+
+            data = document_converter(doc)
+
+            if doc.content_type in CAN_EMBED_META:
+                meta = [v for k, v in (doc.meta or {}).items() if k in self.embed_meta_fields]
+                data = f"{' '.join(meta)} {data}" if meta else data
+
+            docs_data[doc.content_type].append(data)
+
+        return {key: values for key, values in docs_data.items() if values}
diff --git a/pipelines/pipelines/nodes/retriever/ernie_encoder.py b/pipelines/pipelines/nodes/retriever/ernie_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3441ff171bf718d61a5b5d78b0bd2d09e9652b6
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/ernie_encoder.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+from typing import TYPE_CHECKING, Any, Dict, List
+
+import numpy as np
+import requests
+from tqdm import tqdm
+
+from pipelines.nodes.retriever.base_embedding_encoder import _BaseEmbeddingEncoder
+from pipelines.schema import Document
+
+logger = logging.getLogger(__name__)
+
+if TYPE_CHECKING:
+    from pipelines.nodes.retriever import EmbeddingRetriever
+
+
+class ErnieEmbeddingEncoder(_BaseEmbeddingEncoder):
+    def __init__(self, retriever: "EmbeddingRetriever"):
+        self.api_key = retriever.api_key
+        self.secret_key = retriever.secret_key
+        self.batch_size = min(16, retriever.batch_size)
+        self.progress_bar = retriever.progress_bar
+        self.token = self._apply_token(self.api_key, self.secret_key)
+        self._setup_encoding_models(retriever.embedding_model, retriever.max_seq_len)
+
+    def _setup_encoding_models(self, model_name: str, max_seq_len: int):
+        """
+        Setup the encoding models for the retriever.
+        """
+        # new generation of embedding models (December 2022), we need to specify the full name
+        if model_name.startswith("ernie"):
+            self.query_encoder_model = model_name
+            self.doc_encoder_model = model_name
+            self.max_seq_len = min(384, max_seq_len)
+
+    def _apply_token(self, api_key, secret_key):
+        headers = {"Content-Type": "application/json", "Accept": "application/json"}
+        payload = ""
+        token_host = f"https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={api_key}&client_secret={secret_key}"
+        response = requests.request("POST", token_host, headers=headers, data=payload)
+
+        if response:
+            res = response.json()
+        else:
+            raise RuntimeError("Request access token error.")
+
+        return res["access_token"]
+
+    def _ensure_text_limit(self, text: str) -> str:
+        """
+        Ensure that length of the text is within the maximum length of the model.
+        OpenAI v1 embedding models have a limit of 2046 tokens, and v2 models have a limit of 8191 tokens.
+        """
+        n_tokens = len(text)
+        if n_tokens <= self.max_seq_len:
+            return text
+
+        logger.warning(
+            "The prompt has been truncated from %s tokens to %s tokens to fit within the max token limit."
+            " Reduce the length of the prompt to prevent it from being cut off.",
+            n_tokens,
+            self.max_seq_len,
+        )
+
+        tokenized_payload = text[: self.max_seq_len]
+
+        return tokenized_payload
+
+    def embed(self, model: str, text: List[str]) -> np.ndarray:
+        generated_embeddings: List[Any] = []
+        headers: Dict[str, str] = {"Content-Type": "application/json"}
+
+        payload = json.dumps(
+            {
+                "input": text,
+            }
+        )
+        headers = {"Content-Type": "application/json"}
+        url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/embeddings/embedding-v1?access_token={}".format(
+            self.token
+        )
+        try:
+            response = requests.request("POST", url, headers=headers, data=payload)
+            response_json = json.loads(response.text)
+            embedding_data = response_json["data"]
+        except Exception as e:
+            logger.error(e)
+            logger.error(response_json)
+
+        generated_embeddings = [item["embedding"] for item in embedding_data]
+
+        return generated_embeddings
+
+    def embed_batch(self, model: str, text: List[str]) -> np.ndarray:
+        all_embeddings = []
+        for i in tqdm(
+            range(0, len(text), self.batch_size), disable=not self.progress_bar, desc="Calculating embeddings"
+        ):
+            batch = text[i : i + self.batch_size]
+            batch_limited = [self._ensure_text_limit(content) for content in batch]
+            generated_embeddings = self.embed(model, batch_limited)
+            all_embeddings.append(generated_embeddings)
+        return np.concatenate(all_embeddings)
+
+    def embed_queries(self, queries: List[str]) -> np.ndarray:
+        return self.embed_batch(self.query_encoder_model, queries)
+
+    def embed_documents(self, docs: List[Document]) -> np.ndarray:
+        return self.embed_batch(self.doc_encoder_model, [d.content for d in docs])
diff --git a/pipelines/pipelines/nodes/retriever/multimodal_retriever.py b/pipelines/pipelines/nodes/retriever/multimodal_retriever.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3bd6457c8f1116f9f6f85d95a99c91ccbe6b70c
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/multimodal_retriever.py
@@ -0,0 +1,200 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
+import numpy as np
+
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.nodes.retriever.embedder import MultiModalEmbedder
+from pipelines.schema import ContentTypes, Document, FilterType
+
+logger = logging.getLogger(__name__)
+
+
+class MultiModalRetriever(BaseRetriever):
+    def __init__(
+        self,
+        document_store: BaseDocumentStore,
+        query_embedding_model: Union[Path, str],
+        document_embedding_models: Dict[str, Union[Path, str]],  # Replace str with ContentTypes starting Python3.8
+        query_type: str = "text",  # Replace str with ContentTypes starting Python3.8
+        query_feature_extractor_params: Dict[str, Any] = {"max_length": 64},
+        document_feature_extractors_params: Dict[str, Dict[str, Any]] = {"text": {"max_length": 256}},
+        top_k: int = 10,
+        batch_size: int = 16,
+        embed_meta_fields: List[str] = ["name"],
+        similarity_function: str = "dot_product",
+        progress_bar: bool = True,
+        scale_score: bool = True,
+    ):
+        """
+        Retriever that uses a multiple encoder to jointly retrieve among a database consisting of different
+        data types.
+        :param document_store: An instance of DocumentStore from which to retrieve documents.
+        :param query_embedding_model: Local path or remote name of question encoder checkpoint. The format equals the
+            one used by Hugging Face transformers' modelhub models.
+        :param document_embedding_models: Dictionary matching a local path or remote name of document encoder
+            checkpoint with the content type it should handle ("text", "table", "image", and so on).
+            The format equals the one used by Hugging Face transformers' modelhub models.
+        :param query_type: The content type of the query ("text", "image" and so on).
+        :param query_feature_extraction_params: The parameters to pass to the feature extractor of the query.
+        :param document_feature_extraction_params: The parameters to pass to the feature extractor of the documents.
+        :param top_k: How many documents to return per query.
+        :param batch_size: Number of questions or documents to encode at once. For multiple GPUs, this is
+            the total batch size.
+        :param embed_meta_fields: Concatenate the provided meta fields to a (text) pair that is then used to create
+            the embedding. This is likely to improve performance if your titles contain meaningful information
+            for retrieval (topic, entities, and so on). Note that only text and table documents support this feature.
+        :param similarity_function: Which function to apply for calculating the similarity of query and document
+            embeddings during training. Options: `dot_product` (default) or `cosine`.
+        :param progress_bar: Whether to show a tqdm progress bar or not.
+            Can be helpful to disable in production deployments to keep the logs clean.
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+            If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value
+            range are scaled to a range of [0,1], where 1 means extremely relevant.
+            Otherwise raw similarity scores (for example, cosine or dot_product) are used.
+        """
+        super().__init__()
+
+        self.similarity_function = similarity_function
+        self.progress_bar = progress_bar
+        self.top_k = top_k
+        self.scale_score = scale_score
+        self.query_type = query_type
+
+        self.document_embedder = MultiModalEmbedder(
+            embedding_models=document_embedding_models,
+            feature_extractors_params=document_feature_extractors_params,
+            batch_size=batch_size,
+            embed_meta_fields=embed_meta_fields,
+            progress_bar=progress_bar,
+        )
+
+        # # Try to reuse the same embedder for queries if there is overlap
+        if document_embedding_models.get(query_type, None) == query_embedding_model:
+            self.query_embedder = self.document_embedder
+        else:
+            self.query_embedder = MultiModalEmbedder(
+                embedding_models={query_type: query_embedding_model},
+                feature_extractors_params={query_type: query_feature_extractor_params},
+                batch_size=batch_size,
+                embed_meta_fields=embed_meta_fields,
+                progress_bar=progress_bar,
+            )
+
+        self.document_store = document_store
+
+    def retrieve(  # type: ignore
+        self,
+        query: Any,
+        query_type: Optional[ContentTypes] = None,
+        filters: Optional[FilterType] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number of documents that are most relevant to the
+        supplied query. Returns a list of Documents.
+        :param query: Query value. It might be text, a path, a table, and so on.
+        :param query_type: Type of the query ("text", "table", "image" and so on).
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions. It can be a single filter applied to each query or a list of filters
+                        (one filter per query).
+        :param top_k: How many documents to return per query. Must be > 0.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents.
+        :param batch_size: Number of queries to embed at a time. Must be > 0.
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+                            If true, similarity scores (for example, cosine or dot_product) which naturally have a different
+                            value range is scaled to a range of [0,1], where 1 means extremely relevant.
+                            Otherwise raw similarity scores (for example, cosine or dot_product) are used.
+        """
+        if query_type is None:
+            query_type = self.query_type
+        return self.retrieve_batch(
+            queries=[query],
+            queries_type=query_type,
+            filters=[filters],
+            top_k=top_k,
+            index=index,
+            headers=headers,
+            batch_size=1,
+            scale_score=scale_score,
+            document_store=document_store,
+        )[0]
+
+    def retrieve_batch(  # type: ignore
+        self,
+        queries: List[Any],
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[List[Document]]:
+        """
+        Scan through documents in DocumentStore and return a small number of documents that are most relevant to the
+        supplied queries. Returns a list of lists of Documents (one list per query).
+        This method assumes all queries are of the same data type. Mixed-type query batches (for example one image and one text)
+        are currently not supported. Group the queries by type and call `retrieve()` on uniform batches only.
+        :param queries: List of query values. They might be text, paths, tables, and so on.
+        :param queries_type: Type of the query ("text", "table", "image" and so on)
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions. It can be a single filter that will be applied to each query or a list of filters
+                        (one filter per query).
+        :param top_k: How many documents to return per query. Must be > 0.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents.
+        :param batch_size: Number of queries to embed at a time. Must be > 0.
+        :param scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]).
+                            If True, similarity scores (for example, cosine or dot_product) which naturally have a different
+                            value range are scaled to a range of [0,1], where 1 means extremely relevant.
+                            Otherwise raw similarity scores (for example, cosine or dot_product) are used.
+        """
+        top_k = top_k or self.top_k
+        document_store = document_store or self.document_store
+        if not document_store:
+            raise ValueError(
+                "This Retriever was not initialized with a Document Store. Provide one to the retrieve() or retrieve_batch() method."
+            )
+        index = index or document_store.index
+        scale_score = scale_score or self.scale_score
+
+        # Embed the queries - we need them into Document format to leverage MultiModalEmbedder.embed()
+        query_docs = [Document(content=query, content_type=queries_type) for query in queries]
+        query_embeddings = self.query_embedder.embed(documents=query_docs, batch_size=batch_size)
+        # Query documents by embedding (the actual retrieval step)
+        documents = document_store.query_by_embedding_batch(
+            query_embs=query_embeddings,
+            top_k=top_k,
+            filters=filters,
+            index=index,
+            headers=headers,
+        )
+        return documents
+
+    def embed_documents(self, docs: List[Document]) -> np.ndarray:
+        return self.document_embedder.embed(documents=docs)
+
+    def embed_queries(self, queries: List[str]) -> np.ndarray:
+        query_documents = [Document(content=query, content_type="text") for query in queries]
+        return self.query_embedder.embed(documents=query_documents)
diff --git a/pipelines/pipelines/nodes/retriever/parallel_retriever.py b/pipelines/pipelines/nodes/retriever/parallel_retriever.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b5aee45b5f1f718db2ee9f9d1c57cd165a8aacc
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/parallel_retriever.py
@@ -0,0 +1,342 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal
+
+import multiprocessing
+import time
+from copy import deepcopy
+from functools import partial
+from multiprocessing import Pool
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+from tqdm.auto import tqdm
+from tritonclient.http import InferenceServerClient, InferInput, InferRequestedOutput
+
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.schema import ContentTypes, Document
+from pipelines.utils.common_utils import initialize_device_settings
+
+logger = logging.getLogger(__name__)
+
+
+class TritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+
+    def __init__(
+        self, server_url: str, model_name: str, model_version: str, verbose=False, resp_wait_s: Optional[float] = None
+    ):
+        """
+        :param server_url: The port of server
+        :param model_name: The model name needs to match the name in config.txt
+        :param model_version: Model version number
+        :param resp_wait_s: the response waiting time
+        """
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+
+        self._client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        error = self._verify_triton_state(self._client)
+        if error:
+            raise RuntimeError(f"Could not communicate to Triton Server: {error}")
+        model_metadata = self._client.get_model_metadata(self._model_name, self._model_version)
+        self._inputs = {tm["name"]: tm for tm in model_metadata["inputs"]}
+        self._input_names = list(self._inputs)
+        self._outputs = {tm["name"]: tm for tm in model_metadata["outputs"]}
+        self._output_names = list(self._outputs)
+        self._outputs_req = [InferRequestedOutput(name) for name in self._outputs]
+
+    def Run_docs(self, documents, embed_title=False):
+        documents = deepcopy(documents)
+        docs = [Document.from_dict(doc) for doc in documents]
+        dicts = [
+            {
+                "passages": [
+                    {
+                        "title": d.meta["name"] if d.meta and "name" in d.meta else "",
+                        "text": d.content,
+                        "label": d.meta["label"] if d.meta and "label" in d.meta else "positive",
+                        "external_id": d.id,
+                    }
+                ]
+            }
+            for d in docs
+        ]
+        datasets = []
+        for passages in dicts:
+            for item in passages["passages"]:
+                if embed_title:
+                    datasets.append(item["title"] + item["text"])
+                else:
+                    datasets.append(item["text"])
+        infer_inputs = []
+        for idx, data in enumerate([datasets]):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+        try:
+            results = self._client.infer(
+                model_name=self._model_name,
+                model_version=self._model_version,
+                inputs=infer_inputs,
+                outputs=self._outputs_req,
+            )
+        except Exception as e:
+            logger.error("InferenceServerClient infer error {}".format(e))
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        for doc, emb in zip(documents, results["embedding"]):
+            doc["embedding"] = emb
+        return documents
+
+    def Run_query(self, query):
+        """
+        Args:
+            inputs: list, Each value corresponds to an input name of self._input_names
+        Returns:
+            results: dict, {name : numpy.array}
+        """
+        infer_inputs = []
+        for idx, data in enumerate([query]):
+            data = np.array([[x.encode("utf-8")] for x in data], dtype=np.object_)
+            infer_input = InferInput(self._input_names[idx], [len(data), 1], "BYTES")
+            infer_input.set_data_from_numpy(data)
+            infer_inputs.append(infer_input)
+        try:
+            results = self._client.infer(
+                model_name=self._model_name,
+                model_version=self._model_version,
+                inputs=infer_inputs,
+                outputs=self._outputs_req,
+            )
+        except Exception as e:
+            logger.error("InferenceServerClient infer error {}".format(e))
+        results = {name: results.as_numpy(name) for name in self._output_names}
+        return results["embedding"]
+
+    def _verify_triton_state(self, triton_client):
+        if not triton_client.is_server_live():
+            return f"Triton server {self._server_url} is not live"
+        elif not triton_client.is_server_ready():
+            return f"Triton server {self._server_url} is not ready"
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            return f"Model {self._model_name}:{self._model_version} is not ready"
+        return None
+
+
+def run_main_doc(item, url="0.0.0.0:8082", model_name="m3e", model_version="1"):
+    runner = TritonRunner(url, model_name, model_version)
+    return runner.Run_docs(item)
+
+
+def run_main_query(query, url="0.0.0.0:8082", model_name="m3e", model_version="1"):
+    runner = TritonRunner(url, model_name, model_version)
+    return runner.Run_query(query)
+
+
+def embeddings_multi_doc(data, batch_size=32, num_process=10, url="0.0.0.0:8082"):
+    workers = len(data) // batch_size + 1
+    offset = [i * batch_size for i in range(workers)]
+    if offset[-1] != len(data):
+        offset += [len(data)]
+    data_index = zip(offset, offset[1:])
+    data_list = [data[start:end] for start, end in data_index]
+    func = partial(run_main_doc, url=url)
+    pool = Pool(processes=min(num_process, multiprocessing.cpu_count()))
+    result = pool.map(func, data_list)
+    pool.close()  # close the process pool and no longer accept new processes
+    pool.join()
+    return result
+
+
+class ParallelRetriever(BaseRetriever):
+    def __init__(
+        self,
+        document_store: BaseDocumentStore,
+        model_version: Optional[str] = None,
+        output_emb_size: Optional[int] = None,
+        reinitialize: bool = False,
+        share_parameters: bool = False,
+        max_seq_len_query: int = 64,
+        max_seq_len_passage: int = 384,
+        top_k: int = 10,
+        use_gpu: bool = True,
+        batch_size: int = 16,
+        embed_title: bool = True,
+        similarity_function: str = "dot_product",
+        progress_bar: bool = True,
+        mode: Literal["snippets", "raw_documents", "preprocessed_documents"] = "preprocessed_documents",
+        url="0.0.0.0:8082",
+        num_process=10,
+        **kwargs
+    ):
+        """
+        :param url: the port of the HTTP service
+        :param num_process: the number of processes
+        """
+        self.set_config(
+            document_store=document_store,
+            model_version=model_version,
+            max_seq_len_query=max_seq_len_query,
+            max_seq_len_passage=max_seq_len_passage,
+            top_k=top_k,
+            use_gpu=use_gpu,
+            batch_size=batch_size,
+            embed_title=embed_title,
+            reinitialize=reinitialize,
+            share_parameters=share_parameters,
+            output_emb_size=output_emb_size,
+            similarity_function=similarity_function,
+            progress_bar=progress_bar,
+        )
+
+        self.devices, _ = initialize_device_settings(use_cuda=use_gpu, multi_gpu=True)
+        if batch_size < len(self.devices):
+            logger.warning("Batch size is less than the number of devices. All gpus will not be utilized.")
+
+        self.document_store = document_store
+        self.batch_size = batch_size
+        self.progress_bar = progress_bar
+        self.top_k = top_k
+        self.embed_title = embed_title
+        self.mode = mode
+        self.url = url
+        self.num_process = num_process
+
+        if document_store is None:
+            logger.warning("DensePassageRetriever initialized without a document store. ")
+        elif document_store.similarity != "dot_product":
+            logger.warning(
+                f"You are using a Dense Passage Retriever model with the {document_store.similarity} function. "
+                "We recommend you use dot_product instead. "
+                "This can be set when initializing the DocumentStore"
+            )
+
+    def retrieve(
+        self,
+        query: str,
+        query_type: Optional[ContentTypes] = None,
+        filters: dict = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        **kwargs,
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query.
+
+        :param query: The query
+        :param filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        """
+        if top_k is None:
+            top_k = self.top_k
+        if not self.document_store:
+            logger.error("Cannot perform retrieve() since DensePassageRetriever initialized with document_store=None")
+            return []
+        if index is None:
+            index = self.document_store.index
+
+        query_emb = self.embed_queries(texts=[query], **kwargs)
+        documents = self.document_store.query_by_embedding(
+            query_emb=query_emb[0], top_k=top_k, filters=filters, index=index, headers=headers, return_embedding=False
+        )
+        return documents
+
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        queries_type: Optional[ContentTypes] = None,
+        filters: Optional[
+            Union[
+                Dict[str, Union[Dict, List, str, int, float, bool]],
+                List[Dict[str, Union[Dict, List, str, int, float, bool]]],
+            ]
+        ] = None,
+        top_k: Optional[int] = None,
+        index: str = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: bool = None,
+        **kwargs,
+    ) -> List[List[Document]]:
+        if top_k is None:
+            top_k = self.top_k
+        if batch_size is None:
+            batch_size = self.batch_size
+
+        if isinstance(filters, list):
+            if len(filters) != len(queries):
+                raise Exception(
+                    "Number of filters does not match number of queries. Please provide as many filters"
+                    " as queries or a single filter that will be applied to each query."
+                )
+        else:
+            filters = [filters] * len(queries) if filters is not None else [{}] * len(queries)
+        if index is None:
+            index = self.document_store.index
+        if not self.document_store:
+            logger.error(
+                "Cannot perform retrieve_batch() since DensePassageRetriever initialized with document_store=None"
+            )
+            return [[] * len(queries)]  # type: ignore
+        documents = []
+        query_embs: List[np.ndarray] = []
+        for batch in self._get_batches(queries=queries, batch_size=batch_size):
+            query_embs.extend(self.embed_queries(texts=batch, **kwargs))
+        for query_emb, cur_filters in tqdm(
+            zip(query_embs, filters), total=len(query_embs), disable=not self.progress_bar, desc="Querying"
+        ):
+            cur_docs = self.document_store.query_by_embedding(
+                query_emb=query_emb,
+                top_k=top_k,
+                filters=cur_filters,
+                index=index,
+                headers=headers,
+                return_embedding=False,
+            )
+            documents.append(cur_docs)
+
+        return documents
+
+    def run_indexing(self, documents: List[dict], **kwargs):
+        time1 = time.time()
+        documents_list = embeddings_multi_doc(
+            documents, batch_size=self.batch_size, num_process=self.num_process, url=self.url
+        )
+        documents = []
+        for i in documents_list:
+            documents.extend(i)
+        output = {"documents": documents}
+        time2 = time.time()
+        logger.info(f"The time cost of create docs: {time2-time1:.3f} s")
+        return output, "output_1"
+
+    def embed_queries(self, texts: List[str], **kwargs) -> List[np.ndarray]:
+        return run_main_query(texts, self.url)
+
+    def embed_documents(self, docs: List[Document], **kwargs):
+        return run_main_query([d.content for d in docs], self.url)
diff --git a/pipelines/pipelines/nodes/retriever/sparse.py b/pipelines/pipelines/nodes/retriever/sparse.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f6c406f245733d2dc2587c693e651ce7fb67348
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/sparse.py
@@ -0,0 +1,319 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict, List, Optional, Union
+
+from pipelines.document_stores import BaseDocumentStore, KeywordDocumentStore
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.schema import ContentTypes, Document
+
+
+class BM25Retriever(BaseRetriever):
+    def __init__(
+        self,
+        document_store: Optional[KeywordDocumentStore] = None,
+        top_k: int = 10,
+        all_terms_must_match: bool = False,
+        custom_query: Optional[str] = None,
+    ):
+        """
+        :param document_store: an instance of one of the following DocumentStores to retrieve from: ElasticsearchDocumentStore, OpenSearchDocumentStore and OpenDistroElasticsearchDocumentStore.
+            If None, a document store must be passed to the retrieve method for this Retriever to work.
+        :param all_terms_must_match: Whether all terms of the query must match the document.
+                                     If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant").
+                                     Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant").
+                                     Defaults to False.
+        :param custom_query: query string as per Elasticsearch DSL with a mandatory query placeholder(query).
+                             Optionally, ES `filter` clause can be added where the values of `terms` are placeholders
+                             that get substituted during runtime. The placeholder(${filter_name_1}, ${filter_name_2}..)
+                             names must match with the filters dict supplied in self.retrieve().
+                             ::
+                                 **An example custom_query:**
+                                 ```python
+                                |    {
+                                |        "size": 10,
+                                |        "query": {
+                                |            "bool": {
+                                |                "should": [{"multi_match": {
+                                |                    "query": ${query},                 // mandatory query placeholder
+                                |                    "type": "most_fields",
+                                |                    "fields": ["content", "title"]}}],
+                                |                "filter": [                                 // optional custom filters
+                                |                    {"terms": {"year": ${years}}},
+                                |                    {"terms": {"quarter": ${quarters}}},
+                                |                    {"range": {"date": {"gte": ${date}}}}
+                                |                    ],
+                                |            }
+                                |        },
+                                |    }
+                                 ```
+                             **For this custom_query, a sample retrieve() could be:**
+                             ```python
+                            |    self.retrieve(query="Why did the revenue increase?",
+                            |                  filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
+                            ```
+                             Optionally, highlighting can be defined by specifying Elasticsearch's highlight settings.
+                             See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html.
+                             You will find the highlighted output in the returned Document's meta field by key "highlighted".
+                             ::
+                                 **Example custom_query with highlighting:**
+                                 ```python
+                                |    {
+                                |        "size": 10,
+                                |        "query": {
+                                |            "bool": {
+                                |                "should": [{"multi_match": {
+                                |                    "query": ${query},                 // mandatory query placeholder
+                                |                    "type": "most_fields",
+                                |                    "fields": ["content", "title"]}}],
+                                |            }
+                                |        },
+                                |        "highlight": {             // enable highlighting
+                                |            "fields": {            // for fields content and title
+                                |                "content": {},
+                                |                "title": {}
+                                |            }
+                                |        },
+                                |    }
+                                 ```
+                                 **For this custom_query, highlighting info can be accessed by:**
+                                ```python
+                                |    docs = self.retrieve(query="Why did the revenue increase?")
+                                |    highlighted_content = docs[0].meta["highlighted"]["content"]
+                                |    highlighted_title = docs[0].meta["highlighted"]["title"]
+                                ```
+        :param top_k: How many documents to return per query.
+        """
+        super().__init__()
+        self.document_store: Optional[KeywordDocumentStore] = document_store
+        self.top_k = top_k
+        self.custom_query = custom_query
+        self.all_terms_must_match = all_terms_must_match
+
+    def retrieve(
+        self,
+        query: str,
+        query_type: ContentTypes = "text",
+        filters: Optional[Dict[str, Union[Dict, List, str, int, float, bool]]] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+        **kwargs
+    ) -> List[Document]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the query.
+        :param query: The query
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :param document_store: the docstore to use for retrieval. If `None`, the one given in the `__init__` is used instead.
+        """
+        document_store = document_store or self.document_store
+        if document_store is None:
+            raise ValueError(
+                "This Retriever was not initialized with a Document Store. Provide one to the retrieve() method."
+            )
+        if not isinstance(document_store, KeywordDocumentStore):
+            raise ValueError("document_store must be a subclass of KeywordDocumentStore.")
+
+        if top_k is None:
+            top_k = self.top_k
+        if index is None:
+            index = document_store.index
+
+        documents = document_store.query(
+            query=query,
+            filters=filters,
+            top_k=top_k,
+            all_terms_must_match=self.all_terms_must_match,
+            custom_query=self.custom_query,
+            index=index,
+            headers=headers,
+        )
+        return documents
+
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        queries_type: ContentTypes = "text",
+        filters: Optional[
+            Union[
+                Dict[str, Union[Dict, List, str, int, float, bool]],
+                List[Dict[str, Union[Dict, List, str, int, float, bool]]],
+            ]
+        ] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[List[Document]]:
+        """
+        Scan through documents in DocumentStore and return a small number documents
+        that are most relevant to the supplied queries.
+        Returns a list of lists of Documents (one per query).
+        :param queries: List of query strings.
+        :param filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain
+                        conditions.
+                        Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
+                        operator (`"$and"`, `"$or"`, `"$not"`), a comparison operator (`"$eq"`, `"$in"`, `"$gt"`,
+                        `"$gte"`, `"$lt"`, `"$lte"`) or a metadata field name.
+                        Logical operator keys take a dictionary of metadata field names and/or logical operators as
+                        value. Metadata field names take a dictionary of comparison operators as value. Comparison
+                        operator keys take a single value or (in case of `"$in"`) a list of values as value.
+                        If no logical operator is provided, `"$and"` is used as default operation. If no comparison
+                        operator is provided, `"$eq"` (or `"$in"` if the comparison value is a list) is used as default
+                        operation.
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$and": {
+                                    "type": {"$eq": "article"},
+                                    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                    "rating": {"$gte": 3},
+                                    "$or": {
+                                        "genre": {"$in": ["economy", "politics"]},
+                                        "publisher": {"$eq": "nytimes"}
+                                    }
+                                }
+                            }
+                            # or simpler using default operators
+                            filters = {
+                                "type": "article",
+                                "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
+                                "rating": {"$gte": 3},
+                                "$or": {
+                                    "genre": ["economy", "politics"],
+                                    "publisher": "nytimes"
+                                }
+                            }
+                            ```
+                            To use the same logical operator multiple times on the same level, logical operators take
+                            optionally a list of dictionaries as value.
+                            __Example__:
+                            ```python
+                            filters = {
+                                "$or": [
+                                    {
+                                        "$and": {
+                                            "Type": "News Paper",
+                                            "Date": {
+                                                "$lt": "2019-01-01"
+                                            }
+                                        }
+                                    },
+                                    {
+                                        "$and": {
+                                            "Type": "Blog Post",
+                                            "Date": {
+                                                "$gte": "2019-01-01"
+                                            }
+                                        }
+                                    }
+                                ]
+                            }
+                            ```
+        :param top_k: How many documents to return per query.
+        :param index: The name of the index in the DocumentStore from which to retrieve documents
+        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
+                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
+        :param batch_size: Not applicable.
+        :param document_store: the docstore to use for retrieval. If `None`, the one given in the `__init__` is used instead.
+        """
+        document_store = document_store or self.document_store
+        if document_store is None:
+            raise ValueError(
+                "This Retriever was not initialized with a Document Store. Provide one to the retrieve_batch() method."
+            )
+        if not isinstance(document_store, KeywordDocumentStore):
+            raise ValueError("document_store must be a subclass of KeywordDocumentStore.")
+
+        if top_k is None:
+            top_k = self.top_k
+        if index is None:
+            index = document_store.index
+
+        documents = document_store.query_batch(
+            queries=queries,
+            filters=filters,
+            top_k=top_k,
+            all_terms_must_match=self.all_terms_must_match,
+            custom_query=self.custom_query,
+            index=index,
+            headers=headers,
+        )
+        return documents
diff --git a/pipelines/pipelines/nodes/retriever/web.py b/pipelines/pipelines/nodes/retriever/web.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea2fbe54ec75a670805b1cfddb778c574cb23a1
--- /dev/null
+++ b/pipelines/pipelines/nodes/retriever/web.py
@@ -0,0 +1,355 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from concurrent.futures import ThreadPoolExecutor
+from dataclasses import dataclass
+from datetime import datetime, timedelta
+from multiprocessing import cpu_count
+from typing import Any, Dict, Iterator, List, Optional, Union
+from unicodedata import combining, normalize
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal
+
+import requests
+from boilerpy3 import extractors
+
+from pipelines.document_stores.base import BaseDocumentStore
+from pipelines.nodes.preprocessor import PreProcessor
+from pipelines.nodes.retriever.base import BaseRetriever
+from pipelines.nodes.search_engine.base import SearchEngine
+from pipelines.nodes.search_engine.web import WebSearch
+from pipelines.schema import Document, FilterType
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class SearchResult:
+    url: str
+    score: Optional[str]
+    position: Optional[str]
+
+
+class WebRetriever(BaseRetriever):
+    """
+    WebRetriever makes it possible to query the web for relevant documents. It downloads web page results returned by WebSearch, strips HTML, and extracts raw text, which is then
+    split into smaller documents using the optional PreProcessor.
+
+    WebRetriever operates in two modes:
+
+    - snippets mode: WebRetriever returns a list of Documents. Each Document is a snippet of the search result.
+    - raw_documents mode: WebRetriever returns a list of Documents. Each Document is a full website returned by the search, stripped of HTML.
+    - preprocessed_documents mode: WebRetriever return a list of Documents. Each Document is a preprocessed split of the full website stripped of HTML.
+
+    In the preprocessed_documents mode, after WebSearch receives the query through the `run()` method, it fetches the top_k URLs relevant to the query. WebSearch then downloads and processes these URLs.
+    The processing involves stripping HTML tags and producing
+    a clean, raw text wrapped in the Document objects. WebRetriever then splits raw text into Documents according to the PreProcessor settings.
+    Finally, WebRetriever returns the top_k preprocessed Documents.
+
+    Finding the right balance between top_k and top_p is crucial to obtain high-quality and diverse results in the document
+    mode. To explore potential results, we recommend that you set top_k for WebSearch close to 10.
+    However, keep in mind that setting a high top_k value results in fetching and processing numerous web pages and is heavier on the resources.
+
+    We recommend you use the default value for top_k and adjust it based on your specific
+    use case. The default value is 5. This means WebRetriever returns at most
+    five of the most relevant processed documents, ensuring the search results are diverse but still of high
+    quality. To get more results, increase top_k.
+    """
+
+    def __init__(
+        self,
+        api_key: str,
+        search_engine_provider: Union[str, SearchEngine] = "SerpAPI",
+        engine: Optional[str] = "google",
+        top_search_results: Optional[int] = 10,
+        search_engine_kwargs: Optional[Dict[str, Any]] = None,
+        top_k: Optional[int] = 5,
+        mode: Literal["snippets", "raw_documents", "preprocessed_documents"] = "snippets",
+        preprocessor: Optional[PreProcessor] = None,
+        cache_document_store: Optional[BaseDocumentStore] = None,
+        cache_index: Optional[str] = None,
+        cache_headers: Optional[Dict[str, str]] = None,
+        cache_time: int = 1 * 24 * 60 * 60,
+    ):
+        """
+        :param top_k: Top k documents to be returned by the retriever.
+        :param mode: Whether to return snippets, raw documents, or preprocessed documents. Preprocessed documents are the default.
+        :param preprocessor: Optional PreProcessor to be used to split documents into paragraphs. If not provided, the default PreProcessor is used.
+        :param cache_document_store: DocumentStore to be used to cache search results.
+        :param cache_index: Index name to be used to cache search results.
+        :param cache_headers: Headers to be used to cache search results.
+        :param cache_time: Time in seconds to cache search results. Defaults to 24 hours.
+        """
+        super().__init__()
+        self.web_search = WebSearch(
+            api_key=api_key,
+            top_k=top_search_results,
+            search_engine_provider=search_engine_provider,
+            engine=engine,
+            search_engine_kwargs=search_engine_kwargs,
+        )
+        self.mode = mode
+        self.cache_document_store = cache_document_store
+        self.cache_index = cache_index
+        self.cache_headers = cache_headers
+        self.cache_time = cache_time
+        self.top_k = top_k
+        if preprocessor is not None:
+            self.preprocessor = preprocessor
+        else:
+            self.preprocessor = PreProcessor()
+
+    def _normalize_query(self, query: str) -> str:
+        return "".join([c for c in normalize("NFKD", query.lower()) if not combining(c)])
+
+    def _check_cache(
+        self,
+        query: str,
+        cache_index: Optional[str] = None,
+        cache_headers: Optional[Dict[str, str]] = None,
+        cache_time: Optional[int] = None,
+    ) -> List[Document]:
+        """Check documents retrieved based on the query in cache."""
+
+        cache_document_store = self.cache_document_store
+
+        documents = []
+
+        if cache_document_store is not None:
+            query_norm = self._normalize_query(query)
+            cache_filter: FilterType = {"$and": {"search.query": query_norm}}
+
+            if cache_time is not None and cache_time > 0:
+                cache_filter["timestamp"] = {
+                    "$gt": int((datetime.utcnow() - timedelta(seconds=cache_time)).timestamp())
+                }
+                logger.debug("Cache filter: %s", cache_filter)
+
+            documents = cache_document_store.get_all_documents(
+                filters=cache_filter, index=cache_index, headers=cache_headers, return_embedding=False
+            )
+
+        logger.debug("Found %d documents in cache", len(documents))
+
+        return documents
+
+    def _save_cache(
+        self,
+        query: str,
+        documents: List[Document],
+        cache_index: Optional[str] = None,
+        cache_headers: Optional[Dict[str, str]] = None,
+        cache_time: Optional[int] = None,
+    ) -> bool:
+        cache_document_store = self.cache_document_store
+
+        if cache_document_store is not None:
+            cache_document_store.write_documents(
+                documents=documents, index=cache_index, headers=cache_headers, duplicate_documents="overwrite"
+            )
+
+            logger.debug("Saved %d documents in the cache", len(documents))
+
+            cache_filter: FilterType = {"$and": {"search.query": query}}
+
+            if cache_time is not None and cache_time > 0:
+                cache_filter["timestamp"] = {
+                    "$lt": int((datetime.utcnow() - timedelta(seconds=cache_time)).timestamp())
+                }
+
+                cache_document_store.delete_documents(index=cache_index, headers=cache_headers, filters=cache_filter)
+
+                logger.debug("Deleted documents in the cache using filter: %s", cache_filter)
+
+            return True
+
+        return False
+
+    def retrieve(  # type: ignore[override]
+        self,
+        query: str,
+        top_k: Optional[int] = None,
+        preprocessor: Optional[PreProcessor] = None,
+        cache_document_store: Optional[BaseDocumentStore] = None,
+        cache_index: Optional[str] = None,
+        cache_headers: Optional[Dict[str, str]] = None,
+        cache_time: Optional[int] = None,
+        **kwargs,
+    ) -> List[Document]:
+        """
+        Retrieve documents based on the list of URLs from the WebSearchEngine. The documents are scraped from the web
+        at real-time. You can then store the documents in a DocumentStore for later use. You can cache them in a
+        DocumentStore to improve retrieval time.
+        :param query: The query string.
+        :param top_k: The number of documents to be returned by the retriever. If None, the default value is used.
+        :param preprocessor: The PreProcessor to be used to split documents into paragraphs.
+        :param cache_document_store: The DocumentStore to cache the documents to.
+        :param cache_index: The index name to save the documents to.
+        :param cache_headers: The headers to save the documents to.
+        :param cache_time: The time limit in seconds to check the cache. The default is 24 hours.
+        """
+
+        preprocessor = preprocessor or self.preprocessor
+        cache_document_store = cache_document_store or self.cache_document_store
+        cache_index = cache_index or self.cache_index
+        cache_headers = cache_headers or self.cache_headers
+        cache_time = cache_time or self.cache_time
+        top_k = top_k or self.top_k
+
+        query_norm = self._normalize_query(query)
+
+        extracted_docs = self._check_cache(
+            query_norm, cache_index=cache_index, cache_headers=cache_headers, cache_time=cache_time
+        )
+
+        # cache miss
+        if not extracted_docs:
+            search_results, _ = self.web_search.run(query=query)
+            search_results = search_results["documents"]
+            if self.mode == "snippets":
+                return search_results  # type: ignore
+
+            links: List[SearchResult] = [
+                SearchResult(r.meta["link"], r.meta.get("score", None), r.meta.get("position", None))
+                for r in search_results
+                if r.meta.get("link")
+            ]
+            logger.debug("Starting to fetch %d links from WebSearch results", len(links))
+
+            def scrape_direct(link: SearchResult) -> Dict[str, Any]:
+                extractor = extractors.ArticleExtractor(raise_on_failure=False)
+                try:
+                    extracted_doc = {}
+                    response = requests.get(link.url, headers=self._request_headers(), timeout=10)
+                    if response.status_code == 200 and len(response.text) > 0:
+                        extracted_content = extractor.get_content(response.text)
+                        if extracted_content:
+                            extracted_doc = {
+                                "text": extracted_content,
+                                "url": link.url,
+                                "search.score": link.score,
+                                "search.position": link.position,
+                            }
+                    return extracted_doc
+
+                except Exception as e:
+                    logger.error("Error retrieving URL %s: %s", link.url, e)
+                    return {}
+
+            thread_count = cpu_count() if len(links) > cpu_count() else len(links)
+            with ThreadPoolExecutor(max_workers=thread_count) as executor:
+                scraped_pages: Iterator[Dict[str, Any]] = executor.map(scrape_direct, links)
+
+                failed = 0
+                extracted_docs = []
+                for scraped_page, search_result_doc in zip(scraped_pages, search_results):
+                    if scraped_page and "text" in scraped_page:
+                        document = self._document_from_scraped_page(search_result_doc, scraped_page, query_norm)
+                        extracted_docs.append(document)
+                    else:
+                        logger.debug(
+                            "Could not extract text from URL %s. Using search snippet.", search_result_doc.meta["link"]
+                        )
+                        snippet_doc = self._document_from_snippet(search_result_doc, query_norm)
+                        extracted_docs.append(snippet_doc)
+                        failed += 1
+
+                logger.debug(
+                    "Extracted %d documents / %s snippets from %s URLs.",
+                    len(extracted_docs) - failed,
+                    failed,
+                    len(links),
+                )
+
+        if cache_document_store:
+            cached = self._save_cache(query_norm, extracted_docs, cache_index=cache_index, cache_headers=cache_headers)
+            if not cached:
+                logger.warning(
+                    "Could not save retrieved documents to the DocumentStore cache. "
+                    "Check your document store configuration."
+                )
+
+        processed_docs = (
+            [t for d in extracted_docs for t in preprocessor.process([d])]
+            if self.mode == "preprocessed_documents"
+            else extracted_docs
+        )
+
+        logger.debug("Processed %d documents resulting in %s documents", len(extracted_docs), len(processed_docs))
+        return processed_docs[:top_k]
+
+    def retrieve_batch(  # type: ignore[override]
+        self,
+        queries: List[str],
+        top_p: Optional[int] = None,
+        top_k: Optional[int] = None,
+        preprocessor: Optional[PreProcessor] = None,
+        cache_document_store: Optional[BaseDocumentStore] = None,
+        cache_index: Optional[str] = None,
+        cache_headers: Optional[Dict[str, str]] = None,
+        cache_time: Optional[int] = None,
+    ) -> List[List[Document]]:
+        documents = []
+
+        # TODO: parallelize using ProcessPoolExecutor and use Lock at document store methods
+        for q in queries:
+            documents.extend(
+                self.retrieve(
+                    q,
+                    top_p=top_p,
+                    top_k=top_k,
+                    preprocessor=preprocessor,
+                    cache_document_store=cache_document_store,
+                    cache_index=cache_index,
+                    cache_headers=cache_headers,
+                    cache_time=cache_time,
+                )
+            )
+
+        return [documents]
+
+    def _request_headers(self):
+        headers = {
+            "accept": "*/*",
+            "User-Agent": "pipelines/WebRetriever/dev",
+            "Accept-Language": "en-US,en;q=0.9,it;q=0.8,es;q=0.7",
+            "referer": "https://www.google.com/",
+        }
+        return headers
+
+    def _document_from_snippet(self, doc, query_norm):
+        doc_dict = {
+            "text": doc.content,
+            "url": doc.meta["link"],
+            "id_hash_keys": ["meta.url", "meta.search.query"],
+            "search.query": query_norm,
+        }
+        d = Document.from_dict(doc_dict, field_map={"text": "content"})
+        d.meta["timestamp"] = int(datetime.utcnow().timestamp())
+        d.meta["search.position"] = doc.meta.pop("position", None)
+        d.meta["search.snippet"] = 1
+        return d
+
+    def _document_from_scraped_page(self, search_result_doc, scraped_page, query_norm):
+        scraped_page["id_hash_keys"] = ["meta.url", "meta.search.query"]
+        scraped_page["search.query"] = query_norm
+        scraped_page.pop("description", None)
+        document = Document.from_dict(scraped_page, field_map={"text": "content"})
+        document.meta["timestamp"] = int(datetime.utcnow().timestamp())
+        document.meta["search.position"] = search_result_doc.meta.get("position")
+        return document
diff --git a/pipelines/pipelines/nodes/search_engine/__init__.py b/pipelines/pipelines/nodes/search_engine/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbd18512049300f337cfd31baa2ba03df8d96f30
--- /dev/null
+++ b/pipelines/pipelines/nodes/search_engine/__init__.py
@@ -0,0 +1,16 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.search_engine.base import SearchEngine
+from pipelines.nodes.search_engine.web import WebSearch
diff --git a/pipelines/pipelines/nodes/search_engine/base.py b/pipelines/pipelines/nodes/search_engine/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc8e238d047c17c466d0c93291e336197e75b19a
--- /dev/null
+++ b/pipelines/pipelines/nodes/search_engine/base.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from abc import ABC, abstractmethod
+from typing import List, Optional
+
+from pipelines.nodes.search_engine.utils import calculate_ranking_scores
+from pipelines.schema import Document
+
+
+class SearchEngine(ABC):
+    """
+    Abstract base class for search engines providers.
+    """
+
+    @abstractmethod
+    def search(self, query: str, **kwargs) -> List[Document]:
+        """
+        Search the search engine for the given query and return the results.
+        :param query: The query to search for.
+        :param kwargs: Additional parameters to pass to the search engine, such as top_k.
+        :return: List of search results as documents.
+        """
+
+    def score_results(
+        self, results: List[Document], has_answer_box: Optional[bool] = False, boost_factor: Optional[int] = 5
+    ) -> List[Document]:
+        """
+        Assigns scores to search results based on their rank position and ensures that the scores add up to 1.
+        """
+        scores = calculate_ranking_scores(results, boost_first_factor=boost_factor if has_answer_box else None)
+        for doc, score in zip(results, scores):
+            doc.score = doc.meta["score"] = score
+        return results
diff --git a/pipelines/pipelines/nodes/search_engine/providers.py b/pipelines/pipelines/nodes/search_engine/providers.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e8833968bbe607f773ccbe43c4d471a7ab37fd9
--- /dev/null
+++ b/pipelines/pipelines/nodes/search_engine/providers.py
@@ -0,0 +1,241 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+from typing import Any, Dict, List, Optional, Union
+
+import requests
+
+from pipelines import Document
+from pipelines.nodes.search_engine.base import SearchEngine
+
+logger = logging.getLogger(__name__)
+
+
+class SerpAPI(SearchEngine):
+    """
+    SerpAPI is a search engine that provides a REST API to access search results from Google, Bing, Yahoo, Yandex,
+    Amazon, and similar. See the [SerpAPI website](https://serpapi.com/) for more details.
+    """
+
+    def __init__(
+        self,
+        api_key: str,
+        top_k: Optional[int] = 10,
+        engine: Optional[str] = "google",
+        search_engine_kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        :param api_key: API key for SerpAPI.
+        :param top_k: Number of results to return.
+        :param engine: Search engine to use, for example google, bing, baidu, duckduckgo, yahoo, yandex.
+        See the [SerpAPI documentation](https://serpapi.com/search-api) for the full list of supported engines.
+        :param search_engine_kwargs: Additional parameters passed to the SerperDev API. For example, you can set 'lr' to 'lang_en'
+        to limit the search to English.
+        See the [SerpAPI documentation](https://serpapi.com/search-api) for the full list of supported parameters.
+        """
+        super().__init__()
+        self.params_dict: Dict[str, Union[str, int, float]] = {}
+        self.api_key = api_key
+        self.kwargs = search_engine_kwargs if search_engine_kwargs else {}
+        self.engine = engine
+        self.top_k = top_k
+
+    def search(self, query: str, **kwargs) -> List[Document]:
+        """
+        :param query: Query string.
+        :param kwargs: Additional parameters passed to the SerpAPI. For example, you can set 'lr' to 'lang_en'
+        to limit the search to English.
+        See the [SerpAPI documentation](https://serpapi.com/search-api) for the full list of supported parameters.
+        :return: List[Document]
+        """
+        kwargs = {**self.kwargs, **kwargs}
+        top_k = kwargs.pop("top_k", self.top_k)
+        url = "https://serpapi.com/search"
+
+        params = {"source": "python", "serp_api_key": self.api_key, "q": query, **kwargs}
+
+        if self.engine:
+            params["engine"] = self.engine
+        response = requests.get(url, params, timeout=30)
+
+        if response.status_code != 200:
+            raise Exception(f"Error while querying {self.__class__.__name__}: {response.text}")
+
+        json_result = json.loads(response.text)
+        organic = [
+            Document.from_dict(d, field_map={"snippet": "content"})
+            for d in json_result["organic_results"]
+            if "snippet" in d
+        ]
+        answer_box = []
+        if "answer_box" in json_result:
+            answer_dict = json_result["answer_box"]
+            for key in ["answer", "snippet_highlighted_words", "snippet", "title"]:
+                if key in answer_dict:
+                    answer_box_content = answer_dict[key]
+                    if isinstance(answer_box_content, list):
+                        answer_box_content = answer_box_content[0]
+                    answer_box = [
+                        Document.from_dict(
+                            {
+                                "title": answer_dict.get("title", ""),
+                                "content": answer_box_content,
+                                "link": answer_dict.get("displayed_link", ""),
+                            }
+                        )
+                    ]
+                    break
+
+        people_also_search = []
+        if "people_also_search_for" in json_result:
+            for result in json_result["people_also_search_for"]:
+                people_also_search.append(
+                    Document.from_dict(
+                        {
+                            "title": result["title"],
+                            "content": result["snippet"] if result.get("snippet") else result["title"],
+                            "link": result["link"],
+                        }
+                    )
+                )
+
+        related_questions = []
+        if "related_questions" in json_result:
+            for result in json_result["related_questions"]:
+                related_questions.append(
+                    Document.from_dict(
+                        {
+                            "title": result["title"],
+                            "content": result["snippet"] if result.get("snippet") else result["title"],
+                            "link": result["link"],
+                        }
+                    )
+                )
+
+        documents = answer_box + organic + people_also_search + related_questions
+
+        logger.debug("SerpAPI returned %s documents for the query '%s'", len(documents), query)
+        result_docs = documents[:top_k]
+        return self.score_results(result_docs, len(answer_box) > 0)
+
+
+class SerperDev(SearchEngine):
+    """
+    Serper.dev is a search engine that provides a REST API to access search results from Google. See the [Serper.dev website](https://serper.dev.com/) for more details.
+    """
+
+    def __init__(
+        self,
+        api_key: str,
+        top_k: Optional[int] = 10,
+        engine: Optional[str] = "google",
+        search_engine_kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        :param api_key: API key for Serper.dev API.
+        :param top_k: Number of results to return.
+        :param engine: Search engine to use, only supports Google.
+        :param search_engine_kwargs: Additional parameters passed to the SerperDev API. For example, you can set 'hl' to 'en'
+        to set the search results language to English.
+        See the [Serper.dev documentation](https://serper.dev/playground) for the full list of supported parameters.
+        """
+        super().__init__()
+        self.params_dict: Dict[str, Union[str, int, float]] = {}
+        self.api_key = api_key
+        self.kwargs = search_engine_kwargs if search_engine_kwargs else {}
+        self.engine = engine
+        self.top_k = top_k
+
+    def search(self, query: str, **kwargs) -> List[Document]:
+        """
+        :param query: Query string.
+        :param kwargs: Additional parameters passed to the Serper.dev API. For example, you can set 'hl' to 'en'
+        to set the search results language to English.
+        See the [Serper.dev documentation](https://serper.dev/playground) for the full list of supported parameters.
+        :return: List[Document]
+        """
+        kwargs = {**self.kwargs, **kwargs}
+        top_k = kwargs.pop("top_k", self.top_k)
+        url = "https://google.serper.dev/search"
+
+        params = {"q": query, **kwargs}
+
+        headers = {"X-API-KEY": self.api_key, "Content-Type": "application/json"}
+
+        response = requests.post(url, headers=headers, json=params, timeout=30)
+
+        if response.status_code != 200:
+            raise Exception(f"Error while querying {self.__class__.__name__}: {response.text}")
+
+        json_result = json.loads(response.text)
+        organic = [
+            Document.from_dict(d, field_map={"snippet": "content"}) for d in json_result["organic"] if "snippet" in d
+        ]
+        answer_box = []
+        if "answerBox" in json_result:
+            answer_dict = json_result["answerBox"]
+            for key in ["answer", "snippetHighlighted", "snippet", "title"]:
+                if key in answer_dict:
+                    answer_box_content = answer_dict[key]
+                    if isinstance(answer_box_content, list):
+                        answer_box_content = answer_box_content[0]
+                    answer_box = [
+                        Document.from_dict(
+                            {
+                                "title": answer_dict.get("title", ""),
+                                "content": answer_box_content,
+                                "link": answer_dict.get("link", ""),
+                            }
+                        )
+                    ]
+                    break
+
+        people_also_search = []
+        if "peopleAlsoSearchFor" in json_result:
+            for result in json_result["peopleAlsoSearchFor"]:
+                people_also_search.append(
+                    Document.from_dict(
+                        {
+                            "title": result["title"],
+                            "content": result["snippet"] if result.get("snippet") else result["title"],
+                            "link": result["link"],
+                        }
+                    )
+                )
+
+        related_searches = []
+        if "relatedSearches" in json_result:
+            for result in json_result["relatedSearches"]:
+                related_searches.append(Document.from_dict({"content": result.get("query", "")}))
+
+        related_questions = []
+        if "peopleAlsoAsk" in json_result:
+            for result in json_result["peopleAlsoAsk"]:
+                related_questions.append(
+                    Document.from_dict(
+                        {
+                            "title": result["title"],
+                            "content": result["snippet"] if result.get("snippet") else result["title"],
+                            "link": result["link"],
+                        }
+                    )
+                )
+
+        documents = answer_box + organic + people_also_search + related_searches + related_questions
+
+        logger.debug("Serper.dev API returned %s documents for the query '%s'", len(documents), query)
+        result_docs = documents[:top_k]
+        return self.score_results(result_docs, len(answer_box) > 0)
diff --git a/pipelines/pipelines/nodes/search_engine/utils.py b/pipelines/pipelines/nodes/search_engine/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d451230dce70f4daf5e205003dd0c91933f67d4f
--- /dev/null
+++ b/pipelines/pipelines/nodes/search_engine/utils.py
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, List, Optional
+
+
+def calculate_ranking_scores(list_items: List[Any], boost_first_factor: Optional[int] = None) -> List[float]:
+    """
+    Assigns scores to items in a list based on their rank position and ensures that the scores add up to 1.
+    :param list_items: The list of items to score.
+    :param boost_first_factor: The factor to boost the score of the first item by.
+    """
+    n = len(list_items)
+    scores = [0.0] * n
+
+    # Compute the scores based on rank position
+    for i, _ in enumerate(list_items):
+        scores[i] = (n - i) / ((n * (n + 1)) / 2)
+
+    # Apply the boost factor to the first item
+    if boost_first_factor is not None and n > 0:
+        scores[0] *= boost_first_factor
+
+    # Normalize the scores so they add up to 1
+    total_score = sum(scores)
+    normalized_scores = [score / total_score for score in scores]
+
+    return normalized_scores
diff --git a/pipelines/pipelines/nodes/search_engine/web.py b/pipelines/pipelines/nodes/search_engine/web.py
new file mode 100644
index 0000000000000000000000000000000000000000..573756f58527b8dd73d9b46c20aecf18bcdaa3b0
--- /dev/null
+++ b/pipelines/pipelines/nodes/search_engine/web.py
@@ -0,0 +1,109 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pydoc
+from typing import Any, Dict, List, Optional, Tuple, Type, Union
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.search_engine.base import SearchEngine
+from pipelines.schema import Document, MultiLabel
+
+
+class WebSearch(BaseComponent):
+    """
+    WebSearch queries a search engine and retrieves results as a list of Documents. WebSearch abstracts away the details
+    of the underlying search engine provider, provides common interface for all providers, and makes it possible to use various
+    search engines.
+
+    WebSerach currently supports the following search engines providers (bridges):
+    - SerperDev (default)
+    - SerpAPI
+    - BingAPI
+
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        api_key: str,
+        top_k: Optional[int] = 10,
+        search_engine_provider: Union[str, SearchEngine] = "SerpAPI",
+        engine: Optional[str] = "google",
+        search_engine_kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        """
+        :param api_key: API key for the search engine provider.
+        :param search_engine_provider: Name of the search engine provider class, see `providers.py` for a list of
+        supported providers.
+        :param search_engine_kwargs: Additional parameters to pass to the search engine provider.
+        """
+        super().__init__()
+        if isinstance(search_engine_provider, str):
+            # try to find the provider class
+            search_path = [f"pipelines.nodes.search_engine.providers.{search_engine_provider}", search_engine_provider]
+            klass: Type[SearchEngine] = next((pydoc.locate(path) for path in search_path), None)  # type: ignore
+
+            if not klass:
+                raise ValueError(
+                    f"Could not locate the SearchEngine class with the name {search_engine_provider}. "
+                    f"Make sure you pass the full path to the class."
+                )
+            if not issubclass(klass, SearchEngine):
+                raise ValueError(f"Class {search_engine_provider} is not a subclass of SearchEngine.")
+            self.search_engine = klass(api_key=api_key, top_k=top_k, engine=engine, search_engine_kwargs=search_engine_kwargs)  # type: ignore
+        elif isinstance(search_engine_provider, SearchEngine):
+            self.search_engine = search_engine_provider
+        else:
+            raise ValueError(
+                "search_engine_provider must be either a string (SearchEngine class name) or a SearchEngine instance."
+            )
+
+    def run(
+        self,
+        query: Optional[str] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+    ) -> Tuple[Dict, str]:
+        """
+        Search the search engine for the given query and return the results. Only the query parameter is used.
+        :param query: The query to search for.
+
+        :return: List of search results as documents.
+        """
+        # query is a required parameter for search, we need to keep the signature of run() the same as in other nodes
+        if not query:
+            raise ValueError("WebSearch run requires the `query` parameter")
+        return {"documents": self.search_engine.search(query)}, "output_1"
+
+    def run_batch(
+        self,
+        queries: Optional[Union[str, List[str]]] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        results = []
+        if isinstance(queries, str):
+            queries = [queries]
+        elif not isinstance(queries, list):
+            raise ValueError("WebSearch run_batch requires the `queries` parameter to be Union[str, List[str]]")
+        for query in queries:
+            results.append(self.search_engine.search(query))
+        return {"documents": results}, "output_1"
diff --git a/pipelines/pipelines/nodes/sentiment_analysis/__init__.py b/pipelines/pipelines/nodes/sentiment_analysis/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2616f075de5bf85d34299b466d8d253bdb6c42b1
--- /dev/null
+++ b/pipelines/pipelines/nodes/sentiment_analysis/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.sentiment_analysis.senta import UIESenta
+from pipelines.nodes.sentiment_analysis.senta_preprocessor import SentaProcessor
+from pipelines.nodes.sentiment_analysis.senta_visualization import SentaVisualization
diff --git a/pipelines/pipelines/nodes/sentiment_analysis/senta.py b/pipelines/pipelines/nodes/sentiment_analysis/senta.py
new file mode 100644
index 0000000000000000000000000000000000000000..c02c6a62341bfdffcee1f4ab3c6da6268b2f2659
--- /dev/null
+++ b/pipelines/pipelines/nodes/sentiment_analysis/senta.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+from typing import List
+
+from paddlenlp import Taskflow
+from pipelines.nodes.base import BaseComponent
+
+logger = logging.getLogger(__name__)
+
+
+class UIESenta(BaseComponent):
+    """
+    Senta: sentiment analysis for user's comments based on Taskflow
+    """
+
+    outgoing_edges = 1
+
+    def __init__(
+        self,
+        model,
+        schema,
+        task="sentiment_analysis",
+        aspects=None,
+        max_seq_len=512,
+        batch_size=1,
+        split_sentence=False,
+        position_prob=0.5,
+        lazy_load=False,
+        num_workers=0,
+        use_fast=False,
+        **kwargs
+    ):
+        """
+        Init UIESenta for Sentiment Analysis.
+        :param model: the model name that you wanna use, you can choose it in [use-base, uie-medium, uie-micro, uie-mini, uie-nano].
+        :param schema: the schema for extracting sentiment information with UIE.
+        :param task: the task name, you should set to be `sentiment_analysis` for the current task.
+        :param aspects: optional, a list of pre-given aspects, that is to say, Taskflow only perform sentiment analysis on these pre-given aspects if you input it.
+        :param max_seq_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
+        :param batch_size: Number of samples the model receives in one batch for inference.
+        :param split_sentence: If True, sentence-level split will be performed on the inputing examples.
+        :param position_prob: Probability threshold for start/end index probabiliry.
+        :param lazy_load: whether to using `MapDataset` or an `IterDataset` when performing inference with UIE. True for `IterDataset`. False for `MapDataset`.
+        :num_workers: the number of subprocess to load data for dataloader, 0 for no subprocess used and loading data in main process. Default 0.
+        :use_fast: whether to fast tokenizer for UIE.
+        """
+
+        self.set_config(
+            model=model,
+            schema=schema,
+            task=task,
+            aspects=aspects,
+            max_seq_len=max_seq_len,
+            batch_size=batch_size,
+            split_sentence=split_sentence,
+            position_prob=position_prob,
+            lazy_load=lazy_load,
+            num_workers=num_workers,
+            use_fast=use_fast,
+            **kwargs,
+        )
+
+        self.senta = Taskflow(**self.pipeline_config["params"])
+
+    def _predict(self, examples: str):
+        return self.senta(examples)
+
+    def _save_json_file(self, save_path: str, results: List[dict]):
+        dir_name = os.path.dirname(save_path)
+        if not os.path.exists(dir_name):
+            os.makedirs(dir_name)
+        with open(save_path, "w", encoding="utf-8") as f:
+            for result in results:
+                line = json.dumps(result, ensure_ascii=False) + "\n"
+                f.write(line)
+
+    def run(self, examples: List[str], sr_save_path: str):
+        # predict with taskflow
+        results = self._predict(examples)
+        # save the result of sentiment analysis
+        if sr_save_path:
+            self._save_json_file(sr_save_path, results)
+            logger.info("The result of sentiment analysis has been saved to : {}".format(sr_save_path))
+        outputs = {"sr_save_path": sr_save_path}
+        return outputs, "output_1"
diff --git a/pipelines/pipelines/nodes/sentiment_analysis/senta_preprocessor.py b/pipelines/pipelines/nodes/sentiment_analysis/senta_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cac0c07c15fe19b68ee351b5c4ddd15ffe3be64
--- /dev/null
+++ b/pipelines/pipelines/nodes/sentiment_analysis/senta_preprocessor.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from pipelines.nodes.base import BaseComponent
+
+
+class SentaProcessor(BaseComponent):
+    """
+    Read and preprocess texts that you wanna perform sentiment analysis.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(self, max_examples: int = -1):
+        """
+        Init Senta Preprocessor.
+        :param max_examples: Maximum amount of examples to process. if you set to be -1, it will keep all examples to analyze.
+        """
+        self.max_examples = max_examples
+
+    def _check_input_params(self, inputs):
+        if not isinstance(inputs, dict):
+            raise TypeError("a expected dict as input, but received {}!".format(type(inputs)))
+        if "file_path" not in inputs:
+            raise ValueError(
+                "a file path is needed, which you wanna perform sentiment analysis, you can set it by `file_path`."
+            )
+        if not os.path.exists(inputs["file_path"]):
+            raise ValueError("the file does not exist: {}".format(inputs["file_path"]))
+
+    def _read_text_file(self, file_path):
+        examples = []
+        with open(file_path, "r", encoding="utf-8") as f:
+            for line in f.readlines():
+                example = line.strip()
+                examples.append(example)
+        return examples
+
+    def run(self, meta: dict):
+        # check the param meta, file_path need to be input as key.
+        self._check_input_params(meta)
+        # read texts
+        examples = self._read_text_file(meta["file_path"])
+        if self.max_examples != -1:
+            examples = examples[: self.max_examples]
+        # define output for SentaProcessor
+        sr_file_name = "sr_" + os.path.basename(meta["file_path"]).split(".")[0] + ".json"
+        sr_save_path = os.path.join(os.path.dirname(meta["file_path"]), "images", sr_file_name)
+        output = {"examples": examples, "sr_save_path": sr_save_path}
+
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/sentiment_analysis/senta_visualization.py b/pipelines/pipelines/nodes/sentiment_analysis/senta_visualization.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b26b50f106dbdb9f7cc0264a551c4c1e8a74e98
--- /dev/null
+++ b/pipelines/pipelines/nodes/sentiment_analysis/senta_visualization.py
@@ -0,0 +1,566 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import logging
+import os
+import random
+from collections import defaultdict
+
+import matplotlib.pyplot as plt
+import wordcloud
+
+from paddlenlp.taskflow.utils import download_file
+from pipelines.nodes.base import BaseComponent
+
+# make sure that the font 'SimHei' is installed in system
+plt.rcParams["font.sans-serif"] = ["SimHei"]
+plt.rcParams["axes.unicode_minus"] = False
+
+logger = logging.getLogger(__name__)
+
+URLS = {
+    "SimHei": [
+        "https://paddlenlp.bj.bcebos.com/applications/sentiment_analysis/SimHei.ttf",
+        "c9c9de86d3fa7c4af0d3f1269bb2dff2",
+    ],
+}
+
+
+class VisualSentiment(object):
+    """
+    A tool class for visualing sentiment analysis results.
+    """
+
+    def __init__(self, font_path=None):
+        self.font_path = font_path
+        self.wc = wordcloud.WordCloud(font_path=font_path, background_color="white", width=800, height=400)
+        plt.figure(figsize=(8, 6))
+
+    def _plot_wordcloud(self, content_freq, save_path):
+        """
+        plot wordcloud image.
+        Args:
+            content_freq (dict): an content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+        """
+
+        text_list = []
+        for item in content_freq:
+            text_list.extend([item] * content_freq[item])
+        random.shuffle(text_list)
+        text = " ".join(text_list)
+
+        self.wc.generate(text)
+        self.wc.to_file(save_path)
+
+    def _plot_histogram(
+        self, content_freq, save_path, with_line_chart="true", top_n=15, plt_title="", plt_xlabel="", plt_ylabel=""
+    ):
+        """
+        generate histogram image. one aspect corresponds to one bar.
+        Args:
+            content_freq (dict): an content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+            with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram.
+            plt_title (str): the title of image, only work when image_type is set be histogram.
+            plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram.
+            plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram.
+        """
+
+        content_freq_items = content_freq.items()
+        content_freq_items = sorted(content_freq_items, key=lambda x: x[1], reverse=True)
+        content_freq_items = content_freq_items[:top_n]
+
+        x_data = [item[0] for item in content_freq_items]
+        y_data = [item[1] for item in content_freq_items]
+
+        for i in range(len(x_data)):
+            plt.bar(x_data[i], y_data[i])
+
+        if with_line_chart:
+            plt.plot(x_data, y_data, "-")
+        plt.title(plt_title)
+
+        plt.xlabel(plt_xlabel)
+        plt.ylabel(plt_ylabel)
+        plt.savefig(save_path)
+        plt.close()
+
+    def _plot_content_with_frequency(
+        self,
+        content_freq,
+        save_path,
+        image_type="wordcloud",
+        with_line_chart="true",
+        top_n=15,
+        plt_title="",
+        plt_xlabel="",
+        plt_ylabel="",
+    ):
+        """
+        generate image for specified content, such as aspect, opinion and so on.
+        two types of images are supported: wordcloud and histogram.
+        Args:
+            content_freq (dict): an content dict with frequency, the key is content and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of contents, only work when image_type is set be histogram.
+            plt_title (str): the title of image, only work when image_type is set be histogram.
+            plt_xlabel (str): the 'x' axis label of image, only work when image_type is set be histogram.
+            plt_ylabel (str): the 'y' axis label of image, only work when image_type is set be histogram.
+        """
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        if image_type == "wordcloud":
+            self._plot_wordcloud(content_freq, save_path)
+        else:
+            self._plot_histogram(
+                content_freq,
+                save_path,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=plt_title,
+                plt_xlabel=plt_xlabel,
+                plt_ylabel=plt_ylabel,
+            )
+
+    def plot_aspect_with_frequency(
+        self, aspect_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image for aspect, two types of images are supported: wordcloud and histogram.
+        this method can help analyze which aspects of the product/service are more important to customers.
+        Args:
+            aspect_freq (dict): an aspect dict with frequency, the key is aspect and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of apsects, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_freq:
+            print("aspect_freq is empty, please check it.")
+            exit(0)
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(aspect_freq, save_path, image_type=image_type)
+        else:
+            title = "The histogram of aspect/frequency"
+            xlabel = "aspect"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                aspect_freq,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_opinion_with_frequency(
+        self, opinion_freq, save_path, image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image for opinion, two types of images are supported: wordcloud and histogram.
+        this method can help analyze the whole impression of the product/service.
+        Args:
+            opinion_freq (dict): an opinion dict with frequency, the key is opinion and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not opinion_freq:
+            print("opinion_freq is empty, please check it.")
+            exit(0)
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(opinion_freq, save_path, image_type=image_type)
+        else:
+            title = "The histogram of opinion/frequency"
+            xlabel = "opinion"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                opinion_freq,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_aspect_with_opinion(
+        self, aspect_opinion, save_path, sentiment="all", image_type="wordcloud", with_line_chart="true", top_n=15
+    ):
+        """
+        generate image with aspect and opinion, that is, combining apsect with opinion to display the more specifical opinions of aspect.
+        this method can help you at two aspects: 1. mining custom's overall impression of products/services; 2. analyzing the quality of some aspect and improve it further.
+        Args:
+            aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency.
+            aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency.
+            save_path (str): path that the image is saved to.
+            sentiment (str): analyzing aspect with sentiment, Only "all", "positive" and "negative" are received. "positive" only analyzes positive aspects with opinions, "negative" only analyzes negative aspects with opinions, and "all" analyzes all apsects.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_opinion:
+            print("aspect_opinion is empty, please check it.")
+            exit(0)
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        if sentiment not in ["all", "positive", "negative"]:
+            print(
+                "Only 'all', 'positive' and 'negative' are received for sentiment, that is, you should set be in [all, positive, negative]."
+            )
+            exit(0)
+
+        if sentiment == "all":
+            new_aspect_opinion = {}
+
+            for aspect in aspect_opinion:
+                for opinion in aspect_opinion[aspect]:
+                    key = aspect + opinion
+                    new_aspect_opinion[key] = aspect_opinion[aspect][opinion]
+            aspect_opinion = new_aspect_opinion
+
+        if image_type == "wordcloud":
+            self._plot_content_with_frequency(aspect_opinion, save_path, image_type=image_type)
+        else:
+            if sentiment == "all":
+                title = "The histogram of aspect with opinion/frequency"
+            else:
+                title = "The histogram of {} aspect with opinion/frequency".format(sentiment)
+            xlabel = "aspect with opinion"
+            ylabel = "frequency"
+
+            self._plot_content_with_frequency(
+                aspect_opinion,
+                save_path,
+                image_type=image_type,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+    def plot_aspect_with_sentiment(
+        self, aspect_sentiment, save_path, image_type="wordcloud", top_n=0, descend_aspects=None
+    ):
+        """
+        generate image with aspect and sentiment, that is, combining apsect and sentiment to display the sentiment of aspect.
+        This method can help you more intuitively analyze customers' direct impressions of aspects of products/services.
+        Args:
+            aspect_sentiment (dict[dict]): a dict containing aspect, sentiment and its frequency, the key is aspect and its value is a dict containing the aspect's sentiment and frequency.
+            descend_aspects (dict): an aspect list, sorted by frequency in reverse order.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram. if top_n set be 0, it will plot all aspects in histogram.
+        """
+
+        if not aspect_sentiment:
+            print("aspect_sentiment is empty, please check it.")
+            exit(0)
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        if image_type == "wordcloud":
+            new_aspect_opinion = {}
+            for aspect in aspect_sentiment:
+                for sentiment in aspect_sentiment[aspect]:
+                    key = aspect + sentiment
+                    new_aspect_opinion[key] = aspect_sentiment[aspect][sentiment]
+            self._plot_wordcloud(new_aspect_opinion, save_path)
+        else:
+            if top_n != 0 and descend_aspects is None:
+                print("You should input the param descend_aspects when top_n != 0.")
+                exit(0)
+
+            if top_n != 0:
+                keep_aspects = set(descend_aspects[:top_n])
+
+            aspects = []
+            positives = []
+            negatives = []
+            for aspect, sentiment in aspect_sentiment.items():
+                if top_n != 0 and aspect not in keep_aspects:
+                    continue
+                aspects.append(aspect)
+                if "正向" in sentiment:
+                    positives.append(sentiment["正向"])
+                else:
+                    positives.append(0)
+                if "负向" in sentiment:
+                    negatives.append(sentiment["负向"])
+                else:
+                    negatives.append(0)
+
+            total_width, n = 0.8, 2
+            width = total_width / n
+            x_pos = [item - (total_width - width) / 2 for item in range(len(aspects))]
+            x_neg = [item + width for item in x_pos]
+
+            plt.bar(x_pos, positives, width=width, label="positive")
+            plt.bar(x_neg, negatives, width=width, label="negative")
+            plt.title("The histogram of aspect/sentiment")
+            plt.xlabel("aspect")
+            plt.ylabel("sentiment frequency")
+            plt.xticks(x_pos, aspects)
+            plt.legend()
+            plt.savefig(save_path)
+            plt.close()
+
+    def plot_opinion_with_aspect(
+        self, aspect, aspect_opinion, save_path, image_type="wordcloud", with_line_chart=True, top_n=15
+    ):
+        """
+        generate opinion image for given aspect. This method can help you analyzing opinions for given aspects.
+        Args:
+            aspect (str): The set of aspect to analyze.
+            aspect_opinion (dict[dict] or dict): when sentiment set be "all", a expected dict containing aspect, opinion and its frequency, the key is aspect and its value is a dict containing the aspect's opinion and frequency. when sentiment set be "positive" or "netative", a expected dict containing aspect with opinion and frequency, the key is aspect with opinion and its value is frequency.
+            save_path (str): path that the image is saved to.
+            image_type (str): Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].
+            with_line_chart (bool): Whether to plot line chart, Only work when image_type is set be histogram.
+            top_n (int): show top_n of frequency of opinions, Only work when image_type is set be histogram.
+        """
+
+        if not aspect_opinion:
+            print("aspect_opinion is empty, please check it.")
+            exit(0)
+
+        if aspect not in aspect:
+            print("{} not in aspect_opinion, please check it.")
+            exit(0)
+
+        if image_type not in ["wordcloud", "histogram"]:
+            print("Only wordcloud and histogram are supported, that is, you should set be in [wordcloud, histogram].")
+            exit(0)
+
+        opinions = aspect_opinion[aspect]
+        opinion_items = sorted(opinions.items(), key=lambda x: x[1], reverse=True)
+        if top_n is not None:
+            opinion_items = opinion_items[:top_n]
+
+        opinion_freq = {k: v for k, v in opinion_items}
+
+        if image_type == "wordcloud":
+            self._plot_wordcloud(opinion_freq, save_path)
+        else:
+            title = "The opinion analysis for aspect [{}] ".format(aspect)
+            xlabel = "opinion"
+            ylabel = "frequency"
+            self._plot_histogram(
+                opinion_freq,
+                save_path,
+                with_line_chart=with_line_chart,
+                top_n=top_n,
+                plt_title=title,
+                plt_xlabel=xlabel,
+                plt_ylabel=ylabel,
+            )
+
+
+class SentimentResult(object):
+    """
+    load and analyze result of sentiment analysis.
+    """
+
+    def __init__(self, file_path, sentiment_name="情感倾向[正向,负向,未提及]", opinion_name="观点词"):
+        self.file_path = file_path
+        self.sentiment_name = sentiment_name
+        self.opinion_name = opinion_name
+        self.examples = self._load_json_file(file_path)
+        self.read_and_count_result(self.examples)
+
+    def _load_json_file(self, file_path: str):
+        items = []
+        with open(file_path, "r", encoding="utf-8") as f:
+            for line in f.readlines():
+                item = json.loads(line)
+                items.append(item)
+        return items
+
+    def read_and_count_result(self, examples):
+        sentiment_name = self.sentiment_name
+        opinion_name = self.opinion_name
+        aspect_frequency = defaultdict(int)
+        opinion_frequency = defaultdict(int)
+        aspect_opinion_positives = defaultdict(int)
+        aspect_opinion_negatives = defaultdict(int)
+
+        aspect_sentiment = defaultdict(dict)
+        aspect_opinion = defaultdict(dict)
+        for example in examples:
+            if not example:
+                continue
+            for aspect in example["评价维度"]:
+                aspect_name = aspect["text"]
+                if "relations" not in aspect:
+                    continue
+                if sentiment_name not in aspect["relations"] or opinion_name not in aspect["relations"]:
+                    continue
+                sentiment = aspect["relations"][sentiment_name][0]
+                if sentiment["text"] == "未提及":
+                    continue
+                aspect_frequency[aspect_name] += 1
+                if sentiment["text"] not in aspect_sentiment[aspect_name]:
+                    aspect_sentiment[aspect_name][sentiment["text"]] = 1
+                else:
+                    aspect_sentiment[aspect_name][sentiment["text"]] += 1
+
+                opinions = aspect["relations"][opinion_name]
+                for opinion in opinions:
+                    opinion_text = opinion["text"]
+                    opinion_frequency[opinion_text] += 1
+                    if opinion_text not in aspect_opinion[aspect_name]:
+                        aspect_opinion[aspect_name][opinion_text] = 1
+                    else:
+                        aspect_opinion[aspect_name][opinion_text] += 1
+
+                    aspect_opinion_text = aspect_name + opinion_text
+                    if sentiment["text"] == "正向":
+                        aspect_opinion_positives[aspect_opinion_text] += 1
+                    else:
+                        aspect_opinion_negatives[aspect_opinion_text] += 1
+
+        aspect_freq_items = sorted(aspect_frequency.items(), key=lambda x: x[1], reverse=True)
+        descend_aspects = [item[0] for item in aspect_freq_items]
+
+        self.aspect_frequency = aspect_frequency
+        self.opinion_frequency = opinion_frequency
+        self.aspect_sentiment = aspect_sentiment
+        self.aspect_opinion = aspect_opinion
+        self.aspect_opinion_positives = aspect_opinion_positives
+        self.aspect_opinion_negatives = aspect_opinion_negatives
+        self.descend_aspects = descend_aspects
+
+
+class SentaVisualization(BaseComponent):
+    """
+    Generate Images with the Result of Sentiment Analysis.
+    """
+
+    outgoing_edges = 1
+
+    def __init__(self, font_name: str = "SimHei"):
+        """
+        Init Senta Visualization.
+        :param font_name: The font name used to generate images.
+        """
+        base_name_suffix = os.path.basename(URLS[font_name][0]).split(".")[-1]
+        file_name = font_name + "." + base_name_suffix
+        save_dir = os.path.dirname(__file__)
+        download_file(save_dir, file_name, URLS[font_name][0], URLS[font_name][1])
+        font_path = os.path.join(save_dir, file_name)
+        self.vs = VisualSentiment(font_path)
+
+    def run(self, sr_save_path: str):
+        """
+        :Param sr_save_path: The file path of sentiment analysis fesults.
+        """
+        if not os.path.exists(sr_save_path):
+            raise ValueError("The sentiment analysis file not exist: {}".format(sr_save_path))
+
+        img_save_dir = os.path.dirname(sr_save_path)
+        sr = SentimentResult(sr_save_path)
+
+        # generate images
+        img_dict = {}
+        # aspect analysis
+        save_path = os.path.join(img_save_dir, "aspect_wc.png")
+        self.vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="wordcloud")
+        img_dict["aspect_wc"] = save_path
+        save_path = os.path.join(img_save_dir, "aspect_hist.png")
+        self.vs.plot_aspect_with_frequency(sr.aspect_frequency, save_path, image_type="histogram")
+        img_dict["aspect_hist"] = save_path
+        # opinion analysis
+        save_path = os.path.join(img_save_dir, "opinion_wc.png")
+        self.vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="wordcloud")
+        img_dict["opinion_wc"] = save_path
+        save_path = os.path.join(img_save_dir, "opinion_hist.png")
+        self.vs.plot_opinion_with_frequency(sr.opinion_frequency, save_path, image_type="histogram")
+        img_dict["opinion_hist"] = save_path
+        # aspect and opinion analysis
+        save_path = os.path.join(img_save_dir, "aspect_opinion_wc.png")
+        self.vs.plot_aspect_with_opinion(sr.aspect_opinion, save_path, image_type="wordcloud", sentiment="all")
+        img_dict["aspect_opinion_wc"] = save_path
+        save_path = os.path.join(img_save_dir, "aspect_opinion_hist.png")
+        self.vs.plot_aspect_with_opinion(
+            sr.aspect_opinion, save_path, image_type="histogram", sentiment="all", top_n=8
+        )
+        img_dict["aspect_opinion_hist"] = save_path
+        # positive aspect and opinion analysis
+        save_path = os.path.join(img_save_dir, "aspect_opinion_wc_pos.png")
+        self.vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_positives, save_path, image_type="wordcloud", sentiment="positive"
+        )
+        img_dict["aspect_opinion_wc_pos"] = save_path
+        save_path = os.path.join(img_save_dir, "aspect_opinion_hist_pos.png")
+        self.vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_positives, save_path, image_type="histogram", sentiment="positive", top_n=8
+        )
+        img_dict["aspect_opinion_hist_pos"] = save_path
+        # netative aspect and opinion analysis
+        save_path = os.path.join(img_save_dir, "aspect_opinion_wc_neg.png")
+        self.vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_negatives, save_path, image_type="wordcloud", sentiment="negative"
+        )
+        img_dict["aspect_opinion_wc_neg"] = save_path
+        save_path = os.path.join(img_save_dir, "aspect_opinion_hist_neg.png")
+        self.vs.plot_aspect_with_opinion(
+            sr.aspect_opinion_negatives, save_path, image_type="histogram", sentiment="negative", top_n=8
+        )
+        img_dict["aspect_opinion_hist_neg"] = save_path
+        # aspect and sentiment analysis
+        save_path = os.path.join(img_save_dir, "aspect_sentiment_wc.png")
+        self.vs.plot_aspect_with_sentiment(sr.aspect_sentiment, save_path, image_type="wordcloud")
+        img_dict["aspect_sentiment_wc"] = save_path
+        save_path = os.path.join(img_save_dir, "aspect_sentiment_hist.png")
+        self.vs.plot_aspect_with_sentiment(
+            sr.aspect_sentiment, save_path, image_type="histogram", top_n=15, descend_aspects=sr.descend_aspects
+        )
+        img_dict["aspect_sentiment_hist"] = save_path
+
+        logger.info("The images of sentiment analysis has been saved to: {}".format(img_save_dir))
+        output = {"img_dict": img_dict}
+        return output, "output_1"
diff --git a/pipelines/pipelines/nodes/text_to_image_generator/__init__.py b/pipelines/pipelines/nodes/text_to_image_generator/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..579c485c01a002118604988ae9bdf386dd458489
--- /dev/null
+++ b/pipelines/pipelines/nodes/text_to_image_generator/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.nodes.text_to_image_generator.text_to_image_generator import ErnieTextToImageGenerator
diff --git a/pipelines/pipelines/nodes/text_to_image_generator/text_to_image_generator.py b/pipelines/pipelines/nodes/text_to_image_generator/text_to_image_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dc47a0f0ed61ab78e2e3cfa6e1e11cfb4fc4f88
--- /dev/null
+++ b/pipelines/pipelines/nodes/text_to_image_generator/text_to_image_generator.py
@@ -0,0 +1,242 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import hashlib
+import os
+import time
+from io import BytesIO
+from typing import Optional
+
+import requests
+from PIL import Image
+from tqdm.auto import tqdm
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.schema import Document
+
+
+class ErnieTextToImageGenerator(BaseComponent):
+    """
+    ErnieTextToImageGenerator that uses a Ernie Vilg for text to image generation.
+    """
+
+    def __init__(self, ak=None, sk=None):
+        """
+        :param ak: ak for applying token to request wenxin api.
+        :param sk: sk for applying token to request wenxin api.
+        """
+        if ak is None or sk is None:
+            raise Exception("Please apply api_key and secret_key from https://wenxin.baidu.com/moduleApi/ernieVilg")
+        self.ak = ak
+        self.sk = sk
+        self.token_host = "https://wenxin.baidu.com/younger/portal/api/oauth/token"
+        self.token = self._apply_token(self.ak, self.sk)
+
+        # save init parameters to enable export of component config as YAML
+        self.set_config(
+            ak=ak,
+            sk=sk,
+        )
+
+    def _apply_token(self, ak, sk):
+        if ak is None or sk is None:
+            ak = self.ak
+            sk = self.sk
+        response = requests.get(
+            self.token_host, params={"grant_type": "client_credentials", "client_id": ak, "client_secret": sk}
+        )
+        if response:
+            res = response.json()
+            if res["code"] != 0:
+                print("Request access token error.")
+                raise RuntimeError("Request access token error.")
+        else:
+            print("Request access token error.")
+            raise RuntimeError("Request access token error.")
+        return res["data"]
+
+    def generate_image(
+        self,
+        text_prompts,
+        style: Optional[str] = "探索无限",
+        resolution: Optional[str] = "1024*1024",
+        topk: Optional[int] = 6,
+        visualization: Optional[bool] = True,
+        output_dir: Optional[str] = "ernievilg_output",
+    ):
+        """
+        Create image by text prompts using ErnieVilG model.
+        :param text_prompts: Phrase, sentence, or string of words and phrases describing what the image should look like.
+        :param style: Image stype, currently supported 古风、油画、水彩、卡通、二次元、浮世绘、蒸汽波艺术、
+        low poly、像素风格、概念艺术、未来主义、赛博朋克、写实风格、洛丽塔风格、巴洛克风格、超现实主义、探索无限。
+        :param resolution: Resolution of images, currently supported "1024*1024", "1024*1536", "1536*1024".
+        :param topk: Top k images to save.
+        :param visualization: Whether to save images or not.
+        :output_dir: Output directory
+        """
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir, exist_ok=True)
+        token = self.token
+        create_url = "https://wenxin.baidu.com/younger/portal/api/rest/1.0/ernievilg/v1/txt2img?from=paddlehub"
+        get_url = "https://wenxin.baidu.com/younger/portal/api/rest/1.0/ernievilg/v1/getImg?from=paddlehub"
+        if isinstance(text_prompts, str):
+            text_prompts = [text_prompts]
+        taskids = []
+        for text_prompt in text_prompts:
+            res = requests.post(
+                create_url,
+                headers={"Content-Type": "application/x-www-form-urlencoded"},
+                data={"access_token": token, "text": text_prompt, "style": style, "resolution": resolution},
+            )
+            res = res.json()
+            if res["code"] == 4001:
+                print("请求参数错误")
+                raise RuntimeError("请求参数错误")
+            elif res["code"] == 4002:
+                print("请求参数格式错误，请检查必传参数是否齐全，参数类型等")
+                raise RuntimeError("请求参数格式错误，请检查必传参数是否齐全，参数类型等")
+            elif res["code"] == 4003:
+                print("请求参数中，图片风格不在可选范围内")
+                raise RuntimeError("请求参数中，图片风格不在可选范围内")
+            elif res["code"] == 4004:
+                print("API服务内部错误，可能引起原因有请求超时、模型推理错误等")
+                raise RuntimeError("API服务内部错误，可能引起原因有请求超时、模型推理错误等")
+            elif res["code"] == 100 or res["code"] == 110 or res["code"] == 111:
+                token = self._apply_token(self.ak, self.sk)
+                res = requests.post(
+                    create_url,
+                    headers={"Content-Type": "application/x-www-form-urlencoded"},
+                    data={"access_token": token, "text": text_prompt, "style": style, "resolution": resolution},
+                )
+                res = res.json()
+                if res["code"] != 0:
+                    print("Token失效重新请求后依然发生错误，请检查输入的参数")
+                    raise RuntimeError("Token失效重新请求后依然发生错误，请检查输入的参数")
+            if res["msg"] == "success":
+                taskids.append(res["data"]["taskId"])
+            else:
+                print(res["msg"])
+                raise RuntimeError(res["msg"])
+
+        start_time = time.time()
+        process_bar = tqdm(total=100, unit="%")
+        results = {}
+        total_time = 60 * len(taskids)
+        while True:
+            end_time = time.time()
+            duration = end_time - start_time
+            progress_rate = int((duration) / total_time * 100)
+            if not taskids:
+                progress_rate = 100
+            if progress_rate > process_bar.n:
+                if progress_rate >= 100:
+                    if not taskids:
+                        increase_rate = 100 - process_bar.n
+                    else:
+                        increase_rate = 0
+                else:
+                    increase_rate = progress_rate - process_bar.n
+            else:
+                increase_rate = 0
+            process_bar.update(increase_rate)
+            if duration < 30:
+                time.sleep(5)
+                continue
+            else:
+                time.sleep(6)
+            if not taskids:
+                break
+            has_done = []
+            for taskid in taskids:
+                res = requests.post(
+                    get_url,
+                    headers={"Content-Type": "application/x-www-form-urlencoded"},
+                    data={"access_token": token, "taskId": {taskid}},
+                )
+                res = res.json()
+                if res["code"] == 4001:
+                    print("请求参数错误")
+                    raise RuntimeError("请求参数错误")
+                elif res["code"] == 4002:
+                    print("请求参数格式错误，请检查必传参数是否齐全，参数类型等")
+                    raise RuntimeError("请求参数格式错误，请检查必传参数是否齐全，参数类型等")
+                elif res["code"] == 4003:
+                    print("请求参数中，图片风格不在可选范围内")
+                    raise RuntimeError("请求参数中，图片风格不在可选范围内")
+                elif res["code"] == 4004:
+                    print("API服务内部错误，可能引起原因有请求超时、模型推理错误等")
+                    raise RuntimeError("API服务内部错误，可能引起原因有请求超时、模型推理错误等")
+                elif res["code"] == 100 or res["code"] == 110 or res["code"] == 111:
+                    token = self._apply_token(self.ak, self.sk)
+                    res = requests.post(
+                        get_url,
+                        headers={"Content-Type": "application/x-www-form-urlencoded"},
+                        data={"access_token": token, "taskId": {taskid}},
+                    )
+                    res = res.json()
+                    if res["code"] != 0:
+                        print("Token失效重新请求后依然发生错误，请检查输入的参数")
+                        raise RuntimeError("Token失效重新请求后依然发生错误，请检查输入的参数")
+                if res["msg"] == "success":
+                    if res["data"]["status"] == 1:
+                        has_done.append(res["data"]["taskId"])
+                    results[res["data"]["text"]] = {
+                        "imgUrls": res["data"]["imgUrls"],
+                        "waiting": res["data"]["waiting"],
+                        "taskId": res["data"]["taskId"],
+                    }
+                else:
+                    print(res["msg"])
+                    raise RuntimeError(res["msg"])
+            for taskid in has_done:
+                taskids.remove(taskid)
+        print("Saving Images...")
+        result_images = []
+        for text, data in results.items():
+            for idx, imgdata in enumerate(data["imgUrls"]):
+                try:
+                    image = Image.open(BytesIO(requests.get(imgdata["image"]).content))
+                except Exception:
+                    print("Download generated images error, retry one time")
+                    try:
+                        image = Image.open(BytesIO(requests.get(imgdata["image"]).content))
+                    except Exception:
+                        raise RuntimeError("Download generated images failed.")
+                if visualization:
+                    ext = "png"
+                    md5hash = hashlib.md5(image.tobytes())
+                    md5_name = md5hash.hexdigest()
+                    image_name = "{}.{}".format(md5_name, ext)
+                    image_path = os.path.join(output_dir, image_name)
+                    image.save(image_path)
+                    result_images.append(image_path)
+                if idx + 1 >= topk:
+                    break
+        print("Done")
+        return result_images
+
+    def run(
+        self,
+        query: Document,
+        style: Optional[str] = None,
+        topk: Optional[int] = None,
+        resolution: Optional[str] = "1024*1024",
+        output_dir: Optional[str] = "ernievilg_output",
+    ):
+
+        result_images = self.generate_image(
+            query, style=style, topk=topk, resolution=resolution, output_dir=output_dir
+        )
+        results = {"results": result_images}
+        return results, "output_1"
diff --git a/pipelines/pipelines/pipelines/__init__.py b/pipelines/pipelines/pipelines/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2dedfec550a9e2af9b124010400d23b7946e9cb5
--- /dev/null
+++ b/pipelines/pipelines/pipelines/__init__.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pipelines.pipelines.base import Pipeline, RootNode
+from pipelines.pipelines.standard_pipelines import (
+    BaseStandardPipeline,
+    DocPipeline,
+    ExtractiveQAPipeline,
+    QAGenerationPipeline,
+    SemanticSearchPipeline,
+    SentaPipeline,
+    TextToImagePipeline,
+    WebQAPipeline,
+)
diff --git a/pipelines/pipelines/pipelines/base.py b/pipelines/pipelines/pipelines/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..61a509de786cb69ee95367af07a266fe2ba5f585
--- /dev/null
+++ b/pipelines/pipelines/pipelines/base.py
@@ -0,0 +1,890 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import inspect
+import logging
+import traceback
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Union
+
+import networkx as nx
+import yaml
+from networkx import DiGraph
+from networkx.drawing.nx_agraph import to_agraph
+from pandas.core.frame import DataFrame
+
+from pipelines.pipelines.config import (
+    get_component_definitions,
+    get_pipeline_definition,
+    read_pipeline_config_from_yaml,
+)
+from pipelines.pipelines.utils import generate_code
+from pipelines.schema import ContentTypes, Document, MultiLabel
+
+try:
+    import ray
+    from ray import serve
+except Exception:
+    ray = None  # type: ignore
+    serve = None  # type: ignore
+
+try:
+    from pipelines import __version__
+except Exception:
+    # For development
+    __version__ = "0.0.0"
+
+
+from pipelines.nodes.base import BaseComponent  # isort: skip
+from pipelines.document_stores.base import BaseDocumentStore
+from pipelines.nodes.retriever.base import BaseRetriever
+
+logger = logging.getLogger(__name__)
+
+ROOT_NODE_TO_PIPELINE_NAME = {"query": "query", "file": "indexing"}
+CODE_GEN_DEFAULT_COMMENT = "This code has been generated."
+
+
+class RootNode(BaseComponent):
+    """
+    RootNode feeds inputs together with corresponding params to a Pipeline.
+    """
+
+    outgoing_edges = 1
+
+    def run(self, root_node: str):  # type: ignore
+        return {}, "output_1"
+
+    def run_batch(self):  # type: ignore
+        return {}, "output_1"
+
+
+class BasePipeline:
+    """
+    Base class for pipelines, providing the most basic methods to load and save them in different ways.
+    See also the `Pipeline` class for the actual pipeline logic.
+    """
+
+    def run(self, **kwargs):
+        raise NotImplementedError
+
+    def get_config(self, return_defaults: bool = False) -> dict:
+        """
+        Returns a configuration for the Pipeline that can be used with `BasePipeline.load_from_config()`.
+
+        :param return_defaults: whether to output parameters that have the default values.
+        """
+        raise NotImplementedError
+
+    def to_code(
+        self,
+        pipeline_variable_name: str = "pipeline",
+        generate_imports: bool = True,
+        add_comment: bool = False,
+    ) -> str:
+        """
+        Returns the code to create this pipeline as string.
+
+        :param pipeline_variable_name: The variable name of the generated pipeline.
+                                       Default value is 'pipeline'.
+        :param generate_imports: Whether to include the required import statements into the code.
+                                 Default value is True.
+        :param add_comment: Whether to add a preceding comment that this code has been generated.
+                            Default value is False.
+        """
+        pipeline_config = self.get_config()
+        code = generate_code(
+            pipeline_config=pipeline_config,
+            pipeline_variable_name=pipeline_variable_name,
+            generate_imports=generate_imports,
+            comment=CODE_GEN_DEFAULT_COMMENT if add_comment else None,
+        )
+        return code
+
+    def to_notebook_cell(
+        self,
+        pipeline_variable_name: str = "pipeline",
+        generate_imports: bool = True,
+        add_comment: bool = True,
+    ):
+        """
+        Creates a new notebook cell with the code to create this pipeline.
+
+        :param pipeline_variable_name: The variable name of the generated pipeline.
+                                       Default value is 'pipeline'.
+        :param generate_imports: Whether to include the required import statements into the code.
+                                 Default value is True.
+        :param add_comment: Whether to add a preceding comment that this code has been generated.
+                            Default value is True.
+        """
+        pipeline_config = self.get_config()
+        code = generate_code(
+            pipeline_config=pipeline_config,
+            pipeline_variable_name=pipeline_variable_name,
+            generate_imports=generate_imports,
+            comment=CODE_GEN_DEFAULT_COMMENT if add_comment else None,
+            add_pipeline_cls_import=False,
+        )
+        try:
+            get_ipython().set_next_input(code)  # type: ignore
+        except NameError:
+            logger.error("Could not create notebook cell. Make sure you're running in a notebook environment.")
+
+    @classmethod
+    def load_from_config(
+        cls, pipeline_config: Dict, pipeline_name: Optional[str] = None, overwrite_with_env_variables: bool = True
+    ):
+        """
+        Load Pipeline from a config dict defining the individual components and how they're tied together to form
+        a Pipeline. A single config can declare multiple Pipelines, in which case an explicit `pipeline_name` must
+        be passed.
+
+        Here's a sample configuration:
+
+            ```python
+            |   {
+            |       "version": "1.0",
+            |       "components": [
+            |           {  # define all the building-blocks for Pipeline
+            |               "name": "MyReader",  # custom-name for the component; helpful for visualization & debugging
+            |               "type": "FARMReader",  # pipelines Class name for the component
+            |               "params": {"no_ans_boost": -10, "model_name_or_path": "ernie-gram-zh-finetuned-dureader-robust"},
+            |           },
+            |           {
+            |               "name": "MyESRetriever",
+            |               "type": "ElasticsearchRetriever",
+            |               "params": {
+            |                   "document_store": "MyDocumentStore",  # params can reference other components defined in the YAML
+            |                   "custom_query": None,
+            |               },
+            |           },
+            |           {"name": "MyDocumentStore", "type": "ElasticsearchDocumentStore", "params": {"index": "pipelines_test"}},
+            |       ],
+            |       "pipelines": [
+            |           {  # multiple Pipelines can be defined using the components from above
+            |               "name": "my_query_pipeline",  # a simple extractive-qa Pipeline
+            |               "nodes": [
+            |                   {"name": "MyESRetriever", "inputs": ["Query"]},
+            |                   {"name": "MyReader", "inputs": ["MyESRetriever"]},
+            |               ],
+            |           }
+            |       ],
+            |   }
+            ```
+
+        :param pipeline_config: the pipeline config as dict
+        :param pipeline_name: if the config contains multiple pipelines, the pipeline_name to load must be set.
+        :param overwrite_with_env_variables: Overwrite the configuration with environment variables. For example,
+                                             to change index name param for an ElasticsearchDocumentStore, an env
+                                             variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+                                             `_` sign must be used to specify nested hierarchical properties.
+        """
+        pipeline_definition = get_pipeline_definition(pipeline_config=pipeline_config, pipeline_name=pipeline_name)
+        if pipeline_definition["type"] == "Pipeline":
+            return Pipeline.load_from_config(
+                pipeline_config=pipeline_config,
+                pipeline_name=pipeline_name,
+                overwrite_with_env_variables=overwrite_with_env_variables,
+            )
+        else:
+            raise KeyError(
+                f"Pipeline Type '{pipeline_definition['type']}' is not a valid. The available types are" f"'Pipeline'."
+            )
+
+    @classmethod
+    def load_from_yaml(
+        cls, path: Path, pipeline_name: Optional[str] = None, overwrite_with_env_variables: bool = True
+    ):
+        """
+        Load Pipeline from a YAML file defining the individual components and how they're tied together to form
+        a Pipeline. A single YAML can declare multiple Pipelines, in which case an explicit `pipeline_name` must
+        be passed.
+
+        Here's a sample configuration:
+
+            ```yaml
+            |   version: '1.0'
+            |
+            |    components:    # define all the building-blocks for Pipeline
+            |    - name: MyReader       # custom-name for the component; helpful for visualization & debugging
+            |      type: FARMReader    # pipelines Class name for the component
+            |      params:
+            |        no_ans_boost: -10
+            |        model_name_or_path: ernie-gram-zh-finetuned-dureader-robust
+            |    - name: MyESRetriever
+            |      type: ElasticsearchRetriever
+            |      params:
+            |        document_store: MyDocumentStore    # params can reference other components defined in the YAML
+            |        custom_query: null
+            |    - name: MyDocumentStore
+            |      type: ElasticsearchDocumentStore
+            |      params:
+            |        index: pipelines_test
+            |
+            |    pipelines:    # multiple Pipelines can be defined using the components from above
+            |    - name: my_query_pipeline    # a simple extractive-qa Pipeline
+            |      nodes:
+            |      - name: MyESRetriever
+            |        inputs: [Query]
+            |      - name: MyReader
+            |        inputs: [MyESRetriever]
+            ```
+
+        Note that, in case of a mismatch in version between pipelines and the YAML, a warning will be printed.
+        If the pipeline loads correctly regardless, save again the pipeline using `Pipeline.save_to_yaml()` to remove the warning.
+
+        :param path: path of the YAML file.
+        :param pipeline_name: if the YAML contains multiple pipelines, the pipeline_name to load must be set.
+        :param overwrite_with_env_variables: Overwrite the YAML configuration with environment variables. For example,
+                                             to change index name param for an ElasticsearchDocumentStore, an env
+                                             variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+                                             `_` sign must be used to specify nested hierarchical properties.
+        """
+        pipeline_config = read_pipeline_config_from_yaml(path)
+        if pipeline_config["version"] != __version__:
+            logger.warning(
+                f"YAML version ({pipeline_config['version']}) does not match with pipelines version ({__version__}). "
+                "Issues may occur during loading. "
+                "To fix this warning, save again this pipeline with the current pipelines version using Pipeline.save_to_yaml(), "
+                f"or downgrade to pipelines version {__version__}."
+            )
+        return cls.load_from_config(
+            pipeline_config=pipeline_config,
+            pipeline_name=pipeline_name,
+            overwrite_with_env_variables=overwrite_with_env_variables,
+        )
+
+
+class Pipeline(BasePipeline):
+    """
+    Pipeline brings together building blocks to build a complex search pipeline with pipelines & user-defined components.
+
+    Under-the-hood, a pipeline is represented as a directed acyclic graph of component nodes. It enables custom query
+    flows with options to branch queries(eg, extractive qa vs keyword match query), merge candidate documents for a
+    Reader from multiple Retrievers, or re-ranking of candidate documents.
+    """
+
+    def __init__(self):
+        self.graph = DiGraph()
+        self.root_node = None
+
+    @property
+    def components(self):
+        return {
+            name: attributes["component"]
+            for name, attributes in self.graph.nodes.items()
+            if not isinstance(attributes["component"], RootNode)
+        }
+
+    def add_node(self, component, name: str, inputs: List[str]):
+        """
+        Add a new node to the pipeline.
+
+        :param component: The object to be called when the data is passed to the node. It can be a pipelines component
+                          (like Retriever, Reader, or Generator) or a user-defined object that implements a run()
+                          method to process incoming data from predecessor node.
+        :param name: The name for the node. It must not contain any dots.
+        :param inputs: A list of inputs to the node. If the predecessor node has a single outgoing edge, just the name
+                       of node is sufficient. For instance, a 'ElasticsearchRetriever' node would always output a single
+                       edge with a list of documents. It can be represented as ["ElasticsearchRetriever"].
+
+                       In cases when the predecessor node has multiple outputs, e.g., a "QueryClassifier", the output
+                       must be specified explicitly as "QueryClassifier.output_2".
+        """
+        if self.root_node is None:
+            root_node = inputs[0]
+            if root_node in ["Query", "File"]:
+                self.root_node = root_node
+                self.graph.add_node(root_node, component=RootNode())
+            else:
+                raise KeyError(f"Root node '{root_node}' is invalid. Available options are 'Query' and 'File'.")
+        component.name = name
+        self.graph.add_node(name, component=component, inputs=inputs)
+
+        if len(self.graph.nodes) == 2:  # first node added; connect with Root
+            assert len(inputs) == 1 and inputs[0].split(".")[0] == self.root_node, (
+                f"The '{name}' node can only input from {self.root_node}. "
+                f"Set the 'inputs' parameter to ['{self.root_node}']"
+            )
+            self.graph.add_edge(self.root_node, name, label="output_1")
+            return
+
+        for i in inputs:
+            if "." in i:
+                [input_node_name, input_edge_name] = i.split(".")
+                assert "output_" in input_edge_name, f"'{input_edge_name}' is not a valid edge name."
+                outgoing_edges_input_node = self.graph.nodes[input_node_name]["component"].outgoing_edges
+                assert int(input_edge_name.split("_")[1]) <= outgoing_edges_input_node, (
+                    f"Cannot connect '{input_edge_name}' from '{input_node_name}' as it only has "
+                    f"{outgoing_edges_input_node} outgoing edge(s)."
+                )
+            else:
+                outgoing_edges_input_node = self.graph.nodes[i]["component"].outgoing_edges
+                assert outgoing_edges_input_node == 1, (
+                    f"Adding an edge from {i} to {name} is ambiguous as {i} has {outgoing_edges_input_node} edges. "
+                    f"Please specify the output explicitly."
+                )
+                input_node_name = i
+                input_edge_name = "output_1"
+            self.graph.add_edge(input_node_name, name, label=input_edge_name)
+
+    def get_node(self, name: str) -> Optional[BaseComponent]:
+        """
+        Get a node from the Pipeline.
+
+        :param name: The name of the node.
+        """
+        graph_node = self.graph.nodes.get(name)
+        component = graph_node["component"] if graph_node else None
+        return component
+
+    def set_node(self, name: str, component):
+        """
+        Set the component for a node in the Pipeline.
+
+        :param name: The name of the node.
+        :param component: The component object to be set at the node.
+        """
+        self.graph.nodes[name]["component"] = component
+
+    def run(  # type: ignore
+        self,
+        query: Optional[str] = None,
+        history: Optional[Dict[str, str]] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[MultiLabel] = None,
+        documents: Optional[List[Document]] = None,
+        meta: Optional[dict] = None,
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        """
+        Runs the pipeline, one node at a time.
+
+        :param query: The search query (for query pipelines only)
+        :param file_paths: The files to index (for indexing pipelines only)
+        :param labels:
+        :param documents:
+        :param meta:
+        :param params: Dictionary of parameters to be dispatched to the nodes.
+                       If you want to pass a param to all nodes, you can just use: {"top_k":10}
+                       If you want to pass it to targeted nodes, you can do:
+                       {"Retriever": {"top_k": 10}, "Reader": {"top_k": 3, "debug": True}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+                      about their execution. By default these include the input parameters
+                      they received and the output they generated. All debug information can
+                      then be found in the dict returned by this method under the key "_debug"
+        """
+        # validate the node names
+        if params:
+            if not all(node_id in self.graph.nodes for node_id in params.keys()):
+
+                # Might be a non-targeted param. Verify that too
+                not_a_node = set(params.keys()) - set(self.graph.nodes)
+                valid_global_params = set()
+                for node_id in self.graph.nodes:
+                    run_signature_args = inspect.signature(
+                        self.graph.nodes[node_id]["component"].run
+                    ).parameters.keys()
+                    valid_global_params |= set(run_signature_args)
+                invalid_keys = [key for key in not_a_node if key not in valid_global_params]
+
+                if invalid_keys:
+                    raise ValueError(
+                        f"No node(s) or global parameter(s) named {', '.join(invalid_keys)} found in pipeline."
+                    )
+
+        node_output = None
+        queue = {
+            self.root_node: {"root_node": self.root_node, "params": params}
+        }  # ordered dict with "node_id" -> "input" mapping that acts as a FIFO queue
+        if query:
+            queue[self.root_node]["query"] = query
+        if history:
+            queue[self.root_node]["history"] = history
+        if file_paths:
+            queue[self.root_node]["file_paths"] = file_paths
+        if labels:
+            queue[self.root_node]["labels"] = labels
+        if documents:
+            queue[self.root_node]["documents"] = documents
+        if meta:
+            queue[self.root_node]["meta"] = meta
+        i = 0  # the first item is popped off the queue unless it is a "join" node with unprocessed predecessors
+        while queue:
+            node_id = list(queue.keys())[i]
+            node_input = queue[node_id]
+            node_input["node_id"] = node_id
+
+            # Apply debug attributes to the node input params
+            # NOTE: global debug attributes will override the value specified
+            # in each node's params dictionary.
+            if debug is not None:
+                if node_id not in node_input["params"].keys():
+                    node_input["params"][node_id] = {}
+                node_input["params"][node_id]["debug"] = debug
+
+            predecessors = set(nx.ancestors(self.graph, node_id))
+            if predecessors.isdisjoint(set(queue.keys())):  # only execute if predecessor nodes are executed
+                try:
+                    if debug:
+                        logger.debug(f"Running node `{node_id}` with input `{node_input}`")
+                    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
+                except Exception as e:
+                    tb = traceback.format_exc()
+                    raise Exception(
+                        f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}"
+                    )
+                queue.pop(node_id)
+                #
+                if stream_id == "split_documents":
+                    for stream_id in [key for key in node_output.keys() if key.startswith("output_")]:
+                        current_node_output = {k: v for k, v in node_output.items() if not k.startswith("output_")}
+                        current_docs = node_output.pop(stream_id)
+                        current_node_output["documents"] = current_docs
+                        next_nodes = self.get_next_nodes(node_id, stream_id)
+                        for n in next_nodes:
+                            queue[n] = current_node_output
+                else:
+                    next_nodes = self.get_next_nodes(node_id, stream_id)
+                    for n in next_nodes:  # add successor nodes with corresponding inputs to the queue
+                        if queue.get(n):  # concatenate inputs if it's a join node
+                            existing_input = queue[n]
+                            if "inputs" not in existing_input.keys():
+                                updated_input: dict = {"inputs": [existing_input, node_output], "params": params}
+                                if query:
+                                    updated_input["query"] = query
+                                if file_paths:
+                                    updated_input["file_paths"] = file_paths
+                                if labels:
+                                    updated_input["labels"] = labels
+                                if documents:
+                                    updated_input["documents"] = documents
+                                if meta:
+                                    updated_input["meta"] = meta
+                                if history:
+                                    updated_input["history"] = history
+                            else:
+                                existing_input["inputs"].append(node_output)
+                                updated_input = existing_input
+                            queue[n] = updated_input
+                        else:
+                            queue[n] = node_output
+                i = 0
+            else:
+                i += 1  # attempt executing next node in the queue as current `node_id` has unprocessed predecessors
+        return node_output
+
+    def run_batch(  # type: ignore
+        self,
+        queries: List[str] = None,
+        queries_type: Optional[ContentTypes] = None,
+        file_paths: Optional[List[str]] = None,
+        labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
+        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
+        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        if file_paths is not None or meta is not None:
+            logger.info(
+                "It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch."
+            )
+            if isinstance(queries, list):
+                raise Exception("For indexing, only a single query can be provided.")
+            if isinstance(labels, list):
+                raise Exception("For indexing, only one MultiLabel object can be provided as labels.")
+            flattened_documents: List[Document] = []
+            if documents and isinstance(documents[0], list):
+                for doc_list in documents:
+                    assert isinstance(doc_list, list)
+                    flattened_documents.extend(doc_list)
+            return self.run(
+                query=queries,
+                file_paths=file_paths,
+                labels=labels,
+                documents=flattened_documents,
+                meta=meta,
+                params=params,
+                debug=debug,
+            )
+            # Validate node names
+        self._validate_node_names_in_params(params=params)
+
+        root_node = self.root_node
+        if not root_node:
+            raise Exception("Cannot run a pipeline with no nodes.")
+
+        node_output = None
+        queue: Dict[str, Any] = {
+            root_node: {"root_node": root_node, "params": params}
+        }  # ordered dict with "node_id" -> "input" mapping that acts as a FIFO queue
+        if queries:
+            queue[root_node]["queries"] = queries
+        if file_paths:
+            queue[root_node]["file_paths"] = file_paths
+        if labels:
+            queue[root_node]["labels"] = labels
+        if documents:
+            queue[root_node]["documents"] = documents
+        if meta:
+            queue[root_node]["meta"] = meta
+
+        i = 0  # the first item is popped off the queue unless it is a "join" node with unprocessed predecessors
+        while queue:
+            node_id = list(queue.keys())[i]
+            node_input = queue[node_id]
+            node_input["node_id"] = node_id
+
+            # Apply debug attributes to the node input params
+            # NOTE: global debug attributes will override the value specified in each node's params dictionary.
+            if debug is None and node_input:
+                if node_input.get("params", {}):
+                    debug = params.get("debug", None)  # type: ignore
+            if debug is not None:
+                if not node_input.get("params", None):
+                    node_input["params"] = {}
+                if node_id not in node_input["params"].keys():
+                    node_input["params"][node_id] = {}
+                node_input["params"][node_id]["debug"] = debug
+
+            predecessors = set(nx.ancestors(self.graph, node_id))
+            if predecessors.isdisjoint(set(queue.keys())):  # only execute if predecessor nodes are executed
+                try:
+                    logger.debug("Running node '%s` with input: %s", node_id, node_input)
+                    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run_batch(**node_input)
+                except Exception as e:
+                    # The input might be a really large object with thousands of embeddings.
+                    # If you really want to see it, raise the log level.
+                    logger.debug("Exception while running node '%s' with input %s", node_id, node_input)
+                    raise Exception(
+                        f"Exception while running node '{node_id}': {e}\nEnable debug logging to see the data that was passed when the pipeline failed."
+                    ) from e
+                queue.pop(node_id)
+
+                if stream_id == "split":
+                    for stream_id in [key for key in node_output.keys() if key.startswith("output_")]:
+                        current_node_output = {k: v for k, v in node_output.items() if not k.startswith("output_")}
+                        current_docs = node_output.pop(stream_id)
+                        current_node_output["documents"] = current_docs
+                        next_nodes = self.get_next_nodes(node_id, stream_id)
+                        for n in next_nodes:
+                            queue[n] = current_node_output
+                else:
+                    next_nodes = self.get_next_nodes(node_id, stream_id)
+                    for n in next_nodes:
+                        if queue.get(n):  # concatenate inputs if it's a join node
+                            existing_input = queue[n]
+                            if "inputs" not in existing_input.keys():
+                                updated_input: Dict = {"inputs": [existing_input, node_output], "params": params}
+                                if queries:
+                                    updated_input["queries"] = queries
+                                if file_paths:
+                                    updated_input["file_paths"] = file_paths
+                                if labels:
+                                    updated_input["labels"] = labels
+                                if documents:
+                                    updated_input["documents"] = documents
+                                if meta:
+                                    updated_input["meta"] = meta
+                            else:
+                                existing_input["inputs"].append(node_output)
+                                updated_input = existing_input
+                            queue[n] = updated_input
+                        else:
+                            queue[n] = node_output
+                i = 0
+            else:
+                i += 1  # attempt executing next node in the queue as current `node_id` has unprocessed predecessors
+        return node_output
+
+    def _validate_node_names_in_params(self, params: Optional[Dict]):
+        """
+        Validates the node names provided in the 'params' arg of run/run_batch method.
+        """
+        if params:
+            if not all(node_id in self.graph.nodes for node_id in params.keys()):
+
+                # Might be a non-targeted param. Verify that too
+                not_a_node = set(params.keys()) - set(self.graph.nodes)
+                valid_global_params = set(["debug"])  # Debug will be picked up by _dispatch_run, see its code
+                for node_id in self.graph.nodes:
+                    run_signature_args = self._get_run_node_signature(node_id)
+                    valid_global_params |= set(run_signature_args)
+                invalid_keys = [key for key in not_a_node if key not in valid_global_params]
+
+                if invalid_keys:
+                    raise ValueError(
+                        f"No node(s) or global parameter(s) named {', '.join(invalid_keys)} found in pipeline."
+                    )
+
+    def _get_run_node_signature(self, node_id: str):
+        return inspect.signature(self.graph.nodes[node_id]["component"].run).parameters.keys()
+
+    def _reorder_columns(self, df: DataFrame, desired_order: List[str]) -> DataFrame:
+        filtered_order = [col for col in desired_order if col in df.columns]
+        missing_columns = [col for col in df.columns if col not in desired_order]
+        reordered_columns = filtered_order + missing_columns
+        assert len(reordered_columns) == len(df.columns)
+        return df.reindex(columns=reordered_columns)
+
+    def get_next_nodes(self, node_id: str, stream_id: str):
+        current_node_edges = self.graph.edges(node_id, data=True)
+        next_nodes = [
+            next_node
+            for _, next_node, data in current_node_edges
+            if not stream_id or data["label"] == stream_id or stream_id == "output_all"
+        ]
+        return next_nodes
+
+    def get_nodes_by_class(self, class_type) -> List[Any]:
+        """
+        Gets all nodes in the pipeline that are an instance of a certain class (incl. subclasses).
+        This is for example helpful if you loaded a pipeline and then want to interact directly with the document store.
+        Example:
+        | from pipelines.document_stores.base import BaseDocumentStore
+        | INDEXING_PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=INDEXING_PIPELINE_NAME)
+        | res = INDEXING_PIPELINE.get_nodes_by_class(class_type=BaseDocumentStore)
+
+        :return: List of components that are an instance the requested class
+        """
+
+        matches = [
+            self.graph.nodes.get(node)["component"]
+            for node in self.graph.nodes
+            if isinstance(self.graph.nodes.get(node)["component"], class_type)
+        ]
+        return matches
+
+    def get_document_store(self) -> Optional[BaseDocumentStore]:
+        """
+        Return the document store object used in the current pipeline.
+
+        :return: Instance of DocumentStore or None
+        """
+        matches = self.get_nodes_by_class(class_type=BaseDocumentStore)
+        if len(matches) == 0:
+            matches = list(
+                set(retriever.document_store for retriever in self.get_nodes_by_class(class_type=BaseRetriever))
+            )
+
+        if len(matches) > 1:
+            raise Exception(f"Multiple Document Stores found in Pipeline: {matches}")
+        if len(matches) == 0:
+            return None
+        else:
+            return matches[0]
+
+    def draw(self, path: Path = Path("pipeline.png")):
+        """
+        Create a Graphviz visualization of the pipeline.
+
+        :param path: the path to save the image.
+        """
+        graphviz = to_agraph(self.graph)
+        graphviz.layout("dot")
+        graphviz.draw(path)
+
+    @classmethod
+    def load_from_config(
+        cls, pipeline_config: Dict, pipeline_name: Optional[str] = None, overwrite_with_env_variables: bool = True
+    ):
+        """
+        Load Pipeline from a config dict defining the individual components and how they're tied together to form
+        a Pipeline. A single config can declare multiple Pipelines, in which case an explicit `pipeline_name` must
+        be passed.
+
+        Here's a sample configuration:
+
+            ```python
+            |   {
+            |       "version": "0.9",
+            |       "components": [
+            |           {  # define all the building-blocks for Pipeline
+            |               "name": "MyReader",  # custom-name for the component; helpful for visualization & debugging
+            |               "type": "FARMReader",  # pipelines Class name for the component
+            |               "params": {"no_ans_boost": -10, "model_name_or_path": "ernie-gram-zh-finetuned-dureader-robust"},
+            |           },
+            |           {
+            |               "name": "MyESRetriever",
+            |               "type": "ElasticsearchRetriever",
+            |               "params": {
+            |                   "document_store": "MyDocumentStore",  # params can reference other components defined in the YAML
+            |                   "custom_query": None,
+            |               },
+            |           },
+            |           {"name": "MyDocumentStore", "type": "ElasticsearchDocumentStore", "params": {"index": "pipelines_test"}},
+            |       ],
+            |       "pipelines": [
+            |           {  # multiple Pipelines can be defined using the components from above
+            |               "name": "my_query_pipeline",  # a simple extractive-qa Pipeline
+            |               "nodes": [
+            |                   {"name": "MyESRetriever", "inputs": ["Query"]},
+            |                   {"name": "MyReader", "inputs": ["MyESRetriever"]},
+            |               ],
+            |           }
+            |       ],
+            |   }
+            ```
+
+        :param pipeline_config: the pipeline config as dict
+        :param pipeline_name: if the config contains multiple pipelines, the pipeline_name to load must be set.
+        :param overwrite_with_env_variables: Overwrite the configuration with environment variables. For example,
+                                             to change index name param for an ElasticsearchDocumentStore, an env
+                                             variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+                                             `_` sign must be used to specify nested hierarchical properties.
+        """
+        pipeline_definition = get_pipeline_definition(pipeline_config=pipeline_config, pipeline_name=pipeline_name)
+        component_definitions = get_component_definitions(
+            pipeline_config=pipeline_config, overwrite_with_env_variables=overwrite_with_env_variables
+        )
+
+        pipeline = cls()
+        components: dict = {}  # instances of component objects.
+        for node in pipeline_definition["nodes"]:
+            name = node["name"]
+            component = cls._load_or_get_component(name=name, definitions=component_definitions, components=components)
+            pipeline.add_node(component=component, name=name, inputs=node.get("inputs", []))
+
+        return pipeline
+
+    @classmethod
+    def _load_or_get_component(cls, name: str, definitions: dict, components: dict):
+        """
+        Load a component from the definition or return if component object already present in `components` dict.
+
+        :param name: name of the component to load or get.
+        :param definitions: dict containing definitions of all components retrieved from the YAML.
+        :param components: dict containing component objects.
+        """
+        try:
+            if name in components.keys():  # check if component is already loaded.
+                return components[name]
+
+            component_params = definitions[name].get("params", {})
+            component_type = definitions[name]["type"]
+            logger.debug(f"Loading component `{name}` of type `{definitions[name]['type']}`")
+
+            for key, value in component_params.items():
+                # Component params can reference to other components. For instance, a Retriever can reference a
+                # DocumentStore defined in the YAML. All references should be recursively resolved.
+                if (
+                    isinstance(value, str) and value in definitions.keys()
+                ):  # check if the param value is a reference to another component.
+                    if value not in components.keys():  # check if the referenced component is already loaded.
+                        cls._load_or_get_component(name=value, definitions=definitions, components=components)
+                    component_params[key] = components[
+                        value
+                    ]  # substitute reference (string) with the component object.
+
+            instance = BaseComponent.load_from_args(component_type=component_type, **component_params)
+            components[name] = instance
+        except Exception as e:
+            raise Exception(f"Failed loading pipeline component '{name}': {e}")
+        return instance
+
+    def save_to_yaml(self, path: Path, return_defaults: bool = False):
+        """
+        Save a YAML configuration for the Pipeline that can be used with `Pipeline.load_from_yaml()`.
+
+        :param path: path of the output YAML file.
+        :param return_defaults: whether to output parameters that have the default values.
+        """
+        config = self.get_config(return_defaults=return_defaults)
+        with open(path, "w") as outfile:
+            yaml.dump(config, outfile, default_flow_style=False)
+
+    def get_config(self, return_defaults: bool = False) -> dict:
+        """
+        Returns a configuration for the Pipeline that can be used with `Pipeline.load_from_config()`.
+
+        :param return_defaults: whether to output parameters that have the default values.
+        """
+        pipeline_name = ROOT_NODE_TO_PIPELINE_NAME[self.root_node.lower()]
+        pipelines: dict = {pipeline_name: {"name": pipeline_name, "type": self.__class__.__name__, "nodes": []}}
+
+        components = {}
+        for node in self.graph.nodes:
+            if node == self.root_node:
+                continue
+            component_instance = self.graph.nodes.get(node)["component"]
+            component_type = component_instance.pipeline_config["type"]
+            component_params = component_instance.pipeline_config["params"]
+            components[node] = {"name": node, "type": component_type, "params": {}}
+
+            component_parent_classes = inspect.getmro(type(component_instance))
+            component_signature: dict = {}
+            for component_parent in component_parent_classes:
+                component_signature = {**component_signature, **inspect.signature(component_parent).parameters}
+
+            for param_key, param_value in component_params.items():
+                # A parameter for a Component could be another Component. For instance, a Retriever has
+                # the DocumentStore as a parameter.
+                # Component configs must be a dict with a "type" key. The "type" keys distinguishes between
+                # other parameters like "custom_mapping" that are dicts.
+                # This currently only checks for the case single-level nesting case, wherein, "a Component has another
+                # Component as a parameter". For deeper nesting cases, this function should be made recursive.
+                if isinstance(param_value, dict) and "type" in param_value.keys():  # the parameter is a Component
+                    sub_component = param_value
+                    sub_component_type_name = sub_component["type"]
+                    sub_component_signature = inspect.signature(
+                        BaseComponent.subclasses[sub_component_type_name]
+                    ).parameters
+                    sub_component_params = {
+                        k: v
+                        for k, v in sub_component["params"].items()
+                        if sub_component_signature[k].default != v or return_defaults is True
+                    }
+
+                    sub_component_name = self._generate_component_name(
+                        type_name=sub_component_type_name, params=sub_component_params, existing_components=components
+                    )
+                    components[sub_component_name] = {
+                        "name": sub_component_name,
+                        "type": sub_component_type_name,
+                        "params": sub_component_params,
+                    }
+                    components[node]["params"][param_key] = sub_component_name
+                else:
+                    if component_signature[param_key].default != param_value or return_defaults is True:
+                        components[node]["params"][param_key] = param_value
+
+            # create the Pipeline definition with how the Component are connected
+            pipelines[pipeline_name]["nodes"].append({"name": node, "inputs": list(self.graph.predecessors(node))})
+
+        config = {
+            "components": list(components.values()),
+            "pipelines": list(pipelines.values()),
+            "version": __version__,
+        }
+        return config
+
+    def _generate_component_name(
+        self,
+        type_name: str,
+        params: Dict[str, Any],
+        existing_components: Dict[str, Any],
+    ):
+        component_name: str = type_name
+        # add number if there are multiple distinct ones of the same type
+        while component_name in existing_components and params != existing_components[component_name]["params"]:
+            occupied_num = 1
+            if len(component_name) > len(type_name):
+                occupied_num = int(component_name[len(type_name) + 1 :])
+            new_num = occupied_num + 1
+            component_name = f"{type_name}_{new_num}"
+        return component_name
diff --git a/pipelines/pipelines/pipelines/config.py b/pipelines/pipelines/pipelines/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bec1a03f906284edff018f76840b3ba83e308e5
--- /dev/null
+++ b/pipelines/pipelines/pipelines/config.py
@@ -0,0 +1,145 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import logging
+import os
+from pathlib import Path
+import re
+from typing import Any, Dict, List, Optional
+from networkx import DiGraph
+import yaml
+
+logger = logging.getLogger(__name__)
+
+VALID_CODE_GEN_INPUT_REGEX = re.compile(r"^[-a-zA-Z0-9_/.:]+$")
+
+
+def get_pipeline_definition(pipeline_config: Dict[str, Any], pipeline_name: Optional[str] = None) -> Dict[str, Any]:
+    """
+    Get the definition of Pipeline from a given pipeline config. If the config contains more than one Pipeline,
+    then the pipeline_name must be supplied.
+
+    :param pipeline_config: Dict Pipeline config parsed as a dictionary.
+    :param pipeline_name: name of the Pipeline.
+    """
+    if pipeline_name is None:
+        if len(pipeline_config["pipelines"]) == 1:
+            pipeline_definition = pipeline_config["pipelines"][0]
+        else:
+            raise Exception("The YAML contains multiple pipelines. Please specify the pipeline name to load.")
+    else:
+        pipelines_in_definitions = list(filter(lambda p: p["name"] == pipeline_name, pipeline_config["pipelines"]))
+        if not pipelines_in_definitions:
+            raise KeyError(f"Cannot find any pipeline with name '{pipeline_name}' declared in the YAML file.")
+        pipeline_definition = pipelines_in_definitions[0]
+
+    return pipeline_definition
+
+
+def get_component_definitions(pipeline_config: Dict[str, Any], overwrite_with_env_variables: bool) -> Dict[str, Any]:
+    """
+    Returns the definitions of all components from a given pipeline config.
+
+    :param pipeline_config: Dict Pipeline config parsed as a dictionary.
+    :param overwrite_with_env_variables: Overwrite the YAML configuration with environment variables. For example,
+                                            to change index name param for an ElasticsearchDocumentStore, an env
+                                            variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+                                            `_` sign must be used to specify nested hierarchical properties.
+    """
+    component_definitions = {}  # definitions of each component from the YAML.
+    raw_component_definitions = copy.deepcopy(pipeline_config["components"])
+    for component_definition in raw_component_definitions:
+        if overwrite_with_env_variables:
+            _overwrite_with_env_variables(component_definition)
+        name = component_definition.pop("name")
+        component_definitions[name] = component_definition
+
+    return component_definitions
+
+
+def read_pipeline_config_from_yaml(path: Path):
+    with open(path, "r", encoding="utf-8") as stream:
+        return yaml.safe_load(stream)
+
+
+def validate_config(pipeline_config: Dict[str, Any]):
+    for component in pipeline_config["components"]:
+        _validate_user_input(component["name"])
+        _validate_user_input(component["type"])
+        for k, v in component.get("params", {}).items():
+            _validate_user_input(k)
+            _validate_user_input(v)
+    for pipeline in pipeline_config["pipelines"]:
+        _validate_user_input(pipeline["name"])
+        _validate_user_input(pipeline["type"])
+        for node in pipeline["nodes"]:
+            _validate_user_input(node["name"])
+            for input in node["inputs"]:
+                _validate_user_input(input)
+
+
+def build_component_dependency_graph(
+    pipeline_definition: Dict[str, Any], component_definitions: Dict[str, Any]
+) -> DiGraph:
+    """
+    Builds a dependency graph between components. Dependencies are:
+    - referenced components during component build time (e.g. init params)
+    - predecessor components in the pipeline that produce the needed input
+
+    This enables sorting the components in a working and meaningful order for instantiation using topological sorting.
+
+    :param pipeline_definition: the definition of the pipeline (e.g. use get_pipeline_definition() to obtain it)
+    :param component_definitions: the definition of the pipeline components (e.g. use get_component_definitions() to obtain it)
+    """
+    graph = DiGraph()
+    for node in pipeline_definition["nodes"]:
+        node_name = node["name"]
+        graph.add_node(node_name)
+        for input in node["inputs"]:
+            if input in component_definitions:
+                graph.add_edge(input, node_name)
+    for component_name, component_definition in component_definitions.items():
+        params = component_definition.get("params", {})
+        referenced_components: List[str] = list()
+        for param_value in params.values():
+            # Currently we don't do any additional type validation here.
+            if param_value in component_definitions:
+                referenced_components.append(param_value)
+        for referenced_component in referenced_components:
+            graph.add_edge(referenced_component, component_name)
+    return graph
+
+
+def _validate_user_input(input: str):
+    if isinstance(input, str) and not VALID_CODE_GEN_INPUT_REGEX.match(input):
+        raise ValueError(f"'{input}' is not a valid config variable name. Use word characters only.")
+
+
+def _overwrite_with_env_variables(component_definition: Dict[str, Any]):
+    """
+    Overwrite the pipeline config with environment variables. For example, to change index name param for an
+    ElasticsearchDocumentStore, an env variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+    `_` sign must be used to specify nested hierarchical properties.
+
+    :param definition: a dictionary containing the YAML definition of a component.
+    """
+    env_prefix = f"{component_definition['name']}_params_".upper()
+    for key, value in os.environ.items():
+        if key.startswith(env_prefix):
+            param_name = key.replace(env_prefix, "").lower()
+            component_definition["params"][param_name] = value
+            logger.info(
+                f"Param '{param_name}' of component '{component_definition['name']}' overwritten with environment variable '{key}' value '{value}'."
+            )
diff --git a/pipelines/pipelines/pipelines/standard_pipelines.py b/pipelines/pipelines/pipelines/standard_pipelines.py
new file mode 100644
index 0000000000000000000000000000000000000000..d53f369b71b2e37108545c353be45989c03fa92b
--- /dev/null
+++ b/pipelines/pipelines/pipelines/standard_pipelines.py
@@ -0,0 +1,405 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from abc import ABC
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes.answer_extractor import AnswerExtractor, QAFilter
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.prompt import PromptNode, Shaper
+from pipelines.nodes.question_generator import QuestionGenerator
+from pipelines.nodes.ranker import BaseRanker, ErnieRanker
+from pipelines.nodes.reader import BaseReader
+from pipelines.nodes.retriever import BaseRetriever, WebRetriever
+from pipelines.nodes.text_to_image_generator import ErnieTextToImageGenerator
+from pipelines.pipelines import Pipeline
+from pipelines.schema import Answer, Document
+
+logger = logging.getLogger(__name__)
+
+
+class BaseStandardPipeline(ABC):
+    """
+    Base class for pre-made standard pipelines pipelines.
+    This class does not inherit from Pipeline.
+    """
+
+    pipeline: Pipeline
+    metrics_filter: Optional[Dict[str, List[str]]] = None
+
+    def add_node(self, component, name: str, inputs: List[str]):
+        """
+        Add a new node to the pipeline.
+
+        :param component: The object to be called when the data is passed to the node. It can be a pipelines component
+                          (like Retriever, Reader, or Generator) or a user-defined object that implements a run()
+                          method to process incoming data from predecessor node.
+        :param name: The name for the node. It must not contain any dots.
+        :param inputs: A list of inputs to the node. If the predecessor node has a single outgoing edge, just the name
+                       of node is sufficient. For instance, a 'ElasticsearchRetriever' node would always output a single
+                       edge with a list of documents. It can be represented as ["ElasticsearchRetriever"].
+
+                       In cases when the predecessor node has multiple outputs, e.g., a "QueryClassifier", the output
+                       must be specified explicitly as "QueryClassifier.output_2".
+        """
+        self.pipeline.add_node(component=component, name=name, inputs=inputs)
+
+    def get_node(self, name: str):
+        """
+        Get a node from the Pipeline.
+
+        :param name: The name of the node.
+        """
+        component = self.pipeline.get_node(name)
+        return component
+
+    def set_node(self, name: str, component):
+        """
+        Set the component for a node in the Pipeline.
+
+        :param name: The name of the node.
+        :param component: The component object to be set at the node.
+        """
+        self.pipeline.set_node(name, component)
+
+    def draw(self, path: Path = Path("pipeline.png")):
+        """
+        Create a Graphviz visualization of the pipeline.
+
+        :param path: the path to save the image.
+        """
+        self.pipeline.draw(path)
+
+    def save_to_yaml(self, path: Path, return_defaults: bool = False):
+        """
+        Save a YAML configuration for the Pipeline that can be used with `Pipeline.load_from_yaml()`.
+
+        :param path: path of the output YAML file.
+        :param return_defaults: whether to output parameters that have the default values.
+        """
+        return self.pipeline.save_to_yaml(path, return_defaults)
+
+    @classmethod
+    def load_from_yaml(
+        cls, path: Path, pipeline_name: Optional[str] = None, overwrite_with_env_variables: bool = True
+    ):
+        """
+        Load Pipeline from a YAML file defining the individual components and how they're tied together to form
+        a Pipeline. A single YAML can declare multiple Pipelines, in which case an explicit `pipeline_name` must
+        be passed.
+
+        Here's a sample configuration:
+
+            ```yaml
+            |   version: '0.8'
+            |
+            |    components:    # define all the building-blocks for Pipeline
+            |    - name: MyReader       # custom-name for the component; helpful for visualization & debugging
+            |      type: FARMReader    # pipelines Class name for the component
+            |      params:
+            |        no_ans_boost: -10
+            |        model_name_or_path: ernie-gram-zh-finetuned-dureader-robust
+            |    - name: MyESRetriever
+            |      type: ElasticsearchRetriever
+            |      params:
+            |        document_store: MyDocumentStore    # params can reference other components defined in the YAML
+            |        custom_query: null
+            |    - name: MyDocumentStore
+            |      type: ElasticsearchDocumentStore
+            |      params:
+            |        index: pipelines_test
+            |
+            |    pipelines:    # multiple Pipelines can be defined using the components from above
+            |    - name: my_query_pipeline    # a simple extractive-qa Pipeline
+            |      nodes:
+            |      - name: MyESRetriever
+            |        inputs: [Query]
+            |      - name: MyReader
+            |        inputs: [MyESRetriever]
+            ```
+
+        :param path: path of the YAML file.
+        :param pipeline_name: if the YAML contains multiple pipelines, the pipeline_name to load must be set.
+        :param overwrite_with_env_variables: Overwrite the YAML configuration with environment variables. For example,
+                                             to change index name param for an ElasticsearchDocumentStore, an env
+                                             variable 'MYDOCSTORE_PARAMS_INDEX=documents-2021' can be set. Note that an
+                                             `_` sign must be used to specify nested hierarchical properties.
+        """
+        standard_pipeline_object = cls.__new__(
+            cls
+        )  # necessary because we can't call __init__ as we can't provide parameters
+        standard_pipeline_object.pipeline = Pipeline.load_from_yaml(path, pipeline_name, overwrite_with_env_variables)
+        return standard_pipeline_object
+
+    def get_nodes_by_class(self, class_type) -> List[Any]:
+        """
+        Gets all nodes in the pipeline that are an instance of a certain class (incl. subclasses).
+        This is for example helpful if you loaded a pipeline and then want to interact directly with the document store.
+        Example:
+        ```python
+        | from pipelines.document_stores.base import BaseDocumentStore
+        | INDEXING_PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=INDEXING_PIPELINE_NAME)
+        | res = INDEXING_PIPELINE.get_nodes_by_class(class_type=BaseDocumentStore)
+        ```
+        :return: List of components that are an instance of the requested class
+        """
+        return self.pipeline.get_nodes_by_class(class_type)
+
+    def get_document_store(self) -> Optional[BaseDocumentStore]:
+        """
+        Return the document store object used in the current pipeline.
+
+        :return: Instance of DocumentStore or None
+        """
+        return self.pipeline.get_document_store()
+
+    def run_batch(self, queries: List[str], params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        Run a batch of queries through the pipeline.
+        :param queries: List of query strings.
+        :param params: Parameters for the individual nodes of the pipeline. For instance,
+                       `params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}`
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+                      about their execution. By default these include the input parameters
+                      they received and the output they generated.
+                      All debug information can then be found in the dict returned
+                      by this method under the key "_debug"
+        """
+        output = self.pipeline.run_batch(queries=queries, params=params, debug=debug)
+        return output
+
+
+class ExtractiveQAPipeline(BaseStandardPipeline):
+    """
+    Pipeline for Extractive Question Answering.
+    """
+
+    def __init__(self, reader: BaseReader, ranker: BaseRanker, retriever: BaseRetriever):
+        """
+        :param reader: Reader instance
+        :param retriever: Retriever instance
+        """
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+        self.pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+        self.pipeline.add_node(component=reader, name="Reader", inputs=["Ranker"])
+        self.metrics_filter = {"Retriever": ["recall_single_hit"]}
+
+    def run(self, query: str, params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: The search query string.
+        :param params: Params for the `retriever` and `reader`. For instance,
+                       params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+                      about their execution. By default these include the input parameters
+                      they received and the output they generated.
+                      All debug information can then be found in the dict returned
+                      by this method under the key "_debug"
+        """
+        output = self.pipeline.run(query=query, params=params, debug=debug)
+        return output
+
+
+class SemanticSearchPipeline(BaseStandardPipeline):
+    """
+    Pipeline for semantic search.
+    """
+
+    def __init__(self, retriever: BaseRetriever, ranker: Optional[BaseRanker] = None):
+        """
+        :param retriever: Retriever instance
+        """
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+        if ranker:
+            self.pipeline.add_node(component=ranker, name="Ranker", inputs=["Retriever"])
+
+    def run(self, query: str, params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: the query string.
+        :param params: params for the `retriever` and `reader`. For instance, params={"Retriever": {"top_k": 10}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+              about their execution. By default these include the input parameters
+              they received and the output they generated.
+              All debug information can then be found in the dict returned
+              by this method under the key "_debug"
+        """
+        output = self.pipeline.run(query=query, params=params, debug=debug)
+        return output
+
+
+class DocPipeline(BaseStandardPipeline):
+    """
+    Pipeline for document intelligence.
+    """
+
+    def __init__(self, preprocessor: BaseComponent, docreader: BaseComponent):
+        """
+        :param preprocessor: file/image preprocessor instance
+        :param docreader: document model runner instance
+        """
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["Query"])
+        self.pipeline.add_node(component=docreader, name="Reader", inputs=["PreProcessor"])
+
+    def run(self, meta: dict, params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: the query string.
+        :param params: params for the `retriever` and `reader`. For instance, params={"Retriever": {"top_k": 10}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+              about their execution. By default these include the input parameters
+              they received and the output they generated.
+              All debug information can then be found in the dict returned
+              by this method under the key "_debug"
+        """
+        output = self.pipeline.run(meta=meta, params=params, debug=debug)
+        return output
+
+
+class TextToImagePipeline(BaseStandardPipeline):
+    """
+    A simple pipeline that takes prompt texts as input and generates
+    images.
+    """
+
+    def __init__(self, text_to_image_generator: ErnieTextToImageGenerator):
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=text_to_image_generator, name="TextToImageGenerator", inputs=["Query"])
+
+    def run(self, query: str, params: Optional[dict] = None, debug: Optional[bool] = None):
+        output = self.pipeline.run(query=query, params=params, debug=debug)
+        return output
+
+    def run_batch(
+        self,
+        documents: List[Document],
+        params: Optional[dict] = None,
+        debug: Optional[bool] = None,
+    ):
+        output = self.pipeline.run_batch(documents=documents, params=params, debug=debug)
+        return output
+
+
+class QAGenerationPipeline(BaseStandardPipeline):
+    """
+    Pipeline for semantic search.
+    """
+
+    def __init__(self, answer_extractor: AnswerExtractor, question_generator: QuestionGenerator, qa_filter: QAFilter):
+        """
+        :param retriever: Retriever instance
+        """
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=answer_extractor, name="AnswerExtractor", inputs=["Query"])
+        self.pipeline.add_node(component=question_generator, name="QuestionGenerator", inputs=["AnswerExtractor"])
+        self.pipeline.add_node(component=qa_filter, name="QAFilter", inputs=["QuestionGenerator"])
+
+    def run(self, meta: List[str], params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: the query string.
+        :param params: params for the `retriever` and `reader`. For instance, params={"Retriever": {"top_k": 10}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+              about their execution. By default these include the input parameters
+              they received and the output they generated.
+              All debug information can then be found in the dict returned
+              by this method under the key "_debug"
+        """
+        output = self.pipeline.run(meta=meta, params=params, debug=debug)
+        return output
+
+
+class SentaPipeline(BaseStandardPipeline):
+    """
+    Pipeline for document intelligence.
+    """
+
+    def __init__(self, preprocessor: BaseComponent, senta: BaseComponent, visualization: BaseComponent):
+        """
+        :param preprocessor: file preprocessor instance
+        :param senta: senta model instance
+        """
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["File"])
+        self.pipeline.add_node(component=senta, name="Senta", inputs=["PreProcessor"])
+        self.pipeline.add_node(component=visualization, name="Visualization", inputs=["Senta"])
+
+    def run(self, meta: dict, params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: the query string.
+        :param params: params for the `retriever` and `reader`. For instance, params={"Retriever": {"top_k": 10}}
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+              about their execution. By default these include the input parameters
+              they received and the output they generated.
+              All debug information can then be found in the dict returned
+              by this method under the key "_debug"
+        """
+        output = self.pipeline.run(meta=meta, params=params, debug=debug)
+        if "examples" in output:
+            output.pop("examples")
+        return output
+
+
+class WebQAPipeline(BaseStandardPipeline):
+    """
+    Pipeline for Generative Question Answering performed based on Documents returned from a web search engine.
+    """
+
+    def __init__(
+        self,
+        retriever: WebRetriever,
+        prompt_node: PromptNode,
+        sampler: Optional[BaseRanker] = None,
+        shaper: Optional[Shaper] = None,
+    ):
+        """
+        :param retriever: The WebRetriever used for retrieving documents from a web search engine.
+        :param prompt_node: The PromptNode used for generating the answer based on retrieved documents.
+        :param shaper: The Shaper used for transforming the documents and scores into a format that can be used by the PromptNode. Optional.
+        """
+        if not shaper:
+            shaper = Shaper(func="join_documents_and_scores", inputs={"documents": "documents"}, outputs=["documents"])
+        if not sampler and retriever.mode != "snippets":
+            # Documents returned by WebRetriever in mode "snippets" already have scores.
+            # For other modes, we need to add a sampler if none is provided to compute the scores.
+            # TODO(wugaosheng): Add topsampler into WebQAPipeline
+            sampler = ErnieRanker("rocketqa-zh-dureader-cross-encoder", top_k=2)
+
+        self.pipeline = Pipeline()
+        self.pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
+        if sampler:
+            self.pipeline.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
+            self.pipeline.add_node(component=shaper, name="Shaper", inputs=["Sampler"])
+        else:
+            self.pipeline.add_node(component=shaper, name="Shaper", inputs=["Retriever"])
+        self.pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Shaper"])
+        self.metrics_filter = {"Retriever": ["recall_single_hit"]}
+
+    def run(self, query: str, params: Optional[dict] = None, debug: Optional[bool] = None):
+        """
+        :param query: The search query string.
+        :param params: Params for the `Retriever`, `Sampler`, `Shaper`, and ``PromptNode. For instance,
+                       params={"Retriever": {"top_k": 3}, "Sampler": {"top_p": 0.8}}. See the API documentation of each node for available parameters and their descriptions.
+        :param debug: Whether the pipeline should instruct nodes to collect debug information
+                      about their execution. By default, these include the input parameters
+                      they received and the output they generated.
+                      YOu can then find all debug information in the dict thia method returns
+                      under the key "_debug".
+        """
+        output = self.pipeline.run(query=query, params=params, debug=debug)
+        # Extract the answer from the last line of the PromptNode's output
+        output["answers"] = [Answer(answer=output["results"][0].split("\n")[-1], type="generative")]
+        return output
diff --git a/pipelines/pipelines/pipelines/utils.py b/pipelines/pipelines/pipelines/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f9446599dc1bd2260f2d2014eaf4c7ccbc9e4c9
--- /dev/null
+++ b/pipelines/pipelines/pipelines/utils.py
@@ -0,0 +1,224 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import logging
+import re
+import sys
+import networkx as nx
+from typing import Any, Dict, List, Optional
+from networkx import DiGraph
+from pipelines.pipelines.config import (
+    build_component_dependency_graph,
+    get_component_definitions,
+    get_pipeline_definition,
+    validate_config,
+)
+
+logger = logging.getLogger(__name__)
+
+MODULE_NOT_FOUND = "MODULE_NOT_FOUND"
+CODE_GEN_ALLOWED_IMPORTS = ["pipelines.document_stores", "pipelines.nodes", "pipelines.pipelines"]
+CAMEL_CASE_TO_SNAKE_CASE_REGEX = re.compile(r"(?<=[a-z])(?=[A-Z0-9])")
+
+
+def camel_to_snake_case(input: str) -> str:
+    return CAMEL_CASE_TO_SNAKE_CASE_REGEX.sub("_", input).lower()
+
+
+def generate_code(
+    pipeline_config: Dict[str, Any],
+    pipeline_variable_name: str = "pipeline",
+    pipeline_name: Optional[str] = None,
+    generate_imports: bool = True,
+    comment: Optional[str] = None,
+    add_pipeline_cls_import: bool = True,
+) -> str:
+    """
+    Generates code to create a pipeline.
+
+    :param pipeline_config: The pipeline config that specifies components and pipelines.
+    :param pipeline_variable_name: The variable name of the pipeline to be generated.
+                                   Defaults to "pipeline".
+    :param pipeline_name: The name of the pipeline defined in pipeline_config to be generated.
+                          If unspecified the first pipeline will be taken.
+    :param generate_imports: Whether to generate imports code.
+                             Defaults to True.
+    :param comment: Preceding Comment to add to the code.
+    :param add_pipeline_cls_import: Whether to add import statement for Pipeline class if generate_imports is True.
+                                    Defaults to True.
+    """
+    validate_config(pipeline_config)
+
+    component_definitions = get_component_definitions(
+        pipeline_config=pipeline_config, overwrite_with_env_variables=False
+    )
+    component_variable_names = {name: camel_to_snake_case(name) for name in component_definitions.keys()}
+    pipeline_definition = get_pipeline_definition(pipeline_config=pipeline_config, pipeline_name=pipeline_name)
+    component_dependency_graph = build_component_dependency_graph(
+        pipeline_definition=pipeline_definition, component_definitions=component_definitions
+    )
+
+    code_parts = []
+    if generate_imports:
+        types_to_import = [component["type"] for component in component_definitions.values()]
+        if add_pipeline_cls_import:
+            types_to_import.append("Pipeline")
+        imports_code = _generate_imports_code(types_to_import=types_to_import)
+        code_parts.append(imports_code)
+
+    components_code = _generate_components_code(
+        component_definitions=component_definitions,
+        component_variable_names=component_variable_names,
+        dependency_graph=component_dependency_graph,
+    )
+    pipeline_code = _generate_pipeline_code(
+        pipeline_definition=pipeline_definition,
+        component_variable_names=component_variable_names,
+        pipeline_variable_name=pipeline_variable_name,
+    )
+
+    code_parts.append(components_code)
+    code_parts.append(pipeline_code)
+    code = "\n\n".join(code_parts)
+
+    if comment:
+        comment = re.sub(r"^(#\s)?", "# ", comment, flags=re.MULTILINE)
+        code = "\n".join([comment, code])
+
+    return code
+
+
+def _generate_pipeline_code(
+    pipeline_definition: Dict[str, Any], component_variable_names: Dict[str, str], pipeline_variable_name: str
+) -> str:
+    code_lines = [f"{pipeline_variable_name} = Pipeline()"]
+    for node in pipeline_definition["nodes"]:
+        node_name = node["name"]
+        component_variable_name = component_variable_names[node_name]
+        inputs = ", ".join(f'"{name}"' for name in node["inputs"])
+        code_lines.append(
+            f'{pipeline_variable_name}.add_node(component={component_variable_name}, name="{node_name}", inputs=[{inputs}])'
+        )
+
+    code = "\n".join(code_lines)
+    return code
+
+
+def _generate_components_code(
+    component_definitions: Dict[str, Any], component_variable_names: Dict[str, str], dependency_graph: DiGraph
+) -> str:
+    code = ""
+    declarations = {}
+    for name, definition in component_definitions.items():
+        variable_name = component_variable_names[name]
+        class_name = definition["type"]
+        param_value_dict = {
+            key: component_variable_names.get(value, f'"{value}"') if type(value) == str else value
+            for key, value in definition.get("params", {}).items()
+        }
+        init_args = ", ".join(f"{key}={value}" for key, value in param_value_dict.items())
+        declarations[name] = f"{variable_name} = {class_name}({init_args})"
+
+    ordered_components = nx.topological_sort(dependency_graph)
+    ordered_declarations = [declarations[component] for component in ordered_components]
+    code = "\n".join(ordered_declarations)
+    return code
+
+
+def _generate_imports_code(types_to_import: List[str]) -> str:
+    code_lines = []
+    importable_classes = {
+        name: mod
+        for mod in CODE_GEN_ALLOWED_IMPORTS
+        for name, obj in inspect.getmembers(sys.modules[mod])
+        if inspect.isclass(obj)
+    }
+
+    imports_by_module: Dict[str, List[str]] = {}
+    for t in types_to_import:
+        mod = importable_classes.get(t, MODULE_NOT_FOUND)
+        if mod in imports_by_module:
+            imports_by_module[mod].append(t)
+        else:
+            imports_by_module[mod] = [t]
+
+    for mod in sorted(imports_by_module.keys()):
+        sorted_types = sorted(set(imports_by_module[mod]))
+        import_types = ", ".join(sorted_types)
+        line_prefix = "# " if mod == MODULE_NOT_FOUND else ""
+        code_lines.append(f"{line_prefix}from {mod} import {import_types}")
+
+    code = "\n".join(code_lines)
+    return code
+
+
+def _format_document_answer(document_or_answer: dict):
+    return "\n \t".join(f"{name}: {value}" for name, value in document_or_answer.items())
+
+
+def _format_wrong_example(query: dict):
+    metrics = "\n \t".join(f"{name}: {value}" for name, value in query["metrics"].items())
+    documents = "\n\n \t".join(map(_format_document_answer, query.get("documents", [])))
+    documents = f"Documents: \n \t{documents}\n" if len(documents) > 0 else ""
+    answers = "\n\n \t".join(map(_format_document_answer, query.get("answers", [])))
+    answers = f"Answers: \n \t{answers}\n" if len(answers) > 0 else ""
+    gold_document_ids = "\n \t".join(query["gold_document_ids"])
+    gold_answers = "\n \t".join(query.get("gold_answers", []))
+    gold_answers = f"Gold Answers: \n \t{gold_answers}\n" if len(gold_answers) > 0 else ""
+    s = (
+        f"Query: \n \t{query['query']}\n"
+        f"{gold_answers}"
+        f"Gold Document Ids: \n \t{gold_document_ids}\n"
+        f"Metrics: \n \t{metrics}\n"
+        f"{answers}"
+        f"{documents}"
+        f"_______________________________________________________"
+    )
+    return s
+
+
+def _format_pipeline_node(node: str, calculated_metrics: dict):
+    node_metrics: dict = {}
+    for metric_mode, metrics in calculated_metrics.items():
+        for metric, value in metrics.get(node, {}).items():
+            node_metrics[f"{metric}{metric_mode}"] = value
+
+    def format_node_metric(metric, value):
+        return f"                        | {metric}: {value:5.3}"
+
+    node_metrics_formatted = "\n".join(sorted(map(format_node_metric, node_metrics.keys(), node_metrics.values())))
+    node_metrics_formatted = f"{node_metrics_formatted}\n" if len(node_metrics_formatted) > 0 else ""
+    s = (
+        f"                      {node}\n"
+        f"                        |\n"
+        f"{node_metrics_formatted}"
+        f"                        |"
+    )
+    return s
+
+
+def _format_pipeline_overview(calculated_metrics: dict, graph: DiGraph):
+    pipeline_overview = "\n".join(_format_pipeline_node(node, calculated_metrics) for node in graph.nodes)
+    s = (
+        f"================== Evaluation Report ==================\n"
+        f"=======================================================\n"
+        f"                   Pipeline Overview\n"
+        f"=======================================================\n"
+        f"{pipeline_overview}\n"
+        f"                      Output\n"
+        f"=======================================================\n"
+    )
+    return s
diff --git a/pipelines/pipelines/schema.py b/pipelines/pipelines/schema.py
new file mode 100644
index 0000000000000000000000000000000000000000..08c24e6400f444f72784d545e500cbc57d28fbd1
--- /dev/null
+++ b/pipelines/pipelines/schema.py
@@ -0,0 +1,1066 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import typing
+from dataclasses import asdict
+from typing import Any, Dict, List, Optional, Union
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal  # type: ignore
+
+
+if typing.TYPE_CHECKING:
+    from dataclasses import dataclass  # type: ignore
+else:
+    # We are using Pydantic dataclasses instead of vanilla Python's
+    # See #1598 for the reasons behind this choice & performance considerations
+    from pydantic.dataclasses import dataclass
+
+import ast
+import json
+import logging
+import time
+from pathlib import Path
+from uuid import uuid4
+
+import mmh3
+import numpy as np
+import pandas as pd
+from pydantic import BaseConfig
+from pydantic.json import pydantic_encoder
+
+logger = logging.getLogger(__name__)
+
+BaseConfig.arbitrary_types_allowed = True
+
+#: Types of content_types supported
+ContentTypes = Literal["text", "image"]
+FilterType = Dict[str, Union[Dict[str, Any], List[Any], str, int, float, bool]]
+
+
+@dataclass
+class Document:
+    content: Union[str, pd.DataFrame]
+    content_type: Literal["text", "table", "image"]
+    id: str
+    meta: Dict[str, Any]
+    score: Optional[float] = None
+    embedding: Optional[np.ndarray] = None
+    id_hash_keys: Optional[List[str]] = None
+
+    # We use a custom init here as we want some custom logic. The annotations above are however still needed in order
+    # to use some dataclass magic like "asdict()". See https://www.python.org/dev/peps/pep-0557/#custom-init-method
+    # They also help in annotating which object attributes will always be present (e.g. "id") even though they
+    # don't need to passed by the user in init and are rather initialized automatically in the init
+    def __init__(
+        self,
+        content: Union[str, pd.DataFrame],
+        content_type: Literal["text", "table", "image"] = "text",
+        id: Optional[str] = None,
+        score: Optional[float] = None,
+        meta: Dict[str, Any] = None,
+        embedding: Optional[np.ndarray] = None,
+        id_hash_keys: Optional[List[str]] = None,
+    ):
+        """
+        One of the core data classes in PIPELINES. It's used to represent documents / passages in a standardized way within PIPELINES.
+        Documents are stored in DocumentStores, are returned by Retrievers, are the input for Readers and are used in
+        many other places that manipulate or interact with document-level data.
+
+        Note: There can be multiple Documents originating from one file (e.g. PDF), if you split the text
+        into smaller passages. We'll have one Document per passage in this case.
+
+        Each document has a unique ID. This can be supplied by the user or generated automatically.
+        It's particularly helpful for handling of duplicates and referencing documents in other objects (e.g. Labels)
+
+        There's an easy option to convert from/to dicts via `from_dict()` and `to_dict`.
+
+        :param content: Content of the document. For most cases, this will be text, but it can be a table or image.
+        :param content_type: One of "image", "table" or "image". PIPELINES components can use this to adjust their
+                             handling of Documents and check compatibility.
+        :param id: Unique ID for the document. If not supplied by the user, we'll generate one automatically by
+                   creating a hash from the supplied text. This behaviour can be further adjusted by `id_hash_keys`.
+        :param score: The relevance score of the Document determined by a model (e.g. Retriever or Re-Ranker).
+                      In the range of [0,1], where 1 means extremely relevant.
+        :param meta: Meta fields for a document like name, url, or author in the form of a custom dict (any keys and values allowed).
+        :param embedding: Vector encoding of the text
+        :param id_hash_keys: Generate the document id from a custom list of strings that refere to the documents attributes.
+                             If you want ensure you don't have duplicate documents in your DocumentStore but texts are
+                             not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]).
+                             In this case the id will be generated by using the content and the defined metadata.
+        """
+
+        if content is None:
+            raise ValueError("Can't create 'Document': Mandatory 'content' field is None")
+
+        self.content = content
+        self.content_type = content_type
+        self.score = score
+        self.meta = meta or {}
+
+        allowed_hash_key_attributes = ["content", "content_type", "score", "meta", "embedding"]
+
+        if id_hash_keys is not None:
+            if not set(id_hash_keys) <= set(allowed_hash_key_attributes):  # type: ignore
+                raise ValueError(
+                    f"You passed custom strings {id_hash_keys} to id_hash_keys which is deprecated. Supply instead a list of Document's attribute names that the id should be based on (e.g. {allowed_hash_key_attributes})."
+                )
+
+        if embedding is not None:
+            embedding = np.asarray(embedding)
+        self.embedding = embedding
+
+        # Create a unique ID (either new one, or one from user input)
+        if id:
+            self.id: str = str(id)
+        else:
+            self.id: str = self._get_id(id_hash_keys=id_hash_keys)
+
+    def _get_id(self, id_hash_keys: Optional[List[str]] = None):
+        """
+        Generate the id of a document by creating the hash of strings. By default the content of a document is
+        used to generate the hash. There are two ways of modifying the generated id of a document. Either static keys
+        or a selection of the content.
+        :param id_hash_keys: Optional list of fields that should be dynamically used to generate the hash.
+        """
+
+        if id_hash_keys is None:
+            return "{:02x}".format(mmh3.hash128(str(self.content), signed=False))
+
+        final_hash_key = ""
+        for attr in id_hash_keys:
+            final_hash_key += ":" + str(getattr(self, attr))
+
+        if final_hash_key == "":
+            raise ValueError(
+                "Cant't create 'Document': 'id_hash_keys' must contain at least one of ['content', 'meta']"
+            )
+
+        return "{:02x}".format(mmh3.hash128(final_hash_key, signed=False))
+
+    def to_dict(self, field_map={}) -> Dict:
+        """
+        Convert Document to dict. An optional field_map can be supplied to change the names of the keys in the
+        resulting dict. This way you can work with standardized Document objects in PIPELINES, but adjust the format that
+        they are serialized / stored in other places (e.g. elasticsearch)
+        Example:
+        | doc = Document(content="some text", content_type="text")
+        | doc.to_dict(field_map={"custom_content_field": "content"})
+        | >>> {"custom_content_field": "some text", content_type": "text"}
+
+        :param field_map: Dict with keys being the custom target keys and values being the standard Document attributes
+        :return: dict with content of the Document
+        """
+        inv_field_map = {v: k for k, v in field_map.items()}
+        _doc: Dict[str, str] = {}
+        for k, v in self.__dict__.items():
+            if k == "content":
+                # Convert pd.DataFrame to list of rows for serialization
+                if self.content_type == "table" and isinstance(self.content, pd.DataFrame):
+                    v = [self.content.columns.tolist()] + self.content.values.tolist()
+            k = k if k not in inv_field_map else inv_field_map[k]
+            _doc[k] = v
+        return _doc
+
+    @classmethod
+    def from_dict(cls, dict, field_map={}, id_hash_keys=None):
+        """
+        Create Document from dict. An optional field_map can be supplied to adjust for custom names of the keys in the
+        input dict. This way you can work with standardized Document objects in PIPELINES, but adjust the format that
+        they are serialized / stored in other places (e.g. elasticsearch)
+        Example:
+        | my_dict = {"custom_content_field": "some text", content_type": "text"}
+        | Document.from_dict(my_dict, field_map={"custom_content_field": "content"})
+
+        :param field_map: Dict with keys being the custom target keys and values being the standard Document attributes
+        :return: dict with content of the Document
+        """
+
+        _doc = dict.copy()
+        init_args = ["content", "content_type", "id", "score", "question", "meta", "embedding"]
+        if "meta" not in _doc.keys():
+            _doc["meta"] = {}
+        # copy additional fields into "meta"
+        for k, v in _doc.items():
+            if k not in init_args and k not in field_map:
+                _doc["meta"][k] = v
+        # remove additional fields from top level
+        _new_doc = {}
+        for k, v in _doc.items():
+            if k in init_args:
+                _new_doc[k] = v
+            elif k in field_map:
+                k = field_map[k]
+                _new_doc[k] = v
+
+        if _doc.get("id") is None:
+            _new_doc["id_hash_keys"] = id_hash_keys
+
+        # Convert list of rows to pd.DataFrame
+        if _new_doc.get("content_type", None) == "table" and isinstance(_new_doc["content"], list):
+            _new_doc["content"] = pd.DataFrame(columns=_new_doc["content"][0], data=_new_doc["content"][1:])
+
+        return cls(**_new_doc)
+
+    def to_json(self, field_map={}) -> str:
+        d = self.to_dict(field_map=field_map)
+        j = json.dumps(d, cls=NumpyEncoder)
+        return j
+
+    @classmethod
+    def from_json(cls, data: str, field_map={}):
+        d = json.loads(data)
+        return cls.from_dict(d, field_map=field_map)
+
+    def __eq__(self, other):
+        return (
+            isinstance(other, self.__class__)
+            and getattr(other, "content", None) == self.content
+            and getattr(other, "content_type", None) == self.content_type
+            and getattr(other, "id", None) == self.id
+            and getattr(other, "score", None) == self.score
+            and getattr(other, "meta", None) == self.meta
+            and np.array_equal(getattr(other, "embedding", None), self.embedding)
+            and getattr(other, "id_hash_keys", None) == self.id_hash_keys
+        )
+
+    def __repr__(self):
+        return f"<Document: {str(self.to_dict())}>"
+
+    def __str__(self):
+        # In some cases, self.content is None (therefore not subscriptable)
+        if not self.content:
+            return f"<Document: id={self.id}, content=None>"
+        return f"<Document: id={self.id}, content='{self.content[:100]} {'...' if len(self.content) > 100 else ''}'>"
+
+    def __lt__(self, other):
+        """Enable sorting of Documents by score"""
+        return self.score < other.score
+
+
+@dataclass
+class Span:
+    start: int
+    end: int
+    """
+    Defining a sequence of characters (Text span) or cells (Table span) via start and end index.
+    For extractive QA: Character where answer starts/ends
+    For TableQA: Cell where the answer starts/ends (counted from top left to bottom right of table)
+
+    :param start: Position where the span starts
+    :param end:  Position where the spand ends
+    """
+
+
+@dataclass
+class Answer:
+    answer: str
+    type: Literal["generative", "extractive", "other"] = "extractive"
+    score: Optional[float] = None
+    context: Optional[Union[str, pd.DataFrame]] = None
+    offsets_in_document: Optional[List[Span]] = None
+    offsets_in_context: Optional[List[Span]] = None
+    document_id: Optional[str] = None
+    meta: Optional[Dict[str, Any]] = None
+    """
+    The fundamental object in PIPELINES to represent any type of Answers (e.g. extractive QA, generative QA or TableQA).
+    For example, it's used within some Nodes like the Reader, but also in the REST API.
+
+    :param answer: The answer string. If there's no possible answer (aka "no_answer" or "is_impossible) this will be an empty string.
+    :param type: One of ("generative", "extractive", "other"): Whether this answer comes from an extractive model
+                 (i.e. we can locate an exact answer string in one of the documents) or from a generative model
+                 (i.e. no pointer to a specific document, no offsets ...).
+    :param score: The relevance score of the Answer determined by a model (e.g. Reader or Generator).
+                  In the range of [0,1], where 1 means extremely relevant.
+    :param context: The related content that was used to create the answer (i.e. a text passage, part of a table, image ...)
+    :param offsets_in_document: List of `Span` objects with start and end positions of the answer **in the
+                                document** (as stored in the document store).
+                                For extractive QA: Character where answer starts => `Answer.offsets_in_document[0].start
+                                For TableQA: Cell where the answer starts (counted from top left to bottom right of table) => `Answer.offsets_in_document[0].start
+                                (Note that in TableQA there can be multiple cell ranges that are relevant for the answer, thus there can be multiple `Spans` here)
+    :param offsets_in_context: List of `Span` objects with start and end positions of the answer **in the
+                                context** (i.e. the surrounding text/table of a certain window size).
+                                For extractive QA: Character where answer starts => `Answer.offsets_in_document[0].start
+                                For TableQA: Cell where the answer starts (counted from top left to bottom right of table) => `Answer.offsets_in_document[0].start
+                                (Note that in TableQA there can be multiple cell ranges that are relevant for the answer, thus there can be multiple `Spans` here)
+    :param document_id: ID of the document that the answer was located it (if any)
+    :param meta: Dict that can be used to associate any kind of custom meta data with the answer.
+                 In extractive QA, this will carry the meta data of the document where the answer was found.
+    """
+
+    def __post_init__(self):
+        # In case offsets are passed as dicts rather than Span objects we convert them here
+        # For example, this is used when instantiating an object via from_json()
+        if self.offsets_in_document is not None:
+            self.offsets_in_document = [Span(**e) if isinstance(e, dict) else e for e in self.offsets_in_document]
+        if self.offsets_in_context is not None:
+            self.offsets_in_context = [Span(**e) if isinstance(e, dict) else e for e in self.offsets_in_context]
+
+        if self.meta is None:
+            self.meta = {}
+
+    def __lt__(self, other):
+        """Enable sorting of Answers by score"""
+        return self.score < other.score
+
+    def __str__(self):
+        # self.context might be None (therefore not subscriptable)
+        if not self.context:
+            return f"<Answer: answer='{self.answer}', score={self.score}, context=None>"
+        return f"<Answer: answer='{self.answer}', score={self.score}, context='{self.context[:50]}{'...' if len(self.context) > 50 else ''}'>"
+
+    def __repr__(self):
+        return f"<Answer {asdict(self)}>"
+
+    def to_dict(self):
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, dict: dict):
+        return _pydantic_dataclass_from_dict(dict=dict, pydantic_dataclass_type=cls)
+
+    def to_json(self):
+        return json.dumps(self, default=pydantic_encoder)
+
+    @classmethod
+    def from_json(cls, data):
+        if type(data) == str:
+            data = json.loads(data)
+        return cls.from_dict(data)
+
+
+@dataclass
+class Label:
+    id: str
+    query: str
+    document: Document
+    is_correct_answer: bool
+    is_correct_document: bool
+    origin: Literal["user-feedback", "gold-label"]
+    answer: Optional[Answer] = None
+    no_answer: Optional[bool] = None
+    pipeline_id: Optional[str] = None
+    created_at: Optional[str] = None
+    updated_at: Optional[str] = None
+    meta: Optional[dict] = None
+    filters: Optional[dict] = None
+
+    # We use a custom init here as we want some custom logic. The annotations above are however still needed in order
+    # to use some dataclass magic like "asdict()". See https://www.python.org/dev/peps/pep-0557/#custom-init-method
+    def __init__(
+        self,
+        query: str,
+        document: Document,
+        is_correct_answer: bool,
+        is_correct_document: bool,
+        origin: Literal["user-feedback", "gold-label"],
+        answer: Optional[Answer],
+        id: Optional[str] = None,
+        no_answer: Optional[bool] = None,
+        pipeline_id: Optional[str] = None,
+        created_at: Optional[str] = None,
+        updated_at: Optional[str] = None,
+        meta: Optional[dict] = None,
+        filters: Optional[dict] = None,
+    ):
+        """
+        Object used to represent label/feedback in a standardized way within PIPELINES.
+        This includes labels from dataset like SQuAD, annotations from labeling tools,
+        or, user-feedback from the PIPELINES REST API.
+
+        :param query: the question (or query) for finding answers.
+        :param document:
+        :param answer: the answer object.
+        :param is_correct_answer: whether the sample is positive or negative.
+        :param is_correct_document: in case of negative sample(is_correct_answer is False), there could be two cases;
+                                    incorrect answer but correct document & incorrect document. This flag denotes if
+                                    the returned document was correct.
+        :param origin: the source for the labels. It can be used to later for filtering.
+        :param id: Unique ID used within the DocumentStore. If not supplied, a uuid will be generated automatically.
+        :param no_answer: whether the question in unanswerable.
+        :param pipeline_id: pipeline identifier (any str) that was involved for generating this label (in-case of user feedback).
+        :param created_at: Timestamp of creation with format yyyy-MM-dd HH:mm:ss.
+                           Generate in Python via time.strftime("%Y-%m-%d %H:%M:%S").
+        :param created_at: Timestamp of update with format yyyy-MM-dd HH:mm:ss.
+                           Generate in Python via time.strftime("%Y-%m-%d %H:%M:%S")
+        :param meta: Meta fields like "annotator_name" in the form of a custom dict (any keys and values allowed).
+        :param filters: filters that should be applied to the query to rule out non-relevant documents. For example, if there are different correct answers
+                        in a DocumentStore depending on the retrieved document and the answer in this label is correct only on condition of the filters.
+        """
+
+        # Create a unique ID (either new one, or one from user input)
+        if id:
+            self.id = str(id)
+        else:
+            self.id = str(uuid4())
+
+        if created_at is None:
+            created_at = time.strftime("%Y-%m-%d %H:%M:%S")
+        self.created_at = created_at
+
+        self.updated_at = updated_at
+        self.query = query
+        self.answer = answer
+        self.document = document
+        self.is_correct_answer = is_correct_answer
+        self.is_correct_document = is_correct_document
+        self.origin = origin
+
+        # Remove
+        # self.document_id = document_id
+        # self.offset_start_in_doc = offset_start_in_doc
+
+        # If an Answer is provided we need to make sure that it's consistent with the `no_answer` value
+        # TODO: reassess if we want to enforce Span.start=0 and Span.end=0 for no_answer=True
+        if self.answer is not None:
+            if no_answer is True:
+                if self.answer.answer != "" or self.answer.context:
+                    raise ValueError(
+                        f"Got no_answer == True while there seems to be an possible Answer: {self.answer}"
+                    )
+            elif no_answer is False:
+                if self.answer.answer == "":
+                    raise ValueError(
+                        f"Got no_answer == False while there seems to be no possible Answer: {self.answer}"
+                    )
+            else:
+                # Automatically infer no_answer from Answer object
+                no_answer = self.answer.answer == "" or self.answer.answer is None
+
+        self.no_answer = no_answer
+
+        # TODO autofill answer.document_id if Document is provided
+
+        self.pipeline_id = pipeline_id
+        if not meta:
+            self.meta = {}
+        else:
+            self.meta = meta
+        self.filters = filters
+
+    def to_dict(self):
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, dict: dict):
+        return _pydantic_dataclass_from_dict(dict=dict, pydantic_dataclass_type=cls)
+
+    def to_json(self):
+        return json.dumps(self, default=pydantic_encoder)
+
+    @classmethod
+    def from_json(cls, data):
+        if type(data) == str:
+            data = json.loads(data)
+        return cls.from_dict(data)
+
+    # define __eq__ and __hash__ functions to deduplicate Label Objects
+    def __eq__(self, other):
+        return (
+            isinstance(other, self.__class__)
+            and getattr(other, "query", None) == self.query
+            and getattr(other, "answer", None) == self.answer
+            and getattr(other, "is_correct_answer", None) == self.is_correct_answer
+            and getattr(other, "is_correct_document", None) == self.is_correct_document
+            and getattr(other, "origin", None) == self.origin
+            and getattr(other, "document", None) == self.document
+            and getattr(other, "no_answer", None) == self.no_answer
+            and getattr(other, "pipeline_id", None) == self.pipeline_id
+        )
+
+    def __hash__(self):
+        return hash(
+            self.query
+            + str(self.answer)
+            + str(self.is_correct_answer)
+            + str(self.is_correct_document)
+            + str(self.origin)
+            + str(self.document)
+            + str(self.no_answer)
+            + str(self.pipeline_id)
+        )
+
+    def __repr__(self):
+        return str(self.to_dict())
+
+    def __str__(self):
+        return f"<Label: {self.to_dict()}>"
+
+
+@dataclass
+class MultiLabel:
+    labels: List[Label]
+    query: str
+    answers: List[str]
+    no_answer: bool
+    document_ids: List[str]
+    document_contents: List[str]
+    gold_offsets_in_contexts: List[Dict]
+    gold_offsets_in_documents: List[Dict]
+
+    def __init__(self, labels: List[Label], drop_negative_labels=False, drop_no_answers=False):
+        """
+        There are often multiple `Labels` associated with a single query. For example, there can be multiple annotated
+        answers for one question or multiple documents contain the information you want for a query.
+        This class is "syntactic sugar" that simplifies the work with such a list of related Labels.
+        It stored the original labels in MultiLabel.labels and provides additional aggregated attributes that are
+        automatically created at init time. For example, MultiLabel.no_answer allows you to easily access if any of the
+        underlying Labels provided a text answer and therefore demonstrates that there is indeed a possible answer.
+
+        :param labels: A list of labels that belong to a similar query and shall be "grouped" together
+        :param drop_negative_labels: Whether to drop negative labels from that group (e.g. thumbs down feedback from UI)
+        :param drop_no_answers: Whether to drop labels that specify the answer is impossible
+        """
+        # drop duplicate labels and remove negative labels if needed.
+        labels = list(set(labels))
+        if drop_negative_labels:
+            is_positive_label = lambda l: (l.is_correct_answer and l.is_correct_document) or (  # noqa: E731
+                l.answer is None and l.is_correct_document
+            )
+            labels = [l for l in labels if is_positive_label(l)]
+
+        if drop_no_answers:
+            labels = [l for l in labels if l.no_answer is False]
+
+        self.labels = labels
+
+        self.query = self._aggregate_labels(key="query", must_be_single_value=True)[0]
+        self.filters = self._aggregate_labels(key="filters", must_be_single_value=True)[0]
+        self.id = hash((self.query, json.dumps(self.filters, sort_keys=True).encode()))
+
+        # Currently no_answer is only true if all labels are "no_answers", we could later introduce a param here to let
+        # users decided which aggregation logic they want
+        self.no_answer = False not in [l.no_answer for l in self.labels]
+
+        # Answer strings and offsets cleaned for no_answers:
+        # If there are only no_answers, offsets are empty and answers will be a single empty string
+        # which equals the no_answers representation of reader nodes.
+        if self.no_answer:
+            self.answers = [""]
+            self.gold_offsets_in_documents: List[dict] = []
+            self.gold_offsets_in_contexts: List[dict] = []
+        else:
+            answered = [l.answer for l in self.labels if not l.no_answer and l.answer is not None]
+            self.answers = [answer.answer for answer in answered]
+            self.gold_offsets_in_documents = []
+            self.gold_offsets_in_contexts = []
+            for answer in answered:
+                if answer.offsets_in_document is not None:
+                    for span in answer.offsets_in_document:
+                        self.gold_offsets_in_documents.append({"start": span.start, "end": span.end})
+                if answer.offsets_in_context is not None:
+                    for span in answer.offsets_in_context:
+                        self.gold_offsets_in_contexts.append({"start": span.start, "end": span.end})
+
+        # There are two options here to represent document_ids:
+        # taking the id from the document of each label or taking the document_id of each label's answer.
+        # We take the former as labels without answers are allowed.
+        #
+        # For no_answer cases document_store.add_eval_data() currently adds all documents coming from the SQuAD paragraph's context
+        # as separate no_answer labels, and thus with document.id but without answer.document_id.
+        # If we do not exclude them from document_ids this would be problematic for retriever evaluation as they do not contain the answer.
+        # Hence, we exclude them here as well.
+        self.document_ids = [l.document.id for l in self.labels if not l.no_answer]
+        self.document_contents = [l.document.content for l in self.labels if not l.no_answer]
+
+    def _aggregate_labels(self, key, must_be_single_value=True) -> List[Any]:
+        if any(isinstance(getattr(l, key), dict) for l in self.labels):
+            # dict is not hashable so we collect unique filters via looping through all labels
+            unique_values = []
+            for l in self.labels:
+                if l.filters not in unique_values:
+                    unique_values.append(l.filters)
+        else:
+            unique_values = list({getattr(l, key) for l in self.labels})
+        if must_be_single_value and len(unique_values) > 1:
+            raise ValueError(
+                f"Tried to combine attribute '{key}' of Labels, but found multiple different values: {unique_values}"
+            )
+        return unique_values
+
+    def to_dict(self):
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, dict: dict):
+        return _pydantic_dataclass_from_dict(dict=dict, pydantic_dataclass_type=cls)
+
+    def to_json(self):
+        return json.dumps(self, default=pydantic_encoder)
+
+    @classmethod
+    def from_json(cls, data):
+        if type(data) == str:
+            data = json.loads(data)
+        return cls.from_dict(data)
+
+    def __repr__(self):
+        return str(self.to_dict())
+
+    def __str__(self):
+        return f"<MultiLabel: {self.to_dict()}>"
+
+
+def _pydantic_dataclass_from_dict(dict: dict, pydantic_dataclass_type) -> Any:
+    """
+    Constructs a pydantic dataclass from a dict incl. other nested dataclasses.
+    This allows simple de-serialization of pydentic dataclasses from json.
+    :param dict: Dict containing all attributes and values for the dataclass.
+    :param pydantic_dataclass_type: The class of the dataclass that should be constructed (e.g. Document)
+    """
+    base_model = pydantic_dataclass_type.__pydantic_model__.parse_obj(dict)
+    base_mode_fields = base_model.__fields__
+
+    values = {}
+    for base_model_field_name, base_model_field in base_mode_fields.items():
+        value = getattr(base_model, base_model_field_name)
+        values[base_model_field_name] = value
+
+    dataclass_object = pydantic_dataclass_type(**values)
+    return dataclass_object
+
+
+class NumpyEncoder(json.JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, np.ndarray):
+            return obj.tolist()
+        return json.JSONEncoder.default(self, obj)
+
+
+class EvaluationResult:
+    def __init__(self, node_results: Dict[str, pd.DataFrame] = None) -> None:
+        """
+        Convenience class to store, pass and interact with results of a pipeline evaluation run (e.g. pipeline.eval()).
+        Detailed results are stored as one dataframe per node. This class makes them more accessible and provides
+        convenience methods to work with them.
+        For example, you can calculate eval metrics, get detailed reports or simulate different top_k settings.
+
+        Example:
+        ```python
+        | eval_results = pipeline.eval(...)
+        |
+        | # derive detailed metrics
+        | eval_results.calculate_metrics()
+        |
+        | # show summary of incorrect queries
+        | eval_results.wrong_examples()
+        ```
+
+        Each row of the underlying DataFrames contains either an answer or a document that has been retrieved during evaluation.
+        Rows are enriched with basic infos like rank, query, type or node.
+        Additional answer or document specific evaluation infos like gold labels
+        and metrics depicting whether the row matches the gold labels are included, too.
+        The DataFrames have the following schema:
+        - multilabel_id: the id of the multilabel, which is unique for the pair of query and filters
+        - query: the query
+        - filters: the filters used with the query
+        - gold_answers (answers only): the answers to be given
+        - answer (answers only): the answer
+        - context (answers only): the surrounding context of the answer within the document
+        - exact_match (answers only): metric depicting if the answer exactly matches the gold label
+        - f1 (answers only): metric depicting how well the answer overlaps with the gold label on token basis
+        - sas (answers only, optional): metric depciting how well the answer matches the gold label on a semantic basis
+        - gold_document_contents (documents only): the contents of the gold documents
+        - content (documents only): the content of the document
+        - gold_id_match (documents only): metric depicting whether one of the gold document ids matches the document
+        - answer_match (documents only): metric depicting whether the document contains the answer
+        - gold_id_or_answer_match (documents only): metric depicting whether one of the former two conditions are met
+        - rank: rank or 1-based-position in result list
+        - document_id: the id of the document that has been retrieved or that contained the answer
+        - gold_document_ids: the documents to be retrieved
+        - offsets_in_document (answers only): the position or offsets within the document the answer was found
+        - gold_offsets_in_documents (answers only): the positon or offsets of the gold answer within the document
+        - type: 'answer' or 'document'
+        - node: the node name
+        - eval_mode: evaluation mode depicting whether the evaluation was executed in integrated or isolated mode.
+                     Check pipeline.eval()'s add_isolated_node_eval param for more information.
+
+        :param node_results: the evaluation Dataframes per pipeline node
+        """
+        self.node_results: Dict[str, pd.DataFrame] = {} if node_results is None else node_results
+
+    def __getitem__(self, key: str):
+        return self.node_results.__getitem__(key)
+
+    def __delitem__(self, key: str):
+        self.node_results.__delitem__(key)
+
+    def __setitem__(self, key: str, value: pd.DataFrame):
+        self.node_results.__setitem__(key, value)
+
+    def __contains__(self, key: str):
+        return self.node_results.keys().__contains__(key)
+
+    def __len__(self):
+        return self.node_results.__len__()
+
+    def append(self, key: str, value: pd.DataFrame):
+        if value is not None and len(value) > 0:
+            if key in self.node_results:
+                self.node_results[key] = pd.concat([self.node_results[key], value])
+            else:
+                self.node_results[key] = value
+
+    def calculate_metrics(
+        self,
+        simulated_top_k_reader: int = -1,
+        simulated_top_k_retriever: int = -1,
+        doc_relevance_col: str = "gold_id_match",
+        eval_mode: str = "integrated",
+    ) -> Dict[str, Dict[str, float]]:
+        """
+        Calculates proper metrics for each node.
+
+        For document returning nodes default metrics are:
+        - mrr (Mean Reciprocal Rank: see https://en.wikipedia.org/wiki/Mean_reciprocal_rank)
+        - map (Mean Average Precision: see https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29#Mean_average_precision)
+        - ndcg (Normalized Discounted Cumulative Gain: see https://en.wikipedia.org/wiki/Discounted_cumulative_gain)
+        - precision (Precision: How many of the returned documents were relevant?)
+        - recall_multi_hit (Recall according to Information Retrieval definition: How many of the relevant documents were retrieved per query?)
+        - recall_single_hit (Recall for Question Answering: How many of the queries returned at least one relevant document?)
+
+        For answer returning nodes default metrics are:
+        - exact_match (How many of the queries returned the exact answer?)
+        - f1 (How well do the returned results overlap with any gold answer on token basis?)
+        - sas if a SAS model has bin provided during during pipeline.eval() (How semantically similar is the prediction to the gold answers?)
+
+        Lower top_k values for reader and retriever than the actual values during the eval run can be simulated.
+        E.g. top_1_f1 for reader nodes can be calculated by setting simulated_top_k_reader=1.
+
+        Results for reader nodes with applied simulated_top_k_retriever should be considered with caution
+        as there are situations the result can heavily differ from an actual eval run with corresponding top_k_retriever.
+
+        :param simulated_top_k_reader: simulates top_k param of reader
+        :param simulated_top_k_retriever: simulates top_k param of retriever.
+            remarks: there might be a discrepancy between simulated reader metrics and an actual pipeline run with retriever top_k
+        :param doc_relevance_col: column in the underlying eval table that contains the relevance criteria for documents.
+            values can be: 'gold_id_match', 'answer_match', 'gold_id_or_answer_match'
+        :param eval_mode: the input on which the node was evaluated on.
+            Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
+            However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
+            you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
+            For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
+            Values can be 'integrated', 'isolated'.
+            Default value is 'integrated'.
+        """
+        return {
+            node: self._calculate_node_metrics(
+                df,
+                simulated_top_k_reader=simulated_top_k_reader,
+                simulated_top_k_retriever=simulated_top_k_retriever,
+                doc_relevance_col=doc_relevance_col,
+                eval_mode=eval_mode,
+            )
+            for node, df in self.node_results.items()
+        }
+
+    def wrong_examples(
+        self,
+        node: str,
+        n: int = 3,
+        simulated_top_k_reader: int = -1,
+        simulated_top_k_retriever: int = -1,
+        doc_relevance_col: str = "gold_id_match",
+        document_metric: str = "recall_single_hit",
+        answer_metric: str = "f1",
+        eval_mode: str = "integrated",
+    ) -> List[Dict]:
+        """
+        Returns the worst performing queries.
+        Worst performing queries are calculated based on the metric
+        that is either a document metric or an answer metric according to the node type.
+
+        Lower top_k values for reader and retriever than the actual values during the eval run can be simulated.
+        See calculate_metrics() for more information.
+
+        :param simulated_top_k_reader: simulates top_k param of reader
+        :param simulated_top_k_retriever: simulates top_k param of retriever.
+            remarks: there might be a discrepancy between simulated reader metrics and an actual pipeline run with retriever top_k
+        :param doc_relevance_col: column that contains the relevance criteria for documents.
+            values can be: 'gold_id_match', 'answer_match', 'gold_id_or_answer_match'
+        :param document_metric: the document metric worst queries are calculated with.
+            values can be: 'recall_single_hit', 'recall_multi_hit', 'mrr', 'map', 'precision'
+        :param document_metric: the answer metric worst queries are calculated with.
+            values can be: 'f1', 'exact_match' and 'sas' if the evaluation was made using a SAS model.
+        :param eval_mode: the input on which the node was evaluated on.
+            Usually nodes get evaluated on the prediction provided by its predecessor nodes in the pipeline (value='integrated').
+            However, as the quality of the node itself can heavily depend on the node's input and thus the predecessor's quality,
+            you might want to simulate a perfect predecessor in order to get an independent upper bound of the quality of your node.
+            For example when evaluating the reader use value='isolated' to simulate a perfect retriever in an ExtractiveQAPipeline.
+            Values can be 'integrated', 'isolated'.
+            Default value is 'integrated'.
+        """
+        node_df = self.node_results[node]
+        node_df = self._filter_eval_mode(node_df, eval_mode)
+
+        answers = node_df[node_df["type"] == "answer"]
+        if len(answers) > 0:
+            metrics_df = self._build_answer_metrics_df(
+                answers,
+                simulated_top_k_reader=simulated_top_k_reader,
+                simulated_top_k_retriever=simulated_top_k_retriever,
+            )
+            worst_df = metrics_df.sort_values(by=[answer_metric]).head(n)
+            wrong_examples = []
+            for multilabel_id, metrics in worst_df.iterrows():
+                query_answers = answers[answers["multilabel_id"] == multilabel_id]
+                query_dict = {
+                    "multilabel_id": query_answers["multilabel_id"].iloc[0],
+                    "query": query_answers["query"].iloc[0],
+                    "filters": query_answers["filters"].iloc[0],
+                    "metrics": metrics.to_dict(),
+                    "answers": query_answers.drop(
+                        ["node", "query", "type", "gold_answers", "gold_offsets_in_documents", "gold_document_ids"],
+                        axis=1,
+                    ).to_dict(orient="records"),
+                    "gold_answers": query_answers["gold_answers"].iloc[0],
+                    "gold_document_ids": query_answers["gold_document_ids"].iloc[0],
+                }
+                wrong_examples.append(query_dict)
+            return wrong_examples
+
+        documents = node_df[node_df["type"] == "document"]
+        if len(documents) > 0:
+            metrics_df = self._build_document_metrics_df(
+                documents, simulated_top_k_retriever=simulated_top_k_retriever, doc_relevance_col=doc_relevance_col
+            )
+            worst_df = metrics_df.sort_values(by=[document_metric]).head(n)
+            wrong_examples = []
+            for multilabel_id, metrics in worst_df.iterrows():
+                query_documents = documents[documents["multilabel_id"] == multilabel_id]
+                query_dict = {
+                    "multilabel_id": query_documents["multilabel_id"].iloc[0],
+                    "query": query_documents["query"].iloc[0],
+                    "filters": query_documents["filters"].iloc[0],
+                    "metrics": metrics.to_dict(),
+                    "documents": query_documents.drop(
+                        [
+                            "node",
+                            "query",
+                            "multilabel_id",
+                            "filters",
+                            "type",
+                            "gold_document_ids",
+                            "gold_document_contents",
+                        ],
+                        axis=1,
+                    ).to_dict(orient="records"),
+                    "gold_document_ids": query_documents["gold_document_ids"].iloc[0],
+                }
+                wrong_examples.append(query_dict)
+            return wrong_examples
+
+        return []
+
+    def _calculate_node_metrics(
+        self,
+        df: pd.DataFrame,
+        simulated_top_k_reader: int = -1,
+        simulated_top_k_retriever: int = -1,
+        doc_relevance_col: str = "gold_id_match",
+        eval_mode: str = "integrated",
+    ) -> Dict[str, float]:
+        df = self._filter_eval_mode(df, eval_mode)
+
+        answer_metrics = self._calculate_answer_metrics(
+            df, simulated_top_k_reader=simulated_top_k_reader, simulated_top_k_retriever=simulated_top_k_retriever
+        )
+
+        document_metrics = self._calculate_document_metrics(
+            df, simulated_top_k_retriever=simulated_top_k_retriever, doc_relevance_col=doc_relevance_col
+        )
+
+        return {**answer_metrics, **document_metrics}
+
+    def _filter_eval_mode(self, df: pd.DataFrame, eval_mode: str) -> pd.DataFrame:
+        if "eval_mode" in df.columns:
+            df = df[df["eval_mode"] == eval_mode]
+        else:
+            logger.warning("eval dataframe has no eval_mode column. eval_mode param will be ignored.")
+        return df
+
+    def _calculate_answer_metrics(
+        self, df: pd.DataFrame, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1
+    ) -> Dict[str, float]:
+        answers = df[df["type"] == "answer"]
+        if len(answers) == 0:
+            return {}
+
+        metrics_df = self._build_answer_metrics_df(
+            answers, simulated_top_k_reader=simulated_top_k_reader, simulated_top_k_retriever=simulated_top_k_retriever
+        )
+
+        return {metric: metrics_df[metric].mean() for metric in metrics_df.columns}
+
+    def _build_answer_metrics_df(
+        self, answers: pd.DataFrame, simulated_top_k_reader: int = -1, simulated_top_k_retriever: int = -1
+    ) -> pd.DataFrame:
+        """
+        Builds a dataframe containing answer metrics (columns) per query (index).
+        Answer metrics are:
+        - exact_match (Did the query exactly return any gold answer? -> 1.0 or 0.0)
+        - f1 (How well does the best matching returned results overlap with any gold answer on token basis?)
+        - sas if a SAS model has bin provided during during pipeline.eval() (How semantically similar is the prediction to the gold answers?)
+        """
+        # simulate top k retriever
+        if simulated_top_k_retriever != -1:
+            documents = self._get_documents_df()
+
+            top_k_documents = documents[documents["rank"] <= simulated_top_k_retriever]
+            simulated_answers = []
+            for multilabel_id in answers["multilabel_id"].unique():
+                top_k_document_ids = top_k_documents[top_k_documents["multilabel_id"] == multilabel_id][
+                    "document_id"
+                ].unique()
+                query_answers = answers[answers["multilabel_id"] == multilabel_id]
+                # consider only the answers within simulated_top_k_retriever documents
+                simulated_query_answers = query_answers[query_answers["document_id"].isin(top_k_document_ids)]
+                # simulate top k reader
+                if simulated_top_k_reader != -1:
+                    # consider only the simulated_top_k_reader answers within simulated_query_answers
+                    simulated_query_answers = simulated_query_answers.nsmallest(simulated_top_k_reader, "rank")
+                simulated_query_answers["rank"] = np.arange(1, len(simulated_query_answers) + 1)
+                simulated_answers.append(simulated_query_answers)
+            answers = pd.concat(simulated_answers)
+        # simulate top k reader
+        elif simulated_top_k_reader != -1:
+            answers = answers[answers["rank"] <= simulated_top_k_reader]
+
+        # build metrics df
+        metrics = []
+
+        for multilabel_id in answers["multilabel_id"].unique():
+            query_df = answers[answers["multilabel_id"] == multilabel_id]
+
+            metrics_cols = set(query_df.columns).intersection(["exact_match", "f1", "sas"])
+
+            query_metrics = {metric: query_df[metric].max() if len(query_df) > 0 else 0.0 for metric in metrics_cols}
+            metrics.append(query_metrics)
+
+        metrics_df = pd.DataFrame.from_records(metrics, index=answers["multilabel_id"].unique())
+        return metrics_df
+
+    def _get_documents_df(self):
+        document_dfs = [
+            node_df for node_df in self.node_results.values() if len(node_df[node_df["type"] == "document"]) > 0
+        ]
+        if len(document_dfs) != 1:
+            raise ValueError("cannot detect retriever dataframe")
+        documents_df = document_dfs[0]
+        documents_df = documents_df[documents_df["type"] == "document"]
+        return documents_df
+
+    def _calculate_document_metrics(
+        self, df: pd.DataFrame, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match"
+    ) -> Dict[str, float]:
+        documents = df[df["type"] == "document"]
+        if len(documents) == 0:
+            return {}
+
+        metrics_df = self._build_document_metrics_df(
+            documents, simulated_top_k_retriever=simulated_top_k_retriever, doc_relevance_col=doc_relevance_col
+        )
+
+        return {metric: metrics_df[metric].mean() for metric in metrics_df.columns}
+
+    def _build_document_metrics_df(
+        self, documents: pd.DataFrame, simulated_top_k_retriever: int = -1, doc_relevance_col: str = "gold_id_match"
+    ) -> pd.DataFrame:
+        """
+        Builds a dataframe containing document metrics (columns) per pair of query and gold document ids (index).
+        Document metrics are:
+        - mrr (Mean Reciprocal Rank: see https://en.wikipedia.org/wiki/Mean_reciprocal_rank)
+        - map (Mean Average Precision: see https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision)
+        - precision (Precision: How many of the returned documents were relevant?)
+        - recall_multi_hit (Recall according to Information Retrieval definition: How many of the relevant documents were retrieved per query?)
+        - recall_single_hit (Recall for Question Answering: Did the query return at least one relevant document? -> 1.0 or 0.0)
+        """
+        if simulated_top_k_retriever != -1:
+            documents = documents[documents["rank"] <= simulated_top_k_retriever]
+
+        metrics = []
+
+        for multilabel_id in documents["multilabel_id"].unique():
+            query_df = documents[documents["multilabel_id"] == multilabel_id]
+            gold_ids = list(query_df["gold_document_ids"].iloc[0])
+            retrieved = len(query_df)
+
+            relevance_criteria_ids = list(query_df[query_df[doc_relevance_col] == 1]["document_id"].values)
+            num_relevants = len(set(gold_ids + relevance_criteria_ids))
+            num_retrieved_relevants = query_df[doc_relevance_col].values.sum()
+            rank_retrieved_relevants = query_df[query_df[doc_relevance_col] == 1]["rank"].values
+            avp_retrieved_relevants = [
+                query_df[doc_relevance_col].values[: int(rank)].sum() / rank for rank in rank_retrieved_relevants
+            ]
+
+            avg_precision = np.sum(avp_retrieved_relevants) / num_relevants if num_relevants > 0 else 0.0
+            recall_multi_hit = num_retrieved_relevants / num_relevants if num_relevants > 0 else 0.0
+            recall_single_hit = min(num_retrieved_relevants, 1)
+            precision = num_retrieved_relevants / retrieved if retrieved > 0 else 0.0
+            rr = 1.0 / rank_retrieved_relevants.min() if len(rank_retrieved_relevants) > 0 else 0.0
+            dcg = (
+                np.sum([1.0 / np.log2(rank + 1) for rank in rank_retrieved_relevants])
+                if len(rank_retrieved_relevants) > 0
+                else 0.0
+            )
+            idcg = (
+                np.sum([1.0 / np.log2(rank + 1) for rank in range(1, num_relevants + 1)]) if num_relevants > 0 else 1.0
+            )
+            ndcg = dcg / idcg
+
+            metrics.append(
+                {
+                    "recall_multi_hit": recall_multi_hit,
+                    "recall_single_hit": recall_single_hit,
+                    "precision": precision,
+                    "map": avg_precision,
+                    "mrr": rr,
+                    "ndcg": ndcg,
+                }
+            )
+
+        metrics_df = pd.DataFrame.from_records(metrics, index=documents["multilabel_id"].unique())
+        return metrics_df
+
+    def save(self, out_dir: Union[str, Path]):
+        """
+        Saves the evaluation result.
+        The result of each node is saved in a separate csv with file name {node_name}.csv to the out_dir folder.
+
+        :param out_dir: Path to the target folder the csvs will be saved.
+        """
+        out_dir = out_dir if isinstance(out_dir, Path) else Path(out_dir)
+        for node_name, df in self.node_results.items():
+            target_path = out_dir / f"{node_name}.csv"
+            df.to_csv(target_path, index=False, header=True)
+
+    @classmethod
+    def load(cls, load_dir: Union[str, Path]):
+        """
+        Loads the evaluation result from disk. Expects one csv file per node. See save() for further information.
+
+        :param load_dir: The directory containing the csv files.
+        """
+        load_dir = load_dir if isinstance(load_dir, Path) else Path(load_dir)
+        csv_files = [file for file in load_dir.iterdir() if file.is_file() and file.suffix == ".csv"]
+        cols_to_convert = ["gold_document_ids", "gold_document_contents", "gold_answers", "gold_offsets_in_documents"]
+        converters = dict.fromkeys(cols_to_convert, ast.literal_eval)
+        node_results = {file.stem: pd.read_csv(file, header=0, converters=converters) for file in csv_files}
+        result = cls(node_results)
+        return result
diff --git a/pipelines/pipelines/utils/__init__.py b/pipelines/pipelines/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bea690e7ba4d4a42e31b6a648a73c2904ba1c261
--- /dev/null
+++ b/pipelines/pipelines/utils/__init__.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+from pipelines.utils.preprocessing import (  # isort: skip
+    convert_files_to_dicts,
+    convert_files_to_dicts_splitter,
+    tika_convert_files_to_dicts,
+)
+from pipelines.utils.cleaning import clean_wiki_text
+from pipelines.utils.doc_store import (
+    launch_es,
+    launch_milvus,
+    launch_opensearch,
+    launch_weaviate,
+    stop_opensearch,
+    stop_service,
+)
+from pipelines.utils.export_utils import (
+    convert_labels_to_squad,
+    export_answers_to_csv,
+    print_answers,
+    print_documents,
+    print_questions,
+)
+from pipelines.utils.import_utils import fetch_archive_from_http
diff --git a/pipelines/pipelines/utils/cleaning.py b/pipelines/pipelines/utils/cleaning.py
new file mode 100644
index 0000000000000000000000000000000000000000..51811c42ed396e06d56a0d22b6126618fedf17ad
--- /dev/null
+++ b/pipelines/pipelines/utils/cleaning.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+
+
+def clean_wiki_text(text: str) -> str:
+    """
+    Clean wikipedia text by removing multiple new lines, removing extremely short lines,
+    adding paragraph breaks and removing empty paragraphs
+    """
+    # get rid of multiple new lines
+    while "\n\n" in text:
+        text = text.replace("\n\n", "\n")
+
+    # remove extremely short lines
+    lines = text.split("\n")
+    cleaned = []
+    for l in lines:
+        if len(l) > 30:
+            cleaned.append(l)
+        elif l[:2] == "==" and l[-2:] == "==":
+            cleaned.append(l)
+    text = "\n".join(cleaned)
+
+    # add paragraphs (identified by wiki section title which is always in format "==Some Title==")
+    text = text.replace("\n==", "\n\n\n==")
+
+    # remove empty paragrahps
+    text = re.sub(r"(==.*==\n\n\n)", "", text)
+
+    return text
diff --git a/pipelines/pipelines/utils/common_utils.py b/pipelines/pipelines/utils/common_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..440ecd74d55a9d4e74e6b728f03ff627559649d8
--- /dev/null
+++ b/pipelines/pipelines/utils/common_utils.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import random
+from copy import deepcopy
+from typing import List, Tuple
+
+import numpy as np
+import paddle
+
+logger = logging.getLogger(__name__)
+
+
+def set_all_seeds(seed: int, deterministic_cudnn: bool = False) -> None:
+    """
+    Setting multiple seeds to make runs reproducible.
+
+    Important: Enabling `deterministic_cudnn` gives you full reproducibility with CUDA,
+
+    :param seed:number to use as seed
+    :param deterministic_paddle: Enable for full reproducibility when using CUDA. Caution: might slow down training.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+    if deterministic_cudnn:
+        # Todo(tianxin04): set cudnn deterministic
+        pass
+
+
+def initialize_device_settings(use_cuda: bool, local_rank: int = -1, multi_gpu: bool = True) -> Tuple[List[str], int]:
+    """
+    Returns a list of available devices.
+
+    :param use_cuda: Whether to make use of CUDA GPUs (if available).
+    :param local_rank: Ordinal of device to be used. If -1 and multi_gpu is True, all devices will be used.
+    :param multi_gpu: Whether to make use of all GPUs (if available).
+    """
+
+    if not use_cuda:
+        devices = [paddle.set_device("cpu")]
+        n_gpu = 0
+    elif local_rank == -1:
+        if "gpu" in paddle.get_device():
+            if multi_gpu:
+                devices = [
+                    paddle.set_device("gpu:{}".format(device)) for device in range(paddle.device.cuda.device_count())
+                ]
+                n_gpu = paddle.device.cuda.device_count()
+            else:
+                devices = [paddle.set_device("gpu")]
+                n_gpu = 1
+        else:
+            devices = [paddle.set_device("cpu")]
+            n_gpu = 0
+    else:
+        devices = [paddle.set_device("gpu:{}".format(local_rank))]
+        n_gpu = 1
+
+    logger.info(f"Using devices: {', '.join([str(device) for device in devices]).upper()}")
+    logger.info(f"Number of GPUs: {n_gpu}")
+    return devices, n_gpu
+
+
+def flatten_list(nested_list):
+    """Flatten an arbitrarily nested list, without recursion (to avoid
+    stack overflows). Returns a new list, the original list is unchanged.
+    >> list(flatten_list([1, 2, 3, [4], [], [[[[[[[[[5]]]]]]]]]]))
+    [1, 2, 3, 4, 5]
+    >> list(flatten_list([[1, 2], 3]))
+    [1, 2, 3]
+    """
+    nested_list = deepcopy(nested_list)
+
+    while nested_list:
+        sublist = nested_list.pop(0)
+
+        if isinstance(sublist, list):
+            nested_list = sublist + nested_list
+        else:
+            yield sublist
+
+
+def try_get(keys, dictionary):
+    try:
+        for key in keys:
+            if key in dictionary:
+                ret = dictionary[key]
+                if type(ret) == list:
+                    ret = ret[0]
+                return ret
+    except Exception as e:
+        logger.warning(f"Cannot extract from dict {dictionary} with error: {e}")
+    return None
diff --git a/pipelines/pipelines/utils/doc_store.py b/pipelines/pipelines/utils/doc_store.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e8e87e7c9cad3f16171e127457fa1d8a407c657
--- /dev/null
+++ b/pipelines/pipelines/utils/doc_store.py
@@ -0,0 +1,186 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+import logging
+import subprocess
+import requests
+
+from pathlib import Path
+
+logger = logging.getLogger(__name__)
+ELASTICSEARCH_CONTAINER_NAME = "elasticsearch"
+OPENSEARCH_CONTAINER_NAME = "opensearch"
+MILVUS1_CONTAINER_NAME = "milvus1"
+WEAVIATE_CONTAINER_NAME = "weaviate"
+
+
+def launch_es(sleep=15, delete_existing=False):
+    # Start an Elasticsearch server via Docker
+
+    logger.debug("Starting Elasticsearch ...")
+    if delete_existing:
+        _ = subprocess.run(
+            [f"docker rm --force {ELASTICSEARCH_CONTAINER_NAME}"], shell=True, stdout=subprocess.DEVNULL
+        )
+    status = subprocess.run(
+        [
+            f'docker run -d -p 9200:9200 -e "discovery.type=single-node" --name {ELASTICSEARCH_CONTAINER_NAME} elasticsearch:7.9.2'
+        ],
+        shell=True,
+    )
+    if status.returncode:
+        logger.warning(
+            "Tried to start Elasticsearch through Docker but this failed. "
+            "It is likely that there is already an existing Elasticsearch instance running. "
+        )
+    else:
+        time.sleep(sleep)
+
+
+def launch_opensearch(sleep=15, delete_existing=False):
+    # Start an OpenSearch server via docker
+
+    logger.debug("Starting OpenSearch...")
+    # This line is needed since it is not possible to start a new docker container with the name opensearch if there is a stopped image with the same now
+    # docker rm only succeeds if the container is stopped, not if it is running
+    if delete_existing:
+        _ = subprocess.run([f"docker rm --force {OPENSEARCH_CONTAINER_NAME}"], shell=True, stdout=subprocess.DEVNULL)
+    status = subprocess.run(
+        [
+            f'docker run -d -p 9201:9200 -p 9600:9600 -e "discovery.type=single-node" --name {OPENSEARCH_CONTAINER_NAME} opensearchproject/opensearch:1.2.4'
+        ],
+        shell=True,
+    )
+    if status.returncode:
+        logger.warning(
+            "Tried to start OpenSearch through Docker but this failed. "
+            "It is likely that there is already an existing OpenSearch instance running. "
+        )
+    else:
+        time.sleep(sleep)
+
+
+def launch_weaviate(sleep=15):
+    # Start a Weaviate server via Docker
+
+    logger.debug("Starting Weaviate ...")
+    status = subprocess.run(
+        [
+            "docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' --name {WEAVIATE_CONTAINER_NAME} semitechnologies/weaviate:1.7.2"
+        ],
+        shell=True,
+    )
+    if status.returncode:
+        logger.warning(
+            "Tried to start Weaviate through Docker but this failed. "
+            "It is likely that there is already an existing Weaviate instance running. "
+        )
+    else:
+        time.sleep(sleep)
+
+
+def stop_container(container_name, delete_container=False):
+    logger.debug(f"Stopping {container_name}...")
+    status = subprocess.run([f"docker stop {container_name}"], shell=True)
+    if status.returncode:
+        logger.warning(
+            f"Tried to stop {container_name} but this failed. "
+            f"It is likely that there was no Docker container with the name {container_name}"
+        )
+    if delete_container:
+        status = subprocess.run([f"docker rm {container_name}"], shell=True)
+
+
+def stop_opensearch(delete_container=False):
+    stop_container(OPENSEARCH_CONTAINER_NAME, delete_container)
+
+
+def stop_elasticsearch(delete_container=False):
+    stop_container(ELASTICSEARCH_CONTAINER_NAME, delete_container)
+
+
+def stop_milvus(delete_container=False):
+    stop_container(MILVUS1_CONTAINER_NAME, delete_container)
+
+
+def stop_weaviate(delete_container=False):
+    stop_container(WEAVIATE_CONTAINER_NAME, delete_container)
+
+
+def stop_service(document_store, delete_container=False):
+    ds_class = str(type(document_store))
+    if "OpenSearchDocumentStore" in ds_class:
+        stop_opensearch(delete_container)
+    elif "ElasticsearchDocumentStore" in ds_class:
+        stop_elasticsearch(delete_container)
+    elif "MilvusDocumentStore" in ds_class:
+        stop_milvus(delete_container)
+    elif "WeaviateDocumentStore" in ds_class:
+        stop_weaviate(delete_container)
+    else:
+        logger.warning(f"No support yet for auto stopping the service behind a {type(document_store)}")
+
+
+def launch_milvus(sleep=15, delete_existing=False):
+    # Start a Milvus server via docker
+
+    logger.debug("Starting Milvus ...")
+
+    milvus_dir = Path.home() / "milvus"
+    milvus_dir.mkdir(exist_ok=True)
+
+    request = requests.get(
+        "https://github.com/milvus-io/milvus/releases/download/v2.0.0/milvus-standalone-docker-compose.yml"
+    )
+    with open(milvus_dir / "docker-compose.yml", "wb") as f:
+        f.write(request.content)
+
+    status = subprocess.run(["cd /home/$USER/milvus/ && docker-compose up -d"], shell=True)
+
+    if status.returncode:
+        logger.warning(
+            "Tried to start Milvus through Docker but this failed. "
+            "It is likely that there is already an existing Milvus instance running. "
+        )
+    else:
+        time.sleep(sleep)
+
+
+def launch_milvus1(sleep=15):
+    # Start a Milvus (version <2.0.0) server via docker
+
+    logger.debug("Starting Milvus ...")
+    logger.warning(
+        "Automatic Milvus config creation not yet implemented. "
+        "If you are starting Milvus using launch_milvus(), "
+        "make sure you have a properly populated milvus/conf folder. "
+        "See (https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details."
+    )
+    status = subprocess.run(
+        [
+            f"docker run -d --name {MILVUS1_CONTAINER_NAME} \
+          -p 19530:19530 \
+          -p 19121:19121 \
+          milvusdb/milvus:1.1.0-cpu-d050721-5e559c"
+        ],
+        shell=True,
+    )
+    if status.returncode:
+        logger.warning(
+            "Tried to start Milvus through Docker but this failed. "
+            "It is likely that there is already an existing Milvus instance running. "
+        )
+    else:
+        time.sleep(sleep)
diff --git a/pipelines/pipelines/utils/export_utils.py b/pipelines/pipelines/utils/export_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..36ed698b406a96b33f4d414f05b6c004bed56143
--- /dev/null
+++ b/pipelines/pipelines/utils/export_utils.py
@@ -0,0 +1,225 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import pprint
+from collections import defaultdict
+from typing import Optional
+
+import pandas as pd
+
+from pipelines.document_stores.sql import DocumentORM
+from pipelines.schema import Answer, Document
+
+logger = logging.getLogger(__name__)
+
+
+def print_answers(results: dict, details: str = "all", max_text_len: Optional[int] = None):
+    """
+    Utility function to print results of pipelines pipelines
+    :param results: Results from a pipeline
+    :param details: One of "minimum", "medium", "all". Defining the level of details to print.
+    :param max_text_lenght: shorten lengthy text fields to the maximum allowed length. Set to
+        None to not cut long text.
+    :return: None
+    """
+    # Defines the fields to keep in the Answer for each detail level
+    fields_to_keep_by_level = {"minimum": ["answer", "context"], "medium": ["answer", "context", "score"]}
+
+    if "answers" not in results.keys():
+        raise ValueError(
+            "The results object does not seem to come from a Reader: "
+            f"it does not contain the 'answers' key, but only: {results.keys()}.  "
+            "Try print_documents or print_questions."
+        )
+
+    if "query" in results.keys():
+        print(f"\nQuery: {results['query']}\nAnswers:")
+
+    answers = results["answers"]
+    pp = pprint.PrettyPrinter(indent=4)
+
+    # Filter the results by detail level
+    filtered_answers = []
+    if details in fields_to_keep_by_level.keys():
+        for ans in answers:
+            filtered_ans = {
+                field: getattr(ans, field)
+                for field in fields_to_keep_by_level[details]
+                if getattr(ans, field) is not None
+            }
+            filtered_answers.append(filtered_ans)
+    elif details == "all":
+        filtered_answers = answers
+    else:
+        valid_values = ", ".join(fields_to_keep_by_level.keys()) + " and 'all'"
+        logging.warn(f"print_answers received details='{details}', which was not understood. ")
+        logging.warn(f"Valid values are {valid_values}. Using 'all'.")
+        filtered_answers = answers
+
+    # Shorten long text fields
+    if max_text_len is not None:
+        for ans in answers:
+            if getattr(ans, "context") and len(ans.context) > max_text_len:
+                ans.context = ans.context[:max_text_len] + "..."
+
+    pp.pprint(filtered_answers)
+
+
+def print_documents(
+    results: dict, max_text_len: Optional[int] = None, print_name: bool = True, print_meta: bool = False
+):
+    """
+    Utility that prints a compressed representation of the documents returned by a pipeline.
+    :param max_text_lenght: shorten the document's content to a maximum number of chars. if None, does not cut.
+    :param print_name: whether to print the document's name (from the metadata) or not.
+    :param print_meta: whether to print the document's metadata or not.
+    """
+    print(f"\nQuery: {results['query']}\n")
+    pp = pprint.PrettyPrinter(indent=4)
+
+    # Verify that the input contains Documents under the `document` key
+    if any(not isinstance(doc, Document) for doc in results["documents"]):
+        raise ValueError(
+            "This results object does not contain `Document` objects under the `documents` key. "
+            "Please make sure the last node of your pipeline makes proper use of the "
+            "new pipelines primitive objects, and if you're using pipelines nodes/pipelines only, "
+            "please report this as a bug."
+        )
+
+    for doc in results["documents"]:
+        content = doc.content
+        if max_text_len:
+            content = doc.content[:max_text_len] + ("..." if len(doc.content) > max_text_len else "")
+        results = {"content": content}
+        if print_name:
+            results["name"] = doc.meta.get("name", None)
+        if print_meta:
+            results["meta"] = doc.meta
+        pp.pprint(results)
+        print()
+
+
+def print_questions(results: dict):
+    """
+    Utility to print the output of a question generating pipeline in a readable format.
+    """
+    if "generated_questions" in results.keys():
+        print("\nGenerated questions:")
+        for result in results["generated_questions"]:
+            for question in result["questions"]:
+                print(f" - {question}")
+
+    elif "results" in results.keys():
+        print("\nGenerated pairs:")
+        for pair in results["results"]:
+            print(f" - Q:{pair['query']}")
+            for answer in pair["answers"]:
+
+                # Verify that the pairs contains Answers under the `answer` key
+                if not isinstance(answer, Answer):
+                    raise ValueError(
+                        "This results object does not contain `Answer` objects under the `answers` "
+                        "key of the generated question/answer pairs. "
+                        "Please make sure the last node of your pipeline makes proper use of the "
+                        "new pipelines primitive objects, and if you're using pipelines nodes/pipelines only, "
+                        "please report this as a bug."
+                    )
+                print(f"      A: {answer.answer}")
+
+    else:
+        raise ValueError(
+            "This object does not seem to be the output "
+            "of a question generating pipeline: does not contain neither "
+            f"'generated_questions' nor 'results', but only: {results.keys()}. "
+            " Try `print_answers` or `print_documents`."
+        )
+
+
+def export_answers_to_csv(agg_results: list, output_file):
+    """
+    Exports answers coming from finder.get_answers() to a CSV file
+    :param agg_results: list of predictions coming from finder.get_answers()
+    :param output_file: filename of output file
+    :return: None
+    """
+    if isinstance(agg_results, dict):
+        agg_results = [agg_results]
+
+    assert "query" in agg_results[0], f"Wrong format used for {agg_results[0]}"
+    assert "answers" in agg_results[0], f"Wrong format used for {agg_results[0]}"
+
+    data = {}
+    data["query"] = []
+    data["prediction"] = []
+    data["prediction_rank"] = []
+    data["prediction_context"] = []
+
+    for res in agg_results:
+        for i in range(len(res["answers"])):
+            temp = res["answers"][i]
+            data["query"].append(res["query"])
+            data["prediction"].append(temp.answer)
+            data["prediction_rank"].append(i + 1)
+            data["prediction_context"].append(temp.context)
+
+    df = pd.DataFrame(data)
+    df.to_csv(output_file, index=False)
+
+
+def convert_labels_to_squad(labels_file: str):
+    """
+    Convert the export from the labeling UI to SQuAD format for training.
+
+    :param labels_file: path for export file from the labeling tool
+    :return:
+    """
+    with open(labels_file, encoding="utf-8") as label_file:
+        labels = json.load(label_file)
+
+    labels_grouped_by_documents = defaultdict(list)
+    for label in labels:
+        labels_grouped_by_documents[label["document_id"]].append(label)
+
+    labels_in_squad_format = {"data": []}
+    for document_id, labels in labels_grouped_by_documents.items():
+        qas = []
+        for label in labels:
+            doc = DocumentORM.query.get(label["document_id"])
+
+            assert doc.content[label["start_offset"] : label["end_offset"]] == label["selected_text"]
+
+            qas.append(
+                {
+                    "question": label["question"],
+                    "id": label["id"],
+                    "question_id": label["question_id"],
+                    "answers": [
+                        {
+                            "text": label["selected_text"],
+                            "answer_start": label["start_offset"],
+                            "labeller_id": label["labeler_id"],
+                        }
+                    ],
+                    "is_impossible": False,
+                }
+            )
+
+        squad_format_label = {"paragraphs": [{"qas": qas, "context": doc.content, "document_id": document_id}]}
+
+        labels_in_squad_format["data"].append(squad_format_label)
+
+    with open("labels_in_squad_format.json", "w+", encoding="utf-8") as outfile:
+        json.dump(labels_in_squad_format, outfile)
diff --git a/pipelines/pipelines/utils/import_utils.py b/pipelines/pipelines/utils/import_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..37dc2e1e258f0b92b219969cb58e50d02235f05b
--- /dev/null
+++ b/pipelines/pipelines/utils/import_utils.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional
+
+import io
+import tarfile
+import zipfile
+import requests
+import logging
+import importlib
+from pathlib import Path
+
+logger = logging.getLogger(__name__)
+
+
+def safe_import(import_path: str, classname: str, dep_group: str):
+    """
+    Method that allows the import of nodes that depend on missing dependencies.
+    These nodes can be installed one by one with extras_require (see setup.cfg)
+    but they need to be all imported in their respective package's __init__()
+
+    Therefore, in case of an ImportError, the class to import is replaced by
+    a hollow MissingDependency function, which will throw an error when
+    inizialized.
+    """
+    try:
+        module = importlib.import_module(import_path)
+        classs = vars(module).get(classname)
+    except ImportError as ie:
+        classs = _missing_dependency_stub_factory(classname, dep_group, ie)
+    return classs
+
+
+def _missing_dependency_stub_factory(classname: str, dep_group: str, import_error: Exception):
+    """
+    Create custom versions of MissingDependency using the given parameters.
+    See `safe_import()`
+    """
+
+    class MissingDependency:
+        def __init__(self, *args, **kwargs):
+            _optional_component_not_installed(classname, dep_group, import_error)
+
+        def __getattr__(self, *a, **k):
+            return None
+
+    return MissingDependency
+
+
+def _optional_component_not_installed(component: str, dep_group: str, source_error: Exception):
+    raise ImportError(
+        f"Failed to import '{component}', "
+        "which is an optional component in pipelines.\n"
+        f"Run 'pip install -r requirements.txt' "
+        "to install the required dependencies and make this component available."
+    ) from source_error
+
+
+def fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None) -> bool:
+    """
+    Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.
+
+    :param url: http address
+    :param output_dir: local path
+    :param proxies: proxies details as required by requests library
+    :return: if anything got fetched
+    """
+    # verify & prepare local directory
+    path = Path(output_dir)
+    if not path.exists():
+        path.mkdir(parents=True)
+
+    is_not_empty = len(list(Path(path).rglob("*"))) > 0
+    if is_not_empty:
+        logger.info(f"Found data stored in `{output_dir}`. Delete this first if you really want to fetch new data.")
+        return False
+    else:
+        logger.info(f"Fetching from {url} to `{output_dir}`")
+
+        _, _, archive_extension = url.rpartition(".")
+        request_data = requests.get(url, proxies=proxies)
+
+        if archive_extension == "zip":
+            zip_archive = zipfile.ZipFile(io.BytesIO(request_data.content))
+            zip_archive.extractall(output_dir)
+        elif archive_extension in ["gz", "bz2", "xz"]:
+            tar_archive = tarfile.open(fileobj=io.BytesIO(request_data.content), mode="r|*")
+            tar_archive.extractall(output_dir)
+        else:
+            logger.warning(
+                "Skipped url {0} as file type is not supported here. "
+                "See pipelines documentation for support of more file types".format(url)
+            )
+
+        return True
diff --git a/pipelines/pipelines/utils/logger.py b/pipelines/pipelines/utils/logger.py
new file mode 100644
index 0000000000000000000000000000000000000000..6470673fe3cb9eeb67f64b5d8657489d7da14d8a
--- /dev/null
+++ b/pipelines/pipelines/utils/logger.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+logger = logging.getLogger(__name__)
+
+
+class StdoutLogger:
+    """Minimal logger printing metrics and params to stdout.
+    Useful for services like AWS SageMaker, where you parse metrics from the actual logs"""
+
+    disable_logging = False
+
+    def __init__(self, tracking_uri, **kwargs):
+        self.tracking_uri = tracking_uri
+
+    def init_experiment(self, experiment_name, run_name=None, nested=True):
+        logger.info(f"\n **** Starting experiment '{experiment_name}' (Run: {run_name})  ****")
+
+    @classmethod
+    def log_metrics(cls, metrics, step):
+        logger.info(f"Logged metrics at step {step}: \n {metrics}")
+
+    @classmethod
+    def log_params(cls, params):
+        logger.info(f"Logged parameters: \n {params}")
+
+    @classmethod
+    def log_artifacts(cls, dir_path, artifact_path=None):
+        raise NotImplementedError
+
+    @classmethod
+    def end_run(cls):
+        logger.info("**** End of Experiment **** ")
diff --git a/pipelines/pipelines/utils/preprocessing.py b/pipelines/pipelines/utils/preprocessing.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d503eca5d0fed9a7801400aa5662a013e5c3c79
--- /dev/null
+++ b/pipelines/pipelines/utils/preprocessing.py
@@ -0,0 +1,364 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import re
+from pathlib import Path
+from typing import Callable, Dict, List, Optional
+
+from pipelines.nodes.base import BaseComponent
+from pipelines.nodes.file_converter import (
+    BaseConverter,
+    DocxToTextConverter,
+    ImageToTextConverter,
+    MarkdownConverter,
+    PDFToTextConverter,
+    TextConverter,
+)
+from pipelines.nodes.file_converter.docx import DocxTotxtConverter
+from pipelines.nodes.file_converter.markdown import MarkdownRawTextConverter
+from pipelines.nodes.preprocessor.text_splitter import (
+    CharacterTextSplitter,
+    MarkdownHeaderTextSplitter,
+    SpacyTextSplitter,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def convert_files_to_dicts(
+    dir_path: str,
+    clean_func: Optional[Callable] = None,
+    split_paragraphs: bool = False,
+    split_answers: bool = False,
+    encoding: Optional[str] = None,
+) -> List[dict]:
+    """
+    Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Python dicts that can be written to a
+    Document Store.
+
+    :param dir_path: path for the documents to be written to the DocumentStore
+    :param clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
+    :param split_paragraphs: split text in paragraphs.
+    :param split_answers: split text into two columns, including question column, answer column.
+    :param encoding: character encoding to use when converting pdf documents.
+    """
+    file_paths = [p for p in Path(dir_path).glob("**/*")]
+    allowed_suffixes = [".pdf", ".txt", ".docx", ".png", ".jpg", ".md"]
+    suffix2converter: Dict[str, BaseConverter] = {}
+
+    suffix2paths: Dict[str, List[Path]] = {}
+    for path in file_paths:
+        file_suffix = path.suffix.lower()
+        if file_suffix in allowed_suffixes:
+            if file_suffix not in suffix2paths:
+                suffix2paths[file_suffix] = []
+            suffix2paths[file_suffix].append(path)
+        elif not path.is_dir():
+            logger.warning(
+                "Skipped file {0} as type {1} is not supported here. "
+                "See pipelines.file_converter for support of more file types".format(path, file_suffix)
+            )
+    # No need to initialize converter if file type not present
+    for file_suffix in suffix2paths.keys():
+        if file_suffix == ".pdf":
+            suffix2converter[file_suffix] = PDFToTextConverter()
+        if file_suffix == ".txt":
+            suffix2converter[file_suffix] = TextConverter()
+        if file_suffix == ".docx":
+            suffix2converter[file_suffix] = DocxToTextConverter()
+        if file_suffix == ".png" or file_suffix == ".jpg":
+            suffix2converter[file_suffix] = ImageToTextConverter()
+        if file_suffix == ".md":
+            suffix2converter[file_suffix] = MarkdownConverter()
+    documents = []
+    for suffix, paths in suffix2paths.items():
+        for path in paths:
+            if encoding is None and suffix == ".pdf":
+                encoding = "Latin1"
+            logger.info("Converting {}".format(path))
+            list_documents = suffix2converter[suffix].convert(
+                file_path=path,
+                meta=None,
+                encoding=encoding,
+            )  # PDFToTextConverter, TextConverter, ImageToTextConverter and DocxToTextConverter return a list containing a single dict
+            for document in list_documents:
+                text = document["content"]
+                if clean_func:
+                    text = clean_func(text)
+
+                if split_paragraphs:
+                    for para in text.split("\n"):
+                        if not para.strip():  # skip empty paragraphs
+                            continue
+                        if split_answers:
+                            query, answer = para.split("\t")
+                            meta_data = {"name": path.name, "answer": answer}
+                            # Add image list parsed from docx into meta
+                            if document["meta"] is not None and "images" in document["meta"]:
+                                meta_data["images"] = document["meta"]["images"]
+
+                            documents.append({"content": query, "meta": meta_data})
+                        else:
+                            meta_data = {
+                                "name": path.name,
+                            }
+                            # Add image list parsed from docx into meta
+                            if document["meta"] is not None and "images" in document["meta"]:
+                                meta_data["images"] = document["meta"]["images"]
+                            documents.append({"content": para, "meta": meta_data})
+                else:
+                    documents.append(
+                        {"content": text, "meta": document["meta"] if "meta" in document else {"name": path.name}}
+                    )
+    return documents
+
+
+def convert_files_to_dicts_splitter(
+    dir_path: str,
+    clean_func: Optional[Callable] = None,
+    split_paragraphs: bool = False,
+    split_answers: bool = False,
+    encoding: Optional[str] = None,
+    separator: str = "\n",
+    filters: list = ["\n"],
+    chunk_size: int = 300,
+    chunk_overlap: int = 0,
+    language: str = "chinese",
+) -> List[dict]:
+    """
+    Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Python dicts that can be written to a
+    Document Store.
+
+    :param dir_path: path for the documents to be written to the DocumentStore
+    :param clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
+    :param split_paragraphs: split text in paragraphs.
+    :param split_answers: split text into two columns, including question column, answer column.
+    :param encoding: character encoding to use when converting pdf documents.
+    """
+    file_paths = [p for p in Path(dir_path).glob("**/*")]
+    allowed_suffixes = [".pdf", ".txt", ".docx", ".png", ".jpg", ".md"]
+    suffix2converter: Dict[str, BaseConverter] = {}
+
+    suffix2paths: Dict[str, List[Path]] = {}
+    suffix2splitter: Dict[str, BaseComponent] = {}
+    for path in file_paths:
+        file_suffix = path.suffix.lower()
+        if file_suffix in allowed_suffixes:
+            if file_suffix not in suffix2paths:
+                suffix2paths[file_suffix] = []
+            suffix2paths[file_suffix].append(path)
+        elif not path.is_dir():
+            logger.warning(
+                "Skipped file {0} as type {1} is not supported here. "
+                "See pipelines.file_converter for support of more file types".format(path, file_suffix)
+            )
+
+    headers_to_split_on = [
+        ("#", "Header 1"),
+        ("##", "Header 2"),
+        ("###", "Header 3"),
+        ("####", "Header 4"),
+        ("#####", "Header 5"),
+        ("######", "Header 6"),
+    ]
+    markdown_splitter = MarkdownHeaderTextSplitter(
+        separator=separator,
+        chunk_size=chunk_size,
+        headers_to_split_on=headers_to_split_on,
+        return_each_line=True,
+        filters=filters,
+    )
+    if language == "chinese":
+        docx_splitter = SpacyTextSplitter(
+            separator=separator, filters=filters, chunk_size=chunk_size, chunk_overlap=chunk_overlap
+        )
+    else:
+        docx_splitter = SpacyTextSplitter(
+            separator=separator,
+            filters=filters,
+            chunk_size=chunk_size,
+            chunk_overlap=chunk_overlap,
+            pipeline="en_core_web_sm",
+        )
+    text_splitter = CharacterTextSplitter(
+        separator=separator, chunk_size=chunk_size, chunk_overlap=chunk_overlap, filters=filters
+    )
+    pdf_splitter = CharacterTextSplitter(
+        separator=separator, chunk_size=chunk_size, chunk_overlap=chunk_overlap, filters=filters
+    )
+    imgage_splitter = CharacterTextSplitter(
+        separator=separator, chunk_size=chunk_size, chunk_overlap=chunk_overlap, filters=filters
+    )
+    documents = []
+    # No need to initialize converter if file type not present
+    for file_suffix in suffix2paths.keys():
+        if file_suffix == ".pdf":
+            suffix2converter[file_suffix] = PDFToTextConverter()
+            suffix2splitter[file_suffix] = pdf_splitter
+        if file_suffix == ".txt":
+            suffix2converter[file_suffix] = TextConverter()
+            suffix2splitter[file_suffix] = text_splitter
+        if file_suffix == ".docx":
+            suffix2converter[file_suffix] = DocxTotxtConverter()
+            suffix2splitter[file_suffix] = docx_splitter
+        if file_suffix == ".png" or file_suffix == ".jpg":
+            suffix2converter[file_suffix] = ImageToTextConverter()
+            suffix2splitter[file_suffix] = imgage_splitter
+        if file_suffix == ".md":
+            suffix2converter[file_suffix] = MarkdownRawTextConverter()
+            suffix2splitter[file_suffix] = markdown_splitter
+    for suffix, paths in suffix2paths.items():
+        for path in paths:
+            if encoding is None and suffix == ".pdf":
+                encoding = "Latin1"
+            logger.info("Converting {}".format(path))
+            list_documents = suffix2converter[suffix].convert(
+                file_path=path,
+                meta=None,
+                encoding=encoding,
+                language=language,
+            )
+            for document in list_documents:
+                text = document["content"]
+                if clean_func:
+                    text = clean_func(text)
+                if split_paragraphs is True:
+                    text_splits = suffix2splitter[suffix].split_text(text)
+                    for txt in text_splits:
+                        if not txt.strip():  # skip empty paragraphs
+                            continue
+                        if split_answers:
+                            query, answer = txt.split("\t")
+                            meta_data = {"name": path.name, "answer": answer}
+                            # Add image list parsed from docx into meta
+                            if document["meta"] is not None and "images" in document["meta"]:
+                                meta_data["images"] = document["meta"]["images"]
+                            documents.append({"content": query, "meta": meta_data})
+                        else:
+                            meta_data = {
+                                "name": path.name,
+                            }
+                            # Add image list parsed from docx into meta
+                            if document["meta"] is not None and "images" in document["meta"]:
+                                meta_data["images"] = document["meta"]["images"]
+                            documents.append({"content": txt, "meta": meta_data})
+                else:
+                    documents.append(
+                        {"content": text, "meta": document["meta"] if "meta" in document else {"name": path.name}}
+                    )
+    if filters is not None and len(filters) > 0:
+        documents = clean(documents, filters)
+    return documents
+
+
+def clean(documents: List[dict], filters):
+    for special_character in filters:
+        for doc in documents:
+            doc["content"] = doc["content"].replace(special_character, "")
+    return documents
+
+
+def tika_convert_files_to_dicts(
+    dir_path: str,
+    clean_func: Optional[Callable] = None,
+    split_paragraphs: bool = False,
+    merge_short: bool = True,
+    merge_lowercase: bool = True,
+) -> List[dict]:
+    """
+    Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a
+    Document Store.
+
+    :param merge_lowercase: allow conversion of merged paragraph to lowercase
+    :param merge_short: allow merging of short paragraphs
+    :param dir_path: path for the documents to be written to the DocumentStore
+    :param clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
+    :param split_paragraphs: split text in paragraphs.
+    """
+    try:
+        from pipelines.nodes.file_converter import TikaConverter
+    except Exception as ex:
+        logger.error("Tika not installed. Please install tika and try again. Error: {}".format(ex))
+        raise ex
+    converter = TikaConverter()
+    paths = [p for p in Path(dir_path).glob("**/*")]
+    allowed_suffixes = [".pdf", ".txt"]
+    file_paths: List[Path] = []
+
+    for path in paths:
+        file_suffix = path.suffix.lower()
+        if file_suffix in allowed_suffixes:
+            file_paths.append(path)
+        elif not path.is_dir():
+            logger.warning(
+                "Skipped file {0} as type {1} is not supported here. "
+                "See pipelines.file_converter for support of more file types".format(path, file_suffix)
+            )
+
+    documents = []
+    for path in file_paths:
+        logger.info("Converting {}".format(path))
+        document = converter.convert(path)[
+            0
+        ]  # PDFToTextConverter, TextConverter, and DocxToTextConverter return a list containing a single dict
+        meta = document["meta"] or {}
+        meta["name"] = path.name
+        text = document["content"]
+        pages = text.split("\f")
+
+        if split_paragraphs:
+            if pages:
+                paras = pages[0].split("\n\n")
+                # pop the last paragraph from the first page
+                last_para = paras.pop(-1) if paras else ""
+                for page in pages[1:]:
+                    page_paras = page.split("\n\n")
+                    # merge the last paragraph in previous page to the first paragraph in this page
+                    if page_paras:
+                        page_paras[0] = last_para + " " + page_paras[0]
+                        last_para = page_paras.pop(-1)
+                        paras += page_paras
+                if last_para:
+                    paras.append(last_para)
+                if paras:
+                    last_para = ""
+                    for para in paras:
+                        para = para.strip()
+                        if not para:
+                            continue
+
+                        # this paragraph is less than 10 characters or 2 words
+                        para_is_short = len(para) < 10 or len(re.findall(r"\s+", para)) < 2
+                        # this paragraph starts with a lower case and last paragraph does not end with a punctuation
+                        para_is_lowercase = (
+                            para and para[0].islower() and last_para and last_para[-1] not in r'.?!"\'\]\)'
+                        )
+
+                        # merge paragraphs to improve qa
+                        if (merge_short and para_is_short) or (merge_lowercase and para_is_lowercase):
+                            last_para += " " + para
+                        else:
+                            if last_para:
+                                documents.append({"content": last_para, "meta": meta})
+                            last_para = para
+                    # don't forget the last one
+                    if last_para:
+                        documents.append({"content": last_para, "meta": meta})
+        else:
+            if clean_func:
+                text = clean_func(text)
+            documents.append({"content": text, "meta": meta})
+
+    return documents
diff --git a/pipelines/pipelines/utils/tokenization.py b/pipelines/pipelines/utils/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ea19057c6182e3cc7c5afa01c405db0dd7dff61
--- /dev/null
+++ b/pipelines/pipelines/utils/tokenization.py
@@ -0,0 +1,128 @@
+# coding=utf-8
+# Copyright 2018 deepset team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Tokenization classes.
+"""
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+
+import numpy as np
+
+from paddlenlp.transformers.tokenizer_utils_base import TruncationStrategy
+from pipelines.data_handler.samples import SampleBasket
+
+logger = logging.getLogger(__name__)
+
+# Special characters used by the different tokenizers to indicate start of word / whitespace
+SPECIAL_TOKENIZER_CHARS = r"^(##|Ġ|▁)"
+
+
+def tokenize_batch_question_answering(pre_baskets, tokenizer, indices):
+    """
+    Tokenizes text data for question answering tasks. Tokenization means splitting words into subwords, depending on the
+    tokenizer's vocabulary.
+
+    - We first tokenize all documents in batch mode. (When using FastTokenizers Rust multithreading can be enabled by TODO add how to enable rust mt)
+    - Then we tokenize each question individually
+    - We construct dicts with question and corresponding document text + tokens + offsets + ids
+
+    :param pre_baskets: input dicts with QA info #todo change to input objects
+    :param tokenizer: tokenizer to be used
+    :param indices: list, indices used during multiprocessing so that IDs assigned to our baskets are unique
+    :return: baskets, list containing question and corresponding document information
+    """
+    assert len(indices) == len(pre_baskets)
+
+    baskets = []
+    # Tokenize texts in batch mode
+    # tokenized_docs_batch.keys(): dict_keys(['input_ids', 'attention_mask', 'special_tokens_mask', 'offset_mapping'])
+    texts = [d["context"] for d in pre_baskets]
+
+    tokenized_docs_batch = tokenizer.batch_encode(
+        texts,
+        truncation=TruncationStrategy.ONLY_SECOND,
+        return_special_tokens_mask=True,
+        return_attention_mask=True,
+        return_offsets_mapping=True,
+        return_token_type_ids=False,
+        add_special_tokens=False,
+    )
+
+    # Extract relevant data
+    tokenids_batch = tokenized_docs_batch["input_ids"]
+    offsets_batch = []
+    for o in tokenized_docs_batch["offset_mapping"]:
+        offsets_batch.append(np.array([x[0] for x in o]))
+
+    start_of_words_batch = []
+    for input_ids in tokenized_docs_batch["input_ids"]:
+        start_of_words_batch.append([1] * len(input_ids))
+
+    for i_doc, d in enumerate(pre_baskets):
+        document_text = d["context"]
+        # Tokenize questions one by one
+        for i_q, q in enumerate(d["qas"]):
+            question_text = q["question"]
+            tokenized_q = tokenizer.encode(
+                question_text,
+                return_special_tokens_mask=True,
+                return_attention_mask=True,
+                return_offsets_mapping=True,
+                return_token_type_ids=False,
+                add_special_tokens=False,
+            )
+
+            # Extract relevant data
+            question_tokenids = tokenized_q["input_ids"]
+
+            # Fake offset_mapping
+            question_offsets = [(i, i + 1) for i in range(len(question_tokenids))]
+
+            # question start_of_words_batch
+            # Fake question_sow
+            question_sow = [1] * len(question_tokenids)
+            # question_sow = _get_start_of_word_QA(tokenized_q.encodings[0].words)
+
+            # Document.id
+            external_id = q["id"]
+            # The internal_id depends on unique ids created for each process before forking
+            # i_q is always set to 0
+            internal_id = f"{indices[i_doc]}-{i_q}"
+            raw = {
+                "document_text": document_text,
+                "document_tokens": tokenids_batch[i_doc],
+                "document_offsets": offsets_batch[i_doc],
+                "document_start_of_word": start_of_words_batch[i_doc],
+                "question_text": question_text,
+                "question_tokens": question_tokenids,
+                "question_offsets": question_offsets,
+                "question_start_of_word": question_sow,
+                "answers": q["answers"],
+            }
+            # TODO add only during debug mode (need to create debug mode)
+            # raw["document_tokens_strings"] = tokenized_docs_batch.encodings[i_doc].tokens
+            # raw["question_tokens_strings"] = tokenized_q.encodings[0].tokens
+            raw["document_tokens_strings"] = tokenizer.convert_ids_to_tokens(tokenized_docs_batch["input_ids"][i_doc])
+            raw["question_tokens_strings"] = tokenizer.convert_ids_to_tokens(question_tokenids)
+
+            baskets.append(SampleBasket(raw=raw, id_internal=internal_id, id_external=external_id, samples=None))
+    return baskets
+
+
+def _get_start_of_word_QA(word_ids):
+    words = np.array(word_ids)
+    start_of_word_single = [1] + list(np.ediff1d(words))
+    return start_of_word_single
diff --git a/pipelines/pipelines/version.py b/pipelines/pipelines/version.py
new file mode 100644
index 0000000000000000000000000000000000000000..657b89ee325da9d8c2cb6aaefc77c63b66730f55
--- /dev/null
+++ b/pipelines/pipelines/version.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# this file will be generated by tools
+# please not modify it.
+VERSION = "0.0.0"
diff --git a/pipelines/requirements.txt b/pipelines/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..807bb592a43fb29e08b0bef2d69bf6c39b3f79dc
--- /dev/null
+++ b/pipelines/requirements.txt
@@ -0,0 +1,31 @@
+paddlenlp>=2.5.2
+paddleocr==2.6.1.3
+pdf2image>1.14
+requests
+pydantic
+mmh3
+more_itertools
+elasticsearch>=7.7,<=7.11
+sqlalchemy>=1.4.2,<2
+sqlalchemy_utils
+langdetect
+python-docx
+nltk
+pdfplumber
+faiss-cpu>=1.7.2
+opencv-python>=4.4
+opencv-contrib-python-headless
+python-multipart
+click==8.0
+fastapi
+uvicorn
+markdown
+numba
+pymilvus>=2.1
+wordcloud==1.8.2.2
+boilerpy3
+events
+sseclient-py==1.7.2
+spacy
+tritonclient[all]
+typing_extensions==4.5.0
\ No newline at end of file
diff --git a/pipelines/rest_api/__init__.py b/pipelines/rest_api/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/rest_api/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/rest_api/application.py b/pipelines/rest_api/application.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a01be82490d7420c8fa941af90e7a44b1dde594
--- /dev/null
+++ b/pipelines/rest_api/application.py
@@ -0,0 +1,101 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import sys
+
+import uvicorn
+from fastapi import FastAPI, HTTPException
+from fastapi.openapi.utils import get_openapi
+from fastapi.routing import APIRoute
+from starlette.middleware.cors import CORSMiddleware
+
+# flake8: noqa
+sys.path.append(".")
+from rest_api.config import ROOT_PATH
+from rest_api.controller.errors.http_error import http_error_handler
+from rest_api.controller.router import router as api_router
+
+logging.basicConfig(format="%(asctime)s %(message)s", datefmt="%m/%d/%Y %I:%M:%S %p")
+logger = logging.getLogger(__name__)
+logging.getLogger("elasticsearch").setLevel(logging.WARNING)
+logging.getLogger("pipelines").setLevel(logging.INFO)
+
+try:
+    from pipelines import __version__ as pipelines_version
+except Exception:
+    # For development
+    pipelines_version = "0.0.0"
+
+
+def get_application() -> FastAPI:
+    application = FastAPI(title="pipelines REST API", debug=True, version=pipelines_version, root_path=ROOT_PATH)
+
+    # This middleware enables allow all cross-domain requests to the API from a browser. For production
+    # deployments, it could be made more restrictive.
+    application.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+    application.add_exception_handler(HTTPException, http_error_handler)
+    application.include_router(api_router)
+
+    return application
+
+
+def get_openapi_specs() -> dict:
+    """
+    Used to autogenerate OpenAPI specs file to use in the documentation.
+
+    See `docs/_src/api/openapi/generate_openapi_specs.py`
+    """
+    app = get_application()
+    return get_openapi(
+        title=app.title,
+        version=app.version,
+        openapi_version=app.openapi_version,
+        description=app.description,
+        routes=app.routes,
+    )
+
+
+def use_route_names_as_operation_ids(app: FastAPI) -> None:
+    """
+    Simplify operation IDs so that generated API clients have simpler function
+    names (see https://fastapi.tiangolo.com/advanced/path-operation-advanced-configuration/#using-the-path-operation-function-name-as-the-operationid).
+    The operation IDs will be the same as the route names (i.e. the python method names of the endpoints)
+    Should be called only after all routes have been added.
+    """
+    for route in app.routes:
+        if isinstance(route, APIRoute):
+            route.operation_id = route.name
+
+
+app = get_application()
+use_route_names_as_operation_ids(app)
+
+logger.info("Open http://127.0.0.1:8000/docs to see Swagger API Documentation.")
+logger.info(
+    """
+    Or just try it out directly: curl --request POST --url 'http://127.0.0.1:8000/query' -H "Content-Type: application/json"  --data '{"query": "Who is the father of Arya Stark?"}'
+    """
+)
+
+if __name__ == "__main__":
+    port = int(sys.argv[1])
+    uvicorn.run(app, host="0.0.0.0", port=port)
diff --git a/pipelines/rest_api/config.py b/pipelines/rest_api/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..85a6c3f1316ff0dcb45abdc2f1010d35d0dcdacc
--- /dev/null
+++ b/pipelines/rest_api/config.py
@@ -0,0 +1,33 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from pathlib import Path
+
+PIPELINE_YAML_PATH = os.getenv(
+    "PIPELINE_YAML_PATH", str((Path(__file__).parent / "pipeline" / "pipelines.yaml").absolute())
+)
+QUERY_PIPELINE_NAME = os.getenv("QUERY_PIPELINE_NAME", "query")
+QUERY_QA_PAIRS_NAME = os.getenv("QUERY_QA_PAIRS_NAME", "query_qa_pairs")
+INDEXING_PIPELINE_NAME = os.getenv("INDEXING_PIPELINE_NAME", "indexing")
+INDEXING_QA_GENERATING_PIPELINE_NAME = os.getenv("INDEXING_QA_GENERATING_PIPELINE_NAME", "indexing_qa_generating")
+
+FILE_UPLOAD_PATH = os.getenv("FILE_UPLOAD_PATH", str((Path(__file__).parent / "file-upload").absolute()))
+
+FILE_PARSE_PATH = os.getenv("FILE_PARSE_PATH", str((Path(__file__).parent.parent / "parse_files").absolute()))
+
+LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")
+ROOT_PATH = os.getenv("ROOT_PATH", "/")
+
+CONCURRENT_REQUEST_PER_WORKER = int(os.getenv("CONCURRENT_REQUEST_PER_WORKER", "4"))
diff --git a/pipelines/rest_api/controller/__init__.py b/pipelines/rest_api/controller/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..10c635ea7fa34384a2e59c7576b8f73b6a7cff74
--- /dev/null
+++ b/pipelines/rest_api/controller/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from rest_api.pipeline import custom_component  # this import is required for the Custom Components to be registered
diff --git a/pipelines/rest_api/controller/document.py b/pipelines/rest_api/controller/document.py
new file mode 100644
index 0000000000000000000000000000000000000000..58bbc34ae4ea715d392aac1f6d37fa475f2b6014
--- /dev/null
+++ b/pipelines/rest_api/controller/document.py
@@ -0,0 +1,64 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List
+
+import logging
+
+from fastapi import APIRouter
+
+from rest_api.controller.search import DOCUMENT_STORE
+from rest_api.config import LOG_LEVEL
+from rest_api.schema import FilterRequest, DocumentSerialized
+
+logging.getLogger("pipelines").setLevel(LOG_LEVEL)
+logger = logging.getLogger("pipelines")
+
+router = APIRouter()
+
+
+@router.post("/documents/get_by_filters", response_model=List[DocumentSerialized], response_model_exclude_none=True)
+def get_documents(filters: FilterRequest):
+    """
+    This endpoint allows you to retrieve documents contained in your document store.
+    You can filter the documents to delete by metadata (like the document's name),
+    or provide an empty JSON object to clear the document store.
+
+    Example of filters:
+    `'{"filters": {{"name": ["some", "more"], "category": ["only_one"]}}'`
+
+    To get all documents you should provide an empty dict, like:
+    `'{"filters": {}}'`
+    """
+    docs = [doc.to_dict() for doc in DOCUMENT_STORE.get_all_documents(filters=filters.filters)]
+    for doc in docs:
+        doc["embedding"] = None
+    return docs
+
+
+@router.post("/documents/delete_by_filters", response_model=bool)
+def delete_documents(filters: FilterRequest):
+    """
+    This endpoint allows you to delete documents contained in your document store.
+    You can filter the documents to delete by metadata (like the document's name),
+    or provide an empty JSON object to clear the document store.
+
+    Example of filters:
+    `'{"filters": {{"name": ["some", "more"], "category": ["only_one"]}}'`
+
+    To get all documents you should provide an empty dict, like:
+    `'{"filters": {}}'`
+    """
+    DOCUMENT_STORE.delete_documents(filters=filters.filters)
+    return True
diff --git a/pipelines/rest_api/controller/errors/__init__.py b/pipelines/rest_api/controller/errors/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/rest_api/controller/errors/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/rest_api/controller/errors/http_error.py b/pipelines/rest_api/controller/errors/http_error.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c0fc56714d1ed50fd47fb108df5e42208e77e67
--- /dev/null
+++ b/pipelines/rest_api/controller/errors/http_error.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fastapi import HTTPException
+from starlette.requests import Request
+from starlette.responses import JSONResponse
+
+
+async def http_error_handler(_: Request, exc: HTTPException) -> JSONResponse:
+    return JSONResponse({"errors": [exc.detail]}, status_code=exc.status_code)
diff --git a/pipelines/rest_api/controller/feedback.py b/pipelines/rest_api/controller/feedback.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ea917e1ff69f496a648b919220c1ea28ebbdc82
--- /dev/null
+++ b/pipelines/rest_api/controller/feedback.py
@@ -0,0 +1,195 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict, Union, Optional
+
+import json
+import logging
+
+from fastapi import APIRouter
+from pipelines.schema import Label
+from rest_api.schema import FilterRequest, LabelSerialized, CreateLabelSerialized
+from rest_api.controller.search import DOCUMENT_STORE
+
+router = APIRouter()
+
+logger = logging.getLogger(__name__)
+
+
+@router.post("/feedback")
+def post_feedback(feedback: Union[LabelSerialized, CreateLabelSerialized]):
+    """
+    This endpoint allows the API user to submit feedback on an answer for a particular query.
+
+    For example, the user can send feedback on whether the answer was correct and
+    whether the right snippet was identified as the answer.
+
+    Information submitted through this endpoint is used to train the underlying QA model.
+    """
+
+    if feedback.origin is None:
+        feedback.origin = "user-feedback"
+
+    label = Label(**feedback.dict())
+    DOCUMENT_STORE.write_labels([label])
+
+
+@router.get("/feedback")
+def get_feedback():
+    """
+    This endpoint allows the API user to retrieve all the feedback that has been submitted
+    through the `POST /feedback` endpoint.
+    """
+    labels = DOCUMENT_STORE.get_all_labels()
+    return labels
+
+
+@router.delete("/feedback")
+def delete_feedback():
+    """
+    This endpoint allows the API user to delete all the
+    feedback that has been sumbitted through the
+    `POST /feedback` endpoint
+    """
+    all_labels = DOCUMENT_STORE.get_all_labels()
+    user_label_ids = [label.id for label in all_labels if label.origin == "user-feedback"]
+    DOCUMENT_STORE.delete_labels(ids=user_label_ids)
+
+
+@router.post("/eval-feedback")
+def get_feedback_metrics(filters: FilterRequest = None):
+    """
+    This endpoint returns basic accuracy metrics based on user feedback,
+    e.g., the ratio of correct answers or correctly identified documents.
+    You can filter the output by document or label.
+
+    Example:
+
+    `curl --location --request POST 'http://127.0.0.1:8000/eval-doc-qa-feedback' \
+     --header 'Content-Type: application/json' \
+     --data-raw '{ "filters": {"document_id": ["XRR3xnEBCYVTkbTystOB"]} }'`
+    """
+
+    if filters:
+        filters_content = filters.filters or {}
+        filters_content["origin"] = ["user-feedback"]
+    else:
+        filters_content = {"origin": ["user-feedback"]}
+
+    labels = DOCUMENT_STORE.get_all_labels(filters=filters_content)
+
+    res: Dict[str, Optional[Union[float, int]]]
+    if len(labels) > 0:
+        answer_feedback = [1 if l.is_correct_answer else 0 for l in labels]
+        doc_feedback = [1 if l.is_correct_document else 0 for l in labels]
+
+        answer_accuracy = sum(answer_feedback) / len(answer_feedback)
+        doc_accuracy = sum(doc_feedback) / len(doc_feedback)
+
+        res = {"answer_accuracy": answer_accuracy, "document_accuracy": doc_accuracy, "n_feedback": len(labels)}
+    else:
+        res = {"answer_accuracy": None, "document_accuracy": None, "n_feedback": 0}
+    return res
+
+
+@router.get("/export-feedback")
+def export_feedback(
+    context_size: int = 100_000, full_document_context: bool = True, only_positive_labels: bool = False
+):
+    """
+    This endpoint returns JSON output in the SQuAD format for question/answer pairs
+    that were marked as "relevant" by user feedback through the `POST /feedback` endpoint.
+
+    The context_size param can be used to limit response size for large documents.
+    """
+    if only_positive_labels:
+        labels = DOCUMENT_STORE.get_all_labels(filters={"is_correct_answer": [True], "origin": ["user-feedback"]})
+    else:
+        labels = DOCUMENT_STORE.get_all_labels(filters={"origin": ["user-feedback"]})
+        # Filter out the labels where the passage is correct but answer is wrong (in SQuAD this matches
+        # neither a "positive example" nor a negative "is_impossible" one)
+        labels = [l for l in labels if not (l.is_correct_document is True and l.is_correct_answer is False)]
+
+    export_data = []
+
+    for label in labels:
+        if full_document_context:
+            context = label.document.content
+
+            answer_start = label.answer.offsets_in_document[0].start
+        else:
+            text = label.document.content
+            # the final length of context(including the answer string) is 'context_size'.
+            # we try to add equal characters for context before and after the answer string.
+            # if either beginning or end of text is reached, we correspondingly
+            # append more context characters at the other end of answer string.
+            context_to_add = int((context_size - len(label.answer.answer)) / 2)
+            start_pos = max(label.answer.offsets_in_document[0].start - context_to_add, 0)
+            additional_context_at_end = max(context_to_add - label.answer.offsets_in_document[0].start, 0)
+            end_pos = min(
+                label.answer.offsets_in_document[0].start + len(label.answer.answer) + context_to_add, len(text) - 1
+            )
+            additional_context_at_start = max(
+                label.answer.offsets_in_document[0].start + len(label.answer.answer) + context_to_add - len(text), 0
+            )
+            start_pos = max(0, start_pos - additional_context_at_start)
+            end_pos = min(len(text) - 1, end_pos + additional_context_at_end)
+            context = text[start_pos:end_pos]
+            answer_start = label.answer.offsets_in_document[0].start - start_pos
+
+        if label.is_correct_answer is False and label.is_correct_document is False:  # No answer
+            squad_label = {
+                "paragraphs": [
+                    {
+                        "context": context,
+                        "id": label.document.id,
+                        "qas": [{"question": label.query, "id": label.id, "is_impossible": True, "answers": []}],
+                    }
+                ]
+            }
+        else:
+            squad_label = {
+                "paragraphs": [
+                    {
+                        "context": context,
+                        "id": label.document.id,
+                        "qas": [
+                            {
+                                "question": label.query,
+                                "id": label.id,
+                                "is_impossible": False,
+                                "answers": [{"text": label.answer.answer, "answer_start": answer_start}],
+                            }
+                        ],
+                    }
+                ]
+            }
+
+            # quality check
+            start = squad_label["paragraphs"][0]["qas"][0]["answers"][0]["answer_start"]
+            answer = squad_label["paragraphs"][0]["qas"][0]["answers"][0]["text"]
+            context = squad_label["paragraphs"][0]["context"]
+            if not context[start : start + len(answer)] == answer:
+                logger.error(
+                    f"Skipping invalid squad label as string via offsets "
+                    f"('{context[start:start + len(answer)]}') does not match answer string ('{answer}') "
+                )
+        export_data.append(squad_label)
+
+    export = {"data": export_data}
+
+    with open("feedback_squad_direct.json", "w", encoding="utf8") as f:
+        json.dump(export_data, f, ensure_ascii=False, sort_keys=True, indent=4)
+    return export
diff --git a/pipelines/rest_api/controller/file_upload.py b/pipelines/rest_api/controller/file_upload.py
new file mode 100644
index 0000000000000000000000000000000000000000..f58779eb71a4ae93ce3ff161ecb89688eafaa19c
--- /dev/null
+++ b/pipelines/rest_api/controller/file_upload.py
@@ -0,0 +1,231 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import shutil
+import uuid
+from pathlib import Path
+from typing import List, Optional
+
+from fastapi import APIRouter, Depends, File, Form, HTTPException, UploadFile
+from fastapi.responses import FileResponse
+from pydantic import BaseModel
+from rest_api.config import (
+    FILE_PARSE_PATH,
+    FILE_UPLOAD_PATH,
+    INDEXING_PIPELINE_NAME,
+    INDEXING_QA_GENERATING_PIPELINE_NAME,
+    PIPELINE_YAML_PATH,
+)
+from rest_api.controller.utils import as_form
+
+from pipelines.pipelines.base import Pipeline
+from pipelines.pipelines.config import (
+    get_component_definitions,
+    get_pipeline_definition,
+    read_pipeline_config_from_yaml,
+)
+
+logger = logging.getLogger(__name__)
+router = APIRouter()
+try:
+    pipeline_config = read_pipeline_config_from_yaml(Path(PIPELINE_YAML_PATH))
+    pipeline_definition = get_pipeline_definition(
+        pipeline_config=pipeline_config, pipeline_name=INDEXING_PIPELINE_NAME
+    )
+    component_definitions = get_component_definitions(
+        pipeline_config=pipeline_config, overwrite_with_env_variables=True
+    )
+    # Since each instance of FAISSDocumentStore creates an in-memory FAISS index, the Indexing & Query Pipelines would
+    # end up with different indices. The same applies for InMemoryDocumentStore. The check below prevents creation of
+    # Indexing Pipelines with FAISSDocumentStore or InMemoryDocumentStore.
+    is_faiss_or_inmemory_present = False
+    for node in pipeline_definition["nodes"]:
+        if (
+            component_definitions[node["name"]]["type"] == "FAISSDocumentStore"
+            or component_definitions[node["name"]]["type"] == "InMemoryDocumentStore"
+        ):
+            is_faiss_or_inmemory_present = True
+            break
+    if is_faiss_or_inmemory_present:
+        logger.warning(
+            "Indexing Pipeline with FAISSDocumentStore or InMemoryDocumentStore is not supported with the REST APIs."
+        )
+        INDEXING_PIPELINE = None
+    else:
+        INDEXING_PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=INDEXING_PIPELINE_NAME)
+
+except KeyError:
+    INDEXING_PIPELINE = None
+    logger.warning("Indexing Pipeline not found in the YAML configuration. File Upload API will not be available.")
+
+# Loading INDEXING_QA_GENERATING_PIPELINE_NAME
+try:
+    INDEXING_QA_GENERATING_PIPELINE = Pipeline.load_from_yaml(
+        Path(PIPELINE_YAML_PATH), pipeline_name=INDEXING_QA_GENERATING_PIPELINE_NAME
+    )
+except Exception:
+    logger.warning(f"Request pipeline ('{INDEXING_QA_GENERATING_PIPELINE_NAME}: is null'). ")
+
+# create directory for uploading files
+os.makedirs(FILE_UPLOAD_PATH, exist_ok=True)
+
+
+@as_form
+class FileConverterParams(BaseModel):
+    remove_numeric_tables: Optional[bool] = None
+    valid_languages: Optional[List[str]] = None
+
+
+@as_form
+class PreprocessorParams(BaseModel):
+    clean_whitespace: Optional[bool] = None
+    clean_empty_lines: Optional[bool] = None
+    clean_header_footer: Optional[bool] = None
+    split_by: Optional[str] = None
+    split_length: Optional[int] = None
+    split_overlap: Optional[int] = None
+    split_respect_sentence_boundary: Optional[bool] = None
+
+
+class Response(BaseModel):
+    file_id: str
+
+
+@router.post("/file-upload-qa-generate")
+def upload_qa_file(
+    files: List[UploadFile] = File(...),
+    # JSON serialized string
+    meta: Optional[str] = Form("null"),  # type: ignore
+    fileconverter_params: FileConverterParams = Depends(FileConverterParams.as_form),  # type: ignore
+):
+    """
+    You can use this endpoint to upload a file for indexing
+    """
+    if not INDEXING_QA_GENERATING_PIPELINE:
+        raise HTTPException(status_code=501, detail="INDEXING_QA_GENERATING_PIPELINE  is not configured.")
+
+    file_paths: list = []
+    file_metas: list = []
+    meta_form = json.loads(meta) or {}  # type: ignore
+    if not isinstance(meta_form, dict):
+        raise HTTPException(status_code=500, detail=f"The meta field must be a dict or None, not {type(meta_form)}")
+
+    for file in files:
+        try:
+            file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"
+            with file_path.open("wb") as buffer:
+                shutil.copyfileobj(file.file, buffer)
+
+            file_paths.append(file_path)
+            meta_form["name"] = file.filename
+            file_metas.append(meta_form)
+        finally:
+            file.file.close()
+
+    INDEXING_QA_GENERATING_PIPELINE.run(
+        file_paths=file_paths,
+        meta=file_metas,
+        params={
+            "TextFileConverter": fileconverter_params.dict(),
+            "PDFFileConverter": fileconverter_params.dict(),
+        },
+    )
+    return {"message": "OK"}
+
+
+@router.post("/file-upload")
+def upload_file(
+    files: List[UploadFile] = File(...),
+    # JSON serialized string
+    meta: Optional[str] = Form("null"),  # type: ignore
+    fileconverter_params: FileConverterParams = Depends(FileConverterParams.as_form),  # type: ignore
+    preprocessor_params: PreprocessorParams = Depends(PreprocessorParams.as_form),  # type: ignore
+):
+    """
+    You can use this endpoint to upload a file for indexing
+    """
+    if not INDEXING_PIPELINE:
+        raise HTTPException(status_code=501, detail="Indexing Pipeline is not configured.")
+    file_paths: list = []
+    file_metas: list = []
+    meta_form = json.loads(meta) or {}  # type: ignore
+    if not isinstance(meta_form, dict):
+        raise HTTPException(status_code=500, detail=f"The meta field must be a dict or None, not {type(meta_form)}")
+
+    for file in files:
+        try:
+            file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"
+            with file_path.open("wb") as buffer:
+                shutil.copyfileobj(file.file, buffer)
+
+            file_paths.append(file_path)
+            meta_form["name"] = file.filename
+            file_metas.append(meta_form)
+        finally:
+            file.file.close()
+
+    INDEXING_PIPELINE.run(
+        file_paths=file_paths,
+        meta=file_metas,
+        params={
+            "TextFileConverter": fileconverter_params.dict(),
+            "PDFFileConverter": fileconverter_params.dict(),
+            "Preprocessor": preprocessor_params.dict(),
+        },
+    )
+    return {"message": "OK"}
+
+
+@router.post("/file-upload-splitter")
+def upload_file_splitter(
+    files: List[UploadFile] = File(),
+    # JSON serialized string
+    meta: Optional[str] = Form("null"),
+):
+    if not INDEXING_PIPELINE:
+        raise HTTPException(status_code=501, detail="Indexing Pipeline is not configured.")
+    file_paths: list = []
+    file_metas: list = []
+    meta_form = json.loads(meta) or {}  # type: ignore
+    if not isinstance(meta_form, dict):
+        raise HTTPException(status_code=500, detail=f"The meta field must be a dict or None, not {type(meta_form)}")
+
+    for file in files:
+        try:
+            file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"
+            with file_path.open("wb") as buffer:
+                shutil.copyfileobj(file.file, buffer)
+
+            file_paths.append(file_path)
+            file_metas.append(meta_form)
+        finally:
+            file.file.close()
+    INDEXING_PIPELINE.run(
+        file_paths=file_paths,
+        meta=file_metas,
+        params=meta_form,
+    )
+    return {"message": "OK"}
+
+
+@router.get("/files")
+def download_file(file_name: str = "1fc0aeac9900487a8c6cec8dda6499bd_demo_1.png"):
+    file_path = os.path.join(FILE_PARSE_PATH, file_name)
+    if os.path.exists(file_path):
+        return FileResponse(file_path)
+    return {"message": "File not Found"}
diff --git a/pipelines/rest_api/controller/router.py b/pipelines/rest_api/controller/router.py
new file mode 100644
index 0000000000000000000000000000000000000000..1734c10b876f3c704300c8b78361ff9d8817828a
--- /dev/null
+++ b/pipelines/rest_api/controller/router.py
@@ -0,0 +1,24 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from fastapi import APIRouter
+
+from rest_api.controller import file_upload, search, feedback, document
+
+router = APIRouter()
+
+router.include_router(search.router, tags=["search"])
+router.include_router(feedback.router, tags=["feedback"])
+router.include_router(file_upload.router, tags=["file-upload"])
+router.include_router(document.router, tags=["document"])
diff --git a/pipelines/rest_api/controller/search.py b/pipelines/rest_api/controller/search.py
new file mode 100644
index 0000000000000000000000000000000000000000..56f40d22421bbf6182cdb1da198b71a9f56e8a45
--- /dev/null
+++ b/pipelines/rest_api/controller/search.py
@@ -0,0 +1,267 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import shutil
+import time
+import uuid
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+from fastapi import APIRouter, File, Form, UploadFile
+from numpy import ndarray
+from pydantic import BaseConfig
+from rest_api.config import (
+    CONCURRENT_REQUEST_PER_WORKER,
+    FILE_UPLOAD_PATH,
+    LOG_LEVEL,
+    PIPELINE_YAML_PATH,
+    QUERY_PIPELINE_NAME,
+    QUERY_QA_PAIRS_NAME,
+)
+from rest_api.controller.utils import RequestLimiter
+from rest_api.schema import (
+    Chatfile_QueryResponse,
+    DocumentRequest,
+    DocumentResponse,
+    QueryImageResponse,
+    QueryQAPairRequest,
+    QueryQAPairResponse,
+    QueryRequest,
+    QueryResponse,
+    SentaRequest,
+    SentaResponse,
+)
+
+import pipelines
+from pipelines.pipelines.base import Pipeline
+
+logging.getLogger("pipelines").setLevel(LOG_LEVEL)
+logger = logging.getLogger("pipelines")
+
+BaseConfig.arbitrary_types_allowed = True
+
+router = APIRouter()
+
+PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=QUERY_PIPELINE_NAME)
+
+try:
+    QA_PAIR_PIPELINE = Pipeline.load_from_yaml(Path(PIPELINE_YAML_PATH), pipeline_name=QUERY_QA_PAIRS_NAME)
+except Exception:
+    logger.warning(f"Request pipeline ('{QUERY_QA_PAIRS_NAME}: is null'). ")
+
+DOCUMENT_STORE = PIPELINE.get_document_store()
+logging.info(f"Loaded pipeline nodes: {PIPELINE.graph.nodes.keys()}")
+
+concurrency_limiter = RequestLimiter(CONCURRENT_REQUEST_PER_WORKER)
+logging.info("Concurrent requests per worker: {CONCURRENT_REQUEST_PER_WORKER}")
+
+
+@router.get("/initialized")
+def check_status():
+    """
+    This endpoint can be used during startup to understand if the
+    server is ready to take any requests, or is still loading.
+
+    The recommended approach is to call this endpoint with a short timeout,
+    like 500ms, and in case of no reply, consider the server busy.
+    """
+    return True
+
+
+@router.get("/hs_version")
+def pipelines_version():
+    """
+    Get the running pipelines version.
+    """
+    return {"hs_version": pipelines.__version__}
+
+
+@router.post("/query", response_model=QueryResponse, response_model_exclude_none=True)
+def query(request: QueryRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    with concurrency_limiter.run():
+        result = _process_request(PIPELINE, request)
+        return result
+
+
+@router.post("/chatfile_query", response_model=Chatfile_QueryResponse, response_model_exclude_none=True)
+def chatfile_query(request: QueryRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    with concurrency_limiter.run():
+        result = _process_request(PIPELINE, request)
+        return result
+
+
+@router.post("/query_images", response_model=QueryResponse, response_model_exclude_none=True)
+def query_images_for_retrieval(
+    files: List[UploadFile] = File(...),
+    # JSON serialized string
+    meta: Optional[str] = Form("null"),
+):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    file_paths: list = []
+    file_metas: list = []
+    meta_form = json.loads(meta) or {}  # type: ignore
+
+    for file in files:
+        try:
+            file_path = Path(FILE_UPLOAD_PATH) / f"{uuid.uuid4().hex}_{file.filename}"
+            with file_path.open("wb") as buffer:
+                shutil.copyfileobj(file.file, buffer)
+
+            file_paths.append(file_path)
+            # meta_form["name"] = file.filename
+            file_metas.append(meta_form)
+        finally:
+            file.file.close()
+    result = PIPELINE.run(query=str(file_paths[0]), params=meta_form, debug=True)
+    return result
+
+
+@router.post("/query_text_to_images", response_model=QueryImageResponse, response_model_exclude_none=True)
+def query_images(request: QueryRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    result = {}
+    result["query"] = request.query
+    params = request.params or {}
+    res = PIPELINE.run(query=request.query, params=params, debug=request.debug)
+    # Ensure answers and documents exist, even if they're empty lists
+    result["answers"] = res["results"]
+    if "documents" not in result:
+        result["documents"] = []
+    if "answers" not in result:
+        result["answers"] = []
+    return result
+
+
+@router.post("/query_documents", response_model=DocumentResponse, response_model_exclude_none=True)
+def query_documents(request: DocumentRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    result = {}
+    result["meta"] = request.meta
+    params = request.params or {}
+    res = PIPELINE.run(meta=request.meta, params=params, debug=request.debug)
+    result["results"] = res["results"]
+    return result
+
+
+@router.post("/senta_file", response_model=SentaResponse, response_model_exclude_none=True)
+def senta_file(request: SentaRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    result = {}
+    result["meta"] = request.meta
+    params = request.params or {}
+    res = PIPELINE.run(meta=request.meta, params=params, debug=request.debug)
+    result["img_dict"] = res["img_dict"]
+    return result
+
+
+@router.post("/query_qa_pairs", response_model=QueryQAPairResponse, response_model_exclude_none=True)
+def query_qa_pairs(request: QueryQAPairRequest):
+    """
+    This endpoint receives the question as a string and allows the requester to set
+    additional parameters that will be passed on to the pipelines pipeline.
+    """
+    print("request", request)
+    result = {}
+    result["meta"] = request.meta
+    params = request.params or {}
+    res = QA_PAIR_PIPELINE.run(meta=request.meta, params=params, debug=request.debug)
+    result["filtered_cqa_triples"] = res["filtered_cqa_triples"]
+    return result
+
+
+def _process_request(pipeline, request) -> Dict[str, Any]:
+    start_time = time.time()
+    params = request.params or {}
+
+    # format global, top-level filters (e.g. "params": {"filters": {"name": ["some"]}})
+    if "filters" in params.keys():
+        params["filters"] = _format_filters(params["filters"])
+
+    # format targeted node filters (e.g. "params": {"Retriever": {"filters": {"value"}}})
+    for key in params.keys():
+        if "filters" in params[key].keys():
+            params[key]["filters"] = _format_filters(params[key]["filters"])
+
+    result = pipeline.run(query=request.query, params=params, debug=request.debug)
+    # Ensure answers and documents exist, even if they're empty lists
+    if "documents" not in result:
+        result["documents"] = []
+    if "answers" not in result:
+        result["answers"] = []
+    if "result" not in result:
+        result["result"] = ""
+
+    # if any of the documents contains an embedding as an ndarray the latter needs to be converted to list of float
+    for document in result["documents"]:
+        if isinstance(document.embedding, ndarray):
+            document.embedding = document.embedding.tolist()
+
+    logger.info(
+        json.dumps({"request": request, "response": result, "time": f"{(time.time() - start_time):.2f}"}, default=str)
+    )
+    return result
+
+
+def _format_filters(filters):
+    """
+    Adjust filters to compliant format:
+    Put filter values into a list and remove filters with null value.
+    """
+    new_filters = {}
+    if filters is None:
+        logger.warning(
+            "Request with deprecated filter format ('\"filters\": null'). "
+            "Remove empty filters from params to be compliant with future versions"
+        )
+    else:
+        for key, values in filters.items():
+            if values is None:
+                logger.warning(
+                    f"Request with deprecated filter format ('{key}: null'). "
+                    f"Remove null values from filters to be compliant with future versions"
+                )
+                continue
+
+            if not isinstance(values, list):
+                logger.warning(
+                    f"Request with deprecated filter format ('{key}': {values}). "
+                    f"Change to '{key}':[{values}]' to be compliant with future versions"
+                )
+                values = [values]
+
+            new_filters[key] = values
+    return new_filters
diff --git a/pipelines/rest_api/controller/utils.py b/pipelines/rest_api/controller/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c11d8a7be4a14d223ac88b94700d59fdf69df73
--- /dev/null
+++ b/pipelines/rest_api/controller/utils.py
@@ -0,0 +1,63 @@
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from contextlib import contextmanager
+from threading import Semaphore
+from typing import Type, NewType
+
+from fastapi import Form, HTTPException
+from pydantic import BaseModel
+
+
+class RequestLimiter:
+    def __init__(self, limit):
+        self.semaphore = Semaphore(limit - 1)
+
+    @contextmanager
+    def run(self):
+        acquired = self.semaphore.acquire(blocking=False)
+        if not acquired:
+            raise HTTPException(status_code=503, detail="The server is busy processing requests.")
+        try:
+            yield acquired
+        finally:
+            self.semaphore.release()
+
+
+StringId = NewType("StringId", str)
+
+
+def as_form(cls: Type[BaseModel]):
+    """
+    Adds an as_form class method to decorated models. The as_form class method
+    can be used with FastAPI endpoints
+    """
+    new_params = [
+        inspect.Parameter(
+            field.alias,
+            inspect.Parameter.POSITIONAL_ONLY,
+            default=(Form(field.default) if not field.required else Form(...)),
+        )
+        for field in cls.__fields__.values()
+    ]
+
+    async def _as_form(**data):
+        return cls(**data)
+
+    sig = inspect.signature(_as_form)
+    sig = sig.replace(parameters=new_params)
+    _as_form.__signature__ = sig  # type: ignore
+    setattr(cls, "as_form", _as_form)
+    return cls
diff --git a/pipelines/rest_api/pipeline/__init__.py b/pipelines/rest_api/pipeline/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/rest_api/pipeline/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/rest_api/pipeline/chatfile.yaml b/pipelines/rest_api/pipeline/chatfile.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b1beaada37a337b0e2b2378c6650ac39057ad2c3
--- /dev/null
+++ b/pipelines/rest_api/pipeline/chatfile.yaml
@@ -0,0 +1,128 @@
+version: '1.1.0'
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 8200
+      index: esg_example
+      embedding_dim: 768
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 30
+      query_embedding_model: moka-ai/m3e-base
+      passage_embedding_model: moka-ai/m3e-base
+      embed_title: False
+      max_seq_len_query: 64
+      max_seq_len_passage: 256
+      batch_size: 16
+      embed_title: False
+      pooling_mode: mean_tokens
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-zh-dureader-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxTotxtConverter
+  - name: MarkdownFileConverter
+    type: MarkdownRawTextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: DocxSplitter
+    type: SpacyTextSplitter
+    params:
+      filters: ["\n"]
+      chunk_size: 300
+  - name: MarkdownSplitter
+    type: MarkdownHeaderTextSplitter
+    params:
+      chunk_size: 300
+      return_each_line: False
+      filters: ["\n"]
+  - name: TextSplitter
+    type: CharacterTextSplitter
+    params:
+      separator: "\f"
+      chunk_size: 300
+      chunk_overlap: 0
+      filters: ["\n"]
+  - name: PDFSplitter
+    type: CharacterTextSplitter
+    params:
+      separator: "\f"
+      chunk_size: 300
+      chunk_overlap: 0
+      filters: ["\n"]
+  - name: ImageSplitter
+    type: CharacterTextSplitter
+    params:
+      separator: "\f"
+      chunk_size: 300
+      chunk_overlap: 0
+      filters: ["\n"]
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+  - name: Template
+    type: LLMPromptTemplate
+    params: 
+      template: 背景：{documents} 问题：{query}
+  - name: TruncateHistory
+    type: TruncatedConversationHistory
+    params:
+      max_length: 256
+  - name: ErnieBot
+    type: ErnieBot
+    params: 
+      api_key: 
+      secret_key: # the api_key and secret_key are from ERNIE-Bot
+
+pipelines:
+  - name: query  
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+      - name: Template
+        inputs: [Ranker]
+      - name: TruncateHistory
+        inputs: [Template]
+      - name: ErnieBot
+        inputs: [TruncateHistory]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: MarkdownFileConverter
+        inputs: [FileTypeClassifier.output_3]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: TextSplitter
+        inputs: [TextFileConverter]
+      - name: PDFSplitter
+        inputs: [PDFFileConverter]
+      - name: MarkdownSplitter
+        inputs: [MarkdownFileConverter]
+      - name: DocxSplitter
+        inputs: [DocxFileConverter]
+      - name: ImageSplitter
+        inputs: [ImageFileConverter]
+      - name: Retriever
+        inputs: [TextSplitter, PDFSplitter,DocxSplitter,MarkdownSplitter,ImageFileConverter]
+      - name: DocumentStore
+        inputs: [Retriever]
\ No newline at end of file
diff --git a/pipelines/rest_api/pipeline/custom_component.py b/pipelines/rest_api/pipeline/custom_component.py
new file mode 100644
index 0000000000000000000000000000000000000000..eba6e792e055a1c3b7daaf8b6e919bc5e45c4c37
--- /dev/null
+++ b/pipelines/rest_api/pipeline/custom_component.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Pipelines allow putting together Components to build a graph.
+
+In addition to the standard pipelines Components, custom user-defined Components
+can be used in a Pipeline YAML configuration.
+
+The classes for the Custom Components must be defined in this file.
+"""
+
+from pipelines.nodes.base import BaseComponent
+
+
+class SampleComponent(BaseComponent):
+    def run(self, **kwargs):
+        raise NotImplementedError
diff --git a/pipelines/rest_api/pipeline/dense_faq.yaml b/pipelines/rest_api/pipeline/dense_faq.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4bae813cdf3e2459da5bfc63636472818ee0eace
--- /dev/null
+++ b/pipelines/rest_api/pipeline/dense_faq.yaml
@@ -0,0 +1,67 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: insurance
+      embedding_dim: 312
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: passage
+      split_respect_sentence_boundary: False
+      split_answers: True
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query 
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/dense_qa.yaml b/pipelines/rest_api/pipeline/dense_qa.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..13222e3eb6f3888d4f246a33e59f7dd783aed096
--- /dev/null
+++ b/pipelines/rest_api/pipeline/dense_qa.yaml
@@ -0,0 +1,75 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      index: baike_cities
+      embedding_dim: 312
+      port: 9200
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-query-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-zh-dureader-cross-encoder
+      top_k: 5
+  - name: Reader       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieReader    # pipelines Class name for the component
+    params:
+      model_name_or_path: ernie-gram-zh-finetuned-dureader-robust
+      context_window_size: 1000
+      return_no_answer: true
+      top_k: 5
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+      - name: Reader
+        inputs: [Ranker]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/docprompt.yaml b/pipelines/rest_api/pipeline/docprompt.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..18d455cd325ca18d170a746114b9c2342aa2b81c
--- /dev/null
+++ b/pipelines/rest_api/pipeline/docprompt.yaml
@@ -0,0 +1,29 @@
+version: '1.1.0'
+
+components:
+  - name: PreProcessor
+    params:
+          use_gpu: True
+          lang: ch
+    type: DocOCRProcessor
+  - name: Reader
+    params:
+          topn: 1
+          use_gpu: True
+          task_path:
+          model: docprompt
+          device_id: 0
+          num_threads:
+          lang: ch
+          batch_size: 1
+    type: DocPrompter
+
+pipelines:
+  - name: query_documents
+    nodes:
+      - name: PreProcessor
+        inputs: [Query]
+      - name: Reader
+        inputs: [PreProcessor]
+
+
diff --git a/pipelines/rest_api/pipeline/image_to_text_retrieval.yaml b/pipelines/rest_api/pipeline/image_to_text_retrieval.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d816e5c2a8e0fba5d70ffc38d168d62e5d4bc236
--- /dev/null
+++ b/pipelines/rest_api/pipeline/image_to_text_retrieval.yaml
@@ -0,0 +1,26 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: wukong_text
+      embedding_dim: 768
+  - name: Retriever
+    type: MultiModalRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      query_type: image
+      top_k: 10
+      query_embedding_model: PaddlePaddle/ernie_vil-2.0-base-zh
+      document_embedding_models:
+        text: PaddlePaddle/ernie_vil-2.0-base-zh
+
+pipelines:
+  - name: query  
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
\ No newline at end of file
diff --git a/pipelines/rest_api/pipeline/multi_recall_semantic_search.yaml b/pipelines/rest_api/pipeline/multi_recall_semantic_search.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..e942279c717ec1f451bc84c088fdb76b06d2166f
--- /dev/null
+++ b/pipelines/rest_api/pipeline/multi_recall_semantic_search.yaml
@@ -0,0 +1,81 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: dureader_robust_query_encoder
+      embedding_dim: 312
+      search_fields: ['content', 'name']
+  - name: BMRetriever
+    type: BM25Retriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+  - name: DenseRetriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: JoinResults
+    type: JoinDocuments
+    params:
+      join_mode: concatenate
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query  
+    type: Query
+    nodes:
+      - name: BMRetriever
+        inputs: [Query]
+      - name: DenseRetriever
+        inputs: [Query]
+      - name: JoinResults
+        inputs: [BMRetriever, DenseRetriever] # Two path recall including dpr and BM25
+      - name: Ranker
+        inputs: [JoinResults]
+
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: DenseRetriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [DenseRetriever]
diff --git a/pipelines/rest_api/pipeline/pipelines.yaml b/pipelines/rest_api/pipeline/pipelines.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d844a1187d48e4931e20e45091f2e50208a8d68a
--- /dev/null
+++ b/pipelines/rest_api/pipeline/pipelines.yaml
@@ -0,0 +1,98 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      index: dureader_robust_query_encoder
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 5
+      query_embedding_model: rocketqa-zh-dureader-query-encoder
+      passage_embedding_model: rocketqa-zh-dureader-query-encoder
+      embed_title: False
+  - name: Reader       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieReader    # pipelines Class name for the component
+    params:
+      model_name_or_path: ernie-gram-zh-finetuned-dureader-robust
+      context_window_size: 1000
+      return_no_answer: true
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: AnswerExtractorPreprocessor
+    type: AnswerExtractorPreprocessor
+  - name: QAFilterPostprocessor
+    type: QAFilterPostprocessor 
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Reader
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
+        
+  - name: indexing_qa_generating
+    type: Indexing_qa_generating
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: AnswerExtractorPreprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: AnswerExtractor
+        inputs: [AnswerExtractorPreprocessor]
+      - name: QuestionGenerator
+        inputs: [AnswerExtractor]
+      - name: QAFilter
+        inputs: [QuestionGenerator]
+      - name: QAFilterPostprocessor
+        inputs: [QAFilter]
+      - name: Retriever
+        inputs: [QAFilterPostprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/semantic_ernie_search_en.yaml b/pipelines/rest_api/pipeline/semantic_ernie_search_en.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..7a84a94af5f472d05de5393121f9069763cc6689
--- /dev/null
+++ b/pipelines/rest_api/pipeline/semantic_ernie_search_en.yaml
@@ -0,0 +1,69 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: msmarco
+      embedding_dim: 768
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: ernie-search-base-dual-encoder-marco-en # an example of using ernie search models
+      share_parameters: True
+      output_emb_size: 768
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqav2-en-marco-cross-encoder
+      top_k: 3
+      use_en: True,
+      reinitialize: True
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query 
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/semantic_search.yaml b/pipelines/rest_api/pipeline/semantic_search.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b948b5931fcfdd905d0f65aee70dc5b04936d65f
--- /dev/null
+++ b/pipelines/rest_api/pipeline/semantic_search.yaml
@@ -0,0 +1,66 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: dureader_robust_query_encoder
+      embedding_dim: 312
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query  
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/semantic_search_custom.yaml b/pipelines/rest_api/pipeline/semantic_search_custom.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..96ccbd16bf2eb23af4d0c2c0ee01c40cfa013a34
--- /dev/null
+++ b/pipelines/rest_api/pipeline/semantic_search_custom.yaml
@@ -0,0 +1,67 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: dureader_robust_neural_search
+      embedding_dim: 256
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-base-query-encoder
+      params_path: checkpoints/model_40/model_state.pdparams
+      output_emb_size: 256
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder # cross encoder model path with model_config.json, vocab.txt etc
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/semantic_search_en.yaml b/pipelines/rest_api/pipeline/semantic_search_en.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f98feac8369ca3e3de82839cae9f37f1fec5c046
--- /dev/null
+++ b/pipelines/rest_api/pipeline/semantic_search_en.yaml
@@ -0,0 +1,68 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: msmarco_query_encoder
+      embedding_dim: 768
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqav2-en-marco-query-encoder
+      passage_embedding_model: rocketqav2-en-marco-query-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqav2-en-marco-cross-encoder
+      top_k: 3
+      use_en: True,
+      reinitialize: True
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query 
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/semantic_search_milvus.yaml b/pipelines/rest_api/pipeline/semantic_search_milvus.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..dbac53876bf9e50853ac72e7c3df2f53b36ba64b
--- /dev/null
+++ b/pipelines/rest_api/pipeline/semantic_search_milvus.yaml
@@ -0,0 +1,66 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: Milvus2DocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 8530
+      index: dureader_index
+      embedding_dim: 312
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: word
+      split_length: 1000
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing
+    type: Indexing
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: Preprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: Retriever
+        inputs: [Preprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
diff --git a/pipelines/rest_api/pipeline/senta.yaml b/pipelines/rest_api/pipeline/senta.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4392eae68bce8f39d298807e1443808ecae3d01e
--- /dev/null
+++ b/pipelines/rest_api/pipeline/senta.yaml
@@ -0,0 +1,35 @@
+version: '1.1.0'
+
+components:
+  - name: PreProcessor
+    params:
+          max_examples: -1
+    type: SentaProcessor
+  - name: Senta
+    params:
+          model: uie-senta-base
+          schema: [{'评价维度': ['观点词', '情感倾向[正向,负向,未提及]']}]
+          task: sentiment_analysis
+          aspects:
+          max_seq_len: 512
+          batch_size: 4
+          split_sentence: False
+          position_prob: 0.5
+          lazy_load: False
+          num_workers: 0
+          use_fast: False
+    type: UIESenta
+  - name: Visualization
+    params:
+          font_name: SimHei
+    type: SentaVisualization
+
+pipelines:
+  - name: senta_pipeline
+    nodes:
+      - name: PreProcessor
+        inputs: [File]
+      - name: Senta
+        inputs: [PreProcessor]
+      - name: Visualization
+        inputs: [Senta]
diff --git a/pipelines/rest_api/pipeline/text_to_image.yaml b/pipelines/rest_api/pipeline/text_to_image.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..959781b2f2250261d44a26f3e97e9760eb71c9e7
--- /dev/null
+++ b/pipelines/rest_api/pipeline/text_to_image.yaml
@@ -0,0 +1,16 @@
+version: '1.1.0'
+
+components:
+  - name: TextToImageGenerator
+    params:
+          ak: 
+          sk: 
+    type: ErnieTextToImageGenerator
+pipelines:
+  - name: query
+    type: Query
+    nodes:
+      - name: TextToImageGenerator
+        inputs: [Query]
+      
+    
diff --git a/pipelines/rest_api/pipeline/text_to_image_retrieval.yaml b/pipelines/rest_api/pipeline/text_to_image_retrieval.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2a93054da174a54f86d7b0da638115e6ca83aaaf
--- /dev/null
+++ b/pipelines/rest_api/pipeline/text_to_image_retrieval.yaml
@@ -0,0 +1,25 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using Milvus2DocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: wukong_test
+      embedding_dim: 768
+  - name: Retriever
+    type: MultiModalRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: PaddlePaddle/ernie_vil-2.0-base-zh
+      document_embedding_models:
+        image: PaddlePaddle/ernie_vil-2.0-base-zh
+
+pipelines:
+  - name: query  
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
\ No newline at end of file
diff --git a/pipelines/rest_api/pipeline/unsupervised_qa.yaml b/pipelines/rest_api/pipeline/unsupervised_qa.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..40d799d856275148aa60415c8e330c339d4f9a41
--- /dev/null
+++ b/pipelines/rest_api/pipeline/unsupervised_qa.yaml
@@ -0,0 +1,106 @@
+version: '1.1.0'
+
+components:    # define all the building-blocks for Pipeline
+  - name: DocumentStore
+    type: ElasticsearchDocumentStore  # consider using MilvusDocumentStore or WeaviateDocumentStore for scaling to large number of documents
+    params:
+      host: localhost
+      port: 9200
+      index: my_data
+      embedding_dim: 312
+  - name: Retriever
+    type: DensePassageRetriever
+    params:
+      document_store: DocumentStore    # params can reference other components defined in the YAML
+      top_k: 10
+      query_embedding_model: rocketqa-zh-nano-query-encoder
+      passage_embedding_model: rocketqa-zh-nano-para-encoder
+      embed_title: False
+  - name: Ranker       # custom-name for the component; helpful for visualization & debugging
+    type: ErnieRanker    # pipelines Class name for the component
+    params:
+      model_name_or_path: rocketqa-nano-cross-encoder
+      top_k: 3
+  - name: TextFileConverter
+    type: TextConverter
+  - name: ImageFileConverter
+    type: ImageToTextConverter
+  - name: PDFFileConverter
+    type: PDFToTextConverter
+  - name: DocxFileConverter
+    type: DocxToTextConverter
+  - name: AnswerExtractorPreprocessor
+    type: AnswerExtractorPreprocessor
+  - name: QAFilterPostprocessor
+    type: QAFilterPostprocessor 
+  - name: Preprocessor
+    type: PreProcessor
+    params:
+      split_by: passage
+      split_respect_sentence_boundary: False
+      split_answers: True
+  - name: FileTypeClassifier
+    type: FileTypeClassifier
+  - name: AnswerExtractor
+    type: AnswerExtractor
+    params:
+      model: uie-base-answer-extractor
+      schema: ['答案']
+      position_prob: 0.01
+      max_answer_candidates: 3
+  - name: QuestionGenerator
+    type: QuestionGenerator
+    params:
+      model: unimo-text-1.0-question-generation
+      num_return_sequences: 2
+  - name: QAFilter
+    type: QAFilter
+    params:
+      model: uie-base-qa-filter
+      schema: ['答案']
+      position_prob: 0.1
+
+pipelines:
+  - name: query    # a sample extractive-qa Pipeline
+    type: Query
+    nodes:
+      - name: Retriever
+        inputs: [Query]
+      - name: Ranker
+        inputs: [Retriever]
+  - name: indexing_qa_generating
+    type: Indexing_qa_generating
+    nodes:
+      - name: FileTypeClassifier
+        inputs: [File]
+      - name: TextFileConverter
+        inputs: [FileTypeClassifier.output_1]
+      - name: PDFFileConverter
+        inputs: [FileTypeClassifier.output_2]
+      - name: DocxFileConverter
+        inputs: [FileTypeClassifier.output_4]
+      - name: ImageFileConverter
+        inputs: [FileTypeClassifier.output_6]
+      - name: AnswerExtractorPreprocessor
+        inputs: [PDFFileConverter, TextFileConverter, DocxFileConverter, ImageFileConverter]
+      - name: AnswerExtractor
+        inputs: [AnswerExtractorPreprocessor]
+      - name: QuestionGenerator
+        inputs: [AnswerExtractor]
+      - name: QAFilter
+        inputs: [QuestionGenerator]
+      - name: QAFilterPostprocessor
+        inputs: [QAFilter]
+      - name: Retriever
+        inputs: [QAFilterPostprocessor]
+      - name: DocumentStore
+        inputs: [Retriever]
+  - name: query_qa_pairs
+    type: Query
+    nodes:
+      - name: AnswerExtractor
+        inputs: [Query]
+      - name: QuestionGenerator
+        inputs: [AnswerExtractor]
+      - name: QAFilter
+        inputs: [QuestionGenerator]
diff --git a/pipelines/rest_api/schema.py b/pipelines/rest_api/schema.py
new file mode 100644
index 0000000000000000000000000000000000000000..67d77434135396ab60c8094c079ceb3e4a20e4e9
--- /dev/null
+++ b/pipelines/rest_api/schema.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# flake8: noqa
+
+from typing import Any, Dict, List, Optional, Union
+
+from pydantic import BaseConfig, BaseModel, Extra, Field
+from pydantic.dataclasses import dataclass as pydantic_dataclass
+
+from pipelines.schema import Answer, Document, Label, Span
+
+try:
+    from typing import Literal
+except ImportError:
+    from typing_extensions import Literal  # type: ignore
+
+BaseConfig.arbitrary_types_allowed = True
+
+
+class QueryRequest(BaseModel):
+    query: str
+    params: Optional[dict] = None
+    debug: Optional[bool] = False
+
+    class Config:
+        # Forbid any extra fields in the request to avoid silent failures
+        extra = Extra.forbid
+
+
+class FilterRequest(BaseModel):
+    filters: Optional[Dict[str, Optional[Union[str, List[str]]]]] = None
+
+
+@pydantic_dataclass
+class AnswerSerialized(Answer):
+    context: Optional[str] = None
+
+
+@pydantic_dataclass
+class DocumentSerialized(Document):
+    content: str
+    embedding: Optional[List[float]]  # type: ignore
+
+
+@pydantic_dataclass
+class LabelSerialized(Label, BaseModel):
+    document: DocumentSerialized
+    answer: Optional[AnswerSerialized] = None
+
+
+class CreateLabelSerialized(BaseModel):
+    id: Optional[str] = None
+    query: str
+    document: DocumentSerialized
+    is_correct_answer: bool
+    is_correct_document: bool
+    origin: Literal["user-feedback", "gold-label"]
+    answer: Optional[AnswerSerialized] = None
+    no_answer: Optional[bool] = None
+    pipeline_id: Optional[str] = None
+    created_at: Optional[str] = None
+    updated_at: Optional[str] = None
+    meta: Optional[dict] = None
+    filters: Optional[dict] = None
+
+    class Config:
+        # Forbid any extra fields in the request to avoid silent failures
+        extra = Extra.forbid
+
+
+class QueryResponse(BaseModel):
+    query: str
+    answers: List[AnswerSerialized] = []
+    documents: List[DocumentSerialized] = []
+    debug: Optional[Dict] = Field(None, alias="_debug")
+
+
+class Chatfile_QueryResponse(BaseModel):
+    query: str
+    result: str
+    answers: List[AnswerSerialized] = []
+    documents: List[DocumentSerialized] = []
+    debug: Optional[Dict] = Field(None, alias="_debug")
+
+
+class DocumentRequest(BaseModel):
+    meta: dict
+    params: Optional[dict] = None
+    debug: Optional[bool] = False
+
+    class Config:
+        # Forbid any extra fields in the request to avoid silent failures
+        extra = Extra.forbid
+
+
+class DocumentResponse(BaseModel):
+    meta: dict
+    results: List[List[dict]] = []
+    debug: Optional[Dict] = Field(None, alias="_debug")
+
+
+class SentaRequest(BaseModel):
+    meta: dict
+    params: Optional[dict] = None
+    debug: Optional[bool] = False
+
+    class Config:
+        # Forbid any extra fields in the request to avoid silent failures
+        extra = Extra.forbid
+
+
+class SentaResponse(BaseModel):
+    img_dict: dict = []
+    debug: Optional[bool] = False
+
+
+class QueryImageResponse(BaseModel):
+    query: str
+    answers: List[str] = []
+    documents: List[DocumentSerialized] = []
+    debug: Optional[Dict] = Field(None, alias="_debug")
+
+
+class QueryQAPairRequest(BaseModel):
+    meta: List[str]
+    params: Optional[dict] = None
+    debug: Optional[bool] = False
+
+    class Config:
+        # Forbid any extra fields in the request to avoid silent failures
+        extra = Extra.forbid
+
+
+class QueryQAPairResponse(BaseModel):
+    meta: List[str]
+    filtered_cqa_triples: List[dict] = []
+    debug: Optional[Dict] = Field(None, alias="_debug")
diff --git a/pipelines/rest_api/test/__init__.py b/pipelines/rest_api/test/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/rest_api/test/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/rest_api/test/test_rest_api.py b/pipelines/rest_api/test/test_rest_api.py
new file mode 100644
index 0000000000000000000000000000000000000000..60aca1d80dd90a7398fb54ee8960c15e5e37eacc
--- /dev/null
+++ b/pipelines/rest_api/test/test_rest_api.py
@@ -0,0 +1,324 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from copy import deepcopy
+from pathlib import Path
+
+import pytest
+from fastapi.testclient import TestClient
+
+from rest_api.application import app
+
+FEEDBACK = {
+    "id": "123",
+    "query": "Who made the PDF specification?",
+    "document": {
+        "content": "A sample PDF file\n\nHistory and standardization\nFormat (PDF) Adobe Systems made the PDF specification available free of charge in 1993. In the early years PDF was popular mainly in desktop publishing workflows, and competed with a variety of formats such as DjVu, Envoy, Common Ground Digital Paper, Farallon Replica and even Adobe's own PostScript format. PDF was a proprietary format controlled by Adobe until it was released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO 32000-1:2008, at which time control of the specification passed to an ISO Committee of volunteer industry experts. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for all patents owned by Adobe that are necessary to make, use, sell, and distribute PDF-compliant implementations. PDF 1.7, the sixth edition of the PDF specification that became ISO 32000-1, includes some proprietary technologies defined only by Adobe, such as Adobe XML Forms Architecture (XFA) and JavaScript extension for Acrobat, which are referenced by ISO 32000-1 as normative and indispensable for the full implementation of the ISO 32000-1 specification. These proprietary technologies are not standardized and their specification is published only on Adobes website. Many of them are also not supported by popular third-party implementations of PDF. Column 1",
+        "content_type": "text",
+        "score": None,
+        "id": "fc18c987a8312e72a47fb1524f230bb0",
+        "meta": {},
+        "embedding": None,
+        "id_hash_keys": None,
+    },
+    "answer": {
+        "answer": "Adobe Systems",
+        "type": "extractive",
+        "context": "A sample PDF file\n\nHistory and standardization\nFormat (PDF) Adobe Systems made the PDF specification available free of charge in 1993. In the early ye",
+        "offsets_in_context": [{"start": 60, "end": 73}],
+        "offsets_in_document": [{"start": 60, "end": 73}],
+        "document_id": "fc18c987a8312e72a47fb1524f230bb0",
+        "meta": {},
+        "score": None,
+    },
+    "is_correct_answer": True,
+    "is_correct_document": True,
+    "origin": "user-feedback",
+    "pipeline_id": "some-123",
+}
+
+
+def exclude_no_answer(responses):
+    responses["answers"] = [response for response in responses["answers"] if response.get("answer", None)]
+    return responses
+
+
+@pytest.fixture()
+def client() -> TestClient:
+    os.environ["PIPELINE_YAML_PATH"] = str(
+        (Path(__file__).parent / "samples" / "pipeline" / "test_pipeline.yaml").absolute()
+    )
+    os.environ["INDEXING_PIPELINE_NAME"] = "indexing_text_pipeline"
+    client = TestClient(app)
+
+    client.post(url="/documents/delete_by_filters", data='{"filters": {}}')
+    client.delete(url="/feedback")
+
+    yield client
+
+    client.post(url="/documents/delete_by_filters", data='{"filters": {}}')
+    client.delete(url="/feedback")
+
+
+@pytest.fixture()
+def populated_client(client: TestClient) -> TestClient:
+    files_to_upload = [
+        {"files": (Path(__file__).parent / "samples" / "pdf" / "sample_pdf_1.pdf").open("rb")},
+        {"files": (Path(__file__).parent / "samples" / "pdf" / "sample_pdf_2.pdf").open("rb")},
+    ]
+    for index, fi in enumerate(files_to_upload):
+        response = client.post(
+            url="/file-upload", files=fi, data={"meta": f'{{"meta_key": "meta_value", "meta_index": "{index}"}}'}
+        )
+        assert 200 == response.status_code
+
+    yield client
+
+
+def test_get_documents(populated_client: TestClient):
+    # Get the documents
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_key": ["meta_value"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+
+    # Make sure the right docs are found
+    names = [doc["meta"]["name"] for doc in response_json]
+    assert "sample_pdf_1.pdf" in names
+    assert "sample_pdf_2.pdf" in names
+    meta_keys = [doc["meta"]["meta_key"] for doc in response_json]
+    assert all("meta_value" == meta_key for meta_key in meta_keys)
+
+
+def test_delete_documents(populated_client: TestClient):
+    # Check how many docs there are
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_key": ["meta_value"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+    initial_docs = len(response_json)
+
+    # Check how many docs we will delete
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_index": ["0"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+    docs_to_delete = len(response_json)
+
+    # Delete one doc
+    response = populated_client.post(url="/documents/delete_by_filters", data='{"filters": {"meta_index": ["0"]}}')
+    assert 200 == response.status_code
+
+    # Now there should be less document
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_key": ["meta_value"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+    assert len(response_json) == initial_docs - docs_to_delete
+
+    # Make sure the right docs were deleted
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_index": ["0"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+    assert len(response_json) == 0
+
+    response = populated_client.post(url="/documents/get_by_filters", data='{"filters": {"meta_index": ["1"]}}')
+    assert 200 == response.status_code
+    response_json = response.json()
+    assert len(response_json) >= 1
+
+
+def test_file_upload(client: TestClient):
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) == 0
+
+    file_to_upload = {"files": (Path(__file__).parent / "samples" / "pdf" / "sample_pdf_1.pdf").open("rb")}
+    response = client.post(
+        url="/file-upload",
+        files=file_to_upload,
+        data={"meta": '{"meta_key": "meta_value", "non-existing-field": "wrong-value"}'},
+    )
+    assert 200 == response.status_code
+
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) > 0
+
+
+def test_file_upload_with_no_meta(client: TestClient):
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) == 0
+
+    file_to_upload = {"files": (Path(__file__).parent / "samples" / "pdf" / "sample_pdf_1.pdf").open("rb")}
+    response = client.post(
+        url="/file-upload",
+        files=file_to_upload,
+        data={"meta": ""},
+    )
+    assert 200 == response.status_code
+
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) > 0
+
+
+def test_file_upload_with_wrong_meta(client: TestClient):
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) == 0
+
+    file_to_upload = {"files": (Path(__file__).parent / "samples" / "pdf" / "sample_pdf_1.pdf").open("rb")}
+    response = client.post(
+        url="/file-upload",
+        files=file_to_upload,
+        data={"meta": "1"},
+    )
+    assert 500 == response.status_code
+
+    response = client.post(url="/documents/get_by_filters", data='{"filters": {}}')
+    assert len(response.json()) == 0
+
+
+def test_query_with_no_filter(populated_client: TestClient):
+    query_with_no_filter_value = {"query": "Who made the PDF specification?"}
+    response = populated_client.post(url="/query", json=query_with_no_filter_value)
+    assert 200 == response.status_code
+    response_json = response.json()
+    response_json = exclude_no_answer(response_json)
+    assert response_json["answers"][0]["answer"] == "Adobe Systems"
+
+
+def test_query_with_one_filter(populated_client: TestClient):
+    query_with_filter = {
+        "query": "Who made the PDF specification?",
+        "params": {"Retriever": {"filters": {"meta_key": "meta_value"}}},
+    }
+    response = populated_client.post(url="/query", json=query_with_filter)
+    assert 200 == response.status_code
+    response_json = response.json()
+    response_json = exclude_no_answer(response_json)
+    assert response_json["answers"][0]["answer"] == "Adobe Systems"
+
+
+def test_query_with_one_global_filter(populated_client: TestClient):
+    query_with_filter = {"query": "Who made the PDF specification?", "params": {"filters": {"meta_key": "meta_value"}}}
+    response = populated_client.post(url="/query", json=query_with_filter)
+    assert 200 == response.status_code
+    response_json = response.json()
+    response_json = exclude_no_answer(response_json)
+    assert response_json["answers"][0]["answer"] == "Adobe Systems"
+
+
+def test_query_with_filter_list(populated_client: TestClient):
+    query_with_filter_list = {
+        "query": "Who made the PDF specification?",
+        "params": {"Retriever": {"filters": {"meta_key": ["meta_value", "another_value"]}}},
+    }
+    response = populated_client.post(url="/query", json=query_with_filter_list)
+    assert 200 == response.status_code
+    response_json = response.json()
+    response_json = exclude_no_answer(response_json)
+    assert response_json["answers"][0]["answer"] == "Adobe Systems"
+
+
+def test_query_with_invalid_filter(populated_client: TestClient):
+    query_with_invalid_filter = {
+        "query": "Who made the PDF specification?",
+        "params": {"Retriever": {"filters": {"meta_key": "invalid_value"}}},
+    }
+    response = populated_client.post(url="/query", json=query_with_invalid_filter)
+    assert 200 == response.status_code
+    response_json = response.json()
+    response_json = exclude_no_answer(response_json)
+    assert len(response_json["answers"]) == 0
+
+
+def test_query_with_no_documents_and_no_answers():
+    os.environ["PIPELINE_YAML_PATH"] = str(
+        (Path(__file__).parent / "samples" / "pipeline" / "test_pipeline.yaml").absolute()
+    )
+    os.environ["INDEXING_PIPELINE_NAME"] = "indexing_text_pipeline"
+    client = TestClient(app)
+
+    # Clean up to make sure the docstore is empty
+    client.post(url="/documents/delete_by_filters", data='{"filters": {}}')
+    query = {"query": "Who made the PDF specification?"}
+    response = client.post(url="/query", json=query)
+    assert 200 == response.status_code
+    response_json = response.json()
+    assert response_json["documents"] == []
+    assert response_json["answers"] == []
+
+
+def test_write_feedback(populated_client: TestClient):
+    response = populated_client.post(url="/feedback", json=FEEDBACK)
+    assert 200 == response.status_code
+
+
+def test_write_feedback_without_id(populated_client: TestClient):
+    feedback = deepcopy(FEEDBACK)
+    del feedback["id"]
+    response = populated_client.post(url="/feedback", json=feedback)
+    assert 200 == response.status_code
+
+
+def test_get_feedback(client: TestClient):
+    response = client.post(url="/feedback", json=FEEDBACK)
+    assert response.status_code == 200
+
+    response = client.get(url="/feedback")
+    assert response.status_code == 200
+    json_response = response.json()
+    for response_item, expected_item in [(json_response[0][key], value) for key, value in FEEDBACK.items()]:
+        assert response_item == expected_item
+
+
+def test_delete_feedback(client: TestClient):
+    client.post(url="/feedback", json=FEEDBACK)
+
+    feedback = deepcopy(FEEDBACK)
+    feedback["id"] = 456
+    feedback["origin"] = "gold-label"
+    print(feedback)
+    client.post(url="/feedback", json=feedback)
+
+    response = client.get(url="/feedback")
+    json_response = response.json()
+    assert len(json_response) == 2
+
+    response = client.delete(url="/feedback")
+    assert 200 == response.status_code
+
+    response = client.get(url="/feedback")
+    json_response = response.json()
+    assert len(json_response) == 1
+
+
+def test_export_feedback(client: TestClient):
+    response = client.post(url="/feedback", json=FEEDBACK)
+    assert 200 == response.status_code
+
+    feedback_urls = [
+        "/export-feedback?full_document_context=true",
+        "/export-feedback?full_document_context=false&context_size=50",
+        "/export-feedback?full_document_context=false&context_size=50000",
+    ]
+    for url in feedback_urls:
+        response = client.get(url=url, json=FEEDBACK)
+        response_json = response.json()
+        context = response_json["data"][0]["paragraphs"][0]["context"]
+        answer_start = response_json["data"][0]["paragraphs"][0]["qas"][0]["answers"][0]["answer_start"]
+        answer = response_json["data"][0]["paragraphs"][0]["qas"][0]["answers"][0]["text"]
+        assert context[answer_start : answer_start + len(answer)] == answer
+
+
+def test_get_feedback_malformed_query(client: TestClient):
+    feedback = deepcopy(FEEDBACK)
+    feedback["unexpected_field"] = "misplaced-value"
+    response = client.post(url="/feedback", json=feedback)
+    assert response.status_code == 422
diff --git a/pipelines/setup.py b/pipelines/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfa82bef7b19b233535061e423acfaeada27a4c7
--- /dev/null
+++ b/pipelines/setup.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import io
+import os
+
+import setuptools
+
+description = "Paddle-Pipelines: An End to End Natural Language Proceessing Development Kit Based on PaddleNLP"
+
+with open("requirements.txt") as fin:
+    REQUIRED_PACKAGES = fin.read()
+
+
+def read(*names, **kwargs):
+    with io.open(os.path.join(os.path.dirname(__file__), *names), encoding=kwargs.get("encoding", "utf8")) as fp:
+        return fp.read()
+
+
+setuptools.setup(
+    name="paddle-pipelines",
+    version=read("VERSION"),
+    author="PaddlePaddle Speech and Language Team",
+    author_email="paddlenlp@baidu.com",
+    description=description,
+    long_description=read("README.md"),
+    long_description_content_type="text/markdown",
+    url="https://github.com/PaddlePaddle/PaddleNLP",
+    packages=setuptools.find_packages(where=".", exclude=("examples*", "tests*", "docs*", "ui*", "rest_api*")),
+    setup_requires=["cython", "numpy"],
+    install_requires=REQUIRED_PACKAGES,
+    python_requires=">=3.7",
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+    ],
+    license="Apache 2.0",
+)
diff --git a/pipelines/tests/__init__.py b/pipelines/tests/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/pipelines/tests/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/tests/agents/__init__.py b/pipelines/tests/agents/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/pipelines/tests/agents/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/tests/agents/test_agent.py b/pipelines/tests/agents/test_agent.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d6d5d9ac2b2b8470edc9a3e910e58fd27d36572
--- /dev/null
+++ b/pipelines/tests/agents/test_agent.py
@@ -0,0 +1,176 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import unittest
+from unittest import mock
+
+from pipelines import Answer, BaseComponent, Document
+from pipelines.agents import Agent, AgentStep
+from pipelines.agents.base import Tool
+from pipelines.nodes.prompt import PromptNode
+from pipelines.pipelines import BaseStandardPipeline, SemanticSearchPipeline
+from tests.conftest import MockPromptNode, MockRetriever
+
+
+class TestAgents(unittest.TestCase):
+    def test_add_and_overwrite_tool(self):
+        # Add a Node as a Tool to an Agent
+        agent = Agent(prompt_node=MockPromptNode())
+        retriever = MockRetriever()
+        agent.add_tool(
+            Tool(
+                name="Retriever",
+                pipeline_or_node=retriever,
+                description="useful for when you need to " "retrieve documents from your index",
+            )
+        )
+        assert len(agent.tm.tools) == 1
+        assert agent.has_tool(tool_name="Retriever")
+        assert isinstance(agent.tm.tools["Retriever"].pipeline_or_node, BaseComponent)
+
+        agent.add_tool(
+            Tool(
+                name="Retriever",
+                pipeline_or_node=retriever,
+                description="useful for when you need to retrieve documents from your index",
+            )
+        )
+
+        # Add a Pipeline as a Tool to an Agent and overwrite the previously added Tool
+        retriever_pipeline = SemanticSearchPipeline(MockRetriever())
+        agent.add_tool(
+            Tool(
+                name="Retriever",
+                pipeline_or_node=retriever_pipeline,
+                description="useful for when you need to retrieve documents from your index",
+            )
+        )
+        assert len(agent.tm.tools) == 1
+        assert agent.has_tool(tool_name="Retriever")
+        assert isinstance(agent.tm.tools["Retriever"].pipeline_or_node, BaseStandardPipeline)
+
+    def test_run_tool(self):
+        agent = Agent(prompt_node=MockPromptNode())
+        retriever = MockRetriever()
+        agent.add_tool(
+            Tool(
+                name="Retriever",
+                pipeline_or_node=retriever,
+                description="useful for when you need to retrieve documents from your index",
+                output_variable="documents",
+            )
+        )
+        pn_response = (
+            "need to find out what city he was born.\nTool: Retriever\nTool Input: Where was Jeremy McKinnon born"
+        )
+
+        step = AgentStep(prompt_node_response=pn_response)
+        result = agent.tm.run_tool(step.prompt_node_response)
+        assert result == "[]"  # empty list of documents
+
+    def test_extract_final_answer(self):
+        match_examples = [
+            "have the final answer to the question.\nFinal Answer: Florida",
+            "Final Answer: 42 is the answer",
+            "Final Answer:  1234",
+            "Final Answer:  Answer",
+            "Final Answer:  This list: one and two and three",
+            "Final Answer:42",
+            "Final Answer:   ",
+            "Final Answer:    The answer is 99    ",
+        ]
+        expected_answers = [
+            "Florida",
+            "42 is the answer",
+            "1234",
+            "Answer",
+            "This list: one and two and three",
+            "42",
+            "",
+            "The answer is 99",
+        ]
+
+        for example, expected_answer in zip(match_examples, expected_answers):
+            agent_step = AgentStep(prompt_node_response=example, final_answer_pattern=r"Final Answer\s*:\s*(.*)")
+            final_answer = agent_step.final_answer(query="irrelevant")
+            assert final_answer["answers"][0].answer == expected_answer
+
+    def test_final_answer_regex(self):
+        match_examples = [
+            "Final Answer: 42 is the answer",
+            "Final Answer:  1234",
+            "Final Answer:  Answer",
+            "Final Answer:  This list: one and two and three",
+            "Final Answer:42",
+            "Final Answer:   ",
+            "Final Answer:    The answer is 99    ",
+        ]
+
+        non_match_examples = ["Final answer: 42 is the answer", "Final Answer", "The final answer is: 100"]
+        final_answer_pattern = r"Final Answer\s*:\s*(.*)"
+        for example in match_examples:
+            match = re.match(final_answer_pattern, example)
+            assert match is not None
+
+        for example in non_match_examples:
+            match = re.match(final_answer_pattern, example)
+            assert match is None
+
+    def test_update_hash(
+        self,
+    ):
+        agent = Agent(prompt_node=MockPromptNode(), prompt_template=mock.Mock())
+        assert agent.hash == "d41d8cd98f00b204e9800998ecf8427e"
+        agent.add_tool(
+            Tool(
+                name="Search",
+                pipeline_or_node=mock.Mock(),
+                description="useful for when you need to answer "
+                "questions about where people live. You "
+                "should ask targeted questions",
+                output_variable="answers",
+            )
+        )
+        assert agent.hash == "d41d8cd98f00b204e9800998ecf8427e"
+        agent.add_tool(
+            Tool(
+                name="Count",
+                pipeline_or_node=mock.Mock(),
+                description="useful for when you need to count how many characters are in a word. Ask only with a single word.",
+            )
+        )
+        assert agent.hash == "d41d8cd98f00b204e9800998ecf8427e"
+        agent.update_hash()
+        assert agent.hash == "5ac8eca2f92c9545adcce3682b80d4c5"
+
+    def test_tool_fails_processing_dict_result(self):
+        tool = Tool(name="name", pipeline_or_node=MockPromptNode(), description="description")
+        with self.assertRaises(ValueError):
+            tool._process_result({"answer": "answer"})
+
+    def test_tool_processes_answer_result_and_document_result(self):
+        tool = Tool(name="name", pipeline_or_node=MockPromptNode(), description="description")
+        assert tool._process_result(Answer(answer="answer")) == "answer"
+        assert tool._process_result(Document(content="content")) == "content"
+
+    def test_invalid_agent_template(self):
+        pn = PromptNode("gpt-3.5-turbo", api_key="12345678")
+        with self.assertRaises(ValueError):
+            Agent(prompt_node=pn, prompt_template="some_non_existing_template")
+
+        # # if prompt_template is None, then we'll use zero-shot-react
+        # a = Agent(prompt_node=pn, prompt_template=None)
+        # assert isinstance(a.prompt_template, PromptTemplate)
+        # assert a.prompt_template.name == "zero-shot-react"
diff --git a/pipelines/tests/agents/test_agent_step.py b/pipelines/tests/agents/test_agent_step.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb103987660c9ba1c84a9a937d675145dd826bca
--- /dev/null
+++ b/pipelines/tests/agents/test_agent_step.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.agents import AgentStep
+from pipelines.schema import Answer
+
+
+class TestAgentSteps(unittest.TestCase):
+    def setUp(self):
+        self.agent_step = AgentStep(
+            current_step=1, max_steps=10, final_answer_pattern=None, prompt_node_response="Hello", transcript="Hello"
+        )
+
+    def test_create_next_step(self):
+        # Test normal case
+        next_step = self.agent_step.create_next_step(["Hello again"])
+        assert next_step.current_step == 2
+        assert next_step.prompt_node_response == "Hello again"
+        assert next_step.transcript == "Hello"
+
+        # Test with invalid prompt_node_response
+        with self.assertRaises(Exception):
+            self.agent_step.create_next_step({})
+
+        # Test with empty prompt_node_response
+        with self.assertRaises(Exception):
+            self.agent_step.create_next_step([])
+
+    def test_final_answer(self):
+        # Test normal case
+        result = self.agent_step.final_answer("query")
+        assert result["query"] == "query"
+        assert isinstance(result["answers"][0], Answer)
+        assert result["answers"][0].answer == "Hello"
+        assert result["answers"][0].type == "generative"
+        assert result["transcript"] == "Hello"
+
+        # Test with max_steps reached
+        self.agent_step.current_step = 11
+        result = self.agent_step.final_answer("query")
+        assert result["answers"][0].answer == ""
+
+    def test_is_last(self):
+        # Test is last, and it is last because of valid prompt_node_response and default final_answer_pattern
+        agent_step = AgentStep(current_step=1, max_steps=10, prompt_node_response="Hello", transcript="Hello")
+        assert agent_step.is_last()
+
+        # Test not last
+        agent_step.current_step = 1
+        agent_step.prompt_node_response = "final answer not satisfying pattern"
+        agent_step.final_answer_pattern = r"Final Answer\s*:\s*(.*)"
+        assert not agent_step.is_last()
+
+        # Test border cases for max_steps
+        agent_step.current_step = 9
+        assert not agent_step.is_last()
+        agent_step.current_step = 10
+        assert not agent_step.is_last()
+
+        # Test when last due to max_steps
+        agent_step.current_step = 11
+        assert agent_step.is_last()
+
+    def test_completed(self):
+        # Test without observation
+        self.agent_step.completed(None)
+        assert self.agent_step.transcript == "HelloHello"
+
+        # Test with observation, adds Hello from prompt_node_response
+        self.agent_step.completed("observation")
+        assert self.agent_step.transcript == "HelloHelloHello\nObservation: observation\nThought:"
+
+    def test_repr(self):
+        assert repr(self.agent_step) == (
+            "AgentStep(current_step=1, max_steps=10, "
+            "prompt_node_response=Hello, final_answer_pattern=^([\\s\\S]+)$, "
+            "transcript=Hello)"
+        )
+
+    def test_parse_final_answer(self):
+        # Test when pattern matches
+        assert self.agent_step.parse_final_answer() == "Hello"
+
+        # Test when pattern does not match
+        self.agent_step.final_answer_pattern = "goodbye"
+        assert self.agent_step.parse_final_answer() is None
+
+    def test_format_react_answer(self):
+        step = AgentStep(
+            final_answer_pattern=r"Final Answer\s*:\s*(.*)",
+            prompt_node_response="have the final answer to the question.\nFinal Answer: Florida",
+        )
+        formatted_answer = step.final_answer(query="query")
+        assert formatted_answer["query"] == "query"
+        assert formatted_answer["answers"] == [Answer(answer="Florida", type="generative")]
diff --git a/pipelines/tests/agents/test_memory.py b/pipelines/tests/agents/test_memory.py
new file mode 100644
index 0000000000000000000000000000000000000000..2af6956f919bd541738e26cc25038ce44202db78
--- /dev/null
+++ b/pipelines/tests/agents/test_memory.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from typing import Any, Dict
+
+from pipelines.agents.memory import ConversationMemory, NoMemory
+
+
+class TestMemory(unittest.TestCase):
+    def test_no_memory(self):
+        no_mem = NoMemory()
+        assert no_mem.load() == ""
+        no_mem.save({"key": "value"})
+        no_mem.clear()
+
+    def test_conversation_memory(self):
+        conv_mem = ConversationMemory()
+        assert conv_mem.load() == ""
+        data: Dict[str, Any] = {"input": "Hello", "output": "Hi there"}
+        conv_mem.save(data)
+        assert conv_mem.load() == "Human: Hello\nAI: Hi there\n"
+
+        data: Dict[str, Any] = {"input": "How are you?", "output": "I'm doing well, thanks."}
+        conv_mem.save(data)
+        assert conv_mem.load() == "Human: Hello\nAI: Hi there\nHuman: How are you?\nAI: I'm doing well, thanks.\n"
+        assert conv_mem.load(window_size=1) == "Human: How are you?\nAI: I'm doing well, thanks.\n"
+
+        conv_mem.clear()
+        assert conv_mem.load() == ""
+
+    def test_conversation_memory_window_size(self):
+        conv_mem = ConversationMemory()
+        assert conv_mem.load() == ""
+        data: Dict[str, Any] = {"input": "Hello", "output": "Hi there"}
+        conv_mem.save(data)
+        data: Dict[str, Any] = {"input": "How are you?", "output": "I'm doing well, thanks."}
+        conv_mem.save(data)
+        assert conv_mem.load() == "Human: Hello\nAI: Hi there\nHuman: How are you?\nAI: I'm doing well, thanks.\n"
+        assert conv_mem.load(window_size=1) == "Human: How are you?\nAI: I'm doing well, thanks.\n"
+
+        # clear the memory
+        conv_mem.clear()
+        assert conv_mem.load() == ""
+        assert conv_mem.load(window_size=1) == ""
diff --git a/pipelines/tests/agents/test_tools_manager.py b/pipelines/tests/agents/test_tools_manager.py
new file mode 100644
index 0000000000000000000000000000000000000000..eca9d45f79061faff26768d1d721c2cbdf858bd6
--- /dev/null
+++ b/pipelines/tests/agents/test_tools_manager.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest import mock
+
+from pipelines import Pipeline
+from pipelines.agents import Tool, ToolsManager
+from pipelines.schema import Answer, Document
+
+
+class TestToolsManager(unittest.TestCase):
+    def setUp(self):
+        tools = [
+            Tool(name="ToolA", pipeline_or_node=mock.Mock(), description="Tool A Description"),
+            Tool(name="ToolB", pipeline_or_node=mock.Mock(), description="Tool B Description"),
+        ]
+        self.tools_manager = ToolsManager(tools=tools)
+
+    def test_add_tool(self):
+        new_tool = Tool(name="ToolC", pipeline_or_node=mock.Mock(), description="Tool C Description")
+        self.tools_manager.add_tool(new_tool)
+        assert "ToolC" in self.tools_manager.tools
+        assert self.tools_manager.tools["ToolC"] == new_tool
+
+    def test_get_tool_names(self):
+        assert self.tools_manager.get_tool_names() == "ToolA, ToolB"
+
+    def test_get_tools(self):
+        tools = self.tools_manager.get_tools()
+        assert len(tools) == 2
+        assert tools[0].name == "ToolA"
+        assert tools[1].name == "ToolB"
+
+    def test_get_tool_names_with_descriptions(self):
+        expected_output = "ToolA: Tool A Description\n" "ToolB: Tool B Description"
+        assert self.tools_manager.get_tool_names_with_descriptions() == expected_output
+
+    def test_extract_tool_name_and_tool_input(self):
+        examples = [
+            "need to find out what city he was born.\nTool: Search\nTool Input: Where was Jeremy McKinnon born",
+            "need to find out what city he was born.\n\nTool: Search\n\nTool Input: Where was Jeremy McKinnon born",
+            'need to find out what city he was born. Tool: Search Tool Input: "Where was Jeremy McKinnon born"',
+        ]
+        for example in examples:
+            tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(example)
+            assert tool_name == "Search" and tool_input == "Where was Jeremy McKinnon born"
+
+        negative_examples = [
+            "need to find out what city he was born.",
+            "Tool: Search",
+            "Tool Input: Where was Jeremy McKinnon born",
+            "need to find out what city he was born. Tool: Search",
+            "Tool Input: Where was Jeremy McKinnon born",
+        ]
+        for example in negative_examples:
+            tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(example)
+            assert tool_name is None and tool_input is None
+
+    def test_invalid_tool_creation(self):
+        with self.assertRaises(ValueError):
+            Tool(name="Tool-A", pipeline_or_node=mock.Mock(), description="Tool A Description")
+
+    def test_tool_invocation(self):
+        # by default for pipelines as tools we look for results key in the output
+        p = Pipeline()
+        tool = Tool(name="ToolA", pipeline_or_node=p, description="Tool A Description")
+        with unittest.mock.patch("pipelines.pipelines.Pipeline.run", return_value={"results": "mock"}):
+            assert tool.run("input") == "mock"
+
+        # now fail if results key is not present
+        with unittest.mock.patch("pipelines.pipelines.Pipeline.run", return_value={"no_results": "mock"}):
+            with self.assertRaises(ValueError):
+                assert tool.run("input")
+
+        # now try tool with a correct output variable
+        tool = Tool(name="ToolA", pipeline_or_node=p, description="Tool A Description", output_variable="no_results")
+        with unittest.mock.patch("pipelines.pipelines.Pipeline.run", return_value={"no_results": "mock_no_results"}):
+            assert tool.run("input") == "mock_no_results"
+
+        # try tool that internally returns an Answer object but we extract the string
+        tool = Tool(name="ToolA", pipeline_or_node=p, description="Tool A Description")
+        with unittest.mock.patch("pipelines.pipelines.Pipeline.run", return_value=[Answer("mocked_answer")]):
+            assert tool.run("input") == "mocked_answer"
+
+        # same but for the document
+        with unittest.mock.patch("pipelines.pipelines.Pipeline.run", return_value=[Document("mocked_document")]):
+            assert tool.run("input") == "mocked_document"
+
+    def test_extract_tool_name_and_tool_multi_line_input(self):
+        # new pattern being supported but with backward compatibility for the old
+        text = (
+            "We need to find out the following information:\n"
+            "1. What city was Jeremy McKinnon born in?\n"
+            "2. What's the capital of Germany?\n"
+            "Tool: Search\n"
+            "Tool Input: Where was Jeremy\n McKinnon born\n and where did he grow up?"
+        )
+
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text)
+        assert tool_name == "Search" and tool_input == "Where was Jeremy\n McKinnon born\n and where did he grow up?"
+
+        # tool input is empty
+        text2 = (
+            "We need to find out the following information:\n"
+            "1. What city was Jeremy McKinnon born in?\n"
+            "2. What's the capital of Germany?\n"
+            "Tool: Search\n"
+            "Tool Input:"
+        )
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text2)
+        assert tool_name == "Search" and tool_input == ""
+
+        # Case where the tool name and tool input are provided with extra whitespaces
+        text3 = "   Tool:   Search   \n   Tool Input:   What is the tallest building in the world?   "
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text3)
+        assert tool_name.strip() == "Search" and tool_input.strip() == "What is the tallest building in the world?"
+
+        # Case where the tool name is provided but the tool input line is not provided at all
+        # Tool input is not optional, so this should return None for both tool name and tool input
+        text4 = (
+            "We need to find out the following information:\n"
+            "1. Who is the current president of the United States?\n"
+            "Tool: Search\n"
+        )
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text4)
+        assert tool_name is None and tool_input is None
+
+        # Case where neither the tool name nor the tool input is provided
+        text5 = "We need to find out the following information:\n 1. What is the population of India?"
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text5)
+        assert tool_name is None and tool_input is None
+
+        # Case where the tool name and tool input are provided with extra whitespaces and new lines
+        text6 = "   Tool:   Search   \n   Tool Input:   \nWhat is the tallest \nbuilding in the world?   "
+        tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(text6)
+        assert tool_name.strip() == "Search" and tool_input.strip() == "What is the tallest \nbuilding in the world?"
+
+    def test_extract_tool_name_and_empty_tool_input(self):
+        examples = [
+            "need to find out what city he was born.\nTool: Search\nTool Input:",
+            "need to find out what city he was born.\nTool: Search\nTool Input:  ",
+        ]
+        for example in examples:
+            tool_name, tool_input = self.tools_manager.extract_tool_name_and_tool_input(example)
+            assert tool_name == "Search" and tool_input == ""
diff --git a/pipelines/tests/conftest.py b/pipelines/tests/conftest.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbce4f72b8be407b68fbeb5f8dba0a3094c7488f
--- /dev/null
+++ b/pipelines/tests/conftest.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from typing import Dict, List, Optional, Union
+
+from pipelines.document_stores import BaseDocumentStore
+from pipelines.nodes import BaseComponent
+from pipelines.nodes.prompt import PromptNode, PromptTemplate
+from pipelines.nodes.retriever import BaseRetriever
+from pipelines.schema import Document, FilterType
+
+
+class MockNode(BaseComponent):
+    outgoing_edges = 1
+
+    def run(self, *a, **k):
+        pass
+
+    def run_batch(self, *a, **k):
+        pass
+
+
+class MockDocumentStore(BaseDocumentStore):
+    outgoing_edges = 1
+
+    def _create_document_field_map(self, *a, **k):
+        pass
+
+    def delete_documents(self, *a, **k):
+        pass
+
+    def delete_labels(self, *a, **k):
+        pass
+
+    def get_all_documents(self, *a, **k):
+        pass
+
+    def get_all_documents_generator(self, *a, **k):
+        pass
+
+    def get_all_labels(self, *a, **k):
+        pass
+
+    def get_document_by_id(self, *a, **k):
+        pass
+
+    def get_document_count(self, *a, **k):
+        pass
+
+    def get_documents_by_id(self, *a, **k):
+        pass
+
+    def get_label_count(self, *a, **k):
+        pass
+
+    def query_by_embedding(self, *a, **k):
+        pass
+
+    def write_documents(self, *a, **k):
+        pass
+
+    def write_labels(self, *a, **k):
+        pass
+
+    def delete_index(self, *a, **k):
+        pass
+
+    def update_document_meta(self, *a, **kw):
+        pass
+
+
+class MockRetriever(BaseRetriever):
+    outgoing_edges = 1
+
+    def retrieve(
+        self,
+        query: str,
+        filters: Optional[FilterType] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+        **kwargs,
+    ) -> List[Document]:
+        return []
+
+    def retrieve_batch(
+        self,
+        queries: List[str],
+        filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
+        top_k: Optional[int] = None,
+        index: Optional[str] = None,
+        headers: Optional[Dict[str, str]] = None,
+        batch_size: Optional[int] = None,
+        scale_score: Optional[bool] = None,
+        document_store: Optional[BaseDocumentStore] = None,
+    ) -> List[List[Document]]:
+        return [[]]
+
+
+class MockPromptNode(PromptNode):
+    def __init__(self):
+        self.default_prompt_template = None
+        self.model_name_or_path = ""
+
+    def prompt(self, prompt_template: Optional[Union[str, PromptTemplate]], *args, **kwargs) -> List[str]:
+        return [""]
+
+    def get_prompt_template(self, prompt_template: Union[str, PromptTemplate, None]) -> Optional[PromptTemplate]:
+        if prompt_template == "think-step-by-step":
+            return PromptTemplate(
+                name="think-step-by-step",
+                prompt_text="You are a helpful and knowledgeable agent. To achieve your goal of answering complex questions "
+                "correctly, you have access to the following tools:\n\n"
+                "{tool_names_with_descriptions}\n\n"
+                "To answer questions, you'll need to go through multiple steps involving step-by-step thinking and "
+                "selecting appropriate tools and their inputs; tools will respond with observations. When you are ready "
+                "for a final answer, respond with the `Final Answer:`\n\n"
+                "Use the following format:\n\n"
+                "Question: the question to be answered\n"
+                "Thought: Reason if you have the final answer. If yes, answer the question. If not, find out the missing information needed to answer it.\n"
+                "Tool: [{tool_names}]\n"
+                "Tool Input: the input for the tool\n"
+                "Observation: the tool will respond with the result\n"
+                "...\n"
+                "Final Answer: the final answer to the question, make it short (1-5 words)\n\n"
+                "Thought, Tool, Tool Input, and Observation steps can be repeated multiple times, but sometimes we can find an answer in the first pass\n"
+                "---\n\n"
+                "Question: {query}\n"
+                "Thought: Let's think step-by-step, I first need to {generated_text}",
+            )
+        else:
+            return PromptTemplate(name="", prompt_text="")
diff --git a/pipelines/tests/fixtures/example_docx.docx b/pipelines/tests/fixtures/example_docx.docx
new file mode 100644
index 0000000000000000000000000000000000000000..55a4c0a32955f47c34f43bcec6cf46e4a982a11e
Binary files /dev/null and b/pipelines/tests/fixtures/example_docx.docx differ
diff --git a/pipelines/tests/fixtures/example_markdown.md b/pipelines/tests/fixtures/example_markdown.md
new file mode 100644
index 0000000000000000000000000000000000000000..bb6c21ae8bb26c370e836022819041311a24366f
--- /dev/null
+++ b/pipelines/tests/fixtures/example_markdown.md
@@ -0,0 +1,10 @@
+# Heading level 1
+
+## Heading level 2
+
+I really like using Markdown.
+
+1. First item
+2. Second item
+3. Third item
+4. Fourth item
diff --git a/pipelines/tests/fixtures/example_pdf.pdf b/pipelines/tests/fixtures/example_pdf.pdf
new file mode 100644
index 0000000000000000000000000000000000000000..c0e31a076aeb1fa7729e82279943b3504f85338d
--- /dev/null
+++ b/pipelines/tests/fixtures/example_pdf.pdf
@@ -0,0 +1,198 @@
+%PDF-1.3
+%����
+
+1 0 obj
+<<
+/Type /Catalog
+/Outlines 2 0 R
+/Pages 3 0 R
+>>
+endobj
+
+2 0 obj
+<<
+/Type /Outlines
+/Count 0
+>>
+endobj
+
+3 0 obj
+<<
+/Type /Pages
+/Count 2
+/Kids [ 4 0 R 6 0 R ] 
+>>
+endobj
+
+4 0 obj
+<<
+/Type /Page
+/Parent 3 0 R
+/Resources <<
+/Font <<
+/F1 9 0 R 
+>>
+/ProcSet 8 0 R
+>>
+/MediaBox [0 0 612.0000 792.0000]
+/Contents 5 0 R
+>>
+endobj
+
+5 0 obj
+<< /Length 1074 >>
+stream
+2 J
+BT
+0 0 0 rg
+/F1 0027 Tf
+57.3750 722.2800 Td
+( A Simple PDF File ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 688.6080 Td
+( This is a small demonstration .pdf file - ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 664.7040 Td
+( just for use in the Virtual Mechanics tutorials. More text. And more ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 652.7520 Td
+( text. And more text. And more text. And more text. ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 628.8480 Td
+( And more text. And more text. And more text. And more text. And more ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 616.8960 Td
+( text. And more text. Boring, zzzzz. And more text. And more text. And ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 604.9440 Td
+( more text. And more text. And more text. And more text. And more text. ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 592.9920 Td
+( And more text. And more text. ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 569.0880 Td
+( And more text. And more text. And more text. And more text. And more ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 557.1360 Td
+( text. And more text. And more text. Even more. Continued on page 2 ...) Tj
+ET
+endstream
+endobj
+
+6 0 obj
+<<
+/Type /Page
+/Parent 3 0 R
+/Resources <<
+/Font <<
+/F1 9 0 R 
+>>
+/ProcSet 8 0 R
+>>
+/MediaBox [0 0 612.0000 792.0000]
+/Contents 7 0 R
+>>
+endobj
+
+7 0 obj
+<< /Length 676 >>
+stream
+2 J
+BT
+0 0 0 rg
+/F1 0027 Tf
+57.3750 722.2800 Td
+( Simple PDF File 2 ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 688.6080 Td
+( ...continued from page 1. Yet more text. And more text. And more text. ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 676.6560 Td
+( And more text. And more text. And more text. And more text. And more ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 664.7040 Td
+( text. Oh, how boring typing this stuff. But not as boring as watching ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 652.7520 Td
+( paint dry. And more text. And more text. And more text. And more text. ) Tj
+ET
+BT
+/F1 0010 Tf
+69.2500 640.8000 Td
+( Boring.  More, a little more text. The end, and just as well. ) Tj
+ET
+endstream
+endobj
+
+8 0 obj
+[/PDF /Text]
+endobj
+
+9 0 obj
+<<
+/Type /Font
+/Subtype /Type1
+/Name /F1
+/BaseFont /Helvetica
+/Encoding /WinAnsiEncoding
+>>
+endobj
+
+10 0 obj
+<<
+/Creator (Rave \(http://www.nevrona.com/rave\))
+/Producer (Nevrona Designs)
+/CreationDate (D:20060301072826)
+>>
+endobj
+
+xref
+0 11
+0000000000 65535 f
+0000000019 00000 n
+0000000093 00000 n
+0000000147 00000 n
+0000000222 00000 n
+0000000390 00000 n
+0000001522 00000 n
+0000001690 00000 n
+0000002423 00000 n
+0000002456 00000 n
+0000002574 00000 n
+
+trailer
+<<
+/Size 11
+/Root 1 0 R
+/Info 10 0 R
+>>
+
+startxref
+2714
+%%EOF
diff --git a/pipelines/tests/nodes/combine_documents/test_combinedocuments.py b/pipelines/tests/nodes/combine_documents/test_combinedocuments.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ddce13d183b54acbd8582a06bb647db6d1dec4a
--- /dev/null
+++ b/pipelines/tests/nodes/combine_documents/test_combinedocuments.py
@@ -0,0 +1,78 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import MagicMock, patch
+
+from pipelines.nodes.combine_documents import (
+    MapReduceDocuments,
+    ReduceDocuments,
+    StuffDocuments,
+)
+
+
+class TestCombineDocuments(unittest.TestCase):
+    def setUp(self):
+        self.api_key = "your api_key"
+        self.secret_key = "your secret_key"
+        self.document_list = [{"content": "今天天气很好"}, {"content": "今天天气不错"}]
+        self.document_prompt = "文件{index}: 文件内容{content}"
+        self.llm_prompt = "根据下列多个的文件内容给出一个摘要：\n{}"
+
+    @patch("requests.request")
+    def test_StuffDocuments(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "Hello, how can I help you?"}'
+        mock_request.return_value = mock_response
+        stuff_documents = StuffDocuments(
+            api_key=self.api_key,
+            secret_key=self.secret_key,
+            document_prompt=self.document_prompt,
+            llm_prompt=self.llm_prompt,
+        )
+        stuff_documents.run(self.document_list)
+
+    @patch("requests.request")
+    def test_ReduceDocuments(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "Hello, how can I help you?"}'
+        mock_request.return_value = mock_response
+        combine_documents = StuffDocuments(
+            api_key=self.api_key,
+            secret_key=self.secret_key,
+            document_prompt=self.document_prompt,
+            llm_prompt=self.llm_prompt,
+        )
+        reducedocuments = ReduceDocuments(combine_documents=combine_documents)
+        reducedocuments.run(self.document_list)
+
+    @patch("requests.request")
+    def test_MapReduceDocuments(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "Hello, how can I help you?"}'
+        mock_request.return_value = mock_response
+        combine_documents = StuffDocuments(
+            api_key=self.api_key,
+            secret_key=self.secret_key,
+            document_prompt=self.document_prompt,
+            llm_prompt=self.llm_prompt,
+        )
+        reduce_documents = ReduceDocuments(combine_documents=combine_documents)
+        mapReduceDocuments = MapReduceDocuments(
+            api_key=self.api_key,
+            secret_key=self.secret_key,
+            llm_prompt=self.llm_prompt,
+            reduce_documents=reduce_documents,
+        )
+        mapReduceDocuments.run(self.document_list)
diff --git a/pipelines/tests/nodes/file_converter/test_docx.py b/pipelines/tests/nodes/file_converter/test_docx.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9c44d6e01195b04f9b9f9545822727b37ac8a42
--- /dev/null
+++ b/pipelines/tests/nodes/file_converter/test_docx.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import unittest
+
+from pipelines.nodes.file_converter.docx import DocxTotxtConverter
+
+
+class TestDocxtoTextConverter(unittest.TestCase):
+    def test_conversion(self):
+        fixtures_path = "tests/fixtures"
+        file_path = os.path.join(fixtures_path, "example_docx.docx")
+        converter = DocxTotxtConverter()
+        expected_result = [
+            {
+                "content": "1.1过程工业\n过程工业(processindustry)是指以自然资源为主要原材料，通过不同的物理与化学过程，连续不断地将原材料转变成产品的工业。\n1.2过程工程\n\n1.2.1从化学工程到过程工程\n18世纪后期，工业革命降临北欧，大大促进了硫酸、烧碱、肥皂、玻璃和染料等化学品的生产，随着这些工业的发展，化学科学的一些基本概念也同时被确立，Lavoisier在1789年出版的《化学基本论述》中明确提出了质量守恒原则。\n",
+                "content_type": "text",
+                "meta": {},
+            }
+        ]
+        result = converter.convert(file_path)
+        self.assertEqual(expected_result, result)
diff --git a/pipelines/tests/nodes/file_converter/test_markdown.py b/pipelines/tests/nodes/file_converter/test_markdown.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c9b71d65ce9bbde03fdda328f3bbc7c4d27dfa2
--- /dev/null
+++ b/pipelines/tests/nodes/file_converter/test_markdown.py
@@ -0,0 +1,51 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from pipelines.nodes.file_converter import MarkdownConverter
+from pipelines.nodes.file_converter.markdown import MarkdownRawTextConverter
+
+
+class TestMarkdownConverter(unittest.TestCase):
+    def test_conversion(self):
+        fixtures_path = "tests/fixtures"
+        file_path = os.path.join(fixtures_path, "example_markdown.md")
+        converter = MarkdownConverter()
+        expected_result = [
+            {
+                "content": "Heading level 1\nHeading level 2\nI really like using Markdown.\n\nFirst item\nSecond item\nThird item\nFourth item\n",
+                "content_type": "text",
+                "meta": None,
+            }
+        ]
+        result = converter.convert(file_path)
+        self.assertEqual(expected_result, result)
+
+
+class TestMarkdownRawTextConverter(unittest.TestCase):
+    def test_conversion(self):
+        fixtures_path = "tests/fixtures"
+        file_path = os.path.join(fixtures_path, "example_markdown.md")
+        converter = MarkdownRawTextConverter()
+        expected_result = [
+            {
+                "content": "# Heading level 1\n## Heading level 2\nI really like using Markdown.\n\nFirst item\nSecond item\nThird item\nFourth item\n",
+                "content_type": "text",
+                "meta": None,
+            }
+        ]
+        result = converter.convert(file_path)
+        self.assertEqual(expected_result, result)
diff --git a/pipelines/tests/nodes/file_converter/test_pdf.py b/pipelines/tests/nodes/file_converter/test_pdf.py
new file mode 100644
index 0000000000000000000000000000000000000000..558a4f3b9a4307c616056cd496c2288175ab7176
--- /dev/null
+++ b/pipelines/tests/nodes/file_converter/test_pdf.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import unittest
+
+from pipelines.nodes.file_converter import PDFToTextConverter
+
+
+class TestPDFToTextConverter(unittest.TestCase):
+    def test_conversion(self):
+        fixtures_path = "tests/fixtures"
+        file_path = os.path.join(fixtures_path, "example_pdf.pdf")
+        converter = PDFToTextConverter(language="en")
+
+        expected_result = [
+            {
+                "content": "A Simple PDF File\nThis is a small demonstration .pdf file -\njust for use in the Virtual Mechanics tutorials. More text. And more\ntext. And more text. And more text. And more text.\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. Boring, zzzzz. And more text. And more text. And\nmore text. And more text. And more text. And more text. And more text.\nAnd more text. And more text.\nAnd more text. And more text. And more text. And more text. And more\ntext. And more text. And more text. Even more. Continued on page 2 ...\x0cSimple PDF File 2\n...continued from page 1. Yet more text. And more text. And more text.\nAnd more text. And more text. And more text. And more text. And more\ntext. Oh, how boring typing this stuff. But not as boring as watching\npaint dry. And more text. And more text. And more text. And more text.\nBoring. More, a little more text. The end, and just as well.",
+                "content_type": "text",
+                "meta": None,
+            }
+        ]
+        result = converter.convert(file_path)
+        self.assertEqual(expected_result, result)
diff --git a/pipelines/tests/nodes/llm/test_chatglm_bot.py b/pipelines/tests/nodes/llm/test_chatglm_bot.py
new file mode 100644
index 0000000000000000000000000000000000000000..91b8a8ec01c1c8a149c9cb5b36da3352a3940d23
--- /dev/null
+++ b/pipelines/tests/nodes/llm/test_chatglm_bot.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.llm import ChatGLMBot
+
+
+class TestChatGLMBot(unittest.TestCase):
+    def setUp(self):
+        self.chatbot = ChatGLMBot(
+            model_name_or_path="__internal_testing__/tiny-random-chatglm", dtype="float32", tgt_length=8
+        )
+
+    def test_run(self):
+        prompt_text = "很高兴认识你"
+        result = self.chatbot.run(query=prompt_text)
+        expected_output = ({"result": ["strained睡到睡到睡到睡到睡到睡到睡到"]}, "output_1")
+        self.assertEqual(
+            result,
+            expected_output,
+        )
diff --git a/pipelines/tests/nodes/llm/test_ernie_bot.py b/pipelines/tests/nodes/llm/test_ernie_bot.py
new file mode 100644
index 0000000000000000000000000000000000000000..a67d24e36638ee22d39c14292ace87d0becc4a28
--- /dev/null
+++ b/pipelines/tests/nodes/llm/test_ernie_bot.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from unittest.mock import MagicMock, patch
+
+from pipelines.nodes.llm import ErnieBot
+
+
+class TestErnieBot(unittest.TestCase):
+    def setUp(self):
+        self.api_key = "your api_key"
+        self.secret_key = "your secret_key"
+        self.valid_history = [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"}]
+
+    @patch("requests.request")
+    def test_init_with_access_token(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"access_key": "1234567890"}'
+        mock_request.return_value = mock_response
+        ernie_bot = ErnieBot(api_key=self.api_key, secret_key=self.secret_key)
+        self.assertIsNotNone(ernie_bot)
+
+    def test_init_missing_access_token(self):
+        os.environ.pop("ERNIE_BOT_API_KEY", None)
+        os.environ.pop("ERNIE_BOT_SECRET_KEY", None)
+        with self.assertRaises(Exception):
+            ErnieBot()
+
+    @patch("requests.request")
+    def test_run_without_history(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "Hello, how can I help you?"}'
+        mock_request.return_value = mock_response
+
+        ernie_bot = ErnieBot(api_key=self.api_key, secret_key=self.secret_key)
+        response, _ = ernie_bot.run("Hi")
+
+        self.assertEqual(
+            response["history"],
+            [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello, how can I help you?"}],
+        )
+
+    @patch("requests.request")
+    def test_run_with_valid_history(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "I can help you with that."}'
+        mock_request.return_value = mock_response
+
+        ernie_bot = ErnieBot(api_key=self.api_key, secret_key=self.secret_key)
+        response, _ = ernie_bot.run("What can you do?", history=self.valid_history)
+        self.assertEqual(
+            response["history"],
+            self.valid_history
+            + [
+                {"role": "user", "content": "What can you do?"},
+                {"role": "assistant", "content": "I can help you with that."},
+            ],
+        )
+
+    @patch("requests.request")
+    def test_run_with_invalid_history_role(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "I can help you with that."}'
+        mock_request.return_value = mock_response
+        invalid_history = [{"role": "invalid", "content": "Hi"}, {"role": "assistant", "content": "Hello"}]
+
+        ernie_bot = ErnieBot(api_key=self.api_key, secret_key=self.secret_key)
+        with self.assertRaises(ValueError):
+            ernie_bot.run("What can you do?", history=invalid_history)
+
+    @patch("requests.request")
+    def test_run_with_odd_history_length(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "I can help you with that."}'
+        mock_request.return_value = mock_response
+
+        odd_history = [{"role": "user", "content": "Hi"}]
+
+        ernie_bot = ErnieBot(api_key=self.api_key, secret_key=self.secret_key)
+        with self.assertRaises(ValueError):
+            ernie_bot.run("What can you do?", history=odd_history)
diff --git a/pipelines/tests/nodes/llm/test_history.py b/pipelines/tests/nodes/llm/test_history.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0ae5c640940222a0b9791274977d0a565e84eab
--- /dev/null
+++ b/pipelines/tests/nodes/llm/test_history.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.llm import TruncatedConversationHistory
+
+
+class TestTruncatedConversationHistory(unittest.TestCase):
+    def test_truncated_conversation_history(self):
+        max_length = 10
+        component = TruncatedConversationHistory(max_length)
+        query = "test_query"
+        history = [
+            {"content": "This is a very long message."},
+            {"content": "Short message."},
+            {"content": "This one is too long as well, let's see what happens."},
+        ]
+
+        expected_history = [
+            {"content": "This is a "},
+            {"content": "Short mess"},
+            {"content": "This one i"},
+        ]
+
+        truncated_history, output_key = component.run(query=query, history=history)
+
+        self.assertEqual(truncated_history["query"], query)
+        self.assertEqual(truncated_history["history"], expected_history)
+        self.assertEqual(output_key, "output_1")
+
+    def test_no_history(self):
+        max_length = 10
+        component = TruncatedConversationHistory(max_length)
+        query = "test_query"
+
+        truncated_history, output_key = component.run(query=query)
+
+        self.assertEqual(truncated_history["query"], query)
+        self.assertNotIn("history", truncated_history)
+        self.assertEqual(output_key, "output_1")
diff --git a/pipelines/tests/nodes/llm/test_prompt_template.py b/pipelines/tests/nodes/llm/test_prompt_template.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfdc3e4fc52773bd0d3e352a13f5df270c6d8957
--- /dev/null
+++ b/pipelines/tests/nodes/llm/test_prompt_template.py
@@ -0,0 +1,29 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.llm import LLMPromptTemplate as PromptTemplate
+from pipelines.schema import Document
+
+
+class TestPromptTemplate(unittest.TestCase):
+    def test_prompt_templates(self):
+        template = PromptTemplate("Here is some fake template with query {query}, documents {documents}")
+        query = "this is a test"
+        documents = [Document(content="document {} ".format(i), content_type="text") for i in range(2)]
+        results, ouput_edge = template.run(query=query, documents=documents)
+        format_query = "Here is some fake template with query this is a test, documents document 0 document 1 "
+        self.assertEqual(format_query, results["query"])
+        self.assertEqual(ouput_edge, "output_1")
diff --git a/pipelines/tests/nodes/preprocessor/test_preprocessor.py b/pipelines/tests/nodes/preprocessor/test_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..0453e3c7604cae5ba2587cfe04254ff54057c494
--- /dev/null
+++ b/pipelines/tests/nodes/preprocessor/test_preprocessor.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.preprocessor.preprocessor import PreProcessor
+
+TEXT = """
+This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in
+paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1.\f
+
+This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in
+paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2.
+
+This is a sample sentence in paragraph_3. This is a sample sentence in paragraph_3. This is a sample sentence in
+paragraph_3. This is a sample sentence in paragraph_3. This is to trick the test with using an abbreviation\f like Dr.
+in the sentence.
+"""
+
+
+class TestPreProcessor(unittest.TestCase):
+    def test_preprocess_sentence_split(self):
+        parameters = [(1, 15), (10, 2)]
+
+        for split_length, expected_documents_count in parameters:
+            document = {"content": TEXT}
+            preprocessor = PreProcessor(
+                split_length=split_length, split_overlap=0, split_by="sentence", split_respect_sentence_boundary=False
+            )
+            documents = preprocessor.process(document)
+            assert len(documents) == expected_documents_count
+
+    def test_preprocess_word_split(self):
+        document = {"content": TEXT}
+        preprocessor = PreProcessor(
+            split_length=10, split_overlap=0, split_by="word", split_respect_sentence_boundary=False
+        )
+        documents = preprocessor.process(document)
+        assert len(documents) == 11
+
+        preprocessor = PreProcessor(
+            split_length=15, split_overlap=0, split_by="word", split_respect_sentence_boundary=True
+        )
+        documents = preprocessor.process(document)
+        for i, doc in enumerate(documents):
+            if i == 0:
+                assert len(doc["content"].split()) == 14
+            assert len(doc["content"].split()) <= 15 or doc["content"].startswith("This is to trick")
+        assert len(documents) == 8
+
+        preprocessor = PreProcessor(
+            split_length=40, split_overlap=10, split_by="word", split_respect_sentence_boundary=True
+        )
+        documents = preprocessor.process(document)
+        assert len(documents) == 5
+
+        preprocessor = PreProcessor(
+            split_length=5, split_overlap=0, split_by="word", split_respect_sentence_boundary=True
+        )
+        documents = preprocessor.process(document)
+        assert len(documents) == 15
+
+    def test_preprocess_passage_split(self):
+        parameters = [(1, 3), (2, 2)]
+
+        for split_length, expected_documents_count in parameters:
+            document = {"content": TEXT}
+            preprocessor = PreProcessor(
+                split_length=split_length, split_overlap=0, split_by="passage", split_respect_sentence_boundary=False
+            )
+            documents = preprocessor.process(document)
+            assert len(documents) == expected_documents_count
diff --git a/pipelines/tests/nodes/preprocessor/test_text_splitter.py b/pipelines/tests/nodes/preprocessor/test_text_splitter.py
new file mode 100644
index 0000000000000000000000000000000000000000..65373a89806eaf29f3be97c2d34242c9008e6e8c
--- /dev/null
+++ b/pipelines/tests/nodes/preprocessor/test_text_splitter.py
@@ -0,0 +1,178 @@
+# Copyright 2023 The LangChain Authors. All rights reserved.
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.preprocessor.text_splitter import (
+    CharacterTextSplitter,
+    MarkdownHeaderTextSplitter,
+    RecursiveCharacterTextSplitter,
+    SpacyTextSplitter,
+)
+
+
+class TestCharacterTextSplitter(unittest.TestCase):
+    def test_character_text_splitter(self) -> None:
+        """Test splitting by character count."""
+        text = "foo bar baz 123"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=7, chunk_overlap=3)
+        output = splitter.split_text(text)
+        expected_output = ["foo bar", "bar baz", "baz 123"]
+        assert output == expected_output
+
+    def test_character_text_splitter_empty_doc(self) -> None:
+        """Test splitting by character count doesn't create empty documents."""
+        text = "foo  bar"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=2, chunk_overlap=0)
+        output = splitter.split_text(text)
+        expected_output = ["foo", "bar"]
+        assert output == expected_output
+
+    def test_character_text_splitter_separtor_empty_doc(self) -> None:
+        """Test edge cases are separators."""
+        text = "f b"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=2, chunk_overlap=0)
+        output = splitter.split_text(text)
+        expected_output = ["f", "b"]
+        assert output == expected_output
+
+    def test_character_text_splitter_long(self) -> None:
+        """Test splitting by character count on long words."""
+        text = "foo bar baz a a"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=1)
+        output = splitter.split_text(text)
+        expected_output = ["foo", "bar", "baz", "a a"]
+        assert output == expected_output
+
+    def test_character_text_splitter_short_words_first(self) -> None:
+        """Test splitting by character count when shorter words are first."""
+        text = "a a foo bar baz"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=1)
+        output = splitter.split_text(text)
+        expected_output = ["a a", "foo", "bar", "baz"]
+        assert output == expected_output
+
+    def test_character_text_splitter_longer_words(self) -> None:
+        """Test splitting by characters when splits not found easily."""
+        text = "foo bar baz 123"
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=1, chunk_overlap=1)
+        output = splitter.split_text(text)
+        expected_output = ["foo", "bar", "baz", "123"]
+        assert output == expected_output
+
+    def test_character_text_splitting_args(self) -> None:
+        """Test invalid arguments."""
+        with self.assertRaises(ValueError):
+            CharacterTextSplitter(chunk_size=2, chunk_overlap=4)
+
+    def test_create_documents(self) -> None:
+        """Test create documents method."""
+        texts = ["foo bar", "baz"]
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
+        docs = splitter.create_documents(texts)
+        expected_docs = [
+            {"content": "foo", "meta": {}},
+            {"content": "bar", "meta": {}},
+            {"content": "baz", "meta": {}},
+        ]
+        assert docs == expected_docs
+
+    def test_create_documents_with_metadata(
+        self,
+    ) -> None:
+        """Test create documents with metadata method."""
+        texts = ["foo bar", "baz"]
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
+        docs = splitter.create_documents(texts, [{"source": "1"}, {"source": "2"}])
+        expected_docs = [
+            {"content": "foo", "meta": {"source": "1"}},
+            {"content": "bar", "meta": {"source": "1"}},
+            {"content": "baz", "meta": {"source": "2"}},
+        ]
+        assert docs == expected_docs
+
+    def test_metadata_not_shallow(self) -> None:
+        """Test that metadatas are not shallow."""
+        texts = ["foo bar"]
+        splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
+        docs = splitter.create_documents(texts, [{"source": "1"}])
+        expected_docs = [{"content": "foo", "meta": {"source": "1"}}, {"content": "bar", "meta": {"source": "1"}}]
+        assert docs == expected_docs
+        docs[0]["meta"]["foo"] = 1
+        assert docs[0]["meta"] == {"source": "1", "foo": 1}
+        assert docs[1]["meta"] == {"source": "1"}
+
+    def test_iterative_text_splitter(self) -> None:
+        """Test iterative text splitter."""
+        text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
+        This is a weird text to write, but gotta test the splittingggg some how.
+        Bye!\n\n-H."""
+        splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=1)
+        output = splitter.split_text(text)
+        expected_output = [
+            "Hi.",
+            "I'm",
+            "Harrison.",
+            "How? Are?",
+            "You?",
+            "Okay then",
+            "f f f f.",
+            "This is",
+            "a weird",
+            "text to",
+            "write, but",
+            "gotta test",
+            "the",
+            "splittingg",
+            "ggg",
+            "some how.",
+            "Bye!",
+            "-H.",
+        ]
+        assert output == expected_output
+
+    def test_spcay_text_splitter(self) -> None:
+        text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
+        This is a weird text to write, but gotta test the splittingggg some how.
+        Bye!\n\n-H."""
+        splitter = SpacyTextSplitter(chunk_size=10, chunk_overlap=1, pipeline="en_core_web_sm")
+        output = splitter.split_text(text)
+        expected_output = [
+            "Hi.\n\nI'm Harrison.",
+            "How? Are?",
+            "You?",
+            "Okay then f f f f.",
+            "This is a weird text to write, but gotta test the splittingggg some how.",
+            "Bye!\n\n-H.",
+        ]
+        assert expected_output == output
+
+    def test_markdown_text_splitter(self) -> None:
+        md = "## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly"
+        headers_to_split_on = [
+            ("#", "Header 1"),
+            ("##", "Header 2"),
+            ("###", "Header 3"),
+        ]
+        markdown_splitter = MarkdownHeaderTextSplitter(
+            separator="\n",
+            chunk_size=10,
+            headers_to_split_on=headers_to_split_on,
+            return_each_line=False,
+            filters=["\n"],
+        )
+        output = markdown_splitter.split_text(md)
+        expected_output = ["Bar\nHi this is Jim\nHi this is Joe", "Baz\nHi this is Molly"]
+        assert expected_output == output
diff --git a/pipelines/tests/nodes/prompt/__init__.py b/pipelines/tests/nodes/prompt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/tests/nodes/prompt/conftest.py b/pipelines/tests/nodes/prompt/conftest.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4c652cb7d62489b2c0079581fdf4e5a04001e65
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/conftest.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from unittest.mock import Mock
+
+
+def create_mock_layer_that_supports(model_name, response=["fake_response"]):
+    """
+    Create a mock invocation layer that supports the model_name and returns response.
+    """
+
+    def mock_supports(model_name_or_path, **kwargs):
+        return model_name_or_path == model_name
+
+    return Mock(**{"model_name_or_path": model_name, "supports": mock_supports, "invoke.return_value": response})
diff --git a/pipelines/tests/nodes/prompt/invocation_layer/__init__.py b/pipelines/tests/nodes/prompt/invocation_layer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/invocation_layer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/tests/nodes/prompt/invocation_layer/test_chatglm.py b/pipelines/tests/nodes/prompt/invocation_layer/test_chatglm.py
new file mode 100644
index 0000000000000000000000000000000000000000..6df5f6f2ed42face949369b85e78b45dc5515c96
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/invocation_layer/test_chatglm.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.prompt.invocation_layer import ChatGLMInvocationLayer
+
+
+class TestErnieBot(unittest.TestCase):
+    def test_invoke(self):
+        invocation_layer = ChatGLMInvocationLayer(
+            model_name_or_path="__internal_testing__/tiny-random-chatglm", dtype="float32", tgt_length=8
+        )
+        output = invocation_layer.invoke(prompt="dummy_prompt")
+        self.assertEqual(output, ["strained睡到睡到睡到睡到睡到睡到睡到"])
diff --git a/pipelines/tests/nodes/prompt/invocation_layer/test_chatgpt.py b/pipelines/tests/nodes/prompt/invocation_layer/test_chatgpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..64ddbeb9269aa80d8b77539888e6219418e7f0fe
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/invocation_layer/test_chatgpt.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import patch
+
+from pipelines.nodes.prompt.invocation_layer import ChatGPTInvocationLayer
+
+
+class TestChatGPT(unittest.TestCase):
+    @patch("pipelines.nodes.prompt.invocation_layer.chatgpt.openai_request")
+    def test_default_api_base(self, mock_request):
+        invocation_layer = ChatGPTInvocationLayer(api_key="fake_api_key")
+        assert invocation_layer.api_base == "https://api.openai.com/v1"
+        assert invocation_layer.url == "https://api.openai.com/v1/chat/completions"
+
+        invocation_layer.invoke(prompt="dummy_prompt")
+        assert mock_request.call_args.kwargs["url"] == "https://api.openai.com/v1/chat/completions"
+
+    @patch("pipelines.nodes.prompt.invocation_layer.chatgpt.openai_request")
+    def test_custom_api_base(self, mock_request):
+        invocation_layer = ChatGPTInvocationLayer(api_key="fake_api_key", api_base="https://fake_api_base.com")
+        assert invocation_layer.api_base == "https://fake_api_base.com"
+        assert invocation_layer.url == "https://fake_api_base.com/chat/completions"
+
+        invocation_layer.invoke(prompt="dummy_prompt")
+        assert mock_request.call_args.kwargs["url"] == "https://fake_api_base.com/chat/completions"
diff --git a/pipelines/tests/nodes/prompt/invocation_layer/test_ernie_bot.py b/pipelines/tests/nodes/prompt/invocation_layer/test_ernie_bot.py
new file mode 100644
index 0000000000000000000000000000000000000000..c23ec0ce923eba75d353b8a775ea378521d9e443
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/invocation_layer/test_ernie_bot.py
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import MagicMock, patch
+
+from pipelines.nodes.prompt.invocation_layer import ErnieBotInvocationLayer
+
+
+class TestErnieBot(unittest.TestCase):
+    @patch("requests.request")
+    def test_invoke(self, mock_request):
+        mock_response = MagicMock()
+        mock_response.text = '{"result": "Hello, how can I help you?"}'
+        mock_request.return_value = mock_response
+
+        invocation_layer = ErnieBotInvocationLayer(api_key="fake_api_key", secret_key="fake_api_key")
+        output = invocation_layer.invoke(prompt="dummy_prompt")
+        self.assertEqual(output, ["Hello, how can I help you?"])
diff --git a/pipelines/tests/nodes/prompt/invocation_layer/test_openai.py b/pipelines/tests/nodes/prompt/invocation_layer/test_openai.py
new file mode 100644
index 0000000000000000000000000000000000000000..fffed61618b87660d16e77554fbd14c17266d1b1
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/invocation_layer/test_openai.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import patch
+
+from pipelines.nodes.prompt.invocation_layer import OpenAIInvocationLayer
+
+
+class TestOpenAI(unittest.TestCase):
+    @patch("pipelines.nodes.prompt.invocation_layer.open_ai.openai_request")
+    def test_default_api_base(self, mock_request):
+        invocation_layer = OpenAIInvocationLayer(api_key="fake_api_key")
+        assert invocation_layer.api_base == "https://api.openai.com/v1"
+        assert invocation_layer.url == "https://api.openai.com/v1/completions"
+
+        invocation_layer.invoke(prompt="dummy_prompt")
+        assert mock_request.call_args.kwargs["url"] == "https://api.openai.com/v1/completions"
+
+    @patch("pipelines.nodes.prompt.invocation_layer.open_ai.openai_request")
+    def test_custom_api_base(self, mock_request):
+        invocation_layer = OpenAIInvocationLayer(api_key="fake_api_key", api_base="https://fake_api_base.com")
+        assert invocation_layer.api_base == "https://fake_api_base.com"
+        assert invocation_layer.url == "https://fake_api_base.com/completions"
+
+        invocation_layer.invoke(prompt="dummy_prompt")
+        assert mock_request.call_args.kwargs["url"] == "https://fake_api_base.com/completions"
diff --git a/pipelines/tests/nodes/prompt/test_handlers.py b/pipelines/tests/nodes/prompt/test_handlers.py
new file mode 100644
index 0000000000000000000000000000000000000000..c647a78f7bccc96f1ae9421737ea279646e0ce3e
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/test_handlers.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from pipelines.nodes.prompt.invocation_layer.handlers import DefaultPromptHandler
+
+
+class TestHandlers(unittest.TestCase):
+    def test_prompt_handler_basics(self):
+        handler = DefaultPromptHandler(model_name_or_path="gpt2", model_max_length=20, max_length=10)
+        assert callable(handler)
+
+        handler = DefaultPromptHandler(model_name_or_path="gpt2", model_max_length=20)
+        assert handler.max_length == 100
+
+    def test_gpt2_prompt_handler(self):
+        # test gpt2 BPE based tokenizer
+        handler = DefaultPromptHandler(model_name_or_path="gpt2", model_max_length=20, max_length=10)
+
+        # test no resize
+        assert handler("This is a test") == {
+            "prompt_length": 4,
+            "resized_prompt": "This is a test",
+            "max_length": 10,
+            "model_max_length": 20,
+            "new_prompt_length": 4,
+        }
+
+        # test resize
+        assert handler("This is a prompt that will be resized because it is longer than allowed") == {
+            "prompt_length": 15,
+            "resized_prompt": "This is a prompt that will be resized because",
+            "max_length": 10,
+            "model_max_length": 20,
+            "new_prompt_length": 10,
+        }
diff --git a/pipelines/tests/nodes/prompt/test_prompt_model.py b/pipelines/tests/nodes/prompt/test_prompt_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b7c1186ebbef3e1a668520cfaa134b552129a88
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/test_prompt_model.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import patch
+
+import pytest
+
+from pipelines.nodes.prompt.invocation_layer import PromptModelInvocationLayer
+from pipelines.nodes.prompt.prompt_model import PromptModel
+
+from .conftest import create_mock_layer_that_supports
+
+
+class TestPromptModel(unittest.TestCase):
+    def test_constructor_with_default_model(self):
+        mock_layer = create_mock_layer_that_supports("gpt-3.5-turbo")
+        another_layer = create_mock_layer_that_supports("another-model")
+
+        with patch.object(PromptModelInvocationLayer, "invocation_layer_providers", new=[mock_layer, another_layer]):
+            model = PromptModel()
+            mock_layer.assert_called_once()
+            another_layer.assert_not_called()
+            model.model_invocation_layer.model_name_or_path = "gpt-3.5-turbo"
+
+    def test_construtor_with_custom_model(self):
+        mock_layer = create_mock_layer_that_supports("some-model")
+        another_layer = create_mock_layer_that_supports("another-model")
+
+        with patch.object(PromptModelInvocationLayer, "invocation_layer_providers", new=[mock_layer, another_layer]):
+            model = PromptModel("another-model")
+            mock_layer.assert_not_called()
+            another_layer.assert_called_once()
+            model.model_invocation_layer.model_name_or_path = "another-model"
+
+    def test_constructor_with_no_supported_model(self):
+        with pytest.raises(ValueError, match="Model some-random-model is not supported"):
+            PromptModel("some-random-model")
diff --git a/pipelines/tests/nodes/prompt/test_prompt_node.py b/pipelines/tests/nodes/prompt/test_prompt_node.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a028ceeab0a246d34f202abcc4e2484f7d8c6b9
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/test_prompt_node.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import Mock, patch
+
+import pytest
+
+from pipelines.nodes.prompt import PromptNode, PromptTemplate
+
+
+class TestPromptNode(unittest.TestCase):
+    def test_add_and_remove_template(self):
+        with patch("pipelines.nodes.prompt.prompt_node.PromptModel"):
+            node = PromptNode()
+        total_count = 15
+        # Verifies default
+        assert len(node.get_prompt_template_names()) == total_count
+
+        # Add a fake template
+        fake_template = PromptTemplate(name="fake-template", prompt_text="Fake prompt")
+        node.add_prompt_template(fake_template)
+        assert len(node.get_prompt_template_names()) == total_count + 1
+        assert "fake-template" in node.get_prompt_template_names()
+
+        # Verify that adding the same template throws an expection
+        with pytest.raises(ValueError) as e:
+            node.add_prompt_template(fake_template)
+            assert e.match(
+                "Prompt template fake-template already exists. Select a different name for this prompt template."
+            )
+
+        # Verify template is correctly removed
+        assert node.remove_prompt_template("fake-template")
+        assert len(node.get_prompt_template_names()) == total_count
+        assert "fake-template" not in node.get_prompt_template_names()
+
+        # Verify that removing the same template throws an expection
+        with pytest.raises(ValueError) as e:
+            node.remove_prompt_template("fake-template")
+            assert e.match("Prompt template fake-template does not exist")
+
+    @patch.object(PromptNode, "prompt")
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_call_with_no_kwargs(self, mock_model, mocked_prompt):
+        node = PromptNode()
+        node()
+        mocked_prompt.assert_called_once_with(node.default_prompt_template)
+
+    @patch.object(PromptNode, "prompt")
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_call_with_custom_kwargs(self, mock_model, mocked_prompt):
+        node = PromptNode()
+        node(some_kwarg="some_value")
+        mocked_prompt.assert_called_once_with(node.default_prompt_template, some_kwarg="some_value")
+
+    @patch.object(PromptNode, "prompt")
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_call_with_custom_template(self, mock_model, mocked_prompt):
+        node = PromptNode()
+        mock_template = Mock()
+        node(prompt_template=mock_template)
+        mocked_prompt.assert_called_once_with(mock_template)
+
+    @patch.object(PromptNode, "prompt")
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_call_with_custom_kwargs_and_template(self, mock_model, mocked_prompt):
+        node = PromptNode()
+        mock_template = Mock()
+        node(prompt_template=mock_template, some_kwarg="some_value")
+        mocked_prompt.assert_called_once_with(mock_template, some_kwarg="some_value")
+
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_get_prompt_template_without_default_template(self, mock_model):
+        node = PromptNode()
+        assert node.get_prompt_template() is None
+
+        template = node.get_prompt_template("question-answering")
+        assert template.name == "question-answering"
+
+        template = node.get_prompt_template(PromptTemplate(name="fake-template", prompt_text=""))
+        assert template.name == "fake-template"
+
+        with pytest.raises(ValueError) as e:
+            node.get_prompt_template("some-unsupported-template")
+            assert e.match("some-unsupported-template not supported, select one of:")
+
+        fake_yaml_prompt = "name: fake-yaml-template\nprompt_text: fake prompt text"
+        template = node.get_prompt_template(fake_yaml_prompt)
+        assert template.name == "fake-yaml-template"
+
+        fake_yaml_prompt = "- prompt_text: fake prompt text"
+        template = node.get_prompt_template(fake_yaml_prompt)
+        assert template.name == "custom-at-query-time"
+
+        template = node.get_prompt_template("some prompt")
+        assert template.name == "custom-at-query-time"
+
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_get_prompt_template_with_default_template(self, mock_model):
+        node = PromptNode()
+        node.set_default_prompt_template("question-answering")
+
+        template = node.get_prompt_template()
+        assert template.name == "question-answering"
+
+        template = node.get_prompt_template("sentiment-analysis")
+        assert template.name == "sentiment-analysis"
+
+        template = node.get_prompt_template(PromptTemplate(name="fake-template", prompt_text=""))
+        assert template.name == "fake-template"
+
+        with pytest.raises(ValueError) as e:
+            node.get_prompt_template("some-unsupported-template")
+            assert e.match("some-unsupported-template not supported, select one of:")
+
+        fake_yaml_prompt = "name: fake-yaml-template\nprompt_text: fake prompt text"
+        template = node.get_prompt_template(fake_yaml_prompt)
+        assert template.name == "fake-yaml-template"
+
+        fake_yaml_prompt = "- prompt_text: fake prompt text"
+        template = node.get_prompt_template(fake_yaml_prompt)
+        assert template.name == "custom-at-query-time"
+
+        template = node.get_prompt_template("some prompt")
+        assert template.name == "custom-at-query-time"
+
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_node_streaming_handler_on_call(self, mock_model):
+        """
+        Verifies model is created using expected stream handler when calling PromptNode.
+        """
+        mock_handler = Mock()
+        node = PromptNode()
+        node.prompt_model = mock_model
+        node("Irrelevant prompt", stream=True, stream_handler=mock_handler)
+        # Verify model has been constructed with expected model_kwargs
+        mock_model.invoke.assert_called_once()
+        assert mock_model.invoke.call_args_list[0].kwargs["stream_handler"] == mock_handler
+
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_node_streaming_handler_on_constructor(self, mock_model):
+        """
+        Verifies model is created using expected stream handler when constructing PromptNode.
+        """
+        model_kwargs = {"stream_handler": Mock()}
+        PromptNode(model_kwargs=model_kwargs)
+        # Verify model has been constructed with expected model_kwargs
+        mock_model.assert_called_once()
+        assert mock_model.call_args_list[0].kwargs["model_kwargs"] == model_kwargs
diff --git a/pipelines/tests/nodes/prompt/test_prompt_template.py b/pipelines/tests/nodes/prompt/test_prompt_template.py
new file mode 100644
index 0000000000000000000000000000000000000000..00f4e220a3ab3d6b773b21eb4a0e8c5a1350ac00
--- /dev/null
+++ b/pipelines/tests/nodes/prompt/test_prompt_template.py
@@ -0,0 +1,98 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from unittest.mock import patch
+
+from pipelines.nodes.prompt import PromptTemplate
+from pipelines.nodes.prompt.prompt_node import PromptNode
+from pipelines.nodes.prompt.shapers import AnswerParser
+from pipelines.pipelines.base import Pipeline
+
+
+class TestPromptTemplate(unittest.TestCase):
+    def test_prompt_templates(self):
+        p = PromptTemplate("t1", "Here is some fake template with variable {foo}")
+        assert set(p.prompt_params) == {"foo"}
+
+        p = PromptTemplate("t3", "Here is some fake template with variable {foo} and {bar}")
+        assert set(p.prompt_params) == {"foo", "bar"}
+
+        p = PromptTemplate("t4", "Here is some fake template with variable {foo1} and {bar2}")
+        assert set(p.prompt_params) == {"foo1", "bar2"}
+
+        p = PromptTemplate("t4", "Here is some fake template with variable {foo_1} and {bar_2}")
+        assert set(p.prompt_params) == {"foo_1", "bar_2"}
+
+        p = PromptTemplate("t4", "Here is some fake template with variable {Foo_1} and {Bar_2}")
+        assert set(p.prompt_params) == {"Foo_1", "Bar_2"}
+
+        p = PromptTemplate("t4", "'Here is some fake template with variable {baz}'")
+        assert set(p.prompt_params) == {"baz"}
+        # strip single quotes, happens in YAML as we need to use single quotes for the template string
+        assert p.prompt_text == "Here is some fake template with variable {baz}"
+
+        p = PromptTemplate("t4", '"Here is some fake template with variable {baz}"')
+        assert set(p.prompt_params) == {"baz"}
+        # strip double quotes, happens in YAML as we need to use single quotes for the template string
+        assert p.prompt_text == "Here is some fake template with variable {baz}"
+
+    def test_missing_prompt_template_params(self):
+        template = PromptTemplate("missing_params", "Here is some fake template with variable {foo} and {bar}")
+
+        # both params provided - ok
+        template.prepare(foo="foo", bar="bar")
+
+        # missing one param
+        with self.assertRaises(ValueError):
+            template.prepare(foo="foo")
+
+        # missing both params
+        with self.assertRaises(ValueError):
+            template.prepare(lets="go")
+
+        # more than both params provided - also ok
+        template.prepare(foo="foo", bar="bar", lets="go")
+
+    def test_prompt_template_repr(self):
+        p = PromptTemplate("t", "Here is variable {baz}")
+        desired_repr = "PromptTemplate(name=t, prompt_text=Here is variable {baz}, prompt_params=['baz'])"
+        assert repr(p) == desired_repr
+        assert str(p) == desired_repr
+
+    @patch("pipelines.nodes.prompt.prompt_node.PromptModel")
+    def test_prompt_template_deserialization(self, mock_prompt_model):
+        custom_prompt_template = PromptTemplate(
+            name="custom-question-answering",
+            prompt_text="Given the context please answer the question. Context: {context}; Question: {query}; Answer:",
+            output_parser=AnswerParser(),
+        )
+
+        prompt_node = PromptNode(default_prompt_template=custom_prompt_template)
+
+        pipe = Pipeline()
+        pipe.add_node(component=prompt_node, name="Generator", inputs=["Query"])
+        # TODO(wugaosheng): compatible to config
+        # config = pipe.get_config()
+        # loaded_pipe = Pipeline.load_from_config(config)
+
+        loaded_generator = pipe.get_node("Generator")
+        assert isinstance(loaded_generator, PromptNode)
+        assert isinstance(loaded_generator.default_prompt_template, PromptTemplate)
+        assert loaded_generator.default_prompt_template.name == "custom-question-answering"
+        assert (
+            loaded_generator.default_prompt_template.prompt_text
+            == "Given the context please answer the question. Context: {context}; Question: {query}; Answer:"
+        )
+        assert isinstance(loaded_generator.default_prompt_template.output_parser, AnswerParser)
diff --git a/pipelines/tests/pipelines/test_chatbot_pipeline.py b/pipelines/tests/pipelines/test_chatbot_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..e99cbc946f6e8cf6d1de2454f83c894cf1b49965
--- /dev/null
+++ b/pipelines/tests/pipelines/test_chatbot_pipeline.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import unittest
+from unittest.mock import MagicMock, patch
+
+from pipelines.nodes.llm import ChatGLMBot, ErnieBot, TruncatedConversationHistory
+from pipelines.pipelines import Pipeline
+
+
+class TestChatPipeline(unittest.TestCase):
+    @patch("requests.request")
+    def setUp(self, mock_request):
+        self.eb = ErnieBot(api_key="api_key", secret_key="secret_key")
+
+    def request_side_effect(*args, **kwargs):
+        data = json.loads(kwargs["data"])
+        num_messages = len(data["messages"])
+        response_text = json.dumps({"result": f"{num_messages} messages received"})
+        mock_response = MagicMock()
+        mock_response.text = response_text
+        return mock_response
+
+    @patch("requests.request")
+    def test_run_with_history(self, mock_request):
+        mock_request.side_effect = self.request_side_effect
+        pipeline = Pipeline()
+        pipeline.add_node(component=self.eb, name="ErnieBot", inputs=["Query"])
+
+        response_round_1 = pipeline.run("hello")
+        self.assertEqual(response_round_1["result"], "1 messages received")
+        self.assertEqual(
+            response_round_1["history"],
+            [{"role": "user", "content": "hello"}, {"role": "assistant", "content": "1 messages received"}],
+        )
+
+        response_round_2 = pipeline.run("hello again", params={"history": response_round_1["history"]})
+        self.assertEqual(response_round_2["result"], "3 messages received")
+        self.assertEqual(
+            response_round_2["history"],
+            [
+                {"role": "user", "content": "hello"},
+                {"role": "assistant", "content": "1 messages received"},
+                {"role": "user", "content": "hello again"},
+                {"role": "assistant", "content": "3 messages received"},
+            ],
+        )
+
+    @patch("requests.request")
+    def test_run_with_truncated_history(self, mock_request):
+        mock_request.side_effect = self.request_side_effect
+        pipeline = Pipeline()
+        pipeline.add_node(
+            component=TruncatedConversationHistory(max_length=1), name="TruncateHistory", inputs=["Query"]
+        )
+        pipeline.add_node(component=self.eb, name="ErnieBot", inputs=["TruncateHistory"])
+
+        response_round_1 = pipeline.run("hello")
+        self.assertEqual(response_round_1["result"], "1 messages received")
+        self.assertEqual(
+            response_round_1["history"],
+            [{"role": "user", "content": "hello"}, {"role": "assistant", "content": "1 messages received"}],
+        )
+
+        response_round_2 = pipeline.run("hello again", history=response_round_1["history"])
+        self.assertEqual(response_round_2["result"], "3 messages received")
+        self.assertEqual(
+            response_round_2["history"],
+            [
+                {"role": "user", "content": "h"},
+                {"role": "assistant", "content": "1"},
+                {"role": "user", "content": "hello again"},
+                {"role": "assistant", "content": "3 messages received"},
+            ],
+        )
+
+
+class TestChatGLMPipeline(unittest.TestCase):
+    def setUp(self):
+        self.chat = ChatGLMBot(
+            model_name_or_path="__internal_testing__/tiny-random-chatglm", dtype="float32", tgt_length=8
+        )
+
+    def test_run_without_history(self):
+        pipeline = Pipeline()
+        pipeline.add_node(component=self.chat, name="ChatBot", inputs=["Query"])
+        response_round_1 = pipeline.run("hello")
+        self.assertEqual(response_round_1["result"], ["strained睡到睡到睡到睡到睡到睡到睡到"])
diff --git a/pipelines/tests/requirements.txt b/pipelines/tests/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ecfa8c38aece89fecaa8774126a8433b7c17714c
--- /dev/null
+++ b/pipelines/tests/requirements.txt
@@ -0,0 +1,2 @@
+paddlepaddle>=2.4.1
+pytest
\ No newline at end of file
diff --git a/pipelines/ui/__init__.py b/pipelines/ui/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/ui/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/ui/baike_qa.csv b/pipelines/ui/baike_qa.csv
new file mode 100644
index 0000000000000000000000000000000000000000..da26db390be8c482407c04242b4e69ce186bd840
--- /dev/null
+++ b/pipelines/ui/baike_qa.csv
@@ -0,0 +1,6 @@
+"Question Text";"Answer"
+"中国的首都在哪里?";"北京"
+"湖北的省会在哪里？";"武汉"
+"湘西土家族苗族自治州在哪儿？";"湖南省辖自治州（地级行政区），地处湖南省西北部"
+"湖北省人口有多少人？";"5830万人"
+"厦门市的生产总值是多少？";"7033.89亿元"
diff --git a/pipelines/ui/country_search.csv b/pipelines/ui/country_search.csv
new file mode 100644
index 0000000000000000000000000000000000000000..d1b6e5157d7c832186a627e70ef8446e6bd8cfe6
--- /dev/null
+++ b/pipelines/ui/country_search.csv
@@ -0,0 +1,3 @@
+"Question Text";"Answer"
+"What is the capital of America?";"Washington"
+"How many states of the United States ?";"50"
\ No newline at end of file
diff --git a/pipelines/ui/dureader_search.csv b/pipelines/ui/dureader_search.csv
new file mode 100644
index 0000000000000000000000000000000000000000..e5a485b9290cd858791cc6f3484ac96d79a8e7e6
--- /dev/null
+++ b/pipelines/ui/dureader_search.csv
@@ -0,0 +1,7 @@
+"Question Text";"Answer"
+"期货交易手续费指的是什么?";"期货交易者买卖期货成交后按成交合约总价值的一定比例所支付的费用。"
+"衡量酒水的价格的因素有哪些?";"酒水的血统，存储的时间等"
+"母亲节是那一天？";"每年5月的第二个星期日，是母亲节"
+"1P空调一般是制冷量是多少？";"2300W--2600W"
+"个人认证的微博帐号的申请条件";"绑定手机、有头像、粉丝数不低于30、关注数不低于30。"
+"国内现货原油交易的手续费";"万分之十二到万分之十六之间"
\ No newline at end of file
diff --git a/pipelines/ui/insurance_faq.csv b/pipelines/ui/insurance_faq.csv
new file mode 100644
index 0000000000000000000000000000000000000000..16b4293b8373ff399643f1df634f6aa987d6875b
--- /dev/null
+++ b/pipelines/ui/insurance_faq.csv
@@ -0,0 +1,3 @@
+"Question Text";"Answer"
+"新华保险是什么样的保险?";"新华保险的福如东*是一款终身寿险。一般终身寿险是提供终身保障的保险，就是在任何年龄如果身故或全残保险公司给付保险金的保险。一般到生命表的终端年龄100岁为止。如果被保险人生存到100岁，保险人则向其本人给付保险金。关于新华保险福如东*的产品介绍主要是：福如东*A款终身寿险（分红型）投保年龄：6个月——65周岁交费方式：10、15、20、30年保险期间：终身福如东*B款终身寿险（分红型）投保年龄：6个月——65周岁交费期间：一次交清保险期间：终身福如东*C款终身寿险（分红型）投保年龄：6个月——65周岁交费期间：3、5年保险期间：终身新华保险福如东*在合同保险期间内，承担的保险责任主要有：身故或身体全残保险金：被保险人于合同生效之日起一年内因疾病身故或身体全残，公司会按以下二者之和给付身故或身体全残保险金，合同终止。1.基本保险金额的10%；2.本保险实际交纳的保险费。被保险人因意外伤害或于合同生效之日起一年后因疾病身故或身体全残，保险公司按基本保险金额与累积红利保险金额二者之和给付身故或身体全残保险金，合同终止。"
+"如何计算国寿保险退保的现金价值?";"保险的现金价值是指被保险人要求解约或退保时，寿险公司应该发还的金额，保险退保现金价值有多少与退保的时间是成正比的。大多数保单在背后都会附有现金价值表，以提示投保人在投保以后各年度所能退得的保费，现金价值往往小于保险人缴纳的保险费。如果在保单生效的第一年或者第二年退保，所扣除保费的比例将大大增加。实在不得已要退保，因为对于长期的寿险保单来说，时间越长，享受的保障就越多，而扣除的费用却并没有太多增加。你可以看看合同条款的初始费用扣费标准，大致了解你的账户价值。保险单的现金价值是寿险保单所特有的，对于投保人长期投保并在到期前退保所制定的利息计提补贴，这样就相当于银行的死期存款在到期日之前提前取款时，虽然损失了一定的利息，但仍然较之当初的本金和活期要略高的道理一样，只不过在缴费之前就已对每一年份的退保时所得的返还保费都进行了精确的约定。"
diff --git a/pipelines/ui/utils.py b/pipelines/ui/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e67affef156705962b32eef52315efa004812eb
--- /dev/null
+++ b/pipelines/ui/utils.py
@@ -0,0 +1,458 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+import os
+import socket
+from time import sleep
+from typing import Any, Dict, List, Optional, Tuple
+
+import requests
+import streamlit as st
+
+from pipelines.document_stores import ElasticsearchDocumentStore, MilvusDocumentStore
+from pipelines.nodes import DensePassageRetriever
+from pipelines.utils import convert_files_to_dicts, launch_es
+
+API_ENDPOINT = os.getenv("API_ENDPOINT")
+STATUS = "initialized"
+HS_VERSION = "hs_version"
+DOC_REQUEST = "query"
+DOC_REQUEST_CHATFILE = "chatfile_query"
+FILE_REQUEST = "query_images"
+DOC_FEEDBACK = "feedback"
+DOC_UPLOAD = "file-upload"
+DOC_UPLOAD_SPLITTER = "file-upload-splitter"
+DOC_PARSE = "files"
+IMAGE_REQUEST = "query_text_to_images"
+QA_PAIR_REQUEST = "query_qa_pairs"
+FILE_UPLOAD_QA_GENERATE = "file-upload-qa-generate"
+
+
+def pipelines_is_ready():
+    """
+    Used to show the "pipelines is loading..." message
+    """
+    url = f"{API_ENDPOINT}/{STATUS}"
+    try:
+        if requests.get(url).status_code < 400:
+            return True
+    except Exception as e:
+        logging.exception(e)
+        sleep(1)  # To avoid spamming a non-existing endpoint at startup
+    return False
+
+
+@st.cache
+def pipelines_version():
+    """
+    Get the pipelines version from the REST API
+    """
+    url = f"{API_ENDPOINT}/{HS_VERSION}"
+    return requests.get(url, timeout=0.1).json()["hs_version"]
+
+
+def pipelines_files(file_name):
+    """
+    Get the pipelines files from the REST API
+    # http://server_ip:server_port/files?file_name=8f6435d7ff1f1913dbcd74feb47e2fdb_0.png
+    """
+    server_ip = socket.gethostbyname(socket.gethostname())
+    server_port = API_ENDPOINT.split(":")[-1]
+    url = f"http://{server_ip}:{server_port}/files?file_name={file_name}"
+    return url
+
+
+def query(
+    query, filters={}, top_k_reader=5, top_k_ranker=5, top_k_retriever=5
+) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a query to the REST API and parse the answer.
+    Returns both a ready-to-use representation of the results and the raw JSON.
+    """
+
+    url = f"{API_ENDPOINT}/{DOC_REQUEST}"
+    params = {
+        "filters": filters,
+        "Retriever": {"top_k": top_k_retriever},
+        "Ranker": {"top_k": top_k_ranker},
+        "Reader": {"top_k": top_k_reader},
+    }
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+
+    # Format response
+    results = []
+    answers = response["answers"]
+    for answer in answers:
+        if answer.get("answer", None):
+            results.append(
+                {
+                    "context": "..." + answer["context"] + "...",
+                    "answer": answer.get("answer", None),
+                    "source": answer["meta"]["name"],
+                    "relevance": round(answer["score"] * 100, 2),
+                    "document": [doc for doc in response["documents"] if doc["id"] == answer["document_id"]][0],
+                    "offset_start_in_doc": answer["offsets_in_document"][0]["start"],
+                    "_raw": answer,
+                }
+            )
+        else:
+            results.append(
+                {
+                    "context": None,
+                    "answer": None,
+                    "document": None,
+                    "relevance": round(answer["score"] * 100, 2),
+                    "_raw": answer,
+                }
+            )
+    return results, response
+
+
+def multi_recall_semantic_search(
+    query, filters={}, top_k_ranker=5, top_k_bm25_retriever=5, top_k_dpr_retriever=5
+) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a query to the REST API and parse the answer.
+    Returns both a ready-to-use representation of the results and the raw JSON.
+    """
+
+    url = f"{API_ENDPOINT}/{DOC_REQUEST}"
+    params = {
+        "filters": filters,
+        "DenseRetriever": {"top_k": top_k_dpr_retriever},
+        "BMRetriever": {"top_k": top_k_bm25_retriever},
+        "Ranker": {"top_k": top_k_ranker},
+    }
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+
+    # Format response
+    results = []
+    answers = response["documents"]
+    for answer in answers:
+        results.append(
+            {
+                "context": answer["content"],
+                "source": answer["meta"]["name"],
+                "answer": answer["meta"]["answer"] if "answer" in answer["meta"].keys() else "",
+                "relevance": round(answer["score"] * 100, 2),
+                "images": answer["meta"]["images"] if "images" in answer["meta"] else [],
+            }
+        )
+    return results, response
+
+
+def semantic_search(
+    query, filters={}, top_k_reader=5, top_k_retriever=5
+) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a query to the REST API and parse the answer.
+    Returns both a ready-to-use representation of the results and the raw JSON.
+    """
+
+    url = f"{API_ENDPOINT}/{DOC_REQUEST}"
+    params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}, "Ranker": {"top_k": top_k_reader}}
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+
+    # Format response
+    results = []
+    answers = response["documents"]
+    for answer in answers:
+        results.append(
+            {
+                "context": answer["content"],
+                "source": answer["meta"]["name"],
+                "answer": answer["meta"]["answer"] if "answer" in answer["meta"].keys() else "",
+                "relevance": round(answer["score"] * 100, 2),
+                "images": answer["meta"]["images"] if "images" in answer["meta"] else [],
+            }
+        )
+    return results, response
+
+
+def ChatFile(
+    query,
+    filters={},
+    top_k_reader=5,
+    top_k_retriever=5,
+    pooling_mode="mean_tokens",
+    api_key: Optional[str] = None,
+    secret_key: Optional[str] = None,
+):
+    url = f"{API_ENDPOINT}/{DOC_REQUEST_CHATFILE}"
+    if api_key is not None and api_key != " " and secret_key is not None and secret_key != " ":
+        params = {
+            "filters": filters,
+            "Retriever": {
+                "top_k": top_k_retriever,
+                "pooling_mode": pooling_mode,
+            },
+            "Ranker": {"top_k": top_k_reader},
+            "ErnieBot": {"api_key": api_key, "secret_key": secret_key},
+        }
+    else:
+        params = {
+            "filters": filters,
+            "Retriever": {
+                "top_k": top_k_retriever,
+                "pooling_mode": pooling_mode,
+            },
+            "Ranker": {"top_k": top_k_reader},
+        }
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+    return response
+
+
+def text_to_image_search(
+    query, resolution="1024*1024", top_k_images=5, style="探索无限"
+) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a prompt text and corresponding parameters to the REST API
+    """
+    url = f"{API_ENDPOINT}/{IMAGE_REQUEST}"
+    params = {
+        "TextToImageGenerator": {
+            "style": style,
+            "topk": top_k_images,
+            "resolution": resolution,
+        }
+    }
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+    results = response["answers"]
+    return results, response
+
+
+def image_text_search(query, filters={}, top_k_retriever=5) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a query to the REST API and parse the answer.
+    Returns both a ready-to-use representation of the results and the raw JSON.
+    """
+
+    url = f"{API_ENDPOINT}/{DOC_REQUEST}"
+    params = {"filters": filters, "Retriever": {"top_k": top_k_retriever}}
+    req = {"query": query, "params": params}
+    response_raw = requests.post(url, json=req)
+
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+
+    # Format response
+    results = []
+    answers = response["documents"]
+    for answer in answers:
+        results.append(
+            {
+                "context": answer["content"],
+                "relevance": round(answer["meta"]["es_ann_score"] * 100, 2),
+            }
+        )
+    return results, response
+
+
+def image_to_text_search(file, filters={}, top_k_retriever=5) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a query to the REST API and parse the answer.
+    Returns both a ready-to-use representation of the results and the raw JSON.
+    """
+
+    url = f"{API_ENDPOINT}/{FILE_REQUEST}"
+    # {"Retriever": {"top_k": 2, "query_type":"image"}}
+    params = {"filters": filters, "Retriever": {"top_k": top_k_retriever, "query_type": "image"}}
+    req = {"meta": json.dumps(params)}
+    files = [("files", file)]
+
+    response = requests.post(url, files=files, data=req, verify=False).json()
+    return response
+
+
+def text_to_qa_pair_search(query, is_filter=True) -> Tuple[List[Dict[str, Any]], Dict[str, str]]:
+    """
+    Send a prompt text and corresponding parameters to the REST API
+    """
+    url = f"{API_ENDPOINT}/{QA_PAIR_REQUEST}"
+    params = {
+        "QAFilter": {
+            "is_filter": is_filter,
+        },
+    }
+
+    req = {"meta": [query], "params": params}
+    response_raw = requests.post(url, json=req)
+    if response_raw.status_code >= 400 and response_raw.status_code != 503:
+        raise Exception(f"{vars(response_raw)}")
+
+    response = response_raw.json()
+    if "errors" in response:
+        raise Exception(", ".join(response["errors"]))
+    results = response["filtered_cqa_triples"]
+    return results, response
+
+
+def send_feedback(query, answer_obj, is_correct_answer, is_correct_document, document) -> None:
+    """
+    Send a feedback (label) to the REST API
+    """
+    url = f"{API_ENDPOINT}/{DOC_FEEDBACK}"
+    req = {
+        "query": query,
+        "document": document,
+        "is_correct_answer": is_correct_answer,
+        "is_correct_document": is_correct_document,
+        "origin": "user-feedback",
+        "answer": answer_obj,
+    }
+    response_raw = requests.post(url, json=req)
+    if response_raw.status_code >= 400:
+        raise ValueError(f"An error was returned [code {response_raw.status_code}]: {response_raw.json()}")
+
+
+def upload_doc(file):
+    url = f"{API_ENDPOINT}/{DOC_UPLOAD}"
+    files = [("files", file)]
+    response = requests.post(url, files=files).json()
+    return response
+
+
+def upload_chatfile(file, chunk_size: int = 300, separator: str = "\n", filters: list = ["\n"]):
+    url = f"{API_ENDPOINT}/{DOC_UPLOAD_SPLITTER}"
+    params = {
+        "DocxSplitter": {"filters": filters, "chunk_size": chunk_size},
+        "MarkdownSplitter": {"filters": filters, "chunk_size": chunk_size},
+        "TextSplitter": {"filters": filters, "chunk_size": chunk_size, "separator": separator},
+        "PDFSplitter": {"filters": filters, "chunk_size": chunk_size, "separator": separator},
+        "ImageSplitter": {"filters": filters, "chunk_size": chunk_size, "separator": separator},
+    }
+
+    files = [("files", file)]
+    req = {"meta": json.dumps(params)}
+    response = requests.post(url, data=req, files=files, verify=False).json()
+    return response
+
+
+def file_upload_qa_generate(file):
+    url = f"{API_ENDPOINT}/{FILE_UPLOAD_QA_GENERATE}"
+    files = [("files", file)]
+    response = requests.post(url, files=files).json()
+    return response
+
+
+def get_backlink(result) -> Tuple[Optional[str], Optional[str]]:
+    if result.get("document", None):
+        doc = result["document"]
+        if isinstance(doc, dict):
+            if doc.get("meta", None):
+                if isinstance(doc["meta"], dict):
+                    if doc["meta"].get("url", None) and doc["meta"].get("title", None):
+                        return doc["meta"]["url"], doc["meta"]["title"]
+    return None, None
+
+
+def offline_ann(
+    index_name,
+    doc_dir,
+    search_engine="elastic",
+    host="127.0.0.1",
+    port="9200",
+    query_embedding_model="rocketqa-zh-nano-query-encoder",
+    passage_embedding_model="rocketqa-zh-nano-para-encoder",
+    params_path="checkpoints/model_40/model_state.pdparams",
+    embedding_dim=312,
+    split_answers=True,
+):
+    if search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=embedding_dim,
+            host=host,
+            index=index_name,
+            port=port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    else:
+        launch_es()
+        document_store = ElasticsearchDocumentStore(
+            host=host, port=port, username="", password="", embedding_dim=embedding_dim, index=index_name
+        )
+    # 将每篇文档按照段落进行切分
+    dicts = convert_files_to_dicts(
+        dir_path=doc_dir, split_paragraphs=True, split_answers=split_answers, encoding="utf-8"
+    )
+
+    print(dicts[:3])
+
+    # 文档数据写入数据库
+    document_store.write_documents(dicts)
+
+    # 语义索引模型
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=query_embedding_model,
+        passage_embedding_model=passage_embedding_model,
+        params_path=params_path,
+        output_emb_size=embedding_dim,
+        max_seq_len_query=64,
+        max_seq_len_passage=256,
+        batch_size=1,
+        use_gpu=True,
+        embed_title=False,
+    )
+
+    # 建立索引库
+    document_store.update_embeddings(retriever)
diff --git a/pipelines/ui/webapp_chatfile_gradio.py b/pipelines/ui/webapp_chatfile_gradio.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba67b6bee505ec58af469f3d65935c8f42276bd5
--- /dev/null
+++ b/pipelines/ui/webapp_chatfile_gradio.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import io
+import os
+from pathlib import Path
+from typing import NamedTuple
+
+import gradio as gr
+from utils import ChatFile, upload_chatfile
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--serving_name", default="0.0.0.0", help="Serving ip.")
+parser.add_argument("--serving_port", default=8893, type=int, help="Serving port.")
+args = parser.parse_args()
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "30"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "markdown_aistudio_example.csv"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+sentence_embedding_method = ["max_tokens", "mean_tokens", "mean_sqrt_len_tokens", "cls_token"]
+loaded_path = []
+
+
+def clear_session():
+    return "", None, None, None, None, []
+
+
+def predict(
+    query,
+    data_files,
+    filters,
+    top_k_reader,
+    top_k_retriever,
+    chunk_size,
+    pooling_mode,
+    separator,
+    api_key,
+    secret_key,
+    history=None,
+):
+    if filters is not None:
+        filters = [sy for sy in filters.split(";") if sy != ""]
+    if data_files is not None and len(data_files) != 0:
+        upload(data_files, chunk_size=chunk_size, separator=separator, filters=filters)
+    if history is None:
+        history = []
+    result = ChatFile(
+        query,
+        top_k_reader=top_k_reader,
+        top_k_retriever=top_k_retriever,
+        pooling_mode=pooling_mode,
+        api_key=api_key,
+        secret_key=secret_key,
+    )
+    history.append(["user: {}".format(query), "assistant: {}".format(result["result"])])
+
+    return " ", history, history
+
+
+class UploadedFileRec(NamedTuple):
+    """Metadata and raw bytes for an uploaded file. Immutable."""
+
+    id: int
+    name: str
+    type: str
+    data: bytes
+
+
+class UploadedFile(io.BytesIO):
+    """A mutable uploaded file.
+
+    This class extends BytesIO, which has copy-on-write semantics when
+    initialized with `bytes`.
+    """
+
+    def __init__(self, record: UploadedFileRec):
+        # BytesIO's copy-on-write semantics doesn't seem to be mentioned in
+        # the Python docs - possibly because it's a CPython-only optimization
+        # and not guaranteed to be in other Python runtimes. But it's detailed
+        # here: https://hg.python.org/cpython/rev/79a5fbe2c78f
+        super(UploadedFile, self).__init__(record.data)
+        self.id = record.id
+        self.name = record.name
+        self.type = record.type
+        self.size = len(record.data)
+
+
+def upload(data_files, chunk_size, separator, filters):
+    for index, data_file in enumerate(data_files):
+        if data_file.name not in loaded_path:
+            with open(data_file.name, "rb") as f:
+                byte_data = bytes(f.read())
+            data_file = UploadedFileRec(id=index, data=byte_data, name=data_file.name, type=str)
+            # Upload file
+            upload_chatfile(UploadedFile(data_file), chunk_size=chunk_size, separator=separator, filters=filters)
+            loaded_path.append(data_file.name)
+
+
+if __name__ == "__main__":
+    block = gr.Blocks()
+    with block as demo:
+        gr.Markdown("""<h1><center>PaddleNLP Piplines ChatFiles</center></h1>""")
+        with gr.Row():
+            with gr.Column(scale=1):
+                model_choose = gr.Accordion("模型选择")
+                with model_choose:
+                    sentence_embedding = gr.Dropdown(
+                        sentence_embedding_method, label="sentence_embedding method", value="mean_tokens"
+                    )
+                top_k_reader = gr.Slider(1, 30, value=6, step=1, label="最大的答案的数量", interactive=True)
+                top_k_retriever = gr.Slider(1, 100, value=DEFAULT_DOCS_FROM_RETRIEVER, step=1, interactive=True)
+                CHUNK_SIZE = gr.Slider(300, 5000, value=1000, step=100, label="检索器索引的数据长度", interactive=True)
+                separator = gr.Textbox(label="请输入分隔符,例如\\n", value="\n", max_lines=1)
+                file = gr.File(
+                    label="请上传知识库文件夹, 文件夹内文件目前支持txt、pdf、md、docx、png、jpg格式",
+                    file_types=[".txt", ".pdf", ".md", ".docx", ".html", ".png", ".jpg"],
+                    file_count="directory",
+                )
+                filters = gr.Textbox(label="请输入要忽略特殊字符，每个字符用;分隔,例如\\n;\\t", value="\n", max_lines=1)
+            with gr.Column(scale=4):
+                chatbot = gr.Chatbot(label="ChatFile")
+                message = gr.Textbox(label="请输入问题")
+                api_key = gr.Textbox(label="The API Key.")
+                secret_key = gr.Textbox(label="The secret key.")
+                state = gr.State()
+                with gr.Row():
+                    clear_history = gr.Button("🧹 清除历史对话")
+                    send = gr.Button("🚀 发送")
+                    send.click(
+                        predict,
+                        inputs=[
+                            message,
+                            file,
+                            filters,
+                            top_k_reader,
+                            top_k_retriever,
+                            CHUNK_SIZE,
+                            sentence_embedding,
+                            separator,
+                            api_key,
+                            secret_key,
+                            state,
+                        ],
+                        outputs=[message, chatbot, state],
+                    )
+                    clear_history.click(
+                        fn=clear_session, inputs=[], outputs=[chatbot, state, file, separator, filters], queue=False
+                    )
+
+        gr.Markdown(
+            """提醒：<br>
+        1. 使用时请先上传自己的知识文件，并且文件中不含某些特殊字符，否则将返回error. <br>
+        文本文件举例"""
+        )
+    demo.queue().launch(server_name="0.0.0.0", server_port=args.serving_port)
diff --git a/pipelines/ui/webapp_docprompt_gradio.py b/pipelines/ui/webapp_docprompt_gradio.py
new file mode 100644
index 0000000000000000000000000000000000000000..aac83dd48bed4ce6d6aaa9b7af265db9f834d959
--- /dev/null
+++ b/pipelines/ui/webapp_docprompt_gradio.py
@@ -0,0 +1,328 @@
+# -*- coding: UTF-8 -*-
+# Copyright 2022 The Impira Team and the HuggingFace Team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import base64
+import traceback
+from io import BytesIO
+
+import cv2
+import fitz
+import gradio as gr
+import numpy as np
+import requests
+from PIL import Image
+
+fitz_tools = fitz.Tools()
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--serving_name', default="0.0.0.0", help="Serving ip.")
+parser.add_argument("--serving_port", default=8891, type=int, help="Serving port.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def load_document(path):
+    if path.startswith("http://") or path.startswith("https://"):
+        resp = requests.get(path, allow_redirects=True, stream=True)
+        b = resp.raw
+    else:
+        b = open(path, "rb")
+
+    image = Image.open(b)
+    images_list = [np.array(image.convert("RGB"))]
+    return images_list
+
+
+def process_path(path):
+    error = None
+    if path:
+        try:
+            images_list = load_document(path)
+            return (
+                path,
+                gr.update(visible=True, value=images_list),
+                gr.update(visible=True),
+                gr.update(visible=False, value=None),
+                gr.update(visible=False, value=None),
+                None,
+            )
+        except Exception as e:
+            traceback.print_exc()
+            error = str(e)
+    return (
+        None,
+        gr.update(visible=False, value=None),
+        gr.update(visible=False),
+        gr.update(visible=False, value=None),
+        gr.update(visible=False, value=None),
+        gr.update(visible=True, value=error) if error is not None else None,
+        None,
+    )
+
+
+def process_upload(file):
+    if file:
+        return process_path(file.name)
+    else:
+        return (
+            None,
+            gr.update(visible=False, value=None),
+            gr.update(visible=False),
+            gr.update(visible=False, value=None),
+            gr.update(visible=False, value=None),
+            None,
+        )
+
+
+def np2base64(image_np):
+    image = cv2.imencode(".jpg", image_np)[1]
+    base64_str = str(base64.b64encode(image))[2:-1]
+    return base64_str
+
+
+def get_base64(path):
+    if path.startswith("http://") or path.startswith("https://"):
+        resp = requests.get(path, allow_redirects=True, stream=True)
+        b = resp.raw
+    else:
+        b = open(path, "rb")
+
+    base64_str = base64.b64encode(b.read()).decode()
+    return base64_str
+
+
+def process_prompt(prompt, document):
+    if not prompt:
+        prompt = "校验码是多少？"
+    if document is None:
+        return None, None, None
+
+    url = f"http://{args.serving_name}:{args.serving_port}/query_documents"
+    base64_str = get_base64(document)
+    r = requests.post(url, json={"meta": {"doc": base64_str, "prompt": [prompt]}})
+    response = r.json()
+    predictions = response["results"][0]
+
+    pages = [Image.open(BytesIO(base64.b64decode(base64_str)))]
+
+    text_value = predictions[0]["result"][0]["value"]
+
+    return (
+        gr.update(visible=True, value=pages),
+        gr.update(visible=True, value=predictions),
+        gr.update(
+            visible=True,
+            value=text_value,
+        ),
+    )
+
+
+def read_content(file_path: str) -> str:
+    """read the content of target file"""
+    with open(file_path, "r", encoding="utf-8") as f:
+        content = f.read()
+
+    return content
+
+
+CSS = """
+#prompt input {
+    font-size: 16px;
+}
+#url-textbox {
+    padding: 0 !important;
+}
+#short-upload-box .w-full {
+    min-height: 10rem !important;
+}
+/* I think something like this can be used to re-shape
+ * the table
+ */
+/*
+.gr-samples-table tr {
+    display: inline;
+}
+.gr-samples-table .p-2 {
+    width: 100px;
+}
+*/
+#select-a-file {
+    width: 100%;
+}
+#file-clear {
+    padding-top: 2px !important;
+    padding-bottom: 2px !important;
+    padding-left: 8px !important;
+    padding-right: 8px !important;
+    margin-top: 10px;
+}
+.gradio-container .gr-button-primary {
+    background: linear-gradient(180deg, #CDF9BE 0%, #AFF497 100%);
+    border: 1px solid #B0DCCC;
+    border-radius: 8px;
+    color: #1B8700;
+}
+.gradio-container.dark button#submit-button {
+    background: linear-gradient(180deg, #CDF9BE 0%, #AFF497 100%);
+    border: 1px solid #B0DCCC;
+    border-radius: 8px;
+    color: #1B8700
+}
+table.gr-samples-table tr td {
+    border: none;
+    outline: none;
+}
+table.gr-samples-table tr td:first-of-type {
+    width: 0%;
+}
+div#short-upload-box div.absolute {
+    display: none !important;
+}
+gradio-app > div > div > div > div.w-full > div, .gradio-app > div > div > div > div.w-full > div {
+    gap: 0px 2%;
+}
+gradio-app div div div div.w-full, .gradio-app div div div div.w-full {
+    gap: 0px;
+}
+gradio-app h2, .gradio-app h2 {
+    padding-top: 10px;
+}
+#answer {
+    overflow-y: scroll;
+    color: white;
+    background: #666;
+    border-color: #666;
+    font-size: 20px;
+    font-weight: bold;
+}
+#answer span {
+    color: white;
+}
+#answer textarea {
+    color:white;
+    background: #777;
+    border-color: #777;
+    font-size: 18px;
+}
+#url-error input {
+    color: red;
+}
+"""
+
+with gr.Blocks(css=CSS) as demo:
+    document = gr.Variable()
+    example_prompt = gr.Textbox(visible=False)
+    example_image = gr.Image(visible=False)
+    with gr.Row(equal_height=True):
+        with gr.Column():
+            with gr.Row():
+                gr.Markdown("## 1. Select a file", elem_id="select-a-file")
+                img_clear_button = gr.Button("Clear", variant="secondary", elem_id="file-clear", visible=False)
+            image = gr.Gallery(visible=False)
+            with gr.Row(equal_height=True):
+                with gr.Column():
+                    with gr.Row():
+                        url = gr.Textbox(
+                            show_label=False,
+                            placeholder="URL",
+                            lines=1,
+                            max_lines=1,
+                            elem_id="url-textbox",
+                        )
+                        submit = gr.Button("Get")
+                    url_error = gr.Textbox(
+                        visible=False,
+                        elem_id="url-error",
+                        max_lines=1,
+                        interactive=False,
+                        label="Error",
+                    )
+            gr.Markdown("— or —")
+            upload = gr.File(label=None, interactive=True, elem_id="short-upload-box")
+
+        with gr.Column() as col:
+            gr.Markdown("## 2. Make a request")
+            prompt = gr.Textbox(
+                label="Prompt (No restrictions on the setting of prompt. You can type any prompt.)",
+                placeholder="e.g. 校验码是多少？",
+                lines=1,
+                max_lines=1,
+            )
+
+            with gr.Row():
+                clear_button = gr.Button("Clear", variant="secondary")
+                submit_button = gr.Button("Submit", variant="primary", elem_id="submit-button")
+            with gr.Column():
+                output_text = gr.Textbox(label="Top Answer", visible=False, elem_id="answer")
+                output = gr.JSON(label="Output", visible=False)
+
+    for cb in [img_clear_button, clear_button]:
+        cb.click(
+            lambda _: (
+                gr.update(visible=False, value=None),
+                None,
+                gr.update(visible=False, value=None),
+                gr.update(visible=False, value=None),
+                gr.update(visible=False),
+                None,
+                None,
+                None,
+                gr.update(visible=False, value=None),
+                None,
+            ),
+            inputs=clear_button,
+            outputs=[
+                image,
+                document,
+                output,
+                output_text,
+                img_clear_button,
+                example_image,
+                upload,
+                url,
+                url_error,
+                prompt,
+            ],
+        )
+
+    upload.change(
+        fn=process_upload,
+        inputs=[upload],
+        outputs=[document, image, img_clear_button, output, output_text, url_error],
+    )
+    submit.click(
+        fn=process_path,
+        inputs=[url],
+        outputs=[document, image, img_clear_button, output, output_text, url_error],
+    )
+
+    prompt.submit(
+        fn=process_prompt,
+        inputs=[prompt, document],
+        outputs=[image, output, output_text],
+    )
+
+    submit_button.click(
+        fn=process_prompt,
+        inputs=[prompt, document],
+        outputs=[image, output, output_text],
+    )
+
+if __name__ == "__main__":
+    # To create a public link, set `share=True` in `launch()`.
+    demo.launch(enable_queue=False, share=True)
diff --git a/pipelines/ui/webapp_faq.py b/pipelines/ui/webapp_faq.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fb1478df27a2daaf13bf2fcd47c7c9f5ff8a51b
--- /dev/null
+++ b/pipelines/ui/webapp_faq.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+from json import JSONDecodeError
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+from markdown import markdown
+from utils import pipelines_is_ready, semantic_search, upload_doc
+
+# Adjust to a question that you would like users to see in the search bar when they load the UI:
+DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "如何办理企业养老保险?")
+DEFAULT_ANSWER_AT_STARTUP = os.getenv(
+    "DEFAULT_ANSWER_AT_STARTUP",
+    "企业养老保险一般是交由企业办理，个人需要准备好相关的文件即可。个人在参加企业养老保险的时候，需填报《参加企业基本养老保险人员基本情况表》，并提供以下证件和主要资料：1、身份证件及复印件；2、户口簿及复印件；3、以个人身份参保前原为职工身份的本人档案材料；4、曾在其他统筹地区参保的，重新登记应提供原参保所在地社保机构开具的《基本养老保险关系转移表》；5、与单位解除劳动关系的，应提供相关证明；6、省社保机构规定的其他证件资料。企业缴费以职工工资总额为基数，缴费比例为20%；职工个人缴费以本人全部工资收入为基数，月缴费工资超过全省上一年度职工平均工资300%以上的部分不计入，低于60%的按60%计算。职工个人应当缴纳的养老保险费，由所在单位从其工资中代扣代缴。",
+)
+# Sliders
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "30"))
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "insurance_faq.csv"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+
+
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+
+
+def on_change_text():
+    st.session_state.question = st.session_state.quest
+    st.session_state.answer = None
+    st.session_state.results = None
+    st.session_state.raw_json = None
+
+
+def upload():
+    data_files = st.session_state.upload_files["files"]
+    for data_file in data_files:
+        # Upload file
+        if data_file and data_file.name not in st.session_state.upload_files["uploaded_files"]:
+            upload_doc(data_file)
+            st.session_state.upload_files["uploaded_files"].append(data_file.name)
+    # Save the uploaded files
+    st.session_state.upload_files["uploaded_files"] = list(set(st.session_state.upload_files["uploaded_files"]))
+
+
+def main():
+
+    st.set_page_config(
+        page_title="PaddleNLP Pipelines FAQ智能问答",
+        page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png",
+    )
+
+    # Persistent state
+    set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("results", None)
+    set_state_if_absent("raw_json", None)
+    set_state_if_absent("random_question_requested", False)
+    set_state_if_absent("upload_files", {"uploaded_files": [], "files": []})
+
+    # Small callback to reset the interface in case the text of the question changes
+    def reset_results(*args):
+        st.session_state.answer = None
+        st.session_state.results = None
+        st.session_state.raw_json = None
+
+    # Title
+    st.write("# PaddleNLP Pipelines FAQ智能问答")
+    # Sidebar
+    st.sidebar.header("选项")
+    top_k_reader = st.sidebar.slider(
+        "最大的答案的数量",
+        min_value=1,
+        max_value=30,
+        value=DEFAULT_NUMBER_OF_ANSWERS,
+        step=1,
+        on_change=reset_results,
+    )
+    top_k_retriever = st.sidebar.slider(
+        "最大检索数量",
+        min_value=1,
+        max_value=100,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("## 文件上传:")
+        data_files = st.sidebar.file_uploader(
+            "", type=["pdf", "txt", "docx", "png"], help="选择多个文件", accept_multiple_files=True
+        )
+        st.session_state.upload_files["files"] = data_files
+        st.sidebar.button("文件上传", on_click=upload)
+        for data_file in st.session_state.upload_files["uploaded_files"]:
+            st.sidebar.write(str(data_file) + " &nbsp;&nbsp; ✅ ")
+
+    # Load csv into pandas dataframe
+    try:
+        df = pd.read_csv(EVAL_LABELS, sep=";")
+    except Exception:
+        st.error("The eval file was not found.")
+        sys.exit(f"The eval file was not found under `{EVAL_LABELS}`.")
+
+    # Search bar
+    question = st.text_input(
+        "",
+        value=st.session_state.question,
+        key="quest",
+        on_change=on_change_text,
+        max_chars=100,
+        placeholder="请输入您的问题",
+    )
+    col1, col2 = st.columns(2)
+    col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    run_pressed = col1.button("运行")
+
+    # Get next random question from the CSV
+    if col2.button("随机生成"):
+        reset_results()
+        new_row = df.sample(1)
+        while (
+            new_row["Question Text"].values[0] == st.session_state.question
+        ):  # Avoid picking the same question twice (the change is not visible on the UI)
+            new_row = df.sample(1)
+        st.session_state.question = new_row["Question Text"].values[0]
+        st.session_state.random_question_requested = True
+        # Re-runs the script setting the random question as the textbox value
+        # Unfortunately necessary as the Random Question button is _below_ the textbox
+        st.experimental_rerun()
+
+    st.session_state.random_question_requested = False
+
+    run_query = (
+        run_pressed or question != st.session_state.question
+    ) and not st.session_state.random_question_requested
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results()
+    # Get results for query
+    if (run_query or st.session_state.results is None) and question:
+        reset_results()
+        st.session_state.question = question
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.results, st.session_state.raw_json = semantic_search(
+                    question, top_k_reader=top_k_reader, top_k_retriever=top_k_retriever
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.results:
+
+        st.write("## 返回结果:")
+        for count, result in enumerate(st.session_state.results):
+            context = result["context"]
+            st.write(
+                markdown(context),
+                unsafe_allow_html=True,
+            )
+            st.write("**答案:** ", result["answer"])
+            st.write("**Relevance:** ", result["relevance"])
+
+            st.write("___")
+
+
+main()
diff --git a/pipelines/ui/webapp_image_to_text_retrieval.py b/pipelines/ui/webapp_image_to_text_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d788a0261819ffc2fdd9f02e4e654500a5594ce
--- /dev/null
+++ b/pipelines/ui/webapp_image_to_text_retrieval.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from io import BytesIO
+
+import gradio as gr
+from PIL import Image, ImageFile
+from utils import image_to_text_search
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--serving_port", default=8502, type=int, help="Port for the serving.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def pil_base64(image, img_format="JPEG"):
+    Image.MAX_IMAGE_PIXELS = 1000000000
+    ImageFile.LOAD_TRUNCATED_IMAGES = True
+    img_buffer = BytesIO()
+    image.save(img_buffer, format=img_format)
+    byte_data = img_buffer.getvalue()
+    return byte_data
+
+
+def infer(query, top_k_retriever):
+    image = pil_base64(query)
+    response = image_to_text_search(image, top_k_retriever=top_k_retriever)
+    texts = [[item["content"]] for item in response["documents"]]
+    return texts
+
+
+def main():
+    block = gr.Blocks()
+    title = "<h1 align='center'>ERNIE VIL 2.0 图到文搜索应用</h1>"
+    description = "本项目为ERNIE-ViL 2.0等CLIP中文版模型的DEMO，可用于图文检索和图像、文本的表征提取，应用于文图搜索、文图推荐、零样本分类、视频检索等应用场景。"
+
+    with block:
+        gr.Markdown(title)
+        gr.Markdown(description)
+        with gr.Row():
+            with gr.Column(scale=1):
+                with gr.Column(scale=2):
+                    image = gr.components.Image(label="图片", type="pil", elem_id=1)
+                top_k = gr.components.Slider(minimum=0, maximum=50, step=1, value=8, label="返回文本数", elem_id=2)
+                btn = gr.Button(
+                    "搜索",
+                )
+            with gr.Column(scale=100):
+                out = gr.Dataframe(
+                    headers=["content"],
+                    datatype=["str"],
+                    label="搜索结果为：",
+                )
+        inputs = [image, top_k]
+        btn.click(fn=infer, inputs=inputs, outputs=out)
+    return block
+
+
+if __name__ == "__main__":
+    block = main()
+    block.launch(server_name="0.0.0.0", server_port=args.serving_port, share=False)
diff --git a/pipelines/ui/webapp_multi_recall_semantic_search.py b/pipelines/ui/webapp_multi_recall_semantic_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..c62a4f6fe1d6cbc638a13a95f81186b368d38478
--- /dev/null
+++ b/pipelines/ui/webapp_multi_recall_semantic_search.py
@@ -0,0 +1,225 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+from json import JSONDecodeError
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+from markdown import markdown
+from ui.utils import (
+    multi_recall_semantic_search,
+    pipelines_files,
+    pipelines_is_ready,
+    upload_doc,
+)
+
+# Adjust to a question that you would like users to see in the search bar when they load the UI:
+DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "衡量酒水的价格的因素有哪些?")
+DEFAULT_ANSWER_AT_STARTUP = os.getenv("DEFAULT_ANSWER_AT_STARTUP", "酒水的血统，存储的时间等")
+# Sliders
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "30"))
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "dureader_search.csv"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+
+
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+
+
+def on_change_text():
+    st.session_state.question = st.session_state.quest
+    st.session_state.answer = None
+    st.session_state.results = None
+    st.session_state.raw_json = None
+
+
+def upload():
+    data_files = st.session_state.upload_files["files"]
+    for data_file in data_files:
+        # Upload file
+        if data_file and data_file.name not in st.session_state.upload_files["uploaded_files"]:
+            upload_doc(data_file)
+            st.session_state.upload_files["uploaded_files"].append(data_file.name)
+    # Save the uploaded files
+    st.session_state.upload_files["uploaded_files"] = list(set(st.session_state.upload_files["uploaded_files"]))
+
+
+def main():
+
+    st.set_page_config(
+        page_title="PaddleNLP Pipelines 语义检索",
+        page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png",
+    )
+
+    # Persistent state
+    set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("results", None)
+    set_state_if_absent("raw_json", None)
+    set_state_if_absent("random_question_requested", False)
+    set_state_if_absent("upload_files", {"uploaded_files": [], "files": []})
+
+    # Small callback to reset the interface in case the text of the question changes
+    def reset_results(*args):
+        st.session_state.answer = None
+        st.session_state.results = None
+        st.session_state.raw_json = None
+
+    # Title
+    st.write("# PaddleNLP Pipelines 两路召回语义检索")
+    # Sidebar
+    st.sidebar.header("选项")
+    top_k_ranker = st.sidebar.slider(
+        "最大的答案的数量",
+        min_value=1,
+        max_value=30,
+        value=DEFAULT_NUMBER_OF_ANSWERS,
+        step=1,
+        on_change=reset_results,
+    )
+    top_k_dpr_retriever = st.sidebar.slider(
+        "最大向量检索数量",
+        min_value=1,
+        max_value=100,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    top_k_bm25_retriever = st.sidebar.slider(
+        "最大关键词检索数量",
+        min_value=1,
+        max_value=100,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("## 文件上传:")
+        data_files = st.sidebar.file_uploader(
+            "", type=["pdf", "txt", "docx", "png"], help="选择多个文件", accept_multiple_files=True
+        )
+        st.session_state.upload_files["files"] = data_files
+        st.sidebar.button("文件上传", on_click=upload)
+        for data_file in st.session_state.upload_files["uploaded_files"]:
+            st.sidebar.write(str(data_file) + " &nbsp;&nbsp; ✅ ")
+
+    # Load csv into pandas dataframe
+    try:
+        df = pd.read_csv(EVAL_LABELS, sep=";")
+    except Exception:
+        st.error("The eval file was not found.")
+        sys.exit("The eval file was not found under `{EVAL_LABELS}`.")
+
+    # Search bar
+    question = st.text_input(
+        "",
+        value=st.session_state.question,
+        key="quest",
+        on_change=on_change_text,
+        max_chars=100,
+        placeholder="请输入您的问题",
+    )
+    col1, col2 = st.columns(2)
+    col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    run_pressed = col1.button("运行")
+    # Get next random question from the CSV
+    if col2.button("随机生成"):
+        reset_results()
+        new_row = df.sample(1)
+        while (
+            new_row["Question Text"].values[0] == st.session_state.question
+        ):  # Avoid picking the same question twice (the change is not visible on the UI)
+            new_row = df.sample(1)
+        st.session_state.question = new_row["Question Text"].values[0]
+        st.session_state.random_question_requested = True
+        # Re-runs the script setting the random question as the textbox value
+        # Unfortunately necessary as the Random Question button is _below_ the textbox
+        st.experimental_rerun()
+
+    st.session_state.random_question_requested = False
+
+    run_query = (
+        run_pressed or question != st.session_state.question
+    ) and not st.session_state.random_question_requested
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results()
+    # Get results for query
+    if (run_query or st.session_state.results is None) and question:
+        reset_results()
+        st.session_state.question = question
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.results, st.session_state.raw_json = multi_recall_semantic_search(
+                    question,
+                    top_k_ranker=top_k_ranker,
+                    top_k_bm25_retriever=top_k_bm25_retriever,
+                    top_k_dpr_retriever=top_k_dpr_retriever,
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.results:
+
+        st.write("## 返回结果:")
+        for count, result in enumerate(st.session_state.results):
+            context = result["context"]
+            st.write(
+                markdown(context),
+                unsafe_allow_html=True,
+            )
+            # Sqlalchemy Support storing list type data by adding value semicolon, so split str data into separate files
+            if type(result["images"]) == str:
+                result["images"] = result["images"].split(";")
+            for image_path in result["images"]:
+                image_url = pipelines_files(image_path)
+                st.image(
+                    image_url,
+                    width=400,  # Manually Adjust the width of the image as per requirement
+                )
+            st.write("**Relevance:** ", result["relevance"])
+
+            st.write("___")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pipelines/ui/webapp_question_answering.py b/pipelines/ui/webapp_question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dada51fcec1ec3b27e40f5faf5b46d62e403654
--- /dev/null
+++ b/pipelines/ui/webapp_question_answering.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+from json import JSONDecodeError
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+from annotated_text import annotation
+from markdown import markdown
+from ui.utils import get_backlink, pipelines_is_ready, query, upload_doc
+
+# Adjust to a question that you would like users to see in the search bar when they load the UI:
+DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "中国的首都在哪里?")
+DEFAULT_ANSWER_AT_STARTUP = os.getenv("DEFAULT_ANSWER_AT_STARTUP", "北京")
+# Sliders
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "50"))
+DEFAULT_DOCS_FROM_RANKER = int(os.getenv("DEFAULT_DOCS_FROM_RANKER", "1"))
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "1"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "baike_qa.csv"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+
+
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+
+
+def on_change_text():
+    st.session_state.question = st.session_state.quest
+    st.session_state.answer = None
+    st.session_state.results = None
+    st.session_state.raw_json = None
+
+
+def upload():
+    data_files = st.session_state.upload_files["files"]
+    for data_file in data_files:
+        # Upload file
+        if data_file and data_file.name not in st.session_state.upload_files["uploaded_files"]:
+            upload_doc(data_file)
+            st.session_state.upload_files["uploaded_files"].append(data_file.name)
+    # Save the uploaded files
+    st.session_state.upload_files["uploaded_files"] = list(set(st.session_state.upload_files["uploaded_files"]))
+
+
+def main():
+
+    st.set_page_config(
+        page_title="PaddleNLP Pipelines 智能问答",
+        page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png",
+    )
+
+    # Persistent state
+    set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("answer", DEFAULT_ANSWER_AT_STARTUP)
+    set_state_if_absent("results", None)
+    set_state_if_absent("raw_json", None)
+    set_state_if_absent("random_question_requested", False)
+    set_state_if_absent("upload_files", {"uploaded_files": [], "files": []})
+
+    # Small callback to reset the interface in case the text of the question changes
+    def reset_results(*args):
+        st.session_state.answer = None
+        st.session_state.results = None
+        st.session_state.raw_json = None
+
+    # Title
+    st.write("# PaddleNLP Pipelines 智能问答")
+    # Sidebar
+    st.sidebar.header("选项")
+    top_k_retriever = st.sidebar.slider(
+        "最大检索数量",
+        min_value=1,
+        max_value=50,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    top_k_ranker = 1
+
+    top_k_reader = st.sidebar.slider(
+        "最大的答案的数量",
+        min_value=1,
+        max_value=50,
+        value=DEFAULT_NUMBER_OF_ANSWERS,
+        step=1,
+        on_change=reset_results,
+    )
+
+    # Load csv into pandas dataframe
+    try:
+        df = pd.read_csv(EVAL_LABELS, sep=";")
+    except Exception:
+        st.error("The eval file was not found.")
+        sys.exit(f"The eval file was not found under `{EVAL_LABELS}`.")
+
+    # File upload block
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("## 文件上传:")
+        data_files = st.sidebar.file_uploader(
+            "", type=["pdf", "txt", "docx", "png"], help="选择多个文件", accept_multiple_files=True
+        )
+        st.session_state.upload_files["files"] = data_files
+        st.sidebar.button("文件上传", on_click=upload)
+        for data_file in st.session_state.upload_files["uploaded_files"]:
+            st.sidebar.write(str(data_file) + " &nbsp;&nbsp; ✅ ")
+
+    # Search bar
+    question = st.text_input(
+        "",
+        value=st.session_state.question,
+        key="quest",
+        on_change=on_change_text,
+        max_chars=100,
+        placeholder="请输入您的问题",
+    )
+    col1, col2 = st.columns(2)
+    col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    run_pressed = col1.button("运行")
+
+    # Get next random question from the CSV
+    if col2.button("随机生成"):
+        reset_results()
+        new_row = df.sample(1)
+        while (
+            new_row["Question Text"].values[0] == st.session_state.question
+        ):  # Avoid picking the same question twice (the change is not visible on the UI)
+            new_row = df.sample(1)
+        st.session_state.question = new_row["Question Text"].values[0]
+        st.session_state.answer = new_row["Answer"].values[0]
+        st.session_state.random_question_requested = True
+        # Re-runs the script setting the random question as the textbox value
+        # Unfortunately necessary as the Random Question button is _below_ the textbox
+        st.experimental_rerun()
+
+    st.session_state.random_question_requested = False
+
+    run_query = (
+        run_pressed or question != st.session_state.question
+    ) and not st.session_state.random_question_requested
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results()
+
+    # Get results for query
+    if (run_query or st.session_state.results is None) and question:
+        reset_results()
+        st.session_state.question = question
+
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.results, st.session_state.raw_json = query(
+                    question, top_k_reader=top_k_reader, top_k_ranker=top_k_ranker, top_k_retriever=top_k_retriever
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.results:
+
+        st.write("## 返回结果:")
+
+        for count, result in enumerate(st.session_state.results):
+            if result["answer"]:
+                answer, context = result["answer"], result["context"]
+                start_idx = context.find(answer)
+                end_idx = start_idx + len(answer)
+                # Hack due to this bug: https://github.com/streamlit/streamlit/issues/3190
+                st.write(
+                    markdown(context[:start_idx] + str(annotation(answer, "ANSWER", "#8ef")) + context[end_idx:]),
+                    unsafe_allow_html=True,
+                )
+                source = ""
+                url, title = get_backlink(result)
+                if url and title:
+                    source = f"[{result['document']['meta']['title']}]({result['document']['meta']['url']})"
+                else:
+                    source = f"{result['source']}"
+                st.markdown(f"**Relevance:** {result['relevance']} -  **Source:** {source}")
+            elif result["context"] is None:
+                continue
+            else:
+
+                st.info(
+                    "🤔 &nbsp;&nbsp; pipelines is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
+                )
+                context = result["context"]
+                st.write(
+                    # markdown(context[:start_idx] + str(annotation(answer, "ANSWER", "#8ef")) + context[end_idx:]),
+                    markdown(context),
+                    unsafe_allow_html=True,
+                )
+                st.write("**Relevance:** ", result["relevance"])
+
+            st.write("___")
+
+
+main()
diff --git a/pipelines/ui/webapp_semantic_search.py b/pipelines/ui/webapp_semantic_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..17f0bd18cc08989ceda4da936789c3d6dda84e6b
--- /dev/null
+++ b/pipelines/ui/webapp_semantic_search.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+from json import JSONDecodeError
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+from markdown import markdown
+from utils import pipelines_files, pipelines_is_ready, semantic_search, upload_doc
+
+# Adjust to a question that you would like users to see in the search bar when they load the UI:
+DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "衡量酒水的价格的因素有哪些?")
+DEFAULT_ANSWER_AT_STARTUP = os.getenv("DEFAULT_ANSWER_AT_STARTUP", "酒水的血统，存储的时间等")
+# Sliders
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "30"))
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "dureader_search.csv"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+
+
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+
+
+def on_change_text():
+    st.session_state.question = st.session_state.quest
+    st.session_state.answer = None
+    st.session_state.results = None
+    st.session_state.raw_json = None
+
+
+def upload():
+    data_files = st.session_state.upload_files["files"]
+    for data_file in data_files:
+        # Upload file
+        if data_file and data_file.name not in st.session_state.upload_files["uploaded_files"]:
+            upload_doc(data_file)
+            st.session_state.upload_files["uploaded_files"].append(data_file.name)
+    # Save the uploaded files
+    st.session_state.upload_files["uploaded_files"] = list(set(st.session_state.upload_files["uploaded_files"]))
+
+
+def main():
+
+    st.set_page_config(
+        page_title="PaddleNLP Pipelines 语义检索",
+        page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png",
+    )
+
+    # Persistent state
+    set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("results", None)
+    set_state_if_absent("raw_json", None)
+    set_state_if_absent("random_question_requested", False)
+    set_state_if_absent("upload_files", {"uploaded_files": [], "files": []})
+
+    # Small callback to reset the interface in case the text of the question changes
+    def reset_results(*args):
+        st.session_state.answer = None
+        st.session_state.results = None
+        st.session_state.raw_json = None
+
+    # Title
+    st.write("# PaddleNLP Pipelines 语义检索")
+    # Sidebar
+    st.sidebar.header("选项")
+    top_k_reader = st.sidebar.slider(
+        "最大的答案的数量",
+        min_value=1,
+        max_value=30,
+        value=DEFAULT_NUMBER_OF_ANSWERS,
+        step=1,
+        on_change=reset_results,
+    )
+    top_k_retriever = st.sidebar.slider(
+        "最大检索数量",
+        min_value=1,
+        max_value=100,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("## 文件上传:")
+        data_files = st.sidebar.file_uploader(
+            "", type=["pdf", "txt", "docx", "png"], help="选择多个文件", accept_multiple_files=True
+        )
+        st.session_state.upload_files["files"] = data_files
+        st.sidebar.button("文件上传", on_click=upload)
+        for data_file in st.session_state.upload_files["uploaded_files"]:
+            st.sidebar.write(str(data_file) + " &nbsp;&nbsp; ✅ ")
+
+    # Load csv into pandas dataframe
+    try:
+        df = pd.read_csv(EVAL_LABELS, sep=";")
+    except Exception:
+        st.error("The eval file was not found.")
+        sys.exit(f"The eval file was not found under `{EVAL_LABELS}`.")
+
+    # Search bar
+    question = st.text_input(
+        "",
+        value=st.session_state.question,
+        key="quest",
+        on_change=on_change_text,
+        max_chars=100,
+        placeholder="请输入您的问题",
+    )
+    col1, col2 = st.columns(2)
+    col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    run_pressed = col1.button("运行")
+    # Get next random question from the CSV
+    if col2.button("随机生成"):
+        reset_results()
+        new_row = df.sample(1)
+        while (
+            new_row["Question Text"].values[0] == st.session_state.question
+        ):  # Avoid picking the same question twice (the change is not visible on the UI)
+            new_row = df.sample(1)
+        st.session_state.question = new_row["Question Text"].values[0]
+        st.session_state.random_question_requested = True
+        # Re-runs the script setting the random question as the textbox value
+        # Unfortunately necessary as the Random Question button is _below_ the textbox
+        st.experimental_rerun()
+
+    st.session_state.random_question_requested = False
+
+    run_query = (
+        run_pressed or question != st.session_state.question
+    ) and not st.session_state.random_question_requested
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results()
+    # Get results for query
+    if (run_query or st.session_state.results is None) and question:
+        reset_results()
+        st.session_state.question = question
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.results, st.session_state.raw_json = semantic_search(
+                    question, top_k_reader=top_k_reader, top_k_retriever=top_k_retriever
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.results:
+
+        st.write("## 返回结果:")
+        for count, result in enumerate(st.session_state.results):
+            context = result["context"]
+            st.write(
+                markdown(context),
+                unsafe_allow_html=True,
+            )
+            # Sqlalchemy Support storing list type data by adding value semicolon, so split str data into separate files
+            if type(result["images"]) == str:
+                result["images"] = result["images"].split(";")
+            for image_path in result["images"]:
+                image_url = pipelines_files(image_path)
+                st.image(
+                    image_url,
+                    width=400,  # Manually Adjust the width of the image as per requirement
+                )
+            st.write("**Relevance:** ", result["relevance"])
+
+            st.write("___")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pipelines/ui/webapp_senta.py b/pipelines/ui/webapp_senta.py
new file mode 100644
index 0000000000000000000000000000000000000000..1dc4fd01645d87466f9fc1871016ef6c898596d5
--- /dev/null
+++ b/pipelines/ui/webapp_senta.py
@@ -0,0 +1,152 @@
+# -*- coding: UTF-8 -*-
+# Copyright 2022 The Impira Team and the HuggingFace Team.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import gradio as gr
+import requests
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument('--serving_name', default="0.0.0.0", help="Serving ip.")
+parser.add_argument("--serving_port", default=8891, type=int, help="Serving port.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def process_upload(file):
+    if file:
+        return [gr.update(label="uploaded"), gr.update(value=file.name), gr.update(visible=False, value=None)]
+    else:
+        return [gr.update(label="Select a file", value=None), None, gr.update(visible=False, value=None)]
+
+
+def process_file(file_path):
+    if not file_path:
+        return (
+            [gr.update(visible=False)]
+            + [None] * 12
+            + [gr.update(visible=True, value="Warm Note: upload file firstly")]
+        )
+
+    url = f"http://{args.serving_name}:{args.serving_port}/senta_file"
+    save_path = os.path.join(os.path.dirname(file_path), "senta_" + os.path.basename(file_path))
+    r = requests.post(url, json={"meta": {"file_path": file_path, "save_path": save_path}})
+    response = r.json()
+    results = response["img_dict"]
+    components = [
+        "aspect_wc",
+        "aspect_hist",
+        "opinion_wc",
+        "opinion_hist",
+        "aspect_opinion_wc",
+        "aspect_opinion_hist",
+        "aspect_opinion_wc_pos",
+        "aspect_opinion_hist_pos",
+        "aspect_opinion_wc_neg",
+        "aspect_opinion_hist_neg",
+        "aspect_sentiment_wc",
+        "aspect_sentiment_hist",
+    ]
+    outputs = [gr.update(visible=True)]
+    for component in components:
+        if component in results:
+            outputs.append(results[component])
+        else:
+            outputs.append(None)
+    outputs.append(gr.update(visible=False, value=None))
+    return outputs
+
+
+def reset_click():
+    return [
+        gr.update(value=None, label="Select a file"),
+        None,
+        gr.update(visible=False, value=None),
+        gr.update(visible=False),
+    ]
+
+
+with gr.Blocks() as demo:
+    file_path = gr.Textbox(visible=False)
+    gr.Markdown(value="# Sentiment Analysis Application\n----")
+    upload_file = gr.File(label="Select a file", interactive=True, elem_id="file-upload-box")
+    with gr.Row():
+        reset_btn = gr.Button("Reset")
+        file_btn = gr.Button("Submit")
+
+    # define something with exceptional situation
+    msg_box = gr.Markdown(visible=False, interactive=False)
+
+    # show sentiment analysis with UI for batch processing
+    with gr.Column(visible=False) as show:
+        gr.Markdown(value="----")
+        gr.Markdown(value="# Sentiment Analysis Show")
+        gr.Markdown("## 1. 属性分析\n通过属性信息，可以查看客户对于产品/服务的重点关注方面. ")
+        with gr.Row(equal_height=True):
+            aspect_wc = gr.Image()
+            aspect_hist = gr.Image()
+        gr.Markdown("## 2. 观点分析\n通过观点信息，可以查看客户对于产品/服务整体的直观印象。")
+        with gr.Row(equal_height=True):
+            opinion_wc = gr.Image()
+            opinion_hist = gr.Image()
+        gr.Markdown("## 3. 属性 + 观点分析\n结合属性和观点两者信息，可以更加具体的展现客户对于产品/服务的详细观点，分析某个属性的优劣，从而能够帮助商家更有针对性地改善或提高自己的产品/服务质量。")
+        with gr.Column(equal_height=True):
+            gr.Markdown("### 3.1 全部属性+观点的内容分析")
+            with gr.Row():
+                aspect_opinion_wc = gr.Image()
+                aspect_opinion_hist = gr.Image()
+            gr.Markdown("### 3.2 正向属性+观点的内容分析")
+            with gr.Row():
+                aspect_opinion_wc_pos = gr.Image()
+                aspect_opinion_hist_pos = gr.Image()
+            gr.Markdown("### 3.3 负向属性+观点的内容分析")
+            with gr.Row():
+                aspect_opinion_wc_neg = gr.Image()
+                aspect_opinion_hist_neg = gr.Image()
+        gr.Markdown("## 4. 属性 + 极性分析\n 挖掘客户对于产品/服务针对属性的情感极性，帮助商家直观地查看客户对于产品/服务的某些属性的印象。")
+        with gr.Row(equal_height=True):
+            aspect_sentiment_wc = gr.Image()
+            aspect_sentiment_hist = gr.Image()
+
+    upload_file.change(process_upload, inputs=[upload_file], outputs=[upload_file, file_path, msg_box])
+    file_btn.click(
+        process_file,
+        inputs=[file_path],
+        outputs=[
+            show,
+            aspect_wc,
+            aspect_hist,
+            opinion_wc,
+            opinion_hist,
+            aspect_opinion_wc,
+            aspect_opinion_hist,
+            aspect_opinion_wc_pos,
+            aspect_opinion_hist_pos,
+            aspect_opinion_wc_neg,
+            aspect_opinion_hist_neg,
+            aspect_sentiment_wc,
+            aspect_sentiment_hist,
+            msg_box,
+        ],
+    )
+    reset_btn.click(reset_click, inputs=None, outputs=[upload_file, file_path, msg_box, show])
+
+
+if __name__ == "__main__":
+    # To create a public link, set `share=True` in `launch()`.
+    demo.launch(enable_queue=False, share=True)
diff --git a/pipelines/ui/webapp_text_to_image.py b/pipelines/ui/webapp_text_to_image.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b4a6ce7129a01efca8b8441e9c226b7cf89f6b6
--- /dev/null
+++ b/pipelines/ui/webapp_text_to_image.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import gradio as gr
+from utils import text_to_image_search
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--serving_port", default=8502, type=int, help="Port for the serving.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def infer(text_prompt, top_k_images, Size, style):
+    results, raw_json = text_to_image_search(text_prompt, resolution=Size, top_k_images=top_k_images, style=style)
+    return results
+
+
+def main():
+    block = gr.Blocks()
+
+    with block:
+        with gr.Group():
+            with gr.Box():
+                with gr.Row().style(mobile_collapse=False, equal_height=True):
+                    text_prompt = gr.Textbox(
+                        label="Enter your prompt",
+                        value="宁静的小镇",
+                        show_label=False,
+                        max_lines=1,
+                        placeholder="Enter your prompt",
+                    ).style(
+                        border=(True, False, True, True),
+                        rounded=(True, False, False, True),
+                        container=False,
+                    )
+                    btn = gr.Button("开始生成").style(
+                        margin=False,
+                        rounded=(False, True, True, False),
+                    )
+            gallery = gr.Gallery(label="Generated images", show_label=False, elem_id="gallery").style(
+                grid=[2], height="auto"
+            )
+
+            advanced_button = gr.Button("Advanced options", elem_id="advanced-btn")
+
+            with gr.Row(elem_id="advanced-options"):
+                top_k_images = gr.Slider(label="Images", minimum=1, maximum=50, value=5, step=1)
+                style = gr.Radio(
+                    label="Style",
+                    value="古风",
+                    choices=[
+                        "古风",
+                        "油画",
+                        "卡通画",
+                        "二次元",
+                        "水彩画",
+                        "浮世绘",
+                        "蒸汽波艺术",
+                        "low poly",
+                        "像素风格",
+                        "概念艺术",
+                        "未来主义",
+                        "赛博朋克",
+                        "写实风格",
+                        "洛丽塔风格",
+                        "巴洛克风格",
+                        "超现实主义",
+                        "探索无限",
+                    ],
+                )
+                Size = gr.Radio(label="Size", value="1024*1024", choices=["1024*1024", "1024*1536", "1536*1024"])
+
+            text_prompt.submit(infer, inputs=[text_prompt, top_k_images, Size, style], outputs=gallery)
+            btn.click(infer, inputs=[text_prompt, top_k_images, Size, style], outputs=gallery)
+            advanced_button.click(
+                None,
+                [],
+                text_prompt,
+            )
+    return block
+
+
+if __name__ == "__main__":
+    block = main()
+    block.launch(server_name="0.0.0.0", server_port=args.serving_port, share=False)
diff --git a/pipelines/ui/webapp_text_to_image_retrieval.py b/pipelines/ui/webapp_text_to_image_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..995bb4ff71490b3527bce54980e1ae9c015771c7
--- /dev/null
+++ b/pipelines/ui/webapp_text_to_image_retrieval.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+import gradio as gr
+from utils import image_text_search
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--serving_port", default=8502, type=int, help="Port for the serving.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def infer(query, top_k_retriever):
+    results, response = image_text_search(query, top_k_retriever=top_k_retriever)
+    images = [item["context"] for item in results]
+    return images
+
+
+def main():
+    block = gr.Blocks()
+    title = "<h1 align='center'>文图跨模态搜索应用</h1>"
+    description = "本项目为ERNIE-ViL 2.0等CLIP中文版模型的DEMO，可用于图文检索和图像、文本的表征提取，应用于文图搜索、文图推荐、零样本分类、视频检索等应用场景。"
+
+    with block:
+        gr.Markdown(title)
+        gr.Markdown(description)
+        with gr.Row():
+            with gr.Column(scale=1):
+                with gr.Column(scale=2):
+                    text = gr.Textbox(value="云南普者黑现纯白色⒌蒂莲", label="请填写文本", elem_id=0, interactive=True)
+                top_k = gr.components.Slider(minimum=0, maximum=50, step=1, value=8, label="返回图片数", elem_id=2)
+                btn = gr.Button(
+                    "搜索",
+                )
+            with gr.Column(scale=100):
+                out = gr.Gallery(label="检索结果为：").style(grid=4, height=200)
+        inputs = [text, top_k]
+        btn.click(fn=infer, inputs=inputs, outputs=out)
+    return block
+
+
+if __name__ == "__main__":
+    block = main()
+    block.launch(server_name="0.0.0.0", server_port=args.serving_port, share=False)
diff --git a/pipelines/ui/webapp_unsupervised_question_answering.py b/pipelines/ui/webapp_unsupervised_question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d42e9492986cfb9f97d5f47832e203fa9966462
--- /dev/null
+++ b/pipelines/ui/webapp_unsupervised_question_answering.py
@@ -0,0 +1,320 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 deepset GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+from json import JSONDecodeError
+from pathlib import Path
+
+import pandas as pd
+import streamlit as st
+from markdown import markdown
+from ui.utils import (
+    file_upload_qa_generate,
+    offline_ann,
+    pipelines_is_ready,
+    semantic_search,
+    text_to_qa_pair_search,
+)
+
+# Adjust to a question that you would like users to see in the search bar when they load the UI:
+# DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "如何办理企业养老保险?")
+DEFAULT_QUESTION_AT_STARTUP = os.getenv("DEFAULT_QUESTION_AT_STARTUP", "")
+# Sliders
+DEFAULT_DOCS_FROM_RETRIEVER = int(os.getenv("DEFAULT_DOCS_FROM_RETRIEVER", "30"))
+DEFAULT_NUMBER_OF_ANSWERS = int(os.getenv("DEFAULT_NUMBER_OF_ANSWERS", "3"))
+# Labels for the evaluation
+EVAL_LABELS = os.getenv("EVAL_FILE", str(Path(__file__).parent / "insurance_faq.csv"))
+# Corpus dir for ANN
+CORPUS_DIR = os.getenv("CORPUS_DIR", str("data/my_data"))
+# QA pairs file to be saved
+UPDATE_FILE = os.getenv("UPDATE_FILE", str("data/my_data/custom_qa_pairs.txt"))
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+
+DEFAULT_NUMBER_OF_FILTER_STRENGTH = int(os.getenv("DEFAULT_NUMBER_OF_FILTER_STRENGTH", "10"))
+
+
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+
+
+def on_change_text():
+    st.session_state.question = st.session_state.quest
+    st.session_state.answer = None
+    st.session_state.results = None
+    st.session_state.raw_json = None
+
+
+def on_change_text_qag():
+    st.session_state.qag_question = st.session_state.qag_quest
+    st.session_state.answer = None
+    st.session_state.qag_results = None
+    st.session_state.qag_raw_json = None
+
+
+def upload():
+    data_files = st.session_state.upload_files["files"]
+    for data_file in data_files:
+        # Upload file
+        if data_file and data_file.name not in st.session_state.upload_files["uploaded_files"]:
+            file_upload_qa_generate(data_file)
+            st.session_state.upload_files["uploaded_files"].append(data_file.name)
+    # Save the uploaded files
+    st.session_state.upload_files["uploaded_files"] = list(set(st.session_state.upload_files["uploaded_files"]))
+
+
+def main():
+
+    st.set_page_config(page_title="PaddleNLP无监督智能检索问答", page_icon="🐮")
+    # page_icon="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/imgs/logo.png")
+
+    # Persistent state
+    set_state_if_absent("question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("qag_question", DEFAULT_QUESTION_AT_STARTUP)
+    set_state_if_absent("results", None)
+    set_state_if_absent("qag_results", None)
+    set_state_if_absent("raw_json", None)
+    set_state_if_absent("qag_raw_json", None)
+    set_state_if_absent("random_question_requested", False)
+    set_state_if_absent("upload_files", {"uploaded_files": [], "files": []})
+
+    # Small callback to reset the interface in case the text of the question changes
+    def reset_results(*args):
+        st.session_state.answer = None
+        st.session_state.results = None
+        st.session_state.raw_json = None
+
+    def reset_results_qag(*args):
+        st.session_state.answer = None
+        st.session_state.qag_results = None
+        st.session_state.qag_raw_json = None
+
+    # Title
+    st.write("## 无监督智能检索问答")
+    # Sidebar
+    st.sidebar.header("选项")
+    st.sidebar.write("### 问答对生成:")
+    is_filter = st.sidebar.selectbox(
+        "是否进行自动过滤",
+        ("是", "否"),
+        on_change=reset_results,
+    )
+    st.sidebar.write("### 问答检索:")
+    top_k_reader = st.sidebar.slider(
+        "返回答案数量",
+        min_value=1,
+        max_value=30,
+        value=DEFAULT_NUMBER_OF_ANSWERS,
+        step=1,
+        on_change=reset_results,
+    )
+    top_k_retriever = st.sidebar.slider(
+        "最大检索数量",
+        min_value=1,
+        max_value=100,
+        value=DEFAULT_DOCS_FROM_RETRIEVER,
+        step=1,
+        on_change=reset_results,
+    )
+
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("### 文件上传:")
+        data_files = st.sidebar.file_uploader(
+            "", type=["pdf", "txt", "docx", "png"], help="选择多个文件", accept_multiple_files=True
+        )
+        st.session_state.upload_files["files"] = data_files
+        st.sidebar.button("文件上传并自动生成载入问答对", on_click=upload)
+        for data_file in st.session_state.upload_files["uploaded_files"]:
+            st.sidebar.write(str(data_file) + " &nbsp;&nbsp; ✅ ")
+
+    # Load csv into pandas dataframe
+    try:
+        df = pd.read_csv(EVAL_LABELS, sep=";")
+    except Exception:
+        st.error("The eval file was not found.")
+        sys.exit(f"The eval file was not found under `{EVAL_LABELS}`.")
+
+    # QA pairs generation
+    # Search bar
+    st.write("### 问答对生成：")
+    context = st.text_input(
+        "",
+        value=st.session_state.qag_question,
+        key="qag_quest",
+        on_change=on_change_text_qag,
+        max_chars=350,
+        placeholder="请输入要抽取问答对的文本",
+    )
+    qag_col1, qag_col2 = st.columns(2)
+    qag_col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    qag_col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    qag_run_pressed = qag_col1.button("开始生成")
+
+    # Get next random question from the CSV
+    if qag_col2.button("存入数据库"):
+        with open(UPDATE_FILE, "a", encoding="utf-8") as wf:
+            for count, result in enumerate(st.session_state.qag_results):
+                context = result["context"]
+                synthetic_answer = result["synthetic_answer"]
+                synthetic_question = result["synthetic_question"]
+                wf.write(synthetic_question.strip() + "\t" + synthetic_answer.strip() + "\n")
+        offline_ann("my_data", CORPUS_DIR)
+        reset_results_qag()
+
+    # st.session_state.random_question_requested = False
+    qag_run_query = (
+        qag_run_pressed or context != st.session_state.qag_question
+    ) and not st.session_state.random_question_requested
+    # qag_run_query = qag_run_pressed
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results_qag()
+    # Get results for query
+    if (qag_run_query or st.session_state.qag_results is None) and context:
+        reset_results_qag()
+        st.session_state.qag_question = context
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.qag_results, st.session_state.qag_raw_json = text_to_qa_pair_search(
+                    context, is_filter=True if is_filter == "是" else False
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.qag_results:
+        st.write("#### 返回结果:")
+        for count, result in enumerate(st.session_state.qag_results):
+            context = result["context"]
+            synthetic_answer = result["synthetic_answer"]
+            # synthetic_answer_probability = result["synthetic_answer_probability"]
+            synthetic_question = result["synthetic_question"]
+            # synthetic_question_probability = result["synthetic_question_probability"]
+            st.write(
+                markdown(context),
+                unsafe_allow_html=True,
+            )
+            st.write(
+                markdown("**问题：**" + synthetic_question),
+                unsafe_allow_html=True,
+            )
+            st.write(
+                markdown("**答案：**" + synthetic_answer),
+                unsafe_allow_html=True,
+            )
+
+            st.write("___")
+
+    # QA search
+    # Search bar
+    st.write("### 问答检索：")
+    question = st.text_input(
+        "",
+        value=st.session_state.question,
+        key="quest",
+        on_change=on_change_text,
+        max_chars=100,
+        placeholder="请输入您的问题",
+    )
+    col1, col2 = st.columns(2)
+    col1.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+    col2.markdown("<style>.stButton button {width:100%;}</style>", unsafe_allow_html=True)
+
+    # Run button
+    run_pressed = col1.button("运行")
+
+    # Get next random question from the CSV
+    if col2.button("随机提问"):
+        reset_results()
+        new_row = df.sample(1)
+        while (
+            new_row["Question Text"].values[0] == st.session_state.question
+        ):  # Avoid picking the same question twice (the change is not visible on the UI)
+            new_row = df.sample(1)
+        st.session_state.question = new_row["Question Text"].values[0]
+        st.session_state.random_question_requested = True
+        # Re-runs the script setting the random question as the textbox value
+        # Unfortunately necessary as the Random Question button is _below_ the textbox
+        st.experimental_rerun()
+
+    st.session_state.random_question_requested = False
+
+    run_query = (
+        run_pressed or question != st.session_state.question
+    ) and not st.session_state.random_question_requested
+
+    # Check the connection
+    with st.spinner("⌛️ &nbsp;&nbsp; pipelines is starting..."):
+        if not pipelines_is_ready():
+            st.error("🚫 &nbsp;&nbsp; Connection Error. Is pipelines running?")
+            run_query = False
+            reset_results()
+    # Get results for query
+    if (run_query or st.session_state.results is None) and question:
+        reset_results()
+        st.session_state.question = question
+        with st.spinner(
+            "🧠 &nbsp;&nbsp; Performing neural search on documents... \n "
+            "Do you want to optimize speed or accuracy? \n"
+        ):
+            try:
+                st.session_state.results, st.session_state.raw_json = semantic_search(
+                    question, top_k_reader=top_k_reader, top_k_retriever=top_k_retriever
+                )
+            except JSONDecodeError:
+                st.error("👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?")
+                return
+            except Exception as e:
+                logging.exception(e)
+                if "The server is busy processing requests" in str(e) or "503" in str(e):
+                    st.error("🧑‍🌾 &nbsp;&nbsp; All our workers are busy! Try again later.")
+                else:
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+                return
+
+    if st.session_state.results:
+
+        st.write("#### 返回结果:")
+        for count, result in enumerate(st.session_state.results):
+            context = result["context"]
+            st.write(
+                markdown(context),
+                unsafe_allow_html=True,
+            )
+            st.write("**答案:** ", result["answer"])
+            st.write("**Relevance:** ", result["relevance"])
+            st.write("___")
+
+
+main()
diff --git a/pipelines/utils/__init__.py b/pipelines/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/pipelines/utils/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/pipelines/utils/offline_ann.py b/pipelines/utils/offline_ann.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a65006039c9f9dca63a1a73358d14c53a0ee51e
--- /dev/null
+++ b/pipelines/utils/offline_ann.py
@@ -0,0 +1,192 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from pipelines.document_stores import (
+    BaiduElasticsearchDocumentStore,
+    ElasticsearchDocumentStore,
+    MilvusDocumentStore,
+)
+from pipelines.nodes import DensePassageRetriever
+from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http, launch_es
+from pipelines.utils.preprocessing import convert_files_to_dicts_splitter
+
+data_dict = {
+    "data/dureader_dev": "https://paddlenlp.bj.bcebos.com/applications/dureader_dev.zip",
+    "data/baike": "https://paddlenlp.bj.bcebos.com/applications/baike.zip",
+    "data/insurance": "https://paddlenlp.bj.bcebos.com/applications/insurance.zip",
+    "data/file_example": "https://paddlenlp.bj.bcebos.com/pipelines/file_examples.zip",
+}
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--index_name", default="baike_cities", type=str, help="The index name of the ANN search engine")
+parser.add_argument("--doc_dir", default="data/baike/", type=str, help="The doc path of the corpus")
+parser.add_argument('--username', type=str, default="", help='Username of ANN search engine')
+parser.add_argument('--password', type=str, default="", help='Password of ANN search engine')
+parser.add_argument("--search_engine", choices=["elastic", "milvus", 'bes'], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--host", type=str, default="127.0.0.1", help="host ip of ANN search engine")
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--split_answers", action="store_true", help="whether to split lines into question and answers")
+parser.add_argument("--query_embedding_model", default="rocketqa-zh-base-query-encoder", type=str, help="The query_embedding_model path",)
+parser.add_argument("--passage_embedding_model", default="rocketqa-zh-base-para-encoder", type=str, help="The passage_embedding_model path", )
+parser.add_argument("--params_path", default="checkpoints/model_40/model_state.pdparams", type=str, help="The checkpoint path")
+parser.add_argument("--delete_index", action="store_true", help="Whether to delete existing index while updating index")
+parser.add_argument("--share_parameters", action="store_true", help="Use to control the query and title models sharing the same parameters",)
+parser.add_argument('--model_type', choices=['ernie_search', 'ernie', 'bert', 'neural_search'], default="ernie", help="the ernie model types")
+parser.add_argument('--embed_title', default=False, type=bool, help="The title to be  embedded into embedding")
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select devices, defaults to gpu.")
+parser.add_argument('--search_fields', default=['content', 'name'], help="multi recall BM25Retriever set search_fields")
+parser.add_argument('--use_splitter', default=False, type=bool, help="How to split documents")
+parser.add_argument('--chunk_size', type=int, default=300, help="The length of data for indexing by retriever")
+parser.add_argument('--chunk_overlap', type=int, default=0, help="a larger chunk than the chunk overlap")
+parser.add_argument('--separator', type=str, default='\n', help="Use symbols to segment text, PDF, and image files, or connect some short chunks")
+parser.add_argument('--filters', type=list, default=['\n'], help="Filter special symbols")
+parser.add_argument('--language', type=str, default='chinese', help="the language of files")
+parser.add_argument('--pooling_mode', choices=['max_tokens', 'mean_tokens', 'mean_sqrt_len_tokens', 'cls_token'], default='cls_token', help='the type of sentence embedding')
+parser.add_argument("--es_chunk_size", default=500, type=int, help="Number of docs in one chunk sent to es")
+parser.add_argument("--es_thread_count", default=32, type=int, help="Size of the threadpool to use for the bulk requests")
+parser.add_argument("--es_queue_size", default=32, type=int, help="Size of the task queue between the main thread (producing chunks to send) and the processing threads.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def offline_ann(index_name, doc_dir):
+    use_gpu = True if args.device == "gpu" else False
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    elif args.search_engine == "bes":
+
+        document_store = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+
+    else:
+        launch_es()
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+            search_fields=args.search_fields,  # 当使用了多路召回并且搜索字段设置了除content的其他字段，构建索引时其他字段也需要设置，例如：['content', 'name']。
+        )
+    # 将每篇文档按照段落进行切分
+    if args.use_splitter:
+        dicts = convert_files_to_dicts_splitter(
+            dir_path=doc_dir,
+            split_paragraphs=True,
+            split_answers=args.split_answers,
+            encoding="utf-8",
+            separator=args.separator,
+            filters=args.filters,
+            chunk_size=args.chunk_size,
+            language=args.language,
+            chunk_overlap=args.chunk_overlap,
+        )
+    else:
+        dicts = convert_files_to_dicts(
+            dir_path=doc_dir, split_paragraphs=True, split_answers=args.split_answers, encoding="utf-8"
+        )
+
+    print(dicts[:3])
+
+    # 文档数据写入数据库
+    document_store.write_documents(dicts)
+    # 语义索引模型
+    retriever = DensePassageRetriever(
+        document_store=document_store,
+        query_embedding_model=args.query_embedding_model,
+        passage_embedding_model=args.passage_embedding_model,
+        params_path=args.params_path,
+        output_emb_size=args.embedding_dim if args.model_type in ["ernie_search", "neural_search"] else None,
+        share_parameters=args.share_parameters,
+        max_seq_len_query=64,
+        max_seq_len_passage=256,
+        batch_size=16,
+        use_gpu=use_gpu,
+        embed_title=args.embed_title,
+    )
+
+    # 建立索引库
+    document_store.update_embeddings(retriever)
+
+
+def delete_data(index_name):
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    elif args.search_engine == "bes":
+
+        document_store = BaiduElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username=args.username,
+            password=args.password,
+            embedding_dim=args.embedding_dim,
+            similarity="dot_prod",
+            vector_type="bpack_vector",
+            search_fields=["content", "meta"],
+            index=args.index_name,
+            chunk_size=args.es_chunk_size,
+            thread_count=args.es_thread_count,
+            queue_size=args.es_queue_size,
+        )
+
+    else:
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username="",
+            password="",
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+        )
+    document_store.delete_index(index_name)
+    print("Delete an existing elasticsearch index {} Done.".format(index_name))
+
+
+if __name__ == "__main__":
+    if args.doc_dir in data_dict:
+        fetch_archive_from_http(url=data_dict[args.doc_dir], output_dir=args.doc_dir)
+    if args.delete_index:
+        delete_data(args.index_name)
+    offline_ann(args.index_name, args.doc_dir)
diff --git a/pipelines/utils/offline_ann_mm.py b/pipelines/utils/offline_ann_mm.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5cb800c776f8a9dd09f1c76164e21ba3eb527e5
--- /dev/null
+++ b/pipelines/utils/offline_ann_mm.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+from pipelines.document_stores import ElasticsearchDocumentStore, MilvusDocumentStore
+from pipelines.nodes import MultiModalRetriever
+from pipelines.schema import Document
+from pipelines.utils import convert_files_to_dicts, fetch_archive_from_http, launch_es
+
+data_dict = {
+    "data/wukong_test": "https://paddlenlp.bj.bcebos.com/applications/wukong_test_demo.zip",
+    "data/wukong_text": "https://paddlenlp.bj.bcebos.com/applications/wukong_text.zip",
+}
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--index_name", default="wukong_test", type=str, help="The index name of the ANN search engine")
+parser.add_argument("--doc_dir", default="data/wukong_test", type=str, help="The doc path of the corpus")
+parser.add_argument("--search_engine", choices=["elastic", "milvus"], default="elastic", help="The type of ANN search engine.")
+parser.add_argument("--host", type=str, default="127.0.0.1", help="host ip of ANN search engine")
+parser.add_argument("--port", type=str, default="9200", help="port of ANN search engine")
+parser.add_argument("--embedding_dim", default=768, type=int, help="The embedding_dim of index")
+parser.add_argument("--embedding_type", choices=["text", "image"], default="image", help="The type of raw data for embedding.")
+parser.add_argument("--query_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The query_embedding_model path")
+parser.add_argument("--document_embedding_model", default="PaddlePaddle/ernie_vil-2.0-base-zh", type=str, help="The document_embedding_model path")
+parser.add_argument("--delete_index", action="store_true", help="Whether to delete existing index while updating index")
+args = parser.parse_args()
+# yapf: enable
+
+
+def offline_ann(index_name, doc_dir):
+
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    else:
+        launch_es()
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username="",
+            password="",
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+        )
+    if args.embedding_type == "image":
+        docs = [
+            Document(content=f"./{args.doc_dir}/{filename}", content_type="image")
+            for filename in os.listdir(args.doc_dir)
+        ]
+    elif args.embedding_type == "text":
+        docs = convert_files_to_dicts(dir_path=args.doc_dir, split_paragraphs=True, encoding="utf-8")
+    else:
+        raise NotImplementedError
+
+    print(docs[:3])
+
+    # 文档数据写入数据库
+    document_store.write_documents(docs)
+
+    if args.embedding_type == "image":
+        # 文搜图，对image做embedding
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="text",
+            document_embedding_models={"image": args.document_embedding_model},
+        )
+    else:
+        # 图搜文，对text做embedding
+        retriever_mm = MultiModalRetriever(
+            document_store=document_store,
+            query_embedding_model=args.query_embedding_model,
+            query_type="image",
+            document_embedding_models={"text": args.document_embedding_model},
+        )
+
+    # 建立索引库
+    document_store.update_embeddings(retriever_mm)
+
+
+def delete_data(index_name):
+    if args.search_engine == "milvus":
+        document_store = MilvusDocumentStore(
+            embedding_dim=args.embedding_dim,
+            host=args.host,
+            index=args.index_name,
+            port=args.port,
+            index_param={"M": 16, "efConstruction": 50},
+            index_type="HNSW",
+        )
+    else:
+        document_store = ElasticsearchDocumentStore(
+            host=args.host,
+            port=args.port,
+            username="",
+            password="",
+            embedding_dim=args.embedding_dim,
+            index=index_name,
+        )
+    document_store.delete_index(index_name)
+    print("Delete an existing elasticsearch index {} Done.".format(index_name))
+
+
+if __name__ == "__main__":
+    if args.doc_dir in data_dict:
+        fetch_archive_from_http(url=data_dict[args.doc_dir], output_dir=args.doc_dir)
+    if args.delete_index:
+        delete_data(args.index_name)
+    offline_ann(args.index_name, args.doc_dir)
diff --git a/ppdiffusers/README.md b/ppdiffusers/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd52f73112061bca219925343989756e0c0e14a8
--- /dev/null
+++ b/ppdiffusers/README.md
@@ -0,0 +1,995 @@
+<div align="center">
+  <img src="https://user-images.githubusercontent.com/11793384/215372703-4385f66a-abe4-44c7-9626-96b7b65270c8.png" width="40%" height="40%" />
+</div>
+
+<p align="center">
+    <a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/pyversions/paddlenlp"></a>
+    <a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-yellow.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleNLP/blob/develop/LICENSE"><img src="https://img.shields.io/github/license/paddlepaddle/paddlenlp"></a>
+</p>
+
+<h4 align="center">
+  <a href=#特性> 特性 </a> |
+  <a href=#安装> 安装 </a> |
+  <a href=#快速开始> 快速开始 </a> |
+  <a href=#模型部署> 模型部署</a>
+</h4>
+
+# PPDiffusers: Diffusers toolbox implemented based on PaddlePaddle
+
+**PPDiffusers**是一款支持多种模态（如文本图像跨模态、图像、语音）扩散模型（Diffusion Model）训练和推理的国产化工具箱，依托于[**PaddlePaddle**](https://www.paddlepaddle.org.cn/)框架和[**PaddleNLP**](https://github.com/PaddlePaddle/PaddleNLP)自然语言处理开发库。
+
+## News 📢
+* 🔥 **2023.07.25 PPDiffusers 现已迁至 [PaddleMIX](https://github.com/PaddlePaddle/PaddleMIX)，后续更新迭代将于 PaddleMIX 进行。PaddleMIX 是基于飞桨的跨模态大模型开发套件，我们将一如既往的提供优质内容，围绕跨模态前沿技术更充分的发挥飞桨大模型技术先进性，打造高效开发使用跨模态大模型及应用的基础设施。感谢大家关注，欢迎大家使用 PaddleMIX。**
+
+* 🔥 **2023.06.20 发布 0.16.1 版本，新增[T2I-Adapter](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/t2i-adapter)，支持训练与推理；ControlNet升级，支持[reference only推理](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/community#controlnet-reference-only)；新增[WebUIStableDiffusionPipeline](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/community#automatic1111-webui-stable-diffusion)，
+支持通过prompt的方式动态加载lora、textual_inversion权重；
+新增[StableDiffusionHiresFixPipeline](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/community#stable-diffusion-with-high-resolution-fixing)，支持高分辨率修复；
+新增关键点控制生成任务评价指标[COCOeval](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/scripts/cocoeval_keypoints_score)；
+新增多种模态扩散模型Pipeline，包括视频生成（[Text-to-Video-Synth](#文本视频多模)、[Text-to-Video-Zero](#文本视频多模)）、音频生成（[AudioLDM](#文本音频多模)、[Spectrogram Diffusion](#音频)）；新增文图生成模型[IF](#文本图像多模)。**
+
+* 🔥 **2023.03.29 发布 0.14.0 版本，新增[LoRA](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/dreambooth)、[ControlNet](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/controlnet)，支持训练与推理；
+模型加载升级，[可直接加载HF Diffusers的权重](#加载HF-Diffusers权重)（safetensors和pt）或 [SD等原库的Lightning权重进行推理](#加载原库的Lightning权重)，[支持加载Civitai社区的LoRA权重](#加载Civitai社区的LoRA权重)；
+[支持xformers](#XFormers加速) 训练与推理；
+新增用于超高分辨率生成的VAE tiling；
+新增Instruct Pix2Pix、Semantic guidance、Depth2image等模型。**
+
+
+
+## 特性
+#### 📦 SOTA扩散模型Pipelines集合
+我们提供**SOTA（State-of-the-Art）** 的扩散模型Pipelines集合。
+目前**PPDiffusers**已经集成了**50+Pipelines**，支持文图生成（Text-to-Image Generation）、文本引导的图像编辑（Text-Guided Image Inpainting）、文本引导的图像变换（Image-to-Image Text-Guided Generation）、文本条件视频生成（Text-to-Video Generation）、超分（Super Superresolution）在内的**10余项**任务，覆盖**文本、图像、视频、音频**等多种模态。
+如果想要了解当前支持的所有**Pipelines**以及对应的来源信息，可以阅读[🔥 PPDiffusers Pipelines](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/pipelines/README.md)文档。
+
+
+#### 🔊 提供丰富的Noise Scheduler
+我们提供了丰富的**噪声调度器（Noise Scheduler）**，可以对**速度**与**质量**进行权衡，用户可在推理时根据需求快速切换使用。
+当前**PPDiffusers**已经集成了**14+Scheduler**，不仅支持 [DDPM](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/schedulers/scheduling_ddpm.py)、[DDIM](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/schedulers/scheduling_ddim.py) 和 [PNDM](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/schedulers/scheduling_pndm.py)，还支持最新的 [🔥 DPMSolver](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/schedulers/scheduling_dpmsolver_multistep.py)！
+
+#### 🎛️ 提供多种扩散模型组件
+我们提供了**多种扩散模型**组件，如[UNet1DModel](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/unet_1d.py)、[UNet2DModel](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/unet_2d.py)、[UNet2DConditionModel](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/unet_2d_condition.py)、[UNet3DConditionModel](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/unet_3d_condition.py)、[VQModel](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/vae.py)、[AutoencoderKL](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/models/vae.py)等。
+
+
+#### 📖 提供丰富的训练和推理教程
+我们提供了丰富的训练教程，不仅支持扩散模型的二次开发微调，如基于[Textual Inversion](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/textual_inversion)和[DreamBooth](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/dreambooth)使用3-5张图定制化训练生成图像的风格或物体，还支持[🔥 Latent Diffusion Model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/text_to_image_laion400m)、[🔥 ControlNet](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/controlnet)、[🔥 T2I-Adapter](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/t2i-adapter)  等扩散模型的训练！
+此外，我们还提供了丰富的[🔥 Pipelines推理样例](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/examples/inference)。
+
+#### 🚀 支持FastDeploy高性能部署
+我们提供基于[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)的[🔥 高性能Stable Diffusion Pipeline](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/ppdiffusers/pipelines/stable_diffusion/pipeline_fastdeploy_stable_diffusion.py)，更多有关FastDeploy进行多推理引擎后端高性能部署的信息请参考[🔥 高性能FastDeploy推理教程](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers/deploy)。
+```python
+from ppdiffusers import StableDiffusionPipeline, FastDeployStableDiffusionPipeline
+
+orig_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+fd_pipe = FastDeployStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5@fastdeploy")
+```
+
+## 任务展示
+### 文本图像多模
+
+<details open>
+<summary>&emsp;文图生成（Text-to-Image Generation）</summary>
+
+#### text_to_image_generation-stable_diffusion
+
+```python
+from ppdiffusers import StableDiffusionPipeline
+
+# 加载模型和scheduler
+pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+
+# 执行pipeline进行推理
+prompt = "a photo of an astronaut riding a horse on mars"
+image = pipe(prompt).images[0]
+
+# 保存图片
+image.save("astronaut_rides_horse_sd.png")
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209322401-6ecfeaaa-6878-4302-b592-07a31de4e590.png">
+</div>
+
+#### text_to_image_generation-deepfloyd_if
+
+```python
+import paddle
+
+from ppdiffusers import DiffusionPipeline, IFPipeline, IFSuperResolutionPipeline
+from ppdiffusers.utils import pd_to_pil
+
+# Stage 1: generate images
+pipe = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", paddle_dtype=paddle.float16)
+pipe.enable_xformers_memory_efficient_attention()
+prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
+prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)
+image = pipe(
+    prompt_embeds=prompt_embeds,
+    negative_prompt_embeds=negative_embeds,
+    output_type="pd",
+).images
+
+# save intermediate image
+pil_image = pd_to_pil(image)
+pil_image[0].save("text_to_image_generation-deepfloyd_if-result-if_stage_I.png")
+# save gpu memory
+pipe.to(paddle_device="cpu")
+
+# Stage 2: super resolution stage1
+super_res_1_pipe = IFSuperResolutionPipeline.from_pretrained(
+    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", paddle_dtype=paddle.float16
+)
+super_res_1_pipe.enable_xformers_memory_efficient_attention()
+
+image = super_res_1_pipe(
+    image=image,
+    prompt_embeds=prompt_embeds,
+    negative_prompt_embeds=negative_embeds,
+    output_type="pd",
+).images
+# save intermediate image
+pil_image = pd_to_pil(image)
+pil_image[0].save("text_to_image_generation-deepfloyd_if-result-if_stage_II.png")
+# save gpu memory
+super_res_1_pipe.to(paddle_device="cpu")
+```
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/246785766-700dfad9-159d-4bfb-bfc7-c18df938a052.png">
+</div>
+<div align="center">
+<center>if_stage_I</center>
+</div>
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/246785773-3359ca5f-dadf-4cc8-b318-ff1f9d4a2d35.png">
+</div>
+<div align="center">
+<center>if_stage_II</center>
+<!-- <img alt="image" src="https://user-images.githubusercontent.com/20476674/246785774-8870829a-354b-4a87-9d67-93af315f51e6.png">
+<center>if_stage_III</center> -->
+</div>
+</details>
+
+
+<details><summary>&emsp;文本引导的图像放大（Text-Guided Image Upscaling）</summary>
+
+#### text_guided_image_upscaling-stable_diffusion_2
+
+```python
+from ppdiffusers import StableDiffusionUpscalePipeline
+from ppdiffusers.utils import load_image
+
+pipe = StableDiffusionUpscalePipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler")
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/low_res_cat.png"
+low_res_img = load_image(url).resize((128, 128))
+
+prompt = "a white cat"
+upscaled_image = pipe(prompt=prompt, image=low_res_img).images[0]
+upscaled_image.save("upsampled_cat_sd2.png")
+```
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/209324085-0d058b70-89b0-43c2-affe-534eedf116cf.png">
+<center>原图像</center>
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/209323862-ce2d8658-a52b-4f35-90cb-aa7d310022e7.png">
+<center>生成图像</center>
+</div>
+</details>
+
+<details><summary>&emsp;文本引导的图像编辑（Text-Guided Image Inpainting）</summary>
+
+#### text_guided_image_inpainting-stable_diffusion_2
+
+```python
+import paddle
+
+from ppdiffusers import PaintByExamplePipeline
+from ppdiffusers.utils import load_image
+
+img_url = "https://paddlenlp.bj.bcebos.com/models/community/Fantasy-Studio/data/image_example_1.png"
+mask_url = "https://paddlenlp.bj.bcebos.com/models/community/Fantasy-Studio/data/mask_example_1.png"
+example_url = "https://paddlenlp.bj.bcebos.com/models/community/Fantasy-Studio/data/reference_example_1.jpeg"
+
+init_image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+example_image = load_image(example_url).resize((512, 512))
+
+pipe = PaintByExamplePipeline.from_pretrained("Fantasy-Studio/Paint-by-Example")
+
+# 使用fp16加快生成速度
+with paddle.amp.auto_cast(True):
+    image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0]
+image.save("image_guided_image_inpainting-paint_by_example-result.png")
+```
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/247118364-5d91f433-f9ac-4514-b5f0-cb4599905847.png" width=300>
+<center>原图像</center>
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/247118361-0f78d6db-6896-4f8d-b1bd-8350192f7a4e.png" width=300>
+<center>掩码图像</center>
+<div align="center">
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/247118368-305a048d-ddc3-4a5f-8915-58591ef680f0.jpeg" width=300>
+<center>参考图像</center>
+<img alt="image" src="https://user-images.githubusercontent.com/20476674/247117963-e5b9b754-39a3-480b-a557-46a2f9310e79.png" width=300>
+<center>生成图像</center>
+</div>
+</details>
+
+
+<details><summary>&emsp;文本引导的图像变换（Image-to-Image Text-Guided Generation）</summary>
+
+#### image_to_image_text_guided_generation-stable_diffusion
+```python
+import paddle
+
+from ppdiffusers import StableDiffusionImg2ImgPipeline
+from ppdiffusers.utils import load_image
+
+# 加载pipeline
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
+
+# 下载初始图片
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/sketch-mountains-input.png"
+
+init_image = load_image(url).resize((768, 512))
+
+prompt = "A fantasy landscape, trending on artstation"
+# 使用fp16加快生成速度
+with paddle.amp.auto_cast(True):
+    image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
+
+image.save("fantasy_landscape.png")
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209327142-d8e1d0c7-3bf8-4a08-a0e8-b11451fc84d8.png">
+<center>原图像</center>
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209325799-d9ff279b-0d57-435f-bda7-763e3323be23.png">
+<center>生成图像</center>
+</div>
+</details>
+</details>
+
+<details><summary>&emsp;文本图像双引导图像生成（Dual Text and Image Guided Generation）</summary>
+
+#### dual_text_and_image_guided_generation-versatile_diffusion
+```python
+from ppdiffusers import VersatileDiffusionDualGuidedPipeline
+from ppdiffusers.utils import load_image
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/benz.jpg"
+image = load_image(url)
+text = "a red car in the sun"
+
+pipe = VersatileDiffusionDualGuidedPipeline.from_pretrained("shi-labs/versatile-diffusion")
+pipe.remove_unused_weights()
+
+text_to_image_strength = 0.75
+image = pipe(prompt=text, image=image, text_to_image_strength=text_to_image_strength).images[0]
+image.save("versatile-diffusion-red_car.png")
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209325965-2475e9c4-a524-4970-8498-dfe10ff9cf24.jpg" >
+<center>原图像</center>
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209325293-049098d0-d591-4abc-b151-9291ac2636da.png">
+<center>生成图像</center>
+</div>
+</details>
+
+### 文本视频多模
+
+<details open>
+<summary>&emsp;文本条件的视频生成（Text-to-Video Generation）</summary>
+
+#### text_to_video_generation-synth
+
+```python
+import imageio
+
+from ppdiffusers import DPMSolverMultistepScheduler, TextToVideoSDPipeline
+
+pipe = TextToVideoSDPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b")
+pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
+
+prompt = "An astronaut riding a horse."
+video_frames = pipe(prompt, num_inference_steps=25).frames
+imageio.mimsave("text_to_video_generation-synth-result-astronaut_riding_a_horse.mp4", video_frames, fps=8)
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/246780441-8242a955-490b-4326-8415-84264a54a938.gif">
+</div>
+
+#### text_to_video_generation-zero
+
+```python
+import imageio
+
+# pip install imageio[ffmpeg]
+import paddle
+
+from ppdiffusers import TextToVideoZeroPipeline
+
+model_id = "runwayml/stable-diffusion-v1-5"
+pipe = TextToVideoZeroPipeline.from_pretrained(model_id, paddle_dtype=paddle.float16)
+
+prompt = "A panda is playing guitar on times square"
+result = pipe(prompt=prompt).images
+result = [(r * 255).astype("uint8") for r in result]
+imageio.mimsave("text_to_video_generation-zero-result-panda.mp4", result, fps=4)
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/246779321-c2b0c2b4-e383-40c7-a4d8-f417e8062b35.gif">
+</div>
+
+</details>
+
+### 文本音频多模
+<details open>
+<summary>&emsp;文本条件的音频生成（Text-to-Audio Generation）</summary>
+
+#### text_to_audio_generation-audio_ldm
+
+```python
+import paddle
+import scipy
+
+from ppdiffusers import AudioLDMPipeline
+
+pipe = AudioLDMPipeline.from_pretrained("cvssp/audioldm", paddle_dtype=paddle.float16)
+
+prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
+audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
+
+output_path = "text_to_audio_generation-audio_ldm-techno.wav"
+# save the audio sample as a .wav file
+scipy.io.wavfile.write(output_path, rate=16000, data=audio)
+```
+<div align = "center">
+  <thead>
+  </thead>
+  <tbody>
+   <tr>
+      <td align = "center">
+      <a href="https://paddlenlp.bj.bcebos.com/models/community/westfish/develop_ppdiffusers_data/techno.wav" rel="nofollow">
+            <img align="center" src="https://user-images.githubusercontent.com/20476674/209344877-edbf1c24-f08d-4e3b-88a4-a27e1fd0a858.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+  </tbody>
+</div>
+</details>
+
+### 图像
+
+<details><summary>&emsp;无条件图像生成（Unconditional Image Generation）</summary>
+
+#### unconditional_image_generation-latent_diffusion_uncond
+
+```python
+from ppdiffusers import LDMPipeline
+
+# 加载模型和scheduler
+pipe = LDMPipeline.from_pretrained("CompVis/ldm-celebahq-256")
+
+# 执行pipeline进行推理
+image = pipe(num_inference_steps=200).images[0]
+
+# 保存图片
+image.save("ldm_generated_image.png")
+```
+<div align="center">
+<img width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209327936-7fe914e0-0ea0-4e21-a433-24eaed6ee94c.png">
+</div>
+</details>
+
+<details><summary>&emsp;超分（Super Superresolution）</summary>
+
+#### super_resolution-latent_diffusion
+```python
+import paddle
+
+from ppdiffusers import LDMSuperResolutionPipeline
+from ppdiffusers.utils import load_image
+
+# 加载pipeline
+pipe = LDMSuperResolutionPipeline.from_pretrained("CompVis/ldm-super-resolution-4x-openimages")
+
+# 下载初始图片
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations.png"
+
+init_image = load_image(url).resize((128, 128))
+init_image.save("original-image.png")
+
+# 使用fp16加快生成速度
+with paddle.amp.auto_cast(True):
+    image = pipe(init_image, num_inference_steps=100, eta=1).images[0]
+
+image.save("super-resolution-image.png")
+```
+<div align="center">
+<img  alt="image" src="https://user-images.githubusercontent.com/20476674/209328660-9700fdc3-72b3-43bd-9a00-23b370ba030b.png">
+<center>原图像</center>
+<img  alt="image" src="https://user-images.githubusercontent.com/20476674/209328479-4eaea5d8-aa4a-4f31-aa2a-b47e3c730f15.png">
+<center>生成图像</center>
+</div>
+</details>
+
+
+<details><summary>&emsp;图像编辑（Image Inpainting）</summary>
+
+#### image_inpainting-repaint
+```python
+from ppdiffusers import RePaintPipeline, RePaintScheduler
+from ppdiffusers.utils import load_image
+
+img_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/celeba_hq_256.png"
+mask_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/mask_256.png"
+
+# Load the original image and the mask as PIL images
+original_image = load_image(img_url).resize((256, 256))
+mask_image = load_image(mask_url).resize((256, 256))
+
+scheduler = RePaintScheduler.from_pretrained("google/ddpm-ema-celebahq-256", subfolder="scheduler")
+pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler)
+
+output = pipe(
+    original_image=original_image,
+    mask_image=mask_image,
+    num_inference_steps=250,
+    eta=0.0,
+    jump_length=10,
+    jump_n_sample=10,
+)
+inpainted_image = output.images[0]
+
+inpainted_image.save("repaint-image.png")
+```
+<div align="center">
+<img  alt="image" src="https://user-images.githubusercontent.com/20476674/209329052-b6fc2aaf-1a59-49a3-92ef-60180fdffd81.png">
+<center>原图像</center>
+<img  alt="image" src="https://user-images.githubusercontent.com/20476674/209329048-4fe12176-32a0-4800-98f2-49bd8d593799.png">
+<center>mask图像</center>
+<img  alt="image" src="https://user-images.githubusercontent.com/20476674/209329241-b7e4d99e-468a-4b95-8829-d77ee14bfe98.png">
+<center>生成图像</center>
+</div>
+</details>
+
+
+
+<details><summary>&emsp;图像变化（Image Variation）</summary>
+
+#### image_variation-versatile_diffusion
+```python
+from ppdiffusers import VersatileDiffusionImageVariationPipeline
+from ppdiffusers.utils import load_image
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/benz.jpg"
+image = load_image(url)
+
+pipe = VersatileDiffusionImageVariationPipeline.from_pretrained("shi-labs/versatile-diffusion")
+
+image = pipe(image).images[0]
+image.save("versatile-diffusion-car_variation.png")
+```
+<div align="center">
+<img  width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209331434-51f6cdbd-b8e4-4faa-8e49-1cc852e35603.jpg">
+<center>原图像</center>
+<img  width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209331591-f6cc4cd8-8430-4627-8d22-bf404fb2bfdd.png">
+<center>生成图像</center>
+</div>
+</details>
+
+
+
+
+
+### 音频
+<details open>
+<summary>&emsp;无条件音频生成（Unconditional Audio Generation）</summary>
+
+#### unconditional_audio_generation-audio_diffusion
+
+```python
+from scipy.io.wavfile import write
+from ppdiffusers import AudioDiffusionPipeline
+import paddle
+
+# 加载模型和scheduler
+pipe = AudioDiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256")
+pipe.set_progress_bar_config(disable=None)
+generator = paddle.Generator().manual_seed(42)
+
+output = pipe(generator=generator)
+audio = output.audios[0]
+image = output.images[0]
+
+# 保存音频到本地
+for i, audio in enumerate(audio):
+    write(f"audio_diffusion_test{i}.wav", pipe.mel.sample_rate, audio.transpose())
+
+# 保存图片
+image.save("audio_diffusion_test.png")
+```
+<div align = "center">
+  <thead>
+  </thead>
+  <tbody>
+   <tr>
+      <td align = "center">
+      <a href="https://paddlenlp.bj.bcebos.com/models/community/teticio/data/audio_diffusion_test0.wav" rel="nofollow">
+            <img align="center" src="https://user-images.githubusercontent.com/20476674/209344877-edbf1c24-f08d-4e3b-88a4-a27e1fd0a858.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+  </tbody>
+</div>
+
+<div align="center">
+<img  width="300" alt="image" src="https://user-images.githubusercontent.com/20476674/209342125-93e8715e-895b-4115-9e1e-e65c6c2cd95a.png">
+</div>
+
+
+#### unconditional_audio_generation-spectrogram_diffusion
+
+```python
+import paddle
+import scipy
+
+from ppdiffusers import MidiProcessor, SpectrogramDiffusionPipeline
+from ppdiffusers.utils.download_utils import ppdiffusers_url_download
+
+# Download MIDI from: wget https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/beethoven_hammerklavier_2.mid
+mid_file_path = ppdiffusers_url_download(
+    "https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/beethoven_hammerklavier_2.mid", cache_dir="."
+)
+pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion", paddle_dtype=paddle.float16)
+processor = MidiProcessor()
+output = pipe(processor(mid_file_path))
+audio = output.audios[0]
+
+output_path = "unconditional_audio_generation-spectrogram_diffusion-result-beethoven_hammerklavier_2.wav"
+# save the audio sample as a .wav file
+scipy.io.wavfile.write(output_path, rate=16000, data=audio)
+```
+<div align = "center">
+  <thead>
+  </thead>
+  <tbody>
+   <tr>
+      <td align = "center">
+      <a href="https://paddlenlp.bj.bcebos.com/models/community/westfish/develop_ppdiffusers_data/beethoven_hammerklavier_2.wav" rel="nofollow">
+            <img align="center" src="https://user-images.githubusercontent.com/20476674/209344877-edbf1c24-f08d-4e3b-88a4-a27e1fd0a858.png" width="200 style="max-width: 100%;"></a><br>
+      </td>
+    </tr>
+  </tbody>
+</div>
+</details>
+
+
+## 安装
+
+### 环境依赖
+```
+pip install -r requirements.txt
+```
+关于PaddlePaddle安装的详细教程请查看[Installation](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)。
+
+### pip安装
+
+```shell
+pip install --upgrade ppdiffusers
+```
+
+### 手动安装
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP
+# 注意：如果clone仓库非常慢的话，可以考虑使用镜像版本
+# git clone https://gitee.com/paddlepaddle/PaddleNLP
+cd PaddleNLP/ppdiffusers
+python setup.py install
+```
+
+## 快速开始
+我们将以扩散模型的典型代表**Stable Diffusion**为例，带你快速了解PPDiffusers。
+
+**Stable Diffusion**基于**潜在扩散模型（Latent Diffusion Models）**，专门用于**文图生成（Text-to-Image Generation）任务**。该模型是由来自 [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [LAION](https://laion.ai/)以及[RunwayML](https://runwayml.com/)的工程师共同开发完成，目前发布了v1和v2两个版本。v1版本采用了LAION-5B数据集子集（分辨率为 512x512）进行训练，并具有以下架构设置：自动编码器下采样因子为8，UNet大小为860M，文本编码器为CLIP ViT-L/14。v2版本相较于v1版本在生成图像的质量和分辨率等进行了改善。
+
+### Stable Diffusion重点模型权重
+
+<details><summary>&emsp; Stable Diffusion 模型支持的权重（英文） </summary>
+
+**我们只需要将下面的"xxxx"，替换成所需的权重名，即可快速使用！**
+```python
+from ppdiffusers import *
+
+pipe_text2img = StableDiffusionPipeline.from_pretrained("xxxx")
+pipe_img2img = StableDiffusionImg2ImgPipeline.from_pretrained("xxxx")
+pipe_inpaint_legacy = StableDiffusionInpaintPipelineLegacy.from_pretrained("xxxx")
+pipe_mega = StableDiffusionMegaPipeline.from_pretrained("xxxx")
+
+# pipe_mega.text2img() 等于 pipe_text2img()
+# pipe_mega.img2img() 等于 pipe_img2img()
+# pipe_mega.inpaint_legacy() 等于 pipe_inpaint_legacy()
+```
+
+| PPDiffusers支持的模型名称                     | 支持加载的Pipeline                                    | 备注 | huggingface.co地址 |
+| :-------------------------------------------: | :--------------------------------------------------------------------: | --- | :-----------------------------------------: |
+| CompVis/stable-diffusion-v1-4           | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | Stable-Diffusion-v1-4 使用 Stable-Diffusion-v1-2 的权重进行初始化。随后在"laion-aesthetics v2 5+"数据集上以 **512x512** 分辨率微调了 **225k** 步数，对文本使用了 **10%** 的dropout（即：训练过程中文图对中的文本有 10% 的概率会变成空文本）。模型使用了[CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14)作为文本编码器。| [地址](https://huggingface.co/CompVis/stable-diffusion-v1-4) |
+| CompVis/ldm-text2im-large-256               | LDMTextToImagePipeline | [LDM论文](https://arxiv.org/pdf/2112.10752.pdf) LDM-KL-8-G* 权重。| [地址](https://huggingface.co/CompVis/ldm-text2im-large-256) |
+| CompVis/ldm-super-resolution-4x-openimages  | LDMSuperResolutionPipeline | [LDM论文](https://arxiv.org/pdf/2112.10752.pdf) LDM-VQ-4 权重，[原始权重链接](https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip)。| [地址](https://huggingface.co/CompVis/ldm-super-resolution-4x-openimages) |
+| runwayml/stable-diffusion-v1-5              | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | Stable-Diffusion-v1-5 使用 Stable-Diffusion-v1-2 的权重进行初始化。随后在"laion-aesthetics v2 5+"数据集上以 **512x512** 分辨率微调了 **595k** 步数，对文本使用了 **10%** 的dropout（即：训练过程中文图对中的文本有 10% 的概率会变成空文本）。模型同样也使用了[CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14)作为文本编码器。| [地址](https://huggingface.co/runwayml/stable-diffusion-v1-5) |
+| runwayml/stable-diffusion-inpainting        | StableDiffusionInpaintPipeline | Stable-Diffusion-Inpainting 使用 Stable-Diffusion-v1-2 的权重进行初始化。首先进行了 **595k** 步的常规训练（实际也就是 Stable-Diffusion-v1-5 的权重），然后进行了 **440k** 步的 inpainting 修复训练。对于 inpainting 修复训练，给 UNet 额外增加了 **5** 输入通道（其中 **4** 个用于被 Mask 遮盖住的图片，**1** 个用于 Mask 本身）。在训练期间，会随机生成 Mask，并有 **25%** 概率会将原始图片全部 Mask 掉。| [地址](https://huggingface.co/runwayml/stable-diffusion-inpainting) |
+| stabilityai/stable-diffusion-2-base         | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | 该模型首先在 [LAION-5B 256x256 子集上](https://laion.ai/blog/laion-5b/) （过滤条件：[punsafe = 0.1 的 LAION-NSFW 分类器](https://github.com/LAION-AI/CLIP-based-NSFW-Detector) 和 审美分数大于等于 4.5 ）从头开始训练 **550k** 步，然后又在分辨率 **>= 512x512** 的同一数据集上进一步训练 **850k** 步。| [地址](https://huggingface.co/stabilityai/stable-diffusion-2-base) |
+| stabilityai/stable-diffusion-2              | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | stable-diffusion-2 使用 stable-diffusion-2-base 权重进行初始化，首先在同一数据集上（**512x512** 分辨率）使用 [v-objective](https://arxiv.org/abs/2202.00512) 训练了 **150k** 步。然后又在 **768x768** 分辨率上使用 [v-objective](https://arxiv.org/abs/2202.00512) 继续训练了 **140k** 步。| [地址](https://huggingface.co/stabilityai/stable-diffusion-2) |
+| stabilityai/stable-diffusion-2-inpainting   | StableDiffusionInpaintPipeline |stable-diffusion-2-inpainting 使用 stable-diffusion-2-base 权重初始化，并且额外训练了 **200k** 步。训练过程使用了 [LAMA](https://github.com/saic-mdal/lama) 中提出的 Mask 生成策略，并且使用 Mask 图片的 Latent 表示（经过 VAE 编码）作为附加条件。| [地址](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) |
+| stabilityai/stable-diffusion-x4-upscaler    | StableDiffusionUpscalePipeline | 该模型在**LAION 10M** 子集上（>2048x2048）训练了 1.25M 步。该模型还在分辨率为 **512x512** 的图像上使用 [Text-guided Latent Upscaling Diffusion Model](https://arxiv.org/abs/2112.10752) 进行了训练。除了**文本输入**之外，它还接收 **noise_level** 作为输入参数，因此我们可以使用 [预定义的 Scheduler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/low_res_scheduler/scheduler_config.json) 向低分辨率的输入图片添加噪声。| [地址](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) |
+| hakurei/waifu-diffusion    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | waifu-diffusion-v1-2 使用 stable-diffusion-v1-4 权重初始化，并且在**高质量动漫**图像数据集上进行微调后得到的模型。用于微调的数据是 **680k** 文本图像样本，这些样本是通过 **booru 网站** 下载的。| [地址](https://huggingface.co/hakurei/waifu-diffusion) |
+| hakurei/waifu-diffusion-v1-3    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | waifu-diffusion-v1-3 是 waifu-diffusion-v1-2 基础上进一步训练得到的。他们对数据集进行了额外操作：（1）删除下划线；（2）删除括号；（3）用逗号分隔每个booru 标签；（4）随机化标签顺序。| [地址](https://huggingface.co/hakurei/waifu-diffusion) |
+| naclbit/trinart_stable_diffusion_v2_60k    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | trinart_stable_diffusion 使用 stable-diffusion-v1-4 权重初始化，在 40k **高分辨率漫画/动漫风格**的图片数据集上微调了 8 个 epoch。V2 版模型使用 **dropouts**、**10k+ 图像**和**新的标记策略**训练了**更长时间**。| [地址](https://huggingface.co/naclbit/trinart_stable_diffusion_v2) |
+| naclbit/trinart_stable_diffusion_v2_95k    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | **95k** 步数的结果，其他同上。| [地址](https://huggingface.co/naclbit/trinart_stable_diffusion_v2) |
+| naclbit/trinart_stable_diffusion_v2_115k    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | **115k** 步数的结果，其他同上。| [地址](https://huggingface.co/naclbit/trinart_stable_diffusion_v2) |
+| Deltaadams/Hentai-Diffusion    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | None| [地址](https://huggingface.co/Deltaadams/Hentai-Diffusion) |
+| ringhyacinth/nail-set-diffuser    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | 美甲领域的扩散模型，训练数据使用了 [Weekend](https://weibo.com/u/5982308498)| [地址](https://huggingface.co/ringhyacinth/nail-set-diffuser) |
+| Linaqruf/anything-v3.0    | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | 该模型可通过输入几个文本提示词就能生成**高质量、高度详细的动漫风格图片**，该模型支持使用 **danbooru 标签文本** 生成图像。| [地址](https://huggingface.co/Linaqruf/anything-v3.0) |
+
+</details>
+<details><summary>&emsp; Stable Diffusion 模型支持的权重（中文和多语言） </summary>
+
+
+| PPDiffusers支持的模型名称                     | 支持加载的Pipeline                                    | 备注 | huggingface.co地址 |
+| :-------------------------------------------: | :--------------------------------------------------------------------: | --- | :-----------------------------------------: |
+| BAAI/AltDiffusion                           | AltDiffusionPipeline、AltDiffusionImg2ImgPipeline | 该模型使用 [AltCLIP](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) 作为文本编码器，在 Stable Diffusion 基础上训练了**双语Diffusion模型**，其中训练数据来自 [WuDao数据集](https://data.baai.ac.cn/details/WuDaoCorporaText) 和 [LAION](https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus) 。| [地址](https://huggingface.co/BAAI/AltDiffusion) |
+| BAAI/AltDiffusion-m9                        | AltDiffusionPipeline、AltDiffusionImg2ImgPipeline |该模型使用9种语言的 [AltCLIP-m9](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP/README.md) 作为文本编码器，其他同上。| [地址](https://huggingface.co/BAAI/AltDiffusion-m9) |
+| IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1 | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | 他们将 [Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/) 数据集 (100M) 和 [Zero](https://zero.so.com/) 数据集 (23M) 用作预训练的数据集，先用 [IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese](https://huggingface.co/IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese) 对这两个数据集的图文对相似性进行打分，取 CLIP Score 大于 0.2 的图文对作为训练集。 他们使用 [IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese](https://huggingface.co/IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese) 作为初始化的text encoder，冻住 [stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) ([论文](https://arxiv.org/abs/2112.10752)) 模型的其他部分，只训练 text encoder，以便保留原始模型的生成能力且实现中文概念的对齐。该模型目前在0.2亿图文对上训练了一个 epoch。 在 32 x A100 上训练了大约100小时，该版本只是一个初步的版本。| [地址](https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1) |
+| IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-EN-v0.1 | StableDiffusionPipeline、StableDiffusionImg2ImgPipeline、StableDiffusionInpaintPipelineLegacy、StableDiffusionMegaPipeline、StableDiffusionPipelineAllinOne | 他们将 [Noah-Wukong](https://wukong-dataset.github.io/wukong-dataset/) 数据集 (100M) 和 [Zero](https://zero.so.com/) 数据集 (23M) 用作预训练的数据集，先用 [IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese](https://huggingface.co/IDEA-CCNL/Taiyi-CLIP-RoBERTa-102M-ViT-L-Chinese) 对这两个数据集的图文对相似性进行打分，取 CLIP Score 大于 0.2 的图文对作为训练集。 他们使用 [stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) ([论文](https://arxiv.org/abs/2112.10752)) 模型进行继续训练，其中训练分为**两个stage**。**第一个stage** 中冻住模型的其他部分，只训练 text encoder ，以便保留原始模型的生成能力且实现中文概念的对齐。**第二个stage** 中将全部模型解冻，一起训练 text encoder 和 diffusion model ，以便 diffusion model 更好的适配中文引导。第一个 stage 他们训练了 80 小时，第二个 stage 训练了 100 小时，两个stage都是用了8 x A100，该版本是一个初步的版本。| [地址](https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-EN-v0.1) |
+</details>
+
+
+### 加载HF Diffusers权重
+```python
+from ppdiffusers import StableDiffusionPipeline
+# 设置from_hf_hub为True，表示从huggingface hub下载，from_diffusers为True表示加载的是diffusers版Pytorch权重
+pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2", from_hf_hub=True, from_diffusers=True)
+```
+
+### 加载原库的Lightning权重
+```python
+from ppdiffusers import StableDiffusionPipeline
+# 可输入网址 或 本地ckpt、safetensors文件
+pipe = StableDiffusionPipeline.from_pretrained_original_ckpt("https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/ppdiffusers/chilloutmix_NiPrunedFp32Fix.safetensors")
+```
+
+### 加载Civitai社区的LoRA权重
+```python
+from ppdiffusers import StableDiffusionPipeline
+pipe = StableDiffusionPipeline.from_pretrained("TASUKU2023/Chilloutmix")
+# 加载lora权重
+pipe.apply_lora("https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/ppdiffusers/Moxin_10.safetensors")
+```
+
+### XFormers加速
+为了使用**XFormers加速**，我们需要安装`develop`版本的`paddle`，Linux系统的安装命令如下：
+```sh
+python -m pip install paddlepaddle-gpu==0.0.0.post117 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
+```
+
+```python
+import paddle
+from ppdiffusers import StableDiffusionPipeline
+pipe = StableDiffusionPipeline.from_pretrained("TASUKU2023/Chilloutmix", paddle_dtype=paddle.float16)
+# 开启xformers加速 默认选择"cutlass"加速
+pipe.enable_xformers_memory_efficient_attention()
+# flash 需要使用 A100、A10、3060、3070、3080、3090 等以上显卡。
+# pipe.enable_xformers_memory_efficient_attention("flash")
+```
+
+### ToME + ControlNet
+```python
+# 安装develop的ppdiffusers
+# pip install "ppdiffusers>=0.16.1"
+import paddle
+from ppdiffusers import ControlNetModel, StableDiffusionControlNetPipeline
+from ppdiffusers.utils import load_image
+
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+    "runwayml/stable-diffusion-v1-5", safety_checker=None, controlnet=controlnet, paddle_dtype=paddle.float16
+)
+
+# Apply ToMe with a 50% merging ratio
+pipe.apply_tome(ratio=0.5) # Can also use pipe.unet in place of pipe here
+
+# 我们可以开启 xformers
+# pipe.enable_xformers_memory_efficient_attention()
+generator = paddle.Generator().manual_seed(0)
+prompt = "bird"
+image = load_image(
+    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/bird_canny.png"
+)
+
+image = pipe(prompt, image, generator=generator).images[0]
+
+image.save("bird.png")
+```
+
+### 文图生成 （Text-to-Image Generation）
+
+```python
+import paddle
+from ppdiffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2")
+
+# 设置随机种子，我们可以复现下面的结果！
+paddle.seed(5232132133)
+prompt = "a portrait of shiba inu with a red cap growing on its head. intricate. lifelike. soft light. sony a 7 r iv 5 5 mm. cinematic post - processing "
+image = pipe(prompt, guidance_scale=7.5, height=768, width=768).images[0]
+
+image.save("shiba_dog_with_a_red_cap.png")
+```
+<div align="center">
+<img width="500" alt="image" src="https://user-images.githubusercontent.com/50394665/204796701-d7911f76-8670-47d5-8d1b-8368b046c5e4.png">
+</div>
+
+### 文本引导的图像变换（Image-to-Image Text-Guided Generation）
+
+<details><summary>&emsp;Image-to-Image Text-Guided Generation Demo </summary>
+
+```python
+import paddle
+from ppdiffusers import StableDiffusionImg2ImgPipeline
+from ppdiffusers.utils import load_image
+
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained("Linaqruf/anything-v3.0", safety_checker=None)
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/image_Kurisu.png"
+image = load_image(url).resize((512, 768))
+
+# 设置随机种子，我们可以复现下面的结果！
+paddle.seed(42)
+prompt = "Kurisu Makise, looking at viewer, long hair, standing, 1girl, hair ornament, hair flower, cute, jacket, white flower, white dress"
+negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+
+image = pipe(prompt=prompt, negative_prompt=negative_prompt, image=image, strength=0.75, guidance_scale=7.5).images[0]
+image.save("image_Kurisu_img2img.png")
+```
+<div align="center">
+<img width="500" alt="image" src="https://user-images.githubusercontent.com/50394665/204799529-cd89dcdb-eb1d-4247-91ac-b0f7bad777f8.png">
+</div>
+</details>
+
+### 文本引导的图像编辑（Text-Guided Image Inpainting）
+
+注意！当前有两种版本的图像编辑代码，一个是Legacy版本，一个是正式版本，下面将分别介绍两种代码如何使用！
+
+<details><summary>&emsp;Legacy版本代码</summary>
+
+```python
+import paddle
+from ppdiffusers import StableDiffusionInpaintPipelineLegacy
+from ppdiffusers.utils import load_image
+
+# 可选模型权重
+# CompVis/stable-diffusion-v1-4
+# runwayml/stable-diffusion-v1-5
+# stabilityai/stable-diffusion-2-base （原始策略 512x512）
+# stabilityai/stable-diffusion-2 （v-objective 768x768）
+# Linaqruf/anything-v3.0
+# ......
+img_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations.png"
+mask_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations-mask.png"
+
+image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+
+pipe = StableDiffusionInpaintPipelineLegacy.from_pretrained("stabilityai/stable-diffusion-2-base", safety_checker=None)
+
+# 设置随机种子，我们可以复现下面的结果！
+paddle.seed(10245)
+prompt = "a red cat sitting on a bench"
+image = pipe(prompt=prompt, image=image, mask_image=mask_image, strength=0.75).images[0]
+
+image.save("a_red_cat_legacy.png")
+```
+<div align="center">
+<img width="900" alt="image" src="https://user-images.githubusercontent.com/50394665/204802186-5a6d302b-83aa-4247-a5bb-ebabfcc3abc4.png">
+</div>
+
+</details>
+
+<details><summary>&emsp;正式版本代码</summary>
+
+Tips: 下面的使用方法是新版本的代码，也是官方推荐的代码，注意必须配合 **runwayml/stable-diffusion-inpainting** 和 **stabilityai/stable-diffusion-2-inpainting** 才可正常使用。
+```python
+import paddle
+from ppdiffusers import StableDiffusionInpaintPipeline
+from ppdiffusers.utils import load_image
+
+# 可选模型权重
+# runwayml/stable-diffusion-inpainting
+# stabilityai/stable-diffusion-2-inpainting
+img_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations.png"
+mask_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations-mask.png"
+
+image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+
+pipe = StableDiffusionInpaintPipeline.from_pretrained("stabilityai/stable-diffusion-2-inpainting")
+
+# 设置随机种子，我们可以复现下面的结果！
+paddle.seed(1024)
+prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
+image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
+
+image.save("a_yellow_cat.png")
+```
+<div align="center">
+<img width="900" alt="image" src="https://user-images.githubusercontent.com/50394665/204801946-6cd043bc-f3db-42cf-82cd-6a6171484523.png">
+</div>
+</details>
+
+### 文本引导的图像放大 & 超分（Text-Guided Image Upscaling & Super-Resolution）
+
+<details><summary>&emsp;Text-Guided Image Upscaling Demo</summary>
+
+```python
+import paddle
+from ppdiffusers import StableDiffusionUpscalePipeline
+from ppdiffusers.utils import load_image
+
+pipe = StableDiffusionUpscalePipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler")
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/low_res_cat.png"
+# 我们人工将原始图片缩小成 128x128 分辨率，最终保存的图片会放大4倍！
+low_res_img = load_image(url).resize((128, 128))
+
+prompt = "a white cat"
+image = pipe(prompt=prompt, image=low_res_img).images[0]
+
+image.save("upscaled_white_cat.png")
+```
+<div align="center">
+<img width="200" alt="image" src="https://user-images.githubusercontent.com/50394665/204806180-b7f1b9cf-8a62-4577-b5c4-91adda08a13b.png">
+<img width="400" alt="image" src="https://user-images.githubusercontent.com/50394665/204806202-8c110be3-5f48-4946-95ea-21ad5a9a2340.png">
+</div>
+</details>
+
+<details><summary>&emsp;Super-Resolution Demo</summary>
+
+```python
+import paddle
+from ppdiffusers import LDMSuperResolutionPipeline
+from ppdiffusers.utils import load_image
+
+pipe = LDMSuperResolutionPipeline.from_pretrained("CompVis/ldm-super-resolution-4x-openimages")
+
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations.png"
+
+# 我们人工将原始图片缩小成 128x128 分辨率，最终保存的图片会放大4倍！
+low_res_img = load_image(url).resize((128, 128))
+
+image = pipe(image=low_res_img, num_inference_steps=100).images[0]
+
+image.save("ldm-super-resolution-image.png")
+```
+<div align="center">
+<img width="200" alt="image" src="https://user-images.githubusercontent.com/50394665/204804426-5e28b571-aa41-4f56-ba26-68cca75fdaae.png">
+<img width="400" alt="image" src="https://user-images.githubusercontent.com/50394665/204804148-fe7c293b-6cd7-4942-ae9c-446369fe8410.png">
+</div>
+
+</details>
+
+## 模型推理部署
+除了**Paddle动态图**运行之外，很多模型还支持将模型导出并使用推理引擎运行。我们提供基于[FastDeploy](https://github.com/PaddlePaddle/FastDeploy)上的**StableDiffusion**模型部署示例，涵盖文生图、图生图、图像编辑等任务，用户可以按照我们提供[StableDiffusion模型导出教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/deploy/export.md)将模型导出，然后使用`FastDeployStableDiffusionMegaPipeline`进行高性能推理部署！
+
+<details><summary>&emsp; 已预先导出的FastDeploy版Stable Diffusion权重 </summary>
+
+**注意：当前导出的vae encoder带有随机因素！**
+
+- CompVis/stable-diffusion-v1-4@fastdeploy
+- runwayml/stable-diffusion-v1-5@fastdeploy
+- runwayml/stable-diffusion-inpainting@fastdeploy
+- stabilityai/stable-diffusion-2-base@fastdeploy
+- stabilityai/stable-diffusion-2@fastdeploy
+- stabilityai/stable-diffusion-2-inpainting@fastdeploy
+- Linaqruf/anything-v3.0@fastdeploy
+- hakurei/waifu-diffusion-v1-3@fastdeploy
+
+</details>
+
+<details><summary>&emsp; FastDeploy Demo </summary>
+
+```python
+import paddle
+import fastdeploy as fd
+from ppdiffusers import FastDeployStableDiffusionMegaPipeline
+from ppdiffusers.utils import load_image
+
+def create_runtime_option(device_id=0, backend="paddle", use_cuda_stream=True):
+    option = fd.RuntimeOption()
+    if backend == "paddle":
+        option.use_paddle_backend()
+    else:
+        option.use_ort_backend()
+    if device_id == -1:
+        option.use_cpu()
+    else:
+        option.use_gpu(device_id)
+        if use_cuda_stream:
+            paddle_stream = paddle.device.cuda.current_stream(device_id).cuda_stream
+            option.set_external_raw_stream(paddle_stream)
+    return option
+
+runtime_options = {
+    "text_encoder": create_runtime_option(0, "paddle"),  # use gpu:0
+    "vae_encoder": create_runtime_option(0, "paddle"),  # use gpu:0
+    "vae_decoder": create_runtime_option(0, "paddle"),  # use gpu:0
+    "unet": create_runtime_option(0, "paddle"),  # use gpu:0
+}
+
+fd_pipe = FastDeployStableDiffusionMegaPipeline.from_pretrained(
+    "Linaqruf/anything-v3.0@fastdeploy", runtime_options=runtime_options
+)
+
+# text2img
+prompt = "a portrait of shiba inu with a red cap growing on its head. intricate. lifelike. soft light. sony a 7 r iv 5 5 mm. cinematic post - processing "
+image_text2img = fd_pipe.text2img(prompt=prompt, num_inference_steps=50).images[0]
+image_text2img.save("image_text2img.png")
+
+# img2img
+url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/data/image_Kurisu.png"
+image = load_image(url).resize((512, 512))
+prompt = "Kurisu Makise, looking at viewer, long hair, standing, 1girl, hair ornament, hair flower, cute, jacket, white flower, white dress"
+negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"
+
+image_img2img = fd_pipe.img2img(
+    prompt=prompt, negative_prompt=negative_prompt, image=image, strength=0.75, guidance_scale=7.5
+).images[0]
+image_img2img.save("image_img2img.png")
+
+# inpaint_legacy
+img_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations.png"
+mask_url = "https://paddlenlp.bj.bcebos.com/models/community/CompVis/stable-diffusion-v1-4/overture-creations-mask.png"
+image = load_image(img_url).resize((512, 512))
+mask_image = load_image(mask_url).resize((512, 512))
+prompt = "a red cat sitting on a bench"
+
+image_inpaint_legacy = fd_pipe.inpaint_legacy(
+    prompt=prompt, image=image, mask_image=mask_image, strength=0.75, num_inference_steps=50
+).images[0]
+image_inpaint_legacy.save("image_inpaint_legacy.png")
+```
+</details>
+<div align="center">
+<img width="900" alt="image" src="https://user-images.githubusercontent.com/50394665/205297240-46b80992-34af-40cd-91a6-ae76589d0e21.png">
+</div>
+
+
+
+## License
+PPDiffusers 遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/ppdiffusers/LICENSE)。
+
+Stable Diffusion 遵循 [The CreativeML OpenRAIL M 开源协议](https://huggingface.co/spaces/CompVis/stable-diffusion-license)。
+> The CreativeML OpenRAIL M is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which this license is based.
+
+## Acknowledge
+我们借鉴了🤗 Hugging Face的[Diffusers](https://github.com/huggingface/diffusers)关于预训练扩散模型使用的优秀设计，在此对Hugging Face作者及其开源社区表示感谢。
+
+
+## Credits
+This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:
+- @huggingface' diffusers library, available [here](https://github.com/huggingface/diffusers)
+- @CompVis' latent diffusion models library, available [here](https://github.com/CompVis/latent-diffusion)
+- @hojonathanho original DDPM implementation, available [here](https://github.com/hojonathanho/diffusion) as well as the extremely useful translation into PyTorch by @pesser, available [here](https://github.com/pesser/pytorch_diffusion)
+- @ermongroup's DDIM implementation, available [here](https://github.com/ermongroup/ddim).
+- @yang-song's Score-VE and Score-VP implementations, available [here](https://github.com/yang-song/score_sde_pytorch)
+
+We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available [here](https://github.com/heejkoo/Awesome-Diffusion-Models) as well as @crowsonkb and @rromb for useful discussions and insights.
+
+## Citation
+
+```bibtex
+@misc{von-platen-etal-2022-diffusers,
+  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
+  title = {Diffusers: State-of-the-art diffusion models},
+  year = {2022},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/huggingface/diffusers}}
+}
+```
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000000000000000000000000000000000000..dea7629461bb12a5508b3277af7bbe86574e699e
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,39 @@
+[tool.isort]
+profile = 'black'
+known_third_party = ["paddle"]
+skip = ['paddlenlp/transformers/__init__.py']
+
+[tool.black]
+line-length = 119
+target_version = ['py35', 'py36', 'py37', 'py38', 'py39', 'py310']
+exclude = ['.flake8']
+
+[tool.pytest.ini_options]
+minversion = "6.0"
+addopts = "-ra -q --ignore model_zoo/gpt-3/"
+pythonpath = ["."]
+testpaths = [
+    "tests/data",
+    "tests/dataaug",
+    "tests/datasets",
+    "tests/embeddings",
+    "tests/experimental",
+    "tests/generation",
+    "tests/layers",
+    "tests/metrics",
+    "tests/ops",
+    "tests/transformers",
+    "tests/peft",
+    "tests/prompt",
+    # "tests/taskflow",  TODO (paddle 2.5.1 breaks this test suite, debug later)
+    "tests/utils",
+    "model_zoo",
+]
+python_files = [
+    "test.py",
+    "test_*.py"
+]
+filterwarnings = [
+    "ignore::UserWarning",
+    'ignore::DeprecationWarning',
+]
diff --git a/requirements-dev.txt b/requirements-dev.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7900cf73da9e95ea6220de2462210a2f61b48f10
--- /dev/null
+++ b/requirements-dev.txt
@@ -0,0 +1,17 @@
+paddlepaddle==2.5.1
+paddleocr<2.7
+pre-commit
+pytest
+parameterized
+pytest-cov
+regex
+pytest-xdist
+fast_tokenizer_python
+emoji
+ftfy
+yacs
+unidecode
+soundfile
+librosa
+numpy==1.23.5
+rouge
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e1d93ad8de4ea985f6c0031dbd9070533974661f
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,21 @@
+jieba
+colorlog
+colorama
+seqeval
+dill<0.3.5
+multiprocess<=0.70.12.2
+datasets >= 2.0.0
+tqdm
+paddlefsl
+sentencepiece
+huggingface_hub>=0.11.1
+onnx>=1.10.0
+protobuf==3.20.2     # onnx require: protobuf<4,>=3.20.2, paddle require different version on platforms, refer to: https://github.com/PaddlePaddle/Paddle/blob/cd88156a369bbfb83d6306f89e0ae6ebd78b8040/python/requirements.txt#L3
+paddle2onnx
+Flask-Babel
+visualdl
+fastapi
+uvicorn
+typer
+rich
+safetensors
diff --git a/scripts/get_modified_files.py b/scripts/get_modified_files.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e0fec054e81e11e294831dbbf2247a23f3f991b
--- /dev/null
+++ b/scripts/get_modified_files.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import subprocess
+import sys
+
+fork_point_sha = subprocess.check_output("git merge-base develop HEAD".split()).decode("utf-8")
+modified_files = (
+    subprocess.check_output(f"git diff --diff-filter=d --name-only {fork_point_sha}".split()).decode("utf-8").split()
+)
+
+valid_dirs = "|".join(sys.argv[1:])
+regex = re.compile(rf"^({valid_dirs}).*?\.py$")
+
+relevant_modified_files = [x for x in modified_files if regex.match(x)]
+print(" ".join(relevant_modified_files), end="")
diff --git a/scripts/regression/README.md b/scripts/regression/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..89a8e56a7e9c1181145e4dc35854a069951999bb
--- /dev/null
+++ b/scripts/regression/README.md
@@ -0,0 +1,11 @@
+## PaddleNLP CI
+
+主要代码结构及说明：
+```
+./
+├── run_ci.sh # CI pr级别 执行入口
+├── run_release.sh # 天级别回归&发版 执行入口
+├── ci_case.sh # CI 核心case
+├── ci_normal_case.py # 规范模型case 执行脚本
+└── requirements_ci.txt # 依赖库
+```
diff --git a/scripts/regression/ci_case.sh b/scripts/regression/ci_case.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cf2069c333afb72e29cc3c5bbec6851e922997ff
--- /dev/null
+++ b/scripts/regression/ci_case.sh
@@ -0,0 +1,1212 @@
+#!/usr/bin/env bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export nlp_dir=${PWD}
+export log_path=${nlp_dir}/model_logs
+export cudaid1=$2
+export cudaid2=$3
+export C_COMPILER_PATH=$(which gcc)
+export CXX_COMPILER_PATH=$(which g++)
+export CC=$(which gcc)
+export CXX=$(which g++)
+
+if [ ! -d "model_logs" ];then
+    mkdir model_logs
+fi
+if [ ! -d "unittest_logs" ];then
+    mkdir model_logs
+fi
+
+print_info(){
+if [ $1 -ne 0 ];then
+    if [[ $2 =~ 'tests' ]];then
+        mv ${nlp_dir}/unittest_logs/$3.log ${nlp_dir}/unittest_logs/$3_FAIL.log
+        echo -e "\033[31m ${nlp_dir}/unittest_logs/$3_FAIL \033[0m"
+        cat ${nlp_dir}/unittest_logs/$3_FAIL.log
+    else
+        mv ${log_path}/$2 ${log_path}/$2_FAIL.log
+        echo -e "\033[31m ${log_path}/$2_FAIL \033[0m"
+        cat ${log_path}/$2_FAIL.log
+    fi
+elif [[ $2 =~ 'tests' ]];then
+    echo -e "\033[32m ${log_path}/$3_SUCCESS \033[0m"
+else
+    echo -e "\033[32m ${log_path}/$2_SUCCESS \033[0m"
+fi
+}
+# case list
+# 1 waybill_ie (无可控参数，数据集外置)
+waybill_ie(){
+cd ${nlp_dir}/examples/information_extraction/waybill_ie/
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+# BiGRU +CRF star training
+time (
+python download.py --data_dir ./waybill_ie
+python run_bigru_crf.py >${log_path}/waybill_ie_bigru_crf) >>${log_path}/waybill_ie_bigru_crf 2>&1
+print_info $? waybill_ie_bigru_crf
+# ERNIE +RF star training
+time (python run_ernie.py >${log_path}/waybill_ie_ernie) >>${log_path}/waybill_ie_ernie 2>&1
+print_info $? waybill_ie_ernie
+# ERNIE +CRF star training
+time (python run_ernie_crf.py >${log_path}/waybill_ie_ernie_crf) >>${log_path}/waybill_ie_ernie_crf 2>&1
+print_info $? waybill_ie_ernie_crf
+}
+# 2 msra_ner （不可控，内置）
+msra_ner(){
+cd ${nlp_dir}/examples/information_extraction/msra_ner/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+## train
+time (python -m paddle.distributed.launch  ./train.py \
+    --model_type bert  \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --dataset msra_ner \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 1 \
+    --logging_steps 1 \
+    --max_steps 2 \
+    --save_steps 2 \
+    --output_dir ./tmp/msra_ner/ \
+    --device gpu >${log_path}/msra_ner_train) >>${log_path}/msra_ner_train 2>&1
+print_info $? msra_ner_train
+## eval
+time (python -u ./eval.py \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --device gpu \
+    --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_eval) >>${log_path}/msra_ner_eval 2>&1
+print_info $? msra_ner_eval
+## predict
+time (python -u ./predict.py \
+    --model_name_or_path bert-base-multilingual-uncased \
+    --max_seq_length 128 \
+    --batch_size 16 \
+    --device gpu \
+    --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_predict) >>${log_path}/msra_ner_predict 2>&1
+print_info $? msra_ner_predict
+}
+# 3 glue
+glue() {
+cd ${nlp_dir}/examples/benchmark/glue/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+##  TASK_SST-2
+export TASK_NAME=SST-2
+time (python -u run_glue.py \
+    --model_type bert    \
+    --model_name_or_path bert-base-uncased    \
+    --task_name $TASK_NAME \
+    --max_seq_length 128   \
+    --batch_size 128    \
+    --learning_rate 3e-5    \
+    --max_steps 1    \
+    --logging_steps 1    \
+    --save_steps 1   \
+    --output_dir ./$TASK_NAME/    \
+    --device gpu  >${log_path}/glue_${TASK_NAME}_train) >>${log_path}/glue_${TASK_NAME}_train 2>&1
+print_info $? glue_${TASK_NAME}_train
+}
+# 4 bert
+bert() {
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+# cd ${nlp_dir}/model_zoo/bert/
+# wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/bert.tar.gz
+# tar -xzvf bert.tar.gz
+cd ${nlp_dir}/model_zoo/bert/data/
+wget -q https://bj.bcebos.com/paddlenlp/models/transformers/bert/data/training_data.hdf5
+cd ../
+# pretrain
+time (python -m paddle.distributed.launch run_pretrain.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_predictions_per_seq 20 \
+    --batch_size 16  \
+    --learning_rate 1e-4 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --input_dir data/ \
+    --output_dir pretrained_models/ \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --max_steps 1 \
+    --device gpu \
+    --use_amp False >${log_path}/bert_pretrain) >>${log_path}/bert_pretrain 2>&1
+print_info $? bert_pretrain
+time (python -m paddle.distributed.launch run_glue_trainer.py \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST2 \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32   \
+    --per_device_eval_batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --max_steps 1 \
+    --output_dir ./tmp/ \
+    --device gpu \
+    --fp16 False\
+    --do_train \
+    --do_eval >${log_path}/bert_fintune) >>${log_path}/bert_fintune 2>&1
+print_info $? bert_fintune
+time (python -u ./export_model.py \
+    --model_type bert \
+    --model_path bert-base-uncased \
+    --output_path ./infer_model/model >${log_path}/bert_export) >>${log_path}/bert_export 2>&1
+print_info $? bert_export
+ }
+# 5 skep (max save 不可控 内置)
+skep () {
+cd ${nlp_dir}/examples/sentiment_analysis/skep/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+## train_sentence
+time ( python -m paddle.distributed.launch train_sentence.py --batch_size 16 --epochs 1 --model_name "skep_ernie_1.0_large_ch" --device gpu --save_dir ./checkpoints >${log_path}/skep_train_sentence) >>${log_path}/skep_train_sentence 2>&1
+print_info $? skep_train_sentence
+## train_aspect
+time ( python -m paddle.distributed.launch train_aspect.py --batch_size 4 --epochs 1  --device gpu --save_dir ./aspect_checkpoints  >${log_path}/skep_train_aspect) >>${log_path}/skep_train_aspect 2>&1
+print_info $? skep_train_aspect
+# # train_opinion
+time ( python -m paddle.distributed.launch train_opinion.py  --batch_size 4 --epochs 1 --device gpu --save_dir ./opinion_checkpoints >${log_path}/skep_train_opinion) >>${log_path}/skep_train_opinion 2>&1
+print_info $? skep_train_opinion
+# predict_sentence
+time (python predict_sentence.py --model_name "skep_ernie_1.0_large_ch"  --ckpt_dir checkpoints/model_100 >${log_path}/skep_predict_sentence) >>${log_path}/skep_predict_sentence 2>&1
+print_info $? skep_predict_sentence
+## predict_aspect
+time (python predict_aspect.py --device 'gpu' --ckpt_dir ./aspect_checkpoints/model_100  >${log_path}/skep_predict_aspect) >>${log_path}/skep_predict_aspect 2>&1
+print_info $? skep_predict_aspect
+# # predict_opinion
+time (python predict_opinion.py --device 'gpu' --ckpt_dir ./opinion_checkpoints/model_100 >${log_path}/skep_predict_opinion) >>${log_path}/skep_predict_opinion 2>&1
+print_info $? skep_predict_opinion
+}
+# 6 bigbird
+bigbird(){
+cd ${nlp_dir}/examples/language_model/bigbird/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (python -m paddle.distributed.launch  --log_dir log  run_pretrain.py --model_name_or_path bigbird-base-uncased \
+    --input_dir "./data" \
+    --output_dir "output" \
+    --batch_size 4 \
+    --weight_decay 0.01 \
+    --learning_rate 1e-5 \
+    --max_steps 1 \
+    --save_steps 1 \
+    --logging_steps 1 \
+    --max_encoder_length 512 \
+    --max_pred_length 75 >${log_path}/bigbird_pretrain) >>${log_path}/bigbird_pretrain 2>&1
+    print_info $? bigbird_pretrain
+}
+# 7 electra
+electra(){
+cd ${nlp_dir}/model_zoo/electra/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+export DATA_DIR=./BookCorpus/
+wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/BookCorpus.tar.gz && tar -xzvf BookCorpus.tar.gz
+time (python -u ./run_pretrain.py \
+    --model_type electra \
+    --model_name_or_path electra-small \
+    --input_dir ./BookCorpus/ \
+    --output_dir ./pretrain_model/ \
+    --train_batch_size 64 \
+    --learning_rate 5e-4 \
+    --max_seq_length 128 \
+    --weight_decay 1e-2 \
+    --adam_epsilon 1e-6 \
+    --warmup_steps 10000 \
+    --num_train_epochs 4 \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --max_steps 1 \
+    --device gpu >${log_path}/electra_pretrain) >>${log_path}/electra_pretrain 2>&1
+print_info $? electra_pretrain
+}
+fast_gpt(){
+# FT
+cd ${nlp_dir}/
+export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH
+wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-TagBuild-Infer-Linux-Gpu-Cuda120-Cudnn89-Trt86-Mkl-Avx-Gcc122/latest/paddle_inference.tgz
+tar -zxf paddle_inference.tgz
+cd ${nlp_dir}/paddlenlp/ops
+#python
+mkdir build_gpt_so
+cd build_gpt_so/
+cmake ..  -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=${C_COMPILER_PATH} -DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \
+-DPADDLE_LIB=${nlp_dir}/paddle_inference/ -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/gpt.cc \
+-DPY_CMD=python -DWITH_GPT=ON -DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON
+make -j >${log_path}/GPT_python_FT >>${log_path}/gpt_python_FT 2>&1
+print_info $? gpt_python_FT
+cd ../
+#c++
+mkdir build_gpt_cc
+cd build_gpt_cc/
+cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=${C_COMPILER_PATH} -DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \
+-DPADDLE_LIB=${nlp_dir}/paddle_inference/ -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/gpt.cc \
+-DWITH_GPT=ON -DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON
+make -j >${log_path}/GPT_C_FT >>${log_path}/gpt_C_FT 2>&1
+print_info $? gpt_C_FT
+#depoly python
+cd ${nlp_dir}/model_zoo/gpt/fast_gpt/
+python infer.py \
+    --model_name_or_path gpt2-medium-en \
+    --batch_size 1 \
+    --topk 4 \
+    --topp 0.0 \
+    --max_length 32 \
+    --start_token "<|endoftext|>" \
+    --end_token "<|endoftext|>" \
+    --temperature 1.0  >${log_path}/gpt_deploy_P_FT >>${log_path}/gpt_deploy_P_FT 2>&1
+print_info $? gpt_deploy_P_FT
+#depoly C++
+python export_model.py \
+    --model_name_or_path gpt2-medium-en \
+    --decoding_lib ${nlp_dir}/paddlenlp/ops/build_gpt_so/lib/libdecoding_op.so \
+    --topk 4 \
+    --topp 0.0 \
+    --max_out_len 32 \
+    --temperature 1.0 \
+    --inference_model_dir ./infer_model/
+mv infer_model/ ${nlp_dir}/paddlenlp/ops/build_gpt_cc/bin/
+cd ${nlp_dir}/paddlenlp/ops/build_gpt_cc/bin/
+./gpt -batch_size 1 -gpu_id 0 -model_dir ./infer_model -vocab_file ./infer_model/vocab.txt -start_token "<|endoftext|>" -end_token "<|endoftext|>"  >${log_path}/gpt_deploy_C_FT >>${log_path}/gpt_deploy_C_FT 2>&1
+print_info $? gpt_deploy_C_FT
+}
+# 8 gpt
+gpt(){
+cd ${nlp_dir}/model_zoo/ernie-1.0/data_tools
+sed -i "s/python3/python/g" Makefile
+sed -i "s/python-config/python3.7m-config/g" Makefile
+#pretrain
+cd ${nlp_dir}/model_zoo/gpt/
+mkdir data
+cd ./data
+wget -q https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
+wget -q https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
+cd ../
+
+time (python run_pretrain_trainer.py \
+    --model_type "gpt" \
+    --model_name_or_path "gpt2-en" \
+    --tokenizer_name_or_path "gpt2-en" \
+    --input_dir "${nlp_dir}/model_zoo/gpt/data" \
+    --output_dir "output" \
+    --split 949,50,1 \
+    --max_seq_length 1024 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 10 \
+    --save_steps 5 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1 \
+    --dataloader_num_workers 1 \
+    --eval_steps 5 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --do_train \
+    --do_eval \
+    --device "gpu" >${log_path}/gpt_pretrain) >>${log_path}/gpt_pretrain 2>&1
+print_info $? gpt_pretrain
+time (python export_model.py --model_type=gpt \
+    --model_path "gpt2-medium-en" \
+    --output_path ${nlp_dir}/model_zoo/gpt/infer_model/model >${log_path}/gpt_export) >>${log_path}/gpt_export 2>&1
+print_info $? gpt_export
+time (python deploy/python/inference.py \
+    --model_type "gpt" \
+    --model_path ${nlp_dir}/model_zoo/gpt/infer_model/model >${log_path}/gpt_p_depoly) >>${log_path}/gpt_p_depoly 2>&1
+print_info $? gpt_p_depoly
+
+# echo 'run gpt test with pytest'
+# cd ${nlp_dir}
+# python -m pytest ./tests/model_zoo/test_gpt.py >${log_path}/gpt >>${log_path}/gpt 2>&1
+# print_info $? gpt
+
+fast_gpt
+cd ${nlp_dir}/fast_generation/samples
+python gpt_sample.py >${log_path}/fast_generation_gpt >>${log_path}/fast_generation_gpt 2>&1
+print_info $? fast_generation_gpt
+}
+# 9 ernie
+ernie(){
+#data process
+cd ${nlp_dir}/model_zoo/ernie-1.0/
+mkdir data
+cd ./data
+wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_ids.npy
+wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_idx.npz
+cd ../
+mkdir data_ernie_3.0 && cd data_ernie_3.0
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_idx.npz
+cd ../
+# pretrain_trainer
+python -u -m paddle.distributed.launch \
+    --log_dir "output/trainer_log" \
+    run_pretrain_trainer.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-3.0-base-zh" \
+    --tokenizer_name_or_path "ernie-3.0-base-zh" \
+    --input_dir "./data_ernie_3.0" \
+    --output_dir "output/trainer_log" \
+    --split 949,50,1 \
+    --max_seq_length 512 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 32 \
+    --fp16  \
+    --fp16_opt_level "O2"  \
+    --learning_rate 0.0001 \
+    --min_learning_rate 0.00001 \
+    --max_steps 2 \
+    --save_steps 2 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1\
+    --dataloader_num_workers 4 \
+    --eval_steps 1000 \
+    --report_to "visualdl" \
+    --disable_tqdm true \
+    --do_train \
+    --device "gpu" >${log_path}/ernie_1.0_pretrain_trainer >>${log_path}/ernie_1.0_pretrain_trainer 2>&1
+    print_info $? ernie_1.0_pretrain_trainer
+# pretrain_static
+python -u -m paddle.distributed.launch \
+    --log_dir "./log" \
+    run_pretrain_static.py \
+    --model_type "ernie" \
+    --model_name_or_path "ernie-1.0-base-zh" \
+    --tokenizer_name_or_path "ernie-1.0-base-zh" \
+    --input_dir "./data/" \
+    --output_dir "./output/" \
+    --max_seq_len 512 \
+    --micro_batch_size 16 \
+    --global_batch_size 32 \
+    --sharding_degree 1 \
+    --dp_degree 2 \
+    --use_sharding false \
+    --use_amp true \
+    --use_recompute false \
+    --max_lr 0.0001 \
+    --min_lr 0.00001 \
+    --max_steps 4 \
+    --save_steps 2 \
+    --checkpoint_steps 5000 \
+    --decay_steps 3960000 \
+    --weight_decay 0.01 \
+    --warmup_rate 0.0025 \
+    --grad_clip 1.0 \
+    --logging_freq 2\
+    --num_workers 2 \
+    --eval_freq 1000 \
+    --device "gpu" >${log_path}/ernie_1.0_pretrain_static >>${log_path}/ernie_1.0_pretrain_static 2>&1
+    print_info $? ernie_1.0_pretrain_static
+}
+# 10 xlnet
+xlnet(){
+cd ${nlp_dir}/examples/language_model/xlnet/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (python -m paddle.distributed.launch ./run_glue.py \
+    --model_name_or_path xlnet-base-cased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 3 \
+    --max_steps 1 \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --output_dir ./xlnet/ >${log_path}/xlnet_train) >>${log_path}/xlnet_train 2>&1
+print_info $? xlnet_train
+}
+# 11 ofa
+ofa(){
+cd ${nlp_dir}/examples/model_compression/ofa/
+cd ../../benchmark/glue/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+# finetuing
+time (python -u run_glue.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --task_name SST-2 \
+    --max_seq_length 128 \
+    --batch_size 32   \
+    --learning_rate 2e-5 \
+    --num_train_epochs 1 \
+    --max_steps 1 \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --output_dir ./ \
+    --device gpu  >${log_path}/ofa_pretrain) >>${log_path}/ofa_pretrain 2>&1
+print_info $? ofa_pretrain
+mv sst-2_ft_model_1.pdparams/  ${nlp_dir}/examples/model_compression/ofa/
+cd -
+#model slim
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (python -m paddle.distributed.launch run_glue_ofa.py  \
+          --model_type bert \
+          --model_name_or_path ./sst-2_ft_model_1.pdparams/ \
+          --task_name SST-2 --max_seq_length 128     \
+          --batch_size 32       \
+          --learning_rate 2e-5     \
+          --num_train_epochs 1     \
+          --max_steps 1 \
+          --logging_steps 1    \
+          --save_steps 1     \
+          --output_dir ./ofa/SST-2 \
+          --device gpu  \
+          --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5 >${log_path}/ofa_slim) >>${log_path}/ofa_slim 2>&1
+print_info $? ofa_slim
+}
+# 12 albert
+albert (){
+cd ${nlp_dir}/examples/benchmark/glue/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (python -m paddle.distributed.launch  run_glue.py \
+        --model_type albert    \
+        --model_name_or_path albert-base-v2    \
+        --task_name SST-2 \
+        --max_seq_length 128   \
+        --batch_size 32    \
+        --learning_rate 1e-5    \
+        --max_steps 1    \
+        --warmup_steps 1256    \
+        --logging_steps 1    \
+        --save_steps 1   \
+        --output_dir ./albert/SST-2/    \
+        --device gpu >${log_path}/albert_sst-2_train) >>${log_path}/albert_sst-2_train 2>&1
+print_info $? albert_sst-2_train
+}
+# 13 squad
+squad (){
+cd ${nlp_dir}/examples/machine_reading_comprehension/SQuAD/
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+# finetune
+time (python -m paddle.distributed.launch run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-base-uncased \
+    --max_seq_length 384 \
+    --batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 1 \
+    --max_steps 1 \
+    --logging_steps 1 \
+    --save_steps 1 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --output_dir ./tmp/squad/ \
+    --device gpu \
+    --do_train \
+    --do_predict >${log_path}/squad_train) >>${log_path}/squad_train 2>&1
+print_info $? squad_train
+# export model
+time (python  -u ./export_model.py \
+    --model_type bert \
+    --model_path ./tmp/squad/model_1/ \
+    --output_path ./infer_model/model >${log_path}/squad_export) >>${log_path}/squad_export 2>&1
+print_info $? squad_export
+# predict
+time (python -u deploy/python/predict.py \
+    --model_type bert \
+    --model_name_or_path ./infer_model/model \
+    --batch_size 2 \
+    --max_seq_length 384 >${log_path}/squad_predict) >>${log_path}/squad_predict 2>&1
+print_info $? squad_predict
+}
+# 15 lexical_analysis
+lexical_analysis(){
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+cd ${nlp_dir}/examples/lexical_analysis/
+#train
+time (python download.py --data_dir ./ )
+time (python -m paddle.distributed.launch train.py \
+        --data_dir ./lexical_analysis_dataset_tiny \
+        --model_save_dir ./save_dir \
+        --epochs 1 \
+        --save_steps 15 \
+        --logging_steps 1\
+        --batch_size 32 \
+        --device gpu >${log_path}/lexical_analysis_train) >>${log_path}/lexical_analysis_train 2>&1
+print_info $? lexical_analysis_train
+#export
+time (python export_model.py \
+    --data_dir=./lexical_analysis_dataset_tiny \
+    --params_path=./save_dir/model_15.pdparams \
+    --output_path=./infer_model/static_graph_params >${log_path}/lexical_analysis_export) >>${log_path}/lexical_analysis_export 2>&1
+print_info $? lexical_analysis_export
+# predict
+time (python predict.py --data_dir ./lexical_analysis_dataset_tiny \
+        --init_checkpoint ./save_dir/model_15.pdparams \
+        --batch_size 32 \
+        --device gpu >${log_path}/lexical_analysis_predict) >>${log_path}/lexical_analysis_predict 2>&1
+print_info $? lexical_analysis_predict
+# deploy
+time (python deploy/predict.py \
+    --model_file=infer_model/static_graph_params.pdmodel \
+    --params_file=infer_model/static_graph_params.pdiparams \
+    --data_dir lexical_analysis_dataset_tiny >${log_path}/lexical_analysis_deploy) >>${log_path}/lexical_analysis_deploy 2>&1
+print_info $? lexical_analysis_deploy
+}
+# 16 seq2seq
+seq2seq() {
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+cd ${nlp_dir}/examples/machine_translation/seq2seq/
+# train  (1041/steps) 5min
+time (python train.py \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --batch_size 128 \
+    --max_epoch 1 \
+    --log_freq 1 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --device gpu \
+    --model_path ./attention_models >${log_path}/seq2seq_train) >>${log_path}/seq2seq_train 2>&1
+print_info $? seq2seq_train
+# predict
+time (python predict.py \
+     --num_layers 2 \
+     --hidden_size 512 \
+     --batch_size 128 \
+     --dropout 0.2 \
+     --init_scale  0.1 \
+     --max_grad_norm 5.0 \
+     --init_from_ckpt attention_models/0 \
+     --infer_output_file infer_output.txt \
+     --beam_size 10 \
+     --device gpu  >${log_path}/seq2seq_predict) >>${log_path}/seq2seq_predict 2>&1
+print_info $? seq2seq_predict
+# export
+time (python export_model.py \
+     --num_layers 2 \
+     --hidden_size 512 \
+     --batch_size 128 \
+     --dropout 0.2 \
+     --init_scale  0.1 \
+     --max_grad_norm 5.0 \
+     --init_from_ckpt attention_models/0.pdparams \
+     --beam_size 10 \
+     --export_path ./infer_model/model >${log_path}/seq2seq_export) >>${log_path}/seq2seq_export 2>&1
+print_info $? seq2seq_export
+# depoly
+time (cd deploy/python
+python infer.py \
+    --export_path ../../infer_model/model \
+    --device gpu \
+    --batch_size 128 \
+    --infer_output_file infer_output.txt  >${log_path}/seq2seq_depoly) >>${log_path}/seq2seq_deploy 2>&1
+print_info $? seq2seq_depoly
+}
+# 18 word_embedding 5min
+word_embedding(){
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+cd ${nlp_dir}/examples/word_embedding/
+# 使用paddlenlp.embeddings.TokenEmbedding
+time (python train.py --device='gpu' \
+                --lr=5e-4 \
+                --batch_size=32 \
+                --epochs=1 \
+                --use_token_embedding=True \
+                --vdl_dir='./vdl_paddlenlp_dir'  >${log_path}/word_embedding_paddlenlp_train) >>${log_path}/word_embedding_paddlenlp_train 2>&1
+print_info $? word_embedding_paddlenlp_train
+# 使用paddle.nn.Embedding
+time (python train.py --device='gpu' \
+                --lr=1e-4 \
+                --batch_size=32 \
+                --epochs=1 \
+                --use_token_embedding=False \
+                --vdl_dir='./vdl_paddle_dir' >${log_path}/word_embedding_paddle_train) >>${log_path}/word_embedding_paddle_train 2>&1
+print_info $? word_embedding_paddle_train
+}
+# 19 ernie-ctm
+ernie-ctm(){
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+cd ${nlp_dir}/examples/text_to_knowledge/ernie-ctm/
+wget https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/wordtag_dataset_v2.tar.gz && tar -zxvf wordtag_dataset_v2.tar.gz
+time (python -m paddle.distributed.launch  train.py \
+    --max_seq_len 128 \
+    --batch_size 8   \
+    --learning_rate 5e-5 \
+    --num_train_epochs 1 \
+    --logging_steps 1 \
+    --save_steps 100 \
+    --output_dir ./output/ \
+    --device "gpu"   >${log_path}/ernie-ctm_train) >>${log_path}/ernie-ctm_train 2>&1
+print_info $? ernie-ctm_train
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+time (python -m paddle.distributed.launch predict.py \
+    --batch_size 32   \
+    --params_path ./output/model_125/model_state.pdparams \
+    --device "gpu"   >${log_path}/ernie-ctm_eval) >>${log_path}/ernie-ctm_eval 2>&1
+print_info $? ernie-ctm_eval
+}
+# 20 distilbert
+distilbert (){
+cd ${nlp_dir}/examples/model_compression/distill_lstm/
+wget -q https://paddle-qa.bj.bcebos.com/SST-2_GLUE.tar
+tar -xzvf SST-2_GLUE.tar 
+time (
+    python small.py \
+    --task_name sst-2 \
+    --vocab_size 30522 \
+    --max_epoch 1 \
+    --batch_size 64 \
+    --lr 1.0 \
+    --dropout_prob 0.4 \
+    --output_dir small_models/SST-2 \
+    --save_steps 10000 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en >${log_path}/distilbert_small_train) >>${log_path}/distilbert_small_train 2>&1
+print_info $? distilbert_small_train
+time (
+    python bert_distill.py \
+    --task_name sst-2 \
+    --vocab_size 30522 \
+    --max_epoch 1 \
+    --lr 1.0 \
+    --task_name sst-2 \
+    --dropout_prob 0.2 \
+    --batch_size 128 \
+    --model_name bert-base-uncased \
+    --output_dir distilled_models/SST-2 \
+    --teacher_dir ./SST-2/sst-2_ft_model_1.pdparams/ \
+    --save_steps 1000 \
+    --n_iter 1 \
+    --embedding_name w2v.google_news.target.word-word.dim300.en >${log_path}/distilbert_teacher_train) >>${log_path}/distilbert_teacher_train 2>&1
+print_info $? distilbert_teacher_train
+}
+fast_transformer(){
+# FT
+cd ${nlp_dir}/
+export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH
+wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-TagBuild-Infer-Linux-Gpu-Cuda120-Cudnn89-Trt86-Mkl-Avx-Gcc122/latest/paddle_inference.tgz
+tar -zxf paddle_inference.tgz
+cd ${nlp_dir}/paddlenlp/ops
+#python op
+mkdir build_tr_so
+cd build_tr_so/
+cmake ..  -DCMAKE_BUILD_TYPE=Release \
+-DCMAKE_C_COMPILER=${C_COMPILER_PATH} \
+-DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \
+-DPY_CMD=python \
+-DPADDLE_LIB=${nlp_dir}/paddle_inference \
+-DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \
+-DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON
+make -j >${log_path}/transformer_python_FT >>${log_path}/transformer_python_FT 2>&1
+print_info $? transformer_python_FT
+cd ../
+#C++ op
+mkdir build_tr_cc
+cd build_tr_cc/
+cmake .. -DCMAKE_BUILD_TYPE=Release \
+-DCMAKE_C_COMPILER=${C_COMPILER_PATH} \
+-DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \
+-DPADDLE_LIB=${nlp_dir}/paddle_inference -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \
+-DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON
+make -j >${log_path}/transformer_C_FT >>${log_path}/transformer_C_FT 2>&1
+print_info $? transformer_C_FT
+#deploy python
+cd ${nlp_dir}/examples/machine_translation/transformer/fast_transformer/
+sed -i "s#./trained_models/step_final/#./base_trained_models/step_final/#g" ../configs/transformer.base.yaml
+wget -q https://paddlenlp.bj.bcebos.com/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+cp -rf ${nlp_dir}/paddlenlp/ops/build_tr_so/third-party/build/fastertransformer/bin/decoding_gemm ./
+./decoding_gemm 8 4 8 64 38512 32 512 0
+#beam_search
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \
+    --decoding_strategy beam_search \
+    --beam_size 5 >${log_path}/transformer_deploy_P_FT >>${log_path}/transformer_deploy_P_FT 2>&1
+print_info $? transformer_deploy_P_FT
+#topk
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \
+    --decoding_strategy topk_sampling \
+    --topk 3 >topk.log
+#topp
+python encoder_decoding_predict.py \
+    --config ../configs/transformer.base.yaml \
+    --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \
+    --decoding_strategy topp_sampling \
+    --topk 0 \
+    --topp 0.1 >topp.log
+#deploy c++
+python export_model.py  \
+    --config ../configs/transformer.base.yaml  \
+    --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so   \
+    --decoding_strategy beam_search --beam_size 5
+./decoding_gemm 8 5 8 64 38512 256 512 0
+${nlp_dir}/paddlenlp/ops/build_tr_cc/bin/./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 \
+-data_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en  >${log_path}/transformer_deploy_C_FT >>${log_path}/transformer_deploy_C_FT 2>&1
+print_info $? transformer_deploy_C_FT
+}
+# 22 transformer
+transformer (){
+cd ${nlp_dir}/examples/machine_translation/transformer/
+wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/WMT14.en-de.partial.tar.gz
+tar -xzvf WMT14.en-de.partial.tar.gz
+time (
+sed -i "s/save_step: 10000/save_step: 1/g" configs/transformer.base.yaml
+sed -i "s/print_step: 100/print_step: 1/g" configs/transformer.base.yaml
+sed -i "s/epoch: 30/epoch: 1/g" configs/transformer.base.yaml
+sed -i "s/max_iter: None/max_iter: 2/g" configs/transformer.base.yaml
+sed -i "s/batch_size: 4096/batch_size: 1000/g" configs/transformer.base.yaml
+
+python train.py --config ./configs/transformer.base.yaml \
+    --train_file ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.en ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.de \
+    --dev_file ${PWD}/WMT14.en-de.partial/dev.tok.bpe.en ${PWD}/WMT14.en-de.partial/dev.tok.bpe.de \
+    --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \
+    --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"  >${log_path}/transformer_train) >>${log_path}/transformer_train 2>&1
+print_info $? transformer_train
+#predict
+time (
+sed -i 's#init_from_params: "./trained_models/step/"#init_from_params: "./trained_models/step_final/"#g' configs/transformer.base.yaml
+python predict.py --config ./configs/transformer.base.yaml  \
+    --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de \
+    --without_ft \
+    --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \
+    --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"  >${log_path}/transformer_predict) >>${log_path}/transformer_predict 2>&1
+print_info $? transformer_predict
+#export
+time (
+python export_model.py --config ./configs/transformer.base.yaml \
+    --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \
+    --bos_token "<s>" --eos_token "<e>" >${log_path}/transformer_export) >>${log_path}/transformer_export 2>&1
+print_info $? transformer_export
+#infer
+time (
+python ./deploy/python/inference.py --config ./configs/transformer.base.yaml \
+    --profile \
+    --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de  \
+    --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \
+    --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" >${log_path}/transformer_infer) >>${log_path}/transformer_infer 2>&1
+print_info $? transformer_infer
+
+# fast_transformer
+}
+# 23 pet
+pet (){
+path="examples/few_shot/pet"
+python scripts/regression/ci_normal_case.py ${path}
+}
+efl(){
+path="examples/few_shot/efl"
+python scripts/regression/ci_normal_case.py ${path}
+}
+p-tuning(){
+path="examples/few_shot/p-tuning"
+python scripts/regression/ci_normal_case.py ${path}
+}
+#25 ernie-doc
+ernie-doc(){
+cd ${nlp_dir}/model_zoo/ernie-doc/
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (python -m paddle.distributed.launch  --log_dir hyp run_classifier.py --epochs 15 --layerwise_decay 0.7 --learning_rate 5e-5 --batch_size 4 --save_steps 100 --max_steps 100  --dataset hyp --output_dir hyp >${log_path}/ernie-doc_hyp) >>${log_path}/ernie-doc_hyp 2>&1
+print_info $? ernie-doc_hyp
+time (python -m paddle.distributed.launch  --log_dir cmrc2018 run_mrc.py --batch_size 4 --layerwise_decay 0.8 --dropout 0.2 --learning_rate 4.375e-5 --epochs 1 --save_steps 100 --max_steps 100  --dataset cmrc2018 --output_dir cmrc2018  >${log_path}/ernie-doc_cmrc2018) >>${log_path}/ernie-doc_cmrc2018 2>&1
+print_info $?  ernie-doc_cmrc2018
+time (python -m paddle.distributed.launch  --log_dir c3 run_mcq.py --learning_rate 6.5e-5 --epochs 1 --save_steps 100 --max_steps 100  --output_dir c3 >${log_path}/ernie-doc_c3) >>${log_path}/ernie-doc_c3 2>&1
+print_info $? ernie-doc_c3
+time (python -m paddle.distributed.launch  --log_dir cail/ run_semantic_matching.py --epochs 1 --layerwise_decay 0.8 --learning_rate 1.25e-5 --batch_size 4  --save_steps 100 --max_steps 100 --output_dir cail >${log_path}/ernie-doc_cail) >>${log_path}/ernie-doc_cail 2>&1
+print_info $? ernie-doc_cail
+time (python -m paddle.distributed.launch  --log_dir msra run_sequence_labeling.py --learning_rate 3e-5 --epochs 1 --save_steps 100 --max_steps 100  --output_dir msra  >${log_path}/ernie-doc_msar) >>${log_path}/ernie-doc_msar 2>&1
+print_info $? ernie-doc_msar
+time (python run_mrc.py  --model_name_or_path ernie-doc-base-zh  --dataset dureader_robust  --batch_size 8 --learning_rate 2.75e-4 --epochs 1 --save_steps 10 --max_steps 2 --logging_steps 10 --device gpu >${log_path}/ernie-doc_dureader_robust) >>${log_path}/ernie-doc_dureader_robust 2>&1
+print_info $? ernie-doc_dureader_robust
+}
+#26 transformer-xl
+transformer-xl (){
+cd ${nlp_dir}/examples/language_model/transformer-xl/
+mkdir gen_data && cd gen_data
+wget https://paddle-qa.bj.bcebos.com/paddlenlp/enwik8.tar.gz && tar -zxvf enwik8.tar.gz
+cd ../
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+time (sed -i 's/print_step: 100/print_step: 1/g' configs/enwik8.yaml
+sed -i 's/save_step: 10000/save_step: 3/g' configs/enwik8.yaml
+sed -i 's/batch_size: 16/batch_size: 8/g' configs/enwik8.yaml
+sed -i 's/max_step: 400000/max_step: 3/g' configs/enwik8.yaml
+python -m paddle.distributed.launch  train.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_train_enwik8) >>${log_path}/transformer-xl_train_enwik8 2>&1
+print_info $? transformer-xl_train_enwik8
+time (sed -i 's/batch_size: 8/batch_size: 1/g' configs/enwik8.yaml
+sed -i 's#init_from_params: "./trained_models/step_final/"#init_from_params: "./trained_models/step_3/"#g' configs/enwik8.yaml
+python eval.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_eval_enwik8) >>${log_path}/transformer-xl_eval_enwik8 2>&1
+print_info $? transformer-xl_eval_enwik8
+}
+#28 question_matching
+question_matching() {
+cd ${nlp_dir}/examples/text_matching/question_matching/
+wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/data_v4.tar.gz
+tar -xvzf data_v4.tar.gz
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+#train
+time (
+python -u -m paddle.distributed.launch train.py \
+       --train_set ./data_v4/train/ALL/train \
+       --dev_set ./data_v4/train/ALL/dev \
+       --device gpu \
+       --eval_step 10 \
+       --max_steps 10 \
+       --save_dir ./checkpoints \
+       --train_batch_size 32 \
+       --learning_rate 2E-5 \
+       --epochs 1 \
+       --rdrop_coef 0.0 >${log_path}/question_matching_train) >>${log_path}/question_matching_train 2>&1
+print_info $? question_matching_train
+#predict
+time (
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+python -u \
+    predict.py \
+    --device gpu \
+    --params_path "./checkpoints/model_10/model_state.pdparams" \
+    --batch_size 128 \
+    --input_file ./data_v4/test/public_test_A \
+    --result_file 0.0_predict_public_result_test_A_re >${log_path}/question_matching_predict) >>${log_path}/question_matching_predict 2>&1
+print_info $? question_matching_predict
+}
+# 29 ernie-csc
+ernie-csc() {
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+cd ${nlp_dir}/examples/text_correction/ernie-csc
+#dowdnload data
+python download.py --data_dir ./extra_train_ds/ --url https://github.com/wdimmy/Automatic-Corpus-Generation/raw/master/corpus/train.sgml
+#trans xml txt
+python change_sgml_to_txt.py -i extra_train_ds/train.sgml -o extra_train_ds/train.txt
+#2卡训练
+python -m paddle.distributed.launch  train.py --batch_size 32 --logging_steps 100 --epochs 1 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/  >${log_path}/ernie-csc_train >>${log_path}/ernie-csc_train 2>&1
+print_info $? ernie-csc_train
+#predict
+sh run_sighan_predict.sh >${log_path}/ernie-csc_predict >>${log_path}/ernie-csc_predict 2>&1
+print_info $? ernie-csc_predict
+#export model
+python export_model.py --params_path ./checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params >${log_path}/ernie-csc_export >>${log_path}/ernie-csc_export 2>&1
+print_info $? ernie-csc_export
+#python deploy
+python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1
+print_info $? ernie-csc_deploy
+}
+#30 nptag
+nptag() {
+cd ${nlp_dir}/examples/text_to_knowledge/nptag/
+wget -q https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/nptag_dataset.tar.gz && tar -zxvf nptag_dataset.tar.gz
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+python -m paddle.distributed.launch  train.py \
+    --batch_size 64 \
+    --learning_rate 1e-6 \
+    --num_train_epochs 1 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --output_dir ./output \
+    --device "gpu" >${log_path}/nptag_train >>${log_path}/nptag_train 2>&1
+print_info $? nptag_train
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+python -m paddle.distributed.launch  predict.py \
+    --device=gpu \
+    --params_path ./output/model_100/model_state.pdparams >${log_path}/nptag_predict >>${log_path}/nptag_predict 2>&1
+print_info $? nptag_predict
+python export_model.py --params_path=./output/model_100/model_state.pdparams --output_path=./export >${log_path}/nptag_export >>${log_path}/nptag_export 2>&1
+print_info $? nptag_export
+python deploy/python/predict.py --model_dir=./export >${log_path}/nptag_depoly >>${log_path}/nptag_deploy 2>&1
+print_info $? nptag_depoly
+}
+#31 ernie-m
+ernie-m() {
+export CUDA_VISIBLE_DEVICES=${cudaid2}
+cd ${nlp_dir}/model_zoo/ernie-m
+# TODO(ouyanghongyu): remove the following scripts later.
+if [ ! -f 'test.py' ];then
+    echo '模型测试文件不存在！'
+    # finetuned for cross-lingual-transfer
+    python -m paddle.distributed.launch --log_dir output_clt run_classifier.py \
+        --do_train \
+        --do_eval \
+        --do_export \
+        --device gpu \
+        --task_type cross-lingual-transfer \
+        --model_name_or_path __internal_testing__/ernie-m \
+        --use_test_data True \
+        --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \
+        --output_dir output_clt \
+        --export_model_dir output_clt \
+        --per_device_train_batch_size 8 \
+        --save_steps 1 \
+        --eval_steps 1  \
+        --max_steps 2 \
+        --overwrite_output_dir \
+        --remove_unused_columns False >${log_path}/ernie-m_clt >>${log_path}/ernie-m_clt 2>&1
+    print_info $? ernie-m_clt
+    # finetuned for translate-train-all
+    python -m paddle.distributed.launch --log_dir output_tta run_classifier.py \
+        --do_train \
+        --do_eval \
+        --do_export \
+        --device gpu \
+        --task_type translate-train-all \
+        --model_name_or_path __internal_testing__/ernie-m \
+        --use_test_data True \
+        --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \
+        --output_dir output_tta \
+        --export_model_dir output_tta \
+        --per_device_train_batch_size 8 \
+        --save_steps 1 \
+        --eval_steps 1  \
+        --max_steps 2 \
+        --overwrite_output_dir \
+        --remove_unused_columns False >${log_path}/ernie-m_tta >>${log_path}/ernie-m_tta 2>&1
+    print_info $? ernie-m_tta
+else
+    python -m pytest ${nlp_dir}/tests/model_zoo/test_ernie_m.py >${log_path}/ernie-m >>${log_path}/ernie-m 2>&1
+    print_info $? ernie-m
+fi
+}
+#32 clue
+clue (){
+cd ${nlp_dir}/examples/benchmark/clue/classification
+python -u ./run_clue_classifier_trainer.py \
+    --model_name_or_path ernie-3.0-base-zh \
+    --dataset "clue afqmc" \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 32   \
+    --per_device_eval_batch_size 32   \
+    --learning_rate 1e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 1 \
+    --seed 42  \
+    --save_steps 3 \
+    --warmup_ratio 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ./tmp \
+    --device gpu  \
+    --do_train \
+    --do_eval \
+    --metric_for_best_model "eval_accuracy" \
+    --load_best_model_at_end \
+    --save_total_limit 1 \
+    --max_steps 1 >${log_path}/clue-trainer_api >>${log_path}/clue-trainer_api 2>&1
+print_info $? clue-tranier_api
+python -u run_clue_classifier.py  \
+    --model_name_or_path ernie-3.0-base-zh \
+    --task_name afqmc \
+    --max_seq_length 128 \
+    --batch_size 16   \
+    --learning_rate 3e-5 \
+    --num_train_epochs 3 \
+    --logging_steps 100 \
+    --seed 42  \
+    --save_steps 1 \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --adam_epsilon 1e-8 \
+    --output_dir ./output/afqmc \
+    --device gpu \
+    --max_steps 1 \
+    --do_train  >${log_path}/clue-class >>${log_path}/clue-class 2>&1
+print_info $? clue-class
+cd ${nlp_dir}/examples/benchmark/clue/mrc
+export CUDA_VISIBLE_DEVICES=${cudaid1}
+python -m paddle.distributed.launch run_cmrc2018.py \
+    --model_name_or_path ernie-3.0-base-zh \
+    --batch_size 16 \
+    --learning_rate 3e-5 \
+    --max_seq_length 512 \
+    --num_train_epochs 2 \
+    --do_train \
+    --do_predict \
+    --warmup_proportion 0.1 \
+    --weight_decay 0.01 \
+    --gradient_accumulation_steps 2 \
+    --max_steps 1 \
+    --output_dir ./tmp >${log_path}/clue-mrc >>${log_path}/clue-mrc 2>&1
+print_info $? clue-mrc
+}
+#32 textcnn
+textcnn(){
+cd ${nlp_dir}/examples/sentiment_analysis/textcnn
+wget https://bj.bcebos.com/paddlenlp/datasets/RobotChat.tar.gz
+tar xvf RobotChat.tar.gz
+wget https://bj.bcebos.com/paddlenlp/robot_chat_word_dict.txt
+wget https://bj.bcebos.com/paddlenlp/models/textcnn.pdparams
+python -m paddle.distributed.launch train.py \
+    --vocab_path=./robot_chat_word_dict.txt \
+    --init_from_ckpt=./textcnn.pdparams \
+    --device=gpu \
+    --lr=5e-5 \
+    --batch_size=64 \
+    --epochs=1 \
+    --save_dir=./checkpoints \
+    --data_path=./RobotChat >${log_path}/textcnn_train >>${log_path}/textcnn_train 2>&1
+print_info $? textcnn_train
+python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams \
+    --output_path=./static_graph_params >${log_path}/textcnn_export >>${log_path}/textcnn_export 2>&1
+print_info $? export_export
+python deploy/python/predict.py --model_file=static_graph_params.pdmodel \
+    --params_file=static_graph_params.pdiparams >${log_path}/textcnn_depoly >>${log_path}/textcnn_depoly 2>&1
+print_info $? textcnn_deploy
+python predict.py --vocab_path=./robot_chat_word_dict.txt \
+    --device=gpu \
+    --params_path=./checkpoints/final.pdparams >${log_path}/textcnn_predict >>${log_path}/textcnn_predict 2>&1
+print_info $? textcnn_predict
+}
+#33 taskflow
+taskflow (){
+cd ${nlp_dir}
+python -m pytest tests/taskflow/test_*.py >${nlp_dir}/unittest_logs/taskflow_unittest >>${nlp_dir}/unittest_logs/taskflow_unittest 2>&1
+print_info $? taskflow_unittest
+python -m pytest scripts/regression/test_taskflow.py >${log_path}/taskflow >>${log_path}/taskflow 2>&1
+print_info $? taskflow
+}
+transformers(){
+echo ' RUN all LLMs unittest'
+export RUN_SLOW_TEST=True
+python -m pytest tests/llm/test_*.py >${nlp_dir}/unittest_logs/llm_unittest.log 2>&1
+print_info $? llm_unittest
+}
+fast_generation(){
+cd ${nlp_dir}/fast_generation/samples
+# python codegen_sample.py >${log_path}/fast_generation_codegen >>${log_path}/fast_generation_codegen 2>&1
+# print_info $? fast_generation_codegen
+
+python gpt_sample.py >${log_path}/fast_generation_gpt >>${log_path}/fast_generation_gpt 2>&1
+print_info $? fast_generation_gpt
+
+python mbart_sample.py >${log_path}/fast_generation_mbart >>${log_path}/fast_generation_mbart 2>&1
+print_info $? fast_generation_mbart
+
+python plato_sample.py >${log_path}/fast_generation_plato >>${log_path}/fast_generation_plato 2>&1
+print_info $? fast_generation_plato
+
+python t5_sample.py --use_faster >${log_path}/fast_generation_t5 >>${log_path}/fast_generation_t5 2>&1
+print_info $? fast_generation_t5
+
+cd ${nlp_dir}/paddlenlp/ops/fast_transformer/sample/
+python bart_decoding_sample.py >${log_path}/fast_generation_bart >>${log_path}/fast_generation_bart 2>&1
+print_info $? fast_generation_bart
+
+python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1
+print_info $? t5_export_model_sample
+
+python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1
+print_info $? t5_export_model_sample
+
+# fast_gpt
+# fast_transformer
+}
+ernie-3.0(){
+cd ${nlp_dir}/model_zoo/ernie-3.0/
+#训练
+python run_seq_cls.py  --model_name_or_path ernie-3.0-medium-zh  --dataset afqmc --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_seq_cls >>${log_path}/ernie-3.0_train_seq_cls 2>&1
+print_info $? ernie-3.0_train_seq_cls
+python run_token_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset msra_ner --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_token_cls >>${log_path}/ernie-3.0_train_token_cls 2>&1
+print_info $? ernie-3.0_train_token_cls
+python run_qa.py --model_name_or_path ernie-3.0-medium-zh --dataset cmrc2018  --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_qa >>${log_path}/ernie-3.0_train_qa 2>&1
+print_info $? ernie-3.0_train_qa
+# 预测
+python run_seq_cls.py  --model_name_or_path best_models/afqmc/  --dataset afqmc --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_seq_cls >>${log_path}/ernie-3.0_predict_seq_cls 2>&1
+print_info $? ernie-3.0_predict_seq_cls
+python run_token_cls.py  --model_name_or_path best_models/msra_ner/  --dataset msra_ner --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_token_cls >>${log_path}/ernie-3.0_predict_token_cls 2>&1
+print_info $? ernie-3.0_predict_token_cls
+python run_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018  --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_qa >>${log_path}/ernie-3.0_predict_qa 2>&1
+print_info $? ernie-3.0_predict_qa
+#压缩
+python compress_seq_cls.py  --model_name_or_path best_models/afqmc/  --dataset afqmc --output_dir ./best_models/afqmc --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_seq_cls >>${log_path}/ernie-3.0_compress_seq_cls 2>&1
+print_info $? ernie-3.0_compress_seq_cls
+python compress_token_cls.py  --model_name_or_path best_models/msra_ner/  --dataset msra_ner --output_dir ./best_models/msra_ner --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5  --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_token_cls >>${log_path}/ernie-3.0_compress_token_cls 2>&1
+print_info $? ernie-3.0_compress_token_cls
+python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018  --output_dir ./best_models/cmrc2018 --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5  --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_qa >>${log_path}/ernie-3.0_compress_qa 2>&1
+print_info $? ernie-3.0_compress_qa
+}
+ernie-health(){
+cd ${nlp_dir}/tests/model_zoo/
+if [ ! -f 'test_ernie-health.py' ];then
+    echo '模型测试文件不存在！'
+else
+    python -m pytest tests/model_zoo/test_ernie-health.py >${log_path}/ernie-health_unittest>>${log_path}/ernie-health_unittest 2>&1
+    print_info $? tests ernie-health_unittest
+fi
+}
+uie(){
+cd ${nlp_dir}/model_zoo/uie/
+mkdir data && cd data && wget https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json && cd ../
+python doccano.py --doccano_file ./data/doccano_ext.json --task_type ext --save_dir ./data --splits 0.8 0.2 0 --schema_lang ch >${log_path}/uie_doccano>>${log_path}/uie_doccano 2>&1
+print_info $? uie_doccano
+python -u -m paddle.distributed.launch finetune.py --device gpu --logging_steps 2 --save_steps 2 --eval_steps 2 --seed 42 \
+    --model_name_or_path uie-base --output_dir ./checkpoint/model_best --train_path data/train.txt --dev_path data/dev.txt \
+    --max_seq_length 512 --per_device_eval_batch_size 16 --per_device_train_batch_size 16 --num_train_epochs 100 --learning_rate 1e-5 \
+    --do_train --do_eval --do_export --export_model_dir ./checkpoint/model_best --label_names start_positions end_positions \
+    --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True \
+    --save_total_limit 1 --max_steps 2  >${log_path}/uie_train>>${log_path}/uie_train2>&1
+print_info $? uie_train
+python evaluate.py --model_path ./checkpoint/model_best --test_path ./data/dev.txt --batch_size 16 --max_seq_len 512 >${log_path}/uie_eval>>${log_path}/uie_eval 2>&1
+print_info $? uie_eval
+}
+ernie-layout(){
+cd ${nlp_dir}/model_zoo/ernie-layout/
+# train ner
+python -u run_ner.py --model_name_or_path ernie-layoutx-base-uncased --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
+    --dataset_name funsd --do_train --do_eval --max_steps 2 --eval_steps 2 --save_steps 2 --save_total_limit 1 --seed 1000 --overwrite_output_dir \
+    --load_best_model_at_end --pattern ner-bio --preprocessing_num_workers 4 --overwrite_cache false --doc_stride 128 --target_size 1000 \
+    --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --learning_rate 2e-5 --lr_scheduler_type constant --gradient_accumulation_steps 1 \
+    --metric_for_best_model eval_f1 --greater_is_better true >${log_path}/ernie-layout_train>>${log_path}/ernie-layout_train 2>&1
+print_info $? ernie-layout_train
+# export ner
+python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export >${log_path}/ernie-layout_export>>${log_path}/ernie-layout_export2>&1
+print_info $? ernie-layout_export
+# deploy ner
+cd ${nlp_dir}/model_zoo/ernie-layout/deploy/python
+wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip
+python infer.py --model_path_prefix ../../ner_export/inference --task_type ner --lang "en" --batch_size 8 >${log_path}/ernie-layout_deploy>>${log_path}/ernie-layout_deploy 2>&1
+print_info $? ernie-layout_deploy
+}
+ernie-1.0(){
+    ernie
+}
+
+ernie_m(){
+    ernie-m
+}
+
+ernie_layout(){
+ernie-layout
+}
+
+ernie_csc(){
+    ernie-csc
+}
+
+ernie_ctm(){
+    ernie-ctm
+}
+
+ernie_doc(){
+    ernie-doc
+}
+
+ernie_health(){
+    ernie-health
+}
+
+gpt-3() {
+    bash ${nlp_dir}/scripts/regression/ci_gpt-3.sh
+    print_info $? `ls -lt ${log_path} | grep gpt | head -n 1 | awk '{print $9}'`
+}
+$1
\ No newline at end of file
diff --git a/scripts/regression/ci_gpt-3.sh b/scripts/regression/ci_gpt-3.sh
new file mode 100644
index 0000000000000000000000000000000000000000..50e94a49e5dc4f73716a4f5d1d2b0df213fb8c4c
--- /dev/null
+++ b/scripts/regression/ci_gpt-3.sh
@@ -0,0 +1,1080 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -e
+
+export nlp_dir=/workspace/PaddleNLP/
+export log_path=/workspace/PaddleNLP/model_logs
+export case_path=/workspace/PaddleNLP/model_zoo/gpt-3
+export data_path=/fleetx_data
+
+unset CUDA_VISIBLE_DEVICES
+
+function case_list_chain(){
+    gpt_preprocess_data
+    gpt_345M_single
+    gpt_1.3B_dp
+    gpt_6.7B_stage2_dp2_sharding4
+    gpt_6.7B_stage3_dp2_sharding4
+    gpt_6.7B_stage2_sharding8
+    gpt_175B_DP1_MP4_PP2
+    gpt_175B_DP1_MP4_PP2_sp
+    gpt_175B_DP1_MP8_PP1
+    gpt_175B_DP1_MP8_PP1_sp
+    gpt_175B_DP1_MP1_PP8
+    gpt_generation_345M_single
+    gpt_generation_345M_hybrid
+    gpt_345M_mp8_qat
+    gpt_export_345M_mp1
+    gpt_export_345M_mp2
+    # gpt_export_qat_345M
+    gpt_inference_345M_single
+    gpt_inference_345M_dp8
+    gpt_345M_single_finetune
+    gpt_eval_WikiText
+    gpt_eval_LAMBADA
+}
+
+function case_list_auto() {
+    gpt_save_ckpt
+    gpt_auto_serial
+    gpt_auto_dp2mp2
+    gpt_auto_dp2pp2
+    gpt_auto_mp2pp2
+    gpt_auto_dp2mp2pp2
+    gpt_auto_dp2sharding2
+    gpt_auto_dp2mp2sharding2
+    gpt_auto_dp2pp2sharding2
+    gpt_auto_dp2mp2pp2sharding2
+    gpt_auto_pass_o1_stage1
+    gpt_auto_pass_o1_stage2
+    gpt_auto_pass_o2_stage1
+    gpt_auto_pass_o2_stage2
+    gpt_auto_pass_o3_stage1
+    gpt_auto_pass_o3_stage2
+    gpt_auto_dp2mp2pp2_o2
+    gpt_auto_export
+}
+
+############ case start ############
+function gpt_preprocess_data() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python ppfleetx/data/data_tools/gpt/raw_trans_to_json.py  \
+        --input_path ./dataset/wikitext_103_en \
+        --output_path ./dataset/wikitext_103_en/wikitext_103_en \
+        >>${log_path}/$FUNCNAME 2>&1
+    python ppfleetx/data/data_tools/gpt/preprocess_data.py \
+        --model_name gpt2 \
+        --tokenizer_name GPTTokenizer \
+        --data_format JSON \
+        --input_path ./dataset/wikitext_103_en/wikitext_103_en.jsonl \
+        --append_eos \
+        --output_prefix ./dataset/wikitext_103_en/wikitext_103_en  \
+        --workers 40 \
+        --log_interval 1000 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_345M_single() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python tools/train.py \
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_1.3B_dp() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_1.3B_dp8.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_6.7B_stage2_dp2_sharding4() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" \
+        tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Distributed.sharding.sharding_degree=4 -o Distributed.sharding.sharding_stage=2 \
+        -o Distributed.sharding.reduce_overlap=False -o Distributed.sharding.broadcast_overlap=False \
+        -o Engine.logging_freq=5 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_6.7B_stage3_dp2_sharding4() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" \
+        tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Distributed.sharding.sharding_degree=4 -o Distributed.sharding.sharding_stage=3 \
+        -o Distributed.sharding.reduce_overlap=False -o Distributed.sharding.broadcast_overlap=False \
+        -o Engine.logging_freq=5 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_6.7B_stage2_sharding8() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" \
+        tools/train.py -c ppfleetx/configs/nlp/gpt/pretrain_gpt_6.7B_sharding16.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=20 -o Engine.eval_freq=20 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Distributed.sharding.sharding_degree=8 -o Distributed.sharding.sharding_stage=2 \
+        -o Distributed.sharding.reduce_overlap=True -o Distributed.sharding.broadcast_overlap=True \
+        -o Engine.logging_freq=5 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_175B_DP1_MP4_PP2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml \
+        -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Global.local_batch_size=16 -o Global.micro_batch_size=2 \
+        -o Distributed.mp_degree=4 -o Distributed.pp_degree=2 \
+        -o Model.sequence_parallel=False \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_175B_DP1_MP4_PP2_sp() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml \
+        -o Model.hidden_size=1024 -o Model.num_layers=4 -o Model.num_attention_heads=4 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Global.local_batch_size=16 -o Global.micro_batch_size=2 \
+        -o Distributed.mp_degree=4 -o Distributed.pp_degree=2 -o Model.sequence_parallel=True \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_175B_DP1_MP8_PP1() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml \
+        -o Model.hidden_size=1024 -o Model.num_layers=16 -o Model.num_attention_heads=16 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Global.local_batch_size=16 -o Global.micro_batch_size=2 \
+        -o Distributed.mp_degree=8 -o Distributed.pp_degree=1 \
+        -o Model.sequence_parallel=False \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_175B_DP1_MP8_PP1_sp() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml \
+        -o Model.hidden_size=1024 -o Model.num_layers=16 -o Model.num_attention_heads=16 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Global.local_batch_size=16 -o Global.micro_batch_size=2 \
+        -o Distributed.mp_degree=8 -o Distributed.pp_degree=1 -o Model.sequence_parallel=True \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_175B_DP1_MP1_PP8() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/pretrain_gpt_175B_mp8_pp16.yaml \
+        -o Model.hidden_size=1024 -o Model.num_layers=32 -o Model.num_attention_heads=16 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        -o Global.local_batch_size=16 -o Global.micro_batch_size=1 \
+        -o Distributed.mp_degree=1 -o Distributed.pp_degree=8 \
+        -o Model.virtual_pp_degree=2 -o Distributed.pp_recompute_interval=2 \
+        -o Model.fused_linear=True -o Model.use_recompute=True \
+        -o Model.sequence_parallel=False \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_345M_mp8_qat() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" tools/train.py\
+        -c ppfleetx/configs/nlp/gpt/qat_gpt_345M_mp8.yaml \
+        -o Model.num_layers=4 -o Model.num_attention_heads=8 \
+        -o Engine.max_steps=10 -o Engine.eval_freq=10 \
+        -o Engine.eval_iters=5 -o Engine.save_load.save_steps=10 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_generation_345M_single() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python tasks/gpt/generation.py \
+        -c ppfleetx/configs/nlp/gpt/generation_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/ \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_generation_345M_hybrid() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python -m paddle.distributed.launch --devices "0" tasks/gpt/generation.py \
+        -c ppfleetx/configs/nlp/gpt/generation_gpt_345M_dp8.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/ \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_export_345M_mp1() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_export
+    rm -rf $log_dir
+    rm -rf output
+
+    export PYTHONPATH=/workspace/PaddleNLP/model_zoo/gpt-3:$PYTHONPATH
+    export CUDA_VISIBLE_DEVICES=1
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "1" \
+        ./tools/auto_export.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./pretrained/inference_model \
+        >>${log_path}/$FUNCNAME 2>&1
+    python -m paddle.distributed.launch --devices "1" \
+        projects/gpt/inference.py --mp_degree 1 --model_dir output \
+        >>${log_path}/$FUNCNAME 2>&1
+    unset CUDA_VISIBLE_DEVICES
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_export_345M_mp2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_export
+    rm -rf $log_dir
+    rm -rf output
+
+    export PYTHONPATH=/workspace/PaddleNLP/model_zoo/gpt-3:$PYTHONPATH
+    export CUDA_VISIBLE_DEVICES=0,1
+    python -m paddle.distributed.launch --devices "0,1" \
+        ./tools/auto_export.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml \
+        -o Generation.use_topp_sampling=False \
+        -o Engine.save_load.ckpt_dir=./pretrained/inference_model \
+        >>${log_path}/$FUNCNAME 2>&1
+    python -m paddle.distributed.launch --devices "0,1" \
+        projects/gpt/inference.py --mp_degree 2 --model_dir output \
+        >>${log_path}/$FUNCNAME 2>&1
+    unset CUDA_VISIBLE_DEVICES
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_export_qat_345M() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_export
+    rm -rf $log_dir
+    rm -rf output
+
+    python ./tools/export.py \
+        -c ./ppfleetx/configs/nlp/gpt/generation_qat_gpt_345M_single_card.yaml \
+        -o Model.hidden_dropout_prob=0.0 \
+        -o Model.attention_probs_dropout_prob=0.0 \
+        -o Engine.save_load.ckpt_dir='./GPT_345M_QAT_wo_analysis/' \
+        >>${log_path}/$FUNCNAME 2>&1
+    python -m paddle.distributed.launch --devices "0" \
+        projects/gpt/inference.py --mp_degree 1 --model_dir output \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_inference_345M_single() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    rm -rf output
+    python tools/export.py \
+        -c ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/ \
+        >>${log_path}/$FUNCNAME 2>&1
+    python tasks/gpt/inference.py \
+        -c ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_inference_345M_dp8() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    rm -rf output
+    python -m paddle.distributed.launch --devices "0" tools/export.py \
+        -c ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826/ \
+        >>${log_path}/$FUNCNAME 2>&1
+    python -m paddle.distributed.launch --devices "0" \
+        tasks/gpt/inference.py \
+        -c ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_345M_single_finetune() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python ./tools/train.py \
+        -c ./ppfleetx/configs/nlp/gpt/finetune_gpt_345M_single_card_glue.yaml \
+        -o Engine.num_train_epochs=1 \
+        -o Data.Train.dataset.name=WNLI \
+        -o Data.Train.dataset.root=./dataset/WNLI/ \
+        -o Data.Eval.dataset.name=WNLI \
+        -o Data.Eval.dataset.root=./dataset/WNLI/ \
+        -o Data.Eval.dataset.split=dev \
+        -o Model.num_classes=2 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_eval_WikiText() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python ./tools/eval.py \
+        -c ./ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826 \
+        -o Offline_Eval.eval_path=./wikitext-103/wiki.valid.tokens \
+        -o Offline_Eval.overlapping_eval=32 \
+        -o Offline_Eval.batch_size=16 \
+        -o Engine.max_steps=20 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_eval_LAMBADA() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python ./tools/eval.py \
+        -c ./ppfleetx/configs/nlp/gpt/eval_gpt_345M_single_card.yaml \
+        -o Engine.save_load.ckpt_dir=./ckpt/PaddleFleetX_GPT_345M_220826 \
+        -o Offline_Eval.eval_path=./lambada_test.jsonl \
+        -o Offline_Eval.cloze_eval=True \
+        -o Offline_Eval.batch_size=16 \
+        -o Engine.max_steps=20 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_save_ckpt() {
+    echo "=========== $FUNCNAME run begin ==========="
+    rm -rf log
+    python ./tools/train.py \
+        -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0. \
+        -o Model.attention_probs_dropout_prob=0. \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=1 \
+        -o Engine.save_load.save_steps=1 \
+        -o Engine.save_load.output_dir="./ckpt_dynamic" \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_serial() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=1 \
+        -o Distributed.mp_degree=1 \
+        -o Distributed.pp_degree=1 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss=`cat $log_dir/workerlog.0 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    check_result $FUNCNAME 11.0364 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2mp2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=1 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.0 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.2 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_mp2pp2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=1 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss=`cat $log_dir/workerlog.2 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    check_result $FUNCNAME 11.0364 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2pp2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=1 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.1 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.3 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2mp2pp2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2mp2pp2_o2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir=$log_dir --devices="0,1,2,3,4,5,6,7" \
+        tools/auto.py \
+        -c ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_1.3B_dp8.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o2" \
+        -o Model.hidden_size=1024 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=8 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=1 \
+        -o Distributed.sharding.sharding_stage=1 \
+        -o Engine.verbose=3 \
+        -o Model.type_vocab_size=1 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2sharding2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=1 \
+        -o Distributed.pp_degree=1 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.0 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.1 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2mp2sharding2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=1 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.0 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.2 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2pp2sharding2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=1 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.1 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.3 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_dp2mp2pp2sharding2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=False \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=2 \
+        -o Global.micro_batch_size=2 \
+        -o Engine.max_steps=10 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        -o Engine.save_load.ckpt_dir="./ckpt_dynamic/epoch_0_step_1/auto_infer/auto" \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '10/10' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 11.0469 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o1_stage1() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o1" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=1 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9946 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o1_stage2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o1" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9946 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o2_stage1() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o2" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=1 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9951 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o2_stage2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o2" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9951 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o3_stage1() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o3" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=1 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9951 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_pass_o3_stage2() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1,2,3,4,5,6,7" \
+        ./tools/auto.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/pretrain_gpt_345M_single_card.yaml \
+        -o Engine.mix_precision.enable=True \
+        -o Engine.mix_precision.level="o3" \
+        -o Model.hidden_dropout_prob=0 \
+        -o Model.attention_probs_dropout_prob=0 \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        -o Model.use_recompute=True \
+        -o Global.local_batch_size=4 \
+        -o Global.micro_batch_size=1 \
+        -o Engine.max_steps=4 \
+        -o Engine.logging_freq=1 \
+        -o Distributed.dp_degree=2 \
+        -o Distributed.mp_degree=2 \
+        -o Distributed.pp_degree=2 \
+        -o Distributed.sharding.sharding_degree=2 \
+        -o Distributed.sharding.sharding_stage=2 \
+        >>${log_path}/$FUNCNAME 2>&1
+    loss1=`cat $log_dir/workerlog.2 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss2=`cat $log_dir/workerlog.6 | grep '4/4' | grep "lr:" | awk -F 'loss: ' '{print $2}' | awk -F ' ' '{print $1}'`
+    loss=$(echo $loss1 $loss2 | awk '{printf("%.4f",($1+$2)/2)}')
+    check_result $FUNCNAME 10.9951 ${loss}
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+function gpt_auto_export() {
+    echo "=========== $FUNCNAME run begin ==========="
+    log_dir=log_auto
+    rm -rf $log_dir
+
+    python -m paddle.distributed.launch --log_dir $log_dir --devices "0,1" \
+        ./tools/auto_export.py \
+        -c ./ppfleetx/configs/nlp/gpt/auto/generation_gpt_345M_mp2.yaml \
+        -o Model.num_layers=4 \
+        -o Model.num_attention_heads=4 \
+        >>${log_path}/$FUNCNAME 2>&1
+    check_result $FUNCNAME
+    echo "=========== $FUNCNAME run  end ==========="
+}
+
+############ case end ############
+
+function before_hook() {
+    # requirements
+    sed -i -e "s/paddlenlp/#paddlenlp/g" requirements.txt
+    python -m pip install -r requirements.txt --force-reinstall
+    cd ppfleetx/ops && python setup_cuda.py install && cd ../..
+
+    rm -rf ckpt
+    if [[ -e ${data_path}/ckpt/PaddleFleetX_GPT_345M_220826 ]]; then
+        echo "ckpt/PaddleFleetX_GPT_345M_220826 downloaded"
+    else
+        # download ckpt for gpt
+        mkdir -p ${data_path}/ckpt
+        wget -O ${data_path}/ckpt/GPT_345M.tar.gz \
+            https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M.tar.gz
+        tar -xzf ${data_path}/ckpt/GPT_345M.tar.gz -C ${data_path}/ckpt
+        rm -rf ${data_path}/ckpt/GPT_345M.tar.gz
+    fi
+
+    rm -rf data
+    if [[ -e ${data_path}/data ]]; then
+        echo "data downloaded"
+    else
+        # download data for gpt
+        mkdir ${data_path}/data;
+        wget -O ${data_path}/data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy;
+        wget -O ${data_path}/data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz;
+    fi
+
+    rm -rf dataset
+    if [[ -e ${data_path}/dataset/wikitext_103_en ]]; then
+        echo "dataset/wikitext_103_en downloaded"
+    else
+        # download dataset/wikitext_103_en
+        mkdir ${data_path}/dataset/wikitext_103_en;
+        wget -O ${data_path}/dataset/wikitext_103_en/wikitext-103-en.txt http://fleet.bj.bcebos.com/datasets/gpt/wikitext-103-en.txt
+    fi
+
+    rm -rf wikitext-103
+    if [[ -e ${data_path}/wikitext-103 ]]; then
+        echo "wikitext-103 downloaded"
+    else
+        # download wikitext-103 for gpt eval
+        wget -O ${data_path}/wikitext-103-v1.zip https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
+        unzip -q ${data_path}/wikitext-103-v1.zip -d ${data_path}/
+        rm -rf ${data_path}/wikitext-103-v1.zip
+    fi
+
+    rm -rf lambada_test.jsonl
+    if [[ -e ${data_path}/lambada_test.jsonl ]]; then
+        echo "lambada_test.jsonl downloaded"
+    else
+        # download lambada_test.jsonl for gpt eval
+        wget -O ${data_path}/lambada_test.jsonl https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl
+    fi
+
+    rm -rf pretrained
+    if [[ -e ${data_path}/pretrained ]]; then
+        echo "GPT_345M_FP16 downloaded"
+    else
+        # download GPT_345M_FP16 for gpt export
+        wget -O ${data_path}/GPT_345M_FP16.tar.gz https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_FP16.tar.gz
+        tar -zxvf ${data_path}/GPT_345M_FP16.tar.gz -C ${data_path}/
+        rm -rf ${data_path}/GPT_345M_FP16.tar.gz
+    fi
+
+    rm -rf GPT_345M_QAT_wo_analysis
+    if [[ -e ${data_path}/GPT_345M_QAT_wo_analysis ]]; then
+        echo "GPT_345M_QAT_wo_analysis downloaded"
+    else
+        # download GPT_345M_QAT_wo_analysis for gpt qat
+        wget -O ${data_path}/GPT_345M_QAT_wo_analysis.tar https://paddlefleetx.bj.bcebos.com/model/nlp/gpt/GPT_345M_QAT_wo_analysis.tar
+        tar xf ${data_path}/GPT_345M_QAT_wo_analysis.tar -C ${data_path}/
+        rm -rf ${data_path}/GPT_345M_QAT_wo_analysis.tar
+    fi
+
+    ln -s ${data_path}/ckpt ${case_path}/ckpt
+    cp -r ${data_path}/data ${case_path}/
+    cp -r ${data_path}/dataset ${case_path}/
+    ln -s ${data_path}/wikitext-103 ${case_path}/wikitext-103
+    cp ${data_path}/lambada_test.jsonl ${case_path}/
+    ln -s ${data_path}/pretrained ${case_path}/pretrained
+    ln -s ${data_path}/GPT_345M_QAT_wo_analysis ${case_path}/GPT_345M_QAT_wo_analysis
+}
+
+function check_result() {
+    if [ $? -ne 0 ];then
+        mv ${log_path}/$1 ${log_path}/$1_FAIL.log
+        echo -e "\033[31m ${log_path}/$1_FAIL \033[0m"
+        cat ${log_path}/$1_FAIL.log
+        exit -1
+    fi
+
+    if [ $# -eq 1 ]; then
+        echo -e "\033 $1 model runs successfully! \033"
+    else
+        echo -e "loss_base: $2 loss_test: $3" | tee -a ${log_path}/$1
+        if [ $2 != $3 ];then
+            mv ${log_path}/$1 ${log_path}/$1_FAIL.log
+            echo -e "\033[31m ${log_path}/$1_loss_check_FAIL \033[0m"
+            cat ${log_path}/$1_FAIL.log
+            exit -1
+        else
+            echo -e "\033 $1 loss diff check successfully! \033" | tee -a $log_path/result.log
+        fi
+    fi
+
+    # ips_diff
+    # diff=$(echo $3 $4|awk '{printf "%0.2f\n", ($2-$1)/$1*100}')
+    # echo -e "ips_base: $3 ips_test: $4 ips_diff: $diff% " | tee -a $log_path/result.log
+    # if [ $5 == mae_vit_base_patch16_lp_in1k_1n8c_dp_fp16o1 ];then
+    #     v1=$(echo $diff 10.0|awk '{print($1>=$2)?"0":"1"}')
+    #     v2=$(echo $diff -10.0|awk '{print($1<=$2)?"0":"1"}')
+    # else
+    #     v1=$(echo $diff 5.0|awk '{print($1>=$2)?"0":"1"}')
+    #     v2=$(echo $diff -5.0|awk '{print($1<=$2)?"0":"1"}')
+    # fi
+    # if [[ $v1 == 0 ]] || [[ $v2 == 0 ]];then
+    #   echo -e "\033 $5 ips diff check failed! \033" | tee -a $log_path/result.log
+    #   exit -1
+    # fi
+}
+
+main() {
+    cd ${case_path}
+
+    before_hook
+    case_list_chain
+    case_list_auto
+}
+
+main$@
diff --git a/scripts/regression/ci_normal_case.py b/scripts/regression/ci_normal_case.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b6079470ba4dda148502a0f19246fa4cd0edc61
--- /dev/null
+++ b/scripts/regression/ci_normal_case.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+RUN PaddleNLP CI Case
+"""
+import os
+import re
+import subprocess
+import sys
+
+
+def get_mode_info(case_path):
+    """
+    Return: model_info{path,exec_file_list}
+    Examples:
+        pegasus = {
+            "path": "applications/text_summarization/pegasus/"
+            "deploy_path": "deploy/paddle_inference/"
+            "prepare": "run_prepare.py"
+            "train_exec_file": "train.py"
+            "eval_exec_file": None
+            "predict_exec_file": predict.py
+            “export_exec_file”: export_model.py
+            "infer_exec_file": inference_pegasus.py
+            }
+    """
+    model_info = {
+        "path": case_path,
+        "deploy_path": None,
+        "prepare_exec_file": None,
+        "train_exec_file": [],
+        "eval_exec_file": None,
+        "predict_exec_file": None,
+        "export_exec_file": None,
+        "infer_exec_file": None,
+    }
+    for root, dirs, files in os.walk(case_path):
+        infer_deploy_path = case_path + "/deploy/paddle_inference"
+        python_deploy_path = case_path + "/deploy/python"
+
+        if files and root == case_path:
+            for file in files:
+                # TODO .sh file incompatible windows
+                if file.split(".")[-1] == "py":
+                    if re.compile("prepare.py").findall(file):
+                        model_info["prepare_exec_file"] = file
+
+                    elif re.compile("train.py").findall(file):
+                        model_info["train_exec_file"].append(file)
+
+                    elif re.compile("finetune").findall(file):
+                        model_info["train_exec_file"].append(file)
+
+                    elif re.compile("eval.py").findall(file):
+                        model_info["eval_exec_file"] = file
+
+                    elif re.compile("predict.py").findall(file):
+                        model_info["predict_exec_file"] = file
+
+                    elif re.compile("export_model.py").findall(file):
+                        model_info["export_exec_file"] = file
+
+                    elif re.compile("run_").findall(file):
+                        model_info["train_exec_file"].append(file)
+                    else:
+                        continue
+        elif files and root == infer_deploy_path:
+            for file in files:
+                if file.split(".")[-1] == "py":
+                    model_info["deploy_path"] = "deploy/paddle_inference"
+                    model_info["infer_exec_file"] = file
+        elif files and root == python_deploy_path:
+            for file in files:
+                if file.split(".")[-1] == "py":
+                    model_info["deploy_path"] = "deploy/python"
+                    model_info["infer_exec_file"] = file
+
+    print("model_info", model_info)
+    return model_info
+
+
+def save_log(exit_code, output, case_name, file_name):
+    """
+    save model log
+    """
+    root_path = "/workspace/PaddleNLP"
+    # root_path = '/ssd1/paddlenlp/zhangjunjun/PaddleNLP'
+    if exit_code == 0:
+        log_file = root_path + "/model_logs/" + os.path.join(case_name + "_" + file_name + "_SUCCESS.log")
+        print("{} SUCCESS".format(file_name))
+        with open(log_file, "a") as flog:
+            flog.write("%s" % (output))
+    else:
+        log_file = root_path + "/model_logs/" + os.path.join(case_name + "_" + file_name + "_FAIL.log")
+        print("{} FAIL".format(file_name))
+        with open(log_file, "a") as flog:
+            flog.write("%s" % (output))
+
+
+def run_normal_case(case_path):
+    """
+    Run new normal case
+    params:
+    case_path: model path based PaddleNLP from git diff
+    """
+    case_name = case_path.split("/")[-1]
+    model_info = get_mode_info(case_path)
+    depoly_path = model_info["deploy_path"]
+    prepare_exec_file = model_info["prepare_exec_file"]
+    eval_exec_file = model_info["eval_exec_file"]
+    predict_exec_file = model_info["predict_exec_file"]
+    export_exec_file = model_info["export_exec_file"]
+    infer_exec_file = model_info["infer_exec_file"]
+
+    os.chdir(case_path)
+
+    if prepare_exec_file:
+        prepare_output = subprocess.getstatusoutput("python %s " % (prepare_exec_file))
+        save_log(prepare_output[0], prepare_output[1], case_name, prepare_exec_file.split(".")[0])
+
+    if model_info["train_exec_file"]:
+        for train_file in model_info["train_exec_file"]:
+            train_output = subprocess.getstatusoutput(
+                "python -m paddle.distributed.launch %s --device gpu --max_steps 2 \
+                --save_steps 2 --output_dir ./output/"
+                % (train_file)
+            )
+            save_log(train_output[0], train_output[1], case_name, train_file.split(".")[0])
+    else:
+        print("Train Skipped")
+
+    if eval_exec_file:
+        eval_output = subprocess.getstatusoutput("python %s --init_checkpoint_dir ./output/" % (eval_exec_file))
+        save_log(eval_output[0], eval_output[1], case_name, eval_exec_file.split(".")[0])
+    else:
+        print("Evalation Skipped")
+    if predict_exec_file:
+        predict_output = subprocess.getstatusoutput("python %s --init_checkpoint_dir ./output/" % (predict_exec_file))
+        save_log(predict_output[0], predict_output[1], case_name, predict_exec_file.split(".")[0])
+    else:
+        print("Predict Skipped")
+    if export_exec_file:
+        export_output = subprocess.getstatusoutput(
+            "python %s --export_output_dir ./inference_model/" % (export_exec_file)
+        )
+        save_log(export_output[0], export_output[1], case_name, export_exec_file.split(".")[0])
+    else:
+        print("Export model Skipped")
+    if infer_exec_file:
+        infer_output = subprocess.getstatusoutput(
+            "cd %s && python %s --inference_model_dir ../../inference_model/" % (depoly_path, infer_exec_file)
+        )
+        save_log(infer_output[0], infer_output[1], case_name, infer_exec_file.split(".")[0])
+    else:
+        print("python inference Skipped")
+
+
+if __name__ == "__main__":
+    # path ="applications/text_summarization/pegasus"
+    path = sys.argv[1]
+    if os.path.isdir(path):
+        run_normal_case(path)
+    else:
+        print("not model file path, skepped ")
diff --git a/scripts/regression/get_model_list.py b/scripts/regression/get_model_list.py
new file mode 100644
index 0000000000000000000000000000000000000000..97bfff2c3b5d8cbe350f430a7d22307b22df76a5
--- /dev/null
+++ b/scripts/regression/get_model_list.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Get PaddleNLP develop model list && before merge pr """
+import io
+import os
+
+
+def get_model_list():
+    """
+    get model list from
+    <https://github.com/PaddlePaddle/PaddleNLP/model_zoo/>
+    <https://github.com/PaddlePaddle/PaddleNLP/examples/>
+    """
+
+    CI_MODEL_LIST = [
+        "waybill_ie",
+        "msra_ner",
+        "glue",
+        "bert",
+        "skep",
+        "bigbird",
+        "electra",
+        "gpt",
+        "ernie",
+        "xlnet",
+        "ofa",
+        "albert",
+        "squad",
+        "tinybert",
+        "lexical_analysis",
+        "seq2seq",
+        "pretrained_models",
+        "word_embedding",
+        "ernie-ctm",
+        "distilbert",
+        "stacl",
+        "transformer",
+        "simbert",
+        "ernie-doc",
+        "transformer-xl",
+        "ernie-m",
+        "plato-xl",
+        "pointer_summarizer",
+        "question_matching",
+        "few_shot",
+        "unimo-text",
+        "ernie-csc",
+        "nptag",
+        "ofa",
+        "transformer",
+        "DuIE",
+        "tcn",
+        "word_embedding",
+        "unified_transformer",
+        "lic2021_baseline",
+        "vae-seq2seq",
+        "msra_ner",
+        "simbert",
+        "clue",
+        "pet",
+        "bert",
+        "ernie-ctm",
+        "DuReader-yesno",
+        "nptag",
+        "semantic_indexing",
+        "seq2seq",
+        "pointer_summarizer",
+        "bigbird",
+        "unimo-text",
+        "minilmv2",
+        "wordtag",
+        "simcse",
+        "ernie-gen",
+        "distill_lstm",
+        "DuReader-robust",
+        "ernie_matching",
+        "rnn",
+        "ernie-1.0",
+        "stacl",
+        "erniesage",
+        "DuEE",
+        "efl",
+        "doc",
+        "couplet",
+        "rnnlm",
+        "pp-minilm",
+        "dgu",
+        "mpnet",
+        "textcnn",
+        "p-tuning",
+        "SQuAD",
+        "elmo",
+        "plato-2",
+        "pretrained_models",
+        "sentiment_analysis",
+        "ernie-health",
+        "gpt-3",
+    ]
+    examples_second_list = ["model_interpretation", "semantic_indexing", "lexical_analysis", "word_embedding"]
+
+    model_list = os.listdir("model_zoo")
+    examples_list = os.listdir("examples/")
+    app_list = os.listdir("applications/")
+
+    # remove model_list README
+    model_list.remove("README.md")
+    examples_list.remove("README.md")
+    model_list.extend(app_list)
+    model_list.extend(examples_second_list)
+    for examples_model_list in examples_list:
+        if examples_model_list not in examples_second_list:
+            examples_model = os.listdir("examples/" + examples_model_list)
+            if "README.md" in examples_model:
+                examples_model.remove("README.md")
+            model_list.extend(examples_model)
+
+    all_examples_dict = set(sorted(model_list))
+    no_test_models = []
+
+    # get model list not in CI/CE
+    for full_model in all_examples_dict:
+        if full_model not in CI_MODEL_LIST:
+            no_test_models.append(full_model)
+
+    # save model list for CI run_ci.sh
+    with io.open("./scripts/regression/model_list.txt", "w", encoding="utf-8") as list:
+        for all_model in all_examples_dict:
+            list.write("{}\n".format(all_model))
+        list.close()
+    return all_examples_dict
+
+
+if __name__ == "__main__":
+    get_model_list()
diff --git a/scripts/regression/requirements_ci.txt b/scripts/regression/requirements_ci.txt
new file mode 100644
index 0000000000000000000000000000000000000000..011906aef3b179399e40d79cfe2c50cc7fdec909
--- /dev/null
+++ b/scripts/regression/requirements_ci.txt
@@ -0,0 +1,41 @@
+opencv-python==4.6.0.66
+opencv-contrib-python==4.6.0.66
+protobuf==3.20.2
+regex
+attrdict==2.0.1
+easydict==1.10.0
+PyYAML==5.4.1
+subword_nmt==0.3.7
+visualdl
+pandas
+pyrouge
+nltk
+beautifulsoup4
+pybind11
+zstandard
+pypinyin
+pycryptodome
+bce-python-sdk==0.8.74
+packaging
+psutil
+pynvml
+GPUtil
+lac
+yacs
+coverage
+h5py
+pytest
+parameterized
+scikit-learn
+cma
+paddleocr
+fast_tokenizer_python
+https://paddle-qa.bj.bcebos.com/PaddleSlim/paddleslim-0.0.0.dev0-py3-none-any.whl
+tool_helpers
+emoji
+ftfy
+unidecode
+onnxruntime
+sacremoses
+soundfile
+librosa
diff --git a/scripts/regression/run_ci.sh b/scripts/regression/run_ci.sh
new file mode 100644
index 0000000000000000000000000000000000000000..921a8b445a8d8ead81f51cf20142194b93805cd6
--- /dev/null
+++ b/scripts/regression/run_ci.sh
@@ -0,0 +1,291 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+####################################
+export python=$1
+export paddle=$2
+export nlp_dir=/workspace/PaddleNLP
+mkdir /workspace/PaddleNLP/logs
+mkdir /workspace/PaddleNLP/model_logs
+mkdir /workspace/PaddleNLP/unittest_logs
+mkdir /workspace/PaddleNLP/coverage_logs
+mkdir /workspace/PaddleNLP/upload
+export log_path=/workspace/PaddleNLP/model_logs
+export P0case_list=()
+export APIcase_list=()
+declare -A Normal_dic
+declare -A all_P0case_dic
+declare -A Build_list
+all_P0case_dic=(["waybill_ie"]=3 ["msra_ner"]=15 ["glue"]=2 ["bert"]=2 ["skep"]=10 ["bigbird"]=2 ["electra"]=2  ["gpt"]=2 ["ernie-1.0"]=2 ["xlnet"]=2 \
+["ofa"]=2 ["albert"]=2   ["SQuAD"]=20 ["lexical_analysis"]=5 ["seq2seq"]=5 ["word_embedding"]=5 \
+["ernie-ctm"]=5 ["distilbert"]=5  ["transformer"]=5 ["pet"]=5 ["efl"]=5 ["p-tuning"]=5 ["ernie-doc"]=20 ["transformer-xl"]=5 \
+["question_matching"]=5 ["ernie-csc"]=5 ["nptag"]=5 ["ernie-m"]=5 ["taskflow"]=5 ["clue"]=5 ["textcnn"]=5 \
+["fast_generation"]=10 ["ernie-3.0"]=5 ["ernie-layout"]=5 ["uie"]=5 ["ernie-health"]=5 ["transformers"]=5 \
+["ernie"]=2 ["ernie_m"]=5 ["ernie_layout"]=5 ["ernie_csc"]=5 ["ernie_ctm"]=5 ["ernie_doc"]=20 ["ernie_health"]=5 ["gpt-3"]=5)
+####################################
+# Insatll paddlepaddle-gpu
+install_paddle(){
+    echo -e "\033[35m ---- Install paddlepaddle-gpu  \033[0m"
+    python -m pip install --user -r scripts/regression/requirements_ci.txt
+    python -m pip uninstall paddlepaddle -y
+    python -m pip install --user ${paddle};
+    python -c "import paddle; print('paddle version:',paddle.__version__,'\npaddle commit:',paddle.version.commit)";
+    python -c 'from visualdl import LogWriter'
+}
+####################################
+# Install paddlenlp func
+nlp_build (){
+    cd $1
+    rm -rf build/
+    rm -rf paddlenlp.egg-info/
+    rm -rf ppdiffusers.egg-info/
+    rm -rf paddle_pipelines.egg-info/
+    rm -rf dist/
+
+    python -m pip install -r requirements.txt
+    python setup.py bdist_wheel
+    # python -m pip install --ignore-installed  dist/p****.whl
+}
+####################################
+# upload paddlenlp  whl
+upload (){
+    mkdir ${PPNLP_HOME}/upload
+    if [ $1 == "paddlenlp" ];then
+        echo -e "\033[35m ---- build latest paddlenlp  \033[0m"
+        build_dev_path=/workspace/PaddleNLP_dev
+        nlp_build ${build_dev_path}
+        nlp_version=$(python -c "from paddlenlp import __version__; print(__version__)")
+        # for test https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+        cp $build_dev_path/dist/p****.whl ${PPNLP_HOME}/upload/
+        # for ci pr test
+        cp $build_dev_path/dist/p****.whl ${PPNLP_HOME}/upload/paddlenlp-ci-py3-none-any.whl
+        echo -e "\033[35m ---- build ${GIT_PR_ID} paddlenlp  \033[0m"
+        build_pr_path=${nlp_dir}
+        nlp_build ${build_pr_path}
+    elif [ $1 == "pipelines" ];then
+        echo -e "\033[35m ---- build latest pipelines  \033[0m"
+        build_dev_path=/workspace/PaddleNLP_dev/$1
+        nlp_build ${build_dev_path}
+        pipe_version=$(python -c "from pipelines import __version__; print(__version__)")
+        cp $build_dev_path/dist/p****.whl ${PPNLP_HOME}/upload/
+    elif [ $1 == "ppdiffusers" ];then
+        echo -e "\033[35m ---- build latest ppdiffusers  \033[0m"
+        build_dev_path=/workspace/PaddleNLP_dev/$1
+        nlp_build ${build_dev_path}
+        pipe_version=$(python -c "from ppdiffusers import __version__; print(__version__)")
+        cp $build_dev_path/dist/pa****.whl ${PPNLP_HOME}/upload/
+    fi
+}
+####################################
+# get diff case
+for line in `cat scripts/regression/model_list.txt`;do
+    all_example_dict[${#all_example_dict[*]}]=$line
+done
+cd ${nlp_dir}
+get_diff_TO_P0case(){
+for file_name in `git diff --numstat upstream/${AGILE_COMPILE_BRANCH} |awk '{print $NF}'`;do
+    arr_file_name=(${file_name//// })
+    dir1=${arr_file_name[0]}
+    dir2=${arr_file_name[1]}
+    dir3=${arr_file_name[2]}
+    dir4=${arr_file_name[3]}
+    echo "file_name:"${file_name}, "dir1:"${dir1}, "dir2:"${dir2},"dir3:"${dir3},".xx:" ${file_name##*.}
+    if [ ! -f ${file_name} ];then # 针对pr删掉文件
+        continue
+    elif [[ ${file_name##*.} == "md" ]] || [[ ${file_name##*.} == "rst" ]] || [[ ${dir1} == "docs" ]];then
+        continue
+    elif [[ "${AGILE_COMPILE_BRANCH}" == "refactor-training-loop" ]];then
+        P0case_list[${#P0case_list[*]}]=gpt
+    elif [[ ${dir1} =~ "scripts" ]];then # API 升级
+        if [[ ${dir2} =~ "should_deploy" ]];then # 针对发版mini test
+            P0case_list[${#P0case_list[*]}]=transformer
+        fi
+    elif [[ ${dir1} =~ "paddlenlp" ]];then # API 升级
+        if [[ ${dir2} =~ "__init__" ]];then # 针对发版mini test
+            P0case_list[${#P0case_list[*]}]=bert
+        elif [[ ${!all_P0case_dic[*]} =~ ${dir2} ]];then
+            P0case_list[${#P0case_list[*]}]=${dir2}
+        elif [[ ${dir2} =~ "transformers" ]];then
+            P0case_list[${#P0case_list[*]}]=transformers
+            if [[ ${!all_P0case_dic[*]} =~ ${dir3} ]];then
+                P0case_list[${#P0case_list[*]}]=${dir3}
+            fi
+        elif [[ ${dir2} =~ "taskflow" ]];then
+            P0case_list[${#P0case_list[*]}]=taskflow
+        elif [[ ${dir3} =~ "fast_transformer" ]] || [[ ${dir4} =~ "FasterTransformer" ]] ;then
+             P0case_list[${#P0case_list[*]}]=fast_generation
+        fi
+        Build_list[${dir1}]="paddlenlp" # 影响编包
+    elif [[ ${dir1} =~ "examples" ]];then # 模型升级
+        if [[ ${!all_P0case_dic[*]} =~ ${dir2} ]];then
+            P0case_list[${#P0case_list[*]}]=${dir2}
+        elif [[ ${!all_P0case_dic[*]} =~ ${dir3} ]];then
+            P0case_list[${#P0case_list[*]}]=${dir3}
+        elif [[ ${dir3##*.} == "py" ]] && [[ !(${all_example_dict[*]} =~ ${dir2}) ]];then #新增规范模型
+            P0case_list[${#P0case_list[*]}]=${dir2}
+            Normal_dic[${dir2}]="${dir1}/${dir2}/"
+        elif [[ !(${all_example_dict[*]} =~ ${dir3}) ]] ;then
+            P0case_list[${#P0case_list[*]}]=${dir3}
+            Normal_dic[${dir3}]="${dir1}/${dir2}/${dir3}"
+        fi
+    elif [[ ${dir1} =~ "model_zoo" ]];then # 模型升级
+        if [[ ${!all_P0case_dic[*]} =~ ${dir2} ]];then
+            P0case_list[${#P0case_list[*]}]=${dir2}
+        # elif [[ !(${all_example_dict[*]} =~ ${dir2}) ]];then #新增规范模型
+        #     P0case_list[${#P0case_list[*]}]=${dir2}
+        #     Normal_dic[${dir2}]="${dir1}/${dir2}/"
+        fi
+    elif [[ ${dir1} =~ "llm" ]];then # 模型升级
+        P0case_list[${#P0case_list[*]}]=transformers
+    elif [[ ${dir1} =~ "tests" ]];then # 新增单测
+        if [[ ${dir2} =~ "transformers" ]] ;then
+            if [[ ${dir3##*.} == "py" ]];then
+                continue
+            elif [[ ${!all_P0case_dic[*]} =~ ${dir3} ]];then
+                P0case_list[${#P0case_list[*]}]=${dir3}
+            else
+                APIcase_list[${#APIcase_list[*]}]=${dir3}
+            fi
+        elif [[ ${dir2} =~ "taskflow" ]] ;then
+            APIcase_list[${#APIcase_list[*]}]=${dir2}
+        fi
+    elif [[ ${dir1} =~ "pipelines" ]];then # 影响编包
+        Build_list[${dir1}]=${dir1}
+    elif [[ ${dir1} =~ "ppdiffusers" ]];then # 影响编包
+        Build_list[${dir1}]=${dir1}
+    else
+        continue
+    fi
+done
+}
+get_diff_TO_P0case
+P0case_list=($(awk -v RS=' ' '!a[$1]++' <<< ${P0case_list[*]}))
+APIcase_list=($(awk -v RS=' ' '!a[$1]++' <<< ${APIcase_list[*]}))
+####################################
+# upload latest paddlenlp pipelines ppddifusers whl
+if [[ ${#Build_list[*]} -ne 0 ]];then
+    echo -e "\033[32m start build ${Build_list[*]} whl \033[0m"
+    install_paddle
+    for build_pkg in ${Build_list[*]};do
+        upload ${build_pkg}
+    done
+    echo -e "\033[32m make PaddleNLP.tar.gz  \033[0m"
+    cd /workspace
+    rm -rf PaddleNLP_dev/build/*
+    tar -zcvf PaddleNLP.tar.gz PaddleNLP_dev/
+    mv PaddleNLP.tar.gz ${PPNLP_HOME}/upload
+    cd ${PPNLP_HOME}
+    python upload.py ${PPNLP_HOME}/upload 'paddlenlp/wheels'
+    rm -rf upload/*
+else
+   echo -e "\033[32m Don't need build whl  \033[0m"
+fi
+###################################
+if [[ ${#P0case_list[*]} -ne 0 ]] || [[ ${#APIcase_list[*]} -ne 0 ]];then
+    # Install paddlenlp
+    cd ${nlp_dir}
+    python -m pip uninstall protobuf -y
+    python -m pip uninstall protobuf -y
+    python -m pip install protobuf==3.20.2
+    if [ ! -f ./dist/p****.whl ];then
+        install_paddle
+        echo "install_nlp_develop"
+        wget https://paddlenlp.bj.bcebos.com/wheels/paddlenlp-ci-py3-none-any.whl
+        python -m pip install --user paddlenlp-ci-py3-none-any.whl
+    else
+        echo "instal_nlp_pr"
+        python -m pip install  dist/p****.whl
+    fi
+    python -m pip list
+    echo -e "\033[35m =======CI Check P0case========= \033[0m"
+    echo -e "\033[35m ---- P0case_list length: ${#P0case_list[*]}, cases: ${P0case_list[*]} \033[0m"
+    set +e
+    echo -e "\033[35m ---- start run P0case  \033[0m"
+    case_num=1
+    for p0case in ${P0case_list[*]};do
+        echo -e "\033[35m ---- running P0case $case_num/${#P0case_list[*]}: ${p0case} \033[0m"
+        if [[ ${!Normal_dic[*]} =~ ${p0case} ]];then
+            # python ${nlp_dir}/scripts/regression/ci_normal_case.py ${Normal_dic[${p0case}]}
+            # let case_num++
+            echo "pass"
+        else
+            bash ${nlp_dir}/scripts/regression/ci_case.sh ${p0case} ${cudaid1} ${cudaid2}
+            let case_num++
+        fi
+    done
+    echo -e "\033[35m ---- end run P0case  \033[0m"
+    cd ${nlp_dir}/model_logs
+    FF=`ls *FAIL*|wc -l`
+    EXCODE=0
+    if [ "${FF}" -gt "0" ];then
+        P0case_EXCODE=1
+        EXCODE=2
+    else
+        P0case_EXCODE=0
+    fi
+    if [ $P0case_EXCODE -ne 0 ] ; then
+        echo -e "\033[31m ---- P0case Failed number: ${FF} \033[0m"
+        ls *_FAIL*
+    else
+        echo -e "\033[32m ---- P0case Success \033[0m"
+    fi
+    ####################################
+    # run unittest
+    cd ${nlp_dir}
+    echo -e "\033[35m =======CI Check Unittest========= \033[0m"
+    echo -e "\033[35m ---- unittest length: ${#APIcase_list[*]}, unittest cases: ${APIcase_list[*]} \033[0m"
+    for apicase in ${APIcase_list[*]};do
+        if [[ ${apicase} =~ "taskflow" ]] ; then
+            python -m pytest tests/taskflow/test_*.py >${nlp_dir}/unittest_logs/${apicase}_unittest.log 2>&1
+        else
+            python -m pytest tests/transformers/${apicase}/test_*.py  >${nlp_dir}/unittest_logs/${apicase}_unittest.log 2>&1
+            # sh run_coverage.sh paddlenlp.transformers.${apicase} >unittest_logs/${apicase}_coverage.log 2>&1
+        fi
+        UT_EXCODE=$? || true
+        if [ $UT_EXCODE -ne 0 ] ; then
+            mv ${nlp_dir}/unittest_logs/${apicase}_unittest.log ${nlp_dir}/unittest_logs/${apicase}_unittest_FAIL.log
+        fi
+    done
+    cd ${nlp_dir}/unittest_logs
+    UF=`ls *FAIL*|wc -l`
+    if [ "${UF}" -gt "0" ];then
+        UT_EXCODE=1
+        EXCODE=3
+    else
+        UT_EXCODE=0
+    fi
+    if [ $UT_EXCODE -ne 0 ] ; then
+        echo -e "\033[31m ---- Unittest Failed \033[0m"
+        ls *_FAIL*
+    else
+        echo -e "\033[32m ---- Unittest Success \033[0m"
+    fi
+    ####################################
+    # run coverage
+    # cd ${nlp_dir}/tests/
+    # bash run_coverage.sh
+    # Coverage_EXCODE=$? || true
+    # mv ./htmlcov ${nlp_dir}/coverage_logs/
+    # if [ $Coverage_EXCODE -ne 0 ] ; then
+    #     echo -e "\033[31m ---- Coverage Failed \033[0m"
+    # else
+    #     echo -e "\033[32m ---- Coverage Success \033[0m"
+    # fi
+    ####################################
+else
+    echo -e "\033[32m Changed Not CI case, Skips \033[0m"
+    EXCODE=0
+fi
+exit $EXCODE
diff --git a/scripts/regression/run_refactor_training_loop.sh b/scripts/regression/run_refactor_training_loop.sh
new file mode 100644
index 0000000000000000000000000000000000000000..feebb46df254390e7238157aa7d8ccdf820f6fcd
--- /dev/null
+++ b/scripts/regression/run_refactor_training_loop.sh
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export nlp_dir=/workspace/PaddleNLP/
+mkdir ${nlp_dir}/logs
+mkdir ${nlp_dir}/model_logs
+mkdir ${nlp_dir}/unittest_logs
+p0case="refactor_training_loop"
+
+# Insatll Paddle FleetY
+install_paddle(){
+    echo -e "\033[35m ---- Install paddlepaddle-gpu  \033[0m"
+    python -m pip install paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl
+    python -c "import paddle; print('paddle version:',paddle.__version__,'\npaddle commit:',paddle.version.commit)";
+}
+# Install paddlenlp 
+nlp_build (){
+    cd ${nlp_dir}
+    rm -rf build/
+    rm -rf paddlenlp.egg-info/
+    rm -rf ppdiffusers.egg-info/
+    rm -rf paddle_pipelines.egg-info/
+    rm -rf dist/
+
+    python -m pip install -r requirements.txt
+    python -m pip install rouge
+    python setup.py bdist_wheel
+    python -m pip install dist/p****.whl
+}
+install_paddle
+nlp_build
+pip list
+
+# run  ci case
+echo -e "\033[35m ======= Check refactor_training_loop BRANCH ========= \033[0m"
+set +e
+echo -e "\033[35m ---- start run case  \033[0m"
+bash ${nlp_dir}/scripts/regression/ci_case.sh ${p0case} ${cudaid1} ${cudaid2}
+echo -e "\033[35m ---- end run P0case  \033[0m"
+
+# analysis log
+cd ${nlp_dir}/model_logs
+FF=`ls *FAIL*|wc -l`
+EXCODE=0
+if [ "${FF}" -gt "0" ];then
+    P0case_EXCODE=1
+    EXCODE=2
+else
+    P0case_EXCODE=0
+fi
+if [ $P0case_EXCODE -ne 0 ] ; then
+    echo -e "\033[31m ---- P0case Failed number: ${FF} \033[0m"
+    ls *_FAIL*
+else
+    echo -e "\033[32m ---- P0case Success \033[0m"
+fi
+    
+cd ${nlp_dir}/unittest_logs
+UF=`ls *FAIL*|wc -l`
+if [ "${UF}" -gt "0" ];then
+    UT_EXCODE=1
+    EXCODE=3
+else
+    UT_EXCODE=0
+fi
+if [ $UT_EXCODE -ne 0 ] ; then
+    echo -e "\033[31m ---- Unittest Failed \033[0m"
+    ls *_FAIL*
+else
+    echo -e "\033[32m ---- Unittest Success \033[0m"
+fi
+exit $EXCODE
\ No newline at end of file
diff --git a/scripts/regression/run_release.sh b/scripts/regression/run_release.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6aa02bbd8dff6fd2a91a703899376de0ac280424
--- /dev/null
+++ b/scripts/regression/run_release.sh
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export Testcase=$3
+export cudaid2=$2
+export cudaid1=$1
+export nlp_dir=${PWD}
+mkdir ${nlp_dir}/logs
+mkdir ${nlp_dir}/model_logs
+mkdir ${nlp_dir}/unittest_logs
+export log_path=${nlp_dir}/logs
+####################################
+# for paddlenlp env
+python -c 'import sys; print(sys.version_info[:])'
+python -c 'import nltk; nltk.download("punkt")'
+set -x
+python -c "import paddle; print('paddle version:',paddle.__version__,'\npaddle commit:',paddle.version.commit)";
+nlp1_build (){
+    echo -e "\033[35m ---- only install paddlenlp \033[0m"
+    python -m pip install paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
+}
+nlp2_build (){
+    echo -e "\033[35m ---- build and install paddlenlp  \033[0m"
+    rm -rf build/
+    rm -rf paddlenlp.egg-info/
+    rm -rf dist/
+
+    python -m pip install -r requirements.txt
+    python setup.py bdist_wheel
+    python -m pip install -U dist/paddlenlp****.whl
+}
+nlp2_build
+python -c 'from visualdl import LogWriter'
+pip list
+set +x
+####################################
+# run p0case
+export P0case_list=()
+export P0case_time=0
+export all_P0case_time=0
+declare -A all_P0case_dic
+get_diff_TO_P0case(){
+if [[ ${Testcase} =~ "all" ]];then
+    P0case_list=(waybill_ie msra_ner glue bert skep bigbird electra gpt ernie-1.0 xlnet ofa  squad tinybert lexical_analysis seq2seq \
+    word_embedding ernie-ctm distilbert stacl transformer simbert ernie-doc transformer-xl pointer_summarizer question_matching ernie-csc \
+    nptag ernie-m clue taskflow transformers fast_generation ernie-3.0 fast_transformer fast_gpt llama)
+elif [[ ${Testcase} =~ "p0" ]];then
+    P0case_list=(glue bert skep gpt ernie-1.0 transformer clue)
+else
+    P0case_list=${Testcase}
+fi
+}
+get_diff_TO_P0case
+    echo -e "\033[35m =======CI Check P0case========= \033[0m"
+    echo -e "\033[35m ---- P0case_list length: ${#P0case_list[*]}, cases: ${P0case_list[*]} \033[0m"
+    set +e
+    echo -e "\033[35m ---- start run P0case  \033[0m"
+    case_num=1
+    for p0case in ${P0case_list[*]};do
+        echo -e "\033[35m ---- running P0case $case_num/${#P0case_list[*]}: ${p0case} \033[0m"
+        bash ${nlp_dir}/scripts/regression/ci_case.sh ${p0case} ${cudaid1} ${cudaid2}
+        let case_num++
+    done
+    echo -e "\033[35m ---- end run P0case  \033[0m"
+cd ${nlp_dir}
+upload(){
+if [ -f '/ssd1/paddlenlp/bos/upload.py' ];then
+    cp -r /ssd1/paddlenlp/bos/* ./
+    tar -zcvf model_logs.tar model_logs/
+    mkdir upload && mv model_logs.tar upload
+    python upload.py upload 'paddle-qa/paddlenlp'
+else
+    echo 'No upload script found'
+fi
+}
+upload
+cd model_logs/
+FF=`ls *_FAIL*|wc -l`
+if [ "${FF}" -gt "0" ];then
+    P0case_EXCODE=1
+else
+    P0case_EXCODE=0
+fi
+if [ $P0case_EXCODE -ne 0 ] ; then
+    cd model_logs/
+    FF=`ls *_FAIL*|wc -l`
+    echo -e "\033[31m ---- P0case failed number: ${FF} \033[0m"
+    ls *_FAIL*
+    exit $P0case_EXCODE
+else
+    echo -e "\033[32m ---- P0case Success \033[0m"
+fi
diff --git a/scripts/regression/test_taskflow.py b/scripts/regression/test_taskflow.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8e6c69e4461ff0e6dd452e239412e155bf35834
--- /dev/null
+++ b/scripts/regression/test_taskflow.py
@@ -0,0 +1,257 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Test taskflow."""
+import os
+
+from paddlenlp import Taskflow
+
+
+def test_knowledge_mining():
+    """
+    test_knowledge_mining
+    """
+    wordtag = Taskflow("knowledge_mining", model="wordtag", batch_size=2, max_seq_len=128, linking=True)
+    wordtag("《孤女》是2010年九州出版社出版的小说，作者是余兼羽。")
+
+    nptag = Taskflow("knowledge_mining", model="nptag", batch_size=2, max_seq_len=128, linking=True)
+    nptag(["糖醋排骨", "红曲霉菌"])
+
+
+def test_name_entity_recognition():
+    """
+    test_name_entity_recognition
+    """
+    ner = Taskflow("ner", batch_size=2)
+    ner("《长津湖》收尾，北美是最大海外票仓")
+    ner_fast = Taskflow("ner", mode="fast")
+    ner_fast("《长津湖》收尾，北美是最大海外票仓")
+    ner_entity = Taskflow("ner", mode="accurate", entity_only=True)
+    ner_entity("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
+
+
+def test_word_segmetation():
+    """
+    test_word_segmetation
+    """
+    seg = Taskflow("word_segmentation", batch_size=2)
+    seg(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+    seg_fast = Taskflow("word_segmentation", mode="fast")
+    seg_fast(["第十四届全运会在西安举办", "三亚是一个美丽的城市"])
+    seg_acc = Taskflow("word_segmentation", mode="accurate")
+    seg_acc("李伟拿出具有科学性、可操作性的《陕西省高校管理体制改革实施方案》")
+
+
+def test_pos_tagging():
+    """
+    test_pos_tagging
+    """
+    tag = Taskflow("pos_tagging", batch_size=2)
+    tag("第十四届全运会在西安举办")
+
+
+def test_corrector():
+    """
+    test_corrector
+    """
+    corrector = Taskflow("text_correction", batch_size=2)
+    corrector("遇到逆竟时，我们必须勇于面对，而且要愈挫愈勇，这样我们才能朝著成功之路前进。")
+
+
+def test_dependency_parsing():
+    """
+    test_dependency_parsing
+    """
+    ddp = Taskflow("dependency_parsing", model="ddparser", batch_size=2, prob=True, use_pos=True)
+    print(ddp("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫"))
+    print(ddp.from_segments([["9月9日", "上午", "纳达尔", "在", "亚瑟·阿什球场", "击败", "俄罗斯", "球员", "梅德韦杰夫"]]))
+
+    ddp_ernie = Taskflow("dependency_parsing", model="ddparser-ernie-1.0", batch_size=2, prob=True, use_pos=True)
+    print(ddp_ernie("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫"))
+
+    ddp_ernie_gram = Taskflow(
+        "dependency_parsing", model="ddparser-ernie-gram-zh", batch_size=2, prob=True, use_pos=True
+    )
+    print(ddp_ernie_gram("9月9日上午纳达尔在亚瑟·阿什球场击败俄罗斯球员梅德韦杰夫"))
+
+
+def test_sentiment_analysis():
+    """
+    test_sentiment_analysis
+    """
+    skep = Taskflow("sentiment_analysis", batch_size=2)
+    skep("这个产品用起来真的很流畅，我非常喜欢")
+
+    skep_ernie = Taskflow("sentiment_analysis", model="skep_ernie_1.0_large_ch", batch_size=2)
+    skep_ernie("作为老的四星酒店，房间依然很整洁，相当不错。机场接机服务很好，可以在车上办理入住手续，节省时间。")
+
+
+def test_text_similarity():
+    """
+    test_text_similarity
+    """
+    similarity = Taskflow("text_similarity", batch_size=2)
+    similarity([["世界上什么东西最小", "世界上什么东西最小？"]])
+
+
+def test_question_answering():
+    """
+    test_question_answering
+    """
+    qa = Taskflow("question_answering", batch_size=2)
+    qa("中国的国土面积有多大？")
+
+
+def test_poetry():
+    """
+    test_poetry
+    """
+    poetry = Taskflow("poetry_generation", batch_size=2)
+    poetry("林密不见人")
+
+
+def test_dialogue():
+    """
+    test_dialogue
+    """
+    dialogue = Taskflow("dialogue", batch_size=2, max_seq_len=512)
+    dialogue(["吃饭了吗"])
+
+
+def test_uie():
+    """
+    test_uie
+    """
+    schema_ner = ["时间", "选手", "赛事名称"]  # Define the schema for entity extraction
+    ie = Taskflow("information_extraction", schema=schema_ner, model="uie-base", batch_size=2, prob=True, use_pos=True)
+    ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+
+    ie = Taskflow("information_extraction", schema=schema_ner, model="uie-tiny", batch_size=2, prob=True, use_pos=True)
+    schema_re = {"歌曲名称": ["歌手", "所属专辑"]}  # Define the schema for relation extraction
+    ie.set_schema(schema_re)  # Reset schema
+    ie("《告别了》是孙耀威在专辑爱的故事里面的歌曲")
+
+    ie = Taskflow("information_extraction", schema=schema_ner, prob=True, use_pos=True)
+    schema_ee = {"歌曲名称": ["歌手", "所属专辑"]}  # Define the schema for relation extraction
+    ie.set_schema(schema_ee)  # Reset schema
+    ie("《告别了》是孙耀威在专辑爱的故事里面的歌曲")
+
+    schema_opinion = {"评价维度": "观点词"}  # Define the schema for opinion extraction
+    ie.set_schema(schema_opinion)  # Reset schema
+    ie("个人觉得管理太混乱了，票价太高了")
+
+    schema_sa = "情感倾向[正向，负向]"  # Define the schema for sentence-level sentiment classification
+    ie.set_schema(schema_sa)  # Reset schema
+    ie("这个产品用起来真的很流畅，我非常喜欢")
+
+    schema_bre = ["寺庙", {"丈夫": "妻子"}]
+    ie.set_schema(schema_bre)
+    ie("李治即位后，让身在感业寺的武则天续起头发，重新纳入后宫。")
+
+    schema = {"竞赛名称": ["主办方", "承办方", "已举办次数"]}
+    ie.set_schema(schema)
+    ie("2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。")
+
+    schema = ["Person", "Organization"]
+    ie_en = Taskflow("information_extraction", schema=schema, model="uie-base-en")
+    ie_en("In 1997, Steve was excited to become the CEO of Apple.")
+
+    schema = [{"Person": ["Company", "Position"]}]
+    ie_en.set_schema(schema)
+    ie_en("In 1997, Steve was excited to become the CEO of Apple.")
+
+    schema = [{"Aspect": ["Opinion", "Sentiment classification [negative, positive]"]}]
+    ie_en.set_schema(schema)
+    ie_en("The teacher is very nice.")
+
+    schema = "Sentiment classification [negative, positive]"
+    ie_en.set_schema(schema)
+    ie_en("I am sorry but this is the worst film I have ever seen in my life.")
+
+
+def test_summarizer():
+    """
+    test_summarizer
+    """
+    summarizer = Taskflow("text_summarization")
+    summarizer("2022年，中国房地产进入转型阵痛期，传统“高杠杆、快周转”的模式难以为继，万科甚至直接喊话，中国房地产进入“黑铁时代”")
+    summarizer(
+        [
+            "据悉，2022年教育部将围绕“巩固提高、深化落实、创新突破”三个关键词展开工作。要进一步强化学校教育主阵地作用，继续把落实“双减”作为学校工作的重中之重，\
+            重点从提高作业设计水平、提高课后服务水平、提高课堂教学水平、提高均衡发展水平四个方面持续巩固提高学校“双减”工作水平。",
+            "党参有降血脂，降血压的作用，可以彻底消除血液中的垃圾，从而对冠心病以及心血管疾病的患者都有一定的稳定预防工作作用，因此平时口服党参能远离三高的危害。\
+            另外党参除了益气养血，降低中枢神经作用，调整消化系统功能，健脾补肺的功能。",
+        ]
+    )
+
+
+def test_uiex():
+    """UIE-X"""
+    path = "./cases/"
+    if not os.path.exists(path):
+        os.mkdir(path)
+    os.system(
+        "cd %s && wget %s"
+        % (
+            path,
+            "https://user-images.githubusercontent.com/40840292/203457596-8dbc9241-833d-4b0e-9291-f134a790d0e1.jpeg",
+        )
+    )
+    os.system(
+        "cd %s && wget %s"
+        % (
+            path,
+            "https://user-images.githubusercontent.com/40840292/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg",
+        )
+    )
+    os.system(
+        "cd %s && wget %s"
+        % (
+            path,
+            "https://user-images.githubusercontent.com/40840292/203457817-76fe638a-3277-4619-9066-d1dffd52c5d4.jpg ",
+        )
+    )
+    ie = Taskflow(
+        "information_extraction",
+        schema="",
+        schema_lang="ch",
+        ocr_lang="ch",
+        batch_size=16,
+        model="uie-x-base",
+        layout_analysis=False,
+        position_prob=0.5,
+        precision="fp32",
+        use_fast=True,
+    )
+    schema = ["姓名", "性别", "学校"]
+    ie({"doc": "./cases/203457596-8dbc9241-833d-4b0e-9291-f134a790d0e1.jpeg"})
+
+    schema = ["收发货人", "进口口岸", "进口日期", "申报日期", "提运单号"]
+    ie.set_schema(schema)
+    print(ie({"doc": "./cases/203457719-84a70241-607e-4bb1-ab4c-3d9beee9e254.jpeg"}))
+
+    schema = {"项目名": "单价"}
+    ie.set_schema(schema)
+    print(ie({"doc": "./cases/203457817-76fe638a-3277-4619-9066-d1dffd52c5d4.jpg"}))
+
+
+def test_codegen():
+    """ """
+    prompt = "def lengthOfLongestSubstring(self, s: str) -> int:"
+    codegen = Taskflow(
+        "code_generation",
+        model="Salesforce/codegen-350M-mono",
+        decode_strategy="greedy_search",
+        repetition_penalty=1.0,
+    )
+    print(codegen(prompt))
diff --git a/scripts/should_deploy.py b/scripts/should_deploy.py
new file mode 100644
index 0000000000000000000000000000000000000000..322a7bad6f844d9bd3808e843eef6baebfc14d7e
--- /dev/null
+++ b/scripts/should_deploy.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import argparse
+import os
+import subprocess
+import sys
+
+from pkg_resources import parse_version
+
+
+def read_version_of_remote_package(name: str) -> str:
+    """get version of remote package,
+
+    adapted from: https://stackoverflow.com/a/58649262/6894382
+
+    Args:
+        name (str): the name of package
+
+    Returns:
+        str: the version of package
+    """
+    latest_version = str(
+        subprocess.run(
+            [sys.executable, "-m", "pip", "install", "{}==random".format(name)], capture_output=True, text=True
+        )
+    )
+    latest_version = latest_version[latest_version.find("(from versions:") + 15 :]
+    latest_version = latest_version[: latest_version.find(")")]
+    latest_version = latest_version.replace(" ", "").split(",")[-1]
+    return latest_version
+
+
+def read_version_of_local_package(version_file_path: str) -> str:
+    """get version of local package
+
+    Args:
+        version_file_path (str): the path of `VERSION` file
+
+    Returns:
+        str: the version of local package
+    """
+    with open(version_file_path, "r", encoding="utf-8") as f:
+        version = f.read().strip()
+    return version
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--name", required=True)
+
+    args = parser.parse_args()
+
+    version_file_map = {
+        "ppdiffusers": "ppdiffusers/VERSION",
+        "paddle-pipelines": "pipelines/VERSION",
+    }
+    remote_version = read_version_of_remote_package(args.name)
+
+    if args.name == "paddlenlp":
+        local_version = str(subprocess.check_output(["python", "setup.py", "--version"], text=True))
+    elif args.name in version_file_map:
+        PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+        local_version_file = os.path.join(PROJECT_ROOT, version_file_map[args.name])
+        local_version = read_version_of_local_package(local_version_file)
+    else:
+        raise ValueError(f"package<{args.name}> not supported")
+
+    should_deploy = str(parse_version(remote_version) < parse_version(local_version)).lower()
+    print(f"should_deploy={should_deploy}")
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..e39e6481dafa612599405d0d4a1ded16c2f5944e
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import io
+import os
+
+import setuptools
+
+PADDLENLP_STABLE_VERSION = "PADDLENLP_STABLE_VERSION"
+
+
+def read_requirements_file(filepath):
+    with open(filepath) as fin:
+        requirements = fin.read()
+    return requirements
+
+
+__version__ = "2.6.1.post"
+if os.getenv(PADDLENLP_STABLE_VERSION):
+    __version__ = __version__.replace(".post", "")
+
+
+extras = {}
+REQUIRED_PACKAGES = read_requirements_file("requirements.txt")
+extras["tests"] = read_requirements_file("tests/requirements.txt")
+extras["docs"] = read_requirements_file("docs/requirements.txt")
+extras["autonlp"] = read_requirements_file("paddlenlp/experimental/autonlp/requirements.txt")
+extras["dev"] = extras["tests"] + extras["docs"] + extras["autonlp"]
+
+
+def read(*names, **kwargs):
+    with io.open(os.path.join(os.path.dirname(__file__), *names), encoding=kwargs.get("encoding", "utf8")) as fp:
+        return fp.read()
+
+
+def get_package_data_files(package, data, package_dir=None):
+    """
+    Helps to list all specified files in package including files in directories
+    since `package_data` ignores directories.
+    """
+    if package_dir is None:
+        package_dir = os.path.join(*package.split("."))
+    all_files = []
+    for f in data:
+        path = os.path.join(package_dir, f)
+        if os.path.isfile(path):
+            all_files.append(f)
+            continue
+        for root, _dirs, files in os.walk(path, followlinks=True):
+            root = os.path.relpath(root, package_dir)
+            for file in files:
+                file = os.path.join(root, file)
+                if file not in all_files:
+                    all_files.append(file)
+    return all_files
+
+
+setuptools.setup(
+    name="paddlenlp",
+    version=__version__,
+    author="PaddleNLP Team",
+    author_email="paddlenlp@baidu.com",
+    description="Easy-to-use and powerful NLP library with Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including Neural Search, Question Answering, Information Extraction and Sentiment Analysis end-to-end system.",
+    long_description=read("README_en.md"),
+    long_description_content_type="text/markdown",
+    url="https://github.com/PaddlePaddle/PaddleNLP",
+    license_files=("LICENSE",),
+    packages=setuptools.find_packages(
+        where=".",
+        exclude=("examples*", "tests*", "applications*", "fast_tokenizer*", "fast_generation*", "model_zoo*"),
+    ),
+    package_data={
+        "paddlenlp.ops": get_package_data_files(
+            "paddlenlp.ops", ["CMakeLists.txt", "README.md", "cmake", "fast_transformer", "patches", "optimizer"]
+        ),
+        "paddlenlp.transformers.layoutxlm": get_package_data_files(
+            "paddlenlp.transformers.layoutxlm", ["visual_backbone.yaml"]
+        ),
+    },
+    setup_requires=["cython", "numpy"],
+    install_requires=REQUIRED_PACKAGES,
+    entry_points={"console_scripts": ["paddlenlp = paddlenlp.cli:main"]},
+    extras_require=extras,
+    python_requires=">=3.6",
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.6",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: OS Independent",
+    ],
+    license="Apache 2.0",
+)
diff --git a/tests/README.md b/tests/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0a4e022ca34ad8d59eaa0b2dbf9cf96e16aaf035
--- /dev/null
+++ b/tests/README.md
@@ -0,0 +1,3 @@
+# UnitTest for PaddleNLP
+
+TBD
diff --git a/tests/__init__.py b/tests/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/benchmark/README.md b/tests/benchmark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bc35eafeab99c20ec2c0ffc3f1c5a6ba9b23c3f6
--- /dev/null
+++ b/tests/benchmark/README.md
@@ -0,0 +1,20 @@
+# PaddleNLP Benchmark 测试脚本
+
+目前我们为用户提供了GPT模型的Benchmark性能测试脚本。
+
+启动测试脚本的方法如下：
+
+```script
+ImageName="registry.baidubce.com/paddlepaddle/paddle:2.1.2-gpu-cuda10.2-cudnn7"
+docker pull ${ImageName}
+
+run_cmd="set -xe;
+        cd /workspace ;
+        bash -x tests/benchmark/run_all.sh static"
+
+nvidia-docker run --name test_paddle_gpt -i  \
+    --net=host \
+    --shm-size=1g \
+    -v $PWD:/workspace \
+    ${ImageName}  /bin/bash -c "${run_cmd}"
+```
diff --git a/tests/benchmark/run_all.sh b/tests/benchmark/run_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a706521c19989545c8e6ac595621b9f20bbec09
--- /dev/null
+++ b/tests/benchmark/run_all.sh
@@ -0,0 +1,56 @@
+# Test training benchmark for several models.
+
+
+# Usage:
+#   git clone https://github.com/PaddlePaddle/PaddleNLP.git
+#   cd PaddleNLP
+#   bash tests/benchmark/run_all.sh static|dygraph
+
+
+################################# 配置python, 如:
+#rm -rf $run_env
+#mkdir -p $run_env
+#echo `which python3.7`
+#ln -s $(which python3.7)m-config  $run_env/python3-config
+##ln -s /usr/local/python3.7.0/lib/python3.7m-config /usr/local/bin/python3-config
+#ln -s $(which python3.7) $run_env/python
+#ln -s $(which pip3.7) $run_env/pip
+#export PATH=$run_env:${PATH}
+
+pip install -r requirements.txt
+pip install pybind11 regex sentencepiece tqdm visualdl
+pip install -e ./
+
+# Download test dataset and save it to PaddleNLP/data
+if [ -d data ]; then
+    rm -rf data
+fi
+mkdir -p data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy -o .tmp
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz -o .tmp
+cd -
+
+mode_item=$1   #static|dygraph
+
+if [ ${mode_item} != "static" ] && [ ${mode_item} != "dygraph" ]; then
+    echo "please set mode_item(static|dygraph)"
+    exit 1
+fi
+profile="off"
+model_list=(gpt2 gpt3)
+fp_list=(fp16)
+
+for model_item in ${model_list[@]}
+do
+    for fp_item in ${fp_list[@]}
+    do
+	    if [ ${mode_item} == "static" ] && [ ${fp_item} == "fp16" ]; then
+	        bs_item=16
+            else
+                bs_item=8
+            fi
+            CUDA_VISIBLE_DEVICES=0 bash tests/benchmark/run_benchmark.sh sp ${bs_item} ${fp_item}  200  ${model_item} ${mode_item} ${profile} | tee ${log_path}/${model_item}_${mode_item}_bs${bs_item}_${fp_item}_1gpus 2>&1
+            sleep 10
+            CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash tests/benchmark/run_benchmark.sh mp ${bs_item} ${fp_item}  200  ${model_item} ${mode_item} ${profile} | tee ${log_path}/${model_item}_${mode_item}_bs${bs_item}_${fp_item}_8gpus8p 2>&1
+done
+done
diff --git a/tests/benchmark/run_benchmark.sh b/tests/benchmark/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..56503dbdc720e5f521338b508aba688014af52fa
--- /dev/null
+++ b/tests/benchmark/run_benchmark.sh
@@ -0,0 +1,128 @@
+#!/usr/bin/env bash
+set -xe
+# Test training benchmark for a model.
+# Usage：CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} ${max_iter} ${model_item}
+
+function _set_params(){
+    run_mode=${1:-"sp"}         # sp or mp
+    batch_size=${2:-"2"}
+    fp_item=${3:-"fp32"}        # fp32 or fp16
+    max_iter=${4:-"100"}
+    model_item=${5:-"gpt2"}
+    mode_item=${6:-"static"}
+    need_profile=${7:-"off"}
+    
+    mission_name="语义表示"
+    direction_id=1
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}
+
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+
+    base_batch_size=$(($batch_size*1024))
+    model_name=${model_item}_${mode_item}_bs${batch_size}_${fp_item}
+    log_file=${run_log_path}/${model_name}_${num_gpu_devices}_${run_mode}
+    log_folder=${run_log_path}/${model_item}_logdir
+    log_profile=${run_log_path}/${model_item}_model.profile
+    OUTPUT_PATH=${run_log_path}/output
+
+
+    log_with_profiler=$log_file
+    profiler_path=$log_profile
+    keyword="ips:" 
+    keyword_loss="loss:"
+    skip_steps=20
+    model_mode=-1
+    ips_unit='tokens/s'
+    index="1"
+    gpu_num=$num_gpu_devices 
+}
+
+
+function _train(){
+    echo "Train on ${num_gpu_devices} GPUs"
+    echo "current CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES, gpus=$num_gpu_devices, batch_size=$batch_size"
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true" 
+        if [ $mode_item = "dygraph" ] && [ $model_item = "gpt3" ]; then
+            use_fp16_cmd="--use_pure_fp16 true"
+        fi
+    fi
+
+    profiler_cmd=""
+    profiler_options="batch_range=[100,110];profile_path=${log_profile}"
+    if [ $need_profile = "on" ]; then
+        profiler_cmd="--profiler_options=${profiler_options}"
+    fi
+
+    script_cmd="run_pretrain_static.py"
+    if [ $mode_item = "dygraph" ]; then
+        script_cmd="run_pretrain.py"
+    fi
+
+    base_path="examples/language_model/gpt/"
+    if [ $model_item = 'gpt3' ]; then
+        base_path=examples/language_model/gpt-3/${mode_item}
+    fi
+    data_path=$(pwd)"/data"
+
+    train_cmd="${profiler_cmd}\
+               --micro_batch_size=${batch_size} \
+               --global_batch_size=$((${batch_size}*${num_gpu_devices})) \
+               --model_type="gpt"\
+               --model_name_or_path="gpt2-en"\
+               --input_dir=${data_path}\
+               --output_dir=${OUTPUT_PATH} \
+               --dp_degree=${num_gpu_devices}\
+               --max_seq_len 1024 \
+               --max_lr 0.00015 \
+               --min_lr 0.00001 \
+               --max_steps=${max_iter} \
+               --save_steps 100000 \
+               --decay_steps 320000 \
+               --weight_decay 0.01\
+               --warmup_rate 0.01 \
+               --grad_clip 1.0 \
+               --logging_freq 1\
+               --eval_freq 1000 \
+               --device "gpu" \
+               ${use_fp16_cmd}"
+
+    case ${run_mode} in
+    sp)
+        train_cmd="python -m paddle.distributed.launch --log_dir=${log_folder} --gpus=$CUDA_VISIBLE_DEVICES \
+                  ${script_cmd} ${train_cmd}" ;;
+    mp)
+        train_cmd="python -m paddle.distributed.launch --log_dir=${log_folder} --gpus=$CUDA_VISIBLE_DEVICES \
+                  ${script_cmd} ${train_cmd}" ;;
+    *) echo "choose run_mode(sp or mp)"; exit 1;
+    esac
+
+    #timeout 1s 
+    #eval` $train_cmd
+    cd ${base_path}
+    timeout 15m  ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+        export job_fail_flag=1
+    else
+        echo -e "${model_name}, SUCCESS"
+        export job_fail_flag=0
+    fi
+    cd -
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+
+    rm ${log_file}
+    cp ${log_folder}/workerlog.0 ${log_file}
+    rm -r ${log_folder}
+}
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh
+_set_params $@
+_run
diff --git a/tests/benchmark/xlnet/README.md b/tests/benchmark/xlnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c03cc292ee0ce83ebbd86efdab4157bc625432ba
--- /dev/null
+++ b/tests/benchmark/xlnet/README.md
@@ -0,0 +1,29 @@
+# XLNet Benchmark 测试脚本
+
+目前我们为用户提供了XLNet模型的Benchmark性能测试脚本。
+
+启动测试脚本的方法如下：
+
+1. 创建docker容器
+```script
+ImageName="registry.baidubce.com/paddlepaddle/paddle:2.1.2-gpu-cuda10.2-cudnn7"
+docker pull ${ImageName}
+nvidia-docker run -it --name=test_paddle_xlnet --net=host --shm-size=1g -v $PWD:/workspace ${ImageName} /bin/bash
+cd /workspace
+```
+
+2. 克隆PaddleNLP项目
+```script
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+```
+
+3. 进入xlnet测试目录，执行脚本
+```script
+cd PaddleNLP
+
+# 不打开profile选项，执行
+bash tests/benchmark/xlnet/run_all.sh
+
+# 打开profile选项，执行
+bash tests/benchmark/xlnet/run_all.sh on
+```
diff --git a/tests/benchmark/xlnet/run_all.sh b/tests/benchmark/xlnet/run_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..52b8ac6cb9dc773fedfff84f15351fbf5c6272e5
--- /dev/null
+++ b/tests/benchmark/xlnet/run_all.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+
+profile=${1:-"off"}
+
+# 提供可稳定复现性能的脚本，默认在标准docker环境内py37执行： paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7  paddle=2.1.2  py=37
+# 执行目录：需说明
+export BENCHMARK_ROOT=/workspace
+run_env=$BENCHMARK_ROOT/run_env
+
+# 1. 配置python环境:
+rm -rf $run_env
+mkdir $run_env
+echo `which python3.7`
+ln -s $(which python3.7)m-config  $run_env/python3-config
+ln -s $(which python3.7) $run_env/python
+ln -s $(which pip3.7) $run_env/pip
+export PATH=$run_env:${PATH}
+
+# 2. 安装该模型需要的依赖 (如需开启优化策略请注明)
+cd $BENCHMARK_ROOT/PaddleNLP
+pip install -r requirements.txt -i https://mirror.baidu.com/pypi/simple
+pip install sentencepiece -i https://mirror.baidu.com/pypi/simple # 安装 sentencepiece
+pip install -e ./
+
+# 3. 拷贝该模型需要数据、预训练模型（这一步无需操作，数据和模型会自动下载）
+
+# 4. 批量运行（如不方便批量，1，2需放到单个模型中）
+model_mode_list=(xlnet-base-cased)
+fp_item_list=(fp32)
+bs_item_list=(32 128)
+for model_mode in ${model_mode_list[@]}; do
+    for fp_item in ${fp_item_list[@]}; do
+          for bs_item in ${bs_item_list[@]}; do
+              echo "index is speed, 1gpus, begin, ${model_name}"
+              run_mode=sp
+              CUDA_VISIBLE_DEVICES=0 bash $BENCHMARK_ROOT/PaddleNLP/tests/benchmark/xlnet/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 300 ${model_mode} ${profile}     #  (5min)
+              sleep 60
+              echo "index is speed, 8gpus, run_mode is multi_process, begin, ${model_name}"
+              run_mode=mp
+              CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash $BENCHMARK_ROOT/PaddleNLP/tests/benchmark/xlnet/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 300 ${model_mode} ${profile}
+              sleep 60
+          done
+    done
+done
diff --git a/tests/benchmark/xlnet/run_benchmark.sh b/tests/benchmark/xlnet/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cd5b7adb29de20acc8f25df8d893bc9db2ff0bdb
--- /dev/null
+++ b/tests/benchmark/xlnet/run_benchmark.sh
@@ -0,0 +1,76 @@
+#!/usr/bin/env bash
+set -xe
+
+# 运行示例：CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${run_mode} ${batch_size} ${fp_item} 1500 ${model_mode}
+# 参数说明
+function _set_params(){
+    run_mode=${1:-"sp"}          # 单卡sp|多卡mp
+    batch_size=${2:-"64"}
+    fp_item=${3:-"fp32"}        # fp32|fp16
+    max_iter=${4:-"1000"}       # 可选，如果需要修改代码提前中断
+    model_name=${5:-"xlnet-base-cased"}
+    need_profile=${6:-"off"}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # TRAIN_LOG_DIR 后续QA设置该参数
+
+#   以下不用修改
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    log_file=${run_log_path}/${model_name}_${run_mode}_bs${batch_size}_${fp_item}_${num_gpu_devices}
+}
+
+
+function _train(){
+    echo "Train on ${num_gpu_devices} GPUs"
+    echo "current CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES, gpus=$num_gpu_devices, batch_size=$batch_size"
+
+    profiler_cmd=""
+    profiler_options="batch_range=[100,110];profile_path=${log_profile}"
+    if [ $need_profile = "on" ]; then
+        profiler_cmd="--profiler_options=${profiler_options}"
+    fi
+
+    train_cmd="${profiler_cmd}
+               --model_name_or_path=${model_name}
+               --task_name=SST-2
+               --max_seq_length=128
+               --pad_to_max_seq_len=True
+               --logging_steps=1
+               --save_steps=2000
+               --batch_size=${batch_size}
+               --learning_rate=2e-5
+               --max_steps=${max_iter}
+               --output_dir=${run_log_path}
+    "
+
+    case ${run_mode} in
+    sp)
+        train_cmd="python -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
+        examples/language_model/xlnet/run_glue.py ${train_cmd}" ;;
+    mp)
+        train_cmd="python -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
+        examples/language_model/xlnet/run_glue.py ${train_cmd}"
+        log_parse_file="mylog/workerlog.0" ;;
+    *) echo "choose run_mode(sp or mp)"; exit 1;
+    esac
+
+
+# 以下不用修改
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+        export job_fail_flag=1
+    else
+        echo -e "${model_name}, SUCCESS"
+        export job_fail_flag=0
+    fi
+    kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+
+    if [ $run_mode = "mp" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+}
+
+_set_params $@
+_train
diff --git a/tests/cli/__init__.py b/tests/cli/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/cli/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/cli/test_cli.py b/tests/cli/test_cli.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e8342a6b7a98ed2d5f099a2ecfeca6637195dd8
--- /dev/null
+++ b/tests/cli/test_cli.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import sys
+from unittest import TestCase
+
+from paddlenlp.cli.main import main
+
+
+class CliTest(TestCase):
+    def test_smoke(self):
+        """smoke test for cli command"""
+        sys.argv = ["", "--help"]
+        with self.assertRaises(SystemExit):
+            main()
diff --git a/tests/common_test.py b/tests/common_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1e1191089dc98c95916de00a3981f9c030f96e3
--- /dev/null
+++ b/tests/common_test.py
@@ -0,0 +1,155 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+import warnings
+
+import numpy as np
+import paddle
+
+__all__ = ["CommonTest", "CpuCommonTest"]
+
+
+# Assume all elements has same data type
+def get_container_type(container):
+    container_t = type(container)
+    if container_t in [list, tuple]:
+        if len(container) == 0:
+            return container_t
+        return get_container_type(container[0])
+    return container_t
+
+
+class CommonTest(unittest.TestCase):
+    def __init__(self, methodName="runTest"):
+        super(CommonTest, self).__init__(methodName=methodName)
+        self.config = {}
+        self.places = ["cpu"]
+        if paddle.is_compiled_with_cuda():
+            self.places.append("gpu")
+
+    @classmethod
+    def setUpClass(cls):
+        """
+        Set the decorators for all test function
+        """
+        for key, value in cls.__dict__.items():
+            if key.startswith("test"):
+                decorator_func_list = ["_test_places", "_catch_warnings"]
+                for decorator_func in decorator_func_list:
+                    decorator_func = getattr(CommonTest, decorator_func)
+                    value = decorator_func(value)
+                setattr(cls, key, value)
+
+    def _catch_warnings(func):
+        """
+        Catch the warnings and treat them as errors for each test.
+        """
+
+        def wrapper(self, *args, **kwargs):
+            with warnings.catch_warnings(record=True) as w:
+                warnings.resetwarnings()
+                # ignore specified warnings
+                warning_white_list = [UserWarning]
+                for warning in warning_white_list:
+                    warnings.simplefilter("ignore", warning)
+                func(self, *args, **kwargs)
+                msg = None if len(w) == 0 else w[0].message
+                self.assertFalse(len(w) > 0, msg)
+
+        return wrapper
+
+    def _test_places(func):
+        """
+        Setting the running place for each test.
+        """
+
+        def wrapper(self, *args, **kwargs):
+            places = self.places
+            for place in places:
+                paddle.set_device(place)
+                func(self, *args, **kwargs)
+
+        return wrapper
+
+    def _check_output_impl(self, result, expected_result, rtol, atol, equal=True):
+        assertForNormalType = self.assertNotEqual
+        assertForFloat = self.assertFalse
+        if equal:
+            assertForNormalType = self.assertEqual
+            assertForFloat = self.assertTrue
+
+        result_t = type(result)
+        error_msg = "Output has diff at place:{}. \nExpect: {} \nBut Got: {} in class {}"
+        if result_t in [list, tuple]:
+            result_t = get_container_type(result)
+        if result_t in [str, int, bool, set, bool, np.int32, np.int64]:
+            assertForNormalType(
+                result,
+                expected_result,
+                msg=error_msg.format(paddle.get_device(), expected_result, result, self.__class__.__name__),
+            )
+        elif result_t in [float, np.ndarray, np.float32, np.float64]:
+            assertForFloat(
+                np.allclose(result, expected_result, rtol=rtol, atol=atol),
+                msg=error_msg.format(paddle.get_device(), expected_result, result, self.__class__.__name__),
+            )
+            if result_t == np.ndarray:
+                assertForNormalType(
+                    result.shape,
+                    expected_result.shape,
+                    msg=error_msg.format(
+                        paddle.get_device(), expected_result.shape, result.shape, self.__class__.__name__
+                    ),
+                )
+        else:
+            raise ValueError(
+                "result type must be str, int, bool, set, np.bool, np.int32, "
+                "np.int64, np.str, float, np.ndarray, np.float32, np.float64"
+            )
+
+    def check_output_equal(self, result, expected_result, rtol=1.0e-5, atol=1.0e-8):
+        """
+            Check whether result and expected result are equal, including shape.
+        Args:
+            result: str, int, bool, set, np.ndarray.
+                The result needs to be checked.
+            expected_result: str, int, bool, set, np.ndarray. The type has to be same as result's.
+                Use the expected result to check result.
+            rtol: float
+                relative tolerance, default 1.e-5.
+            atol: float
+                absolute tolerance, default 1.e-8
+        """
+        self._check_output_impl(result, expected_result, rtol, atol)
+
+    def check_output_not_equal(self, result, expected_result, rtol=1.0e-5, atol=1.0e-8):
+        """
+            Check whether result and expected result are not equal, including shape.
+        Args:
+            result: str, int, bool, set, np.ndarray.
+                The result needs to be checked.
+            expected_result: str, int, bool, set, np.ndarray. The type has to be same as result's.
+                Use the expected result to check result.
+            rtol: float
+                relative tolerance, default 1.e-5.
+            atol: float
+                absolute tolerance, default 1.e-8
+        """
+        self._check_output_impl(result, expected_result, rtol, atol, equal=False)
+
+
+class CpuCommonTest(CommonTest):
+    def __init__(self, methodName="runTest"):
+        super(CpuCommonTest, self).__init__(methodName=methodName)
+        self.places = ["cpu"]
diff --git a/tests/data/__init__.py b/tests/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/data/test_collate.py b/tests/data/test_collate.py
new file mode 100644
index 0000000000000000000000000000000000000000..7883db683b4c83386679133a0c06d74b18f1e273
--- /dev/null
+++ b/tests/data/test_collate.py
@@ -0,0 +1,110 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+from paddlenlp.data import Dict, Pad, Stack, Tuple
+from tests import testing_utils
+from tests.common_test import CpuCommonTest
+
+
+class TestStack(CpuCommonTest):
+    def setUp(self):
+        self.input = [[1, 2, 3, 4], [4, 5, 6, 8], [8, 9, 1, 2]]
+        self.expected_result = np.array(self.input)
+
+    def test_stack(self):
+        result = Stack()(self.input)
+        self.check_output_equal(self.expected_result, result)
+
+
+class TestPad(CpuCommonTest):
+    def setUp(self):
+        self.input = [[1, 2, 3, 4], [4, 5, 6], [8, 9]]
+        self.expected_result = np.array([[1, 2, 3, 4], [4, 5, 6, 0], [8, 9, 0, 0]])
+
+    def test_pad(self):
+        result = Pad()(self.input)
+        self.check_output_equal(self.expected_result, result)
+
+
+class TestPadLeft(CpuCommonTest):
+    def setUp(self):
+        self.input = [[1, 2, 3, 4], [4, 5, 6], [8, 9]]
+        self.expected_result = np.array([[1, 2, 3, 4], [0, 4, 5, 6], [0, 0, 8, 9]])
+
+    def test_pad(self):
+        result = Pad(pad_right=False)(self.input)
+        self.check_output_equal(self.expected_result, result)
+
+
+class TestPadRetLength(CpuCommonTest):
+    def setUp(self):
+        self.input = [[1, 2, 3, 4], [4, 5, 6], [8, 9]]
+        self.expected_result = np.array([[1, 2, 3, 4], [4, 5, 6, 0], [8, 9, 0, 0]])
+
+    def test_pad(self):
+        result, length = Pad(ret_length=True)(self.input)
+        self.check_output_equal(self.expected_result, result)
+        self.check_output_equal(length, np.array([4, 3, 2]))
+
+
+class TestTuple(CpuCommonTest):
+    def setUp(self):
+        self.input = [[[1, 2, 3, 4], [1, 2, 3, 4]], [[4, 5, 6, 8], [4, 5, 6]], [[8, 9, 1, 2], [8, 9]]]
+        self.expected_result = (
+            np.array([[1, 2, 3, 4], [4, 5, 6, 8], [8, 9, 1, 2]]),
+            np.array([[1, 2, 3, 4], [4, 5, 6, 0], [8, 9, 0, 0]]),
+        )
+
+    def _test_impl(self, list_fn=True):
+        if list_fn:
+            batchify_fn = Tuple([Stack(), Pad(axis=0, pad_val=0)])
+        else:
+            batchify_fn = Tuple(Stack(), Pad(axis=0, pad_val=0))
+        result = batchify_fn(self.input)
+        self.check_output_equal(result[0], self.expected_result[0])
+        self.check_output_equal(result[1], self.expected_result[1])
+
+    def test_tuple(self):
+        self._test_impl()
+
+    def test_tuple_list(self):
+        self._test_impl(False)
+
+    @testing_utils.assert_raises
+    def test_empty_fn(self):
+        Tuple([Stack()], Pad(axis=0, pad_val=0))
+
+
+class TestDict(CpuCommonTest):
+    def setUp(self):
+        self.input = [
+            {"text": [1, 2, 3, 4], "label": [1]},
+            {"text": [4, 5, 6], "label": [0]},
+            {"text": [7, 8], "label": [1]},
+        ]
+        self.expected_result = (np.array([[1, 2, 3, 4], [4, 5, 6, 0], [7, 8, 0, 0]]), np.array([[1], [0], [1]]))
+
+    def test_dict(self):
+        batchify_fn = Dict({"text": Pad(axis=0, pad_val=0), "label": Stack()})
+        result = batchify_fn(self.input)
+        self.check_output_equal(result[0], self.expected_result[0])
+        self.check_output_equal(result[1], self.expected_result[1])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/data/test_data_collator.py b/tests/data/test_data_collator.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e2f8163ebd46663928022d8bb71affcebea4978
--- /dev/null
+++ b/tests/data/test_data_collator.py
@@ -0,0 +1,339 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.data import (
+    DataCollatorForLanguageModeling,
+    DataCollatorForTokenClassification,
+    DataCollatorForWholeWordMask,
+    DataCollatorWithPadding,
+    default_data_collator,
+)
+from paddlenlp.trainer import set_seed
+from paddlenlp.transformers import BertTokenizer
+
+
+class DataCollatorIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
+        self.vocab_file = os.path.join(self.tmpdirname, "vocab.txt")
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def test_default_with_dict(self):
+        features = [{"label": i, "inputs": [0, 1, 2, 3, 4, 5]} for i in range(8)]
+        batch = default_data_collator(features)
+
+        self.assertTrue(batch["labels"].equal_all(paddle.to_tensor(list(range(8)))))
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+        self.assertEqual(batch["inputs"].shape, [8, 6])
+
+        # With label_ids
+        features = [{"label_ids": [0, 1, 2], "inputs": [0, 1, 2, 3, 4, 5]} for i in range(8)]
+        batch = default_data_collator(features)
+        self.assertTrue(batch["labels"].equal_all(paddle.to_tensor([[0, 1, 2]] * 8)))
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+        self.assertEqual(batch["inputs"].shape, [8, 6])
+
+        # Features can already be tensors
+        features = [{"label": i, "inputs": np.random.randint(0, 10, [10])} for i in range(8)]
+        batch = default_data_collator(features)
+        self.assertTrue(batch["labels"].equal_all(paddle.to_tensor(list(range(8)))))
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+        self.assertEqual(batch["inputs"].shape, [8, 10])
+
+        # Labels can already be tensors
+        features = [{"label": paddle.to_tensor(i), "inputs": np.random.randint(0, 10, [10])} for i in range(8)]
+
+        batch = default_data_collator(features)
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+        self.assertTrue(batch["labels"].equal_all(paddle.to_tensor(list(range(8)))))
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+        self.assertEqual(batch["inputs"].shape, [8, 10])
+
+    def test_default_classification_and_regression(self):
+        data_collator = default_data_collator
+
+        features = [{"input_ids": [0, 1, 2, 3, 4], "label": i} for i in range(4)]
+        batch = data_collator(features)
+        self.assertEqual(batch["labels"].dtype, paddle.int64)
+
+        features = [{"input_ids": [0, 1, 2, 3, 4], "label": float(i)} for i in range(4)]
+        batch = data_collator(features)
+        self.assertEqual(batch["labels"].dtype, paddle.float32)
+
+    def test_default_with_no_labels(self):
+        features = [{"label": None, "inputs": [0, 1, 2, 3, 4, 5]} for i in range(8)]
+        batch = default_data_collator(features)
+        self.assertTrue("labels" not in batch)
+        self.assertEqual(batch["inputs"].shape, [8, 6])
+
+        # With label_ids
+        features = [{"label_ids": None, "inputs": [0, 1, 2, 3, 4, 5]} for i in range(8)]
+        batch = default_data_collator(features)
+        self.assertTrue("labels" not in batch)
+        self.assertEqual(batch["inputs"].shape, [8, 6])
+
+    def test_data_collator_with_padding(self):
+        tokenizer = BertTokenizer(self.vocab_file)
+        features = [{"input_ids": [0, 1, 2]}, {"input_ids": [0, 1, 2, 3, 4, 5]}]
+
+        data_collator = DataCollatorWithPadding(tokenizer)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+
+        data_collator = DataCollatorWithPadding(tokenizer, padding="max_length", max_length=10)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+
+        data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 8])
+
+    def test_data_collator_for_token_classification(self):
+        tokenizer = BertTokenizer(self.vocab_file)
+        features = [
+            {"input_ids": [0, 1, 2], "labels": [0, 1, 2]},
+            {"input_ids": [0, 1, 2, 3, 4, 5], "labels": [0, 1, 2, 3, 4, 5]},
+        ]
+        data_collator = DataCollatorForTokenClassification(tokenizer)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+        self.assertEqual(batch["labels"].shape, [2, 6])
+        self.assertEqual(batch["labels"][0].tolist(), [0, 1, 2] + [-100] * 3)
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, padding="max_length", max_length=10)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 8])
+        self.assertEqual(batch["labels"].shape, [2, 8])
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=-1)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+        self.assertEqual(batch["labels"].shape, [2, 6])
+        self.assertEqual(batch["labels"][0].tolist(), [0, 1, 2] + [-1] * 3)
+
+        for feature in features:
+            feature.pop("labels")
+
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+
+    def test_data_collator_for_token_classification_works_with_pt_tensors(self):
+        tokenizer = BertTokenizer(self.vocab_file)
+        features = [
+            {"input_ids": paddle.to_tensor([0, 1, 2]), "labels": paddle.to_tensor([0, 1, 2])},
+            {"input_ids": paddle.to_tensor([0, 1, 2, 3, 4, 5]), "labels": paddle.to_tensor([0, 1, 2, 3, 4, 5])},
+        ]
+
+        data_collator = DataCollatorForTokenClassification(tokenizer)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+        self.assertEqual(batch["labels"].shape, [2, 6])
+        self.assertEqual(batch["labels"][0].tolist(), [0, 1, 2] + [-100] * 3)
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, padding="max_length", max_length=10)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 8])
+        self.assertEqual(batch["labels"].shape, [2, 8])
+
+        data_collator = DataCollatorForTokenClassification(tokenizer, label_pad_token_id=-1)
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+        self.assertEqual(batch["labels"].shape, [2, 6])
+        self.assertEqual(batch["labels"][0].tolist(), [0, 1, 2] + [-1] * 3)
+
+        for feature in features:
+            feature.pop("labels")
+
+        batch = data_collator(features)
+        self.assertEqual(batch["input_ids"].shape, [2, 6])
+        self.assertEqual(batch["input_ids"][0].tolist(), [0, 1, 2] + [tokenizer.pad_token_id] * 3)
+
+    def _test_no_pad_and_pad(self, no_pad_features, pad_features):
+        tokenizer = BertTokenizer(self.vocab_file)
+        data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
+        batch = data_collator(no_pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        batch = data_collator(pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, pad_to_multiple_of=8)
+        batch = data_collator(no_pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 16])
+        self.assertEqual(batch["labels"].shape, [2, 16])
+
+        batch = data_collator(pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 16])
+        self.assertEqual(batch["labels"].shape, [2, 16])
+
+        tokenizer._pad_token = None
+        data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
+        with self.assertRaises(ValueError):
+            # Expect error due to padding token missing
+            data_collator(pad_features)
+
+        set_seed(3)  # For reproducibility
+        tokenizer = BertTokenizer(self.vocab_file)
+        data_collator = DataCollatorForLanguageModeling(tokenizer)
+        batch = data_collator(no_pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        masked_tokens = batch["input_ids"] == tokenizer.mask_token_id
+        self.assertTrue(paddle.any(masked_tokens))
+        self.assertTrue(all(x == -100 for x in batch["labels"][~masked_tokens].tolist()))
+
+        batch = data_collator(pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+        masked_tokens = batch["input_ids"] == tokenizer.mask_token_id
+        self.assertTrue(paddle.any(masked_tokens))
+        self.assertTrue(all(x == -100 for x in batch["labels"][~masked_tokens].tolist()))
+
+        data_collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(no_pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 16])
+        self.assertEqual(batch["labels"].shape, [2, 16])
+
+        masked_tokens = batch["input_ids"] == tokenizer.mask_token_id
+        self.assertTrue(paddle.any(masked_tokens))
+        self.assertTrue(all(x == -100 for x in batch["labels"][~masked_tokens].tolist()))
+
+        batch = data_collator(pad_features)
+        self.assertEqual(batch["input_ids"].shape, [2, 16])
+        self.assertEqual(batch["labels"].shape, [2, 16])
+
+        masked_tokens = batch["input_ids"] == tokenizer.mask_token_id
+        self.assertTrue(paddle.any(masked_tokens))
+        self.assertTrue(all(x == -100 for x in batch["labels"][~masked_tokens].tolist()))
+
+    def test_data_collator_for_language_modeling(self):
+        no_pad_features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
+        pad_features = [{"input_ids": list(range(5))}, {"input_ids": list(range(10))}]
+        self._test_no_pad_and_pad(no_pad_features, pad_features)
+
+        no_pad_features = [list(range(10)), list(range(10))]
+        pad_features = [list(range(5)), list(range(10))]
+        self._test_no_pad_and_pad(no_pad_features, pad_features)
+
+    def test_data_collator_for_whole_word_mask(self):
+        features = [{"input_ids": list(range(10))}, {"input_ids": list(range(10))}]
+
+        tokenizer = BertTokenizer(self.vocab_file)
+        data_collator = DataCollatorForWholeWordMask(tokenizer, return_tensors="pd")
+        batch = data_collator(features)
+
+        self.assertEqual(batch["input_ids"].shape, [2, 10])
+        self.assertEqual(batch["labels"].shape, [2, 10])
+
+    def test_nsp(self):
+        tokenizer = BertTokenizer(self.vocab_file)
+        features = [
+            {"input_ids": [0, 1, 2, 3, 4], "token_type_ids": [0, 1, 2, 3, 4], "next_sentence_label": i}
+            for i in range(2)
+        ]
+        data_collator = DataCollatorForLanguageModeling(tokenizer)
+        batch = data_collator(features)
+
+        self.assertEqual(batch["input_ids"].shape, [2, 5])
+        self.assertEqual(batch["token_type_ids"].shape, [2, 5])
+        self.assertEqual(batch["labels"].shape, [2, 5])
+        self.assertEqual(
+            batch["next_sentence_label"].shape,
+            [
+                2,
+            ],
+        )
+
+        data_collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(features)
+
+        self.assertEqual(batch["input_ids"].shape, [2, 8])
+        self.assertEqual(batch["token_type_ids"].shape, [2, 8])
+        self.assertEqual(batch["labels"].shape, [2, 8])
+        self.assertEqual(
+            batch["next_sentence_label"].shape,
+            [
+                2,
+            ],
+        )
+
+    def test_sop(self):
+        tokenizer = BertTokenizer(self.vocab_file)
+        features = [
+            {
+                "input_ids": paddle.to_tensor([0, 1, 2, 3, 4]),
+                "token_type_ids": paddle.to_tensor([0, 1, 2, 3, 4]),
+                "sentence_order_label": i,
+            }
+            for i in range(2)
+        ]
+        data_collator = DataCollatorForLanguageModeling(tokenizer)
+        batch = data_collator(features)
+
+        self.assertEqual(batch["input_ids"].shape, [2, 5])
+        self.assertEqual(batch["token_type_ids"].shape, [2, 5])
+        self.assertEqual(batch["labels"].shape, [2, 5])
+        self.assertEqual(
+            batch["sentence_order_label"].shape,
+            [
+                2,
+            ],
+        )
+
+        data_collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8)
+        batch = data_collator(features)
+
+        self.assertEqual(batch["input_ids"].shape, [2, 8])
+        self.assertEqual(batch["token_type_ids"].shape, [2, 8])
+        self.assertEqual(batch["labels"].shape, [2, 8])
+        self.assertEqual(
+            batch["sentence_order_label"].shape,
+            [
+                2,
+            ],
+        )
diff --git a/tests/data/test_sampler.py b/tests/data/test_sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..b645ac6cf88a0253027752cc3f7f96140e0b253a
--- /dev/null
+++ b/tests/data/test_sampler.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.data import SamplerHelper
+from paddlenlp.datasets import load_dataset
+from tests.common_test import CpuCommonTest
+from tests.testing_utils import assert_raises, get_tests_dir
+
+
+def cmp(x, y):
+    return -1 if x < y else 1 if x > y else 0
+
+
+class TestSampler(CpuCommonTest):
+    @classmethod
+    def setUpClass(cls):
+        fixture_path = get_tests_dir(os.path.join("fixtures", "dummy"))
+        cls.train_ds = load_dataset("clue", "tnews", data_files=[os.path.join(fixture_path, "tnews", "train.json")])
+
+    def test_length(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        self.check_output_equal(len(train_batch_sampler), 10)
+        self.check_output_equal(len(train_batch_sampler), train_batch_sampler.length)
+
+        train_batch_sampler.length = 5
+        self.check_output_equal(len(train_batch_sampler), 5)
+
+    def test_iter1(self):
+        train_ds_len = len(self.train_ds)
+        ds_iter = iter(range(train_ds_len - 1, -1, -1))
+        train_batch_sampler = SamplerHelper(self.train_ds, ds_iter)
+        for i, sample in enumerate(train_batch_sampler):
+            self.check_output_equal(i, train_ds_len - 1 - sample)
+
+    def test_iter2(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        for i, sample in enumerate(train_batch_sampler):
+            self.check_output_equal(i, sample)
+
+    def test_list(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        list_sampler = train_batch_sampler.list()
+        self.check_output_equal(type(iter(list_sampler)).__name__, "list_iterator")
+        for i, sample in enumerate(list_sampler):
+            self.check_output_equal(i, sample)
+
+    def test_shuffle_no_buffer_size(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        shuffle_sampler = train_batch_sampler.shuffle(seed=102)
+        expected_result = {0: 4, 1: 9}
+        for i, sample in enumerate(shuffle_sampler):
+            if i in expected_result.keys():
+                self.check_output_equal(sample, expected_result[i])
+
+    def test_shuffle_buffer_size(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        shuffle_sampler = train_batch_sampler.shuffle(buffer_size=10, seed=102)
+        expected_result = {0: 4, 1: 9}
+        for i, sample in enumerate(shuffle_sampler):
+            if i in expected_result.keys():
+                self.check_output_equal(sample, expected_result[i])
+
+    def test_sort_buffer_size(self):
+        train_ds_len = len(self.train_ds)
+        ds_iter = iter(range(train_ds_len - 1, -1, -1))
+        train_batch_sampler = SamplerHelper(self.train_ds, ds_iter)
+        sort_sampler = train_batch_sampler.sort(cmp=lambda x, y, dataset: cmp(x, y), buffer_size=5)
+        for i, sample in enumerate(sort_sampler):
+            if i < 5:
+                self.check_output_equal(i + 5, sample)
+            else:
+                self.check_output_equal(i - 5, sample)
+
+    def test_sort_no_buffer_size(self):
+        train_ds_len = len(self.train_ds)
+        ds_iter = iter(range(train_ds_len - 1, -1, -1))
+        train_batch_sampler = SamplerHelper(self.train_ds, ds_iter)
+        sort_sampler = train_batch_sampler.sort(cmp=lambda x, y, dataset: cmp(x, y))
+        for i, sample in enumerate(sort_sampler):
+            self.check_output_equal(i, sample)
+
+    def test_batch(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        batch_size = 3
+        batch_sampler = train_batch_sampler.batch(batch_size)
+        for i, sample in enumerate(batch_sampler):
+            for j, minibatch in enumerate(sample):
+                self.check_output_equal(i * batch_size + j, minibatch)
+
+    @assert_raises(ValueError)
+    def test_batch_oversize(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        batch_size = 3
+
+        batch_sampler = train_batch_sampler.batch(
+            batch_size,
+            key=lambda size_so_far, minibatch_len: max(size_so_far, minibatch_len),
+            batch_size_fn=lambda new, count, sofar, data_source: len(data_source),
+        )
+        for i, sample in enumerate(batch_sampler):
+            for j, minibatch in enumerate(sample):
+                self.check_output_equal(i * batch_size + j, minibatch)
+
+    def test_shard(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        shard_sampler1 = train_batch_sampler.shard(2, 0)
+        shard_sampler2 = train_batch_sampler.shard(2, 1)
+        for i, sample in enumerate(shard_sampler1):
+            self.check_output_equal(i * 2, sample)
+
+        for i, sample in enumerate(shard_sampler2):
+            self.check_output_equal(i * 2 + 1, sample)
+
+    def test_shard_default(self):
+        train_batch_sampler = SamplerHelper(self.train_ds)
+        shard_sampler1 = train_batch_sampler.shard()
+        for i, sample in enumerate(shard_sampler1):
+            self.check_output_equal(i, sample)
+
+    def test_apply(self):
+        train_ds_len = len(self.train_ds)
+        ds_iter = iter(range(train_ds_len - 1, -1, -1))
+        train_batch_sampler = SamplerHelper(self.train_ds, ds_iter)
+        apply_sampler = train_batch_sampler.apply(
+            lambda sampler: SamplerHelper.sort(sampler, cmp=lambda x, y, dataset: cmp(x, y))
+        )
+        for i, sample in enumerate(apply_sampler):
+            self.check_output_equal(i, sample)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/data/test_tokenizer.py b/tests/data/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d4a62eac7c5883997dec3b7f9e15b0ee8811469
--- /dev/null
+++ b/tests/data/test_tokenizer.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.data import JiebaTokenizer, Vocab
+from tests.common_test import CpuCommonTest
+from tests.testing_utils import create_test_data
+
+
+class TestJiebaTokenizer(CpuCommonTest):
+    def setUp(self):
+        test_data_file = create_test_data(__file__)
+        self.vocab = Vocab.load_vocabulary(test_data_file, unk_token="[UNK]")
+        self.tokenizer = JiebaTokenizer(self.vocab)
+
+    def test_jieba(self):
+        text = "一万一"
+        token_arr = self.tokenizer.cut(text)
+        idx_arr = self.tokenizer.encode(text)
+        for i, token in enumerate(token_arr):
+            self.check_output_equal(self.vocab(token), idx_arr[i])
+
+        jieba_tokenizer = self.tokenizer.get_tokenizer()
+        jieba_token_arr = jieba_tokenizer.lcut(text, False, True)
+        self.check_output_equal(token_arr, jieba_token_arr)
+
+    def test_unk(self):
+        text = "中国"
+        idx_arr = self.tokenizer.encode(text)
+        self.check_output_equal(self.vocab[self.vocab.unk_token] in idx_arr, True)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/data/test_vocab.py b/tests/data/test_vocab.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bba9901237a13b0affb1b7ac4633bedbae34759
--- /dev/null
+++ b/tests/data/test_vocab.py
@@ -0,0 +1,83 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from collections import Counter
+
+from paddlenlp.data import Vocab
+from tests import testing_utils
+from tests.common_test import CpuCommonTest
+
+
+class TestVocab(CpuCommonTest):
+    def create_counter(self):
+        counter = Counter()
+        counter["一万七千多"] = 2
+        counter["一万七千余"] = 3
+        counter["一万万"] = 1
+        counter["一万七千多户"] = 3
+        counter["一万七千"] = 4
+        counter["一万七"] = 0
+        self.counter = counter
+
+    def setUp(self):
+        self.create_counter()
+
+    @testing_utils.assert_raises(ValueError)
+    def test_invalid_specail_token(self):
+        Vocab(wrong_kwarg="")
+
+    @testing_utils.assert_raises(ValueError)
+    def test_invalid_identifier(self):
+        Vocab(counter=self.counter, _special_token="")
+
+    @testing_utils.assert_raises(ValueError)
+    def test_sort_index_value_error1(self):
+        token_to_idx = {"一万七千多": 1, "一万七千余": 2, "IP地址": 3}
+        Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+
+    @testing_utils.assert_raises(ValueError)
+    def test_sort_index_value_error2(self):
+        token_to_idx = {"一万七千多": 1, "一万七千余": 2, "一万七千": 2}
+        Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+
+    @testing_utils.assert_raises(ValueError)
+    def test_sort_index_value_error3(self):
+        token_to_idx = {"一万七千多": -1, "一万七千余": 2, "一万七千": 3}
+        Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+
+    @testing_utils.assert_raises(ValueError)
+    def test_to_token_excess_size(self):
+        token_to_idx = {"一万七千多": 1, "一万七千余": 2, "一万万": 3}
+        vocab = Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+        vocab.to_tokens(len(vocab))
+
+    def test_counter(self):
+        token_to_idx = {"一万七千多": 1, "一万七千余": 2, "一万万": 3}
+        vocab = Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+        self.check_output_equal(vocab.to_tokens(1), "一万七千多")
+        self.check_output_equal(vocab.to_tokens(2), "一万七千余")
+        self.check_output_equal(vocab.to_tokens(3), "一万万")
+
+    def test_json(self):
+        token_to_idx = {"一万七千多": 1, "一万七千余": 2, "一万万": 3}
+        vocab = Vocab(counter=self.counter, unk_token="[UNK]", token_to_idx=token_to_idx)
+        json_str = vocab.to_json()
+        copied_vocab = Vocab.from_json(json_str)
+        for key, value in copied_vocab.token_to_idx.items():
+            self.check_output_equal(value, vocab[key])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/dataaug/__init__.py b/tests/dataaug/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/dataaug/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/dataaug/test_char_aug.py b/tests/dataaug/test_char_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..32206c4e8b5b4f2d31c5f1799d82fe6a60c6b79f
--- /dev/null
+++ b/tests/dataaug/test_char_aug.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import random
+import unittest
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.dataaug import CharDelete, CharInsert, CharSubstitute, CharSwap
+
+
+class TestCharAug(unittest.TestCase):
+    def setUp(self):
+        self.sequences = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+        self.types = ["custom", "random", "mlm"]
+
+        custom_dict = {
+            "人": ["任", "认", "忍"],
+            "抽": ["丑", "臭"],
+            "轻": ["亲", "秦"],
+            "数": ["书", "树"],
+            "转": ["赚", "专"],
+            "理": ["里", "例"],
+        }
+        self.temp_dir = TemporaryDirectory()
+        self.custom_file_path = os.path.join(self.temp_dir.name, "custom.json")
+        with open(self.custom_file_path, "w", encoding="utf-8") as f:
+            json.dump(custom_dict, open(self.custom_file_path, "w", encoding="utf-8"))
+        f.close()
+
+        self.seed = 42
+        self.set_random_seed(self.seed)
+
+    def set_random_seed(self, seed):
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+        os.environ["PYTHONHASHSEED"] = str(seed)
+
+    @parameterized.expand([(1,), (2,)])
+    def test_char_substitute(self, create_n):
+        for t in self.types:
+            if t == "mlm":
+                aug = CharSubstitute(
+                    "mlm", create_n=create_n, model_name="__internal_testing__/tiny-random-ernie", vocab="test_vocab"
+                )
+                augmented = aug.augment(self.sequences)
+                self.assertEqual(len(self.sequences), len(augmented))
+                continue
+            elif t == "custom":
+                aug = CharSubstitute(
+                    "custom", create_n=create_n, custom_file_path=self.custom_file_path, vocab="test_vocab"
+                )
+            else:
+                aug = CharSubstitute(t, create_n=create_n, vocab="test_vocab")
+
+            augmented = aug.augment(self.sequences)
+            self.assertEqual(len(self.sequences), len(augmented))
+            self.assertEqual(create_n, len(augmented[0]))
+            self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,), (2,)])
+    def test_char_insert(self, create_n):
+        for t in self.types:
+            if t == "mlm":
+                aug = CharInsert(
+                    "mlm", create_n=create_n, model_name="__internal_testing__/tiny-random-ernie", vocab="test_vocab"
+                )
+                augmented = aug.augment(self.sequences)
+                self.assertEqual(len(self.sequences), len(augmented))
+                continue
+            elif t == "custom":
+                aug = CharInsert(
+                    "custom", create_n=create_n, custom_file_path=self.custom_file_path, vocab="test_vocab"
+                )
+            else:
+                aug = CharInsert(t, create_n=create_n, vocab="test_vocab")
+
+            augmented = aug.augment(self.sequences)
+            self.assertEqual(len(self.sequences), len(augmented))
+            self.assertEqual(create_n, len(augmented[0]))
+            self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,)])
+    def test_char_delete(self, create_n):
+        aug = CharDelete(create_n=create_n, vocab="test_vocab")
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(create_n, len(augmented[0]))
+        self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,)])
+    def test_char_swap(self, create_n):
+        aug = CharSwap(create_n=create_n, vocab="test_vocab")
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(create_n, len(augmented[0]))
+        self.assertEqual(create_n, len(augmented[1]))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/dataaug/test_sentence_aug.py b/tests/dataaug/test_sentence_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..394831729c2c07806fb902bf9b71df4fdaa4e724
--- /dev/null
+++ b/tests/dataaug/test_sentence_aug.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+from paddlenlp.dataaug import (
+    SentenceBackTranslate,
+    SentenceContinue,
+    SentenceGenerate,
+    SentenceSummarize,
+)
+from paddlenlp.transformers import AutoModelForConditionalGeneration, AutoTokenizer
+
+
+class TestSentAug(unittest.TestCase):
+    def setUp(self):
+        self.sequences = ["人类语言是抽象的信息符号。", "而计算机只能处理数值化的信息。"]
+        self.max_length = 3
+
+    def test_sent_generate(self):
+        aug = SentenceGenerate(model_name="__internal_testing__/tiny-random-roformer-sim", max_length=self.max_length)
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(aug.create_n, len(augmented[0]))
+        self.assertEqual(aug.create_n, len(augmented[1]))
+
+    def test_sent_summarize(self):
+        model = AutoModelForConditionalGeneration.from_pretrained(
+            "__internal_testing__/tiny-random-mbart", max_length=self.max_length
+        )
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-mbart")
+        model_path = os.path.join(TemporaryDirectory().name, "model")
+        model.save_pretrained(model_path)
+        tokenizer.save_pretrained(model_path)
+
+        aug = SentenceSummarize(task_path=model_path)
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(aug.create_n, len(augmented[0]))
+        self.assertEqual(aug.create_n, len(augmented[1]))
+
+    def test_sent_backtranslate(self):
+        aug = SentenceBackTranslate(
+            from_model_name="__internal_testing__/tiny-random-mbart",
+            to_model_name="__internal_testing__/tiny-random-mbart",
+            max_length=self.max_length,
+        )
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(1, len(augmented[0]))
+        self.assertEqual(1, len(augmented[1]))
+
+    def test_sent_continue(self):
+        aug = SentenceContinue(model_name="__internal_testing__/tiny-random-gpt", max_length=self.max_length)
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(aug.create_n, len(augmented[0]))
+        self.assertEqual(aug.create_n, len(augmented[1]))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/dataaug/test_word_aug.py b/tests/dataaug/test_word_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..7608d80ef57c66dbc7fb4711cb3978b0d83b68fc
--- /dev/null
+++ b/tests/dataaug/test_word_aug.py
@@ -0,0 +1,111 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import random
+import unittest
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap
+
+
+class TestWordAug(unittest.TestCase):
+    def setUp(self):
+        self.sequences = ["人类语言是抽象的信息符号，其中蕴含着丰富的语义信息，人类可以很轻松地理解其中的含义。", "而计算机只能处理数值化的信息，无法直接理解人类语言，所以需要将人类语言进行数值化转换。"]
+        self.types = ["custom", "random", "mlm"]
+
+        custom_dict = {"人类": ["人累", "扔雷"], "抽象": ["丑相"], "符号": ["富豪", "负号", "付豪"]}
+        self.temp_dir = TemporaryDirectory()
+        self.custom_file_path = os.path.join(self.temp_dir.name, "custom.json")
+        with open(self.custom_file_path, "w", encoding="utf-8") as f:
+            json.dump(custom_dict, open(self.custom_file_path, "w", encoding="utf-8"))
+        f.close()
+
+        self.seed = 42
+        self.set_random_seed(self.seed)
+
+    def set_random_seed(self, seed):
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+        os.environ["PYTHONHASHSEED"] = str(seed)
+
+    @parameterized.expand([(1,), (2,)])
+    def test_word_substitute(self, create_n):
+        for t in self.types:
+            if t == "mlm":
+                aug = WordSubstitute(
+                    "mlm", create_n=create_n, model_name="__internal_testing__/tiny-random-ernie", vocab="test_vocab"
+                )
+                augmented = aug.augment(self.sequences)
+                self.assertEqual(len(self.sequences), len(augmented))
+                continue
+            elif t == "custom":
+                aug = WordSubstitute(
+                    "custom", create_n=create_n, custom_file_path=self.custom_file_path, vocab="test_vocab"
+                )
+            else:
+                aug = WordSubstitute(t, create_n=create_n, vocab="test_vocab")
+
+            augmented = aug.augment(self.sequences)
+            self.assertEqual(len(self.sequences), len(augmented))
+            self.assertEqual(create_n, len(augmented[0]))
+            self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,), (2,)])
+    def test_word_insert(self, create_n):
+        for t in self.types:
+            if t == "mlm":
+                aug = WordInsert(
+                    "mlm", create_n=create_n, model_name="__internal_testing__/tiny-random-ernie", vocab="test_vocab"
+                )
+                augmented = aug.augment(self.sequences)
+                self.assertEqual(len(self.sequences), len(augmented))
+                continue
+            elif t == "custom":
+                aug = WordInsert(
+                    "custom", create_n=create_n, custom_file_path=self.custom_file_path, vocab="test_vocab"
+                )
+            else:
+                aug = WordInsert(t, create_n=create_n, vocab="test_vocab")
+
+            augmented = aug.augment(self.sequences)
+            self.assertEqual(len(self.sequences), len(augmented))
+            self.assertEqual(create_n, len(augmented[0]))
+            self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,)])
+    def test_word_delete(self, create_n):
+        aug = WordDelete(create_n=create_n, vocab="test_vocab")
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(create_n, len(augmented[0]))
+        self.assertEqual(create_n, len(augmented[1]))
+
+    @parameterized.expand([(1,)])
+    def test_word_swap(self, create_n):
+        aug = WordSwap(create_n=create_n, vocab="test_vocab")
+        augmented = aug.augment(self.sequences)
+        self.assertEqual(len(self.sequences), len(augmented))
+        self.assertEqual(create_n, len(augmented[0]))
+        self.assertEqual(create_n, len(augmented[1]))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/dataset/__init__.py b/tests/dataset/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/dataset/test_imdb.py b/tests/dataset/test_imdb.py
new file mode 100644
index 0000000000000000000000000000000000000000..452d4cfa7c7bad999f539bdad718404496ea31af
--- /dev/null
+++ b/tests/dataset/test_imdb.py
@@ -0,0 +1,105 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+from paddlenlp.datasets import load_dataset
+from tests.common_test import CpuCommonTest
+from tests.testing_utils import assert_raises, slow
+
+
+def get_examples(mode="train"):
+    examples = {
+        "train": (
+            "I loved this movie since I was 7 and I saw it on the opening day. "
+            "It was so touching and beautiful. I strongly recommend seeing for all. "
+            "It's a movie to watch with your family by far.<br /><br />"
+            "My MPAA rating: PG-13 for thematic elements, prolonged scenes of disastor, "
+            "nudity/sexuality and some language.",
+            1,
+        ),
+        "test": (
+            "Felix in Hollywood is a great film. The version I viewed was very well restored, "
+            "which is sometimes a problem with these silent era animated films. It has some of "
+            "Hollywood's most famous stars making cameo animated appearances. A must for any "
+            "silent film or animation enthusiast.",
+            1,
+        ),
+    }
+    return examples[mode]
+
+
+@slow
+class TestImdbTrainSet(CpuCommonTest):
+    def setUp(self):
+        self.config["path_or_read_func"] = "imdb"
+        self.config["splits"] = "train"
+
+    def test_train_set(self):
+        expected_len = 25000
+        expected_text, expected_label = get_examples(self.config["splits"])
+        train_ds = load_dataset(**self.config)
+        self.check_output_equal(len(train_ds), expected_len)
+        self.check_output_equal(expected_text, train_ds[36]["text"])
+        self.check_output_equal(expected_label, train_ds[36]["label"])
+
+
+@slow
+class TestImdbTestSet(CpuCommonTest):
+    def setUp(self):
+        self.config["path_or_read_func"] = "imdb"
+        self.config["splits"] = "test"
+
+    def test_test_set(self):
+        expected_len = 25000
+        expected_text, expected_label = get_examples(self.config["splits"])
+        test_ds = load_dataset(**self.config)
+        self.check_output_equal(len(test_ds), expected_len)
+        self.check_output_equal(expected_text, test_ds[23]["text"])
+        self.check_output_equal(expected_label, test_ds[23]["label"])
+
+
+@slow
+class TestImdbTrainTestSet(CpuCommonTest):
+    def setUp(self):
+        self.config["path_or_read_func"] = "imdb"
+        self.config["splits"] = ["train", "test"]
+
+    def test_train_set(self):
+        expected_ds_num = 2
+        expected_len = 25000
+        expected_train_text, expected_train_label = get_examples("train")
+        expected_test_text, expected_test_label = get_examples("test")
+        ds = load_dataset(**self.config)
+
+        self.check_output_equal(len(ds), expected_ds_num)
+        self.check_output_equal(len(ds[0]), expected_len)
+        self.check_output_equal(len(ds[1]), expected_len)
+
+        self.check_output_equal(expected_train_text, ds[0][36]["text"])
+        self.check_output_equal(expected_train_label, ds[0][36]["label"])
+        self.check_output_equal(expected_test_text, ds[1][23]["text"])
+        self.check_output_equal(expected_test_label, ds[1][23]["label"])
+
+
+class TestImdbNoSplitDataFiles(CpuCommonTest):
+    def setUp(self):
+        self.config["path_or_read_func"] = "imdb"
+
+    @assert_raises
+    def test_no_split_datafiles(self):
+        load_dataset(**self.config)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/dataset/test_intokens.py b/tests/dataset/test_intokens.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9b3438766ef6f22dcd9d5ddc5eba840d41b0f22
--- /dev/null
+++ b/tests/dataset/test_intokens.py
@@ -0,0 +1,211 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import unittest
+
+import numpy as np
+
+from paddlenlp.datasets import InTokensIterableDataset, InTokensMapDataset, load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from tests.testing_utils import get_tests_dir
+
+
+# used to create a IterDataset that can be iterated over many times
+def read_local_dataset(path):
+    with open(path, "r", encoding="utf-8") as fp:
+        for line in fp:
+            yield json.loads(line.strip())
+
+
+class InTokensTestCommon:
+    tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/micro-random-llama")
+    expected_output = {
+        "input_ids": [1, 29871, 30429, 1, 29871, 30429, 2, 1, 29871, 31427, 1, 29871, 31427, 2],
+        "labels": [-100, -100, -100, 1, 29871, 30429, 2, -100, -100, -100, 1, 29871, 31427, 2],
+        "position_ids": np.array([0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6]),
+        "attention_mask": np.array(
+            [
+                [
+                    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
+                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0],
+                    [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
+                ]
+            ]
+        ),
+        "position_ids_2d": [[0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6]],
+    }
+
+    def preprocess_fn(
+        self,
+        example,
+        max_src_length=3,
+        max_tgt_length=3,
+        return_position_ids=True,
+        position_ids_2d=False,
+        return_attention_mask=True,
+    ):
+        inputs = example["sentence"][:2]
+        model_inputs = self.tokenizer(
+            inputs, max_length=max_src_length, truncation=True, return_attention_mask=False, return_position_ids=False
+        )
+        labels_input_ids = model_inputs["input_ids"] + [self.tokenizer.eos_token_id]
+        model_inputs["labels"] = [-100] * len(model_inputs["input_ids"]) + labels_input_ids
+        model_inputs["input_ids"] = model_inputs["input_ids"] + labels_input_ids
+        seq_length = len(model_inputs["input_ids"])
+        if return_position_ids:
+            if position_ids_2d:
+                position_ids = np.arange(seq_length, dtype=np.int64)
+                # fake block_position_ids with wrong values but correct shape
+                block_position_ids = np.arange(seq_length, dtype=np.int64)
+                model_inputs["position_ids"] = np.stack([position_ids, block_position_ids], axis=0)
+            else:
+                model_inputs["position_ids"] = list(range(seq_length))
+        if return_attention_mask:
+            model_inputs["attention_mask"] = np.tril(np.ones([seq_length, seq_length]))
+        return model_inputs
+
+
+class TestInTokensMapDataset(InTokensTestCommon, unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        fixture_path = get_tests_dir(os.path.join("fixtures", "dummy"))
+        cls.train_ds = load_dataset(
+            "clue",
+            "tnews",
+            data_files=[os.path.join(fixture_path, "tnews", "train.json")],
+            lazy=False,
+        )
+        copy_dataset_1 = copy.deepcopy(cls.train_ds)
+        copy_dataset_2 = copy.deepcopy(cls.train_ds)
+        cls.dataset = cls.train_ds.map(lambda example: cls.preprocess_fn(cls, example))
+        cls.dataset_position_2d = copy_dataset_1.map(
+            lambda example: cls.preprocess_fn(cls, example, position_ids_2d=True)
+        )
+        cls.dataset_input_labels_only = copy_dataset_2.map(
+            lambda example: cls.preprocess_fn(cls, example, return_position_ids=False, return_attention_mask=False)
+        )
+
+    def test_long_max_length(self):
+        inData = InTokensMapDataset(self.dataset, self.tokenizer, max_length=128)
+        self.assertEqual(set(inData[0].keys()), {"input_ids", "labels", "position_ids", "attention_mask"})
+        self.assertEqual(len(inData), 1)
+        self.assertEqual(type(inData[0]["input_ids"]), list)
+        self.assertEqual(np.array(inData[0]["input_ids"]).shape, (70,))
+
+        inData_input_labels_only = InTokensMapDataset(self.dataset_input_labels_only, self.tokenizer, max_length=128)
+        self.assertEqual(set(inData_input_labels_only[0].keys()), {"input_ids", "labels", "attention_mask"})
+        self.assertEqual(len(inData_input_labels_only), 1)
+        self.assertEqual(type(inData_input_labels_only[0]["input_ids"]), list)
+        self.assertEqual(np.array(inData_input_labels_only[0]["input_ids"]).shape, (70,))
+
+    def test_short_max_length(self):
+        inData = InTokensMapDataset(self.dataset, self.tokenizer, max_length=16)
+        self.assertEqual(inData[0]["input_ids"], self.expected_output["input_ids"])
+        self.assertEqual(inData[0]["labels"], self.expected_output["labels"])
+        self.assertTrue((inData[0]["position_ids"] == self.expected_output["position_ids"]).all())
+        self.assertTrue((inData[0]["attention_mask"] == self.expected_output["attention_mask"]).all())
+
+        inData_input_labels_only = InTokensMapDataset(self.dataset_input_labels_only, self.tokenizer, max_length=16)
+        self.assertEqual(inData_input_labels_only[0]["input_ids"], self.expected_output["input_ids"])
+        self.assertEqual(inData_input_labels_only[0]["labels"], self.expected_output["labels"])
+        self.assertTrue(
+            (inData_input_labels_only[0]["attention_mask"] == self.expected_output["attention_mask"]).all()
+        )
+
+    def test_2d_position_id(self):
+        inData_2d = InTokensMapDataset(self.dataset_position_2d, self.tokenizer, max_length=16)
+        self.assertTrue(inData_2d[0]["position_ids"] == self.expected_output["position_ids_2d"])
+
+    def test_missing_data(self):
+        orginal_input_ids = [item["input_ids"] for item in self.dataset]
+        orginal_input_ids = [sum(orginal_input_ids, [])]
+        inData = InTokensMapDataset(self.dataset, self.tokenizer, max_length=16)
+        tgt_input_ids = [item["input_ids"] for item in inData]
+        tgt_input_ids = [sum(tgt_input_ids, [])]
+        self.assertEqual(orginal_input_ids, tgt_input_ids)
+
+
+class TestInTokensIterableDataset(InTokensTestCommon, unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        fixture_path = get_tests_dir(os.path.join("fixtures", "dummy"))
+        cls.train_ds = load_dataset(
+            read_local_dataset, path=os.path.join(fixture_path, "tnews", "train.json"), lazy=True
+        )
+        copy_dataset_1 = copy.deepcopy(cls.train_ds)
+        copy_dataset_2 = copy.deepcopy(cls.train_ds)
+        cls.dataset = cls.train_ds.map(lambda example: cls.preprocess_fn(cls, example))
+        cls.dataset_position_2d = copy_dataset_1.map(
+            lambda example: cls.preprocess_fn(cls, example, position_ids_2d=True)
+        )
+        cls.dataset_input_labels_only = copy_dataset_2.map(
+            lambda example: cls.preprocess_fn(cls, example, return_position_ids=False, return_attention_mask=False)
+        )
+
+    def test_long_max_length(self):
+        inData = InTokensIterableDataset(self.dataset, self.tokenizer, max_length=128)
+        example = next(iter(inData))
+        self.assertEqual(set(example.keys()), {"input_ids", "labels", "position_ids", "attention_mask"})
+        self.assertEqual(type(example["input_ids"]), list)
+        self.assertEqual(np.array(example["input_ids"]).shape, (70,))
+
+        inData_input_labels_only = InTokensIterableDataset(
+            self.dataset_input_labels_only, self.tokenizer, max_length=128
+        )
+        example = next(iter(inData_input_labels_only))
+        self.assertEqual(set(example.keys()), {"input_ids", "labels", "attention_mask"})
+        self.assertEqual(type(example["input_ids"]), list)
+        self.assertEqual(np.array(example["input_ids"]).shape, (70,))
+
+    def test_short_max_length(self):
+        inData = InTokensIterableDataset(self.dataset, self.tokenizer, max_length=16)
+        example = next(iter(inData))
+        self.assertEqual(example["input_ids"], self.expected_output["input_ids"])
+        self.assertEqual(example["labels"], self.expected_output["labels"])
+        self.assertTrue((example["position_ids"] == self.expected_output["position_ids"]).all())
+        self.assertTrue((example["attention_mask"] == self.expected_output["attention_mask"]).all())
+
+        inData_input_labels_only = InTokensIterableDataset(
+            self.dataset_input_labels_only, self.tokenizer, max_length=16
+        )
+        example = next(iter(inData_input_labels_only))
+        self.assertEqual(example["input_ids"], self.expected_output["input_ids"])
+        self.assertEqual(example["labels"], self.expected_output["labels"])
+        self.assertTrue((example["attention_mask"] == self.expected_output["attention_mask"]).all())
+
+    def test_2d_position_id(self):
+        inData_2d = InTokensIterableDataset(self.dataset_position_2d, self.tokenizer, max_length=16)
+        example = next(iter(inData_2d))
+        self.assertTrue(example["position_ids"] == self.expected_output["position_ids_2d"])
+
+    def test_missing_data(self):
+        orginal_input_ids = [item["input_ids"] for item in self.dataset]
+        orginal_input_ids = [sum(orginal_input_ids, [])]
+        inData = InTokensIterableDataset(self.dataset, self.tokenizer, max_length=128)
+        tgt_input_ids = [item["input_ids"] for item in inData]
+        self.assertEqual(orginal_input_ids, tgt_input_ids)
diff --git a/tests/dataset/test_iter_datasets.py b/tests/dataset/test_iter_datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..21140dbedafe465bd61bd1c79091d24fb011dc0b
--- /dev/null
+++ b/tests/dataset/test_iter_datasets.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.datasets import load_dataset
+from tests.testing_utils import get_tests_dir
+
+
+class TestIterDataset(unittest.TestCase):
+    def test_skip(self):
+        fixture_path = get_tests_dir(os.path.join("fixtures", "dummy"))
+        first_ds = load_dataset("json", data_files=os.path.join(fixture_path, "tnews", "train.json"), lazy=True)[0]
+        first_ds = first_ds.skip(5)
+        first_data = next(iter(first_ds))
+        second_ds = iter(
+            load_dataset("json", data_files=os.path.join(fixture_path, "tnews", "train.json"), lazy=True)[0]
+        )
+        for i in range(5):
+            next(second_ds)
+        second_data = next(second_ds)
+        self.assertEqual(first_data, second_data)
diff --git a/tests/embeddings/__init__.py b/tests/embeddings/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/embeddings/test_token_embedding.py b/tests/embeddings/test_token_embedding.py
new file mode 100644
index 0000000000000000000000000000000000000000..be4854b1547adb7584a073022e56c21b66b202ed
--- /dev/null
+++ b/tests/embeddings/test_token_embedding.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.embeddings import TokenEmbedding
+from paddlenlp.utils.log import logger
+from tests.common_test import CommonTest
+from tests.testing_utils import create_test_data, get_vocab_list
+
+logger.logger.setLevel("ERROR")
+
+
+class TestTokenEmbedding(CommonTest):
+    def setUp(self):
+        self.test_data_file = create_test_data(__file__)
+        self.config["embedding_name"] = "w2v.sikuquanshu.target.word-word.dim300"
+        self.config["trainable"] = True
+
+
+class TestTokenEmbeddingTrainable(TestTokenEmbedding):
+    def test_trainable(self):
+        self.embedding = TokenEmbedding(**self.config)
+        self.check_output_not_equal(self.config["trainable"], self.embedding.weight.stop_gradient)
+
+
+class TestTokenEmbeddingUNK(TestTokenEmbedding):
+    def setUp(self):
+        super().setUp()
+        self.config["unknown_token"] = "[unk]"  # default [UNK], change it
+        self.config["unknown_token_vector"] = np.random.normal(scale=0.02, size=300).astype(paddle.get_default_dtype())
+
+    def test_unk_token(self):
+        self.embedding = TokenEmbedding(**self.config)
+        self.check_output_equal(self.config["unknown_token"], self.embedding.unknown_token)
+        self.check_output_equal(
+            self.config["unknown_token_vector"], self.embedding.search(self.embedding.unknown_token)[0]
+        )
+
+
+class TestTokenEmbeddingExtendedVocab(TestTokenEmbedding):
+    def setUp(self):
+        super().setUp()
+        self.config["extended_vocab_path"] = self.test_data_file
+
+    def test_extended_vocab(self):
+        self.embedding = TokenEmbedding(**self.config)
+        vocab_list = get_vocab_list(self.config["extended_vocab_path"])
+        emb_idx = set(self.embedding.get_idx_list_from_words(vocab_list))
+        vocab_idx = set([i for i in range(len(vocab_list))])
+        self.assertEqual(emb_idx, vocab_idx)
+        self.check_output_equal(emb_idx, vocab_idx)
+
+
+class TestTokenEmbeddingKeepExtendedVocab(TestTokenEmbedding):
+    def setUp(self):
+        super().setUp()
+        self.config["extended_vocab_path"] = self.test_data_file
+        self.config["keep_extended_vocab_only"] = True
+
+    def test_extended_vocab(self):
+        self.embedding = TokenEmbedding(**self.config)
+        vocab_list = get_vocab_list(self.config["extended_vocab_path"])
+        vocab_size = len(vocab_list)
+        # +1 means considering [PAD]
+        self.check_output_equal(vocab_size + 1, len(self.embedding._word_to_idx))
+
+
+class TestTokenEmbeddingSimilarity(TestTokenEmbedding):
+    def setUp(self):
+        super().setUp()
+        self.config["extended_vocab_path"] = self.test_data_file
+        self.config["keep_extended_vocab_only"] = True
+
+    def get_dot(self, vec_a, vec_b):
+        return np.sum(vec_a * vec_b)
+
+    def get_cosine(self, vec_a, vec_b):
+        return self.get_dot(vec_a, vec_b) / (np.sqrt(self.get_dot(vec_a, vec_a) * self.get_dot(vec_b, vec_b)))
+
+    def get_random_word_vec(self, vocab_list):
+        vocab_size = len(vocab_list)
+        ids = np.random.randint(vocab_size, size=2)
+        word_a, word_b = vocab_list[ids[0]], vocab_list[ids[1]]
+        vec_a, vec_b = self.embedding.search([word_a, word_b])
+        return word_a, word_b, vec_a, vec_b
+
+    def test_cosine_sim(self):
+        self.embedding = TokenEmbedding(**self.config)
+        vocab_list = get_vocab_list(self.config["extended_vocab_path"])
+        word_a, word_b, vec_a, vec_b = self.get_random_word_vec(vocab_list)
+        result = self.embedding.cosine_sim(word_a, word_b)
+        expected_result = self.get_cosine(vec_a, vec_b)
+        self.check_output_equal(result, expected_result)
+
+    def test_dot(self):
+        self.embedding = TokenEmbedding(**self.config)
+        vocab_list = get_vocab_list(self.config["extended_vocab_path"])
+        word_a, word_b, vec_a, vec_b = self.get_random_word_vec(vocab_list)
+        result = self.embedding.dot(word_a, word_b)
+        expected_result = self.get_dot(vec_a, vec_b)
+        self.check_output_equal(result, expected_result)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/examples/gpt/acc_dp_dygraph.sh b/tests/examples/gpt/acc_dp_dygraph.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5244d4bf1140418f602000c82b299887f8ab67a2
--- /dev/null
+++ b/tests/examples/gpt/acc_dp_dygraph.sh
@@ -0,0 +1,28 @@
+set -x
+
+task_name="gpt-acc-dp-dygraph"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+unset CUDA_VISIBLE_DEVICES
+PYTHONPATH=../../../ python -m paddle.distributed.launch \
+    --gpus "0,1" \
+    --log_dir "$base_out/$task_name/log"  run_pretrain.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data"\
+    --output_dir "$base_out/$task_name"\
+    --max_seq_len 1024 \
+    --micro_batch_size 4\
+    --max_lr 0.00015\
+    --min_lr 0.00001\
+    --max_steps 20\
+    --save_steps 100000\
+    --decay_steps 320000\
+    --weight_decay 0.01\
+    --warmup_rate 0.01\
+    --grad_clip 1.0\
+    --logging_freq 1\
+    --eval_freq 500\
+    --check_accuracy  true\
+    --device "gpu"
diff --git a/tests/examples/gpt/acc_dp_static.sh b/tests/examples/gpt/acc_dp_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3c4786fde74a5e745469530b1dce7451cd888dbf
--- /dev/null
+++ b/tests/examples/gpt/acc_dp_static.sh
@@ -0,0 +1,45 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-acc-dp-static"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.launch \
+    --gpus "0,1" \
+    --log_dir "$base_out/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "$base_out/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 4 \
+    --global_batch_size 8 \
+    --sharding_degree 1\
+    --mp_degree 1 \
+    --dp_degree 2 \
+    --pp_degree 1 \
+    --use_sharding false \
+    --use_amp false \
+    --use_recompute false \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 20 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --check_accuracy true \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/acc_mp_static.sh b/tests/examples/gpt/acc_mp_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..285e8fef90bd3f17e879df91628c9c663f254b81
--- /dev/null
+++ b/tests/examples/gpt/acc_mp_static.sh
@@ -0,0 +1,47 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+export FLAGS_allocator_strategy=naive_best_fit
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-acc-mp-static"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.launch \
+    --gpus "0,1" \
+    --log_dir "$base_out/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "$base_out/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 8 \
+    --global_batch_size 8\
+    --sharding_degree 1\
+    --mp_degree 2 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding true \
+    --use_amp false \
+    --use_recompute false \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 20 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --check_accuracy true\
+    --device "gpu"
+
+ # Not support pipeline for this version, don't change pp_degree.
diff --git a/tests/examples/gpt/acc_sharding_static.sh b/tests/examples/gpt/acc_sharding_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ee4e04e80b0e3c049eec103a98ebed600fb41f3c
--- /dev/null
+++ b/tests/examples/gpt/acc_sharding_static.sh
@@ -0,0 +1,45 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-acc-sharding-static"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.launch \
+    --gpus "0,1" \
+    --log_dir "$base_out/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "$base_out/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 4 \
+    --global_batch_size 8 \
+    --sharding_degree 2\
+    --mp_degree 1 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding false \
+    --use_amp false \
+    --use_recompute false \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 20 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --check_accuracy true \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/acc_single_dygraph.sh b/tests/examples/gpt/acc_single_dygraph.sh
new file mode 100644
index 0000000000000000000000000000000000000000..71223a188d22529686b2b93480006a67cd880354
--- /dev/null
+++ b/tests/examples/gpt/acc_single_dygraph.sh
@@ -0,0 +1,28 @@
+set -x
+
+task_name="gpt-acc-single-dygraph"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+unset CUDA_VISIBLE_DEVICES
+PYTHONPATH=../../../ python -m paddle.distributed.launch \
+    --gpus "0" \
+    --log_dir "$base_out/$task_name/log"  run_pretrain.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data"\
+    --output_dir "$base_out/$task_name"\
+    --max_seq_len 1024 \
+    --micro_batch_size 8\
+    --max_lr 0.00015\
+    --min_lr 0.00001\
+    --max_steps 20\
+    --save_steps 100000\
+    --decay_steps 320000\
+    --weight_decay 0.01\
+    --warmup_rate 0.01\
+    --grad_clip 1.0\
+    --logging_freq 1\
+    --eval_freq 500\
+    --check_accuracy  true\
+    --device "gpu"
diff --git a/tests/examples/gpt/acc_single_static.sh b/tests/examples/gpt/acc_single_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5a69d57e1355fd9a40858b36b2c94910f52a1ed4
--- /dev/null
+++ b/tests/examples/gpt/acc_single_static.sh
@@ -0,0 +1,46 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+export FLAGS_allocator_strategy=naive_best_fit
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-acc-single-static"
+base_out="tests/output"
+rm -rf $base_out/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.launch \
+    --gpus "0" \
+    --log_dir "$base_out/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "$base_out/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 8 \
+    --global_batch_size 8\
+    --sharding_degree 1\
+    --mp_degree 1 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding false \
+    --use_amp false \
+    --use_recompute false \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 20 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --check_accuracy  true\
+    --device "gpu"
+
+ # Not support pipeline for this version, don't change pp_degree.
diff --git a/tests/examples/gpt/benchmark_mp_sharding_static.sh b/tests/examples/gpt/benchmark_mp_sharding_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4b98b735c6172cf7a0cbb6b6a6f55b83f82e110d
--- /dev/null
+++ b/tests/examples/gpt/benchmark_mp_sharding_static.sh
@@ -0,0 +1,44 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+#export FLAGS_allocator_strategy=naive_best_fit
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-benchmark-mp-sharding-static"
+rm -rf output/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.fleet.launch \
+    --gpus "0,1,2,3" \
+    --log_dir "output/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 16 \
+    --global_batch_size 32 \
+    --sharding_degree 2\
+    --mp_degree 2 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding true \
+    --use_amp true \
+    --use_recompute true \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 70000 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/benchmark_mp_static.sh b/tests/examples/gpt/benchmark_mp_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..439f3d62067e4a7c6b574c6e83b18683bd1cdf28
--- /dev/null
+++ b/tests/examples/gpt/benchmark_mp_static.sh
@@ -0,0 +1,44 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+#export FLAGS_allocator_strategy=naive_best_fit
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-benchmark-mp-static"
+rm -rf output/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.fleet.launch \
+    --gpus "0,1" \
+    --log_dir "output/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 32 \
+    --global_batch_size 32 \
+    --sharding_degree 1\
+    --mp_degree 2 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding true \
+    --use_amp true \
+    --use_recompute true \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 70000 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/benchmark_sharding_static.sh b/tests/examples/gpt/benchmark_sharding_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ed4dccb62252f480f65b35ff4315363193e12eb0
--- /dev/null
+++ b/tests/examples/gpt/benchmark_sharding_static.sh
@@ -0,0 +1,44 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+#export FLAGS_allocator_strategy=naive_best_fit
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-benchmark-sharding-static"
+rm -rf output/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.fleet.launch \
+    --gpus "0,1" \
+    --log_dir "output/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 16 \
+    --global_batch_size 32 \
+    --sharding_degree 2\
+    --mp_degree 1 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding true \
+    --use_amp true \
+    --use_recompute true \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 70000 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/benchmark_single_static.sh b/tests/examples/gpt/benchmark_single_static.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6e259bb8718f26f49fe4a9b1b06af2e3e744cd98
--- /dev/null
+++ b/tests/examples/gpt/benchmark_single_static.sh
@@ -0,0 +1,44 @@
+set -x
+export PADDLE_WITH_GLOO=0
+export FLAGS_call_stack_level=2
+#export FLAGS_allocator_strategy=naive_best_fit
+export GLOG_v=-1
+unset CUDA_VISIBLE_DEVICES
+
+rm -rf *.prototxt
+rm -rf core.*
+rm -rf start_sharding*
+rm -rf main_sharding*
+
+task_name="gpt-benchmark-single-static"
+rm -rf output/$task_name/log
+
+PYTHONPATH=../../../ python -u  -m paddle.distributed.fleet.launch \
+    --gpus "0" \
+    --log_dir "output/$task_name/log" run_pretrain_static.py \
+    --model_type "gpt" \
+    --model_name_or_path "./ckpt/gpt2-small-en-init-checkpoint"\
+    --input_dir "./data" \
+    --output_dir "output/$task_name" \
+    --max_seq_len 1024 \
+    --micro_batch_size 32 \
+    --global_batch_size 32 \
+    --sharding_degree 1\
+    --mp_degree 1 \
+    --dp_degree 1 \
+    --pp_degree 1 \
+    --use_sharding false \
+    --use_amp true \
+    --use_recompute true \
+    --max_lr 0.00015 \
+    --min_lr 0.00001 \
+    --max_steps 70000 \
+    --save_steps 10000 \
+    --decay_steps 320000 \
+    --weight_decay 0.01\
+    --warmup_rate 0.01 \
+    --grad_clip 1.0 \
+    --logging_freq 1\
+    --eval_freq 500 \
+    --device "gpu"
+
diff --git a/tests/examples/gpt/test_accuracy.py b/tests/examples/gpt/test_accuracy.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8189f156f9865c34d258aab33e8545a27bc5610
--- /dev/null
+++ b/tests/examples/gpt/test_accuracy.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+
+def check_dataset():
+    return True
+
+
+def check_init_checkpoint():
+    return True
+
+
+def get_groundtruth():
+    res = {
+        1: {"loss": 11.008564949},
+        20: {"loss": 10.876321793},
+    }
+    return res
+
+
+def get_gpt_path():
+    return "../../../examples/language_model/gpt"
+
+
+def get_scripts_path():
+    return "../../../tests/examples/gpt"
+
+
+def parse_log(path=None):
+    path = os.path.join(get_gpt_path(), "tests", path)
+    if not os.path.exists(path):
+        raise ValueError("File not found in %s ." % path)
+
+    os.system('cat %s | grep "global step"  > tmp.log' % path)
+    res = {}
+    with open(os.path.join(".", "tmp.log"), "r") as f:
+        line = f.readline()
+        while line:
+            line = line.split(" - ")[-1]
+            kw = line.split(",")
+            global_step = int(kw[0].split(" ")[-1])
+            loss = float(kw[3].split(" ")[-1])
+            lr = float(kw[6].split(" ")[-1].split("\x1b")[0])
+
+            res[global_step] = {"loss": loss, "lr": lr}
+
+            line = f.readline()
+    os.system("rm tmp.log")
+    return res
+
+
+def print_test_results(name):
+    print("\n" * 5)
+    print("---- This is test reports for %s task: ----" % name)
+
+
+class GPTAccuarcy(unittest.TestCase):
+    """
+    Train accuarcy test for GPT
+    """
+
+    def test_acc_single_card(self):
+        check_dataset()
+        check_init_checkpoint()
+
+        for task_name in ["acc_single_dygraph", "acc_single_static"]:
+            ret = os.system("cd %s && sh %s/%s.sh" % (get_gpt_path(), get_scripts_path(), task_name))
+            if ret != 0:
+                print(ret)
+                raise ValueError("Train script failed")
+            gt = get_groundtruth()
+            res = parse_log("./output/gpt-%s/log/workerlog.0" % task_name.replace("_", "-"))
+            print_test_results(task_name)
+            for k in gt.keys():
+                print("%s step: %d, gt:%.9f res:%.9f " % (task_name, k, gt[k]["loss"], res[k]["loss"]))
+                self.assertAlmostEqual(gt[k]["loss"], res[k]["loss"], delta=1e-6)
+            print("\n" * 5)
+
+    def test_acc_dp(self):
+        check_dataset()
+        check_init_checkpoint()
+
+        for task_name in ["acc_dp_dygraph", "acc_dp_static"]:
+            ret = os.system("cd %s && sh %s/%s.sh" % (get_gpt_path(), get_scripts_path(), task_name))
+            if ret != 0:
+                print(ret)
+                raise ValueError("Train script failed")
+
+            gt = get_groundtruth()
+            res1 = parse_log("./output/gpt-%s/log/workerlog.0" % task_name.replace("_", "-"))
+            res2 = parse_log("./output/gpt-%s/log/workerlog.1" % task_name.replace("_", "-"))
+
+            print_test_results(task_name)
+            for k in gt.keys():
+                mean = (res1[k]["loss"] + res2[k]["loss"]) / 2
+                print("%s step: %d, gt:%.9f res:%.9f " % (task_name, k, gt[k]["loss"], mean))
+                self.assertAlmostEqual(gt[k]["loss"], mean, delta=5e-6)
+            print("\n" * 5)
+
+    def test_acc_sharding_static(self):
+        check_dataset()
+        check_init_checkpoint()
+
+        for task_name in ["acc_sharding_static"]:
+            ret = os.system("cd %s && sh %s/%s.sh" % (get_gpt_path(), get_scripts_path(), task_name))
+            if ret != 0:
+                print(ret)
+                raise ValueError("Train script failed")
+
+            gt = get_groundtruth()
+            res1 = parse_log("./output/gpt-%s/log/workerlog.0" % task_name.replace("_", "-"))
+            res2 = parse_log("./output/gpt-%s/log/workerlog.1" % task_name.replace("_", "-"))
+
+            print_test_results(task_name)
+            for k in gt.keys():
+                mean = (res1[k]["loss"] + res2[k]["loss"]) / 2
+                print("%s step: %d, gt:%.9f res:%.9f " % (task_name, k, gt[k]["loss"], mean))
+                self.assertAlmostEqual(gt[k]["loss"], mean, delta=5e-6)
+            print("\n" * 5)
+
+    @unittest.skipIf(True, "This folder not support MP. Please use MP in GPT-3.")
+    def test_acc_mp_static(self):
+        check_dataset()
+        check_init_checkpoint()
+
+        for task_name in ["acc_mp_static"]:
+            ret = os.system("cd %s && sh %s/%s.sh" % (get_gpt_path(), get_gpt_path(), task_name))
+            if ret != 0:
+                print(ret)
+                raise ValueError("Train script failed")
+
+            gt = get_groundtruth()
+            res1 = parse_log("./output/gpt-%s/log/workerlog.0" % task_name.replace("_", "-"))
+            res2 = parse_log("./output/gpt-%s/log/workerlog.1" % task_name.replace("_", "-"))
+
+            print_test_results(task_name)
+            for k in gt.keys():
+                self.assertAlmostEqual(res1[k]["loss"], res2[k]["loss"], delta=1e-7)
+                mean = (res1[k]["loss"] + res2[k]["loss"]) / 2
+                print("%s step: %d, gt:%.9f res:%.9f " % (task_name, k, gt[k]["loss"], mean))
+                self.assertAlmostEqual(gt[k]["loss"], res1[k]["loss"], delta=1e-7)
+            print("\n" * 5)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/examples/test_bloom.py b/tests/examples/test_bloom.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5bd148ff23f49d7db13a498fd4d3944d1bc9e3e
--- /dev/null
+++ b/tests/examples/test_bloom.py
@@ -0,0 +1,87 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+import sys
+import tempfile
+from unittest import TestCase
+
+import pytest
+
+from tests.testing_utils import argv_context_guard, load_test_config
+from tests.transformers.test_modeling_common import DistributedTest
+
+
+class BloomCPUTest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./examples/language_model/bloom"
+        self.config_path = "./tests/fixtures/examples/bloom.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_predict_generation(self):
+        config = load_test_config(self.config_path, "predict_generation")
+        with argv_context_guard(config):
+            from predict_generation import predict
+
+            predict()
+
+    def test_export_and_infer_generation(self):
+        config = load_test_config(self.config_path, "export_generation")
+        # 1. do export generation
+        with tempfile.TemporaryDirectory() as tempdir:
+            config["output_path"] = os.path.join(tempdir, "bloom")
+            with argv_context_guard(config):
+                from export_generation import main
+
+                main()
+                self.assertTrue(os.path.exists(os.path.join(tempdir, "bloom.pdmodel")))
+
+    def test_export_glue(self):
+        config = load_test_config(self.config_path, "export_glue")
+        with tempfile.TemporaryDirectory() as tempdir:
+            config["output_path"] = os.path.join(tempdir, "bloom")
+            with argv_context_guard(config):
+                from export_glue import main
+
+                main()
+                self.assertTrue(os.path.exists(os.path.join(tempdir, "bloom.pdmodel")))
+
+
+class BloomGenerationDistributedTest(DistributedTest):
+    def setUp(self) -> None:
+        super().setUp()
+
+        self.path = "./examples/language_model/bloom"
+        self.config_path = "./tests/fixtures/examples/bloom.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    @pytest.mark.skip("skip for test")
+    def test_pipeline(self):
+        # 1. test for fine-tune scripts
+        with tempfile.TemporaryDirectory() as tempdir:
+            config = load_test_config(self.config_path, "finetune_generation")
+            config["output_dir"] = os.path.join(tempdir, "bloom")
+            config["mp_degree"] = self.get_world_size()
+            with argv_context_guard(config):
+                self.run_on_gpu(
+                    training_script=os.path.join(self.path, "finetune_generation.py"), training_script_args=config
+                )
diff --git a/tests/examples/test_opt.py b/tests/examples/test_opt.py
new file mode 100644
index 0000000000000000000000000000000000000000..091ecbffe4563ad19e4e9cb178340d087028c58a
--- /dev/null
+++ b/tests/examples/test_opt.py
@@ -0,0 +1,69 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+import sys
+import tempfile
+from unittest import TestCase
+
+from tests.testing_utils import argv_context_guard, load_test_config
+from tests.transformers.test_modeling_common import slow
+
+
+class OPTTest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./examples/language_model/opt"
+        self.config_path = "./tests/fixtures/examples/opt.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_predict_generation(self):
+        config = load_test_config(self.config_path, "predict_generation")
+        with argv_context_guard(config):
+            from predict_generation import predict
+
+            predict()
+
+    @slow
+    def test_pipelines(self):
+        finetune_config = load_test_config(self.config_path, "finetune_generation")
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            finetune_config["output_dir"] = os.path.join(tmp_dir, "exports")
+            # 1. do finetune
+            with argv_context_guard(finetune_config):
+                from finetune_generation import main
+
+                main()
+
+            # 2. do export
+            export_config = {"model_name_or_path": finetune_config["output_dir"]}
+            with tempfile.TemporaryDirectory() as finetune_dir:
+                export_config["output_path"] = os.path.join(finetune_dir, "opt")
+                with argv_context_guard(export_config):
+                    from export_generation import main
+
+                    main()
+
+                    self.assertTrue(os.path.exists(export_config["output_path"] + ".pdmodel"))
+
+                # 3. do inference
+                infer_config = {"model_dir": finetune_dir, "model_prefix": "opt"}
+                with argv_context_guard(infer_config):
+                    from infer_generation import main
+
+                    main()
diff --git a/tests/experimental/autonlp/__init__.py b/tests/experimental/autonlp/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/experimental/autonlp/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/experimental/autonlp/test_text_classification.py b/tests/experimental/autonlp/test_text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..b87514e3b619f1fbcaca75c1a63bea9f82f9bdca
--- /dev/null
+++ b/tests/experimental/autonlp/test_text_classification.py
@@ -0,0 +1,523 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+import ray
+from hyperopt import hp
+from pandas import DataFrame
+from parameterized import parameterized
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.experimental.autonlp import AutoTrainerForTextClassification
+from tests.testing_utils import get_tests_dir, slow
+
+finetune_model_candidate = {
+    "max_steps": 2,
+    "per_device_train_batch_size": 2,
+    "per_device_eval_batch_size": 2,
+    "model_name_or_path": hp.choice("finetune_models", ["__internal_testing__/tiny-random-ernie"]),
+    "report_to": ["visualdl"],  # report_to autonlp is functional but is problematic in unit tests
+}
+
+utc_model_candidate = {
+    "max_steps": 2,
+    "per_device_train_batch_size": 2,
+    "per_device_eval_batch_size": 2,
+    "model_name_or_path": hp.choice("utc_models", ["__internal_testing__/tiny-random-utc"]),
+    "report_to": ["visualdl"],  # report_to autonlp is functional but is problematic in unit tests
+}
+
+
+def read_dataset(path, is_test=False):
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            items = line.strip().split("\t")
+            if len(items) == 1:
+                sentence = items[0]
+                labels = []
+            else:
+                sentence = "".join(items[:-1])
+                label = items[-1]
+                labels = label.split(",")
+            if is_test:
+                yield {"sentence": sentence}
+            else:
+                yield {"sentence": sentence, "labels": labels}
+
+
+class TestAutoTrainerForTextClassification(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        fixture_path = get_tests_dir(os.path.join("fixtures", "dummy"))
+        cls.multi_class_train_ds, cls.multi_class_dev_ds = load_dataset(
+            "clue",
+            "tnews",
+            data_files=[
+                os.path.join(fixture_path, "tnews", "train.json"),
+                os.path.join(fixture_path, "tnews", "dev.json"),
+            ],
+            lazy=False,
+        )
+        cls.multi_label_train_ds = load_dataset(
+            read_dataset, path=os.path.join(fixture_path, "divorce", "train.txt"), lazy=False
+        )
+        cls.multi_label_dev_ds = load_dataset(
+            read_dataset,
+            path=os.path.join(fixture_path, "divorce", "dev.txt"),
+            lazy=False,
+        )
+        cls.test_ds = load_dataset(
+            read_dataset,
+            path=os.path.join(fixture_path, "divorce", "dev.txt"),
+            is_test=True,
+            lazy=False,
+        )
+        ray.init(local_mode=True)
+
+    @classmethod
+    def tearDownClass(cls):
+        ray.shutdown()
+
+    @parameterized.expand(
+        [
+            ([finetune_model_candidate], {"max_steps": 3}),
+            ([utc_model_candidate], None),
+            ([utc_model_candidate, finetune_model_candidate], None),
+        ]
+    )
+    def test_multiclass(self, custom_model_candidate, hp_overrides):
+        with TemporaryDirectory() as temp_dir_path:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            num_models = 1
+            # create auto trainer and train
+            auto_trainer = AutoTrainerForTextClassification(
+                train_dataset=train_ds,
+                eval_dataset=dev_ds,
+                label_column="label_desc",
+                text_column="sentence",
+                language="Chinese",
+                output_dir=temp_dir_path,
+                problem_type="multi_class",
+            )
+            auto_trainer.train(
+                num_cpus=1,
+                num_gpus=0,
+                max_concurrent_trials=1,
+                num_models=num_models,
+                custom_model_candidates=custom_model_candidate,
+                hp_overrides=hp_overrides,
+            )
+
+            # check is training is valid
+            self.assertEqual(len(auto_trainer.training_results.errors), 0)
+            self.assertEqual(len(auto_trainer.training_results), num_models)
+
+            # test show_training_results
+            results_df = auto_trainer.show_training_results()
+            self.assertIsInstance(results_df, DataFrame)
+            self.assertEqual(len(results_df), num_models)
+
+            # test hp override
+            model_result = auto_trainer._get_model_result()
+            if hp_overrides is not None:
+                for hp_key, hp_value in hp_overrides.items():
+                    self.assertEqual(model_result.metrics["config"]["candidates"][hp_key], hp_value)
+
+            # test save
+            save_path = os.path.join(model_result.log_dir, auto_trainer.save_path)
+            self.assertTrue(os.path.exists(os.path.join(save_path, "model_state.pdparams")))
+            self.assertTrue(os.path.exists(os.path.join(save_path, "tokenizer_config.json")))
+
+            # test visualdl
+            self.assertTrue(os.path.isdir(auto_trainer.visualdl()))
+
+            # test evaluate
+            copy_dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            eval_metrics1 = auto_trainer.evaluate()
+            eval_metrics2 = auto_trainer.evaluate(eval_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                eval_metrics2[auto_trainer.metric_for_best_model],
+            )
+
+            # test predict
+            dev_output = auto_trainer.predict(test_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                dev_output.metrics[auto_trainer.metric_for_best_model.replace("eval", "test")],
+            )
+            self.assertEqual(len(copy_dev_ds), len(dev_output.label_ids))
+            self.assertEqual(len(copy_dev_ds), len(dev_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(dev_output.predictions[0]))
+
+            copy_test_ds = copy.deepcopy(self.test_ds)
+            test_output = auto_trainer.predict(test_dataset=copy_test_ds)
+            self.assertFalse(auto_trainer.metric_for_best_model.replace("eval", "test") in test_output.metrics)
+            self.assertEqual(None, test_output.label_ids)
+            self.assertEqual(len(copy_test_ds), len(test_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(test_output.predictions[0]))
+
+            # test export
+            temp_export_path = os.path.join(temp_dir_path, "test_export")
+            auto_trainer.export(export_path=temp_export_path)
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "model.pdmodel")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "taskflow_config.json")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "tokenizer_config.json")))
+
+            # test export compress model
+            auto_trainer.export(export_path=temp_export_path, compress=True)
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "model.pdmodel")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "taskflow_config.json")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "tokenizer_config.json")))
+
+            # test invalid export
+            temp_export_path = os.path.join(temp_dir_path, "invalid_export")
+            with self.assertRaises(LookupError):
+                auto_trainer.export(export_path=temp_export_path, trial_id="invalid_trial_id")
+
+            # test taskflow
+            taskflow = auto_trainer.to_taskflow()
+            test_inputs = [dev_ds[0]["sentence"], dev_ds[1]["sentence"]]
+            test_results = taskflow(test_inputs)
+            self.assertEqual(len(test_results), len(test_inputs))
+            for test_result in test_results:
+                for prediction in test_result["predictions"]:
+                    self.assertIn(prediction["label"], auto_trainer.label2id)
+
+            # test compress model taskflow
+            taskflow = auto_trainer.to_taskflow(compress=True)
+            test_inputs = [dev_ds[0]["sentence"], dev_ds[1]["sentence"]]
+            test_results = taskflow(test_inputs)
+            self.assertEqual(len(test_results), len(test_inputs))
+            for test_result in test_results:
+                for prediction in test_result["predictions"]:
+                    self.assertIn(prediction["label"], auto_trainer.label2id)
+
+            # test training_path
+            self.assertFalse(os.path.exists(os.path.join(auto_trainer.training_path)))
+
+    @parameterized.expand(
+        [
+            ([finetune_model_candidate], {"max_steps": 3}),
+            ([utc_model_candidate], None),
+            ([utc_model_candidate, finetune_model_candidate], None),
+        ]
+    )
+    def test_multilabel(self, custom_model_candidate, hp_overrides):
+        with TemporaryDirectory() as temp_dir_path:
+            train_ds = copy.deepcopy(self.multi_label_train_ds)
+            dev_ds = copy.deepcopy(self.multi_label_dev_ds)
+            num_models = 1
+            # create auto trainer and train
+            auto_trainer = AutoTrainerForTextClassification(
+                train_dataset=train_ds,
+                eval_dataset=dev_ds,
+                label_column="labels",
+                text_column="sentence",
+                language="Chinese",
+                output_dir=temp_dir_path,
+                problem_type="multi_label",
+            )
+            auto_trainer.train(
+                num_cpus=1,
+                num_gpus=0,
+                max_concurrent_trials=1,
+                num_models=num_models,
+                custom_model_candidates=custom_model_candidate,
+                hp_overrides=hp_overrides,
+            )
+
+            # check is training is valid
+            self.assertEqual(len(auto_trainer.training_results.errors), 0)
+            self.assertEqual(len(auto_trainer.training_results), num_models)
+
+            # test show_training_results
+            results_df = auto_trainer.show_training_results()
+            self.assertIsInstance(results_df, DataFrame)
+            self.assertEqual(len(results_df), num_models)
+
+            # test hp override
+            model_result = auto_trainer._get_model_result()
+            if hp_overrides is not None:
+                for hp_key, hp_value in hp_overrides.items():
+                    self.assertEqual(model_result.metrics["config"]["candidates"][hp_key], hp_value)
+
+            # test save
+            save_path = os.path.join(model_result.log_dir, auto_trainer.save_path)
+            self.assertTrue(os.path.exists(os.path.join(save_path, "model_state.pdparams")))
+            self.assertTrue(os.path.exists(os.path.join(save_path, "tokenizer_config.json")))
+
+            # test visualdl
+            self.assertTrue(os.path.isdir(auto_trainer.visualdl()))
+
+            # test evaluate
+            copy_dev_ds = copy.deepcopy(self.multi_label_dev_ds)
+            eval_metrics1 = auto_trainer.evaluate()
+            eval_metrics2 = auto_trainer.evaluate(eval_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                eval_metrics2[auto_trainer.metric_for_best_model],
+            )
+
+            # test predict
+            dev_output = auto_trainer.predict(test_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                dev_output.metrics[auto_trainer.metric_for_best_model.replace("eval", "test")],
+            )
+            self.assertEqual(len(copy_dev_ds), len(dev_output.label_ids))
+            self.assertEqual(len(copy_dev_ds), len(dev_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(dev_output.predictions[0]))
+
+            copy_test_ds = copy.deepcopy(self.test_ds)
+            test_output = auto_trainer.predict(test_dataset=copy_test_ds)
+            self.assertFalse(auto_trainer.metric_for_best_model.replace("eval", "test") in test_output.metrics)
+            self.assertEqual(None, test_output.label_ids)
+            self.assertEqual(len(copy_test_ds), len(test_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(test_output.predictions[0]))
+
+            # test taskflow
+            taskflow = auto_trainer.to_taskflow()
+            test_inputs = [dev_ds[0]["sentence"], dev_ds[1]["sentence"]]
+            test_results = taskflow(test_inputs)
+            self.assertEqual(len(test_results), len(test_inputs))
+            for test_result in test_results:
+                for prediction in test_result["predictions"]:
+                    self.assertIn(prediction["label"], auto_trainer.label2id)
+                    self.assertGreater(prediction["score"], taskflow.task_instance.multilabel_threshold)
+
+            # test training_path
+            self.assertFalse(os.path.exists(os.path.join(auto_trainer.training_path)))
+
+    @parameterized.expand(
+        [
+            (
+                "Chinese",
+                {
+                    "max_steps": 2,
+                    "per_device_train_batch_size": 1,
+                    "per_device_eval_batch_size": 1,
+                },
+            ),
+            (
+                "English",
+                {
+                    "max_steps": 2,
+                    "per_device_train_batch_size": 1,
+                    "per_device_eval_batch_size": 1,
+                },
+            ),
+        ]
+    )
+    @slow
+    def test_default_model_candidate(self, language, hp_overrides):
+        with TemporaryDirectory() as temp_dir_path:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            num_models = 2
+            # create auto trainer and train
+            auto_trainer = AutoTrainerForTextClassification(
+                train_dataset=train_ds,
+                eval_dataset=dev_ds,
+                label_column="label_desc",
+                text_column="sentence",
+                language=language,
+                output_dir=temp_dir_path,
+                problem_type="multi_class",
+            )
+            auto_trainer.train(
+                num_cpus=0,
+                num_gpus=1,
+                max_concurrent_trials=1,
+                num_models=num_models,
+                hp_overrides=hp_overrides,
+            )
+
+            # check is training is valid
+            self.assertEqual(len(auto_trainer.training_results.errors), 0)
+            self.assertEqual(len(auto_trainer.training_results), num_models)
+
+            # test show_training_results
+            results_df = auto_trainer.show_training_results()
+            self.assertIsInstance(results_df, DataFrame)
+            self.assertEqual(len(results_df), num_models)
+
+            # test hp override
+            model_result = auto_trainer._get_model_result()
+            if hp_overrides is not None:
+                for hp_key, hp_value in hp_overrides.items():
+                    self.assertEqual(model_result.metrics["config"]["candidates"][hp_key], hp_value)
+
+            # test save
+            save_path = os.path.join(model_result.log_dir, auto_trainer.save_path)
+            self.assertTrue(os.path.exists(os.path.join(save_path, "model_state.pdparams")))
+            self.assertTrue(os.path.exists(os.path.join(save_path, "tokenizer_config.json")))
+
+            # test visualdl
+            self.assertTrue(os.path.isdir(auto_trainer.visualdl()))
+
+            # test evaluate
+            copy_dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            eval_metrics1 = auto_trainer.evaluate()
+            eval_metrics2 = auto_trainer.evaluate(eval_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                eval_metrics2[auto_trainer.metric_for_best_model],
+            )
+
+            # test predict
+            dev_output = auto_trainer.predict(test_dataset=copy_dev_ds)
+            self.assertEqual(
+                eval_metrics1[auto_trainer.metric_for_best_model],
+                dev_output.metrics[auto_trainer.metric_for_best_model.replace("eval", "test")],
+            )
+            self.assertEqual(len(copy_dev_ds), len(dev_output.label_ids))
+            self.assertEqual(len(copy_dev_ds), len(dev_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(dev_output.predictions[0]))
+
+            copy_test_ds = copy.deepcopy(self.test_ds)
+            test_output = auto_trainer.predict(test_dataset=copy_test_ds)
+            self.assertFalse(auto_trainer.metric_for_best_model.replace("eval", "test") in test_output.metrics)
+            self.assertEqual(None, test_output.label_ids)
+            self.assertEqual(len(copy_test_ds), len(test_output.predictions))
+            self.assertEqual(len(auto_trainer.id2label), len(test_output.predictions[0]))
+
+            # test export
+            temp_export_path = os.path.join(temp_dir_path, "test_export")
+            auto_trainer.export(export_path=temp_export_path)
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "model.pdmodel")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "taskflow_config.json")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "tokenizer_config.json")))
+
+            # test export compress model
+            auto_trainer.export(export_path=temp_export_path, compress=True)
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "model.pdmodel")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "taskflow_config.json")))
+            self.assertTrue(os.path.exists(os.path.join(temp_export_path, "tokenizer_config.json")))
+
+            # test invalid export
+            temp_export_path = os.path.join(temp_dir_path, "invalid_export")
+            with self.assertRaises(LookupError):
+                auto_trainer.export(export_path=temp_export_path, trial_id="invalid_trial_id")
+
+            # test taskflow
+            taskflow = auto_trainer.to_taskflow()
+            test_inputs = [dev_ds[0]["sentence"], dev_ds[1]["sentence"]]
+            test_results = taskflow(test_inputs)
+            self.assertEqual(len(test_results), len(test_inputs))
+            for test_result in test_results:
+                for prediction in test_result["predictions"]:
+                    self.assertIn(prediction["label"], auto_trainer.label2id)
+
+            # test compress model taskflow
+            taskflow = auto_trainer.to_taskflow(compress=True)
+            test_inputs = [dev_ds[0]["sentence"], dev_ds[1]["sentence"]]
+            test_results = taskflow(test_inputs)
+            self.assertEqual(len(test_results), len(test_inputs))
+            for test_result in test_results:
+                for prediction in test_result["predictions"]:
+                    self.assertIn(prediction["label"], auto_trainer.label2id)
+
+            # test training_path
+            self.assertFalse(os.path.exists(os.path.join(auto_trainer.training_path)))
+
+    def test_untrained_auto_trainer(self):
+        with TemporaryDirectory() as temp_dir:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            auto_trainer = AutoTrainerForTextClassification(
+                train_dataset=train_ds,
+                eval_dataset=dev_ds,
+                label_column="label_desc",
+                text_column="sentence",
+                language="Chinese",
+                output_dir=temp_dir,
+            )
+
+            with self.assertRaises(AttributeError):
+                # test show results
+                auto_trainer.show_training_results()
+
+                # test export
+                auto_trainer.export(temp_dir)
+
+    def test_unsupported_languages(self):
+        with TemporaryDirectory() as temp_dir:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            with self.assertRaises(ValueError):
+                AutoTrainerForTextClassification(
+                    train_dataset=train_ds,
+                    eval_dataset=dev_ds,
+                    label_column="label_desc",
+                    text_column="sentence",
+                    language="Spanish",  # spanish is unsupported for now
+                    output_dir=temp_dir,
+                )
+
+    def test_model_language_filter(self):
+        with TemporaryDirectory() as temp_dir:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            auto_trainer = AutoTrainerForTextClassification(
+                train_dataset=train_ds,
+                eval_dataset=dev_ds,
+                label_column="label_desc",
+                text_column="sentence",
+                language="Chinese",
+                output_dir=temp_dir,
+            )
+            for language in auto_trainer.supported_languages:
+                model_candidates = auto_trainer._filter_model_candidates(language=language)
+                for candidate in model_candidates:
+                    self.assertEqual(candidate["language"], language)
+
+    def test_id2label_label_not_found(self):
+        with TemporaryDirectory() as temp_dir:
+            train_ds = copy.deepcopy(self.multi_class_train_ds)
+            # multi class
+            dev_ds = copy.deepcopy(self.multi_class_dev_ds)
+            with self.assertRaises(ValueError):
+                AutoTrainerForTextClassification(
+                    train_dataset=train_ds,
+                    eval_dataset=dev_ds,
+                    label_column="label_desc",
+                    text_column="sentence",
+                    language="Chinese",
+                    output_dir=temp_dir,
+                    id2label={0: "negative", 1: "positive"},
+                    problem_type="multi_class",
+                )
+
+            # multi label
+            dev_ds = copy.deepcopy(self.multi_label_dev_ds)
+            with self.assertRaises(ValueError):
+                AutoTrainerForTextClassification(
+                    train_dataset=train_ds,
+                    eval_dataset=dev_ds,
+                    label_column="label_desc",
+                    text_column="sentence",
+                    language="Chinese",
+                    output_dir=temp_dir,
+                    id2label={0: "negative", 1: "positive"},
+                    problem_type="multi_label",
+                )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/fixtures/bpe.en/merges.txt b/tests/fixtures/bpe.en/merges.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d7c5738baaf4304ef4692f6fe8ad887b9517d047
--- /dev/null
+++ b/tests/fixtures/bpe.en/merges.txt
@@ -0,0 +1,5 @@
+#version: 0.2
+Ġ l
+Ġl o
+Ġlo w
+e r
diff --git a/tests/fixtures/bpe.en/vocab.json b/tests/fixtures/bpe.en/vocab.json
new file mode 100644
index 0000000000000000000000000000000000000000..849b89ece3cbb580f07e1f3ac692823a8792125c
--- /dev/null
+++ b/tests/fixtures/bpe.en/vocab.json
@@ -0,0 +1,23 @@
+{
+    "l": 0,
+    "o": 1,
+    "w": 2,
+    "e": 3,
+    "r": 4,
+    "s": 5,
+    "t": 6,
+    "i": 7,
+    "d": 8,
+    "n": 9,
+    "Ġ": 10,
+    "Ġl": 11,
+    "Ġn": 12,
+    "Ġlo": 13,
+    "Ġlow": 14,
+    "er": 15,
+    "Ġlowest": 16,
+    "Ġnewer": 17,
+    "Ġwider": 18,
+    "<unk>": 19,
+    "<|endoftext|>": 20
+}
\ No newline at end of file
diff --git a/tests/fixtures/dummy/divorce/dev.txt b/tests/fixtures/dummy/divorce/dev.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d67618e0b056919f0f69a009cc9fa7f5271ca342
--- /dev/null
+++ b/tests/fixtures/dummy/divorce/dev.txt
@@ -0,0 +1,10 @@
+综上，原告现要求变更女儿李乙抚养关系的请求，本院应予支持。	婚后有子女,限制行为能力子女抚养
+通过审理，结合法庭上双方当事人的陈述，本院确认以下法律事实：原告杨某某与被告陈某某同属龙新乡的小学教师，1997年间双方相互认识，2000年5月2日登记结婚，2002年5月生育一男孩，取名杨某甲。	婚后有子女
+2014年，王X以其与肖X协议离婚时未分割该套楼房的首付款为由，起诉至法院，要求分得楼房的首付款15万元。	不动产分割,有夫妻共同财产
+2.依法分割家庭财产；	有夫妻共同财产
+但原、被告对已建立起的夫妻感情不够珍惜，因琐事即发生吵闹并最终分居，对夫妻感情造成了严重的影响，现原、被告已分居六年有余，且经人民法院判决不准离婚后仍未和好，夫妻感情确已破裂，依法应准予原、被告离婚。	二次起诉离婚,准予离婚,婚后分居,法定离婚
+何×1主张购房款来源系婚前个人财产及父母财产，系个人财产，不同意作为夫妻共同财产分割。	婚后个人财产
+经审理查明，原告朱某甲与被告向某某原系夫妻关系，朱某丙系原、被告的婚生女儿。	婚后有子女
+关于双方共同财产的处理，双方协商粤B×××××车辆归被告所有，由被告向原告补偿车辆价值的一半99000元，本院予以支持。	有夫妻共同财产
+本院认为：原、被告签订的《自愿离婚协议书》系双方当事人真实意思表示，内容不违反法律禁止性规定，现双方已办理了离婚手续，该协议应认定为合法有效，关于小孩抚养问题，双方在享有各自权利的同时，应自觉按照协议履行各自的义务。	婚后有子女,限制行为能力子女抚养
+2007年2月9日原、被告对该房办理了所有权证和共有权证，登记的房屋所有权人为被告，共有权人为原告。	不动产分割,有夫妻共同财产
\ No newline at end of file
diff --git a/tests/fixtures/dummy/divorce/train.txt b/tests/fixtures/dummy/divorce/train.txt
new file mode 100644
index 0000000000000000000000000000000000000000..030cd9a2e418c76b7deda5cee8be6ff611fe0ba4
--- /dev/null
+++ b/tests/fixtures/dummy/divorce/train.txt
@@ -0,0 +1,10 @@
+2013年11月28日原、被告离婚时自愿达成协议，婚生子张某乙由被告李某某抚养，本院以（2013）宝渭法民初字第01848号民事调解书对该协议内容予以了确认，该协议具有法律效力，对原、被告双方均有约束力。	婚后有子女,限制行为能力子女抚养
+经审理查明：原告甘某某与被告刘某某于1991年2月经人介绍确立恋爱关系，后双方同居生活，并于1993年2月生育一子刘某。	婚后有子女
+2011年4月8日，原、被告及被告赵某甲父母共同签订一份协议，该协议载明“一、赵金太东屋分给次子赵顺，西屋分给长子赵某甲，赵金太贴补长子赵某甲西屋房屋差价贰万贰仟元，于今年年底一次性付给长子赵某甲。	有夫妻共同财产
+四、原、被告婚后共同财产：爱玛牌电动车一辆（现在原告处）归原告胡××所有，原告胡××于本判决生效后五日内给付被告袁一×电动车折价款800元。	有夫妻共同财产
+原告袁某与被告李某于2007年8月29日离婚时并未对李某在海业公司中的出资进行分割，李某在海业公司中的出资系夫妻共同财产，应当依法予以分割。	有夫妻共同财产
+二、子女的抚养：双方于××××年××月××日生育有一双胞儿女，儿子梁某丁、女儿梁某丙，离婚后，儿女两人归梁某甲抚养，女方付小孩的生活费为每月800元，直至孩子完成大学学业止，生活费日后随物资生活水平变化而商议调整，儿女所需的医疗费和教育费由男女双方各承担50%，在不影响孩子学习、生活的情况下，女方享有子女探视权，探视时间由双方商定。	婚后有子女,按月给付抚养费,支付抚养费,限制行为能力子女抚养
+婚后，饶某某与郑某某也是建立起较好的夫妻感情的，特别是女儿饶某甲出生后，双方感情进一步加深，并通过外出旅游和为孩子写日记等方式经营婚姻。	婚后有子女
+本案焦点为：原、被告夫妻双方感情是否破裂及原、被告共同生育的子女的抚养问题。	婚后有子女
+2、婚生男孩由原告抚养，抚养费由原告自己承担。	婚后有子女,限制行为能力子女抚养
+对原告在庭审中主张女儿马某乙归原告抚养、不需被告支付抚养费的请求，本院不予支持，本院认为结合本案实际情况，因马某乙一直由被告抚养、照顾，应由被告抚养教育为宜，酌定由原告每月支付抚养费500元，每年支付一次。	婚后有子女,按月给付抚养费,支付抚养费,限制行为能力子女抚养
\ No newline at end of file
diff --git a/tests/fixtures/dummy/tnews/dev.json b/tests/fixtures/dummy/tnews/dev.json
new file mode 100644
index 0000000000000000000000000000000000000000..9287b47edc6a3d2d1f3960cfc21f856517924232
--- /dev/null
+++ b/tests/fixtures/dummy/tnews/dev.json
@@ -0,0 +1,10 @@
+{"label": "102", "label_desc": "news_entertainment", "sentence": "江疏影甜甜圈自拍，迷之角度竟这么好看，美吸引一切事物", "keywords": "江疏影,美少女,经纪人,甜甜圈"}
+{"label": "110", "label_desc": "news_military", "sentence": "以色列大规模空袭开始！伊朗多个军事目标遭遇打击，誓言对等反击", "keywords": "伊朗,圣城军,叙利亚,以色列国防军,以色列"}
+{"label": "104", "label_desc": "news_finance", "sentence": "出栏一头猪亏损300元，究竟谁能笑到最后！", "keywords": "商品猪,养猪,猪价,仔猪,饲料"}
+{"label": "109", "label_desc": "news_tech", "sentence": "以前很火的巴铁为何现在只字不提？", "keywords": ""}
+{"label": "112", "label_desc": "news_travel", "sentence": "作为一名酒店从业人员，你经历过房客哪些特别没有素质的行为？", "keywords": ""}
+{"label": "101", "label_desc": "news_culture", "sentence": "走进荀子的世界 触摸二千年前的心灵温度", "keywords": "荀子导读,韩非子,荀卿,深切著明,稷下学宫,稷下学史,劝学,荀子,中国哲学史,儒家,风俗通义,史记·孟子荀卿列传,中国哲学,大略,成相"}
+{"label": "109", "label_desc": "news_tech", "sentence": "图解：全要素 多领域 高效益 天津智能科技军民融合发展", "keywords": "高效益,天津"}
+{"label": "104", "label_desc": "news_finance", "sentence": "区块链投资心得，能做到就不会亏钱", "keywords": "机会主义,盲人摸象,比特币,区块链,张大千"}
+{"label": "106", "label_desc": "news_house", "sentence": "你家拆迁，要钱还是要房？答案一目了然", "keywords": "房价,房产,货币化安置,三四线城市,买房"}
+{"label": "106", "label_desc": "news_house", "sentence": "军嫂探亲拧包入住，部队家属临时来队房标准有了规定，全面落实！", "keywords": "包入住,热水器,空房子"}
\ No newline at end of file
diff --git a/tests/fixtures/dummy/tnews/train.json b/tests/fixtures/dummy/tnews/train.json
new file mode 100644
index 0000000000000000000000000000000000000000..c57684e58e10cfa96ef1477af9ca9c00573fd195
--- /dev/null
+++ b/tests/fixtures/dummy/tnews/train.json
@@ -0,0 +1,10 @@
+{"label": "108", "label_desc": "news_edu", "sentence": "上课时学生手机响个不停，老师一怒之下把手机摔了，家长拿发票让老师赔，大家怎么看待这种事？", "keywords": ""}
+{"label": "104", "label_desc": "news_finance", "sentence": "商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告", "keywords": "商赢环球股份有限公司,年度报告,商赢环球,赢环球股份有限公司,事后审核问询函,上海证券交易所"}
+{"label": "106", "label_desc": "news_house", "sentence": "通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？", "keywords": ""}
+{"label": "112", "label_desc": "news_travel", "sentence": "2018年去俄罗斯看世界杯得花多少钱？", "keywords": "莫斯科,贝加尔湖,世界杯,俄罗斯,Hour"}
+{"label": "109", "label_desc": "news_tech", "sentence": "剃须刀的个性革新，雷明登天猫定制版新品首发", "keywords": "剃须刀,绝地求生,定制版,战狼2,红海行动,天猫定制版三防,雷明登,维克托"}
+{"label": "103", "label_desc": "news_sports", "sentence": "再次证明了“无敌是多么寂寞”——逆天的中国乒乓球队！", "keywords": "世乒赛,张怡宁,许昕,兵乓球,乒乓球"}
+{"label": "109", "label_desc": "news_tech", "sentence": "三农盾SACC-全球首个推出：互联网+区块链+农产品的电商平台", "keywords": "湖南省,区块链,物联网,集中化,SACC三农盾"}
+{"label": "116", "label_desc": "news_game", "sentence": "重做or新英雄？其实重做对暴雪来说同样重要", "keywords": "暴雪,重做,新英雄,黑百合,英雄联盟"}
+{"label": "103", "label_desc": "news_sports", "sentence": "如何在商业活动中不受人欺骗？", "keywords": ""}
+{"label": "101", "label_desc": "news_culture", "sentence": "87版红楼梦最温柔的四个丫鬟，娶谁都是一生的福气", "keywords": "欧阳奋强,贾宝玉,花袭人,红楼梦,平儿"}
\ No newline at end of file
diff --git a/tests/fixtures/examples/opt.yaml b/tests/fixtures/examples/opt.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fbcd2a10dda62bd14bf10b95f19909f71f4de9cb
--- /dev/null
+++ b/tests/fixtures/examples/opt.yaml
@@ -0,0 +1,77 @@
+  
+finetune_generation:
+  default:
+    model_name_or_path: __internal_testing__/opt
+    num_train_epochs: 3
+    learning_rate: 3e-5
+    warmup_ratio: 0.06
+    weight_decay: 0.1
+    label_smoothing: 0.1
+    save_steps: 10
+    max_steps: 9
+    logging_steps: 10
+    tensor_parallel_degree: 1
+    eval_steps: 10000 
+    output_dir: ./checkpoints/opt-1.3b
+    src_length: 608
+    device: cpu
+    tgt_length: 160
+    min_tgt_length: 1
+    length_penalty: 0.7
+    no_repeat_ngram_size: 3
+    num_beams: 5
+    select_topk: True
+    per_device_eval_batch_size: 2
+    per_device_train_batch_size: 2
+    max_grad_norm: 1.0
+    lr_scheduler_type: linear
+    overwrite_output_dir: true
+    fp16_opt_level: O1
+    fp16: true
+    recompute: true
+    do_train: true
+    do_eval: false
+  slow:
+    model_name_or_path: facebook/opt-125m
+    num_train_epochs: 3
+    learning_rate: 3e-5
+    warmup_ratio: 0.06
+    weight_decay: 0.1
+    label_smoothing: 0.1
+    save_steps: 10
+    max_steps: 9
+    logging_steps: 10
+    tensor_parallel_degree: 1
+    eval_steps: 10000 
+    output_dir: ./checkpoints/opt-1.3b
+    src_length: 608
+    tgt_length: 160
+    min_tgt_length: 1
+    length_penalty: 0.7
+    no_repeat_ngram_size: 3
+    num_beams: 5
+    select_topk: True
+    per_device_eval_batch_size: 2
+    per_device_train_batch_size: 2
+    max_grad_norm: 1.0
+    lr_scheduler_type: linear
+    overwrite_output_dir: true
+    fp16_opt_level: O1
+    fp16: true
+    recompute: true
+    do_train: true
+    do_eval: false
+
+export_generation:
+  default:
+    model_name_or_path: __internal_testing__/opt
+
+  slow:
+    model_name_or_path: facebook/opt-125m 
+
+predict_generation:
+  default:
+    model_name_or_path: __internal_testing__/opt
+
+  slow:
+    model_name_or_path: facebook/opt-125m
\ No newline at end of file
diff --git a/tests/fixtures/llm/data/dev.json b/tests/fixtures/llm/data/dev.json
new file mode 100644
index 0000000000000000000000000000000000000000..fadbf76436c93cc0bb4f3ab17d4be2b9fee8a7e1
--- /dev/null
+++ b/tests/fixtures/llm/data/dev.json
@@ -0,0 +1,5 @@
+{"src": "类型#上衣*材质#牛仔布*颜色#白色*风格#简约*图案#刺绣*衣样式#外套*衣款式#破洞", "tgt": "简约而不简单的牛仔外套，白色的衣身十分百搭。衣身多处有做旧破洞设计，打破单调乏味，增加一丝造型看点。衣身后背处有趣味刺绣装饰，丰富层次感，彰显别样时尚。"}
+{"src": "类型#裙*材质#针织*颜色#纯色*风格#复古*风格#文艺*风格#简约*图案#格子*图案#纯色*图案#复古*裙型#背带裙*裙长#连衣裙*裙领型#半高领", "tgt": "这款BRAND针织两件套连衣裙，简约的纯色半高领针织上衣，修饰着颈部线，尽显优雅气质。同时搭配叠穿起一条背带式的复古格纹裙，整体散发着一股怀旧的时髦魅力，很是文艺范。"}
+{"src": "类型#上衣*风格#嘻哈*图案#卡通*图案#印花*图案#撞色*衣样式#卫衣*衣款式#连帽", "tgt": "嘻哈玩转童年，随时<UNK>，没错，出街还是要靠卫衣来装酷哦！时尚个性的连帽设计，率性有范还防风保暖。还有胸前撞色的卡通印花设计，靓丽抢眼更富有趣味性，加上前幅大容量又时尚美观的袋鼠兜，简直就是孩子耍帅装酷必备的利器。"}
+{"src": "类型#裤*风格#英伦*风格#简约", "tgt": "裤子是简约大方的版型设计，带来一种极简主义风格而且不乏舒适优雅感，是衣橱必不可少的一件百搭单品。标志性的logo可以体现出一股子浓郁的英伦风情，轻而易举带来独一无二的<UNK>体验。"}
+{"src": "类型#裙*裙下摆#弧形*裙腰型#高腰*裙长#半身裙*裙款式#不规则*裙款式#收腰", "tgt": "这款来自梵凯的半身裙富有十足的设计感，采用了别致的不规则设计，凸显出时尚前卫的格调，再搭配俏皮的高腰设计，收腰提臀的同时还勾勒出优美迷人的身材曲线，而且还帮你拉长腿部比例，释放出优雅娇俏的小女人味。并且独特的弧形下摆还富有流畅的线条美，一颦一动间展现出灵动柔美的气质。"}
\ No newline at end of file
diff --git a/tests/fixtures/llm/data/train.json b/tests/fixtures/llm/data/train.json
new file mode 100644
index 0000000000000000000000000000000000000000..c8ca5e98fa72bcb0e6e6832cc5dd415f2fb8f960
--- /dev/null
+++ b/tests/fixtures/llm/data/train.json
@@ -0,0 +1,20 @@
+{"src": "类型#裤*版型#宽松*风格#性感*图案#线条*裤型#阔腿裤", "tgt": "宽松的阔腿裤这两年真的吸粉不少，明星时尚达人的心头爱。毕竟好穿时尚，谁都能穿出腿长2米的效果宽松的裤腿，当然是遮肉小能手啊。上身随性自然不拘束，面料亲肤舒适贴身体验感棒棒哒。系带部分增加设计看点，还让单品的设计感更强。腿部线条若隐若现的，性感撩人。颜色敲温柔的，与裤子本身所呈现的风格有点反差萌。"}
+{"src": "类型#裙*风格#简约*图案#条纹*图案#线条*图案#撞色*裙型#鱼尾裙*裙袖长#无袖", "tgt": "圆形领口修饰脖颈线条，适合各种脸型，耐看有气质。无袖设计，尤显清凉，简约横条纹装饰，使得整身人鱼造型更为生动立体。加之撞色的鱼尾下摆，深邃富有诗意。收腰包臀,修饰女性身体曲线，结合别出心裁的鱼尾裙摆设计，勾勒出自然流畅的身体轮廓，展现了婀娜多姿的迷人姿态。"}
+{"src": "类型#上衣*版型#宽松*颜色#粉红色*图案#字母*图案#文字*图案#线条*衣样式#卫衣*衣款式#不规则", "tgt": "宽松的卫衣版型包裹着整个身材，宽大的衣身与身材形成鲜明的对比描绘出纤瘦的身形。下摆与袖口的不规则剪裁设计，彰显出时尚前卫的形态。被剪裁过的样式呈现出布条状自然地垂坠下来，别具有一番设计感。线条分明的字母样式有着花式的外观，棱角分明加上具有少女元气的枣红色十分有年轻活力感。粉红色的衣身把肌肤衬托得很白嫩又健康。"}
+{"src": "类型#裙*版型#宽松*材质#雪纺*风格#清新*裙型#a字*裙长#连衣裙", "tgt": "踩着轻盈的步伐享受在午后的和煦风中，让放松与惬意感为你免去一身的压力与束缚，仿佛要将灵魂也寄托在随风摇曳的雪纺连衣裙上，吐露出<UNK>微妙而又浪漫的清新之意。宽松的a字版型除了能够带来足够的空间，也能以上窄下宽的方式强化立体层次，携带出自然优雅的曼妙体验。"}
+{"src": "类型#上衣*材质#棉*颜色#蓝色*风格#潮*衣样式#polo*衣领型#polo领*衣袖长#短袖*衣款式#拼接", "tgt": "想要在人群中脱颖而出吗？那么最适合您的莫过于这款polo衫短袖，采用了经典的polo领口和柔软纯棉面料，让您紧跟时尚潮流。再配合上潮流的蓝色拼接设计，使您的风格更加出众。就算单从选料上来说，这款polo衫的颜色沉稳经典，是这个季度十分受大众喜爱的风格了，而且兼具舒适感和时尚感。"}
+{"src": "类型#上衣*版型#h*材质#蚕丝*风格#复古*图案#条纹*图案#复古*图案#撞色*衣样式#衬衫*衣领型#小立领", "tgt": "小女人十足的条纹衬衣，缎面一点点的复古，还有蓝绿色这种高级气质复古色，真丝材质，撞色竖条纹特别的现代感味道，直h型的裁剪和特别的衣长款式，更加独立性格。双层小立领，更显脸型。"}
+{"src": "类型#裙*材质#网纱*颜色#粉红色*图案#线条*图案#刺绣*裙腰型#高腰*裙长#连衣裙*裙袖长#短袖*裙领型#圆领", "tgt": "这款连衣裙，由上到下都透出一丝迷人诱惑的女性魅力，经典圆领型，开口度恰好，露出你的迷人修长的脖颈线条，很是优雅气质，短袖设计，在这款上竟是撩人美貌，高腰线，散开的裙摆，到小腿的长度，遮住了腿部粗的部分，对身材有很好的修饰作用，穿起来很女神；裙身粉红色花枝重工刺绣，让人一眼难忘！而且在这种网纱面料上做繁复图案的绣花，是很考验工艺的，对机器的要求会更高，更加凸显我们的高品质做工；"}
+{"src": "类型#上衣*颜色#纯色*图案#纯色*图案#文字*图案#印花*衣样式#卫衣", "tgt": "一款非常简洁大方的纯色卫衣，设计点在于胸前的“<UNK><UNK>”的中文字印花，新颖特别，让人眼前一亮。简单又吸睛的款式，而且不失时髦感，很适合个性年轻人。"}
+{"src": "类型#上衣*版型#宽松*颜色#黑色*颜色#灰色*颜色#姜黄色*风格#休闲*图案#线条*图案#撞色*衣样式#毛衣*衣袖型#落肩袖", "tgt": "看惯了灰色的冷淡和黑色的沉闷感，来一点醒目的彩色增添点活力吧。亮眼又吸睛的姜黄色色调，嫩肤显白非常的有设计感。趣味的撞色和宽松的版型相交辉映，修饰身形小缺点的同时，时尚又百搭。优雅的落肩袖，轻松修饰肩部线条，让毛衣上身凸显出一丝慵懒随性的休闲感，时尚魅力尽显。"}
+{"src": "类型#上衣*风格#休闲*风格#潮*图案#印花*图案#撞色*衣样式#衬衫*衣领型#圆领*衣长#中长款*衣长#常规*衣袖长#无袖", "tgt": "黑与白，两种最极端的颜色却轻松搭配成了经典，就像此款衬衣，无需过多装饰，仅色调就足够醒目个性，受潮<UNK>所喜欢。做了无袖中长款的样式，走路带风的感觉着实不错，圆领的设计，不是常规的衬衫领，少了点正式反而有种休闲感觉，适合孩子们穿着。后背大面积撞色印花装点，是时尚潮流的象征，也让衣衣不至于单调，轻松就能穿出彩。"}
+{"src": "类型#上衣*版型#宽松*风格#街头*风格#休闲*风格#朋克*图案#字母*图案#文字*图案#印花*衣样式#卫衣*衣款式#连帽*衣款式#对称", "tgt": "个性休闲风的连帽卫衣造型时髦大方，宽松的版型剪裁让肉肉的小宝贝也可以穿着，保暖的连帽设计时刻给予宝贝温柔的呵护，袖子和后背别致时髦的字母印花点缀，满满的街头元素融入，演绎休闲朋克风，对称的小口袋美观大方，方便放置更多的随身物品。"}
+{"src": "类型#裙*裙款式#链条", "tgt": "简单大气的设计，不费吹灰之力就能搭配的时髦范儿。时尚的配色一点都不觉得平淡了，有种浑然天成的大气感。强调了整体的装饰，和谐又不失个性，搭配裤装帅气十足，搭配裙子精致优雅。链条和肩带的搭配让使用感更加舒服，单肩手提都好看。"}
+{"src": "类型#裤*版型#显瘦*材质#牛仔布*颜色#深蓝色*风格#复古*图案#复古*图案#线条*裤腰型#高腰*裤口#微喇裤", "tgt": "深蓝色的高腰牛仔裤，修身的款式勾勒出纤细的美腿。牛仔裤的裤脚设计<UNK>张开的喇叭型，巧妙地修饰了小腿的线条，洋溢着复古的年代感。"}
+{"src": "类型#上衣*风格#清新*风格#潮*风格#性感*图案#条纹*图案#蝴蝶结*衣样式#衬衫*衣领型#一字领*衣门襟#系带*衣款式#不对称", "tgt": "这是一件显得特别清新的衬衣，采用了条纹的设计，给予人一种甜美可人的气质。并且融合了别致的斜肩一字领设计，高调的展示出性感的锁骨，将迷人的香肩展现在外，性感中不失去清纯的气息。袖口处的蝴蝶结系带装饰，增添了俏皮的韵味，简洁大方。且在下摆处采用了不对称的设计，增强了视觉效果，更显潮流。"}
+{"src": "类型#裤*材质#牛仔布*风格#复古*图案#复古*裤型#直筒裤*裤款式#纽扣*裤腰型#高腰", "tgt": "作为基础款单品，牛仔裤也<UNK><UNK>，想要呈现给大家的是——每次搭配都有新感觉。裤子经过复古做旧处理，风格鲜明，也很注重细节，连纽扣也做了统一的做旧处理，融入个性十足的磨破设计，高腰直筒basic裤型，修饰身材，穿出高挑长腿。"}
+{"src": "类型#上衣*版型#宽松*版型#显瘦*图案#线条*图案#刺绣*衣样式#针织衫*衣领型#v领", "tgt": "一款温暖柔软又富有弹性的针织衫，不仅可以抵御严寒侵袭，还能更好地进行搭配。v领的设计，能勾勒出迷人的天鹅颈以及衬托出娇小的脸型。宽松又别致的剪裁，能从视觉上显露纤长的下半身，起到显瘦的效果。直筒造型的袖子，修饰出优美的手臂线条，衣身上的方格刺绣，时尚又吸睛。"}
+{"src": "类型#上衣*颜色#绿色*风格#清新*图案#线条*衣样式#衬衫*衣领型#翻领", "tgt": "绿色的衣身上镶嵌着<UNK>，就是这款衬衫最大的迷人之处，“红花配绿叶”般的色调，将清新气息阐述的淋漓尽致。经典的翻领更是贴心，修饰颈部线条的同时，尽显精致干练的气质，出街轻松凹造型。"}
+{"src": "类型#上衣*图案#字母*图案#文字*图案#印花*图案#撞色*衣样式#外套*衣门襟#拉链*衣款式#拉链", "tgt": "这款外套采用了撞色拉链织带以及字母印花设计。这两种元素的融入使外套不会显得过于单调沉闷，吸睛而亮眼，充满年轻与朝气感，非常减龄。"}
+{"src": "类型#裙*版型#显瘦*版型#h*风格#复古*图案#复古*图案#刺绣*裙长#连衣裙*裙袖长#长袖*裙领型#翻领*裙衣门襟#单排扣", "tgt": "本款连衣裙整体采用h型的轮廓设计，藏肉显瘦，不挑身材，适合各种身形的人穿着。小翻领的领口设计，使得本款连衣裙穿在身上看起来十分的精神帅气，具有青春活力。单排扣的衣门襟设计，又给本款连衣裙带来了一丝的复古味道。裙身上的刺绣花朵装饰，使得本款连衣裙不显得单调，富有层次感，上身给人一种独特的时尚魅力。长袖的设计，更加的贴合手臂曲线，上身更加的舒适贴身。"}
+{"src": "类型#上衣*颜色#粉色*风格#清新*衣样式#外套*衣样式#西装*衣门襟#双排扣", "tgt": "这款外套设计成西装的版型，彰显经典优雅的气质，结合了粉色又添清新气息，甜美百搭时尚感满满。利落的版型简洁流畅，亮色双排扣更添精致感。"}
\ No newline at end of file
diff --git a/tests/fixtures/llm/llama.yaml b/tests/fixtures/llm/llama.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..916848f2c1c84a322fa6ee3b38a1a07d92c30cd2
--- /dev/null
+++ b/tests/fixtures/llm/llama.yaml
@@ -0,0 +1,61 @@
+pretrain:
+  default:
+    model_type: llama
+    model_name_or_path: __internal_testing__/micro-random-llama
+    input_dir: ./llm/llama/data/
+    output: ./output/pretrain
+    weight_decay: 0.01
+    max_steps: 2
+    save_steps: 2
+    eval_steps: 2
+    logging_steps: 1
+    warmup_steps: 2
+    warmup_ratio: 0.01
+    per_device_train_batch_size: 1
+    per_device_eval_batch_size: 1
+    device: gpu
+    data_impl: mmap
+    fp16: true
+    fp16_opt_level: O2
+    do_train: true
+    do_predict: true
+    use_flash_attention: 1
+    use_fused_rms_norm: 0
+    continue_training: 1
+
+finetune:
+  default:
+    model_name_or_path: __internal_testing__/micro-random-llama
+    dataset_name_or_path: ./tests/fixtures/llm/data
+    output_dir: ./checkpoints/llama_sft_ckpts
+    save_steps: 2
+    max_steps: 2
+    per_device_train_batch_size: 1
+    per_device_eval_batch_size: 1
+    tensor_parallel_degree: 1
+    pipeline_parallel_degree: 1
+
+quant:
+  default:
+    dataset_name_or_path: ./tests/fixtures/llm/data
+    per_device_train_batch_size: 2
+    per_device_eval_batch_size: 2
+    model_name_or_path: ./checkpoints/llama_sft_ckpts/checkpoint-2
+    output_dir: ./checkpoints/llama_ptq_ckpts
+
+merge_tp_params:
+  default:
+    model_name_or_path: ./checkpoints/llama_sft_ckpts/checkpoint-2
+
+merge_lora_params:
+  default:
+    model_name_or_path: __internal_testing__/micro-random-llama
+    lora_path: ./checkpoints/llama_lora_ckpts/checkpoint-2
+
+predict:
+  default:
+    model_name_or_path: __internal_testing__/micro-random-llama
+    batch_size: 1
+    data_file: ./tests/fixtures/llm/data/dev.json
+    dtype: float16 
+    mode: dynamic
\ No newline at end of file
diff --git a/tests/fixtures/llm/predictor.yaml b/tests/fixtures/llm/predictor.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..92eee5eb737679bccf0bc18289ec7a9b82e3d6d8
--- /dev/null
+++ b/tests/fixtures/llm/predictor.yaml
@@ -0,0 +1,23 @@
+inference-predict:
+  slow:
+    model_name_or_path: __internal_testing__/tiny-random-llama
+    mode: dynamic 
+    inference_model: 1
+    top_p: 0.0 
+    temperature: 1.0
+    max_length: 20
+    batch_size: 1
+    dtype: float16
+
+inference-to-static:
+  slow:
+    model_name_or_path: __internal_testing__/tiny-random-llama
+    inference_model: 1
+    dtype: float16
+
+inference-infer:
+  slow:
+    model_name_or_path: __internal_testing__/tiny-random-llama
+    mode: static
+    inference_model: 1
+    dtype: float16
diff --git a/tests/fixtures/model_zoo/bert.yaml b/tests/fixtures/model_zoo/bert.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..a197df8c9250c06e54770a1ef7defe8137b9c7cc
--- /dev/null
+++ b/tests/fixtures/model_zoo/bert.yaml
@@ -0,0 +1,68 @@
+pretrain:
+  default:
+    model_type: bert
+    model_name_or_path: __internal_testing__/bert
+    input_dir: model_zoo/bert/data
+    output_dir: model_zoo/bert/output/pretrained_models
+    max_predictions_per_seq: 20
+    per_device_train_batch_size: 2
+    warmup_steps: 2
+    num_train_epochs: 0.0005
+    logging_steps: 1
+    save_steps: 2
+    max_steps: 10
+    device: gpu
+    fp16: False
+    do_train: true 
+
+  slow:
+    model_type: bert
+    model_name_or_path: bert-base-uncased
+    input_dir: model_zoo/bert/data
+    output_dir: model_zoo/bert/output/pretrained_models
+    per_device_train_batch_size: 32   
+    learning_rate: 1e-4 
+    weight_decay: 1e-2 
+    adam_epsilon: 1e-6 
+    warmup_steps: 10000 
+    num_train_epochs: 3 
+    logging_steps: 1 
+    save_steps: 20000 
+    max_steps: 1000000 
+    device: gpu 
+    fp16: False 
+    do_train: true
+
+
+glue:
+  default:
+    model_name_or_path: __internal_testing__/bert
+    output_dir: model_zoo/bert/tmp
+    task_name: SST2 
+    max_seq_length: 32
+    per_device_train_batch_size: 32  
+    per_device_eval_batch_size: 32 
+    learning_rate: 2e-5 
+    num_train_epochs: 0.0001
+    logging_steps: 1
+    save_steps: 400
+    device: cpu 
+    fp16: False
+    do_train: true
+    do_eval:  true
+
+  slow:
+    model_name_or_path: bert-base-uncased
+    output_dir: model_zoo/bert/tmp
+    task_name: SST2 
+    max_seq_length: 128 
+    per_device_train_batch_size: 32   
+    per_device_eval_batch_size: 32   
+    learning_rate: 2e-5 
+    num_train_epochs: 3 
+    logging_steps: 1 
+    save_steps: 500 
+    device: gpu 
+    fp16: False
+    do_train: true
+    do_eval:  true
\ No newline at end of file
diff --git a/tests/fixtures/model_zoo/ernie-health.yaml b/tests/fixtures/model_zoo/ernie-health.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f96e02b6ff36711c5c7f463e0a0a8a4b4e42751c
--- /dev/null
+++ b/tests/fixtures/model_zoo/ernie-health.yaml
@@ -0,0 +1,42 @@
+pretrain:
+  default:
+    model_type: ernie-health
+    model_name_or_path: __internal_testing__/ernie-health-chinese
+    input_dir: model_zoo/ernie-health/data
+    output_dir: model_zoo/ernie-health/output/eheath-pretraining
+    max_seq_length: 32
+    gradient_accumulation_steps: 1
+    weight_decay: 0.01
+    max_steps: 2
+    num_train_epochs: 0.0005
+    save_steps: 5
+    warmup_steps: 5
+    warmup_ratio: 0.01
+    per_device_train_batch_size: 4
+    overwrite_output_dir: true
+    device: cpu
+    eval_steps: 10
+    do_train: true
+  
+  slow:
+    model_type: ernie-health
+    model_name_or_path: ernie-health-chinese
+    input_dir: model_zoo/ernie-health/data
+    output_dir: model_zoo/ernie-health/output/eheath-pretraining
+    max_seq_length: 512 
+    gradient_accumulation_steps: 1
+    per_device_train_batch_size: 8 
+    per_device_eval_batch_size: 8 
+    learning_rate: 0.001 
+    max_steps: 1000000
+    save_steps: 50000
+    weight_decay: 0.01 
+    warmup_ratio: 0.01 
+    max_grad_norm: 1.0 
+    logging_steps: 20 
+    dataloader_num_workers: 2 
+    device: gpu
+    fp16: true
+    fp16_opt_level: O1  
+    do_train: true
+    save_total_limit: 10
diff --git a/tests/fixtures/model_zoo/ernie-m.yaml b/tests/fixtures/model_zoo/ernie-m.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b711d288fed621135f70fd77d2004d01de109770
--- /dev/null
+++ b/tests/fixtures/model_zoo/ernie-m.yaml
@@ -0,0 +1,53 @@
+classifier:
+  default:
+    do_train: True
+    do_eval: True
+    do_export: True
+    use_test_data: True
+    test_data_path: tests/fixtures/tests_samples/xnli/xnli.jsonl
+    task_type: cross-lingual-transfer
+    model_name_or_path: __internal_testing__/ernie-m
+    output_dir: ./output_dir
+    export_model_dir: ./output_dir
+    per_device_train_batch_size: 16
+    per_device_eval_batch_size: 16
+    max_seq_length: 256
+    learning_rate: 5e-5
+    classifier_dropout: 0.1
+    weight_decay: 0.0
+    layerwise_decay: 0.8
+    warmup_ratio: 0.1
+    max_steps: 2
+    logging_steps: 1
+    save_steps: 1
+    eval_steps: 1
+    device: cpu
+    fp16: False
+    load_best_model_at_end: True
+    metric_for_best_model: eval_accuracy
+    overwrite_output_dir: True
+  
+  slow:
+    do_train: True
+    do_eval: True
+    do_export: True
+    task_type: cross-lingual-transfer
+    model_name_or_path: ernie-m-base
+    output_dir: ./output_dir
+    export_model_dir: ./output_dir
+    per_device_train_batch_size: 16
+    per_device_eval_batch_size: 16
+    max_seq_length: 256
+    learning_rate: 5e-5
+    classifier_dropout: 0.1
+    weight_decay: 0.0
+    layerwise_decay: 0.8
+    num_train_epochs: 5
+    logging_steps: 767
+    save_steps: 12272
+    eval_steps: 767
+    device: gpu
+    fp16: False
+    load_best_model_at_end: Trues
+    metric_for_best_model: eval_accuracy
+    overwrite_output_dir: True
diff --git a/tests/fixtures/model_zoo/ernie_vil.yaml b/tests/fixtures/model_zoo/ernie_vil.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..62d2796c166a833032ffb2337daa019379fe4889
--- /dev/null
+++ b/tests/fixtures/model_zoo/ernie_vil.yaml
@@ -0,0 +1,35 @@
+finetune:
+  default:
+    model_type: ernie_vil-2.0-base-zh
+    train_data: ./tests/fixtures/Flickr30k-CN/lmdb/train
+    val_data: ./tests/fixtures/Flickr30k-CN/lmdb/valid
+    model_name_or_path: __internal_testing__/tiny-random-ernievil2
+    data_root: .
+    learning_rate: 5e-5
+    output_dir: ./output_dir/finetune
+    weight_decay: 0.01
+    max_steps: 2
+    save_steps: 2
+    warmup_steps: 5
+    warmup_ratio: 0.01
+    max_seq_len: 32
+    label_names: index
+    lr_scheduler_type: cosine
+    per_device_train_batch_size: 2
+    device: cpu
+    eval_steps: 5
+    do_train: true
+  
+  slow:
+    model_type: ernie_vil-2.0-base-zh
+    model_name_or_path: PaddlePaddle/ernie_vil-2.0-base-zh
+    warmup_ratio: 0.01
+    per_device_train_batch_size: 4
+    device: cpu
+    eval_steps: 100
+    do_train: true
+    label_names: index
+    lr_scheduler_type: cosine
+
+
+  
diff --git a/tests/fixtures/model_zoo/gpt.yaml b/tests/fixtures/model_zoo/gpt.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d3583f7ff8b6118138cfaa01e0898f1913c96a03
--- /dev/null
+++ b/tests/fixtures/model_zoo/gpt.yaml
@@ -0,0 +1,94 @@
+pretrain:
+  default:
+    model_type: gpt
+    model_name_or_path: __internal_testing__/gpt
+    input_dir: ./data
+    output_dir: ./output_dir/pretrain
+    weight_decay: 0.01
+    max_steps: 2
+    save_steps: 10
+    warmup_steps: 10
+    warmup_ratio: 0.01
+    per_device_train_batch_size: 4
+    device: cpu
+    eval_steps: 10
+    do_train: true
+    do_predict: true
+  
+  slow:
+    model_type: gpt
+    model_name_or_path: gpt2-en
+    warmup_steps: 320000
+    warmup_ratio: 0.01
+    per_device_train_batch_size: 4
+    device: cpu
+    eval_steps: 100
+    do_train: true
+    do_predict: true
+
+
+msra_ner:
+  default:
+    model_name_or_path: __internal_testing__/gpt_cn
+    max_seq_length: 32
+    per_device_eval_batch_size: 4
+    learning_rate: 2e-5
+    num_train_epochs: 3
+    logging_steps: 25
+    save_steps: 250
+    output_dir: ./output_dir/msra_ner/
+    device: cpu
+    do_train: true
+    max_steps: 10
+  
+  slow:
+    model_name_or_path: gpt-cpm-small-cn-distill
+    max_seq_length: 128
+    per_device_eval_batch_size: 32
+    learning_rate: 2e-5
+    num_train_epochs: 3
+    logging_steps: 25
+    save_steps: 250
+    output_dir: ./tmp/msra_ner/
+    device: gpu 
+    do_train: true
+  
+glue:
+  default:
+    model_name_or_path: __internal_testing__/gpt
+    task_name: SST-2
+    max_seq_length: 128
+    per_device_train_batch_size: 32  
+    learning_rate: 2e-5
+    num_train_epochs: 3
+    logging_steps: 1
+    save_steps: 500
+    output_dir: ./output_dir/glue
+    max_steps: 10
+    eval_steps: 1
+    device: cpu
+    do_train: true
+  
+  slow:
+    model_name_or_path: gpt2-medium-en
+    task_name: SST-2
+    max_seq_length: 128
+    per_device_train_batch_size: 32  
+    learning_rate: 2e-5
+    num_train_epochs: 3
+    logging_steps: 1
+    save_steps: 500
+    output_dir: ./output_dir/glue
+    eval_steps: 1
+    device: gpu
+    do_train: true
+  
+generation:
+  default:
+    model_type: gpt2
+    model_name_or_path: hf-internal-testing/tiny-random-gpt2
+    from_hf_hub: true
+
+  slow:
+    model_type: gpt2-cn
+    model_name_or_path: gpt-cpm-small-cn-distill
\ No newline at end of file
diff --git a/tests/fixtures/spiece.model b/tests/fixtures/spiece.model
new file mode 100644
index 0000000000000000000000000000000000000000..c91b8acfa56ccfc80e1cdd854ddcaf9b6c44ab2a
Binary files /dev/null and b/tests/fixtures/spiece.model differ
diff --git a/tests/fixtures/test_sentencepiece.model b/tests/fixtures/test_sentencepiece.model
new file mode 100644
index 0000000000000000000000000000000000000000..376dda73010c6f93acfa3b974bea81a9ac9e1740
Binary files /dev/null and b/tests/fixtures/test_sentencepiece.model differ
diff --git a/tests/fixtures/test_sentencepiece_bpe.model b/tests/fixtures/test_sentencepiece_bpe.model
new file mode 100644
index 0000000000000000000000000000000000000000..a75dee72cb00ae836bcc4fad45ce26f42f9f2c67
Binary files /dev/null and b/tests/fixtures/test_sentencepiece_bpe.model differ
diff --git a/tests/fixtures/test_sentencepiece_bpe.vocab.txt b/tests/fixtures/test_sentencepiece_bpe.vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..3f35eac4f6e485e5addd1f296952af67765dee64
--- /dev/null
+++ b/tests/fixtures/test_sentencepiece_bpe.vocab.txt
@@ -0,0 +1,1000 @@
+<unk>
+<s>
+</s>
+▁t
+he
+▁a
+in
+▁s
+▁w
+▁the
+▁o
+re
+▁b
+▁m
+ou
+ed
+▁I
+nd
+ha
+it
+▁f
+er
+▁to
+ing
+is
+en
+▁c
+or
+on
+▁of
+▁d
+as
+▁and
+▁p
+▁n
+ll
+at
+ar
+an
+▁in
+▁l
+▁h
+▁be
+om
+es
+ot
+ut
+▁g
+▁ha
+ow
+▁he
+hat
+▁was
+▁"
+ld
+▁th
+gh
+id
+le
+▁it
+▁on
+ch
+▁T
+ay
+▁st
+et
+▁e
+ce
+▁y
+ent
+▁re
+ly
+ve
+▁me
+se
+ght
+ir
+▁that
+oo
+ith
+▁as
+al
+im
+▁r
+▁with
+▁my
+ould
+▁not
+st
+ad
+▁S
+ion
+▁u
+ct
+her
+▁you
+▁for
+▁is
+ke
+▁The
+ight
+▁an
+▁at
+ur
+▁we
+▁wh
+am
+ic
+▁but
+▁or
+ter
+ac
+."
+if
+ro
+ver
+▁su
+▁had
+▁so
+▁this
+out
+ome
+pp
+qu
+▁A
+▁would
+▁P
+red
+ard
+▁te
+▁his
+▁sh
+▁se
+▁by
+▁W
+▁H
+ra
+ine
+▁ab
+▁fr
+▁have
+▁one
+▁sa
+▁fe
+▁li
+▁him
+ect
+▁R
+irt
+ked
+▁k
+ation
+hen
+hing
+▁all
+▁go
+▁Sh
+ill
+art
+cu
+▁out
+ess
+▁do
+▁they
+▁if
+▁no
+ore
+▁ne
+own
+▁K
+▁up
+▁Red
+▁Shirt
+..
+ame
+ain
+all
+▁there
+▁wor
+▁from
+▁B
+ool
+▁de
+▁any
+▁about
+▁M
+▁said
+▁them
+em
+ong
+▁whi
+ant
+▁ro
+ri
+ud
+llow
+ound
+un
+cup
+ak
+▁were
+▁hou
+▁like
+us
+ake
+ind
+il
+▁C
+os
+▁are
+cupine
+ig
+orcupine
+op
+chool
+▁fa
+▁j
+▁Porcupine
+ood
+,"
+ul
+ive
+▁school
+yo
+▁sp
+ble
+▁can
+ack
+▁ex
+▁com
+▁al
+acher
+um
+ang
+ok
+▁who
+▁when
+ther
+▁af
+ust
+▁man
+▁get
+ers
+▁le
+▁tr
+▁un
+▁room
+ag
+han
+one
+▁con
+▁some
+▁which
+wn
+▁G
+▁O
+ough
+▁did
+▁say
+?"
+▁v
+ist
+ci
+est
+▁bec
+▁pro
+lf
+▁teacher
+au
+▁pr
+ven
+ass
+ber
+▁If
+▁kn
+▁tw
+▁been
+ous
+▁house
+▁what
+res
+▁could
+▁qu
+ought
+▁N
+▁He
+▁ag
+▁int
+--
+and
+hed
+▁work
+ate
+▁wr
+▁such
+▁fellow
+we
+ree
+▁she
+ide
+ject
+lo
+ents
+ap
+▁more
+▁stud
+▁am
+▁app
+pt
+▁Y
+ast
+ish
+ite
+ie
+ime
+nder
+ice
+▁It
+▁Pro
+▁loo
+▁will
+▁bo
+▁other
+▁po
+berg
+lown
+uten
+▁may
+▁should
+utenberg
+ck
+ep
+way
+▁day
+▁pl
+ry
+ity
+ort
+▁her
+▁Clown
+▁time
+very
+▁old
+▁way
+ash
+▁off
+urn
+nt
+▁F
+are
+▁back
+▁Gutenberg
+ure
+▁than
+th
+ink
+here
+ving
+▁only
+▁right
+▁their
+▁co
+▁going
+▁thought
+ach
+▁prin
+▁Project
+▁students
+ally
+te
+ary
+▁Ki
+able
+▁two
+▁asked
+el
+ated
+▁night
+▁On
+self
+other
+▁Kiyo
+▁make
+▁after
+ub
+▁D
+ade
+▁Th
+▁ac
+▁br
+▁gr
+▁us
+ance
+▁see
+▁head
+▁know
+▁don
+▁how
+ose
+ved
+▁cl
+▁But
+▁res
+pl
+!"
+cip
+ick
+▁fo
+▁might
+▁princip
+▁ad
+ared
+▁come
+▁start
+▁E
+,--
+▁bl
+omet
+▁mat
+▁your
+▁three
+▁principal
+hi
+ans
+▁down
+▁even
+▁made
+▁part
+itt
+▁bet
+▁dis
+▁appe
+ge
+alk
+ign
+uch
+▁mo
+▁cou
+▁let
+▁over
+og
+tm
+imp
+ory
+▁ar
+▁somet
+ied
+led
+▁into
+▁want
+ull
+....
+▁pre
+▁think
+▁face
+▁much
+▁told
+ff
+ever
+▁fir
+▁ret
+thing
+▁then
+ark
+ence
+▁dec
+▁Then
+▁seem
+co
+▁L
+ause
+ings
+▁beg
+▁has
+▁This
+▁1
+ace
+ling
+okyo
+ress
+▁hot
+▁our
+▁own
+▁wat
+▁again
+▁something
+reat
+▁rem
+▁came
+▁take
+so
+▁In
+▁far
+▁sha
+▁Tokyo
+ty
+ise
+ner
+▁ke
+ition
+▁When
+▁call
+▁good
+ady
+ger
+▁en
+ered
+▁inc
+aking
+......
+▁first
+ab
+mp
+orm
+fore
+ather
+▁long
+▁tell
+▁went
+▁better
+ere
+ire
+pect
+▁str
+▁away
+▁just
+▁sure
+▁town
+ath
+▁sm
+pped
+▁ent
+quash
+▁kind
+▁before
+▁looked
+iz
+not
+▁cha
+▁now
+▁heard
+ell
+erm
+row
+ubb
+ment
+▁They
+▁What
+▁here
+▁board
+▁under
+ily
+▁ans
+▁too
+ubbard
+▁being
+▁where
+▁wrong
+▁Squash
+▁without
+▁No
+ious
+less
+▁You
+ction
+▁many
+▁once
+▁same
+▁well
+▁return
+▁Hubbard
+ng
+ase
+ince
+ward
+▁And
+▁cop
+▁matter
+▁because
+od
+pe
+ord
+ped
+▁As
+▁pe
+▁hand
+▁pers
+▁place
+▁letter
+ook
+▁ho
+▁ra
+▁sc
+ited
+▁lau
+▁yen
+▁days
+▁look
+▁myself
+ft
+lt
+xt
+ars
+ced
+▁So
+ised
+ried
+▁spe
+▁show
+▁There
+▁saying
+▁anything
+idd
+per
+▁sw
+▁vo
+isha
+ying
+▁aff
+▁des
+▁ser
+▁talk
+▁having
+ank
+oss
+ron
+ves
+▁1.
+▁op
+cess
+▁Kog
+▁sat
+▁set
+ating
+▁class
+▁every
+▁never
+▁while
+ho
+hy
+rib
+aper
+▁har
+▁new
+▁Koga
+▁left
+▁very
+▁follow
+▁seemed
+air
+ies
+lic
+ted
+side
+tain
+ting
+▁add
+▁imp
+▁thr
+▁must
+▁person
+▁teachers
+▁(
+aid
+ave
+▁Mr
+ened
+lect
+▁bad
+▁bel
+▁comp
+ix
+ful
+ife
+▁To
+▁ye
+hile
+▁ind
+▁sen
+▁care
+▁those
+▁trans
+▁watch
+na
+ya
+▁U
+▁[
+iff
+ined
+▁Mad
+▁reg
+▁sal
+▁wom
+▁dang
+▁dist
+▁felt
+ations
+▁count
+▁found
+▁these
+▁works
+▁became
+▁started
+li
+day
+ons
+ont
+▁We
+That
+cept
+ired
+▁att
+▁che
+▁half
+▁midd
+▁soon
+▁took
+▁upon
+▁salary
+▁through
+ue
+▁J
+You
+age
+bod
+ott
+ult
+urp
+arly
+body
+▁Bad
+▁its
+▁ref
+▁saw
+▁give
+▁hard
+▁mind
+▁pass
+▁began
+▁cannot
+▁nothing
+az
+po
+ged
+▁lo
+ning
+oney
+▁She
+▁red
+▁tra
+ement
+ither
+ribut
+▁comm
+▁hour
+▁keep
+▁next
+▁prov
+▁elect
+▁quite
+▁still
+▁Badger
+aw
+san
+▁gl
+▁gu
+ason
+fter
+onna
+▁got
+▁turn
+▁called
+▁middle
+▁Madonna
+▁fellows
+▁boarding
+ne
+oth
+sid
+ura
+▁ge
+▁ob
+ange
+ical
+▁hel
+▁per
+▁ple
+▁put
+▁sle
+ronic
+▁copy
+▁diff
+▁full
+▁read
+▁supp
+▁temp
+ishing
+▁money
+▁round
+▁woman
+▁appeared
+▁electronic
+get
+▁St
+▁fl
+▁eas
+wered
+▁find
+▁home
+▁meet
+▁paper
+▁geisha
+▁resign
+▁country
+▁underst
+▁answered
+dd
+les
+our
+▁ed
+erms
+iter
+ople
+▁acc
+▁cle
+▁eff
+▁nat
+ished
+ouble
+▁feet
+▁mean
+▁stra
+ention
+▁While
+▁fight
+▁since
+osition
+▁reason
+▁things
+be
+iss
+▁sl
+cond
+ense
+▁bed
+▁pay
+aring
+▁fool
+▁form
+▁stay
+ressed
+▁front
+▁voice
+▁brother
+TE
+The
+Yes
+▁wo
+ably
+ness
+▁One
+▁dra
+▁end
+▁hon
+▁mon
+itten
+ittle
+▁decl
+▁dire
+▁terms
+▁thing
+▁become
+▁little
+▁people
+▁second
+log
+man
+▁My
+▁bu
+aken
+form
+ided
+▁six
+iting
+▁That
+▁agre
+▁does
+▁lady
+▁requ
+▁simp
+▁Found
+▁story
+▁affair
+▁consid
+▁written
+▁straight
+▁Foundation
+no
+ial
+iet
+rap
+▁ey
+apan
+ible
+icul
+ount
+well
+▁flo
+▁mar
+▁ste
+▁why
+ident
+▁bang
+▁bath
+▁done
+▁four
+▁line
+▁rece
+▁stre
+▁After
+▁belie
+▁expect
+▁remark
+▁consider
+No
+gg
+ats
+ize
+ost
+ung
+What
+▁
+e
+t
+o
+a
+n
+i
+h
+s
+r
+d
+l
+u
+m
+c
+w
+g
+f
+y
+p
+.
+b
+,
+I
+k
+v
+"
+T
+-
+S
+'
+A
+H
+W
+P
+R
+j
+q
+x
+?
+B
+M
+E
+N
+K
+C
+!
+O
+Y
+G
+;
+z
+F
+D
+L
+1
+:
+[
+]
+U
+J
+0
+(
+)
+*
+8
+5
+2
+6
\ No newline at end of file
diff --git a/tests/fixtures/test_sentencepiece_bpe_char.model b/tests/fixtures/test_sentencepiece_bpe_char.model
new file mode 100644
index 0000000000000000000000000000000000000000..82ee359fb5e640a167a70cf09eae9f79239f1764
Binary files /dev/null and b/tests/fixtures/test_sentencepiece_bpe_char.model differ
diff --git a/tests/fixtures/tests_samples/COCO/000000039769.png b/tests/fixtures/tests_samples/COCO/000000039769.png
new file mode 100644
index 0000000000000000000000000000000000000000..a3b5225fc3cef5c492cc109aebe883f24941a156
Binary files /dev/null and b/tests/fixtures/tests_samples/COCO/000000039769.png differ
diff --git a/tests/fixtures/tests_samples/OCR/custom.jpeg b/tests/fixtures/tests_samples/OCR/custom.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..93f325dd47c9f44bb38d98587c7b9372661263bc
Binary files /dev/null and b/tests/fixtures/tests_samples/OCR/custom.jpeg differ
diff --git a/tests/fixtures/tests_samples/xnli/xnli.jsonl b/tests/fixtures/tests_samples/xnli/xnli.jsonl
new file mode 100644
index 0000000000000000000000000000000000000000..ae3d0dbb59e70d5119d4c7d86c3f1803a8ad97da
--- /dev/null
+++ b/tests/fixtures/tests_samples/xnli/xnli.jsonl
@@ -0,0 +1,30 @@
+{
+    "premise": "从 概念 上 看 , 奶油 收入 有 两 个 基本 方面 产品 和 地理 .",
+    "hypothesis": "产品 和 地理 是 什么 使 奶油 抹 霜 工作 .",
+    "label": 0
+}
+{
+    "premise": "One of our number will carry out your instructions minutely .",
+    "hypothesis": "A member of my team will execute your orders with immense precision .",
+    "label": 2
+}
+{
+    "premise": "Conceptually cream skimming has two basic dimensions - product and geography .",
+    "hypothesis": "Product and geography are what make cream skimming work .",
+    "label": 1
+}
+{
+    "premise": "从 概念 上 看 , 奶油 收入 有 两 个 基本 方面 产品 和 地理 .",
+    "hypothesis": "产品 和 地理 是 什么 使 奶油 抹 霜 工作 .",
+    "label": 0
+}
+{
+    "premise": "One of our number will carry out your instructions minutely .",
+    "hypothesis": "A member of my team will execute your orders with immense precision .",
+    "label": 2
+}
+{
+    "premise": "Conceptually cream skimming has two basic dimensions - product and geography .",
+    "hypothesis": "Product and geography are what make cream skimming work .",
+    "label": 1
+}
\ No newline at end of file
diff --git a/tests/fixtures/vocab.zh.pegasus.txt b/tests/fixtures/vocab.zh.pegasus.txt
new file mode 100644
index 0000000000000000000000000000000000000000..69671d2eb970a22767350df8960c448370594c63
--- /dev/null
+++ b/tests/fixtures/vocab.zh.pegasus.txt
@@ -0,0 +1,2186 @@
+[PAD]
+[unused0]
+[unused1]
+[unused2]
+[unused3]
+[unused4]
+[unused5]
+[unused6]
+[unused7]
+[unused8]
+[unused9]
+[unused10]
+[unused11]
+[unused12]
+[unused13]
+[unused14]
+[unused15]
+[unused16]
+[unused17]
+[unused18]
+[unused19]
+[unused20]
+[unused21]
+[unused22]
+[unused23]
+[unused24]
+[unused25]
+[unused26]
+[unused27]
+[unused28]
+[unused29]
+[unused30]
+[unused31]
+[unused32]
+[unused33]
+[unused34]
+[unused35]
+[unused36]
+[unused37]
+[unused38]
+[unused39]
+[unused40]
+[unused41]
+[unused42]
+[unused43]
+[unused44]
+[unused45]
+[unused46]
+[unused47]
+[unused48]
+[unused49]
+[unused50]
+[unused51]
+[unused52]
+[unused53]
+[unused54]
+[unused55]
+[unused56]
+[unused57]
+[unused58]
+[unused59]
+[unused60]
+[unused61]
+[unused62]
+[unused63]
+[unused64]
+[unused65]
+[unused66]
+[unused67]
+[unused68]
+[unused69]
+[unused70]
+[unused71]
+[unused72]
+[unused73]
+[unused74]
+[unused75]
+[unused76]
+[unused77]
+[unused78]
+[unused79]
+[unused80]
+[unused81]
+[unused82]
+[unused83]
+[unused84]
+[unused85]
+[unused86]
+[unused87]
+[unused88]
+[unused89]
+[unused90]
+[unused91]
+[unused92]
+[unused93]
+[unused94]
+[unused95]
+[unused96]
+[unused97]
+[unused98]
+[UNK]
+[CLS]
+[SEP]
+[MASK]
+[unused99]
+[unused100]
+[unused101]
+[unused102]
+[unused103]
+[unused104]
+[unused105]
+[unused106]
+[unused107]
+[unused108]
+[unused109]
+[unused110]
+[unused111]
+[unused112]
+[unused113]
+[unused114]
+[unused115]
+[unused116]
+[unused117]
+[unused118]
+[unused119]
+[unused120]
+[unused121]
+[unused122]
+[unused123]
+[unused124]
+[unused125]
+[unused126]
+[unused127]
+[unused128]
+[unused129]
+[unused130]
+[unused131]
+[unused132]
+[unused133]
+[unused134]
+[unused135]
+[unused136]
+[unused137]
+[unused138]
+[unused139]
+[unused140]
+[unused141]
+[unused142]
+[unused143]
+[unused144]
+[unused145]
+[unused146]
+[unused147]
+[unused148]
+[unused149]
+[unused150]
+[unused151]
+[unused152]
+[unused153]
+[unused154]
+[unused155]
+[unused156]
+[unused157]
+[unused158]
+[unused159]
+[unused160]
+[unused161]
+[unused162]
+[unused163]
+[unused164]
+[unused165]
+[unused166]
+[unused167]
+[unused168]
+[unused169]
+[unused170]
+[unused171]
+[unused172]
+[unused173]
+[unused174]
+[unused175]
+[unused176]
+[unused177]
+[unused178]
+[unused179]
+[unused180]
+[unused181]
+[unused182]
+[unused183]
+[unused184]
+[unused185]
+[unused186]
+[unused187]
+[unused188]
+[unused189]
+[unused190]
+[unused191]
+[unused192]
+[unused193]
+[unused194]
+[unused195]
+[unused196]
+[unused197]
+[unused198]
+[unused199]
+[unused200]
+[unused201]
+[unused202]
+[unused203]
+[unused204]
+[unused205]
+[unused206]
+[unused207]
+[unused208]
+[unused209]
+[unused210]
+[unused211]
+[unused212]
+[unused213]
+[unused214]
+[unused215]
+[unused216]
+[unused217]
+[unused218]
+[unused219]
+[unused220]
+[unused221]
+[unused222]
+[unused223]
+[unused224]
+[unused225]
+[unused226]
+[unused227]
+[unused228]
+[unused229]
+[unused230]
+[unused231]
+[unused232]
+[unused233]
+[unused234]
+[unused235]
+[unused236]
+[unused237]
+[unused238]
+[unused239]
+[unused240]
+[unused241]
+[unused242]
+[unused243]
+[unused244]
+[unused245]
+[unused246]
+[unused247]
+[unused248]
+[unused249]
+[unused250]
+[unused251]
+[unused252]
+[unused253]
+[unused254]
+[unused255]
+[unused256]
+[unused257]
+[unused258]
+[unused259]
+[unused260]
+[unused261]
+[unused262]
+[unused263]
+[unused264]
+[unused265]
+[unused266]
+[unused267]
+[unused268]
+[unused269]
+[unused270]
+[unused271]
+[unused272]
+[unused273]
+[unused274]
+[unused275]
+[unused276]
+[unused277]
+[unused278]
+[unused279]
+[unused280]
+[unused281]
+[unused282]
+[unused283]
+[unused284]
+[unused285]
+[unused286]
+[unused287]
+[unused288]
+[unused289]
+[unused290]
+[unused291]
+[unused292]
+[unused293]
+[unused294]
+[unused295]
+[unused296]
+[unused297]
+[unused298]
+[unused299]
+[unused300]
+[unused301]
+[unused302]
+[unused303]
+[unused304]
+[unused305]
+[unused306]
+[unused307]
+[unused308]
+[unused309]
+[unused310]
+[unused311]
+[unused312]
+[unused313]
+[unused314]
+[unused315]
+[unused316]
+[unused317]
+[unused318]
+[unused319]
+[unused320]
+[unused321]
+[unused322]
+[unused323]
+[unused324]
+[unused325]
+[unused326]
+[unused327]
+[unused328]
+[unused329]
+[unused330]
+[unused331]
+[unused332]
+[unused333]
+[unused334]
+[unused335]
+[unused336]
+[unused337]
+[unused338]
+[unused339]
+[unused340]
+[unused341]
+[unused342]
+[unused343]
+[unused344]
+[unused345]
+[unused346]
+[unused347]
+[unused348]
+[unused349]
+[unused350]
+[unused351]
+[unused352]
+[unused353]
+[unused354]
+[unused355]
+[unused356]
+[unused357]
+[unused358]
+[unused359]
+[unused360]
+[unused361]
+[unused362]
+[unused363]
+[unused364]
+[unused365]
+[unused366]
+[unused367]
+[unused368]
+[unused369]
+[unused370]
+[unused371]
+[unused372]
+[unused373]
+[unused374]
+[unused375]
+[unused376]
+[unused377]
+[unused378]
+[unused379]
+[unused380]
+[unused381]
+[unused382]
+[unused383]
+[unused384]
+[unused385]
+[unused386]
+[unused387]
+[unused388]
+[unused389]
+[unused390]
+[unused391]
+[unused392]
+[unused393]
+[unused394]
+[unused395]
+[unused396]
+[unused397]
+[unused398]
+[unused399]
+[unused400]
+[unused401]
+[unused402]
+[unused403]
+[unused404]
+[unused405]
+[unused406]
+[unused407]
+[unused408]
+[unused409]
+[unused410]
+[unused411]
+[unused412]
+[unused413]
+[unused414]
+[unused415]
+[unused416]
+[unused417]
+[unused418]
+[unused419]
+[unused420]
+[unused421]
+[unused422]
+[unused423]
+[unused424]
+[unused425]
+[unused426]
+[unused427]
+[unused428]
+[unused429]
+[unused430]
+[unused431]
+[unused432]
+[unused433]
+[unused434]
+[unused435]
+[unused436]
+[unused437]
+[unused438]
+[unused439]
+[unused440]
+[unused441]
+[unused442]
+[unused443]
+[unused444]
+[unused445]
+[unused446]
+[unused447]
+[unused448]
+[unused449]
+[unused450]
+[unused451]
+[unused452]
+[unused453]
+[unused454]
+[unused455]
+[unused456]
+[unused457]
+[unused458]
+[unused459]
+[unused460]
+[unused461]
+[unused462]
+[unused463]
+[unused464]
+[unused465]
+[unused466]
+[unused467]
+[unused468]
+[unused469]
+[unused470]
+[unused471]
+[unused472]
+[unused473]
+[unused474]
+[unused475]
+[unused476]
+[unused477]
+[unused478]
+[unused479]
+[unused480]
+[unused481]
+[unused482]
+[unused483]
+[unused484]
+[unused485]
+[unused486]
+[unused487]
+[unused488]
+[unused489]
+[unused490]
+[unused491]
+[unused492]
+[unused493]
+[unused494]
+[unused495]
+[unused496]
+[unused497]
+[unused498]
+[unused499]
+[unused500]
+[unused501]
+[unused502]
+[unused503]
+[unused504]
+[unused505]
+[unused506]
+[unused507]
+[unused508]
+[unused509]
+[unused510]
+[unused511]
+[unused512]
+[unused513]
+[unused514]
+[unused515]
+[unused516]
+[unused517]
+[unused518]
+[unused519]
+[unused520]
+[unused521]
+[unused522]
+[unused523]
+[unused524]
+[unused525]
+[unused526]
+[unused527]
+[unused528]
+[unused529]
+[unused530]
+[unused531]
+[unused532]
+[unused533]
+[unused534]
+[unused535]
+[unused536]
+[unused537]
+[unused538]
+[unused539]
+[unused540]
+[unused541]
+[unused542]
+[unused543]
+[unused544]
+[unused545]
+[unused546]
+[unused547]
+[unused548]
+[unused549]
+[unused550]
+[unused551]
+[unused552]
+[unused553]
+[unused554]
+[unused555]
+[unused556]
+[unused557]
+[unused558]
+[unused559]
+[unused560]
+[unused561]
+[unused562]
+[unused563]
+[unused564]
+[unused565]
+[unused566]
+[unused567]
+[unused568]
+[unused569]
+[unused570]
+[unused571]
+[unused572]
+[unused573]
+[unused574]
+[unused575]
+[unused576]
+[unused577]
+[unused578]
+[unused579]
+[unused580]
+[unused581]
+[unused582]
+[unused583]
+[unused584]
+[unused585]
+[unused586]
+[unused587]
+[unused588]
+[unused589]
+[unused590]
+[unused591]
+[unused592]
+[unused593]
+[unused594]
+[unused595]
+[unused596]
+[unused597]
+[unused598]
+[unused599]
+[unused600]
+[unused601]
+[unused602]
+[unused603]
+[unused604]
+[unused605]
+[unused606]
+[unused607]
+[unused608]
+[unused609]
+[unused610]
+[unused611]
+[unused612]
+[unused613]
+[unused614]
+[unused615]
+[unused616]
+[unused617]
+[unused618]
+[unused619]
+[unused620]
+[unused621]
+[unused622]
+[unused623]
+[unused624]
+[unused625]
+[unused626]
+[unused627]
+[unused628]
+[unused629]
+[unused630]
+[unused631]
+[unused632]
+[unused633]
+[unused634]
+[unused635]
+[unused636]
+[unused637]
+[unused638]
+[unused639]
+[unused640]
+[unused641]
+[unused642]
+[unused643]
+[unused644]
+[unused645]
+[unused646]
+[unused647]
+[unused648]
+[unused649]
+[unused650]
+[unused651]
+[unused652]
+[unused653]
+[unused654]
+[unused655]
+[unused656]
+[unused657]
+[unused658]
+[unused659]
+[unused660]
+[unused661]
+[unused662]
+[unused663]
+[unused664]
+[unused665]
+[unused666]
+[unused667]
+[unused668]
+[unused669]
+[unused670]
+[unused671]
+[unused672]
+[unused673]
+[unused674]
+[unused675]
+[unused676]
+[unused677]
+[unused678]
+[unused679]
+[unused680]
+[unused681]
+[unused682]
+[unused683]
+[unused684]
+[unused685]
+[unused686]
+[unused687]
+[unused688]
+[unused689]
+[unused690]
+[unused691]
+[unused692]
+[unused693]
+[unused694]
+[unused695]
+[unused696]
+[unused697]
+[unused698]
+[unused699]
+[unused700]
+[unused701]
+[unused702]
+[unused703]
+[unused704]
+[unused705]
+[unused706]
+[unused707]
+[unused708]
+[unused709]
+[unused710]
+[unused711]
+[unused712]
+[unused713]
+[unused714]
+[unused715]
+[unused716]
+[unused717]
+[unused718]
+[unused719]
+[unused720]
+[unused721]
+[unused722]
+[unused723]
+[unused724]
+[unused725]
+[unused726]
+[unused727]
+[unused728]
+[unused729]
+[unused730]
+[unused731]
+[unused732]
+[unused733]
+[unused734]
+[unused735]
+[unused736]
+[unused737]
+[unused738]
+[unused739]
+[unused740]
+[unused741]
+[unused742]
+[unused743]
+[unused744]
+[unused745]
+[unused746]
+[unused747]
+[unused748]
+[unused749]
+[unused750]
+[unused751]
+[unused752]
+[unused753]
+[unused754]
+[unused755]
+[unused756]
+[unused757]
+[unused758]
+[unused759]
+[unused760]
+[unused761]
+[unused762]
+[unused763]
+[unused764]
+[unused765]
+[unused766]
+[unused767]
+[unused768]
+[unused769]
+[unused770]
+[unused771]
+[unused772]
+[unused773]
+[unused774]
+[unused775]
+[unused776]
+[unused777]
+[unused778]
+[unused779]
+[unused780]
+[unused781]
+[unused782]
+[unused783]
+[unused784]
+[unused785]
+[unused786]
+[unused787]
+[unused788]
+[unused789]
+[unused790]
+[unused791]
+[unused792]
+[unused793]
+[unused794]
+[unused795]
+[unused796]
+[unused797]
+[unused798]
+[unused799]
+[unused800]
+[unused801]
+[unused802]
+[unused803]
+[unused804]
+[unused805]
+[unused806]
+[unused807]
+[unused808]
+[unused809]
+[unused810]
+[unused811]
+[unused812]
+[unused813]
+[unused814]
+[unused815]
+[unused816]
+[unused817]
+[unused818]
+[unused819]
+[unused820]
+[unused821]
+[unused822]
+[unused823]
+[unused824]
+[unused825]
+[unused826]
+[unused827]
+[unused828]
+[unused829]
+[unused830]
+[unused831]
+[unused832]
+[unused833]
+[unused834]
+[unused835]
+[unused836]
+[unused837]
+[unused838]
+[unused839]
+[unused840]
+[unused841]
+[unused842]
+[unused843]
+[unused844]
+[unused845]
+[unused846]
+[unused847]
+[unused848]
+[unused849]
+[unused850]
+[unused851]
+[unused852]
+[unused853]
+[unused854]
+[unused855]
+[unused856]
+[unused857]
+[unused858]
+[unused859]
+[unused860]
+[unused861]
+[unused862]
+[unused863]
+[unused864]
+[unused865]
+[unused866]
+[unused867]
+[unused868]
+[unused869]
+[unused870]
+[unused871]
+[unused872]
+[unused873]
+[unused874]
+[unused875]
+[unused876]
+[unused877]
+[unused878]
+[unused879]
+[unused880]
+[unused881]
+[unused882]
+[unused883]
+[unused884]
+[unused885]
+[unused886]
+[unused887]
+[unused888]
+[unused889]
+[unused890]
+[unused891]
+[unused892]
+[unused893]
+[unused894]
+[unused895]
+[unused896]
+[unused897]
+[unused898]
+[unused899]
+[unused900]
+[unused901]
+[unused902]
+[unused903]
+[unused904]
+[unused905]
+[unused906]
+[unused907]
+[unused908]
+[unused909]
+[unused910]
+[unused911]
+[unused912]
+[unused913]
+[unused914]
+[unused915]
+[unused916]
+[unused917]
+[unused918]
+[unused919]
+[unused920]
+[unused921]
+[unused922]
+[unused923]
+[unused924]
+[unused925]
+[unused926]
+[unused927]
+[unused928]
+[unused929]
+[unused930]
+[unused931]
+[unused932]
+[unused933]
+[unused934]
+[unused935]
+[unused936]
+[unused937]
+[unused938]
+[unused939]
+[unused940]
+[unused941]
+[unused942]
+[unused943]
+[unused944]
+[unused945]
+[unused946]
+[unused947]
+[unused948]
+[unused949]
+[unused950]
+[unused951]
+[unused952]
+[unused953]
+[unused954]
+[unused955]
+[unused956]
+[unused957]
+[unused958]
+[unused959]
+[unused960]
+[unused961]
+[unused962]
+[unused963]
+[unused964]
+[unused965]
+[unused966]
+[unused967]
+[unused968]
+[unused969]
+[unused970]
+[unused971]
+[unused972]
+[unused973]
+[unused974]
+[unused975]
+[unused976]
+[unused977]
+[unused978]
+[unused979]
+[unused980]
+[unused981]
+[unused982]
+[unused983]
+[unused984]
+[unused985]
+[unused986]
+[unused987]
+[unused988]
+[unused989]
+[unused990]
+[unused991]
+[unused992]
+[unused993]
+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+:
+;
+<
+=
+>
+?
+@
+[
+\
+]
+^
+_
+`
+a
+b
+c
+d
+e
+f
+g
+h
+i
+j
+k
+l
+m
+n
+o
+p
+q
+r
+s
+t
+u
+v
+w
+x
+y
+z
+{
+|
+}
+~
+¡
+¢
+£
+¤
+¥
+¦
+§
+¨
+©
+ª
+«
+¬
+®
+°
+±
+²
+³
+´
+µ
+¶
+·
+¹
+º
+»
+¼
+½
+¾
+¿
+×
+ß
+æ
+ð
+÷
+ø
+þ
+đ
+ħ
+ı
+ł
+ŋ
+œ
+ƒ
+ɐ
+ɑ
+ɒ
+ɔ
+ɕ
+ə
+ɛ
+ɡ
+ɣ
+ɨ
+ɪ
+ɫ
+ɬ
+ɯ
+ɲ
+ɴ
+ɹ
+ɾ
+ʀ
+ʁ
+ʂ
+ʃ
+ʉ
+ʊ
+ʋ
+ʌ
+ʎ
+ʐ
+ʑ
+ʒ
+ʔ
+ʰ
+ʲ
+ʳ
+ʷ
+ʸ
+ʻ
+ʼ
+ʾ
+ʿ
+ˈ
+ː
+ˡ
+ˢ
+ˣ
+ˤ
+α
+β
+γ
+δ
+ε
+ζ
+η
+θ
+ι
+κ
+λ
+μ
+ν
+ξ
+ο
+π
+ρ
+ς
+σ
+τ
+υ
+φ
+χ
+ψ
+ω
+а
+б
+в
+г
+д
+е
+ж
+з
+и
+к
+л
+м
+н
+о
+п
+р
+с
+т
+у
+ф
+х
+ц
+ч
+ш
+щ
+ъ
+ы
+ь
+э
+ю
+я
+ђ
+є
+і
+ј
+љ
+њ
+ћ
+ӏ
+ա
+բ
+գ
+դ
+ե
+թ
+ի
+լ
+կ
+հ
+մ
+յ
+ն
+ո
+պ
+ս
+վ
+տ
+ր
+ւ
+ք
+־
+א
+ב
+ג
+ד
+ה
+ו
+ז
+ח
+ט
+י
+ך
+כ
+ל
+ם
+מ
+ן
+נ
+ס
+ע
+ף
+פ
+ץ
+צ
+ק
+ר
+ש
+ת
+،
+ء
+ا
+ب
+ة
+ت
+ث
+ج
+ح
+خ
+د
+ذ
+ر
+ز
+س
+ش
+ص
+ض
+ط
+ظ
+ع
+غ
+ـ
+ف
+ق
+ك
+ل
+م
+ن
+ه
+و
+ى
+ي
+ٹ
+پ
+چ
+ک
+گ
+ں
+ھ
+ہ
+ی
+ے
+अ
+आ
+उ
+ए
+क
+ख
+ग
+च
+ज
+ट
+ड
+ण
+त
+थ
+द
+ध
+न
+प
+ब
+भ
+म
+य
+र
+ल
+व
+श
+ष
+स
+ह
+ा
+ि
+ी
+ो
+।
+॥
+ং
+অ
+আ
+ই
+উ
+এ
+ও
+ক
+খ
+গ
+চ
+ছ
+জ
+ট
+ড
+ণ
+ত
+থ
+দ
+ধ
+ন
+প
+ব
+ভ
+ম
+য
+র
+ল
+শ
+ষ
+স
+হ
+া
+ি
+ী
+ে
+க
+ச
+ட
+த
+ந
+ன
+ப
+ம
+ய
+ர
+ல
+ள
+வ
+ா
+ி
+ு
+ே
+ை
+ನ
+ರ
+ಾ
+ක
+ය
+ර
+ල
+ව
+ා
+ก
+ง
+ต
+ท
+น
+พ
+ม
+ย
+ร
+ล
+ว
+ส
+อ
+า
+เ
+་
+།
+ག
+ང
+ད
+ན
+པ
+བ
+མ
+འ
+ར
+ལ
+ས
+မ
+ა
+ბ
+გ
+დ
+ე
+ვ
+თ
+ი
+კ
+ლ
+მ
+ნ
+ო
+რ
+ს
+ტ
+უ
+ᄀ
+ᄂ
+ᄃ
+ᄅ
+ᄆ
+ᄇ
+ᄉ
+ᄊ
+ᄋ
+ᄌ
+ᄎ
+ᄏ
+ᄐ
+ᄑ
+ᄒ
+ᅡ
+ᅢ
+ᅥ
+ᅦ
+ᅧ
+ᅩ
+ᅪ
+ᅭ
+ᅮ
+ᅯ
+ᅲ
+ᅳ
+ᅴ
+ᅵ
+ᆨ
+ᆫ
+ᆯ
+ᆷ
+ᆸ
+ᆼ
+ᴬ
+ᴮ
+ᴰ
+ᴵ
+ᴺ
+ᵀ
+ᵃ
+ᵇ
+ᵈ
+ᵉ
+ᵍ
+ᵏ
+ᵐ
+ᵒ
+ᵖ
+ᵗ
+ᵘ
+ᵢ
+ᵣ
+ᵤ
+ᵥ
+ᶜ
+ᶠ
+‐
+‑
+‒
+–
+—
+―
+‖
+‘
+’
+‚
+“
+”
+„
+†
+‡
+•
+…
+‰
+′
+″
+›
+‿
+⁄
+⁰
+ⁱ
+⁴
+⁵
+⁶
+⁷
+⁸
+⁹
+⁺
+⁻
+ⁿ
+₀
+₁
+₂
+₃
+₄
+₅
+₆
+₇
+₈
+₉
+₊
+₍
+₎
+ₐ
+ₑ
+ₒ
+ₓ
+ₕ
+ₖ
+ₗ
+ₘ
+ₙ
+ₚ
+ₛ
+ₜ
+₤
+₩
+€
+₱
+₹
+ℓ
+№
+ℝ
+™
+⅓
+⅔
+←
+↑
+→
+↓
+↔
+↦
+⇄
+⇌
+⇒
+∂
+∅
+∆
+∇
+∈
+−
+∗
+∘
+√
+∞
+∧
+∨
+∩
+∪
+≈
+≡
+≤
+≥
+⊂
+⊆
+⊕
+⊗
+⋅
+─
+│
+■
+▪
+●
+★
+☆
+☉
+♠
+♣
+♥
+♦
+♭
+♯
+⟨
+⟩
+ⱼ
+⺩
+⺼
+⽥
+、
+。
+〈
+〉
+《
+》
+「
+」
+『
+』
+〜
+あ
+い
+う
+え
+お
+か
+き
+く
+け
+こ
+さ
+し
+す
+せ
+そ
+た
+ち
+っ
+つ
+て
+と
+な
+に
+ぬ
+ね
+の
+は
+ひ
+ふ
+へ
+ほ
+ま
+み
+む
+め
+も
+や
+ゆ
+よ
+ら
+り
+る
+れ
+ろ
+を
+ん
+ァ
+ア
+ィ
+イ
+ウ
+ェ
+エ
+オ
+カ
+キ
+ク
+ケ
+コ
+サ
+シ
+ス
+セ
+タ
+チ
+ッ
+ツ
+テ
+ト
+ナ
+ニ
+ノ
+ハ
+ヒ
+フ
+ヘ
+ホ
+マ
+ミ
+ム
+メ
+モ
+ャ
+ュ
+ョ
+ラ
+リ
+ル
+レ
+ロ
+ワ
+ン
+・
+ー
+一
+三
+上
+下
+不
+世
+中
+主
+久
+之
+也
+事
+二
+五
+井
+京
+人
+亻
+仁
+介
+代
+仮
+伊
+会
+佐
+侍
+保
+信
+健
+元
+光
+八
+公
+内
+出
+分
+前
+劉
+力
+加
+勝
+北
+区
+十
+千
+南
+博
+原
+口
+古
+史
+司
+合
+吉
+同
+名
+和
+囗
+四
+国
+國
+土
+地
+坂
+城
+堂
+場
+士
+夏
+外
+大
+天
+太
+夫
+奈
+女
+子
+学
+宀
+宇
+安
+宗
+定
+宣
+宮
+家
+宿
+寺
+將
+小
+尚
+山
+岡
+島
+崎
+川
+州
+巿
+帝
+平
+年
+幸
+广
+弘
+張
+彳
+後
+御
+德
+心
+忄
+志
+忠
+愛
+成
+我
+戦
+戸
+手
+扌
+政
+文
+新
+方
+日
+明
+星
+春
+昭
+智
+曲
+書
+月
+有
+朝
+木
+本
+李
+村
+東
+松
+林
+森
+楊
+樹
+橋
+歌
+止
+正
+武
+比
+氏
+民
+水
+氵
+氷
+永
+江
+沢
+河
+治
+法
+海
+清
+漢
+瀬
+火
+版
+犬
+王
+生
+田
+男
+疒
+発
+白
+的
+皇
+目
+相
+省
+真
+石
+示
+社
+神
+福
+禾
+秀
+秋
+空
+立
+章
+竹
+糹
+美
+義
+耳
+良
+艹
+花
+英
+華
+葉
+藤
+行
+街
+西
+見
+訁
+語
+谷
+貝
+貴
+車
+軍
+辶
+道
+郎
+郡
+部
+都
+里
+野
+金
+鈴
+镇
+長
+門
+間
+阝
+阿
+陳
+陽
+雄
+青
+面
+風
+食
+香
+馬
+高
+龍
+龸
+ﬁ
+ﬂ
+！
+（
+）
+，
+－
+．
+／
+：
+？
+～
+the
+of
+and
+in
+to
+was
+he
+is
+as
+for
+on
+with
+that
+it
+his
+by
+at
+from
+her
+##s
+she
+you
+had
+an
+were
+but
+be
+this
+are
+not
+my
+they
+one
+which
+or
+have
+him
+me
+first
+all
+also
+their
+has
+up
+who
+out
+been
+when
+after
+there
+into
+new
+two
+its
+##a
+time
+would
+no
+what
+about
+said
+we
+over
+then
+other
+so
+more
+##e
+can
+if
+like
+back
+them
+only
+some
+could
+##i
+where
+just
+##ing
+during
+before
+##n
+do
+##o
+made
+school
+through
+than
+now
+years
+most
+world
+may
+between
+down
+well
+three
+##d
+year
+while
+will
+##ed
+##r
+##y
+later
+##t
+city
+under
+around
+did
+such
+being
+used
+state
+people
+part
+know
+against
+your
+many
+second
+university
+both
+national
+##er
+these
+don
+known
+off
+way
+until
+re
+how
+even
+get
+head
+...
+didn
+##ly
+team
+american
+because
+de
+##l
+born
+united
+film
+since
+still
+long
+work
+south
+us
+became
+any
+high
+again
+day
+family
+see
+right
+man
+eyes
+house
+season
+war
+states
+including
+took
+life
+north
+same
+each
+called
+name
+much
+place
+however
+go
+four
+group
+another
+found
+won
+area
+here
+going
+10
+away
\ No newline at end of file
diff --git a/tests/fixtures/vocab.zh.txt b/tests/fixtures/vocab.zh.txt
new file mode 100644
index 0000000000000000000000000000000000000000..3d0dafba92ac00cbeb5697fbdda488784a361ab9
--- /dev/null
+++ b/tests/fixtures/vocab.zh.txt
@@ -0,0 +1,18 @@
+[UNK]
+[SEP]
+[PAD]
+[CLS]
+[MASK]
+欢
+迎
+使
+用
+百
+度
+飞
+桨
+深
+学
+习
+框
+架
\ No newline at end of file
diff --git a/tests/fixtures/vocab.zh.unimo.txt b/tests/fixtures/vocab.zh.unimo.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ea5cadc13f2460255b7b94355cea7c1215f28b46
--- /dev/null
+++ b/tests/fixtures/vocab.zh.unimo.txt
@@ -0,0 +1,18 @@
+[UNK]	0
+[SEP]	1
+[PAD]	2
+[CLS]	3
+[MASK]	4
+欢	5
+迎	6
+使	7
+用	8
+百	9
+度	10
+飞	11
+桨	12
+深	13
+学	14
+习	15
+框	16
+架	17
diff --git a/tests/generation/__init__.py b/tests/generation/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/generation/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/generation/test_logits_process.py b/tests/generation/test_logits_process.py
new file mode 100644
index 0000000000000000000000000000000000000000..99c54e2460f1dd1c9cd33780f105649e6ea97694
--- /dev/null
+++ b/tests/generation/test_logits_process.py
@@ -0,0 +1,365 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import paddle
+from paddle import nn
+
+
+def ids_tensor(shape, vocab_size, rng=None, name=None):
+    #  Creates a random int32 tensor of the shape within the vocab size
+    if rng is None:
+        rng = random.Random()
+
+    total_dims = 1
+    for dim in shape:
+        total_dims *= dim
+
+    values = []
+    for _ in range(total_dims):
+        values.append(rng.randint(0, vocab_size - 1))
+
+    return paddle.to_tensor(data=values).reshape(shape)
+
+
+from paddlenlp.generation.logits_process import (
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    NoBadWordsLogitsProcessor,
+    NoRepeatNGramLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+    SequenceBiasLogitsProcessor,
+    TemperatureLogitsWarper,
+    TopKProcess,
+    TopPProcess,
+)
+
+
+class LogitsProcessorTest(unittest.TestCase):
+    def _get_uniform_logits(self, batch_size: int, length: int):
+        scores = paddle.ones((batch_size, length)) / length
+        return scores
+
+    def test_min_length_dist_processor(self):
+        vocab_size = 20
+        batch_size = 4
+        eos_token_id = 0
+
+        min_dist_processor = MinLengthLogitsProcessor(min_length=10, eos_token_id=eos_token_id)
+
+        # check that min length is applied at length 5
+        input_ids = ids_tensor((batch_size, 5), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores_before_min_length = min_dist_processor(input_ids, scores)
+        self.assertListEqual(scores_before_min_length[:, eos_token_id].tolist(), 4 * [paddle.finfo(scores.dtype).min])
+
+        # check that min length is not applied anymore at length 15
+        input_ids = ids_tensor((batch_size, 15), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores_before_min_length = min_dist_processor(input_ids, scores)
+        self.assertFalse((scores_before_min_length == paddle.finfo(scores.dtype).min).any())
+
+    def test_temperature_dist_warper(self):
+        input_ids = None
+        length = 20
+
+        scores = self._get_uniform_logits(batch_size=2, length=length)
+
+        # tweak scores to not be uniform anymore
+        scores[1, 5] = (1 / length) + 0.1  # peak, 1st batch
+        scores[1, 10] = (1 / length) - 0.4  # valley, 1st batch
+
+        # compute softmax
+        probs = nn.functional.softmax(scores, axis=-1)
+
+        temp_dist_warper_sharper = TemperatureLogitsWarper(temperature=0.5)
+        temp_dist_warper_smoother = TemperatureLogitsWarper(temperature=1.3)
+
+        warped_prob_sharp = nn.functional.softmax(temp_dist_warper_sharper(input_ids, scores.clone()), axis=-1)
+        warped_prob_smooth = nn.functional.softmax(temp_dist_warper_smoother(input_ids, scores.clone()), axis=-1)
+
+        # uniform distribution stays uniform
+        self.assertTrue(paddle.allclose(probs[0, :], warped_prob_sharp[0, :], atol=1e-3))
+        self.assertTrue(paddle.allclose(probs[0, :], warped_prob_smooth[0, :], atol=1e-3))
+
+        # sharp peaks get higher, valleys get lower
+        self.assertLess(probs[1, :].max(), warped_prob_sharp[1, :].max())
+        self.assertGreater(probs[1, :].min(), warped_prob_sharp[1, :].min())
+
+        # smooth peaks get lower, valleys get higher
+        self.assertGreater(probs[1, :].max(), warped_prob_smooth[1, :].max())
+        self.assertLess(probs[1, :].min(), warped_prob_smooth[1, :].min())
+
+    def test_repetition_penalty_dist_process(self):
+        input_ids = paddle.to_tensor([[0, 1], [5, 0]])
+        vocab_size = 10
+
+        scores = self._get_uniform_logits(batch_size=2, length=vocab_size)
+
+        # give values special values
+        scores[0, 0] = -(1 / vocab_size)
+        scores[1, 5] = 4 / vocab_size
+
+        rep_penalty_proc = RepetitionPenaltyLogitsProcessor(penalty=2.0)
+
+        scores = rep_penalty_proc(input_ids, scores.clone())
+
+        # check that values were correctly changed
+        self.assertAlmostEqual(scores[0, 0].item(), -(1 / vocab_size) * 2)
+        self.assertAlmostEqual(scores[0, 1].item(), (1 / vocab_size) / 2)
+
+        self.assertAlmostEqual(scores[1, 0].item(), (1 / vocab_size) / 2)
+        self.assertAlmostEqual(scores[1, 5].item(), (4 / vocab_size) / 2)
+
+    def test_top_k_dist_warper(self):
+        vocab_size = 10
+        batch_size = 2
+
+        # create ramp distribution
+        ramp_logits = paddle.arange(vocab_size).unsqueeze(0).tile((batch_size, 1))
+        ramp_logits[1:, : vocab_size // 2] = ramp_logits[1:, : vocab_size // 2] + vocab_size
+        ramp_logits = ramp_logits.astype("float32")
+        scores = TopKProcess(ramp_logits, 3, 1)
+
+        # check that correct tokens are filtered
+        self.assertListEqual((scores[0] == 0.0).tolist(), 7 * [True] + 3 * [False])
+        self.assertListEqual((scores[1] == 0.0).tolist(), 2 * [True] + 3 * [False] + 5 * [True])
+
+        # check special cases
+        length = 5
+
+        logits = self._get_uniform_logits(batch_size=batch_size, length=length)
+        scores = TopKProcess(logits, top_k=1, min_tokens_to_keep=3)
+        # uniform dist is not changed
+        self.assertListEqual((scores == 0.0).sum(axis=-1).tolist(), [0, 0])
+
+        ramp_logits = paddle.arange(length).unsqueeze(0).tile((batch_size, 1))
+        ramp_logits = ramp_logits.astype("float32")
+        scores = TopKProcess(ramp_logits, top_k=1, min_tokens_to_keep=3)
+
+        # min_tokens overwrites k: 3 tokens are kept => 2 tokens are nullified
+        self.assertListEqual((scores == 0.0).sum(axis=-1).tolist(), [2, 2])
+
+    def test_top_p_dist_warper(self):
+        vocab_size = 10
+        batch_size = 2
+
+        # create distribution and take log (inverse to Softmax as taken in TopPProcess)
+        # dist = paddle.log(paddle.to_tensor([[0.3, 0.1, 0.1, 0.5], [0.15, 0.3, 0.3, 0.25]]))
+        dist = paddle.to_tensor([[0.3, 0.1, 0.1, 0.5], [0.15, 0.3, 0.3, 0.25]])
+
+        # filtered_dist = paddle.exp(TopPProcess(dist, 0.80001, 1))
+        filtered_dist = TopPProcess(dist, 0.79999, 1)
+
+        EXPECTED_FILTERED_DIST = paddle.to_tensor([[0.3, 0.0, 0.0, 0.5], [0.0, 0.3, 0.3, 0.25]])
+        self.assertTrue(paddle.allclose(filtered_dist, EXPECTED_FILTERED_DIST, atol=1e-3))
+
+        # check edge cases with negative and extreme logits
+        ramp_logits = paddle.arange(vocab_size).unsqueeze(0).tile((batch_size, 1)) - (vocab_size // 2)
+        ramp_logits = ramp_logits.astype("float32")
+        # make ramp_logits more extreme
+        ramp_logits[1] = ramp_logits[1] * 10.0
+        sft_ramp_logits = paddle.nn.functional.softmax(ramp_logits, axis=-1)
+        # make sure at least 2 tokens are kept
+        filtered_dist = TopPProcess(sft_ramp_logits, 0.9, min_tokens_to_keep=2)
+
+        # first batch should keep three tokens, second batch would keep only 1, but due to `min_tokens_to_keep=2` keeps 2.
+        self.assertListEqual((filtered_dist != 0.0).sum(axis=-1).tolist(), [3, 2])
+
+    def test_no_repeat_ngram_dist_processor(self):
+        vocab_size = 3
+        batch_size = 2
+
+        input_ids = paddle.to_tensor([[1, 1, 2, 1], [0, 1, 0, 1]])
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+
+        no_repeat_proc_2_gram = NoRepeatNGramLogitsProcessor(2)
+        no_repeat_proc_3_gram = NoRepeatNGramLogitsProcessor(3)
+
+        filtered_scores_2_gram = no_repeat_proc_2_gram(input_ids, scores.clone())
+        filtered_scores_3_gram = no_repeat_proc_3_gram(input_ids, scores.clone())
+
+        # 2-gram would forbid 2nd and 3rd token (1,2) at 1st batch and 1st token (0) at 2nd batch
+
+        self.assertListEqual(
+            (filtered_scores_2_gram == paddle.finfo(scores.dtype).min).tolist(),
+            [[False, True, True], [True, False, False]],
+        )
+
+        # 3-gram would forbid no token at 1st batch and 1st token (0) at 2nd batch
+        self.assertListEqual(
+            (filtered_scores_3_gram == paddle.finfo(scores.dtype).min).tolist(),
+            [[False, False, False], [True, False, False]],
+        )
+
+    def test_processor_list(self):
+        batch_size = 4
+        sequence_length = 10
+        vocab_size = 15
+        eos_token_id = 0
+
+        # dummy input_ids and scores
+        input_ids = ids_tensor((batch_size, sequence_length), vocab_size)
+        input_ids_comp = input_ids.clone()
+
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores_comp = scores.clone()
+
+        # instantiate all dist processors
+        min_dist_proc = MinLengthLogitsProcessor(min_length=10, eos_token_id=eos_token_id)
+        temp_dist_warp = TemperatureLogitsWarper(temperature=0.5)
+        rep_penalty_proc = RepetitionPenaltyLogitsProcessor(penalty=2.0)
+        no_repeat_proc = NoRepeatNGramLogitsProcessor(2)
+
+        # no processor list
+        scores = min_dist_proc(input_ids, scores)
+        scores = temp_dist_warp(input_ids, scores)
+        scores = rep_penalty_proc(input_ids, scores)
+        scores = no_repeat_proc(input_ids, scores)
+
+        # with processor list
+        processor = LogitsProcessorList(
+            [
+                min_dist_proc,
+                temp_dist_warp,
+                rep_penalty_proc,
+                no_repeat_proc,
+            ]
+        )
+        scores_comp = processor(input_ids, scores_comp)
+
+        # scores should be equal
+        self.assertTrue(paddle.allclose(scores, scores_comp, atol=1e-3))
+
+        # input_ids should never be changed
+        self.assertListEqual(input_ids.tolist(), input_ids_comp.tolist())
+
+    def test_hamming_diversity(self):
+        vocab_size = 4
+        num_beams = 2
+        num_beam_groups = 2
+
+        scores = self._get_uniform_logits(num_beams, vocab_size)
+        # batch_idx = 0 -> index batch_idx * num_beam_groups -> idx = 0 * 2 = 0 -> penalises tokens 1
+        # batch_idx = 1 -> index batch_idx * num_beam_groups -> idx = 1 * 2 = 2 -> penalises tokens 1
+        current_tokens = paddle.to_tensor([0, 3, 1, 2])
+
+        diversity_logits_processor = HammingDiversityLogitsProcessor(
+            diversity_rate=1.0, num_beams=num_beams, num_beam_groups=num_beam_groups
+        )
+
+        processed_scores = diversity_logits_processor(None, scores, current_tokens, 1)
+
+        self.assertTrue(
+            paddle.allclose(processed_scores[0], paddle.to_tensor([-0.7500, 0.2500, 0.2500, 0.2500]), atol=1e-3)
+        )
+        self.assertTrue(
+            paddle.allclose(processed_scores[1], paddle.to_tensor([0.2500, -0.7500, 0.2500, 0.2500]), atol=1e-3)
+        )
+
+    def test_forced_bos_token_logits_processor(self):
+        vocab_size = 20
+        batch_size = 4
+        bos_token_id = 0
+
+        logits_processor = ForcedBOSTokenLogitsProcessor(forced_bos_token_id=bos_token_id)
+
+        # check that all scores are -inf except the bos_token_id score
+        input_ids = ids_tensor((batch_size, 1), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores = logits_processor(input_ids, scores)
+        self.assertTrue((scores[:, bos_token_id + 1 :] == paddle.finfo(scores.dtype).min).all())
+        self.assertListEqual(scores[:, bos_token_id].tolist(), 4 * [0])  # score for bos_token_id shold be zero
+
+        # check that bos_token_id is not forced if current length is greater than 1
+        input_ids = ids_tensor((batch_size, 4), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores = logits_processor(input_ids, scores)
+        self.assertFalse((scores == paddle.finfo(scores.dtype).min).any())
+
+    def test_forced_eos_token_logits_processor(self):
+        vocab_size = 20
+        batch_size = 4
+        eos_token_id = 0
+        max_length = 5
+
+        logits_processor = ForcedEOSTokenLogitsProcessor(max_length=max_length, forced_eos_token_id=eos_token_id)
+
+        # check that all scores are -inf except the eos_token_id when max_length-1 is reached
+        input_ids = ids_tensor((batch_size, 4), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores = logits_processor(input_ids, scores)
+        self.assertTrue((scores[:, eos_token_id + 1 :] == paddle.finfo(scores.dtype).min).all())
+        self.assertListEqual(scores[:, eos_token_id].tolist(), 4 * [0])  # score for eos_token_id should be zero
+
+        # check that eos_token_id is not forced if max_length-1 is not reached
+        input_ids = ids_tensor((batch_size, 3), vocab_size=20)
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+        scores = logits_processor(input_ids, scores)
+        self.assertFalse((scores == paddle.finfo(scores.dtype).min).any())
+
+    def test_bias_dist_processor(self):
+        vocab_size = 5
+        batch_size = 2
+
+        input_ids = paddle.to_tensor([[0, 1, 3, 1], [0, 1, 0, 1]])
+        positive_bias = {(1,): 100.0, (4,): 100.0}
+        negative_bias = {(1, 0): -100.0, (0, 1, 2): -100.0, (1, 3, 1, 3): -100.0}
+        # biases the same termination twice, to ensure we can handle overlapping terminations (it won't have an effect
+        # on the test cases, though)
+        negative_bias.update({(1, 3, 1, 3, 1, 3): -100.0})
+        sequence_bias = {**positive_bias, **negative_bias}
+
+        # scores = 0 to facilitate checks
+        scores = paddle.zeros((batch_size, vocab_size))
+
+        bias_dist_proc = SequenceBiasLogitsProcessor(sequence_bias=sequence_bias)
+        filtered_scores = bias_dist_proc(input_ids, scores.clone())
+
+        # batch 1: positive bias: tokens (1, 4); negative bias: tokens (0, 3); neutral: tokens (2)
+        # batch 2: positive bias: tokens (1, 4); negative bias: tokens (0, 2); neutral: tokens (3)
+        self.assertListEqual(
+            filtered_scores.tolist(), [[-100.0, 100.0, 0.0, -100.0, 100.0], [-100.0, 100.0, -100.0, 0.0, 100.0]]
+        )
+
+    def test_no_bad_words_dist_processor(self):
+        vocab_size = 5
+        batch_size = 2
+        eos_token_id = 4
+
+        input_ids = paddle.to_tensor([[0, 1, 3, 1], [0, 1, 0, 1]])
+        bad_word_tokens = [[1], [4], [1, 0], [0, 1, 2], [1, 3, 1, 3]]
+        scores = self._get_uniform_logits(batch_size, vocab_size)
+
+        no_bad_words_dist_proc = NoBadWordsLogitsProcessor(bad_words_ids=bad_word_tokens, eos_token_id=eos_token_id)
+
+        filtered_scores = no_bad_words_dist_proc(input_ids, scores.clone())
+
+        # batch 1: 1st, 2nd, and 4th (0, 1, 3) token are forbidden
+        # batch 2: 1st, 2nd, and 3rd (0, 1, 2) token are forbidden
+        # Note that 5th element cannot be forbidden as it is EOS token
+        self.assertListEqual(
+            paddle.isinf(filtered_scores).tolist(),
+            [[True, True, False, True, False], [True, True, True, False, False]],
+        )
+
+        # check edge case
+        no_bad_words_dist_proc = NoBadWordsLogitsProcessor(bad_words_ids=[[4]], eos_token_id=eos_token_id)
+        filtered_scores = no_bad_words_dist_proc(input_ids, scores.clone())
+        self.assertTrue(paddle.allclose(scores, filtered_scores, atol=1e-3).numpy())
diff --git a/tests/generation/test_stopping_criteria.py b/tests/generation/test_stopping_criteria.py
new file mode 100644
index 0000000000000000000000000000000000000000..f557dbaba2796896b7c223f01828099cb67eada7
--- /dev/null
+++ b/tests/generation/test_stopping_criteria.py
@@ -0,0 +1,100 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import time
+import unittest
+
+import paddle
+
+from paddlenlp.generation.stopping_criteria import (
+    MaxLengthCriteria,
+    MaxTimeCriteria,
+    StoppingCriteriaList,
+    validate_stopping_criteria,
+)
+
+
+def ids_tensor(shape, vocab_size, rng=None, name=None):
+    #  Creates a random int32 tensor of the shape within the vocab size
+    if rng is None:
+        rng = random.Random()
+
+    total_dims = 1
+    for dim in shape:
+        total_dims *= dim
+
+    values = []
+    for _ in range(total_dims):
+        values.append(rng.randint(0, vocab_size - 1))
+
+    return paddle.to_tensor(data=values).reshape(shape)
+
+
+class StoppingCriteriaTestCase(unittest.TestCase):
+    def _get_tensors(self, length):
+        batch_size = 3
+        vocab_size = 250
+
+        input_ids = ids_tensor((batch_size, length), vocab_size)
+        scores = paddle.ones((batch_size, length)) / length
+        return input_ids, scores
+
+    def test_list_criteria(self):
+        input_ids, scores = self._get_tensors(5)
+
+        criteria = StoppingCriteriaList(
+            [
+                MaxLengthCriteria(max_length=10),
+                MaxTimeCriteria(max_time=0.1),
+            ]
+        )
+        self.assertFalse(criteria(input_ids, scores))
+
+        input_ids, scores = self._get_tensors(9)
+        self.assertFalse(criteria(input_ids, scores))
+
+        input_ids, scores = self._get_tensors(10)
+        self.assertTrue(criteria(input_ids, scores))
+
+    def test_max_length_criteria(self):
+        criteria = MaxLengthCriteria(max_length=10)
+
+        input_ids, scores = self._get_tensors(5)
+        self.assertFalse(criteria(input_ids, scores))
+
+        input_ids, scores = self._get_tensors(9)
+        self.assertFalse(criteria(input_ids, scores))
+
+        input_ids, scores = self._get_tensors(10)
+        self.assertTrue(criteria(input_ids, scores))
+
+    def test_max_time_criteria(self):
+        input_ids, scores = self._get_tensors(5)
+
+        criteria = MaxTimeCriteria(max_time=0.1)
+        self.assertFalse(criteria(input_ids, scores))
+
+        criteria = MaxTimeCriteria(max_time=0.1, initial_timestamp=time.time() - 0.2)
+        self.assertTrue(criteria(input_ids, scores))
+
+    def test_validate_stopping_criteria(self):
+        validate_stopping_criteria(StoppingCriteriaList([MaxLengthCriteria(10)]), 10)
+
+        with self.assertWarns(UserWarning):
+            validate_stopping_criteria(StoppingCriteriaList([MaxLengthCriteria(10)]), 11)
+
+        stopping_criteria = validate_stopping_criteria(StoppingCriteriaList(), 11)
+
+        self.assertEqual(len(stopping_criteria), 1)
diff --git a/tests/generation/test_streamers.py b/tests/generation/test_streamers.py
new file mode 100644
index 0000000000000000000000000000000000000000..9182cda9b88a3e539bf3037a7fe5ff1a149cc3c3
--- /dev/null
+++ b/tests/generation/test_streamers.py
@@ -0,0 +1,112 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from queue import Empty
+from threading import Thread
+
+import paddle
+
+from paddlenlp.generation import TextIteratorStreamer, TextStreamer
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.transformers.utils import CaptureStd
+from tests.testing_utils import slow
+from tests.transformers.test_modeling_common import ids_tensor
+
+
+class StreamerTester(unittest.TestCase):
+    def get_inputs(self, model):
+        input_ids = ids_tensor([1, 5], vocab_size=model.config.vocab_size, dtype="int64")
+        attention_mask = paddle.ones_like(input_ids, dtype="bool")
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decode_strategy": "greedy_search",
+            "max_length": 10,
+        }
+
+    def test_text_streamer_matches_non_streaming(self):
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-llama")
+        model = AutoModelForCausalLM.from_pretrained("__internal_testing__/tiny-random-llama")
+        model.config.eos_token_id = -1
+
+        input_kwargs = self.get_inputs(model)
+        greedy_ids = model.generate(**input_kwargs)
+        greedy_text = tokenizer.decode(greedy_ids[0][0])
+
+        with CaptureStd(out=True, err=False, replay=True) as cs:
+            streamer = TextStreamer(tokenizer, skip_prompt=True)
+            model.generate(**input_kwargs, streamer=streamer)
+        # The greedy text should be printed to stdout, except for the final "\n" in the streamer
+        streamer_text = cs.out[:-1]
+
+        self.assertEqual(streamer_text, greedy_text)
+
+    def test_iterator_streamer_matches_non_streaming(self):
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-llama")
+        model = AutoModelForCausalLM.from_pretrained("__internal_testing__/tiny-random-llama")
+        model.config.eos_token_id = -1
+
+        input_kwargs = self.get_inputs(model)
+        greedy_ids = model.generate(**input_kwargs)
+        greedy_text = tokenizer.decode(greedy_ids[0][0])
+
+        streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
+        generation_kwargs = {**input_kwargs, "streamer": streamer}
+        thread = Thread(target=model.generate, kwargs=generation_kwargs)
+        thread.start()
+        streamer_text = ""
+        for new_text in streamer:
+            streamer_text += new_text
+
+        self.assertEqual(streamer_text, greedy_text)
+
+    @slow
+    def test_text_streamer_decode_kwargs(self):
+        # Tests that we can pass `decode_kwargs` to the streamer to control how the tokens are decoded. Must be tested
+        # with actual models -- the dummy models' tokenizers are not aligned with their models, and
+        # `skip_special_tokens=True` has no effect on them
+        tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
+        model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
+        model.config.eos_token_id = -1
+
+        input_ids = paddle.ones([1, 5], dtype="int64") * model.config.bos_token_id
+        attention_mask = paddle.ones_like(input_ids, dtype="bool")
+
+        with CaptureStd(out=True, err=False, replay=True) as cs:
+            streamer = TextStreamer(tokenizer, skip_special_tokens=True)
+            model.generate(input_ids, attention_mask=attention_mask, max_length=1, do_sample=False, streamer=streamer)
+
+        # The prompt contains a special token, so the streamer should not print it. As such, the output text, when
+        # re-tokenized, must only contain one token
+        streamer_text = cs.out[:-1]  # Remove the final "\n"
+        streamer_text_tokenized = tokenizer(streamer_text, return_tensors="pd")
+        self.assertEqual(streamer_text_tokenized.input_ids.shape, [1, 1])
+
+    def test_iterator_streamer_timeout(self):
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-llama")
+        model = AutoModelForCausalLM.from_pretrained("__internal_testing__/tiny-random-llama")
+        model.config.eos_token_id = -1
+
+        input_kwargs = self.get_inputs(model)
+        streamer = TextIteratorStreamer(tokenizer, timeout=0.001)
+        generation_kwargs = {**input_kwargs, "streamer": streamer}
+        thread = Thread(target=model.generate, kwargs=generation_kwargs)
+        thread.start()
+
+        # The streamer will timeout after 0.001 seconds, so an exception will be raised
+        with self.assertRaises(Empty):
+            streamer_text = ""
+            for new_text in streamer:
+                streamer_text += new_text
diff --git a/tests/layers/__init__.py b/tests/layers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/layers/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/layers/test_linear.py b/tests/layers/test_linear.py
new file mode 100644
index 0000000000000000000000000000000000000000..802bdefa31af41d12497ce0e6bb10a1985cbf5dc
--- /dev/null
+++ b/tests/layers/test_linear.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.layers import Linear
+
+
+class TestLinear(unittest.TestCase):
+    def test_linear_layer(self):
+        x = paddle.randn([2, 3], "float32")
+        layer = Linear(in_features=3, out_features=7)
+        self.assertEqual(layer(x).shape, [2, 7])
+        self.assertEqual(layer.weight.shape, [7, 3])
+        self.assertEqual(layer.bias.shape, [7])
diff --git a/tests/llm/__init__.py b/tests/llm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/llm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/llm/test_llama.py b/tests/llm/test_llama.py
new file mode 100644
index 0000000000000000000000000000000000000000..80f8a977219c196074061c423de1d5d1cc7f1745
--- /dev/null
+++ b/tests/llm/test_llama.py
@@ -0,0 +1,192 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+import shutil
+import sys
+import tempfile
+from unittest import TestCase
+
+from paddlenlp.utils.downloader import get_path_from_url
+from tests.testing_utils import (
+    argv_context_guard,
+    load_test_config,
+    run_command,
+    update_params,
+)
+
+
+class LLaMATest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./llm"
+        self.config_path = "./tests/fixtures/llm/llama.yaml"
+        self.output_dir = tempfile.mkdtemp()
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+        shutil.rmtree(self.output_dir)
+
+    def test_pretrain(self):
+
+        # Run pretrain
+        if not os.path.exists("./llm/llama/data"):
+            URL = "https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy"
+            URL2 = "https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz"
+            get_path_from_url(URL, root_dir="./llm/llama/data")
+            get_path_from_url(URL2, root_dir="./llm/llama/data")
+
+        sys.path.insert(0, self.path + "/llama")
+        pretrain_config = load_test_config(self.config_path, "pretrain")
+        pretrain_config.update({"output": os.path.join(self.output_dir, "pretrain")})
+        with argv_context_guard(pretrain_config):
+            from run_pretrain import main
+
+            main()
+
+    def llm_finetune(self, model_name: str, method: str):
+        """
+        Tests a `complete` llm fintune examples
+
+        Args:
+            model_name (`str`):
+                The LLM name, eg: llama, chatglm
+            method (`str`):
+                intruction  finetune method, eg: sft or lora
+                quanzitation method aslo can be used by qtp or gqtp
+
+        """
+        run_fintune = "llm/finetune_generation.py"
+        finetune_params = load_test_config(self.config_path, "finetune")
+        quant_params = load_test_config(self.config_path, "quant")
+        predict_config = load_test_config(self.config_path, "predict")
+        merge_lora_config = load_test_config(self.config_path, "merge_lora_params")
+
+        # Copy json file to tmp dir
+        model_path = os.path.join("./llm/", model_name)
+        for json_file in os.listdir(model_path):
+            if json_file.endswith("json"):
+                shutil.copyfile(model_path + "/" + json_file, os.path.join(self.output_dir, json_file))
+
+        # 1.Run finetune
+        params_file = method + "_argument.json"
+        checkpoint_output = model_name + "_" + method + "_ckpts"
+        json_file = os.path.join(self.output_dir, params_file)
+        finetune_params.update({"output_dir": os.path.join(self.output_dir, checkpoint_output)})
+        update_params(json_file, finetune_params)
+        run_command("python %s %s " % (run_fintune, json_file))
+
+        # 2. Run quant
+        if method == "ptq" or method == "gptq":
+
+            quant_params.update(
+                {
+                    "model_name_or_path": os.path.join(self.output_dir, "llama_sft_ckpts"),
+                    "output_dir": os.path.join(self.output_dir, checkpoint_output),
+                }
+            )
+            update_params(json_file, quant_params)
+            run_command("python %s %s " % (run_fintune, json_file))
+
+        # 3. Run predict
+        if method == "lora":
+            # LoRA dynamic predict
+            lora_predict_config = {
+                "model_name_or_path": predict_config["model_name_or_path"],
+                "batch_size": predict_config["batch_size"],
+                "data_file": predict_config["data_file"],
+                "dtype": predict_config["dtype"],
+                "mode": predict_config["mode"],
+                "lora_path": os.path.join(self.output_dir, "llama_lora_ckpts/"),
+            }
+
+            with argv_context_guard(lora_predict_config):
+                from predictor import predict
+
+                predict()
+
+            # Merge Lora Params
+            merge_lora_config = {
+                "model_name_or_path": merge_lora_config["model_name_or_path"],
+                "lora_path": os.path.join(self.output_dir, "llama_lora_ckpts/"),
+            }
+            with argv_context_guard(merge_lora_config):
+                from merge_lora_params import merge
+
+                merge()
+
+        if method == "pt":
+            # Prefix Tuning dynamic predict
+            pt_predict_config = {
+                "model_name_or_path": predict_config["model_name_or_path"],
+                "batch_size": predict_config["batch_size"],
+                "data_file": predict_config["data_file"],
+                "dtype": predict_config["dtype"],
+                "mode": predict_config["mode"],
+                "prefix_path": os.path.join(self.output_dir, "llama_pt_ckpts/"),
+            }
+            with argv_context_guard(pt_predict_config):
+                from predictor import predict
+
+                predict()
+
+            # Merge Tensor Parallelism Params
+            merge_tp_config = {"model_name_or_path": os.path.join(self.output_dir, "llama_pt_ckpts/")}
+            with argv_context_guard(merge_tp_config):
+                from merge_tp_params import main
+
+                main()
+
+    def test_llm_finetune(self):
+        self.llm_finetune("llama", "sft")
+        self.llm_finetune("llama", "sft_pp")
+
+        self.llm_finetune("llama", "lora")
+        self.llm_finetune("llama", "pt")
+
+        self.llm_finetune("llama", "ptq")
+        self.llm_finetune("llama", "gptq")
+
+    def test_predict(self):
+        # dynamic predict
+        predict_config = load_test_config(self.config_path, "predict")
+        with argv_context_guard(predict_config):
+            from predictor import predict
+
+            predict()
+
+        # Export model
+        export_config = {
+            "model_name_or_path": predict_config["model_name_or_path"],
+            "output_path": "./llm/inference",
+            "dtype": predict_config["dtype"],
+        }
+        with argv_context_guard(export_config):
+            from export_model import main
+
+            main()
+
+        # Static predict
+        st_predict_config = {
+            "model_name_or_path": "./llm/inference",
+            "batch_size": predict_config["batch_size"],
+            "data_file": predict_config["data_file"],
+            "dtype": predict_config["dtype"],
+            "mode": "static",
+        }
+        with argv_context_guard(st_predict_config):
+            from predictor import predict
+
+            predict()
diff --git a/tests/llm/test_predictor.py b/tests/llm/test_predictor.py
new file mode 100644
index 0000000000000000000000000000000000000000..9031efce12f1e1dd3edda8661ba27e526eacb4ab
--- /dev/null
+++ b/tests/llm/test_predictor.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+import sys
+import tempfile
+from unittest import TestCase
+
+import paddle
+
+from tests.testing_utils import argv_context_guard, load_test_config, slow
+
+
+class PredictorTest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./llm"
+        self.config_path = "./tests/fixtures/llm/predictor.yaml"
+        sys.path.insert(0, self.path)
+        paddle.disable_static()
+        self.tempdir = tempfile.TemporaryDirectory()
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+        paddle.disable_static()
+        self.tempdir.cleanup()
+
+    @slow
+    def test_dynamic_predictor(self):
+        # to avoid the same parameter
+        paddle.utils.unique_name.switch()
+        predict_config = load_test_config(self.config_path, "inference-predict")
+        predict_config["output_file"] = os.path.join(self.tempdir.name, "predict.json")
+        with argv_context_guard(predict_config):
+            from predictor import predict
+
+            predict()
+
+        # to static
+        paddle.utils.unique_name.switch()
+        config = load_test_config(self.config_path, "inference-to-static")
+        config["output_path"] = self.tempdir.name
+        with argv_context_guard(config):
+            from export_model import main
+
+            main()
+
+        # inference
+        config = load_test_config(self.config_path, "inference-infer")
+        config["model_name_or_path"] = self.tempdir.name
+        config["output_file"] = os.path.join(self.tempdir.name, "infer.json")
+        with argv_context_guard(config):
+            from predictor import predict
+
+            predict()
diff --git a/tests/metrics/__init__.py b/tests/metrics/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/metrics/test_bleu.py b/tests/metrics/test_bleu.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f42e08a3e223584e330e1aa4a0b935b4b423ecf
--- /dev/null
+++ b/tests/metrics/test_bleu.py
@@ -0,0 +1,27 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.metrics import BLEU
+
+
+class TestBLEU(unittest.TestCase):
+    def test_metrics(self):
+        bleu = BLEU()
+        bleu.reset()
+        cand = ["The", "cat", "The", "cat", "on", "the", "mat"]
+        ref_list = [["The", "cat", "is", "on", "the", "mat"], ["There", "is", "a", "cat", "on", "the", "mat"]]
+        bleu.add_inst(cand, ref_list)
+        self.assertEqual(bleu.score(), 0.4671379777282001)
diff --git a/tests/metrics/test_chunk.py b/tests/metrics/test_chunk.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a85556d3077b69be247e287a502a16ff6eb2e0a
--- /dev/null
+++ b/tests/metrics/test_chunk.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.metrics import ChunkEvaluator
+
+
+class TestChunk(unittest.TestCase):
+    def test_metrics(self):
+        label_list = ["O", "B-Person", "I-Person"]
+        evaluator = ChunkEvaluator(label_list)
+        evaluator.reset()
+        lengths = paddle.to_tensor([5])
+        predictions = paddle.to_tensor([[0, 1, 2, 1, 2]])
+        labels = paddle.to_tensor([[0, 1, 2, 1, 1]])
+        num_infer_chunks, num_label_chunks, num_correct_chunks = evaluator.compute(
+            lengths=lengths, predictions=predictions, labels=labels
+        )
+        evaluator.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1 = evaluator.accumulate()
+        self.assertEqual(precision, 0.5)
+        self.assertEqual(recall, 0.3333333333333333)
+        self.assertEqual(f1, 0.4)
diff --git a/tests/metrics/test_distinct.py b/tests/metrics/test_distinct.py
new file mode 100644
index 0000000000000000000000000000000000000000..430c2f9066a99e30767abd60dfd8e530e645fc36
--- /dev/null
+++ b/tests/metrics/test_distinct.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.metrics import Distinct
+
+
+class TestDistinct(unittest.TestCase):
+    def test_metrics(self):
+        distinct = Distinct()
+        cand = ["The", "cat", "The", "cat", "on", "the", "mat"]
+        distinct.add_inst(cand)
+        self.assertEqual(distinct.score(), 0.8333333333333334)
diff --git a/tests/metrics/test_glue.py b/tests/metrics/test_glue.py
new file mode 100644
index 0000000000000000000000000000000000000000..f61257250bebb93c4a906fc212abed83b8ff3f51
--- /dev/null
+++ b/tests/metrics/test_glue.py
@@ -0,0 +1,173 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import numpy.random
+import paddle
+from sklearn.metrics import precision_recall_fscore_support
+
+from paddlenlp.metrics.glue import (
+    AccuracyAndF1,
+    Mcc,
+    MultiLabelsMetric,
+    PearsonAndSpearman,
+)
+
+
+class TestAccuracyAndF1(unittest.TestCase):
+    def test_metric(self):
+        x = paddle.to_tensor([[0.1, 0.9], [0.5, 0.5], [0.6, 0.4], [0.7, 0.3]])
+        y = paddle.to_tensor([[1], [0], [1], [1]])
+
+        m = AccuracyAndF1()
+        correct = m.compute(x, y)
+        m.update(correct)
+        res = m.accumulate()
+        self.assertEqual(res, (0.5, 0.5, 0.3333333333333333, 0.4, 0.45))
+
+
+class TestMcc(unittest.TestCase):
+    def test_metric(self):
+        x = paddle.to_tensor([[-0.1, 0.12], [-0.23, 0.23], [-0.32, 0.21], [-0.13, 0.23]])
+        y = paddle.to_tensor([[1], [0], [1], [1]])
+
+        m = Mcc()
+        (preds, label) = m.compute(x, y)
+        m.update((preds, label))
+        res = m.accumulate()
+        self.assertEqual(res, (0.0,))
+
+
+class TestPearsonAndSpearman(unittest.TestCase):
+    def test_metric(self):
+        x = paddle.to_tensor([[0.1], [1.0], [2.4], [0.9]])
+        y = paddle.to_tensor([[0.0], [1.0], [2.9], [1.0]])
+
+        m = PearsonAndSpearman()
+        m.update((x, y))
+        res = m.accumulate()
+        self.assertEqual(res, (0.9985229081857804, 1.0, 0.9992614540928901))
+
+
+class TestMultiLabelsMetric(unittest.TestCase):
+    def setUp(self):
+        self.cls_num = 10
+        self.shape = (5, 20, self.cls_num)
+        self.label_shape = (5, 20)
+        self.metrics = MultiLabelsMetric(num_labels=self.cls_num)
+
+    def get_multi_labels_random_case(self):
+        label = np.random.randint(self.cls_num, size=self.label_shape).astype("int64")
+        pred = np.random.uniform(0.1, 1.0, self.shape).astype(paddle.get_default_dtype())
+        np_label = label.reshape(-1)
+        np_pred = pred.reshape(-1, self.cls_num).argmax(axis=1)
+        average_type = ["micro", "macro", "weighted", None]
+        pos_label = np.random.randint(0, self.cls_num)
+        return label, pred, np_label, np_pred, average_type[np.random.randint(0, 3)], pos_label
+
+    def test_compute(self):
+        for i in range(29):
+            numpy.random.seed(i)
+            self.metrics.reset()
+            label, pred, np_label, np_pred, average_type, pos_label = self.get_multi_labels_random_case()
+            precision, recall, f, _ = precision_recall_fscore_support(
+                np_label, np_pred, average=average_type, pos_label=pos_label
+            )
+            args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+            self.metrics.update(args)
+            result = self.metrics.accumulate(average=average_type, pos_label=pos_label)
+            self.assertEqual(precision, result[0])
+            self.assertEqual(recall, result[1])
+            self.assertEqual(f, result[2])
+
+    def test_reset(self):
+        self.metrics.reset()
+        numpy.random.seed(0)
+        label, pred, np_label, np_pred, average_type, pos_label = self.get_multi_labels_random_case()
+        args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+        self.metrics.update(args)
+
+        numpy.random.seed(1)
+        label, pred, np_label, np_pred, average_type, pos_label = self.get_multi_labels_random_case()
+        precision, recall, f, _ = precision_recall_fscore_support(
+            np_label, np_pred, average=average_type, pos_label=pos_label
+        )
+        args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+        self.metrics.update(args)
+        result = self.metrics.accumulate(average=average_type, pos_label=pos_label)
+        self.assertNotEqual(precision, result[0])
+        self.assertNotEqual(recall, result[1])
+        self.assertNotEqual(f, result[2])
+
+        self.metrics.reset()
+        args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+        self.metrics.update(args)
+        result = self.metrics.accumulate(average=average_type, pos_label=pos_label)
+        self.assertEqual(precision, result[0])
+        self.assertEqual(recall, result[1])
+        self.assertEqual(f, result[2])
+
+    def test_update_accumulate(self):
+        steps = 10
+        np_pred = np.zeros((0), dtype=int)
+        np_label = np.zeros((0), dtype=int)
+        for i in range(steps):
+            numpy.random.seed(i)
+            label, pred, cur_np_label, cur_np_pred, average_type, pos_label = self.get_multi_labels_random_case()
+            np_label = np.concatenate((np_label, cur_np_label))
+            np_pred = np.concatenate((np_pred, cur_np_pred))
+            precision, recall, f, _ = precision_recall_fscore_support(
+                np_label, np_pred, average=average_type, pos_label=pos_label
+            )
+            args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+            self.metrics.update(args)
+            result = self.metrics.accumulate(average=average_type, pos_label=pos_label)
+            self.assertEqual(precision, result[0])
+            self.assertEqual(recall, result[1])
+            self.assertEqual(f, result[2])
+
+    def get_binary_labels_random_case(self):
+        label = np.random.randint(self.cls_num, size=self.label_shape).astype("int64")
+        pred = np.random.uniform(0.1, 1.0, self.shape).astype(paddle.get_default_dtype())
+        average_type = "binary"
+        pos_label = np.random.randint(0, self.cls_num)
+
+        np_label = label.reshape(-1)
+        selection = pos_label == np_label
+        np_label = np.zeros_like(np_label)
+        np_label[selection] = 1
+
+        np_pred = pred.reshape(-1, self.cls_num).argmax(axis=1)
+        selection = pos_label == np_pred
+        np_pred = np.zeros_like(np_pred)
+        np_pred[selection] = 1
+        return label, pred, np_label, np_pred, average_type, pos_label
+
+    def test_binary_compute(self):
+        for i in range(29):
+            numpy.random.seed(i)
+            self.metrics.reset()
+            label, pred, np_label, np_pred, average_type, pos_label = self.get_binary_labels_random_case()
+            precision, recall, f, _ = precision_recall_fscore_support(np_label, np_pred, average=average_type)
+            args = self.metrics.compute(paddle.to_tensor(pred), paddle.to_tensor(label))
+            self.metrics.update(args)
+            result = self.metrics.accumulate(average=average_type, pos_label=pos_label)
+            self.assertEqual(precision, result[0])
+            self.assertEqual(recall, result[1])
+            self.assertEqual(f, result[2])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/metrics/test_mrr.py b/tests/metrics/test_mrr.py
new file mode 100644
index 0000000000000000000000000000000000000000..22f8ff659692e4611be4f0208b49eedfdcc44b8f
--- /dev/null
+++ b/tests/metrics/test_mrr.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+
+from paddlenlp.metrics import MRR
+from tests.common_test import CommonTest
+
+
+class TestMRR(CommonTest):
+    def setUp(self):
+        self.distance = "cosine"
+        self.mrr = MRR(distance=self.distance)
+        self.label_num = 10
+        self.label_shape = (20,)
+        self.embedding_shape = (20, 128)
+
+    def get_random_case(self):
+        labels = np.random.randint(0, self.label_num, size=self.label_shape).astype("int64")
+        embeddings = np.random.uniform(0.1, 1.0, self.embedding_shape).astype("float64")
+        all_distance = ["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"]
+        distance = random.choice(all_distance)
+        return labels, embeddings, distance, all_distance
+
+    def get_true_mrr_case(self):
+        labels = np.array([1, 2, 1]).astype("int64")
+        embeddings = np.array(
+            [
+                # cosine similarity: 1,2 => 0.991; 1,3=>0.851; 2,3=>0.912
+                [1.0, 2.0, 3.0],
+                [1.0, 2.0, 4.0],
+                [1.0, 100.0, 1000.0],
+            ]
+        )
+        distance = "cosine"
+        true_mrr = (1.0 / 2 + 0 + 1.0 / 2) / 3
+        return labels, embeddings, distance, true_mrr
+
+    def test_reset_distance(self):
+        _, _, distance, _ = self.get_random_case()
+        self.mrr.reset_distance(distance)
+        self.check_output_equal(self.mrr.distance, distance)
+
+    def test_compute_matrix_mrr(self):
+        step = 100
+        for i in range(step):
+            labels, embeddings, distance, _ = self.get_random_case()
+            self.mrr.reset_distance(distance)
+            self.mrr.compute_matrix_mrr(labels, embeddings)
+
+    def test_compute_true_mrr(self):
+        labels, embeddings, distance, true_mrr = self.get_true_mrr_case()
+        self.mrr.reset_distance(distance)
+        mrr = self.mrr.compute_matrix_mrr(labels, embeddings)
+        self.check_output_equal(mrr, true_mrr)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/metrics/test_perplexity.py b/tests/metrics/test_perplexity.py
new file mode 100644
index 0000000000000000000000000000000000000000..5808967f5dfee61306b47101f537ffefa83df3d2
--- /dev/null
+++ b/tests/metrics/test_perplexity.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.metrics import Perplexity
+from tests.common_test import CommonTest
+from tests.testing_utils import cross_entropy, stable_softmax
+
+
+class NpPerplexity(object):
+    def __init__(self):
+        self.total_ce = 0
+        self.total_word_num = 0
+
+    def compute(self, pred, label, seq_mask=None):
+        label = np.expand_dims(label, axis=2)
+        ce = cross_entropy(softmax=pred, label=label, soft_label=False, axis=-1, ignore_index=-100)
+        ce = np.squeeze(ce, axis=2)
+        if seq_mask is not None:
+            ce = ce * seq_mask
+            word_num = np.sum(seq_mask)
+            return ce, word_num
+        return ce
+
+    def update(self, ce):
+        self.total_ce += np.sum(ce)
+        self.total_word_num += ce.size
+
+    def accumulate(self):
+        return np.exp(self.total_ce / self.total_word_num)
+
+
+class TestPerplexity(CommonTest):
+    def setUp(self):
+        self.config["name"] = "test_perplexity"
+        self.cls_num = 10
+        self.shape = (5, 20, self.cls_num)
+        self.label_shape = (5, 20)
+        self.metrics = Perplexity(**self.config)
+        self.np_metrics = NpPerplexity()
+
+    def get_random_case(self):
+        label = np.random.randint(self.cls_num, size=self.label_shape).astype("int64")
+        logits = np.random.uniform(0.1, 1.0, self.shape).astype(paddle.get_default_dtype())
+        pred = np.apply_along_axis(stable_softmax, -1, logits)
+        seq_mask = np.random.randint(2, size=self.label_shape).astype("int64")
+        return label, logits, pred, seq_mask
+
+    def test_name(self):
+        self.check_output_equal(self.metrics.name(), self.config["name"])
+
+    def test_compute(self):
+        label, logits, pred, _ = self.get_random_case()
+        expected_result = self.np_metrics.compute(pred, label)
+        result = self.metrics.compute(paddle.to_tensor(logits), paddle.to_tensor(label))
+        self.check_output_equal(expected_result, result.numpy())
+
+    def test_compute_with_mask(self):
+        label, logits, pred, seq_mask = self.get_random_case()
+        expected_result = self.np_metrics.compute(pred, label, seq_mask)
+        result = self.metrics.compute(paddle.to_tensor(logits), paddle.to_tensor(label), paddle.to_tensor(seq_mask))
+        self.check_output_equal(expected_result[0], result[0].numpy())
+        self.check_output_equal(expected_result[1], result[1])
+
+    def test_reset(self):
+        label, logits, pred, _ = self.get_random_case()
+        result = self.metrics.compute(paddle.to_tensor(logits), paddle.to_tensor(label))
+        self.metrics.update(result.numpy())
+        self.check_output_not_equal(self.metrics.total_ce, 0)
+        self.check_output_not_equal(self.metrics.total_word_num, 0)
+
+        self.metrics.reset()
+        self.check_output_equal(self.metrics.total_ce, 0)
+        self.check_output_equal(self.metrics.total_word_num, 0)
+
+    def test_update_accumulate(self):
+        steps = 10
+        for i in range(steps):
+            label, logits, pred, _ = self.get_random_case()
+            expected_result = self.np_metrics.compute(pred, label)
+            result = self.metrics.compute(paddle.to_tensor(logits), paddle.to_tensor(label))
+            self.metrics.update(result.numpy())
+            self.np_metrics.update(expected_result)
+        self.check_output_equal(self.metrics.accumulate(), self.np_metrics.accumulate())
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/metrics/test_rouge.py b/tests/metrics/test_rouge.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6ad48cbbdcc25e31f55acfbc708118d47002508
--- /dev/null
+++ b/tests/metrics/test_rouge.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.metrics import Rouge1, RougeL
+
+
+class TestRouge(unittest.TestCase):
+    def test_rogue1(self):
+        rouge1 = Rouge1()
+        rouge1.reset()
+        cand = ["The", "cat", "The", "cat", "on", "the", "mat"]
+        ref_list = [["The", "cat", "is", "on", "the", "mat"], ["There", "is", "a", "cat", "on", "the", "mat"]]
+        self.assertEqual(rouge1.score(cand, ref_list), 0.07692307692307693)
+
+    def test_roguel(self):
+        rougel = RougeL()
+        rougel.reset()
+        cand = ["The", "cat", "The", "cat", "on", "the", "mat"]
+        ref_list = [["The", "cat", "is", "on", "the", "mat"], ["There", "is", "a", "cat", "on", "the", "mat"]]
+        rougel.add_inst(cand, ref_list)
+        self.assertEqual(rougel.score(), 0.7800511508951408)
diff --git a/tests/metrics/test_span.py b/tests/metrics/test_span.py
new file mode 100644
index 0000000000000000000000000000000000000000..aaf034103842f8d6009784507c5a076782a8d35a
--- /dev/null
+++ b/tests/metrics/test_span.py
@@ -0,0 +1,35 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.metrics import SpanEvaluator
+
+
+class TestSpanEvaluator(unittest.TestCase):
+    def test_metrics(self):
+        metric = SpanEvaluator()
+        metric.reset()
+        start_prob = paddle.to_tensor([[0.1, 0.1, 0.6, 0.2], [0.0, 0.9, 0.1, 0.0]])
+        end_prob = paddle.to_tensor([[0.1, 0.1, 0.2, 0.6], [0.0, 0.9, 0.1, 0.0]])
+        start_ids = paddle.to_tensor([[0, 0, 1, 0], [0, 0, 1, 0]])
+        end_ids = paddle.to_tensor([[0, 0, 0, 1], [0, 0, 1, 0]])
+        num_correct, num_infer, num_label = metric.compute(start_prob, end_prob, start_ids, end_ids)
+        metric.update(num_correct, num_infer, num_label)
+        precision, recall, f1 = metric.accumulate()
+        self.assertEqual(precision, 0.5)
+        self.assertEqual(recall, 0.5)
+        self.assertEqual(f1, 0.5)
diff --git a/tests/model_zoo/__init__.py b/tests/model_zoo/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/model_zoo/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/model_zoo/test_bert.py b/tests/model_zoo/test_bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..389aafd8e6c9968194c90df8553ca095071dbb39
--- /dev/null
+++ b/tests/model_zoo/test_bert.py
@@ -0,0 +1,97 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+import sys
+from unittest import TestCase
+
+from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+from paddlenlp.utils.log import logger
+from tests.testing_utils import argv_context_guard, load_test_config
+
+
+class BERT_Test(TestCase):
+    def download_corpus(self, input_dir):
+        os.makedirs(input_dir, exist_ok=True)
+        files = [
+            "https://bj.bcebos.com/paddlenlp/models/transformers/bert/data/training_data.hdf5",
+        ]
+
+        for file in files:
+            file_name = file.split("/")[-1]
+            file_path = os.path.join(input_dir, file_name)
+            if not os.path.exists(file_path):
+                logger.info(f"start to download corpus: <{file_name}> into <{input_dir}>")
+                get_path_from_url_with_filelock(file, root_dir=input_dir)
+
+    def setUp(self) -> None:
+        self.path = "./model_zoo/bert"
+        self.config_path = "./tests/fixtures/model_zoo/bert.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_pretrain(self):
+
+        # 1. run pretrain
+        pretrain_config = load_test_config(self.config_path, "pretrain")
+        self.download_corpus(pretrain_config["input_dir"])
+        with argv_context_guard(pretrain_config):
+            from run_pretrain_trainer import do_train
+
+            do_train()
+
+        # 2. export model
+        export_config = {
+            "model_type": pretrain_config["model_type"],
+            "model_path": pretrain_config["output_dir"],
+            "output_path": "infer_model/model",
+        }
+        with argv_context_guard(export_config):
+            from export_model import main
+
+            main()
+
+        # 3. infer model of glue
+        glue_config = load_test_config(self.config_path, "glue")
+        infer_config = {
+            "model_type": export_config["model_type"],
+            "model_path": export_config["output_path"],
+            "task_name": glue_config["task_name"],
+        }
+        with argv_context_guard(infer_config):
+            from predict_glue import main
+
+            main()
+
+        # infer model of samples
+        infer_config = {
+            "model_path": export_config["output_path"],
+            "device": pretrain_config["device"],
+        }
+        with argv_context_guard(infer_config):
+            from predict import main
+
+            main()
+
+    def test_glue(self):
+
+        glue_config = load_test_config(self.config_path, "glue")
+        with argv_context_guard(glue_config):
+            from run_glue_trainer import do_train
+
+            do_train()
diff --git a/tests/model_zoo/test_ernie-health.py b/tests/model_zoo/test_ernie-health.py
new file mode 100644
index 0000000000000000000000000000000000000000..f83a3a47233d076ea1d51d44458111aaca0adcbe
--- /dev/null
+++ b/tests/model_zoo/test_ernie-health.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+import sys
+from unittest import TestCase
+
+from paddlenlp.utils.downloader import get_path_from_url
+from tests.testing_utils import argv_context_guard, load_test_config
+
+
+class ERNIEHEALTH_Test(TestCase):
+    def setUp(self) -> None:
+        self.path = "./model_zoo/ernie-health"
+        self.config_path = "./tests/fixtures/model_zoo/ernie-health.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_pretrain(self):
+        if not os.path.exists("./model_zoo/ernie-health/data"):
+            URL = "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-health/data/samples_ids.npy"
+            URL2 = "https://bj.bcebos.com/paddlenlp/models/transformers/ernie-health/data/samples_idx.npz"
+            get_path_from_url(URL, root_dir="./model_zoo/ernie-health/data")
+            get_path_from_url(URL2, root_dir="./model_zoo/ernie-health/data")
+
+        pretrain_config = load_test_config(self.config_path, "pretrain")
+        with argv_context_guard(pretrain_config):
+            from run_pretrain_trainer import main
+
+            main()
diff --git a/tests/model_zoo/test_ernie_m.py b/tests/model_zoo/test_ernie_m.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e5e4a1a6b99b825b6fc8c6ac14fd9f7ae535dab
--- /dev/null
+++ b/tests/model_zoo/test_ernie_m.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+import sys
+from unittest import TestCase
+
+from parameterized import parameterized_class
+
+from tests.testing_utils import argv_context_guard, is_slow_test, load_test_config
+
+
+@parameterized_class(
+    ["task_type"],
+    [["cross-lingual-transfer"], ["translate-train-all"]],
+)
+class ErnieMTest(TestCase):
+    task_type: str = "cross-lingual-transfer"
+
+    def setUp(self) -> None:
+        self.path = "./model_zoo/ernie-m"
+        self.config_path = "./tests/fixtures/model_zoo/ernie-m.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_classifier(self):
+        finetune_config = load_test_config(self.config_path, "classifier")
+
+        finetune_config["task_type"] = self.task_type
+
+        # 1. finetune and export model
+        with argv_context_guard(finetune_config):
+            from run_classifier import do_train
+
+            do_train()
+
+        # 2. infer model
+        infer_config = {
+            "model_name_or_path": finetune_config["model_name_or_path"],
+            "model_path": os.path.join(finetune_config["export_model_dir"], "export", "model"),
+            "device": finetune_config["device"],
+        }
+        with argv_context_guard(infer_config):
+            from deploy.predictor.inference import main
+
+            main()
+
+        # if using gpu, test infering with precision_mode 'fp16'
+        if is_slow_test():
+            infer_config.update({"infer_config": "fp16"})
+            with argv_context_guard(infer_config):
+                from deploy.predictor.inference import main
+
+                main()
diff --git a/tests/model_zoo/test_ernie_vil.py b/tests/model_zoo/test_ernie_vil.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff46c65f47bfea56ba0d34073e6820e60ac9e008
--- /dev/null
+++ b/tests/model_zoo/test_ernie_vil.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import os
+import sys
+from unittest import TestCase
+
+from paddlenlp.utils import install_package
+from paddlenlp.utils.downloader import get_path_from_url
+from tests.testing_utils import argv_context_guard, load_test_config
+
+
+class ErnieViLTest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./model_zoo/ernie-vil2.0"
+        self.config_path = "./tests/fixtures/model_zoo/ernie_vil.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    def test_finetune(self):
+        install_package("lmdb", "1.3.0")
+        if not os.path.exists("./tests/fixtures/Flickr30k-CN"):
+            URL = "https://paddlenlp.bj.bcebos.com/tests/Flickr30k-CN-small.zip"
+            get_path_from_url(URL, root_dir="./tests/fixtures")
+
+        # 1. run finetune
+        finetune_config = load_test_config(self.config_path, "finetune")
+        with argv_context_guard(finetune_config):
+            from run_finetune import do_train
+
+            do_train()
+
+        # 2. export model
+        export_config = {
+            "model_path": finetune_config["output_dir"],
+            "output_path": finetune_config["output_dir"],
+        }
+        with argv_context_guard(export_config):
+            from export_model import main
+
+            main()
+
+        # 3. infer model
+        infer_config = {
+            "image_path": "./tests/fixtures/tests_samples/COCO/000000039769.png",
+            "model_dir": export_config["output_path"],
+            "device": finetune_config["device"],
+        }
+        with argv_context_guard(infer_config):
+            from deploy.python.infer import main
+
+            main()
diff --git a/tests/model_zoo/test_gpt.py b/tests/model_zoo/test_gpt.py
new file mode 100644
index 0000000000000000000000000000000000000000..56f150fcf9ef2ba92625cd01fdf97420e49a328a
--- /dev/null
+++ b/tests/model_zoo/test_gpt.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+import sys
+import unittest
+from unittest import TestCase
+
+from tests.testing_utils import argv_context_guard, load_test_config
+
+
+class GPTTest(TestCase):
+    def setUp(self) -> None:
+        self.path = "./model_zoo/gpt"
+        self.config_path = "./tests/fixtures/model_zoo/gpt.yaml"
+        sys.path.insert(0, self.path)
+
+    def tearDown(self) -> None:
+        sys.path.remove(self.path)
+
+    # TODO(wj-Mcat): disable old gpt `run_pretrain.py` and will be uncomment in: https://github.com/PaddlePaddle/PaddleNLP/pull/4614
+    @unittest.skip("disable for old gpt")
+    def test_pretrain(self):
+
+        # 1. run pretrain
+        pretrain_config = load_test_config(self.config_path, "pretrain")
+        with argv_context_guard(pretrain_config):
+            from run_pretrain import do_train
+
+            do_train()
+
+        # 2. export model
+        export_config = {
+            "model_type": pretrain_config["model_type"],
+            "model_path": pretrain_config["output_dir"],
+            "output_path": os.path.join(pretrain_config["output_dir"], "export_model"),
+        }
+        with argv_context_guard(export_config):
+            from export_model import main
+
+            main()
+
+        # 3. infer model
+        infer_config = {
+            "model_type": export_config["model_type"],
+            "model_path": export_config["output_path"],
+            "select_device": pretrain_config["device"],
+        }
+        with argv_context_guard(infer_config):
+            from deploy.python.inference import main
+
+            main()
+
+    def test_msra_ner(self):
+        config = load_test_config(self.config_path, "msra_ner")
+        with argv_context_guard(config):
+            from run_msra_ner import do_train
+
+            do_train()
+
+    def test_run_glue(self):
+        config = load_test_config(self.config_path, "glue")
+        with argv_context_guard(config):
+            from run_glue import do_train
+
+            do_train()
+
+    def test_generation(self):
+        config = load_test_config(self.config_path, "generation")
+        with argv_context_guard(config):
+            from run_generation import run
+
+            run()
diff --git a/tests/ops/__init__.py b/tests/ops/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/ops/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/ops/test_einsum.py b/tests/ops/test_einsum.py
new file mode 100644
index 0000000000000000000000000000000000000000..e92810822dc3c323aec84ea5aa7537353e0ad243
--- /dev/null
+++ b/tests/ops/test_einsum.py
@@ -0,0 +1,175 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+import numpy as np
+import paddle
+
+import paddlenlp.ops as ops
+from tests.common_test import CommonTest
+
+EINSUM_TEST_SAMPLE = {
+    "x": np.random.rand(5),
+    "y": np.random.rand(7),
+    "A": np.random.rand(4, 5),
+    "B": np.random.rand(2, 5),
+    "C": np.random.rand(3, 7),
+    "D": np.random.rand(3, 4, 5),
+    "E": np.random.rand(3, 5, 2),
+    "F": np.random.rand(2, 4, 5, 3),
+    "G": np.random.rand(4, 2, 5),
+    "H": np.random.rand(3, 2, 4),
+    "I": np.random.rand(2, 2),
+    "J": np.random.rand(1, 3, 5),
+    "K": np.random.rand(1, 2, 3, 4),
+}
+
+
+class TestEinsum(CommonTest):
+    def setUp(self):
+        self.sample = {"paradigm": "i->", "data": ["x"]}
+
+    def test_forward(self):
+        operands = [EINSUM_TEST_SAMPLE[operand] for operand in self.sample["data"]]
+        expected_result = np.einsum(self.sample["paradigm"], *operands)
+
+        pd_operands = [paddle.to_tensor(operand) for operand in operands]
+        result = ops.einsum(self.sample["paradigm"], *pd_operands).numpy()
+        if len(result.shape) == 1:
+            result = list(result)
+        self.check_output_equal(result, expected_result)
+
+
+class TestEinsumVectorDot(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "i,i->", "data": ["x", "x"]}
+
+
+class TestEinsumVectorMul(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "i,i->i", "data": ["x", "x"]}
+
+
+class TestEinsumVectorOuter(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "i,j->ij", "data": ["x", "y"]}
+
+
+class TestEinsumMatrixTranspose(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij->ji", "data": ["A"]}
+
+
+class TestEinsumMatrixRowSum(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij->j", "data": ["A"]}
+
+
+class TestEinsumMatrixColSum(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij->i", "data": ["A"]}
+
+
+class TestEinsumMatrixEleMul(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij,ij->ij", "data": ["A", "A"]}
+
+
+class TestEinsumMatrixVecMul(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij,j->i", "data": ["A", "x"]}
+
+
+class TestEinsumMatrixMul(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij,kj->ik", "data": ["A", "B"]}
+
+
+class TestEinsumMatrixOuter(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij,kl->ijkl", "data": ["A", "C"]}
+
+
+class TestEinsumTensorBMM(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "bij,bjk->bik", "data": ["D", "E"]}
+
+
+class TestEinsumTensorContract1(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijk,jk->i", "data": ["D", "A"]}
+
+
+class TestEinsumTensorContract2(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijk,lk->ijl", "data": ["D", "B"]}
+
+
+class TestEinsumTensorContract3(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "abcd,dfg->abcfg", "data": ["F", "D"]}
+
+
+class TestEinsumTensorContract4(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijk,jk->ik", "data": ["D", "A"]}
+
+
+class TestEinsumTensorContract5(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijk,jk->ij", "data": ["D", "A"]}
+
+
+class TestEinsumTensorContract6(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ik, ijk->j", "data": ["A", "G"]}
+
+
+class TestEinsumTensorContract7(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijk, ik->jk", "data": ["G", "A"]}
+
+
+class TestEinsumEllipsis1(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "i...->...", "data": ["G"]}
+
+
+class TestEinsumEllipsis2(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ij,...i->j...", "data": ["A", "H"]}
+
+
+class TestEinsumEllipsis3(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "k...,jk", "data": ["F", "I"]}
+
+
+class TestEinsumTestEinsumBilinear(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "bn,anm,bm->ba", "data": ["B", "E", "I"]}
+
+
+class TestEinsumTestEinsumOthers(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "ijkl, lmn->ijn", "data": ["F", "H"]}
+
+
+class TestEinsumBatch1(TestEinsum):
+    def setUp(self):
+        self.sample = {"paradigm": "blq,bhlk->bhlqk", "data": ["J", "K"]}
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/peft/__init__.py b/tests/peft/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/peft/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/peft/test_lora.py b/tests/peft/test_lora.py
new file mode 100644
index 0000000000000000000000000000000000000000..f982c23defeb854d211f7830c6b04077dca06e79
--- /dev/null
+++ b/tests/peft/test_lora.py
@@ -0,0 +1,247 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import os
+import re
+import unittest
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.peft.lora import LoRAConfig, LoRALinear, LoRAMergedLinear, LoRAModel
+from paddlenlp.transformers import AutoModel, BertModel
+
+
+class TestLoraLayer(unittest.TestCase):
+    def test_r_raise_exception(self):
+        with self.assertRaises(ValueError):
+            LoRALinear(in_features=16, out_features=8, r=0, lora_dropout=0.1, lora_alpha=8)
+
+    def test_forward(self):
+        lora_layer = LoRALinear(in_features=16, out_features=8, r=4, lora_dropout=0.1, lora_alpha=8)
+        x = paddle.randn([2, 4, 16], "float32")
+        output = lora_layer(x)
+        self.assertFalse(lora_layer.lora_A.stop_gradient)
+        self.assertFalse(lora_layer.lora_B.stop_gradient)
+        self.assertTrue(lora_layer.weight.stop_gradient)
+        self.assertFalse(lora_layer.bias.stop_gradient)
+        self.assertEqual(output.shape, [2, 4, 8])
+
+    def test_train_eval(self):
+        x = paddle.randn([2, 4, 16], "float32")
+        lora_layer = LoRALinear(in_features=16, out_features=8, r=4)
+        lora_layer.train()
+        train_result = lora_layer(x)
+        train_weight = copy.deepcopy(lora_layer.weight)  # deep copy since this is a pointer
+        lora_layer.eval()
+        eval_result = lora_layer(x)
+        eval_weight = lora_layer.weight
+        self.assertTrue(paddle.allclose(train_result, eval_result))
+        self.assertTrue(paddle.allclose(train_weight, eval_weight))
+
+    def test_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            lora_layer = LoRALinear(in_features=16, out_features=8, r=4)
+            weights_path = os.path.join(tempdir, "model.pdparams")
+            paddle.save(lora_layer.state_dict(), weights_path)
+            new_lora_layer = LoRALinear(in_features=16, out_features=8, r=4)
+            state_dict = paddle.load(weights_path)
+            new_lora_layer.set_dict(state_dict)
+            x = paddle.randn([2, 4, 16], "float32")
+            self.assertTrue(paddle.allclose(new_lora_layer(x), lora_layer(x)))
+
+    def test_load_regular_linear(self):
+        with TemporaryDirectory() as tempdir:
+            regular_linear = paddle.nn.Linear(in_features=16, out_features=8)
+            weights_path = os.path.join(tempdir, "model.pdparams")
+            paddle.save(regular_linear.state_dict(), weights_path)
+            state_dict = paddle.load(weights_path)
+            # should be identical to regular linear
+            lora_layer_r8 = LoRALinear(in_features=16, out_features=8, r=8)
+            lora_layer_r4 = LoRALinear(in_features=16, out_features=8, r=4)
+            lora_layer_r8.set_dict(state_dict)
+            lora_layer_r4.set_dict(state_dict)
+            x = paddle.randn([2, 4, 16], "float32")
+            self.assertTrue(paddle.allclose(lora_layer_r8(x), regular_linear(x)))
+            self.assertTrue(paddle.allclose(lora_layer_r4(x), regular_linear(x)))
+
+
+class TestLoRAMergedLayer(unittest.TestCase):
+    def test_forward(self):
+        lora_layer = LoRAMergedLinear(
+            in_features=16, out_features=8, r=4, lora_dropout=0.1, lora_alpha=8, enable_lora=[True, False], head_dim=2
+        )
+        x = paddle.randn([2, 4, 16], "float32")
+        output = lora_layer(x)
+        self.assertFalse(lora_layer.lora_A.stop_gradient)
+        self.assertFalse(lora_layer.lora_B.stop_gradient)
+        self.assertTrue(lora_layer.weight.stop_gradient)
+        self.assertFalse(lora_layer.bias.stop_gradient)
+        self.assertEqual(output.shape, [2, 4, 8])
+
+    def test_train_eval(self):
+        x = paddle.randn([2, 4, 16], "float32")
+        lora_layer = LoRAMergedLinear(
+            in_features=16, out_features=8, r=4, lora_alpha=8, enable_lora=[True, False], head_dim=2
+        )
+        lora_layer.train()
+        train_result = lora_layer(x)
+        train_weight = copy.deepcopy(lora_layer.weight)  # deep copy since this is a pointer
+        lora_layer.eval()
+        eval_result = lora_layer(x)
+        eval_weight = lora_layer.weight
+        self.assertTrue(paddle.allclose(train_result, eval_result))
+        self.assertTrue(paddle.allclose(train_weight, eval_weight))
+
+    def test_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            lora_layer = LoRAMergedLinear(
+                in_features=16, out_features=8, r=4, lora_alpha=8, enable_lora=[True, False], head_dim=2
+            )
+            weights_path = os.path.join(tempdir, "model.pdparams")
+            paddle.save(lora_layer.state_dict(), weights_path)
+            new_lora_layer = LoRAMergedLinear(
+                in_features=16, out_features=8, r=4, lora_alpha=8, enable_lora=[True, False], head_dim=2
+            )
+            state_dict = paddle.load(weights_path)
+            new_lora_layer.set_dict(state_dict)
+            x = paddle.randn([2, 4, 16], "float32")
+            self.assertTrue(paddle.allclose(new_lora_layer(x), lora_layer(x)))
+
+    def test_load_regular_linear(self):
+        with TemporaryDirectory() as tempdir:
+            regular_linear = paddle.nn.Linear(in_features=16, out_features=8)
+            weights_path = os.path.join(tempdir, "model.pdparams")
+            paddle.save(regular_linear.state_dict(), weights_path)
+            state_dict = paddle.load(weights_path)
+            # should be identical to regular linear
+            lora_layer_r8 = LoRAMergedLinear(in_features=16, out_features=8, r=8, head_dim=2)
+            lora_layer_r4 = LoRAMergedLinear(in_features=16, out_features=8, r=4, head_dim=2)
+            lora_layer_r8.set_dict(state_dict)
+            lora_layer_r4.set_dict(state_dict)
+            x = paddle.randn([2, 4, 16], "float32")
+            self.assertTrue(paddle.allclose(lora_layer_r8(x), regular_linear(x)))
+            self.assertTrue(paddle.allclose(lora_layer_r4(x), regular_linear(x)))
+
+
+class TestLoraModel(unittest.TestCase):
+    def test_lora_model_restore(self):
+        lora_config = LoRAConfig(
+            target_modules=[".*q_proj.*", ".*v_proj.*"],
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+            enable_lora_list=[None, [True, False]],
+            head_dim=2,
+        )
+        model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert")
+        input_ids = paddle.to_tensor(np.random.randint(100, 200, [1, 20]))
+        model.eval()
+        original_results_1 = model(input_ids)
+        lora_model = LoRAModel(model, lora_config)
+        restored_model = lora_model.restore_original_model()
+        restored_model.eval()
+        original_results_2 = restored_model(input_ids)
+        self.assertIsNotNone(original_results_1)
+        self.assertIsNotNone(original_results_2)
+        self.assertIsInstance(restored_model, BertModel)
+        self.assertTrue(paddle.allclose(original_results_1[0], original_results_2[0]))
+
+    @parameterized.expand([(None,), ("all",), ("lora",)])
+    def test_lora_model_constructor(self, bias):
+        lora_config = LoRAConfig(
+            target_modules=[".*q_proj.*", ".*v_proj.*"],
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+            enable_lora_list=[None, [True, False]],
+            trainable_bias=bias,
+            head_dim=2,
+        )
+        # turn off plm dropout for to test train vs test
+        model = AutoModel.from_pretrained(
+            "__internal_testing__/tiny-random-bert", hidden_dropout_prob=0, attention_probs_dropout_prob=0
+        )
+        lora_model = LoRAModel(model, lora_config)
+        lora_model.mark_only_lora_as_trainable()
+        for name, weight in lora_model.state_dict().items():
+            if any([re.fullmatch(target_module, name) for target_module in lora_config.target_modules]):
+                if "lora" in name:
+                    self.assertFalse(weight.stop_gradient)
+                elif "bias" in name and bias in ["lora", "all"]:
+                    self.assertFalse(weight.stop_gradient)
+                else:
+                    self.assertTrue(weight.stop_gradient)
+            else:
+                if "bias" in name and bias == "all":
+                    self.assertFalse(weight.stop_gradient)
+                else:
+                    self.assertTrue(weight.stop_gradient)
+        input_ids = paddle.to_tensor(np.random.randint(100, 200, [1, 20]))
+        lora_model.train()
+        train_forward_results = lora_model(input_ids)
+        self.assertIsNotNone(train_forward_results)
+        lora_model.eval()
+        eval_forward_results = lora_model(input_ids)
+        self.assertIsNotNone(eval_forward_results)
+        self.assertTrue(paddle.allclose(train_forward_results[0], eval_forward_results[0]))
+
+    def test_lora_model_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            input_ids = paddle.to_tensor(np.random.randint(100, 200, [1, 20]))
+            lora_config = LoRAConfig(
+                target_modules=[".*q_proj.*", ".*v_proj.*"],
+                r=4,
+                lora_alpha=8,
+                merge_weights=True,
+            )
+            model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert")
+            lora_model = LoRAModel(model, lora_config)
+            lora_model.eval()
+            original_results = lora_model(input_ids)
+            lora_model.save_pretrained(tempdir)
+
+            loaded_lora_model = LoRAModel.from_pretrained(model, tempdir)
+            loaded_lora_model.eval()
+            loaded_results = loaded_lora_model(input_ids)
+            self.assertTrue(paddle.allclose(original_results[0], loaded_results[0]))
+
+            config_loaded_lora_model = LoRAModel.from_pretrained(model, tempdir, lora_config=lora_config)
+            config_loaded_lora_model.eval()
+            config_loaded_results = config_loaded_lora_model(input_ids)
+            self.assertTrue(paddle.allclose(original_results[0], config_loaded_results[0]))
+
+    def test_lora_module_raise_exception(self):
+        lora_config = LoRAConfig(
+            target_modules=[".*norm1.*"],
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+            enable_lora_list=None,
+        )
+        model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert")
+        with self.assertRaises(ValueError):
+            LoRAModel(model, lora_config)
+
+
+class TestLoRAConfig(unittest.TestCase):
+    def test_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            lora_config = LoRAConfig()
+            lora_config.save_pretrained(tempdir)
+            loaded_lora_config = LoRAConfig.from_pretrained(tempdir)
+            self.assertEqual(lora_config, loaded_lora_config)
diff --git a/tests/peft/test_prefix.py b/tests/peft/test_prefix.py
new file mode 100644
index 0000000000000000000000000000000000000000..574bebd9093519864d89049ca983eff5fea4814d
--- /dev/null
+++ b/tests/peft/test_prefix.py
@@ -0,0 +1,188 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from tempfile import TemporaryDirectory
+
+import paddle
+
+from paddlenlp.peft.prefix import (
+    PrefixConfig,
+    PrefixModelForCausalLM,
+    chatglm_postprocess_past_key_value,
+    llama_postprocess_past_key_value,
+)
+from paddlenlp.transformers import (
+    ChatGLMv2Config,
+    ChatGLMv2ForCausalLM,
+    LlamaConfig,
+    LlamaForCausalLM,
+)
+
+
+class TestPrefixModel(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.config = LlamaConfig(
+            vocab_size=200,
+            hidden_size=32,
+            intermediate_size=86,
+            num_hidden_layers=1,
+            num_attention_heads=1,
+            dtype="float32",
+        )
+
+        cls.model = LlamaForCausalLM(cls.config)
+        cls.prefix_config = PrefixConfig(
+            num_prefix_tokens=2,
+            num_attention_heads=cls.model.config.num_attention_heads,
+            num_hidden_layers=cls.model.config.num_hidden_layers,
+            hidden_size=cls.model.config.hidden_size,
+            prefix_projection_hidden_size=cls.model.config.hidden_size,
+            dtype="float32",
+        )
+        cls.prefix_model = PrefixModelForCausalLM(
+            model=cls.model,
+            prefix_config=cls.prefix_config,
+            postprocess_past_key_value=llama_postprocess_past_key_value,
+        )
+
+    def test_prefix_config(self):
+        with TemporaryDirectory() as tempdir:
+            self.prefix_config.save_pretrained(tempdir)
+            loaded_prefix_config = PrefixConfig.from_pretrained(tempdir)
+            self.assertEqual(self.prefix_config, loaded_prefix_config)
+
+    def test_prefix_model_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            input_ids = paddle.randint(100, 200, [1, 20])
+            self.prefix_model.eval()
+            self.prefix_model.save_pretrained(tempdir)
+            loaded_prefix_model = PrefixModelForCausalLM.from_pretrained(
+                self.model, tempdir, llama_postprocess_past_key_value
+            )
+            loaded_prefix_model.eval()
+
+            original_results = self.prefix_model(input_ids)
+            loaded_results = loaded_prefix_model(input_ids)
+
+            self.assertIsNotNone(original_results)
+            self.assertEqual(original_results[0].shape, [1, 20, self.config.vocab_size])
+            self.assertIsNotNone(loaded_results)
+            self.assertEqual(loaded_results[0].shape, [1, 20, self.config.vocab_size])
+            self.assertTrue(paddle.allclose(original_results[0], loaded_results[0]))
+
+    def test_prefix_model_attention_mask(self):
+        inputs = {
+            "input_ids": paddle.randint(100, 200, [1, 20]),
+            "attention_mask": paddle.ones([1, 20]),
+            "position_ids": paddle.arange(20).unsqueeze(0),
+        }
+        logits_2d = self.prefix_model(**inputs)[0]
+        inputs["attention_mask"] = paddle.tril(paddle.ones([1, 20, 20]))
+        logits_3d = self.prefix_model(**inputs)[0]
+        inputs["attention_mask"] = paddle.tril(paddle.ones([1, 1, 20, 20]))
+        logits_4d = self.prefix_model(**inputs)[0]
+        self.assertTrue(paddle.allclose(logits_2d, logits_3d))
+        self.assertTrue(paddle.allclose(logits_3d, logits_4d))
+
+    def test_prefix_model_generate(self):
+        inputs = {
+            "input_ids": paddle.randint(100, 200, [1, 20]),
+            "attention_mask": paddle.ones([1, 20]),
+            "position_ids": paddle.arange(20).unsqueeze(0),
+        }
+        result = self.prefix_model.generate(
+            **inputs,
+            max_length=5,
+            decode_strategy="sampling",
+            temperature=1.0,
+            top_k=1,
+            top_p=1.0,
+            repetition_penalty=1.0,
+        )
+        self.assertEqual(result[0].shape, [1, 5])
+
+
+class TestPrefixModelMultiQuery(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.config = ChatGLMv2Config(
+            vocab_size=200,
+            hidden_size=32,
+            intermediate_size=86,
+            num_hidden_layers=1,
+            num_attention_heads=4,
+            multi_query_group_num=2,
+            kv_channels=8,
+            dtype="float32",
+        )
+
+        cls.model = ChatGLMv2ForCausalLM(cls.config)
+        cls.prefix_config = PrefixConfig(
+            num_prefix_tokens=2,
+            num_attention_heads=cls.model.config.num_attention_heads,
+            multi_query_group_num=cls.model.config.multi_query_group_num,
+            num_hidden_layers=cls.model.config.num_hidden_layers,
+            hidden_size=cls.model.config.hidden_size,
+            prefix_projection_hidden_size=cls.model.config.hidden_size,
+            dtype="float32",
+        )
+        cls.prefix_model = PrefixModelForCausalLM(
+            model=cls.model,
+            prefix_config=cls.prefix_config,
+            postprocess_past_key_value=chatglm_postprocess_past_key_value,
+        )
+
+    def test_prefix_config(self):
+        with TemporaryDirectory() as tempdir:
+            self.prefix_config.save_pretrained(tempdir)
+            loaded_prefix_config = PrefixConfig.from_pretrained(tempdir)
+            self.assertEqual(self.prefix_config, loaded_prefix_config)
+
+    def test_prefix_model_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            input_ids = paddle.randint(100, 200, [1, 20])
+            self.prefix_model.eval()
+            self.prefix_model.save_pretrained(tempdir)
+            loaded_prefix_model = PrefixModelForCausalLM.from_pretrained(
+                self.model, tempdir, chatglm_postprocess_past_key_value
+            )
+            loaded_prefix_model.eval()
+
+            original_results = self.prefix_model(input_ids)
+            loaded_results = loaded_prefix_model(input_ids)
+
+            self.assertIsNotNone(original_results)
+            self.assertEqual(original_results[0].shape, [1, 20, self.config.vocab_size])
+            self.assertIsNotNone(loaded_results)
+            self.assertEqual(loaded_results[0].shape, [1, 20, self.config.vocab_size])
+            self.assertTrue(paddle.allclose(original_results[0], loaded_results[0]))
+
+    def test_prefix_model_generate(self):
+        inputs = {
+            "input_ids": paddle.randint(100, 200, [1, 20]),
+            "attention_mask": paddle.ones([1, 20]),
+            "position_ids": paddle.arange(20).unsqueeze(0),
+        }
+        result = self.prefix_model.generate(
+            **inputs,
+            max_length=5,
+            decode_strategy="sampling",
+            temperature=1.0,
+            top_k=1,
+            top_p=1.0,
+            repetition_penalty=1.0,
+        )
+        self.assertEqual(result[0].shape, [1, 5])
diff --git a/tests/peft/test_quant_lora.py b/tests/peft/test_quant_lora.py
new file mode 100644
index 0000000000000000000000000000000000000000..902ce14a6192d8b70c7f8f711d2a517665cca3b0
--- /dev/null
+++ b/tests/peft/test_quant_lora.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import paddle
+from paddle.quantization import QAT, QuantConfig
+from paddle.quantization.config import SingleLayerConfig
+from paddle.quantization.quanters import FakeQuanterWithAbsMaxObserver
+from paddle.quantization.quanters.abs_max import FakeQuanterWithAbsMaxObserverLayer
+
+from paddlenlp.peft.lora import LoRAConfig, LoRALinear, LoRAModel
+from paddlenlp.peft.lora.lora_quant_layers import QuantedLoRALinear
+from paddlenlp.transformers import AutoModel
+
+
+class TestQuantedLoraLayer(unittest.TestCase):
+    def test_forward(self):
+        quant_lora_layer = QuantedLoRALinear(
+            layer=LoRALinear(in_features=16, out_features=8, r=4, lora_alpha=8),
+            q_config=SingleLayerConfig(weight=FakeQuanterWithAbsMaxObserver(moving_rate=0.9), activation=None),
+        )
+        x = paddle.randn([2, 4, 16], "float32")
+        quant_output = quant_lora_layer(x)
+        self.assertFalse(quant_lora_layer.lora_A.stop_gradient)
+        self.assertFalse(quant_lora_layer.lora_B.stop_gradient)
+        self.assertTrue(quant_lora_layer.weight.stop_gradient)
+        self.assertFalse(quant_lora_layer.bias.stop_gradient)
+        self.assertEqual(quant_output.shape, [2, 4, 8])
+
+    def test_forward_no_quant(self):
+        lora_layer = LoRALinear(
+            in_features=16,
+            out_features=8,
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+        )
+        quant_lora_layer = QuantedLoRALinear(
+            layer=lora_layer, q_config=SingleLayerConfig(weight=None, activation=None)
+        )
+        x = paddle.randn([2, 4, 16], "float32")
+        output = lora_layer(x)
+        quant_output = quant_lora_layer(x)
+        self.assertTrue(paddle.allclose(output, quant_output))
+
+    def test_dropout_raise_exception(self):
+        with self.assertRaises(ValueError):
+            QuantedLoRALinear(
+                layer=LoRALinear(in_features=16, out_features=8, r=4, lora_alpha=8, lora_dropout=0.1),
+                q_config=SingleLayerConfig(weight=None, activation=None),
+            )
+
+    def test_save_load(self):
+        with TemporaryDirectory() as tempdir:
+            q_config = SingleLayerConfig(weight=FakeQuanterWithAbsMaxObserver(moving_rate=0.9), activation=None)
+            quant_lora_layer = QuantedLoRALinear(
+                layer=LoRALinear(in_features=16, out_features=8, r=4, lora_alpha=8), q_config=q_config
+            )
+            weights_path = os.path.join(tempdir, "model.pdparams")
+            paddle.save(quant_lora_layer.state_dict(), weights_path)
+            new_quant_lora_layer = QuantedLoRALinear(
+                layer=LoRALinear(in_features=16, out_features=8, r=4, lora_alpha=8), q_config=q_config
+            )
+            state_dict = paddle.load(weights_path)
+            new_quant_lora_layer.set_dict(state_dict)
+            x = paddle.randn([2, 4, 16], "float32")
+            self.assertTrue(paddle.allclose(new_quant_lora_layer(x), quant_lora_layer(x)))
+
+    def test_merge_weights(self):
+        lora_layer = LoRALinear(in_features=16, out_features=8, r=4, lora_alpha=8, merge_weights=True)
+        quant_lora_layer = QuantedLoRALinear(
+            layer=lora_layer, q_config=SingleLayerConfig(weight=None, activation=None)
+        )
+        x = paddle.randn([2, 4, 16], "float32")
+
+        quant_lora_layer.train()
+        train_output = lora_layer(x)
+        quant_lora_layer.eval()
+        eval_output = lora_layer(x)
+        self.assertTrue(paddle.allclose(train_output, eval_output))
+
+
+class TestQuantedLoRAModel(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        lora_config = LoRAConfig(
+            target_modules=[".*q_proj.*", ".*v_proj.*"],
+            r=4,
+            lora_alpha=8,
+            merge_weights=True,
+        )
+        cls.model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert")
+        cls.lora_model = LoRAModel(cls.model, lora_config)
+        cls.lora_model.mark_only_lora_as_trainable()
+        # lora_B parameter is initalized to 0, therefore AB = 0 and W + AB = W
+        # Since we want to test W + AB logic, we set lora_B to random values.
+        lora_b_state_dict = {}
+        for name, state in cls.lora_model.state_dict().items():
+            if "lora_B" in name:
+                lora_b_state_dict[name] = paddle.randn(state.shape)
+        cls.lora_model.set_dict(lora_b_state_dict)
+
+    def _count_layers(self, model, layer_type):
+        count = 0
+        for _layer in model.sublayers(True):
+            if isinstance(_layer, layer_type):
+                count += 1
+        return count
+
+    def test_count_model_layers(self):
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+        q_config.add_type_config(LoRALinear, weight=FakeQuanterWithAbsMaxObserver(moving_rate=0.9))
+        qat = QAT(q_config)
+        self.lora_model.train()
+        quant_lora_model = qat.quantize(self.lora_model, inplace=False)
+        quantizer_cnt = self._count_layers(quant_lora_model, FakeQuanterWithAbsMaxObserverLayer)
+        # 2 LoRA layers (q_proj, v_proj) per transformer layer
+        self.assertEqual(quantizer_cnt, 2 * self.model.config.num_hidden_layers)
+
+    def test_forward_no_quant(self):
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+        q_config.add_type_config(LoRALinear, weight=None, activation=None)
+        qat = QAT(q_config)
+        self.lora_model.train()
+        quant_lora_model = qat.quantize(self.lora_model, inplace=False)
+        quant_lora_model.eval()
+        self.lora_model.eval()
+        input_ids = paddle.to_tensor(np.random.randint(100, 200, [1, 5]))
+        original_model_outputs = self.lora_model(input_ids)[0]
+        quant_model_outputs = quant_lora_model(input_ids)[0]
+        self.assertTrue(paddle.allclose(original_model_outputs, quant_model_outputs, atol=1e-5))
+
+    def test_forward_weight_quant(self):
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+        q_config.add_type_config(LoRALinear, weight=FakeQuanterWithAbsMaxObserver(moving_rate=0.9))
+        qat = QAT(q_config)
+        self.lora_model.train()
+        quant_lora_model = qat.quantize(self.lora_model, inplace=False)
+        quant_lora_model.eval()
+        input_ids = paddle.to_tensor(np.random.randint(100, 200, [1, 5]))
+        original_model_outputs = self.lora_model(input_ids)[0]
+        quant_model_outputs = quant_lora_model(input_ids)[0]
+        self.assertEqual(original_model_outputs.shape, quant_model_outputs.shape)
+
+    def test_quant_lora_model_stop_gradient(self):
+        q_config = QuantConfig(activation=None, weight=None)
+        q_config.add_qat_layer_mapping(LoRALinear, QuantedLoRALinear)
+        q_config.add_type_config(LoRALinear, weight=FakeQuanterWithAbsMaxObserver(moving_rate=0.9))
+        qat = QAT(q_config)
+        self.lora_model.train()
+        quant_lora_model = qat.quantize(self.lora_model, inplace=False)
+        for name, weight in quant_lora_model.state_dict().items():
+            if "lora" in name:
+                self.assertFalse(weight.stop_gradient)
+            else:
+                self.assertTrue(weight.stop_gradient)
diff --git a/tests/prepare.sh b/tests/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ce708e716616418ebec77707d221cc62d9f544a8
--- /dev/null
+++ b/tests/prepare.sh
@@ -0,0 +1,220 @@
+MODE=$1
+
+if [ ${MODE} = "lite_train_infer" ]; then
+    cd ../examples/machine_translation/transformer/
+    # The whole procedure of lite_train_infer should be less than 15min.
+    # Hence, set maximum output length is 16. 
+    sed -i "s/^max_out_len.*/max_out_len: 16/g" configs/transformer.base.yaml
+    sed -i "s/^batch_size.*/batch_size: 3072/g" configs/transformer.base.yaml
+    sed -i "s/^max_out_len.*/max_out_len: 16/g" configs/transformer.big.yaml
+    sed -i "s/^batch_size.*/batch_size: 3072/g" configs/transformer.big.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.base.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.big.yaml
+    # Data set prepared. 
+    if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+        tar -zxf WMT14.en-de.partial.tar.gz
+    fi
+    # Set soft link.
+    if [ -f train.en ]; then
+        rm -f train.en
+    fi
+    if [ -f train.de ]; then
+        rm -f train.de
+    fi
+    if [ -f dev.en ]; then
+        rm -f dev.en
+    fi
+    if [ -f dev.de ]; then
+        rm -f dev.de
+    fi
+    if [ -f test.en ]; then
+        rm -f test.en
+    fi
+    if [ -f test.de ]; then
+        rm -f test.de
+    fi
+    rm -f vocab_all.bpe.33712
+    rm -f vocab_all.bpe.33708
+    # Vocab
+    cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+    cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+    # Train
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+    # Dev
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+    #Test
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.en test.en
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.de test.de
+    cd -
+elif [ ${MODE} = "whole_infer" ]; then
+    cd ../examples/machine_translation/transformer/
+    sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.base.yaml
+    sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.big.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+    # Trained transformer base model checkpoint. 
+    # For infer. 
+    if [ ! -f transformer-base-wmt_ende_bpe.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+        tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+        mv base_trained_models/ trained_models/
+    fi
+    # For train. 
+    if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+        tar -zxf WMT14.en-de.partial.tar.gz
+    fi
+    # Whole data set prepared. 
+    if [ ! -f WMT14.en-de.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+        tar -zxf WMT14.en-de.tar.gz
+    fi
+    # Set soft link.
+    if [ -f train.en ]; then
+        rm -f train.en
+    fi
+    if [ -f train.de ]; then
+        rm -f train.de
+    fi
+    if [ -f dev.en ]; then
+        rm -f dev.en
+    fi
+    if [ -f dev.de ]; then
+        rm -f dev.de
+    fi
+    if [ -f test.en ]; then
+        rm -f test.en
+    fi
+    if [ -f test.de ]; then
+        rm -f test.de
+    fi
+    rm -f vocab_all.bpe.33712
+    rm -f vocab_all.bpe.33708
+    # Vocab
+    cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+    cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+    # Train with partial data. 
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+    # Dev with partial data. 
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+    ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+    # Test with whole data. 
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+    cd -
+elif [ ${MODE} = "whole_train_infer" ]; then
+    cd ../examples/machine_translation/transformer/
+    sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.base.yaml
+    sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.big.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+    # Whole data set prepared. 
+    if [ ! -f WMT14.en-de.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+        tar -zxf WMT14.en-de.tar.gz
+    fi
+    # Set soft link. 
+    if [ -f train.en ]; then
+        rm -f train.en
+    fi
+    if [ -f train.de ]; then
+        rm -f train.de
+    fi
+    if [ -f dev.en ]; then
+        rm -f dev.en
+    fi
+    if [ -f dev.de ]; then
+        rm -f dev.de
+    fi
+    if [ -f test.en ]; then
+        rm -f test.en
+    fi
+    if [ -f test.de ]; then
+        rm -f test.de
+    fi
+    rm -f vocab_all.bpe.33712
+    rm -f vocab_all.bpe.33708
+    # Vocab
+    cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+    cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+    # Train with whole data. 
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.en train.en
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.de train.de
+    # Dev with whole data. 
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.en dev.en
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.de dev.de
+    # Test with whole data. 
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+    cd -
+else # infer
+    cd ../examples/machine_translation/transformer/
+    sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.base.yaml
+    sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+    sed -i "s/^batch_size.*/batch_size: 4096/g" configs/transformer.big.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+    sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+    sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+    # Trained transformer base model checkpoint. 
+    if [ ! -f transformer-base-wmt_ende_bpe.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+        tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+        mv base_trained_models/ trained_models/
+    fi
+    # Whole data set prepared. 
+    if [ ! -f WMT14.en-de.tar.gz ]; then
+        wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+        tar -zxf WMT14.en-de.tar.gz
+    fi
+    # Set soft link.
+    if [ -f test.en ]; then
+        rm -f test.en
+    fi
+    if [ -f test.de ]; then
+        rm -f test.de
+    fi
+    rm -f vocab_all.bpe.33712
+    rm -f vocab_all.bpe.33708
+    # Vocab
+    cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+    cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+    # Test with whole data. 
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+    ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+    cd -
+fi
diff --git a/tests/prompt/__init__.py b/tests/prompt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/prompt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/prompt/test_prompt_model.py b/tests/prompt/test_prompt_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7d853956dda144aac6d84be5dd7f9c5ebb08be9
--- /dev/null
+++ b/tests/prompt/test_prompt_model.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.prompt import (
+    AutoTemplate,
+    PromptDataCollatorWithPadding,
+    PromptModelForSequenceClassification,
+    SoftVerbalizer,
+)
+from paddlenlp.transformers import (
+    AutoModelForMaskedLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+)
+
+
+class PromptModelTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.model = AutoModelForMaskedLM.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.num_labels = 2
+        cls.seq_cls_model = AutoModelForSequenceClassification.from_pretrained(
+            "__internal_testing__/tiny-random-ernie", num_labels=cls.num_labels
+        )
+
+        cls.template = AutoTemplate.create_from(
+            prompt="{'soft'}{'text': 'text'}{'mask'}", tokenizer=cls.tokenizer, max_length=512, model=cls.model
+        )
+        cls.label_words = {0: "0", 1: "1", 2: "2"}
+        cls.verbalizer = SoftVerbalizer(cls.label_words, cls.tokenizer, cls.model)
+        cls.data_collator = PromptDataCollatorWithPadding(cls.tokenizer, padding=True, return_tensors="pd")
+        cls.prompt_model = PromptModelForSequenceClassification(cls.model, cls.template, cls.verbalizer)
+
+    def test_sequence_classification_no_labels(self):
+        examples = [{"text": "百度飞桨深度学习框架"}, {"text": "这是一个测试"}]
+        encoded_examples = [self.template(i) for i in examples]
+        logits, hidden_states = self.prompt_model(**self.data_collator(encoded_examples), return_hidden_states=True)
+        self.assertEqual(logits.shape[0], len(examples))
+        self.assertEqual(logits.shape[1], len(self.label_words))
+        self.assertEqual(hidden_states.shape[0], len(examples))
+
+        model_outputs = self.prompt_model(
+            **self.data_collator(encoded_examples), return_dict=True, return_hidden_states=True
+        )
+        self.assertIsNone(model_outputs.loss)
+        self.assertEqual(model_outputs.logits.shape[0], len(examples))
+        self.assertEqual(model_outputs.logits.shape[1], len(self.label_words))
+        self.assertEqual(model_outputs.hidden_states.shape[0], len(examples))
+
+    def test_sequence_classification_with_labels(self):
+        examples = [{"text": "百度飞桨深度学习框架", "labels": 0}, {"text": "这是一个测试", "labels": 1}]
+        encoded_examples = [self.template(i) for i in examples]
+        loss, logits, hidden_states = self.prompt_model(
+            **self.data_collator(encoded_examples), return_hidden_states=True
+        )
+        self.assertIsNotNone(loss)
+        self.assertEqual(logits.shape[0], len(examples))
+        self.assertEqual(logits.shape[1], len(self.label_words))
+        self.assertEqual(hidden_states.shape[0], len(examples))
+
+        model_outputs = self.prompt_model(
+            **self.data_collator(encoded_examples), return_dict=True, return_hidden_states=True
+        )
+        self.assertIsNotNone(model_outputs.loss)
+        self.assertEqual(model_outputs.logits.shape[0], len(examples))
+        self.assertEqual(model_outputs.logits.shape[1], len(self.label_words))
+        self.assertEqual(model_outputs.hidden_states.shape[0], len(examples))
+
+    def test_efl_no_labels(self):
+        prompt_model = PromptModelForSequenceClassification(self.seq_cls_model, self.template, verbalizer=None)
+        examples = [{"text": "百度飞桨深度学习框架"}, {"text": "这是一个测试"}]
+        encoded_examples = [self.template(i) for i in examples]
+        logits, hidden_states = prompt_model(**self.data_collator(encoded_examples), return_hidden_states=True)
+        self.assertEqual(logits.shape[0], len(examples))
+        self.assertEqual(logits.shape[1], self.num_labels)
+        self.assertEqual(hidden_states.shape[0], len(examples))
+
+        model_outputs = prompt_model(
+            **self.data_collator(encoded_examples), return_dict=True, return_hidden_states=True
+        )
+        self.assertIsNone(model_outputs.loss)
+        self.assertEqual(model_outputs.logits.shape[0], len(examples))
+        self.assertEqual(model_outputs.logits.shape[1], self.num_labels)
+        self.assertEqual(model_outputs.hidden_states.shape[0], len(examples))
+
+    def test_efl_with_labels(self):
+        prompt_model = PromptModelForSequenceClassification(self.seq_cls_model, self.template, verbalizer=None)
+        examples = [{"text": "百度飞桨深度学习框架", "labels": 0}, {"text": "这是一个测试", "labels": 1}]
+        encoded_examples = [self.template(i) for i in examples]
+        loss, logits, hidden_states = prompt_model(**self.data_collator(encoded_examples), return_hidden_states=True)
+        self.assertIsNotNone(loss)
+        self.assertEqual(logits.shape[0], len(examples))
+        self.assertEqual(logits.shape[1], self.num_labels)
+        self.assertEqual(hidden_states.shape[0], len(examples))
+
+        model_outputs = prompt_model(
+            **self.data_collator(encoded_examples), return_dict=True, return_hidden_states=True
+        )
+        self.assertIsNotNone(model_outputs.loss)
+        self.assertEqual(model_outputs.logits.shape[0], len(examples))
+        self.assertEqual(model_outputs.logits.shape[1], self.num_labels)
+        self.assertEqual(model_outputs.hidden_states.shape[0], len(examples))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/prompt/test_template.py b/tests/prompt/test_template.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2f55825e9941b07758d325f8758dbc873b51483
--- /dev/null
+++ b/tests/prompt/test_template.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.prompt import AutoTemplate, ManualTemplate, PrefixTemplate, SoftTemplate
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+
+
+class TemplateTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.model = AutoModelForMaskedLM.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.example = {"text_a": "天气晴朗", "text_b": "下雪了", "choices": ["不", "很"], "labels": 0}
+        cls.max_length = 20
+        cls.tokenizer.add_special_tokens({"additional_special_tokens": ["[O-MASK]"]})
+        cls.model.resize_token_embeddings(len(cls.tokenizer))
+
+    def test_initialization(self):
+        with self.assertRaises(ValueError):
+            prompt = "非AutoTemplate模板中必须定义`text`"
+            ManualTemplate(prompt, self.tokenizer, self.max_length)
+
+        with self.assertRaises(ValueError):
+            prompt = "{'text': 'text_a'} SoftTemplate必须定义`soft`"
+            SoftTemplate(prompt, self.tokenizer, self.max_length, self.model.get_input_embeddings())
+
+        with self.assertRaises(ValueError):
+            prompt = "{'text': 'text_a'} PrefixTemplate只定义`prefix`。{'soft'}"
+            PrefixTemplate(prompt, self.tokenizer, self.max_length, self.model)
+
+        with self.assertRaises(ValueError):
+            prompt = "{'text': 'text_a'}{'prefix': None, 'length':3}PrefixTemplate中`prefix`只能位于句首。"
+            PrefixTemplate(prompt, self.tokenizer, self.max_length, self.model)
+
+    @parameterized.expand(
+        [
+            (
+                "{'text': 'text_a'}说明天气{'mask'}好。",
+                "[CLS]天气晴朗说明天气[MASK]好。[SEP]",
+            ),
+            (
+                "{'text': 'text_a'}{'hard': '说明天气'}{'mask'}{'hard': '好。'}",
+                "[CLS]天气晴朗说明天气[MASK]好。[SEP]",
+            ),
+            (
+                "下边两句话意思一样吗？{'text':'text_a'}{'sep'}{'text':'text_b'}",
+                "[CLS]下边两句话意思一样吗？天气晴[SEP]下雪了[SEP]",
+            ),
+            (
+                "{'options':'choices','add_omask':True,'add_prompt':'[OPT]好。'}天气如何？{'text': 'text_a'}",
+                "[CLS][O-MASK]不好。[O-MASK]很好。天气如何？天气晴朗[SEP]",
+            ),
+            (
+                "{'text': 'text_a'}{'text': 'text_b', 'truncate': False}这是一个很长的提示，要截断",
+                "[CLS]天气下雪了这是一个很长的提示，要截断[SEP]",
+            ),
+            ("{'text':'text_a'}说明天气{'mask': None, 'length': 3}", "[CLS]天气晴朗说明天气[MASK][MASK][MASK][SEP]"),
+        ]
+    )
+    def test_manual_template(self, prompt, expected_ans):
+        template = ManualTemplate(prompt, self.tokenizer, self.max_length)
+        encoded_ids = template(self.example)["input_ids"]
+        encoded_tokens = self.tokenizer.decode(encoded_ids).replace(" ", "")
+        self.assertEqual(encoded_tokens, expected_ans)
+
+    @parameterized.expand(
+        [
+            (
+                "{'text':'text_a', 'position': 4}",
+                [0, 4, 5, 6, 7, 0],
+            ),
+            (
+                "{'options':'choices','add_omask':True,'add_prompt':'[OPT]好。'}天气如何？{'text': 'text_a'}",
+                [0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0],
+            ),
+        ]
+    )
+    def test_position(self, prompt, expected_ans):
+        template = ManualTemplate(prompt, self.tokenizer, self.max_length)
+        encoded_positions = template(self.example)["position_ids"]
+        self.assertEqual(encoded_positions, expected_ans)
+
+    def test_token_type(self):
+        prompt = "{'text': 'text_a'}{'sep': None, 'token_type': 1}{'text': 'text_b'}之间的关系是什么？"
+        template = ManualTemplate(prompt, self.tokenizer, self.max_length)
+        encoded_token_types = template(self.example)["token_type_ids"]
+        self.assertEqual(encoded_token_types, [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])
+
+    def test_attention_mask(self):
+        expected_att = np.zeros([15, 15])
+        expected_att[1:5, 5:9] = -1e4
+        expected_att[5:9, 1:5] = -1e4
+        prompt = "{'options':'choices','add_omask':True,'add_prompt':'[OPT]好。'}{'sep'}{'text': 'text_a'}"
+        template = ManualTemplate(prompt, self.tokenizer, self.max_length)
+        encoded_att = template(self.example)["attention_mask"]
+        self.assertListEqual(encoded_att.tolist(), expected_att.tolist())
+
+    @parameterized.expand(
+        [
+            (
+                "{'soft'}{'text': 'text_a'}{'sep'}{'text': 'text_b'}",
+                "[CLS][CLS]天气晴朗[SEP]下雪了[SEP]",
+                [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            ),
+            (
+                "{'text': 'text_a'}{'soft': '请判断', 'length': 5}{'mask'}",
+                "[CLS]天气晴朗[CLS][SEP][MASK]，的[MASK][SEP]",
+                [0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 0, 0],
+            ),
+        ]
+    )
+    def test_soft_template(self, prompt, expected_token_ids, expected_soft_ids):
+        template = SoftTemplate(prompt, self.tokenizer, self.max_length, self.model.get_input_embeddings())
+        encoded = template(self.example)
+        self.assertEqual(self.tokenizer.decode(encoded["input_ids"]).replace(" ", ""), expected_token_ids)
+        self.assertEqual(encoded["soft_token_ids"], expected_soft_ids)
+
+    def test_soft_initialization(self):
+        prompt = "{'text': 'text_a'}{'soft': '分类'}"
+        template = SoftTemplate(prompt, self.tokenizer, self.max_length, self.model.get_input_embeddings())
+        expected_tokens = self.tokenizer("分类", add_special_tokens=False)["input_ids"]
+        expected_embeds = self.model.get_input_embeddings()(paddle.to_tensor(expected_tokens))
+        init_embeds = template.soft_embeddings.weight[1:]
+        self.assertTrue(paddle.allclose(init_embeds, expected_embeds, atol=1e-6))
+
+    def test_soft_encoder(self):
+        prompt = "{'text': 'text_a'}{'soft': '天气', 'encoder': 'lstm', 'hidden_size': 32}{'soft': None, 'length': 2, 'encoder': 'mlp'}"
+        template = SoftTemplate(prompt, self.tokenizer, self.max_length, self.model.get_input_embeddings())
+        encoded = template(self.example)
+        encoded_tokens = self.tokenizer.decode(encoded["input_ids"]).replace(" ", "")
+        encoded_encoder = encoded["encoder_ids"]
+        expected_tokens = "[CLS]天气晴朗[CLS][SEP][MASK]，[SEP]"
+        expected_encoder = [0, 0, 0, 0, 0, 1, 1, 2, 2, 0]
+        self.assertEqual(encoded_tokens, expected_tokens)
+        self.assertEqual(encoded_encoder, expected_encoder)
+
+    @parameterized.expand(
+        [
+            (
+                "{'prefix': '新闻类别', 'length': 10, 'encoder': 'lstm'}{'text': 'text_a'}",
+                "[CLS][CLS][SEP][MASK]，的、一人有是天气晴朗[SEP]",
+                [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0, 0, 0, 0],
+            )
+        ]
+    )
+    def test_prefix_template(self, prompt, expected_tokens, expected_soft_tokens):
+        template = PrefixTemplate(prompt, self.tokenizer, self.max_length, self.model)
+        encoded = template(self.example)
+        encoded_tokens = self.tokenizer.decode(encoded["input_ids"]).replace(" ", "")
+        self.assertEqual(encoded_tokens, expected_tokens)
+        self.assertEqual(encoded["soft_token_ids"], expected_soft_tokens)
+
+    @parameterized.expand(
+        [
+            (
+                "{'text': 'text_a'}说明天气{'mask'}好。",
+                "[CLS]天气晴朗说明天气[MASK]好。[SEP]",
+                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            ),
+            (
+                "{'text': 'text_a'}{'hard': '说明天气'}{'mask'}{'hard': '好。'}",
+                "[CLS]天气晴朗说明天气[MASK]好。[SEP]",
+                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            ),
+            (
+                "{'text': 'text_a'}{'soft': '请判断', 'length': 5}{'mask'}",
+                "[CLS]天气晴朗[CLS][SEP][MASK]，的[MASK][SEP]",
+                [0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 0, 0],
+            ),
+            (
+                "{'prefix': '新闻类别', 'length': 10, 'encoder': 'lstm'}{'text': 'text_a'}",
+                "[CLS][CLS][SEP][MASK]，的、一人有是天气晴朗[SEP]",
+                [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 0, 0, 0, 0],
+            ),
+        ]
+    )
+    def test_auto_template(self, prompt, expected_tokens, expected_soft_tokens):
+        template = AutoTemplate.create_from(prompt, self.tokenizer, self.max_length, self.model)
+        encoded = template(self.example)
+        encoded_tokens = self.tokenizer.decode(encoded["input_ids"]).replace(" ", "")
+        self.assertEqual(encoded_tokens, expected_tokens)
+        self.assertEqual(encoded["soft_token_ids"], expected_soft_tokens)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/prompt/test_verbalizer.py b/tests/prompt/test_verbalizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2eed9a152ff3ea2079faddbf64469354564afdc9
--- /dev/null
+++ b/tests/prompt/test_verbalizer.py
@@ -0,0 +1,263 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import tempfile
+import unittest
+
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.prompt import ManualVerbalizer, MaskedLMVerbalizer, SoftVerbalizer
+from paddlenlp.prompt.verbalizer import MaskedLMIdentity
+from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer
+from paddlenlp.transformers.albert.modeling import AlbertMLMHead
+from paddlenlp.transformers.ernie.modeling import ErnieLMPredictionHead
+
+
+class VerbalizerTest(unittest.TestCase):
+    """
+    Unittest for Verbalizer
+    """
+
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.model = AutoModelForMaskedLM.from_pretrained("__internal_testing__/tiny-random-ernie")
+        cls.default_label_words = {"正向": "很", "负向": "不"}
+        cls.kwargs = {
+            "token_aggregate_type": "first",
+            "word_aggregate_type": "first",
+            "mask_aggregate_type": "first",
+            "post_log_softmax": False,
+        }
+        cls.default_verb = ManualVerbalizer(label_words=cls.default_label_words, tokenizer=cls.tokenizer)
+
+    @parameterized.expand(
+        [
+            (ManualVerbalizer,),
+            (SoftVerbalizer,),
+            (MaskedLMVerbalizer,),
+        ]
+    )
+    def test_kwargs(self, class_name):
+        model = copy.deepcopy(self.model)
+        verb = class_name(tokenizer=self.tokenizer, label_words=self.default_label_words, model=model, **self.kwargs)
+        self.assertEqual(verb.token_aggregate_type, "first")
+        self.assertEqual(verb.word_aggregate_type, "first")
+        self.assertEqual(verb.mask_aggregate_type, "first")
+        self.assertFalse(verb.post_log_softmax)
+
+    def test_labels_property(self):
+        verb = ManualVerbalizer(label_words=self.default_label_words, tokenizer=self.tokenizer)
+        labels = ["差评", "好评"]
+        label_words = {"差评": "避雷", "好评": "非常推荐"}
+        expected_words = {"好评": ["非常推荐"], "差评": ["避雷"]}
+        expected_token_ids = paddle.to_tensor([[[465, 223, 426, 1645]], [[1166, 1048, 0, 0]]], dtype="int64")
+        expected_word_mask = paddle.to_tensor([[1], [1]], dtype="int64")
+        expected_token_mask = paddle.to_tensor([[[1, 1, 1, 1]], [[1, 1, 0, 0]]], dtype="int64")
+        with self.assertRaises(NotImplementedError):
+            verb.labels = labels
+        verb.label_words = label_words
+        self.assertEqual(verb.labels, sorted(labels))
+        self.assertEqual(verb.label_words, expected_words)
+        self.assertTrue(paddle.equal_all(verb.token_ids, expected_token_ids))
+        self.assertTrue(paddle.equal_all(verb.word_mask, expected_word_mask))
+        self.assertTrue(paddle.equal_all(verb.token_mask, expected_token_mask))
+
+    @parameterized.expand(
+        [
+            ("mean", [[0.5, 2.0], [-3.0, -1.5]]),
+            ("max", [[1.0, 2.0], [-3.0, -1.0]]),
+            ("first", [[0.0, 2.0], [-3.0, -1.0]]),
+        ]
+    )
+    def test_aggregate_token(self, atype, expected_outputs):
+        outputs = paddle.to_tensor([[[0, 1.0], [2, 3.0]], [[-3, -4.0], [-1, -2.0]]])
+        token_mask = paddle.to_tensor([[[1, 1], [1, 0]], [[1, 0], [1, 1]]])
+        outputs = self.default_verb.aggregate(outputs, token_mask, atype)
+        self.assertEqual(outputs.tolist(), expected_outputs)
+
+    @parameterized.expand(
+        [
+            ("mean", [0.5, 2.0]),
+            ("max", [1.0, 2.0]),
+            ("first", [0.0, 2.0]),
+        ]
+    )
+    def test_aggregate_word(self, atype, expected_outputs):
+        outputs = paddle.to_tensor([[0, 1.0], [2, 3.0]])
+        word_mask = paddle.to_tensor([[1, 1], [1, 0]])
+        outputs = self.default_verb.aggregate(outputs, word_mask, atype)
+        self.assertEqual(outputs.tolist(), expected_outputs)
+
+    def test_project(self):
+        outputs = self.default_verb.project(paddle.rand([2, 1, 400]))
+        self.assertEqual(outputs.shape, [2, 1, 2, 1])
+
+    def test_normalize(self):
+        outputs = paddle.rand([2, 1, 2, 3])
+        self.assertAlmostEqual(self.default_verb.normalize(outputs)[0].sum().item(), 1, 6)
+        self.assertAlmostEqual(self.default_verb.normalize(outputs)[1].sum().item(), 1, 6)
+
+    def test_save_and_load(self):
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            self.default_verb.save(save_path=tmpdirname)
+            verb = ManualVerbalizer.load_from(tmpdirname, tokenizer=self.tokenizer)
+            self.assertEqual(verb.label_words, self.default_verb.label_words)
+
+    def test_encode_and_decode(self):
+        label_words = {"负向": "不喜欢", "正向": "非常推荐"}
+        verb = ManualVerbalizer(label_words=label_words, tokenizer=self.tokenizer)
+        self.assertEqual(verb.convert_ids_to_labels(1), "负向")
+        self.assertEqual(verb.convert_labels_to_ids("负向"), 1)
+
+    @parameterized.expand(
+        [
+            (
+                {"0": "负向", "1": "正向"},
+                ["0", "1"],
+                {"0": ["负向"], "1": ["正向"]},
+                [[[383, 253]], [[243, 253]]],
+                [[1], [1]],
+                [[[1, 1]], [[1, 1]]],
+            ),
+            (
+                {0: ["差评", "不喜欢"], 1: ["好评", "不错"]},
+                [0, 1],
+                {0: ["差评", "不喜欢"], 1: ["好评", "不错"]},
+                [[[859, 480, 0], [16, 692, 811]], [[170, 480, 0], [16, 990, 0]]],
+                [[1, 1], [1, 1]],
+                [[[1, 1, 0], [1, 1, 1]], [[1, 1, 0], [1, 1, 0]]],
+            ),
+            (
+                {1: ["很满意", "非常推荐"], 0: "避雷"},
+                [0, 1],
+                {0: ["避雷"], 1: ["很满意", "非常推荐"]},
+                [[[1166, 1048, 0, 0], [0, 0, 0, 0]], [[321, 596, 221, 0], [465, 223, 426, 1645]]],
+                [[1, 0], [1, 1]],
+                [[[1, 1, 0, 0], [0, 0, 0, 0]], [[1, 1, 1, 0], [1, 1, 1, 1]]],
+            ),
+        ]
+    )
+    def test_manual_initialization(
+        self, label_words, labels, expected_words, expected_token_ids, expected_word_mask, expected_token_mask
+    ):
+        verb = ManualVerbalizer(label_words=label_words, tokenizer=self.tokenizer)
+        self.assertEqual(verb.labels, labels)
+        self.assertEqual(verb.label_words, expected_words)
+        self.assertTrue(paddle.equal_all(verb.token_ids, paddle.to_tensor(expected_token_ids, dtype="int64")))
+        self.assertTrue(paddle.equal_all(verb.word_mask, paddle.to_tensor(expected_word_mask, dtype="int64")))
+        self.assertTrue(paddle.equal_all(verb.token_mask, paddle.to_tensor(expected_token_mask, dtype="int64")))
+
+    @parameterized.expand(
+        [
+            ("mean", [[1, 2.0], [-2, -3.0]]),
+            ("max", [[2, 3.0], [-1, -2.0]]),
+            ("first", [[0, 1.0], [-3, -4.0]]),
+            ("product", [[0, 3.0], [3, 8.0]]),
+            ("invalid", None),
+        ]
+    )
+    def test_manual_aggregate_multiple_mask(self, atype, expected_outputs):
+        outputs = paddle.to_tensor([[[0, 1.0], [2, 3.0]], [[-3, -4.0], [-1, -2.0]]])
+        if atype == "invalid":
+            with self.assertRaises(ValueError):
+                self.default_verb.aggregate_multiple_mask(outputs, atype)
+        else:
+            outputs = self.default_verb.aggregate_multiple_mask(outputs, atype)
+            self.assertEqual(outputs.tolist(), expected_outputs)
+
+    def test_manual_process_outputs(self):
+        outputs = paddle.rand([3, 2, 500])
+        masked_positions = paddle.to_tensor([0, 2, 5])
+        outputs = self.default_verb.process_outputs(outputs, masked_positions)
+        self.assertEqual(outputs.shape, [3, 2])
+
+    @parameterized.expand(
+        [
+            (
+                "__internal_testing__/tiny-random-ernie",
+                ["cls", "predictions", "decoder"],
+                ErnieLMPredictionHead,
+                ["decoder_bias", "decoder.weight", "decoder.bias"],
+                ["transform.weight", "transform.bias", "layer_norm.weight", "layer_norm.bias"],
+            ),
+            (
+                "albert-chinese-tiny",
+                ["predictions", "decoder"],
+                AlbertMLMHead,
+                ["bias", "decoder.weight", "decoder.bias"],
+                ["layer_norm.weight", "layer_norm.bias", "dense.weight", "dense.bias"],
+            ),
+        ]
+    )
+    def test_soft_initialization(self, model_name, head_name, head_class, head_params, non_head_params):
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        model = AutoModelForMaskedLM.from_pretrained(model_name, ignore_mismatched_sizes=True)
+        verb = SoftVerbalizer(self.default_label_words, tokenizer, model)
+        self.assertEqual(verb.head_name, head_name)
+        self.assertTrue(isinstance(verb.head, head_class))
+        self.assertTrue(isinstance(getattr(model, head_name[0]), MaskedLMIdentity))
+        module = getattr(verb.head, verb.head_name[-1])
+        # module = module.weight if isinstance(module, paddle.nn.Linear) else module
+        self.assertTrue(len(self.default_label_words) in module.weight.shape)
+        self.assertEqual([x[0] for x in verb.head_parameters()], head_params)
+        self.assertEqual([x[0] for x in verb.non_head_parameters()], non_head_params)
+
+    def test_soft_process_outputs(self):
+        verb = SoftVerbalizer(self.default_label_words, self.tokenizer, copy.deepcopy(self.model))
+        outputs = paddle.rand([3, 2, 8])
+        masked_positions = paddle.to_tensor([0, 2, 5])
+        outputs = verb.process_outputs(outputs, masked_positions)
+
+    @parameterized.expand(
+        [
+            (
+                {"0": "负向", "1": "正向"},
+                {"0": ["负向"], "1": ["正向"]},
+            ),
+            (
+                {1: ["好评", "不错"], 0: ["差评", "不喜欢"]},
+                {0: ["差评"], 1: ["好评"]},
+            ),
+            (
+                {1: "很满意", 0: "避雷"},
+                None,
+            ),
+        ]
+    )
+    def test_maskedlm_initialization(self, label_words, expected_label_words):
+        if expected_label_words is None:
+            with self.assertRaises(ValueError):
+                verb = MaskedLMVerbalizer(label_words, self.tokenizer)
+        else:
+            verb = MaskedLMVerbalizer(label_words, self.tokenizer)
+            self.assertEqual(verb.label_words, expected_label_words)
+
+    @parameterized.expand([("mean",), ("max",), ("first",), ("product",), ("sum",), ("invalid",)])
+    def test_maskedlm_aggregate_multiple_mask(self, atype):
+        label_words = {"0": "负向", "1": "正向"}
+        verb = MaskedLMVerbalizer(label_words, self.tokenizer)
+        outputs = paddle.rand([3, 2, 500])
+        if atype == "invalid":
+            with self.assertRaises(ValueError):
+                verb.aggregate_multiple_mask(outputs, atype)
+        else:
+            outputs = verb.aggregate_multiple_mask(outputs, atype)
+            self.assertEqual(outputs.shape, [3, 2])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/requirements.txt b/tests/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..000a843debf50e28ed510623a5770977e3c5f6f8
--- /dev/null
+++ b/tests/requirements.txt
@@ -0,0 +1,9 @@
+parameterized
+sentencepiece
+regex
+torch>=1.5
+transformers
+tool_helpers
+fast_tokenizer_python
+sacremoses
+pydantic==1.10.9
diff --git a/tests/run_all_test.sh b/tests/run_all_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ce5c8e5793ad7ed3c97935434eeb07f555efba6d
--- /dev/null
+++ b/tests/run_all_test.sh
@@ -0,0 +1 @@
+python -m unittest discover -v
diff --git a/tests/run_coverage.sh b/tests/run_coverage.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8780a74081b09fc9e55b6b9d44dd132754fd2cab
--- /dev/null
+++ b/tests/run_coverage.sh
@@ -0,0 +1,8 @@
+MODULE=paddlenlp
+IFS=','
+if [ "$#" -gt 0 ]; then
+MODULE=$*
+fi
+coverage run --source "$MODULE" -m unittest discover
+coverage report -m
+coverage html
diff --git a/tests/sample_text.txt b/tests/sample_text.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a42812060c576bae870eb29b1ac083fda0d239d3
--- /dev/null
+++ b/tests/sample_text.txt
@@ -0,0 +1,33 @@
+This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত
+Text should be one-sentence-per-line, with empty lines between documents.
+This sample text is public domain and was randomly selected from Project Guttenberg.
+
+The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
+Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.
+Possibly this may have been the reason why early risers in that locality, during the rainy season, adopted a thoughtful habit of body, and seldom lifted their eyes to the rifted or india-ink washed skies above them.
+"Cass" Beard had risen early that morning, but not with a view to discovery.
+A leak in his cabin roof,--quite consistent with his careless, improvident habits,--had roused him at 4 A. M., with a flooded "bunk" and wet blankets.
+The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor's to supply the deficiency.
+This was nearly opposite.
+Mr. Cassius crossed the highway, and stopped suddenly.
+Something glittered in the nearest red pool before him.
+Gold, surely!
+But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a plain gold ring.
+Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
+Like most of his fellow gold-seekers, Cass was superstitious.
+
+The fountain of classic wisdom, Hypatia herself.
+As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred doors of her lecture-room, imbibe celestial knowledge.
+From my youth I felt in me a soul above the matter-entangled herd.
+She revealed to me the glorious fact, that I am a spark of Divinity itself.
+A fallen star, I am, sir!' continued he, pensively, stroking his lean stomach--'a fallen star!--fallen, if the dignity of philosophy will allow of the simile, among the hogs of the lower world--indeed, even into the hog-bucket itself. Well, after all, I will show you the way to the Archbishop's.
+There is a philosophic pleasure in opening one's treasures to the modest young.
+Perhaps you will assist me by carrying this basket of fruit?' And the little man jumped up, put his basket on Philammon's head, and trotted off up a neighbouring street.
+Philammon followed, half contemptuous, half wondering at what this philosophy might be, which could feed the self-conceit of anything so abject as his ragged little apish guide;
+but the novel roar and whirl of the street, the perpetual stream of busy faces, the line of curricles, palanquins, laden asses, camels, elephants, which met and passed him, and squeezed him up steps and into doorways, as they threaded their way through the great Moon-gate into the ample street beyond, drove everything from his mind but wondering curiosity, and a vague, helpless dread of that great living wilderness, more terrible than any dead wilderness of sand which he had left behind.
+Already he longed for the repose, the silence of the Laura--for faces which knew him and smiled upon him; but it was too late to turn back now.
+His guide held on for more than a mile up the great main street, crossed in the centre of the city, at right angles, by one equally magnificent, at each end of which, miles away, appeared, dim and distant over the heads of the living stream of passengers, the yellow sand-hills of the desert;
+while at the end of the vista in front of them gleamed the blue harbour, through a network of countless masts.
+At last they reached the quay at the opposite end of the street;
+and there burst on Philammon's astonished eyes a vast semicircle of blue sea, ringed with palaces and towers.
+He stopped involuntarily; and his little guide stopped also, and looked askance at the young monk, to watch the effect which that grand panorama should produce on him.
diff --git a/tests/taskflow/__init__.py b/tests/taskflow/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/taskflow/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/taskflow/test_dialogue.py b/tests/taskflow/test_dialogue.py
new file mode 100644
index 0000000000000000000000000000000000000000..e01be7d962fb6c4f8e3806b684a873ed1737515b
--- /dev/null
+++ b/tests/taskflow/test_dialogue.py
@@ -0,0 +1,46 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.taskflow import Taskflow
+
+
+class TestDialogueTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.dialogue = Taskflow(
+            task="dialogue",
+            model="__internal_testing__/tiny-random-plato",
+        )
+        cls.max_turn = 3
+
+    def test_single(self):
+        input_text = ["吃饭了吗"]
+        for turn in range(self.max_turn):
+            output_text = self.dialogue(input_text)
+            self.assertTrue(len(output_text) == 1)
+            self.assertIsInstance(output_text[0], str)
+            input_text.append(output_text[0])
+
+    def test_batch(self):
+        input_text_1 = ["你好"]
+        input_text_2 = ["你是谁？"]
+        for turn in range(self.max_turn):
+            output_text = self.dialogue(input_text_1, input_text_2)
+            self.assertTrue(len(output_text) == 2)
+            self.assertIsInstance(output_text[0], str)
+            self.assertIsInstance(output_text[1], str)
+            input_text_1.append(output_text[0])
+            input_text_2.append(output_text[1])
diff --git a/tests/taskflow/test_fill_mask.py b/tests/taskflow/test_fill_mask.py
new file mode 100644
index 0000000000000000000000000000000000000000..c80cf712836362019ba1733e5acf593f3e63f089
--- /dev/null
+++ b/tests/taskflow/test_fill_mask.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+from parameterized import parameterized
+
+from paddlenlp.taskflow import Taskflow
+from paddlenlp.taskflow.fill_mask import FillMaskTask
+from paddlenlp.transformers import AutoTokenizer, ErnieForMaskedLM
+
+
+class TestFillMaskTask(unittest.TestCase):
+    def setUp(self):
+        self.temp_dir = TemporaryDirectory()
+        self.model_path = os.path.join(self.temp_dir.name, "model")
+        model = ErnieForMaskedLM.from_pretrained("__internal_testing__/tiny-random-ernie")
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/tiny-random-ernie")
+        model.save_pretrained(self.model_path)
+        tokenizer.save_pretrained(self.model_path)
+
+    def tearDown(self):
+        self.temp_dir.cleanup()
+
+    def test_fill_mask_taskflow_invalid_inputs(self):
+        taskflow = FillMaskTask(task="fill_mask", task_path=self.model_path)
+
+        with self.assertRaises(ValueError):
+            taskflow((["飞桨深度学习框"],))
+            taskflow((["飞[MASK]深度学[MASK]"],))
+
+    @parameterized.expand([(1, 1), (2, 3)])
+    def test_fill_mask_taskflow(self, batch_size: int, top_k: int):
+        # input_text is a tuple to simulate the args passed from Taskflow to TextClassificationTask
+        input_text = (["飞桨深度学习框[MASK]", "生活的真谛是[MASK]"],)
+        taskflow = FillMaskTask(task="fill_mask", task_path=self.model_path, batch_size=batch_size, top_k=top_k)
+
+        results = taskflow(input_text)
+        self.assertEqual(len(results), len(input_text[0]))
+        for result in results:
+            self.assertEqual(len(result), top_k)
+
+    @parameterized.expand([(1, 1), (2, 3)])
+    def test_taskflow(self, batch_size: int, top_k: int):
+        input_text = ["飞桨深度学习框[MASK]", "生活的真谛是[MASK]"]
+        taskflow = Taskflow(task="fill_mask", task_path=self.model_path, batch_size=batch_size, top_k=top_k)
+
+        results = taskflow(input_text)
+        self.assertEqual(len(results), len(input_text))
+        for result in results:
+            self.assertEqual(len(result), top_k)
diff --git a/tests/taskflow/test_information_extraction.py b/tests/taskflow/test_information_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..115424881c45a900a47f9063ed5a40066811decb
--- /dev/null
+++ b/tests/taskflow/test_information_extraction.py
@@ -0,0 +1,135 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp import Taskflow
+
+from ..testing_utils import get_tests_dir
+
+
+class TestUIETask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.uie = Taskflow(
+            task="information_extraction",
+            model="__internal_testing__/tiny-random-uie",
+        )
+
+        cls.uie_m = Taskflow(
+            task="information_extraction",
+            task_path="PaddleCI/tiny-random-uie-m",
+            from_hf_hub=True,
+            convert_from_torch=False,
+        )
+
+        cls.uie_x = Taskflow(
+            task="information_extraction",
+            task_path="PaddleCI/tiny-random-uie-x",
+            from_hf_hub=True,
+            convert_from_torch=False,
+        )
+
+    def test_entity_extraction(self):
+        schema = ["时间", "选手", "赛事名称"]
+        self.uie.set_schema(schema)
+        outputs = self.uie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            for field in output:
+                self.assertIn(field, schema)
+                for entity in output[field]:
+                    self.assertIn("text", entity)
+                    self.assertIn("probability", entity)
+
+        self.uie_m.set_schema(schema)
+        outputs = self.uie_m("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！")
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            for field in output:
+                self.assertIn(field, schema)
+                for entity in output[field]:
+                    self.assertIn("text", entity)
+                    self.assertIn("probability", entity)
+
+    def test_relation_extraction(self):
+        schema = [{"歌曲名称": ["歌手", "所属专辑"]}]
+        entity_type = "歌曲名称"
+        relation_types = ["歌手", "所属专辑"]
+        self.uie.set_schema(schema)
+        outputs = self.uie("《告别了》是孙耀威在专辑爱的故事里面的歌曲")
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            self.assertIn(entity_type, output)
+            for entity in output[entity_type]:
+                self.assertIn("text", entity)
+                self.assertIn("probability", entity)
+                self.assertIn("relations", entity)
+                for relation_type, relations in entity["relations"].items():
+                    self.assertIn(relation_type, relation_types)
+                    for relation in relations:
+                        self.assertIn("text", relation)
+                        self.assertIn("probability", relation)
+
+    def test_opinion_extraction(self):
+        schema = [{"评价维度": ["观点词", "情感倾向[正向，负向]"]}]
+        entity_type = "评价维度"
+        relation_types = ["观点词", "情感倾向[正向，负向]"]
+        self.uie.set_schema(schema)
+        outputs = self.uie("店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队")
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            self.assertIn(entity_type, output)
+            for entity in output[entity_type]:
+                self.assertIn("text", entity)
+                self.assertIn("probability", entity)
+                self.assertIn("relations", entity)
+                for relation_type, relations in entity["relations"].items():
+                    self.assertIn(relation_type, relation_types)
+                    for relation in relations:
+                        self.assertIn("text", relation)
+                        self.assertIn("probability", relation)
+
+    def test_doc_entity_extraction(self):
+        doc_path = get_tests_dir("fixtures/tests_samples/OCR/custom.jpeg")
+
+        schema = ["进口日期", "申报日期"]
+        self.uie_x.set_schema(schema)
+        outputs = self.uie_x(
+            {"doc": doc_path},
+            {"text": "进口日期: 2023年3月2日, 申报日期: 2023年3月2日"},
+        )
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            for field in output:
+                self.assertIn(field, schema)
+                for entity in output[field]:
+                    self.assertIn("text", entity)
+                    self.assertIn("probability", entity)
+                    self.assertIn("bbox", entity)
+
+        # Enable layout analysis
+        self.uie_x.set_argument({"layout_analysis": True})
+        outputs = self.uie_x(
+            {"doc": doc_path},
+            {"text": "进口日期: 2023年3月2日, 申报日期: 2023年3月2日"},
+        )
+        self.assertIsNotNone(outputs)
+        for output in outputs:
+            for field in output:
+                self.assertIn(field, schema)
+                for entity in output[field]:
+                    self.assertIn("text", entity)
+                    self.assertIn("probability", entity)
+                    self.assertIn("bbox", entity)
diff --git a/tests/taskflow/test_multimodal_feature_extraction.py b/tests/taskflow/test_multimodal_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..594521bccde331591b7d52266f195a8917b6f78b
--- /dev/null
+++ b/tests/taskflow/test_multimodal_feature_extraction.py
@@ -0,0 +1,177 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from paddlenlp.taskflow import Taskflow
+from paddlenlp.taskflow.multimodal_feature_extraction import (
+    MultimodalFeatureExtractionTask,
+)
+
+
+class TestMultimodalFeatureExtractionTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        cls.batch_size = 2
+        cls.max_resolution = 40
+        cls.min_resolution = 30
+        cls.num_channels = 3
+        cls.max_length = 30
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.temp_dir.cleanup()
+
+    def test_model_np(self):
+        feature_extractor = Taskflow(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            return_tensors="np",
+            max_length=self.max_length,
+        )
+        outputs = feature_extractor("This is a test")
+        self.assertEqual(outputs["features"].shape, (1, 32))
+
+    def test_return_tensors(self):
+        feature_extractor = Taskflow(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            return_tensors="pd",
+            max_length=self.max_length,
+        )
+        outputs = feature_extractor(
+            "This is a test",
+        )
+        self.assertTrue(paddle.is_tensor(outputs["features"]))
+
+    def prepare_inputs(self, equal_resolution=False, numpify=False, paddleify=False):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PaddlePaddle tensors if one specifies paddleify=True.
+        """
+
+        assert not (numpify and paddleify), "You cannot specify both numpy and PaddlePaddle tensors at the same time"
+
+        if equal_resolution:
+            image_inputs = []
+            for i in range(self.batch_size):
+                image_inputs.append(
+                    np.random.randint(
+                        255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8
+                    )
+                )
+        else:
+            image_inputs = []
+            for i in range(self.batch_size):
+                width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2)
+                image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8))
+
+        if not numpify and not paddleify:
+            # PIL expects the channel dimension as last dimension
+            image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        if paddleify:
+            image_inputs = [paddle.to_tensor(x) for x in image_inputs]
+
+        return image_inputs
+
+    def test_feature_extraction_task(self):
+        input_text = (["这是一只猫", "这是一只狗"],)
+        # dygraph text test
+        dygraph_taskflow = MultimodalFeatureExtractionTask(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            is_static_model=False,
+            return_tensors="np",
+            max_length=self.max_length,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+        self.assertEqual(shape[0], 2)
+        # static text test
+        static_taskflow = MultimodalFeatureExtractionTask(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            is_static_model=True,
+            return_tensors="np",
+            device_id=0,
+            max_length=self.max_length,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertEqual(static_results["features"].shape[0], 2)
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+        input_image = (self.prepare_inputs(equal_resolution=True, paddleify=False),)
+        #  dygraph image test
+        dygraph_results = dygraph_taskflow(input_image)
+        self.assertEqual(dygraph_results["features"].shape[0], 2)
+
+        # static image test
+        static_results = static_taskflow(input_image)
+        self.assertEqual(static_results["features"].shape[0], 2)
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+    def test_taskflow_task(self):
+        input_text = ["这是一只猫", "这是一只狗"]
+
+        # dygraph test
+        dygraph_taskflow = Taskflow(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            is_static_model=False,
+            return_tensors="np",
+            max_length=self.max_length,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+
+        self.assertEqual(shape[0], 2)
+        # static test
+        static_taskflow = Taskflow(
+            model="__internal_testing__/tiny-random-ernievil2",
+            task="feature_extraction",
+            is_static_model=True,
+            return_tensors="np",
+            max_length=self.max_length,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertEqual(static_results["features"].shape[0], 2)
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+        input_image = self.prepare_inputs(equal_resolution=True, paddleify=False)
+        #  dygraph image test
+        dygraph_results = dygraph_taskflow(input_image)
+        self.assertEqual(dygraph_results["features"].shape[0], 2)
+
+        # static image test
+        static_results = static_taskflow(input_image)
+        self.assertEqual(static_results["features"].shape[0], 2)
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
diff --git a/tests/taskflow/test_sentiment_analysis.py b/tests/taskflow/test_sentiment_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..27bf1f8f9b83c4d2ddd121441ceecfd308b1741e
--- /dev/null
+++ b/tests/taskflow/test_sentiment_analysis.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.taskflow import Taskflow
+
+
+class TestSentimentAnalysis(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.senta = Taskflow(
+            task="sentiment_analysis",
+            model="__internal_testing__/tiny-random-skep",
+        )
+
+    def test_single(self):
+        input_text = ["蛋糕味道不错"]
+        output_text = self.senta(input_text)
+        self.assertTrue(len(output_text) == 1)
+        self.assertIsInstance(output_text[0], dict)
+
+    def test_batch(self):
+        input_texts = ["蛋糕味道不错", "服务很好", "环境很差", "味道很香"]
+        output_texts = self.senta(input_texts)
+        self.assertTrue(len(output_texts) == len(input_texts))
+        self.assertIsInstance(output_texts[0], dict)
diff --git a/tests/taskflow/test_text2text_generation.py b/tests/taskflow/test_text2text_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cf294c76b47b563b3717341cc9e6bf9da8b4d17
--- /dev/null
+++ b/tests/taskflow/test_text2text_generation.py
@@ -0,0 +1,41 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from tempfile import TemporaryDirectory
+
+from paddlenlp.taskflow import Taskflow
+
+
+class TestText2TextGenerationTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        cls.chatbot = Taskflow(
+            task="text2text_generation",
+            model="__internal_testing__/tiny-random-chatglm",
+            dtype="float32",
+        )
+
+    def test_single(self):
+        input_text = "您好？"
+        output_text = self.chatbot(input_text)
+        self.assertTrue(len(output_text) == 1)
+        self.assertIsInstance(output_text["result"], list)
+
+    def test_batch(self):
+        input_text = ["你好", "你是谁？"]
+        output_text = self.chatbot(input_text)
+        self.assertTrue(len(output_text["result"]) == 2)
+        self.assertIsInstance(output_text["result"], list)
diff --git a/tests/taskflow/test_text_classification.py b/tests/taskflow/test_text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..2acb4915e8806b2aecf0330bb4f74cb812870608
--- /dev/null
+++ b/tests/taskflow/test_text_classification.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.prompt import (
+    AutoTemplate,
+    PromptModelForSequenceClassification,
+    SoftVerbalizer,
+)
+from paddlenlp.taskflow import Taskflow
+from paddlenlp.taskflow.text_classification import TextClassificationTask
+from paddlenlp.transformers import (
+    AutoModelForMaskedLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+)
+
+
+class TestTextClassificationTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/ernie")
+
+        # finetune (dynamic)
+        cls.finetune_dygraph_model_path = os.path.join(cls.temp_dir.name, "finetune_dygraph")
+        finetune_dygraph_model = AutoModelForSequenceClassification.from_pretrained(
+            "__internal_testing__/ernie", num_classes=2
+        )
+        finetune_dygraph_model.save_pretrained(cls.finetune_dygraph_model_path)
+        tokenizer.save_pretrained(cls.finetune_dygraph_model_path)
+
+        # finetune (static)
+        cls.finetune_static_model_path = os.path.join(cls.temp_dir.name, "finetune_static")
+        input_spec = [
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="input_ids"),
+            paddle.static.InputSpec(shape=[None, None], dtype="int64", name="token_type_ids"),
+        ]
+        finetune_static_model = paddle.jit.to_static(finetune_dygraph_model, input_spec=input_spec)
+        paddle.jit.save(finetune_static_model, os.path.join(cls.finetune_static_model_path, "model"))
+        tokenizer.save_pretrained(cls.finetune_static_model_path)
+
+        # prompt (dynamic)
+        cls.prompt_dygraph_model_path = os.path.join(cls.temp_dir.name, "prompt_dygraph")
+        prompt_plm_model = AutoModelForMaskedLM.from_pretrained("__internal_testing__/ernie", num_classes=2)
+        template = AutoTemplate.create_from("测试：", tokenizer, 16, model=prompt_plm_model)
+        label_words = {"negative": ["负面"], "positive": ["正面"]}
+        verbalizer = SoftVerbalizer(label_words, tokenizer, prompt_plm_model)
+        prompt_dygraph_model = PromptModelForSequenceClassification(prompt_plm_model, template, verbalizer)
+
+        template.save(cls.prompt_dygraph_model_path)
+        verbalizer.save(cls.prompt_dygraph_model_path)
+        tokenizer.save_pretrained(cls.prompt_dygraph_model_path)
+        prompt_plm_model.save_pretrained(os.path.join(cls.prompt_dygraph_model_path, "plm"))
+        state_dict = prompt_dygraph_model.state_dict()
+        paddle.save(state_dict, os.path.join(cls.prompt_dygraph_model_path, "model_state.pdparams"))
+
+        # prompt (static)
+        cls.prompt_static_model_path = os.path.join(cls.temp_dir.name, "prompt_static")
+        input_spec = prompt_dygraph_model.get_input_spec()
+        prompt_static_model = paddle.jit.to_static(prompt_dygraph_model, input_spec=input_spec)
+
+        template.save(cls.prompt_static_model_path)
+        verbalizer.save(cls.prompt_static_model_path)
+        tokenizer.save_pretrained(cls.prompt_static_model_path)
+        prompt_plm_model.save_pretrained(os.path.join(cls.prompt_static_model_path, "plm"))
+        paddle.jit.save(prompt_static_model, os.path.join(cls.prompt_static_model_path, "model"))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.temp_dir.cleanup()
+
+    @parameterized.expand(
+        [
+            (1, "multi_class", "finetune"),
+            # (1, "multi_class", "prompt"),  # TODO (paddle 2.5.1 breaks this test)
+            (1, "multi_label", "finetune"),
+            # (1, "multi_label", "prompt"),  # TODO (paddle 2.5.1 breaks this test)
+        ]
+    )
+    def test_classification_task(self, batch_size, problem_type, model):
+        # input_text is a tuple to simulate the args passed from Taskflow to TextClassificationTask
+        input_text = (["百度", "深度学习框架", "飞桨", "PaddleNLP"],)
+        id2label = {
+            0: "negative",
+            1: "positive",
+        }
+        if model == "finetune":
+            dygraph_model_path = self.finetune_dygraph_model_path
+            static_model_path = self.finetune_static_model_path
+        else:
+            dygraph_model_path = self.prompt_dygraph_model_path
+            static_model_path = self.prompt_static_model_path
+
+        dygraph_taskflow = TextClassificationTask(
+            model=model,
+            task="text_classification",
+            task_path=dygraph_model_path,
+            id2label=id2label,
+            batch_size=batch_size,
+            device_id=0,
+            problem_type=problem_type,
+        )
+
+        dygraph_results = dygraph_taskflow(input_text)
+
+        self.assertEqual(len(dygraph_results), len(input_text[0]))
+
+        static_taskflow = TextClassificationTask(
+            model=model,
+            task="text_classification",
+            is_static_model=True,
+            task_path=static_model_path,
+            id2label=id2label,
+            batch_size=batch_size,
+            device_id=0,
+            problem_type=problem_type,
+        )
+
+        static_results = static_taskflow(input_text)
+        self.assertEqual(len(static_results), len(input_text[0]))
+
+        for dygraph_result, static_result in zip(dygraph_results, static_results):
+            for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
+                self.assertEqual(dygraph_pred["label"], static_pred["label"])
+                self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
+                # if multi_label, all predictions should be greater than the threshold
+                if model == "multi_label":
+                    self.assertGreater(dygraph_pred["score"], dygraph_taskflow.multilabel_threshold)
+
+    @parameterized.expand(
+        [
+            (1, "multi_class", "finetune"),
+            (1, "multi_class", "prompt"),
+            (1, "multi_label", "finetune"),
+            (1, "multi_label", "prompt"),
+        ]
+    )
+    def test_taskflow_task(self, batch_size, problem_type, mode):
+        input_text = ["百度", "深度学习框架", "飞桨", "PaddleNLP"]
+        id2label = {
+            0: "negative",
+            1: "positive",
+        }
+        if mode == "finetune":
+            dygraph_model_path = self.finetune_dygraph_model_path
+            static_model_path = self.finetune_static_model_path
+        else:
+            dygraph_model_path = self.prompt_dygraph_model_path
+            static_model_path = self.prompt_static_model_path
+
+        dygraph_taskflow = Taskflow(
+            mode=mode,
+            task="text_classification",
+            task_path=dygraph_model_path,
+            id2label=id2label,
+            batch_size=batch_size,
+            device_id=0,
+            problem_type=problem_type,
+        )
+
+        dygraph_results = dygraph_taskflow(input_text)
+
+        self.assertEqual(len(dygraph_results), len(input_text))
+
+        static_taskflow = Taskflow(
+            mode=mode,
+            task="text_classification",
+            is_static_model=True,
+            task_path=static_model_path,
+            id2label=id2label,
+            batch_size=batch_size,
+            device_id=0,
+            problem_type=problem_type,
+        )
+
+        static_results = static_taskflow(input_text)
+        self.assertEqual(len(static_results), len(input_text))
+
+        for dygraph_result, static_result in zip(dygraph_results, static_results):
+            for dygraph_pred, static_pred in zip(dygraph_result["predictions"], static_result["predictions"]):
+                self.assertEqual(dygraph_pred["label"], static_pred["label"])
+                self.assertAlmostEqual(dygraph_pred["score"], static_pred["score"], delta=1e-6)
+                # if multi_label, all predictions should be greater than the threshold
+                if mode == "multi_label":
+                    self.assertGreater(dygraph_pred["score"], dygraph_taskflow.task_instance.multilabel_threshold)
diff --git a/tests/taskflow/test_text_feature_extraction.py b/tests/taskflow/test_text_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ec315ef5c7293a28eea62271ef13a4801db8b3
--- /dev/null
+++ b/tests/taskflow/test_text_feature_extraction.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from tempfile import TemporaryDirectory
+
+from paddlenlp.taskflow import Taskflow
+from paddlenlp.taskflow.text_feature_extraction import (
+    SentenceFeatureExtractionTask,
+    TextFeatureExtractionTask,
+)
+
+
+class TestTextFeatureExtractionTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        cls.max_seq_len = 32
+        cls.model = "__internal_testing__/tiny-random-rocketqa-query-encoder"
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.temp_dir.cleanup()
+
+    @unittest.skipIf(True, "TODO, fix ci for new from_pretrained!")
+    def test_text_feature_extraction_task(self):
+        input_text = (["这是一只猫", "这是一只狗"],)
+        # dygraph text test
+        dygraph_taskflow = TextFeatureExtractionTask(
+            model="rocketqa-zh-nano-query-encoder",
+            task="feature_extraction",
+            task_path=self.model,
+            _static_mode=False,
+            device_id=0,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+        self.assertEqual(shape[0], 2)
+
+        # static text test
+        static_taskflow = TextFeatureExtractionTask(
+            model="rocketqa-zh-nano-query-encoder",
+            task="feature_extraction",
+            task_path=self.model,
+            _static_mode=True,
+            device_id=0,
+        )
+        static_results = static_taskflow(input_text)
+        shape = static_results["features"].shape
+        self.assertEqual(shape[0], 2)
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+    @unittest.skipIf(True, "TODO, fix ci for new from_pretrained!")
+    def test_taskflow_task(self):
+        input_text = ["这是一只猫", "这是一只狗"]
+        # dygraph test
+        dygraph_taskflow = Taskflow(
+            model="rocketqa-zh-nano-query-encoder",
+            task="feature_extraction",
+            task_path=self.model,
+            _static_mode=False,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+
+        self.assertEqual(shape[0], 2)
+
+        # static test
+        static_taskflow = Taskflow(
+            model="rocketqa-zh-nano-query-encoder",
+            task="feature_extraction",
+            task_path=self.model,
+            _static_mode=True,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertEqual(static_results["features"].shape[0], 2)
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+
+class TestSentenceeExtractionTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        cls.max_seq_len = 32
+        cls.model = "__internal_testing__/tiny-random-m3e"
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.temp_dir.cleanup()
+
+    def test_text_feature_extraction_task(self):
+        input_text = (["这是一只猫", "这是一只狗"],)
+        # dygraph text test
+        dygraph_taskflow = SentenceFeatureExtractionTask(
+            model=self.model,
+            task="feature_extraction",
+            _static_mode=False,
+            device_id=0,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+        self.assertEqual(shape, [2, 768])
+
+        # static text test
+        static_taskflow = SentenceFeatureExtractionTask(
+            model=self.model,
+            task="feature_extraction",
+            _static_mode=True,
+            device_id=0,
+        )
+        static_results = static_taskflow(input_text)
+        shape = static_results["features"].shape
+        self.assertEqual(shape, [2, 768])
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
+
+    def test_taskflow_task(self):
+        input_text = ["这是一只猫", "这是一只狗"]
+        # dygraph test
+        dygraph_taskflow = Taskflow(
+            model=self.model,
+            task="feature_extraction",
+            _static_mode=False,
+        )
+        dygraph_results = dygraph_taskflow(input_text)
+        shape = dygraph_results["features"].shape
+
+        self.assertEqual(shape, [2, 768])
+        # static test
+        static_taskflow = Taskflow(
+            model=self.model,
+            task="feature_extraction",
+            _static_mode=True,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertEqual(static_results["features"].shape, [2, 768])
+
+        for dygraph_result, static_result in zip(dygraph_results["features"], static_results["features"]):
+            for dygraph_pred, static_pred in zip(dygraph_result.tolist(), static_result.tolist()):
+                self.assertAlmostEqual(dygraph_pred, static_pred, delta=1e-5)
diff --git a/tests/taskflow/test_text_similarity.py b/tests/taskflow/test_text_similarity.py
new file mode 100644
index 0000000000000000000000000000000000000000..be784ad54da9f1a73a41c138a54adca4661b457b
--- /dev/null
+++ b/tests/taskflow/test_text_similarity.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from tempfile import TemporaryDirectory
+
+from paddlenlp.taskflow import Taskflow
+from paddlenlp.taskflow.text_similarity import TextSimilarityTask
+
+
+class TestTextSimilarityTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.temp_dir = TemporaryDirectory()
+        cls.max_seq_len = 32
+        cls.model = "__internal_testing__/tiny-random-rocketqa-cross-encoder"
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.temp_dir.cleanup()
+
+    def test_bert_model(self):
+        # static simbert test
+        similarity = Taskflow(
+            task="text_similarity",
+            model="__internal_testing__/tiny-random-bert",
+        )
+        results = similarity([["世界上什么东西最小", "世界上什么东西最小？"]])
+        self.assertTrue(len(results) == 1)
+        self.assertTrue("text1" in results[0])
+        self.assertTrue("text2" in results[0])
+        self.assertIsInstance(results[0]["similarity"], float)
+
+        results = similarity([["光眼睛大就好看吗", "眼睛好看吗？"], ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"]])
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
+
+    def test_text_similarity_task(self):
+        # static rocketqa test
+        input_text = ([["世界上什么东西最小", "世界上什么东西最小？"]],)
+        static_taskflow = TextSimilarityTask(
+            model="rocketqa-zh-dureader-cross-encoder",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+            device_id=0,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        input_text = ([["光眼睛大就好看吗", "眼睛好看吗？"], ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"]],)
+        results = static_taskflow(input_text)
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
+
+        # static rocketqav2 test
+        input_text = ([["Tomorrow is another day", "Today is a sunny day"]],)
+        static_taskflow = TextSimilarityTask(
+            model="rocketqav2-en-marco-cross-encoder",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+            device_id=0,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        input_text = (
+            [["Tomorrow is another day", "Today is a sunny day"], ["This is my dream", "This is my father"]],
+        )
+        results = static_taskflow(input_text)
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
+
+        # static ernie-search test
+        input_text = ([["Tomorrow is another day", "Today is a sunny day"]],)
+        static_taskflow = TextSimilarityTask(
+            model="ernie-search-large-cross-encoder-marco-en",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+            device_id=0,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        input_text = (
+            [["Tomorrow is another day", "Today is a sunny day"], ["This is my dream", "This is my father"]],
+        )
+        results = static_taskflow(input_text)
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
+
+    def test_taskflow_task(self):
+        # static rocketqav1 test
+        input_text = [["世界上什么东西最小", "世界上什么东西最小？"]]
+        static_taskflow = Taskflow(
+            model="rocketqa-zh-dureader-cross-encoder",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        # static rocketqav2 test
+        input_text = [["Tomorrow is another day", "Today is a sunny day"]]
+        static_taskflow = Taskflow(
+            model="rocketqav2-en-marco-cross-encoder",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        input_text = [["Tomorrow is another day", "Today is a sunny day"], ["This is my dream", "This is my father"]]
+        results = static_taskflow(input_text)
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
+
+        # static ernie-search test
+        input_text = [["Tomorrow is another day", "Today is a sunny day"]]
+        static_taskflow = Taskflow(
+            model="ernie-search-large-cross-encoder-marco-en",
+            task="text_similarity",
+            task_path=self.model,
+            max_seq_len=self.max_seq_len,
+        )
+        static_results = static_taskflow(input_text)
+        self.assertTrue(len(static_results) == 1)
+        self.assertTrue("text1" in static_results[0])
+        self.assertTrue("text2" in static_results[0])
+        self.assertIsInstance(static_results[0]["similarity"], float)
+
+        input_text = [["Tomorrow is another day", "Today is a sunny day"], ["This is my dream", "This is my father"]]
+        results = static_taskflow(input_text)
+        self.assertTrue(len(results) == 2)
+        for result in results:
+            self.assertTrue("text1" in result)
+            self.assertTrue("text2" in result)
+            self.assertIsInstance(result["similarity"], float)
diff --git a/tests/taskflow/test_zero_shot_text_classification.py b/tests/taskflow/test_zero_shot_text_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee2e3129405a405c3b45e48c64d893aa85cade01
--- /dev/null
+++ b/tests/taskflow/test_zero_shot_text_classification.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.taskflow import Taskflow
+
+
+class TestZeroShotTextClassificationTask(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.schema = ["这是一条差评", "这是一条好评"]
+        cls.taskflow = Taskflow(
+            task="zero_shot_text_classification",
+            model="__internal_testing__/tiny-random-utc",
+            schema=cls.schema,
+        )
+
+    def test_single(self):
+        output = self.taskflow("房间干净明亮，非常不错")
+        self.assertEqual(len(output), 1)
+        self.assertIn("text_a", output[0])
+        self.assertIn("predictions", output[0])
+        for pred in output[0]["predictions"]:
+            self.assertIn(pred["label"], self.schema)
+
+    def test_batch(self):
+        outputs = self.taskflow(["房间干净明亮，非常不错", "这馆子不咋地"])
+        self.assertEqual(len(outputs), 2)
+        for output in outputs:
+            self.assertIn("text_a", output)
+            self.assertIn("predictions", output)
+            for pred in output["predictions"]:
+                self.assertIn(pred["label"], self.schema)
+
+    def test_pair(self):
+        output = self.taskflow([["测试句子1", "句子2"]])
+        self.assertEqual(len(output), 1)
+        self.assertIn("text_a", output[0])
+        self.assertIn("predictions", output[0])
+        for pred in output[0]["predictions"]:
+            self.assertIn(pred["label"], self.schema)
diff --git a/tests/test_model.sh b/tests/test_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c6c0c9b3d34dd65528498293b417852aaabe5ec3
--- /dev/null
+++ b/tests/test_model.sh
@@ -0,0 +1,359 @@
+#!/bin/bash
+FILENAME=$1
+# MODE be one of ['lite_train_infer' 'whole_infer' 'whole_train_infer', 'infer']
+MODE=$2
+
+dataline=$(cat ${FILENAME})
+
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+
+function func_parser_key(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[0]}
+    echo ${tmp}
+}
+function func_parser_value(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[1]}
+    echo ${tmp}
+}
+function func_set_params(){
+    key=$1
+    value=$2
+    if [ ${key} = "null" ];then
+        echo " "
+    elif [[ ${value} = "null" ]] || [[ ${value} = " " ]] || [ ${#value} -le 0 ];then
+        echo " "
+    else 
+        echo "${key}=${value}"
+    fi
+}
+function func_parser_params(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    key=${array[0]}
+    tmp=${array[1]}
+    IFS="|"
+    res=""
+    for _params in ${tmp[*]}; do
+        IFS="="
+        array=(${_params})
+        mode=${array[0]}
+        value=${array[1]}
+        if [[ ${mode} = ${MODE} ]]; then
+            IFS="|"
+            #echo $(func_set_params "${mode}" "${value}")
+            echo $value
+            break
+        fi
+        IFS="|"
+    done
+    echo ${res}
+}
+function status_check(){
+    last_status=$1   # the exit code
+    run_command=$2
+    run_log=$3
+    if [ $last_status -eq 0 ]; then
+        echo -e "\033[33m Run successfully with command - ${run_command}!  \033[0m" | tee -a ${run_log}
+    else
+        echo -e "\033[33m Run failed with command - ${run_command}!  \033[0m" | tee -a ${run_log}
+    fi
+}
+
+IFS=$'\n'
+# The training params
+model_name=$(func_parser_value "${lines[1]}")
+python=$(func_parser_value "${lines[2]}")
+gpu_list=$(func_parser_value "${lines[3]}")
+train_use_gpu_key=$(func_parser_key "${lines[4]}")
+train_use_gpu_value=$(func_parser_value "${lines[4]}")
+autocast_list=$(func_parser_value "${lines[5]}")
+autocast_key=$(func_parser_key "${lines[5]}")
+epoch_key=$(func_parser_key "${lines[6]}")
+epoch_num=$(func_parser_params "${lines[6]}")
+save_model_key=$(func_parser_key "${lines[7]}")
+train_batch_key=$(func_parser_key "${lines[8]}")
+train_batch_value=$(func_parser_params "${lines[8]}")
+pretrain_model_key=$(func_parser_key "${lines[9]}")
+pretrain_model_value=$(func_parser_value "${lines[9]}")
+train_model_name=$(func_parser_value "${lines[10]}")
+train_infer_img_dir=$(func_parser_value "${lines[11]}")
+train_param_key1=$(func_parser_key "${lines[12]}")
+train_param_value1=$(func_parser_value "${lines[12]}")
+
+trainer_list=$(func_parser_value "${lines[14]}")
+trainer_norm=$(func_parser_key "${lines[15]}")
+norm_trainer=$(func_parser_value "${lines[15]}")
+pact_key=$(func_parser_key "${lines[16]}")
+pact_trainer=$(func_parser_value "${lines[16]}")
+fpgm_key=$(func_parser_key "${lines[17]}")
+fpgm_trainer=$(func_parser_value "${lines[17]}")
+distill_key=$(func_parser_key "${lines[18]}")
+distill_trainer=$(func_parser_value "${lines[18]}")
+trainer_key1=$(func_parser_key "${lines[19]}")
+trainer_value1=$(func_parser_value "${lines[19]}")
+trainer_key2=$(func_parser_key "${lines[20]}")
+trainer_value2=$(func_parser_value "${lines[20]}")
+
+eval_py=$(func_parser_value "${lines[23]}")
+eval_key1=$(func_parser_key "${lines[24]}")
+eval_value1=$(func_parser_value "${lines[24]}")
+
+save_infer_key=$(func_parser_key "${lines[27]}")
+export_weight=$(func_parser_key "${lines[28]}")
+norm_export=$(func_parser_value "${lines[29]}")
+pact_export=$(func_parser_value "${lines[30]}")
+fpgm_export=$(func_parser_value "${lines[31]}")
+distill_export=$(func_parser_value "${lines[32]}")
+export_key1=$(func_parser_key "${lines[33]}")
+export_value1=$(func_parser_value "${lines[33]}")
+export_key2=$(func_parser_key "${lines[34]}")
+export_value2=$(func_parser_value "${lines[34]}")
+
+# parser inference model 
+infer_model_dir_list=$(func_parser_value "${lines[36]}")
+infer_export_list=$(func_parser_value "${lines[37]}")
+infer_is_quant=$(func_parser_value "${lines[38]}")
+# parser inference 
+inference_py=$(func_parser_value "${lines[39]}")
+use_gpu_key=$(func_parser_key "${lines[40]}")
+use_gpu_list=$(func_parser_value "${lines[40]}")
+use_mkldnn_key=$(func_parser_key "${lines[41]}")
+use_mkldnn_list=$(func_parser_value "${lines[41]}")
+cpu_threads_key=$(func_parser_key "${lines[42]}")
+cpu_threads_list=$(func_parser_value "${lines[42]}")
+batch_size_key=$(func_parser_key "${lines[43]}")
+batch_size_list=$(func_parser_value "${lines[43]}")
+use_trt_key=$(func_parser_key "${lines[44]}")
+use_trt_list=$(func_parser_value "${lines[44]}")
+precision_key=$(func_parser_key "${lines[45]}")
+precision_list=$(func_parser_value "${lines[45]}")
+infer_model_key=$(func_parser_key "${lines[46]}")
+image_dir_key=$(func_parser_key "${lines[47]}")
+infer_img_dir=$(func_parser_value "${lines[47]}")
+save_log_key=$(func_parser_key "${lines[48]}")
+benchmark_key=$(func_parser_key "${lines[49]}")
+benchmark_value=$(func_parser_value "${lines[49]}")
+infer_key1=$(func_parser_key "${lines[50]}")
+infer_value1=$(func_parser_value "${lines[50]}")
+
+LOG_PATH="./tests/output"
+mkdir -p ${LOG_PATH}
+status_log="${LOG_PATH}/results.log"
+
+
+function func_inference(){
+    IFS='|'
+    _python=$1
+    _script=$2
+    _model_dir=$3
+    _log_path=$4
+    _img_dir=$5
+    _flag_quant=$6
+    # inference 
+    for use_gpu in ${use_gpu_list[*]}; do
+        if [ ${use_gpu} = "False" ] || [ ${use_gpu} = "cpu" ]; then
+            for use_mkldnn in ${use_mkldnn_list[*]}; do
+                if [ ${use_mkldnn} = "False" ] && [ ${_flag_quant} = "True" ]; then
+                    continue
+                fi
+                for threads in ${cpu_threads_list[*]}; do
+                    for batch_size in ${batch_size_list[*]}; do
+                        _save_log_path="${_log_path}/infer_cpu_usemkldnn_${use_mkldnn}_threads_${threads}_batchsize_${batch_size}.log"
+                        set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                        set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                        set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                        set_cpu_threads=$(func_set_params "${cpu_threads_key}" "${threads}")
+                        set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                        set_infer_params1=$(func_set_params "${infer_key1}" "${infer_value1}")
+                        command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${use_mkldnn_key}=${use_mkldnn} ${set_cpu_threads} ${set_model_dir} ${set_batchsize} ${set_infer_data} ${set_benchmark} ${set_infer_params1} > ${_save_log_path} 2>&1 "
+                        eval $command
+                        last_status=${PIPESTATUS[0]}
+                        eval "cat ${_save_log_path}"
+                        status_check $last_status "${command}" "${status_log}"
+                    done
+                done
+            done
+        elif [ ${use_gpu} = "True" ] || [ ${use_gpu} = "gpu" ]; then
+            for use_trt in ${use_trt_list[*]}; do
+                for precision in ${precision_list[*]}; do
+                    if [[ ${precision} =~ "fp16" || ${precision} =~ "int8" ]] && [ ${use_trt} = "False" ]; then
+                        continue
+                    fi
+                    if [[ ${use_trt} = "False" || ${precision} =~ "int8" ]] && [ ${_flag_quant} = "True" ]; then
+                        continue
+                    fi
+                    for batch_size in ${batch_size_list[*]}; do
+                        _save_log_path="${_log_path}/infer_gpu_usetrt_${use_trt}_precision_${precision}_batchsize_${batch_size}.log"
+                        set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                        set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                        set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                        set_tensorrt=$(func_set_params "${use_trt_key}" "${use_trt}")
+                        set_precision=$(func_set_params "${precision_key}" "${precision}")
+                        set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                        command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${set_tensorrt} ${set_precision} ${set_model_dir} ${set_batchsize} ${set_infer_data} ${set_benchmark} > ${_save_log_path} 2>&1 "
+                        eval $command
+                        last_status=${PIPESTATUS[0]}
+                        eval "cat ${_save_log_path}"
+                        status_check $last_status "${command}" "${status_log}"
+                        
+                    done
+                done
+            done
+        else
+            echo "Does not support hardware other than CPU and GPU Currently!"
+        fi
+    done
+}
+
+if [ ${MODE} = "infer" ]; then
+    GPUID=$3
+    if [ ${#GPUID} -le 0 ];then
+        env=" "
+    else
+        env="export CUDA_VISIBLE_DEVICES=${GPUID}"
+    fi
+    # set CUDA_VISIBLE_DEVICES
+    eval $env
+    export Count=0
+    IFS="|"
+    infer_run_exports=(${infer_export_list})
+    infer_quant_flag=(${infer_is_quant})
+    for infer_model in ${infer_model_dir_list[*]}; do
+        # run export
+        if [ ${infer_run_exports[Count]} != "null" ];then
+            set_export_weight=$(func_set_params "${export_weight}" "${infer_model}")
+            set_save_infer_key=$(func_set_params "${save_infer_key}" "${infer_model}")
+            export_cmd="${python} ${norm_export} ${set_export_weight} ${set_save_infer_key}"
+            eval $export_cmd
+            status_export=$?
+            if [ ${status_export} = 0 ];then
+                status_check $status_export "${export_cmd}" "${status_log}"
+            fi
+        fi
+        #run inference
+        is_quant=${infer_quant_flag[Count]}
+        echo "is_quant: ${is_quant}"
+        func_inference "${python}" "${inference_py}" "${infer_model}" "${LOG_PATH}" "${infer_img_dir}" ${is_quant}
+        Count=$(($Count + 1))
+    done
+
+else
+    IFS="|"
+    export Count=0
+    USE_GPU_KEY=(${train_use_gpu_value})
+    for gpu in ${gpu_list[*]}; do
+        use_gpu=${USE_GPU_KEY[Count]}
+        Count=$(($Count + 1))
+        if [ ${gpu} = "-1" ];then
+            env=""
+        elif [ ${#gpu} -le 1 ];then
+            env="export CUDA_VISIBLE_DEVICES=${gpu}"
+            eval ${env}
+        elif [ ${#gpu} -le 15 ];then
+            IFS=","
+            array=(${gpu})
+            env="export CUDA_VISIBLE_DEVICES=${array[0]}"
+            IFS="|"
+        else
+            IFS=";"
+            array=(${gpu})
+            ips=${array[0]}
+            gpu=${array[1]}
+            IFS="|"
+            env=" "
+        fi
+        for autocast in ${autocast_list[*]}; do 
+            for trainer in ${trainer_list[*]}; do 
+                flag_quant=False
+                if [ ${trainer} = ${pact_key} ]; then
+                    run_train=${pact_trainer}
+                    run_export=${pact_export}
+                    flag_quant=True
+                elif [ ${trainer} = "${fpgm_key}" ]; then
+                    run_train=${fpgm_trainer}
+                    run_export=${fpgm_export}
+                elif [ ${trainer} = "${distill_key}" ]; then
+                    run_train=${distill_trainer}
+                    run_export=${distill_export}
+                elif [ ${trainer} = ${trainer_key1} ]; then
+                    run_train=${trainer_value1}
+                    run_export=${export_value1}
+                elif [[ ${trainer} = ${trainer_key2} ]]; then
+                    run_train=${trainer_value2}
+                    run_export=${export_value2}
+                else
+                    run_train=${norm_trainer}
+                    run_export=${norm_export}
+                fi
+
+                if [ ${run_train} = "null" ]; then
+                    continue
+                fi
+                
+                set_autocast=$(func_set_params "${autocast_key}" "${autocast}")
+                set_epoch=$(func_set_params "${epoch_key}" "${epoch_num}")
+                set_pretrain=$(func_set_params "${pretrain_model_key}" "${pretrain_model_value}")
+                set_batchsize=$(func_set_params "${train_batch_key}" "${train_batch_value}")
+                set_train_params1=$(func_set_params "${train_param_key1}" "${train_param_value1}")
+                set_use_gpu=$(func_set_params "${train_use_gpu_key}" "${use_gpu}")
+                save_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}"
+                
+                # load pretrain from norm training if current trainer is pact or fpgm trainer
+                if [ ${trainer} = ${pact_key} ] || [ ${trainer} = ${fpgm_key} ]; then
+                    set_pretrain="${load_norm_train_model}"
+                fi
+
+                set_save_model=$(func_set_params "${save_model_key}" "${save_log}")
+                if [ ${#gpu} -le 2 ];then  # train with cpu or single gpu
+                    cmd="${python} ${run_train} ${set_use_gpu}  ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1} "
+                elif [ ${#gpu} -le 15 ];then  # train with multi-gpu
+                    cmd="${python} -m paddle.distributed.launch --gpus=${gpu} ${run_train} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1}"
+                else     # train with multi-machine
+                    cmd="${python} -m paddle.distributed.launch --ips=${ips} --gpus=${gpu} ${run_train} ${set_save_model} ${set_pretrain} ${set_epoch} ${set_autocast} ${set_batchsize} ${set_train_params1}"
+                fi
+                # run train
+                eval "unset CUDA_VISIBLE_DEVICES"
+                eval $cmd
+                status_check $? "${cmd}" "${status_log}"
+
+                set_eval_pretrain=$(func_set_params "${pretrain_model_key}" "${save_log}/${train_model_name}")
+                # save norm trained models to set pretrain for pact training and fpgm training 
+                if [ ${trainer} = ${trainer_norm} ]; then
+                    load_norm_train_model=${set_eval_pretrain}
+                fi
+                # run eval 
+                if [ ${eval_py} != "null" ]; then
+                    set_eval_params1=$(func_set_params "${eval_key1}" "${eval_value1}")
+                    eval_cmd="${python} ${eval_py} ${set_eval_pretrain} ${set_use_gpu} ${set_eval_params1}" 
+                    eval $eval_cmd
+                    status_check $? "${eval_cmd}" "${status_log}"
+                fi
+                # run export model
+                if [ ${run_export} != "null" ]; then 
+                    # run export model
+                    save_infer_path="${save_log}"
+                    set_export_weight=$(func_set_params "${export_weight}" "${save_log}/${train_model_name}")
+                    set_save_infer_key=$(func_set_params "${save_infer_key}" "${save_infer_path}")
+                    export_cmd="${python} ${run_export} ${set_export_weight} ${set_save_infer_key}"
+                    eval $export_cmd
+                    status_check $? "${export_cmd}" "${status_log}"
+
+                    #run inference
+                    eval $env
+                    save_infer_path="${save_log}"
+                    func_inference "${python}" "${inference_py}" "${save_infer_path}" "${LOG_PATH}" "${train_infer_img_dir}" "${flag_quant}"
+                    eval "unset CUDA_VISIBLE_DEVICES"
+                fi
+            done  # done with:    for trainer in ${trainer_list[*]}; do 
+        done      # done with:    for autocast in ${autocast_list[*]}; do 
+    done          # done with:    for gpu in ${gpu_list[*]}; do
+fi  # end if [ ${MODE} = "infer" ]; then
diff --git a/tests/test_tipc/README.md b/tests/test_tipc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6a9dcc5c59a00c8a66f950edc412ead77521f8ef
--- /dev/null
+++ b/tests/test_tipc/README.md
@@ -0,0 +1,86 @@
+# 飞桨训推一体认证（TIPC）
+
+## 1. 简介
+
+飞桨除了基本的模型训练和预测，还提供了支持多端多平台的高性能推理部署工具。本文档提供了PaddleNLP中部分模型的飞桨训推一体认证 (Training and Inference Pipeline Certification(TIPC)) 信息和测试工具，方便用户查阅每种模型的训练推理部署打通情况，并可以进行一键测试。
+
+
+## 2. 汇总信息
+
+打通情况汇总如下，已填写的部分表示可以使用本工具进行一键测试，未填写的表示正在支持中。
+
+**字段说明：**
+- 基础训练预测：包括模型单机单卡训练、单机多卡训练以及Paddle Inference Python预测。
+- 更多训练方式：包括多机多卡、混合精度。
+
+更详细的MKLDNN、TensorRT等预测加速相关功能的支持情况可以查看各测试工具的[更多教程](#more)。
+
+| 模型名称 | 模型类型 | 基础<br>训练预测 | 更多<br>训练方式 | 模型压缩 |
+| :--- |  :----:  | :--------: |  :----  |   :----  |
+| bigru_crf | 序列标注  | 支持 | - | - |
+| Transformer | 机器翻译 | 支持 | - | - |
+
+
+
+## 3. 测试工具简介
+### 目录介绍
+
+```shell
+test_tipc/
+├── bigru_crf                      # bigru_crf模型实现
+│   ├── data.py
+│   ├── deploy
+│   │   └── predict.py             # python预测部署脚本
+│   ├── export_model.py            # 模型导出脚本
+│   ├── model.py                   # 模型实现脚本
+│   └── train.py                   # 训练脚本
+├── transformer                    # Transformer 双精度模型实现
+│   ├── modeling.py                # Transformer 双精度模型组网脚本
+│   └── train.py                   # Transformer 双精度训练脚本
+├── compare_results.py             # 用于对比log中的预测结果与results中的预存结果精度误差是否在限定范围内
+├── configs                        # 配置文件目录
+│   ├── bigru_crf                  # bigru_crf模型的测试配置文件目录
+│       └── train_infer_python.txt # 测试Linux上python训练预测（基础训练预测）的配置文件
+│   └── Transformer                # Transformer 模型的测试配置文件目录
+│       └── train_infer_python.txt # 测试 Linux 上 python 训练预测（基础训练预测）的配置文件
+├── prepare.sh                     # 完成test_*.sh运行所需要的数据和模型下载
+├── readme.md                      # 使用文档
+├── results                        # 预先保存的预测结果，用于和实际预测结果进行精读比对
+│   ├── python_bigru_crf_results_fp16.txt # 预存的bigru_cr模型python预测fp16精度的结果
+│   └── python_bigru_crf_results_fp32.txt # 预存的bigru_cr模型python预测fp32精度的结果
+└── test_train_inference_python.sh # 测试python训练预测的主程序
+```
+
+### 测试流程概述
+
+使用本工具，可以测试不同功能的支持情况，以及预测结果是否对齐，测试流程概括如下：
+
+1. 运行prepare.sh准备测试所需数据和模型；
+2. 运行要测试的功能对应的测试脚本`test_train_inference_python.sh`，产出log，由log可以看到不同配置是否运行成功；
+3. 用`compare_results.py`对比log中的预测结果和预存在results目录下的结果，判断预测精度是否符合预期（在误差范围内）。
+
+测试单项功能仅需两行命令，**如需测试不同模型/功能，替换配置文件即可**，命令格式如下：
+```shell
+# 功能：准备数据
+# 格式：bash + 运行脚本 + 参数1: 配置文件选择 + 参数2: 模式选择
+bash test_tipc/prepare.sh  configs/[model_name]/[params_file_name]  [Mode]
+
+# 功能：运行测试
+# 格式：bash + 运行脚本 + 参数1: 配置文件选择 + 参数2: 模式选择
+bash test_tipc/test_train_inference_python.sh configs/[model_name]/[params_file_name]  [Mode]
+```
+
+例如，测试基本训练预测功能的`lite_train_lite_infer`模式，运行：
+```shell
+# 准备数据
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'lite_train_lite_infer'
+# 运行测试
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'lite_train_lite_infer'
+```
+关于本示例命令的更多信息可查看[基础训练预测使用文档](docs/test_train_inference_python.md)。
+
+
+<a name="more"></a>
+## 4. 开始测试
+各功能测试中涉及MKLDNN、TensorRT等多种预测相关参数配置，请点击下方相应链接了解更多细节和使用教程：
+- [test_train_inference_python 使用](docs/test_train_inference_python.md) ：测试基于Python的模型训练、推理等基本功能。
diff --git a/tests/test_tipc/benchmark/__init__.py b/tests/test_tipc/benchmark/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..55f4cd07b7fde69aebc7a5ead510e1cd4ad0133e
--- /dev/null
+++ b/tests/test_tipc/benchmark/__init__.py
@@ -0,0 +1,2 @@
+from .modules import *
+from .utils import *
diff --git a/tests/test_tipc/benchmark/modules/__init__.py b/tests/test_tipc/benchmark/modules/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/test_tipc/benchmark/modules/benchmark_utils.py b/tests/test_tipc/benchmark/modules/benchmark_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a062f2b87060f6b4c8c9e419ce1c6c39d6193b5
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/benchmark_utils.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+
+def rand_int_tensor(low, high, shape):
+    return paddle.randint(
+        low,
+        high,
+        shape=shape,
+        dtype=paddle.int64,
+    )
+
+
+def clone_tensor(x):
+    y = x.clone()
+    return y
+
+
+def clone_input(x):
+    def paddle_clone(x):
+        y = paddle.clone(x)
+        if x.is_leaf:
+            y.stop_gradient = x.stop_gradient
+        if x.is_leaf and x.grad is not None:
+            y.grad = clone_input(x.grad)
+        return y
+
+    with paddle.no_grad():
+        result = paddle.empty(x.shape, dtype=x.dtype)
+        result.copy_(x.clone(), True)
+        if x.is_leaf:
+            result.stop_gradient = x.stop_gradient
+        if x.is_leaf and x.grad is not None:
+            result.grad = clone_input(x.grad)
+        return result
+
+
+def clone_inputs(example_inputs):
+    if isinstance(example_inputs, dict):
+        res = dict(example_inputs)
+        for key, value in res.items():
+            assert isinstance(value, paddle.Tensor)
+            res[key] = clone_input(value)
+        return res
+
+    res = list(example_inputs)
+    for i in range(len(res)):
+        if isinstance(res[i], paddle.Tensor):
+            res[i] = clone_input(res[i])
+    return res
diff --git a/tests/test_tipc/benchmark/modules/bert_for_question_answering.py b/tests/test_tipc/benchmark/modules/bert_for_question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..42dc2794e4f81e2c0822def75d2dcb7af025fd18
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/bert_for_question_answering.py
@@ -0,0 +1,108 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+
+from paddlenlp.transformers import BertForQuestionAnswering
+from paddlenlp.utils.log import logger
+
+from .benchmark_utils import rand_int_tensor
+from .model_base import BenchmarkBase
+
+
+class BertForQuestionAnsweringBenchmark(BenchmarkBase):
+    def __init__(self):
+        self.label_list = None
+        self.pad_token_id = 0
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path", type=str, default="bert-base-cased", help="Model name. Defaults to bert-base. "
+        )
+        # args.max_seq_len
+
+    def create_data_loader(self, args, **kwargs):
+        raise NotImplementedError(
+            "bert_for_question_answering's DataLoader is not implemented. Please use --generated_inputs. "
+        )
+
+    def generate_inputs_for_model(self, args, model, **kwargs):
+        input_ids = rand_int_tensor(1, model.config.vocab_size, [args.batch_size, args.max_seq_len])
+        start_positions = rand_int_tensor(
+            0,
+            args.max_seq_len,
+            [
+                args.batch_size,
+            ],
+        )
+        end_positions = rand_int_tensor(
+            0,
+            args.max_seq_len,
+            [
+                args.batch_size,
+            ],
+        )
+        return {"input_ids": input_ids, "start_positions": start_positions, "end_positions": end_positions}
+
+    def build_model(self, args, **kwargs):
+        model = BertForQuestionAnswering.from_pretrained(args.model_name_or_path)
+        self.pad_token_id = model.config.pad_token_id
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        start_positions = input_data.pop("start_positions")
+        end_positions = input_data.pop("end_positions")
+        start_logits, end_logits = model(**input_data)
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if start_positions.ndim > 1:
+                start_positions = start_positions.squeeze(-1)
+            if start_positions.ndim > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = paddle.shape(start_logits)[1]
+            start_positions = start_positions.clip(0, ignored_index)
+            end_positions = end_positions.clip(0, ignored_index)
+
+            loss_fct = paddle.nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        return (
+            total_loss,
+            paddle.sum((input_data["input_ids"] != self.pad_token_id)).numpy().astype("int64").item(),
+        )
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f words/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/bigru_crf.py b/tests/test_tipc/benchmark/modules/bigru_crf.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e42646e0aa165c6192dba00fb86ed13923e6bb9
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/bigru_crf.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+from .model_base import BenchmarkBase
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)))
+from bigru_crf.data import create_data_loader  # noqa: E402
+from bigru_crf.model import BiGruCrf  # noqa: E402
+
+
+class BiGruCrfBenchmark(BenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--base_lr", type=float, default=0.001, help="The basic learning rate that affects the entire network."
+        )
+        parser.add_argument(
+            "--crf_lr", type=float, default=0.2, help="The learning rate ratio that affects CRF layers."
+        )
+        parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+        parser.add_argument(
+            "--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer."
+        )
+
+        return parser
+
+    def create_data_loader(self, args, **kwargs):
+        self.word_vocab, self.label_vocab, train_loader, test_loader = create_data_loader(args)
+
+        self.num_batch = len(train_loader)
+
+        return train_loader, test_loader
+
+    def build_model(self, args, **kwargs):
+        model = BiGruCrf(
+            args.emb_dim, args.hidden_size, len(self.word_vocab), len(self.label_vocab), crf_lr=args.crf_lr
+        )
+
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        (token_ids, length, label_ids) = input_data
+        loss = model(token_ids, length, label_ids)
+        avg_loss = paddle.mean(loss)
+
+        return avg_loss, args.batch_size
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/ernie3_for_sequence_classification.py b/tests/test_tipc/benchmark/modules/ernie3_for_sequence_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..90f8574380a41d95296989496ef23fc27b25b122
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/ernie3_for_sequence_classification.py
@@ -0,0 +1,182 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+
+import paddle
+from paddle.io import DataLoader, DistributedBatchSampler
+
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+from .model_base import BenchmarkBase
+
+
+# Data pre-process function for clue benchmark datatset
+def seq_convert_example(example, label_list, tokenizer=None, max_seq_length=512, **kwargs):
+    """convert a glue example into necessary features"""
+    is_test = False
+    if "label" not in example.keys():
+        is_test = True
+
+    if not is_test:
+        # `label_list == None` is for regression task
+        label_dtype = "int64" if label_list else "float32"
+        # Get the label
+        example["label"] = int(example["label"]) if label_dtype != "float32" else float(example["label"])
+        label = example["label"]
+    # Convert raw text to feature
+    if "keyword" in example:  # CSL
+        sentence1 = " ".join(example["keyword"])
+        example = {"sentence1": sentence1, "sentence2": example["abst"], "label": example["label"]}
+    elif "target" in example:  # wsc
+        text, query, pronoun, query_idx, pronoun_idx = (
+            example["text"],
+            example["target"]["span1_text"],
+            example["target"]["span2_text"],
+            example["target"]["span1_index"],
+            example["target"]["span2_index"],
+        )
+        text_list = list(text)
+        assert text[pronoun_idx : (pronoun_idx + len(pronoun))] == pronoun, "pronoun: {}".format(pronoun)
+        assert text[query_idx : (query_idx + len(query))] == query, "query: {}".format(query)
+        if pronoun_idx > query_idx:
+            text_list.insert(query_idx, "_")
+            text_list.insert(query_idx + len(query) + 1, "_")
+            text_list.insert(pronoun_idx + 2, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 2 + 1, "]")
+        else:
+            text_list.insert(pronoun_idx, "[")
+            text_list.insert(pronoun_idx + len(pronoun) + 1, "]")
+            text_list.insert(query_idx + 2, "_")
+            text_list.insert(query_idx + len(query) + 2 + 1, "_")
+        text = "".join(text_list)
+        example["sentence"] = text
+
+    if tokenizer is None:
+        return example
+    if "sentence" in example:
+        example = tokenizer(
+            example["sentence"], max_length=max_seq_length, truncation=True, padding="max_length", return_tensors="np"
+        )
+    elif "sentence1" in example:
+        example = tokenizer(
+            example["sentence1"],
+            text_pair=example["sentence2"],
+            max_length=max_seq_length,
+            padding="max_length",
+            truncation=True,
+            return_tensors="np",
+        )
+
+    if not is_test:
+        if "token_type_ids" in example:
+            return {
+                "input_ids": example["input_ids"][0],
+                "token_type_ids": example["token_type_ids"][0],
+                "labels": label,
+            }
+        else:
+            return {"input_ids": example["input_ids"][0], "labels": label}
+    else:
+        return {"input_ids": example["input_ids"][0], "token_type_ids": example["token_type_ids"][0]}
+
+
+class Ernie3ForSequenceClassificationBenchmark(BenchmarkBase):
+    def __init__(self):
+        super().__init__()
+        self.pad_token_id = 0
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path",
+            type=str,
+            default="ernie-3.0-base-zh",
+            help="Model name. Defaults to ernie-3.0-base-zh. ",
+        )
+        parser.add_argument(
+            "--task_name",
+            default="tnews",
+            type=str,
+            help="The name of the task to train selected in the list: afqmc, tnews, iflytek, ocnli, cmnli, cluewsc2020, csl",
+        )
+        parser.add_argument("--max_seq_length", type=int, default=args.max_seq_len, help="Maximum sequence length. ")
+
+    def create_input_specs(self):
+        input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
+        token_type_ids = paddle.static.InputSpec(name="token_type_ids", shape=[-1, -1], dtype="int64")
+        labels = paddle.static.InputSpec(name="labels", shape=[-1], dtype="int64")
+        return [input_ids, token_type_ids, None, None, None, labels]
+
+    def create_data_loader(self, args, **kwargs):
+        args.task_name = args.task_name.lower()
+
+        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+        train_ds, _ = load_dataset("clue", args.task_name, splits=("train", "dev"))
+
+        trans_func = partial(
+            seq_convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_len=args.max_seq_length
+        )
+
+        train_ds = train_ds.map(trans_func, lazy=False)
+        repeat_data = []
+        for i in range(10):
+            repeat_data.extend(train_ds.new_data)
+        train_ds.new_data = repeat_data
+        train_batch_sampler = DistributedBatchSampler(
+            train_ds, batch_size=args.batch_size, shuffle=False, drop_last=True
+        )
+        # fix develop bug, we donot use DataCollatorWithPadding.
+        # batchify_fn = DataCollatorWithPadding(tokenizer)
+
+        train_loader = DataLoader(
+            dataset=train_ds,
+            batch_sampler=train_batch_sampler,
+            num_workers=args.num_workers,  # when paddlepaddle<=2.4.1, if we use dynamicTostatic mode, we need set num_workeks > 0
+        )
+
+        self.num_batch = len(train_loader)
+
+        return train_loader, None
+
+    def build_model(self, args, **kwargs):
+        train_ds = load_dataset("clue", args.task_name, splits="train")
+        num_labels = 1 if train_ds.label_list is None else len(train_ds.label_list)
+        model = ErnieForSequenceClassification.from_pretrained(args.model_name_or_path, num_labels=num_labels)
+        self.pad_token_id = model.config.pad_token_id
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        loss = model(**input_data)[0]
+        return loss, args.batch_size * args.max_seq_length
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f words/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/ernie_tiny.py b/tests/test_tipc/benchmark/modules/ernie_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..af6d45f1cb719d8b91b01ca9f246c94eea5b61fd
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/ernie_tiny.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from functools import partial
+
+import paddle.nn as nn
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+
+from paddlenlp.data import DataCollatorWithPadding
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ErnieForSequenceClassification, ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+from .model_base import BenchmarkBase
+
+sys.path.insert(
+    0,
+    os.path.abspath(
+        os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, os.pardir, os.pardir, "model_zoo", "ernie-3.0")
+    ),
+)
+
+
+from utils import seq_convert_example  # noqa: E402
+
+
+class ErnieTinyBenchmark(BenchmarkBase):
+    def __init__(self):
+        self.label_list = None
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path", type=str, default="ernie-tiny", help="Model name. Defaults to ernie-tiny. "
+        )
+        parser.add_argument(
+            "--task_name",
+            default="tnews",
+            type=str,
+            help="The name of the task to train selected in the list: afqmc, tnews, iflytek, ocnli, cmnli, cluewsc2020, csl",
+        )
+        parser.add_argument("--max_seq_length", type=int, default=args.max_seq_len, help="Maximum sequence length. ")
+
+    def create_data_loader(self, args, **kwargs):
+        args.task_name = args.task_name.lower()
+
+        tokenizer = ErnieTokenizer.from_pretrained(args.model_name_or_path)
+        train_ds, dev_ds = load_dataset("clue", args.task_name, splits=("train", "dev"))
+        trans_func = partial(
+            seq_convert_example, label_list=train_ds.label_list, tokenizer=tokenizer, max_seq_len=args.max_seq_length
+        )
+
+        train_ds = train_ds.map(trans_func, lazy=True)
+        train_batch_sampler = DistributedBatchSampler(train_ds, batch_size=args.batch_size, shuffle=True)
+
+        dev_ds = dev_ds.map(trans_func, lazy=True)
+        dev_batch_sampler = BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False)
+
+        batchify_fn = DataCollatorWithPadding(tokenizer)
+
+        train_loader = DataLoader(
+            dataset=train_ds,
+            batch_sampler=train_batch_sampler,
+            collate_fn=batchify_fn,
+            num_workers=0,
+            return_list=True,
+        )
+        dev_loader = DataLoader(
+            dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True
+        )
+        self.num_batch = len(train_loader)
+        self.label_list = train_ds.label_list
+
+        return train_loader, dev_loader
+
+    def build_model(self, args, **kwargs):
+        num_classes = 1 if self.label_list is None else len(self.label_list)
+        model = ErnieForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+
+        self.loss_fct = nn.CrossEntropyLoss() if self.label_list else nn.MSELoss()
+
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        labels = input_data.pop("labels")
+        logits = model(**input_data)
+        loss = self.loss_fct(logits, labels)
+
+        return loss, args.batch_size
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/gpt_for_sequence_classification.py b/tests/test_tipc/benchmark/modules/gpt_for_sequence_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e136293676a5b8f5880107577bd4e710dfe9f73
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/gpt_for_sequence_classification.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.distributed as dist
+
+from paddlenlp.transformers import GPTForSequenceClassification
+from paddlenlp.utils.log import logger
+
+from .benchmark_utils import rand_int_tensor
+from .model_base import BenchmarkBase
+
+
+class GPTForSequenceClassificationBenchmark(BenchmarkBase):
+    def __init__(self):
+        self.label_list = None
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path", type=str, default="gpt2-en", help="Model name. Defaults to gpt2-en. "
+        )
+        # args.max_seq_len
+
+    def create_data_loader(self, args, **kwargs):
+        raise NotImplementedError(
+            "gpt_for_sequence_classification's DataLoader is not implemented. Please use --generated_inputs. "
+        )
+
+    def create_input_specs(self):
+        input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
+        labels = paddle.static.InputSpec(name="labels", shape=[-1], dtype="int64")
+        return [input_ids, None, None, None, labels]
+
+    def generate_inputs_for_model(self, args, model, **kwargs):
+        input_ids = rand_int_tensor(1, model.config.vocab_size, [args.batch_size, args.max_seq_len])
+        labels = rand_int_tensor(0, model.config.num_classes - 1, [args.batch_size])
+
+        return {"input_ids": input_ids, "labels": labels}
+
+    def build_model(self, args, **kwargs):
+        model = GPTForSequenceClassification.from_pretrained(args.model_name_or_path)
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        res = model(**input_data)
+        if dist.get_world_size() == 1:
+            pad_token_id = model.config.pad_token_id
+        else:
+            pad_token_id = model._layers.config.pad_token_id
+        return (
+            res[0],
+            paddle.sum((input_data["input_ids"] != pad_token_id)).numpy().astype("int64").item(),
+        )
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f words/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/lr_scheduler.py b/tests/test_tipc/benchmark/modules/lr_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..d738009c72b21cc7d0aaeccb76badbba52f0dfd6
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/lr_scheduler.py
@@ -0,0 +1,51 @@
+import paddle
+
+from paddlenlp.transformers import LinearDecayWithWarmup
+
+
+class SchedulerBase(object):
+    def __init__(self):
+        pass
+
+    @staticmethod
+    def add_args(args, parser):
+        raise NotImplementedError
+
+    def build_scheculer(self, args):
+        raise NotImplementedError
+
+
+class LambdaDecayBenchmark(SchedulerBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--epoch_start_decay", type=int, default=6, help="epoch_start_decay")
+        parser.add_argument("--lr_decay", type=float, default=0.8, help="lr_decay")
+
+    def build_scheculer(self, args):
+        lr_scheduler = paddle.optimizer.lr.LambdaDecay(
+            learning_rate=args.learning_rate,
+            lr_lambda=lambda x: args.lr_decay ** max(x + 1 - args.epoch_start_decay, 0.0),
+            verbose=True,
+        )
+
+        return lr_scheduler
+
+
+class LinearDecayWithWarmupBenchmark(SchedulerBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--warmup_steps", type=int, default=0, help="Warmup steps. ")
+        parser.add_argument("--warmup_proportion", type=float, default=0.1, help="Warmup proportion. ")
+
+    def build_scheculer(self, args):
+        warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion
+
+        lr_scheduler = LinearDecayWithWarmup(args.learning_rate, args.max_steps, warmup)
+
+        return lr_scheduler
diff --git a/tests/test_tipc/benchmark/modules/model_base.py b/tests/test_tipc/benchmark/modules/model_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..cca9b84c8f8a1a827ec08dcc721eca2c40d2be36
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/model_base.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.utils.log import logger
+
+
+class BenchmarkBase(object):
+    def __init__(self):
+        self.num_batch = 0
+
+    @staticmethod
+    def add_args(args, parser):
+        parser = parser.add_argument_group()
+
+    def create_data_loader(self, args, **kwargs):
+        raise NotImplementedError
+
+    def build_model(self, args, **kwargs):
+        raise NotImplementedError
+
+    def generate_inputs_for_model(self, args, **kwargs):
+        raise NotImplementedError
+
+    def create_input_specs(self):
+        return None
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        raise NotImplementedError
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/optimizer.py b/tests/test_tipc/benchmark/modules/optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..267aeb3ea0ac8b983c8fb566815d8130df34880e
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/optimizer.py
@@ -0,0 +1,92 @@
+import paddle
+import paddle.nn as nn
+
+
+class OptimizerBenchmarkBase(object):
+    def __init__(self):
+        pass
+
+    @staticmethod
+    def add_args(args, parser):
+        raise NotImplementedError
+
+    def build_optimizer(self, args, learning_rate, model, **kwargs):
+        raise NotImplementedError
+
+
+class SGDBenchmark(OptimizerBenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--max_grad_norm", type=float, default=None, help="Norm clip. ")
+
+    def build_optimizer(self, args, learning_rate, model, **kwargs):
+        if getattr(args, "max_grad_norm", None) is not None:
+            grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
+        else:
+            grad_clip = None
+
+        optimizer = paddle.optimizer.SGD(
+            learning_rate=learning_rate, parameters=model.parameters(), grad_clip=grad_clip
+        )
+
+        return optimizer
+
+
+class AdamBenchmark(OptimizerBenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--max_grad_norm", type=float, default=None, help="Norm clip. ")
+
+    def build_optimizer(self, args, learning_rate, model, **kwargs):
+        if getattr(args, "max_grad_norm", None) is not None:
+            grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
+        else:
+            grad_clip = None
+
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=learning_rate, parameters=model.parameters(), grad_clip=grad_clip
+        )
+
+        return optimizer
+
+
+class AdamWBenchmark(OptimizerBenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--beta1", type=float, default=0.9, help=". ")
+        parser.add_argument("--beta2", type=float, default=0.999, help=". ")
+        parser.add_argument("--epsilon", type=float, default=1e-8, help=". ")
+        parser.add_argument("--max_grad_norm", type=float, default=None, help=". ")
+        parser.add_argument("--weight_decay", type=float, default=0.0, help=". ")
+
+    def build_optimizer(self, args, learning_rate, model, **kwargs):
+        if getattr(args, "max_grad_norm", None) is not None:
+            grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm)
+        else:
+            grad_clip = None
+
+        decay_params = [
+            p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "layer_norm"])
+        ]
+
+        optimizer = paddle.optimizer.AdamW(
+            learning_rate=learning_rate,
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=args.epsilon,
+            parameters=model.parameters(),
+            grad_clip=grad_clip,
+            weight_decay=args.weight_decay,
+            apply_decay_param_fun=lambda x: x in decay_params,
+        )
+
+        return optimizer
diff --git a/tests/test_tipc/benchmark/modules/rnnlm.py b/tests/test_tipc/benchmark/modules/rnnlm.py
new file mode 100644
index 0000000000000000000000000000000000000000..389f05bc96c96bf11a39140c9f8219ec2a2797b6
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/rnnlm.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+
+from paddlenlp.metrics import Perplexity
+from paddlenlp.utils import profiler
+
+from .model_base import BenchmarkBase
+
+sys.path.append(
+    os.path.abspath(
+        os.path.join(
+            os.path.dirname(__file__), os.pardir, os.pardir, os.pardir, os.pardir, "examples", "language_model"
+        )
+    )
+)
+from rnnlm.model import CrossEntropyLossForLm, RnnLm, UpdateModel  # noqa: E402
+from rnnlm.reader import create_data_loader  # noqa: E402
+
+
+class AddProfiler(paddle.callbacks.Callback):
+    def on_batch_end(self, mode, step=None, logs=None):
+        if mode == "train":
+            profiler.add_profiler_step(self.profiler_options)
+
+
+class RNNLMBenchmark(BenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--hidden_size", type=int, default=650, help="hidden_size")
+        parser.add_argument("--num_steps", type=int, default=35, help="num steps")
+        parser.add_argument("--num_layers", type=int, default=2, help="num_layers")
+        parser.add_argument("--dropout", type=float, default=0.5, help="dropout")
+        parser.add_argument("--init_scale", type=float, default=0.05, help="init_scale")
+        parser.add_argument("--use_hapi", action="store_false", help="Whether to use hapi to run. ")
+
+    def create_data_loader(self, args, **kwargs):
+        train_loader, valid_loader, test_loader, self.vocab_size = create_data_loader(
+            batch_size=args.batch_size, num_steps=args.num_steps
+        )
+
+        self.num_batch = len(train_loader)
+
+        return train_loader, valid_loader
+
+    def build_model(self, args, **kwargs):
+        network = RnnLm(
+            vocab_size=self.vocab_size,
+            hidden_size=args.hidden_size,
+            batch_size=args.batch_size,
+            num_layers=args.num_layers,
+            init_scale=args.init_scale,
+            dropout=args.dropout,
+        )
+
+        self.cross_entropy = CrossEntropyLossForLm()
+
+        model = paddle.Model(network)
+
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        ppl_metric = Perplexity()
+        callback = UpdateModel()
+
+        scheduler = paddle.callbacks.LRScheduler(by_step=False, by_epoch=True)
+
+        model.prepare(optimizer=kwargs.get("optimizer"), loss=self.cross_entropy, metrics=ppl_metric)
+
+        benchmark_logger = self.logger(args)
+
+        if args.profiler_options is not None:
+            profiler_callback = AddProfiler()
+            profiler_callback.profiler_options = args.profiler_options
+            callbacks_lists = [callback, scheduler, benchmark_logger, profiler_callback]
+        else:
+            callbacks_lists = [callback, scheduler, benchmark_logger]
+
+        model.fit(
+            train_data=kwargs.get("train_loader"),
+            eval_data=kwargs.get("eval_loader"),
+            epochs=args.epoch,
+            shuffle=False,
+            callbacks=callbacks_lists,
+        )
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        return paddle.callbacks.ProgBarLogger(log_freq=(self.num_batch // 10), verbose=3)
diff --git a/tests/test_tipc/benchmark/modules/seq2seq.py b/tests/test_tipc/benchmark/modules/seq2seq.py
new file mode 100644
index 0000000000000000000000000000000000000000..90ca025fcb9662ec30d4f017db161e0c67887040
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/seq2seq.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+from .model_base import BenchmarkBase
+
+sys.path.append(
+    os.path.abspath(
+        os.path.join(
+            os.path.dirname(__file__), os.pardir, os.pardir, os.pardir, os.pardir, "examples", "machine_translation"
+        )
+    )
+)
+from seq2seq.data import create_train_loader  # noqa: E402
+from seq2seq.seq2seq_attn import CrossEntropyCriterion, Seq2SeqAttnModel  # noqa: E402
+
+
+class Seq2SeqBenchmark(BenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument("--num_layers", type=int, default=2, help="Number of layers. ")
+        parser.add_argument("--hidden_size", type=int, default=512, help="Hidden size. ")
+        parser.add_argument("--dropout", type=float, default=0.2, help="Dropout rate. ")
+        parser.add_argument("--init_scale", type=float, default=0.1, help="Initial scale. ")
+        parser.add_argument("--max_len", type=int, default=args.max_seq_len, help="Number of layers. ")
+
+    def create_data_loader(self, args, **kwargs):
+        (train_loader, eval_loader, self.src_vocab_size, self.tgt_vocab_size, self.eos_id) = create_train_loader(args)
+
+        self.num_batch = len(train_loader)
+
+        return train_loader, eval_loader
+
+    def build_model(self, args, **kwargs):
+        model = Seq2SeqAttnModel(
+            self.src_vocab_size,
+            self.tgt_vocab_size,
+            args.hidden_size,
+            args.hidden_size,
+            args.num_layers,
+            args.dropout,
+            self.eos_id,
+        )
+
+        self.criterion = CrossEntropyCriterion()
+
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        (src, src_length, trg, label, trg_mask) = input_data
+        predict = model(src, src_length, trg)
+        loss = self.criterion(predict, label, trg_mask)
+
+        return loss, paddle.sum(trg_mask).numpy()
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %.6f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f words/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/stablediffusion.py b/tests/test_tipc/benchmark/modules/stablediffusion.py
new file mode 100644
index 0000000000000000000000000000000000000000..969046ab41a00376deaddc49ea94b5536a9c0ba7
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/stablediffusion.py
@@ -0,0 +1,276 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import random
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import DatasetDict, concatenate_datasets
+from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler
+from paddle.vision import BaseTransform, transforms
+
+from paddlenlp.transformers import CLIPTextModel, CLIPTokenizer
+from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+from paddlenlp.utils.log import logger
+from ppdiffusers import AutoencoderKL, DDPMScheduler, UNet2DConditionModel
+from ppdiffusers.training_utils import main_process_first
+from ppdiffusers.utils import PPDIFFUSERS_CACHE
+
+from .model_base import BenchmarkBase
+
+
+def freeze_params(params):
+    for param in params:
+        param.stop_gradient = True
+
+
+def url_or_path_join(*path_list):
+    return os.path.join(*path_list) if os.path.isdir(os.path.join(*path_list)) else "/".join(path_list)
+
+
+class Lambda(BaseTransform):
+    def __init__(self, fn, keys=None):
+        super().__init__(keys)
+        self.fn = fn
+
+    def _apply_image(self, img):
+        return self.fn(img)
+
+
+class StableDiffusion(nn.Layer):
+    def __init__(self, args):
+        super().__init__()
+        self.args = args
+        self.unet = UNet2DConditionModel.from_pretrained(url_or_path_join(args.model_name_or_path, "unet"))
+        self.vae = AutoencoderKL.from_pretrained(url_or_path_join(args.model_name_or_path, "vae"))
+        self.text_encoder = CLIPTextModel.from_pretrained(url_or_path_join(args.model_name_or_path, "text_encoder"))
+        # we only use self.noise_scheduler.alphas_cumprod
+        self.noise_scheduler = DDPMScheduler.from_pretrained(url_or_path_join(args.model_name_or_path, "scheduler"))
+        self.register_buffer("alphas_cumprod", self.noise_scheduler.alphas_cumprod)
+        freeze_params(self.vae.parameters())
+        freeze_params(self.text_encoder.parameters())
+        self.unet.train()
+        self.vae.eval()
+        self.text_encoder.eval()
+        if args.use_amp and args.amp_level == "O2":
+            self.vae.to(dtype=paddle.float16)
+            self.text_encoder.to(dtype=paddle.float16)
+
+    def add_noise(
+        self,
+        original_samples: paddle.Tensor,
+        noise: paddle.Tensor,
+        timesteps: paddle.Tensor,
+    ) -> paddle.Tensor:
+        sqrt_alpha_prod = self.alphas_cumprod[timesteps] ** 0.5
+        sqrt_alpha_prod = sqrt_alpha_prod.flatten()
+        while len(sqrt_alpha_prod.shape) < len(original_samples.shape):
+            sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1)
+
+        sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timesteps]) ** 0.5
+        sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten()
+        while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape):
+            sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1)
+
+        noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise
+        return noisy_samples
+
+    def forward(self, input_ids=None, pixel_values=None):
+        with paddle.no_grad():
+            latents = self.vae.encode(pixel_values).latent_dist.sample()
+            latents = latents * 0.18215
+            noise = paddle.randn(latents.shape)
+            timesteps = paddle.randint(0, self.noise_scheduler.num_train_timesteps, (latents.shape[0],))
+            noisy_latents = self.add_noise(latents, noise, timesteps)
+
+        encoder_hidden_states = self.text_encoder(input_ids)[0]
+        model_pred = self.unet(noisy_latents, timesteps, encoder_hidden_states).sample
+
+        # Get the target for loss depending on the prediction type
+        if self.noise_scheduler.config.prediction_type == "epsilon":
+            target = noise
+        elif self.noise_scheduler.config.prediction_type == "v_prediction":
+            target = self.noise_scheduler.get_velocity(latents, noise, timesteps)
+        else:
+            raise ValueError(f"Unknown prediction type {self.noise_scheduler.config.prediction_type}")
+
+        loss = F.mse_loss(model_pred.cast(paddle.float32), target.cast(paddle.float32), reduction="mean")
+        return loss
+
+
+class StableDiffusionBenchmark(BenchmarkBase):
+    def __init__(self):
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path",
+            type=str,
+            default="CompVis/stable-diffusion-v1-4",
+            help="Model name. Defaults to CompVis/stable-diffusion-v1-4. ",
+        )
+        parser.add_argument(
+            "--resolution",
+            type=int,
+            default=512,
+            help=(
+                "The resolution for input images, all the images in the train/validation dataset will be resized to this"
+                " resolution"
+            ),
+        )
+        parser.add_argument(
+            "--center_crop",
+            action="store_true",
+            help="Whether to center crop images before resizing to resolution (if not set, random crop will be used)",
+        )
+        parser.add_argument(
+            "--random_flip",
+            action="store_true",
+            help="whether to randomly flip images horizontally",
+        )
+        parser.add_argument(
+            "--dataloader_num_workers",
+            type=int,
+            default=4,
+            help=(
+                "Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process."
+            ),
+        )
+
+    def create_input_specs(self):
+        input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, self.model_max_length], dtype="int64")
+        dtype = "float16" if self.args.use_amp and self.args.amp_level == "O2" else "float32"
+        pixel_values = paddle.static.InputSpec(
+            name="pixel_values", shape=[-1, 3, self.args.resolution, self.args.resolution], dtype=dtype
+        )
+        return [input_ids, pixel_values]
+
+    def create_data_loader(self, args, **kwargs):
+        caption_column = "text"
+        image_column = "image"
+        self.tokenizer = tokenizer = CLIPTokenizer.from_pretrained(
+            url_or_path_join(args.model_name_or_path, "tokenizer")
+        )
+
+        def tokenize_captions(examples, is_train=True):
+            captions = []
+            for caption in examples[caption_column]:
+                if isinstance(caption, str):
+                    captions.append(caption)
+                elif isinstance(caption, (list, np.ndarray)):
+                    # take a random caption if there are multiple
+                    captions.append(random.choice(caption) if is_train else caption[0])
+                else:
+                    raise ValueError(
+                        f"Caption column `{caption_column}` should contain either strings or lists of strings."
+                    )
+            inputs = tokenizer(
+                captions,
+                max_length=tokenizer.model_max_length,
+                padding="do_not_pad",
+                truncation=True,
+                return_attention_mask=False,
+            )
+            return inputs.input_ids
+
+        # Preprocessing the datasets.
+        train_transforms = transforms.Compose(
+            [
+                transforms.Resize((args.resolution, args.resolution), interpolation="bilinear"),
+                transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
+                transforms.RandomHorizontalFlip() if args.random_flip else Lambda(lambda x: x),
+                transforms.ToTensor(),
+                transforms.Normalize([0.5], [0.5]),
+            ]
+        )
+
+        def preprocess_train(examples):
+            images = [image.convert("RGB") for image in examples[image_column]]
+            examples["pixel_values"] = [train_transforms(image) for image in images]
+            examples["input_ids"] = tokenize_captions(examples)
+
+            return examples
+
+        file_path = get_path_from_url_with_filelock(
+            "https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/pokemon-blip-captions.tar.gz",
+            PPDIFFUSERS_CACHE,
+        )
+        dataset = DatasetDict.load_from_disk(file_path)
+
+        with main_process_first():
+            repeat_dataset = concatenate_datasets([dataset["train"]] * 250)
+            dataset["train"] = repeat_dataset
+            # Set the training transforms
+            train_dataset = dataset["train"].with_transform(preprocess_train)
+
+        train_sampler = (
+            DistributedBatchSampler(train_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True)
+            if paddle.distributed.get_world_size() > 1
+            else BatchSampler(train_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True)
+        )
+
+        def collate_fn(examples):
+            pixel_values = paddle.stack([example["pixel_values"] for example in examples])
+            if args.use_amp and args.amp_level == "O2":
+                pixel_values = pixel_values.cast(paddle.float16)
+            input_ids = [example["input_ids"] for example in examples]
+            input_ids = tokenizer.pad(
+                {"input_ids": input_ids},
+                padding="max_length",
+                max_length=tokenizer.model_max_length,
+                return_tensors="pd",
+                return_attention_mask=False,
+            ).input_ids
+            return {"pixel_values": pixel_values, "input_ids": input_ids}
+
+        train_dataloader = DataLoader(
+            train_dataset, batch_sampler=train_sampler, collate_fn=collate_fn, num_workers=args.dataloader_num_workers
+        )
+        self.num_batch = len(train_dataloader)
+        self.model_max_length = tokenizer.model_max_length
+        return train_dataloader, None
+
+    def build_model(self, args, **kwargs):
+        model = StableDiffusion(args)
+        self.args = args
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        loss = model(**input_data)
+        return (
+            loss,
+            input_data["input_ids"].shape[0],
+        )
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sample/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/t5_for_conditional_generation.py b/tests/test_tipc/benchmark/modules/t5_for_conditional_generation.py
new file mode 100644
index 0000000000000000000000000000000000000000..af288659a31b661adcb31e0f68f0248dda987b02
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/t5_for_conditional_generation.py
@@ -0,0 +1,86 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.distributed as dist
+
+from paddlenlp.transformers import T5ForConditionalGeneration
+from paddlenlp.utils.log import logger
+
+from .benchmark_utils import rand_int_tensor
+from .model_base import BenchmarkBase
+
+
+class T5ForConditionalGenerationBenchmark(BenchmarkBase):
+    def __init__(self):
+        self.label_list = None
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path", type=str, default="t5-small", help="Model name. Defaults to t5-small. "
+        )
+        # args.max_seq_len
+
+    def create_data_loader(self, args, **kwargs):
+        raise NotImplementedError(
+            "t5_for_conditional_genneration's DataLoader is not implemented. Please use --generated_inputs. "
+        )
+
+    def create_input_specs(self):
+        input_ids = paddle.static.InputSpec(name="input_ids", shape=[-1, -1], dtype="int64")
+        decoder_input_ids = paddle.static.InputSpec(name="decoder_input_ids", shape=[-1, -1], dtype="int64")
+        labels = paddle.static.InputSpec(name="labels", shape=[-1, -1], dtype="int64")
+        return [input_ids, None, decoder_input_ids, None, None, None, labels]
+
+    def generate_inputs_for_model(self, args, model, **kwargs):
+        input_ids = rand_int_tensor(1, model.config.vocab_size, [args.batch_size, args.max_seq_len])
+        decoder_input_ids = input_ids
+        labels = rand_int_tensor(0, model.config.vocab_size - 1, [args.batch_size, args.max_seq_len])
+
+        return {"input_ids": input_ids, "decoder_input_ids": decoder_input_ids, "labels": labels}
+
+    def build_model(self, args, **kwargs):
+        model = T5ForConditionalGeneration.from_pretrained(args.model_name_or_path)
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        res = model(**input_data)
+        if dist.get_world_size() == 1:
+            pad_token_id = model.config.pad_token_id
+        else:
+            pad_token_id = model._layers.config.pad_token_id
+        return (
+            res[0],
+            paddle.sum((input_data["input_ids"] != pad_token_id)).numpy().astype("int64").item(),
+        )
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f words/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/modules/xlnet.py b/tests/test_tipc/benchmark/modules/xlnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdcbee5d4d3232bd283c5c3abff94744873e5732
--- /dev/null
+++ b/tests/test_tipc/benchmark/modules/xlnet.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+
+import paddle
+
+from paddlenlp.transformers.xlnet.modeling import XLNetForSequenceClassification
+from paddlenlp.transformers.xlnet.tokenizer import XLNetTokenizer
+from paddlenlp.utils.log import logger
+
+from .model_base import BenchmarkBase
+
+sys.path.append(
+    os.path.abspath(
+        os.path.join(
+            os.path.dirname(__file__), os.pardir, os.pardir, os.pardir, os.pardir, "examples", "language_model"
+        )
+    )
+)
+from xlnet.run_glue import create_data_loader  # noqa: E402
+
+
+class XLNetBenchmark(BenchmarkBase):
+    def __init__(self):
+        self.label_list = None
+        super().__init__()
+
+    @staticmethod
+    def add_args(args, parser):
+        parser.add_argument(
+            "--model_name_or_path",
+            type=str,
+            default="xlnet-base-cased",
+            help="Model name. Defaults to xlnet-base-cased. ",
+        )
+        parser.add_argument("--task_name", type=str, default="SST-2", help="Task name. Defaults to sst-2. ")
+        parser.add_argument("--max_seq_length", type=int, default=args.max_seq_len, help="Maximum sequence length. ")
+
+    def create_data_loader(self, args, **kwargs):
+        args.task_name = args.task_name.lower()
+        tokenizer = XLNetTokenizer.from_pretrained(args.model_name_or_path)
+
+        if args.task_name == "mnli":
+            (
+                train_data_loader,
+                dev_data_loader_matched,
+                dev_data_loader_mismatched,
+                train_ds,
+                _,
+                _,
+            ) = create_data_loader(args, tokenizer)
+        else:
+            train_loader, dev_loader, train_ds, _ = create_data_loader(args, tokenizer)
+
+        self.num_batch = len(train_loader)
+        self.label_list = train_ds.label_list
+
+        if args.task_name == "mnli":
+            return train_data_loader, (dev_data_loader_matched, dev_data_loader_mismatched)
+        else:
+            return train_loader, dev_loader
+
+    def build_model(self, args, **kwargs):
+        num_classes = 1 if self.label_list is None else len(self.label_list)
+        model = XLNetForSequenceClassification.from_pretrained(args.model_name_or_path, num_classes=num_classes)
+
+        self.loss_fct = paddle.nn.loss.CrossEntropyLoss() if self.label_list else paddle.nn.loss.MSELoss()
+
+        return model
+
+    def forward(self, model, args, input_data=None, **kwargs):
+        input_ids, token_type_ids, attention_mask, labels = input_data
+        logits = model(input_ids, token_type_ids, attention_mask)
+        loss = self.loss_fct(logits, labels)
+
+        return loss, args.batch_size
+
+    def logger(
+        self,
+        args,
+        step_id=None,
+        pass_id=None,
+        batch_id=None,
+        loss=None,
+        batch_cost=None,
+        reader_cost=None,
+        num_samples=None,
+        ips=None,
+        **kwargs
+    ):
+        logger.info(
+            "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+            % (step_id, args.epoch * self.num_batch, loss, reader_cost, batch_cost, num_samples, ips)
+        )
diff --git a/tests/test_tipc/benchmark/options.py b/tests/test_tipc/benchmark/options.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c287833e9c6fc274928a11e9ced2bbb91d86813
--- /dev/null
+++ b/tests/test_tipc/benchmark/options.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from .modules.bert_for_question_answering import BertForQuestionAnsweringBenchmark
+from .modules.bigru_crf import BiGruCrfBenchmark
+from .modules.ernie3_for_sequence_classification import (
+    Ernie3ForSequenceClassificationBenchmark,
+)
+from .modules.ernie_tiny import ErnieTinyBenchmark
+from .modules.gpt_for_sequence_classification import (
+    GPTForSequenceClassificationBenchmark,
+)
+from .modules.lr_scheduler import *  # noqa: F403
+from .modules.optimizer import *  # noqa: F403
+from .modules.rnnlm import RNNLMBenchmark
+from .modules.seq2seq import Seq2SeqBenchmark
+
+try:
+    from .modules.stablediffusion import StableDiffusionBenchmark
+except Exception:
+    StableDiffusionBenchmark = None
+from .modules.t5_for_conditional_generation import T5ForConditionalGenerationBenchmark
+from .modules.xlnet import XLNetBenchmark
+
+__all__ = [
+    "MODEL_REGISTRY",
+    "OPTIMIZER_REGISTRY",
+    "LR_SCHEDULER_REGISTRY",
+    "get_training_parser",
+    "parse_args_and_model",
+]
+
+MODEL_REGISTRY = {
+    "seq2seq": Seq2SeqBenchmark,
+    "xlnet": XLNetBenchmark,
+    "lac": BiGruCrfBenchmark,
+    "ptb": RNNLMBenchmark,
+    "ernie_tiny": ErnieTinyBenchmark,
+    "ernie3_for_sequence_classification": Ernie3ForSequenceClassificationBenchmark,
+    "bert_for_question_answering": BertForQuestionAnsweringBenchmark,
+    "gpt_for_sequence_classification": GPTForSequenceClassificationBenchmark,
+    "t5_for_conditional_generation": T5ForConditionalGenerationBenchmark,
+    "stablediffusion": StableDiffusionBenchmark,
+}
+
+OPTIMIZER_REGISTRY = {
+    "adam": AdamBenchmark,  # noqa: F405
+    "adamw": AdamWBenchmark,  # noqa: F405
+    "sgd": SGDBenchmark,  # noqa: F405
+}
+
+LR_SCHEDULER_REGISTRY = {
+    "lambda_decay": LambdaDecayBenchmark,  # noqa: F405
+    "linear_decay_with_warmup": LinearDecayWithWarmupBenchmark,  # noqa: F405
+}
+
+
+def str2bool(v):
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Unsupported value encountered.")
+
+
+def get_training_parser():
+    parser = get_parser()
+    add_dataset_args(parser)
+    add_model_args(parser)
+    add_optimization_args(parser)
+    return parser
+
+
+def eval_str_list(x, type=float):
+    if x is None:
+        return None
+    if isinstance(x, str):
+        x = eval(x)
+    try:
+        return list(map(type, x))
+    except TypeError:
+        return [type(x)]
+
+
+def parse_args_and_model(parser):
+    args, _ = parser.parse_known_args()
+
+    if getattr(args, "optimizer", None) is not None:
+        args.optimizer = args.optimizer.lower()
+        OPTIMIZER_REGISTRY[args.optimizer].add_args(args, parser)
+    else:
+        raise ValueError("--optimizer must be specified. ")
+
+    if getattr(args, "model", None) is not None:
+        args.model = args.model.lower()
+        MODEL_REGISTRY[args.model].add_args(args, parser)
+    else:
+        raise ValueError("--model must be specified. ")
+
+    if getattr(args, "lr_scheduler", None) is not None:
+        args.lr_scheduler = args.lr_scheduler.lower()
+        LR_SCHEDULER_REGISTRY[args.lr_scheduler].add_args(args, parser)
+
+    args, _ = parser.parse_known_args()
+
+    return args
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--device", type=str, default="gpu", help="Device. ")
+
+    parser.add_argument("--model", type=str, default=None, help="Model. ")
+
+    parser.add_argument("--logging_steps", type=int, default=10, help="Print logs after N steps. ")
+    parser.add_argument("--seed", type=int, default=None, help="Random generator seed. ")
+
+    parser.add_argument("--use_amp", type=str2bool, nargs="?", const=False, help="Enable AMP. ")
+    parser.add_argument("--scale_loss", type=float, default=128, help="Loss scale. ")
+    parser.add_argument("--amp_level", type=str, default="O2", help="AMP LEVEL. O1 or O2. ")
+    parser.add_argument("--amp_use_promote", action="store_true", help="Enable kernel promotion for AMP training. ")
+    parser.add_argument("--custom_black_list", type=str, nargs="+", default=None, help="Custom black list for AMP. ")
+
+    parser.add_argument("--to_static", action="store_true", help="Enable to static. ")
+
+    parser.add_argument("--max_steps", type=int, default=None, help="Maximum steps. ")
+    parser.add_argument("--epoch", type=int, default=10, help="Number of epochs. ")
+
+    parser.add_argument("--generated_inputs", action="store_true", help="Use generated inputs. ")
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=4,
+        help="num_workers of dataloader. When paddlepaddle<=2.4.1, if we use dynamicTostatic mode, we need set num_workeks > 0 ",
+    )
+
+    # For benchmark.
+    parser.add_argument(
+        "--profiler_options",
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format "key1=value1;key2=value2;key3=value3".',
+    )
+    parser.add_argument("--save_model", type=str, default=None, help="Directory to save models. ")
+
+    return parser
+
+
+def add_dataset_args(parser):
+    parser.add_argument("--batch_size", type=int, default=1, help="Batch size. ")
+    parser.add_argument(
+        "--max_seq_len", type=int, default=64, help="Maximum number of tokens in the source sequence. "
+    )
+    parser.add_argument("--data_dir", type=str, default=None, help="Path to data. ")
+    parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Pad to max seq len. ")
+
+
+def add_optimization_args(parser):
+    parser.add_argument("--optimizer", type=str, default=None, help="Optimizer. ")
+    parser.add_argument("--learning_rate", type=float, default=0.25, help="Learning rate. ")
+    parser.add_argument("--lr_scheduler", type=str, default=None, help="Learning rate scheduler")
+
+    parser.add_argument("--scheduler_update_by_epoch", action="store_true", help="Scheduler update after each epoch. ")
+
+
+def add_model_args(parser):
+    parser = parser.add_argument_group()
diff --git a/tests/test_tipc/benchmark/utils/record.py b/tests/test_tipc/benchmark/utils/record.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1ddc738a5280978255d02eed4682adc59543794
--- /dev/null
+++ b/tests/test_tipc/benchmark/utils/record.py
@@ -0,0 +1,29 @@
+class AverageStatistical(object):
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.total_cnt = 0
+        self.time = 0
+
+    def record(self, val, cnt=1):
+        self.time += val
+        self.total_cnt += cnt
+
+    def get_average(self):
+        if self.total_cnt == 0:
+            return 0
+
+        return self.time / self.total_cnt
+
+    def get_average_per_sec(self):
+        if self.time == 0.0:
+            return 0.0
+
+        return float(self.total_cnt) / self.time
+
+    def get_total_cnt(self):
+        return self.total_cnt
+
+    def get_total_time(self):
+        return self.time
diff --git a/tests/test_tipc/benchmark_train.sh b/tests/test_tipc/benchmark_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..879a81a26a6b3b1df88085175346223e794cd67a
--- /dev/null
+++ b/tests/test_tipc/benchmark_train.sh
@@ -0,0 +1,340 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+source test_tipc/common_func.sh
+
+# run benchmark sh 
+# Usage:
+# bash run_benchmark_train.sh config.txt params
+# or 
+# bash run_benchmark_train.sh config.txt
+
+function func_parser_params(){
+    strs=$1
+    IFS="="
+    array=(${strs})
+    tmp=${array[1]}
+    echo ${tmp}
+}
+
+function func_sed_params(){
+    filename=$1
+    line=$2
+    param_value=$3
+    params=`sed -n "${line}p" $filename`
+    IFS=":"
+    array=(${params})
+    key=${array[0]}
+    value=${array[1]}
+
+    new_params="${key}:${param_value}"
+    IFS=";"
+    cmd="sed -i '${line}s/.*/${new_params}/' '${filename}'"
+    eval $cmd
+}
+
+function func_parse_amp(){
+    filename=$1
+    line=$2
+    param_value=$3
+
+    if [ ${param_value} = "fp16" ]; then
+        param_value="True"
+    else
+        param_value="False"
+    fi
+
+    params=`sed -n "${line}p" $filename`
+    IFS=":"
+    array=(${params})
+    key=${array[0]}
+    value=${array[1]}
+
+    new_params="${key}:${param_value}"
+    IFS=";"
+    cmd="sed -i '${line}s/.*/${new_params}/' '${filename}'"
+    eval $cmd
+}
+
+function set_gpu_id(){
+    string=$1
+    _str=${string:1:6}
+    IFS="C"
+    arr=(${_str})
+    M=${arr[0]}
+    P=${arr[1]}
+    gn=`expr $P - 1`
+    gpu_num=`expr $gn / $M`
+    seq=`seq -s "," 0 $gpu_num`
+    echo $seq
+}
+
+function get_world_size(){
+    IFS="C"
+    arr=($1) 
+    echo ${arr[1]}
+}
+
+function get_repo_name(){
+    IFS=";"
+    cur_dir=$(pwd)
+    IFS="/"
+    arr=(${cur_dir})
+    echo ${arr[-2]}
+}
+
+FILENAME=$1
+# copy FILENAME as new
+new_filename="./test_tipc/benchmark_train.txt"
+cmd=`yes|cp $FILENAME $new_filename`
+FILENAME=$new_filename
+# MODE must be one of ['benchmark_train']
+MODE=$2
+PARAMS=$3
+REST_ARGS=$4
+# bash test_tipc/benchmark_train.sh test_tipc/configs/transformer/base/train_infer_python.txt benchmark_train dynamicTostatic_bs64_fp32_DP_N1C1
+
+
+to_static=""
+# parse "to_static" options and modify trainer into "to_static_trainer"
+if [[ $PARAMS =~ "dynamicTostatic" ]] ;then
+   to_static="d2sT_"
+   sed -i 's/trainer:norm_train/trainer:to_static_train/g' $FILENAME
+fi
+
+
+IFS=$'\n'
+# parser params from train_benchmark.txt
+dataline=`cat $FILENAME`
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+model_name=$(func_parser_value "${lines[1]}")
+
+# set env
+python=python
+export str_tmp=$(echo `pip list|grep paddlepaddle-gpu|awk -F ' ' '{print $2}'`)
+export frame_version=${str_tmp%%.post*}
+export frame_commit=$(echo `${python} -c "import paddle;print(paddle.version.commit)"`)
+
+# 获取benchmark_params所在的行数
+line_num=`grep -n -w "train_benchmark_params" $FILENAME  | cut -d ":" -f 1`
+# for train log parser
+batch_size=$(func_parser_value "${lines[line_num]}")
+line_num=`expr $line_num + 1`
+fp_items=$(func_parser_value "${lines[line_num]}")
+line_num=`expr $line_num + 1`
+epoch=$(func_parser_value "${lines[line_num]}")
+
+line_num=`expr $line_num + 1`
+profile_option_key=$(func_parser_key "${lines[line_num]}")
+profile_option_params=$(func_parser_value "${lines[line_num]}")
+profile_option="${profile_option_key}:${profile_option_params}"
+
+line_num=`expr $line_num + 1`
+flags_value=$(func_parser_value "${lines[line_num]}")
+# set flags
+IFS=";"
+flags_list=(${flags_value})
+for _flag in ${flags_list[*]}; do
+    cmd="export ${_flag}"
+    eval $cmd
+done
+
+# set log_name
+repo_name=$(get_repo_name )
+SAVE_LOG=${BENCHMARK_LOG_DIR:-$(pwd)}   # */benchmark_log
+mkdir -p "${SAVE_LOG}/benchmark_log/"
+status_log="${SAVE_LOG}/benchmark_log/results.log"
+# get benchmark profiling params : PROFILING_TIMER_ONLY=no|True|False
+PROFILING_TIMER_ONLY=${PROFILING_TIMER_ONLY:-"True"}
+
+# The number of lines in which train params can be replaced.
+line_python=3
+line_gpuid=4
+line_precision=6
+line_epoch=7
+line_batchsize=9
+line_profile=13
+line_norm_train=16
+line_eval_py=24
+line_export_py=30
+
+func_sed_params "$FILENAME" "${line_eval_py}" "null"
+func_sed_params "$FILENAME" "${line_export_py}" "null"
+func_sed_params "$FILENAME" "${line_python}"  "$python"
+
+norm_train=`sed -n ${line_norm_train}p $FILENAME`
+
+# if params
+if  [ ! -n "$PARAMS" ] ;then
+    # PARAMS input is not a word.
+    IFS="|"
+    batch_size_list=(${batch_size})
+    fp_items_list=(${fp_items})
+    device_num_list=(N1C4)
+    run_mode="DP"
+elif [[ ${PARAMS} = "dynamicTostatic" ]] ;then
+    IFS="|"
+    model_type=$PARAMS
+    batch_size_list=(${batch_size})
+    fp_items_list=(${fp_items})
+    device_num_list=(N1C4)
+    run_mode="DP"
+else
+    # parser params from input: modeltype_bs${bs_item}_${fp_item}_${run_mode}_${device_num}
+    IFS="_"
+    params_list=(${PARAMS})
+    model_type=${params_list[0]}
+    batch_size=${params_list[1]}
+    batch_size=`echo  ${batch_size} | tr -cd "[0-9]" `
+    precision=${params_list[2]}
+    run_mode=${params_list[3]}
+    device_num=${params_list[4]}
+    IFS=";"
+
+    if [ ${precision} = "null" ];then
+        precision="fp32"
+    fi
+
+    fp_items_list=($precision)
+    batch_size_list=($batch_size)
+    device_num_list=($device_num)
+fi
+
+IFS="|"
+for batch_size in ${batch_size_list[*]}; do 
+    for precision in ${fp_items_list[*]}; do
+        for device_num in ${device_num_list[*]}; do
+            # sed batchsize and precision
+            # NOTE: Only For NLP. 
+            func_parse_amp "$FILENAME" "${line_precision}" "$precision"
+            #
+            func_sed_params "$FILENAME" "${line_batchsize}" "$MODE=$batch_size"
+            func_sed_params "$FILENAME" "${line_epoch}" "$MODE=$epoch"
+            gpu_id=$(set_gpu_id $device_num)
+
+            # NOTE: Only for GPT for now.
+            if [[ ${model_name} =~ gpt* ]]; then
+                num_gpu_devices=`get_world_size $device_num`
+                sed_norm_train=$norm_train
+
+                global_batch_size=$(($batch_size*$num_gpu_devices))
+                extra_params="--global_batch_size=$global_batch_size --dp_degree=$num_gpu_devices"
+                sed_norm_train="$sed_norm_train $extra_params"
+
+                file_1=`head -n $[${line_norm_train}-1] $FILENAME`
+                file_2=`tail -n +$[${line_norm_train}+1] $FILENAME`
+                
+                echo ${file_1} > $FILENAME
+                echo ${sed_norm_train} >> $FILENAME
+                echo ${file_2} >> $FILENAME
+            fi
+
+            if [ ${#gpu_id} -le 1 ];then
+                func_sed_params "$FILENAME" "${line_gpuid}" "0"  # sed used gpu_id
+                if [[ ${PROFILING_TIMER_ONLY} != "no" ]];then
+                    echo "run profile"
+                    # The default value of profile_option's timer_only parameter is True
+                    if [[ ${PROFILING_TIMER_ONLY} = "False" ]];then
+                        profile_option="${profile_option};timer_only=False"
+                    fi
+                    log_path="$SAVE_LOG/profiling_log"
+                    mkdir -p $log_path
+                    log_name="${repo_name}_${model_name}_bs${batch_size}_${precision}_${run_mode}_${device_num}_${to_static}profiling"
+                    # set profile_option params
+                    tmp=`sed -i "${line_profile}s/.*/\"${profile_option}\"/" "${FILENAME}"`
+
+                    # run test_train_inference_python.sh
+                    cmd="timeout 5m bash test_tipc/test_train_inference_python.sh ${FILENAME} benchmark_train > ${log_path}/${log_name} 2>&1 "
+                    echo $cmd
+                    eval ${cmd}
+                    eval "cat ${log_path}/${log_name}"
+                fi
+                echo "run without profile"
+                # without profile
+                log_path="$SAVE_LOG/train_log"
+                speed_log_path="$SAVE_LOG/index"
+                mkdir -p $log_path
+                mkdir -p $speed_log_path
+                log_name="${repo_name}_${model_name}_bs${batch_size}_${precision}_${run_mode}_${device_num}_${to_static}log"
+                speed_log_name="${repo_name}_${model_name}_bs${batch_size}_${precision}_${run_mode}_${device_num}_${to_static}speed"
+                func_sed_params "$FILENAME" "${line_profile}" "null"  # sed profile_id as null
+                cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} benchmark_train > ${log_path}/${log_name} 2>&1 "
+                echo $cmd
+                job_bt=`date '+%Y%m%d%H%M%S'`
+                eval $cmd
+                job_et=`date '+%Y%m%d%H%M%S'`
+                export model_run_time=$((${job_et}-${job_bt}))
+                eval "cat ${log_path}/${log_name}"
+
+                # parser log
+                _model_name="${model_name}_bs${batch_size}_${precision}_${run_mode}"
+                cmd="${python} ${BENCHMARK_ROOT}/scripts/analysis.py --filename ${log_path}/${log_name} \
+                        --speed_log_file '${speed_log_path}/${speed_log_name}' \
+                        --model_name ${_model_name} \
+                        --base_batch_size ${batch_size} \
+                        --run_mode ${run_mode} \
+                        --fp_item ${precision} \
+                        --keyword ips: \
+                        --skip_steps 2 \
+                        --device_num ${device_num} \
+                        --speed_unit samples/s \
+                        --convergence_key loss: "
+                echo $cmd
+                eval $cmd
+                last_status=${PIPESTATUS[0]}
+                status_check $last_status "${cmd}" "${status_log}"
+            else
+                IFS=";"
+                unset_env=`unset CUDA_VISIBLE_DEVICES`
+                log_path="$SAVE_LOG/train_log"
+                speed_log_path="$SAVE_LOG/index"
+                mkdir -p $log_path
+                mkdir -p $speed_log_path
+                log_name="${repo_name}_${model_name}_bs${batch_size}_${precision}_${run_mode}_${device_num}_${to_static}log"
+                speed_log_name="${repo_name}_${model_name}_bs${batch_size}_${precision}_${run_mode}_${device_num}_${to_static}speed"
+                func_sed_params "$FILENAME" "${line_gpuid}" "$gpu_id"  # sed used gpu_id 
+                func_sed_params "$FILENAME" "${line_profile}" "null"  # sed --profile_option as null
+                cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} benchmark_train > ${log_path}/${log_name} 2>&1 "
+                echo $cmd
+                job_bt=`date '+%Y%m%d%H%M%S'`
+                eval $cmd
+                job_et=`date '+%Y%m%d%H%M%S'`
+                export model_run_time=$((${job_et}-${job_bt}))
+                eval "cat ${log_path}/${log_name}"
+                # parser log
+                _model_name="${model_name}_bs${batch_size}_${precision}_${run_mode}"
+                
+                cmd="${python} ${BENCHMARK_ROOT}/scripts/analysis.py --filename ${log_path}/${log_name} \
+                        --speed_log_file '${speed_log_path}/${speed_log_name}' \
+                        --model_name ${_model_name} \
+                        --base_batch_size ${batch_size} \
+                        --run_mode ${run_mode} \
+                        --fp_item ${precision} \
+                        --keyword ips: \
+                        --skip_steps 2 \
+                        --device_num ${device_num} \
+                        --speed_unit images/s \
+                        --convergence_key loss: "
+                echo $cmd
+                eval $cmd
+                last_status=${PIPESTATUS[0]}
+                status_check $last_status "${cmd}" "${status_log}"
+            fi
+        done
+    done
+done
diff --git a/tests/test_tipc/bert_base_text_cls/export_model.py b/tests/test_tipc/bert_base_text_cls/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b7ae2134dbae6c632a24b99edda0916e08b722f
--- /dev/null
+++ b/tests/test_tipc/bert_base_text_cls/export_model.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+
+from paddlenlp.transformers import AutoModelForSequenceClassification
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--params_path",
+        type=str,
+        required=True,
+        default="./checkpoint/model_900",
+        help="The path to model parameters to be loaded.",
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved."
+    )
+    args = parser.parse_args()
+
+    # The number of labels should be in accordance with the training dataset.
+    label_map = {0: "negative", 1: "positive"}
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path, num_labels=len(label_map))
+
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/tests/test_tipc/bert_base_text_cls/predict.py b/tests/test_tipc/bert_base_text_cls/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..45111a5bc71605eebc94c9eebfc491dc30ea6434
--- /dev/null
+++ b/tests/test_tipc/bert_base_text_cls/predict.py
@@ -0,0 +1,277 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from scipy.special import softmax
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+    - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+    A BERT sequence pair mask has the following format:
+    ::
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+
+    If only one sequence, only returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        label_list(obj:`list[str]`): All the labels that the data has.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        segment_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    text = example
+    encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    segment_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        # create label maps
+        label_map = {}
+        for (i, l) in enumerate(label_list):
+            label_map[l] = i
+
+        label = label_map[example["labels"]]
+        label = np.array([label], dtype="int64")
+        return input_ids, segment_ids, label
+    else:
+        return input_ids, segment_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="./log_output/",
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+        self.benchmark = benchmark
+
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="bert-base",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer, label_map):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if self.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, label_list=label_map.values(), max_seq_length=self.max_seq_length, is_test=True
+            )
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        probs = softmax(logits, axis=1)
+        idx = np.argmax(probs, axis=1)
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+
+        if self.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return labels
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+    )
+    parser.add_argument(
+        "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+    )
+    parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+    parser.add_argument(
+        "--enable_mkldnn",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Enable to use mkldnn to speed up when using cpu.",
+    )
+    parser.add_argument(
+        "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+    )
+    parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of predict steps to perform."
+    )
+    args = parser.parse_args()
+
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+    test_ds = load_dataset("glue", "sst-2", splits=["test"])
+
+    data = [d["sentence"] for d in test_ds]
+    if args.max_steps > 0:
+        data = data[: args.max_steps]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+    label_map = {0: "negative", 1: "positive"}
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer, label_map))
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/tests/test_tipc/bert_base_text_cls/train.py b/tests/test_tipc/bert_base_text_cls/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..81e9ee1ae1602042fde1d43df5dc946d7a9b3b4b
--- /dev/null
+++ b/tests/test_tipc/bert_base_text_cls/train.py
@@ -0,0 +1,237 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    It returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    encoded_inputs = tokenizer(text=example["sentence"], max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["labels"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"])
+
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "bert-base-uncased", num_labels=len(train_ds.label_list)
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits = model(input_ids, token_type_ids)
+            loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % 100 == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+                save_dir = os.path.join(args.save_dir, "model")
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+            if global_step > args.max_steps:
+                return
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--save_dir",
+        default="./checkpoint",
+        type=str,
+        help="The output directory where the model checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. "
+        "Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process."
+    )
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform."
+    )
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    args = parser.parse_args()
+
+    do_train(args)
diff --git a/tests/test_tipc/bigru_crf/data.py b/tests/test_tipc/bigru_crf/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..08deda22f47dc5000d99bded6fb76416c1386c6d
--- /dev/null
+++ b/tests/test_tipc/bigru_crf/data.py
@@ -0,0 +1,191 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The file_reader converts raw corpus to input.
+"""
+
+import os
+from functools import partial
+
+import paddle
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import MapDataset
+
+# We use "\002" to separate sentence characters and sequence labels,
+# for example: 除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员
+#              p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002
+CHAR_DELIMITER = "\002"
+
+
+def load_dataset(datafiles):
+    def read(data_path):
+        with open(data_path, "r", encoding="utf-8") as fp:
+            if "infer" in data_path:
+                next(fp)
+            for line in fp:
+                line = line.strip()
+                if "infer" in data_path:
+                    words = list(line)
+                    yield [words]
+                else:
+                    words, labels = line.split("\t")
+                    words = words.split(CHAR_DELIMITER)
+                    labels = labels.split(CHAR_DELIMITER)
+                    assert len(words) == len(labels), "The word %s is not match with the label %s" % (words, labels)
+                    yield [words, labels]
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
+
+
+def load_vocab(dict_path):
+    """
+    Load vocab from file
+    """
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def normalize_token(token, normlize_vocab):
+    """Normalize text from DBC case to SBC case"""
+    if normlize_vocab:
+        token = normlize_vocab.get(token, token)
+    return token
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None):
+    """convert tokens to token indexs"""
+    token_ids = []
+    oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None
+    for token in tokens:
+        token = normalize_token(token, normlize_vocab)
+        token_id = vocab.get(token, oov_replace_token)
+        token_ids.append(token_id)
+
+    return token_ids
+
+
+def convert_example(example, max_seq_len, word_vocab, label_vocab=None, normlize_vocab=None):
+    if len(example) == 2:
+        tokens, labels = example
+    else:
+        tokens, labels = example[0], None
+    tokens = tokens[:max_seq_len]
+
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab)
+    length = len(token_ids)
+    if labels is not None:
+        labels = labels[:max_seq_len]
+        label_ids = convert_tokens_to_ids(labels, label_vocab, oov_replace_token="O")
+        return token_ids, length, label_ids
+    else:
+        return token_ids, length
+
+
+def parse_result(words, preds, lengths, word_vocab, label_vocab):
+    """parse padding result"""
+    batch_out = []
+    id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys()))
+    id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys()))
+    for sent_index in range(len(lengths)):
+        sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]]
+        tags = [id2label_dict.get(index, "O") for index in preds[sent_index][: lengths[sent_index]]]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split("-")[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split("-")[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
+
+
+def create_data_loader(args):
+    # Create dataset.
+    train_ds, test_ds = load_dataset(
+        datafiles=(os.path.join(args.data_dir, "train.tsv"), os.path.join(args.data_dir, "test.tsv"))
+    )
+
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    # q2b.dic is used to replace DBC case to SBC case
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+
+    trans_func = partial(
+        convert_example,
+        max_seq_len=args.max_seq_len,
+        word_vocab=word_vocab,
+        label_vocab=label_vocab,
+        normlize_vocab=normlize_vocab,
+    )
+    train_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=word_vocab.get("[PAD]", 0), dtype="int64"),  # word_ids
+        Stack(dtype="int64"),  # length
+        Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"),  # label_ids
+    ): fn(samples)
+
+    # Create sampler for dataloader
+    train_sampler = paddle.io.DistributedBatchSampler(
+        dataset=train_ds, batch_size=args.batch_size, shuffle=True, drop_last=True
+    )
+    train_loader = paddle.io.DataLoader(
+        dataset=train_ds, batch_sampler=train_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    test_sampler = paddle.io.BatchSampler(dataset=test_ds, batch_size=args.batch_size, shuffle=False, drop_last=False)
+    test_loader = paddle.io.DataLoader(
+        dataset=test_ds, batch_sampler=test_sampler, return_list=True, collate_fn=batchify_fn
+    )
+
+    return word_vocab, label_vocab, train_loader, test_loader
diff --git a/tests/test_tipc/bigru_crf/deploy/predict.py b/tests/test_tipc/bigru_crf/deploy/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..db47425d4ab02e6614f65f4d7ab34421fcdfd89c
--- /dev/null
+++ b/tests/test_tipc/bigru_crf/deploy/predict.py
@@ -0,0 +1,327 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.")
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--batch_size", type=int, default=2, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=128, help="Number of words of the longest seqence.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu", "npu", "xpu"],
+    help="The device to select to train the model, is must be cpu/gpu.",
+)
+parser.add_argument(
+    "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+)
+parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+parser.add_argument(
+    "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+)
+parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16"], help="The tensorrt precision.")
+parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+parser.add_argument(
+    "--enable_mkldnn",
+    default=False,
+    type=eval,
+    choices=[True, False],
+    help="Enable to use mkldnn to speed up when using cpu.",
+)
+args = parser.parse_args()
+
+
+def normalize_token(token, normlize_vocab):
+    """Normalize text from DBC case to SBC case"""
+    if normlize_vocab:
+        token = normlize_vocab.get(token, token)
+    return token
+
+
+def convert_tokens_to_ids(tokens, vocab, oov_replace_token=None, normlize_vocab=None):
+    """Convert tokens to token indexs"""
+    token_ids = []
+    oov_replace_token = vocab.get(oov_replace_token) if oov_replace_token else None
+    for token in tokens:
+        token = normalize_token(token, normlize_vocab)
+        token_id = vocab.get(token, oov_replace_token)
+        token_ids.append(token_id)
+
+    return token_ids
+
+
+def convert_example(tokens, max_seq_len, word_vocab, normlize_vocab=None):
+    """Convert tokens of sequences to token ids"""
+    tokens = tokens[:max_seq_len]
+
+    token_ids = convert_tokens_to_ids(tokens, word_vocab, oov_replace_token="OOV", normlize_vocab=normlize_vocab)
+    length = len(token_ids)
+    return token_ids, length
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_result(words, preds, lengths, word_vocab, label_vocab):
+    """Parse padding result"""
+    batch_out = []
+    id2word_dict = dict(zip(word_vocab.values(), word_vocab.keys()))
+    id2label_dict = dict(zip(label_vocab.values(), label_vocab.keys()))
+    for sent_index in range(len(lengths)):
+        sent = [id2word_dict[index] for index in words[sent_index][: lengths[sent_index]]]
+        tags = [id2label_dict.get(index, "O") for index in preds[sent_index][: lengths[sent_index]]]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split("-")[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split("-")[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=200,
+        use_tensorrt=False,
+        precision="fp32",
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="",
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": (inference.PrecisionType.Half, False),
+                "fp32": (inference.PrecisionType.Float32, False),
+                "int8": (inference.PrecisionType.Int8, True),
+            }
+            precision_mode, use_calib_mode = precision_map[precision]
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size,
+                    min_subgraph_size=1,
+                    precision_mode=precision_mode,
+                    use_calib_mode=use_calib_mode,
+                )
+                min_input_shape = {
+                    # shape: [B, T, H]
+                    "embedding_1.tmp_0": [batch_size, 1, 128],
+                    # shape: [T, B, H]
+                    "gru_0.tmp_0": [1, batch_size, 256],
+                }
+                max_input_shape = {
+                    "embedding_1.tmp_0": [batch_size, 256, 128],
+                    "gru_0.tmp_0": [256, batch_size, 256],
+                }
+                opt_input_shape = {
+                    "embedding_1.tmp_0": [batch_size, 128, 128],
+                    "gru_0.tmp_0": [128, batch_size, 256],
+                }
+                config.set_trt_dynamic_shape_info(min_input_shape, max_input_shape, opt_input_shape)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(args.cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu()
+        elif device == "npu":
+            # set NPU configs accordingly
+            config.enable_custom_device("npu")
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if args.benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            kwargs = {
+                "model_name": "bigru_crf",
+                "model_precision": precision,
+                "batch_size": self.batch_size,
+                "data_shape": "dynamic",
+                "save_path": save_log_path,
+                "inference_config": config,
+                "pids": pid,
+                "process_name": None,
+                "time_keys": ["preprocess_time", "inference_time", "postprocess_time"],
+                "warmup": 0,
+                "logger": logger,
+            }
+            if device == "gpu":
+                kwargs["gpu_ids"] = 0
+            self.autolog = auto_log.AutoLogger(**kwargs)
+
+    def predict(self, data, word_vocab, label_vocab, normlize_vocab):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
+                A Example object contains `text`(word_ids) and `seq_len`(sequence length).
+            word_vocab(obj:`dict`): The word id (key) to word str (value) map.
+            label_vocab(obj:`dict`): The label id (key) to label str (value) map.
+            normlize_vocab(obj:`dict`): The fullwidth char (key) to halfwidth char (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if args.benchmark:
+            self.autolog.times.start()
+        examples = []
+
+        for text in data:
+            tokens = list(text.strip())
+            token_ids, length = convert_example(
+                tokens, self.max_seq_length, word_vocab=word_vocab, normlize_vocab=normlize_vocab
+            )
+            examples.append((token_ids, length))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=0),  # input
+            Stack(axis=0),  # length
+        ): fn(samples)
+
+        batches = [examples[idx : idx + self.batch_size] for idx in range(0, len(examples), self.batch_size)]
+
+        results = []
+        preds_list = []
+        token_ids_list = []
+        length_list = []
+
+        for batch in batches:
+            token_ids, length = batchify_fn(batch)
+            token_ids_list.append(token_ids)
+            length_list.append(length)
+        # Preprocess time
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        for token_ids, length in zip(token_ids_list, length_list):
+            self.input_handles[0].copy_from_cpu(token_ids)
+            self.input_handles[1].copy_from_cpu(length)
+            self.predictor.run()
+            preds = self.output_handle.copy_to_cpu()
+            preds_list.append(preds)
+        # inference time
+        if args.benchmark:
+            self.autolog.times.stamp()
+
+        for token_ids, length, preds in zip(token_ids_list, length_list, preds_list):
+            result = parse_result(token_ids, preds, length, word_vocab, label_vocab)
+            results.extend(result)
+        # Postprocess time
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+        return results, preds_list, length_list
+
+
+if __name__ == "__main__":
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+    normlize_vocab = load_vocab(os.path.join(args.data_dir, "q2b.dic"))
+    infer_ds = []
+    with open(os.path.join(args.data_dir, "infer.tsv"), "r", encoding="utf-8") as fp:
+        for line in fp.readlines():
+            infer_ds += [line.strip()]
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_len,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.enable_mkldnn,
+    )
+    results, preds_list, length_list = predictor.predict(infer_ds, word_vocab, label_vocab, normlize_vocab)
+
+    idx = 0
+    for batch_preds, batch_length in zip(preds_list, length_list):
+        for preds, length in zip(batch_preds, batch_length):
+            print("{}\t{}".format(idx, preds[:length].tolist()), flush=True)
+            idx += 1
+
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/tests/test_tipc/bigru_crf/export_model.py b/tests/test_tipc/bigru_crf/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..90283ee4baff812c250e3cfa0b1c8ab90e854278
--- /dev/null
+++ b/tests/test_tipc/bigru_crf/export_model.py
@@ -0,0 +1,59 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_vocab
+from model import BiGruCrf
+from paddle.static import InputSpec
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument(
+    "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded."
+)
+parser.add_argument(
+    "--output_path", type=str, default="./infer_model", help="The path of model parameter in static graph to be saved."
+)
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+args = parser.parse_args()
+
+
+def main():
+    word_vocab = load_vocab(os.path.join(args.data_dir, "word.dic"))
+    label_vocab = load_vocab(os.path.join(args.data_dir, "tag.dic"))
+
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab))
+
+    state_dict = paddle.load(args.params_path)
+    model.set_dict(state_dict)
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            InputSpec(shape=[None, None], dtype="int64", name="token_ids"),
+            InputSpec(shape=[None], dtype="int64", name="length"),
+        ],
+    )
+    # Save in static graph model.
+    paddle.jit.save(model, os.path.join(args.output_path, "inference"))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/test_tipc/bigru_crf/model.py b/tests/test_tipc/bigru_crf/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e733498228fb550e86246839819d232db527ccb2
--- /dev/null
+++ b/tests/test_tipc/bigru_crf/model.py
@@ -0,0 +1,97 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss
+from paddlenlp.utils.tools import compare_version
+
+if compare_version(paddle.version.full_version, "2.2.0") >= 0:
+    # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0
+    from paddle.text import ViterbiDecoder
+else:
+    from paddlenlp.layers.crf import ViterbiDecoder
+
+
+class BiGruCrf(nn.Layer):
+    """The network for lexical analysis, based on two layers of BiGRU and one layer of CRF. More details see https://arxiv.org/abs/1807.01882
+
+    Args:
+        word_emb_dim (int): The dimension in which a word is embedded.
+        hidden_size (int): The number of hidden nodes in the GRU layer.
+        vocab_size (int): the word vocab size.
+        num_labels (int): the labels amount.
+        emb_lr (float, optional): The scaling of the learning rate of the embedding layer. Defaults to 2.0.
+        crf_lr (float, optional): The scaling of the learning rate of the crf layer. Defaults to 0.2.
+    """
+
+    def __init__(
+        self, word_emb_dim, hidden_size, vocab_size, num_labels, emb_lr=2.0, crf_lr=0.2, with_start_stop_tag=True
+    ):
+        super(BiGruCrf, self).__init__()
+        self.word_emb_dim = word_emb_dim
+        self.vocab_size = vocab_size
+        self.num_labels = num_labels
+        self.hidden_size = hidden_size
+        self.emb_lr = emb_lr
+        self.crf_lr = crf_lr
+        self.init_bound = 0.1
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=self.vocab_size,
+            embedding_dim=self.word_emb_dim,
+            weight_attr=paddle.ParamAttr(
+                learning_rate=self.emb_lr,
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+            ),
+        )
+
+        self.gru = nn.GRU(
+            input_size=self.word_emb_dim,
+            hidden_size=self.hidden_size,
+            num_layers=2,
+            direction="bidirectional",
+            weight_ih_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+            weight_hh_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.fc = nn.Linear(
+            in_features=self.hidden_size * 2,
+            out_features=self.num_labels + 2 if with_start_stop_tag else self.num_labels,
+            weight_attr=paddle.ParamAttr(
+                initializer=nn.initializer.Uniform(low=-self.init_bound, high=self.init_bound),
+                regularizer=paddle.regularizer.L2Decay(coeff=1e-4),
+            ),
+        )
+
+        self.crf = LinearChainCrf(self.num_labels, self.crf_lr, with_start_stop_tag)
+        self.crf_loss = LinearChainCrfLoss(self.crf)
+        self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, with_start_stop_tag)
+
+    def forward(self, inputs, lengths, labels=None):
+        word_embed = self.word_embedding(inputs)
+        bigru_output, _ = self.gru(word_embed, sequence_length=lengths)
+        emission = self.fc(bigru_output)
+        if labels is not None:
+            loss = self.crf_loss(emission, lengths, labels)
+            return loss
+        else:
+            _, prediction = self.viterbi_decoder(emission, lengths)
+            return prediction
diff --git a/tests/test_tipc/bigru_crf/train.py b/tests/test_tipc/bigru_crf/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eff03d3678a48b2a51d74255169389fa8fcf08b
--- /dev/null
+++ b/tests/test_tipc/bigru_crf/train.py
@@ -0,0 +1,155 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+
+import numpy as np
+import paddle
+from data import create_data_loader
+from model import BiGruCrf
+
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.log import logger
+
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--data_dir", type=str, default=None, help="The folder where the dataset is located.")
+parser.add_argument("--init_checkpoint", type=str, default=None, help="Path to init model.")
+parser.add_argument("--model_save_dir", type=str, default=None, help="The model will be saved in this path.")
+parser.add_argument("--epochs", type=int, default=10, help="Corpus iteration num.")
+parser.add_argument("--batch_size", type=int, default=300, help="The number of sequences contained in a mini-batch.")
+parser.add_argument("--max_seq_len", type=int, default=64, help="Number of words of the longest seqence.")
+parser.add_argument(
+    "--device",
+    default="gpu",
+    type=str,
+    choices=["cpu", "gpu"],
+    help="The device to select to train the model, is must be cpu/gpu.",
+)
+parser.add_argument(
+    "--base_lr", type=float, default=0.001, help="The basic learning rate that affects the entire network."
+)
+parser.add_argument("--crf_lr", type=float, default=0.2, help="The learning rate ratio that affects CRF layers.")
+parser.add_argument("--emb_dim", type=int, default=128, help="The dimension in which a word is embedded.")
+parser.add_argument("--hidden_size", type=int, default=128, help="The number of hidden nodes in the GRU layer.")
+parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.")
+parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.")
+parser.add_argument("--do_eval", type=strtobool, default=True, help="To evaluate the model if True.")
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for batch in data_loader:
+        token_ids, length, labels = batch
+        preds = model(token_ids, length)
+        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(length, preds, labels)
+        metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    logger.info("eval precision: %f, recall: %f, f1: %f" % (precision, recall, f1_score))
+    model.train()
+    return precision, recall, f1_score
+
+
+def train(args):
+    paddle.set_device(args.device)
+    set_seed(102)
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    rank = paddle.distributed.get_rank()
+
+    word_vocab, label_vocab, train_loader, test_loader = create_data_loader(args)
+
+    # Define the model netword and its loss
+    model = BiGruCrf(args.emb_dim, args.hidden_size, len(word_vocab), len(label_vocab), crf_lr=args.crf_lr)
+    # Prepare optimizer, loss and metric evaluator
+    optimizer = paddle.optimizer.Adam(learning_rate=args.base_lr, parameters=model.parameters())
+    chunk_evaluator = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+
+    if args.init_checkpoint:
+        if os.path.exists(args.init_checkpoint):
+            logger.info("Init checkpoint from %s" % args.init_checkpoint)
+            model_dict = paddle.load(args.init_checkpoint)
+            model.load_dict(model_dict)
+        else:
+            logger.info("Cannot init checkpoint from %s which doesn't exist" % args.init_checkpoint)
+    logger.info("Start training")
+    # Start training
+    global_step = 0
+    last_step = args.epochs * len(train_loader)
+    train_reader_cost = 0.0
+    train_run_cost = 0.0
+    total_samples = 0
+    reader_start = time.time()
+    max_f1_score = -1
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_loader):
+            train_reader_cost += time.time() - reader_start
+            global_step += 1
+            token_ids, length, label_ids = batch
+            train_start = time.time()
+            loss = model(token_ids, length, label_ids)
+            avg_loss = paddle.mean(loss)
+            train_run_cost += time.time() - train_start
+            total_samples += args.batch_size
+            if global_step % args.logging_steps == 0:
+                logger.info(
+                    "global step %d / %d, loss: %f, avg_reader_cost: %.5f sec, avg_batch_cost: %.5f sec, avg_samples: %.5f, ips: %.5f sequences/sec"
+                    % (
+                        global_step,
+                        last_step,
+                        avg_loss,
+                        train_reader_cost / args.logging_steps,
+                        (train_reader_cost + train_run_cost) / args.logging_steps,
+                        total_samples / args.logging_steps,
+                        total_samples / (train_reader_cost + train_run_cost),
+                    )
+                )
+                train_reader_cost = 0.0
+                train_run_cost = 0.0
+                total_samples = 0
+            avg_loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+            if global_step % args.save_steps == 0 or global_step == last_step:
+                if rank == 0:
+                    paddle.save(
+                        model.state_dict(), os.path.join(args.model_save_dir, "model_%d.pdparams" % global_step)
+                    )
+                    logger.info("Save %d steps model." % (global_step))
+                    if args.do_eval:
+                        precision, recall, f1_score = evaluate(model, chunk_evaluator, test_loader)
+                        if f1_score > max_f1_score:
+                            max_f1_score = f1_score
+                            paddle.save(model.state_dict(), os.path.join(args.model_save_dir, "best_model.pdparams"))
+                            logger.info("Save best model.")
+
+            reader_start = time.time()
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    train(args)
diff --git a/tests/test_tipc/common_func.sh b/tests/test_tipc/common_func.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7c65f275f6019381ed026d19987ec546a74c1703
--- /dev/null
+++ b/tests/test_tipc/common_func.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+function func_parser_key(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[0]}
+    echo ${tmp}
+}
+
+function func_parser_value(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[1]}
+    echo ${tmp}
+}
+
+function func_set_params(){
+    key=$1
+    value=$2
+    if [ ${key}x = "null"x ];then
+        echo " "
+    elif [[ ${value} = "null" ]] || [[ ${value} = " " ]] || [ ${#value} -le 0 ];then
+        echo " "
+    else 
+        echo "${key}=${value}"
+    fi
+}
+
+function func_parser_params(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    key=${array[0]}
+    tmp=${array[1]}
+    IFS="|"
+    res=""
+    for _params in ${tmp[*]}; do
+        IFS="="
+        array=(${_params})
+        mode=${array[0]}
+        value=${array[1]}
+        if [[ ${mode} = ${MODE} ]]; then
+            IFS="|"
+            #echo $(func_set_params "${mode}" "${value}")
+            echo $value
+            break
+        fi
+        IFS="|"
+    done
+    echo ${res}
+}
+
+function status_check(){
+    last_status=$1   # the exit code. 
+    run_command=$2
+    run_log=$3
+    model_name=$4
+    log_path=$5
+    if [ $last_status -eq 0 ]; then
+        echo -e "\033[33m Run successfully with command - ${model_name} - ${run_command} - ${log_path} \033[0m" | tee -a ${run_log}
+    else
+        echo -e "\033[33m Run failed with command - ${model_name} - ${run_command} - ${log_path} \033[0m" | tee -a ${run_log}
+    fi
+}
diff --git a/tests/test_tipc/compare_results.py b/tests/test_tipc/compare_results.py
new file mode 100644
index 0000000000000000000000000000000000000000..4577932f093c8e9621b63223c924cf3f181ecf54
--- /dev/null
+++ b/tests/test_tipc/compare_results.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import ast
+import glob
+import os
+import subprocess
+
+import numpy as np
+
+
+def init_args():
+    parser = argparse.ArgumentParser()
+    # params for testing assert allclose
+    parser.add_argument("--gt_file", type=str, default="")
+    parser.add_argument("--log_file", type=str, default="")
+    parser.add_argument("--precision", type=str, default="fp32")
+    return parser
+
+
+def parse_args():
+    parser = init_args()
+    return parser.parse_args()
+
+
+def run_shell_command(cmd):
+    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
+    out, err = p.communicate()
+
+    if p.returncode == 0:
+        return out.decode("utf-8")
+    else:
+        return None
+
+
+def parser_results_from_log_by_name(log_path, names_list):
+    if not os.path.exists(log_path):
+        raise ValueError("The log file {} does not exists!".format(log_path))
+
+    if names_list is None or len(names_list) < 1:
+        return []
+    parser_results = {}
+    cmd = "grep -P '\t' {}".format(log_path)
+    outs = run_shell_command(cmd).split("\n")
+    for item in outs:
+        item = item.split("\t")
+        if len(item) != 2:
+            continue
+        parser_results[item[0]] = ast.literal_eval(item[1])
+    return parser_results
+
+
+def load_gt_from_file(gt_file):
+    if not os.path.exists(gt_file):
+        raise ValueError("The log file {} does not exists!".format(gt_file))
+    with open(gt_file, "r") as f:
+        data = f.readlines()
+        f.close()
+    parser_gt = {}
+    for line in data:
+        image_name, result = line.strip("\n").split("\t")
+        parser_gt[image_name] = ast.literal_eval(result)
+    return parser_gt
+
+
+def load_gt_from_txts(gt_file):
+    gt_list = glob.glob(gt_file)
+    gt_collection = {}
+    for gt_f in gt_list:
+        gt_dict = load_gt_from_file(gt_f)
+        basename = os.path.basename(gt_f)
+        if "fp32" in basename:
+            gt_collection["fp32"] = [gt_dict, gt_f]
+        elif "fp16" in basename:
+            gt_collection["fp16"] = [gt_dict, gt_f]
+        elif "int8" in basename:
+            gt_collection["int8"] = [gt_dict, gt_f]
+        else:
+            continue
+    return gt_collection
+
+
+def collect_predict_from_logs(log_path, key_list):
+    log_list = glob.glob(log_path)
+    pred_collection = {}
+    for log_f in log_list:
+        pred_dict = parser_results_from_log_by_name(log_f, key_list)
+        key = os.path.basename(log_f)
+        pred_collection[key] = pred_dict
+
+    return pred_collection
+
+
+def testing_assert_allclose(dict_x, dict_y, atol=1e-7, rtol=1e-7):
+    for k in dict_x:
+        np.testing.assert_equal(np.array(dict_x[k]), np.array(dict_y[k]))
+
+
+if __name__ == "__main__":
+    # Usage:
+    # python3.7 tests/compare_results.py --gt_file=./tests/results/*.txt  --log_file=./tests/output/infer_*.log
+
+    args = parse_args()
+
+    gt_collection = load_gt_from_txts(args.gt_file)
+    key_list = gt_collection[args.precision][0].keys()
+
+    pred_collection = collect_predict_from_logs(args.log_file, key_list)
+    for filename in pred_collection.keys():
+        if "fp32" in filename:
+            gt_dict, gt_filename = gt_collection["fp32"]
+        elif "fp16" in filename:
+            gt_dict, gt_filename = gt_collection["fp16"]
+        elif "int8" in filename:
+            gt_dict, gt_filename = gt_collection["int8"]
+        else:
+            continue
+        pred_dict = pred_collection[filename]
+
+        try:
+            testing_assert_allclose(gt_dict, pred_dict)
+            print("Assert allclose passed! The results of {} and {} are consistent!".format(filename, gt_filename))
+        except Exception as E:
+            print(E)
+            raise ValueError("The results of {} and the results of {} are inconsistent!".format(filename, gt_filename))
diff --git a/tests/test_tipc/configs/bert/base/bert_base_seqlen128_cinn_train_infer_python.txt b/tests/test_tipc/configs/bert/base/bert_base_seqlen128_cinn_train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..643f606a101cf678895fa29d6658e4c588908806
--- /dev/null
+++ b/tests/test_tipc/configs/bert/base/bert_base_seqlen128_cinn_train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:bert_base_seqlen128_cinn
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/language_model/bert/run_pretrain.py --preprocessing_num_workers 8 --max_predictions_per_seq 20 --learning_rate 1e-4 --weight_decay 1e-2 --adam_epsilon 1e-6 --warmup_steps 10000 --output_dir ./tmp2/ --logging_steps 10 --save_steps 20000 --model_type bert --model_name_or_path bert-base-uncased --input_dir ./data/wikicorpus_en_seqlen128 --fuse_transformer false --amp_level O2 --cinn True
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static True
+===========================train_benchmark_params==========================
+batch_size:128
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/bert/base/train_infer_python.txt b/tests/test_tipc/configs/bert/base/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..49479b14e26b00f744ee7b8d77a2655a10653bc2
--- /dev/null
+++ b/tests/test_tipc/configs/bert/base/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:bert_base_seqlen128
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/language_model/bert/run_pretrain.py --max_predictions_per_seq 20 --learning_rate 1e-4 --weight_decay 1e-2 --adam_epsilon 1e-6 --warmup_steps 10000 --output_dir ./tmp2/ --logging_steps 10 --save_steps 20000 --model_type bert --model_name_or_path bert-base-uncased --input_dir ./data/wikicorpus_en_seqlen128 --fuse_transformer true --amp_level O2
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static True
+===========================train_benchmark_params==========================
+batch_size:32|64|96
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/bert/large/train_infer_python.txt b/tests/test_tipc/configs/bert/large/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..4b61334f5d42a8abbd26f56b105903a833e99a6e
--- /dev/null
+++ b/tests/test_tipc/configs/bert/large/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:bert_large_seqlen512
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/language_model/bert/run_pretrain.py --max_predictions_per_seq 20 --learning_rate 1e-4 --weight_decay 1e-2 --adam_epsilon 1e-6 --warmup_steps 10000 --output_dir ./tmp2/ --logging_steps 10 --save_steps 20000 --model_type bert --model_name_or_path bert-large-uncased --input_dir ./data/wikicorpus_en_seqlen512
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static True
+===========================train_benchmark_params==========================
+batch_size:4
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/bert_base_text_cls/train_infer_python.txt b/tests/test_tipc/configs/bert_base_text_cls/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5f4f97b045d60f3b3c56e39e82380c54dee75190
--- /dev/null
+++ b/tests/test_tipc/configs/bert_base_text_cls/train_infer_python.txt
@@ -0,0 +1,51 @@
+===========================train_params=========================== 
+model_name:bert_base_text_cls
+python:python
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=3
+--save_dir:null
+--batch_size:lite_train_lite_infer=32|lite_train_whole_infer=32|whole_train_whole_infer=32
+null:null
+null:model
+null:null
+null:null
+##
+trainer:norm
+norm_train:./test_tipc/bert_base_text_cls/train.py --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:null
+--params_path:null
+norm_export:./test_tipc/bert_base_text_cls/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+null:null
+null:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:./test_tipc/bert_base_text_cls/predict.py --max_steps 50
+--device:cpu|npu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:32
+--use_tensorrt:False
+--precision:fp32
+--model_dir:null
+null:null
+--save_log_path:null
+--benchmark:False
+null:null
\ No newline at end of file
diff --git a/tests/test_tipc/configs/bert_for_question_answering/train_infer_python.txt b/tests/test_tipc/configs/bert_for_question_answering/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2d3e0bd1d104e2d34ca06eb05e21b31d8cb46a45
--- /dev/null
+++ b/tests/test_tipc/configs/bert_for_question_answering/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:bert_for_question_answering
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model bert_for_question_answering --optimizer sgd --max_seq_len 512 --generated_inputs --learning_rate 0.01 --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:32
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/bigru_crf/train_infer_python.txt b/tests/test_tipc/configs/bigru_crf/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e6c90f54b5e842cbf2ef15c05578423b385ccdfa
--- /dev/null
+++ b/tests/test_tipc/configs/bigru_crf/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:bigru_crf
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=100
+--save_model:./test_tipc/output/
+--batch_size:lite_train_lite_infer=16|lite_train_whole_infer=16|whole_train_whole_infer=32
+null:null
+train_model_name:model.pdparams
+train_infer_img_dir:./data/lexical_analysis_dataset_tiny
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model lac --optimizer adam --max_seq_len 64 --data_dir ./data/lexical_analysis_dataset_tiny/ --seed 102
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:./test_tipc/bigru_crf/infer_model/
+--params_path:./test_tipc/output/bigru_crf/model.pdparams
+norm_export:test_tipc/bigru_crf/export_model.py --data_dir ./data/lexical_analysis_dataset_tiny/
+quant_export:null
+fpgm_export:null
+distill_export:null
+--data_dir:./data/lexical_analysis_dataset_tiny
+export2:null
+inference_dir:null
+infer_model:./test_tipc/bigru_crf/infer_model
+infer_export:null
+infer_quant:False
+inference:./test_tipc/bigru_crf/deploy/predict.py
+--device:cpu|gpu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:1|8
+--use_tensorrt:False
+--precision:fp32|fp16
+--model_dir:./test_tipc/bigru_crf/infer_model
+--data_dir:./data/lexical_analysis_dataset_tiny
+null:null
+--benchmark:True
+null:null
+===========================train_benchmark_params==========================
+batch_size:32
+fp_items:fp32
+epoch:7
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt b/tests/test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66b677983b34007251a4fe4cfeed59be79c72936
--- /dev/null
+++ b/tests/test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:bigru_crf
+python:python3.7
+gpu_list:-1
+--device:cpu
+Global.auto_cast:null
+--epochs:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=100
+--model_save_dir:./test_tipc/bigru_crf/output/
+--batch_size:lite_train_lite_infer=16|lite_train_whole_infer=16|whole_train_whole_infer=32
+Global.pretrained_model:null
+train_model_name:best_model.pdparams
+train_infer_img_dir:./data/lexical_analysis_dataset_tiny
+--data_dir:./data/lexical_analysis_dataset_tiny
+##
+trainer:norm_train
+norm_train:test_tipc/bigru_crf/train.py
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:./test_tipc/bigru_crf/infer_model/
+--params_path:./test_tipc/bigru_crf/output/best_model.pdparams
+norm_export:test_tipc/bigru_crf/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+--data_dir:./data/lexical_analysis_dataset_tiny
+export2:null
+inference_dir:null
+infer_model:./test_tipc/bigru_crf/infer_model
+infer_export:null
+infer_quant:False
+inference:./test_tipc/bigru_crf/deploy/predict.py
+--device:cpu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:1|8
+--use_tensorrt:False
+--precision:fp32
+--model_dir:./test_tipc/bigru_crf/infer_model
+--data_dir:./data/lexical_analysis_dataset_tiny
+null:null
+--benchmark:True
+null:null
diff --git a/tests/test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt b/tests/test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d6958563f7589c87243b826ecfa8e84df0f2dd06
--- /dev/null
+++ b/tests/test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:bigru_crf
+python:python3.7
+gpu_list:0
+--device:gpu
+Global.auto_cast:null
+--epochs:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=100
+--model_save_dir:./test_tipc/bigru_crf/output/
+--batch_size:lite_train_lite_infer=16|lite_train_whole_infer=16|whole_train_whole_infer=32
+Global.pretrained_model:null
+train_model_name:best_model.pdparams
+train_infer_img_dir:./data/lexical_analysis_dataset_tiny
+--data_dir:./data/lexical_analysis_dataset_tiny
+##
+trainer:norm_train
+norm_train:test_tipc/bigru_crf/train.py
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:./test_tipc/bigru_crf/infer_model/
+--params_path:./test_tipc/bigru_crf/output/best_model.pdparams
+norm_export:test_tipc/bigru_crf/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+--data_dir:./data/lexical_analysis_dataset_tiny
+export2:null
+inference_dir:null
+infer_model:./test_tipc/bigru_crf/infer_model
+infer_export:null
+infer_quant:False
+inference:./test_tipc/bigru_crf/deploy/predict.py
+--device:cpu|gpu
+--enable_mkldnn:True|False
+--cpu_threads:1|6
+--batch_size:1|8
+--use_tensorrt:False|True
+--precision:fp32|fp16
+--model_dir:./test_tipc/bigru_crf/infer_model
+--data_dir:./data/lexical_analysis_dataset_tiny
+null:null
+--benchmark:True
+null:null
diff --git a/tests/test_tipc/configs/ernie3_for_sequence_classification/train_infer_python.txt b/tests/test_tipc/configs/ernie3_for_sequence_classification/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9bc6847dcc844834e03ba06d376618c8cb62a54e
--- /dev/null
+++ b/tests/test_tipc/configs/ernie3_for_sequence_classification/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:ernie3_for_sequence_classification
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model ernie3_for_sequence_classification --optimizer adamw --lr_scheduler linear_decay_with_warmup --learning_rate 2e-5 --max_grad_norm 1.0 --model_name_or_path ernie-3.0-base-zh --pad_to_max_seq_len --max_seq_len 512 --logging_steps 10 --seed 42 --task_name tnews --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:32
+fp_items:fp32|fp16
+max_steps:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/ernie_information_extraction/train_infer_python.txt b/tests/test_tipc/configs/ernie_information_extraction/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..099c66fd8117bf2b4132f0861575e160c5b137cf
--- /dev/null
+++ b/tests/test_tipc/configs/ernie_information_extraction/train_infer_python.txt
@@ -0,0 +1,51 @@
+===========================train_params=========================== 
+model_name:ernie_information_extraction
+python:python
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:lite_train_lite_infer=10|lite_train_whole_infer=10|whole_train_whole_infer=10
+--save_dir:null
+--batch_size:lite_train_lite_infer=32|lite_train_whole_infer=32|whole_train_whole_infer=32
+null:null
+null:model
+null:null
+--data_dir:./waybill_ie/data
+##
+trainer:norm
+norm_train:./test_tipc/ernie_information_extraction/train.py --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:null
+--params_path:null
+norm_export:./test_tipc/ernie_information_extraction/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+null:null
+null:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:./test_tipc/ernie_information_extraction/predict.py --max_steps 50
+--device:cpu|gpu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:32
+--use_tensorrt:False
+--precision:fp32
+--model_dir:null
+null:null
+--save_log_path:null
+--benchmark:True
+--data_dir:./waybill_ie/data
\ No newline at end of file
diff --git a/tests/test_tipc/configs/ernie_text_cls/train_infer_python.txt b/tests/test_tipc/configs/ernie_text_cls/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..aebd93f83b75f7584f3f3c5c842158de83daa097
--- /dev/null
+++ b/tests/test_tipc/configs/ernie_text_cls/train_infer_python.txt
@@ -0,0 +1,51 @@
+===========================train_params=========================== 
+model_name:ernie_text_cls
+python:python
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=3
+--save_dir:null
+--batch_size:lite_train_lite_infer=32|lite_train_whole_infer=32|whole_train_whole_infer=32
+null:null
+null:model
+null:null
+null:null
+##
+trainer:norm
+norm_train:./test_tipc/ernie_text_cls/train.py --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:null
+--params_path:null
+norm_export:./test_tipc/ernie_text_cls/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+null:null
+null:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:./test_tipc/ernie_text_cls/predict.py --max_steps 50
+--device:cpu|gpu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:32
+--use_tensorrt:False
+--precision:fp32
+--model_dir:null
+null:null
+--save_log_path:null
+--benchmark:True
+null:null
\ No newline at end of file
diff --git a/tests/test_tipc/configs/ernie_text_matching/train_infer_python.txt b/tests/test_tipc/configs/ernie_text_matching/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2e0be37fa494148d27fd48b854b3d6790f6c80e4
--- /dev/null
+++ b/tests/test_tipc/configs/ernie_text_matching/train_infer_python.txt
@@ -0,0 +1,51 @@
+===========================train_params=========================== 
+model_name:ernie_text_matching
+python:python
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:lite_train_lite_infer=1|lite_train_whole_infer=1|whole_train_whole_infer=3
+--save_dir:null
+--batch_size:lite_train_lite_infer=32|lite_train_whole_infer=32|whole_train_whole_infer=32
+null:null
+null:model
+null:null
+null:null
+##
+trainer:norm
+norm_train:./test_tipc/ernie_text_matching/train.py --max_steps 150
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+--output_path:null
+--params_path:null
+norm_export:./test_tipc/ernie_text_matching/export_model.py
+quant_export:null
+fpgm_export:null
+distill_export:null
+null:null
+null:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:./test_tipc/ernie_text_matching/predict.py --max_steps 50
+--device:cpu|gpu
+--enable_mkldnn:False
+--cpu_threads:1|6
+--batch_size:32
+--use_tensorrt:False
+--precision:fp32
+--model_dir:null
+null:null
+--save_log_path:null
+--benchmark:True
+null:null
\ No newline at end of file
diff --git a/tests/test_tipc/configs/ernie_tiny/train_infer_python.txt b/tests/test_tipc/configs/ernie_tiny/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f98cb67b5eb10b2c89418738823f8668f576fc49
--- /dev/null
+++ b/tests/test_tipc/configs/ernie_tiny/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:ernie_tiny
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--max_steps:null
+--save_model:./test_tipc/output/
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --amp_level O2 --model ernie_tiny --optimizer adamw --lr_scheduler linear_decay_with_warmup --learning_rate 2e-5 --max_grad_norm 1.0 --model_name_or_path ernie-tiny --pad_to_max_seq_len --max_seq_len 128 --logging_steps 1 --task_name tnews --max_steps 150
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+null:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================train_benchmark_params==========================
+batch_size:32|64|96
+fp_items:fp32|fp16
+max_steps:1000
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/gpt2/train_infer_python.txt b/tests/test_tipc/configs/gpt2/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..cabeb3fcf3f9cf25e42fc5b7832e6ff070eedb3a
--- /dev/null
+++ b/tests/test_tipc/configs/gpt2/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:gpt2
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--micro_batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/language_model/gpt/run_pretrain.py --model_type="gpt" --model_name_or_path="gpt2-en" --save_steps 100000 --decay_steps 320000 --weight_decay 0.01 --warmup_rate 0.01 --grad_clip 1.0 --logging_freq 1 --eval_freq 1000 --device "gpu" --min_lr 0.00001 --max_lr 0.00015 --max_seq_len 1024 --output_dir=./output/ --input_dir=./data/
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:8
+fp_items:fp32|fp16
+epoch:200
+--profiler_options:batch_range=[100,110];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/gpt3/train_infer_python.txt b/tests/test_tipc/configs/gpt3/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..1fb0c7d048bc9bc900682d716067b79e72aed2a6
--- /dev/null
+++ b/tests/test_tipc/configs/gpt3/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:gpt3
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_pure_fp16:null
+--max_steps:null
+null:null
+--micro_batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/language_model/gpt-3/dygraph/run_pretrain.py --model_type="gpt" --model_name_or_path="gpt2-en" --save_steps 100000 --decay_steps 320000 --weight_decay 0.01 --warmup_rate 0.01 --grad_clip 1.0 --logging_freq 1 --eval_freq 1000 --device "gpu" --min_lr 0.00001 --max_lr 0.00015 --max_seq_len 1024 --output_dir=./output/ --input_dir=./data/
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:8
+fp_items:fp16|fp32
+epoch:200
+--profiler_options:batch_range=[100,110];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/gpt_for_sequence_classification/train_infer_python.txt b/tests/test_tipc/configs/gpt_for_sequence_classification/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2f9dec9fc6dc7e585e2cf81f11e7dfe2c435687b
--- /dev/null
+++ b/tests/test_tipc/configs/gpt_for_sequence_classification/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:gpt_for_sequence_classification
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model gpt_for_sequence_classification --optimizer sgd --max_seq_len 1024 --generated_inputs --learning_rate 0.0001
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:8
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/latent_diffusion_model/train_infer_python.txt b/tests/test_tipc/configs/latent_diffusion_model/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..178c3d687abd73337279730c919ab61b39c199e6
--- /dev/null
+++ b/tests/test_tipc/configs/latent_diffusion_model/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:latent_diffusion_model
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--fp16:null
+--max_steps:null
+null:null
+--per_device_train_batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./PaddleMIX/ppdiffusers/examples/text_to_image_laion400m/train_txt2img_laion400m_trainer.py --disable_tqdm True --benchmark True --overwrite_output_dir True --resolution 256 --do_train --output_dir ./outputs --learning_rate 5e-5 --weight_decay 0.02 --lr_scheduler_type "constant" --warmup_steps 0 --image_logging_steps -1 --logging_steps 10 --save_steps 999999 --seed 23 --dataloader_num_workers 8 --pretrained_model_name_or_path CompVis/ldm-text2im-large-256 --file_list ./data/filelist/train.filelist.list --model_max_length 77 --max_grad_norm -1 --fp16_opt_level O2
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:16
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/llama/train_infer_python.txt b/tests/test_tipc/configs/llama/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e7b4f549b38217bc48b33cfea7e28d8c799f59d4
--- /dev/null
+++ b/tests/test_tipc/configs/llama/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:llama
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--fp16:null
+--max_steps:null
+null:null
+--per_device_train_batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../llm/llama/benchmark.py --model_name_or_path facebook/llama-7b-2l --do_train --max_steps 500 --recompute False --overwrite_output_dir --output_dir ./checkpoints/ --fp16_opt_level O2 --learning_rate 3e-5 --warmup_steps 0 --seed 23 --logging_steps 1
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:8
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/rnnlm/train_infer_python.txt b/tests/test_tipc/configs/rnnlm/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..17a35443bf2fb9b6446213c1883944ae59b1f802
--- /dev/null
+++ b/tests/test_tipc/configs/rnnlm/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:ptb
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:null
+--save_model:./test_tipc/output/
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model ptb --optimizer sgd --lr_scheduler lambda_decay --learning_rate 1.0 --max_grad_norm 5.0 --lr_decay 0.8
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+null:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================train_benchmark_params==========================
+batch_size:20
+fp_items:fp32
+epoch:3
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/seq2seq/train_infer_python.txt b/tests/test_tipc/configs/seq2seq/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c527eef0b5f00220bbd6d760bb4b08c58b55e1f7
--- /dev/null
+++ b/tests/test_tipc/configs/seq2seq/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:seq2seq
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--epoch:null
+--save_model:./test_tipc/output/
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model seq2seq --optimizer adam --max_seq_len 50 --learning_rate 0.001 --max_grad_norm 5.0 --max_steps 150
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+null:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================train_benchmark_params==========================
+batch_size:128
+fp_items:fp32
+epoch:1
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/stable_diffusion_model/train_infer_python.txt b/tests/test_tipc/configs/stable_diffusion_model/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..72a69fa79a389f45d007797d7261f7c071f755d3
--- /dev/null
+++ b/tests/test_tipc/configs/stable_diffusion_model/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:stable_diffusion_model
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--bf16:null
+--max_steps:null
+null:null
+--per_device_train_batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./PaddleMIX/ppdiffusers/examples/stable_diffusion/train_txt2img_laion400m_trainer.py --benchmark True --overwrite_output_dir True --resolution 256 --do_train --output_dir ./outputs --learning_rate 5e-5 --weight_decay 0.02 --lr_scheduler_type "constant" --warmup_steps 0 --image_logging_steps -1 --logging_steps 10 --save_steps 999999 --seed 23 --dataloader_num_workers 8 --pretrained_model_name_or_path CompVis/stable-diffusion-v1-4 --file_list ./data/filelist/train.filelist.list --model_max_length 77 --max_grad_norm -1 --fp16_opt_level O1
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================disable_train_benchmark==========================
+batch_size:32
+fp_items:fp32|fp16
+epoch:200
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
\ No newline at end of file
diff --git a/tests/test_tipc/configs/stablediffusion/train_infer_python.txt b/tests/test_tipc/configs/stablediffusion/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..06c2742b5c7f038beed559420c41f8e9dd020964
--- /dev/null
+++ b/tests/test_tipc/configs/stablediffusion/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:stablediffusion
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model stablediffusion --optimizer adamw --center_crop --random_flip --learning_rate 1e-5
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:2
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/t5_for_conditional_generation/train_infer_python.txt b/tests/test_tipc/configs/t5_for_conditional_generation/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0f34930410e11bbf6aed314a4f8bb9bf3e7643ca
--- /dev/null
+++ b/tests/test_tipc/configs/t5_for_conditional_generation/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:t5_for_conditional_generation
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_steps:null
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model t5_for_conditional_generation --optimizer sgd --max_seq_len 1024 --generated_inputs --learning_rate 0.0001
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================disable_train_benchmark==========================
+batch_size:2
+fp_items:fp32|fp16
+epoch:500
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/transformer/base/train_dy2static_python.txt b/tests/test_tipc/configs/transformer/base/train_dy2static_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..0e12cdf8188eb8fddb069c448ae718e07fa6c565
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/base/train_dy2static_python.txt
@@ -0,0 +1,22 @@
+=========================== base_train ===========================
+model_name:transformer_base
+python:python3.7
+gpu_list:0|0,1
+null:null
+--use_amp:null
+--max_iter:lite_train_lite_infer=10|lite_train_whole_infer=100
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+pact_train:null
+fpgm_train:null
+distill_train:null
+to_static_train:--to_static
+null:null
+sed -i 's/^random_seed.*$/random_seed: 1234/g' ../examples/machine_translation/transformer/configs/transformer.base.yaml && sed -i 's/^print_step.*$/print_step: 1/g' ../examples/machine_translation/transformer/configs/transformer.base.yaml && sed -i 's/^weight_sharing.*$/weight_sharing: True/g' ../examples/machine_translation/transformer/configs/transformer.base.yaml
diff --git a/tests/test_tipc/configs/transformer/base/train_infer_python.txt b/tests/test_tipc/configs/transformer/base/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d51530da5b8c5c5176f89d4a1aafbef80205c7ff
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/base/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:transformer_base
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --bos_token "<s>" --eos_token "<e>" --benchmark
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:4096
+fp_items:fp32|fp16
+--max_iter:2000
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt b/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..076dfacbf9b8c12609f08caa4becdc3deedfa7da
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/base/transformer_base_dygraph_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_base
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/test_tipc/configs/transformer/base/transformer_base_static_params.txt b/tests/test_tipc/configs/transformer/base/transformer_base_static_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..869f5562b3a4d2caf71986b0422829ad94fc1203
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/base/transformer_base_static_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_base_static
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/static/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --distributed --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/static/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/test_tipc/configs/transformer/big/train_dy2static_python.txt b/tests/test_tipc/configs/transformer/big/train_dy2static_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..dd99389a92d4dd023beb448b13f10e4d941948a4
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/big/train_dy2static_python.txt
@@ -0,0 +1,22 @@
+=========================== base_train ===========================
+model_name:transformer_big
+python:python3.7
+gpu_list:0|0,1
+null:null
+--use_amp:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+pact_train:null
+fpgm_train:null
+distill_train:null
+to_static_train:--to_static
+null:null
+sed -i 's/^random_seed.*$/random_seed: 1234/g' ../examples/machine_translation/transformer/configs/transformer.big.yaml && sed -i 's/^print_step.*$/print_step: 10/g' ../examples/machine_translation/transformer/configs/transformer.big.yaml
diff --git a/tests/test_tipc/configs/transformer/big/train_infer_python.txt b/tests/test_tipc/configs/transformer/big/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..68ebd4a0f7d1740ee3840bbf5270117848ff6252
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/big/train_infer_python.txt
@@ -0,0 +1,59 @@
+===========================train_params===========================
+model_name:transformer_big
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+--use_amp:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --bos_token "<s>" --eos_token "<e>" --benchmark
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33712 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>" --benchmark
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================to_static_train_benchmark_params===========================
+to_static_train:--to_static
+===========================train_benchmark_params==========================
+batch_size:5120|4096
+fp_items:fp32|fp16
+--max_iter:2000
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt b/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6f6c804d22526467701a9ed0e40546e2e65bed91
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/big/transformer_big_dygraph_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_big
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/test_tipc/configs/transformer/big/transformer_big_static_params.txt b/tests/test_tipc/configs/transformer/big/transformer_big_static_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..66e5cea88f2634957bc4d50c5fa6d3cc590c5f08
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/big/transformer_big_static_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_big_static
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/static/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --distributed --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/static/predict.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/test_tipc/configs/transformer/train_infer_python.txt b/tests/test_tipc/configs/transformer/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..076dfacbf9b8c12609f08caa4becdc3deedfa7da
--- /dev/null
+++ b/tests/test_tipc/configs/transformer/train_infer_python.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_base
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--max_iter:lite_train_lite_infer=100|lite_train_whole_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --bos_token "<s>" --eos_token "<e>"
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/test_tipc/configs/xlnet/train_infer_python.txt b/tests/test_tipc/configs/xlnet/train_infer_python.txt
new file mode 100644
index 0000000000000000000000000000000000000000..75cbaa10eabb5553a2ffd3b7d5fbf28a826eb536
--- /dev/null
+++ b/tests/test_tipc/configs/xlnet/train_infer_python.txt
@@ -0,0 +1,57 @@
+===========================train_params===========================
+model_name:xlnet
+python:python3.7
+gpu_list:0|0,1
+--device:gpu|gpu
+null:null
+--max_steps:null
+--save_model:./test_tipc/output/
+--batch_size:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:test_tipc/train.py --model xlnet --optimizer adamw --lr_scheduler linear_decay_with_warmup --learning_rate 2e-5 --max_grad_norm 1.0 --model_name_or_path xlnet-base-cased --pad_to_max_seq_len --max_seq_len 128 --logging_steps 1 --task_name SST-2 --max_steps 150
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+null:null
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+===========================train_benchmark_params==========================
+batch_size:32|64|128
+fp_items:fp32
+max_steps:1000
+--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile  
+flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
diff --git a/tests/test_tipc/docs/install.md b/tests/test_tipc/docs/install.md
new file mode 100644
index 0000000000000000000000000000000000000000..beead05acd5d6cb05a661fce1b6d0e7fe5e2a4dd
--- /dev/null
+++ b/tests/test_tipc/docs/install.md
@@ -0,0 +1,121 @@
+## 1. 环境准备
+
+本教程适用于test_tipc目录下基础功能测试的运行环境搭建。
+
+推荐环境：
+- CUDA 10.1/10.2
+- CUDNN 7.6/cudnn8.1
+- TensorRT 6.1.0.5 / 7.1 / 7.2
+
+环境配置可以选择docker镜像安装，或者在本地环境Python搭建环境。推荐使用docker镜像安装，避免不必要的环境配置。
+
+## 2. Docker 镜像安装
+
+推荐docker镜像安装，按照如下命令创建镜像，当前目录映射到镜像中的`/paddle`目录下
+```
+nvidia-docker run --name paddle -it -v $PWD:/paddle paddlepaddle/paddle:latest-dev-cuda10.1-cudnn7-gcc82 /bin/bash
+cd /paddle
+
+# 安装带TRT的paddle
+pip3.7 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.3/linux-gpu-cuda10.1-cudnn7-mkl-gcc8.2-trt6-avx/paddlepaddle_gpu-2.1.3.post101-cp37-cp37m-linux_x86_64.whl
+```
+
+## 3 Python 环境构建
+
+非docker环境下，环境配置比较灵活，推荐环境组合配置：
+- CUDA10.1 + CUDNN7.6 + TensorRT 6
+- CUDA10.2 + CUDNN8.1 + TensorRT 7
+- CUDA11.1 + CUDNN8.1 + TensorRT 7
+
+下面以 CUDA10.2 + CUDNN8.1 + TensorRT 7 配置为例，介绍环境配置的流程。
+
+### 3.1 安装CUDNN
+
+如果当前环境满足CUDNN版本的要求，可以跳过此步骤。
+
+以CUDNN8.1 安装安装为例，安装步骤如下，首先下载CUDNN，从[Nvidia官网](https://developer.nvidia.com/rdp/cudnn-archive)下载CUDNN8.1版本，下载符合当前系统版本的三个deb文件，分别是：
+- cuDNN Runtime Library ，如：libcudnn8_8.1.0.77-1+cuda10.2_amd64.deb
+- cuDNN Developer Library ，如：libcudnn8-dev_8.1.0.77-1+cuda10.2_amd64.deb
+- cuDNN Code Samples，如：libcudnn8-samples_8.1.0.77-1+cuda10.2_amd64.deb
+
+deb安装可以参考[官方文档](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-deb)，安装方式如下
+```
+# x.x.x表示下载的版本号
+# $HOME为工作目录
+sudo dpkg -i libcudnn8_x.x.x-1+cudax.x_arm64.deb
+sudo dpkg -i libcudnn8-dev_8.x.x.x-1+cudax.x_arm64.deb
+sudo dpkg -i libcudnn8-samples_8.x.x.x-1+cudax.x_arm64.deb
+
+# 验证是否正确安装
+cp -r /usr/src/cudnn_samples_v8/ $HOME
+cd  $HOME/cudnn_samples_v8/mnistCUDNN
+
+# 编译
+make clean && make
+./mnistCUDNN
+```
+如果运行mnistCUDNN完后提示运行成功，则表示安装成功。如果运行后出现freeimage相关的报错，需要按照提示安装freeimage库:
+```
+sudo apt-get install libfreeimage-dev
+sudo apt-get install libfreeimage
+```
+
+### 3.2 安装TensorRT
+
+首先，从[Nvidia官网TensorRT板块](https://developer.nvidia.com/tensorrt-getting-started)下载TensorRT，这里选择7.1.3.4版本的TensorRT，注意选择适合自己系统版本和CUDA版本的TensorRT，另外建议下载TAR package的安装包。
+
+以Ubuntu16.04+CUDA10.2为例，下载并解压后可以参考[官方文档](https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-713/install-guide/index.html#installing-tar)的安装步骤，按照如下步骤安装:
+```
+# 以下安装命令中 '${version}' 为下载的TensorRT版本，如7.1.3.4
+# 设置环境变量，<TensorRT-${version}/lib> 为解压后的TensorRT的lib目录
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>
+
+# 安装TensorRT
+cd TensorRT-${version}/python
+pip3.7 install tensorrt-*-cp3x-none-linux_x86_64.whl
+
+# 安装graphsurgeon
+cd TensorRT-${version}/graphsurgeon
+```
+
+
+### 3.3 安装PaddlePaddle
+
+下载支持TensorRT版本的Paddle安装包，注意安装包的TensorRT版本需要与本地TensorRT一致，下载[链接](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html#python)
+选择下载 linux-cuda10.2-trt7-gcc8.2 Python3.7版本的Paddle：
+```
+# 从下载链接中可以看到是paddle2.1.1-cuda10.2-cudnn8.1版本
+wget  https://paddle-wheel.bj.bcebos.com/with-trt/2.1.1-gpu-cuda10.2-cudnn8.1-mkl-gcc8.2/paddlepaddle_gpu-2.1.1-cp37-cp37m-linux_x86_64.whl
+pip3.7 install -U paddlepaddle_gpu-2.1.1-cp37-cp37m-linux_x86_64.whl
+```
+
+## 4. 安装PaddleNLP依赖
+```
+# 安装AutoLog
+git clone https://github.com/LDOUBLEV/AutoLog
+cd AutoLog
+pip3.7 install -r requirements.txt
+python3.7 setup.py bdist_wheel
+pip3.7 install ./dist/auto_log-1.0.0-py3-none-any.whl
+
+# 下载PaddleNLP代码
+cd ../
+git clone https://github.com/PaddlePaddle/PaddleNLP
+
+```
+
+安装PaddleNLP依赖：
+```
+cd PaddleNLP
+pip3.7 install ./
+```
+
+## FAQ :
+Q. You are using Paddle compiled with TensorRT, but TensorRT dynamic library is not found. Ignore this if TensorRT is not needed.
+
+A. 问题一般是当前安装paddle版本带TRT，但是本地环境找不到TensorRT的预测库，需要下载TensorRT库，解压后设置环境变量LD_LIBRARY_PATH;
+如：
+```
+export LD_LIBRARY_PATH=/usr/local/python3.7.0/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/paddle/package/TensorRT-6.0.1.5/lib
+```
+或者问题是下载的TensorRT版本和当前paddle中编译的TRT版本不匹配，需要下载版本相符的TensorRT重新安装。
diff --git a/tests/test_tipc/docs/mac_test_train_inference_python.md b/tests/test_tipc/docs/mac_test_train_inference_python.md
new file mode 100644
index 0000000000000000000000000000000000000000..dbac9671def7de6df374c159dd274f5b68faaead
--- /dev/null
+++ b/tests/test_tipc/docs/mac_test_train_inference_python.md
@@ -0,0 +1,122 @@
+# Mac端基础训练预测功能测试
+
+Mac端基础训练预测功能测试的主程序为`test_train_inference_python.sh`，可以测试基于Python的模型训练、推理等基本功能。
+
+## 1. 测试结论汇总
+
+- 训练相关：
+
+| 模型名称 | 单机单卡 | 单机多卡 | 多机多卡 | 模型压缩（单机多卡） |
+| :----  |    :----  |  :----   |  :----   |  :----   |
+| bigru_crf | 正常训练  | - | - | - |
+
+
+- 预测相关：训练产出的模型为`正常模型`，模型对应的预测功能汇总如下，
+
+| 模型类型 |device | batchsize | tensorrt | mkldnn | cpu多线程 |
+|  ----   |  ---- |   ----   |  :----:  |   :----:   |  :----:  |
+| 正常模型 | CPU | 1/8 | - | fp32 | 支持 |
+
+
+## 2. 测试流程
+
+运行环境配置请参考[文档](./install.md)的内容配置TIPC的运行环境。
+
+### 2.1 安装依赖
+- 安装PaddlePaddle >= 2.0
+- 安装PaddleNLP
+    ```
+    pip3 install ../requirements.txt
+    pip3 install  ..
+    ```
+- 安装autolog（规范化日志输出工具）
+    ```
+    git clone https://github.com/LDOUBLEV/AutoLog
+    cd AutoLog
+    pip3 install -r requirements.txt
+    python3 setup.py bdist_wheel
+    pip3 install ./dist/auto_log-1.0.0-py3-none-any.whl
+    cd ../
+    ```
+
+### 2.2 功能测试
+先运行`prepare.sh`准备数据和模型，然后运行`test_train_inference_python.sh`进行测试，最终在```test_tipc/output```目录下生成`python_infer_*.log`格式的日志文件。
+
+
+`test_train_inference_python.sh`包含4种运行模式，每种模式的运行数据不同，分别用于测试速度和精度，分别是：
+
+- 模式1：lite_train_lite_infer，使用少量数据训练，用于快速验证训练到预测的走通流程，不验证精度和速度；
+```shell
+# 同linux端运行不同的是，Mac端测试使用新的配置文件train_mac_cpu_normal_normal_infer_python_mac_cpu.txt，
+# 配置文件中默认去掉了GPU和mkldnn相关的测试链条
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'lite_train_lite_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'lite_train_lite_infer'
+```
+
+- 模式2：lite_train_whole_infer，使用少量数据训练，一定量数据预测，用于验证训练后的模型执行预测，预测速度是否合理；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt  'lite_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'lite_train_whole_infer'
+```
+
+- 模式3：whole_infer，不训练，全量数据预测，走通开源模型评估、动转静，检查inference model预测时间和精度;
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'whole_infer'
+# 用法1:
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'whole_infer'
+# 用法2: 指定GPU卡预测，第三个传入参数为GPU卡号
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'whole_infer' '1'
+```
+
+- 模式4：whole_train_whole_infer，CE： 全量数据训练，全量数据预测，验证模型训练精度，预测精度，预测速度；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'whole_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_mac_cpu_normal_normal_infer_python_mac_cpu.txt 'whole_train_whole_infer'
+```
+
+运行相应指令后，在`test_tipc/output`文件夹下自动会保存运行日志。如'lite_train_lite_infer'模式下，会运行训练+inference的链条，因此，在`test_tipc/output`文件夹有以下文件：
+```
+test_tipc/bigru_crf/output/
+|- results_python.log    # 运行指令状态的日志
+|- norm_train_gpus_-1_autocast_null/  # GPU 0号卡上正常训练的训练日志和模型保存文件夹
+......
+|- python_infer_cpu_usemkldnn_False_threads_1_batchsize_1.log  # CPU上关闭Mkldnn线程数设置为1，测试batch_size=1条件下的预测运行日志
+......
+```
+
+其中`results_python.log`中包含了每条指令的运行状态，如果运行成功会输出：
+```
+Run successfully with command ......
+```
+如果运行失败，会输出：
+```
+Run failed with command ......
+```
+可以很方便的根据`results_python.log`中的内容判定哪一个指令运行错误。
+
+
+### 2.3 精度测试
+
+使用compare_results.py脚本比较模型预测的结果是否符合预期，主要步骤包括：
+- 提取日志中的预测结果；
+- 从本地文件中提取保存好的结果；
+- 比较上述两个结果是否符合精度预期，误差大于设置阈值时会报错。
+
+* 注意：仅验证whole_infer模式即可。
+
+#### 使用方式
+运行命令：
+```shell
+python3.7 test_tipc/compare_results.py --gt_file=./test_tipc/bigru_crf/results/python_*.txt  --log_file=./test_tipc/bigru_crf/output/python_bigru_crf_*.log
+```
+
+参数介绍：
+- gt_file： 指向事先保存好的预测结果路径，支持*.txt 结尾，会自动索引*.txt格式的文件，文件默认保存在test_tipc/result/ 文件夹下
+- log_file: 指向运行test_tipc/test_train_inference_python.sh 脚本的infer模式保存的预测日志，预测日志中打印的有预测结果，同样支持python_infer_*.log格式传入
+
+#### 运行结果
+正常运行效果如下图：
+![image](https://user-images.githubusercontent.com/10826371/143834455-d0bb1597-dbea-47ee-949d-29885cf49085.png)
+
+出现不一致结果时的运行输出：
+![image](https://user-images.githubusercontent.com/10826371/143834641-f3d55202-8205-465c-906c-818a80b976fd.png)
diff --git a/tests/test_tipc/docs/test_train_inference_python.md b/tests/test_tipc/docs/test_train_inference_python.md
new file mode 100644
index 0000000000000000000000000000000000000000..81f9e4a28a0d88c40ff5af9107b7f43a33629548
--- /dev/null
+++ b/tests/test_tipc/docs/test_train_inference_python.md
@@ -0,0 +1,126 @@
+# Linux端基础训练预测功能测试
+
+Linux端基础训练预测功能测试的主程序为`test_train_inference_python.sh`，可以测试基于Python的模型训练、推理等基本功能。
+
+- Mac端基础训练预测功能测试参考[链接](./mac_test_train_inference_python.md)
+- Windows端基础训练预测功能测试参考[链接](./win_test_train_inference_python.md)
+
+## 1. 测试结论汇总
+
+- 训练相关：
+
+| 模型名称 | 单机单卡 | 单机多卡 | 多机多卡 | 模型压缩（单机多卡） |
+| :----  |    :----  |  :----   |  :----   |  :----   |
+| bigru_crf | 正常训练  | 正常训练 | - | - |
+| Transformer | 正常训练 | 正常训练 | - | - |
+
+
+- 预测相关：训练产出的模型可以分为`正常模型`，模型对应的预测功能汇总如下，
+
+| 模型类型 |device | batchsize | tensorrt | mkldnn | cpu多线程 |
+|  ----   |  ---- |   ----   |  :----:  |   :----:   |  :----:  |
+| 正常模型 | GPU | 1/8 | fp32/fp16 | - | - |
+| 正常模型 | CPU | 1/8 | - | fp32/fp16 | 支持 |
+
+
+## 2. 测试流程
+
+运行环境配置请参考[文档](./install.md)的内容配置TIPC的运行环境。
+
+### 2.1 安装依赖
+- 安装PaddlePaddle >= 2.0
+- 安装PaddleNLP
+    ```
+    pip3 install ../requirements.txt
+    pip3 install  ..
+    ```
+- 安装autolog（规范化日志输出工具）
+    ```
+    git clone https://github.com/LDOUBLEV/AutoLog
+    cd AutoLog
+    pip3 install -r requirements.txt
+    python3 setup.py bdist_wheel
+    pip3 install ./dist/auto_log-1.0.0-py3-none-any.whl
+    cd ../
+    ```
+
+### 2.2 功能测试
+先运行`prepare.sh`准备数据和模型，然后运行`test_train_inference_python.sh`进行测试，最终在```test_tipc/output```目录下生成`python_infer_*.log`格式的日志文件。
+
+
+`test_train_inference_python.sh`包含4种运行模式，每种模式的运行数据不同，分别用于测试速度和精度，分别是：
+
+- 模式1：lite_train_lite_infer，使用少量数据训练，用于快速验证训练到预测的走通流程，不验证精度和速度；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'lite_train_lite_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'lite_train_lite_infer'
+```
+
+- 模式2：lite_train_whole_infer，使用少量数据训练，一定量数据预测，用于验证训练后的模型执行预测，预测速度是否合理；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt  'lite_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'lite_train_whole_infer'
+```
+
+- 模式3：whole_infer，不训练，全量数据预测，走通开源模型评估、动转静，检查inference model预测时间和精度;
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'whole_infer'
+# 用法1:
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'whole_infer'
+# 用法2: 指定GPU卡预测，第三个传入参数为GPU卡号
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'whole_infer' '1'
+```
+
+- 模式4：whole_train_whole_infer，CE： 全量数据训练，全量数据预测，验证模型训练精度，预测精度，预测速度；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'whole_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_infer_python.txt 'whole_train_whole_infer'
+```
+
+运行相应指令后，在`test_tipc/output`文件夹下自动会保存运行日志。如'lite_train_lite_infer'模式下，会运行训练+inference的链条，因此，在`test_tipc/output`文件夹有以下文件：
+```
+test_tipc/bigru_crf/output/
+|- results_python.log    # 运行指令状态的日志
+|- norm_train_gpus_0_autocast_null/  # GPU 0号卡上正常训练的训练日志和模型保存文件夹
+......
+|- python_infer_cpu_usemkldnn_True_threads_1_batchsize_1.log  # CPU上开启Mkldnn线程数设置为1，测试batch_size=1条件下的预测运行日志
+|- python_infer_gpu_usetrt_True_precision_fp16_batchsize_1.log # GPU上开启TensorRT，测试batch_size=1的半精度预测日志
+......
+```
+
+其中`results_python.log`中包含了每条指令的运行状态，如果运行成功会输出：
+```
+Run successfully with command ......
+```
+如果运行失败，会输出：
+```
+Run failed with command ......
+```
+可以很方便的根据`results_python.log`中的内容判定哪一个指令运行错误。
+
+
+### 2.3 精度测试
+
+使用compare_results.py脚本比较模型预测的结果是否符合预期，主要步骤包括：
+- 提取日志中的预测结果；
+- 从本地文件中提取保存好的结果；
+- 比较上述两个结果是否符合精度预期，误差大于设置阈值时会报错。
+
+* 注意：仅验证whole_infer模式即可。
+
+#### 使用方式
+运行命令：
+```shell
+python3.7 test_tipc/compare_results.py --gt_file=./test_tipc/bigru_crf/results/python_*.txt  --log_file=./test_tipc/bigru_crf/output/python_bigru_crf_*.log
+```
+
+参数介绍：
+- gt_file： 指向事先保存好的预测结果路径，支持*.txt 结尾，会自动索引*.txt格式的文件，文件默认保存在test_tipc/result/ 文件夹下
+- log_file: 指向运行test_tipc/test_train_inference_python.sh 脚本的infer模式保存的预测日志，预测日志中打印的有预测结果，同样支持python_infer_*.log格式传入
+
+#### 运行结果
+正常运行效果如下图：
+![image](https://user-images.githubusercontent.com/10826371/143834455-d0bb1597-dbea-47ee-949d-29885cf49085.png)
+
+出现不一致结果时的运行输出：
+![image](https://user-images.githubusercontent.com/10826371/143834641-f3d55202-8205-465c-906c-818a80b976fd.png)
diff --git a/tests/test_tipc/docs/win_test_train_inference_python.md b/tests/test_tipc/docs/win_test_train_inference_python.md
new file mode 100644
index 0000000000000000000000000000000000000000..80168a2af35020ebcb30a04d2703d9870f34f271
--- /dev/null
+++ b/tests/test_tipc/docs/win_test_train_inference_python.md
@@ -0,0 +1,124 @@
+# Windows端基础训练预测功能测试
+
+Windows端基础训练预测功能测试的主程序为`test_train_inference_python.sh`，可以测试基于Python的模型训练、推理等基本功能。
+
+
+## 1. 测试结论汇总
+
+- 训练相关：
+
+| 模型名称 | 单机单卡 | 单机多卡 | 多机多卡 | 模型压缩（单机多卡） |
+| :----  |    :----  |  :----   |  :----   |  :----   |
+| bigru_crf | 正常训练  | - | - | - |
+
+
+- 预测相关：训练产出的模型为`正常模型`，预测功能汇总如下，
+
+| 模型类型 |device | batchsize | tensorrt | mkldnn | cpu多线程 |
+|  ----   |  ---- |   ----   |  :----:  |   :----:   |  :----:  |
+| 正常模型 | GPU | 1/8 | fp32/fp16 | - | - |
+| 正常模型 | CPU | 1/8 | - | fp32/fp16 | 支持 |
+
+
+## 2. 测试流程
+
+运行环境配置请参考[文档](./install.md)的内容配置TIPC的运行环境。
+
+另外，由于Windows上和linux的路径管理方式不同，可以在win上安装gitbash终端，在gitbash中执行指令的方式和在linux端执行指令方式相同，更方便tipc测试。[gitbash下载连接](https://git-scm.com/download/win)
+
+### 2.1 安装依赖
+- 安装PaddlePaddle >= 2.0
+- 安装PaddleNLP
+    ```
+    pip3 install ../requirements.txt
+    pip3 install  ..
+    ```
+- 安装autolog（规范化日志输出工具）
+    ```
+    git clone https://github.com/LDOUBLEV/AutoLog
+    cd AutoLog
+    pip3 install -r requirements.txt
+    python3 setup.py bdist_wheel
+    pip3 install ./dist/auto_log-1.0.0-py3-none-any.whl
+    cd ../
+    ```
+
+### 2.2 功能测试
+先运行`prepare.sh`准备数据和模型，然后运行`test_train_inference_python.sh`进行测试，最终在```test_tipc/output```目录下生成`python_infer_*.log`格式的日志文件。
+
+
+`test_train_inference_python.sh`包含4种运行模式，每种模式的运行数据不同，分别用于测试速度和精度，分别是：
+
+- 模式1：lite_train_lite_infer，使用少量数据训练，用于快速验证训练到预测的走通流程，不验证精度和速度；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'lite_train_lite_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'lite_train_lite_infer'
+```
+
+- 模式2：lite_train_whole_infer，使用少量数据训练，一定量数据预测，用于验证训练后的模型执行预测，预测速度是否合理；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt  'lite_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'lite_train_whole_infer'
+```
+
+- 模式3：whole_infer，不训练，全量数据预测，走通开源模型评估、动转静，检查inference model预测时间和精度;
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'whole_infer'
+# 用法1:
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'whole_infer'
+# 用法2: 指定GPU卡预测，第三个传入参数为GPU卡号
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'whole_infer' '1'
+```
+
+- 模式4：whole_train_whole_infer，CE： 全量数据训练，全量数据预测，验证模型训练精度，预测精度，预测速度；
+```shell
+bash test_tipc/prepare.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'whole_train_whole_infer'
+bash test_tipc/test_train_inference_python.sh ./test_tipc/configs/bigru_crf/train_windows_gpu_normal_normal_infer_python_windows_cpu_gpu.txt 'whole_train_whole_infer'
+```
+
+运行相应指令后，在`test_tipc/output`文件夹下自动会保存运行日志。如'lite_train_lite_infer'模式下，会运行训练+inference的链条，因此，在`test_tipc/output`文件夹有以下文件：
+```
+test_tipc/bigru_crf/output/
+|- results_python.log    # 运行指令状态的日志
+|- norm_train_gpus_0_autocast_null/  # GPU 0号卡上正常训练的训练日志和模型保存文件夹
+......
+|- python_infer_cpu_usemkldnn_True_threads_1_batchsize_1.log  # CPU上开启Mkldnn线程数设置为1，测试batch_size=1条件下的预测运行日志
+|- python_infer_gpu_usetrt_True_precision_fp16_batchsize_1.log # GPU上开启TensorRT，测试batch_size=1的半精度预测日志
+......
+```
+
+其中`results_python.log`中包含了每条指令的运行状态，如果运行成功会输出：
+```
+Run successfully with command ......
+```
+如果运行失败，会输出：
+```
+Run failed with command ......
+```
+可以很方便的根据`results_python.log`中的内容判定哪一个指令运行错误。
+
+
+### 2.3 精度测试
+
+使用compare_results.py脚本比较模型预测的结果是否符合预期，主要步骤包括：
+- 提取日志中的预测结果；
+- 从本地文件中提取保存好的结果；
+- 比较上述两个结果是否符合精度预期，误差大于设置阈值时会报错。
+
+#### 使用方式
+运行命令：
+```shell
+python3.7 test_tipc/compare_results.py --gt_file=./test_tipc/bigru_crf/results/python_*.txt  --log_file=./test_tipc/bigru_crf/output/python_bigru_crf_*.log
+```
+
+参数介绍：
+- gt_file： 指向事先保存好的预测结果路径，支持*.txt 结尾，会自动索引*.txt格式的文件，文件默认保存在test_tipc/result/ 文件夹下
+- log_file: 指向运行test_tipc/test_train_inference_python.sh 脚本的infer模式保存的预测日志，预测日志中打印的有预测结果，同样支持python_infer_*.log格式传入
+
+#### 运行结果
+
+正常运行效果如下图：
+![image](https://user-images.githubusercontent.com/10826371/143834455-d0bb1597-dbea-47ee-949d-29885cf49085.png)
+
+出现不一致结果时的运行输出：
+![image](https://user-images.githubusercontent.com/10826371/143834641-f3d55202-8205-465c-906c-818a80b976fd.png)
diff --git a/tests/test_tipc/dygraph/ft/benchmark_common/prepare.sh b/tests/test_tipc/dygraph/ft/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c2efda96745b95e071c2598eb6385ac47731cbf9
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/benchmark_common/prepare.sh
@@ -0,0 +1,24 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+pip install regex -i https://pypi.tuna.tsinghua.edu.cn/simple 
+
+cd ../llm
+
+wget https://bj.bcebos.com/paddlenlp/datasets/examples/llm_benchmark_en.tar.gz
+tar -zxvf llm_benchmark_en.tar.gz
+
+wget https://bj.bcebos.com/paddlenlp/datasets/examples/llm_benchmark_zh.tar.gz
+tar -zxvf llm_benchmark_zh.tar.gz
diff --git a/tests/test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh b/tests/test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a8dd1d3b2b4fa40402a35a98351c63768f22a13f
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh
@@ -0,0 +1,150 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_name_or_path} ${per_device_train_batch_size} ${use_flash_attention} ${tensor_parallel_degree} ${pipeline_parallel_degree} ${virtual_pp_degree} ${sequence_parallel} ${sharding_degree} ${num_train_epochs} ${save_steps} ${sharding} ${recompute} ${run_mode} ${device_num}
+function _set_params(){
+    # 脚本所需参数
+    model_name_or_path=${1:-"facebook/llama-7b"}
+    dataset_name_or_path=${2:-"llm_benchmark_zh"}
+    base_batch_size=${3:-"1"}
+    learning_rate=${4:-"3e-05"}
+    recompute=${5:-"true"}
+    tensor_parallel_degree=${6:-"1"}
+    lora=${7:-"false"}
+    prefix_tuning=${8:-"false"}
+
+    # benchmark配置参数
+    model_item=${9:-"facebook/llama-7b"}   # (必选) 模型 item |fastscnn|segformer_b0| ocrnet_hrnetw48
+    fp_item="fp16"            # (必选) fp32|fp16
+    run_mode=${10:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
+    device_num=${11:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling="false"      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    num_train_epochs=${12:-"1"}
+
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="Effective_Tokens_per_second:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="train_loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    is_large_model=True           # (可选)普通模型默认为False，如果添加大模型且只取一条ips设置为True
+
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${base_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    mkdir -p $(dirname ${train_log_file})
+
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    mkdir -p $(dirname ${profiling_log_file})
+
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+    mkdir -p $(dirname ${speed_log_file})
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=1 # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    use_pure_fp16=False
+
+    train_cmd="    --model_name_or_path ${model_name_or_path} \
+            --dataset_name_or_path ${dataset_name_or_path} \
+            --output_dir output \
+            --per_device_train_batch_size ${base_batch_size} \
+            --gradient_accumulation_steps 1 \
+            --num_train_epochs ${num_train_epochs} \
+            --learning_rate ${learning_rate} \
+            --warmup_steps 30 \
+            --evaluation_strategy no \
+            --save_strategy no \
+            --logging_steps 1 \
+            --src_length 1024 \
+            --max_length 1024 \
+            --fp16 1 \
+            --fp16_opt_level O2 \
+            --do_train 1 \
+            --do_eval 0 \
+            --disable_tqdm 1 \
+            --eval_with_do_generation 0 \
+            --recompute ${recompute} \
+            --tensor_parallel_degree ${tensor_parallel_degree} \
+            --lora ${lora} \
+            --prefix_tuning ${prefix_tuning} \
+            --benchmark 1 \
+            --intokens 1 \
+            --device gpu"
+
+    # 以下为通用执行命令，无特殊可不用修改
+    cd ../llm/
+    echo "run run_mode: ${run_mode} device_num: ${device_num}"
+    if [ "N1C1" = ${device_num} ]; then
+        train_cmd="python -u finetune_generation.py ${train_cmd}" 
+    else
+        rm -rf ./mylog   # 注意执行前删掉log目录
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES \
+            finetune_generation.py ${train_cmd}" 
+    fi
+
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+
+    python -c "import paddlenlp"
+    ${train_cmd} > ${log_file} 2>&1
+
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+# _train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_lora_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_lora_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..24065af921fe2ac0971ef836a2dd6b096732ec25
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_lora_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="bigscience/bloomz-7b1-mt"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=1
+learning_rate="3e-04"
+recompute="0"
+tensor_parallel_degree="1"
+lora="1"
+prefix_tuning="0"
+model_item="bigscience-bloomz-7b1-mt_lora"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_pt_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_pt_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a6812ceb1c52bbbcd30dec6cfad97b0f6c25cab2
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/bloom/N1C1/bigscience-bloomz-7b1-mt_pt_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="bigscience/bloomz-7b1-mt"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=1
+learning_rate="3e-02"
+recompute="0"
+tensor_parallel_degree="1"
+lora="0"
+prefix_tuning="1"
+model_item="bigscience-bloomz-7b1-mt_pt"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/bloom/N1C8/bigscience-bloomz-7b1-mt_sft_bs2_fp16_MP8.sh b/tests/test_tipc/dygraph/ft/bloom/N1C8/bigscience-bloomz-7b1-mt_sft_bs2_fp16_MP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c888625d8f0915f44eb0a55df489cd3f7e63d02
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/bloom/N1C8/bigscience-bloomz-7b1-mt_sft_bs2_fp16_MP8.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="bigscience/bloomz-7b1-mt"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=2
+learning_rate="3e-05"
+recompute="0"
+tensor_parallel_degree="8"
+lora="0"
+prefix_tuning="0"
+model_item="bigscience-bloomz-7b1-mt_sft"
+run_mode="MP8"
+device_num="N1C8"
+num_train_epochs=5
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_lora_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_lora_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9fe791e687dfa38f9862679dd49ab9cc9d07ae26
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_lora_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="THUDM/chatglm-6b"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=1
+learning_rate="3e-04"
+recompute="0"
+tensor_parallel_degree="1"
+lora="1"
+prefix_tuning="0"
+model_item="THUDM-chatglm-6b_lora"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_pt_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_pt_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cce5ab8eeb86f5a93eb65b8eedc397e194984dae
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/chatglm/N1C1/THUDM-chatglm-6b_pt_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="THUDM/chatglm-6b"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=1
+learning_rate="3e-02"
+recompute="0"
+tensor_parallel_degree="1"
+lora="0"
+prefix_tuning="1"
+model_item="THUDM-chatglm-6b_pt"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/chatglm/N1C8/THUDM-chatglm-6b_sft_bs2_fp16_MP8.sh b/tests/test_tipc/dygraph/ft/chatglm/N1C8/THUDM-chatglm-6b_sft_bs2_fp16_MP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5e49c3db7b1475de2dddad60e00f08d6799d0ee9
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/chatglm/N1C8/THUDM-chatglm-6b_sft_bs2_fp16_MP8.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="THUDM/chatglm-6b"
+dataset_name_or_path="llm_benchmark_zh"
+base_batch_size=2
+learning_rate="3e-05"
+recompute="0"
+tensor_parallel_degree="8"
+lora="0"
+prefix_tuning="0"
+model_item="THUDM-chatglm-6b_sft"
+run_mode="MP8"
+device_num="N1C8"
+num_train_epochs=5
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_lora_bs1_fp16_DP1-recompute.sh b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_lora_bs1_fp16_DP1-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..56f3dd17830ed212092379bdb1bf91ea9fab0604
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_lora_bs1_fp16_DP1-recompute.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-13b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=1
+learning_rate="3e-04"
+recompute="1"
+tensor_parallel_degree="1"
+lora="1"
+prefix_tuning="0"
+model_item="facebook-llama-13b_lora"
+run_mode="DP1-recompute"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_pt_bs1_fp16_DP1-recompute.sh b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_pt_bs1_fp16_DP1-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0cd145b3612bf579d50d3eaf67ca3822874029e0
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-13b_pt_bs1_fp16_DP1-recompute.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-13b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=1
+learning_rate="3e-02"
+recompute="1"
+tensor_parallel_degree="1"
+lora="0"
+prefix_tuning="1"
+model_item="facebook-llama-13b_pt"
+run_mode="DP1-recompute"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size}  ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_lora_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_lora_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..baa8ba4edbe1f0e9b3b485b784270d1350d8341e
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_lora_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-7b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=1
+learning_rate="3e-04"
+recompute="0"
+tensor_parallel_degree="1"
+lora="1"
+prefix_tuning="0"
+model_item="facebook-llama-7b_lora"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_pt_bs1_fp16_DP1.sh b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_pt_bs1_fp16_DP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5ee6e6f5905c3e2cedc9081cf7b889839d51eb20
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C1/facebook-llama-7b_pt_bs1_fp16_DP1.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-7b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=1
+learning_rate="3e-02"
+recompute="0"
+tensor_parallel_degree="1"
+lora="0"
+prefix_tuning="1"
+model_item="facebook-llama-7b_pt"
+run_mode="DP1"
+device_num="N1C1"
+num_train_epochs=2
+export CUDA_VISIBLE_DEVICES=0
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-13b_sft_bs2_fp16_MP8-recompute.sh b/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-13b_sft_bs2_fp16_MP8-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8e52b17d72e39b064842290be2a33374e32ea109
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-13b_sft_bs2_fp16_MP8-recompute.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-13b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=2
+learning_rate="3e-05"
+recompute="1"
+tensor_parallel_degree="8"
+lora="0"
+prefix_tuning="0"
+model_item="facebook-llama-13b_sft"
+run_mode="MP8-recompute"
+device_num="N1C8"
+num_train_epochs=5
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-7b_sft_bs2_fp16_MP8.sh b/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-7b_sft_bs2_fp16_MP8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..094f1795c4f439a5903dd87221b112ba6caf0931
--- /dev/null
+++ b/tests/test_tipc/dygraph/ft/llama/N1C8/facebook-llama-7b_sft_bs2_fp16_MP8.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_name_or_path="facebook/llama-7b"
+dataset_name_or_path="llm_benchmark_en"
+base_batch_size=2
+learning_rate="3e-05"
+recompute="0"
+tensor_parallel_degree="8"
+lora="0"
+prefix_tuning="0"
+model_item="facebook-llama-7b_sft"
+run_mode="MP8"
+device_num="N1C8"
+num_train_epochs=5
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+cd ./tests
+bash ./test_tipc/dygraph/ft/benchmark_common/prepare.sh
+bash ./test_tipc/dygraph/ft/benchmark_common/run_benchmark.sh ${model_name_or_path} ${dataset_name_or_path} ${base_batch_size} ${learning_rate} ${recompute} ${tensor_parallel_degree} ${lora} ${prefix_tuning} ${model_item} ${run_mode} ${device_num} ${num_train_epochs}
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5591299e74dd687ab31c7d5ced1d62ce31356170
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+
+model=gpt
+micro_bs=${bs_item}
+
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp32_DP1-MP1-PP1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp32_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..847a8395a2fad5eacb0065ce6b8f5996c8ba2e19
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp32_DP1-MP1-PP1.sh
@@ -0,0 +1,17 @@
+model_item=gpt3
+dp_degree=1
+mp_degree=1
+pp_degree=1
+bs_item=16
+fp_item=fp32
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+
+model=gpt
+micro_bs=${bs_item}
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..97bdfbad904b24b015e0c8a17e0e391da7b6c330
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+
+model=gpt
+micro_bs=8
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp32_DP2-MP2-PP2.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp32_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d7c5eaf695cb7a988bd68e6edf961b8c0c058012
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp32_DP2-MP2-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=2
+mp_degree=2
+pp_degree=2
+bs_item=16
+fp_item=fp32
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+
+model=gpt
+micro_bs=8
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f1abfd237503f8c990c9721ff1e3a0e98c291faf
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=1
+mp_degree=8
+pp_degree=4
+bs_item=16
+fp_item=fp16
+run_mode=DP1-MP8-PP4
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9580c4135051f63373dd37a57b6aa6ee4f5fb020
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp16
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5cb34c9adf6ca1245f65b2eb15388347709d069a
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=4
+mp_degree=8
+pp_degree=1
+bs_item=16
+fp_item=fp16
+run_mode=DP4-MP8-PP1
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP1-MP8-PP4.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP1-MP8-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a0350fb7bb188bbdbce66a42ff022b479fa17bb6
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP1-MP8-PP4.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=1
+mp_degree=8
+pp_degree=4
+bs_item=16
+fp_item=fp32
+run_mode=DP1-MP8-PP4
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP2-MP8-PP2.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f4b91a4f887f8d1fa33c0d7ec82ef661aa5cba56
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP2-MP8-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=2
+mp_degree=8
+pp_degree=2
+bs_item=16
+fp_item=fp32
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP4-MP8-PP1.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP4-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f222c23bbf54dd6c801b51fcde999ae7d272576d
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp32_DP4-MP8-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+dp_degree=4
+mp_degree=8
+pp_degree=1
+bs_item=16
+fp_item=fp32
+run_mode=DP4-MP8-PP1
+device_num=N4C32
+
+model=gpt
+micro_bs=4
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_bs} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/prepare.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9b0348c32b92971d63ec0b86f516f92c0bfcbc7b
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/prepare.sh
@@ -0,0 +1,15 @@
+cd ../examples/language_model/gpt-3/data_tools/
+sed -i "s/python3/python3.7/g" Makefile
+cd -
+
+python3 -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+unset http_proxy https_proxy
+python3 -m pip install -r ../requirements.txt #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python3 -m pip install pybind11 regex sentencepiece tqdm visualdl #-i https://mirror.baidu.com/pypi/simple
+python3 -m pip install --upgrade paddlenlp
+python3 -m pip install tool_helpers -i https://mirror.baidu.com/pypi/simple
+# get data
+cd ../examples/language_model/gpt-3/dygraph/
+rm -rf data
+mkdir data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt2/train.data.json_ids.npz
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..69a73023fffbe196fb6b3c9af572f79974ea8f8a
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh
@@ -0,0 +1,178 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${mp_degree} ${pp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp数据并行度
+    mp_degree=${4:-"1"}             # (必选) mp数据并行度
+    pp_degree=${5:-"1"}             # (必选) pp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) micro_batch_size
+    global_batch_size=${7:-"16"}    # （必选）global_batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP2-MP8-PP2|DP1-MP8-PP4|DP4-MP8-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${10:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    use_sharding=${11:-"false"}               # （可选) 是否使用Sharding
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    use_recompute=${12:-"False"}    # (可选)是否打开recompute
+    sharding_stage=${13:-"1"}       # (可选)sharding case
+    sharding_offload=${14:-"False"} # (可选)
+    eval_freq=${15:-"1000"}         # (可选)
+    sharding_degree=${16:-"1"}      # (可选)
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    # if [ ${model_item} = "gpt3_moe" ];then
+    #     static_scripts="../examples/language_model/gpt-moe/dygraph/"
+    # else
+    #     echo "not supported model item: ${model_item}"; exit 1;
+    # fi
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    # data_path="./data/"
+
+    use_pure_fp16=False
+
+    model_config="gpt2-medium-en"
+    [ ${mp_degree} -lt 8 ] && model_config="gpt2-small-en"
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="--model_type gpt \
+                --model_name_or_path ${model_config} \
+                --input_dir ./data\
+                --output_dir output\
+                --weight_decay 0.01\
+                --grad_clip 1.0\
+                --max_steps ${max_iter}\
+                --save_steps 100000\
+                --decay_steps 320000\
+                --device gpu\
+                --eval_freq ${eval_freq}\
+                --warmup_rate 0.01\
+                --scale_loss 32768\
+                --global_batch_size ${global_batch_size}\
+                --micro_batch_size ${micro_batch_size}\
+                --dp_degree ${dp_degree}\
+                --mp_degree ${mp_degree}\
+                --pp_degree ${pp_degree}\
+                --sharding_degree ${sharding_degree}\
+                --use_pure_fp16 ${use_pure_fp16}\
+                --use_recompute ${use_recompute}\
+                --sharding_stage ${sharding_stage}\
+                --sharding_offload ${sharding_offload}\
+                --fuse_transformer True"
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    if [ "N1C2" = ${device_num} ]; then
+        # sharding case
+        echo "run run_mode: DP1-MP1-PP1 device_num: N1C2"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1 ${PADDLE_RANK_OPTION}\
+              run_pretrain.py ${train_cmd}" 
+        workerlog_id=0
+    else
+        # hybrid_parallelism case
+        case ${run_mode} in
+        DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        DP1-MP1-PP4|DP1-MP4-PP1) echo "run run_mode: ${run_mode}"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        DP1-MP2-PP4|DP1-MP4-PP2|DP2-MP2-PP2|DP2-MP8-PP2|DP4-MP8-PP1|DP1-MP8-PP4) echo "run run_mode: ${run_mode}"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        *) echo "choose run_mode "; exit 1;
+        esac
+    fi
+    cd ../examples/language_model/gpt-3/dygraph/
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    python -c "import paddlenlp"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 15m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP1-PP8-mbs1-acc32-recompute.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP1-PP8-mbs1-acc32-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d4412f56995f1fc481e8a5bd2591d1f07dd9ee04
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP1-PP8-mbs1-acc32-recompute.sh
@@ -0,0 +1,40 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+param="model_name_or_path=facebook/llama-13b "
+param+="per_device_train_batch_size=1 "
+param+="tensor_parallel_degree=1 "
+param+="data_parallel_degree=1 "
+param+="pipeline_parallel_degree=8 "
+param+="virtual_pp_degree=5 "
+param+="sequence_parallel=0 "
+param+="sharding_parallel_degree=1 "
+param+="save_steps=200 "
+param+="sharding=stage1 "
+param+="recompute=1 "
+param+="run_mode=DP1-MP1-PP8-mbs1-acc32-recompute "
+param+="device_num=N1C8 "
+param+="global_batch_size=32 "
+param+="model_item=facebook-llama-13b_seqlen2048_pretrain "
+param+="max_steps=150 "
+param+="gradient_accumulation_steps=32 "
+param+="pp_recompute_interval=1 "
+param+="tensor_parallel_config=enable_mp_async_allreduce,enable_mp_skip_c_identity,enable_mp_fused_linear_param_grad_add "
+param+="recompute_use_reentrant=true "
+
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh"
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP2-PP4-mbs2-acc16-recompute.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP2-PP4-mbs2-acc16-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..72d48b0cb41fb1b0d695cb7f81bf20d51a0d1110
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N1C8/facebook-llama-13b_seqlen2048_pretrain_bs32_bf16_DP1-MP2-PP4-mbs2-acc16-recompute.sh
@@ -0,0 +1,40 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+param="model_name_or_path=facebook/llama-13b "
+param+="per_device_train_batch_size=2 "
+param+="tensor_parallel_degree=2 "
+param+="data_parallel_degree=1 "
+param+="pipeline_parallel_degree=4 "
+param+="virtual_pp_degree=5 "
+param+="sequence_parallel=0 "
+param+="sharding_parallel_degree=1 "
+param+="save_steps=200 "
+param+="sharding=stage1 "
+param+="recompute=1 "
+param+="run_mode=DP1-MP2-PP4-mbs2-acc16-recompute "
+param+="device_num=N1C8 "
+param+="global_batch_size=32 "
+param+="model_item=facebook-llama-13b_seqlen2048_pretrain "
+param+="max_steps=150 "
+param+="gradient_accumulation_steps=16 "
+param+="pp_recompute_interval=1 "
+param+="tensor_parallel_config=enable_mp_async_allreduce,enable_mp_skip_c_identity,enable_mp_fused_linear_param_grad_add "
+param+="recompute_use_reentrant=true "
+
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh"
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP1-mbs1-acc32-recompute.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP1-mbs1-acc32-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..aca0abd85b64dfae0514dfd0b6ae79264dde85a9
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP1-mbs1-acc32-recompute.sh
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_name_or_path=facebook/llama-13b "
+param+="per_device_train_batch_size=1 "
+param+="tensor_parallel_degree=2 "
+param+="data_parallel_degree=1 "
+param+="pipeline_parallel_degree=4 "
+param+="virtual_pp_degree=1 "
+param+="sequence_parallel=0 "
+param+="sharding_parallel_degree=2 "
+param+="save_steps=200 "
+param+="sharding=stage1 "
+param+="recompute=0 "
+param+="run_mode=DP1-MP2-PP4-VPP1-mbs1-acc32-recompute "
+param+="device_num=N2C16 "
+param+="global_batch_size=64 "
+param+="model_item=facebook-llama-13b_seqlen2048_pretrain "
+param+="max_steps=150 "
+param+="gradient_accumulation_steps=32 "
+param+="pp_recompute_interval=1 "
+param+="tensor_parallel_config=enable_mp_async_allreduce,enable_mp_skip_c_identity,enable_mp_fused_linear_param_grad_add "
+param+="recompute_use_reentrant=true "
+
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh"
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP2-mbs1-acc32-recompute.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP2-mbs1-acc32-recompute.sh
new file mode 100644
index 0000000000000000000000000000000000000000..508de19402b5ead27aef907f3af3ee536ade99e0
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/N2C16/facebook-llama-13b_seqlen2048_pretrain_bs64_bf16_DP1-MP2-PP4-VPP2-mbs1-acc32-recompute.sh
@@ -0,0 +1,39 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+param="model_name_or_path=facebook/llama-13b "
+param+="per_device_train_batch_size=1 "
+param+="tensor_parallel_degree=2 "
+param+="data_parallel_degree=1 "
+param+="pipeline_parallel_degree=4 "
+param+="virtual_pp_degree=2 "
+param+="sequence_parallel=0 "
+param+="sharding_parallel_degree=2 "
+param+="save_steps=200 "
+param+="sharding=stage1 "
+param+="recompute=0 "
+param+="run_mode=DP1-MP2-PP4-VPP2-mbs1-acc32-recompute "
+param+="device_num=N2C16 "
+param+="global_batch_size=64 "
+param+="model_item=facebook-llama-13b_seqlen2048_pretrain "
+param+="max_steps=150 "
+param+="gradient_accumulation_steps=32 "
+param+="pp_recompute_interval=1 "
+param+="tensor_parallel_config=enable_mp_async_allreduce,enable_mp_skip_c_identity,enable_mp_fused_linear_param_grad_add "
+param+="recompute_use_reentrant=true "
+
+cd ./tests
+bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
+
+bash -c "${param} bash ./test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh"
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a9393dd573ef3a187f38e2ceb343d726b2ba91d9
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/prepare.sh
@@ -0,0 +1,30 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m pip install -r ../requirements.txt
+
+# install fused_ln custom ops
+cd ../model_zoo/gpt-3/external_ops/
+python setup.py install
+
+# install tool_helpers
+cd ../../../llm/llama
+python -m pip install tool_helpers
+
+wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_ids.npy
+wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k_idx.npz
+
+mkdir data
+mv llama_openwebtext_100k_ids.npy ./data
+mv llama_openwebtext_100k_idx.npz ./data
\ No newline at end of file
diff --git a/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh b/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eaa7bd9f3fd2da8177108d4331b39ebf02bd60d4
--- /dev/null
+++ b/tests/test_tipc/dygraph/hybrid_parallelism/llama/benchmark_common/run_benchmark.sh
@@ -0,0 +1,198 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_name_or_path} ${per_device_train_batch_size} ${tensor_parallel_degree} ${pipeline_parallel_degree} ${virtual_pp_degree} ${sequence_parallel} ${sharding_parallel_degree} ${save_steps} ${sharding} ${recompute} ${run_mode} ${device_num}
+function _set_params(){
+    model_name_or_path=${model_name_or_path:-"facebook/llama-13b"}
+    per_device_train_batch_size=${per_device_train_batch_size:-1}
+    tensor_parallel_degree=${tensor_parallel_degree:-2}
+    data_parallel_degree=${data_parallel_degree:-1}
+    pipeline_parallel_degree=${pipeline_parallel_degree:-4}
+    virtual_pp_degree=${virtual_pp_degree:-2}
+    sequence_parallel=${sequence_parallel:-0}
+    sharding_parallel_degree=${sharding_parallel_degree:-2}
+    save_steps=${save_steps:-200}
+    sharding=${sharding:-"stage1"}
+    recompute=${recompute:-1}
+    run_mode=${run_mode:-"DP1-MP2-PP4-mbs1-acc32-recompute"}
+    device_num=${device_num:-"N2C16"}
+    global_batch_size=${global_batch_size:-64}
+    model_item=${model_item:-"facebook-llama-13b_seqlen2048_pretrain"}
+    max_steps=${max_steps:-150}
+    gradient_accumulation_steps=${gradient_accumulation_steps:-32}
+    pp_recompute_interval=${pp_recompute_interval:-1}
+    tensor_parallel_config=${tensor_parallel_config:-"enable_mp_async_allreduce,enable_mp_skip_c_identity,enable_mp_fused_linear_param_grad_add"}
+    recompute_use_reentrant=${recompute_use_reentrant:-"true"}
+
+    base_batch_size=${global_batch_size}
+
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+
+    fp_item="bf16"
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    mkdir -p $(dirname ${train_log_file})
+
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    mkdir -p $(dirname ${profiling_log_file})
+
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+    mkdir -p $(dirname ${speed_log_file})
+
+    OUTPUT_PATH=${run_log_path}/output
+    is_large_model=True
+}
+
+function _train(){
+    batch_size=${per_device_train_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    use_pure_fp16=False
+    train_cmd="--model_type llama \
+    --model_name_or_path ${model_name_or_path} \
+    --tokenizer_name_or_path ${model_name_or_path} \
+    --input_dir ./data \
+    --output_dir ./output \
+    --split 949,50,1 \
+    --max_seq_length 2048 \
+    --per_device_train_batch_size ${per_device_train_batch_size} \
+    --gradient_accumulation_steps ${gradient_accumulation_steps} \
+    --use_flash_attention 1 \
+    --use_fused_rms_norm 1 \
+    --bf16  \
+    --fp16_opt_level O2  \
+    --amp_master_grad \
+    --tensor_parallel_degree ${tensor_parallel_degree} \
+    --pipeline_parallel_degree ${pipeline_parallel_degree} \
+    --virtual_pp_degree ${virtual_pp_degree} \
+    --pp_recompute_interval ${pp_recompute_interval} \
+    --learning_rate 0.00001 \
+    --min_learning_rate 0.000001 \
+    --max_steps ${max_steps} \
+    --save_steps 50000 \
+    --weight_decay 0.01 \
+    --warmup_ratio 0.01 \
+    --max_grad_norm 1.0 \
+    --logging_steps 1 \
+    --dataloader_num_workers 1 \
+    --eval_steps 1001 \
+    --sharding ${sharding} \
+    --disable_tqdm true \
+    --continue_training 0 \
+    --do_train \
+    --device gpu \
+    --rope_fusion_level core \
+    --enable_linear_fused_grad_add true \
+    --fuse_attention_qkv true \
+    --fuse_attention_ffn true \
+    --tensor_parallel_config ${tensor_parallel_config} \
+    --recompute ${recompute} \
+    --recompute_use_reentrant ${recompute_use_reentrant} \
+    --data_cache ./data_cache"
+
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    if [ "N1C2" = ${device_num} ]; then
+        # sharding case
+        echo "run run_mode: DP1-MP1-PP1 device_num: N1C2"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1 ${PADDLE_RANK_OPTION}\
+              run_pretrain.py ${train_cmd}" 
+        workerlog_id=0
+    else
+        # hybrid_parallelism case
+        case ${run_mode} in
+        DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        DP1-MP1-PP4|DP1-MP4-PP1) echo "run run_mode: ${run_mode}"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        DP1-MP2-PP4-mbs2-acc16-recompute|DP1-MP1-PP8-mbs1-acc32-recompute|DP1-MP2-PP4-VPP2-mbs1-acc32-recompute|DP1-MP2-PP4-VPP1-mbs1-acc32-recompute) echo "run run_mode: ${run_mode}"
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+                run_pretrain.py ${train_cmd}"
+            workerlog_id=0
+            ;;
+        *) echo "choose run_mode "; exit 1;
+        esac
+    fi
+    cd ../llm/llama
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    python -c "import paddlenlp"
+    if [[ ${model_name_or_path} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 30m ${train_cmd} > ${log_file} 2>&1
+        # echo ${train_cmd}
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp16_DP_MoE_C1.sh b/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp16_DP_MoE_C1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..40dfdd2812b96bef6a013c7adc7ae0c95aa0c094
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp16_DP_MoE_C1.sh
@@ -0,0 +1,14 @@
+model_item=gpt3_moe
+dp_degree=1
+sharding_degree=1
+bs_item=8
+fp_item=fp16
+run_mode=DP_MoE_C1
+device_num=N1C1
+
+model=gpt
+
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp32_DP_MoE_C1.sh b/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp32_DP_MoE_C1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3563f11ac1790dc7e84deeb0c2938616ac09be91
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C1/gpt3_moe_bs8_fp32_DP_MoE_C1.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+fp_item=fp32
+dp_degree=1
+sharding_degree=1
+bs_item=8
+run_mode=DP_MoE_C1
+device_num=N1C1
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_DP_MoE_C8.sh b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_DP_MoE_C8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d4b02cecd2d50f14c08a900a377db62fda03b36d
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_DP_MoE_C8.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=8
+sharding_degree=1
+bs_item=8
+fp_item=fp16
+run_mode=DP_MoE_C8
+device_num=N1C8
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_Sharding_MoE_C8.sh b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_Sharding_MoE_C8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..59210d49dc832d5edcb18f05b63206d6483d9c07
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp16_Sharding_MoE_C8.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=1
+sharding_degree=8
+bs_item=8
+fp_item=fp16
+run_mode=Sharding_MoE_C8
+device_num=N1C8
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_DP_MoE_C8.sh b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_DP_MoE_C8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..46412601f945cc07fe046eafb75a846479af643a
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_DP_MoE_C8.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=8
+sharding_degree=1
+bs_item=8
+fp_item=fp32
+run_mode=DP_MoE_C8
+device_num=N1C8
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_Sharding_MoE_C8.sh b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_Sharding_MoE_C8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..30e464268477c7108d90c9b2ce4cda85ece42253
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N1C8/gpt3_moe_bs8_fp32_Sharding_MoE_C8.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=1
+sharding_degree=8
+bs_item=8
+fp_item=fp32
+run_mode=Sharding_MoE_C8
+device_num=N1C8
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_DP_MoE_C32.sh b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_DP_MoE_C32.sh
new file mode 100644
index 0000000000000000000000000000000000000000..62d10b3a9aed2ec9bb89bae961a823195b4da82f
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_DP_MoE_C32.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=32
+sharding_degree=1
+bs_item=8
+fp_item=fp16
+run_mode=DP_MoE_C32
+device_num=N4C32
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_Sharding_MoE_C32.sh b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_Sharding_MoE_C32.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eda900210743fd8137c4ec0c9bf10096aee9eee1
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp16_Sharding_MoE_C32.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=1
+sharding_degree=32
+bs_item=8
+fp_item=fp16
+run_mode=Sharding_MoE_C32
+device_num=N4C32
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_DP_MoE_C32.sh b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_DP_MoE_C32.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c0d653b27a99df8e78dfa22238a8df7d93f7d535
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_DP_MoE_C32.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=32
+sharding_degree=1
+bs_item=8
+fp_item=fp32
+run_mode=DP_MoE_C32
+device_num=N4C32
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_Sharding_MoE_C32.sh b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_Sharding_MoE_C32.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3d968ad5ae406165dbf1518775fc0827b429a02e
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/N4C32/gpt3_moe_bs8_fp32_Sharding_MoE_C32.sh
@@ -0,0 +1,15 @@
+model_item=gpt3_moe
+dp_degree=1
+sharding_degree=32
+bs_item=8
+fp_item=fp32
+run_mode=Sharding_MoE_C32
+device_num=N4C32
+
+model=gpt
+
+# get data
+cd ./tests
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/dygraph/moe/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${dp_degree} ${sharding_degree} ${bs_item} ${run_mode} ${device_num} 2>&1;
diff --git a/tests/test_tipc/dygraph/moe/gpt/benchmark_common/prepare.sh b/tests/test_tipc/dygraph/moe/gpt/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e511447a8af139c0a32caf6a0e83405131ef9e55
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/benchmark_common/prepare.sh
@@ -0,0 +1,14 @@
+cd ../examples/language_model/moe/data_tools/
+sed -i "s/python3/python3.10/g" Makefile
+cd -
+
+python3 -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+unset http_proxy https_proxy
+python3 -m pip install -r ../requirements.txt #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python3 -m pip install pybind11 regex sentencepiece tqdm visualdl #-i https://mirror.baidu.com/pypi/simple
+python3 -m pip install --upgrade paddlenlp
+# get data
+cd ../examples/language_model/moe/dygraph/
+rm -rf data
+mkdir data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt2/train.data.json_ids.npz
diff --git a/tests/test_tipc/dygraph/moe/gpt/benchmark_common/run_benchmark.sh b/tests/test_tipc/dygraph/moe/gpt/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d4cd7fb46308f1130c8c6fbab097de5d9637f90d
--- /dev/null
+++ b/tests/test_tipc/dygraph/moe/gpt/benchmark_common/run_benchmark.sh
@@ -0,0 +1,175 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    dp_degree=${3:-"1"}             # (必选) dp/moe数据并行度
+    sharding_degree=${4:-"1"}       # (必选) sharding/moe分组参数切分并行度
+    local_batch_size=${5:-"2"}      # (必选) 每张卡的batch_size
+    run_mode=${6:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1|DP_MoE_C1|Sharding_MoE_C1
+    device_num=${7:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    num_experts=${8:-8}             #(可选)每张卡的expert数量
+    max_iter=${9:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    use_sharding=${10:-"false"}               # （可选) 是否使用ShardingOptimizer
+    num_workers=0                  # (可选)
+    base_batch_size=$local_batch_size
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${local_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${local_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    # if [ ${model_item} = "gpt3_moe" ];then
+    #     static_scripts="../examples/language_model/gpt-moe/dygraph/"
+    # else
+    #     echo "not supported model item: ${model_item}"; exit 1;
+    # fi
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    # data_path="./data/"
+
+    use_pure_fp16=False
+    if [ "fp16" = ${fp_item} ]; then use_pure_fp16=True; fi
+    train_cmd="${add_options} \
+               --model_type gpt \
+                --model_name_or_path gpt2-small-en \
+                --input_dir ./data\
+                --output_dir output\
+                --weight_decay 0.01\
+                --grad_clip 2\
+                --max_steps ${max_iter}\
+                --save_steps 100000\
+                --decay_steps 320000\
+                --device gpu\
+                --eval_freq 100000\
+                --warmup_rate 0.01\
+                --local_batch_size ${local_batch_size} \
+                --micro_batch_size ${local_batch_size} \
+                --dp_degree ${dp_degree}\
+                --mp_degree 1\
+                --pp_degree 1\
+                --sharding_degree ${sharding_degree}\
+                --expert_mode True\
+                --logging_freq 1 \
+                --num_experts ${num_experts}\
+                --use_pure_fp16 ${use_pure_fp16} \
+                --use_recompute False\
+                --recompute_partition False\
+                --recompute_offload False\
+                --scale_loss 32768 \
+                --gate gshard \
+                --balance_loss_weight 1.0"
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP_MoE_C1) echo "run run_mode: DP_MoE_C1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0 ${PADDLE_RANK_OPTION}\
+              run_moe_pretrain.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP_MoE_C8|DP_MoE_C16) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+              run_moe_pretrain.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP_MoE_C32) echo "run run_mode: DP_MoE_C32"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+              run_moe_pretrain.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    Sharding_MoE_C8|Sharding_MoE_C16) echo "run run_mode: ${run_mode}"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+              run_moe_pretrain.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    Sharding_MoE_C32) echo "run run_mode: Sharding_MoE_C32"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0,1,2,3,4,5,6,7 ${PADDLE_RANK_OPTION}\
+              run_moe_pretrain.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../examples/language_model/moe/dygraph/
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    python -c "import paddlenlp"
+    if [[ ${model_item} =~ "CE" ]];then # CE精度-不限制执行时间
+        ${train_cmd} > ${log_file} 2>&1
+    else
+        timeout 15m ${train_cmd} > ${log_file} 2>&1
+    fi
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/ernie_information_extraction/data.py b/tests/test_tipc/ernie_information_extraction/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d276b45510749ffb56ba6636827dbc43b1d4c49c
--- /dev/null
+++ b/tests/test_tipc/ernie_information_extraction/data.py
@@ -0,0 +1,79 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddlenlp.datasets import MapDataset
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_dataset(datafiles):
+    def read(data_path):
+        with open(data_path, "r", encoding="utf-8") as fp:
+            next(fp)  # Skip header
+            for line in fp.readlines():
+                words, labels = line.strip("\n").split("\t")
+                words = words.split("\002")
+                labels = labels.split("\002")
+                yield words, labels
+
+    if isinstance(datafiles, str):
+        return MapDataset(list(read(datafiles)))
+    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
+        return [MapDataset(list(read(datafile))) for datafile in datafiles]
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
diff --git a/tests/test_tipc/ernie_information_extraction/export_model.py b/tests/test_tipc/ernie_information_extraction/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e3a646cf3b88d1c3271b2811c8dcfb5a5357ff7
--- /dev/null
+++ b/tests/test_tipc/ernie_information_extraction/export_model.py
@@ -0,0 +1,55 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from data import load_dict
+
+from paddlenlp.transformers import AutoModelForTokenClassification
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--params_path",
+        type=str,
+        required=True,
+        default="./checkpoint/model_900",
+        help="The path to model parameters to be loaded.",
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved."
+    )
+    parser.add_argument(
+        "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located."
+    )
+    args = parser.parse_args()
+
+    # The number of labels should be in accordance with the training dataset.
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    model = AutoModelForTokenClassification.from_pretrained(args.params_path, num_classes=len(label_vocab))
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/tests/test_tipc/ernie_information_extraction/predict.py b/tests/test_tipc/ernie_information_extraction/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..38d6a34c6b7926d25dcba2a7d5d41e64596d8e65
--- /dev/null
+++ b/tests/test_tipc/ernie_information_extraction/predict.py
@@ -0,0 +1,299 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from functools import partial
+
+import numpy as np
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import ErnieTokenizer
+from paddlenlp.utils.log import logger
+
+
+def load_dict(dict_path):
+    vocab = {}
+    i = 0
+    with open(dict_path, "r", encoding="utf-8") as fin:
+        for line in fin:
+            key = line.strip("\n")
+            vocab[key] = i
+            i += 1
+    return vocab
+
+
+def load_vocab(dict_path):
+    """Load vocab from file"""
+    vocab = {}
+    reverse = None
+    with open(dict_path, "r", encoding="utf8") as fin:
+        for i, line in enumerate(fin):
+            terms = line.strip("\n").split("\t")
+            if len(terms) == 2:
+                if reverse is None:
+                    reverse = True if terms[0].isdigit() else False
+                if reverse:
+                    value, key = terms
+                else:
+                    key, value = terms
+            elif len(terms) == 1:
+                key, value = terms[0], i
+            else:
+                raise ValueError("Error line: %s in file: %s" % (line, dict_path))
+            vocab[key] = value
+    return vocab
+
+
+def parse_decodes(sentences, predictions, lengths, label_vocab):
+    """Parse the padding result
+
+    Args:
+        sentences (list): the tagging sentences.
+        predictions (list): the prediction tags.
+        lengths (list): the valid length of each sentence.
+        label_vocab (dict): the label vocab.
+
+    Returns:
+        outputs (list): the formatted output.
+    """
+    predictions = [x for batch in predictions for x in batch]
+    lengths = [x for batch in lengths for x in batch]
+    id_label = dict(zip(label_vocab.values(), label_vocab.keys()))
+
+    outputs = []
+    for idx, end in enumerate(lengths):
+        sent = sentences[idx][:end]
+        tags = [id_label[x] for x in predictions[idx][:end]]
+        sent_out = []
+        tags_out = []
+        words = ""
+        for s, t in zip(sent, tags):
+            if t.endswith("-B") or t == "O":
+                if len(words):
+                    sent_out.append(words)
+                tags_out.append(t.split("-")[0])
+                words = s
+            else:
+                words += s
+        if len(sent_out) < len(tags_out):
+            sent_out.append(words)
+        outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)]))
+    return outputs
+
+
+def convert_to_features(example, tokenizer):
+    tokens = example[0]
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words=True)
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"]
+
+
+def read(data_path):
+    with open(data_path, "r", encoding="utf-8") as fp:
+        next(fp)  # Skip header
+        for line in fp.readlines():
+            words, labels = line.strip("\n").split("\t")
+            words = words.split("\002")
+            labels = labels.split("\002")
+            yield words, labels
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        batch_size=200,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="./log_output/",
+    ):
+        self.batch_size = batch_size
+        self.benchmark = benchmark
+
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu()
+        elif device == "npu":
+            # set NPU configs accordingly
+            config.enable_custom_device("npu")
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-1.0",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, dataset, batchify_fn, tokenizer, label_vocab, max_steps=-1):
+        if self.benchmark:
+            self.autolog.times.start()
+
+        all_preds = []
+        all_lens = []
+        num_of_examples = max_steps if max_steps > 0 else len(dataset)
+        trans_func = partial(convert_to_features, tokenizer=tokenizer)
+        start_idx = 0
+        while start_idx < num_of_examples:
+            end_idx = start_idx + self.batch_size
+            end_idx = end_idx if end_idx < num_of_examples else num_of_examples
+            batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]]
+
+            if self.benchmark:
+                self.autolog.times.stamp()
+            input_ids, segment_ids, lens = batchify_fn(batch_data)
+            self.input_handles[0].copy_from_cpu(input_ids)
+            self.input_handles[1].copy_from_cpu(segment_ids)
+            self.predictor.run()
+            logits = self.output_handle.copy_to_cpu()
+
+            if self.benchmark:
+                self.autolog.times.stamp()
+            preds = np.argmax(logits, axis=-1)
+            # Drop CLS prediction
+            preds = preds[:, 1:]
+            all_preds.append(preds)
+            all_lens.append(lens)
+
+            start_idx += self.batch_size
+
+        if self.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        sentences = [example[0] for example in dataset.data]
+        results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+        return results
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.")
+    parser.add_argument(
+        "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located."
+    )
+    parser.add_argument(
+        "--batch_size", type=int, default=32, help="The number of sequences contained in a mini-batch."
+    )
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "npu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu.",
+    )
+    parser.add_argument(
+        "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+    )
+    parser.add_argument(
+        "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+    )
+    parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+    parser.add_argument(
+        "--enable_mkldnn",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Enable to use mkldnn to speed up when using cpu.",
+    )
+    parser.add_argument(
+        "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+    )
+    parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of predict steps to perform."
+    )
+
+    args = parser.parse_args()
+
+    tokenizer = ErnieTokenizer.from_pretrained("ernie-1.0")
+    test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False)
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # seq_len
+    ): fn(samples)
+
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab, args.max_steps)
+    print("\n".join(results))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/tests/test_tipc/ernie_information_extraction/train.py b/tests/test_tipc/ernie_information_extraction/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcd90cc4946d66094e961b9ff570dc820ee482bf
--- /dev/null
+++ b/tests/test_tipc/ernie_information_extraction/train.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import load_dataset, load_dict, parse_decodes
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.metrics import ChunkEvaluator
+from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+def convert_to_features(example, tokenizer, label_vocab):
+    tokens, labels = example
+    tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words=True)
+    # Token '[CLS]' and '[SEP]' will get label 'O'
+    labels = ["O"] + labels + ["O"]
+    tokenized_input["labels"] = [label_vocab[x] for x in labels]
+    return (
+        tokenized_input["input_ids"],
+        tokenized_input["token_type_ids"],
+        tokenized_input["seq_len"],
+        tokenized_input["labels"],
+    )
+
+
+@paddle.no_grad()
+def evaluate(model, metric, data_loader):
+    model.eval()
+    metric.reset()
+    for input_ids, seg_ids, lens, labels in data_loader:
+        logits = model(input_ids, seg_ids)
+        preds = paddle.argmax(logits, axis=-1)
+        n_infer, n_label, n_correct = metric.compute(lens, preds, labels)
+        metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
+        precision, recall, f1_score = metric.accumulate()
+    print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score))
+    model.train()
+
+
+@paddle.no_grad()
+def predict(model, data_loader, ds, label_vocab):
+    all_preds = []
+    all_lens = []
+    for input_ids, seg_ids, lens, labels in data_loader:
+        logits = model(input_ids, seg_ids)
+        preds = paddle.argmax(logits, axis=-1)
+        # Drop CLS prediction
+        preds = [pred[1:] for pred in preds.numpy()]
+        all_preds.append(preds)
+        all_lens.append(lens)
+    sentences = [example[0] for example in ds.data]
+    results = parse_decodes(sentences, all_preds, all_lens, label_vocab)
+    return results
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    trainer_num = paddle.distributed.get_world_size()
+    if trainer_num > 1:
+        paddle.distributed.init_parallel_env()
+    set_seed(args.seed)
+    # Create dataset, tokenizer and dataloader.
+    train_ds, dev_ds, test_ds = load_dataset(
+        datafiles=(
+            os.path.join(args.data_dir, "train.txt"),
+            os.path.join(args.data_dir, "dev.txt"),
+            os.path.join(args.data_dir, "test.txt"),
+        )
+    )
+
+    label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic"))
+    tokenizer = AutoTokenizer.from_pretrained("ernie-1.0")
+
+    trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab)
+
+    train_ds.map(trans_func)
+    dev_ds.map(trans_func)
+    test_ds.map(trans_func)
+
+    ignore_label = -1
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"),  # token_type_ids
+        Stack(dtype="int64"),  # seq_len
+        Pad(axis=0, pad_val=ignore_label, dtype="int64"),  # labels
+    ): fn(samples)
+
+    train_loader = create_dataloader(
+        dataset=train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn
+    )
+
+    dev_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn)
+
+    test_loader = create_dataloader(dataset=test_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn)
+
+    # Define the model netword and its loss
+    model = AutoModelForTokenClassification.from_pretrained("ernie-1.0", num_classes=len(label_vocab))
+    if trainer_num > 1:
+        model = paddle.DataParallel(model)
+    metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True)
+    loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label)
+    optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters())
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(args.epochs):
+        for step, batch in enumerate(train_loader):
+            input_ids, token_type_ids, length, labels = batch
+            logits = model(input_ids, token_type_ids)
+            loss = paddle.mean(loss_fn(logits, labels))
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+
+            loss.backward()
+            optimizer.step()
+            optimizer.clear_grad()
+
+            if global_step % 100 == 0 and rank == 0:
+                evaluate(model, metric, dev_loader)
+                save_dir = os.path.join(args.save_dir, "model")
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+            if global_step > args.max_steps:
+                return
+    if rank == 0:
+        preds = predict(model, test_loader, test_ds, label_vocab)
+        file_path = "ernie_results.txt"
+        with open(file_path, "w", encoding="utf8") as fout:
+            fout.write("\n".join(preds))
+        # Print some examples
+        print("The results have been saved in the file: %s, some examples are shown below: " % file_path)
+        print("\n".join(preds[:10]))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--save_dir",
+        default="./checkpoint",
+        type=str,
+        help="The output directory where the model checkpoints will be written.",
+    )
+    parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        type=str,
+        choices=["cpu", "gpu", "npu", "xpu"],
+        help="The device to select to train the model, is must be cpu/gpu.",
+    )
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform."
+    )
+    parser.add_argument(
+        "--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located."
+    )
+    args = parser.parse_args()
+
+    do_train(args)
diff --git a/tests/test_tipc/ernie_text_cls/export_model.py b/tests/test_tipc/ernie_text_cls/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0819eaffc1818cad72693f7a5e14a3d87b0628c
--- /dev/null
+++ b/tests/test_tipc/ernie_text_cls/export_model.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+
+from paddlenlp.transformers import AutoModelForSequenceClassification
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--params_path",
+        type=str,
+        required=True,
+        default="./checkpoint/model_900",
+        help="The path to model parameters to be loaded.",
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved."
+    )
+    args = parser.parse_args()
+
+    # The number of labels should be in accordance with the training dataset.
+    label_map = {0: "negative", 1: "positive"}
+    model = AutoModelForSequenceClassification.from_pretrained(args.params_path, num_classes=len(label_map))
+
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/tests/test_tipc/ernie_text_cls/predict.py b/tests/test_tipc/ernie_text_cls/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd220b85938b789183910b36a62c88972ae76aba
--- /dev/null
+++ b/tests/test_tipc/ernie_text_cls/predict.py
@@ -0,0 +1,277 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+from scipy.special import softmax
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def convert_example(example, tokenizer, label_list, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+    - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+    A BERT sequence pair mask has the following format:
+    ::
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+
+    If only one sequence, only returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        label_list(obj:`list[str]`): All the labels that the data has.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        segment_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    text = example
+    encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    segment_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        # create label maps
+        label_map = {}
+        for (i, l) in enumerate(label_list):
+            label_map[l] = i
+
+        label = label_map[example["label"]]
+        label = np.array([label], dtype="int64")
+        return input_ids, segment_ids, label
+    else:
+        return input_ids, segment_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="./log_output/",
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+        self.benchmark = benchmark
+
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-tiny",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer, label_map):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if self.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(
+                text, tokenizer, label_list=label_map.values(), max_seq_length=self.max_seq_length, is_test=True
+            )
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        probs = softmax(logits, axis=1)
+        idx = np.argmax(probs, axis=1)
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+
+        if self.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return labels
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+    )
+    parser.add_argument(
+        "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+    )
+    parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+    parser.add_argument(
+        "--enable_mkldnn",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Enable to use mkldnn to speed up when using cpu.",
+    )
+    parser.add_argument(
+        "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+    )
+    parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of predict steps to perform."
+    )
+    args = parser.parse_args()
+
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    # ErnieTinyTokenizer is special for ernie-tiny pretained model.
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+    test_ds = load_dataset("chnsenticorp", splits=["test"])
+    data = [d["text"] for d in test_ds]
+    if args.max_steps > 0:
+        data = data[: args.max_steps]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+    label_map = {0: "negative", 1: "positive"}
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer, label_map))
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/tests/test_tipc/ernie_text_cls/train.py b/tests/test_tipc/ernie_text_cls/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..b47e23a03b4462a95e36e8319093248fac532fd8
--- /dev/null
+++ b/tests/test_tipc/ernie_text_cls/train.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    LinearDecayWithWarmup,
+)
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        logits = model(input_ids, token_type_ids)
+        loss = criterion(logits, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(logits, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+    """
+    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
+    by concatenating and adding special tokens. And creates a mask from the two sequences passed
+    to be used in a sequence-pair classification task.
+
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    It returns the first portion of the mask (0's).
+
+
+    Args:
+        example(obj:`list[str]`): List of input data, containing text and label if it have label.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of token ids.
+        token_type_ids(obj: `list[int]`): List of sequence pair mask.
+        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
+    """
+    encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_length)
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
+
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "ernie-3.0-medium-zh", num_classes=len(train_ds.label_list)
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            logits = model(input_ids, token_type_ids)
+            loss = criterion(logits, labels)
+            probs = F.softmax(logits, axis=1)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % 100 == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+                save_dir = os.path.join(args.save_dir, "model")
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                model_to_save.save_pretrained(save_dir)
+                tokenizer.save_pretrained(save_dir)
+
+            if global_step > args.max_steps:
+                return
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--save_dir",
+        default="./checkpoint",
+        type=str,
+        help="The output directory where the model checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. "
+        "Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument(
+        "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process."
+    )
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform."
+    )
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    args = parser.parse_args()
+
+    do_train(args)
diff --git a/tests/test_tipc/ernie_text_matching/data.py b/tests/test_tipc/ernie_text_matching/data.py
new file mode 100644
index 0000000000000000000000000000000000000000..aabe8f71a29be91d6d05e8266a5990feb62b413d
--- /dev/null
+++ b/tests/test_tipc/ernie_text_matching/data.py
@@ -0,0 +1,130 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+from paddlenlp.datasets import MapDataset
+
+
+def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == "train" else False
+    if mode == "train":
+        batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)
+
+
+def read_text_pair(data_path):
+    """Reads data."""
+    with open(data_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if len(data) != 2:
+                continue
+            yield {"query": data[0], "title": data[1]}
+
+
+def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"):
+
+    if phase == "train":
+        query, pos_title, neg_title = example["query"], example["title"], example["neg_title"]
+
+        pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length)
+        neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length)
+
+        pos_input_ids = pos_inputs["input_ids"]
+        pos_token_type_ids = pos_inputs["token_type_ids"]
+        neg_input_ids = neg_inputs["input_ids"]
+        neg_token_type_ids = neg_inputs["token_type_ids"]
+
+        return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids)
+
+    else:
+        query, title = example["query"], example["title"]
+
+        inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+        input_ids = inputs["input_ids"]
+        token_type_ids = inputs["token_type_ids"]
+        if phase == "eval":
+            return input_ids, token_type_ids, example["label"]
+        elif phase == "predict":
+            return input_ids, token_type_ids
+        else:
+            raise ValueError("not supported phase:{}".format(phase))
+
+
+def gen_pair(dataset, pool_size=100):
+    """
+    Generate triplet randomly based on dataset
+
+    Args:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+            Each example is composed of 2 texts: example["query"], example["title"]
+        pool_size: the number of example to sample negative example randomly
+
+    Return:
+        dataset: A `MapDataset` or `IterDataset` or a tuple of those.
+        Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"]
+    """
+
+    if len(dataset) < pool_size:
+        pool_size = len(dataset)
+
+    new_examples = []
+    pool = []
+    tmp_examples = []
+
+    for example in dataset:
+        label = example["label"]
+
+        # Filter negative example
+        if label == 0:
+            continue
+
+        tmp_examples.append(example)
+        pool.append(example["title"])
+
+        if len(pool) >= pool_size:
+            np.random.shuffle(pool)
+            for idx, example in enumerate(tmp_examples):
+                example["neg_title"] = pool[idx]
+                new_examples.append(example)
+            tmp_examples = []
+            pool = []
+        else:
+            continue
+    return MapDataset(new_examples)
diff --git a/tests/test_tipc/ernie_text_matching/export_model.py b/tests/test_tipc/ernie_text_matching/export_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..e53d8485679faacf263ba97740bb0ed42599486d
--- /dev/null
+++ b/tests/test_tipc/ernie_text_matching/export_model.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import paddle
+from model import PointwiseMatching
+
+from paddlenlp.transformers import AutoModel
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--params_path",
+        type=str,
+        required=True,
+        default="./checkpoint/model_900/model_state.pdparams",
+        help="The path to model parameters to be loaded.",
+    )
+    parser.add_argument(
+        "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved."
+    )
+    args = parser.parse_args()
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    model = PointwiseMatching(pretrained_model)
+
+    if args.params_path:
+        if os.path.isfile(args.params_path):
+            state_dict = paddle.load(args.params_path)
+            model.set_dict(state_dict)
+            print("Loaded parameters from %s" % args.params_path)
+        elif os.path.isdir(args.params_path):
+            path = os.path.join(args.params_path, "model_state.pdparams")
+            state_dict = paddle.load(path)
+            model.set_dict(state_dict)
+            print("Loaded parameters from %s" % path)
+    model.eval()
+
+    # Convert to static graph with specific input description
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # input_ids
+            paddle.static.InputSpec(shape=[None, None], dtype="int64"),  # segment_ids
+        ],
+    )
+    # Save in static graph model.
+    save_path = os.path.join(args.output_path, "inference")
+    paddle.jit.save(model, save_path)
diff --git a/tests/test_tipc/ernie_text_matching/model.py b/tests/test_tipc/ernie_text_matching/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d21f91012cca5f6d9ef928bea96f75a40d190b3d
--- /dev/null
+++ b/tests/test_tipc/ernie_text_matching/model.py
@@ -0,0 +1,89 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class PointwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # num_labels = 2 (similar or dissimilar)
+        self.classifier = nn.Linear(self.ptm.config["hidden_size"], 2)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        logits = self.classifier(cls_embedding)
+        probs = F.softmax(logits)
+
+        return probs
+
+
+class PairwiseMatching(nn.Layer):
+    def __init__(self, pretrained_model, dropout=None, margin=0.1):
+        super().__init__()
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+        self.margin = margin
+
+        # hidden_size -> 1, calculate similarity
+        self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1)
+
+    def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+
+        _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask)
+
+        cls_embedding = self.dropout(cls_embedding)
+        sim_score = self.similarity(cls_embedding)
+        sim_score = F.sigmoid(sim_score)
+
+        return sim_score
+
+    def forward(
+        self,
+        pos_input_ids,
+        neg_input_ids,
+        pos_token_type_ids=None,
+        neg_token_type_ids=None,
+        pos_position_ids=None,
+        neg_position_ids=None,
+        pos_attention_mask=None,
+        neg_attention_mask=None,
+    ):
+
+        _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask)
+
+        _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask)
+
+        pos_embedding = self.dropout(pos_cls_embedding)
+        neg_embedding = self.dropout(neg_cls_embedding)
+
+        pos_sim = self.similarity(pos_embedding)
+        neg_sim = self.similarity(neg_embedding)
+
+        pos_sim = F.sigmoid(pos_sim)
+        neg_sim = F.sigmoid(neg_sim)
+
+        labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32")
+
+        loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin)
+
+        return loss
diff --git a/tests/test_tipc/ernie_text_matching/predict.py b/tests/test_tipc/ernie_text_matching/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..2621a7126e1cc139489680c464b39640ad9245d3
--- /dev/null
+++ b/tests/test_tipc/ernie_text_matching/predict.py
@@ -0,0 +1,244 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import numpy as np
+import paddle
+from paddle import inference
+
+from paddlenlp.data import Pad, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoTokenizer
+from paddlenlp.utils.log import logger
+
+
+def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
+
+    query, title = example["query"], example["title"]
+
+    encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length)
+
+    input_ids = encoded_inputs["input_ids"]
+    token_type_ids = encoded_inputs["token_type_ids"]
+
+    if not is_test:
+        label = np.array([example["label"]], dtype="int64")
+        return input_ids, token_type_ids, label
+    else:
+        return input_ids, token_type_ids
+
+
+class Predictor(object):
+    def __init__(
+        self,
+        model_dir,
+        device="gpu",
+        max_seq_length=128,
+        batch_size=32,
+        use_tensorrt=False,
+        precision="fp32",
+        cpu_threads=10,
+        enable_mkldnn=False,
+        benchmark=False,
+        save_log_path="./log_output",
+    ):
+        self.max_seq_length = max_seq_length
+        self.batch_size = batch_size
+        self.benchmark = benchmark
+
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+        if not os.path.exists(model_file):
+            raise ValueError("not find model file path {}".format(model_file))
+        if not os.path.exists(params_file):
+            raise ValueError("not find params file path {}".format(params_file))
+        config = paddle.inference.Config(model_file, params_file)
+
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as initialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+                "int8": inference.PrecisionType.Int8,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode
+                )
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu()
+        elif device == "npu":
+            # set NPU configs accordingly
+            config.enable_custom_device("npu")
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = paddle.inference.create_predictor(config)
+        self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()]
+        self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0])
+
+        if benchmark:
+            import auto_log
+
+            pid = os.getpid()
+            self.autolog = auto_log.AutoLogger(
+                model_name="ernie-tiny",
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=["preprocess_time", "inference_time", "postprocess_time"],
+                warmup=0,
+                logger=logger,
+            )
+
+    def predict(self, data, tokenizer, label_map):
+        """
+        Predicts the data labels.
+
+        Args:
+            data (obj:`List(str)`): The batch data whose each element is a raw text.
+            tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
+                which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+            label_map(obj:`dict`): The label id (key) to label str (value) map.
+
+        Returns:
+            results(obj:`dict`): All the predictions labels.
+        """
+        if self.benchmark:
+            self.autolog.times.start()
+
+        examples = []
+        for text in data:
+            input_ids, segment_ids = convert_example(text, tokenizer, max_seq_length=self.max_seq_length, is_test=True)
+            examples.append((input_ids, segment_ids))
+
+        batchify_fn = lambda samples, fn=Tuple(
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
+            Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment
+        ): fn(samples)
+
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        input_ids, segment_ids = batchify_fn(examples)
+        self.input_handles[0].copy_from_cpu(input_ids)
+        self.input_handles[1].copy_from_cpu(segment_ids)
+        self.predictor.run()
+        probs = self.output_handle.copy_to_cpu()
+        if self.benchmark:
+            self.autolog.times.stamp()
+
+        # probs = softmax(logits, axis=1)
+        idx = np.argmax(probs, axis=1)
+        idx = idx.tolist()
+        labels = [label_map[i] for i in idx]
+
+        if args.benchmark:
+            self.autolog.times.end(stamp=True)
+
+        return labels
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_dir", type=str, required=True, help="The directory to static model.")
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "xpu", "npu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    parser.add_argument(
+        "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up."
+    )
+    parser.add_argument(
+        "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision."
+    )
+    parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.")
+    parser.add_argument(
+        "--enable_mkldnn",
+        default=False,
+        type=eval,
+        choices=[True, False],
+        help="Enable to use mkldnn to speed up when using cpu.",
+    )
+    parser.add_argument(
+        "--benchmark", type=eval, default=False, help="To log some information about environment and running."
+    )
+    parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of predict steps to perform."
+    )
+
+    args = parser.parse_args()
+
+    # Define predictor to do prediction.
+    predictor = Predictor(
+        args.model_dir,
+        args.device,
+        args.max_seq_length,
+        args.batch_size,
+        args.use_tensorrt,
+        args.precision,
+        args.cpu_threads,
+        args.enable_mkldnn,
+        args.benchmark,
+        args.save_log_path,
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    test_ds = load_dataset("lcqmc", splits=["test"])
+
+    data = [{"query": d["query"], "title": d["title"]} for d in test_ds]
+    if args.max_steps > 0:
+        data = data[: args.max_steps]
+
+    batches = [data[idx : idx + args.batch_size] for idx in range(0, len(data), args.batch_size)]
+    label_map = {0: "dissimilar", 1: "similar"}
+
+    results = []
+    for batch_data in batches:
+        results.extend(predictor.predict(batch_data, tokenizer, label_map))
+    for idx, text in enumerate(data):
+        print("Data: {} \t Label: {}".format(text, results[idx]))
+    if args.benchmark:
+        predictor.autolog.report()
diff --git a/tests/test_tipc/ernie_text_matching/train.py b/tests/test_tipc/ernie_text_matching/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..61ed578ced9cfff3779350fc6d4f2c71bc6cfa9b
--- /dev/null
+++ b/tests/test_tipc/ernie_text_matching/train.py
@@ -0,0 +1,190 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import random
+import time
+from functools import partial
+
+import numpy as np
+import paddle
+from data import convert_pointwise_example as convert_example
+from data import create_dataloader
+from model import PointwiseMatching
+
+from paddlenlp.data import Pad, Stack, Tuple
+from paddlenlp.datasets import load_dataset
+from paddlenlp.transformers import AutoModel, AutoTokenizer, LinearDecayWithWarmup
+
+
+def set_seed(seed):
+    """sets random seed"""
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+
+
+@paddle.no_grad()
+def evaluate(model, criterion, metric, data_loader, phase="dev"):
+    """
+    Given a dataset, it evals model and computes the metric.
+
+    Args:
+        model(obj:`paddle.nn.Layer`): A model to classify texts.
+        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
+        criterion(obj:`paddle.nn.Layer`): It can compute the loss.
+        metric(obj:`paddle.metric.Metric`): The evaluation metric.
+    """
+    model.eval()
+    metric.reset()
+    losses = []
+    for batch in data_loader:
+        input_ids, token_type_ids, labels = batch
+        probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+        loss = criterion(probs, labels)
+        losses.append(loss.numpy())
+        correct = metric.compute(probs, labels)
+        metric.update(correct)
+        accu = metric.accumulate()
+    print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu))
+    model.train()
+    metric.reset()
+
+
+def do_train(args):
+    paddle.set_device(args.device)
+    rank = paddle.distributed.get_rank()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+
+    set_seed(args.seed)
+
+    train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])
+
+    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh")
+
+    trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)
+
+    batchify_fn = lambda samples, fn=Tuple(
+        Pad(axis=0, pad_val=tokenizer.pad_token_id),  # text_pair_input
+        Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_pair_segment
+        Stack(dtype="int64"),  # label
+    ): [data for data in fn(samples)]
+
+    train_data_loader = create_dataloader(
+        train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    dev_data_loader = create_dataloader(
+        dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func
+    )
+
+    model = PointwiseMatching(pretrained_model)
+
+    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
+        state_dict = paddle.load(args.init_from_ckpt)
+        model.set_dict(state_dict)
+
+    model = paddle.DataParallel(model)
+
+    num_training_steps = len(train_data_loader) * args.epochs
+
+    lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)
+
+    # Generate parameter names needed to perform weight decay.
+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
+    optimizer = paddle.optimizer.AdamW(
+        learning_rate=lr_scheduler,
+        parameters=model.parameters(),
+        weight_decay=args.weight_decay,
+        apply_decay_param_fun=lambda x: x in decay_params,
+    )
+
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+    metric = paddle.metric.Accuracy()
+
+    global_step = 0
+    tic_train = time.time()
+    for epoch in range(1, args.epochs + 1):
+        for step, batch in enumerate(train_data_loader, start=1):
+            input_ids, token_type_ids, labels = batch
+            probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
+            loss = criterion(probs, labels)
+            correct = metric.compute(probs, labels)
+            metric.update(correct)
+            acc = metric.accumulate()
+
+            global_step += 1
+            if global_step % 10 == 0 and rank == 0:
+                print(
+                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
+                    % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))
+                )
+                tic_train = time.time()
+            loss.backward()
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.clear_grad()
+
+            if global_step % args.eval_step == 0 and rank == 0:
+                evaluate(model, criterion, metric, dev_data_loader)
+                save_dir = os.path.join(args.save_dir, "model")
+                tokenizer.save_pretrained(save_dir)
+                model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model
+                paddle.save(model_to_save.state_dict(), os.path.join(save_dir, "model_state.pdparams"))
+
+            if global_step > args.max_steps:
+                return
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--save_dir",
+        default="./checkpoint",
+        type=str,
+        help="The output directory where the model checkpoints will be written.",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. "
+        "Sequences longer than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.")
+    parser.add_argument("--eval_step", default=100, type=int, help="Step interval for evaluation.")
+    parser.add_argument("--save_step", default=10000, type=int, help="Step interval for saving checkpoint.")
+    parser.add_argument(
+        "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process."
+    )
+    parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
+    parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.")
+    parser.add_argument(
+        "--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform."
+    )
+    parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "npu", "xpu"],
+        default="gpu",
+        help="Select which device to train model, defaults to gpu.",
+    )
+    args = parser.parse_args()
+    do_train(args)
diff --git a/tests/test_tipc/extract_loss.py b/tests/test_tipc/extract_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b18e4ce3fff2c801c74b1b8787d07c4afa2e49e
--- /dev/null
+++ b/tests/test_tipc/extract_loss.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import re
+import sys
+
+
+def parameter_parser():
+    parser = argparse.ArgumentParser(description="Support Args:")
+    parser.add_argument("-v", "--valid-expr", type=str, default="*", help="when not match, the line will discard.")
+    parser.add_argument(
+        "-e", "--extract-expr", type=str, default="^{%s}$,", help="the extract expr for the loss: loss {%f}"
+    )
+    parser.add_argument("-r", "--reduction-expr", type=str, default="print", help="print | sum | mean")
+    parser.add_argument("-n", "--discard", type=int, default=0, help="while reduction, discard [0:n] and [-n:]")
+    parser.add_argument("-d", "--debug", type=bool, default=False, help="debug")
+    return parser.parse_args()
+
+
+args = parameter_parser()
+
+
+def log(*inp, **kargs):
+    if args.debug:
+        print(*inp, **kargs)
+
+
+def is_valid(line, valid_expr):
+    if valid_expr == "*":
+        return True
+    if valid_expr in line:
+        return True
+    return False
+
+
+def extract(line, extract_expr):
+    """
+    return tuple, the output will be
+    """
+    log("Extract_expression is : ", extract_expr)
+    x = re.findall("\{%(.)\}", extract_expr)
+    assert len(x) == 1, "Must exist a {%d} | {%f} | {%s} "
+    t = x[0]
+    type_converter = {
+        "f": float,
+        "i": int,
+        "s": str,
+    }
+    type_extracter = {
+        "f": r"(-?\\d+\\.\\d+)",
+        "i": r"(-?\\d+)",
+        "s": r"(.*?)",
+    }
+    log(type_extracter[t])
+    pattern = re.sub("\{%(.)\}", type_extracter[t], extract_expr, 1)
+    log("Created Pattern is: ", pattern)
+    x = re.findall(pattern, line)
+    if len(x) == 0:
+        return None
+    assert len(x) == 1, f"Multi Match for `{extract_expr}` in line: \n{line}"
+    log("Find in line: ", x[0].strip())
+    return type_converter[t](x[0].strip())
+
+
+def action(tuple_list, action):
+    # discard the warm up
+    if args.discard > 0:
+        tuple_list = tuple_list[args.discard :]
+        tuple_list = tuple_list[: -args.discard]
+    # do action for each item
+    if action == "sum":
+        print(sum(tuple_list))
+    if action == "mean":
+        if len(tuple_list) == 0:
+            print("null")
+        else:
+            print(sum(tuple_list) / len(tuple_list))
+    if action == "print":
+        for item in tuple_list:
+            print(item)
+
+
+def main():
+    tuple_list = []
+    for line in sys.stdin:
+        line = line.strip()
+        if is_valid(line, args.valid_expr):
+            ret = extract(line, args.extract_expr)
+            if ret:
+                tuple_list.append(ret)
+    action(tuple_list, args.reduction_expr)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/test_tipc/prepare.sh b/tests/test_tipc/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1226800a3bedd0502623900aaaa94f4beb1dcef4
--- /dev/null
+++ b/tests/test_tipc/prepare.sh
@@ -0,0 +1,436 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+source test_tipc/common_func.sh
+
+FILENAME=$1
+
+# MODE be one of ['lite_train_lite_infer' 'lite_train_whole_infer' 'whole_train_whole_infer',  
+#                 'whole_infer', 'klquant_whole_infer',
+#                 'cpp_infer', 'serving_infer']
+# PaddleNLP supports 'lite_train_lite_infer', 'lite_train_whole_infer', 'whole_train_whole_infer' and 
+# 'whole_infer' mode now.
+
+MODE=$2
+
+dataline=$(cat ${FILENAME})
+
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+
+# The training params
+model_name=$(func_parser_value "${lines[1]}")
+
+trainer_list=$(func_parser_value "${lines[14]}")
+
+if [ ${MODE} = "lite_train_lite_infer" ];then
+    if [ ${model_name} == "bigru_crf" ]; then
+        rm -rf ./data/lexical_analysis_dataset_tiny ./data/lexical_analysis_dataset_tiny.tar.gz
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz --no-check-certificate
+        cd ./data/ && tar xfz lexical_analysis_dataset_tiny.tar.gz && cd .. 
+    fi
+
+    if [ ${model_name} == "ernie_information_extraction" ]; then
+        python ../examples/information_extraction/waybill_ie/download.py --data_dir ./waybill_ie
+    fi
+
+    if [[ ${model_name} =~ transformer* ]]; then
+        cd ../examples/machine_translation/transformer/
+
+        # The whole procedure of lite_train_infer should be less than 15min.
+        # Hence, set maximum output length is 16. 
+        sed -i "s/^max_out_len.*/max_out_len: 16/g" configs/transformer.base.yaml
+        sed -i "s/^max_out_len.*/max_out_len: 16/g" configs/transformer.big.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.base.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.big.yaml
+
+        # Data set prepared. 
+        if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+            tar -zxf WMT14.en-de.partial.tar.gz
+        fi
+        # Set soft link.
+        if [ -f train.en ]; then
+            rm -f train.en
+        fi
+        if [ -f train.de ]; then
+            rm -f train.de
+        fi
+        if [ -f dev.en ]; then
+            rm -f dev.en
+        fi
+        if [ -f dev.de ]; then
+            rm -f dev.de
+        fi
+        if [ -f test.en ]; then
+            rm -f test.en
+        fi
+        if [ -f test.de ]; then
+            rm -f test.de
+        fi
+        rm -f vocab_all.bpe.33712
+        rm -f vocab_all.bpe.33708
+        # Vocab
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+        # Train
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+        # Dev
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+        #Test
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.en test.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.de test.de
+        cd -
+    fi
+elif [ ${MODE} = "whole_train_whole_infer" ];then
+    if [ ${model_name} == "bigru_crf" ]; then
+        rm -rf ./data/lexical_analysis_dataset_tiny ./data/lexical_analysis_dataset_tiny.tar.gz
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz --no-check-certificate
+        cd ./data/ && tar xfz lexical_analysis_dataset_tiny.tar.gz && cd ..
+    fi
+
+    if [ ${model_name} == "ernie_information_extraction" ]; then
+        python ../examples/information_extraction/waybill_ie/download.py --data_dir ./waybill_ie
+    fi
+
+    if [[ ${model_name} =~ transformer* ]]; then
+        cd ../examples/machine_translation/transformer/
+        sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+        sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+        # Whole data set prepared. 
+        if [ ! -f WMT14.en-de.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+            tar -zxf WMT14.en-de.tar.gz
+        fi
+        # Set soft link. 
+        if [ -f train.en ]; then
+            rm -f train.en
+        fi
+        if [ -f train.de ]; then
+            rm -f train.de
+        fi
+        if [ -f dev.en ]; then
+            rm -f dev.en
+        fi
+        if [ -f dev.de ]; then
+            rm -f dev.de
+        fi
+        if [ -f test.en ]; then
+            rm -f test.en
+        fi
+        if [ -f test.de ]; then
+            rm -f test.de
+        fi
+        rm -f vocab_all.bpe.33712
+        rm -f vocab_all.bpe.33708
+        # Vocab
+        cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+        cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+        # Train with whole data. 
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.en train.en
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/train.tok.clean.bpe.33708.de train.de
+        # Dev with whole data. 
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.en dev.en
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2013.tok.bpe.33708.de dev.de
+        # Test with whole data. 
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+        cd -
+    fi
+elif [ ${MODE} = "lite_train_whole_infer" ];then
+    if [ ${model_name} == "bigru_crf" ]; then
+        rm -rf ./data/lexical_analysis_dataset_tiny ./data/lexical_analysis_dataset_tiny.tar.gz
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz --no-check-certificate
+        cd ./data/ && tar xfz lexical_analysis_dataset_tiny.tar.gz && cd ..
+    fi
+
+    if [ ${model_name} == "ernie_information_extraction" ]; then
+        python ../examples/information_extraction/waybill_ie/download.py --data_dir ./waybill_ie
+    fi
+
+    if [[ ${model_name} =~ transformer* ]]; then
+        cd ../examples/machine_translation/transformer/
+        sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+        sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+        # Trained transformer base model checkpoint. 
+        # For infer. 
+        if [ ! -f transformer-base-wmt_ende_bpe.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+            tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+            mv base_trained_models/ trained_models/
+        fi
+        # For train. 
+        if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+            tar -zxf WMT14.en-de.partial.tar.gz
+        fi
+        # Whole data set prepared. 
+        if [ ! -f WMT14.en-de.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+            tar -zxf WMT14.en-de.tar.gz
+        fi
+        # Set soft link.
+        if [ -f train.en ]; then
+            rm -f train.en
+        fi
+        if [ -f train.de ]; then
+            rm -f train.de
+        fi
+        if [ -f dev.en ]; then
+            rm -f dev.en
+        fi
+        if [ -f dev.de ]; then
+            rm -f dev.de
+        fi
+        if [ -f test.en ]; then
+            rm -f test.en
+        fi
+        if [ -f test.de ]; then
+            rm -f test.de
+        fi
+        rm -f vocab_all.bpe.33712
+        rm -f vocab_all.bpe.33708
+        # Vocab
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+        # Train with partial data. 
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+        # Dev with partial data. 
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+        # Test with whole data. 
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+        cd -
+    fi
+elif [ ${MODE} = "whole_infer" ];then
+    if [ ${model_name} == "bigru_crf" ]; then
+        rm -rf ./data/lexical_analysis_dataset_tiny ./data/lexical_analysis_dataset_tiny.tar.gz
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/lexical_analysis_dataset_tiny.tar.gz --no-check-certificate
+        cd ./data/ && tar xfz lexical_analysis_dataset_tiny.tar.gz && cd ..
+        # Download static model
+        rm -rf ./test_tipc/bigru_crf/infer_model
+        wget -nc -P ./test_tipc/bigru_crf/ https://bj.bcebos.com/paddlenlp/models/bigru_crf_infer_model.tgz  --no-check-certificate
+        cd ./test_tipc/bigru_crf && tar xfz bigru_crf_infer_model.tgz && cd ../..
+    fi
+
+    if [ ${model_name} == "ernie_information_extraction" ]; then
+        python ../examples/information_extraction/waybill_ie/download.py --data_dir ./waybill_ie
+    fi
+    
+    if [[ ${model_name} =~ transformer* ]]; then
+        cd ../examples/machine_translation/transformer/
+        sed -i "s/^max_out_len.*/max_out_len: 256/g" configs/transformer.base.yaml
+        sed -i "s/^max_out_len.*/max_out_len: 1024/g" configs/transformer.big.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.base.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: None/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: True/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle:.*/shuffle: True/g" configs/transformer.big.yaml
+
+        # Trained transformer base model checkpoint. 
+        if [ ! -f transformer-base-wmt_ende_bpe.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz
+            tar -zxf transformer-base-wmt_ende_bpe.tar.gz
+            mv base_trained_models/ trained_models/
+        fi
+        # Whole data set prepared. 
+        if [ ! -f WMT14.en-de.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.tar.gz
+            tar -zxf WMT14.en-de.tar.gz
+        fi
+        # Set soft link.
+        if [ -f test.en ]; then
+            rm -f test.en
+        fi
+        if [ -f test.de ]; then
+            rm -f test.de
+        fi
+        rm -f vocab_all.bpe.33712
+        rm -f vocab_all.bpe.33708
+        # Vocab
+        cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+        cp -f WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+        # Test with whole data. 
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en test.en
+        ln -s WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.de test.de
+        cd -
+    fi
+elif [ ${MODE} = "benchmark_train" ];then
+    if [ ${model_name} == "bigru_crf" ]; then
+        rm -rf ./data/lexical_analysis_dataset_tiny ./data/lexical_analysis_dataset_tiny.tar.gz
+        python ${BENCHMARK_ROOT}/paddlecloud/file_upload_download.py --remote-path frame_benchmark/paddle/PaddleNLP/lexical_analysis_dataset_tiny/ --local-path ./data/ --mode download
+        cd ./data/ && tar xfz lexical_analysis_dataset_tiny.tar.gz && cd .. 
+    fi
+
+    if [[ ${model_name} =~ bert* ]]; then
+        rm -rf ./data/wikicorpus_en_seqlen128/ wikicorpus_en_seqlen128.tar wikicorpus_en_seqlen512 hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/ hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/benchmark_wikicorpus_en_seqlen128.tar --no-check-certificate
+        wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/benchmark_hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar --no-check-certificate
+
+        cd ./data/
+        tar -xf benchmark_wikicorpus_en_seqlen128.tar
+        tar -xf benchmark_hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar
+
+        ln -s hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en_seqlen512/ wikicorpus_en_seqlen512
+
+        cd ..
+
+        python -m pip install h5py -i https://mirror.baidu.com/pypi/simple
+    fi
+
+    if [[ ${model_name} == "gpt2" ]]; then
+        python -m pip install tool_helpers -i https://mirror.baidu.com/pypi/simple
+        mkdir -p data && cd data
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy -o .tmp
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz -o .tmp
+        cd -
+    fi
+
+    if [[ ${model_name} == "gpt3" ]]; then
+        python -m pip install tool_helpers -i https://mirror.baidu.com/pypi/simple
+        mkdir -p data && cd data
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy -o .tmp
+        wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz -o .tmp
+        cd -
+    fi
+
+    if [[ ${model_name} =~ transformer* ]]; then
+        cd ../examples/machine_translation/transformer/
+
+        git checkout .
+
+        sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.base.yaml
+        sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.base.yaml
+
+        sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.big.yaml
+        sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.big.yaml
+
+        # Data set prepared. 
+        if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+            wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+            tar -zxf WMT14.en-de.partial.tar.gz
+        fi
+        # Set soft link.
+        if [ -f train.en ]; then
+            rm -f train.en
+        fi
+        if [ -f train.de ]; then
+            rm -f train.de
+        fi
+        if [ -f dev.en ]; then
+            rm -f dev.en
+        fi
+        if [ -f dev.de ]; then
+            rm -f dev.de
+        fi
+        if [ -f test.en ]; then
+            rm -f test.en
+        fi
+        if [ -f test.de ]; then
+            rm -f test.de
+        fi
+        rm -f vocab_all.bpe.33712
+        rm -f vocab_all.bpe.33708
+        # Vocab
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+        cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+        # Train
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+        # Dev
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+        #Test
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.en test.en
+        ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.de test.de
+        cd -
+    fi
+
+    if [[ ${model_name} =~ "latent_diffusion_model" ]]; then
+        rm -rf laion400m_demo_data.tar.gz
+        wget https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/laion400m_demo_data.tar.gz
+        rm -rf data
+        tar -zxvf laion400m_demo_data.tar.gz
+    fi
+
+    if [[ ${model_name} =~ "stable_diffusion_model" ]]; then
+        rm -rf laion400m_demo_data.tar.gz
+        wget https://paddlenlp.bj.bcebos.com/models/community/junnyu/develop/laion400m_demo_data.tar.gz
+        rm -rf data
+        tar -zxvf laion400m_demo_data.tar.gz
+    fi
+
+    if [[ ${model_name} =~ "llama" ]]; then
+        rm -rf llama_sft_demo_data.tar.gz
+        wget https://paddlenlp.bj.bcebos.com/models/community/facebook/llama_sft_demo_data.tar.gz
+        tar -xvf llama_sft_demo_data.tar.gz
+    fi
+
+    export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+    python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+    python -m pip install einops -i https://pypi.tuna.tsinghua.edu.cn/simple
+    python -m pip install setuptools_scm 
+    python -m pip install Cython 
+    python -m pip install -r ../requirements.txt  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+    python -m pip install pybind11 regex sentencepiece tqdm visualdl attrdict easydict pyyaml rouge -i https://mirror.baidu.com/pypi/simple
+
+    python -m pip install -e ../
+    # python -m pip install paddlenlp    # PDC 镜像中安装失败
+    python -m pip list
+    export http_proxy=${HTTP_PRO}
+    export https_proxy=${HTTPS_PRO}
+    # install develop paddlemix/ppdiffusers
+    git clone https://github.com/PaddlePaddle/PaddleMIX.git
+    unset http_proxy
+    unset https_proxy
+    cd PaddleMIX
+    python -m pip install -e ./ppdiffusers
+    cd -
+fi
diff --git a/tests/test_tipc/results/python_bigru_crf_results_fp16.txt b/tests/test_tipc/results/python_bigru_crf_results_fp16.txt
new file mode 100644
index 0000000000000000000000000000000000000000..54656699a69e9cb1bdd331d68db4c5289de4238d
--- /dev/null
+++ b/tests/test_tipc/results/python_bigru_crf_results_fp16.txt
@@ -0,0 +1,1000 @@
+0	[24, 25, 25, 42, 32, 49, 49, 2, 3, 38, 39, 14, 15, 42, 43, 42, 43, 26, 14, 15, 44, 50, 51, 12, 14, 15, 14, 15, 44, 44, 44, 24, 25, 25, 25]
+1	[24, 15, 25, 38, 30, 31, 38, 15, 14, 44, 44, 38, 31, 46, 44]
+2	[14, 15, 14, 15, 14, 15, 30, 31, 38]
+3	[14, 15, 25, 25]
+4	[24, 25, 25, 25, 25, 25, 25, 25]
+5	[38, 39, 14, 12, 14, 15, 15, 8, 8]
+6	[6, 50, 51, 15, 50, 51, 50, 51, 10, 19, 43, 14, 15, 10, 44, 38, 39, 39, 14, 15]
+7	[14, 15, 0, 12, 13, 14, 15, 36, 14, 15, 15]
+8	[44, 24, 25, 24, 25, 25, 25, 25, 25, 25, 14, 15, 44, 44, 24, 25, 24, 25, 25, 25, 25, 14, 15, 44, 14, 15, 23, 44, 22, 23, 44, 22, 23, 44, 14, 15, 15, 44, 44, 44, 20, 25, 25, 14, 15, 14, 15]
+9	[14, 15, 38, 39, 38, 12, 14, 15, 36, 14, 15, 38, 30, 31]
+10	[30, 30, 31, 31, 15, 38, 14, 15, 6, 14, 15, 46, 44]
+11	[42, 43, 38, 38, 39, 27, 42, 43, 14, 15, 14, 15]
+12	[42, 25, 14, 15, 15, 15, 13, 13]
+13	[6, 38, 39, 30, 31, 14, 15, 8, 1, 36, 14, 15, 14, 14, 15, 44, 24, 25, 25, 25, 15, 38, 38, 39, 36, 11, 36, 14, 15, 42, 43, 44]
+14	[14, 15, 10, 51, 0, 1, 12, 15, 14, 15, 8, 26, 46, 44, 48, 49, 49, 38, 14, 0, 0, 1, 44, 14, 15, 14, 15, 44, 24, 25, 25, 25, 25, 25, 14, 25, 14, 15, 44, 24, 25, 25]
+15	[26, 55, 15, 38, 39]
+16	[24, 25, 25, 25, 25, 44, 12, 13, 13]
+17	[6, 7, 38, 36, 14, 15, 38, 39, 38, 39, 38, 39, 39, 36, 14, 25, 25, 44, 24, 25, 25, 25]
+18	[2, 3, 42, 43, 30, 31, 14, 15, 0]
+19	[38, 39, 14, 15, 8, 8]
+20	[24, 15, 15, 15, 15, 38, 25, 14, 15, 14, 15]
+21	[14, 15, 15, 15, 15, 44, 38, 14, 15, 44, 14, 15]
+22	[24, 25, 25, 25, 25, 25, 44, 26, 55, 38, 39, 6, 38, 39, 36, 14, 15]
+23	[34, 35, 44, 14, 15, 8, 9, 38, 14, 15, 38, 39, 44]
+24	[14, 15, 24, 25, 25, 25, 36, 14, 15]
+25	[14, 15, 36, 14, 15]
+26	[48, 49, 44, 50, 25, 25, 25, 44, 24, 25, 25, 25]
+27	[24, 25, 25, 25, 17, 47, 47, 21, 21, 25, 25, 25, 25, 25, 25, 53, 53, 53, 53]
+28	[47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47]
+29	[14, 15, 0, 1]
+30	[24, 15, 44, 42, 43]
+31	[38, 39, 54, 54, 55, 55, 44, 26, 11, 12, 13, 14, 15, 10, 36, 42, 43, 14, 15, 8, 38, 38, 39, 26, 24, 14, 15, 14, 15, 14, 15, 10, 44]
+32	[38, 14, 15, 44, 24, 25]
+33	[6, 7, 26, 14, 43, 14, 15, 10, 8, 38, 12, 14, 15, 15, 44, 14, 15, 15, 42, 38, 38, 36, 14, 15, 8, 9, 38, 39, 44, 8, 0, 1, 0, 1, 44]
+34	[54, 55, 55, 55, 55, 55, 55, 55, 55, 44, 52, 53, 53, 53, 26, 50, 51, 38, 39, 26, 50, 51, 38, 39, 36, 14, 15, 36, 14, 15, 42, 43, 42, 43, 14, 15, 44, 38, 39, 36, 24, 15, 14, 15, 38, 39, 44, 38, 39, 42, 43, 36, 42, 15, 42, 15, 15, 15, 44, 8, 26, 14, 15, 38, 39, 14, 15, 36, 14, 3, 40, 41, 38, 39, 38, 39, 36, 12, 13, 0, 36, 14, 15, 44]
+35	[52, 53, 53, 14, 15, 38, 43, 15, 38, 39, 44, 14, 15, 14, 15, 10, 38, 39, 14, 39, 39, 39, 44, 0, 25, 25]
+36	[44, 40, 41, 38, 39, 44, 12, 13, 10, 13, 14, 39, 38, 39, 36, 14, 44, 8, 9, 38, 14, 15, 15, 15, 44, 38, 39, 25, 44, 14, 15, 15, 8, 39, 50, 51, 51, 14, 15, 44, 8, 6, 30, 6, 31, 38, 38, 39, 39, 30, 44]
+37	[20, 21, 53, 44, 20, 21, 53]
+38	[30, 31, 24, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 38, 39, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 44, 44, 44, 24, 13, 13, 47, 47, 44, 46, 47, 47]
+39	[24, 25, 25, 44, 14, 15, 15]
+40	[30, 8, 8, 9, 38, 39, 44, 30, 31, 14, 15, 38, 14, 15, 50, 51, 38, 43, 36, 42, 43, 43, 44, 14, 15, 38, 39, 0, 1, 8, 38, 39, 14, 15, 14, 15, 44, 38, 39, 8, 38, 38, 39, 39, 39, 42, 43, 14, 15, 36, 42, 43, 44, 8, 9, 26, 14, 15, 8, 9, 0, 1, 44, 38, 39, 38, 39, 14, 15, 15, 15, 44]
+41	[50, 15, 0, 1, 36, 14, 15, 15, 15]
+42	[30, 0, 14, 15, 15, 14, 15, 36, 14, 15, 42, 43, 8, 8, 9, 0, 1, 44, 8, 9, 38, 0, 9, 30, 31, 36, 14, 15, 14, 15, 14, 15, 44]
+43	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+44	[48, 17, 17, 44, 24, 25, 25, 15]
+45	[12, 13, 44, 12, 13, 44, 13, 44, 12]
+46	[24, 25, 25, 25, 25, 25, 25, 15]
+47	[38, 39, 30, 38, 39, 38, 39, 48, 21, 21, 21, 44]
+48	[38, 15, 6, 41, 38, 26, 26, 38, 39, 0, 1, 14, 15, 14, 15, 46]
+49	[30, 31, 38, 38, 14, 15, 15, 6, 14, 15, 15, 15, 44, 8, 9, 9, 38, 39, 14, 15, 44, 8, 38, 39, 38, 39, 44, 6, 30, 31, 14, 15, 38, 39, 38, 39, 44, 38, 39, 39]
+50	[54, 55, 55, 55, 55, 14, 15, 38, 39, 44, 54, 35, 44, 26, 24, 49, 25, 49, 49, 49, 38, 39, 14, 15, 14, 15, 26, 39, 14, 15, 15, 15, 15, 12, 15, 42, 43, 38, 14, 14, 15, 54, 55, 44, 6, 41, 38, 14, 14, 15, 25, 12, 13, 13, 44]
+51	[24, 25, 44, 14, 15]
+52	[38, 39, 46, 43, 36, 14]
+53	[50, 51, 53, 53, 53, 53, 53, 53, 15]
+54	[14, 15, 14, 15, 30, 31, 38, 44, 14, 15]
+55	[44, 44, 44, 24, 25, 25, 25, 25]
+56	[14, 15, 14, 15, 10]
+57	[24, 25, 25, 14, 15, 15, 15]
+58	[0, 1, 14, 15, 36, 14, 15, 44, 38, 39, 36, 14, 15, 14, 15, 38, 39, 38, 39, 39, 44, 30, 31, 14, 15, 36, 0, 1, 14, 15, 6, 38, 39, 36, 14, 15, 14, 15, 14, 15, 0, 44]
+59	[14, 15, 14, 15]
+60	[54, 55, 55, 55, 38, 39, 30, 31, 38, 39]
+61	[24, 25, 25, 25, 44, 12, 13, 13]
+62	[48, 33, 0, 36, 14, 15, 15, 15]
+63	[46, 44, 6, 7, 38, 38, 12, 13, 13, 13, 44, 42, 15, 15, 44, 8, 9, 38, 39, 30, 31, 38, 39, 38, 39, 44]
+64	[48, 49, 38, 39, 36, 14, 15, 14, 15, 0, 36, 14, 15, 14, 15, 8, 38, 39, 44, 30, 38, 44]
+65	[12, 44, 38, 30, 38, 39, 44, 14, 15, 14, 44]
+66	[42, 43, 14, 15, 38, 38, 38, 39, 39, 38, 38, 30, 31]
+67	[14, 15, 14, 12, 13, 13, 13, 6, 50, 14, 15, 10]
+68	[50, 51, 51, 51, 51, 51, 15, 51, 51, 14, 15, 15]
+69	[54, 55, 55, 55, 55, 44, 26, 12, 13, 13, 13, 14, 15, 14, 15, 42, 43, 42, 43, 43, 43, 14, 15, 42, 43, 14, 15, 14, 15, 38, 39, 42, 39, 44, 38, 54, 55, 55, 55, 55, 44, 8, 9, 38, 39, 14, 15, 14, 15, 38, 39, 42, 43, 42, 43, 36, 14, 14, 15, 44]
+70	[54, 55, 55, 13, 14, 14, 15, 12, 51, 14, 15, 44, 44, 42, 23, 44, 14, 15, 36, 44, 38, 43, 44, 44, 24, 25, 25, 25]
+71	[14, 25, 44, 26, 9, 38, 39, 36, 14, 15, 14, 15, 44, 30, 31, 31, 8]
+72	[38, 24, 15, 25, 25, 25]
+73	[50, 51, 51, 51, 51, 51, 51, 53, 53, 53, 53, 53, 39, 15, 15]
+74	[14, 15, 14, 15, 44, 6, 31, 38, 39, 39, 14, 15, 15]
+75	[24, 25, 25, 15, 42, 43]
+76	[52, 53, 14, 15, 53, 42, 42, 43, 14, 43, 15, 15, 44, 14, 15, 44, 24, 25, 25, 25]
+77	[38, 38, 39, 30, 38, 36, 14, 15, 8, 0, 1]
+78	[52, 53, 53, 53, 53, 15, 44, 24, 25, 25]
+79	[24, 25, 25, 25, 44, 24, 25, 25, 25, 14, 15, 36, 42, 43, 38, 39, 39, 30, 44, 44, 24, 25, 25, 25]
+80	[38, 38, 12, 13, 13, 8, 1, 14, 30, 31]
+81	[38, 39, 14, 15, 36, 14, 15, 15, 38, 38, 39, 24, 25, 25, 25, 38, 44]
+82	[52, 53, 53, 53, 53, 53, 53, 53, 53, 25, 25, 25]
+83	[30, 38, 38, 39, 44, 38, 39, 54, 55, 44, 14, 15, 26, 30, 36, 4, 38, 44, 8, 38, 30, 30, 31, 38, 39, 38, 39, 6, 4, 5]
+84	[26, 14, 15, 38, 39, 38, 39, 30, 36, 14, 15, 46, 44]
+85	[48, 53, 38, 39, 44, 12, 13, 44, 14, 15, 44, 44, 44, 44, 24, 25, 25, 38, 44, 44, 44, 24, 25, 44, 44, 44, 14, 15, 14, 15]
+86	[24, 25, 25, 25, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+87	[50, 51, 51, 51, 14, 15, 15, 44, 50, 51, 51, 51, 51, 14, 15, 44, 50, 51, 51, 51, 51, 1, 14, 44, 50, 51, 53, 25, 25]
+88	[14, 15, 38, 39, 36, 38, 42, 4, 14, 15, 15, 42, 43, 36, 14, 15, 38, 44, 44, 44, 38, 39, 14, 15, 14, 15, 42, 43, 10, 38, 39]
+89	[48, 51, 49, 38, 39, 36, 0, 1, 44]
+90	[14, 15, 36, 42, 43, 26, 26, 50, 51, 51, 51, 53, 14, 15, 42, 43, 38, 39, 39, 14, 38, 39, 36, 5, 44, 26, 9, 10, 11, 38, 39, 2, 38, 38, 39, 48, 49, 51, 30, 31, 10, 14, 15, 15, 15, 44, 38, 39, 44, 16, 49, 44, 14, 15, 15, 44, 38, 39, 10, 14, 15, 14, 15, 14, 15, 36, 4, 1, 42, 43, 44]
+91	[14, 15, 44, 14, 15, 8, 38, 39, 14, 15, 36, 14, 15, 15, 44, 40, 53, 53, 44, 4, 5, 49, 44, 32, 11, 32, 38, 39, 44]
+92	[24, 25, 25, 25, 25, 44, 38, 39, 14, 15]
+93	[24, 25, 25, 25, 25, 25, 14, 15]
+94	[24, 25, 25, 15, 38, 39, 38, 38, 39, 46]
+95	[44, 44, 44, 24, 25, 25, 25, 25]
+96	[10, 31, 44, 52, 53, 53, 53, 14, 39, 36, 42, 1, 42, 43, 14, 15, 38, 14, 15, 15, 36, 0, 1, 14, 15, 15, 6, 53, 53, 53, 38, 39, 44]
+97	[38, 12, 44, 52, 38, 38, 39, 10, 10, 14, 15, 15, 13, 13, 14, 38, 38, 38, 14, 36, 14, 15, 15, 44, 8, 14, 15, 38, 39, 44, 13, 44, 38, 14, 15, 38, 15, 15, 26, 24, 25, 12, 44, 26, 14, 15, 38, 39, 38, 12, 44]
+98	[12, 13, 38, 38, 39, 39, 15, 15, 24, 25, 25, 25, 44, 38, 39, 14, 15, 44, 38, 39, 36, 14, 36, 26, 14, 15, 44, 38, 39, 38, 39, 38, 39, 44, 14, 15, 0, 1, 0, 1, 44]
+99	[34, 35, 36, 14, 3, 38, 39, 44, 16, 21, 21, 21, 17, 38, 39, 14, 15, 0, 15, 14, 15, 14, 15, 12, 13, 13, 36, 14, 15, 44]
+100	[24, 25, 25, 25, 44, 24, 25, 25, 14, 15, 15, 14, 15, 30, 31, 38, 39, 46, 44]
+101	[24, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47]
+102	[24, 25, 25, 25, 25, 25, 25, 25, 15]
+103	[12, 44, 14, 15, 44, 26, 14, 15, 14, 15, 0, 48, 0, 1, 14, 15, 38, 39, 44, 38, 27, 48, 49, 38, 39, 10, 11, 8, 8, 0, 38, 54, 55, 8, 0, 1, 46, 44, 54, 49, 26, 38, 39, 38, 12, 13, 13, 13, 30, 31, 14, 15, 38, 0, 14, 15, 44, 8, 9, 38, 38, 39, 44, 8, 38, 39, 38, 39, 36, 8, 39, 38, 38, 0, 1, 13, 38, 44]
+104	[44, 52, 53, 14, 15, 44, 54, 55, 55, 55, 55, 14, 15, 14, 15, 0, 1, 44, 44, 14, 15, 12, 13, 13]
+105	[38, 38, 39, 36, 14, 15, 30, 31, 12, 13, 14, 15, 8, 38, 0, 1, 36, 14, 15]
+106	[47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+107	[8, 38, 39, 30, 38, 0, 36, 14, 15, 14, 15, 15, 15, 8, 38, 39, 44]
+108	[24, 25, 25, 25, 25, 25, 15, 15, 15]
+109	[38, 39, 38, 39, 12, 13, 13, 15, 15, 15, 44, 24, 25, 25, 25]
+110	[24, 25, 14, 15, 38, 39, 38, 15, 15, 15]
+111	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+112	[24, 25, 25, 25]
+113	[34, 35, 44, 14, 15, 15, 15, 10, 38, 39, 14, 15, 15, 36, 42, 43, 43, 42, 43, 14, 15, 38, 38, 39, 44, 26, 14, 15, 14, 15, 38, 39, 38, 43, 38, 39, 44, 38, 9, 38, 39, 0, 1, 36, 42, 43, 14, 15, 44]
+114	[50, 51, 26, 1, 14, 15, 53, 15, 54, 55, 55, 55, 55, 52, 53, 53, 42, 43, 42, 43, 14, 15, 44, 12, 13, 14, 43, 15, 44, 12, 13, 13, 42, 43]
+115	[12, 13, 44, 14, 15, 38, 39, 39, 44]
+116	[48, 49, 51, 51, 12, 13, 14, 15]
+117	[24, 25, 25, 25, 25, 54, 55, 55, 55, 38, 39, 26, 38, 39, 38, 38, 39, 39, 39, 15, 6, 14, 21, 21, 21, 21, 38, 39, 42, 43, 44, 24, 25, 25, 25, 25, 54, 55, 55, 55, 38, 38, 38, 39, 13, 8, 38, 39, 39, 38, 39, 44]
+118	[24, 25, 25, 14, 15, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+119	[38, 14, 15, 14, 15, 0, 0, 1, 14, 15, 26, 42, 43, 44, 14, 44, 44, 44, 44, 38, 39, 14, 15, 14, 15, 44, 44, 44, 24, 25]
+120	[14, 39, 24, 25, 25, 6, 14, 15, 15, 30, 31, 0, 44]
+121	[38, 30, 30, 31, 44, 24, 25]
+122	[24, 47, 47, 47, 44, 46, 47, 47, 44, 13, 13, 13, 47, 47, 44, 46, 47, 47, 44, 14, 15]
+123	[48, 14, 15, 14, 15, 15, 26, 14, 15, 38, 39, 14, 15, 10, 44, 0, 1, 36, 14, 15, 30, 31, 38, 44]
+124	[14, 15, 15, 44, 30, 31, 31, 36, 14, 15, 44, 44, 52, 53, 38, 39, 14, 25, 36, 14, 15, 44]
+125	[52, 53, 53, 53, 14, 15, 14, 15]
+126	[54, 55, 55, 55, 14, 15, 38, 39, 42, 43, 14, 15, 14, 15]
+127	[34, 0, 51, 0, 1, 38, 39, 39, 42, 43, 44, 38, 12, 13, 42, 43, 26, 14, 15, 10, 11, 44, 14, 15, 38, 39, 44, 24, 25, 25, 44]
+128	[24, 14, 25, 25, 23, 23, 23, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 38, 39, 44, 14, 15, 15, 15, 36, 14, 15, 15, 15, 8, 20, 51, 38, 39, 38, 39, 14, 15, 44]
+129	[48, 49, 15, 38, 39, 38, 0, 1]
+130	[14, 15, 14, 15, 42, 15, 42, 43, 14, 15, 14, 15]
+131	[12, 14, 14, 15, 15, 15, 15, 14, 15]
+132	[42, 43, 2, 3, 42, 43, 14, 15, 14, 15, 44, 12, 13, 13, 25, 17, 17, 17, 49, 49, 49, 49, 26, 11, 9, 38, 39, 14, 15, 15, 44, 14, 15, 38, 39, 12, 38, 38, 46, 24, 25, 25, 25, 15, 15, 15, 10, 38, 39, 14, 31, 44, 10, 9, 10, 9, 38, 14, 1, 4, 42, 43, 10, 38, 39, 8, 7, 38, 39, 44, 16, 17, 12, 55, 14, 49, 8, 9, 38, 39, 14, 15, 44, 8, 39, 14, 15, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 38, 39, 39, 15, 44, 38, 39, 25, 25, 25, 25, 15, 15, 42, 14, 15, 38, 39, 13, 13, 44]
+133	[54, 55, 44, 54, 55, 55, 55, 44, 26, 27, 14, 15, 8, 9, 38, 39, 10, 51, 0, 1, 0, 1, 36, 42, 43, 44, 40, 52, 53, 53, 38, 51, 53, 39, 36, 24, 15, 25, 15, 15, 15, 39, 39, 44, 52, 53, 53, 53, 53, 14, 39, 36, 42, 43, 44, 14, 15, 21, 38, 14, 15, 15, 15, 38, 38, 14, 15, 44, 14, 15, 38, 43, 14, 10, 12, 51, 38, 39, 12, 13, 13, 13, 13, 13, 13, 0, 1, 36, 12, 12, 13, 0, 1, 44, 6, 7, 11, 14, 15, 0, 1, 44, 38, 39, 34, 35, 9, 38, 38, 39, 14, 15, 44]
+134	[50, 51, 14, 15, 14, 15]
+135	[14, 15, 36, 39, 46, 44, 14, 15, 44, 14, 15]
+136	[38, 39, 39, 38, 39, 38, 39, 14, 15, 15, 15]
+137	[30, 31, 38, 14, 15, 38, 51, 54, 55, 55, 38, 39, 15, 44, 12, 13, 24, 25, 25, 25, 13, 13, 14, 15, 44, 12, 13, 14, 15, 15, 44, 12, 13, 38, 43, 15, 44, 12, 13, 25, 25, 12, 25, 25, 25, 14, 15, 44, 38, 38, 42, 43, 14, 15, 44, 44, 24, 25, 25, 25, 25]
+138	[50, 51, 51, 14, 15, 15, 42, 43, 14, 15, 30, 31, 31]
+139	[48, 49, 49, 49, 49, 10, 14, 15]
+140	[12, 13, 13, 13, 13, 38, 39, 36, 42, 15, 38, 39, 38, 39, 38, 39, 39, 12, 13, 38, 39, 14, 15, 38, 39, 14, 25, 38, 44]
+141	[38, 38, 38, 24, 25, 25, 25]
+142	[14, 15, 14, 15, 38, 39, 38, 39, 14, 15, 15, 8, 1]
+143	[54, 55, 50, 51, 14, 15, 38, 39, 14, 15]
+144	[46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 44, 16, 17, 44, 14, 15, 14, 15, 0, 38, 44, 44, 44, 12, 13, 13, 13, 47, 47]
+145	[44, 18, 19, 52, 53, 53, 53, 53, 53, 53, 53, 53]
+146	[48, 49, 25, 53, 25, 25]
+147	[20, 21, 21, 38, 21, 21]
+148	[24, 15, 14, 15, 14, 15]
+149	[6, 6, 38, 39, 44, 38, 39, 44, 36, 14, 15]
+150	[44, 14, 15, 14, 14, 15, 15, 44, 44, 0, 39, 0, 14, 15, 15, 14, 15, 44, 38, 39, 0, 14, 15, 15, 42, 43, 15, 15, 44, 44, 44, 20, 21, 21, 21]
+151	[44, 14, 15, 44, 44, 14, 15, 25, 15, 25, 38, 39, 44, 52, 53, 14, 53, 14, 15, 14, 15, 44, 14, 15, 15, 15]
+152	[22, 23, 23, 49, 23, 23, 23, 44, 44, 22, 23, 23, 23, 44, 2, 38, 14, 39, 38, 39, 44, 14, 15, 44, 24, 25, 25]
+153	[24, 25, 15, 38, 38, 39, 15, 38, 39, 38, 39, 46, 44, 24, 25, 42, 39, 44, 24, 25, 25, 44]
+154	[14, 15, 15, 14, 15, 38, 39, 14, 15]
+155	[50, 51, 14, 25, 14, 15, 14, 15, 38]
+156	[14, 25, 30, 31, 38, 39, 14, 15, 50, 53, 14]
+157	[38, 39, 36, 38, 44, 26, 24, 25, 25, 15, 10, 44, 30, 31, 14, 15, 38, 39, 44, 38, 15, 38, 39, 36, 14, 8, 9, 38, 15, 38, 39, 44]
+158	[18, 21, 18, 49, 49, 49, 38, 39, 14, 15]
+159	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+160	[34, 35, 12, 13, 14, 15, 42, 14, 15, 0, 9, 38, 39, 26, 12, 12, 13, 13, 13, 44, 8, 39, 38, 39, 44, 12, 13, 13, 44, 38, 39, 8, 12, 44]
+161	[38, 39, 24, 25, 15, 14, 15, 10, 30, 31, 38, 38, 39, 44]
+162	[14, 1, 14, 15, 42, 43, 42, 43, 14, 15, 44, 12, 13, 13, 13, 38, 0]
+163	[50, 51, 38, 53, 50, 53, 53, 51, 38, 38, 14, 15, 36, 46, 44]
+164	[14, 14, 15, 10, 1, 15, 15, 15, 30, 31, 31, 44]
+165	[50, 51, 51, 44, 14, 15, 15]
+166	[14, 15, 48, 17, 17, 36, 12, 13, 14, 15, 44, 30, 8, 38, 39, 30, 12, 13, 44]
+167	[46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 21, 25, 25, 25, 38, 39, 42, 43, 14, 15, 44, 42, 25, 38, 39, 44, 14, 25, 25, 44, 14, 15, 0, 1, 14, 14, 15, 38]
+168	[38, 14, 14, 15, 8, 0, 14, 15, 15, 38, 30, 31, 38, 39, 44]
+169	[14, 15, 0, 1, 44, 8, 26, 14, 39, 10, 1, 13, 8, 39, 39, 38, 44]
+170	[8, 39, 0, 26, 25, 25, 36, 42, 43, 44, 38, 39]
+171	[52, 53, 14, 15, 14, 14, 15]
+172	[14, 15, 0, 1, 44, 12, 13, 14, 14, 0, 51, 0, 1, 44, 38, 38, 39, 14, 15, 15, 14, 38, 39, 39, 39, 44]
+173	[24, 25, 14, 15, 38, 39, 26, 30, 31]
+174	[14, 15, 21, 12, 14, 15, 15, 15, 26, 55, 13, 15, 6, 12, 55, 0, 1, 10, 14, 15, 44, 38, 39, 12, 13, 13, 13, 44]
+175	[14, 15, 14, 15, 15, 0, 38, 39]
+176	[12, 44, 54, 55, 52, 52, 53, 14, 15, 42, 43, 36, 42, 43, 44, 14, 15, 14, 15, 44, 24, 25, 25, 25]
+177	[44, 44, 44, 44, 44, 26, 36, 14, 15, 14, 15, 10, 38, 44, 26, 16, 17, 36, 14, 15, 14, 15, 8, 39, 8, 38, 38, 39, 44, 38, 1, 38, 0, 1, 14, 49, 36, 14, 15, 24, 25, 25, 25, 15, 15, 44]
+178	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 39, 39]
+179	[52, 14, 15, 30, 14, 15, 26, 50, 51, 51, 51, 24, 25, 25, 15, 44, 0, 1, 14, 15, 44, 38, 39, 38, 39, 14, 15, 44, 6, 7, 14, 38, 39, 30, 31, 8, 38, 39, 36, 14, 15, 14, 15, 44, 14, 15, 14, 15, 15, 44, 8, 38, 24, 25, 25, 44]
+180	[14, 15, 25, 25, 25, 48, 25, 25]
+181	[14, 15, 14, 15, 14, 15, 36, 14, 15, 44, 38, 39, 38, 38, 39, 44, 14, 15, 15, 44, 2, 38, 39, 44, 44, 38, 39, 26, 43, 10, 51, 51, 51, 51, 51, 51, 14, 15, 14, 15, 14, 15, 15, 38, 14, 43, 14, 15, 44, 42, 15, 42, 43, 44, 38, 43, 42, 43, 44, 38, 39, 8, 39, 38, 38, 39, 24, 25, 25, 25, 25, 15, 14, 15, 44, 38, 39, 14, 15, 14, 39, 44, 26, 38, 14, 15, 38, 39, 44, 42, 43, 14, 15, 38, 39, 44, 38, 41, 38, 39, 14, 15, 42, 43, 14, 15, 14, 15, 36, 14, 43, 15, 44, 38, 39, 14, 15, 36, 4, 5, 14, 15, 44, 8, 39, 38, 39, 39]
+182	[24, 25, 25, 25, 25, 38, 15, 44, 14, 15, 42, 43]
+183	[14, 15, 26, 39, 38, 39, 42, 43, 26, 43, 15, 26, 50, 15, 44, 38, 39, 38, 39, 25, 26, 36, 14, 15, 6, 14, 15, 15, 15, 36, 14, 15, 14, 43, 38]
+184	[52, 53, 25, 53, 53, 53, 53, 53, 14, 15, 14, 15, 44, 38, 39, 25, 53, 53, 53, 38, 39, 50, 51, 51, 51]
+185	[38, 39, 30, 38, 38, 39, 36, 14, 15, 14, 15, 15, 44]
+186	[38, 39, 24, 25, 25, 25, 25, 17, 38, 39, 36, 12, 13, 44, 13, 14, 15, 6, 39, 44, 20, 21, 21, 17, 17, 25, 25, 25, 49, 26, 38, 36, 12, 13, 44, 10, 36, 14, 15, 14, 44, 8, 38, 38, 14, 15, 38, 39, 8, 0, 1, 44, 38, 39, 39, 39, 39, 25, 25, 12, 13, 30, 31, 38, 39, 12, 25, 25, 14, 15, 44, 14, 15, 15, 38, 39, 54, 55, 55, 55, 21, 12, 13, 13, 13, 44]
+187	[8, 38, 14, 15, 38, 39, 14, 15, 42, 43, 42, 43, 8, 9, 38, 39, 36, 14, 15, 44, 0, 1, 42, 43, 0, 44, 42, 1, 0, 1, 14, 5, 44, 42, 25, 25, 23, 32, 33, 38, 39, 44, 14, 15, 14, 15, 14, 15, 15, 15, 39, 42, 43, 36, 38, 38, 39, 14, 15, 14, 15, 26, 50, 33, 38, 14, 15, 15, 15, 38, 39, 36, 14, 1, 14, 15, 44]
+188	[30, 36, 14, 15, 15, 15, 14, 15]
+189	[30, 31, 38, 14, 15, 42, 43, 14]
+190	[50, 51, 42, 43, 42, 43, 4, 5, 42, 43, 15, 15, 15, 15, 44, 14, 15, 14, 15]
+191	[26, 27, 48, 49, 38, 14, 15, 2, 9, 38, 39, 38, 14, 15, 26, 14, 15, 14, 15, 36, 42, 15, 44, 38, 39, 38, 39, 30, 31, 38, 31, 36, 14, 15, 44]
+192	[14, 15, 14, 15, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 25, 25, 25, 44, 44, 52, 53, 25, 25, 25, 25, 25, 25, 25, 44, 44, 50, 51, 42, 42, 14, 14, 15, 15, 44, 44, 14, 15, 14, 15, 38, 39]
+193	[26, 9, 1, 14, 15, 14, 15, 10, 44, 38, 39, 36, 14, 15, 36, 50, 51, 15, 15, 15, 26, 14, 15, 15, 49, 26, 14, 25, 26, 38, 39, 0, 1, 36, 14, 15, 44]
+194	[50, 51, 38, 14, 15, 38, 38, 39, 53, 53, 53, 53, 53]
+195	[52, 53, 53, 53, 53, 53, 39]
+196	[50, 51, 14, 15, 14, 15, 44, 52, 53, 24, 25, 25, 15]
+197	[50, 51, 14, 25, 53, 14, 15, 6, 48, 49, 38, 14, 44, 14, 15]
+198	[54, 55, 55, 55, 55, 38, 39, 38, 0, 33, 14, 15, 36, 14, 15, 0, 39, 44]
+199	[52, 53, 53, 53, 53, 53, 53, 38, 39, 14, 15]
+200	[50, 51, 51, 12, 13, 13]
+201	[38, 39, 14, 15, 38, 39, 14, 15]
+202	[50, 51, 14, 15, 38, 39]
+203	[24, 25, 25, 47, 47, 44, 46, 21, 21]
+204	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+205	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+206	[14, 15, 14, 15, 10, 12, 13, 36, 42, 1, 42, 43, 8, 8, 38, 39, 44, 44, 20, 49, 49, 49, 38, 39, 26, 14, 15, 14, 15, 42, 43, 14, 15, 15, 12, 14, 15, 44, 24, 25, 21, 6, 24, 49, 38, 39, 39, 39, 13, 38, 12, 13, 13, 44]
+207	[24, 25, 25, 25, 36, 14, 15, 15, 15]
+208	[24, 15, 14, 15, 15, 15, 38, 39, 42, 42, 43, 43, 30, 31, 38, 38, 39, 30, 31, 14, 15]
+209	[30, 36, 14, 15, 14, 15, 38, 38, 39, 8, 0, 1, 13, 36]
+210	[52, 53, 53, 53, 44, 52, 53, 53, 53]
+211	[50, 51, 53, 53, 53, 14, 15]
+212	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 17, 44, 46, 47, 47, 47, 47, 47]
+213	[14, 15, 14, 15, 8, 38, 39, 50, 51, 51, 44, 14, 15, 0, 1, 44, 14, 15, 38, 38, 39, 8, 26, 14, 15, 10, 10, 11, 38, 39]
+214	[50, 51, 53, 53, 14, 15, 14, 15, 38, 39, 14, 15]
+215	[14, 15, 36, 10, 11, 42, 43, 14, 15]
+216	[48, 49, 44]
+217	[44, 14, 15, 0, 0, 1, 14, 15, 42, 43, 14, 15, 38, 39, 44, 38, 39, 44, 44, 50, 51, 24, 25, 25, 25, 42, 53, 38, 39, 38, 53, 53, 53, 38, 39, 44, 44, 44, 44, 24, 25, 25]
+218	[8, 3, 38, 39, 44, 8, 38, 38, 39, 14, 15, 38, 39, 36, 14, 15, 10, 44, 30, 31, 14, 15, 54, 55, 55, 44, 30, 14, 14, 15, 12, 38, 12, 13, 44, 38, 39, 39, 12, 13, 44]
+219	[24, 25, 25, 25, 25, 25, 25, 25, 15, 25, 44, 38, 39, 12, 13, 36, 14, 15]
+220	[46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 25, 25, 44, 12, 13, 13, 13, 42, 43]
+221	[14, 15, 14, 15]
+222	[30, 38, 30, 36, 14, 15, 44, 24, 25, 25, 25, 25]
+223	[38, 39, 39, 8, 8, 38, 54, 38, 1, 36, 12, 13, 14, 15, 44, 38, 44, 22, 23, 23, 23, 44, 38, 39, 36, 12, 13, 6, 14, 8, 0, 36, 14, 15, 0, 1, 36, 4, 5, 15, 44, 14, 49, 26, 38, 39, 14, 15, 36, 30, 36, 14, 15, 44, 38, 39, 26, 39, 14, 14, 15, 36, 14, 15, 36, 14, 15, 44]
+224	[48, 49, 14, 15, 14, 15, 10, 44, 38, 39, 38, 39, 46]
+225	[38, 39, 14, 15, 15, 38, 43, 14, 15]
+226	[24, 15, 15, 0, 14, 15, 14, 15]
+227	[44, 14, 15, 38, 15, 44, 14, 15, 42, 15, 14, 15, 44, 14, 15, 44, 14, 15, 15, 15, 42, 43, 44, 38, 39, 44, 14, 15, 15, 15, 15, 15, 44, 14, 15, 15, 44, 20, 21, 21, 21]
+228	[54, 55, 55, 55, 55, 44, 14, 15, 44, 48, 49, 49, 6, 14, 15, 14, 15, 8, 27, 54, 19, 38, 38, 39, 44]
+229	[38, 39, 44, 30, 31, 36, 14, 15, 8, 26, 38, 39, 44]
+230	[14, 15, 15, 0, 1, 14, 15, 8, 38, 39, 12, 13, 13, 13, 14, 15, 44, 38, 39, 39, 39, 39, 50, 51, 44, 50, 51, 44, 14, 15, 44, 24, 25, 25]
+231	[14, 15, 14, 15, 30, 31, 26, 54, 55, 55, 14, 15, 38, 39, 44]
+232	[44, 8, 38, 38, 39, 6, 38, 39, 36, 11, 38, 39, 36, 11, 26, 36, 14, 15, 44, 38, 38, 39, 14, 39, 12, 13, 38, 39, 44, 44, 24, 25, 25, 25, 25]
+233	[52, 53, 25, 53, 14, 15, 15, 38, 39, 39, 39, 30, 31, 14, 15]
+234	[14, 15, 49, 38, 38, 26, 0, 1, 14, 15, 14, 39, 36, 0, 1, 0, 1, 44, 8, 9, 0, 3, 42, 15, 15, 15, 44, 8, 38, 39, 38, 39, 44, 30, 31, 31, 36, 14, 15, 38, 39, 0, 1, 36, 14, 15, 44, 8, 11, 8, 38, 14, 39, 38, 39, 36, 44, 6, 7, 38, 39, 42, 43, 14, 15, 44, 38, 8, 9, 38, 39, 36, 44]
+235	[8, 39, 14, 15, 14, 15, 15, 10, 11, 38, 39, 44, 38, 38, 14, 15, 38, 39, 39, 44, 12, 13, 13, 15, 44, 44, 14, 15, 15, 44, 16, 17, 17, 12, 17, 13, 44, 25, 47, 47]
+236	[14, 15, 14, 15, 15, 12, 44, 13, 13, 39, 14, 14, 15, 31, 38, 39]
+237	[30, 31, 12, 13, 15, 26, 26, 14, 15, 36, 14, 15, 42, 43, 10, 38, 39, 36, 12, 12, 14, 15, 36, 14, 15, 44, 38, 39, 8, 9, 38, 39, 14, 15, 15, 15, 44, 12, 13, 13, 38, 39, 36, 4, 43, 44, 26, 48, 49, 38, 39, 14, 38, 38, 39, 44, 6, 7, 38, 39, 30, 31, 14, 15, 36, 12, 13, 14, 15, 44]
+238	[38, 39, 14, 15, 12, 13, 14, 15, 15, 14, 15, 15, 44, 14, 15, 10, 1, 36, 12, 13, 42, 25, 14, 15]
+239	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+240	[48, 49, 48, 49, 49, 54, 55, 14, 15, 42, 43, 44, 48, 49, 48, 49, 14, 12, 55, 0, 1, 14, 15, 38, 39, 44, 14, 15, 15]
+241	[24, 25, 25, 25, 15, 44, 0, 1]
+242	[30, 31, 38, 39, 12, 13, 14, 14, 15, 6, 38, 39, 38, 24, 25, 25, 44, 26, 38, 39, 4, 5, 36, 14, 15, 44]
+243	[24, 25, 25, 15, 49, 49, 26, 9, 26, 14, 15, 26, 27]
+244	[50, 51, 15, 15, 25, 15]
+245	[38, 38, 39, 14, 15, 14, 15, 38, 39, 24, 15, 36, 14, 0, 14, 15, 38, 39, 14, 15]
+246	[38, 39, 14, 15, 36, 30, 31, 14, 15, 14, 15, 46, 44]
+247	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+248	[52, 53, 53, 39, 38, 39, 46]
+249	[38, 39, 38, 38, 38, 39, 36, 42, 15, 38, 14, 15]
+250	[24, 25, 14, 15, 14, 15, 14, 14, 15, 8, 39, 30, 31, 44, 6, 7, 38, 39, 39, 39, 46, 24, 25, 14, 15, 30, 31, 44, 48, 49, 25, 25, 14, 15, 38, 38, 39, 14, 15]
+251	[50, 51, 14, 15, 38, 15, 14]
+252	[14, 15, 44, 12, 13, 13, 13, 42, 43]
+253	[14, 15, 14, 15, 15, 38, 14, 15, 42, 43, 14, 15, 15, 10, 36, 42, 43]
+254	[50, 51, 51, 14, 25, 42, 43, 42, 43]
+255	[48, 49, 49, 14, 15, 38, 39, 44, 24, 25, 25, 25]
+256	[14, 15, 15, 38, 15, 15, 38, 43, 38, 39, 44, 24, 25, 25, 25]
+257	[30, 31, 26, 9, 38, 39, 42, 14, 15, 15, 15]
+258	[14, 15, 44, 22, 23, 23, 23, 44, 38, 39, 38, 39, 42, 15, 15, 38, 39, 39, 44, 24, 25, 25]
+259	[30, 31, 30, 31, 36, 30, 31, 38, 38, 39, 36, 14]
+260	[44, 22, 23, 23, 44, 8, 0, 42, 15, 42, 43, 38, 39, 44, 2, 8, 41, 38, 39, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+261	[44, 22, 23, 23, 44, 18, 50, 51, 54, 55, 49, 42, 43, 44, 54, 55, 14, 15, 38, 39, 38, 43]
+262	[38, 30, 8, 38, 39, 44]
+263	[14, 15, 15, 14, 15, 38, 43, 14, 15]
+264	[0, 1, 43, 44, 42, 43, 0, 1]
+265	[0, 1, 42, 43]
+266	[14, 15, 14, 14, 15, 14, 15]
+267	[14, 15, 12, 14, 15, 14, 15]
+268	[50, 51, 51, 51, 51, 10, 14, 15, 14, 15, 15]
+269	[30, 31, 8, 9, 38, 39, 44, 24, 25, 25, 25, 25, 44, 24, 17, 21, 21, 17, 38, 39, 14, 15, 44, 38, 39, 38, 39, 44, 38, 14, 25, 25, 38, 39]
+270	[38, 39, 14, 15, 0, 1, 14, 15, 15, 36, 14, 15, 15, 25, 44]
+271	[8, 39, 30, 38, 39, 38, 39, 44, 14, 15, 14, 15, 12, 13, 14, 15, 15, 8, 9, 38, 39, 44]
+272	[14, 21, 14, 15, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 0, 1, 14, 15, 14, 15, 46]
+273	[26, 27, 8, 39, 39, 14, 15, 44, 38, 39, 30, 31, 31, 38, 39, 38, 38, 39, 36]
+274	[26, 38, 38, 38, 39, 0, 1]
+275	[54, 55, 36, 14, 15]
+276	[14, 15, 15, 38, 39, 12, 15, 36, 14, 38, 39, 0, 1, 15, 38, 38, 39, 15, 38, 38, 39, 39, 36, 14, 15, 44]
+277	[50, 51, 14, 53, 14, 15, 44, 42, 43, 14, 15]
+278	[52, 53, 14, 15, 15, 0, 43, 46]
+279	[24, 25, 25, 47, 44, 24, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 38, 39]
+280	[12, 13, 13, 39, 39, 13, 13, 13, 13, 13, 13]
+281	[30, 31, 38, 14, 15, 15, 15, 24, 25]
+282	[50, 51, 14, 15, 25]
+283	[50, 15, 14, 15, 44, 24, 25, 14, 15, 44, 14, 15, 15, 44, 10, 14, 15, 15, 24, 24, 25, 25, 25, 44, 14, 14, 15, 38, 44, 14, 15, 25]
+284	[34, 55, 14, 15, 0, 1, 44, 14, 15, 8, 38, 46, 44, 48, 49, 49, 49, 26, 36, 30, 31, 8, 38, 39, 44, 38, 39, 38, 39, 38, 39, 46, 44, 14, 15, 38, 39, 43, 44, 24]
+285	[50, 51, 51, 51, 51, 51, 50, 51, 53, 53, 53, 15]
+286	[38, 39, 38, 14, 15, 44, 42, 43, 15, 44, 14, 15, 15]
+287	[50, 8, 38, 39, 12, 13, 13, 13, 14, 15, 15]
+288	[8, 39, 14, 15, 39, 36, 14, 15, 42, 43]
+289	[38, 39, 12, 13, 38, 39, 10, 11, 8, 38, 39, 44, 14, 15, 8, 38, 39, 38, 39, 30, 31, 36, 14, 15, 44, 8, 9, 38, 38, 38, 30, 31, 44, 12, 13, 38, 14, 15, 44, 12, 13, 14, 15, 21, 39, 44, 8, 26, 30, 31, 44, 42, 39, 38, 39, 46, 44]
+290	[44, 54, 55, 55, 55, 55, 14, 52, 53, 53, 53, 53, 53, 53, 53, 53, 0, 1, 42, 43, 14, 15, 44, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+291	[14, 14, 15, 14, 15, 14, 15, 14, 38, 39, 14, 15, 42, 43, 42, 43]
+292	[26, 49, 3, 38, 12, 13, 0, 1, 36, 14, 15, 44, 38, 39, 14, 15, 38, 38, 8, 0, 14, 15, 36, 44, 6, 38, 14, 15, 36, 14, 15, 6, 7, 8, 8, 9, 38, 26, 14, 15, 36, 38, 39, 44, 38, 38, 39, 26, 14, 15, 36, 38, 39, 38, 39, 12, 13, 13, 44, 38, 39, 14, 15, 42, 43, 42, 43, 44]
+293	[30, 11, 38, 14, 15, 15, 38, 38, 39, 44, 16, 30, 31, 8, 9, 14, 15, 44, 10, 44, 6, 39, 14, 38, 38, 8, 9, 38, 39, 44, 38, 36, 14, 36, 44, 14, 15, 38, 39, 12, 13, 13, 39, 46, 44, 44]
+294	[0, 15, 38, 39, 39, 12, 13, 44, 24, 25, 25]
+295	[44, 38, 38, 39, 44, 38, 38, 38, 38, 14, 15, 44]
+296	[14, 15, 15, 38, 39, 14, 14, 15]
+297	[26, 27, 14, 15, 15, 6, 7, 7, 38, 39, 44, 30, 31, 8, 9, 38, 39, 36, 12, 13, 8, 0, 1, 36, 14, 15, 44]
+298	[26, 38, 38, 38, 39, 14, 15, 44]
+299	[24, 25, 25, 25, 25, 25, 44, 18, 19, 14, 15, 44, 38, 15, 14, 15]
+300	[30, 38, 39, 8, 1, 14, 15, 26, 38, 39, 12, 13, 38, 15, 15, 36, 14, 15, 15, 44, 0, 1, 36, 14, 15, 44]
+301	[50, 53, 53, 53, 53, 14, 15, 14, 15, 14, 15, 42, 43, 2, 43, 50, 50, 50, 51, 53, 53, 53, 53, 42, 43, 32, 53, 14, 15, 38, 39, 14, 15, 38, 43, 44]
+302	[34, 35, 44, 50, 51, 51, 53, 53, 53, 53, 53, 53, 53, 26, 30, 31, 38, 39, 36, 42, 43, 14, 15, 15, 39, 14, 15, 15, 38, 39, 36, 12, 13, 42, 43, 44]
+303	[24, 25, 25, 25, 44, 42, 43, 38, 43, 14, 15, 25]
+304	[8, 12, 13, 14, 25, 14, 15, 14, 15, 10, 11, 44, 14, 15, 8, 39, 26, 12, 55, 44, 54, 55, 38, 36, 14, 15, 42, 43, 44]
+305	[24, 25, 25, 25, 44, 24, 25, 25, 44, 14, 44, 24, 25, 25, 15, 44, 38, 39, 38, 39]
+306	[44, 23, 44, 38, 38, 12, 13, 0, 0, 14, 15, 36, 14, 15, 42, 43, 44]
+307	[24, 15, 15, 44, 24, 25, 36, 38, 39, 44, 44, 44, 14, 15]
+308	[46, 47, 47, 47, 44, 46, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47]
+309	[14, 15, 50, 51, 51, 44, 14, 15, 14, 15]
+310	[48, 49, 25, 25, 14, 15]
+311	[8, 9, 38, 14, 15, 38, 39, 36, 0, 46, 44, 14, 15, 8, 38, 39, 38, 44, 24, 25, 54, 55, 14, 44, 36, 0, 14, 15, 15, 44, 38, 39, 6, 0, 1, 36, 14, 15, 8, 39, 44]
+312	[50, 51, 51, 51, 44, 24, 25, 25]
+313	[50, 51, 51, 51, 51, 51, 51, 15, 12, 13, 13, 13, 14, 15, 44, 14, 44, 44, 14, 15, 14, 15, 14, 15]
+314	[50, 51, 14, 14, 15, 15, 14, 15, 38, 39, 39, 39, 12, 13, 14, 15, 46, 44]
+315	[0, 1, 50, 51, 51, 26, 50, 51, 51, 51, 36, 12, 13, 14, 15, 14, 15, 38, 39, 39, 42, 43, 15, 44, 38, 39, 10, 44, 38, 39, 38, 39, 0, 39, 44, 8, 39, 36, 14, 15, 44, 44]
+316	[38, 38, 39, 38, 39, 46, 44, 24, 25, 25, 25]
+317	[14, 15, 14, 15, 15, 38, 39, 0, 1, 36]
+318	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 14, 17]
+319	[38, 39, 0, 14, 36, 14, 15]
+320	[38, 14, 15, 38, 39, 26, 24, 25, 46]
+321	[24, 25, 25, 25, 25, 44, 12, 13, 13, 44, 6, 7, 7, 38, 14, 31, 8, 0, 39, 14, 15, 44, 14, 15, 15, 14, 38, 38, 39, 38, 39, 39, 14, 15, 44, 14, 14, 15, 44, 38, 39, 25, 44, 14]
+322	[46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 13, 47, 44, 46, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+323	[42, 43, 14, 15, 38, 39, 14, 15]
+324	[24, 25, 25, 25, 49, 38, 39, 39, 30, 38, 14, 13, 38, 39, 25, 25, 25, 25, 25, 25, 17, 17, 17, 38, 27, 10, 38, 43, 14, 15, 6, 38, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+325	[24, 49, 49, 49, 49, 49, 44, 24, 53, 14, 14, 15]
+326	[44, 14, 15, 15, 44, 14, 15, 15, 28, 44, 10, 35, 44, 12, 44, 12, 13, 44, 12, 13, 44, 12, 13, 44, 16, 17, 13, 44, 12, 13, 44, 12, 13, 44, 12, 13, 44, 44]
+327	[44, 44, 44, 24, 25, 25, 25, 25]
+328	[12, 13, 13, 14, 15]
+329	[38, 1, 36, 14, 15]
+330	[24, 25, 15, 14, 15, 14, 15, 36, 14, 15, 14]
+331	[38, 39, 0, 14, 15, 38, 39, 14, 15, 38, 39, 39, 36, 10, 1, 44, 38, 38, 39, 14, 15, 39, 38, 39, 38, 39]
+332	[8, 38, 39, 38, 26, 38, 39, 44, 38, 39, 44, 38, 39, 14, 15, 0, 39, 38, 39, 36, 14, 15, 44, 26, 14, 15, 38, 39, 26, 24, 25, 25, 14, 39, 6, 38, 39, 44]
+333	[30, 38, 39, 25, 25, 14, 15, 24, 25, 14, 15, 30, 31, 38, 39, 44, 24, 25, 25, 25]
+334	[24, 25, 25, 25, 25, 15, 15, 15, 15, 15, 15]
+335	[14, 15, 14, 15, 15]
+336	[50, 51, 44, 14, 15]
+337	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 30, 31, 38, 39]
+338	[54, 55, 55, 46, 44, 6, 7, 14, 15, 10, 36, 0, 1, 36, 14, 15, 38, 39, 38, 39, 46, 44, 6, 7, 26, 7, 9, 38, 39, 14, 15, 15, 15, 36, 0, 44, 0, 36, 38, 38, 0, 14, 14, 15, 15, 10, 44, 8, 9, 38, 14, 15, 15, 44, 54, 55, 38, 39, 26, 38, 39, 10, 10, 14, 39, 44, 30, 31, 38, 38, 38, 36, 38, 39, 42, 43, 38, 39, 44]
+339	[24, 25, 25, 25, 15]
+340	[8, 9, 44, 8, 38, 36, 38, 14, 38, 15, 36, 42, 43, 44]
+341	[12, 13, 13, 13, 38, 39, 44, 12, 13, 13, 38, 39, 44, 12, 13, 13, 13, 39]
+342	[52, 53, 44, 48, 44, 0, 49]
+343	[44, 14, 25, 38, 26, 14, 15, 38, 43, 14, 15, 15, 54, 54, 55, 9, 8, 54, 35, 12, 55, 55, 0, 10, 36, 14, 15, 14, 15, 44, 14, 15, 14, 15, 38, 39, 30, 31, 38, 39, 12, 15, 0, 14, 15, 44, 14, 15, 38, 39, 38, 12, 13, 13, 14, 15, 38, 14, 15, 10, 36, 39, 14, 15, 14, 15, 44, 42, 43, 44, 38, 39, 44, 38, 39, 44, 38, 14, 15, 15, 36, 42, 15, 14, 15, 44, 26, 15, 38, 39, 44, 12, 13, 14, 15, 44, 38, 39, 42, 1, 14, 15, 44]
+344	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 14, 15, 39, 36, 14, 15, 44, 40, 41, 38, 39]
+345	[50, 51, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+346	[8, 26, 54, 55, 55, 55, 55, 44, 8, 9, 26, 39, 25, 25, 25, 49, 38, 39, 38, 39, 36, 44, 24, 25, 14, 15, 15, 44, 8, 38, 39, 38, 39, 38, 39, 44, 14, 15, 15, 44, 14, 15, 36, 42, 43, 15, 15, 44, 38, 39, 0, 1, 44]
+347	[14, 15, 14, 15, 42, 43, 8, 9, 38, 39, 14, 15, 46]
+348	[52, 53, 38, 39, 44]
+349	[38, 39, 14, 38, 38, 39, 14, 15, 14, 15]
+350	[52, 53, 53, 53, 53, 53, 14, 15, 14, 15, 14, 15, 12, 15, 15, 44, 44, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47]
+351	[14, 15, 42, 43, 14, 15, 8, 39, 14, 15, 38, 39, 44]
+352	[20, 21, 21, 54, 55, 14, 14, 15, 15]
+353	[0, 14, 15, 10, 15, 15, 32, 14, 15, 38, 39, 39, 8, 0, 1, 36, 44, 6, 7, 38, 39, 38, 39, 38, 39, 14, 15, 38, 36, 8, 0, 39, 44, 6, 7, 38, 39, 38, 12, 13, 0, 14, 15, 36, 8, 8, 1, 44, 6, 7, 30, 8, 38, 39, 30, 31, 14, 15, 14, 15, 38, 39, 14, 15, 30, 31, 12, 13, 15, 38, 39, 36, 14, 15, 44]
+354	[42, 43, 14, 43, 14, 15, 38, 39, 38, 39]
+355	[24, 25, 25, 25, 25, 25, 14, 15, 38, 39, 24, 25, 38, 39, 26, 14, 15]
+356	[12, 14, 15, 14, 25, 25]
+357	[24, 25, 25, 15, 44, 14, 15]
+358	[12, 13, 13, 13, 14, 0, 26, 42, 14, 15, 25]
+359	[44, 22, 23, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 39, 39, 39, 39, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 44, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 13, 44, 12, 13, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 13, 44, 44, 44, 24, 25, 25, 25]
+360	[34, 11, 44, 14, 15, 14, 15, 15, 15, 44, 14, 15, 0, 1, 44, 24, 25, 25, 6, 0, 1, 14, 43, 14, 15, 14, 15, 38, 39, 39, 44, 36, 14, 15, 44]
+361	[30, 31, 36, 42, 43, 44, 38, 39, 53, 53, 53, 53, 15, 38, 39, 36, 14, 15, 44]
+362	[14, 38, 39, 44, 30, 31, 38, 39, 26, 12, 13, 55, 55, 38, 39, 44, 2, 1, 38, 39, 39, 13, 13, 38, 39, 44, 26, 14, 15, 10, 38, 38, 38, 39, 44, 38, 14, 15, 44, 8, 38, 39, 38, 39, 44, 48, 49, 49, 49, 49, 38, 30, 44, 44, 38, 39, 36, 30, 31, 15, 44, 44]
+363	[44, 14, 49, 38, 39, 38, 14, 15, 44, 44, 14, 15, 38, 39, 38, 14, 15, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+364	[30, 38, 39, 14, 25, 44, 30, 31, 31, 31, 14, 15, 8, 1, 44, 38, 38, 14, 15, 36, 44, 2, 38, 38, 36, 8, 39, 46, 44, 8, 38, 39, 4, 38, 44]
+365	[44, 22, 23, 23, 23, 23, 44, 38, 39, 12, 13, 38, 39, 0, 1, 36, 14, 15, 44, 8, 1, 36, 14, 26, 30, 31, 38, 39, 36, 8, 1, 1, 36, 44, 14, 15, 44, 44]
+366	[44, 22, 23, 23, 23, 23, 44, 14, 25, 44, 14, 15, 15, 42, 43, 14, 15, 14, 15, 44, 38, 15, 15, 12, 13, 13, 13, 44, 14, 15, 23, 44, 23, 44, 44, 14, 15, 15, 15, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+367	[30, 31, 14, 15, 26, 30, 0, 1, 36, 14, 15, 14, 15, 8, 26, 27, 30, 31, 38, 9, 8, 38, 39, 14, 15, 14, 15, 38, 39, 14, 15, 14, 14, 15, 44, 30, 38, 39, 12, 13, 13, 44]
+368	[6, 27, 6, 7, 26, 30, 31, 14, 15, 36, 42, 23, 15, 14, 15, 38, 39, 38, 39, 30, 1, 10, 36, 0, 1, 44, 10, 11, 36, 12, 14, 15, 15, 8, 38, 39, 44]
+369	[6, 7, 30, 31, 44, 50, 51, 12, 13, 13, 36, 14, 15, 15, 38, 14, 15, 44, 30, 38, 38, 46, 44]
+370	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+371	[44, 44, 44, 24, 25, 25, 25, 25]
+372	[14, 15, 6, 7, 38, 39, 14, 15, 15, 44, 30, 38, 38, 12, 13, 44, 24, 25, 25, 25, 38, 0, 1, 44]
+373	[24, 25, 25, 25, 25, 25, 25, 25, 25, 44, 38, 11, 38]
+374	[44, 24, 25, 25, 14, 38, 39, 44, 24, 25, 25, 15, 0, 1, 42, 43, 14, 15, 44, 38, 53, 53, 39, 53, 14, 15, 44, 44, 38, 39, 12, 39, 14, 15]
+375	[44, 44, 44, 24, 25, 25, 25, 25]
+376	[38, 39, 30, 31, 14, 15, 6, 9, 38, 38, 39, 36, 14, 15, 44, 44, 8, 8, 38, 39, 39, 39, 36, 14, 15, 44, 44]
+377	[38, 42, 39, 14, 15, 14, 15]
+378	[50, 14, 15, 38, 39, 38, 26, 12, 13, 36, 14, 15, 10, 38, 39, 44, 44, 44, 44, 46, 44, 14, 15, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 46, 44, 14, 15, 44, 24, 25, 25]
+379	[14, 15, 15, 15, 44, 52, 53, 53, 53]
+380	[14, 15, 38, 39, 12, 30, 31, 38, 14, 15]
+381	[4, 5, 44, 2, 3, 38, 39, 38, 39, 36, 42, 43, 44]
+382	[50, 51, 38, 30, 31, 38, 38, 38, 14, 15]
+383	[26, 27, 38, 14, 15, 15, 15, 25, 25, 25, 25, 25, 49, 10, 14, 39, 44, 14, 15, 36, 14, 15, 14, 15, 44, 6, 44]
+384	[52, 53, 38, 24, 25, 14, 15, 30, 31, 38, 44, 44, 24, 25, 25, 25]
+385	[24, 24, 25, 25, 47, 47, 47, 47, 47, 47, 47]
+386	[14, 15, 38, 15, 14, 15, 36, 14, 38, 39]
+387	[14, 15, 44, 12, 13, 13, 38, 24, 25, 25, 42, 43]
+388	[26, 54, 55, 36, 38, 39, 24, 15, 15, 14, 15, 10, 44, 14, 15, 14, 15, 14, 15, 53, 53, 52, 53, 53, 53, 53, 53, 53, 53, 53, 38, 48, 49, 44, 38, 38, 36, 50, 51, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 36, 12, 13, 13, 14, 15, 10, 55, 55, 10, 8, 7, 12, 38, 13, 10, 10, 14, 15, 38, 39, 39, 14, 15, 36, 14, 15, 44]
+389	[44, 50, 51, 15, 8, 50, 51, 42, 15, 14, 15, 14, 15, 44, 42, 43, 44, 14, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 24, 25, 25, 25, 25]
+390	[24, 25, 14, 15, 25, 25, 25, 25, 17, 17, 50, 14, 15, 36]
+391	[38, 14, 15, 15, 38, 14, 14, 15, 15]
+392	[14, 15, 15, 14, 15, 44, 14, 15]
+393	[46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 38, 39, 14, 15]
+394	[14, 15, 8, 8, 9, 38, 39, 38, 38, 38, 14, 15, 6, 14, 36, 44]
+395	[14, 15, 14, 15]
+396	[14, 38, 38, 50, 51, 36, 14, 15, 44, 14, 15, 36, 14, 15, 36, 36, 44, 38, 14, 15, 14, 15, 8, 39, 39, 44]
+397	[26, 14, 15, 36, 42, 43, 14, 15, 44, 2, 49, 26, 54, 55, 55, 55, 55, 38, 39, 36, 43, 14, 15, 14, 15, 44, 26, 5, 2, 3, 6, 50, 1, 14, 15, 38, 39, 36, 54, 55, 55, 55, 14, 14, 15, 26, 27, 43, 49, 38, 39, 39, 8, 1, 12, 13, 38, 39, 14, 15, 36, 38, 43, 36, 14, 15, 38, 39, 36, 14, 15, 44, 38, 39, 38, 39, 38, 43, 15, 43, 43, 44]
+398	[24, 25, 25, 44, 46]
+399	[14, 2, 8, 38, 26, 14, 15]
+400	[50, 51, 51, 44, 24, 15, 15, 38, 39]
+401	[50, 51, 53, 53, 53, 53, 12, 13, 13, 13]
+402	[12, 39, 14, 15, 14, 15, 8, 9, 9, 38, 39, 42, 39, 6, 42, 43, 44, 38, 39, 38, 0, 32, 14, 6, 0, 0, 1, 44, 38, 39, 12, 14, 15, 15, 51, 14, 15, 15, 38, 46, 44]
+403	[14, 15, 39, 36, 24, 14, 25, 30, 31, 38]
+404	[52, 53, 15, 38, 39, 14, 39, 39, 13, 13, 13, 31, 14, 15]
+405	[50, 51, 38, 38, 39, 39, 14, 15, 42, 43]
+406	[14, 38, 24, 25, 13, 15, 15, 31]
+407	[54, 55, 55, 55, 55, 14, 15, 0, 1, 51, 0, 1, 14, 15, 14, 15, 38, 43]
+408	[14, 15, 14, 15, 42, 43]
+409	[20, 21, 21, 21, 21, 17, 17, 17, 17, 17, 14, 17]
+410	[14, 15, 15, 30, 31, 8, 38, 1, 14, 15]
+411	[14, 15, 54, 49, 38, 39, 38, 14, 15, 15, 42, 39, 44]
+412	[38, 39, 39, 36, 42, 43, 14, 15, 38, 39, 39, 44, 38, 39, 39, 36, 42, 43, 14, 15, 0, 1, 42, 43, 44, 12, 13, 13, 15, 44, 42, 15, 14, 15]
+413	[30, 8, 38, 26, 44, 22, 23, 23, 44, 14, 15, 10, 36, 44, 42, 25, 25, 25, 25, 44, 42, 43, 38, 36, 44, 44, 24, 25, 44, 10, 15, 50, 51, 2, 38, 38, 39, 39, 39, 53, 15, 15, 36, 14, 15, 44]
+414	[14, 15, 15]
+415	[24, 25, 25, 25, 25, 38, 25, 25, 25, 21, 25, 25, 44, 10, 44, 14, 15, 44, 24, 25, 25, 25]
+416	[46, 47, 47, 47, 47, 47, 47, 44, 30, 31, 14, 15]
+417	[38, 38, 38, 39, 14, 15, 14, 15, 8, 9, 38, 39, 36, 14, 15, 44, 14, 15, 26, 38, 39, 44, 38, 38, 39, 44, 36, 14, 15, 36, 14, 15, 42, 43, 43, 14, 15, 44, 38, 26, 52, 53, 14, 15, 14, 15, 10, 38, 39, 8, 9, 0, 1, 44]
+418	[38, 38, 14, 15, 36, 30, 38, 0, 1, 36, 14, 15, 6, 14, 15, 44, 54, 55, 55, 55, 55, 8, 38, 39, 14, 15, 15, 14, 15, 15, 12, 13, 13, 44]
+419	[14, 15, 44, 14, 15, 14, 38, 44, 14, 15, 14, 15, 15, 38, 44, 0, 1, 15, 48, 49, 49, 38, 44, 44, 44, 24, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47]
+420	[44, 44, 44, 24, 25, 25, 25, 25]
+421	[47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 44, 46, 44, 12, 13, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47]
+422	[24, 25, 25, 25, 14, 15, 38, 39, 12, 44, 24, 25, 25, 25, 25, 15, 38, 39, 14, 15, 15, 15, 44, 44, 8, 38, 38, 39, 44, 44, 24]
+423	[38, 14, 15, 38, 39, 24, 25, 25, 25, 15, 14, 15, 38, 14, 15, 30, 31, 31, 38, 39, 14, 15, 38, 43, 14, 15, 44, 24, 25, 46, 44, 24, 25, 25, 25]
+424	[14, 15, 42, 43, 36, 14, 15]
+425	[0, 1, 38, 39, 26]
+426	[54, 55, 55, 55, 38, 39, 13, 13, 24, 25]
+427	[50, 51, 24, 25, 25, 25, 25, 30, 31, 31]
+428	[38, 39, 38, 39, 12, 13, 14, 15, 26, 38, 39]
+429	[24, 25, 25, 25, 25, 25, 25, 25, 25]
+430	[48, 25, 49, 38, 39, 38, 39, 47, 47, 47, 47, 47, 47, 46]
+431	[14, 14, 15, 15, 15, 44, 30, 31, 44]
+432	[44, 14, 15, 14, 15, 15, 15, 14, 44, 14, 15, 14, 15, 15, 42, 43, 44, 14, 15, 14, 15, 15, 15, 15, 44, 44, 38, 39, 39, 12, 13, 13, 13, 44, 12, 13, 44, 12, 13]
+433	[6, 26, 30, 12, 13, 13, 13, 15, 44, 8, 9, 38, 0, 36, 44, 6, 7, 26, 30, 38, 0, 36, 12, 13, 15, 30, 38, 30, 44, 44, 30, 44, 12, 13, 13, 13, 44]
+434	[24, 24, 24, 25, 25, 44, 48, 49, 49, 49, 44]
+435	[24, 17, 17, 17, 44, 6, 7, 31]
+436	[8, 38, 14, 15, 14, 15, 38, 39]
+437	[38, 14, 14, 15, 14, 15, 30, 31, 0, 1, 0]
+438	[14, 38, 39, 14, 15, 39, 44]
+439	[38, 42, 43, 15, 15, 39, 14, 15]
+440	[26, 51, 14, 15, 26, 14, 53, 14, 15, 38, 39, 14, 15]
+441	[14, 15, 14, 15, 44, 14, 15, 14, 15, 14, 15, 44, 14, 15, 14, 15, 12, 13, 14, 14, 14, 15, 44, 24, 25, 25, 25, 25, 25, 14]
+442	[44, 38, 15, 14, 15, 36, 14, 15, 15, 44, 14, 15, 30, 31, 38, 44, 14, 15, 38, 39, 44, 38, 39, 15, 15]
+443	[38, 43, 14, 15, 44]
+444	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 24, 25, 25, 25, 25]
+445	[26, 15, 14, 15, 38, 39, 44, 38, 9, 38, 39, 38, 43]
+446	[14, 15, 15, 44, 14, 15, 44, 14, 15, 38, 39, 36, 14, 15, 36, 42, 43, 14, 15, 14]
+447	[20, 49, 25, 30, 31, 14, 38, 39, 38, 39, 44, 12, 13, 13, 13]
+448	[50, 50, 51, 51, 51, 42, 43, 15, 8, 26, 9, 38, 38, 39, 38, 12, 13, 13, 13, 10, 10, 44, 8, 9, 36, 42, 43, 0, 1, 44, 8, 9, 38, 39, 12, 13, 13, 13, 13, 44]
+449	[42, 39, 44, 14, 15, 15, 38, 39, 44, 38, 39, 38, 38, 39, 39, 39]
+450	[12, 13, 13, 38, 38, 39, 0, 14, 15, 38, 46]
+451	[38, 39, 36, 14, 15]
+452	[10, 43, 42, 43, 14, 15, 38, 39, 36, 42, 43, 44, 44, 44, 24, 25, 25]
+453	[24, 25, 25, 25, 15, 0, 1, 14, 15, 15]
+454	[12, 13, 44, 30, 31, 30, 31, 38, 30, 31, 44, 14, 15, 38, 38, 39, 46, 44, 38, 39, 39, 39, 46, 44, 38, 39, 14, 15, 44, 38, 43, 25, 38, 8, 39, 38, 38, 39, 44, 24, 25, 25, 25, 38, 8, 39, 38, 39, 44]
+455	[52, 51, 53, 53, 53, 14, 15, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 14, 15]
+456	[50, 38, 39, 44, 24, 25, 25, 25, 25]
+457	[26, 38, 39, 44, 38, 12, 13, 36, 14, 15, 44, 22, 23, 23, 23, 44, 20, 48, 49, 49, 53, 53, 53, 21, 21, 53, 38, 44, 14, 15, 38, 39, 38, 15, 38, 39, 54, 55, 55, 55, 55, 38, 39, 44, 20, 55, 49, 49, 49, 38, 38, 38, 14, 15, 38, 39, 50, 53, 48, 49, 36, 0, 1, 14, 25, 25, 25, 44]
+458	[52, 53, 38, 53, 14, 15, 44, 14, 15]
+459	[14, 15, 14, 15, 38, 39, 38, 39, 46]
+460	[24, 25, 25, 15]
+461	[14, 15, 42, 43, 38, 39, 44, 38, 25, 25, 44, 44, 46, 47, 47, 47, 47, 44, 46, 17, 44, 46, 47, 47, 47, 47, 47, 44, 26, 14, 15, 38, 44, 14, 15, 25]
+462	[24, 25, 25, 25, 25]
+463	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 38, 39, 13, 14, 13]
+464	[24, 15, 12, 13, 38, 39, 30, 31, 14, 15]
+465	[12, 13, 14, 15, 38, 39, 44, 18, 19, 52, 14, 14, 15, 15, 12, 13, 13, 36, 12, 44, 14, 38, 14, 15, 38, 39, 38, 38, 14, 0, 1, 36, 0, 1, 44, 50, 14, 15, 38, 39, 30, 38, 44, 14, 15, 44, 44]
+466	[38, 39, 42, 39, 44, 24, 25, 25, 8, 0, 1, 36, 42, 43, 44, 14, 15, 15, 38, 39, 0, 39, 36, 44, 8, 9, 38, 42, 43, 38, 43, 10, 44, 8, 9, 38, 39, 44, 38, 30, 38, 26, 39, 44]
+467	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+468	[14, 15, 6, 50, 51, 30, 12, 13, 0]
+469	[25, 38, 39, 38, 30, 31, 14, 15]
+470	[24, 49, 23, 14, 15, 14, 15, 15, 15, 15, 38, 43, 15]
+471	[24, 25, 44, 46, 44, 12, 13, 13, 13, 13]
+472	[14, 15, 14, 15, 15]
+473	[30, 14, 11, 38, 36, 14, 15, 38, 39, 38, 39, 30, 36, 44]
+474	[53, 53, 53, 53, 15, 15, 15, 44, 44, 44, 24, 25, 25]
+475	[12, 13, 13, 13, 14, 15, 14, 15, 38, 39, 2, 38, 54, 54, 55, 55, 55, 26, 55, 55, 55, 44, 52, 53, 14, 15, 15, 44, 44, 14, 15, 15, 42, 43, 44, 38, 43, 14, 15, 14]
+476	[44, 2, 3, 38, 39, 55, 55, 10, 12, 13, 55, 38, 38, 39, 12, 13, 13]
+477	[14, 15, 10, 1, 42, 43, 14, 15]
+478	[26, 53, 12, 13, 13, 13, 38, 39, 14, 15, 12, 13, 13, 13, 13, 44]
+479	[0, 15, 14, 15, 38, 39, 36, 38, 43, 44, 42, 15, 42, 43, 24, 43, 44, 14, 15, 44, 24, 25, 25, 25]
+480	[24, 14, 15, 14, 15, 14, 15, 44, 42, 15, 15, 15, 15, 14, 15, 38, 15, 15, 44, 42, 15, 15, 15, 15, 38, 39, 14, 15, 38, 38, 14, 15, 44, 44, 44, 14, 15, 14]
+481	[14, 15, 26, 12, 55, 55, 55, 55, 8, 38, 39, 39, 44, 44, 26, 27, 42, 23, 25, 15, 26, 30, 30, 31, 44, 14, 15, 14, 15, 15, 8, 38, 38, 39, 36, 30, 31, 36, 14, 15, 44, 38, 39, 30, 31, 36, 14, 15, 38, 38, 39, 38, 39, 38, 38, 38, 39, 44, 44, 26, 0, 1, 14, 15, 36, 14, 15, 38, 39, 38, 15, 15, 8, 39, 30, 31, 44, 44]
+482	[48, 49, 44]
+483	[44, 52, 53, 14, 15, 44, 24, 25, 25]
+484	[54, 55, 55, 38, 38, 39, 14, 15, 15, 38, 39, 44, 14, 15, 38, 14, 15, 12, 13, 38, 38, 39, 39, 39, 12, 13, 13, 13, 13, 13, 44, 10, 1, 14, 39, 8, 9, 26, 8, 39, 14, 15, 44, 6, 8, 38, 39, 39, 1, 12, 13, 13, 13, 10, 1, 36, 14, 14, 15, 15, 38, 39, 38, 39, 38, 39, 44, 0, 1, 38, 14, 15, 38, 38, 14, 15, 15, 36, 42, 43, 44, 14, 15, 30, 13, 0, 36, 12, 13, 13, 15, 42, 43, 44, 8, 15, 0, 1, 36, 14, 14, 15, 44, 24, 15, 14, 15, 38, 39, 12, 13, 13, 13, 38, 39, 44, 14, 15, 38, 43, 15, 8, 39, 44]
+485	[6, 7, 30, 26, 14, 15, 14, 15, 36, 42, 43, 44, 8, 1, 38, 14, 15, 15, 36, 14, 15, 44]
+486	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 15]
+487	[12, 14, 15, 14, 12, 13, 38, 38, 39, 44, 8, 9, 39, 38, 14, 15, 44, 38, 38, 39, 14, 15, 44]
+488	[14, 38, 14, 15, 8, 38, 39, 14, 15, 38, 39, 44, 38, 54, 55, 13, 13, 10, 11, 44, 38, 14, 38, 38, 38, 39, 38, 39, 44, 26, 14, 15, 6, 14, 15, 8, 38, 39, 26, 54, 54, 55, 15, 38, 39, 44]
+489	[44, 14, 44, 18, 19, 14, 15, 44, 52, 19, 53, 53, 53, 53, 53, 53, 53, 38, 39, 14, 15, 44, 24, 25, 25, 25]
+490	[42, 43, 42, 43, 14, 15, 44, 24, 25, 25]
+491	[52, 53, 53, 53, 53, 53, 53, 53]
+492	[12, 13, 14, 15, 42, 43]
+493	[16, 17, 17, 17, 38, 39]
+494	[50, 51, 44, 14, 15, 15, 38, 14, 15, 44, 14, 15]
+495	[24, 53, 53, 53, 53, 53, 53, 53, 38, 39, 38, 39, 14, 15, 15]
+496	[10, 15, 14, 15, 15, 38, 14, 39]
+497	[14, 15, 51, 54, 55, 55, 55, 14, 15, 15, 15, 38, 39, 38, 39, 44, 14, 15, 38, 54, 55, 55, 55, 14, 15, 15, 15, 38, 39, 14, 15, 12, 13, 44, 12, 13, 13, 14, 15]
+498	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+499	[23, 44, 6, 7, 41, 41, 38, 39, 26, 42, 43, 46, 44]
+500	[50, 51, 51, 53, 53, 53, 14, 15, 44, 14, 25, 25, 44, 38, 39, 14, 15]
+501	[14, 15, 14, 15, 0, 1, 0, 36, 12, 31, 14, 15]
+502	[30, 38, 44, 24, 25, 25, 44, 44, 14, 15, 38, 39, 14, 15, 36, 42, 43, 44]
+503	[50, 51, 14, 39, 12, 13, 13, 13, 44, 12, 13, 13, 24, 25, 25, 44, 14, 25, 38, 39]
+504	[24, 25, 25, 14, 15, 14, 15, 15, 0, 36, 42, 43, 26, 12, 13, 9, 38, 39, 44, 14, 15, 8, 9, 38, 39, 14, 15, 6, 0, 1, 44]
+505	[50, 51, 54, 55, 55, 13, 38, 39, 14, 15]
+506	[12, 13, 14, 14, 15, 38, 39, 44, 38, 39, 36, 44, 14, 15, 38, 39, 36, 39, 44, 38, 39, 44, 14, 15, 14, 15, 38, 39, 44, 44, 44, 38, 14, 15, 14, 15, 15, 42, 43, 14, 15, 6, 44, 24, 15, 15, 15, 44, 44, 6, 38, 39, 44, 8, 38, 38, 39, 39, 39, 44, 38, 39, 38, 15, 15, 14, 15, 15, 8, 1, 14, 15, 44, 44]
+507	[26, 39, 50, 51, 36, 14, 36]
+508	[48, 49, 49, 49, 6, 38, 39, 30, 38, 38, 30, 38, 38, 14, 15, 15, 36, 10, 11, 44]
+509	[14, 15, 6, 38, 39, 12, 13, 14, 25, 12, 13, 13, 13, 0, 1, 0, 51, 51, 51, 14, 15, 24, 25, 44, 12, 25, 25, 25, 25, 25, 25, 25, 25, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+510	[44, 24, 15, 15, 44, 14, 15, 15, 14, 15, 44, 14, 15, 25, 38, 43, 44, 14, 15, 15, 14, 15, 44, 44, 46, 25, 25, 25, 25, 25, 23, 14]
+511	[26, 39, 50, 52, 53, 53, 53, 53]
+512	[38, 15, 14, 15, 15]
+513	[24, 15, 42, 43, 38, 39, 44, 14, 15, 15, 42, 44, 14, 14, 15, 44, 44, 0, 14, 15, 8, 9, 38, 38, 36, 14, 15, 38, 10, 12, 15, 14, 15, 14, 9, 38, 1, 36, 14, 1, 14, 15, 44]
+514	[24, 25, 25, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+515	[50, 51, 44, 14, 15, 50, 51, 44, 14, 15, 15, 44, 52, 53, 53, 53, 14, 15, 26, 15]
+516	[14, 15, 6, 39, 44, 14, 15, 38, 39, 0, 1, 36, 42, 15, 8, 9, 8, 38, 39, 44]
+517	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+518	[24, 15, 44, 14, 15, 25, 15]
+519	[14, 38, 39, 12, 13, 14, 15, 15, 2, 27, 14, 15]
+520	[14, 15, 44]
+521	[6, 7, 50, 0, 1, 50, 51, 53, 53, 53, 53, 53, 15, 38, 38, 39, 44, 38, 30, 31, 8, 1, 36, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 26, 11, 38, 39, 14, 15, 15, 15, 44, 38, 15, 12, 13, 38, 13, 13, 13, 36, 14, 15, 15, 44]
+522	[12, 13, 14, 15]
+523	[24, 25, 25, 25, 25, 25]
+524	[26, 27, 6, 53, 33, 2, 3, 38, 39, 52, 53, 53, 53, 53, 44, 52, 53, 14, 53, 14, 53, 14, 15, 42, 43, 44]
+525	[14, 15, 14, 15, 14, 15, 14, 15, 44, 38, 39]
+526	[48, 49, 49, 21, 17, 44, 12, 13, 13, 13, 13, 13, 13, 13]
+527	[30, 31, 38, 39, 38, 12, 13, 14, 15, 8, 9, 38, 39, 24, 14, 15]
+528	[12, 13, 13, 12, 13, 38, 26, 8, 12, 13, 13, 38, 39, 36, 30, 30, 31, 14, 15, 44, 24, 25, 25, 25]
+529	[54, 55, 55, 55, 14, 13, 44, 12, 13, 13, 13, 0, 1, 0, 1, 14, 15, 38, 39, 14, 15, 36, 14, 15, 14, 15, 15, 8, 9, 38, 39, 44, 18, 19, 19, 49, 42, 43, 43, 43, 23, 23, 23, 23, 44, 44, 38, 39, 26, 54, 55, 55, 55, 55, 38, 38, 39, 0, 14, 15, 42, 39, 44, 6, 38, 39, 9, 38, 14, 15, 44]
+530	[38, 51, 14, 14, 15, 15, 14, 15]
+531	[50, 51, 51, 51, 50, 54, 55, 55, 55, 55, 55, 44, 14, 15, 48, 49, 49, 44, 18, 19, 53, 53, 50, 53, 53, 38, 7, 38, 39, 38, 44, 48, 25, 25, 44, 38, 15, 14, 15, 44]
+532	[24, 25, 25, 47, 47, 47, 47, 47, 47, 44, 24, 25, 25, 25, 25, 25, 25, 25, 25]
+533	[14, 15, 44, 10, 25]
+534	[20, 21, 12, 13, 13, 13, 13, 13]
+535	[44, 44, 14, 15, 38, 43, 14, 44, 44, 10, 36, 14, 15, 44, 38, 51, 44, 8, 9, 38, 14, 14, 15, 38, 46, 44, 30, 31, 38, 39, 30, 31, 0, 14, 15, 38, 39, 44, 38, 39, 38, 54, 55, 55, 38, 36, 44, 8, 31, 44, 24, 25, 25, 25]
+536	[50, 51, 51, 51, 51, 53, 53, 53, 14, 14, 15, 15, 38, 14, 15, 44, 24, 25, 25, 25]
+537	[34, 51, 38, 36, 38, 2, 3, 38, 39, 46]
+538	[50, 51, 38, 38, 12, 13, 14, 15, 44]
+539	[38, 39, 14, 25, 30, 31, 0]
+540	[50, 1, 14, 15, 42, 43, 14, 15, 0, 1, 44, 38, 39, 8, 9, 38, 39, 39, 39, 39, 39, 39, 38, 38, 39, 44]
+541	[14, 15, 0, 10, 14, 25, 15, 14, 15, 25, 25, 25, 14, 15, 15, 38, 39, 44, 30, 31, 14, 15, 44]
+542	[46, 47, 47, 47, 17, 47, 47, 47, 47, 47, 47, 47, 47, 47, 17, 17, 17, 17, 17, 47, 44, 46, 47, 47, 47]
+543	[16, 46, 44, 30, 26, 38, 39, 38, 30, 30, 38, 38, 38, 38, 30, 31, 31, 14, 15, 46]
+544	[14, 15, 14, 15, 8, 9, 44, 34, 9, 8, 38, 39, 14, 15, 44, 48, 25, 49, 49, 49, 38, 54, 55, 55, 55, 44, 8, 26, 38, 38, 39, 38, 12, 13, 44]
+545	[38, 39, 12, 44, 14, 15, 15]
+546	[26, 38, 39, 51, 14, 15, 14, 15, 15, 15, 38]
+547	[24, 15, 14, 15, 44, 24, 25]
+548	[50, 51, 14, 15, 15, 15]
+549	[24, 25, 25, 25, 25, 25, 25, 15, 14, 15, 15]
+550	[50, 50, 51, 51, 44, 14, 15]
+551	[14, 15, 0, 14, 15, 14, 15, 38, 39]
+552	[24, 25, 24, 25, 25, 38, 43, 14, 15]
+553	[0, 1, 14, 15, 14, 15]
+554	[38, 38, 39, 14, 15, 12, 13, 13, 13, 38, 39]
+555	[12, 48, 25, 25]
+556	[50, 51, 25, 46]
+557	[54, 55, 55, 55, 55, 55, 55, 13, 0, 1, 1, 14, 15, 44, 52, 53, 52, 53, 53, 53, 53, 53, 53, 53, 53, 15, 44]
+558	[24, 17, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 25, 25, 13, 13, 13, 14, 25]
+559	[52, 53, 52, 53, 25, 25, 25, 25]
+560	[54, 55, 51, 51, 51, 14, 15, 42, 14, 15, 15, 15, 14, 15, 14, 15, 15, 42, 43, 14, 15, 38, 39, 12, 44, 38, 39, 44, 0, 1, 14, 15, 14, 15, 15, 38]
+561	[46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+562	[8, 39, 48, 49, 42, 43, 38, 39, 44, 20, 21, 21, 25, 25, 46, 44, 24, 25, 25, 25]
+563	[6, 26, 26, 27, 30, 38, 39, 44, 38, 39, 14, 15, 15, 36, 14, 15, 15, 8, 9, 8, 1, 46, 44]
+564	[14, 14, 15, 44, 14, 15]
+565	[26, 26, 14, 15, 6, 7, 38, 39, 0, 1, 44, 38, 38, 14, 15, 36, 42, 43, 14, 15, 38, 39, 44]
+566	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47]
+567	[50, 51, 51, 51, 53, 53, 14, 15, 38, 39, 6, 14, 39, 42, 43, 14, 15, 14, 15, 15, 44]
+568	[42, 43, 14, 15, 15, 15]
+569	[30, 31, 31, 31, 44, 12, 13, 15, 8, 9, 39, 30, 31, 38, 39, 8, 9, 13, 46, 44, 30, 31, 14, 15, 14, 15, 15, 39, 36, 12, 13, 13, 15, 14, 15, 44, 12, 13, 13, 13, 13, 44]
+570	[14, 15, 44, 38, 39]
+571	[44, 54, 8, 51, 38, 38, 39, 10, 1, 44, 46, 14, 15, 44, 14, 15, 15]
+572	[38, 39, 30, 26, 50, 51, 14, 15, 14, 15, 36, 42, 15, 44, 24, 25, 25, 25, 44]
+573	[20, 14, 21, 38, 39, 50, 51, 38, 39, 14, 15, 46, 44, 44, 24, 25, 25, 25]
+574	[44, 0, 42, 14, 38, 43, 44, 14, 15, 15, 38, 39, 44, 12, 13, 15, 38, 39, 44, 42, 43, 14, 15, 15, 44, 14, 15, 15, 38, 39, 44, 38, 39, 14, 14, 15, 44, 12, 25, 15, 38, 39, 44, 8, 14, 15, 38, 38, 39, 44, 14, 15, 8, 14, 15, 44, 38, 43, 15, 38, 39, 44, 50, 51, 51, 51, 14, 15, 38]
+575	[24, 25, 14, 38, 39, 38, 15, 30, 31, 15]
+576	[52, 53, 53, 53, 53, 53, 44, 0, 12, 25, 25, 15, 14, 15, 8, 1, 4, 14, 15, 38, 39, 26, 0, 1, 44, 38, 39, 14, 15, 38, 39, 14, 15, 44, 42, 43, 14, 15, 42, 43, 44, 14, 15, 15, 15, 38, 39, 36, 14, 15, 14, 15, 44]
+577	[38, 39, 24, 25, 25, 36, 14, 15, 15]
+578	[24, 0, 14, 15, 14, 15, 30, 31, 0, 1]
+579	[26, 26, 42, 43, 14, 15, 8, 26, 24, 39, 14, 38, 39, 44]
+580	[54, 54, 55, 38, 39, 12, 15, 25, 25, 49, 49, 49, 49, 49, 44, 12, 13, 38, 39, 39, 39, 44]
+581	[24, 25, 25, 14, 15, 8, 0, 1, 36, 14, 15]
+582	[24, 25, 25, 25, 25, 25, 25, 44, 12, 13, 13, 13, 16, 17, 17, 44, 12, 13, 13, 25, 25]
+583	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 16, 25, 38, 38, 39, 14, 39]
+584	[30, 31, 38, 39, 14, 15, 36, 14, 15]
+585	[50, 51, 51, 38, 54, 55, 55, 55, 55, 38, 36, 30, 31, 15]
+586	[14, 15, 44, 50, 51, 15, 44, 24, 49, 23, 44, 46, 49, 49, 17, 17, 17, 17, 17, 44, 16, 17, 17, 21, 21, 17, 17, 17, 44]
+587	[24, 25, 25, 25, 44, 25, 25]
+588	[14, 15, 38, 39, 14, 15]
+589	[24, 25, 25, 14, 15]
+590	[6, 30, 31, 14, 15, 14, 15, 8, 38, 39, 52, 53, 38, 39, 38, 14, 15, 0, 36, 42, 15, 15, 15, 44]
+591	[30, 14, 15, 14, 15, 15, 46, 44, 30, 31, 38, 14, 15, 15, 15, 44, 10, 36, 14, 15, 6, 14, 15, 2, 3, 26, 27, 46, 44, 8, 38, 38, 26, 11, 26, 30, 31, 14, 15, 38, 39, 39, 39, 44, 24, 15, 12, 13, 13]
+592	[52, 53, 52, 53, 53, 53, 53, 15]
+593	[52, 53, 53, 53, 53, 53, 14, 15, 44, 44, 44, 44]
+594	[12, 38, 39, 38, 39, 39, 46, 44, 30, 31, 31]
+595	[14, 15, 15, 26, 14, 15, 15, 38, 39, 38]
+596	[14, 15, 15, 15, 38, 39, 44, 34, 9, 10, 11, 44, 42, 43, 44, 14, 15, 8, 39, 46, 44, 38, 36, 10, 11, 44, 42, 43, 8, 38, 38, 38, 39, 26, 38, 46, 44, 38, 43, 15, 8, 9, 38, 39, 30, 38, 39, 12, 13, 44, 12, 13, 44]
+597	[44, 46, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 17, 17, 44, 30, 12, 31, 38, 12, 13, 14, 15, 44, 6, 38, 39, 39, 39, 47, 17, 6, 21, 17, 17, 17, 38, 38, 39, 38, 39, 46, 44, 46, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47]
+598	[42, 43, 14, 15, 44, 24, 25, 25, 25, 44]
+599	[38, 46, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 44, 44, 14, 14, 15, 38, 43, 15, 44, 44, 44, 14, 25, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 25, 25]
+600	[50, 50, 51, 51, 51, 51, 12, 14, 15, 26, 14, 15, 38, 39, 12, 13, 13, 44]
+601	[26, 24, 25, 38, 36, 14, 15, 15]
+602	[14, 15, 42, 43, 44, 14, 15, 44, 6, 36, 11, 26, 30, 11, 36, 0, 1, 14, 15, 33, 44, 40, 41, 38, 39, 14, 15, 36, 14, 15, 44, 12, 13, 0, 1, 36, 0, 1, 44, 38, 39, 24, 25, 25, 43, 38, 39, 36, 14, 44]
+603	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+604	[30, 31, 14, 15, 38, 43, 44, 54, 55, 55, 55, 44, 14, 15, 14, 15, 15, 42, 0, 1, 6, 14, 15, 44, 14, 15, 15, 14, 15, 2, 3, 42, 39, 38, 39, 44, 26, 41, 48, 49, 38, 39, 14, 15, 8, 9, 38, 39, 44]
+605	[30, 14, 14, 38, 30, 31, 14, 15, 38, 30, 31, 44, 14, 15, 44]
+606	[44, 22, 23, 44, 30, 38, 39, 38, 30, 31, 31, 44]
+607	[42, 15, 14, 15, 30, 31, 38, 39]
+608	[52, 53, 14, 15]
+609	[24, 25, 25, 15, 14, 15, 44, 8, 7, 54, 55, 55, 15, 10, 8, 8, 8, 38, 39]
+610	[26, 35, 11, 36, 14, 15, 10, 44, 0, 1, 14, 15, 15, 1, 44, 14, 15, 14, 15, 15, 39, 8, 38, 38, 14, 15, 23, 23, 44]
+611	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+612	[24, 17, 47, 47, 44, 46, 47, 47, 44, 16, 17, 17, 38, 39, 46, 47, 15, 15]
+613	[30, 31, 12, 13, 13, 13, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 6, 12, 13, 13, 13, 8, 26, 30, 14, 15, 15, 38, 39, 36, 42, 15, 6, 42, 43, 38, 39, 36, 14, 15, 14, 39, 44]
+614	[54, 55, 55, 55, 55, 55, 55, 55, 44, 52, 53, 14, 15, 38, 39, 38, 39, 38, 39, 39, 12, 13, 13, 13, 13, 36, 48, 49, 21, 21, 44, 38, 39, 35, 53, 53, 38, 39, 14, 15, 36, 12, 13, 13, 13, 13, 13, 44]
+615	[14, 15, 14, 15, 48, 49]
+616	[30, 14, 15, 44, 18, 19, 19, 38, 38, 39, 14, 15, 14, 15, 14, 15, 46, 44, 24, 25, 25, 25]
+617	[14, 15, 15, 38, 39]
+618	[54, 55, 55, 55, 55, 44, 48, 49, 53, 38, 39, 44, 26, 50, 51, 10, 15, 33, 6, 50, 51, 50, 51, 6, 50, 51, 38, 39, 39, 15, 14, 15, 15, 44, 14, 43, 14, 15, 14, 15, 6, 38, 43, 14, 15, 44, 38, 39, 8, 1, 0, 1, 36, 42, 43, 14, 15, 44]
+619	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 14, 25, 38, 39]
+620	[12, 13, 13, 55, 55, 38, 14, 15, 15, 15, 14, 15, 15, 44, 14, 15, 14, 15, 44]
+621	[44, 24, 25, 25, 44, 24, 25, 25, 25, 25, 44, 24, 25, 25, 25, 25, 44, 38, 39, 27, 14, 15, 14, 15, 44, 14, 15, 25, 25, 25, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 30, 31, 14, 15, 15, 15, 44, 50, 51, 51, 51, 51, 14, 15, 6, 20, 15, 14, 15, 14, 15, 14, 15, 44, 24, 25, 25, 25]
+622	[38, 39, 54, 55, 44, 26, 1, 0, 14, 15, 15, 44, 14, 15, 38, 39, 14, 15, 50, 50, 14, 15, 15, 50, 51, 44, 54, 55, 54, 55, 55, 44, 14, 15, 12, 39, 26, 52, 52, 53, 25, 25, 25, 25, 25, 38, 38, 39, 44, 42, 15, 15, 38, 38, 39, 44, 14, 15, 8, 9, 38, 39, 36, 14, 15, 10, 44, 2, 3, 7, 38, 39, 26, 0, 1, 36, 14, 15, 38, 39, 42, 39, 44, 14, 15, 38, 15, 44, 14, 15, 14, 15, 10, 33, 44, 38, 39, 0, 51, 44]
+623	[14, 15, 14, 15, 44, 12, 13, 42, 48, 49, 38, 14, 15, 14, 15, 44, 6, 7, 8, 7, 9, 9, 9, 0, 39, 44, 6, 8, 38, 39, 30, 31, 38, 39, 38, 14, 36, 44]
+624	[12, 13, 13, 44, 12, 13, 13, 38, 39, 38, 25, 46, 44, 38, 39, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+625	[50, 51, 51, 51, 51, 51, 15, 38, 39]
+626	[50, 14, 15, 15, 0, 14, 43, 15, 14, 15, 14, 15, 38, 39, 39, 39, 39, 39, 24, 14, 15]
+627	[50, 51, 51, 51, 51, 51, 51, 50, 51, 51, 51, 51, 51, 51, 51, 35, 12, 13, 14, 15, 12, 1, 13, 14, 15]
+628	[8, 9, 38, 39, 30, 38, 14, 15, 38, 39, 36, 14, 15, 15, 44, 26, 14, 15, 10, 38, 36, 0, 1, 14, 15, 15, 14, 15, 44, 38, 39, 38, 39, 36, 14, 1, 14, 15, 42, 43, 14, 15, 44, 38, 0, 30, 31, 38, 39, 39, 30, 31, 14, 15, 10, 1, 44]
+629	[24, 47, 47, 47, 47, 47, 25, 25, 25, 47, 47, 47, 47, 47, 47]
+630	[16, 17, 17, 17, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 14, 15, 42, 43, 14, 15, 44, 44, 44, 24, 25, 25]
+631	[14, 15, 26, 24, 25, 14, 15, 38, 39, 44, 38, 39, 38, 39, 14, 15, 25, 25, 44, 38, 15, 44, 44, 14, 15, 14, 15, 44, 24, 25, 25]
+632	[26, 44, 22, 23, 23, 23, 44, 36, 14, 15, 42, 43, 10, 44, 26, 27, 38, 38, 26, 12, 38, 38, 39, 36, 14, 15, 42, 15, 44, 8, 38, 39, 36, 0, 36, 42, 43, 44, 6, 26, 38, 39, 14, 15, 36, 0, 1, 14, 15, 6, 0, 1, 36, 14, 15, 38, 39, 42, 43, 44, 38, 39, 8, 39, 38, 39, 14, 15, 36, 14, 15, 14, 15, 44]
+633	[14, 15, 14, 15, 14, 15, 8, 0, 38, 39, 38, 38]
+634	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+635	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+636	[14, 38, 14, 15, 14, 15, 38, 39]
+637	[20, 21, 21, 21, 21, 17, 17, 17]
+638	[34, 51, 15, 12, 13, 13, 14, 14, 15, 15, 12, 13, 13, 13, 13, 38, 14, 15, 12, 13, 13, 13, 13, 13, 31, 14, 15, 44, 20, 21, 14, 15, 38, 39]
+639	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+640	[24, 25, 25, 36, 14]
+641	[24, 25, 25, 15, 14, 15, 14, 15]
+642	[50, 53, 14, 15, 38, 36, 24, 19, 14, 14, 23, 44, 24, 25, 25, 25]
+643	[14, 15, 38, 39, 38, 38, 0, 1, 14, 15, 36, 14, 15, 10, 38, 43]
+644	[50, 14, 15, 44, 14, 15, 15, 44, 14, 15]
+645	[15, 44, 14, 15, 15, 15, 12, 44, 48, 49, 49, 49, 49, 49, 49, 14, 15, 44, 26, 14, 15, 38, 44, 44, 44, 24, 25, 14, 15]
+646	[30, 36, 0, 1, 44, 12, 14, 15, 44, 14, 15, 44, 42, 15, 44, 42, 43, 15, 15, 38, 39, 44, 0, 1, 48, 49, 38, 39, 44]
+647	[38, 38, 36, 14, 15, 15]
+648	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 49, 17, 38, 30, 14, 15, 36, 42, 43, 14, 44, 24, 25, 25, 25]
+649	[34, 35, 44, 14, 15, 14, 15, 26, 9, 38, 38, 39, 14, 15, 10, 15, 15, 36, 42, 43, 14, 15, 44, 38, 39, 38, 39, 39, 53, 25, 15, 15, 36, 44, 42, 43, 42, 43, 44, 14, 15, 38, 43, 14, 15, 15, 44, 24, 25, 25, 44, 44, 42, 43, 42, 43, 44, 8, 9, 38, 14, 15, 14, 15, 44, 14, 15, 14, 15, 36, 42, 43, 25, 15, 26, 48, 49, 38, 39, 44]
+650	[46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+651	[24, 25, 25, 25, 54, 55, 55, 55, 49, 42, 42, 43, 15, 15]
+652	[44, 2, 38, 38, 0, 1, 14, 15, 36, 44, 44, 24, 25, 25, 25]
+653	[14, 15, 14, 15, 15, 26, 39, 38, 36, 14, 15]
+654	[24, 25, 25, 15]
+655	[24, 25, 44, 14, 15]
+656	[14, 15, 0, 1, 0, 1, 1, 14, 15, 15]
+657	[12, 13, 13, 14, 15, 8, 9, 9, 38, 39, 44, 14, 15, 15, 38, 39, 36, 42, 43, 44, 14, 15, 14, 15, 14, 15, 36, 14, 15, 44, 26, 14, 15, 12, 14, 15, 14, 0, 10, 44, 14, 15, 0, 1, 36, 42, 14, 15, 38, 39, 30, 14, 15, 15, 15, 42, 43, 36, 14, 15, 14, 15, 38, 39, 44]
+658	[14, 14, 15, 38, 39, 0, 39]
+659	[50, 51, 25, 25, 38, 14, 15, 38, 14, 15, 15, 44]
+660	[6, 26, 50, 51, 14, 15, 36, 14, 15, 15, 6, 8, 38, 38, 39, 44, 6, 8, 38, 39, 12, 13, 13, 13, 13, 14, 15, 14, 53, 44, 38, 39, 54, 55, 55, 53, 15, 10, 51, 50, 51, 51, 51, 14, 15, 44]
+661	[52, 53, 53, 53, 14, 15, 44, 24, 25, 25, 25, 25, 25, 44, 52, 53, 53, 53, 44, 52, 53, 53, 53, 44, 0, 1, 14, 15, 15, 42, 43, 36, 4, 43, 53, 53, 15, 15, 44, 52, 53, 14, 15, 44, 14, 15, 42, 43, 14, 15, 44, 14, 15, 42, 43, 44, 2, 3, 52, 53, 44]
+662	[44, 52, 53, 14, 43, 14, 44, 14, 15, 42, 43, 44, 14, 15, 42, 39, 44, 52, 53, 53, 53, 44, 14, 15, 12, 13, 13, 13, 44, 12, 13, 13, 44, 12, 13, 13]
+663	[52, 53, 38, 53, 53, 53, 14, 44, 14, 21, 38, 39, 14, 15, 14, 14, 15, 44, 18, 19, 23, 25, 25, 44]
+664	[14, 15, 14, 15, 36, 30, 31, 12, 13]
+665	[20, 21, 21, 21, 21, 21, 21, 17, 55, 13, 13, 30, 31, 38]
+666	[48, 49, 44]
+667	[22, 23, 44, 23, 23, 44, 44, 14, 15, 15, 44, 54, 55, 55, 13, 13, 44, 38, 38, 39, 12, 13, 14, 15, 42, 43, 44]
+668	[46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+669	[38, 39, 51, 38, 38, 38, 39, 25, 39, 36, 14, 15]
+670	[12, 13, 13, 13, 13, 44, 48, 49, 49, 48, 49, 49, 32, 33, 0, 1, 1, 43, 43, 3, 8, 39, 44, 0, 1, 14, 15, 44]
+671	[38, 39, 25, 25, 24, 25, 25, 25, 25, 25]
+672	[24, 25, 14, 15, 38, 39, 14, 15, 44, 14, 15, 24, 21, 14, 15, 15, 2, 38, 39, 38, 39, 14, 15, 44, 38, 39, 39, 13, 13, 13, 39, 38, 39, 44, 18, 3, 14, 15, 14]
+673	[50, 51, 51, 54, 55, 55, 55, 55, 24, 24, 25, 25, 25, 24, 25, 25, 14, 25, 14, 15, 38, 15, 44, 42, 25, 25, 25, 44, 24, 25, 25, 25, 14, 15, 44, 24, 25, 25]
+674	[30, 36, 14, 15, 30, 38, 8, 9, 0, 1]
+675	[44, 24, 25, 25, 44, 24, 25, 25, 25, 25, 12, 25, 25, 25, 25, 25, 25, 25]
+676	[24, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+677	[12, 44, 14, 15, 14, 14, 15, 6, 38, 30, 14, 15, 38, 14, 15, 15, 15, 26, 11, 14, 39, 39, 8, 38, 14, 15, 39, 44, 30, 38, 39, 12, 13, 13, 36, 14, 15, 8, 38, 39, 38, 14, 15, 15, 44, 40, 41, 39, 39, 44, 38, 39, 39, 44, 38, 39, 36, 14, 44, 14, 15, 15, 15, 14, 15, 6, 7, 38, 38, 39, 14, 15, 36, 44, 6, 7, 38, 39, 14, 15, 15, 14, 42, 43, 38, 7, 38, 36, 44, 44]
+678	[38, 39, 44, 26, 27, 24, 15, 15, 36, 42, 43, 14, 15, 44, 26, 53, 53, 53, 14, 15, 38, 39, 36, 12, 13, 13, 13, 14, 15, 14, 15, 15, 8, 26, 14, 15, 36, 14, 15, 6, 49, 54, 54, 55, 13, 55, 38, 39, 44]
+679	[44, 44, 24, 25, 25, 14, 15, 32, 5, 42, 43]
+680	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+681	[50, 51, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15]
+682	[30, 31, 38, 14, 15]
+683	[38, 39, 12, 13, 14, 15, 15, 44, 6, 21, 21, 44, 18, 19, 44, 14, 38, 38, 39, 12, 14, 15, 38, 39, 46, 44, 24, 25, 25, 25]
+684	[44, 24, 25, 25, 25, 14, 15, 42, 43, 44, 34, 14, 15, 38, 39, 46, 44, 22, 25, 25, 25, 12, 25, 13, 44, 14, 38, 38, 39, 24, 15, 15, 44, 14, 47, 44, 12, 13, 13, 13, 13, 44, 38, 14, 15, 38, 53, 53, 53, 53, 44, 26, 9, 38, 39, 44, 44, 44, 44, 44, 44, 14, 15, 15, 38, 39, 24, 25, 44]
+685	[38, 38, 39, 44, 24, 25, 25, 25]
+686	[14, 8, 9, 38, 14, 14, 0, 14, 15]
+687	[12, 13, 13, 13, 13, 13, 14, 15]
+688	[46, 47, 47, 12, 44, 46, 47, 47]
+689	[50, 14, 15, 15, 44]
+690	[14, 15, 49, 38, 39, 14, 15, 15, 38, 39, 0, 1, 44, 42, 43, 38, 39, 14, 15, 38, 39, 38, 39, 42, 39, 44, 24, 25, 25, 25, 25, 44]
+691	[38, 39, 14, 15, 15, 15, 25, 15, 15, 15, 44, 16, 17, 17, 17, 17, 17, 17, 17, 44, 12, 13, 14, 15, 24, 24, 25, 25, 25, 44, 14, 15, 12, 13, 0, 1, 44, 0, 1, 1, 51, 15, 44]
+692	[12, 13, 13, 14, 15, 38, 39, 36, 42, 14, 15, 44, 44, 22, 23, 44, 14, 15, 14, 15, 44, 2, 3, 38, 39, 15, 14, 15, 44, 44, 44, 24, 25, 25]
+693	[38, 8, 0, 1, 14, 15, 36, 42, 15]
+694	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+695	[12, 13, 13, 14, 15, 14, 15, 8, 9, 38, 39, 0, 1]
+696	[50, 51, 14, 53, 14, 15, 25, 25, 25, 14, 15]
+697	[24, 25, 25, 25, 25, 44, 13, 13, 14, 21, 21, 38, 14, 15, 44, 44, 44, 50, 25, 25, 25, 25, 25, 25, 44, 24, 25, 25, 25, 15, 44]
+698	[24, 14, 15, 14, 15, 6, 38, 39, 36, 14, 15, 38, 44, 44, 44, 44, 46, 44, 24, 25, 25, 25, 25, 47, 47, 47, 44, 24, 25, 25, 25, 25, 14, 15, 44, 14, 15, 25, 25, 13, 13, 13, 13, 47, 44, 14, 15, 44, 24, 25, 25, 25]
+699	[38, 36, 38, 14, 15, 14, 15, 14, 15]
+700	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 44, 44, 14, 15, 44]
+701	[12, 13, 13, 13, 13, 13, 44, 12, 13, 13, 44, 12, 13, 13, 13, 13, 44, 14, 15]
+702	[14, 15, 49, 39, 38, 39, 39, 38, 39, 36, 14, 15, 50, 35, 12, 13, 13]
+703	[54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 44, 50, 51, 12, 13, 13, 14, 25, 49, 6, 3, 38, 39, 44]
+704	[8, 38, 39, 38, 30, 31, 14, 15, 0, 1, 36, 14, 15, 15, 44, 8, 38, 39, 38, 39, 38, 39, 38, 39, 10, 11, 30, 30, 38, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+705	[48, 49, 49, 49, 38, 39, 44, 44, 14, 15, 44, 44]
+706	[14, 15, 12, 13, 13, 13, 13, 13, 13, 13, 14]
+707	[14, 15, 14, 15, 38, 43]
+708	[38, 39, 39, 26, 14, 15, 36, 42, 1]
+709	[52, 53, 53, 25, 25, 12, 13, 13, 44, 12, 13, 13, 13, 13, 42, 43]
+710	[26, 50, 51, 51, 36, 50, 51, 14, 15, 44, 50, 51, 51, 53, 53, 53, 14, 15, 14, 15, 8, 9, 51, 8, 26, 14, 15, 14, 15, 38, 39, 50, 51, 53, 53, 38, 39, 14, 15, 44, 44, 30, 31, 26, 14, 15, 14, 15, 36, 14, 15, 14, 15, 26, 27, 38, 46, 44, 44]
+711	[50, 51, 51, 36, 14, 15, 38, 27, 11, 44, 14, 15, 44, 42, 43, 44, 14, 15, 44, 40, 41, 38, 39, 42, 43, 44, 52, 53, 53, 53, 14, 44, 14, 15, 44, 50, 14, 15, 42, 43, 43, 14, 15, 44, 44, 54, 13, 55, 55, 55, 55, 55, 55, 39, 55, 55, 44, 14, 38, 14, 15, 44, 42, 43, 39, 39, 44, 44, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 14, 15, 38, 39, 26, 14, 15, 38, 39, 15, 42, 43, 15, 44, 38, 39, 38, 39, 36, 14, 15, 44]
+712	[30, 12, 13, 13, 38, 39, 12, 13, 13, 44, 38, 39, 38, 39, 46, 44]
+713	[12, 44, 14, 44, 14, 6, 38, 38, 39, 44, 14, 15, 44, 6, 44, 24, 25, 44, 42, 43, 14, 15, 36, 14, 15, 44, 6, 26, 38, 39, 39, 39, 39, 36, 14, 15, 8, 38, 38, 39, 39, 12, 13, 15, 36, 44, 14, 15, 44, 36, 14, 15, 6, 30, 31, 15, 38, 15, 36, 14, 15, 44, 8, 3, 38, 39, 50, 25, 38, 39, 36, 14, 15, 15, 6, 14, 6, 15, 26, 39, 39, 21, 39, 39, 38, 39, 44, 14, 15, 44, 36, 14, 15, 44, 8, 9, 38, 26, 50, 0, 14, 15, 2, 3, 38, 39, 44]
+714	[24, 25, 25, 25, 25, 44, 14, 15]
+715	[44, 18, 49, 49, 14, 15, 44, 48, 49, 49, 44, 0, 1, 36, 23, 15, 44, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+716	[8, 9, 38, 39, 14, 15, 46, 30, 38, 38, 38, 39, 0, 1, 36, 14, 15, 44, 44, 38, 39, 36, 14, 15, 8, 7, 30, 31, 38, 39, 30, 31, 44]
+717	[8, 38, 38, 39, 39, 14, 12, 13, 38, 39, 14, 15, 38, 38, 39, 44, 38, 38, 39, 12, 13, 44, 14, 15, 12, 13, 44, 14, 15, 26, 9, 38, 36, 14, 15, 44, 38, 39, 39, 13, 13, 12, 13, 44]
+718	[38, 38, 24, 15, 15, 15, 38, 39]
+719	[14, 15, 43, 38, 39, 9, 38, 0, 1, 44, 6, 38, 39, 38, 39]
+720	[38, 39, 15]
+721	[54, 55, 55, 55, 44, 52, 53, 53, 53, 53, 53, 53, 14, 15, 36, 42, 43, 14, 15, 44, 38, 15, 38, 39, 42, 14, 15, 44]
+722	[52, 53, 38, 39, 36, 0, 1, 44, 8, 26, 9, 39, 38, 36, 42, 43, 42, 39, 36, 0, 1, 36, 42, 43, 44, 6, 38, 39, 26, 24, 25, 24, 25, 25, 44]
+723	[52, 53, 53, 53, 14, 15, 44, 38, 39, 38]
+724	[30, 38, 54, 55, 55, 55, 44, 50, 33, 38, 39, 39, 30, 12, 13, 44, 50, 51, 50, 51, 0, 10, 10, 11, 44, 14, 15, 14, 15, 44]
+725	[44, 14, 15, 8, 38, 39, 38, 39, 38, 39, 44, 26, 0, 1, 14, 15, 8, 9, 38, 39, 8, 8, 8, 39, 44]
+726	[24, 21, 21, 21, 21, 17, 21, 21, 21, 14, 15, 15]
+727	[38, 39, 38, 39, 14, 15, 14, 15, 15, 15, 15, 36, 42, 43, 14, 15, 44]
+728	[30, 26, 7, 38, 30, 31, 14, 15]
+729	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+730	[20, 21, 21, 25, 25, 44, 38, 39, 39, 39, 44, 24, 25, 25]
+731	[14, 15, 14, 53, 14, 15]
+732	[16, 17, 17, 17, 38, 39, 39, 47, 47, 47, 47, 21, 21, 21, 21, 39]
+733	[30, 14, 14, 15, 14, 15, 15, 14, 15]
+734	[14, 15, 15, 44, 48, 15, 14, 14, 15, 15, 14, 15, 38, 39, 13, 13, 14, 15, 44, 14, 15, 44, 46, 44, 25, 25, 25, 15, 44, 44, 14, 15, 14, 15, 15]
+735	[24, 25, 25, 25, 47, 47, 47, 47, 47, 44]
+736	[26, 30, 31, 14, 15, 15, 38, 39, 14, 38, 38, 39, 44, 8, 8, 13, 38, 39, 36, 50, 53, 52, 53, 53, 38, 39, 38, 39, 36, 42, 43, 6, 42, 43, 44]
+737	[24, 25, 47, 47, 47, 47, 47, 47, 15]
+738	[6, 7, 44, 50, 19, 44, 50, 51, 26, 50, 50, 51, 51, 54, 55, 55, 44, 0, 1, 44, 14, 15, 38, 39, 38, 12, 13, 13, 13, 13, 13, 44, 12, 15, 15, 44, 0, 1, 10, 38, 12, 13, 13, 13, 13, 44, 14, 15, 8, 9, 9, 38, 39, 12, 13, 12, 13, 13, 14, 15, 15, 44]
+739	[50, 51, 14, 15, 8, 8, 8, 9, 38, 50, 51, 14, 15, 44, 18, 19, 38, 53, 44, 24, 17, 13, 13, 13, 13, 44, 46, 46, 44, 52, 19, 14, 15, 14, 15, 46]
+740	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]
+741	[44, 14, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 14, 15, 42, 39, 44, 14, 15, 14, 15]
+742	[8, 9, 8, 14, 15, 6, 14, 15, 39, 0, 1, 44, 30, 31, 26, 42, 43, 14, 38, 39, 36, 14, 15, 36, 30, 12, 13, 38, 39, 44, 38, 24, 15, 15, 15, 6, 24, 15, 15, 15, 14, 15, 38, 39, 12, 13, 50, 51, 42, 43, 15, 36, 14, 15, 44]
+743	[12, 13, 13, 13, 13, 13, 38, 39, 14, 15, 15, 26, 30, 14, 15, 44, 24, 25, 25, 25]
+744	[38, 39, 38, 14, 15, 15, 38, 39, 44]
+745	[24, 25, 25, 25, 14, 15]
+746	[14, 15, 14, 15, 44, 14, 15, 14, 15, 42, 39, 44, 38, 38, 39, 39, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 15, 42, 43, 44, 0, 42, 43, 44, 0, 1, 42, 43, 42, 43, 44, 18, 19, 24, 25, 25]
+747	[24, 25, 25, 25, 25, 30, 31, 38, 39, 39, 9, 0, 1]
+748	[14, 15, 0, 14, 15, 0]
+749	[14, 15, 44, 14, 15, 10, 30, 31, 38]
+750	[30, 38, 12, 13, 14, 15, 38, 0, 44, 38, 39, 39, 0, 36, 14, 15, 44]
+751	[14, 15, 38, 38, 39, 13, 13, 13]
+752	[30, 38, 14, 15, 15, 15, 10, 44, 12, 44, 44, 14, 15, 44, 14, 15, 38]
+753	[44, 44, 24, 25, 25, 14, 15, 32, 5, 42, 43]
+754	[44, 38, 39, 26, 38, 14, 15, 44, 44, 38, 39, 39, 39, 44, 38, 39, 26, 38, 14, 15, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+755	[14, 38, 38, 39, 30, 30, 31, 0, 1, 44]
+756	[14, 15, 14, 15]
+757	[50, 51, 53, 50, 51, 51, 51, 51, 53]
+758	[24, 25, 25, 25, 25, 25, 42, 39]
+759	[14, 36, 12, 15, 8, 38, 39, 24, 14, 15, 15, 44, 26, 50, 51, 50, 51, 33, 38, 39, 14, 51, 51, 44, 50, 35, 38, 38, 39, 14, 15, 14, 15, 15, 10, 44, 8, 38, 39]
+760	[48, 49, 44]
+761	[24, 14, 15, 14, 15, 14, 15]
+762	[52, 53, 53, 53, 53, 53, 15, 38]
+763	[24, 25, 25, 30, 31, 38]
+764	[38, 39, 44, 14, 15, 14, 15, 38, 39, 14, 15]
+765	[24, 25, 25, 25, 25, 15, 42]
+766	[38, 26, 14, 15, 38, 30, 31, 1, 36, 14, 15, 44, 16, 17, 8, 9, 38, 39, 38, 38, 38, 39, 44, 6, 26, 12, 13, 8, 39, 26, 39, 16, 31, 38, 14, 15, 36, 14, 15, 44]
+767	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47]
+768	[44, 38, 1, 14, 15, 42, 43, 53, 43, 44, 44, 38, 39, 14, 15, 26, 14, 15, 14, 43, 14, 44, 24, 25, 25, 25]
+769	[30, 31, 38, 39, 38, 42, 43, 14, 43, 36, 14, 15, 44, 12, 13, 38, 39, 0, 1, 36, 14, 15, 26, 30, 31, 10, 11, 38, 39, 44]
+770	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47]
+771	[52, 53, 53, 53, 51, 44, 38, 14, 15, 46]
+772	[6, 30, 7, 30, 31, 8, 38, 39, 26, 14, 15, 38, 14, 15, 44]
+773	[8, 39, 14, 15, 38, 15, 53, 53, 53, 15, 14, 15]
+774	[14, 15, 44, 24, 25, 25, 38, 39, 52, 53, 53, 53, 39, 52, 53, 52, 53, 53, 53, 44, 14, 15, 44, 24, 25, 25]
+775	[44, 22, 23, 23, 23, 23, 23, 23, 23, 23, 47, 47, 44, 26, 30, 31, 14, 15, 0, 38, 36, 14, 15, 36, 14, 15, 15, 0, 36, 0, 1, 44, 26, 38, 15, 36, 2, 9, 38, 39, 36, 38, 39, 38, 39, 36, 36, 14, 15, 15, 15, 15, 2, 5, 7, 8, 50, 50, 51, 51, 44, 6, 38, 39, 15, 38, 36, 42, 36, 14, 15, 44]
+776	[38, 39, 42, 15, 14, 15, 44, 30, 38, 38, 39, 46, 44]
+777	[48, 8, 38, 39, 44]
+778	[26, 13, 26, 14, 15, 38, 14, 44, 38, 54, 55, 55, 55, 55, 38, 39, 26, 35, 44, 26, 30, 12, 13, 38, 38, 39, 36, 10, 12, 0, 38, 14, 15, 38, 39, 44, 38, 39, 38, 39, 55, 14, 15, 6, 30, 14, 15, 14, 15, 10, 26, 30, 12, 31, 44]
+779	[38, 39, 39, 39, 39, 25, 25, 25, 25, 25, 25, 25, 25, 14, 15, 42, 43, 14, 15, 36, 44, 24, 25, 25, 25]
+780	[14, 15, 38, 30, 31, 13, 38, 38, 14, 15, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+781	[38, 15, 38, 39, 36, 0, 1, 14, 15, 15, 15, 39, 44]
+782	[14, 15, 36, 0, 1, 14, 15]
+783	[14, 15, 6, 7, 36, 14, 44, 14, 15]
+784	[14, 15, 26, 24, 25, 25]
+785	[8, 8, 9, 8, 38, 39, 14, 15, 14, 15, 14, 15, 44, 12, 13, 38, 42, 43, 43, 15, 44, 54, 55, 38, 39, 38, 39, 39, 15, 44]
+786	[14, 24, 25, 44, 20, 21, 21, 38, 39, 39, 39, 44, 36, 24, 25, 25, 42, 43, 44, 48, 49, 44, 24, 25, 25, 25]
+787	[52, 53, 14, 53, 14, 15, 15, 15, 38, 39, 39, 39, 53, 15]
+788	[52, 53, 53]
+789	[30, 36, 14, 15, 6, 14, 15, 15, 15, 8, 6, 7, 8, 39, 44, 14, 15, 8, 9, 42, 43, 38, 39, 6, 14, 15, 44, 6, 7, 26, 38, 38, 38, 39, 14, 15, 10, 39, 39, 39, 43, 6, 42, 43, 30, 14, 15, 38, 39, 44, 38, 39, 30, 38, 39, 26, 43, 15, 14, 15, 38, 39, 44, 26, 14, 15, 38, 39, 44, 30, 31, 31, 36, 14, 8, 38, 39, 44, 30, 31, 38, 38, 14, 15, 36, 14, 15, 8, 38, 39, 44, 54, 55, 55, 55, 55, 55, 55, 55, 53, 50, 50, 51, 51, 53, 53, 53, 14, 15, 26, 38, 39, 38, 39, 44, 44, 14, 15, 15, 15, 14, 15, 44, 0, 1, 0, 1, 43]
+790	[14, 15, 15, 6, 38, 39]
+791	[52, 53, 42, 53, 44, 44, 6, 30, 38, 44, 38, 30, 38, 44, 38, 30, 38, 44, 14, 15, 38, 14, 15, 44, 48, 49, 49, 38, 44, 14, 15, 14, 39, 44, 44, 24, 25, 25]
+792	[12, 13, 13, 8, 2, 2, 38, 39, 14, 15, 15]
+793	[44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 38, 39, 44, 42, 43, 44, 42, 39, 44, 30, 31, 31, 44, 38, 39, 14]
+794	[24, 21, 21, 21, 21, 21, 21, 21, 21, 53, 39, 44, 24, 17, 12, 13, 44]
+795	[30, 39, 10, 12, 55, 55, 55, 55]
+796	[48, 23, 44, 48, 49, 49, 49, 49, 16, 49, 49, 16, 17, 21, 17, 44, 46, 47, 47, 47, 47, 17, 44, 14, 17, 44, 6, 49, 44, 54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 38, 39, 39, 52, 53, 53, 53, 53, 53, 53, 53, 25, 44, 14, 23, 44, 14, 15, 44]
+797	[50, 0, 36, 30, 7, 26, 30, 31, 8, 38, 39, 14, 15, 6, 6, 8, 14, 15, 42, 43, 14, 15, 38, 26, 30, 26, 30, 12, 13, 14, 15, 36, 42, 43, 44, 30, 31, 44, 38, 39, 39, 9, 38, 50, 35, 50, 51, 51, 15, 38, 39, 36, 14, 15, 44, 38, 39, 0, 26, 14, 15, 10, 38, 39, 0, 36, 14, 15, 44, 54, 55, 38, 44, 26, 1, 14, 15, 14, 15, 42, 43, 14, 15, 0, 36, 42, 43, 44, 26, 14, 15, 14, 15, 38, 39, 0, 36, 42, 43, 43, 3, 38, 39, 43, 23, 44, 30, 38, 38, 39, 8, 9, 26, 14, 15, 14, 15, 10, 36, 38, 8, 8, 9, 38, 12, 13, 13, 44]
+798	[48, 49, 49, 49, 49, 42, 43, 30, 31, 31]
+799	[24, 25, 14, 15, 42, 43, 14, 15]
+800	[50, 48, 49, 49, 53, 53, 53, 14, 15, 26, 52, 53, 38, 39, 54, 55, 55, 55, 55, 36, 48, 15, 38, 39, 39, 12, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+801	[44, 24, 25, 25, 44, 42, 39, 44, 38, 38, 39, 36, 42, 43, 8, 42, 43, 44, 6, 7, 7, 30, 38, 36, 14, 15, 12, 13, 13, 38, 24, 25, 14, 38, 39, 0, 1, 13, 36, 14, 15, 44, 6, 7, 14, 15, 10, 49, 49, 49, 1, 44, 42, 43, 30, 38, 36, 8, 38, 39, 44, 38, 39, 30, 31, 14, 15, 44, 44, 26, 38, 38, 44, 38, 38, 39, 42, 38, 36, 14, 15, 30, 38, 14, 15, 15, 25, 36, 42, 15, 44, 6, 7, 44, 14, 15, 42, 14, 44, 8, 39, 38, 39, 36, 14, 15, 44, 44]
+802	[30, 31, 38, 14, 15, 38, 39, 9, 26, 30, 31, 31, 25]
+803	[50, 51, 14, 15, 15]
+804	[50, 51, 44, 14, 15, 15, 6, 8, 38, 39, 44, 14, 15, 10, 48, 6, 8, 3, 38, 39, 44, 44, 44, 42, 43, 25, 25, 42, 43, 44, 44, 44, 50, 51, 38, 43, 14, 15, 14, 15, 44, 18, 19, 38, 43, 14, 15]
+805	[12, 13, 13, 13, 24, 25, 25, 25, 15, 14, 15, 15, 38, 38, 39, 39]
+806	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+807	[38, 39, 0, 1, 44, 30, 38, 38, 39, 16, 31, 38, 39, 38, 39, 46, 44, 44, 54, 55, 55, 55, 55, 55, 44, 12, 13, 26, 0, 1, 14, 15, 38, 39, 44, 50, 51, 51, 14, 15, 0, 1, 14, 15, 44, 36, 42, 43, 14, 15, 44, 8, 50, 51, 50, 51, 51, 51, 51, 51, 51, 53, 53, 53, 53, 53, 53, 43, 44]
+808	[44, 14, 15, 38, 43, 44, 12, 13, 13, 14, 14, 15, 14, 15, 14, 15, 14, 15, 15, 38, 14, 15, 44, 14, 15, 15, 36, 42, 43, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 14, 14, 15, 44, 38, 39, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 24, 25, 25]
+809	[24, 15, 25, 17, 44, 16, 17, 17, 17, 17]
+810	[14, 15, 12, 13, 13, 44, 28, 44, 38, 39, 39, 30, 31, 44, 24, 25, 25, 25]
+811	[14, 15, 38, 39, 52, 53, 14, 14, 15]
+812	[50, 38, 14, 15, 15, 15, 44, 14, 15, 15, 44, 24, 25, 25, 25]
+813	[12, 44, 8, 8, 12, 13, 15, 38, 39, 24, 15, 36, 14, 15, 15, 44, 38, 14, 43, 43, 14, 15, 10, 44, 26, 24, 17, 38, 39, 25, 25, 25, 44, 0, 1, 38, 39, 24, 25, 44]
+814	[14, 15, 30, 31, 38, 39, 14, 15, 38, 39, 39, 15, 38, 38, 39, 36, 30, 31, 14, 44, 24, 25, 25, 25]
+815	[14, 15, 38, 38, 39, 38, 39, 39, 44, 12, 13, 14, 15, 15, 2, 9, 10, 10, 44, 38, 38, 14, 15, 38, 39, 38, 38, 39, 0, 1, 44]
+816	[6, 38, 12, 13, 13, 38, 39, 38, 39, 26, 30, 38, 39, 14, 15, 14, 15, 14, 15, 10, 44, 38, 12, 13, 13, 38, 39, 26, 30, 38, 15, 53, 53, 53, 53, 53, 53, 53, 14, 15, 15, 15, 44]
+817	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+818	[38, 39, 30, 38, 30, 31, 38, 39, 44, 38, 39, 39, 39, 14, 15, 44]
+819	[12, 51, 13, 13, 14, 15, 6, 14, 15]
+820	[48, 14, 15, 15, 15]
+821	[14, 38, 8, 38, 39, 44, 14, 15, 15, 15]
+822	[14, 15, 44, 24, 24, 25, 25, 25, 25, 42, 43, 14, 15, 44, 24, 25, 25, 25, 25]
+823	[26, 14, 15, 44, 30, 55, 36, 12, 44, 54, 55, 38, 39, 10, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 14, 15, 36, 42, 15, 26, 30, 12, 13, 13, 38, 10, 38, 39, 10, 39, 44]
+824	[24, 25, 25, 25, 30, 31, 26, 14, 15, 42, 43, 15]
+825	[14, 15, 12, 13, 13, 14, 15, 44, 22, 18, 49, 23, 23, 23, 23, 44]
+826	[52, 53, 53, 53]
+827	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+828	[38, 39, 30, 31, 14, 15, 44]
+829	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+830	[24, 25, 25, 44, 38, 13, 44, 12, 13, 13, 13, 44, 24, 25, 25, 25, 25, 44, 12, 13, 13]
+831	[44, 50, 51, 38, 51, 14, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 44, 44, 44, 44, 50, 51, 38, 12, 13, 13, 13]
+832	[24, 25, 25, 25, 25, 25, 25, 25, 14, 15, 15, 44, 30, 31, 38, 39]
+833	[12, 13, 13, 55, 14, 15, 2, 38, 39, 38, 39, 39, 15, 48, 55, 38, 38, 39, 12, 13]
+834	[30, 8, 9, 38, 39, 36, 14, 15, 36, 14, 15, 36, 39, 44]
+835	[8, 3, 2, 38, 39, 44, 8, 1, 12, 15, 15, 26, 12, 13, 13, 15, 38, 38, 39, 38, 36, 14, 44]
+836	[14, 15, 38, 15, 15, 43]
+837	[12, 44, 38, 39, 26, 49, 30, 31, 14, 15, 8, 38, 39, 42, 43, 14, 23, 43, 44]
+838	[38, 39, 24, 25, 25, 14, 15, 0, 1, 26, 2, 3, 38, 39, 39, 14, 15, 36, 14, 15, 14, 15, 10, 44, 6, 14, 7, 38, 1, 36, 42, 43, 2, 2, 38, 42, 43, 44, 38, 43, 43, 43, 36, 6, 15, 26, 38, 39, 39, 0, 36, 42, 1, 44]
+839	[44, 44, 44, 24, 25, 25, 25, 25]
+840	[14, 15, 8, 0, 14, 15, 14, 44]
+841	[30, 38, 39, 44, 38, 39, 38, 39, 46, 44]
+842	[38, 38, 39, 38, 39, 50, 51, 6, 51, 10, 36, 14, 15, 14, 15, 44, 38, 39, 38, 39, 38, 39, 44, 40, 33, 30, 14, 15, 15, 44, 38, 39, 38, 39, 39, 39, 44, 14, 15, 38, 39, 38, 39, 42, 43, 14, 15, 44]
+843	[30, 31, 36, 50, 51, 51, 44, 38, 3, 38, 39, 44, 8, 9, 38, 50, 50, 51, 38, 39, 24, 25, 36, 42, 43, 44, 6, 38, 12, 13, 13, 14, 15, 1, 36, 14, 15, 44]
+844	[46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 44, 46, 44, 44, 44, 44, 44, 44, 44, 44, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47]
+845	[24, 25, 25]
+846	[44, 42, 43, 44, 38, 39, 14, 15, 6, 38, 39, 14, 15, 44, 10, 1, 14, 44, 24, 25, 25, 25]
+847	[44, 50, 9, 8, 38, 39, 10, 11, 44, 30, 31, 38, 39, 12, 13, 44, 44]
+848	[34, 35, 14, 14, 15, 15, 15, 15, 15, 38, 14, 15, 36, 42, 43, 38, 14, 15, 44, 4, 5, 44, 38, 11, 36]
+849	[14, 15, 12, 13, 13, 13, 13, 13, 13, 13, 13, 44, 6, 30, 38, 39, 30, 8, 0, 1, 36, 38, 39, 44, 44, 24, 15, 14, 25, 44]
+850	[44, 14, 15, 15, 44, 42, 23, 44]
+851	[54, 55, 55, 14, 15, 10, 42, 42, 43, 15, 6, 26, 14, 15, 14, 15]
+852	[38, 39, 44, 14, 15, 6, 14, 15, 14, 38, 39, 36, 12, 13, 14, 15, 14, 14, 15, 44, 26, 14, 15, 15, 36, 14, 15, 0, 12, 12, 13, 38, 39, 39, 14, 15, 39, 44, 38, 39, 36, 10, 44, 42, 14, 15, 26, 38, 38, 38, 39, 30, 31, 36, 14, 15, 26, 30, 38, 39, 39, 15, 15, 44, 6, 7, 30, 31, 14, 15, 14, 15, 15, 38, 39, 36, 14, 14, 15, 38, 38, 38, 39, 15, 14, 25, 36, 24, 49, 25, 25, 25, 25, 36, 0, 1, 44, 20, 19, 49, 21, 14, 25, 6, 11, 8, 9, 38, 39, 14, 15, 36, 38, 39, 46, 44, 38, 39, 14, 15, 15, 36, 2, 27, 38, 39, 36, 14]
+853	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+854	[14, 15, 14, 15, 15, 38, 39, 14, 15, 14, 15, 42, 43, 14, 15]
+855	[14, 15, 39, 15, 14, 15, 38, 36, 14, 15, 10]
+856	[24, 36, 12, 13, 14, 15, 15, 38, 39, 44, 38, 38, 39, 39, 39, 0, 14, 15, 42, 43, 44, 14, 15, 15, 38, 39, 44, 14, 15, 6, 39, 36, 14, 15, 14, 15, 44, 38, 14, 15, 15, 14, 15, 39, 39, 38, 44]
+857	[30, 31, 38, 39, 39, 43]
+858	[8, 38, 14, 38, 39, 30, 31, 38, 39]
+859	[8, 9, 26, 27, 14, 15, 44, 14, 15, 14, 15, 8, 38, 39, 44, 30, 31, 31, 46, 44, 44, 24, 25, 25, 25]
+860	[38, 14, 15, 44, 44, 30, 8, 9, 38, 39, 39, 39, 46, 44]
+861	[14, 15, 38, 14, 15, 12, 13, 13, 13, 44, 14, 15, 15, 15, 15, 14, 15, 14, 15, 14, 15, 0, 39, 44, 12, 13, 13, 24, 25, 25]
+862	[14, 38, 39, 36, 14, 15, 14, 15, 8, 8, 38, 39, 14, 15, 44]
+863	[24, 24, 25, 25, 25, 23, 36, 14, 15, 10, 44, 38, 44, 38, 43, 44, 36, 42, 25, 25, 38, 36, 14, 15, 44]
+864	[50, 51, 14, 15, 14, 15, 38, 39, 44, 30, 31, 31, 36]
+865	[38, 39, 38, 39, 8, 9, 38, 30, 31, 38, 1, 44, 24, 15, 25, 15, 44, 24, 25, 25, 53]
+866	[14, 15, 38, 15, 12, 17, 14, 14, 38, 39, 46, 44, 54, 55, 14, 14, 15, 10, 30, 30, 31, 44]
+867	[53, 39, 39, 38, 39, 42, 43, 14, 15, 38, 38, 38, 43, 14, 44]
+868	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 24, 25, 47, 47, 47]
+869	[52, 53, 53, 53, 14, 15, 44, 30, 31]
+870	[46, 47, 47, 44, 46, 47, 47, 47, 44, 12, 13, 13]
+871	[6, 7, 26, 26, 38, 39, 12, 55, 44, 54, 55, 55, 55, 14, 15, 44, 6, 7, 7, 38, 14, 15, 12, 9, 38, 39, 44, 38, 39, 26, 14, 14, 15, 30, 14, 15, 44, 38, 39, 39, 12, 13, 13, 13, 8, 39, 12, 13, 44, 12, 36, 38, 39, 38, 14, 15, 38, 38, 39, 44]
+872	[14, 15, 15, 42, 43, 0, 43, 14, 15, 38, 39, 39, 39, 38, 39, 6, 38, 39, 39, 44, 14, 15, 44, 44, 24, 25, 42, 43, 14]
+873	[32, 33, 42, 43, 44, 14, 15, 15, 15, 44, 14, 15, 38, 39, 44, 50, 51, 50, 43, 44, 52, 53, 53, 53, 44, 52, 53, 14, 15, 44, 26, 9, 38, 39, 38, 38, 39, 36, 44, 26, 9, 8, 8, 9, 39, 38, 39, 44, 8, 9, 38, 38, 39, 48, 49, 15, 10, 1, 1, 36, 14, 44, 0, 1, 14, 15, 8, 2, 27, 38, 39, 44]
+874	[2, 38, 38, 38, 8, 38, 38, 39, 15, 38, 39, 36, 14, 15, 14, 15, 44, 38, 39, 38, 39, 15, 0, 1, 36, 14, 15, 6, 42, 43, 14, 15, 2, 3, 38, 43, 42, 39, 42, 43, 14, 15, 44, 38, 14, 38, 39, 14, 15, 42, 43, 44]
+875	[18, 19, 53, 53, 44, 48, 49, 26, 14, 15, 38, 39, 44, 38, 39, 14, 15, 14, 15, 44]
+876	[52, 53, 53, 53, 53, 14, 15]
+877	[30, 38, 39, 36, 14, 15, 24, 25, 25, 44, 38, 38, 39, 30, 30, 38, 39, 0, 1, 36, 14, 15, 44, 38, 24, 25, 25, 15, 44, 30, 11, 8, 38, 39, 44, 10, 11, 38, 39, 30, 36, 14, 44, 6, 14, 25, 8, 8, 38, 39, 39, 39, 42, 43, 10, 36, 36, 14, 15, 10, 38, 39, 36, 30, 14, 15, 15, 44, 38, 39, 36, 12, 13, 8, 38, 39, 39, 1, 14, 15, 44]
+878	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+879	[52, 53, 25, 25, 14, 15, 15]
+880	[34, 35, 8, 42, 14, 8, 38, 38, 39, 44, 0, 1, 13, 14, 15, 38, 38, 39, 39, 14, 15, 44, 44]
+881	[14, 15, 38, 39, 44, 26, 34, 35, 14, 15, 14, 15, 14, 15, 38, 39, 36, 14, 15, 10, 44, 0, 3, 14, 15, 14, 15, 44, 26, 27, 0, 14, 15, 10, 36, 14, 15, 44, 8, 38, 39, 40, 41, 38, 39, 44, 0, 9, 38, 39, 44, 38, 38, 39, 14, 15, 14, 15, 10, 49, 38, 38, 39, 14, 15, 44, 38, 14, 15, 38, 39, 14, 15, 15, 38, 39, 38, 39, 14, 15, 44, 38, 39, 50, 49, 49, 49, 49, 42, 43, 36, 0, 1, 44]
+882	[48, 49, 49, 8, 9, 14, 21]
+883	[50, 53, 14, 15, 14, 15, 14, 15, 15]
+884	[44, 50, 2, 50, 38, 39, 38, 43, 19, 19, 42, 43, 15, 42, 53, 43, 43, 38, 39, 39, 53, 39, 44, 44, 38, 39, 50, 42, 43, 14, 15, 14, 15, 4, 5, 42, 43, 14, 15, 44, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 14, 15, 44, 14, 15, 44, 42, 39, 44, 14, 15]
+885	[48, 49, 49, 25, 25]
+886	[24, 15, 15, 38, 39, 14, 15, 0, 1, 14]
+887	[14, 15, 38, 38, 39, 39, 25, 38, 46]
+888	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+889	[50, 53, 38, 39, 38, 39]
+890	[32, 34, 35, 44, 14, 15, 38, 39, 54, 21, 38, 39, 10, 44, 38, 39, 38, 39, 36, 14, 36, 14, 15, 0, 1, 26, 39, 24, 25, 15, 36, 14, 15, 44, 6, 26, 14, 15, 14, 15, 10, 44, 14, 15, 8, 9, 4, 15, 6, 14, 15, 36, 42, 1, 36, 14, 15, 44]
+891	[12, 25, 44, 14, 15]
+892	[14, 15, 44, 14]
+893	[52, 53, 53, 53, 53, 53, 53, 53, 38, 39, 39, 53, 53, 36, 12, 13, 13, 48, 10, 14, 49, 14, 15, 44, 50, 51, 53, 53, 53, 53]
+894	[30, 8, 9, 8, 38, 39, 44, 38, 8, 9, 8, 9, 38, 9, 38, 39, 44, 44, 44, 44, 14, 15, 38, 39, 44, 44, 44, 52, 53, 53, 53, 14, 15]
+895	[14, 15, 14, 15, 36, 14, 15]
+896	[52, 53, 53, 53, 53, 53, 53, 53, 53, 42, 43, 38, 38, 39, 14, 15]
+897	[30, 44, 12, 13, 14, 36, 44, 14, 15, 8, 38, 39, 36, 8, 9, 30, 14, 15, 38, 14, 15, 10, 38, 39, 36, 12, 13, 14, 15, 44, 8, 9, 26, 30, 31, 31, 38, 26, 30, 30, 14, 15, 38, 38, 38, 44, 38, 39, 44, 14, 15, 38, 38, 39, 14, 15, 10, 44, 38, 39, 14, 15, 14, 15, 14, 15, 14, 15, 44]
+898	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 16, 17, 47, 47, 47, 47]
+899	[24, 25, 25, 25, 44, 12, 13, 13, 13, 44, 14, 15]
+900	[44, 14, 15, 38, 39, 14, 15, 14, 15, 44, 44, 14, 15, 38, 39, 14, 15, 44, 14, 15, 15, 44, 22, 23, 44, 22, 23, 44, 14, 15, 15, 44, 44, 44, 24, 25, 25, 25, 44, 44, 44, 20, 25, 25, 14, 15, 14, 15]
+901	[14, 14, 14, 15, 38, 39, 14, 15]
+902	[12, 55, 55, 55, 44, 50, 51, 51, 51, 50, 51, 51, 12, 13, 13, 13, 38, 39, 14, 15, 42, 43, 14, 15, 14, 15, 44, 14, 15, 38, 39, 44, 38, 39, 44, 38, 39, 14, 15, 36, 14, 15, 38, 12, 13, 13, 44]
+903	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+904	[8, 9, 38, 39, 14, 43, 36, 14, 15, 44, 14, 15, 25, 38, 15, 15, 0, 44, 38, 39, 12, 25, 15, 15, 25, 25, 44]
+905	[0, 36, 14, 15, 15, 42, 43, 43, 44, 30, 36, 14, 15, 38, 14, 15, 44]
+906	[24, 47, 47, 47, 15, 44, 10, 44, 14, 15, 14, 15, 14, 15, 44, 36, 14, 15]
+907	[30, 31, 38, 14, 15, 38, 39]
+908	[24, 25, 12, 13, 13, 13, 44, 14, 15, 14, 15]
+909	[24, 25, 25, 49]
+910	[44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 44, 14, 25, 21, 49, 49, 49, 50, 51, 51, 0, 14, 15, 15, 44, 24, 25, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 38, 39, 44, 44, 44, 24, 25, 25, 25, 25, 44, 44, 44, 38, 42, 43, 8, 0, 1]
+911	[14, 15, 14, 43, 14, 15, 15, 14, 15, 15, 15]
+912	[50, 51, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 14, 15, 38, 53, 53, 53, 39, 20, 21, 21, 15, 14, 15, 44, 44, 14, 23, 14, 15, 15, 36, 42, 43, 14, 15, 44, 38, 39, 50, 51, 51, 51, 36, 42, 43, 43, 43, 44, 2, 3, 38, 26, 30, 0, 1, 36, 14, 15, 14, 15, 38, 39, 44, 44]
+913	[50, 51, 53, 38, 14, 53, 24, 25, 25, 25]
+914	[54, 55, 55, 55, 13, 12, 13, 13, 13, 14, 15]
+915	[0, 1, 26, 14, 15, 0, 1, 36, 14, 15, 44, 26, 11, 14, 15, 12, 13, 13, 30, 31, 44, 8, 38, 14, 15, 8, 9, 2, 38, 39, 44, 10, 9, 8, 9, 8, 8, 38, 39, 8, 9, 38, 39, 44, 8, 9, 8, 38, 39, 44, 24, 25, 25, 25]
+916	[26, 14, 25, 25, 25, 38, 39, 44]
+917	[54, 55, 55, 55, 55, 14, 15, 30, 36, 14, 6, 0, 38, 42, 43, 14, 15]
+918	[47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+919	[30, 8, 38, 39, 30, 30, 31, 38, 9, 38, 39, 38, 39, 50, 51, 51, 14, 15, 36, 14, 15, 44]
+920	[46, 47, 47, 44, 46, 47, 47, 47]
+921	[54, 49, 44, 16, 49, 21, 21]
+922	[48, 49, 10, 38, 39, 46, 15, 30, 31, 38]
+923	[54, 55, 55, 55, 44, 38, 39, 30, 31, 14, 15, 26, 38, 38, 36, 38, 38, 39, 39, 44, 48, 49, 49, 38, 39, 39, 46, 44]
+924	[50, 51, 53, 53, 54, 55, 55, 55, 13, 14, 15, 42, 43, 42, 43, 42, 43, 43, 43, 44, 24, 25, 25, 25, 44, 24, 25, 25]
+925	[26, 30, 13, 44, 12, 13]
+926	[44, 44, 44, 22, 44, 18, 43, 14, 38, 44, 38, 44]
+927	[30, 10, 38, 38, 0, 14, 15, 8, 8, 1]
+928	[14, 15, 44, 14, 15, 14, 15]
+929	[15, 44, 14, 15, 42, 43, 14, 14, 15, 15, 15, 38, 39, 39, 38, 39, 38, 39, 14, 14, 15, 42, 43, 14, 15, 38, 39, 44, 6, 38, 26, 14, 15, 10, 38, 39, 0, 1, 14, 15, 42, 43, 14, 15, 38, 39, 53, 53, 15, 44, 0, 1, 0, 1, 44, 14, 15, 14, 15, 8, 1, 44]
+930	[0, 1, 14, 15, 38, 39, 24, 25, 25, 25, 25, 25, 25, 14, 15, 44, 44, 44, 38, 39, 38, 39, 14, 15]
+931	[38, 39, 24, 24, 25, 25, 25, 25]
+932	[53, 53, 53, 53, 53, 53, 53, 53, 53, 15, 15, 15, 15, 38, 39, 44, 44, 24, 25, 25, 44, 44, 24, 14, 15, 26, 1, 14, 15, 14, 15, 14, 15, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47]
+933	[38, 39, 38, 38, 39, 42, 43, 14, 15, 38, 7, 7, 9, 6, 38, 39, 44, 38, 39, 42, 43, 42, 42, 43, 14, 15, 36, 14, 15, 42, 43, 38, 39]
+934	[38, 39, 14, 15, 15, 0, 38, 0, 1]
+935	[38, 39, 50, 51, 36, 4, 43, 36, 5, 44, 14, 15, 44, 0, 1, 14, 15, 14, 15, 15, 38]
+936	[14, 15, 14, 15, 15, 8, 9, 9, 38, 39, 0, 5, 44, 14, 15, 50, 51, 42, 15, 15, 38, 46, 44, 44, 14, 15, 44, 44, 24, 25, 25, 44]
+937	[38, 36, 14, 15, 14, 15, 38, 15, 15, 15, 15, 15, 15]
+938	[50, 19, 51, 51, 51, 53, 14, 15, 42, 43, 14, 15, 44, 24, 25, 25, 25]
+939	[44, 12, 13, 13, 14, 44, 42, 43, 44, 6, 7, 0, 1, 14, 15, 14, 15, 44, 44, 38, 39, 39, 39, 12, 13, 13, 13, 13, 44, 6, 7, 8, 9, 38, 12, 13, 14, 15, 12, 13, 13, 13, 24, 15, 25, 15, 42, 38, 15, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+940	[48, 49, 44, 48, 49, 49, 38, 38, 39, 38, 39, 44, 20, 49, 49, 49, 49, 38, 39, 39, 44, 12, 13, 44, 14, 15, 0, 1, 44, 24, 51, 25, 44, 24, 25, 25, 25, 14, 15, 44, 24, 25, 25]
+941	[0, 1, 14, 15, 44, 0, 1, 14, 15, 14, 15, 44, 0, 1, 0, 1, 14, 15, 42, 43, 44, 38, 39, 14, 15, 44, 44, 44, 14, 23, 14, 44]
+942	[50, 0, 43, 14, 14, 14, 15]
+943	[52, 53, 53, 14, 53]
+944	[24, 25, 25, 25, 25, 14, 15, 14]
+945	[6, 0, 1, 36, 14, 15, 44, 12, 13, 15, 38, 39, 38, 39, 0, 1, 36, 44]
+946	[12, 13, 13, 38, 39, 53, 25, 10, 12, 14, 14, 15, 48, 49, 44, 14, 15, 10, 14, 15, 38, 43, 48, 49, 38, 39, 36, 12, 55, 13, 52, 52, 53, 54, 55, 55, 36, 42, 43, 14, 15, 44, 38, 39, 42, 15, 42, 43, 44, 14, 15, 42, 43, 44, 14, 15, 42, 43, 36, 14, 15, 44, 6, 38, 39, 38, 39, 36, 54, 55, 12, 13, 36, 14, 15, 14, 15, 15, 15, 44]
+947	[14, 15, 14, 14, 15, 44, 38, 38, 0, 1, 14, 15, 46]
+948	[50, 51, 50, 51, 51, 15, 15, 14, 15, 8, 9, 38, 39, 44, 38, 39, 38, 43, 50, 51, 26, 50, 51, 38, 39, 42, 43]
+949	[0, 1, 36, 38, 44, 14, 15, 38, 38, 39, 8, 39, 39, 46, 44]
+950	[14, 15, 14, 15, 44, 12, 13, 13, 13, 13, 13]
+951	[2, 26, 14, 15]
+952	[24, 25, 25, 38, 39, 46, 24, 25, 25, 25, 25, 15, 36, 42, 15, 14, 15, 44, 8, 9, 38, 39, 36, 10, 11, 8, 3, 14, 15, 15, 0, 36, 42, 43, 14, 15, 44]
+953	[12, 13, 13, 14, 15, 36, 48, 49, 15, 38, 39, 14, 15, 38, 38, 39, 14, 15, 14, 15, 44, 12, 13, 15, 26, 14, 14, 8, 9, 38, 39, 12, 13, 14, 15, 15, 14, 15, 10, 44, 26, 55, 38, 55, 38, 10, 44, 26, 27, 14, 49, 38, 39, 38, 39, 36, 0, 1, 44]
+954	[12, 13, 38, 38, 42, 43, 36, 42, 43, 42, 43, 43, 15, 38, 39, 44, 42, 43, 42, 43, 43, 10, 38, 38, 39, 36, 50, 14, 15, 13, 13, 13, 44, 30, 31, 38, 39, 36, 50, 51, 13, 13, 13, 44]
+955	[8, 7, 30, 31, 38, 39, 38, 39, 38, 39, 12, 13, 14, 15, 43, 44, 14, 15, 43, 36, 14, 15]
+956	[38, 39, 14, 15, 15, 44, 20, 21, 21, 21, 6, 20, 21, 21, 25, 14, 15, 15, 39, 44, 24, 25, 25, 25]
+957	[24, 47, 47, 47, 47, 47, 47, 47, 44]
+958	[38, 38, 54, 14, 15, 42, 43, 14, 15, 44, 38, 39, 39, 39, 14, 55, 8, 9, 38, 39, 14, 15, 38, 39, 36, 12, 13, 13, 10, 11, 44, 38, 14, 39, 38, 39, 38, 39, 44]
+959	[24, 25, 25, 44, 24, 25, 25]
+960	[6, 7, 44, 38, 14, 14, 15, 10, 44, 6, 26, 14, 15, 8, 9, 44, 42, 14, 15, 15, 36, 14, 15, 8, 8, 9, 44, 20, 21, 6, 24, 25, 25, 25, 15, 15, 30, 8, 0, 9, 0, 1, 44]
+961	[14, 15, 38, 39, 36, 0, 1, 14, 15]
+962	[6, 38, 8, 0, 0, 1, 36, 14, 15, 44, 24, 25, 38, 38, 39, 38, 39, 24, 25, 25, 38, 44, 6, 39, 44, 0, 1, 44, 38, 39, 44]
+963	[52, 53, 44, 12, 14, 25, 15]
+964	[12, 13, 14, 15, 0, 1, 14, 15, 15, 44, 12, 13, 13, 38, 39, 38, 39, 38, 39, 44, 48, 49, 49, 38, 39, 14, 15, 15, 44, 12, 13, 13, 13, 13, 14, 15, 14, 15]
+965	[30, 31, 22, 14, 23, 14, 15, 38, 30, 8, 9, 38, 36, 38, 39, 44, 30, 38, 38, 39, 44, 38, 39, 26, 30, 38, 7, 38, 30, 38, 36, 38, 38, 30, 38, 30, 44, 8, 30, 14, 13, 12, 13, 14, 15, 14, 15, 15, 30, 44, 8, 9, 0, 36, 12, 15, 15, 44, 30, 38, 39, 14, 15, 36, 38, 30, 44, 44, 30, 38, 38, 39, 26, 38, 39, 46, 44]
+966	[14, 15, 15, 14, 15, 15, 38, 39, 38]
+967	[12, 13, 13, 14, 15, 0, 1, 36, 14, 15, 30, 36, 14]
+968	[14, 15, 15]
+969	[50, 51, 51, 44, 14, 15, 0, 51, 51]
+970	[26, 27, 14, 15, 38, 39, 36, 0, 14, 15, 36, 14, 15, 44, 24, 25, 25, 25]
+971	[38, 30, 31, 36, 42, 43, 14, 43, 8, 39, 0, 1, 44]
+972	[34, 35, 44, 30, 31, 42, 43, 6, 1, 44]
+973	[42, 43, 14, 43]
+974	[14, 15, 8, 9, 38, 39, 30, 31, 44, 30, 31, 8, 8, 8, 38, 39, 14, 15, 15, 44, 38, 1, 36, 14, 38, 39, 38, 14, 15, 44]
+975	[46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+976	[20, 21, 21, 14, 15, 15]
+977	[14, 15, 44, 24, 25, 25, 25]
+978	[14, 15, 0, 43, 14]
+979	[50, 51, 38, 38, 14, 15, 42, 42, 43, 14, 15, 44, 38, 39, 14, 15, 14, 15, 38, 39, 44, 38, 39, 38, 43, 42, 43, 14, 15, 42, 43, 44, 8, 38, 39, 42, 43, 42, 43, 14, 15, 38, 39, 38, 38, 39, 44, 38, 39, 14, 15, 6, 14, 15, 42, 43, 44, 38, 39, 14, 15, 38, 39, 14, 15, 42, 43, 14, 15, 6, 14, 15, 42, 43, 14, 15, 44]
+980	[14, 15, 44, 24, 25, 25]
+981	[24, 25, 24, 25, 25, 15]
+982	[24, 25, 25, 25, 44, 12, 13, 44, 0, 0, 1, 14, 15, 43, 14, 15, 44, 12, 13, 44, 0, 0, 49, 15, 14, 15, 14, 15, 14, 15, 13, 13, 44, 0, 1, 14, 15, 15, 38, 44, 44, 44, 24, 25, 25, 25, 14, 15, 44, 44, 44, 14, 39, 6, 30, 8, 0, 14, 15]
+983	[14, 15, 30, 31, 6, 30, 31, 14, 15, 38, 39]
+984	[52, 53, 53, 53, 53, 38, 53, 14, 15, 14, 15, 44, 52, 53, 53, 14, 15, 15, 8, 38, 14, 15, 38, 38, 14, 15, 44, 44, 44, 52, 53, 14, 15, 44, 44, 44, 24, 25, 25, 44]
+985	[12, 13, 13, 13, 0, 14, 14, 15, 14, 15, 15, 14, 15, 14, 15, 42, 43, 43, 14, 15, 15]
+986	[24, 25, 25, 25, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+987	[34, 35, 44, 8, 38, 27, 36, 4, 42, 43, 43, 49, 26, 38, 39, 14, 8, 38, 39, 26, 49, 17, 26, 14, 15, 10, 38, 39, 39, 44, 30, 31, 8, 38, 14, 15, 38, 26, 49, 33, 38, 39, 14, 15, 38, 39, 12, 13, 36, 42, 43, 44, 6, 6, 7, 8, 9, 38, 39, 30, 31, 13, 36, 28, 44, 30, 31, 8, 9, 38, 39, 39, 36, 42, 39, 44]
+988	[46, 47, 47, 47, 47, 47, 14, 25, 25, 25, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+989	[46, 47, 47, 44, 24, 15, 15, 44, 0, 1, 42, 43, 15]
+990	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 40, 41, 38, 39, 44, 24, 25, 25, 25]
+991	[10, 11, 36, 4, 48, 49, 49, 27, 24, 25, 23, 14, 36, 14, 15, 44, 8, 9, 38, 39, 26, 24, 25, 36, 42, 42, 43, 36, 14, 15, 15, 36, 42, 39, 44, 12, 13, 38, 39, 48, 49, 25, 49, 44, 30, 7, 31, 36, 30, 6, 7, 38, 39, 44, 6, 7, 14, 14, 38, 39, 0, 1, 36, 4, 5, 44, 26, 11, 36, 12, 13, 38, 39, 36, 14, 15, 15, 8, 39, 6, 11, 14, 15, 38, 39, 8, 38, 39, 36, 14, 15, 15, 44, 6, 38, 38, 30, 14, 15, 44, 38, 39, 14, 15, 36, 14, 15, 44, 0, 1, 0, 30, 30, 31, 8, 9, 8, 38, 30, 38, 39, 44, 14, 15, 0, 1, 0, 44]
+992	[14, 15, 14, 15, 42, 43, 14, 15]
+993	[8, 27, 44, 8, 9, 30, 14, 14, 15, 14, 15, 10, 36, 0, 1, 14, 15, 38, 39, 39, 38, 39, 44, 44, 30, 38, 8, 9, 10, 38, 38, 39, 36, 42, 15, 44, 38, 39, 38, 38, 39, 30, 31, 30, 14, 15, 15, 15, 14, 15, 14, 15, 44, 14, 15, 38, 39, 38, 39, 44, 44]
+994	[6, 30, 31, 31, 14, 15, 15, 44, 38, 30, 30, 31, 8, 13, 38, 39, 39, 14, 15, 8, 9, 38, 39, 39, 39, 39, 14, 15, 36, 14, 15, 44]
+995	[14, 15, 44, 46]
+996	[14, 15, 14, 15, 14, 15, 14, 15]
+997	[38, 39, 46, 44, 38, 39, 38, 39, 44, 30, 31, 38, 44, 24, 25, 25, 25]
+998	[50, 51, 51, 51, 44, 14, 15, 32, 33, 44, 50, 51, 51, 51, 44, 50, 51, 14, 15, 44, 50, 51, 25, 53, 44, 50, 51, 21, 51, 44, 48, 25, 49, 51, 44, 14, 15, 15, 10, 44, 50, 51, 51, 51, 44, 18, 34, 50, 51, 12, 0, 14, 15, 42, 43, 14, 36, 14, 15, 15, 42, 43, 44, 38, 38, 39, 14, 43, 42, 43, 42, 39, 42, 43, 36, 0, 1, 14, 15, 44, 8, 38, 38, 39, 50, 51, 51, 42, 14, 15, 42, 43, 0, 1, 36, 2, 3, 38, 39, 44]
+999	[38, 39, 36, 8, 39, 8, 38, 39, 14, 15, 30, 31, 14, 15, 38, 38, 14, 15, 44, 30, 31, 8, 3, 41, 38, 38, 14, 15, 44, 38, 39, 38, 44, 12, 13, 14, 44, 12, 13, 15, 44, 44]
diff --git a/tests/test_tipc/results/python_bigru_crf_results_fp32.txt b/tests/test_tipc/results/python_bigru_crf_results_fp32.txt
new file mode 100644
index 0000000000000000000000000000000000000000..54656699a69e9cb1bdd331d68db4c5289de4238d
--- /dev/null
+++ b/tests/test_tipc/results/python_bigru_crf_results_fp32.txt
@@ -0,0 +1,1000 @@
+0	[24, 25, 25, 42, 32, 49, 49, 2, 3, 38, 39, 14, 15, 42, 43, 42, 43, 26, 14, 15, 44, 50, 51, 12, 14, 15, 14, 15, 44, 44, 44, 24, 25, 25, 25]
+1	[24, 15, 25, 38, 30, 31, 38, 15, 14, 44, 44, 38, 31, 46, 44]
+2	[14, 15, 14, 15, 14, 15, 30, 31, 38]
+3	[14, 15, 25, 25]
+4	[24, 25, 25, 25, 25, 25, 25, 25]
+5	[38, 39, 14, 12, 14, 15, 15, 8, 8]
+6	[6, 50, 51, 15, 50, 51, 50, 51, 10, 19, 43, 14, 15, 10, 44, 38, 39, 39, 14, 15]
+7	[14, 15, 0, 12, 13, 14, 15, 36, 14, 15, 15]
+8	[44, 24, 25, 24, 25, 25, 25, 25, 25, 25, 14, 15, 44, 44, 24, 25, 24, 25, 25, 25, 25, 14, 15, 44, 14, 15, 23, 44, 22, 23, 44, 22, 23, 44, 14, 15, 15, 44, 44, 44, 20, 25, 25, 14, 15, 14, 15]
+9	[14, 15, 38, 39, 38, 12, 14, 15, 36, 14, 15, 38, 30, 31]
+10	[30, 30, 31, 31, 15, 38, 14, 15, 6, 14, 15, 46, 44]
+11	[42, 43, 38, 38, 39, 27, 42, 43, 14, 15, 14, 15]
+12	[42, 25, 14, 15, 15, 15, 13, 13]
+13	[6, 38, 39, 30, 31, 14, 15, 8, 1, 36, 14, 15, 14, 14, 15, 44, 24, 25, 25, 25, 15, 38, 38, 39, 36, 11, 36, 14, 15, 42, 43, 44]
+14	[14, 15, 10, 51, 0, 1, 12, 15, 14, 15, 8, 26, 46, 44, 48, 49, 49, 38, 14, 0, 0, 1, 44, 14, 15, 14, 15, 44, 24, 25, 25, 25, 25, 25, 14, 25, 14, 15, 44, 24, 25, 25]
+15	[26, 55, 15, 38, 39]
+16	[24, 25, 25, 25, 25, 44, 12, 13, 13]
+17	[6, 7, 38, 36, 14, 15, 38, 39, 38, 39, 38, 39, 39, 36, 14, 25, 25, 44, 24, 25, 25, 25]
+18	[2, 3, 42, 43, 30, 31, 14, 15, 0]
+19	[38, 39, 14, 15, 8, 8]
+20	[24, 15, 15, 15, 15, 38, 25, 14, 15, 14, 15]
+21	[14, 15, 15, 15, 15, 44, 38, 14, 15, 44, 14, 15]
+22	[24, 25, 25, 25, 25, 25, 44, 26, 55, 38, 39, 6, 38, 39, 36, 14, 15]
+23	[34, 35, 44, 14, 15, 8, 9, 38, 14, 15, 38, 39, 44]
+24	[14, 15, 24, 25, 25, 25, 36, 14, 15]
+25	[14, 15, 36, 14, 15]
+26	[48, 49, 44, 50, 25, 25, 25, 44, 24, 25, 25, 25]
+27	[24, 25, 25, 25, 17, 47, 47, 21, 21, 25, 25, 25, 25, 25, 25, 53, 53, 53, 53]
+28	[47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47]
+29	[14, 15, 0, 1]
+30	[24, 15, 44, 42, 43]
+31	[38, 39, 54, 54, 55, 55, 44, 26, 11, 12, 13, 14, 15, 10, 36, 42, 43, 14, 15, 8, 38, 38, 39, 26, 24, 14, 15, 14, 15, 14, 15, 10, 44]
+32	[38, 14, 15, 44, 24, 25]
+33	[6, 7, 26, 14, 43, 14, 15, 10, 8, 38, 12, 14, 15, 15, 44, 14, 15, 15, 42, 38, 38, 36, 14, 15, 8, 9, 38, 39, 44, 8, 0, 1, 0, 1, 44]
+34	[54, 55, 55, 55, 55, 55, 55, 55, 55, 44, 52, 53, 53, 53, 26, 50, 51, 38, 39, 26, 50, 51, 38, 39, 36, 14, 15, 36, 14, 15, 42, 43, 42, 43, 14, 15, 44, 38, 39, 36, 24, 15, 14, 15, 38, 39, 44, 38, 39, 42, 43, 36, 42, 15, 42, 15, 15, 15, 44, 8, 26, 14, 15, 38, 39, 14, 15, 36, 14, 3, 40, 41, 38, 39, 38, 39, 36, 12, 13, 0, 36, 14, 15, 44]
+35	[52, 53, 53, 14, 15, 38, 43, 15, 38, 39, 44, 14, 15, 14, 15, 10, 38, 39, 14, 39, 39, 39, 44, 0, 25, 25]
+36	[44, 40, 41, 38, 39, 44, 12, 13, 10, 13, 14, 39, 38, 39, 36, 14, 44, 8, 9, 38, 14, 15, 15, 15, 44, 38, 39, 25, 44, 14, 15, 15, 8, 39, 50, 51, 51, 14, 15, 44, 8, 6, 30, 6, 31, 38, 38, 39, 39, 30, 44]
+37	[20, 21, 53, 44, 20, 21, 53]
+38	[30, 31, 24, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 38, 39, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 44, 44, 44, 24, 13, 13, 47, 47, 44, 46, 47, 47]
+39	[24, 25, 25, 44, 14, 15, 15]
+40	[30, 8, 8, 9, 38, 39, 44, 30, 31, 14, 15, 38, 14, 15, 50, 51, 38, 43, 36, 42, 43, 43, 44, 14, 15, 38, 39, 0, 1, 8, 38, 39, 14, 15, 14, 15, 44, 38, 39, 8, 38, 38, 39, 39, 39, 42, 43, 14, 15, 36, 42, 43, 44, 8, 9, 26, 14, 15, 8, 9, 0, 1, 44, 38, 39, 38, 39, 14, 15, 15, 15, 44]
+41	[50, 15, 0, 1, 36, 14, 15, 15, 15]
+42	[30, 0, 14, 15, 15, 14, 15, 36, 14, 15, 42, 43, 8, 8, 9, 0, 1, 44, 8, 9, 38, 0, 9, 30, 31, 36, 14, 15, 14, 15, 14, 15, 44]
+43	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+44	[48, 17, 17, 44, 24, 25, 25, 15]
+45	[12, 13, 44, 12, 13, 44, 13, 44, 12]
+46	[24, 25, 25, 25, 25, 25, 25, 15]
+47	[38, 39, 30, 38, 39, 38, 39, 48, 21, 21, 21, 44]
+48	[38, 15, 6, 41, 38, 26, 26, 38, 39, 0, 1, 14, 15, 14, 15, 46]
+49	[30, 31, 38, 38, 14, 15, 15, 6, 14, 15, 15, 15, 44, 8, 9, 9, 38, 39, 14, 15, 44, 8, 38, 39, 38, 39, 44, 6, 30, 31, 14, 15, 38, 39, 38, 39, 44, 38, 39, 39]
+50	[54, 55, 55, 55, 55, 14, 15, 38, 39, 44, 54, 35, 44, 26, 24, 49, 25, 49, 49, 49, 38, 39, 14, 15, 14, 15, 26, 39, 14, 15, 15, 15, 15, 12, 15, 42, 43, 38, 14, 14, 15, 54, 55, 44, 6, 41, 38, 14, 14, 15, 25, 12, 13, 13, 44]
+51	[24, 25, 44, 14, 15]
+52	[38, 39, 46, 43, 36, 14]
+53	[50, 51, 53, 53, 53, 53, 53, 53, 15]
+54	[14, 15, 14, 15, 30, 31, 38, 44, 14, 15]
+55	[44, 44, 44, 24, 25, 25, 25, 25]
+56	[14, 15, 14, 15, 10]
+57	[24, 25, 25, 14, 15, 15, 15]
+58	[0, 1, 14, 15, 36, 14, 15, 44, 38, 39, 36, 14, 15, 14, 15, 38, 39, 38, 39, 39, 44, 30, 31, 14, 15, 36, 0, 1, 14, 15, 6, 38, 39, 36, 14, 15, 14, 15, 14, 15, 0, 44]
+59	[14, 15, 14, 15]
+60	[54, 55, 55, 55, 38, 39, 30, 31, 38, 39]
+61	[24, 25, 25, 25, 44, 12, 13, 13]
+62	[48, 33, 0, 36, 14, 15, 15, 15]
+63	[46, 44, 6, 7, 38, 38, 12, 13, 13, 13, 44, 42, 15, 15, 44, 8, 9, 38, 39, 30, 31, 38, 39, 38, 39, 44]
+64	[48, 49, 38, 39, 36, 14, 15, 14, 15, 0, 36, 14, 15, 14, 15, 8, 38, 39, 44, 30, 38, 44]
+65	[12, 44, 38, 30, 38, 39, 44, 14, 15, 14, 44]
+66	[42, 43, 14, 15, 38, 38, 38, 39, 39, 38, 38, 30, 31]
+67	[14, 15, 14, 12, 13, 13, 13, 6, 50, 14, 15, 10]
+68	[50, 51, 51, 51, 51, 51, 15, 51, 51, 14, 15, 15]
+69	[54, 55, 55, 55, 55, 44, 26, 12, 13, 13, 13, 14, 15, 14, 15, 42, 43, 42, 43, 43, 43, 14, 15, 42, 43, 14, 15, 14, 15, 38, 39, 42, 39, 44, 38, 54, 55, 55, 55, 55, 44, 8, 9, 38, 39, 14, 15, 14, 15, 38, 39, 42, 43, 42, 43, 36, 14, 14, 15, 44]
+70	[54, 55, 55, 13, 14, 14, 15, 12, 51, 14, 15, 44, 44, 42, 23, 44, 14, 15, 36, 44, 38, 43, 44, 44, 24, 25, 25, 25]
+71	[14, 25, 44, 26, 9, 38, 39, 36, 14, 15, 14, 15, 44, 30, 31, 31, 8]
+72	[38, 24, 15, 25, 25, 25]
+73	[50, 51, 51, 51, 51, 51, 51, 53, 53, 53, 53, 53, 39, 15, 15]
+74	[14, 15, 14, 15, 44, 6, 31, 38, 39, 39, 14, 15, 15]
+75	[24, 25, 25, 15, 42, 43]
+76	[52, 53, 14, 15, 53, 42, 42, 43, 14, 43, 15, 15, 44, 14, 15, 44, 24, 25, 25, 25]
+77	[38, 38, 39, 30, 38, 36, 14, 15, 8, 0, 1]
+78	[52, 53, 53, 53, 53, 15, 44, 24, 25, 25]
+79	[24, 25, 25, 25, 44, 24, 25, 25, 25, 14, 15, 36, 42, 43, 38, 39, 39, 30, 44, 44, 24, 25, 25, 25]
+80	[38, 38, 12, 13, 13, 8, 1, 14, 30, 31]
+81	[38, 39, 14, 15, 36, 14, 15, 15, 38, 38, 39, 24, 25, 25, 25, 38, 44]
+82	[52, 53, 53, 53, 53, 53, 53, 53, 53, 25, 25, 25]
+83	[30, 38, 38, 39, 44, 38, 39, 54, 55, 44, 14, 15, 26, 30, 36, 4, 38, 44, 8, 38, 30, 30, 31, 38, 39, 38, 39, 6, 4, 5]
+84	[26, 14, 15, 38, 39, 38, 39, 30, 36, 14, 15, 46, 44]
+85	[48, 53, 38, 39, 44, 12, 13, 44, 14, 15, 44, 44, 44, 44, 24, 25, 25, 38, 44, 44, 44, 24, 25, 44, 44, 44, 14, 15, 14, 15]
+86	[24, 25, 25, 25, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+87	[50, 51, 51, 51, 14, 15, 15, 44, 50, 51, 51, 51, 51, 14, 15, 44, 50, 51, 51, 51, 51, 1, 14, 44, 50, 51, 53, 25, 25]
+88	[14, 15, 38, 39, 36, 38, 42, 4, 14, 15, 15, 42, 43, 36, 14, 15, 38, 44, 44, 44, 38, 39, 14, 15, 14, 15, 42, 43, 10, 38, 39]
+89	[48, 51, 49, 38, 39, 36, 0, 1, 44]
+90	[14, 15, 36, 42, 43, 26, 26, 50, 51, 51, 51, 53, 14, 15, 42, 43, 38, 39, 39, 14, 38, 39, 36, 5, 44, 26, 9, 10, 11, 38, 39, 2, 38, 38, 39, 48, 49, 51, 30, 31, 10, 14, 15, 15, 15, 44, 38, 39, 44, 16, 49, 44, 14, 15, 15, 44, 38, 39, 10, 14, 15, 14, 15, 14, 15, 36, 4, 1, 42, 43, 44]
+91	[14, 15, 44, 14, 15, 8, 38, 39, 14, 15, 36, 14, 15, 15, 44, 40, 53, 53, 44, 4, 5, 49, 44, 32, 11, 32, 38, 39, 44]
+92	[24, 25, 25, 25, 25, 44, 38, 39, 14, 15]
+93	[24, 25, 25, 25, 25, 25, 14, 15]
+94	[24, 25, 25, 15, 38, 39, 38, 38, 39, 46]
+95	[44, 44, 44, 24, 25, 25, 25, 25]
+96	[10, 31, 44, 52, 53, 53, 53, 14, 39, 36, 42, 1, 42, 43, 14, 15, 38, 14, 15, 15, 36, 0, 1, 14, 15, 15, 6, 53, 53, 53, 38, 39, 44]
+97	[38, 12, 44, 52, 38, 38, 39, 10, 10, 14, 15, 15, 13, 13, 14, 38, 38, 38, 14, 36, 14, 15, 15, 44, 8, 14, 15, 38, 39, 44, 13, 44, 38, 14, 15, 38, 15, 15, 26, 24, 25, 12, 44, 26, 14, 15, 38, 39, 38, 12, 44]
+98	[12, 13, 38, 38, 39, 39, 15, 15, 24, 25, 25, 25, 44, 38, 39, 14, 15, 44, 38, 39, 36, 14, 36, 26, 14, 15, 44, 38, 39, 38, 39, 38, 39, 44, 14, 15, 0, 1, 0, 1, 44]
+99	[34, 35, 36, 14, 3, 38, 39, 44, 16, 21, 21, 21, 17, 38, 39, 14, 15, 0, 15, 14, 15, 14, 15, 12, 13, 13, 36, 14, 15, 44]
+100	[24, 25, 25, 25, 44, 24, 25, 25, 14, 15, 15, 14, 15, 30, 31, 38, 39, 46, 44]
+101	[24, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47]
+102	[24, 25, 25, 25, 25, 25, 25, 25, 15]
+103	[12, 44, 14, 15, 44, 26, 14, 15, 14, 15, 0, 48, 0, 1, 14, 15, 38, 39, 44, 38, 27, 48, 49, 38, 39, 10, 11, 8, 8, 0, 38, 54, 55, 8, 0, 1, 46, 44, 54, 49, 26, 38, 39, 38, 12, 13, 13, 13, 30, 31, 14, 15, 38, 0, 14, 15, 44, 8, 9, 38, 38, 39, 44, 8, 38, 39, 38, 39, 36, 8, 39, 38, 38, 0, 1, 13, 38, 44]
+104	[44, 52, 53, 14, 15, 44, 54, 55, 55, 55, 55, 14, 15, 14, 15, 0, 1, 44, 44, 14, 15, 12, 13, 13]
+105	[38, 38, 39, 36, 14, 15, 30, 31, 12, 13, 14, 15, 8, 38, 0, 1, 36, 14, 15]
+106	[47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+107	[8, 38, 39, 30, 38, 0, 36, 14, 15, 14, 15, 15, 15, 8, 38, 39, 44]
+108	[24, 25, 25, 25, 25, 25, 15, 15, 15]
+109	[38, 39, 38, 39, 12, 13, 13, 15, 15, 15, 44, 24, 25, 25, 25]
+110	[24, 25, 14, 15, 38, 39, 38, 15, 15, 15]
+111	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47]
+112	[24, 25, 25, 25]
+113	[34, 35, 44, 14, 15, 15, 15, 10, 38, 39, 14, 15, 15, 36, 42, 43, 43, 42, 43, 14, 15, 38, 38, 39, 44, 26, 14, 15, 14, 15, 38, 39, 38, 43, 38, 39, 44, 38, 9, 38, 39, 0, 1, 36, 42, 43, 14, 15, 44]
+114	[50, 51, 26, 1, 14, 15, 53, 15, 54, 55, 55, 55, 55, 52, 53, 53, 42, 43, 42, 43, 14, 15, 44, 12, 13, 14, 43, 15, 44, 12, 13, 13, 42, 43]
+115	[12, 13, 44, 14, 15, 38, 39, 39, 44]
+116	[48, 49, 51, 51, 12, 13, 14, 15]
+117	[24, 25, 25, 25, 25, 54, 55, 55, 55, 38, 39, 26, 38, 39, 38, 38, 39, 39, 39, 15, 6, 14, 21, 21, 21, 21, 38, 39, 42, 43, 44, 24, 25, 25, 25, 25, 54, 55, 55, 55, 38, 38, 38, 39, 13, 8, 38, 39, 39, 38, 39, 44]
+118	[24, 25, 25, 14, 15, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+119	[38, 14, 15, 14, 15, 0, 0, 1, 14, 15, 26, 42, 43, 44, 14, 44, 44, 44, 44, 38, 39, 14, 15, 14, 15, 44, 44, 44, 24, 25]
+120	[14, 39, 24, 25, 25, 6, 14, 15, 15, 30, 31, 0, 44]
+121	[38, 30, 30, 31, 44, 24, 25]
+122	[24, 47, 47, 47, 44, 46, 47, 47, 44, 13, 13, 13, 47, 47, 44, 46, 47, 47, 44, 14, 15]
+123	[48, 14, 15, 14, 15, 15, 26, 14, 15, 38, 39, 14, 15, 10, 44, 0, 1, 36, 14, 15, 30, 31, 38, 44]
+124	[14, 15, 15, 44, 30, 31, 31, 36, 14, 15, 44, 44, 52, 53, 38, 39, 14, 25, 36, 14, 15, 44]
+125	[52, 53, 53, 53, 14, 15, 14, 15]
+126	[54, 55, 55, 55, 14, 15, 38, 39, 42, 43, 14, 15, 14, 15]
+127	[34, 0, 51, 0, 1, 38, 39, 39, 42, 43, 44, 38, 12, 13, 42, 43, 26, 14, 15, 10, 11, 44, 14, 15, 38, 39, 44, 24, 25, 25, 44]
+128	[24, 14, 25, 25, 23, 23, 23, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 38, 39, 44, 14, 15, 15, 15, 36, 14, 15, 15, 15, 8, 20, 51, 38, 39, 38, 39, 14, 15, 44]
+129	[48, 49, 15, 38, 39, 38, 0, 1]
+130	[14, 15, 14, 15, 42, 15, 42, 43, 14, 15, 14, 15]
+131	[12, 14, 14, 15, 15, 15, 15, 14, 15]
+132	[42, 43, 2, 3, 42, 43, 14, 15, 14, 15, 44, 12, 13, 13, 25, 17, 17, 17, 49, 49, 49, 49, 26, 11, 9, 38, 39, 14, 15, 15, 44, 14, 15, 38, 39, 12, 38, 38, 46, 24, 25, 25, 25, 15, 15, 15, 10, 38, 39, 14, 31, 44, 10, 9, 10, 9, 38, 14, 1, 4, 42, 43, 10, 38, 39, 8, 7, 38, 39, 44, 16, 17, 12, 55, 14, 49, 8, 9, 38, 39, 14, 15, 44, 8, 39, 14, 15, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 38, 39, 39, 15, 44, 38, 39, 25, 25, 25, 25, 15, 15, 42, 14, 15, 38, 39, 13, 13, 44]
+133	[54, 55, 44, 54, 55, 55, 55, 44, 26, 27, 14, 15, 8, 9, 38, 39, 10, 51, 0, 1, 0, 1, 36, 42, 43, 44, 40, 52, 53, 53, 38, 51, 53, 39, 36, 24, 15, 25, 15, 15, 15, 39, 39, 44, 52, 53, 53, 53, 53, 14, 39, 36, 42, 43, 44, 14, 15, 21, 38, 14, 15, 15, 15, 38, 38, 14, 15, 44, 14, 15, 38, 43, 14, 10, 12, 51, 38, 39, 12, 13, 13, 13, 13, 13, 13, 0, 1, 36, 12, 12, 13, 0, 1, 44, 6, 7, 11, 14, 15, 0, 1, 44, 38, 39, 34, 35, 9, 38, 38, 39, 14, 15, 44]
+134	[50, 51, 14, 15, 14, 15]
+135	[14, 15, 36, 39, 46, 44, 14, 15, 44, 14, 15]
+136	[38, 39, 39, 38, 39, 38, 39, 14, 15, 15, 15]
+137	[30, 31, 38, 14, 15, 38, 51, 54, 55, 55, 38, 39, 15, 44, 12, 13, 24, 25, 25, 25, 13, 13, 14, 15, 44, 12, 13, 14, 15, 15, 44, 12, 13, 38, 43, 15, 44, 12, 13, 25, 25, 12, 25, 25, 25, 14, 15, 44, 38, 38, 42, 43, 14, 15, 44, 44, 24, 25, 25, 25, 25]
+138	[50, 51, 51, 14, 15, 15, 42, 43, 14, 15, 30, 31, 31]
+139	[48, 49, 49, 49, 49, 10, 14, 15]
+140	[12, 13, 13, 13, 13, 38, 39, 36, 42, 15, 38, 39, 38, 39, 38, 39, 39, 12, 13, 38, 39, 14, 15, 38, 39, 14, 25, 38, 44]
+141	[38, 38, 38, 24, 25, 25, 25]
+142	[14, 15, 14, 15, 38, 39, 38, 39, 14, 15, 15, 8, 1]
+143	[54, 55, 50, 51, 14, 15, 38, 39, 14, 15]
+144	[46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 44, 16, 17, 44, 14, 15, 14, 15, 0, 38, 44, 44, 44, 12, 13, 13, 13, 47, 47]
+145	[44, 18, 19, 52, 53, 53, 53, 53, 53, 53, 53, 53]
+146	[48, 49, 25, 53, 25, 25]
+147	[20, 21, 21, 38, 21, 21]
+148	[24, 15, 14, 15, 14, 15]
+149	[6, 6, 38, 39, 44, 38, 39, 44, 36, 14, 15]
+150	[44, 14, 15, 14, 14, 15, 15, 44, 44, 0, 39, 0, 14, 15, 15, 14, 15, 44, 38, 39, 0, 14, 15, 15, 42, 43, 15, 15, 44, 44, 44, 20, 21, 21, 21]
+151	[44, 14, 15, 44, 44, 14, 15, 25, 15, 25, 38, 39, 44, 52, 53, 14, 53, 14, 15, 14, 15, 44, 14, 15, 15, 15]
+152	[22, 23, 23, 49, 23, 23, 23, 44, 44, 22, 23, 23, 23, 44, 2, 38, 14, 39, 38, 39, 44, 14, 15, 44, 24, 25, 25]
+153	[24, 25, 15, 38, 38, 39, 15, 38, 39, 38, 39, 46, 44, 24, 25, 42, 39, 44, 24, 25, 25, 44]
+154	[14, 15, 15, 14, 15, 38, 39, 14, 15]
+155	[50, 51, 14, 25, 14, 15, 14, 15, 38]
+156	[14, 25, 30, 31, 38, 39, 14, 15, 50, 53, 14]
+157	[38, 39, 36, 38, 44, 26, 24, 25, 25, 15, 10, 44, 30, 31, 14, 15, 38, 39, 44, 38, 15, 38, 39, 36, 14, 8, 9, 38, 15, 38, 39, 44]
+158	[18, 21, 18, 49, 49, 49, 38, 39, 14, 15]
+159	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+160	[34, 35, 12, 13, 14, 15, 42, 14, 15, 0, 9, 38, 39, 26, 12, 12, 13, 13, 13, 44, 8, 39, 38, 39, 44, 12, 13, 13, 44, 38, 39, 8, 12, 44]
+161	[38, 39, 24, 25, 15, 14, 15, 10, 30, 31, 38, 38, 39, 44]
+162	[14, 1, 14, 15, 42, 43, 42, 43, 14, 15, 44, 12, 13, 13, 13, 38, 0]
+163	[50, 51, 38, 53, 50, 53, 53, 51, 38, 38, 14, 15, 36, 46, 44]
+164	[14, 14, 15, 10, 1, 15, 15, 15, 30, 31, 31, 44]
+165	[50, 51, 51, 44, 14, 15, 15]
+166	[14, 15, 48, 17, 17, 36, 12, 13, 14, 15, 44, 30, 8, 38, 39, 30, 12, 13, 44]
+167	[46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 21, 25, 25, 25, 38, 39, 42, 43, 14, 15, 44, 42, 25, 38, 39, 44, 14, 25, 25, 44, 14, 15, 0, 1, 14, 14, 15, 38]
+168	[38, 14, 14, 15, 8, 0, 14, 15, 15, 38, 30, 31, 38, 39, 44]
+169	[14, 15, 0, 1, 44, 8, 26, 14, 39, 10, 1, 13, 8, 39, 39, 38, 44]
+170	[8, 39, 0, 26, 25, 25, 36, 42, 43, 44, 38, 39]
+171	[52, 53, 14, 15, 14, 14, 15]
+172	[14, 15, 0, 1, 44, 12, 13, 14, 14, 0, 51, 0, 1, 44, 38, 38, 39, 14, 15, 15, 14, 38, 39, 39, 39, 44]
+173	[24, 25, 14, 15, 38, 39, 26, 30, 31]
+174	[14, 15, 21, 12, 14, 15, 15, 15, 26, 55, 13, 15, 6, 12, 55, 0, 1, 10, 14, 15, 44, 38, 39, 12, 13, 13, 13, 44]
+175	[14, 15, 14, 15, 15, 0, 38, 39]
+176	[12, 44, 54, 55, 52, 52, 53, 14, 15, 42, 43, 36, 42, 43, 44, 14, 15, 14, 15, 44, 24, 25, 25, 25]
+177	[44, 44, 44, 44, 44, 26, 36, 14, 15, 14, 15, 10, 38, 44, 26, 16, 17, 36, 14, 15, 14, 15, 8, 39, 8, 38, 38, 39, 44, 38, 1, 38, 0, 1, 14, 49, 36, 14, 15, 24, 25, 25, 25, 15, 15, 44]
+178	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 39, 39]
+179	[52, 14, 15, 30, 14, 15, 26, 50, 51, 51, 51, 24, 25, 25, 15, 44, 0, 1, 14, 15, 44, 38, 39, 38, 39, 14, 15, 44, 6, 7, 14, 38, 39, 30, 31, 8, 38, 39, 36, 14, 15, 14, 15, 44, 14, 15, 14, 15, 15, 44, 8, 38, 24, 25, 25, 44]
+180	[14, 15, 25, 25, 25, 48, 25, 25]
+181	[14, 15, 14, 15, 14, 15, 36, 14, 15, 44, 38, 39, 38, 38, 39, 44, 14, 15, 15, 44, 2, 38, 39, 44, 44, 38, 39, 26, 43, 10, 51, 51, 51, 51, 51, 51, 14, 15, 14, 15, 14, 15, 15, 38, 14, 43, 14, 15, 44, 42, 15, 42, 43, 44, 38, 43, 42, 43, 44, 38, 39, 8, 39, 38, 38, 39, 24, 25, 25, 25, 25, 15, 14, 15, 44, 38, 39, 14, 15, 14, 39, 44, 26, 38, 14, 15, 38, 39, 44, 42, 43, 14, 15, 38, 39, 44, 38, 41, 38, 39, 14, 15, 42, 43, 14, 15, 14, 15, 36, 14, 43, 15, 44, 38, 39, 14, 15, 36, 4, 5, 14, 15, 44, 8, 39, 38, 39, 39]
+182	[24, 25, 25, 25, 25, 38, 15, 44, 14, 15, 42, 43]
+183	[14, 15, 26, 39, 38, 39, 42, 43, 26, 43, 15, 26, 50, 15, 44, 38, 39, 38, 39, 25, 26, 36, 14, 15, 6, 14, 15, 15, 15, 36, 14, 15, 14, 43, 38]
+184	[52, 53, 25, 53, 53, 53, 53, 53, 14, 15, 14, 15, 44, 38, 39, 25, 53, 53, 53, 38, 39, 50, 51, 51, 51]
+185	[38, 39, 30, 38, 38, 39, 36, 14, 15, 14, 15, 15, 44]
+186	[38, 39, 24, 25, 25, 25, 25, 17, 38, 39, 36, 12, 13, 44, 13, 14, 15, 6, 39, 44, 20, 21, 21, 17, 17, 25, 25, 25, 49, 26, 38, 36, 12, 13, 44, 10, 36, 14, 15, 14, 44, 8, 38, 38, 14, 15, 38, 39, 8, 0, 1, 44, 38, 39, 39, 39, 39, 25, 25, 12, 13, 30, 31, 38, 39, 12, 25, 25, 14, 15, 44, 14, 15, 15, 38, 39, 54, 55, 55, 55, 21, 12, 13, 13, 13, 44]
+187	[8, 38, 14, 15, 38, 39, 14, 15, 42, 43, 42, 43, 8, 9, 38, 39, 36, 14, 15, 44, 0, 1, 42, 43, 0, 44, 42, 1, 0, 1, 14, 5, 44, 42, 25, 25, 23, 32, 33, 38, 39, 44, 14, 15, 14, 15, 14, 15, 15, 15, 39, 42, 43, 36, 38, 38, 39, 14, 15, 14, 15, 26, 50, 33, 38, 14, 15, 15, 15, 38, 39, 36, 14, 1, 14, 15, 44]
+188	[30, 36, 14, 15, 15, 15, 14, 15]
+189	[30, 31, 38, 14, 15, 42, 43, 14]
+190	[50, 51, 42, 43, 42, 43, 4, 5, 42, 43, 15, 15, 15, 15, 44, 14, 15, 14, 15]
+191	[26, 27, 48, 49, 38, 14, 15, 2, 9, 38, 39, 38, 14, 15, 26, 14, 15, 14, 15, 36, 42, 15, 44, 38, 39, 38, 39, 30, 31, 38, 31, 36, 14, 15, 44]
+192	[14, 15, 14, 15, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 25, 25, 25, 44, 44, 52, 53, 25, 25, 25, 25, 25, 25, 25, 44, 44, 50, 51, 42, 42, 14, 14, 15, 15, 44, 44, 14, 15, 14, 15, 38, 39]
+193	[26, 9, 1, 14, 15, 14, 15, 10, 44, 38, 39, 36, 14, 15, 36, 50, 51, 15, 15, 15, 26, 14, 15, 15, 49, 26, 14, 25, 26, 38, 39, 0, 1, 36, 14, 15, 44]
+194	[50, 51, 38, 14, 15, 38, 38, 39, 53, 53, 53, 53, 53]
+195	[52, 53, 53, 53, 53, 53, 39]
+196	[50, 51, 14, 15, 14, 15, 44, 52, 53, 24, 25, 25, 15]
+197	[50, 51, 14, 25, 53, 14, 15, 6, 48, 49, 38, 14, 44, 14, 15]
+198	[54, 55, 55, 55, 55, 38, 39, 38, 0, 33, 14, 15, 36, 14, 15, 0, 39, 44]
+199	[52, 53, 53, 53, 53, 53, 53, 38, 39, 14, 15]
+200	[50, 51, 51, 12, 13, 13]
+201	[38, 39, 14, 15, 38, 39, 14, 15]
+202	[50, 51, 14, 15, 38, 39]
+203	[24, 25, 25, 47, 47, 44, 46, 21, 21]
+204	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+205	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+206	[14, 15, 14, 15, 10, 12, 13, 36, 42, 1, 42, 43, 8, 8, 38, 39, 44, 44, 20, 49, 49, 49, 38, 39, 26, 14, 15, 14, 15, 42, 43, 14, 15, 15, 12, 14, 15, 44, 24, 25, 21, 6, 24, 49, 38, 39, 39, 39, 13, 38, 12, 13, 13, 44]
+207	[24, 25, 25, 25, 36, 14, 15, 15, 15]
+208	[24, 15, 14, 15, 15, 15, 38, 39, 42, 42, 43, 43, 30, 31, 38, 38, 39, 30, 31, 14, 15]
+209	[30, 36, 14, 15, 14, 15, 38, 38, 39, 8, 0, 1, 13, 36]
+210	[52, 53, 53, 53, 44, 52, 53, 53, 53]
+211	[50, 51, 53, 53, 53, 14, 15]
+212	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 17, 44, 46, 47, 47, 47, 47, 47]
+213	[14, 15, 14, 15, 8, 38, 39, 50, 51, 51, 44, 14, 15, 0, 1, 44, 14, 15, 38, 38, 39, 8, 26, 14, 15, 10, 10, 11, 38, 39]
+214	[50, 51, 53, 53, 14, 15, 14, 15, 38, 39, 14, 15]
+215	[14, 15, 36, 10, 11, 42, 43, 14, 15]
+216	[48, 49, 44]
+217	[44, 14, 15, 0, 0, 1, 14, 15, 42, 43, 14, 15, 38, 39, 44, 38, 39, 44, 44, 50, 51, 24, 25, 25, 25, 42, 53, 38, 39, 38, 53, 53, 53, 38, 39, 44, 44, 44, 44, 24, 25, 25]
+218	[8, 3, 38, 39, 44, 8, 38, 38, 39, 14, 15, 38, 39, 36, 14, 15, 10, 44, 30, 31, 14, 15, 54, 55, 55, 44, 30, 14, 14, 15, 12, 38, 12, 13, 44, 38, 39, 39, 12, 13, 44]
+219	[24, 25, 25, 25, 25, 25, 25, 25, 15, 25, 44, 38, 39, 12, 13, 36, 14, 15]
+220	[46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 25, 25, 44, 12, 13, 13, 13, 42, 43]
+221	[14, 15, 14, 15]
+222	[30, 38, 30, 36, 14, 15, 44, 24, 25, 25, 25, 25]
+223	[38, 39, 39, 8, 8, 38, 54, 38, 1, 36, 12, 13, 14, 15, 44, 38, 44, 22, 23, 23, 23, 44, 38, 39, 36, 12, 13, 6, 14, 8, 0, 36, 14, 15, 0, 1, 36, 4, 5, 15, 44, 14, 49, 26, 38, 39, 14, 15, 36, 30, 36, 14, 15, 44, 38, 39, 26, 39, 14, 14, 15, 36, 14, 15, 36, 14, 15, 44]
+224	[48, 49, 14, 15, 14, 15, 10, 44, 38, 39, 38, 39, 46]
+225	[38, 39, 14, 15, 15, 38, 43, 14, 15]
+226	[24, 15, 15, 0, 14, 15, 14, 15]
+227	[44, 14, 15, 38, 15, 44, 14, 15, 42, 15, 14, 15, 44, 14, 15, 44, 14, 15, 15, 15, 42, 43, 44, 38, 39, 44, 14, 15, 15, 15, 15, 15, 44, 14, 15, 15, 44, 20, 21, 21, 21]
+228	[54, 55, 55, 55, 55, 44, 14, 15, 44, 48, 49, 49, 6, 14, 15, 14, 15, 8, 27, 54, 19, 38, 38, 39, 44]
+229	[38, 39, 44, 30, 31, 36, 14, 15, 8, 26, 38, 39, 44]
+230	[14, 15, 15, 0, 1, 14, 15, 8, 38, 39, 12, 13, 13, 13, 14, 15, 44, 38, 39, 39, 39, 39, 50, 51, 44, 50, 51, 44, 14, 15, 44, 24, 25, 25]
+231	[14, 15, 14, 15, 30, 31, 26, 54, 55, 55, 14, 15, 38, 39, 44]
+232	[44, 8, 38, 38, 39, 6, 38, 39, 36, 11, 38, 39, 36, 11, 26, 36, 14, 15, 44, 38, 38, 39, 14, 39, 12, 13, 38, 39, 44, 44, 24, 25, 25, 25, 25]
+233	[52, 53, 25, 53, 14, 15, 15, 38, 39, 39, 39, 30, 31, 14, 15]
+234	[14, 15, 49, 38, 38, 26, 0, 1, 14, 15, 14, 39, 36, 0, 1, 0, 1, 44, 8, 9, 0, 3, 42, 15, 15, 15, 44, 8, 38, 39, 38, 39, 44, 30, 31, 31, 36, 14, 15, 38, 39, 0, 1, 36, 14, 15, 44, 8, 11, 8, 38, 14, 39, 38, 39, 36, 44, 6, 7, 38, 39, 42, 43, 14, 15, 44, 38, 8, 9, 38, 39, 36, 44]
+235	[8, 39, 14, 15, 14, 15, 15, 10, 11, 38, 39, 44, 38, 38, 14, 15, 38, 39, 39, 44, 12, 13, 13, 15, 44, 44, 14, 15, 15, 44, 16, 17, 17, 12, 17, 13, 44, 25, 47, 47]
+236	[14, 15, 14, 15, 15, 12, 44, 13, 13, 39, 14, 14, 15, 31, 38, 39]
+237	[30, 31, 12, 13, 15, 26, 26, 14, 15, 36, 14, 15, 42, 43, 10, 38, 39, 36, 12, 12, 14, 15, 36, 14, 15, 44, 38, 39, 8, 9, 38, 39, 14, 15, 15, 15, 44, 12, 13, 13, 38, 39, 36, 4, 43, 44, 26, 48, 49, 38, 39, 14, 38, 38, 39, 44, 6, 7, 38, 39, 30, 31, 14, 15, 36, 12, 13, 14, 15, 44]
+238	[38, 39, 14, 15, 12, 13, 14, 15, 15, 14, 15, 15, 44, 14, 15, 10, 1, 36, 12, 13, 42, 25, 14, 15]
+239	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+240	[48, 49, 48, 49, 49, 54, 55, 14, 15, 42, 43, 44, 48, 49, 48, 49, 14, 12, 55, 0, 1, 14, 15, 38, 39, 44, 14, 15, 15]
+241	[24, 25, 25, 25, 15, 44, 0, 1]
+242	[30, 31, 38, 39, 12, 13, 14, 14, 15, 6, 38, 39, 38, 24, 25, 25, 44, 26, 38, 39, 4, 5, 36, 14, 15, 44]
+243	[24, 25, 25, 15, 49, 49, 26, 9, 26, 14, 15, 26, 27]
+244	[50, 51, 15, 15, 25, 15]
+245	[38, 38, 39, 14, 15, 14, 15, 38, 39, 24, 15, 36, 14, 0, 14, 15, 38, 39, 14, 15]
+246	[38, 39, 14, 15, 36, 30, 31, 14, 15, 14, 15, 46, 44]
+247	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+248	[52, 53, 53, 39, 38, 39, 46]
+249	[38, 39, 38, 38, 38, 39, 36, 42, 15, 38, 14, 15]
+250	[24, 25, 14, 15, 14, 15, 14, 14, 15, 8, 39, 30, 31, 44, 6, 7, 38, 39, 39, 39, 46, 24, 25, 14, 15, 30, 31, 44, 48, 49, 25, 25, 14, 15, 38, 38, 39, 14, 15]
+251	[50, 51, 14, 15, 38, 15, 14]
+252	[14, 15, 44, 12, 13, 13, 13, 42, 43]
+253	[14, 15, 14, 15, 15, 38, 14, 15, 42, 43, 14, 15, 15, 10, 36, 42, 43]
+254	[50, 51, 51, 14, 25, 42, 43, 42, 43]
+255	[48, 49, 49, 14, 15, 38, 39, 44, 24, 25, 25, 25]
+256	[14, 15, 15, 38, 15, 15, 38, 43, 38, 39, 44, 24, 25, 25, 25]
+257	[30, 31, 26, 9, 38, 39, 42, 14, 15, 15, 15]
+258	[14, 15, 44, 22, 23, 23, 23, 44, 38, 39, 38, 39, 42, 15, 15, 38, 39, 39, 44, 24, 25, 25]
+259	[30, 31, 30, 31, 36, 30, 31, 38, 38, 39, 36, 14]
+260	[44, 22, 23, 23, 44, 8, 0, 42, 15, 42, 43, 38, 39, 44, 2, 8, 41, 38, 39, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+261	[44, 22, 23, 23, 44, 18, 50, 51, 54, 55, 49, 42, 43, 44, 54, 55, 14, 15, 38, 39, 38, 43]
+262	[38, 30, 8, 38, 39, 44]
+263	[14, 15, 15, 14, 15, 38, 43, 14, 15]
+264	[0, 1, 43, 44, 42, 43, 0, 1]
+265	[0, 1, 42, 43]
+266	[14, 15, 14, 14, 15, 14, 15]
+267	[14, 15, 12, 14, 15, 14, 15]
+268	[50, 51, 51, 51, 51, 10, 14, 15, 14, 15, 15]
+269	[30, 31, 8, 9, 38, 39, 44, 24, 25, 25, 25, 25, 44, 24, 17, 21, 21, 17, 38, 39, 14, 15, 44, 38, 39, 38, 39, 44, 38, 14, 25, 25, 38, 39]
+270	[38, 39, 14, 15, 0, 1, 14, 15, 15, 36, 14, 15, 15, 25, 44]
+271	[8, 39, 30, 38, 39, 38, 39, 44, 14, 15, 14, 15, 12, 13, 14, 15, 15, 8, 9, 38, 39, 44]
+272	[14, 21, 14, 15, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 0, 1, 14, 15, 14, 15, 46]
+273	[26, 27, 8, 39, 39, 14, 15, 44, 38, 39, 30, 31, 31, 38, 39, 38, 38, 39, 36]
+274	[26, 38, 38, 38, 39, 0, 1]
+275	[54, 55, 36, 14, 15]
+276	[14, 15, 15, 38, 39, 12, 15, 36, 14, 38, 39, 0, 1, 15, 38, 38, 39, 15, 38, 38, 39, 39, 36, 14, 15, 44]
+277	[50, 51, 14, 53, 14, 15, 44, 42, 43, 14, 15]
+278	[52, 53, 14, 15, 15, 0, 43, 46]
+279	[24, 25, 25, 47, 44, 24, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 38, 39]
+280	[12, 13, 13, 39, 39, 13, 13, 13, 13, 13, 13]
+281	[30, 31, 38, 14, 15, 15, 15, 24, 25]
+282	[50, 51, 14, 15, 25]
+283	[50, 15, 14, 15, 44, 24, 25, 14, 15, 44, 14, 15, 15, 44, 10, 14, 15, 15, 24, 24, 25, 25, 25, 44, 14, 14, 15, 38, 44, 14, 15, 25]
+284	[34, 55, 14, 15, 0, 1, 44, 14, 15, 8, 38, 46, 44, 48, 49, 49, 49, 26, 36, 30, 31, 8, 38, 39, 44, 38, 39, 38, 39, 38, 39, 46, 44, 14, 15, 38, 39, 43, 44, 24]
+285	[50, 51, 51, 51, 51, 51, 50, 51, 53, 53, 53, 15]
+286	[38, 39, 38, 14, 15, 44, 42, 43, 15, 44, 14, 15, 15]
+287	[50, 8, 38, 39, 12, 13, 13, 13, 14, 15, 15]
+288	[8, 39, 14, 15, 39, 36, 14, 15, 42, 43]
+289	[38, 39, 12, 13, 38, 39, 10, 11, 8, 38, 39, 44, 14, 15, 8, 38, 39, 38, 39, 30, 31, 36, 14, 15, 44, 8, 9, 38, 38, 38, 30, 31, 44, 12, 13, 38, 14, 15, 44, 12, 13, 14, 15, 21, 39, 44, 8, 26, 30, 31, 44, 42, 39, 38, 39, 46, 44]
+290	[44, 54, 55, 55, 55, 55, 14, 52, 53, 53, 53, 53, 53, 53, 53, 53, 0, 1, 42, 43, 14, 15, 44, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+291	[14, 14, 15, 14, 15, 14, 15, 14, 38, 39, 14, 15, 42, 43, 42, 43]
+292	[26, 49, 3, 38, 12, 13, 0, 1, 36, 14, 15, 44, 38, 39, 14, 15, 38, 38, 8, 0, 14, 15, 36, 44, 6, 38, 14, 15, 36, 14, 15, 6, 7, 8, 8, 9, 38, 26, 14, 15, 36, 38, 39, 44, 38, 38, 39, 26, 14, 15, 36, 38, 39, 38, 39, 12, 13, 13, 44, 38, 39, 14, 15, 42, 43, 42, 43, 44]
+293	[30, 11, 38, 14, 15, 15, 38, 38, 39, 44, 16, 30, 31, 8, 9, 14, 15, 44, 10, 44, 6, 39, 14, 38, 38, 8, 9, 38, 39, 44, 38, 36, 14, 36, 44, 14, 15, 38, 39, 12, 13, 13, 39, 46, 44, 44]
+294	[0, 15, 38, 39, 39, 12, 13, 44, 24, 25, 25]
+295	[44, 38, 38, 39, 44, 38, 38, 38, 38, 14, 15, 44]
+296	[14, 15, 15, 38, 39, 14, 14, 15]
+297	[26, 27, 14, 15, 15, 6, 7, 7, 38, 39, 44, 30, 31, 8, 9, 38, 39, 36, 12, 13, 8, 0, 1, 36, 14, 15, 44]
+298	[26, 38, 38, 38, 39, 14, 15, 44]
+299	[24, 25, 25, 25, 25, 25, 44, 18, 19, 14, 15, 44, 38, 15, 14, 15]
+300	[30, 38, 39, 8, 1, 14, 15, 26, 38, 39, 12, 13, 38, 15, 15, 36, 14, 15, 15, 44, 0, 1, 36, 14, 15, 44]
+301	[50, 53, 53, 53, 53, 14, 15, 14, 15, 14, 15, 42, 43, 2, 43, 50, 50, 50, 51, 53, 53, 53, 53, 42, 43, 32, 53, 14, 15, 38, 39, 14, 15, 38, 43, 44]
+302	[34, 35, 44, 50, 51, 51, 53, 53, 53, 53, 53, 53, 53, 26, 30, 31, 38, 39, 36, 42, 43, 14, 15, 15, 39, 14, 15, 15, 38, 39, 36, 12, 13, 42, 43, 44]
+303	[24, 25, 25, 25, 44, 42, 43, 38, 43, 14, 15, 25]
+304	[8, 12, 13, 14, 25, 14, 15, 14, 15, 10, 11, 44, 14, 15, 8, 39, 26, 12, 55, 44, 54, 55, 38, 36, 14, 15, 42, 43, 44]
+305	[24, 25, 25, 25, 44, 24, 25, 25, 44, 14, 44, 24, 25, 25, 15, 44, 38, 39, 38, 39]
+306	[44, 23, 44, 38, 38, 12, 13, 0, 0, 14, 15, 36, 14, 15, 42, 43, 44]
+307	[24, 15, 15, 44, 24, 25, 36, 38, 39, 44, 44, 44, 14, 15]
+308	[46, 47, 47, 47, 44, 46, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47]
+309	[14, 15, 50, 51, 51, 44, 14, 15, 14, 15]
+310	[48, 49, 25, 25, 14, 15]
+311	[8, 9, 38, 14, 15, 38, 39, 36, 0, 46, 44, 14, 15, 8, 38, 39, 38, 44, 24, 25, 54, 55, 14, 44, 36, 0, 14, 15, 15, 44, 38, 39, 6, 0, 1, 36, 14, 15, 8, 39, 44]
+312	[50, 51, 51, 51, 44, 24, 25, 25]
+313	[50, 51, 51, 51, 51, 51, 51, 15, 12, 13, 13, 13, 14, 15, 44, 14, 44, 44, 14, 15, 14, 15, 14, 15]
+314	[50, 51, 14, 14, 15, 15, 14, 15, 38, 39, 39, 39, 12, 13, 14, 15, 46, 44]
+315	[0, 1, 50, 51, 51, 26, 50, 51, 51, 51, 36, 12, 13, 14, 15, 14, 15, 38, 39, 39, 42, 43, 15, 44, 38, 39, 10, 44, 38, 39, 38, 39, 0, 39, 44, 8, 39, 36, 14, 15, 44, 44]
+316	[38, 38, 39, 38, 39, 46, 44, 24, 25, 25, 25]
+317	[14, 15, 14, 15, 15, 38, 39, 0, 1, 36]
+318	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 14, 17]
+319	[38, 39, 0, 14, 36, 14, 15]
+320	[38, 14, 15, 38, 39, 26, 24, 25, 46]
+321	[24, 25, 25, 25, 25, 44, 12, 13, 13, 44, 6, 7, 7, 38, 14, 31, 8, 0, 39, 14, 15, 44, 14, 15, 15, 14, 38, 38, 39, 38, 39, 39, 14, 15, 44, 14, 14, 15, 44, 38, 39, 25, 44, 14]
+322	[46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 13, 47, 44, 46, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+323	[42, 43, 14, 15, 38, 39, 14, 15]
+324	[24, 25, 25, 25, 49, 38, 39, 39, 30, 38, 14, 13, 38, 39, 25, 25, 25, 25, 25, 25, 17, 17, 17, 38, 27, 10, 38, 43, 14, 15, 6, 38, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+325	[24, 49, 49, 49, 49, 49, 44, 24, 53, 14, 14, 15]
+326	[44, 14, 15, 15, 44, 14, 15, 15, 28, 44, 10, 35, 44, 12, 44, 12, 13, 44, 12, 13, 44, 12, 13, 44, 16, 17, 13, 44, 12, 13, 44, 12, 13, 44, 12, 13, 44, 44]
+327	[44, 44, 44, 24, 25, 25, 25, 25]
+328	[12, 13, 13, 14, 15]
+329	[38, 1, 36, 14, 15]
+330	[24, 25, 15, 14, 15, 14, 15, 36, 14, 15, 14]
+331	[38, 39, 0, 14, 15, 38, 39, 14, 15, 38, 39, 39, 36, 10, 1, 44, 38, 38, 39, 14, 15, 39, 38, 39, 38, 39]
+332	[8, 38, 39, 38, 26, 38, 39, 44, 38, 39, 44, 38, 39, 14, 15, 0, 39, 38, 39, 36, 14, 15, 44, 26, 14, 15, 38, 39, 26, 24, 25, 25, 14, 39, 6, 38, 39, 44]
+333	[30, 38, 39, 25, 25, 14, 15, 24, 25, 14, 15, 30, 31, 38, 39, 44, 24, 25, 25, 25]
+334	[24, 25, 25, 25, 25, 15, 15, 15, 15, 15, 15]
+335	[14, 15, 14, 15, 15]
+336	[50, 51, 44, 14, 15]
+337	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 30, 31, 38, 39]
+338	[54, 55, 55, 46, 44, 6, 7, 14, 15, 10, 36, 0, 1, 36, 14, 15, 38, 39, 38, 39, 46, 44, 6, 7, 26, 7, 9, 38, 39, 14, 15, 15, 15, 36, 0, 44, 0, 36, 38, 38, 0, 14, 14, 15, 15, 10, 44, 8, 9, 38, 14, 15, 15, 44, 54, 55, 38, 39, 26, 38, 39, 10, 10, 14, 39, 44, 30, 31, 38, 38, 38, 36, 38, 39, 42, 43, 38, 39, 44]
+339	[24, 25, 25, 25, 15]
+340	[8, 9, 44, 8, 38, 36, 38, 14, 38, 15, 36, 42, 43, 44]
+341	[12, 13, 13, 13, 38, 39, 44, 12, 13, 13, 38, 39, 44, 12, 13, 13, 13, 39]
+342	[52, 53, 44, 48, 44, 0, 49]
+343	[44, 14, 25, 38, 26, 14, 15, 38, 43, 14, 15, 15, 54, 54, 55, 9, 8, 54, 35, 12, 55, 55, 0, 10, 36, 14, 15, 14, 15, 44, 14, 15, 14, 15, 38, 39, 30, 31, 38, 39, 12, 15, 0, 14, 15, 44, 14, 15, 38, 39, 38, 12, 13, 13, 14, 15, 38, 14, 15, 10, 36, 39, 14, 15, 14, 15, 44, 42, 43, 44, 38, 39, 44, 38, 39, 44, 38, 14, 15, 15, 36, 42, 15, 14, 15, 44, 26, 15, 38, 39, 44, 12, 13, 14, 15, 44, 38, 39, 42, 1, 14, 15, 44]
+344	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 14, 15, 39, 36, 14, 15, 44, 40, 41, 38, 39]
+345	[50, 51, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+346	[8, 26, 54, 55, 55, 55, 55, 44, 8, 9, 26, 39, 25, 25, 25, 49, 38, 39, 38, 39, 36, 44, 24, 25, 14, 15, 15, 44, 8, 38, 39, 38, 39, 38, 39, 44, 14, 15, 15, 44, 14, 15, 36, 42, 43, 15, 15, 44, 38, 39, 0, 1, 44]
+347	[14, 15, 14, 15, 42, 43, 8, 9, 38, 39, 14, 15, 46]
+348	[52, 53, 38, 39, 44]
+349	[38, 39, 14, 38, 38, 39, 14, 15, 14, 15]
+350	[52, 53, 53, 53, 53, 53, 14, 15, 14, 15, 14, 15, 12, 15, 15, 44, 44, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47]
+351	[14, 15, 42, 43, 14, 15, 8, 39, 14, 15, 38, 39, 44]
+352	[20, 21, 21, 54, 55, 14, 14, 15, 15]
+353	[0, 14, 15, 10, 15, 15, 32, 14, 15, 38, 39, 39, 8, 0, 1, 36, 44, 6, 7, 38, 39, 38, 39, 38, 39, 14, 15, 38, 36, 8, 0, 39, 44, 6, 7, 38, 39, 38, 12, 13, 0, 14, 15, 36, 8, 8, 1, 44, 6, 7, 30, 8, 38, 39, 30, 31, 14, 15, 14, 15, 38, 39, 14, 15, 30, 31, 12, 13, 15, 38, 39, 36, 14, 15, 44]
+354	[42, 43, 14, 43, 14, 15, 38, 39, 38, 39]
+355	[24, 25, 25, 25, 25, 25, 14, 15, 38, 39, 24, 25, 38, 39, 26, 14, 15]
+356	[12, 14, 15, 14, 25, 25]
+357	[24, 25, 25, 15, 44, 14, 15]
+358	[12, 13, 13, 13, 14, 0, 26, 42, 14, 15, 25]
+359	[44, 22, 23, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 39, 39, 39, 39, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 44, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 13, 44, 12, 13, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 13, 44, 44, 44, 24, 25, 25, 25]
+360	[34, 11, 44, 14, 15, 14, 15, 15, 15, 44, 14, 15, 0, 1, 44, 24, 25, 25, 6, 0, 1, 14, 43, 14, 15, 14, 15, 38, 39, 39, 44, 36, 14, 15, 44]
+361	[30, 31, 36, 42, 43, 44, 38, 39, 53, 53, 53, 53, 15, 38, 39, 36, 14, 15, 44]
+362	[14, 38, 39, 44, 30, 31, 38, 39, 26, 12, 13, 55, 55, 38, 39, 44, 2, 1, 38, 39, 39, 13, 13, 38, 39, 44, 26, 14, 15, 10, 38, 38, 38, 39, 44, 38, 14, 15, 44, 8, 38, 39, 38, 39, 44, 48, 49, 49, 49, 49, 38, 30, 44, 44, 38, 39, 36, 30, 31, 15, 44, 44]
+363	[44, 14, 49, 38, 39, 38, 14, 15, 44, 44, 14, 15, 38, 39, 38, 14, 15, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+364	[30, 38, 39, 14, 25, 44, 30, 31, 31, 31, 14, 15, 8, 1, 44, 38, 38, 14, 15, 36, 44, 2, 38, 38, 36, 8, 39, 46, 44, 8, 38, 39, 4, 38, 44]
+365	[44, 22, 23, 23, 23, 23, 44, 38, 39, 12, 13, 38, 39, 0, 1, 36, 14, 15, 44, 8, 1, 36, 14, 26, 30, 31, 38, 39, 36, 8, 1, 1, 36, 44, 14, 15, 44, 44]
+366	[44, 22, 23, 23, 23, 23, 44, 14, 25, 44, 14, 15, 15, 42, 43, 14, 15, 14, 15, 44, 38, 15, 15, 12, 13, 13, 13, 44, 14, 15, 23, 44, 23, 44, 44, 14, 15, 15, 15, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+367	[30, 31, 14, 15, 26, 30, 0, 1, 36, 14, 15, 14, 15, 8, 26, 27, 30, 31, 38, 9, 8, 38, 39, 14, 15, 14, 15, 38, 39, 14, 15, 14, 14, 15, 44, 30, 38, 39, 12, 13, 13, 44]
+368	[6, 27, 6, 7, 26, 30, 31, 14, 15, 36, 42, 23, 15, 14, 15, 38, 39, 38, 39, 30, 1, 10, 36, 0, 1, 44, 10, 11, 36, 12, 14, 15, 15, 8, 38, 39, 44]
+369	[6, 7, 30, 31, 44, 50, 51, 12, 13, 13, 36, 14, 15, 15, 38, 14, 15, 44, 30, 38, 38, 46, 44]
+370	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+371	[44, 44, 44, 24, 25, 25, 25, 25]
+372	[14, 15, 6, 7, 38, 39, 14, 15, 15, 44, 30, 38, 38, 12, 13, 44, 24, 25, 25, 25, 38, 0, 1, 44]
+373	[24, 25, 25, 25, 25, 25, 25, 25, 25, 44, 38, 11, 38]
+374	[44, 24, 25, 25, 14, 38, 39, 44, 24, 25, 25, 15, 0, 1, 42, 43, 14, 15, 44, 38, 53, 53, 39, 53, 14, 15, 44, 44, 38, 39, 12, 39, 14, 15]
+375	[44, 44, 44, 24, 25, 25, 25, 25]
+376	[38, 39, 30, 31, 14, 15, 6, 9, 38, 38, 39, 36, 14, 15, 44, 44, 8, 8, 38, 39, 39, 39, 36, 14, 15, 44, 44]
+377	[38, 42, 39, 14, 15, 14, 15]
+378	[50, 14, 15, 38, 39, 38, 26, 12, 13, 36, 14, 15, 10, 38, 39, 44, 44, 44, 44, 46, 44, 14, 15, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 46, 44, 14, 15, 44, 24, 25, 25]
+379	[14, 15, 15, 15, 44, 52, 53, 53, 53]
+380	[14, 15, 38, 39, 12, 30, 31, 38, 14, 15]
+381	[4, 5, 44, 2, 3, 38, 39, 38, 39, 36, 42, 43, 44]
+382	[50, 51, 38, 30, 31, 38, 38, 38, 14, 15]
+383	[26, 27, 38, 14, 15, 15, 15, 25, 25, 25, 25, 25, 49, 10, 14, 39, 44, 14, 15, 36, 14, 15, 14, 15, 44, 6, 44]
+384	[52, 53, 38, 24, 25, 14, 15, 30, 31, 38, 44, 44, 24, 25, 25, 25]
+385	[24, 24, 25, 25, 47, 47, 47, 47, 47, 47, 47]
+386	[14, 15, 38, 15, 14, 15, 36, 14, 38, 39]
+387	[14, 15, 44, 12, 13, 13, 38, 24, 25, 25, 42, 43]
+388	[26, 54, 55, 36, 38, 39, 24, 15, 15, 14, 15, 10, 44, 14, 15, 14, 15, 14, 15, 53, 53, 52, 53, 53, 53, 53, 53, 53, 53, 53, 38, 48, 49, 44, 38, 38, 36, 50, 51, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 36, 12, 13, 13, 14, 15, 10, 55, 55, 10, 8, 7, 12, 38, 13, 10, 10, 14, 15, 38, 39, 39, 14, 15, 36, 14, 15, 44]
+389	[44, 50, 51, 15, 8, 50, 51, 42, 15, 14, 15, 14, 15, 44, 42, 43, 44, 14, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 24, 25, 25, 25, 25]
+390	[24, 25, 14, 15, 25, 25, 25, 25, 17, 17, 50, 14, 15, 36]
+391	[38, 14, 15, 15, 38, 14, 14, 15, 15]
+392	[14, 15, 15, 14, 15, 44, 14, 15]
+393	[46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 38, 39, 14, 15]
+394	[14, 15, 8, 8, 9, 38, 39, 38, 38, 38, 14, 15, 6, 14, 36, 44]
+395	[14, 15, 14, 15]
+396	[14, 38, 38, 50, 51, 36, 14, 15, 44, 14, 15, 36, 14, 15, 36, 36, 44, 38, 14, 15, 14, 15, 8, 39, 39, 44]
+397	[26, 14, 15, 36, 42, 43, 14, 15, 44, 2, 49, 26, 54, 55, 55, 55, 55, 38, 39, 36, 43, 14, 15, 14, 15, 44, 26, 5, 2, 3, 6, 50, 1, 14, 15, 38, 39, 36, 54, 55, 55, 55, 14, 14, 15, 26, 27, 43, 49, 38, 39, 39, 8, 1, 12, 13, 38, 39, 14, 15, 36, 38, 43, 36, 14, 15, 38, 39, 36, 14, 15, 44, 38, 39, 38, 39, 38, 43, 15, 43, 43, 44]
+398	[24, 25, 25, 44, 46]
+399	[14, 2, 8, 38, 26, 14, 15]
+400	[50, 51, 51, 44, 24, 15, 15, 38, 39]
+401	[50, 51, 53, 53, 53, 53, 12, 13, 13, 13]
+402	[12, 39, 14, 15, 14, 15, 8, 9, 9, 38, 39, 42, 39, 6, 42, 43, 44, 38, 39, 38, 0, 32, 14, 6, 0, 0, 1, 44, 38, 39, 12, 14, 15, 15, 51, 14, 15, 15, 38, 46, 44]
+403	[14, 15, 39, 36, 24, 14, 25, 30, 31, 38]
+404	[52, 53, 15, 38, 39, 14, 39, 39, 13, 13, 13, 31, 14, 15]
+405	[50, 51, 38, 38, 39, 39, 14, 15, 42, 43]
+406	[14, 38, 24, 25, 13, 15, 15, 31]
+407	[54, 55, 55, 55, 55, 14, 15, 0, 1, 51, 0, 1, 14, 15, 14, 15, 38, 43]
+408	[14, 15, 14, 15, 42, 43]
+409	[20, 21, 21, 21, 21, 17, 17, 17, 17, 17, 14, 17]
+410	[14, 15, 15, 30, 31, 8, 38, 1, 14, 15]
+411	[14, 15, 54, 49, 38, 39, 38, 14, 15, 15, 42, 39, 44]
+412	[38, 39, 39, 36, 42, 43, 14, 15, 38, 39, 39, 44, 38, 39, 39, 36, 42, 43, 14, 15, 0, 1, 42, 43, 44, 12, 13, 13, 15, 44, 42, 15, 14, 15]
+413	[30, 8, 38, 26, 44, 22, 23, 23, 44, 14, 15, 10, 36, 44, 42, 25, 25, 25, 25, 44, 42, 43, 38, 36, 44, 44, 24, 25, 44, 10, 15, 50, 51, 2, 38, 38, 39, 39, 39, 53, 15, 15, 36, 14, 15, 44]
+414	[14, 15, 15]
+415	[24, 25, 25, 25, 25, 38, 25, 25, 25, 21, 25, 25, 44, 10, 44, 14, 15, 44, 24, 25, 25, 25]
+416	[46, 47, 47, 47, 47, 47, 47, 44, 30, 31, 14, 15]
+417	[38, 38, 38, 39, 14, 15, 14, 15, 8, 9, 38, 39, 36, 14, 15, 44, 14, 15, 26, 38, 39, 44, 38, 38, 39, 44, 36, 14, 15, 36, 14, 15, 42, 43, 43, 14, 15, 44, 38, 26, 52, 53, 14, 15, 14, 15, 10, 38, 39, 8, 9, 0, 1, 44]
+418	[38, 38, 14, 15, 36, 30, 38, 0, 1, 36, 14, 15, 6, 14, 15, 44, 54, 55, 55, 55, 55, 8, 38, 39, 14, 15, 15, 14, 15, 15, 12, 13, 13, 44]
+419	[14, 15, 44, 14, 15, 14, 38, 44, 14, 15, 14, 15, 15, 38, 44, 0, 1, 15, 48, 49, 49, 38, 44, 44, 44, 24, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47]
+420	[44, 44, 44, 24, 25, 25, 25, 25]
+421	[47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 44, 46, 44, 12, 13, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47]
+422	[24, 25, 25, 25, 14, 15, 38, 39, 12, 44, 24, 25, 25, 25, 25, 15, 38, 39, 14, 15, 15, 15, 44, 44, 8, 38, 38, 39, 44, 44, 24]
+423	[38, 14, 15, 38, 39, 24, 25, 25, 25, 15, 14, 15, 38, 14, 15, 30, 31, 31, 38, 39, 14, 15, 38, 43, 14, 15, 44, 24, 25, 46, 44, 24, 25, 25, 25]
+424	[14, 15, 42, 43, 36, 14, 15]
+425	[0, 1, 38, 39, 26]
+426	[54, 55, 55, 55, 38, 39, 13, 13, 24, 25]
+427	[50, 51, 24, 25, 25, 25, 25, 30, 31, 31]
+428	[38, 39, 38, 39, 12, 13, 14, 15, 26, 38, 39]
+429	[24, 25, 25, 25, 25, 25, 25, 25, 25]
+430	[48, 25, 49, 38, 39, 38, 39, 47, 47, 47, 47, 47, 47, 46]
+431	[14, 14, 15, 15, 15, 44, 30, 31, 44]
+432	[44, 14, 15, 14, 15, 15, 15, 14, 44, 14, 15, 14, 15, 15, 42, 43, 44, 14, 15, 14, 15, 15, 15, 15, 44, 44, 38, 39, 39, 12, 13, 13, 13, 44, 12, 13, 44, 12, 13]
+433	[6, 26, 30, 12, 13, 13, 13, 15, 44, 8, 9, 38, 0, 36, 44, 6, 7, 26, 30, 38, 0, 36, 12, 13, 15, 30, 38, 30, 44, 44, 30, 44, 12, 13, 13, 13, 44]
+434	[24, 24, 24, 25, 25, 44, 48, 49, 49, 49, 44]
+435	[24, 17, 17, 17, 44, 6, 7, 31]
+436	[8, 38, 14, 15, 14, 15, 38, 39]
+437	[38, 14, 14, 15, 14, 15, 30, 31, 0, 1, 0]
+438	[14, 38, 39, 14, 15, 39, 44]
+439	[38, 42, 43, 15, 15, 39, 14, 15]
+440	[26, 51, 14, 15, 26, 14, 53, 14, 15, 38, 39, 14, 15]
+441	[14, 15, 14, 15, 44, 14, 15, 14, 15, 14, 15, 44, 14, 15, 14, 15, 12, 13, 14, 14, 14, 15, 44, 24, 25, 25, 25, 25, 25, 14]
+442	[44, 38, 15, 14, 15, 36, 14, 15, 15, 44, 14, 15, 30, 31, 38, 44, 14, 15, 38, 39, 44, 38, 39, 15, 15]
+443	[38, 43, 14, 15, 44]
+444	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 24, 25, 25, 25, 25]
+445	[26, 15, 14, 15, 38, 39, 44, 38, 9, 38, 39, 38, 43]
+446	[14, 15, 15, 44, 14, 15, 44, 14, 15, 38, 39, 36, 14, 15, 36, 42, 43, 14, 15, 14]
+447	[20, 49, 25, 30, 31, 14, 38, 39, 38, 39, 44, 12, 13, 13, 13]
+448	[50, 50, 51, 51, 51, 42, 43, 15, 8, 26, 9, 38, 38, 39, 38, 12, 13, 13, 13, 10, 10, 44, 8, 9, 36, 42, 43, 0, 1, 44, 8, 9, 38, 39, 12, 13, 13, 13, 13, 44]
+449	[42, 39, 44, 14, 15, 15, 38, 39, 44, 38, 39, 38, 38, 39, 39, 39]
+450	[12, 13, 13, 38, 38, 39, 0, 14, 15, 38, 46]
+451	[38, 39, 36, 14, 15]
+452	[10, 43, 42, 43, 14, 15, 38, 39, 36, 42, 43, 44, 44, 44, 24, 25, 25]
+453	[24, 25, 25, 25, 15, 0, 1, 14, 15, 15]
+454	[12, 13, 44, 30, 31, 30, 31, 38, 30, 31, 44, 14, 15, 38, 38, 39, 46, 44, 38, 39, 39, 39, 46, 44, 38, 39, 14, 15, 44, 38, 43, 25, 38, 8, 39, 38, 38, 39, 44, 24, 25, 25, 25, 38, 8, 39, 38, 39, 44]
+455	[52, 51, 53, 53, 53, 14, 15, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 14, 15]
+456	[50, 38, 39, 44, 24, 25, 25, 25, 25]
+457	[26, 38, 39, 44, 38, 12, 13, 36, 14, 15, 44, 22, 23, 23, 23, 44, 20, 48, 49, 49, 53, 53, 53, 21, 21, 53, 38, 44, 14, 15, 38, 39, 38, 15, 38, 39, 54, 55, 55, 55, 55, 38, 39, 44, 20, 55, 49, 49, 49, 38, 38, 38, 14, 15, 38, 39, 50, 53, 48, 49, 36, 0, 1, 14, 25, 25, 25, 44]
+458	[52, 53, 38, 53, 14, 15, 44, 14, 15]
+459	[14, 15, 14, 15, 38, 39, 38, 39, 46]
+460	[24, 25, 25, 15]
+461	[14, 15, 42, 43, 38, 39, 44, 38, 25, 25, 44, 44, 46, 47, 47, 47, 47, 44, 46, 17, 44, 46, 47, 47, 47, 47, 47, 44, 26, 14, 15, 38, 44, 14, 15, 25]
+462	[24, 25, 25, 25, 25]
+463	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 38, 39, 13, 14, 13]
+464	[24, 15, 12, 13, 38, 39, 30, 31, 14, 15]
+465	[12, 13, 14, 15, 38, 39, 44, 18, 19, 52, 14, 14, 15, 15, 12, 13, 13, 36, 12, 44, 14, 38, 14, 15, 38, 39, 38, 38, 14, 0, 1, 36, 0, 1, 44, 50, 14, 15, 38, 39, 30, 38, 44, 14, 15, 44, 44]
+466	[38, 39, 42, 39, 44, 24, 25, 25, 8, 0, 1, 36, 42, 43, 44, 14, 15, 15, 38, 39, 0, 39, 36, 44, 8, 9, 38, 42, 43, 38, 43, 10, 44, 8, 9, 38, 39, 44, 38, 30, 38, 26, 39, 44]
+467	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+468	[14, 15, 6, 50, 51, 30, 12, 13, 0]
+469	[25, 38, 39, 38, 30, 31, 14, 15]
+470	[24, 49, 23, 14, 15, 14, 15, 15, 15, 15, 38, 43, 15]
+471	[24, 25, 44, 46, 44, 12, 13, 13, 13, 13]
+472	[14, 15, 14, 15, 15]
+473	[30, 14, 11, 38, 36, 14, 15, 38, 39, 38, 39, 30, 36, 44]
+474	[53, 53, 53, 53, 15, 15, 15, 44, 44, 44, 24, 25, 25]
+475	[12, 13, 13, 13, 14, 15, 14, 15, 38, 39, 2, 38, 54, 54, 55, 55, 55, 26, 55, 55, 55, 44, 52, 53, 14, 15, 15, 44, 44, 14, 15, 15, 42, 43, 44, 38, 43, 14, 15, 14]
+476	[44, 2, 3, 38, 39, 55, 55, 10, 12, 13, 55, 38, 38, 39, 12, 13, 13]
+477	[14, 15, 10, 1, 42, 43, 14, 15]
+478	[26, 53, 12, 13, 13, 13, 38, 39, 14, 15, 12, 13, 13, 13, 13, 44]
+479	[0, 15, 14, 15, 38, 39, 36, 38, 43, 44, 42, 15, 42, 43, 24, 43, 44, 14, 15, 44, 24, 25, 25, 25]
+480	[24, 14, 15, 14, 15, 14, 15, 44, 42, 15, 15, 15, 15, 14, 15, 38, 15, 15, 44, 42, 15, 15, 15, 15, 38, 39, 14, 15, 38, 38, 14, 15, 44, 44, 44, 14, 15, 14]
+481	[14, 15, 26, 12, 55, 55, 55, 55, 8, 38, 39, 39, 44, 44, 26, 27, 42, 23, 25, 15, 26, 30, 30, 31, 44, 14, 15, 14, 15, 15, 8, 38, 38, 39, 36, 30, 31, 36, 14, 15, 44, 38, 39, 30, 31, 36, 14, 15, 38, 38, 39, 38, 39, 38, 38, 38, 39, 44, 44, 26, 0, 1, 14, 15, 36, 14, 15, 38, 39, 38, 15, 15, 8, 39, 30, 31, 44, 44]
+482	[48, 49, 44]
+483	[44, 52, 53, 14, 15, 44, 24, 25, 25]
+484	[54, 55, 55, 38, 38, 39, 14, 15, 15, 38, 39, 44, 14, 15, 38, 14, 15, 12, 13, 38, 38, 39, 39, 39, 12, 13, 13, 13, 13, 13, 44, 10, 1, 14, 39, 8, 9, 26, 8, 39, 14, 15, 44, 6, 8, 38, 39, 39, 1, 12, 13, 13, 13, 10, 1, 36, 14, 14, 15, 15, 38, 39, 38, 39, 38, 39, 44, 0, 1, 38, 14, 15, 38, 38, 14, 15, 15, 36, 42, 43, 44, 14, 15, 30, 13, 0, 36, 12, 13, 13, 15, 42, 43, 44, 8, 15, 0, 1, 36, 14, 14, 15, 44, 24, 15, 14, 15, 38, 39, 12, 13, 13, 13, 38, 39, 44, 14, 15, 38, 43, 15, 8, 39, 44]
+485	[6, 7, 30, 26, 14, 15, 14, 15, 36, 42, 43, 44, 8, 1, 38, 14, 15, 15, 36, 14, 15, 44]
+486	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 15]
+487	[12, 14, 15, 14, 12, 13, 38, 38, 39, 44, 8, 9, 39, 38, 14, 15, 44, 38, 38, 39, 14, 15, 44]
+488	[14, 38, 14, 15, 8, 38, 39, 14, 15, 38, 39, 44, 38, 54, 55, 13, 13, 10, 11, 44, 38, 14, 38, 38, 38, 39, 38, 39, 44, 26, 14, 15, 6, 14, 15, 8, 38, 39, 26, 54, 54, 55, 15, 38, 39, 44]
+489	[44, 14, 44, 18, 19, 14, 15, 44, 52, 19, 53, 53, 53, 53, 53, 53, 53, 38, 39, 14, 15, 44, 24, 25, 25, 25]
+490	[42, 43, 42, 43, 14, 15, 44, 24, 25, 25]
+491	[52, 53, 53, 53, 53, 53, 53, 53]
+492	[12, 13, 14, 15, 42, 43]
+493	[16, 17, 17, 17, 38, 39]
+494	[50, 51, 44, 14, 15, 15, 38, 14, 15, 44, 14, 15]
+495	[24, 53, 53, 53, 53, 53, 53, 53, 38, 39, 38, 39, 14, 15, 15]
+496	[10, 15, 14, 15, 15, 38, 14, 39]
+497	[14, 15, 51, 54, 55, 55, 55, 14, 15, 15, 15, 38, 39, 38, 39, 44, 14, 15, 38, 54, 55, 55, 55, 14, 15, 15, 15, 38, 39, 14, 15, 12, 13, 44, 12, 13, 13, 14, 15]
+498	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+499	[23, 44, 6, 7, 41, 41, 38, 39, 26, 42, 43, 46, 44]
+500	[50, 51, 51, 53, 53, 53, 14, 15, 44, 14, 25, 25, 44, 38, 39, 14, 15]
+501	[14, 15, 14, 15, 0, 1, 0, 36, 12, 31, 14, 15]
+502	[30, 38, 44, 24, 25, 25, 44, 44, 14, 15, 38, 39, 14, 15, 36, 42, 43, 44]
+503	[50, 51, 14, 39, 12, 13, 13, 13, 44, 12, 13, 13, 24, 25, 25, 44, 14, 25, 38, 39]
+504	[24, 25, 25, 14, 15, 14, 15, 15, 0, 36, 42, 43, 26, 12, 13, 9, 38, 39, 44, 14, 15, 8, 9, 38, 39, 14, 15, 6, 0, 1, 44]
+505	[50, 51, 54, 55, 55, 13, 38, 39, 14, 15]
+506	[12, 13, 14, 14, 15, 38, 39, 44, 38, 39, 36, 44, 14, 15, 38, 39, 36, 39, 44, 38, 39, 44, 14, 15, 14, 15, 38, 39, 44, 44, 44, 38, 14, 15, 14, 15, 15, 42, 43, 14, 15, 6, 44, 24, 15, 15, 15, 44, 44, 6, 38, 39, 44, 8, 38, 38, 39, 39, 39, 44, 38, 39, 38, 15, 15, 14, 15, 15, 8, 1, 14, 15, 44, 44]
+507	[26, 39, 50, 51, 36, 14, 36]
+508	[48, 49, 49, 49, 6, 38, 39, 30, 38, 38, 30, 38, 38, 14, 15, 15, 36, 10, 11, 44]
+509	[14, 15, 6, 38, 39, 12, 13, 14, 25, 12, 13, 13, 13, 0, 1, 0, 51, 51, 51, 14, 15, 24, 25, 44, 12, 25, 25, 25, 25, 25, 25, 25, 25, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+510	[44, 24, 15, 15, 44, 14, 15, 15, 14, 15, 44, 14, 15, 25, 38, 43, 44, 14, 15, 15, 14, 15, 44, 44, 46, 25, 25, 25, 25, 25, 23, 14]
+511	[26, 39, 50, 52, 53, 53, 53, 53]
+512	[38, 15, 14, 15, 15]
+513	[24, 15, 42, 43, 38, 39, 44, 14, 15, 15, 42, 44, 14, 14, 15, 44, 44, 0, 14, 15, 8, 9, 38, 38, 36, 14, 15, 38, 10, 12, 15, 14, 15, 14, 9, 38, 1, 36, 14, 1, 14, 15, 44]
+514	[24, 25, 25, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+515	[50, 51, 44, 14, 15, 50, 51, 44, 14, 15, 15, 44, 52, 53, 53, 53, 14, 15, 26, 15]
+516	[14, 15, 6, 39, 44, 14, 15, 38, 39, 0, 1, 36, 42, 15, 8, 9, 8, 38, 39, 44]
+517	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+518	[24, 15, 44, 14, 15, 25, 15]
+519	[14, 38, 39, 12, 13, 14, 15, 15, 2, 27, 14, 15]
+520	[14, 15, 44]
+521	[6, 7, 50, 0, 1, 50, 51, 53, 53, 53, 53, 53, 15, 38, 38, 39, 44, 38, 30, 31, 8, 1, 36, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 26, 11, 38, 39, 14, 15, 15, 15, 44, 38, 15, 12, 13, 38, 13, 13, 13, 36, 14, 15, 15, 44]
+522	[12, 13, 14, 15]
+523	[24, 25, 25, 25, 25, 25]
+524	[26, 27, 6, 53, 33, 2, 3, 38, 39, 52, 53, 53, 53, 53, 44, 52, 53, 14, 53, 14, 53, 14, 15, 42, 43, 44]
+525	[14, 15, 14, 15, 14, 15, 14, 15, 44, 38, 39]
+526	[48, 49, 49, 21, 17, 44, 12, 13, 13, 13, 13, 13, 13, 13]
+527	[30, 31, 38, 39, 38, 12, 13, 14, 15, 8, 9, 38, 39, 24, 14, 15]
+528	[12, 13, 13, 12, 13, 38, 26, 8, 12, 13, 13, 38, 39, 36, 30, 30, 31, 14, 15, 44, 24, 25, 25, 25]
+529	[54, 55, 55, 55, 14, 13, 44, 12, 13, 13, 13, 0, 1, 0, 1, 14, 15, 38, 39, 14, 15, 36, 14, 15, 14, 15, 15, 8, 9, 38, 39, 44, 18, 19, 19, 49, 42, 43, 43, 43, 23, 23, 23, 23, 44, 44, 38, 39, 26, 54, 55, 55, 55, 55, 38, 38, 39, 0, 14, 15, 42, 39, 44, 6, 38, 39, 9, 38, 14, 15, 44]
+530	[38, 51, 14, 14, 15, 15, 14, 15]
+531	[50, 51, 51, 51, 50, 54, 55, 55, 55, 55, 55, 44, 14, 15, 48, 49, 49, 44, 18, 19, 53, 53, 50, 53, 53, 38, 7, 38, 39, 38, 44, 48, 25, 25, 44, 38, 15, 14, 15, 44]
+532	[24, 25, 25, 47, 47, 47, 47, 47, 47, 44, 24, 25, 25, 25, 25, 25, 25, 25, 25]
+533	[14, 15, 44, 10, 25]
+534	[20, 21, 12, 13, 13, 13, 13, 13]
+535	[44, 44, 14, 15, 38, 43, 14, 44, 44, 10, 36, 14, 15, 44, 38, 51, 44, 8, 9, 38, 14, 14, 15, 38, 46, 44, 30, 31, 38, 39, 30, 31, 0, 14, 15, 38, 39, 44, 38, 39, 38, 54, 55, 55, 38, 36, 44, 8, 31, 44, 24, 25, 25, 25]
+536	[50, 51, 51, 51, 51, 53, 53, 53, 14, 14, 15, 15, 38, 14, 15, 44, 24, 25, 25, 25]
+537	[34, 51, 38, 36, 38, 2, 3, 38, 39, 46]
+538	[50, 51, 38, 38, 12, 13, 14, 15, 44]
+539	[38, 39, 14, 25, 30, 31, 0]
+540	[50, 1, 14, 15, 42, 43, 14, 15, 0, 1, 44, 38, 39, 8, 9, 38, 39, 39, 39, 39, 39, 39, 38, 38, 39, 44]
+541	[14, 15, 0, 10, 14, 25, 15, 14, 15, 25, 25, 25, 14, 15, 15, 38, 39, 44, 30, 31, 14, 15, 44]
+542	[46, 47, 47, 47, 17, 47, 47, 47, 47, 47, 47, 47, 47, 47, 17, 17, 17, 17, 17, 47, 44, 46, 47, 47, 47]
+543	[16, 46, 44, 30, 26, 38, 39, 38, 30, 30, 38, 38, 38, 38, 30, 31, 31, 14, 15, 46]
+544	[14, 15, 14, 15, 8, 9, 44, 34, 9, 8, 38, 39, 14, 15, 44, 48, 25, 49, 49, 49, 38, 54, 55, 55, 55, 44, 8, 26, 38, 38, 39, 38, 12, 13, 44]
+545	[38, 39, 12, 44, 14, 15, 15]
+546	[26, 38, 39, 51, 14, 15, 14, 15, 15, 15, 38]
+547	[24, 15, 14, 15, 44, 24, 25]
+548	[50, 51, 14, 15, 15, 15]
+549	[24, 25, 25, 25, 25, 25, 25, 15, 14, 15, 15]
+550	[50, 50, 51, 51, 44, 14, 15]
+551	[14, 15, 0, 14, 15, 14, 15, 38, 39]
+552	[24, 25, 24, 25, 25, 38, 43, 14, 15]
+553	[0, 1, 14, 15, 14, 15]
+554	[38, 38, 39, 14, 15, 12, 13, 13, 13, 38, 39]
+555	[12, 48, 25, 25]
+556	[50, 51, 25, 46]
+557	[54, 55, 55, 55, 55, 55, 55, 13, 0, 1, 1, 14, 15, 44, 52, 53, 52, 53, 53, 53, 53, 53, 53, 53, 53, 15, 44]
+558	[24, 17, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 25, 25, 13, 13, 13, 14, 25]
+559	[52, 53, 52, 53, 25, 25, 25, 25]
+560	[54, 55, 51, 51, 51, 14, 15, 42, 14, 15, 15, 15, 14, 15, 14, 15, 15, 42, 43, 14, 15, 38, 39, 12, 44, 38, 39, 44, 0, 1, 14, 15, 14, 15, 15, 38]
+561	[46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+562	[8, 39, 48, 49, 42, 43, 38, 39, 44, 20, 21, 21, 25, 25, 46, 44, 24, 25, 25, 25]
+563	[6, 26, 26, 27, 30, 38, 39, 44, 38, 39, 14, 15, 15, 36, 14, 15, 15, 8, 9, 8, 1, 46, 44]
+564	[14, 14, 15, 44, 14, 15]
+565	[26, 26, 14, 15, 6, 7, 38, 39, 0, 1, 44, 38, 38, 14, 15, 36, 42, 43, 14, 15, 38, 39, 44]
+566	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47]
+567	[50, 51, 51, 51, 53, 53, 14, 15, 38, 39, 6, 14, 39, 42, 43, 14, 15, 14, 15, 15, 44]
+568	[42, 43, 14, 15, 15, 15]
+569	[30, 31, 31, 31, 44, 12, 13, 15, 8, 9, 39, 30, 31, 38, 39, 8, 9, 13, 46, 44, 30, 31, 14, 15, 14, 15, 15, 39, 36, 12, 13, 13, 15, 14, 15, 44, 12, 13, 13, 13, 13, 44]
+570	[14, 15, 44, 38, 39]
+571	[44, 54, 8, 51, 38, 38, 39, 10, 1, 44, 46, 14, 15, 44, 14, 15, 15]
+572	[38, 39, 30, 26, 50, 51, 14, 15, 14, 15, 36, 42, 15, 44, 24, 25, 25, 25, 44]
+573	[20, 14, 21, 38, 39, 50, 51, 38, 39, 14, 15, 46, 44, 44, 24, 25, 25, 25]
+574	[44, 0, 42, 14, 38, 43, 44, 14, 15, 15, 38, 39, 44, 12, 13, 15, 38, 39, 44, 42, 43, 14, 15, 15, 44, 14, 15, 15, 38, 39, 44, 38, 39, 14, 14, 15, 44, 12, 25, 15, 38, 39, 44, 8, 14, 15, 38, 38, 39, 44, 14, 15, 8, 14, 15, 44, 38, 43, 15, 38, 39, 44, 50, 51, 51, 51, 14, 15, 38]
+575	[24, 25, 14, 38, 39, 38, 15, 30, 31, 15]
+576	[52, 53, 53, 53, 53, 53, 44, 0, 12, 25, 25, 15, 14, 15, 8, 1, 4, 14, 15, 38, 39, 26, 0, 1, 44, 38, 39, 14, 15, 38, 39, 14, 15, 44, 42, 43, 14, 15, 42, 43, 44, 14, 15, 15, 15, 38, 39, 36, 14, 15, 14, 15, 44]
+577	[38, 39, 24, 25, 25, 36, 14, 15, 15]
+578	[24, 0, 14, 15, 14, 15, 30, 31, 0, 1]
+579	[26, 26, 42, 43, 14, 15, 8, 26, 24, 39, 14, 38, 39, 44]
+580	[54, 54, 55, 38, 39, 12, 15, 25, 25, 49, 49, 49, 49, 49, 44, 12, 13, 38, 39, 39, 39, 44]
+581	[24, 25, 25, 14, 15, 8, 0, 1, 36, 14, 15]
+582	[24, 25, 25, 25, 25, 25, 25, 44, 12, 13, 13, 13, 16, 17, 17, 44, 12, 13, 13, 25, 25]
+583	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 16, 25, 38, 38, 39, 14, 39]
+584	[30, 31, 38, 39, 14, 15, 36, 14, 15]
+585	[50, 51, 51, 38, 54, 55, 55, 55, 55, 38, 36, 30, 31, 15]
+586	[14, 15, 44, 50, 51, 15, 44, 24, 49, 23, 44, 46, 49, 49, 17, 17, 17, 17, 17, 44, 16, 17, 17, 21, 21, 17, 17, 17, 44]
+587	[24, 25, 25, 25, 44, 25, 25]
+588	[14, 15, 38, 39, 14, 15]
+589	[24, 25, 25, 14, 15]
+590	[6, 30, 31, 14, 15, 14, 15, 8, 38, 39, 52, 53, 38, 39, 38, 14, 15, 0, 36, 42, 15, 15, 15, 44]
+591	[30, 14, 15, 14, 15, 15, 46, 44, 30, 31, 38, 14, 15, 15, 15, 44, 10, 36, 14, 15, 6, 14, 15, 2, 3, 26, 27, 46, 44, 8, 38, 38, 26, 11, 26, 30, 31, 14, 15, 38, 39, 39, 39, 44, 24, 15, 12, 13, 13]
+592	[52, 53, 52, 53, 53, 53, 53, 15]
+593	[52, 53, 53, 53, 53, 53, 14, 15, 44, 44, 44, 44]
+594	[12, 38, 39, 38, 39, 39, 46, 44, 30, 31, 31]
+595	[14, 15, 15, 26, 14, 15, 15, 38, 39, 38]
+596	[14, 15, 15, 15, 38, 39, 44, 34, 9, 10, 11, 44, 42, 43, 44, 14, 15, 8, 39, 46, 44, 38, 36, 10, 11, 44, 42, 43, 8, 38, 38, 38, 39, 26, 38, 46, 44, 38, 43, 15, 8, 9, 38, 39, 30, 38, 39, 12, 13, 44, 12, 13, 44]
+597	[44, 46, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 17, 17, 44, 30, 12, 31, 38, 12, 13, 14, 15, 44, 6, 38, 39, 39, 39, 47, 17, 6, 21, 17, 17, 17, 38, 38, 39, 38, 39, 46, 44, 46, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47]
+598	[42, 43, 14, 15, 44, 24, 25, 25, 25, 44]
+599	[38, 46, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 44, 44, 44, 14, 14, 15, 38, 43, 15, 44, 44, 44, 14, 25, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 25, 25]
+600	[50, 50, 51, 51, 51, 51, 12, 14, 15, 26, 14, 15, 38, 39, 12, 13, 13, 44]
+601	[26, 24, 25, 38, 36, 14, 15, 15]
+602	[14, 15, 42, 43, 44, 14, 15, 44, 6, 36, 11, 26, 30, 11, 36, 0, 1, 14, 15, 33, 44, 40, 41, 38, 39, 14, 15, 36, 14, 15, 44, 12, 13, 0, 1, 36, 0, 1, 44, 38, 39, 24, 25, 25, 43, 38, 39, 36, 14, 44]
+603	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+604	[30, 31, 14, 15, 38, 43, 44, 54, 55, 55, 55, 44, 14, 15, 14, 15, 15, 42, 0, 1, 6, 14, 15, 44, 14, 15, 15, 14, 15, 2, 3, 42, 39, 38, 39, 44, 26, 41, 48, 49, 38, 39, 14, 15, 8, 9, 38, 39, 44]
+605	[30, 14, 14, 38, 30, 31, 14, 15, 38, 30, 31, 44, 14, 15, 44]
+606	[44, 22, 23, 44, 30, 38, 39, 38, 30, 31, 31, 44]
+607	[42, 15, 14, 15, 30, 31, 38, 39]
+608	[52, 53, 14, 15]
+609	[24, 25, 25, 15, 14, 15, 44, 8, 7, 54, 55, 55, 15, 10, 8, 8, 8, 38, 39]
+610	[26, 35, 11, 36, 14, 15, 10, 44, 0, 1, 14, 15, 15, 1, 44, 14, 15, 14, 15, 15, 39, 8, 38, 38, 14, 15, 23, 23, 44]
+611	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+612	[24, 17, 47, 47, 44, 46, 47, 47, 44, 16, 17, 17, 38, 39, 46, 47, 15, 15]
+613	[30, 31, 12, 13, 13, 13, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 6, 12, 13, 13, 13, 8, 26, 30, 14, 15, 15, 38, 39, 36, 42, 15, 6, 42, 43, 38, 39, 36, 14, 15, 14, 39, 44]
+614	[54, 55, 55, 55, 55, 55, 55, 55, 44, 52, 53, 14, 15, 38, 39, 38, 39, 38, 39, 39, 12, 13, 13, 13, 13, 36, 48, 49, 21, 21, 44, 38, 39, 35, 53, 53, 38, 39, 14, 15, 36, 12, 13, 13, 13, 13, 13, 44]
+615	[14, 15, 14, 15, 48, 49]
+616	[30, 14, 15, 44, 18, 19, 19, 38, 38, 39, 14, 15, 14, 15, 14, 15, 46, 44, 24, 25, 25, 25]
+617	[14, 15, 15, 38, 39]
+618	[54, 55, 55, 55, 55, 44, 48, 49, 53, 38, 39, 44, 26, 50, 51, 10, 15, 33, 6, 50, 51, 50, 51, 6, 50, 51, 38, 39, 39, 15, 14, 15, 15, 44, 14, 43, 14, 15, 14, 15, 6, 38, 43, 14, 15, 44, 38, 39, 8, 1, 0, 1, 36, 42, 43, 14, 15, 44]
+619	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 14, 25, 38, 39]
+620	[12, 13, 13, 55, 55, 38, 14, 15, 15, 15, 14, 15, 15, 44, 14, 15, 14, 15, 44]
+621	[44, 24, 25, 25, 44, 24, 25, 25, 25, 25, 44, 24, 25, 25, 25, 25, 44, 38, 39, 27, 14, 15, 14, 15, 44, 14, 15, 25, 25, 25, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 30, 31, 14, 15, 15, 15, 44, 50, 51, 51, 51, 51, 14, 15, 6, 20, 15, 14, 15, 14, 15, 14, 15, 44, 24, 25, 25, 25]
+622	[38, 39, 54, 55, 44, 26, 1, 0, 14, 15, 15, 44, 14, 15, 38, 39, 14, 15, 50, 50, 14, 15, 15, 50, 51, 44, 54, 55, 54, 55, 55, 44, 14, 15, 12, 39, 26, 52, 52, 53, 25, 25, 25, 25, 25, 38, 38, 39, 44, 42, 15, 15, 38, 38, 39, 44, 14, 15, 8, 9, 38, 39, 36, 14, 15, 10, 44, 2, 3, 7, 38, 39, 26, 0, 1, 36, 14, 15, 38, 39, 42, 39, 44, 14, 15, 38, 15, 44, 14, 15, 14, 15, 10, 33, 44, 38, 39, 0, 51, 44]
+623	[14, 15, 14, 15, 44, 12, 13, 42, 48, 49, 38, 14, 15, 14, 15, 44, 6, 7, 8, 7, 9, 9, 9, 0, 39, 44, 6, 8, 38, 39, 30, 31, 38, 39, 38, 14, 36, 44]
+624	[12, 13, 13, 44, 12, 13, 13, 38, 39, 38, 25, 46, 44, 38, 39, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+625	[50, 51, 51, 51, 51, 51, 15, 38, 39]
+626	[50, 14, 15, 15, 0, 14, 43, 15, 14, 15, 14, 15, 38, 39, 39, 39, 39, 39, 24, 14, 15]
+627	[50, 51, 51, 51, 51, 51, 51, 50, 51, 51, 51, 51, 51, 51, 51, 35, 12, 13, 14, 15, 12, 1, 13, 14, 15]
+628	[8, 9, 38, 39, 30, 38, 14, 15, 38, 39, 36, 14, 15, 15, 44, 26, 14, 15, 10, 38, 36, 0, 1, 14, 15, 15, 14, 15, 44, 38, 39, 38, 39, 36, 14, 1, 14, 15, 42, 43, 14, 15, 44, 38, 0, 30, 31, 38, 39, 39, 30, 31, 14, 15, 10, 1, 44]
+629	[24, 47, 47, 47, 47, 47, 25, 25, 25, 47, 47, 47, 47, 47, 47]
+630	[16, 17, 17, 17, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 14, 15, 42, 43, 14, 15, 44, 44, 44, 24, 25, 25]
+631	[14, 15, 26, 24, 25, 14, 15, 38, 39, 44, 38, 39, 38, 39, 14, 15, 25, 25, 44, 38, 15, 44, 44, 14, 15, 14, 15, 44, 24, 25, 25]
+632	[26, 44, 22, 23, 23, 23, 44, 36, 14, 15, 42, 43, 10, 44, 26, 27, 38, 38, 26, 12, 38, 38, 39, 36, 14, 15, 42, 15, 44, 8, 38, 39, 36, 0, 36, 42, 43, 44, 6, 26, 38, 39, 14, 15, 36, 0, 1, 14, 15, 6, 0, 1, 36, 14, 15, 38, 39, 42, 43, 44, 38, 39, 8, 39, 38, 39, 14, 15, 36, 14, 15, 14, 15, 44]
+633	[14, 15, 14, 15, 14, 15, 8, 0, 38, 39, 38, 38]
+634	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+635	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+636	[14, 38, 14, 15, 14, 15, 38, 39]
+637	[20, 21, 21, 21, 21, 17, 17, 17]
+638	[34, 51, 15, 12, 13, 13, 14, 14, 15, 15, 12, 13, 13, 13, 13, 38, 14, 15, 12, 13, 13, 13, 13, 13, 31, 14, 15, 44, 20, 21, 14, 15, 38, 39]
+639	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+640	[24, 25, 25, 36, 14]
+641	[24, 25, 25, 15, 14, 15, 14, 15]
+642	[50, 53, 14, 15, 38, 36, 24, 19, 14, 14, 23, 44, 24, 25, 25, 25]
+643	[14, 15, 38, 39, 38, 38, 0, 1, 14, 15, 36, 14, 15, 10, 38, 43]
+644	[50, 14, 15, 44, 14, 15, 15, 44, 14, 15]
+645	[15, 44, 14, 15, 15, 15, 12, 44, 48, 49, 49, 49, 49, 49, 49, 14, 15, 44, 26, 14, 15, 38, 44, 44, 44, 24, 25, 14, 15]
+646	[30, 36, 0, 1, 44, 12, 14, 15, 44, 14, 15, 44, 42, 15, 44, 42, 43, 15, 15, 38, 39, 44, 0, 1, 48, 49, 38, 39, 44]
+647	[38, 38, 36, 14, 15, 15]
+648	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 49, 17, 38, 30, 14, 15, 36, 42, 43, 14, 44, 24, 25, 25, 25]
+649	[34, 35, 44, 14, 15, 14, 15, 26, 9, 38, 38, 39, 14, 15, 10, 15, 15, 36, 42, 43, 14, 15, 44, 38, 39, 38, 39, 39, 53, 25, 15, 15, 36, 44, 42, 43, 42, 43, 44, 14, 15, 38, 43, 14, 15, 15, 44, 24, 25, 25, 44, 44, 42, 43, 42, 43, 44, 8, 9, 38, 14, 15, 14, 15, 44, 14, 15, 14, 15, 36, 42, 43, 25, 15, 26, 48, 49, 38, 39, 44]
+650	[46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47]
+651	[24, 25, 25, 25, 54, 55, 55, 55, 49, 42, 42, 43, 15, 15]
+652	[44, 2, 38, 38, 0, 1, 14, 15, 36, 44, 44, 24, 25, 25, 25]
+653	[14, 15, 14, 15, 15, 26, 39, 38, 36, 14, 15]
+654	[24, 25, 25, 15]
+655	[24, 25, 44, 14, 15]
+656	[14, 15, 0, 1, 0, 1, 1, 14, 15, 15]
+657	[12, 13, 13, 14, 15, 8, 9, 9, 38, 39, 44, 14, 15, 15, 38, 39, 36, 42, 43, 44, 14, 15, 14, 15, 14, 15, 36, 14, 15, 44, 26, 14, 15, 12, 14, 15, 14, 0, 10, 44, 14, 15, 0, 1, 36, 42, 14, 15, 38, 39, 30, 14, 15, 15, 15, 42, 43, 36, 14, 15, 14, 15, 38, 39, 44]
+658	[14, 14, 15, 38, 39, 0, 39]
+659	[50, 51, 25, 25, 38, 14, 15, 38, 14, 15, 15, 44]
+660	[6, 26, 50, 51, 14, 15, 36, 14, 15, 15, 6, 8, 38, 38, 39, 44, 6, 8, 38, 39, 12, 13, 13, 13, 13, 14, 15, 14, 53, 44, 38, 39, 54, 55, 55, 53, 15, 10, 51, 50, 51, 51, 51, 14, 15, 44]
+661	[52, 53, 53, 53, 14, 15, 44, 24, 25, 25, 25, 25, 25, 44, 52, 53, 53, 53, 44, 52, 53, 53, 53, 44, 0, 1, 14, 15, 15, 42, 43, 36, 4, 43, 53, 53, 15, 15, 44, 52, 53, 14, 15, 44, 14, 15, 42, 43, 14, 15, 44, 14, 15, 42, 43, 44, 2, 3, 52, 53, 44]
+662	[44, 52, 53, 14, 43, 14, 44, 14, 15, 42, 43, 44, 14, 15, 42, 39, 44, 52, 53, 53, 53, 44, 14, 15, 12, 13, 13, 13, 44, 12, 13, 13, 44, 12, 13, 13]
+663	[52, 53, 38, 53, 53, 53, 14, 44, 14, 21, 38, 39, 14, 15, 14, 14, 15, 44, 18, 19, 23, 25, 25, 44]
+664	[14, 15, 14, 15, 36, 30, 31, 12, 13]
+665	[20, 21, 21, 21, 21, 21, 21, 17, 55, 13, 13, 30, 31, 38]
+666	[48, 49, 44]
+667	[22, 23, 44, 23, 23, 44, 44, 14, 15, 15, 44, 54, 55, 55, 13, 13, 44, 38, 38, 39, 12, 13, 14, 15, 42, 43, 44]
+668	[46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+669	[38, 39, 51, 38, 38, 38, 39, 25, 39, 36, 14, 15]
+670	[12, 13, 13, 13, 13, 44, 48, 49, 49, 48, 49, 49, 32, 33, 0, 1, 1, 43, 43, 3, 8, 39, 44, 0, 1, 14, 15, 44]
+671	[38, 39, 25, 25, 24, 25, 25, 25, 25, 25]
+672	[24, 25, 14, 15, 38, 39, 14, 15, 44, 14, 15, 24, 21, 14, 15, 15, 2, 38, 39, 38, 39, 14, 15, 44, 38, 39, 39, 13, 13, 13, 39, 38, 39, 44, 18, 3, 14, 15, 14]
+673	[50, 51, 51, 54, 55, 55, 55, 55, 24, 24, 25, 25, 25, 24, 25, 25, 14, 25, 14, 15, 38, 15, 44, 42, 25, 25, 25, 44, 24, 25, 25, 25, 14, 15, 44, 24, 25, 25]
+674	[30, 36, 14, 15, 30, 38, 8, 9, 0, 1]
+675	[44, 24, 25, 25, 44, 24, 25, 25, 25, 25, 12, 25, 25, 25, 25, 25, 25, 25]
+676	[24, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+677	[12, 44, 14, 15, 14, 14, 15, 6, 38, 30, 14, 15, 38, 14, 15, 15, 15, 26, 11, 14, 39, 39, 8, 38, 14, 15, 39, 44, 30, 38, 39, 12, 13, 13, 36, 14, 15, 8, 38, 39, 38, 14, 15, 15, 44, 40, 41, 39, 39, 44, 38, 39, 39, 44, 38, 39, 36, 14, 44, 14, 15, 15, 15, 14, 15, 6, 7, 38, 38, 39, 14, 15, 36, 44, 6, 7, 38, 39, 14, 15, 15, 14, 42, 43, 38, 7, 38, 36, 44, 44]
+678	[38, 39, 44, 26, 27, 24, 15, 15, 36, 42, 43, 14, 15, 44, 26, 53, 53, 53, 14, 15, 38, 39, 36, 12, 13, 13, 13, 14, 15, 14, 15, 15, 8, 26, 14, 15, 36, 14, 15, 6, 49, 54, 54, 55, 13, 55, 38, 39, 44]
+679	[44, 44, 24, 25, 25, 14, 15, 32, 5, 42, 43]
+680	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+681	[50, 51, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15, 14, 15]
+682	[30, 31, 38, 14, 15]
+683	[38, 39, 12, 13, 14, 15, 15, 44, 6, 21, 21, 44, 18, 19, 44, 14, 38, 38, 39, 12, 14, 15, 38, 39, 46, 44, 24, 25, 25, 25]
+684	[44, 24, 25, 25, 25, 14, 15, 42, 43, 44, 34, 14, 15, 38, 39, 46, 44, 22, 25, 25, 25, 12, 25, 13, 44, 14, 38, 38, 39, 24, 15, 15, 44, 14, 47, 44, 12, 13, 13, 13, 13, 44, 38, 14, 15, 38, 53, 53, 53, 53, 44, 26, 9, 38, 39, 44, 44, 44, 44, 44, 44, 14, 15, 15, 38, 39, 24, 25, 44]
+685	[38, 38, 39, 44, 24, 25, 25, 25]
+686	[14, 8, 9, 38, 14, 14, 0, 14, 15]
+687	[12, 13, 13, 13, 13, 13, 14, 15]
+688	[46, 47, 47, 12, 44, 46, 47, 47]
+689	[50, 14, 15, 15, 44]
+690	[14, 15, 49, 38, 39, 14, 15, 15, 38, 39, 0, 1, 44, 42, 43, 38, 39, 14, 15, 38, 39, 38, 39, 42, 39, 44, 24, 25, 25, 25, 25, 44]
+691	[38, 39, 14, 15, 15, 15, 25, 15, 15, 15, 44, 16, 17, 17, 17, 17, 17, 17, 17, 44, 12, 13, 14, 15, 24, 24, 25, 25, 25, 44, 14, 15, 12, 13, 0, 1, 44, 0, 1, 1, 51, 15, 44]
+692	[12, 13, 13, 14, 15, 38, 39, 36, 42, 14, 15, 44, 44, 22, 23, 44, 14, 15, 14, 15, 44, 2, 3, 38, 39, 15, 14, 15, 44, 44, 44, 24, 25, 25]
+693	[38, 8, 0, 1, 14, 15, 36, 42, 15]
+694	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+695	[12, 13, 13, 14, 15, 14, 15, 8, 9, 38, 39, 0, 1]
+696	[50, 51, 14, 53, 14, 15, 25, 25, 25, 14, 15]
+697	[24, 25, 25, 25, 25, 44, 13, 13, 14, 21, 21, 38, 14, 15, 44, 44, 44, 50, 25, 25, 25, 25, 25, 25, 44, 24, 25, 25, 25, 15, 44]
+698	[24, 14, 15, 14, 15, 6, 38, 39, 36, 14, 15, 38, 44, 44, 44, 44, 46, 44, 24, 25, 25, 25, 25, 47, 47, 47, 44, 24, 25, 25, 25, 25, 14, 15, 44, 14, 15, 25, 25, 13, 13, 13, 13, 47, 44, 14, 15, 44, 24, 25, 25, 25]
+699	[38, 36, 38, 14, 15, 14, 15, 14, 15]
+700	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 44, 44, 14, 15, 44]
+701	[12, 13, 13, 13, 13, 13, 44, 12, 13, 13, 44, 12, 13, 13, 13, 13, 44, 14, 15]
+702	[14, 15, 49, 39, 38, 39, 39, 38, 39, 36, 14, 15, 50, 35, 12, 13, 13]
+703	[54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 44, 50, 51, 12, 13, 13, 14, 25, 49, 6, 3, 38, 39, 44]
+704	[8, 38, 39, 38, 30, 31, 14, 15, 0, 1, 36, 14, 15, 15, 44, 8, 38, 39, 38, 39, 38, 39, 38, 39, 10, 11, 30, 30, 38, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+705	[48, 49, 49, 49, 38, 39, 44, 44, 14, 15, 44, 44]
+706	[14, 15, 12, 13, 13, 13, 13, 13, 13, 13, 14]
+707	[14, 15, 14, 15, 38, 43]
+708	[38, 39, 39, 26, 14, 15, 36, 42, 1]
+709	[52, 53, 53, 25, 25, 12, 13, 13, 44, 12, 13, 13, 13, 13, 42, 43]
+710	[26, 50, 51, 51, 36, 50, 51, 14, 15, 44, 50, 51, 51, 53, 53, 53, 14, 15, 14, 15, 8, 9, 51, 8, 26, 14, 15, 14, 15, 38, 39, 50, 51, 53, 53, 38, 39, 14, 15, 44, 44, 30, 31, 26, 14, 15, 14, 15, 36, 14, 15, 14, 15, 26, 27, 38, 46, 44, 44]
+711	[50, 51, 51, 36, 14, 15, 38, 27, 11, 44, 14, 15, 44, 42, 43, 44, 14, 15, 44, 40, 41, 38, 39, 42, 43, 44, 52, 53, 53, 53, 14, 44, 14, 15, 44, 50, 14, 15, 42, 43, 43, 14, 15, 44, 44, 54, 13, 55, 55, 55, 55, 55, 55, 39, 55, 55, 44, 14, 38, 14, 15, 44, 42, 43, 39, 39, 44, 44, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 14, 15, 38, 39, 26, 14, 15, 38, 39, 15, 42, 43, 15, 44, 38, 39, 38, 39, 36, 14, 15, 44]
+712	[30, 12, 13, 13, 38, 39, 12, 13, 13, 44, 38, 39, 38, 39, 46, 44]
+713	[12, 44, 14, 44, 14, 6, 38, 38, 39, 44, 14, 15, 44, 6, 44, 24, 25, 44, 42, 43, 14, 15, 36, 14, 15, 44, 6, 26, 38, 39, 39, 39, 39, 36, 14, 15, 8, 38, 38, 39, 39, 12, 13, 15, 36, 44, 14, 15, 44, 36, 14, 15, 6, 30, 31, 15, 38, 15, 36, 14, 15, 44, 8, 3, 38, 39, 50, 25, 38, 39, 36, 14, 15, 15, 6, 14, 6, 15, 26, 39, 39, 21, 39, 39, 38, 39, 44, 14, 15, 44, 36, 14, 15, 44, 8, 9, 38, 26, 50, 0, 14, 15, 2, 3, 38, 39, 44]
+714	[24, 25, 25, 25, 25, 44, 14, 15]
+715	[44, 18, 49, 49, 14, 15, 44, 48, 49, 49, 44, 0, 1, 36, 23, 15, 44, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+716	[8, 9, 38, 39, 14, 15, 46, 30, 38, 38, 38, 39, 0, 1, 36, 14, 15, 44, 44, 38, 39, 36, 14, 15, 8, 7, 30, 31, 38, 39, 30, 31, 44]
+717	[8, 38, 38, 39, 39, 14, 12, 13, 38, 39, 14, 15, 38, 38, 39, 44, 38, 38, 39, 12, 13, 44, 14, 15, 12, 13, 44, 14, 15, 26, 9, 38, 36, 14, 15, 44, 38, 39, 39, 13, 13, 12, 13, 44]
+718	[38, 38, 24, 15, 15, 15, 38, 39]
+719	[14, 15, 43, 38, 39, 9, 38, 0, 1, 44, 6, 38, 39, 38, 39]
+720	[38, 39, 15]
+721	[54, 55, 55, 55, 44, 52, 53, 53, 53, 53, 53, 53, 14, 15, 36, 42, 43, 14, 15, 44, 38, 15, 38, 39, 42, 14, 15, 44]
+722	[52, 53, 38, 39, 36, 0, 1, 44, 8, 26, 9, 39, 38, 36, 42, 43, 42, 39, 36, 0, 1, 36, 42, 43, 44, 6, 38, 39, 26, 24, 25, 24, 25, 25, 44]
+723	[52, 53, 53, 53, 14, 15, 44, 38, 39, 38]
+724	[30, 38, 54, 55, 55, 55, 44, 50, 33, 38, 39, 39, 30, 12, 13, 44, 50, 51, 50, 51, 0, 10, 10, 11, 44, 14, 15, 14, 15, 44]
+725	[44, 14, 15, 8, 38, 39, 38, 39, 38, 39, 44, 26, 0, 1, 14, 15, 8, 9, 38, 39, 8, 8, 8, 39, 44]
+726	[24, 21, 21, 21, 21, 17, 21, 21, 21, 14, 15, 15]
+727	[38, 39, 38, 39, 14, 15, 14, 15, 15, 15, 15, 36, 42, 43, 14, 15, 44]
+728	[30, 26, 7, 38, 30, 31, 14, 15]
+729	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+730	[20, 21, 21, 25, 25, 44, 38, 39, 39, 39, 44, 24, 25, 25]
+731	[14, 15, 14, 53, 14, 15]
+732	[16, 17, 17, 17, 38, 39, 39, 47, 47, 47, 47, 21, 21, 21, 21, 39]
+733	[30, 14, 14, 15, 14, 15, 15, 14, 15]
+734	[14, 15, 15, 44, 48, 15, 14, 14, 15, 15, 14, 15, 38, 39, 13, 13, 14, 15, 44, 14, 15, 44, 46, 44, 25, 25, 25, 15, 44, 44, 14, 15, 14, 15, 15]
+735	[24, 25, 25, 25, 47, 47, 47, 47, 47, 44]
+736	[26, 30, 31, 14, 15, 15, 38, 39, 14, 38, 38, 39, 44, 8, 8, 13, 38, 39, 36, 50, 53, 52, 53, 53, 38, 39, 38, 39, 36, 42, 43, 6, 42, 43, 44]
+737	[24, 25, 47, 47, 47, 47, 47, 47, 15]
+738	[6, 7, 44, 50, 19, 44, 50, 51, 26, 50, 50, 51, 51, 54, 55, 55, 44, 0, 1, 44, 14, 15, 38, 39, 38, 12, 13, 13, 13, 13, 13, 44, 12, 15, 15, 44, 0, 1, 10, 38, 12, 13, 13, 13, 13, 44, 14, 15, 8, 9, 9, 38, 39, 12, 13, 12, 13, 13, 14, 15, 15, 44]
+739	[50, 51, 14, 15, 8, 8, 8, 9, 38, 50, 51, 14, 15, 44, 18, 19, 38, 53, 44, 24, 17, 13, 13, 13, 13, 44, 46, 46, 44, 52, 19, 14, 15, 14, 15, 46]
+740	[24, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]
+741	[44, 14, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 14, 15, 42, 39, 44, 14, 15, 14, 15]
+742	[8, 9, 8, 14, 15, 6, 14, 15, 39, 0, 1, 44, 30, 31, 26, 42, 43, 14, 38, 39, 36, 14, 15, 36, 30, 12, 13, 38, 39, 44, 38, 24, 15, 15, 15, 6, 24, 15, 15, 15, 14, 15, 38, 39, 12, 13, 50, 51, 42, 43, 15, 36, 14, 15, 44]
+743	[12, 13, 13, 13, 13, 13, 38, 39, 14, 15, 15, 26, 30, 14, 15, 44, 24, 25, 25, 25]
+744	[38, 39, 38, 14, 15, 15, 38, 39, 44]
+745	[24, 25, 25, 25, 14, 15]
+746	[14, 15, 14, 15, 44, 14, 15, 14, 15, 42, 39, 44, 38, 38, 39, 39, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 14, 15, 42, 43, 44, 0, 42, 43, 44, 0, 1, 42, 43, 42, 43, 44, 18, 19, 24, 25, 25]
+747	[24, 25, 25, 25, 25, 30, 31, 38, 39, 39, 9, 0, 1]
+748	[14, 15, 0, 14, 15, 0]
+749	[14, 15, 44, 14, 15, 10, 30, 31, 38]
+750	[30, 38, 12, 13, 14, 15, 38, 0, 44, 38, 39, 39, 0, 36, 14, 15, 44]
+751	[14, 15, 38, 38, 39, 13, 13, 13]
+752	[30, 38, 14, 15, 15, 15, 10, 44, 12, 44, 44, 14, 15, 44, 14, 15, 38]
+753	[44, 44, 24, 25, 25, 14, 15, 32, 5, 42, 43]
+754	[44, 38, 39, 26, 38, 14, 15, 44, 44, 38, 39, 39, 39, 44, 38, 39, 26, 38, 14, 15, 14, 15, 44, 14, 15, 44, 44, 44, 24, 25, 25]
+755	[14, 38, 38, 39, 30, 30, 31, 0, 1, 44]
+756	[14, 15, 14, 15]
+757	[50, 51, 53, 50, 51, 51, 51, 51, 53]
+758	[24, 25, 25, 25, 25, 25, 42, 39]
+759	[14, 36, 12, 15, 8, 38, 39, 24, 14, 15, 15, 44, 26, 50, 51, 50, 51, 33, 38, 39, 14, 51, 51, 44, 50, 35, 38, 38, 39, 14, 15, 14, 15, 15, 10, 44, 8, 38, 39]
+760	[48, 49, 44]
+761	[24, 14, 15, 14, 15, 14, 15]
+762	[52, 53, 53, 53, 53, 53, 15, 38]
+763	[24, 25, 25, 30, 31, 38]
+764	[38, 39, 44, 14, 15, 14, 15, 38, 39, 14, 15]
+765	[24, 25, 25, 25, 25, 15, 42]
+766	[38, 26, 14, 15, 38, 30, 31, 1, 36, 14, 15, 44, 16, 17, 8, 9, 38, 39, 38, 38, 38, 39, 44, 6, 26, 12, 13, 8, 39, 26, 39, 16, 31, 38, 14, 15, 36, 14, 15, 44]
+767	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47]
+768	[44, 38, 1, 14, 15, 42, 43, 53, 43, 44, 44, 38, 39, 14, 15, 26, 14, 15, 14, 43, 14, 44, 24, 25, 25, 25]
+769	[30, 31, 38, 39, 38, 42, 43, 14, 43, 36, 14, 15, 44, 12, 13, 38, 39, 0, 1, 36, 14, 15, 26, 30, 31, 10, 11, 38, 39, 44]
+770	[46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 47, 44, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47]
+771	[52, 53, 53, 53, 51, 44, 38, 14, 15, 46]
+772	[6, 30, 7, 30, 31, 8, 38, 39, 26, 14, 15, 38, 14, 15, 44]
+773	[8, 39, 14, 15, 38, 15, 53, 53, 53, 15, 14, 15]
+774	[14, 15, 44, 24, 25, 25, 38, 39, 52, 53, 53, 53, 39, 52, 53, 52, 53, 53, 53, 44, 14, 15, 44, 24, 25, 25]
+775	[44, 22, 23, 23, 23, 23, 23, 23, 23, 23, 47, 47, 44, 26, 30, 31, 14, 15, 0, 38, 36, 14, 15, 36, 14, 15, 15, 0, 36, 0, 1, 44, 26, 38, 15, 36, 2, 9, 38, 39, 36, 38, 39, 38, 39, 36, 36, 14, 15, 15, 15, 15, 2, 5, 7, 8, 50, 50, 51, 51, 44, 6, 38, 39, 15, 38, 36, 42, 36, 14, 15, 44]
+776	[38, 39, 42, 15, 14, 15, 44, 30, 38, 38, 39, 46, 44]
+777	[48, 8, 38, 39, 44]
+778	[26, 13, 26, 14, 15, 38, 14, 44, 38, 54, 55, 55, 55, 55, 38, 39, 26, 35, 44, 26, 30, 12, 13, 38, 38, 39, 36, 10, 12, 0, 38, 14, 15, 38, 39, 44, 38, 39, 38, 39, 55, 14, 15, 6, 30, 14, 15, 14, 15, 10, 26, 30, 12, 31, 44]
+779	[38, 39, 39, 39, 39, 25, 25, 25, 25, 25, 25, 25, 25, 14, 15, 42, 43, 14, 15, 36, 44, 24, 25, 25, 25]
+780	[14, 15, 38, 30, 31, 13, 38, 38, 14, 15, 44, 42, 43, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+781	[38, 15, 38, 39, 36, 0, 1, 14, 15, 15, 15, 39, 44]
+782	[14, 15, 36, 0, 1, 14, 15]
+783	[14, 15, 6, 7, 36, 14, 44, 14, 15]
+784	[14, 15, 26, 24, 25, 25]
+785	[8, 8, 9, 8, 38, 39, 14, 15, 14, 15, 14, 15, 44, 12, 13, 38, 42, 43, 43, 15, 44, 54, 55, 38, 39, 38, 39, 39, 15, 44]
+786	[14, 24, 25, 44, 20, 21, 21, 38, 39, 39, 39, 44, 36, 24, 25, 25, 42, 43, 44, 48, 49, 44, 24, 25, 25, 25]
+787	[52, 53, 14, 53, 14, 15, 15, 15, 38, 39, 39, 39, 53, 15]
+788	[52, 53, 53]
+789	[30, 36, 14, 15, 6, 14, 15, 15, 15, 8, 6, 7, 8, 39, 44, 14, 15, 8, 9, 42, 43, 38, 39, 6, 14, 15, 44, 6, 7, 26, 38, 38, 38, 39, 14, 15, 10, 39, 39, 39, 43, 6, 42, 43, 30, 14, 15, 38, 39, 44, 38, 39, 30, 38, 39, 26, 43, 15, 14, 15, 38, 39, 44, 26, 14, 15, 38, 39, 44, 30, 31, 31, 36, 14, 8, 38, 39, 44, 30, 31, 38, 38, 14, 15, 36, 14, 15, 8, 38, 39, 44, 54, 55, 55, 55, 55, 55, 55, 55, 53, 50, 50, 51, 51, 53, 53, 53, 14, 15, 26, 38, 39, 38, 39, 44, 44, 14, 15, 15, 15, 14, 15, 44, 0, 1, 0, 1, 43]
+790	[14, 15, 15, 6, 38, 39]
+791	[52, 53, 42, 53, 44, 44, 6, 30, 38, 44, 38, 30, 38, 44, 38, 30, 38, 44, 14, 15, 38, 14, 15, 44, 48, 49, 49, 38, 44, 14, 15, 14, 39, 44, 44, 24, 25, 25]
+792	[12, 13, 13, 8, 2, 2, 38, 39, 14, 15, 15]
+793	[44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 38, 39, 44, 42, 43, 44, 42, 39, 44, 30, 31, 31, 44, 38, 39, 14]
+794	[24, 21, 21, 21, 21, 21, 21, 21, 21, 53, 39, 44, 24, 17, 12, 13, 44]
+795	[30, 39, 10, 12, 55, 55, 55, 55]
+796	[48, 23, 44, 48, 49, 49, 49, 49, 16, 49, 49, 16, 17, 21, 17, 44, 46, 47, 47, 47, 47, 17, 44, 14, 17, 44, 6, 49, 44, 54, 55, 55, 55, 55, 55, 55, 55, 55, 55, 38, 39, 39, 52, 53, 53, 53, 53, 53, 53, 53, 25, 44, 14, 23, 44, 14, 15, 44]
+797	[50, 0, 36, 30, 7, 26, 30, 31, 8, 38, 39, 14, 15, 6, 6, 8, 14, 15, 42, 43, 14, 15, 38, 26, 30, 26, 30, 12, 13, 14, 15, 36, 42, 43, 44, 30, 31, 44, 38, 39, 39, 9, 38, 50, 35, 50, 51, 51, 15, 38, 39, 36, 14, 15, 44, 38, 39, 0, 26, 14, 15, 10, 38, 39, 0, 36, 14, 15, 44, 54, 55, 38, 44, 26, 1, 14, 15, 14, 15, 42, 43, 14, 15, 0, 36, 42, 43, 44, 26, 14, 15, 14, 15, 38, 39, 0, 36, 42, 43, 43, 3, 38, 39, 43, 23, 44, 30, 38, 38, 39, 8, 9, 26, 14, 15, 14, 15, 10, 36, 38, 8, 8, 9, 38, 12, 13, 13, 44]
+798	[48, 49, 49, 49, 49, 42, 43, 30, 31, 31]
+799	[24, 25, 14, 15, 42, 43, 14, 15]
+800	[50, 48, 49, 49, 53, 53, 53, 14, 15, 26, 52, 53, 38, 39, 54, 55, 55, 55, 55, 36, 48, 15, 38, 39, 39, 12, 44, 14, 15, 44, 0, 1, 14, 15, 44, 20, 21, 21]
+801	[44, 24, 25, 25, 44, 42, 39, 44, 38, 38, 39, 36, 42, 43, 8, 42, 43, 44, 6, 7, 7, 30, 38, 36, 14, 15, 12, 13, 13, 38, 24, 25, 14, 38, 39, 0, 1, 13, 36, 14, 15, 44, 6, 7, 14, 15, 10, 49, 49, 49, 1, 44, 42, 43, 30, 38, 36, 8, 38, 39, 44, 38, 39, 30, 31, 14, 15, 44, 44, 26, 38, 38, 44, 38, 38, 39, 42, 38, 36, 14, 15, 30, 38, 14, 15, 15, 25, 36, 42, 15, 44, 6, 7, 44, 14, 15, 42, 14, 44, 8, 39, 38, 39, 36, 14, 15, 44, 44]
+802	[30, 31, 38, 14, 15, 38, 39, 9, 26, 30, 31, 31, 25]
+803	[50, 51, 14, 15, 15]
+804	[50, 51, 44, 14, 15, 15, 6, 8, 38, 39, 44, 14, 15, 10, 48, 6, 8, 3, 38, 39, 44, 44, 44, 42, 43, 25, 25, 42, 43, 44, 44, 44, 50, 51, 38, 43, 14, 15, 14, 15, 44, 18, 19, 38, 43, 14, 15]
+805	[12, 13, 13, 13, 24, 25, 25, 25, 15, 14, 15, 15, 38, 38, 39, 39]
+806	[46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+807	[38, 39, 0, 1, 44, 30, 38, 38, 39, 16, 31, 38, 39, 38, 39, 46, 44, 44, 54, 55, 55, 55, 55, 55, 44, 12, 13, 26, 0, 1, 14, 15, 38, 39, 44, 50, 51, 51, 14, 15, 0, 1, 14, 15, 44, 36, 42, 43, 14, 15, 44, 8, 50, 51, 50, 51, 51, 51, 51, 51, 51, 53, 53, 53, 53, 53, 53, 43, 44]
+808	[44, 14, 15, 38, 43, 44, 12, 13, 13, 14, 14, 15, 14, 15, 14, 15, 14, 15, 15, 38, 14, 15, 44, 14, 15, 15, 36, 42, 43, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 44, 14, 15, 14, 14, 15, 44, 38, 39, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 24, 25, 25]
+809	[24, 15, 25, 17, 44, 16, 17, 17, 17, 17]
+810	[14, 15, 12, 13, 13, 44, 28, 44, 38, 39, 39, 30, 31, 44, 24, 25, 25, 25]
+811	[14, 15, 38, 39, 52, 53, 14, 14, 15]
+812	[50, 38, 14, 15, 15, 15, 44, 14, 15, 15, 44, 24, 25, 25, 25]
+813	[12, 44, 8, 8, 12, 13, 15, 38, 39, 24, 15, 36, 14, 15, 15, 44, 38, 14, 43, 43, 14, 15, 10, 44, 26, 24, 17, 38, 39, 25, 25, 25, 44, 0, 1, 38, 39, 24, 25, 44]
+814	[14, 15, 30, 31, 38, 39, 14, 15, 38, 39, 39, 15, 38, 38, 39, 36, 30, 31, 14, 44, 24, 25, 25, 25]
+815	[14, 15, 38, 38, 39, 38, 39, 39, 44, 12, 13, 14, 15, 15, 2, 9, 10, 10, 44, 38, 38, 14, 15, 38, 39, 38, 38, 39, 0, 1, 44]
+816	[6, 38, 12, 13, 13, 38, 39, 38, 39, 26, 30, 38, 39, 14, 15, 14, 15, 14, 15, 10, 44, 38, 12, 13, 13, 38, 39, 26, 30, 38, 15, 53, 53, 53, 53, 53, 53, 53, 14, 15, 15, 15, 44]
+817	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+818	[38, 39, 30, 38, 30, 31, 38, 39, 44, 38, 39, 39, 39, 14, 15, 44]
+819	[12, 51, 13, 13, 14, 15, 6, 14, 15]
+820	[48, 14, 15, 15, 15]
+821	[14, 38, 8, 38, 39, 44, 14, 15, 15, 15]
+822	[14, 15, 44, 24, 24, 25, 25, 25, 25, 42, 43, 14, 15, 44, 24, 25, 25, 25, 25]
+823	[26, 14, 15, 44, 30, 55, 36, 12, 44, 54, 55, 38, 39, 10, 44, 12, 13, 13, 13, 44, 12, 13, 13, 13, 14, 15, 36, 42, 15, 26, 30, 12, 13, 13, 38, 10, 38, 39, 10, 39, 44]
+824	[24, 25, 25, 25, 30, 31, 26, 14, 15, 42, 43, 15]
+825	[14, 15, 12, 13, 13, 14, 15, 44, 22, 18, 49, 23, 23, 23, 23, 44]
+826	[52, 53, 53, 53]
+827	[46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47]
+828	[38, 39, 30, 31, 14, 15, 44]
+829	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+830	[24, 25, 25, 44, 38, 13, 44, 12, 13, 13, 13, 44, 24, 25, 25, 25, 25, 44, 12, 13, 13]
+831	[44, 50, 51, 38, 51, 14, 15, 44, 52, 53, 53, 53, 53, 53, 53, 53, 44, 44, 44, 44, 50, 51, 38, 12, 13, 13, 13]
+832	[24, 25, 25, 25, 25, 25, 25, 25, 14, 15, 15, 44, 30, 31, 38, 39]
+833	[12, 13, 13, 55, 14, 15, 2, 38, 39, 38, 39, 39, 15, 48, 55, 38, 38, 39, 12, 13]
+834	[30, 8, 9, 38, 39, 36, 14, 15, 36, 14, 15, 36, 39, 44]
+835	[8, 3, 2, 38, 39, 44, 8, 1, 12, 15, 15, 26, 12, 13, 13, 15, 38, 38, 39, 38, 36, 14, 44]
+836	[14, 15, 38, 15, 15, 43]
+837	[12, 44, 38, 39, 26, 49, 30, 31, 14, 15, 8, 38, 39, 42, 43, 14, 23, 43, 44]
+838	[38, 39, 24, 25, 25, 14, 15, 0, 1, 26, 2, 3, 38, 39, 39, 14, 15, 36, 14, 15, 14, 15, 10, 44, 6, 14, 7, 38, 1, 36, 42, 43, 2, 2, 38, 42, 43, 44, 38, 43, 43, 43, 36, 6, 15, 26, 38, 39, 39, 0, 36, 42, 1, 44]
+839	[44, 44, 44, 24, 25, 25, 25, 25]
+840	[14, 15, 8, 0, 14, 15, 14, 44]
+841	[30, 38, 39, 44, 38, 39, 38, 39, 46, 44]
+842	[38, 38, 39, 38, 39, 50, 51, 6, 51, 10, 36, 14, 15, 14, 15, 44, 38, 39, 38, 39, 38, 39, 44, 40, 33, 30, 14, 15, 15, 44, 38, 39, 38, 39, 39, 39, 44, 14, 15, 38, 39, 38, 39, 42, 43, 14, 15, 44]
+843	[30, 31, 36, 50, 51, 51, 44, 38, 3, 38, 39, 44, 8, 9, 38, 50, 50, 51, 38, 39, 24, 25, 36, 42, 43, 44, 6, 38, 12, 13, 13, 14, 15, 1, 36, 14, 15, 44]
+844	[46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 44, 46, 44, 44, 44, 44, 44, 44, 44, 44, 44, 46, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47]
+845	[24, 25, 25]
+846	[44, 42, 43, 44, 38, 39, 14, 15, 6, 38, 39, 14, 15, 44, 10, 1, 14, 44, 24, 25, 25, 25]
+847	[44, 50, 9, 8, 38, 39, 10, 11, 44, 30, 31, 38, 39, 12, 13, 44, 44]
+848	[34, 35, 14, 14, 15, 15, 15, 15, 15, 38, 14, 15, 36, 42, 43, 38, 14, 15, 44, 4, 5, 44, 38, 11, 36]
+849	[14, 15, 12, 13, 13, 13, 13, 13, 13, 13, 13, 44, 6, 30, 38, 39, 30, 8, 0, 1, 36, 38, 39, 44, 44, 24, 15, 14, 25, 44]
+850	[44, 14, 15, 15, 44, 42, 23, 44]
+851	[54, 55, 55, 14, 15, 10, 42, 42, 43, 15, 6, 26, 14, 15, 14, 15]
+852	[38, 39, 44, 14, 15, 6, 14, 15, 14, 38, 39, 36, 12, 13, 14, 15, 14, 14, 15, 44, 26, 14, 15, 15, 36, 14, 15, 0, 12, 12, 13, 38, 39, 39, 14, 15, 39, 44, 38, 39, 36, 10, 44, 42, 14, 15, 26, 38, 38, 38, 39, 30, 31, 36, 14, 15, 26, 30, 38, 39, 39, 15, 15, 44, 6, 7, 30, 31, 14, 15, 14, 15, 15, 38, 39, 36, 14, 14, 15, 38, 38, 38, 39, 15, 14, 25, 36, 24, 49, 25, 25, 25, 25, 36, 0, 1, 44, 20, 19, 49, 21, 14, 25, 6, 11, 8, 9, 38, 39, 14, 15, 36, 38, 39, 46, 44, 38, 39, 14, 15, 15, 36, 2, 27, 38, 39, 36, 14]
+853	[46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47]
+854	[14, 15, 14, 15, 15, 38, 39, 14, 15, 14, 15, 42, 43, 14, 15]
+855	[14, 15, 39, 15, 14, 15, 38, 36, 14, 15, 10]
+856	[24, 36, 12, 13, 14, 15, 15, 38, 39, 44, 38, 38, 39, 39, 39, 0, 14, 15, 42, 43, 44, 14, 15, 15, 38, 39, 44, 14, 15, 6, 39, 36, 14, 15, 14, 15, 44, 38, 14, 15, 15, 14, 15, 39, 39, 38, 44]
+857	[30, 31, 38, 39, 39, 43]
+858	[8, 38, 14, 38, 39, 30, 31, 38, 39]
+859	[8, 9, 26, 27, 14, 15, 44, 14, 15, 14, 15, 8, 38, 39, 44, 30, 31, 31, 46, 44, 44, 24, 25, 25, 25]
+860	[38, 14, 15, 44, 44, 30, 8, 9, 38, 39, 39, 39, 46, 44]
+861	[14, 15, 38, 14, 15, 12, 13, 13, 13, 44, 14, 15, 15, 15, 15, 14, 15, 14, 15, 14, 15, 0, 39, 44, 12, 13, 13, 24, 25, 25]
+862	[14, 38, 39, 36, 14, 15, 14, 15, 8, 8, 38, 39, 14, 15, 44]
+863	[24, 24, 25, 25, 25, 23, 36, 14, 15, 10, 44, 38, 44, 38, 43, 44, 36, 42, 25, 25, 38, 36, 14, 15, 44]
+864	[50, 51, 14, 15, 14, 15, 38, 39, 44, 30, 31, 31, 36]
+865	[38, 39, 38, 39, 8, 9, 38, 30, 31, 38, 1, 44, 24, 15, 25, 15, 44, 24, 25, 25, 53]
+866	[14, 15, 38, 15, 12, 17, 14, 14, 38, 39, 46, 44, 54, 55, 14, 14, 15, 10, 30, 30, 31, 44]
+867	[53, 39, 39, 38, 39, 42, 43, 14, 15, 38, 38, 38, 43, 14, 44]
+868	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 24, 25, 47, 47, 47]
+869	[52, 53, 53, 53, 14, 15, 44, 30, 31]
+870	[46, 47, 47, 44, 46, 47, 47, 47, 44, 12, 13, 13]
+871	[6, 7, 26, 26, 38, 39, 12, 55, 44, 54, 55, 55, 55, 14, 15, 44, 6, 7, 7, 38, 14, 15, 12, 9, 38, 39, 44, 38, 39, 26, 14, 14, 15, 30, 14, 15, 44, 38, 39, 39, 12, 13, 13, 13, 8, 39, 12, 13, 44, 12, 36, 38, 39, 38, 14, 15, 38, 38, 39, 44]
+872	[14, 15, 15, 42, 43, 0, 43, 14, 15, 38, 39, 39, 39, 38, 39, 6, 38, 39, 39, 44, 14, 15, 44, 44, 24, 25, 42, 43, 14]
+873	[32, 33, 42, 43, 44, 14, 15, 15, 15, 44, 14, 15, 38, 39, 44, 50, 51, 50, 43, 44, 52, 53, 53, 53, 44, 52, 53, 14, 15, 44, 26, 9, 38, 39, 38, 38, 39, 36, 44, 26, 9, 8, 8, 9, 39, 38, 39, 44, 8, 9, 38, 38, 39, 48, 49, 15, 10, 1, 1, 36, 14, 44, 0, 1, 14, 15, 8, 2, 27, 38, 39, 44]
+874	[2, 38, 38, 38, 8, 38, 38, 39, 15, 38, 39, 36, 14, 15, 14, 15, 44, 38, 39, 38, 39, 15, 0, 1, 36, 14, 15, 6, 42, 43, 14, 15, 2, 3, 38, 43, 42, 39, 42, 43, 14, 15, 44, 38, 14, 38, 39, 14, 15, 42, 43, 44]
+875	[18, 19, 53, 53, 44, 48, 49, 26, 14, 15, 38, 39, 44, 38, 39, 14, 15, 14, 15, 44]
+876	[52, 53, 53, 53, 53, 14, 15]
+877	[30, 38, 39, 36, 14, 15, 24, 25, 25, 44, 38, 38, 39, 30, 30, 38, 39, 0, 1, 36, 14, 15, 44, 38, 24, 25, 25, 15, 44, 30, 11, 8, 38, 39, 44, 10, 11, 38, 39, 30, 36, 14, 44, 6, 14, 25, 8, 8, 38, 39, 39, 39, 42, 43, 10, 36, 36, 14, 15, 10, 38, 39, 36, 30, 14, 15, 15, 44, 38, 39, 36, 12, 13, 8, 38, 39, 39, 1, 14, 15, 44]
+878	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+879	[52, 53, 25, 25, 14, 15, 15]
+880	[34, 35, 8, 42, 14, 8, 38, 38, 39, 44, 0, 1, 13, 14, 15, 38, 38, 39, 39, 14, 15, 44, 44]
+881	[14, 15, 38, 39, 44, 26, 34, 35, 14, 15, 14, 15, 14, 15, 38, 39, 36, 14, 15, 10, 44, 0, 3, 14, 15, 14, 15, 44, 26, 27, 0, 14, 15, 10, 36, 14, 15, 44, 8, 38, 39, 40, 41, 38, 39, 44, 0, 9, 38, 39, 44, 38, 38, 39, 14, 15, 14, 15, 10, 49, 38, 38, 39, 14, 15, 44, 38, 14, 15, 38, 39, 14, 15, 15, 38, 39, 38, 39, 14, 15, 44, 38, 39, 50, 49, 49, 49, 49, 42, 43, 36, 0, 1, 44]
+882	[48, 49, 49, 8, 9, 14, 21]
+883	[50, 53, 14, 15, 14, 15, 14, 15, 15]
+884	[44, 50, 2, 50, 38, 39, 38, 43, 19, 19, 42, 43, 15, 42, 53, 43, 43, 38, 39, 39, 53, 39, 44, 44, 38, 39, 50, 42, 43, 14, 15, 14, 15, 4, 5, 42, 43, 14, 15, 44, 44, 52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 44, 14, 15, 44, 14, 15, 44, 42, 39, 44, 14, 15]
+885	[48, 49, 49, 25, 25]
+886	[24, 15, 15, 38, 39, 14, 15, 0, 1, 14]
+887	[14, 15, 38, 38, 39, 39, 25, 38, 46]
+888	[52, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53]
+889	[50, 53, 38, 39, 38, 39]
+890	[32, 34, 35, 44, 14, 15, 38, 39, 54, 21, 38, 39, 10, 44, 38, 39, 38, 39, 36, 14, 36, 14, 15, 0, 1, 26, 39, 24, 25, 15, 36, 14, 15, 44, 6, 26, 14, 15, 14, 15, 10, 44, 14, 15, 8, 9, 4, 15, 6, 14, 15, 36, 42, 1, 36, 14, 15, 44]
+891	[12, 25, 44, 14, 15]
+892	[14, 15, 44, 14]
+893	[52, 53, 53, 53, 53, 53, 53, 53, 38, 39, 39, 53, 53, 36, 12, 13, 13, 48, 10, 14, 49, 14, 15, 44, 50, 51, 53, 53, 53, 53]
+894	[30, 8, 9, 8, 38, 39, 44, 38, 8, 9, 8, 9, 38, 9, 38, 39, 44, 44, 44, 44, 14, 15, 38, 39, 44, 44, 44, 52, 53, 53, 53, 14, 15]
+895	[14, 15, 14, 15, 36, 14, 15]
+896	[52, 53, 53, 53, 53, 53, 53, 53, 53, 42, 43, 38, 38, 39, 14, 15]
+897	[30, 44, 12, 13, 14, 36, 44, 14, 15, 8, 38, 39, 36, 8, 9, 30, 14, 15, 38, 14, 15, 10, 38, 39, 36, 12, 13, 14, 15, 44, 8, 9, 26, 30, 31, 31, 38, 26, 30, 30, 14, 15, 38, 38, 38, 44, 38, 39, 44, 14, 15, 38, 38, 39, 14, 15, 10, 44, 38, 39, 14, 15, 14, 15, 14, 15, 14, 15, 44]
+898	[46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 16, 17, 47, 47, 47, 47]
+899	[24, 25, 25, 25, 44, 12, 13, 13, 13, 44, 14, 15]
+900	[44, 14, 15, 38, 39, 14, 15, 14, 15, 44, 44, 14, 15, 38, 39, 14, 15, 44, 14, 15, 15, 44, 22, 23, 44, 22, 23, 44, 14, 15, 15, 44, 44, 44, 24, 25, 25, 25, 44, 44, 44, 20, 25, 25, 14, 15, 14, 15]
+901	[14, 14, 14, 15, 38, 39, 14, 15]
+902	[12, 55, 55, 55, 44, 50, 51, 51, 51, 50, 51, 51, 12, 13, 13, 13, 38, 39, 14, 15, 42, 43, 14, 15, 14, 15, 44, 14, 15, 38, 39, 44, 38, 39, 44, 38, 39, 14, 15, 36, 14, 15, 38, 12, 13, 13, 44]
+903	[12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13]
+904	[8, 9, 38, 39, 14, 43, 36, 14, 15, 44, 14, 15, 25, 38, 15, 15, 0, 44, 38, 39, 12, 25, 15, 15, 25, 25, 44]
+905	[0, 36, 14, 15, 15, 42, 43, 43, 44, 30, 36, 14, 15, 38, 14, 15, 44]
+906	[24, 47, 47, 47, 15, 44, 10, 44, 14, 15, 14, 15, 14, 15, 44, 36, 14, 15]
+907	[30, 31, 38, 14, 15, 38, 39]
+908	[24, 25, 12, 13, 13, 13, 44, 14, 15, 14, 15]
+909	[24, 25, 25, 49]
+910	[44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 44, 46, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 44, 46, 44, 14, 25, 21, 49, 49, 49, 50, 51, 51, 0, 14, 15, 15, 44, 24, 25, 14, 15, 44, 44, 44, 24, 25, 25, 25, 25, 25, 38, 39, 44, 44, 44, 24, 25, 25, 25, 25, 44, 44, 44, 38, 42, 43, 8, 0, 1]
+911	[14, 15, 14, 43, 14, 15, 15, 14, 15, 15, 15]
+912	[50, 51, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 14, 15, 38, 53, 53, 53, 39, 20, 21, 21, 15, 14, 15, 44, 44, 14, 23, 14, 15, 15, 36, 42, 43, 14, 15, 44, 38, 39, 50, 51, 51, 51, 36, 42, 43, 43, 43, 44, 2, 3, 38, 26, 30, 0, 1, 36, 14, 15, 14, 15, 38, 39, 44, 44]
+913	[50, 51, 53, 38, 14, 53, 24, 25, 25, 25]
+914	[54, 55, 55, 55, 13, 12, 13, 13, 13, 14, 15]
+915	[0, 1, 26, 14, 15, 0, 1, 36, 14, 15, 44, 26, 11, 14, 15, 12, 13, 13, 30, 31, 44, 8, 38, 14, 15, 8, 9, 2, 38, 39, 44, 10, 9, 8, 9, 8, 8, 38, 39, 8, 9, 38, 39, 44, 8, 9, 8, 38, 39, 44, 24, 25, 25, 25]
+916	[26, 14, 25, 25, 25, 38, 39, 44]
+917	[54, 55, 55, 55, 55, 14, 15, 30, 36, 14, 6, 0, 38, 42, 43, 14, 15]
+918	[47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+919	[30, 8, 38, 39, 30, 30, 31, 38, 9, 38, 39, 38, 39, 50, 51, 51, 14, 15, 36, 14, 15, 44]
+920	[46, 47, 47, 44, 46, 47, 47, 47]
+921	[54, 49, 44, 16, 49, 21, 21]
+922	[48, 49, 10, 38, 39, 46, 15, 30, 31, 38]
+923	[54, 55, 55, 55, 44, 38, 39, 30, 31, 14, 15, 26, 38, 38, 36, 38, 38, 39, 39, 44, 48, 49, 49, 38, 39, 39, 46, 44]
+924	[50, 51, 53, 53, 54, 55, 55, 55, 13, 14, 15, 42, 43, 42, 43, 42, 43, 43, 43, 44, 24, 25, 25, 25, 44, 24, 25, 25]
+925	[26, 30, 13, 44, 12, 13]
+926	[44, 44, 44, 22, 44, 18, 43, 14, 38, 44, 38, 44]
+927	[30, 10, 38, 38, 0, 14, 15, 8, 8, 1]
+928	[14, 15, 44, 14, 15, 14, 15]
+929	[15, 44, 14, 15, 42, 43, 14, 14, 15, 15, 15, 38, 39, 39, 38, 39, 38, 39, 14, 14, 15, 42, 43, 14, 15, 38, 39, 44, 6, 38, 26, 14, 15, 10, 38, 39, 0, 1, 14, 15, 42, 43, 14, 15, 38, 39, 53, 53, 15, 44, 0, 1, 0, 1, 44, 14, 15, 14, 15, 8, 1, 44]
+930	[0, 1, 14, 15, 38, 39, 24, 25, 25, 25, 25, 25, 25, 14, 15, 44, 44, 44, 38, 39, 38, 39, 14, 15]
+931	[38, 39, 24, 24, 25, 25, 25, 25]
+932	[53, 53, 53, 53, 53, 53, 53, 53, 53, 15, 15, 15, 15, 38, 39, 44, 44, 24, 25, 25, 44, 44, 24, 14, 15, 26, 1, 14, 15, 14, 15, 14, 15, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47]
+933	[38, 39, 38, 38, 39, 42, 43, 14, 15, 38, 7, 7, 9, 6, 38, 39, 44, 38, 39, 42, 43, 42, 42, 43, 14, 15, 36, 14, 15, 42, 43, 38, 39]
+934	[38, 39, 14, 15, 15, 0, 38, 0, 1]
+935	[38, 39, 50, 51, 36, 4, 43, 36, 5, 44, 14, 15, 44, 0, 1, 14, 15, 14, 15, 15, 38]
+936	[14, 15, 14, 15, 15, 8, 9, 9, 38, 39, 0, 5, 44, 14, 15, 50, 51, 42, 15, 15, 38, 46, 44, 44, 14, 15, 44, 44, 24, 25, 25, 44]
+937	[38, 36, 14, 15, 14, 15, 38, 15, 15, 15, 15, 15, 15]
+938	[50, 19, 51, 51, 51, 53, 14, 15, 42, 43, 14, 15, 44, 24, 25, 25, 25]
+939	[44, 12, 13, 13, 14, 44, 42, 43, 44, 6, 7, 0, 1, 14, 15, 14, 15, 44, 44, 38, 39, 39, 39, 12, 13, 13, 13, 13, 44, 6, 7, 8, 9, 38, 12, 13, 14, 15, 12, 13, 13, 13, 24, 15, 25, 15, 42, 38, 15, 39, 36, 14, 15, 14, 15, 14, 15, 44]
+940	[48, 49, 44, 48, 49, 49, 38, 38, 39, 38, 39, 44, 20, 49, 49, 49, 49, 38, 39, 39, 44, 12, 13, 44, 14, 15, 0, 1, 44, 24, 51, 25, 44, 24, 25, 25, 25, 14, 15, 44, 24, 25, 25]
+941	[0, 1, 14, 15, 44, 0, 1, 14, 15, 14, 15, 44, 0, 1, 0, 1, 14, 15, 42, 43, 44, 38, 39, 14, 15, 44, 44, 44, 14, 23, 14, 44]
+942	[50, 0, 43, 14, 14, 14, 15]
+943	[52, 53, 53, 14, 53]
+944	[24, 25, 25, 25, 25, 14, 15, 14]
+945	[6, 0, 1, 36, 14, 15, 44, 12, 13, 15, 38, 39, 38, 39, 0, 1, 36, 44]
+946	[12, 13, 13, 38, 39, 53, 25, 10, 12, 14, 14, 15, 48, 49, 44, 14, 15, 10, 14, 15, 38, 43, 48, 49, 38, 39, 36, 12, 55, 13, 52, 52, 53, 54, 55, 55, 36, 42, 43, 14, 15, 44, 38, 39, 42, 15, 42, 43, 44, 14, 15, 42, 43, 44, 14, 15, 42, 43, 36, 14, 15, 44, 6, 38, 39, 38, 39, 36, 54, 55, 12, 13, 36, 14, 15, 14, 15, 15, 15, 44]
+947	[14, 15, 14, 14, 15, 44, 38, 38, 0, 1, 14, 15, 46]
+948	[50, 51, 50, 51, 51, 15, 15, 14, 15, 8, 9, 38, 39, 44, 38, 39, 38, 43, 50, 51, 26, 50, 51, 38, 39, 42, 43]
+949	[0, 1, 36, 38, 44, 14, 15, 38, 38, 39, 8, 39, 39, 46, 44]
+950	[14, 15, 14, 15, 44, 12, 13, 13, 13, 13, 13]
+951	[2, 26, 14, 15]
+952	[24, 25, 25, 38, 39, 46, 24, 25, 25, 25, 25, 15, 36, 42, 15, 14, 15, 44, 8, 9, 38, 39, 36, 10, 11, 8, 3, 14, 15, 15, 0, 36, 42, 43, 14, 15, 44]
+953	[12, 13, 13, 14, 15, 36, 48, 49, 15, 38, 39, 14, 15, 38, 38, 39, 14, 15, 14, 15, 44, 12, 13, 15, 26, 14, 14, 8, 9, 38, 39, 12, 13, 14, 15, 15, 14, 15, 10, 44, 26, 55, 38, 55, 38, 10, 44, 26, 27, 14, 49, 38, 39, 38, 39, 36, 0, 1, 44]
+954	[12, 13, 38, 38, 42, 43, 36, 42, 43, 42, 43, 43, 15, 38, 39, 44, 42, 43, 42, 43, 43, 10, 38, 38, 39, 36, 50, 14, 15, 13, 13, 13, 44, 30, 31, 38, 39, 36, 50, 51, 13, 13, 13, 44]
+955	[8, 7, 30, 31, 38, 39, 38, 39, 38, 39, 12, 13, 14, 15, 43, 44, 14, 15, 43, 36, 14, 15]
+956	[38, 39, 14, 15, 15, 44, 20, 21, 21, 21, 6, 20, 21, 21, 25, 14, 15, 15, 39, 44, 24, 25, 25, 25]
+957	[24, 47, 47, 47, 47, 47, 47, 47, 44]
+958	[38, 38, 54, 14, 15, 42, 43, 14, 15, 44, 38, 39, 39, 39, 14, 55, 8, 9, 38, 39, 14, 15, 38, 39, 36, 12, 13, 13, 10, 11, 44, 38, 14, 39, 38, 39, 38, 39, 44]
+959	[24, 25, 25, 44, 24, 25, 25]
+960	[6, 7, 44, 38, 14, 14, 15, 10, 44, 6, 26, 14, 15, 8, 9, 44, 42, 14, 15, 15, 36, 14, 15, 8, 8, 9, 44, 20, 21, 6, 24, 25, 25, 25, 15, 15, 30, 8, 0, 9, 0, 1, 44]
+961	[14, 15, 38, 39, 36, 0, 1, 14, 15]
+962	[6, 38, 8, 0, 0, 1, 36, 14, 15, 44, 24, 25, 38, 38, 39, 38, 39, 24, 25, 25, 38, 44, 6, 39, 44, 0, 1, 44, 38, 39, 44]
+963	[52, 53, 44, 12, 14, 25, 15]
+964	[12, 13, 14, 15, 0, 1, 14, 15, 15, 44, 12, 13, 13, 38, 39, 38, 39, 38, 39, 44, 48, 49, 49, 38, 39, 14, 15, 15, 44, 12, 13, 13, 13, 13, 14, 15, 14, 15]
+965	[30, 31, 22, 14, 23, 14, 15, 38, 30, 8, 9, 38, 36, 38, 39, 44, 30, 38, 38, 39, 44, 38, 39, 26, 30, 38, 7, 38, 30, 38, 36, 38, 38, 30, 38, 30, 44, 8, 30, 14, 13, 12, 13, 14, 15, 14, 15, 15, 30, 44, 8, 9, 0, 36, 12, 15, 15, 44, 30, 38, 39, 14, 15, 36, 38, 30, 44, 44, 30, 38, 38, 39, 26, 38, 39, 46, 44]
+966	[14, 15, 15, 14, 15, 15, 38, 39, 38]
+967	[12, 13, 13, 14, 15, 0, 1, 36, 14, 15, 30, 36, 14]
+968	[14, 15, 15]
+969	[50, 51, 51, 44, 14, 15, 0, 51, 51]
+970	[26, 27, 14, 15, 38, 39, 36, 0, 14, 15, 36, 14, 15, 44, 24, 25, 25, 25]
+971	[38, 30, 31, 36, 42, 43, 14, 43, 8, 39, 0, 1, 44]
+972	[34, 35, 44, 30, 31, 42, 43, 6, 1, 44]
+973	[42, 43, 14, 43]
+974	[14, 15, 8, 9, 38, 39, 30, 31, 44, 30, 31, 8, 8, 8, 38, 39, 14, 15, 15, 44, 38, 1, 36, 14, 38, 39, 38, 14, 15, 44]
+975	[46, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47]
+976	[20, 21, 21, 14, 15, 15]
+977	[14, 15, 44, 24, 25, 25, 25]
+978	[14, 15, 0, 43, 14]
+979	[50, 51, 38, 38, 14, 15, 42, 42, 43, 14, 15, 44, 38, 39, 14, 15, 14, 15, 38, 39, 44, 38, 39, 38, 43, 42, 43, 14, 15, 42, 43, 44, 8, 38, 39, 42, 43, 42, 43, 14, 15, 38, 39, 38, 38, 39, 44, 38, 39, 14, 15, 6, 14, 15, 42, 43, 44, 38, 39, 14, 15, 38, 39, 14, 15, 42, 43, 14, 15, 6, 14, 15, 42, 43, 14, 15, 44]
+980	[14, 15, 44, 24, 25, 25]
+981	[24, 25, 24, 25, 25, 15]
+982	[24, 25, 25, 25, 44, 12, 13, 44, 0, 0, 1, 14, 15, 43, 14, 15, 44, 12, 13, 44, 0, 0, 49, 15, 14, 15, 14, 15, 14, 15, 13, 13, 44, 0, 1, 14, 15, 15, 38, 44, 44, 44, 24, 25, 25, 25, 14, 15, 44, 44, 44, 14, 39, 6, 30, 8, 0, 14, 15]
+983	[14, 15, 30, 31, 6, 30, 31, 14, 15, 38, 39]
+984	[52, 53, 53, 53, 53, 38, 53, 14, 15, 14, 15, 44, 52, 53, 53, 14, 15, 15, 8, 38, 14, 15, 38, 38, 14, 15, 44, 44, 44, 52, 53, 14, 15, 44, 44, 44, 24, 25, 25, 44]
+985	[12, 13, 13, 13, 0, 14, 14, 15, 14, 15, 15, 14, 15, 14, 15, 42, 43, 43, 14, 15, 15]
+986	[24, 25, 25, 25, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+987	[34, 35, 44, 8, 38, 27, 36, 4, 42, 43, 43, 49, 26, 38, 39, 14, 8, 38, 39, 26, 49, 17, 26, 14, 15, 10, 38, 39, 39, 44, 30, 31, 8, 38, 14, 15, 38, 26, 49, 33, 38, 39, 14, 15, 38, 39, 12, 13, 36, 42, 43, 44, 6, 6, 7, 8, 9, 38, 39, 30, 31, 13, 36, 28, 44, 30, 31, 8, 9, 38, 39, 39, 36, 42, 39, 44]
+988	[46, 47, 47, 47, 47, 47, 14, 25, 25, 25, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47]
+989	[46, 47, 47, 44, 24, 15, 15, 44, 0, 1, 42, 43, 15]
+990	[46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 46, 47, 44, 46, 47, 47, 47, 44, 46, 47, 47, 44, 46, 47, 47, 47, 47, 47, 47, 47, 44, 46, 47, 44, 40, 41, 38, 39, 44, 24, 25, 25, 25]
+991	[10, 11, 36, 4, 48, 49, 49, 27, 24, 25, 23, 14, 36, 14, 15, 44, 8, 9, 38, 39, 26, 24, 25, 36, 42, 42, 43, 36, 14, 15, 15, 36, 42, 39, 44, 12, 13, 38, 39, 48, 49, 25, 49, 44, 30, 7, 31, 36, 30, 6, 7, 38, 39, 44, 6, 7, 14, 14, 38, 39, 0, 1, 36, 4, 5, 44, 26, 11, 36, 12, 13, 38, 39, 36, 14, 15, 15, 8, 39, 6, 11, 14, 15, 38, 39, 8, 38, 39, 36, 14, 15, 15, 44, 6, 38, 38, 30, 14, 15, 44, 38, 39, 14, 15, 36, 14, 15, 44, 0, 1, 0, 30, 30, 31, 8, 9, 8, 38, 30, 38, 39, 44, 14, 15, 0, 1, 0, 44]
+992	[14, 15, 14, 15, 42, 43, 14, 15]
+993	[8, 27, 44, 8, 9, 30, 14, 14, 15, 14, 15, 10, 36, 0, 1, 14, 15, 38, 39, 39, 38, 39, 44, 44, 30, 38, 8, 9, 10, 38, 38, 39, 36, 42, 15, 44, 38, 39, 38, 38, 39, 30, 31, 30, 14, 15, 15, 15, 14, 15, 14, 15, 44, 14, 15, 38, 39, 38, 39, 44, 44]
+994	[6, 30, 31, 31, 14, 15, 15, 44, 38, 30, 30, 31, 8, 13, 38, 39, 39, 14, 15, 8, 9, 38, 39, 39, 39, 39, 14, 15, 36, 14, 15, 44]
+995	[14, 15, 44, 46]
+996	[14, 15, 14, 15, 14, 15, 14, 15]
+997	[38, 39, 46, 44, 38, 39, 38, 39, 44, 30, 31, 38, 44, 24, 25, 25, 25]
+998	[50, 51, 51, 51, 44, 14, 15, 32, 33, 44, 50, 51, 51, 51, 44, 50, 51, 14, 15, 44, 50, 51, 25, 53, 44, 50, 51, 21, 51, 44, 48, 25, 49, 51, 44, 14, 15, 15, 10, 44, 50, 51, 51, 51, 44, 18, 34, 50, 51, 12, 0, 14, 15, 42, 43, 14, 36, 14, 15, 15, 42, 43, 44, 38, 38, 39, 14, 43, 42, 43, 42, 39, 42, 43, 36, 0, 1, 14, 15, 44, 8, 38, 38, 39, 50, 51, 51, 42, 14, 15, 42, 43, 0, 1, 36, 2, 3, 38, 39, 44]
+999	[38, 39, 36, 8, 39, 8, 38, 39, 14, 15, 30, 31, 14, 15, 38, 38, 14, 15, 44, 30, 31, 8, 3, 41, 38, 38, 14, 15, 44, 38, 39, 38, 44, 12, 13, 14, 44, 12, 13, 15, 44, 44]
diff --git a/tests/test_tipc/static/README.md b/tests/test_tipc/static/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..557a910646389cc5d72816e0934bb096da52b947
--- /dev/null
+++ b/tests/test_tipc/static/README.md
@@ -0,0 +1,39 @@
+# PaddleNLP 下静态图 benchmark 模型执行说明
+
+静态图 benchmark 测试脚本说明
+
+# 目录说明
+# Docker 运行环境
+
+docker image: registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.2-cudnn8-gcc82
+
+# 运行 benchmark 测试步骤
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/tests/
+```
+
+# 准备数据
+
+无需额外准备数据，`${shell_name}.sh` 脚本里面已经加上了 prepare.sh 的调用。
+
+```shell
+bash test_tipc/static/dp/${model_item}/benchmark_common/prepare.sh
+```
+
+# 运行模型
+
+## 单卡
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+bash  test_tipc/static/dp/${model_item}/N1C1/${shell_name}.sh
+```
+
+## 多卡
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+bash test_tipc/static/dp/${model_item}/N1C8/${shell_name}.sh
+```
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1391a3f29c4341796fe308d56098f2d693426e86
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..22ada909dd414f0b4b717c8b5d6aaf2594d3e3e7
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs32_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..565ad0666495f59d296e416fb442d2802f08ded6
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bb2e3ba53efaf9c5a5b6d4edffcd5bc11a6262f8
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs64_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..690d07dc30e1c5fde4d515d67c6f34e996254785
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f83ef7ad7e4ab99f8092ae6363803e198e4a1f4
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_base_seqlen128_bs96_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9d99fa0f2abdb5eba47b92d3c57b4b4c0e2ea04a
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3ee2b74d6ad9913be56bafc6449368af8ca14f47
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs10_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e0093bcac421197f7dc008dfe69af49bbac58805
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp16_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=bert_large_seqlen512
+model=bert
+bs_item=4
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..55e68cf7fbacc626319bfcb2962ef64dae8121de
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C1/bert_large_seqlen512_bs4_fp32_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=bert_large_seqlen512
+model=bert
+bs_item=4
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ef4a499f246a8df878d82f0c33e7f5fe92d0f787
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d659672f71e6e2f2cb08972a7fd8c5fad28e020e
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs32_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e129203a488fa1eed05ffc418c9e1e5c97e1f917
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8937d6dad96ee5f497028de0d6f5a65bb58f417f
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs64_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7f5c09b061779a9d9c0187dc3c3830ff1f04b8a5
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..614256972dd713153304962c5d105f2fb132b6d1
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_base_seqlen128_bs96_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c16ad8e1c481cff74eaf00a090311c0c6ebe392
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ba6e587683b54bc13d17672c73e4c3918015ee61
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N1C8/bert_large_seqlen512_bs10_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..02562fffd461c4db77fd247fdadcd4a493ea4412
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d12224d0653e057fc5e62a8a28c67c5d0d2cb6a1
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs32_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=32
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ae3d9ac6ee17285310e3b19a4b5af932b26df42a
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..607cd95ade19c92faa6d9c1e247c0761d0790f0e
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs64_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=64
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..17e6b52ceda0188022a8e7a5080093a6a7eca4f0
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2e2db06efd6b73ba41af0b07f4418397e9705248
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_base_seqlen128_bs96_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_base_seqlen128
+model=bert
+bs_item=96
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp16_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e5028c99d84ebb096f9ece5fefefc5cd5e154cf6
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp32_DP.sh b/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..23d2c22367d701d28d4844340c547218d7847215
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/N4C32/bert_large_seqlen512_bs10_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=bert_large_seqlen512
+model=bert
+bs_item=10
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=1000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/bert/benchmark_common/prepare.sh b/tests/test_tipc/static/dp/bert/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..435240968f1f992d57cc6748ef0f622204580291
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/benchmark_common/prepare.sh
@@ -0,0 +1,22 @@
+rm -rf ./data/wikicorpus_en_seqlen128/ wikicorpus_en_seqlen128.tar wikicorpus_en_seqlen512 hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/ hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar
+wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/benchmark_wikicorpus_en_seqlen128.tar --no-check-certificate
+wget -nc -P ./data/ https://bj.bcebos.com/paddlenlp/datasets/benchmark_hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar --no-check-certificate
+
+cd ./data/
+tar -xf benchmark_wikicorpus_en_seqlen128.tar
+tar -xf benchmark_hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5.tar
+
+ln -s hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en_seqlen512/ wikicorpus_en_seqlen512
+
+cd ..
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install setuptools_scm 
+python -m pip install Cython 
+python -m pip install -r ../requirements.txt  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install pybind11 regex sentencepiece tqdm visualdl attrdict pyyaml h5py -i https://mirror.baidu.com/pypi/simple
+
+python -m pip install -e ../
+# python -m pip install paddlenlp    # PDC 镜像中安装失败
+python -m pip list
diff --git a/tests/test_tipc/static/dp/bert/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/dp/bert/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a8ddbfd99529f2710a357bbbada8f7cd4b475feb
--- /dev/null
+++ b/tests/test_tipc/static/dp/bert/benchmark_common/run_benchmark.sh
@@ -0,0 +1,152 @@
+#!/bin/bash
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    base_batch_size=${2:-"2"}       # (必选) 如果是静态图单进程，则表示每张卡上的BS，需在训练时*卡数
+    fp_item=${3:-"fp32"}            # (必选) fp32|fp16
+    run_mode=${4:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
+    device_num=${5:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="sequences/sec"         # (必选)速度指标单位
+    skip_steps=10                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=500                   # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${base_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+}
+
+function _train(){
+    batch_size=${base_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+    static_scripts="../examples/language_model/bert/static/"
+
+    gradient_merge_steps=$(expr 67584 \/ $batch_size \/ 8)
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    OLD_IFS=$IFS
+    IFS="_"
+    array=(${model_item})
+
+    model_type=${array[1]}
+    seq_len=${array[2]}
+
+    IFS=$OLD_IFS
+
+    if [ ${fp_item} == "fp16" ]; then
+        fuse_transformer="True"
+        use_amp="True"
+    elif [ ${fp_item} == "fp32" ]; then
+        fuse_transformer="False"
+        use_amp="False"
+    else
+        echo " The fp_item should be fp32 or fp16"
+    fi
+
+    if [[ ${model_type} = "base" ]]; then
+        train_cmd="${add_options} \
+            --max_predictions_per_seq 80 \
+            --learning_rate 5e-5 \
+            --weight_decay 0.0 \
+            --adam_epsilon 1e-8 \
+            --warmup_steps 0 \
+            --output_dir ./tmp2/ \
+            --logging_steps 10 \
+            --save_steps 20000 \
+            --max_steps ${max_iter} \
+            --input_dir ./data/wikicorpus_en_${seq_len} \
+            --model_type bert \
+            --model_name_or_path bert-${model_type}-uncased \
+            --batch_size ${batch_size} \
+            --use_amp ${use_amp} \
+            --gradient_merge_steps $gradient_merge_steps \
+            --fuse_transformer true "
+    elif [[ ${model_type} = "large" ]]; then
+         train_cmd="${add_options} \
+            --max_predictions_per_seq 80 \
+            --learning_rate 1e-5 \
+            --weight_decay 1e-2 \
+            --adam_epsilon 1e-6 \
+            --warmup_steps 10000 \
+            --output_dir ./tmp2/ \
+            --logging_steps 10 \
+            --save_steps 20000 \
+            --max_steps ${max_iter} \
+            --input_dir ./data/wikicorpus_en_${seq_len} \
+            --model_type bert \
+            --model_name_or_path bert-${model_type}-uncased \
+            --batch_size ${batch_size} \
+            --use_amp ${use_amp} \
+            --gradient_merge_steps $gradient_merge_steps \
+            --fuse_transformer ${fuse_transformer} "
+    else
+        echo "Model type must be base or large. "
+    fi
+
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP) if [[ ${device_num} = "N1C1" ]];then
+            echo "run ${run_mode} "
+            train_cmd="python -u ${static_scripts}/run_pretrain.py ${train_cmd}" 
+        else
+            rm -rf ./mylog
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES \
+                  ${static_scripts}/run_pretrain.py ${train_cmd}" 
+        fi
+        ;;
+    DP1-MP1-PP1)  echo "run run_mode: DP1-MP1-PP1" ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+    echo ${train_cmd} >> ${log_file}
+}
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+# _train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..05a9c77330677932e5748224909d162cded4c93f
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6df451fcfd01c6b92cb234909810999e5912a98c
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp16_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt2
+model=gpt
+bs_item=8
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1426de9e443ae0a6a1418438ccbc58a51d69d792
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt2_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f0317273a4f8125fc26269a9644d12d06a653874
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..62819afce403bba64636dde53e98f13c620350b1
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp16_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=gpt3
+model=gpt
+bs_item=8
+fp_item=fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..682df621b2050586626e3f8da60dfb870a7f4dbe
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C1/gpt3_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8d6872f6925674fb0a4c5f20077ba8df063a423f
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b9d7094b392c61b6fdee3c479d2eeac53b0dab9e
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C8/gpt2_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..54bbb1c10cedea5c03ebdf7df27c404fe0c9ff96
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f1bfc4575b8b0b187a509d7f656df847523698bd
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N1C8/gpt3_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c5f90aa9adb8dbcef29d0cad6c63d73025d700bd
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b7c8e923c5a1e4452f81e97df1e16c32e0ce3f24
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N4C32/gpt2_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt2
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs16_fp16_DP.sh b/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs16_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7fd3a76723358c3b68c1f231ce9e220e2e9ccecb
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs16_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=16
+fp_item=fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs8_fp32_DP.sh b/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs8_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1309fdadaaf07b5e4f54d5127649ee8d42372b93
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/N4C32/gpt3_bs8_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=gpt3
+model=gpt
+bs_item=8
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=200
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/gpt/benchmark_common/prepare.sh b/tests/test_tipc/static/dp/gpt/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0fa609ac3f0c878fa43915b8ff30158b416f78ed
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/benchmark_common/prepare.sh
@@ -0,0 +1,27 @@
+cd ../examples/language_model/gpt/data_tools/
+sed -i "s/python3/python/g" Makefile
+sed -i "s/python-config/python3.7m-config/g" Makefile
+cd -
+
+cd ../examples/language_model/gpt-3/data_tools/
+sed -i "s/python3/python/g" Makefile
+sed -i "s/python-config/python3.7m-config/g" Makefile
+cd -
+
+mkdir -p data && cd data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy -o .tmp
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz -o .tmp
+cd -
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install tool_helpers -i https://mirror.baidu.com/pypi/simple
+python -m pip install setuptools_scm 
+python -m pip install Cython 
+python -m pip install -r ../requirements.txt  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install pybind11 regex sentencepiece tqdm visualdl attrdict pyyaml -i https://mirror.baidu.com/pypi/simple
+
+python -m pip install -e ../
+# python -m pip install paddlenlp    # PDC 镜像中安装失败
+python -m pip list
diff --git a/tests/test_tipc/static/dp/gpt/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/dp/gpt/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fcfe447ef130c8a6fdf9c30b2f62bd9f3b6f0f26
--- /dev/null
+++ b/tests/test_tipc/static/dp/gpt/benchmark_common/run_benchmark.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    base_batch_size=${2:-"2"}       # (必选) 如果是静态图单进程，则表示每张卡上的BS，需在训练时*卡数
+    fp_item=${3:-"fp32"}            # (必选) fp32|fp16
+    run_mode=${4:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
+    device_num=${5:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=10                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${6:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${base_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${base_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+
+    if [ ${model_item} = "gpt2" ];then
+        static_scripts="../examples/language_model/gpt/"
+    else
+        static_scripts="../examples/language_model/gpt-3/static/"
+    fi
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    data_path="./data/"
+
+    train_cmd="${add_options} \
+               --micro_batch_size=${batch_size} \
+               --global_batch_size=$((${batch_size}*${num_gpu_devices})) \
+               --model_type="gpt" \
+               --model_name_or_path="gpt2-en" \
+               --input_dir=${data_path} \
+               --output_dir=${OUTPUT_PATH} \
+               --dp_degree=${num_gpu_devices} \
+               --max_seq_len 1024 \
+               --max_lr 0.00015 \
+               --min_lr 0.00001 \
+               --max_steps=${max_iter} \
+               --save_steps 100000 \
+               --decay_steps 320000 \
+               --weight_decay 0.01 \
+               --warmup_rate 0.01 \
+               --grad_clip 1.0 \
+               --logging_freq 1 \
+               --eval_freq 1000 \
+               --device "gpu" \
+               ${use_fp16_cmd}"
+
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP) if [[ ${device_num} = "N1C1" ]];then
+            echo "run ${run_mode} "
+            train_cmd="python -u ${static_scripts}/run_pretrain_static.py ${train_cmd}" 
+        else
+            rm -rf ./mylog
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES \
+                  ${static_scripts}/run_pretrain_static.py ${train_cmd}" 
+        fi
+        ;;
+    DP1-MP1-PP1)  echo "run run_mode: DP1-MP1-PP1" ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+    echo ${train_cmd} >> ${log_file}
+}
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+# _train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1110cf895898a5021a2b3a5336fc4cd84333ef2e
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..48d53a634f8df152dc41a34ec38f9f761f232dd0
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65683ef998f9284ba4ba9830327c579121ec2db7
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_base_bs4096_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8777f3f6d9269104ee5d708e270b95dde1c96973
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8cb52da445828596bcd599c4f1be6f5396a0f2ef
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1ca790baa95677476d12835adcf7bfcf58bf0ea5
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs2560_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..464c93bf8034bb2671f9a6902a0215dd370cb850
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_amp_fp16_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=transformer_big
+model=transformer
+bs_item=4096
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3320123f6764fedecfe83658c9fa3f706796bd75
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs4096_fp32_DP.sh
@@ -0,0 +1,27 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model_item=transformer_big
+model=transformer
+bs_item=4096
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2b5cfeb878cd259b26edfb392f154e578853af77
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..438abcc35f3292afb85afa9fa3d846fb611e15de
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=fp32
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c05a09190945547da630515af39a96d050403ced
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C1/transformer_big_bs5120_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C1
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7c9ef51a761ab9598254d1023a68b6c108893efd
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3c6bbaea6d45dbcc2fb8b468e36694da330d2ccb
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f19e15f9a9cdb4c26b6ddbd532ede10d81352c02
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_base_bs4096_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c32c2fa53b7b05053abd171120c9987417ff920
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..99e1b01db9301f22177f84776b2da8a8dad11d28
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..480e1bf3ead7bf725c4c9899980f1f28d6fd56f9
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs2560_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c41a539734784e62c10c452ff6aefef5e946b6f4
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=amp_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f8038897d0d8bf89f61ae320d1eb4697fd6aaf15
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=fp32
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c945913d39ac9d8496dad16a356647a40ef7e8a
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N1C8/transformer_big_bs5120_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=pure_fp16
+run_mode=DP
+device_num=N1C8
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6c0525cfaf67059dd9ccc74032da95d44225d1da
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=amp_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..edc8eb33efcb7100595bd1f2164a248adaa6385c
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8446b2654ad9b1b336027b671c8b4feaccfed8c9
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_base_bs4096_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_base
+model=transformer
+bs_item=4096
+fp_item=pure_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a9e84df15f97ecf019dfe0a3b5fc6c0e659d8187
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=amp_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b2680ad60f139e9a4581e478709d7756f90bd73d
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..10935e0c0f220e910558a3d201aaa7037116c04a
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs2560_pure_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=2560
+fp_item=pure_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_amp_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_amp_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3ea7a4efe101a37b1ce82161eb1fb98929f58e69
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_amp_fp16_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=amp_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_fp32_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_fp32_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1fab1209979eb38150d3ab23890aff582895ce8b
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_fp32_DP.sh
@@ -0,0 +1,13 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=fp32
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_pure_fp16_DP.sh b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_pure_fp16_DP.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8af5ab67d4f865521668af17f6455cf512204009
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/N4C32/transformer_big_bs5120_pure_fp16_DP.sh
@@ -0,0 +1,14 @@
+model_item=transformer_big
+model=transformer
+bs_item=5120
+fp_item=pure_fp16
+run_mode=DP
+device_num=N4C32
+max_epochs=3000
+num_workers=0
+
+
+# get data
+bash ./test_tipc/static/dp/${model}/benchmark_common/prepare.sh
+# run
+bash ./test_tipc/static/dp/${model}/benchmark_common/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num} ${max_epochs} ${num_workers} 2>&1;
diff --git a/tests/test_tipc/static/dp/transformer/benchmark_common/prepare.sh b/tests/test_tipc/static/dp/transformer/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b8526d6cba532f162ad3bf5c0b836fb3188b67a7
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/benchmark_common/prepare.sh
@@ -0,0 +1,74 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+cd ../examples/machine_translation/transformer/
+
+sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.base.yaml
+sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.base.yaml
+sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.base.yaml
+
+sed -i "s/^random_seed:.*/random_seed: 128/g" configs/transformer.big.yaml
+sed -i "s/^shuffle_batch:.*/shuffle_batch: False/g" configs/transformer.big.yaml
+sed -i "s/^shuffle:.*/shuffle: False/g" configs/transformer.big.yaml
+
+# Data set prepared. 
+if [ ! -f WMT14.en-de.partial.tar.gz ]; then
+    wget https://bj.bcebos.com/paddlenlp/datasets/WMT14.en-de.partial.tar.gz
+    tar -zxf WMT14.en-de.partial.tar.gz
+fi
+# Set soft link.
+if [ -f train.en ]; then
+    rm -f train.en
+fi
+if [ -f train.de ]; then
+    rm -f train.de
+fi
+if [ -f dev.en ]; then
+    rm -f dev.en
+fi
+if [ -f dev.de ]; then
+    rm -f dev.de
+fi
+if [ -f test.en ]; then
+    rm -f test.en
+fi
+if [ -f test.de ]; then
+    rm -f test.de
+fi
+rm -f vocab_all.bpe.33712
+rm -f vocab_all.bpe.33708
+# Vocab
+cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33712 ./
+cp -f WMT14.en-de.partial/wmt14_ende_data_bpe/vocab_all.bpe.33708 ./
+# Train
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.en train.en
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/train.tok.clean.bpe.de train.de
+# Dev
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.en dev.en
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/dev.tok.bpe.de dev.de
+#Test
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.en test.en
+ln -s WMT14.en-de.partial/wmt14_ende_data_bpe/test.tok.bpe.de test.de
+cd -
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install setuptools_scm 
+python -m pip install Cython 
+python -m pip install -r ../requirements.txt  #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install attrdict easydict pyyaml #-i https://pypi.tuna.tsinghua.edu.cn/simple
+
+python -m pip install -e ../
+# python -m pip install paddlenlp    # PDC 镜像中安装失败
+python -m pip list
diff --git a/tests/test_tipc/static/dp/transformer/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/dp/transformer/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d961790145fbe533fc8a9c5f0f9962e2449772b4
--- /dev/null
+++ b/tests/test_tipc/static/dp/transformer/benchmark_common/run_benchmark.sh
@@ -0,0 +1,121 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_mode} ${device_num}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    base_batch_size=${2:-"2"}       # (必选) 如果是静态图单进程，则表示每张卡上的BS，需在训练时*卡数
+    fp_item=${3:-"fp32"}            # (必选) fp32|fp16
+    run_mode=${4:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
+    device_num=${5:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="words/sec"         # (必选)速度指标单位
+    skip_steps=5                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${6:-500}                     # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    num_workers=0                  # (可选)
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${base_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+}
+
+function _train(){
+    batch_size=${base_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+    static_scripts="../examples/machine_translation/transformer/static/"
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${model_item} == "transformer_base" ]; then
+        config_file="transformer.base.yaml"
+    elif [ ${model_item} == "transformer_big" ]; then
+        config_file="transformer.big.yaml"
+    else
+        echo "The model should be transformer_big or transformer_base!"
+        exit 1
+    fi
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ ${fp_item} == "amp_fp16" ]; then
+        sed -i "s/^use_amp.*/use_amp: True/g" ${static_scripts}/../configs/${config_file}
+        sed -i "s/^use_pure_fp16.*/use_pure_fp16: False/g" ${static_scripts}/../configs/${config_file}
+    elif [ ${fp_item} == "pure_fp16" ]; then
+        sed -i "s/^use_amp.*/use_amp: True/g" ${static_scripts}/../configs/${config_file}
+        sed -i "s/^use_pure_fp16.*/use_pure_fp16: True/g" ${static_scripts}/../configs/${config_file}
+    elif [ ${fp_item} == "fp32" ]; then
+        sed -i "s/^use_amp.*/use_amp: False/g" ${static_scripts}/../configs/${config_file}
+        sed -i "s/^use_pure_fp16.*/use_pure_fp16: False/g" ${static_scripts}/../configs/${config_file}
+    else
+        echo " The fp_item should be fp32 pure_fp16 or amp_fp16"
+    fi
+
+    sed -i "s/^max_iter.*/max_iter: ${max_iter}/g" ${static_scripts}/../configs/${config_file}
+    sed -i "s/^batch_size:.*/batch_size: ${base_batch_size}/g" ${static_scripts}/../configs/${config_file}
+
+    train_cmd="${add_options} --config ${static_scripts}/../configs/${config_file} --train_file ${static_scripts}/../train.en ${static_scripts}/../train.de --dev_file ${static_scripts}/../dev.en ${static_scripts}/../dev.de --vocab_file ${static_scripts}/../vocab_all.bpe.33712 --unk_token <unk> --bos_token <s> --eos_token <e> --benchmark"
+
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP) if [[ ${device_num} = "N1C1" ]];then
+            echo "run ${run_mode} "
+            sed -i "s/^is_distributed.*/is_distributed: False/g" ${static_scripts}/../configs/${config_file}
+            train_cmd="python -u ${static_scripts}/train.py ${train_cmd}" 
+        else
+            sed -i "s/^is_distributed.*/is_distributed: True/g" ${static_scripts}/../configs/${config_file}
+            rm -rf ./mylog
+            train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES \
+                  ${static_scripts}/train.py --distributed ${train_cmd}" 
+        fi
+        ;;
+    DP1-MP1-PP1)  echo "run run_mode: DP1-MP1-PP1" ;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    timeout 25m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+    echo ${train_cmd} >> ${log_file}
+}
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+# _train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/static/hybrid_parallelism/README.md b/tests/test_tipc/static/hybrid_parallelism/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b44e19f9c1e0491c3b6fdbd5a1754e9708723a3e
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/README.md
@@ -0,0 +1,32 @@
+# PaddleNLP 下静态图混合并行 benchmark 模型执行说明
+静态图混合并行 benchmark 测试脚本说明
+
+# 目录说明
+# Docker 运行环境
+docker image: registry.baidu.com/paddlecloud/base-images:paddlecloud-ubuntu18.04-gcc8.2-cuda11.0-cudnn8
+
+# 运行 benchmark 测试步骤
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP/tests/
+```
+
+# 准备数据
+
+```shell
+bash test_tipc/static/hybrid_parallelism/gpt/benchmark_common/prepare.sh
+```
+
+# 运行模型
+
+## 单卡
+
+```shell
+bash  test_tipc/static/hybrid_parallelism/gpt/N1C1/${shell_name}.sh
+```
+
+## 多卡
+
+```shell
+bash  test_tipc/static/hybrid_parallelism/gpt/N${node_num}C${gpu_num}/${shell_name}.sh
+```
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..de88c0121509411309a38ec54b5f378676a4087d
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C1/gpt3_bs16_fp16_DP1-MP1-PP1.sh
@@ -0,0 +1,20 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=1
+pp_degree=1
+dp_degree=1
+micro_batch_size=16
+global_batch_size=16
+run_mode=DP1-MP1-PP1
+device_num=N1C1
+max_iter=600
+use_sharding=false
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
+
+sleep 10
+export PROFILING=true
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C2/gpt3_bs16_fp16_DP2-MP1-PP1.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C2/gpt3_bs16_fp16_DP2-MP1-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fce933de02456ca31797317cafe1da29b9df81ac
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C2/gpt3_bs16_fp16_DP2-MP1-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=1
+pp_degree=1
+dp_degree=2
+micro_batch_size=8
+global_batch_size=16
+run_mode=DP2-MP1-PP1
+device_num=N1C2
+max_iter=1000
+use_sharding=false
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP1-PP4.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP1-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d8cbe3ba0702ce75d3dad8473c91f52e2c9fd8d3
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP1-PP4.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=1
+pp_degree=4
+dp_degree=1
+micro_batch_size=4
+global_batch_size=16
+run_mode=DP1-MP1-PP4
+device_num=N1C4
+max_iter=1000
+use_sharding=true
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP4-PP1.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0d38de359e6bccfcee988bcb448a6da495d8fbfd
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C4/gpt3_bs16_fp16_DP1-MP4-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=4
+pp_degree=1
+dp_degree=1
+micro_batch_size=16
+global_batch_size=16
+run_mode=DP1-MP4-PP1
+device_num=N1C4
+max_iter=1000
+use_sharding=false
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ce898494e8f5ab37576865455be5172f6be8daa3
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP2-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=2
+pp_degree=2
+dp_degree=2
+micro_batch_size=4
+global_batch_size=16
+run_mode=DP2-MP2-PP2
+device_num=N1C8
+max_iter=2000
+use_sharding=true
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP4-PP1.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP4-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..29f4af3ed37b7ac984dfbca80bb9091de055a602
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N1C8/gpt3_bs16_fp16_DP2-MP4-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=4
+pp_degree=1
+dp_degree=2
+micro_batch_size=8
+global_batch_size=16
+run_mode=DP2-MP4-PP1
+device_num=N1C8
+max_iter=2000
+use_sharding=false
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bd84445903bf17ce5920c1d60084b373521c20bf
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP1-MP8-PP4.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=8
+pp_degree=4
+dp_degree=1
+micro_batch_size=4
+global_batch_size=16
+run_mode=DP1-MP8-PP4
+device_num=N4C32
+max_iter=2000
+use_sharding=true
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7a1cb0cc643479dd9840a5a45c8581eb3b1988c8
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP2-MP8-PP2.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=8
+pp_degree=2
+dp_degree=2
+micro_batch_size=4
+global_batch_size=16
+run_mode=DP2-MP8-PP2
+device_num=N4C32
+max_iter=2000
+use_sharding=true
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..680e0eec5c9cc3faa7781863db5200560d721c7c
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/N4C32/gpt3_bs16_fp16_DP4-MP8-PP1.sh
@@ -0,0 +1,16 @@
+model_item=gpt3
+model=gpt
+fp_item=fp16
+mp_degree=8
+pp_degree=1
+dp_degree=4
+micro_batch_size=4
+global_batch_size=16
+run_mode=DP4-MP8-PP1
+device_num=N4C32
+max_iter=1500
+use_sharding=false
+
+# run
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/prepare.sh
+bash ./test_tipc/static/hybrid_parallelism/${model}/benchmark_common/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${max_iter} ${use_sharding} 2>&1;
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/prepare.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/prepare.sh
new file mode 100644
index 0000000000000000000000000000000000000000..979a9db7363a18e2aab46545c6b7e644aa65eb1c
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/prepare.sh
@@ -0,0 +1,13 @@
+cd ../examples/language_model/gpt-3/data_tools/
+sed -i "s/python3/python3.7/g" Makefile
+cd -
+
+python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install -r ../requirements.txt #-i https://pypi.tuna.tsinghua.edu.cn/simple
+python -m pip install pybind11 regex sentencepiece tqdm visualdl -i https://mirror.baidu.com/pypi/simple
+python -m pip install tool_helpers -i https://mirror.baidu.com/pypi/simple
+
+# get data
+cd ../examples/language_model/gpt-3/static/
+mkdir train_data && cd train_data
+wget https://bj.bcebos.com/paddlenlp/models/transformers/gpt2/train.data.json_ids.npz
diff --git a/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh b/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cd3e46a32004f83d1bc72bd93960ddb8ee9ab07a
--- /dev/null
+++ b/tests/test_tipc/static/hybrid_parallelism/gpt/benchmark_common/run_benchmark.sh
@@ -0,0 +1,194 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Test training benchmark for a model.
+# Usage：bash benchmark/run_benchmark.sh ${model_item} ${fp_item} ${mp_degree} ${pp_degree} ${dp_degree} ${micro_batch_size} ${global_batch_size} ${run_mode} ${device_num} ${use_sharding}
+function _set_params(){
+    model_item=${1:-"model_item"}   # (必选) 模型 item
+    fp_item=${2:-"fp32"}            # (必选) fp32|fp16
+    mp_degree=${3:-"1"}             # (必选) mp模型并行度
+    pp_degree=${4:-"1"}             # (必选) pp流水线并行度
+    dp_degree=${5:-"1"}             # (必选) dp数据并行度
+    micro_batch_size=${6:-"2"}      # (必选) 每张卡的mirco_batch_size
+    global_batch_size=${7:-"2"}     # (必选) 全局batch_size
+    run_mode=${8:-"DP"}             # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
+    device_num=${9:-"N1C1"}         # (必选) 使用的卡数量，N1C1|N1C8|N4C32 （4机32卡）
+    profiling=${PROFILING:-"false"}      # (必选) Profiling  开关，默认关闭，通过全局变量传递
+    model_repo="PaddleNLP"          # (必选) 模型套件的名字
+    speed_unit="tokens/s"         # (必选)速度指标单位
+    skip_steps=0                  # (必选)解析日志，跳过模型前几个性能不稳定的step
+    keyword="ips:"                 # (必选)解析日志，筛选出性能数据所在行的关键字
+    convergence_key="loss:"        # (可选)解析日志，筛选出收敛数据所在行的关键字 如：convergence_key="loss:"
+    max_iter=${10:-500}                      # （可选）需保证模型执行时间在5分钟内，需要修改代码提前中断的直接提PR 合入套件；或使用max_epoch参数
+    use_sharding=${11:-"true"}               # （可选) 是否使用ShardingOptimizer
+    num_workers=0                  # (可选)
+    base_batch_size=$global_batch_size
+    # 以下为通用执行命令，无特殊可不用修改
+    model_name=${model_item}_bs${global_batch_size}_${fp_item}_${run_mode}  # (必填) 且格式不要改动,与竞品名称对齐
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    run_log_path=${TRAIN_LOG_DIR:-$(pwd)}  # （必填） TRAIN_LOG_DIR  benchmark框架设置该参数为全局变量
+    profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)}  # （必填） PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
+    speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
+    #
+    train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
+    profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
+    speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
+
+    OUTPUT_PATH=${run_log_path}/output
+}
+
+function _train(){
+    batch_size=${global_batch_size}  # 如果模型跑多卡单进程时,请在_train函数中计算出多卡需要的bs
+
+    if [ -d $OUTPUT_PATH ]; then
+        rm -rf $OUTPUT_PATH
+    fi
+    mkdir $OUTPUT_PATH
+
+    if [ ${model_item} = "gpt2" ];then
+        static_scripts="../examples/language_model/gpt/"
+    else
+        static_scripts="../examples/language_model/gpt-3/static/"
+    fi
+
+    echo "current CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}, model_name=${model_name}, device_num=${device_num}, is profiling=${profiling}"
+
+    if [ ${profiling} = "true" ];then
+        add_options="--profiler_options=\"batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile\""
+        log_file=${profiling_log_file}
+    else
+        add_options=""
+        log_file=${train_log_file}
+    fi
+
+    if [ $fp_item = "fp16" ]; then
+        use_fp16_cmd="--use_amp true"
+    fi
+
+    data_path="./train_data/"
+
+    train_cmd="${add_options} \
+               --micro_batch_size=${micro_batch_size} \
+               --global_batch_size=${global_batch_size} \
+               --model_type="gpt" \
+               --model_name_or_path="gpt2-medium-en" \
+               --input_dir=${data_path} \
+               --output_dir=${OUTPUT_PATH} \
+               --dp_degree=${dp_degree} \
+               --pp_degree=${pp_degree} \
+               --mp_degree=${mp_degree} \
+               --sharding_degree=1 \
+	       --use_sharding $use_sharding \
+	       --amp_level "O1" \
+	       --use_recompute true \
+               --max_seq_len 1024 \
+               --max_lr 0.00015 \
+               --min_lr 0.00001 \
+               --max_steps=${max_iter} \
+               --save_steps 100000 \
+               --decay_steps 320000 \
+               --weight_decay 0.01 \
+               --warmup_rate 0.01 \
+               --grad_clip 1.0 \
+               --logging_freq 1 \
+               --eval_freq 1000 \
+               --device "gpu" \
+               --fuse_transformer True \
+               ${use_fp16_cmd}"
+    if [ ${PADDLE_TRAINER_ID} ]
+    then
+        PADDLE_RANK_OPTION=" --rank ${PADDLE_TRAINER_ID}"
+    else
+        PADDLE_RANK_OPTION=""
+    fi
+    # 以下为通用执行命令，无特殊可不用修改
+    case ${run_mode} in
+    DP1-MP1-PP1) echo "run run_mode: DP1-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=0 ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+        ;;
+    DP2-MP1-PP1)  echo "run run_mode: DP2-MP1-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    DP1-MP4-PP1)  echo "run run_mode: DP1-MP4-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    DP1-MP1-PP4)  echo "run run_mode: DP1-MP1-PP4"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=3
+	;;
+    DP2-MP4-PP1)  echo "run run_mode: DP2-MP4-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3,4,5,6,7" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    DP2-MP2-PP2)  echo "run run_mode: DP2-MP2-PP2"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3,4,5,6,7" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=7
+	;;
+    DP2-MP8-PP2)  echo "run run_mode: DP2-MP8-PP2"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3,4,5,6,7" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    DP1-MP8-PP4)  echo "run run_mode: DP1-MP8-PP4"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3,4,5,6,7" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    DP4-MP8-PP1)  echo "run run_mode: DP4-MP8-PP1"
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus="0,1,2,3,4,5,6,7" ${PADDLE_RANK_OPTION}\
+              run_pretrain_static.py ${train_cmd}"
+        workerlog_id=0
+	;;
+    *) echo "choose run_mode "; exit 1;
+    esac
+    cd ../examples/language_model/gpt-3/static/
+    echo "train_cmd: ${train_cmd}  log_file: ${log_file}"
+    python -c "import paddlenlp"
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+    else
+        echo -e "${model_name}, SUCCESS"
+    fi
+    #kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ ${device_num} != "N1C1" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.${workerlog_id} ${log_file}
+    fi
+}
+
+function _set_env(){
+    export FLAGS_sync_before_allreduce=1
+}
+
+export PYTHONPATH=$(dirname "$PWD"):$PYTHONPATH
+
+source ${BENCHMARK_ROOT}/scripts/run_model.sh   # 在该脚本中会对符合benchmark规范的log使用analysis.py 脚本进行性能数据解析;如果不联调只想要产出训练log可以注掉本行,提交时需打开
+_set_params $@
+_set_env #设置环境变量
+#_train       # 如果只产出训练log,不解析,可取消注释
+_run     # 该函数在run_model.sh中,执行时会调用_train; 如果不联调只产出训练log可以注掉本行,提交时需打开
diff --git a/tests/test_tipc/test_train_dy2static_python.sh b/tests/test_tipc/test_train_dy2static_python.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a13301225e1e913fd141926bc2cc9206c6a9e63f
--- /dev/null
+++ b/tests/test_tipc/test_train_dy2static_python.sh
@@ -0,0 +1,90 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+source test_tipc/common_func.sh
+
+IFS=$'\n'
+BASE_CONFIG_FILE=$1
+# always use the lite_train_lite_infer mode to speed. Modify the config file.
+MODE=lite_train_lite_infer
+BASEDIR=$(dirname "$0")
+
+# get the log path.
+dataline=$(cat ${BASE_CONFIG_FILE})
+lines=(${dataline})
+model_name=$(func_parser_value "${lines[1]}")
+LOG_PATH="./test_tipc/output/${model_name}/${MODE}"
+rm -rf $LOG_PATH
+mkdir -p ${LOG_PATH}
+status_log="${LOG_PATH}/results_python.log"
+
+# make cudnn algorithm deterministic, such as conv.
+export FLAGS_cudnn_deterministic=True
+
+# read the base config and parse and run the sub commands
+config_line_numbers=`cat ${BASE_CONFIG_FILE} | grep -n "============" | cut -d':' -f1`
+for cln in $config_line_numbers
+do
+    # change IFS to prevent \n is parsed as delimiter.
+    IFS=""
+    config_lines=$(cat ${BASE_CONFIG_FILE} | sed -n "${cln},\$p" | head -n 22)
+    config_name=`echo ${config_lines} | grep '=====' | cut -d' ' -f2`
+    FILENAME=$LOG_PATH/dy2static_$config_name.txt
+    echo "[Start dy2static]" "${config_name} : ${FILENAME}"
+    echo ${config_lines} > $FILENAME
+    sed -i 's/gpu_list.*$/gpu_list:0/g' $FILENAME
+
+    # execute the last line command
+    custom_cmd=$(echo $config_lines | tail -n 1)
+    echo "CustomCmd is: " $custom_cmd
+    eval $custom_cmd
+
+    IFS=$'\n'
+
+    # start dygraph train
+    dygraph_output=$LOG_PATH/${config_name}_python_train_infer_dygraph_output.txt
+    dygraph_loss=$LOG_PATH/${config_name}_dygraph_loss.txt
+    cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} $MODE >$dygraph_output 2>&1"
+    echo $cmd
+    eval $cmd
+
+    # start dy2static train
+    dy2static_output=$LOG_PATH/${config_name}_python_train_infer_dy2static_output.txt
+    dy2static_loss=$LOG_PATH/${config_name}_dy2static_loss.txt
+    sed -i '16s/$/ --to_static/g' ${FILENAME}
+    cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} $MODE >$dy2static_output 2>&1"
+    echo $cmd
+    eval $cmd
+
+    # analysis and compare the losses. 
+    dyout=`cat $dy2static_output | python test_tipc/extract_loss.py -v 'step_idx' -e 'avg loss: {%f}'`
+    stout=`cat $dygraph_output   | python test_tipc/extract_loss.py -v 'step_idx' -e 'avg loss: {%f}'`
+    echo $dyout > $dygraph_loss
+    echo $stout > $dy2static_loss
+    diff_log=$LOG_PATH/${config_name}_diff_log.txt
+    diff_cmd="diff -w $dygraph_loss $dy2static_loss > $diff_log"
+    eval $diff_cmd
+    last_status=$?
+    cat $diff_log
+    if [ "$dyout" = "" ]; then
+        status_check 1 $diff_cmd $status_log $model_name $diff_log
+    elif [ "$stout" = "" ]; then
+        status_check 2 $diff_cmd $status_log $model_name $diff_log
+    else
+        status_check $last_status $diff_cmd $status_log $model_name $diff_log
+    fi
+done
+
diff --git a/tests/test_tipc/test_train_inference_python.sh b/tests/test_tipc/test_train_inference_python.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65e6e5b0cfcf1498fa50903f010bc14065e24a2f
--- /dev/null
+++ b/tests/test_tipc/test_train_inference_python.sh
@@ -0,0 +1,419 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+source test_tipc/common_func.sh
+
+FILENAME=$1
+# MODE be one of ['lite_train_lite_infer' 'lite_train_whole_infer' 'whole_train_whole_infer', 'whole_infer', 'klquant_whole_infer']
+MODE=$2
+
+dataline=$(cat ${FILENAME})
+
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+
+# The training params
+model_name=$(func_parser_value "${lines[1]}")
+python=$(func_parser_value "${lines[2]}")
+gpu_list=$(func_parser_value "${lines[3]}")
+train_use_gpu_key=$(func_parser_key "${lines[4]}")
+train_use_gpu_value=$(func_parser_value "${lines[4]}")
+autocast_list=$(func_parser_value "${lines[5]}")
+autocast_key=$(func_parser_key "${lines[5]}")
+epoch_key=$(func_parser_key "${lines[6]}")
+epoch_num=$(func_parser_params "${lines[6]}" "${MODE}")
+save_model_key=$(func_parser_key "${lines[7]}")
+train_batch_key=$(func_parser_key "${lines[8]}")
+train_batch_value=$(func_parser_params "${lines[8]}" "${MODE}")
+pretrain_model_key=$(func_parser_key "${lines[9]}")
+pretrain_model_value=$(func_parser_value "${lines[9]}")
+train_model_name=$(func_parser_value "${lines[10]}")
+train_infer_img_dir=$(func_parser_value "${lines[11]}")
+train_param_key1=$(func_parser_key "${lines[12]}")
+train_param_value1=$(func_parser_value "${lines[12]}")
+
+trainer_list=$(func_parser_value "${lines[14]}")
+trainer_norm=$(func_parser_key "${lines[15]}")
+norm_trainer=$(func_parser_value "${lines[15]}")
+pact_key=$(func_parser_key "${lines[16]}")
+pact_trainer=$(func_parser_value "${lines[16]}")
+fpgm_key=$(func_parser_key "${lines[17]}")
+fpgm_trainer=$(func_parser_value "${lines[17]}")
+distill_key=$(func_parser_key "${lines[18]}")
+distill_trainer=$(func_parser_value "${lines[18]}")
+trainer_key1=$(func_parser_key "${lines[19]}")
+trainer_value1=$(func_parser_value "${lines[19]}")
+trainer_key2=$(func_parser_key "${lines[20]}")
+trainer_value2=$(func_parser_value "${lines[20]}")
+
+eval_py=$(func_parser_value "${lines[23]}")
+eval_key1=$(func_parser_key "${lines[24]}")
+eval_value1=$(func_parser_value "${lines[24]}")
+
+save_infer_key=$(func_parser_key "${lines[27]}")
+export_weight=$(func_parser_key "${lines[28]}")
+norm_export=$(func_parser_value "${lines[29]}")
+pact_export=$(func_parser_value "${lines[30]}")
+fpgm_export=$(func_parser_value "${lines[31]}")
+distill_export=$(func_parser_value "${lines[32]}")
+export_key1=$(func_parser_key "${lines[33]}")
+export_value1=$(func_parser_value "${lines[33]}")
+export_key2=$(func_parser_key "${lines[34]}")
+export_value2=$(func_parser_value "${lines[34]}")
+inference_dir=$(func_parser_value "${lines[35]}")
+
+# parser inference model 
+infer_model_dir_list=$(func_parser_value "${lines[36]}")
+infer_export_list=$(func_parser_value "${lines[37]}")
+infer_is_quant=$(func_parser_value "${lines[38]}")
+# parser inference 
+inference_py=$(func_parser_value "${lines[39]}")
+use_gpu_key=$(func_parser_key "${lines[40]}")
+use_gpu_list=$(func_parser_value "${lines[40]}")
+use_mkldnn_key=$(func_parser_key "${lines[41]}")
+use_mkldnn_list=$(func_parser_value "${lines[41]}")
+cpu_threads_key=$(func_parser_key "${lines[42]}")
+cpu_threads_list=$(func_parser_value "${lines[42]}")
+batch_size_key=$(func_parser_key "${lines[43]}")
+batch_size_list=$(func_parser_value "${lines[43]}")
+use_trt_key=$(func_parser_key "${lines[44]}")
+use_trt_list=$(func_parser_value "${lines[44]}")
+precision_key=$(func_parser_key "${lines[45]}")
+precision_list=$(func_parser_value "${lines[45]}")
+infer_model_key=$(func_parser_key "${lines[46]}")
+image_dir_key=$(func_parser_key "${lines[47]}")
+infer_img_dir=$(func_parser_value "${lines[47]}")
+save_log_key=$(func_parser_key "${lines[48]}")
+benchmark_key=$(func_parser_key "${lines[49]}")
+benchmark_value=$(func_parser_value "${lines[49]}")
+infer_key1=$(func_parser_key "${lines[50]}")
+infer_value1=$(func_parser_value "${lines[50]}")
+
+line_num=`grep -n -w "to_static_train_benchmark_params" $FILENAME  | cut -d ":" -f 1`
+to_static_key=$(func_parser_key "${lines[line_num]}")
+to_static_trainer=$(func_parser_value "${lines[line_num]}")
+
+# parser klquant_infer
+if [ ${MODE} = "klquant_whole_infer" ]; then
+    dataline=$(awk 'NR==1, NR==17{print}'  $FILENAME)
+    lines=(${dataline})
+    model_name=$(func_parser_value "${lines[1]}")
+    python=$(func_parser_value "${lines[2]}")
+    export_weight=$(func_parser_key "${lines[3]}")
+    save_infer_key=$(func_parser_key "${lines[4]}")
+    # parser inference model 
+    infer_model_dir_list=$(func_parser_value "${lines[5]}")
+    infer_export_list=$(func_parser_value "${lines[6]}")
+    infer_is_quant=$(func_parser_value "${lines[7]}")
+    # parser inference 
+    inference_py=$(func_parser_value "${lines[8]}")
+    use_gpu_key=$(func_parser_key "${lines[9]}")
+    use_gpu_list=$(func_parser_value "${lines[9]}")
+    use_mkldnn_key=$(func_parser_key "${lines[10]}")
+    use_mkldnn_list=$(func_parser_value "${lines[10]}")
+    cpu_threads_key=$(func_parser_key "${lines[11]}")
+    cpu_threads_list=$(func_parser_value "${lines[11]}")
+    batch_size_key=$(func_parser_key "${lines[12]}")
+    batch_size_list=$(func_parser_value "${lines[12]}")
+    use_trt_key=$(func_parser_key "${lines[13]}")
+    use_trt_list=$(func_parser_value "${lines[13]}")
+    precision_key=$(func_parser_key "${lines[14]}")
+    precision_list=$(func_parser_value "${lines[14]}")
+    infer_model_key=$(func_parser_key "${lines[15]}")
+    image_dir_key=$(func_parser_key "${lines[16]}")
+    infer_img_dir=$(func_parser_value "${lines[16]}")
+    save_log_key=$(func_parser_key "${lines[17]}")
+    save_log_value=$(func_parser_value "${lines[17]}")
+    benchmark_key=$(func_parser_key "${lines[18]}")
+    benchmark_value=$(func_parser_value "${lines[18]}")
+    infer_key1=$(func_parser_key "${lines[19]}")
+    infer_value1=$(func_parser_value "${lines[19]}")
+fi
+
+WORK_PATH=$(pwd)
+LOG_PATH="$(pwd)/test_tipc/output/${model_name}/${MODE}"
+mkdir -p ${LOG_PATH}
+status_log="${LOG_PATH}/results_python.log"
+
+
+function func_inference(){
+    IFS='|'
+    _python=$1
+    _script=$2
+    _model_dir=$3
+    _log_path=$4
+    _img_dir=$5
+    _flag_quant=$6
+    _gpu=$7
+    # inference 
+    for use_gpu in ${use_gpu_list[*]}; do
+        if [ ${use_gpu} = "False" ] || [ ${use_gpu} = "cpu" ]; then
+            for use_mkldnn in ${use_mkldnn_list[*]}; do
+                if [ ${use_mkldnn} = "False" ] && [ ${_flag_quant} = "True" ]; then
+                    continue
+                fi
+                for threads in ${cpu_threads_list[*]}; do
+                    for batch_size in ${batch_size_list[*]}; do
+                        for precision in ${precision_list[*]}; do
+                            if [ ${use_mkldnn} = "False" ] && [ ${precision} = "fp16" ]; then
+                                continue
+                            fi # skip when enable fp16 but disable mkldnn
+                            if [ ${_flag_quant} = "True" ] && [ ${precision} != "int8" ]; then
+                                continue
+                            fi # skip when quant model inference but precision is not int8
+                            set_precision=$(func_set_params "${precision_key}" "${precision}")
+                            
+                            _save_log_path="${_log_path}/python_infer_cpu_gpus_${_gpu}_usemkldnn_${use_mkldnn}_threads_${threads}_precision_${precision}_batchsize_${batch_size}.log"
+                            set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                            set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                            set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                            set_mkldnn=$(func_set_params "${use_mkldnn_key}" "${use_mkldnn}")
+                            set_cpu_threads=$(func_set_params "${cpu_threads_key}" "${threads}")
+                            set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                            set_infer_params0=$(func_set_params "${save_log_key}" "${save_log_value}")
+                            set_infer_params1=$(func_set_params "${infer_key1}" "${infer_value1}")
+                            command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${set_mkldnn} ${set_cpu_threads} ${set_model_dir} ${set_batchsize} ${set_infer_params0} ${set_infer_data} ${set_benchmark} ${set_precision} ${set_infer_params1} > ${_save_log_path} 2>&1 "
+                            eval $command
+                            last_status=${PIPESTATUS[0]}
+                            eval "cat ${_save_log_path}"
+                            status_check $last_status "${command}" "${status_log}" "${model_name}" "${_save_log_path}"
+                        done
+                    done
+                done
+            done
+        elif [ ${use_gpu} = "True" ] || [ ${use_gpu} = "gpu" ]; then
+            for use_trt in ${use_trt_list[*]}; do
+                for precision in ${precision_list[*]}; do
+                    if [[ ${_flag_quant} = "False" ]] && [[ ${precision} =~ "int8" ]]; then
+                        continue
+                    fi 
+                    if [[ ${precision} =~ "fp16" || ${precision} =~ "int8" ]] && [ ${use_trt} = "False" ]; then
+                        continue
+                    fi
+                    if [[ ${use_trt} = "False" && ${precision} =~ "int8" ]] && [ ${_flag_quant} = "True" ]; then
+                        continue
+                    fi
+                    for batch_size in ${batch_size_list[*]}; do
+                        _save_log_path="${_log_path}/python_infer_gpu_gpus_${_gpu}_usetrt_${use_trt}_precision_${precision}_batchsize_${batch_size}.log"
+                        set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                        set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                        set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                        set_tensorrt=$(func_set_params "${use_trt_key}" "${use_trt}")
+                        set_precision=$(func_set_params "${precision_key}" "${precision}")
+                        set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                        set_infer_params0=$(func_set_params "${save_log_key}" "${save_log_value}")
+                        set_infer_params1=$(func_set_params "${infer_key1}" "${infer_value1}")
+                        command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${set_tensorrt} ${set_precision} ${set_model_dir} ${set_batchsize} ${set_infer_data} ${set_benchmark} ${set_infer_params1} ${set_infer_params0} > ${_save_log_path} 2>&1 "
+                        eval $command
+                        last_status=${PIPESTATUS[0]}
+                        eval "cat ${_save_log_path}"
+                        status_check $last_status "${command}" "${status_log}" "${model_name}" "${_save_log_path}"
+                        
+                    done
+                done
+            done
+        else
+            echo "Does not support hardware other than CPU and GPU Currently!"
+        fi
+    done
+}
+
+if [ ${MODE} = "whole_infer" ] || [ ${MODE} = "klquant_whole_infer" ]; then
+    GPUID=$3
+    if [ ${#GPUID} -le 0 ];then
+        env=" "
+    else
+        env="export CUDA_VISIBLE_DEVICES=${GPUID}"
+    fi
+    # set CUDA_VISIBLE_DEVICES
+    eval $env
+    export Count=0
+    IFS="|"
+    infer_run_exports=(${infer_export_list})
+    infer_quant_flag=(${infer_is_quant})
+    for infer_model in ${infer_model_dir_list[*]}; do
+        # run export
+        if [ ${infer_run_exports[Count]} != "null" ];then
+            if [ ${MODE} = "klquant_whole_infer" ]; then
+                save_infer_dir="${infer_model}_klquant"
+            fi
+            if [ ${MODE} = "whole_infer" ]; then
+                save_infer_dir="${infer_model}"
+            fi
+            set_export_weight=$(func_set_params "${export_weight}" "${infer_model}")
+            set_save_infer_key=$(func_set_params "${save_infer_key}" "${save_infer_dir}")
+            export_cmd="${python} ${infer_run_exports[Count]} ${set_export_weight} ${set_save_infer_key}"
+            echo ${infer_run_exports[Count]} 
+            echo $export_cmd
+            eval $export_cmd
+            status_export=${PIPESTATUS[0]}
+            status_check $status_export "${export_cmd}" "${status_log}"
+        else
+            save_infer_dir=${infer_model}
+        fi
+        #run inference
+        is_quant=${infer_quant_flag[Count]}
+        if [ ${MODE} = "klquant_whole_infer" ]; then
+            is_quant="True"
+        fi
+        func_inference "${python}" "${inference_py}" "${save_infer_dir}" "${LOG_PATH}" "${infer_img_dir}" ${is_quant}
+        Count=$(($Count + 1))
+    done
+else
+    IFS="|"
+    export Count=0
+    USE_GPU_KEY=(${train_use_gpu_value})
+    for gpu in ${gpu_list[*]}; do
+        train_use_gpu=${USE_GPU_KEY[Count]}
+        Count=$(($Count + 1))
+        ips=""
+        if [ ${gpu} = "-1" ];then
+            env=""
+        elif [ ${#gpu} -le 1 ];then
+            env="export CUDA_VISIBLE_DEVICES=${gpu}"
+        elif [ ${#gpu} -le 15 ];then
+            IFS=","
+            array=(${gpu})
+            env="export CUDA_VISIBLE_DEVICES=${array[0]}"
+            IFS="|"
+        else
+            IFS=";"
+            array=(${gpu})
+            ips=${array[0]}
+            gpu=${array[1]}
+            IFS="|"
+            env=" "
+        fi
+        for autocast in ${autocast_list[*]}; do 
+            if [ ${autocast} = "amp" ]; then
+                set_amp_config="Global.use_amp=True Global.scale_loss=1024.0 Global.use_dynamic_loss_scaling=True"
+            else
+                set_amp_config=" "
+            fi          
+            for trainer in ${trainer_list[*]}; do 
+                flag_quant=False
+                if [ ${trainer} = ${pact_key} ]; then
+                    run_train=${pact_trainer}
+                    run_export=${pact_export}
+                    flag_quant=True
+                elif [ ${trainer} = "${fpgm_key}" ]; then
+                    run_train=${fpgm_trainer}
+                    run_export=${fpgm_export}
+                elif [ ${trainer} = "${distill_key}" ]; then
+                    run_train=${distill_trainer}
+                    run_export=${distill_export}
+                elif [[ ${trainer} = ${trainer_key1} ]]; then
+                    run_train=${trainer_value1}
+                    run_export=${export_value1}
+                elif [[ ${trainer} = ${trainer_key2} ]]; then
+                    run_train=${trainer_value2}
+                    run_export=${export_value2}
+                # In case of @to_static, we re-used norm_traier,
+                # but append "--to_static" for config
+                # to trigger "apply_to_static" logic in 'train.py'
+                elif [ ${trainer} = "${to_static_key}" ]; then
+                    run_train="${norm_trainer}  ${to_static_trainer}"
+                    run_export=${norm_export}
+                else
+                    run_train=${norm_trainer}
+                    run_export=${norm_export}
+                fi
+
+                if [ ${run_train} = "null" ]; then
+                    continue
+                fi
+                set_autocast=$(func_set_params "${autocast_key}" "${autocast}")
+                set_epoch=$(func_set_params "${epoch_key}" "${epoch_num}")
+                set_pretrain=$(func_set_params "${pretrain_model_key}" "${pretrain_model_value}")
+                set_batchsize=$(func_set_params "${train_batch_key}" "${train_batch_value}")
+
+                set_train_params1=$(func_set_params "${train_param_key1}" "${train_param_value1}")
+                set_use_gpu=$(func_set_params "${train_use_gpu_key}" "${train_use_gpu}")
+                if [ ${#ips} -le 26 ];then
+                    nodes=1
+                    save_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}_nodes_${nodes}"
+                else
+                    IFS=","
+                    ips_array=(${ips})
+                    IFS="|"
+                    nodes=${#ips_array[@]}
+                    save_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}_nodes_${nodes}"
+                fi
+
+
+                _train_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}_nodes_${nodes}.log"
+                set_save_model=$(func_set_params "${save_model_key}" "${save_log}")
+                if [ ${#gpu} -le 2 ];then  # train with cpu or single gpu
+                    cmd="${python} ${run_train} ${set_use_gpu}  ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1} ${set_amp_config} >${_train_log} 2>&1"
+                elif [ ${#ips} -le 26 ];then  # train with multi-gpu
+                    cmd="${python} -m paddle.distributed.launch --gpus=${gpu} ${run_train} ${set_use_gpu} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1} ${set_amp_config}"
+                else     # train with multi-machine
+                    cmd="${python} -m paddle.distributed.launch --ips=${ips} --gpus=${gpu} ${run_train} ${set_use_gpu} ${set_save_model} ${set_pretrain} ${set_epoch} ${set_autocast} ${set_batchsize} ${set_train_params1} ${set_amp_config}"
+                fi
+                # run train
+                eval $cmd
+                last_status=${PIPESTATUS[0]}
+                if [ ${#gpu} -ge 2 ];then
+                    cat ${WORK_PATH}/log/workerlog.0 > ${_train_log} 
+                fi
+                if [ ${#gpu} -le 2 ];then  # train with cpu or single gpu
+                    eval "cat ${_train_log}"
+                fi
+                status_check ${last_status} "${cmd}" "${status_log}" "${model_name}" "${_train_log}"
+
+                set_eval_pretrain=$(func_set_params "${pretrain_model_key}" "${save_log}/${train_model_name}")
+
+                # run eval 
+                if [ ${eval_py} != "null" ]; then
+                    eval ${env}
+                    _eval_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}_nodes_${nodes}_eval.log"
+                    set_eval_params1=$(func_set_params "${eval_key1}" "${eval_value1}")
+                    eval_cmd="${python} ${eval_py} ${set_eval_pretrain} ${set_use_gpu} ${set_eval_params1} >${_eval_log} 2>&1" 
+                    eval $eval_cmd
+                    last_status=${PIPESTATUS[0]}
+                    eval "cat ${_eval_log}"
+                    status_check ${last_status} "${eval_cmd}" "${status_log}" "${model_name}" "${_eval_log}"
+                fi
+                # run export model
+                if [ ${run_export} != "null" ]; then 
+                    # run export model
+                    save_infer_path="${save_log}"
+                    set_export_weight=$(func_set_params "${export_weight}" "${save_log}/${train_model_name}")
+                    set_save_infer_key=$(func_set_params "${save_infer_key}" "${save_infer_path}")
+                    _export_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}_nodes_${nodes}_export.log"
+                    export_cmd="${python} ${run_export} ${set_export_weight} ${set_save_infer_key} >${_export_log} 2>&1"
+                    eval $export_cmd
+                    last_status=${PIPESTATUS[0]}
+                    eval "cat ${_export_log}"
+                    status_check ${last_status} "${export_cmd}" "${status_log}" "${model_name}" "${_export_log}"
+
+                    #run inference
+                    eval $env
+                    save_infer_path="${save_log}"
+                    if [[ ${inference_dir} != "null" ]] && [[ ${inference_dir} != '##' ]]; then
+                        infer_model_dir="${save_infer_path}/${inference_dir}"
+                    else
+                        infer_model_dir=${save_infer_path}
+                    fi
+                    func_inference "${python}" "${inference_py}" "${infer_model_dir}" "${LOG_PATH}" "${train_infer_img_dir}" "${flag_quant}" "${gpu}"
+                    
+                    eval "unset CUDA_VISIBLE_DEVICES"
+                fi
+            done  # done with:    for trainer in ${trainer_list[*]}; do 
+        done      # done with:    for autocast in ${autocast_list[*]}; do 
+    done          # done with:    for gpu in ${gpu_list[*]}; do
+fi  # end if [ ${MODE} = "infer" ]; then
diff --git a/tests/test_tipc/test_train_inference_python_npu.sh b/tests/test_tipc/test_train_inference_python_npu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a3f1f644a470cbbf3b489bf796aaa0ea92b76057
--- /dev/null
+++ b/tests/test_tipc/test_train_inference_python_npu.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+source test_tipc/common_func.sh
+
+function readlinkf() {
+    perl -MCwd -e 'print Cwd::abs_path shift' "$1";
+}
+
+function func_parser_config() {
+    strs=$1
+    IFS=" "
+    array=(${strs})
+    tmp=${array[2]}
+    echo ${tmp}
+}
+
+BASEDIR=$(dirname "$0")
+REPO_ROOT_PATH=$(readlinkf ${BASEDIR}/../)
+
+FILENAME=$1
+
+# change gpu to npu in tipc txt configs
+sed -i "s/--device:gpu|gpu/--device:npu|npu/g" $FILENAME
+sed -i "s/--device:gpu/--device:npu/g" $FILENAME
+sed -i "s/state=GPU/state=NPU/g" $FILENAME
+sed -i "s/trainer:pact_train/trainer:norm_train/g" $FILENAME
+sed -i "s/trainer:fpgm_train/trainer:norm_train/g" $FILENAME
+sed -i "s/--device:cpu|gpu/--device:cpu|npu/g" $FILENAME
+sed -i "s/--device:gpu|cpu/--device:cpu|npu/g" $FILENAME
+sed -i "s/--benchmark:True/--benchmark:False/g" $FILENAME
+sed -i "s/--use_tensorrt:False|True/--use_tensorrt:False/g" $FILENAME
+# python has been updated to version 3.9 for npu backend
+sed -i "s/python3.7/python3.9/g" $FILENAME
+sed -i 's/\"gpu\"/\"npu\"/g' test_tipc/test_train_inference_python.sh
+
+# parser params
+dataline=`cat $FILENAME`
+IFS=$'\n'
+lines=(${dataline})
+
+# change total iters/epochs for npu to accelaration
+modelname=$(echo $FILENAME | cut -d '/' -f4)
+if  [ $modelname == "stablediffusion" ] || [ $modelname == "t5_for_conditional_generation" ] || [ $modelname == "gpt_for_sequence_classification" ] \
+    || [ $modelname == "bert_for_question_answering" ] || [ $modelname == "ernie_tiny" ] || [ $modelname == "ernie3_for_sequence_classification" ]  \
+    || [ $modelname == "seq2seq" ] || [ $modelname == "xlnet" ]; then
+    changed=$(sed -n "16p" $FILENAME | grep "max_steps" | wc -l)
+    if [ $changed == "0" ]; then
+        sed -i '16s/$/   --max_steps 10/'  $FILENAME
+    fi
+fi
+
+# pass parameters to test_train_inference_python.sh
+cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} $2"
+echo $cmd
+eval $cmd
+
+
diff --git a/tests/test_tipc/test_train_inference_python_xpu.sh b/tests/test_tipc/test_train_inference_python_xpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cd3b18cb3e8cd97e63b78444dac1375e04cd6342
--- /dev/null
+++ b/tests/test_tipc/test_train_inference_python_xpu.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+source test_tipc/common_func.sh
+
+function readlinkf() {
+    perl -MCwd -e 'print Cwd::abs_path shift' "$1";
+}
+
+function func_parser_config() {
+    strs=$1
+    IFS=" "
+    array=(${strs})
+    tmp=${array[2]}
+    echo ${tmp}
+}
+
+BASEDIR=$(dirname "$0")
+REPO_ROOT_PATH=$(readlinkf ${BASEDIR}/../)
+
+FILENAME=$1
+
+# change gpu to npu in tipc txt configs
+sed -i "s/--device:gpu|gpu/--device:xpu|xpu/g" $FILENAME
+sed -i "s/--device:gpu/--device:xpu/g" $FILENAME
+sed -i "s/state=GPU/state=XPU/g" $FILENAME
+sed -i "s/trainer:pact_train/trainer:norm_train/g" $FILENAME
+sed -i "s/trainer:fpgm_train/trainer:norm_train/g" $FILENAME
+sed -i "s/--device:cpu|gpu/--device:cpu|xpu/g" $FILENAME
+sed -i "s/--device:gpu|cpu/--device:cpu|xpu/g" $FILENAME
+sed -i "s/--benchmark:True/--benchmark:False/g" $FILENAME
+sed -i "s/--use_tensorrt:False|True/--use_tensorrt:False/g" $FILENAME
+# python has been updated to version 3.9 for npu backend
+sed -i "s/python3.7/python3.9/g" $FILENAME
+sed -i 's/\"gpu\"/\"npu\"/g' test_tipc/test_train_inference_python.sh
+
+# parser params
+dataline=`cat $FILENAME`
+IFS=$'\n'
+lines=(${dataline})
+
+# change total iters/epochs for npu to accelaration
+modelname=$(echo $FILENAME | cut -d '/' -f4)
+if  [ $modelname == "stablediffusion" ] || [ $modelname == "t5_for_conditional_generation" ] || [ $modelname == "gpt_for_sequence_classification" ] \
+    || [ $modelname == "bert_for_question_answering" ] || [ $modelname == "ernie_tiny" ] || [ $modelname == "ernie3_for_sequence_classification" ]  \
+    || [ $modelname == "seq2seq" ] || [ $modelname == "xlnet" ]; then
+    changed=$(sed -n "16p" $FILENAME | grep "max_steps" | wc -l)
+    if [ $changed == "0" ]; then
+        sed -i '16s/$/   --max_steps 10/'  $FILENAME
+    fi
+fi
+
+# pass parameters to test_train_inference_python.sh
+cmd="bash test_tipc/test_train_inference_python.sh ${FILENAME} $2"
+echo $cmd
+eval $cmd
+
+
diff --git a/tests/test_tipc/train.py b/tests/test_tipc/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..fadc23131d1a0b97f57764ff07c9625fb91050bf
--- /dev/null
+++ b/tests/test_tipc/train.py
@@ -0,0 +1,390 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import os
+import random
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from benchmark import options
+from benchmark.modules.benchmark_utils import clone_inputs
+from benchmark.options import LR_SCHEDULER_REGISTRY, MODEL_REGISTRY, OPTIMIZER_REGISTRY
+from benchmark.utils.record import AverageStatistical
+
+from paddlenlp.utils import profiler
+from paddlenlp.utils.log import logger
+
+
+def set_seed(seed):
+    paddle.seed(seed)
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def do_generated_inputs(args):
+    if args.device == "cpu":
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+    else:
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    benchmark_model = MODEL_REGISTRY[args.model]()
+    benchmark_optimizer = OPTIMIZER_REGISTRY[args.optimizer]()
+
+    if args.max_steps is None or (args.max_steps is not None and args.max_steps < 0):
+        args.max_steps = 10000
+
+    # Define model
+    model = benchmark_model.build_model(args)
+
+    if args.to_static:
+        input_spec = benchmark_model.create_input_specs()
+        model = paddle.jit.to_static(model, input_spec=input_spec)
+        logger.info("Successfully to apply @to_static with specs: {}".format(input_spec))
+
+    # Define data loader
+    example_inputs = benchmark_model.generate_inputs_for_model(args, model)
+
+    if args.lr_scheduler is not None:
+        benchmark_lr_scheduler = LR_SCHEDULER_REGISTRY[args.lr_scheduler]()
+        lr = benchmark_lr_scheduler.build_scheculer(args)
+    else:
+        lr = args.learning_rate
+
+    optimizer = benchmark_optimizer.build_optimizer(args, lr, model)
+
+    # for amp training
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(enable=True, init_loss_scaling=args.scale_loss)
+        model = paddle.amp.decorate(models=model, level=args.amp_level, save_dtype="float32")
+
+    # for distributed training
+    if trainer_count > 1:
+        model = paddle.DataParallel(model)
+
+    step_id = 1
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+
+        batch_start = time.time()
+        for batch_id in range(args.max_steps):
+            train_reader_cost = time.time() - batch_start
+            cloned_inputs = clone_inputs(example_inputs)
+
+            if args.use_amp:
+                # paddle version >= 2.5.0 or develop
+                paddle_version = float(paddle.__version__[:3])
+                if (paddle_version == 0.0) or (paddle_version >= 2.5):
+                    with paddle.amp.auto_cast(
+                        custom_black_list=args.custom_black_list if args.amp_level == "O2" else {},
+                        level=args.amp_level,
+                        use_promote=args.amp_use_promote,
+                    ):
+                        loss, sample_per_cards = benchmark_model.forward(model, args, cloned_inputs)
+                else:
+                    with paddle.amp.auto_cast(
+                        custom_black_list=args.custom_black_list if args.amp_level == "O2" else {},
+                        level=args.amp_level,
+                    ):
+                        loss, sample_per_cards = benchmark_model.forward(model, args, cloned_inputs)
+
+                scaled = scaler.scale(loss)
+                scaled.backward()
+
+                scaler.minimize(optimizer, scaled)
+                if "set_to_zero" in inspect.getfullargspec(optimizer.clear_grad).args:
+                    optimizer.clear_grad(set_to_zero=False)
+                else:
+                    optimizer.clear_grad()
+            else:
+                loss, sample_per_cards = benchmark_model.forward(model, args, cloned_inputs)
+
+                loss.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            if args.profiler_options is not None:
+                profiler.add_profiler_step(args.profiler_options)
+
+            if args.max_steps and step_id == args.max_steps:
+                if args.save_model and rank == 0:
+                    model_dir = args.save_model
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(model.state_dict(), os.path.join(model_dir, "model.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "model.pdopt"))
+                return
+
+            if args.lr_scheduler is not None and not args.scheduler_update_by_epoch:
+                lr.step()
+
+            if step_id % args.logging_steps == 0:
+                total_avg_loss = loss.numpy()
+
+                train_batch_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                batch_cost_avg.record(train_batch_cost)
+                batch_ips_avg.record(train_batch_cost, sample_per_cards)
+
+                benchmark_model.logger(
+                    args,
+                    step_id=step_id,
+                    pass_id=pass_id,
+                    batch_id=batch_id,
+                    loss=total_avg_loss,
+                    batch_cost=batch_cost_avg.get_average(),
+                    reader_cost=reader_cost_avg.get_average(),
+                    num_samples=sample_per_cards,
+                    ips=batch_ips_avg.get_average_per_sec(),
+                )
+
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+            else:
+                train_batch_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                batch_cost_avg.record(train_batch_cost)
+                batch_ips_avg.record(train_batch_cost, sample_per_cards)
+
+            batch_start = time.time()
+
+            batch_id += 1
+            step_id += 1
+
+        if args.lr_scheduler is not None and args.scheduler_update_by_epoch:
+            lr.step()
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+
+def do_train(args):
+    if args.device == "gpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+    else:
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    benchmark_model = MODEL_REGISTRY[args.model]()
+    benchmark_optimizer = OPTIMIZER_REGISTRY[args.optimizer]()
+
+    # Define data loader
+    train_loader, eval_loader = benchmark_model.create_data_loader(args)
+
+    if args.max_steps is None or (args.max_steps is not None and args.max_steps < 0):
+        args.max_steps = len(train_loader) * args.epoch
+
+    # Define model
+    model = benchmark_model.build_model(args)
+
+    if args.to_static:
+        input_spec = benchmark_model.create_input_specs()
+        model = paddle.jit.to_static(model, input_spec=input_spec)
+        logger.info("Successfully to apply @to_static with specs: {}".format(input_spec))
+
+    if args.lr_scheduler is not None:
+        benchmark_lr_scheduler = LR_SCHEDULER_REGISTRY[args.lr_scheduler]()
+        lr = benchmark_lr_scheduler.build_scheculer(args)
+    else:
+        lr = args.learning_rate
+
+    optimizer = benchmark_optimizer.build_optimizer(args, lr, model)
+
+    # for amp training
+    if args.use_amp:
+        scaler = paddle.amp.GradScaler(enable=True, init_loss_scaling=args.scale_loss)
+        model = paddle.amp.decorate(models=model, level=args.amp_level, save_dtype="float32")
+
+    # for distributed training
+    if trainer_count > 1:
+        model = paddle.DataParallel(model)
+
+    step_id = 1
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+
+        batch_id = 0
+        batch_start = time.time()
+        for input_data in train_loader:
+            train_reader_cost = time.time() - batch_start
+
+            if args.use_amp:
+                # paddle version >= 2.5.0 or develop
+                paddle_version = float(paddle.__version__[:3])
+                if (paddle_version == 0.0) or (paddle_version >= 2.5):
+                    with paddle.amp.auto_cast(
+                        custom_black_list=args.custom_black_list if args.amp_level == "O2" else {},
+                        level=args.amp_level,
+                        use_promote=args.amp_use_promote,
+                    ):
+                        loss, sample_per_cards = benchmark_model.forward(model, args, input_data)
+                else:
+                    with paddle.amp.auto_cast(
+                        custom_black_list=args.custom_black_list if args.amp_level == "O2" else {},
+                        level=args.amp_level,
+                    ):
+                        loss, sample_per_cards = benchmark_model.forward(model, args, input_data)
+
+                scaled = scaler.scale(loss)
+                scaled.backward()
+
+                scaler.minimize(optimizer, scaled)
+                if "set_to_zero" in inspect.getfullargspec(optimizer.clear_grad).args:
+                    optimizer.clear_grad(set_to_zero=False)
+                else:
+                    optimizer.clear_grad()
+            else:
+                loss, sample_per_cards = benchmark_model.forward(model, args, input_data)
+
+                loss.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            if args.profiler_options is not None:
+                profiler.add_profiler_step(args.profiler_options)
+
+            if args.max_steps and step_id == args.max_steps:
+                if args.save_model and rank == 0:
+                    model_dir = args.save_model
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(model.state_dict(), os.path.join(model_dir, "model.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "model.pdopt"))
+                return
+
+            if args.lr_scheduler is not None and not args.scheduler_update_by_epoch:
+                lr.step()
+
+            if step_id % args.logging_steps == 0:
+                total_avg_loss = loss.numpy()
+
+                train_batch_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                batch_cost_avg.record(train_batch_cost)
+                batch_ips_avg.record(train_batch_cost, sample_per_cards)
+
+                benchmark_model.logger(
+                    args,
+                    step_id=step_id,
+                    pass_id=pass_id,
+                    batch_id=batch_id,
+                    loss=total_avg_loss,
+                    batch_cost=batch_cost_avg.get_average(),
+                    reader_cost=reader_cost_avg.get_average(),
+                    num_samples=sample_per_cards,
+                    ips=batch_ips_avg.get_average_per_sec(),
+                )
+
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+            else:
+                train_batch_cost = time.time() - batch_start
+                reader_cost_avg.record(train_reader_cost)
+                batch_cost_avg.record(train_batch_cost)
+                batch_ips_avg.record(train_batch_cost, sample_per_cards)
+
+            batch_start = time.time()
+
+            batch_id += 1
+            step_id += 1
+
+        if args.lr_scheduler is not None and args.scheduler_update_by_epoch:
+            lr.step()
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+
+def do_hapi(args):
+    paddle.set_device(args.device)
+
+    # Set seed for CE
+    if args.seed is not None:
+        set_seed(args.seed)
+
+    benchmark_model = MODEL_REGISTRY[args.model]()
+    benchmark_optimizer = OPTIMIZER_REGISTRY[args.optimizer]()
+
+    # Define data loader
+    train_loader, eval_loader = benchmark_model.create_data_loader(args)
+
+    if args.lr_scheduler is not None:
+        benchmark_lr_scheduler = LR_SCHEDULER_REGISTRY[args.lr_scheduler]()
+        lr = benchmark_lr_scheduler.build_scheculer(args)
+    else:
+        lr = args.learning_rate
+
+    model = benchmark_model.build_model(args)
+
+    if args.to_static:
+        input_spec = benchmark_model.create_input_specs()
+        model = paddle.jit.to_static(model, input_spec=input_spec)
+        logger.info("Successfully to apply @to_static with specs: {}".format(input_spec))
+
+    optimizer = benchmark_optimizer.build_optimizer(args, lr, model)
+
+    benchmark_model.forward(model, args, optimizer=optimizer, train_loader=train_loader, eval_loader=eval_loader)
+
+
+if __name__ == "__main__":
+    parser = options.get_training_parser()
+    args = options.parse_args_and_model(parser)
+    pprint(args)
+
+    if args.generated_inputs:
+        do_generated_inputs(args)
+    elif getattr(args, "use_hapi", False):
+        do_hapi(args)
+    else:
+        do_train(args)
diff --git a/tests/test_tipc/transformer/modeling.py b/tests/test_tipc/transformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ecb8849f8fc5fc4ecfc72247b76cbe55ea92696
--- /dev/null
+++ b/tests/test_tipc/transformer/modeling.py
@@ -0,0 +1,774 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.utils import map_structure
+
+__all__ = [
+    "position_encoding_init",
+    "WordEmbedding",
+    "PositionalEmbedding",
+    "CrossEntropyCriterion",
+    "TransformerDecodeCell",
+    "TransformerBeamSearchDecoder",
+    "TransformerModel",
+]
+
+
+def position_encoding_init(n_position, d_pos_vec, dtype="float64"):
+    """
+    Generates the initial values for the sinusoidal position encoding table.
+    This method follows the implementation in tensor2tensor, but is slightly
+    different from the description in "Attention Is All You Need".
+
+    Args:
+        n_position (int):
+            The largest position for sequences, that is, the maximum length
+            of source or target sequences.
+        d_pos_vec (int):
+            The size of positional embedding vector.
+        dtype (str, optional):
+            The output `numpy.array`'s data type. Defaults to "float32".
+
+    Returns:
+        numpy.array:
+            The embedding table of sinusoidal position encoding with shape
+            `[n_position, d_pos_vec]`.
+
+    Example:
+        .. code-block::
+
+            from paddlenlp.transformers import position_encoding_init
+
+            max_length = 256
+            emb_dim = 512
+            pos_table = position_encoding_init(max_length, emb_dim)
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = np.log(float(1e4) / float(1)) / (num_timescales - 1)
+    inv_timescales = np.exp(np.arange(num_timescales) * -log_timescale_increment)
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, 0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], "constant")
+    position_enc = signal
+    return position_enc.astype(dtype)
+
+
+class WordEmbedding(nn.Layer):
+    """
+    Word Embedding layer of Transformer.
+
+    This layer automatically constructs a 2D embedding matrix based on the
+    input the size of vocabulary (`vocab_size`) and the size of each embedding
+    vector (`emb_dim`). This layer lookups embeddings vector of ids provided
+    by input `word`.
+
+    After the embedding, those weights are multiplied by `sqrt(d_model)` which is
+    `sqrt(emb_dim)` in the interface.
+
+    .. math::
+
+        Out = embedding(word) * sqrt(emb\_dim)
+
+    Args:
+        vocab_size (int):
+            The size of vocabulary.
+        emb_dim (int):
+            Dimensionality of each embedding vector.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+    """
+
+    def __init__(self, vocab_size, emb_dim, bos_id=0):
+        super(WordEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=vocab_size,
+            embedding_dim=emb_dim,
+            padding_idx=bos_id,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(0.0, emb_dim**-0.5)),
+        )
+
+    def forward(self, word):
+        r"""
+        Computes word embedding.
+
+        Args:
+            word (Tensor):
+                The input ids which indicates the sequences' words with shape
+                `[batch_size, sequence_length]` whose data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                The (scaled) embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import WordEmbedding
+
+                word_embedding = WordEmbedding(
+                    vocab_size=30000,
+                    emb_dim=512,
+                    bos_id=0)
+
+                batch_size = 5
+                sequence_length = 10
+                src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length])
+                src_emb = word_embedding(src_words)
+        """
+        word_emb = self.emb_dim**0.5 * self.word_embedding(word)
+        return word_emb
+
+
+class PositionalEmbedding(nn.Layer):
+    """
+    This layer produces sinusoidal positional embeddings of any length.
+    While in `forward()` method, this layer lookups embeddings vector of
+    ids provided by input `pos`.
+
+    Args:
+        emb_dim (int):
+            The size of each embedding vector.
+        max_length (int):
+            The maximum length of sequences.
+    """
+
+    def __init__(self, emb_dim, max_length):
+        super(PositionalEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.pos_encoder = nn.Embedding(num_embeddings=max_length, embedding_dim=self.emb_dim)
+        self.pos_encoder.weight.set_value(
+            position_encoding_init(max_length, self.emb_dim, dtype=paddle.get_default_dtype())
+        )
+
+    def forward(self, pos):
+        r"""
+        Computes positional embedding.
+
+        Args:
+            pos (Tensor):
+                The input position ids with shape `[batch_size, sequence_length]` whose
+                data type can be int or int64.
+
+        Returns:
+            Tensor:
+                The positional embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PositionalEmbedding
+
+                pos_embedding = PositionalEmbedding(
+                    emb_dim=512,
+                    max_length=256)
+
+                batch_size = 5
+                pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1])
+                pos_emb = pos_embedding(pos)
+        """
+        pos_emb = self.pos_encoder(pos)
+        pos_emb.stop_gradient = True
+        return pos_emb
+
+
+class CrossEntropyCriterion(nn.Layer):
+    """
+    Computes the cross entropy loss for given input with or without label smoothing.
+
+    Args:
+        label_smooth_eps (float, optional):
+            The weight used to mix up the original ground-truth distribution
+            and the fixed distribution. Defaults to None. If given, label smoothing
+            will be applied on `label`.
+        pad_idx (int, optional):
+            The token id used to pad variant sequence. Defaults to 0.
+    """
+
+    def __init__(self, label_smooth_eps=None, pad_idx=0):
+        super(CrossEntropyCriterion, self).__init__()
+        self.label_smooth_eps = label_smooth_eps
+        self.pad_idx = pad_idx
+
+    def forward(self, predict, label):
+        r"""
+        Computes cross entropy loss with or without label smoothing.
+
+        Args:
+            predict (Tensor):
+                The predict results of `TransformerModel` with shape
+                `[batch_size, sequence_length, vocab_size]` whose data type can
+                be float32 or float64.
+            label (Tensor):
+                The label for correspoding results with shape
+                `[batch_size, sequence_length, 1]`.
+
+        Returns:
+            tuple:
+                A tuple with items: (`sum_cost`, `avg_cost`, `token_num`).
+
+                With the corresponding fields:
+
+                - `sum_cost` (Tensor):
+                    The sum of loss of current batch whose data type can be float32, float64.
+                - `avg_cost` (Tensor):
+                    The average loss of current batch whose data type can be float32, float64.
+                    The relation between `sum_cost` and `avg_cost` can be described as:
+
+                    .. math:
+
+                        avg_cost = sum_cost / token_num
+
+                - `token_num` (Tensor):
+                    The number of tokens of current batch.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CrossEntropyCriterion
+
+                criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0)
+                batch_size = 1
+                seq_len = 2
+                vocab_size = 30000
+                predict = paddle.rand(shape=[batch_size, seq_len, vocab_size])
+                label = paddle.randint(
+                    low=3,
+                    high=vocab_size,
+                    shape=[batch_size, seq_len, 1])
+
+                criterion(predict, label)
+        """
+        weights = paddle.cast(label != self.pad_idx, dtype=paddle.get_default_dtype())
+        if self.label_smooth_eps:
+            label = paddle.squeeze(label, axis=[2])
+            label = F.label_smooth(
+                label=F.one_hot(x=label, num_classes=predict.shape[-1]), epsilon=self.label_smooth_eps
+            )
+            if paddle.get_default_dtype() != "float32":
+                label = paddle.cast(label, dtype=paddle.get_default_dtype())
+
+        cost = F.cross_entropy(
+            input=predict, label=label, reduction="none", soft_label=True if self.label_smooth_eps else False
+        )
+        weighted_cost = cost * weights
+        sum_cost = paddle.sum(weighted_cost)
+        token_num = paddle.sum(weights)
+        token_num.stop_gradient = True
+        avg_cost = sum_cost / token_num
+        return sum_cost, avg_cost, token_num
+
+
+class TransformerDecodeCell(nn.Layer):
+    """
+    This layer wraps a Transformer decoder combined with embedding
+    layer and output layer to produce logits from ids and position.
+
+    Args:
+        decoder (callable):
+            Can be a `paddle.nn.TransformerDecoder` instance. Or a wrapper that includes an
+            embedding layer accepting ids and positions and includes an
+            output layer transforming decoder output to logits.
+        word_embedding (callable, optional):
+            Can be a `WordEmbedding` instance or a callable that accepts ids as
+            arguments and return embeddings. It can be None if `decoder`
+            includes a embedding layer. Defaults to None.
+        pos_embedding (callable, optional):
+            Can be a `PositionalEmbedding` instance or a callable that accepts position
+            as arguments and return embeddings. It can be None if `decoder`
+            includes a positional embedding layer. Defaults to None.
+        linear (callable, optional):
+            Can be a `paddle.nn.Linear` instance or a callable to transform decoder
+            output to logits.
+        dropout (float, optional):
+            The dropout rate for the results of `word_embedding` and `pos_embedding`.
+            Defaults to 0.1.
+    """
+
+    def __init__(self, decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1):
+        super(TransformerDecodeCell, self).__init__()
+        self.decoder = decoder
+        self.word_embedding = word_embedding
+        self.pos_embedding = pos_embedding
+        self.linear = linear
+        self.dropout = dropout
+
+    def forward(self, inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs):
+        r"""
+        Produces logits.
+
+        Args:
+            inputs (Tensor|tuple|list):
+                A tuple/list includes target ids and positions. If `word_embedding` is None,
+                then it should be a Tensor which means the input for decoder.
+            states (list):
+                It is a list and each element of the list is an instance
+                of `paddle.nn.MultiheadAttention.Cache` for corresponding decoder
+                layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            static_cache (list):
+                It is a list and each element of the list is an instance of
+                `paddle.nn.MultiheadAttention.StaticCache` for corresponding
+                decoder layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            trg_src_attn_bias (Tensor):
+                A tensor used in self attention to prevents attention to some unwanted
+                positions, usually the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, target_length, target_length]`,
+                where the unwanted positions have `-INF` values and the others
+                have 0 values. The data type should be float32 or float64. It can
+                be None when nothing wanted or needed to be prevented attention to.
+            memory (Tensor):
+                The output of Transformer encoder. It is a tensor with shape
+                `[batch_size, source_length, d_model]` and its data type can be
+                float32 or float64.
+
+        Returns:
+            tuple:
+                A tuple with items: `(outputs, new_states)`
+
+                With the corresponding fields:
+
+                - `outputs` (Tensor):
+                    A float32 or float64 3D tensor representing logits shaped
+                    `[batch_size, sequence_length, vocab_size]`.
+                - `new_states` (Tensor):
+                    This output has the same structure and data type with `states`
+                    while the length is one larger since concatanating the
+                    intermediate results of current step.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerDecodeCell
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                def decoder():
+                    # do decoder
+                    pass
+
+                cell = TransformerDecodeCell(decoder())
+
+                self.decode = TransformerBeamSearchDecoder(
+                    cell, start_token=0, end_token=1, beam_size=4,
+                    var_dim_in_state=2)
+        """
+
+        if states and static_cache:
+            states = list(zip(states, static_cache))
+
+        if self.word_embedding:
+            if not isinstance(inputs, (list, tuple)):
+                inputs = inputs
+
+            word_emb = self.word_embedding(inputs[0])
+            pos_emb = self.pos_embedding(inputs[1])
+            word_emb = word_emb + pos_emb
+            inputs = F.dropout(word_emb, p=self.dropout, training=False) if self.dropout else word_emb
+
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+        else:
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+
+        if self.linear:
+            cell_outputs = self.linear(cell_outputs)
+
+        new_states = [cache[0] for cache in new_states]
+
+        return cell_outputs, new_states
+
+
+class TransformerBeamSearchDecoder(nn.decode.BeamSearchDecoder):
+    """
+    This layer is a subclass of `BeamSearchDecoder` to make
+    beam search adapt to Transformer decoder.
+
+    Args:
+        cell (`TransformerDecodeCell`):
+            An instance of `TransformerDecoderCell`.
+        start_token (int):
+            The start token id.
+        end_token (int):
+            The end token id.
+        beam_size (int):
+            The beam width used in beam search.
+        var_dim_in_state (int):
+            Indicate which dimension of states is variant.
+    """
+
+    def __init__(self, cell, start_token, end_token, beam_size, var_dim_in_state):
+        super(TransformerBeamSearchDecoder, self).__init__(cell, start_token, end_token, beam_size)
+        self.cell = cell
+        self.var_dim_in_state = var_dim_in_state
+
+    def _merge_batch_beams_with_var_dim(self, c):
+        # Init length of cache is 0, and it increases with decoding carrying on,
+        # thus need to reshape elaborately
+        var_dim_in_state = self.var_dim_in_state + 1  # count in beam dim
+        c = paddle.transpose(c, list(range(var_dim_in_state, len(c.shape))) + list(range(0, var_dim_in_state)))
+        c = paddle.reshape(
+            c,
+            [0] * (len(c.shape) - var_dim_in_state)
+            + [self.batch_size * self.beam_size]
+            + [int(size) for size in c.shape[-var_dim_in_state + 2 :]],
+        )
+        c = paddle.transpose(
+            c,
+            list(range((len(c.shape) + 1 - var_dim_in_state), len(c.shape)))
+            + list(range(0, (len(c.shape) + 1 - var_dim_in_state))),
+        )
+        return c
+
+    def _split_batch_beams_with_var_dim(self, c):
+        var_dim_size = paddle.shape(c)[self.var_dim_in_state]
+        c = paddle.reshape(
+            c,
+            [-1, self.beam_size]
+            + [int(size) for size in c.shape[1 : self.var_dim_in_state]]
+            + [var_dim_size]
+            + [int(size) for size in c.shape[self.var_dim_in_state + 1 :]],
+        )
+        return c
+
+    @staticmethod
+    def tile_beam_merge_with_batch(t, beam_size):
+        r"""
+        Tiles the batch dimension of a tensor. Specifically, this function takes
+        a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch
+        entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape
+        `[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries
+        `t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated
+        `beam_size` times.
+
+        Args:
+            t (list|tuple):
+                A list of tensor with shape `[batch_size, ...]`.
+            beam_size (int):
+                The beam width used in beam search.
+
+        Returns:
+            Tensor:
+                A tensor with shape `[batch_size * beam_size, ...]`, whose
+                data type is same as `t`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                t = paddle.rand(shape=[10, 10])
+                TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)
+        """
+        return map_structure(lambda x: nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(x, beam_size), t)
+
+    def step(self, time, inputs, states, **kwargs):
+        # Steps for decoding.
+        # Compared to RNN, Transformer has 3D data at every decoding step
+        inputs = paddle.reshape(inputs, [-1, 1])  # token
+        pos = paddle.ones_like(inputs) * time  # pos
+
+        cell_states = map_structure(self._merge_batch_beams_with_var_dim, states.cell_states)
+
+        cell_outputs, next_cell_states = self.cell((inputs, pos), cell_states, **kwargs)
+
+        # Squeeze to adapt to BeamSearchDecoder which use 2D logits
+        cell_outputs = map_structure(lambda x: paddle.squeeze(x, [1]) if len(x.shape) == 3 else x, cell_outputs)
+        cell_outputs = map_structure(self._split_batch_beams, cell_outputs)
+        next_cell_states = map_structure(self._split_batch_beams_with_var_dim, next_cell_states)
+
+        beam_search_output, beam_search_state = self._beam_search_step(
+            time=time, logits=cell_outputs, next_cell_states=next_cell_states, beam_state=states
+        )
+
+        if kwargs.get("trg_word", None) is not None:
+            if paddle.in_dynamic_mode():
+                if paddle.shape(kwargs.get("trg_word"))[1] > time:
+                    beam_search_output, beam_search_state = self.force_decoding(
+                        beam_search_output, beam_search_state, kwargs.get("trg_word"), kwargs.get("trg_length"), time
+                    )
+            else:
+
+                def condition(trg_word, time):
+                    return paddle.shape(trg_word)[1] > time
+
+                def default_fn(beam_search_output, beam_search_state):
+                    return beam_search_output, beam_search_state
+
+                from functools import partial
+
+                beam_search_output, beam_search_state = paddle.static.nn.case(
+                    [
+                        (
+                            condition(kwargs.get("trg_word"), time),
+                            partial(
+                                self.force_decoding,
+                                beam_search_output=beam_search_output,
+                                beam_search_state=beam_search_state,
+                                trg_word=kwargs.get("trg_word"),
+                                trg_length=kwargs.get("trg_length"),
+                                time=time,
+                            ),
+                        )
+                    ],
+                    default=partial(
+                        default_fn, beam_search_output=beam_search_output, beam_search_state=beam_search_state
+                    ),
+                )
+
+        next_inputs, finished = (beam_search_output.predicted_ids, beam_search_state.finished)
+
+        return (beam_search_output, beam_search_state, next_inputs, finished)
+
+    def force_decoding(self, beam_search_output, beam_search_state, trg_word, trg_length, time):
+        batch_size = paddle.shape(beam_search_output.predicted_ids)[0]
+        beam_size = paddle.shape(beam_search_output.predicted_ids)[1]
+
+        ids_dtype = beam_search_output.predicted_ids.dtype
+        scores_dtype = beam_search_output.scores.dtype
+        parent_ids = paddle.zeros(shape=[batch_size, 1], dtype=ids_dtype)
+        scores = paddle.ones(shape=[batch_size, beam_size], dtype=scores_dtype) * -10e9
+        scores = paddle.scatter(
+            scores.flatten(),
+            paddle.arange(0, batch_size * beam_size, step=beam_size, dtype=scores_dtype),
+            paddle.zeros([batch_size]),
+        ).reshape([batch_size, beam_size])
+
+        force_position = paddle.unsqueeze(trg_length > time, [1])
+        # NOTE: When the date type of the input of paddle.tile is bool
+        # and enable static mode, its stop_gradient must be True .
+        force_position.stop_gradient = True
+        force_position = paddle.tile(force_position, [1, beam_size])
+        crt_trg_word = paddle.slice(trg_word, axes=[1], starts=[time], ends=[time + 1])
+        crt_trg_word = paddle.tile(crt_trg_word, [1, beam_size])
+
+        predicted_ids = paddle.where(force_position, crt_trg_word, beam_search_output.predicted_ids)
+        scores = paddle.where(force_position, scores, beam_search_output.scores)
+        parent_ids = paddle.where(force_position, parent_ids, beam_search_output.parent_ids)
+
+        cell_states = beam_search_state.cell_states
+        log_probs = paddle.where(force_position, scores, beam_search_state.log_probs)
+        finished = beam_search_state.finished
+        lengths = beam_search_state.lengths
+
+        return self.OutputWrapper(scores, predicted_ids, parent_ids), self.StateWrapper(
+            cell_states, log_probs, finished, lengths
+        )
+
+
+class TransformerModel(nn.Layer):
+    """
+    The Transformer model.
+
+    This model is a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        attn_dropout (float):
+            The dropout probability used in MHA to drop some attention target.
+            If None, use the value of dropout. Defaults to None.
+        act_dropout (float):
+            The dropout probability used after FFN activition. If None, use
+            the value of dropout. Defaults to None.
+        bos_id (int, optional):
+            The start token id and also be used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        attn_dropout=None,
+        act_dropout=None,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        activation="relu",
+    ):
+        super(TransformerModel, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
+        self.dropout = dropout
+
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+        if weight_sharing:
+            assert (
+                src_vocab_size == trg_vocab_size
+            ), "Vocabularies in source and target should be same for weight sharing."
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        self.transformer = paddle.nn.Transformer(
+            d_model=d_model,
+            nhead=n_head,
+            num_encoder_layers=num_encoder_layers,
+            num_decoder_layers=num_decoder_layers,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            activation=activation,
+            normalize_before=True,
+        )
+
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
+            )
+        else:
+            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)
+
+    def forward(self, src_word, trg_word):
+        r"""
+        The Transformer forward methods. The input are source/target sequences, and
+        returns logits.
+
+        Args:
+            src_word (Tensor):
+                The ids of source sequences words. It is a tensor with shape
+                `[batch_size, source_sequence_length]` and its data type can be
+                int or int64.
+            trg_word (Tensor):
+                The ids of target sequences words. It is a tensor with shape
+                `[batch_size, target_sequence_length]` and its data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                Output tensor of the final layer of the model whose data
+                type can be float32 or float64 with shape
+                `[batch_size, sequence_length, vocab_size]`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerModel
+
+                transformer = TransformerModel(
+                    src_vocab_size=30000,
+                    trg_vocab_size=30000,
+                    max_length=257,
+                    num_encoder_layers=6,
+                    num_decoder_layers=6,
+                    n_head=8,
+                    d_model=512,
+                    d_inner_hid=2048,
+                    dropout=0.1,
+                    weight_sharing=True,
+                    bos_id=0,
+                    eos_id=1)
+
+                batch_size = 5
+                seq_len = 10
+                predict = transformer(
+                    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]),
+                    trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = self.transformer.generate_square_subsequent_mask(trg_max_len)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = src_slf_attn_bias
+        src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=src_max_len, dtype=src_word.dtype
+        )
+        trg_pos = paddle.cast(trg_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=trg_max_len, dtype=trg_word.dtype
+        )
+        with paddle.static.amp.fp16_guard():
+            src_emb = self.src_word_embedding(src_word)
+            src_pos_emb = self.src_pos_embedding(src_pos)
+            src_emb = src_emb + src_pos_emb
+            enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            dec_output = self.transformer(
+                enc_input,
+                dec_input,
+                src_mask=src_slf_attn_bias,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias,
+            )
+
+            predict = self.linear(dec_output)
+
+        return predict
diff --git a/tests/test_tipc/transformer/train.py b/tests/test_tipc/transformer/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..32b26aab40e148377b2215821b7a9bb5a88f0863
--- /dev/null
+++ b/tests/test_tipc/transformer/train.py
@@ -0,0 +1,386 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import yaml
+from attrdict import AttrDict
+from modeling import CrossEntropyCriterion, TransformerModel
+
+from paddlenlp.utils.log import logger
+
+sys.path.append(
+    os.path.abspath(
+        os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, "examples", "machine_translation", "transformer")
+    )
+)
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
+
+paddle.set_default_dtype("float64")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--train_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--dev_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def transfer_param(state_dict):
+    for item in state_dict:
+        state_dict[item] = paddle.cast(state_dict[item], dtype="float32")
+    return state_dict
+
+
+def do_train(args):
+    if args.device == "gpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+    else:
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    # Define data loader
+    (train_loader), (eval_loader) = reader.create_data_loader(args)
+
+    # Define model
+    transformer = TransformerModel(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+    )
+
+    # Define loss
+    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
+
+    scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
+
+    # Define optimizer
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=scheduler,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        epsilon=float(args.eps),
+        parameters=transformer.parameters(),
+    )
+
+    # Init from some checkpoint, to resume the previous training
+    if args.init_from_checkpoint:
+        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
+        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
+        transformer.set_state_dict(model_dict)
+        optimizer.set_state_dict(opt_dict)
+        print("loaded from checkpoint.")
+    # Init from some pretrain models, to better solve the current task
+    if args.init_from_pretrain_model:
+        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
+        transformer.set_state_dict(model_dict)
+        print("loaded from pre-trained model.")
+
+    if trainer_count > 1:
+        transformer = paddle.DataParallel(transformer)
+
+    # The best cross-entropy value with label smoothing
+    loss_normalizer = -(
+        (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps))
+        + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)
+    )
+
+    step_idx = 0
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+
+        batch_id = 0
+        batch_start = time.time()
+        for input_data in train_loader:
+            train_reader_cost = time.time() - batch_start
+            (src_word, trg_word, lbl_word) = input_data
+
+            if args.use_amp:
+                scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+                with paddle.amp.auto_cast():
+                    logits = transformer(src_word=src_word, trg_word=trg_word)
+                    sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                scaled = scaler.scale(avg_cost)  # scale the loss
+                scaled.backward()  # do backward
+
+                scaler.minimize(optimizer, scaled)  # update parameters
+                optimizer.clear_grad()
+            else:
+                logits = transformer(src_word=src_word, trg_word=trg_word)
+                sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                avg_cost.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            tokens_per_cards = token_num.numpy()
+
+            train_batch_cost = time.time() - batch_start
+            reader_cost_avg.record(train_reader_cost)
+            batch_cost_avg.record(train_batch_cost)
+            batch_ips_avg.record(train_batch_cost, tokens_per_cards)
+
+            # NOTE: For benchmark, loss infomation on all cards will be printed.
+            if step_idx % args.print_step == 0 and (args.benchmark or rank == 0):
+                total_avg_cost = avg_cost.numpy()
+
+                if step_idx == 0:
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f "
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                else:
+                    train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time()
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
+                        "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
+                        "ips: %.5f words/sec"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            train_avg_batch_cost,
+                            batch_cost_avg.get_average(),
+                            reader_cost_avg.get_average(),
+                            batch_ips_avg.get_total_cnt(),
+                            batch_ips_avg.get_average_per_sec(),
+                        )
+                    )
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                # Validation
+                transformer.eval()
+                total_sum_cost = 0
+                total_token_num = 0
+                with paddle.no_grad():
+                    for input_data in eval_loader:
+                        (src_word, trg_word, lbl_word) = input_data
+                        logits = transformer(src_word=src_word, trg_word=trg_word)
+                        sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+                        total_sum_cost += sum_cost.numpy()
+                        total_token_num += token_num.numpy()
+                        total_avg_cost = total_sum_cost / total_token_num
+                    logger.info(
+                        "validation, step_idx: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f"
+                        % (
+                            step_idx,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                transformer.train()
+
+                if args.save_model and rank == 0:
+                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+            # NOTE: Used for benchmark and use None as default.
+            if args.max_iter and step_idx == args.max_iter:
+                break
+            batch_id += 1
+            step_idx += 1
+            scheduler.step()
+            batch_start = time.time()
+
+        # NOTE: Used for benchmark and use None as default.
+        if args.max_iter and step_idx == args.max_iter:
+            break
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+    if args.save_model and rank == 0:
+        model_dir = os.path.join(args.save_model, "step_final")
+        if not os.path.exists(model_dir):
+            os.makedirs(model_dir)
+        # Transform dtype from float64 to float32,
+        # since some pass during inference doesn't
+        # support float64 kernel.
+        param_sd = transfer_param(transformer.state_dict())
+        paddle.save(param_sd, os.path.join(model_dir, "transformer.pdparams"))
+
+        optim_sd = transfer_param(transformer.state_dict())
+        paddle.save(optim_sd, os.path.join(model_dir, "transformer.pdopt"))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    if ARGS.max_iter:
+        args.max_iter = ARGS.max_iter
+    args.data_dir = ARGS.data_dir
+    args.train_file = ARGS.train_file
+    args.dev_file = ARGS.dev_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+    pprint(args)
+
+    do_train(args)
diff --git a/tests/testing_utils.py b/tests/testing_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..96798fe52f316a3b901c1c53cebdd246120dfc98
--- /dev/null
+++ b/tests/testing_utils.py
@@ -0,0 +1,395 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import gc
+import inspect
+import json
+import os
+import subprocess
+import sys
+import unittest
+from collections.abc import Mapping
+from contextlib import contextmanager
+
+import numpy as np
+import paddle
+import yaml
+
+from paddlenlp.trainer.argparser import strtobool
+from paddlenlp.utils.import_utils import is_package_available
+
+__all__ = ["get_vocab_list", "stable_softmax", "cross_entropy"]
+
+
+class PaddleNLPModelTest(unittest.TestCase):
+    def tearDown(self):
+        gc.collect()
+
+
+def get_vocab_list(vocab_path):
+    with open(vocab_path, "r", encoding="utf-8") as f:
+        vocab_list = [vocab.rstrip("\n").split("\t")[0] for vocab in f.readlines()]
+        return vocab_list
+
+
+def stable_softmax(x):
+    """Compute the softmax of vector x in a numerically stable way."""
+    # clip to shiftx, otherwise, when calc loss with
+    # log(exp(shiftx)), may get log(0)=INF
+    shiftx = (x - np.max(x)).clip(-64.0)
+    exps = np.exp(shiftx)
+    return exps / np.sum(exps)
+
+
+def cross_entropy(softmax, label, soft_label, axis, ignore_index=-1):
+    if soft_label:
+        return (-label * np.log(softmax)).sum(axis=axis, keepdims=True)
+
+    shape = softmax.shape
+    axis %= len(shape)
+    n = int(np.prod(shape[:axis]))
+    axis_dim = shape[axis]
+    remain = int(np.prod(shape[axis + 1 :]))
+    softmax_reshape = softmax.reshape((n, axis_dim, remain))
+    label_reshape = label.reshape((n, 1, remain))
+    result = np.zeros_like(label_reshape, dtype=softmax.dtype)
+    for i in range(n):
+        for j in range(remain):
+            lbl = label_reshape[i, 0, j]
+            if lbl != ignore_index:
+                result[i, 0, j] -= np.log(softmax_reshape[i, lbl, j])
+    return result.reshape(label.shape)
+
+
+def softmax_with_cross_entropy(logits, label, soft_label=False, axis=-1, ignore_index=-1):
+    softmax = np.apply_along_axis(stable_softmax, -1, logits)
+    return cross_entropy(softmax, label, soft_label, axis, ignore_index)
+
+
+def assert_raises(Error=AssertionError):
+    def assert_raises_error(func):
+        def wrapper(self, *args, **kwargs):
+            with self.assertRaises(Error):
+                func(self, *args, **kwargs)
+
+        return wrapper
+
+    return assert_raises_error
+
+
+def create_test_data(file=__file__):
+    dir_path = os.path.dirname(os.path.realpath(file))
+    test_data_file = os.path.join(dir_path, "dict.txt")
+    with open(test_data_file, "w") as f:
+        vocab_list = [
+            "[UNK]",
+            "AT&T",
+            "B超",
+            "c#",
+            "C#",
+            "c++",
+            "C++",
+            "T恤",
+            "A座",
+            "A股",
+            "A型",
+            "A轮",
+            "AA制",
+            "AB型",
+            "B座",
+            "B股",
+            "B型",
+            "B轮",
+            "BB机",
+            "BP机",
+            "C盘",
+            "C座",
+            "C语言",
+            "CD盒",
+            "CD机",
+            "CALL机",
+            "D盘",
+            "D座",
+            "D版",
+            "E盘",
+            "E座",
+            "E化",
+            "E通",
+            "F盘",
+            "F座",
+            "G盘",
+            "H盘",
+            "H股",
+            "I盘",
+            "IC卡",
+            "IP卡",
+            "IP电话",
+            "IP地址",
+            "K党",
+            "K歌之王",
+            "N年",
+            "O型",
+            "PC机",
+            "PH值",
+            "SIM卡",
+            "U盘",
+            "VISA卡",
+            "Z盘",
+            "Q版",
+            "QQ号",
+            "RSS订阅",
+            "T盘",
+            "X光",
+            "X光线",
+            "X射线",
+            "γ射线",
+            "T恤衫",
+            "T型台",
+            "T台",
+            "4S店",
+            "4s店",
+            "江南style",
+            "江南Style",
+            "1号店",
+            "小S",
+            "大S",
+            "阿Q",
+            "一",
+            "一一",
+            "一一二",
+            "一一例",
+            "一一分",
+            "一一列举",
+            "一一对",
+            "一一对应",
+            "一一记",
+            "一一道来",
+            "一丁",
+            "一丁不识",
+            "一丁点",
+            "一丁点儿",
+            "一七",
+            "一七八不",
+            "一万",
+            "一万一千",
+            "一万一千五百二十颗",
+            "一万一千八百八十斤",
+            "一万一千多间",
+            "一万一千零九十五册",
+            "一万七千",
+            "一万七千余",
+            "一万七千多",
+            "一万七千多户",
+            "一万万",
+        ]
+        for vocab in vocab_list:
+            f.write("{}\n".format(vocab))
+    return test_data_file
+
+
+def get_bool_from_env(key, default_value=False):
+    if key not in os.environ:
+        return default_value
+    value = os.getenv(key)
+    try:
+        value = strtobool(value)
+    except ValueError:
+        raise ValueError(f"If set, {key} must be yes, no, true, false, 0 or 1 (case insensitive).")
+    return value
+
+
+_run_slow_test = get_bool_from_env("RUN_SLOW_TEST")
+
+
+def slow(test):
+    """
+    Mark a test which spends too much time.
+    Slow tests are skipped by default. Excute the command `export RUN_SLOW_TEST=True` to run them.
+    """
+    if not _run_slow_test:
+        return unittest.skip("test spends too much time")(test)
+    else:
+        return test
+
+
+def get_tests_dir(append_path=None):
+    """
+    Args:
+        append_path: optional path to append to the tests dir path
+
+    Return:
+        The full path to the `tests` dir, so that the tests can be invoked from anywhere. Optionally `append_path` is
+        joined after the `tests` dir the former is provided.
+
+    """
+    # this function caller's __file__
+    caller__file__ = inspect.stack()[1][1]
+    tests_dir = os.path.abspath(os.path.dirname(caller__file__))
+
+    while not tests_dir.endswith("tests"):
+        tests_dir = os.path.dirname(tests_dir)
+
+    if append_path:
+        return os.path.join(tests_dir, append_path)
+    else:
+        return tests_dir
+
+
+def nested_simplify(obj, decimals=3):
+    """
+    Simplifies an object by rounding float numbers, and downcasting tensors/numpy arrays to get simple equality test
+    within tests.
+    """
+    import numpy as np
+
+    if isinstance(obj, list):
+        return [nested_simplify(item, decimals) for item in obj]
+    elif isinstance(obj, np.ndarray):
+        return nested_simplify(obj.tolist())
+    elif isinstance(obj, Mapping):
+        return {nested_simplify(k, decimals): nested_simplify(v, decimals) for k, v in obj.items()}
+    elif isinstance(obj, (str, int, np.int64)):
+        return obj
+    elif obj is None:
+        return obj
+    elif isinstance(obj, paddle.Tensor):
+        return nested_simplify(obj.numpy().tolist(), decimals)
+    elif isinstance(obj, float):
+        return round(obj, decimals)
+    elif isinstance(obj, (np.int32, np.float32)):
+        return nested_simplify(obj.item(), decimals)
+    else:
+        raise Exception(f"Not supported: {type(obj)}")
+
+
+def require_package(*package_names):
+    """decorator which can detect that it will require the specific package
+
+    Args:
+        package_name (str): the name of package
+    """
+
+    def decorator(func):
+        for package_name in package_names:
+            if not is_package_available(package_name):
+                return unittest.skip(f"package<{package_name}> not found, so to skip this test")(func)
+        return func
+
+    return decorator
+
+
+def is_slow_test() -> bool:
+    """check whether is the slow test
+
+    Returns:
+        bool: whether is the slow test
+    """
+    return os.getenv("RUN_SLOW_TEST") is not None
+
+
+def load_test_config(config_file: str, key: str) -> dict | None:
+    """parse config file to argv
+
+    Args:
+        config_dir (str, optional): the path of config file. Defaults to None.
+        config_name (str, optional): the name key in config file. Defaults to None.
+    """
+    # 1. load the config with key and test env(default, test)
+    with open(config_file, "r", encoding="utf-8") as f:
+        config = yaml.safe_load(f)
+
+    assert key in config, f"<{key}> should be the top key in configuration file"
+    config = config[key]
+
+    sub_key = "slow" if is_slow_test() else "default"
+
+    if sub_key not in config:
+        return None
+
+    config = config[sub_key]
+    return config
+
+
+def construct_argv(config: dict) -> list[str]:
+    """construct argv by configs
+
+    Args:
+        config (dict): the config data
+
+    Returns:
+        list[str]: the argvs
+    """
+    # get current test
+    # refer to: https://docs.pytest.org/en/latest/example/simple.html#pytest-current-test-environment-variable
+    current_test = "tests/__init__.py"
+    if "PYTEST_CURRENT_TEST" in os.environ:
+        current_test = os.getenv("PYTEST_CURRENT_TEST").split("::")[0]
+
+    argv = [current_test]
+    for key, value in config.items():
+        argv.append(f"--{key}")
+        argv.append(str(value))
+
+    return argv
+
+
+@contextmanager
+def argv_context_guard(config: dict):
+    """construct argv by config
+
+    Args:
+        config (dict): the configuration to argv
+    """
+    old_argv = copy.deepcopy(sys.argv)
+    argv = construct_argv(config)
+    sys.argv = argv
+    yield
+    sys.argv = old_argv
+
+
+def update_params(json_file: str, params: dict):
+    """update params in json file
+
+    Args:
+        json_file (str): the path of json file
+        params (dict): the parameters need to update
+    """
+    with open(json_file, "r") as f:
+        data = json.load(f)
+        data.update(params)
+    with open(json_file, "w") as f:
+        json.dump(data, f, indent=2, ensure_ascii=False)
+
+
+class SubprocessCallException(Exception):
+    pass
+
+
+def run_command(command: list[str], return_stdout=False):
+    """
+    Runs `command` with `subprocess.check_output` and will potentially return the `stdout`. Will also properly capture
+    if an error occured while running `command`
+    """
+    try:
+        output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
+        if return_stdout:
+            if hasattr(output, "decode"):
+                output = output.decode("utf-8")
+            return output
+    except subprocess.CalledProcessError as e:
+        raise SubprocessCallException(
+            f"Command `{' '.join(command)}` failed with the following error:\n\n{e.output.decode()}"
+        ) from e
diff --git a/tests/transformer/modeling.py b/tests/transformer/modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ecb8849f8fc5fc4ecfc72247b76cbe55ea92696
--- /dev/null
+++ b/tests/transformer/modeling.py
@@ -0,0 +1,774 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.utils import map_structure
+
+__all__ = [
+    "position_encoding_init",
+    "WordEmbedding",
+    "PositionalEmbedding",
+    "CrossEntropyCriterion",
+    "TransformerDecodeCell",
+    "TransformerBeamSearchDecoder",
+    "TransformerModel",
+]
+
+
+def position_encoding_init(n_position, d_pos_vec, dtype="float64"):
+    """
+    Generates the initial values for the sinusoidal position encoding table.
+    This method follows the implementation in tensor2tensor, but is slightly
+    different from the description in "Attention Is All You Need".
+
+    Args:
+        n_position (int):
+            The largest position for sequences, that is, the maximum length
+            of source or target sequences.
+        d_pos_vec (int):
+            The size of positional embedding vector.
+        dtype (str, optional):
+            The output `numpy.array`'s data type. Defaults to "float32".
+
+    Returns:
+        numpy.array:
+            The embedding table of sinusoidal position encoding with shape
+            `[n_position, d_pos_vec]`.
+
+    Example:
+        .. code-block::
+
+            from paddlenlp.transformers import position_encoding_init
+
+            max_length = 256
+            emb_dim = 512
+            pos_table = position_encoding_init(max_length, emb_dim)
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = np.log(float(1e4) / float(1)) / (num_timescales - 1)
+    inv_timescales = np.exp(np.arange(num_timescales) * -log_timescale_increment)
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, 0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], "constant")
+    position_enc = signal
+    return position_enc.astype(dtype)
+
+
+class WordEmbedding(nn.Layer):
+    """
+    Word Embedding layer of Transformer.
+
+    This layer automatically constructs a 2D embedding matrix based on the
+    input the size of vocabulary (`vocab_size`) and the size of each embedding
+    vector (`emb_dim`). This layer lookups embeddings vector of ids provided
+    by input `word`.
+
+    After the embedding, those weights are multiplied by `sqrt(d_model)` which is
+    `sqrt(emb_dim)` in the interface.
+
+    .. math::
+
+        Out = embedding(word) * sqrt(emb\_dim)
+
+    Args:
+        vocab_size (int):
+            The size of vocabulary.
+        emb_dim (int):
+            Dimensionality of each embedding vector.
+        bos_id (int, optional):
+            The start token id and also is used as padding id. Defaults to 0.
+    """
+
+    def __init__(self, vocab_size, emb_dim, bos_id=0):
+        super(WordEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.word_embedding = nn.Embedding(
+            num_embeddings=vocab_size,
+            embedding_dim=emb_dim,
+            padding_idx=bos_id,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Normal(0.0, emb_dim**-0.5)),
+        )
+
+    def forward(self, word):
+        r"""
+        Computes word embedding.
+
+        Args:
+            word (Tensor):
+                The input ids which indicates the sequences' words with shape
+                `[batch_size, sequence_length]` whose data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                The (scaled) embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import WordEmbedding
+
+                word_embedding = WordEmbedding(
+                    vocab_size=30000,
+                    emb_dim=512,
+                    bos_id=0)
+
+                batch_size = 5
+                sequence_length = 10
+                src_words = paddle.randint(low=3, high=30000, shape=[batch_size, sequence_length])
+                src_emb = word_embedding(src_words)
+        """
+        word_emb = self.emb_dim**0.5 * self.word_embedding(word)
+        return word_emb
+
+
+class PositionalEmbedding(nn.Layer):
+    """
+    This layer produces sinusoidal positional embeddings of any length.
+    While in `forward()` method, this layer lookups embeddings vector of
+    ids provided by input `pos`.
+
+    Args:
+        emb_dim (int):
+            The size of each embedding vector.
+        max_length (int):
+            The maximum length of sequences.
+    """
+
+    def __init__(self, emb_dim, max_length):
+        super(PositionalEmbedding, self).__init__()
+        self.emb_dim = emb_dim
+
+        self.pos_encoder = nn.Embedding(num_embeddings=max_length, embedding_dim=self.emb_dim)
+        self.pos_encoder.weight.set_value(
+            position_encoding_init(max_length, self.emb_dim, dtype=paddle.get_default_dtype())
+        )
+
+    def forward(self, pos):
+        r"""
+        Computes positional embedding.
+
+        Args:
+            pos (Tensor):
+                The input position ids with shape `[batch_size, sequence_length]` whose
+                data type can be int or int64.
+
+        Returns:
+            Tensor:
+                The positional embedding tensor of shape
+                `(batch_size, sequence_length, emb_dim)` whose data type can be
+                float32 or float64.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import PositionalEmbedding
+
+                pos_embedding = PositionalEmbedding(
+                    emb_dim=512,
+                    max_length=256)
+
+                batch_size = 5
+                pos = paddle.tile(paddle.arange(start=0, end=50), repeat_times=[batch_size, 1])
+                pos_emb = pos_embedding(pos)
+        """
+        pos_emb = self.pos_encoder(pos)
+        pos_emb.stop_gradient = True
+        return pos_emb
+
+
+class CrossEntropyCriterion(nn.Layer):
+    """
+    Computes the cross entropy loss for given input with or without label smoothing.
+
+    Args:
+        label_smooth_eps (float, optional):
+            The weight used to mix up the original ground-truth distribution
+            and the fixed distribution. Defaults to None. If given, label smoothing
+            will be applied on `label`.
+        pad_idx (int, optional):
+            The token id used to pad variant sequence. Defaults to 0.
+    """
+
+    def __init__(self, label_smooth_eps=None, pad_idx=0):
+        super(CrossEntropyCriterion, self).__init__()
+        self.label_smooth_eps = label_smooth_eps
+        self.pad_idx = pad_idx
+
+    def forward(self, predict, label):
+        r"""
+        Computes cross entropy loss with or without label smoothing.
+
+        Args:
+            predict (Tensor):
+                The predict results of `TransformerModel` with shape
+                `[batch_size, sequence_length, vocab_size]` whose data type can
+                be float32 or float64.
+            label (Tensor):
+                The label for correspoding results with shape
+                `[batch_size, sequence_length, 1]`.
+
+        Returns:
+            tuple:
+                A tuple with items: (`sum_cost`, `avg_cost`, `token_num`).
+
+                With the corresponding fields:
+
+                - `sum_cost` (Tensor):
+                    The sum of loss of current batch whose data type can be float32, float64.
+                - `avg_cost` (Tensor):
+                    The average loss of current batch whose data type can be float32, float64.
+                    The relation between `sum_cost` and `avg_cost` can be described as:
+
+                    .. math:
+
+                        avg_cost = sum_cost / token_num
+
+                - `token_num` (Tensor):
+                    The number of tokens of current batch.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import CrossEntropyCriterion
+
+                criterion = CrossEntropyCriterion(label_smooth_eps=0.1, pad_idx=0)
+                batch_size = 1
+                seq_len = 2
+                vocab_size = 30000
+                predict = paddle.rand(shape=[batch_size, seq_len, vocab_size])
+                label = paddle.randint(
+                    low=3,
+                    high=vocab_size,
+                    shape=[batch_size, seq_len, 1])
+
+                criterion(predict, label)
+        """
+        weights = paddle.cast(label != self.pad_idx, dtype=paddle.get_default_dtype())
+        if self.label_smooth_eps:
+            label = paddle.squeeze(label, axis=[2])
+            label = F.label_smooth(
+                label=F.one_hot(x=label, num_classes=predict.shape[-1]), epsilon=self.label_smooth_eps
+            )
+            if paddle.get_default_dtype() != "float32":
+                label = paddle.cast(label, dtype=paddle.get_default_dtype())
+
+        cost = F.cross_entropy(
+            input=predict, label=label, reduction="none", soft_label=True if self.label_smooth_eps else False
+        )
+        weighted_cost = cost * weights
+        sum_cost = paddle.sum(weighted_cost)
+        token_num = paddle.sum(weights)
+        token_num.stop_gradient = True
+        avg_cost = sum_cost / token_num
+        return sum_cost, avg_cost, token_num
+
+
+class TransformerDecodeCell(nn.Layer):
+    """
+    This layer wraps a Transformer decoder combined with embedding
+    layer and output layer to produce logits from ids and position.
+
+    Args:
+        decoder (callable):
+            Can be a `paddle.nn.TransformerDecoder` instance. Or a wrapper that includes an
+            embedding layer accepting ids and positions and includes an
+            output layer transforming decoder output to logits.
+        word_embedding (callable, optional):
+            Can be a `WordEmbedding` instance or a callable that accepts ids as
+            arguments and return embeddings. It can be None if `decoder`
+            includes a embedding layer. Defaults to None.
+        pos_embedding (callable, optional):
+            Can be a `PositionalEmbedding` instance or a callable that accepts position
+            as arguments and return embeddings. It can be None if `decoder`
+            includes a positional embedding layer. Defaults to None.
+        linear (callable, optional):
+            Can be a `paddle.nn.Linear` instance or a callable to transform decoder
+            output to logits.
+        dropout (float, optional):
+            The dropout rate for the results of `word_embedding` and `pos_embedding`.
+            Defaults to 0.1.
+    """
+
+    def __init__(self, decoder, word_embedding=None, pos_embedding=None, linear=None, dropout=0.1):
+        super(TransformerDecodeCell, self).__init__()
+        self.decoder = decoder
+        self.word_embedding = word_embedding
+        self.pos_embedding = pos_embedding
+        self.linear = linear
+        self.dropout = dropout
+
+    def forward(self, inputs, states, static_cache, trg_src_attn_bias, memory, **kwargs):
+        r"""
+        Produces logits.
+
+        Args:
+            inputs (Tensor|tuple|list):
+                A tuple/list includes target ids and positions. If `word_embedding` is None,
+                then it should be a Tensor which means the input for decoder.
+            states (list):
+                It is a list and each element of the list is an instance
+                of `paddle.nn.MultiheadAttention.Cache` for corresponding decoder
+                layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            static_cache (list):
+                It is a list and each element of the list is an instance of
+                `paddle.nn.MultiheadAttention.StaticCache` for corresponding
+                decoder layer. It can be produced by `paddle.nn.TransformerDecoder.gen_cache`.
+            trg_src_attn_bias (Tensor):
+                A tensor used in self attention to prevents attention to some unwanted
+                positions, usually the subsequent positions. It is a tensor with shape
+                broadcasted to `[batch_size, n_head, target_length, target_length]`,
+                where the unwanted positions have `-INF` values and the others
+                have 0 values. The data type should be float32 or float64. It can
+                be None when nothing wanted or needed to be prevented attention to.
+            memory (Tensor):
+                The output of Transformer encoder. It is a tensor with shape
+                `[batch_size, source_length, d_model]` and its data type can be
+                float32 or float64.
+
+        Returns:
+            tuple:
+                A tuple with items: `(outputs, new_states)`
+
+                With the corresponding fields:
+
+                - `outputs` (Tensor):
+                    A float32 or float64 3D tensor representing logits shaped
+                    `[batch_size, sequence_length, vocab_size]`.
+                - `new_states` (Tensor):
+                    This output has the same structure and data type with `states`
+                    while the length is one larger since concatanating the
+                    intermediate results of current step.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerDecodeCell
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                def decoder():
+                    # do decoder
+                    pass
+
+                cell = TransformerDecodeCell(decoder())
+
+                self.decode = TransformerBeamSearchDecoder(
+                    cell, start_token=0, end_token=1, beam_size=4,
+                    var_dim_in_state=2)
+        """
+
+        if states and static_cache:
+            states = list(zip(states, static_cache))
+
+        if self.word_embedding:
+            if not isinstance(inputs, (list, tuple)):
+                inputs = inputs
+
+            word_emb = self.word_embedding(inputs[0])
+            pos_emb = self.pos_embedding(inputs[1])
+            word_emb = word_emb + pos_emb
+            inputs = F.dropout(word_emb, p=self.dropout, training=False) if self.dropout else word_emb
+
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+        else:
+            cell_outputs, new_states = self.decoder(inputs, memory, None, trg_src_attn_bias, states)
+
+        if self.linear:
+            cell_outputs = self.linear(cell_outputs)
+
+        new_states = [cache[0] for cache in new_states]
+
+        return cell_outputs, new_states
+
+
+class TransformerBeamSearchDecoder(nn.decode.BeamSearchDecoder):
+    """
+    This layer is a subclass of `BeamSearchDecoder` to make
+    beam search adapt to Transformer decoder.
+
+    Args:
+        cell (`TransformerDecodeCell`):
+            An instance of `TransformerDecoderCell`.
+        start_token (int):
+            The start token id.
+        end_token (int):
+            The end token id.
+        beam_size (int):
+            The beam width used in beam search.
+        var_dim_in_state (int):
+            Indicate which dimension of states is variant.
+    """
+
+    def __init__(self, cell, start_token, end_token, beam_size, var_dim_in_state):
+        super(TransformerBeamSearchDecoder, self).__init__(cell, start_token, end_token, beam_size)
+        self.cell = cell
+        self.var_dim_in_state = var_dim_in_state
+
+    def _merge_batch_beams_with_var_dim(self, c):
+        # Init length of cache is 0, and it increases with decoding carrying on,
+        # thus need to reshape elaborately
+        var_dim_in_state = self.var_dim_in_state + 1  # count in beam dim
+        c = paddle.transpose(c, list(range(var_dim_in_state, len(c.shape))) + list(range(0, var_dim_in_state)))
+        c = paddle.reshape(
+            c,
+            [0] * (len(c.shape) - var_dim_in_state)
+            + [self.batch_size * self.beam_size]
+            + [int(size) for size in c.shape[-var_dim_in_state + 2 :]],
+        )
+        c = paddle.transpose(
+            c,
+            list(range((len(c.shape) + 1 - var_dim_in_state), len(c.shape)))
+            + list(range(0, (len(c.shape) + 1 - var_dim_in_state))),
+        )
+        return c
+
+    def _split_batch_beams_with_var_dim(self, c):
+        var_dim_size = paddle.shape(c)[self.var_dim_in_state]
+        c = paddle.reshape(
+            c,
+            [-1, self.beam_size]
+            + [int(size) for size in c.shape[1 : self.var_dim_in_state]]
+            + [var_dim_size]
+            + [int(size) for size in c.shape[self.var_dim_in_state + 1 :]],
+        )
+        return c
+
+    @staticmethod
+    def tile_beam_merge_with_batch(t, beam_size):
+        r"""
+        Tiles the batch dimension of a tensor. Specifically, this function takes
+        a tensor t shaped `[batch_size, s0, s1, ...]` composed of minibatch
+        entries `t[0], ..., t[batch_size - 1]` and tiles it to have a shape
+        `[batch_size * beam_size, s0, s1, ...]` composed of minibatch entries
+        `t[0], t[0], ..., t[1], t[1], ...` where each minibatch entry is repeated
+        `beam_size` times.
+
+        Args:
+            t (list|tuple):
+                A list of tensor with shape `[batch_size, ...]`.
+            beam_size (int):
+                The beam width used in beam search.
+
+        Returns:
+            Tensor:
+                A tensor with shape `[batch_size * beam_size, ...]`, whose
+                data type is same as `t`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerBeamSearchDecoder
+
+                t = paddle.rand(shape=[10, 10])
+                TransformerBeamSearchDecoder.tile_beam_merge_with_batch(t, beam_size=4)
+        """
+        return map_structure(lambda x: nn.decode.BeamSearchDecoder.tile_beam_merge_with_batch(x, beam_size), t)
+
+    def step(self, time, inputs, states, **kwargs):
+        # Steps for decoding.
+        # Compared to RNN, Transformer has 3D data at every decoding step
+        inputs = paddle.reshape(inputs, [-1, 1])  # token
+        pos = paddle.ones_like(inputs) * time  # pos
+
+        cell_states = map_structure(self._merge_batch_beams_with_var_dim, states.cell_states)
+
+        cell_outputs, next_cell_states = self.cell((inputs, pos), cell_states, **kwargs)
+
+        # Squeeze to adapt to BeamSearchDecoder which use 2D logits
+        cell_outputs = map_structure(lambda x: paddle.squeeze(x, [1]) if len(x.shape) == 3 else x, cell_outputs)
+        cell_outputs = map_structure(self._split_batch_beams, cell_outputs)
+        next_cell_states = map_structure(self._split_batch_beams_with_var_dim, next_cell_states)
+
+        beam_search_output, beam_search_state = self._beam_search_step(
+            time=time, logits=cell_outputs, next_cell_states=next_cell_states, beam_state=states
+        )
+
+        if kwargs.get("trg_word", None) is not None:
+            if paddle.in_dynamic_mode():
+                if paddle.shape(kwargs.get("trg_word"))[1] > time:
+                    beam_search_output, beam_search_state = self.force_decoding(
+                        beam_search_output, beam_search_state, kwargs.get("trg_word"), kwargs.get("trg_length"), time
+                    )
+            else:
+
+                def condition(trg_word, time):
+                    return paddle.shape(trg_word)[1] > time
+
+                def default_fn(beam_search_output, beam_search_state):
+                    return beam_search_output, beam_search_state
+
+                from functools import partial
+
+                beam_search_output, beam_search_state = paddle.static.nn.case(
+                    [
+                        (
+                            condition(kwargs.get("trg_word"), time),
+                            partial(
+                                self.force_decoding,
+                                beam_search_output=beam_search_output,
+                                beam_search_state=beam_search_state,
+                                trg_word=kwargs.get("trg_word"),
+                                trg_length=kwargs.get("trg_length"),
+                                time=time,
+                            ),
+                        )
+                    ],
+                    default=partial(
+                        default_fn, beam_search_output=beam_search_output, beam_search_state=beam_search_state
+                    ),
+                )
+
+        next_inputs, finished = (beam_search_output.predicted_ids, beam_search_state.finished)
+
+        return (beam_search_output, beam_search_state, next_inputs, finished)
+
+    def force_decoding(self, beam_search_output, beam_search_state, trg_word, trg_length, time):
+        batch_size = paddle.shape(beam_search_output.predicted_ids)[0]
+        beam_size = paddle.shape(beam_search_output.predicted_ids)[1]
+
+        ids_dtype = beam_search_output.predicted_ids.dtype
+        scores_dtype = beam_search_output.scores.dtype
+        parent_ids = paddle.zeros(shape=[batch_size, 1], dtype=ids_dtype)
+        scores = paddle.ones(shape=[batch_size, beam_size], dtype=scores_dtype) * -10e9
+        scores = paddle.scatter(
+            scores.flatten(),
+            paddle.arange(0, batch_size * beam_size, step=beam_size, dtype=scores_dtype),
+            paddle.zeros([batch_size]),
+        ).reshape([batch_size, beam_size])
+
+        force_position = paddle.unsqueeze(trg_length > time, [1])
+        # NOTE: When the date type of the input of paddle.tile is bool
+        # and enable static mode, its stop_gradient must be True .
+        force_position.stop_gradient = True
+        force_position = paddle.tile(force_position, [1, beam_size])
+        crt_trg_word = paddle.slice(trg_word, axes=[1], starts=[time], ends=[time + 1])
+        crt_trg_word = paddle.tile(crt_trg_word, [1, beam_size])
+
+        predicted_ids = paddle.where(force_position, crt_trg_word, beam_search_output.predicted_ids)
+        scores = paddle.where(force_position, scores, beam_search_output.scores)
+        parent_ids = paddle.where(force_position, parent_ids, beam_search_output.parent_ids)
+
+        cell_states = beam_search_state.cell_states
+        log_probs = paddle.where(force_position, scores, beam_search_state.log_probs)
+        finished = beam_search_state.finished
+        lengths = beam_search_state.lengths
+
+        return self.OutputWrapper(scores, predicted_ids, parent_ids), self.StateWrapper(
+            cell_states, log_probs, finished, lengths
+        )
+
+
+class TransformerModel(nn.Layer):
+    """
+    The Transformer model.
+
+    This model is a Paddle `paddle.nn.Layer <https://www.paddlepaddle.org.cn/documentation
+    /docs/zh/api/paddle/nn/Layer_cn.html>`__ subclass. Use it as a regular Paddle Layer
+    and refer to the Paddle documentation for all matter related to general usage and behavior.
+
+    Args:
+        src_vocab_size (int):
+            The size of source vocabulary.
+        trg_vocab_size (int):
+            The size of target vocabulary.
+        max_length (int):
+            The maximum length of input sequences.
+        num_encoder_layers (int):
+            The number of sub-layers to be stacked in the encoder.
+        num_decoder_layers (int):
+            The number of sub-layers to be stacked in the decoder.
+        n_head (int):
+            The number of head used in multi-head attention.
+        d_model (int):
+            The dimension for word embeddings, which is also the last dimension of
+            the input and output of multi-head attention, position-wise feed-forward
+            networks, encoder and decoder.
+        d_inner_hid (int):
+            Size of the hidden layer in position-wise feed-forward networks.
+        dropout (float):
+            Dropout rates. Used for pre-process, activation and inside attention.
+        weight_sharing (bool):
+            Whether to use weight sharing.
+        attn_dropout (float):
+            The dropout probability used in MHA to drop some attention target.
+            If None, use the value of dropout. Defaults to None.
+        act_dropout (float):
+            The dropout probability used after FFN activition. If None, use
+            the value of dropout. Defaults to None.
+        bos_id (int, optional):
+            The start token id and also be used as padding id. Defaults to 0.
+        eos_id (int, optional):
+            The end token id. Defaults to 1.
+        pad_id (int, optional):
+            The pad token id. Defaults to None. If it's None, the bos_id will be used as pad_id.
+        activation (str, optional):
+            The activation used in FFN. Defaults to "relu".
+    """
+
+    def __init__(
+        self,
+        src_vocab_size,
+        trg_vocab_size,
+        max_length,
+        num_encoder_layers,
+        num_decoder_layers,
+        n_head,
+        d_model,
+        d_inner_hid,
+        dropout,
+        weight_sharing,
+        attn_dropout=None,
+        act_dropout=None,
+        bos_id=0,
+        eos_id=1,
+        pad_id=None,
+        activation="relu",
+    ):
+        super(TransformerModel, self).__init__()
+        self.trg_vocab_size = trg_vocab_size
+        self.emb_dim = d_model
+        self.bos_id = bos_id
+        self.eos_id = eos_id
+        self.pad_id = pad_id if pad_id is not None else self.bos_id
+        self.dropout = dropout
+
+        self.src_word_embedding = WordEmbedding(vocab_size=src_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+        self.src_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+        if weight_sharing:
+            assert (
+                src_vocab_size == trg_vocab_size
+            ), "Vocabularies in source and target should be same for weight sharing."
+            self.trg_word_embedding = self.src_word_embedding
+            self.trg_pos_embedding = self.src_pos_embedding
+        else:
+            self.trg_word_embedding = WordEmbedding(vocab_size=trg_vocab_size, emb_dim=d_model, bos_id=self.bos_id)
+            self.trg_pos_embedding = PositionalEmbedding(emb_dim=d_model, max_length=max_length)
+
+        self.transformer = paddle.nn.Transformer(
+            d_model=d_model,
+            nhead=n_head,
+            num_encoder_layers=num_encoder_layers,
+            num_decoder_layers=num_decoder_layers,
+            dim_feedforward=d_inner_hid,
+            dropout=dropout,
+            attn_dropout=attn_dropout,
+            act_dropout=act_dropout,
+            activation=activation,
+            normalize_before=True,
+        )
+
+        if weight_sharing:
+            self.linear = lambda x: paddle.matmul(
+                x=x, y=self.trg_word_embedding.word_embedding.weight, transpose_y=True
+            )
+        else:
+            self.linear = nn.Linear(in_features=d_model, out_features=trg_vocab_size, bias_attr=False)
+
+    def forward(self, src_word, trg_word):
+        r"""
+        The Transformer forward methods. The input are source/target sequences, and
+        returns logits.
+
+        Args:
+            src_word (Tensor):
+                The ids of source sequences words. It is a tensor with shape
+                `[batch_size, source_sequence_length]` and its data type can be
+                int or int64.
+            trg_word (Tensor):
+                The ids of target sequences words. It is a tensor with shape
+                `[batch_size, target_sequence_length]` and its data type can be
+                int or int64.
+
+        Returns:
+            Tensor:
+                Output tensor of the final layer of the model whose data
+                type can be float32 or float64 with shape
+                `[batch_size, sequence_length, vocab_size]`.
+
+        Example:
+            .. code-block::
+
+                import paddle
+                from paddlenlp.transformers import TransformerModel
+
+                transformer = TransformerModel(
+                    src_vocab_size=30000,
+                    trg_vocab_size=30000,
+                    max_length=257,
+                    num_encoder_layers=6,
+                    num_decoder_layers=6,
+                    n_head=8,
+                    d_model=512,
+                    d_inner_hid=2048,
+                    dropout=0.1,
+                    weight_sharing=True,
+                    bos_id=0,
+                    eos_id=1)
+
+                batch_size = 5
+                seq_len = 10
+                predict = transformer(
+                    src_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]),
+                    trg_word=paddle.randint(low=3, high=30000, shape=[batch_size, seq_len]))
+        """
+        src_max_len = paddle.shape(src_word)[-1]
+        trg_max_len = paddle.shape(trg_word)[-1]
+        src_slf_attn_bias = (
+            paddle.cast(src_word == self.bos_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e9
+        )
+        src_slf_attn_bias.stop_gradient = True
+        trg_slf_attn_bias = self.transformer.generate_square_subsequent_mask(trg_max_len)
+        trg_slf_attn_bias.stop_gradient = True
+        trg_src_attn_bias = src_slf_attn_bias
+        src_pos = paddle.cast(src_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=src_max_len, dtype=src_word.dtype
+        )
+        trg_pos = paddle.cast(trg_word != self.bos_id, dtype=src_word.dtype) * paddle.arange(
+            start=0, end=trg_max_len, dtype=trg_word.dtype
+        )
+        with paddle.static.amp.fp16_guard():
+            src_emb = self.src_word_embedding(src_word)
+            src_pos_emb = self.src_pos_embedding(src_pos)
+            src_emb = src_emb + src_pos_emb
+            enc_input = F.dropout(src_emb, p=self.dropout, training=self.training) if self.dropout else src_emb
+
+            trg_emb = self.trg_word_embedding(trg_word)
+            trg_pos_emb = self.trg_pos_embedding(trg_pos)
+            trg_emb = trg_emb + trg_pos_emb
+            dec_input = F.dropout(trg_emb, p=self.dropout, training=self.training) if self.dropout else trg_emb
+
+            dec_output = self.transformer(
+                enc_input,
+                dec_input,
+                src_mask=src_slf_attn_bias,
+                tgt_mask=trg_slf_attn_bias,
+                memory_mask=trg_src_attn_bias,
+            )
+
+            predict = self.linear(dec_output)
+
+        return predict
diff --git a/tests/transformer/train.py b/tests/transformer/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..8fd1ffa921e8fa8aa65c55f2df9c311afef7e875
--- /dev/null
+++ b/tests/transformer/train.py
@@ -0,0 +1,400 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import sys
+import time
+from pprint import pprint
+
+import numpy as np
+import paddle
+import paddle.distributed as dist
+import yaml
+from attrdict import AttrDict
+from modeling import CrossEntropyCriterion, TransformerModel
+
+from paddlenlp.utils.log import logger
+
+sys.path.append(
+    os.path.abspath(
+        os.path.join(os.path.dirname(__file__), os.pardir, os.pardir, "examples", "machine_translation", "transformer")
+    )
+)
+import reader  # noqa: E402
+from tls.record import AverageStatistical  # noqa: E402
+
+paddle.set_default_dtype("float64")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config", default="./configs/transformer.big.yaml", type=str, help="Path of the config file. "
+    )
+    parser.add_argument(
+        "--benchmark",
+        action="store_true",
+        help="Whether to print logs on each cards and use benchmark vocab. Normally, not necessary to set --benchmark. ",
+    )
+    parser.add_argument("--max_iter", default=None, type=int, help="The maximum iteration for training. ")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        help="The dir of train, dev and test datasets. If data_dir is given, train_file and dev_file and test_file will be replaced by data_dir/[train|dev|test].\{src_lang\}-\{trg_lang\}.[\{src_lang\}|\{trg_lang\}]. ",
+    )
+    parser.add_argument(
+        "--train_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for training, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--dev_file",
+        nargs="+",
+        default=None,
+        type=str,
+        help="The files for validation, including [source language file, target language file]. If it's None, the default WMT14 en-de dataset will be used. ",
+    )
+    parser.add_argument(
+        "--vocab_file",
+        default=None,
+        type=str,
+        help="The vocab file. Normally, it shouldn't be set and in this case, the default WMT14 dataset will be used.",
+    )
+    parser.add_argument(
+        "--src_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for source language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument(
+        "--trg_vocab",
+        default=None,
+        type=str,
+        help="The vocab file for target language. If --vocab_file is given, the --vocab_file will be used. ",
+    )
+    parser.add_argument("-s", "--src_lang", default=None, type=str, help="Source language. ")
+    parser.add_argument("-t", "--trg_lang", default=None, type=str, help="Target language. ")
+    parser.add_argument(
+        "--unk_token",
+        default=None,
+        type=str,
+        help="The unknown token. It should be provided when use custom vocab_file. ",
+    )
+    parser.add_argument(
+        "--bos_token", default=None, type=str, help="The bos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--eos_token", default=None, type=str, help="The eos token. It should be provided when use custom vocab_file. "
+    )
+    parser.add_argument(
+        "--pad_token",
+        default=None,
+        type=str,
+        help="The pad token. It should be provided when use custom vocab_file. And if it's None, bos_token will be used. ",
+    )
+    parser.add_argument(
+        "--device", default="gpu", choices=["gpu", "cpu", "xpu", "npu"], help="Device selected for inference."
+    )
+
+    args = parser.parse_args()
+    return args
+
+
+def transfer_param(state_dict):
+    for item in state_dict:
+        state_dict[item] = paddle.cast(state_dict[item], dtype="float32")
+    return state_dict
+
+
+def do_train(args):
+    if args.device == "gpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+    elif args.device == "npu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("npu")
+    elif args.device == "xpu":
+        rank = dist.get_rank()
+        trainer_count = dist.get_world_size()
+        paddle.set_device("xpu")
+    else:
+        rank = 0
+        trainer_count = 1
+        paddle.set_device("cpu")
+
+    if trainer_count > 1:
+        dist.init_parallel_env()
+
+    # Set seed for CE
+    random_seed = eval(str(args.random_seed))
+    if random_seed is not None:
+        paddle.seed(random_seed)
+
+    # Define data loader
+    (train_loader), (eval_loader) = reader.create_data_loader(args)
+
+    # Define model
+    transformer = TransformerModel(
+        src_vocab_size=args.src_vocab_size,
+        trg_vocab_size=args.trg_vocab_size,
+        max_length=args.max_length + 1,
+        num_encoder_layers=args.n_layer,
+        num_decoder_layers=args.n_layer,
+        n_head=args.n_head,
+        d_model=args.d_model,
+        d_inner_hid=args.d_inner_hid,
+        dropout=args.dropout,
+        weight_sharing=args.weight_sharing,
+        bos_id=args.bos_idx,
+        eos_id=args.eos_idx,
+    )
+
+    # Define loss
+    criterion = CrossEntropyCriterion(args.label_smooth_eps, args.bos_idx)
+
+    scheduler = paddle.optimizer.lr.NoamDecay(args.d_model, args.warmup_steps, args.learning_rate, last_epoch=0)
+
+    # Define optimizer
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=scheduler,
+        beta1=args.beta1,
+        beta2=args.beta2,
+        epsilon=float(args.eps),
+        parameters=transformer.parameters(),
+    )
+
+    # Init from some checkpoint, to resume the previous training
+    if args.init_from_checkpoint:
+        model_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdparams"))
+        opt_dict = paddle.load(os.path.join(args.init_from_checkpoint, "transformer.pdopt"))
+        transformer.set_state_dict(model_dict)
+        optimizer.set_state_dict(opt_dict)
+        print("loaded from checkpoint.")
+    # Init from some pretrain models, to better solve the current task
+    if args.init_from_pretrain_model:
+        model_dict = paddle.load(os.path.join(args.init_from_pretrain_model, "transformer.pdparams"))
+        transformer.set_state_dict(model_dict)
+        print("loaded from pre-trained model.")
+
+    if trainer_count > 1:
+        transformer = paddle.DataParallel(transformer)
+
+    # The best cross-entropy value with label smoothing
+    loss_normalizer = -(
+        (1.0 - args.label_smooth_eps) * np.log((1.0 - args.label_smooth_eps))
+        + args.label_smooth_eps * np.log(args.label_smooth_eps / (args.trg_vocab_size - 1) + 1e-20)
+    )
+
+    step_idx = 0
+
+    # For benchmark
+    reader_cost_avg = AverageStatistical()
+    batch_cost_avg = AverageStatistical()
+    batch_ips_avg = AverageStatistical()
+
+    # Train loop
+    for pass_id in range(args.epoch):
+        epoch_start = time.time()
+
+        batch_id = 0
+        batch_start = time.time()
+        for input_data in train_loader:
+            train_reader_cost = time.time() - batch_start
+            (src_word, trg_word, lbl_word) = input_data
+
+            if args.use_amp:
+                scaler = paddle.amp.GradScaler(init_loss_scaling=args.scale_loss)
+                with paddle.amp.auto_cast():
+                    logits = transformer(src_word=src_word, trg_word=trg_word)
+                    sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                scaled = scaler.scale(avg_cost)  # scale the loss
+                scaled.backward()  # do backward
+
+                scaler.minimize(optimizer, scaled)  # update parameters
+                optimizer.clear_grad()
+            else:
+                logits = transformer(src_word=src_word, trg_word=trg_word)
+                sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+
+                avg_cost.backward()
+
+                optimizer.step()
+                optimizer.clear_grad()
+
+            tokens_per_cards = token_num.numpy()
+
+            train_batch_cost = time.time() - batch_start
+            reader_cost_avg.record(train_reader_cost)
+            batch_cost_avg.record(train_batch_cost)
+            batch_ips_avg.record(train_batch_cost, tokens_per_cards)
+
+            # NOTE: For benchmark, loss infomation on all cards will be printed.
+            if step_idx % args.print_step == 0 and (args.benchmark or rank == 0):
+                total_avg_cost = avg_cost.numpy()
+
+                if step_idx == 0:
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f "
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                else:
+                    train_avg_batch_cost = args.print_step / batch_cost_avg.get_total_time()
+                    logger.info(
+                        "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f, avg_speed: %.2f step/sec, "
+                        "batch_cost: %.5f sec, reader_cost: %.5f sec, tokens: %d, "
+                        "ips: %.5f words/sec"
+                        % (
+                            step_idx,
+                            pass_id,
+                            batch_id,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            train_avg_batch_cost,
+                            batch_cost_avg.get_average(),
+                            reader_cost_avg.get_average(),
+                            batch_ips_avg.get_total_cnt(),
+                            batch_ips_avg.get_average_per_sec(),
+                        )
+                    )
+                reader_cost_avg.reset()
+                batch_cost_avg.reset()
+                batch_ips_avg.reset()
+
+            if step_idx % args.save_step == 0 and step_idx != 0:
+                # Validation
+                transformer.eval()
+                total_sum_cost = 0
+                total_token_num = 0
+                with paddle.no_grad():
+                    for input_data in eval_loader:
+                        (src_word, trg_word, lbl_word) = input_data
+                        logits = transformer(src_word=src_word, trg_word=trg_word)
+                        sum_cost, avg_cost, token_num = criterion(logits, lbl_word)
+                        total_sum_cost += sum_cost.numpy()
+                        total_token_num += token_num.numpy()
+                        total_avg_cost = total_sum_cost / total_token_num
+                    logger.info(
+                        "validation, step_idx: %d, avg loss: %f, "
+                        "normalized loss: %f, ppl: %f"
+                        % (
+                            step_idx,
+                            total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                        )
+                    )
+                transformer.train()
+
+                if args.save_model and rank == 0:
+                    model_dir = os.path.join(args.save_model, "step_" + str(step_idx))
+                    if not os.path.exists(model_dir):
+                        os.makedirs(model_dir)
+                    paddle.save(transformer.state_dict(), os.path.join(model_dir, "transformer.pdparams"))
+                    paddle.save(optimizer.state_dict(), os.path.join(model_dir, "transformer.pdopt"))
+
+            # NOTE: Used for benchmark and use None as default.
+            if args.max_iter and step_idx == args.max_iter:
+                break
+            batch_id += 1
+            step_idx += 1
+            scheduler.step()
+            batch_start = time.time()
+
+        # NOTE: Used for benchmark and use None as default.
+        if args.max_iter and step_idx == args.max_iter:
+            break
+
+        train_epoch_cost = time.time() - epoch_start
+        logger.info("train epoch: %d, epoch_cost: %.5f s" % (pass_id, train_epoch_cost))
+
+    if args.save_model and rank == 0:
+        model_dir = os.path.join(args.save_model, "step_final")
+        if not os.path.exists(model_dir):
+            os.makedirs(model_dir)
+        # Transform dtype from float64 to float32,
+        # since some pass during inference doesn't
+        # support float64 kernel.
+        param_sd = transfer_param(transformer.state_dict())
+        paddle.save(param_sd, os.path.join(model_dir, "transformer.pdparams"))
+
+        optim_sd = transfer_param(transformer.state_dict())
+        paddle.save(optim_sd, os.path.join(model_dir, "transformer.pdopt"))
+
+
+if __name__ == "__main__":
+    ARGS = parse_args()
+    yaml_file = ARGS.config
+    with open(yaml_file, "rt") as f:
+        args = AttrDict(yaml.safe_load(f))
+    args.benchmark = ARGS.benchmark
+    if ARGS.max_iter:
+        args.max_iter = ARGS.max_iter
+    args.data_dir = ARGS.data_dir
+    args.train_file = ARGS.train_file
+    args.dev_file = ARGS.dev_file
+
+    if ARGS.vocab_file is not None:
+        args.src_vocab = ARGS.vocab_file
+        args.trg_vocab = ARGS.vocab_file
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is not None and ARGS.trg_vocab is None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.src_vocab
+        args.joined_dictionary = True
+    elif ARGS.src_vocab is None and ARGS.trg_vocab is not None:
+        args.vocab_file = args.trg_vocab = args.src_vocab = ARGS.trg_vocab
+        args.joined_dictionary = True
+    else:
+        args.src_vocab = ARGS.src_vocab
+        args.trg_vocab = ARGS.trg_vocab
+        args.joined_dictionary = not (
+            args.src_vocab is not None and args.trg_vocab is not None and args.src_vocab != args.trg_vocab
+        )
+    if args.weight_sharing != args.joined_dictionary:
+        if args.weight_sharing:
+            raise ValueError("The src_vocab and trg_vocab must be consistency when weight_sharing is True. ")
+        else:
+            raise ValueError(
+                "The src_vocab and trg_vocab must be specified respectively when weight sharing is False. "
+            )
+
+    if ARGS.src_lang is not None:
+        args.src_lang = ARGS.src_lang
+    if ARGS.trg_lang is not None:
+        args.trg_lang = ARGS.trg_lang
+
+    args.unk_token = ARGS.unk_token
+    args.bos_token = ARGS.bos_token
+    args.eos_token = ARGS.eos_token
+    args.pad_token = ARGS.pad_token
+
+    args.device = ARGS.device
+    pprint(args)
+
+    do_train(args)
diff --git a/tests/transformer_base_dygraph_params.txt b/tests/transformer_base_dygraph_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6d3f38860cd7a826f3e31cb4e6526a47bfdfb0df
--- /dev/null
+++ b/tests/transformer_base_dygraph_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_base
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/transformer_base_static_params.txt b/tests/transformer_base_static_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d38745c4ab28795bc91d70be73c5a2d23e09b85c
--- /dev/null
+++ b/tests/transformer_base_static_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_base_static
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/static/train.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --distributed --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/static/predict.py --config ../examples/machine_translation/transformer/configs/transformer.base.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/transformer_big_dygraph_params.txt b/tests/transformer_big_dygraph_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..26099439db687231dbe2f63f03233eae9b620c44
--- /dev/null
+++ b/tests/transformer_big_dygraph_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_big
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:./transformer/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/predict.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --without_ft  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:../examples/machine_translation/transformer/export_model.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:../examples/machine_translation/transformer/deploy/python/inference.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --profile --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de  --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+--device:gpu|cpu
+--use_mkl:True
+--threads:1|6
+--batch_size:32
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/transformer_big_static_params.txt b/tests/transformer_big_static_params.txt
new file mode 100644
index 0000000000000000000000000000000000000000..98aa9171fcb5f4cafe2ffb128022bf21b37c0241
--- /dev/null
+++ b/tests/transformer_big_static_params.txt
@@ -0,0 +1,51 @@
+===========================train_params===========================
+model_name:transformer_big_static
+python:python3.7
+gpu_list:0|0,1
+null:null
+null:null
+--max_iter:lite_train_infer=100
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+##
+trainer:norm_train
+norm_train:../examples/machine_translation/transformer/static/train.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --distributed --train_file ../examples/machine_translation/transformer/train.en ../examples/machine_translation/transformer/train.de --dev_file ../examples/machine_translation/transformer/dev.en ../examples/machine_translation/transformer/dev.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:../examples/machine_translation/transformer/static/predict.py --config ../examples/machine_translation/transformer/configs/transformer.big.yaml --test_file ../examples/machine_translation/transformer/test.en ../examples/machine_translation/transformer/test.de --vocab_file ../examples/machine_translation/transformer/vocab_all.bpe.33708 --unk_token "<unk>" --bos_token "<s>" --eos_token "<e>"
+null:null
+##
+===========================infer_params===========================
+null:null
+null:null
+norm_export:null
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+##
+infer_model:null
+infer_export:null
+infer_quant:null
+inference:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
+null:null
diff --git a/tests/transformers/__init__.py b/tests/transformers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/albert/__init__.py b/tests/transformers/albert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/albert/test_modeling.py b/tests/transformers/albert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..552201f3182502905cd91bc0b2d1821caf390a7b
--- /dev/null
+++ b/tests/transformers/albert/test_modeling.py
@@ -0,0 +1,479 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from paddle import Tensor
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    AlbertConfig,
+    AlbertForMaskedLM,
+    AlbertForMultipleChoice,
+    AlbertForQuestionAnswering,
+    AlbertForSequenceClassification,
+    AlbertForTokenClassification,
+    AlbertModel,
+    AlbertPretrainedModel,
+)
+
+from ...testing_utils import require_package, slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class AlbertModelTester:
+    def __init__(
+        self,
+        parent,
+    ):
+        self.parent = parent
+        self.batch_size = 13
+        self.seq_length = 7
+        self.is_training = True
+        self.use_input_mask = True
+        self.use_token_type_ids = True
+        self.vocab_size = 99
+        self.embedding_size = 16
+        self.hidden_size = 36
+        self.num_hidden_layers = 6
+        self.num_hidden_groups = 6
+        self.num_attention_heads = 6
+        self.intermediate_size = 37
+        self.inner_group_num = 1
+        self.hidden_act = "gelu"
+        self.hidden_dropout_prob = 0.1
+        self.attention_probs_dropout_prob = 0.1
+        self.max_position_embeddings = 512
+        self.type_vocab_size = 2
+        self.initializer_range = 0.02
+        self.layer_norm_eps = 1e-12
+        self.pad_token_id = (0,)
+        self.bos_token_id = (2,)
+        self.eos_token_id = (3,)
+        self.add_pooling_layer = True
+        self.type_sequence_label_size = 2
+        self.num_labels = 3
+        self.num_choices = 4
+        self.scope = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return AlbertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            num_hidden_groups=self.num_hidden_groups,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertModel(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = AlbertForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, token_type_ids, input_mask, _, _, _) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class AlbertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = AlbertModel
+    use_labels = False
+    return_dict = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        AlbertModel,
+        AlbertForMaskedLM,
+        AlbertForMultipleChoice,
+        AlbertForSequenceClassification,
+        AlbertForTokenClassification,
+        AlbertForQuestionAnswering,
+    )
+
+    def setUp(self):
+        self.model_tester = AlbertModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(AlbertPretrainedModel.pretrained_init_configuration.keys())[:1]:
+            model = AlbertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class AlbertModelCompatibilityTest(unittest.TestCase):
+    model_id = "hf-internal-testing/tiny-random-AlbertModel"
+
+    @require_package("transformers", "torch")
+    def test_albert_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import AlbertModel
+
+            paddle_model = AlbertModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch model
+            import torch
+            from transformers import AlbertModel
+
+            torch_model = AlbertModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_albert_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import AlbertModel
+
+            torch_model = AlbertModel.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import AlbertModel
+
+            paddle_model = AlbertModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            ("AlbertModel",),
+            # ("AlbertForMaskedLM",),   TODO: need to tie weights
+            # ("AlbertForPretraining",),   TODO: need to tie weights
+            ("AlbertForMultipleChoice",),
+            ("AlbertForQuestionAnswering",),
+            ("AlbertForSequenceClassification",),
+            ("AlbertForTokenClassification",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_albert_classes_from_local_dir(self, class_name, pytorch_class_name=None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.model_id)
+            torch_model.eval()
+
+            if "MultipleChoice" in class_name:
+                # construct input for MultipleChoice Model
+                torch_model.config.num_choices = random.randint(2, 10)
+                input_ids = (
+                    paddle.to_tensor(input_ids)
+                    .unsqueeze(1)
+                    .expand([-1, torch_model.config.num_choices, -1])
+                    .cpu()
+                    .numpy()
+                )
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class AlbertModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_head_absolute_embedding(self):
+        model = AlbertModel.from_pretrained("albert-base-v2")
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [[[-0.6513, 1.5035, -0.2766], [-0.6515, 1.5046, -0.2780], [-0.6512, 1.5049, -0.2784]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
diff --git a/tests/transformers/albert/test_tokenizer.py b/tests/transformers/albert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..34c3f7302621d418dae11218b78379a94087f0eb
--- /dev/null
+++ b/tests/transformers/albert/test_tokenizer.py
@@ -0,0 +1,409 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Hugging Face inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.albert.tokenizer import (
+    AlbertChineseTokenizer,
+    AlbertEnglishTokenizer,
+)
+from paddlenlp.transformers.bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/spiece.model")
+
+
+class AlbertEnglishTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = AlbertEnglishTokenizer
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+    test_sentencepiece = True
+    test_sentencepiece_ignore_case = True
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = AlbertEnglishTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "this is a test"
+        output_text = "this is a test"
+        return input_text, output_text
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<pad>"
+        token_id = 0
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+        self.assertEqual(vocab_keys[0], "<pad>")
+        self.assertEqual(vocab_keys[1], "<unk>")
+        self.assertEqual(vocab_keys[-1], "▁eloquent")
+        self.assertEqual(len(vocab_keys), 30_000)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 30_000)
+
+    def test_full_tokenizer(self):
+        tokenizer = AlbertEnglishTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁this", "▁is", "▁a", "▁test"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [48, 25, 21, 1289])
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens, ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "é", "."]
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, [31, 23, 386, 19, 561, 3050, 15, 17, 48, 25, 8256, 18, 1, 9])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            ["▁i", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fal", "s", "<unk>", "."],
+        )
+
+    def test_sequence_builders(self):
+        tokenizer = AlbertEnglishTokenizer(SAMPLE_VOCAB)
+
+        text = tokenizer.encode("sequence builders")["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build")["input_ids"]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]
+
+    @slow
+    def test_tokenizer_integration(self):
+        # fmt: off
+        expected_encoding = {
+            'attention_mask': [
+                [
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
+                ],
+                [
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+                ],
+                [
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+                ]],
+            'input_ids': [
+                [
+                    2, 21970, 13, 5, 6092, 167, 28, 7103, 2153, 673, 8, 7028, 12051,
+                    18, 17, 7103, 2153, 673, 8, 3515, 18684, 8, 4461, 6, 1927, 297,
+                    8, 12060, 2607, 18, 13, 5, 4461, 15, 10538, 38, 8, 135, 15, 822,
+                    58, 15, 993, 10363, 15, 1460, 8005, 4461, 15, 993, 255, 2328, 9,
+                    9, 9, 6, 26, 1112, 816, 3260, 13, 5, 103, 2377, 6, 17, 1112,
+                    816, 2782, 13, 5, 103, 10641, 6, 29, 84, 2512, 2430, 782, 18684,
+                    2761, 19, 808, 2430, 2556, 17, 855, 1480, 9477, 4091, 128,
+                    11712, 15, 7103, 2153, 673, 17, 24883, 9990, 9, 3
+                ],
+                [
+                    2, 11502, 25, 1006, 20, 782, 8, 11809, 855, 1732,
+                    19393, 18667, 37, 367, 21018, 69, 1854, 34, 11860,
+                    19124, 27, 156, 225, 17, 193, 4141, 19, 65, 9124,
+                    9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0
+                ],
+                [
+                    2, 14, 2231, 886, 2385, 17659, 84, 14, 16792,
+                    1952, 9, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0
+                ]],
+            'token_type_ids': [
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+                ],
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+                ],
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+                ]]
+        }  # noqa: E501
+        # fmt: on
+
+        self.tokenizer_integration_test_util(
+            expected_encoding=expected_encoding,
+            model_name="albert-base-v2",
+        )
+
+
+class AlbertChineseTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = AlbertChineseTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, AlbertChineseTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("albert-chinese-base")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/artist/__init__.py b/tests/transformers/artist/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/artist/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/artist/test_modeling.py b/tests/transformers/artist/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a97f41a39d30c5509786e42eb7c1447e3247a89
--- /dev/null
+++ b/tests/transformers/artist/test_modeling.py
@@ -0,0 +1,213 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    ArtistConfig,
+    ArtistForConditionalGeneration,
+    ArtistModel,
+)
+from tests.transformers.test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ArtistModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+        self.bos_token_id = vocab_size - 1
+        self.eos_token_id = vocab_size - 1
+        self.pad_token_id = vocab_size - 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return ArtistConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = paddle.cast(
+            ids_tensor([self.batch_size, self.seq_length], vocab_size=2), dtype="float32"
+        )
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+        }
+
+        return config, inputs_dict
+
+    def create_and_check_artist_model(self, config, input_ids, input_mask, *args):
+        model = ArtistModel(config)
+        model.eval()
+
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result[1]), config.num_hidden_layers)
+
+    def create_and_check_conditional_generation(self, config, input_ids, input_mask, *args):
+        model = ArtistForConditionalGeneration(config)
+        model.eval()
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(len(result[1]), config.num_hidden_layers)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ArtistModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ArtistModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (ArtistModel, ArtistForConditionalGeneration)
+
+    def setUp(self):
+        self.model_tester = ArtistModelTester(self)
+
+    def test_artist_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_artist_model(*config_and_inputs)
+
+    def test_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_conditional_generation(*config_and_inputs)
diff --git a/tests/transformers/artist/test_tokenizer.py b/tests/transformers/artist/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4269a04fe4787a36de640e191f9b202d9240a7a1
--- /dev/null
+++ b/tests/transformers/artist/test_tokenizer.py
@@ -0,0 +1,355 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import re
+import unittest
+
+from paddlenlp.transformers import ArtistTokenizer
+
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+
+class ArtistTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = ArtistTokenizer
+    space_between_special_tokens = True
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, ArtistTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                # Test not batched
+                encoded_sequences_1 = tokenizer.encode(
+                    sequences[0],
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                    max_length=32,
+                    padding="max_length",
+                    truncation=True,
+                )
+                encoded_sequences_2 = tokenizer(sequences[0], return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test not batched pairs
+                encoded_sequences_1 = tokenizer.encode(
+                    sequences[0],
+                    sequences[1],
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                    max_length=32,
+                    padding="max_length",
+                    truncation=True,
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences[0], sequences[1], return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    sequences,
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                    max_length=32,
+                    padding="max_length",
+                    truncation=True,
+                )
+                encoded_sequences_2 = tokenizer(sequences, return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched pairs
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    list(zip(sequences, sequences)),
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                    max_length=32,
+                    padding="max_length",
+                    truncation=True,
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences, sequences, return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 32)
+            self.assertEqual(len(encoding["offset_mapping"]), 34)
+
+    def test_conversion_reversible(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab = tokenizer.get_vocab()
+                for word, ind in vocab.items():
+                    if word == tokenizer.unk_token:
+                        continue
+                    self.assertEqual(tokenizer.convert_tokens_to_ids(word), ind + tokenizer.image_vocab_size)
+                    self.assertEqual(tokenizer.convert_ids_to_tokens(ind + tokenizer.image_vocab_size), word)
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5):
+        toks = [
+            (i, tokenizer.decode([i], clean_up_tokenization_spaces=False))
+            for i in range(tokenizer.image_vocab_size, tokenizer.image_vocab_size + len(tokenizer))
+        ]
+        # filter the english only character
+        if self.only_english_character:
+            toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
+
+        toks = list(
+            filter(
+                lambda t: [t[0]]
+                == tokenizer.encode(t[1], return_token_type_ids=None, add_special_tokens=False)["input_ids"],
+                toks,
+            )
+        )
+        if max_length is not None and len(toks) > max_length:
+            toks = toks[:max_length]
+        if min_length is not None and len(toks) < min_length and len(toks) > 0:
+            while len(toks) < min_length:
+                toks = toks + toks
+        # toks_str = [t[1] for t in toks]
+        toks_ids = [t[0] for t in toks]
+
+        # Ensure consistency
+        output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
+        if " " not in output_txt and len(toks_ids) > 1:
+            output_txt = (
+                tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+                + " "
+                + tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
+            )
+        if with_prefix_space:
+            output_txt = " " + output_txt
+        output_ids = tokenizer.encode(output_txt, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        return output_txt, output_ids
+
+    def test_maximum_encoding_length_single_input(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
+
+                sequence = tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                total_length = len(sequence)
+
+                self.assertGreater(total_length, 4, "Issue with the testing sequence, please update it it's too short")
+
+                # Test with max model input length
+                model_max_length = tokenizer.model_max_length
+                self.assertEqual(model_max_length, 100)
+                seq_1 = seq_0 * model_max_length
+
+                sequence1 = tokenizer(seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False)
+                total_length1 = len(sequence1["input_ids"])
+                self.assertGreater(
+                    total_length1, model_max_length, "Issue with the testing sequence, please update it it's too short"
+                )
+
+                # Simple
+                padding_strategies = (
+                    [False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
+                )
+                for padding_state in padding_strategies:
+                    with self.subTest(f"Padding: {padding_state}"):
+                        for truncation_state in [True, "longest_first", "only_first"]:
+                            with self.subTest(f"Truncation: {truncation_state}"):
+
+                                output = tokenizer(
+                                    seq_1,
+                                    padding=padding_state,
+                                    max_length=model_max_length,
+                                    truncation=truncation_state,
+                                )
+                                self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                                output = tokenizer(
+                                    [seq_1],
+                                    padding=padding_state,
+                                    max_length=model_max_length,
+                                    truncation=truncation_state,
+                                )
+                                self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                # Overflowing tokens
+                stride = 2
+                information = tokenizer(
+                    seq_0,
+                    max_length=total_length - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="longest_first",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information["input_ids"]
+                overflowing_tokens = information["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), total_length - 2)
+                self.assertEqual(truncated_sequence, sequence[:-2])
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
+
+    def test_special_tokens_mask(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                # Testing single inputs
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0, add_special_tokens=True, return_special_tokens_mask=True  # , add_prefix_space=False
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                encoded_sequence_w_special = (
+                    [tokenizer.cls_token_id] + encoded_sequence_w_special + [tokenizer.cls_token_id]
+                )
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+                filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    normal_tokens = tokenizer("This", pad_to_multiple_of=8, truncation=False, padding=False)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    def test_tokenizers_common_ids_setters(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                attributes_list = [
+                    "bos_token",
+                    "eos_token",
+                    "unk_token",
+                    "sep_token",
+                    "pad_token",
+                    "cls_token",
+                    "mask_token",
+                ]
+
+                vocab = tokenizer.get_vocab()
+                token_id_to_test_setters = next(iter(vocab.values()))
+                token_to_test_setters = tokenizer.convert_ids_to_tokens(
+                    token_id_to_test_setters, skip_special_tokens=False
+                )
+                token_id_to_test_setters = token_id_to_test_setters + tokenizer.image_vocab_size
+
+                for attr in attributes_list:
+                    setattr(tokenizer, attr + "_id", None)
+                    self.assertEqual(getattr(tokenizer, attr), None)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), None)
+
+                    setattr(tokenizer, attr + "_id", token_id_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr), token_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), token_id_to_test_setters)
+
+                setattr(tokenizer, "additional_special_tokens_ids", [])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens"), [])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens_ids"), [])
+
+                setattr(tokenizer, "additional_special_tokens_ids", [token_id_to_test_setters])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens"), [token_to_test_setters])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens_ids"), [token_id_to_test_setters])
+
+    def test_special_tokens_mask_input_pairs(self):
+        pass
+
+    def test_maximum_encoding_length_pair_input(self):
+        pass
+
+    def test_mask_output(self):
+        pass
+
+    def test_number_of_added_tokens(self):
+        pass
+
+    def test_offsets_mapping(self):
+        pass
diff --git a/tests/transformers/auto/__init__.py b/tests/transformers/auto/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/auto/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/auto/test_confiugration.py b/tests/transformers/auto/test_confiugration.py
new file mode 100644
index 0000000000000000000000000000000000000000..3394e2b60b38c4d19094ddb8cca14fb0f9e13f0d
--- /dev/null
+++ b/tests/transformers/auto/test_confiugration.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Hugging Face inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import json
+import os
+import random
+import tempfile
+import unittest
+
+from paddlenlp.transformers import AutoConfig
+from paddlenlp.utils.env import CONFIG_NAME
+
+
+class AutoConfigTest(unittest.TestCase):
+    def test_built_in_model_class_config(self):
+        config = AutoConfig.from_pretrained("bert-base-uncased")
+        number = random.randint(0, 10000)
+        self.assertEqual(config.hidden_size, 768)
+
+        config.hidden_size = number
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            config.save_pretrained(tempdir)
+
+            # there is no architectures in config.json
+            with open(os.path.join(tempdir, AutoConfig.config_file), "r", encoding="utf-8") as f:
+                config_data = json.load(f)
+
+            self.assertNotIn("architectures", config_data)
+
+            # but it can load it as the PretrainedConfig class
+            auto_config = AutoConfig.from_pretrained(tempdir)
+            self.assertEqual(auto_config.hidden_size, number)
+
+    def test_community_model_class(self):
+        # OPT model do not support PretrainedConfig, but can load it as the AutoConfig object
+        config = AutoConfig.from_pretrained("facebook/opt-125m")
+
+        self.assertEqual(config.hidden_size, 768)
+
+        number = random.randint(0, 10000)
+        config.hidden_size = number
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            config.save_pretrained(tempdir)
+
+            # but it can load it as the PretrainedConfig class
+            auto_config = AutoConfig.from_pretrained(tempdir)
+            self.assertEqual(auto_config.hidden_size, number)
+
+    def test_from_hf_hub(self):
+        config = AutoConfig.from_pretrained("facebook/opt-66b", from_hf_hub=True)
+        self.assertEqual(config.hidden_size, 9216)
+
+    def test_load_from_legacy_config(self):
+        number = random.randint(0, 10000)
+        legacy_config = {"init_class": "BertModel", "hidden_size": number}
+        with tempfile.TemporaryDirectory() as tempdir:
+            with open(os.path.join(tempdir, AutoConfig.legacy_config_file), "w", encoding="utf-8") as f:
+                json.dump(legacy_config, f, ensure_ascii=False)
+
+            # but it can load it as the PretrainedConfig class
+            auto_config = AutoConfig.from_pretrained(tempdir)
+            self.assertEqual(auto_config.hidden_size, number)
+
+    def test_from_pretrained_cache_dir(self):
+        model_id = "__internal_testing__/tiny-random-bert"
+        with tempfile.TemporaryDirectory() as tempdir:
+            AutoConfig.from_pretrained(model_id, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_id, CONFIG_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_id, model_id)))
diff --git a/tests/transformers/auto/test_modeling.py b/tests/transformers/auto/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f0cfe2a563a09e6014db64884e10f748d3214b4
--- /dev/null
+++ b/tests/transformers/auto/test_modeling.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Hugging Face inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import AutoModel, BertModel
+from paddlenlp.utils.env import CONFIG_NAME, PADDLE_WEIGHTS_NAME
+
+
+class AutoModelTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = AutoModel.from_pretrained("__internal_testing__/tiny-random-bert")
+
+    def test_from_pretrained_local(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            self.model.save_pretrained(tmp_dir)
+            model = AutoModel.from_pretrained(tmp_dir)
+            self.assertIsInstance(model, BertModel)
+
+    def test_from_pretrained_no_init_class_with_model_name(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model = copy.deepcopy(self.model)
+            # when init_class is not found, we rely on the filename to get the import class
+            model_save_path = os.path.join(tmp_dir, "tiny-random-bert")
+            model.save_pretrained(model_save_path)
+            config = model.config.to_dict()
+            config.pop("architectures")
+            with open(os.path.join(model_save_path, "config.json"), "w", encoding="utf-8") as writer:
+                writer.write(json.dumps(config, indent=2, sort_keys=True) + "\n")
+            reloaded_model = AutoModel.from_pretrained(model_save_path)
+            self.assertIsInstance(reloaded_model, BertModel)
+
+    def test_from_pretrained_no_init_class_no_model_name(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model = copy.deepcopy(self.model)
+            model.save_pretrained(tmp_dir)
+            config = model.config.to_dict()
+            config.pop("architectures")
+            with open(os.path.join(tmp_dir, "config.json"), "w", encoding="utf-8") as writer:
+                writer.write(json.dumps(config, indent=2, sort_keys=True) + "\n")
+            with self.assertRaises(AttributeError):
+                AutoModel.from_pretrained(tmp_dir)
+
+    def test_model_from_pretrained_cache_dir(self):
+        model_name = "__internal_testing__/tiny-random-bert"
+        with tempfile.TemporaryDirectory() as tempdir:
+            AutoModel.from_pretrained(model_name, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, CONFIG_NAME)))
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, PADDLE_WEIGHTS_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_name, model_name)))
diff --git a/tests/transformers/auto/test_tokenizer.py b/tests/transformers/auto/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..887e1932ff2ddd0f9bc7708cd78eb7c5fec2883a
--- /dev/null
+++ b/tests/transformers/auto/test_tokenizer.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Hugging Face inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tempfile
+import unittest
+
+import paddlenlp
+from paddlenlp.transformers import AutoTokenizer, is_fast_tokenizer_available
+from paddlenlp.utils.env import TOKENIZER_CONFIG_NAME
+
+
+class AutoTokenizerTest(unittest.TestCase):
+    def test_fast_tokenizer_import(self):
+        tokenizer1 = AutoTokenizer.from_pretrained("__internal_testing__/bert", use_fast=False)
+        self.assertIsInstance(tokenizer1, paddlenlp.transformers.BertTokenizer)
+
+        tokenizer2 = AutoTokenizer.from_pretrained("__internal_testing__/bert", use_fast=True)
+        if is_fast_tokenizer_available():
+            self.assertIsInstance(tokenizer2, paddlenlp.transformers.BertFastTokenizer)
+        else:
+            self.assertIsInstance(tokenizer2, paddlenlp.transformers.BertTokenizer)
+
+    def test_fast_tokenizer_non_exist(self):
+        tokenizer1 = AutoTokenizer.from_pretrained("t5-small", use_fast=True)
+        # T5 FastTokenizer doesn't exist yet, so from_pretrained will return the normal tokenizer.
+        self.assertIsInstance(tokenizer1, paddlenlp.transformers.T5Tokenizer)
+
+    def test_use_faster(self):
+        tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/bert", use_faster=True)
+        if is_fast_tokenizer_available():
+            self.assertIsInstance(tokenizer, paddlenlp.transformers.BertFastTokenizer)
+        else:
+            self.assertIsInstance(tokenizer, paddlenlp.transformers.BertTokenizer)
+
+    def test_hf_tokenizer(self):
+        t1 = AutoTokenizer.from_pretrained(
+            "hf-internal-testing/tiny-random-BertModel", from_hf_hub=True, use_fast=True
+        )
+        t2 = AutoTokenizer.from_pretrained(
+            "hf-internal-testing/tiny-random-BertModel", from_hf_hub=True, use_fast=False
+        )
+        if is_fast_tokenizer_available():
+            self.assertIsInstance(t1, paddlenlp.transformers.BertFastTokenizer)
+        else:
+            self.assertIsInstance(t1, paddlenlp.transformers.BertTokenizer)
+        self.assertIsInstance(t2, paddlenlp.transformers.BertTokenizer)
+
+    def test_from_pretrained_cache_dir(self):
+        model_name = "__internal_testing__/tiny-random-bert"
+        with tempfile.TemporaryDirectory() as tempdir:
+            AutoTokenizer.from_pretrained(model_name, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, TOKENIZER_CONFIG_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_name, model_name)))
diff --git a/tests/transformers/bart/__init__.py b/tests/transformers/bart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/bart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/bart/test_modeling.py b/tests/transformers/bart/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..60aa6bcdf676ed5f74aa429d16e4c7b37a6754cb
--- /dev/null
+++ b/tests/transformers/bart/test_modeling.py
@@ -0,0 +1,1002 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    BartConfig,
+    BartForConditionalGeneration,
+    BartForQuestionAnswering,
+    BartForSequenceClassification,
+    BartModel,
+    BartTokenizer,
+)
+from paddlenlp.transformers.bart.modeling import shift_tokens_right
+from paddlenlp.transformers.tokenizer_utils_base import (
+    PaddingStrategy,
+    TruncationStrategy,
+)
+from tests.testing_utils import require_package, slow
+
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+def prepare_bart_inputs_dict(
+    config,
+    input_ids,
+    decoder_input_ids=None,
+    attention_mask=None,
+    decoder_attention_mask=None,
+    head_mask=None,
+    decoder_head_mask=None,
+    cross_attn_head_mask=None,
+):
+    if attention_mask is None:
+        attention_mask = (
+            paddle.cast(input_ids == config.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+        )
+    if decoder_attention_mask is None:
+        decoder_attention_mask = (
+            paddle.cast(decoder_input_ids == config.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+            * -1e4
+        )
+    return {
+        "input_ids": input_ids,
+        "decoder_input_ids": decoder_input_ids,
+        "attention_mask": attention_mask,
+        "decoder_attention_mask": attention_mask,
+    }
+
+
+class BartModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_labels=False,
+        vocab_size=99,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=4,
+        intermediate_size=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=20,
+        eos_token_id=2,
+        pad_token_id=1,
+        bos_token_id=0,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+
+        # forcing a certain token to be generated, sets all other tokens to -inf
+        # if however the token to be generated is already at -inf then it can lead token
+        # `nan` values and thus break generation
+        self.forced_eos_token_id = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_ids = paddle.clip(ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64"), 3)
+        input_ids[:, -1] = self.eos_token_id  # Eos Token
+
+        decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        config = self.get_config()
+        inputs_dict = prepare_bart_inputs_dict(config, input_ids, decoder_input_ids)
+        return config, inputs_dict
+
+    def get_config(self):
+        return BartConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            dropout=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            forced_eos_token_id=self.forced_eos_token_id,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = BartModel(config)
+        encoder = model.get_encoder()
+        decoder = model.get_decoder()
+        encoder.eval()
+        decoder.eval()
+
+        input_ids = inputs_dict["input_ids"]
+        decoder_input_ids = (
+            paddle.zeros_like(input_ids[:, :1], dtype="int64") + BartModel(config).decoder_start_token_id
+        )
+
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_attention_mask = paddle.zeros([input_ids.shape[0], 1, 1, 1], dtype=paddle.get_default_dtype())
+
+        encoder_output = encoder(input_ids, attention_mask, return_dict=self.parent.return_dict)
+        origin_cache = decoder.decoder.gen_cache(encoder_output)
+        outputs = decoder(
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            attention_mask,
+            cache=origin_cache,
+            return_dict=self.parent.return_dict,
+        )
+
+        output, cache = outputs[:2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_attn_mask = paddle.zeros([self.batch_size, 1, 1, 3], dtype=paddle.get_default_dtype())
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([decoder_input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([decoder_attention_mask, next_attn_mask], axis=-1)
+
+        output_from_no_past = decoder(
+            next_input_ids, next_attention_mask, encoder_output, attention_mask, return_dict=self.parent.return_dict
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past, _ = decoder(
+            next_tokens,
+            next_attention_mask,
+            encoder_output,
+            attention_mask,
+            cache=cache,
+            return_dict=self.parent.return_dict,
+        )[:2]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class BartHeadTests(unittest.TestCase):
+    vocab_size = 99
+    use_labels = False
+    return_dict = False
+
+    def _get_config_and_data(self):
+        input_ids = paddle.to_tensor(
+            [
+                [71, 82, 18, 33, 46, 91, 2],
+                [68, 34, 26, 58, 30, 82, 2],
+                [5, 97, 17, 39, 94, 40, 2],
+                [76, 83, 94, 25, 70, 78, 2],
+                [87, 59, 41, 35, 48, 66, 2],
+                [55, 13, 16, 58, 5, 2, 1],  # note padding
+                [64, 27, 31, 51, 12, 75, 2],
+                [52, 64, 86, 17, 83, 39, 2],
+                [48, 61, 9, 24, 71, 82, 2],
+                [26, 1, 60, 48, 22, 13, 2],
+                [21, 5, 62, 28, 14, 76, 2],
+                [45, 98, 37, 86, 59, 48, 2],
+                [70, 70, 50, 9, 28, 0, 2],
+            ],
+            dtype="int64",
+        )
+
+        batch_size = input_ids.shape[0]
+        config = BartConfig(
+            vocab_size=self.vocab_size,
+            d_model=24,
+            encoder_layers=2,
+            decoder_layers=2,
+            encoder_attention_heads=2,
+            decoder_attention_heads=2,
+            encoder_ffn_dim=32,
+            decoder_ffn_dim=32,
+            max_position_embeddings=48,
+            eos_token_id=2,
+            pad_token_id=1,
+            bos_token_id=0,
+        )
+        return config, input_ids, batch_size
+
+    def test_sequence_classification_forward(self):
+        config, input_ids, batch_size = self._get_config_and_data()
+        config.num_labels = 2
+        labels = _long_tensor([1] * batch_size) if self.use_labels else None
+        model = BartForSequenceClassification(config)
+        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, labels=labels, return_dict=self.return_dict)
+        expected_shape = [batch_size, config.num_labels]
+        if self.use_labels:
+            self.assertIsInstance(outputs[0].item(), float)  # test loss
+            self.assertEqual(outputs[1].shape, expected_shape)  # test logits
+        elif isinstance(outputs, paddle.Tensor):
+            self.assertEqual(outputs.shape, expected_shape)
+        else:
+            self.assertEqual(outputs[0].shape, expected_shape)
+
+    def test_question_answering_forward(self):
+        config, input_ids, batch_size = self._get_config_and_data()
+        sequence_labels = ids_tensor([batch_size], 2) if self.use_labels else None
+        model = BartForQuestionAnswering(config)
+        outputs = model(
+            input_ids=input_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.return_dict,
+        )
+
+        if self.use_labels:
+            loss, start_logits, end_logits = outputs[:3]
+            self.assertIsInstance(loss.item(), float)
+        else:
+            start_logits, end_logits = outputs[:2]
+        self.assertEqual(start_logits.shape, input_ids.shape)
+        self.assertEqual(end_logits.shape, input_ids.shape)
+
+    def test_lm_forward(self):
+        config, input_ids, batch_size = self._get_config_and_data()
+        lm_labels = ids_tensor([batch_size, input_ids.shape[1]], self.vocab_size) if self.use_labels else None
+        lm_model = BartForConditionalGeneration(config)
+        outputs = lm_model(input_ids=input_ids, labels=lm_labels, return_dict=self.return_dict)
+        expected_shape = [batch_size, input_ids.shape[1], config.vocab_size]
+        if self.use_labels:
+            self.assertIsInstance(outputs[0].item(), float)
+            self.assertEqual(outputs[1].shape, expected_shape)
+        elif isinstance(outputs, paddle.Tensor):
+            self.assertEqual(outputs.shape, expected_shape)
+        else:
+            self.assertEqual(outputs[0].shape, expected_shape)
+
+    def test_lm_uneven_forward(self):
+        config = BartConfig(
+            vocab_size=self.vocab_size,
+            d_model=14,
+            encoder_layers=2,
+            decoder_layers=2,
+            encoder_attention_heads=2,
+            decoder_attention_heads=2,
+            encoder_ffn_dim=8,
+            decoder_ffn_dim=8,
+            max_position_embeddings=48,
+        )
+        lm_model = BartForConditionalGeneration(config)
+        context = paddle.to_tensor([[71, 82, 18, 33, 46, 91, 2], [68, 34, 26, 58, 30, 2, 1]], dtype="int64")
+        summary = paddle.to_tensor([[82, 71, 82, 18, 2], [58, 68, 2, 1, 1]], dtype="int64")
+        outputs = lm_model(
+            input_ids=context,
+            decoder_input_ids=summary,
+            labels=summary if self.use_labels else None,
+            return_dict=self.return_dict,
+        )
+        expected_shape = summary.shape
+        expected_shape.append(config.vocab_size)
+        if self.use_labels:
+            self.assertIsInstance(outputs[0].item(), float)
+        elif isinstance(outputs, paddle.Tensor):
+            self.assertEqual(outputs.shape, expected_shape)
+        else:
+            self.assertEqual(outputs[0].shape, expected_shape)
+
+    def test_generate_beam_search(self):
+        input_ids = paddle.to_tensor([[71, 82, 2], [68, 34, 2]], dtype="int64")
+        config = BartConfig(
+            vocab_size=self.vocab_size,
+            d_model=24,
+            encoder_layers=2,
+            decoder_layers=2,
+            encoder_attention_heads=2,
+            decoder_attention_heads=2,
+            encoder_ffn_dim=32,
+            decoder_ffn_dim=32,
+            max_position_embeddings=48,
+            eos_token_id=2,
+            pad_token_id=1,
+            bos_token_id=0,
+        )
+        lm_model = BartForConditionalGeneration(config)
+        lm_model.eval()
+
+        max_length = 5
+        generated_ids = lm_model.generate(
+            input_ids,
+            decode_strategy="sampling",
+            num_return_sequences=1,
+            max_length=max_length,
+            top_k=4,
+        )[0]
+        self.assertEqual(generated_ids.shape, [input_ids.shape[0], max_length])
+
+    def test_shift_tokens_right(self):
+        input_ids = paddle.to_tensor([[71, 82, 18, 33, 2, 1, 1], [68, 34, 26, 58, 30, 82, 2]], dtype="int64")
+        shifted = shift_tokens_right(input_ids, 2)
+        n_pad_before = paddle.equal(input_ids, 1).sum().numpy()
+        n_pad_after = paddle.equal(shifted, 1).sum().numpy()
+        self.assertEqual(shifted.shape, input_ids.shape)
+        self.assertEqual(n_pad_after, n_pad_before - 1)
+        self.assertTrue(paddle.equal(shifted[:, 0], 2).all())
+
+    @slow
+    def test_tokenization(self):
+        tokenizer = BartTokenizer.from_pretrained("bart-large")
+        examples = [" Hello world", " DomDramg"]  # need leading spaces for equality
+        fairseq_results = [
+            paddle.to_tensor([0, 20920, 232, 2]),
+            paddle.to_tensor([0, 11349, 495, 4040, 571, 2]),
+        ]
+        for ex, desired_result in zip(examples, fairseq_results):
+            bart_toks = tokenizer.encode(ex, return_tensors="pd")["input_ids"].squeeze()
+            assert_tensors_close(desired_result, bart_toks, prefix=ex)
+
+
+class BartModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = BartModel
+
+    all_model_classes = (
+        BartModel,
+        BartForConditionalGeneration,
+        BartForSequenceClassification,
+        BartForQuestionAnswering,
+    )
+
+    all_generative_model_classes = {BartForConditionalGeneration: (BartModel, "bart")}
+    is_encoder_decoder = True
+    fx_compatible = True
+    test_pruning = False
+    test_missing_keys = False
+    use_labels = False
+    return_dict = False
+    use_test_inputs_embeds = True
+
+    def setUp(self):
+        self.model_tester = BartModelTester(self)
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+
+def assert_tensors_close(a, b, atol=1e-12, prefix=""):
+    """If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
+    if a is None and b is None:
+        return True
+    try:
+        if paddle.allclose(a.astype("float32"), b.astype("float32"), atol=atol):
+            return True
+        raise
+    except Exception:
+        pct_different = ((a - b).abs() > atol).astype("float").mean().item()
+        if a.numel() > 100:
+            msg = f"tensor values are {pct_different:.1%} percent different."
+        else:
+            msg = f"{a} != {b}"
+        if prefix:
+            msg = prefix + ": " + msg
+        raise AssertionError(msg)
+
+
+def _long_tensor(tok_lst):
+    return paddle.to_tensor(tok_lst, dtype="int64")
+
+
+@slow
+class FastIntegrationTests(unittest.TestCase):
+    """These tests are useful for debugging since they operate on a model with 1 encoder layer and 1 decoder layer."""
+
+    def tok(self):
+        return BartTokenizer.from_pretrained("bart-large")
+
+    def bart_base(self):
+        return BartForConditionalGeneration.from_pretrained("bart-base")
+
+    def test_bart_base_generation(self):
+        model = self.bart_base()
+        model.eval()
+        tok = self.tok()
+        ARTICLE = (
+            "The Palestinian Authority officially became the 123rd member of the International Criminal Court on"
+            " Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The"
+            " formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based."
+            " The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its"
+            ' jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East'
+            ' Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the'
+            " situation in Palestinian territories, paving the way for possible war crimes investigations against"
+            " Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and"
+            " the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the"
+            " body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, said it was a"
+            ' move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the'
+            ' world is also a step closer to ending a long era of impunity and injustice," he said, according to an'
+            ' ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge'
+            " Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the"
+            ' Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine'
+            " acquires all the rights as well as responsibilities that come with being a State Party to the Statute."
+            ' These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights'
+            ' Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should'
+            " immediately end their pressure, and countries that support universal acceptance of the court's treaty"
+            ' should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the'
+            " group. \"What's objectionable is the attempts to undermine international justice, not Palestine's"
+            ' decision to join a treaty to which over 100 countries around the world are members." In January, when'
+            " the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an"
+            ' outrage, saying the court was overstepping its boundaries. The United States also said it "strongly"'
+            " disagreed with the court's decision. \"As we have said repeatedly, we do not believe that Palestine is a"
+            ' state and therefore we do not believe that it is eligible to join the ICC," the State Department said in'
+            ' a statement. It urged the warring sides to resolve their differences through direct negotiations. "We'
+            ' will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace,"'
+            " it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the"
+            ' territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the'
+            " court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou"
+            ' Bensouda said her office would "conduct its analysis in full independence and impartiality." The war'
+            " between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry"
+            " will include alleged war crimes committed since June. The International Criminal Court was set up in"
+            " 2002 to prosecute genocide, crimes against humanity and war crimes."
+        )
+        EXPECTED = 'The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday\'s ceremony, said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Rome Statute today, the world is also a step closer to ending a long era of impunity and injustice," he said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should immediately end their pressure, and countries that support universal acceptance of the court\'s treaty should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the group. "What\'s objectionable is the attempts to undermine international justice, not Palestine\'s decision to join a treaty to which over 100 countries around the world are members." In January, when the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was overstepping its boundaries. The United States also said it "strongly" disagreed with the court\'s decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement. It urged the warring sides to resolve their differences through direct negotiations. "We will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry will include alleged war crimes committed since June. The International Criminal Court was set up in 2002 to prosecute genocide, crimes against humanity and war crimes.'
+
+        dct = tok(ARTICLE, return_tensors="pd")
+
+        dct.pop("token_type_ids")
+        generated_ids, _ = model.generate(**dct, num_beams=4, decode_strategy="beam_search", max_length=1024)
+        result = tok.batch_decode(generated_ids, skip_special_tokens=True)[0]
+        assert EXPECTED == result, f"{EXPECTED}\n{result}"
+
+    def test_xsum_1_1_batch_generation(self):
+        # test batch
+        batch = self.tok()(
+            [
+                "The Palestinian Authority officially became the 123rd member of the International Criminal Court on"
+                " Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories."
+                " The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is"
+                " based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted"
+                ' its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including'
+                ' East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination'
+                " into the situation in Palestinian territories, paving the way for possible war crimes investigations"
+                " against Israelis. As members of the court, Palestinians may be subject to counter-charges as well."
+                " Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts"
+                " to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony,"
+                ' said it was a move toward greater justice. "As Palestine formally becomes a State Party to the Rome'
+                ' Statute today, the world is also a step closer to ending a long era of impunity and injustice," he'
+                ' said, according to an ICC news release. "Indeed, today brings us closer to our shared goals of'
+                ' justice and peace." Judge Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was'
+                ' just the first step for the Palestinians. "As the Rome Statute today enters into force for the State'
+                " of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a"
+                ' State Party to the Statute. These are substantive commitments, which cannot be taken lightly," she'
+                ' said. Rights group Human Rights Watch welcomed the development. "Governments seeking to penalize'
+                " Palestine for joining the ICC should immediately end their pressure, and countries that support"
+                " universal acceptance of the court's treaty should speak out to welcome its membership,\" said"
+                " Balkees Jarrah, international justice counsel for the group. \"What's objectionable is the attempts"
+                " to undermine international justice, not Palestine's decision to join a treaty to which over 100"
+                ' countries around the world are members." In January, when the preliminary ICC examination was'
+                " opened, Israeli Prime Minister Benjamin Netanyahu described it as an outrage, saying the court was"
+                ' overstepping its boundaries. The United States also said it "strongly" disagreed with the court\'s'
+                ' decision. "As we have said repeatedly, we do not believe that Palestine is a state and therefore we'
+                ' do not believe that it is eligible to join the ICC," the State Department said in a statement. It'
+                ' urged the warring sides to resolve their differences through direct negotiations. "We will continue'
+                ' to oppose actions against Israel at the ICC as counterproductive to the cause of peace," it said.'
+                " But the ICC begs to differ with the definition of a state for its purposes and refers to the"
+                ' territories as "Palestine." While a preliminary examination is not a formal investigation, it allows'
+                " the court to review evidence and determine whether to investigate suspects on both sides. Prosecutor"
+                ' Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality."'
+                " The war between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The"
+                " inquiry will include alleged war crimes committed since June. The International Criminal Court was"
+                " set up in 2002 to prosecute genocide, crimes against humanity and war crimes.",
+                "The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted"
+                " Wednesday that he was not aware of any video footage from on board the plane. Marseille prosecutor"
+                ' Brice Robin told CNN that "so far no videos were used in the crash investigation." He added, "A'
+                " person who has such a video needs to immediately give it to the investigators.\" Robin's comments"
+                " follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video"
+                " showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the"
+                " French Alps. All 150 on board were killed. Paris Match and Bild reported that the video was"
+                " recovered from a phone at the wreckage site. The two publications described the supposed video, but"
+                " did not post it on their websites. The publications said that they watched the video, which was"
+                " found by a source close to the investigation. \"One can hear cries of 'My God' in several"
+                ' languages," Paris Match reported. "Metallic banging can also be heard more than three times, perhaps'
+                " of the pilot trying to open the cockpit door with a heavy object.  Towards the end, after a heavy"
+                ' shake, stronger than the others, the screaming intensifies. Then nothing." "It is a very disturbing'
+                " scene,\" said Julian Reichelt, editor-in-chief of Bild online. An official with France's accident"
+                " investigation agency, the BEA, said the agency is not aware of any such video. Lt. Col. Jean-Marc"
+                " Menichini, a French Gendarmerie spokesman in charge of communications on rescue efforts around the"
+                ' Germanwings crash site, told CNN that the reports were "completely wrong" and "unwarranted." Cell'
+                ' phones have been collected at the site, he said, but that they "hadn\'t been exploited yet."'
+                " Menichini said he believed the cell phones would need to be sent to the Criminal Research Institute"
+                " in Rosny sous-Bois, near Paris, in order to be analyzed by specialized technicians working"
+                " hand-in-hand with investigators. But none of the cell phones found so far have been sent to the"
+                " institute, Menichini said. Asked whether staff involved in the search could have leaked a memory"
+                ' card to the media, Menichini answered with a categorical "no." Reichelt told "Erin Burnett:'
+                ' Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match are'
+                ' "very confident" that the clip is real. He noted that investigators only revealed they\'d recovered'
+                ' cell phones from the crash site after Bild and Paris Match published their reports. "That is'
+                " something we did not know before. ... Overall we can say many things of the investigation weren't"
+                ' revealed by the investigation at the beginning," he said. What was mental state of Germanwings'
+                " co-pilot? German airline Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled"
+                " depression years before he took the controls of Germanwings Flight 9525, which he's accused of"
+                " deliberately crashing last week in the French Alps. Lubitz told his Lufthansa flight training school"
+                ' in 2009 that he had a "previous episode of severe depression," the airline said Tuesday. Email'
+                " correspondence between Lubitz and the school discovered in an internal investigation, Lufthansa"
+                " said, included medical documents he submitted in connection with resuming his flight training. The"
+                " announcement indicates that Lufthansa, the parent company of Germanwings, knew of Lubitz's battle"
+                " with depression, allowed him to continue training and ultimately put him in the cockpit. Lufthansa,"
+                " whose CEO Carsten Spohr previously said Lubitz was 100% fit to fly, described its statement Tuesday"
+                ' as a "swift and seamless clarification" and said it was sharing the information and documents --'
+                " including training and medical records -- with public prosecutors. Spohr traveled to the crash site"
+                " Wednesday, where recovery teams have been working for the past week to recover human remains and"
+                " plane debris scattered across a steep mountainside. He saw the crisis center set up in"
+                " Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash site, where grieving"
+                " families have left flowers at a simple stone memorial. Menichini told CNN late Tuesday that no"
+                " visible human remains were left at the site but recovery teams would keep searching. French"
+                " President Francois Hollande, speaking Tuesday, said that it should be possible to identify all the"
+                " victims using DNA analysis by the end of the week, sooner than authorities had previously suggested."
+                " In the meantime, the recovery of the victims' personal belongings will start Wednesday, Menichini"
+                " said. Among those personal belongings could be more cell phones belonging to the 144 passengers and"
+                " six crew on board. Check out the latest from our correspondents . The details about Lubitz's"
+                " correspondence with the flight school during his training were among several developments as"
+                " investigators continued to delve into what caused the crash and Lubitz's possible motive for"
+                " downing the jet. A Lufthansa spokesperson told CNN on Tuesday that Lubitz had a valid medical"
+                ' certificate, had passed all his examinations and "held all the licenses required." Earlier, a'
+                " spokesman for the prosecutor's office in Dusseldorf, Christoph Kumpa, said medical records reveal"
+                " Lubitz suffered from suicidal tendencies at some point before his aviation career and underwent"
+                " psychotherapy before he got his pilot's license. Kumpa emphasized there's no evidence suggesting"
+                " Lubitz was suicidal or acting aggressively before the crash. Investigators are looking into whether"
+                " Lubitz feared his medical condition would cause him to lose his pilot's license, a European"
+                ' government official briefed on the investigation told CNN on Tuesday. While flying was "a big part'
+                " of his life,\" the source said, it's only one theory being considered. Another source, a law"
+                " enforcement official briefed on the investigation, also told CNN that authorities believe the"
+                " primary motive for Lubitz to bring down the plane was that he feared he would not be allowed to fly"
+                " because of his medical problems. Lubitz's girlfriend told investigators he had seen an eye doctor"
+                " and a neuropsychologist, both of whom deemed him unfit to work recently and concluded he had"
+                " psychological issues, the European government official said. But no matter what details emerge about"
+                " his previous mental health struggles, there's more to the story, said Brian Russell, a forensic"
+                ' psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the'
+                " fact that maybe they weren't going to keep doing their job and they're upset about that and so"
+                ' they\'re suicidal," he said. "But there is no mental illness that explains why somebody then feels'
+                " entitled to also take that rage and turn it outward on 149 other people who had nothing to do with"
+                " the person's problems.\" Germanwings crash compensation: What we know . Who was the captain of"
+                " Germanwings Flight 9525? CNN's Margot Haddad reported from Marseille and Pamela Brown from"
+                " Dusseldorf, while Laura Smith-Spark wrote from London. CNN's Frederik Pleitgen, Pamela Boykoff,"
+                " Antonia Mortensen, Sandrine Amiel and Anna-Maja Rappard contributed to this report.",
+            ],
+            return_tensors="pd",
+            padding="longest",
+            truncation=True,
+        )
+        model = self.bart_base()
+        model.eval()
+
+        generated_ids, _ = model.generate(**batch, num_beams=4, decode_strategy="beam_search")
+        result = self.tok().batch_decode(generated_ids, skip_special_tokens=True)
+        assert (
+            result[0]
+            == "The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a"
+        )
+        assert (
+            result[1]
+            == "The French prosecutor leading an investigation into the crash of Germanwings Flight 9525 insisted Wednesday that"
+        )
+
+
+class BartModelIntegrationTests(unittest.TestCase):
+    def default_tokenizer(self):
+        return BartTokenizer.from_pretrained("bart-large")
+
+    @slow
+    def test_inference_no_head(self):
+        model = BartModel.from_pretrained("bart-large")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]], dtype="int64")
+
+        attention_mask = (
+            paddle.cast(input_ids == model.config.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+            * -1e4
+        )
+        with paddle.no_grad():
+            output = model(input_ids=input_ids, attention_mask=attention_mask)
+        expected_shape = [1, 11, 1024]
+        self.assertEqual(output[0].shape, expected_shape)
+
+    @slow
+    def test_cnn_summarization_same_as_fairseq(self):
+
+        model = BartForConditionalGeneration.from_pretrained("bart-large")
+        model.eval()
+        tok = BartTokenizer.from_pretrained("bart-large")
+
+        FRANCE_ARTICLE = (  # @noq
+            " Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings"
+            " Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane."
+            ' Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation."'
+            ' He added, "A person who has such a video needs to immediately give it to the investigators." Robin\'s'
+            " comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video"
+            " showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French"
+            " Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a"
+            " phone at the wreckage site. The two publications described the supposed video, but did not post it on"
+            " their websites. The publications said that they watched the video, which was found by a source close to"
+            " the investigation. \"One can hear cries of 'My God' in several languages,\" Paris Match reported."
+            ' "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the'
+            " cockpit door with a heavy object.  Towards the end, after a heavy shake, stronger than the others, the"
+            ' screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt,'
+            " editor-in-chief of Bild online. An official with France's accident investigation agency, the BEA, said"
+            " the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman"
+            " in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the"
+            ' reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said,'
+            ' but that they "hadn\'t been exploited yet." Menichini said he believed the cell phones would need to be'
+            " sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by"
+            " specialized technicians working hand-in-hand with investigators. But none of the cell phones found so"
+            " far have been sent to the institute, Menichini said. Asked whether staff involved in the search could"
+            ' have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin'
+            ' Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match'
+            ' are "very confident" that the clip is real. He noted that investigators only revealed they\'d recovered'
+            ' cell phones from the crash site after Bild and Paris Match published their reports. "That is something'
+            " we did not know before. ... Overall we can say many things of the investigation weren't revealed by the"
+            ' investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline'
+            " Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the"
+            " controls of Germanwings Flight 9525, which he's accused of deliberately crashing last week in the"
+            ' French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of'
+            ' severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school'
+            " discovered in an internal investigation, Lufthansa said, included medical documents he submitted in"
+            " connection with resuming his flight training. The announcement indicates that Lufthansa, the parent"
+            " company of Germanwings, knew of Lubitz's battle with depression, allowed him to continue training and"
+            " ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100%"
+            ' fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was'
+            " sharing the information and documents -- including training and medical records -- with public"
+            " prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the"
+            " past week to recover human remains and plane debris scattered across a steep mountainside. He saw the"
+            " crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash"
+            " site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late"
+            " Tuesday that no visible human remains were left at the site but recovery teams would keep searching."
+            " French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all"
+            " the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested."
+            " In the meantime, the recovery of the victims' personal belongings will start Wednesday, Menichini said."
+            " Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew"
+            " on board. Check out the latest from our correspondents . The details about Lubitz's correspondence with"
+            " the flight school during his training were among several developments as investigators continued to"
+            " delve into what caused the crash and Lubitz's possible motive for downing the jet. A Lufthansa"
+            " spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his"
+            ' examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor\'s office in'
+            " Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at"
+            " some point before his aviation career and underwent psychotherapy before he got his pilot's license."
+            " Kumpa emphasized there's no evidence suggesting Lubitz was suicidal or acting aggressively before the"
+            " crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to"
+            " lose his pilot's license, a European government official briefed on the investigation told CNN on"
+            ' Tuesday. While flying was "a big part of his life," the source said, it\'s only one theory being'
+            " considered. Another source, a law enforcement official briefed on the investigation, also told CNN that"
+            " authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would"
+            " not be allowed to fly because of his medical problems. Lubitz's girlfriend told investigators he had"
+            " seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded"
+            " he had psychological issues, the European government official said. But no matter what details emerge"
+            " about his previous mental health struggles, there's more to the story, said Brian Russell, a forensic"
+            ' psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact'
+            " that maybe they weren't going to keep doing their job and they're upset about that and so they're"
+            ' suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to'
+            " also take that rage and turn it outward on 149 other people who had nothing to do with the person's"
+            ' problems." Germanwings crash compensation: What we know . Who was the captain of Germanwings Flight'
+            " 9525? CNN's Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura"
+            " Smith-Spark wrote from London. CNN's Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine"
+            " Amiel and Anna-Maja Rappard contributed to this report."
+        )
+
+        SHORTER_ARTICLE = (
+            " (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on"
+            " Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The"
+            " formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based."
+            " The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its"
+            ' jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East'
+            ' Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the'
+            " situation in Palestinian territories, paving the way for possible war crimes investigations against"
+            " Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and"
+            " the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the"
+            " body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, said it was a"
+            ' move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the'
+            ' world is also a step closer to ending a long era of impunity and injustice," he said, according to an'
+            ' ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge'
+            " Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the"
+            ' Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine'
+            " acquires all the rights as well as responsibilities that come with being a State Party to the Statute."
+            ' These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights'
+            ' Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should'
+            " immediately end their pressure, and countries that support universal acceptance of the court's treaty"
+            ' should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the'
+            " group. \"What's objectionable is the attempts to undermine international justice, not Palestine's"
+            ' decision to join a treaty to which over 100 countries around the world are members." In January, when'
+            " the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an"
+            ' outrage, saying the court was overstepping its boundaries. The United States also said it "strongly"'
+            " disagreed with the court's decision. \"As we have said repeatedly, we do not believe that Palestine is a"
+            ' state and therefore we do not believe that it is eligible to join the ICC," the State Department said in'
+            ' a statement. It urged the warring sides to resolve their differences through direct negotiations. "We'
+            ' will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace,"'
+            " it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the"
+            ' territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the'
+            " court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou"
+            ' Bensouda said her office would "conduct its analysis in full independence and impartiality." The war'
+            " between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry"
+            " will include alleged war crimes committed since June. The International Criminal Court was set up in"
+            " 2002 to prosecute genocide, crimes against humanity and war crimes. CNN's Vasco Cotovio, Kareem Khadder"
+            " and Faith Karimi contributed to this report."
+        )
+
+        # The below article tests that we don't add any hypotheses outside of the top n_beams
+        IRAN_ARTICLE = (
+            " (CNN)The United States and its negotiating partners reached a very strong framework agreement with Iran"
+            " in Lausanne, Switzerland, on Thursday that limits Iran's nuclear program in such a way as to effectively"
+            " block it from building a nuclear weapon. Expect pushback anyway, if the recent past is any harbinger."
+            " Just last month, in an attempt to head off such an agreement, House Speaker John Boehner invited Israeli"
+            " Prime Minister Benjamin Netanyahu to preemptively blast it before Congress, and 47 senators sent a"
+            " letter to the Iranian leadership warning them away from a deal. The debate that has already begun since"
+            " the announcement of the new framework will likely result in more heat than light. It will not be helped"
+            " by the gathering swirl of dubious assumptions and doubtful assertions. Let us address some of these: ."
+            " The most misleading assertion, despite universal rejection by experts, is that the negotiations'"
+            " objective at the outset was the total elimination of any nuclear program in Iran. That is the position"
+            " of Netanyahu and his acolytes in the U.S. Congress. But that is not and never was the objective. If it"
+            " had been, there would have been no Iranian team at the negotiating table. Rather, the objective has"
+            " always been to structure an agreement or series of agreements so that Iran could not covertly develop a"
+            " nuclear arsenal before the United States and its allies could respond. The new framework has exceeded"
+            " expectations in achieving that goal. It would reduce Iran's low-enriched uranium stockpile, cut by"
+            " two-thirds its number of installed centrifuges and implement a rigorous inspection regime. Another"
+            " dubious assumption of opponents is that the Iranian nuclear program is a covert weapons program. Despite"
+            " sharp accusations by some in the United States and its allies, Iran denies having such a program, and"
+            " U.S. intelligence contends that Iran has not yet made the decision to build a nuclear weapon. Iran's"
+            " continued cooperation with International Atomic Energy Agency inspections is further evidence on this"
+            " point, and we'll know even more about Iran's program in the coming months and years because of the deal."
+            " In fact, the inspections provisions that are part of this agreement are designed to protect against any"
+            " covert action by the Iranians. What's more, the rhetoric of some members of Congress has implied that"
+            " the negotiations have been between only the United States and Iran (i.e., the 47 senators' letter"
+            " warning that a deal might be killed by Congress or a future president). This of course is not the case."
+            " The talks were between Iran and the five permanent members of the U.N. Security Council (United States,"
+            " United Kingdom, France, China and Russia) plus Germany, dubbed the P5+1. While the United States has"
+            " played a leading role in the effort, it negotiated the terms alongside its partners. If the agreement"
+            " reached by the P5+1 is rejected by Congress, it could result in an unraveling of the sanctions on Iran"
+            " and threaten NATO cohesion in other areas. Another questionable assertion is that this agreement"
+            " contains a sunset clause, after which Iran will be free to do as it pleases. Again, this is not the"
+            " case. Some of the restrictions on Iran's nuclear activities, such as uranium enrichment, will be eased"
+            " or eliminated over time, as long as 15 years. But most importantly, the framework agreement includes"
+            " Iran's ratification of the Additional Protocol, which allows IAEA inspectors expanded access to nuclear"
+            " sites both declared and nondeclared. This provision will be permanent. It does not sunset. Thus, going"
+            " forward, if Iran decides to enrich uranium to weapons-grade levels, monitors will be able to detect such"
+            " a move in a matter of days and alert the U.N. Security Council. Many in Congress have said that the"
+            ' agreement should be a formal treaty requiring the Senate to "advise and consent." But the issue is not'
+            " suited for a treaty. Treaties impose equivalent obligations on all signatories. For example, the New"
+            " START treaty limits Russia and the United States to 1,550 deployed strategic warheads. But any agreement"
+            " with Iran will not be so balanced.  The restrictions and obligations in the final framework agreement"
+            " will be imposed almost exclusively on Iran. The P5+1 are obligated only to ease and eventually remove"
+            " most but not all economic sanctions, which were imposed as leverage to gain this final deal. Finally"
+            " some insist that any agreement must address Iranian missile programs, human rights violations or support"
+            " for Hamas or Hezbollah.  As important as these issues are, and they must indeed be addressed, they are"
+            " unrelated to the most important aim of a nuclear deal: preventing a nuclear Iran.  To include them in"
+            " the negotiations would be a poison pill. This agreement should be judged on its merits and on how it"
+            " affects the security of our negotiating partners and allies, including Israel. Those judgments should be"
+            " fact-based, not based on questionable assertions or dubious assumptions."
+        )
+
+        ARTICLE_SUBWAY = (
+            " New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A"
+            " year later, she got married again in Westchester County, but to a different man and without divorcing"
+            " her first husband.  Only 18 days after that marriage, she got hitched yet again. Then, Barrientos"
+            ' declared "I do" five more times, sometimes only within two weeks of each other. In 2010, she married'
+            " once more, this time in the Bronx. In an application for a marriage license, she stated it was her"
+            ' "first and only" marriage. Barrientos, now 39, is facing two criminal counts of "offering a false'
+            ' instrument for filing in the first degree," referring to her false statements on the 2010 marriage'
+            " license application, according to court documents. Prosecutors said the marriages were part of an"
+            " immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to"
+            " her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was"
+            " arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New"
+            " York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total,"
+            " Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.  All"
+            " occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be"
+            " married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors"
+            " said the immigration scam involved some of her husbands, who filed for permanent residence status"
+            " shortly after the marriages.  Any divorces happened only after such filings were approved. It was"
+            " unclear whether any of the men will be prosecuted. The case was referred to the Bronx District"
+            " Attorney's Office by Immigration and Customs Enforcement and the Department of Homeland Security's"
+            ' Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt,'
+            " Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his"
+            " native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces"
+            " up to four years in prison.  Her next court appearance is scheduled for May 18."
+        )
+
+        dct = tok._batch_encode_plus(
+            [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY],
+            max_length=1024,
+            padding_strategy=PaddingStrategy("max_length"),
+            truncation_strategy=TruncationStrategy("only_first"),
+            return_tensors="pd",
+            return_attention_mask=True,
+        )
+
+        self.assertEqual(1024, dct["input_ids"].shape[1])
+        hypotheses_batch, _ = model.generate(
+            input_ids=dct["input_ids"],
+            attention_mask=dct["attention_mask"],
+            num_beams=2,
+            decode_strategy="beam_search",
+            max_length=1024,
+        )
+
+        # EXPECTED = [
+        #     "A French prosecutor says he is not aware of any video footage from on board the plane. Two German "
+        #     "magazines claim to have found a cell phone video showing the crash. The publications say they watched "
+        #     "the video, which was found by a source close to the investigation. All 150 on board Germanwings Flight "
+        #     "9525 were killed.",
+        #     "Palestinian Authority becomes 123rd member of the International Criminal Court. The move gives the court "
+        #     "jurisdiction over alleged crimes in Palestinian territories. Israel and the United States opposed the "
+        #     "Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki said it was a "
+        #     "move toward greater justice.",
+        #     "U.S. and its negotiating partners reached a strong framework agreement with Iran. Peter Bergen: The "
+        #     "debate that has already begun will likely result in more heat than light. He says critics have made "
+        #     "dubious assumptions and doubtful assertions. Bergen says the goal was to block Iran from building a "
+        #     "nuclear weapon.",
+        #     "Liana Barrientos, 39, has been married 10 times, sometimes within two weeks of each other. Prosecutors "
+        #     "say the marriages were part of an immigration scam. She pleaded not guilty at State Supreme Court in the "
+        #     "Bronx on Friday. If convicted, she faces up to four years in prison.",
+        # ]
+        # will be used to check generation result
+
+        tok.batch_decode(
+            hypotheses_batch.tolist(), clean_up_tokenization_spaces=True, skip_special_tokens=True
+        )  # assigned generated_summaries but never used
+
+
+class BartModelCompatibilityTest(unittest.TestCase):
+    model_id = "hf-internal-testing/tiny-random-BartModel"
+
+    @require_package("transformers", "torch")
+    def test_bart_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import BartModel
+
+            paddle_model = BartModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch model
+            import torch
+            from transformers import BartModel
+
+            torch_model = BartModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_bart_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import BartModel
+
+            torch_model = BartModel.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import BartModel
+
+            paddle_model = BartModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            ("BartModel",),
+            ("BartForSequenceClassification",),
+            ("BartForQuestionAnswering",),
+            ("BartForConditionalGeneration",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_bart_classes_from_local_dir(self, class_name, pytorch_class_name=None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+            # wrap `input_ids`, because `transformers.BartForSequenceClassification` need `eos_mask`
+            input_ids = [[0] + input_ids[0].tolist() + [2]]
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
diff --git a/tests/transformers/bart/test_tokenizer.py b/tests/transformers/bart/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a5538de882ac0fd044df6ab8b8c70c5dde6a7a9
--- /dev/null
+++ b/tests/transformers/bart/test_tokenizer.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import BartTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_roberta_detectors
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class TestTokenizationBart(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = BartTokenizer
+    test_rust_tokenizer = False
+    test_offsets = False
+    from_pretrained_filter = filter_roberta_detectors
+
+    def setUp(self):
+        super().setUp()
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<s>",
+            "</s>",
+            "<pad>",
+            "<mask>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {
+            "bos_token": "<s>",
+            "eos_token": "</s>",
+            "cls_token": "<s>",
+            "sep_token": "</s>",
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+            "mask_token": "<mask>",
+        }
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        return "lower newer", "lower newer"
+
+    def default_tokenizer(self):
+        return BartTokenizer.from_pretrained("bart-large")
+
+    def test_prepare_batch(self):
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        expected_src_tokens = [0, 250, 251, 17818, 13, 39186, 1938, 4, 2]
+
+        for tokenizer in [BartTokenizer.from_pretrained("bart-large")]:
+            batch = tokenizer(
+                text=src_text,
+                max_length=len(expected_src_tokens),
+                padding=True,
+                return_attention_mask=True,
+                return_tensors="pd",
+            )
+            self.assertEqual([2, 9], batch.input_ids.shape)
+            self.assertEqual([2, 9], batch.attention_mask.shape)
+            result = batch.input_ids.tolist()[0]
+            self.assertListEqual(expected_src_tokens, result)
+            # Test that special tokens are reset
+
+    def test_prepare_batch_empty_target_text(self):
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        for tokenizer in [BartTokenizer.from_pretrained("bart-large")]:
+            batch = tokenizer(text=src_text, padding=True, return_tensors="pd", return_attention_mask=True)
+            # check if input_ids are returned and no labels
+            self.assertIn("input_ids", batch)
+            self.assertIn("attention_mask", batch)
+            self.assertNotIn("labels", batch)
+            self.assertNotIn("decoder_attention_mask", batch)
+
+    def test_tokenizer_as_target_length(self):
+        tgt_text = [
+            "Summary of the text.",
+            "Another summary.",
+        ]
+        for tokenizer in [BartTokenizer.from_pretrained("bart-large")]:
+            targets = tokenizer(text=tgt_text, max_length=32, padding="max_length", return_tensors="pd")
+            self.assertEqual(32, targets["input_ids"].shape[1])
+
+    def test_prepare_batch_not_longer_than_maxlen(self):
+        for tokenizer in [BartTokenizer.from_pretrained("bart-large", max_len=1024)]:
+            batch = tokenizer(
+                text=["I am a small frog" * 1024, "I am a small frog"],
+                padding=True,
+                truncation=True,
+                return_tensors="pd",
+            )
+            self.assertEqual(batch.input_ids.shape, [2, 1024])
+
+    def test_special_tokens(self):
+
+        src_text = ["A long paragraph for summarization."]
+        tgt_text = [
+            "Summary of the text.",
+        ]
+        for tokenizer in [BartTokenizer.from_pretrained("bart-large")]:
+            inputs = tokenizer(text=src_text, return_tensors="pd")
+            targets = tokenizer(text=tgt_text, return_tensors="pd")
+            input_ids = inputs["input_ids"]
+            labels = targets["input_ids"]
+            self.assertTrue((input_ids[:, 0] == tokenizer.bos_token_id).all().item())
+            self.assertTrue((labels[:, 0] == tokenizer.bos_token_id).all().item())
+            self.assertTrue((input_ids[:, -1] == tokenizer.eos_token_id).all().item())
+            self.assertTrue((labels[:, -1] == tokenizer.eos_token_id).all().item())
+
+    def test_pretokenized_inputs(self):
+        pass
diff --git a/tests/transformers/bert/__init__.py b/tests/transformers/bert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/bert/test_modeling.py b/tests/transformers/bert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..bda38b855a9812006da4447c24e8f07d36ebfb60
--- /dev/null
+++ b/tests/transformers/bert/test_modeling.py
@@ -0,0 +1,765 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import os
+import random
+import tempfile
+import unittest
+from typing import List
+
+import numpy as np
+import paddle
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp import __version__ as current_version
+from paddlenlp.transformers import (
+    AutoModel,
+    AutoModelForQuestionAnswering,
+    AutoModelForTokenClassification,
+    BertForMaskedLM,
+    BertForMultipleChoice,
+    BertForPretraining,
+    BertForQuestionAnswering,
+    BertForSequenceClassification,
+    BertForTokenClassification,
+    BertModel,
+)
+from paddlenlp.transformers.bert.configuration import BertConfig
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils import install_package, uninstall_package
+
+from ...testing_utils import require_package, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class BertModelTester:
+    def __init__(
+        self,
+        parent: BertModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        pool_act="tanh",
+        fuse=False,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        return_dict=False,
+    ):
+        self.parent: BertModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> BertConfig:
+        return BertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            pool_act=self.pool_act,
+            fuse=self.fuse,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self, config: BertConfig, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = BertModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config: BertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertForMaskedLM(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_model_past_large_inputs(
+        self,
+        config: BertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.return_dict)
+        past_key_values = outputs.past_key_values if self.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids, attention_mask=next_attention_mask, output_hidden_states=True, return_dict=self.return_dict
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_for_pretraining(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertForPretraining(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            next_sentence_label=sequence_labels,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        self.parent.assertEqual(result[2].shape, [self.batch_size, 2])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config: BertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: BertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = BertForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BertForTokenClassification(config)
+
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def test_addition_params(self, config: BertConfig, *args, **kwargs):
+        config.num_labels = 7
+        config.classifier_dropout = 0.98
+
+        model = BertForTokenClassification(config)
+        model.eval()
+
+        self.parent.assertEqual(model.classifier.weight.shape, [config.hidden_size, 7])
+        self.parent.assertEqual(model.dropout.p, 0.98)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class BertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = BertModel
+    return_dict = False
+    use_labels = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        BertModel,
+        BertForMaskedLM,
+        BertForMultipleChoice,
+        BertForPretraining,
+        BertForQuestionAnswering,
+        BertForSequenceClassification,
+        BertForTokenClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = BertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BertConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        # self.config_tester.create_and_test_config_from_and_save_pretrained()
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_past_large_inputs(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_pretraining(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_pretraining(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_custom_params(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.test_addition_params(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    @slow
+    def test_params_compatibility_of_init_method(self):
+        """test initing model with different params"""
+        model: BertForTokenClassification = BertForTokenClassification.from_pretrained(
+            "bert-base-uncased", num_classes=4, dropout=0.3
+        )
+        assert model.num_labels == 4
+        assert model.dropout.p == 0.3
+
+
+class BertCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-BertModel"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import BertModel
+
+        # when python application is done, `TemporaryDirectory` will be free
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        model = BertModel.from_pretrained(cls.test_model_id)
+        model.save_pretrained(cls.torch_model_path)
+
+    def test_model_config_mapping(self):
+        config = BertConfig(num_labels=22, hidden_dropout_prob=0.99)
+        self.assertEqual(config.hidden_dropout_prob, 0.99)
+        self.assertEqual(config.num_labels, 22)
+
+    def setUp(self) -> None:
+        self.tempdirs: List[tempfile.TemporaryDirectory] = []
+
+    def tearDown(self) -> None:
+        for tempdir in self.tempdirs:
+            tempdir.cleanup()
+
+    def get_tempdir(self) -> str:
+        tempdir = tempfile.TemporaryDirectory()
+        self.tempdirs.append(tempdir)
+        return tempdir.name
+
+    def run_token_for_classification(self, version: str):
+        install_package("paddlenlp", version=version)
+
+        from paddlenlp import __version__
+
+        self.assertEqual(__version__, version)
+        from paddlenlp.transformers import BertForTokenClassification, BertModel
+
+        tempdir = self.get_tempdir()
+
+        # prepare the old version of model
+        old_model = BertModel.from_pretrained("bert-base-uncased")
+        old_model_path = os.path.join(tempdir, "old-model")
+        old_model.save_pretrained(old_model_path)
+
+        old_model_for_token = BertForTokenClassification.from_pretrained(
+            "bert-base-uncased", num_classes=4, dropout=0.3
+        )
+        old_model_for_token_path = os.path.join(tempdir, "old-model-for-token")
+        old_model_for_token.save_pretrained(old_model_for_token_path)
+
+        uninstall_package("paddlenlp")
+        from paddlenlp import __version__
+
+        self.assertEqual(__version__, current_version)
+
+        from paddlenlp.transformers import BertForTokenClassification, BertModel
+
+        # bert: from old bert
+        model = BertModel.from_pretrained(old_model_path)
+        self.compare_two_model(old_model, model)
+
+        # bert: from old bert-for-token
+        model = BertModel.from_pretrained(old_model_for_token_path)
+        self.compare_two_model(old_model, model)
+
+        # bert-for-token: from old bert
+        model = BertForTokenClassification.from_pretrained(old_model_path)
+        self.compare_two_model(old_model_for_token, model)
+        self.assertNotEqual(model.num_labels, 4)
+        self.assertNotEqual(model.dropout.p, 0.3)
+
+        # bert-for-token: from old bert-for-token
+        model = BertForTokenClassification.from_pretrained(old_model_for_token_path)
+        self.compare_two_model(old_model_for_token, model)
+        self.assertEqual(model.num_labels, 4)
+        self.assertEqual(model.dropout.p, 0.3)
+
+    def compare_two_model(self, first_model: PretrainedModel, second_model: PretrainedModel):
+
+        first_weight_name = "encoder.layers.8.linear2.weight"
+        if first_model.__class__.__name__ != "BertModel":
+            first_weight_name = "bert." + first_weight_name
+
+        second_weight_name = "encoder.layers.8.linear2.weight"
+        if second_model.__class__.__name__ != "BertModel":
+            second_weight_name = "bert." + second_weight_name
+
+        first_tensor = first_model.state_dict()[first_weight_name]
+        second_tensor = second_model.state_dict()[second_weight_name]
+        self.compare_two_weight(first_tensor, second_tensor)
+
+    def compare_two_weight(self, first_tensor, second_tensor):
+        diff = paddle.sum(first_tensor - second_tensor).item()
+        self.assertEqual(diff, 0.0)
+
+    @slow
+    def test_paddlenlp_token_classification(self):
+        versions = ["2.2.2", "2.3.0", "2.3.4", "2.3.7", "2.4.0"]
+        for version in versions:
+            install_package("paddlenlp", version=version)
+            self.run_token_for_classification(version)
+            uninstall_package("paddlenlp")
+
+    @slow
+    def test_bert_save_token_load(self):
+        """bert -> token"""
+        from paddlenlp.transformers import BertForTokenClassification, BertModel
+
+        saved_dir = os.path.join(self.get_tempdir(), "bert-saved")
+        bert: BertModel = BertModel.from_pretrained("bert-base-uncased")
+        bert.save_pretrained(saved_dir)
+
+        bert_for_token = BertForTokenClassification.from_pretrained(saved_dir)
+        self.compare_two_model(bert, bert_for_token)
+
+    @slow
+    def test_bert_save_bert_load(self):
+        """bert -> bert"""
+        saved_dir = os.path.join(self.get_tempdir(), "bert-saved")
+        bert: BertModel = BertModel.from_pretrained("bert-base-uncased")
+        bert.save_pretrained(saved_dir)
+
+        bert_loaded = BertModel.from_pretrained(saved_dir)
+        self.compare_two_model(bert, bert_loaded)
+
+    @slow
+    def test_token_saved_bert_load(self):
+        """token -> bert"""
+        from paddlenlp.transformers import BertForTokenClassification, BertModel
+
+        saved_dir = os.path.join(self.get_tempdir(), "bert-token-saved")
+        bert_for_token = BertForTokenClassification.from_pretrained("bert-base-uncased")
+        bert_for_token.save_pretrained(saved_dir)
+
+        bert = BertModel.from_pretrained(saved_dir)
+        self.compare_two_model(bert, bert_for_token)
+
+    @slow
+    def test_token_saved_token_load(self):
+        """token -> token"""
+        saved_dir = os.path.join(self.get_tempdir(), "bert-token-saved")
+        bert_for_token = BertForTokenClassification.from_pretrained("bert-base-uncased")
+        bert_for_token.save_pretrained(saved_dir)
+
+        bert_for_token_loaded = BertForTokenClassification.from_pretrained(saved_dir)
+        self.compare_two_model(bert_for_token, bert_for_token_loaded)
+
+    @slow
+    def test_auto_model(self):
+        AutoModel.from_pretrained("bert-base-uncased")
+        model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_classes=4, dropout=0.3)
+        self.assertEqual(model.num_labels, 4)
+        self.assertEqual(model.dropout.p, 0.3)
+
+        model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased", dropout=0.3)
+        self.assertEqual(model.dropout.p, 0.3)
+
+    @require_package("transformers", "torch")
+    def test_bert_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import BertModel
+
+            paddle_model = BertModel.from_pretrained(
+                "hf-internal-testing/tiny-random-BertModel", from_hf_hub=True, cache_dir=tempdir
+            )
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch  model
+            import torch
+            from transformers import BertModel
+
+            torch_model = BertModel.from_pretrained("hf-internal-testing/tiny-random-BertModel", cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_bert_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import BertModel
+
+            torch_model = BertModel.from_pretrained("hf-internal-testing/tiny-random-BertModel")
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import BertModel
+
+            paddle_model = BertModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            ("BertModel",),
+            # ("BertForMaskedLM",),   TODO: need to tie weights
+            # ("BertForPretraining", "BertForPreTraining"),   TODO: need to tie weights
+            ("BertForMultipleChoice",),
+            ("BertForQuestionAnswering",),
+            ("BertForSequenceClassification",),
+            ("BertForTokenClassification",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_bert_classes_from_local_dir(self, class_name, pytorch_class_name: str | None = None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+
+            if "MultipleChoice" in class_name:
+                # construct input for MultipleChoice Model
+                torch_model.config.num_choices = random.randint(2, 10)
+                input_ids = (
+                    paddle.to_tensor(input_ids)
+                    .unsqueeze(1)
+                    .expand([-1, torch_model.config.num_choices, -1])
+                    .cpu()
+                    .numpy()
+                )
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class BertModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = BertModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-bert"
+    paddlehub_remote_test_model_path = "__internal_testing__/tiny-random-bert"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = BertModel.from_pretrained("bert-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[0.4249, 0.1008, 0.7531], [0.3771, 0.1188, 0.7467], [0.4152, 0.1098, 0.7108]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = BertModel.from_pretrained("bert-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[0.4249, 0.1008, 0.7531], [0.3771, 0.1188, 0.7467], [0.4152, 0.1098, 0.7108]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/bert/test_tokenizer.py b/tests/transformers/bert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..5627e9eff876100c9b32e7d1016b2c57c019a7bc
--- /dev/null
+++ b/tests/transformers/bert/test_tokenizer.py
@@ -0,0 +1,347 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.bert.fast_tokenizer import BertFastTokenizer
+from paddlenlp.transformers.bert.tokenizer import (
+    BasicTokenizer,
+    BertTokenizer,
+    WordpieceTokenizer,
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class BertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = BertTokenizer
+    fast_tokenizer_class = BertFastTokenizer
+    test_fast_tokenizer = True
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, BertTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(" "))
+        self.assertTrue(_is_whitespace("\t"))
+        self.assertTrue(_is_whitespace("\r"))
+        self.assertTrue(_is_whitespace("\n"))
+        self.assertTrue(_is_whitespace("\u00A0"))
+
+        self.assertFalse(_is_whitespace("A"))
+        self.assertFalse(_is_whitespace("-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control("\u0005"))
+
+        self.assertFalse(_is_control("A"))
+        self.assertFalse(_is_control(" "))
+        self.assertFalse(_is_control("\t"))
+        self.assertFalse(_is_control("\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation("-"))
+        self.assertTrue(_is_punctuation("$"))
+        self.assertTrue(_is_punctuation("`"))
+        self.assertTrue(_is_punctuation("."))
+
+        self.assertFalse(_is_punctuation("A"))
+        self.assertFalse(_is_punctuation(" "))
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+        self.assertListEqual(
+            [tokenizer_fast.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]]
+        )
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("bert-base-uncased")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                tokens_fast = tokenizer_fast.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_fast.convert_ids_to_tokens(tokens_fast["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+                self.assertEqual([e[0] for e in expected_results], tokens_fast["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                ids_without_spe_char_fast = tokenizer_fast.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                tokens_without_spe_char_fast = tokenizer_fast.convert_ids_to_tokens(ids_without_spe_char_fast)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_fast, list_of_commun_chinese_char)
+
+                # not yet supported in bert tokenizer
+                """
+                kwargs["tokenize_chinese_chars"] = False
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                ids_without_spe_char_p = tokenizer.encode(text_with_chinese_char, return_token_type_ids=None,add_special_tokens=False)["input_ids"]
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                # it is expected that only the first Chinese character is not preceded by "##".
+                expected_tokens = [
+                    f"##{token}" if idx != 0 else token for idx, token in enumerate(list_of_commun_chinese_char)
+                ]
+                self.assertListEqual(tokens_without_spe_char_p, expected_tokens)
+                """
diff --git a/tests/transformers/bigbird/__init__.py b/tests/transformers/bigbird/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/bigbird/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/bigbird/test_modeling.py b/tests/transformers/bigbird/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..11f7fe217dbbe0cb04ee3786fad69fc60df83c3b
--- /dev/null
+++ b/tests/transformers/bigbird/test_modeling.py
@@ -0,0 +1,387 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    BigBirdForMultipleChoice,
+    BigBirdForPretraining,
+    BigBirdForQuestionAnswering,
+    BigBirdForSequenceClassification,
+    BigBirdForTokenClassification,
+    BigBirdModel,
+    BigBirdPretrainedModel,
+)
+from paddlenlp.transformers.bigbird.configuration import BigBirdConfig
+
+from ...testing_utils import slow
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class BigBirdModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.key_length = self.hidden_size // self.num_attention_heads
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return BigBirdConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BigBirdModel(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BigBirdForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            rand_mask_idx_list=None,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if choice_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BigBirdForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BigBirdForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            rand_mask_idx_list=None,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = BigBirdForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class BigBirdModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = BigBirdModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+    test_tie_weights = True
+
+    all_model_classes = (
+        BigBirdModel,
+        BigBirdForMultipleChoice,
+        BigBirdForPretraining,
+        BigBirdForQuestionAnswering,
+        BigBirdForSequenceClassification,
+        BigBirdForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = BigBirdModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(BigBirdPretrainedModel.pretrained_init_configuration)[:1]:
+            model = BigBirdModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class BigBirdModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = BigBirdModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-bigbird"
+    paddlehub_remote_test_model_path = "__internal_testing__/tiny-random-bigbird"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = BigBirdModel.from_pretrained("bigbird-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.07543463, -0.12640929, 0.04644738],
+                    [0.13448411, -0.08428665, -0.04799746],
+                    [0.00980866, -0.08991019, 0.17119916],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = BigBirdModel.from_pretrained("bigbird-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.00835392, -0.06217613, -0.17532486],
+                    [0.07107036, -0.04628750, 0.47526565],
+                    [-0.03114043, -0.15154681, 0.92528886],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/bigbird/test_tokenizer.py b/tests/transformers/bigbird/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fe2351c079f1d09b4d81f0b65fcba75a038b7f6
--- /dev/null
+++ b/tests/transformers/bigbird/test_tokenizer.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE, AddedToken, BigBirdTokenizer
+from paddlenlp.transformers.tokenizer_utils_base import BatchEncoding
+from tests.testing_utils import get_tests_dir, slow
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+FRAMEWORK = "pd"
+
+
+class BigBirdTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = BigBirdTokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = BigBirdTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+        self.assertEqual(vocab_keys[0], "<unk>")
+        self.assertEqual(vocab_keys[1], "<s>")
+        self.assertEqual(vocab_keys[-1], "[MASK]")
+        self.assertEqual(len(vocab_keys), 1_104)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 1_100)
+
+    def test_full_tokenizer(self):
+        tokenizer = BigBirdTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens),
+            [285, 46, 10, 170, 382],
+        )
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids,
+            [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4],
+        )
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    @slow
+    def bigbird_base_tokenizer(self):
+        return BigBirdTokenizer.from_pretrained("bigbird-base-uncased")
+
+    def get_tokenizer(self, **kwargs) -> BigBirdTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, pad_token=None, **kwargs)
+
+    @slow
+    def test_eos_treatment(self):
+        tokenizer = self.bigbird_base_tokenizer()
+        batch_with_eos_added = tokenizer(["hi</s>", "I went to the gym</s>", "</s>"])
+        batch_without_eos_added = tokenizer(["hi", "I went to the gym", ""])
+        self.assertListEqual(batch_with_eos_added["input_ids"], batch_without_eos_added["input_ids"])
+
+    @slow
+    def test_prepare_batch(self):
+        tokenizer = self.bigbird_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        expected_src_tokens = [418, 991, 7423, 430, 15777, 1735, 114, 1]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        self.assertIsInstance(batch, BatchEncoding)
+
+        result = list(batch["input_ids"].tolist()[0])
+
+        self.assertListEqual(expected_src_tokens, result)
+
+        self.assertEqual([2, 8], batch["input_ids"].shape)
+        self.assertEqual([2, 8], batch.attention_mask.shape)
+
+    @slow
+    def test_empty_target_text(self):
+        tokenizer = self.bigbird_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        # check if input_ids are returned and no decoder_input_ids
+        self.assertIn("input_ids", batch)
+        self.assertIn("attention_mask", batch)
+        self.assertNotIn("decoder_input_ids", batch)
+        self.assertNotIn("decoder_attention_mask", batch)
+
+    @slow
+    def test_max_length(self):
+        tokenizer = self.bigbird_base_tokenizer()
+        tgt_text = [
+            "Summary of the text.",
+            "Another summary.",
+        ]
+        targets = tokenizer(
+            text=tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertEqual(32, targets["input_ids"].shape[1])
+
+    @slow
+    def test_outputs_not_longer_than_maxlen(self):
+        tokenizer = self.bigbird_base_tokenizer()
+
+        batch = tokenizer(
+            ["I am a small frog" * 1000, "I am a small frog"], padding=True, truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertIsInstance(batch, BatchEncoding)
+        # Since BigBird does NOT have a max input length,
+        # this test should be changed to the following in Transformers v5:
+        # self.assertEqual(batch["input_ids"].shape, (2, 8001))
+        self.assertEqual(batch["input_ids"].shape, [2, 4096])
+
+    @slow
+    def test_eos_in_input(self):
+        tokenizer = self.bigbird_base_tokenizer()
+        src_text = ["A long paragraph for summarization. </s>"]
+        tgt_text = ["Summary of the text. </s>"]
+        expected_src_tokens = [418, 991, 7423, 430, 15777, 1735, 114, 1]
+
+        # TODO(wj-Mcat): to enable `expected_tgt_tokens`
+        # expected_tgt_tokens = [20698, 13, 8, 1499, 5, 1]
+
+        batch = tokenizer(src_text, text_target=tgt_text)
+
+        self.assertEqual(expected_src_tokens, batch["input_ids"][0])
+        # self.assertEqual(expected_tgt_tokens, batch["labels"][0])
+
+    @slow
+    def test_token_type_ids(self):
+        src_text_1 = ["A first paragraph for summarization."]
+        src_text_2 = ["A second paragraph for summarization."]
+
+        tokenizer = self.bigbird_base_tokenizer()
+
+        slow_token_type_ids = tokenizer(src_text_1, src_text_2, add_special_tokens=True, return_token_type_ids=True)[
+            "token_type_ids"
+        ]
+        self.assertEqual(len(slow_token_type_ids[0]), 16)
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        tokenizer_list = []
+        tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))
+
+        for tokenizer_class, tokenizer_utils in tokenizer_list:
+
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                tokenizer_utils.save_pretrained(tmp_dir)
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
+                    special_tokens_map = json.load(json_file)
+
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
+                    tokenizer_config = json.load(json_file)
+
+                added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(100)]
+
+                special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+                tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(special_tokens_map, outfile)
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(tokenizer_config, outfile)
+
+                # the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
+                # into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
+                # "special_tokens_map.json" files
+                tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                )
+                self.assertIn(
+                    "an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
+                )
+                # self.assertIn("an_additional_special_token",tokenizer_without_change_in_init.get_vocab()) # ByBigBirdTokenization no vocab
+                self.assertEqual(
+                    ["an_additional_special_token"],
+                    tokenizer_without_change_in_init.convert_ids_to_tokens(
+                        tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
+                    ),
+                )
+
+                # Now we test that we can change the value of additional_special_tokens in the from_pretrained
+                new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
+                tokenizer = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                    additional_special_tokens=new_added_tokens,
+                )
+
+                self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
+                self.assertEqual(
+                    ["a_new_additional_special_token"],
+                    tokenizer.convert_ids_to_tokens(
+                        tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
+                    ),
+                )
+
+    # overwritten from `test_tokenization_common` since BigBird has no max length
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 3)
+            self.assertEqual(len(encoding["offset_mapping"]), 3)
diff --git a/tests/transformers/bit/__init__.py b/tests/transformers/bit/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/bit/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/bit/test_modeling.py b/tests/transformers/bit/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..675d6a89a5df2e43e0edfbbdae4546f2dfbfc805
--- /dev/null
+++ b/tests/transformers/bit/test_modeling.py
@@ -0,0 +1,334 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import inspect
+import os
+import shutil
+import tempfile
+import unittest
+
+import paddle
+import paddle.nn as nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    BitBackbone,
+    BitConfig,
+    BitForImageClassification,
+    BitImageProcessor,
+    BitModel,
+)
+from paddlenlp.utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
+
+BIT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/bit-50",
+    # See all BiT models at https://huggingface.co/models?filter=bit
+]
+
+
+class BitModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=3,
+        image_size=32,
+        num_channels=3,
+        embeddings_size=10,
+        hidden_sizes=[8, 16, 32, 64],
+        depths=[1, 1, 2, 1],
+        is_training=True,
+        use_labels=True,
+        hidden_act="relu",
+        num_labels=3,
+        scope=None,
+        out_features=["stage2", "stage3", "stage4"],
+        num_groups=1,
+        return_dict=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.num_channels = num_channels
+        self.embeddings_size = embeddings_size
+        self.hidden_sizes = hidden_sizes
+        self.depths = depths
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.hidden_act = hidden_act
+        self.num_labels = num_labels
+        self.scope = scope
+        self.num_stages = len(hidden_sizes)
+        self.out_features = out_features
+        self.num_groups = num_groups
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+
+        labels = None
+        if self.use_labels:
+            labels = ids_tensor([self.batch_size], self.num_labels)
+
+        config = self.get_config()
+
+        return config, pixel_values, labels
+
+    def get_config(self):
+        return BitConfig(
+            num_channels=self.num_channels,
+            embeddings_size=self.embeddings_size,
+            hidden_sizes=self.hidden_sizes,
+            depths=self.depths,
+            hidden_act=self.hidden_act,
+            num_labels=self.num_labels,
+            out_features=self.out_features,
+            num_groups=self.num_groups,
+            return_dict=self.return_dict,
+        )
+
+    def create_and_check_model(self, config, pixel_values, labels):
+        model = BitModel(config=config)
+        model.eval()
+        result = model(pixel_values)
+        self.parent.assertEqual(
+            result.last_hidden_state.shape,
+            [self.batch_size, self.hidden_sizes[-1], self.image_size // 32, self.image_size // 32],
+        )
+
+    def create_and_check_for_image_classification(self, config, pixel_values, labels):
+        config.num_labels = self.num_labels
+        model = BitForImageClassification(config)
+        model.eval()
+        result = model(pixel_values, labels=labels)
+        self.parent.assertEqual(result.logits.shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_backbone(self, config, pixel_values, labels):
+        model = BitBackbone(config=config)
+        model.eval()
+        result = model(pixel_values)
+
+        # verify feature maps
+        self.parent.assertEqual(len(result.feature_maps), len(config.out_features))
+        self.parent.assertListEqual(list(result.feature_maps[0].shape), [self.batch_size, self.hidden_sizes[1], 4, 4])
+
+        # verify channels
+        self.parent.assertEqual(len(model.channels), len(config.out_features))
+        self.parent.assertListEqual(model.channels, config.hidden_sizes[1:])
+
+        # verify backbone works with out_features=None
+        config.out_features = None
+        model = BitBackbone(config=config)
+        model.eval()
+        result = model(pixel_values)
+
+        # verify feature maps
+        self.parent.assertEqual(len(result.feature_maps), 1)
+        self.parent.assertListEqual(list(result.feature_maps[0].shape), [self.batch_size, self.hidden_sizes[-1], 1, 1])
+
+        # verify channels
+        self.parent.assertEqual(len(model.channels), 1)
+        self.parent.assertListEqual(model.channels, [config.hidden_sizes[-1]])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values, labels = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class BitModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = BitModel
+    all_model_classes = (BitModel, BitForImageClassification, BitBackbone)
+    test_resize_embeddings = False
+    has_attentions = False
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = BitModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BitConfig, has_text_modality=False)
+
+    def test_config(self):
+        self.create_and_test_config_common_properties()
+        self.config_tester.create_and_test_config_to_json_string()
+        self.config_tester.create_and_test_config_to_json_file()
+        self.config_tester.create_and_test_config_from_and_save_pretrained()
+        self.config_tester.create_and_test_config_with_num_classes()
+        self.config_tester.check_config_can_be_init_without_params()
+        self.config_tester.check_config_arguments_init()
+
+    def create_and_test_config_common_properties(self):
+        return
+
+    def test_pretrained_config_save_load(self):
+
+        if self.base_model_class is None or not self.base_model_class.constructed_from_pretrained_config():
+            return
+
+        config_class = self.base_model_class.config_class
+        with tempfile.TemporaryDirectory() as tempdir:
+            config = config_class()
+
+            config.save_pretrained(tempdir)
+
+            # check the file exist
+            self.assertFalse(os.path.exists(os.path.join(tempdir, LEGACY_CONFIG_NAME)))
+            self.assertTrue(os.path.exists(os.path.join(tempdir, CONFIG_NAME)))
+
+            # rename the CONFIG_NAME
+            shutil.move(os.path.join(tempdir, CONFIG_NAME), os.path.join(tempdir, LEGACY_CONFIG_NAME))
+
+            loaded_config = config.__class__.from_pretrained(tempdir)
+            self.assertEqual(config.hidden_sizes, loaded_config.hidden_sizes)
+
+    @unittest.skip(reason="Bit does not use model_name_list")
+    def test_model_name_list(self):
+        pass
+
+    @unittest.skip(reason="Bit does not output attentions")
+    def test_attention_outputs(self):
+        pass
+
+    @unittest.skip(reason="Bit does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Bit does not support input and output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_backbone(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_backbone(*config_and_inputs)
+
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config=config)
+            for name, module in model.named_sublayers():
+                if isinstance(module, (nn.BatchNorm2D, nn.GroupNorm)):
+                    self.assertTrue(
+                        paddle.all(module.weight == 1),
+                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                    )
+                    self.assertTrue(
+                        paddle.all(module.bias == 0),
+                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                    )
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = model_class(config)
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.hidden_states
+
+            expected_num_stages = self.model_tester.num_stages
+            self.assertEqual(len(hidden_states), expected_num_stages + 1)
+
+            # Bit's feature maps are of shape (batch_size, num_channels, height, width)
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [self.model_tester.image_size // 4, self.model_tester.image_size // 4],
+            )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        layers_type = ["preactivation", "bottleneck"]
+        for model_class in self.all_model_classes:
+            for layer_type in layers_type:
+                config.layer_type = layer_type
+                inputs_dict["output_hidden_states"] = True
+                check_hidden_states_output(inputs_dict, config, model_class)
+
+                # check that output_hidden_states also work using config
+                del inputs_dict["output_hidden_states"]
+                config.output_hidden_states = True
+
+                check_hidden_states_output(inputs_dict, config, model_class)
+
+    @unittest.skip(reason="Bit does not use feedforward chunking")
+    def test_feed_forward_chunking(self):
+        pass
+
+    def test_for_image_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_image_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BitModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    CUTE_CATS = get_tests_dir("fixtures/tests_samples/COCO/000000039769.png")
+    image = Image.open(CUTE_CATS)
+    return image
+
+
+class BitModelIntegrationTest(unittest.TestCase):
+    def default_image_processor(self):
+        return BitImageProcessor.from_pretrained(BIT_PRETRAINED_MODEL_ARCHIVE_LIST[0])
+
+    @slow
+    def test_inference_image_classification_head(self):
+        model = BitForImageClassification.from_pretrained(BIT_PRETRAINED_MODEL_ARCHIVE_LIST[0])
+        model.eval()
+        image_processor = self.default_image_processor()
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        expected_shape = [1, 1000]
+        self.assertEqual(outputs.logits.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor([-0.65258133, -0.52634168, -1.43975902])
+
+        self.assertTrue(paddle.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/blenderbot/__init__.py b/tests/transformers/blenderbot/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/blenderbot/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/blenderbot/test_modeling.py b/tests/transformers/blenderbot/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..db81ae71e1e7c751712172bc342e3deb4c7f4972
--- /dev/null
+++ b/tests/transformers/blenderbot/test_modeling.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    BlenderbotConfig,
+    BlenderbotForCausalLM,
+    BlenderbotForConditionalGeneration,
+    BlenderbotModel,
+    BlenderbotPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class BlenderbotModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        bos_token_id=1,
+        pad_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=1,
+        d_model=32,
+        num_encoder_layers=2,
+        num_decoder_layers=4,
+        encoder_attention_heads=4,
+        decoder_attention_heads=4,
+        encoder_ffn_dim=64,
+        decoder_ffn_dim=64,
+        dropout=0.1,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        max_position_embeddings=128,
+        init_std=0.02,
+        scale_embedding=True,
+        normalize_before=True,
+        scope=None,
+    ):
+        self.parent = parent
+        self.vocab_size = vocab_size
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.num_decoder_layers = num_decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.normalize_before = normalize_before
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.activation_dropout = activation_dropout
+        self.init_std = init_std
+        self.scale_embedding = scale_embedding
+        self.pad_token_id = pad_token_id
+        self.attention_dropout = attention_dropout
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="float32")
+
+        config = self.get_config()
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return BlenderbotConfig(
+            vocab_size=self.vocab_size,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            decoder_start_token_id=self.decoder_start_token_id,
+            d_model=self.d_model,
+            num_encoder_layers=self.num_encoder_layers,
+            encoder_attention_heads=self.encoder_attention_heads,
+            num_decoder_layers=self.num_decoder_layers,
+            decoder_attention_heads=self.decoder_attention_heads,
+            max_position_embeddings=self.max_position_embeddings,
+            encoder_ffn_dim=self.encoder_ffn_dim,
+            normalize_before=self.normalize_before,
+            decoder_ffn_dim=self.decoder_ffn_dim,
+            dropout=self.dropout,
+            activation_function=self.activation_function,
+            activation_dropout=self.activation_dropout,
+            init_std=self.init_std,
+            scale_embedding=self.scale_embedding,
+            pad_token_id=self.pad_token_id,
+            attention_dropout=self.attention_dropout,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.d_model])
+
+    def create_and_check_conditiona_generation_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotForConditionalGeneration(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_causal_lm_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+class BlenderbotModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = BlenderbotModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (
+        BlenderbotModel,
+        BlenderbotForConditionalGeneration,
+        BlenderbotForCausalLM,
+    )
+
+    def setUp(self):
+        self.model_tester = BlenderbotModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_conditiona_generation_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_conditiona_generation_model(*config_and_inputs)
+
+    def test_causal_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_causal_lm_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(BlenderbotPretrainedModel.pretrained_init_configuration)[:1]:
+            model = BlenderbotModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/blenderbot/test_tokenizer.py b/tests/transformers/blenderbot/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fef897f2d6d42006d0fa9d4432406adcdc728e68
--- /dev/null
+++ b/tests/transformers/blenderbot/test_tokenizer.py
@@ -0,0 +1,302 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import unittest
+
+from paddle.utils import try_import
+
+from paddlenlp.transformers import BlenderbotTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class TestTokenizationBlenderbot(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = BlenderbotTokenizer
+    test_rust_tokenizer = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<s>",
+            "</s>",
+            "<pad>",
+            "<mask>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120low er", ""]
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        self.special_tokens_map = {
+            "bos_token": "<s>",
+            "eos_token": "</s>",
+            "cls_token": "<s>",
+            "sep_token": "</s>",
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+            "mask_token": "<mask>",
+        }
+
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5):
+        toks = [(i, tokenizer.decode([i], clean_up_tokenization_spaces=False)) for i in range(len(tokenizer))]
+        # filter the english only character
+        re = try_import("regex")
+        if self.only_english_character:
+            toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
+        toks = list(
+            filter(
+                lambda t: t[0]
+                == tokenizer.encode(t[1], return_token_type_ids=None, add_special_tokens=False)["input_ids"][
+                    1
+                ],  # frist is add_prefix
+                toks,
+            )
+        )
+        if max_length is not None and len(toks) > max_length:
+            toks = toks[:max_length]
+        if min_length is not None and len(toks) < min_length and len(toks) > 0:
+            while len(toks) < min_length:
+                toks = toks + toks
+        toks_ids = [t[0] for t in toks]
+
+        # Ensure consistency
+        output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
+        if " " not in output_txt and len(toks_ids) > 1:
+            output_txt = (
+                tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+                + " "
+                + tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
+            )
+        if with_prefix_space:
+            output_txt = " " + output_txt
+        output_ids = tokenizer.encode(output_txt, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        return output_txt, output_ids
+
+    def test_add_special_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, ids = self.get_clean_sequence(tokenizer)
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"cls_token": special_token})
+                encoded_special_token = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(len(encoded_special_token), 1)
+
+                text = tokenizer.decode(ids + encoded_special_token, clean_up_tokenization_spaces=False)
+                encoded = tokenizer.encode(text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                print(text)
+                input_encoded = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                special_token_id = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                # Each encoding adds a space token at the beginning of the sentence
+                self.assertNotEqual(encoded, input_encoded + special_token_id)
+
+                decoded = tokenizer.decode(encoded, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, output_text = self.get_input_output_texts(tokenizer)
+
+                tokens = tokenizer.tokenize(input_text)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                ids_2 = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                self.assertListEqual(ids, ids_2)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+                # Each encoding adds a space token
+                self.assertEqual(text_2, " " + output_text)
+
+    def test_special_tokens_mask(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                # Testing single inputs
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0, add_special_tokens=True, return_special_tokens_mask=True  # , add_prefix_space=False
+                )
+                # Each encoding adds a space token at the beginning of the sentence
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = [0] + encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+                filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]]
+                # Each encoding adds a eos token at the end of the sentence
+                self.assertEqual(encoded_sequence, filtered_sequence[:-1])
+
+    def test_padding_to_multiple_of(self):
+        # token_type_ids is shorter than input_ids
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    self.assertEqual(len(empty_tokens["input_ids"]) % 8, 0, "BatchEncoding is not multiple of 8")
+                    self.assertEqual(len(normal_tokens["input_ids"]) % 8, 0, "BatchEncoding is not multiple of 8")
+
+                    normal_tokens = tokenizer("This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, "BatchEncoding is not multiple of 8")
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    self.assertEqual(len(normal_tokens["input_ids"]) % 8, 0, "BatchEncoding is not multiple of 8")
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+            # offset_mapping is shorter than input_ids
+            self.assertEqual(len(encoding["offset_mapping"]), 4 - 1)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaa bbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokens[-3])
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    def test_special_tokens_mask_input_pairs(self):
+        # token_ids_1 Will be ignored
+        self.skipTest("token_ids_1 Will be ignored")
+
+    def test_number_of_added_tokens(self):
+        # token_ids_1 Will be ignored
+        self.skipTest("token_ids_1 Will be ignored")
+
+    def test_mask_output(self):
+        # token_ids_1 Will be ignored
+        self.skipTest("token_ids_1 Will be ignored")
+
+    def test_maximum_encoding_length_pair_input(self):
+        # token_ids_1 Will be ignored
+        self.skipTest("token_ids_1 Will be ignored")
diff --git a/tests/transformers/blenderbot_small/__init__.py b/tests/transformers/blenderbot_small/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/blenderbot_small/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/blenderbot_small/test_modeling.py b/tests/transformers/blenderbot_small/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..87d9447c04c20c0b991fcbb6a4cf8326d8a21a5c
--- /dev/null
+++ b/tests/transformers/blenderbot_small/test_modeling.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    BlenderbotSmallConfig,
+    BlenderbotSmallForCausalLM,
+    BlenderbotSmallForConditionalGeneration,
+    BlenderbotSmallModel,
+    BlenderbotSmallPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class BlenderbotSmallModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        bos_token_id=1,
+        pad_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=1,
+        d_model=32,
+        num_encoder_layers=2,
+        num_decoder_layers=4,
+        encoder_attention_heads=4,
+        decoder_attention_heads=4,
+        encoder_ffn_dim=64,
+        decoder_ffn_dim=64,
+        dropout=0.1,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        max_position_embeddings=128,
+        init_std=0.02,
+        scale_embedding=True,
+        normalize_before=True,
+        scope=None,
+    ):
+        self.parent = parent
+        self.vocab_size = vocab_size
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.num_decoder_layers = num_decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.normalize_before = normalize_before
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.activation_dropout = activation_dropout
+        self.init_std = init_std
+        self.scale_embedding = scale_embedding
+        self.pad_token_id = pad_token_id
+        self.attention_dropout = attention_dropout
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="float32")
+
+        config = self.get_config()
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return BlenderbotSmallConfig(
+            vocab_size=self.vocab_size,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            decoder_start_token_id=self.decoder_start_token_id,
+            d_model=self.d_model,
+            num_encoder_layers=self.num_encoder_layers,
+            encoder_attention_heads=self.encoder_attention_heads,
+            num_decoder_layers=self.num_decoder_layers,
+            decoder_attention_heads=self.decoder_attention_heads,
+            max_position_embeddings=self.max_position_embeddings,
+            encoder_ffn_dim=self.encoder_ffn_dim,
+            normalize_before=self.normalize_before,
+            decoder_ffn_dim=self.decoder_ffn_dim,
+            dropout=self.dropout,
+            activation_function=self.activation_function,
+            activation_dropout=self.activation_dropout,
+            init_std=self.init_std,
+            scale_embedding=self.scale_embedding,
+            pad_token_id=self.pad_token_id,
+            attention_dropout=self.attention_dropout,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotSmallModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.d_model])
+
+    def create_and_check_conditiona_generation_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotSmallForConditionalGeneration(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_causal_lm_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+    ):
+        model = BlenderbotSmallForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+class BlenderbotSmallModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = BlenderbotSmallModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (
+        BlenderbotSmallModel,
+        BlenderbotSmallForConditionalGeneration,
+        BlenderbotSmallForCausalLM,
+    )
+
+    def setUp(self):
+        self.model_tester = BlenderbotSmallModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_conditiona_generation_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_conditiona_generation_model(*config_and_inputs)
+
+    def test_causal_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_causal_lm_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(BlenderbotSmallPretrainedModel.pretrained_init_configuration)[:1]:
+            model = BlenderbotSmallModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/blenderbot_small/test_tokenizer.py b/tests/transformers/blenderbot_small/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..706a0a3be72c0fc61c5aad8330801ee4f33df96a
--- /dev/null
+++ b/tests/transformers/blenderbot_small/test_tokenizer.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import BlenderbotSmallTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class TestTokenizationBlenderbotSmall(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = BlenderbotSmallTokenizer
+    test_rust_tokenizer = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+        vocab = [
+            "l@@",
+            "o@@",
+            "w@@",
+            "e@@",
+            "s@@",
+            "t@@",
+            "t",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "\u0120lowest",
+            "__start__",
+            "__end__",
+            "__unk__",
+            "__null__",
+            "__newln__",
+            ".",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "low er", ""]
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        self.special_tokens_map = {
+            "bos_token": "__start__",
+            "eos_token": "__end__",
+            "unk_token": "__unk__",
+            "pad_token": "__null__",
+            "eol_token": "__newln__",
+        }
+
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaa bbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokens[-3])
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 2)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
+
+    def test_special_tokens_mask_input_pairs(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                sequence_1 = "This one too please."
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence += tokenizer.encode(sequence_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0,
+                    sequence_1,
+                    add_special_tokens=True,
+                    return_special_tokens_mask=True,
+                    # add_prefix_space=False,
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
diff --git a/tests/transformers/blip/__init__.py b/tests/transformers/blip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/blip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/blip/test_modeling.py b/tests/transformers/blip/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9fe6141260f1f796d60bf6076118ea307b59b63
--- /dev/null
+++ b/tests/transformers/blip/test_modeling.py
@@ -0,0 +1,744 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle Blip model. """
+
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import requests
+from PIL import Image
+
+from paddlenlp.transformers import (
+    BlipConfig,
+    BlipForConditionalGeneration,
+    BlipForImageTextRetrieval,
+    BlipForQuestionAnswering,
+    BlipModel,
+    BlipProcessor,
+    BlipTextConfig,
+    BlipTextModel,
+    BlipVisionConfig,
+    BlipVisionModel,
+)
+from paddlenlp.transformers.blip.modeling import BLIP_PRETRAINED_MODEL_ARCHIVE_LIST
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class BlipVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=1e-10,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return BlipVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = BlipVisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class BlipVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as Blip does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (BlipVisionModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = BlipVisionModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BlipVisionConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="Blip does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="BlipVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="BlipVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BlipVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class BlipTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        bos_token_id=0,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+        self.bos_token_id = bos_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return BlipTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = BlipTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+            result = model(input_ids)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class BlipTextModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (BlipTextModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = BlipTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BlipTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="Blip does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="BlipTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="BlipTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BlipTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class BlipModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = BlipTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = BlipVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return BlipConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = BlipModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask=attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+class BlipModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (BlipModel,)
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = BlipModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="BlipModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the `logit_scale` parameter initilization is different for Blip
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if not param.stop_gradient:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save BlipConfig and check if we can load BlipVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = BlipVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save BlipConfig and check if we can load BlipTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = BlipTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BlipModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class BlipTextImageModelsModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = BlipTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = BlipVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+        self.num_choices = 4
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return BlipConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = BlipModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask=attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+        }
+        return config, inputs_dict
+
+
+class BlipTextImageModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (
+        BlipForConditionalGeneration,
+        BlipForQuestionAnswering,
+        BlipForImageTextRetrieval,
+    )
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+        inputs_dict = copy.deepcopy(inputs_dict)
+        if model_class.__name__.endswith("ForMultipleChoice"):
+            inputs_dict = {
+                k: v.unsqueeze(1).expand(shape=[-1, self.model_tester.num_choices, -1])
+                if isinstance(v, paddle.Tensor) and v.ndim > 1
+                else v
+                for k, v in inputs_dict.items()
+            }
+
+        if return_labels:
+            if model_class.__name__.endswith("ForMultipleChoice"):
+                inputs_dict["labels"] = paddle.ones(
+                    (self.model_tester.text_model_tester.batch_size,), dtype=paddle.int64
+                )
+            elif model_class.__name__.endswith("ForQuestionAnswering"):
+                inputs_dict["decoder_input_ids"] = paddle.zeros(
+                    (self.model_tester.text_model_tester.batch_size, 4), dtype=paddle.int64
+                )
+            elif model_class.__name__.endswith("ImageTextRetrieval"):
+                inputs_dict["labels"] = paddle.zeros(
+                    (self.model_tester.text_model_tester.batch_size,), dtype=paddle.int64
+                )
+
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = BlipTextImageModelsModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="BlipModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["input_ids"] if model_class != BlipForConditionalGeneration else ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_training(self):
+        if not self.model_tester.is_training:
+            return
+
+        for model_class in self.all_model_classes[:-1]:
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.return_dict = True
+
+            model = model_class(config)
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+            loss = model(**inputs).loss
+            loss.backward()
+
+    @slow
+    def test_training_gradient_checkpointing(self):
+        if not self.model_tester.is_training:
+            return
+
+        for model_class in self.all_model_classes[:-1]:
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.use_cache = False
+            config.return_dict = True
+
+            model = model_class(config)
+            model.gradient_checkpointing_enable()
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
+            loss = model(**inputs).loss
+            loss.backward()
+
+    # override as the `logit_scale` parameter initilization is different for Blip
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if not param.stop_gradient:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save BlipConfig and check if we can load BlipVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = BlipVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save BlipConfig and check if we can load BlipTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = BlipTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BlipModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of dog and girl
+def prepare_img():
+    url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+@slow
+class BlipModelIntegrationTest(unittest.TestCase):
+    def test_inference_image_captioning(self):
+        model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
+        model.eval()
+        processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
+        image = prepare_img()
+        assert model.config.text_config.num_attention_heads == 12
+        assert model.config.vision_config.layer_norm_eps == 1e-6
+        # image only
+        inputs = processor(images=image, return_tensors="pd")
+
+        predictions = model.generate(**inputs)[0]
+
+        # Test output
+        self.assertEqual(predictions[0].tolist(), [1037, 2450, 3564, 2006, 1996, 3509, 2007, 2014, 3899, 102])
+
+        # image and context
+        context = ["a picture of"]
+        inputs = processor(images=image, text=context, return_tensors="pd")
+
+        predictions = model.generate(**inputs)[0]
+
+        # Test output
+        self.assertEqual(
+            predictions[0].tolist(),
+            [1037, 2450, 1998, 2014, 3899, 2006, 1996, 3509, 102],
+        )
+
+    def test_inference_vqa(self):
+        model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
+        model.eval()
+        assert model.config.text_config.num_attention_heads == 12
+        assert model.config.vision_config.layer_norm_eps == 1e-6
+        processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
+
+        image = prepare_img()
+        text = "how many dogs are in the picture?"
+
+        inputs = processor(images=image, text=text, return_tensors="pd")
+        out = model.generate(**inputs)[0]
+
+        # Test output
+        self.assertEqual(out[0].tolist(), [1015, 102])
+
+    def test_inference_itm(self):
+        model = BlipForImageTextRetrieval.from_pretrained("Salesforce/blip-itm-base-coco")
+        model.eval()
+        assert model.config.text_config.num_attention_heads == 12
+        assert model.config.vision_config.layer_norm_eps == 1e-6
+        processor = BlipProcessor.from_pretrained("Salesforce/blip-itm-base-coco")
+
+        image = prepare_img()
+        text = "A woman and her dog sitting in a beach"
+
+        inputs = processor(images=image, text=text, return_tensors="pd")
+        with paddle.no_grad():
+            out_itm = model(**inputs)
+            out = model(**inputs, use_itm_head=False)
+
+        expected_scores = paddle.to_tensor([[0.00289215, 0.99710792]])
+
+        self.assertTrue(paddle.allclose(nn.functional.softmax(out_itm[0]), expected_scores, rtol=1e-3, atol=1e-3))
+        self.assertTrue(paddle.allclose(out[0], paddle.to_tensor([[0.51626438]]), rtol=1e-3, atol=1e-3))
diff --git a/tests/transformers/blip/test_modeling_text.py b/tests/transformers/blip/test_modeling_text.py
new file mode 100644
index 0000000000000000000000000000000000000000..b49da53337696c041039e98cb7dfeee4e38421de
--- /dev/null
+++ b/tests/transformers/blip/test_modeling_text.py
@@ -0,0 +1,159 @@
+# coding=utf-8
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle Blip model. """
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import BlipTextConfig, BlipTextModel
+from paddlenlp.transformers.blip.modeling import BLIP_PRETRAINED_MODEL_ARCHIVE_LIST
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class BlipTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        bos_token_id=0,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+        self.bos_token_id = bos_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return BlipTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = BlipTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class BlipTextModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (BlipTextModel,)
+    fx_compatible = False
+    test_pruning = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = BlipTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=BlipTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="Blip does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="BlipTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="BlipTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = BlipTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/blip/test_processor.py b/tests/transformers/blip/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..424cbb92da8b085fcd76503845522c102eaeccc0
--- /dev/null
+++ b/tests/transformers/blip/test_processor.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from paddlenlp.transformers import (
+    BertTokenizer,
+    BlipImageProcessor,
+    BlipProcessor,
+    PretrainedTokenizer,
+)
+
+
+class BlipProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        image_processor = BlipImageProcessor()
+        tokenizer = BertTokenizer.from_pretrained("hf-internal-testing/tiny-random-BertModel", from_hf_hub=True)
+
+        processor = BlipProcessor(image_processor, tokenizer)
+
+        processor.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs):
+        return BlipProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
+
+    def get_image_processor(self, **kwargs):
+        return BlipProcessor.from_pretrained(self.tmpdirname, **kwargs).image_processor
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PyTorch tensors if one specifies torchify=True.
+        """
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = BlipProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+        processor = BlipProcessor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        self.assertIsInstance(processor.tokenizer, PretrainedTokenizer)
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, BlipImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = BlipProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_feat_extract = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = BlipProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = BlipProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = BlipProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = BlipProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        # For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
+        self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
diff --git a/tests/transformers/blip2/__init__.py b/tests/transformers/blip2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/blip2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/blip2/test_modeling.py b/tests/transformers/blip2/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3af5333e74376d1a94d4945a633fb675a4befb0
--- /dev/null
+++ b/tests/transformers/blip2/test_modeling.py
@@ -0,0 +1,902 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Testing suite for the PyTorch BLIP-2 model. """
+
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import requests
+from PIL import Image
+
+from paddlenlp.transformers import (
+    Blip2Config,
+    Blip2ForConditionalGeneration,
+    Blip2Model,
+    Blip2QFormerConfig,
+    Blip2VisionConfig,
+    Blip2VisionModel,
+    OPTConfig,
+    T5Config,
+)
+from paddlenlp.transformers.blip_2.modeling import BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class Blip2VisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=1e-10,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return Blip2VisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = Blip2VisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class Blip2VisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as BLIP-2's vision encoder does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (Blip2VisionModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = Blip2VisionModelTester(self)
+        self.config_tester = ConfigTester(
+            self, config_class=Blip2VisionConfig, has_text_modality=False, hidden_size=37
+        )
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="BLIP-2's vision encoder does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="Blip2VisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="Blip2VisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = Blip2VisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class Blip2QFormerModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=6,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        bos_token_id=0,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+        self.bos_token_id = bos_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return Blip2QFormerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+        )
+
+
+# this class is based on `OPTModelTester` found in tests/models/opt/test_modeling_opt.py
+class Blip2TextModelDecoderOnlyTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_labels=False,
+        vocab_size=99,
+        hidden_size=16,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=20,
+        eos_token_id=2,
+        pad_token_id=1,
+        bos_token_id=0,
+        embed_dim=16,
+        num_labels=3,
+        word_embed_proj_dim=16,
+        type_sequence_label_size=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.embed_dim = embed_dim
+        self.num_labels = num_labels
+        self.type_sequence_label_size = type_sequence_label_size
+        self.word_embed_proj_dim = word_embed_proj_dim
+        self.is_encoder_decoder = False
+
+    def prepare_config_and_inputs(self):
+        config = self.get_config()
+
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64").clip(
+            3,
+        )
+        input_ids[:, -1] = self.eos_token_id  # Eos Token
+
+        attention_mask = input_ids.not_equal(paddle.to_tensor([self.pad_token_id], dtype="int64")).cast("int64")
+
+        return config, input_ids, attention_mask
+
+    def get_config(self):
+        return OPTConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            ffn_dim=self.intermediate_size,
+            dropout=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            embed_dim=self.embed_dim,
+            is_encoder_decoder=False,
+            word_embed_proj_dim=self.word_embed_proj_dim,
+        )
+
+
+# this model tester uses a decoder-only language model (OPT)
+class Blip2ForConditionalGenerationDecoderOnlyModelTester:
+    def __init__(
+        self, parent, vision_kwargs=None, qformer_kwargs=None, text_kwargs=None, is_training=True, num_query_tokens=10
+    ):
+        if vision_kwargs is None:
+            vision_kwargs = {}
+        if qformer_kwargs is None:
+            qformer_kwargs = {}
+        if text_kwargs is None:
+            text_kwargs = {}
+
+        self.parent = parent
+        self.vision_model_tester = Blip2VisionModelTester(parent, **vision_kwargs)
+        self.qformer_model_tester = Blip2QFormerModelTester(parent, **qformer_kwargs)
+        self.text_model_tester = Blip2TextModelDecoderOnlyTester(parent, **text_kwargs)
+        self.is_training = is_training
+        self.num_query_tokens = num_query_tokens
+
+    def prepare_config_and_inputs(self):
+        _, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+        _, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return Blip2Config.from_vision_qformer_text_configs(
+            vision_config=self.vision_model_tester.get_config(),
+            qformer_config=self.qformer_model_tester.get_config(),
+            text_config=self.text_model_tester.get_config(),
+            num_query_tokens=self.num_query_tokens,
+        )
+
+    def create_and_check_for_conditional_generation(self, config, input_ids, attention_mask, pixel_values):
+        model = Blip2ForConditionalGeneration(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values, input_ids, attention_mask, return_dict=True)
+        expected_seq_length = self.num_query_tokens + self.text_model_tester.seq_length
+        self.parent.assertEqual(
+            result.logits.shape,
+            [self.vision_model_tester.batch_size, expected_seq_length, self.text_model_tester.vocab_size],
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "pixel_values": pixel_values,
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "labels": input_ids,
+            "return_dict": True,
+        }
+        return config, inputs_dict
+
+
+class Blip2ForConditionalGenerationDecoderOnlyTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (Blip2ForConditionalGeneration,)
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = Blip2ForConditionalGenerationDecoderOnlyModelTester(self)
+
+    def test_for_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_conditional_generation(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="Blip2Model does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    @unittest.skip(reason="There's no base Blip2Model")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="There's no base Blip2Model")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_load_vision_qformer_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save Blip2Config and check if we can load Blip2VisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = Blip2VisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save Blip2Config and check if we can load Blip2QFormerConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            qformer_config = Blip2QFormerConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.qformer_config.to_dict(), qformer_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST:
+            model = Blip2ForConditionalGeneration.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# this class is based on `T5ModelTester` found in tests/models/t5/test_modeling_t5.py
+class Blip2TextModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=12,
+        encoder_seq_length=7,
+        decoder_seq_length=9,
+        # For common tests
+        is_training=True,
+        use_attention_mask=True,
+        use_labels=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        eos_token_id=1,
+        pad_token_id=0,
+        decoder_start_token_id=0,
+        scope=None,
+        decoder_layers=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        # For common tests
+        self.seq_length = self.decoder_seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.scope = None
+        self.decoder_layers = decoder_layers
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+        decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        decoder_attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2, dtype="int64")
+            decoder_attention_mask = ids_tensor(
+                [self.batch_size, self.decoder_seq_length], vocab_size=2, dtype="int64"
+            )
+
+        lm_labels = None
+        if self.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def get_config(self):
+        return T5Config(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+            decoder_start_token_id=self.decoder_start_token_id,
+        )
+
+
+# this model tester uses an encoder-decoder language model (T5)
+class Blip2ForConditionalGenerationModelTester:
+    def __init__(
+        self, parent, vision_kwargs=None, qformer_kwargs=None, text_kwargs=None, is_training=True, num_query_tokens=10
+    ):
+        if vision_kwargs is None:
+            vision_kwargs = {}
+        if qformer_kwargs is None:
+            qformer_kwargs = {}
+        if text_kwargs is None:
+            text_kwargs = {}
+
+        self.parent = parent
+        self.vision_model_tester = Blip2VisionModelTester(parent, **vision_kwargs)
+        self.qformer_model_tester = Blip2QFormerModelTester(parent, **qformer_kwargs)
+        self.text_model_tester = Blip2TextModelTester(parent, **text_kwargs)
+        self.is_training = is_training
+        self.num_query_tokens = num_query_tokens
+
+    def prepare_config_and_inputs(self):
+        _, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+        (
+            _,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = self.text_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values, decoder_input_ids, decoder_attention_mask, lm_labels
+
+    def get_config(self):
+        return Blip2Config.from_vision_qformer_text_configs(
+            vision_config=self.vision_model_tester.get_config(),
+            qformer_config=self.qformer_model_tester.get_config(),
+            text_config=self.text_model_tester.get_config(),
+            num_query_tokens=self.num_query_tokens,
+        )
+
+    def create_and_check_for_conditional_generation(
+        self, config, input_ids, attention_mask, pixel_values, decoder_input_ids, decoder_attention_mask, labels
+    ):
+        model = Blip2ForConditionalGeneration(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(
+                pixel_values, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, return_dict=True
+            )
+
+        self.parent.assertEqual(
+            result.logits.shape,
+            [
+                self.vision_model_tester.batch_size,
+                self.text_model_tester.seq_length,
+                self.text_model_tester.vocab_size,
+            ],
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+            pixel_values,
+            decoder_input_ids,
+            decoder_attention_mask,
+            labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "pixel_values": pixel_values,
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_attention_mask,
+            "labels": labels,
+        }
+        return config, inputs_dict
+
+
+# this model tester uses an encoder-decoder language model (T5)
+class Blip2ModelTester:
+    def __init__(
+        self, parent, vision_kwargs=None, qformer_kwargs=None, text_kwargs=None, is_training=True, num_query_tokens=10
+    ):
+        if vision_kwargs is None:
+            vision_kwargs = {}
+        if qformer_kwargs is None:
+            qformer_kwargs = {}
+        if text_kwargs is None:
+            text_kwargs = {}
+
+        self.parent = parent
+        self.vision_model_tester = Blip2VisionModelTester(parent, **vision_kwargs)
+        self.qformer_model_tester = Blip2QFormerModelTester(parent, **qformer_kwargs)
+        self.text_model_tester = Blip2TextModelTester(parent, **text_kwargs)
+        self.is_training = is_training
+        self.num_query_tokens = num_query_tokens
+
+    def prepare_config_and_inputs(self):
+        _, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+        (
+            _,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = self.text_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values, decoder_input_ids, decoder_attention_mask, lm_labels
+
+    def get_config(self):
+        return Blip2Config.from_vision_qformer_text_configs(
+            vision_config=self.vision_model_tester.get_config(),
+            qformer_config=self.qformer_model_tester.get_config(),
+            text_config=self.text_model_tester.get_config(),
+            num_query_tokens=self.num_query_tokens,
+        )
+
+    def create_and_check_for_conditional_generation(
+        self, config, input_ids, attention_mask, pixel_values, decoder_input_ids, decoder_attention_mask, labels
+    ):
+        model = Blip2ForConditionalGeneration(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(
+                pixel_values, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, return_dict=True
+            )
+
+        self.parent.assertEqual(
+            result.logits.shape,
+            [
+                self.vision_model_tester.batch_size,
+                self.text_model_tester.seq_length,
+                self.text_model_tester.vocab_size,
+            ],
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+            pixel_values,
+            decoder_input_ids,
+            decoder_attention_mask,
+            labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "pixel_values": pixel_values,
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_attention_mask,
+            "labels": labels,
+        }
+        return config, inputs_dict
+
+
+class Blip2ModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (Blip2ForConditionalGeneration, Blip2Model)
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+    use_test_inputs_embeds: bool = False
+
+    def setUp(self):
+        self.model_tester = Blip2ModelTester(self)
+
+    def test_for_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_conditional_generation(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="Blip2Model does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    @unittest.skip(reason="There's no base Blip2Model")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="There's no base Blip2Model")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_load_vision_qformer_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save Blip2Config and check if we can load Blip2VisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = Blip2VisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save Blip2Config and check if we can load Blip2QFormerConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            qformer_config = Blip2QFormerConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.qformer_config.to_dict(), qformer_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST:
+            model = Blip2ForConditionalGeneration.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    def test_get_text_features(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        inputs_dict = {
+            "input_ids": paddle.to_tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]),
+            "attention_mask": paddle.to_tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
+            "decoder_input_ids": paddle.to_tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]),
+        }
+
+        model = Blip2Model(config)
+        model.eval()
+        text_features = model.get_text_features(**inputs_dict)
+        self.assertEqual(text_features[0].shape, [1, 10, config.text_config.vocab_size])
+
+    def test_get_image_features(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        keys_to_pop = ["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"]
+
+        for key in keys_to_pop:
+            inputs_dict.pop(key)
+
+        model = Blip2Model(config)
+        model.eval()
+        image_features = model.get_image_features(**inputs_dict)
+        self.assertEqual(
+            image_features[0].shape,
+            [
+                self.model_tester.vision_model_tester.batch_size,
+                self.model_tester.vision_model_tester.seq_length,
+                config.vision_config.hidden_size,
+            ],
+        )
+
+    def test_get_qformer_features(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        keys_to_pop = ["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"]
+
+        for key in keys_to_pop:
+            inputs_dict.pop(key)
+
+        model = Blip2Model(config)
+        model.eval()
+        qformer_features = model.get_qformer_features(**inputs_dict)
+        self.assertEqual(
+            qformer_features[0].shape,
+            [self.model_tester.vision_model_tester.batch_size, 10, config.vision_config.hidden_size],
+        )
+
+    # override from common to deal with nested configurations (`vision_config`, `text_config` and `qformer_config`)
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for key in ["vision_config", "qformer_config", "text_config"]:
+            setattr(configs_no_init, key, _config_zero_init(getattr(configs_no_init, key)))
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if not param.stop_gradient:
+                    self.assertIn(
+                        ((param.mean() * 1e9).round() / 1e9).item(),
+                        [0.0, 1.0],
+                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                    )
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "https://huggingface.co/hf-internal-testing/blip-test-image/resolve/main/demo.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+    return image
diff --git a/tests/transformers/blip2/test_processor.py b/tests/transformers/blip2/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3c995480a8cc0e932fa913bc57d8821cd1eb0ce
--- /dev/null
+++ b/tests/transformers/blip2/test_processor.py
@@ -0,0 +1,146 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from paddlenlp.transformers import BertTokenizer, Blip2Processor, BlipImageProcessor
+
+
+class Blip2ProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        image_processor = BlipImageProcessor()
+        # tokenizer = GPTTokenizer.from_pretrained("gpt2-medium-en")
+        tokenizer = BertTokenizer.from_pretrained("hf-internal-testing/tiny-random-BertModel", from_hf_hub=True)
+
+        processor = Blip2Processor(image_processor, tokenizer)
+
+        processor.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs):
+        return Blip2Processor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
+
+    def get_image_processor(self, **kwargs):
+        return Blip2Processor.from_pretrained(self.tmpdirname, **kwargs).image_processor
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PaddlePaddle tensors if one specifies torchify=True.
+        """
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = Blip2Processor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+        processor = Blip2Processor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, BlipImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = Blip2Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_feat_extract = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = Blip2Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str, return_token_type_ids=False)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = Blip2Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["pixel_values", "input_ids", "token_type_ids"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = Blip2Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = Blip2Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        # For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
+        self.assertListEqual(list(inputs.keys()), ["pixel_values", "input_ids", "token_type_ids"])
diff --git a/tests/transformers/bloom/__init__.py b/tests/transformers/bloom/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/bloom/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/bloom/test_modeling.py b/tests/transformers/bloom/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b17b851f6c9c22e4142b12561c0bead72d0cafdb
--- /dev/null
+++ b/tests/transformers/bloom/test_modeling.py
@@ -0,0 +1,681 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import math
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import pytest
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    BloomConfig,
+    BloomForCausalLM,
+    BloomForSequenceClassification,
+    BloomForTokenClassification,
+    BloomModel,
+    BloomTokenizer,
+)
+from paddlenlp.transformers.bloom.modeling import BloomForGeneration
+from tests.testing_utils import PaddleNLPModelTest, require_package, slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (  # GenerationD2STestMixin,
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class BloomModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=20,
+        is_training=False,
+        use_input_mask=True,
+        vocab_size=100,
+        hidden_size=32,
+        n_layer=2,
+        n_head=8,
+        masked_softmax_fusion=True,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        use_cache=False,
+        bos_token_id=1,
+        eos_token_id=2,
+        apply_residual_connection_post_layernorm=False,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        attention_softmax_in_fp32=True,
+        pretraining_tp=1,  # TP rank used when training with megatron
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = self.n_head = n_head
+        self.num_hidden_layers = self.n_layer = n_layer
+        self.n_head = n_head
+        self.masked_softmax_fusion = masked_softmax_fusion
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+        self.pretraining_tp = pretraining_tp
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.type_sequence_label_size = type_sequence_label_size
+        self.scope = None
+        self.bos_token_id = 1
+        self.eos_token_id = 2
+        self.pad_token_id = 3
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return BloomConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            n_layer=self.n_layer,
+            n_head=self.n_head,
+            masked_softmax_fusion=self.masked_softmax_fusion,
+            layer_norm_epsilon=self.layer_norm_epsilon,
+            initializer_range=self.initializer_range,
+            use_cache=self.use_cache,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            apply_residual_connection_post_layernorm=self.apply_residual_connection_post_layernorm,
+            hidden_dropout=self.hidden_dropout,
+            attention_dropout=self.attention_dropout,
+            attention_softmax_in_fp32=self.attention_softmax_in_fp32,
+            pretraining_tp=self.pretraining_tp,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = paddle.cast(
+            ids_tensor([self.batch_size, self.seq_length], vocab_size=2), dtype="float32"
+        )
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def create_and_check_gpt_model(self, config, input_ids, input_mask, *args):
+        model = BloomModel(config)
+        model.eval()
+
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result[1]), config["n_layer"])
+
+    def create_and_check_gpt_model_past(self, config, input_ids, input_mask, *args):
+        model = BloomModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) + 1 == len(outputs_use_cache_conf))
+
+        output, past = outputs_use_cache_conf[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(next_tokens, use_cache=True, cache=past, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_gpt_model_attention_mask_past(self, config, input_ids, input_mask, *args):
+        model = BloomModel(config)
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="float32")
+        half_seq_length = self.seq_length // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past = model(input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict)[
+            :2
+        ]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = ids_tensor((1,), half_seq_length, dtype="int64").item() + 1
+        random_other_next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64").squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="float32")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(
+            next_tokens, cache=past, use_cache=True, attention_mask=attn_mask, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx]
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx]
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_gpt_model_past_large_inputs(self, config, input_ids, input_mask, *args):
+        model = BloomModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config["vocab_size"], dtype="int64")
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2, dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )[0]
+        output_from_past = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            cache=past,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, *args):
+        model = BloomForCausalLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            use_cache=True,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(self, config, input_ids, input_mask, *args):
+        model = BloomForCausalLM(config)
+
+        if self.parent.use_labels:
+            loss, logits = model(input_ids, labels=input_ids, return_dict=self.parent.return_dict)
+            self.parent.assertEqual(loss.shape, [1])
+            self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+            loss.backward()
+
+    def create_and_check_gpt_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, *args):
+        config.num_labels = self.num_labels
+        model = BloomForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=sequence_labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_labels])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_gpt_for_token_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, *args
+    ):
+        config.num_labels = self.num_labels
+        model = BloomForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=token_labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_gpt_weight_initialization(self, config, *args):
+        model = BloomModel(config)
+        model_std = model.config["initializer_range"] / math.sqrt(2 * model.config["n_layer"])
+        for key in model.state_dict().keys():
+            if "out_proj" in key and "weight" in key:
+                self.parent.assertLessEqual(abs((paddle.std(model.state_dict()[key]) - model_std).numpy()), 0.02)
+                self.parent.assertLessEqual(abs((paddle.mean(model.state_dict()[key]) - 0.0).numpy()), 0.01)
+
+    def create_and_check_model_attention_mask(
+        self, config: BloomConfig, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = BloomModel(config)
+        model.eval()
+        attn_mask_2d = random_attention_mask([self.batch_size, self.seq_length])
+        result_2d = model(input_ids, attention_mask=attn_mask_2d)[0]
+        batch, seq_length = input_ids.shape
+        causal_mask = paddle.tril(paddle.ones((batch, seq_length, seq_length), dtype=attn_mask_2d.dtype))
+        attn_mask_3d = causal_mask & attn_mask_2d.unsqueeze(-1)
+        result_3d = model(input_ids, attention_mask=attn_mask_3d)[0]
+        attn_mask_4d = attn_mask_3d.unsqueeze(1)
+        result_4d = model(input_ids, attention_mask=attn_mask_4d)[0]
+        result_no_attention_mask = model(input_ids, attention_mask=None)[0]
+        # Assert non-padding tokens have the same logits with different attention_mask shape
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_3d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_4d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_no_attention_mask[attn_mask_2d]).all())
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+        }
+
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_gpt(self):
+        config = self.get_config()
+        # excluding eos_token_id which is equal to vocab_size - 1
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size - 1, dtype="int64")
+        inputs_dict = {"input_ids": input_ids}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [[False, False], [False, True], [True, False], [True, True]],
+)
+class BloomModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = BloomModel
+    use_labels = False
+    return_dict = False
+    use_test_model_name_list = False
+
+    all_model_classes = (BloomModel, BloomForCausalLM, BloomForSequenceClassification, BloomForTokenClassification)
+    all_generative_model_classes = {BloomForCausalLM: (BloomModel, "bloom")}
+
+    all_parallelizable_model_classes = BloomForCausalLM
+    test_missing_keys = False
+    test_tie_weights = False
+    test_model_parallel = True
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = BloomModelTester(self)
+        self.test_resize_embeddings = False
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+    def test_gpt_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model(*config_and_inputs)
+
+    def test_gpt_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_past(*config_and_inputs)
+
+    def test_gpt_model_att_mask_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_attention_mask_past(*config_and_inputs)
+
+    def test_gpt_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_past_large_inputs(*config_and_inputs)
+
+    def test_gpt_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_gpt_sequence_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_for_sequence_classification(*config_and_inputs)
+
+    def test_gpt_token_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_for_token_classification(*config_and_inputs)
+
+    def test_gpt_weight_initialization(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_weight_initialization(*config_and_inputs)
+
+    def test_model_attention_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_attention_mask(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        # NOTE: rewrite test inputs embeds for gpt model since couldn't detect eos token id from inputs_embeds
+        # get config for model and inputs_dict for model forward
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_gpt()
+        # test all model classes
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            with paddle.no_grad():
+                ids_output = model(**inputs)
+
+            if not self.is_encoder_decoder:
+                input_ids = inputs["input_ids"]
+                del inputs["input_ids"]
+            else:
+                encoder_input_ids = inputs["input_ids"]
+                decoder_input_ids = inputs.get("decoder_input_ids", encoder_input_ids)
+                del inputs["input_ids"]
+                inputs.pop("decoder_input_ids", None)
+
+            wte = model.get_input_embeddings()
+            if not self.is_encoder_decoder:
+                inputs["inputs_embeds"] = wte(input_ids)
+            else:
+                inputs["inputs_embeds"] = wte(encoder_input_ids)
+                inputs["decoder_inputs_embeds"] = wte(decoder_input_ids)
+
+            with paddle.no_grad():
+                embeds_output = model(**inputs)
+
+            self.assertTrue(paddle.allclose(ids_output, embeds_output, rtol=1e-4, atol=1e-4))
+
+
+class BloomCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-BloomModel"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import BloomModel
+
+        # when python application is done, `TemporaryDirectory` will be free
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        model = BloomModel.from_pretrained(cls.test_model_id)
+        model.save_pretrained(cls.torch_model_path)
+
+    @parameterized.expand(
+        [
+            ("BloomModel", "BloomModel"),
+            ("BloomForSequenceClassification", "BloomForSequenceClassification"),
+            ("BloomForTokenClassification", "BloomForTokenClassification"),
+            ("BloomForCausalLM", "BloomForCausalLM"),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_gpt_classes_from_local_dir(self, paddle_class_name, pytorch_class_name=None):
+        pytorch_class_name = pytorch_class_name or paddle_class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, paddle_class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().numpy().reshape([-1])[:9],
+                    torch_logit.detach().cpu().numpy().reshape([-1])[:9],
+                    atol=1e-3,
+                )
+            )
+
+
+class BloomModelLanguageGenerationTest(PaddleNLPModelTest):
+    def _test_lm_generate_gpt_helper(
+        self,
+        verify_outputs=True,
+    ):
+        model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m")
+        model.eval()
+
+        # The dog
+        input_ids = paddle.to_tensor([[464, 3290]], dtype="int64")
+
+        # The dog was found in a field near the intersection of West and West Streets.\n\nThe dog
+        # fmt: off
+        expected_output_ids = [
+            373,
+            1043,
+            287,
+            257,
+            2214,
+            1474,
+            262,
+            16246,
+            286,
+            2688,
+            290,
+            2688,
+            27262,
+            13,
+            198,
+            198,
+            464,
+            3290,
+        ]
+        # fmt: on
+        output_ids, _ = model.generate(input_ids, decode_strategy="greedy_search", max_length=18)
+        if verify_outputs:
+            self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
+
+    @pytest.mark.skip("compelte `generate` method in another pr")
+    @slow
+    def test_lm_generate_gpt(self):
+        self._test_lm_generate_gpt_helper()
+
+    @slow
+    def test_gpt_for_generation(self):
+        model_name = "bigscience/bloom-560m"
+        tokenizer = BloomTokenizer.from_pretrained(model_name)
+
+        config = BloomConfig.from_pretrained(model_name)
+        config.top_k = 1
+        model = BloomForGeneration.from_pretrained(model_name, config=config)
+        model.eval()
+
+        paddle.seed(128)
+        np.random.seed(128)
+        random.seed(128)
+
+        tokenized = tokenizer("I love you,", return_tensors="pd")
+        input_ids = tokenized["input_ids"]
+
+        output_ids, _ = model(
+            input_ids,
+        )
+        output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+        print(output_str)
+
+        output_seq, _ = model(input_ids=input_ids)
+        output_seq_strs = tokenizer.batch_decode(output_seq, skip_special_tokens=True)
+        print(output_seq_strs)
+
+        EXPECTED_OUTPUT_STR = " baby.\nI love you, baby.\nI love you, baby.\nI love you, baby.\n"
+
+        self.assertEqual(output_seq_strs[0], EXPECTED_OUTPUT_STR)
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+
+    @pytest.mark.skip("compelte `generate` method in another pr")
+    @slow
+    def test_gpt_sample(self):
+        tokenizer = BloomTokenizer.from_pretrained("bigscience/bloom-560m")
+        model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m")
+        model.eval()
+
+        paddle.seed(128)
+        np.random.seed(128)
+        random.seed(128)
+
+        tokenized = tokenizer("where is the captial of china: ", return_tensors="pd")
+        input_ids = tokenized["input_ids"]
+
+        output_ids, _ = model.generate(
+            input_ids,
+            top_k=1,
+        )
+        output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+        print(output_str)
+
+        output_seq, _ = model.generate(
+            input_ids=input_ids,
+            top_k=1,
+        )
+        output_seq_strs = tokenizer.batch_decode(output_seq, skip_special_tokens=True)
+        print(output_seq_strs)
+
+        EXPECTED_OUTPUT_STR = "the result is not accurate with BloomForGeneration."
+
+        self.assertEqual(output_seq_strs[0], EXPECTED_OUTPUT_STR)
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+
+
+# class BloomGenerationD2STest(GenerationD2STestMixin, unittest.TestCase):
+#    max_length = 100
+#    internal_testing_model = "__internal_testing__/tiny-random-bloom"
diff --git a/tests/transformers/bloom/test_tokenizer.py b/tests/transformers/bloom/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd7eb8d57f81b44be2d64fe48cd83c542dc6f955
--- /dev/null
+++ b/tests/transformers/bloom/test_tokenizer.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import BloomTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class BloomTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = BloomTokenizer
+    from_pretrained_kwargs = {"add_prefix_space": True}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<|endoftext|>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return BloomTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = BloomTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = BloomTokenizer.from_pretrained(self.tmpdirname, pad_token="<pad>")
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+        p2 = [
+            ("This is a simple input loooooong", "This is a simple input"),
+            ("This is a simple pair loooooong", "This is a simple pair"),
+        ]
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+        out_p2 = tokenizer(p2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 33)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"])
+
+        # p2
+        # test automatic padding pair
+        self.assertEqual(out_p2["input_ids"].shape[-1], 52)
+        # long slice pair doesn't have padding
+        self.assertFalse(pad_token_id in out_p2["input_ids"][0])
+        self.assertFalse(0 in out_p2["attention_mask"][0])
+        # short slice pair does have padding
+        self.assertTrue(pad_token_id in out_p2["input_ids"][1])
+        self.assertTrue(0 in out_p2["attention_mask"][1])
+
+    def test_add_bos_token_slow(self):
+        bos_token = "$$$"
+        tokenizer = BloomTokenizer.from_pretrained(self.tmpdirname, bos_token=bos_token, add_bos_token=True)
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[0], bos_token_id)
+        self.assertTrue(all(o[0] == bos_token_id for o in out_s2["input_ids"]))
+
+        decode_s = tokenizer.decode(out_s["input_ids"])
+        decode_s2 = tokenizer.batch_decode(out_s2["input_ids"])
+
+        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+
+    # tokenizer has no padding token
+    def test_padding_different_model_input_name(self):
+        pass
+
+    def test_pretrained_model_lists(self):
+        # No max_model_input_sizes
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 2)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
diff --git a/tests/transformers/chatglm/__init__.py b/tests/transformers/chatglm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/chatglm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/chatglm/test_modeling.py b/tests/transformers/chatglm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..30b63d07f12a3a07dc92e5a757f28cdda44beb80
--- /dev/null
+++ b/tests/transformers/chatglm/test_modeling.py
@@ -0,0 +1,485 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import ChatGLMConfig, ChatGLMForCausalLM, ChatGLMModel
+from tests.transformers.test_configuration_common import ConfigTester
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (
+    ModelTesterMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ChatGLMTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=130528,
+        hidden_size=64,
+        num_hidden_layers=2,
+        num_attention_heads=8,
+        layernorm_epsilon=1e-5,
+        use_cache=False,
+        bos_token_id=130004,
+        eos_token_id=130005,
+        pad_token_id=3,
+        mask_token_id=130000,
+        gmask_token_id=130001,
+        max_sequence_length=10,
+        inner_hidden_size=256,
+        position_encoding_2d=True,
+        quantization_bit=0,
+        pre_seq_len=None,
+        prefix_projection=False,
+        output_predict=True,
+        recompute=False,
+        attention_scale=True,
+        activation="gelu",
+        batch_size: int = 2,
+        seq_length: int = 10,
+        num_image_tokens=0,
+        use_labels: bool = False,
+        return_dict=False,
+    ):
+        self.parent: ChatGLMTest = parent
+        self.num_hidden_layers = num_hidden_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.max_sequence_length = max_sequence_length
+        self.layernorm_epsilon = layernorm_epsilon
+        self.inner_hidden_size = inner_hidden_size
+        self.use_cache = use_cache
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.mask_token_id = mask_token_id
+        self.gmask_token_id = gmask_token_id
+        self.position_encoding_2d = position_encoding_2d
+        self.quantization_bit = quantization_bit
+        self.pre_seq_len = pre_seq_len
+        self.prefix_projection = prefix_projection
+        self.output_predict = output_predict
+        self.recompute = recompute
+        self.attention_scale = attention_scale
+        self.activation = activation
+        self.num_image_tokens = num_image_tokens
+
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.use_labels = use_labels
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_ids[input_ids == self.gmask_token_id] = self.mask_token_id
+        input_ids[input_ids == self.bos_token_id] = self.mask_token_id
+
+        context_length = np.random.randint(1, self.seq_length - 2)
+        input_ids[:, context_length - 2] = self.gmask_token_id
+        input_ids[:, context_length - 1] = self.bos_token_id
+
+        attention_mask = paddle.ones_like(input_ids, dtype=paddle.int64)
+        attention_mask = attention_mask.unsqueeze([1, 2])
+        attention_mask = attention_mask * attention_mask.transpose([0, 1, 3, 2])
+
+        MASK, gMASK = self.mask_token_id, self.gmask_token_id
+        use_gmasks = []
+        mask_positions = []
+        context_lengths = []
+        for seq in input_ids:
+            mask_token = gMASK if gMASK in seq else MASK
+            use_gmask = mask_token == gMASK
+            use_gmasks.append(use_gmask)
+            mask_positions.append(paddle.where(seq == mask_token)[0][0])
+            context_lengths.append(context_length)
+
+        position_ids = paddle.arange(self.seq_length, dtype="int64").unsqueeze(0).tile([self.batch_size, 1])
+        for i, context_length in enumerate(context_lengths):
+            position_ids[i, context_length:] = mask_positions[i]
+
+        block_position_ids = [
+            paddle.concat(
+                (
+                    paddle.zeros([context_length], dtype="int64"),
+                    paddle.arange(self.seq_length - context_length, dtype="int64") + 1,
+                )
+            )
+            for context_length in context_lengths
+        ]
+        block_position_ids = paddle.stack(block_position_ids, axis=0)
+        position_ids = paddle.stack((position_ids, block_position_ids), axis=1)
+
+        labels = None
+        if self.use_labels:
+            labels = paddle.ones([self.batch_size, self.seq_length]) * -100
+            labels[:, context_length:] = input_ids[:, context_length:]
+
+        config = self.get_config()
+        return config, input_ids, labels, attention_mask, position_ids
+
+    def get_config(self):
+        return ChatGLMConfig(
+            num_hidden_layers=self.num_hidden_layers,
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_attention_heads=self.num_attention_heads,
+            max_sequence_length=self.max_sequence_length,
+            layernorm_epsilon=self.layernorm_epsilon,
+            inner_hidden_size=self.inner_hidden_size,
+            use_cache=self.use_cache,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            mask_token_id=self.mask_token_id,
+            gmask_token_id=self.gmask_token_id,
+            position_encoding_2d=self.position_encoding_2d,
+            quantization_bit=self.quantization_bit,
+            pre_seq_len=self.pre_seq_len,
+            prefix_projection=self.prefix_projection,
+            output_predict=self.output_predict,
+            recompute=self.recompute,
+            attention_scale=self.attention_scale,
+            activation=self.activation,
+            num_image_tokens=self.num_image_tokens,
+        )
+
+    def create_and_check_model(self, config, input_ids, labels, attention_mask, position_ids):
+        model = ChatGLMModel(config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids)
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.batch_size, self.hidden_size])
+
+    # def create_and_check_model_past_large_inputs(self, config, input_ids, labels):
+    #    model = ChatGLMModel(config)
+    #    model.eval()
+
+    #    outputs = model(input_ids, attention_mask=attention_mask, return_dict=self.return_dict)
+    #    past_key_values = outputs.past_key_values[0] if self.return_dict else outputs[1][0]
+
+    #    next_tokens = ids_tensor([self.batch_size, 3], self.vocab_size)
+    #    next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+    #    next_attention_mask = model.get_masks(next_input_ids)
+
+    #    outputs = model(next_input_ids, attention_mask=next_attention_mask, return_dict=self.return_dict)
+    #    output_from_no_past = outputs.past_key_values[0] if self.return_dict else outputs[1][0]
+
+    #    outputs = model(
+    #        next_tokens,
+    #        attention_mask=next_attention_mask,
+    #        past_key_values=past_key_values,
+    #        return_dict=self.return_dict,
+    #    )
+    #    output_from_past = outputs.past_key_values[0] if self.return_dict else outputs[1][0]
+
+    #    # select random slice
+    #    random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+    #    output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+    #    output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+    #    self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+    #    # test that outputs are equal for slice
+    #    self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, labels, attention_mask, position_ids = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": attention_mask, "position_ids": position_ids}
+        return config, inputs_dict
+
+    def create_and_check_lm_head_model(self, config, input_ids, labels, attention_mask, position_ids):
+        model = ChatGLMForCausalLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            labels=labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            loss = result.loss if self.parent.return_dict else result[0]
+            self.parent.assertEqual(loss.shape, [1])
+            logits = result.logits if self.parent.return_dict else result[1]
+            past_key_values = result.past_key_values[0] if self.parent.return_dict else result[2][0]
+        else:
+            loss = result.loss if self.parent.return_dict else None
+            self.parent.assertTrue(loss is None)
+            logits = result.logits if self.parent.return_dict else result[0]
+            past_key_values = result.past_key_values[0] if self.parent.return_dict else result[1][0]
+        self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        if config.use_cache:
+            self.parent.assertTrue(isinstance(past_key_values, tuple))
+            self.parent.assertEqual(past_key_values[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        else:
+            self.parent.assertTrue(past_key_values is None)
+
+    def create_and_check_model_attention_mask(self, config: ChatGLMConfig, input_ids, labels):
+        model = ChatGLMModel(config)
+        model.eval()
+        attn_mask_2d = random_attention_mask([self.batch_size, self.seq_length])
+        result_2d = model(input_ids, attention_mask=attn_mask_2d)[0].transpose([1, 0, 2])
+        batch, seq_length = input_ids.shape
+        causal_mask = paddle.tril(paddle.ones((batch, seq_length, seq_length), dtype=attn_mask_2d.dtype))
+        attn_mask_3d = causal_mask & attn_mask_2d.unsqueeze(-1)
+        result_3d = model(input_ids, attention_mask=attn_mask_3d)[0].transpose([1, 0, 2])
+
+        # use 4d mask for chatglm must prepocess prefix mask and padding mask
+        attn_mask_4d = attn_mask_3d.unsqueeze(1)
+        context_lengths, pad_lengths = [], []
+        for seq in input_ids:
+            context_lengths.append(paddle.where(seq == self.bos_token_id)[0][0])
+            pad_lengths.append(paddle.where(seq != self.pad_token_id)[0][0])
+
+        for i, context_length in enumerate(context_lengths):
+            attn_mask_4d[i, :, :, :context_length] = 1
+        print(attn_mask_4d)
+
+        for i, pad_length in enumerate(pad_lengths):
+            attn_mask_4d[i, :pad_length, :pad_length] = 0
+        print(attn_mask_4d)
+
+        result_4d = model(input_ids, attention_mask=attn_mask_4d)[0].transpose([1, 0, 2])
+        result_no_attention_mask = model(input_ids, attention_mask=None)[0].transpose([1, 0, 2])
+        # Assert non-padding tokens have the same logits with different attention_mask shape
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_3d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_4d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_no_attention_mask[attn_mask_2d]).all())
+
+
+class ChatGLMTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = ChatGLMModel
+    return_dict = False
+    use_labels = False
+
+    all_model_classes = (ChatGLMModel, ChatGLMForCausalLM)
+    all_generative_model_classes = {ChatGLMForCausalLM: (ChatGLMModel, "chatglm")}
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = ChatGLMTester(self)
+        self.config_tester = ConfigTester(self, config_class=ChatGLMConfig, vocab_size=256, hidden_size=24)
+
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+        attention_mask = inputs_dict["attention_mask"]
+        position_ids = inputs_dict["position_ids"]
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1]
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+        attention_mask = attention_mask[:max_batch_size, :, :sequence_length, :sequence_length]
+        position_ids = position_ids[:max_batch_size, :, :sequence_length]
+
+        # generate max 3 tokens
+        max_length = 3
+
+        if config.eos_token_id or config.pad_token_id:
+            # hack to allow generate for models such as GPT2 as is done in `generate()`
+            config["pad_token_id"] = config["eos_token_id"]
+
+        return config, input_ids, attention_mask, max_length
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_name_list(self):
+        pass
+
+    def test_group_beam_search_generate(self):
+        pass
+
+    def test_beam_search_generate(self):
+        pass
+
+    def test_generate_without_input_ids(self):
+        pass
+
+    def test_resize_tokens_embeddings(self):
+        pass
+
+    def test_chatglm_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    # def test_model_attention_mask(self):
+    #    config_and_inputs = self.model_tester.prepare_config_and_inputs()
+    #    self.model_tester.create_and_check_model_attention_mask(*config_and_inputs)
+
+
+# class ChatGLMGenerationD2STest(GenerationD2STestMixin, unittest.TestCase):
+#     internal_testing_model = "__internal_testing__/tiny-random-chatglm"
+#     TokenizerClass = ChatGLMTokenizer
+#     CausalLMClass = ChatGLMForCausalLM
+
+#     def test_to_static_use_top_k(self):
+#         tokenizer = self.TokenizerClass.from_pretrained(self.internal_testing_model)
+#         model = self.CausalLMClass.from_pretrained(self.internal_testing_model)
+
+#         model_kwargs = tokenizer(
+#             self.article,
+#             max_length=self.max_length,
+#             truncation=True,
+#             truncation_side="left",
+#             return_tensors="pd",
+#             padding=True,
+#             add_special_tokens=True,
+#         )
+
+#         model.eval()
+
+#         # Llama model do not contians ``
+#         model.is_encoder_decoder = False
+
+#         max_length = self.max_length
+
+#         model_kwargs["use_cache"] = True
+#         model_kwargs["max_length"] = max_length + model_kwargs["input_ids"].shape[-1]
+
+#         decoded_ids = model.greedy_search(
+#             logits_processors=None,
+#             bos_token_id=model.config.bos_token_id,
+#             pad_token_id=model.config.pad_token_id,
+#             eos_token_id=model.config.eos_token_id,
+#             **model_kwargs,
+#         )[0]
+
+#         dygraph_decoded_ids = decoded_ids.tolist()
+
+#         with static_mode_guard():
+#             with tempfile.TemporaryDirectory() as tempdir:
+#                 path = os.path.join(tempdir, "model")
+#                 model.to_static(
+#                     path,
+#                     config=dict(
+#                         use_top_p=False,
+#                     ),
+#                 )
+
+#                 model_path = os.path.join(tempdir, "model.pdmodel")
+#                 params_path = os.path.join(tempdir, "model.pdiparams")
+#                 config = paddle.inference.Config(model_path, params_path)
+
+#                 config.disable_gpu()
+#                 config.disable_glog_info()
+#                 predictor = paddle.inference.create_predictor(config)
+
+#                 model_kwargs["top_k"] = 1
+#                 model_kwargs["attention_mask"] = model_kwargs["attention_mask"].astype("int64")
+#                 model_kwargs["max_length"] = self.max_length
+#                 # create input
+#                 for key in model_kwargs.keys():
+#                     if paddle.is_tensor(model_kwargs[key]):
+#                         model_kwargs[key] = model_kwargs[key].numpy()
+#                     elif isinstance(model_kwargs[key], float):
+#                         model_kwargs[key] = np.array(model_kwargs[key], dtype="float32")
+#                     else:
+#                         model_kwargs[key] = np.array(model_kwargs[key], dtype="int64")
+
+#                 input_handles = {}
+#                 for name in predictor.get_input_names():
+#                     input_handles[name] = predictor.get_input_handle(name)
+
+#                     input_handles[name].copy_from_cpu(model_kwargs[name])
+
+#                 predictor.run()
+#                 output_names = predictor.get_output_names()
+#                 output_handle = predictor.get_output_handle(output_names[0])
+#                 results = output_handle.copy_to_cpu()
+
+#                 static_decoded_ids = results.tolist()
+
+#         self.assertEqual(dygraph_decoded_ids, static_decoded_ids)
+
+#     def test_to_static_use_top_p(self):
+#         tokenizer = self.TokenizerClass.from_pretrained(self.internal_testing_model)
+#         model = self.CausalLMClass.from_pretrained(self.internal_testing_model)
+
+#         model_kwargs = tokenizer(
+#             self.article,
+#             max_length=self.max_length,
+#             truncation=True,
+#             truncation_side="left",
+#             return_tensors="pd",
+#             padding=True,
+#             add_special_tokens=True,
+#         )
+
+#         model.eval()
+
+#         # Llama model do not contians ``
+#         model.is_encoder_decoder = False
+
+#         max_length = self.max_length
+
+#         model_kwargs["use_cache"] = True
+#         model_kwargs["max_length"] = max_length + model_kwargs["input_ids"].shape[-1]
+
+#         with static_mode_guard():
+#             with tempfile.TemporaryDirectory() as tempdir:
+
+#                 path = os.path.join(tempdir, "model")
+#                 model.to_static(
+#                     path,
+#                     config=dict(
+#                         use_top_p=False,
+#                     ),
+#                 )
+
+#                 model_path = os.path.join(tempdir, "model.pdmodel")
+#                 params_path = os.path.join(tempdir, "model.pdiparams")
+#                 config = paddle.inference.Config(model_path, params_path)
+
+#                 config.disable_gpu()
+#                 config.disable_glog_info()
+#                 predictor = paddle.inference.create_predictor(config)
+
+#                 model_kwargs["attention_mask"] = model_kwargs["attention_mask"].astype("int64")
+#                 model_kwargs["top_k"] = 1
+#                 model_kwargs["max_length"] = self.max_length
+#                 # create input
+#                 for key in model_kwargs.keys():
+#                     if paddle.is_tensor(model_kwargs[key]):
+#                         model_kwargs[key] = model_kwargs[key].numpy()
+#                     else:
+#                         model_kwargs[key] = np.array(model_kwargs[key])
+
+#                 input_handles = {}
+#                 for name in predictor.get_input_names():
+#                     input_handles[name] = predictor.get_input_handle(name)
+#                     input_handles[name].copy_from_cpu(model_kwargs[name])
+
+#                 predictor.run()
+#                 output_names = predictor.get_output_names()
+#                 output_handle = predictor.get_output_handle(output_names[0])
+#                 results = output_handle.copy_to_cpu()
+
+#         self.assertIsNotNone(results)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/chatglm/test_tokenizer.py b/tests/transformers/chatglm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..9edb85d4a46e912fdb7e91451da8251e4c3ede1a
--- /dev/null
+++ b/tests/transformers/chatglm/test_tokenizer.py
@@ -0,0 +1,438 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+
+from paddlenlp.transformers import ChatGLMTokenizer
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "ice_text.model",
+}
+
+
+class ChatGLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ChatGLMTokenizer
+    from_pretrained_vocab_key = "model_file"
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        tokenizer = ChatGLMTokenizer.from_pretrained("THUDM/chatglm-6b", **kwargs)
+        return tokenizer
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        text = "lower newer"
+        bpe_tokens = ["▁lower", "▁newer"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [680, 10243, 0]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_tokenizers_common_ids_setters(self, *args, **kwargs):
+        pass
+
+    def test_mask_output(self):
+        pass
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_offsets_mapping_with_unk(self):
+        pass
+
+    def test_special_tokens_mask(self):
+        pass
+
+    def test_special_tokens_mask_input_pairs(self):
+        pass
+
+    def test_right_and_left_padding(self):
+        pass
+
+    def test_encode_decode_with_spaces(self):
+        # TODO Fix decode in tokenizer.
+        pass
+
+    def test_add_special_tokens(self):
+        pass
+
+    def test_padding_to_multiple_of(self):
+        pass
+
+    def test_padding_side_in_kwargs(self):
+        tokenizer = self.get_tokenizer(padding_side="left")
+        self.assertEqual(tokenizer.padding_side, "left")
+
+        tokenizer = self.get_tokenizer()
+        self.assertEqual(tokenizer.padding_side, "left")
+
+    def test_truncation_side_in_kwargs(self):
+        tokenizer = self.get_tokenizer(truncation_side="left")
+        self.assertEqual(tokenizer.truncation_side, "left")
+
+        tokenizer = self.get_tokenizer(truncation_side="right")
+        self.assertEqual(tokenizer.truncation_side, "right")
+
+    def test_add_tokens(self):
+        tokenizer = self.get_tokenizer()
+
+        vocab_size = len(tokenizer)
+        self.assertEqual(tokenizer.add_tokens(""), 0)
+        self.assertEqual(tokenizer.add_tokens("testoken"), 1)
+        self.assertEqual(tokenizer.add_tokens(["testoken1", "testtoken2"]), 2)
+        self.assertEqual(len(tokenizer), vocab_size + 3)
+
+        self.assertEqual(tokenizer.add_special_tokens({}), 0)
+        self.assertRaises(AssertionError, tokenizer.add_special_tokens, {"additional_special_tokens": "<testtoken1>"})
+        self.assertEqual(tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken2>"]}), 1)
+        self.assertEqual(
+            tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken3>", "<testtoken4>"]}), 2
+        )
+        self.assertIn("<testtoken3>", tokenizer.special_tokens_map["additional_special_tokens"])
+        self.assertIsInstance(tokenizer.special_tokens_map["additional_special_tokens"], list)
+        self.assertGreaterEqual(len(tokenizer.special_tokens_map["additional_special_tokens"]), 2)
+
+        self.assertEqual(len(tokenizer), vocab_size + 6)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+
+        vocab_size = tokenizer.vocab_size
+        all_size = len(tokenizer)
+
+        self.assertNotEqual(vocab_size, 0)
+
+        new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+        added_toks = tokenizer.add_tokens(new_toks)
+        vocab_size_2 = tokenizer.vocab_size
+        all_size_2 = len(tokenizer)
+
+        self.assertNotEqual(vocab_size_2, 0)
+        self.assertEqual(vocab_size, vocab_size_2)
+        self.assertEqual(added_toks, len(new_toks))
+        self.assertEqual(all_size_2, all_size + len(new_toks))
+
+        tokens = tokenizer.encode(
+            "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+        )["input_ids"]
+        self.assertGreaterEqual(len(tokens), 4)
+        self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+        self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizer = self.get_tokenizer(add_bos_token=False)
+
+        tokens = [tokenizer.unk_token for _ in range(2)]
+        string = tokenizer.convert_tokens_to_string(tokens)
+        encoding = tokenizer(
+            text=string,
+            add_special_tokens=False,
+            runcation=True,
+            return_offsets_mapping=True,
+        )
+        # TODO (wanghuijuan): Aligned with transformers, but 2 expected.
+        self.assertEqual(len(encoding["input_ids"]), 3)
+        self.assertEqual(len(encoding["offset_mapping"]), 3)
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = self.get_tokenizer()
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"][..., 0])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 11)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0][..., 0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1][..., 0])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"][..., 0])
+
+    def test_add_bos_token_slow(self):
+        tokenizer = self.get_tokenizer()
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[-1], bos_token_id)
+        self.assertTrue(all(o[-1] == bos_token_id for o in out_s2["input_ids"]))
+
+    def test_pretrained_model_lists(self):
+        # No max_model_input_sizes
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_encode_plus_with_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_size = 10
+                padding_idx = tokenizer.pad_token_id
+                token_type_padding_idx = tokenizer.pad_token_type_id
+
+                encoded_sequence = tokenizer.encode(sequence, return_special_tokens_mask=True)
+                input_ids = encoded_sequence["input_ids"]
+                special_tokens_mask = encoded_sequence["special_tokens_mask"]
+                sequence_length = len(input_ids)
+
+                # Test left padding
+                tokenizer.padding_side = "left"
+                left_padded_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length + padding_size,
+                    padding="max_length",
+                    return_special_tokens_mask=True,
+                )
+                left_padded_input_ids = left_padded_sequence["input_ids"]
+                left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"]
+                left_padded_sequence_length = len(left_padded_input_ids)
+
+                self.assertEqual(sequence_length + padding_size, left_padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + input_ids, left_padded_input_ids)
+                self.assertEqual([1] * padding_size + special_tokens_mask, left_padded_special_tokens_mask)
+
+                if "token_type_ids" in tokenizer.model_input_names:
+                    token_type_ids = encoded_sequence["token_type_ids"]
+                    left_padded_token_type_ids = left_padded_sequence["token_type_ids"]
+
+                    self.assertEqual(
+                        [token_type_padding_idx] * padding_size + token_type_ids, left_padded_token_type_ids
+                    )
+
+                if "attention_mask" in tokenizer.model_input_names and "attention_mask" in encoded_sequence:
+                    attention_mask = encoded_sequence["attention_mask"]
+                    left_padded_attention_mask = left_padded_sequence["attention_mask"]
+
+                    self.assertEqual([0] * padding_size + attention_mask, left_padded_attention_mask)
+
+    def test_padding_to_max_length(self):
+        """We keep this test for backward compatibility but it should be remove when `pad_to_max_seq_len` is deprecated."""
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "left"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                # FIXME: the next line should be padding(max_length) to avoid warning
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, pad_to_max_seq_len=True
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + encoded_sequence, padded_sequence)
+
+                # Check that nothing is done when a maximum length is not specified
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, pad_to_max_seq_len=True)["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+    def test_padding_with_attention_mask(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                if "attention_mask" not in tokenizer.model_input_names:
+                    self.skipTest("This model does not use attention mask.")
+
+                features = [
+                    {"input_ids": [1, 2, 3], "attention_mask": np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 1]]])},
+                    {
+                        "input_ids": [
+                            1,
+                            2,
+                        ],
+                        "attention_mask": np.array([[[0, 0], [0, 1]]]),
+                    },
+                ]
+                padded_features = tokenizer.pad(features)
+                print(padded_features["attention_mask"])
+                self.assertListEqual(
+                    [x.tolist() for x in padded_features["attention_mask"]],
+                    [
+                        [[[0, 0, 0], [0, 0, 0], [0, 0, 1]]],
+                        [[[0, 0, 0], [0, 0, 0], [0, 0, 1]]],
+                    ],
+                )
+
+    def test_batch_encode_plus_padding(self):
+        # Test that padded sequences are equivalent between batch_encode_plus and encode_plus
+        # Left padding tests
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokenizer.padding_side = "left"
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                max_length = 100
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences = [
+                    tokenizer.encode(sequence, max_length=max_length, padding="max_length") for sequence in sequences
+                ]
+                encoded_sequences_batch = tokenizer.batch_encode(
+                    sequences, max_length=max_length, padding="max_length"
+                )
+                self.assertListEqual(
+                    [x["input_ids"] for x in encoded_sequences],
+                    [
+                        x["input_ids"]
+                        for x in self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                    ],
+                )
+
+    def test_batch_encode_plus_batch_sequence_length(self):
+        # Tests that all encoded values have the correct size
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                def tolist(input_dict_list):
+                    if isinstance(input_dict_list, np.ndarray):
+                        return input_dict_list.tolist()
+                    unwrap = False
+                    if isinstance(input_dict_list, dict):
+                        input_dict_list = [input_dict_list]
+                        unwrap = True
+                    for i, input_dict in enumerate(input_dict_list):
+                        for k in input_dict:
+                            if isinstance(input_dict[k], np.ndarray):
+                                input_dict_list[i][k] = input_dict[k].tolist()
+                    return input_dict_list[0] if unwrap else input_dict_list
+
+                encoded_sequences = [tokenizer.encode(sequence) for sequence in sequences]
+                encoded_sequences_batch = tokenizer.batch_encode(sequences, padding=False)
+                self.assertListEqual(
+                    tolist(encoded_sequences),
+                    tolist(self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)),
+                )
+
+                maximum_length = len(
+                    max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len)
+                )
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences_padded = [
+                    tokenizer.encode(sequence, max_length=maximum_length, padding="max_length")
+                    for sequence in sequences
+                ]
+
+                encoded_sequences_batch_padded = tokenizer.batch_encode(sequences, padding=True)
+                self.assertListEqual(
+                    tolist(encoded_sequences_padded),
+                    tolist(self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded)),
+                )
+
+                # check 'longest' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=True)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding="longest"
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        [x.tolist() for x in encoded_sequences_batch_padded_1[key]]
+                        if key != "input_ids"
+                        else encoded_sequences_batch_padded_1[key],
+                        [x.tolist() for x in encoded_sequences_batch_padded_2[key]]
+                        if key != "input_ids"
+                        else encoded_sequences_batch_padded_2[key],
+                    )
+
+                # check 'no_padding' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=False)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding=False
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        [x.tolist() for x in encoded_sequences_batch_padded_1[key]]
+                        if key != "input_ids"
+                        else encoded_sequences_batch_padded_1[key],
+                        [x.tolist() for x in encoded_sequences_batch_padded_2[key]]
+                        if key != "input_ids"
+                        else encoded_sequences_batch_padded_2[key],
+                    )
diff --git a/tests/transformers/chatglm_v2/__init__.py b/tests/transformers/chatglm_v2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/chatglm_v2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/chatglm_v2/test_modeling.py b/tests/transformers/chatglm_v2/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..be444c29cfd3e7bdc87eb7918174bc51dbd68658
--- /dev/null
+++ b/tests/transformers/chatglm_v2/test_modeling.py
@@ -0,0 +1,228 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import ChatGLMv2Config, ChatGLMv2ForCausalLM, ChatGLMv2Model
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (  # GenerationD2STestMixin,
+    ModelTesterMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ChatGLMv2Tester:
+    def __init__(
+        self,
+        parent,
+        is_training=True,
+        num_hidden_layers=3,
+        seq_length=10,
+        batch_size=2,
+        vocab_size=123,
+        kv_channels=4,
+        hidden_size=8,
+        ffn_hidden_size=8,
+        num_attention_heads=2,
+        rmsnorm=True,
+        use_cache=True,
+    ):
+        self.parent = parent
+        self.is_training = is_training
+        self.num_hidden_layers = num_hidden_layers
+        self.vocab_size = vocab_size
+        self.kv_channels = kv_channels
+        self.seq_length = seq_length
+        self.batch_size = batch_size
+        self.hidden_size = hidden_size
+        self.ffn_hidden_size = ffn_hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.rmsnorm = rmsnorm
+        self.use_cache = use_cache
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        labels = None
+        context_length = self.seq_length // 2
+        if self.parent.use_labels:
+            labels = paddle.ones([self.batch_size, self.seq_length], dtype=input_ids.dtype) * -100
+            labels[:, context_length:] = input_ids[:, context_length:]
+
+        config = self.get_config()
+        return config, input_ids, labels
+
+    def get_config(self):
+        return ChatGLMv2Config(
+            vocab_size=self.vocab_size,
+            num_hidden_layers=self.num_hidden_layers,
+            hidden_size=self.hidden_size,
+            ffn_hidden_size=self.ffn_hidden_size,
+            num_attention_heads=self.num_attention_heads,
+            kv_channels=self.kv_channels,
+            use_cache=self.use_cache,
+            rmsnorm=self.rmsnorm,
+        )
+
+    def create_and_check_model(self, config, input_ids, labels):
+        model = ChatGLMv2Model(config)
+        model.eval()
+
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.batch_size, self.hidden_size])
+
+    def create_and_check_model_past_large_inputs(self, config, input_ids, labels):
+        model = ChatGLMv2Model(config)
+        model.eval()
+
+        outputs = model(input_ids, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values[0] if self.parent.return_dict else outputs[1][0]
+
+        next_tokens = ids_tensor([self.batch_size, 3], self.vocab_size, dtype="int64")
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = model.get_masks(next_input_ids)
+
+        outputs = model(next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict)
+        output_from_no_past = outputs.past_key_values[0] if self.parent.return_dict else outputs[1][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            return_dict=self.parent.return_dict,
+        )
+        output_from_past = outputs.past_key_values[0] if self.parent.return_dict else outputs[1][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, labels = config_and_inputs
+        inputs_dict = {"input_ids": input_ids}
+        return config, inputs_dict
+
+    def create_and_check_lm_head_model(self, config, input_ids, labels, *args):
+        model = ChatGLMv2ForCausalLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            labels=labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            loss = result.loss if self.parent.return_dict else result[0]
+            self.parent.assertIsNotNone(loss)
+            logits = result.logits if self.parent.return_dict else result[1]
+            past_key_values = result.past_key_values[0] if self.parent.return_dict else result[2][0]
+        else:
+            loss = result.loss if self.parent.return_dict else None
+            self.parent.assertIsNone(loss)
+            logits = result.logits if self.parent.return_dict else result[0]
+            past_key_values = result.past_key_values[0] if self.parent.return_dict else result[1][0]
+        self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        if config.use_cache:
+            self.parent.assertTrue(isinstance(past_key_values, tuple))
+            self.parent.assertEqual(
+                past_key_values[0].shape,
+                [self.seq_length, self.batch_size, config.multi_query_group_num, config.kv_channels],
+            )
+        else:
+            self.parent.assertTrue(past_key_values is None)
+
+    def create_and_check_model_attention_mask(self, config: ChatGLMv2Config, input_ids, labels):
+        model = ChatGLMv2ForCausalLM(config)
+        model.eval()
+        attn_mask_2d = random_attention_mask([self.batch_size, self.seq_length])
+        result_2d = model(input_ids, attention_mask=attn_mask_2d)[0]
+        batch, seq_length = input_ids.shape
+        causal_mask = paddle.tril(paddle.ones((batch, seq_length, seq_length), dtype=attn_mask_2d.dtype))
+        attn_mask_3d = causal_mask & attn_mask_2d.unsqueeze(-1)
+        result_3d = model(input_ids, attention_mask=attn_mask_3d)[0]
+        attn_mask_4d = attn_mask_3d.unsqueeze(1)
+        result_4d = model(input_ids, attention_mask=attn_mask_4d)[0]
+        result_no_attention_mask = model(input_ids, attention_mask=None)[0]
+        # Assert non-padding tokens have the same logits with different attention_mask shape
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_3d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_4d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_no_attention_mask[attn_mask_2d]).all())
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, True],
+        [True, False],
+    ],
+)
+class ChatGLMv2Test(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = ChatGLMv2Model
+    return_dict: bool = True
+    use_labels: bool = False
+    use_test_model_name_list = False
+
+    all_model_classes = (ChatGLMv2Model, ChatGLMv2ForCausalLM)
+    all_generative_model_classes = {ChatGLMv2ForCausalLM: (ChatGLMv2Model, "chatglm_v2")}
+
+    def setUp(self):
+        self.model_tester = ChatGLMv2Tester(self)
+
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+        print(input_ids)
+        attention_mask = paddle.ones_like(input_ids)
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1] // 2
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+        attention_mask = attention_mask[:max_batch_size, :sequence_length]
+
+        # generate max 3 tokens
+        max_length = 3
+
+        return config, input_ids, attention_mask, max_length
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_ChatGLMv2_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_model_attention_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_attention_mask(*config_and_inputs)
+
+
+# class ChatGLMV2GenerationD2STest(GenerationD2STestMixin, unittest.TestCase):
+#     internal_testing_model = "__internal_testing__/tiny-random-chatglm2"
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/chinese_clip/__init__.py b/tests/transformers/chinese_clip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/chinese_clip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/chinese_clip/test_modeling.py b/tests/transformers/chinese_clip/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d7b4ea85e8f9154ec92cebd602aa496aeb04d18
--- /dev/null
+++ b/tests/transformers/chinese_clip/test_modeling.py
@@ -0,0 +1,550 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle Chinese-CLIP model. """
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import requests
+from PIL import Image
+
+from paddlenlp.transformers import (
+    ChineseCLIPConfig,
+    ChineseCLIPModel,
+    ChineseCLIPProcessor,
+    ChineseCLIPTextConfig,
+    ChineseCLIPTextModel,
+    ChineseCLIPTextModelWithProjection,
+    ChineseCLIPVisionConfig,
+    ChineseCLIPVisionModel,
+    ChineseCLIPVisionModelWithProjection,
+)
+from paddlenlp.transformers.chineseclip.modeling import (
+    CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
+    ChineseCLIPVisionTransformer,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class ChineseCLIPTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        type_vocab_size=2,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.type_vocab_size = type_vocab_size
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        config = self.get_config()
+
+        return config, input_ids, token_type_ids, input_mask
+
+    def get_config(self):
+        return ChineseCLIPTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, input_ids, token_type_ids, input_mask):
+        model = ChineseCLIPTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, input_ids, token_type_ids, input_mask):
+        model = ChineseCLIPTextModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.text_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, token_type_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class ChineseCLIPTextModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (ChineseCLIPTextModel, ChineseCLIPTextModelWithProjection)
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ChineseCLIPTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ChineseCLIPTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPTextModel does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ChineseCLIPTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ChineseCLIPTextModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "text_projection"))
+
+
+class ChineseCLIPVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ChineseCLIPVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ChineseCLIPVisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = ChineseCLIPVisionModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.image_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class ChineseCLIPVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as CHINESE_CLIP does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ChineseCLIPVisionModel, ChineseCLIPVisionModelWithProjection)
+    test_resize_embeddings = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ChineseCLIPVisionModelTester(self)
+        self.config_tester = ConfigTester(
+            self, config_class=ChineseCLIPVisionConfig, has_text_modality=False, hidden_size=37
+        )
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="CHINESE_CLIP does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ChineseCLIPVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ChineseCLIPVisionModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            if isinstance(model.vision_model, ChineseCLIPVisionTransformer):
+                self.assertTrue(hasattr(model, "vision_projection"))
+
+
+class ChineseCLIPModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = ChineseCLIPTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = ChineseCLIPVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, token_type_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, token_type_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return ChineseCLIPConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, token_type_ids, attention_mask, pixel_values):
+        model = ChineseCLIPModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask=attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, token_type_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+class ChineseCLIPModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ChineseCLIPModel,)
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        text_kwargs = {"batch_size": 12}
+        vision_kwargs = {"batch_size": 12}
+        self.model_tester = ChineseCLIPModelTester(self, text_kwargs, vision_kwargs)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="ChineseCLIPModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the `logit_scale` parameter initilization is different for CHINESE_CLIP
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for sub_config_key in ("vision_config", "text_config"):
+            sub_config = getattr(configs_no_init, sub_config_key, {})
+            setattr(configs_no_init, sub_config_key, _config_zero_init(sub_config))
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if not param.stop_gradient:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save CLIPConfig and check if we can load CLIPVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = ChineseCLIPVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save CLIPConfig and check if we can load CLIPTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = ChineseCLIPTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CHINESE_CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ChineseCLIPModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of Pikachu
+def prepare_img():
+    url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+class ChineseCLIPModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):
+        model_name = "OFA-Sys/chinese-clip-vit-base-patch16"
+        model = ChineseCLIPModel.from_pretrained(model_name)
+        model.eval()
+        processor = ChineseCLIPProcessor.from_pretrained(model_name)
+
+        image = prepare_img()
+        inputs = processor(text=["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], images=image, padding=True, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        self.assertEqual(
+            outputs.logits_per_image.shape,
+            [inputs.pixel_values.shape[0], inputs.input_ids.shape[0]],
+        )
+        self.assertEqual(
+            outputs.logits_per_text.shape,
+            [inputs.input_ids.shape[0], inputs.pixel_values.shape[0]],
+        )
+
+        probs = paddle.nn.functional.softmax(outputs.logits_per_image, axis=1)
+        expected_probs = paddle.to_tensor([[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]])
+
+        self.assertTrue(paddle.allclose(probs, expected_probs, atol=5e-3))
diff --git a/tests/transformers/chinesebert/__init__.py b/tests/transformers/chinesebert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/chinesebert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/chinesebert/test_modeling.py b/tests/transformers/chinesebert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9403fb03f64de3c051345ca338f81dd2dc2c9e5
--- /dev/null
+++ b/tests/transformers/chinesebert/test_modeling.py
@@ -0,0 +1,286 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    ChineseBertConfig,
+    ChineseBertForPretraining,
+    ChineseBertForQuestionAnswering,
+    ChineseBertForSequenceClassification,
+    ChineseBertForTokenClassification,
+    ChineseBertModel,
+    ChineseBertPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class ChineseBertModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        pinyin_map_len=32,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.pinyin_map_len = pinyin_map_len
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        # each pin
+        pinyin_ids = ids_tensor([self.batch_size, self.seq_length, 8], self.pinyin_map_len)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, pinyin_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return ChineseBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        pinyin_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ChineseBertModel(config)
+        model.eval()
+        result = model(input_ids, pinyin_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, pinyin_ids, token_type_ids=token_type_ids)
+        result = model(input_ids, pinyin_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        pinyin_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ChineseBertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            pinyin_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        pinyin_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ChineseBertForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            pinyin_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        pinyin_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ChineseBertForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            pinyin_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            pinyin_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "pinyin_ids": pinyin_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+        }
+        return config, inputs_dict
+
+
+class ChineseBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ChineseBertModel
+    return_dict: bool = False
+    use_labels: bool = False
+
+    all_model_classes = (
+        ChineseBertModel,
+        ChineseBertForPretraining,
+        ChineseBertForQuestionAnswering,
+        ChineseBertForSequenceClassification,
+        ChineseBertForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = ChineseBertModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ChineseBertPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ChineseBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/clap/__init__.py b/tests/transformers/clap/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/clap/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/clap/test_feature_extraction.py b/tests/transformers/clap/test_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..413f69276e0d5c5d2f3c5ecaab062fcb0950fb56
--- /dev/null
+++ b/tests/transformers/clap/test_feature_extraction.py
@@ -0,0 +1,532 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import itertools
+import random
+import unittest
+
+import numpy as np
+
+from paddlenlp.transformers import ClapFeatureExtractor
+
+from ..test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
+
+global_rng = random.Random()
+
+
+# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
+def floats_list(shape, scale=1.0, rng=None, name=None):
+    """Creates a random float32 tensor"""
+    if rng is None:
+        rng = global_rng
+
+    values = []
+    for batch_idx in range(shape[0]):
+        values.append([])
+        for _ in range(shape[1]):
+            values[-1].append(rng.random() * scale)
+
+    return values
+
+
+class ClapFeatureExtractionTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        min_seq_length=400,
+        max_seq_length=2000,
+        feature_size=10,
+        hop_length=160,
+        chunk_length=8,
+        padding_value=0.0,
+        sampling_rate=4_000,
+        return_attention_mask=False,
+        do_normalize=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.min_seq_length = min_seq_length
+        self.max_seq_length = max_seq_length
+        self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
+        self.padding_value = padding_value
+        self.sampling_rate = sampling_rate
+        self.return_attention_mask = return_attention_mask
+        self.do_normalize = do_normalize
+        self.feature_size = feature_size
+        self.chunk_length = chunk_length
+        self.hop_length = hop_length
+
+    def prepare_feat_extract_dict(self):
+        return {
+            "feature_size": self.feature_size,
+            "hop_length": self.hop_length,
+            "chunk_length": self.chunk_length,
+            "padding_value": self.padding_value,
+            "sampling_rate": self.sampling_rate,
+            "return_attention_mask": self.return_attention_mask,
+            "do_normalize": self.do_normalize,
+        }
+
+    def prepare_inputs_for_common(self, equal_length=False, numpify=False):
+        def _flatten(list_of_lists):
+            return list(itertools.chain(*list_of_lists))
+
+        if equal_length:
+            speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
+        else:
+            # make sure that inputs increase in size
+            speech_inputs = [
+                floats_list((x, self.feature_size))
+                for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
+            ]
+        if numpify:
+            speech_inputs = [np.asarray(x) for x in speech_inputs]
+        return speech_inputs
+
+
+class ClapFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
+    feature_extraction_class = ClapFeatureExtractor
+
+    def setUp(self):
+        self.feat_extract_tester = ClapFeatureExtractionTester(self)
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+
+        # Test feature size
+        input_features = feature_extractor(np_speech_inputs, padding="max_length", return_tensors="np").input_features
+        self.assertTrue(input_features.ndim == 4)
+
+        # Test not batched input
+        encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+
+        # Test batched
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test 2-D numpy arrays are batched.
+        speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
+        np_speech_inputs = np.asarray(speech_inputs)
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+    def test_double_precision_pad(self):
+        import paddle
+
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
+        py_speech_inputs = np_speech_inputs.tolist()
+
+        for inputs in [py_speech_inputs, np_speech_inputs]:
+            np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
+            self.assertTrue(np_processed.input_features.dtype == np.float32)
+            pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pd")
+            self.assertTrue(pt_processed.input_features.dtype == paddle.float32)
+
+    def _load_datasamples(self, num_samples):
+        from datasets import load_dataset
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples]
+
+    # def test_integration_fusion_short_input(self):
+    #     # fmt: off
+    #     EXPECTED_INPUT_FEATURES = paddle.to_tensor(
+    #         [
+    #             [
+    #                 # "repeat"
+    #                 [
+    #                     -20.1049, -19.9764, -20.0731, -19.5055, -27.5018, -22.5761, -26.6071,
+    #                     -29.0091, -26.4659, -26.4236, -28.8808, -31.9190, -32.4848, -34.1186,
+    #                     -34.0340, -32.8803, -30.9895, -37.6238, -38.0347, -40.6263, -36.3496,
+    #                     -42.2533, -32.9132, -27.7068, -29.3704, -30.3208, -22.5972, -27.1494,
+    #                     -30.1975, -31.1005, -29.9372, -27.1917, -25.9806, -30.3489, -33.2380,
+    #                     -31.9062, -36.5498, -32.8721, -30.5629, -27.4674, -22.2232, -22.5653,
+    #                     -16.3868, -17.2713, -25.9738, -30.6256, -34.3766, -31.1292, -27.8950,
+    #                     -27.0588, -25.6206, -23.0712, -26.6050, -28.0112, -32.6847, -34.3396,
+    #                     -34.9738, -35.8463, -39.2324, -37.1188, -33.3705, -28.9230, -28.9112,
+    #                     -28.6578
+    #                 ],
+    #                 [
+    #                     -36.7233, -30.0587, -24.8431, -18.4611, -16.8149, -23.9319, -32.8580,
+    #                     -34.2264, -27.4332, -26.8027, -29.2721, -33.9033, -39.3403, -35.3232,
+    #                     -26.8076, -28.6460, -35.2780, -36.0738, -35.4996, -37.7631, -39.5056,
+    #                     -34.7112, -36.8741, -34.1066, -32.9474, -33.6604, -27.9937, -30.9594,
+    #                     -26.2928, -32.0485, -29.2151, -29.2917, -32.7308, -29.6542, -31.1454,
+    #                     -37.0088, -32.3388, -37.3086, -31.1024, -27.2889, -19.6788, -21.1488,
+    #                     -19.5144, -14.8889, -21.2006, -24.7488, -27.7940, -31.1058, -27.5068,
+    #                     -21.5737, -22.3780, -21.5151, -26.3086, -30.9223, -33.5043, -32.0307,
+    #                     -37.3806, -41.6188, -45.6650, -40.5131, -32.5023, -26.7385, -26.3709,
+    #                     -26.7761
+    #                 ]
+    #             ],
+    #             [
+    #                 # "repeatpad"
+    #                 [
+    #                     -25.7496, -24.9339, -24.1357, -23.1271, -23.7853, -26.1264, -29.1456,
+    #                     -33.2060, -37.8179, -42.4833, -41.9386, -41.2164, -42.3566, -44.2575,
+    #                     -40.0217, -36.6794, -36.6974, -38.7819, -42.0880, -45.5560, -39.9368,
+    #                     -36.3219, -35.5981, -36.6434, -35.1851, -33.0684, -30.0437, -30.2010,
+    #                     -34.3476, -42.1373, -38.8039, -37.3355, -40.4576, -41.0485, -40.6377,
+    #                     -38.2275, -42.7481, -34.6084, -34.7048, -29.5149, -26.3935, -26.8952,
+    #                     -34.1336, -26.2904, -28.2571, -32.5642, -36.7240, -35.5334, -38.2451,
+    #                     -34.8177, -28.9754, -25.1096, -27.9768, -32.3184, -37.0269, -40.5136,
+    #                     -40.8061, -36.4948, -40.3767, -38.9671, -38.3552, -34.1250, -30.9035,
+    #                     -31.6112
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ],
+    #             [
+    #                 # None, same as "repeatpad"
+    #                 [
+    #                     -25.7496, -24.9339, -24.1357, -23.1271, -23.7853, -26.1264, -29.1456,
+    #                     -33.2060, -37.8179, -42.4833, -41.9386, -41.2164, -42.3566, -44.2575,
+    #                     -40.0217, -36.6794, -36.6974, -38.7819, -42.0880, -45.5560, -39.9368,
+    #                     -36.3219, -35.5981, -36.6434, -35.1851, -33.0684, -30.0437, -30.2010,
+    #                     -34.3476, -42.1373, -38.8039, -37.3355, -40.4576, -41.0485, -40.6377,
+    #                     -38.2275, -42.7481, -34.6084, -34.7048, -29.5149, -26.3935, -26.8952,
+    #                     -34.1336, -26.2904, -28.2571, -32.5642, -36.7240, -35.5334, -38.2451,
+    #                     -34.8177, -28.9754, -25.1096, -27.9768, -32.3184, -37.0269, -40.5136,
+    #                     -40.8061, -36.4948, -40.3767, -38.9671, -38.3552, -34.1250, -30.9035,
+    #                     -31.6112
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ],
+    #             [
+    #                 # "pad"
+    #                 [
+    #                     -58.5260, -58.1155, -57.8623, -57.5059, -57.9178, -58.7171, -59.2343,
+    #                     -59.9833, -60.9764, -62.0722, -63.5723, -65.7111, -67.5153, -68.7088,
+    #                     -69.8325, -70.2987, -70.1548, -70.6233, -71.5702, -72.5159, -72.3821,
+    #                     -70.1817, -67.0315, -64.1387, -62.2202, -61.0717, -60.4951, -61.6005,
+    #                     -63.7358, -67.1400, -67.6185, -65.5635, -64.3593, -63.7138, -63.6209,
+    #                     -66.4950, -72.6284, -63.3961, -56.8334, -52.7319, -50.6310, -51.3728,
+    #                     -53.5619, -51.9190, -50.9708, -52.8684, -55.8073, -58.8227, -60.6991,
+    #                     -57.0547, -52.7611, -51.4388, -54.4892, -60.8950, -66.1024, -72.4352,
+    #                     -67.8538, -65.1463, -68.7588, -72.3080, -68.4864, -60.4688, -57.1516,
+    #                     -60.9460
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ]
+    #         ]
+    #     )
+    #     # fmt: on
+    #     MEL_BIN = [[976, 977], [976, 977], [976, 977], [196, 197]]
+    #     input_speech = self._load_datasamples(1)
+    #     feature_extractor = ClapFeatureExtractor()
+    #     for padding, EXPECTED_VALUES, idx_in_mel in zip(
+    #         ["repeat", "repeatpad", None, "pad"], EXPECTED_INPUT_FEATURES, MEL_BIN
+    #     ):
+    #         input_features = feature_extractor(input_speech, return_tensors="pd", padding=padding).input_features
+    #         self.assertEqual(input_features.shape, [1, 4, 1001, 64])
+
+    #         self.assertTrue(paddle.allclose(input_features[0, 0, idx_in_mel[0]], EXPECTED_VALUES[0], atol=1e-4))
+    #         self.assertTrue(paddle.allclose(input_features[0, 0, idx_in_mel[1]], EXPECTED_VALUES[1], atol=1e-4))
+
+    #         self.assertTrue(paddle.all(input_features[0, 0] == input_features[0, 1]))
+    #         self.assertTrue(paddle.all(input_features[0, 0] == input_features[0, 2]))
+    #         self.assertTrue(paddle.all(input_features[0, 0] == input_features[0, 3]))
+
+    # def test_integration_rand_trunc_short_input(self):
+    #     # fmt: off
+    #     EXPECTED_INPUT_FEATURES = paddle.to_tensor(
+    #         [
+    #             [
+    #                 # "repeat"
+    #                 [
+    #                     -35.0483, -35.7865, -38.2884, -40.0220, -42.5349, -44.9489, -43.2228,
+    #                     -44.6499, -47.6253, -49.6983, -50.2127, -52.5483, -52.2223, -51.9157,
+    #                     -49.4082, -51.2024, -57.0476, -56.2803, -58.1618, -60.7474, -55.0389,
+    #                     -60.9514, -59.3080, -50.4419, -47.8172, -48.7570, -55.2552, -44.5036,
+    #                     -44.1148, -50.8218, -51.0968, -52.9408, -51.1037, -48.9789, -47.5897,
+    #                     -52.0915, -55.4216, -54.1529, -58.0149, -58.0866, -52.7798, -52.6154,
+    #                     -45.9144, -46.2008, -40.7603, -41.1703, -50.2250, -55.4112, -59.4818,
+    #                     -54.5795, -53.5552, -51.3668, -49.8358, -50.3186, -54.0452, -57.6030,
+    #                     -61.1589, -61.6415, -63.2756, -66.5890, -62.8543, -58.0665, -56.7203,
+    #                     -56.7632
+    #                 ],
+    #                 [
+    #                     -47.1320, -37.9961, -34.0076, -36.7109, -47.9057, -48.4924, -43.8371,
+    #                     -44.9728, -48.1689, -52.9141, -57.6077, -52.8520, -44.8502, -45.6764,
+    #                     -51.8389, -56.4284, -54.6972, -53.4889, -55.6077, -58.7149, -60.3760,
+    #                     -54.0136, -56.0730, -55.9870, -54.4017, -53.1094, -53.5640, -50.3064,
+    #                     -49.9520, -49.3239, -48.1668, -53.4852, -50.4561, -50.8688, -55.1970,
+    #                     -51.5538, -53.0260, -59.6933, -54.8183, -59.5895, -55.9589, -50.3761,
+    #                     -44.1282, -44.1463, -43.8540, -39.1168, -45.3893, -49.5542, -53.1505,
+    #                     -55.2870, -50.3921, -46.8511, -47.4444, -49.5633, -56.0034, -59.0815,
+    #                     -59.0018, -63.7589, -69.5745, -71.5789, -64.0498, -56.0558, -54.3475,
+    #                     -54.7004
+    #                 ]
+    #             ],
+    #             [
+    #                 # "repeatpad"
+    #                 [
+    #                     -40.3184, -39.7186, -39.8807, -41.6508, -45.3613, -50.4785, -57.0297,
+    #                     -60.4944, -59.1642, -58.9495, -60.4661, -62.5300, -58.4759, -55.2865,
+    #                     -54.8973, -56.0780, -57.5482, -59.6557, -64.3309, -65.0330, -59.4941,
+    #                     -56.8552, -55.0519, -55.9817, -56.9739, -55.2827, -54.5312, -51.4141,
+    #                     -50.4289, -51.9131, -57.5821, -63.9979, -59.9180, -58.9489, -62.3247,
+    #                     -62.6975, -63.7948, -60.5250, -64.6107, -58.7905, -57.0229, -54.3084,
+    #                     -49.8445, -50.4459, -57.0172, -50.6425, -52.5992, -57.4207, -61.6358,
+    #                     -60.6540, -63.1968, -57.4360, -52.3263, -51.7695, -57.1946, -62.9610,
+    #                     -66.7359, -67.0335, -63.7440, -68.1775, -66.3798, -62.8650, -59.8972,
+    #                     -59.3139
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ],
+    #             [
+    #                 # None, same as "repeatpad"
+    #                 [
+    #                     -40.3184, -39.7186, -39.8807, -41.6508, -45.3613, -50.4785, -57.0297,
+    #                     -60.4944, -59.1642, -58.9495, -60.4661, -62.5300, -58.4759, -55.2865,
+    #                     -54.8973, -56.0780, -57.5482, -59.6557, -64.3309, -65.0330, -59.4941,
+    #                     -56.8552, -55.0519, -55.9817, -56.9739, -55.2827, -54.5312, -51.4141,
+    #                     -50.4289, -51.9131, -57.5821, -63.9979, -59.9180, -58.9489, -62.3247,
+    #                     -62.6975, -63.7948, -60.5250, -64.6107, -58.7905, -57.0229, -54.3084,
+    #                     -49.8445, -50.4459, -57.0172, -50.6425, -52.5992, -57.4207, -61.6358,
+    #                     -60.6540, -63.1968, -57.4360, -52.3263, -51.7695, -57.1946, -62.9610,
+    #                     -66.7359, -67.0335, -63.7440, -68.1775, -66.3798, -62.8650, -59.8972,
+    #                     -59.3139
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ],
+    #             [
+    #                 # "pad"
+    #                 [
+    #                     -73.3190, -73.6349, -74.1451, -74.8539, -75.7476, -76.5438, -78.5540,
+    #                     -80.1339, -81.8911, -83.7560, -85.5387, -86.7466, -88.2072, -88.6090,
+    #                     -88.8243, -89.0784, -89.4364, -89.8179, -91.3146, -92.2833, -91.7221,
+    #                     -90.9440, -88.1315, -86.2425, -84.2281, -82.4893, -81.5993, -81.1328,
+    #                     -81.5759, -83.1068, -85.6525, -88.9520, -88.9187, -87.2703, -86.3052,
+    #                     -85.7188, -85.8802, -87.9996, -95.0464, -88.0133, -80.8561, -76.5597,
+    #                     -74.2816, -74.8109, -77.3615, -76.0719, -75.3426, -77.6428, -80.9663,
+    #                     -84.5275, -84.9907, -80.5205, -77.2851, -78.6259, -84.7740, -91.4535,
+    #                     -98.1894, -94.3872, -92.3735, -97.6807, -98.1501, -91.4344, -85.2842,
+    #                     -88.4338
+    #                 ],
+    #                 [
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100., -100., -100., -100., -100., -100., -100.,
+    #                     -100., -100., -100., -100.
+    #                 ]
+    #             ]
+    #         ]
+    #     )
+    #     # fmt: on
+    #     MEL_BIN = [[976, 977], [976, 977], [976, 977], [196, 197]]
+    #     input_speech = self._load_datasamples(1)
+    #     feature_extractor = ClapFeatureExtractor()
+    #     for padding, EXPECTED_VALUES, idx_in_mel in zip(
+    #         ["repeat", "repeatpad", None, "pad"], EXPECTED_INPUT_FEATURES, MEL_BIN
+    #     ):
+    #         input_features = feature_extractor(
+    #             input_speech, return_tensors="pd", truncation="rand_trunc", padding=padding
+    #         ).input_features
+    #         self.assertEqual(input_features.shape, [1, 1, 1001, 64])
+    #         self.assertTrue(paddle.allclose(input_features[0, 0, idx_in_mel[0]], EXPECTED_VALUES[0], atol=1e-4))
+    #         self.assertTrue(paddle.allclose(input_features[0, 0, idx_in_mel[1]], EXPECTED_VALUES[1], atol=1e-4))
+
+    # def test_integration_fusion_long_input(self):
+    #     # fmt: off
+    #     EXPECTED_INPUT_FEATURES = paddle.to_tensor(
+    #         [
+    #             [
+    #                 -11.1830, -10.1894, -8.6051, -4.8578, -1.3268, -8.4606, -14.5453,
+    #                 -9.2017, 0.5781, 16.2129, 14.8289, 3.6326, -3.8794, -6.5544,
+    #                 -2.4408, 1.9531, 6.0967, 1.7590, -7.6730, -6.1571, 2.0052,
+    #                 16.6694, 20.6447, 21.2145, 13.4972, 15.9043, 16.8987, 4.1766,
+    #                 11.9428, 21.2372, 12.3016, 4.8604, 6.7241, 1.8543, 4.9235,
+    #                 5.3188, -0.9897, -1.2416, -6.5864, 2.9529, 2.9274, 6.4753,
+    #                 10.2300, 11.2127, 3.4042, -1.0055, -6.0475, -6.7524, -3.9801,
+    #                 -1.4434, 0.4740, -0.1584, -4.5457, -8.5746, -8.8428, -13.1475,
+    #                 -9.6079, -8.5798, -4.1143, -3.7966, -7.1651, -6.1517, -8.0258,
+    #                 -12.1486
+    #             ],
+    #             [
+    #                 -10.2017, -7.9924, -5.9517, -3.9372, -1.9735, -4.3130, 16.1647,
+    #                 25.0592, 23.5532, 14.4974, -7.0778, -10.2262, 6.4782, 20.3454,
+    #                 19.4269, 1.7976, -16.5070, 4.9380, 12.3390, 6.9285, -13.6325,
+    #                 -8.5298, 1.0839, -5.9629, -8.4812, 3.1331, -2.0963, -16.6046,
+    #                 -14.0070, -17.5707, -13.2080, -17.2168, -17.7770, -12.1111, -18.6184,
+    #                 -17.1897, -13.9801, -12.0426, -23.5400, -25.6823, -23.5813, -18.7847,
+    #                 -20.5473, -25.6458, -19.7585, -27.6007, -28.9276, -24.8948, -25.4458,
+    #                 -22.2807, -19.6613, -19.2669, -15.7813, -19.6821, -24.3439, -22.2598,
+    #                 -28.2631, -30.1017, -32.7646, -33.6525, -27.5639, -22.0548, -27.8054,
+    #                 -29.6947
+    #             ],
+    #             [
+    #                 -9.2078, -7.2963, -6.2095, -7.9959, -2.9280, -11.1843, -6.1490,
+    #                 5.0733, 19.2957, 21.4578, 14.6803, -3.3153, -6.3334, -2.3542,
+    #                 6.9509, 15.2965, 14.6620, 5.2075, -0.0873, 1.1919, 18.1986,
+    #                 20.8470, 10.8035, 2.2516, 7.6905, 7.7427, -1.2543, -5.0018,
+    #                 0.9809, -2.1584, -5.4580, -5.4760, -11.8888, -9.0605, -8.4638,
+    #                 -9.9897, -0.0540, -5.1629, 0.0483, -4.1504, -4.8140, -7.8236,
+    #                 -9.0622, -10.1742, -8.9597, -11.5380, -16.5603, -17.1858, -17.5032,
+    #                 -20.9326, -23.9543, -25.2602, -25.3429, -27.4536, -26.8859, -22.7852,
+    #                 -25.8288, -24.8399, -23.8893, -24.2096, -26.5415, -23.7281, -25.6851,
+    #                 -22.3629
+    #             ],
+    #             [
+    #                 1.3448, 2.9883, 4.0366, -0.8019, -10.4191, -10.0883, -4.3812,
+    #                 0.8136, 2.1579, 0.0832, 1.0949, -0.9759, -5.5319, -4.6009,
+    #                 -6.5452, -14.9155, -20.1584, -9.3611, -2.4271, 1.4031, 4.9910,
+    #                 8.6916, 8.6785, 10.1973, 9.9029, 5.3840, 7.5336, 5.2803,
+    #                 2.8144, -0.3138, 2.2216, 5.7328, 7.5574, 7.7402, 1.0681,
+    #                 3.1049, 7.0742, 6.5588, 7.3712, 5.7881, 8.6874, 8.7725,
+    #                 2.8133, -4.5809, -6.1317, -5.1719, -5.0192, -9.0977, -10.9391,
+    #                 -6.0769, 1.6016, -0.8965, -7.2252, -7.8632, -11.4468, -11.7446,
+    #                 -10.7447, -7.0601, -2.7748, -4.1798, -2.8433, -3.1352, 0.8097,
+    #                 6.4212
+    #             ]
+    #         ]
+    #     )
+    #     # fmt: on
+    #     MEL_BIN = 963
+    #     input_speech = paddle.concat([paddle.to_tensor(x) for x in self._load_datasamples(5)])
+    #     feature_extractor = ClapFeatureExtractor()
+    #     for padding, EXPECTED_VALUES, block_idx in zip(
+    #         ["repeat", "repeatpad", None, "pad"], EXPECTED_INPUT_FEATURES, [1, 2, 0, 3]
+    #     ):
+    #         set_seed(987654321)
+    #         input_features = feature_extractor(input_speech, return_tensors="pd", padding=padding).input_features
+    #         self.assertEqual(input_features.shape, [1, 4, 1001, 64])
+    #         self.assertTrue(paddle.allclose(input_features[0, block_idx, MEL_BIN], EXPECTED_VALUES, atol=1e-3))
+
+    # def test_integration_rand_trunc_long_input(self):
+    #     # fmt: off
+    #     EXPECTED_INPUT_FEATURES = paddle.to_tensor(
+    #         [
+    #             [
+    #                 -35.4022, -32.7555, -31.2004, -32.7764, -42.5770, -41.6339, -43.1630,
+    #                 -44.5080, -44.3029, -48.9628, -39.5022, -39.2105, -43.1350, -43.2195,
+    #                 -48.4894, -52.2344, -57.6891, -52.2228, -45.5155, -44.2893, -43.4697,
+    #                 -46.6702, -43.7490, -40.4819, -42.7275, -46.3434, -46.8412, -41.2003,
+    #                 -43.1681, -46.2948, -46.1925, -47.8333, -45.6812, -44.9182, -41.7786,
+    #                 -43.3809, -44.3199, -42.8814, -45.4771, -46.7114, -46.9746, -42.7090,
+    #                 -41.6057, -38.3965, -40.1980, -41.0263, -34.1256, -28.3289, -29.0201,
+    #                 -30.4453, -29.5561, -30.1734, -25.9406, -19.0897, -15.8452, -20.1351,
+    #                 -23.6515, -23.1194, -17.1845, -19.4399, -23.6527, -22.8768, -20.7279,
+    #                 -22.7864
+    #             ],
+    #             [
+    #                 -35.7719, -27.2566, -23.6964, -27.5521, 0.2510, 7.4391, 1.3917,
+    #                 -13.3417, -28.1758, -17.0856, -5.7723, -0.8000, -7.8832, -15.5548,
+    #                 -30.5935, -24.7571, -13.7009, -10.3432, -21.2464, -24.8118, -19.4080,
+    #                 -14.9779, -11.7991, -18.4485, -20.1982, -17.3652, -20.6328, -28.2967,
+    #                 -25.7819, -21.8962, -28.5083, -29.5719, -30.2120, -35.7033, -31.8218,
+    #                 -34.0408, -37.7744, -33.9653, -31.3009, -30.9063, -28.6153, -32.2202,
+    #                 -28.5456, -28.8579, -32.5170, -37.9152, -43.0052, -46.4849, -44.0786,
+    #                 -39.1933, -33.2757, -31.6313, -42.6386, -52.3679, -53.5785, -55.6444,
+    #                 -47.0050, -47.6459, -56.6361, -60.6781, -61.5244, -55.8272, -60.4832,
+    #                 -58.1897
+    #             ],
+    #             [
+    #                 -38.2686, -36.6285, -32.5835, -35.1693, -37.7938, -37.4035, -35.3132,
+    #                 -35.6083, -36.3609, -40.9472, -36.7846, -36.1544, -38.9076, -39.3618,
+    #                 -35.4953, -34.2809, -39.9466, -39.7433, -34.8347, -37.5674, -41.5689,
+    #                 -38.9161, -34.3947, -30.2924, -30.4841, -34.5831, -28.9261, -24.8849,
+    #                 -31.2324, -27.1622, -27.2107, -25.9385, -30.1691, -30.9223, -23.9495,
+    #                 -25.6047, -26.7119, -28.5523, -27.7481, -32.8427, -35.4650, -31.0399,
+    #                 -31.2073, -30.5163, -22.9819, -20.8892, -19.2510, -24.7905, -28.9426,
+    #                 -28.1998, -26.7386, -25.0140, -27.9223, -32.9913, -33.1864, -34.9742,
+    #                 -38.5995, -39.6990, -29.3203, -22.4697, -25.6415, -33.5608, -33.0945,
+    #                 -27.1716
+    #             ],
+    #             [
+    #                 -33.2015, -28.7741, -21.9457, -23.4888, -32.1072, -8.6307, 3.2724,
+    #                 5.9157, -0.9221, -30.1814, -31.0015, -27.4508, -27.0477, -9.5342,
+    #                 0.3221, 0.6511, -7.1596, -25.9707, -32.8924, -32.2300, -13.8974,
+    #                 -0.4895, 0.9168, -10.7663, -27.1176, -35.0829, -11.6859, -4.8855,
+    #                 -11.8898, -26.6167, -5.6192, -3.8443, -19.7947, -14.4101, -8.6236,
+    #                 -21.2458, -21.0801, -17.9136, -24.4663, -18.6333, -24.8085, -15.5854,
+    #                 -15.4344, -11.5046, -22.3625, -27.3387, -32.4353, -30.9670, -31.3789,
+    #                 -35.4044, -34.4591, -25.2433, -28.0773, -33.8736, -33.0224, -33.3155,
+    #                 -38.5302, -39.2741, -36.6395, -34.7729, -32.4483, -42.4001, -49.2857,
+    #                 -39.1682
+    #             ]
+    #         ]
+    #     )
+    #     # fmt: on
+    #     MEL_BIN = 963
+    #     SEEDS = [987654321, 1234, 666, 5555]
+    #     input_speech = paddle.concat([paddle.to_tensor(x) for x in self._load_datasamples(5)])
+    #     feature_extractor = ClapFeatureExtractor()
+    #     for padding, EXPECTED_VALUES, seed in zip(
+    #         ["repeat", "repeatpad", None, "pad"], EXPECTED_INPUT_FEATURES, SEEDS
+    #     ):
+    #         set_seed(seed)
+    #         input_features = feature_extractor(
+    #             input_speech, return_tensors="pd", truncation="rand_trunc", padding=padding
+    #         ).input_features
+    #         self.assertEqual(input_features.shape, [1, 1, 1001, 64])
+    #         self.assertTrue(paddle.allclose(input_features[0, 0, MEL_BIN], EXPECTED_VALUES, atol=1e-4))
diff --git a/tests/transformers/clap/test_modeling.py b/tests/transformers/clap/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf989897d6e9d5db085646ee5850e9b221e14061
--- /dev/null
+++ b/tests/transformers/clap/test_modeling.py
@@ -0,0 +1,664 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from datasets import load_dataset
+from paddle import nn
+
+from paddlenlp.transformers import (
+    ClapAudioConfig,
+    ClapAudioModel,
+    ClapAudioModelWithProjection,
+    ClapConfig,
+    ClapModel,
+    ClapProcessor,
+    ClapTextConfig,
+    ClapTextModel,
+    ClapTextModelWithProjection,
+)
+from paddlenlp.transformers.clap.modeling import CLAP_PRETRAINED_MODEL_ARCHIVE_LIST
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class ClapAudioModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=60,
+        num_mel_bins=16,
+        window_size=4,
+        spec_size=64,
+        patch_size=2,
+        patch_stride=2,
+        seq_length=16,
+        freq_ratio=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=256,
+        patch_embeds_hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=4,
+        num_heads=[2, 2, 2, 2],
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.num_mel_bins = num_mel_bins
+        self.window_size = window_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_heads = num_heads
+        self.num_attention_heads = num_heads[0]
+        self.seq_length = seq_length
+        self.spec_size = spec_size
+        self.freq_ratio = freq_ratio
+        self.patch_stride = patch_stride
+        self.patch_embeds_hidden_size = patch_embeds_hidden_size
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_features = floats_tensor([self.batch_size, 1, self.hidden_size, self.num_mel_bins])
+        config = self.get_config()
+
+        return config, input_features
+
+    def get_config(self):
+        return ClapAudioConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_mel_bins=self.num_mel_bins,
+            window_size=self.window_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            patch_stride=self.patch_stride,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+            spec_size=self.spec_size,
+            freq_ratio=self.freq_ratio,
+            patch_embeds_hidden_size=self.patch_embeds_hidden_size,
+        )
+
+    def create_and_check_model(self, config, input_features):
+        model = ClapAudioModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_features, return_dict=True)
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, input_features):
+        model = ClapAudioModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_features, return_dict=True)
+        self.parent.assertEqual(result.audio_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_features = config_and_inputs
+        inputs_dict = {"input_features": input_features}
+        return config, inputs_dict
+
+
+class ClapAudioModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as CLAP does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ClapAudioModel, ClapAudioModelWithProjection)
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ClapAudioModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ClapAudioConfig, has_text_modality=False, hidden_size=37)
+
+    # def test_config(self):
+    #     self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ClapAudioModel does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = model_class(config)
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class), return_dict=True)
+
+            hidden_states = outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [self.model_tester.patch_embeds_hidden_size, self.model_tester.patch_embeds_hidden_size],
+            )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+            # check that output_hidden_states also work using config
+            del inputs_dict["output_hidden_states"]
+            config.output_hidden_states = True
+
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+    @unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["input_features"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    @unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
+    def test_training(self):
+        pass
+
+    @unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ClapAudioModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ClapAudioModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "audio_projection"))
+
+
+class ClapTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        scope=None,
+        projection_hidden_act="relu",
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+        self.projection_hidden_act = projection_hidden_act
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return ClapTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            projection_hidden_act=self.projection_hidden_act,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = ClapTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask, return_dict=True)
+            result = model(input_ids, return_dict=True)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, input_ids, input_mask):
+        model = ClapTextModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask, return_dict=True)
+            result = model(input_ids, return_dict=True)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.text_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class ClapTextModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ClapTextModel, ClapTextModelWithProjection)
+    fx_compatible = False
+    test_pruning = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ClapTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ClapTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    @unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
+    def test_training(self):
+        pass
+
+    @unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ClapTextModel does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ClapTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ClapTextModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "text_projection"))
+
+
+class ClapModelTester:
+    def __init__(self, parent, text_kwargs=None, audio_kwargs=None, is_training=True):
+        if text_kwargs is None:
+            text_kwargs = {}
+        if audio_kwargs is None:
+            audio_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = ClapTextModelTester(parent, **text_kwargs)
+        self.audio_model_tester = ClapAudioModelTester(parent, **audio_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        _, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        _, input_features = self.audio_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, input_features
+
+    def get_config(self):
+        return ClapConfig.from_text_audio_configs(
+            self.text_model_tester.get_config(), self.audio_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, input_features):
+        model = ClapModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, input_features, attention_mask, return_dict=True)
+        self.parent.assertEqual(
+            result.logits_per_audio.shape, [self.audio_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.audio_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, input_features = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "input_features": input_features,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+class ClapModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ClapModel,)
+    pipeline_model_mapping = {"feature-extraction": ClapModel}
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ClapModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="ClapModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the `logit_scale` parameter initilization is different for CLAP
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if param.stop_gradient:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_audio_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save ClapConfig and check if we can load ClapAudioConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            audio_config = ClapAudioConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.audio_config.to_dict(), audio_config.to_dict())
+
+        # Save ClapConfig and check if we can load ClapTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = ClapTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ClapModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+@slow
+class ClapModelIntegrationTest(unittest.TestCase):
+    paddings = ["repeatpad", "repeat", "pad"]
+
+    def test_integration_unfused(self):
+        EXPECTED_MEANS_UNFUSED = {
+            "repeatpad": 0.0024,
+            "pad": 0.0020,
+            "repeat": 0.0023,
+        }
+
+        librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        audio_sample = librispeech_dummy[-1]
+
+        model_id = "laion/clap-htsat-unfused"
+
+        model = ClapModel.from_pretrained(model_id)
+        model.eval()
+        processor = ClapProcessor.from_pretrained(model_id)
+
+        for padding in self.paddings:
+            inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pd", padding=padding)
+
+            audio_embed = model.get_audio_features(**inputs)
+            expected_mean = EXPECTED_MEANS_UNFUSED[padding]
+
+            self.assertTrue(
+                paddle.allclose(audio_embed.cpu().mean(), paddle.to_tensor([expected_mean]), atol=1e-3, rtol=1e-3)
+            )
+
+    def test_integration_fused(self):
+        EXPECTED_MEANS_FUSED = {
+            "repeatpad": 0.00069,
+            "repeat": 0.00196,
+            "pad": -0.000379,
+        }
+
+        librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        audio_sample = librispeech_dummy[-1]
+
+        model_id = "laion/clap-htsat-fused"
+
+        model = ClapModel.from_pretrained(model_id)
+        model.eval()
+        processor = ClapProcessor.from_pretrained(model_id)
+
+        for padding in self.paddings:
+            inputs = processor(
+                audios=audio_sample["audio"]["array"], return_tensors="pd", padding=padding, truncation="fusion"
+            )
+
+            audio_embed = model.get_audio_features(**inputs)
+            expected_mean = EXPECTED_MEANS_FUSED[padding]
+
+            self.assertTrue(
+                paddle.allclose(audio_embed.cpu().mean(), paddle.to_tensor([expected_mean]), atol=1e-3, rtol=1e-3)
+            )
+
+    def test_batched_fused(self):
+        EXPECTED_MEANS_FUSED = {
+            "repeatpad": 0.0010,
+            "repeat": 0.0020,
+            "pad": 0.0006,
+        }
+
+        librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        audio_samples = [sample["array"] for sample in librispeech_dummy[0:4]["audio"]]
+
+        model_id = "laion/clap-htsat-fused"
+
+        model = ClapModel.from_pretrained(model_id)
+        model.eval()
+        processor = ClapProcessor.from_pretrained(model_id)
+
+        for padding in self.paddings:
+            inputs = processor(audios=audio_samples, return_tensors="pd", padding=padding, truncation="fusion")
+
+            audio_embed = model.get_audio_features(**inputs)
+            expected_mean = EXPECTED_MEANS_FUSED[padding]
+            self.assertTrue(
+                paddle.allclose(audio_embed.cpu().mean(), paddle.to_tensor([expected_mean]), atol=1e-3, rtol=1e-3)
+            )
+
+    def test_batched_unfused(self):
+        EXPECTED_MEANS_FUSED = {
+            "repeatpad": 0.0016,
+            "repeat": 0.0019,
+            "pad": 0.0019,
+        }
+
+        librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        audio_samples = [sample["array"] for sample in librispeech_dummy[0:4]["audio"]]
+
+        model_id = "laion/clap-htsat-unfused"
+
+        model = ClapModel.from_pretrained(model_id)
+        model.eval()
+        processor = ClapProcessor.from_pretrained(model_id)
+
+        for padding in self.paddings:
+            inputs = processor(audios=audio_samples, return_tensors="pd", padding=padding)
+
+            audio_embed = model.get_audio_features(**inputs)
+            expected_mean = EXPECTED_MEANS_FUSED[padding]
+
+            self.assertTrue(
+                paddle.allclose(audio_embed.cpu().mean(), paddle.to_tensor([expected_mean]), atol=1e-3, rtol=1e-3)
+            )
diff --git a/tests/transformers/clap/test_processor.py b/tests/transformers/clap/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..8335351c9f6556f04ce4ad585db97cadc4817cec
--- /dev/null
+++ b/tests/transformers/clap/test_processor.py
@@ -0,0 +1,122 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import shutil
+import tempfile
+import unittest
+
+from paddlenlp.transformers import ClapFeatureExtractor, ClapProcessor, RobertaTokenizer
+
+from .test_feature_extraction import floats_list
+
+
+class ClapProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.checkpoint = "laion/clap-htsat-unfused"
+        self.tmpdirname = tempfile.mkdtemp()
+
+    def get_tokenizer(self, **kwargs):
+        return RobertaTokenizer.from_pretrained(self.checkpoint, **kwargs)
+
+    def get_feature_extractor(self, **kwargs):
+        return ClapFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def test_save_load_pretrained_default(self):
+        tokenizer = self.get_tokenizer()
+        feature_extractor = self.get_feature_extractor()
+
+        processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        processor.save_pretrained(self.tmpdirname)
+        processor = ClapProcessor.from_pretrained(self.tmpdirname)
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
+
+        self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
+        self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = ClapProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
+
+        processor = ClapProcessor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        # self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
+
+        self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
+
+    def test_feature_extractor(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        raw_speech = floats_list((3, 1000))
+
+        input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
+        input_processor = processor(audios=raw_speech, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        input_str = "This is a test string"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_tokenizer_decode(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        self.assertListEqual(
+            processor.model_input_names[2:],
+            feature_extractor.model_input_names,
+            msg="`processor` and `feature_extractor` model input names do not match",
+        )
diff --git a/tests/transformers/clip/__init__.py b/tests/transformers/clip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/clip/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/clip/test_modeling.py b/tests/transformers/clip/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..e505fabb8c420547ac6d8f13fcc631143cbc74ca
--- /dev/null
+++ b/tests/transformers/clip/test_modeling.py
@@ -0,0 +1,759 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle CLIP model. """
+
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from paddle import nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    CLIPConfig,
+    CLIPModel,
+    CLIPProcessor,
+    CLIPTextConfig,
+    CLIPTextModel,
+    CLIPTextModelWithProjection,
+    CLIPVisionConfig,
+    CLIPVisionModel,
+    CLIPVisionModelWithProjection,
+    CLIPVisionTransformer,
+)
+from paddlenlp.transformers.clip.modeling import CLIP_PRETRAINED_MODEL_ARCHIVE_LIST
+
+from ...testing_utils import get_tests_dir, require_package, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class CLIPVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return CLIPVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = CLIPVisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, pixel_values):
+        model = CLIPVisionModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.image_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class CLIPVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as CLIP does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (CLIPVisionModel, CLIPVisionModelWithProjection)
+    test_resize_embeddings = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = CLIPVisionModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=CLIPVisionConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="CLIP does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="CLIPVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="CLIPVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPVisionModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            if isinstance(model.vision_model, CLIPVisionTransformer):
+                self.assertTrue(hasattr(model, "vision_projection"))
+
+
+class CLIPTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return CLIPTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = CLIPTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_with_projection(self, config, input_ids, input_mask):
+        model = CLIPTextModelWithProjection(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.text_embeds.shape, [self.batch_size, self.projection_dim])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class CLIPTextModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (CLIPTextModel, CLIPTextModelWithProjection)
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = CLIPTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=CLIPTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_with_projection(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="CLIP does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="CLIPTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="CLIPTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @slow
+    def test_model_with_projection_from_pretrained(self):
+        for model_name in CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPTextModelWithProjection.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+            self.assertTrue(hasattr(model, "text_projection"))
+
+
+class CLIPModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = CLIPTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = CLIPVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return CLIPConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = CLIPModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask=attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+class CLIPModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (CLIPModel,)
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = CLIPModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="CLIPModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the `logit_scale` parameter initilization is different for CLIP
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if not param.stop_gradient:
+                    # check if `logit_scale` is initilized as per the original implementation
+                    if name == "logit_scale":
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save CLIPConfig and check if we can load CLIPVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = CLIPVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save CLIPConfig and check if we can load CLIPTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = CLIPTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    CUTE_CATS = get_tests_dir("fixtures/tests_samples/COCO/000000039769.png")
+    image = Image.open(CUTE_CATS)
+    return image
+
+
+class CLIPModelCompatibilityTest(unittest.TestCase):
+    model_id = "hf-internal-testing/tiny-random-CLIPModel"
+
+    def setUp(self):
+        # 1. create input
+        processor = CLIPProcessor.from_pretrained(self.model_id, from_hf_hub=True)
+        image = prepare_img()
+        self.inputs = processor(
+            text=["a photo of a cat", "a photo of a dog"], images=image, padding=True, return_tensors="np"
+        )
+
+    @require_package("transformers", "torch")
+    def test_clip_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import CLIPModel
+
+            paddle_model = CLIPModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(inputs["input_ids"]), pixel_values=paddle.to_tensor(inputs["pixel_values"])
+            )
+
+            # 3. forward the torch model
+            import torch
+            from transformers import CLIPModel
+
+            torch_model = CLIPModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(
+                input_ids=torch.tensor(inputs["input_ids"]), pixel_values=torch.tensor(inputs["pixel_values"])
+            )
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.logits_per_image.detach().cpu().numpy(),
+                    torch_logit.logits_per_image.detach().cpu().numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.logits_per_text.detach().cpu().numpy(),
+                    torch_logit.logits_per_text.detach().cpu().numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_clip_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import CLIPModel
+
+            torch_model = CLIPModel.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(
+                input_ids=torch.tensor(inputs["input_ids"]), pixel_values=torch.tensor(inputs["pixel_values"])
+            )
+
+            # 3. forward the paddle model
+            from paddlenlp.transformers import CLIPModel
+
+            paddle_model = CLIPModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(inputs["input_ids"]), pixel_values=paddle.to_tensor(inputs["pixel_values"])
+            )
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.logits_per_image.detach().cpu().numpy(),
+                    torch_logit.logits_per_image.detach().cpu().numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.logits_per_text.detach().cpu().numpy(),
+                    torch_logit.logits_per_text.detach().cpu().numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_clip_text_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import CLIPTextModel
+
+            paddle_model = CLIPTextModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(input_ids=paddle.to_tensor(inputs["input_ids"]))
+
+            # 3. forward the torch model
+            import torch
+            from transformers import CLIPTextModel
+
+            torch_model = CLIPTextModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(input_ids=torch.tensor(inputs["input_ids"]))
+
+            # 4. compare results
+            np.testing.assert_equal(
+                paddle_logit.last_hidden_state.shape,
+                torch_logit.last_hidden_state.shape,
+            )
+            np.testing.assert_equal(
+                paddle_logit.pooler_output.shape,
+                torch_logit.pooler_output.shape,
+            )
+
+    @require_package("transformers", "torch")
+    def test_clip_vision_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import CLIPVisionModel
+
+            paddle_model = CLIPVisionModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(pixel_values=paddle.to_tensor(inputs["pixel_values"]))
+
+            # 3. forward the torch model
+            import torch
+            from transformers import CLIPVisionModel
+
+            torch_model = CLIPVisionModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(pixel_values=torch.tensor(inputs["pixel_values"]))
+
+            # 4. compare results
+            np.testing.assert_equal(
+                paddle_logit.last_hidden_state.shape,
+                torch_logit.last_hidden_state.shape,
+            )
+            np.testing.assert_equal(
+                paddle_logit.pooler_output.shape,
+                torch_logit.pooler_output.shape,
+            )
+
+    @require_package("transformers", "torch")
+    def test_clip_text_with_projection_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import CLIPTextModelWithProjection
+
+            paddle_model = CLIPTextModelWithProjection.from_pretrained(
+                self.model_id, from_hf_hub=True, cache_dir=tempdir
+            )
+            paddle_model.eval()
+            paddle_logit = paddle_model(input_ids=paddle.to_tensor(inputs["input_ids"]))
+
+            # 3. forward the torch model
+            import torch
+            from transformers import CLIPTextModelWithProjection
+
+            torch_model = CLIPTextModelWithProjection.from_pretrained(
+                self.model_id, cache_dir=tempdir, ignore_mismatched_sizes=True
+            )
+            torch_model.eval()
+            torch_logit = torch_model(input_ids=torch.tensor(inputs["input_ids"]))
+
+            # 4. compare results
+            np.testing.assert_equal(
+                paddle_logit.last_hidden_state.shape,
+                torch_logit.last_hidden_state.shape,
+            )
+
+    @require_package("transformers", "torch")
+    def test_clip_vision_with_projection_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            inputs = self.inputs
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import CLIPVisionModelWithProjection
+
+            paddle_model = CLIPVisionModelWithProjection.from_pretrained(
+                self.model_id, from_hf_hub=True, cache_dir=tempdir
+            )
+            paddle_model.eval()
+            paddle_logit = paddle_model(pixel_values=paddle.to_tensor(inputs["pixel_values"]))
+
+            # 3. forward the torch model
+            import torch
+            from transformers import CLIPVisionModelWithProjection
+
+            torch_model = CLIPVisionModelWithProjection.from_pretrained(
+                self.model_id, cache_dir=tempdir, ignore_mismatched_sizes=True
+            )
+            torch_model.eval()
+            torch_logit = torch_model(pixel_values=torch.tensor(inputs["pixel_values"]))
+
+            # 4. compare results
+            np.testing.assert_equal(
+                paddle_logit.last_hidden_state.shape,
+                torch_logit.last_hidden_state.shape,
+            )
+
+
+class CLIPModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):
+        model_name = "openai/clip-vit-base-patch32"
+        model = CLIPModel.from_pretrained(model_name)
+        model.eval()
+        processor = CLIPProcessor.from_pretrained(model_name)
+
+        image = prepare_img()
+        inputs = processor(
+            text=["a photo of a cat", "a photo of a dog"], images=image, padding=True, return_tensors="pd"
+        )
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        self.assertEqual(
+            outputs.logits_per_image.shape,
+            [inputs.pixel_values.shape[0], inputs.input_ids.shape[0]],
+        )
+        self.assertEqual(
+            outputs.logits_per_text.shape,
+            [inputs.input_ids.shape[0], inputs.pixel_values.shape[0]],
+        )
+
+        expected_logits = paddle.to_tensor([[24.5701, 19.3049]])
+
+        self.assertTrue(paddle.allclose(outputs.logits_per_image, expected_logits, atol=1e-3))
diff --git a/tests/transformers/clip/test_tokenizer.py b/tests/transformers/clip/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9c7537d28d5f5f9456a6c2bb9f0d97feb078bfc
--- /dev/null
+++ b/tests/transformers/clip/test_tokenizer.py
@@ -0,0 +1,154 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import CLIPTokenizer
+
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = CLIPTokenizer.resource_files_names
+
+
+class CLIPTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = CLIPTokenizer
+    test_rust_tokenizer = True
+    from_pretrained_kwargs = {}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # fmt: off
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l</w>", "w</w>", "r</w>", "t</w>", "low</w>", "er</w>", "lowest</w>", "newer</w>", "wider", "<unk>", "<|startoftext|>", "<|endoftext|>"]
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "l o", "lo w</w>", "e r</w>"]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        if "model_max_length" not in kwargs:
+            kwargs.update({"model_max_length": 512})
+        return CLIPTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        text = "lower newer"
+        bpe_tokens = ["lo", "w", "er</w>", "n", "e", "w", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [10, 2, 16, 9, 3, 2, 16, 20]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = CLIPTokenizer.from_pretrained(self.tmpdirname, pad_token="<pad>")
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+        p2 = [
+            ("This is a simple input loooooong", "This is a simple input"),
+            ("This is a simple pair loooooong", "This is a simple pair"),
+        ]
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+        out_p2 = tokenizer(p2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 31)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"])
+
+        # p2
+        # test automatic padding pair
+        self.assertEqual(out_p2["input_ids"].shape[-1], 48)
+        # long slice pair doesn't have padding
+        self.assertFalse(pad_token_id in out_p2["input_ids"][0])
+        self.assertFalse(0 in out_p2["attention_mask"][0])
+        # short slice pair does have padding
+        self.assertTrue(pad_token_id in out_p2["input_ids"][1])
+        self.assertTrue(0 in out_p2["attention_mask"][1])
+
+    def test_add_bos_token_slow(self):
+        bos_token = "$$$"
+        tokenizer = CLIPTokenizer.from_pretrained(self.tmpdirname, bos_token=bos_token, add_bos_token=True)
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[0], bos_token_id)
+        self.assertTrue(all(o[0] == bos_token_id for o in out_s2.input_ids))
+
+        decode_s = tokenizer.decode(out_s.input_ids)
+        decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
+
+        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+
+    @unittest.skip(reason="CLIP always lower cases letters")
+    def test_added_tokens_do_lower_case(self):
+        # CLIP always lower cases letters
+        pass
+
+    @unittest.skip(reason="CLIP donot check pretrained_model_lists")
+    def test_pretrained_model_lists(self):
+        pass
diff --git a/tests/transformers/clipseg/__init__.py b/tests/transformers/clipseg/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/clipseg/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/clipseg/test_modeling.py b/tests/transformers/clipseg/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..e147793821c114229228ab8000b01c058da66c8f
--- /dev/null
+++ b/tests/transformers/clipseg/test_modeling.py
@@ -0,0 +1,534 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import requests
+from paddle import nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    CLIPSegConfig,
+    CLIPSegForImageSegmentation,
+    CLIPSegModel,
+    CLIPSegProcessor,
+    CLIPSegTextConfig,
+    CLIPSegTextModel,
+    CLIPSegVisionConfig,
+    CLIPSegVisionModel,
+)
+from paddlenlp.transformers.clipseg.modeling import (
+    CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+class CLIPSegVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return CLIPSegVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = CLIPSegVisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values, return_dict=True)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class CLIPSegVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as CLIPSeg does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (CLIPSegVisionModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = CLIPSegVisionModelTester(self)
+        self.config_tester = ConfigTester(
+            self, config_class=CLIPSegVisionConfig, has_text_modality=False, hidden_size=37
+        )
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="CLIPSeg does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="CLIPSegVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="CLIPSegVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPSegVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class CLIPSegTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return CLIPSegTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = CLIPSegTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask, return_dict=True)
+            result = model(input_ids, return_dict=True)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class CLIPSegTextModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (CLIPSegTextModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = CLIPSegTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=CLIPSegTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="CLIPSeg does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="CLIPSegTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="CLIPSegTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPSegTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class CLIPSegModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = CLIPSegTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = CLIPSegVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return CLIPSegConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(),
+            self.vision_model_tester.get_config(),
+            projection_dim=64,
+            reduce_dim=32,
+            extract_layers=[1, 2, 3],
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = CLIPSegModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask, return_dict=True)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def create_and_check_model_for_image_segmentation(self, config, input_ids, attention_maks, pixel_values):
+        model = CLIPSegForImageSegmentation(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, return_dict=True)
+        self.parent.assertEqual(
+            result.logits.shape,
+            [
+                self.vision_model_tester.batch_size,
+                self.vision_model_tester.image_size,
+                self.vision_model_tester.image_size,
+            ],
+        )
+        self.parent.assertEqual(
+            result.conditional_embeddings.shape, [self.text_model_tester.batch_size, config.projection_dim]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+        }
+        return config, inputs_dict
+
+
+class CLIPSegModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (CLIPSegModel, CLIPSegForImageSegmentation)
+    pipeline_model_mapping = {"feature-extraction": CLIPSegModel}
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+        # CLIPSegForImageSegmentation requires special treatment
+        if return_labels:
+            if model_class.__name__ == "CLIPSegForImageSegmentation":
+                batch_size, _, height, width = inputs_dict["pixel_values"].shape
+                inputs_dict["labels"] = paddle.zeros([batch_size, height, width], dtype="float32")
+
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = CLIPSegModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_for_image_segmentation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_for_image_segmentation(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="CLIPSegModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    # override as the some parameters require custom initialization
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                if param.stop_gradient is False:
+                    # check if `logit_scale` is initilized as per the original implementation]
+                    if "logit_scale" in name:
+                        self.assertAlmostEqual(
+                            param.item(),
+                            np.log(1 / 0.07),
+                            delta=1e-3,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    elif "film" in name or "transposed_conv" in name or "reduce" in name:
+                        # those parameters use PyTorch' default nn.Linear initialization scheme
+                        pass
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save CLIPSegConfig and check if we can load CLIPSegVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = CLIPSegVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save CLIPSegConfig and check if we can load CLIPSegTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = CLIPSegTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CLIPSEG_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CLIPSegModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+    return image
+
+
+class CLIPSegModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_image_segmentation(self):
+        model_name = "CIDAS/clipseg-rd64-refined"
+        processor = CLIPSegProcessor.from_pretrained(model_name)
+        model = CLIPSegForImageSegmentation.from_pretrained(model_name)
+
+        image = prepare_img()
+        texts = ["a cat", "a remote", "a blanket"]
+        inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs, return_dict=True)
+
+        # verify the predicted masks
+        self.assertEqual(
+            outputs.logits.shape,
+            [3, 352, 352],
+        )
+        expected_masks_slice = paddle.to_tensor(
+            [[-7.4613, -7.4785, -7.3628], [-7.3268, -7.0899, -7.1333], [-6.9838, -6.7900, -6.8913]]
+        )
+        self.assertTrue(paddle.allclose(outputs.logits[0, :3, :3], expected_masks_slice, atol=1e-3))
+
+        # verify conditional and pooled output
+        expected_conditional = paddle.to_tensor([0.5601, -0.0314, 0.1980])
+        expected_pooled_output = paddle.to_tensor([0.5036, -0.2681, -0.2644])
+        self.assertTrue(paddle.allclose(outputs.conditional_embeddings[0, :3], expected_conditional, atol=1e-3))
+        self.assertTrue(paddle.allclose(outputs.pooled_output[0, :3], expected_pooled_output, atol=1e-3))
diff --git a/tests/transformers/clipseg/test_processor.py b/tests/transformers/clipseg/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..94da948baa71286e7d7366f722d778b28ed55875
--- /dev/null
+++ b/tests/transformers/clipseg/test_processor.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from paddlenlp.transformers import CLIPSegProcessor, CLIPTokenizer, ViTImageProcessor
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+IMAGE_PROCESSOR_NAME = "preprocessor_config.json"
+
+
+class CLIPSegProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        # fmt: off
+        vocab = ["l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "lo", "l</w>", "w</w>", "r</w>", "t</w>", "low</w>", "er</w>", "lowest</w>", "newer</w>", "wider", "<unk>", "<|startoftext|>", "<|endoftext|>"]
+        # fmt: on
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "l o", "lo w</w>", "e r</w>", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+        image_processor_map = {
+            "do_resize": True,
+            "size": 20,
+            "do_center_crop": True,
+            "crop_size": 18,
+            "do_normalize": True,
+            "image_mean": [0.48145466, 0.4578275, 0.40821073],
+            "image_std": [0.26862954, 0.26130258, 0.27577711],
+        }
+        self.image_processor_file = os.path.join(self.tmpdirname, IMAGE_PROCESSOR_NAME)
+        with open(self.image_processor_file, "w", encoding="utf-8") as fp:
+            json.dump(image_processor_map, fp)
+
+    def get_tokenizer(self, **kwargs):
+        return CLIPTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_image_processor(self, **kwargs):
+        return ViTImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PyTorch tensors if one specifies torchify=True."""
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_default(self):
+        tokenizer_slow = self.get_tokenizer()
+        image_processor = self.get_image_processor()
+
+        processor_slow = CLIPSegProcessor(tokenizer=tokenizer_slow, image_processor=image_processor)
+        processor_slow.save_pretrained(self.tmpdirname)
+        processor_slow = CLIPSegProcessor.from_pretrained(self.tmpdirname, use_fast=False)
+
+        self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
+        self.assertIsInstance(processor_slow.tokenizer, CLIPTokenizer)
+        # self.assertIsInstance(processor_fast.tokenizer, CLIPTokenizerFast)
+
+        self.assertEqual(processor_slow.image_processor.to_json_string(), image_processor.to_json_string())
+        self.assertIsInstance(processor_slow.image_processor, ViTImageProcessor)
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = CLIPSegProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+        processor = CLIPSegProcessor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        # self.assertIsInstance(processor.tokenizer, CLIPTokenizerFast)
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, ViTImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = CLIPSegProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_feat_extract = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = CLIPSegProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor_text(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = CLIPSegProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["input_ids", "pixel_values"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_processor_visual_prompt(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = CLIPSegProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+        visual_prompt_input = self.prepare_image_inputs()
+
+        inputs = processor(images=image_input, visual_prompt=visual_prompt_input)
+
+        self.assertListEqual(list(inputs.keys()), ["pixel_values", "conditional_pixel_values"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = CLIPSegProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
diff --git a/tests/transformers/codegen/__init__.py b/tests/transformers/codegen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/codegen/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/codegen/test_modeling.py b/tests/transformers/codegen/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb66486e13b423c5fc574b696f0c8bd2618ffcdd
--- /dev/null
+++ b/tests/transformers/codegen/test_modeling.py
@@ -0,0 +1,505 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST,
+    AutoTokenizer,
+    CodeGenConfig,
+    CodeGenForCausalLM,
+    CodeGenModel,
+)
+
+from ...testing_utils import slow
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class CodeGenModelTester:
+    test_model_name_list = False
+
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        use_mc_token_ids=True,
+        vocab_size=256,
+        hidden_size=32,
+        rotary_dim=4,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.use_mc_token_ids = use_mc_token_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.rotary_dim = rotary_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+        self.bos_token_id = vocab_size - 1
+        self.eos_token_id = vocab_size - 1
+        self.pad_token_id = vocab_size - 1
+
+        paddle.seed(128)
+        np.random.seed(128)
+        random.seed(128)
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        mc_token_ids = None
+        if self.use_mc_token_ids:
+            mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length, dtype="int64")
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return CodeGenConfig(
+            vocab_size=self.vocab_size,
+            n_embd=self.hidden_size,
+            n_layer=self.num_hidden_layers,
+            n_head=self.num_attention_heads,
+            activation_function=self.hidden_act,
+            resid_pdrop=self.hidden_dropout_prob,
+            attn_pdrop=self.attention_probs_dropout_prob,
+            n_positions=self.max_position_embeddings,
+            n_ctx=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            rotary_dim=self.rotary_dim,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2, dtype="int64")
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def create_and_check_codegen_model(self, config, input_ids, input_mask, *args):
+        model = CodeGenModel(config)
+        model.eval()
+
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result[1]), config["n_layer"])
+
+    def create_and_check_codegen_model_past(self, config, input_ids, input_mask, *args):
+        model = CodeGenModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_no_past = model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(next_tokens, cache=past, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_codegen_model_attention_mask_past(self, config, input_ids, input_mask, *args):
+        model = CodeGenModel(config)
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="int64")
+        half_seq_length = self.seq_length // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past = model(input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict)[
+            :2
+        ]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = ids_tensor((1,), half_seq_length, dtype="int64").item() + 1
+        random_other_next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64").squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="int64")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(
+            next_tokens, cache=past, attention_mask=attn_mask, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_codegen_model_past_large_inputs(self, config, input_ids, input_mask, *args):
+        model = CodeGenModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config["vocab_size"], dtype="int64")
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2, dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )[0]
+        output_from_past = model(
+            next_tokens, attention_mask=next_attention_mask, cache=past, return_dict=self.parent.return_dict
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, *args):
+        model = CodeGenForCausalLM(config)
+
+        outputs = model(
+            input_ids, labels=input_ids if self.parent.use_labels else None, return_dict=self.parent.return_dict
+        )
+        if self.parent.use_labels:
+            loss, logits = outputs[:2]
+            self.parent.assertEqual(loss.shape, [1])
+        else:
+            logits = outputs[0]
+        self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(self, config, input_ids, input_mask, *args):
+        model = CodeGenForCausalLM(config)
+
+        loss, logits = model(input_ids, return_dict=self.parent.return_dict, labels=input_ids)[:2]
+        self.parent.assertEqual(loss.shape, [1])
+        self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        loss.backward()
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+
+        (
+            config,
+            input_ids,
+            input_mask,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {"input_ids": input_ids}
+
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict",),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class CodeGenModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = CodeGenModel
+
+    all_model_classes = (CodeGenModel, CodeGenForCausalLM)
+    all_generative_model_classes = {CodeGenForCausalLM: (CodeGenModel, "transformer")}
+    fx_compatible = False
+    test_pruning = False
+    test_missing_keys = False
+    test_model_parallel = False
+    test_head_masking = False
+    use_test_model_name_list = False
+    return_dict = False
+    use_labels = False
+    use_test_inputs_embeds = True
+
+    # attention mask issue
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+        attention_mask = paddle.zeros_like(input_ids, dtype=paddle.float32)
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1] // 2
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+        attention_mask = attention_mask[:max_batch_size, :sequence_length].unsqueeze([1, 2])
+
+        # generate max 3 tokens
+        max_length = 3
+
+        if config.get("eos_token_id", None) is not None and config.get("pad_token_id", None) is None:
+            # hack to allow generate for models such as GPT2 as is done in `generate()`
+            config["pad_token_id"] = config["eos_token_id"]
+
+        return config, input_ids, attention_mask, max_length
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = CodeGenModelTester(self)
+
+    def test_codegen_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_codegen_model(*config_and_inputs)
+
+    def test_codegen_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_codegen_model_past(*config_and_inputs)
+
+    def test_codegen_model_att_mask_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_codegen_model_attention_mask_past(*config_and_inputs)
+
+    def test_codegen_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_codegen_model_past_large_inputs(*config_and_inputs)
+
+    def test_codegen_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    @slow
+    def test_batch_generation(self):
+        tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+        model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
+        model.eval()
+
+        tokenizer.padding_side = "left"
+
+        # Define PAD Token = EOS Token = 50256
+        tokenizer.pad_token = tokenizer.eos_token
+        model.transformer.config["pad_token_id"] = model.transformer.config["eos_token_id"]
+
+        # use different length sentences to test batching
+        sentences = ["def hellow_world():", "def greet(name):"]
+
+        inputs = tokenizer(sentences, return_tensors="pd", padding=True, return_attention_mask=True)
+        input_ids = inputs["input_ids"]
+
+        outputs, _ = model.generate(
+            input_ids=input_ids,
+            attention_mask=inputs["attention_mask"],
+        )
+
+        inputs_non_padded = tokenizer(sentences[0], return_tensors="pd")["input_ids"]
+        output_non_padded, _ = model.generate(input_ids=inputs_non_padded)
+
+        inputs_padded = tokenizer(sentences[1], return_tensors="pd")["input_ids"]
+        output_padded, _ = model.generate(input_ids=inputs_padded)
+
+        batch_out_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+
+        non_padded_sentence = tokenizer.decode(output_non_padded[0], skip_special_tokens=True)
+        padded_sentence = tokenizer.decode(output_padded[0], skip_special_tokens=True)
+
+        expected_output_sentence = [
+            '\n      print("Hello World")\n\nhellow_world()\n\n#',
+            '\n      print(f"Hello {name}")\n\ngreet("Rolf")\n',
+        ]
+        self.assertListEqual(expected_output_sentence, batch_out_sentence)
+
+        self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = CodeGenModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    @unittest.skip("Not implemented")
+    def test_model_name_list(self):
+        pass
+
+    @slow
+    def test_auto_tokenizer(self):
+        for model_name in CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST:
+            AutoTokenizer.from_pretrained(model_name)  # assign a tokenizer but never use
+
+
+class CodeGenModelLanguageGenerationTest(unittest.TestCase):
+    @slow
+    def test_lm_generate_codegen(self):
+        tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+        model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
+        model.eval()
+
+        inputs = tokenizer(
+            "def hello_world():", return_tensors="pd", return_attention_mask=True, return_token_type_ids=False
+        )
+        expected_output = '\n      print("Hello World")\n\nhello_world()\n\n#'
+
+        output_ids, _ = model.generate(**inputs, decode_strategy="sampling", top_k=1)
+        output_str = tokenizer.batch_decode(output_ids)[0]
+
+        self.assertEqual(output_str, expected_output)
+
+    @slow
+    def test_codegen_sample(self):
+        tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+        model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
+        model.eval()
+
+        tokenized = tokenizer(
+            "def hello_world():", return_tensors="pd", return_token_type_ids=True, return_attention_mask=True
+        )
+        input_ids = tokenized["input_ids"]
+        output_ids, _ = model.generate(input_ids, decode_strategy="sampling", top_k=1)
+        output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+
+        token_type_ids = tokenized.token_type_ids
+        output_seq, _ = model.generate(
+            input_ids=input_ids, decode_strategy="sampling", top_k=1, num_return_sequences=5
+        )
+        output_seq_tt, _ = model.generate(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            decode_strategy="sampling",
+            top_k=1,
+            num_return_sequences=5,
+        )
+        output_seq_strs = tokenizer.batch_decode(output_seq, skip_special_tokens=True)
+        output_seq_tt_strs = tokenizer.batch_decode(output_seq_tt, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT_STR = '\n      print("Hello World")\n\nhello_world()\n\n#'
+
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+        self.assertTrue(
+            all([output_seq_strs[idx] != output_seq_tt_strs[idx] for idx in range(len(output_seq_tt_strs))])
+        )  # token_type_ids should change output
diff --git a/tests/transformers/codegen/test_tokenizer.py b/tests/transformers/codegen/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f99f65fcdf2c3c417a87a533e4baf77275f19e28
--- /dev/null
+++ b/tests/transformers/codegen/test_tokenizer.py
@@ -0,0 +1,213 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import re
+import unittest
+
+from paddlenlp.transformers import CodeGenTokenizer
+from paddlenlp.transformers.codegen.tokenizer import VOCAB_FILES_NAMES
+
+from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class CodeGenTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = CodeGenTokenizer
+    from_pretrained_kwargs = {"add_prefix_space": True}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<|endoftext|>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return CodeGenTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = CodeGenTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        # It's very difficult to mix/test pretokenization with byte-level
+        # And get both CodeGen and Roberta to work at the same time (mostly an issue of adding a space before the string)
+        pass
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = CodeGenTokenizer.from_pretrained(self.tmpdirname, pad_token="<pad>")
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+        p2 = [
+            ("This is a simple input loooooong", "This is a simple input"),
+            ("This is a simple pair loooooong", "This is a simple pair"),
+        ]
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+        out_p2 = tokenizer(p2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 33)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"])
+
+        # p2
+        # test automatic padding pair
+        self.assertEqual(out_p2["input_ids"].shape[-1], 52)
+        # long slice pair doesn't have padding
+        self.assertFalse(pad_token_id in out_p2["input_ids"][0])
+        self.assertFalse(0 in out_p2["attention_mask"][0])
+        # short slice pair does have padding
+        self.assertTrue(pad_token_id in out_p2["input_ids"][1])
+        self.assertTrue(0 in out_p2["attention_mask"][1])
+
+    def test_add_bos_token_slow(self):
+        bos_token = "$$$"
+        tokenizer = CodeGenTokenizer.from_pretrained(self.tmpdirname, bos_token=bos_token, add_bos_token=True)
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[0], bos_token_id)
+        self.assertTrue(all(o[0] == bos_token_id for o in out_s2.input_ids))
+
+        decode_s = tokenizer.decode(out_s.input_ids)
+        decode_s2 = tokenizer.batch_decode(out_s2.input_ids)
+
+        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+
+    @slow
+    def test_truncation(self):
+        tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
+
+        text = "\nif len_a > len_b:\n    result = a\nelse:\n    result = b\n\n\n\n#"
+        expected_trucated_text = "\nif len_a > len_b:      result = a\nelse:      result = b"
+
+        input_ids = tokenizer.encode(text)["input_ids"]
+        truncation_pattern = ["^#", re.escape("<|endoftext|>"), "^'''", '^"""', "\n\n\n"]
+        decoded_text = tokenizer.decode(input_ids, truncate_before_pattern=truncation_pattern)
+        self.assertEqual(decoded_text, expected_trucated_text)
+
+    # tokenizer has no padding token
+    def test_padding_different_model_input_name(self):
+        pass
+
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertEqual(
+            len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]),
+            len(self.tokenizer_class.max_model_input_sizes),
+        )
+
+        weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
+        weights_lists_2 = []
+        for file_id, map_list in self.tokenizer_class.pretrained_resource_files_map.items():
+            weights_lists_2.append(list(map_list.keys()))
+
+        for weights_list_2 in weights_lists_2:
+            self.assertListEqual(weights_list, weights_list_2)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 2)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
diff --git a/tests/transformers/convbert/__init__.py b/tests/transformers/convbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/convbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/convbert/test_modeling.py b/tests/transformers/convbert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cce946db0ce2af682fd7bd866ce24862c19fc46
--- /dev/null
+++ b/tests/transformers/convbert/test_modeling.py
@@ -0,0 +1,528 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    ConvBertConfig,
+    ConvBertForMaskedLM,
+    ConvBertForMultipleChoice,
+    ConvBertForPretraining,
+    ConvBertForQuestionAnswering,
+    ConvBertForSequenceClassification,
+    ConvBertForTokenClassification,
+    ConvBertModel,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ConvBertModelTester:
+    def __init__(
+        self,
+        parent: ConvBertModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_inputs_embeds=False,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        embedding_size=16,
+        conv_kernel_size=3,
+        head_ratio: int = 2,
+        num_groups: int = 1,
+        pool_act="tanh",
+        fuse=False,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        return_dict=False,
+    ):
+        self.parent: ConvBertModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_inputs_embeds = use_inputs_embeds
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads // head_ratio
+        self.total_num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.embedding_size = embedding_size
+        self.conv_kernel_size = conv_kernel_size
+        self.head_ratio = head_ratio
+        self.num_groups = num_groups
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = None
+        inputs_embeds = None
+        if self.use_inputs_embeds:
+            inputs_embeds = floats_tensor([self.batch_size, self.seq_length, self.embedding_size])
+        else:
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            inputs_embeds,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self) -> ConvBertConfig:
+        return ConvBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.total_num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            embedding_size=self.embedding_size,
+            conv_kernel_size=self.conv_kernel_size,
+            head_ratio=self.head_ratio,
+            num_groups=self.num_groups,
+            pool_act=self.pool_act,
+            fuse=self.fuse,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config: ConvBertConfig,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            return_dict=self.return_dict,
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.return_dict)
+        result = model(input_ids, return_dict=self.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=token_labels,
+            return_dict=self.return_dict,
+        )
+        if not self.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_pretraining(
+        self,
+        config: ConvBertConfig,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForPretraining(config)
+        model.eval()
+
+        generator_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        raw_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=input_mask,
+            raw_input_ids=raw_input_ids,
+            generator_labels=generator_labels,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[2].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config: ConvBertConfig,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=choice_labels,
+            return_dict=self.return_dict,
+        )
+
+        if not self.return_dict and choice_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif choice_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.return_dict,
+        )
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: ConvBertConfig,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.return_dict and sequence_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        inputs_embeds,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ConvBertForTokenClassification(config)
+
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=token_labels,
+            return_dict=self.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def test_addition_params(self, config: ConvBertConfig, *args, **kwargs):
+        config.num_labels = 7
+        config.classifier_dropout = 0.98
+
+        model = ConvBertForTokenClassification(config)
+        model.eval()
+
+        self.parent.assertEqual(model.classifier.weight.shape, [config.hidden_size, 7])
+        self.parent.assertEqual(model.dropout.p, 0.98)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            inputs_embeds,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+            "inputs_embeds": inputs_embeds,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ConvBertModelTest(ModelTesterMixin, unittest.TestCase):
+    test_resize_embeddings: bool = False
+    base_model_class = ConvBertModel
+    return_dict: bool = False
+    use_labels: bool = False
+    test_tie_weights: bool = True
+    use_test_inputs_embeds: bool = True
+
+    all_model_classes = (
+        ConvBertModel,
+        ConvBertForMultipleChoice,
+        ConvBertForMaskedLM,
+        ConvBertForQuestionAnswering,
+        ConvBertForSequenceClassification,
+        ConvBertForTokenClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+        self.model_tester = ConvBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ConvBertConfig, vocab_size=256, hidden_size=24)
+
+        self.test_resize_embeddings = False
+
+    def test_config(self):
+        # self.config_tester.create_and_test_config_from_and_save_pretrained()
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_pretraining(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_pretraining(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_custom_params(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.test_addition_params(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    @slow
+    def test_params_compatibility_of_init_method(self):
+        """test initing model with different params"""
+        model: ConvBertForTokenClassification = ConvBertForTokenClassification.from_pretrained(
+            "convbert-base", num_classes=4, dropout=0.3
+        )
+        assert model.num_labels == 4
+        assert model.dropout.p == 0.3
+
+
+class ConvBertModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+
+    base_model_class = ConvBertModel
+    paddlehub_remote_test_model_name: str = "convbert-base"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ConvBertModel.from_pretrained("convbert-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[1, 2, 3, 4, 5, 6]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 6, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[-0.0864, -0.4898, -0.3677], [0.1434, -0.2952, -0.7640], [-0.0112, -0.4432, -0.5432]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, :3, :3], expected_slice, atol=1e-4))
+
+    @unittest.skip(
+        "The URL of CONVBERT_PRETRAINED_RESOURCE_FILES_MAP in configuration.py is not in the format required by test_pretrained_save_and_load"
+    )
+    def test_pretrained_save_and_load(self):
+        pass
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/convbert/test_tokenizer.py b/tests/transformers/convbert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..294ecd450717f752223bc8755ba54ed9ba737f16
--- /dev/null
+++ b/tests/transformers/convbert/test_tokenizer.py
@@ -0,0 +1,220 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from paddlenlp.transformers.convbert.tokenizer import ConvBertTokenizer
+
+from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+class ConvBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ConvBertTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ConvBertTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("convbert-base")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
diff --git a/tests/transformers/ctrl/__init__.py b/tests/transformers/ctrl/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/ctrl/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ctrl/test_modeling.py b/tests/transformers/ctrl/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..119a6791e264102a3a366abf70c1d0c8bfb87dac
--- /dev/null
+++ b/tests/transformers/ctrl/test_modeling.py
@@ -0,0 +1,255 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    CTRLConfig,
+    CTRLForSequenceClassification,
+    CTRLLMHeadModel,
+    CTRLModel,
+    CTRLPreTrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class CTRLModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_token_type_ids=True,
+        use_input_mask=True,
+        use_labels=True,
+        use_mc_token_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_token_type_ids = use_token_type_ids
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.use_mc_token_ids = use_mc_token_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.pad_token_id = self.vocab_size - 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        mc_token_ids = None
+        if self.use_mc_token_ids:
+            mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+
+        head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            head_mask,
+            token_type_ids,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return CTRLConfig(
+            vocab_size=self.vocab_size,
+            n_embd=self.hidden_size,
+            n_layer=self.num_hidden_layers,
+            n_head=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            n_positions=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+        model = CTRLModel(config=config)
+        model.eval()
+
+        model(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask)
+        model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+        model = CTRLLMHeadModel(config)
+        model.eval()
+
+        result = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
+        self.parent.assertIsInstance(result[0].item(), float)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+
+        (
+            config,
+            input_ids,
+            input_mask,
+            head_mask,
+            token_type_ids,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+
+        return config, inputs_dict
+
+    def create_and_check_ctrl_for_sequence_classification(
+        self, config, input_ids, input_mask, head_mask, token_type_ids, *args
+    ):
+        config.num_classes = self.num_labels
+        model = CTRLForSequenceClassification(config)
+        model.eval()
+        sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+        result = model(input_ids=input_ids, token_type_ids=token_type_ids, labels=sequence_labels)
+        self.parent.assertEqual(len(result), 2)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_labels])
+
+
+@parameterized_class(
+    ("use_labels"),
+    [
+        [False],
+        [True],
+    ],
+)
+class CTRLModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = CTRLModel
+    return_dict = False
+    use_labels = False
+    all_model_classes = (CTRLModel, CTRLLMHeadModel, CTRLForSequenceClassification)
+
+    def setUp(self):
+        super().setUp()
+        self.model_tester = CTRLModelTester(self)
+
+    def test_ctrl_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
+
+    def test_ctrl_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_ctrl_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_ctrl_for_sequence_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(CTRLPreTrainedModel.pretrained_init_configuration)[:1]:
+            model = CTRLModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class CTRLModelLanguageGenerationTest(unittest.TestCase):
+    @slow
+    def test_lm_generate_ctrl(self):
+        model = CTRLLMHeadModel.from_pretrained("ctrl")
+        input_ids = paddle.to_tensor(
+            [[11859, 0, 1611, 8]],
+        )  # Legal the president is
+        expected_output_ids = [
+            11859,
+            0,
+            1611,
+            8,
+            5,
+            150,
+            26449,
+            2,
+            19,
+            348,
+            469,
+            3,
+            2595,
+            48,
+            20740,
+            246533,
+            246533,
+            19,
+            30,
+            5,
+        ]  # Legal the president is a good guy and I don't want to lose my job. \n \n I have a
+
+        output_ids = model.generate(input_ids, do_sample=False)
+        self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
diff --git a/tests/transformers/ctrl/test_tokenizer.py b/tests/transformers/ctrl/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3aa67aea8239b7c2966e510f9114e7c6cca58609
--- /dev/null
+++ b/tests/transformers/ctrl/test_tokenizer.py
@@ -0,0 +1,106 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import CTRLTokenizer
+
+# from paddlenlp.transformers import CodeGenTokenizer
+from paddlenlp.transformers.codegen.tokenizer import VOCAB_FILES_NAMES
+
+# from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class CTRLTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = CTRLTokenizer
+    test_rust_tokenizer = False
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ["adapt", "re@@", "a@@", "apt", "c@@", "t", "<unk>"]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "a p", "ap t</w>", "r e", "a d", "ad apt</w>", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return CTRLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "adapt react readapt apt"
+        output_text = "adapt react readapt apt"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = CTRLTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "adapt react readapt apt"
+        bpe_tokens = "adapt re@@ a@@ c@@ t re@@ adapt apt".split()
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+
+        input_bpe_tokens = [0, 1, 2, 4, 5, 1, 0, 3, 6]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_add_special_tokens(self):
+        pass
+
+    def test_add_tokens(self):
+        pass
+
+    def test_add_tokens_tokenizer(self):
+        pass
+
+    def test_added_token_are_matched_longest_first(self):
+        pass
+
+    def test_added_tokens_do_lower_case(self):
+        pass
+
+    def test_consecutive_unk_string(self):
+        pass
+
+    def test_encode_decode_with_spaces(self):
+        pass
+
+    def test_offsets_mapping_with_unk(self):
+        pass
+
+    def test_pretokenized_inputs(self):
+        pass
+
+    def test_pretrained_model_lists(self):
+        pass
+
+    def test_tokenize_special_tokens(self):
+        pass
+
+    def test_save_and_load_tokenizer(self):
+        pass
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        pass
diff --git a/tests/transformers/dallebart/__init__.py b/tests/transformers/dallebart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/dallebart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/dallebart/test_modeling.py b/tests/transformers/dallebart/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..19fada61ae5765090c48a2c3a1bbf8e4c2e51382
--- /dev/null
+++ b/tests/transformers/dallebart/test_modeling.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import (
+    DalleBartConfig,
+    DalleBartForConditionalGeneration,
+    DalleBartModel,
+)
+
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class DalleBartModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=False,
+        text_vocab_size=99,
+        image_vocab_size=1024,
+        max_text_length=12,
+        max_image_length=32,
+        bos_token_id=1024,
+        pad_token_id=1024,
+        eos_token_id=1024,
+        decoder_start_token_id=1024,
+        d_model=32,
+        num_encoder_layers=4,
+        num_decoder_layers=4,
+        encoder_attention_heads=4,
+        decoder_attention_heads=4,
+        encoder_ffn_dim=64,
+        decoder_ffn_dim=64,
+        dropout=0.0,
+        activation_function="gelu",
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        use_bias=False,
+        init_std=0.02,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.text_vocab_size = text_vocab_size
+        self.image_vocab_size = image_vocab_size
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.max_text_length = max_text_length
+        self.max_image_length = max_image_length
+        self.d_model = d_model
+        self.num_encoder_layers = num_encoder_layers
+        self.num_decoder_layers = num_decoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_attention_heads = decoder_attention_heads
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.use_bias = use_bias
+        self.init_std = init_std
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.text_vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        config = self.get_config()
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return DalleBartConfig(
+            text_vocab_size=self.text_vocab_size,
+            image_vocab_size=self.image_vocab_size,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            max_text_length=self.max_text_length,
+            max_image_length=self.max_image_length,
+            d_model=self.d_model,
+            num_encoder_layers=self.num_encoder_layers,
+            num_decoder_layers=self.num_decoder_layers,
+            encoder_attention_heads=self.encoder_attention_heads,
+            decoder_attention_heads=self.decoder_attention_heads,
+            encoder_ffn_dim=self.encoder_ffn_dim,
+            decoder_ffn_dim=self.decoder_ffn_dim,
+            dropout=self.dropout,
+            activation_function=self.activation_function,
+            attention_dropout=self.attention_dropout,
+            activation_dropout=self.activation_dropout,
+            use_bias=self.use_bias,
+            init_std=self.init_std,
+            pad_token_id=self.pad_token_id,
+            decoder_start_token_id=self.decoder_start_token_id,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = DalleBartModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.d_model])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+    def create_and_check_conditional_generation(self, config, input_ids, input_mask):
+        model = DalleBartForConditionalGeneration(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.image_vocab_size + 1])
+
+
+class DalleBartModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = DalleBartModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    all_model_classes = (DalleBartForConditionalGeneration, DalleBartModel)
+
+    def setUp(self):
+        self.model_tester = DalleBartModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_conditional_generation(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        # Direct input embedding tokens is currently not supported
+        self.skipTest("Direct input embedding tokens is currently not supported")
diff --git a/tests/transformers/dallebart/test_tokenizer.py b/tests/transformers/dallebart/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..dda72c5148545159250efea0432c76c04a6a7bd8
--- /dev/null
+++ b/tests/transformers/dallebart/test_tokenizer.py
@@ -0,0 +1,533 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import DalleBartTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+    "wiki_word_frequency_file": "enwiki-words-frequency.txt",
+}
+
+
+class TestTokenizationDalleBart(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = DalleBartTokenizer
+    test_rust_tokenizer = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<s>",
+            "</s>",
+            "<pad>",
+            "<mask>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        frequency = ["l 3123", "o 2133", "w 897", "r 1348", "e 6813", "s 7318", "t 1390"]
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        self.wiki_word_frequency_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["wiki_word_frequency_file"])
+        self.special_tokens_map = {
+            "bos_token": "<s>",
+            "eos_token": "</s>",
+            "cls_token": "<s>",
+            "sep_token": "</s>",
+            "unk_token": "<unk>",
+            "pad_token": "<pad>",
+            "mask_token": "<mask>",
+        }
+
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+        with open(self.wiki_word_frequency_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(frequency))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        return "lower newer", "lower newer"
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                # Test not batched,should be processed before encode
+                encoded_sequences_1 = tokenizer.encode(
+                    tokenizer.text_processor(sequences[0]),
+                    max_length=64,  # default
+                    padding="max_length",  # default
+                    truncation=True,  # default)
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                )
+                encoded_sequences_2 = tokenizer(sequences[0], return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test not batched pairs
+                encoded_sequences_1 = tokenizer.encode(
+                    tokenizer.text_processor(sequences[0]),
+                    tokenizer.text_processor(sequences[1]),
+                    max_length=64,  # default
+                    padding="max_length",  # default
+                    truncation=True,  # default)
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences[0], sequences[1], return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched
+                processed_seq = [tokenizer.text_processor(s) for s in sequences]
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    processed_seq,
+                    max_length=64,  # default
+                    padding="max_length",  # default
+                    truncation=True,  # default)
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                )
+                encoded_sequences_2 = tokenizer(sequences, return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched pairs
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    list(zip(processed_seq, processed_seq)),
+                    max_length=64,  # default
+                    padding="max_length",  # default
+                    truncation=True,  # default)
+                    return_token_type_ids=False,
+                    return_attention_mask=True,
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences, sequences, return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer.encode(
+                text=string, runcation=True, return_offsets_mapping=True, padding=False, truncation=False
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer.encode("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer.encode("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    normal_tokens = tokenizer.encode("This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer.encode("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    # __call__(),max_length default 64
+    def test_maximum_encoding_length_pair_input(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Build a sequence from our model's vocabulary
+                stride = 2
+                seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
+                if len(ids) <= 2 + stride:
+                    seq_0 = (seq_0 + " ") * (2 + stride)
+                    ids = None
+
+                seq0_tokens = tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                self.assertGreater(len(seq0_tokens), 2 + stride)
+
+                seq_1 = "This is another sentence to be encoded."
+                seq1_tokens = tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                if abs(len(seq0_tokens) - len(seq1_tokens)) <= 2:
+                    seq1_tokens = seq1_tokens + seq1_tokens
+                    seq_1 = tokenizer.decode(seq1_tokens, clean_up_tokenization_spaces=False)
+                seq1_tokens = tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+
+                self.assertGreater(len(seq1_tokens), 2 + stride)
+
+                # We are not using the special tokens - a bit too hard to test all the tokenizers with this
+                # TODO try this again later
+                sequence = tokenizer.encode(seq_0, seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]  # , add_prefix_space=False)
+
+                # Test with max model input length
+                model_max_length = tokenizer.model_max_length
+                self.assertEqual(model_max_length, 100)
+                seq_2 = seq_0 * model_max_length
+                self.assertGreater(len(seq_2), model_max_length)
+
+                sequence1 = tokenizer.encode(
+                    seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False
+                )
+                total_length1 = len(sequence1["input_ids"])
+                sequence2 = tokenizer.encode(
+                    seq_2, seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False
+                )
+                total_length2 = len(sequence2["input_ids"])
+                self.assertLess(
+                    total_length1, model_max_length - 10, "Issue with the testing sequence, please update it."
+                )
+                self.assertGreater(
+                    total_length2, model_max_length, "Issue with the testing sequence, please update it."
+                )
+
+                # Simple
+                padding_strategies = (
+                    [False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
+                )
+                for padding_state in padding_strategies:
+                    with self.subTest(f"{tokenizer.__class__.__name__} Padding: {padding_state}"):
+                        for truncation_state in [True, "longest_first", "only_first"]:
+                            with self.subTest(f"{tokenizer.__class__.__name__} Truncation: {truncation_state}"):
+
+                                output = tokenizer.encode(
+                                    seq_2,
+                                    seq_1,
+                                    padding=padding_state,
+                                    truncation=truncation_state,
+                                    max_length=model_max_length,
+                                )
+
+                                self.assertEqual(len(output["input_ids"]), model_max_length)
+                                output = tokenizer(
+                                    [seq_2],
+                                    [seq_1],
+                                    padding=padding_state,
+                                    truncation=truncation_state,
+                                    max_length=model_max_length,
+                                )
+                                self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple
+                        output = tokenizer.encode(
+                            seq_1, seq_2, padding=padding_state, truncation="only_second", max_length=model_max_length
+                        )
+                        self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                        output = tokenizer(
+                            [seq_1],
+                            [seq_2],
+                            padding=padding_state,
+                            truncation="only_second",
+                            max_length=model_max_length,
+                        )
+                        self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple with no truncation
+                        # Reset warnings
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer.encode(seq_1, seq_2, padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer(
+                                [seq_1], [seq_2], padding=padding_state, max_length=None, truncation=False
+                            )
+                            self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                truncated_first_sequence = (
+                    tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"][:-2]
+                    + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                )
+                truncated_second_sequence = (
+                    tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                    + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"][:-2]
+                )
+
+                # TODO(wj-Mcat): `overflow_first_sequence` and `overflow_second_sequence` is not used
+                # to make CI green, the following codes will be commented out
+
+                # overflow_first_sequence = (
+                #     tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"][
+                #         -(2 + stride) :
+                #     ]
+                #     + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                # )
+                # overflow_second_sequence = (
+                #     tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                #     + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"][
+                #         -(2 + stride) :
+                #     ]
+                # )
+
+                with self.assertRaises(ValueError) as context:
+                    tokenizer.encode(
+                        seq_0,
+                        seq_1,
+                        max_length=len(sequence) - 2,
+                        return_token_type_ids=None,
+                        add_special_tokens=False,
+                        stride=stride,
+                        truncation="longest_first",
+                        return_overflowing_tokens=True,
+                        # add_prefix_space=False,
+                    )
+
+                self.assertTrue(
+                    context.exception.args[0].startswith(
+                        "Not possible to return overflowing tokens for pair of sequences with the "
+                        "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                        "for instance `only_second` or `only_first`."
+                    )
+                )
+
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                # No overflowing tokens when using 'longest' in python tokenizers
+                with self.assertRaises(ValueError) as context:
+                    tokenizer.encode(
+                        seq_0,
+                        seq_1,
+                        max_length=len(sequence) - 2,
+                        return_token_type_ids=None,
+                        add_special_tokens=False,
+                        stride=stride,
+                        truncation=True,
+                        return_overflowing_tokens=True,
+                        # add_prefix_space=False,
+                    )
+
+                self.assertTrue(
+                    context.exception.args[0].startswith(
+                        "Not possible to return overflowing tokens for pair of sequences with the "
+                        "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                        "for instance `only_second` or `only_first`."
+                    )
+                )
+
+                information_first_truncated = tokenizer.encode(
+                    seq_0,
+                    seq_1,
+                    max_length=len(sequence) - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="only_first",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information_first_truncated["input_ids"]
+                overflowing_tokens = information_first_truncated["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), len(sequence) - 2)
+                self.assertEqual(truncated_sequence, truncated_first_sequence)
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, seq0_tokens[-(2 + stride) :])
+
+                information_second_truncated = tokenizer.encode(
+                    seq_0,
+                    seq_1,
+                    max_length=len(sequence) - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="only_second",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information_second_truncated["input_ids"]
+                overflowing_tokens = information_second_truncated["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), len(sequence) - 2)
+                self.assertEqual(truncated_sequence, truncated_second_sequence)
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, seq1_tokens[-(2 + stride) :])
+
+    # __call__(),max_length default 64
+    def test_maximum_encoding_length_single_input(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
+
+                sequence = tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                total_length = len(sequence)
+
+                self.assertGreater(total_length, 4, "Issue with the testing sequence, please update it it's too short")
+
+                # Test with max model input length
+                model_max_length = tokenizer.model_max_length
+                self.assertEqual(model_max_length, 100)
+                seq_1 = seq_0 * model_max_length
+
+                sequence1 = tokenizer.encode(
+                    seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False
+                )
+                total_length1 = len(sequence1["input_ids"])
+                self.assertGreater(
+                    total_length1, model_max_length, "Issue with the testing sequence, please update it it's too short"
+                )
+
+                # Simple
+                padding_strategies = (
+                    [False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
+                )
+                for padding_state in padding_strategies:
+                    with self.subTest(f"Padding: {padding_state}"):
+                        for truncation_state in [True, "longest_first", "only_first"]:
+                            with self.subTest(f"Truncation: {truncation_state}"):
+                                output = tokenizer.encode(seq_1, padding=padding_state, truncation=truncation_state)
+                                self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                                output = tokenizer(
+                                    [seq_1],
+                                    padding=padding_state,
+                                    max_length=model_max_length,
+                                    truncation=truncation_state,
+                                )
+                                self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple with no truncation
+                        # Reset warnings
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer.encode(seq_1, padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer([seq_1], padding=padding_state, truncation=False, max_length=None)
+                            self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                # Overflowing tokens
+                stride = 2
+                information = tokenizer.encode(
+                    seq_0,
+                    max_length=total_length - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="longest_first",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information["input_ids"]
+                overflowing_tokens = information["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), total_length - 2)
+                self.assertEqual(truncated_sequence, sequence[:-2])
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
diff --git a/tests/transformers/distilbert/__init__.py b/tests/transformers/distilbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/distilbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/distilbert/test_modeling.py b/tests/transformers/distilbert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..9281eee6d93875a708e6cbc21874575de6aa87e3
--- /dev/null
+++ b/tests/transformers/distilbert/test_modeling.py
@@ -0,0 +1,451 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.transformers import (
+    DistilBertForMaskedLM,
+    DistilBertForQuestionAnswering,
+    DistilBertForSequenceClassification,
+    DistilBertForTokenClassification,
+    DistilBertModel,
+)
+from paddlenlp.transformers.distilbert.configuration import DistilBertConfig
+
+from ...testing_utils import require_package, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class DistilBertModelTester:
+    def __init__(
+        self,
+        parent: DistilBertModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        pool_act="tanh",
+        fuse=False,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        return_dict=False,
+    ):
+        self.parent: DistilBertModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
+        self.fuse = fuse
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> DistilBertConfig:
+        return DistilBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            pool_act=self.pool_act,
+            fuse=self.fuse,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self, config: DistilBertConfig, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = DistilBertModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        result = model(input_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config: DistilBertConfig,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = DistilBertForMaskedLM(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = DistilBertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: DistilBertConfig,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = DistilBertForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = DistilBertForTokenClassification(config)
+
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def test_addition_params(self, config: DistilBertConfig, *args, **kwargs):
+        config.num_labels = 7
+        config.classifier_dropout = 0.98
+
+        model = DistilBertForTokenClassification(config)
+        model.eval()
+
+        self.parent.assertEqual(model.classifier.weight.shape, [config.hidden_size, 7])
+        self.parent.assertEqual(model.dropout.p, 0.98)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class DistilBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = DistilBertModel
+    return_dict = False
+    use_labels = False
+    test_resize_embeddings = False
+
+    all_model_classes = (
+        DistilBertModel,
+        DistilBertForMaskedLM,
+        DistilBertForQuestionAnswering,
+        DistilBertForSequenceClassification,
+        DistilBertForTokenClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = DistilBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DistilBertConfig, vocab_size=256, hidden_size=24)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_custom_params(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.test_addition_params(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    @slow
+    def test_params_compatibility_of_init_method(self):
+        """test initing model with different params"""
+        model: DistilBertForTokenClassification = DistilBertForTokenClassification.from_pretrained(
+            "distilbert-base-uncased", num_classes=4, dropout=0.3
+        )
+        assert model.num_labels == 4
+        assert model.dropout.p == 0.3
+
+
+class DistilBertModelCompatibilityTest(unittest.TestCase):
+    model_id = "hf-internal-testing/tiny-random-DistilBertModel"
+
+    @require_package("transformers", "torch")
+    def test_distilBert_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import DistilBertModel
+
+            paddle_model = DistilBertModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch model
+            import torch
+            from transformers import DistilBertModel
+
+            torch_model = DistilBertModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids))[0]
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_distilBert_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import DistilBertModel
+
+            torch_model = DistilBertModel.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids))[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import DistilBertModel
+
+            paddle_model = DistilBertModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            ("DistilBertModel",),
+            ("DistilBertForQuestionAnswering",),
+            ("DistilBertForSequenceClassification",),
+            ("DistilBertForTokenClassification",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_distilBert_classes_from_local_dir(self, class_name, pytorch_class_name=None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids))[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class DistilBertModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = DistilBertModel
+
+    @slow
+    def test_inference_no_attention(self):
+        model = DistilBertModel.from_pretrained("__internal_testing__/tiny-random-distilbert")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)
+        expected_shape = [1, 11, 8]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.50366199, -1.33068442, -1.73558784],
+                    [1.72435653, 1.08600891, -0.28388503],
+                    [-0.19172087, -0.56781638, 0.51192915],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = DistilBertModel.from_pretrained("__internal_testing__/tiny-random-distilbert")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)
+        expected_shape = [1, 11, 8]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.50366199, -1.33068442, -1.73558784],
+                    [1.72435653, 1.08600891, -0.28388503],
+                    [-0.19172087, -0.56781638, 0.51192915],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/dpt/__init__.py b/tests/transformers/dpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/dpt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/dpt/test_modeling.py b/tests/transformers/dpt/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea83cd7e7033bf6d79f29019610f59a135386614
--- /dev/null
+++ b/tests/transformers/dpt/test_modeling.py
@@ -0,0 +1,334 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle DPT model. """
+
+
+import inspect
+import unittest
+
+import paddle
+import paddle.nn as nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    DPTConfig,
+    DPTForDepthEstimation,
+    DPTForSemanticSegmentation,
+    DPTImageProcessor,
+    DPTModel,
+)
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
+
+DPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "Intel/dpt-large",
+    "Intel/dpt-hybrid-midas",
+    # See all DPT models at https://huggingface.co/models?filter=dpt
+]
+
+
+class DPTModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        image_size=32,
+        patch_size=16,
+        num_channels=3,
+        is_training=True,
+        use_labels=True,
+        hidden_size=32,
+        num_hidden_layers=4,
+        backbone_out_indices=[0, 1, 2, 3],
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        initializer_range=0.02,
+        num_labels=3,
+        is_hybrid=False,
+        return_dict=True,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.backbone_out_indices = backbone_out_indices
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.scope = scope
+        self.is_hybrid = is_hybrid
+        # sequence length of DPT = num_patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+
+        labels = None
+        if self.use_labels:
+            labels = ids_tensor([self.batch_size, self.image_size, self.image_size], self.num_labels)
+
+        config = self.get_config()
+
+        return config, pixel_values, labels
+
+    def get_config(self):
+        return DPTConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            backbone_out_indices=self.backbone_out_indices,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            is_decoder=False,
+            initializer_range=self.initializer_range,
+            is_hybrid=self.is_hybrid,
+            return_dict=self.return_dict,
+        )
+
+    def create_and_check_model(self, config, pixel_values, labels):
+        model = DPTModel(config=config)
+        model.eval()
+        result = model(pixel_values)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_depth_estimation(self, config, pixel_values, labels):
+        config.num_labels = self.num_labels
+        model = DPTForDepthEstimation(config)
+        model.eval()
+        result = model(pixel_values)
+        self.parent.assertEqual(result.predicted_depth.shape, [self.batch_size, self.image_size, self.image_size])
+
+    def create_and_check_for_semantic_segmentation(self, config, pixel_values, labels):
+        config.num_labels = self.num_labels
+        model = DPTForSemanticSegmentation(config)
+        model.eval()
+        result = model(pixel_values, labels=labels)
+        self.parent.assertEqual(
+            result.logits.shape, [self.batch_size, self.num_labels, self.image_size, self.image_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values, labels = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class DPTModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as DPT does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (DPTModel, DPTForDepthEstimation, DPTForSemanticSegmentation)
+
+    test_resize_embeddings = False
+
+    def setUp(self):
+        self.model_tester = DPTModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DPTConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="DPT does not use model_name_list")
+    def test_model_name_list(self):
+        pass
+
+    @unittest.skip(reason="DPT does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_depth_estimation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_depth_estimation(*config_and_inputs)
+
+    def test_for_semantic_segmentation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_semantic_segmentation(*config_and_inputs)
+
+    def test_training(self):
+        for model_class in self.all_model_classes:
+            if model_class.__name__ in ["DPTModel", "DPTForDepthEstimation"]:
+                continue
+
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.return_dict = True
+
+            model = model_class(config)
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class)
+            inputs["labels"] = paddle.zeros(
+                (self.model_tester.batch_size, self.model_tester.image_size, self.model_tester.image_size),
+                dtype=paddle.int64,
+            )
+            loss = model(**inputs).loss
+            loss.backward()
+
+    @slow
+    def test_training_gradient_checkpointing(self):
+        for model_class in self.all_model_classes:
+            if model_class.__name__ in ["DPTModel", "DPTForDepthEstimation"]:
+                continue
+
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.use_cache = False
+            config.return_dict = True
+
+            if not model_class.supports_gradient_checkpointing:
+                continue
+            model = model_class(config)
+            model.gradient_checkpointing_enable()
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class)
+            inputs["labels"] = paddle.zeros(
+                (self.model_tester.batch_size, self.model_tester.image_size, self.model_tester.image_size),
+                dtype=paddle.int64,
+            )
+            loss = model(**inputs).loss
+            loss.backward()
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in DPT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = DPTModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    CUTE_CATS = get_tests_dir("fixtures/tests_samples/COCO/000000039769.png")
+    image = Image.open(CUTE_CATS)
+    return image
+
+
+@slow
+class DPTModelIntegrationTest(unittest.TestCase):
+    def test_inference_depth_estimation(self):
+        image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
+        model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
+        model.eval()
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+            predicted_depth = outputs.predicted_depth
+
+        # verify the predicted depth
+        expected_shape = [1, 384, 384]
+        self.assertEqual(predicted_depth.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[6.3199, 6.3629, 6.4148], [6.3850, 6.3615, 6.4166], [6.3519, 6.3176, 6.3575]]
+        )
+
+        self.assertTrue(paddle.allclose(outputs.predicted_depth[0, :3, :3], expected_slice, atol=1e-4))
+
+    def test_inference_semantic_segmentation(self):
+        image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large-ade")
+        model = DPTForSemanticSegmentation.from_pretrained("Intel/dpt-large-ade")
+        model.eval()
+
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        expected_shape = [1, 150, 480, 480]
+        self.assertEqual(outputs.logits.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[4.0480, 4.2420, 4.4360], [4.3124, 4.5693, 4.8261], [4.5768, 4.8965, 5.2163]]
+        )
+
+        self.assertTrue(paddle.allclose(outputs.logits[0, 0, :3, :3], expected_slice, atol=1e-4))
+
+    def test_post_processing_semantic_segmentation(self):
+        image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large-ade")
+        model = DPTForSemanticSegmentation.from_pretrained("Intel/dpt-large-ade")
+        model.eval()
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        outputs.logits = outputs.logits.detach().cpu()
+
+        segmentation = image_processor.post_process_semantic_segmentation(outputs=outputs, target_sizes=[[500, 300]])
+        expected_shape = [500, 300]
+        self.assertEqual(segmentation[0].shape, expected_shape)
+
+        segmentation = image_processor.post_process_semantic_segmentation(outputs=outputs)
+        expected_shape = [480, 480]
+        self.assertEqual(segmentation[0].shape, expected_shape)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/dpt/test_modeling_hybrid.py b/tests/transformers/dpt/test_modeling_hybrid.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae33cf2fdaa7c4daacb54a1f95e338348c56d8aa
--- /dev/null
+++ b/tests/transformers/dpt/test_modeling_hybrid.py
@@ -0,0 +1,309 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle DPT model. """
+
+
+import inspect
+import unittest
+
+import paddle
+import paddle.nn as nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    DPTConfig,
+    DPTForDepthEstimation,
+    DPTForSemanticSegmentation,
+    DPTImageProcessor,
+    DPTModel,
+)
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
+
+DPT_PRETRAINED_MODEL_ARCHIVE_LIST = []
+
+
+class DPTModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        image_size=32,
+        patch_size=16,
+        num_channels=3,
+        is_training=True,
+        use_labels=True,
+        hidden_size=32,
+        num_hidden_layers=4,
+        backbone_out_indices=[0, 1, 2, 3],
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        initializer_range=0.02,
+        num_labels=3,
+        backbone_featmap_shape=[1, 384, 24, 24],
+        is_hybrid=True,
+        return_dict=True,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.backbone_out_indices = backbone_out_indices
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.backbone_featmap_shape = backbone_featmap_shape
+        self.scope = scope
+        self.is_hybrid = is_hybrid
+        # sequence length of DPT = num_patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+
+        labels = None
+        if self.use_labels:
+            labels = ids_tensor([self.batch_size, self.image_size, self.image_size], self.num_labels)
+
+        config = self.get_config()
+
+        return config, pixel_values, labels
+
+    def get_config(self):
+        backbone_config = {
+            "global_padding": "same",
+            "layer_type": "bottleneck",
+            "depths": [3, 4, 9],
+            "out_features": ["stage1", "stage2", "stage3"],
+            "embedding_dynamic_padding": True,
+            "hidden_sizes": [96, 192, 384, 768],
+            "num_groups": 2,
+            "return_dict": True,
+        }
+
+        return DPTConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            backbone_out_indices=self.backbone_out_indices,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            is_decoder=False,
+            initializer_range=self.initializer_range,
+            is_hybrid=self.is_hybrid,
+            backbone_config=backbone_config,
+            backbone_featmap_shape=self.backbone_featmap_shape,
+            return_dict=self.return_dict,
+        )
+
+    def create_and_check_model(self, config, pixel_values, labels):
+        model = DPTModel(config=config)
+        model.eval()
+        result = model(pixel_values)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_depth_estimation(self, config, pixel_values, labels):
+        config.num_labels = self.num_labels
+        model = DPTForDepthEstimation(config)
+        model.eval()
+        result = model(pixel_values)
+        self.parent.assertEqual(result.predicted_depth.shape, [self.batch_size, self.image_size, self.image_size])
+
+    def create_and_check_for_semantic_segmentation(self, config, pixel_values, labels):
+        config.num_labels = self.num_labels
+        model = DPTForSemanticSegmentation(config)
+        model.eval()
+        result = model(pixel_values, labels=labels)
+        self.parent.assertEqual(
+            result.logits.shape, [self.batch_size, self.num_labels, self.image_size, self.image_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values, labels = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class DPTModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as DPT does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (DPTModel, DPTForDepthEstimation, DPTForSemanticSegmentation)
+
+    test_resize_embeddings = False
+
+    def setUp(self):
+        self.model_tester = DPTModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DPTConfig, has_text_modality=False, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="DPT does not use model_name_list")
+    def test_model_name_list(self):
+        pass
+
+    @unittest.skip(reason="DPT does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_depth_estimation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_depth_estimation(*config_and_inputs)
+
+    def test_for_semantic_segmentation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_semantic_segmentation(*config_and_inputs)
+
+    def test_training(self):
+        for model_class in self.all_model_classes:
+            if model_class.__name__ in ["DPTModel", "DPTForDepthEstimation"]:
+                continue
+
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.return_dict = True
+
+            model = model_class(config)
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class)
+            inputs["labels"] = paddle.zeros(
+                (self.model_tester.batch_size, self.model_tester.image_size, self.model_tester.image_size),
+                dtype=paddle.int64,
+            )
+            loss = model(**inputs).loss
+            loss.backward()
+
+    @slow
+    def test_training_gradient_checkpointing(self):
+        for model_class in self.all_model_classes:
+            if model_class.__name__ in ["DPTModel", "DPTForDepthEstimation"]:
+                continue
+
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            config.use_cache = False
+            config.return_dict = True
+
+            if not model_class.supports_gradient_checkpointing:
+                continue
+            model = model_class(config)
+            model.gradient_checkpointing_enable()
+            model.train()
+            inputs = self._prepare_for_class(inputs_dict, model_class)
+            inputs["labels"] = paddle.zeros(
+                (self.model_tester.batch_size, self.model_tester.image_size, self.model_tester.image_size),
+                dtype=paddle.int64,
+            )
+            loss = model(**inputs).loss
+            loss.backward()
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in DPT_PRETRAINED_MODEL_ARCHIVE_LIST[1:]:
+            model = DPTModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    def test_raise_readout_type(self):
+        # We do this test only for DPTForDepthEstimation since it is the only model that uses readout_type
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+        config.readout_type = "add"
+        with self.assertRaises(ValueError):
+            _ = DPTForDepthEstimation(config)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    CUTE_CATS = get_tests_dir("fixtures/tests_samples/COCO/000000039769.png")
+    image = Image.open(CUTE_CATS)
+    return image
+
+
+@slow
+class DPTModelIntegrationTest(unittest.TestCase):
+    def test_inference_depth_estimation(self):
+        image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
+        model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas")
+        model.eval()
+        image = prepare_img()
+        inputs = image_processor(images=image, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+            predicted_depth = outputs.predicted_depth
+
+        # verify the predicted depth
+        expected_shape = [1, 384, 384]
+        self.assertEqual(predicted_depth.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[5.6437, 5.6146, 5.6511], [5.4371, 5.5649, 5.5958], [5.5215, 5.5184, 5.5293]]]
+        )
+
+        self.assertTrue(paddle.allclose(outputs.predicted_depth[:3, :3, :3] / 100, expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/electra/__init__.py b/tests/transformers/electra/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/electra/test_modeling.py b/tests/transformers/electra/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a3a04198e31add52e5ab377077763d20946fdc0
--- /dev/null
+++ b/tests/transformers/electra/test_modeling.py
@@ -0,0 +1,623 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    ElectraConfig,
+    ElectraDiscriminator,
+    ElectraForMaskedLM,
+    ElectraForMultipleChoice,
+    ElectraForPretraining,
+    ElectraForQuestionAnswering,
+    ElectraForSequenceClassification,
+    ElectraForTokenClassification,
+    ElectraGenerator,
+    ElectraModel,
+    ElectraPretrainedModel,
+)
+from tests.testing_utils import require_package, slow
+from tests.transformers.test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ElectraModelTester:
+    def __init__(
+        self,
+        parent,
+    ):
+        self.parent = parent
+        self.batch_size = 13
+        self.seq_length = 7
+        self.is_training = True
+        self.use_input_mask = True
+        self.use_token_type_ids = True
+        self.use_inputs_embeds = False
+        self.vocab_size = 99
+        self.embedding_size = 32
+        self.hidden_size = 32
+        self.num_hidden_layers = 5
+        self.num_attention_heads = 4
+        self.intermediate_size = 37
+        self.hidden_act = "gelu"
+        self.hidden_dropout_prob = 0.1
+        self.attention_probs_dropout_prob = 0.1
+        self.max_position_embeddings = 512
+        self.type_vocab_size = 2
+        self.initializer_range = 0.02
+        self.pad_token_id = 0
+        self.layer_norm_eps = 1e-12
+        self.type_sequence_label_size = 2
+        self.num_classes = 3
+        self.num_choices = 2
+
+    def prepare_config_and_inputs(self):
+        input_ids = None
+        inputs_embeds = None
+        if self.use_inputs_embeds:
+            inputs_embeds = floats_tensor([self.batch_size, self.seq_length, self.embedding_size])
+        else:
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            inputs_embeds,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return ElectraConfig(
+            vocab_size=self.vocab_size,
+            embedding_size=self.embedding_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def create_and_check_electra_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ElectraModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_electra_model_cache(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ElectraModel(config)
+        model.eval()
+
+        input_ids = ids_tensor((self.batch_size, self.seq_length), self.vocab_size)
+        input_token_types = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        # first forward pass
+        first_pass_outputs = model(input_ids, token_type_ids=input_token_types, use_cache=True, return_dict=True)
+        past_key_values = first_pass_outputs.past_key_values
+
+        # fully-visible attention mask
+        attention_mask = paddle.ones([self.batch_size, self.seq_length * 2])
+
+        # second forward pass with past_key_values with visible mask
+        second_pass_outputs = model(
+            input_ids,
+            token_type_ids=input_token_types,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            return_dict=self.parent.return_dict,
+        )
+
+        # last_hidden_state should have the same shape but different values when given past_key_values
+        if self.parent.return_dict:
+            self.parent.assertEqual(
+                second_pass_outputs.last_hidden_state.shape, first_pass_outputs.last_hidden_state.shape
+            )
+            self.parent.assertFalse(
+                paddle.allclose(second_pass_outputs.last_hidden_state, first_pass_outputs.last_hidden_state)
+            )
+        else:
+            self.parent.assertEqual(second_pass_outputs.shape, first_pass_outputs[0].shape)
+            self.parent.assertFalse(paddle.allclose(second_pass_outputs, first_pass_outputs[0]))
+
+    def create_and_check_electra_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ElectraForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_electra_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        config.num_classes = self.num_classes
+        model = ElectraForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_electra_for_pretraining(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ElectraForPretraining(config)
+        model.eval()
+
+        generator_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        raw_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            raw_input_ids=raw_input_ids,
+            token_type_ids=token_type_ids,
+            generator_labels=generator_labels,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[2].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_electra_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        config.num_classes = self.type_sequence_label_size
+        model = ElectraForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.type_sequence_label_size])
+
+    def create_and_check_electra_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ElectraForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_electra_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        inputs_embeds,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        config.num_choices = self.num_choices
+        model = ElectraForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            inputs_embeds=inputs_embeds,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            inputs_embeds,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+            "inputs_embeds": inputs_embeds,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels", "use_inputs_embeds"),
+    [
+        [False, False, True],
+        [False, False, False],
+        [False, True, False],
+        [True, False, False],
+        [True, True, False],
+    ],
+)
+class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
+    test_resize_embeddings = False
+    test_tie_weights = True
+    base_model_class = ElectraModel
+
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (
+        ElectraModel,
+        ElectraForMaskedLM,
+        ElectraForMultipleChoice,
+        ElectraForTokenClassification,
+        ElectraForSequenceClassification,
+        ElectraForQuestionAnswering,
+        ElectraDiscriminator,
+        ElectraGenerator,
+    )
+
+    def setUp(self):
+        self.model_tester = ElectraModelTester(self)
+
+        # set attribute in setUp to overwrite the static attribute
+        self.test_resize_embeddings = False
+
+    def test_electra_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_model(*config_and_inputs)
+
+    def test_electra_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_model_cache(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_masked_lm(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_token_classification(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_sequence_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_question_answering(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_multiple_choice(*config_and_inputs)
+
+    def test_for_electra_for_pretraining(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_pretraining(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ElectraPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ElectraModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ElectraModelCompatibilityTest(unittest.TestCase):
+    model_id = "hf-internal-testing/tiny-random-ElectraModel"
+
+    @require_package("transformers", "torch")
+    def test_electra_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            # 1. create input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import ElectraModel
+
+            paddle_model = ElectraModel.from_pretrained(self.model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch model
+            import torch
+            from transformers import ElectraModel
+
+            torch_model = ElectraModel.from_pretrained(self.model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 4. compare results
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_electra_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import ElectraModel
+
+            torch_model = ElectraModel.from_pretrained(self.model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import ElectraModel
+
+            paddle_model = ElectraModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            ("ElectraModel",),
+            # ("ElectraForMaskedLM",),   TODO: need to tie weights
+            # ("ElectraForPretraining",),   TODO: need to tie weights
+            ("ElectraForMultipleChoice",),
+            ("ElectraForQuestionAnswering",),
+            ("ElectraForSequenceClassification",),
+            ("ElectraForTokenClassification",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_electra_classes_from_local_dir(self, class_name, pytorch_class_name=None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.model_id)
+            torch_model.eval()
+
+            if "MultipleChoice" in class_name:
+                # construct input for MultipleChoice Model
+                torch_model.config.num_choices = random.randint(2, 10)
+                input_ids = (
+                    paddle.to_tensor(input_ids)
+                    .unsqueeze(1)
+                    .expand([-1, torch_model.config.num_choices, -1])
+                    .cpu()
+                    .numpy()
+                )
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class ElectraModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_head_absolute_embedding(self):
+        model = ElectraModel.from_pretrained("electra-small")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        output = model(input_ids, attention_mask=attention_mask)
+        expected_shape = [1, 11, 256]
+        self.assertEqual(output.shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [[[0.4471, 0.6821, -0.3265], [0.4627, 0.5255, -0.3668], [0.4532, 0.3313, -0.4344]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
diff --git a/tests/transformers/electra/test_tokenizer.py b/tests/transformers/electra/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..cba148d310a8998ca433b587824df7ae70bb2e5b
--- /dev/null
+++ b/tests/transformers/electra/test_tokenizer.py
@@ -0,0 +1,237 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.electra.tokenizer import ElectraTokenizer
+from paddlenlp.transformers.bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+class ElectraTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ElectraTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ElectraTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("electra-small")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/ernie/__init__.py b/tests/transformers/ernie/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/ernie/test_modeling.py b/tests/transformers/ernie/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8a1fb6b4d8f910cca7ba8f5d8181e95b3e8bdbc
--- /dev/null
+++ b/tests/transformers/ernie/test_modeling.py
@@ -0,0 +1,516 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    UIE,
+    ErnieConfig,
+    ErnieForMaskedLM,
+    ErnieForMultipleChoice,
+    ErnieForPretraining,
+    ErnieForQuestionAnswering,
+    ErnieForSequenceClassification,
+    ErnieForTokenClassification,
+    ErnieModel,
+    ErniePretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ErnieModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return ErnieConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieModel(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if choice_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_uie(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = UIE(config)
+        model.eval()
+        start_prob, end_prob = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+
+        self.parent.assertEqual(start_prob.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_prob.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_model_cache(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = ErnieModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ErnieModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+    test_tie_weights = True
+
+    all_model_classes = (
+        ErnieModel,
+        ErnieForMaskedLM,
+        ErnieForMultipleChoice,
+        ErnieForPretraining,
+        ErnieForQuestionAnswering,
+        ErnieForSequenceClassification,
+        ErnieForTokenClassification,
+        UIE,
+    )
+
+    def setUp(self):
+        self.model_tester = ErnieModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_uie(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_uie(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ErniePretrainedModel.pretrained_init_configuration)[:1]:
+            model = ErnieModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieModelIntegrationTest(unittest.TestCase, ModelTesterPretrainedMixin):
+    base_model_class = ErniePretrainedModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-ernie"
+    paddlehub_remote_test_model_path = "__internal_testing__/tiny-random-ernie"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ErnieModel.from_pretrained("ernie-1.0")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[-0.0664, -1.3266, -0.3922], [-0.2440, -1.3631, -1.0745], [-0.0986, -0.7947, -0.1932]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ErnieModel.from_pretrained("ernie-1.0")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[-0.0664, -1.3266, -0.3922], [-0.2440, -1.3631, -1.0745], [-0.0986, -0.7947, -0.1932]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = ErnieModel.from_pretrained("ernie-1.0")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output[0].shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.06635337, -1.32662833, -0.39223742],
+                    [-0.24396378, -1.36314595, -1.07446611],
+                    [-0.09860237, -0.79468340, -0.19317953],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.07865857, -1.07012558, -1.02596498],
+                    [0.12576045, -1.76696026, -1.18682015],
+                    [-0.08606864, -1.37880838, -0.12364164],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ernie/test_tokenizer.py b/tests/transformers/ernie/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0572a3a878dfd35fd703ef727f5d9c8fe953cfde
--- /dev/null
+++ b/tests/transformers/ernie/test_tokenizer.py
@@ -0,0 +1,333 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.ernie.fast_tokenizer import ErnieFastTokenizer
+from paddlenlp.transformers.ernie.tokenizer import (
+    BasicTokenizer,
+    ErnieTokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class ErnieTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieTokenizer
+    fast_tokenizer_class = ErnieFastTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ErnieTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                tokens_fast = tokenizer_fast.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_fast.convert_ids_to_tokens(tokens_fast["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+                self.assertEqual([e[0] for e in expected_results], tokens_fast["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                ids_without_spe_char_fast = tokenizer_fast.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                tokens_without_spe_char_fast = tokenizer.convert_ids_to_tokens(ids_without_spe_char_fast)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_fast, list_of_commun_chinese_char)
+
+    @slow
+    def test_with_emoji(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+        text = "好👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 10)
+        self.assertEqual(len(encoding["offset_mapping"]), 10)
+
+        text = "好👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 15)
+        self.assertEqual(len(encoding["offset_mapping"]), 15)
diff --git a/tests/transformers/ernie_code/__init__.py b/tests/transformers/ernie_code/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/ernie_code/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ernie_code/test_modeling.py b/tests/transformers/ernie_code/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..16b34344d80d4584e81970f92f6daaa14dfd007a
--- /dev/null
+++ b/tests/transformers/ernie_code/test_modeling.py
@@ -0,0 +1,704 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 Baidu ErnieCode Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    ERNIECODE_PRETRAINED_MODEL_ARCHIVE_LIST,
+    ErnieCodeConfig,
+    ErnieCodeEncoderModel,
+    ErnieCodeForConditionalGeneration,
+    ErnieCodeModel,
+)
+from tests.testing_utils import slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def make_model_instance(config: ErnieCodeConfig, model_class, base_model_class):
+    if model_class == base_model_class:
+        return model_class(config)
+    else:
+        return model_class(base_model_class(config))
+
+
+class ErnieCodeModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        decoder_seq_length=9,
+        # For common tests
+        is_training=True,
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+        decoder_layers=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        # For common tests
+        self.seq_length = self.decoder_seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = pad_token_id
+        self.scope = None
+        self.decoder_layers = decoder_layers
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+        decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        decoder_attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+            decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def get_pipeline_config(self) -> ErnieCodeConfig:
+        return ErnieCodeConfig(
+            vocab_size=66,  # ErnieCode forces 0 extra tokens
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def get_config(self) -> ErnieCodeConfig:
+        return ErnieCodeConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def check_prepare_lm_labels_via_shift_left(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        if not self.parent.use_labels:
+            return
+        model = ErnieCodeModel(config)
+        model.eval()
+
+        # make sure that lm_labels are correctly padded from the right
+        lm_labels = masked_fill(lm_labels, (lm_labels == self.decoder_start_token_id), self.eos_token_id)
+
+        # add casaul pad token mask
+        triangular_mask = paddle.tril(paddle.ones(lm_labels.shape)).logical_not()
+        lm_labels = masked_fill(lm_labels, triangular_mask, self.pad_token_id)
+        decoder_input_ids = model._shift_right(lm_labels)
+
+        for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
+            # first item
+            self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
+            if i < decoder_input_ids_slice.shape[-1]:
+                if i < decoder_input_ids.shape[-1] - 1:
+                    # items before diagonal
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
+                    )
+                # pad items after diagonal
+                if i < decoder_input_ids.shape[-1] - 2:
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
+                    )
+            else:
+                # all items after square
+                self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
+
+    def create_and_check_model(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = ErnieCodeModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, return_dict=self.parent.return_dict)
+        decoder_output = result[0]
+        decoder_past = result[1]
+        encoder_output = result[2]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+        self.parent.assertEqual(decoder_output.shape, [self.batch_size, self.decoder_seq_length, self.hidden_size])
+        # There should be `num_layers` key value embeddings stored in decoder_past
+        self.parent.assertEqual(len(decoder_past), config["num_layers"])
+        # There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
+        self.parent.assertEqual(len(decoder_past[0]), 4)
+
+    def create_and_check_with_lm_head(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = ErnieCodeForConditionalGeneration(config)
+        model.eval()
+        outputs = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            labels=lm_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(len(outputs), 4 if self.parent.use_labels else 3)
+        if self.parent.use_labels:
+            self.parent.assertEqual(outputs[1].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+            self.parent.assertIsInstance(outputs[0].item(), float)
+        else:
+            self.parent.assertEqual(outputs[0].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+
+    def create_and_check_decoder_model_past(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = ErnieCodeModel(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, return_dict=self.parent.return_dict)
+        outputs_no_past = model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf) + 1)
+        self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(next_tokens, cache=past_key_values, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_attention_mask_past(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = ErnieCodeModel(config).get_decoder()
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="int64")
+
+        half_seq_length = input_ids.shape[-1] // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past_key_values = model(
+            input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict
+        )[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = (
+            ids_tensor(
+                [
+                    1,
+                ],
+                half_seq_length,
+            ).item()
+            + 1
+        )
+        random_other_next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"]).squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="int64")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(
+            next_tokens,
+            cache=past_key_values,
+            attention_mask=paddle.ones((attn_mask.shape[0], 1), dtype="int64"),
+            return_dict=self.parent.return_dict,
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_past_large_inputs(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = ErnieCodeModel(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 3], config["vocab_size"])
+        next_mask = ids_tensor([self.batch_size, 3], vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([attention_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )[0]
+        output_from_past = model(
+            next_tokens, attention_mask=next_attention_mask, cache=past_key_values, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_generate_with_past_key_values(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        model = ErnieCodeForConditionalGeneration(config)
+        model.eval()
+
+        output_without_past_cache, _ = model.generate(
+            input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling", use_cache=False
+        )
+
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        output_with_past_cache, _ = model.generate(input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling")
+
+        self.parent.assertTrue(paddle.all(output_with_past_cache == output_without_past_cache))
+
+    def check_resize_embeddings_ErnieCode_v1_1(
+        self,
+        config: ErnieCodeConfig,
+    ):
+        prev_vocab_size = config["vocab_size"]
+
+        model = ErnieCodeForConditionalGeneration(config)
+        model.eval()
+        model.resize_token_embeddings(prev_vocab_size - 10)
+
+        self.parent.assertEqual(model.get_input_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.get_output_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.ErnieCode.config["vocab_size"], prev_vocab_size - 10)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_attention_mask,
+            "use_cache": False,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ErnieCodeModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = ErnieCodeModel
+    return_dict: bool = False
+    use_labels: bool = False
+
+    all_model_classes = (ErnieCodeModel, ErnieCodeForConditionalGeneration)
+    all_generative_model_classes = {ErnieCodeForConditionalGeneration: (ErnieCodeModel, "ErnieCode")}
+    all_parallelizable_model_classes = (ErnieCodeModel, ErnieCodeForConditionalGeneration)
+    fx_compatible = True
+    test_pruning = False
+    test_resize_embeddings = True
+    test_model_parallel = True
+    use_test_inputs_embeds = True
+    is_encoder_decoder = True
+    use_test_model_name_list = False
+    # The small ErnieCode model needs higher percentages for CPU/MP tests
+    model_split_percents = [0.8, 0.9]
+
+    def setUp(self):
+        self.model_tester = ErnieCodeModelTester(self)
+
+    def test_shift_right(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_v1_1(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        # check that gated gelu feed forward and different word embeddings work
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-gelu"
+        self.model_tester.create_and_check_model(config, *config_and_inputs[1:])
+
+    def test_config_and_model_silu_gated(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-silu"
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_with_lm_head(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_with_lm_head(*config_and_inputs)
+
+    def test_decoder_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_attn_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_3d_attn_mask(self):
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = self.model_tester.prepare_config_and_inputs()
+
+        attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.encoder_seq_length, self.model_tester.encoder_seq_length],
+            vocab_size=2,
+        )
+        decoder_attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.decoder_seq_length, self.model_tester.decoder_seq_length],
+            vocab_size=2,
+        )
+
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_generate_with_past_key_values(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs)
+
+    def test_v1_1_resize_embeddings(self):
+        config = self.model_tester.prepare_config_and_inputs()[0]
+        self.model_tester.check_resize_embeddings_ErnieCode_v1_1(config)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in ERNIECODE_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ErnieCodeModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieCodeEncoderOnlyModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        # For common tests
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        is_training=False,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        is_encoder_decoder=False,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        # For common tests
+        self.seq_length = self.encoder_seq_length
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.is_encoder_decoder = is_encoder_decoder
+        self.scope = None
+        self.is_training = is_training
+
+    def get_config(self):
+        config = ErnieCodeConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+            is_encoder_decoder=self.is_encoder_decoder,
+        )
+        return config
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            attention_mask,
+        )
+
+    def create_and_check_model(
+        self,
+        config: ErnieCodeConfig,
+        input_ids,
+        attention_mask,
+    ):
+        model = ErnieCodeEncoderModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+        )
+        result = model(input_ids=input_ids)
+        encoder_output = result[0]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+
+class ErnieCodeEncoderOnlyModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ErnieCodeEncoderModel,)
+    test_pruning = False
+    test_resize_embeddings = False
+    test_model_parallel = True
+    all_parallelizable_model_classes = (ErnieCodeEncoderModel,)
+
+    def _make_model_instance(self, config: ErnieCodeConfig, model_class):
+        return model_class(config)
+
+    def setUp(self):
+        self.model_tester = ErnieCodeEncoderOnlyModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_name_list(self):
+        pass
+
+    def test_save_load(self):
+        pass
diff --git a/tests/transformers/ernie_code/test_tokenizer.py b/tests/transformers/ernie_code/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fd8dd1c57faaefa1ad50cd1ab8242547511f3d6
--- /dev/null
+++ b/tests/transformers/ernie_code/test_tokenizer.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 Baidu ErnieCode Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE, AddedToken, ErnieCodeTokenizer
+from paddlenlp.transformers.tokenizer_utils_base import BatchEncoding
+from tests.testing_utils import get_tests_dir
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+FRAMEWORK = "pd"
+
+
+class ErnieCodeTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieCodeTokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = ErnieCodeTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_full_tokenizer(self):
+        tokenizer = ErnieCodeTokenizer(SAMPLE_VOCAB)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    def erniecode_base_tokenizer(self):
+        return ErnieCodeTokenizer.from_pretrained("ernie-code-base")
+
+    def get_tokenizer(self, **kwargs) -> ErnieCodeTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, pad_token=None, **kwargs)
+
+    def test_eos_treatment(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        batch_with_eos_added = tokenizer(["hi</s>", "I went to the gym</s>", "</s>"])
+        batch_without_eos_added = tokenizer(["hi", "I went to the gym", ""])
+        self.assertListEqual(batch_with_eos_added["input_ids"], batch_without_eos_added["input_ids"])
+
+    def test_prepare_batch(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        expected_src_tokens = [298, 2952, 259, 90234, 332, 196098, 14534, 260, tokenizer.eos_token_id]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        self.assertIsInstance(batch, BatchEncoding)
+
+        result = list(batch["input_ids"].tolist()[0])
+
+        self.assertListEqual(expected_src_tokens, result)
+
+        self.assertEqual([2, 9], batch["input_ids"].shape)
+        self.assertEqual([2, 9], batch.attention_mask.shape)
+
+    def test_empty_target_text(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        # check if input_ids are returned and no decoder_input_ids
+        self.assertIn("input_ids", batch)
+        self.assertIn("attention_mask", batch)
+        self.assertNotIn("decoder_input_ids", batch)
+        self.assertNotIn("decoder_attention_mask", batch)
+
+    def test_max_length(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        tgt_text = [
+            "Summary of the text.",
+            "Another summary.",
+        ]
+        targets = tokenizer(
+            text=tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertEqual(32, targets["input_ids"].shape[1])
+
+    def test_outputs_not_longer_than_maxlen(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        batch = tokenizer(
+            ["I am a small frog" * 1000, "I am a small frog"],
+            padding=True,
+            truncation=True,
+            max_length=512,
+            return_tensors=FRAMEWORK,
+        )
+        self.assertIsInstance(batch, BatchEncoding)
+        # Since ErnieCode does NOT have a max input length,
+        # this test should be changed to the following in Transformers v5:
+        # self.assertEqual(batch["input_ids"].shape, (2, 8001))
+        self.assertEqual(batch["input_ids"].shape, [2, 512])
+
+    def test_eos_in_input(self):
+        tokenizer = self.erniecode_base_tokenizer()
+        src_text = ["A long paragraph for summarization. </s>"]
+        tgt_text = ["Summary of the text. </s>"]
+        expected_src_tokens = [298, 2952, 259, 90234, 332, 196098, 14534, 260, 1]
+
+        batch = tokenizer(src_text, text_target=tgt_text)
+
+        self.assertEqual(expected_src_tokens, batch["input_ids"][0])
+        # self.assertEqual(expected_tgt_tokens, batch["labels"][0])
+
+    def test_token_type_ids(self):
+        src_text_1 = ["A first paragraph for summarization."]
+        src_text_2 = ["A second paragraph for summarization."]
+
+        tokenizer = self.erniecode_base_tokenizer()
+
+        slow_token_type_ids = tokenizer(src_text_1, src_text_2, add_special_tokens=True, return_token_type_ids=True)[
+            "token_type_ids"
+        ]
+
+        self.assertEqual(len(slow_token_type_ids[0]), 18)
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        tokenizer_list = []
+        tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))
+
+        for tokenizer_class, tokenizer_utils in tokenizer_list:
+
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                tokenizer_utils.save_pretrained(tmp_dir)
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
+                    special_tokens_map = json.load(json_file)
+
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
+                    tokenizer_config = json.load(json_file)
+
+                added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(100)]
+
+                special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+                tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(special_tokens_map, outfile)
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(tokenizer_config, outfile)
+
+                # the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
+                # into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
+                # "special_tokens_map.json" files
+                tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                )
+                self.assertIn(
+                    "an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
+                )
+                # self.assertIn("an_additional_special_token",tokenizer_without_change_in_init.get_vocab()) # ByErnieCodeTokenization no vocab
+                self.assertEqual(
+                    ["an_additional_special_token"],
+                    tokenizer_without_change_in_init.convert_ids_to_tokens(
+                        tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
+                    ),
+                )
+
+                # Now we test that we can change the value of additional_special_tokens in the from_pretrained
+                new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
+                tokenizer = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                    additional_special_tokens=new_added_tokens,
+                )
+
+                self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
+                self.assertEqual(
+                    ["a_new_additional_special_token"],
+                    tokenizer.convert_ids_to_tokens(
+                        tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
+                    ),
+                )
+
+    # overwritten from `test_tokenization_common` since ErnieCode has no max length
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 3)
+            self.assertEqual(len(encoding["offset_mapping"]), 3)
diff --git a/tests/transformers/ernie_ctm/__init__.py b/tests/transformers/ernie_ctm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/ernie_ctm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ernie_ctm/test_modeling.py b/tests/transformers/ernie_ctm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb0e57f2295d4efcff0a987488c88ea20d7950f9
--- /dev/null
+++ b/tests/transformers/ernie_ctm/test_modeling.py
@@ -0,0 +1,381 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    ErnieCtmConfig,
+    ErnieCtmForTokenClassification,
+    ErnieCtmModel,
+    ErnieCtmNptagModel,
+    ErnieCtmWordtagModel,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ErnieCtmModelTester:
+    def __init__(
+        self,
+        parent: ErnieCtmModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size: int = 100,
+        embedding_size: int = 16,
+        hidden_size: int = 16,
+        num_hidden_layers: int = 2,
+        num_attention_heads: int = 2,
+        intermediate_size: int = 16,
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 512,
+        layer_norm_eps: float = 1e-12,
+        type_vocab_size: int = 2,
+        initializer_range: float = 0.02,
+        use_content_summary: bool = True,
+        content_summary_index: int = 1,
+        cls_num: int = 2,
+        pad_token_id: int = 0,
+        num_prompt_placeholders: int = 5,
+        prompt_vocab_ids: set = None,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        return_dict=False,
+    ):
+        self.parent: ErnieCtmModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.embedding_size = embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.layer_norm_eps = layer_norm_eps
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.use_content_summary = use_content_summary
+        self.content_summary_index = content_summary_index
+        self.cls_num = cls_num
+        self.pad_token_id = pad_token_id
+        self.num_prompt_placeholders = num_prompt_placeholders
+        self.prompt_vocab_ids = prompt_vocab_ids
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> ErnieCtmConfig:
+        return ErnieCtmConfig(
+            vocab_size=self.vocab_size,
+            embedding_size=self.embedding_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            layer_norm_eps=self.layer_norm_eps,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            use_content_summary=self.use_content_summary,
+            content_summary_index=self.content_summary_index,
+            cls_num=self.cls_num,
+            pad_token_id=self.pad_token_id,
+            num_prompt_placeholders=self.num_prompt_placeholders,
+            prompt_vocab_ids=self.prompt_vocab_ids,
+            type_sequence_label_size=self.type_sequence_label_size,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+            scope=self.scope,
+            dropout=self.dropout,
+            return_dict=self.return_dict,
+        )
+
+    def create_and_check_model(
+        self,
+        config: ErnieCtmConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieCtmModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_model_past_large_inputs(
+        self,
+        config: ErnieCtmConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieCtmModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.return_dict)
+        past_key_values = outputs.past_key_values if self.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids, attention_mask=next_attention_mask, output_hidden_states=True, return_dict=self.return_dict
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieCtmForTokenClassification(config)
+
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_wordtag(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieCtmWordtagModel(config)
+
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_nptag(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = ErnieCtmNptagModel(config)
+
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ErnieCtmModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieCtmModel
+    return_dict = False
+    use_labels = False
+    is_encoder_decoder = False
+
+    all_model_classes = (
+        ErnieCtmModel,
+        ErnieCtmWordtagModel,
+        ErnieCtmNptagModel,
+        ErnieCtmForTokenClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = ErnieCtmModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ErnieCtmConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        self.config_tester.create_and_test_config_from_and_save_pretrained()
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_wordtag(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_wordtag(*config_and_inputs)
+
+    def test_for_nptag(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_nptag(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+
+class ErnieCtmModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = ErnieCtmModel
+    paddlehub_remote_test_model_path = "__internal_testing__/tiny-random-ernie_ctm"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ErnieCtmModel.from_pretrained(self.paddlehub_remote_test_model_path)
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[0.4249, 0.1008, 0.7531], [0.3771, 0.1188, 0.7467], [0.4152, 0.1098, 0.7108]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ErnieCtmModel.from_pretrained(self.paddlehub_remote_test_model_path)
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [[[0.4249, 0.1008, 0.7531], [0.3771, 0.1188, 0.7467], [0.4152, 0.1098, 0.7108]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ernie_ctm/test_tokenizer.py b/tests/transformers/ernie_ctm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c15288d74d2b8dadcc5f763bc8d986e26523270
--- /dev/null
+++ b/tests/transformers/ernie_ctm/test_tokenizer.py
@@ -0,0 +1,294 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.ernie.tokenizer import (
+    BasicTokenizer,
+    ErnieTokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class ErnieCTMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ErnieTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    @slow
+    def test_with_emoji(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+        text = "好👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 10)
+        self.assertEqual(len(encoding["offset_mapping"]), 10)
+
+        text = "好👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 15)
+        self.assertEqual(len(encoding["offset_mapping"]), 15)
diff --git a/tests/transformers/ernie_doc/__init__.py b/tests/transformers/ernie_doc/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/ernie_doc/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ernie_doc/test_modeling.py b/tests/transformers/ernie_doc/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..197dedbb2d0cb9b2e38e5e2e4a329088839a45b7
--- /dev/null
+++ b/tests/transformers/ernie_doc/test_modeling.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    ErnieDocConfig,
+    ErnieDocForQuestionAnswering,
+    ErnieDocForSequenceClassification,
+    ErnieDocForTokenClassification,
+    ErnieDocModel,
+    ErnieDocPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class ErnieDocModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        hidden_size=32,
+        hidden_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        relu_dropout=0.0,
+        hidden_act="gelu",
+        memory_len=7,
+        vocab_size=99,
+        type_vocab_size=2,
+        max_position_embeddings=256,
+        task_type_vocab_size=3,
+        normalize_before=False,
+        epsilon=1e-5,
+        rel_pos_params_sharing=False,
+        initializer_range=0.02,
+        pad_token_id=0,
+        cls_token_idx=-1,
+        type_sequence_label_size=2,
+        num_classes=2,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_size = hidden_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_dropout_prob = attention_dropout_prob
+        self.relu_dropout = relu_dropout
+        self.hidden_act = hidden_act
+        self.memory_len = memory_len
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.task_type_vocab_size = task_type_vocab_size
+        self.type_vocab_size = type_vocab_size
+        self.normalize_before = normalize_before
+        self.epsilon = epsilon
+        self.rel_pos_params_sharing = rel_pos_params_sharing
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.cls_token_idx = cls_token_idx
+        self.num_classes = num_classes
+        self.type_sequence_label_size = type_sequence_label_size
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length, 1], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length, 1])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length, 1], self.type_vocab_size, dtype="int64")
+
+        position_ids = None
+        token_labels = None
+
+        def get_related_pos(insts, seq_len, memory_len=128):
+            beg = seq_len + seq_len + memory_len
+            r_position = [list(range(beg - 1, seq_len - 1, -1)) + list(range(0, seq_len)) for i in range(len(insts))]
+            return np.array(r_position).astype("int64").reshape([len(insts), beg, 1])
+
+        position_ids = paddle.to_tensor(get_related_pos(input_ids, self.seq_length, self.memory_len))
+
+        tensor = paddle.zeros([self.batch_size, self.seq_length, self.hidden_size], dtype="float32")
+        memories = [tensor for i in range(self.num_hidden_layers)]
+
+        if self.parent.use_labels:
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+
+        config = self.get_config()
+        return config, input_ids, memories, token_type_ids, input_mask, position_ids, token_labels
+
+    def get_config(self):
+        return ErnieDocConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            attention_dropout_prob=self.attention_dropout_prob,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            relu_dropout=self.relu_dropout,
+            memory_len=self.memory_len,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            task_type_vocab_size=self.task_type_vocab_size,
+            normalize_before=self.normalize_before,
+            epsilon=self.epsilon,
+            rel_pos_params_sharing=self.rel_pos_params_sharing,
+            cls_token_idx=self.cls_token_idx,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, memories, token_type_ids, input_mask, position_ids, token_labels) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attn_mask": input_mask,
+            "memories": memories,
+            "position_ids": position_ids,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        memories,
+        token_type_ids,
+        input_mask,
+        position_ids,
+        token_labels,
+    ):
+        model = ErnieDocModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            memories=memories,
+            attn_mask=input_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        memories,
+        token_type_ids,
+        input_mask,
+        position_ids,
+        token_labels,
+    ):
+        model = ErnieDocForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            memories=memories,
+            attn_mask=input_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+
+        start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        memories,
+        token_type_ids,
+        input_mask,
+        position_ids,
+        token_labels,
+    ):
+        model = ErnieDocForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            memories=memories,
+            attn_mask=input_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+        if position_ids is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0][1].shape, [self.batch_size, self.memory_len, self.hidden_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        memories,
+        token_type_ids,
+        input_mask,
+        position_ids,
+        token_labels,
+    ):
+        model = ErnieDocForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            memories=memories,
+            attn_mask=input_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+
+class ErnieDocModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieDocModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    all_model_classes = (
+        ErnieDocModel,
+        ErnieDocForSequenceClassification,
+        ErnieDocForTokenClassification,
+        ErnieDocForQuestionAnswering,
+    )
+
+    def setUp(self):
+        self.model_tester = ErnieDocModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        # Direct input embedding tokens is currently not supported
+        self.skipTest("Direct input embedding tokens is currently not supported")
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ErnieDocPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ErnieDocModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/ernie_doc/test_tokenizer.py b/tests/transformers/ernie_doc/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a50e29f4adcaa70b9d4a21b2d3414b5a736f5999
--- /dev/null
+++ b/tests/transformers/ernie_doc/test_tokenizer.py
@@ -0,0 +1,188 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.ernie_doc.tokenizer import ErnieDocTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class ErnieTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieDocTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ErnieDocTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+
+    @slow
+    def test_with_emoji(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+        text = "好👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 10)
+        self.assertEqual(len(encoding["offset_mapping"]), 10)
+
+        text = "好👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 15)
+        self.assertEqual(len(encoding["offset_mapping"]), 15)
diff --git a/tests/transformers/ernie_gram/__init__.py b/tests/transformers/ernie_gram/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/ernie_gram/test_modeling.py b/tests/transformers/ernie_gram/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..44ab79ec412112ed340b28bf66bdd192b5cd9813
--- /dev/null
+++ b/tests/transformers/ernie_gram/test_modeling.py
@@ -0,0 +1,394 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from typing import Any, Dict, Tuple
+
+import paddle
+from paddle import Tensor
+
+from paddlenlp.transformers import (
+    ErnieGramConfig,
+    ErnieGramForQuestionAnswering,
+    ErnieGramForSequenceClassification,
+    ErnieGramForTokenClassification,
+    ErnieGramModel,
+    ErnieGramPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class ErnieGramModelTester:
+    """Base ErnieGram Model tester which can test:"""
+
+    def __init__(self, parent):
+        self.parent = parent
+        self.batch_size = 2
+        self.seq_length = 7
+        self.is_training = False
+        self.use_token_type_ids = True
+        self.use_attention_mask = True
+        self.test_resize_embeddings = False
+
+        self.num_labels = 3
+        self.attention_probs_dropout_prob = 0.1
+        self.embedding_size = 8
+        self.hidden_act = "gelu"
+        self.hidden_dropout_prob = 0.1
+        self.hidden_size = 8
+        self.initializer_range = 0.02
+        self.max_position_embeddings = 512
+        self.num_attention_heads = 2
+        self.num_hidden_layers = 2
+        self.type_vocab_size = 2
+        self.vocab_size = 1801
+        self.config = ErnieGramConfig(
+            num_labels=self.num_labels,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            embedding_size=self.embedding_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            hidden_size=self.hidden_size,
+            initializer_range=self.initializer_range,
+            max_position_embeddings=self.max_position_embeddings,
+            num_attention_heads=self.num_attention_heads,
+            num_hidden_layers=self.num_hidden_layers,
+            type_vocab_size=self.type_vocab_size,
+            vocab_size=self.vocab_size,
+        )
+
+    def prepare_config_and_inputs(self) -> Tuple[Dict[str, Any], Tensor, Tensor, Tensor]:
+        config = self.config
+        input_ids = ids_tensor([self.batch_size, self.seq_length], config.vocab_size)
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = paddle.zeros_like(input_ids)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+        return config, input_ids, token_type_ids, attention_mask, sequence_labels, token_labels, choice_labels
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, token_type_ids, attention_mask, _, _, _ = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieGramModel(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            return_dict=self.parent.return_dict,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.config.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.config.hidden_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieGramForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.config.num_labels])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieGramForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            start_position=sequence_labels,
+            end_position=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieGramForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+            attention_mask=attention_mask,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.config.num_labels])
+
+    def create_and_check_model_cache(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = ErnieGramModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-4))
+
+    def get_config(self) -> dict:
+        return self.config
+
+
+class ErnieGramModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieGramModel
+    return_dict = False
+    use_labels = False
+
+    all_model_classes = (
+        ErnieGramModel,
+        ErnieGramForSequenceClassification,
+        ErnieGramForTokenClassification,
+        ErnieGramForQuestionAnswering,
+    )
+
+    def setUp(self):
+        self.model_tester = ErnieGramModelTester(self)
+        self.test_resize_embeddings = self.model_tester.test_resize_embeddings
+
+    def get_config():
+        pass
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ErnieGramPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ErnieGramModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieGramModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = ErnieGramModel.from_pretrained("ernie-gram-zh")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.43569842, -1.50805628, -2.24448967],
+                    [-0.12123521, -1.35024536, -1.76512492],
+                    [-0.14853711, -1.13618660, -2.87098265],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-5))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ErnieGramModel.from_pretrained("ernie-gram-zh-finetuned-dureader-robust")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.37543082, -2.94639230, -2.04799986],
+                    [0.14168003, -2.02873731, -2.34919119],
+                    [0.70280838, -2.40280604, -1.93488157],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = ErnieGramModel.from_pretrained("ernie-gram-zh")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output[0].shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.43569842, -1.50805628, -2.24448967],
+                    [-0.12123521, -1.35024536, -1.76512492],
+                    [-0.14853711, -1.13618660, -2.87098265],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.59400421, -1.32317221, -2.88611341],
+                    [-0.79759967, -0.97396499, -1.89245439],
+                    [-0.47301087, -1.50476563, -2.37942648],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ernie_gram/test_tokenizer.py b/tests/transformers/ernie_gram/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..527524560324d7109a786a1c8eaeb951c4d6a381
--- /dev/null
+++ b/tests/transformers/ernie_gram/test_tokenizer.py
@@ -0,0 +1,190 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.ernie_gram.tokenizer import ErnieGramTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+class ErnieGramTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieGramTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, ErnieGramTokenizer.resource_files_names["vocab_file"])
+
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_offsets_mapping(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "这世界很美"
+                pair = "我们需要共同守护"
+
+                # No pair
+                tokens_with_offsets = tokenizer.encode(
+                    text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(False)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+                # Pairs
+                tokens_with_offsets = tokenizer.encode(
+                    text, pair, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(True)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-gram-zh")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"中国的首都 {tokenizer.mask_token} 是北京"
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                expected_results = [
+                    ((0, 0), tokenizer.cls_token),
+                    ((0, 1), "中"),
+                    ((1, 2), "国"),
+                    ((2, 3), "的"),
+                    ((3, 4), "首"),
+                    ((4, 5), "都"),
+                    ((6, 12), "[MASK]"),
+                    ((13, 14), "是"),
+                    ((14, 15), "北"),
+                    ((15, 16), "京"),
+                    ((0, 0), tokenizer.sep_token),
+                ]
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+
+                # not yet supported in bert tokenizer
+                """
+                kwargs["tokenize_chinese_chars"] = False
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(text_with_chinese_char, return_token_type_ids=None,add_special_tokens=False)["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that only the first Chinese character is not preceded by "##".
+                expected_tokens = [
+                    f"##{token}" if idx != 0 else token for idx, token in enumerate(list_of_commun_chinese_char)
+                ]
+                self.assertListEqual(tokens_without_spe_char_p, expected_tokens)
+                """
diff --git a/tests/transformers/ernie_layout/__init__.py b/tests/transformers/ernie_layout/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/ernie_layout/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ernie_layout/test_modeling.py b/tests/transformers/ernie_layout/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ece46cfb354b5211956ed19d9473ffad650a9cc
--- /dev/null
+++ b/tests/transformers/ernie_layout/test_modeling.py
@@ -0,0 +1,419 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    UIEX,
+    ErnieLayoutConfig,
+    ErnieLayoutForQuestionAnswering,
+    ErnieLayoutForSequenceClassification,
+    ErnieLayoutForTokenClassification,
+    ErnieLayoutModel,
+    ErnieLayoutPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class ErnieLayoutModelTester:
+    """Base ErnieLayout Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_position_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_position_ids = use_position_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.num_classes = num_classes
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_input_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        position_ids = None
+        if self.use_position_ids:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+        bbox = paddle.expand(paddle.to_tensor([0, 0, 0, 0]), [self.batch_size, self.seq_length, 4])
+        image = paddle.zeros([self.batch_size, 3, 224, 224])
+
+        sequence_labels = None
+        token_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+
+        config = self.get_config()
+        return config, input_ids, position_ids, attention_mask, bbox, image, sequence_labels, token_labels
+
+    def get_config(self):
+        return ErnieLayoutConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, position_ids, attention_mask, bbox, image, _, _ = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "bbox": bbox,
+            "image": image,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config: ErnieLayoutConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = ErnieLayoutModel(config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids, bbox=bbox, image=image)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length + 49, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: ErnieLayoutConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = ErnieLayoutForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            bbox=bbox,
+            image=image,
+            labels=sequence_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_question_answering(
+        self,
+        config: ErnieLayoutConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = ErnieLayoutForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            bbox=bbox,
+            image=image,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+        )
+
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_uie(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels,
+        token_labels,
+    ):
+        model = UIEX(config)
+        model.eval()
+        start_prob, end_prob = model(
+            input_ids,
+            attention_mask=input_mask,
+            bbox=bbox,
+            image=image,
+        )
+
+        self.parent.assertEqual(start_prob.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_prob.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_token_classification(
+        self,
+        config: ErnieLayoutConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = ErnieLayoutForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+            image=image,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ErnieLayoutModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieLayoutModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (
+        ErnieLayoutModel,
+        ErnieLayoutForSequenceClassification,
+        ErnieLayoutForTokenClassification,
+        ErnieLayoutForQuestionAnswering,
+        UIEX,
+    )
+
+    def setUp(self):
+        self.model_tester = ErnieLayoutModelTester(self)
+
+        # set attribute in setUp to overwrite the static attribute
+        self.test_resize_embeddings = False
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_uie(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_uie(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ErnieLayoutPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ErnieLayoutModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieLayoutModelIntegrationTest(unittest.TestCase):
+    base_model_class = ErnieLayoutPretrainedModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-ernie-layout"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ErnieLayoutModel.from_pretrained("ernie-layoutx-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ErnieLayoutModel.from_pretrained("ernie-layoutx-base-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = ErnieLayoutModel.from_pretrained("ernie-m-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True)
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output[0].shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.05163988, -0.07475190, 0.06332156],
+                    [0.03051429, -0.01377687, -0.12024689],
+                    [0.03379946, 0.00674286, 0.08079184],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ernie_layout/test_tokenizer.py b/tests/transformers/ernie_layout/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..91b4d246a578e87c1ff93fb39e424f114e2a4cfe
--- /dev/null
+++ b/tests/transformers/ernie_layout/test_tokenizer.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers.ernie_layout.tokenizer import ErnieLayoutTokenizer
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class ErnieLayoutEnglishTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieLayoutTokenizer
+    space_between_special_tokens = True
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        return ErnieLayoutTokenizer.from_pretrained("ernie-layoutx-base-uncased", **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "This is a test"
+        output_text = "This is a test"
+        return input_text, output_text
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "[CLS]"
+        token_id = 0
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁test"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [3293, 83, 10, 3034])
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens, ["▁I", "▁was", "▁born", "▁in", "▁9", "2000", ",", "▁and", "▁this", "▁is", "▁fals", "é", "."]
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, [87, 509, 103122, 23, 483, 13821, 4, 136, 903, 83, 84047, 446, 5])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(back_tokens, tokens)
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual(
+            [tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["▁Test"], ["▁", "\xad"], ["▁test"]]
+        )
+
+    def test_sequence_builders(self):
+        tokenizer = self.get_tokenizer()
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [
+            tokenizer.sep_token_id,
+            tokenizer.sep_token_id,
+        ] + text_2 + [tokenizer.sep_token_id]
+
+    def test_add_tokens(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                vocab_size = len(tokenizer)
+                self.assertEqual(tokenizer.add_tokens(""), 0)
+                self.assertEqual(tokenizer.add_tokens("testoken"), 1)
+                self.assertEqual(tokenizer.add_tokens(["testoken1", "testtoken2"]), 2)
+                self.assertEqual(len(tokenizer), vocab_size + 3)
+
+                self.assertEqual(tokenizer.add_special_tokens({}), 0)
+                self.assertRaises(
+                    AssertionError, tokenizer.add_special_tokens, {"additional_special_tokens": "<testtoken1>"}
+                )
+                self.assertEqual(tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken2>"]}), 1)
+                self.assertEqual(
+                    tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken3>", "<testtoken4>"]}), 2
+                )
+                self.assertIn("<testtoken3>", tokenizer.special_tokens_map["additional_special_tokens"])
+                self.assertIsInstance(tokenizer.special_tokens_map["additional_special_tokens"], list)
+                self.assertGreaterEqual(len(tokenizer.special_tokens_map["additional_special_tokens"]), 2)
+
+                self.assertEqual(len(tokenizer), vocab_size + 6)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers(model_max_length=64)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    normal_tokens = tokenizer("This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    def test_token_type_ids(self):
+        self.skipTest("Ernie-Layout model doesn't have token_type embedding. so skip this test")
diff --git a/tests/transformers/ernie_m/__init__.py b/tests/transformers/ernie_m/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/ernie_m/test_modeling.py b/tests/transformers/ernie_m/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f8a59fff8f9e5c0d8a850044b830875473498e5
--- /dev/null
+++ b/tests/transformers/ernie_m/test_modeling.py
@@ -0,0 +1,511 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    UIEM,
+    ErnieMConfig,
+    ErnieMForMultipleChoice,
+    ErnieMForQuestionAnswering,
+    ErnieMForSequenceClassification,
+    ErnieMForTokenClassification,
+    ErnieMModel,
+    ErnieMPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ErnieMModelTester:
+    """Base ErnieM Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_position_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_position_ids = use_position_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_input_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        position_ids = None
+        if self.use_position_ids:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, position_ids, attention_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return ErnieMConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, position_ids, attention_mask, _, _, _ = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config: ErnieMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieMModel(config)
+        model.eval()
+
+        result = model(
+            input_ids, attention_mask=attention_mask, position_ids=position_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, position_ids=position_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, attention_mask=attention_mask, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: ErnieMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieMForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_question_answering(
+        self,
+        config: ErnieMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieMForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_uie(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = UIEM(config)
+        model.eval()
+        start_prob, end_prob = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+
+        self.parent.assertEqual(start_prob.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_prob.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_token_classification(
+        self,
+        config: ErnieMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieMForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config: ErnieMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = ErnieMForMultipleChoice(config)
+        model.eval()
+
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_position_ids = position_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_attention_mask = attention_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+
+        result = model(
+            multiple_choice_inputs_ids,
+            position_ids=multiple_choice_position_ids,
+            attention_mask=multiple_choice_attention_mask,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_model_cache(
+        self, config: ErnieMConfig, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = ErnieMModel(config)
+        model.eval()
+
+        input_ids = ids_tensor((self.batch_size, self.seq_length), self.vocab_size)
+
+        # create tensors for past_key_values of shape [batch_size, num_heads, seq_length, head_size]
+        embed_size_per_head = self.hidden_size // self.num_attention_heads
+        key_tensor = floats_tensor((self.batch_size, self.num_attention_heads, self.seq_length, embed_size_per_head))
+        values_tensor = floats_tensor(
+            (self.batch_size, self.num_attention_heads, self.seq_length, embed_size_per_head)
+        )
+        past_key_values = (
+            (
+                key_tensor,
+                values_tensor,
+            ),
+        ) * self.num_hidden_layers
+
+        # create fully-visible attention mask for input_ids only and input_ids + past
+        attention_mask = paddle.ones([self.batch_size, self.seq_length])
+        attention_mask_with_past = paddle.ones([self.batch_size, self.seq_length * 2])
+
+        outputs_with_cache = model(
+            input_ids,
+            attention_mask=attention_mask_with_past,
+            past_key_values=past_key_values,
+            return_dict=self.parent.return_dict,
+        )
+        outputs_without_cache = model(input_ids, attention_mask=attention_mask, return_dict=self.parent.return_dict)
+
+        # last_hidden_state should have the same shape but different values when given past_key_values
+        if self.parent.return_dict:
+            self.parent.assertEqual(
+                outputs_with_cache.last_hidden_state.shape, outputs_without_cache.last_hidden_state.shape
+            )
+            self.parent.assertFalse(
+                paddle.allclose(outputs_with_cache.last_hidden_state, outputs_without_cache.last_hidden_state)
+            )
+        else:
+            outputs_with_cache, _ = outputs_with_cache
+            outputs_without_cache, _ = outputs_without_cache
+            self.parent.assertEqual(outputs_with_cache.shape, outputs_without_cache.shape)
+            self.parent.assertFalse(paddle.allclose(outputs_with_cache, outputs_without_cache))
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ErnieMModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ErnieMModel
+    use_labels = False
+    return_dict = False
+    use_inputs_embeds = True
+
+    all_model_classes = (
+        ErnieMModel,
+        ErnieMForSequenceClassification,
+        ErnieMForTokenClassification,
+        ErnieMForQuestionAnswering,
+        ErnieMForMultipleChoice,
+        UIEM,
+    )
+
+    def setUp(self):
+        self.model_tester = ErnieMModelTester(self)
+
+        # set attribute in setUp to overwrite the static attribute
+        self.test_resize_embeddings = False
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_uie(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_uie(*config_and_inputs)
+
+    def test_for_multi_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ErnieMPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ErnieMModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieMModelIntegrationTest(unittest.TestCase):
+    base_model_class = ErnieMPretrainedModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-ernie-m"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ErnieMModel.from_pretrained("ernie-m-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ErnieMModel.from_pretrained("ernie-m-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = ErnieMModel.from_pretrained("ernie-m-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output[0].shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.02920425, -0.00768885, -0.10219190],
+                    [-0.10798159, 0.02311476, -0.17285497],
+                    [0.05675533, 0.01330730, -0.06826267],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.05163988, -0.07475190, 0.06332156],
+                    [0.03051429, -0.01377687, -0.12024689],
+                    [0.03379946, 0.00674286, 0.08079184],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ernie_m/test_tokenizer.py b/tests/transformers/ernie_m/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d3bec508aa5cdda32a2c84b1d20d8bfe66309a5
--- /dev/null
+++ b/tests/transformers/ernie_m/test_tokenizer.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers.ernie_m.fast_tokenizer import ErnieMFastTokenizer
+from paddlenlp.transformers.ernie_m.tokenizer import ErnieMTokenizer
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+from paddlenlp.transformers.tokenizer_utils_fast import PretrainedFastTokenizer
+
+from ...testing_utils import get_tests_dir
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+EN_SENTENCEPIECE = get_tests_dir("fixtures/test_sentencepiece_bpe.model")
+EN_VOCAB = get_tests_dir("fixtures/test_sentencepiece_bpe.vocab.txt")
+
+
+class ErnieMEnglishTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ErnieMTokenizer
+    fast_tokenizer_class = ErnieMFastTokenizer
+    space_between_special_tokens = True
+
+    def setUp(self):
+        super().setUp()
+
+        tokenizer = self.tokenizer_class(
+            vocab_file=EN_VOCAB, sentencepiece_model_file=EN_SENTENCEPIECE, unk_token="<unk>"
+        )
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_fast_tokenizer(self, **kwargs) -> PretrainedFastTokenizer:
+        return self.fast_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "This is a test"
+        output_text = "This is a test"
+        return input_text, output_text
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<unk>"
+        token_id = 0
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+        self.assertEqual(self.get_fast_tokenizer()._convert_token_to_id_with_added_voc(token), token_id)
+        self.assertEqual(self.get_fast_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        tokens = tokenizer.tokenize("This is a test")
+        tokens_fast = tokenizer_fast.tokenize("This is a test")
+
+        expected_tokens = ["▁This", "▁is", "▁a", "▁t", "est"]
+        self.assertListEqual(tokens, expected_tokens)
+        self.assertListEqual(tokens_fast, expected_tokens)
+
+        expected_ids = [474, 97, 5, 3, 263]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), expected_ids)
+        self.assertListEqual(tokenizer_fast.convert_tokens_to_ids(tokens_fast), expected_ids)
+
+        # The tokenize api has difference between ErnieMTokenizer and ErnieMFastTokenizer,
+        # skip to check tokenizer_fast
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        expected_tokens = [
+            "▁I",
+            "▁was",
+            "▁b",
+            "or",
+            "n",
+            "▁in",
+            "9",
+            "2",
+            "0",
+            "0",
+            "0",
+            ",",
+            "▁and",
+            "▁this",
+            "▁is",
+            "▁f",
+            "al",
+            "s",
+            "é",
+            ".",
+        ]
+        self.assertListEqual(
+            tokens,
+            expected_tokens,
+        )
+
+        expected_ids = [16, 52, 12, 27, 936, 39, 0, 998, 992, 992, 992, 953, 32, 119, 97, 20, 81, 939, 0, 951]
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        ids_fast = tokenizer_fast.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, expected_ids)
+        self.assertListEqual(ids_fast, expected_ids)
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        back_tokens_fast = tokenizer_fast.convert_ids_to_tokens(ids)
+        expected_back_tokens = [
+            "▁I",
+            "▁was",
+            "▁b",
+            "or",
+            "n",
+            "▁in",
+            "<unk>",
+            "2",
+            "0",
+            "0",
+            "0",
+            ",",
+            "▁and",
+            "▁this",
+            "▁is",
+            "▁f",
+            "al",
+            "s",
+            "<unk>",
+            ".",
+        ]
+        self.assertListEqual(back_tokens, expected_back_tokens)
+        self.assertListEqual(back_tokens_fast, expected_back_tokens)
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual(
+            [tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["▁T", "est"], ["\xad"], ["▁t", "est"]]
+        )
+
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-m-base")
+        tokenizer_fast = self.fast_tokenizer_class.from_pretrained("ernie-m-base")
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        expected_ids = [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        expected_pair_ids = (
+            [tokenizer.cls_token_id]
+            + text
+            + [
+                tokenizer.sep_token_id,
+                tokenizer.sep_token_id,
+            ]
+            + text_2
+            + [tokenizer.sep_token_id]
+        )
+
+        self.assertListEqual(encoded_sentence, expected_ids)
+        self.assertListEqual(encoded_pair, expected_pair_ids)
+        self.assertListEqual(
+            tokenizer_fast("sequence builders", return_token_type_ids=None)["input_ids"], expected_ids
+        )
+        self.assertListEqual(
+            tokenizer_fast("sequence builders", "multi-sequence build", return_token_type_ids=None)["input_ids"],
+            expected_pair_ids,
+        )
+
+    def test_token_type_ids(self):
+        self.skipTest("Ernie-M model doesn't have token_type embedding. so skip this test")
diff --git a/tests/transformers/ernie_vil/__init__.py b/tests/transformers/ernie_vil/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/ernie_vil/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ernie_vil/test_image_processing.py b/tests/transformers/ernie_vil/test_image_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f224ec951b768f9c031b576f8c08d04e75da1dd
--- /dev/null
+++ b/tests/transformers/ernie_vil/test_image_processing.py
@@ -0,0 +1,290 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from paddlenlp.transformers import ErnieViLImageProcessor
+
+from ..test_image_processing_common import ImageProcessingSavingTestMixin
+
+
+class ErnieViLImageProcessingTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        num_channels=3,
+        image_size=18,
+        min_resolution=30,
+        max_resolution=400,
+        do_resize=True,
+        size=None,
+        do_center_crop=True,
+        crop_size=None,
+        do_normalize=True,
+        image_mean=[0.485, 0.456, 0.406],
+        image_std=[0.229, 0.224, 0.225],
+        do_convert_rgb=True,
+    ):
+        size = size if size is not None else {"shortest_edge": 224}
+        crop_size = crop_size if crop_size is not None else {"height": 18, "width": 18}
+        self.parent = parent
+        self.batch_size = batch_size
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.min_resolution = min_resolution
+        self.max_resolution = max_resolution
+        self.do_resize = do_resize
+        self.size = size
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean
+        self.image_std = image_std
+        self.do_convert_rgb = do_convert_rgb
+
+    def prepare_image_processor_dict(self):
+        return {
+            "do_resize": self.do_resize,
+            "size": self.size,
+            "do_center_crop": self.do_center_crop,
+            "crop_size": self.crop_size,
+            "do_normalize": self.do_normalize,
+            "image_mean": self.image_mean,
+            "image_std": self.image_std,
+            "do_convert_rgb": self.do_convert_rgb,
+        }
+
+    def prepare_inputs(self, equal_resolution=False, numpify=False, paddleify=False):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PaddlePaddle tensors if one specifies paddleify=True.
+        """
+
+        assert not (numpify and paddleify), "You cannot specify both numpy and PaddlePaddle tensors at the same time"
+
+        if equal_resolution:
+            image_inputs = []
+            for i in range(self.batch_size):
+                image_inputs.append(
+                    np.random.randint(
+                        255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8
+                    )
+                )
+        else:
+            image_inputs = []
+            for i in range(self.batch_size):
+                width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2)
+                image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8))
+
+        if not numpify and not paddleify:
+            # PIL expects the channel dimension as last dimension
+            image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        if paddleify:
+            image_inputs = [paddle.to_tensor(x) for x in image_inputs]
+
+        return image_inputs
+
+
+class ErnieViLImageProcessingTest(ImageProcessingSavingTestMixin, unittest.TestCase):
+    image_processing_class = ErnieViLImageProcessor
+
+    def setUp(self):
+        self.image_processor_tester = ErnieViLImageProcessingTester(self, do_center_crop=True)
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+        self.assertTrue(hasattr(image_processing, "do_center_crop"))
+        self.assertTrue(hasattr(image_processing, "center_crop"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+        self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
+
+    def test_image_processor_from_dict_with_kwargs(self):
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
+        self.assertEqual(image_processor.size, {"shortest_edge": 224})
+        self.assertEqual(image_processor.crop_size, {"height": 18, "width": 18})
+
+        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42, crop_size=84)
+        self.assertEqual(image_processor.size, 42)
+        self.assertEqual(image_processor.crop_size, 84)
+
+    def test_batch_feature(self):
+        pass
+
+    def test_call_pil(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PIL images
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False)
+        for image in image_inputs:
+            self.assertIsInstance(image, Image.Image)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
+
+    def test_call_numpy(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random numpy tensors
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, numpify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, np.ndarray)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="np").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="np").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            (
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ),
+        )
+
+    def test_call_paddle(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PaddlePaddle tensors
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False, paddleify=True)
+        for image in image_inputs:
+            self.assertIsInstance(image, paddle.Tensor)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                1,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                self.image_processor_tester.batch_size,
+                self.image_processor_tester.num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
+
+
+class ErnieViLImageProcessingTestFourChannels(ImageProcessingSavingTestMixin, unittest.TestCase):
+    image_processing_class = ErnieViLImageProcessor
+
+    def setUp(self):
+        self.image_processor_tester = ErnieViLImageProcessingTester(self, num_channels=4, do_center_crop=True)
+        self.expected_encoded_image_num_channels = 3
+
+    @property
+    def image_processor_dict(self):
+        return self.image_processor_tester.prepare_image_processor_dict()
+
+    def test_image_processor_properties(self):
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        self.assertTrue(hasattr(image_processing, "do_resize"))
+        self.assertTrue(hasattr(image_processing, "size"))
+        self.assertTrue(hasattr(image_processing, "do_center_crop"))
+        self.assertTrue(hasattr(image_processing, "center_crop"))
+        self.assertTrue(hasattr(image_processing, "do_normalize"))
+        self.assertTrue(hasattr(image_processing, "image_mean"))
+        self.assertTrue(hasattr(image_processing, "image_std"))
+        self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
+
+    def test_batch_feature(self):
+        pass
+
+    def test_call_pil_four_channels(self):
+        # Initialize image_processing
+        image_processing = self.image_processing_class(**self.image_processor_dict)
+        # create random PIL images
+        image_inputs = self.image_processor_tester.prepare_inputs(equal_resolution=False)
+        for image in image_inputs:
+            self.assertIsInstance(image, Image.Image)
+
+        # Test not batched input
+        encoded_images = image_processing(image_inputs[0], return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                1,
+                self.expected_encoded_image_num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
+
+        # Test batched
+        encoded_images = image_processing(image_inputs, return_tensors="pd").pixel_values
+        self.assertEqual(
+            encoded_images.shape,
+            [
+                self.image_processor_tester.batch_size,
+                self.expected_encoded_image_num_channels,
+                self.image_processor_tester.crop_size["height"],
+                self.image_processor_tester.crop_size["width"],
+            ],
+        )
diff --git a/tests/transformers/ernie_vil/test_modeling.py b/tests/transformers/ernie_vil/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..148d486882d57fe3954a6b5e485cdace11f6d85b
--- /dev/null
+++ b/tests/transformers/ernie_vil/test_modeling.py
@@ -0,0 +1,461 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the Paddle ErnieViL model. """
+
+
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from paddle import nn
+from PIL import Image
+
+from paddlenlp.transformers import (
+    ErnieViLConfig,
+    ErnieViLModel,
+    ErnieViLProcessor,
+    ErnieViLTextConfig,
+    ErnieViLTextModel,
+    ErnieViLVisionConfig,
+    ErnieViLVisionModel,
+)
+from paddlenlp.transformers.ernie_vil.modeling import (
+    ERNIE_VIL_PRETRAINED_MODEL_ARCHIVE_LIST,
+)
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ErnieViLVisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return ErnieViLVisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = ErnieViLVisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class ErnieViLVisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as ErnieViL does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (ErnieViLVisionModel,)
+    test_resize_embeddings = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ErnieViLVisionModelTester(self)
+        self.config_tester = ConfigTester(
+            self, config_class=ErnieViLVisionConfig, has_text_modality=False, hidden_size=37
+        )
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="ErnieViL does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ErnieViLVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ErnieViLVisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in ERNIE_VIL_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ErnieViLVisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieViLTextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return ErnieViLTextConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, input_ids, input_mask):
+        model = ErnieViLTextModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, input_mask = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class ErnieViLTextModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (ErnieViLTextModel,)
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ErnieViLTextModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ErnieViLTextConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="ErnieViL does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="ErnieViLTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="ErnieViLTextModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in ERNIE_VIL_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ErnieViLTextModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class ErnieViLModelTester:
+    def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
+
+        if text_kwargs is None:
+            text_kwargs = {}
+        if vision_kwargs is None:
+            vision_kwargs = {}
+
+        self.parent = parent
+        self.text_model_tester = ErnieViLTextModelTester(parent, **text_kwargs)
+        self.vision_model_tester = ErnieViLVisionModelTester(parent, **vision_kwargs)
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, input_ids, attention_mask, pixel_values
+
+    def get_config(self):
+        return ErnieViLConfig.from_text_vision_configs(
+            self.text_model_tester.get_config(), self.vision_model_tester.get_config()
+        )
+
+    def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
+        model = ErnieViLModel(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(input_ids, pixel_values, attention_mask=attention_mask)
+        self.parent.assertEqual(
+            result.logits_per_image.shape, [self.vision_model_tester.batch_size, self.text_model_tester.batch_size]
+        )
+        self.parent.assertEqual(
+            result.logits_per_text.shape, [self.text_model_tester.batch_size, self.vision_model_tester.batch_size]
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_ids, attention_mask, pixel_values = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": pixel_values,
+            "return_loss": True,
+        }
+        return config, inputs_dict
+
+
+class ErnieViLModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (ErnieViLModel,)
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = ErnieViLModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="ErnieViLModel does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    def test_load_vision_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save ErnieViLConfig and check if we can load ErnieViLVisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = ErnieViLVisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save ErnieViLConfig and check if we can load ErnieViLTextConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            text_config = ErnieViLTextConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in ERNIE_VIL_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ErnieViLModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    CUTE_CATS = get_tests_dir("fixtures/tests_samples/COCO/000000039769.png")
+    image = Image.open(CUTE_CATS)
+    return image
+
+
+class ErnieViLModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference(self):
+        model_name = "PaddlePaddle/ernie_vil-2.0-base-zh"
+        model = ErnieViLModel.from_pretrained(model_name)
+        model.eval()
+        processor = ErnieViLProcessor.from_pretrained(model_name)
+
+        image = prepare_img()
+        inputs = processor(text=["一只猫的照片", "一条狗的照片"], images=image, padding=True, return_tensors="pd")
+
+        # forward pass
+        with paddle.no_grad():
+            outputs = model(**inputs)
+
+        # verify the logits
+        self.assertEqual(
+            outputs.logits_per_image.shape,
+            [inputs.pixel_values.shape[0], inputs.input_ids.shape[0]],
+        )
+        self.assertEqual(
+            outputs.logits_per_text.shape,
+            [inputs.input_ids.shape[0], inputs.pixel_values.shape[0]],
+        )
+
+        expected_logits = paddle.to_tensor([[20.84891510, 16.06993484]])
+
+        self.assertTrue(paddle.allclose(outputs.logits_per_image, expected_logits, atol=1e-3))
diff --git a/tests/transformers/ernie_vil/test_processor.py b/tests/transformers/ernie_vil/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d674653275ce07c87416cd9862384eb5f5d62db
--- /dev/null
+++ b/tests/transformers/ernie_vil/test_processor.py
@@ -0,0 +1,195 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+from PIL import Image
+
+from paddlenlp.transformers import (
+    ErnieViLImageProcessor,
+    ErnieViLProcessor,
+    ErnieViLTokenizer,
+)
+
+
+class ErnieViLProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "的",
+            "价",
+            "格",
+            "是",
+            "15",
+            "便",
+            "alex",
+            "##andra",
+            "，",
+            "。",
+            "-",
+            "t",
+            "shirt",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, ErnieViLTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+        image_processor_map = {
+            "do_resize": True,
+            "size": 224,
+            "do_center_crop": True,
+            "crop_size": {"height": 18, "width": 18},
+            "do_normalize": True,
+            "image_mean": [0.485, 0.456, 0.406],
+            "image_std": [0.229, 0.224, 0.225],
+            "do_convert_rgb": True,
+        }
+        self.image_processor_file = os.path.join(self.tmpdirname, "preprocessor_config.json")
+        with open(self.image_processor_file, "w", encoding="utf-8") as fp:
+            json.dump(image_processor_map, fp)
+
+    def get_tokenizer(self, **kwargs):
+        return ErnieViLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_image_processor(self, **kwargs):
+        return ErnieViLImageProcessor.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PyTorch tensors if one specifies torchify=True.
+        """
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_default(self):
+        tokenizer_slow = self.get_tokenizer()
+        image_processor = self.get_image_processor()
+
+        processor_slow = ErnieViLProcessor(tokenizer=tokenizer_slow, image_processor=image_processor)
+        processor_slow.save_pretrained(self.tmpdirname)
+        processor_slow = ErnieViLProcessor.from_pretrained(self.tmpdirname, use_fast=False)
+
+        self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
+        self.assertIsInstance(processor_slow.tokenizer, ErnieViLTokenizer)
+
+        self.assertEqual(processor_slow.image_processor.to_json_string(), image_processor.to_json_string())
+        self.assertIsInstance(processor_slow.image_processor, ErnieViLImageProcessor)
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = ErnieViLProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(cls_token="(CLS)", sep_token="(SEP)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False)
+
+        processor = ErnieViLProcessor.from_pretrained(
+            self.tmpdirname, cls_token="(CLS)", sep_token="(SEP)", do_normalize=False
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, ErnieViLImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ErnieViLProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_feat_extract = image_processor(image_input, return_tensors="np")
+        input_processor = processor(images=image_input, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ErnieViLProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "Alexandra，T-shirt的价格是15便士。"
+
+        encoded_processor = processor(text=input_str)
+
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ErnieViLProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "Alexandra，T-shirt的价格是15便士。"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), ["input_ids", "pixel_values"])
+
+        # test if it raises when no input is passed
+        with pytest.raises(ValueError):
+            processor()
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ErnieViLProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = ErnieViLProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "Alexandra，T-shirt的价格是15便士。"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(list(inputs.keys()), processor.model_input_names)
diff --git a/tests/transformers/fnet/__init__.py b/tests/transformers/fnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/fnet/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/fnet/test_modeling.py b/tests/transformers/fnet/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..479d3f7ed306b6a9c5f43ee06407abc96765517a
--- /dev/null
+++ b/tests/transformers/fnet/test_modeling.py
@@ -0,0 +1,323 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import (
+    FNetConfig,
+    FNetForMaskedLM,
+    FNetForMultipleChoice,
+    FNetForNextSentencePrediction,
+    FNetForPreTraining,
+    FNetForQuestionAnswering,
+    FNetForSequenceClassification,
+    FNetForTokenClassification,
+    FNetModel,
+)
+
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class FnetModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=False,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=4,
+        intermediate_size=64,
+        hidden_act="gelu_new",
+        hidden_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=3,
+        bos_token_id=1,
+        eos_token_id=2,
+        add_pooling_layer=True,
+        num_labels=2,
+        num_classes=3,
+        return_dict=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.layer_norm_eps = layer_norm_eps
+        self.add_pooling_layer = add_pooling_layer
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.num_labels = num_labels
+        self.num_classes = num_classes
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        num_labels = self.num_labels
+        num_classes = self.num_classes
+        config = self.get_config()
+        return_dict = self.return_dict
+        return (config, input_ids, token_type_ids, num_classes, num_labels, return_dict)
+
+    def get_config(self):
+        return FNetConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            layer_norm_eps=self.layer_norm_eps,
+            pad_token_id=self.pad_token_id,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            add_pooling_layer=self.add_pooling_layer,
+            # num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            num_classes,
+            num_labels,
+            return_dict,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(
+            result["last_hidden_state"].shape, [self.batch_size, self.seq_length, self.hidden_size]
+        )
+
+    def create_and_check_sequence_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForSequenceClassification(config, num_classes)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["logits"].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_token_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForTokenClassification(config, num_classes)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["logits"].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_masked_lm_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["prediction_logits"].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_pretraining_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForPreTraining(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["prediction_logits"].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_next_sentence_prediction_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForNextSentencePrediction(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["seq_relationship_logits"].shape, [self.batch_size, 2])
+
+    def create_and_check_multiple_chocie_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForMultipleChoice(config)
+        model.eval()
+        input_ids = ids_tensor([self.batch_size, self.num_classes, self.seq_length], self.vocab_size)
+        token_type_ids = ids_tensor([self.batch_size, self.num_classes, self.seq_length], self.type_vocab_size)
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["logits"].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_question_answering_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        num_classes,
+        num_labels,
+        return_dict,
+    ):
+        model = FNetForQuestionAnswering(config, num_labels)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result["start_logits"].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result["end_logits"].shape, [self.batch_size, self.seq_length])
+
+
+class FnetModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = FNetModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (FNetModel,)
+
+    def setUp(self):
+        self.model_tester = FnetModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_sequence_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_sequence_classification_model(*config_and_inputs)
+
+    def test_pretraining_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_pretraining_model(*config_and_inputs)
+
+    def test_masked_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_masked_lm_model(*config_and_inputs)
+
+    def test_next_sentence_prediction_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_next_sentence_prediction_model(*config_and_inputs)
+
+    def test_multiple_chocie_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_multiple_chocie_model(*config_and_inputs)
+
+    def test_token_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_token_classification_model(*config_and_inputs)
+
+    def test_question_answering_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_question_answering_model(*config_and_inputs)
diff --git a/tests/transformers/fnet/test_tokenizer.py b/tests/transformers/fnet/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3c5c354618ea5a1f74034e37ca12b58ff6bf2b1
--- /dev/null
+++ b/tests/transformers/fnet/test_tokenizer.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+
+from paddlenlp.transformers import FNetTokenizer
+
+from ...testing_utils import get_tests_dir
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/spiece.model")
+
+
+class TestTokenizationFNet(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = FNetTokenizer
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+    test_sentencepiece = True
+    test_sentencepiece_ignore_case = True
+
+    def setUp(self):
+        super().setUp()
+        # We have a SentencePiece fixture for testing
+        tokenizer = FNetTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "this is a test"
+        output_text = "this is a test"
+        return input_text, output_text
+
+    def test_add_special_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, ids = self.get_clean_sequence(tokenizer)
+
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"cls_token": special_token})
+                encoded_special_token = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(len(encoded_special_token), 1)
+
+                text = tokenizer.decode(ids + encoded_special_token, clean_up_tokenization_spaces=False)
+                encoded = tokenizer.encode(text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+
+                input_encoded = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                special_token_id = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(encoded, input_encoded + special_token_id)
+
+                decoded = tokenizer.decode(encoded, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
diff --git a/tests/transformers/funnel/__init__.py b/tests/transformers/funnel/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/funnel/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/funnel/test_modeling.py b/tests/transformers/funnel/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..749e8e8f954efe219df0a266d31e9cafd1cedcf1
--- /dev/null
+++ b/tests/transformers/funnel/test_modeling.py
@@ -0,0 +1,398 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    FunnelConfig,
+    FunnelForQuestionAnswering,
+    FunnelForSequenceClassification,
+    FunnelForTokenClassification,
+    FunnelModel,
+)
+
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class FunnelModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        block_sizes=[4, 4, 4],
+        block_repeats=None,
+        num_decoder_layers=2,
+        d_model=32,
+        n_head=4,
+        d_head=4,
+        d_inner=32,
+        hidden_act="gelu_new",
+        hidden_dropout=0.1,
+        attention_dropout=0.1,
+        activation_dropout=0.0,
+        max_position_embeddings=512,
+        type_vocab_size=3,
+        initializer_range=0.1,
+        initializer_std=None,
+        layer_norm_eps=1e-9,
+        num_labels=2,
+        return_dict=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.block_sizes = block_sizes
+        self.block_repeats = block_repeats
+        self.num_decoder_layers = num_decoder_layers
+        self.d_model = d_model
+        self.hidden_size = d_model
+        self.n_head = n_head
+        self.d_head = d_head
+        self.d_inner = d_inner
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.initializer_std = initializer_std
+        self.layer_norm_eps = layer_norm_eps
+        self.num_hidden_layers = sum(self.block_sizes)
+        self.num_attention_heads = n_head
+        self.num_labels = num_labels
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="int32")
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        config = self.get_config()
+        return_dict = self.return_dict
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            return_dict,
+        )
+
+    def get_config(self):
+        return FunnelConfig(
+            vocab_size=self.vocab_size,
+            hidden_act=self.hidden_act,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            block_sizes=self.block_sizes,
+            block_repeats=self.block_repeats,
+            num_decoder_layers=self.num_decoder_layers,
+            d_model=self.d_model,
+            n_head=self.n_head,
+            d_head=self.d_head,
+            d_inner=self.d_inner,
+            hidden_dropout=self.hidden_dropout,
+            attention_dropout=self.attention_dropout,
+            activation_dropout=self.activation_dropout,
+            initializer_std=self.initializer_std,
+            layer_norm_eps=self.layer_norm_eps,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            return_dict,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+            "token_type_ids": token_type_ids,
+            "return_dict": return_dict,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        return_dict,
+    ):
+        model = FunnelModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.d_model])
+
+    def create_and_check_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        return_dict,
+    ):
+        model = FunnelForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            output_attentions=True,
+            output_hidden_states=True,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        return_dict,
+    ):
+        model = FunnelForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            output_attentions=True,
+            output_hidden_states=True,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_token_classification(self, config, input_ids, token_type_ids, input_mask, return_dict):
+        model = FunnelForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            output_attentions=True,
+            output_hidden_states=True,
+            return_dict=return_dict,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_labels])
+
+
+class FunnelModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = FunnelModel
+    return_dict: bool = True
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (FunnelModel,)
+
+    def setUp(self):
+        self.model_tester = FunnelModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_question_answering(*config_and_inputs)
+
+    def test_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_sequence_classification(*config_and_inputs)
+
+    def test_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_token_classification(*config_and_inputs)
+
+    def test_attention_outputs(self):
+        "attention include encoder and decoder"
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        seq_len = getattr(self.model_tester, "seq_length", None)
+        decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+        encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
+        decoder_key_length = getattr(self.model_tester, "decoder_key_length", decoder_seq_length)
+        encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
+
+        for model_class in self.all_model_classes:
+            signature = inspect.signature(model_class.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            if not all(name in arg_names for name in ["output_attentions", "output_hidden_states", "return_dict"]):
+                continue
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = False
+            inputs_dict["return_dict"] = True
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if self.is_encoder_decoder else outputs.attentions
+            self.assertEqual(
+                len(attentions), self.model_tester.num_hidden_layers + self.model_tester.num_decoder_layers
+            )
+
+            # TODO(guosheng): check that output_attentions also work using config
+            self.assertListEqual(
+                list(attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
+            )
+            out_len = len(outputs)
+
+            if self.is_encoder_decoder:
+                correct_outlen = 5
+
+                # loss is at first position
+                if "labels" in inputs_dict:
+                    correct_outlen += 1  # loss is added to beginning
+                # Question Answering model returns start_logits and end_logits
+                if model_class.__name__.endswith("ForQuestionAnswering"):
+                    correct_outlen += 1  # start_logits and end_logits instead of only 1 output
+                if "past_key_values" in outputs:
+                    correct_outlen += 1  # past_key_values have been returned
+
+                self.assertEqual(out_len, correct_outlen)
+
+                # decoder attentions
+                decoder_attentions = outputs.decoder_attentions
+                self.assertIsInstance(decoder_attentions, (list, tuple))
+                self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(decoder_attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
+                )
+
+                # cross attentions
+                cross_attentions = outputs.cross_attentions
+                self.assertIsInstance(cross_attentions, (list, tuple))
+                self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(cross_attentions[0].shape[-3:]),
+                    [
+                        self.model_tester.num_attention_heads,
+                        decoder_seq_length,
+                        encoder_key_length,
+                    ],
+                )
+
+            # Check attention is always last and order is fine
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = True
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            if hasattr(self.model_tester, "num_hidden_states_types"):
+                added_hidden_states = self.model_tester.num_hidden_states_types
+            elif self.is_encoder_decoder:
+                added_hidden_states = 2
+            else:
+                added_hidden_states = 1
+
+            self.assertEqual(out_len + added_hidden_states, len(outputs))
+
+            self_attentions = outputs.encoder_attentions if self.is_encoder_decoder else outputs.attentions
+
+            self.assertEqual(
+                len(self_attentions), self.model_tester.num_hidden_layers + self.model_tester.num_decoder_layers
+            )
+
+            self.assertListEqual(
+                list(self_attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
+            )
+
+    def test_hidden_states_output(self):
+        "hidden state include encoder and decoder"
+
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.encoder_hidden_states if self.is_encoder_decoder else outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester,
+                "expected_num_hidden_layers",
+                self.model_tester.num_hidden_layers + 1 + self.model_tester.num_decoder_layers + 1,
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            if hasattr(self.model_tester, "encoder_seq_length"):
+                seq_length = self.model_tester.encoder_seq_length
+            else:
+                seq_length = self.model_tester.seq_length
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [seq_length, self.model_tester.hidden_size],
+            )
+
+            if self.is_encoder_decoder:
+                hidden_states = outputs.decoder_hidden_states
+
+                self.assertIsInstance(hidden_states, (list, tuple))
+                self.assertEqual(len(hidden_states), expected_num_layers)
+                seq_len = getattr(self.model_tester, "seq_length", None)
+                decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+
+                self.assertListEqual(
+                    list(hidden_states[0].shape[-2:]),
+                    [decoder_seq_length, self.model_tester.hidden_size],
+                )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        inputs_dict["return_dict"] = True
+        for model_class in self.all_model_classes:
+            signature = inspect.signature(model_class.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            if not all(name in arg_names for name in ["output_attentions", "output_hidden_states", "return_dict"]):
+                continue
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+            # TODO(guosheng): check that output_hidden_states also work using config
diff --git a/tests/transformers/funnel/test_tokenizer.py b/tests/transformers/funnel/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef443b2ab18e3481d093ce86834c7c395aa829f0
--- /dev/null
+++ b/tests/transformers/funnel/test_tokenizer.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import FunnelTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class TestTokenizationFunnel(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = FunnelTokenizer
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "<unk>",
+            "<cls>",
+            "<sep>",
+            "<pad>",
+            "<mask>",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, FunnelTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
diff --git a/tests/transformers/gau_alpha/__init__.py b/tests/transformers/gau_alpha/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/gau_alpha/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/gau_alpha/test_modeling.py b/tests/transformers/gau_alpha/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..783e0eae2d5c46ded20dfd399b04018657a2e6a2
--- /dev/null
+++ b/tests/transformers/gau_alpha/test_modeling.py
@@ -0,0 +1,296 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    GAUAlphaConfig,
+    GAUAlphaForMultipleChoice,
+    GAUAlphaForQuestionAnswering,
+    GAUAlphaForSequenceClassification,
+    GAUAlphaForTokenClassification,
+    GAUAlphaModel,
+    GAUAlphaPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class GAUAlphaModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return GAUAlphaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = GAUAlphaModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = GAUAlphaForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = GAUAlphaForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = GAUAlphaForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = GAUAlphaForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class GAUAlphaModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = GAUAlphaModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        GAUAlphaModel,
+        GAUAlphaForMultipleChoice,
+        GAUAlphaForQuestionAnswering,
+        GAUAlphaForSequenceClassification,
+        GAUAlphaForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = GAUAlphaModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(GAUAlphaPretrainedModel.pretrained_init_configuration)[:1]:
+            model = GAUAlphaModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/gau_alpha/test_tokenizer.py b/tests/transformers/gau_alpha/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..adf5478790d52df8ebdb0ec3699973213b4294c0
--- /dev/null
+++ b/tests/transformers/gau_alpha/test_tokenizer.py
@@ -0,0 +1,296 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import BasicTokenizer, GAUAlphaTokenizer, WordpieceTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class GAUAlphaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = GAUAlphaTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, GAUAlphaTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        print(encoded_sentence)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                tokens_fast = tokenizer_fast.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_fast.convert_ids_to_tokens(tokens_fast["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+                self.assertEqual([e[0] for e in expected_results], tokens_fast["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                ids_without_spe_char_fast = tokenizer_fast.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                tokens_without_spe_char_fast = tokenizer.convert_ids_to_tokens(ids_without_spe_char_fast)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_fast, list_of_commun_chinese_char)
diff --git a/tests/transformers/glm/__init__.py b/tests/transformers/glm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/glm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/glm/test_modeling.py b/tests/transformers/glm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d9f7ff4e4389fcd1b4c445acb0f927b483f93d7
--- /dev/null
+++ b/tests/transformers/glm/test_modeling.py
@@ -0,0 +1,290 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import random
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import (
+    GLMConfig,
+    GLMForConditionalGeneration,
+    GLMForMultipleChoice,
+    GLMModel,
+    GLMTokenizer,
+)
+from tests.testing_utils import PaddleNLPModelTest, slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (
+    ModelTesterMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+GLM_PRETRAINED_MODEL_ARCHIVE_LIST = ["THUDM/glm-515m", "THUDM/glm-2b", "THUDM/glm-large-chinese"]
+
+
+class GLMModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_attention_mask=True,
+        use_position_ids=True,
+        num_layers=5,
+        vocab_size=99,
+        hidden_size=32,
+        num_attention_heads=4,
+        embedding_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        output_dropout_prob=0.1,
+        max_sequence_length=512,
+        checkpoint_activations=False,
+        checkpoint_num_layers=1,
+        parallel_output=True,
+        relative_encoding=False,
+        block_position_encoding=True,
+        output_predict=False,
+        spell_length=None,
+        spell_func="lstm",
+        attention_scale=1.0,
+        initializer_range=0.02,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        pool_token="cls",
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+        self.use_position_ids = use_position_ids
+        self.num_layers = num_layers
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.embedding_dropout_prob = embedding_dropout_prob
+        self.attention_dropout_prob = attention_dropout_prob
+        self.output_dropout_prob = output_dropout_prob
+        self.max_sequence_length = max_sequence_length
+        self.checkpoint_activations = checkpoint_activations
+        self.checkpoint_num_layers = checkpoint_num_layers
+        self.parallel_output = parallel_output
+        self.relative_encoding = relative_encoding
+        self.block_position_encoding = block_position_encoding
+        self.output_predict = output_predict
+        self.spell_length = spell_length
+        self.spell_func = spell_func
+        self.attention_scale = attention_scale
+        self.initializer_range = initializer_range
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.pool_token = pool_token
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+
+    def prepare_config_and_inputs(self, model_class):
+        config = self.get_config()
+
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = random_attention_mask(
+                [self.batch_size, 1, self.seq_length, self.seq_length], dtype="int64"
+            )
+
+        position_ids = None
+        if self.use_position_ids:
+            position_ids = paddle.arange(0, self.seq_length, dtype="int64").unsqueeze(0).unsqueeze(1)
+            position_ids = paddle.expand(position_ids, shape=[self.batch_size, 2, -1])
+
+        sequence_labels = None
+        choice_labels = None
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        return (
+            config,
+            input_ids,
+            position_ids,
+            attention_mask,
+            sequence_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return GLMConfig(
+            num_layers=self.num_layers,
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_attention_heads=self.num_attention_heads,
+            embedding_dropout_prob=self.embedding_dropout_prob,
+            attention_dropout_prob=self.attention_dropout_prob,
+            output_dropout_prob=self.output_dropout_prob,
+            max_sequence_length=self.max_sequence_length,
+            checkpoint_activations=self.checkpoint_activations,
+            checkpoint_num_layers=self.checkpoint_num_layers,
+            parallel_output=self.parallel_output,
+            relative_encoding=self.relative_encoding,
+            block_position_encoding=self.block_position_encoding,
+            output_predict=self.output_predict,
+            spell_length=self.spell_length,
+            spell_func=self.spell_func,
+            attention_scale=self.attention_scale,
+            initializer_range=self.initializer_range,
+            pool_token=self.pool_token,
+            use_scaled_init_for_output_weights=True,
+            layernorm_epsilon=1e-5,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        position_ids,
+        attention_mask,
+        sequence_labels,
+        choice_labels,
+    ):
+        model = GLMModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            return_dict=True,
+        )
+
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result.past_key_values), config["num_layers"] + 1)
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        position_ids,
+        attention_mask,
+        sequence_labels,
+        choice_labels,
+    ):
+        self.parent.assertEqual(position_ids.shape, [self.batch_size, 2, self.seq_length])
+        config.output_predict = True
+        model = GLMForMultipleChoice(config=config)
+        model.eval()
+        choice_labels = ids_tensor([self.batch_size, self.num_choices], self.num_choices, dtype="int64")
+        choice_indices = paddle.to_tensor([[x for x in batch] for batch in choice_labels], dtype="int64")
+        choice_ids = paddle.to_tensor(
+            [[x for x in batch] for batch in ids_tensor(choice_labels.shape, vocab_size=self.vocab_size)],
+            dtype="int64",
+        )
+
+        result = model(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            choice_ids=choice_ids,
+            choice_indices=choice_indices,
+            return_dict=True,
+        )
+
+        self.parent.assertEqual(result.logits.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_conditional_generation(
+        self,
+        config,
+        input_ids,
+        position_ids,
+        attention_mask,
+        sequence_labels,
+        choice_labels,
+    ):
+        model = GLMForConditionalGeneration(config=config)
+        model.eval()
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids, return_dict=True)
+        self.parent.assertEqual(result.logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        self.parent.assertEqual(len(result.past_key_values), self.num_layers + 1)
+        self.parent.assertEqual(result.past_key_values[0].shape, [self.seq_length, self.seq_length, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs("GLMModel")
+        (
+            config,
+            input_ids,
+            position_ids,
+            attention_mask,
+            sequence_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        input_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+        }
+
+        return config, input_dict
+
+
+class GLMModelTest(ModelTesterMixin, GenerationTesterMixin, PaddleNLPModelTest):
+    base_model_class = GLMModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (GLMModel,)
+    all_generative_model_classes = {}
+    test_missing_keys = False
+    test_model_parallel = True
+    use_test_input_embeds = False
+
+    def setUp(self):
+        self.model_tester = GLMModelTester(self)
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+    def test_glm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs("GLMModel")
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs("GLMForMultipleChoice")
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs("GLMForConditionalGeneration")
+        self.model_tester.create_and_check_for_conditional_generation(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in GLM_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = GLMModel.from_pretrained(model_name)
+            tokenizer = GLMTokenizer.from_pretrained(model_name)
+            tokens = tokenizer._encode("hello world [MASK]")
+            input_ids, _, _, position_ids, attention_mask, _, _ = tokenizer.build_input_from_ids(
+                text_a_ids=tokens, tokenizer=tokenizer
+            )
+            input_ids = paddle.to_tensor([input_ids])
+            position_ids = paddle.to_tensor([position_ids])
+            attention_mask = paddle.to_tensor([attention_mask])
+            model(input_ids=input_ids, position_ids=position_ids, attention_mask=attention_mask)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/glm/test_tokenizer.py b/tests/transformers/glm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea4cbf45157dbc3f17f2de80fb1377bde46888ce
--- /dev/null
+++ b/tests/transformers/glm/test_tokenizer.py
@@ -0,0 +1,235 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers.glm.tokenizer import GLMBertTokenizer, GLMGPT2Tokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class GLMBertTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = GLMBertTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.special_tokens_map = {"truncation_side": "right"}
+        self.vocab_file = os.path.join(self.tmpdirname, GLMBertTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return GLMBertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file, **self.special_tokens_map)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+
+class GLMGPT2TokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = GLMGPT2Tokenizer
+    from_pretrained_kwargs = {"add_prefix_space": True}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<|endoftext|>",
+            "[CLS]",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>", "truncation_side": "right"}
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return GLMGPT2Tokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_full_tokenizer(self):
+        tokenizer = GLMGPT2Tokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_offsets_mapping(self):
+        if not self.test_offsets:
+            return
+
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "Wonderful no inspiration example with subtoken"
+
+                # No pair
+                tokens_with_offsets = tokenizer.encode(
+                    text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(False)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                print(offsets)
+                print(added_tokens)
+                print(tokens_with_offsets["input_ids"], tokenizer.decode(tokens_with_offsets["input_ids"]))
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+    def test_padding_different_model_input_name(self):
+        pass
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = GLMGPT2Tokenizer.from_pretrained(self.tmpdirname, pad_token="<pad>")
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 35)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+    def test_add_bos_token_slow(self):
+        pass
+
+    def test_maximum_encoding_length_pair_input(self):
+        pass
+
+    def test_special_tokens_mask_input_pairs(self):
+        pass
+
+    def test_number_of_added_tokens(self):
+        pass
+
+    def test_pretrained_model_lists(self):
+        # No max_model_input_sizes
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+            self.assertEqual(len(encoding["offset_mapping"]), 4)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/gpt/__init__.py b/tests/transformers/gpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/gpt/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/gpt/test_modeling.py b/tests/transformers/gpt/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7331e6be3f62f7e19d8da2f0ac408c13f2dcf51
--- /dev/null
+++ b/tests/transformers/gpt/test_modeling.py
@@ -0,0 +1,731 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import datetime
+import math
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    GPTConfig,
+    GPTForSequenceClassification,
+    GPTForTokenClassification,
+    GPTLMHeadModel,
+    GPTModel,
+    GPTTokenizer,
+)
+from tests.testing_utils import PaddleNLPModelTest, require_package, slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "gpt2-en",
+    "gpt2-medium-en",
+    "gpt2-large-en",
+    "gpt2-xl-en",
+    "gpt-cpm-small-cn-distill",
+    "gpt-cpm-large-cn",
+]
+
+
+class GPTModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+        self.bos_token_id = vocab_size - 1
+        self.eos_token_id = vocab_size - 1
+        self.pad_token_id = vocab_size - 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return GPTConfig(
+            **{
+                "vocab_size": self.vocab_size,
+                "hidden_size": self.hidden_size,
+                "num_hidden_layers": self.num_hidden_layers,
+                "num_attention_heads": self.num_attention_heads,
+                "intermediate_size": self.intermediate_size,
+                "hidden_act": self.hidden_act,
+                "hidden_dropout_prob": self.hidden_dropout_prob,
+                "attention_probs_dropout_prob": self.attention_probs_dropout_prob,
+                "max_position_embeddings": self.max_position_embeddings,
+                "type_vocab_size": self.type_vocab_size,
+                "initializer_range": self.initializer_range,
+                "bos_token_id": self.bos_token_id,
+                "eos_token_id": self.eos_token_id,
+                "pad_token_id": self.pad_token_id,
+            }
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = paddle.cast(
+            ids_tensor([self.batch_size, self.seq_length], vocab_size=2), dtype="float32"
+        )
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def create_and_check_gpt_model(self, config, input_ids, input_mask, *args):
+        model = GPTModel(config)
+        model.eval()
+
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result[1]), config["num_hidden_layers"])
+
+    def create_and_check_gpt_model_past(self, config, input_ids, input_mask, *args):
+        model = GPTModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf))
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(next_tokens, use_cache=True, cache=past, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_gpt_model_attention_mask_past(self, config, input_ids, input_mask, *args):
+        model = GPTModel(config)
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="float32")
+        half_seq_length = self.seq_length // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past = model(input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict)[
+            :2
+        ]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64")
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = ids_tensor((1,), half_seq_length, dtype="int64").item() + 1
+        random_other_next_tokens = ids_tensor((self.batch_size, 1), config["vocab_size"], dtype="int64").squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="float32")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(
+            next_tokens, cache=past, use_cache=True, attention_mask=attn_mask, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx]
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx]
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_gpt_model_past_large_inputs(self, config, input_ids, input_mask, *args):
+        model = GPTModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config["vocab_size"], dtype="int64")
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2, dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            cache=past,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, *args):
+        model = GPTLMHeadModel(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            use_cache=True,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(self, config, input_ids, input_mask, *args):
+        model = GPTLMHeadModel(config)
+
+        if self.parent.use_labels:
+            loss, logits = model(input_ids, labels=input_ids, return_dict=self.parent.return_dict)
+            self.parent.assertEqual(loss.shape, [1])
+            self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+            loss.backward()
+
+    def create_and_check_gpt_for_sequence_classification(self, config, input_ids, input_mask, sequence_labels, *args):
+        config.num_labels = self.num_labels
+        model = GPTForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=sequence_labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_labels])
+        elif self.parent.return_dict:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+        else:
+            self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_gpt_for_token_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, *args
+    ):
+        config.num_labels = self.num_labels
+        model = GPTForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=token_labels if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+        elif self.parent.return_dict:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+        else:
+            self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_gpt_weight_initialization(self, config, *args):
+        model = GPTModel(config)
+        model_std = model.config["initializer_range"] / math.sqrt(2 * model.config["num_hidden_layers"])
+        for key in model.state_dict().keys():
+            if "out_proj" in key and "weight" in key:
+                self.parent.assertLessEqual(abs((paddle.std(model.state_dict()[key]) - model_std).numpy()), 0.02)
+                self.parent.assertLessEqual(abs((paddle.mean(model.state_dict()[key]) - 0.0).numpy()), 0.01)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+        }
+
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_gpt(self):
+        config = self.get_config()
+        # excluding eos_token_id which is equal to vocab_size - 1
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size - 1, dtype="int64")
+        inputs_dict = {"input_ids": input_ids}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class GPTModelTest(ModelTesterMixin, GenerationTesterMixin, PaddleNLPModelTest):
+    base_model_class = GPTModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (GPTModel, GPTLMHeadModel, GPTForSequenceClassification, GPTForTokenClassification)
+    all_generative_model_classes = {GPTLMHeadModel: (GPTModel, "gpt")}
+    all_parallelizable_model_classes = GPTLMHeadModel
+    test_missing_keys = False
+    test_model_parallel = True
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = GPTModelTester(self)
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+    def test_gpt_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model(*config_and_inputs)
+
+    def test_gpt_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_past(*config_and_inputs)
+
+    def test_gpt_model_att_mask_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_attention_mask_past(*config_and_inputs)
+
+    def test_gpt_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_model_past_large_inputs(*config_and_inputs)
+
+    def test_gpt_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    def test_gpt_sequence_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_for_sequence_classification(*config_and_inputs)
+
+    def test_gpt_token_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_for_token_classification(*config_and_inputs)
+
+    def test_gpt_weight_initialization(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gpt_weight_initialization(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        # NOTE: rewrite test inputs embeds for gpt model since couldn't detect eos token id from inputs_embeds
+        # get config for model and inputs_dict for model forward
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_gpt()
+        # test all model classes
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            with paddle.no_grad():
+                ids_output = model(**inputs)
+
+            if not self.is_encoder_decoder:
+                input_ids = inputs["input_ids"]
+                del inputs["input_ids"]
+            else:
+                encoder_input_ids = inputs["input_ids"]
+                decoder_input_ids = inputs.get("decoder_input_ids", encoder_input_ids)
+                del inputs["input_ids"]
+                inputs.pop("decoder_input_ids", None)
+
+            wte = model.get_input_embeddings()
+            if not self.is_encoder_decoder:
+                inputs["inputs_embeds"] = wte(input_ids)
+            else:
+                inputs["inputs_embeds"] = wte(encoder_input_ids)
+                inputs["decoder_inputs_embeds"] = wte(decoder_input_ids)
+
+            with paddle.no_grad():
+                embeds_output = model(**inputs)
+
+            self.assertTrue(paddle.allclose(ids_output, embeds_output, rtol=1e-4, atol=1e-4))
+
+    @slow
+    def test_batch_generation(self):
+        model = GPTLMHeadModel.from_pretrained("gpt2-en")
+        model.eval()
+        tokenizer = GPTTokenizer.from_pretrained("gpt2-en")
+
+        tokenizer.padding_side = "left"
+
+        # Define PAD Token = EOS Token = 50256
+        tokenizer.pad_token = tokenizer.eos_token
+        model.pad_token_id = model.eos_token_id
+        getattr(model, model.base_model_prefix).config["pad_token_id"] = getattr(
+            model, model.base_model_prefix
+        ).config["eos_token_id"]
+
+        # use different length sentences to test batching
+        sentences = [
+            "Hello, my dog is a little",
+            "Today, I",
+        ]
+
+        inputs = tokenizer(
+            sentences, return_tensors="pd", padding=True, return_attention_mask=True, return_position_ids=True
+        )
+        input_ids = inputs["input_ids"]
+
+        outputs, _ = model.generate(
+            input_ids=input_ids,
+            position_ids=inputs["position_ids"],
+            decode_strategy="greedy_search",
+            attention_mask=inputs["attention_mask"],
+            use_cache=True,
+        )
+        batch_out_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+
+        inputs_non_padded = tokenizer(sentences[0], return_tensors="pd")["input_ids"]
+        output_non_padded, _ = model.generate(
+            input_ids=inputs_non_padded, use_cache=True, decode_strategy="greedy_search"
+        )
+        non_padded_sentence = tokenizer.decode(output_non_padded[0], skip_special_tokens=True)
+
+        inputs_padded = tokenizer(sentences[1], return_tensors="pd")["input_ids"]
+        output_padded, _ = model.generate(input_ids=inputs_padded, use_cache=True, decode_strategy="greedy_search")
+        padded_sentence = tokenizer.decode(output_padded[0], skip_special_tokens=True)
+
+        expected_output_sentence = [
+            " bit of a mess. I'm not sure if he's going to be able to walk or not",
+            "'m going to be doing a lot of research on this. I'm going to be doing a lot",
+        ]
+        self.assertListEqual(expected_output_sentence, batch_out_sentence)
+        self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in GPT2_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = GPTModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class GPTCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-GPT2Model"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import GPT2Model
+
+        # when python application is done, `TemporaryDirectory` will be free
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        model = GPT2Model.from_pretrained(cls.test_model_id)
+        model.save_pretrained(cls.torch_model_path)
+
+    @parameterized.expand(
+        [
+            ("GPTModel", "GPT2Model"),
+            ("GPTForSequenceClassification", "GPT2ForSequenceClassification"),
+            ("GPTForTokenClassification", "GPT2ForTokenClassification"),
+            ("GPTLMHeadModel", "GPT2LMHeadModel"),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_gpt_classes_from_local_dir(self, paddle_class_name, pytorch_class_name: str | None = None):
+        pytorch_class_name = pytorch_class_name or paddle_class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, paddle_class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().numpy().reshape([-1])[:9],
+                    torch_logit.detach().cpu().numpy().reshape([-1])[:9],
+                    atol=1e-3,
+                )
+            )
+
+
+class GPTModelLanguageGenerationTest(PaddleNLPModelTest):
+    def _test_lm_generate_gpt_helper(
+        self,
+        verify_outputs=True,
+    ):
+        model = GPTLMHeadModel.from_pretrained("gpt2-en")
+        model.eval()
+
+        # The dog
+        input_ids = paddle.to_tensor([[464, 3290]], dtype="int64")
+
+        # The dog was found in a field near the intersection of West and West Streets.\n\nThe dog
+        # fmt: off
+        expected_output_ids = [
+            373,
+            1043,
+            287,
+            257,
+            2214,
+            1474,
+            262,
+            16246,
+            286,
+            2688,
+            290,
+            2688,
+            27262,
+            13,
+            198,
+            198,
+            464,
+            3290,
+        ]
+        # fmt: on
+        output_ids, _ = model.generate(input_ids, decode_strategy="greedy_search", max_length=18)
+        if verify_outputs:
+            self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
+
+    @slow
+    def test_lm_generate_gpt(self):
+        self._test_lm_generate_gpt_helper()
+
+    @slow
+    def test_gpt_sample(self):
+        tokenizer = GPTTokenizer.from_pretrained("gpt2-en")
+        model = GPTLMHeadModel.from_pretrained("gpt2-en")
+        model.eval()
+
+        paddle.seed(128)
+        np.random.seed(128)
+        random.seed(128)
+
+        tokenized = tokenizer(
+            "Today is a nice day and", return_tensors="pd", return_position_ids=True, return_attention_mask=True
+        )
+        input_ids = tokenized["input_ids"]
+
+        output_ids, _ = model.generate(
+            input_ids,
+            attention_mask=tokenized["attention_mask"],
+            position_ids=tokenized["position_ids"],
+            decode_strategy="sampling",
+            top_k=1,
+        )
+        output_str = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+
+        output_seq, _ = model.generate(
+            input_ids=input_ids,
+            attention_mask=tokenized["attention_mask"],
+            position_ids=tokenized["position_ids"],
+            decode_strategy="sampling",
+            top_k=1,
+            num_return_sequences=5,
+        )
+        output_seq_strs = tokenizer.batch_decode(output_seq, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT_STR = " I'm glad I'm here. I'm glad I'm here. I'm glad I'm here"
+
+        self.assertEqual(output_seq_strs[0], EXPECTED_OUTPUT_STR)
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+
+    @slow
+    def test_gpt_sample_max_time(self):
+        # NOTE: duration changed sharply and can not be limit in a range for now.
+        tokenizer = GPTTokenizer.from_pretrained("gpt2-en")
+        model = GPTLMHeadModel.from_pretrained("gpt2-en")
+        model.eval()
+
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        tokenized = tokenizer("Today is a nice day and", return_tensors="pd")
+        input_ids = tokenized["input_ids"]
+
+        MAX_TIME = 0.5
+
+        start = datetime.datetime.now()
+        model.generate(input_ids, decode_strategy="sampling", max_time=MAX_TIME, max_length=256)
+        datetime.datetime.now() - start
+        # duration = datetime.datetime.now() - start
+        # self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
+        # self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
+
+        start = datetime.datetime.now()
+        model.generate(input_ids, decode_strategy="greedy_search", max_time=MAX_TIME, max_length=256)
+        datetime.datetime.now() - start
+        # duration = datetime.datetime.now() - start
+        # self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
+        # self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
+
+        start = datetime.datetime.now()
+        model.generate(input_ids, decode_strategy="beam_search", num_beams=2, max_time=MAX_TIME, max_length=256)
+        datetime.datetime.now() - start
+        # duration = datetime.datetime.now() - start
+        # self.assertGreater(duration, datetime.timedelta(seconds=MAX_TIME))
+        # self.assertLess(duration, datetime.timedelta(seconds=1.5 * MAX_TIME))
diff --git a/tests/transformers/gpt/test_tokenizer.py b/tests/transformers/gpt/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8eaeb345dc080bb3f9b9b5d0d0e4a7a7f89824ff
--- /dev/null
+++ b/tests/transformers/gpt/test_tokenizer.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import GPTTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+    "merges_file": "merges.txt",
+}
+
+
+class GPTTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = GPTTokenizer
+    from_pretrained_kwargs = {"add_prefix_space": True}
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<|endoftext|>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return GPTTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = GPTTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [14, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = GPTTokenizer.from_pretrained(self.tmpdirname, pad_token="<pad>")
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+        p2 = [
+            ("This is a simple input loooooong", "This is a simple input"),
+            ("This is a simple pair loooooong", "This is a simple pair"),
+        ]
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+        out_p2 = tokenizer(p2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 33)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"])
+
+        # p2
+        # test automatic padding pair
+        self.assertEqual(out_p2["input_ids"].shape[-1], 52)
+        # long slice pair doesn't have padding
+        self.assertFalse(pad_token_id in out_p2["input_ids"][0])
+        self.assertFalse(0 in out_p2["attention_mask"][0])
+        # short slice pair does have padding
+        self.assertTrue(pad_token_id in out_p2["input_ids"][1])
+        self.assertTrue(0 in out_p2["attention_mask"][1])
+
+    def test_add_bos_token_slow(self):
+        bos_token = "$$$"
+        tokenizer = GPTTokenizer.from_pretrained(self.tmpdirname, bos_token=bos_token, add_bos_token=True)
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[0], bos_token_id)
+        self.assertTrue(all(o[0] == bos_token_id for o in out_s2["input_ids"]))
+
+        decode_s = tokenizer.decode(out_s["input_ids"])
+        decode_s2 = tokenizer.batch_decode(out_s2["input_ids"])
+
+        self.assertEqual(decode_s.split()[0], bos_token)
+        self.assertTrue(all(d.split()[0] == bos_token for d in decode_s2))
+
+    # tokenizer has no padding token
+    def test_padding_different_model_input_name(self):
+        pass
+
+    def test_pretrained_model_lists(self):
+        # No max_model_input_sizes
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 2)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
diff --git a/tests/transformers/gptj/__init__.py b/tests/transformers/gptj/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/gptj/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/gptj/test_modeling.py b/tests/transformers/gptj/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..64a0a18d79c97eebf6d1fc670ca932d109e24217
--- /dev/null
+++ b/tests/transformers/gptj/test_modeling.py
@@ -0,0 +1,393 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    GPTJConfig,
+    GPTJForCausalLM,
+    GPTJForQuestionAnswering,
+    GPTJForSequenceClassification,
+    GPTJModel,
+)
+
+from ...testing_utils import slow
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+GPTJ_PRETRAINED_MODEL_ARCHIVE_LIST = ["EleutherAI/gpt-j-6B"]
+
+
+class GPTJModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_token_type_ids=True,
+        use_input_mask=True,
+        use_labels=True,
+        use_mc_token_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        rotary_dim=4,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_token_type_ids = use_token_type_ids
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.use_mc_token_ids = use_mc_token_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.rotary_dim = rotary_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+        self.bos_token_id = vocab_size - 1
+        self.eos_token_id = vocab_size - 1
+        self.pad_token_id = vocab_size - 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        mc_token_ids = None
+        if self.use_mc_token_ids:
+            mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return GPTJConfig(
+            vocab_size=self.vocab_size,
+            n_embd=self.hidden_size,
+            n_layer=self.num_hidden_layers,
+            n_head=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            n_positions=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            use_cache=True,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            rotary_dim=self.rotary_dim,
+        )
+
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.vocab_size = 300
+        return config
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def create_and_check_gptj_model(self, config, input_ids, input_mask, token_type_ids, *args):
+        model = GPTJModel(config=config)
+        model.eval()
+        result = model(input_ids, token_type_ids=token_type_ids, use_cache=True, return_dict=True)
+        result = model(input_ids, token_type_ids=token_type_ids, use_cache=True, return_dict=True)
+        result = model(input_ids, use_cache=True, return_dict=True)
+
+        self.parent.assertEqual(
+            result.last_hidden_state.shape, list((self.batch_size, self.seq_length, self.hidden_size))
+        )
+        self.parent.assertEqual(len(result.past_key_values), config.n_layer)
+
+    def create_and_check_gptj_model_past(self, config, input_ids, input_mask, token_type_ids, *args):
+        model = GPTJModel(config=config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, token_type_ids=token_type_ids, use_cache=True)
+        outputs_no_past = model(input_ids, token_type_ids=token_type_ids, use_cache=False)
+        self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        output, past = outputs
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
+        next_token_types = ids_tensor([self.batch_size, 1], self.type_vocab_size)
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+
+        output_from_no_past = model(next_input_ids, token_type_ids=next_token_type_ids, return_dict=True)[
+            "last_hidden_state"
+        ]
+        output_from_past = model(next_tokens, token_type_ids=next_token_types, past_key_values=past, return_dict=True)[
+            "last_hidden_state"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=5e-1))
+
+    def create_and_check_gptj_model_attention_mask_past(self, config, input_ids, input_mask, token_type_ids, *args):
+        model = GPTJModel(config=config)
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype=paddle.int64)
+        half_seq_length = self.seq_length // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past = model(input_ids, attention_mask=attn_mask, use_cache=True)
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = ids_tensor((1,), half_seq_length).item() + 1
+        random_other_next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size).squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype=paddle.int64)],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=True)["last_hidden_state"]
+        output_from_past = model(next_tokens, past_key_values=past, attention_mask=attn_mask, return_dict=True)[
+            "last_hidden_state"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_gptj_model_past_large_inputs(self, config, input_ids, input_mask, token_type_ids, *args):
+        model = GPTJModel(config=config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=input_mask, use_cache=True)
+
+        output, past = outputs
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
+        next_token_types = ids_tensor([self.batch_size, 3], self.type_vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+        output_from_no_past = model(
+            next_input_ids, token_type_ids=next_token_type_ids, attention_mask=next_attention_mask, return_dict=True
+        )["last_hidden_state"]
+        output_from_past = model(
+            next_tokens,
+            token_type_ids=next_token_types,
+            attention_mask=next_attention_mask,
+            past_key_values=past,
+            return_dict=True,
+        )["last_hidden_state"]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, token_type_ids, *args):
+        model = GPTJForCausalLM(config)
+        model.eval()
+
+        result = model(input_ids, token_type_ids=token_type_ids, labels=input_ids, return_dict=True)
+        self.parent.assertIsInstance(result.loss.item(), float)
+        self.parent.assertEqual(result.logits.shape, list((self.batch_size, self.seq_length, self.vocab_size)))
+
+    def create_and_check_forward_and_backwards(
+        self, config, input_ids, input_mask, token_type_ids, *args, gradient_checkpointing=False
+    ):
+        model = GPTJForCausalLM(config)
+        if gradient_checkpointing:
+            model.gradient_checkpointing_enable()
+
+        result = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
+        self.parent.assertEqual(result.loss.shape, ())
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+        result.loss.backward()
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+
+        (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            mc_token_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids.astype("int64"),
+            "token_type_ids": token_type_ids,
+        }
+
+        return config, inputs_dict
+
+
+class GPTJModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = GPTJModel
+
+    all_model_classes = (GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, GPTJForQuestionAnswering)
+    all_generative_model_classes = {GPTJForCausalLM: (GPTJModel, "unimo")}
+    fx_compatible = True
+    test_pruning = False
+    test_missing_keys = False
+    test_model_parallel = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = GPTJModelTester(self)
+
+    def test_gptj_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gptj_model(*config_and_inputs)
+
+    def test_gptj_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gptj_model_past(*config_and_inputs)
+
+    def test_gptj_model_att_mask_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gptj_model_attention_mask_past(*config_and_inputs)
+
+    def test_gptj_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_gptj_model_past_large_inputs(*config_and_inputs)
+
+    def test_gptj_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in GPTJ_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = GPTJModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/layoutlm/__init__.py b/tests/transformers/layoutlm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/layoutlm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/layoutlm/test_modeling.py b/tests/transformers/layoutlm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..9520dc85857e650aed113e8fcd5fc246804bd387
--- /dev/null
+++ b/tests/transformers/layoutlm/test_modeling.py
@@ -0,0 +1,239 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+
+from paddlenlp.transformers import (
+    LayoutLMConfig,
+    LayoutLMForMaskedLM,
+    LayoutLMForSequenceClassification,
+    LayoutLMForTokenClassification,
+    LayoutLMModel,
+    LayoutLMPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class LayoutLMModelTester:
+    """Base LayoutLM Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_position_ids=True,
+        vocab_size=103,
+        hidden_size=24,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_position_ids = use_position_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.num_classes = num_classes
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_input_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        position_ids = None
+        if self.use_position_ids:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+        bbox = paddle.expand(paddle.to_tensor([0, 0, 0, 0]), [self.batch_size, self.seq_length, 4])
+
+        config = self.get_config()
+        return config, input_ids, position_ids, attention_mask, bbox
+
+    def get_config(self):
+        return LayoutLMConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, position_ids, attention_mask, bbox = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "bbox": bbox,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self, config: LayoutLMConfig, input_ids: Tensor, position_ids: Tensor, attention_mask: Tensor, bbox: Tensor
+    ):
+        model = LayoutLMModel(config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids, bbox=bbox)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_token_classification(
+        self, config: LayoutLMConfig, input_ids: Tensor, position_ids: Tensor, attention_mask: Tensor, bbox: Tensor
+    ):
+        model = LayoutLMForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_sequence_classification(
+        self, config: LayoutLMConfig, input_ids: Tensor, position_ids: Tensor, attention_mask: Tensor, bbox: Tensor
+    ):
+        model = LayoutLMForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_masked_lm(
+        self, config: LayoutLMConfig, input_ids: Tensor, position_ids: Tensor, attention_mask: Tensor, bbox: Tensor
+    ):
+        model = LayoutLMForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+class LayoutLMModelModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = LayoutLMModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (
+        LayoutLMModel,
+        LayoutLMForTokenClassification,
+        LayoutLMForSequenceClassification,
+        LayoutLMForMaskedLM,
+    )
+
+    def setUp(self):
+        self.model_tester = LayoutLMModelTester(self)
+
+        # set attribute in setUp to overwrite the static attribute
+        self.test_resize_embeddings = False
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(LayoutLMPretrainedModel.pretrained_init_configuration)[:1]:
+            model = LayoutLMModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/layoutlm/test_tokenizer.py b/tests/transformers/layoutlm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..688985082b8a8d2b14c0c304b9c94bfcd0a66e2d
--- /dev/null
+++ b/tests/transformers/layoutlm/test_tokenizer.py
@@ -0,0 +1,241 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import os
+import unittest
+
+from paddlenlp.transformers import BasicTokenizer, LayoutLMTokenizer, WordpieceTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class LayoutLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = LayoutLMTokenizer
+    test_fast_tokenizer = False
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, LayoutLMTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        print(encoded_sentence)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/layoutlmv2/__init__.py b/tests/transformers/layoutlmv2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/layoutlmv2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/layoutlmv2/test_modeling.py b/tests/transformers/layoutlmv2/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8a69ea17e7f1c143be03edcda449409bfd59f19
--- /dev/null
+++ b/tests/transformers/layoutlmv2/test_modeling.py
@@ -0,0 +1,225 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+
+from paddlenlp.transformers import (
+    LayoutLMv2Config,
+    LayoutLMv2ForTokenClassification,
+    LayoutLMv2Model,
+    LayoutLMv2PretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class LayoutLMv2ModelTester:
+    """Base LayoutLMv2 Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_position_ids=True,
+        vocab_size=103,
+        hidden_size=24,
+        coordinate_size=4,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_position_ids = use_position_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.coordinate_size = coordinate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.num_classes = num_classes
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_input_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        position_ids = None
+        if self.use_position_ids:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+        bbox = paddle.expand(paddle.to_tensor([0, 0, 0, 0]), [self.batch_size, self.seq_length, 4])
+        image = paddle.zeros([self.batch_size, 3, 224, 224])
+
+        sequence_labels = None
+        token_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+
+        config = self.get_config()
+        return config, input_ids, position_ids, attention_mask, bbox, image, sequence_labels, token_labels
+
+    def get_config(self):
+        return LayoutLMv2Config(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            coordinate_size=self.coordinate_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, position_ids, attention_mask, bbox, image, _, _ = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "bbox": bbox,
+            "image": image,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config: LayoutLMv2Config,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = LayoutLMv2Model(config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids, bbox=bbox, image=image)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length + 49, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config: LayoutLMv2Config,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+    ):
+        model = LayoutLMv2ForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+            image=image,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+
+class LayoutLMv2ModelModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = LayoutLMv2Model
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (
+        LayoutLMv2Model,
+        LayoutLMv2ForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = LayoutLMv2ModelTester(self)
+
+        # set attribute in setUp to overwrite the static attribute
+        self.test_resize_embeddings = False
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(LayoutLMv2PretrainedModel.pretrained_init_configuration)[:1]:
+            model = LayoutLMv2Model.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/layoutlmv2/test_tokenizer.py b/tests/transformers/layoutlmv2/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a7c8cd37ad1f65db3f6c6e5fa8c98cd1ef2e549
--- /dev/null
+++ b/tests/transformers/layoutlmv2/test_tokenizer.py
@@ -0,0 +1,245 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import os
+import unittest
+
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    LayoutLMv2Tokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class LayoutLMv2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = LayoutLMv2Tokenizer
+    test_fast_tokenizer = False
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, LayoutLMv2Tokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        print(encoded_sentence)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/layoutxml/__init__.py b/tests/transformers/layoutxml/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/layoutxml/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/layoutxml/test_modeling.py b/tests/transformers/layoutxml/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..39e3f19108c3d0d751187db1f1f4e35af3e01bc9
--- /dev/null
+++ b/tests/transformers/layoutxml/test_modeling.py
@@ -0,0 +1,324 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    LayoutXLMConfig,
+    LayoutXLMForQuestionAnswering,
+    LayoutXLMForSequenceClassification,
+    LayoutXLMForTokenClassification,
+    LayoutXLMModel,
+    LayoutXLMPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class LayoutXLMModelTester:
+    """Base LayoutXLM Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_position_ids=True,
+        vocab_size=103,
+        hidden_size=24,
+        coordinate_size=4,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=2,
+        num_choices=4,
+        num_classes=2,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_position_ids = use_position_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.coordinate_size = coordinate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.num_classes = num_classes
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_input_mask:
+            attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        position_ids = None
+        if self.use_position_ids:
+            ones = paddle.ones_like(input_ids, dtype="int64")
+            seq_length = paddle.cumsum(ones, axis=1)
+            position_ids = seq_length - ones
+
+        bbox = paddle.expand(paddle.to_tensor([0, 0, 0, 0]), [self.batch_size, self.seq_length, 4])
+        image = paddle.zeros([self.batch_size, 3, 224, 224])
+        start_positions = ids_tensor([self.batch_size], self.type_sequence_label_size)
+        end_positions = ids_tensor([self.batch_size], self.type_sequence_label_size)
+
+        sequence_labels = None
+        token_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            position_ids,
+            attention_mask,
+            bbox,
+            image,
+            sequence_labels,
+            token_labels,
+            start_positions,
+            end_positions,
+        )
+
+    def get_config(self):
+        return LayoutXLMConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            coordinate_size=self.coordinate_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_ids, position_ids, attention_mask, bbox, image, _, _, _, _ = self.prepare_config_and_inputs()
+        inputs_dict = {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "bbox": bbox,
+            "image": image,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config: LayoutXLMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        start_positions: Tensor,
+        end_positions: Tensor,
+    ):
+        model = LayoutXLMModel(config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=attention_mask, position_ids=position_ids, bbox=bbox, image=image)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length + 49, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config: LayoutXLMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        start_positions: Tensor,
+        end_positions: Tensor,
+    ):
+        model = LayoutXLMForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+            image=image,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: LayoutXLMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        start_positions: Tensor,
+        end_positions: Tensor,
+    ):
+        model = LayoutXLMForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+            image=image,
+            labels=sequence_labels,
+        )
+
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_question_answering(
+        self,
+        config: LayoutXLMConfig,
+        input_ids: Tensor,
+        position_ids: Tensor,
+        attention_mask: Tensor,
+        bbox: Tensor,
+        image: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        start_positions: Tensor,
+        end_positions: Tensor,
+    ):
+        model = LayoutXLMForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            bbox=bbox,
+            image=image,
+            start_positions=start_positions,
+            end_positions=end_positions,
+        )
+        if len(result) > 3:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+            self.parent.assertEqual(result[2].shape, [self.batch_size, self.seq_length])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+
+@parameterized_class(
+    ("use_labels"),
+    [
+        [False],
+        [True],
+    ],
+)
+class LayoutXLMModelModelTest(ModelTesterMixin, unittest.TestCase):
+
+    use_labels = True
+    return_dict = False
+    use_test_model_name_list = False
+
+    all_model_classes = (
+        LayoutXLMForQuestionAnswering,
+        LayoutXLMForSequenceClassification,
+        LayoutXLMForTokenClassification,
+        LayoutXLMModel,
+    )
+
+    def setUp(self):
+        self.model_tester = LayoutXLMModelTester(self)
+        self.test_resize_embeddings = False
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(LayoutXLMPretrainedModel.pretrained_init_configuration)[:1]:
+            model = LayoutXLMModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/layoutxml/test_tokenizer.py b/tests/transformers/layoutxml/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7177755689ec292188143cec274b3702f2042841
--- /dev/null
+++ b/tests/transformers/layoutxml/test_tokenizer.py
@@ -0,0 +1,168 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import shutil
+import tempfile
+import unittest
+
+from paddlenlp.transformers import LayoutXLMTokenizer
+
+from ...testing_utils import get_tests_dir, slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+
+class LayoutXLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = LayoutXLMTokenizer
+    test_fast_tokenizer = False
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = False
+
+    def get_words_and_boxes(self):
+        words = ["a", "weirdly", "test"]
+        boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]]
+
+        return words, boxes
+
+    def get_words_and_boxes_batch(self):
+        words = [["a", "weirdly", "test"], ["hello", "my", "name", "is", "bob"]]
+        boxes = [
+            [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]],
+            [[961, 885, 992, 912], [256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69]],
+        ]
+
+        return words, boxes
+
+    def get_question_words_and_boxes(self):
+        question = "what's his name?"
+        words = ["a", "weirdly", "test"]
+        boxes = [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]]
+
+        return question, words, boxes
+
+    def get_question_words_and_boxes_batch(self):
+        questions = ["what's his name?", "how is he called?"]
+        words = [["a", "weirdly", "test"], ["what", "a", "laif", "gastn"]]
+        boxes = [
+            [[423, 237, 440, 251], [427, 272, 441, 287], [419, 115, 437, 129]],
+            [[256, 38, 330, 58], [256, 38, 330, 58], [336, 42, 353, 57], [34, 42, 66, 69]],
+        ]
+
+        return questions, words, boxes
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = LayoutXLMTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    # override test in `test_tokenization_common.py` because of the required input format of the `__call__`` method of
+    # this tokenizer
+    def test_save_sentencepiece_tokenizer(self) -> None:
+        if not self.test_sentencepiece:
+            return
+        # We want to verify that we will be able to save the tokenizer even if the original files that were used to
+        # build the tokenizer have been deleted in the meantime.
+        words, boxes = self.get_words_and_boxes()
+
+        tokenizer_slow_1 = self.get_tokenizer()
+        encoding_tokenizer_slow_1 = tokenizer_slow_1(
+            words,
+            boxes=boxes,
+        )
+
+        tmpdirname_1 = tempfile.mkdtemp()
+        tmpdirname_2 = tempfile.mkdtemp()
+
+        tokenizer_slow_1.save_pretrained(tmpdirname_1)
+        tokenizer_slow_2 = self.tokenizer_class.from_pretrained(tmpdirname_1)
+        encoding_tokenizer_slow_2 = tokenizer_slow_2(
+            words,
+            boxes=boxes,
+        )
+
+        shutil.rmtree(tmpdirname_1)
+        tokenizer_slow_2.save_pretrained(tmpdirname_2)
+
+        tokenizer_slow_3 = self.tokenizer_class.from_pretrained(tmpdirname_2)
+        encoding_tokenizer_slow_3 = tokenizer_slow_3(
+            words,
+            boxes=boxes,
+        )
+        shutil.rmtree(tmpdirname_2)
+
+        self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_2)
+        self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_3)
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("layoutxlm-base-uncased")
+
+        question, words, boxes = self.get_question_words_and_boxes()
+
+        text = tokenizer.encode(
+            question.split(),
+            boxes=[tokenizer.pad_token_box for _ in range(len(question.split()))],
+            add_special_tokens=False,
+        )
+        text_2 = tokenizer.encode(words, boxes=boxes, add_special_tokens=False)
+
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_pair == [0] + text + [2] + [2] + text_2 + [2]
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=False,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                words, boxes = self.get_words_and_boxes()
+
+                tokens = []
+                for word in words:
+                    tokens.extend(tokenizer.tokenize(word))
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+
+                output_text = "a weirdly test"
+                self.assertEqual(text_2, output_text)
diff --git a/tests/transformers/llama/__init__.py b/tests/transformers/llama/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/llama/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/llama/test_modeling.py b/tests/transformers/llama/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d1ea83608e56b47435190d343dc388bbd42a7df
--- /dev/null
+++ b/tests/transformers/llama/test_modeling.py
@@ -0,0 +1,487 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized
+
+from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM, LlamaModel
+from tests.testing_utils import require_package, slow
+from tests.transformers.test_configuration_common import ConfigTester
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (  # GenerationD2STestMixin,
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class LlamaModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=32000,
+        hidden_size=64,
+        num_hidden_layers=2,
+        num_attention_heads=8,
+        masked_softmax_fusion=True,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        is_training=True,
+        use_cache=False,
+        bos_token_id=1,
+        eos_token_id=2,
+        apply_residual_connection_post_layernorm=False,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        attention_softmax_in_fp32=True,
+        pretraining_tp=1,  # TP rank used when training with megatron
+        dtype="bfloat16",
+        slow_but_exact=False,
+        batch_size: int = 2,
+        seq_length: int = 10,
+        type_sequence_label_size=2,
+        activation_function="gelu",
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        use_input_mask: bool = False,
+        use_labels: bool = False,
+        return_dict=False,
+    ):
+        self.parent: LlamaModelTest = parent
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.masked_softmax_fusion = masked_softmax_fusion
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.is_training = is_training
+        self.use_cache = use_cache
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+        self.pretraining_tp = pretraining_tp
+        self.dtype = dtype
+        self.slow_but_exact = slow_but_exact
+
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.type_sequence_label_size = type_sequence_label_size
+        self.activation_function = activation_function
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype=paddle.int64)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> LlamaConfig:
+        return LlamaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            masked_softmax_fusion=self.masked_softmax_fusion,
+            layer_norm_epsilon=self.layer_norm_epsilon,
+            initializer_range=self.initializer_range,
+            use_cache=self.use_cache,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            apply_residual_connection_post_layernorm=self.apply_residual_connection_post_layernorm,
+            hidden_dropout=self.hidden_dropout,
+            attention_dropout=self.attention_dropout,
+            attention_softmax_in_fp32=self.attention_softmax_in_fp32,
+            pretraining_tp=self.pretraining_tp,
+            dtype=self.dtype,
+            slow_but_exact=self.slow_but_exact,
+            activation_function=self.activation_function,
+        )
+
+    def create_and_check_model(
+        self, config: LlamaConfig, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LlamaModel(config)
+        model.eval()
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_model_attention_mask(
+        self, config: LlamaConfig, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LlamaModel(config)
+        model.eval()
+        attn_mask_2d = random_attention_mask([self.batch_size, self.seq_length])
+        result_2d = model(input_ids, attention_mask=attn_mask_2d)[0]
+        batch, seq_length = input_ids.shape
+        causal_mask = paddle.tril(paddle.ones((batch, seq_length, seq_length), dtype=attn_mask_2d.dtype))
+        attn_mask_3d = causal_mask & attn_mask_2d.unsqueeze(-1)
+        result_3d = model(input_ids, attention_mask=attn_mask_3d)[0]
+        attn_mask_4d = attn_mask_3d.unsqueeze(1)
+        result_4d = model(input_ids, attention_mask=attn_mask_4d)[0]
+        result_no_attention_mask = model(input_ids, attention_mask=None)[0]
+        # Assert non-padding tokens have the same logits with different attention_mask shape
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_3d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_4d[attn_mask_2d]).all())
+        self.parent.assertTrue((result_2d[attn_mask_2d] == result_no_attention_mask[attn_mask_2d]).all())
+
+    def create_and_check_model_past_large_inputs(
+        self,
+        config: LlamaConfig,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = LlamaModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.return_dict)
+        past_key_values = outputs.past_key_values if self.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids, attention_mask=next_attention_mask, output_hidden_states=True, return_dict=self.return_dict
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+    def create_and_check_lm_head_model(self, config, input_ids, input_mask, *args):
+        model = LlamaForCausalLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            use_cache=True,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def check_model_position_ids(self, config, input_ids, input_mask, *args):
+        model = LlamaForCausalLM(config)
+        model.eval()
+
+        result_no_position_id = model(
+            input_ids,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        batch_size, seq_len = input_ids.shape
+        position_ids = paddle.arange(seq_len).expand((batch_size, seq_len))
+        result_position_id = model(
+            input_ids,
+            position_ids,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            self.parent.assertTrue((result_position_id[1] == result_no_position_id[1]).all())
+        else:
+            self.parent.assertTrue((result_position_id[0] == result_no_position_id[0]).all())
+
+
+class LlamaModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = LlamaModel
+    return_dict = False
+    use_labels = False
+
+    all_model_classes = (LlamaModel, LlamaForCausalLM)
+    all_generative_model_classes = {LlamaForCausalLM: (LlamaModel, "llama")}
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = LlamaModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=LlamaConfig, vocab_size=256, hidden_size=24)
+
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+        attention_mask = paddle.ones_like(input_ids, dtype=paddle.int64)
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1] // 2
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+        attention_mask = attention_mask[:max_batch_size, :sequence_length]
+        max_length = 3
+
+        return config, input_ids, attention_mask, max_length
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_attention_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_attention_mask(*config_and_inputs)
+
+    def test_model_position_ids(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.check_model_position_ids(*config_and_inputs)
+
+    def test_generate_without_input_ids(self):
+        # this requires 4-D attention mask logic, which is not supported yet
+        pass
+
+    def test_llama_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+
+class LlamaModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = LlamaModel
+
+    @slow
+    def test_inference_no_attention(self):
+        model = LlamaModel.from_pretrained("__internal_testing__/tiny-random-llama")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.20443289, 0.18662477, -0.75216216],
+                    [0.32803515, -0.36956733, -0.95613617],
+                    [0.28622314, 0.07698685, -0.64143789],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = LlamaModel.from_pretrained("__internal_testing__/tiny-random-llama")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.20443289, 0.18662477, -0.75216216],
+                    [0.32803515, -0.36956733, -0.95613617],
+                    [0.28622314, 0.07698685, -0.64143789],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+# class LlamaGenerationD2STest(GenerationD2STestMixin, unittest.TestCase):
+#    internal_testing_model = "__internal_testing__/micro-random-llama"
+
+
+class LlamaCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-LlamaModel"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import LlamaConfig, LlamaForCausalLM
+
+        # when python application is done, `TemporaryDirectory` will be free
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        config = LlamaConfig(hidden_size=16, num_hidden_layers=1, num_attention_heads=2)
+        model = LlamaForCausalLM(config)
+        model.save_pretrained(cls.torch_model_path)
+
+    @require_package("transformers", "torch")
+    def test_llama_converter(self):
+        # 1. create commmon input
+        input_ids = np.random.randint(100, 200, [1, 20])
+
+        # 2. forward the paddle model
+        from paddlenlp.transformers import LlamaModel
+
+        paddle_model = LlamaModel.from_pretrained(self.torch_model_path, convert_from_torch=True)
+        paddle_model.eval()
+        paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+        # 3. forward the torch  model
+        import torch
+        from transformers import LlamaModel
+
+        torch_model = LlamaModel.from_pretrained(self.torch_model_path)
+        torch_model.eval()
+        torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+        self.assertTrue(
+            np.allclose(
+                paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                rtol=1e-2,
+            )
+        )
+
+    @require_package("transformers", "torch")
+    def test_llama_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import LlamaModel
+
+            torch_model = LlamaModel.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import LlamaModel
+
+            paddle_model = LlamaModel.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-2,
+                )
+            )
+
+    @parameterized.expand([("LlamaModel",), ("LlamaForCausalLM",)])
+    @require_package("transformers", "torch")
+    def test_llama_classes_from_local_dir(self, class_name, pytorch_class_name: str | None = None):
+        pytorch_class_name = pytorch_class_name or class_name
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, pytorch_class_name)
+            torch_model = torch_model_class.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 3. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/llama/test_tokenizer.py b/tests/transformers/llama/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e703704f17cf50556a5b31527898726eaf22765
--- /dev/null
+++ b/tests/transformers/llama/test_tokenizer.py
@@ -0,0 +1,209 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers.llama.tokenizer import LlamaTokenizer
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.json",
+}
+
+
+class LlamaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = LlamaTokenizer
+    # from_pretrained_kwargs = {"add_prefix_space": True}
+    # test_seq2seq = False
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        tokenizer = LlamaTokenizer.from_pretrained("__internal_testing__/tiny-random-llama", **kwargs)
+        tokenizer.pad_token = tokenizer.unk_token
+        return tokenizer
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        text = "lower newer"
+        bpe_tokens = ["▁lower", "▁newer"]
+        tokens = tokenizer.tokenize(text, add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [5224, 20687, 0]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_tokenizers_common_ids_setters(self, *args, **kwargs):
+        pass
+
+    def test_mask_output(self):
+        pass
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_offsets_mapping_with_unk(self):
+        pass
+
+    def test_special_tokens_mask(self):
+        pass
+
+    def test_special_tokens_mask_input_pairs(self):
+        pass
+
+    def test_padding_side_in_kwargs(self):
+        tokenizer = self.get_tokenizer(padding_side="left")
+        self.assertEqual(tokenizer.padding_side, "left")
+
+        tokenizer = self.get_tokenizer(padding_side="right")
+        self.assertEqual(tokenizer.padding_side, "right")
+
+    def test_truncation_side_in_kwargs(self):
+        tokenizer = self.get_tokenizer(truncation_side="left")
+        self.assertEqual(tokenizer.truncation_side, "left")
+
+        tokenizer = self.get_tokenizer(truncation_side="right")
+        self.assertEqual(tokenizer.truncation_side, "right")
+
+    def test_add_tokens(self):
+        tokenizer = self.get_tokenizer()
+
+        vocab_size = len(tokenizer)
+        self.assertEqual(tokenizer.add_tokens(""), 0)
+        self.assertEqual(tokenizer.add_tokens("testoken"), 1)
+        self.assertEqual(tokenizer.add_tokens(["testoken1", "testtoken2"]), 2)
+        self.assertEqual(len(tokenizer), vocab_size + 3)
+
+        self.assertEqual(tokenizer.add_special_tokens({}), 0)
+        self.assertRaises(AssertionError, tokenizer.add_special_tokens, {"additional_special_tokens": "<testtoken1>"})
+        self.assertEqual(tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken2>"]}), 1)
+        self.assertEqual(
+            tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken3>", "<testtoken4>"]}), 2
+        )
+        self.assertIn("<testtoken3>", tokenizer.special_tokens_map["additional_special_tokens"])
+        self.assertIsInstance(tokenizer.special_tokens_map["additional_special_tokens"], list)
+        self.assertGreaterEqual(len(tokenizer.special_tokens_map["additional_special_tokens"]), 2)
+
+        self.assertEqual(len(tokenizer), vocab_size + 6)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+
+        vocab_size = tokenizer.vocab_size
+        all_size = len(tokenizer)
+
+        self.assertNotEqual(vocab_size, 0)
+
+        new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+        added_toks = tokenizer.add_tokens(new_toks)
+        vocab_size_2 = tokenizer.vocab_size
+        all_size_2 = len(tokenizer)
+
+        self.assertNotEqual(vocab_size_2, 0)
+        self.assertEqual(vocab_size, vocab_size_2)
+        self.assertEqual(added_toks, len(new_toks))
+        self.assertEqual(all_size_2, all_size + len(new_toks))
+
+        tokens = tokenizer.encode(
+            "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+        )["input_ids"]
+        self.assertGreaterEqual(len(tokens), 4)
+        self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+        self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+    def test_consecutive_unk_string(self):
+        tokenizer = self.get_tokenizer(add_bos_token=False)
+
+        tokens = [tokenizer.unk_token for _ in range(2)]
+        string = tokenizer.convert_tokens_to_string(tokens)
+        encoding = tokenizer(
+            text=string,
+            runcation=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 2)
+        self.assertEqual(len(encoding["offset_mapping"]), 2)
+
+    def test_padding_if_pad_token_set_slow(self):
+        tokenizer = self.get_tokenizer()
+
+        # Simple input
+        s = "This is a simple input"
+        s2 = ["This is a simple input looooooooong", "This is a simple input"]
+        p = ("This is a simple input", "This is a pair")
+
+        pad_token_id = tokenizer.pad_token_id
+
+        out_s = tokenizer(s, padding="max_length", max_length=30, return_tensors="np", return_attention_mask=True)
+        out_s2 = tokenizer(s2, padding=True, truncate=True, return_tensors="np", return_attention_mask=True)
+        out_p = tokenizer(*p, padding="max_length", max_length=60, return_tensors="np", return_attention_mask=True)
+
+        # s
+        # test single string max_length padding
+        self.assertEqual(out_s["input_ids"].shape[-1], 30)
+        self.assertTrue(pad_token_id in out_s["input_ids"])
+        self.assertTrue(0 in out_s["attention_mask"])
+
+        # s2
+        # test automatic padding
+        self.assertEqual(out_s2["input_ids"].shape[-1], 12)
+        # long slice doesn't have padding
+        self.assertFalse(pad_token_id in out_s2["input_ids"][0])
+        self.assertFalse(0 in out_s2["attention_mask"][0])
+        # short slice does have padding
+        self.assertTrue(pad_token_id in out_s2["input_ids"][1])
+        self.assertTrue(0 in out_s2["attention_mask"][1])
+
+        # p
+        # test single pair max_length padding
+        self.assertEqual(out_p["input_ids"].shape[-1], 60)
+        self.assertTrue(pad_token_id in out_p["input_ids"])
+        self.assertTrue(0 in out_p["attention_mask"])
+
+    def test_add_bos_token_slow(self):
+        bos_token = "<s>"
+        tokenizer = self.get_tokenizer()
+
+        s = "This is a simple input"
+        s2 = ["This is a simple input 1", "This is a simple input 2"]
+
+        bos_token_id = tokenizer.bos_token_id
+
+        out_s = tokenizer(s)
+        out_s2 = tokenizer(s2)
+
+        self.assertEqual(out_s.input_ids[0], bos_token_id)
+        self.assertTrue(all(o[0] == bos_token_id for o in out_s2["input_ids"]))
+
+        decode_s = tokenizer.decode(out_s["input_ids"])
+        decode_s2 = tokenizer.batch_decode(out_s2["input_ids"])
+
+        self.assertEqual(decode_s.split()[0][:3], bos_token)
+        self.assertTrue(all(d.split()[0][:3] == bos_token for d in decode_s2))
+
+    def test_pretrained_model_lists(self):
+        # No max_model_input_sizes
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
diff --git a/tests/transformers/luke/__init__.py b/tests/transformers/luke/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/luke/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/luke/test_modeling.py b/tests/transformers/luke/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..80dfe8b608cfaa61891d5c0cdd0ba7413062d1fe
--- /dev/null
+++ b/tests/transformers/luke/test_modeling.py
@@ -0,0 +1,363 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    LukeConfig,
+    LukeForEntityClassification,
+    LukeForEntityPairClassification,
+    LukeForEntitySpanClassification,
+    LukeForMaskedLM,
+    LukeForQuestionAnswering,
+    LukeModel,
+    LukePretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class LukeModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=514,
+        type_vocab_size=2,
+        entity_vocab_size=32,
+        entity_emb_size=16,
+        initializer_range=0.02,
+        pad_token_id=1,
+        cls_token_id=2,
+        entity_pad_token_id=0,
+        num_labels=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.cls_token_id = cls_token_id
+        self.entity_vocab_size = entity_vocab_size
+        self.entity_emb_size = entity_emb_size
+        self.entity_pad_token_id = entity_pad_token_id
+        self.num_labels = num_labels
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="int32")
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        entity_ids = paddle.randint(0, self.entity_vocab_size, [self.batch_size, 2])
+        entity_position_ids = paddle.randint(0, self.max_position_embeddings, [self.batch_size, 2, self.seq_length])
+        config = self.get_config()
+
+        entity_start_positions = paddle.ones([self.batch_size, 2], dtype="int32")
+        entity_end_positions = paddle.ones([self.batch_size, 2], dtype="int32")
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            entity_ids,
+            entity_position_ids,
+            entity_start_positions,
+            entity_end_positions,
+        )
+
+    def get_config(self):
+        return LukeConfig(
+            vocab_size=self.vocab_size,
+            entity_vocab_size=self.entity_vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            entity_emb_size=self.entity_emb_size,
+            entity_pad_token_id=self.entity_pad_token_id,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            entity_ids,
+            entity_position_ids,
+            entity_start_positions,
+            entity_end_positions,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+            "token_type_ids": token_type_ids,
+            "entity_ids": entity_ids,
+            "entity_position_ids": entity_position_ids,
+            "entity_start_positions": entity_start_positions,
+            "entity_end_positions": entity_end_positions,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[2].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_masked_lm_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_question_answering_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_entity_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeForEntityClassification(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_entity_span_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeForEntitySpanClassification(config)
+        model.eval()
+        result = model(
+            entity_start_positions=entity_start_positions,
+            entity_end_positions=entity_end_positions,
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, 2, self.num_labels])
+
+    def create_and_check_entity_pair_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        entity_ids,
+        entity_position_ids,
+        entity_start_positions,
+        entity_end_positions,
+    ):
+        model = LukeForEntityPairClassification(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            entity_ids=entity_ids,
+            entity_position_ids=entity_position_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+
+class LukeModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = LukeModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (
+        LukeModel,
+        LukeForEntitySpanClassification,
+        LukeForEntityPairClassification,
+        LukeForEntityClassification,
+        LukeForMaskedLM,
+        LukeForQuestionAnswering,
+    )
+
+    def setUp(self):
+        self.model_tester = LukeModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_masked_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_masked_lm_model(*config_and_inputs)
+
+    def test_question_answering_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_question_answering_model(*config_and_inputs)
+
+    def test_Entity_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_entity_classification_model(*config_and_inputs)
+
+    def test_entity_pair_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_entity_pair_classification_model(*config_and_inputs)
+
+    def test_entity_span_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_entity_span_classification_model(*config_and_inputs)
+
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = copy.deepcopy(inputs_dict)
+        if model_class.__name__.endswith("SpanClassification"):
+            return inputs_dict
+        else:
+            del inputs_dict["entity_start_positions"]
+            del inputs_dict["entity_end_positions"]
+            return inputs_dict
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            expected_arg_names = ["input_ids"]
+            if not model_class.__name__.endswith("SpanClassification"):
+                self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(LukePretrainedModel.pretrained_init_configuration)[:1]:
+            model = LukeModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/luke/test_tokenizer.py b/tests/transformers/luke/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d03bab22a62f4109c806fb4f2113c4f99e7b932
--- /dev/null
+++ b/tests/transformers/luke/test_tokenizer.py
@@ -0,0 +1,199 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+from paddlenlp.transformers import LukeTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = LukeTokenizer.resource_files_names
+
+
+class TestTokenizationLuke(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = LukeTokenizer
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "</s>",
+            "<pad>",
+            "<s>",
+            "<mask>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        entity_vocab = {"[PAD]": 0, "[UNK]": 1, "[MASK]": 2, "[MASK2]": 3}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        self.entity_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["entity_file"])
+
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+        with open(self.entity_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(entity_vocab))
+
+    def test_add_special_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, ids = self.get_clean_sequence(tokenizer)
+
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"additional_special_tokens": special_token})
+                encoded_special_token = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(len(encoded_special_token), len(special_token))
+
+                text = tokenizer.decode(ids + encoded_special_token, clean_up_tokenization_spaces=False)
+                encoded = tokenizer.encode(text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+
+                input_encoded = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                special_token_id = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(encoded, input_encoded + special_token_id)
+
+                decoded = tokenizer.decode(encoded, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
+
+    def test_tokenize_special_tokens(self):
+        """Test `tokenize` with special tokens."""
+        tokenizers = self.get_tokenizers(do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                SPECIAL_TOKEN_1 = "[SPECIAL_TOKEN_1]"
+                SPECIAL_TOKEN_2 = "[SPECIAL_TOKEN_2]"
+                tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True)
+                tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]})
+
+                token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1)
+                token_2 = tokenizer.tokenize(SPECIAL_TOKEN_2)
+
+                self.assertEqual(len(token_1), len(SPECIAL_TOKEN_1))
+                self.assertEqual(len(token_2), 1)
+                self.assertEqual(token_1[0], SPECIAL_TOKEN_1[0])
+                self.assertEqual(token_2[0], SPECIAL_TOKEN_2)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+
+    def test_save_and_load_tokenizer(self):
+        # safety check on max_len default value so we are sure the test works
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                self.assertNotEqual(tokenizer.model_max_length, 42)
+
+        # Now let's start the test
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens["input_ids"], after_tokens["input_ids"])
+                self.assertEqual(before_vocab.keys(), after_vocab.keys())
+
+                shutil.rmtree(tmpdirname)
+
+    def test_conversion_reversible(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab = tokenizer.get_vocab()
+                for word, ind in vocab.items():
+                    if word == tokenizer.unk_token:
+                        continue
+                    self.assertEqual(tokenizer.encoder[word], ind)
+                    self.assertEqual(tokenizer.convert_ids_to_tokens(ind), word)
+
+    def test_call(self):
+        self.skipTest("Direct call is not the same as encode")
+
+    def test_tokenizers_common_ids_setters(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_add_tokens(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_add_tokens_tokenizer(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_added_token_serializable(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_added_tokens_do_lower_case(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_added_token_are_matched_longest_first(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        self.skipTest("Add token not implement yet")
+
+    def test_encode_decode_with_spaces(self):
+        self.skipTest("Add token not implement yet")
diff --git a/tests/transformers/mbart/__init__.py b/tests/transformers/mbart/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/mbart/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/mbart/test_modeling.py b/tests/transformers/mbart/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bb926d99bb3fe347fecc60d1cd51ba3325ef3b1
--- /dev/null
+++ b/tests/transformers/mbart/test_modeling.py
@@ -0,0 +1,628 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import tempfile
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    AutoTokenizer,
+    MBartConfig,
+    MBartForConditionalGeneration,
+    MBartForQuestionAnswering,
+    MBartForSequenceClassification,
+    MBartModel,
+)
+from paddlenlp.transformers.mbart.modeling import MBartDecoder
+from tests.testing_utils import PaddleNLPModelTest, slow
+
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+def prepare_mbart_inputs_dict(
+    config,
+    input_ids,
+    decoder_input_ids,
+    attention_mask=None,
+    decoder_attention_mask=None,
+):
+    if attention_mask is None:
+        attention_mask = (input_ids == config.pad_token_id).astype("float32").unsqueeze([1, 2]) * -1e4
+    if decoder_attention_mask is None:
+        decoder_attention_mask = (decoder_input_ids == config.pad_token_id).astype("float32").unsqueeze([1, 2]) * -1e4
+    return {
+        "input_ids": input_ids,
+        "decoder_input_ids": decoder_input_ids,
+        "attention_mask": attention_mask,
+        "decoder_attention_mask": attention_mask,
+    }
+
+
+class MBartModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_labels=False,
+        vocab_size=99,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=4,
+        intermediate_size=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=100,
+        eos_token_id=2,
+        pad_token_id=1,
+        bos_token_id=0,
+        decoder_start_token_id=2,
+        activation_function="relu",
+        activation_dropout=0.0,
+        init_std=0.02,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.activation_function = activation_function
+        self.activation_dropout = activation_dropout
+        self.init_std = init_std
+
+        # forcing a certain token to be generated, sets all other tokens to -inf
+        # if however the token to be generated is already at -inf then it can lead token
+        # `nan` values and thus break generation
+        self.forced_bos_token_id = None
+        self.forced_eos_token_id = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_ids = paddle.clip(ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64"), 3)
+        input_ids[:, -1] = self.eos_token_id  # Eos Token
+
+        decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        config = self.get_config()
+        inputs_dict = prepare_mbart_inputs_dict(config, input_ids, decoder_input_ids)
+        return config, inputs_dict
+
+    def get_config(self):
+        return MBartConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            dropout=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            forced_bos_token_id=self.forced_bos_token_id,
+            decoder_start_token_id=self.decoder_start_token_id,
+            activation_function=self.activation_function,
+            activation_dropout=self.activation_dropout,
+            init_std=self.init_std,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = MBartModel(config).get_decoder()
+        model.eval()
+        input_ids = inputs_dict["input_ids"]
+        attention_mask = inputs_dict["attention_mask"]
+
+        cache = model.decoder.gen_cache(paddle.randn(shape=[input_ids.shape[0], input_ids.shape[1], config.d_model]))
+
+        # first forward pass
+        outputs = model(
+            input_ids, decoder_attention_mask=attention_mask, cache=cache, return_dict=self.parent.return_dict
+        )
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_attn_mask = (1 - ids_tensor((self.batch_size, 3), 2, dtype="int64").unsqueeze([1, 2])).astype(
+            "float32"
+        ) * -1e4
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([attention_mask, next_attn_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, decoder_attention_mask=next_attention_mask, cache=None, return_dict=self.parent.return_dict
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(next_tokens, decoder_attention_mask=next_attention_mask, cache=past_key_values)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+@parameterized_class(
+    ("return_dict",),
+    [
+        [False],
+        [True],
+    ],
+)
+class MBartModelTest(ModelTesterMixin, GenerationTesterMixin, PaddleNLPModelTest):
+    base_model_class = MBartModel
+
+    all_model_classes = (
+        MBartModel,
+        MBartForConditionalGeneration,
+        MBartForSequenceClassification,
+        MBartForQuestionAnswering,
+    )
+
+    all_generative_model_classes = {MBartForConditionalGeneration: (MBartModel, "mbart")}
+    is_encoder_decoder = True
+    test_missing_keys = False
+    return_dict = False
+
+    def setUp(self):
+        self.model_tester = MBartModelTester(self)
+
+    def test_save_load_strict(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model_class.from_pretrained(tmpdirname)  # assign a model but never use
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_inputs_embeds_for_mbart(self):
+        # NOTE: rewrite test inputs embeds for mbart model since scaler not equal to 1.0
+        # get config for model and inputs_dict for model forward
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        scaler = config.d_model**0.5
+        # test all model classes
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            with paddle.no_grad():
+                ids_output = model(**inputs)
+
+            if not self.is_encoder_decoder:
+                input_ids = inputs["input_ids"]
+                del inputs["input_ids"]
+            else:
+                encoder_input_ids = inputs["input_ids"]
+                decoder_input_ids = inputs.get("decoder_input_ids", encoder_input_ids)
+                del inputs["input_ids"]
+                inputs.pop("decoder_input_ids", None)
+
+            wte = model.get_input_embeddings()
+            if not self.is_encoder_decoder:
+                inputs["inputs_embeds"] = wte(input_ids) * scaler
+            else:
+                inputs["inputs_embeds"] = wte(encoder_input_ids) * scaler
+                inputs["decoder_inputs_embeds"] = wte(decoder_input_ids) * scaler
+
+            with paddle.no_grad():
+                embeds_output = model(**inputs)
+
+            self.assertTrue(paddle.allclose(ids_output, embeds_output, rtol=1e-4, atol=1e-4))
+
+
+def assert_tensors_close(a, b, atol=1e-12, prefix=""):
+    """If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
+    if a is None and b is None:
+        return True
+    try:
+        if paddle.allclose(a, b, atol=atol):
+            return True
+        raise
+    except Exception:
+        pct_different = (paddle.greater_than((a - b).abs(), atol)).float().mean().item()
+        if a.numel() > 100:
+            msg = f"tensor values are {pct_different:.1%} percent different."
+        else:
+            msg = f"{a} != {b}"
+        if prefix:
+            msg = prefix + ": " + msg
+        raise AssertionError(msg)
+
+
+def _long_tensor(tok_lst):
+    return paddle.to_tensor(tok_lst, dtype="int64")
+
+
+class AbstractSeq2SeqIntegrationTest(PaddleNLPModelTest):
+    maxDiff = 1000  # longer string compare tracebacks
+    checkpoint_name = None
+
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer = AutoTokenizer.from_pretrained(cls.checkpoint_name)
+        return cls
+
+    def model(self):
+        """Only load the model if needed."""
+        model = MBartForConditionalGeneration.from_pretrained(self.checkpoint_name)
+        model.eval()
+        return model
+
+
+@parameterized_class(
+    ("return_dict",),
+    [
+        [False],
+        [True],
+    ],
+)
+class MBartEnroIntegrationTest(AbstractSeq2SeqIntegrationTest):
+    checkpoint_name = "mbart-large-en-ro"
+    src_text = [
+        " UN Chief Says There Is No Military Solution in Syria",
+        """ Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people.""",
+    ]
+    tgt_text = [
+        "Şeful ONU declară că nu există o soluţie militară în Siria",
+        'Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar acordat de Rusia Siriei este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor face decât să înrăutăţească violenţele şi mizeria a milioane de oameni.',
+    ]
+    expected_src_tokens = [8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2, 250004]
+    return_dict = False
+
+    @slow
+    def test_enro_generate_one(self):
+        batch = self.tokenizer(
+            ["UN Chief Says There Is No Military Solution in Syria"], return_tensors="pd", return_token_type_ids=False
+        )
+        model = self.model()
+        translated_tokens = model.generate(**batch, max_length=128)[0]
+        decoded = self.tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
+        self.assertEqual(self.tgt_text[0], decoded[0])
+
+    @slow
+    def test_enro_generate_batch(self):
+        batch = self.tokenizer(
+            self.src_text, return_tensors="pd", padding=True, truncation=True, return_token_type_ids=False
+        )
+        model = self.model()
+        translated_tokens = model.generate(**batch, max_length=128, decode_strategy="greedy_search")[0]
+        decoded = self.tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
+
+        for i in range(len(self.tgt_text)):
+            assert str(self.tgt_text[i]) == str(decoded[i]), f"{i}"
+
+    def test_mbart_fast_forward(self):
+        config = MBartConfig(
+            vocab_size=99,
+            d_model=24,
+            encoder_layers=2,
+            decoder_layers=2,
+            encoder_attention_heads=2,
+            decoder_attention_heads=2,
+            encoder_ffn_dim=32,
+            decoder_ffn_dim=32,
+            max_position_embeddings=48,
+        )
+        lm_model = MBartForConditionalGeneration(config)
+        context = paddle.to_tensor([[71, 82, 18, 33, 46, 91, 2], [68, 34, 26, 58, 30, 2, 1]], dtype="int64")
+        summary = paddle.to_tensor([[82, 71, 82, 18, 2], [58, 68, 2, 1, 1]], dtype="int64")
+        loss, logits = lm_model(
+            input_ids=context, decoder_input_ids=summary, labels=summary, return_dict=self.return_dict
+        )[:2]
+        expected_shape = [*summary.shape, config.vocab_size]
+        self.assertIsInstance(loss.item(), float)
+        self.assertEqual(logits.shape, expected_shape)
+
+
+class MBartCC25IntegrationTest(AbstractSeq2SeqIntegrationTest):
+    checkpoint_name = "mbart-large-cc25"
+    src_text = [
+        " UN Chief Says There Is No Military Solution in Syria",
+        " I ate lunch twice yesterday",
+    ]
+    tgt_text = ["Şeful ONU declară că nu există o soluţie militară în Siria", "to be padded"]
+
+    @slow
+    def test_fill_mask(self):
+        inputs = self.tokenizer(["One of the best <mask> I ever read!"], return_tensors="pd")
+        model = self.model()
+        outputs = model.generate(inputs["input_ids"], decoder_start_token_id=self.tokenizer.lang_code_to_id["en_XX"])[
+            0
+        ]
+        prediction = self.tokenizer.batch_decode(outputs, clean_up_tokenization_spaces=True, skip_special_tokens=True)[
+            0
+        ]
+        self.assertEqual(prediction, "of the best books I ever read!")
+
+
+class MBartStandaloneDecoderModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        d_model=16,
+        decoder_seq_length=7,
+        is_training=True,
+        is_decoder=True,
+        use_attention_mask=True,
+        use_cache=False,
+        use_labels=True,
+        decoder_start_token_id=2,
+        decoder_ffn_dim=32,
+        decoder_layers=4,
+        encoder_attention_heads=4,
+        decoder_attention_heads=4,
+        max_position_embeddings=30,
+        is_encoder_decoder=False,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.decoder_seq_length = decoder_seq_length
+        # For common tests
+        self.seq_length = self.decoder_seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.hidden_size = d_model
+        self.num_hidden_layers = decoder_layers
+        self.decoder_layers = decoder_layers
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_attention_heads = decoder_attention_heads
+        self.num_attention_heads = decoder_attention_heads
+        self.eos_token_id = eos_token_id
+        self.bos_token_id = bos_token_id
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.max_position_embeddings = max_position_embeddings
+
+        self.use_cache = use_cache
+        self.is_encoder_decoder = is_encoder_decoder
+
+        self.scope = None
+        self.decoder_key_length = decoder_seq_length
+        self.base_model_out_len = 2
+        self.decoder_attention_idx = 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size, dtype="int64")
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, 1, 1, self.decoder_seq_length], vocab_size=2, dtype="int64")
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size, dtype="int64")
+
+        config = MBartConfig(
+            embed_tokens=None,
+            vocab_size=self.vocab_size,
+            d_model=self.d_model,
+            decoder_layers=self.decoder_layers,
+            decoder_ffn_dim=self.decoder_ffn_dim,
+            decoder_attention_heads=self.decoder_attention_heads,
+            max_position_embeddings=self.max_position_embeddings,
+        )
+
+        return (
+            config,
+            input_ids,
+            attention_mask,
+            lm_labels,
+        )
+
+    def create_and_check_decoder_model_past(
+        self,
+        config,
+        input_ids,
+        attention_mask,
+        lm_labels,
+    ):
+        # self.use_cache = True
+        model = MBartDecoder(config)
+        model.eval()
+
+        encoder_output = paddle.randn(shape=input_ids.shape + [self.d_model])
+        origin_cache = model.decoder.gen_cache(encoder_output)
+
+        # first forward pass
+        outputs = model(input_ids, cache=origin_cache, return_dict=self.parent.return_dict)
+        # outputs_use_cache_conf = model(input_ids, return_dict=self.parent.return_dict)
+        outputs_no_past = model(input_ids, cache=None, return_dict=self.parent.return_dict)
+
+        # self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf))  # didn't support using cache by config yet
+        if not self.parent.return_dict:
+            self.parent.assertTrue(len(outputs) == len((outputs_no_past,)) + 1)
+        else:
+            self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        past_key_values = outputs[1]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64")
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(next_tokens, cache=past_key_values, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, next_input_ids.shape[-1] - 1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        assert paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3)
+
+    def create_and_check_decoder_model_attention_mask_past(
+        self,
+        config,
+        input_ids,
+        attention_mask,
+        lm_labels,
+    ):
+        model = MBartDecoder(config)
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="int64")
+
+        half_seq_length = input_ids.shape[-1] // 2
+        attn_mask[:, half_seq_length:] = 0
+        attn_mask = attn_mask.unsqueeze([1, 2])
+
+        encoder_output = paddle.randn(shape=input_ids.shape + [self.d_model])
+        origin_cache = model.decoder.gen_cache(encoder_output)
+        # first forward pass
+        past_key_values = model(
+            input_ids,
+            # attention_mask=attn_mask,
+            decoder_attention_mask=attn_mask,
+            cache=origin_cache,
+            return_dict=self.parent.return_dict,
+        )[1]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64")
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = ids_tensor((1,), half_seq_length, dtype="int64").item() + 1
+        random_other_next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64").squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1, 1, 1), dtype="int64")],
+            axis=-1,
+        )
+        # get two different outputs
+        output_from_no_past = model(
+            next_input_ids, decoder_attention_mask=attn_mask, return_dict=self.parent.return_dict
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(
+            next_tokens, decoder_attention_mask=attn_mask, cache=past_key_values, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, next_input_ids.shape[-1] - 1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        assert paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+            lm_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class MBartStandaloneDecoderModelTest(ModelTesterMixin, GenerationTesterMixin, PaddleNLPModelTest):
+    base_model_class = MBartModel
+
+    all_model_classes = ()
+    use_test_model_name_list = False
+    all_generative_model_classes = {}
+    is_encoder_decoder = False
+    use_labels = False
+
+    def setUp(self):
+        self.model_tester = MBartStandaloneDecoderModelTester(self, is_training=False)
+
+    def test_decoder_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past(*config_and_inputs)
+
+    def test_decoder_model_attn_mask_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(*config_and_inputs)
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # decoder cannot keep gradients
+        return
+
+    def test_model_name_list(self):
+        pass
diff --git a/tests/transformers/mbart/test_tokenizer.py b/tests/transformers/mbart/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e272e54f79f001bde4c56ddfd0f5689cdde417
--- /dev/null
+++ b/tests/transformers/mbart/test_tokenizer.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE
+from paddlenlp.transformers.mbart.modeling import shift_tokens_right
+from paddlenlp.transformers.mbart.tokenizer import MBartTokenizer
+
+from ...testing_utils import get_tests_dir
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+EN_CODE = 250004
+RO_CODE = 250026
+
+
+class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = MBartTokenizer
+    test_sentencepiece = True
+
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = MBartTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_full_tokenizer(self):
+        tokenizer = MBartTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens),
+            [value + tokenizer.fairseq_offset for value in [285, 46, 10, 170, 382]],
+        )
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids,
+            [
+                value + tokenizer.fairseq_offset
+                for value in [8, 21, 84, 55, 24, 19, 7, 2, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 2, 4]
+                #                                       ^ unk: 2 + 1 = 3                  unk: 2 + 1 = 3 ^
+            ],
+        )
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    def test_tokenizer_slow_store_full_signature(self):
+        signature = inspect.signature(self.tokenizer_class.__init__)
+        tokenizer = self.get_tokenizer()
+
+        for parameter_name, parameter in signature.parameters.items():
+            if parameter.default != inspect.Parameter.empty:
+                if parameter_name != "mbart_type":
+                    self.assertIn(parameter_name, tokenizer.init_kwargs)
+
+
+class MBartEnroIntegrationTest(unittest.TestCase):
+    checkpoint_name = "mbart-large-en-ro"
+    src_text = [
+        " UN Chief Says There Is No Military Solution in Syria",
+        """ Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people.""",
+    ]
+    tgt_text = [
+        "Şeful ONU declară că nu există o soluţie militară în Siria",
+        "Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei"
+        ' pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor'
+        " face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
+    ]
+    expected_src_tokens = [8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2, EN_CODE]
+
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer: MBartTokenizer = MBartTokenizer.from_pretrained(
+            cls.checkpoint_name, src_lang="en_XX", tgt_lang="ro_RO"
+        )
+        cls.pad_token_id = 1
+        return cls
+
+    def check_language_codes(self):
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["ar_AR"], 250001)
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["en_EN"], 250004)
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["ro_RO"], 250020)
+
+    def test_enro_tokenizer_decode_ignores_language_codes(self):
+        self.assertIn(RO_CODE, self.tokenizer.all_special_ids)
+        generated_ids = [RO_CODE, 884, 9019, 96, 9, 916, 86792, 36, 18743, 15596, 5, 2]
+        result = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
+        expected_romanian = self.tokenizer.decode(generated_ids[1:], skip_special_tokens=True)
+        self.assertEqual(result, expected_romanian)
+        self.assertNotIn(self.tokenizer.eos_token, result)
+
+    def test_enro_tokenizer_truncation(self):
+        src_text = ["this is gunna be a long sentence " * 20]
+        assert isinstance(src_text[0], str)
+        desired_max_length = 10
+        ids = self.tokenizer(src_text, max_length=desired_max_length, truncation=True).input_ids[0]
+        self.assertEqual(ids[-2], 2)
+        self.assertEqual(ids[-1], EN_CODE)
+        self.assertEqual(len(ids), desired_max_length)
+
+    def test_mask_token(self):
+        self.assertListEqual(self.tokenizer.convert_tokens_to_ids(["<mask>", "ar_AR"]), [250026, 250001])
+
+    def test_special_tokens_unaffacted_by_save_load(self):
+        tmpdirname = tempfile.mkdtemp()
+        original_special_tokens = self.tokenizer.fairseq_tokens_to_ids
+        self.tokenizer.save_pretrained(tmpdirname)
+        new_tok = MBartTokenizer.from_pretrained(tmpdirname)
+        self.assertDictEqual(new_tok.fairseq_tokens_to_ids, original_special_tokens)
+
+    def test_seq2seq_max_length(self):
+        batch = self.tokenizer(self.src_text, padding=True, truncation=True, max_length=3, return_tensors="pd")
+        targets = self.tokenizer(self.tgt_text, padding=True, truncation=True, max_length=10, return_tensors="pd")
+        labels = targets["input_ids"]
+        batch["decoder_input_ids"] = shift_tokens_right(labels, self.tokenizer.pad_token_id)
+
+        self.assertEqual(batch.input_ids.shape[1], 3)
+        self.assertEqual(batch.decoder_input_ids.shape[1], 10)
diff --git a/tests/transformers/mbart50/__init__.py b/tests/transformers/mbart50/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/mbart50/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/mbart50/test_tokenizer.py b/tests/transformers/mbart50/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e52858afe470191b9dfdfab630e2d2195824220
--- /dev/null
+++ b/tests/transformers/mbart50/test_tokenizer.py
@@ -0,0 +1,191 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE, MBart50Tokenizer
+from paddlenlp.transformers.mbart.modeling import shift_tokens_right
+
+from ...testing_utils import get_tests_dir, nested_simplify
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+EN_CODE = 250004
+RO_CODE = 250020
+
+
+class MBart50TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = MBart50Tokenizer
+    test_sentencepiece = True
+
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = MBart50Tokenizer(SAMPLE_VOCAB, src_lang="en_XX", tgt_lang="ro_RO", keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 0
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "<s>")
+        self.assertEqual(vocab_keys[1], "<pad>")
+        self.assertEqual(vocab_keys[-1], "<mask>")
+        self.assertEqual(len(vocab_keys), 1_054)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 1_054)
+
+    def test_full_tokenizer(self):
+        tokenizer = MBart50Tokenizer(SAMPLE_VOCAB, src_lang="en_XX", tgt_lang="ro_RO", keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens),
+            [value + tokenizer.fairseq_offset for value in [285, 46, 10, 170, 382]],
+        )
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            # fmt: off
+            [
+                SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "", "9", "2", "0", "0", "0", ",",
+                SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s", "é",
+                "."
+            ],
+            # fmt: on
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids,
+            [
+                value + tokenizer.fairseq_offset
+                for value in [8, 21, 84, 55, 24, 19, 7, 2, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 2, 4]
+            ],
+        )
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            # fmt: off
+            [
+                SPIECE_UNDERLINE + "I", SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b", "or", "n", SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "", "<unk>", "2", "0", "0", "0", ",",
+                SPIECE_UNDERLINE + "and", SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is", SPIECE_UNDERLINE + "f", "al", "s",
+                "<unk>", "."
+            ],
+            # fmt: on
+        )
+
+
+class MBart50OneToManyIntegrationTest(unittest.TestCase):
+    checkpoint_name = "mbart-large-50-one-to-many-mmt"
+    src_text = [
+        " UN Chief Says There Is No Military Solution in Syria",
+        """ Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people.""",
+    ]
+    tgt_text = [
+        "Şeful ONU declară că nu există o soluţie militară în Siria",
+        "Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei"
+        ' pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor'
+        " face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
+    ]
+    expected_src_tokens = [EN_CODE, 8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2]
+
+    @classmethod
+    def setUpClass(cls):
+        cls.tokenizer: MBart50Tokenizer = MBart50Tokenizer.from_pretrained(
+            cls.checkpoint_name, src_lang="en_XX", tgt_lang="ro_RO"
+        )
+        cls.pad_token_id = 1
+        return cls
+
+    def check_language_codes(self):
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["ar_AR"], 250001)
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["en_EN"], 250004)
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["ro_RO"], 250020)
+        self.assertEqual(self.tokenizer.fairseq_tokens_to_ids["mr_IN"], 250038)
+
+    def test_tokenizer_decode_ignores_language_codes(self):
+        self.assertIn(RO_CODE, self.tokenizer.all_special_ids)
+        generated_ids = [RO_CODE, 884, 9019, 96, 9, 916, 86792, 36, 18743, 15596, 5, 2]
+        result = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
+        expected_romanian = self.tokenizer.decode(generated_ids[1:], skip_special_tokens=True)
+        self.assertEqual(result, expected_romanian)
+        self.assertNotIn(self.tokenizer.eos_token, result)
+
+    def test_tokenizer_truncation(self):
+        src_text = ["this is gunna be a long sentence " * 20]
+        assert isinstance(src_text[0], str)
+        desired_max_length = 10
+        ids = self.tokenizer(src_text, max_length=desired_max_length, truncation=True).input_ids[0]
+        self.assertEqual(ids[0], EN_CODE)
+        self.assertEqual(ids[-1], 2)
+        self.assertEqual(len(ids), desired_max_length)
+
+    def test_mask_token(self):
+        self.assertListEqual(self.tokenizer.convert_tokens_to_ids(["<mask>", "ar_AR"]), [250053, 250001])
+
+    def test_special_tokens_unaffacted_by_save_load(self):
+        tmpdirname = tempfile.mkdtemp()
+        original_special_tokens = self.tokenizer.fairseq_tokens_to_ids
+        self.tokenizer.save_pretrained(tmpdirname)
+        new_tok = MBart50Tokenizer.from_pretrained(tmpdirname)
+        self.assertDictEqual(new_tok.fairseq_tokens_to_ids, original_special_tokens)
+
+    def test_seq2seq_max_target_length(self):
+        batch = self.tokenizer(self.src_text, padding=True, truncation=True, max_length=3, return_tensors="pd")
+        targets = self.tokenizer(self.tgt_text, padding=True, truncation=True, max_length=10, return_tensors="pd")
+        labels = targets["input_ids"]
+        batch["decoder_input_ids"] = shift_tokens_right(labels, self.tokenizer.pad_token_id)
+
+        self.assertEqual(batch.input_ids.shape[1], 3)
+        self.assertEqual(batch.decoder_input_ids.shape[1], 10)
+
+    def test_tokenizer_translation(self):
+        inputs = self.tokenizer._build_translation_inputs(
+            "A test", return_tensors="pd", src_lang="en_XX", tgt_lang="ar_AR"
+        )
+
+        self.assertEqual(
+            nested_simplify(inputs),
+            {
+                # en_XX, A, test, EOS
+                "input_ids": [[250004, 62, 3034, 2]],
+                "attention_mask": [[1, 1, 1, 1]],
+                # ar_AR
+                "forced_bos_token_id": 250001,
+            },
+        )
diff --git a/tests/transformers/megatronbert/__init__.py b/tests/transformers/megatronbert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/megatronbert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/megatronbert/test_modeling.py b/tests/transformers/megatronbert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..3220f8fdff380643cdcd18049b4cac0ad97250bb
--- /dev/null
+++ b/tests/transformers/megatronbert/test_modeling.py
@@ -0,0 +1,359 @@
+# # Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# #
+# # Licensed under the Apache License, Version 2.0 (the "License");
+# # you may not use this file except in compliance with the License.
+# # You may obtain a copy of the License at
+# #
+# #     http://www.apache.org/licenses/LICENSE-2.0
+# #
+# # Unless required by applicable law or agreed to in writing, software
+# # distributed under the License is distributed on an "AS IS" BASIS,
+# # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# # See the License for the specific language governing permissions and
+# # limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    MegatronBertConfig,
+    MegatronBertForCausalLM,
+    MegatronBertForMaskedLM,
+    MegatronBertForMultipleChoice,
+    MegatronBertForNextSentencePrediction,
+    MegatronBertForPreTraining,
+    MegatronBertForQuestionAnswering,
+    MegatronBertForSequenceClassification,
+    MegatronBertForTokenClassification,
+    MegatronBertModel,
+    MegatronBertPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class MegatronBertModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+        return_dict=False,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return MegatronBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_next_sentence_prediction(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForNextSentencePrediction(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        if paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, 2])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_causal_lm(
+        self,
+        config: MegatronBertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForCausalLM(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.vocab_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config: MegatronBertConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = MegatronBertForMaskedLM(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result[0].shape, [self.seq_length, self.vocab_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class MegatronBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = MegatronBertModel
+    return_dict: bool = False
+    use_labels: bool = False
+    test_resize_embeddings: bool = False
+
+    all_model_classes = (
+        MegatronBertModel,
+        MegatronBertForQuestionAnswering,
+        MegatronBertForSequenceClassification,
+        MegatronBertForNextSentencePrediction,
+        MegatronBertForCausalLM,
+        MegatronBertForPreTraining,
+        MegatronBertForMaskedLM,
+        MegatronBertForMultipleChoice,
+        MegatronBertForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = MegatronBertModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_next_sentence_prediction(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_next_sentence_prediction(*config_and_inputs)
+
+    def test_for_causal_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_causal_lm(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(MegatronBertPretrainedModel.pretrained_init_configuration)[:1]:
+            model = MegatronBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/megatronbert/test_tokenizer.py b/tests/transformers/megatronbert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc1d69cd7e5b12a102dfb34246e953a9f71a8d82
--- /dev/null
+++ b/tests/transformers/megatronbert/test_tokenizer.py
@@ -0,0 +1,138 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import MegatronBertTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+
+class MegatronBertTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = MegatronBertTokenizer
+    space_between_special_tokens = True
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, MegatronBertTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("megatronbert-uncased")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
diff --git a/tests/transformers/minigpt4/__init__.py b/tests/transformers/minigpt4/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/minigpt4/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/minigpt4/test_modeling.py b/tests/transformers/minigpt4/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ba904f2534aa931e6e3b79c9268871954c90d09
--- /dev/null
+++ b/tests/transformers/minigpt4/test_modeling.py
@@ -0,0 +1,567 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Testing suite for the PyTorch MiniGPT4 model. """
+
+
+import inspect
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+from paddlenlp.transformers import (
+    LlamaConfig,
+    MiniGPT4Config,
+    MiniGPT4ForConditionalGeneration,
+    MiniGPT4QFormerConfig,
+    MiniGPT4VisionConfig,
+    MiniGPT4VisionModel,
+)
+from paddlenlp.transformers.minigpt4.modeling import (
+    MiniGPT4_PRETRAINED_MODEL_ARCHIVE_LIST,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class MiniGPT4VisionModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        image_size=30,
+        patch_size=2,
+        num_channels=3,
+        is_training=True,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        initializer_range=1e-10,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.scope = scope
+
+        # in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
+        num_patches = (image_size // patch_size) ** 2
+        self.seq_length = num_patches + 1
+
+    def prepare_config_and_inputs(self):
+        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+        config = self.get_config()
+
+        return config, pixel_values
+
+    def get_config(self):
+        return MiniGPT4VisionConfig(
+            image_size=self.image_size,
+            patch_size=self.patch_size,
+            num_channels=self.num_channels,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, pixel_values):
+        model = MiniGPT4VisionModel(config=config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(pixel_values)
+        # expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
+        image_size = (self.image_size, self.image_size)
+        patch_size = (self.patch_size, self.patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, num_patches + 1, self.hidden_size])
+        self.parent.assertEqual(result.pooler_output.shape, [self.batch_size, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        inputs_dict = {"pixel_values": pixel_values}
+        return config, inputs_dict
+
+
+class MiniGPT4VisionModelTest(ModelTesterMixin, unittest.TestCase):
+    """
+    Here we also overwrite some of the tests of test_modeling_common.py, as MiniGPT4's vision encoder does not use input_ids, inputs_embeds,
+    attention_mask and seq_length.
+    """
+
+    all_model_classes = (MiniGPT4VisionModel,)
+    fx_compatible = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = MiniGPT4VisionModelTester(self)
+        self.config_tester = ConfigTester(
+            self, config_class=MiniGPT4VisionConfig, has_text_modality=False, hidden_size=37
+        )
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    @unittest.skip(reason="MiniGPT4's vision encoder does not use inputs_embeds")
+    def test_inputs_embeds(self):
+        pass
+
+    def test_model_common_attributes(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings(), (nn.Layer))
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x, nn.Linear))
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["pixel_values"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_save_load(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        def check_save_load(out1, out2):
+            # make sure we don't have nans
+            out_2 = out2.numpy()
+            out_2[np.isnan(out_2)] = 0
+
+            out_1 = out1.numpy()
+            out_1[np.isnan(out_1)] = 0
+            max_diff = np.amax(np.abs(out_1 - out_2))
+            self.assertLessEqual(max_diff, 1e-5)
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                first = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model = model_class.from_pretrained(
+                    tmpdirname, vit_dtype="float32", qformer_dtype="float32", llama_dtype="float32"
+                )
+                model.eval()
+                with paddle.no_grad():
+                    second = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            # support tuple of tensor
+            if isinstance(first, tuple) and isinstance(second, tuple):
+                for tensor1, tensor2 in zip(first, second):
+                    check_save_load(tensor1, tensor2)
+            else:
+                check_save_load(first, second)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    @unittest.skip(reason="MiniGPT4VisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="MiniGPT4VisionModel has no base class and is not available in MODEL_MAPPING")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in MiniGPT4_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = MiniGPT4VisionModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class MiniGPT4QFormerModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        projection_dim=32,
+        num_hidden_layers=6,
+        num_attention_heads=4,
+        intermediate_size=37,
+        dropout=0.1,
+        attention_dropout=0.1,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        bos_token_id=0,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.projection_dim = projection_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.scope = scope
+        self.bos_token_id = bos_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64")
+
+        if input_mask is not None:
+            batch_size, seq_length = input_mask.shape
+            rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
+            for batch_idx, start_index in enumerate(rnd_start_indices):
+                input_mask[batch_idx, :start_index] = 1
+                input_mask[batch_idx, start_index:] = 0
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask
+
+    def get_config(self):
+        return MiniGPT4QFormerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            projection_dim=self.projection_dim,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.dropout,
+            attention_dropout=self.attention_dropout,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+        )
+
+
+class MiniGPT4TextModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=12,
+        seq_length=7,
+        is_training=True,
+        use_labels=False,
+        vocab_size=99,
+        hidden_size=16,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=200,
+        eos_token_id=2,
+        pad_token_id=1,
+        bos_token_id=0,
+        embed_dim=16,
+        num_labels=3,
+        word_embed_proj_dim=16,
+        type_sequence_label_size=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.embed_dim = embed_dim
+        self.num_labels = num_labels
+        self.type_sequence_label_size = type_sequence_label_size
+        self.word_embed_proj_dim = word_embed_proj_dim
+        self.is_encoder_decoder = False
+
+    def prepare_config_and_inputs(self):
+        config = self.get_config()
+
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64").clip(
+            3,
+        )
+        input_ids[:, -1] = self.eos_token_id  # Eos Token
+
+        attention_mask = input_ids.not_equal(paddle.to_tensor([self.pad_token_id], dtype="int64")).cast("int64")
+
+        return config, input_ids, attention_mask
+
+    def get_config(self):
+        return LlamaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            dropout=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            is_encoder_decoder=False,
+        )
+
+
+class MiniGPT4ForConditionalGenerationModelTester:
+    def __init__(
+        self, parent, vision_kwargs=None, qformer_kwargs=None, text_kwargs=None, is_training=True, num_query_tokens=10
+    ):
+        if vision_kwargs is None:
+            vision_kwargs = {}
+        if qformer_kwargs is None:
+            qformer_kwargs = {}
+        if text_kwargs is None:
+            text_kwargs = {}
+
+        self.parent = parent
+        self.vision_model_tester = MiniGPT4VisionModelTester(parent, **vision_kwargs)
+        self.qformer_model_tester = MiniGPT4QFormerModelTester(parent, **qformer_kwargs)
+        self.text_model_tester = MiniGPT4TextModelTester(parent, **text_kwargs)
+        self.is_training = is_training
+        self.num_query_tokens = num_query_tokens
+
+    def prepare_config_and_inputs(self):
+        _, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
+        _, first_input_ids, first_attention_mask = self.text_model_tester.prepare_config_and_inputs()
+        _, second_input_ids, second_attention_mask = self.text_model_tester.prepare_config_and_inputs()
+
+        config = self.get_config()
+
+        return config, first_input_ids, first_attention_mask, second_input_ids, second_attention_mask, pixel_values
+
+    def get_config(self):
+        return MiniGPT4Config.from_vision_qformer_text_configs(
+            vision_config=self.vision_model_tester.get_config(),
+            qformer_config=self.qformer_model_tester.get_config(),
+            text_config=self.text_model_tester.get_config(),
+            num_query_tokens=self.num_query_tokens,
+        )
+
+    def create_and_check_for_conditional_generation(
+        self, config, first_input_ids, first_attention_mask, second_input_ids, second_attention_mask, pixel_values
+    ):
+        model = MiniGPT4ForConditionalGeneration(config)
+        model.eval()
+        with paddle.no_grad():
+            result = model(
+                pixel_values,
+                first_input_ids,
+                first_attention_mask,
+                second_input_ids,
+                second_attention_mask,
+                return_dict=True,
+            )
+        expected_seq_length = first_input_ids.shape[1] + self.num_query_tokens + second_input_ids.shape[1]
+        self.parent.assertEqual(
+            result.logits.shape,
+            [self.vision_model_tester.batch_size, expected_seq_length, self.text_model_tester.vocab_size],
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            first_input_ids,
+            first_attention_mask,
+            second_input_ids,
+            second_attention_mask,
+            pixel_values,
+        ) = config_and_inputs
+        inputs_dict = {
+            "pixel_values": pixel_values,
+            "first_input_ids": first_input_ids,
+            "first_attention_mask": first_attention_mask,
+            "second_input_ids": second_input_ids,
+            "second_attention_mask": second_attention_mask,
+            "return_dict": True,
+        }
+        return config, inputs_dict
+
+
+class MiniGPT4ForConditionalGenerationTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (MiniGPT4ForConditionalGeneration,)
+    fx_compatible = False
+    test_head_masking = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_attention_outputs = False
+    use_test_model_name_list = False
+
+    def setUp(self):
+        self.model_tester = MiniGPT4ForConditionalGenerationModelTester(self)
+
+    def test_for_conditional_generation(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_conditional_generation(*config_and_inputs)
+
+    @unittest.skip(reason="Hidden_states is tested in individual model tests")
+    def test_hidden_states_output(self):
+        pass
+
+    @unittest.skip(reason="Inputs_embeds is tested in individual model tests")
+    def test_inputs_embeds(self):
+        pass
+
+    @unittest.skip(reason="Retain_grad is tested in individual model tests")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    @unittest.skip(reason="MiniGPT4Model does not have input/output embeddings")
+    def test_model_common_attributes(self):
+        pass
+
+    @unittest.skip(reason="There's no base MiniGPT4Model")
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    @unittest.skip(reason="There's no base MiniGPT4Model")
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            expected_arg_names = ["pixel_values", "first_input_ids", "second_input_ids"]
+            self.assertListEqual(arg_names[:3], expected_arg_names)
+
+    def test_save_load(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        def check_save_load(out1, out2):
+            # make sure we don't have nans
+            out_2 = out2.numpy()
+            out_2[np.isnan(out_2)] = 0
+
+            out_1 = out1.numpy()
+            out_1[np.isnan(out_1)] = 0
+            max_diff = np.amax(np.abs(out_1 - out_2))
+            self.assertLessEqual(max_diff, 1e-5)
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                first = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model2 = model_class.from_pretrained(
+                    tmpdirname, llama_dtype="float32", vit_dtype="float32", qformer_dtype="float32"
+                )
+                model2.eval()
+                with paddle.no_grad():
+                    second = model2(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            # support tuple of tensor
+            if isinstance(first, tuple) and isinstance(second, tuple):
+                for tensor1, tensor2 in zip(first, second):
+                    check_save_load(tensor1, tensor2)
+            else:
+                check_save_load(first, second)
+
+    def test_load_vision_qformer_text_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # Save MiniGPT4Config and check if we can load MiniGPT4VisionConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            vision_config = MiniGPT4VisionConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.vision_config.to_dict(), vision_config.to_dict())
+
+        # Save MiniGPT4Config and check if we can load MiniGPT4QFormerConfig from it
+        with tempfile.TemporaryDirectory() as tmp_dir_name:
+            config.save_pretrained(tmp_dir_name)
+            qformer_config = MiniGPT4QFormerConfig.from_pretrained(tmp_dir_name)
+            self.assertDictEqual(config.qformer_config.to_dict(), qformer_config.to_dict())
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in MiniGPT4_PRETRAINED_MODEL_ARCHIVE_LIST:
+            model = MiniGPT4ForConditionalGeneration.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/minigpt4/test_processor.py b/tests/transformers/minigpt4/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b30a982d5d77d560c6777d5004d49782ba3766c
--- /dev/null
+++ b/tests/transformers/minigpt4/test_processor.py
@@ -0,0 +1,160 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+from PIL import Image
+
+from paddlenlp.transformers import (
+    LlamaTokenizer,
+    MiniGPT4ImageProcessor,
+    MiniGPT4Processor,
+)
+
+
+class MiniGPT4ProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+        image_processor = MiniGPT4ImageProcessor.from_pretrained("minigpt4-13b")
+        tokenizer = LlamaTokenizer.from_pretrained("minigpt4-13b")
+        processor = MiniGPT4Processor(image_processor, tokenizer)
+        processor.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs):
+        return MiniGPT4Processor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
+
+    def get_image_processor(self, **kwargs):
+        return MiniGPT4Processor.from_pretrained(self.tmpdirname, **kwargs).image_processor
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def prepare_image_inputs(self):
+        """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+        or a list of PaddlePaddle tensors if one specifies torchify=True.
+        """
+
+        image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+        image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+        return image_inputs
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = MiniGPT4Processor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        image_processor_add_kwargs = self.get_image_processor(do_normalize=False, rescale_factor=1.0)
+
+        processor = MiniGPT4Processor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, rescale_factor=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.image_processor, MiniGPT4ImageProcessor)
+
+    def test_image_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MiniGPT4Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        image_input = self.prepare_image_inputs()
+
+        input_feat_extract = image_processor(image_input, return_tensors="np")
+        input_processor = processor.process_images(images=image_input, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+        processor = MiniGPT4Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = ["lower newer"]
+        encoded_processor = processor.process_texts(texts=input_str, return_tensors="np")
+
+        first_texts = "###Human: <Img>"
+        second_texts = "</Img> lower newer###Assistant: "
+        first_text_encoding = tokenizer(
+            text=first_texts,
+            return_tensors="np",
+            add_special_tokens=True,
+        )
+        second_text_encoding = tokenizer(
+            text=second_texts,
+            return_tensors="np",
+            add_special_tokens=False,
+        )
+
+        encoded_tok = {
+            "first_input_ids": first_text_encoding["input_ids"],
+            "first_attention_mask": first_text_encoding["attention_mask"],
+            "second_input_ids": second_text_encoding["input_ids"],
+            "second_attention_mask": second_text_encoding["attention_mask"],
+        }
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key].tolist(), encoded_processor[key].tolist())
+
+    def test_processor(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MiniGPT4Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+
+        inputs = processor(text=input_str, images=image_input)
+
+        self.assertListEqual(
+            list(inputs.keys()),
+            ["pixel_values", "first_input_ids", "first_attention_mask", "second_input_ids", "second_attention_mask"],
+        )
+
+    def test_tokenizer_decode(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+
+        processor = MiniGPT4Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        image_processor = self.get_image_processor()
+        tokenizer = self.get_tokenizer()
+        processor = MiniGPT4Processor(tokenizer=tokenizer, image_processor=image_processor)
+
+        input_str = "lower newer"
+        image_input = self.prepare_image_inputs()
+        inputs = processor(text=input_str, images=image_input)
+
+        # For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
+        self.assertListEqual(
+            list(inputs.keys()),
+            ["pixel_values", "first_input_ids", "first_attention_mask", "second_input_ids", "second_attention_mask"],
+        )
diff --git a/tests/transformers/mobile_bert/__init__.py b/tests/transformers/mobile_bert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/mobile_bert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/mobile_bert/test_modeling.py b/tests/transformers/mobile_bert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..6971846b527b3f3fc68ebcd4896d45f7b6285289
--- /dev/null
+++ b/tests/transformers/mobile_bert/test_modeling.py
@@ -0,0 +1,282 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    MobileBertConfig,
+    MobileBertForQuestionAnswering,
+    MobileBertForSequenceClassification,
+    MobileBertModel,
+    PretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class MobileBertModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        embedding_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.embedding_size = embedding_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        inputs = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+
+        config = self.get_config()
+
+        return config, inputs, token_type_ids, input_mask, sequence_labels
+
+    def get_config(self):
+        return MobileBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            embedding_size=self.embedding_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_model(self, config, input_ids, token_type_ids, input_mask, sequence_labels):
+        model = MobileBertModel(config=config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_question_answering(self, config, input_ids, token_type_ids, input_mask, sequence_labels):
+        model = MobileBertForQuestionAnswering(config=config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels
+    ):
+        config.num_labels = self.num_labels
+        model = MobileBertForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class MobileBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = MobileBertModel
+    return_dict = False
+    use_labels = False
+    is_decoder = True
+
+    all_model_classes = (
+        MobileBertModel,
+        MobileBertForSequenceClassification,
+        MobileBertForQuestionAnswering,
+    )
+
+    def setUp(self):
+        self.model_tester = MobileBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=MobileBertConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_model_from_pretrained(self):
+        for model_name in MobileBertModel.pretrained_init_configuration.keys():
+            model = MobileBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class MobileBertModelIntegrationTest(unittest.TestCase, ModelTesterPretrainedMixin):
+    base_model_class: PretrainedModel = MobileBertModel
+    hf_remote_test_model_path: str = "google/mobilebert-uncased"
+    paddlehub_remote_test_model_name: str = "mobilebert-uncased"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = MobileBertModel.from_pretrained("mobilebert-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[101, 7110, 1005, 1056, 2023, 11333, 17413, 1029, 102]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 9, 512]
+        self.assertEqual(output.shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-2.4736526e07, 8.2691656e04, 1.6521838e05],
+                    [-5.7541704e-01, 3.9056022e00, 4.4011507e00],
+                    [2.6047359e00, 1.5677652e00, -1.7324188e-01],
+                ]
+            ]
+        )
+        lower_bound = paddle.all((expected_slice / output[..., :3, :3]) >= 1 - 1e-3)
+        upper_bound = paddle.all((expected_slice / output[..., :3, :3]) <= 1 + 1e-3)
+
+        self.assertTrue(lower_bound and upper_bound)
+
+    @slow
+    def test_inference_with_attention(self):
+        model = MobileBertModel.from_pretrained("mobilebert-uncased")
+        model.eval()
+        input_ids = paddle.to_tensor([[101, 7110, 1005, 1056, 2023, 11333, 17413, 1029, 102]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 9, 512]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [2.96605349, 3.73147392, -0.20700839],
+                    [2.02441382, 0.04513174, 3.61004543],
+                    [4.02399778, -0.25662401, 1.62328660],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/mobile_bert/test_tokenizer.py b/tests/transformers/mobile_bert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..be562564f8a261bf9ecee6233d25e91819e53592
--- /dev/null
+++ b/tests/transformers/mobile_bert/test_tokenizer.py
@@ -0,0 +1,267 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import MobileBertTokenizer
+from paddlenlp.transformers.bert.tokenizer import (
+    BasicTokenizer,
+    WordpieceTokenizer,
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class MobileBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = MobileBertTokenizer
+    space_between_special_tokens = True
+    from_pretraind_filter = filter_non_english
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, "vocab.txt")
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \t Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \t Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["H\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(" "))
+        self.assertTrue(_is_whitespace("\t"))
+        self.assertTrue(_is_whitespace("\r"))
+        self.assertTrue(_is_whitespace("\n"))
+        self.assertTrue(_is_whitespace("\u00A0"))
+
+        self.assertFalse(_is_whitespace("A"))
+        self.assertFalse(_is_whitespace("-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control("\u0005"))
+
+        self.assertFalse(_is_control("A"))
+        self.assertFalse(_is_control(" "))
+        self.assertFalse(_is_control("\t"))
+        self.assertFalse(_is_control("\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation("-"))
+        self.assertTrue(_is_punctuation("$"))
+        self.assertTrue(_is_punctuation("`"))
+        self.assertTrue(_is_punctuation("."))
+
+        self.assertFalse(_is_punctuation("A"))
+        self.assertFalse(_is_punctuation(" "))
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("mobilebert-uncased")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = tokenizer.from_pretrained(pretrained_name, **kwargs)
+                text = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    text,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/mpnet/__init__.py b/tests/transformers/mpnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/mpnet/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/mpnet/test_modeling.py b/tests/transformers/mpnet/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3689d1cefc08dc5f08c2442ac4f0d7fb24ecef9
--- /dev/null
+++ b/tests/transformers/mpnet/test_modeling.py
@@ -0,0 +1,232 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    MPNetConfig,
+    MPNetForMaskedLM,
+    MPNetForMultipleChoice,
+    MPNetForQuestionAnswering,
+    MPNetForSequenceClassification,
+    MPNetForTokenClassification,
+    MPNetModel,
+)
+
+from ...transformers.test_configuration_common import ConfigTester
+from ...transformers.test_modeling_common import (
+    ModelTesterMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class MPNetModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=64,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=64,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return MPNetConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+        )
+
+    def create_and_check_mpnet_model(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = MPNetModel(config=config)
+        model.eval()
+        result = model(input_ids, input_mask)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_mpnet_for_question_answering(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = MPNetForQuestionAnswering(config=config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        start_logits, end_logits = result[0], result[1]
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_mpnet_for_sequence_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        config.num_labels = self.num_labels
+        model = MPNetForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_mpnet_for_multiple_choice(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = MPNetForMultipleChoice(config=config, num_choices=self.num_choices)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand((-1, self.num_choices, -1))
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand((-1, self.num_choices, -1))
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+        )
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_mpnet_for_token_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        config.num_labels = self.num_labels
+        model = MPNetForTokenClassification(config=config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class MPNetModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = MPNetModel
+    all_model_classes = (
+        MPNetForMaskedLM,
+        MPNetForMultipleChoice,
+        MPNetForQuestionAnswering,
+        MPNetForSequenceClassification,
+        MPNetForTokenClassification,
+        MPNetModel,
+    )
+    use_test_model_name_list = False
+    return_dict = False
+    use_labels = False
+
+    def setUp(self):
+        self.model_tester = MPNetModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=MPNetConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_mpnet_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_mpnet_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_mpnet_for_sequence_classification(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_mpnet_for_multiple_choice(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_mpnet_for_token_classification(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_mpnet_for_question_answering(*config_and_inputs)
diff --git a/tests/transformers/mpnet/test_tokenizer.py b/tests/transformers/mpnet/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..53eb952523b2eeb7f11d3b49fcab3046fc9c8677
--- /dev/null
+++ b/tests/transformers/mpnet/test_tokenizer.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import MPNetTokenizer
+
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+
+class MPNetTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = MPNetTokenizer
+    space_between_special_tokens = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, MPNetTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("mpnet-base")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)["input_ids"]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [0] + text + [2]
+        assert encoded_pair == [0] + text + [2] + [2] + text_2 + [2]
diff --git a/tests/transformers/mt5/__init__.py b/tests/transformers/mt5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/mt5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/mt5/test_modeling.py b/tests/transformers/mt5/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e3db271223f6c40ee612bb3f080d5de36d043e1
--- /dev/null
+++ b/tests/transformers/mt5/test_modeling.py
@@ -0,0 +1,829 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google MT5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    MT5_PRETRAINED_MODEL_ARCHIVE_LIST,
+    MT5Config,
+    MT5EncoderModel,
+    MT5ForConditionalGeneration,
+    MT5Model,
+    T5Tokenizer,
+)
+from tests.testing_utils import require_package, slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def make_model_instance(config: MT5Config, model_class, base_model_class):
+    if model_class == base_model_class:
+        return model_class(config)
+    else:
+        return model_class(base_model_class(config))
+
+
+class MT5ModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        decoder_seq_length=9,
+        # For common tests
+        is_training=True,
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+        decoder_layers=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        # For common tests
+        self.seq_length = self.decoder_seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = pad_token_id
+        self.scope = None
+        self.decoder_layers = decoder_layers
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+        decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        decoder_attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+            decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def get_pipeline_config(self) -> MT5Config:
+        return MT5Config(
+            vocab_size=66,  # mt5 forces 0 extra tokens
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def get_config(self) -> MT5Config:
+        return MT5Config(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def check_prepare_lm_labels_via_shift_left(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        if not self.parent.use_labels:
+            return
+        model = MT5Model(config)
+        model.eval()
+
+        # make sure that lm_labels are correctly padded from the right
+        lm_labels = masked_fill(lm_labels, (lm_labels == self.decoder_start_token_id), self.eos_token_id)
+
+        # add casaul pad token mask
+        triangular_mask = paddle.tril(paddle.ones(lm_labels.shape)).logical_not()
+        lm_labels = masked_fill(lm_labels, triangular_mask, self.pad_token_id)
+        decoder_input_ids = model._shift_right(lm_labels)
+
+        for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
+            # first item
+            self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
+            if i < decoder_input_ids_slice.shape[-1]:
+                if i < decoder_input_ids.shape[-1] - 1:
+                    # items before diagonal
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
+                    )
+                # pad items after diagonal
+                if i < decoder_input_ids.shape[-1] - 2:
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
+                    )
+            else:
+                # all items after square
+                self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
+
+    def create_and_check_model(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = MT5Model(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, return_dict=self.parent.return_dict)
+        decoder_output = result[0]
+        decoder_past = result[1]
+        encoder_output = result[2]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+        self.parent.assertEqual(decoder_output.shape, [self.batch_size, self.decoder_seq_length, self.hidden_size])
+        # There should be `num_layers` key value embeddings stored in decoder_past
+        self.parent.assertEqual(len(decoder_past), config["num_layers"])
+        # There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
+        self.parent.assertEqual(len(decoder_past[0]), 4)
+
+    def create_and_check_with_lm_head(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = MT5ForConditionalGeneration(config)
+        model.eval()
+        outputs = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            labels=lm_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(len(outputs), 4 if self.parent.use_labels else 3)
+        if self.parent.use_labels:
+            self.parent.assertEqual(outputs[1].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+            self.parent.assertIsInstance(outputs[0].item(), float)
+        else:
+            self.parent.assertEqual(outputs[0].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+
+    def create_and_check_decoder_model_past(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = MT5Model(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, return_dict=self.parent.return_dict)
+        outputs_no_past = model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf) + 1)
+        self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(next_tokens, cache=past_key_values, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_attention_mask_past(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = MT5Model(config).get_decoder()
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="int64")
+
+        half_seq_length = input_ids.shape[-1] // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past_key_values = model(
+            input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict
+        )[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = (
+            ids_tensor(
+                [
+                    1,
+                ],
+                half_seq_length,
+            ).item()
+            + 1
+        )
+        random_other_next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"]).squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="int64")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(
+            next_tokens,
+            cache=past_key_values,
+            attention_mask=paddle.ones((attn_mask.shape[0], 1), dtype="int64"),
+            return_dict=self.parent.return_dict,
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_past_large_inputs(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = MT5Model(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 3], config["vocab_size"])
+        next_mask = ids_tensor([self.batch_size, 3], vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([attention_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )[0]
+        output_from_past = model(
+            next_tokens, attention_mask=next_attention_mask, cache=past_key_values, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_generate_with_past_key_values(
+        self,
+        config: MT5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        model = MT5ForConditionalGeneration(config)
+        model.eval()
+
+        output_without_past_cache, _ = model.generate(
+            input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling", use_cache=False
+        )
+
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        output_with_past_cache, _ = model.generate(input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling")
+
+        self.parent.assertTrue(paddle.all(output_with_past_cache == output_without_past_cache))
+
+    def check_resize_embeddings_mt5_v1_1(
+        self,
+        config: MT5Config,
+    ):
+        prev_vocab_size = config["vocab_size"]
+
+        model = MT5ForConditionalGeneration(config)
+        model.eval()
+        model.resize_token_embeddings(prev_vocab_size - 10)
+
+        self.parent.assertEqual(model.get_input_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.get_output_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.mt5.config["vocab_size"], prev_vocab_size - 10)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_attention_mask,
+            "use_cache": False,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class MT5ModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = MT5Model
+    return_dict: bool = False
+    use_labels: bool = False
+
+    all_model_classes = (MT5Model, MT5ForConditionalGeneration)
+    all_generative_model_classes = {MT5ForConditionalGeneration: (MT5Model, "mt5")}
+    all_parallelizable_model_classes = (MT5Model, MT5ForConditionalGeneration)
+    fx_compatible = True
+    test_pruning = False
+    test_resize_embeddings = True
+    test_model_parallel = True
+    use_test_inputs_embeds = True
+    is_encoder_decoder = True
+    use_test_model_name_list = False
+    # The small MT5 model needs higher percentages for CPU/MP tests
+    model_split_percents = [0.8, 0.9]
+
+    def setUp(self):
+        self.model_tester = MT5ModelTester(self)
+
+    def test_shift_right(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_v1_1(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        # check that gated gelu feed forward and different word embeddings work
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-gelu"
+        self.model_tester.create_and_check_model(config, *config_and_inputs[1:])
+
+    def test_config_and_model_silu_gated(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-silu"
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_with_lm_head(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_with_lm_head(*config_and_inputs)
+
+    def test_decoder_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_attn_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_3d_attn_mask(self):
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = self.model_tester.prepare_config_and_inputs()
+
+        attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.encoder_seq_length, self.model_tester.encoder_seq_length],
+            vocab_size=2,
+        )
+        decoder_attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.decoder_seq_length, self.model_tester.decoder_seq_length],
+            vocab_size=2,
+        )
+
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_generate_with_past_key_values(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs)
+
+    def test_v1_1_resize_embeddings(self):
+        config = self.model_tester.prepare_config_and_inputs()[0]
+        self.model_tester.check_resize_embeddings_mt5_v1_1(config)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in MT5_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = MT5Model.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class MT5EncoderOnlyModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        # For common tests
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        is_training=False,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        is_encoder_decoder=False,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        # For common tests
+        self.seq_length = self.encoder_seq_length
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.is_encoder_decoder = is_encoder_decoder
+        self.scope = None
+        self.is_training = is_training
+
+    def get_config(self):
+        config = MT5Config(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+            is_encoder_decoder=self.is_encoder_decoder,
+        )
+        return config
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            attention_mask,
+        )
+
+    def create_and_check_model(
+        self,
+        config: MT5Config,
+        input_ids,
+        attention_mask,
+    ):
+        model = MT5EncoderModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+        )
+        result = model(input_ids=input_ids)
+        encoder_output = result[0]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+
+class MT5EncoderOnlyModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (MT5EncoderModel,)
+    test_pruning = False
+    test_resize_embeddings = False
+    test_model_parallel = True
+    all_parallelizable_model_classes = (MT5EncoderModel,)
+
+    def _make_model_instance(self, config: MT5Config, model_class):
+        return model_class(config)
+
+    def setUp(self):
+        self.model_tester = MT5EncoderOnlyModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_name_list(self):
+        pass
+
+
+class MT5CompatibilityTest(unittest.TestCase):
+    @require_package("transformers", "torch")
+    @slow
+    def test_mt5_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "google/mt5-small"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import MT5Model
+
+            paddle_model = MT5Model.from_pretrained(model_id, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            # 3. forward the torch  model
+            import torch
+            from transformers import MT5Model
+
+            torch_model = MT5Model.from_pretrained(model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().numpy()[:4, :4], torch_logit.detach().cpu().numpy()[:4, :4], rtol=1e-3
+                )
+            )
+
+    @require_package("transformers", "torch")
+    @slow
+    def test_mt5_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "google/mt5-small"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import MT5Model
+
+            torch_model = MT5Model.from_pretrained(model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import MT5Model
+
+            paddle_model = MT5Model.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-3,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    @slow
+    def test_mt5_for_conditional_generation(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "google/mt5-small"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import MT5ForConditionalGeneration
+
+            torch_model = MT5ForConditionalGeneration.from_pretrained(model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import MT5ForConditionalGeneration
+
+            paddle_model = MT5ForConditionalGeneration.from_pretrained(tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class MT5ModelIntegrationTests(unittest.TestCase):
+    @slow
+    def test_small_integration_test(self):
+
+        model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
+        tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
+
+        model.eval()
+
+        input_ids = tokenizer("Hello there", return_tensors="pd")["input_ids"]
+        labels = tokenizer("Hi I am", return_tensors="pd")["input_ids"]
+
+        loss = model(input_ids, labels=labels)[0]
+        mtf_score = -(labels.shape[-1] * loss.item())
+
+        EXPECTED_SCORE = -84.9127
+        self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
diff --git a/tests/transformers/nezha/__init__.py b/tests/transformers/nezha/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/nezha/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/nezha/test_modeling.py b/tests/transformers/nezha/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..755f970b3cf75abbf570166d5f2345214b0b257e
--- /dev/null
+++ b/tests/transformers/nezha/test_modeling.py
@@ -0,0 +1,330 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    NeZhaConfig,
+    NeZhaForMultipleChoice,
+    NeZhaForPretraining,
+    NeZhaForQuestionAnswering,
+    NeZhaForSequenceClassification,
+    NeZhaForTokenClassification,
+    NeZhaModel,
+    NeZhaPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class NeZhaModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return NeZhaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = NeZhaModel(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = NeZhaForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if choice_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = NeZhaForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = NeZhaForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = NeZhaForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class NeZhaModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = NeZhaModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    all_model_classes = (
+        NeZhaModel,
+        NeZhaForMultipleChoice,
+        NeZhaForPretraining,
+        NeZhaForQuestionAnswering,
+        NeZhaForSequenceClassification,
+        NeZhaForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = NeZhaModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(NeZhaPretrainedModel.pretrained_init_configuration)[:1]:
+            model = NeZhaModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/nezha/test_tokenizer.py b/tests/transformers/nezha/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbd00129daf3e39793af385acbf54096d0b7a732
--- /dev/null
+++ b/tests/transformers/nezha/test_tokenizer.py
@@ -0,0 +1,297 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import BasicTokenizer, NeZhaTokenizer, WordpieceTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class NeZhaTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = NeZhaTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, NeZhaTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ernie-1.0")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        print(encoded_sentence)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                tokens_fast = tokenizer_fast.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_fast.convert_ids_to_tokens(tokens_fast["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+                self.assertEqual([e[0] for e in expected_results], tokens_fast["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                ids_without_spe_char_fast = tokenizer_fast.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                tokens_without_spe_char_fast = tokenizer.convert_ids_to_tokens(ids_without_spe_char_fast)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_fast, list_of_commun_chinese_char)
diff --git a/tests/transformers/nystromformer/__init__.py b/tests/transformers/nystromformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/nystromformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/nystromformer/test_modeling.py b/tests/transformers/nystromformer/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..538f1f90b2af6265a5e51006e90884af88168778
--- /dev/null
+++ b/tests/transformers/nystromformer/test_modeling.py
@@ -0,0 +1,445 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from typing import Tuple
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    NystromformerConfig,
+    NystromformerForMaskedLM,
+    NystromformerForMultipleChoice,
+    NystromformerForQuestionAnswering,
+    NystromformerForSequenceClassification,
+    NystromformerForTokenClassification,
+    NystromformerModel,
+    NystromformerPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class NystromformerModelTester:
+    """Base Nystromformer Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=8,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu_new",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        conv_kernel_size=65,
+        inv_coeff_init_option=False,
+        layer_norm_eps=1e-05,
+        num_landmarks=64,
+        segment_means_seq_len=64,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.conv_kernel_size = conv_kernel_size
+        self.inv_coeff_init_option = inv_coeff_init_option
+        self.layer_norm_eps = layer_norm_eps
+        self.num_landmarks = num_landmarks
+        self.segment_means_seq_len = segment_means_seq_len
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def get_config(self) -> NystromformerConfig:
+        return NystromformerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            segment_means_seq_len=self.segment_means_seq_len,
+            num_landmarks=self.num_landmarks,
+            conv_kernel_size=self.conv_kernel_size,
+            inv_coeff_init_option=self.inv_coeff_init_option,
+            initializer_range=self.initializer_range,
+            layer_norm_eps=self.layer_norm_eps,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_labels,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def prepare_config_and_inputs(self) -> Tuple[NystromformerConfig, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def create_and_check_model(
+        self,
+        config: NystromformerConfig,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        # 1. test model instantiation and forward w/o token_type_ids
+        model = NystromformerModel(config)
+        model.eval()
+
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        # nystromformer only return one tensor: last_hidden_state
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+        # 2. test forward with chunk computing
+        config.chunk_size_feed_forward = True
+        model_with_chunk = NystromformerModel(config)
+        model_with_chunk.load_dict(model.state_dict())
+        model_with_chunk.eval()
+        result_with_chunk = model_with_chunk(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertTrue(paddle.allclose(result[0], result_with_chunk[0], atol=1e-4))
+        model.config.chunk_size_feed_forward = False
+
+        # 3. test nystrom attention
+        config.segment_means_seq_len = input_ids.shape[1]
+        config.num_landmarks = 2
+        model_with_nystrom = NystromformerModel(config)
+        model_with_nystrom.eval()
+        result_with_nystrom = model_with_nystrom(input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result_with_nystrom[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: NystromformerConfig,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        config.num_labels = self.type_sequence_label_size
+        model = NystromformerForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and sequence_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result[0]))
+
+        if sequence_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.type_sequence_label_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        config.num_labels = self.num_labels
+        model = NystromformerForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=token_labels,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result[0]))
+        if token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        config.num_labels = self.vocab_size
+        model = NystromformerForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=token_labels,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result[0]))
+        if token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        config.num_labels = self.num_choices
+        model = NystromformerForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if choice_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        config.num_labels = self.num_labels
+        model = NystromformerForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class NystromformerModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = NystromformerModel
+    return_dict = False
+    use_labels = False
+
+    all_model_classes = (
+        NystromformerModel,
+        NystromformerForSequenceClassification,
+        NystromformerForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = NystromformerModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(NystromformerPretrainedModel.pretrained_init_configuration)[:1]:
+            model = NystromformerModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class NystromformerModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = NystromformerModel.from_pretrained("nystromformer-base-zh")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [input_ids.shape[0], input_ids.shape[1], model.config.hidden_size]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.27683097, -2.19216943, -0.23561366],
+                    [0.10705502, -2.06556797, -0.07792263],
+                    [0.53340679, -2.20003223, -0.07504901],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = NystromformerModel.from_pretrained("nystromformer-base-zh")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [input_ids.shape[0], input_ids.shape[1], model.config.hidden_size]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.46736166, -1.27038229, 0.81337416],
+                    [-0.59629452, -1.13692689, 0.81597191],
+                    [-0.55872959, -1.07646871, 0.72584474],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/nystromformer/test_tokenizer.py b/tests/transformers/nystromformer/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..b53e3e52fe9d645d71d130f05f73dd237614503a
--- /dev/null
+++ b/tests/transformers/nystromformer/test_tokenizer.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers import (
+    BasicTokenizer,
+    NystromformerTokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class NystromformerTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = NystromformerTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, NystromformerTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("nystromformer-base-zh")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 6), "nai"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/opt/test_modeling.py b/tests/transformers/opt/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..4058554d13b593b77ddd0add734ec4c79c25cda8
--- /dev/null
+++ b/tests/transformers/opt/test_modeling.py
@@ -0,0 +1,566 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import math
+import random
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import GPTTokenizer, OPTConfig, OPTForCausalLM, OPTModel
+from tests.testing_utils import PaddleNLPModelTest, require_package, slow
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+from tests.transformers.test_modeling_common import (  # GenerationD2STestMixin,
+    ModelTesterMixin,
+    floats_tensor,
+    ids_tensor,
+)
+
+OPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/opt-125m",
+]
+
+
+class OPTModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=14,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="relu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        normalize_before=True,
+        word_embed_proj_dim=32,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = None
+        self.normalize_before = normalize_before
+        self.word_embed_proj_dim = word_embed_proj_dim
+        self.bos_token_id = vocab_size - 1
+        self.eos_token_id = vocab_size - 1
+        self.pad_token_id = vocab_size - 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            # contruct input_mask filling with 0 and -1e4
+            # left padding: [[-1e4, -1e4, -1e4, 0, 0], [-1e4, -1e4, -1e4, 0, 0], ...]
+            input_mask = []
+            for _ in range(self.batch_size):
+                pad_length = random.randint(0, self.seq_length)
+                input_mask.append([0] * (self.seq_length - pad_length) + [1] * pad_length)
+            input_mask = paddle.to_tensor(input_mask, dtype="int64")
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size, dtype="int64")
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels, dtype="int64")
+            choice_labels = ids_tensor([self.batch_size], self.num_choices, dtype="int64")
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return OPTConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            normalize_before=self.normalize_before,
+            word_embed_proj_dim=self.word_embed_proj_dim,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        encoder_hidden_states = floats_tensor([self.batch_size, self.seq_length, self.hidden_size])
+        encoder_attention_mask = paddle.cast(
+            ids_tensor([self.batch_size, self.seq_length], vocab_size=2), dtype="int64"
+        )
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            encoder_hidden_states,
+            encoder_attention_mask,
+        )
+
+    def create_and_check_opt_model(self, config, input_ids, input_mask, *args):
+        model = OPTModel(config)
+        model.eval()
+
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(result[1]), config.num_hidden_layers)
+
+    def create_and_check_opt_model_past(self, config, input_ids, input_mask, *args):
+        model = OPTModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf))
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64")
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+
+        past_key_values_length = paddle.shape(past[0].k)[2]
+        attention_mask = paddle.ones(shape=[next_tokens.shape[0], 1 + past_key_values_length])
+        output_from_past = model(
+            next_tokens, use_cache=True, attention_mask=attention_mask, cache=past, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_opt_model_past_large_inputs(self, config, input_ids, input_mask, *args):
+        model = OPTModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_mask = paddle.ones_like(next_tokens, dtype=paddle.int64)
+
+        # append to next input_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            cache=past,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_opt_for_causal_lm(self, config, input_ids, input_mask, *args):
+        model = OPTForCausalLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            use_cache=True,
+            labels=input_ids if self.parent.use_labels else None,
+            return_dict=self.parent.return_dict,
+        )
+
+        if self.parent.use_labels:
+            self.parent.assertIsInstance(result[0].item(), float)
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(self, config, input_ids, input_mask, *args):
+        model = OPTForCausalLM(config)
+
+        if self.parent.use_labels:
+            loss, logits = model(input_ids, labels=input_ids, return_dict=self.parent.return_dict)
+            self.parent.assertEqual(loss.shape, [1])
+            self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+            loss.backward()
+
+    def create_and_check_opt_weight_initialization(self, config, *args):
+        model = OPTModel(config)
+        model_std = model.config.initializer_range / math.sqrt(2 * model.config.num_hidden_layers)
+        for key in model.state_dict().keys():
+            if "out_proj" in key and "weight" in key:
+                self.parent.assertLessEqual(abs((paddle.std(model.state_dict()[key]) - model_std).numpy()), 0.02)
+                self.parent.assertLessEqual(abs((paddle.mean(model.state_dict()[key]) - 0.0).numpy()), 0.01)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+        }
+
+        return config, inputs_dict
+
+    def create_and_check_model_cache(self, config, input_ids, input_mask, *args):
+        model = OPTModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[1]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size, dtype="int64")
+
+        # all next mask is ones
+        next_mask = paddle.ones_like(next_tokens, dtype="int64")
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[1][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            cache=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[1][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class OPTModelTest(ModelTesterMixin, GenerationTesterMixin, PaddleNLPModelTest):
+    base_model_class = OPTModel
+    use_labels = False
+    return_dict = False
+    use_test_inputs_embeds = True
+
+    all_model_classes = [
+        OPTModel,
+    ]
+    all_generative_model_classes = {OPTForCausalLM: (OPTModel, "opt")}
+    test_missing_keys = False
+    test_model_parallel = True
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+
+        return inputs_dict
+
+    def setUp(self):
+        self.model_tester = OPTModelTester(self)
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+    def test_opt_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_opt_model(*config_and_inputs)
+
+    def test_opt_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_opt_model_past(*config_and_inputs)
+
+    def test_opt_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_opt_model_past_large_inputs(*config_and_inputs)
+
+    def test_opt_causal_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_opt_for_causal_lm(*config_and_inputs)
+
+    def test_opt_weight_initialization(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_opt_weight_initialization(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_batch_generation(self):
+        model = OPTForCausalLM.from_pretrained("facebook/opt-1.3b")
+        model.eval()
+        tokenizer = GPTTokenizer.from_pretrained("facebook/opt-1.3b")
+        tokenizer.padding_side = "left"
+
+        # use different length sentences to test batching
+        sentences = [
+            "my dog is",
+            "Today, I",
+        ]
+
+        inputs = tokenizer(sentences, return_tensors="pd", padding=True)
+        input_ids = inputs["input_ids"]
+
+        outputs, _ = model.generate(
+            input_ids=input_ids,
+            decode_strategy="greedy_search",
+            use_cache=True,
+        )
+        batch_out_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+
+        inputs_non_padded = tokenizer(sentences[0], return_tensors="pd")["input_ids"]
+        output_non_padded, _ = model.generate(
+            input_ids=inputs_non_padded, use_cache=True, decode_strategy="greedy_search"
+        )
+        non_padded_sentence = tokenizer.decode(output_non_padded[0], skip_special_tokens=True)
+
+        inputs_padded = tokenizer(sentences[1], return_tensors="pd")["input_ids"]
+        output_padded, _ = model.generate(input_ids=inputs_padded, use_cache=True, decode_strategy="greedy_search")
+        padded_sentence = tokenizer.decode(output_padded[0], skip_special_tokens=True)
+
+        expected_output_sentence = [
+            " a rescue and she's the best dog ever. she's a little bitch but she's the best",
+            " am going to share with you a few of my favorite recipes.\nI have been cooking for a",
+        ]
+
+        self.assertListEqual(expected_output_sentence, batch_out_sentence)
+        self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
+
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1] // 2
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+
+        attention_mask = paddle.ones_like(input_ids, dtype=paddle.int64)
+
+        # generate max 3 tokens
+        max_length = 3
+
+        if config.eos_token_id or config.pad_token_id:
+            config["pad_token_id"] = config["eos_token_id"]
+
+        return config, input_ids, attention_mask, max_length
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in OPT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = OPTModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class OPTCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-OPTModel"
+
+    @require_package("transformers", "torch")
+    def test_model_config_mapping(self):
+        # 1. create commmon input
+        input_ids = np.random.randint(100, 200, [1, 20])
+
+        # 2. forward the torch model
+        import torch
+        import transformers
+
+        torch_model_class = getattr(transformers, "OPTModel")
+        torch_model = torch_model_class.from_pretrained(OPTCompatibilityTest.test_model_id)
+        torch_model.eval()
+
+        torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+        # 3. forward the paddle model
+        from paddlenlp import transformers
+
+        paddle_model_class = getattr(transformers, "OPTModel")
+        paddle_model = paddle_model_class.from_pretrained(OPTCompatibilityTest.test_model_id, from_hf_hub=True)
+        paddle_model.eval()
+
+        paddle_logit = paddle_model(paddle.to_tensor(input_ids), return_dict=False)[0]
+
+        self.assertTrue(
+            np.allclose(
+                paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                atol=1e-4,
+            )
+        )
+
+
+class OPTModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = OPTModel.from_pretrained("facebook/opt-1.3b")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)
+
+        expected_shape = [1, 11, 2048]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.81907797, -1.08688772, 1.26071370],
+                    [0.96454084, -0.42267877, 1.70609033],
+                    [0.78616256, -0.27438506, 0.74083930],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = OPTModel.from_pretrained("facebook/opt-1.3b")
+        model.eval()
+
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)
+        expected_shape = [1, 11, 2048]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.15988758, -0.21016182, -0.28532112],
+                    [-0.18293847, -0.35511413, 0.56858277],
+                    [0.39969346, -0.33906624, -0.43125907],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+# class OPTGenerationD2STest(GenerationD2STestMixin, unittest.TestCase):
+#    internal_testing_model = "__internal_testing__/tiny-random-opt"
+#    TokenizerClass = GPTTokenizer
diff --git a/tests/transformers/pegasus/__init__.py b/tests/transformers/pegasus/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/pegasus/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/pegasus/test_modeling.py b/tests/transformers/pegasus/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..197bc317ddf59f1e924e3d786c634b7df5cdf405
--- /dev/null
+++ b/tests/transformers/pegasus/test_modeling.py
@@ -0,0 +1,273 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    PegasusConfig,
+    PegasusDecoder,
+    PegasusEncoder,
+    PegasusForConditionalGeneration,
+    PegasusModel,
+)
+
+from ..test_configuration_common import ConfigTester
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class PegasusModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_labels=False,
+        vocab_size=99,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=4,
+        intermediate_size=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=20,
+        eos_token_id=2,
+        pad_token_id=1,
+        bos_token_id=0,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+
+        # forcing a certain token to be generated, sets all other tokens to -inf
+        # if however the token to be generated is already at -inf then it can lead token
+        # `nan` values and thus break generation
+        self.forced_bos_token_id = None
+        self.forced_eos_token_id = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_ids = paddle.clip(ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64"), 3)
+        input_ids[:, -1] = self.eos_token_id  # Eos Token
+
+        decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        config = self.get_config()
+        attention_mask = (
+            paddle.cast(input_ids == config.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+        )
+        decoder_attention_mask = (
+            paddle.cast(decoder_input_ids == config.pad_token_id, dtype=paddle.get_default_dtype()).unsqueeze([1, 2])
+            * -1e4
+        )
+        inputs_dict = {
+            "input_ids": input_ids,
+            "decoder_input_ids": decoder_input_ids,
+            "attention_mask": attention_mask,
+            "decoder_attention_mask": decoder_attention_mask,
+        }
+        return config, inputs_dict
+
+    def get_config(self):
+        return PegasusConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            dropout=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            forced_bos_token_id=self.forced_bos_token_id,
+            forced_eos_token_id=self.forced_eos_token_id,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = PegasusModel(config=config)
+        encoder = model.get_encoder()
+        decoder = model.get_decoder()
+        encoder.eval()
+        decoder.eval()
+
+        input_ids = inputs_dict["input_ids"]
+        decoder_input_ids = (
+            paddle.zeros_like(input_ids[:, :1], dtype="int64") + PegasusModel(config).decoder_start_token_id
+        )
+
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_attention_mask = paddle.zeros([input_ids.shape[0], 1, 1, 1], dtype=paddle.get_default_dtype())
+
+        encoder_output = encoder(input_ids, attention_mask)
+        origin_cache = decoder.decoder.gen_cache(encoder_output)
+        outputs = decoder(
+            decoder_input_ids,
+            decoder_attention_mask,
+            encoder_output,
+            attention_mask,
+            cache=origin_cache,
+        )
+
+        output, cache = outputs
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_attn_mask = paddle.zeros([self.batch_size, 1, 1, 3], dtype=paddle.get_default_dtype())
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([decoder_input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([decoder_attention_mask, next_attn_mask], axis=-1)
+
+        output_from_no_past, _ = decoder(next_input_ids, next_attention_mask, encoder_output, attention_mask)
+        output_from_past, _ = decoder(
+            next_tokens,
+            next_attention_mask,
+            encoder_output,
+            attention_mask,
+            cache=cache,
+        )
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
+
+    def check_encoder_decoder_model_standalone(self, config, inputs_dict):
+        model = PegasusModel(config=config)
+        model.eval()
+        outputs = model(**inputs_dict)
+
+        encoder_last_hidden_state = outputs[2]
+        last_hidden_state = outputs[0]
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            encoder = model.get_encoder()
+            encoder.save_pretrained(tmpdirname)
+            encoder = PegasusEncoder.from_pretrained(tmpdirname)
+            encoder.eval()
+        encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])
+
+        self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            decoder = model.get_decoder()
+            decoder.save_pretrained(tmpdirname)
+            decoder = PegasusDecoder.from_pretrained(tmpdirname)
+            decoder.eval()
+
+        last_hidden_state_2 = decoder(
+            decoder_input_ids=inputs_dict["decoder_input_ids"],
+            decoder_attention_mask=inputs_dict["decoder_attention_mask"],
+            encoder_output=encoder_last_hidden_state,
+            memory_mask=inputs_dict["attention_mask"],
+        )[0]
+
+        self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
+
+
+class PegasusModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = PegasusModel
+    all_model_classes = (PegasusModel, PegasusForConditionalGeneration)
+    all_generative_model_classes = {PegasusForConditionalGeneration: (PegasusModel, "pegasus")}
+
+    is_encoder_decoder = True
+    fx_compatible = True
+    test_resize_position_embeddings = False
+    test_pruning = False
+    test_missing_keys = False
+    use_labels = False
+    use_test_model_name_list = False
+    use_test_inputs_embeds = False
+    return_dict = False
+
+    def setUp(self):
+        self.model_tester = PegasusModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=PegasusConfig)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_save_load_strict(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model2 = model_class.from_pretrained(tmpdirname)
+
+            missing_keys = []
+            for k in model2.state_dict().keys():
+                if k not in model.state_dict().keys():
+                    missing_keys.append(k)
+
+            self.assertEqual(missing_keys, [])
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_encoder_decoder_model_standalone(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
+        self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
+
+    def test_generate_fp16(self):
+        config, input_dict = self.model_tester.prepare_config_and_inputs()
+        input_ids = input_dict["input_ids"]
+        attention_mask = paddle.cast(input_ids == 1, dtype=paddle.get_default_dtype()).unsqueeze([1, 2]) * -1e4
+        model = PegasusForConditionalGeneration(config=config)
+        model.eval()
+        with paddle.amp.auto_cast():
+            model.generate(input_ids, attention_mask=attention_mask)
+            model.generate(
+                decode_strategy="beam_search",
+                num_beams=4,
+                do_sample=True,
+                early_stopping=False,
+                num_return_sequences=3,
+            )
diff --git a/tests/transformers/pegasus/test_tokenizer.py b/tests/transformers/pegasus/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcfdc8e31eb90910cbcd026cbf22c54aa05088a8
--- /dev/null
+++ b/tests/transformers/pegasus/test_tokenizer.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import PegasusChineseTokenizer
+from tests.testing_utils import get_tests_dir
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/vocab.zh.pegasus.txt")
+
+
+class PegasusTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = PegasusChineseTokenizer
+    test_rust_tokenizer = False
+
+    def setUp(self):
+        super().setUp()
+        tokenizer = PegasusChineseTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizer(self, **kwargs) -> PegasusChineseTokenizer:
+        return PegasusChineseTokenizer.from_pretrained("IDEA-CCNL/Randeng-Pegasus-238M-Summary-Chinese", **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        return ("这是一个测试。", "这是一个测试。")
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "</s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[-4], "<pad>")
+        self.assertEqual(vocab_keys[-5], "</s>")
+        self.assertEqual(vocab_keys[158], "v")
+        self.assertEqual(len(vocab_keys), 50000)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 50000)
+
+    def test_mask_tokens(self):
+        tokenizer = self.get_tokenizer()
+        # <mask_1> masks whole sentence while <mask_2> masks single word
+        raw_input_str = "<mask_1> 为了确保银行决议的 <mask_2> 流动。"
+        desired_result = [2, 7569, 26503, 33094, 10328, 3399, 3, 23514, 179, 1]
+        ids = tokenizer([raw_input_str], return_tensors=None).input_ids[0]
+        self.assertListEqual(desired_result, ids)
+
+    def test_tokenizer_settings(self):
+        tokenizer = self.get_tokenizer()
+        # The tracebacks for the following asserts are **better** without messages or self.assertEqual
+        assert tokenizer.vocab_size == 50000
+        assert tokenizer.pad_token_id == 0
+        assert tokenizer.eos_token_id == 1
+        assert tokenizer.offset == 100
+        assert tokenizer.unk_token_id == tokenizer.offset == 100
+        assert tokenizer.unk_token == "<unk>"
+        assert tokenizer.model_max_length == 1024
+        raw_input_str = "确保银行决议的顺利进行。"
+        desired_result = [26503, 33094, 10328, 3399, 5396, 612, 4921, 4503, 179, 1]
+        ids = tokenizer([raw_input_str], return_tensors=None).input_ids[0]
+        self.assertListEqual(desired_result, ids)
+        assert tokenizer.convert_ids_to_tokens([0, 1, 2, 3], skip_special_tokens=False) == [
+            "<pad>",
+            "</s>",
+            "<mask_1>",
+            "<mask_2>",
+        ]
+
+    def test_seq2seq_truncation(self):
+        tokenizer = self.get_tokenizer()
+        src_texts = ["这将是一个很长很长的文本。" * 150, "short example"]
+        tgt_texts = ["这个不是很长但是超过5个字。", "tiny"]
+        batch = tokenizer(text=src_texts, padding=True, truncation=True, return_tensors="pd")
+        targets = tokenizer(text=tgt_texts, max_length=5, padding=True, truncation=True, return_tensors="pd")
+
+        assert batch.input_ids.shape == [2, 1024]
+        assert batch.attention_mask.shape == [2, 1024]
+        assert targets["input_ids"].shape == [2, 5]
+        assert len(batch) == 2  # input_ids, attention_mask.
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            # BOS is never used.
+            self.assertEqual(len(encoding["input_ids"]), 3)
+            self.assertEqual(len(encoding["offset_mapping"]), 3)
diff --git a/tests/transformers/ppminilm/__init__.py b/tests/transformers/ppminilm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/ppminilm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/ppminilm/test_modeling.py b/tests/transformers/ppminilm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..8413627f7597ae812ddeb5067fdf00a8dccbc283
--- /dev/null
+++ b/tests/transformers/ppminilm/test_modeling.py
@@ -0,0 +1,308 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    PPMiniLMForMultipleChoice,
+    PPMiniLMForQuestionAnswering,
+    PPMiniLMForSequenceClassification,
+    PPMiniLMModel,
+)
+from paddlenlp.transformers.ppminilm.configuration import PPMiniLMConfig
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class PPMiniLMModelTester:
+    def __init__(
+        self,
+        parent: PPMiniLMModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        pool_act="tanh",
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+    ):
+        self.parent: PPMiniLMModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask
+
+    def get_config(self) -> PPMiniLMConfig:
+        return PPMiniLMConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            pool_act=self.pool_act,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config: PPMiniLMConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = PPMiniLMModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config: PPMiniLMConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = PPMiniLMForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = PPMiniLMForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=input_mask,
+        )
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config: PPMiniLMConfig,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = PPMiniLMForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+    def test_addition_params(self, config: PPMiniLMConfig, *args, **kwargs):
+        config.num_labels = 7
+        config.classifier_dropout = 0.98
+
+        model = PPMiniLMForSequenceClassification(config)
+        model.eval()
+
+        self.parent.assertEqual(model.classifier.weight.shape, [config.hidden_size, 7])
+        self.parent.assertEqual(model.dropout.p, 0.98)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class PPMiniLMModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = PPMiniLMModel
+
+    all_model_classes = (
+        PPMiniLMModel,
+        PPMiniLMForMultipleChoice,
+        PPMiniLMForQuestionAnswering,
+        PPMiniLMForSequenceClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = PPMiniLMModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=PPMiniLMConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_custom_params(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.test_addition_params(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    @slow
+    def test_params_compatibility_of_init_method(self):
+        """test initing model with different params"""
+        model: PPMiniLMForSequenceClassification = PPMiniLMForSequenceClassification.from_pretrained(
+            "ppminilm-6l-768h", num_labels=4, dropout=0.3
+        )
+        assert model.num_labels == 4
+        assert model.dropout.p == 0.3
+
+
+class PPMiniLMModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = PPMiniLMModel
+
+    @slow
+    def test_inference_no_attention(self):
+        model = PPMiniLMModel.from_pretrained("ppminilm-6l-768h")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.79207015, 0.40036711, 1.18436682],
+                    [-0.85833853, 0.34584877, 0.93867993],
+                    [-0.97080499, 0.33460250, 0.69212830],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = PPMiniLMModel.from_pretrained("ppminilm-6l-768h")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 768]
+        self.assertEqual(output.shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.79207015, 0.40036711, 1.18436682],
+                    [-0.85833853, 0.34584877, 0.93867993],
+                    [-0.97080499, 0.33460250, 0.69212830],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/ppminilm/test_tokenizer.py b/tests/transformers/ppminilm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..70ba9400bcabdc20072d39e8715d7a82cfd5ff08
--- /dev/null
+++ b/tests/transformers/ppminilm/test_tokenizer.py
@@ -0,0 +1,137 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.ppminilm.tokenizer import PPMiniLMTokenizer
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class PPMiniLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = PPMiniLMTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = False
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, PPMiniLMTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("ppminilm-6l-768h")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                expected_results = [
+                    ((0, 0), tokenizer.cls_token),
+                    ((0, 1), "a"),
+                    ((1, 2), ","),
+                    ((3, 5), "na"),
+                    ((5, 8), "##ive"),
+                    ((9, 15), tokenizer.mask_token),
+                    ((16, 21), "allen"),
+                    ((21, 22), "##n"),
+                    ((22, 24), "##lp"),
+                    ((25, 27), "se"),
+                    ((27, 29), "##nt"),
+                    ((29, 33), "##ence"),
+                    ((33, 34), "."),
+                    ((0, 0), tokenizer.sep_token),
+                ]
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/prophetnet/__init__.py b/tests/transformers/prophetnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/prophetnet/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/prophetnet/test_modeling.py b/tests/transformers/prophetnet/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9735a51b014d3987d31c3bfd371daa4d5aad7a8
--- /dev/null
+++ b/tests/transformers/prophetnet/test_modeling.py
@@ -0,0 +1,203 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    ProphetNetConfig,
+    ProphetNetForConditionalGeneration,
+    ProphetNetModel,
+    ProphetNetPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class ProphetNetModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        seq_length=7,
+        tgt_seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        bos_token_id=1,
+        pad_token_id=0,
+        eos_token_id=2,
+        decoder_start_token_id=1,
+        hidden_size=32,
+        relative_max_distance=32,
+        ngram=2,
+        num_buckets=8,
+        num_encoder_layers=2,
+        num_decoder_layers=4,
+        num_encoder_attention_heads=4,
+        num_decoder_attention_heads=4,
+        encoder_ffn_dim=64,
+        decoder_ffn_dim=64,
+        dropout=0.1,
+        activation_function="gelu",
+        attention_dropout=0.1,
+        activation_dropout=0.1,
+        max_position_embeddings=128,
+        init_std=0.02,
+        eps=0.1,
+        add_cross_attention=True,
+        disable_ngram_loss=False,
+    ):
+        self.parent = parent
+        self.vocab_size = vocab_size
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.tgt_seq_length = tgt_seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.decoder_start_token_id = decoder_start_token_id
+        self.hidden_size = hidden_size
+        self.num_encoder_layers = num_encoder_layers
+        self.num_encoder_attention_heads = num_encoder_attention_heads
+        self.num_decoder_layers = num_decoder_layers
+        self.num_decoder_attention_heads = num_decoder_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.add_cross_attention = add_cross_attention
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.dropout = dropout
+        self.activation_function = activation_function
+        self.activation_dropout = activation_dropout
+        self.init_std = init_std
+        self.disable_ngram_loss = disable_ngram_loss
+        self.pad_token_id = pad_token_id
+        self.attention_dropout = attention_dropout
+        self.eps = eps
+        self.ngram = ngram
+        self.relative_max_distance = relative_max_distance
+        self.num_buckets = num_buckets
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        decoder_input_ids = ids_tensor([self.batch_size, self.tgt_seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="float32")
+        decoder_input_mask = None
+        if self.use_input_mask:
+            decoder_input_mask = paddle.ones([self.batch_size, self.tgt_seq_length], dtype="float32")
+
+        config = self.get_config()
+        return config, input_ids, input_mask, decoder_input_ids, decoder_input_mask
+
+    def get_config(self):
+        return ProphetNetConfig(
+            vocab_size=self.vocab_size,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            eos_token_id=self.eos_token_id,
+            hidden_size=self.hidden_size,
+            decoder_start_token_id=self.decoder_start_token_id,
+            max_position_embeddings=self.max_position_embeddings,
+            activation_function=self.activation_function,
+            activation_dropout=self.activation_dropout,
+            dropout=self.dropout,
+            relative_max_distance=self.relative_max_distance,
+            ngram=self.ngram,
+            num_buckets=self.num_buckets,
+            encoder_ffn_dim=self.encoder_ffn_dim,
+            num_encoder_attention_heads=self.num_encoder_attention_heads,
+            num_encoder_layers=self.num_encoder_layers,
+            decoder_ffn_dim=self.decoder_ffn_dim,
+            num_decoder_attention_heads=self.num_decoder_attention_heads,
+            num_decoder_layers=self.num_decoder_layers,
+            attention_dropout=self.attention_dropout,
+            init_std=self.init_std,
+            eps=self.eps,
+            add_cross_attention=self.add_cross_attention,
+            disable_ngram_loss=self.disable_ngram_loss,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask, decoder_input_ids, decoder_input_mask) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_input_mask,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(self, config, input_ids, input_mask, decoder_input_ids, decoder_attention_mask):
+        model = ProphetNetModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_conditional_generation_model(
+        self, config, input_ids, input_mask, decoder_input_ids, decoder_attention_mask
+    ):
+        model = ProphetNetForConditionalGeneration(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+class ProphetNetModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = ProphetNetModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (
+        ProphetNetModel,
+        ProphetNetForConditionalGeneration,
+    )
+
+    def setUp(self):
+        self.model_tester = ProphetNetModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_conditional_generation_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_conditional_generation_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(ProphetNetPretrainedModel.pretrained_init_configuration)[:1]:
+            model = ProphetNetModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/prophetnet/test_tokenizer.py b/tests/transformers/prophetnet/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad8182edf0bbfd882056007e45662e6326d6ea8e
--- /dev/null
+++ b/tests/transformers/prophetnet/test_tokenizer.py
@@ -0,0 +1,169 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import shutil
+import tempfile
+import unittest
+import warnings
+
+from paddlenlp.transformers import ProphetNetTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = {
+    "vocab_file": "vocab.txt",
+}
+
+
+class TestTokenizationProphetNet(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = ProphetNetTokenizer
+    test_rust_tokenizer = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+        vocab = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.special_tokens_map = {
+            "unk_token": "[UNK]",
+            "sep_token": "[SEP]",
+            "bos_token": "[SEP]",
+            "eos_token": "[SEP]",
+            "cls_token": "[CLS]",
+            "x_sep_token": "[X_SEP]",
+            "pad_token": "[PAD]",
+            "mask_token": "[MASK]",
+        }
+        self.vocab_file = os.path.join(self.tmpdirname, ProphetNetTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_save_and_load_tokenizer(self):
+        warnings.warn("Every addtoken not in vocab is unk_token")
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                self.assertNotEqual(tokenizer.model_max_length, 42)
+
+        # Now let's start the test
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens["input_ids"], after_tokens["input_ids"])
+                self.assertDictEqual(before_vocab, after_vocab)
+
+                shutil.rmtree(tmpdirname)
+
+    def test_add_tokens_tokenizer(self):
+        warnings.warn("Every token not in vocab is unk_token")
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertEqual(tokens[0], tokenizer.unk_token_id)
+                self.assertEqual(tokens[0], tokenizer.unk_token_id)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertEqual(tokens[0], tokenizer.unk_token_id)
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_encode_decode_with_spaces(self):
+        self.skipTest("Every token not in vocab is unk_token")
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        self.skipTest("Every token not in vocab is unk_token")
+
+    def test_consecutive_unk_string(self):
+        self.skipTest("Every token not in vocab is unk_token")
+
+    def test_pretokenized_inputs(self):
+        self.skipTest("tokenizer is_split_into_words not implement yet")
diff --git a/tests/transformers/qwen/__init__.py b/tests/transformers/qwen/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/qwen/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/qwen/test_modeling.py b/tests/transformers/qwen/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..73c2ceb2e6ff20824ae965420990d0f89a6681bd
--- /dev/null
+++ b/tests/transformers/qwen/test_modeling.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from parameterized import parameterized_class
+
+from paddlenlp.transformers.qwen.configuration import QWenConfig
+from paddlenlp.transformers.qwen.modeling import QWenForCausalLM, QWenModel
+from tests.transformers.test_generation_utils import GenerationTesterMixin
+
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class QWenModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+        multi_query=True,
+        bias=False,
+        parallel_attn=True,
+        output_attentions=False,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.multi_query = multi_query
+        self.bias = bias
+        self.parallel_attn = parallel_attn
+        self.output_attentions = output_attentions
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent and self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+    def get_config(self):
+        return QWenConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+            multi_query=self.multi_query,
+            bias=self.bias,
+            parallel_attn=self.parallel_attn,
+            output_attentions=self.output_attentions,
+            seq_length=self.seq_length,
+            kv_channels=self.hidden_size // self.num_attention_heads,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = QWenModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        result = model(input_ids, use_cache=True, output_attentions=True, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        result = model(
+            input_ids,
+            use_cache=True,
+            output_attentions=True,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_causal_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = QWenForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=input_ids,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[0].shape, [])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class QWenModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = QWenModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+    use_test_model_name_list = False
+
+    all_model_classes = (
+        QWenModel,
+        QWenForCausalLM,
+    )
+    all_generative_model_classes = {QWenForCausalLM: (QWenModel, "qwen")}
+
+    def setUp(self):
+        self.model_tester = QWenModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_causal_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_causal_model(*config_and_inputs)
diff --git a/tests/transformers/reformer/__init__.py b/tests/transformers/reformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b5b28f3150172667e33941d8b0ec05fd7c969d9
--- /dev/null
+++ b/tests/transformers/reformer/__init__.py
@@ -0,0 +1,12 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/reformer/test_modeling.py b/tests/transformers/reformer/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..b546d41d003df53c11d8dc9b5aed38d9635d1579
--- /dev/null
+++ b/tests/transformers/reformer/test_modeling.py
@@ -0,0 +1,809 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+import paddle.nn as nn
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    ReformerForMaskedLM,
+    ReformerForQuestionAnswering,
+    ReformerForSequenceClassification,
+    ReformerModel,
+    ReformerModelWithLMHead,
+)
+from paddlenlp.transformers.reformer.configuration import ReformerConfig
+from paddlenlp.transformers.reformer.modeling import (
+    REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+    ReformerLayer,
+)
+from tests.testing_utils import slow
+
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    ModelTesterPretrainedMixin,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class ReformerModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=32,
+        is_training=True,
+        is_decoder=True,
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=32,
+        attention_head_size=16,
+        hidden_size=32,
+        num_attention_heads=2,
+        local_attn_chunk_length=4,
+        local_num_chunks_before=1,
+        local_num_chunks_after=0,
+        num_buckets=None,
+        num_hashes=1,
+        lsh_attn_chunk_length=None,
+        lsh_num_chunks_before=None,
+        lsh_num_chunks_after=None,
+        chunk_size_lm_head=0,
+        chunk_size_feed_forward=0,
+        feed_forward_size=32,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        local_attention_probs_dropout_prob=0.1,
+        lsh_attention_probs_dropout_prob=None,
+        max_position_embeddings=512,
+        initializer_range=0.02,
+        axial_norm_std=1.0,
+        layer_norm_eps=1e-12,
+        axial_pos_embds=True,
+        axial_pos_shape=[4, 8],
+        axial_pos_embds_dim=[16, 16],
+        attn_layers=["local", "local", "local", "local"],
+        pad_token_id=0,
+        eos_token_id=2,
+        scope=None,
+        hash_seed=0,
+        num_labels=2,
+        num_hidden_layers=4,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.is_decoder = is_decoder
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.attention_head_size = attention_head_size
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.local_attn_chunk_length = local_attn_chunk_length
+        self.local_num_chunks_after = local_num_chunks_after
+        self.local_num_chunks_before = local_num_chunks_before
+        self.num_hashes = num_hashes
+        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets
+        self.lsh_attn_chunk_length = lsh_attn_chunk_length
+        self.lsh_num_chunks_after = lsh_num_chunks_after
+        self.lsh_num_chunks_before = lsh_num_chunks_before
+        self.hidden_act = hidden_act
+        self.feed_forward_size = feed_forward_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob
+        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.axial_pos_embds = axial_pos_embds
+        self.axial_pos_shape = tuple(axial_pos_shape)
+        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)
+        self.axial_norm_std = axial_norm_std
+        self.chunk_size_lm_head = chunk_size_lm_head
+        self.chunk_size_feed_forward = chunk_size_feed_forward
+        self.scope = scope
+        self.attn_layers = attn_layers
+        self.pad_token_id = pad_token_id
+        self.hash_seed = hash_seed
+        self.num_hidden_layers = num_hidden_layers
+
+        attn_chunk_length = local_attn_chunk_length if local_attn_chunk_length is not None else lsh_attn_chunk_length
+        num_chunks_after = local_num_chunks_after if local_num_chunks_after is not None else lsh_num_chunks_after
+        num_chunks_before = local_num_chunks_before if local_num_chunks_before is not None else lsh_num_chunks_before
+
+        self.encoder_seq_length = seq_length // attn_chunk_length + (self.seq_length % attn_chunk_length != 0)
+        self.key_length = (num_chunks_before + num_chunks_after + 1) * attn_chunk_length
+        self.chunk_length = attn_chunk_length
+        self.num_labels = num_labels
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        choice_labels = None
+        if self.use_labels:
+            choice_labels = ids_tensor([self.batch_size], 2)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            choice_labels,
+        )
+
+    def get_pipeline_config(self) -> ReformerConfig:
+        config = self.get_config()
+        config.vocab_size = 100
+        config.max_position_embeddings = 100
+        config.axial_pos_shape = (4, 25)
+        config.is_decoder = False
+        return config
+
+    def get_config(self) -> ReformerConfig:
+        return ReformerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            feed_forward_size=self.feed_forward_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            local_attention_probs_dropout_prob=self.local_attention_probs_dropout_prob,
+            lsh_attention_probs_dropout_prob=self.lsh_attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            is_decoder=self.is_decoder,
+            axial_pos_embds=self.axial_pos_embds,
+            axial_pos_shape=self.axial_pos_shape,
+            axial_pos_embds_dim=self.axial_pos_embds_dim,
+            local_attn_chunk_length=self.local_attn_chunk_length,
+            local_num_chunks_after=self.local_num_chunks_after,
+            local_num_chunks_before=self.local_num_chunks_before,
+            num_hashes=self.num_hashes,
+            num_buckets=self.num_buckets,
+            lsh_attn_chunk_length=self.lsh_attn_chunk_length,
+            lsh_num_chunks_after=self.lsh_num_chunks_after,
+            lsh_num_chunks_before=self.lsh_num_chunks_before,
+            attn_layers=self.attn_layers,
+            pad_token_id=self.pad_token_id,
+            hash_seed=self.hash_seed,
+            attention_head_size=self.attention_head_size,
+            layer_norm_eps=self.layer_norm_eps,
+            initializer_range=self.initializer_range,
+            axial_norm_std=self.axial_norm_std,
+        )
+
+    def create_and_check_reformer_model(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        model = ReformerModel(config=config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        result = model(input_ids)
+        result = model(input_ids, attention_mask=input_mask, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        # 2 * hidden_size because we use reversible resnet layers
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, 2 * self.hidden_size])
+
+    def create_and_check_reformer_model_with_lm_backward(
+        self, config: ReformerConfig, input_ids, input_mask, choice_labels
+    ):
+        if not self.is_training:
+            return
+
+        config.is_decoder = False
+        config.lsh_num_chunks_after = 1
+        model = ReformerForMaskedLM(config=config)
+        model.train()
+        loss = model(input_ids, attention_mask=input_mask, labels=input_ids, return_dict=self.parent.return_dict)[0]
+        loss.backward()
+
+    def create_and_check_reformer_with_lm(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        config.lsh_num_chunks_after = 0
+        config.is_decoder = True
+        model = ReformerModelWithLMHead(config=config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, labels=input_ids)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_reformer_with_mlm(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        config.is_decoder = False
+        model = ReformerForMaskedLM(config=config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, labels=input_ids, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_reformer_model_with_attn_mask(
+        self, config: ReformerConfig, input_ids, input_mask, choice_labels, is_decoder=False
+    ):
+        # no special position embeddings
+        config.axial_pos_embds = False
+        config.is_decoder = is_decoder
+
+        if self.lsh_attn_chunk_length is not None:
+            # need to set chunk length equal sequence length to be certain that chunking works
+            config.lsh_attn_chunk_length = self.seq_length
+
+        model = ReformerModel(config=config)
+        model.eval()
+        # set all position encodings to zero so that postions don't matter
+        with paddle.no_grad():
+            embedding = model.embeddings.position_embeddings.embedding
+            embedding.weight = paddle.create_parameter(
+                embedding.weight.shape, dtype="float32", default_initializer=nn.initializer.Constant(value=0)
+            )
+            embedding.weight.requires_grad = False
+
+        half_seq_len = self.seq_length // 2
+        roll = self.chunk_length
+
+        half_input_ids = input_ids[:, :half_seq_len]
+
+        # normal padded
+        attn_mask = paddle.concat(
+            [paddle.ones_like(half_input_ids), paddle.zeros_like(half_input_ids)],
+            axis=-1,
+        )
+        input_ids_padded = paddle.concat(
+            [half_input_ids, ids_tensor((self.batch_size, half_seq_len), self.vocab_size)],
+            axis=-1,
+        )
+
+        # shifted padded
+        input_ids_roll = paddle.concat(
+            [half_input_ids, ids_tensor((self.batch_size, half_seq_len), self.vocab_size)],
+            axis=-1,
+        )
+        input_ids_roll = paddle.roll(input_ids_roll, roll, axis=-1)
+        attn_mask_roll = paddle.roll(attn_mask, roll, axis=-1)
+
+        output_padded = model(input_ids_padded, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0][
+            :, :half_seq_len
+        ]
+        output_padded_rolled = model(
+            input_ids_roll, attention_mask=attn_mask_roll, return_dict=self.parent.return_dict
+        )[0][:, roll : half_seq_len + roll]
+
+        self.parent.assertTrue(paddle.allclose(output_padded, output_padded_rolled, atol=1e-4))
+
+    def create_and_check_reformer_layer_dropout_seed(
+        self, config: ReformerConfig, input_ids, input_mask, choice_labels, is_decoder=False
+    ):
+        config.is_decoder = is_decoder
+        layer = ReformerLayer(config)
+        layer.train()
+        shape = (
+            self.batch_size,
+            self.seq_length,
+            config.hidden_size,
+        )  # Batch x SeqLen x hiddenSize
+
+        # get random tensors
+        hidden_states = floats_tensor(shape)
+        prev_attn_output = floats_tensor(shape)
+
+        # now the random seeds for attention and feed forward is initialized
+        # forward tensors with dropout
+        layer_outputs = layer(prev_attn_output, hidden_states, attention_mask=input_mask)
+
+        next_attn_output = layer_outputs.attn_output
+        next_hidden_states = layer_outputs.hidden_states
+
+        paddle.seed(layer.attention_seed)
+        attn_outputs = layer.attention(hidden_states, attention_mask=input_mask)
+        self.parent.assertTrue(
+            paddle.allclose(
+                prev_attn_output + attn_outputs.hidden_states,
+                next_attn_output,
+                atol=1e-4,
+            )
+        )
+
+        paddle.seed(layer.feed_forward_seed)
+        feed_forward_hidden_states = layer.feed_forward(next_attn_output)
+        self.parent.assertTrue(
+            paddle.allclose(
+                next_hidden_states,
+                hidden_states + feed_forward_hidden_states,
+                atol=1e-4,
+            )
+        )
+
+    def create_and_check_reformer_feed_backward_chunking(
+        self, config: ReformerConfig, input_ids, input_mask, choice_labels
+    ):
+        if not self.is_training:
+            return
+
+        # disable dropout
+        config.hidden_dropout_prob = 0
+        config.local_attention_probs_dropout_prob = 0
+        config.lsh_attention_probs_dropout_prob = 0
+        config.lsh_num_chunks_after = 1
+        config.is_decoder = False
+
+        paddle.seed(0)
+        model = ReformerForMaskedLM(config=config)
+        model.train()
+        loss_no_chunk, output_no_chunk = model(
+            input_ids, labels=input_ids, attention_mask=input_mask, return_dict=self.parent.return_dict
+        )[:2]
+        loss_no_chunk.backward()
+        grad_slice_word_no_chunk = model.reformer.embeddings.word_embeddings.weight.grad[0, :5]
+        grad_slice_position_factor_1_no_chunk = model.reformer.embeddings.position_embeddings.weights[0][1, 0, -5:]
+        grad_slice_position_factor_2_no_chunk = model.reformer.embeddings.position_embeddings.weights[1][0, 1, :5]
+
+        config.chunk_size_lm_head = 1
+        config.chunk_size_feed_forward = 1
+
+        paddle.seed(0)
+        model = ReformerForMaskedLM(config=config)
+        model.train()
+        loss_chunk, output_chunk = model(
+            input_ids, labels=input_ids, attention_mask=input_mask, return_dict=self.parent.return_dict
+        )[:2]
+        loss_chunk.backward()
+        grad_slice_word_chunk = model.reformer.embeddings.word_embeddings.weight.grad[0, :5]
+        grad_slice_position_factor_1_chunk = model.reformer.embeddings.position_embeddings.weights[0][1, 0, -5:]
+        grad_slice_position_factor_2_chunk = model.reformer.embeddings.position_embeddings.weights[1][0, 1, :5]
+        self.parent.assertTrue(paddle.allclose(loss_chunk, loss_no_chunk, atol=1e-4))
+        self.parent.assertTrue(paddle.allclose(grad_slice_word_no_chunk, grad_slice_word_chunk, atol=1e-4))
+        self.parent.assertTrue(
+            paddle.allclose(grad_slice_position_factor_1_chunk, grad_slice_position_factor_1_no_chunk, atol=1e-4)
+        )
+        self.parent.assertTrue(
+            paddle.allclose(grad_slice_position_factor_2_chunk, grad_slice_position_factor_2_no_chunk, atol=1e-4)
+        )
+
+    def create_and_check_reformer_model_generate(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        config.is_decoder = True
+        config.lsh_num_chunks_after = 0
+        config.bos_token_id = 0
+        config.eos_token_id = None
+        config.max_length = 20
+
+        model = ReformerModelWithLMHead(config=config)
+        model.eval()
+        output = model.generate()
+        self.parent.assertIsNotNone(output)
+
+    def create_and_check_reformer_no_chunking(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        # force chunk length to be bigger than input_ids
+        config.lsh_attn_chunk_length = 2 * input_ids.shape[-1]
+        config.local_attn_chunk_length = 2 * input_ids.shape[-1]
+        config.lsh_num_chunks_after = 1
+        config.is_decoder = False
+        model = ReformerForMaskedLM(config=config)
+        model.eval()
+        output_logits = model(
+            input_ids, attention_mask=input_mask, return_dict=self.parent.return_dict
+        )  # (loss, logits, hidden_states, attentions)
+        self.parent.assertTrue(output_logits[0].shape[1] == input_ids.shape[-1])
+
+    def create_and_check_reformer_for_question_answering(
+        self, config: ReformerConfig, input_ids, input_mask, choice_labels, sequence_labels
+    ):
+        model = ReformerForQuestionAnswering(config=config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            start_positions=choice_labels,
+            end_positions=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_cache(self, config: ReformerConfig, input_ids, input_mask, choice_labels):
+        config.is_decoder = True
+        config.lsh_num_chunks_before = 1
+        config.lsh_num_chunks_after = 0
+        model = ReformerModelWithLMHead(config=config)
+        model.eval()
+        input_ids_first = input_ids[:, :-1]
+        input_ids_second = input_ids[:, -1:]
+
+        # return saved cache
+        cache = model(input_ids_first, use_cache=True)[1]
+
+        # calculate last output with and without cache
+        outputs_with_cache = model(input_ids_second, cache=cache, use_cache=True)[0]
+        outputs_without_cache = model(input_ids)[0][:, -1]
+
+        # select random slice idx
+        random_slice_idx = paddle.randint(outputs_without_cache.shape[-1], shape=(1, 1)).item()
+
+        # outputs should be similar within range
+        self.parent.assertTrue(
+            paddle.allclose(
+                outputs_with_cache[:, 0, random_slice_idx], outputs_without_cache[:, random_slice_idx], atol=1e-2
+            )
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask, choice_labels) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+    def create_and_check_reformer_for_sequence_classification(
+        self, config, input_ids, input_mask, choice_labels, is_decoder
+    ):
+        config.is_decoder = is_decoder
+        sequence_labels = ids_tensor([self.batch_size], config.num_labels)
+        model = ReformerForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, labels=sequence_labels, return_dict=self.parent.return_dict
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class ReformerTesterMixin:
+    """
+    Reformer Local and Reformer LSH run essentially the same tests
+    """
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_reformer_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_model(*config_and_inputs)
+
+    def test_reformer_lm_model_backward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_model_with_lm_backward(*config_and_inputs)
+
+    def test_reformer_model_attn_masking(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=True)
+        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=False)
+
+    def test_reformer_with_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_with_lm(*config_and_inputs)
+
+    def test_reformer_with_mlm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_with_mlm(*config_and_inputs)
+
+    def test_reformer_layer_training_dropout(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=True)
+        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=False)
+
+    def test_reformer_chunking_backward_equality(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_feed_backward_chunking(*config_and_inputs)
+
+    def test_reformer_no_chunking(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_no_chunking(*config_and_inputs)
+
+    def test_reformer_qa_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_for_question_answering(*config_and_inputs)
+
+    def test_reformer_cached_inference(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_cache(*config_and_inputs)
+
+    def test_reformer_cached_generate(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_model_generate(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_for_sequence_classification(*config_and_inputs, is_decoder=False)
+
+
+class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (
+        ReformerModel,
+        ReformerModelWithLMHead,
+        ReformerForSequenceClassification,
+        ReformerForQuestionAnswering,
+    )
+    all_generative_model_classes = {ReformerModelWithLMHead: (ReformerModel, "Reformer")}
+    test_pruning = False
+    test_headmasking = False
+    test_torchscript = False
+    test_sequence_classification_problem_types = True
+    test_tie_weights = True
+    base_model_class = ReformerModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    def setUp(self):
+        self.model_tester = ReformerModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=ReformerConfig, hidden_size=37)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = ReformerModelWithLMHead.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    def _check_attentions_for_generate(
+        self, batch_size, attentions, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(attentions, tuple)
+        self.assertListEqual(
+            [isinstance(iter_attentions, list) for iter_attentions in attentions], [True] * len(attentions)
+        )
+        self.assertEqual(len(attentions), (max_length - min_length) * num_beam_groups)
+
+        for idx, iter_attentions in enumerate(attentions):
+            tgt_len = min_length + idx if not use_cache else 1
+            num_chunks = tgt_len // config.local_attn_chunk_length + (tgt_len % config.local_attn_chunk_length != 0)
+            tgt_chunk_len = config.local_attn_chunk_length
+            src_chunk_len = config.local_attn_chunk_length * (
+                1 + config.local_num_chunks_after + config.local_num_chunks_before
+            )
+
+            if use_cache:
+                expected_shape = (
+                    batch_size * num_beam_groups,
+                    config.num_attention_heads,
+                    tgt_len,
+                    min_length // config.local_attn_chunk_length + 1 + idx,
+                )
+            else:
+                expected_shape = (
+                    batch_size * num_beam_groups,
+                    config.num_attention_heads,
+                    num_chunks,
+                    tgt_chunk_len,
+                    src_chunk_len,
+                )
+            # check attn size
+            self.assertListEqual(
+                [layer_attention.shape for layer_attention in iter_attentions], [expected_shape] * len(iter_attentions)
+            )
+
+    def _check_hidden_states_for_generate(
+        self, batch_size, hidden_states, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(hidden_states, tuple)
+        self.assertListEqual(
+            [isinstance(iter_hidden_states, list) for iter_hidden_states in hidden_states],
+            [True] * len(hidden_states),
+        )
+        self.assertEqual(len(hidden_states), (max_length - min_length) * num_beam_groups)
+
+        for idx, iter_hidden_states in enumerate(hidden_states):
+            seq_len = min_length + idx
+            seq_len = config.local_attn_chunk_length * (
+                seq_len // config.local_attn_chunk_length + (seq_len % config.local_attn_chunk_length != 0)
+            )
+
+            if use_cache:
+                seq_len = 1
+
+            expected_shape = (batch_size * num_beam_groups, seq_len, config.hidden_size)
+            # check hidden size
+            self.assertListEqual(
+                [layer_hidden_states.shape for layer_hidden_states in iter_hidden_states],
+                [expected_shape] * len(iter_hidden_states),
+            )
+
+
+class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (
+        ReformerModel,
+        ReformerModelWithLMHead,
+        ReformerForSequenceClassification,
+        ReformerForQuestionAnswering,
+    )
+    all_generative_model_classes = {ReformerModelWithLMHead: (ReformerModel, "Reformer")}
+    test_pruning = False
+    test_headmasking = False
+    test_torchscript = False
+    test_tie_weights = False  # reformer tie_weights implementation is problematic for now
+    base_model_class = ReformerModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    def setUp(self):
+        self.model_tester = ReformerModelTester(
+            self,
+            batch_size=13,
+            seq_length=13,
+            use_input_mask=True,
+            use_labels=True,
+            is_training=False,
+            is_decoder=True,
+            vocab_size=32,
+            attention_head_size=16,
+            hidden_size=64,
+            num_attention_heads=2,
+            num_buckets=2,
+            num_hashes=4,
+            lsh_attn_chunk_length=4,
+            lsh_num_chunks_before=1,
+            lsh_num_chunks_after=0,
+            chunk_size_lm_head=5,
+            chunk_size_feed_forward=6,
+            feed_forward_size=32,
+            hidden_act="relu",
+            hidden_dropout_prob=0.1,
+            lsh_attention_probs_dropout_prob=0.1,
+            max_position_embeddings=512,
+            initializer_range=0.02,
+            axial_norm_std=1.0,
+            layer_norm_eps=1e-12,
+            axial_pos_embds=True,
+            axial_pos_shape=[4, 8],
+            axial_pos_embds_dim=[16, 48],
+            # sanotheu
+            # attn_layers=[lsh,lsh,lsh,lsh],
+            attn_layers=["lsh"],
+            pad_token_id=0,
+            eos_token_id=2,
+            scope=None,
+            hash_seed=0,
+            num_labels=2,
+            num_hidden_layers=1,
+        )
+        self.config_tester = ConfigTester(self, config_class=ReformerConfig, hidden_size=37)
+
+    def _check_attentions_for_generate(
+        self, batch_size, attentions, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(attentions, tuple)
+        self.assertListEqual(
+            [isinstance(iter_attentions, list) for iter_attentions in attentions], [True] * len(attentions)
+        )
+        self.assertEqual(len(attentions), (max_length - min_length) * num_beam_groups)
+
+        for idx, iter_attentions in enumerate(attentions):
+            tgt_len = min_length + idx if not use_cache else 1
+            num_chunks = tgt_len // config.lsh_attn_chunk_length + (tgt_len % config.lsh_attn_chunk_length != 0)
+            tgt_chunk_len = config.lsh_attn_chunk_length
+            src_chunk_len = config.lsh_attn_chunk_length * (
+                1 + config.lsh_num_chunks_after + config.lsh_num_chunks_before
+            )
+
+            if use_cache:
+                expected_shape = (
+                    batch_size * num_beam_groups,
+                    config.num_attention_heads,
+                    config.num_hashes,
+                    tgt_len,
+                    config.num_hashes * (1 + config.lsh_num_chunks_after + config.lsh_num_chunks_before),
+                )
+            else:
+                expected_shape = (
+                    batch_size * num_beam_groups,
+                    config.num_attention_heads,
+                    num_chunks * config.num_hashes,
+                    tgt_chunk_len,
+                    src_chunk_len,
+                )
+            # check attn size
+            self.assertListEqual(
+                [layer_attention.shape for layer_attention in iter_attentions], [expected_shape] * len(iter_attentions)
+            )
+
+    def _check_hidden_states_for_generate(
+        self, batch_size, hidden_states, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(hidden_states, tuple)
+        self.assertListEqual(
+            [isinstance(iter_hidden_states, list) for iter_hidden_states in hidden_states],
+            [True] * len(hidden_states),
+        )
+        self.assertEqual(len(hidden_states), (max_length - min_length) * num_beam_groups)
+
+        for idx, iter_hidden_states in enumerate(hidden_states):
+            seq_len = min_length + idx if not use_cache else 1
+            seq_len = config.lsh_attn_chunk_length * (
+                seq_len // config.lsh_attn_chunk_length + (seq_len % config.lsh_attn_chunk_length != 0)
+            )
+
+            if use_cache:
+                seq_len = 1
+
+            expected_shape = (batch_size * num_beam_groups, seq_len, config.hidden_size)
+            # check hidden size
+            self.assertListEqual(
+                [layer_hidden_states.shape for layer_hidden_states in iter_hidden_states],
+                [expected_shape] * len(iter_hidden_states),
+            )
+
+    def test_problem_types(self):
+        # Fails because the sequence length is not a multiple of 4
+        pass
+
+
+class ReformerModelIntegrationTest(ModelTesterPretrainedMixin, unittest.TestCase):
+    base_model_class = ReformerModel
+    hf_remote_test_model_path = "PaddleCI/tiny-random-reformer"
+    paddlehub_remote_test_model_path = "__internal_testing__/tiny-random-reformer"
+
+    @slow
+    def test_inference_no_attention(self):
+        model = ReformerModel.from_pretrained("reformer-enwik8")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 2048]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.08537189, -0.01475962, 0.28183940],
+                    [0.11155435, 0.03538624, 0.37847346],
+                    [0.12673721, 0.07730877, 0.48841247],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = ReformerModel.from_pretrained("reformer-enwik8")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 2048]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.08537189, -0.01475962, 0.28183940],
+                    [0.11155435, 0.03538624, 0.37847346],
+                    [0.12673721, 0.07730877, 0.48841247],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
diff --git a/tests/transformers/reformer/test_tokenizer.py b/tests/transformers/reformer/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..81b4a08d09b98dbc088632ab7d067876f606bc53
--- /dev/null
+++ b/tests/transformers/reformer/test_tokenizer.py
@@ -0,0 +1,344 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE, AddedToken, ReformerTokenizer
+from paddlenlp.transformers.tokenizer_utils_base import BatchEncoding
+from tests.testing_utils import get_tests_dir
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+FRAMEWORK = "pd"
+
+
+class ReformerTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = ReformerTokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = ReformerTokenizer(SAMPLE_VOCAB, keep_accents=True)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "<unk>")
+        self.assertEqual(vocab_keys[1], "<s>")
+        self.assertEqual(vocab_keys[-1], "<pad>")
+        self.assertEqual(len(vocab_keys), 1_101)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 1_100)
+
+    def test_full_tokenizer(self):
+        tokenizer = ReformerTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens),
+            [285, 46, 10, 170, 382],
+        )
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(
+            ids,
+            [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4],
+        )
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    def reformer_base_tokenizer(self):
+        return ReformerTokenizer.from_pretrained("reformer-crime-and-punishment")
+
+    def get_tokenizer(self, **kwargs) -> ReformerTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, pad_token=None, **kwargs)
+
+    def test_eos_treatment(self):
+        tokenizer = self.reformer_base_tokenizer()
+        batch_with_eos_added = tokenizer(["hi</s>", "I went to the gym</s>", "</s>"])
+        batch_without_eos_added = tokenizer(["hi", "I went to the gym", ""])
+        self.assertListEqual(batch_with_eos_added["input_ids"], batch_without_eos_added["input_ids"])
+
+    def test_prepare_batch(self):
+        tokenizer = self.reformer_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        expected_src_tokens = [
+            99,
+            35,
+            28,
+            275,
+            40,
+            52,
+            261,
+            275,
+            209,
+            279,
+            265,
+            88,
+            117,
+            271,
+            271,
+            52,
+            264,
+            299,
+            248,
+            278,
+            2,
+        ]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        self.assertIsInstance(batch, BatchEncoding)
+
+        result = list(batch["input_ids"].tolist()[0])
+
+        self.assertListEqual(expected_src_tokens, result)
+
+        self.assertEqual([2, 21], batch["input_ids"].shape)
+        self.assertEqual([2, 21], batch.attention_mask.shape)
+
+    def test_empty_target_text(self):
+        tokenizer = self.reformer_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        # check if input_ids are returned and no decoder_input_ids
+        self.assertIn("input_ids", batch)
+        self.assertIn("attention_mask", batch)
+        self.assertNotIn("decoder_input_ids", batch)
+        self.assertNotIn("decoder_attention_mask", batch)
+
+    def test_max_length(self):
+        tokenizer = self.reformer_base_tokenizer()
+        tgt_text = [
+            "Summary of the text.",
+            "Another summary.",
+        ]
+        targets = tokenizer(
+            text=tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertEqual(32, targets["input_ids"].shape[1])
+
+    def test_outputs_not_longer_than_maxlen(self):
+        tokenizer = self.reformer_base_tokenizer()
+
+        batch = tokenizer(
+            ["I am a small frog" * 1000, "I am a small frog"], padding=True, truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertIsInstance(batch, BatchEncoding)
+        # Since Reformer does NOT have a max input length,
+        # this test should be changed to the following in Transformers v5:
+        # self.assertEqual(batch["input_ids"].shape, (2, 8001))
+        self.assertEqual(batch["input_ids"].shape, [2, 9001])
+
+    def test_eos_in_input(self):
+        tokenizer = self.reformer_base_tokenizer()
+        src_text = ["A long paragraph for summarization. </s>"]
+        tgt_text = ["Summary of the text. </s>"]
+        expected_src_tokens = [
+            99,
+            35,
+            28,
+            275,
+            40,
+            52,
+            261,
+            275,
+            209,
+            279,
+            265,
+            88,
+            117,
+            271,
+            271,
+            52,
+            264,
+            299,
+            248,
+            278,
+            2,
+        ]
+
+        # TODO(wj-Mcat): to enable `expected_tgt_tokens`
+        # expected_tgt_tokens = [20698, 13, 8, 1499, 5, 1]
+
+        batch = tokenizer(src_text, text_target=tgt_text)
+
+        self.assertEqual(expected_src_tokens, batch["input_ids"][0])
+        # self.assertEqual(expected_tgt_tokens, batch["labels"][0])
+
+    def test_token_type_ids(self):
+        src_text_1 = ["A first paragraph for summarization."]
+        src_text_2 = ["A second paragraph for summarization."]
+
+        tokenizer = self.reformer_base_tokenizer()
+
+        slow_token_type_ids = tokenizer(src_text_1, src_text_2, add_special_tokens=True, return_token_type_ids=True)[
+            "token_type_ids"
+        ]
+        self.assertEqual(len(slow_token_type_ids[0]), 44)
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        tokenizer_list = []
+        tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))
+
+        for tokenizer_class, tokenizer_utils in tokenizer_list:
+
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                tokenizer_utils.save_pretrained(tmp_dir)
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
+                    special_tokens_map = json.load(json_file)
+
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
+                    tokenizer_config = json.load(json_file)
+
+                added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(100)]
+
+                special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+                tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(special_tokens_map, outfile)
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(tokenizer_config, outfile)
+
+                # the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
+                # into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
+                # "special_tokens_map.json" files
+                tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                )
+                self.assertIn(
+                    "an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
+                )
+                # self.assertIn("an_additional_special_token",tokenizer_without_change_in_init.get_vocab()) # ByReformerTokenization no vocab
+                self.assertEqual(
+                    ["an_additional_special_token"],
+                    tokenizer_without_change_in_init.convert_ids_to_tokens(
+                        tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
+                    ),
+                )
+
+                # Now we test that we can change the value of additional_special_tokens in the from_pretrained
+                new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
+                tokenizer = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                    additional_special_tokens=new_added_tokens,
+                )
+
+                self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
+                self.assertEqual(
+                    ["a_new_additional_special_token"],
+                    tokenizer.convert_ids_to_tokens(
+                        tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
+                    ),
+                )
+
+    # overwritten from `test_tokenization_common` since Reformer has no max length
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 3)
+            self.assertEqual(len(encoding["offset_mapping"]), 3)
diff --git a/tests/transformers/rembert/__init__.py b/tests/transformers/rembert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/rembert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/rembert/test_modeling.py b/tests/transformers/rembert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..77e26afe6213248a47b4b6656de9462f26494ce6
--- /dev/null
+++ b/tests/transformers/rembert/test_modeling.py
@@ -0,0 +1,250 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    RemBertConfig,
+    RemBertForMaskedLM,
+    RemBertForMultipleChoice,
+    RemBertForQuestionAnswering,
+    RemBertForSequenceClassification,
+    RemBertForTokenClassification,
+    RemBertModel,
+    RemBertPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+class RemBertModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        input_embedding_size=64,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0,
+        attention_probs_dropout_prob=0,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        layer_norm_eps=1e-12,
+        num_classes=2,
+        num_choices=1,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.input_embedding_size = input_embedding_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.layer_norm_eps = layer_norm_eps
+        self.num_classes = num_classes
+        self.num_choices = num_choices
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = paddle.ones([self.batch_size, self.seq_length], dtype="int32")
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        config = self.get_config()
+
+        return config, input_ids, token_type_ids, input_mask
+
+    def get_config(self):
+        return RemBertConfig(
+            vocab_size=self.vocab_size,
+            input_embedding_size=self.input_embedding_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            layer_norm_eps=self.layer_norm_eps,
+            num_classes=self.num_classes,
+            num_choices=self.num_choices,
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask, token_type_ids) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": input_mask,
+            "token_type_ids": token_type_ids,
+        }
+        return config, inputs_dict
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_masked_lm_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertForMaskedLM(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_question_answering_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertForQuestionAnswering(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, 1])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, 1])
+
+    def create_and_check_sequence_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_multiple_choice_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertForMultipleChoice(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_token_classification_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RemBertForTokenClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_classes])
+
+
+class RemBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = RemBertModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = False
+
+    all_model_classes = (
+        RemBertModel,
+        RemBertForMaskedLM,
+        RemBertForQuestionAnswering,
+        RemBertForSequenceClassification,
+        RemBertForMultipleChoice,
+        RemBertForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = RemBertModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_masked_lm_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_masked_lm_model(*config_and_inputs)
+
+    def test_question_answering_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_question_answering_model(*config_and_inputs)
+
+    def test_sequence_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_sequence_classification_model(*config_and_inputs)
+
+    def test_multiple_choice_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_multiple_choice_model(*config_and_inputs)
+
+    def test_token_classification_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_token_classification_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(RemBertPretrainedModel.pretrained_init_configuration)[:1]:
+            model = RemBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
diff --git a/tests/transformers/rembert/test_tokenizer.py b/tests/transformers/rembert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..20e79d6b5f02198dfe6544b02d313d7752c59773
--- /dev/null
+++ b/tests/transformers/rembert/test_tokenizer.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import RemBertTokenizer
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class TestTokenizationRemBert(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = RemBertTokenizer
+    test_offsets = False
+
+    def get_tokenizer(self, **kwargs):
+        return self.tokenizer_class.from_pretrained("rembert", **kwargs)
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5):
+        output_text = "unwanted, running,running."
+
+        if with_prefix_space:
+            output_text = " " + output_text
+
+        output_ids = tokenizer.encode(output_text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        return output_text, output_ids
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+            self.assertEqual(len(encoding["offset_mapping"]), 2)
+
+    def test_pretokenized_inputs(self):
+        self.skipTest("not implement yet")
diff --git a/tests/transformers/roberta/__init__.py b/tests/transformers/roberta/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/roberta/test_modeling.py b/tests/transformers/roberta/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..30c83db52de6027aa0c15599f494bfe7439ddb99
--- /dev/null
+++ b/tests/transformers/roberta/test_modeling.py
@@ -0,0 +1,517 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized, parameterized_class
+
+from paddlenlp.transformers import (
+    RobertaConfig,
+    RobertaForCausalLM,
+    RobertaForMaskedLM,
+    RobertaForMultipleChoice,
+    RobertaForQuestionAnswering,
+    RobertaForSequenceClassification,
+    RobertaForTokenClassification,
+    RobertaModel,
+    RobertaPretrainedModel,
+)
+
+from ...testing_utils import require_package, slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+ROBERTA_TINY = "sshleifer/tiny-distilroberta-base"
+
+
+class RobertaModelTester:
+    def __init__(self, parent: RobertaModelTest):
+        self.parent: RobertaModelTest = parent
+        self.batch_size = 13
+        self.seq_length = 7
+        self.is_training = True
+        self.use_input_mask = True
+        self.use_token_type_ids = True
+        self.use_labels = True
+        self.vocab_size = 99
+        self.hidden_size = 32
+        self.num_hidden_layers = 5
+        self.num_attention_heads = 4
+        self.intermediate_size = 37
+        self.hidden_act = "gelu"
+        self.hidden_dropout_prob = 0.1
+        self.attention_probs_dropout_prob = 0.1
+        self.max_position_embeddings = 512
+        self.type_vocab_size = 16
+        self.type_sequence_label_size = 2
+        self.initializer_range = 0.02
+        self.pad_token_id = (0,)
+        self.layer_norm_eps = (1e-12,)
+        self.cls_token_id = 101
+        self.num_labels = 3
+        self.num_choices = 4
+        self.dropout = 0.56
+        self.scope = None
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return RobertaConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=0,
+            layer_norm_eps=1e-12,
+            cls_token_id=101,
+            num_labels=self.num_labels,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = self.prepare_config_and_inputs()
+
+        return (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = RobertaModel(config)
+        model.eval()
+
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_causal_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = RobertaForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = RobertaForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = RobertaForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = RobertaForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = RobertaForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=choice_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+
+        model = RobertaForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+        )
+
+        if sequence_labels is not None:
+            start_logits, end_logits = result[1], result[2]
+        else:
+            start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class RobertaModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = RobertaModel
+    use_test_inputs_embeds: bool = True
+    return_dict: bool = False
+    use_labels: bool = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        RobertaForCausalLM,
+        RobertaForMaskedLM,
+        RobertaModel,
+        RobertaForSequenceClassification,
+        RobertaForTokenClassification,
+        RobertaForMultipleChoice,
+        RobertaForQuestionAnswering,
+    )
+    all_generative_model_classes = (RobertaForCausalLM,)
+
+    def setUp(self):
+        self.model_tester = RobertaModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_causal_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs_for_decoder()
+        self.model_tester.create_and_check_for_causal_lm(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(RobertaPretrainedModel.pretrained_init_configuration.keys())[:1]:
+            model = RobertaModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class RobertaCompatibilityTest(unittest.TestCase):
+    test_model_id = "hf-internal-testing/tiny-random-RobertaModel"
+
+    @classmethod
+    @require_package("transformers", "torch")
+    def setUpClass(cls) -> None:
+        from transformers import RobertaModel
+
+        cls.torch_model_path = tempfile.TemporaryDirectory().name
+        model = RobertaModel.from_pretrained(cls.test_model_id)
+        model.save_pretrained(cls.torch_model_path)
+
+    @require_package("transformers", "torch")
+    def test_roberta_model_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import RobertaModel
+
+            paddle_model = RobertaModel.from_pretrained(self.test_model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+
+            # 3. forward the torch  model
+            import torch
+            from transformers import RobertaModel
+
+            torch_model = RobertaModel.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @parameterized.expand(
+        [
+            # ("RobertaForCausalLM",),  TODO: need to tie weights
+            # ("RobertaForMaskedLM",),  TODO: need to tie weights
+            ("RobertaModel",),
+            ("RobertaForSequenceClassification",),
+            ("RobertaForTokenClassification",),
+            ("RobertaForQuestionAnswering",),
+        ]
+    )
+    @require_package("transformers", "torch")
+    def test_roberta_classes_from_local_dir(self, class_name):
+        with tempfile.TemporaryDirectory() as tempdir:
+
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch model
+            import torch
+            import transformers
+
+            torch_model_class = getattr(transformers, class_name)
+            torch_model = torch_model_class.from_pretrained(self.torch_model_path)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(torch.tensor(input_ids), return_dict=False)[0]
+
+            # 2. forward the paddle model
+            from paddlenlp import transformers
+
+            paddle_model_class = getattr(transformers, class_name)
+            paddle_model = paddle_model_class.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(paddle.to_tensor(input_ids))[0]
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class RobertaModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_masked_lm(self):
+        model = RobertaForMaskedLM.from_pretrained("roberta-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)
+        expected_shape = [1, 11, 50265]
+        self.assertEqual(output.shape, expected_shape)
+        # compare the actual values for a slice.
+        expected_slice = paddle.to_tensor(
+            [[[33.8802, -4.3103, 22.7761], [4.6539, -2.8098, 13.6253], [1.8228, -3.6898, 8.8600]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, :3, :3], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_no_head(self):
+        model = RobertaModel.from_pretrained("roberta-base")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        # compare the actual values for a slice.
+        expected_slice = paddle.to_tensor(
+            [[[-0.0231, 0.0782, 0.0074], [-0.1854, 0.0540, -0.0175], [0.0548, 0.0799, 0.1687]]]
+        )
+        self.assertTrue(paddle.allclose(output[:, :3, :3], expected_slice, atol=1e-4))
diff --git a/tests/transformers/roberta/test_tokenizer.py b/tests/transformers/roberta/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..53d9fdec20c6363ff39d402c9a845512c1170125
--- /dev/null
+++ b/tests/transformers/roberta/test_tokenizer.py
@@ -0,0 +1,383 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers.bert.tokenizer import BasicTokenizer, WordpieceTokenizer
+from paddlenlp.transformers.roberta.tokenizer import (
+    RobertaBPETokenizer,
+    RobertaChineseTokenizer,
+)
+from paddlenlp.transformers.tokenizer_utils import AddedToken
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import TokenizerTesterMixin
+
+VOCAB_FILES_NAMES = RobertaBPETokenizer.resource_files_names
+
+
+class RobertaBPETokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = RobertaBPETokenizer
+    test_offsets = False
+    from_pretrained_kwargs = {"cls_token": "<s>"}
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "lower newer"
+        bpe_tokens = ["l", "o", "w", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = tokenizer.tokenize(text)  # , add_prefix_space=True)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+        input_bpe_tokens = [0, 1, 2, 15, 10, 9, 3, 2, 15, 19]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    def roberta_dict_integration_testing(self):
+        tokenizer = self.get_tokenizer()
+
+        self.assertListEqual(tokenizer.encode("Hello world!", add_special_tokens=False), [0, 31414, 232, 328, 2])
+        self.assertListEqual(
+            tokenizer.encode("Hello world! cécé herlolip 418", add_special_tokens=False),
+            [0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2],
+        )
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("roberta-base")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)["input_ids"]
+
+        encoded_text_from_decode = tokenizer.encode(
+            "sequence builders", add_special_tokens=True, add_prefix_space=False
+        )["input_ids"]
+        encoded_pair_from_decode = tokenizer.encode(
+            "sequence builders", "multi-sequence build", add_special_tokens=True, add_prefix_space=False
+        )["input_ids"]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == encoded_text_from_decode
+        assert encoded_pair == encoded_pair_from_decode
+
+    def test_space_encoding(self):
+        tokenizer = self.get_tokenizer()
+
+        sequence = "Encode this sequence."
+        space_encoding = tokenizer.byte_encoder[" ".encode("utf-8")[0]]
+
+        # Testing encoder arguments
+        encoded = tokenizer.encode(sequence, add_special_tokens=False, add_prefix_space=False)["input_ids"]
+        first_char = tokenizer.convert_ids_to_tokens(encoded[0])[0]
+        self.assertNotEqual(first_char, space_encoding)
+
+        encoded = tokenizer.encode(sequence, add_special_tokens=False, add_prefix_space=True)["input_ids"]
+        first_char = tokenizer.convert_ids_to_tokens(encoded[0])[0]
+        self.assertEqual(first_char, space_encoding)
+
+        tokenizer.add_special_tokens({"bos_token": "<s>"})
+        encoded = tokenizer.encode(sequence, add_special_tokens=True)["input_ids"]
+        first_char = tokenizer.convert_ids_to_tokens(encoded[1])[0]
+        self.assertNotEqual(first_char, space_encoding)
+
+        # Testing spaces after special tokens
+        mask = "<mask>"
+        tokenizer.add_special_tokens(
+            {"mask_token": AddedToken(mask, lstrip=True, rstrip=False)}
+        )  # mask token has a left space
+        mask_ind = tokenizer.convert_tokens_to_ids(mask)
+
+        sequence = "Encode <mask> sequence"
+        sequence_nospace = "Encode <mask>sequence"
+
+        encoded = tokenizer.encode(sequence)["input_ids"]
+        mask_loc = encoded.index(mask_ind)
+        first_char = tokenizer.convert_ids_to_tokens(encoded[mask_loc + 1])[0]
+        self.assertEqual(first_char, space_encoding)
+
+        encoded = tokenizer.encode(sequence_nospace)["input_ids"]
+        mask_loc = encoded.index(mask_ind)
+        first_char = tokenizer.convert_ids_to_tokens(encoded[mask_loc + 1])[0]
+        self.assertNotEqual(first_char, space_encoding)
+
+    def test_pretokenized_inputs(self):
+        pass
+
+    def test_embeded_special_tokens(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer_p = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                sentence = "A, <mask> AllenNLP sentence."
+                tokens_p = tokenizer_p.encode(
+                    sentence, add_special_tokens=True, return_token_type_ids=True, return_attention_mask=True
+                )
+
+                # token_type_ids should put 0 everywhere
+                self.assertEqual(sum(tokens_p["token_type_ids"]), 0)
+
+                # attention_mask should put 1 everywhere, so sum over length should be 1
+                self.assertEqual(sum(tokens_p["attention_mask"]) / len(tokens_p["attention_mask"]), 1)
+
+                tokens_p_str = tokenizer_p.convert_ids_to_tokens(tokens_p["input_ids"])
+
+                # Rust correctly handles the space before the mask while python doesnt
+                self.assertSequenceEqual(tokens_p["input_ids"], [0, 250, 6, 50264, 3823, 487, 21992, 3645, 4, 2])
+
+                self.assertSequenceEqual(
+                    tokens_p_str, ["<s>", "A", ",", "<mask>", "ĠAllen", "N", "LP", "Ġsentence", ".", "</s>"]
+                )
+
+    def test_offsets_mapping_with_different_add_prefix_space_and_trim_space_arguments(self):
+        # Test which aims to verify that the offsets are well adapted to the argument `add_prefix_space` and
+        # `trim_offsets`
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                text_of_1_token = "hello"  # `hello` is a token in the vocabulary of `pretrained_name`
+                text = f"{text_of_1_token} {text_of_1_token}"
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=True)
+                encoding = tokenizer(
+                    text, return_offsets_mapping=True, add_special_tokens=False, return_token_type_ids=None
+                )
+                self.assertEqual(encoding.offset_mapping[0], (1, len(text_of_1_token) + 1))
+                self.assertEqual(
+                    encoding.offset_mapping[1],
+                    (len(text_of_1_token) + 2, len(text_of_1_token) + 2 + len(text_of_1_token)),
+                )
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, add_prefix_space=False)
+                encoding = tokenizer(
+                    text, return_offsets_mapping=True, add_special_tokens=False, return_token_type_ids=None
+                )
+                self.assertEqual(encoding.offset_mapping[0], (0, len(text_of_1_token)))
+                self.assertEqual(
+                    encoding.offset_mapping[1],
+                    (len(text_of_1_token) + 1, len(text_of_1_token) + 1 + len(text_of_1_token)),
+                )
+
+
+class RobertaChineseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = RobertaChineseTokenizer
+    space_between_special_tokens = True
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+            "你",
+            "我",
+            "人",
+            "天",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, RobertaChineseTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("hfl/roberta-wwm-ext")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
diff --git a/tests/transformers/roformer/__init__.py b/tests/transformers/roformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/roformer/test_modeling.py b/tests/transformers/roformer/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..40d935fe4919caee7a4cdc10eb93f08210dd777c
--- /dev/null
+++ b/tests/transformers/roformer/test_modeling.py
@@ -0,0 +1,483 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    RoFormerConfig,
+    RoFormerForMaskedLM,
+    RoFormerForMultipleChoice,
+    RoFormerForQuestionAnswering,
+    RoFormerForSequenceClassification,
+    RoFormerForTokenClassification,
+    RoFormerModel,
+    RoFormerPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class RoFormerModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=7,
+        is_training=False,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=200,
+        embedding_size=50,
+        hidden_size=36,
+        num_hidden_layers=6,
+        num_attention_heads=6,
+        intermediate_size=16,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=20,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        pool_act="tanh",
+        type_sequence_label_size=3,
+        num_labels=3,
+        num_choices=3,
+        dropout=0.56,
+        rotary_value=False,
+        return_dict=False,
+    ):
+        self.parent: RoFormerModelTester = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
+        self.embedding_size = embedding_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.rotary_value = rotary_value
+        self.dropout = dropout
+        self.return_dict = return_dict
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> RoFormerConfig:
+        return RoFormerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            embedding_size=self.embedding_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            pool_act=self.pool_act,
+            num_labels=self.num_labels,
+            rotary_value=self.rotary_value,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerModel(config)
+        model.eval()
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+            labels=choice_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if sequence_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerForMaskedLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=token_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = RoFormerForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_model_cache(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = RoFormerModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, token_type_ids, input_mask, _, _, _) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class RoFormerModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = RoFormerModel
+    use_labels = False
+    return_dict = False
+    test_tie_weights = True
+
+    all_model_classes = (
+        RoFormerModel,
+        RoFormerForSequenceClassification,
+        RoFormerForTokenClassification,
+        RoFormerForQuestionAnswering,
+        RoFormerForMultipleChoice,
+        RoFormerForMaskedLM,
+    )
+
+    def setUp(self):
+        super().setUp()
+        self.model_tester = RoFormerModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=RoFormerConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    def test_model_name_list(self):
+        config = self.model_tester.get_config()
+        model = self.base_model_class(config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(RoFormerPretrainedModel.pretrained_init_configuration)[:1]:
+            model = RoFormerModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class RoFormerModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = RoFormerModel.from_pretrained("roformer-chinese-small")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 384]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.17788891, -2.17795515, 0.28824317],
+                    [-1.70342600, -2.84062195, -0.53377795],
+                    [-0.16374627, -0.67967212, -0.37192002],
+                ]
+            ]
+        )
+
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = RoFormerModel.from_pretrained("roformer-chinese-small")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 384]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.17788891, -2.17795515, 0.28824317],
+                    [-1.70342600, -2.84062195, -0.53377795],
+                    [-0.16374627, -0.67967212, -0.37192002],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = RoFormerModel.from_pretrained("roformer-chinese-small")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 384]
+        self.assertEqual(output[0].shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.17788891, -2.17795515, 0.28824317],
+                    [-1.70342600, -2.84062195, -0.53377795],
+                    [-0.16374627, -0.67967212, -0.37192002],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.63710368, -1.37745416, 0.48294422],
+                    [-1.31292200, -2.98008418, -0.44472846],
+                    [0.02552767, -0.64935315, -0.51669586],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/roformer/test_tokenizer.py b/tests/transformers/roformer/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2b4dfcec070aee3f45528a826136234bea337a2
--- /dev/null
+++ b/tests/transformers/roformer/test_tokenizer.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.roformer.tokenizer import RoFormerTokenizer
+
+from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+class RoFormerTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = RoFormerTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+
+    def setUp(self):
+        self.from_pretrained_kwargs = {"do_lower_case": False}
+
+        super().setUp()
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, RoFormerTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_chinese(self):
+        tokenizer = RoFormerTokenizer.from_pretrained(
+            list(RoFormerTokenizer.pretrained_init_configuration.keys())[0], use_jieba=True
+        )
+        # test jieba tokenizer in rofromer
+
+        tokens = tokenizer.tokenize("ah\u535A\u63A8zz")
+        self.assertListEqual(tokens, ["ah", "博", "推", "z", "##z"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [5829, 713, 2093, 167, 48585])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("roformer-chinese-small")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                # sentence = f"testing with {tokenizer.mask_token} simple sentence"
+                sentence = f"a simple {tokenizer.mask_token} allennlp sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+                expected_results = [
+                    ((0, 0), tokenizer.cls_token),
+                    ((0, 1), "a"),
+                    ((2, 8), "simple"),
+                    ((9, 15), tokenizer.mask_token),
+                    ((16, 21), "allen"),
+                    ((21, 23), "##nl"),
+                    ((23, 24), "##p"),
+                    ((25, 33), "sentence"),
+                    ((33, 34), "."),
+                    ((0, 0), tokenizer.sep_token),
+                ]
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
diff --git a/tests/transformers/roformerv2/__init__.py b/tests/transformers/roformerv2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/roformerv2/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/roformerv2/test_modeling.py b/tests/transformers/roformerv2/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d356b88d129aaf30a865fbc1097350d41ea5156
--- /dev/null
+++ b/tests/transformers/roformerv2/test_modeling.py
@@ -0,0 +1,307 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import unittest
+
+from paddlenlp.transformers import (
+    RoFormerv2Config,
+    RoFormerv2ForMaskedLM,
+    RoFormerv2ForMultipleChoice,
+    RoFormerv2ForQuestionAnswering,
+    RoFormerv2ForSequenceClassification,
+    RoFormerv2ForTokenClassification,
+    RoFormerv2Model,
+    RoFormerv2PretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class RoFormerv2ModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        initializer_range=0.02,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_classes=3,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=8,
+        hidden_act="relu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        act_dropout=0,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        pad_token_id=0,
+        rotary_value=False,
+        use_bias=False,
+        epsilon=1e-12,
+        normalize_before=False,
+        num_choices=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.initializer_range = initializer_range
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_classes = num_classes
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.act_dropout = act_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.pad_token_id = pad_token_id
+        self.rotary_value = rotary_value
+        self.use_bias = use_bias
+        self.epsilon = epsilon
+        self.normalize_before = normalize_before
+        self.num_choices = num_choices
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask
+
+    def get_config(self):
+        return RoFormerv2Config(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            pad_token_id=self.pad_token_id,
+            rotary_value=self.rotary_value,
+            use_bias=self.use_bias,
+            epsilon=self.epsilon,
+            normalize_before=self.normalize_before,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+
+        model = RoFormerv2Model(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_masked_lm(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RoFormerv2ForMaskedLM(config)
+        model.eval()
+
+        result = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=input_mask,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RoFormerv2ForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+            token_type_ids=multiple_choice_token_type_ids,
+        )
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RoFormerv2ForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        start_logits = result[0]
+        end_logits = result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RoFormerv2ForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+    ):
+        model = RoFormerv2ForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_choices])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class RoFormerv2ModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = RoFormerv2Model
+    return_dict: bool = False
+    use_labels: bool = False
+
+    all_model_classes = (
+        RoFormerv2Model,
+        RoFormerv2ForMaskedLM,
+        RoFormerv2ForSequenceClassification,
+        RoFormerv2ForTokenClassification,
+        RoFormerv2ForQuestionAnswering,
+        RoFormerv2ForMultipleChoice,
+    )
+
+    def setUp(self):
+        self.model_tester = RoFormerv2ModelTester(self)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            expected_arg_names = ["input_ids"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(RoFormerv2PretrainedModel.pretrained_init_configuration)[:1]:
+            model = RoFormerv2Model.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/roformerv2/test_tokenizer.py b/tests/transformers/roformerv2/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb832197eba36d654c99c3ac4b8693da77ad3056
--- /dev/null
+++ b/tests/transformers/roformerv2/test_tokenizer.py
@@ -0,0 +1,264 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.roformerv2.tokenizer import (
+    BasicTokenizer,
+    RoFormerv2Tokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import slow
+from ...transformers.test_tokenizer_common import (
+    TokenizerTesterMixin,
+    filter_non_english,
+)
+
+
+class RoFormerv2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = RoFormerv2Tokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, RoFormerv2Tokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("roformer_v2_chinese_char_small")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+        assert encoded_sentence == [1] + text + [2]
+        assert encoded_pair == [1] + text + [2] + text_2 + [2]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 8), "##ive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 22), "##n"),
+                        ((22, 24), "##lp"),
+                        ((25, 27), "se"),
+                        ((27, 29), "##nt"),
+                        ((29, 33), "##ence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+
+    @slow
+    def test_with_emoji(self):
+        tokenizer = self.tokenizer_class.from_pretrained("roformer_v2_chinese_char_small")
+        text = "好👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 10)
+        self.assertEqual(len(encoding["offset_mapping"]), 10)
+
+        text = "好👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻"
+        prompt = "评价维度"
+        encoding = tokenizer(
+            text=text,
+            text_pair=prompt,
+            runcation=True,
+            pad_to_max_seq_len=True,
+            return_attention_mask=True,
+            return_position_ids=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 15)
+        self.assertEqual(len(encoding["offset_mapping"]), 15)
diff --git a/tests/transformers/rw/__init__.py b/tests/transformers/rw/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/rw/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/rw/test_modeling.py b/tests/transformers/rw/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f3ca34808807cde5137c82dbdc3130c88a0341b
--- /dev/null
+++ b/tests/transformers/rw/test_modeling.py
@@ -0,0 +1,229 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from parameterized import parameterized_class
+
+from paddlenlp.transformers.rw.configuration import RWConfig
+from paddlenlp.transformers.rw.modeling import RWForCausalLM, RWModel
+
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class RWModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        use_relative_position=True,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+        multi_query=True,
+        bias=False,
+        parallel_attn=True,
+        output_attentions=False,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.use_relative_position = use_relative_position
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.multi_query = multi_query
+        self.bias = bias
+        self.parallel_attn = parallel_attn
+        self.output_attentions = output_attentions
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent and self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+    def get_config(self):
+        return RWConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            use_relative_position=self.use_relative_position,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+            multi_query=self.multi_query,
+            bias=self.bias,
+            parallel_attn=self.parallel_attn,
+            output_attentions=self.output_attentions,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = RWModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        result = model(input_ids, use_cache=True, output_attentions=True, return_dict=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        result = model(
+            input_ids,
+            use_cache=True,
+            output_attentions=True,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_causal_model(
+        self,
+        config,
+        input_ids,
+        token_type_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = RWForCausalLM(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=input_ids,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(result[0].shape, [])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class RWModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = RWModel
+    return_dict: bool = False
+    use_labels: bool = False
+    use_test_inputs_embeds: bool = True
+
+    all_model_classes = (
+        RWModel,
+        RWForCausalLM,
+    )
+
+    def setUp(self):
+        self.model_tester = RWModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_causal_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_causal_model(*config_and_inputs)
diff --git a/tests/transformers/skep/__init__.py b/tests/transformers/skep/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/skep/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/skep/test_modeling.py b/tests/transformers/skep/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..a150e77c3b9e686e52d2d377e14ee80e9bbbc128
--- /dev/null
+++ b/tests/transformers/skep/test_modeling.py
@@ -0,0 +1,430 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from typing import Tuple
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    SkepConfig,
+    SkepCrfForTokenClassification,
+    SkepForSequenceClassification,
+    SkepForTokenClassification,
+    SkepModel,
+    SkepPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class SkepModelTester:
+    """Base Skep Model tester which can test:"""
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        pad_token_id=0,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+
+    def prepare_config_and_inputs(self) -> Tuple[SkepConfig, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> SkepConfig:
+        return SkepConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config: SkepConfig,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = SkepModel(config)
+        model.eval()
+
+        result = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, return_dict=self.parent.return_dict
+        )
+        result = model(input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict)
+        result = model(input_ids, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = SkepForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and sequence_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if sequence_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = SkepForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            return_dict=self.parent.return_dict,
+            labels=token_labels,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_crf_token_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = SkepCrfForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids, token_type_ids=token_type_ids, return_dict=self.parent.return_dict, labels=token_labels
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        if token_labels is not None:
+            self.parent.assertEqual(result[0].shape, [self.batch_size])
+        else:
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_model_cache(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = SkepModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class SkepModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = SkepModel
+    return_dict = False
+    use_labels = False
+    use_test_inputs_embeds = True
+
+    all_model_classes = (
+        SkepModel,
+        SkepCrfForTokenClassification,
+        SkepForSequenceClassification,
+        SkepForTokenClassification,
+    )
+
+    def setUp(self):
+        self.model_tester = SkepModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_crf_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_crf_token_classification(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(SkepPretrainedModel.pretrained_init_configuration)[:1]:
+            model = SkepModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class SkepModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = SkepModel.from_pretrained("skep_ernie_1.0_large_ch")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 1024]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.31737554, 0.58842468, 0.43969756],
+                    [0.20048048, 0.04142965, -0.2655520],
+                    [0.49883127, -0.15263288, 0.46780178],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = SkepModel.from_pretrained("skep_ernie_1.0_large_ch")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 1024]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.31737554, 0.58842468, 0.43969756],
+                    [0.20048048, 0.04142965, -0.2655520],
+                    [0.49883127, -0.15263288, 0.46780178],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = SkepModel.from_pretrained("skep_ernie_1.0_large_ch")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 1024]
+        self.assertEqual(output[0].shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.31737363, 0.58842909, 0.43969074],
+                    [0.20047806, 0.04142847, -0.26555336],
+                    [0.49882850, -0.15263671, 0.46780348],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [0.29901379, 0.68195367, 0.62448436],
+                    [0.18537062, 0.33085057, -0.04292759],
+                    [0.38783669, -0.19946010, 0.24944240],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
diff --git a/tests/transformers/skep/test_tokenizer.py b/tests/transformers/skep/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f59ca03f1821bf893ff9cb9a105d578ef374f4b7
--- /dev/null
+++ b/tests/transformers/skep/test_tokenizer.py
@@ -0,0 +1,596 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import unittest
+from typing import Any, Dict, List, Tuple
+
+from paddlenlp.transformers.skep.tokenizer import (
+    BasicTokenizer,
+    BpeEncoder,
+    SkepTokenizer,
+    WordpieceTokenizer,
+)
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+def _class_name_func(cls, num: int, params_dict: Dict[str, Any]):
+    suffix = "UseBPE" if params_dict["use_bpe_encoder"] else "NotUseBPE"
+    return f"{cls.__name__}{suffix}"
+
+
+def _read_tokens_from_file(file: str) -> List[str]:
+    with open(file, "r", encoding="utf-8") as f:
+        tokens = [token.strip() for token in f.readlines()]
+    return tokens
+
+
+class SkepBpeEncoderTest(unittest.TestCase):
+    def setUp(self):
+        self.vocab_file = get_tests_dir("fixtures/bpe.en/vocab.json")
+        self.merges_file = get_tests_dir("fixtures/bpe.en/merges.txt")
+        self.encoder = BpeEncoder(encoder_json_file=self.vocab_file, vocab_bpe_file=self.merges_file)
+
+    def test_tokenizer(self):
+        text = " lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        tokens = self.encoder._tokenize(text)
+
+        self.assertListEqual(tokens, bpe_tokens)
+
+        decoded_text = self.encoder.convert_tokens_to_string(tokens)
+        self.assertEqual(text, decoded_text)
+
+    def test_tokenizer_encode_decode(self):
+        text = " lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+        token_ids = self.encoder.encode(text)
+        tokens = [self.encoder.decoder[token_id] for token_id in token_ids]
+
+        self.assertListEqual(tokens, bpe_tokens)
+
+        decoded_text = self.encoder.decode(token_ids)
+        self.assertEqual(text, decoded_text)
+
+    def test_unk_word(self):
+        text = " lower newer a"
+        with self.assertRaises(KeyError):
+            self.encoder.encode(text)
+
+        # can tokenize correct
+        tokens = self.encoder._tokenize(text)
+
+        # recognize the `a` as the <unk-token>
+        token_ids = [self.encoder._convert_token_to_id(token) for token in tokens]
+
+        decoded_tokens = [self.encoder._convert_id_to_token(token_id) for token_id in token_ids]
+        self.assertIn(self.encoder.unk_token, decoded_tokens)
+
+
+class SkepBPETokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = SkepTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+    use_bpe_encoder = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "\u0120",
+            "\u0120l",
+            "\u0120n",
+            "\u0120lo",
+            "\u0120low",
+            "er",
+            "\u0120lowest",
+            "\u0120newer",
+            "\u0120wider",
+            "<unk>",
+            "<|endoftext|>",
+        ]
+        # save vocab file
+        self.vocab_file = os.path.join(self.tmpdirname, "vocab.txt")
+        with open(self.vocab_file, "w", encoding="utf-8") as f:
+            # f.write('\n'.join(vocab))
+            f.write("\n".join(vocab + ["[PAD]", "[CLS]", "[SEP]", "[MASK]"]))
+
+        # save bpe related files
+        self.bpe_json_file = os.path.join(self.tmpdirname, "encoder.json")
+        self.bpe_vocab_file = os.path.join(self.tmpdirname, "merges.txt")
+        shutil.copyfile(get_tests_dir("fixtures/bpe.en/vocab.json"), self.bpe_json_file)
+
+        shutil.copyfile(get_tests_dir("fixtures/bpe.en/merges.txt"), self.bpe_vocab_file)
+
+    def get_tokenizer(self, **kwargs):
+        tokenizer = self.tokenizer_class.from_pretrained(
+            self.tmpdirname,
+            bpe_vocab_file=self.bpe_vocab_file,
+            bpe_json_file=self.bpe_json_file,
+            use_bpe_encoder=self.use_bpe_encoder,
+            unk_token="<unk>",
+            **kwargs,
+        )
+        return tokenizer
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = " lower"
+        output_text = "\u0120lower"
+        return input_text, output_text
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5) -> Tuple[str, list]:
+        toks = [
+            (i, tokenizer.decode([i], clean_up_tokenization_spaces=False))
+            for i in range(len(tokenizer.bpe_tokenizer.encoder))
+        ]
+        toks = list(
+            filter(
+                lambda t: [t[0]]
+                == tokenizer.encode(t[1], return_token_type_ids=None, add_special_tokens=False)["input_ids"],
+                toks,
+            )
+        )
+        if max_length is not None and len(toks) > max_length:
+            toks = toks[:max_length]
+        if min_length is not None and len(toks) < min_length and len(toks) > 0:
+            while len(toks) < min_length:
+                toks = toks + toks
+        # toks_str = [t[1] for t in toks]
+        toks_ids = [t[0] for t in toks]
+
+        # Ensure consistency
+        output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
+        if " " not in output_txt and len(toks_ids) > 1:
+            output_txt = (
+                tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+                + " "
+                + tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
+            )
+        if with_prefix_space:
+            output_txt = " " + output_txt
+        output_ids = tokenizer.encode(output_txt, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        return output_txt, output_ids
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+        text = " lower newer"
+        bpe_tokens = ["\u0120low", "er", "\u0120", "n", "e", "w", "er"]
+
+        # test tokenize
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        # test encode
+        token_ids = tokenizer.encode(text)["input_ids"]
+
+        # test decode
+        decode_text = tokenizer.decode(token_ids, skip_special_tokens=True, spaces_between_special_tokens=False)
+        self.assertEqual(text, decode_text)
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [14, 15, 10, 9, 3, 2, 15])
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, output_text = self.get_input_output_texts(tokenizer)
+
+                tokens = tokenizer.tokenize(input_text)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                ids_2 = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                self.assertListEqual(ids, ids_2)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual(
+            [tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]],
+            [["T", "e", "s", "t"], ["Â", "Ń"], ["t", "e", "s", "t"]],
+        )
+
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("skep_ernie_1.0_large_ch")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]
+
+    def test_pretokenized_inputs(self):
+        # Test when inputs are pretokenized
+
+        tokenizers = self.get_tokenizers(do_lower_case=False)  # , add_prefix_space=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                if hasattr(tokenizer, "add_prefix_space") and not tokenizer.add_prefix_space:
+                    continue
+
+                # Prepare a sequence from our tokenizer vocabulary
+                sequence, ids = self.get_clean_sequence(tokenizer, with_prefix_space=True, max_length=20)
+
+                # Test encode for pretokenized inputs
+                output_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                self.assertEqual(ids, output_sequence)
+
+    def test_conversion_reversible(self):
+        self.skipTest("bpe vocab not supported cls_token, bos_token")
+
+    def test_offsets_mapping(self):
+        self.skipTest("using basic-tokenizer or word-piece tokenzier to do this test, so to skpt")
+
+    def test_special_tokens_mask_input_pairs(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = " lower"
+                sequence_1 = "newer"
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence += tokenizer.encode(sequence_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0,
+                    sequence_1,
+                    add_special_tokens=True,
+                    return_special_tokens_mask=True,
+                    # add_prefix_space=False,
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+                filtered_sequence = [
+                    (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)
+                ]
+                filtered_sequence = [x for x in filtered_sequence if x is not None]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+
+class SkepWordPieceTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = SkepTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+    use_bpe_encoder = False
+    from_pretrained_kwargs = {"do_lower_case": False}
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, "vocab.txt")
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("\n".join(vocab_tokens))
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+
+class SkepChineseTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = SkepTokenizer
+    space_between_special_tokens = False
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+    use_bpe_encoder = False
+
+    only_english_character = False
+
+    def setUp(self):
+        super().setUp()
+        self.vocab_file = os.path.join(self.tmpdirname, "vocab.txt")
+
+        shutil.copyfile(get_tests_dir("fixtures/vocab.zh.txt"), self.vocab_file)
+
+        self.bpe_vocab_file = None
+        self.bpe_json_file = None
+
+    def get_tokenizer(self, **kwargs):
+        return self.tokenizer_class.from_pretrained(
+            self.tmpdirname,
+            vocab_file=self.vocab_file,
+            bpe_vocab_file=self.bpe_vocab_file,
+            bpe_json_file=self.bpe_json_file,
+            use_bpe_encoder=self.use_bpe_encoder,
+            **kwargs,
+        )
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "飞\u6868深度学习框架"
+        output_text = "飞 桨 深 度 学 习 框 架"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("飞\u6868深度学习框架")
+        self.assertListEqual(tokens, list("飞桨深度学习框架"))
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [11, 12, 13, 10, 14, 15, 16, 17])
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("飞\u535A\u63A8桨"), ["飞", "\u535A", "\u63A8", "桨"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+        tokens = tokenizer.tokenize(" \t飞!桨  \n 深度学 习  ")
+        self.assertListEqual(tokens, ["飞", "!", "桨", "深", "度", "学", "习"])
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        tokens = tokenizer.tokenize(" \t飞!桨  \n 深度学 习  [UNK]")
+        self.assertListEqual(tokens, ["飞", "!", "桨", "深", "度", "学", "习", "[UNK]"])
+
+    def test_offsets_mapping(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "这世界很美"
+                pair = "我们需要共同守护"
+
+                # No pair
+                tokens_with_offsets = tokenizer.encode(
+                    text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(False)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+                # Pairs
+                tokens_with_offsets = tokenizer.encode(
+                    text, pair, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(True)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["鲲", "\xad", "鹏"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("skep_ernie_1.0_large_ch")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]
+
+    @slow
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"北京的首都 {tokenizer.mask_token} 是北京"
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                    spaces_between_special_tokens=self.space_between_special_tokens,
+                )
+
+                expected_results = [
+                    ((0, 0), tokenizer.cls_token),
+                    ((0, 1), "北"),
+                    ((1, 2), "京"),
+                    ((2, 3), "的"),
+                    ((3, 4), "首"),
+                    ((4, 5), "都"),
+                    ((6, 12), "[MASK]"),
+                    ((13, 14), "是"),
+                    ((14, 15), "北"),
+                    ((15, 16), "京"),
+                    ((0, 0), tokenizer.sep_token),
+                ]
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+
+                # not yet supported in bert tokenizer
+                """
+                kwargs["tokenize_chinese_chars"] = False
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(text_with_chinese_char, return_token_type_ids=None,add_special_tokens=False)["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that only the first Chinese character is not preceded by "##".
+                expected_tokens = [
+                    f"##{token}" if idx != 0 else token for idx, token in enumerate(list_of_commun_chinese_char)
+                ]
+                self.assertListEqual(tokens_without_spe_char_p, expected_tokens)
+                """
+
+    def test_pretrained_model_lists(self):
+        self.skipTest("`max_model_input_sizes` not found, so skip this test")
diff --git a/tests/transformers/speecht5/__init__.py b/tests/transformers/speecht5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/speecht5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/speecht5/test_feature_extraction.py b/tests/transformers/speecht5/test_feature_extraction.py
new file mode 100644
index 0000000000000000000000000000000000000000..067108b9c9485308cd163c15ef067a9d1a0630e0
--- /dev/null
+++ b/tests/transformers/speecht5/test_feature_extraction.py
@@ -0,0 +1,412 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the SpeechT5 feature extractors."""
+
+import itertools
+import random
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.transformers import SpeechT5FeatureExtractor
+from paddlenlp.transformers.feature_extraction_utils import BatchFeature
+
+from ..test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
+
+global_rng = random.Random()
+
+
+def floats_list(shape, scale=1.0, rng=None, name=None):
+    """Creates a random float32 tensor"""
+    if rng is None:
+        rng = global_rng
+
+    values = []
+    for batch_idx in range(shape[0]):
+        values.append([])
+        for _ in range(shape[1]):
+            values[-1].append(rng.random() * scale)
+
+    return values
+
+
+class SpeechT5FeatureExtractionTester(unittest.TestCase):
+    def __init__(
+        self,
+        parent,
+        batch_size=7,
+        min_seq_length=400,
+        max_seq_length=2000,
+        feature_size=1,
+        padding_value=0.0,
+        sampling_rate=16000,
+        do_normalize=True,
+        num_mel_bins=80,
+        hop_length=16,
+        win_length=64,
+        win_function="hann_window",
+        fmin=80,
+        fmax=7600,
+        mel_floor=1e-10,
+        return_attention_mask=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.min_seq_length = min_seq_length
+        self.max_seq_length = max_seq_length
+        self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
+        self.feature_size = feature_size
+        self.padding_value = padding_value
+        self.sampling_rate = sampling_rate
+        self.do_normalize = do_normalize
+        self.num_mel_bins = num_mel_bins
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.win_function = win_function
+        self.fmin = fmin
+        self.fmax = fmax
+        self.mel_floor = mel_floor
+        self.return_attention_mask = return_attention_mask
+
+    def prepare_feat_extract_dict(self):
+        return {
+            "feature_size": self.feature_size,
+            "padding_value": self.padding_value,
+            "sampling_rate": self.sampling_rate,
+            "do_normalize": self.do_normalize,
+            "num_mel_bins": self.num_mel_bins,
+            "hop_length": self.hop_length,
+            "win_length": self.win_length,
+            "win_function": self.win_function,
+            "fmin": self.fmin,
+            "fmax": self.fmax,
+            "mel_floor": self.mel_floor,
+            "return_attention_mask": self.return_attention_mask,
+        }
+
+    def prepare_inputs_for_common(self, equal_length=False, numpify=False):
+        def _flatten(list_of_lists):
+            return list(itertools.chain(*list_of_lists))
+
+        if equal_length:
+            speech_inputs = floats_list((self.batch_size, self.max_seq_length))
+        else:
+            # make sure that inputs increase in size
+            speech_inputs = [
+                _flatten(floats_list((x, self.feature_size)))
+                for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
+            ]
+
+        if numpify:
+            speech_inputs = [np.asarray(x) for x in speech_inputs]
+
+        return speech_inputs
+
+    def prepare_inputs_for_target(self, equal_length=False, numpify=False):
+        if equal_length:
+            speech_inputs = [floats_list((self.max_seq_length, self.num_mel_bins)) for _ in range(self.batch_size)]
+        else:
+            # make sure that inputs increase in size
+            speech_inputs = [
+                floats_list((x, self.num_mel_bins))
+                for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
+            ]
+
+        if numpify:
+            speech_inputs = [np.asarray(x) for x in speech_inputs]
+
+        return speech_inputs
+
+
+class SpeechT5FeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
+    feature_extraction_class = SpeechT5FeatureExtractor
+
+    def setUp(self):
+        self.feat_extract_tester = SpeechT5FeatureExtractionTester(self)
+
+    def _check_zero_mean_unit_variance(self, input_vector):
+        self.assertTrue(np.all(np.mean(input_vector, axis=0) < 1e-3))
+        self.assertTrue(np.all(np.abs(np.var(input_vector, axis=0) - 1) < 1e-3))
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+
+        # Test not batched input
+        encoded_sequences_1 = feat_extract(speech_inputs[0], return_tensors="np").input_values
+        encoded_sequences_2 = feat_extract(np_speech_inputs[0], return_tensors="np").input_values
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+
+        # Test batched
+        encoded_sequences_1 = feat_extract(speech_inputs, return_tensors="np").input_values
+        encoded_sequences_2 = feat_extract(np_speech_inputs, return_tensors="np").input_values
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+    def test_zero_mean_unit_variance_normalization_np(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+
+        paddings = ["longest", "max_length", "do_not_pad"]
+        max_lengths = [None, 1600, None]
+        for max_length, padding in zip(max_lengths, paddings):
+            processed = feat_extract(speech_inputs, padding=padding, max_length=max_length, return_tensors="np")
+            input_values = processed.input_values
+
+            self._check_zero_mean_unit_variance(input_values[0][:800])
+            self.assertTrue(input_values[0][800:].sum() < 1e-6)
+            self._check_zero_mean_unit_variance(input_values[1][:1000])
+            self.assertTrue(input_values[0][1000:].sum() < 1e-6)
+            self._check_zero_mean_unit_variance(input_values[2][:1200])
+
+    def test_zero_mean_unit_variance_normalization(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        lengths = range(800, 1400, 200)
+        speech_inputs = [floats_list((1, x))[0] for x in lengths]
+
+        paddings = ["longest", "max_length", "do_not_pad"]
+        max_lengths = [None, 1600, None]
+
+        for max_length, padding in zip(max_lengths, paddings):
+            processed = feat_extract(speech_inputs, max_length=max_length, padding=padding)
+            input_values = processed.input_values
+
+            self._check_zero_mean_unit_variance(input_values[0][:800])
+            self._check_zero_mean_unit_variance(input_values[1][:1000])
+            self._check_zero_mean_unit_variance(input_values[2][:1200])
+
+    def test_zero_mean_unit_variance_normalization_trunc_np_max_length(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        processed = feat_extract(
+            speech_inputs, truncation=True, max_length=1000, padding="max_length", return_tensors="np"
+        )
+        input_values = processed.input_values
+
+        self._check_zero_mean_unit_variance(input_values[0, :800])
+        self._check_zero_mean_unit_variance(input_values[1])
+        self._check_zero_mean_unit_variance(input_values[2])
+
+    def test_zero_mean_unit_variance_normalization_trunc_np_longest(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        processed = feat_extract(
+            speech_inputs, truncation=True, max_length=1000, padding="longest", return_tensors="np"
+        )
+        input_values = processed.input_values
+
+        self._check_zero_mean_unit_variance(input_values[0, :800])
+        self._check_zero_mean_unit_variance(input_values[1, :1000])
+        self._check_zero_mean_unit_variance(input_values[2])
+
+        # make sure that if max_length < longest -> then pad to max_length
+        self.assertTrue(input_values.shape == (3, 1000))
+
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        processed = feat_extract(
+            speech_inputs, truncation=True, max_length=2000, padding="longest", return_tensors="np"
+        )
+        input_values = processed.input_values
+
+        self._check_zero_mean_unit_variance(input_values[0, :800])
+        self._check_zero_mean_unit_variance(input_values[1, :1000])
+        self._check_zero_mean_unit_variance(input_values[2])
+
+        # make sure that if max_length > longest -> then pad to longest
+        self.assertTrue(input_values.shape == (3, 1200))
+
+    def test_double_precision_pad(self):
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        np_speech_inputs = np.random.rand(100).astype(np.float64)
+        py_speech_inputs = np_speech_inputs.tolist()
+
+        for inputs in [py_speech_inputs, np_speech_inputs]:
+            np_processed = feature_extractor.pad([{"input_values": inputs}], return_tensors="np")
+            self.assertTrue(np_processed.input_values.dtype == np.float32)
+            pt_processed = feature_extractor.pad([{"input_values": inputs}], return_tensors="pd")
+            self.assertTrue(pt_processed.input_values.dtype == paddle.float32)
+
+    def test_call_target(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+
+        # Test feature size
+        input_values = feature_extractor(audio_target=np_speech_inputs, padding=True, return_tensors="np").input_values
+        self.assertTrue(input_values.ndim == 3)
+        self.assertTrue(input_values.shape[-1] == feature_extractor.num_mel_bins)
+
+        # Test not batched input
+        encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_values
+        encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_values
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+
+        # Test batched
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_values
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_values
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test 2-D numpy arrays are batched.
+        speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
+        np_speech_inputs = np.asarray(speech_inputs)
+        encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_values
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_values
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+    def test_batch_feature_target(self):
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target()
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+
+        self.assertTrue(all(len(x) == len(y) for x, y in zip(speech_inputs, processed_features[input_name])))
+
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target(equal_length=True)
+        processed_features = BatchFeature({input_name: speech_inputs}, tensor_type="np")
+
+        batch_features_input = processed_features[input_name]
+
+        if len(batch_features_input.shape) < 3:
+            batch_features_input = batch_features_input[:, :, None]
+
+        self.assertTrue(
+            batch_features_input.shape
+            == (self.feat_extract_tester.batch_size, len(speech_inputs[0]), self.feat_extract_tester.num_mel_bins)
+        )
+
+    def test_batch_feature_target_pd(self):
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target(equal_length=True)
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs}, tensor_type="pd")
+
+        batch_features_input = processed_features[input_name]
+
+        if len(batch_features_input.shape) < 3:
+            batch_features_input = batch_features_input[:, :, None]
+
+        self.assertTrue(
+            batch_features_input.shape
+            == [self.feat_extract_tester.batch_size, len(speech_inputs[0]), self.feat_extract_tester.num_mel_bins]
+        )
+
+    def test_padding_accepts_tensors_target_pd(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target()
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+
+        feat_extract.feature_size = feat_extract.num_mel_bins  # hack!
+
+        input_np = feat_extract.pad(processed_features, padding="longest", return_tensors="np")[input_name]
+        input_pt = feat_extract.pad(processed_features, padding="longest", return_tensors="pd")[input_name]
+
+        self.assertTrue(abs(input_np.astype(np.float32).sum() - input_pt.numpy().astype(np.float32).sum()) < 1e-2)
+
+    def test_attention_mask_target(self):
+        feat_dict = self.feat_extract_dict
+        feat_dict["return_attention_mask"] = True
+        feat_extract = self.feature_extraction_class(**feat_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target()
+        input_lenghts = [len(x) for x in speech_inputs]
+        input_name = feat_extract.model_input_names[0]
+
+        processed = BatchFeature({input_name: speech_inputs})
+
+        feat_extract.feature_size = feat_extract.num_mel_bins  # hack!
+
+        processed = feat_extract.pad(processed, padding="longest", return_tensors="np")
+        self.assertIn("attention_mask", processed)
+        self.assertListEqual(list(processed.attention_mask.shape), list(processed[input_name].shape[:2]))
+        self.assertListEqual(processed.attention_mask.sum(-1).tolist(), input_lenghts)
+
+    def test_attention_mask_with_truncation_target(self):
+        feat_dict = self.feat_extract_dict
+        feat_dict["return_attention_mask"] = True
+        feat_extract = self.feature_extraction_class(**feat_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_target()
+        input_lenghts = [len(x) for x in speech_inputs]
+        input_name = feat_extract.model_input_names[0]
+
+        processed = BatchFeature({input_name: speech_inputs})
+        max_length = min(input_lenghts)
+
+        feat_extract.feature_size = feat_extract.num_mel_bins  # hack!
+
+        processed_pad = feat_extract.pad(
+            processed, padding="max_length", max_length=max_length, truncation=True, return_tensors="np"
+        )
+        self.assertIn("attention_mask", processed_pad)
+        self.assertListEqual(
+            list(processed_pad.attention_mask.shape), [processed_pad[input_name].shape[0], max_length]
+        )
+        self.assertListEqual(
+            processed_pad.attention_mask[:, :max_length].sum(-1).tolist(), [max_length for x in speech_inputs]
+        )
+
+    def _load_datasamples(self, num_samples):
+        from datasets import load_dataset
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples]
+
+    def test_integration(self):
+        # fmt: off
+        EXPECTED_INPUT_VALUES = paddle.to_tensor(
+            [2.3804e-03, 2.0752e-03, 1.9836e-03, 2.1057e-03, 1.6174e-03,
+             3.0518e-04, 9.1553e-05, 3.3569e-04, 9.7656e-04, 1.8311e-03,
+             2.0142e-03, 2.1057e-03, 1.7395e-03, 4.5776e-04, -3.9673e-04,
+             4.5776e-04, 1.0071e-03, 9.1553e-05, 4.8828e-04, 1.1597e-03,
+             7.3242e-04, 9.4604e-04, 1.8005e-03, 1.8311e-03, 8.8501e-04,
+             4.2725e-04, 4.8828e-04, 7.3242e-04, 1.0986e-03, 2.1057e-03]
+        )
+        # fmt: on
+
+        input_speech = self._load_datasamples(1)
+        feature_extractor = SpeechT5FeatureExtractor()
+        input_values = feature_extractor(input_speech, return_tensors="pd").input_values
+        self.assertEquals(input_values.shape, [1, 93680])
+        self.assertTrue(paddle.allclose(input_values[0, :30], EXPECTED_INPUT_VALUES, atol=1e-6))
+
+    def test_integration_target(self):
+        # fmt: off
+        EXPECTED_INPUT_VALUES = paddle.to_tensor(
+            [-2.6870, -3.0104, -3.1356, -3.5352, -3.0044, -3.0353, -3.4719, -3.6777,
+             -3.1520, -2.9435, -2.6553, -2.8795, -2.9944, -2.5921, -3.0279, -3.0386,
+             -3.0864, -3.1291, -3.2353, -2.7444, -2.6831, -2.7287, -3.1761, -3.1571,
+             -3.2726, -3.0582, -3.1007, -3.4533, -3.4695, -3.0998]
+        )
+        # fmt: on
+
+        input_speech = self._load_datasamples(1)
+        feature_extractor = SpeechT5FeatureExtractor()
+        input_values = feature_extractor(audio_target=input_speech, return_tensors="pd").input_values
+        self.assertEquals(input_values.shape, [1, 366, 80])
+        self.assertTrue(paddle.allclose(input_values[0, 0, :30], EXPECTED_INPUT_VALUES, atol=1e-4))
diff --git a/tests/transformers/speecht5/test_modeling.py b/tests/transformers/speecht5/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d8317c0607f39915aeb02b62eeffd760d80febe
--- /dev/null
+++ b/tests/transformers/speecht5/test_modeling.py
@@ -0,0 +1,1465 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Testing suite for the PyTorch SpeechT5 model. """
+
+import inspect
+import unittest
+
+import paddle
+
+from paddlenlp.trainer.trainer_utils import set_seed
+from paddlenlp.transformers import (
+    SpeechT5Config,
+    SpeechT5ForSpeechToSpeech,
+    SpeechT5ForSpeechToText,
+    SpeechT5ForTextToSpeech,
+    SpeechT5HifiGan,
+    SpeechT5HifiGanConfig,
+    SpeechT5Model,
+    SpeechT5Processor,
+)
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import (
+    ModelTesterMixin,
+    _config_zero_init,
+    floats_tensor,
+    ids_tensor,
+    random_attention_mask,
+)
+
+
+class cached_property(property):
+    """
+    Descriptor that mimics @property but caches output in member variable.
+
+    From tensorflow_datasets
+
+    Built-in in functools from Python 3.8.
+    """
+
+    def __get__(self, obj, objtype=None):
+        # See docs.python.org/3/howto/descriptor.html#properties
+        if obj is None:
+            return self
+        if self.fget is None:
+            raise AttributeError("unreadable attribute")
+        attr = "__cached_" + self.fget.__name__
+        cached = getattr(obj, attr, None)
+        if cached is None:
+            cached = self.fget(obj)
+            setattr(obj, attr, cached)
+        return cached
+
+
+def prepare_inputs_dict(
+    config,
+    input_ids=None,
+    input_values=None,
+    decoder_input_ids=None,
+    decoder_input_values=None,
+    attention_mask=None,
+    decoder_attention_mask=None,
+    head_mask=None,
+    decoder_head_mask=None,
+    cross_attn_head_mask=None,
+):
+    if input_ids is not None:
+        encoder_dict = {"input_ids": input_ids}
+    else:
+        encoder_dict = {"input_values": input_values}
+
+    if decoder_input_ids is not None:
+        decoder_dict = {"decoder_input_ids": decoder_input_ids}
+    else:
+        decoder_dict = {"decoder_input_values": decoder_input_values}
+
+    if head_mask is None:
+        head_mask = paddle.ones([config.encoder_layers, config.encoder_attention_heads])
+    if decoder_head_mask is None:
+        decoder_head_mask = paddle.ones([config.decoder_layers, config.decoder_attention_heads])
+    if cross_attn_head_mask is None:
+        cross_attn_head_mask = paddle.ones([config.decoder_layers, config.decoder_attention_heads])
+
+    return {
+        **encoder_dict,
+        **decoder_dict,
+        "attention_mask": attention_mask,
+        "decoder_attention_mask": decoder_attention_mask,
+        "head_mask": head_mask,
+        "decoder_head_mask": decoder_head_mask,
+        "cross_attn_head_mask": cross_attn_head_mask,
+    }
+
+
+class SpeechT5ModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=False,
+        vocab_size=81,
+        hidden_size=24,
+        num_hidden_layers=4,
+        num_attention_heads=2,
+        intermediate_size=4,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+
+    def prepare_config_and_inputs(self):
+        input_values = floats_tensor([self.batch_size, self.seq_length, self.hidden_size], scale=1.0)
+        attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        decoder_input_values = floats_tensor([self.batch_size, self.seq_length, self.hidden_size], scale=1.0)
+        decoder_attention_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        config = self.get_config()
+        inputs_dict = prepare_inputs_dict(
+            config,
+            input_values=input_values,
+            decoder_input_values=decoder_input_values,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def get_config(self):
+        return SpeechT5Config(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+        )
+
+    def create_and_check_model_forward(self, config, inputs_dict):
+        model = SpeechT5Model(config=config)
+        model.eval()
+
+        input_values = inputs_dict["input_values"]
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_input_values = inputs_dict["decoder_input_values"]
+
+        result = model(input_values, attention_mask=attention_mask, decoder_input_values=decoder_input_values)
+        self.parent.assertEqual(result.last_hidden_state.shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+
+class SpeechT5ModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (SpeechT5Model,)
+    pipeline_model_mapping = {
+        "automatic-speech-recognition": SpeechT5ForSpeechToText,
+        "feature-extraction": SpeechT5Model,
+    }
+    is_encoder_decoder = True
+    test_pruning = False
+    test_headmasking = False
+    test_resize_embeddings = False
+    use_test_model_name_list = False
+
+    input_name = "input_values"
+
+    def setUp(self):
+        self.model_tester = SpeechT5ModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=SpeechT5Config, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model_forward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_forward(*config_and_inputs)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "input_values",
+                "attention_mask",
+                "decoder_input_values",
+                "decoder_attention_mask",
+            ]
+            expected_arg_names.extend(
+                ["head_mask", "decoder_head_mask", "cross_attn_head_mask", "encoder_outputs"]
+                if "head_mask" and "decoder_head_mask" and "cross_attn_head_mask" in arg_names
+                else ["encoder_outputs"]
+            )
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    # this model has no inputs_embeds
+    def test_inputs_embeds(self):
+        pass
+
+    # this model has no input embeddings
+    def test_model_common_attributes(self):
+        pass
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # decoder cannot keep gradients
+        pass
+
+
+class SpeechT5ForSpeechToTextTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        encoder_seq_length=1024,  # speech is longer
+        decoder_seq_length=7,
+        is_training=False,
+        hidden_size=24,
+        num_hidden_layers=4,
+        num_attention_heads=2,
+        intermediate_size=4,
+        conv_dim=(32, 32, 32),
+        conv_stride=(4, 4, 4),
+        conv_kernel=(8, 8, 8),
+        conv_bias=False,
+        num_conv_pos_embeddings=16,
+        num_conv_pos_embedding_groups=2,
+        vocab_size=81,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.conv_dim = conv_dim
+        self.conv_stride = conv_stride
+        self.conv_kernel = conv_kernel
+        self.conv_bias = conv_bias
+        self.num_conv_pos_embeddings = num_conv_pos_embeddings
+        self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
+        self.vocab_size = vocab_size
+
+    def prepare_config_and_inputs(self):
+        input_values = floats_tensor([self.batch_size, self.encoder_seq_length], scale=1.0)
+        attention_mask = random_attention_mask([self.batch_size, self.encoder_seq_length])
+
+        decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size).clip(2)
+        decoder_attention_mask = random_attention_mask([self.batch_size, self.decoder_seq_length])
+
+        config = self.get_config()
+        inputs_dict = prepare_inputs_dict(
+            config,
+            input_values=input_values,
+            decoder_input_ids=decoder_input_ids,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def get_config(self):
+        return SpeechT5Config(
+            hidden_size=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            conv_dim=self.conv_dim,
+            conv_stride=self.conv_stride,
+            conv_kernel=self.conv_kernel,
+            conv_bias=self.conv_bias,
+            num_conv_pos_embeddings=self.num_conv_pos_embeddings,
+            num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
+            vocab_size=self.vocab_size,
+        )
+
+    def create_and_check_model_forward(self, config, inputs_dict):
+        model = SpeechT5ForSpeechToText(config=config)
+        model.eval()
+
+        input_values = inputs_dict["input_values"]
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_input_ids = inputs_dict["decoder_input_ids"]
+
+        result = model(input_values, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids)
+        self.parent.assertEqual(result.logits.shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = SpeechT5ForSpeechToText(config=config).get_decoder()
+        model.eval()
+        input_ids = inputs_dict["decoder_input_ids"]
+        attention_mask = inputs_dict["decoder_attention_mask"]
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
+
+        output, past_key_values = outputs.to_tuple()
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size).clip(2)
+        next_attn_mask = ids_tensor((self.batch_size, 3), 2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([attention_mask, next_attn_mask], axis=-1)
+
+        output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
+        output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
+            "last_hidden_state"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
+
+
+class SpeechT5ForSpeechToTextTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (SpeechT5ForSpeechToText,)
+    all_generative_model_classes = (SpeechT5ForSpeechToText,)
+    is_encoder_decoder = True
+    test_pruning = False
+    test_headmasking = False
+    use_test_model_name_list = False
+    test_resize_embeddings = False
+
+    input_name = "input_values"
+
+    def setUp(self):
+        self.model_tester = SpeechT5ForSpeechToTextTester(self)
+        self.config_tester = ConfigTester(self, config_class=SpeechT5Config, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model_forward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_forward(*config_and_inputs)
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_attention_outputs(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        config.return_dict = True
+
+        seq_len = getattr(self.model_tester, "seq_length", None)
+        decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+        encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
+        decoder_key_length = getattr(self.model_tester, "decoder_key_length", decoder_seq_length)
+        encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = False
+            config.return_dict = True
+            model = model_class(config)
+
+            model.eval()
+
+            subsampled_encoder_seq_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(
+                encoder_seq_length
+            )
+            subsampled_encoder_key_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(
+                encoder_key_length
+            )
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+            # check that output_attentions also work using config
+            del inputs_dict["output_attentions"]
+            config.output_attentions = True
+            model = model_class(config)
+
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+            self.assertListEqual(
+                list(attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, subsampled_encoder_seq_length, subsampled_encoder_key_length],
+            )
+            out_len = len(outputs)
+
+            correct_outlen = 5
+
+            # loss is at first position
+            if "labels" in inputs_dict:
+                correct_outlen += 1  # loss is added to beginning
+            if "past_key_values" in outputs:
+                correct_outlen += 1  # past_key_values have been returned
+
+            self.assertEqual(out_len, correct_outlen)
+
+            # decoder attentions
+            decoder_attentions = outputs.decoder_attentions
+            self.assertIsInstance(decoder_attentions, (list, tuple))
+            self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(decoder_attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
+            )
+
+            # cross attentions
+            cross_attentions = outputs.cross_attentions
+            self.assertIsInstance(cross_attentions, (list, tuple))
+            self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(cross_attentions[0].shape[-3:]),
+                [
+                    self.model_tester.num_attention_heads,
+                    decoder_seq_length,
+                    subsampled_encoder_key_length,
+                ],
+            )
+
+            # Check attention is always last and order is fine
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = True
+            model = model_class(config)
+
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            added_hidden_states = 2
+            self.assertEqual(out_len + added_hidden_states, len(outputs))
+
+            self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+
+            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(self_attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, subsampled_encoder_seq_length, subsampled_encoder_key_length],
+            )
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "input_values",
+                "attention_mask",
+                "decoder_input_ids",
+                "decoder_attention_mask",
+            ]
+            expected_arg_names.extend(
+                ["head_mask", "decoder_head_mask", "cross_attn_head_mask", "encoder_outputs"]
+                if "head_mask" and "decoder_head_mask" and "cross_attn_head_mask" in arg_names
+                else ["encoder_outputs"]
+            )
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = model_class(config)
+
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            if hasattr(self.model_tester, "encoder_seq_length"):
+                seq_length = self.model_tester.encoder_seq_length
+            else:
+                seq_length = self.model_tester.seq_length
+
+            subsampled_seq_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(seq_length)
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [subsampled_seq_length, self.model_tester.hidden_size],
+            )
+
+            if config.is_encoder_decoder:
+                hidden_states = outputs.decoder_hidden_states
+
+                self.assertIsInstance(hidden_states, (list, tuple))
+                self.assertEqual(len(hidden_states), expected_num_layers)
+                seq_len = getattr(self.model_tester, "seq_length", None)
+                decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+
+                self.assertListEqual(
+                    list(hidden_states[0].shape[-2:]),
+                    [decoder_seq_length, self.model_tester.hidden_size],
+                )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+            # check that output_hidden_states also work using config
+            del inputs_dict["output_hidden_states"]
+            config.output_hidden_states = True
+
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                uniform_init_parms = [
+                    "conv.weight",
+                    "masked_spec_embed",
+                    "feature_projection.projection.weight",
+                    "feature_projection.projection.bias",
+                ]
+                if param.stop_gradient is False:
+                    if any([x in name for x in uniform_init_parms]):
+                        self.assertTrue(
+                            -1.0 <= ((param.mean() * 1e9).round() / 1e9).item() <= 1.0,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    # this model has no inputs_embeds
+    def test_inputs_embeds(self):
+
+        pass
+
+    def test_resize_embeddings_untied(self):
+        # TODO(wugaosheng): fix test_resize_embeddings_untied
+        pass
+
+    def test_resize_tokens_embeddings(self):
+        # TODO(wugaosheng): fix test_resize_tokens_embeddings
+        pass
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # decoder cannot keep gradients
+        pass
+
+    # training is not supported yet
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    # overwrite from test_modeling_common
+    def _mock_init_weights(self, module):
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data.fill_(3)
+        if hasattr(module, "weight_g") and module.weight_g is not None:
+            module.weight_g.data.fill_(3)
+        if hasattr(module, "weight_v") and module.weight_v is not None:
+            module.weight_v.data.fill_(3)
+        if hasattr(module, "bias") and module.bias is not None:
+            module.bias.data.fill_(3)
+        if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
+            module.masked_spec_embed.data.fill_(3)
+
+
+@slow
+class SpeechT5ForSpeechToTextIntegrationTests(unittest.TestCase):
+    @cached_property
+    def default_processor(self):
+        return SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
+
+    def _load_datasamples(self, num_samples):
+        from datasets import load_dataset
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples]
+
+    def test_generation_librispeech(self):
+        model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")
+
+        processor = self.default_processor
+
+        input_speech = self._load_datasamples(1)
+
+        input_values = processor(audio=input_speech, return_tensors="pd").input_values
+
+        generated_ids = model.generate(input_values)
+        generated_transcript = processor.batch_decode(generated_ids, skip_special_tokens=True)
+
+        EXPECTED_TRANSCRIPTIONS = [
+            "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel"
+        ]
+        self.assertListEqual(generated_transcript, EXPECTED_TRANSCRIPTIONS)
+
+    def test_generation_librispeech_batched(self):
+        model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")
+
+        processor = self.default_processor
+
+        input_speech = self._load_datasamples(4)
+
+        inputs = processor(audio=input_speech, return_tensors="pd", padding=True)
+
+        input_values = inputs.input_values
+        attention_mask = inputs.attention_mask
+
+        generated_ids = model.generate(input_values, attention_mask=attention_mask)
+        generated_transcripts = processor.batch_decode(generated_ids, skip_special_tokens=True)
+
+        EXPECTED_TRANSCRIPTIONS = [
+            "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel",
+            "nor is mister quilter's manner less interesting than his matter",
+            "he tells us that at this festive season of the year with christmas and rosebeaf looming before us"
+            " similars drawn from eating and its results occur most readily to the mind",
+            "he has grave doubts whether sir frederick latin's work is really greek after all and can discover in it"
+            " but little of rocky ithica",
+        ]
+        self.assertListEqual(generated_transcripts, EXPECTED_TRANSCRIPTIONS)
+
+
+class SpeechT5ForTextToSpeechTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        encoder_seq_length=7,
+        decoder_seq_length=1024,  # speech is longer
+        is_training=False,
+        hidden_size=24,
+        num_hidden_layers=4,
+        num_attention_heads=2,
+        intermediate_size=4,
+        vocab_size=81,
+        num_mel_bins=20,
+        reduction_factor=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.vocab_size = vocab_size
+        self.num_mel_bins = num_mel_bins
+        self.reduction_factor = reduction_factor
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size).clip(2)
+        attention_mask = random_attention_mask([self.batch_size, self.encoder_seq_length])
+
+        decoder_input_values = floats_tensor([self.batch_size, self.decoder_seq_length, self.num_mel_bins], scale=1.0)
+        decoder_attention_mask = random_attention_mask([self.batch_size, self.decoder_seq_length])
+
+        config = self.get_config()
+        inputs_dict = prepare_inputs_dict(
+            config,
+            input_ids=input_ids,
+            decoder_input_values=decoder_input_values,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def get_config(self):
+        return SpeechT5Config(
+            hidden_size=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            vocab_size=self.vocab_size,
+            num_mel_bins=self.num_mel_bins,
+            reduction_factor=self.reduction_factor,
+        )
+
+    def create_and_check_model_forward(self, config, inputs_dict):
+        model = SpeechT5ForTextToSpeech(config=config)
+        model.eval()
+
+        input_ids = inputs_dict["input_ids"]
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_input_values = inputs_dict["decoder_input_values"]
+
+        result = model(input_ids, attention_mask=attention_mask, decoder_input_values=decoder_input_values)
+        self.parent.assertEqual(
+            result.spectrogram.shape,
+            [self.batch_size, self.decoder_seq_length * self.reduction_factor, self.num_mel_bins],
+        )
+
+
+class SpeechT5ForTextToSpeechTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (SpeechT5ForTextToSpeech,)
+    all_generative_model_classes = (SpeechT5ForTextToSpeech,)
+    is_encoder_decoder = True
+    test_pruning = False
+    test_headmasking = False
+    use_test_model_name_list = False
+
+    input_name = "input_ids"
+
+    def setUp(self):
+        self.model_tester = SpeechT5ForTextToSpeechTester(self)
+        self.config_tester = ConfigTester(self, config_class=SpeechT5Config, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model_forward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_forward(*config_and_inputs)
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_decoder_model_past_with_large_inputs(self):
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_determinism(self):
+        pass
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "input_ids",
+                "attention_mask",
+                "decoder_input_values",
+                "decoder_attention_mask",
+            ]
+            expected_arg_names.extend(
+                ["head_mask", "decoder_head_mask", "cross_attn_head_mask", "encoder_outputs"]
+                if "head_mask" and "decoder_head_mask" and "cross_attn_head_mask" in arg_names
+                else ["encoder_outputs"]
+            )
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                uniform_init_parms = [
+                    "conv.weight",
+                ]
+                if param.stop_gradient is False:
+                    if any([x in name for x in uniform_init_parms]):
+                        self.assertTrue(
+                            -1.0 <= ((param.mean() * 1e9).round() / 1e9).item() <= 1.0,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    # this model has no inputs_embeds
+    def test_inputs_embeds(self):
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_model_outputs_equivalence(self):
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_save_load(self):
+        pass
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # decoder cannot keep gradients
+        pass
+
+    @slow
+    def test_torchscript_output_attentions(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    @slow
+    def test_torchscript_output_hidden_state(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    @slow
+    def test_torchscript_simple(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    # training is not supported yet
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    # overwrite from test_modeling_common
+    def _mock_init_weights(self, module):
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data.fill_(3)
+        if hasattr(module, "weight_g") and module.weight_g is not None:
+            module.weight_g.data.fill_(3)
+        if hasattr(module, "weight_v") and module.weight_v is not None:
+            module.weight_v.data.fill_(3)
+        if hasattr(module, "bias") and module.bias is not None:
+            module.bias.data.fill_(3)
+
+
+@slow
+class SpeechT5ForTextToSpeechIntegrationTests(unittest.TestCase):
+    @cached_property
+    def default_processor(self):
+        return SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
+
+    def test_generation(self):
+        model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
+
+        processor = self.default_processor
+
+        set_seed(555)  # make deterministic
+
+        input_text = "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel"
+        input_ids = processor(text=input_text, return_tensors="pd").input_ids
+
+        generated_speech = model.generate_speech(input_ids)
+        self.assertEqual(generated_speech.shape, (1820, model.config.num_mel_bins))
+
+
+class SpeechT5ForSpeechToSpeechTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        encoder_seq_length=1024,  # speech is longer
+        decoder_seq_length=1024,
+        is_training=False,
+        hidden_size=24,
+        num_hidden_layers=4,
+        num_attention_heads=2,
+        intermediate_size=4,
+        conv_dim=(32, 32, 32),
+        conv_stride=(4, 4, 4),
+        conv_kernel=(8, 8, 8),
+        conv_bias=False,
+        num_conv_pos_embeddings=16,
+        num_conv_pos_embedding_groups=2,
+        vocab_size=81,
+        num_mel_bins=20,
+        reduction_factor=2,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        self.is_training = is_training
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.conv_dim = conv_dim
+        self.conv_stride = conv_stride
+        self.conv_kernel = conv_kernel
+        self.conv_bias = conv_bias
+        self.num_conv_pos_embeddings = num_conv_pos_embeddings
+        self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
+        self.vocab_size = vocab_size
+        self.num_mel_bins = num_mel_bins
+        self.reduction_factor = reduction_factor
+
+    def prepare_config_and_inputs(self):
+        input_values = floats_tensor([self.batch_size, self.encoder_seq_length], scale=1.0)
+        attention_mask = random_attention_mask([self.batch_size, self.encoder_seq_length])
+
+        decoder_input_values = floats_tensor([self.batch_size, self.decoder_seq_length, self.num_mel_bins], scale=1.0)
+        decoder_attention_mask = random_attention_mask([self.batch_size, self.decoder_seq_length])
+
+        config = self.get_config()
+        inputs_dict = prepare_inputs_dict(
+            config,
+            input_values=input_values,
+            decoder_input_values=decoder_input_values,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+        )
+        return config, inputs_dict
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def get_config(self):
+        return SpeechT5Config(
+            hidden_size=self.hidden_size,
+            encoder_layers=self.num_hidden_layers,
+            decoder_layers=self.num_hidden_layers,
+            encoder_attention_heads=self.num_attention_heads,
+            decoder_attention_heads=self.num_attention_heads,
+            encoder_ffn_dim=self.intermediate_size,
+            decoder_ffn_dim=self.intermediate_size,
+            conv_dim=self.conv_dim,
+            conv_stride=self.conv_stride,
+            conv_kernel=self.conv_kernel,
+            conv_bias=self.conv_bias,
+            num_conv_pos_embeddings=self.num_conv_pos_embeddings,
+            num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
+            vocab_size=self.vocab_size,
+            num_mel_bins=self.num_mel_bins,
+            reduction_factor=self.reduction_factor,
+        )
+
+    def create_and_check_model_forward(self, config, inputs_dict):
+        model = SpeechT5ForSpeechToSpeech(config=config)
+        model.eval()
+
+        input_values = inputs_dict["input_values"]
+        attention_mask = inputs_dict["attention_mask"]
+        decoder_input_values = inputs_dict["decoder_input_values"]
+
+        result = model(input_values, attention_mask=attention_mask, decoder_input_values=decoder_input_values)
+        self.parent.assertEqual(
+            result.spectrogram.shape,
+            [self.batch_size, self.decoder_seq_length * self.reduction_factor, self.num_mel_bins],
+        )
+
+
+class SpeechT5ForSpeechToSpeechTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (SpeechT5ForSpeechToSpeech,)
+    all_generative_model_classes = (SpeechT5ForSpeechToSpeech,)
+    is_encoder_decoder = True
+    test_pruning = False
+    test_headmasking = False
+    test_resize_embeddings = False
+    use_test_model_name_list = False
+
+    input_name = "input_values"
+
+    def setUp(self):
+        self.model_tester = SpeechT5ForSpeechToSpeechTester(self)
+        self.config_tester = ConfigTester(self, config_class=SpeechT5Config, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model_forward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_forward(*config_and_inputs)
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_decoder_model_past_with_large_inputs(self):
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_determinism(self):
+        pass
+
+    def test_attention_outputs(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        config.return_dict = True
+
+        seq_len = getattr(self.model_tester, "seq_length", None)
+        decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+        encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
+        decoder_key_length = getattr(self.model_tester, "decoder_key_length", decoder_seq_length)
+        encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = False
+            config.return_dict = True
+            model = model_class(config)
+
+            model.eval()
+
+            subsampled_encoder_seq_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(
+                encoder_seq_length
+            )
+            subsampled_encoder_key_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(
+                encoder_key_length
+            )
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+            # check that output_attentions also work using config
+            del inputs_dict["output_attentions"]
+            config.output_attentions = True
+            model = model_class(config)
+
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+            self.assertListEqual(
+                list(attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, subsampled_encoder_seq_length, subsampled_encoder_key_length],
+            )
+            out_len = len(outputs)
+
+            correct_outlen = 5
+
+            # loss is at first position
+            if "labels" in inputs_dict:
+                correct_outlen += 1  # loss is added to beginning
+            if "past_key_values" in outputs:
+                correct_outlen += 1  # past_key_values have been returned
+
+            self.assertEqual(out_len, correct_outlen)
+
+            # decoder attentions
+            decoder_attentions = outputs.decoder_attentions
+            self.assertIsInstance(decoder_attentions, (list, tuple))
+            self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(decoder_attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
+            )
+
+            # cross attentions
+            cross_attentions = outputs.cross_attentions
+            self.assertIsInstance(cross_attentions, (list, tuple))
+            self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(cross_attentions[0].shape[-3:]),
+                [
+                    self.model_tester.num_attention_heads,
+                    decoder_seq_length,
+                    subsampled_encoder_key_length,
+                ],
+            )
+
+            # Check attention is always last and order is fine
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = True
+            model = model_class(config)
+
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            added_hidden_states = 2
+            self.assertEqual(out_len + added_hidden_states, len(outputs))
+
+            self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
+
+            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
+            self.assertListEqual(
+                list(self_attentions[0].shape[-3:]),
+                [self.model_tester.num_attention_heads, subsampled_encoder_seq_length, subsampled_encoder_key_length],
+            )
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "input_values",
+                "attention_mask",
+                "decoder_input_values",
+                "decoder_attention_mask",
+            ]
+            expected_arg_names.extend(
+                ["head_mask", "decoder_head_mask", "cross_attn_head_mask", "encoder_outputs"]
+                if "head_mask" and "decoder_head_mask" and "cross_attn_head_mask" in arg_names
+                else ["encoder_outputs"]
+            )
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = model_class(config)
+
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            if hasattr(self.model_tester, "encoder_seq_length"):
+                seq_length = self.model_tester.encoder_seq_length
+            else:
+                seq_length = self.model_tester.seq_length
+
+            subsampled_seq_length = model.speecht5.encoder.prenet._get_feat_extract_output_lengths(seq_length)
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [subsampled_seq_length, self.model_tester.hidden_size],
+            )
+
+            if config.is_encoder_decoder:
+                hidden_states = outputs.decoder_hidden_states
+
+                self.assertIsInstance(hidden_states, (list, tuple))
+                self.assertEqual(len(hidden_states), expected_num_layers)
+                seq_len = getattr(self.model_tester, "seq_length", None)
+                decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+
+                self.assertListEqual(
+                    list(hidden_states[0].shape[-2:]),
+                    [decoder_seq_length, self.model_tester.hidden_size],
+                )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+            # check that output_hidden_states also work using config
+            del inputs_dict["output_hidden_states"]
+            config.output_hidden_states = True
+
+            check_hidden_states_output(inputs_dict, config, model_class)
+
+    def test_initialization(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        configs_no_init = _config_zero_init(config)
+        for model_class in self.all_model_classes:
+            model = model_class(config=configs_no_init)
+            for name, param in model.named_parameters():
+                uniform_init_parms = [
+                    "conv.weight",
+                    "masked_spec_embed",
+                    "feature_projection.projection.weight",
+                    "feature_projection.projection.bias",
+                ]
+                if param.stop_gradient is False:
+                    if any([x in name for x in uniform_init_parms]):
+                        self.assertTrue(
+                            -1.0 <= ((param.mean() * 1e9).round() / 1e9).item() <= 1.0,
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+                    else:
+                        self.assertIn(
+                            ((param.mean() * 1e9).round() / 1e9).item(),
+                            [0.0, 1.0],
+                            msg=f"Parameter {name} of model {model_class} seems not properly initialized",
+                        )
+
+    # this model has no inputs_embeds
+    def test_inputs_embeds(self):
+        pass
+
+    # this model has no input embeddings
+    def test_model_common_attributes(self):
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_model_outputs_equivalence(self):
+        pass
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # decoder cannot keep gradients
+        pass
+
+    # skipped because there is always dropout in SpeechT5SpeechDecoderPrenet
+    def test_save_load(self):
+        pass
+
+    @slow
+    def test_torchscript_output_attentions(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    @slow
+    def test_torchscript_output_hidden_state(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    @slow
+    def test_torchscript_simple(self):
+        # disabled because this model doesn't have decoder_input_ids
+        pass
+
+    # training is not supported yet
+    def test_training(self):
+        pass
+
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    # overwrite from test_modeling_common
+    def _mock_init_weights(self, module):
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data.fill_(3)
+        if hasattr(module, "weight_g") and module.weight_g is not None:
+            module.weight_g.data.fill_(3)
+        if hasattr(module, "weight_v") and module.weight_v is not None:
+            module.weight_v.data.fill_(3)
+        if hasattr(module, "bias") and module.bias is not None:
+            module.bias.data.fill_(3)
+        if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
+            module.masked_spec_embed.data.fill_(3)
+
+
+@slow
+class SpeechT5ForSpeechToSpeechIntegrationTests(unittest.TestCase):
+    @cached_property
+    def default_processor(self):
+        return SpeechT5Processor.from_pretrained("microsoft/speecht5_vc")
+
+    def _load_datasamples(self, num_samples):
+        from datasets import load_dataset
+
+        ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+        # automatic decoding with librispeech
+        speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
+
+        return [x["array"] for x in speech_samples]
+
+    def test_generation_librispeech(self):
+        model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc")
+
+        processor = self.default_processor
+
+        input_speech = self._load_datasamples(1)
+        input_values = processor(audio=input_speech, return_tensors="pd").input_values
+
+        speaker_embeddings = paddle.zeros((1, 512))
+        generated_speech = model.generate_speech(input_values, speaker_embeddings=speaker_embeddings)
+
+        self.assertEqual(generated_speech.shape[1], model.config.num_mel_bins)
+        self.assertGreaterEqual(generated_speech.shape[0], 300)
+        self.assertLessEqual(generated_speech.shape[0], 310)
+
+
+class SpeechT5HifiGanTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=False,
+        num_mel_bins=20,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.num_mel_bins = num_mel_bins
+
+    def prepare_config_and_inputs(self):
+        input_values = floats_tensor([self.seq_length, self.num_mel_bins], scale=1.0)
+        config = self.get_config()
+        return config, input_values
+
+    def get_config(self):
+        return SpeechT5HifiGanConfig(
+            model_in_dim=self.num_mel_bins,
+        )
+
+    def create_and_check_model(self, config, input_values):
+        model = SpeechT5HifiGan(config=config)
+        model.eval()
+        result = model(input_values)
+        self.parent.assertEqual(
+            result.shape,
+            [
+                self.seq_length * 256,
+            ],
+        )
+
+    def prepare_config_and_inputs_for_common(self):
+        config, input_values = self.prepare_config_and_inputs()
+        inputs_dict = {"spectrogram": input_values}
+        return config, inputs_dict
+
+
+class SpeechT5HifiGanTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (SpeechT5HifiGan,)
+    test_torchscript = False
+    test_pruning = False
+    test_resize_embeddings = False
+    test_resize_position_embeddings = False
+    test_head_masking = False
+    test_mismatched_shapes = False
+    test_missing_keys = False
+    test_model_parallel = False
+    is_encoder_decoder = False
+    has_attentions = False
+    use_test_model_name_list = False
+
+    input_name = "spectrogram"
+
+    def setUp(self):
+        self.model_tester = SpeechT5HifiGanTester(self)
+        self.config_tester = ConfigTester(self, config_class=SpeechT5HifiGanConfig)
+
+    def test_config(self):
+        self.config_tester.create_and_test_config_to_json_string()
+        self.config_tester.create_and_test_config_to_json_file()
+        self.config_tester.create_and_test_config_from_and_save_pretrained()
+        # self.config_tester.create_and_test_config_from_and_save_pretrained_subfolder()
+        self.config_tester.create_and_test_config_with_num_labels()
+        self.config_tester.check_config_can_be_init_without_params()
+        self.config_tester.check_config_arguments_init()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "spectrogram",
+            ]
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+    # this model does not output hidden states
+    def test_hidden_states_output(self):
+        pass
+
+    # skip
+    def test_initialization(self):
+        pass
+
+    # this model has no inputs_embeds
+    def test_inputs_embeds(self):
+        pass
+
+    # this model has no input embeddings
+    def test_model_common_attributes(self):
+        pass
+
+    # skip as this model doesn't support all arguments tested
+    def test_model_outputs_equivalence(self):
+        pass
+
+    # this model does not output hidden states
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    # skip because it fails on automapping of SpeechT5HifiGanConfig
+    def test_save_load_fast_init_from_base(self):
+        pass
+
+    # skip because it fails on automapping of SpeechT5HifiGanConfig
+    def test_save_load_fast_init_to_base(self):
+        pass
+
+    def test_batched_inputs_outputs(self):
+        config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            model.eval()
+
+            batched_inputs = inputs["spectrogram"].unsqueeze(0).tile([2, 1, 1])
+            with paddle.no_grad():
+                batched_outputs = model(batched_inputs)
+
+            self.assertEqual(
+                batched_inputs.shape[0], batched_outputs.shape[0], msg="Got different batch dims for input and output"
+            )
+
+    def test_unbatched_inputs_outputs(self):
+        config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(inputs["spectrogram"])
+            self.assertTrue(outputs.dim() == 1, msg="Got un-batched inputs but batched output")
diff --git a/tests/transformers/speecht5/test_processor.py b/tests/transformers/speecht5/test_processor.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e02b9a92c0045dcbe0a24e405b740e01c4cd3ed
--- /dev/null
+++ b/tests/transformers/speecht5/test_processor.py
@@ -0,0 +1,182 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import shutil
+import tempfile
+import unittest
+
+from paddlenlp.transformers import (
+    SpeechT5FeatureExtractor,
+    SpeechT5Processor,
+    SpeechT5Tokenizer,
+)
+
+from ..test_utils import get_tests_dir
+from .test_feature_extraction import floats_list
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece_bpe_char.model")
+FEATURE_EXTRACTOR_NAME = "preprocessor_config.json"
+
+
+class SpeechT5ProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.tmpdirname = tempfile.mkdtemp()
+
+        tokenizer = SpeechT5Tokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+        feature_extractor_map = {
+            "feature_size": 1,
+            "padding_value": 0.0,
+            "sampling_rate": 16000,
+            "do_normalize": False,
+            "num_mel_bins": 80,
+            "hop_length": 16,
+            "win_length": 64,
+            "win_function": "hann_window",
+            "fmin": 80,
+            "fmax": 7600,
+            "mel_floor": 1e-10,
+            "reduction_factor": 2,
+            "return_attention_mask": True,
+        }
+
+        self.feature_extraction_file = os.path.join(self.tmpdirname, FEATURE_EXTRACTOR_NAME)
+        with open(self.feature_extraction_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(feature_extractor_map) + "\n")
+
+    def get_tokenizer(self, **kwargs):
+        return SpeechT5Tokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_feature_extractor(self, **kwargs):
+        return SpeechT5FeatureExtractor.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def test_save_load_pretrained_default(self):
+        tokenizer = self.get_tokenizer()
+        feature_extractor = self.get_feature_extractor()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        processor.save_pretrained(self.tmpdirname)
+        processor = SpeechT5Processor.from_pretrained(self.tmpdirname)
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
+        self.assertIsInstance(processor.tokenizer, SpeechT5Tokenizer)
+
+        self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
+        self.assertIsInstance(processor.feature_extractor, SpeechT5FeatureExtractor)
+
+    def test_save_load_pretrained_additional_features(self):
+        processor = SpeechT5Processor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
+        processor.save_pretrained(self.tmpdirname)
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+        feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
+
+        processor = SpeechT5Processor.from_pretrained(
+            self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+        self.assertIsInstance(processor.tokenizer, SpeechT5Tokenizer)
+
+        self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
+        self.assertIsInstance(processor.feature_extractor, SpeechT5FeatureExtractor)
+
+    def test_feature_extractor(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        raw_speech = floats_list((3, 1000))
+
+        input_feat_extract = feature_extractor(audio=raw_speech, return_tensors="np")
+        input_processor = processor(audio=raw_speech, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_feature_extractor_target(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        raw_speech = floats_list((3, 1000))
+
+        input_feat_extract = feature_extractor(audio_target=raw_speech, return_tensors="np")
+        input_processor = processor(audio_target=raw_speech, return_tensors="np")
+
+        for key in input_feat_extract.keys():
+            self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+    def test_tokenizer(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        input_str = "This is a test string"
+
+        encoded_processor = processor(text=input_str)
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_tokenizer_target(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        input_str = "This is a test string"
+
+        encoded_processor = processor(text_target=input_str)
+        encoded_tok = tokenizer(input_str)
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+    def test_tokenizer_decode(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+        decoded_processor = processor.batch_decode(predicted_ids)
+        decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+        self.assertListEqual(decoded_tok, decoded_processor)
+
+    def test_model_input_names(self):
+        feature_extractor = self.get_feature_extractor()
+        tokenizer = self.get_tokenizer()
+
+        processor = SpeechT5Processor(tokenizer=tokenizer, feature_extractor=feature_extractor)
+
+        self.assertListEqual(
+            processor.model_input_names,
+            feature_extractor.model_input_names,
+            msg="`processor` and `feature_extractor` model input names do not match",
+        )
diff --git a/tests/transformers/speecht5/test_tokenizer.py b/tests/transformers/speecht5/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..869f5900fe7ef760cd73156f23f4c7514899fe41
--- /dev/null
+++ b/tests/transformers/speecht5/test_tokenizer.py
@@ -0,0 +1,309 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2023 The Fairseq Authors, Microsoft Research, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers import AddedToken, SpeechT5Tokenizer
+
+from ...testing_utils import slow
+from ..test_utils import get_tests_dir
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece_bpe_char.model")
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SPIECE_UNDERLINE = "▁"
+
+
+class SpeechT5TokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = SpeechT5Tokenizer
+    test_rust_tokenizer = False
+    test_sentencepiece = True
+
+    def test_pretokenized_inputs(self, *args, **kwargs):
+        pass
+
+    def test_tokenizers_common_ids_setters(self, *args, **kwargs):
+        pass
+
+    def test_mask_output(self):
+        pass
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_offsets_mapping_with_unk(self):
+        pass
+
+    def test_special_tokens_mask(self):
+        pass
+
+    def test_special_tokens_mask_input_pairs(self):
+        pass
+
+    def test_right_and_left_padding(self):
+        pass
+
+    def test_encode_decode_with_spaces(self):
+        # TODO Fix decode in tokenizer.
+        pass
+
+    def test_add_special_tokens(self):
+        pass
+
+    def test_padding_to_multiple_of(self):
+        pass
+
+    def test_batch_encode_plus_batch_sequence_length(self):
+        # Tests that all encoded values have the correct size
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                encoded_sequences = [tokenizer.encode(sequence) for sequence in sequences]
+                encoded_sequences_batch = tokenizer.batch_encode(sequences, padding=False)
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+                maximum_length = len(
+                    max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len)
+                )
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences_padded = [
+                    tokenizer.encode(sequence, max_length=maximum_length, padding="max_length")
+                    for sequence in sequences
+                ]
+
+                encoded_sequences_batch_padded = tokenizer.batch_encode(sequences, padding=True)
+                self.assertListEqual(
+                    encoded_sequences_padded,
+                    self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded),
+                )
+
+                # check 'longest' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=True)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding="longest"
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+                # check 'no_padding' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=False)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding=False
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+    def test_consecutive_unk_string(self):
+        tokenizer = self.get_tokenizer(add_bos_token=False)
+
+        tokens = [tokenizer.unk_token for _ in range(2)]
+        string = tokenizer.convert_tokens_to_string(tokens)
+        encoding = tokenizer(
+            text=string,
+            add_special_tokens=False,
+            runcation=True,
+            return_offsets_mapping=True,
+        )
+        self.assertEqual(len(encoding["input_ids"]), 2)
+        self.assertEqual(len(encoding["offset_mapping"]), 2)
+
+    def setUp(self):
+        super().setUp()
+        # We have a SentencePiece fixture for testing
+        tokenizer = SpeechT5Tokenizer(SAMPLE_VOCAB)
+
+        mask_token = AddedToken("<mask>", lstrip=True, rstrip=False)
+        tokenizer.mask_token = mask_token
+        tokenizer.add_special_tokens({"mask_token": mask_token})
+        tokenizer.add_tokens(["<ctc_blank>"])
+
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "this is a test"
+        output_text = "this is a test"
+        return input_text, output_text
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5):
+        input_text, output_text = self.get_input_output_texts(tokenizer)
+        ids = tokenizer.encode(output_text, add_special_tokens=False)
+        text = tokenizer.decode(ids["input_ids"], clean_up_tokenization_spaces=False)
+        return text, ids["input_ids"]
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<pad>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "<s>")
+        self.assertEqual(vocab_keys[1], "<pad>")
+        self.assertEqual(vocab_keys[-4], "œ")
+        self.assertEqual(vocab_keys[-2], "<mask>")
+        self.assertEqual(vocab_keys[-1], "<ctc_blank>")
+        self.assertEqual(len(vocab_keys), 81)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 79)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode("aaaaa bbbbbb low cccccccccdddddddd l", add_special_tokens=False)[
+                    "input_ids"
+                ]
+
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-3], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l", add_special_tokens=False
+                )["input_ids"]
+
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-3], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-3], tokens[-4])
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-3], tokenizer.pad_token_id)
+
+    def test_pickle_subword_regularization_tokenizer(self):
+        pass
+
+    def test_subword_regularization_tokenizer(self):
+        pass
+
+    def test_full_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+
+        tokens = tokenizer.tokenize("This is a test")
+        # fmt: off
+        self.assertListEqual(tokens, [SPIECE_UNDERLINE, 'T', 'h', 'i', 's', SPIECE_UNDERLINE, 'i', 's', SPIECE_UNDERLINE, 'a', SPIECE_UNDERLINE, 't', 'e', 's', 't'])
+        # fmt: on
+
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(tokens),
+            [4, 32, 11, 10, 12, 4, 10, 12, 4, 7, 4, 6, 5, 12, 6],
+        )
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            # fmt: off
+            [SPIECE_UNDERLINE, 'I', SPIECE_UNDERLINE, 'w', 'a', 's', SPIECE_UNDERLINE, 'b', 'o', 'r', 'n', SPIECE_UNDERLINE, 'i', 'n', SPIECE_UNDERLINE, '92000', ',', SPIECE_UNDERLINE, 'a', 'n', 'd', SPIECE_UNDERLINE, 't', 'h', 'i', 's', SPIECE_UNDERLINE, 'i', 's', SPIECE_UNDERLINE, 'f', 'a', 'l', 's', 'é', '.']
+            # fmt: on
+        )
+
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        # fmt: off
+        self.assertListEqual(ids, [4, 30, 4, 20, 7, 12, 4, 25, 8, 13, 9, 4, 10, 9, 4, 3, 23, 4, 7, 9, 14, 4, 6, 11, 10, 12, 4, 10, 12, 4, 19, 7, 15, 12, 73, 26])
+        # fmt: on
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            # fmt: off
+            [SPIECE_UNDERLINE, 'I', SPIECE_UNDERLINE, 'w', 'a', 's', SPIECE_UNDERLINE, 'b', 'o', 'r', 'n', SPIECE_UNDERLINE, 'i', 'n', SPIECE_UNDERLINE, '<unk>', ',', SPIECE_UNDERLINE, 'a', 'n', 'd', SPIECE_UNDERLINE, 't', 'h', 'i', 's', SPIECE_UNDERLINE, 'i', 's', SPIECE_UNDERLINE, 'f', 'a', 'l', 's', 'é', '.']
+            # fmt: on
+        )
+
+    @slow
+    def test_tokenizer_integration(self):
+        # Use custom sequence because this tokenizer does not handle numbers.
+        sequences = [
+            "Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides "
+            "general-purpose architectures (BERT, GPT, RoBERTa, XLM, DistilBert, XLNet...) for Natural "
+            "Language Understanding (NLU) and Natural Language Generation (NLG) with over thirty-two pretrained "
+            "models in one hundred plus languages and deep interoperability between Jax, PyTorch and TensorFlow.",
+            "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
+            "conditioning on both left and right context in all layers.",
+            "The quick brown fox jumps over the lazy dog.",
+        ]
+
+        # fmt: off
+        expected_encoding = {
+            'input_ids': [
+                [4, 32, 13, 7, 9, 12, 19, 8, 13, 18, 5, 13, 12, 4, 64, 19, 8, 13, 18, 5, 13, 15, 22, 4, 28, 9, 8, 20, 9, 4, 7, 12, 4, 24, 22, 6, 8, 13, 17, 11, 39, 6, 13, 7, 9, 12, 19, 8, 13, 18, 5, 13, 12, 4, 7, 9, 14, 4, 24, 22, 6, 8, 13, 17, 11, 39, 24, 13, 5, 6, 13, 7, 10, 9, 5, 14, 39, 25, 5, 13, 6, 63, 4, 24, 13, 8, 27, 10, 14, 5, 12, 4, 21, 5, 9, 5, 13, 7, 15, 39, 24, 16, 13, 24, 8, 12, 5, 4, 7, 13, 17, 11, 10, 6, 5, 17, 6, 16, 13, 5, 12, 4, 64, 40, 47, 54, 32, 23, 4, 53, 49, 32, 23, 4, 54, 8, 40, 47, 54, 32, 7, 23, 4, 69, 52, 43, 23, 4, 51, 10, 12, 6, 10, 15, 40, 5, 13, 6, 23, 4, 69, 52, 48, 5, 6, 26, 26, 26, 63, 4, 19, 8, 13, 4, 48, 7, 6, 16, 13, 7, 15, 4, 52, 7, 9, 21, 16, 7, 21, 5, 4, 61, 9, 14, 5, 13, 12, 6, 7, 9, 14, 10, 9, 21, 4, 64, 48, 52, 61, 63, 4, 7, 9, 14, 4, 48, 7, 6, 16, 13, 7, 15, 4, 52, 7, 9, 21, 16, 7, 21, 5, 4, 53, 5, 9, 5, 13, 7, 6, 10, 8, 9, 4, 64, 48, 52, 53, 63, 4, 20, 10, 6, 11, 4, 8, 27, 5, 13, 4, 6, 11, 10, 13, 6, 22, 39, 6, 20, 8, 4, 24, 13, 5, 6, 13, 7, 10, 9, 5, 14, 4, 18, 8, 14, 5, 15, 12, 4, 10, 9, 4, 8, 9, 5, 4, 11, 16, 9, 14, 13, 5, 14, 4, 24, 15, 16, 12, 4, 15, 7, 9, 21, 16, 7, 21, 5, 12, 4, 7, 9, 14, 4, 14, 5, 5, 24, 4, 10, 9, 6, 5, 13, 8, 24, 5, 13, 7, 25, 10, 15, 10, 6, 22, 4, 25, 5, 6, 20, 5, 5, 9, 4, 58, 7, 37, 23, 4, 49, 22, 32, 8, 13, 17, 11, 4, 7, 9, 14, 4, 32, 5, 9, 12, 8, 13, 55, 15, 8, 20, 26, 2],
+                [4, 40, 47, 54, 32, 4, 10, 12, 4, 14, 5, 12, 10, 21, 9, 5, 14, 4, 6, 8, 4, 24, 13, 5, 39, 6, 13, 7, 10, 9, 4, 14, 5, 5, 24, 4, 25, 10, 14, 10, 13, 5, 17, 6, 10, 8, 9, 7, 15, 4, 13, 5, 24, 13, 5, 12, 5, 9, 6, 7, 6, 10, 8, 9, 12, 4, 19, 13, 8, 18, 4, 16, 9, 15, 7, 25, 5, 15, 5, 14, 4, 6, 5, 37, 6, 4, 25, 22, 4, 46, 8, 10, 9, 6, 15, 22, 4, 17, 8, 9, 14, 10, 6, 10, 8, 9, 10, 9, 21, 4, 8, 9, 4, 25, 8, 6, 11, 4, 15, 5, 19, 6, 4, 7, 9, 14, 4, 13, 10, 21, 11, 6, 4, 17, 8, 9, 6, 5, 37, 6, 4, 10, 9, 4, 7, 15, 15, 4, 15, 7, 22, 5, 13, 12, 26, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                [4, 32, 11, 5, 4, 45, 16, 10, 17, 28, 4, 25, 13, 8, 20, 9, 4, 19, 8, 37, 4, 46, 16, 18, 24, 12, 4, 8, 27, 5, 13, 4, 6, 11, 5, 4, 15, 7, 57, 22, 4, 14, 8, 21, 26, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+            ],
+            'attention_mask': [
+                [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+                [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+            ]
+        }
+        # fmt: on
+
+        self.tokenizer_integration_test_util(
+            expected_encoding=expected_encoding,
+            model_name="microsoft/speecht5_asr",
+            revision="c5ef64c71905caeccde0e4462ef3f9077224c524",
+            sequences=sequences,
+        )
diff --git a/tests/transformers/squeezebert/__init__.py b/tests/transformers/squeezebert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/squeezebert/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/squeezebert/test_modeling.py b/tests/transformers/squeezebert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..81f7df2fda788ae24e04c2cffbe83da343098001
--- /dev/null
+++ b/tests/transformers/squeezebert/test_modeling.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    SqueezeBertConfig,
+    SqueezeBertForQuestionAnswering,
+    SqueezeBertForSequenceClassification,
+    SqueezeBertForTokenClassification,
+    SqueezeBertModel,
+    SqueezeBertPreTrainedModel,
+)
+from tests.testing_utils import slow
+
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class SqueezeBertModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=64,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        q_groups=2,
+        k_groups=2,
+        v_groups=2,
+        post_attention_groups=2,
+        intermediate_groups=4,
+        output_groups=1,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.q_groups = q_groups
+        self.k_groups = k_groups
+        self.v_groups = v_groups
+        self.post_attention_groups = post_attention_groups
+        self.intermediate_groups = intermediate_groups
+        self.output_groups = output_groups
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+
+        return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self):
+        return SqueezeBertConfig(
+            embedding_size=self.hidden_size,
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            attention_probs_dropout_prob=self.hidden_dropout_prob,
+            attention_dropout=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            initializer_range=self.initializer_range,
+            q_groups=self.q_groups,
+            k_groups=self.k_groups,
+            v_groups=self.v_groups,
+            post_attention_groups=self.post_attention_groups,
+            intermediate_groups=self.intermediate_groups,
+            output_groups=self.output_groups,
+        )
+
+    def create_and_check_squeezebert_model(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = SqueezeBertModel(config=config)
+        model.eval()
+        result = model(input_ids, input_mask)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_squeezebert_for_question_answering(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = SqueezeBertForQuestionAnswering(config=config)
+        model.eval()
+        start_logits, end_logits = model(input_ids)
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_squeezebert_for_sequence_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        config.num_labels = self.num_labels
+        model = SqueezeBertForSequenceClassification(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_squeezebert_for_token_classification(
+        self, config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        config.num_labels = self.num_labels
+        model = SqueezeBertForTokenClassification(config=config)
+        model.eval()
+
+        result = model(input_ids, attention_mask=input_mask)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_labels])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (config, input_ids, input_mask, sequence_labels, token_labels, choice_labels) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+class SqueezeBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = SqueezeBertModel
+    all_model_classes = (
+        SqueezeBertModel,
+        SqueezeBertForSequenceClassification,
+        SqueezeBertForTokenClassification,
+    )
+    return_dict: bool = False
+    use_labels: bool = False
+
+    def setUp(self):
+        self.model_tester = SqueezeBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=SqueezeBertConfig, dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_squeezebert_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_model(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_squeezebert_for_token_classification(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(SqueezeBertPreTrainedModel.pretrained_init_configuration)[:1]:
+            model = SqueezeBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class SqueezeBertModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_classification_head(self):
+        model = SqueezeBertForSequenceClassification.from_pretrained("squeezebert-mnli")
+
+        input_ids = paddle.to_tensor([[1, 29414, 232, 328, 740, 1140, 12695, 69, 13, 1588, 2]])
+        output = model(input_ids)[0]
+        expected_shape = paddle.shape((1, 3))
+        self.assertEqual(output.shape, expected_shape)
+        expected_tensor = paddle.to_tensor([[0.6401, -0.0349, -0.6041]])
+        self.assertTrue(paddle.allclose(output, expected_tensor, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/squeezebert/test_tokenizer.py b/tests/transformers/squeezebert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed5bb970e84fd8f866cee0f1c850d7bf15d4c6db
--- /dev/null
+++ b/tests/transformers/squeezebert/test_tokenizer.py
@@ -0,0 +1,46 @@
+# coding=utf-8
+# Copyright 2020 The SqueezeBert authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from paddlenlp.transformers import SqueezeBertTokenizer
+from tests.testing_utils import slow
+
+from ..bert.test_tokenizer import BertTokenizationTest
+
+
+class SqueezeBertTokenizationTest(BertTokenizationTest):
+    tokenizer_class = SqueezeBertTokenizer
+    test_fast_tokenizer = False
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = SqueezeBertTokenizer.from_pretrained("squeezebert-mnli-headless")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
+        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + text_2 + [
+            tokenizer.sep_token_id
+        ]
+
+    def test_pretrained_model_lists(self):
+        pass
+
+    def test_offsets_with_special_characters(self):
+        pass
diff --git a/tests/transformers/t5/__init__.py b/tests/transformers/t5/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/t5/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/t5/test_modeling.py b/tests/transformers/t5/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4567a2cfd941a16ed64a4ab8d27106a7677a462
--- /dev/null
+++ b/tests/transformers/t5/test_modeling.py
@@ -0,0 +1,1199 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import tempfile
+import unittest
+
+import numpy as np
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    T5EncoderModel,
+    T5ForConditionalGeneration,
+    T5Model,
+    T5Tokenizer,
+)
+from paddlenlp.transformers.t5.configuration import T5Config
+from paddlenlp.transformers.t5.modeling import T5_PRETRAINED_MODEL_ARCHIVE_LIST
+from tests.testing_utils import require_package, slow
+
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor
+
+
+def masked_fill(x, mask, value):
+    y = paddle.full(x.shape, value, x.dtype)
+    return paddle.where(mask, y, x)
+
+
+def make_model_instance(config: T5Config, model_class, base_model_class):
+    if model_class == base_model_class:
+        return model_class(config)
+    else:
+        return model_class(base_model_class(config))
+
+
+class T5ModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        decoder_seq_length=9,
+        # For common tests
+        is_training=True,
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+        decoder_layers=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        self.decoder_seq_length = decoder_seq_length
+        # For common tests
+        self.seq_length = self.decoder_seq_length
+        self.is_training = is_training
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.decoder_start_token_id = pad_token_id
+        self.scope = None
+        self.decoder_layers = decoder_layers
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+        decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        decoder_attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+            decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def get_pipeline_config(self) -> T5Config:
+        return T5Config(
+            vocab_size=166,  # t5 forces 100 extra tokens
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def get_config(self) -> T5Config:
+        return T5Config(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_decoder_layers=self.decoder_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+        )
+
+    def check_prepare_lm_labels_via_shift_left(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        if not self.parent.use_labels:
+            return
+        model = T5Model(config)
+        model.eval()
+
+        # make sure that lm_labels are correctly padded from the right
+        lm_labels = masked_fill(lm_labels, (lm_labels == self.decoder_start_token_id), self.eos_token_id)
+
+        # add casaul pad token mask
+        triangular_mask = paddle.tril(paddle.ones(lm_labels.shape)).logical_not()
+        lm_labels = masked_fill(lm_labels, triangular_mask, self.pad_token_id)
+        decoder_input_ids = model._shift_right(lm_labels)
+
+        for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
+            # first item
+            self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
+            if i < decoder_input_ids_slice.shape[-1]:
+                if i < decoder_input_ids.shape[-1] - 1:
+                    # items before diagonal
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
+                    )
+                # pad items after diagonal
+                if i < decoder_input_ids.shape[-1] - 2:
+                    self.parent.assertListEqual(
+                        decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
+                    )
+            else:
+                # all items after square
+                self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
+
+    def create_and_check_model(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = T5Model(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            attention_mask=attention_mask,
+            decoder_attention_mask=decoder_attention_mask,
+            return_dict=self.parent.return_dict,
+        )
+        result = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, return_dict=self.parent.return_dict)
+        decoder_output = result[0]
+        decoder_past = result[1]
+        encoder_output = result[2]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+        self.parent.assertEqual(decoder_output.shape, [self.batch_size, self.decoder_seq_length, self.hidden_size])
+        # There should be `num_layers` key value embeddings stored in decoder_past
+        self.parent.assertEqual(len(decoder_past), config["num_layers"])
+        # There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
+        self.parent.assertEqual(len(decoder_past[0]), 4)
+
+    def create_and_check_with_lm_head(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = T5ForConditionalGeneration(config)
+        model.eval()
+        outputs = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            labels=lm_labels,
+            return_dict=self.parent.return_dict,
+        )
+        self.parent.assertEqual(len(outputs), 4 if self.parent.use_labels else 3)
+        if self.parent.use_labels:
+            self.parent.assertEqual(outputs[1].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+            self.parent.assertIsInstance(outputs[0].item(), float)
+        else:
+            self.parent.assertEqual(outputs[0].shape, [self.batch_size, self.decoder_seq_length, self.vocab_size])
+
+    def create_and_check_decoder_model_past(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = T5Model(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, use_cache=True, return_dict=self.parent.return_dict)
+        outputs_use_cache_conf = model(input_ids, return_dict=self.parent.return_dict)
+        outputs_no_past = model(input_ids, use_cache=False, return_dict=self.parent.return_dict)
+
+        self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf) + 1)
+        self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+
+        output_from_no_past = model(next_input_ids, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(next_tokens, cache=past_key_values, return_dict=self.parent.return_dict)[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_attention_mask_past(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = T5Model(config).get_decoder()
+        model.eval()
+
+        # create attention mask
+        attn_mask = paddle.ones(input_ids.shape, dtype="int64")
+
+        half_seq_length = input_ids.shape[-1] // 2
+        attn_mask[:, half_seq_length:] = 0
+
+        # first forward pass
+        output, past_key_values = model(
+            input_ids, attention_mask=attn_mask, use_cache=True, return_dict=self.parent.return_dict
+        )[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"])
+
+        # change a random masked slice from input_ids
+        random_seq_idx_to_change = (
+            ids_tensor(
+                [
+                    1,
+                ],
+                half_seq_length,
+            ).item()
+            + 1
+        )
+        random_other_next_tokens = ids_tensor([self.batch_size, 1], config["vocab_size"]).squeeze(-1)
+        input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
+
+        # append to next input_ids and attn_mask
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        attn_mask = paddle.concat(
+            [attn_mask, paddle.ones((attn_mask.shape[0], 1), dtype="int64")],
+            axis=1,
+        )
+
+        # get two different outputs
+        output_from_no_past = model(next_input_ids, attention_mask=attn_mask, return_dict=self.parent.return_dict)[0]
+        output_from_past = model(
+            next_tokens,
+            cache=past_key_values,
+            attention_mask=paddle.ones((attn_mask.shape[0], 1), dtype="int64"),
+            return_dict=self.parent.return_dict,
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_decoder_model_past_large_inputs(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        model = T5Model(config).get_decoder()
+        model.eval()
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=self.parent.return_dict)
+
+        output, past_key_values = outputs[:2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor([self.batch_size, 3], config["vocab_size"])
+        next_mask = ids_tensor([self.batch_size, 3], vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([attention_mask, next_mask], axis=-1)
+
+        output_from_no_past = model(
+            next_input_ids, attention_mask=next_attention_mask, return_dict=self.parent.return_dict
+        )[0]
+        output_from_past = model(
+            next_tokens, attention_mask=next_attention_mask, cache=past_key_values, return_dict=self.parent.return_dict
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor(
+            [
+                1,
+            ],
+            output_from_past.shape[-1],
+        ).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_generate_with_past_key_values(
+        self,
+        config: T5Config,
+        input_ids,
+        decoder_input_ids,
+        attention_mask,
+        decoder_attention_mask,
+        lm_labels,
+    ):
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        model = T5ForConditionalGeneration(config)
+        model.eval()
+
+        output_without_past_cache, _ = model.generate(
+            input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling", use_cache=False
+        )
+
+        paddle.seed(0)
+        np.random.seed(0)
+        random.seed(0)
+
+        output_with_past_cache, _ = model.generate(input_ids[:1], top_k=1, max_length=5, decode_strategy="sampling")
+
+        self.parent.assertTrue(paddle.all(output_with_past_cache == output_without_past_cache))
+
+    def check_resize_embeddings_t5_v1_1(
+        self,
+        config: T5Config,
+    ):
+        prev_vocab_size = config["vocab_size"]
+
+        model = T5ForConditionalGeneration(config)
+        model.eval()
+        model.resize_token_embeddings(prev_vocab_size - 10)
+
+        self.parent.assertEqual(model.get_input_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.get_output_embeddings().weight.shape[0], prev_vocab_size - 10)
+        self.parent.assertEqual(model.t5.config["vocab_size"], prev_vocab_size - 10)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "decoder_attention_mask": decoder_attention_mask,
+            "use_cache": False,
+        }
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class T5ModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = T5Model
+    return_dict: bool = False
+    use_labels: bool = False
+
+    all_model_classes = (T5Model, T5ForConditionalGeneration)
+    all_generative_model_classes = {T5ForConditionalGeneration: (T5Model, "t5")}
+    all_parallelizable_model_classes = (T5Model, T5ForConditionalGeneration)
+    fx_compatible = True
+    test_pruning = False
+    test_resize_embeddings = True
+    test_model_parallel = True
+    use_test_inputs_embeds = True
+    is_encoder_decoder = True
+    # The small T5 model needs higher percentages for CPU/MP tests
+    model_split_percents = [0.8, 0.9]
+
+    def setUp(self):
+        self.model_tester = T5ModelTester(self)
+
+    def test_shift_right(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_v1_1(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        # check that gated gelu feed forward and different word embeddings work
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-gelu"
+        self.model_tester.create_and_check_model(config, *config_and_inputs[1:])
+
+    def test_config_and_model_silu_gated(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        config = config_and_inputs[0]
+        config["feed_forward_proj"] = "gated-silu"
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_with_lm_head(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_with_lm_head(*config_and_inputs)
+
+    def test_decoder_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_attn_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(*config_and_inputs)
+
+    def test_decoder_model_past_with_3d_attn_mask(self):
+        (
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = self.model_tester.prepare_config_and_inputs()
+
+        attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.encoder_seq_length, self.model_tester.encoder_seq_length],
+            vocab_size=2,
+        )
+        decoder_attention_mask = ids_tensor(
+            [self.model_tester.batch_size, self.model_tester.decoder_seq_length, self.model_tester.decoder_seq_length],
+            vocab_size=2,
+        )
+
+        self.model_tester.create_and_check_decoder_model_attention_mask_past(
+            config,
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        )
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_generate_with_past_key_values(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_generate_with_past_key_values(*config_and_inputs)
+
+    def test_v1_1_resize_embeddings(self):
+        config = self.model_tester.prepare_config_and_inputs()[0]
+        self.model_tester.check_resize_embeddings_t5_v1_1(config)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in T5_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = T5Model.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class T5EncoderOnlyModelTester:
+    def __init__(
+        self,
+        parent,
+        vocab_size=99,
+        batch_size=13,
+        encoder_seq_length=7,
+        # For common tests
+        use_attention_mask=True,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        d_ff=37,
+        relative_attention_num_buckets=8,
+        is_training=False,
+        dropout_rate=0.1,
+        initializer_factor=0.002,
+        is_encoder_decoder=False,
+        eos_token_id=1,
+        pad_token_id=0,
+        scope=None,
+    ):
+
+        self.parent = parent
+        self.batch_size = batch_size
+        self.encoder_seq_length = encoder_seq_length
+        # For common tests
+        self.seq_length = self.encoder_seq_length
+        self.use_attention_mask = use_attention_mask
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.d_ff = d_ff
+        self.relative_attention_num_buckets = relative_attention_num_buckets
+        self.dropout_rate = dropout_rate
+        self.initializer_factor = initializer_factor
+        self.eos_token_id = eos_token_id
+        self.pad_token_id = pad_token_id
+        self.is_encoder_decoder = is_encoder_decoder
+        self.scope = None
+        self.is_training = is_training
+
+    def get_config(self):
+        config = T5Config(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            d_ff=self.d_ff,
+            d_kv=self.hidden_size // self.num_attention_heads,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            relative_attention_num_buckets=self.relative_attention_num_buckets,
+            dropout_rate=self.dropout_rate,
+            initializer_factor=self.initializer_factor,
+            eos_token_id=self.eos_token_id,
+            bos_token_id=self.pad_token_id,
+            pad_token_id=self.pad_token_id,
+            is_encoder_decoder=self.is_encoder_decoder,
+        )
+        return config
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
+
+        attention_mask = None
+        if self.use_attention_mask:
+            attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
+
+        config = self.get_config()
+        return (
+            config,
+            input_ids,
+            attention_mask,
+        )
+
+    def create_and_check_model(
+        self,
+        config: T5Config,
+        input_ids,
+        attention_mask,
+    ):
+        model = T5EncoderModel(config)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+        )
+        result = model(input_ids=input_ids)
+        encoder_output = result[0]
+
+        self.parent.assertEqual(encoder_output.shape, [self.batch_size, self.encoder_seq_length, self.hidden_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            attention_mask,
+        ) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+
+class T5EncoderOnlyModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (T5EncoderModel,)
+    test_pruning = False
+    test_resize_embeddings = False
+    test_model_parallel = True
+    all_parallelizable_model_classes = (T5EncoderModel,)
+
+    def _make_model_instance(self, config: T5Config, model_class):
+        return model_class(config)
+
+    def setUp(self):
+        self.model_tester = T5EncoderOnlyModelTester(self)
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_model_name_list(self):
+        pass
+
+
+class T5CompatibilityTest(unittest.TestCase):
+    @require_package("transformers", "torch")
+    def test_t5_converter(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "hf-internal-testing/tiny-random-T5Model"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import T5Model
+
+            paddle_model = T5Model.from_pretrained(model_id, from_hf_hub=True, cache_dir=tempdir)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            # 3. forward the torch  model
+            import torch
+            from transformers import T5Model
+
+            torch_model = T5Model.from_pretrained(model_id, cache_dir=tempdir)
+            torch_model.eval()
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().numpy()[:4, :4], torch_logit.detach().cpu().numpy()[:4, :4], rtol=1e-4
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_t5_converter_from_local_dir(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "hf-internal-testing/tiny-random-T5Model"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import T5Model
+
+            torch_model = T5Model.from_pretrained(model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import T5Model
+
+            paddle_model = T5Model.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    rtol=1e-4,
+                )
+            )
+
+    @require_package("transformers", "torch")
+    def test_t5_for_conditional_generation(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            model_id = "hf-internal-testing/tiny-random-T5Model"
+            # 1. create commmon input
+            input_ids = np.random.randint(100, 200, [1, 20])
+
+            # 2. forward the torch  model
+            import torch
+            from transformers import T5ForConditionalGeneration
+
+            torch_model = T5ForConditionalGeneration.from_pretrained(model_id)
+            torch_model.eval()
+            torch_model.save_pretrained(tempdir)
+            torch_logit = torch_model(
+                input_ids=torch.tensor(input_ids), decoder_input_ids=torch.tensor(input_ids), return_dict=False
+            )[0][0]
+
+            # 2. forward the paddle model
+            from paddlenlp.transformers import T5ForConditionalGeneration
+
+            paddle_model = T5ForConditionalGeneration.from_pretrained(tempdir, convert_from_torch=True)
+            paddle_model.eval()
+            paddle_logit = paddle_model(
+                input_ids=paddle.to_tensor(input_ids), decoder_input_ids=paddle.to_tensor(input_ids)
+            )[0][0]
+
+            self.assertTrue(
+                np.allclose(
+                    paddle_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    torch_logit.detach().cpu().reshape([-1])[:9].numpy(),
+                    atol=1e-3,
+                )
+            )
+
+
+class T5ModelIntegrationTests(unittest.TestCase):
+    def model(self):
+        return T5ForConditionalGeneration.from_pretrained("t5-base")
+
+    def tokenizer(self):
+        return T5Tokenizer.from_pretrained("t5-base")
+
+    @slow
+    def test_small_generation(self):
+        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+        tokenizer = T5Tokenizer.from_pretrained("t5-small")
+
+        model.eval()
+
+        input_ids = tokenizer("summarize: Hello there", return_tensors="pd")["input_ids"]
+
+        sequences = model.generate(input_ids, max_length=8, decode_strategy="greedy_search")[0]
+
+        output_str = tokenizer.batch_decode(sequences, skip_special_tokens=True)[0]
+        self.assertTrue(output_str == "Hello there!")
+
+    @slow
+    def test_small_integration_test(self):
+
+        model = T5ForConditionalGeneration.from_pretrained("t5-small")
+        tokenizer = T5Tokenizer.from_pretrained("t5-small")
+
+        model.eval()
+
+        input_ids = tokenizer("Hello there", return_tensors="pd")["input_ids"]
+        labels = tokenizer("Hi I am", return_tensors="pd")["input_ids"]
+
+        loss = model(input_ids, labels=labels)[0]
+        mtf_score = -(labels.shape[-1] * loss.item())
+
+        EXPECTED_SCORE = -19.084566
+        self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
+
+    @slow
+    def test_small_v1_1_integration_test(self):
+
+        model = T5ForConditionalGeneration.from_pretrained("t5-v1_1-base")
+        tokenizer = T5Tokenizer.from_pretrained("t5-v1_1-base")
+
+        model.eval()
+
+        input_ids = tokenizer("Hello there", return_tensors="pd")["input_ids"]
+        labels = tokenizer("Hi I am", return_tensors="pd")["input_ids"]
+
+        loss = model(input_ids, labels=labels)[0]
+        mtf_score = -(labels.shape[-1] * loss.item())
+
+        EXPECTED_SCORE = -56.207352
+        self.assertTrue(abs(mtf_score - EXPECTED_SCORE) < 1e-4)
+
+    @slow
+    def test_summarization(self):
+        model = self.model()
+        model.eval()
+        tok = self.tokenizer()
+
+        FRANCE_ARTICLE = (  # @noqa
+            "Marseille, France (CNN)The French prosecutor leading an investigation into the crash of Germanwings"
+            " Flight 9525 insisted Wednesday that he was not aware of any video footage from on board the plane."
+            ' Marseille prosecutor Brice Robin told CNN that "so far no videos were used in the crash investigation."'
+            ' He added, "A person who has such a video needs to immediately give it to the investigators." Robin\'s'
+            " comments follow claims by two magazines, German daily Bild and French Paris Match, of a cell phone video"
+            " showing the harrowing final seconds from on board Germanwings Flight 9525 as it crashed into the French"
+            " Alps. All 150 on board were killed. Paris Match and Bild reported that the video was recovered from a"
+            " phone at the wreckage site. The two publications described the supposed video, but did not post it on"
+            " their websites. The publications said that they watched the video, which was found by a source close to"
+            " the investigation. \"One can hear cries of 'My God' in several languages,\" Paris Match reported."
+            ' "Metallic banging can also be heard more than three times, perhaps of the pilot trying to open the'
+            " cockpit door with a heavy object.  Towards the end, after a heavy shake, stronger than the others, the"
+            ' screaming intensifies. Then nothing." "It is a very disturbing scene," said Julian Reichelt,'
+            " editor-in-chief of Bild online. An official with France's accident investigation agency, the BEA, said"
+            " the agency is not aware of any such video. Lt. Col. Jean-Marc Menichini, a French Gendarmerie spokesman"
+            " in charge of communications on rescue efforts around the Germanwings crash site, told CNN that the"
+            ' reports were "completely wrong" and "unwarranted." Cell phones have been collected at the site, he said,'
+            ' but that they "hadn\'t been exploited yet." Menichini said he believed the cell phones would need to be'
+            " sent to the Criminal Research Institute in Rosny sous-Bois, near Paris, in order to be analyzed by"
+            " specialized technicians working hand-in-hand with investigators. But none of the cell phones found so"
+            " far have been sent to the institute, Menichini said. Asked whether staff involved in the search could"
+            ' have leaked a memory card to the media, Menichini answered with a categorical "no." Reichelt told "Erin'
+            ' Burnett: Outfront" that he had watched the video and stood by the report, saying Bild and Paris Match'
+            ' are "very confident" that the clip is real. He noted that investigators only revealed they\'d recovered'
+            ' cell phones from the crash site after Bild and Paris Match published their reports. "That is something'
+            " we did not know before. ... Overall we can say many things of the investigation weren't revealed by the"
+            ' investigation at the beginning," he said. What was mental state of Germanwings co-pilot? German airline'
+            " Lufthansa confirmed Tuesday that co-pilot Andreas Lubitz had battled depression years before he took the"
+            " controls of Germanwings Flight 9525, which he's accused of deliberately crashing last week in the"
+            ' French Alps. Lubitz told his Lufthansa flight training school in 2009 that he had a "previous episode of'
+            ' severe depression," the airline said Tuesday. Email correspondence between Lubitz and the school'
+            " discovered in an internal investigation, Lufthansa said, included medical documents he submitted in"
+            " connection with resuming his flight training. The announcement indicates that Lufthansa, the parent"
+            " company of Germanwings, knew of Lubitz's battle with depression, allowed him to continue training and"
+            " ultimately put him in the cockpit. Lufthansa, whose CEO Carsten Spohr previously said Lubitz was 100%"
+            ' fit to fly, described its statement Tuesday as a "swift and seamless clarification" and said it was'
+            " sharing the information and documents -- including training and medical records -- with public"
+            " prosecutors. Spohr traveled to the crash site Wednesday, where recovery teams have been working for the"
+            " past week to recover human remains and plane debris scattered across a steep mountainside. He saw the"
+            " crisis center set up in Seyne-les-Alpes, laid a wreath in the village of Le Vernet, closer to the crash"
+            " site, where grieving families have left flowers at a simple stone memorial. Menichini told CNN late"
+            " Tuesday that no visible human remains were left at the site but recovery teams would keep searching."
+            " French President Francois Hollande, speaking Tuesday, said that it should be possible to identify all"
+            " the victims using DNA analysis by the end of the week, sooner than authorities had previously suggested."
+            " In the meantime, the recovery of the victims' personal belongings will start Wednesday, Menichini said."
+            " Among those personal belongings could be more cell phones belonging to the 144 passengers and six crew"
+            " on board. Check out the latest from our correspondents . The details about Lubitz's correspondence with"
+            " the flight school during his training were among several developments as investigators continued to"
+            " delve into what caused the crash and Lubitz's possible motive for downing the jet. A Lufthansa"
+            " spokesperson told CNN on Tuesday that Lubitz had a valid medical certificate, had passed all his"
+            ' examinations and "held all the licenses required." Earlier, a spokesman for the prosecutor\'s office in'
+            " Dusseldorf, Christoph Kumpa, said medical records reveal Lubitz suffered from suicidal tendencies at"
+            " some point before his aviation career and underwent psychotherapy before he got his pilot's license."
+            " Kumpa emphasized there's no evidence suggesting Lubitz was suicidal or acting aggressively before the"
+            " crash. Investigators are looking into whether Lubitz feared his medical condition would cause him to"
+            " lose his pilot's license, a European government official briefed on the investigation told CNN on"
+            ' Tuesday. While flying was "a big part of his life," the source said, it\'s only one theory being'
+            " considered. Another source, a law enforcement official briefed on the investigation, also told CNN that"
+            " authorities believe the primary motive for Lubitz to bring down the plane was that he feared he would"
+            " not be allowed to fly because of his medical problems. Lubitz's girlfriend told investigators he had"
+            " seen an eye doctor and a neuropsychologist, both of whom deemed him unfit to work recently and concluded"
+            " he had psychological issues, the European government official said. But no matter what details emerge"
+            " about his previous mental health struggles, there's more to the story, said Brian Russell, a forensic"
+            ' psychologist. "Psychology can explain why somebody would turn rage inward on themselves about the fact'
+            " that maybe they weren't going to keep doing their job and they're upset about that and so they're"
+            ' suicidal," he said. "But there is no mental illness that explains why somebody then feels entitled to'
+            " also take that rage and turn it outward on 149 other people who had nothing to do with the person's"
+            ' problems." Germanwings crash compensation: What we know . Who was the captain of Germanwings Flight'
+            " 9525? CNN's Margot Haddad reported from Marseille and Pamela Brown from Dusseldorf, while Laura"
+            " Smith-Spark wrote from London. CNN's Frederik Pleitgen, Pamela Boykoff, Antonia Mortensen, Sandrine"
+            " Amiel and Anna-Maja Rappard contributed to this report."
+        )
+        SHORTER_ARTICLE = (
+            "(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on"
+            " Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The"
+            " formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based."
+            " The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its"
+            ' jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East'
+            ' Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the'
+            " situation in Palestinian territories, paving the way for possible war crimes investigations against"
+            " Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and"
+            " the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the"
+            " body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, said it was a"
+            ' move toward greater justice. "As Palestine formally becomes a State Party to the Rome Statute today, the'
+            ' world is also a step closer to ending a long era of impunity and injustice," he said, according to an'
+            ' ICC news release. "Indeed, today brings us closer to our shared goals of justice and peace." Judge'
+            " Kuniko Ozaki, a vice president of the ICC, said acceding to the treaty was just the first step for the"
+            ' Palestinians. "As the Rome Statute today enters into force for the State of Palestine, Palestine'
+            " acquires all the rights as well as responsibilities that come with being a State Party to the Statute."
+            ' These are substantive commitments, which cannot be taken lightly," she said. Rights group Human Rights'
+            ' Watch welcomed the development. "Governments seeking to penalize Palestine for joining the ICC should'
+            " immediately end their pressure, and countries that support universal acceptance of the court's treaty"
+            ' should speak out to welcome its membership," said Balkees Jarrah, international justice counsel for the'
+            " group. \"What's objectionable is the attempts to undermine international justice, not Palestine's"
+            ' decision to join a treaty to which over 100 countries around the world are members." In January, when'
+            " the preliminary ICC examination was opened, Israeli Prime Minister Benjamin Netanyahu described it as an"
+            ' outrage, saying the court was overstepping its boundaries. The United States also said it "strongly"'
+            " disagreed with the court's decision. \"As we have said repeatedly, we do not believe that Palestine is a"
+            ' state and therefore we do not believe that it is eligible to join the ICC," the State Department said in'
+            ' a statement. It urged the warring sides to resolve their differences through direct negotiations. "We'
+            ' will continue to oppose actions against Israel at the ICC as counterproductive to the cause of peace,"'
+            " it said. But the ICC begs to differ with the definition of a state for its purposes and refers to the"
+            ' territories as "Palestine." While a preliminary examination is not a formal investigation, it allows the'
+            " court to review evidence and determine whether to investigate suspects on both sides. Prosecutor Fatou"
+            ' Bensouda said her office would "conduct its analysis in full independence and impartiality." The war'
+            " between Israel and Hamas militants in Gaza last summer left more than 2,000 people dead. The inquiry"
+            " will include alleged war crimes committed since June. The International Criminal Court was set up in"
+            " 2002 to prosecute genocide, crimes against humanity and war crimes. CNN's Vasco Cotovio, Kareem Khadder"
+            " and Faith Karimi contributed to this report."
+        )
+        IRAN_ARTICLE = (
+            "(CNN)The United States and its negotiating partners reached a very strong framework agreement with Iran"
+            " in Lausanne, Switzerland, on Thursday that limits Iran's nuclear program in such a way as to effectively"
+            " block it from building a nuclear weapon. Expect pushback anyway, if the recent past is any harbinger."
+            " Just last month, in an attempt to head off such an agreement, House Speaker John Boehner invited Israeli"
+            " Prime Minister Benjamin Netanyahu to preemptively blast it before Congress, and 47 senators sent a"
+            " letter to the Iranian leadership warning them away from a deal. The debate that has already begun since"
+            " the announcement of the new framework will likely result in more heat than light. It will not be helped"
+            " by the gathering swirl of dubious assumptions and doubtful assertions. Let us address some of these: ."
+            " The most misleading assertion, despite universal rejection by experts, is that the negotiations'"
+            " objective at the outset was the total elimination of any nuclear program in Iran. That is the position"
+            " of Netanyahu and his acolytes in the U.S. Congress. But that is not and never was the objective. If it"
+            " had been, there would have been no Iranian team at the negotiating table. Rather, the objective has"
+            " always been to structure an agreement or series of agreements so that Iran could not covertly develop a"
+            " nuclear arsenal before the United States and its allies could respond. The new framework has exceeded"
+            " expectations in achieving that goal. It would reduce Iran's low-enriched uranium stockpile, cut by"
+            " two-thirds its number of installed centrifuges and implement a rigorous inspection regime. Another"
+            " dubious assumption of opponents is that the Iranian nuclear program is a covert weapons program. Despite"
+            " sharp accusations by some in the United States and its allies, Iran denies having such a program, and"
+            " U.S. intelligence contends that Iran has not yet made the decision to build a nuclear weapon. Iran's"
+            " continued cooperation with International Atomic Energy Agency inspections is further evidence on this"
+            " point, and we'll know even more about Iran's program in the coming months and years because of the deal."
+            " In fact, the inspections provisions that are part of this agreement are designed to protect against any"
+            " covert action by the Iranians. What's more, the rhetoric of some members of Congress has implied that"
+            " the negotiations have been between only the United States and Iran (i.e., the 47 senators' letter"
+            " warning that a deal might be killed by Congress or a future president). This of course is not the case."
+            " The talks were between Iran and the five permanent members of the U.N. Security Council (United States,"
+            " United Kingdom, France, China and Russia) plus Germany, dubbed the P5+1. While the United States has"
+            " played a leading role in the effort, it negotiated the terms alongside its partners. If the agreement"
+            " reached by the P5+1 is rejected by Congress, it could result in an unraveling of the sanctions on Iran"
+            " and threaten NATO cohesion in other areas. Another questionable assertion is that this agreement"
+            " contains a sunset clause, after which Iran will be free to do as it pleases. Again, this is not the"
+            " case. Some of the restrictions on Iran's nuclear activities, such as uranium enrichment, will be eased"
+            " or eliminated over time, as long as 15 years. But most importantly, the framework agreement includes"
+            " Iran's ratification of the Additional Protocol, which allows IAEA inspectors expanded access to nuclear"
+            " sites both declared and nondeclared. This provision will be permanent. It does not sunset. Thus, going"
+            " forward, if Iran decides to enrich uranium to weapons-grade levels, monitors will be able to detect such"
+            " a move in a matter of days and alert the U.N. Security Council. Many in Congress have said that the"
+            ' agreement should be a formal treaty requiring the Senate to "advise and consent." But the issue is not'
+            " suited for a treaty. Treaties impose equivalent obligations on all signatories. For example, the New"
+            " START treaty limits Russia and the United States to 1,550 deployed strategic warheads. But any agreement"
+            " with Iran will not be so balanced.  The restrictions and obligations in the final framework agreement"
+            " will be imposed almost exclusively on Iran. The P5+1 are obligated only to ease and eventually remove"
+            " most but not all economic sanctions, which were imposed as leverage to gain this final deal. Finally"
+            " some insist that any agreement must address Iranian missile programs, human rights violations or support"
+            " for Hamas or Hezbollah.  As important as these issues are, and they must indeed be addressed, they are"
+            " unrelated to the most important aim of a nuclear deal: preventing a nuclear Iran.  To include them in"
+            " the negotiations would be a poison pill. This agreement should be judged on its merits and on how it"
+            " affects the security of our negotiating partners and allies, including Israel. Those judgments should be"
+            " fact-based, not based on questionable assertions or dubious assumptions."
+        )
+        ARTICLE_SUBWAY = (
+            "New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A"
+            " year later, she got married again in Westchester County, but to a different man and without divorcing"
+            " her first husband.  Only 18 days after that marriage, she got hitched yet again. Then, Barrientos"
+            ' declared "I do" five more times, sometimes only within two weeks of each other. In 2010, she married'
+            " once more, this time in the Bronx. In an application for a marriage license, she stated it was her"
+            ' "first and only" marriage. Barrientos, now 39, is facing two criminal counts of "offering a false'
+            ' instrument for filing in the first degree," referring to her false statements on the 2010 marriage'
+            " license application, according to court documents. Prosecutors said the marriages were part of an"
+            " immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to"
+            " her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was"
+            " arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New"
+            " York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total,"
+            " Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.  All"
+            " occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be"
+            " married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors"
+            " said the immigration scam involved some of her husbands, who filed for permanent residence status"
+            " shortly after the marriages.  Any divorces happened only after such filings were approved. It was"
+            " unclear whether any of the men will be prosecuted. The case was referred to the Bronx District"
+            " Attorney's Office by Immigration and Customs Enforcement and the Department of Homeland Security's"
+            ' Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt,'
+            " Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his"
+            " native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces"
+            " up to four years in prison.  Her next court appearance is scheduled for May 18."
+        )
+
+        expected_summaries = [
+            'prosecutor: "so far no videos were used in the crash investigation" two magazines claim to have found a cell phone video of the final seconds . "one can hear cries of \'My God\' in several languages," one magazine says . all 150 on board were killed in the crash .',
+            "the formal accession was marked by a ceremony at The Hague, in the Netherlands . the ICC opened a preliminary examination into the situation in the occupied Palestinian territory . as members of the court, Palestinians may be subject to counter-charges as well .",
+            "the u.s. and its negotiating partners reached a very strong framework agreement with Iran . aaron miller: the debate that has already begun since the announcement of the new framework will likely result in more heat than light . he says the new framework would reduce Iran's low-enriched uranium stockpile and cut centrifuges . miller: if it had been, there would have been no Iranian team at the negotiating table .",
+            'prosecutors say the marriages were part of an immigration scam . barrientos pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 .',
+        ]
+
+        dct = tok(
+            ["summarize: " + x for x in [FRANCE_ARTICLE, SHORTER_ARTICLE, IRAN_ARTICLE, ARTICLE_SUBWAY]],
+            padding="max_length",
+            truncation=True,
+            return_tensors="pd",
+        )
+        self.assertEqual(512, dct["input_ids"].shape[1])
+
+        hypotheses_batch = model.generate(
+            **dct,
+            num_beams=4,
+            length_penalty=2.0,
+            max_length=142,
+            min_length=56,
+            decode_strategy="beam_search",
+            early_stopping=True,
+        )
+
+        decoded = tok.batch_decode(hypotheses_batch[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+
+        self.assertListEqual(
+            expected_summaries,
+            decoded,
+        )
+
+    @slow
+    def test_translation_en_to_de(self):
+        model = self.model()
+        model.eval()
+        tok = self.tokenizer()
+
+        en_text = '"Luigi often said to me that he never wanted the brothers to end up in court", she wrote.'
+        expected_translation = '"Luigi sagte mir oft, er wollte nie, dass die Brüder am Gericht enden", schrieb sie.'
+
+        input_ids = tok.encode("translate English to German: " + en_text, return_tensors="pd")["input_ids"]
+        output = model.generate(input_ids, decode_strategy="greedy_search", max_length=100)
+        translation = tok.decode(output[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        self.assertEqual(translation, expected_translation)
+
+    @slow
+    def test_translation_en_to_fr(self):
+        model = self.model()  # t5-base
+        model.eval()
+        tok = self.tokenizer()
+
+        en_text = (
+            ' This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of'
+            " countless generations of stars: the oldest stars are seen as blue dots. "
+        )
+
+        input_ids = tok.encode("translate English to French: " + en_text, return_tensors="pd")["input_ids"]
+
+        output = model.generate(
+            input_ids=input_ids,
+            num_beams=4,
+            length_penalty=2.0,
+            max_length=100,
+            decode_strategy="beam_search",
+            early_stopping=True,
+        )
+        translation = tok.decode(output[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        new_truncated_translation = [
+            "Cette section d'images d'un enregistrement infrarouge du télescope Spitzer montre un « portrait familial » d'innombrables générations d'étoiles : les étoiles les plus anciennes sont visibles sous forme de points bleus."
+        ]
+
+        self.assertEqual(translation, new_truncated_translation[0])
+
+    @slow
+    def test_translation_en_to_ro(self):
+        model = self.model()
+        model.eval()
+        tok = self.tokenizer()
+
+        en_text = "Taco Bell said it plans to add 2,000 locations in the US by 2022."
+        expected_translation = "Taco Bell a declarat că intenţionează să adauge 2 000 de locaţii în SUA până în 2022."
+
+        input_ids = tok("translate English to Romanian: " + en_text, return_tensors="pd")["input_ids"]
+        output = model.generate(input_ids, decode_strategy="greedy_search", max_length=100)
+        translation = tok.decode(output[0][0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
+        self.assertEqual(translation, expected_translation)
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class TestAsymmetricT5(unittest.TestCase):
+    return_dict = False
+    use_labels = False
+
+    def build_model_and_check_forward_pass(self, **kwargs):
+        tester = T5ModelTester(self, **kwargs)
+        config, *inputs = tester.prepare_config_and_inputs()
+        (
+            input_ids,
+            decoder_input_ids,
+            attention_mask,
+            decoder_attention_mask,
+            lm_labels,
+        ) = inputs
+        model = T5ForConditionalGeneration(config)
+        model.eval()
+        outputs = model(
+            input_ids=input_ids,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            labels=lm_labels,
+            return_dict=self.return_dict,
+        )
+        # outputs = model(*inputs)
+        assert len(outputs) == (4 if self.use_labels else 3), f"{type(outputs)}, {type(lm_labels)}"
+
+        if self.use_labels:
+            assert outputs[1].shape == [tester.batch_size, tester.decoder_seq_length, tester.vocab_size]
+            assert isinstance(outputs[0].item(), float)
+        else:
+            assert outputs[0].shape == [tester.batch_size, tester.decoder_seq_length, tester.vocab_size]
+        return model
+
+    def test_small_decoder(self):
+        # num_hidden_layers is passed to T5Config as num_layers
+        model = self.build_model_and_check_forward_pass(decoder_layers=1, num_hidden_layers=2)
+        assert len(model.encoder.block) == 2
+        assert len(model.decoder.block) == 1
+
+    def test_defaulting_to_symmetry(self):
+        # num_hidden_layers is passed to T5Config as num_layers
+        model = self.build_model_and_check_forward_pass(num_hidden_layers=2)
+        assert len(model.decoder.block) == len(model.encoder.block) == 2
diff --git a/tests/transformers/t5/test_tokenizer.py b/tests/transformers/t5/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..91df013f95db1d4d554a0d12347df88f12f789c9
--- /dev/null
+++ b/tests/transformers/t5/test_tokenizer.py
@@ -0,0 +1,296 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2018 Google T5 Authors and HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import SPIECE_UNDERLINE, AddedToken, T5Tokenizer
+from paddlenlp.transformers.tokenizer_utils_base import BatchEncoding
+from tests.testing_utils import get_tests_dir
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+FRAMEWORK = "pd"
+
+
+class T5TokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = T5Tokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = T5Tokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "<unk>")
+        self.assertEqual(vocab_keys[1], "<s>")
+        self.assertEqual(vocab_keys[-1], "<pad>")
+        self.assertEqual(len(vocab_keys), 1_101)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 1_100)
+
+    def test_full_tokenizer(self):
+        tokenizer = T5Tokenizer(SAMPLE_VOCAB)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    def t5_base_tokenizer(self):
+        return T5Tokenizer.from_pretrained("t5-base")
+
+    def get_tokenizer(self, **kwargs) -> T5Tokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, pad_token=None, **kwargs)
+
+    def test_eos_treatment(self):
+        tokenizer = self.t5_base_tokenizer()
+        batch_with_eos_added = tokenizer(["hi</s>", "I went to the gym</s>", "</s>"])
+        batch_without_eos_added = tokenizer(["hi", "I went to the gym", ""])
+        self.assertListEqual(batch_with_eos_added["input_ids"], batch_without_eos_added["input_ids"])
+
+    def test_prepare_batch(self):
+        tokenizer = self.t5_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        expected_src_tokens = [71, 307, 8986, 21, 4505, 1635, 1707, 5, tokenizer.eos_token_id]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        self.assertIsInstance(batch, BatchEncoding)
+
+        result = list(batch["input_ids"].tolist()[0])
+
+        self.assertListEqual(expected_src_tokens, result)
+
+        self.assertEqual([2, 9], batch["input_ids"].shape)
+        self.assertEqual([2, 9], batch.attention_mask.shape)
+
+    def test_empty_target_text(self):
+        tokenizer = self.t5_base_tokenizer()
+        src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
+        batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
+        # check if input_ids are returned and no decoder_input_ids
+        self.assertIn("input_ids", batch)
+        self.assertIn("attention_mask", batch)
+        self.assertNotIn("decoder_input_ids", batch)
+        self.assertNotIn("decoder_attention_mask", batch)
+
+    def test_max_length(self):
+        tokenizer = self.t5_base_tokenizer()
+        tgt_text = [
+            "Summary of the text.",
+            "Another summary.",
+        ]
+        targets = tokenizer(
+            text=tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertEqual(32, targets["input_ids"].shape[1])
+
+    def test_outputs_not_longer_than_maxlen(self):
+        tokenizer = self.t5_base_tokenizer()
+
+        batch = tokenizer(
+            ["I am a small frog" * 1000, "I am a small frog"], padding=True, truncation=True, return_tensors=FRAMEWORK
+        )
+        self.assertIsInstance(batch, BatchEncoding)
+        # Since T5 does NOT have a max input length,
+        # this test should be changed to the following in Transformers v5:
+        # self.assertEqual(batch["input_ids"].shape, (2, 8001))
+        self.assertEqual(batch["input_ids"].shape, [2, 512])
+
+    def test_eos_in_input(self):
+        tokenizer = self.t5_base_tokenizer()
+        src_text = ["A long paragraph for summarization. </s>"]
+        tgt_text = ["Summary of the text. </s>"]
+        expected_src_tokens = [71, 307, 8986, 21, 4505, 1635, 1707, 5, 1]
+
+        # TODO(wj-Mcat): to enable `expected_tgt_tokens`
+        # expected_tgt_tokens = [20698, 13, 8, 1499, 5, 1]
+
+        batch = tokenizer(src_text, text_target=tgt_text)
+
+        self.assertEqual(expected_src_tokens, batch["input_ids"][0])
+        # self.assertEqual(expected_tgt_tokens, batch["labels"][0])
+
+    def test_token_type_ids(self):
+        src_text_1 = ["A first paragraph for summarization."]
+        src_text_2 = ["A second paragraph for summarization."]
+
+        tokenizer = self.t5_base_tokenizer()
+
+        slow_token_type_ids = tokenizer(src_text_1, src_text_2, add_special_tokens=True, return_token_type_ids=True)[
+            "token_type_ids"
+        ]
+
+        self.assertEqual(len(slow_token_type_ids[0]), 18)
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        tokenizer_list = []
+        tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))
+
+        for tokenizer_class, tokenizer_utils in tokenizer_list:
+
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                tokenizer_utils.save_pretrained(tmp_dir)
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
+                    special_tokens_map = json.load(json_file)
+
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
+                    tokenizer_config = json.load(json_file)
+
+                added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(100)]
+
+                special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+                tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
+                    "an_additional_special_token"
+                ]
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(special_tokens_map, outfile)
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(tokenizer_config, outfile)
+
+                # the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
+                # into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
+                # "special_tokens_map.json" files
+                tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                )
+                self.assertIn(
+                    "an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
+                )
+                # self.assertIn("an_additional_special_token",tokenizer_without_change_in_init.get_vocab()) # ByT5Tokenization no vocab
+                self.assertEqual(
+                    ["an_additional_special_token"],
+                    tokenizer_without_change_in_init.convert_ids_to_tokens(
+                        tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
+                    ),
+                )
+
+                # Now we test that we can change the value of additional_special_tokens in the from_pretrained
+                new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
+                tokenizer = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                    additional_special_tokens=new_added_tokens,
+                )
+
+                self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
+                self.assertEqual(
+                    ["a_new_additional_special_token"],
+                    tokenizer.convert_ids_to_tokens(
+                        tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
+                    ),
+                )
+
+    # overwritten from `test_tokenization_common` since T5 has no max length
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+
+    def test_offsets_mapping(self):
+        pass
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                runcation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 3)
+            self.assertEqual(len(encoding["offset_mapping"]), 3)
diff --git a/tests/transformers/test_configuration_common.py b/tests/transformers/test_configuration_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..704eee905462af5fe9414857f1958695f7b62554
--- /dev/null
+++ b/tests/transformers/test_configuration_common.py
@@ -0,0 +1,240 @@
+# coding=utf-8
+# Copyright 2019 HuggingFace Inc.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import json
+import os
+import tempfile
+import unittest.mock as mock
+
+from requests.exceptions import HTTPError
+
+from paddlenlp.transformers import BertConfig
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+
+
+class CustomConfig(PretrainedConfig):
+    model_type = "custom"
+
+    def __init__(self, attribute=1, **kwargs):
+        self.attribute = attribute
+        super().__init__(**kwargs)
+
+
+class NoSuperInitConfig(PretrainedConfig):
+    model_type = "custom"
+
+    def __init__(self, attribute=1, **kwargs):
+        self.attribute = attribute
+
+
+config_common_kwargs = {
+    "return_dict": False,
+    "output_hidden_states": True,
+    "output_attentions": True,
+    "use_bfloat16": True,
+    "tf_legacy_loss": True,
+    "pruned_heads": {"a": 1},
+    "tie_word_embeddings": False,
+    "is_decoder": True,
+    "cross_attention_hidden_size": 128,
+    "add_cross_attention": True,
+    "tie_encoder_decoder": True,
+    "max_length": 50,
+    "min_length": 3,
+    "do_sample": True,
+    "early_stopping": True,
+    "num_beams": 3,
+    "num_beam_groups": 3,
+    "diversity_penalty": 0.5,
+    "temperature": 2.0,
+    "top_k": 10,
+    "top_p": 0.7,
+    "typical_p": 0.2,
+    "repetition_penalty": 0.8,
+    "length_penalty": 0.8,
+    "no_repeat_ngram_size": 5,
+    "encoder_no_repeat_ngram_size": 5,
+    "bad_words_ids": [1, 2, 3],
+    "num_return_sequences": 3,
+    "chunk_size_feed_forward": 5,
+    "output_scores": True,
+    "return_dict_in_generate": True,
+    "forced_bos_token_id": 2,
+    "forced_eos_token_id": 3,
+    "remove_invalid_values": True,
+    "architectures": ["BertModel"],
+    "finetuning_task": "translation",
+    "id2label": {0: "label"},
+    "label2id": {"label": "0"},
+    "tokenizer_class": "BertTokenizerFast",
+    "prefix": "prefix",
+    "bos_token_id": 6,
+    "pad_token_id": 7,
+    "eos_token_id": 8,
+    "sep_token_id": 9,
+    "decoder_start_token_id": 10,
+    "exponential_decay_length_penalty": (5, 1.01),
+    "task_specific_params": {"translation": "some_params"},
+    "problem_type": "regression",
+}
+
+
+class ConfigTester(object):
+    def __init__(self, parent, config_class=None, has_text_modality=True, **kwargs):
+        self.parent = parent
+        self.config_class = config_class
+        self.has_text_modality = has_text_modality
+        self.inputs_dict = kwargs
+
+    def create_and_test_config_common_properties(self):
+        config = self.config_class(**self.inputs_dict)
+        common_properties = ["hidden_size", "num_attention_heads", "num_hidden_layers"]
+
+        # Add common fields for text models
+        if self.has_text_modality:
+            common_properties.extend(["vocab_size"])
+
+        # Test that config has the common properties as getters
+        for prop in common_properties:
+            self.parent.assertTrue(hasattr(config, prop), msg=f"`{prop}` does not exist")
+
+        # Test that config has the common properties as setter
+        for idx, name in enumerate(common_properties):
+            try:
+                setattr(config, name, idx)
+                self.parent.assertEqual(
+                    getattr(config, name), idx, msg=f"`{name} value {idx} expected, but was {getattr(config, name)}"
+                )
+            except NotImplementedError:
+                # Some models might not be able to implement setters for common_properties
+                # In that case, a NotImplementedError is raised
+                pass
+
+        # Test if config class can be called with Config(prop_name=..)
+        for idx, name in enumerate(common_properties):
+            try:
+                config = self.config_class(**{name: idx})
+                self.parent.assertEqual(
+                    getattr(config, name), idx, msg=f"`{name} value {idx} expected, but was {getattr(config, name)}"
+                )
+            except NotImplementedError:
+                # Some models might not be able to implement setters for common_properties
+                # In that case, a NotImplementedError is raised
+                pass
+
+    def create_and_test_config_to_json_string(self):
+        config = self.config_class(**self.inputs_dict)
+        obj = json.loads(config.to_json_string())
+        for key, value in self.inputs_dict.items():
+            self.parent.assertEqual(obj[key], value)
+
+    def create_and_test_config_to_json_file(self):
+        config_first = self.config_class(**self.inputs_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "config.json")
+            config_first.to_json_file(json_file_path)
+            config_second = self.config_class.from_json_file(json_file_path)
+
+        self.parent.assertEqual(config_second.to_dict(), config_first.to_dict())
+
+    def create_and_test_config_from_and_save_pretrained(self):
+        config_first = self.config_class(**self.inputs_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            config_first.save_pretrained(tmpdirname)
+            config_second = self.config_class.from_pretrained(tmpdirname)
+
+        self.parent.assertEqual(config_second.to_dict(), config_first.to_dict())
+
+    def create_and_test_config_with_num_classes(self):
+        config = self.config_class(**self.inputs_dict, num_classes=5)
+        self.parent.assertEqual(len(config.id2label), 5)
+        self.parent.assertEqual(len(config.label2id), 5)
+
+        config.num_classes = 3
+        self.parent.assertEqual(len(config.id2label), 3)
+        self.parent.assertEqual(len(config.label2id), 3)
+
+    def check_config_can_be_init_without_params(self):
+        if self.config_class.is_composition:
+            return
+        config = self.config_class()
+        self.parent.assertIsNotNone(config)
+
+    def check_config_arguments_init(self):
+        kwargs = copy.deepcopy(config_common_kwargs)
+        config = self.config_class(**kwargs)
+        wrong_values = []
+        for key, value in config_common_kwargs.items():
+            if getattr(config, key) != value:
+                wrong_values.append((key, getattr(config, key), value))
+        if len(wrong_values) > 0:
+            errors = "\n".join([f"- {v[0]}: got {v[1]} instead of {v[2]}" for v in wrong_values])
+            raise ValueError(f"The following keys were not properly set in the config:\n{errors}")
+
+    def run_common_tests(self):
+        self.create_and_test_config_common_properties()
+        self.create_and_test_config_to_json_string()
+        self.create_and_test_config_to_json_file()
+        self.create_and_test_config_from_and_save_pretrained()
+        self.create_and_test_config_with_num_classes()
+        self.check_config_can_be_init_without_params()
+        self.check_config_arguments_init()
+
+    def create_and_test_config_with_num_labels(self):
+        config = self.config_class(**self.inputs_dict, num_labels=5)
+        self.parent.assertEqual(len(config.id2label), 5)
+        self.parent.assertEqual(len(config.label2id), 5)
+
+        config.num_labels = 3
+        self.parent.assertEqual(len(config.id2label), 3)
+        self.parent.assertEqual(len(config.label2id), 3)
+
+
+class ConfigTestUtils:
+    def test_config_common_kwargs_is_complete(self):
+        base_config = PretrainedConfig()
+        missing_keys = [key for key in base_config.__dict__ if key not in config_common_kwargs]
+        # If this part of the test fails, you have arguments to addin config_common_kwargs above.
+        self.assertListEqual(
+            missing_keys,
+            ["use_cache", "is_encoder_decoder", "classifier_dropout", "dtype", "_name_or_path", "paddlenlp_version"],
+        )
+        keys_with_defaults = [key for key, value in config_common_kwargs.items() if value == getattr(base_config, key)]
+        if len(keys_with_defaults) > 0:
+            raise ValueError(
+                "The following keys are set with the default values in"
+                " `test_configuration_common.config_common_kwargs` pick another value for them:"
+                f" {', '.join(keys_with_defaults)}."
+            )
+
+    def test_cached_files_are_used_when_internet_is_down(self):
+        # A mock response for an HTTP head request to emulate server down
+        response_mock = mock.Mock()
+        response_mock.status_code = 500
+        response_mock.headers = []
+        response_mock.raise_for_status.side_effect = HTTPError
+
+        # Download this model to make sure it's in the cache.
+        _ = BertConfig.from_pretrained("hf-internal-testing/tiny-random-bert")
+
+        # Under the mock environment we get a 500 error when trying to reach the model.
+        with mock.patch("transformers.utils.hub.requests.head", return_value=response_mock) as mock_head:
+            _ = BertConfig.from_pretrained("hf-internal-testing/tiny-random-bert")
+            # This check we did call the fake head request
+            mock_head.assert_called()
diff --git a/tests/transformers/test_configuration_utils.py b/tests/transformers/test_configuration_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b4c5c9e94e308cf92f8c8d728b4242499c03d40
--- /dev/null
+++ b/tests/transformers/test_configuration_utils.py
@@ -0,0 +1,228 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import tempfile
+import unittest
+from typing import Dict, Optional
+
+from paddlenlp.transformers import BertConfig
+from paddlenlp.transformers.configuration_utils import PretrainedConfig, attribute_map
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils import CONFIG_NAME
+from paddlenlp.utils.env import LEGACY_CONFIG_NAME
+
+
+class FakeSimplePretrainedModelConfig(PretrainedConfig):
+    """simple fake Pretrained Model Config"""
+
+    def __init__(self, a=0, b=1, c=2):
+        self.a = a
+        self.b = b
+        self.c = c
+
+
+class FakePretrainedModelConfig(PretrainedConfig):
+    """Fake Pretrained Model which is similar with actual situation"""
+
+    attribute_map: Dict[str, str] = {
+        "num_classes": "num_labels",
+    }
+
+    def __init__(self, hidden_dropout_prob: float, **kwargs):
+        attribute_map(self, kwargs=kwargs)
+        super().__init__(**kwargs)
+        self.hidden_dropout_prob = hidden_dropout_prob
+
+
+class FakeLayer:
+    def __init__(self, config: Optional[FakeSimplePretrainedModelConfig] = None, *args, **kwargs):
+        super(FakeLayer, self).__init__()
+
+        self.a = config.a
+        self.b = config.b
+        self.c = config.c
+
+
+class FakeModel(PretrainedModel):
+    def __init__(self, config: FakeSimplePretrainedModelConfig):
+        """fake `__init__`, the source of parameters is:
+            def __init__(self, model, a, b):
+                self.model = model
+                self.a = a
+                self.b = b
+        Args:
+            config_or_model (Optional[Union[FakeLayer, FakeSimplePretrainedModelConfig]], optional): config or model instance. Defaults to None.
+        """
+        super().__init__()
+
+        self.model: FakeLayer = FakeLayer(config)
+        self.a = config.a
+        self.b = config.b
+
+
+class ConfigurationUtilsTest:
+    def test_parse_config_with_single_config(self):
+        # 1. single config
+        config = FakeSimplePretrainedModelConfig(a=10, b=11, c=12)
+        model = FakeLayer(config)
+        assert model.a == 10
+        assert model.b == 11
+
+    def test_parse_config_with_full_kwargs(self):
+        model = FakeLayer(a=10, b=11, c=12)
+        assert model.a == 10
+        assert model.b == 11
+
+    def test_parse_config_with_full_args(self):
+        model = FakeLayer(10, 11, 12)
+        assert model.a == 10
+        assert model.b == 11
+
+    def test_parse_config_with_args_and_kwargs(self):
+        model = FakeLayer(10, b=11, c=12)
+        assert model.a == 10
+        assert model.b == 11
+
+    def test_parse_config_and_model_with_single_config(self):
+        config = FakeSimplePretrainedModelConfig(a=10, b=11, c=12)
+        model = FakeModel(config)
+        assert model.a == 10
+        assert model.b == 11
+
+    def test_parse_config_and_model_with_model_and_args(self):
+        config = FakeSimplePretrainedModelConfig(a=10, b=11, c=12)
+        fake_layer = FakeLayer(config)
+        model = FakeModel(fake_layer, 100, 110)
+        assert model.a == 100
+        assert model.b == 110
+
+        assert model.model.a == 10
+        assert model.model.b == 11
+
+    def test_parse_config_and_model_with_model_and_kwargs(self):
+        config = FakeSimplePretrainedModelConfig(a=10, b=11, c=12)
+        fake_layer = FakeLayer(config)
+        model = FakeModel(fake_layer, a=100, b=110)
+
+        assert model.a == 100
+        assert model.b == 110
+
+        assert model.model.a == 10
+        assert model.model.b == 11
+
+    def test_parse_config_and_model_with_model_and_kwargs_and_args(self):
+        config = FakeSimplePretrainedModelConfig(a=10, b=11, c=12)
+        fake_layer = FakeLayer(config)
+        model = FakeModel(fake_layer, 100, b=110)
+
+        assert model.a == 100
+        assert model.b == 110
+
+        assert model.model.a == 10
+        assert model.model.b == 11
+
+    def test_get_value_with_default_from_config(self):
+        config = FakeSimplePretrainedModelConfig(a=10)
+        assert config.get("a", None) == 10
+        assert config.get("a", None) == config.a
+        assert config.get("no_name", 0) == 0
+
+
+class StandardConfigMappingTest(unittest.TestCase):
+    def test_bert_config_mapping(self):
+        # create new fake-bert class to prevent static-attributed modified by this test
+        class FakeBertConfig(BertConfig):
+            pass
+
+        config = FakeBertConfig.from_pretrained("__internal_testing__/bert")
+        hidden_size = config.hidden_size
+
+        FakeBertConfig.attribute_map = {"fake_field": "hidden_size"}
+
+        loaded_config = FakeBertConfig.from_pretrained("__internal_testing__/bert")
+        fake_field = loaded_config.fake_field
+        self.assertEqual(fake_field, hidden_size)
+
+    def test_from_pretrained_cache_dir(self):
+        model_id = "__internal_testing__/tiny-random-bert"
+        with tempfile.TemporaryDirectory() as tempdir:
+            BertConfig.from_pretrained(model_id, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_id, CONFIG_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_id, model_id)))
+
+    def test_load_from_hf(self):
+        """test load config from hf"""
+        config = BertConfig.from_pretrained("hf-internal-testing/tiny-random-BertModel", from_hf_hub=True)
+        self.assertEqual(config.hidden_size, 32)
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            config.save_pretrained(tempdir)
+
+            self.assertTrue(os.path.exists(os.path.join(tempdir, CONFIG_NAME)))
+
+            loaded_config = BertConfig.from_pretrained(tempdir)
+            self.assertEqual(loaded_config.hidden_size, 32)
+
+    def test_config_mapping(self):
+        # create new fake-bert class to prevent static-attributed modified by this test
+        class FakeBertConfig(BertConfig):
+            pass
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            config = FakeBertConfig.from_pretrained("bert-base-uncased")
+            config.save_pretrained(tempdir)
+
+            # rename `config.json` -> `model_config.json`
+            shutil.move(os.path.join(tempdir, CONFIG_NAME), os.path.join(tempdir, LEGACY_CONFIG_NAME))
+
+            FakeBertConfig.attribute_map = {"fake_field": "hidden_size"}
+
+            loaded_config = FakeBertConfig.from_pretrained(tempdir)
+            self.assertEqual(loaded_config.fake_field, config.hidden_size)
+
+
+class TestTensorParallelConveter(unittest.TestCase):
+    def test_qkv_convertor(self):
+        """test_qkv_convertor"""
+        hidden_size = 8
+        tensor_parallel_degree = 4
+        num_attention_heads = 4
+        # head_dim = hidden_size // num_attention_heads
+        import numpy as np
+
+        from paddlenlp.transformers.conversion_utils import (
+            naive_merged_qkv_to_tensor_parallel_qkv,
+            normal_fuse_merge_tp,
+            normal_fuse_split_tp,
+            tensor_parallel_qkv_to_naive_merged_qkv,
+        )
+
+        naive_merged_qkv = np.arange(3 * hidden_size * hidden_size).reshape([hidden_size, -1])
+        tensor_parallel_qkv = naive_merged_qkv_to_tensor_parallel_qkv(naive_merged_qkv, num_attention_heads)
+        new_naive_merged_qkv = tensor_parallel_qkv_to_naive_merged_qkv(tensor_parallel_qkv, num_attention_heads)
+        np.testing.assert_equal(new_naive_merged_qkv, naive_merged_qkv)
+        # print("tensor_parallel_qkv", tensor_parallel_qkv)
+        np.testing.assert_equal(
+            tensor_parallel_qkv[0],
+            [0, 1, 8, 9, 16, 17, 2, 3, 10, 11, 18, 19, 4, 5, 12, 13, 20, 21, 6, 7, 14, 15, 22, 23],
+        )
+
+        mp_qkv_splited = normal_fuse_split_tp(tensor_parallel_qkv, tensor_parallel_degree)
+        new_tensor_parallel_qkv = normal_fuse_merge_tp(mp_qkv_splited)
+        # print("mp_qkv_splited", mp_qkv_splited[0])
+        np.testing.assert_equal(new_tensor_parallel_qkv, tensor_parallel_qkv)
+        np.testing.assert_equal(mp_qkv_splited[0][0], [0, 1, 8, 9, 16, 17])
diff --git a/tests/transformers/test_feature_extraction_common.py b/tests/transformers/test_feature_extraction_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..73751eb0e530bd9121293ac26db105ccf8b3977c
--- /dev/null
+++ b/tests/transformers/test_feature_extraction_common.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+
+from .test_utils import check_json_file_has_correct_format
+
+
+class FeatureExtractionSavingTestMixin:
+    test_cast_dtype = None
+
+    def test_feat_extract_to_json_string(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        obj = json.loads(feat_extract.to_json_string())
+        for key, value in self.feat_extract_dict.items():
+            self.assertEqual(obj[key], value)
+
+    def test_feat_extract_to_json_file(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "feat_extract.json")
+            feat_extract_first.to_json_file(json_file_path)
+            feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
+
+        self.assertEqual(feat_extract_second.to_dict(), feat_extract_first.to_dict())
+
+    def test_feat_extract_from_and_save_pretrained(self):
+        feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
+            check_json_file_has_correct_format(saved_file)
+            feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
+
+        self.assertEqual(feat_extract_second.to_dict(), feat_extract_first.to_dict())
+
+    def test_init_without_params(self):
+        feat_extract = self.feature_extraction_class()
+        self.assertIsNotNone(feat_extract)
diff --git a/tests/transformers/test_generation_utils.py b/tests/transformers/test_generation_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d38d6b811f64f55ae79973e2041a27e274cb4a2
--- /dev/null
+++ b/tests/transformers/test_generation_utils.py
@@ -0,0 +1,1298 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import unittest
+
+import numpy as np
+import paddle
+
+from paddlenlp.generation import (
+    BeamSearchScorer,
+    ForcedBOSTokenLogitsProcessor,
+    ForcedEOSTokenLogitsProcessor,
+    GenerationConfig,
+    HammingDiversityLogitsProcessor,
+    LogitsProcessorList,
+    MinLengthLogitsProcessor,
+    RepetitionPenaltyLogitsProcessor,
+    TopKProcess,
+    TopPProcess,
+    get_unfinished_flag,
+)
+from paddlenlp.transformers import (  # import gpt model
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BartForConditionalGeneration,
+    BartTokenizer,
+    GPTLMHeadModel,
+    PretrainedConfig,
+    PretrainedTokenizer,
+)
+from tests.testing_utils import slow
+
+
+def top_k_top_p_filtering(
+    logits,
+    top_k=0,
+    top_p=1.0,
+    min_tokens_to_keep=1,
+):
+    if top_k > 0:
+        logits = TopKProcess(logits, top_k, min_tokens_to_keep)
+
+    if 0 <= top_p <= 1.0:
+        logits = TopPProcess(logits, top_p, min_tokens_to_keep)
+
+    return logits
+
+
+class GenerationTesterMixin:
+    model_tester = None
+    # all_pretrained_model = []
+    # all_pretrained_model_name = []
+    all_generative_model_classes = {}
+    input_name = "input_ids"
+    is_encoder_decoder = False
+
+    def _get_input_ids_and_config(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        input_ids = inputs_dict[self.input_name]
+        attention_mask = paddle.ones_like(input_ids, dtype=paddle.int64)
+
+        max_batch_size = 2
+        sequence_length = input_ids.shape[-1] // 2
+        input_ids = input_ids[:max_batch_size, :sequence_length]
+        attention_mask = attention_mask[:max_batch_size, :sequence_length].unsqueeze([1, 2])
+
+        attention_mask = attention_mask * attention_mask.transpose([0, 1, 3, 2])
+
+        # generate max 3 tokens
+        max_length = 3
+
+        if config.eos_token_id or config.pad_token_id:
+            # hack to allow generate for models such as GPT2 as is done in `generate()`
+            config["pad_token_id"] = config["eos_token_id"]
+        # if config.get(
+        #         "eos_token_id",
+        #         None) is not None and config.get("pad_token_id", None) is None:
+        #     # hack to allow generate for models such as GPT2 as is done in `generate()`
+        #     config["pad_token_id"] = config["eos_token_id"]
+
+        return config, input_ids, attention_mask, max_length
+
+    @staticmethod
+    def _get_logits_processor_and_kwargs(
+        eos_token_id,
+        forced_bos_token_id=None,
+        forced_eos_token_id=None,
+        max_length=None,
+        diversity_rate=None,
+        plus_length=0,
+    ):
+        process_kwargs = {
+            "min_length": 1 if max_length is None else max_length - 1,
+            "repetition_penalty": 1.2,
+        }
+
+        if diversity_rate is not None:
+            process_kwargs["diversity_rate"] = diversity_rate
+        logits_processor = LogitsProcessorList(
+            (
+                [
+                    HammingDiversityLogitsProcessor(diversity_rate, num_beams=2, num_beam_groups=2),
+                ]
+                if diversity_rate is not None
+                else []
+            )
+            + (
+                [
+                    MinLengthLogitsProcessor(process_kwargs["min_length"] + plus_length, eos_token_id),
+                ]
+                if eos_token_id is not None
+                else []
+            )
+            + (
+                [
+                    ForcedBOSTokenLogitsProcessor(forced_bos_token_id),
+                ]
+                if forced_bos_token_id is not None
+                else []
+            )
+            + (
+                [ForcedEOSTokenLogitsProcessor(max_length + plus_length, forced_eos_token_id)]
+                if forced_eos_token_id is not None
+                else []
+            )
+            + [
+                RepetitionPenaltyLogitsProcessor(process_kwargs["repetition_penalty"]),
+            ]
+        )
+        return process_kwargs, logits_processor
+
+    @staticmethod
+    def _get_warper_and_kwargs():
+        warp_kwargs = {"top_k": 10, "top_p": 0.7, "temperature": 0.7}
+        return warp_kwargs
+
+    @staticmethod
+    def _get_beam_scorer_and_kwargs(batch_size, max_length, num_return_sequences=1):
+        beam_kwargs = {
+            "early_stopping": False,
+            "length_penalty": 2.0,
+            "num_beams": 2,
+            "num_return_sequences": num_return_sequences,
+        }
+        beam_scorer = BeamSearchScorer(
+            batch_size=batch_size,
+            max_length=max_length,
+            num_beams=beam_kwargs["num_beams"],
+            length_penalty=beam_kwargs["length_penalty"],
+            do_early_stopping=beam_kwargs["early_stopping"],
+            num_beam_hyps_to_keep=num_return_sequences,
+        )
+        return beam_kwargs, beam_scorer
+
+    @staticmethod
+    def _get_diverse_beam_scorer_and_kwargs(batch_size, max_length, num_return_sequences=1):
+        beam_kwargs = {
+            "early_stopping": False,
+            "length_penalty": 2.0,
+            "num_beams": 2,
+            "num_return_sequences": num_return_sequences,
+            "num_beam_groups": 2,  # one beam per group
+            "diversity_rate": 2.0,
+        }
+        beam_scorer = BeamSearchScorer(
+            batch_size=batch_size,
+            max_length=max_length,
+            num_beams=beam_kwargs["num_beams"],
+            length_penalty=beam_kwargs["length_penalty"],
+            do_early_stopping=beam_kwargs["early_stopping"],
+            num_beam_hyps_to_keep=num_return_sequences,
+            num_beam_groups=beam_kwargs["num_beam_groups"],
+        )
+        return beam_kwargs, beam_scorer
+
+    @staticmethod
+    def _get_encoder_outputs(
+        model,
+        input_ids,
+        attention_mask,
+        output_attentions=None,
+        output_hidden_states=None,
+        num_interleave=1,
+    ):
+        model.eval()
+        encoder = model.get_encoder()
+        encoder_outputs = encoder(
+            input_ids,
+            attention_mask=attention_mask,
+        )
+        if isinstance(encoder_outputs, (list, tuple)):
+            encoder_outputs = encoder_outputs[0]
+
+        encoder_outputs = encoder_outputs.repeat_interleave(num_interleave, axis=0)
+
+        input_ids = paddle.zeros_like(input_ids[:, :1], dtype="int64") + model.get_decoder_start_token_id()
+        # attention_mask = None
+        return encoder_outputs, input_ids, attention_mask
+
+    def _greedy_generate(
+        self,
+        model,
+        input_ids,
+        attention_mask,
+        max_length,
+    ):
+        if self.is_encoder_decoder:
+            max_length = 4
+        logits_process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
+            eos_token_id=getattr(model, model.base_model_prefix).config["eos_token_id"],
+            forced_bos_token_id=getattr(getattr(model, model.base_model_prefix).config, "forced_bos_token_id", None),
+            forced_eos_token_id=getattr(getattr(model, model.base_model_prefix).config, "forced_eos_token_id", None),
+            max_length=max_length,
+            plus_length=1 if self.is_encoder_decoder else input_ids.shape[-1],
+        )
+
+        kwargs = {}
+
+        with paddle.no_grad():
+            output_generate = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                generation_config=GenerationConfig(
+                    max_new_tokens=max_length,
+                    decode_strategy="greedy_search",
+                    **logits_process_kwargs,
+                ),
+            )
+
+        if self.is_encoder_decoder:
+            encoder_outputs, input_ids, attention_mask = self._get_encoder_outputs(
+                model,
+                input_ids,
+                attention_mask,
+            )
+            kwargs["encoder_output"] = encoder_outputs
+
+        with paddle.no_grad():
+            output_greedy = model.greedy_search(
+                input_ids,
+                max_length=max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                attention_mask=attention_mask,
+                logits_processors=logits_processor,
+                pad_token_id=getattr(model, model.base_model_prefix).config["pad_token_id"],
+                eos_token_id=getattr(model, model.base_model_prefix).config["eos_token_id"],
+                **kwargs,
+            )
+        return output_greedy, output_generate
+
+    def _sample_generate(
+        self,
+        model,
+        input_ids,
+        attention_mask,
+        max_length,
+        num_return_sequences,
+        logits_processors,
+        logits_warper,
+        process_kwargs,
+    ):
+        with paddle.no_grad():
+            output_generate = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                generation_config=GenerationConfig(
+                    max_new_tokens=max_length,
+                    decode_strategy="sampling",
+                    num_return_sequences=num_return_sequences,
+                    top_k=1,
+                    **process_kwargs,
+                ),
+            )
+
+        kwargs = {}
+        if self.is_encoder_decoder:
+            encoder_outputs, input_ids_clone, attention_mask_clone = self._get_encoder_outputs(
+                model,
+                input_ids,
+                attention_mask,
+                num_interleave=num_return_sequences,
+            )
+            kwargs["encoder_output"] = encoder_outputs
+            input_ids_clone = input_ids_clone.repeat_interleave(num_return_sequences, axis=0)
+            attention_mask_clone = attention_mask_clone.repeat_interleave(num_return_sequences, axis=0)
+        else:
+            attention_mask_clone = attention_mask.repeat_interleave(num_return_sequences, axis=0)
+            input_ids_clone = input_ids.repeat_interleave(num_return_sequences, axis=0)
+
+        with paddle.no_grad():
+            output_sample = model.sample(
+                input_ids_clone,
+                attention_mask=attention_mask_clone,
+                max_length=max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                logits_processors=logits_processors,
+                pad_token_id=getattr(model, model.base_model_prefix).config["pad_token_id"],
+                eos_token_id=getattr(model, model.base_model_prefix).config["eos_token_id"],
+                top_k=1,
+                **process_kwargs,
+                **kwargs,
+            )
+        return output_sample, output_generate
+
+    def _beam_search_generate(
+        self,
+        model,
+        input_ids,
+        attention_mask,
+        max_length,
+        beam_scorer,
+        beam_kwargs,
+        logits_processor,
+        logits_process_kwargs,
+    ):
+
+        with paddle.no_grad():
+            output_generate = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                generation_config=GenerationConfig(
+                    decode_strategy="beam_search",
+                    max_new_tokens=max_length,
+                    **beam_kwargs,
+                    **logits_process_kwargs,
+                ),
+            )
+
+        # beam_search does not automatically interleave `batch_size` dim for `num_beams`
+        kwargs = {}
+        if self.is_encoder_decoder:
+            encoder_outputs, input_ids_clone, attention_mask_clone = self._get_encoder_outputs(
+                model,
+                input_ids,
+                attention_mask,
+                num_interleave=beam_scorer.num_beams,
+            )
+            kwargs["encoder_output"] = encoder_outputs
+            input_ids_clone = input_ids_clone.repeat_interleave(beam_scorer.num_beams, axis=0)
+            attention_mask_clone = attention_mask_clone.repeat_interleave(beam_scorer.num_beams, axis=0)
+        else:
+            attention_mask_clone = attention_mask.repeat_interleave(beam_scorer.num_beams, axis=0)
+            input_ids_clone = input_ids.repeat_interleave(beam_scorer.num_beams, axis=0)
+
+        kwargs["use_cache"] = True
+
+        with paddle.no_grad():
+            output_beam_search = model.beam_search(
+                input_ids_clone,
+                beam_scorer,
+                max_length=max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                attention_mask=attention_mask_clone,
+                logits_processors=logits_processor,
+                diversity_rate=getattr(logits_process_kwargs, "diversity_rate", 0.0),
+                pad_token_id=getattr(model, model.base_model_prefix).config["pad_token_id"],
+                eos_token_id=getattr(model, model.base_model_prefix).config["eos_token_id"],
+                **kwargs,
+            )
+
+        return output_generate, output_beam_search
+
+    def _group_beam_search_generate(
+        self,
+        model,
+        input_ids,
+        attention_mask,
+        max_length,
+        beam_scorer,
+        beam_kwargs,
+        logits_processor,
+        logits_process_kwargs,
+    ):
+        beam_kwargs.pop("diversity_rate")
+        model.eval()
+        with paddle.no_grad():
+            output_generate = model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                generation_config=GenerationConfig(
+                    decode_strategy="beam_search",
+                    max_new_tokens=max_length,
+                    **beam_kwargs,
+                    **logits_process_kwargs,
+                ),
+            )
+
+        # group_beam_search does not automatically interleave `batch_size` dim for `num_beams`
+        kwargs = {}
+        if self.is_encoder_decoder:
+            encoder_outputs, input_ids_clone, attention_mask_clone = self._get_encoder_outputs(
+                model,
+                input_ids,
+                attention_mask,
+                num_interleave=beam_scorer.num_beams,
+            )
+            kwargs["encoder_output"] = encoder_outputs
+            input_ids_clone = input_ids_clone.repeat_interleave(beam_scorer.num_beams, axis=0)
+            attention_mask_clone = attention_mask_clone.repeat_interleave(beam_scorer.num_beams, axis=0)
+        else:
+            attention_mask_clone = attention_mask.repeat_interleave(beam_scorer.num_beams, axis=0)
+            input_ids_clone = input_ids.repeat_interleave(beam_scorer.num_beams, axis=0)
+
+        kwargs["use_cache"] = True
+
+        with paddle.no_grad():
+            output_group_beam_search = model.group_beam_search(
+                input_ids_clone,
+                beam_scorer,
+                max_length=max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                attention_mask=attention_mask_clone,
+                logits_processors=logits_processor,
+                pad_token_id=getattr(model, model.base_model_prefix).config["pad_token_id"],
+                eos_token_id=getattr(model, model.base_model_prefix).config["eos_token_id"],
+                **kwargs,
+            )
+        return output_generate, output_group_beam_search
+
+    def test_greedy_generate(self):
+        # check `generate()` and `greedy_search()` are equal
+        for model_class in self.all_generative_model_classes.keys():
+            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
+            paddle.seed(124)
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            output_greedy, output_generate = self._greedy_generate(
+                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
+            )
+
+            self.assertListEqual(output_greedy[0].tolist(), output_generate[0].tolist())
+
+    def test_sample_generate(self):
+
+        for model_class in self.all_generative_model_classes.keys():
+            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
+            paddle.seed(124)
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            if self.is_encoder_decoder:
+                max_length = 4
+
+            process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
+                getattr(model, model.base_model_prefix).config["eos_token_id"],
+                forced_bos_token_id=getattr(
+                    getattr(model, model.base_model_prefix).config, "forced_bos_token_id", None
+                ),
+                forced_eos_token_id=getattr(
+                    getattr(model, model.base_model_prefix).config, "forced_eos_token_id", None
+                ),
+                max_length=max_length,
+                plus_length=1 if self.is_encoder_decoder else input_ids.shape[-1],
+            )
+            logits_warper = self._get_warper_and_kwargs()
+
+            # check `generate()` and `sample()` are equal
+            output_sample, output_generate = self._sample_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                num_return_sequences=1,
+                logits_processors=logits_processor,
+                logits_warper=logits_warper,
+                process_kwargs=process_kwargs,
+            )
+            self.assertListEqual(output_sample[0].tolist(), output_generate[0].tolist())
+
+            # check `generate()` and `sample()` yield equal results for `num_return_sequences`
+            output_sample, output_generate = self._sample_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                num_return_sequences=3,
+                logits_processors=logits_processor,
+                logits_warper=logits_warper,
+                process_kwargs=process_kwargs,
+            )
+            self.assertListEqual(output_sample[0].tolist(), output_generate[0].tolist())
+
+    def test_beam_search_generate(self):
+        for model_class in self.all_generative_model_classes.keys():
+            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
+            paddle.seed(128)
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            if self.is_encoder_decoder:
+                max_length = 4
+
+            logits_process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
+                config["eos_token_id"],
+                getattr(config, "forced_bos_token_id", None),
+                getattr(config, "forced_eos_token_id", None),
+                max_length,
+                plus_length=1 if self.is_encoder_decoder else input_ids.shape[-1],
+            )
+            beam_kwargs, beam_scorer = self._get_beam_scorer_and_kwargs(
+                input_ids.shape[0], max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1]
+            )
+
+            # check `generate()` and `beam_search()` are equal
+            output_generate, output_beam_search = self._beam_search_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                beam_scorer=beam_scorer,
+                beam_kwargs=beam_kwargs,
+                logits_process_kwargs=logits_process_kwargs,
+                logits_processor=logits_processor,
+            )
+
+            self.assertListEqual(output_generate[0].tolist(), output_beam_search[0].tolist())
+
+            # check `generate()` and `beam_search()` are equal for `num_return_sequences`
+            num_return_sequences = 2
+            if self.is_encoder_decoder:
+                max_length = 4
+            beam_kwargs, beam_scorer = self._get_beam_scorer_and_kwargs(
+                input_ids.shape[0],
+                max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                num_return_sequences=num_return_sequences,
+            )
+
+            output_generate, output_beam_search = self._beam_search_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                beam_scorer=beam_scorer,
+                beam_kwargs=beam_kwargs,
+                logits_process_kwargs=logits_process_kwargs,
+                logits_processor=logits_processor,
+            )
+            self.assertListEqual(output_generate[0].tolist(), output_beam_search[0].tolist())
+
+    def test_generate_without_input_ids(self):
+        config, _, _, max_length = self._get_input_ids_and_config()
+
+        # if no bos token id => cannot generate from None
+        if config.bos_token_id is None:
+            return
+
+        for model_class in self.all_generative_model_classes.keys():
+            if isinstance(config, PretrainedConfig):
+                model = model_class(config)
+            else:
+                pretrained_model = self.all_generative_model_classes[model_class][0](**config)
+                model = model_class(pretrained_model)
+            model.eval()
+            output_ids_generate = model.generate(
+                generation_config=GenerationConfig(
+                    decode_strategy="greedy_search",
+                    max_length=max_length,
+                )
+            )
+
+            self.assertIsNotNone(output_ids_generate)
+
+    def test_group_beam_search_generate(self):
+        for model_class in self.all_generative_model_classes.keys():
+            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            if self.is_encoder_decoder:
+                max_length = 4
+
+            logits_process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
+                config["eos_token_id"],
+                getattr(config, "forced_bos_token_id", None),
+                getattr(config, "forced_eos_token_id", None),
+                max_length,
+                diversity_rate=2.0,
+                plus_length=1 if self.is_encoder_decoder else input_ids.shape[-1],
+            )
+
+            # check `generate()` and `group_beam_search()` are equal
+            beam_kwargs, beam_scorer = self._get_diverse_beam_scorer_and_kwargs(
+                input_ids.shape[0], max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1]
+            )
+            output_generate, output_group_beam_search = self._group_beam_search_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                beam_scorer=beam_scorer,
+                beam_kwargs=beam_kwargs,
+                logits_processor=logits_processor,
+                logits_process_kwargs=logits_process_kwargs,
+            )
+            self.assertListEqual(output_generate[0].tolist(), output_group_beam_search[0].tolist())
+
+            # check `generate()` and `group_beam_search()` are equal for `num_return_sequences`
+            num_return_sequences = 2
+            if self.is_encoder_decoder:
+                max_length = 4
+            beam_kwargs, beam_scorer = self._get_diverse_beam_scorer_and_kwargs(
+                input_ids.shape[0],
+                max_length + 1 if self.is_encoder_decoder else max_length + input_ids.shape[-1],
+                num_return_sequences=num_return_sequences,
+            )
+            output_generate, output_group_beam_search = self._group_beam_search_generate(
+                model=model,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                max_length=max_length,
+                beam_scorer=beam_scorer,
+                beam_kwargs=beam_kwargs,
+                logits_processor=logits_processor,
+                logits_process_kwargs=logits_process_kwargs,
+            )
+            self.assertListEqual(output_generate[0].tolist(), output_group_beam_search[0].tolist())
+
+    def _check_sequence_inside_sequence(self, tensor_1, tensor_2):
+        # check if tensor_1 inside tensor_2 or tensor_2 inside tensor_1.
+        # set to same device. we don't care what device.
+
+        if not isinstance(tensor_1, list):
+            tensor_1 = tensor_1.cpu().tolist()
+        if not isinstance(tensor_2, list):
+            tensor_2 = tensor_2.cpu().tolist()
+
+        in_order = len(tensor_1) <= len(tensor_2)
+        longer = tensor_2 if in_order else tensor_1
+        shorter = tensor_1 if in_order else tensor_2
+
+        flag = False
+        chunk_size = len(shorter)
+        for chunk_idx in range(len(longer) - chunk_size + 1):
+            subseq = longer[chunk_idx : chunk_idx + chunk_size]
+            if subseq == shorter:
+                flag = True
+                break
+
+        self.assertTrue(flag)
+
+
+class UtilsFunctionsTest:
+
+    # tests whether the top_k_top_p function behaves as expected
+    def test_top_k_top_p_filtering(self):
+        logits = paddle.to_tensor(
+            [
+                [
+                    8.2220991,  # 3rd highest value; idx. 0
+                    -0.5620044,
+                    5.23229752,
+                    4.0386393,
+                    -6.8798378,
+                    -0.54785802,
+                    -3.2012153,
+                    2.92777176,
+                    1.88171953,
+                    7.35341276,
+                    8.43207833,  # 2nd highest value; idx. 10
+                    -9.85711836,
+                    -5.96209236,
+                    -1.13039161,
+                    -7.1115294,
+                    -0.8369633,
+                    -5.3186408,
+                    7.06427407,
+                    0.81369344,
+                    -0.82023817,
+                    -5.9179796,
+                    0.58813443,
+                    -6.99778438,
+                    4.71551189,
+                    -0.18771637,
+                    7.44020759,  # 4th highest value; idx. 25
+                    9.38450987,  # 1st highest value; idx. 26
+                    2.12662941,
+                    -9.32562038,
+                    2.35652522,
+                ],  # cummulative prob of 4 highest values <= 0.6
+                [
+                    0.58425518,
+                    4.53139238,
+                    -5.57510464,
+                    -6.28030699,
+                    -7.19529503,
+                    -4.02122551,
+                    1.39337037,
+                    -6.06707057,
+                    1.59480517,
+                    -9.643119,
+                    0.03907799,
+                    0.67231762,
+                    -8.88206726,
+                    6.27115922,  # 4th highest value; idx. 13
+                    2.28520723,
+                    4.82767506,
+                    4.30421368,
+                    8.8275313,  # 2nd highest value; idx. 17
+                    5.44029958,
+                    -4.4735794,
+                    7.38579536,  # 3rd highest value; idx. 20
+                    -2.91051663,
+                    2.61946077,
+                    -2.5674762,
+                    -9.48959302,
+                    -4.02922645,
+                    -1.35416918,
+                    9.67702323,  # 1st highest value; idx. 27
+                    -5.89478553,
+                    1.85370467,
+                ],  # cummulative prob of 4 highest values <= 0.6
+            ],
+            dtype="float32",
+        )
+
+        non_inf_expected_idx = paddle.to_tensor(
+            [[0, 0], [0, 10], [0, 25], [0, 26], [1, 13], [1, 17], [1, 20], [1, 27]],
+            dtype="int64",
+        )  # expected non filtered idx as noted above
+
+        non_inf_expected_output = paddle.to_tensor(
+            [
+                8.2221,
+                8.4321,
+                7.4402,
+                9.3845,
+                6.2712,
+                8.8275,
+                7.3858,
+                9.6770,
+            ],  # expected non filtered values as noted above
+            dtype="float32",
+        )
+
+        output = top_k_top_p_filtering(logits, top_k=10, top_p=0.6, min_tokens_to_keep=4)
+        non_inf_output = output[output >= -10000]
+        non_inf_idx = (output >= -10000).nonzero()
+
+        self.assertTrue(paddle.allclose(non_inf_expected_output, non_inf_output, atol=1e-12))
+        self.assertTrue(paddle.all(paddle.eq(non_inf_expected_idx, non_inf_idx)))
+
+
+class GenerationIntegrationTests:
+    @slow
+    def test_diverse_beam_search(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood.
+        The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in statements to People.
+        "Silas was the middle name of Timberlake's maternal grandfather Bill Bomar, who died in 2012, while Randall is the musician's own middle name, as well as his father's first," People reports.
+        The couple announced the pregnancy in January, with an Instagram post. It is the first baby for both."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        outputs = bart_model.generate(
+            input_ids,
+            generation_config=GenerationConfig(
+                decode_strategy="beam_search",
+                num_beams=4,
+                num_return_sequences=3,
+                num_beam_groups=4,
+                diversity_rate=2.0,
+            ),
+        )
+
+        # assigned but never used
+        bart_tokenizer.batch_decode(outputs, skip_special_tokens=True)
+
+    def test_max_length_backward_compat_greedy(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        max_length = 5
+        input_ids = paddle.tile(input_ids, [2, 1])
+
+        bos_token_id = getattr(bart_model, "bos_token_id", None)
+        eos_token_id = getattr(bart_model, "eos_token_id", None)
+        pad_token_id = getattr(bart_model, "pad_token_id", None)
+        decoder_start_token_id = getattr(bart_model, "decoder_start_token_id", None)
+
+        model_kwargs = {}
+
+        model_kwargs["attention_mask"] = bart_model.prepare_attention_mask_for_generation(
+            input_ids, pad_token_id, eos_token_id
+        )
+
+        bart_model.is_encoder_decoder = hasattr(bart_model, "encoder") and hasattr(bart_model, "decoder")
+
+        model_kwargs = bart_model.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+
+        if "decoder_input_ids" in model_kwargs:
+            input_ids = model_kwargs.pop("decoder_input_ids")
+        else:
+            input_ids = bart_model.prepare_decoder_input_ids_for_generation(
+                input_ids, decoder_start_token_id, bos_token_id
+            )
+
+        model_kwargs["use_cache"] = True
+        max_length += input_ids.shape[-1]
+
+        bart_model.greedy_search(
+            input_ids,
+            max_length=max_length,
+            pad_token_id=bart_model.bart.config["pad_token_id"],
+            eos_token_id=bart_model.bart.config["eos_token_id"],
+            logits_processors=None,
+            **model_kwargs,
+        )
+
+    def test_max_length_backward_compat_sample(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        max_length = 5
+        input_ids = paddle.tile(input_ids, [2, 1])
+
+        bos_token_id = getattr(bart_model, "bos_token_id", None)
+        eos_token_id = getattr(bart_model, "eos_token_id", None)
+        pad_token_id = getattr(bart_model, "pad_token_id", None)
+        decoder_start_token_id = getattr(bart_model, "decoder_start_token_id", None)
+
+        model_kwargs = {}
+
+        model_kwargs["attention_mask"] = bart_model.prepare_attention_mask_for_generation(
+            input_ids, pad_token_id, eos_token_id
+        )
+
+        bart_model.is_encoder_decoder = hasattr(bart_model, "encoder") and hasattr(bart_model, "decoder")
+
+        model_kwargs = bart_model.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+
+        if "decoder_input_ids" in model_kwargs:
+            input_ids = model_kwargs.pop("decoder_input_ids")
+        else:
+            input_ids = bart_model.prepare_decoder_input_ids_for_generation(
+                input_ids, decoder_start_token_id, bos_token_id
+            )
+
+        model_kwargs["use_cache"] = True
+        max_length += input_ids.shape[-1]
+
+        bart_model.sample(
+            input_ids,
+            max_length=max_length,
+            pad_token_id=bart_model.bart.config["pad_token_id"],
+            eos_token_id=bart_model.bart.config["eos_token_id"],
+            logits_processors=None,
+            top_k=4,
+            **model_kwargs,
+        )
+
+    def test_max_length_backward_compat_beam_search(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        max_length = 5
+        input_ids = paddle.tile(input_ids, [2, 1])
+
+        bos_token_id = getattr(bart_model, "bos_token_id", None)
+        eos_token_id = getattr(bart_model, "eos_token_id", None)
+        pad_token_id = getattr(bart_model, "pad_token_id", None)
+        decoder_start_token_id = getattr(bart_model, "decoder_start_token_id", None)
+
+        model_kwargs = {}
+
+        model_kwargs["attention_mask"] = bart_model.prepare_attention_mask_for_generation(
+            input_ids, pad_token_id, eos_token_id
+        )
+
+        bart_model.is_encoder_decoder = hasattr(bart_model, "encoder") and hasattr(bart_model, "decoder")
+
+        model_kwargs = bart_model.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+
+        if "decoder_input_ids" in model_kwargs:
+            input_ids = model_kwargs.pop("decoder_input_ids")
+        else:
+            input_ids = bart_model.prepare_decoder_input_ids_for_generation(
+                input_ids, decoder_start_token_id, bos_token_id
+            )
+
+        model_kwargs["use_cache"] = True
+        max_length += input_ids.shape[-1]
+
+        beam_scorer = BeamSearchScorer(batch_size=2, max_length=max_length, num_beams=2)
+
+        input_ids, model_kwargs = bart_model.expand_inputs_for_generation(input_ids, expand_size=2, **model_kwargs)
+
+        bart_model.beam_search(
+            input_ids,
+            num_beams=2,
+            max_length=max_length,
+            beam_scorer=beam_scorer,
+            logits_processors=None,
+            diversity_rate=0.0,
+            pad_token_id=bart_model.bart.config["pad_token_id"],
+            eos_token_id=bart_model.bart.config["eos_token_id"],
+            **model_kwargs,
+        )
+
+    def test_max_length_backward_compat_group_beam_search(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        max_length = 5
+        input_ids = paddle.tile(input_ids, [2, 1])
+
+        bos_token_id = getattr(bart_model, "bos_token_id", None)
+        eos_token_id = getattr(bart_model, "eos_token_id", None)
+        pad_token_id = getattr(bart_model, "pad_token_id", None)
+        decoder_start_token_id = getattr(bart_model, "decoder_start_token_id", None)
+
+        model_kwargs = {}
+
+        model_kwargs["attention_mask"] = bart_model.prepare_attention_mask_for_generation(
+            input_ids, pad_token_id, eos_token_id
+        )
+
+        bart_model.is_encoder_decoder = hasattr(bart_model, "encoder") and hasattr(bart_model, "decoder")
+
+        model_kwargs = bart_model.prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
+
+        if "decoder_input_ids" in model_kwargs:
+            input_ids = model_kwargs.pop("decoder_input_ids")
+        else:
+            input_ids = bart_model.prepare_decoder_input_ids_for_generation(
+                input_ids, decoder_start_token_id, bos_token_id
+            )
+
+        model_kwargs["use_cache"] = True
+        max_length += input_ids.shape[-1]
+
+        diverse_beam_scorer = BeamSearchScorer(batch_size=2, max_length=max_length, num_beams=2, num_beam_groups=2)
+
+        input_ids, model_kwargs = bart_model.expand_inputs_for_generation(input_ids, expand_size=2, **model_kwargs)
+
+        bart_model.group_beam_search(
+            input_ids,
+            num_beams=2,
+            max_length=max_length,
+            beam_scorer=diverse_beam_scorer,
+            logits_processors=None,
+            pad_token_id=bart_model.bart.config["pad_token_id"],
+            eos_token_id=bart_model.bart.config["eos_token_id"],
+            **model_kwargs,
+        )
+
+    def test_custom_logits_processor(self):
+        article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        input_ids = paddle.to_tensor(bart_tokenizer(article)["input_ids"]).unsqueeze([0])
+
+        bart_model.eval()
+
+        logits_processor = LogitsProcessorList()
+        # 1 means decoder_start_token.
+        logits_processor.append(
+            MinLengthLogitsProcessor(min_length=25 + 1, eos_token_id=bart_model.bart.config["forced_eos_token_id"])
+        )
+
+        bart_model.generate(
+            input_ids,
+            generation_config=GenerationConfig(
+                decode_strategy="sampling", top_k=1, max_length=30, logits_processors=logits_processor
+            ),
+        )
+
+        bart_model.generate(
+            input_ids,
+            generation_config=GenerationConfig(decode_strategy="sampling", top_k=1, max_length=30, min_length=25),
+        )
+
+    # BART supports inputs_embeds
+    # def test_encoder_decoder_generate_with_inputs_embeds(self):
+    #     article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+    #     bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+    #     bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+    #     bart_model.eval()
+
+    #     bart_model.bart.config["eos_token_id"] = None
+    #     input_ids = paddle.to_tensor(bart_tokenizer(articles[0])["input_ids"]).unsqueeze([0])
+    #     inputs_embeds = bart_model.get_input_embeddings()(input_ids)
+
+    #     output_sequences = bart_model.generate(inputs_embeds=inputs_embeds)
+
+    #     self.assertEqual(output_sequences.shape, (1, 5))
+
+    def test_encoder_decoder_generate_attention_mask(self):
+        articles = ["Timberlake", "Jessica Biel, welcome to parenthood among other things"]
+        bart_tokenizer = BartTokenizer.from_pretrained("bart-base")
+        bart_model = BartForConditionalGeneration.from_pretrained("bart-base")
+        bart_model.eval()
+
+        input_ids = paddle.to_tensor(bart_tokenizer(articles[0])["input_ids"]).unsqueeze([0])
+        input_ids_batched = paddle.to_tensor(bart_tokenizer(articles, padding=True)["input_ids"])
+
+        output_sequences_batched = bart_model.generate(
+            input_ids=input_ids_batched, generation_config=GenerationConfig(decode_strategy="greedy_search")
+        )
+        output_sequences = bart_model.generate(
+            input_ids=input_ids, generation_config=GenerationConfig(decode_strategy="greedy_search")
+        )
+
+        batched_out = output_sequences_batched[1]
+        out = output_sequences[1]
+
+        diff = (batched_out - out).abs()
+
+        self.assertTrue(diff.numpy() < 1e-6)
+
+
+class GenerationUtilsTestCase(unittest.TestCase):
+    def test_get_unfinished_flag(self):
+        input_ids = paddle.to_tensor([[1, 2, 3, 4, 5, 6, 7], [5, 6, 7, 8, 9, 10, 11]], dtype=paddle.int64)
+
+        # 1. test single eos_token_id
+        eos_token_id = 6
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [True, True])
+
+        eos_token_id = 7
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, True])
+
+        # 2. get tokens
+        eos_token_id = [6, 7]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, True])
+
+        eos_token_id = [10, 11]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [True, False])
+
+        # 3. get multi tokens
+        eos_token_id = [[6, 7], [9, 10]]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, True])
+
+        eos_token_id = [[6, 7], [10, 11]]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, False])
+
+        eos_token_id = [[7], [11]]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, False])
+
+        eos_token_id = [[7], [10, 11]]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, False])
+
+        eos_token_id = [[7], [10, 12]]
+        unfinish_flag = paddle.to_tensor([[True], [True]], dtype="bool")
+        unfinish_flag = get_unfinished_flag(input_ids, unfinish_flag, eos_token_id)
+        self.assertEqual(unfinish_flag.reshape([2]).tolist(), [False, True])
+
+    @slow
+    def test_gpt_multi_stop_tokens(self):
+        tokenizer: PretrainedTokenizer = AutoTokenizer.from_pretrained("gpt-cpm-small-cn-distill")
+
+        input_ids = tokenizer("中国的首都是")["input_ids"]
+        model = AutoModelForCausalLM.from_pretrained("gpt-cpm-small-cn-distill")
+        model.eval()
+
+        # 1. generate with no special eos_token_id
+        # [520, 8, 9, 59, 124, 635, 8, 12, 8, 10, 8, 10, 8, 10, 8, 10, 8, 10, 8, 10]
+        decoded_ids = model.generate(paddle.to_tensor([input_ids]), generation_config=GenerationConfig(max_length=20))[
+            0
+        ].tolist()[0]
+        self.assertEqual(len(decoded_ids), 20)
+
+        # 2. generate with single special eos_token_id (12)
+        decoded_ids = model.generate(
+            paddle.to_tensor([input_ids]), generation_config=GenerationConfig(max_length=20, eos_token_id=12)
+        )[0].tolist()[0]
+        self.assertEqual(decoded_ids, [520, 8, 9, 59, 124, 635, 8, 12])
+
+        decoded_ids = model.generate(
+            paddle.to_tensor([input_ids]), generation_config=GenerationConfig(max_length=20, eos_token_id=635)
+        )[0].tolist()[0]
+        self.assertEqual(decoded_ids, [520, 8, 9, 59, 124, 635])
+
+        # 3. generate with single tokens
+        decoded_ids = model.generate(
+            paddle.to_tensor([input_ids]), generation_config=GenerationConfig(max_length=20, eos_token_id=[124, 635])
+        )[0].tolist()[0]
+        self.assertEqual(decoded_ids, [520, 8, 9, 59, 124, 635])
+
+        # 4. generate with multi tokens
+        decoded_ids = model.generate(
+            paddle.to_tensor([input_ids]),
+            generation_config=GenerationConfig(max_length=20, eos_token_id=[[59, 124], [124, 635]]),
+        )[0].tolist()[0]
+        self.assertEqual(decoded_ids, [520, 8, 9, 59, 124])
+
+    def test_gpt_generation(self):
+        # init the tiny-random-gpt
+        model = GPTLMHeadModel.from_pretrained("__internal_testing__/tiny-random-gpt")
+        model.eval()
+
+        input_ids = np.array([list(range(200, 300)), list(range(100, 200))])
+
+        # 1. get the dygraph decoded_ids
+        expected_output_ids = [[15426, 15426, 15426, 15426, 15426, 15426], [18966, 18000, 23410, 23410, 23410, 23410]]
+
+        decoded_ids = model.generate(paddle.to_tensor(input_ids), generation_config=GenerationConfig(max_length=6))[
+            0
+        ].tolist()
+
+        self.assertEqual(expected_output_ids, decoded_ids)
+
+        decoded_ids = model.generate(
+            paddle.to_tensor(input_ids), generation_config=GenerationConfig(max_length=6, eos_token_id=[1800, 23410])
+        )[0].tolist()
+        self.assertEqual(expected_output_ids, decoded_ids)
+
+
+# TODO (wj-Mcat: enable the unit test after fix)
+# class GenerationD2STest(unittest.TestCase):
+#     def test_to_static_use_top_k(self):
+#         article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+#         tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/micro-random-llama")
+#         model = AutoModelForCausalLM.from_pretrained("__internal_testing__/micro-random-llama")
+#         input_ids = paddle.to_tensor(tokenizer(article)["input_ids"]).unsqueeze([0])
+
+#         model.eval()
+
+#         # Llama model do not contians ``
+#         model.is_encoder_decoder = False
+
+#         max_length = 25
+#         input_ids = paddle.to_tensor([[i for i in range(100, 120)]])
+
+#         bos_token_id = getattr(model, "bos_token_id", None)
+#         eos_token_id = getattr(model, "eos_token_id", None)
+#         pad_token_id = getattr(model, "pad_token_id", None)
+
+#         model_kwargs = {}
+
+#         model_kwargs["attention_mask"] = paddle.ones_like(input_ids)
+#         model_kwargs["use_cache"] = True
+#         model_kwargs["max_length"] = max_length + input_ids.shape[-1]
+#         model_kwargs["input_ids"] = input_ids
+
+#         decoded_ids = model.greedy_search(
+#             bos_token_id=bos_token_id,
+#             pad_token_id=pad_token_id,
+#             eos_token_id=eos_token_id,
+#             logits_processors=None,
+#             **model_kwargs,
+#         )[0]
+
+#         dygraph_decoded_ids = decoded_ids.tolist()
+
+#         with tempfile.TemporaryDirectory() as tempdir:
+#             path = os.path.join(tempdir, "model")
+#             model.to_static(
+#                 path,
+#                 config=dict(
+#                     bos_token_id=bos_token_id, pad_token_id=pad_token_id, eos_token_id=eos_token_id, use_top_p=False
+#                 ),
+#             )
+
+#             model_path = os.path.join(tempdir, "model.pdmodel")
+#             params_path = os.path.join(tempdir, "model.pdiparams")
+#             config = paddle.inference.Config(model_path, params_path)
+
+#             config.disable_gpu()
+#             config.disable_glog_info()
+#             predictor = paddle.inference.create_predictor(config)
+
+#             model_kwargs["top_k"] = 1
+#             model_kwargs["max_length"] = 25
+#             # create input
+#             for key in model_kwargs.keys():
+#                 if paddle.is_tensor(model_kwargs[key]):
+#                     model_kwargs[key] = model_kwargs[key].numpy()
+#                 else:
+#                     model_kwargs[key] = np.array(model_kwargs[key])
+
+#             input_handles = {}
+#             for name in predictor.get_input_names():
+#                 input_handles[name] = predictor.get_input_handle(name)
+#                 input_handles[name].copy_from_cpu(model_kwargs[name])
+
+#             predictor.run()
+#             output_names = predictor.get_output_names()
+#             output_handle = predictor.get_output_handle(output_names[0])
+#             results = output_handle.copy_to_cpu()
+
+#             static_decoded_ids = results.tolist()
+
+#         self.assertEqual(dygraph_decoded_ids, static_decoded_ids)
+
+#     def test_to_static_use_top_p(self):
+#         article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+
+#         tokenizer = AutoTokenizer.from_pretrained("__internal_testing__/micro-random-llama")
+#         model = AutoModelForCausalLM.from_pretrained("__internal_testing__/micro-random-llama")
+#         input_ids = paddle.to_tensor(tokenizer(article)["input_ids"]).unsqueeze([0])
+
+#         model.eval()
+
+#         # Llama model do not contians ``
+#         model.is_encoder_decoder = False
+
+#         max_length = 25
+#         input_ids = paddle.to_tensor([[i for i in range(100, 120)]])
+
+#         bos_token_id = getattr(model, "bos_token_id", None)
+#         eos_token_id = getattr(model, "eos_token_id", None)
+#         pad_token_id = getattr(model, "pad_token_id", None)
+
+#         model_kwargs = {}
+
+#         model_kwargs["attention_mask"] = paddle.ones_like(input_ids)
+#         model_kwargs["use_cache"] = True
+#         model_kwargs["max_length"] = max_length + input_ids.shape[-1]
+#         model_kwargs["input_ids"] = input_ids
+
+#         with tempfile.TemporaryDirectory() as tempdir:
+#             path = os.path.join(tempdir, "model")
+#             model.to_static(
+#                 path,
+#                 config=dict(
+#                     bos_token_id=bos_token_id, pad_token_id=pad_token_id, eos_token_id=eos_token_id, use_top_p=False
+#                 ),
+#             )
+
+#             model_path = os.path.join(tempdir, "model.pdmodel")
+#             params_path = os.path.join(tempdir, "model.pdiparams")
+#             config = paddle.inference.Config(model_path, params_path)
+
+#             config.disable_gpu()
+#             config.disable_glog_info()
+#             predictor = paddle.inference.create_predictor(config)
+
+#             model_kwargs["top_k"] = 1
+#             model_kwargs["max_length"] = 25
+#             # create input
+#             for key in model_kwargs.keys():
+#                 if paddle.is_tensor(model_kwargs[key]):
+#                     model_kwargs[key] = model_kwargs[key].numpy()
+#                 else:
+#                     model_kwargs[key] = np.array(model_kwargs[key])
+
+#             input_handles = {}
+#             for name in predictor.get_input_names():
+#                 input_handles[name] = predictor.get_input_handle(name)
+#                 input_handles[name].copy_from_cpu(model_kwargs[name])
+
+#             predictor.run()
+#             output_names = predictor.get_output_names()
+#             output_handle = predictor.get_output_handle(output_names[0])
+#             results = output_handle.copy_to_cpu()
+
+#             self.assertIsNotNone(results)
diff --git a/tests/transformers/test_image_processing_common.py b/tests/transformers/test_image_processing_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..87ca1b65d33ebf427b19af7e7f75c91736f2ff7f
--- /dev/null
+++ b/tests/transformers/test_image_processing_common.py
@@ -0,0 +1,86 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+
+import numpy as np
+import paddle
+from PIL import Image
+
+from ..transformers.test_utils import check_json_file_has_correct_format
+
+
+def prepare_image_inputs(image_processor_tester, equal_resolution=False, numpify=False, paddlefy=False):
+    """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+    or a list of PaddlePaddle tensors if one specifies paddlefy=True.
+    One can specify whether the images are of the same resolution or not.
+    """
+
+    assert not (numpify and paddlefy), "You cannot specify both numpy and PaddlePaddle tensors at the same time"
+
+    image_inputs = []
+    for i in range(image_processor_tester.batch_size):
+        if equal_resolution:
+            width = height = image_processor_tester.max_resolution
+        else:
+            # To avoid getting image width/height 0
+            min_resolution = image_processor_tester.min_resolution
+            if getattr(image_processor_tester, "size_divisor", None):
+                # If `size_divisor` is defined, the image needs to have width/size >= `size_divisor`
+                min_resolution = max(image_processor_tester.size_divisor, min_resolution)
+            width, height = np.random.choice(np.arange(min_resolution, image_processor_tester.max_resolution), 2)
+        image_inputs.append(
+            np.random.randint(255, size=(image_processor_tester.num_channels, width, height), dtype=np.uint8)
+        )
+
+    if not numpify and not paddlefy:
+        # PIL expects the channel dimension as last dimension
+        image_inputs = [Image.fromarray(np.moveaxis(image, 0, -1)) for image in image_inputs]
+
+    if paddlefy:
+        image_inputs = [paddle.to_tensor(image) for image in image_inputs]
+
+    return image_inputs
+
+
+class ImageProcessingSavingTestMixin:
+    test_cast_dtype = None
+
+    def test_image_processor_to_json_string(self):
+        image_processor = self.image_processing_class(**self.image_processor_dict)
+        obj = json.loads(image_processor.to_json_string())
+        for key, value in self.image_processor_dict.items():
+            self.assertEqual(obj[key], value)
+
+    def test_image_processor_to_json_file(self):
+        image_processor_first = self.image_processing_class(**self.image_processor_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            json_file_path = os.path.join(tmpdirname, "image_processor.json")
+            image_processor_first.to_json_file(json_file_path)
+            image_processor_second = self.image_processing_class.from_json_file(json_file_path)
+
+        self.assertEqual(image_processor_second.to_dict(), image_processor_first.to_dict())
+
+    def test_image_processor_from_and_save_pretrained(self):
+        image_processor_first = self.image_processing_class(**self.image_processor_dict)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            saved_file = image_processor_first.save_pretrained(tmpdirname)[0]
+            check_json_file_has_correct_format(saved_file)
+            image_processor_second = self.image_processing_class.from_pretrained(tmpdirname)
+
+        self.assertEqual(image_processor_second.to_dict(), image_processor_first.to_dict())
diff --git a/tests/transformers/test_modeling_common.py b/tests/transformers/test_modeling_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..9241135b20de7e36c32eb64ae81bca1c4b4a2082
--- /dev/null
+++ b/tests/transformers/test_modeling_common.py
@@ -0,0 +1,1060 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+import copy
+import inspect
+import os
+import random
+import shutil
+import subprocess
+import tempfile
+import time
+import unittest
+from typing import Optional, Tuple, Type
+
+import numpy as np
+import paddle
+from paddle.distributed.utils.launch_utils import (
+    TrainerProc,
+    find_free_ports,
+    get_cluster,
+    watch_local_trainers,
+)
+
+from paddlenlp.taskflow.utils import static_mode_guard
+from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
+from paddlenlp.transformers.configuration_utils import PretrainedConfig
+from paddlenlp.transformers.model_utils import PretrainedModel
+from paddlenlp.utils.env import CONFIG_NAME, LEGACY_CONFIG_NAME, MODEL_HOME
+
+from ..testing_utils import slow
+
+
+def _config_zero_init(config):
+    configs_no_init = copy.deepcopy(config)
+    for key in configs_no_init.__dict__.keys():
+        if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
+            setattr(configs_no_init, key, 1e-10)
+    return configs_no_init
+
+
+def get_cluster_from_args(selected_gpus):
+    cluster_node_ips = "127.0.0.1"
+    node_ip = "127.0.0.1"
+
+    node_ips = [x.strip() for x in cluster_node_ips.split(",")]
+
+    node_ips.index(node_ip)
+
+    free_ports = None
+
+    free_ports = find_free_ports(len(selected_gpus))
+    if free_ports is not None:
+        free_ports = list(free_ports)
+
+    trainer_endpoints = []
+    for ip in node_ips:
+        trainer_endpoints.append(["%s:%d" % (ip, port) for port in free_ports])
+    return get_cluster(node_ips, node_ip, trainer_endpoints, selected_gpus)
+
+
+def get_gpus(selected_gpus):
+    selected_gpus = [x.strip() for x in selected_gpus.split(",")]
+    return selected_gpus
+
+
+def start_local_trainers_cpu(trainer_endpoints, training_script, training_script_args, log_dir=None):
+    current_env = copy.copy(os.environ.copy())
+    current_env.pop("http_proxy", None)
+    current_env.pop("https_proxy", None)
+
+    procs = []
+    n_rank = len(trainer_endpoints)
+    print(trainer_endpoints)
+    for rank_id, endpoint in enumerate(trainer_endpoints):
+        proc_env = {
+            "PADDLE_DISTRI_BACKEND": "gloo",
+            "PADDLE_TRAINER_ID": "%d" % rank_id,
+            "PADDLE_CURRENT_ENDPOINT": "%s" % endpoint,
+            "PADDLE_TRAINERS_NUM": "%d" % n_rank,
+            "PADDLE_TRAINER_ENDPOINTS": ",".join(trainer_endpoints),
+        }
+
+        current_env.update(proc_env)
+
+        print("trainer proc env:{}".format(current_env))
+
+        assert os.getenv("WITH_COVERAGE", "OFF") == "OFF", "Gloo don't support WITH_COVERAGE."
+        cmd = "python -u " + training_script
+
+        print("start trainer proc:{} env:{}".format(cmd, proc_env))
+
+        fn = None
+
+        proc = subprocess.Popen(cmd.split(" "), env=current_env)
+
+        tp = TrainerProc()
+        tp.proc = proc
+        tp.rank = rank_id
+        tp.log_fn = fn
+        tp.cmd = cmd
+
+        procs.append(tp)
+
+    return procs
+
+
+def start_local_trainers(
+    cluster,
+    pod,
+    training_script,
+    training_script_args="",
+    eager_mode=True,
+    allocator_strategy="auto_growth",
+    log_dir=None,
+    without_http_proxy=True,
+):
+    current_env = copy.copy(os.environ.copy())
+    # paddle broadcast ncclUniqueId use socket, and
+    # proxy maybe make trainers unreachable, so delete them.
+    # if we set them to "", grpc will log error message "bad uri"
+    # so just delete them.
+
+    # current_env.pop("http_proxy", None)
+    # current_env.pop("https_proxy", None)
+
+    # parse args
+    if isinstance(training_script_args, dict):
+        training_script_args = [f"--{k} {v}" for k, v in training_script_args.items()]
+
+    if isinstance(training_script_args, list):
+        training_script_args = " ".join(training_script_args)
+
+    procs = []
+    for t in pod.trainers:
+        proc_env = {
+            "FLAGS_selected_gpus": "%s" % ",".join([str(g) for g in t.gpus]),
+            "PADDLE_TRAINER_ID": "%d" % t.rank,
+            "PADDLE_CURRENT_ENDPOINT": "%s" % t.endpoint,
+            "PADDLE_TRAINERS_NUM": "%d" % cluster.trainers_nranks(),
+            "PADDLE_TRAINER_ENDPOINTS": ",".join(cluster.trainers_endpoints()),
+        }
+
+        proc_env["FLAGS_allocator_strategy"] = allocator_strategy
+        if allocator_strategy == "auto_growth":
+            proc_env["FLAGS_fraction_of_gpu_memory_to_use"] = "0.1"
+
+        current_env.update(proc_env)
+
+        if os.getenv("WITH_COVERAGE", "OFF") == "ON":
+            cmd = "python -m coverage run --branch -p " + training_script
+        else:
+            cmd = f"python -u {training_script} {training_script_args}"
+
+        print("start trainer proc:{} env:{}".format(cmd, proc_env))
+
+        fn = None
+
+        proc = subprocess.Popen(cmd.split(" "), env=current_env)
+
+        tp = TrainerProc()
+        tp.proc = proc
+        tp.rank = t.rank
+        tp.log_fn = fn
+        tp.cmd = cmd
+
+        procs.append(tp)
+
+    return procs
+
+
+def ids_tensor(shape, vocab_size, dtype="int32"):
+    #  Creates a random int32 tensor of the shape within the vocab size
+    return paddle.randint(low=1, high=vocab_size, dtype=dtype, shape=shape)
+
+
+def random_attention_mask(shape, dtype="int32"):
+    attn_mask = ids_tensor(shape, vocab_size=2, dtype=dtype)
+    # make sure that at least one token is attended to for each batch
+    attn_mask[:, -1] = 1
+    return attn_mask
+
+
+def floats_tensor(shape, scale=1.0):
+    """Creates a random float32 tensor"""
+    return scale * paddle.randn(shape, dtype="float32")
+
+
+def check_two_model_parameter(first_model: PretrainedModel, second_model: PretrainedModel):
+    assert len(set(first_model.state_dict().keys()) - set(second_model.state_dict().keys())) == 0
+
+    # random choice the keys to compare
+    key = random.choice(list(first_model.state_dict().keys()))
+    diff = first_model.state_dict()[key] - second_model.state_dict()[key]
+    assert diff.sum().item() == 0
+
+
+class ModelTesterMixin:
+    model_tester = None
+    base_model_class: Optional[Type[PretrainedModel]] = None
+    all_model_classes: Tuple[Type[PretrainedModel]] = ()
+    all_generative_model_classes = ()
+    test_resize_embeddings = True
+    test_resize_position_embeddings = False
+    test_mismatched_shapes = True
+    test_missing_keys = True
+    test_model_compatibility_keys = True
+    test_tie_weights = False
+    use_test_inputs_embeds = False
+    use_test_model_name_list = True
+    is_encoder_decoder = False
+    has_attentions = True
+    model_split_percents = [0.5, 0.7, 0.9]
+
+    def _prepare_for_class(self, inputs_dict, model_class):
+        inputs_dict = copy.deepcopy(inputs_dict)
+        if model_class.__name__.endswith("ForMultipleChoice"):
+            inputs_dict = {
+                k: v.unsqueeze(1).expand(shape=[-1, self.model_tester.num_choices, -1])
+                if isinstance(v, paddle.Tensor) and v.ndim > 1
+                else v
+                for k, v in inputs_dict.items()
+            }
+        return inputs_dict
+
+    def _make_model_instance(self, config, model_class):
+        if isinstance(config, PretrainedConfig):
+            return model_class(config)
+        if model_class == self.base_model_class:
+            return model_class(**config)
+
+        return model_class(self.base_model_class(**config))
+
+    def test_save_load(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        def check_save_load(out1, out2):
+            # make sure we don't have nans
+            out_2 = out2.numpy()
+            out_2[np.isnan(out_2)] = 0
+
+            out_1 = out1.numpy()
+            out_1[np.isnan(out_1)] = 0
+            max_diff = np.amax(np.abs(out_1 - out_2))
+            self.assertLessEqual(max_diff, 1e-5)
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                first = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model = model_class.from_pretrained(tmpdirname)
+                model.eval()
+                with paddle.no_grad():
+                    second = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            # support tuple of tensor
+            if isinstance(first, tuple) and isinstance(second, tuple):
+                for tensor1, tensor2 in zip(first, second):
+                    check_save_load(tensor1, tensor2)
+            else:
+                check_save_load(first, second)
+
+    def test_determinism(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        def check_determinism(first, second):
+            out_1 = first.numpy()
+            out_2 = second.numpy()
+            out_1 = out_1[~np.isnan(out_1)]
+            out_2 = out_2[~np.isnan(out_2)]
+            max_diff = np.amax(np.abs(out_1 - out_2))
+            self.assertLessEqual(max_diff, 1e-5)
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                first = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+                second = model(**self._prepare_for_class(inputs_dict, model_class))[0]
+
+            if isinstance(first, tuple) and isinstance(second, tuple):
+                for tensor1, tensor2 in zip(first, second):
+                    check_determinism(tensor1, tensor2)
+            else:
+                check_determinism(first, second)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            expected_arg_names = ["input_ids"]
+            self.assertListEqual(arg_names[:1], expected_arg_names)
+
+    @unittest.skip("Not implemented yet")
+    def test_training(self):
+        # TODO(guosheng): add more tests for training if loss is implemented
+        pass
+
+    @unittest.skip("Not implemented yet")
+    def test_training_gradient_checkpointing(self):
+        pass
+
+    def test_attention_outputs(self):
+        if not self.has_attentions:
+            return
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        seq_len = getattr(self.model_tester, "seq_length", None)
+        decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+        encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
+        decoder_key_length = getattr(self.model_tester, "decoder_key_length", decoder_seq_length)
+        encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
+        chunk_length = getattr(self.model_tester, "chunk_length", None)
+        if chunk_length is not None and hasattr(self.model_tester, "num_hashes"):
+            encoder_seq_length = encoder_seq_length * self.model_tester.num_hashes
+
+        for model_class in self.all_model_classes:
+            signature = inspect.signature(model_class.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            if not all(name in arg_names for name in ["output_attentions", "output_hidden_states", "return_dict"]):
+                continue
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = False
+            inputs_dict["return_dict"] = True
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+            attentions = outputs.encoder_attentions if self.is_encoder_decoder else outputs.attentions
+            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+            # TODO(guosheng): check that output_attentions also work using config
+
+            if chunk_length is not None:
+                self.assertListEqual(
+                    list(attentions[0].shape[-4:]),
+                    [self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
+                )
+            else:
+                self.assertListEqual(
+                    list(attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
+                )
+            out_len = len(outputs)
+
+            if self.is_encoder_decoder:
+                correct_outlen = 5
+
+                # loss is at first position
+                if "labels" in inputs_dict:
+                    correct_outlen += 1  # loss is added to beginning
+                # Question Answering model returns start_logits and end_logits
+                if model_class.__name__.endswith("ForQuestionAnswering"):
+                    correct_outlen += 1  # start_logits and end_logits instead of only 1 output
+                if "past_key_values" in outputs:
+                    correct_outlen += 1  # past_key_values have been returned
+
+                self.assertEqual(out_len, correct_outlen)
+
+                # decoder attentions
+                decoder_attentions = outputs.decoder_attentions
+                self.assertIsInstance(decoder_attentions, (list, tuple))
+                self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(decoder_attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
+                )
+
+                # cross attentions
+                cross_attentions = outputs.cross_attentions
+                self.assertIsInstance(cross_attentions, (list, tuple))
+                self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
+                self.assertListEqual(
+                    list(cross_attentions[0].shape[-3:]),
+                    [
+                        self.model_tester.num_attention_heads,
+                        decoder_seq_length,
+                        encoder_key_length,
+                    ],
+                )
+
+            # Check attention is always last and order is fine
+            inputs_dict["output_attentions"] = True
+            inputs_dict["output_hidden_states"] = True
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            if hasattr(self.model_tester, "num_hidden_states_types"):
+                added_hidden_states = self.model_tester.num_hidden_states_types
+            elif self.is_encoder_decoder:
+                added_hidden_states = 2
+            else:
+                added_hidden_states = 1
+            self.assertEqual(out_len + added_hidden_states, len(outputs))
+
+            self_attentions = outputs.encoder_attentions if self.is_encoder_decoder else outputs.attentions
+
+            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
+            if chunk_length is not None:
+                self.assertListEqual(
+                    list(self_attentions[0].shape[-4:]),
+                    [self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
+                )
+            else:
+                self.assertListEqual(
+                    list(self_attentions[0].shape[-3:]),
+                    [self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
+                )
+
+    def test_hidden_states_output(self):
+        def check_hidden_states_output(inputs_dict, config, model_class):
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            with paddle.no_grad():
+                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+            hidden_states = outputs.encoder_hidden_states if self.is_encoder_decoder else outputs.hidden_states
+
+            expected_num_layers = getattr(
+                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+            )
+            self.assertEqual(len(hidden_states), expected_num_layers)
+
+            if hasattr(self.model_tester, "encoder_seq_length"):
+                seq_length = self.model_tester.encoder_seq_length
+                if hasattr(self.model_tester, "chunk_length") and self.model_tester.chunk_length > 1:
+                    seq_length = seq_length * self.model_tester.chunk_length
+            else:
+                seq_length = self.model_tester.seq_length
+
+            self.assertListEqual(
+                list(hidden_states[0].shape[-2:]),
+                [seq_length, self.model_tester.hidden_size],
+            )
+
+            if self.is_encoder_decoder:
+                hidden_states = outputs.decoder_hidden_states
+
+                self.assertIsInstance(hidden_states, (list, tuple))
+                self.assertEqual(len(hidden_states), expected_num_layers)
+                seq_len = getattr(self.model_tester, "seq_length", None)
+                decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
+
+                self.assertListEqual(
+                    list(hidden_states[0].shape[-2:]),
+                    [decoder_seq_length, self.model_tester.hidden_size],
+                )
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        inputs_dict["return_dict"] = True
+        for model_class in self.all_model_classes:
+            signature = inspect.signature(model_class.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+            if not all(name in arg_names for name in ["output_attentions", "output_hidden_states", "return_dict"]):
+                continue
+            inputs_dict["output_hidden_states"] = True
+            check_hidden_states_output(inputs_dict, config, model_class)
+            # TODO(guosheng): check that output_hidden_states also work using config
+
+    @unittest.skip("Not implemented")
+    def test_retain_grad_hidden_states_attentions(self):
+        pass
+
+    def test_resize_position_vector_embeddings(self):
+        if not self.test_resize_position_embeddings:
+            return
+
+        (
+            original_config,
+            inputs_dict,
+        ) = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            config = copy.deepcopy(original_config)
+            model = self._make_model_instance(config, model_class)
+
+            if self.model_tester.is_training is False:
+                model.eval()
+
+            max_position_embeddings = config.max_position_embeddings
+
+            # Retrieve the embeddings and clone theme
+            if self.is_encoder_decoder:
+                encoder_model_embed, decoder_model_embed = model.get_position_embeddings()
+                encoder_cloned_embeddings = encoder_model_embed.weight.clone()
+                decoder_cloned_embeddings = decoder_model_embed.weight.clone()
+            else:
+                model_embed = model.get_position_embeddings()
+                cloned_embeddings = model_embed.weight.clone()
+
+            # Check that resizing the position embeddings with a larger max_position_embeddings increases
+            # the model's postion embeddings size
+            model.resize_position_embeddings(max_position_embeddings + 10)
+            self.assertEqual(model.config.max_position_embeddings, max_position_embeddings + 10)
+
+            # Check that it actually resizes the embeddings matrix
+            if model.config.is_encoder_decoder:
+                encoder_model_embed, decoder_model_embed = model.get_position_embeddings()
+                self.assertEqual(encoder_model_embed.weight.shape[0], encoder_cloned_embeddings.shape[0] + 10)
+                self.assertEqual(decoder_model_embed.weight.shape[0], decoder_cloned_embeddings.shape[0] + 10)
+            else:
+                model_embed = model.get_position_embeddings()
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that resizing the position embeddings with a smaller max_position_embeddings decreases
+            # the model's max_position_embeddings
+            model.resize_position_embeddings(max_position_embeddings - 5)
+            self.assertEqual(model.base_model.config["max_position_embeddings"], max_position_embeddings - 5)
+
+            # Check that it actually resizes the embeddings matrix
+            if self.is_encoder_decoder:
+                encoder_model_embed, decoder_model_embed = model.get_position_embeddings()
+                self.assertEqual(encoder_model_embed.weight.shape[0], encoder_cloned_embeddings.shape[0] - 5)
+                self.assertEqual(decoder_model_embed.weight.shape[0], decoder_cloned_embeddings.shape[0] - 5)
+            else:
+                model_embed = model.get_position_embeddings()
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 5)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that adding and removing tokens has not modified the first part of the embedding matrix.
+            models_equal = True
+
+            if model.config.is_encoder_decoder:
+                for p1, p2 in zip(encoder_cloned_embeddings, encoder_model_embed.weight):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        models_equal = False
+                for p1, p2 in zip(decoder_cloned_embeddings, decoder_model_embed.weight):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        models_equal = False
+            else:
+                for p1, p2 in zip(cloned_embeddings, model_embed.weight):
+                    if p1.data.ne(p2.data).sum() > 0:
+                        models_equal = False
+
+            self.assertTrue(models_equal)
+
+    def test_resize_tokens_embeddings(self):
+        (
+            original_config,
+            inputs_dict,
+        ) = self.model_tester.prepare_config_and_inputs_for_common()
+        if not self.test_resize_embeddings:
+            return
+
+        for model_class in self.all_model_classes:
+            config = copy.deepcopy(original_config)
+            model = self._make_model_instance(config, model_class)
+            if self.model_tester.is_training is False:
+                model.eval()
+
+            model_vocab_size = config.vocab_size
+            # Retrieve the embeddings and clone theme
+            model_embed = model.resize_token_embeddings(model_vocab_size)
+            cloned_embeddings = model_embed.weight.clone()
+
+            # Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
+            model_embed = model.resize_token_embeddings(model_vocab_size + 10)
+            self.assertEqual(model.base_model.config.vocab_size, model_vocab_size + 10)
+            # Check that it actually resizes the embeddings matrix
+            self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
+            model_embed = model.resize_token_embeddings(model_vocab_size - 15)
+            self.assertEqual(model.base_model.config.vocab_size, model_vocab_size - 15)
+            # Check that it actually resizes the embeddings matrix
+            self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            # Input ids should be clamped to the maximum size of the vocabulary
+            inputs_dict["input_ids"] = paddle.clip(inputs_dict["input_ids"], max=model_vocab_size - 15 - 1)
+
+            # make sure that decoder_input_ids are resized as well
+            if "decoder_input_ids" in inputs_dict:
+                inputs_dict["decoder_input_ids"] = paddle.clip(
+                    inputs_dict["decoder_input_ids"], max=model_vocab_size - 15 - 1
+                )
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that adding and removing tokens has not modified the first part of the embedding matrix.
+            models_equal = True
+            for p1, p2 in zip(cloned_embeddings, model_embed.weight):
+                if not paddle.equal_all(p1, p2).item():
+                    models_equal = False
+                    break
+
+            self.assertTrue(models_equal)
+
+    def _compare_tensor(self, tensor1, tensor2, rtol=1e-04, atol=1e-04):
+        if tensor1.dtype != tensor2.dtype:
+            return False
+
+        if tensor1.dtype in [paddle.float32, paddle.float64]:
+            return paddle.allclose(tensor1, tensor2, rtol=rtol, atol=atol)
+        else:
+            return paddle.equal_all(tensor1, tensor2)
+
+    def test_inputs_embeds(self):
+        # pass the test if don't need to test inputs embeddings
+        if not self.use_test_inputs_embeds:
+            return
+        # get config for model and inputs_dict for model forward
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        # test all model classes
+        for model_class in self.all_model_classes:
+            model = self._make_model_instance(config, model_class)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            with paddle.no_grad():
+                ids_output = model(**inputs)
+
+            if not self.is_encoder_decoder:
+                input_ids = inputs["input_ids"]
+                del inputs["input_ids"]
+            else:
+                encoder_input_ids = inputs["input_ids"]
+                decoder_input_ids = inputs.get("decoder_input_ids", encoder_input_ids)
+                del inputs["input_ids"]
+                inputs.pop("decoder_input_ids", None)
+
+            wte = model.get_input_embeddings()
+            if not self.is_encoder_decoder:
+                inputs["inputs_embeds"] = wte(input_ids)
+            else:
+                inputs["inputs_embeds"] = wte(encoder_input_ids)
+                inputs["decoder_inputs_embeds"] = wte(decoder_input_ids)
+
+            with paddle.no_grad():
+                embeds_output = model(**inputs)
+
+            if isinstance(embeds_output, paddle.Tensor):
+                self.assertTrue(self._compare_tensor(ids_output, embeds_output))
+            else:
+                for ids_item, embeds_item in zip(ids_output, embeds_output):
+                    self.assertTrue(self._compare_tensor(ids_item, embeds_item))
+
+    def test_model_name_list(self):
+        if not self.use_test_model_name_list:
+            return
+        config = self.model_tester.get_config()
+        if isinstance(config, PretrainedConfig):
+            model = self.base_model_class(config)
+        else:
+            model = self.base_model_class(**config)
+        self.assertTrue(len(model.model_name_list) != 0)
+
+    def test_pretrained_config_save_load(self):
+        if self.base_model_class is None or not self.base_model_class.constructed_from_pretrained_config():
+            return
+
+        config_class = self.base_model_class.config_class
+        with tempfile.TemporaryDirectory() as tempdir:
+            config = config_class()
+
+            config.save_pretrained(tempdir)
+
+            # check the file exist
+            self.assertFalse(os.path.exists(os.path.join(tempdir, LEGACY_CONFIG_NAME)))
+            self.assertTrue(os.path.exists(os.path.join(tempdir, CONFIG_NAME)))
+
+            # rename the CONFIG_NAME
+            shutil.move(os.path.join(tempdir, CONFIG_NAME), os.path.join(tempdir, LEGACY_CONFIG_NAME))
+
+            loaded_config = config.__class__.from_pretrained(tempdir)
+            for key in config.__dict__.keys():
+                self.assertEqual(getattr(config, key), getattr(loaded_config, key))
+
+    def random_choice_pretrained_config_field(self) -> Optional[str]:
+        if self.base_model_class is None or not self.base_model_class.constructed_from_pretrained_config():
+            return None
+
+        config = self.base_model_class.config_class()
+        fields = [key for key, value in config.to_dict() if value]
+        return random.choice(fields)
+
+    def test_for_missed_attribute(self):
+        if not self.test_model_compatibility_keys:
+            self.skipTest(f"Do not test model_compatibility_keys on {self.base_model_class}")
+            return
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        for model_class in self.all_model_classes:
+            if not model_class.constructed_from_pretrained_config():
+                continue
+
+            model = self._make_model_instance(config, model_class)
+
+            all_maps: dict = copy.deepcopy(model_class.config_class.attribute_map)
+
+            for old_attribute, new_attribute in all_maps.items():
+                old_value = getattr(model.config, old_attribute)
+                new_value = getattr(model.config, new_attribute)
+
+                # eg: dropout can be an instance of nn.Dropout, so we should check it attribute
+                if type(new_value) != type(old_value):
+                    continue
+
+                self.assertEqual(old_value, new_value)
+
+    def test_tie_weight(self):
+        # test whether id of input_embeding equal id of output_embeding ?
+        if not self.test_tie_weights:
+            return
+
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        for model_class in self.all_model_classes:
+            if "CausalLM" not in model_class.__name__ and "MaskedLM" not in model_class.__name__:
+                continue
+
+            model = self._make_model_instance(config, model_class)
+
+            if not model.config.tie_word_embeddings:
+                continue
+
+            if hasattr(model, "get_input_embeddings") and hasattr(model, "get_output_embeddings"):
+                try:
+                    input_embeddings = model.get_input_embeddings()
+                except NotImplementedError:
+                    continue
+
+                try:
+                    output_embeddings = model.get_output_embeddings()
+                except NotImplementedError:
+                    continue
+
+                if input_embeddings is not None and output_embeddings is not None:
+                    if hasattr(output_embeddings, "weight"):
+                        output_embeddings_weight = output_embeddings.weight
+                    else:
+                        output_embeddings_weight = output_embeddings
+
+                    if hasattr(input_embeddings, "weight"):
+                        input_embeddings_weight = input_embeddings.weight
+                    else:
+                        input_embeddings_weight = input_embeddings
+                    print(
+                        input_embeddings_weight,
+                        output_embeddings_weight,
+                    )
+                    print(
+                        "model name :{},id is{},{}".format(
+                            model_class, id(output_embeddings_weight), id(input_embeddings_weight)
+                        )
+                    )
+                    self.assertEqual(id(output_embeddings_weight), id(input_embeddings_weight))
+
+
+class ModelTesterPretrainedMixin:
+    base_model_class: PretrainedModel = None
+    hf_remote_test_model_path: str = None
+    paddlehub_remote_test_model_path: str = None
+
+    # Download from HF doesn't work in CI yet
+    @slow
+    def test_model_from_pretrained_hf_hub(self):
+        if self.hf_remote_test_model_path is None or self.base_model_class is None:
+            return
+        model = self.base_model_class.from_pretrained(self.hf_remote_test_model_path, from_hf_hub=True)
+        self.assertIsNotNone(model)
+
+    def test_model_from_pretrained_paddle_hub(self):
+        if self.paddlehub_remote_test_model_path is None or self.base_model_class is None:
+            return
+        model = self.base_model_class.from_pretrained(self.paddlehub_remote_test_model_path)
+        self.assertIsNotNone(model)
+
+    def test_model_from_config_paddle_hub(self):
+        if self.paddlehub_remote_test_model_path is None or self.base_model_class is None:
+            return
+        config = self.base_model_class.config_class.from_pretrained(self.paddlehub_remote_test_model_path)
+        model = self.base_model_class._from_config(config)
+        self.assertIsNotNone(model)
+
+    @slow
+    def test_model_from_pretrained_with_cache_dir(self):
+        for model_name in list(self.base_model_class.pretrained_init_configuration)[:1]:
+            with tempfile.TemporaryDirectory() as tempdir:
+                tempdir = str(tempdir)
+
+                model = self.base_model_class.from_pretrained(model_name, cache_dir=tempdir)
+                self.assertIsNotNone(model)
+                self.assertTrue(
+                    os.path.isfile(
+                        os.path.join(tempdir, model_name, self.base_model_class.resource_files_names["model_state"])
+                    )
+                )
+                self.assertTrue(
+                    os.path.isfile(os.path.join(tempdir, model_name, self.base_model_class.model_config_file))
+                )
+
+    @slow
+    def test_pretrained_save_and_load(self):
+        """test the pretrained model save and load with two different ways: url-file-name & model_state name
+
+        eg: `bert-base-uncased.pdparams` and `model_state.pdparams`
+        """
+        for model_name in list(self.base_model_class.pretrained_init_configuration)[:1]:
+            model = self.base_model_class.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+            # 1. save and load
+            with tempfile.TemporaryDirectory() as tempdir:
+                tempdirname = str(tempdir)
+                model.save_pretrained(tempdirname)
+
+                loaded_model = self.base_model_class.from_pretrained(tempdirname)
+
+                check_two_model_parameter(model, loaded_model)
+
+            # 2. convert the weight file name
+            with tempfile.TemporaryDirectory() as tempdir:
+                tempdirname = str(tempdir) + "_old"
+
+                shutil.copytree(
+                    os.path.join(MODEL_HOME, model_name),
+                    tempdirname,
+                )
+
+                saved_model_state_file = os.path.join(
+                    tempdirname, self.base_model_class.resource_files_names["model_state"]
+                )
+
+                self.assertTrue(os.path.isfile(saved_model_state_file))
+
+                # rename it to the old style: name of url, eg: model_state.pdparams -> bert-base-uncased.pdparams
+                url = self.base_model_class.pretrained_resource_files_map["model_state"][model_name]
+                pretrained_resource_file_name = os.path.split(url)[-1]
+                target_file_path = os.path.join(tempdirname, pretrained_resource_file_name)
+
+                shutil.copyfile(saved_model_state_file, target_file_path)
+                os.remove(saved_model_state_file)
+
+                new_model = self.base_model_class.from_pretrained(tempdirname)
+
+                check_two_model_parameter(model, new_model)
+
+
+class DistributedTest(unittest.TestCase):
+    def setUp(self) -> None:
+        self.gpus = "0,1"
+
+    def get_world_size(self):
+        return len(self.gpus.split(","))
+
+    def run_on_gpu(
+        self,
+        training_script,
+        training_script_args=None,
+        gpus: str = "0,1",
+        eager_mode=True,
+        allocator_strategy="auto_growth",
+    ):
+        if not paddle.framework.core.is_compiled_with_cuda() or paddle.framework.core.get_cuda_device_count() == 0:
+            return
+
+        cluster = None
+        pod = None
+
+        cluster, pod = get_cluster_from_args(get_gpus(gpus))
+
+        procs = start_local_trainers(
+            cluster,
+            pod,
+            eager_mode=eager_mode,
+            allocator_strategy=allocator_strategy,
+            training_script=training_script,
+            training_script_args=training_script_args,
+        )
+
+        while True:
+            alive = watch_local_trainers(procs, cluster.trainers_endpoints())
+
+            if not alive:
+                print("Local procs complete, POD info:{}".format(pod))
+                break
+            time.sleep(3)
+
+
+class GenerationD2STestMixin:
+    article = """Justin Timberlake and Jessica Biel, welcome to parenthood."""
+    internal_testing_model = None
+
+    TokenizerClass = AutoTokenizer
+    CausalLMClass = AutoModelForCausalLM
+    max_length = 20
+
+    def setUp(self):
+        paddle.disable_static()
+
+    def tearDown(self):
+        paddle.disable_static()
+
+    def test_to_static_use_top_k(self):
+        tokenizer = self.TokenizerClass.from_pretrained(self.internal_testing_model)
+        if tokenizer.__class__.__name__ == "LlamaTokenizer":
+            tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else "<pad>"
+
+        model = self.CausalLMClass.from_pretrained(self.internal_testing_model)
+        model_kwargs = tokenizer(
+            self.article,
+            max_length=self.max_length,
+            truncation=True,
+            truncation_side="left",
+            return_tensors="pd",
+            padding=True,
+            add_special_tokens=True,
+        )
+        model.is_encoder_decoder = False
+
+        model.eval()
+
+        model_kwargs["use_cache"] = True
+        model_kwargs["max_length"] = self.max_length
+
+        decoded_ids = model.greedy_search(
+            logits_processors=None,
+            bos_token_id=model.config.bos_token_id,
+            pad_token_id=model.config.pad_token_id,
+            eos_token_id=model.config.eos_token_id,
+            **model_kwargs,
+        )[0]
+
+        dygraph_decoded_ids = decoded_ids.tolist()
+
+        with static_mode_guard():
+            with tempfile.TemporaryDirectory() as tempdir:
+                path = os.path.join(tempdir, "model")
+                model.to_static(
+                    path,
+                    config=dict(
+                        use_top_p=False,
+                    ),
+                )
+
+                model_path = os.path.join(tempdir, "model.pdmodel")
+                params_path = os.path.join(tempdir, "model.pdiparams")
+                config = paddle.inference.Config(model_path, params_path)
+
+                config.disable_gpu()
+                config.disable_glog_info()
+                predictor = paddle.inference.create_predictor(config)
+
+                model_kwargs["top_k"] = 1
+                model_kwargs["max_length"] = self.max_length - model_kwargs["input_ids"].shape[-1]
+
+                # create input
+                for key in model_kwargs.keys():
+                    if paddle.is_tensor(model_kwargs[key]):
+                        model_kwargs[key] = model_kwargs[key].numpy()
+                    elif isinstance(model_kwargs[key], float):
+                        model_kwargs[key] = np.array(model_kwargs[key], dtype="float32")
+                    else:
+                        model_kwargs[key] = np.array(model_kwargs[key], dtype="int64")
+
+                input_handles = {}
+                for name in predictor.get_input_names():
+                    input_handles[name] = predictor.get_input_handle(name)
+                    input_handles[name].copy_from_cpu(model_kwargs[name])
+
+                predictor.run()
+                output_names = predictor.get_output_names()
+                output_handle = predictor.get_output_handle(output_names[0])
+                results = output_handle.copy_to_cpu()
+
+                static_decoded_ids = results.tolist()
+
+        self.assertEqual(dygraph_decoded_ids, static_decoded_ids)
+
+    def test_to_static_use_top_p(self):
+        tokenizer = self.TokenizerClass.from_pretrained(self.internal_testing_model)
+        if tokenizer.__class__.__name__ == "LlamaTokenizer":
+            tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else "<pad>"
+        model = self.CausalLMClass.from_pretrained(self.internal_testing_model)
+
+        model_kwargs = tokenizer(
+            self.article,
+            max_length=self.max_length,
+            truncation=True,
+            truncation_side="left",
+            return_tensors="pd",
+            padding=True,
+            add_special_tokens=True,
+        )
+
+        model.eval()
+
+        model_kwargs["use_cache"] = True
+        model_kwargs["max_length"] = self.max_length
+
+        with static_mode_guard():
+            with tempfile.TemporaryDirectory() as tempdir:
+
+                path = os.path.join(tempdir, "model")
+                model.to_static(
+                    path,
+                    config=dict(
+                        use_top_p=False,
+                    ),
+                )
+
+                model_path = os.path.join(tempdir, "model.pdmodel")
+                params_path = os.path.join(tempdir, "model.pdiparams")
+                config = paddle.inference.Config(model_path, params_path)
+
+                config.disable_gpu()
+                config.disable_glog_info()
+                predictor = paddle.inference.create_predictor(config)
+
+                model_kwargs["top_k"] = 1
+                model_kwargs["max_length"] = self.max_length
+                # create input
+                for key in model_kwargs.keys():
+                    if paddle.is_tensor(model_kwargs[key]):
+                        model_kwargs[key] = model_kwargs[key].numpy()
+                    else:
+                        model_kwargs[key] = np.array(model_kwargs[key])
+
+                input_handles = {}
+                for name in predictor.get_input_names():
+                    input_handles[name] = predictor.get_input_handle(name)
+                    input_handles[name].copy_from_cpu(model_kwargs[name])
+
+                predictor.run()
+                output_names = predictor.get_output_names()
+                output_handle = predictor.get_output_handle(output_names[0])
+                results = output_handle.copy_to_cpu()
+
+        self.assertIsNotNone(results)
diff --git a/tests/transformers/test_modeling_utils.py b/tests/transformers/test_modeling_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cdf1c847f90f806b156ae45fc1e492f1218784f
--- /dev/null
+++ b/tests/transformers/test_modeling_utils.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import unittest
+from multiprocessing import Pool
+from tempfile import TemporaryDirectory
+
+from paddlenlp.transformers import BertModel, TinyBertModel
+from paddlenlp.utils.env import CONFIG_NAME, MODEL_HOME, PADDLE_WEIGHTS_NAME
+from tests.testing_utils import slow
+
+
+def download_bert_model(model_name: str):
+    """set the global method: multiprocessing can not pickle local method
+
+    Args:
+        model_name (str): the model name
+    """
+
+    model = BertModel.from_pretrained(model_name)
+    # free the model resource
+    del model
+
+
+class TestModeling(unittest.TestCase):
+    """Test PretrainedModel single time, not in Transformer models"""
+
+    def test_from_pretrained_cache_dir_community_model(self):
+        model_name = "__internal_testing__/bert"
+        with TemporaryDirectory() as tempdir:
+            BertModel.from_pretrained(model_name, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, CONFIG_NAME)))
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, PADDLE_WEIGHTS_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_name, model_name)))
+
+    @slow
+    def test_from_pretrained_cache_dir_pretrained_init(self):
+        model_name = "bert-base-uncased"
+        with TemporaryDirectory() as tempdir:
+            BertModel.from_pretrained(model_name, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, CONFIG_NAME)))
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, PADDLE_WEIGHTS_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_name, model_name)))
+
+    @slow
+    def test_from_pretrained_with_load_as_state_np_params(self):
+        """init model with `load_state_as_np` params"""
+        model = TinyBertModel.from_pretrained("tinybert-4l-312d", load_state_as_np=True)
+        self.assertIsNotNone(model)
+
+    @slow
+    def test_multiprocess_downloading(self):
+        """test downloading with multi-process. Some errors may be triggered when downloading model
+        weight file with multiprocess, so this test code was born.
+
+        `num_process_in_pool` is the number of process in Pool.
+        And the `num_jobs` is the number of total process to download file.
+        """
+        num_process_in_pool, num_jobs = 10, 20
+        small_model_path = (
+            "https://paddlenlp.bj.bcebos.com/models/community/__internal_testing__/bert/model_state.pdparams"
+        )
+
+        from paddlenlp.transformers.model_utils import get_path_from_url_with_filelock
+
+        with TemporaryDirectory() as tempdir:
+
+            with Pool(num_process_in_pool) as pool:
+                pool.starmap(get_path_from_url_with_filelock, [(small_model_path, tempdir) for _ in range(num_jobs)])
+
+    @slow
+    def test_model_from_pretrained_with_multiprocessing(self):
+        """
+        this test can not init tooooo many models which will occupy CPU/GPU memorys.
+
+            `num_process_in_pool` is the number of process in Pool.
+            And the `num_jobs` is the number of total process to download file.
+        """
+        num_process_in_pool, num_jobs = 1, 10
+
+        # 1.remove tinybert model weight file
+        model_name = "__internal_testing__/bert"
+        shutil.rmtree(os.path.join(MODEL_HOME, model_name), ignore_errors=True)
+
+        # 2. downloaing tinybert modeling using multi-processing
+        with Pool(num_process_in_pool) as pool:
+            pool.starmap(download_bert_model, [(model_name,) for _ in range(num_jobs)])
diff --git a/tests/transformers/test_sequence_feature_extraction_common.py b/tests/transformers/test_sequence_feature_extraction_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f64d0c4179ae88e9ac86f7e3ceb1acf4f60067a
--- /dev/null
+++ b/tests/transformers/test_sequence_feature_extraction_common.py
@@ -0,0 +1,391 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+from paddlenlp.transformers import BatchFeature
+
+from .test_feature_extraction_common import FeatureExtractionSavingTestMixin
+
+
+class SequenceFeatureExtractionTestMixin(FeatureExtractionSavingTestMixin):
+    # to overwrite at feature extractactor specific tests
+    feat_extract_tester = None
+    feature_extraction_class = None
+
+    @property
+    def feat_extract_dict(self):
+        return self.feat_extract_tester.prepare_feat_extract_dict()
+
+    def test_feat_extract_common_properties(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        self.assertTrue(hasattr(feat_extract, "feature_size"))
+        self.assertTrue(hasattr(feat_extract, "sampling_rate"))
+        self.assertTrue(hasattr(feat_extract, "padding_value"))
+
+    def test_batch_feature(self):
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common()
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+
+        self.assertTrue(all(len(x) == len(y) for x, y in zip(speech_inputs, processed_features[input_name])))
+
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common(equal_length=True)
+        processed_features = BatchFeature({input_name: speech_inputs}, tensor_type="np")
+
+        batch_features_input = processed_features[input_name]
+
+        if len(batch_features_input.shape) < 3:
+            batch_features_input = batch_features_input[:, :, None]
+
+        self.assertTrue(
+            batch_features_input.shape
+            == (self.feat_extract_tester.batch_size, len(speech_inputs[0]), self.feat_extract_tester.feature_size)
+        )
+
+    def test_batch_feature_pd(self):
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common(equal_length=True)
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs}, tensor_type="pd")
+
+        batch_features_input = processed_features[input_name]
+
+        if len(batch_features_input.shape) < 3:
+            batch_features_input = batch_features_input[:, :, None]
+
+        self.assertTrue(
+            batch_features_input.shape
+            == [self.feat_extract_tester.batch_size, len(speech_inputs[0]), self.feat_extract_tester.feature_size]
+        )
+
+    def _check_padding(self, numpify=False):
+        def _inputs_have_equal_length(input):
+            length = len(input[0])
+            for input_slice in input[1:]:
+                if len(input_slice) != length:
+                    return False
+            return True
+
+        def _inputs_are_equal(input_1, input_2):
+            if len(input_1) != len(input_2):
+                return False
+
+            for input_slice_1, input_slice_2 in zip(input_1, input_2):
+                if not np.allclose(np.asarray(input_slice_1), np.asarray(input_slice_2), atol=1e-3):
+                    return False
+            return True
+
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common(numpify=numpify)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+
+        pad_diff = self.feat_extract_tester.seq_length_diff
+        pad_max_length = self.feat_extract_tester.max_seq_length + pad_diff
+        pad_min_length = self.feat_extract_tester.min_seq_length
+        batch_size = self.feat_extract_tester.batch_size
+        feature_size = self.feat_extract_tester.feature_size
+
+        # test padding for List[int] + numpy
+        input_1 = feat_extract.pad(processed_features, padding=True)
+        input_1 = input_1[input_name]
+
+        input_2 = feat_extract.pad(processed_features, padding="longest")
+        input_2 = input_2[input_name]
+
+        input_3 = feat_extract.pad(processed_features, padding="max_length", max_length=len(speech_inputs[-1]))
+        input_3 = input_3[input_name]
+
+        input_4 = feat_extract.pad(processed_features, padding="longest", return_tensors="np")
+        input_4 = input_4[input_name]
+
+        # max_length parameter has to be provided when setting `padding="max_length"`
+        with self.assertRaises(ValueError):
+            feat_extract.pad(processed_features, padding="max_length")[input_name]
+
+        input_5 = feat_extract.pad(
+            processed_features, padding="max_length", max_length=pad_max_length, return_tensors="np"
+        )
+        input_5 = input_5[input_name]
+        # self.assertFalse(_inputs_have_equal_length(input_1))
+        self.assertTrue(_inputs_have_equal_length(input_2))
+        self.assertTrue(_inputs_have_equal_length(input_3))
+        self.assertTrue(_inputs_are_equal(input_2, input_3))
+        # self.assertTrue(len(input_1[0]) == pad_min_length)
+        # self.assertTrue(len(input_1[1]) == pad_min_length + pad_diff)
+        self.assertTrue(input_4.shape[:2] == (batch_size, len(input_3[0])))
+        self.assertTrue(input_5.shape[:2] == (batch_size, pad_max_length))
+
+        if feature_size > 1:
+            self.assertTrue(input_4.shape[2] == input_5.shape[2] == feature_size)
+
+        # test padding for `pad_to_multiple_of` for List[int] + numpy
+        input_6 = feat_extract.pad(processed_features, pad_to_multiple_of=10)
+        input_6 = input_6[input_name]
+
+        input_7 = feat_extract.pad(processed_features, padding="longest", pad_to_multiple_of=10)
+        input_7 = input_7[input_name]
+
+        input_8 = feat_extract.pad(
+            processed_features, padding="max_length", pad_to_multiple_of=10, max_length=pad_max_length
+        )
+        input_8 = input_8[input_name]
+
+        input_9 = feat_extract.pad(
+            processed_features,
+            padding="max_length",
+            pad_to_multiple_of=10,
+            max_length=pad_max_length,
+            return_tensors="np",
+        )
+        input_9 = input_9[input_name]
+
+        self.assertTrue(all(len(x) % 10 == 0 for x in input_6))
+        self.assertTrue(_inputs_are_equal(input_6, input_7))
+
+        expected_mult_pad_length = pad_max_length if pad_max_length % 10 == 0 else (pad_max_length // 10 + 1) * 10
+        self.assertTrue(all(len(x) == expected_mult_pad_length for x in input_8))
+        self.assertEqual(input_9.shape[:2], (batch_size, expected_mult_pad_length))
+
+        if feature_size > 1:
+            self.assertTrue(input_9.shape[2] == feature_size)
+
+        # Check padding value is correct
+        padding_vector_sum = (np.ones(self.feat_extract_tester.feature_size) * feat_extract.padding_value).sum()
+        self.assertTrue(
+            abs(np.asarray(input_2[0])[pad_min_length:].sum() - padding_vector_sum * (pad_max_length - pad_min_length))
+            < 1e-3
+        )
+        self.assertTrue(
+            abs(
+                np.asarray(input_2[1])[pad_min_length + pad_diff :].sum()
+                - padding_vector_sum * (pad_max_length - pad_min_length - pad_diff)
+            )
+            < 1e-3
+        )
+        self.assertTrue(
+            abs(
+                np.asarray(input_2[2])[pad_min_length + 2 * pad_diff :].sum()
+                - padding_vector_sum * (pad_max_length - pad_min_length - 2 * pad_diff)
+            )
+            < 1e-3
+        )
+        self.assertTrue(
+            abs(input_5[0, pad_min_length:].sum() - padding_vector_sum * (pad_max_length - pad_min_length)) < 1e-3
+        )
+        self.assertTrue(
+            abs(input_9[0, pad_min_length:].sum() - padding_vector_sum * (expected_mult_pad_length - pad_min_length))
+            < 1e-3
+        )
+
+    def _check_truncation(self, numpify=False):
+        def _inputs_have_equal_length(input):
+            length = len(input[0])
+            for input_slice in input[1:]:
+                if len(input_slice) != length:
+                    return False
+            return True
+
+        def _inputs_are_equal(input_1, input_2):
+            if len(input_1) != len(input_2):
+                return False
+
+            for input_slice_1, input_slice_2 in zip(input_1, input_2):
+                if not np.allclose(np.asarray(input_slice_1), np.asarray(input_slice_2), atol=1e-3):
+                    return False
+            return True
+
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common(numpify=numpify)
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+        # truncate to smallest
+        input_1 = feat_extract.pad(
+            processed_features, padding="max_length", max_length=len(speech_inputs[0]), truncation=True
+        )
+        input_1 = input_1[input_name]
+        input_2 = feat_extract.pad(
+            processed_features, padding=True, max_length=len(speech_inputs[0]), truncation=False
+        )
+        input_2 = input_2[input_name]
+
+        self.assertTrue(_inputs_have_equal_length(input_1))
+        # self.assertFalse(_inputs_have_equal_length(input_2))
+
+        # truncate to smallest with np
+        input_3 = feat_extract.pad(
+            processed_features,
+            padding="max_length",
+            max_length=len(speech_inputs[0]),
+            return_tensors="np",
+            truncation=True,
+        )
+        input_3 = input_3[input_name]
+        input_4 = feat_extract.pad(
+            processed_features,
+            padding="max_length",
+            max_length=len(speech_inputs[0]),
+            return_tensors="np",
+            truncation=True,
+        )
+        input_4 = input_4[input_name]
+
+        self.assertTrue(_inputs_have_equal_length(input_3))
+        self.assertTrue(input_3.shape[1] == len(speech_inputs[0]))
+
+        # since truncation forces padding to be smaller than longest input
+        # function can't return `np.ndarray`, but has to return list
+        # self.assertFalse(_inputs_have_equal_length(input_4))
+
+        # truncate to middle
+        input_5 = feat_extract.pad(
+            processed_features,
+            padding="max_length",
+            max_length=len(speech_inputs[1]),
+            truncation=True,
+            return_tensors="np",
+        )
+        input_5 = input_5[input_name]
+
+        input_6 = feat_extract.pad(
+            processed_features, padding="max_length", max_length=len(speech_inputs[1]), truncation=True
+        )
+        input_6 = input_6[input_name]
+
+        input_7 = feat_extract.pad(
+            processed_features, padding=True, max_length=len(speech_inputs[1]), return_tensors="np", truncation=True
+        )
+        input_7 = input_7[input_name]
+
+        self.assertTrue(input_5.shape[1] == len(speech_inputs[1]))
+        self.assertTrue(_inputs_have_equal_length(input_5))
+        self.assertTrue(_inputs_have_equal_length(input_6))
+        self.assertTrue(_inputs_are_equal(input_5, input_6))
+
+        # since truncation forces padding to be smaller than longest input
+        # function can't return `np.ndarray`, but has to return list
+        # self.assertFalse(_inputs_have_equal_length(input_7))
+        # self.assertTrue(len(input_7[-1]) == len(speech_inputs[-1]))
+
+        # padding has to be max_length when setting `truncation=True`
+        with self.assertRaises(ValueError):
+            feat_extract.pad(processed_features, truncation=True)[input_name]
+
+        # padding has to be max_length when setting `truncation=True`
+        with self.assertRaises(ValueError):
+            feat_extract.pad(processed_features, padding="longest", truncation=True)[input_name]
+
+        # padding has to be max_length when setting `truncation=True`
+        with self.assertRaises(ValueError):
+            feat_extract.pad(processed_features, padding="longest", truncation=True)[input_name]
+
+        # max_length parameter has to be provided when setting `truncation=True` and padding="max_length"
+        with self.assertRaises(ValueError):
+            feat_extract.pad(processed_features, padding="max_length", truncation=True)[input_name]
+
+        # test truncation for `pad_to_multiple_of` for List[int] + numpy
+        pad_to_multiple_of = 12
+        input_8 = feat_extract.pad(
+            processed_features,
+            padding="max_length",
+            max_length=len(speech_inputs[0]),
+            pad_to_multiple_of=pad_to_multiple_of,
+            truncation=True,
+        )
+        input_8 = input_8[input_name]
+
+        input_9 = feat_extract.pad(
+            processed_features,
+            padding=True,
+            max_length=len(speech_inputs[0]),
+            pad_to_multiple_of=pad_to_multiple_of,
+            truncation=True,
+        )
+        input_9 = input_9[input_name]
+
+        # retrieve expected_length as multiple of pad_to_multiple_of
+        expected_length = len(speech_inputs[0])
+        if expected_length % pad_to_multiple_of != 0:
+            expected_length = ((len(speech_inputs[0]) // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        self.assertTrue(len(input_8[0]) == expected_length)
+        self.assertTrue(_inputs_have_equal_length(input_8))
+        # self.assertFalse(_inputs_have_equal_length(input_9))
+
+    def test_padding_from_list(self):
+        self._check_padding(numpify=False)
+
+    def test_padding_from_array(self):
+        self._check_padding(numpify=True)
+
+    def test_truncation_from_list(self):
+        self._check_truncation(numpify=False)
+
+    def test_truncation_from_array(self):
+        self._check_truncation(numpify=True)
+
+    def test_padding_accepts_tensors_pd(self):
+        feat_extract = self.feature_extraction_class(**self.feat_extract_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common()
+        input_name = feat_extract.model_input_names[0]
+
+        processed_features = BatchFeature({input_name: speech_inputs})
+
+        input_np = feat_extract.pad(processed_features, padding="longest", return_tensors="np")[input_name]
+        input_pt = feat_extract.pad(processed_features, padding="longest", return_tensors="pd")[input_name]
+
+        self.assertTrue(abs(input_np.astype(np.float32).sum() - input_pt.numpy().astype(np.float32).sum()) < 1e-2)
+
+    def test_attention_mask(self):
+        feat_dict = self.feat_extract_dict
+        feat_dict["return_attention_mask"] = True
+        feat_extract = self.feature_extraction_class(**feat_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common()
+        input_lenghts = [len(x) for x in speech_inputs]
+        input_name = feat_extract.model_input_names[0]
+
+        processed = BatchFeature({input_name: speech_inputs})
+
+        processed = feat_extract.pad(processed, padding="longest", return_tensors="np")
+        self.assertIn("attention_mask", processed)
+        self.assertListEqual(list(processed.attention_mask.shape), list(processed[input_name].shape[:2]))
+        self.assertListEqual(processed.attention_mask.sum(-1).tolist(), input_lenghts)
+
+    def test_attention_mask_with_truncation(self):
+        feat_dict = self.feat_extract_dict
+        feat_dict["return_attention_mask"] = True
+        feat_extract = self.feature_extraction_class(**feat_dict)
+        speech_inputs = self.feat_extract_tester.prepare_inputs_for_common()
+        input_lenghts = [len(x) for x in speech_inputs]
+        input_name = feat_extract.model_input_names[0]
+
+        processed = BatchFeature({input_name: speech_inputs})
+        max_length = min(input_lenghts)
+
+        processed_pad = feat_extract.pad(
+            processed, padding="max_length", max_length=max_length, truncation=True, return_tensors="np"
+        )
+        self.assertIn("attention_mask", processed_pad)
+        self.assertListEqual(
+            list(processed_pad.attention_mask.shape), [processed_pad[input_name].shape[0], max_length]
+        )
+        self.assertListEqual(
+            processed_pad.attention_mask[:, :max_length].sum(-1).tolist(), [max_length for x in speech_inputs]
+        )
diff --git a/tests/transformers/test_shard_checkpoint.py b/tests/transformers/test_shard_checkpoint.py
new file mode 100644
index 0000000000000000000000000000000000000000..69c50d27d41419d6f61a4406eb4e3b153799dd60
--- /dev/null
+++ b/tests/transformers/test_shard_checkpoint.py
@@ -0,0 +1,395 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+import unittest
+
+import paddle
+
+from paddlenlp.transformers import (
+    AutoConfig,
+    BertModel,
+    PretrainedConfig,
+    PretrainedModel,
+    register_base_model,
+)
+from paddlenlp.transformers.model_utils import load_sharded_checkpoint, shard_checkpoint
+from paddlenlp.utils.env import (
+    PADDLE_WEIGHTS_INDEX_NAME,
+    PADDLE_WEIGHTS_NAME,
+    SAFE_WEIGHTS_INDEX_NAME,
+    SAFE_WEIGHTS_NAME,
+)
+from paddlenlp.utils.import_utils import is_paddle_cuda_available
+from tests.testing_utils import require_package
+
+
+class FakeConfig(PretrainedConfig):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+
+class FakePretrainedModel(PretrainedModel):
+    config_class = FakeConfig
+
+    _keep_in_fp32_modules = ["norm."]
+
+
+@register_base_model
+class FakeModel(FakePretrainedModel):
+    def __init__(self, config):
+        super(FakeModel, self).__init__(config)
+        self.linear = paddle.nn.Linear(2, 3)
+        self.norm = paddle.nn.LayerNorm(2)
+
+
+class TestFromPretrained(unittest.TestCase):
+    def test_from_pretrained_low_cpu_mem_usage_functional(self):
+        # test that we can use `from_pretrained(..., low_cpu_mem_usage=True)` with normal and
+        # sharded models
+        mnames = [
+            "__internal_testing__/tiny-random-bert-sharded",
+            "__internal_testing__/tiny-random-bert",
+        ]
+        for mname in mnames:
+            m1 = BertModel.from_pretrained(mname, low_cpu_mem_usage=True)
+            m2 = BertModel.from_pretrained(mname, low_cpu_mem_usage=False)
+            for p1, p2 in zip(m1.parameters(), m2.parameters()):
+                self.assertTrue(paddle.allclose(p1, p2))
+
+    @unittest.skipIf(not is_paddle_cuda_available(), "some op is missing in cpu mode")
+    def test_keep_in_fp32_modules(self):
+        with tempfile.TemporaryDirectory() as tempdir:
+            config = PretrainedConfig()
+            model = FakeModel._from_config(config, dtype="float16")
+            model.config = config
+            model.save_pretrained(tempdir)
+
+            # check model_state.pdparams
+            state_dict = paddle.load(os.path.join(tempdir, "model_state.pdparams"))
+
+            self.assertEqual(state_dict["linear.weight"].dtype, paddle.float16)
+            self.assertEqual(state_dict["norm.weight"].dtype, paddle.float16)
+
+            new_model = FakeModel.from_pretrained(tempdir)
+            self.assertEqual(new_model.linear.weight.dtype, paddle.float16)
+            self.assertEqual(new_model.norm.weight.dtype, paddle.float32)
+
+    def test_load_sharded_checkpoint(self):
+        config = AutoConfig.from_pretrained("__internal_testing__/bert-shard")
+        model = BertModel.from_pretrained("__internal_testing__/bert-shard")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model.save_pretrained(tmp_dir, max_shard_size="200kiB")
+            model_load = BertModel._from_config(config)
+            missing_keys, unexpected_keys = load_sharded_checkpoint(model_load, tmp_dir)
+
+        self.assertEqual(missing_keys, [])
+        self.assertEqual(unexpected_keys, [])
+        for p1, p2 in zip(model.parameters(), model_load.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    @unittest.skipIf(not is_paddle_cuda_available(), "some op is missing in cpu mode")
+    def test_load_from_torch_dtyp_cast(self):
+        pass
+
+    @unittest.skipIf(not is_paddle_cuda_available(), "some op is missing in cpu mode")
+    def test_load_dtype_cast(self):
+        dtype_prefix_len = len("paddle.")
+
+        def inner_convert_test(src_dtype, dst_dtype):
+            str_src_dtype = str(src_dtype)[dtype_prefix_len:]
+            str_dst_dtype = str(dst_dtype)[dtype_prefix_len:]
+
+            config = AutoConfig.from_pretrained("__internal_testing__/tiny-random-bert")
+            model = BertModel._from_config(config, dtype=str_src_dtype)
+
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                model.save_pretrained(tmp_dir)
+                new_model = BertModel.from_pretrained(tmp_dir, dtype=str_dst_dtype)
+
+            for k, v in model.state_dict().items():
+                if v.is_floating_point():
+                    self.assertEqual(v.dtype, src_dtype)
+            for k, v in new_model.state_dict().items():
+                if v.is_floating_point():
+                    self.assertEqual(v.dtype, dst_dtype)
+
+        with self.subTest("paddle.float32 to paddle.float16"):
+            inner_convert_test(paddle.float32, paddle.float16)
+        with self.subTest("paddle.float32 to paddle.bfloat16"):
+            inner_convert_test(paddle.float32, paddle.bfloat16)
+        with self.subTest("paddle.float16 to paddle.float32"):
+            inner_convert_test(paddle.float16, paddle.float32)
+        with self.subTest("paddle.float16 to paddle.bfloat16"):
+            inner_convert_test(paddle.float16, paddle.bfloat16)
+        with self.subTest("paddle.bfloat16 to paddle.float32"):
+            inner_convert_test(paddle.bfloat16, paddle.float32)
+        with self.subTest("paddle.bfloat16 to paddle.float16"):
+            inner_convert_test(paddle.bfloat16, paddle.float16)
+
+
+class TestShardCheckpoint(unittest.TestCase):
+    def test_shard_checkpoint(self):
+        # This is the model we will use, total size 340,000 bytes.
+        model = paddle.nn.Sequential(
+            paddle.nn.Linear(100, 200, bias_attr=False),  # size 80,000
+            paddle.nn.Linear(200, 200, bias_attr=False),  # size 160,000
+            paddle.nn.Linear(200, 100, bias_attr=False),  # size 80,000
+            paddle.nn.Linear(100, 50, bias_attr=False),  # size 20,000
+        )
+        state_dict = model.state_dict()
+
+        with self.subTest("No shard when max size is bigger than model size"):
+            shards, index = shard_checkpoint(state_dict)
+            self.assertIsNone(index)
+            self.assertDictEqual(shards, {PADDLE_WEIGHTS_NAME: state_dict})
+
+        with self.subTest("Test sharding, no weights bigger than max size"):
+            shards, index = shard_checkpoint(state_dict, max_shard_size="300kB")
+            # Split is first two layers then last two.
+            self.assertDictEqual(
+                index,
+                {
+                    "metadata": {"total_size": 340000},
+                    "weight_map": {
+                        "0.weight": "model_state-00001-of-00002.pdparams",
+                        "1.weight": "model_state-00001-of-00002.pdparams",
+                        "2.weight": "model_state-00002-of-00002.pdparams",
+                        "3.weight": "model_state-00002-of-00002.pdparams",
+                    },
+                },
+            )
+
+            shard1 = {"0.weight": state_dict["0.weight"], "1.weight": state_dict["1.weight"]}
+            shard2 = {"2.weight": state_dict["2.weight"], "3.weight": state_dict["3.weight"]}
+            self.assertDictEqual(
+                shards, {"model_state-00001-of-00002.pdparams": shard1, "model_state-00002-of-00002.pdparams": shard2}
+            )
+
+        with self.subTest("Test sharding with weights bigger than max size"):
+            shards, index = shard_checkpoint(state_dict, max_shard_size="100kB")
+            # Split is first layer, second layer then last 2.
+            self.assertDictEqual(
+                index,
+                {
+                    "metadata": {"total_size": 340000},
+                    "weight_map": {
+                        "0.weight": "model_state-00001-of-00003.pdparams",
+                        "1.weight": "model_state-00002-of-00003.pdparams",
+                        "2.weight": "model_state-00003-of-00003.pdparams",
+                        "3.weight": "model_state-00003-of-00003.pdparams",
+                    },
+                },
+            )
+
+            shard1 = {"0.weight": state_dict["0.weight"]}
+            shard2 = {"1.weight": state_dict["1.weight"]}
+            shard3 = {"2.weight": state_dict["2.weight"], "3.weight": state_dict["3.weight"]}
+            self.assertDictEqual(
+                shards,
+                {
+                    "model_state-00001-of-00003.pdparams": shard1,
+                    "model_state-00002-of-00003.pdparams": shard2,
+                    "model_state-00003-of-00003.pdparams": shard3,
+                },
+            )
+
+    def test_checkpoint_sharding_local(self):
+        model = BertModel.from_pretrained("__internal_testing__/bert-shard")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            # We use the same folder for various sizes to make sure a new save erases the old checkpoint.
+            for max_size in ["50kB", "50kiB", "100kB", "100kiB", "200kB", "200kiB"]:
+                model.save_pretrained(tmp_dir, max_shard_size=max_size)
+
+                # Get each shard file and its size
+                shard_to_size = {}
+                for shard in os.listdir(tmp_dir):
+                    if shard.endswith(".pdparams"):
+                        shard_file = os.path.join(tmp_dir, shard)
+                        shard_to_size[shard_file] = os.path.getsize(shard_file)
+
+                index_file = os.path.join(tmp_dir, PADDLE_WEIGHTS_INDEX_NAME)
+                # Check there is an index but no regular weight file
+                self.assertTrue(os.path.isfile(index_file))
+                self.assertFalse(os.path.isfile(os.path.join(tmp_dir, PADDLE_WEIGHTS_NAME)))
+
+                # Check a file is bigger than max_size only when it has a single weight
+                for shard_file, size in shard_to_size.items():
+                    if max_size.endswith("kiB"):
+                        max_size_int = int(max_size[:-3]) * 2**10
+                    else:
+                        max_size_int = int(max_size[:-2]) * 10**3
+                    # Note: pickle adds some junk so the weight of the file can end up being slightly bigger than
+                    # the size asked for (since we count parameters)
+                    if size >= max_size_int + 50000:
+                        state_dict = paddle.load(shard_file)
+                        self.assertEqual(len(state_dict), 1)
+
+                # Check the index and the shard files found match
+                with open(index_file, "r", encoding="utf-8") as f:
+                    index = json.loads(f.read())
+
+                all_shards = set(index["weight_map"].values())
+                shards_found = {f for f in os.listdir(tmp_dir) if f.endswith(".pdparams")}
+                self.assertSetEqual(all_shards, shards_found)
+
+                # Finally, check the model can be reloaded
+                new_model = BertModel.from_pretrained(tmp_dir)
+                for p1, p2 in zip(model.parameters(), new_model.parameters()):
+                    self.assertTrue(paddle.allclose(p1, p2))
+
+    def test_checkpoint_sharding_from_hub(self):
+        model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert-sharded")
+
+        # the model above is the same as the model below, just a sharded version.
+        ref_model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert-no-sharded")
+        for p1, p2 in zip(model.parameters(), ref_model.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    def test_checkpoint_variant_local(self):
+        model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model.save_pretrained(tmp_dir, variant="v2")
+
+            weights_name = ".".join(PADDLE_WEIGHTS_NAME.split(".")[:-1] + ["v2"] + ["pdparams"])
+
+            weights_file = os.path.join(tmp_dir, weights_name)
+            self.assertTrue(os.path.isfile(weights_file))
+            self.assertFalse(os.path.isfile(os.path.join(tmp_dir, PADDLE_WEIGHTS_NAME)))
+
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained(tmp_dir)
+
+            new_model = BertModel.from_pretrained(tmp_dir, variant="v2")
+
+        for p1, p2 in zip(model.parameters(), new_model.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    def test_checkpoint_variant_local_sharded(self):
+        model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model.save_pretrained(tmp_dir, variant="v2", max_shard_size="50kB")
+
+            weights_index_name = ".".join(PADDLE_WEIGHTS_INDEX_NAME.split(".")[:-1] + ["v2"] + ["json"])
+            weights_index_file = os.path.join(tmp_dir, weights_index_name)
+            self.assertTrue(os.path.isfile(weights_index_file))
+            self.assertFalse(os.path.isfile(os.path.join(tmp_dir, PADDLE_WEIGHTS_INDEX_NAME)))
+
+            for i in range(1, 6):
+                weights_name = ".".join(PADDLE_WEIGHTS_NAME.split(".")[:-1] + [f"v2-0000{i}-of-00005"] + ["pdparams"])
+                weights_name_file = os.path.join(tmp_dir, weights_name)
+                self.assertTrue(os.path.isfile(weights_name_file))
+
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained(tmp_dir)
+
+            new_model = BertModel.from_pretrained(tmp_dir, variant="v2")
+
+        for p1, p2 in zip(model.parameters(), new_model.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    @require_package("safetensors")
+    def test_checkpoint_variant_local_safe(self):
+        model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model.save_pretrained(tmp_dir, variant="v2", safe_serialization=True)
+
+            weights_name = ".".join(SAFE_WEIGHTS_NAME.split(".")[:-1] + ["v2"] + ["safetensors"])
+
+            weights_file = os.path.join(tmp_dir, weights_name)
+
+            self.assertTrue(os.path.isfile(weights_file))
+            self.assertFalse(os.path.isfile(os.path.join(tmp_dir, SAFE_WEIGHTS_NAME)))
+
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained(tmp_dir)
+
+            new_model = BertModel.from_pretrained(tmp_dir, variant="v2")
+
+        for p1, p2 in zip(model.parameters(), new_model.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    @require_package("safetensors")
+    def test_checkpoint_variant_local_sharded_safe(self):
+        model = BertModel.from_pretrained("__internal_testing__/tiny-random-bert")
+
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model.save_pretrained(tmp_dir, variant="v2", max_shard_size="50kB", safe_serialization=True)
+
+            weights_index_name = ".".join(SAFE_WEIGHTS_INDEX_NAME.split(".")[:-1] + ["v2"] + ["json"])
+            weights_index_file = os.path.join(tmp_dir, weights_index_name)
+            self.assertTrue(os.path.isfile(weights_index_file))
+            self.assertFalse(os.path.isfile(os.path.join(tmp_dir, SAFE_WEIGHTS_INDEX_NAME)))
+
+            for i in range(1, 6):
+                weights_name = ".".join(SAFE_WEIGHTS_NAME.split(".")[:-1] + [f"v2-0000{i}-of-00005"] + ["safetensors"])
+                weights_name_file = os.path.join(tmp_dir, weights_name)
+                self.assertTrue(os.path.isfile(weights_name_file))
+
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained(tmp_dir)
+
+            new_model = BertModel.from_pretrained(tmp_dir, variant="v2")
+
+        for p1, p2 in zip(model.parameters(), new_model.parameters()):
+            self.assertTrue(paddle.allclose(p1, p2))
+
+    def test_checkpoint_variant_hub(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained("__internal_testing__/tiny-random-bert-variant", cache_dir=tmp_dir)
+
+            model = BertModel.from_pretrained(
+                "__internal_testing__/tiny-random-bert-variant", cache_dir=tmp_dir, variant="v2"
+            )
+        self.assertIsNotNone(model)
+
+    def test_checkpoint_variant_hub_sharded(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            with self.assertRaises(EnvironmentError):
+                _ = BertModel.from_pretrained(
+                    "__internal_testing__/tiny-random-bert-variant-sharded", cache_dir=tmp_dir
+                )
+            model = BertModel.from_pretrained(
+                "__internal_testing__/tiny-random-bert-variant-sharded", cache_dir=tmp_dir, variant="v2"
+            )
+        self.assertIsNotNone(model)
+
+    def test_checkpoint_variant_save_load(self):
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            model = BertModel.from_pretrained(
+                "__internal_testing__/tiny-random-bert-variant", cache_dir=tmp_dir, variant="v2"
+            )
+            weights_name = ".".join(PADDLE_WEIGHTS_NAME.split(".")[:-1] + ["v2"] + ["pdparams"])
+
+            model.save_pretrained(tmp_dir, variant="v2")
+            # saving will create a variant checkpoint
+            self.assertTrue(os.path.isfile(os.path.join(tmp_dir, weights_name)))
+
+            model.save_pretrained(tmp_dir)
+            # saving shouldn't delete variant checkpoints
+            weights_name = ".".join(PADDLE_WEIGHTS_NAME.split(".")[:-1] + ["v2"] + ["pdparams"])
+            self.assertTrue(os.path.isfile(os.path.join(tmp_dir, weights_name)))
+
+            # there should be a normal checkpoint
+            self.assertTrue(os.path.isfile(os.path.join(tmp_dir, PADDLE_WEIGHTS_NAME)))
+
+        self.assertIsNotNone(model)
diff --git a/tests/transformers/test_tensor_parallel.py b/tests/transformers/test_tensor_parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..209ce43d15216e43d9912132e7dfa4008be172ae
--- /dev/null
+++ b/tests/transformers/test_tensor_parallel.py
@@ -0,0 +1,131 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import glob
+import os
+import tempfile
+import unittest
+
+import paddle
+from paddle.distributed import fleet
+
+tp_size = paddle.distributed.get_world_size()
+tp_rank = 0
+if tp_size > 1:
+    strategy = fleet.DistributedStrategy()
+    strategy.hybrid_configs = {
+        "dp_degree": 1,
+        "mp_degree": tp_size,
+        "pp_degree": 1,
+        "sharding_degree": 1,
+    }
+    fleet.init(is_collective=True, strategy=strategy)
+    hcg = fleet.get_hybrid_communicate_group()
+    tp_rank = hcg.get_model_parallel_rank()
+    mp_group = hcg.get_model_parallel_group()
+
+
+def prepare_config(config):
+    config.hidden_size = 512
+    config.num_layers = 2
+    config.num_hidden_layers = 2
+    config.num_attention_heads = 16
+    config.intermediate_size = config.hidden_size * 3
+    config.tensor_parallel_degree = tp_size
+    config.tensor_parallel_rank = tp_rank
+    return config
+
+
+def common_test_load(model_class, tempdir):
+    paddle.distributed.barrier()
+    if model_class is not None:
+        model_class.from_pretrained(tempdir)
+        paddle.distributed.barrier()
+        if paddle.distributed.get_rank() == 0:
+            files = glob.glob(tempdir + "/*")
+            for f in files:
+                os.remove(f)
+        paddle.distributed.barrier()
+
+
+def common_test_merge(model, model_class=None):
+    rank = paddle.distributed.get_rank()
+    is_main_process = rank == 0
+    object_list = []
+    with tempfile.TemporaryDirectory() as tempdir:
+        paddle.distributed.all_gather_object(object_list, tempdir, group=mp_group)
+        tempdir = object_list[0]
+        # test merge one
+        model.save_pretrained(save_dir=tempdir, merge_tensor_parallel=True, is_main_process=is_main_process)
+        common_test_load(model_class, tempdir)
+        # test merge shard
+        model.save_pretrained(
+            tempdir,
+            merge_tensor_parallel=True,
+            variant=f"tp{rank:0>2d}",
+            max_shard_size="5MB",
+            is_main_process=is_main_process,
+        )
+        common_test_load(model_class, tempdir)
+        # test save tp
+        model.save_pretrained(tempdir, max_shard_size="5MB", is_main_process=is_main_process)
+        common_test_load(model_class, tempdir)
+        # test save shard safe
+        model.save_pretrained(tempdir, max_shard_size="5MB", safe_serialization=True, is_main_process=is_main_process)
+        common_test_load(model_class, tempdir)
+        # test save safe tensor
+        model.save_pretrained(tempdir, safe_serialization=True, is_main_process=is_main_process)
+        common_test_load(model_class, tempdir)
+        paddle.distributed.barrier()
+
+
+def _test_llama():
+    from paddlenlp.transformers import LlamaConfig, LlamaForCausalLM
+
+    config = LlamaConfig()
+    config = prepare_config(config)
+    model = LlamaForCausalLM._from_config(config)
+    common_test_merge(model, LlamaForCausalLM)
+
+
+def _test_chatglm():
+    from paddlenlp.transformers import ChatGLMConfig, ChatGLMForCausalLM
+
+    config = ChatGLMConfig()
+    config = prepare_config(config)
+    model = ChatGLMForCausalLM._from_config(config)
+    common_test_merge(model, ChatGLMForCausalLM)
+
+
+def _test_bloom():
+    from paddlenlp.transformers import BloomConfig, BloomForCausalLM
+
+    config = BloomConfig()
+    config = prepare_config(config)
+    model = BloomForCausalLM._from_config(config)
+    common_test_merge(model, BloomForCausalLM)
+
+
+# _test_llama()
+# _test_chatglm()
+# _test_bloom()
+
+
+class TestTensorParallel(unittest.TestCase):
+    @unittest.skipIf(tp_size < 2, "Need muti-gpu to run this test!")
+    def test_model_load_merge(self):
+        _test_llama()
+        _test_chatglm()
+        _test_bloom()
diff --git a/tests/transformers/test_tokenizer_common.py b/tests/transformers/test_tokenizer_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ff406428b557429b14f85c4efac6433e2217d1
--- /dev/null
+++ b/tests/transformers/test_tokenizer_common.py
@@ -0,0 +1,2235 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import itertools
+import json
+import os
+import pickle
+import re
+import shutil
+import sys
+import tempfile
+import unittest
+from itertools import takewhile
+from pathlib import Path
+from typing import Any, Dict, List, Tuple
+
+from paddlenlp.transformers import PretrainedFastTokenizer, PretrainedTokenizer
+from paddlenlp.transformers.tokenizer_utils import AddedToken, Trie
+from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase
+
+from ..testing_utils import get_tests_dir
+
+sys.path.append(str(Path(__file__).parent.parent / "utils"))
+
+NON_ENGLISH_TAGS = ["chinese", "dutch", "french", "finnish", "german", "multilingual"]
+
+SMALL_TRAINING_CORPUS = [
+    ["This is the first sentence.", "This is the second one."],
+    ["This sentence (contains #) over symbols and numbers 12 3.", "But not this one."],
+]
+
+
+def filter_non_english(_, pretrained_name: str):
+    """Filter all the model for non-english language"""
+    return not any([lang in pretrained_name for lang in NON_ENGLISH_TAGS])
+
+
+def filter_roberta_detectors(_, pretrained_name: str):
+    return "detector" not in pretrained_name
+
+
+class TokenizerTesterMixin:
+
+    tokenizer_class = None
+    fast_tokenizer_class = None
+    test_fast_tokenizer = False
+    space_between_special_tokens = False
+    from_pretrained_kwargs = None
+    from_pretrained_filter = None
+    from_pretrained_vocab_key = "vocab_file"
+
+    # set to True to test a sentencepiece tokenizer
+    test_sentencepiece = False
+    # set to True to ignore casing when testing a sentencepiece tokenizer
+    # test_sentencepiece must also be set to True
+    test_sentencepiece_ignore_case = False
+    test_offsets = True
+
+    only_english_character: bool = True
+
+    def setUp(self) -> None:
+        tokenizers_list = [
+            (
+                self.tokenizer_class,
+                pretrained_name,
+                self.from_pretrained_kwargs if self.from_pretrained_kwargs is not None else {},
+            )
+            for pretrained_name in self.tokenizer_class.pretrained_resource_files_map[
+                self.from_pretrained_vocab_key
+            ].keys()
+            if self.from_pretrained_filter is None
+            or (self.from_pretrained_filter is not None and self.from_pretrained_filter(pretrained_name))
+        ]
+        self.tokenizers_list = tokenizers_list[:1]
+
+        with open(f"{get_tests_dir()}/sample_text.txt", encoding="utf-8") as f_data:
+            self._data = f_data.read().replace("\n\n", "\n").strip()
+
+        self.tmpdirname = tempfile.mkdtemp()
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_txt = self.get_clean_sequence(tokenizer)[0]
+        return input_txt, input_txt
+
+    def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5) -> Tuple[str, list]:
+        toks = [(i, tokenizer.decode([i], clean_up_tokenization_spaces=False)) for i in range(len(tokenizer))]
+
+        # filter the english only character
+        if self.only_english_character:
+            toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
+
+        toks = list(
+            filter(
+                lambda t: [t[0]]
+                == tokenizer.encode(t[1], return_token_type_ids=None, add_special_tokens=False)["input_ids"],
+                toks,
+            )
+        )
+        if max_length is not None and len(toks) > max_length:
+            toks = toks[:max_length]
+        if min_length is not None and len(toks) < min_length and len(toks) > 0:
+            while len(toks) < min_length:
+                toks = toks + toks
+        # toks_str = [t[1] for t in toks]
+        toks_ids = [t[0] for t in toks]
+
+        # Ensure consistency
+        output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
+        if " " not in output_txt and len(toks_ids) > 1:
+            output_txt = (
+                tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+                + " "
+                + tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
+            )
+        if with_prefix_space:
+            output_txt = " " + output_txt
+        output_ids = tokenizer.encode(output_txt, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        return output_txt, output_ids
+
+    def get_tokenizers(self, **kwargs) -> List[PretrainedTokenizerBase]:
+        return [self.get_tokenizer(**kwargs)]
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_fast_tokenizer(self, **kwargs) -> PretrainedFastTokenizer:
+        return self.fast_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def tokenizer_integration_test_util(
+        self,
+        expected_encoding: Dict,
+        model_name: str,
+        sequences: List[str] = None,
+        decode_kwargs: Dict[str, Any] = None,
+        padding: bool = True,
+    ):
+        """
+        Util for integration test.
+
+        Text is tokenized and then reverted back to text. Both results are then checked.
+
+        Args:
+            expected_encoding:
+                The expected result of the tokenizer output.
+            model_name:
+                The model name of the tokenizer to load and use.
+            sequences:
+                Can overwrite the texts that are used to check the tokenizer.
+                This is useful if the tokenizer supports non english languages
+                like france.
+            decode_kwargs:
+                Additional args for the ``decode`` function which reverts the
+                tokenized text back to a string.
+            padding:
+                Activates and controls padding of the tokenizer.
+        """
+        decode_kwargs = {} if decode_kwargs is None else decode_kwargs
+
+        if sequences is None:
+            sequences = [
+                "Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides "
+                "general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural "
+                "Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained "
+                "models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow.",
+                "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
+                "conditioning on both left and right context in all layers.",
+                "The quick brown fox jumps over the lazy dog.",
+            ]
+
+        if self.test_sentencepiece_ignore_case:
+            sequences = [sequence.lower() for sequence in sequences]
+
+        tokenizer_classes = [self.tokenizer_class]
+
+        for tokenizer_class in tokenizer_classes:
+            tokenizer = tokenizer_class.from_pretrained(model_name)
+
+            encoding = tokenizer(sequences, padding=padding, return_attention_mask=True)
+            decoded_sequences = [
+                tokenizer.decode(seq, skip_special_tokens=True, **decode_kwargs) for seq in encoding["input_ids"]
+            ]
+
+            encoding_data = encoding.data
+            self.assertDictEqual(encoding_data, expected_encoding)
+
+            for expected, decoded in zip(sequences, decoded_sequences):
+                if self.test_sentencepiece_ignore_case:
+                    expected = expected.lower()
+                self.assertEqual(expected, decoded)
+
+    def assert_padded_input_match(self, input_r: list, input_p: list, max_length: int, pad_token_id: int):
+        # Ensure we match max_length
+        self.assertEqual(len(input_r), max_length)
+        self.assertEqual(len(input_p), max_length)
+
+        # Ensure the number of padded tokens is the same
+        padded_tokens_r = list(takewhile(lambda i: i == pad_token_id, reversed(input_r)))
+        padded_tokens_p = list(takewhile(lambda i: i == pad_token_id, reversed(input_p)))
+        self.assertSequenceEqual(padded_tokens_r, padded_tokens_p)
+
+    def assert_batch_padded_input_match(
+        self,
+        input_r: dict,
+        input_p: dict,
+        max_length: int,
+        pad_token_id: int,
+        model_main_input_name: str = "input_ids",
+    ):
+        for i_r in input_r.values():
+            self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), max_length), self.assertEqual(
+                len(i_r[1]), max_length
+            )
+            self.assertEqual(len(i_r), 2), self.assertEqual(len(i_r[0]), max_length), self.assertEqual(
+                len(i_r[1]), max_length
+            )
+
+        for i_r, i_p in zip(input_r[model_main_input_name], input_p[model_main_input_name]):
+            self.assert_padded_input_match(i_r, i_p, max_length, pad_token_id)
+
+        for i_r, i_p in zip(input_r["attention_mask"], input_p["attention_mask"]):
+            self.assertSequenceEqual(i_r, i_p)
+
+    @staticmethod
+    def convert_batch_encode_plus_format_to_encode_plus(batch_encode_plus_sequences):
+        # Switch from batch_encode_plus format:   {'input_ids': [[...], [...]], ...}
+        # to the list of examples/ encode_plus format: [{'input_ids': [...], ...}, {'input_ids': [...], ...}]
+        return [
+            {value: batch_encode_plus_sequences[value][i] for value in batch_encode_plus_sequences.keys()}
+            for i in range(len(batch_encode_plus_sequences["input_ids"]))
+        ]
+
+    # TODO: this test can be combined with `test_sentencepiece_tokenize_and_convert_tokens_to_string` after the latter is extended to all tokenizers.
+    def test_tokenize_special_tokens(self):
+        """Test `tokenize` with special tokens."""
+        tokenizers = self.get_tokenizers(do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                SPECIAL_TOKEN_1 = "[SPECIAL_TOKEN_1]"
+                SPECIAL_TOKEN_2 = "[SPECIAL_TOKEN_2]"
+
+                # TODO:
+                # Can we combine `unique_no_split_tokens` and `all_special_tokens`(and properties related to it)
+                # with one variable(property) for a better maintainability?
+
+                # `add_tokens` method stores special tokens only in `tokenizer.unique_no_split_tokens`. (in tokenization_utils.py)
+                tokenizer.add_tokens([SPECIAL_TOKEN_1], special_tokens=True)
+                # `add_special_tokens` method stores special tokens in `tokenizer.additional_special_tokens`,
+                # which also occur in `tokenizer.all_special_tokens`. (in tokenization_utils_base.py)
+                tokenizer.add_special_tokens({"additional_special_tokens": [SPECIAL_TOKEN_2]})
+
+                token_1 = tokenizer.tokenize(SPECIAL_TOKEN_1)
+                token_2 = tokenizer.tokenize(SPECIAL_TOKEN_2)
+
+                self.assertEqual(len(token_1), 1)
+                self.assertEqual(len(token_2), 1)
+                self.assertEqual(token_1[0], SPECIAL_TOKEN_1)
+                self.assertEqual(token_2[0], SPECIAL_TOKEN_2)
+
+    # TODO: this test could be extended to all tokenizers - not just the sentencepiece
+    def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
+        """Test ``_tokenize`` and ``convert_tokens_to_string``."""
+        if not self.test_sentencepiece:
+            return
+
+        tokenizer = self.get_tokenizer()
+        text = "This is text to test the tokenizer."
+
+        if self.test_sentencepiece_ignore_case:
+            text = text.lower()
+
+        tokens = tokenizer.tokenize(text)
+
+        self.assertTrue(len(tokens) > 0)
+
+        # check if converting back to original text works
+        reverse_text = tokenizer.convert_tokens_to_string(tokens)
+
+        if self.test_sentencepiece_ignore_case:
+            reverse_text = reverse_text.lower()
+
+        self.assertEqual(reverse_text, text)
+
+    def test_subword_regularization_tokenizer(self) -> None:
+        if not self.test_sentencepiece:
+            return
+
+        # Subword regularization is only available for the slow tokenizer.
+        sp_model_kwargs = {"enable_sampling": True, "alpha": 0.1, "nbest_size": -1}
+        tokenizer = self.get_tokenizer(sp_model_kwargs=sp_model_kwargs)
+
+        self.assertTrue(hasattr(tokenizer, "sp_model_kwargs"))
+        self.assertIsNotNone(tokenizer.sp_model_kwargs)
+        self.assertTrue(isinstance(tokenizer.sp_model_kwargs, dict))
+        self.assertEqual(tokenizer.sp_model_kwargs, sp_model_kwargs)
+        self.check_subword_sampling(tokenizer)
+
+    def test_pickle_subword_regularization_tokenizer(self) -> None:
+        if not self.test_sentencepiece:
+            return
+        """Google pickle __getstate__ __setstate__ if you are struggling with this."""
+        # Subword regularization is only available for the slow tokenizer.
+        sp_model_kwargs = {"enable_sampling": True, "alpha": 0.1, "nbest_size": -1}
+        tokenizer = self.get_tokenizer(sp_model_kwargs=sp_model_kwargs)
+        tokenizer_bin = pickle.dumps(tokenizer)
+        del tokenizer
+        tokenizer_new = pickle.loads(tokenizer_bin)
+
+        self.assertTrue(hasattr(tokenizer_new, "sp_model_kwargs"))
+        self.assertIsNotNone(tokenizer_new.sp_model_kwargs)
+        self.assertTrue(isinstance(tokenizer_new.sp_model_kwargs, dict))
+        self.assertEqual(tokenizer_new.sp_model_kwargs, sp_model_kwargs)
+        self.check_subword_sampling(tokenizer_new)
+
+    def test_save_sentencepiece_tokenizer(self) -> None:
+        if not self.test_sentencepiece:
+            return
+        # We want to verify that we will be able to save the tokenizer even if the original files that were used to
+        # build the tokenizer have been deleted in the meantime.
+        text = "This is text to test the tokenizer."
+
+        tokenizer_slow_1 = self.get_tokenizer()
+        encoding_tokenizer_slow_1 = tokenizer_slow_1(text)
+
+        tmpdirname_1 = tempfile.mkdtemp()
+        tmpdirname_2 = tempfile.mkdtemp()
+
+        tokenizer_slow_1.save_pretrained(tmpdirname_1)
+        tokenizer_slow_2 = self.tokenizer_class.from_pretrained(tmpdirname_1)
+        encoding_tokenizer_slow_2 = tokenizer_slow_2(text)
+
+        tokenizer_slow_2.save_pretrained(tmpdirname_2)
+        shutil.rmtree(tmpdirname_1)
+
+        tokenizer_slow_3 = self.tokenizer_class.from_pretrained(tmpdirname_2)
+        encoding_tokenizer_slow_3 = tokenizer_slow_3(text)
+        shutil.rmtree(tmpdirname_2)
+
+        self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_2)
+        self.assertEqual(encoding_tokenizer_slow_1, encoding_tokenizer_slow_3)
+
+    def test_model_input_names_signature(self):
+        accepted_model_main_input_names = [
+            "input_ids",  # nlp models
+            "input_values",  # speech models
+        ]
+
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            # first name of model_input_names has to correspond to main model input name
+            # to make sure `tokenizer.pad(...)` works correctly
+            self.assertTrue(tokenizer.model_input_names[0] in accepted_model_main_input_names)
+
+    def test_tokenizer_slow_store_full_signature(self):
+        signature = inspect.signature(self.tokenizer_class.__init__)
+        tokenizer = self.get_tokenizer()
+
+        for parameter_name, parameter in signature.parameters.items():
+            if parameter.default != inspect.Parameter.empty:
+                self.assertIn(parameter_name, tokenizer.init_kwargs)
+
+    def test_tokenizers_common_properties(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                attributes_list = [
+                    "bos_token",
+                    "eos_token",
+                    "unk_token",
+                    "sep_token",
+                    "pad_token",
+                    "cls_token",
+                    "mask_token",
+                ]
+                for attr in attributes_list:
+                    self.assertTrue(hasattr(tokenizer, attr))
+                    self.assertTrue(hasattr(tokenizer, attr + "_id"))
+
+                self.assertTrue(hasattr(tokenizer, "additional_special_tokens"))
+                self.assertTrue(hasattr(tokenizer, "additional_special_tokens_ids"))
+
+                attributes_list = [
+                    "model_max_length",
+                    "init_inputs",
+                    "init_kwargs",
+                ]
+
+                attributes_list += [
+                    "added_tokens_encoder",
+                    "added_tokens_decoder",
+                ]
+                for attr in attributes_list:
+                    self.assertTrue(hasattr(tokenizer, attr))
+
+    def test_tokenizers_common_ids_setters(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                attributes_list = [
+                    "bos_token",
+                    "eos_token",
+                    "unk_token",
+                    "sep_token",
+                    "pad_token",
+                    "cls_token",
+                    "mask_token",
+                ]
+
+                vocab = tokenizer.get_vocab()
+                token_id_to_test_setters = next(iter(vocab.values()))
+                token_to_test_setters = tokenizer.convert_ids_to_tokens(
+                    token_id_to_test_setters, skip_special_tokens=False
+                )
+
+                for attr in attributes_list:
+                    setattr(tokenizer, attr + "_id", None)
+                    self.assertEqual(getattr(tokenizer, attr), None)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), None)
+
+                    setattr(tokenizer, attr + "_id", token_id_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr), token_to_test_setters)
+                    self.assertEqual(getattr(tokenizer, attr + "_id"), token_id_to_test_setters)
+
+                setattr(tokenizer, "additional_special_tokens_ids", [])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens"), [])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens_ids"), [])
+
+                setattr(tokenizer, "additional_special_tokens_ids", [token_id_to_test_setters])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens"), [token_to_test_setters])
+                self.assertListEqual(getattr(tokenizer, "additional_special_tokens_ids"), [token_id_to_test_setters])
+
+    def test_save_and_load_tokenizer(self):
+        # safety check on max_len default value so we are sure the test works
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                self.assertNotEqual(tokenizer.model_max_length, 42)
+
+        # Now let's start the test
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens["input_ids"], after_tokens["input_ids"])
+                self.assertDictEqual(before_vocab, after_vocab)
+
+                shutil.rmtree(tmpdirname)
+
+        tokenizers = self.get_tokenizers(model_max_length=42)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                tokenizer.add_tokens(["bim", "bambam"])
+                additional_special_tokens = tokenizer.additional_special_tokens
+                additional_special_tokens.append("new_additional_special_token")
+                tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
+                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens["input_ids"], after_tokens["input_ids"])
+                self.assertDictEqual(before_vocab, after_vocab)
+                self.assertIn("bim", after_vocab)
+                self.assertIn("bambam", after_vocab)
+                self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
+                self.assertEqual(after_tokenizer.model_max_length, 42)
+
+                tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
+                self.assertEqual(tokenizer.model_max_length, 43)
+
+                shutil.rmtree(tmpdirname)
+
+        # Test that we can also use the non-legacy saving format for fast tokenizers
+        tokenizers = self.get_tokenizers(model_max_length=42)
+        for tokenizer in tokenizers:
+            if not tokenizer.is_fast:
+                continue
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                tokenizer.add_tokens(["bim", "bambam"])
+                additional_special_tokens = tokenizer.additional_special_tokens
+                additional_special_tokens.append("new_additional_special_token")
+                tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
+                before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens, after_tokens)
+                self.assertDictEqual(before_vocab, after_vocab)
+                self.assertIn("bim", after_vocab)
+                self.assertIn("bambam", after_vocab)
+                self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
+                self.assertEqual(after_tokenizer.model_max_length, 42)
+
+                tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
+                self.assertEqual(tokenizer.model_max_length, 43)
+
+                shutil.rmtree(tmpdirname)
+
+    # def test_pickle_tokenizer(self):
+    #     """Google pickle __getstate__ __setstate__ if you are struggling with this."""
+    #     tokenizers = self.get_tokenizers()
+    #     for tokenizer in tokenizers:
+    #         with self.subTest(f"{tokenizer.__class__.__name__}"):
+    #             self.assertIsNotNone(tokenizer)
+    #
+    #             text = "Munich and Berlin are nice cities"
+    #             subwords = tokenizer.tokenize(text)
+    #
+    #             filename = os.path.join(self.tmpdirname, "tokenizer.bin")
+    #             with open(filename, "wb") as handle:
+    #                 pickle.dump(tokenizer, handle)
+    #
+    #             with open(filename, "rb") as handle:
+    #                 tokenizer_new = pickle.load(handle)
+    #
+    #             subwords_loaded = tokenizer_new.tokenize(text)
+    #
+    #             self.assertListEqual(subwords, subwords_loaded)
+
+    def test_pickle_added_tokens(self):
+        tok1 = AddedToken("<s>", rstrip=True, lstrip=True, normalized=False, single_word=True)
+        tok2 = pickle.loads(pickle.dumps(tok1))
+
+        self.assertEqual(tok1.__getstate__(), tok2.__getstate__())
+
+    def test_added_tokens_do_lower_case(self):
+        tokenizers = self.get_tokenizers(do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if not hasattr(tokenizer, "do_lower_case") or not tokenizer.do_lower_case:
+                    continue
+
+                special_token = tokenizer.all_special_tokens[0]
+
+                text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
+                text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token
+
+                toks_before_adding = tokenizer.tokenize(text)  # toks before adding new_toks
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
+                added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])
+
+                toks_after_adding = tokenizer.tokenize(text)
+                toks_after_adding2 = tokenizer.tokenize(text2)
+
+                # Rust tokenizers dont't lowercase added tokens at the time calling `tokenizer.add_tokens`,
+                # while python tokenizers do, so new_toks 0 and 2 would be treated as the same, so do new_toks 1 and 3.
+                self.assertIn(added, [2, 4])
+
+                self.assertListEqual(toks_after_adding, toks_after_adding2)
+                self.assertTrue(
+                    len(toks_before_adding) > len(toks_after_adding),  # toks_before_adding should be longer
+                )
+
+                # Check that none of the special tokens are lowercased
+                sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B"
+                # Convert the tokenized list to str as some special tokens are tokenized like normal tokens
+                # which have a prefix spacee e.g. the mask token of Albert, and cannot match the original
+                # special tokens exactly.
+                tokenized_sequence = "".join(tokenizer.tokenize(sequence_with_special_tokens))
+
+                for special_token in tokenizer.all_special_tokens:
+                    self.assertTrue(special_token in tokenized_sequence)
+
+        tokenizers = self.get_tokenizers(do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if hasattr(tokenizer, "do_lower_case") and tokenizer.do_lower_case:
+                    continue
+
+                special_token = tokenizer.all_special_tokens[0]
+
+                text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
+                text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token
+
+                toks_before_adding = tokenizer.tokenize(text)  # toks before adding new_toks
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
+                added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])
+                self.assertIn(added, [2, 4])
+
+                toks_after_adding = tokenizer.tokenize(text)
+                toks_after_adding2 = tokenizer.tokenize(text2)
+
+                self.assertEqual(len(toks_after_adding), len(toks_after_adding2))  # Length should still be the same
+                self.assertNotEqual(
+                    toks_after_adding[1], toks_after_adding2[1]
+                )  # But at least the first non-special tokens should differ
+                self.assertTrue(
+                    len(toks_before_adding) > len(toks_after_adding),  # toks_before_adding should be longer
+                )
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokens[-3])
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    def test_add_special_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, ids = self.get_clean_sequence(tokenizer)
+
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"cls_token": special_token})
+                encoded_special_token = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(len(encoded_special_token), 1)
+
+                text = tokenizer.decode(ids + encoded_special_token, clean_up_tokenization_spaces=False)
+                encoded = tokenizer.encode(text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+
+                input_encoded = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                special_token_id = tokenizer.encode(
+                    special_token, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(encoded, input_encoded + special_token_id)
+
+                decoded = tokenizer.decode(encoded, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_text, output_text = self.get_input_output_texts(tokenizer)
+
+                tokens = tokenizer.tokenize(input_text)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                ids_2 = tokenizer.encode(input_text, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                self.assertListEqual(ids, ids_2)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+
+                self.assertEqual(text_2, output_text)
+
+    def test_encode_decode_with_spaces(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                new_toks = [
+                    AddedToken("[ABC]", normalized=False),
+                    AddedToken("[DEF]", normalized=False),
+                    AddedToken("GHI IHG", normalized=False),
+                ]
+                tokenizer.add_tokens(new_toks)
+                input = "[ABC][DEF][ABC]GHI IHG[DEF]"
+                if self.space_between_special_tokens:
+                    output = "[ABC] [DEF] [ABC] GHI IHG [DEF]"
+                else:
+                    output = input
+                encoded = tokenizer.encode(input, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens)
+                self.assertIn(decoded, [output, output.lower()])
+
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+        self.assertEqual(
+            len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]),
+            len(self.tokenizer_class.max_model_input_sizes),
+        )
+
+        weights_list = list(self.tokenizer_class.max_model_input_sizes.keys())
+        weights_lists_2 = []
+        for file_id, map_list in self.tokenizer_class.pretrained_resource_files_map.items():
+            weights_lists_2.append(list(map_list.keys()))
+
+        weights_list.sort()
+        for weights_list_2 in weights_lists_2:
+            weights_list_2.sort()
+            self.assertListEqual(weights_list, weights_list_2)
+
+    def test_mask_output(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                if (
+                    tokenizer.build_inputs_with_special_tokens.__qualname__.split(".")[0] != "PretrainedTokenizer"
+                    and "token_type_ids" in tokenizer.model_input_names
+                ):
+                    seq_0 = "Test this method."
+                    seq_1 = "With these inputs."
+                    information = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)
+                    sequences, mask = information["input_ids"], information["token_type_ids"]
+                    self.assertEqual(len(sequences), len(mask))
+
+    def test_token_type_ids(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                seq_0 = "Test this method."
+
+                # We want to have sequence 0 and sequence 1 are tagged
+                # respectively with 0 and 1 token_ids
+                # (regardless of whether the model use token type ids)
+                # We use this assumption in the QA pipeline among other place
+                output = tokenizer(seq_0, return_token_type_ids=True)
+                self.assertIn(0, output["token_type_ids"])
+
+    def test_number_of_added_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                seq_0 = "Test this method."
+                seq_1 = "With these inputs."
+
+                sequences = tokenizer.encode(seq_0, seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                attached_sequences = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)["input_ids"]
+
+                # Method is implemented (e.g. not GPT-2)
+                if len(attached_sequences) != 2:
+                    self.assertEqual(
+                        tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences)
+                    )
+
+    def test_maximum_encoding_length_single_input(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
+
+                sequence = tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                total_length = len(sequence)
+
+                self.assertGreater(total_length, 4, "Issue with the testing sequence, please update it it's too short")
+
+                # Test with max model input length
+                model_max_length = tokenizer.model_max_length
+                self.assertEqual(model_max_length, 100)
+                seq_1 = seq_0 * model_max_length
+
+                sequence1 = tokenizer(seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False)
+                total_length1 = len(sequence1["input_ids"])
+                self.assertGreater(
+                    total_length1, model_max_length, "Issue with the testing sequence, please update it it's too short"
+                )
+
+                # Simple
+                padding_strategies = (
+                    [False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
+                )
+                for padding_state in padding_strategies:
+                    with self.subTest(f"Padding: {padding_state}"):
+                        for truncation_state in [True, "longest_first", "only_first"]:
+                            with self.subTest(f"Truncation: {truncation_state}"):
+                                output = tokenizer(seq_1, padding=padding_state, truncation=truncation_state)
+                                self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                                output = tokenizer([seq_1], padding=padding_state, truncation=truncation_state)
+                                self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple with no truncation
+                        # Reset warnings
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer(seq_1, padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer([seq_1], padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                # Overflowing tokens
+                stride = 2
+                information = tokenizer(
+                    seq_0,
+                    max_length=total_length - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="longest_first",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information["input_ids"]
+                overflowing_tokens = information["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), total_length - 2)
+                self.assertEqual(truncated_sequence, sequence[:-2])
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
+
+    def test_maximum_encoding_length_pair_input(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Build a sequence from our model's vocabulary
+                stride = 2
+                seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
+                if len(ids) <= 2 + stride:
+                    seq_0 = (seq_0 + " ") * (2 + stride)
+                    ids = None
+
+                seq0_tokens = tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                self.assertGreater(len(seq0_tokens), 2 + stride)
+
+                seq_1 = "This is another sentence to be encoded."
+                seq1_tokens = tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                if abs(len(seq0_tokens) - len(seq1_tokens)) <= 2:
+                    seq1_tokens = seq1_tokens + seq1_tokens
+                    seq_1 = tokenizer.decode(seq1_tokens, clean_up_tokenization_spaces=False)
+                seq1_tokens = tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+
+                self.assertGreater(len(seq1_tokens), 2 + stride)
+
+                # We are not using the special tokens - a bit too hard to test all the tokenizers with this
+                # TODO try this again later
+                sequence = tokenizer.encode(seq_0, seq_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]  # , add_prefix_space=False)
+
+                # Test with max model input length
+                model_max_length = tokenizer.model_max_length
+                self.assertEqual(model_max_length, 100)
+                seq_2 = seq_0 * model_max_length
+                self.assertGreater(len(seq_2), model_max_length)
+
+                sequence1 = tokenizer(seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False)
+                total_length1 = len(sequence1["input_ids"])
+                sequence2 = tokenizer(
+                    seq_2, seq_1, return_token_type_ids=None, add_special_tokens=False, truncation=False
+                )
+                total_length2 = len(sequence2["input_ids"])
+                self.assertLess(
+                    total_length1, model_max_length - 10, "Issue with the testing sequence, please update it."
+                )
+                self.assertGreater(
+                    total_length2, model_max_length, "Issue with the testing sequence, please update it."
+                )
+
+                # Simple
+                padding_strategies = (
+                    [False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
+                )
+                for padding_state in padding_strategies:
+                    with self.subTest(f"{tokenizer.__class__.__name__} Padding: {padding_state}"):
+                        for truncation_state in [True, "longest_first", "only_first"]:
+                            with self.subTest(f"{tokenizer.__class__.__name__} Truncation: {truncation_state}"):
+                                output = tokenizer(seq_2, seq_1, padding=padding_state, truncation=truncation_state)
+                                self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                                output = tokenizer(
+                                    [seq_2], [seq_1], padding=padding_state, truncation=truncation_state
+                                )
+                                self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple
+                        output = tokenizer(seq_1, seq_2, padding=padding_state, truncation="only_second")
+                        self.assertEqual(len(output["input_ids"]), model_max_length)
+
+                        output = tokenizer([seq_1], [seq_2], padding=padding_state, truncation="only_second")
+                        self.assertEqual(len(output["input_ids"][0]), model_max_length)
+
+                        # Simple with no truncation
+                        # Reset warnings
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer(seq_1, seq_2, padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                        tokenizer.deprecation_warnings = {}
+                        with self.assertLogs("PaddleNLP", level="WARNING") as cm:
+                            output = tokenizer([seq_1], [seq_2], padding=padding_state, truncation=False)
+                            self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
+                        self.assertEqual(len(cm.records), 1)
+                        self.assertTrue(
+                            cm.records[0].message.startswith(
+                                "Token indices sequence length is longer than the specified maximum sequence length for this model"
+                            )
+                        )
+
+                truncated_first_sequence = (
+                    tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"][:-2]
+                    + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                )
+                truncated_second_sequence = (
+                    tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                    + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"][:-2]
+                )
+
+                # TODO(wj-Mcat): `overflow_first_sequence` and `overflow_second_sequence` is not used
+                # to make CI green, the following codes will be commented out
+
+                # overflow_first_sequence = (
+                #     tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"][
+                #         -(2 + stride) :
+                #     ]
+                #     + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                # )
+                # overflow_second_sequence = (
+                #     tokenizer.encode(seq_0, return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+                #     + tokenizer.encode(seq_1, return_token_type_ids=None, add_special_tokens=False)["input_ids"][
+                #         -(2 + stride) :
+                #     ]
+                # )
+
+                with self.assertRaises(ValueError) as context:
+                    tokenizer(
+                        seq_0,
+                        seq_1,
+                        max_length=len(sequence) - 2,
+                        return_token_type_ids=None,
+                        add_special_tokens=False,
+                        stride=stride,
+                        truncation="longest_first",
+                        return_overflowing_tokens=True,
+                        # add_prefix_space=False,
+                    )
+
+                self.assertTrue(
+                    context.exception.args[0].startswith(
+                        "Not possible to return overflowing tokens for pair of sequences with the "
+                        "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                        "for instance `only_second` or `only_first`."
+                    )
+                )
+
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                # No overflowing tokens when using 'longest' in python tokenizers
+                with self.assertRaises(ValueError) as context:
+                    tokenizer(
+                        seq_0,
+                        seq_1,
+                        max_length=len(sequence) - 2,
+                        return_token_type_ids=None,
+                        add_special_tokens=False,
+                        stride=stride,
+                        truncation=True,
+                        return_overflowing_tokens=True,
+                        # add_prefix_space=False,
+                    )
+
+                self.assertTrue(
+                    context.exception.args[0].startswith(
+                        "Not possible to return overflowing tokens for pair of sequences with the "
+                        "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                        "for instance `only_second` or `only_first`."
+                    )
+                )
+
+                information_first_truncated = tokenizer(
+                    seq_0,
+                    seq_1,
+                    max_length=len(sequence) - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="only_first",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information_first_truncated["input_ids"]
+                overflowing_tokens = information_first_truncated["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), len(sequence) - 2)
+                self.assertEqual(truncated_sequence, truncated_first_sequence)
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, seq0_tokens[-(2 + stride) :])
+
+                information_second_truncated = tokenizer(
+                    seq_0,
+                    seq_1,
+                    max_length=len(sequence) - 2,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                    stride=stride,
+                    truncation="only_second",
+                    return_overflowing_tokens=True,
+                    # add_prefix_space=False,
+                )
+                # Overflowing tokens are handled quite differently in slow and fast tokenizers
+
+                truncated_sequence = information_second_truncated["input_ids"]
+                overflowing_tokens = information_second_truncated["overflowing_tokens"]
+
+                self.assertEqual(len(truncated_sequence), len(sequence) - 2)
+                self.assertEqual(truncated_sequence, truncated_second_sequence)
+
+                self.assertEqual(len(overflowing_tokens), 2 + stride)
+                self.assertEqual(overflowing_tokens, seq1_tokens[-(2 + stride) :])
+
+    # def test_encode_input_type(self):
+    #     tokenizers = self.get_tokenizers(do_lower_case=False)
+    #     for tokenizer in tokenizers:
+    #         with self.subTest(f"{tokenizer.__class__.__name__}"):
+    #             sequence = "Let's encode this sequence"
+
+    #             tokens = sequence.split()  # tokenizer.tokenize(sequence)
+    #             # input_ids = tokenizer.convert_tokens_to_ids(tokens)
+    #             formatted_input = tokenizer.encode(sequence, add_special_tokens=True, add_prefix_space=False)
+
+    #             self.assertEqual(
+    #                 tokenizer.encode(tokens, is_split_into_words=True, add_special_tokens=True), formatted_input
+    #             )
+    #             # This is not supported with the Rust tokenizers
+    #             # self.assertEqual(tokenizer.encode(input_ids, add_special_tokens=True), formatted_input)
+
+    # def test_swap_special_token(self):
+    #     tokenizers = self.get_tokenizers(do_lower_case=False)
+    #     for tokenizer in tokenizers:
+    #         with self.subTest(f"{tokenizer.__class__.__name__}"):
+    #             # Our mask token
+    #             mask = "<mask>"
+    #             # We take a single word in the middle of the vocabulary
+    #             all_tokens = sorted(tokenizer.get_vocab().keys())
+    #             word = tokenizer.decode(tokenizer.encode(all_tokens[len(all_tokens)//2], add_special_tokens=False)[:1])
+
+    #             sequence_0 = "Encode " + word + " sequence"
+    #             sequence_masked_0 = "Encode " + mask + " sequence"
+
+    #             sequence_1 = word + " this sequence"
+    #             sequence_masked_1 = mask + " this sequence"
+
+    #             # Add tokens so that masked token isn't split
+    #             # tokens = [AddedToken(t, lstrip=True, normalized=False) for t in sequence.split()]
+    #             # tokenizer.add_tokens(tokens)
+    #             tokenizer.add_special_tokens(
+    #                 {"mask_token": AddedToken(mask, normalized=False)}
+    #             )  # Eat left space on Byte-level BPE tokenizers
+    #             mask_ind = tokenizer.convert_tokens_to_ids(mask)
+
+    #             # Test first masked sequence
+    #             encoded_0 = tokenizer.encode(sequence_0, add_special_tokens=False)
+    #             encoded_masked = tokenizer.encode(sequence_masked_0, add_special_tokens=False)
+    #             self.assertEqual(len(encoded_masked), len(encoded_0))
+    #             mask_loc = encoded_masked.index(mask_ind)
+    #             encoded_masked[mask_loc] = encoded_0[mask_loc]
+
+    #             self.assertEqual(encoded_masked, encoded_0)
+
+    #             # Test second masked sequence
+    #             encoded_1 = tokenizer.encode(sequence_1, add_special_tokens=False)
+    #             encoded_masked = tokenizer.encode(sequence_masked_1, add_special_tokens=False)
+    #             self.assertEqual(len(encoded_masked), len(encoded_1))
+    #             mask_loc = encoded_masked.index(mask_ind)
+    #             encoded_masked[mask_loc] = encoded_1[mask_loc]
+
+    #             self.assertEqual(encoded_masked, encoded_1)
+
+    def test_special_tokens_mask(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                # Testing single inputs
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0, add_special_tokens=True, return_special_tokens_mask=True  # , add_prefix_space=False
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+                filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+    def test_special_tokens_mask_input_pairs(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                sequence_1 = "This one too please."
+                encoded_sequence = tokenizer.encode(sequence_0, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence += tokenizer.encode(sequence_1, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                encoded_sequence_dict = tokenizer.encode(
+                    sequence_0,
+                    sequence_1,
+                    add_special_tokens=True,
+                    return_special_tokens_mask=True,
+                    # add_prefix_space=False,
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+                filtered_sequence = [
+                    (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)
+                ]
+                filtered_sequence = [x for x in filtered_sequence if x is not None]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+    def test_padding_side_in_kwargs(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, padding_side="left", **kwargs)
+                self.assertEqual(tokenizer.padding_side, "left")
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, padding_side="right", **kwargs)
+                self.assertEqual(tokenizer.padding_side, "right")
+
+                self.assertRaises(
+                    ValueError,
+                    self.tokenizer_class.from_pretrained,
+                    pretrained_name,
+                    padding_side="unauthorized",
+                    **kwargs,
+                )
+
+    def test_truncation_side_in_kwargs(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, truncation_side="left", **kwargs)
+                self.assertEqual(tokenizer.truncation_side, "left")
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, truncation_side="right", **kwargs)
+                self.assertEqual(tokenizer.truncation_side, "right")
+
+                self.assertRaises(
+                    ValueError,
+                    self.tokenizer_class.from_pretrained,
+                    pretrained_name,
+                    truncation_side="unauthorized",
+                    **kwargs,
+                )
+
+    def test_right_and_left_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "left"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + encoded_sequence, padded_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, padding=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding="longest")["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding=False)["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+    def test_right_and_left_truncation(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "This is a test sequence"
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                truncation_size = 3
+                tokenizer.truncation_side = "right"
+                encoded_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                sequence_length = len(encoded_sequence)
+                # Remove EOS/BOS tokens
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[:-truncation_size], truncated_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the truncation flag set to True
+                tokenizer.truncation_side = "left"
+                sequence_length = len(encoded_sequence)
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[truncation_size:], truncated_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_truncation'
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, truncation=True, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation="longest_first", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation=False, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+    def test_padding_to_max_length(self):
+        """We keep this test for backward compatibility but it should be remove when `pad_to_max_seq_len` is deprecated."""
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                # FIXME: the next line should be padding(max_length) to avoid warning
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, pad_to_max_seq_len=True
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # Check that nothing is done when a maximum length is not specified
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, pad_to_max_seq_len=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    normal_tokens = tokenizer("This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    def test_padding_with_attention_mask(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                if "attention_mask" not in tokenizer.model_input_names:
+                    self.skipTest("This model does not use attention mask.")
+
+                features = [
+                    {"input_ids": [1, 2, 3, 4, 5, 6], "attention_mask": [1, 1, 1, 1, 1, 0]},
+                    {"input_ids": [1, 2, 3], "attention_mask": [1, 1, 0]},
+                ]
+                padded_features = tokenizer.pad(features)
+                if tokenizer.padding_side == "right":
+                    self.assertListEqual(padded_features["attention_mask"], [[1, 1, 1, 1, 1, 0], [1, 1, 0, 0, 0, 0]])
+                else:
+                    self.assertListEqual(padded_features["attention_mask"], [[1, 1, 1, 1, 1, 0], [0, 0, 0, 1, 1, 0]])
+
+    def test_encode_plus_with_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_size = 10
+                padding_idx = tokenizer.pad_token_id
+                token_type_padding_idx = tokenizer.pad_token_type_id
+
+                encoded_sequence = tokenizer.encode(sequence, return_special_tokens_mask=True)
+                input_ids = encoded_sequence["input_ids"]
+                special_tokens_mask = encoded_sequence["special_tokens_mask"]
+                sequence_length = len(input_ids)
+
+                # Test 'longest' and 'no_padding' don't do anything
+                tokenizer.padding_side = "right"
+
+                not_padded_sequence = tokenizer.encode(
+                    sequence,
+                    padding=True,
+                    return_special_tokens_mask=True,
+                )
+                not_padded_input_ids = not_padded_sequence["input_ids"]
+
+                not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+                not_padded_sequence_length = len(not_padded_input_ids)
+
+                self.assertEqual(sequence_length, not_padded_sequence_length)
+                self.assertEqual(input_ids, not_padded_input_ids)
+                self.assertEqual(special_tokens_mask, not_padded_special_tokens_mask)
+
+                not_padded_sequence = tokenizer.encode(
+                    sequence,
+                    padding=False,
+                    return_special_tokens_mask=True,
+                )
+                not_padded_input_ids = not_padded_sequence["input_ids"]
+
+                not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+                not_padded_sequence_length = len(not_padded_input_ids)
+
+                self.assertEqual(sequence_length, not_padded_sequence_length)
+                self.assertEqual(input_ids, not_padded_input_ids)
+                self.assertEqual(special_tokens_mask, not_padded_special_tokens_mask)
+
+                # Test right padding
+                tokenizer.padding_side = "right"
+
+                right_padded_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length + padding_size,
+                    padding="max_length",
+                    return_special_tokens_mask=True,
+                )
+                right_padded_input_ids = right_padded_sequence["input_ids"]
+
+                right_padded_special_tokens_mask = right_padded_sequence["special_tokens_mask"]
+                right_padded_sequence_length = len(right_padded_input_ids)
+
+                self.assertEqual(sequence_length + padding_size, right_padded_sequence_length)
+                self.assertEqual(input_ids + [padding_idx] * padding_size, right_padded_input_ids)
+                self.assertEqual(special_tokens_mask + [1] * padding_size, right_padded_special_tokens_mask)
+
+                # Test left padding
+                tokenizer.padding_side = "left"
+                left_padded_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length + padding_size,
+                    padding="max_length",
+                    return_special_tokens_mask=True,
+                )
+                left_padded_input_ids = left_padded_sequence["input_ids"]
+                left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"]
+                left_padded_sequence_length = len(left_padded_input_ids)
+
+                self.assertEqual(sequence_length + padding_size, left_padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + input_ids, left_padded_input_ids)
+                self.assertEqual([1] * padding_size + special_tokens_mask, left_padded_special_tokens_mask)
+
+                if "token_type_ids" in tokenizer.model_input_names:
+                    token_type_ids = encoded_sequence["token_type_ids"]
+                    left_padded_token_type_ids = left_padded_sequence["token_type_ids"]
+                    right_padded_token_type_ids = right_padded_sequence["token_type_ids"]
+
+                    self.assertEqual(
+                        token_type_ids + [token_type_padding_idx] * padding_size, right_padded_token_type_ids
+                    )
+                    self.assertEqual(
+                        [token_type_padding_idx] * padding_size + token_type_ids, left_padded_token_type_ids
+                    )
+
+                if "attention_mask" in tokenizer.model_input_names:
+                    attention_mask = encoded_sequence["attention_mask"]
+                    right_padded_attention_mask = right_padded_sequence["attention_mask"]
+                    left_padded_attention_mask = left_padded_sequence["attention_mask"]
+
+                    self.assertEqual(attention_mask + [0] * padding_size, right_padded_attention_mask)
+                    self.assertEqual([0] * padding_size + attention_mask, left_padded_attention_mask)
+
+    def test_separate_tokenizers(self):
+        # This tests that tokenizers don't impact others. Unfortunately the case where it fails is when
+        # we're loading an S3 configuration from a pre-trained identifier, and we have no way of testing those today.
+
+        tokenizers = self.get_tokenizers(random_argument=True)
+        new_tokenizers = self.get_tokenizers(random_argument=False)
+
+        for tokenizer, new_tokenizer in zip(tokenizers, new_tokenizers):
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                self.assertTrue(tokenizer.init_kwargs["random_argument"])
+                self.assertTrue(tokenizer.init_kwargs["random_argument"])
+                self.assertFalse(new_tokenizer.init_kwargs["random_argument"])
+
+    def test_get_vocab(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_dict = tokenizer.get_vocab()
+                self.assertIsInstance(vocab_dict, dict)
+                self.assertGreaterEqual(len(tokenizer), len(vocab_dict))
+
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+                tokenizer.add_tokens(["asdfasdfasdfasdf"])
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+    def test_conversion_reversible(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab = tokenizer.get_vocab()
+                for word, ind in vocab.items():
+                    if word == tokenizer.unk_token:
+                        continue
+                    self.assertEqual(tokenizer.convert_tokens_to_ids(word), ind)
+                    self.assertEqual(tokenizer.convert_ids_to_tokens(ind), word)
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                # Test not batched
+                encoded_sequences_1 = tokenizer.encode(
+                    sequences[0], return_token_type_ids=False, return_attention_mask=True
+                )
+                encoded_sequences_2 = tokenizer(sequences[0], return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test not batched pairs
+                encoded_sequences_1 = tokenizer.encode(
+                    sequences[0], sequences[1], return_token_type_ids=False, return_attention_mask=True
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences[0], sequences[1], return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    sequences, return_token_type_ids=False, return_attention_mask=True
+                )
+                encoded_sequences_2 = tokenizer(sequences, return_token_type_ids=False, return_attention_mask=True)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched pairs
+                encoded_sequences_1 = tokenizer.batch_encode(
+                    list(zip(sequences, sequences)), return_token_type_ids=False, return_attention_mask=True
+                )
+                encoded_sequences_2 = tokenizer(
+                    sequences, sequences, return_token_type_ids=False, return_attention_mask=True
+                )
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+    def test_batch_encode_plus_batch_sequence_length(self):
+        # Tests that all encoded values have the correct size
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                encoded_sequences = [tokenizer.encode(sequence) for sequence in sequences]
+                encoded_sequences_batch = tokenizer.batch_encode(sequences, padding=False)
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+                maximum_length = len(
+                    max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len)
+                )
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences_padded = [
+                    tokenizer.encode(sequence, max_length=maximum_length, padding="max_length")
+                    for sequence in sequences
+                ]
+
+                encoded_sequences_batch_padded = tokenizer.batch_encode(sequences, padding=True)
+                self.assertListEqual(
+                    encoded_sequences_padded,
+                    self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded),
+                )
+
+                # check 'longest' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=True)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding="longest"
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+                # check 'no_padding' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode(sequences, padding=False)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode(
+                    sequences, max_length=maximum_length + 10, padding=False
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+    def test_added_token_are_matched_longest_first(self):
+        tokenizers = self.get_tokenizers(fast=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                try:
+                    tokenizer.add_tokens([AddedToken("extra_id_1")])
+                    tokenizer.add_tokens([AddedToken("extra_id_100")])
+                except Exception:
+                    # Canine cannot add tokens which are not codepoints
+                    self.skipTest("Cannot add those Added tokens")
+
+                # XXX: This used to split on `extra_id_1` first we're matching
+                # longest first now.
+                tokens = tokenizer.tokenize("This is some extra_id_100")
+                self.assertIn("extra_id_100", tokens)
+
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokenizer.add_tokens([AddedToken("extra_id_100")])
+                tokenizer.add_tokens([AddedToken("extra_id_1")])
+
+                tokens = tokenizer.tokenize("This is some extra_id_100")
+                self.assertIn("extra_id_100", tokens)
+
+    def test_added_token_serializable(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                new_token = AddedToken("new_token", lstrip=True)
+                tokenizer.add_special_tokens({"additional_special_tokens": [new_token]})
+
+                with tempfile.TemporaryDirectory() as tmp_dir_name:
+                    tokenizer.save_pretrained(tmp_dir_name)
+                    tokenizer.from_pretrained(tmp_dir_name)
+
+    def test_batch_encode_plus_padding(self):
+        # Test that padded sequences are equivalent between batch_encode_plus and encode_plus
+
+        # Right padding tests
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                max_length = 100
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences = [
+                    tokenizer.encode(sequence, max_length=max_length, padding="max_length") for sequence in sequences
+                ]
+                encoded_sequences_batch = tokenizer.batch_encode(
+                    sequences, max_length=max_length, padding="max_length"
+                )
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+        # Left padding tests
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokenizer.padding_side = "left"
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                max_length = 100
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences = [
+                    tokenizer.encode(sequence, max_length=max_length, padding="max_length") for sequence in sequences
+                ]
+                encoded_sequences_batch = tokenizer.batch_encode(
+                    sequences, max_length=max_length, padding="max_length"
+                )
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+    def test_pretokenized_inputs(self):
+        # Test when inputs are pretokenized
+
+        tokenizers = self.get_tokenizers(do_lower_case=False)  # , add_prefix_space=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                if hasattr(tokenizer, "add_prefix_space") and not tokenizer.add_prefix_space:
+                    continue
+
+                # Prepare a sequence from our tokenizer vocabulary
+                sequence, ids = self.get_clean_sequence(tokenizer, with_prefix_space=True, max_length=20)
+                # sequence = " " + sequence  # To be sure the byte-level tokenizers are feeling good
+                token_sequence = sequence.split()
+                # sequence_no_prefix_space = sequence.strip()
+
+                # Test encode for pretokenized inputs
+                output = tokenizer.encode(
+                    token_sequence, is_split_into_words=True, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                output_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                self.assertEqual(output, output_sequence)
+
+                output = tokenizer.encode(token_sequence, is_split_into_words=True, add_special_tokens=True)[
+                    "input_ids"
+                ]
+                output_sequence = tokenizer.encode(sequence, add_special_tokens=True)["input_ids"]
+                self.assertEqual(output, output_sequence)
+
+                # Test encode_plus for pretokenized inputs
+                output = tokenizer.encode(
+                    token_sequence, is_split_into_words=True, return_token_type_ids=None, add_special_tokens=False
+                )
+                output_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+                output = tokenizer.encode(token_sequence, is_split_into_words=True, add_special_tokens=True)
+                output_sequence = tokenizer.encode(sequence, add_special_tokens=True)
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+
+                # Test batch_encode_plus for pretokenized inputs
+                sequence_batch = [sequence.strip()] * 2 + [sequence.strip() + " " + sequence.strip()]
+                token_sequence_batch = [s.split() for s in sequence_batch]
+                sequence_batch_cleaned_up_spaces = [" " + " ".join(s) for s in token_sequence_batch]
+
+                output = tokenizer.batch_encode(
+                    token_sequence_batch,
+                    is_split_into_words=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )
+                output_sequence = tokenizer.batch_encode(
+                    sequence_batch_cleaned_up_spaces, return_token_type_ids=None, add_special_tokens=False
+                )
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+                output = tokenizer.batch_encode(
+                    token_sequence_batch, is_split_into_words=True, add_special_tokens=True
+                )
+                output_sequence = tokenizer.batch_encode(sequence_batch_cleaned_up_spaces, add_special_tokens=True)
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+
+                # Test encode for pretokenized inputs pairs
+                output = tokenizer.encode(
+                    token_sequence,
+                    token_sequence,
+                    is_split_into_words=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                output_sequence = tokenizer.encode(
+                    sequence, sequence, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertEqual(output, output_sequence)
+                output = tokenizer.encode(
+                    token_sequence, token_sequence, is_split_into_words=True, add_special_tokens=True
+                )["input_ids"]
+                output_sequence = tokenizer.encode(sequence, sequence, add_special_tokens=True)["input_ids"]
+                self.assertEqual(output, output_sequence)
+
+                # Test encode_plus for pretokenized inputs pairs
+                output = tokenizer.encode(
+                    token_sequence,
+                    token_sequence,
+                    is_split_into_words=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )
+                output_sequence = tokenizer.encode(
+                    sequence, sequence, return_token_type_ids=None, add_special_tokens=False
+                )
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+                output = tokenizer.encode(
+                    token_sequence, token_sequence, is_split_into_words=True, add_special_tokens=True
+                )
+                output_sequence = tokenizer.encode(sequence, sequence, add_special_tokens=True)
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+
+                # Test batch_encode_plus for pretokenized inputs pairs
+                sequence_pair_batch = [(sequence.strip(), sequence.strip())] * 2 + [
+                    (sequence.strip() + " " + sequence.strip(), sequence.strip())
+                ]
+                token_sequence_pair_batch = [tuple(s.split() for s in pair) for pair in sequence_pair_batch]
+                sequence_pair_batch_cleaned_up_spaces = [
+                    tuple(" " + " ".join(s) for s in pair) for pair in token_sequence_pair_batch
+                ]
+
+                output = tokenizer.batch_encode(
+                    token_sequence_pair_batch,
+                    is_split_into_words=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )
+                output_sequence = tokenizer.batch_encode(
+                    sequence_pair_batch_cleaned_up_spaces, return_token_type_ids=None, add_special_tokens=False
+                )
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+                output = tokenizer.batch_encode(
+                    token_sequence_pair_batch, is_split_into_words=True, add_special_tokens=True
+                )
+                output_sequence = tokenizer.batch_encode(
+                    sequence_pair_batch_cleaned_up_spaces, add_special_tokens=True
+                )
+                for key in output.keys():
+                    self.assertEqual(output[key], output_sequence[key])
+
+    def test_prepare_for_model(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                string_sequence = "Testing the prepare_for_model method."
+                ids = tokenizer.encode(string_sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                prepared_input_dict = tokenizer.prepare_for_model(ids, add_special_tokens=True)
+
+                input_dict = tokenizer.encode(string_sequence, add_special_tokens=True)
+
+                self.assertEqual(input_dict, prepared_input_dict)
+
+    def test_batch_encode_plus_overflowing_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            string_sequences = ["Testing the prepare_for_model method.", "Test"]
+
+            if tokenizer.pad_token is None:
+                tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+
+            tokenizer.batch_encode(
+                string_sequences, return_overflowing_tokens=True, truncation=True, padding=True, max_length=3
+            )
+
+    def _check_no_pad_token_padding(self, tokenizer, sequences):
+        # if tokenizer does not have pad_token_id, an error should be thrown
+        if tokenizer.pad_token_id is None:
+            with self.assertRaises(ValueError):
+                if isinstance(sequences, list):
+                    tokenizer.batch_encode(sequences, padding="longest")
+                else:
+                    tokenizer.encode(sequences, padding=True)
+
+            # add pad_token_id to pass subsequent tests
+            tokenizer.add_special_tokens({"pad_token": "<PAD>"})
+
+    def check_subword_sampling(
+        self,
+        tokenizer: PretrainedTokenizer,
+        text: str = None,
+    ) -> None:
+        """
+        Check if the tokenizer generates different results when subword regularization is enabled.
+
+        Subword regularization augments training data with subword sampling.
+        This has a random component.
+
+        Args:
+            tokenizer: The tokenizer to check.
+            text: The text to use for the checks.
+        """
+        text = "This is a test for subword regularization." if text is None else text
+        if self.test_sentencepiece_ignore_case:
+            text = text.lower()
+
+        tokens_list = []
+        for _ in range(5):
+            tokens_list.append(tokenizer.tokenize(text))
+
+        # the list of different pairs of tokens_list
+        combinations = itertools.combinations(tokens_list, 2)
+
+        # check of sampling is done
+        subword_sampling_found = False
+        for combination in combinations:
+            if combination[0] != combination[1]:
+                subword_sampling_found = True
+        self.assertTrue(subword_sampling_found)
+
+        # check if converting back to original text works
+        for tokens in tokens_list:
+            if self.test_sentencepiece_ignore_case:
+                self.assertEqual(text, tokenizer.convert_tokens_to_string(tokens).lower())
+            else:
+                self.assertEqual(text, tokenizer.convert_tokens_to_string(tokens))
+
+    def test_add_tokens(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                vocab_size = len(tokenizer)
+                self.assertEqual(tokenizer.add_tokens(""), 0)
+                self.assertEqual(tokenizer.add_tokens("testoken"), 1)
+                self.assertEqual(tokenizer.add_tokens(["testoken1", "testtoken2"]), 2)
+                self.assertEqual(len(tokenizer), vocab_size + 3)
+
+                self.assertEqual(tokenizer.add_special_tokens({}), 0)
+                self.assertEqual(tokenizer.add_special_tokens({"bos_token": "[BOS]", "eos_token": "[EOS]"}), 2)
+                self.assertRaises(
+                    AssertionError, tokenizer.add_special_tokens, {"additional_special_tokens": "<testtoken1>"}
+                )
+                self.assertEqual(tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken2>"]}), 1)
+                self.assertEqual(
+                    tokenizer.add_special_tokens({"additional_special_tokens": ["<testtoken3>", "<testtoken4>"]}), 2
+                )
+                self.assertIn("<testtoken3>", tokenizer.special_tokens_map["additional_special_tokens"])
+                self.assertIsInstance(tokenizer.special_tokens_map["additional_special_tokens"], list)
+                self.assertGreaterEqual(len(tokenizer.special_tokens_map["additional_special_tokens"]), 2)
+
+                self.assertEqual(len(tokenizer), vocab_size + 8)
+
+    def test_offsets_mapping_with_unk(self):
+        if not self.test_offsets:
+            return
+
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "σπδ a no inspiration example with subtoken"
+                mappings_1 = tokenizer.get_offset_mapping(text)
+
+                text = "a σπδ no inspiration example with subtoken"
+                mappings_2 = tokenizer.get_offset_mapping(text)
+                assert len(mappings_1) == len(mappings_2)
+
+    def test_offsets_mapping(self):
+        if not self.test_offsets:
+            return
+
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                text = "Wonderful no inspiration example with subtoken"
+                pair = "Along with an awesome pair"
+
+                # No pair
+                tokens_with_offsets = tokenizer.encode(
+                    text, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(False)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+                # Pairs
+                tokens_with_offsets = tokenizer.encode(
+                    text, pair, return_special_tokens_mask=True, return_offsets_mapping=True, add_special_tokens=True
+                )
+                added_tokens = tokenizer.num_special_tokens_to_add(True)
+                offsets = tokens_with_offsets["offset_mapping"]
+
+                # Assert there is the same number of tokens and offsets
+                self.assertEqual(len(offsets), len(tokens_with_offsets["input_ids"]))
+
+                # Assert there is online added_tokens special_tokens
+                self.assertEqual(sum(tokens_with_offsets["special_tokens_mask"]), added_tokens)
+
+    def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
+        tokenizer_list = [(self.tokenizer_class, self.get_tokenizer())]
+
+        for tokenizer_class, tokenizer_utils in tokenizer_list:
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                tokenizer_utils.save_pretrained(tmp_dir)
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
+                    special_tokens_map = json.load(json_file)
+
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
+                    tokenizer_config = json.load(json_file)
+
+                special_tokens_map["additional_special_tokens"] = ["an_additional_special_token"]
+                tokenizer_config["additional_special_tokens"] = ["an_additional_special_token"]
+
+                with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(special_tokens_map, outfile)
+                with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
+                    json.dump(tokenizer_config, outfile)
+
+                # the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
+                # into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
+                # "special_tokens_map.json" files
+                tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                )
+                self.assertIn(
+                    "an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
+                )
+
+                self.assertIn("an_additional_special_token", tokenizer_without_change_in_init.get_vocab())
+                self.assertEqual(
+                    ["an_additional_special_token"],
+                    tokenizer_without_change_in_init.convert_ids_to_tokens(
+                        tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
+                    ),
+                )
+
+                # Now we test that we can change the value of additional_special_tokens in the from_pretrained
+                new_added_tokens = [AddedToken("a_new_additional_special_token", lstrip=True)]
+                tokenizer = tokenizer_class.from_pretrained(
+                    tmp_dir,
+                    additional_special_tokens=new_added_tokens,
+                )
+
+                self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
+                self.assertEqual(
+                    ["a_new_additional_special_token"],
+                    tokenizer.convert_ids_to_tokens(
+                        tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
+                    ),
+                )
+
+    def test_convert_tokens_to_string_format(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["this", "is", "a", "test"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
+
+    def test_consecutive_unk_string(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            tokens = [tokenizer.unk_token for _ in range(2)]
+            string = tokenizer.convert_tokens_to_string(tokens)
+            encoding = tokenizer(
+                text=string,
+                truncation=True,
+                return_offsets_mapping=True,
+            )
+            self.assertEqual(len(encoding["input_ids"]), 4)
+            self.assertEqual(len(encoding["offset_mapping"]), 4)
+
+
+class TrieTest(unittest.TestCase):
+    def test_trie(self):
+        trie = Trie()
+        trie.add("Hello 友達")
+        self.assertEqual(trie.data, {"H": {"e": {"l": {"l": {"o": {" ": {"友": {"達": {"": 1}}}}}}}}})
+        trie.add("Hello")
+        trie.data
+        self.assertEqual(trie.data, {"H": {"e": {"l": {"l": {"o": {"": 1, " ": {"友": {"達": {"": 1}}}}}}}}})
+
+    def test_trie_split(self):
+        trie = Trie()
+        self.assertEqual(trie.split("[CLS] This is a extra_id_100"), ["[CLS] This is a extra_id_100"])
+        trie.add("[CLS]")
+        trie.add("extra_id_1")
+        trie.add("extra_id_100")
+        self.assertEqual(trie.split("[CLS] This is a extra_id_100"), ["[CLS]", " This is a ", "extra_id_100"])
+
+    def test_trie_single(self):
+        trie = Trie()
+        trie.add("A")
+        self.assertEqual(trie.split("ABC"), ["A", "BC"])
+        self.assertEqual(trie.split("BCA"), ["BC", "A"])
+
+    def test_trie_final(self):
+        trie = Trie()
+        trie.add("TOKEN]")
+        trie.add("[SPECIAL_TOKEN]")
+        self.assertEqual(trie.split("This is something [SPECIAL_TOKEN]"), ["This is something ", "[SPECIAL_TOKEN]"])
+
+    def test_trie_subtokens(self):
+        trie = Trie()
+        trie.add("A")
+        trie.add("P")
+        trie.add("[SPECIAL_TOKEN]")
+        self.assertEqual(trie.split("This is something [SPECIAL_TOKEN]"), ["This is something ", "[SPECIAL_TOKEN]"])
+
+    def test_trie_suffix_tokens(self):
+        trie = Trie()
+        trie.add("AB")
+        trie.add("B")
+        trie.add("C")
+        self.assertEqual(trie.split("ABC"), ["AB", "C"])
+
+    def test_trie_skip(self):
+        trie = Trie()
+        trie.add("ABC")
+        trie.add("B")
+        trie.add("CD")
+        self.assertEqual(trie.split("ABCD"), ["ABC", "D"])
+
+    def test_cut_text_hardening(self):
+        # Even if the offsets are wrong, we necessarily output correct string
+        # parts.
+        trie = Trie()
+        parts = trie.cut_text("ABC", [0, 0, 2, 1, 2, 3])
+        self.assertEqual(parts, ["AB", "C"])
diff --git a/tests/transformers/test_tokenizer_util.py b/tests/transformers/test_tokenizer_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1a2c49cc65fd264f2d220e62bbf72517df6b7e6
--- /dev/null
+++ b/tests/transformers/test_tokenizer_util.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import tempfile
+import unittest
+
+from paddlenlp.transformers import BertTokenizer
+from paddlenlp.transformers.tokenizer_utils import PretrainedTokenizer
+from paddlenlp.utils.env import TOKENIZER_CONFIG_NAME
+
+
+class EmptyTokenizer(PretrainedTokenizer):
+    def __init__(self, a=1, b=2):
+        pass
+
+
+class SubEmptyTokenizer(EmptyTokenizer):
+    def __init__(self, c=3, d=4):
+        super().__init__(a=c, b=d)
+
+
+class TokenizerUtilsTest(unittest.TestCase):
+    def test_multi_inherit(self):
+        tokenizer = SubEmptyTokenizer()
+
+        self.assertIn("c", tokenizer.init_kwargs)
+        self.assertEqual(tokenizer.init_kwargs["c"], 3)
+
+    def test_config(self):
+        tmpdirname = tempfile.mkdtemp()
+
+        tokenizer = SubEmptyTokenizer()
+        tokenizer.save_pretrained(tmpdirname)
+
+        with open(os.path.join(tmpdirname, "tokenizer_config.json"), "r", encoding="utf-8") as f:
+            data = json.load(f)
+
+        self.assertIn("c", data)
+        self.assertEqual(data["c"], 3)
+        self.assertEqual(data["tokenizer_class"], "SubEmptyTokenizer")
+
+    def test_from_pretrained_cache_dir(self):
+        model_name = "__internal_testing__/tiny-random-bert"
+        with tempfile.TemporaryDirectory() as tempdir:
+            BertTokenizer.from_pretrained(model_name, cache_dir=tempdir)
+            self.assertTrue(os.path.exists(os.path.join(tempdir, model_name, TOKENIZER_CONFIG_NAME)))
+            # check against double appending model_name in cache_dir
+            self.assertFalse(os.path.exists(os.path.join(tempdir, model_name, model_name)))
diff --git a/tests/transformers/test_utils.py b/tests/transformers/test_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2813a86d88cb5fc861ad91d7cb037ced5beaa5f2
--- /dev/null
+++ b/tests/transformers/test_utils.py
@@ -0,0 +1,68 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+import json
+import os
+import unittest
+
+from paddlenlp.transformers import (
+    AlbertForTokenClassification,
+    AlbertModel,
+    BertModel,
+    utils,
+)
+from paddlenlp.transformers.bert.modeling import BertForTokenClassification
+
+
+class TestUtils(unittest.TestCase):
+    """Unittest for paddlenlp.transformers.utils.py module"""
+
+    def test_find_transformer_model_type(self):
+        """test for `find_transformer_model_type`"""
+        self.assertEqual(utils.find_transformer_model_type(AlbertModel), "albert")
+        self.assertEqual(utils.find_transformer_model_type(AlbertForTokenClassification), "albert")
+        self.assertEqual(utils.find_transformer_model_type(BertModel), "bert")
+        self.assertEqual(utils.find_transformer_model_type(BertForTokenClassification), "bert")
+
+
+def check_json_file_has_correct_format(file_path):
+    with open(file_path, "r") as f:
+        try:
+            json.load(f)
+        except Exception as e:
+            raise Exception(f"{e}: the json file should be a valid json")
+
+
+def get_tests_dir(append_path=None):
+    """
+    Args:
+        append_path: optional path to append to the tests dir path
+
+    Return:
+        The full path to the `tests` dir, so that the tests can be invoked from anywhere. Optionally `append_path` is
+        joined after the `tests` dir the former is provided.
+
+    """
+    # this function caller's __file__
+    caller__file__ = inspect.stack()[1][1]
+    tests_dir = os.path.abspath(os.path.dirname(caller__file__))
+
+    while not tests_dir.endswith("tests"):
+        tests_dir = os.path.dirname(tests_dir)
+
+    if append_path:
+        return os.path.join(tests_dir, append_path)
+    else:
+        return tests_dir
diff --git a/tests/transformers/tinybert/__init__.py b/tests/transformers/tinybert/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/tinybert/test_modeling.py b/tests/transformers/tinybert/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..0142e865af847a36f503f1ee08357dcaca8d7d45
--- /dev/null
+++ b/tests/transformers/tinybert/test_modeling.py
@@ -0,0 +1,502 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from dataclasses import Field, dataclass, fields
+from typing import Tuple
+
+import paddle
+from paddle import Tensor
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    TinyBertForMultipleChoice,
+    TinyBertForPretraining,
+    TinyBertForQuestionAnswering,
+    TinyBertForSequenceClassification,
+    TinyBertModel,
+    TinyBertPretrainedModel,
+)
+from paddlenlp.transformers.tinybert.configuration import TinyBertConfig
+
+from ...testing_utils import slow
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+@dataclass
+class TinyBertTestModelConfig:
+    """tinybert model config which keep consist with pretrained_init_configuration sub fields"""
+
+    vocab_size: int = 100
+    hidden_size: int = 100
+    num_hidden_layers: int = 4
+    num_attention_heads: int = 5
+    intermediate_size: int = 120
+    hidden_act: str = "gelu"
+    hidden_dropout_prob: float = 0.1
+    attention_probs_dropout_prob: float = 0.1
+    max_position_embeddings: int = 62
+    type_vocab_size: int = 2
+    initializer_range: float = 0.02
+    pad_token_id: int = 0
+
+    @property
+    def model_kwargs(self) -> dict:
+        """get the model kwargs configuration to init the model"""
+        model_config_fields: Tuple[Field, ...] = fields(TinyBertTestModelConfig)
+        return {field.name: getattr(self, field.name) for field in model_config_fields}
+
+
+@dataclass
+class TinyBertTestConfig(TinyBertTestModelConfig):
+    """train config under unittest code"""
+
+    batch_size: int = 2
+    seq_length: int = 7
+    is_training: bool = False
+    use_input_mask: bool = True
+    use_token_type_ids: bool = True
+
+    # used for sequence classification
+    num_classes: int = 3
+    num_choices: int = 3
+    type_sequence_label_size: int = 3
+
+
+class TinyBertModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = TinyBertModel
+    use_labels = False
+    return_dict = False
+
+    all_model_classes = (
+        TinyBertModel,
+        TinyBertForMultipleChoice,
+        TinyBertForPretraining,
+        TinyBertForQuestionAnswering,
+        TinyBertForSequenceClassification,
+    )
+
+    def setUp(self):
+        super().setUp()
+
+        self.model_tester = TinyBertModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=TinyBertConfig, vocab_size=256, hidden_size=24)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_model_cache(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model_cache(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(TinyBertPretrainedModel.pretrained_init_configuration)[:1]:
+            model = TinyBertModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    def test_hidden_states_output(self):
+        self.skipTest("skip: test_hidden_states_output -> there is no supporting argument return_dict")
+
+
+class TinyBertModelTester:
+    def __init__(
+        self,
+        parent: TinyBertModelTest,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        initializer_range=0.02,
+        pad_token_id=0,
+        pool_act="tanh",
+        layer_norm_eps=1e-12,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        dropout=0.56,
+        return_dict=False,
+        fit_size=768,
+    ):
+        self.parent: TinyBertModelTest = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.pad_token_id = pad_token_id
+        self.pool_act = pool_act
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.dropout = dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.return_dict = return_dict
+        self.fit_size = fit_size
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def get_config(self) -> TinyBertConfig:
+        return TinyBertConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            pad_token_id=self.pad_token_id,
+            fit_size=self.fit_size,
+            pool_act=self.pool_act,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = TinyBertModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = TinyBertForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+
+        if input_mask is not None:
+            input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=choice_labels,
+            return_dict=self.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = TinyBertForQuestionAnswering(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.return_dict,
+        )
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids: Tensor,
+        token_type_ids: Tensor,
+        input_mask: Tensor,
+        sequence_labels: Tensor,
+        token_labels: Tensor,
+        choice_labels: Tensor,
+    ):
+        model = TinyBertForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            labels=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.num_labels])
+
+    def create_and_check_model_cache(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = TinyBertModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=input_mask, use_cache=True, return_dict=self.parent.return_dict)
+        past_key_values = outputs.past_key_values if self.parent.return_dict else outputs[2]
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), self.vocab_size)
+        next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
+
+        # append to next input_ids and
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_attention_mask = paddle.concat([input_mask, next_mask], axis=-1)
+
+        outputs = model(
+            next_input_ids,
+            attention_mask=next_attention_mask,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_no_past = outputs[2][0]
+
+        outputs = model(
+            next_tokens,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            output_hidden_states=True,
+            return_dict=self.parent.return_dict,
+        )
+
+        output_from_past = outputs[2][0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class TinyBertModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_attention(self):
+        model = TinyBertModel.from_pretrained("tinybert-4l-312d")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        with paddle.no_grad():
+            output = model(input_ids)[0]
+        expected_shape = [1, 11, 312]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.76857519, -0.04066351, -0.36538580],
+                    [-0.79803109, -0.04977923, -0.37076530],
+                    [-0.76121056, -0.07496471, -0.35906711],
+                ]
+            ]
+        )
+
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_attention(self):
+        model = TinyBertModel.from_pretrained("tinybert-4l-312d")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask)[0]
+        expected_shape = [1, 11, 312]
+        self.assertEqual(output.shape, expected_shape)
+
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.76857519, -0.04066351, -0.36538580],
+                    [-0.79803109, -0.04977923, -0.37076530],
+                    [-0.76121056, -0.07496471, -0.35906711],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_with_past_key_value(self):
+        model = TinyBertModel.from_pretrained("tinybert-4l-312d")
+        model.eval()
+        input_ids = paddle.to_tensor([[0, 345, 232, 328, 740, 140, 1695, 69, 6078, 1588, 2]])
+        attention_mask = paddle.to_tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
+        with paddle.no_grad():
+            output = model(input_ids, attention_mask=attention_mask, use_cache=True, return_dict=True)
+
+        expected_shape = [1, 11, 312]
+        self.assertEqual(output[0].shape, expected_shape)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.76857519, -0.04066351, -0.36538580],
+                    [-0.79803109, -0.04977923, -0.37076530],
+                    [-0.76121056, -0.07496471, -0.35906711],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+        # insert the past key value into model
+        with paddle.no_grad():
+            output = model(input_ids, use_cache=True, past_key_values=output.past_key_values, return_dict=True)
+        expected_slice = paddle.to_tensor(
+            [
+                [
+                    [-0.61422300, -0.05978593, -0.23719205],
+                    [-0.64617568, -0.04066525, -0.26458248],
+                    [-0.65170693, -0.04711169, -0.29544356],
+                ]
+            ]
+        )
+        self.assertTrue(paddle.allclose(output[0][:, 1:4, 1:4], expected_slice, atol=1e-4))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/tinybert/test_tokenizer.py b/tests/transformers/tinybert/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c87246059c7a508e2e647378b41d6197f5e999bf
--- /dev/null
+++ b/tests/transformers/tinybert/test_tokenizer.py
@@ -0,0 +1,360 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+
+from paddlenlp.transformers.bert.tokenizer import (
+    BasicTokenizer,
+    WordpieceTokenizer,
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+)
+from paddlenlp.transformers.tinybert.fast_tokenizer import TinyBertFastTokenizer
+from paddlenlp.transformers.tinybert.tokenizer import TinyBertTokenizer
+
+from ...testing_utils import slow
+from ..test_tokenizer_common import TokenizerTesterMixin, filter_non_english
+
+
+class TinyBertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+
+    tokenizer_class = TinyBertTokenizer
+    fast_tokenizer_class = TinyBertFastTokenizer
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+    test_seq2seq = True
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+
+        self.vocab_file = os.path.join(self.tmpdirname, TinyBertTokenizer.resource_files_names["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_fast_and_python_full_tokenizer(self):
+        if not self.test_fast_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        tokenizer_fast = self.get_fast_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        tokenizer_fast = self.get_fast_tokenizer(do_lower_case=True)
+
+        tokens = tokenizer.tokenize(sequence)
+        tokens_fast = tokenizer_fast.tokenize(sequence)
+        self.assertListEqual(tokens, tokens_fast)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence, add_special_tokens=False)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+        ids = tokenizer.encode(sequence)["input_ids"]
+        ids_fast = tokenizer_fast.encode(sequence)["input_ids"]
+        self.assertListEqual(ids, ids_fast)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(" "))
+        self.assertTrue(_is_whitespace("\t"))
+        self.assertTrue(_is_whitespace("\r"))
+        self.assertTrue(_is_whitespace("\n"))
+        self.assertTrue(_is_whitespace("\u00A0"))
+
+        self.assertFalse(_is_whitespace("A"))
+        self.assertFalse(_is_whitespace("-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control("\u0005"))
+
+        self.assertFalse(_is_control("A"))
+        self.assertFalse(_is_control(" "))
+        self.assertFalse(_is_control("\t"))
+        self.assertFalse(_is_control("\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation("-"))
+        self.assertTrue(_is_punctuation("$"))
+        self.assertTrue(_is_punctuation("`"))
+        self.assertTrue(_is_punctuation("."))
+
+        self.assertFalse(_is_punctuation("A"))
+        self.assertFalse(_is_punctuation(" "))
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("bert-base-uncased")
+
+        text = tokenizer.encode("sequence builders", return_token_type_ids=None, add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", return_token_type_ids=None, add_special_tokens=False)[
+            "input_ids"
+        ]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer.mask_token} AllenNLP sentence."
+                tokens = tokenizer.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+                tokens_fast = tokenizer_fast.encode(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer.do_lower_case if hasattr(tokenizer, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_fast.convert_ids_to_tokens(tokens_fast["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+                self.assertEqual([e[0] for e in expected_results], tokens_fast["offset_mapping"])
+
+    def test_change_tokenize_chinese_chars(self):
+        list_of_commun_chinese_char = ["的", "人", "有"]
+        text_with_chinese_char = "".join(list_of_commun_chinese_char)
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
+
+                kwargs["tokenize_chinese_chars"] = True
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+                tokenizer_fast = self.fast_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                ids_without_spe_char_fast = tokenizer_fast.encode(
+                    text_with_chinese_char, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+                tokens_without_spe_char_fast = tokenizer_fast.convert_ids_to_tokens(ids_without_spe_char_fast)
+
+                # it is expected that each Chinese character is not preceded by "##"
+                self.assertListEqual(tokens_without_spe_char_p, list_of_commun_chinese_char)
+                self.assertListEqual(tokens_without_spe_char_fast, list_of_commun_chinese_char)
+
+                # not yet supported in bert tokenizer
+                """
+                kwargs["tokenize_chinese_chars"] = False
+                tokenizer = self.tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                ids_without_spe_char_p = tokenizer.encode(text_with_chinese_char, return_token_type_ids=None,add_special_tokens=False)["input_ids"]
+
+                tokens_without_spe_char_p = tokenizer.convert_ids_to_tokens(ids_without_spe_char_p)
+
+                # it is expected that only the first Chinese character is not preceded by "##".
+                expected_tokens = [
+                    f"##{token}" if idx != 0 else token for idx, token in enumerate(list_of_commun_chinese_char)
+                ]
+                self.assertListEqual(tokens_without_spe_char_p, expected_tokens)
+                """
+
+    def test_pretrained_model_lists(self):
+        # We should have at least one default checkpoint for each tokenizer
+        # We should specify the max input length as well (used in some part to list the pretrained checkpoints)
+        self.assertGreaterEqual(len(self.tokenizer_class.pretrained_resource_files_map), 1)
+        self.assertGreaterEqual(len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]), 1)
+        self.assertEqual(
+            len(list(self.tokenizer_class.pretrained_resource_files_map.values())[0]),
+            len(self.tokenizer_class.pretrained_init_configuration),
+        )
+
+        weights_list = list(self.tokenizer_class.pretrained_init_configuration.keys())
+        weights_lists_2 = []
+        for file_id, map_list in self.tokenizer_class.pretrained_resource_files_map.items():
+            weights_lists_2.append(list(map_list.keys()))
+
+        for weights_list_2 in weights_lists_2:
+            self.assertListEqual(weights_list, weights_list_2)
diff --git a/tests/transformers/unified_transformer/__init__.py b/tests/transformers/unified_transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/unified_transformer/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/unified_transformer/test_modeling.py b/tests/transformers/unified_transformer/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..69fc04236a4cd501a1c23934fd6b14b15556a387
--- /dev/null
+++ b/tests/transformers/unified_transformer/test_modeling.py
@@ -0,0 +1,563 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from parameterized import parameterized_class
+
+from paddlenlp.data import Pad
+from paddlenlp.transformers import (
+    UnifiedTransformerConfig,
+    UnifiedTransformerLMHeadModel,
+    UnifiedTransformerModel,
+    UnifiedTransformerTokenizer,
+)
+from tests.testing_utils import slow
+
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+UNIFIED_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "unified_transformer-12L-cn",
+    "unified_transformer-12L-cn-luge",
+    "plato-mini",
+]
+
+
+def batchify_fn(batch_examples, pad_val):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e4
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    return {
+        "input_ids": paddle.to_tensor(input_ids, dtype="int64"),
+        "token_type_ids": paddle.to_tensor(token_type_ids, dtype="int64"),
+        "position_ids": paddle.to_tensor(position_ids, dtype="int64"),
+        "attention_mask": paddle.to_tensor(attention_mask, dtype="float32"),
+    }
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.sep_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    response = " ".join(tokens)
+    return response
+
+
+class UnifiedTransformerModelTester:
+    def __init__(
+        self,
+        parent,
+        is_training=True,
+        batch_size=14,
+        seq_length=7,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        normalize_before=True,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        unk_token_id=0,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        role_type_size=None,
+    ):
+        self.parent = parent
+        self.is_training = is_training
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.normalize_before = normalize_before
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.unk_token_id = unk_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.mask_token_id = vocab_size - 1
+        self.role_type_size = role_type_size
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64").unsqueeze([1, 2])
+        token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size, dtype="int64")
+        position_ids = paddle.tile(
+            paddle.arange(end=self.seq_length, dtype="int64").reshape([1, -1]), [self.batch_size, 1]
+        )
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        config = self.get_config()
+
+        return (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels)
+
+    def get_config(self):
+        return UnifiedTransformerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            normalize_before=self.normalize_before,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            unk_token_id=self.unk_token_id,
+            pad_token_id=self.pad_token_id,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            mask_token_id=self.mask_token_id,
+            role_type_size=self.role_type_size,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels) = self.prepare_config_and_inputs()
+        return (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels)
+
+    def create_and_check_unified_transformer_model(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, *args
+    ):
+        model = UnifiedTransformerModel(config)
+        model.eval()
+
+        result, cache = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(cache), config.num_hidden_layers)
+
+    def create_and_check_unified_transformer_model_past(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, *args
+    ):
+        model = UnifiedTransformerModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )
+        outputs_use_cache_conf = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            return_dict=self.parent.return_dict,
+        )
+        outputs_no_past = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=False,
+            return_dict=self.parent.return_dict,
+        )
+
+        self.parent.assertTrue(len(outputs_no_past) == len(outputs_use_cache_conf))
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64")
+        next_token_types = ids_tensor([self.batch_size, 1], self.type_vocab_size, dtype="int64")
+        next_position = position_ids[:, -1:] + 1
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+        next_position_ids = paddle.concat([position_ids, next_position], axis=-1)
+
+        input_mask_t = paddle.transpose(input_mask, perm=[0, 1, 3, 2])
+        input_mask = input_mask * input_mask_t
+
+        next_attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(input_mask)
+        next_attention_mask = nn.Pad2D([0, 1, 0, 0], value=0)(next_attention_mask)
+        next_attention_mask[:, :, -1, -1] = 1
+
+        output_from_no_past, cache = model(
+            next_input_ids,
+            token_type_ids=next_token_type_ids,
+            position_ids=next_position_ids,
+            attention_mask=next_attention_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+        output_from_past = model(
+            next_tokens,
+            token_type_ids=next_token_types,
+            position_ids=next_position,
+            attention_mask=next_attention_mask[:, :, -1:, :],
+            use_cache=True,
+            cache=past,
+            return_dict=self.parent.return_dict,
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_unified_transformer_model_past_large_inputs(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, *args
+    ):
+        model = UnifiedTransformerModel(config)
+        model.eval()
+
+        # first forward pass
+        output, past = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_token_types = ids_tensor([self.batch_size, 3], self.type_vocab_size, dtype="int64")
+        next_position = position_ids[:, -3:] + 3
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+        next_position_ids = paddle.concat([position_ids, next_position], axis=-1)
+
+        input_mask_t = paddle.transpose(input_mask, perm=[0, 1, 3, 2])
+        input_mask = input_mask * input_mask_t
+
+        next_attention_mask = nn.Pad2D([0, 0, 0, 3], mode="replicate")(input_mask)
+        next_attention_mask = nn.Pad2D([0, 3, 0, 0], value=0)(next_attention_mask)
+        next_attention_mask[:, :, -1, -1] = 1
+        next_attention_mask[:, :, -2, -2] = 1
+        next_attention_mask[:, :, -3, -3] = 1
+        next_attention_mask[:, :, -2, -1] = 1
+        next_attention_mask[:, :, -3, -1] = 1
+        next_attention_mask[:, :, -3, -2] = 1
+
+        output_from_no_past = model(
+            next_input_ids,
+            token_type_ids=next_token_type_ids,
+            attention_mask=next_attention_mask,
+            position_ids=next_position_ids,
+            use_cache=False,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.return_dict:
+            output_from_no_past = output_from_no_past[0]
+        output_from_past = model(
+            next_tokens,
+            token_type_ids=next_token_types,
+            attention_mask=next_attention_mask[:, :, -3:, :],
+            position_ids=next_position,
+            cache=past,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, lm_labels, *args
+    ):
+        model = UnifiedTransformerLMHeadModel(config)
+        model.eval()
+
+        outputs = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            labels=lm_labels,
+            return_dict=self.parent.return_dict,
+        )
+        if self.parent.use_labels:
+            loss, result = outputs[:2]
+            self.parent.assertIsInstance(loss.item(), float)
+        else:
+            result = outputs[0] if self.parent.return_dict else outputs
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, lm_head, *args
+    ):
+        model = UnifiedTransformerLMHeadModel(config)
+
+        loss, logits = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=input_mask,
+            position_ids=position_ids,
+            label=input_ids,
+            return_dict=self.parent.return_dict,
+        )[:2]
+        self.parent.assertIsInstance(loss.item(), float)
+        self.parent.assertEqual(logits.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        loss.backward()
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+
+        (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+            "position_ids": position_ids,
+        }
+
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class UnifiedTransformerModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = UnifiedTransformerModel
+
+    all_model_classes = (UnifiedTransformerModel, UnifiedTransformerLMHeadModel)
+    all_generative_model_classes = {UnifiedTransformerLMHeadModel: (UnifiedTransformerModel, "unified_transformer")}
+    test_missing_keys = False
+    use_test_inputs_embeds = True
+    use_labels = False
+    return_dict = False
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+        return inputs_dict
+
+    def setUp(self):
+        seed = 1028
+        random.seed(seed)
+        np.random.seed(seed)
+        paddle.seed(seed)
+
+        self.model_tester = UnifiedTransformerModelTester(self)
+
+    def test_unified_transformer_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unified_transformer_model(*config_and_inputs)
+
+    def test_unified_transformer_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unified_transformer_model_past(*config_and_inputs)
+
+    def test_unified_transformer_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unified_transformer_model_past_large_inputs(*config_and_inputs)
+
+    def test_unified_transformer_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    @slow
+    def test_batch_generation(self):
+        model = UnifiedTransformerLMHeadModel.from_pretrained("plato-mini")
+        tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-mini")
+        model.eval()
+
+        tokenizer.padding_side = "left"
+
+        # use different length sentences to test batching
+        sentences = [
+            ["你好"],
+            ["今天天气不错"],
+        ]
+        inputs = []
+        for seq in sentences:
+            inputs.append(tokenizer.dialogue_encode(history=seq, add_start_token_as_response=True))
+
+        data = batchify_fn(inputs, tokenizer.pad_token_id)
+
+        input_ids = data["input_ids"]
+        position_ids = data["position_ids"]
+        token_type_ids = data["token_type_ids"]
+        attention_mask = data["attention_mask"]
+
+        outputs, _ = model.generate(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            decode_strategy="greedy_search",
+        )
+
+        data_non_padded = tokenizer.dialogue_encode(sentences[0], add_start_token_as_response=True)
+        output_non_padded, _ = model.generate(
+            input_ids=paddle.to_tensor(data_non_padded["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(data_non_padded["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(data_non_padded["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(data_non_padded["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="greedy_search",
+        )
+
+        data_padded = tokenizer.dialogue_encode(sentences[1], add_start_token_as_response=True)
+        output_padded, _ = model.generate(
+            input_ids=paddle.to_tensor(data_padded["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(data_padded["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(data_padded["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(data_padded["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="greedy_search",
+        )
+
+        batch_out_sentence = []
+        for i in range(len(outputs)):
+            batch_out_sentence.append(postprocess_response(outputs[i].numpy(), tokenizer))
+        non_padded_sentence = postprocess_response(output_non_padded[0], tokenizer)
+        padded_sentence = postprocess_response(output_padded[0], tokenizer)
+
+        expected_output_sentence = [
+            "你好 , 你 是 做 什么 工作 的 ?",
+            "是 啊 , 我 也 很开心",
+        ]
+        self.assertListEqual(expected_output_sentence, batch_out_sentence)
+        self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
+
+
+class UnifiedTransformerModelLanguageGenerationTest(unittest.TestCase):
+    def _test_lm_generate_unified_transformer_helper(
+        self,
+        verify_outputs=True,
+    ):
+        model = UnifiedTransformerLMHeadModel.from_pretrained("plato-mini")
+        model.eval()
+
+        input_ids = paddle.to_tensor([[1, 464, 3290, 2, 1]], dtype="int64")
+        position_ids = paddle.to_tensor([[0, 1, 2, 3, 4]], dtype="int64")
+        token_type_ids = paddle.to_tensor([[0, 0, 0, 0, 1]], dtype="int64")
+
+        expected_output_ids = [
+            9,
+            113,
+            78,
+            48,
+            3290,
+            4,
+            16,
+            2,
+        ]
+
+        output_ids, _ = model.generate(
+            input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            decode_strategy="greedy_search",
+        )
+
+        if verify_outputs:
+            self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
+
+    @slow
+    def test_lm_generate_unified_transformer(self):
+        self._test_lm_generate_unified_transformer_helper()
+
+    @slow
+    def test_unified_transformer_sample(self):
+        tokenizer = UnifiedTransformerTokenizer.from_pretrained("plato-mini")
+        model = UnifiedTransformerLMHeadModel.from_pretrained("plato-mini")
+        model.eval()
+
+        sequence = ["今天天气真好！"]
+
+        tokenized = tokenizer.dialogue_encode(history=sequence, add_start_token_as_response=True)
+        output_ids, _ = model.generate(
+            paddle.to_tensor(tokenized["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(tokenized["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(tokenized["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(tokenized["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="sampling",
+            top_k=1,
+        )
+        output_str = postprocess_response(output_ids[0].numpy(), tokenizer)
+
+        EXPECTED_OUTPUT_STR = "你 在 哪里 呀 ?"
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+
+    def test_generate_without_input_ids(self):
+        pass
diff --git a/tests/transformers/unified_transformer/test_tokenizer.py b/tests/transformers/unified_transformer/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..215676b4127475d42418af284cf663b7bd54bed1
--- /dev/null
+++ b/tests/transformers/unified_transformer/test_tokenizer.py
@@ -0,0 +1,279 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+from typing import List
+
+from paddlenlp.transformers import PretrainedTokenizer, UnifiedTransformerTokenizer
+from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase
+
+from ...testing_utils import get_tests_dir
+
+SAMPLE_SENTENCEPIECE = get_tests_dir("fixtures/test_sentencepiece.model")
+SAMPLE_VOCAB = get_tests_dir("fixtures/vocab.zh.txt")
+
+
+class UnifiedTransformerTokenizationTest(unittest.TestCase):
+
+    tokenizer_class = UnifiedTransformerTokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "sentencepiece_model_file"
+    test_seq2seq = False
+    test_offsets = False
+
+    space_between_special_tokens = False
+    from_pretrained_kwargs = None
+    from_pretrained_filter = None
+
+    test_sentencepiece_ignore_case = False
+
+    def setUp(self):
+        super().setUp()
+
+        tokenizers_list = [
+            (
+                self.tokenizer_class,
+                pretrained_name,
+                self.from_pretrained_kwargs if self.from_pretrained_kwargs is not None else {},
+            )
+            for pretrained_name in self.tokenizer_class.pretrained_resource_files_map[
+                self.from_pretrained_vocab_key
+            ].keys()
+            if self.from_pretrained_filter is None
+            or (self.from_pretrained_filter is not None and self.from_pretrained_filter(pretrained_name))
+        ]
+        self.tokenizers_list = tokenizers_list[:1]
+
+        with open(f"{get_tests_dir()}/sample_text.txt", encoding="utf-8") as f_data:
+            self._data = f_data.read().replace("\n\n", "\n").strip()
+
+        self.tmpdirname = tempfile.mkdtemp()
+
+        tokenizer = UnifiedTransformerTokenizer(SAMPLE_VOCAB, SAMPLE_SENTENCEPIECE)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizers(self, **kwargs) -> List[PretrainedTokenizerBase]:
+        return [self.get_tokenizer(**kwargs)]
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_get_vocab(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_dict = tokenizer.get_vocab()
+                self.assertIsInstance(vocab_dict, dict)
+                self.assertGreaterEqual(len(tokenizer), len(vocab_dict))
+
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+                tokenizer.add_tokens(["asdfasdfasdfasdf"])
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+    def test_right_and_left_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "left"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + encoded_sequence, padded_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, padding=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding="longest")["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding=False)["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+    def test_right_and_left_truncation(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "This is a test sequence"
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                truncation_size = 3
+                tokenizer.truncation_side = "right"
+                encoded_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                sequence_length = len(encoded_sequence)
+                # Remove EOS/BOS tokens
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[:-truncation_size], truncated_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the truncation flag set to True
+                tokenizer.truncation_side = "left"
+                sequence_length = len(encoded_sequence)
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[truncation_size:], truncated_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_truncation'
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, truncation=True, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation="longest_first", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation=False, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+    def test_padding_to_max_length(self):
+        """We keep this test for backward compatibility but it should be remove when `pad_to_max_seq_len` is deprecated."""
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                # FIXME: the next line should be padding(max_length) to avoid warning
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, pad_to_max_seq_len=True
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # Check that nothing is done when a maximum length is not specified
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, pad_to_max_seq_len=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+    def _check_no_pad_token_padding(self, tokenizer, sequences):
+        # if tokenizer does not have pad_token_id, an error should be thrown
+        if tokenizer.pad_token_id is None:
+            with self.assertRaises(ValueError):
+                if isinstance(sequences, list):
+                    tokenizer.batch_encode(sequences, padding="longest")
+                else:
+                    tokenizer.encode(sequences, padding=True)
+
+            # add pad_token_id to pass subsequent tests
+            tokenizer.add_special_tokens({"pad_token": "<PAD>"})
+
+    def test_convert_tokens_to_string_format(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["今天", "天气"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
diff --git a/tests/transformers/unimo/__init__.py b/tests/transformers/unimo/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..97043fd7ba6885aac81cad5a49924c23c67d4d47
--- /dev/null
+++ b/tests/transformers/unimo/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/unimo/test_modeling.py b/tests/transformers/unimo/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..638a9e540490b8ec1f0054b4a6cd6f71b360a311
--- /dev/null
+++ b/tests/transformers/unimo/test_modeling.py
@@ -0,0 +1,548 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+from parameterized import parameterized_class
+
+from paddlenlp.data import Pad
+from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOModel, UNIMOTokenizer
+from paddlenlp.transformers.unimo.configuration import UNIMOConfig
+from tests.testing_utils import slow
+
+from ..test_generation_utils import GenerationTesterMixin
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+UNIMO_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "unimo-text-1.0",
+    "unimo-text-1.0-lcsts-new",
+    "unimo-text-1.0-summary",
+]
+
+
+def batchify_fn(batch_examples, pad_val):
+    def pad_mask(batch_attention_mask):
+        batch_size = len(batch_attention_mask)
+        max_len = max(map(len, batch_attention_mask))
+        attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e4
+        for i, mask_data in enumerate(attention_mask):
+            seq_len = len(batch_attention_mask[i])
+            mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32")
+        # In order to ensure the correct broadcasting mechanism, expand one
+        # dimension to the second dimension (n_head of Transformer).
+        attention_mask = np.expand_dims(attention_mask, axis=1)
+        return attention_mask
+
+    pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64")
+
+    input_ids = pad_func([example["input_ids"] for example in batch_examples])
+    token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples])
+    position_ids = pad_func([example["position_ids"] for example in batch_examples])
+
+    attention_mask = pad_mask([example["attention_mask"] for example in batch_examples])
+
+    return {
+        "input_ids": paddle.to_tensor(input_ids, dtype="int64"),
+        "token_type_ids": paddle.to_tensor(token_type_ids, dtype="int64"),
+        "position_ids": paddle.to_tensor(position_ids, dtype="int64"),
+        "attention_mask": paddle.to_tensor(attention_mask, dtype="float32"),
+    }
+
+
+def postprocess_response(token_ids, tokenizer):
+    """Post-process the decoded sequence. Truncate from the first <eos>."""
+    eos_pos = len(token_ids)
+    for i, tok_id in enumerate(token_ids):
+        if tok_id == tokenizer.mask_token_id:
+            eos_pos = i
+            break
+    token_ids = token_ids[:eos_pos]
+    tokens = tokenizer.convert_ids_to_tokens(token_ids)
+    tokens = tokenizer.merge_subword(tokens)
+    return " ".join(tokens)
+
+
+class UNIMOModelTester:
+    def __init__(
+        self,
+        parent,
+        is_training=True,
+        batch_size=14,
+        seq_length=7,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        normalize_before=True,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        unk_token_id=0,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        mask_token_id=3,
+    ):
+        self.parent = parent
+        self.is_training = is_training
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.normalize_before = normalize_before
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.unk_token_id = unk_token_id
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.mask_token_id = mask_token_id
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size, dtype="int64")
+        input_mask = random_attention_mask([self.batch_size, self.seq_length], dtype="int64").unsqueeze([1, 2])
+        token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size, dtype="int64")
+        position_ids = paddle.tile(
+            paddle.arange(end=self.seq_length, dtype="int64").reshape([1, -1]), [self.batch_size, 1]
+        )
+
+        config = self.get_config()
+
+        lm_labels = None
+        if self.parent.use_labels:
+            lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        return (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels)
+
+    def get_config(self):
+        return UNIMOConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            normalize_before=self.normalize_before,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            unk_token_id=self.unk_token_id,
+            pad_token_id=self.pad_token_id,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            mask_token_id=self.mask_token_id,
+        )
+
+    def prepare_config_and_inputs_for_decoder(self):
+        (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels) = self.prepare_config_and_inputs()
+        return (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels)
+
+    def create_and_check_unimo_model(self, config, input_ids, input_mask, token_type_ids, position_ids, *args):
+        model = UNIMOModel(config)
+        model.eval()
+
+        result, cache = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.hidden_size])
+        self.parent.assertEqual(len(cache), config.num_hidden_layers)
+
+    def create_and_check_unimo_model_past(self, config, input_ids, input_mask, token_type_ids, position_ids, *args):
+        model = UNIMOModel(config)
+        model.eval()
+
+        # first forward pass
+        outputs = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )
+        outputs_use_cache_conf = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            return_dict=self.parent.return_dict,
+        )
+        outputs_no_past = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=False,
+            return_dict=self.parent.return_dict,
+        )
+
+        self.parent.assertTrue(len(outputs_no_past) == len(outputs_use_cache_conf))
+
+        output, past = outputs[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size, dtype="int64")
+        next_token_types = ids_tensor([self.batch_size, 1], self.type_vocab_size, dtype="int64")
+        next_position = position_ids[:, -1:] + 1
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+        next_position_ids = paddle.concat([position_ids, next_position], axis=-1)
+
+        input_mask_t = paddle.transpose(input_mask, perm=[0, 1, 3, 2])
+        input_mask = input_mask * input_mask_t
+
+        next_attention_mask = nn.Pad2D([0, 0, 0, 1], mode="replicate")(input_mask)
+        next_attention_mask = nn.Pad2D([0, 1, 0, 0], value=0)(next_attention_mask)
+        next_attention_mask[:, :, -1, -1] = 1
+
+        output_from_no_past, cache = model(
+            next_input_ids,
+            token_type_ids=next_token_type_ids,
+            position_ids=next_position_ids,
+            attention_mask=next_attention_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+        output_from_past = model(
+            next_tokens,
+            token_type_ids=next_token_types,
+            position_ids=next_position,
+            attention_mask=next_attention_mask[:, :, -1:, :],
+            use_cache=True,
+            cache=past,
+            return_dict=self.parent.return_dict,
+        )[0]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_unimo_model_past_large_inputs(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, *args
+    ):
+        model = UNIMOModel(config)
+        model.eval()
+
+        # first forward pass
+        output, past = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[:2]
+
+        # create hypothetical next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size, dtype="int64")
+        next_token_types = ids_tensor([self.batch_size, 3], self.type_vocab_size, dtype="int64")
+        next_position = position_ids[:, -3:] + 3
+
+        # append to next input_ids and token_type_ids
+        next_input_ids = paddle.concat([input_ids, next_tokens], axis=-1)
+        next_token_type_ids = paddle.concat([token_type_ids, next_token_types], axis=-1)
+        next_position_ids = paddle.concat([position_ids, next_position], axis=-1)
+
+        input_mask_t = paddle.transpose(input_mask, perm=[0, 1, 3, 2])
+        input_mask = input_mask * input_mask_t
+
+        next_attention_mask = nn.Pad2D([0, 0, 0, 3], mode="replicate")(input_mask)
+        next_attention_mask = nn.Pad2D([0, 3, 0, 0], value=0)(next_attention_mask)
+        next_attention_mask[:, :, -1, -1] = 1
+        next_attention_mask[:, :, -2, -2] = 1
+        next_attention_mask[:, :, -3, -3] = 1
+        next_attention_mask[:, :, -2, -1] = 1
+        next_attention_mask[:, :, -3, -1] = 1
+        next_attention_mask[:, :, -3, -2] = 1
+
+        output_from_no_past = model(
+            next_input_ids,
+            token_type_ids=next_token_type_ids,
+            attention_mask=next_attention_mask,
+            position_ids=next_position_ids,
+            use_cache=False,
+            return_dict=self.parent.return_dict,
+        )
+        output_from_no_past = output_from_no_past[0] if self.parent.return_dict else output_from_no_past
+        output_from_past = model(
+            next_tokens,
+            token_type_ids=next_token_types,
+            attention_mask=next_attention_mask[:, :, -3:, :],
+            position_ids=next_position,
+            cache=past,
+            use_cache=True,
+            return_dict=self.parent.return_dict,
+        )[0]
+        self.parent.assertTrue(output_from_past.shape[1] == next_tokens.shape[1])
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1], dtype="int64").item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(paddle.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+    def create_and_check_lm_head_model(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, lm_labels, *args
+    ):
+        model = UNIMOLMHeadModel(config)
+        model.eval()
+
+        outputs = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            attention_mask=input_mask,
+            labels=lm_labels,
+            return_dict=self.parent.return_dict,
+        )
+
+        if self.parent.use_labels:
+            loss, result = outputs[:2]
+            self.parent.assertIsInstance(loss.item(), float)
+        else:
+            result = outputs[0] if self.parent.return_dict else outputs
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def create_and_check_forward_and_backwards(
+        self, config, input_ids, input_mask, token_type_ids, position_ids, *args
+    ):
+        base_model = UNIMOModel(**config)
+        model = UNIMOLMHeadModel(base_model)
+
+        outputs = model(
+            input_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=input_mask,
+            position_ids=position_ids,
+            labels=input_ids,
+            return_dict=self.parent.return_dict,
+        )
+
+        loss, result = outputs[:2]
+        self.parent.assertIsInstance(loss.item(), float)
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.vocab_size])
+        loss.backward()
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+
+        (config, input_ids, input_mask, token_type_ids, position_ids, lm_labels) = config_and_inputs
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+            "position_ids": position_ids,
+        }
+
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class UNIMOModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    base_model_class = UNIMOModel
+
+    all_model_classes = (UNIMOModel, UNIMOLMHeadModel)
+    all_generative_model_classes = {UNIMOLMHeadModel: (UNIMOModel, "unimo")}
+    test_missing_keys = False
+
+    use_labels = False
+    return_dict = False
+    use_test_inputs_embeds = True
+
+    # special case for DoubleHeads model
+    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+        inputs_dict = super()._prepare_for_class(inputs_dict, model_class)
+        return inputs_dict
+
+    def setUp(self):
+        random.seed(128)
+        np.random.seed(128)
+        paddle.seed(128)
+
+        self.model_tester = UNIMOModelTester(self)
+
+    def test_unimo_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unimo_model(*config_and_inputs)
+
+    def test_unimo_model_past(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unimo_model_past(*config_and_inputs)
+
+    def test_unimo_model_past_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_unimo_model_past_large_inputs(*config_and_inputs)
+
+    def test_unimo_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    @slow
+    def test_batch_generation(self):
+        model = UNIMOLMHeadModel.from_pretrained("unimo-text-1.0-lcsts-new")
+        tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-lcsts-new")
+        model.eval()
+
+        tokenizer.padding_side = "left"
+
+        # use different length sentences to test batching
+        sentences = [
+            ["深度学习是人工智能的核心技术领域。百度飞桨作为中国首个自主研发、功能丰富、开源开放的产业级深度学习平台,将从多层次技术产品、产业AI人才培养和强大的生态资源支持三方面全面护航企业实现快速AI转型升级。"],
+            ["深度学习是人工智能的核心技术领域。百度飞桨很厉害。"],
+        ]
+        inputs = []
+        for seq in sentences:
+            inputs.append(tokenizer.gen_encode(source=seq[0], add_start_token_for_decoding=True))
+
+        data = batchify_fn(inputs, tokenizer.pad_token_id)
+
+        input_ids = data["input_ids"]
+        position_ids = data["position_ids"]
+        token_type_ids = data["token_type_ids"]
+        attention_mask = data["attention_mask"]
+
+        outputs, _ = model.generate(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask,
+            decode_strategy="greedy_search",
+        )
+
+        data_non_padded = tokenizer.gen_encode(sentences[0][0], add_start_token_for_decoding=True)
+        output_non_padded, _ = model.generate(
+            input_ids=paddle.to_tensor(data_non_padded["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(data_non_padded["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(data_non_padded["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(data_non_padded["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="greedy_search",
+        )
+
+        data_padded = tokenizer.gen_encode(sentences[1][0], add_start_token_for_decoding=True)
+        output_padded, _ = model.generate(
+            input_ids=paddle.to_tensor(data_padded["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(data_padded["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(data_padded["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(data_padded["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="greedy_search",
+        )
+
+        batch_out_sentence = []
+        for i in range(len(outputs)):
+            batch_out_sentence.append(postprocess_response(outputs[i].numpy(), tokenizer))
+        non_padded_sentence = postprocess_response(output_non_padded[0], tokenizer)
+        padded_sentence = postprocess_response(output_padded[0], tokenizer)
+
+        expected_output_sentence = [
+            "百 度 飞 桨 ： 深 度 学 习 助 力 企 业 转 型 升 级",
+            "百 度 飞 桨 ： 人 工 智 能 的 核 心 技 术",
+        ]
+        self.assertListEqual(expected_output_sentence, batch_out_sentence)
+        self.assertListEqual(expected_output_sentence, [non_padded_sentence, padded_sentence])
+
+
+class UNIMOModelLanguageGenerationTest(unittest.TestCase):
+    def _test_lm_generate_unimo_helper(
+        self,
+        verify_outputs=True,
+    ):
+        model = UNIMOLMHeadModel.from_pretrained("unimo-text-1.0-lcsts-new")
+        model.eval()
+
+        input_ids = paddle.to_tensor([[1, 464, 3290, 2, 1]], dtype="int64")
+        position_ids = paddle.to_tensor([[0, 1, 2, 3, 4]], dtype="int64")
+        token_type_ids = paddle.to_tensor([[0, 0, 0, 0, 1]], dtype="int64")
+
+        expected_output_ids = [9483, 42, 540, 74, 464, 85, 5, 203, 280, 3]
+
+        output_ids, _ = model.generate(
+            input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            decode_strategy="greedy_search",
+        )
+
+        if verify_outputs:
+            self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
+
+    @slow
+    def test_lm_generate_unimo(self):
+        self._test_lm_generate_unimo_helper()
+
+    @slow
+    def test_unimo_sample(self):
+        tokenizer = UNIMOTokenizer.from_pretrained("unimo-text-1.0-lcsts-new")
+        model = UNIMOLMHeadModel.from_pretrained("unimo-text-1.0-lcsts-new")
+        model.eval()
+
+        sequence = [
+            "深度学习是人工智能的核心技术领域。百度飞桨作为中国首个自主研发、功能丰富、开源开放的产业级深度学习平台,将从多层次技术产品、产业AI人才培养和强大的生态资源支持三方面全面护航企业实现快速AI转型升级。"
+        ]
+
+        tokenized = tokenizer.gen_encode(source=sequence[0], add_start_token_for_decoding=True)
+        output_ids, _ = model.generate(
+            paddle.to_tensor(tokenized["input_ids"], dtype="int64").reshape([1, -1]),
+            position_ids=paddle.to_tensor(tokenized["position_ids"], dtype="int64").reshape([1, -1]),
+            token_type_ids=paddle.to_tensor(tokenized["token_type_ids"], dtype="int64").reshape([1, -1]),
+            attention_mask=paddle.to_tensor(tokenized["attention_mask"], dtype="float32").unsqueeze([0, 1]),
+            decode_strategy="sampling",
+            top_k=1,
+        )
+        output_str = postprocess_response(output_ids[0].numpy(), tokenizer)
+
+        EXPECTED_OUTPUT_STR = "百 度 飞 桨 ： 深 度 学 习 助 力 企 业 转 型 升 级"
+        self.assertEqual(output_str, EXPECTED_OUTPUT_STR)
+
+    def test_generate_without_input_ids(self):
+        pass
diff --git a/tests/transformers/unimo/test_tokenizer.py b/tests/transformers/unimo/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..71ec282d8cc3b8073055c8e1e67d7f3fba3029e6
--- /dev/null
+++ b/tests/transformers/unimo/test_tokenizer.py
@@ -0,0 +1,278 @@
+# coding=utf-8
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tempfile
+import unittest
+from typing import List
+
+from paddlenlp.transformers import PretrainedTokenizer, UNIMOTokenizer
+from paddlenlp.transformers.tokenizer_utils_base import PretrainedTokenizerBase
+
+from ...testing_utils import get_tests_dir
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/vocab.zh.unimo.txt")
+
+
+class UNIMOTokenizationTest(unittest.TestCase):
+
+    tokenizer_class = UNIMOTokenizer
+    test_sentencepiece = True
+    from_pretrained_vocab_key = "vocab_file"
+    test_seq2seq = False
+    test_offsets = False
+
+    space_between_special_tokens = False
+    from_pretrained_kwargs = None
+    from_pretrained_filter = None
+
+    test_sentencepiece_ignore_case = False
+
+    def setUp(self):
+        super().setUp()
+
+        tokenizers_list = [
+            (
+                self.tokenizer_class,
+                pretrained_name,
+                self.from_pretrained_kwargs if self.from_pretrained_kwargs is not None else {},
+            )
+            for pretrained_name in self.tokenizer_class.pretrained_resource_files_map[
+                self.from_pretrained_vocab_key
+            ].keys()
+            if self.from_pretrained_filter is None
+            or (self.from_pretrained_filter is not None and self.from_pretrained_filter(pretrained_name))
+        ]
+        self.tokenizers_list = tokenizers_list[:1]
+
+        with open(f"{get_tests_dir()}/sample_text.txt", encoding="utf-8") as f_data:
+            self._data = f_data.read().replace("\n\n", "\n").strip()
+
+        self.tmpdirname = tempfile.mkdtemp()
+
+        tokenizer = UNIMOTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_tokenizers(self, **kwargs) -> List[PretrainedTokenizerBase]:
+        return [self.get_tokenizer(**kwargs)]
+
+    def get_tokenizer(self, **kwargs) -> PretrainedTokenizer:
+        return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+
+    def test_get_vocab(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_dict = tokenizer.get_vocab()
+                self.assertIsInstance(vocab_dict, dict)
+                self.assertGreaterEqual(len(tokenizer), len(vocab_dict))
+
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+                tokenizer.add_tokens(["asdfasdfasdfasdf"])
+                vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer))]
+                self.assertEqual(len(vocab), len(tokenizer))
+
+    def test_right_and_left_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "left"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual([padding_idx] * padding_size + encoded_sequence, padded_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, padding=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding="longest")["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(sequence, padding=False)["input_ids"]
+                padded_sequence_left_length = len(padded_sequence_left)
+                self.assertEqual(sequence_length, padded_sequence_left_length)
+                self.assertEqual(encoded_sequence, padded_sequence_left)
+
+    def test_right_and_left_truncation(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "This is a test sequence"
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                truncation_size = 3
+                tokenizer.truncation_side = "right"
+                encoded_sequence = tokenizer.encode(sequence, return_token_type_ids=None, add_special_tokens=False)[
+                    "input_ids"
+                ]
+                sequence_length = len(encoded_sequence)
+                # Remove EOS/BOS tokens
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[:-truncation_size], truncated_sequence)
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the truncation flag set to True
+                tokenizer.truncation_side = "left"
+                sequence_length = len(encoded_sequence)
+                truncated_sequence = tokenizer.encode(
+                    sequence,
+                    max_length=sequence_length - truncation_size,
+                    truncation=True,
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                truncated_sequence_length = len(truncated_sequence)
+                self.assertEqual(sequence_length, truncated_sequence_length + truncation_size)
+                self.assertEqual(encoded_sequence[truncation_size:], truncated_sequence)
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_truncation'
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, truncation=True, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation="longest_first", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+                tokenizer.truncation_side = "right"
+                truncated_sequence_right = tokenizer.encode(
+                    sequence, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_right_length = len(truncated_sequence_right)
+                self.assertEqual(sequence_length, truncated_sequence_right_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_right)
+
+                tokenizer.truncation_side = "left"
+                truncated_sequence_left = tokenizer.encode(
+                    sequence, truncation=False, return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                truncated_sequence_left_length = len(truncated_sequence_left)
+                self.assertEqual(sequence_length, truncated_sequence_left_length)
+                self.assertEqual(encoded_sequence, truncated_sequence_left)
+
+    def test_padding_to_max_length(self):
+        """We keep this test for backward compatibility but it should be remove when `pad_to_max_seq_len` is deprecated."""
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+                # FIXME: the next line should be padding(max_length) to avoid warning
+                padded_sequence = tokenizer.encode(
+                    sequence, max_length=sequence_length + padding_size, pad_to_max_seq_len=True
+                )["input_ids"]
+                padded_sequence_length = len(padded_sequence)
+                self.assertEqual(sequence_length + padding_size, padded_sequence_length)
+                self.assertEqual(encoded_sequence + [padding_idx] * padding_size, padded_sequence)
+
+                # Check that nothing is done when a maximum length is not specified
+                encoded_sequence = tokenizer.encode(sequence)["input_ids"]
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(sequence, pad_to_max_seq_len=True)["input_ids"]
+                padded_sequence_right_length = len(padded_sequence_right)
+                self.assertEqual(sequence_length, padded_sequence_right_length)
+                self.assertEqual(encoded_sequence, padded_sequence_right)
+
+    def _check_no_pad_token_padding(self, tokenizer, sequences):
+        # if tokenizer does not have pad_token_id, an error should be thrown
+        if tokenizer.pad_token_id is None:
+            with self.assertRaises(ValueError):
+                if isinstance(sequences, list):
+                    tokenizer.batch_encode(sequences, padding="longest")
+                else:
+                    tokenizer.encode(sequences, padding=True)
+
+            # add pad_token_id to pass subsequent tests
+            tokenizer.add_special_tokens({"pad_token": "<PAD>"})
+
+    def test_convert_tokens_to_string_format(self):
+        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokens = ["今天", "天气"]
+                string = tokenizer.convert_tokens_to_string(tokens)
+
+                self.assertIsInstance(string, str)
diff --git a/tests/transformers/xlm/__init__.py b/tests/transformers/xlm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..595add0aed9e110889fb8cb1e07a1b8d5877e441
--- /dev/null
+++ b/tests/transformers/xlm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/transformers/xlm/test_modeling.py b/tests/transformers/xlm/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..a73afb313dbf8c467cf5200ea660dc2b845a1025
--- /dev/null
+++ b/tests/transformers/xlm/test_modeling.py
@@ -0,0 +1,325 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (
+    XLMConfig,
+    XLMForMultipleChoice,
+    XLMForQuestionAnsweringSimple,
+    XLMForSequenceClassification,
+    XLMForTokenClassification,
+    XLMModel,
+    XLMPretrainedModel,
+    XLMWithLMHeadModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class XLMModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_encoder=True,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        hidden_act="gelu",
+        n_langs=1,
+        lang_id=0,
+        lang2id=None,
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        pad_token_id=2,
+        type_sequence_label_size=2,
+        num_labels=3,
+        num_choices=4,
+        num_classes=3,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_encoder = is_encoder
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.n_langs = n_langs
+        self.lang_id = lang_id
+        self.lang2id = lang2id
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.pad_token_id = pad_token_id
+        self.type_sequence_label_size = type_sequence_label_size
+        self.num_classes = num_classes
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.seed = 1
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+        return config, input_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def set_seed(self):
+        random.seed(self.seed)
+        paddle.seed(self.seed)
+
+    def get_config(self):
+        return XLMConfig(
+            vocab_size=self.vocab_size,
+            emb_dim=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            hidden_act=self.hidden_act,
+            n_langs=self.n_langs,
+            lang_id=self.lang_id,
+            lang2id=self.lang2id,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            pad_token_id=self.pad_token_id,
+            num_class=self.num_classes,
+            num_labels=self.num_labels,
+            num_choices=self.num_choices,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMModel(config)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, output_hidden_states=self.parent.return_dict)
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_for_multiple_choice(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMForMultipleChoice(config)
+        model.eval()
+        multiple_choice_inputs_ids = input_ids.unsqueeze(1).expand([-1, self.num_choices, -1])
+        multiple_choice_input_mask = input_mask.unsqueeze(1).expand([-1, self.num_choices, -1])
+        result = model(
+            multiple_choice_inputs_ids,
+            attention_mask=multiple_choice_input_mask,
+        )
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_choices])
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMForQuestionAnsweringSimple(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+
+        start_logits, end_logits = result[0], result[1]
+
+        self.parent.assertEqual(start_logits.shape, [self.batch_size, self.seq_length])
+        self.parent.assertEqual(end_logits.shape, [self.batch_size, self.seq_length])
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMForSequenceClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.num_classes])
+
+    def create_and_check_for_token_classification(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMForTokenClassification(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+        )
+
+        self.parent.assertEqual(result.shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_for_lm_head(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLMWithLMHeadModel(config)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            labels=token_labels,
+        )
+
+        if token_labels is not None:
+            result = result[1:]
+        elif paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class XLMModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = XLMModel
+    return_dict: bool = False
+    use_labels: bool = True
+
+    all_model_classes = (
+        XLMModel,
+        XLMForMultipleChoice,
+        XLMForQuestionAnsweringSimple,
+        XLMForSequenceClassification,
+        XLMForTokenClassification,
+        XLMWithLMHeadModel,
+    )
+
+    def setUp(self):
+        self.model_tester = XLMModelTester(self)
+
+    def test_model(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    def test_for_multiple_choice(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_multiple_choice(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+    def test_for_token_classification(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_token_classification(*config_and_inputs)
+
+    def test_for_ml_head(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_lm_head(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(XLMPretrainedModel.pretrained_init_configuration)[:1]:
+            model = XLMModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/xlm/test_tokenizer.py b/tests/transformers/xlm/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..87e8a66f17ef349af1488dff3c6942f216d44ccc
--- /dev/null
+++ b/tests/transformers/xlm/test_tokenizer.py
@@ -0,0 +1,167 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import unittest
+
+from paddlenlp.transformers.xlm.tokenizer import XLMTokenizer
+from tests.testing_utils import slow
+
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+
+class XLMTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = XLMTokenizer
+    test_fast_tokenizer = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = [
+            "l",
+            "o",
+            "w",
+            "e",
+            "r",
+            "s",
+            "t",
+            "i",
+            "d",
+            "n",
+            "w</w>",
+            "r</w>",
+            "t</w>",
+            "lo",
+            "low",
+            "er</w>",
+            "low</w>",
+            "lowest</w>",
+            "newer</w>",
+            "wider</w>",
+            "<unk>",
+        ]
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["l o 123", "lo w 1456", "e r</w> 1789", ""]
+
+        self.vocab_file = os.path.join(self.tmpdirname, XLMTokenizer.resource_files_names["vocab_file"])
+        self.merges_file = os.path.join(self.tmpdirname, XLMTokenizer.resource_files_names["merges_file"])
+        with open(self.vocab_file, "w") as fp:
+            fp.write(json.dumps(vocab_tokens))
+        with open(self.merges_file, "w") as fp:
+            fp.write("\n".join(merges))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "lower newer"
+        output_text = "lower newer"
+        return input_text, output_text
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "l")
+        self.assertEqual(vocab_keys[-1], "<special9>")
+        self.assertEqual(len(vocab_keys), 34)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer.get_vocab())
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer.get_vocab())
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(
+                    "aaaaa bbbbbb low cccccccccdddddddd l", return_token_type_ids=None, add_special_tokens=False
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer.get_vocab())
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    return_token_type_ids=None,
+                    add_special_tokens=False,
+                )["input_ids"]
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    def test_consecutive_unk_string(self):
+        pass
+
+    def test_add_tokens(self):
+        tokenizer = XLMTokenizer(self.vocab_file, self.merges_file)
+        vocab_size = len(tokenizer)
+        self.assertEqual(tokenizer.add_tokens(""), 0)
+        self.assertEqual(tokenizer.add_tokens("testoken"), 1)
+        self.assertEqual(tokenizer.add_tokens(["testoken1", "testtoken2"]), 2)
+        self.assertEqual(len(tokenizer.get_vocab()), vocab_size + 3)
+
+    def test_full_tokenizer(self):
+        """Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt"""
+        tokenizer = XLMTokenizer(self.vocab_file, self.merges_file)
+
+        text = "lower"
+        bpe_tokens = ["low", "er</w>"]
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + ["<unk>"]
+        input_bpe_tokens = [14, 15, 20]
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-en-2048")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [0] + text + [1]
+        assert encoded_pair == [0] + text + [1] + text_2 + [1]
diff --git a/tests/transformers/xlnet/__init__.py b/tests/transformers/xlnet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/transformers/xlnet/test_modeling.py b/tests/transformers/xlnet/test_modeling.py
new file mode 100644
index 0000000000000000000000000000000000000000..602006d2d8670b1ae0a14a1216d38de8be7db106
--- /dev/null
+++ b/tests/transformers/xlnet/test_modeling.py
@@ -0,0 +1,924 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import unittest
+
+import paddle
+from parameterized import parameterized_class
+
+from paddlenlp.transformers import (  # XLNetForCausalLM,
+    XLNetConfig,
+    XLNetForMultipleChoice,
+    XLNetForQuestionAnswering,
+    XLNetForSequenceClassification,
+    XLNetForTokenClassification,
+    XLNetLMHeadModel,
+    XLNetModel,
+    XLNetPretrainedModel,
+)
+
+from ...testing_utils import slow
+from ..test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+class XLNetModelTester:
+    def __init__(
+        self,
+        parent,
+    ):
+        self.parent = parent
+        self.batch_size = 14
+        self.seq_length = 7
+        self.mem_len = 10
+        # self.key_len = seq_length + mem_len
+        self.clamp_len = -1
+        self.reuse_len = 15
+        self.is_training = True
+        self.use_labels = True
+        self.vocab_size = 99
+        self.cutoffs = [10, 50, 80]
+        self.hidden_size = 32
+        self.num_attention_heads = 4
+        self.d_inner = 128
+        self.num_hidden_layers = 5
+        self.type_sequence_label_size = 2
+        self.bi_data = False
+        self.same_length = False
+        self.initializer_range = 0.05
+        self.seed = 1
+        self.type_vocab_size = 2
+        self.bos_token_id = 1
+        self.eos_token_id = 2
+        self.pad_token_id = 5
+        self.num_choices = 4
+        self.num_classes = 2
+
+    def prepare_config_and_inputs(self):
+        input_ids_1 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        input_ids_2 = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+        token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+        input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        input_ids_q = ids_tensor([self.batch_size, self.seq_length + 1], self.vocab_size)
+        perm_mask = paddle.zeros([self.batch_size, self.seq_length + 1, self.seq_length + 1])
+        perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
+        target_mapping = paddle.zeros(
+            [
+                self.batch_size,
+                1,
+                self.seq_length + 1,
+            ]
+        )
+        target_mapping[:, 0, -1] = 1.0  # predict last token
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+
+        if self.parent.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_classes)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = self.get_config()
+
+        return (
+            config,
+            input_ids_1,
+            input_ids_2,
+            input_ids_q,
+            perm_mask,
+            input_mask,
+            target_mapping,
+            token_type_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        )
+
+    def get_config(self):
+        return XLNetConfig(
+            vocab_size=self.vocab_size,
+            d_model=self.hidden_size,
+            n_head=self.num_attention_heads,
+            d_inner=self.d_inner,
+            n_layer=self.num_hidden_layers,
+            mem_len=self.mem_len,
+            clamp_len=self.clamp_len,
+            same_length=self.same_length,
+            reuse_len=self.reuse_len,
+            bi_data=self.bi_data,
+            initializer_range=self.initializer_range,
+            bos_token_id=self.bos_token_id,
+            pad_token_id=self.pad_token_id,
+            eos_token_id=self.eos_token_id,
+            num_classes=self.num_classes,
+            num_choices=self.num_choices,
+        )
+
+    def set_seed(self):
+        random.seed(self.seed)
+        paddle.seed(self.seed)
+
+    def create_and_check_xlnet_base_model(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetModel(config)
+        model.eval()
+
+        result = model(input_ids_1, input_mask=input_mask)
+        result = model(input_ids_1, attention_mask=input_mask)
+        result = model(input_ids_1, token_type_ids=token_type_ids)
+        result = model(input_ids_1, return_dict=self.parent.return_dict)
+
+        config["mem_len"] = 0
+        model = XLNetModel(config)
+        model.eval()
+        model(input_ids_1, return_dict=self.parent.return_dict)
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.hidden_size])
+
+    def create_and_check_use_mems_train(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetForSequenceClassification(config)
+        model.train()
+
+        train_size = input_ids_1.shape[0]
+
+        batch_size = 4
+        for i in range(train_size // batch_size + 1):
+            input_ids = input_ids_1[i : (i + 1) * batch_size]
+            outputs = model(input_ids=input_ids, return_dict=True)
+            self.parent.assertIsNone(outputs.get("mems", None))
+
+    def create_and_check_xlnet_base_model_with_att_output(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetModel(config)
+        model.eval()
+
+        outputs = model(
+            input_ids_1, target_mapping=target_mapping, output_attentions=True, return_dict=self.parent.return_dict
+        )
+        if not self.parent.return_dict:
+            assert len(outputs) == 2
+
+        if isinstance(outputs, tuple):
+            attentions = outputs[1]
+        else:
+            attentions = outputs.attentions
+
+        self.parent.assertEqual(len(attentions), config["n_layer"])
+        self.parent.assertIsInstance(attentions[0], tuple)
+        self.parent.assertEqual(len(attentions[0]), 2)
+        self.parent.assertTrue(attentions[0][0].shape, attentions[0][0].shape)
+
+    def create_and_check_xlnet_lm_head(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetLMHeadModel(config)
+        model.eval()
+
+        result = model(
+            input_ids_1, token_type_ids=token_type_ids, labels=token_labels, return_dict=self.parent.return_dict
+        )
+
+        # compatibility with old-school code
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.vocab_size])
+
+        result = model(
+            input_ids_q, perm_mask=perm_mask, target_mapping=target_mapping, return_dict=self.parent.return_dict
+        )
+
+        if paddle.is_tensor(result):
+            result = [result]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, 1, self.vocab_size])
+
+    def create_and_check_xlnet_qa(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetForQuestionAnswering(config)
+        model.eval()
+
+        results = []
+        result = model(
+            input_ids_1,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            return_dict=self.parent.return_dict,
+        )
+        results.append(result)
+
+        result_with_mask = model(
+            input_ids_1,
+            start_positions=sequence_labels,
+            end_positions=sequence_labels,
+            input_mask=input_mask,
+            return_dict=self.parent.return_dict,
+        )
+        results.append(result_with_mask)
+
+        for result in results:
+            if token_labels is not None:
+                result = result[1:]
+
+            self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length])
+            self.parent.assertEqual(result[1].shape, [self.batch_size, self.seq_length])
+
+            # compatibility with old-school code
+            if not self.parent.return_dict and token_labels is None:
+                self.parent.assertEqual(len(result), 2)
+
+    def create_and_check_xlnet_token_classif(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetForTokenClassification(config)
+        model.eval()
+
+        result = model(input_ids_1, labels=token_labels, return_dict=self.parent.return_dict)
+
+        # compatibility with old-school code
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.seq_length, self.num_classes])
+
+    def create_and_check_xlnet_sequence_classif(
+        self,
+        config,
+        input_ids_1,
+        input_ids_2,
+        input_ids_q,
+        perm_mask,
+        input_mask,
+        target_mapping,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        choice_labels,
+    ):
+        model = XLNetForSequenceClassification(config)
+        model.eval()
+
+        result = model(input_ids_1, labels=sequence_labels, return_dict=self.parent.return_dict)
+        # compatibility with old-school code
+        if not self.parent.return_dict and token_labels is None:
+            self.parent.assertTrue(paddle.is_tensor(result))
+
+        if paddle.is_tensor(result):
+            result = [result]
+        elif token_labels is not None:
+            result = result[1:]
+
+        self.parent.assertEqual(result[0].shape, [self.batch_size, self.type_sequence_label_size])
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids_1,
+            input_ids_2,
+            input_ids_q,
+            perm_mask,
+            input_mask,
+            target_mapping,
+            token_type_ids,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids_1}
+        return config, inputs_dict
+
+
+@parameterized_class(
+    ("return_dict", "use_labels"),
+    [
+        [False, False],
+        [False, True],
+        [True, False],
+        [True, True],
+    ],
+)
+class XLNetModelTest(ModelTesterMixin, unittest.TestCase):
+    base_model_class = XLNetModel
+    use_labels = False
+    return_dict = False
+    all_model_classes = (
+        XLNetModel,
+        XLNetLMHeadModel,
+        XLNetForTokenClassification,
+        XLNetForSequenceClassification,
+        XLNetForQuestionAnswering,
+        XLNetForMultipleChoice,
+    )
+
+    def setUp(self):
+        self.model_tester = XLNetModelTester(self)
+
+    def test_xlnet_base_model(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
+
+    def test_seq_classification_use_mems_train(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_use_mems_train(*config_and_inputs)
+
+    def test_xlnet_base_model_with_att_output(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_base_model_with_att_output(*config_and_inputs)
+
+    def test_xlnet_lm_head(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_lm_head(*config_and_inputs)
+
+    def test_xlnet_sequence_classif(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_sequence_classif(*config_and_inputs)
+
+    def test_xlnet_token_classif(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_token_classif(*config_and_inputs)
+
+    def test_xlnet_qa(self):
+        self.model_tester.set_seed()
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlnet_qa(*config_and_inputs)
+
+    def test_retain_grad_hidden_states_attentions(self):
+        # xlnet cannot keep gradients in attentions or hidden states
+        return
+
+    # overwrite from test_modeling_common
+    def _mock_init_weights(self, module):
+        if hasattr(module, "weight") and module.weight is not None:
+            module.weight.data.fill_(3)
+        if hasattr(module, "bias") and module.bias is not None:
+            module.bias.data.fill_(3)
+
+        for param in ["q", "k", "v", "o", "r", "r_r_bias", "r_s_bias", "r_w_bias", "seg_embed", "mask_emb"]:
+            if hasattr(module, param) and getattr(module, param) is not None:
+                weight = getattr(module, param)
+                weight.data.fill_(3)
+
+    def _check_hidden_states_for_generate(
+        self, batch_size, hidden_states, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(hidden_states, tuple)
+        self.assertListEqual(
+            [isinstance(iter_hidden_states, tuple) for iter_hidden_states in hidden_states],
+            [True] * len(hidden_states),
+        )
+        self.assertEqual(len(hidden_states), (max_length - min_length) * num_beam_groups)
+
+        for idx, iter_hidden_states in enumerate(hidden_states):
+            # check hidden size
+            for i, layer_hidden_states in enumerate(iter_hidden_states):
+                # every 2nd tensor is from extra stream
+                if i % 2 != 0:
+                    seq_len = 1
+                else:
+                    # for first item dummy PAD token is appended so need one more
+                    seq_len = (min_length + 1) if idx == 0 else min_length
+
+                expected_shape = (batch_size * num_beam_groups, seq_len, config.hidden_size)
+                self.assertEqual(layer_hidden_states.shape, expected_shape)
+
+    def _check_attentions_for_generate(
+        self, batch_size, attentions, min_length, max_length, config, use_cache=False, num_beam_groups=1
+    ):
+        self.assertIsInstance(attentions, tuple)
+        self.assertListEqual(
+            [isinstance(iter_attentions, tuple) for iter_attentions in attentions], [True] * len(attentions)
+        )
+        self.assertEqual(len(attentions), (max_length - min_length) * num_beam_groups)
+
+        for idx, attentions_item in enumerate(attentions):
+            for iter_attentions in attentions_item:
+                tgt_len = min_length
+
+                # for first item dummy PAD token is appended so need one more
+                if idx == 0:
+                    tgt_len += 1
+
+                src_len = min_length + idx + 1
+
+                expected_shape = (
+                    batch_size * num_beam_groups,
+                    config.num_attention_heads,
+                    tgt_len,
+                    src_len,
+                )
+                # check attn size
+                self.assertListEqual(
+                    [layer_attention.shape for layer_attention in iter_attentions],
+                    [expected_shape] * len(iter_attentions),
+                )
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in list(XLNetPretrainedModel.pretrained_init_configuration)[:1]:
+            model = XLNetModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+
+class XLNetModelLanguageGenerationTest(unittest.TestCase):
+    @slow
+    def test_lm_generate_xlnet_base_cased(self):
+
+        model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
+        # fmt: off
+        input_ids = paddle.to_tensor([[
+            67,
+            2840,
+            19,
+            18,
+            1484,
+            20,
+            965,
+            29077,
+            8719,
+            1273,
+            21,
+            45,
+            273,
+            17,
+            10,
+            15048,
+            28,
+            27511,
+            21,
+            4185,
+            11,
+            41,
+            2444,
+            9,
+            32,
+            1025,
+            20,
+            8719,
+            26,
+            23,
+            673,
+            966,
+            19,
+            29077,
+            20643,
+            27511,
+            20822,
+            20643,
+            19,
+            17,
+            6616,
+            17511,
+            18,
+            8978,
+            20,
+            18,
+            777,
+            9,
+            19233,
+            1527,
+            17669,
+            19,
+            24,
+            673,
+            17,
+            28756,
+            150,
+            12943,
+            4354,
+            153,
+            27,
+            442,
+            37,
+            45,
+            668,
+            21,
+            24,
+            256,
+            20,
+            416,
+            22,
+            2771,
+            4901,
+            9,
+            12943,
+            4354,
+            153,
+            51,
+            24,
+            3004,
+            21,
+            28142,
+            23,
+            65,
+            20,
+            18,
+            416,
+            34,
+            24,
+            2958,
+            22947,
+            9,
+            1177,
+            45,
+            668,
+            3097,
+            13768,
+            23,
+            103,
+            28,
+            441,
+            148,
+            48,
+            20522,
+            19,
+            12943,
+            4354,
+            153,
+            12860,
+            34,
+            18,
+            326,
+            27,
+            17492,
+            684,
+            21,
+            6709,
+            9,
+            8585,
+            123,
+            266,
+            19,
+            12943,
+            4354,
+            153,
+            6872,
+            24,
+            3004,
+            20,
+            18,
+            9225,
+            2198,
+            19,
+            12717,
+            103,
+            22,
+            401,
+            24,
+            6348,
+            9,
+            12943,
+            4354,
+            153,
+            1068,
+            2768,
+            2286,
+            19,
+            33,
+            104,
+            19,
+            176,
+            24,
+            9313,
+            19,
+            20086,
+            28,
+            45,
+            10292,
+            9,
+            4,
+            3,
+        ]], )
+        # fmt: on
+        #  In 1991, the remains of Russian Tsar Nicholas II and his family
+        #  (except for Alexei and Maria) are discovered.
+        #  The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+        #  remainder of the story. 1883 Western Siberia,
+        #  a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+        #  Rasputin has a vision and denounces one of the men as a horse thief. Although his
+        #  father initially slaps him for making such an accusation, Rasputin watches as the
+        #  man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+        #  the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+        #  with people, even a bishop, begging for his blessing. """
+
+        # fmt: off
+        expected_output_ids = [
+            67,
+            2840,
+            19,
+            18,
+            1484,
+            20,
+            965,
+            29077,
+            8719,
+            1273,
+            21,
+            45,
+            273,
+            17,
+            10,
+            15048,
+            28,
+            27511,
+            21,
+            4185,
+            11,
+            41,
+            2444,
+            9,
+            32,
+            1025,
+            20,
+            8719,
+            26,
+            23,
+            673,
+            966,
+            19,
+            29077,
+            20643,
+            27511,
+            20822,
+            20643,
+            19,
+            17,
+            6616,
+            17511,
+            18,
+            8978,
+            20,
+            18,
+            777,
+            9,
+            19233,
+            1527,
+            17669,
+            19,
+            24,
+            673,
+            17,
+            28756,
+            150,
+            12943,
+            4354,
+            153,
+            27,
+            442,
+            37,
+            45,
+            668,
+            21,
+            24,
+            256,
+            20,
+            416,
+            22,
+            2771,
+            4901,
+            9,
+            12943,
+            4354,
+            153,
+            51,
+            24,
+            3004,
+            21,
+            28142,
+            23,
+            65,
+            20,
+            18,
+            416,
+            34,
+            24,
+            2958,
+            22947,
+            9,
+            1177,
+            45,
+            668,
+            3097,
+            13768,
+            23,
+            103,
+            28,
+            441,
+            148,
+            48,
+            20522,
+            19,
+            12943,
+            4354,
+            153,
+            12860,
+            34,
+            18,
+            326,
+            27,
+            17492,
+            684,
+            21,
+            6709,
+            9,
+            8585,
+            123,
+            266,
+            19,
+            12943,
+            4354,
+            153,
+            6872,
+            24,
+            3004,
+            20,
+            18,
+            9225,
+            2198,
+            19,
+            12717,
+            103,
+            22,
+            401,
+            24,
+            6348,
+            9,
+            12943,
+            4354,
+            153,
+            1068,
+            2768,
+            2286,
+            19,
+            33,
+            104,
+            19,
+            176,
+            24,
+            9313,
+            19,
+            20086,
+            28,
+            45,
+            10292,
+            9,
+            4,
+            3,
+            19,
+            12943,
+            4354,
+            153,
+            27,
+            442,
+            22,
+            2771,
+            4901,
+            9,
+            69,
+            27,
+            442,
+            22,
+            2771,
+            24,
+            11335,
+            20,
+            18,
+            9225,
+            2198,
+            9,
+            69,
+            27,
+            442,
+            22,
+            2771,
+            24,
+            11335,
+            20,
+            18,
+            9225,
+            2198,
+            9,
+            69,
+            27,
+            442,
+            22,
+            2771,
+        ]
+        # fmt: on
+        #  In 1991, the remains of Russian Tsar Nicholas II and his family (except for Alexei and Maria)
+        #  are discovered. The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich,
+        #  narrates the remainder of the story. 1883 Western Siberia, a young Grigori Rasputin
+        #  is asked by his father and a group of men to perform magic. Rasputin has a vision and
+        #  denounces one of the men as a horse thief. Although his father initially slaps
+        #  him for making such an accusation, Rasputin watches as the man is chased outside and beaten.
+        #  Twenty years later, Rasputin sees a vision of the Virgin Mary, prompting him to become a priest.
+        #  Rasputin quickly becomes famous, with people, even a bishop, begging for his blessing.
+        #  <sep><cls>, Rasputin is asked to perform magic. He is asked to perform a ritual of the Virgin Mary.
+        #  He is asked to perform a ritual of the Virgin Mary. He is asked to perform
+
+        output_ids, _ = model.generate(input_ids, max_length=39, decode_strategy="greedy_search")
+        self.assertListEqual(output_ids[0].tolist(), expected_output_ids)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/transformers/xlnet/test_tokenizer.py b/tests/transformers/xlnet/test_tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..004dbbf491424114d8ac789f9d79c2777e5d294d
--- /dev/null
+++ b/tests/transformers/xlnet/test_tokenizer.py
@@ -0,0 +1,304 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# Copyright 2019 Hugging Face inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.transformers.xlnet.tokenizer import SPIECE_UNDERLINE, XLNetTokenizer
+
+from ...testing_utils import get_tests_dir, slow
+from ..test_tokenizer_common import TokenizerTesterMixin
+
+SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
+
+
+class XLNetTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = XLNetTokenizer
+    from_pretrained_vocab_key = "vocab_file"
+    test_sentencepiece = True
+    test_sentencepiece_ignore_case = False
+    test_offsets = False
+
+    def setUp(self):
+        super().setUp()
+
+        # We have a SentencePiece fixture for testing
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB)
+        tokenizer.save_pretrained(self.tmpdirname)
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "this is a test"
+        output_text = "this is a test"
+        return input_text, output_text
+
+    def test_convert_token_and_id(self):
+        """Test ``_convert_token_to_id`` and ``_convert_id_to_token``."""
+        token = "<s>"
+        token_id = 1
+
+        self.assertEqual(self.get_tokenizer()._convert_token_to_id(token), token_id)
+        self.assertEqual(self.get_tokenizer()._convert_id_to_token(token_id), token)
+
+    def test_get_vocab(self):
+        vocab_keys = list(self.get_tokenizer().get_vocab().keys())
+
+        self.assertEqual(vocab_keys[0], "<unk>")
+        self.assertEqual(vocab_keys[1], "<s>")
+        self.assertEqual(vocab_keys[-1], "<eod>")
+        self.assertEqual(len(vocab_keys), 1_006)
+
+    def test_vocab_size(self):
+        self.assertEqual(self.get_tokenizer().vocab_size, 1000)
+
+    def test_full_tokenizer(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, keep_accents=True)
+
+        tokens = tokenizer.tokenize("This is a test")
+        self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
+
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [285, 46, 10, 170, 382])
+
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "é",
+                ".",
+            ],
+        )
+        ids = tokenizer.convert_tokens_to_ids(tokens)
+        self.assertListEqual(ids, [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4])
+
+        back_tokens = tokenizer.convert_ids_to_tokens(ids)
+        self.assertListEqual(
+            back_tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "<unk>",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "s",
+                "<unk>",
+                ".",
+            ],
+        )
+
+    def test_tokenizer_lower(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, do_lower_case=True)
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "",
+                "i",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "se",
+                ".",
+            ],
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["▁he", "ll", "o"])
+
+    def test_tokenizer_no_lower(self):
+        tokenizer = XLNetTokenizer(SAMPLE_VOCAB, do_lower_case=False)
+        tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
+        self.assertListEqual(
+            tokens,
+            [
+                SPIECE_UNDERLINE + "I",
+                SPIECE_UNDERLINE + "was",
+                SPIECE_UNDERLINE + "b",
+                "or",
+                "n",
+                SPIECE_UNDERLINE + "in",
+                SPIECE_UNDERLINE + "",
+                "9",
+                "2",
+                "0",
+                "0",
+                "0",
+                ",",
+                SPIECE_UNDERLINE + "and",
+                SPIECE_UNDERLINE + "this",
+                SPIECE_UNDERLINE + "is",
+                SPIECE_UNDERLINE + "f",
+                "al",
+                "se",
+                ".",
+            ],
+        )
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)["input_ids"]
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)["input_ids"]
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == text + [4, 3]
+        assert encoded_pair == text + [4] + text_2 + [4, 3]
+
+    @slow
+    def test_tokenizer_integration(self):
+        # fmt: off
+        expected_encoding = {
+            'input_ids': [
+                [
+                    17, 21442, 270, 17, 10, 14645, 318, 34, 17, 4546, 3145, 787, 13,
+                    7752, 22018, 23, 21, 17, 4546, 3145, 787, 13, 3352, 14431, 13,
+                    5500, 11, 1176, 580, 13, 16819, 4797, 23, 17, 10, 17135, 658,
+                    19, 457, 7932, 13, 184, 19, 3154, 17135, 6468, 19, 1404, 12269,
+                    19, 4229, 5356, 16264, 46, 19, 17, 20545, 10395, 9, 9, 9, 11,
+                    28, 6421, 9531, 20729, 17, 10, 353, 17022, 11, 21, 6421, 9531,
+                    16949, 17, 10, 11509, 753, 11, 33, 95, 2421, 7385, 956, 14431,
+                    2626, 25, 842, 7385, 4836, 21, 1429, 2272, 9855, 3120, 161,
+                    24738, 19, 13203, 658, 218, 787, 21, 430, 18482, 847, 2637, 9,
+                    4, 3
+                ],
+                [
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 322, 22178, 27,
+                    1064, 22, 956, 13, 11101, 1429, 5854, 24313,
+                    18953, 40, 422, 24366, 68, 1758, 37, 10483, 14257,
+                    31, 207, 263, 21, 203, 3773, 25, 71, 9735, 9, 4, 3
+                ],
+                [
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
+                    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 32, 2049,
+                    3442, 17, 13894, 3380, 23, 95, 18, 17634, 2288, 9,
+                    4, 3
+                ]],
+            'token_type_ids': [
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 2
+                ],
+                [
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 2
+                ],
+                [
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+                    3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 2
+                ]],
+            'attention_mask': [
+                [
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1
+                ],
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1
+                ],
+                [
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                    0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
+                    1, 1, 1, 1, 1, 1
+                ]]
+        }
+        # fmt: on
+
+        self.tokenizer_integration_test_util(
+            expected_encoding=expected_encoding,
+            model_name="xlnet-base-cased",
+        )
diff --git a/tests/utils/__init__.py b/tests/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/utils/test_downloader.py b/tests/utils/test_downloader.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8177eb3eea850d22e0cf93d8cf4bd36ae250094
--- /dev/null
+++ b/tests/utils/test_downloader.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import hashlib
+import os
+import unittest
+from tempfile import TemporaryDirectory
+
+
+class LockFileTest(unittest.TestCase):
+    test_url = (
+        "https://bj.bcebos.com/paddlenlp/models/transformers/roformerv2/roformer_v2_chinese_char_small/vocab.txt"
+    )
+
+    def test_downloading_with_exist_file(self):
+
+        from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+
+        with TemporaryDirectory() as tempdir:
+            lock_file_name = hashlib.md5((self.test_url + tempdir).encode("utf-8")).hexdigest()
+            lock_file_path = os.path.join(tempdir, ".lock", lock_file_name)
+            os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
+
+            # create lock file
+            with open(lock_file_path, "w", encoding="utf-8") as f:
+                f.write("temp test")
+
+            # downloading with exist lock file
+            config_file = get_path_from_url_with_filelock(self.test_url, root_dir=tempdir)
+            self.assertIsNotNone(config_file)
+
+    def test_downloading_with_opened_exist_file(self):
+
+        from paddlenlp.utils.downloader import get_path_from_url_with_filelock
+
+        with TemporaryDirectory() as tempdir:
+            lock_file_name = hashlib.md5((self.test_url + tempdir).encode("utf-8")).hexdigest()
+            lock_file_path = os.path.join(tempdir, ".lock", lock_file_name)
+            os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
+
+            # create lock file
+            with open(lock_file_path, "w", encoding="utf-8") as f:
+                f.write("temp test")
+
+            # downloading with opened lock file
+            open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
+            _ = os.open(lock_file_path, open_mode)
+            config_file = get_path_from_url_with_filelock(self.test_url, root_dir=tempdir)
+            self.assertIsNotNone(config_file)
diff --git a/tests/utils/test_import_utils.py b/tests/utils/test_import_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e49bd8dd64c745f08cb04fc89cf7cfaa49df0d5
--- /dev/null
+++ b/tests/utils/test_import_utils.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from paddlenlp.utils import install_package, uninstall_package
+
+
+class ImportUntilsTest(unittest.TestCase):
+    def test_install_specific_package(self):
+        install_package("loguru", "0.6.0")
+        from loguru import __version__
+
+        assert __version__ == "0.6.0"
+
+        install_package("loguru", "0.5.3")
+        from loguru import __version__
+
+        assert __version__ == "0.5.3"
+
+    def test_uninstall_package(self):
+        uninstall_package("paddlenlp")
+        uninstall_package("empty-package")
diff --git a/tests/utils/test_serialization.py b/tests/utils/test_serialization.py
new file mode 100644
index 0000000000000000000000000000000000000000..efb69f6b3082b98bf0e7dc1c4a6ca2558f7d4dfa
--- /dev/null
+++ b/tests/utils/test_serialization.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import tempfile
+from unittest import TestCase
+
+import numpy as np
+import paddle
+from huggingface_hub import hf_hub_download
+from parameterized import parameterized
+
+from paddlenlp.utils import load_torch
+from tests.testing_utils import require_package
+
+
+class SerializationTest(TestCase):
+    @parameterized.expand(
+        [
+            "float32",
+            "float16",
+            "bfloat16",
+        ]
+    )
+    @require_package("torch")
+    def test_simple_load(self, dtype: str):
+        import torch
+
+        # torch "normal_kernel_cpu" not implemented for 'Char', 'Int', 'Long', so only support float
+        dtype_mapping = {
+            "float32": torch.float32,
+            "float16": torch.float16,
+            "bfloat16": torch.bfloat16,  # test bfloat16
+        }
+        dtype = dtype_mapping[dtype]
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            weight_file_path = os.path.join(tempdir, "pytorch_model.bin")
+            torch.save(
+                {
+                    "a": torch.randn(2, 3, dtype=dtype),
+                    "b": torch.randn(3, 4, dtype=dtype),
+                    "a_parameter": torch.nn.Parameter(torch.randn(2, 3, dtype=dtype)),  # test torch.nn.Parameter
+                    "b_parameter": torch.nn.Parameter(torch.randn(3, 4, dtype=dtype)),
+                },
+                weight_file_path,
+            )
+            numpy_data = load_torch(weight_file_path)
+            torch_data = torch.load(weight_file_path)
+
+            for key, arr in numpy_data.items():
+                assert np.allclose(
+                    paddle.to_tensor(arr).cast("float32").cpu().numpy(),
+                    torch_data[key].detach().cpu().to(torch.float32).numpy(),
+                )
+
+    @parameterized.expand(
+        [
+            "hf-internal-testing/tiny-random-codegen",
+            "hf-internal-testing/tiny-random-Data2VecTextModel",
+            "hf-internal-testing/tiny-random-SwinModel",
+        ]
+    )
+    @require_package("torch")
+    def test_load_bert_model(self, repo_id):
+        import torch
+
+        with tempfile.TemporaryDirectory() as tempdir:
+            weight_file = hf_hub_download(
+                repo_id=repo_id,
+                filename="pytorch_model.bin",
+                cache_dir=tempdir,
+                library_name="PaddleNLP",
+            )
+            torch_weight = torch.load(weight_file)
+            torch_weight = {key: value for key, value in torch_weight.items()}
+            paddle_weight = load_torch(weight_file)
+
+            for key, arr in paddle_weight.items():
+                assert np.allclose(
+                    arr,
+                    torch_weight[key].numpy(),
+                )